is R wrong? (boxplot)

Dear Matplotlib gurus,

Following the code to demonstrate recent(ish) fix for whiskers in boxplots:
https://github.com/matplotlib/matplotlib/pull/1855 I have compared it against
R's boxplot. Description seems to correspond, and all the percentiles are the
same in numpy and R (3.0.1) but R's boxplot seems to have extended IQR box and
still have an upper whisker (corresponds to 9000, which is not within
75%+1.5*IQR), when it shouldn't:
http://nbviewer.ipython.org/url/www.onerussian.com/tmp/boxplot-Python-vs-R.ipynb

is R's plot incorrect or am I missing something (e.g. documented feature
in R's boxplot) warranting such a difference?

Thanks in advance

···

--
Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Senior Research Associate, Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
WWW: http://www.linkedin.com/in/yarik

Hey Yaroslav,

As the author of the fix and the recent overhaul to boxplots, I can say with certainty that R is wrong! :wink:

More seriously, the main thing that I take away from Tukey’s paper about boxplots, is that there are many valid ways to draw them. I personally set up the new boxplot functionality to take the most basic boxplot definition very literally. My guess is that R is fudging those rules a bit for the purpose of completeness, or aesthetics, or …(?)

Perhaps one can look at the purpose of boxplots in two different fashions:

  1. Matplotlib: show some of the data and some basic stats

  2. R (I’m guession): show how the data are /probably/ distributed.

Obviously, I prefer #1. But I’m not going to say that #2 is wrong just yet.

···

On Sat, Feb 15, 2014 at 5:00 AM, Yaroslav Halchenko <sf@…825…> wrote:

Dear Matplotlib gurus,

Following the code to demonstrate recent(ish) fix for whiskers in boxplots:

https://github.com/matplotlib/matplotlib/pull/1855 I have compared it against

R’s boxplot. Description seems to correspond, and all the percentiles are the

same in numpy and R (3.0.1) but R’s boxplot seems to have extended IQR box and

still have an upper whisker (corresponds to 9000, which is not within

75%+1.5*IQR), when it shouldn’t:

http://nbviewer.ipython.org/url/www.onerussian.com/tmp/boxplot-Python-vs-R.ipynb

is R’s plot incorrect or am I missing something (e.g. documented feature

in R’s boxplot) warranting such a difference?

Thanks in advance

Yaroslav O. Halchenko, Ph.D.

http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org

Senior Research Associate, Psychological and Brain Sciences Dept.

Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755

Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419

WWW: http://www.linkedin.com/in/yarik


Android apps run on BlackBerry 10

Introducing the new BlackBerry 10.2.1 Runtime for Android apps.

Now with support for Jelly Bean, Bluetooth, Mapview and more.

Get your Android app in front of a whole new audience. Start now.

http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk


Matplotlib-devel mailing list

Matplotlib-devel@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/matplotlib-devel

Hi Paul,

   As the author of the fix and the recent overhaul to boxplots

Thanks for that!

I can say with certainty that R is wrong! :wink:

phew -- thanks :wink:

   More seriously, the main thing that I take away from Tukey's paper about
   boxplots, is that there are many valid ways to draw them. I personally set
   up the new boxplot functionality to take the most basic boxplot definition
   very literally. My guess is that R is fudging those rules a bit for the
   purpose of completeness, or aesthetics, or ...(?)

well -- I was trying to figure out why the divergence from R's boxplot
help, but so far it seemed to match description/definition for boxplot
as in matplotlib. I guess the next step would be to look "inside"
(running apt-get source r-base now :wink: )

   Perhaps one can look at the purpose of boxplots in two different fashions:
   1) Matplotlib: show some of the data and some basic stats
   2) R (I'm guession): show how the data are /probably/ distributed.�
   Obviously, I prefer #1. But I'm not going to say that #2 is wrong just
   yet.

would you may be interested to adopt (or just do independently) an
option to e.g. plot the data point? once I shared this one
http://nbviewer.ipython.org/url/www.onerussian.com/tmp/run_plots.ipynb
and the actual code https://gist.github.com/yarikoptic/9023331

I just never got to formalize it into mpl pull request :-/

···

On Sat, 15 Feb 2014, Paul Hobson wrote:
--
Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Senior Research Associate, Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
WWW: http://www.linkedin.com/in/yarik

Yaroslav,

Those figures look great. Seaborn has some similar functionality (scroll down a bit):

http://nbviewer.ipython.org/github/mwaskom/seaborn/blob/master/examples/plotting_distributions.ipynb#Comparing-distributions:-boxplot-and-violinplot

The main point of the most recent overhaul of boxplots was to allow users to just what you describe. The methods plt.boxplot and ax.boxplot now do very little on their own. Input data are passed to matplotlib.cbook.boxplot_stats, that function returns a list of dictionaries of statistics, and then ax.bxp actually does the drawing. All of this is to say that you can write your own function to modify boxplot_stats’ output or generate independently the list of dictionaries expected by ax.bxp.

The keys of those dictionaries can include:

  • label -> tick label for the boxplot

  • mean -> mean value (can plot as a line or point)

  • median -> 50th percentile

  • q1 -> first quartile (25th pctl)

  • q3 -> third quartile (75 (pctl)

  • cilo -> lower notch around the median

  • ciho -> upper notch around the median

  • whislo -> end of the lower whisker

  • whishi -> end of the upper whisker

  • fliers -> outliers

Basically, you can set the appropriate values to whatever you want to draw boxplots however you wish (like open/close diagrams for pandas).

Also, the whis kwarg accepted by boxplot and cbook.boxplot_stats can either be a float (1.5 by default), a list of integer percentiles (like 5, 95), or the strings ‘range’, ‘limits’, or ‘min/max’, all of which will extend the whiskers to over all of the data.

Since you’re running off of master, you should access to this new functionality.

Here’s a link to the PR that overhauled ax.boxplot and created ax.bxp:

https://github.com/matplotlib/matplotlib/pull/2643

Looking at it now – it looks like cbook.boxplot_stats’ docstring got cutoff. I’ll pull together a PR to fix that soon.

Feel free to hit me up with any other questions!

-paul

···

On Sat, Feb 15, 2014 at 2:20 PM, Yaroslav Halchenko <sf@…825…> wrote:

Hi Paul,

On Sat, 15 Feb 2014, Paul Hobson wrote:

As the author of the fix and the recent overhaul to boxplots

Thanks for that!

I can say with certainty that R is wrong! :wink:

phew – thanks :wink:

More seriously, the main thing that I take away from Tukey’s paper about

boxplots, is that there are many valid ways to draw them. I personally set

up the new boxplot functionality to take the most basic boxplot definition

very literally. My guess is that R is fudging those rules a bit for the

purpose of completeness, or aesthetics, or …(?)

well – I was trying to figure out why the divergence from R’s boxplot

help, but so far it seemed to match description/definition for boxplot

as in matplotlib. I guess the next step would be to look “inside”

(running apt-get source r-base now :wink: )

Perhaps one can look at the purpose of boxplots in two different fashions:

  1. Matplotlib: show some of the data and some basic stats
  1. R (I’m guession): show how the data are /probably/ distributed.�

Obviously, I prefer #1. But I’m not going to say that #2 is wrong just

yet.

would you may be interested to adopt (or just do independently) an

option to e.g. plot the data point? once I shared this one

http://nbviewer.ipython.org/url/www.onerussian.com/tmp/run_plots.ipynb

and the actual code https://gist.github.com/yarikoptic/9023331

I just never got to formalize it into mpl pull request :-/

Yaroslav O. Halchenko, Ph.D.

http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org

Senior Research Associate, Psychological and Brain Sciences Dept.

Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755

Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419

WWW: http://www.linkedin.com/in/yarik


Android apps run on BlackBerry 10

Introducing the new BlackBerry 10.2.1 Runtime for Android apps.

Now with support for Jelly Bean, Bluetooth, Mapview and more.

Get your Android app in front of a whole new audience. Start now.

http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk


Matplotlib-devel mailing list

Matplotlib-devel@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/matplotlib-devel

   Those figures look great. Seaborn has some similar functionality (scroll
   down a bit):
   [1]http://nbviewer.ipython.org/github/mwaskom/seaborn/blob/master/examples/plotting_distributions.ipynb#Comparing-distributions:-boxplot-and-violinplot

right -- seaborn looks really nice and I am yet to take advantage of it.

BUT that is why we are talking here, at matplotlib list: seaborn (and
few others) while aiming to provide high level convenience, specific to
e.g. using pandas as the core datastructures, add improvements which
could easily go into stock matplotlib and thus benefit all of the users.
That is why I thought that improving boxplot itself could be of
more generic benefit, while allowing all the dependent projects take
advantage of it without requiring unnecessary fragmentation (e.g. "use
seaborn for paired plots", which could easily go straight into stock
boxplot operating on arrays).

Even violin plots could probably could be done in matplotlib with
some basic density estimator (with parameter for a custom one) as an
option within boxplot function itself.

   The main point of the most recent overhaul of boxplots was to allow users
   to just what you describe. The methods plt.boxplot and ax.boxplot now do
   very little on their own. Input data are passed to
   matplotlib.cbook.boxplot_stats, that function returns a list of
   dictionaries of statistics, and then ax.bxp actually does the drawing. All
   of this is to say that you can write your own function to modify
   boxplot_stats' output or generate independently the list of dictionaries
   expected by ax.bxp.
   The keys of those dictionaries can include:
    - label -> tick label for the boxplot
    - mean -> mean value (can plot as a line or point)
    - median -> 50th percentile
    - q1 -> first quartile (25th pctl)
    - q3 -> third quartile (75 (pctl)
    - cilo -> lower notch around the median
    - ciho -> upper notch around the median
    - whislo -> end of the lower whisker
    - whishi -> end of the upper whisker
    - fliers -> outliers
   Basically, you can set the appropriate values to whatever you want to draw
   boxplots however you wish (like open/close diagrams for pandas).
   Also, the `whis` kwarg accepted by boxplot and cbook.boxplot_stats can
   either be a float (1.5 by default), a list of integer percentiles (like 5,
   95), or the strings 'range', 'limits', or 'min/max', all of which will
   extend the whiskers to over all of the data.
   Since you're running off of master, you should access to this new
   functionality.

:wink: usually I run off the releases and even more often from releases in
Debian stable. But yes -- I have the master and this new functionality
looks neat -- thanks again. But those few enhancements, such as

- plot actual datapoints with the jitter
- plot pairing lines across boxplots

seems to be not there and I would consider them worthwhile enhancement

   Feel free to hit me up with any other questions!

sorry that I have hit with not really a question above :wink:

···

On Sat, 15 Feb 2014, Paul Hobson wrote:
--
Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Senior Research Associate, Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
WWW: http://www.linkedin.com/in/yarik

As a side note, adding jitter has been discussed before
(https://github.com/matplotlib/matplotlib/issues/2750) in a slightly
different context and the consensus was to _not_ add it to mpl (as it
is a non-deterministic data transformation).

Tom

···

On Sat, Feb 15, 2014 at 10:45 PM, Yaroslav Halchenko <sf@...825...> wrote:

On Sat, 15 Feb 2014, Paul Hobson wrote:

   Those figures look great. Seaborn has some similar functionality (scroll
   down a bit):
   [1]http://nbviewer.ipython.org/github/mwaskom/seaborn/blob/master/examples/plotting_distributions.ipynb#Comparing-distributions:-boxplot-and-violinplot

right -- seaborn looks really nice and I am yet to take advantage of it.

BUT that is why we are talking here, at matplotlib list: seaborn (and
few others) while aiming to provide high level convenience, specific to
e.g. using pandas as the core datastructures, add improvements which
could easily go into stock matplotlib and thus benefit all of the users.
That is why I thought that improving boxplot itself could be of
more generic benefit, while allowing all the dependent projects take
advantage of it without requiring unnecessary fragmentation (e.g. "use
seaborn for paired plots", which could easily go straight into stock
boxplot operating on arrays).

Even violin plots could probably could be done in matplotlib with
some basic density estimator (with parameter for a custom one) as an
option within boxplot function itself.

   The main point of the most recent overhaul of boxplots was to allow users
   to just what you describe. The methods plt.boxplot and ax.boxplot now do
   very little on their own. Input data are passed to
   matplotlib.cbook.boxplot_stats, that function returns a list of
   dictionaries of statistics, and then ax.bxp actually does the drawing. All
   of this is to say that you can write your own function to modify
   boxplot_stats' output or generate independently the list of dictionaries
   expected by ax.bxp.
   The keys of those dictionaries can include:
    - label -> tick label for the boxplot
    - mean -> mean value (can plot as a line or point)
    - median -> 50th percentile
    - q1 -> first quartile (25th pctl)
    - q3 -> third quartile (75 (pctl)
    - cilo -> lower notch around the median
    - ciho -> upper notch around the median
    - whislo -> end of the lower whisker
    - whishi -> end of the upper whisker
    - fliers -> outliers
   Basically, you can set the appropriate values to whatever you want to draw
   boxplots however you wish (like open/close diagrams for pandas).
   Also, the `whis` kwarg accepted by boxplot and cbook.boxplot_stats can
   either be a float (1.5 by default), a list of integer percentiles (like 5,
   95), or the strings 'range', 'limits', or 'min/max', all of which will
   extend the whiskers to over all of the data.
   Since you're running off of master, you should access to this new
   functionality.

:wink: usually I run off the releases and even more often from releases in
Debian stable. But yes -- I have the master and this new functionality
looks neat -- thanks again. But those few enhancements, such as

- plot actual datapoints with the jitter
- plot pairing lines across boxplots

seems to be not there and I would consider them worthwhile enhancement

   Feel free to hit me up with any other questions!

sorry that I have hit with not really a question above :wink:
--
Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Senior Research Associate, Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
WWW: http://www.linkedin.com/in/yarik

------------------------------------------------------------------------------
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience. Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
_______________________________________________
Matplotlib-devel mailing list
Matplotlib-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/matplotlib-devel

--
Thomas A Caswell
PhD Candidate University of Chicago
Nagel and Gardel labs
tcaswell@...1038...
jfi.uchicago.edu/~tcaswell
o: 773.702.7204

interesting discussion -- thanks for pointing it out Tom

well -- for scatter plot it does make sense to demand jittering
"outside". For boxplot -- nope. x-axis (in standard vertical
boxplots) doesn't represent informative dimension anyways, besides
"groupping" and jitter imho would be only for visualization purpose.
Also any non-deterministic jitter could be made deterministic and
reproducible by seeding. Since, once again, here randomization would be
added only for visualization purpose, it could e.g. always be produced
by the rng state seeded with 0 :wink:

···

On Sat, 15 Feb 2014, Thomas A Caswell wrote:

As a side note, adding jitter has been discussed before
(https://github.com/matplotlib/matplotlib/issues/2750) in a slightly
different context and the consensus was to _not_ add it to mpl (as it
is a non-deterministic data transformation).

--
Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Senior Research Associate, Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
WWW: http://www.linkedin.com/in/yarik

Hey all,

I thought I’d throw out that a tool I’m working on, Plotly, also does box plots with the option to show jittered points. Instead of passing in stats you pass in an array of values.

Here is a notebook with the box plots with jitter: nbviewer.ipython.org/gist/fperez/8930306.

You can also view the mean of the array (the dashed line), +/- 1.5 standard deviations around the median, and the outliers of the set (the hollow points): https://plot.ly/~ChrisPP/49.

More generally, we’re hoping to soon let folks convert matplotlib scripts into a Plotly graph (GitHub Issue). We’d love your advice and thoughts.

Thanks a bunch,

M

···

On Sun, Feb 16, 2014 at 9:39 PM, Yaroslav Halchenko <sf@…825…> wrote:

On Sat, 15 Feb 2014, Thomas A Caswell wrote:

As a side note, adding jitter has been discussed before

(https://github.com/matplotlib/matplotlib/issues/2750) in a slightly

different context and the consensus was to not add it to mpl (as it

is a non-deterministic data transformation).

interesting discussion – thanks for pointing it out Tom

well – for scatter plot it does make sense to demand jittering

“outside”. For boxplot – nope. x-axis (in standard vertical

boxplots) doesn’t represent informative dimension anyways, besides

“groupping” and jitter imho would be only for visualization purpose.

Also any non-deterministic jitter could be made deterministic and

reproducible by seeding. Since, once again, here randomization would be

added only for visualization purpose, it could e.g. always be produced

by the rng state seeded with 0 :wink:

Yaroslav O. Halchenko, Ph.D.

http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org

Senior Research Associate, Psychological and Brain Sciences Dept.

Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755

Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419

WWW: http://www.linkedin.com/in/yarik


Android apps run on BlackBerry 10

Introducing the new BlackBerry 10.2.1 Runtime for Android apps.

Now with support for Jelly Bean, Bluetooth, Mapview and more.

Get your Android app in front of a whole new audience. Start now.

http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk


Matplotlib-devel mailing list

Matplotlib-devel@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/matplotlib-devel