Improvements to boxplots

Looks like my evenings this week (after today) will be open. I was thinking about coding up a potentially major overhaul of the axes.Axes.boxplot. Here's a rough outline of what I was thinking:

1) Improve the bootstrapping of the confidence intervals around the median
2) Add support for masked arrays (i.e., let user specify if masked values should be considered or not -- currently they are always considered, IIRC)
3) Improve the calculation of the percentiles to be consistent with SciPy and R.

#1 seems like something that'll be nice. #2 seems pretty essential to me. The third improvement is something for which I would want y'all's blessing before moving ahead. However, I think it's pretty critical. See (25th and 75th percentiles) below:

import numpy as np
import matplotlib.mlab as mlab
import scipy.stats as stats

def comparePercentiles(x):
     mlp = mlab.prctile(x)
     stp = np.array([])
     for p in (0.0, 25.0, 50.0, 75.0, 100.0):
         stp = np.hstack([stp, stats.scoreatpercentile(x,p)])
     outstring = """
     mlab \t scipy

···

-------------
     %0.3f \t %0.3f (0th)
     %0.3f \t %0.3f (25th)
     %0.3f \t %0.3f (50th)
     %0.3f \t %0.3f (75th)
     %0.3f \t %0.3f (100th)
     """ % (mlp[0], stp[0], mlp[1], stp[1], mlp[2], stp[2], mlp[3], stp[3], mlp[4], stp[4])
     print(outstring)

comparePercentiles(x)

    mlab scipy
    ----------------------
    -1.245 -1.245 (0th)
    -0.950 -0.802 (25th)
    -0.162 -0.162 (50th)
    0.571 0.266 (75th)
    1.067 1.067 (100th)

Copying and pasting the exact same data into R I get:

quantile(x, probs=c(0.0, 0.25, 0.50, 0.75, 1.0))

        0% 25% 50% 75% 100%
-1.2448508 -0.8022337 -0.1617812 0.2661112 1.0666244

Seems like it's clear that something needs to be done. AFAICT, scipy is not listed as a dependency of matplotlib, so it'll probably just be easier to retool mlab.prctile to return values that agree with scipy and R. What do you think? Would this be a welcome contribution?

Thanks,
-Paul Hobson

Support for masked arrays should always be implemented such that masked values are skipped--don't make that optional, please.

Eric

···

On 07/06/2010 10:55 AM, PHobson@...814... wrote:

Looks like my evenings this week (after today) will be open. I was thinking about coding up a potentially major overhaul of the axes.Axes.boxplot. Here's a rough outline of what I was thinking:

1) Improve the bootstrapping of the confidence intervals around the median
2) Add support for masked arrays (i.e., let user specify if masked values should be considered or not -- currently they are always considered, IIRC)