Looks like my evenings this week (after today) will be open. I was thinking about coding up a potentially major overhaul of the axes.Axes.boxplot. Here's a rough outline of what I was thinking:
1) Improve the bootstrapping of the confidence intervals around the median
2) Add support for masked arrays (i.e., let user specify if masked values should be considered or not -- currently they are always considered, IIRC)
3) Improve the calculation of the percentiles to be consistent with SciPy and R.
#1 seems like something that'll be nice. #2 seems pretty essential to me. The third improvement is something for which I would want y'all's blessing before moving ahead. However, I think it's pretty critical. See (25th and 75th percentiles) below:
import numpy as np
import matplotlib.mlab as mlab
import scipy.stats as stats
def comparePercentiles(x):
mlp = mlab.prctile(x)
stp = np.array()
for p in (0.0, 25.0, 50.0, 75.0, 100.0):
stp = np.hstack([stp, stats.scoreatpercentile(x,p)])
outstring = """
mlab \t scipy
···
-------------
%0.3f \t %0.3f (0th)
%0.3f \t %0.3f (25th)
%0.3f \t %0.3f (50th)
%0.3f \t %0.3f (75th)
%0.3f \t %0.3f (100th)
""" % (mlp[0], stp[0], mlp[1], stp[1], mlp[2], stp[2], mlp[3], stp[3], mlp[4], stp[4])
print(outstring)
comparePercentiles(x)
mlab scipy
----------------------
-1.245 -1.245 (0th)
-0.950 -0.802 (25th)
-0.162 -0.162 (50th)
0.571 0.266 (75th)
1.067 1.067 (100th)
Copying and pasting the exact same data into R I get:
quantile(x, probs=c(0.0, 0.25, 0.50, 0.75, 1.0))
0% 25% 50% 75% 100%
-1.2448508 -0.8022337 -0.1617812 0.2661112 1.0666244
Seems like it's clear that something needs to be done. AFAICT, scipy is not listed as a dependency of matplotlib, so it'll probably just be easier to retool mlab.prctile to return values that agree with scipy and R. What do you think? Would this be a welcome contribution?
Thanks,
-Paul Hobson