boxplot notch

Hi,

I've been reading about box plots and examining the source code for boxplot() lately. While there doesn't seem to be a convention about what the notch specifies, I can't find any justification (or text describing) what exactly the MPL notch is. The source code is:

   # get median and quartiles
   q1, med, q3 = mlab.prctile(d,[25,50,75])
   iq = q3 - q1

   notch_max = med + 1.57*iq/np.sqrt(row)
   notch_min = med - 1.57*iq/np.sqrt(row)

Is this code actually calculating a meaningful value? If so, what?

The original commit was r1098, which doesn't offer a useful comment either (only "aaplied several sf patches" ... looking through the SF bug tracker, I couldn't find anything relevant from before the commit date of 2005-03-28).

of salt... I'd write that code as

notch_max = med + (iq/2) * (pi/np.sqrt(row))

and it makes more sense. The notch limits are an estimate of the
interval of the median, which is (one-half, for each up/down) the
q3-q1 range times a normalization factor which is pi/sqrt(n), where
n==row=len(d). The 1/sqrt(n) makes some sense, as it's the usual
statistical error normalization factor. The multiplication by pi, I'm
not so sure, and I can't find that exact formula in any quick stats
reference, but I'm sure someone who actually knows stats can point out
where it comes from.

Note that the code below does:

                if notch_max > q3:
                    notch_max = q3
                if notch_min < q1:
                    notch_min = q1

though matlab explicitly states in:

http://www.mathworks.com/access/helpdesk/help/toolbox/stats/boxplot.html

that

"""
Interval endpoints are the extremes of the notches or the centers of
the triangular markers. When the sample size is small, notches may
extend beyond the end of the box.
"""

So it seems to me that the more principled thing to do would be to
leave those notch markers outside the box if they land there, because
that's a warning of the robustness of the estimation. Clipping them to
q1/q3 is effectively hiding a problem...

cheers,

f

···

On Tue, Dec 15, 2009 at 9:57 AM, Andrew Straw <strawman@...36...> wrote:

notch_max = med + 1.57*iq/np.sqrt(row)
notch_min = med - 1.57*iq/np.sqrt(row)

Is this code actually calculating a meaningful value? If so, what?

From the statistics ignoramus in the room, so take this with a grain

Fernando Perez wrote:

Note that the code below does:

                if notch_max > q3:
                    notch_max = q3
                if notch_min < q1:
                    notch_min = q1

though matlab explicitly states in:

http://www.mathworks.com/access/helpdesk/help/toolbox/stats/boxplot.html

that

"""
Interval endpoints are the extremes of the notches or the centers of
the triangular markers. When the sample size is small, notches may
extend beyond the end of the box.
"""

So it seems to me that the more principled thing to do would be to
leave those notch markers outside the box if they land there, because
that's a warning of the robustness of the estimation. Clipping them to
q1/q3 is effectively hiding a problem...

I agree. I disabled the boxplot notch shortening in r8040.

(This still leaves open the question of what the notches actually _are_...)

-Andrew

No idea. I'd still leave the code instead written as

notch_max = med + (iq/2) * (pi/np.sqrt(row))

as that's what it appears to be doing (unless 1.57 is *not* pi/2 in
this case). That might help someone spot what formula the factors
come from. Even better would be to get from the original author an
explanation of where those factors come from :slight_smile: I did look around
some more, couldn't find an answer...

Cheers,

f

···

On Fri, Dec 18, 2009 at 2:28 PM, Andrew Straw <strawman@...36...> wrote:

(This still leaves open the question of what the notches actually _are_...)

Fernando Perez wrote:

  

(This still leaves open the question of what the notches actually _are_...)
    
No idea. I'd still leave the code instead written as

notch_max = med + (iq/2) * (pi/np.sqrt(row))
  

Further searching turned this up: http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_01.pdf

It says that

median +/- 1.57 * (iq / sqrt(n)) is the median, plus or minus its standard error.

I can't find any further support for this notion, though.

I've decided not to use notches on my own plots, so I'm leaving this issue for now...

-Andrew

···

On Fri, Dec 18, 2009 at 2:28 PM, Andrew Straw <strawman@...36...> wrote: