boxplot notch

Andrew_Straw5 · December 19, 2009, 5:25am

Pierre GM wrote:

···

On Dec 18, 2009, at 10:34 PM, Andrew Straw wrote:


Fernando Perez wrote:


On Fri, Dec 18, 2009 at 2:28 PM, Andrew Straw <strawman@...36...> wrote:

(This still leaves open the question of what the notches actually _are_...)

No idea. I'd still leave the code instead written as

notch_max = med + (iq/2) * (pi/np.sqrt(row))

Further searching turned this up: http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_01.pdf

It says that

median +/- 1.57 * (iq / sqrt(n)) is the median, plus or minus its standard error.

I can't find any further support for this notion, though.

Looks like the std error of the median is (1.253*std error of the mean=1.253*std dev/sqrt(nb of obs)).
The 1.57 looks like it's 1.253^2, but I wouldn't bet anything on it...

Also, I think that formula is only for normally distributed data. Which, especially if you're using boxplots, medians, and quartiles, may not be a valid assumption.

Maybe we should at least raise a warning when someone uses notch=1. The current implementation seems dubious, at best, IMO.

-Andrew

Andrew_Straw5 · December 21, 2009, 5:20pm

Andrew Straw wrote:

Also, I think that formula is only for normally distributed data. Which,
especially if you're using boxplots, medians, and quartiles, may not be
a valid assumption.

Maybe we should at least raise a warning when someone uses notch=1. The
current implementation seems dubious, at best, IMO.

I read the following reference:

McGill, R., Tukey, J.W., and Larsen, W.A. (1978) "Variations of Boxplots", The American Statistician, 32:12-16.

McGill et al. have an entire section devoted to "Choice of Notch Size",
starting with:

"In notched box plots, one is, of course, faced with the question of how
best to determine the widths of the notches. Many methods, both
classical and non-parametric, might be considered. None will likely be
best in all cases."

They then describe a suggestion based on the Gaussian-based asymptotic
appoximation (Kendall and Stuart, 1967) given by

s = 1.25*R / (1.35 * sqrt(N))

And the notch around each median should be M +/- Cs where C is a
constant and R is the interquartile range. It seems any value between
1.386 and 1.96 could be justified depending on the standard deviations,
and they choose C=1.7 empirically as preferable and ultimately give the
full equation for notches to be

M +/- 1.7* (1.25*R / (1.35 * sqrt(N)))

But they end the section with:

"Clearly, a variety of other choices, such as a single less conservative
value (<1.7) or one dependent upon the data (chosen to compromise over
the range of the ratios of the spreads involved), are possible and may
be preferable in certain cases."

The thing not done in this article is to display outliers -- they refer
the reader to "schematic plots" in Tukey's 1977 book titled Exploratory
Data Analysis (Addison-Wesley). In the version of boxplots described in
this paper, the whiskers go to the data extremes.

Andrew_Straw5 · December 21, 2009, 5:29pm

Andrew Straw wrote:

Also, I think that formula is only for normally distributed data. Which,
especially if you're using boxplots, medians, and quartiles, may not be
a valid assumption.

Maybe we should at least raise a warning when someone uses notch=1. The
current implementation seems dubious, at best, IMO.

(I sent the previous version of this email a bit too early -- this is
slightly edited for clarity.)

I read the following reference:

McGill, R., Tukey, J.W., and Larsen, W.A. (1978) "Variations of
Boxplots", The American Statistician, 32:12-16.

McGill et al. have an entire section devoted to "Choice of Notch Size",
starting with:

"In notched box plots, one is, of course, faced with the question of how
best to determine the widths of the notches. Many methods, both
classical and non-parametric, might be considered. None will likely be
best in all cases."

They then describe a suggestion based on the Gaussian-based asymptotic
approximation (Kendall and Stuart, 1967). Here the standard deviation of
the median is given by

s = 1.25*R / (1.35 * sqrt(N))

where R is the interquartile range and N is the number of observations.

Using this value for s, the notch around each median should be M +/- Cs
where C is a constant. To summarize this section of their paper, values
of C between 1.386 and 1.96 could be justified depending on the
standard deviations, and they choose C=1.7 empirically as preferable and
ultimately give the full equation for notches to be

M +/- 1.7* (1.25*R / (1.35 * sqrt(N)))

But they end the section with:

"Clearly, a variety of other choices, such as a single less conservative
value (<1.7) or one dependent upon the data (chosen to compromise over
the range of the ratios of the spreads involved), are possible and may
be preferable in certain cases."

The thing not done in this article is to display outliers -- they refer
the reader to "schematic plots" in Tukey's 1977 book titled Exploratory
Data Analysis (Addison-Wesley). In the version of boxplots described in
this paper, the whiskers go to the data extremes.

_Steve.M · January 20, 2010, 10:27pm

I also ran into this problem recently and was disappointed to find that the
notch was based on a normal approximation.
While there are a number of ways to calculate the notch size, it would be
useful to allow the user to supply (either as an optional keyword, or as a
vector input for the notch keyword) their own notch locations.

For example, I have some code that calculates bootstrapped confidence
intervals - in the case of a significantly non-normal distribution this
would be a better way to find the notch boundaries (which will likely not
even be symmetric). While I'm not advocating building other calculations in,
having the option to supply my own notch locations would be immensely
useful. The default should probably remain as is (IMO) but should also be
mentioned in the documentation as being based on that assumption.
I'm happy to submit an update to do just that if it's seen as a good idea.

Steve.

Andrew Straw wrote:

···

Andrew Straw wrote:

Also, I think that formula is only for normally distributed data. Which,
especially if you're using boxplots, medians, and quartiles, may not be
a valid assumption.

Maybe we should at least raise a warning when someone uses notch=1. The
current implementation seems dubious, at best, IMO.

(I sent the previous version of this email a bit too early -- this is
slightly edited for clarity.)

I read the following reference:

McGill, R., Tukey, J.W., and Larsen, W.A. (1978) "Variations of
Boxplots", The American Statistician, 32:12-16.

McGill et al. have an entire section devoted to "Choice of Notch Size",
starting with:

"In notched box plots, one is, of course, faced with the question of how
best to determine the widths of the notches. Many methods, both
classical and non-parametric, might be considered. None will likely be
best in all cases."

...

--
View this message in context: http://old.nabble.com/boxplot-notch-tp26798967p27249739.html
Sent from the matplotlib - devel mailing list archive at Nabble.com.