boxplot -- how (more)

Virgil_Stokes · August 21, 2012, 2:58pm

In reference to my previous email.

How can I find the outliers (samples points beyond the whiskers) in the data used for the boxplot?

Here is a code snippet that shows how it was used for the timings data (a list of 4 sublists (y1,y2,y3,y4), each containing 400,000 real data values),
   ...
   # Box Plots
   plt.subplot(2,1,2)
   timings = [y1,y2,y3,y4]
   pos = np.array(range(len(timings)))+1
   bp = plt.boxplot( timings, sym='k+', patch_artist=True,
                    positions=pos, notch=1, bootstrap=5000 )

   plt.xlabel('Algorithm')
   plt.ylabel('Exection time (sec)')
   plt.ylim(0.9*ymin,1.1*ymax)

   plt.setp(bp['whiskers'], color='k', linestyle='-' )
   plt.setp(bp['fliers'], markersize=3.0)
   plt.title('Box plots (%4d trials)' %(n))
   plt.show()
   ...

Again my questions:
1) How to get the value of the median?
2) How to find the outliers (outside the whiskers)?
3) How to find the width of the notch?

Jeff_Blackburne1 · August 21, 2012, 3:52pm

In reference to my previous email.

How can I find the outliers (samples points beyond the whiskers) in the data
used for the boxplot?

Here is a code snippet that shows how it was used for the timings data (a list
of 4 sublists (y1,y2,y3,y4), each containing 400,000 real data values),
   ...
   # Box Plots
   plt.subplot(2,1,2)
   timings = [y1,y2,y3,y4]
   pos = np.array(range(len(timings)))+1
   bp = plt.boxplot( timings, sym='k+', patch_artist=True,
                    positions=pos, notch=1, bootstrap=5000 )

   plt.xlabel('Algorithm')
   plt.ylabel('Exection time (sec)')
   plt.ylim(0.9*ymin,1.1*ymax)

   plt.setp(bp['whiskers'], color='k', linestyle='-' )
   plt.setp(bp['fliers'], markersize=3.0)
   plt.title('Box plots (%4d trials)' %(n))
   plt.show()
   ...

Again my questions:
1) How to get the value of the median?

This is easily calculated from your data. Numpy will even do it for you: np.median(timings)

2) How to find the outliers (outside the whiskers)?

From the boxplot documentation: the whiskers extend to the most extreme data point within distance X of the bottom or top of the box, where X is 1.5 times the extent of the box. Any points more extreme than that are the outliers. The box itself of course extends from the 25th percentile to the 75th percentile of your data. Again, you can easily calculate these values from your data.

3) How to find the width of the notch?

Again, from the docs: with bootstrap=5000, it calculates the width of the notch by bootstrap resampling your data (the timings array) 5000 times and finding the 95% confidence interval of the median, and uses that as the notch width. You can redo that yourself pretty easily. Here is some bootstrap code for you to adapt:
http://mail.scipy.org/pipermail/scipy-user/2009-July/021704.html

I encourage you to read the documentation! This page is very useful for reference:
http://matplotlib.sourceforge.net/api/pyplot_api.html

-Jeff

···

On Aug 21, 2012, at 10:58 AM, Virgil Stokes wrote:

_Paul_Hobson1 · August 21, 2012, 3:55pm

Ooops. Here's my reply -- this time to whole list
Virgil, the objects stuffed inside the `bp` dictionary should have
methods to retrieve their values. Let's see:

In [35]: x = np.random.lognormal(mean=1.25, sigma=1.35, size=(37,3))

In [36]: bp = plt.boxplot(x, bootstrap=5000, notch=True)

In [37]: # Question 1
    ...: print('medians')
    ...: for n, median in enumerate(bp['medians']):
    ...: print('%d: %f' % (n, median.get_ydata()[0]))
    ...:
medians
0: 6.339692
1: 3.449320
2: 4.503706

In [38]: # Question 2
    ...: print('fliers')
    ...: for n in range(0, len(bp['fliers']), 2):
    ...: print('%d: upper outliers = \t' % (n/2,))
    ...: print(bp['fliers'][n].get_ydata())
    ...: print('\n%d: lower outliers = \t' % (n/2,))
    ...: print(bp['fliers'][n+1].get_ydata())
    ...: print('\n')
    ...:

In [39]: # Question 3
    ...: print('Confidence Intervals')
    ...: for n, box in enumerate(bp['boxes']):
    ...: print('%d: lower CI: %f' % (n, box.get_ydata()[2]))
    ...: print('%d: upper CI: %f' % (n, box.get_ydata()[4]))
    ...:
Confidence Intervals
0: lower CI: 1.760701
0: upper CI: 10.102221
1: lower CI: 1.626386
1: upper CI: 5.601927
2: lower CI: 2.173173

Hope that helps,
-paul

···

On Tue, Aug 21, 2012 at 7:58 AM, Virgil Stokes <vs@...2650...> wrote:

In reference to my previous email.

How can I find the outliers (samples points beyond the whiskers) in the data
used for the boxplot?

Here is a code snippet that shows how it was used for the timings data (a list
of 4 sublists (y1,y2,y3,y4), each containing 400,000 real data values),
   ...
   ...
   ...
   # Box Plots
   plt.subplot(2,1,2)
   timings = [y1,y2,y3,y4]
   pos = np.array(range(len(timings)))+1
   bp = plt.boxplot( timings, sym='k+', patch_artist=True,
                    positions=pos, notch=1, bootstrap=5000 )

   plt.xlabel('Algorithm')
   plt.ylabel('Exection time (sec)')
   plt.ylim(0.9*ymin,1.1*ymax)

   plt.setp(bp['whiskers'], color='k', linestyle='-' )
   plt.setp(bp['fliers'], markersize=3.0)
   plt.title('Box plots (%4d trials)' %(n))
   plt.show()
   ...
   ...
   ...

Again my questions:
1) How to get the value of the median?
2) How to find the outliers (outside the whiskers)?
3) How to find the width of the notch?

Virgil_Stokes · August 22, 2012, 2:04pm

Yes Jeff,
These are very useful links; however, box plots have a parameter called the "adjacent value" (from the McGill reference),

"The plotted whisker extends to the adjacent value, which is the most extreme data value that is not an outlier."

It seems there should be one for the lower and one for the upper whisker --- how can one get these two values from boxplot?

Also, is there anyway to directly get the indices of the outliers?

···

On 21-Aug-2012 17:52, Jeffrey Blackburne wrote:

On Aug 21, 2012, at 10:58 AM, Virgil Stokes wrote:

In reference to my previous email.

How can I find the outliers (samples points beyond the whiskers) in the data
used for the boxplot?

Here is a code snippet that shows how it was used for the timings data (a list
of 4 sublists (y1,y2,y3,y4), each containing 400,000 real data values),
   ...
   # Box Plots
   plt.subplot(2,1,2)
   timings = [y1,y2,y3,y4]
   pos = np.array(range(len(timings)))+1
   bp = plt.boxplot( timings, sym='k+', patch_artist=True,
                    positions=pos, notch=1, bootstrap=5000 )

   plt.xlabel('Algorithm')
   plt.ylabel('Exection time (sec)')
   plt.ylim(0.9*ymin,1.1*ymax)

   plt.setp(bp['whiskers'], color='k', linestyle='-' )
   plt.setp(bp['fliers'], markersize=3.0)
   plt.title('Box plots (%4d trials)' %(n))
   plt.show()
   ...

Again my questions:
1) How to get the value of the median?

This is easily calculated from your data. Numpy will even do it for you: np.median(timings)

2) How to find the outliers (outside the whiskers)?

From the boxplot documentation: the whiskers extend to the most extreme data point within distance X of the bottom or top of the box, where X is 1.5 times the extent of the box. Any points more extreme than that are the outliers. The box itself of course extends from the 25th percentile to the 75th percentile of your data. Again, you can easily calculate these values from your data.

3) How to find the width of the notch?

Again, from the docs: with bootstrap=5000, it calculates the width of the notch by bootstrap resampling your data (the timings array) 5000 times and finding the 95% confidence interval of the median, and uses that as the notch width. You can redo that yourself pretty easily. Here is some bootstrap code for you to adapt:
http://mail.scipy.org/pipermail/scipy-user/2009-July/021704.html

I encourage you to read the documentation! This page is very useful for reference:
http://matplotlib.sourceforge.net/api/pyplot_api.html

-Jeff

Jeff_Blackburne1 · August 22, 2012, 3:29pm

In reference to my previous email.

How can I find the outliers (samples points beyond the whiskers) in the data
used for the boxplot?

Here is a code snippet that shows how it was used for the timings data (a list
of 4 sublists (y1,y2,y3,y4), each containing 400,000 real data values),
   ...
   # Box Plots
   plt.subplot(2,1,2)
   timings = [y1,y2,y3,y4]
   pos = np.array(range(len(timings)))+1
   bp = plt.boxplot( timings, sym='k+', patch_artist=True,
                    positions=pos, notch=1, bootstrap=5000 )

   plt.xlabel('Algorithm')
   plt.ylabel('Exection time (sec)')
   plt.ylim(0.9*ymin,1.1*ymax)

   plt.setp(bp['whiskers'], color='k', linestyle='-' )
   plt.setp(bp['fliers'], markersize=3.0)
   plt.title('Box plots (%4d trials)' %(n))
   plt.show()
   ...

Again my questions:
1) How to get the value of the median?

This is easily calculated from your data. Numpy will even do it for you: np.median(timings)

2) How to find the outliers (outside the whiskers)?

From the boxplot documentation: the whiskers extend to the most extreme data point within distance X of the bottom or top of the box, where X is 1.5 times the extent of the box. Any points more extreme than that are the outliers. The box itself of course extends from the 25th percentile to the 75th percentile of your data. Again, you can easily calculate these values from your data.

3) How to find the width of the notch?

Again, from the docs: with bootstrap=5000, it calculates the width of the notch by bootstrap resampling your data (the timings array) 5000 times and finding the 95% confidence interval of the median, and uses that as the notch width. You can redo that yourself pretty easily. Here is some bootstrap code for you to adapt:
http://mail.scipy.org/pipermail/scipy-user/2009-July/021704.html

I encourage you to read the documentation! This page is very useful for reference:
http://matplotlib.sourceforge.net/api/pyplot_api.html

-Jeff

Yes Jeff,
These are very useful links; however, box plots have a parameter called the "adjacent value" (from the McGill reference),

"The plotted whisker extends to the adjacent value, which is the most extreme data value that is not an outlier."

It seems there should be one for the lower and one for the upper whisker --- how can one get these two values from boxplot?

Look at bp['whiskers']

For those who got here by searching: bp is the object returned by plt.boxplot()

Also, is there anyway to directly get the indices of the outliers?

Look into np.where()

···

On Aug 22, 2012, at 10:04 AM, Virgil Stokes wrote:

On 21-Aug-2012 17:52, Jeffrey Blackburne wrote:

On Aug 21, 2012, at 10:58 AM, Virgil Stokes wrote: