inconsistent spacing in histograms

Chris_Fonnesbeck · October 22, 2010, 1:47pm

I notice that when the number of bins in a histogram is sparse, the spacing between the bins can be irregular. For example:

http://cl.ly/7e0ad7039873d5446365
http://cl.ly/c7cb20b567722928ac3c

Is there a way of normalizing this, and better, can the default behavior result in something more consistent (i.e. publication-quality)?

Thanks,
Chris

_Ryan_May1 · October 22, 2010, 2:13pm

That looks like some bizarre rounding/truncation or something like it.
Can you post an example (can just use made up data) that reproduces
this? I've not seen this before, so I sense it's due to the specific
data types you're passing in.

Ryan

···

On Fri, Oct 22, 2010 at 8:47 AM, Christopher Fonnesbeck <statistics@...2904...> wrote:

I notice that when the number of bins in a histogram is sparse, the spacing between the bins can be irregular. For example:

http://cl.ly/7e0ad7039873d5446365
Screen shot 2010-10-21 at 5...

Is there a way of normalizing this, and better, can the default behavior result in something more consistent (i.e. publication-quality)?

--
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma

Chris_Fonnesbeck · October 22, 2010, 2:40pm

Here is a very simple example. The data are just a list of integers:

http://dl.dropbox.com/u/233041/histexample.py

and it results in an odd choice of intervals.

(array([863, 775, 0, 271, 0, 67, 23, 0, 0, 1]),
array([ 0. , 0.6, 1.2, 1.8, 2.4, 3. , 3.6, 4.2, 4.8, 5.4, 6. ]),
<a list of 10 Patch objects>)

If there are only 7 possible values of the data, which are evenly-spaced, it should probably not go in and create more than 6 bins as the default behavior. I know I can specify bins by hand, but when automated it would be nice to have a more sensible default.

Thanks,
cf

···

On Oct 22, 2010, at 9:13 AM, Ryan May wrote:

On Fri, Oct 22, 2010 at 8:47 AM, Christopher Fonnesbeck > <statistics@...2904...> wrote:

I notice that when the number of bins in a histogram is sparse, the spacing between the bins can be irregular. For example:

http://cl.ly/7e0ad7039873d5446365
Screen shot 2010-10-21 at 5...

Is there a way of normalizing this, and better, can the default behavior result in something more consistent (i.e. publication-quality)?

That looks like some bizarre rounding/truncation or something like it.
Can you post an example (can just use made up data) that reproduces
this? I've not seen this before, so I sense it's due to the specific
data types you're passing in.

_Ryan_May1 · October 22, 2010, 4:12pm

It just defaults to creating 10 bins (which is identical to
numpy.histogram, which is what does the work under the hood.) If you
know how many bins you want, you can just do:

hist(x, bins=6)

This gives (for your example) the behavior you seem to want. I don't
know of any way that would sensibly choose a number of bins
automatically, but I'd consider a patch that proves me wrong.

Ryan

···

On Fri, Oct 22, 2010 at 9:40 AM, Christopher Fonnesbeck <statistics@...2904...> wrote:

On Oct 22, 2010, at 9:13 AM, Ryan May wrote:

On Fri, Oct 22, 2010 at 8:47 AM, Christopher Fonnesbeck >> <statistics@...2904...> wrote:

I notice that when the number of bins in a histogram is sparse, the spacing between the bins can be irregular. For example:

http://cl.ly/7e0ad7039873d5446365
Screen shot 2010-10-21 at 5...

Is there a way of normalizing this, and better, can the default behavior result in something more consistent (i.e. publication-quality)?

That looks like some bizarre rounding/truncation or something like it.
Can you post an example (can just use made up data) that reproduces
this? I've not seen this before, so I sense it's due to the specific
data types you're passing in.

Here is a very simple example. The data are just a list of integers:

http://dl.dropbox.com/u/233041/histexample.py

and it results in an odd choice of intervals.

(array([863, 775, 0, 271, 0, 67, 23, 0, 0, 1]),
array([ 0. , 0.6, 1.2, 1.8, 2.4, 3. , 3.6, 4.2, 4.8, 5.4, 6. ]),
<a list of 10 Patch objects>)

If there are only 7 possible values of the data, which are evenly-spaced, it should probably not go in and create more than 6 bins as the default behavior. I know I can specify bins by hand, but when automated it would be nice to have a more sensible default.

--
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma

Maarten_Sneep · October 22, 2010, 4:31pm

On Fri, Oct 22, 2010 at 9:40 AM, Christopher Fonnesbeck

> If there are only 7 possible values of the data, which are
evenly-spaced, it should probably not go in and create more than 6
bins as the default behavior. I know I can specify bins by hand, but
when automated it would be nice to have a more sensible default.

It just defaults to creating 10 bins (which is identical to
numpy.histogram, which is what does the work under the hood.) If you
know how many bins you want, you can just do:

hist(x, bins=6)

This gives (for your example) the behavior you seem to want. I don't
know of any way that would sensibly choose a number of bins
automatically, but I'd consider a patch that proves me wrong.

I'm moving on from IDL. From that background I used the Coyote library
quite a bit, and there I found:

binsize = (3.5 * numpy.std(data)) / len(data)**(0.3333)

(from http://www.dfanning.com/programs/histoplot.pro known as Scott's
Choice of bin size for histograms).

From the binsize and the range of the data, you then figure out an axis

for the histogram).

Maarten

···

On Fri, 2010-10-22 at 11:12 -0500, Ryan May wrote:
--
KNMI, De Bilt
T: 030 2206 747
E: Maarten.Sneep@...3329...
Room B 2.42

_Ryan_May1 · October 22, 2010, 6:39pm

Thanks for that. This actually led me here:
Histogram - Wikipedia which gives a bunch of
different ways to estimate the number of bins/binsize. It might be
worth looking at one of these in general. However, ironically enough,
these wouldn't actually give the original poster the desired
results--the binsizes would lead to lots of bins, many of which would
be empty due to the integer data. In fact, it seems that all of these
methods are going to break down due to integer data. I guess you could
take the ceiling of the calculated binsize...anyone have an opinion on
whether calculating binsize/nbins would be a step forward over leaving
the default (of 10) and letting the user calculate if they like?

Ryan

···

On Fri, Oct 22, 2010 at 11:31 AM, Maarten Sneep <maarten.sneep@...3329...> wrote:

On Fri, 2010-10-22 at 11:12 -0500, Ryan May wrote:

On Fri, Oct 22, 2010 at 9:40 AM, Christopher Fonnesbeck

> If there are only 7 possible values of the data, which are
evenly-spaced, it should probably not go in and create more than 6
bins as the default behavior. I know I can specify bins by hand, but
when automated it would be nice to have a more sensible default.

It just defaults to creating 10 bins (which is identical to
numpy.histogram, which is what does the work under the hood.) If you
know how many bins you want, you can just do:

hist(x, bins=6)

This gives (for your example) the behavior you seem to want. I don't
know of any way that would sensibly choose a number of bins
automatically, but I'd consider a patch that proves me wrong.

I'm moving on from IDL. From that background I used the Coyote library
quite a bit, and there I found:

binsize = (3.5 * numpy.std(data)) / len(data)**(0.3333)

(from http://www.dfanning.com/programs/histoplot.pro known as Scott's
Choice of bin size for histograms).

--
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma

Maarten_Sneep · October 25, 2010, 2:32pm

Integer histograms are a different beast altogether. It is not very hard
to define a natural bin width for integer histograms: 1. The only
sensible alternatives are integer multiples of that.

import numpy as np
import matplotlib.pyplot as plt
data = np.int32(np.rint(200*np.random.randn(10000)))
axis = np.arange(data.min(), data.max()+1)
hist = np.zeros((data.max()-data.min()+1,), dtype=np.int32)
# unfortunately the shortcut hist[data-data.min()] += 1 does not work,
# the list of indices in data is simplified before looping implicitly.
# Explicit loop:
for item in data:
hist[item-data.min()] += 1

plt.plot(axis,hist)
plt.show()

This histogram can easily be adapted to any sensible bin size, as this
is the finest possible increment. With floats you have to do things the
hard way because there is no such thing as a natural bin size.

And yes, the np.histogram() function is much faster.

hist2 = np.histogram(data, bins=data.max()-data.min())
plt.plot(hist2[1][0:-1]+0.5, hist2[0])
plt.show()

I don't like putting the data on the bin-boundaries, as it is very clear
what the bins can be in this case.

Yes, this is not so much a hard suggestion, as it is a line of thought.
Treating integer data for histograms differently from pseudo continuous
data is the natural way in in my view. Scaling (grouping bins) could be
done to ensure that the most populated bin contains 4*ndata/nbins points
(yes, this fails for uniformly distributed data).

Maarten

···

On Fri, 2010-10-22 at 13:39 -0500, Ryan May wrote:

Thanks for that. This actually led me here:
Histogram - Wikipedia which gives a bunch of
different ways to estimate the number of bins/binsize. It might be
worth looking at one of these in general. However, ironically enough,
these wouldn't actually give the original poster the desired
results--the binsizes would lead to lots of bins, many of which would
be empty due to the integer data. In fact, it seems that all of these
methods are going to break down due to integer data. I guess you could
take the ceiling of the calculated binsize...anyone have an opinion on
whether calculating binsize/nbins would be a step forward over leaving
the default (of 10) and letting the user calculate if they like?

--
KNMI, De Bilt
T: 030 2206 747
E: Maarten.Sneep@...3329...
Room B 2.42