Integer histograms are a different beast altogether. It is not very hard

to define a natural bin width for integer histograms: 1. The only

sensible alternatives are integer multiples of that.

import numpy as np

import matplotlib.pyplot as plt

data = np.int32(np.rint(200*np.random.randn(10000)))

axis = np.arange(data.min(), data.max()+1)

hist = np.zeros((data.max()-data.min()+1,), dtype=np.int32)

# unfortunately the shortcut hist[data-data.min()] += 1 does not work,

# the list of indices in data is simplified before looping implicitly.

# Explicit loop:

for item in data:

hist[item-data.min()] += 1

plt.plot(axis,hist)

plt.show()

This histogram can easily be adapted to any sensible bin size, as this

is the finest possible increment. With floats you have to do things the

hard way because there is no such thing as a natural bin size.

And yes, the np.histogram() function is much faster.

hist2 = np.histogram(data, bins=data.max()-data.min())

plt.plot(hist2[1][0:-1]+0.5, hist2[0])

plt.show()

I don't like putting the data on the bin-boundaries, as it is very clear

what the bins can be in this case.

Yes, this is not so much a hard suggestion, as it is a line of thought.

Treating integer data for histograms differently from pseudo continuous

data is the natural way in in my view. Scaling (grouping bins) could be

done to ensure that the most populated bin contains 4*ndata/nbins points

(yes, this fails for uniformly distributed data).

Maarten

## ···

On Fri, 2010-10-22 at 13:39 -0500, Ryan May wrote:

Thanks for that. This actually led me here:

http://en.wikipedia.org/wiki/Histogram which gives a bunch of

different ways to estimate the number of bins/binsize. It might be

worth looking at one of these in general. However, ironically enough,

these wouldn't actually give the original poster the desired

results--the binsizes would lead to lots of bins, many of which would

be empty due to the integer data. In fact, it seems that all of these

methods are going to break down due to integer data. I guess you could

take the ceiling of the calculated binsize...anyone have an opinion on

whether calculating binsize/nbins would be a step forward over leaving

the default (of 10) and letting the user calculate if they like?

--

KNMI, De Bilt

T: 030 2206 747

E: Maarten.Sneep@...3329...

Room B 2.42