Integer equal-width bins for histograms

Antony_Lee1 · May 30, 2014, 10:23pm

Yes, I understand there are alternatives – but I still think a simple, binned histogram is a fairly basic feature.
KDEs are nice but can easily be overtweaked (if I see one I certainly want to know how the bandwidth was selected, otherwise it’s not better than a histogram – even worse, as the issue is now hidden); while CDFs (essentially, your second proposition) can be useful, some kinds of data are traditionally represented as histograms and CDFs would only confuse readers.

Antony

···

2014-05-30 15:11 GMT-07:00 Mark Voorhies <mark.voorhies@…4539…4…>:

On 05/30/2014 08:25 AM, Antony Lee wrote:

I can still need to bin data, e.g. when the data range is “large”, or at

least not small compared to the number of data points.

Antony

Two alternatives to histograms that you might consider:

Kernel density estimation (KDE)

This blog post has a good discussion motivating KDE from issues with bin choice in histograms:

http://www.mglerner.com/blog/?p=28

And this follow up explores the various KDE implementations in the “Scientific Python” stack:

http://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/

A rank vs. value plot, e.g.:

plot(sorted(r))

This is horizontal for peaks (lots of copies of similar values) and vertical for tails/gaps,

so it presents the same information as a histogram, but without requiring bin choice.

–Mark

2014-05-30 5:03 GMT-07:00 Yoshi Rokuko <yoshi@…3676…>:

Am Thu, 29 May 2014 14:14:52 -0700

schrieb Antony Lee <antony.lee@…1016…>:

Hi,

When histogramming integer data, is there an easy way to tell

matplotlib that I want a certain number of bins, and each bin to

cover an equal number of integers (except possibly the last one)?

(in order to avoid having some bins higher than others merely because

they cover more integers) I know I can pass in an explicit bins array

(something like list(range(min, max, (max-min)//n)) + max) but I was

hoping for something simpler, like hist(data, nbins=42,

equal_integer_coverage=True). Best,

Antony

Int data is discrete. For discrete variables you don’t need bins, you

don’t estimate the frequency distribution you know it exactly by

counting.

Of course you could do that with the hist function:

pl.hist(r, np.arange(min(r)-0.5, max(r)+1.5), histtype=‘step’)

Time is money. Stop wasting it! Get your web API in 5 minutes.

www.restlet.com/download

http://p.sf.net/sfu/restlet

Matplotlib-users mailing list

Matplotlib-users@…1735…sourceforge.net

https://lists.sourceforge.net/lists/listinfo/matplotlib-users

Time is money. Stop wasting it! Get your web API in 5 minutes.

www.restlet.com/download

http://p.sf.net/sfu/restlet

Matplotlib-users mailing list

Matplotlib-users@…1735…sourceforge.net

https://lists.sourceforge.net/lists/listinfo/matplotlib-users