hexbin and nans

I just spent a little time getting hexbin to discard nans without
failing in mysterious ways.

I'm sending 2 patches: one will detect nans and raise a suitable
exception. The other will automatically drop the nans and continue with
hexbin. The 2nd seems a little nicer in functionality, but the code is a
little more convoluted and hence more brittle, and so I'm torn as to
which I'd prefer. (The problem with just detecting nans from the
beginning is that it would cause an extra pass through the data in all
cases.)

Anyhow, I've attached them both here for discussion. I'm happy to check
one of these in myself or feel free to do it.

-Andrew

hexbin_nans_raise_exception.patch (585 Bytes)

hexbin_nans_automatic_fix.patch (1.63 KB)

I think the nans should be dropped with no raise or warning, which is
consistent with our handling elsewhere. I don't think your approach,
which if I am reading this correctly assumes that is x or y has nans
then the min or max will be nan, is correct. Eg,

  In [51]: x = np.random.rand(10)

  In [52]: x[4] = np.nan

  In [53]: x.min()
  Out[53]: 0.37072898459621617

  In [54]: x.max()
  Out[54]: 0.79367039009185769

I think min and max in the presence of nans is undefined. But you can
just pay for the extra pass

mask = np.isnan(x) & np.isnan(y)
if mask.any():
     # mask where

JDH

···

On Fri, Jul 18, 2008 at 2:49 PM, Andrew Straw <strawman@...36...> wrote:

I just spent a little time getting hexbin to discard nans without
failing in mysterious ways.

I'm sending 2 patches: one will detect nans and raise a suitable
exception. The other will automatically drop the nans and continue with
hexbin. The 2nd seems a little nicer in functionality, but the code is a
little more convoluted and hence more brittle, and so I'm torn as to
which I'd prefer. (The problem with just detecting nans from the
beginning is that it would cause an extra pass through the data in all
cases.)

John Hunter wrote:

  

I just spent a little time getting hexbin to discard nans without
failing in mysterious ways.

I'm sending 2 patches: one will detect nans and raise a suitable
exception. The other will automatically drop the nans and continue with
hexbin. The 2nd seems a little nicer in functionality, but the code is a
little more convoluted and hence more brittle, and so I'm torn as to
which I'd prefer. (The problem with just detecting nans from the
beginning is that it would cause an extra pass through the data in all
cases.)
    
I think the nans should be dropped with no raise or warning, which is
consistent with our handling elsewhere. I don't think your approach,
which if I am reading this correctly assumes that is x or y has nans
then the min or max will be nan, is correct. Eg,

  In [51]: x = np.random.rand(10)

  In [52]: x[4] = np.nan

  In [53]: x.min()
  Out[53]: 0.37072898459621617

  In [54]: x.max()
  Out[54]: 0.79367039009185769

I think min and max in the presence of nans is undefined.

True -- that slipped past me, since in my use case the nans were always
coming out for both min and max...

  But you can
just pay for the extra pass

mask = np.isnan(x) & np.isnan(y)
if mask.any():
     # mask where

If you're happy with that extra cost, I'll modify
axes.delete_masked_points() so that hexbin and scatter are automatically
filtered in this way. Given the speed complaints we sometimes get, I was
hesitant to add another pass through the data.

-Andrew

···

On Fri, Jul 18, 2008 at 2:49 PM, Andrew Straw <strawman@...36...> wrote:

Happy is the wrong word -- we certainly should make these things as
efficient as reasonable so if your optimization worked it would be
preferable numpy passes are rarely the bottleneck but they are good
to be aware of. We could provide some helper function in nxutils to
do more stuff in a single pass for common use cases (eg a minmax
function to get both the min and the max...)

···

On Fri, Jul 18, 2008 at 4:42 PM, Andrew Straw <strawman@...36...> wrote:

If you're happy with that extra cost, I'll modify
axes.delete_masked_points() so that hexbin and scatter are automatically
filtered in this way. Given the speed complaints we sometimes get, I was
hesitant to add another pass through the data.

John Hunter wrote:

If you're happy with that extra cost, I'll modify
axes.delete_masked_points() so that hexbin and scatter are automatically
filtered in this way. Given the speed complaints we sometimes get, I was
hesitant to add another pass through the data.

masks.extend([~np.isfinite(x) for x in args])

may be the quickest and most general way to do it. I believe ~np.isfinite is both more general and significantly faster than np.isnan.

A kwarg could be used to enable or disable this checking.

Eric

···

On Fri, Jul 18, 2008 at 4:42 PM, Andrew Straw <strawman@...36...> wrote:

Happy is the wrong word -- we certainly should make these things as
efficient as reasonable so if your optimization worked it would be
preferable numpy passes are rarely the bottleneck but they are good
to be aware of. We could provide some helper function in nxutils to
do more stuff in a single pass for common use cases (eg a minmax
function to get both the min and the max...)

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Matplotlib-devel mailing list
Matplotlib-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/matplotlib-devel

Eric Firing wrote:

John Hunter wrote:

If you're happy with that extra cost, I'll modify
axes.delete_masked_points() so that hexbin and scatter are automatically
filtered in this way. Given the speed complaints we sometimes get, I was
hesitant to add another pass through the data.

masks.extend([~np.isfinite(x) for x in args])

may be the quickest and most general way to do it. I believe
~np.isfinite is both more general and significantly faster than np.isnan.

Clever, but it won't work as-is. np.isfinite('b') returns a
NotImplementedType, and a default argument to scatter is c='b', which
gets passed to this function. Anyhow, I implemented your idea with a
check for NotImplementedType and some unit tests in r5791.

A kwarg could be used to enable or disable this checking.

Yes, but since we're grabbing *args, that would mean parsing **kwargs --
so I just implemented it without the option of disabling it. I'm happy
to add this if desired.

Happy is the wrong word -- we certainly should make these things as
efficient as reasonable so if your optimization worked it would be
preferable numpy passes are rarely the bottleneck but they are good
to be aware of. We could provide some helper function in nxutils to
do more stuff in a single pass for common use cases (eg a minmax
function to get both the min and the max...)

Yes, good idea. (In fact I wonder why numpy doesn't have minmax.) And
apologies for suggesting that you might be "happy" with a (minor)
slowdown -- not the best choice of words.

-Andrew

···

On Fri, Jul 18, 2008 at 4:42 PM, Andrew Straw <strawman@...36...> >> wrote:

I've often wondered this myself...

Cheers,

f

···

On Fri, Jul 18, 2008 at 4:17 PM, Andrew Straw <strawman@...36...> wrote:

Yes, good idea. (In fact I wonder why numpy doesn't have minmax.)

Andrew Straw wrote:

may be the quickest and most general way to do it. I believe
~np.isfinite is both more general and significantly faster than np.isnan.

Clever, but it won't work as-is. np.isfinite('b') returns a
NotImplementedType, and a default argument to scatter is c='b', which
gets passed to this function. Anyhow, I implemented your idea with a
check for NotImplementedType and some unit tests in r5791.

Andrew,

I think there were at least two problems with the delete_masked_points function after you added the isfinite check, one of which was left over just about everything, including the unit tests and the docstring (which is not in rst--sorry, maybe I can fix that later). Note that the function is now in cbook.

I hope the combination of the code, the docstring and the unit tests make the intended functionality clear, but it may all still be a bit confusing. In any case, I think the new version does what is needed for scatter, hexbin, and windbarb, and may turn out to be more generally useful.

As a side note, I suspect the check for NotImplementedType is not robust; I can easily imagine isfinite being changed to raise an exception instead. Therefore I did not use that check.

Eric

···

from my earlier implementation, and one new one. I ended up rewriting