How do you use numpy.ma?

Hi matplotters,

As any of you subscribed to the numpy-discussion list will have
probably noticed, there's intense debate going on about how numpy can
do a better job of handling missing data and masked arrays. Part of
the problem is that we aren't actually sure what users need these
features to do. There's one group who just wants R-style "missing
data", and their needs are pretty straightforward -- they just want a
magic value that indicates some data point doesn't actually exist. But
it seems like there's also demand for a more "masked array"-like
feature, similar to the current numpy.ma, where the mask is
non-destructive and easily manipulable. No-one seems clear on who
exactly this should work, though, and there's a lot of disagreement
about what semantics make sense. (If you want more details, there's a
wiki page summarizing some of this[1]).

Since you seem to be the biggest users of numpy.ma, it would be really
helpful if you could explain how you actually use it, so we can make
sure that whatever we do in numpy-land is actually useful to you!

What does matplotlib use masked arrays for? Is it just a convenient
way to keep an array and a boolean mask together in one object, or do
you take advantage of more numpy.ma features? For example, do you
ever:
- unmask values?
- create multiple arrays that share the same storage for their data,
but have different masks? (i.e., creating a new array with new
elements masked, but without actually allocating the memory for a full
array copy)
- use reduction operations on masked arrays? (e.g., np.sum(masked_arr))
- use binary operations on masked arrays? (e.g., masked_arr1 + masked_arr2)

And while we're at it, any complaints about how numpy.ma works now,
that a new version might do better?

Thanks in advance,
-- Nathaniel

[1] https://github.com/njsmith/numpy/wiki/NA-discussion-status

Hi Nathaniel,

Unfortunately, I can’t spend much more time on this topic due to my dissertation work. I will allow others to elaborate further, if they wish. But I think I can summarize it a bit.

First, we try our best to respect multiple ways of users specifying missing data as input to our main plotting functions. Most common are NaNs and np.ma masks. Given that we try to maintain compatibility with older versions of Numpy, we are going to have to build some sort of compatibility mechanism to unify any representation (NaNs, np.ma, NA(or whatever it will be called)) under a single abstraction to be used internally. This will probably be np.ma at first until we can depend on the existence of np.NA.

Second, with functions that have multiple input arrays (pretty much all of them), a single mask has to be applied to all data (typically a logical_or’ing of the individual masks). Some other functions such as the pcolor family of functions have slightly more complicated mask merging.

The most important thing is that we do not modify the user’s data, and we keep copies to a minimum. np.ma works great because we can convert the arrays into masked_arrays without a copy, and the mask-merging process does not modify the user’s input data. I don’t think we were using some of the more advanced features of np.ma, but I can’t be sure of that.

I guess the tricky thing that could occur (and probably should be tested for) is if the input array is already a masked array and that we aren’t changing the user’s pre-existing masks.

Ben Root

···

On Sun, Nov 6, 2011 at 4:43 PM, Nathaniel Smith <njs@…503…> wrote:

Hi matplotters,

As any of you subscribed to the numpy-discussion list will have

probably noticed, there’s intense debate going on about how numpy can

do a better job of handling missing data and masked arrays. Part of

the problem is that we aren’t actually sure what users need these

features to do. There’s one group who just wants R-style "missing

data", and their needs are pretty straightforward – they just want a

magic value that indicates some data point doesn’t actually exist. But

it seems like there’s also demand for a more “masked array”-like

feature, similar to the current numpy.ma, where the mask is

non-destructive and easily manipulable. No-one seems clear on who

exactly this should work, though, and there’s a lot of disagreement

about what semantics make sense. (If you want more details, there’s a

wiki page summarizing some of this[1]).

Since you seem to be the biggest users of numpy.ma, it would be really

helpful if you could explain how you actually use it, so we can make

sure that whatever we do in numpy-land is actually useful to you!

What does matplotlib use masked arrays for? Is it just a convenient

way to keep an array and a boolean mask together in one object, or do

you take advantage of more numpy.ma features? For example, do you

ever:

  • unmask values?

  • create multiple arrays that share the same storage for their data,

but have different masks? (i.e., creating a new array with new

elements masked, but without actually allocating the memory for a full

array copy)

  • use reduction operations on masked arrays? (e.g., np.sum(masked_arr))

  • use binary operations on masked arrays? (e.g., masked_arr1 + masked_arr2)

And while we’re at it, any complaints about how numpy.ma works now,

that a new version might do better?

Thanks in advance,

– Nathaniel

[1] https://github.com/njsmith/numpy/wiki/NA-discussion-status