filled contours, and missing data

Eric_Firing1 · February 20, 2005, 2:23am

John et al.,

I would like to phase in matplotlib to replace Matlab ASAP for plotting physical oceanographic observations, primarily current profile measurements. I (and many other physical oceanographers) primarily use contourf to plot filled contours; I only rarely use line contours. It looks to me like gcntr.c has the necessary functionality--the ability to output polygons enclosing regions between a pair of specified levels. Is someone already working on exposing that functionality in matplotlib, or is it planned?

It appears that gcntr.c also has the ability to handle missing data via setting elements of the reg array to zero, and that this could be exposed fairly easily in the contour method in axes.py by adding "reg" to the set of kwargs. Correct? If so, is this also planned?

The question of missing data handling in contour plotting brings up the more general issue of how to handle data gaps in plots. For example, the ocean current profiles that I measure using a Doppler profiler extend to varying depths, and sometimes have holes in the middle where there are not enough acoustic scatterers to give a signal. This sort of thing--data gaps--is universal in physical oceanography. One of Matlab's major strengths is the way it handles them, using nan as a bad value flag. Plotting a line with the plot command, the line is broken at each nan; so if there is a hole in the data, the plot shows exactly that. The same for contouring: nans are automatically used as a mask.

Obviously, not everyone needs this kind of automatic handling of data gaps, but I think it would be very useful for many applications, so I hope it can be considered as a possible goal. At the plotting level, collections may make it easier to implement than would have been the case in the early days of matplotlib. At the array manipulation level, the implementation could involve either masked arrays or nans. I would greatly prefer the Matlab-style nan approach, but I don't know whether this would work with Numeric. Maybe in Numeric3? Numarray appears better equipped, with its ieeespecial.py module.

Thanks for the enormous amount of beautiful work you have already done!

Eric

_Perry_Greenfield1 · February 21, 2005, 5:08pm

John et al.,

I would like to phase in matplotlib to replace Matlab ASAP for plotting physical oceanographic observations, primarily current profile measurements. I (and many other physical oceanographers) primarily use contourf to plot filled contours; I only rarely use line contours. It looks to me like gcntr.c has the necessary functionality--the ability to output polygons enclosing regions between a pair of specified levels. Is someone already working on exposing that functionality in matplotlib, or is it planned?

No one (as far as I know is working on it right now. It is in our plans to add this capability. As you correctly note, the underlying C code can handle this capability. I'm not sure how long it will be; right now the priority is to finish contour labeling capability, and the person working on that also has other work that competes with her time to do this. I'm guessing that she could start looking at it in a couple weeks. Of course, if someone wants to help now, that would be great.

It appears that gcntr.c also has the ability to handle missing data via setting elements of the reg array to zero, and that this could be exposed fairly easily in the contour method in axes.py by adding "reg" to the set of kwargs. Correct? If so, is this also planned?

Correct. Yes (it is planned).

The question of missing data handling in contour plotting brings up the more general issue of how to handle data gaps in plots. For example, the ocean current profiles that I measure using a Doppler profiler extend to varying depths, and sometimes have holes in the middle where there are not enough acoustic scatterers to give a signal. This sort of thing--data gaps--is universal in physical oceanography. One of Matlab's major strengths is the way it handles them, using nan as a bad value flag. Plotting a line with the plot command, the line is broken at each nan; so if there is a hole in the data, the plot shows exactly that. The same for contouring: nans are automatically used as a mask.

Obviously, not everyone needs this kind of automatic handling of data gaps, but I think it would be very useful for many applications, so I hope it can be considered as a possible goal. At the plotting level, collections may make it easier to implement than would have been the case in the early days of matplotlib. At the array manipulation level, the implementation could involve either masked arrays or nans. I would greatly prefer the Matlab-style nan approach, but I don't know whether this would work with Numeric. Maybe in Numeric3? Numarray appears better equipped, with its ieeespecial.py module.

I think you touch on the key issue. I think we'd have to figure out how to handle this between Numeric and numarray (and Numeric3 potentially). Would a mask array be a suitable substitute as an interim solution?

Perry

···

On Feb 19, 2005, at 9:23 PM, Eric Firing wrote:

Eric_Firing1 · February 21, 2005, 10:02pm

Perry,

I would like to phase in matplotlib to replace Matlab ASAP for plotting physical oceanographic observations, primarily current profile measurements. I (and many other physical oceanographers) primarily use contourf to plot filled contours; I only rarely use line contours. It looks to me like gcntr.c has the necessary functionality--the ability to output polygons enclosing regions between a pair of specified levels. Is someone already working on exposing that functionality in matplotlib, or is it planned?

No one (as far as I know is working on it right now. It is in our plans to add this capability. As you correctly note, the underlying C code can handle this capability. I'm not sure how long it will be; right now the priority is to finish contour labeling capability, and the person working on that also has other work that competes with her time to do this. I'm guessing that she could start looking at it in a couple weeks. Of course, if someone wants to help now, that would be great.

I have started working on it. I don't know how far I will get; the necessary change to the c extension code was easy, but my first attempt to make a PolyCollection work in place of a Line Collection is failing. I will do a bit more research before asking for help, if necessary. (No promises--I don't have much time to work on this, and it is my first plunge into the innards of matplotlib.)

It appears that gcntr.c also has the ability to handle missing data via setting elements of the reg array to zero, and that this could be exposed fairly easily in the contour method in axes.py by adding "reg" to the set of kwargs. Correct? If so, is this also planned?

Correct. Yes (it is planned).

The question of missing data handling in contour plotting brings up the more general issue of how to handle data gaps in plots. For example, the ocean current profiles that I measure using a Doppler profiler extend to varying depths, and sometimes have holes in the middle where there are not enough acoustic scatterers to give a signal. This sort of thing--data gaps--is universal in physical oceanography. One of Matlab's major strengths is the way it handles them, using nan as a bad value flag. Plotting a line with the plot command, the line is broken at each nan; so if there is a hole in the data, the plot shows exactly that. The same for contouring: nans are automatically used as a mask.

Obviously, not everyone needs this kind of automatic handling of data gaps, but I think it would be very useful for many applications, so I hope it can be considered as a possible goal. At the plotting level, collections may make it easier to implement than would have been the case in the early days of matplotlib. At the array manipulation level, the implementation could involve either masked arrays or nans. I would greatly prefer the Matlab-style nan approach, but I don't know whether this would work with Numeric. Maybe in Numeric3? Numarray appears better equipped, with its ieeespecial.py module.

I think you touch on the key issue. I think we'd have to figure out how to handle this between Numeric and numarray (and Numeric3 potentially). Would a mask array be a suitable substitute as an interim solution?

Are you suggesting something like this? Let each plotting function have a new kwarg, perhaps called "validmask", with the same dimensions as the dependent variable to be plotted, and with nonzero where the variable is valid and 0 where it is missing. The mask would then be used (1) to limit the autoranging tests to the valid data, (2) in the case of line plotting, to break the line up into segments so that a LineCollection would be plotted, (3) in the case of contouring, to set the reg array, (4) for images or pcolors to similarly mask out the invalid regions with white, or transparent, or perhaps some settable color.

This could be implemented in matplotlib in a way that would not depend on any special features, or likely changes, in the Numeric/Numeric3/numarray set.

A numarray user could then use
def notnan(y):
return numarray.ieeespecial.mask(y, numarray.ieeespecial.NAN)

and say
plot(x, y, validmask=notnan(y))

In any case, this "validmask kwarg" solution seems to me like a perfectly good one from a user's standpoint, and a good bridge to the happy day when Numeric/Numeric3/numarray converge or evolve to a single, dominant numerical module with good nan handling built in. (I very much hope such convergence will occur, and the sooner the better.)

Eric

Stephen_Walton1 · February 21, 2005, 10:07pm

Eric Firing wrote:

Are you suggesting something like this? Let each plotting function have a new kwarg, perhaps called "validmask", with the same dimensions as the dependent variable to be plotted, and with nonzero where the variable is valid and 0 where it is missing.

More or less, except that the mask is an attribute (?) of a MaskedArray object. I for one would be in favor of this capability.

Eric_Firing1 · February 22, 2005, 12:00am

Stephen,

Are you suggesting something like this? Let each plotting function have a new kwarg, perhaps called "validmask", with the same dimensions as the dependent variable to be plotted, and with nonzero where the variable is valid and 0 where it is missing.

More or less, except that the mask is an attribute (?) of a MaskedArray object. I for one would be in favor of this capability.

I agree that this is an alternative, but I am not sure that it is better than what I described. It requires all the machinery of the ma/MA module, which looks cumbersome to me. What does it gain? max and min will do the right thing on the masked array input, so one would not have to duplicate this inside matplotlib. It is not hard to duplicate, however. How much more ma/MA functionality would actually be useful?

When it was originally developed, the MaskedArray may have been a good way to get past Numeric's lack of nan-handling. In the long run, however, it seems to me that Python needs a numeric module with good nan-handling (as in Matlab and Octave), and that this will render the Masked Array obsolete. If so, then specifying a mask as a kwarg in matplotlib, and not using MA internally, may be simpler, more robust, and more flexible.

The user would still be free to use MA/ma externally, if desired.

A variation would be to support MA/ma in matplotlib only to the extent of checking for a MaskedArray input, and if it is present, breaking it apart and using the mask as if it had come via the kwarg. One could use either the kwarg or a Masked Array.

Eric

Stephen_Walton1 · February 22, 2005, 3:05am

Hello,

I agree that this is an alternative, but I am not sure that it is better than what I described. It requires all the machinery of the ma/MA module, which looks cumbersome to me. What does it gain?

It would be more flexible. Instead of having to actually replace data with NaN, you could create a mask which marked data to be ignored for the moment: all negative values, say, or all values with a complex part less than 1e-5. Much more flexible. Having said that, I agree that NaN should also be ignored wherever it occurs.

One could use either the kwarg or a Masked Array.

-1 on the kwarg. It seems to me that adding it to every plot command uglifies the interface significantly as well as being more work for John.

Stephen

Eric_Firing1 · February 22, 2005, 4:14am

Perry, John,

Progress! I found that the problem I was having with PolyCollection was this: the vertices argument must be a sequence (list or tuple) of tuples of tuples--if one gives it a list of *lists* of tuples, one gets

[first part of trace omitted]
File "/usr/lib/python2.3/site-packages/matplotlib/collections.py", line 205, in draw
self._offsets, self._transOffset)
TypeError: CXX: type error

(The line number was smaller before I put in some debugging print statements.)

I think this fussiness qualifies as a bug; the docstring for PolyCollection says vertices can be a sequence of sequences of tuples. I don't know what the right way to fix it is, however, so I am working around it.

Having solved that problem, I am getting more optimistic about being able to come up with a usable filled contour capability fairly quickly. Still no promises, though.

All this brings to mind a question that has puzzled me for a long time: why does matplotlib internally use sequences of (x,y) tuples instead of numerix arrays--either a 2-D array, or a pair (or tuple) of 1-D arrays? I would think that running all plotted numbers through the conversion from arrays to Python tuples, and then from there into the native data types for each backend, would incur a big performance penalty when plotting large numbers of points. Not that I am suggesting a redesign--I am just curious.

Eric

Stephen_Walton1 · February 22, 2005, 2:35pm

Stephen Walton wrote, in the context of using masked arrays rather than a keyword argument which would be the mask:

It would be more flexible. Instead of having to actually replace data with NaN, you could create a mask which marked data to be ignored for the moment:

This is, of course, incorrect. Both approaches would allow arbitrary data sets to be masked as needed.

There was a mention over on an astronomy group of the progress being made in masked astronomical images. Here too, the mask "comes along" with the data. Perry et al., does STScI anticipate using numarray/Numeric masked arrays within PyFITS to handle this?

_Perry_Greenfield1 · February 22, 2005, 9:04pm

When we looked at the issue of using NaNs in place of masks or masked arrays, we concluded (well, I did anyway) that while NaNs could be used to replace masks in many instances, they could not be used in all. There are a lot of cases where people want to retain the value being masked (e.g., to do statistics on the rejected values). NaNs as masks only work for float and complex, not ints. So both approaches are useful and needed as far as I can tell.

As far as keyword args go, it seems to me that they would be more convenient in many cases, but as Stephen mentions, may be a fair amount of work (and in essence, they are an attribute of the data, so that may be where they belong).

Perry

···

On Feb 21, 2005, at 7:00 PM, Eric Firing wrote:

Stephen,

Are you suggesting something like this? Let each plotting function have a new kwarg, perhaps called "validmask", with the same dimensions as the dependent variable to be plotted, and with nonzero where the variable is valid and 0 where it is missing.

More or less, except that the mask is an attribute (?) of a MaskedArray object. I for one would be in favor of this capability.

I agree that this is an alternative, but I am not sure that it is better than what I described. It requires all the machinery of the ma/MA module, which looks cumbersome to me. What does it gain? max and min will do the right thing on the masked array input, so one would not have to duplicate this inside matplotlib. It is not hard to duplicate, however. How much more ma/MA functionality would actually be useful?

When it was originally developed, the MaskedArray may have been a good way to get past Numeric's lack of nan-handling. In the long run, however, it seems to me that Python needs a numeric module with good nan-handling (as in Matlab and Octave), and that this will render the Masked Array obsolete. If so, then specifying a mask as a kwarg in matplotlib, and not using MA internally, may be simpler, more robust, and more flexible.

The user would still be free to use MA/ma externally, if desired.

A variation would be to support MA/ma in matplotlib only to the extent of checking for a MaskedArray input, and if it is present, breaking it apart and using the mask as if it had come via the kwarg. One could use either the kwarg or a Masked Array.