plotting with missing data?

Hi All,

Say I have data that looks like:

date x y z
2008-01-01 10
2008-01-02 21 11
2008-01-02 32 15 5

How can I plot it such that all three lines are plotted by that it's apparent two of them are missing some data?
(I know I could just sub in zeros for the missing values, but I'd like the point not to be there, not just down the bottom of the graph...)

cheers,

Chris

···

--
Simplistix - Content Management, Zope & Python Consulting
            - http://www.simplistix.co.uk

Chris,

Use masked arrays. See masked_demo.py in the mpl examples subdirectory.

Eric

Chris Withers wrote:

···

Hi All,

Say I have data that looks like:

date x y z
2008-01-01 10
2008-01-02 21 11
2008-01-02 32 15 5

How can I plot it such that all three lines are plotted by that it's apparent two of them are missing some data?
(I know I could just sub in zeros for the missing values, but I'd like the point not to be there, not just down the bottom of the graph...)

cheers,

Chris

Eric Firing wrote:

Chris,

Use masked arrays. See masked_demo.py in the mpl examples subdirectory.

Hi Eric,

I took a look at that, but it uses:

import matplotlib.numerix.npyma as ma

...and matplotlib.numerix isn't listed in the API reference. Where are the docs for this?

Specifically, what I have is an array like so:

['','','',1.1,2.2]

I want to mask the strings out so I don't get ValueErrors raised when I call plot functions with that array.

How should I do that?

cheers,

Chris

···

--
Simplistix - Content Management, Zope & Python Consulting
            - http://www.simplistix.co.uk

Chris Withers wrote:

Eric Firing wrote:

Chris,

Use masked arrays. See masked_demo.py in the mpl examples subdirectory.

Hi Eric,

I took a look at that, but it uses:

import matplotlib.numerix.npyma as ma

...and matplotlib.numerix isn't listed in the API reference. Where are the docs for this?

numerix is obsolete, and numerix.npyma was a temporary method to provide access to either of two masked array implementations. It is probably time for me to remove it from the examples. Substitute

import numpy.ma as ma

The ma module is documented as part of numpy.

Specifically, what I have is an array like so:

['','','',1.1,2.2]

Try something like this:

import numpy.ma as ma
from pylab import *

aa = [3.4, 2.5, '','','',1.1,2.2]
def to_num(arg):
     if arg == '':
         return 9999.0
     return arg

aanum = array([to_num(arg) for arg in aa])
aamasked = ma.masked_where(aanum==9999.0, aanum)
plot(aamasked)
show()

Eric

···

I want to mask the strings out so I don't get ValueErrors raised when I call plot functions with that array.

How should I do that?

cheers,

Chris

Eric Firing wrote:

Specifically, what I have is an array like so:

['','','',1.1,2.2]

Try something like this:

import numpy.ma as ma
from pylab import *

aa = [3.4, 2.5, '','','',1.1,2.2]
def to_num(arg):
    if arg == '':
        return 9999.0
    return arg

aanum = array([to_num(arg) for arg in aa])
aamasked = ma.masked_where(aanum==9999.0, aanum)
plot(aamasked)
show()

What I ended up doing was getting my array to look like:

from numpy import nan
aa = [3.4,2.5,nan,nan,nan,1.1,2.2]
values = numpy.array(aa)
values = numpy.ma.masked_equal(values,nan)

I only wish that masked_equal didn't blow up when aa contains datetime objects :frowning:

cheers,

Chris

···

--
Simplistix - Content Management, Zope & Python Consulting
            - http://www.simplistix.co.uk

Chris Withers wrote:

Eric Firing wrote:

Specifically, what I have is an array like so:

['','','',1.1,2.2]

Try something like this:

import numpy.ma as ma
from pylab import *

aa = [3.4, 2.5, '','','',1.1,2.2]
def to_num(arg):
    if arg == '':
        return 9999.0
    return arg

aanum = array([to_num(arg) for arg in aa])
aamasked = ma.masked_where(aanum==9999.0, aanum)
plot(aamasked)
show()

What I ended up doing was getting my array to look like:

from numpy import nan
aa = [3.4,2.5,nan,nan,nan,1.1,2.2]
values = numpy.array(aa)
values = numpy.ma.masked_equal(values,nan)

This is not doing what you think it is, because any logical operation with a Nan returns False:

In [4]:nan == nan
Out[4]:False

You should use numpy.masked_where(numpy.isnan(aa), aa).

In some places in mpl, nans are treated as missing values, but this is not uniformly true, so it is better not to count on it.

Your values array is not actually getting masked at the nans:

In [7]:aa = array([1,nan,2])

In [8]:aa
Out[8]:array([ 1., NaN, 2.])

In [9]:values = ma.masked_equal(aa, nan)

In [10]:values
Out[10]:
masked_array(data = [1.0 nan 2.0],
       mask = [False False False],
       fill_value=1e+20)

Eric

···

I only wish that masked_equal didn't blow up when aa contains datetime objects :frowning:

cheers,

Chris

Chris Withers wrote:
> Eric Firing wrote:
You should use numpy.masked_where(numpy.isnan(aa), aa).

or use masked_invalid directly (shortcut to masked_where((isnan(aa) |
isinf(aa))

> I only wish that masked_equal didn't blow up when aa contains datetime
> objects :frowning:

Could you send me an example of the kind of data you're using ?
As it seems you're dealing with series indexed in time, you may want to try
scikits.timeseries, a package Matt Knox and myself implemented for that very
reason.

···

On Tuesday 18 March 2008 16:17:08 Eric Firing wrote:

Pierre GM wrote:

Chris Withers wrote:

Eric Firing wrote:

You should use numpy.masked_where(numpy.isnan(aa), aa).

(I meant numpy.ma.masked_where(...))

or use masked_invalid directly (shortcut to masked_where((isnan(aa) | isinf(aa))

I don't see it in numpy.ma, with numpy from svn.

In any case, the fastest method is masked_where(~numpy.isfinite(aa), aa):

In [1]:import numpy

In [2]:xx = numpy.random.rand(10000)

In [3]:xx[xx>0.8] = numpy.nan

In [6]:timeit numpy.ma.masked_where(~numpy.isfinite(xx), xx)
10000 loops, best of 3: 83.9 �s per loop

In [7]:timeit numpy.ma.masked_where(numpy.isnan(xx), xx)
10000 loops, best of 3: 119 �s per loop

In [9]:timeit numpy.ma.masked_where((numpy.isnan(xx)|numpy.isinf(xx)), xx)
1000 loops, best of 3: 260 �s per loop

So, wherever you do have masked_invalid defined, you might want to use the faster implementation with ~isfinite.

Eric

···

On Tuesday 18 March 2008 16:17:08 Eric Firing wrote:

I only wish that masked_equal didn't blow up when aa contains datetime
objects :frowning:

Could you send me an example of the kind of data you're using ?
As it seems you're dealing with series indexed in time, you may want to try scikits.timeseries, a package Matt Knox and myself implemented for that very reason.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Matplotlib-users mailing list
Matplotlib-users@lists.sourceforge.net
matplotlib-users List Signup and Options

Pierre GM wrote:

Could you send me an example of the kind of data you're using ?

It's basically performance and volume data for a high-volume website.
Unfortunately, the data is gappy in places due to data collection errors in the past...
(it's important the gaps are shown, rather than trying to interpolate them away, however)

As it seems you're dealing with series indexed in time, you may want to try scikits.timeseries, a package Matt Knox and myself implemented for that very reason.

How would this help me here and where can I find out about it?

cheers,

Chris

···

--
Simplistix - Content Management, Zope & Python Consulting
            - http://www.simplistix.co.uk

Eric Firing wrote:

This is not doing what you think it is,

Indeed, I guess I was seeing nans being treated as missing values rather than being masked...

You should use numpy.masked_where(numpy.isnan(aa), aa).

I am now :wink:

However, I'm still running into problems when I try and plot the gappy data on a filled line as follows:

dates = *an array of datetimes*
values = *an array containing data values and a few nans*
values = numpy.ma.masked_where(numpy.isnan(values),values)
xs,ys = mlab.poly_between(dates,0,values)
pylab.fill(xs,ys,'r')

For starters, I get this warning:

numpy\core\ma.py:609: UserWarning: Cannot automatically convert masked array to numeric because data is masked in one or more locations.

...and wherever a NaN occurs in the data, the line is plotted off the top of the axes. I want it to appear at 0 if there's no data. Well, ideally just not appear at all, but I'd settle for appearing at 0...

Any ideas?

cheers,

Chris

···

--
Simplistix - Content Management, Zope & Python Consulting
            - http://www.simplistix.co.uk

import numpy as np
a = ['','','',1.1,2.2]
mask_a = [i == '' for i in a]
b = np.ma.MaskedArray(a, mask=mask_a)

Chris Withers wrote:

···

Eric Firing wrote:
  

Chris,

Use masked arrays. See masked_demo.py in the mpl examples subdirectory.
    
Hi Eric,

I took a look at that, but it uses:

import matplotlib.numerix.npyma as ma

...and matplotlib.numerix isn't listed in the API reference. Where are the docs for this?

Specifically, what I have is an array like so:

['','','',1.1,2.2]

I want to mask the strings out so I don't get ValueErrors raised when I call plot functions with that array.

How should I do that?

cheers,

Chris

--
giorgio@...1462...
http://www.cafelamarck.it

Chris,

Both with respect to documentation and functionality, what you are encountering is the historical aspect of masked arrays as a tacked-on part of python numeric packages, and of matplotlib. Support and integration are improving, but still far from perfect. A largely new, and substantially different, implementation of masked arrays has been transplanted into numpy since the last release. Similarly, mpl got a heart transplant since the last release, and it has some implications for the way nans and masked arrays are handled. There is lots more room for fundamental work on both numpy masked arrays (e.g., moving core code to pyrex/cython or C to speed them up) and on mpl.

Now with respect to your particular case here, trying to plot a filled line with gaps: poly_between has no notion of masked arrays at present. If it did, how should it behave? At the very least, additional arguments are needed to specify what should happen for fill-type plotting with missing values. If we can come up with a clear description of the behaviors that should be available, then maybe we can provide them in mpl. I would be happy to fix this gap in mpl's handling of gappy data, but I can't make it a priority use of my time right now.

For a quick fix, it sounds like what you need is either a function to break up your data set into gapless chunks, each of which could be plotted by a call to fill, or a function (a variant of poly_between) that would replace the gap regions with top and bottom lines at the same place (the bottom level? the x-axis?) so the whole thing could be plotted in one call to fill, provided the patch outline is suppressed.

I seem to recall someone else with a similar need in the past few months, so maybe someone on the list has a ready-made solution for you.

Eric

Chris Withers wrote:

···

Eric Firing wrote:

This is not doing what you think it is,

Indeed, I guess I was seeing nans being treated as missing values rather than being masked...

You should use numpy.masked_where(numpy.isnan(aa), aa).

I am now :wink:

However, I'm still running into problems when I try and plot the gappy data on a filled line as follows:

dates = *an array of datetimes*
values = *an array containing data values and a few nans*
values = numpy.ma.masked_where(numpy.isnan(values),values)
xs,ys = mlab.poly_between(dates,0,values)
pylab.fill(xs,ys,'r')

For starters, I get this warning:

numpy\core\ma.py:609: UserWarning: Cannot automatically convert masked array to numeric because data is masked in one or more locations.

...and wherever a NaN occurs in the data, the line is plotted off the top of the axes. I want it to appear at 0 if there's no data. Well, ideally just not appear at all, but I'd settle for appearing at 0...

Any ideas?

cheers,

Chris

Chris,
My 2c:
Your data is indexed in time, right ? Your x-axis is a date object ? Then use
scikits.timeseries
http://scipy.org/scipy/scikits/wiki/TimeSeries
That package was designed to take missing dates/data into account. That way,
you can plot your data with the gaps already taken into account: we have
written a specific matplotlib interface, you'll find the details following
the link above. I must admit we didn't implement poly_between for timeseries.
Most likely, we'd have to implement it for regular masked arrays first, as
mentioned by Eric.
What you could do is to fill your array with some kind of baseline, such as 0,
or your minimum data, or wtvr. That's just a quick trick and no fix.

Eric Firing wrote:

Both with respect to documentation and functionality, what you are encountering is the historical aspect of masked arrays as a tacked-on part of python numeric packages, and of matplotlib.

*sigh* I feel lucky :wink:

Support and integration are improving, but still far from perfect.

I wish I could help, but my knowledge is lacking...

Now with respect to your particular case here, trying to plot a filled line with gaps: poly_between has no notion of masked arrays at present. If it did, how should it behave?

Well, what I actually settled on was juat doing using:

my_masked_array.filled(0)

...to plot with.

At the very least, additional arguments are needed to specify what should happen for fill-type plotting with missing values.

Indeed, what I personally would have liked was a complete gap where the data is missing, but I guess that would have to return multiple polygons, and I don't know how that would work?

provide them in mpl. I would be happy to fix this gap in mpl's handling of gappy data,

...heh :wink:

but I can't make it a priority use of my time right now.

No, I understand :slight_smile:

cheers,

Chris

···

--
Simplistix - Content Management, Zope & Python Consulting
            - http://www.simplistix.co.uk

Giorgio F. Gilestro wrote:

import numpy as np
a = ['','','',1.1,2.2]
mask_a = [i == '' for i in a]
b = np.ma.MaskedArray(a, mask=mask_a)

Not very efficient, though, is it?

cheers,

Chris

···

--
Simplistix - Content Management, Zope & Python Consulting
            - http://www.simplistix.co.uk

Pierre,

I was interested in learning more about TimeSeries, and had a few questions…

Your data is indexed in time, right ? Your x-axis is a date object ?

Just to be clear on the language: “indexed in time” means data for which the x-axis is a series of dates, correct? But I am not sure what is meant by the “x-axis being a date object”–wouldn’t it be a axis object with the values comprising it being date objects? I’m not trying to split hairs, I’m just unclear about the way this is typically described and it would be useful for me to be clear about it.

Then use

scikits.timeseries

http://scipy.org/scipy/scikits/wiki/TimeSeries

That package was designed to take missing dates/data into account. That way,

you can plot your data with the gaps already taken into account: we have

written a specific matplotlib interface, you’ll find the details following

the link above.

I’ve looked at the link. Could you explain what TimeSeries does that the mpl modules dates and dateutil don’t do, or when one would use one versus the other?

For my part, I need to simply plot values with dates (and yes with some dates missing no doubt) as the x-axis and am looking for various ways to do it well.

Thank you.

Pierre GM wrote:

Your data is indexed in time, right ? Your x-axis is a date object ? Then use scikits.timeseries
http://scipy.org/scipy/scikits/wiki/TimeSeries

I'm not sure what this is giving me.
The dates are all python datetimes in a list already.
The missing values started off as '', I turned those into nan and then created a ma with the nan's masked.

What more would TimeSeries give me?

the link above. I must admit we didn't implement poly_between for timeseries. Most likely, we'd have to implement it for regular masked arrays first, as mentioned by Eric.

OK.

What you could do is to fill your array with some kind of baseline, such as 0, or your minimum data, or wtvr. That's just a quick trick and no fix.

Indeed, that's what I had to do.

I have to admit, I see some interesting things while scanning that wiki page, but nothing that would have helped me...

cheers,

Chris (who might well be missing something...)

···

--
Simplistix - Content Management, Zope & Python Consulting
            - http://www.simplistix.co.uk