Sparse (NaN) dataset

I’ve been collecting telemetry form my PHEV car. The data comes with a always filled time column (no NaNs in it), but the rest of the columns are not ‘polled’ on every poll time (It’s hard to describe, sorry), so one row has only some of the data points and the next will have other data points and the one after that have another set. I think it takes between 4 or 5 polls to go back to the original data, but I have so much data (35k rows) I can’t check it all by hand.

Of course, I’ve been trying to plot all this, but the removal of NaN support back in '08 or before (!!!) means I can only plot with marker but not only lines. I can try and mask the data, but since the time column is always not-NaN, I get the wrong graphs. I can call panda’s fillna, but then I can’t recognize a sequence of equal values from a sequence of NaNs being skipped.

Is there an alternative to turn on ‘skip NaNs’ while drawing lines? Or at least a link to what’s referred as r4833 in NaN support - #4 by Michael_Droettboom (it’s clearly a SVN or HG revision number)?

I’m not a pandas user, but can’t you use dropna to get a version of the timeseries without the NaN?

I can try and mask the data, but since the time column is always
not-NaN, I get the wrong graphs. I can call panda’s fillna, but then
I can’t recognize a sequence of equal values from a sequence of NaNs
being skipped.

Can you elaborate on this? I’m assuming you mean something like the
'ffill' mode, which propagates the last non-NaN value forward. Could
you post some example rows with how you’d like to fill them? If you can
characterise this you can write some pure Python to implement your own
fill method. Can you explain this stuff about “recognize a sequence of
equal values from a sequence of NaNs being skipped”. I think you mean
the 'ffill' would replace NaNs with equal prior values, thus obscuring
“real” distinct measurements which were genuinely equal.

What do you want to see in your plot where these NaN gaps exist?

Is there an alternative to turn on ‘skip NaNs’ while drawing lines? Or at least a link to what’s referred as r4833 in NaN support - #4 by Michael_Droettboom (it’s clearly a SVN or HG revision number)?

Could you plot individual lines in multiple calls onto the same axes?
You might call out distinct single column dataframes per line, and drop
the NaNs from those subsidiary dataframes? Then plot.

Cameron Simpson


On 11Sep2022 12:39, Marcos Dione via Matplotlib wrote:

I don’t want to fill them, just skip them.

time | speed | batt level | temp
0000 |     0 |        100 | NaN
0001 |   NaN |        100 | 18
0002 |   NaN |        100 | NaN
0003 |   NaN |        NaN | 19
0004 |    10 |         99 | NaN

The line connects the last point before the NaNs to the first point after them. so, with the values above, a line from 0 to 10 for speed, a line that goes 100, 100, 100, 99 for batt level, and one from 18 to 19 for temp, all aligned with the time series as the X axis. Think of it as a scatter plot. It seems it was the behavior 14y ago.

Again, just make three plots where you use dropna on each column of the original data frame.

The revision number is indeed an svn revision which is now Handle NaNs in the data (on-the-fly as it's drawn). · matplotlib/matplotlib@6229098 · GitHub

The functionality referred to did get merged, but it does not do what you want it to do. When there are nan (or masked data) in the data, we do not know where to draw that point so we can not draw a line from the last valid point to the missing data. Thus, we “skip” the invalid point and move (in the Path drawing sense) to the next valid point which effectively breaks the line. Because you need two sequential non-invalid points to draw a line, your data is going to show up as either very very short segments 2 or 3 point segments (which if the time base is very small compared to the total time maybe effectively invisible with the default anti-aliasing as they are shorter than a pixel long) or 1-ended segments (which do not get drawn because you need two sequential points to draw a line).

I think your option here are either to do 3 plots against the time index where you dropna before passing to Matplotlib (as described above) or to pick some interpolation scheme you like to resample all of the data sets to the same time-basis.

And the obvious question is: why is it so? Why not draw a line between a non-invalid point and the next non-invalid point? Can’t it be f.i. optional? Remember, this is for a scatter-like plot, so ‘skipping’ points should be natural.

How to deal with “invalid” data is a complex subject that depends a lot of the context. A value being nan because (as in your case) there really is no data there and a value being nan because the reading is there but invalid are very different things.

If we were to effectively do dropna inside of Line2D then it would make your case much easier, but it would make the second case much harder (because to get the broken lines you would have to do the breaks and then do a variable number of lines on each segment which would both be annoying to write and carry a (maybe steep) performance cost.

Having Line2D say “I will draw your data by connecting the point pairwise with linear lines, if a point is invalid I won’t draw the segment on either side of it because” is a simple and straight forward way to implement this. If you are in case 2 it “just works” and in “case 1” the user will have to drop the invalid points (because only the user can tell which of the nan are of which type).

I think the core of the problem here is that the data you have is not well modeled by a table. It is really N 1D time series of uneven length which cover the same time range but are not synchronously sampled not a sequence of records (all alike).

I see why the car software does it this way (1 file is simpler than N, probably just dumping a bunch of registers on a clock (and clearing them on write) so they do not having to deal with interrupts / push based logging…), but on reading the data in to work on, the first thing I would do would be to split the the table by column and then work with a dictionary of (t, v) pairs (or a pd.Series with t as the index) each of which only has nan for truly invalid data, not artifacts of the data acquisition system.

Just for reference, this is real data:


That 5 line segment seems to repeat indefinitely once I start driving.