large data sets and performance

John_Hunter · February 11, 2004, 7:24pm

Hello: We will be dealing with large (> 100,000 but in some

> instances as big as 500,000 points) data sets. They are to
> be plotted, and I would like to use matplotlib.

Are you working with plot/loglog/etc (line data) or
pcolor/hist/scatter/bar (patch data)?

I routinely plot data sets this large. 500,000 data points is a
typical 10 seconds of EEG, which is the application that led me to
write matplotlib. EEG is fairly special: the x axis time is
monotonically increasing and the y axis is smooth. This lets me take
advantage of level of detail subsampling.

If your xdata are sorted, ie like time, the following

l = plot(blah, blah)
set(l, 'lod', True)

could be a big win. LOD is "Level of Detail" and if true subsamples
the data according to the pixel width of the output, as you described.
Whether this is appropriate or not depends on the data set of course,
whether it is continuous, and so on. Can you describe your dataset in
more detail, because I would like to add whatever optimizations are
appropriate -- if others can pipe in here too that would help.

Secondly, the standard gdmodule will iterate over the x, y values in a
python loop in gd.py. This is slow for lines with lots of points. I
have a patched gdmodule that I can send you (provide platform info)
that moves this step to the extension module. Potentially a very big
win.

Another possibility: change backends. The GTK backend is
significantly faster than GD. If you want to work off line (ie, draw
to image only and not display to screen ) and are on a linux box, you
can do this with GTK and Xvfb. I'll give you instructions if
interested. In the next release of matplotlib, there will be a libart
paint backend (cross platform) that may be faster than GD. I'm
working on an Agg backend that should be considerably faster than all
the other backends since it does everything in extension code -- we'll
see :-).

JDH

_Perry_Greenfield1 · February 11, 2004, 7:49pm

John Hunter writes:

could be a big win. LOD is "Level of Detail" and if true subsamples
the data according to the pixel width of the output, as you described.
Whether this is appropriate or not depends on the data set of course,
whether it is continuous, and so on. Can you describe your dataset in
more detail, because I would like to add whatever optimizations are
appropriate -- if others can pipe in here too that would help.

What I was alluding to was that if a backend primitive was added that
allowed plotting a symbol (patch?) or point for an array of points.
The base implementation would just do a python loop over the single
point case so there is no requirement for a backend to overload this
call. But it could do so if it wanted to loop over all points in C.
How flexible to make this is open to discussion (e.g., allowing
x, and y scaling factors, as arrays, for the symbol to be plotted, and
other attributes that may vary with point such as color)

Perry

Peter_Groszkowski · February 11, 2004, 8:18pm

Perry:
Currently using connected line plots, but do not want to limit myself in any way when it comes to presenting data. I am certain that at one point, I will use every plot available in the matplotlib arsenal. On a 3.2Ghz P4 with 2GB RAM get ~90 seconds for a 100,000 data set, ~50 seconds for 50,000 and ~9 seconds for a 10,000 (sort of linear). This is way to long for my purposes. I was hoping more for ~5 seconds for 100,000 points.

John:

I routinely plot data sets this large. 500,000 data points is a

I routinely plot data sets this large. 500,000 data points is a
typical 10 seconds of EEG, which is the application that led me to

That sounds good!

If your xdata are sorted, ie like time, the following

l = plot(blah, blah)
set(l, 'lod', True)

could be a big win.

Whether this is appropriate or not depends on the data set of course,
whether it is continuous, and so on. Can you describe your dataset in
more detail, because I would like to add whatever optimizations are
appropriate -- if others can pipe in here too that would help.

Will mostly be plotting time Vs value(time) but in certain cases will need plots of other data, and therefore have to look at the worst case scenario.
Not exactly sure what you mean by "continuous" since all are descrete data points. The data may not be smooth (could have misbehaving sensors giving garbage) and jump all over the place.

econdly, the standard gdmodule will iterate over the x, y values in a
python loop in gd.py. This is slow for lines with lots of points. I
have a patched gdmodule that I can send you (provide platform info)
that moves this step to the extension module. Potentially a very big
win.

Yes, that would be great!
System info:

OS: RedHat9 ( kernel 2.4.20)

gcc version from running 'gcc -v':
Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/3.2.2/specs
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --disable-checking --with-system-zlib --enable-__cxa_atexit --host=i386-redhat-linux
Thread model: posix
gcc version 3.2.2 20030222 (Red Hat Linux 3.2.2-5)

Python: Python 2.2.2 (#1, Feb 24 2003, 19:13:11)

matplotlig: matplotlib-0.50e

gdpython: 0.51 (with modified _gdmodule.c)

gd: gd-2.0.21

Another possibility: change backends. The GTK backend is
significantly faster than GD. If you want to work off line (ie, draw
to image only and not display to screen ) and are on a linux box, you
can do this with GTK and Xvfb. I'll give you instructions if
interested. In the next release of matplotlib, there will be a libart
paint backend (cross platform) that may be faster than GD. I'm
working on an Agg backend that should be considerably faster than all
the other backends since it does everything in extension code -- we'll
see

Yes I am only planning to work offline. Want to be able to pipe the output images to stdout. I am looking for the fastest solution possible.

Thanks again.
Peter

_Perry_Greenfield1 · February 12, 2004, 7:43pm

Peter Groszkowski wrote:

>
Yes I am only planning to work offline. Want to be able to pipe the
output images to stdout. I am looking for the fastest solution possible.

Following up on this, I was curious what exactly you meant by
this. A stream of byte values in ascii separated by spaces?
Or the actual binary bytes? If the latter, it wouldn't appear
to be difficult to write a C extension to return the image
as a string, but I'm figuring there is more to it than that
since the representations for the image structure can change
from one backend to another.

Perry

Peter_Groszkowski · February 12, 2004, 8:01pm

Perry Greenfield wrote:

That was a response to:

If you want to work off line (ie, draw
to image only and not display to screen ) and are on a linux box, you
can do this with GTK and Xvfb

Perhaps it was a misinterpretation on my part. All I meant was that I do not need to look at the images through any of the standard tools (ie, via show()). For my purposes I need to write the image to stdout (not to disk) in its binary form - which is what I do now after modifying some stuff in backend_gd.py.

···

Peter Groszkowski wrote:

Yes I am only planning to work offline. Want to be able to pipe the output images to stdout. I am looking for the fastest solution possible.

Following up on this, I was curious what exactly you meant by
this. A stream of byte values in ascii separated by spaces?
Or the actual binary bytes? If the latter, it wouldn't appear
to be difficult to write a C extension to return the image
as a string, but I'm figuring there is more to it than that
since the representations for the image structure can change
from one backend to another.

Perry