Hello: We will be dealing with large (> 100,000 but in some
> instances as big as 500,000 points) data sets. They are to
> be plotted, and I would like to use matplotlib.
Are you working with plot/loglog/etc (line data) or
pcolor/hist/scatter/bar (patch data)?
I routinely plot data sets this large. 500,000 data points is a
typical 10 seconds of EEG, which is the application that led me to
write matplotlib. EEG is fairly special: the x axis time is
monotonically increasing and the y axis is smooth. This lets me take
advantage of level of detail subsampling.
If your xdata are sorted, ie like time, the following
l = plot(blah, blah)
set(l, 'lod', True)
could be a big win. LOD is "Level of Detail" and if true subsamples
the data according to the pixel width of the output, as you described.
Whether this is appropriate or not depends on the data set of course,
whether it is continuous, and so on. Can you describe your dataset in
more detail, because I would like to add whatever optimizations are
appropriate -- if others can pipe in here too that would help.
Secondly, the standard gdmodule will iterate over the x, y values in a
python loop in gd.py. This is slow for lines with lots of points. I
have a patched gdmodule that I can send you (provide platform info)
that moves this step to the extension module. Potentially a very big
win.
Another possibility: change backends. The GTK backend is
significantly faster than GD. If you want to work off line (ie, draw
to image only and not display to screen ) and are on a linux box, you
can do this with GTK and Xvfb. I'll give you instructions if
interested. In the next release of matplotlib, there will be a libart
paint backend (cross platform) that may be faster than GD. I'm
working on an Agg backend that should be considerably faster than all
the other backends since it does everything in extension code -- we'll
see :-).
JDH