large data sets and performance

John_Hunter · February 11, 2004, 9:02pm

Will mostly be plotting time Vs value(time) but in certain

    > cases will need plots of other data, and therefore have to
    > look at the worst case scenario. Not exactly sure what you
    > mean by "continuous" since all are descrete data
    > points. The data may not be smooth (could have misbehaving
    > sensors giving garbage) and jump all over the place.

Bad terminology: for x I meant sorted (monotonic) and for y the ideal
cases is smooth and not varying too rapidly. Try the lod feature and
see if it works for you.

Perhaps it would be better to extend the LOD functionality, so that
you control the extent of subsampling. Eg, suppose you have 100,000 x
data points but only 1000 pixels of display. Then for every data 100
points you could set the decimation factor, perhaps as a percentage.

More generally, we could implement a LOD base class users could supply
their own derived instances to subsample the data how they see fit,
eg, min and max over the 100 points, and so on. By reshaping the
points into a 1000x100 matrix, this could be done in Numeric
efficiently.

    >> econdly, the standard gdmodule will iterate over the x, y
    >> values in a python loop in gd.py. This is slow for lines with
    >> lots of points. I have a patched gdmodule that I can send you
    >> (provide platform info) that moves this step to the extension
    >> module. Potentially a very big win.

> Yes, that would be great! System info:

Here is the link

http://nitace.bsd.uchicago.edu:8080/files/share/gdmodule-0.52b.tar.gz

You must also upgrade gd to 2.0.22 (alas 2.0.21 is obsolete!) since I
needed the latest version to get this sucker ported to win32.

    >> Another possibility: change backends. The GTK backend is
    >> significantly faster than GD. If you want to work off line
    >> (ie, draw to image only and not display to screen ) and are on
    >> a linux box, you can do this with GTK and Xvfb. I'll give you
    >> instructions if interested. In the next release of matplotlib,
    >> there will be a libart paint backend (cross platform) that may
    >> be faster than GD. I'm working on an Agg backend that should
    >> be considerably faster than all the other backends since it
    >> does everything in extension code -- we'll see

    > Yes I am only planning to work offline. Want to be able to
    > pipe the output images to stdout. I am looking for the
    > fastest solution possible.

I don't know how to write a GTK pixbuf to stdout. I inquired on the
pygtk mailing list, so perhaps we'll learn something soon. To use GTK
in Xvfb, make sure you have Xvfb (X virtual frame buffer) installed
(/usr/X11R6/bin/Xvfb). There is probably an RPM, but I don't
remember.

You then need to start it with something like

XVFB_HOME=/usr/X11R6

$XVFB_HOME/bin/Xvfb :1 -co $XVFB_HOME/lib/X11/rgb -fp $XVFB_HOME/lib/X11/fonts/misc/,$XVFB_HOME/lib/X11/fonts/Speedo/,$XVFB_HOME/lib/X11/fonts/Type1/,$XVFB_HOME/lib/X11/fonts/75dpi/,$XVFB_HOME/lib/X11/fonts/100dpi/ &

And connect your display to it

setenv DISPLAY :1

Now you can use gtk as follows

from matplotlib.matlab import *
from matplotlib.backends.backend_gtk import show_xvfb
def f(t):
    s1 = cos(2*pi*t)
    e1 = exp(-t)
    return multiply(s1,e1)

t1 = arange(0.0, 5.0, 0.1)
t2 = arange(0.0, 5.0, 0.02)
t3 = arange(0.0, 2.0, 0.01)

subplot(211)
plot(t1, f(t1), 'bo', t2, f(t2), 'k')
title('A tale of 2 subplots')
ylabel('Damped oscillation')

subplot(212)
plot(t3, cos(2*pi*t3), 'r--')
xlabel('time (s)')
ylabel('Undamped')

savefig('subplot_demo')
show_xvfb() # not show!

Peter_Groszkowski · February 12, 2004, 12:45am

Thanks for the prompt answers.

Bad terminology: for x I meant sorted (monotonic) and for y the ideal
cases is smooth and not varying too rapidly. Try the lod feature and
see if it works for you.

Although the data I'm playing with right now is monotonic (in x), I cannot assume that this will always be the case, and need an efficient solutions for all situations.

the 'lod' option in:
l = plot(arange(10000), arange(20000,30000)) #dummy data.. 10,000 pairs
set(l, 'lod', True)
option does not work for me. It's still roughly 1000 points/second

   >> econdly, the standard gdmodule will iterate over the x, y
   >> values in a python loop in gd.py. This is slow for lines with
   >> lots of points. I have a patched gdmodule that I can send you
   >> (provide platform info) that moves this step to the extension
   >> module. Potentially a very big win.

   > Yes, that would be great! System info:

Here is the link

http://nitace.bsd.uchicago.edu:8080/files/share/gdmodule-0.52b.tar.gz

You must also upgrade gd to 2.0.22 (alas 2.0.21 is obsolete!) since I
needed the latest version to get this sucker ported to win32.

Installed gd 2.0.22, and gdmodule-0.52b (from the link you provided) but there is no change in the times. Not sure why.. I should probably notice at least a little difference.

I don't know how to write a GTK pixbuf to stdout. I inquired on the
pygtk mailing list, so perhaps we'll learn something soon. To use GTK
in Xvfb, make sure you have Xvfb (X virtual frame buffer) installed
(/usr/X11R6/bin/Xvfb). There is probably an RPM, but I don't
remember.

[...]
Installed Xvfb, and ran the little script you included. It complained about:
File "/usr/lib/python2.2/site-packages/matplotlib/backends/backend_gtk.py", line 528, in _quit_after_print_xvfb
if len(manager.drawingArea._printQued): break
AttributeError: FigureManagerGTK instance has no attribute 'drawingArea'

Didn't inquire further because in my case it is crucial to have stdout output.. I have to be able to pipe these plots to cgi scrips.

If you have any other ideas, please let me know. Can anyone else tell me what kind of performance they're getting doing these 10k, 50k, 100k plots?

Best,
Peter