Efficiently processing large datasets

Hello,

I am using matplotlib in combination with QT4 as part of an audio application I am building. Part of this involves plotting an oscilloscope-style display (time vs intensity) for sets of raw audio data, which I have through necessity converted to arrays of floating point data.

The problem I have is that Pylab/QT slow down an unacceptable amount once the amount of audio is more than a few seconds long. This is not too surprising - typically a minute of stereo audio data will have 44100 * 60 * 2 = 5292000 points. I need to deal with, at minimum, 15 minute clips efficiently

I have tried downsampling the audio dramatically and although this helps a little, it is not enough for really large data sets. Downsampling also reduces the accuracy of any editing of the audio, so it’s not the ideal solution.

I’ve read about ‘data clipping’ functionality in matplotlib, but can’t seem to get it working - has it been removed?

If so, does anybody have any ideas as to the sort of approach I could take to solve this?

Thanks
Rob

rfwatson wrote:

Hello,

I am using matplotlib in combination with QT4 as part of an audio application I am building. Part of this involves plotting an oscilloscope-style display (time vs intensity) for sets of raw audio data, which I have through necessity converted to arrays of floating point data.

The problem I have is that Pylab/QT slow down an unacceptable amount once the amount of audio is more than a few seconds long. This is not too surprising - typically a minute of stereo audio data will have 44100 * 60 * 2 = 5292000 points. I need to deal with, at minimum, 15 minute clips efficiently

I have tried downsampling the audio dramatically and although this helps a little, it is not enough for really large data sets. Downsampling also reduces the accuracy of any editing of the audio, so it's not the ideal solution.

Numpy slicing will let you create a subsampled (without interpolation) view on the data that you could send to matplotlib, and still maintain the original data for editing/listening purposes.

For example: audio[::64] will create a view that skips every 64 data points.

I don't know if skipping will produce adequate results for you vs. proper downsampling, however, but it's worth a try.

I remember SoundForge (at least a few years ago), used to downsample the data for display purposes and cache that to a file alongside the high-resolution audio. That suggests to me that any sort of downsampling on-the-fly may just be inherently too slow.

Remember, also, that matplotlib is drawing *actual* lines for its plots, which implies "stroking" (generating a polygon from moving an imaginary pen along the ideal line so that it can be filled). I suspect many audio editors take a much simpler approach, by drawing vertical 1-pixel wide strokes whose height is determined based on the average of the data within that pixel. That would be much more efficient for high-sample rate data than what matplotlib currently does. This is something to think about including in matplotlib for the future, but not something it currently does.

I've read about 'data clipping' functionality in matplotlib, but can't seem to get it working - has it been removed?

It should work with any of the Agg backends in 0.98.3 and additionally PDF, PS and SVG in SVN trunk. It should increase the speed, but on the other hand, large data is large data and the system still needs to iterate through all of it to determine which points to ignore.

Hope that helps in some way,
Mike

···

--
Michael Droettboom
Science Software Branch
Operations and Engineering Division
Space Telescope Science Institute
Operated by AURA for NASA

Do you typically plot a large number of points, only a subset of which
are in your viewport? If so, the "clipped line" demo may be useful to
you:

http://matplotlib.sourceforge.net/examples/pylab_examples/clippedline.py

If you are trying to plot a large number of dense points all in the
same viewport, then you will need to decimate the data before
plotting.

JDH

···

On Tue, Sep 23, 2008 at 5:56 AM, rfwatson <rfwatson@...287...> wrote:

Hello,

I am using matplotlib in combination with QT4 as part of an audio
application I am building. Part of this involves plotting an
oscilloscope-style display (time vs intensity) for sets of raw audio data,
which I have through necessity converted to arrays of floating point data.

The problem I have is that Pylab/QT slow down an unacceptable amount once
the amount of audio is more than a few seconds long. This is not too
surprising - typically a minute of stereo audio data will have 44100 * 60 *
2 = 5292000 points. I need to deal with, at minimum, 15 minute clips
efficiently