Agg large-dataset optimization

Hi,

I've been plotting data with millions of datapoints, and it takes a long
time for the plot to draw and to rescale with the mouse. I was playing
around with the draw_lines function in the agg backend and sped it up a
bit, maybe others will find it useful. I've attached my optimized version
of _backend_agg.cpp. (just replace the old src/_backend_agg.cpp and
recompile).

The main optimization is to combine sequential lines that are parallel
into a single line to be drawn once. Try "plot(rand(100000),'-')" to see
the change. Resizing and zooming should be faster than before.

This comes at the expense of accuracy since many lines are not drawn. I've
tried to balance out accuracy vs speed to my liking. The main thing to
tweak is a length threshold which I have hardcoded to 0.25. This is the
squared length in pixels that we can move in the direction perpendicular
to the set of parallel lines we are combining and still combine the lines.
(so it can move 0.5 pixels to the side before a new line will be started).
If you set it to 0, you should get the same plots as without my changes.

I also made it skip any lines that are outside of the drawing area.

Hopefully it's not buggy!

Allan

_backend_agg.cpp (72.3 KB)

John,

Have you considered Allan's version of _backend_agg.cpp (message date was 06/27)? I just tried it, and so far, I like it. I have not found any problems with the backend_driver.py test, and the improvement in speed that Allan notes with the "plot(rand(100000))" example is dramatic--the difference between unusable for panning and zooming with the original backend, and quite responsive and very usable with Allan's version.

I have not tried to understand the algorithm or code.

Allan,

Can you suggest a simple test that graphically illustrates the "loss of accuracy", perhaps a worst case, so we can see whether this is an actual problem?

I am also wondering whether your optimization technique, or some variation of it, could and should be applied in the generation of vector-based printer files: ps, pdf, svg. They can get excessively large and slow with very large datasets, and the ability to trim them with minimal loss of genuine information conveyed to the ultimate viewer of the plot would be very nice.

Eric

ahalda@...534... wrote:

···

Hi,

I've been plotting data with millions of datapoints, and it takes a long
time for the plot to draw and to rescale with the mouse. I was playing
around with the draw_lines function in the agg backend and sped it up a
bit, maybe others will find it useful. I've attached my optimized version
of _backend_agg.cpp. (just replace the old src/_backend_agg.cpp and
recompile).

The main optimization is to combine sequential lines that are parallel
into a single line to be drawn once. Try "plot(rand(100000),'-')" to see
the change. Resizing and zooming should be faster than before.

This comes at the expense of accuracy since many lines are not drawn. I've
tried to balance out accuracy vs speed to my liking. The main thing to
tweak is a length threshold which I have hardcoded to 0.25. This is the
squared length in pixels that we can move in the direction perpendicular
to the set of parallel lines we are combining and still combine the lines.
(so it can move 0.5 pixels to the side before a new line will be started).
If you set it to 0, you should get the same plots as without my changes.

I also made it skip any lines that are outside of the drawing area.

Hopefully it's not buggy!

Allan

Hi,

First, I noticed a bug in the version I sent before. I've attached a new
version in patch form, to be applied from the base directory of a normal
0.90.1 installation with 'patch -p1 <patch'.

One artifact I can clearly see is there is partly due to skipping lines <
1 pixel in length. The original agg renderer did this too, so it is not
too different from before. My patch makes it worse, though. You can see it
with plot(rand(10000)) again, but zoom out on only the y axis until the
data is almost a straight line from right to left. You should see little
clumps appear as it gets close to a straight line.

Besides this I have not noticed anything. There may be problem cases I
have not thought of, of course. To see the kind of problem that might
occur, you can try changing the distance cutoff from 0.25 to something
huge like 25.0. There you will definitely notice errors.

About doing this for other image formats: I think it is OK for pixel-based
images, but probably not for SVG or vector-based images. I can imagine
someone writing to an SVG so that they can use another program to make a
pretty rendering of it, but if lines are missing this second program does
not have the full data to work with.

Allan

patch (11.1 KB)

···

On Sat, July 7, 2007 10:15 pm, Eric Firing wrote:

John,

Have you considered Allan's version of _backend_agg.cpp (message date
was 06/27)? I just tried it, and so far, I like it. I have not found any
problems with the backend_driver.py test, and the improvement in speed
that Allan notes with the "plot(rand(100000))" example is dramatic--the
difference between unusable for panning and zooming with the original
backend, and quite responsive and very usable with Allan's version.

I have not tried to understand the algorithm or code.

Allan,

Can you suggest a simple test that graphically illustrates the "loss of
accuracy", perhaps a worst case, so we can see whether this is an actual
problem?

I am also wondering whether your optimization technique, or some
variation of it, could and should be applied in the generation of
vector-based printer files: ps, pdf, svg. They can get excessively large
and slow with very large datasets, and the ability to trim them with
minimal loss of genuine information conveyed to the ultimate viewer of the
plot would be very nice.

Eric

Thanks Allan, I just applied this to svn. Some time ago I made an
optimization to Agg's draw_markers to used cached rasters of the
markers and this made a dramatic difference in speed. So it is nice
to complement that with this draw_lines optimization.

I was curious to see what approach you took to cull the points outside
the window -- just looking at the vertices of the current point and
last point and dropping if both are outside the x and y interval of
the viewport. As you mention in the comment, we could do a line
segment/rectangle intersection test to cull more, but it is not clear
that that would be faster because of the tradeoff between the cost of
the test and cost of drawing segments we don't need.

I wonder if you cached some of the interval tests whether you might
see some additional speedups. Eg, something like

thisoutsidex = (thisx<0 || thisx>width)
thisoutsidey = (thisy<0 || thisy>height)

if (thisoutsidex && thisoutsidey && lastoutsidex && lastoutsidey) {
  // other stuff as necessary...
  lastoutsidex = thisoutsidex
  lastoutsidey = thisoutsidey
  continue;
}
lastoutsidex = thisoutsidex
lastoutsidey = thisoutsidey

I suspect the comparisons are pretty cheap so I don't know how much
this will save but it might be worth looking into.

There is one case where your culling algorithm will fail, I think

                      view window x2

···

On 7/8/07, ahalda@...534... <ahalda@...534...> wrote:

First, I noticed a bug in the version I sent before. I've attached a new
version in patch form, to be applied from the base directory of a normal
0.90.1 installation with 'patch -p1 <patch'.

                  -----------------------------
                  > >
                  _______________
x1

The segment connecting x1 and x2 will be dropped, but the user should
see a line here, right? If we could do a rectangle/line intersection
test w/o giving up too much in terms of performance, that would be
better I think. That or give users the ability to turn it off. Since
people use mpl to view their scientific data, we should be careful to
make sure they can see it....

One artifact I can clearly see is there is partly due to skipping lines <
1 pixel in length. The original agg renderer did this too, so it is not
too different from before. My patch makes it worse, though. You can see it
with plot(rand(10000)) again, but zoom out on only the y axis until the
data is almost a straight line from right to left. You should see little
clumps appear as it gets close to a straight line.

Besides this I have not noticed anything. There may be problem cases I
have not thought of, of course. To see the kind of problem that might
occur, you can try changing the distance cutoff from 0.25 to something
huge like 25.0. There you will definitely notice errors.

Do you think it is worth making this a line property so users can
tweak it? There are a couple of ways to get the line property
information into the backend -- the typical way is
to make a corresponding graphics context property and then set it in
the line draw method.

About doing this for other image formats: I think it is OK for pixel-based
images, but probably not for SVG or vector-based images. I can imagine
someone writing to an SVG so that they can use another program to make a
pretty rendering of it, but if lines are missing this second program does
not have the full data to work with.

I do think it would be nice to support culling points outside the
viewport across backends. A recurring criticism is that our
postscript files with lots of points, in which the zoom is set to just
a subset of points, are quite large because we use graphics clipping
to get rid of them but the data are still in the file. As you note,
this can be a good thing for users who want to do post-hoc editing,
and culling is tricky because, for example, a point whose center is
outside the viewport could have a edge vertex inside the viewport. So
we would need polygon intersections to do it right (and even this
would not completely solve the case of polygons with thick edge
strokes) but having the option to cull points whose centers fell
outside the viewport, or line segments whose vertices fell outside the
x and y intervals, would be a big win in most common use cases.

Since we already support masks in the line class, probably the best
way to do this is would be some special purpose (possibly extension)
code that efficiently sets the mask. Then we would automatically get
support across backends.

JDH

I wonder if you cached some of the interval tests whether you might
see some additional speedups. Eg, something like
....

You could do this, but I don't think you would even notice the difference.
Really, my impression is that the only things that really take time are
drawing a line, and maybe doing excessive amounts of
multiplication/division. It can't hurt to try, though, if there is time.

The full rectangle test would give a boost in some cases, but I think
these cases are going to be extremely rare, so I didn't bother. All x vs
time type data works fine under the simpler test, for example. I'll give
it a try sometime later this week though, just to see.

There is one case where your culling algorithm will fail, I think

                      view window x2

···

On Mon, July 9, 2007 10:57 am, John Hunter wrote:
                  -----------------------------
                  > >
                  > >
                  > >
                  _______________
x1

It doesn't fail, as you can see by trying to plot something like that.
Both points need to be on the 'undrawn' side of one of the lines of the
box. In your case, the points are to either side of the top line, the left
line and the bottom line, and are both to the 'drawn' side of the right
line. They are not both on the undrawn side of any line.

Do you think it is worth making this a line property so users can
tweak it? There are a couple of ways to get the line property information
into the backend -- the typical way is to make a corresponding graphics
context property and then set it in the line draw method.

I thought of doing this, but I didn't get to it. I originally wrote it
just for myself. If you know how, it seems like a good idea.

I do think it would be nice to support culling points outside the
viewport across backends. A recurring criticism is that our postscript
files with lots of points, in which the zoom is set to just a subset of
points, are quite large because we use graphics clipping to get rid of
them but the data are still in the file. As you note, this can be a good
thing for users who want to do post-hoc editing, and culling is tricky
because, for example, a point whose center is outside the viewport could
have a edge vertex inside the viewport. So we would need polygon
intersections to do it right (and even this would not completely solve the
case of polygons with thick edge strokes) but having the option to cull
points whose centers fell outside the viewport, or line segments whose
vertices fell outside the x and y intervals, would be a big win in most
common use cases.

Dropping parallel lines in the visible area still seems like a bad idea
for SVG, but I can see that culling points outside it could be OK. There
are two things you want to cull: lines between two points, and polygons
drawn around single points.

I don't know exactly how the polygon parts work in matplotlib, but it
seems reasonable that you could find an upper value to the radius of the
polygon around a point. Add max radius of polygon + stroke width to get a
'bound' value, and then instead of (x < 0 || x > width ...) do (x < -bound

x > width + bound ...) to cull. Just make the bound big enough that you

are sure nothing could be in the viewing area, even if you draw a few more
points than you need to.

For lines between two points, if you use just the simple cull test I am
using, you could do the same thing as for polygons where the bound depends
only on the stroke width. Actually, I should really do that in the agg
code. If you use a full box test, you would again need a more complicated
algorithm with extra intersection tests, it could get messy.

Also, I am not sure how and when transforms are applied to the data, but I
can imagine the transform and culling interfering with each other. In the
agg code, the data is transformed to its final pixel units before I do
culling.

Allan

John (and others),

I've made a quick change to the svg backend to cull the data, see attached
patch. I haven't tested it extensively, but it seems OK. It culls both
lines and polygons out of svg output.

About making it general across backends: I've looked a bit at how things
get rendered, and here is what I understand: We can only do the culling in
pixel coordinates, since the height/width of the figure is in pixel units.

The transform from unit to pixel coordinates either happens in
Line2D.draw() (such as SVG), or in the backend's draw function (such as
for agg). Therefore we could only generalize for the case where the
transform is in line2d, and even then, it looks simpler to do it in the
backend to me since that is the only place we conveniently have all the
pixel information we need. Maybe I have misunderstood, though.

Also, I noticed I forgot to comment out a degugging cout from my last agg
patch, near the end of draw_lines. Oops, sorry, you should fix that in
svn.

Allan

By the way, Here's what I used to test the svg patch, using the -dSVG switch:

import pylab
from scipy import *

pylab.plot(sin(arange(1000)*pi/100.0))
pylab.plot(rand(1000)*2, '.')

pylab.axis([300, 800, -0.5, 1])
pylab.savefig('test')

svgpatch (2.67 KB)

ahalda@...534... wrote:

John (and others),

I've made a quick change to the svg backend to cull the data, see attached
patch. I haven't tested it extensively, but it seems OK. It culls both
lines and polygons out of svg output.

Allan,

Looks good, and the basic idea should be generally applicable. It can be sped up (probably), and the code made more compact, by using numpy arrays; but this can wait.

About making it general across backends: I've looked a bit at how things
get rendered, and here is what I understand: We can only do the culling in
pixel coordinates, since the height/width of the figure is in pixel units.

The transform from unit to pixel coordinates either happens in
Line2D.draw() (such as SVG), or in the backend's draw function (such as
for agg). Therefore we could only generalize for the case where the
transform is in line2d, and even then, it looks simpler to do it in the
backend to me since that is the only place we conveniently have all the
pixel information we need. Maybe I have misunderstood, though.

The relevant transformation is between data coordinates and pixel coordinates, so provided the transformation is from one rectilinear system to another, it should be possible to do the culling in either system. One way to do it would be by inserting nans in the x and y arrays (I am thinking in terms of numpy array-based algorithms), since nan is already supported in backends as a do-not-draw marker.

This brings up the larger question of why we have "newstyle" (do transformation in backend) and oldstyle (transform before calling backend) at all, and whether we still need both. My guess is that we don't; by consistently using numpy arrays for x and y, and by using the transformation function that works with these arrays, we should be able to do the transform equally efficiently in the backend or earlier. Making the backends consistent would take a bit of work, and perhaps some culling of backends that are neither widely used nor actively maintained.

Eric