large data sets and performance

John_Hunter · February 11, 2004, 10:28pm

What I was alluding to was that if a backend primitive was

    > added that allowed plotting a symbol (patch?) or point for
    > an array of points. The base implementation would just do
    > a python loop over the single point case so there is no
    > requirement for a backend to overload this call. But it
    > could do so if it wanted to loop over all points in C. How
    > flexible to make this is open to discussion (e.g., allowing
    > x, and y scaling factors, as arrays, for the symbol to be
    > plotted, and other attributes that may vary with point such
    > as color)

To make this work in the current design, you'll need more than a new
backend method.

Plot commands like scatter instantiate Artists (Circle) and add them
to the Axes as a generic patch instances. On a call to draw, the Axes
instance iterates over all of it's patch instances and forwards the
call on to the artists it contains. These, in turn instantiate gc
instances which contain information like linewidth, facecolor,
edgecolor, alpha , etc... The patch instance also transforms its data
into display units and calls the relevant backend method. Eg, a
Circle instance would call

renderer.draw_arc(gc, x, y, width, ...)

This makes it relatively easy to write a backend since you only have
to worry about 1 coordinate system (display) and don't need to know
anything about the Artist objects (Circle, Line, Rectangle, Text, ...)

The point is that no existing entity knows that a collection of
patches are all circles, and noone is keeping track of whether they
share a property or not. This buys you total flexibility to set
individual properties, but you pay for it in performance, since you
have to set every property for every object and call render methods
for each one, and so on.

My first response to this problem was to use a naive container class,
eg Circles, and an appropriate backend method, eg, draw_circles. In
this case, scatter would instantiate a Circles instance with a list of
circles. When Circles was called to render, it would need to create a
sequence of location data and a sequence of gcs

locs = [ (x0, y0, w0, h0), (x1, y1, w1, h1), ...]
gcs = [ circ0.get_gc(), circ1.get_gc(), ...]

and then call

renderer.draw_ellipses( locs, gcs).

This would provide some savings, but probably not dramatic ones. The
backends would need to know how to read the GCs. In backend_agg
extension code, I've implemented the code (in CVS) to read the python
GraphicsContextBase information using the python API.

  _gc_get_linecap
  _gc_get_joinstyle
  _gc_get_color # returns rgb

This is kind of backward, implementing an object in python and then
accessing it at the extension level code using the Python API, but it
does keep as much of the frontend in python as possible, which is
desirable. The point is that for your approach to work and to not
break encapsulation, the backends have to know about the GC.

The discussion above was focused on preserving all the individual
properties of the actors (eg every circle can have it's own linewidth,
color, alpha, dash style). But this is rare. Usually, we just want to
vary one or two properties across a large collection, eg, color in
pcolor and size and color in scatter.

Much better is to implement a GraphicsContextCollection, where the
relevant properties can be either individual elements or
len(collection) sequences. If a property is an element, it's
homogeneous across the collection. If it's len(collection), iterate
over it. The CircleCollection, instead of storing individual Circle
instances as I wrote about above, stores just the location and size
data in arrays and a single GraphicsContextCollection.

def scatter(x, y, s, c):

  collection = CircleCollection(x, y, s)
  gc = GraphicsContextCollection()
  gc.set_linewidth(1.0) # a single line width
  gc.set_foreground(c) # a len(x) array of facecolors
  gc.set_edgecolor('k') # a single edgecolor

collection.set_gc(gc)

axes.add_collection(collection)
return collection

And this will be blazingly fast compared to the solution above, since,
for example, you transform the x, y, and s coordinates as numeric
arrays rather than individually. And there is almost no function call
overhead. And as you say, if the backend doesn't implement a
draw_circles method, the CircleCollection can just fall back on
calling the existing methods in a loop.

Thoughts?

JDH

_Perry_Greenfield1 · February 11, 2004, 11:02pm

John Hunter writes:

    > What I was alluding to was that if a backend primitive was
    > added that allowed plotting a symbol (patch?) or point for
    > an array of points. The base implementation would just do
    > a python loop over the single point case so there is no
    > requirement for a backend to overload this call. But it
    > could do so if it wanted to loop over all points in C. How
    > flexible to make this is open to discussion (e.g., allowing
    > x, and y scaling factors, as arrays, for the symbol to be
    > plotted, and other attributes that may vary with point such
    > as color)

To make this work in the current design, you'll need more than a new
backend method.

[much good explanation of why...]

OK, I understand.

My first response to this problem was to use a naive container class,
eg Circles, and an appropriate backend method, eg, draw_circles. In
this case, scatter would instantiate a Circles instance with a list of
circles. When Circles was called to render, it would need to create a
sequence of location data and a sequence of gcs

[...]
I'd agree that this doesn't seem worth the trouble

Much better is to implement a GraphicsContextCollection, where the
relevant properties can be either individual elements or
len(collection) sequences. If a property is an element, it's
homogeneous across the collection. If it's len(collection), iterate
over it. The CircleCollection, instead of storing individual Circle
instances as I wrote about above, stores just the location and size
data in arrays and a single GraphicsContextCollection.

def scatter(x, y, s, c):

  collection = CircleCollection(x, y, s)
  gc = GraphicsContextCollection()
  gc.set_linewidth(1.0) # a single line width
  gc.set_foreground(c) # a len(x) array of facecolors
  gc.set_edgecolor('k') # a single edgecolor

  collection.set_gc(gc)

  axes.add_collection(collection)
  return collection

And this will be blazingly fast compared to the solution above, since,
for example, you transform the x, y, and s coordinates as numeric
arrays rather than individually. And there is almost no function call
overhead. And as you say, if the backend doesn't implement a
draw_circles method, the CircleCollection can just fall back on
calling the existing methods in a loop.

Thoughts?

I like the sounds of this approach even more. But I wonder if
it can be made somewhat more generic. This approach (if I read
it correctly seems to need a backend function for each shape:
perhaps only for circle?). What I was thinking was if there was a way
to pass it the vectors or path for a symbol (for very often,
many points will share the same shape, if not all the same x,y
scale). Here the circle is a bit of a special case compared to
crosses, error bars triangles and other symbols that are usually
made up of a few straight lines. In these cases you could pass
the backend the context collection along with the shape
(and perhaps some scaling info if that isn't part of the context).
That way only one backend routine is needed.

I suppose circle and other curved items could be handled with
A bezier type call.

But perhaps I still misunderstand.

Thanks for your very detailed response.

Perry