low-hanging fruit on the transforms branch

Mike,

I made a quick test and took a quick look, and I certainly see a ripe mango within reach. I don't know what your constraints and strategy are, but I thought I would give you the off-the-cuff idea before I forget what I did.

The test was pcolortest.py, and the kcachegrind input is the .log file.

The problem is the path initializer: it is converting everything to a masked array, which in the vast majority of cases is not needed, and is very costly.

We need to think carefully about the levels of API, and what should be done at which levels. One possibility is that at the level of the path initializer, only ordinary ndarrays should be acceptable--any mask manipulations and compressions should already have been done. This would require a helper function to generate the codes for that case. Another is that the path initializer could get a flag telling it whether to check for masked arrays. And another is that a check for existance of a mask should be done at the start, and the mask processing done only if there is a mask. Yet another is that if a mask is needed, it be passed in as an optional 1-D array kwarg. An advantage of this is that the code that calls the path initializer may be in a better position to know what is needed to generate the 1-D mask (that is, a mask for each (x,y) point rather than for x and y separately)--that mask may already be sitting around.

Masked arrays are pretty clunky and slow. The maskedarray implementation by Pierre GM is nicer, more complete, and faster for many operations than numpy.ma, but it still adds a lot of overhead, especially for small arrays. (It needs to have its core in C; so far I have failed dismally in trying to understand how to do that without repeating the bulk of the ndarray code.)

A related point: can you (or is it OK if I do it) change all the "import numpy.ma as ma" or whatever to "from matplotlib.numerix import npyma as ma"? The advantage is that it makes it easy to test the new version with either maskedarray or ma. This should be temporary; I am still hoping and expecting that maskedarray will replace ma in the core numpy distribution.

Eric

pcolortest.py (160 Bytes)

pcolortest.py.log (616 KB)

Eric Firing wrote:

Mike,

I made a quick test and took a quick look, and I certainly see a ripe mango within reach. I don't know what your constraints and strategy are, but I thought I would give you the off-the-cuff idea before I forget what I did.

The test was pcolortest.py, and the kcachegrind input is the .log file.

The problem is the path initializer: it is converting everything to a masked array, which in the vast majority of cases is not needed, and is very costly.

Thanks for finding this. I agree completely. I think that was basically a typo that ended up "working", just suboptimally. The input to the path constructor may be either a numpy array, an ma array or a regular Python sequence. If it's the first two, it should be left alone (if there is an array mask, it is dealt with later on in the constructor), but if the latter, it should be converted to a numpy array.

What I meant to type was:

         if not ma.isMaskedArray(vertices):
             vertices = npy.asarray(vertices, npy.float_)

The argument against just "npy.asarray(vertices, npy.float_)" is that the mask needs to be preserved.

If I understand correctly, that will be essentially a no-op when the input is a numpy array, albeit with the overhead of some checks.

We need to think carefully about the levels of API, and what should be done at which levels. One possibility is that at the level of the path initializer, only ordinary ndarrays should be acceptable--any mask manipulations and compressions should already have been done. This would require a helper function to generate the codes for that case. Another is that the path initializer could get a flag telling it whether to check for masked arrays. And another is that a check for existance of a mask should be done at the start, and the mask processing done only if there is a mask.

This option was the intent.

Yet another is that if a mask is needed, it be passed in as an optional 1-D array kwarg. An advantage of this is that the code that calls the path initializer may be in a better position to know what is needed to generate the 1-D mask (that is, a mask for each (x,y) point rather than for x and y separately)--that mask may already be sitting around.

Many of these options I fear would significantly complicate the code. One of the driving motivations for the refactoring is to allow transformations to be combined more generally. Think of the case where you have a polar plot with a logarithmic scale on the r-axis (this wasn't ever possible in the trunk). The log scale means that there is potential for negative masked values, but the polar part of the transformation shouldn't have to know or care whether masked values are being passed through. Requiring it to do so would need the same checks currently performed in the Path constructor, but they would be copied all over the code in every kind of new transformation.

FWIW, there already is a deliberate "quarantining" of masked arrays -- it happens where the logical elements of the plot hit the drawing commands of the plot (the Path object). It could have been implemented such that the backends must understand masked arrays and draw accordingly, but it proved to be faster (based on the simple_plot_fps.py benchmark) to convert to a non-masked array with MOVETO codes upfront and reuse that. (Not surprising, given the overhead of masked arrays). This means that masked arrays are not used at all during panning and zooming operations where speed is perhaps the most crucial.

Masked arrays are pretty clunky and slow. The maskedarray implementation by Pierre GM is nicer, more complete, and faster for many operations than numpy.ma, but it still adds a lot of overhead, especially for small arrays. (It needs to have its core in C; so far I have failed dismally in trying to understand how to do that without repeating the bulk of the ndarray code.)

A related point: can you (or is it OK if I do it) change all the "import numpy.ma as ma" or whatever to "from matplotlib.numerix import npyma as ma"? The advantage is that it makes it easy to test the new version with either maskedarray or ma. This should be temporary; I am still hoping and expecting that maskedarray will replace ma in the core numpy distribution.

That sounds like a very good idea. I'll go ahead and do this (on the branch only).

Cheers,
Mike

···

--
Michael Droettboom
Science Software Branch
Operations and Engineering Division
Space Telescope Science Institute
Operated by AURA for NASA

Mike,

Thanks for the quick response.

I was wrong as usual: the masked array overhead in your original version of the path initializer was actually small. I misinterpreted the kcachegrind display. Rats! I was hoping for a big gain. It looks like anything that makes a huge number of paths is going to be slow, no matter what we do to try to optimize the path initializer.

A partial solution for pcolor is pcolormesh, using the quadmesh extension code, although that still has a bug. (Paul Kienzle was going to look into it.) Is the quadmesh extension compatible with your transforms branch?

My impression is that the transforms branch is going to be a big step forward, with performance improvements in some areas, at worst minor penalties in others--except for some problems like pcolor that need to be solved.

In order to replace matlab in my application, a very fast interactive pcolor-type capability is absolutely essential. I think this simply has to be done via extension code, like quadmesh and the image codes. (Pcolor in the trunk isn't fast enough, either.) Unfortunately, I have found those codes hard to understand. Only the regular-grid image code is fully integrated into the trunk, and even it has a long-standing bug revealed by extreme zooming. The irregular-grid image routine might be a big help, but it has never been integrated. I don't remember which bugs it shares with quadmesh and image, if any.

Eric

Michael Droettboom wrote:

···

Eric Firing wrote:

Mike,

I made a quick test and took a quick look, and I certainly see a ripe mango within reach. I don't know what your constraints and strategy are, but I thought I would give you the off-the-cuff idea before I forget what I did.

The test was pcolortest.py, and the kcachegrind input is the .log file.

The problem is the path initializer: it is converting everything to a masked array, which in the vast majority of cases is not needed, and is very costly.

Thanks for finding this. I agree completely. I think that was basically a typo that ended up "working", just suboptimally. The input to the path constructor may be either a numpy array, an ma array or a regular Python sequence. If it's the first two, it should be left alone (if there is an array mask, it is dealt with later on in the constructor), but if the latter, it should be converted to a numpy array.

What I meant to type was:

         if not ma.isMaskedArray(vertices):
             vertices = npy.asarray(vertices, npy.float_)

The argument against just "npy.asarray(vertices, npy.float_)" is that the mask needs to be preserved.

If I understand correctly, that will be essentially a no-op when the input is a numpy array, albeit with the overhead of some checks.

We need to think carefully about the levels of API, and what should be done at which levels. One possibility is that at the level of the path initializer, only ordinary ndarrays should be acceptable--any mask manipulations and compressions should already have been done. This would require a helper function to generate the codes for that case. Another is that the path initializer could get a flag telling it whether to check for masked arrays. And another is that a check for existance of a mask should be done at the start, and the mask processing done only if there is a mask.

This option was the intent.

Yet another is that if a mask is needed, it be passed in as an optional 1-D array kwarg. An advantage of this is that the code that calls the path initializer may be in a better position to know what is needed to generate the 1-D mask (that is, a mask for each (x,y) point rather than for x and y separately)--that mask may already be sitting around.

Many of these options I fear would significantly complicate the code. One of the driving motivations for the refactoring is to allow transformations to be combined more generally. Think of the case where you have a polar plot with a logarithmic scale on the r-axis (this wasn't ever possible in the trunk). The log scale means that there is potential for negative masked values, but the polar part of the transformation shouldn't have to know or care whether masked values are being passed through. Requiring it to do so would need the same checks currently performed in the Path constructor, but they would be copied all over the code in every kind of new transformation.

FWIW, there already is a deliberate "quarantining" of masked arrays -- it happens where the logical elements of the plot hit the drawing commands of the plot (the Path object). It could have been implemented such that the backends must understand masked arrays and draw accordingly, but it proved to be faster (based on the simple_plot_fps.py benchmark) to convert to a non-masked array with MOVETO codes upfront and reuse that. (Not surprising, given the overhead of masked arrays). This means that masked arrays are not used at all during panning and zooming operations where speed is perhaps the most crucial.

Masked arrays are pretty clunky and slow. The maskedarray implementation by Pierre GM is nicer, more complete, and faster for many operations than numpy.ma, but it still adds a lot of overhead, especially for small arrays. (It needs to have its core in C; so far I have failed dismally in trying to understand how to do that without repeating the bulk of the ndarray code.)

A related point: can you (or is it OK if I do it) change all the "import numpy.ma as ma" or whatever to "from matplotlib.numerix import npyma as ma"? The advantage is that it makes it easy to test the new version with either maskedarray or ma. This should be temporary; I am still hoping and expecting that maskedarray will replace ma in the core numpy distribution.

That sounds like a very good idea. I'll go ahead and do this (on the branch only).

Cheers,
Mike

Eric Firing wrote:

Mike,

Thanks for the quick response.

I was wrong as usual: the masked array overhead in your original version of the path initializer was actually small. I misinterpreted the kcachegrind display. Rats! I was hoping for a big gain. It looks like anything that makes a huge number of paths is going to be slow, no matter what we do to try to optimize the path initializer.

Do you think the overhead is just from Python object creation? On the trunk, pcolor ultimately creates a large nested array, which on the branch is then converted to a Python list of Path objects (each containing two arrays). I can definitely imagine how that would be slower...

A partial solution for pcolor is pcolormesh, using the quadmesh extension code, although that still has a bug. (Paul Kienzle was going to look into it.) Is the quadmesh extension compatible with your transforms branch?

I'm going to say 'no', only because I wasn't aware of it :wink: Where is that code? Trunk contains a "draw_quad_mesh" method in the Agg backend which has been replaced by creating Path objects in Python and sending them to a more generic draw_path_collection method.

My impression is that the transforms branch is going to be a big step forward, with performance improvements in some areas, at worst minor penalties in others--except for some problems like pcolor that need to be solved.

I hope you're right. In all honesty, I actually haven't seen many performance improvements myself in what I've benchmarked. Certainly there must be some degenerate synthetic benchmark I can devise to prove its superiority <wink>. Seriously, I was focusing on generalization, ease of adding new transforms, and reduction of interactions with extension code -- all of which aren't certain to create performance gains. I hope that through more pervasive use of numpy, however, some of the performance lost can be made up -- and the reduction in lines of code may make it easier to optimize more broadly.

In order to replace matlab in my application, a very fast interactive pcolor-type capability is absolutely essential. I think this simply has to be done via extension code, like quadmesh and the image codes. (Pcolor in the trunk isn't fast enough, either.) Unfortunately, I have found those codes hard to understand. Only the regular-grid image code is fully integrated into the trunk, and even it has a long-standing bug revealed by extreme zooming. The irregular-grid image routine might be a big help, but it has never been integrated. I don't remember which bugs it shares with quadmesh and image, if any.

We'll definitely have to find better ways of doing this stuff.

It would be really nice if all this could be done independently of the backends. The primary reason for having the Path abstraction is so that backends only need to understand one basic thing. Images (as they are now) are an exception to that for the obvious optimization benefits. I wonder if generating the Path objects in an extension would be a worthwhile compromise to adding pcolor support in all the backends (hard to say what the factor of improvement might be).

If by interactive you mean "panning and zooming" performance, than the branch is already at least twice as fast as the trunk (finally a performance benchmark in favor of the branch!). I converted your example into a benchmark (attached) and get these results:

trunk:

init: 1.39
fps: 0.448591422932

branch:

init: 5.31
fps: 1.08885017422

Of course, if you mean "interactive" as in "updating the data", then, yes, the branch has a long way to go to catch up to the trunk.

Cheers,
Mike

···

Michael Droettboom wrote:

Eric Firing wrote:

Mike,

I made a quick test and took a quick look, and I certainly see a ripe mango within reach. I don't know what your constraints and strategy are, but I thought I would give you the off-the-cuff idea before I forget what I did.

The test was pcolortest.py, and the kcachegrind input is the .log file.

The problem is the path initializer: it is converting everything to a masked array, which in the vast majority of cases is not needed, and is very costly.

Thanks for finding this. I agree completely. I think that was basically a typo that ended up "working", just suboptimally. The input to the path constructor may be either a numpy array, an ma array or a regular Python sequence. If it's the first two, it should be left alone (if there is an array mask, it is dealt with later on in the constructor), but if the latter, it should be converted to a numpy array.

What I meant to type was:

         if not ma.isMaskedArray(vertices):
             vertices = npy.asarray(vertices, npy.float_)

The argument against just "npy.asarray(vertices, npy.float_)" is that the mask needs to be preserved.

If I understand correctly, that will be essentially a no-op when the input is a numpy array, albeit with the overhead of some checks.

We need to think carefully about the levels of API, and what should be done at which levels. One possibility is that at the level of the path initializer, only ordinary ndarrays should be acceptable--any mask manipulations and compressions should already have been done. This would require a helper function to generate the codes for that case. Another is that the path initializer could get a flag telling it whether to check for masked arrays. And another is that a check for existance of a mask should be done at the start, and the mask processing done only if there is a mask.

This option was the intent.

Yet another is that if a mask is needed, it be passed in as an optional 1-D array kwarg. An advantage of this is that the code that calls the path initializer may be in a better position to know what is needed to generate the 1-D mask (that is, a mask for each (x,y) point rather than for x and y separately)--that mask may already be sitting around.

Many of these options I fear would significantly complicate the code. One of the driving motivations for the refactoring is to allow transformations to be combined more generally. Think of the case where you have a polar plot with a logarithmic scale on the r-axis (this wasn't ever possible in the trunk). The log scale means that there is potential for negative masked values, but the polar part of the transformation shouldn't have to know or care whether masked values are being passed through. Requiring it to do so would need the same checks currently performed in the Path constructor, but they would be copied all over the code in every kind of new transformation.

FWIW, there already is a deliberate "quarantining" of masked arrays -- it happens where the logical elements of the plot hit the drawing commands of the plot (the Path object). It could have been implemented such that the backends must understand masked arrays and draw accordingly, but it proved to be faster (based on the simple_plot_fps.py benchmark) to convert to a non-masked array with MOVETO codes upfront and reuse that. (Not surprising, given the overhead of masked arrays). This means that masked arrays are not used at all during panning and zooming operations where speed is perhaps the most crucial.

Masked arrays are pretty clunky and slow. The maskedarray implementation by Pierre GM is nicer, more complete, and faster for many operations than numpy.ma, but it still adds a lot of overhead, especially for small arrays. (It needs to have its core in C; so far I have failed dismally in trying to understand how to do that without repeating the bulk of the ndarray code.)

A related point: can you (or is it OK if I do it) change all the "import numpy.ma as ma" or whatever to "from matplotlib.numerix import npyma as ma"? The advantage is that it makes it easy to test the new version with either maskedarray or ma. This should be temporary; I am still hoping and expecting that maskedarray will replace ma in the core numpy distribution.

That sounds like a very good idea. I'll go ahead and do this (on the branch only).

Cheers,
Mike

--
Michael Droettboom
Science Software Branch
Operations and Engineering Division
Space Telescope Science Institute
Operated by AURA for NASA

Attaching benchmark.

pcolortest.py (422 Bytes)

Michael Droettboom wrote:

Eric Firing wrote:

Mike,

Thanks for the quick response.

I was wrong as usual: the masked array overhead in your original version of the path initializer was actually small. I misinterpreted the kcachegrind display. Rats! I was hoping for a big gain. It looks like anything that makes a huge number of paths is going to be slow, no matter what we do to try to optimize the path initializer.

Do you think the overhead is just from Python object creation? On the trunk, pcolor ultimately creates a large nested array, which on the branch is then converted to a Python list of Path objects (each containing two arrays). I can definitely imagine how that would be slower...

I think the problem is at least partly the creation of a large number of numpy arrays; this is a known performance problem with numpy. It is also partly python function call overhead. In my benchmark, that path initializer gets called a *lot* of times, and every little bit of overhead adds up.

A partial solution for pcolor is pcolormesh, using the quadmesh extension code, although that still has a bug. (Paul Kienzle was going to look into it.) Is the quadmesh extension compatible with your transforms branch?

I'm going to say 'no', only because I wasn't aware of it :wink: Where is that code? Trunk contains a "draw_quad_mesh" method in the Agg backend which has been replaced by creating Path objects in Python and sending them to a more generic draw_path_collection method.

That's it--"draw_quad_mesh" in Agg backend.

My impression is that the transforms branch is going to be a big step forward, with performance improvements in some areas, at worst minor penalties in others--except for some problems like pcolor that need to be solved.

I hope you're right. In all honesty, I actually haven't seen many performance improvements myself in what I've benchmarked. Certainly there must be some degenerate synthetic benchmark I can devise to prove its superiority <wink>. Seriously, I was focusing on generalization, ease of adding new transforms, and reduction of interactions with extension code -- all of which aren't certain to create performance gains. I hope that through more pervasive use of numpy, however, some of the performance lost can be made up -- and the reduction in lines of code may make it easier to optimize more broadly.

I think a fair amount of extension code may remain necessary, but your reorganization should make it easier to have a clean interface to the extension code.

In order to replace matlab in my application, a very fast interactive pcolor-type capability is absolutely essential. I think this simply has to be done via extension code, like quadmesh and the image codes. (Pcolor in the trunk isn't fast enough, either.) Unfortunately, I have found those codes hard to understand. Only the regular-grid image code is fully integrated into the trunk, and even it has a long-standing bug revealed by extreme zooming. The irregular-grid image routine might be a big help, but it has never been integrated. I don't remember which bugs it shares with quadmesh and image, if any.

We'll definitely have to find better ways of doing this stuff.

It would be really nice if all this could be done independently of the backends. The primary reason for having the Path abstraction is so that backends only need to understand one basic thing. Images (as they are now) are an exception to that for the obvious optimization benefits. I wonder if generating the Path objects in an extension would be a worthwhile compromise to adding pcolor support in all the backends (hard to say what the factor of improvement might be).

1) It might indeed be possible to speed up path object generation via fairly simple extension code using the numpy C API. We will have to look more closely to see where the time is being taken.

2) If we take Agg as the standard core for interactive backends, then it is not so crucial to optimize the non-interactive backends, although it would still be nice. Alternatively, it may be that the way to go for all very complex plots on all backends is to use an image for the complex part. Matlab did something like this, with their "zbuffer" versus "painters" renderer choice, but they fouled it up--with zbuffer, everything was bitmapped, including lines and fonts, so the result looked terrible, and the ps files were sometimes too big to print. (I know very little about options for images in svg, ps and pdf.) This was many years ago; I haven't run into this sort of problem with Matlab more recently.

If by interactive you mean "panning and zooming" performance, than the branch is already at least twice as fast as the trunk (finally a performance benchmark in favor of the branch!). I converted your example into a benchmark (attached) and get these results:

trunk:

init: 1.39
fps: 0.448591422932

branch:

init: 5.31
fps: 1.08885017422

Good, this confirms my subjective impression. But for comparison, try the benchmark on the trunk with pcolormesh instead of pcolor.

Of course, if you mean "interactive" as in "updating the data", then, yes, the branch has a long way to go to catch up to the trunk.

Both aspects--data updating and pan/zoom--are important for interactive use.

Eric

···

Cheers,
Mike

Michael Droettboom wrote:

Eric Firing wrote:

Mike,

I made a quick test and took a quick look, and I certainly see a ripe mango within reach. I don't know what your constraints and strategy are, but I thought I would give you the off-the-cuff idea before I forget what I did.

The test was pcolortest.py, and the kcachegrind input is the .log file.

The problem is the path initializer: it is converting everything to a masked array, which in the vast majority of cases is not needed, and is very costly.

Thanks for finding this. I agree completely. I think that was basically a typo that ended up "working", just suboptimally. The input to the path constructor may be either a numpy array, an ma array or a regular Python sequence. If it's the first two, it should be left alone (if there is an array mask, it is dealt with later on in the constructor), but if the latter, it should be converted to a numpy array.

What I meant to type was:

         if not ma.isMaskedArray(vertices):
             vertices = npy.asarray(vertices, npy.float_)

The argument against just "npy.asarray(vertices, npy.float_)" is that the mask needs to be preserved.

If I understand correctly, that will be essentially a no-op when the input is a numpy array, albeit with the overhead of some checks.

We need to think carefully about the levels of API, and what should be done at which levels. One possibility is that at the level of the path initializer, only ordinary ndarrays should be acceptable--any mask manipulations and compressions should already have been done. This would require a helper function to generate the codes for that case. Another is that the path initializer could get a flag telling it whether to check for masked arrays. And another is that a check for existance of a mask should be done at the start, and the mask processing done only if there is a mask.

This option was the intent.

Yet another is that if a mask is needed, it be passed in as an optional 1-D array kwarg. An advantage of this is that the code that calls the path initializer may be in a better position to know what is needed to generate the 1-D mask (that is, a mask for each (x,y) point rather than for x and y separately)--that mask may already be sitting around.

Many of these options I fear would significantly complicate the code. One of the driving motivations for the refactoring is to allow transformations to be combined more generally. Think of the case where you have a polar plot with a logarithmic scale on the r-axis (this wasn't ever possible in the trunk). The log scale means that there is potential for negative masked values, but the polar part of the transformation shouldn't have to know or care whether masked values are being passed through. Requiring it to do so would need the same checks currently performed in the Path constructor, but they would be copied all over the code in every kind of new transformation.

FWIW, there already is a deliberate "quarantining" of masked arrays -- it happens where the logical elements of the plot hit the drawing commands of the plot (the Path object). It could have been implemented such that the backends must understand masked arrays and draw accordingly, but it proved to be faster (based on the simple_plot_fps.py benchmark) to convert to a non-masked array with MOVETO codes upfront and reuse that. (Not surprising, given the overhead of masked arrays). This means that masked arrays are not used at all during panning and zooming operations where speed is perhaps the most crucial.

Masked arrays are pretty clunky and slow. The maskedarray implementation by Pierre GM is nicer, more complete, and faster for many operations than numpy.ma, but it still adds a lot of overhead, especially for small arrays. (It needs to have its core in C; so far I have failed dismally in trying to understand how to do that without repeating the bulk of the ndarray code.)

A related point: can you (or is it OK if I do it) change all the "import numpy.ma as ma" or whatever to "from matplotlib.numerix import npyma as ma"? The advantage is that it makes it easy to test the new version with either maskedarray or ma. This should be temporary; I am still hoping and expecting that maskedarray will replace ma in the core numpy distribution.

That sounds like a very good idea. I'll go ahead and do this (on the branch only).

Cheers,
Mike

Mike,

On my machine, with pcolor from the trunk:
efiring@...340...:~/test$ python pcolortest2.py
init: 2.0
fps: 0.287026406429

And substituting pcolormesh for pcolor:
init: 0.27
fps: 5.48245614035

Now that's more like it!

Using image can be another order of magnitude faster than pcolormesh (but with limitations, of course). I suspect nonuniform image code is intermediate, but it is a long time since I have tried it.

Eric

Michael Droettboom wrote:

···

Attaching benchmark.

------------------------------------------------------------------------

from numpy.random import rand
import matplotlib
from matplotlib.pyplot import pcolor, savefig, show, ion, axis, draw, axes
import time

ion()

t = time.clock()
pcolor(rand(1000,100))
print "init: ", time.clock() - t

frames = 25.0
t = time.clock()
for i in xrange(int(frames)):
    part = (1.0 - (i / frames) / 2.0)
    axes().set_ylim((0.0, 1000.0 * part))
    draw()
print "fps:", frames / (time.clock() - t)

# show()