Some remarks/questions about perceived slowness of matplotlib

Eric_Firing1 · December 14, 2006, 6:14pm

David,

I have made some changes in svn that address all but one of the points you made:

[....]

if self.clip:
               mask = ma.getmaskorNone(val)
               if mask == None:
                   val = ma.array(clip(val.filled(vmax), vmin, vmax))
               else:
                   val = ma.array(clip(val.filled(vmax), vmin, vmax),
                               mask=mask)

The real problem here is that I should not have been using getmaskorNone(). In numpy.ma, we need nomask, not None, so we want an ordinary getmask() call. ma.array(...., mask=ma.nomask) is very fast, so the problem goes away.

Actually, the problem is in ma.array: with a value of mask to None, it should not make a difference between mask = None or no mask arg, right ?

But it does, because for numpy it needs to be nomask; it does something with None, but whatever it is, it is very slow.

I didn't change ma.array to keep my change as local as possible. To change only this operation as above gives a speed up from 1.8 s to ~ 1.0 s for to_rgba, which means calling show goes from ~ 2.2 s to ~1.4 s. I also changed result = (val-vmin)/float(vmax-vmin)

to

invcache = 1.0 / (vmax - vmin)
result = (val-vmin) * invcache

This is the one I did not address. I don't understand how this could be making much difference, and some testing using ipython and %prun with 1-line operations showed little difference with variations on this theme. The fastest would appear to be (and logically should be, I think) result = (val-vmin)*(1.0/(vmax-vmin)), but I don't think it makes much difference--it looks to me like maybe 10-20 msec, not 100, on my Pentium M 1.6 Ghz. Maybe still worthwhile, so I may yet make the change after more careful testing.

which gives a moderate speed up (around 100 ms for a 8000x256 points array). Once you make both those changes, the clip call is by far the most expensive operation in normalize functor, but the functor is not really expensive anymore compared to the rest, so this is not where I looked at.

For the where calls in Colormap functor, I was wondering if they are necessary in all cases: some of those calls seem redundant, and it may be possible to detect that before calling them. This should be both easier and faster, at least in this case, than having a fast where ?

You hit the nail squarely: where() is the wrong function to use, and I have eliminated it from colors.py. The much faster replacement is putmask, which does as well as direct indexing with a Boolean but works with all three numerical packages. I think that using the fast putmask is better than trying to figure out special cases in which there would be nothing to put, although I could be convinced otherwise.

I understand that support of multiple array backend, support of mask arrays have cost consequences. But it looks like it may be possible to speed things up for cases where an array has only meaningful values/no mask.

The big gains here were essentially bug fixes--picking the appropriate function (getmask versus getmaskorNone and putmask versus where).

Here is the colors.py diff:

--- trunk/matplotlib/lib/matplotlib/colors.py 2006/12/03 21:54:38 2906
+++ trunk/matplotlib/lib/matplotlib/colors.py 2006/12/14 08:27:04 2923
@@ -30,9 +30,9 @@
"""
import re

-from numerix import array, arange, take, put, Float, Int, where, \
+from numerix import array, arange, take, put, Float, Int, putmask, \
       zeros, asarray, sort, searchsorted, sometrue, ravel, divide,\
- ones, typecode, typecodes, alltrue
+ ones, typecode, typecodes, alltrue, clip
  from numerix.mlab import amin, amax
  import numerix.ma as ma
  import numerix as nx
@@ -536,8 +536,9 @@
      lut[0] = y1[0]
      lut[-1] = y0[-1]
      # ensure that the lut is confined to values between 0 and 1 by clipping it
- lut = where(lut > 1., 1., lut)
- lut = where(lut < 0., 0., lut)
+ clip(lut, 0.0, 1.0)
+ #lut = where(lut > 1., 1., lut)
+ #lut = where(lut < 0., 0., lut)
      return lut

@@ -588,16 +589,16 @@
              vtype = 'array'
              xma = ma.asarray(X)
              xa = xma.filled(0)
- mask_bad = ma.getmaskorNone(xma)
+ mask_bad = ma.getmask(xma)
          if typecode(xa) in typecodes['Float']:
- xa = where(xa == 1.0, 0.9999999, xa) # Tweak so 1.0 is in range.
+ putmask(xa, xa==1.0, 0.9999999) #Treat 1.0 as slightly less than 1.
              xa = (xa * self.N).astype(Int)
- mask_under = xa < 0
- mask_over = xa > self.N-1
- xa = where(mask_under, self._i_under, xa)
- xa = where(mask_over, self._i_over, xa)
- if mask_bad is not None: # and sometrue(mask_bad):
- xa = where(mask_bad, self._i_bad, xa)
+ # Set the over-range indices before the under-range;
+ # otherwise the under-range values get converted to over-range.
+ putmask(xa, xa>self.N-1, self._i_over)
+ putmask(xa, xa<0, self._i_under)
+ if mask_bad is not None and mask_bad.shape == xa.shape:
+ putmask(xa, mask_bad, self._i_bad)
          rgba = take(self._lut, xa)
          if vtype == 'scalar':
              rgba = tuple(rgba[0,:])
@@ -752,7 +753,7 @@
              return 0.*value
          else:
              if clip:
- mask = ma.getmaskorNone(val)
+ mask = ma.getmask(val)
                  val = ma.array(nx.clip(val.filled(vmax), vmin, vmax),
                                  mask=mask)
              result = (val-vmin)/float(vmax-vmin)
@@ -804,7 +805,7 @@
              return 0.*value
          else:
              if clip:
- mask = ma.getmaskorNone(val)
+ mask = ma.getmask(val)
                  val = ma.array(nx.clip(val.filled(vmax), vmin, vmax),
                                  mask=mask)
              result = (ma.log(val)-nx.log(vmin))/(nx.log(vmax)-nx.log(vmin))

Eric

_Simson_L_Garfinkel · December 15, 2006, 3:36am

HI. I wand to have just horizontal grid lines. Is there any way to do this? Thanks!

Eric_Firing1 · December 15, 2006, 7:49am

Simson Garfinkel wrote:

HI. I wand to have just horizontal grid lines. Is there any way to do this? Thanks!

gca().yaxis.grid(True)
gca().xaxis.grid(False)

Here is the grid method docstring:

     def grid(self, b=None, which='major', **kwargs):
         """
         Set the axis grid on or off; b is a boolean use which =
         'major' | 'minor' to set the grid for major or minor ticks

         if b is None and len(kwargs)==0, toggle the grid state. If
         kwargs are supplied, it is assumed you want the grid on and b
         will be set to True

kwargs are used to set the line properties of the grids, eg,

xax.grid(color='r', linestyle='-', linewidth=2)

Eric

_Simson_L_Garfinkel · December 15, 2006, 11:56pm

Looks like I need to read *all* of the docstrings. I wish there was an easy way to search them....

···

On Dec 15, 2006, at 2:49 AM, Eric Firing wrote:

Simson Garfinkel wrote:

HI. I wand to have just horizontal grid lines. Is there any way to do this? Thanks!

gca().yaxis.grid(True)
gca().xaxis.grid(False)

Here is the grid method docstring:

    def grid(self, b=None, which='major', **kwargs):
        """
        Set the axis grid on or off; b is a boolean use which =
        'major' | 'minor' to set the grid for major or minor ticks

        if b is None and len(kwargs)==0, toggle the grid state. If
        kwargs are supplied, it is assumed you want the grid on and b
        will be set to True

        kwargs are used to set the line properties of the grids, eg,

          xax.grid(color='r', linestyle='-', linewidth=2)

Eric

David_Cournapeau2 · December 18, 2006, 5:39am

Eric Firing wrote:

David,

I have made some changes in svn that address all but one of the points you made:

[....]

if self.clip:
               mask = ma.getmaskorNone(val)
               if mask == None:
                   val = ma.array(clip(val.filled(vmax), vmin, vmax))
               else:
                   val = ma.array(clip(val.filled(vmax), vmin, vmax),
                               mask=mask)

The real problem here is that I should not have been using getmaskorNone(). In numpy.ma, we need nomask, not None, so we want an ordinary getmask() call. ma.array(...., mask=ma.nomask) is very fast, so the problem goes away.

Actually, the problem is in ma.array: with a value of mask to None, it should not make a difference between mask = None or no mask arg, right ?

But it does, because for numpy it needs to be nomask; it does something with None, but whatever it is, it is very slow.

I didn't change ma.array to keep my change as local as possible. To change only this operation as above gives a speed up from 1.8 s to ~ 1.0 s for to_rgba, which means calling show goes from ~ 2.2 s to ~1.4 s. I also changed result = (val-vmin)/float(vmax-vmin)

to

invcache = 1.0 / (vmax - vmin)
result = (val-vmin) * invcache

This is the one I did not address. I don't understand how this could be making much difference, and some testing using ipython and %prun with 1-line operations showed little difference with variations on this theme. The fastest would appear to be (and logically should be, I think) result = (val-vmin)*(1.0/(vmax-vmin)), but I don't think it makes much difference--it looks to me like maybe 10-20 msec, not 100, on my Pentium M 1.6 Ghz. Maybe still worthwhile, so I may yet make the change after more careful testing.

which gives a moderate speed up (around 100 ms for a 8000x256 points array). Once you make both those changes, the clip call is by far the most expensive operation in normalize functor, but the functor is not really expensive anymore compared to the rest, so this is not where I looked at.

For the where calls in Colormap functor, I was wondering if they are necessary in all cases: some of those calls seem redundant, and it may be possible to detect that before calling them. This should be both easier and faster, at least in this case, than having a fast where ?

You hit the nail squarely: where() is the wrong function to use, and I have eliminated it from colors.py. The much faster replacement is putmask, which does as well as direct indexing with a Boolean but works with all three numerical packages. I think that using the fast putmask is better than trying to figure out special cases in which there would be nothing to put, although I could be convinced otherwise.

I understand that support of multiple array backend, support of mask arrays have cost consequences. But it looks like it may be possible to speed things up for cases where an array has only meaningful values/no mask.

The big gains here were essentially bug fixes--picking the appropriate function (getmask versus getmaskorNone and putmask versus where).

Ok, I've installed last svn, and now, there is still one function which is much slower than a direct numpy implementation, so I would like to know if this is inherent to the multiple backend nature of matplotlib or not. The functor Normalize uses the clip function, and a direct numpy would be 3 times faster (giving the show call a 20 % speed in my really limited benchmarks):

if clip:
    mask = ma.getmask(val)
    #val = ma.array(nx.clip(val.filled(vmax), vmin, vmax),
    # mask=mask)
    def myclip(a, m, M):
        a[a<m] = m
        a[a>M] = M
        return a
    val = ma.array(myclip(val.filled(vmax), vmin, vmax), mask=mask)

I am a bit lost in the matplotlib code to see where clip is implemented (is it in numerix and as such using the numpy function clip ?).

Still, I must confess that all this looks quite good, because it was possible to speed things up quite considerably without too much effort,

cheers,

David

Eric_Firing1 · December 18, 2006, 6:53am

David Cournapeau wrote:
[...]

Ok, I've installed last svn, and now, there is still one function which is much slower than a direct numpy implementation, so I would like to know if this is inherent to the multiple backend nature of matplotlib or not. The functor Normalize uses the clip function, and a direct numpy would be 3 times faster (giving the show call a 20 % speed in my really limited benchmarks):

if clip:
    mask = ma.getmask(val)
    #val = ma.array(nx.clip(val.filled(vmax), vmin, vmax),
    # mask=mask)
    def myclip(a, m, M):
        a[a<m] = m
        a[a>M] = M
        return a
    val = ma.array(myclip(val.filled(vmax), vmin, vmax), mask=mask)

I am a bit lost in the matplotlib code to see where clip is implemented (is it in numerix and as such using the numpy function clip ?).

There is a clip function in all three numeric packages, so a native clip is being used.

If numpy.clip is actually slower than your version, that sounds like a problem with the implementation in numpy. By all logic a single clip function should either be the same (if it is implemented like yours) or faster (if it is a single loop in C-code, as I would expect). This warrants a little more investigation before changing the mpl code. The best thing would be if you could make a simple standalone numpy test case profiling both versions and post the results as a question to the numpy-discussion list. Many such questions in the past have resulted in big speedups in numpy.

One more thought: it is possible that the difference is because myclip operates on the array in place while clip generates a new array. If this is the cause of the difference then changing your last line to "return a.copy()" probably would slow it down to the numpy clip speed or slower.

Eric

David_Cournapeau2 · December 18, 2006, 7:07am

Eric Firing wrote:

There is a clip function in all three numeric packages, so a native clip is being used.

If numpy.clip is actually slower than your version, that sounds like a problem with the implementation in numpy. By all logic a single clip function should either be the same (if it is implemented like yours) or faster (if it is a single loop in C-code, as I would expect). This warrants a little more investigation before changing the mpl code. The best thing would be if you could make a simple standalone numpy test case profiling both versions and post the results as a question to the numpy-discussion list. Many such questions in the past have resulted in big speedups in numpy.

I am much more familiar with internal numpy code than matplotlib's, so this is much easier for me, too

One more thought: it is possible that the difference is because myclip operates on the array in place while clip generates a new array. If this is the cause of the difference then changing your last line to "return a.copy()" probably would slow it down to the numpy clip speed or slower.

It would be scary if a copy of a 8008x256 array of double took 100 ms... Fortunately, it does not, this does not seem to be the problem.

cheers,

David

David_Cournapeau2 · December 19, 2006, 7:11am

David Cournapeau wrote:

Eric Firing wrote:

There is a clip function in all three numeric packages, so a native clip is being used.

If numpy.clip is actually slower than your version, that sounds like a problem with the implementation in numpy. By all logic a single clip function should either be the same (if it is implemented like yours) or faster (if it is a single loop in C-code, as I would expect). This warrants a little more investigation before changing the mpl code. The best thing would be if you could make a simple standalone numpy test case profiling both versions and post the results as a question to the numpy-discussion list. Many such questions in the past have resulted in big speedups in numpy.

I am much more familiar with internal numpy code than matplotlib's, so this is much easier for me, too

One more thought: it is possible that the difference is because myclip operates on the array in place while clip generates a new array. If this is the cause of the difference then changing your last line to "return a.copy()" probably would slow it down to the numpy clip speed or slower.

It would be scary if a copy of a 8008x256 array of double took 100 ms... Fortunately, it does not, this does not seem to be the problem.

cheers,

David

Ok, so now, with my clip function, still for a 8000x256 double array: we have show() after imshow which takes around 760 ms. 3/5 are in make_image, 2/5 in the function blop, which is just an alias I put to measure the difference between axes.py:1043(draw) and image.py:173(draw) in the function Axis.draw (file axes.py):

def blop(dsu):
for zorder, i, a in dsu:
a.draw(renderer)

blop(dsu)

In make_image, most of the time is taken into to_rgba: almost half of it is taken in by the take call in the Colormap.__call__. Almost 200 ms to get colors from the indexes seems quite a lot (this means 280 cycles / pixel on average !). I can reproduce this number by using a small numpy test.

On my laptop (pentium M, 1.2 Ghz), make_image takes almost 85 % of the time, which seems to imply that this is where one should focus if one wants to improve the speed,

cheers,

David