Patch suggestion

I've attempted to implement this code myself (see

    > attached patch to src/_image.cpp) but I'm not a regular
    > c++ or even c programmer so it's fairly likely there
    > will be memory leaks in the code. For a 1024x2048 array
    > using the GTKAgg backend and with plenty of memory free
    > this change results in show() taking <0.7s rather than
    > >4.6s; if there is a memory shortage and swapping
    > becomes involved the change is much more noticeable. I
    > haven't made any decent Python wrapping code yet - but
    > would be happy do do so if someone familiar with c++
    > could tidy up my attachment.

Hi Nicholas,

Thanks for the suggestions and patch. I incorporated frombuffer and
have been testing it. I've been testing the performance of frombuffer
vs fromarray, and have seen some 2-3x speedups but nothing like the
numbers you are reporting. [Also, I don't see any detectable memory
leaks so I don't think you have any worries there]

Here is the test script I am using - does this look like a fair test?

You can uncomment report_memory on unix like systems to get a memory
report on each pass through the loop, and switch out fromarray vs
frombuffer to compare your function with mine

On a related note, below I'm pasting in a representative section the
code I am currently using in fromarray for MxNx3 and MxNx4 arrays --
any obvious performance gains to be had here numerix gurus?

Another suggestion for Nicholas -- perhaps you want to support MxN,
MxNx3 and MxNx4 arrays in your frombuffer function?

And a final question -- how are you getting your function into the
matplotlib image pipeline. Did you alter the image.py
AxesImage.set_data function to test whether A is a buffer object? If
so, you might want to post these changes to the codebase as well.

// some fromarray code

   //PyArrayObject *A = (PyArrayObject *) PyArray_ContiguousFromObject(x.ptr(), PyArray_DOUBLE, 2, 3);
  PyArrayObject *A = (PyArrayObject *) PyArray_FromObject(x.ptr(), PyArray_DOUBLE, 2, 3);

    int rgba = A->dimensions[2]==4;
    double r,g,b,alpha;
    int offset =0;
    
    for (size_t rownum=0; rownum<imo->rowsIn; rownum++) {
      for (size_t colnum=0; colnum<imo->colsIn; colnum++) {
  offset = rownum*A->strides[0] + colnum*A->strides[1];
  r = *(double *)(A->data + offset);
  g = *(double *)(A->data + offset + A->strides[2] );
  b = *(double *)(A->data + offset + 2*A->strides[2] );
   
  if (rgba)
    alpha = *(double *)(A->data + offset + 3*A->strides[2] );
  else
    alpha = 1.0;
  
  *buffer++ = int(255*r); // red
  *buffer++ = int(255*g); // green
  *buffer++ = int(255*b); // blue
  *buffer++ = int(255*alpha); // alpha
  
      }
    }

## ... and here is the profile script ....

import sys, os, time, gc

from matplotlib._image import fromarray, fromarray2, frombuffer
from matplotlib.numerix.mlab import rand
from matplotlib.numerix import UInt8

def report_memory(i):
    pid = os.getpid()
    a2 = os.popen('ps -p %d -o rss,sz' % pid).readlines()
    print i, ' ', a2[1],
    return int(a2[1].split()[1])

N = 1024
#X2 = rand(N,N)
#X3 = rand(N,N,3)
X4 = rand(N,N,4)

start = time.time()

b4 = (X4*255).astype(UInt8).tostring()
for i in range(50):

    im = fromarray(X4, 0)
    #im = frombuffer(b4, N, N, 0)
    #val = report_memory(i)

end = time.time()
print 'elapsed: %1.3f'%(end-start)

Thanks for the suggestions and patch. I incorporated frombuffer and
have been testing it. I've been testing the performance of frombuffer
vs fromarray, and have seen some 2-3x speedups but nothing like the
numbers you are reporting. [Also, I don't see any detectable memory
leaks so I don't think you have any worries there]

That kind of speed up is probably more realistic - I probably made a
greater number of optimisations to the python buffer code than to the
numerix code (which I only remembered after posting my first message).
Performance do gains seem greater where memory is limited though as the
reduced memory consumption means less swapping, in cases where the
reduced memory consumption avoids swapping altogether they are, of
course, huge.

Here is the test script I am using - does this look like a fair test?

It seems to be a fair test - I'd have created the string buffer outside
of the timing (as in reality you wouldn't go through that step) but as
it should be a fairly quick conversion it shouldn't matter too much.

Another suggestion for Nicholas -- perhaps you want to support MxN,
MxNx3 and MxNx4 arrays in your frombuffer function?

I could do - the main reason I don't particularly want to is that a
compiler should be able to more easily optimise a simple for(i,i<j,i++)
loop than a series of nested loops and other instructions. As this code
is only really of use where performance is particularly important I'd
rather keep performance as high as possible - it's easy to generate
buffers in whatever format is necessary.

My suggestion for improved performance would be to allow the Image class
to work directly on the buffer passed to the generator function - I've
started implementing this in c++. Going with this approach should
improve speed further and again conserve memory for very large images -
and (without making other changes to the Image class) would require that
rgba rather than rgb or intensity was used as an input.

And a final question -- how are you getting your function into the
matplotlib image pipeline. Did you alter the image.py
AxesImage.set_data function to test whether A is a buffer object? If
so, you might want to post these changes to the codebase as well.

I made some alterations to these functions - but they are currently
limited to my own situation. I will make them general once I've played
around with writable buffers a bit - this will be fairly easily but I
don't want to spend time writing wrapper code until I'm happy with what
I'm wrapping.

Nicholas

···

On Thu, 2005-04-14 at 16:00 -0500, John Hunter wrote:

My suggestion for improved performance would be to allow the Image class
to work directly on the buffer passed to the generator function - I've
started implementing this in c++. Going with this approach should
improve speed further and again conserve memory for very large images -
and (without making other changes to the Image class) would require that
rgba rather than rgb or intensity was used as an input.

Having tried this (patch attached) I get the following results on
running a slightly modified version of John Hunters test script
(attached). In the original script it turned out that a significant
amount of time was being taken up with creating some of the test
environment - hence a smaller improvement was being shown.

Running with 1024x1024:
Starting array:
  Array set up: resident stack size: 39716, size: 10914
  Tests done: resident stack size: 43836, size: 11938
  Elapsed: 9.363
Starting buffer:
  Buffer set up: resident stack size: 15192, size: 4823
  Tests done: resident stack size: 15200, size: 4824
  Elapsed: 0.690
Fractional improvement: 12.577

Running with 2048x2048:
Starting array:
  Array set up: resident stack size: 146276, size: 37592
  Tests done: resident stack size: 158572, size: 40664
  Elapsed: 38.544
Starting buffer:
  Buffer set up: resident stack size: 39784, size: 10968
  Tests done: resident stack size: 39784, size: 10967
  Elapsed: 2.044
Fractional improvement: 17.855

Running with 4096x4096:
Starting array:
  Array set up: resident stack size: 564076, size: 142041
  Tests done: resident stack size: 613252, size: 154329
  Elapsed: 170.495
Starting buffer:
  Buffer set up: resident stack size: 67100, size: 35544
  Tests done: resident stack size: 133060, size: 35544
  Elapsed: 8.474
Fractional improvement: 19.120

As you can see - in both cases a big improvement. In the case of large
data sets its a change from a noticeable delay 3.4s per plot to the very
usable 0.17s per plot (none of these plots required swapping - although
the set functions did). If you don't have you data in the form of a
writable python buffer it's necessary to wrap the input buffer in a
Python array.array to get this performance (a read-only buffer is still
accepted but will be slightly slower). Performance using a
non-modifiable buffer is slightly lower even with the overhead of an
additional Python function call (I guess it implements some more
intelligent buffer creation strategy).

I made some alterations to these functions - but they are currently
limited to my own situation. I will make them general once I've played
around with writable buffers a bit - this will be fairly easily but I
don't want to spend time writing wrapper code until I'm happy with what
I'm wrapping.

Changes in the patch I've attached (and were even simpler than I
imagined).

Nich

patch (5.9 KB)

test.py (1.24 KB)

···

On Fri, 2005-04-15 at 12:27 +0100, I wrote:

Hi,

I made a suggestion for improving imshow performance for plotting an
image already in byte string form a while ago; some of the results are
currently in CVS. I seems that other changes made to CVS at the same
time or since mean that the floating point buffer source code is now
much faster and I'd suggest taking anything sourced from my code back
out.

Nick