Improving PDF generation speed

tcbradle · August 29, 2022, 1:40pm

Are there any good suggestions to improve the performance of matplotlib when generating PDFs?

To provide an example:
One of my PDFs I’m generating is 139 page with about 2 plots on each page. For my current input data set, it takes about 150 seconds from start to finish to generate the PDF. If I comment out the pdf.savefig() call, total run time drops to ~30 seconds.

I’ve tried the following tricks so far. Some definitely help, but only shaves off about 10-20 seconds

import matplotlib.style as mplstyle; mplstyle.use([‘fast’]) – this also seems to increase the PDF size 3x
Set a lower DPI on the figure

Any help is appreciated.

tacaswell · August 29, 2022, 10:15pm

I suspect that the fast style is affecting how much paths are being simplified.

Without code to work against it is very hard to guess what could improve your runtime. Have you done any profiling to see where the bottle necks are?

Are you creating a new figure / artists every time? If you are making ~250 Axes (what I take “plot” to mean in this context) I expect there to be a lot of versions of the same visualization but with different data. You may get some time back by updating existing artists and then saving over re-creating the same artists every time.

Can you split the generation of each page up into its own process and then use another program to stitch them all together? This would not make anything faster per-se but would let you take advantage of (I assume) the more than one core on your machine.

Can you down-sample your data at all before passing it to Matplotlib? You may have a better idea of what the correct resampling is than we do.

What version of Python are you using? py310 and py311 include some reasonable speed ups at the language level.

Can you try using the core fonts rather than a font we have to subset? I suspect there is also an opportunity for use to do some more caching there…

tcbradle · August 29, 2022, 10:34pm

Thanks for the response!

Without code to work against it is very hard to guess what could improve your runtime. Have you done any profiling to see where the bottle necks are?

The main bottleneck is the call to saveFig(). In my original example, just commenting out this one function call brings the total run time from ~150 seconds to ~30 seconds. I didn’t think my code would be too relevant for this bottleneck, but if you think it’d be helpful, I can try to provide a minimal example.

Are you creating a new figure / artists every time? If you are making ~250 Axes (what I take “plot” to mean in this context) I expect there to be a lot of versions of the same visualization but with different data. You may get some time back by updating existing artists and then saving over re-creating the same artists every time.

Yes I am. This is one place I could try to optimize, but to my understanding, this won’t have any improvement on the saveFig() calls (please correct me if I’m wrong).

Can you split the generation of each page up into its own process and then use another program to stitch them all together? This would not make anything faster per-se but would let you take advantage of (I assume) the more than one core on your machine.

Possibly. This is probably worth looking into, but may increase the complexity of things quite a bit.

Can you down-sample your data at all before passing it to Matplotlib? You may have a better idea of what the correct resampling is than we do.

Not easily without potentially losing important information.

What version of Python are you using? py310 and py311 include some reasonable speed ups at the language level.

We’re on 3.7.13 for now. This might be worth looking into, but I don’t want to run into any unexpected behavior.

Can you try using the core fonts rather than a font we have to subset? I suspect there is also an opportunity for use to do some more caching there…

Will definitely look into this.