Improving interactive plotting speed with large datasets

Hi,

I used matplotlib for quite a few publication style figures in the past. It is really great!

Where I work now the main constrain, however, is interactive exploration of large datasets (say, many line plots each over 1 million points, axes linked together across many subplots, etc.). It looks like now this is not yet handled by matplotlib in a way to make such interactive work feasible.

For instance panning of such large datasets has up to several seconds lag in most of the backends/renderers.

I’m just wondering what would be the developers’ point of view on how to improve this. For instance, would an OpenGL-based renderer be helpful in this scenario, or is the bottleneck really the abstraction layer of matplotlib itself? It looks like OpenGL renderer shouldn’t have much of an issue in drawing this large datasets, given some constraints.

Another question, speaking about panning. I’m not that familiar with how the interactive plotting pipeline works at this point. Any pointers where I should start in the codebase? I couldn’t see much in the developer section in the documentation, but maybe I missed something.

Best,
Lukas

For extremely large, but single-artist datasets (e.g. a single line/points plot with millions of points), the renderer should indeed be the bottleneck. (If you instead have (very) many separate artists, then Matplotlib’s abstraction layer is also quite slow, but that shouldn’t be a problem for a single line plot.) As always, this can (should) be confirmed by running the code through a profiler (I am personally partial to py-spy+speedscope, but there are other options out there).

I would believe that experimenting with an OpenGL renderer should be a good idea. See https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/backends/backend_template.py for the backend API that needs to be implemented, and (for example) my own https://github.com/matplotlib/mplcairo for an example of third-party backend.

As for panning, it is implemented at https://github.com/matplotlib/matplotlib/blob/a6109530344f91acce5e68840b7718dc0e439720/lib/matplotlib/backend_bases.py#L3072 (and then just track down the methods called).

1 Like

Depending on exactly what you are doing, have a look at https://github.com/holoviz/datashader/pull/939 .

If that is not quite what you need, a short-term fix would be a lightly customized Line2D class that that over-rides draw to do more performant / domain specific down sampling / caching. If you hand line 1M points that we are going through all of the points on every draw (even if you only have 200 pixels). It is plausible that you can make stronger assumptions about when it is safe to cache down-sampled data, hold onto a pyramid of down-sampled data, and then at draw time pick which level of the data to push into the base Line2D. Something like


class SmarterLine2D(Line2D):
    pyramid = None
    def draw(self, renderer):
        if self.pyramid is None:
            self.pyramid = build_pyramid(self)

        xlims = self.ax.get_xlim()
        ylims = self.ax.get_ylim()

        x, y = self.pyramid.get_resampled_data(xlims, ylims)
        self.set_data(x, y)
        super().draw(renderer)

This leaves out all of the details of pyramid and makes it impossible to update the actually data payload.

There is also a (dormant) project called https://github.com/ChrisBeaumont/mpl-modest-image which does a similar thing from images and https://github.com/astrofrog/mpl-scatter-density which handles the case of 2D histograms.

The datashader PR looks interesting. Seems like at this point it’s tailored for 2D raster plots though. Does your PR support a timeseries plot? Sorry if that would be obvious, but I’m not familiar with the datashader API to understand quickly how to generate such a plot. A small timeseries example would be handy here.

Most of the cases where I have seen people looking at having performance issues has been with 2D data (the squared scaling means it bites you sooner).

That said, if you think of your time series not as 1 column with and index, but as 2 columns (value, time) with an arbitrary index if you have enough points “making the 2D histogram” of that table will look like a line.