Eric Firing wrote:
Mike, John,
Because path simplification does not work with anything but a continuous line, it is turned off if there are any nans in the path. The result is that if one does this:
import numpy as np
xx = np.arange(200000)
yy = np.random.rand(200000)
#plot(xx, yy)
yy[1000] = np.nan
plot(xx, yy)
the plot fails with an incomplete rendering and general unresponsiveness; apparently some mysterious agg limit is quietly exceeded.
The limit in question is "cell_block_limit" in agg_rasterizer_cells_aa.h. The relationship between the number vertices and the number of rasterization cells I suspect depends on the nature of the values.
However, if we want to increase the limit, each "cell_block" is 4096 cells, each with 16 bytes, and currently it maxes out at 1024 cell blocks, for a total of 67,108,864 bytes. So, the question is, how much memory should be devoted to rasterization, when the data set is large like this? I think we could safely quadruple this number for a lot of modern machines, and this maximum won't affect people plotting smaller data sets, since the memory is dynamically allocated anyway. It works for me, but I have 4GB RAM here at work.
With or without the nan, this test case also shows the bizarre slowness of add_line that I asked about in a message yesterday, and that has me completely baffled.
lsprofcalltree is my friend!
Both of these are major problems for real-world use.
Do you have any thoughts on timing and strategy for solving this problem? A few weeks ago, when the problem with nans and path simplification turned up, I tried to figure out what was going on and how to fix it, but I did not get very far. I could try again, but as you know I don't get along well with C++.
That simplification code is pretty hairy, particularly because it tries to avoid a copy by doing everything in an iterator/generator way. I think even just supporting MOVETOs there would be tricky, but probably the easiest first thing.
I am also wondering whether more than straightforward path simplification with nan/moveto might be needed. Suppose there is a nightmarish time series with every third point being bad, so it is essentially a sequence of 2-point line segments. The simplest form of path simplification fix might be to reset the calculation whenever a moveto is encountered, but this would yield no simplification in this case. I assume Agg would still choke. Is there a need for some sort of automatic chunking of the rendering operation in addition to path simplification?
Chunking is probably something worth looking into (for lines, at least), as it might also reduce memory usage vs. the "increase the cell_block_limit" scenario.
I also think for the special case of high-resolution time series data, where x if uniform, there is an opportunity to do something completely different that should be far faster. Audio editors (such as Audacity), draw each column of pixels based on the min/max and/or mean and/or RMS of the values within that column. This makes the rendering extremely fast and simple. See:
http://audacity.sourceforge.net/about/images/audacity-macosx.png
Of course, that would mean writing a bunch of new code, but it shouldn't be incredibly tricky new code. It could convert the time series data to an image and plot that, or to a filled polygon whose vertices are downsampled from the original data. The latter may be nicer for Ps/Pdf output.
Cheers,
Mike
···
--
Michael Droettboom
Science Software Branch
Operations and Engineering Division
Space Telescope Science Institute
Operated by AURA for NASA