Ensuring stability of figure tests when testing against git matplotlib


I am working on the figure tests for sunpy (and if I make it work, astropy) and have run into a bit of a wall when trying to test against development matplotlib. My objective with this is to be able to get reliable figure comparison tests just using tox (pip) to pin versions, and not have to rely on docker or conda etc to get non-python deps like freetype.

I am making use of the fact the binary wheels shipped for mpl, statically link against freetype to provide a stable version of freetype. I am using the version reported by matplotlib.ft2type.__freetype_version__ to check that it’s the version we expect for the reference images. So far this approach seems to be working, the tests seem stable against the reference images.

I also tried basically the same approach for the git version of mpl. The default behaviour seems to be that when installing from a checkout freetype is downloaded and built into the extension, mpl then reports the same freetype version for all subsequent builds. It seems however that there are frequent shifts in the images, which I would not expect from just movement in mpl itself. (see this for an example https://55079-2165383-gh.circle-artifacts.com/0/.tmp/py37-figure-devdeps/figure_test_images/fig_comparison.html )

Any advice people could give me on the behaviour of building dev mpl, or other things I might need to look at ensuring aren’t shifting in my test setup would be really helpful.


Some of those failures look like freetype miss-matches. Master branch should now always grab 2.6.1 unless you tell it not to and at at least the 3.2.1 linux wheels link against the same. Are you on windows, those wheels may bundle a different version of freetype.

Are you using constrained or tight layout in your figures?

Thanks for the response @tacaswell

We are running these on linux based CI only, I am not sure but some of them might be using tight layout, but I would be surprised if they all are.

Further investigation on our side has shown we might have had a couple of things wonky with the hashes / comparison figures so we are going to see how it goes in the next little while.

Are you getting things from conda sometimes and from pypi others? The conda-forge packages do not pin freetype back to our testing version.

If possible, I suggest writing tests that assert the contents of the text / data in the returned artists rather than the images. You may also be interested in the work our GSOC student is doing with @anntzer.lee to use the previous commit to generate the baseline for the next commit ( https://github.com/matplotlib/matplotlib/pull/17557 is I think where the work is being centered). The idea is you could check out the master branch, generate a set of test images, and then any local hacking could compare against the locally generated baseline so changes in the underlying libraries (in mpl’s case freetype, in your case mpl + freetype) could be elided.

It seems to have settled down the last few builds, but we are still watching it. It seems that the figure comparisons I shared above were inaccurate and using the wrong reference images. We did have to eliminate an issue with our hashes of the png files by configuring it to not save the mpl version to the metadata.

I learned about you having a GSOC project on figure tests a couple of days ago. I would be really interested in seeing if we could make any of that work useful to packages using mpl as well as mpl itself. SunPy currently has a very weird hacked together figure test solution where we hash png files and compare the generated pngs to the hashes we commit into the repo. We mainly ended up doing this to prevent us from having to check in the reference figures, or having to download them when running the tests. What I was planning on doing was making a contribution to pytest-mpl which allows it to optionally compare against hashes over doing image comparison. This would enable better workflows where an external library of reference figures could be updated automatically with CI on the target branch for a PR, while the CI checks on the PR itself would pass as the new hashes would be committed along with the changes. This would allow for the generation of a comparison page like I linked you above so reviewers of the PR could see the changes introduced by the PR.

I am interested in the idea of comparison to previous commits, if you have any other documentation on that I could read other than the PR that would be great, before I bombard you with questions :laughing:. My main questions would be around what you see a PR workflow looking like with that system.

Thanks so much for all the help!

The best place to ask these questions is the gitter room we are using:

@SidharthBansal is doing the work (mentored by @anntzer.lee ).