Sample data: a proposal

Andrew Straw <strawman@...36...> writes:

3. To make this work, agree that sample data files are immutable: if a
    new version is needed, it needs to have a new name (and thus the
    examples using it need to be updated). The files have not been
    changed a lot [2], so I don't think this is very much of a burden.

I don't like #3 -- for the same reasons as we want to separate the rest
of the sample data (smaller download, smaller repository, and separation
of code and non-essential data), I think the test comparison images
should be with the sample data. Having to deal with renames in the tests
would be annoying.

If the test data is moved there, I agree that renaming won't work.

But it seems to me that test data is different from sample data used by
examples: when running the tests for a given revision of matplotlib, you
don't want the absolute latest comparison images but the images that
correspond to that particular code revision. You also typically want to
get all of the comparison images for that revision at the same time,
since you're likely to be running the whole test suite. Also, if you are
running the test suite, I think we can assume you can get a checkout of
the test-data repository.

(A git submodule would seem to be a good fit: the main repository would
have a pointer to the appropriate revision of the test-images
repository, and people interested in running the test suite would have
to run "git submodule update" to check it out.)

Two alternative ideas to handle for the versioning
issue: A) Add a .py file in the main source repository with is a list of
sample data filenames and checksums. If a sample data file doesn't
exist, or its checksum is wrong, it can be downloaded.

Sounds complicated, and makes older versions unable to run newer
examples.

B) The source file could simply have the same data version number
required and the sample data itself could be versioned.

That might work. If I understand this correctly, the example code would
call get_sample_data("foo.dat") to get the latest revision or
get_sample_data("foo.dat", 1234) to get a specific one. These would
retrieve URLs like

http://example.com/sample-data/raw/master/foo.dat
http://example.com/sample-data/raw/1234/foo.dat

···

--
Jouni K. Sepp�nen