Sample data: a proposal

A while ago there was a discussion [1] about how using the
get_sample_data function in building the documentation is a problem for
Debian packagers. Let me see if I understand the goals of
get_sample_data correctly:

* we want to enable users to run examples they find in the gallery
  without downloading extra files;

* we don't want to package all the sample data with matplotlib, either
  because it is too large, or because it changes more often than we
  release new versions.

The current sample data takes about 2.5 megabytes uncompressed, so the
size doesn't look like a real problem, but of course it is desirable
that new examples are usable with old versions unless they need new
features.

The problem that the Debian packagers have with the current system is
(I suppose) that building the documentation requires network access and
is not guaranteed to be repeatable.

Here's what I suggest:

1. Package the sample data in a separate zip file that users can
   download and expand in e.g. ~/.matplotlib/sample_data if they like.
   This file could be released more often than matplotlib, if needed.
   Debian can use this as one source file and package it as a separate
   deb file.

2. Make get_sample_data look first in the place where the zip file could
   have been expanded, and only if the required file is not found, try
   to obtain it from the web. Add an option to disable the network
   access. This is different from what we do now, because now
   get_sample_data always tries to check if there is a newer version
   available, which apparently doesn't work reliably on unconnected
   computers.

3. To make this work, agree that sample data files are immutable: if a
   new version is needed, it needs to have a new name (and thus the
   examples using it need to be updated). The files have not been
   changed a lot [2], so I don't think this is very much of a burden.

What do you think?

Jouni

[1] http://thread.gmane.org/gmane.comp.python.matplotlib.devel/8865
[2] Here is a summary of the changes to each file in sample_data:

=== ./aapl.csv ===

···

------------------------------------------------------------------------
r7379 | jdh2358 | 2009-08-05 18:57:31 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r6202 | jdh2358 | 2008-10-15 15:43:41 +0300 (Wed, 15 Oct 2008)
------------------------------------------------------------------------
r4975 | jdh2358 | 2008-02-16 22:58:37 +0200 (Sat, 16 Feb 2008)
------------------------------------------------------------------------
=== ./AAPL.dat ===
------------------------------------------------------------------------
r7388 | jdh2358 | 2009-08-05 20:16:50 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
=== ./aapl.npy ===
------------------------------------------------------------------------
r7377 | jdh2358 | 2009-08-05 18:52:29 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r6203 | jdh2358 | 2008-10-15 18:39:44 +0300 (Wed, 15 Oct 2008)
------------------------------------------------------------------------
=== ./axes_grid/bivariate_normal.npy ===
------------------------------------------------------------------------
r7436 | leejjoon | 2009-08-09 07:34:08 +0300 (Sun, 09 Aug 2009)
------------------------------------------------------------------------
=== ./ct.raw ===
------------------------------------------------------------------------
r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r177 | jdh2358 | 2004-03-13 01:00:12 +0200 (Sat, 13 Mar 2004)
------------------------------------------------------------------------
=== ./data_x_x2_x3.csv ===
------------------------------------------------------------------------
r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r7078 | efiring | 2009-05-03 03:09:06 +0300 (Sun, 03 May 2009)
------------------------------------------------------------------------
=== ./demodata.csv ===
------------------------------------------------------------------------
r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r5100 | jdh2358 | 2008-04-30 22:53:10 +0300 (Wed, 30 Apr 2008)
------------------------------------------------------------------------
=== ./eeg.dat ===
------------------------------------------------------------------------
r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r52 | jdh2358 | 2003-11-02 23:23:21 +0200 (Sun, 02 Nov 2003)
------------------------------------------------------------------------
=== ./embedding_in_wx3.xrc ===
------------------------------------------------------------------------
r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r397 | astraw | 2004-07-10 21:39:48 +0300 (Sat, 10 Jul 2004)
------------------------------------------------------------------------
=== ./goog.npy ===
------------------------------------------------------------------------
r7377 | jdh2358 | 2009-08-05 18:52:29 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r6203 | jdh2358 | 2008-10-15 18:39:44 +0300 (Wed, 15 Oct 2008)
------------------------------------------------------------------------
=== ./INTC.dat ===
------------------------------------------------------------------------
r7387 | jdh2358 | 2009-08-05 20:16:00 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
=== ./lena.jpg ===
------------------------------------------------------------------------
r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r2557 | astraw | 2006-07-12 02:32:31 +0300 (Wed, 12 Jul 2006)
------------------------------------------------------------------------
r2556 | astraw | 2006-07-12 02:28:46 +0300 (Wed, 12 Jul 2006)
------------------------------------------------------------------------
r603 | astraw | 2004-10-19 20:50:03 +0300 (Tue, 19 Oct 2004)
------------------------------------------------------------------------
=== ./lena.png ===
------------------------------------------------------------------------
r7364 | jdh2358 | 2009-08-05 17:36:27 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r7327 | jdh2358 | 2009-07-31 21:55:17 +0300 (Fri, 31 Jul 2009)
------------------------------------------------------------------------
=== ./logo2.png ===
------------------------------------------------------------------------
r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r5669 | jdh2358 | 2008-06-24 21:58:41 +0300 (Tue, 24 Jun 2008)
------------------------------------------------------------------------
=== ./membrane.dat ===
------------------------------------------------------------------------
r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r64 | jdh2358 | 2003-11-15 19:05:37 +0200 (Sat, 15 Nov 2003)
------------------------------------------------------------------------
=== ./Minduka_Present_Blue_Pack.png ===
------------------------------------------------------------------------
r7421 | leejjoon | 2009-08-08 04:40:31 +0300 (Sat, 08 Aug 2009)
------------------------------------------------------------------------
=== ./msft.csv ===
------------------------------------------------------------------------
r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r2144 | jdh2358 | 2006-03-14 03:28:43 +0200 (Tue, 14 Mar 2006)
------------------------------------------------------------------------
r86 | jdh2358 | 2003-11-21 19:50:00 +0200 (Fri, 21 Nov 2003)
------------------------------------------------------------------------
=== ./msft_nasdaq.npy ===
------------------------------------------------------------------------
r7377 | jdh2358 | 2009-08-05 18:52:29 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r6203 | jdh2358 | 2008-10-15 18:39:44 +0300 (Wed, 15 Oct 2008)
------------------------------------------------------------------------
=== ./s1045.ima ===
------------------------------------------------------------------------
r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r48 | jdh2358 | 2003-11-02 21:43:30 +0200 (Sun, 02 Nov 2003)
------------------------------------------------------------------------
=== ./testdata.csv ===
------------------------------------------------------------------------
r7364 | jdh2358 | 2009-08-05 17:36:27 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r7361 | jdh2358 | 2009-08-05 14:39:37 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
r7360 | jdh2358 | 2009-08-05 14:34:43 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------
=== ./testdir/subdir/testsub.csv ===
------------------------------------------------------------------------
r7368 | jdh2358 | 2009-08-05 17:54:01 +0300 (Wed, 05 Aug 2009)
------------------------------------------------------------------------

--
Jouni K. Seppänen
http://www.iki.fi/jks

A while ago there was a discussion [1] about how using the
get_sample_data function in building the documentation is a problem for
Debian packagers. Let me see if I understand the goals of
get_sample_data correctly:

* we want to enable users to run examples they find in the gallery
   without downloading extra files;

* we don't want to package all the sample data with matplotlib, either
   because it is too large, or because it changes more often than we
   release new versions.
   
* Also, we want to have the sample data not to be in the same version control repository as MPL proper so that when we download the MPL source code itself, we don't get the sample data. (This is one of the sticking points for a move to git.)

Here's what I suggest:

1. Package the sample data in a separate zip file that users can
    download and expand in e.g. ~/.matplotlib/sample_data if they like.
    This file could be released more often than matplotlib, if needed.
    Debian can use this as one source file and package it as a separate
    deb file.

2. Make get_sample_data look first in the place where the zip file could
    have been expanded, and only if the required file is not found, try
    to obtain it from the web. Add an option to disable the network
    access. This is different from what we do now, because now
    get_sample_data always tries to check if there is a newer version
    available, which apparently doesn't work reliably on unconnected
    computers.

3. To make this work, agree that sample data files are immutable: if a
    new version is needed, it needs to have a new name (and thus the
    examples using it need to be updated). The files have not been
    changed a lot [2], so I don't think this is very much of a burden.

What do you think?

#1 and #2 seem reasonable to me.

I don't like #3 -- for the same reasons as we want to separate the rest of the sample data (smaller download, smaller repository, and separation of code and non-essential data), I think the test comparison images should be with the sample data. Having to deal with renames in the tests would be annoying. Two alternative ideas to handle for the versioning issue: A) Add a .py file in the main source repository with is a list of sample data filenames and checksums. If a sample data file doesn't exist, or its checksum is wrong, it can be downloaded. B) The source file could simply have the same data version number required and the sample data itself could be versioned.

···

On 09/12/2010 07:10 AM, Jouni K. Seppänen wrote:

I agree with Andrew here -- we don't want to hamstring our ability to
change the data just because some people would rather take a version
in place of the latest version. If we have an rc option

  sampledata.fetch : False

then the sampledata function would only look in the sample data dir,
get the file if available, raise otherwise. If fetch is True, it
would always go the web first and check for the latest, get it and
cache it. Then the packagers could download the tarball, unpack it,
and not worry about mpl trying to check for a more recent version.

JDH

···

On Sun, Sep 12, 2010 at 10:30 AM, Andrew Straw <strawman@...36...> wrote:

#1 and #2 seem reasonable to me.

I don't like #3 -- for the same reasons as we want to separate the rest