default data sets for matplotlib?

In R, there are many default data sets one can use to both illustrate code and explore the scripting language. Instead of having to fake data, one can pull from meaningful data sets, created in the real world. For example, this one liner actually produces a plot:

plot(mtcars$hp~mtcars$mpg)

where mtcars refers to a built-in data set taken from Motor Trend Magazine. I don’t believe matplotlib has anything similar. I have started to download some of the R data sets and store them as pickles for my own use. Does anyone else have any interest in creating a repository for these data sets or otherwise sharing them in some way?

Paul

In R, there are many default data sets one can use to both illustrate code
and explore the scripting language. Instead of having to fake data, one can
pull from meaningful data sets, created in the real world. For example, this
one liner actually produces a plot:

plot(mtcars$hp~mtcars$mpg)

where mtcars refers to a built-in data set taken from Motor Trend Magazine.
I don't believe matplotlib has anything similar. I have started to download
some of the R data sets and store them as pickles for my own use. Does
anyone else have any interest in creating a repository for these data sets
or otherwise sharing them in some way?

Vincent converted several R datasets back to csv, that can be easily
loaded from the web with, for example, pandas.
http://vincentarelbundock.github.com/Rdatasets/
The collection is a bit random.

statsmodels has some datasets that we use for examples and tests
http://statsmodels.sourceforge.net/devel/datasets/index.html
We were always a bit slow with adding datasets because we were too
cautious about licensing issues. But R seems to get away with
considering most datasets to be public domain.
We keep adding datasets to statsmodels as we need them for new models.

The machine learning packages like sklearn have packaged the typical
machine learning datasets.

If you are interested, you could join up with statsmodels or with
Vincent to expand on what's available.

Josef

···

On Wed, Sep 26, 2012 at 12:05 AM, Paul Tremblay <paulhtremblay@...287...> wrote:

Paul

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Matplotlib-users mailing list
Matplotlib-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/matplotlib-users

It seems to me like contributing to (rather than duplicating) the work of one of these projects would be a great idea. It would also be nice to add functionality in matplotlib to make it easier to download these things as a one-off -- obviously not exactly the same syntax as with R, but ideally with a single function call.

Mike

···

On 09/26/2012 12:28 AM, josef.pktd@...287... wrote:

On Wed, Sep 26, 2012 at 12:05 AM, Paul Tremblay <paulhtremblay@...287...> wrote:

In R, there are many default data sets one can use to both illustrate code
and explore the scripting language. Instead of having to fake data, one can
pull from meaningful data sets, created in the real world. For example, this
one liner actually produces a plot:

plot(mtcars$hp~mtcars$mpg)

where mtcars refers to a built-in data set taken from Motor Trend Magazine.
I don't believe matplotlib has anything similar. I have started to download
some of the R data sets and store them as pickles for my own use. Does
anyone else have any interest in creating a repository for these data sets
or otherwise sharing them in some way?

Vincent converted several R datasets back to csv, that can be easily
loaded from the web with, for example, pandas.
http://vincentarelbundock.github.com/Rdatasets/
The collection is a bit random.

statsmodels has some datasets that we use for examples and tests
http://statsmodels.sourceforge.net/devel/datasets/index.html
We were always a bit slow with adding datasets because we were too
cautious about licensing issues. But R seems to get away with
considering most datasets to be public domain.
We keep adding datasets to statsmodels as we need them for new models.

The machine learning packages like sklearn have packaged the typical
machine learning datasets.

If you are interested, you could join up with statsmodels or with
Vincent to expand on what's available.

We did have such a thing. matplotlib.cbook.get_sample_data(). I think we got rid of it for 1.2.0?

Ben

···

On Wed, Sep 26, 2012 at 9:10 AM, Michael Droettboom <mdroe@…86…> wrote:

On 09/26/2012 12:28 AM, josef.pktd@…287… wrote:

On Wed, Sep 26, 2012 at 12:05 AM, Paul Tremblay <paulhtremblay@…287…> wrote:

In R, there are many default data sets one can use to both illustrate code

and explore the scripting language. Instead of having to fake data, one can

pull from meaningful data sets, created in the real world. For example, this

one liner actually produces a plot:

plot(mtcars$hp~mtcars$mpg)

where mtcars refers to a built-in data set taken from Motor Trend Magazine.

I don’t believe matplotlib has anything similar. I have started to download

some of the R data sets and store them as pickles for my own use. Does

anyone else have any interest in creating a repository for these data sets

or otherwise sharing them in some way?

Vincent converted several R datasets back to csv, that can be easily

loaded from the web with, for example, pandas.

http://vincentarelbundock.github.com/Rdatasets/

The collection is a bit random.

statsmodels has some datasets that we use for examples and tests

http://statsmodels.sourceforge.net/devel/datasets/index.html

We were always a bit slow with adding datasets because we were too

cautious about licensing issues. But R seems to get away with

considering most datasets to be public domain.

We keep adding datasets to statsmodels as we need them for new models.

The machine learning packages like sklearn have packaged the typical

machine learning datasets.

If you are interested, you could join up with statsmodels or with

Vincent to expand on what’s available.

It seems to me like contributing to (rather than duplicating) the work

of one of these projects would be a great idea. It would also be nice

to add functionality in matplotlib to make it easier to download these

things as a one-off – obviously not exactly the same syntax as with R,

but ideally with a single function call.

Mike

>> In R, there are many default data sets one can use to both illustrate
>> code
>> and explore the scripting language. Instead of having to fake data, one
>> can
>> pull from meaningful data sets, created in the real world. For example,
>> this
>> one liner actually produces a plot:
>>
>> plot(mtcars$hp~mtcars$mpg)
>>
>> where mtcars refers to a built-in data set taken from Motor Trend
>> Magazine.
>> I don't believe matplotlib has anything similar. I have started to
>> download
>> some of the R data sets and store them as pickles for my own use. Does
>> anyone else have any interest in creating a repository for these data
>> sets
>> or otherwise sharing them in some way?
> Vincent converted several R datasets back to csv, that can be easily
> loaded from the web with, for example, pandas.
> http://vincentarelbundock.github.com/Rdatasets/
> The collection is a bit random.
>
> statsmodels has some datasets that we use for examples and tests
> http://statsmodels.sourceforge.net/devel/datasets/index.html
> We were always a bit slow with adding datasets because we were too
> cautious about licensing issues. But R seems to get away with
> considering most datasets to be public domain.
> We keep adding datasets to statsmodels as we need them for new models.
>
> The machine learning packages like sklearn have packaged the typical
> machine learning datasets.
>
> If you are interested, you could join up with statsmodels or with
> Vincent to expand on what's available.
>
It seems to me like contributing to (rather than duplicating) the work
of one of these projects would be a great idea. It would also be nice
to add functionality in matplotlib to make it easier to download these
things as a one-off -- obviously not exactly the same syntax as with R,
but ideally with a single function call.

Mike

We did have such a thing. matplotlib.cbook.get_sample_data(). I think we
got rid of it for 1.2.0?

I don't know the details, but it looks like in pandas they spend some
time on python 3 compatibility, in case that was a problem

https://github.com/pydata/pandas/pull/970

Josef

···

On Wed, Sep 26, 2012 at 9:33 AM, Benjamin Root <ben.root@...1304...> wrote:

On Wed, Sep 26, 2012 at 9:10 AM, Michael Droettboom <mdroe@...86...> wrote:

On 09/26/2012 12:28 AM, josef.pktd@...287... wrote:
> On Wed, Sep 26, 2012 at 12:05 AM, Paul Tremblay >> > <paulhtremblay@...287...> wrote:

Ben

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Matplotlib-users mailing list
Matplotlib-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/matplotlib-users

It was removed because the server side was a moving target and would
constantly break. It was based on pulling files out of the svn (and
later git) repository, and sourceforge and github have had a habit
of changing the urls used to do so. All of the data that was there
was moved into the main repository and is now installed alongside
matplotlib, so get_sample_data() still works. See this PR:
I should have mentioned it earlier, that we do have a very small set
of standard data sets included there – but these other projects
linked to above are much better and more extensive. If we can rely
on them to have static urls over time, I think they are much better
options than anything matplotlib has had in the past.
Mike

···

On 09/26/2012 09:33 AM, Benjamin Root
wrote:

    On Wed, Sep 26, 2012 at 9:10 AM, Michael

Droettboom <mdroe@…86…>
wrote:

On 09/26/2012 12:28 AM, josef.pktd@…287…
wrote:

          > On Wed, Sep 26, 2012 at 12:05 AM, Paul Tremblay <paulhtremblay@...287...              >

wrote:

          >> In R, there are many default data sets one can

use to both illustrate code

          >> and explore the scripting language. Instead of

having to fake data, one can

          >> pull from meaningful data sets, created in the

real world. For example, this

          >> one liner actually produces a plot:

          >>

          >> plot(mtcars$hp~mtcars$mpg)

          >>

          >> where mtcars refers to a built-in data set taken

from Motor Trend Magazine.

          >> I don't believe matplotlib has anything similar.

I have started to download

          >> some of the R data sets and store them as pickles

for my own use. Does

          >> anyone else have any interest in creating a

repository for these data sets

          >> or otherwise sharing them in some way?

          > Vincent converted several R datasets back to csv,

that can be easily

          > loaded from the web with, for example, pandas.

          > [http://vincentarelbundock.github.com/Rdatasets/](http://vincentarelbundock.github.com/Rdatasets/)

          > The collection is a bit random.

          >

          > statsmodels has some datasets that we use for

examples and tests

          > [http://statsmodels.sourceforge.net/devel/datasets/index.html](http://statsmodels.sourceforge.net/devel/datasets/index.html)

          > We were always a bit slow with adding datasets

because we were too

          > cautious about licensing issues. But R seems to get

away with

          > considering most datasets to be public domain.

          > We keep adding datasets to statsmodels as we need

them for new models.

          >

          > The machine learning packages like sklearn have

packaged the typical

          > machine learning datasets.

          >

          > If you are interested, you could join up with

statsmodels or with

          > Vincent to expand on what's available.

          >
      It seems to me like contributing to (rather than duplicating)

the work

      of one of these projects would be a great idea.  It would also

be nice

      to add functionality in matplotlib to make it easier to

download these

      things as a one-off -- obviously not exactly the same syntax

as with R,

      but ideally with a single function call.



      Mike
      We did have such a thing. 

matplotlib.cbook.get_sample_data(). I think we got rid of it
for 1.2.0?

https://github.com/matplotlib/matplotlib/pull/498

Drawing on other posts, it is conceivable to download both the R
sets and the stats models sets and include them in
site-packages/matplotlib/mpl-data/sample_data/? I understand that
pulling data sets not in this directory creates problems because of
moving URLs, but why even try to do a web pull when the data can
exists in a reliable place?
I suppose one might raise reasonable objections to my suggestion,
but at any rate, it doesn’t seem I can add anything else to either
project, since they both seem complete. I see only a small though
significant problem with the R data sets in that it leaves out the
header of the first column because of the structure of R data
frames. Python needs this header.
Paul

···

On 9/26/12 10:15 AM, Michael Droettboom
wrote:

  It was removed because the server side was a moving target and

would constantly break. It was based on pulling files out of the
svn (and later git) repository, and sourceforge and github have
had a habit of changing the urls used to do so. All of the data
that was there was moved into the main repository and is now
installed alongside matplotlib, so get_sample_data() still works.
See this PR:
I should have mentioned it earlier, that we do have a very small
set of standard data sets included there – but these other
projects linked to above are much better and more extensive. If
we can rely on them to have static urls over time, I think they
are much better options than anything matplotlib has had in the
past.
Mike

    On 09/26/2012 09:33 AM, Benjamin Root

wrote:

      On Wed, Sep 26, 2012 at 9:10 AM,

Michael Droettboom <mdroe@…86…> wrote:

On 09/26/2012 12:28 AM, josef.pktd@…287…
wrote:

            > On Wed, Sep 26, 2012 at 12:05 AM, Paul Tremblay

<paulhtremblay@…287…
>
wrote:

            >> In R, there are many default data sets one can

use to both illustrate code

            >> and explore the scripting language. Instead of

having to fake data, one can

            >> pull from meaningful data sets, created in the

real world. For example, this

            >> one liner actually produces a plot:

            >>

            >> plot(mtcars$hp~mtcars$mpg)

            >>

            >> where mtcars refers to a built-in data set

taken from Motor Trend Magazine.

            >> I don't believe matplotlib has anything

similar. I have started to download

            >> some of the R data sets and store them as

pickles for my own use. Does

            >> anyone else have any interest in creating a

repository for these data sets

            >> or otherwise sharing them in some way?

            > Vincent converted several R datasets back to csv,

that can be easily

            > loaded from the web with, for example, pandas.

            > [http://vincentarelbundock.github.com/Rdatasets/](http://vincentarelbundock.github.com/Rdatasets/)

            > The collection is a bit random.

            >

            > statsmodels has some datasets that we use for

examples and tests

            > [http://statsmodels.sourceforge.net/devel/datasets/index.html](http://statsmodels.sourceforge.net/devel/datasets/index.html)

            > We were always a bit slow with adding datasets

because we were too

            > cautious about licensing issues. But R seems to get

away with

            > considering most datasets to be public domain.

            > We keep adding datasets to statsmodels as we need

them for new models.

            >

            > The machine learning packages like sklearn have

packaged the typical

            > machine learning datasets.

            >

            > If you are interested, you could join up with

statsmodels or with

            > Vincent to expand on what's available.

            >
        It seems to me like contributing to (rather than

duplicating) the work

        of one of these projects would be a great idea.  It would

also be nice

        to add functionality in matplotlib to make it easier to

download these

        things as a one-off -- obviously not exactly the same syntax

as with R,

        but ideally with a single function call.



        Mike
        We did have such a thing. 

matplotlib.cbook.get_sample_data(). I think we got rid of
it for 1.2.0?

https://github.com/matplotlib/matplotlib/pull/498