On 09/26/2012 12:28 AM, josef.pktd@…287…
wrote:
> On Wed, Sep 26, 2012 at 12:05 AM, Paul Tremblay
<paulhtremblay@…287…
>
wrote:
>> In R, there are many default data sets one can
use to both illustrate code
>> and explore the scripting language. Instead of
having to fake data, one can
>> pull from meaningful data sets, created in the
real world. For example, this
>> one liner actually produces a plot:
>>
>> plot(mtcars$hp~mtcars$mpg)
>>
>> where mtcars refers to a built-in data set
taken from Motor Trend Magazine.
>> I don't believe matplotlib has anything
similar. I have started to download
>> some of the R data sets and store them as
pickles for my own use. Does
>> anyone else have any interest in creating a
repository for these data sets
>> or otherwise sharing them in some way?
> Vincent converted several R datasets back to csv,
that can be easily
> loaded from the web with, for example, pandas.
> [http://vincentarelbundock.github.com/Rdatasets/](http://vincentarelbundock.github.com/Rdatasets/)
> The collection is a bit random.
>
> statsmodels has some datasets that we use for
examples and tests
> [http://statsmodels.sourceforge.net/devel/datasets/index.html](http://statsmodels.sourceforge.net/devel/datasets/index.html)
> We were always a bit slow with adding datasets
because we were too
> cautious about licensing issues. But R seems to get
away with
> considering most datasets to be public domain.
> We keep adding datasets to statsmodels as we need
them for new models.
>
> The machine learning packages like sklearn have
packaged the typical
> machine learning datasets.
>
> If you are interested, you could join up with
statsmodels or with
> Vincent to expand on what's available.
>