funky strings in rec2csv

As a followup to the work on floats, I fixed rec2csv to deal with funky strings (strings with commas and quotes). I've checked this is as r4749, but I thought I'd announce here in case someone (John, in particular) was depending on some peculiar aspect of the old implementation. If there is something you depend on, can you update the unittests in unit/mlab_unit.py to check for the behavior you need?

Also, John, I didn't see mlab.FormatString being used anywhere, but I didn't want to change it, either. So I made FormatString2 and used it. But if there's no known use of FormatString, lets kill the original and more FormatString2 into its place.

-Andrew

I am happy to leave it as is, and fix it if I bump into anything.
There was a use case at work that made me quote all the strings with "
but I am not sure what it is right now so I will test with your
version tomorrow and see if I can find any problems.

There is one more thing I'd like to clean up in this code, and that is
the handling of date and datetime. The current implementation will
return datetime regardless of whether the string looks like a date or
a datetime. Thus it would fail a roundtrip tests if you had a date
field. I may make some changes tomorrow to support date or datetime.

What do you think of making checkrows=0 the default for csv2rec? It
is slower, but mostly guaranteed to do the right thing, whereas
checkrows=5 which is the current default can often fail if the first 5
rows match some type but later rows do not, and may lead to confusion
among new users.

JDH

···

On Dec 16, 2007 5:26 PM, Andrew Straw <strawman@...36...> wrote:

As a followup to the work on floats, I fixed rec2csv to deal with funky
strings (strings with commas and quotes). I've checked this is as r4749,
but I thought I'd announce here in case someone (John, in particular)
was depending on some peculiar aspect of the old implementation. If
there is something you depend on, can you update the unittests in
unit/mlab_unit.py to check for the behavior you need?

Also, John, I didn't see mlab.FormatString being used anywhere, but I
didn't want to change it, either. So I made FormatString2 and used it.
But if there's no known use of FormatString, lets kill the original and
more FormatString2 into its place.

OK, I found one problem with repr. When adding support for date vs
datetime to the csv2rec roundtrip, and adding this to the unit test, I
notice that repr is probably not what we want for datetime; str makes
more sense here:

In [1]: import datetime

In [2]: d = datetime.date.today()

In [3]: repr(d)
Out[3]: 'datetime.date(2007, 12, 16)'

In [4]: str(d)
Out[4]: '2007-12-16'

The latter is more natural in the CSV file, and the repr version is
not supported by dateutil, at least not the one we are shipping:

In [5]: import dateutil.parser

In [6]: dateutil.parser.parse(repr(d))

···

On Dec 16, 2007 5:26 PM, Andrew Straw <strawman@...36...> wrote:

As a followup to the work on floats, I fixed rec2csv to deal with funky
strings (strings with commas and quotes). I've checked this is as r4749,
but I thought I'd announce here in case someone (John, in particular)
was depending on some peculiar aspect of the old implementation. If
there is something you depend on, can you update the unittests in
unit/mlab_unit.py to check for the behavior you need?

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)

/Users/jdhunter/python/svn/matplotlib/examples/<ipython console> in <module>()

...snip the rest of the traceback...

In [7]: dateutil.parser.parse(str(d))
Out[7]: datetime.datetime(2007, 12, 16, 0, 0)

So I changed FormatObject to use str, pending further discussion. At
least for my common use cases, the only obj types I have in my record
arrays are dates and datetimes, and I find this to be a pretty
compelling use case since it is the type least likely to be supported
by other persistence methods (tostring and pickle both fail or do not
behave as expected with datetimes in the recarray).

But there is an oddity in the parsing of milliseconds which is causing
the updated unit test to fail; the code below illustrates the problem:

In [3]: import dateutil.parser

In [4]: import datetime

In [5]: s = '2007-12-18 22:29:34.924122'

In [6]: dateutil.parser.parse(s)
Out[6]: datetime.datetime(2007, 12, 18, 22, 29, 34, 924121)

Thoughts?

JDH

> As a followup to the work on floats, I fixed rec2csv to deal with funky
> strings (strings with commas and quotes). I've checked this is as r4749,
> but I thought I'd announce here in case someone (John, in particular)
> was depending on some peculiar aspect of the old implementation. If
> there is something you depend on, can you update the unittests in
> unit/mlab_unit.py to check for the behavior you need?
>
> Also, John, I didn't see mlab.FormatString being used anywhere, but I
> didn't want to change it, either. So I made FormatString2 and used it.
> But if there's no known use of FormatString, lets kill the original and
> more FormatString2 into its place.

I am happy to leave it as is, and fix it if I bump into anything.
There was a use case at work that made me quote all the strings with "
but I am not sure what it is right now so I will test with your
version tomorrow and see if I can find any problems.

There is one more thing I'd like to clean up in this code, and that is
the handling of date and datetime. The current implementation will
return datetime regardless of whether the string looks like a date or
a datetime. Thus it would fail a roundtrip tests if you had a date
field. I may make some changes tomorrow to support date or datetime.

I'm CCing Chris Burns, from Berkeley here, but it would be great if
you post a note of these changes on the numpy list. One of the things
to come out of the sprint was a lot of work on numpy i/o, and that
includes looking into integrating much of the MPL facilities. So it
would be great if you keep the numpy team posted, to make sure they
get your latest code.

Cheers.

···

On Dec 16, 2007 9:10 PM, John Hunter <jdh2358@...149...> wrote:

On Dec 16, 2007 5:26 PM, Andrew Straw <strawman@...36...> wrote:

What do you think of making checkrows=0 the default for csv2rec? It
is slower, but mostly guaranteed to do the right thing, whereas
checkrows=5 which is the current default can often fail if the first 5
rows match some type but later rows do not, and may lead to confusion
among new users.

Thanks for looking at this. From the unittest that you checked in, it appears you've succeeded in getting dates and datetimes to work, modulo the apparent datetime precision buglet you mention below.

My understanding with repr() is that, in general, it should attempt to support lossless copying of data with the eval() function. It does appear that dateutils does this -- repr() produces a string that looks like what you'd use to call the constructor. (Numpy scalars don't follow this idiom, and it's doubtful they could change without breaking a lot of code.) I also agree that for a CSV file, something like the str() on date and datetime objects makes most sense. This is all just to say that the date and datetime stuff looks good to my eye, but you're using that stuff a lot more than me. Finally, from your example, it does look like there's a bug somewhere in the datetime code. Since I don't use them, I leave that to you...

Finally, as far as the checkrows -- I must admit that I wasn't even aware of its existence until you asked the question. (That's the problem with writing code that just works - no one notices it!) Now that you've posed the question, I suppose I'm in favor a default of 0 (which actually means infinity). I suppose no one in their right mind is going to use CSV files to store gigabytes of data, so parsing the whole thing for consistency seems like a worthwhile expenditure of cycles. (And if they're thinking about it, maybe the slowness with dissuade them, or at least cause them to read the docstrings! :slight_smile:

John Hunter wrote:

···

On Dec 16, 2007 5:26 PM, Andrew Straw <strawman@...36...> wrote:

As a followup to the work on floats, I fixed rec2csv to deal with funky
strings (strings with commas and quotes). I've checked this is as r4749,
but I thought I'd announce here in case someone (John, in particular)
was depending on some peculiar aspect of the old implementation. If
there is something you depend on, can you update the unittests in
unit/mlab_unit.py to check for the behavior you need?
    
OK, I found one problem with repr. When adding support for date vs
datetime to the csv2rec roundtrip, and adding this to the unit test, I
notice that repr is probably not what we want for datetime; str makes
more sense here:

In [1]: import datetime

In [2]: d = datetime.date.today()

In [3]: repr(d)
Out[3]: 'datetime.date(2007, 12, 16)'

In [4]: str(d)
Out[4]: '2007-12-16'

The latter is more natural in the CSV file, and the repr version is
not supported by dateutil, at least not the one we are shipping:

In [5]: import dateutil.parser

In [6]: dateutil.parser.parse(repr(d))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)

/Users/jdhunter/python/svn/matplotlib/examples/<ipython console> in <module>()

...snip the rest of the traceback...

In [7]: dateutil.parser.parse(str(d))
Out[7]: datetime.datetime(2007, 12, 16, 0, 0)

So I changed FormatObject to use str, pending further discussion. At
least for my common use cases, the only obj types I have in my record
arrays are dates and datetimes, and I find this to be a pretty
compelling use case since it is the type least likely to be supported
by other persistence methods (tostring and pickle both fail or do not
behave as expected with datetimes in the recarray).

But there is an oddity in the parsing of milliseconds which is causing
the updated unit test to fail; the code below illustrates the problem:

In [3]: import dateutil.parser

In [4]: import datetime

In [5]: s = '2007-12-18 22:29:34.924122'

In [6]: dateutil.parser.parse(s)
Out[6]: datetime.datetime(2007, 12, 18, 22, 29, 34, 924121)

Thoughts?

JDH

-------------------------------------------------------------------------
SF.Net email is sponsored by:
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services
for just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
_______________________________________________
Matplotlib-devel mailing list
Matplotlib-devel@lists.sourceforge.net
matplotlib-devel List Signup and Options