record array and date support

I just added support for native plotting of python date and datetime
objects (you still can, but don't have to use plot_date with date2num
conversions). We will continue to do conversion to floats under the
hood, but the conversion can be handled automagically. I also added
support for loading CSV files (or general space/tab/comma delimited
files) into numpy record arrays, and the type conversions (int, float,
date, etc...) happen automagically. The function assumes there is a
header row, and these strings will be munged to give valid python
attribute names. It inspects the first checkrows lines after the
header to try and infer the datatype and set the appropriate
conversion function. It's not entirely bullet proof, but it should
cover a lot of common use cases.

Here is an example (svn only)

  from matplotlib.mlab import csv2rec
  from pylab import figure, show

  a = csv2rec('data/msft.csv')
  fig = figure()
  ax = fig.add_subplot(111)
  ax.plot(a.date, a.adj_close, '-')
  fig.autofmt_xdate()
  show()

The autofmt_xdate is optional, but is a new function that does a few
things you usually want in date plots: turns off tick labels in the
upper subplots if any, rotates the tick labels on the lowest axes and
right aligns them, and increases the bottom of the subplots adjust to
make room for the rotated tick labels.

Here is what the dtype looks like from the example above.

  In [3]: !head -3 data/msft.csv
  Date,Open,High,Low,Close,Volume,Adj. Close*
  19-Sep-03,29.76,29.97,29.52,29.96,92433800,29.79
  18-Sep-03,28.49,29.51,28.42,29.50,67268096,29.34

  In [4]: a = csv2rec('data/msft.csv')

  In [5]: a.dtype
  Out[5]: dtype([('date', '|O4'), ('open', '<f8'), ('high', '<f8'),
('low', '<f8'), ('close', '<f8'), ('volume', '<i4'), ('adj_close',
'<f8')])

  In [6]: a.date[:2]
  Out[6]: array([2003-09-19 00:00:00, 2003-09-18 00:00:00], dtype=object)

I'll probably add a few performance features to the csv2rec function,
mainly to let you skip columns and supply conversion functions where
desired because the autodate parser is pretty slow if you want to
parse date strings, but this is enough to make it useful. Another
useful feature will be able to support customizable type dependent
NULL value conversion (eg convert to numpy.nan for floats,
'0000-00-00' for dates, etc...)

Record arrays are your friend; have fun!
JDH

Hi John,
very very interesting idea.
Is there a way to add some extras informations on the records arrays columns,
like the units or/and the desired labels for the resulting plotted lines,
directly retrieved in the CSV files?
Cordialy

···

Le jeudi 07 juin 2007, John Hunter a écrit :

I just added support for native plotting of python date and datetime
objects (you still can, but don't have to use plot_date with date2num
conversions). We will continue to do conversion to floats under the
hood, but the conversion can be handled automagically. I also added
support for loading CSV files (or general space/tab/comma delimited
files) into numpy record arrays, and the type conversions (int, float,
date, etc...) happen automagically. The function assumes there is a
header row, and these strings will be munged to give valid python
attribute names. It inspects the first checkrows lines after the
header to try and infer the datatype and set the appropriate
conversion function. It's not entirely bullet proof, but it should
cover a lot of common use cases.

Here is an example (svn only)

  from matplotlib.mlab import csv2rec
  from pylab import figure, show

  a = csv2rec('data/msft.csv')
  fig = figure()
  ax = fig.add_subplot(111)
  ax.plot(a.date, a.adj_close, '-')
  fig.autofmt_xdate()
  show()

The autofmt_xdate is optional, but is a new function that does a few
things you usually want in date plots: turns off tick labels in the
upper subplots if any, rotates the tick labels on the lowest axes and
right aligns them, and increases the bottom of the subplots adjust to
make room for the rotated tick labels.

Here is what the dtype looks like from the example above.

  In [3]: !head -3 data/msft.csv
  Date,Open,High,Low,Close,Volume,Adj. Close*
  19-Sep-03,29.76,29.97,29.52,29.96,92433800,29.79
  18-Sep-03,28.49,29.51,28.42,29.50,67268096,29.34

  In [4]: a = csv2rec('data/msft.csv')

  In [5]: a.dtype
  Out[5]: dtype([('date', '|O4'), ('open', '<f8'), ('high', '<f8'),
('low', '<f8'), ('close', '<f8'), ('volume', '<i4'), ('adj_close',
'<f8')])

  In [6]: a.date[:2]
  Out[6]: array([2003-09-19 00:00:00, 2003-09-18 00:00:00], dtype=object)

I'll probably add a few performance features to the csv2rec function,
mainly to let you skip columns and supply conversion functions where
desired because the autodate parser is pretty slow if you want to
parse date strings, but this is enough to make it useful. Another
useful feature will be able to support customizable type dependent
NULL value conversion (eg convert to numpy.nan for floats,
'0000-00-00' for dates, etc...)

Record arrays are your friend; have fun!
JDH

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Matplotlib-users mailing list
Matplotlib-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/matplotlib-users

--
Lionel Roubeyrie - lroubeyrie@...1068...
Chagé d'études et de maintenance
LIMAIR - la Surveillance de l'Air en Limousin
http://www.limair.asso.fr

Hi John,
very very interesting idea.
Is there a way to add some extras informations on the records arrays columns,
like the units or/and the desired labels for the resulting plotted lines,
directly retrieved in the CSV files?

It could be done, but my goal here is not to create a persistence
layer for record arrays, or a method of describing them or mpl labels,
but rather a way to easily import 3rd party CSV files into numpy
record arrays. I work with a lot of tab/space/ascii delimited files,
and found myself duplicating a lot of code importing them into record
arrays. This function is the distillation of that code. It would be
fairly easy to add designated rows for those who did want to decorate
their CSV files. I think it might be most useful to support a row
that provided a numpy dtype per column, or perhaps the name of a
converter function...

One thing people coming from gnuplot miss is file plotting
functionality. I just added a function to pylab called plotfile which
uses the csv2rec functionality (with autolabeling etc) to plot data
from a file. Eg,

plotfile(fname, (0,5,6))

plots columns 5 and 6 against column 0. And

plotfile(fname, ('date', 'volume', 'adj_close'),

plotfuncs={'volume': 'bar'})

does the same using the names of the columns, using "plot" for
adj_close (the default) and "bar" for volume (customization from the
plotfuncs dictionary). The column names in either case are used to
create default x and y labels.

The 2nd command produces the attached plot. This is just a first
pass, so if people want to see a different interface or have an
opinion what should be returned, or where this function should live
outside of pylab, feel free to comment or commit changes.

JDH

plotfile.png

···

On 6/8/07, Lionel Roubeyrie <lroubeyrie@...1068...> wrote:

Hey John - currently using matplotlib.mlab import csv2rec functionality in a
script.

Is there a tool or way to automate plotting of multiple y series contained
in a csv data file (data in columns, header is first row, x axis is time,
several y series) with varying column header names and varying numbers of
columns depending on the individual data file?
I particularly want to avoid manually typing individual series names -as
this information is contained in the header row for each column of data it
seems inefficient to have to type series names for plotting, only to have to
retype series names for the next csv file which contains different column
header names

Plotfile came close, but doesnt seem to automatically label individual
series by column header
eg file formats (varying headers, and numbers of columns):

file 1
elapsedtime,AS2data,AS45data,SE34data,VB56data

file 2
elapsedtime,AS09data,VB24data
  
John Hunter-4 wrote:

···

<<support for native plotting of python date and datetime
objects <<support for loading CSV files (or general space/tab/comma
delimited
files) into numpy record arrays, and the type conversions (int, float,
date, etc...) >><<The function assumes there is a
header row, and these strings will be munged to give valid python
attribute names. It inspects the first checkrows lines after the
header to try and infer the datatype and set the appropriate
conversion function. >>
Here is an example (svn only)

  from matplotlib.mlab import csv2rec
  from pylab import figure, show

  a = csv2rec('data/msft.csv')
  fig = figure()
  ax = fig.add_subplot(111)
  ax.plot(a.date, a.adj_close, '-')
  fig.autofmt_xdate()
  show()

The autofmt_xdate is optional, but is a new function that does a few
things you usually want in date plots: turns off tick labels in the
upper subplots if any, rotates the tick labels on the lowest axes and
right aligns them, and increases the bottom of the subplots adjust to
make room for the rotated tick labels.

Here is what the dtype looks like from the example above.

  In [3]: !head -3 data/msft.csv
  Date,Open,High,Low,Close,Volume,Adj. Close*
  19-Sep-03,29.76,29.97,29.52,29.96,92433800,29.79
  18-Sep-03,28.49,29.51,28.42,29.50,67268096,29.34

  In [4]: a = csv2rec('data/msft.csv')

  In [5]: a.dtype
  Out[5]: dtype([('date', '|O4'), ('open', '<f8'), ('high', '<f8'),
('low', '<f8'), ('close', '<f8'), ('volume', '<i4'), ('adj_close',
'<f8')])

  In [6]: a.date[:2]
  Out[6]: array([2003-09-19 00:00:00, 2003-09-18 00:00:00], dtype=object)

I'll probably add a few performance features to the csv2rec function,
mainly to let you skip columns and supply conversion functions where
desired because the autodate parser is pretty slow if you want to
parse date strings, but this is enough to make it useful. Another
useful feature will be able to support customizable type dependent
NULL value conversion (eg convert to numpy.nan for floats,
'0000-00-00' for dates, etc...)

Record arrays are your friend; have fun!
JDH

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Matplotlib-users mailing list
Matplotlib-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/matplotlib-users

--
View this message in context: http://old.nabble.com/record-array-and-date-support-tp11011990p31483567.html
Sent from the matplotlib - users mailing list archive at Nabble.com.

given a recarray r, r.dtype.names contains a tuple with the column names.

It should be easy to do what you want using a loop.

briant100 wrote:

···

Hey John - currently using matplotlib.mlab import csv2rec functionality in
a script.

Is there a tool or way to automate plotting of multiple y series contained
in a csv data file (data in columns, header is first row, x axis is time,
several y series) with varying column header names and varying numbers of
columns depending on the individual data file?
I particularly want to avoid manually typing individual series names -as
this information is contained in the header row for each column of data it
seems inefficient to have to type series names for plotting, only to have
to retype series names for the next csv file which contains different
column header names

Plotfile came close, but doesnt seem to automatically label individual
series by column header
eg file formats (varying headers, and numbers of columns):

file 1
elapsedtime,AS2data,AS45data,SE34data,VB56data

file 2
elapsedtime,AS09data,VB24data

--
View this message in context: http://old.nabble.com/record-array-and-date-support-tp11011990p31483894.html
Sent from the matplotlib - users mailing list archive at Nabble.com.

that should work
many thanks

butterw wrote:

···

given a recarray r, r.dtype.names contains a tuple with the column names.

It should be easy to do what you want using a loop.

briant100 wrote:

Hey John - currently using matplotlib.mlab import csv2rec functionality
in a script.

Is there a tool or way to automate plotting of multiple y series
contained in a csv data file (data in columns, header is first row, x
axis is time, several y series) with varying column header names and
varying numbers of columns depending on the individual data file?
I particularly want to avoid manually typing individual series names -as
this information is contained in the header row for each column of data
it seems inefficient to have to type series names for plotting, only to
have to retype series names for the next csv file which contains
different column header names

Plotfile came close, but doesnt seem to automatically label individual
series by column header
eg file formats (varying headers, and numbers of columns):

file 1
elapsedtime,AS2data,AS45data,SE34data,VB56data

file 2
elapsedtime,AS09data,VB24data

--
View this message in context: http://old.nabble.com/record-array-and-date-support-tp11011990p31493748.html
Sent from the matplotlib - users mailing list archive at Nabble.com.