loading csv data into arrays

per_freem · August 12, 2009, 2:56pm

hi all,

i have tab-separated text files that i would like to parse into arrays
in numpy/scipy. i simply want to be able to read in the data into an
array, and then use indexing to get some of the columns, or some of
the rows, etc. the key thing is that these columns might be strings or
might be numbers. typically, one column is a set of strings and the
others are floats. it's necessary for me to be able to specify whether
the file has a header or not, or what the delimiter is. also, i'd like
to be able to manipulate the array that i read in and then easily
serialize it to a file as csv, again controlling the
delimiters/headers.

from the documentation, it looks like 'csv2rec' (from
matplotlib.mlab) might be the best option, but i am having problems
with it. for example, i use:

data = csv2rec(se_counts, skiprows=1, delimiter='\t')

however, then i cannot access the resulting array 'data' by columns,
it seems. the usual array notation data[:, 0] to access the first
column does not work -- how can i access the columns?

also, the first line of the file in this case was a header. ideally
i'd like to be able to specify that to csv2rec, so that it uses the
tab separated headers in the first line as the field names for the
columns -- is there a way to do this?

finally, how can i then easily serialize the results to a csv file?

any help on this would be greatly appreciated. i am happy to use
options aside from 'csv2rec' -- it's just that this seemed closest to
what i wanted to do, but i might have missed something. thank you.

_Sandro_Tosi2 · August 12, 2009, 3:01pm

numpy's loadtxt()

Regards,

···

On Wed, Aug 12, 2009 at 16:56, per freem<perfreem@...287...> wrote:

hi all,

i have tab-separated text files that i would like to parse into arrays
in numpy/scipy. i simply want to be able to read in the data into an

--
Sandro Tosi (aka morph, morpheus, matrixhasu)
My website: http://matrixhasu.altervista.org/
Me at Debian: http://wiki.debian.org/SandroTosi

_Ryan_May1 · August 12, 2009, 3:16pm

hi all,

i have tab-separated text files that i would like to parse into arrays

in numpy/scipy. i simply want to be able to read in the data into an

numpy’s loadtxt()

With numpy 1.3 and newer, there’s also numpy.genfromtxt (which actually should behave very similar to mlab.csv2rec):

import numpy as np

from StringIO import StringIO
data = StringIO(“”"
#gender age weight
M 21 72.100000
F 35 58.330000
M 33 21.99
“”")

arr = np.genfromtxt(data, names=True, dtype=None)

print arr[‘gender’]
print arr[‘age’]

Writing this back out to a file in the same format will require a bit more of manual (though) straightforward work. There’s no simple method that will do it for you. The best one liner here is:

arr.tofile(‘test.txt’, sep=‘\n’)

cat arr.txt
(‘M’, 21, 72.099999999999994)
(‘F’, 35, 58.329999999999998)
(‘M’, 33, 21.989999999999998)

That should get you going. If it’s not enough, feel free to post a sample of your data file (or a representative example) and I can try to point you further in the right direction.

Ryan

···

On Wed, Aug 12, 2009 at 10:01 AM, Sandro Tosi <morph@…10…> wrote:

On Wed, Aug 12, 2009 at 16:56, per freem<perfreem@…287…> wrote:

–
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma

per_freem · August 12, 2009, 3:47pm

hi all,

thanks for these comments. i tried loadtxt and genfromtxt and am
having similar problems. my data looks like this:

my;header1 myheader-2_a myheader-2_b
c:6-Y:d 0;0;0;0;0;0 2.5

i simply want to read these in, and have all numbers be read in as
floats and all things that don't look like numbers (in this case the
first and second columns) to be parsed in as strings.

i tried:

data = genfromtxt(myfile, delimiter='\t', dtype=None)

(if i don't specify dtype=None, it reads everything as NaN)

the first problem is that with dtype=None all the entries are parsed
as strings. i'd like to be able to read in the unambiguously numeric
values as numbers (column 3 in this case.)

the second problem is that if i try to use headers as column names using:

data = genfromtxt(myfile, delimiter='\t', dtype=None, names=True)

then it converts my headers into different strings:

data

array([('a:5-X:b', '3;0;5;0;0;0', 3.0150000000000001),
('c:6-Y:d', '0;0;0;0;0;0', 2.5)],
dtype=[('myheader1', '|S7'), ('myheader2_a', '|S11'),
('myheader2_b', '<f8')])

i would only like to refer to my headers using this notation:

data['my;header1']

i don't need to be able to write data.headername at all. is there a
way to make genfromtxt not mess with any of the header names, and read
in the numeric values?

thanks very much.

···

a:5-X:b 3;0;5;0;0;0 3.015

On Wed, Aug 12, 2009 at 11:16 AM, Ryan May<rmay31@...287...> wrote:

On Wed, Aug 12, 2009 at 10:01 AM, Sandro Tosi <morph@...10...> wrote:

On Wed, Aug 12, 2009 at 16:56, per freem<perfreem@...287...> wrote:
> hi all,
>
> i have tab-separated text files that i would like to parse into arrays
> in numpy/scipy. i simply want to be able to read in the data into an

numpy's loadtxt()

With numpy 1.3 and newer, there's also numpy.genfromtxt (which actually
should behave very similar to mlab.csv2rec):

import numpy as np
from StringIO import StringIO
data = StringIO("""
#gender age weight
M 21 72.100000
F 35 58.330000
M 33 21.99
""")

arr = np.genfromtxt(data, names=True, dtype=None)
print arr['gender']
print arr['age']

Writing this back out to a file in the same format will require a bit more
of manual (though) straightforward work. There's no simple method that will
do it for you. The best one liner here is:

arr.tofile('test.txt', sep='\n')

cat arr.txt

('M', 21, 72.099999999999994)
('F', 35, 58.329999999999998)
('M', 33, 21.989999999999998)

That should get you going. If it's not enough, feel free to post a sample
of your data file (or a representative example) and I can try to point you
further in the right direction.

Ryan

--
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma

_Ryan_May1 · August 12, 2009, 6:17pm

hi all,

thanks for these comments. i tried loadtxt and genfromtxt and am

having similar problems. my data looks like this:

my;header1 myheader-2_a myheader-2_b

a:5-X:b 3;0;5;0;0;0 3.015

c:6-Y:d 0;0;0;0;0;0 2.5

i simply want to read these in, and have all numbers be read in as

floats and all things that don’t look like numbers (in this case the

first and second columns) to be parsed in as strings.

i tried:

data = genfromtxt(myfile, delimiter=‘\t’, dtype=None)

(if i don’t specify dtype=None, it reads everything as NaN)

the first problem is that with dtype=None all the entries are parsed

as strings. i’d like to be able to read in the unambiguously numeric

values as numbers (column 3 in this case.)

That’s because you aren’t skipping the header row, so it’s trying to infer the proper dtype from the header names – hence, string. If you didn’t want to use the header names, you could use skiprows=1.

the second problem is that if i try to use headers as column names using:

data = genfromtxt(myfile, delimiter=‘\t’, dtype=None, names=True)

then it converts my headers into different strings:

data

array([(‘a:5-X:b’, ‘3;0;5;0;0;0’, 3.0150000000000001),
   ('c:6-Y:d', '0;0;0;0;0;0', 2.5)],

  dtype=[('myheader1', '|S7'), ('myheader2_a', '|S11'),
(‘myheader2_b’, ‘<f8’)])

i would only like to refer to my headers using this notation:

data[‘my;header1’]

try adding: deletechars=‘’

If you really just want the values from specific columns, you can also pass in a list of column numbers (or names) to keep:

arr = np.genfromtxt(‘test.txt’, delimiter=‘\t’, dtype=None, names=True, deletechars=‘’, usecols=[2])

Ryan

···

On Wed, Aug 12, 2009 at 10:47 AM, per freem <perfreem@…878…287…> wrote:

–
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma