Working with CSV files using PDL
Ahh, the venerable comma separated variable format, beloved of data scientists.
I grabbed a couple of csv files from Matt Pettis’ csvkit talk to prepare for the datafile that I should be getting my mitts on and tripped and bumped my way through the documentation for PDL::IO::CSV and metaphorically skinned my knees, as you do when you don’t read too carefully.
Too lazy to write a script, I fired up PDL’s REPL, the perldl shell, and did this straight away
$ perldl
pdl> use PDL::IO::CSV ':all'
pdl> $cars = rcsv2D('mtcars_001.csv')
pdl> p $cars
[ big block of data wider than it is tall ]
pdl> p dims $cars
33 12
WHAT? I gave it a 12x32 shaped CSV datafile, not 33x12!
Let’s unpick what went wrong.
My data fell over on its side!
First - the last thing the rcsv2D function does after reading the file is to transpose the data before returning the piddle (yes, if you hadn’t heard it before, they’re called piddles. stop sniggering) This keeps it consistent with how PDL stores its data. Maybe it has something to do with PDL being row-major That doesn’t matter right now. What does matter is that your data is where you can play with it.
Where are all the car names?
Once upon a time all data was numerical, so in this module, if it doesn’t
look like a number
it gets silently converted to 0, unless you use
{text2bad => 1}
in the options in which case it’s converted to a
Bad Value
and will print like this
...
pdl> $cars = rcsv2D('mtcars_001.csv', {text2bad => 1})
pdl> p $cars
[
[ BAD BAD BAD BAD ... ]
[ BAD 21 21 22.8 ... ]
[ BAD 6 6 4 ... ]
...
]
Why the extra line?
How about some verbosity to see what’s happening? Turn on the debug
flag.
pdl> $cars = rcsv2D('mtcars_001.csv', {text2bad => 1, debug => 1})
Initialized size=80000, cols=12, type=double
Reshaping to 33 (final)
Twelve columns, great, but “Reshaping to 33” when there’s only 32 cars?
Ahh, forgot about the header, so our ingest becomes
pdl> $cars = rcsv2D('mtcars_001.csv', {text2bad => 1, header => 1, debug => 1})
Initialized size=80000, cols=12, type=double
Reshaping to 31 (final)
Huh? 31? Bug
in the header function eats the first line after the header. Ouch!
You can get around that with a bit of a hack. Set header
to 0.5 or -1.
Edit: author fixed bug within 36 hours and pushed changes to CPAN. Woohoo! (you can ignore this bit)
pdl> $cars = rcsv2D('mtcars_001.csv', {text2bad => 1, header => -1, debug => 1})
Initialized size=80000, cols=12, type=double
Reshaping to 32 (final)
What’s my name?
Get the column headers with
pdl> p $cars->hdr->{col_name}
this line intentionally left blank i.e. you get a big pile o’ nuthin from this command
well, that was disappointing. Turns out you don’t get column headers for 2D structures, only 1D structures. So how about a list of cars
pdl> @cars = rcsv1D('mtcars_001.csv', { text2bad => 1, header => 1 })
Could not parse the given string at /usr/local/share/perl/5.26.1/PDL/DateTime.pm line 654, <$fh> line 2.
Why? I dunno. Still works, but you can quiet the warning by telling it
the data types you expect
(all double
because I’m lazy. there are some short
s, etc.)
pdl> @cars = rcsv1D('mtcars_001.csv', { text2bad => 1, header => 1, type => double })
pdl> p $cars[0]->hdr->{col_name}
name
pdl> p $cars[11]->hdr->{col_name}
carb
Exclude the name column
You can choose which columns with and array ref like [ 1, 4, 6 ]
Everything is numerical, except name at column 0, so let’s choose everything else
pdl> $cars = rcsv2D('mtcars_001.csv', [1 .. 11], { header => -1, debug => 1, text2bad => 1})
Fetching 2D chunk=40000, reshape=80000, bad=?/1, sep_char=','
Initialized size=80000, cols=11, type=double
Reshaping to 32 (final)
pdl> p $cars
[
[ 21 21 22.8 21.4 ... ]
[ 6 6 4 6 ... ]
...
]
Now there are 11 columns and everything is numerical.
Missing values?
Yes, there’s an option for missing values using empty2bad
pdl> $cars = rcsv2D('mtcars_blank-select_cells.csv', [1 .. 11], { header => -1, debug => 1, empty2bad => 1 })
Fetching 2D chunk=40000, reshape=80000, bad=1/?, sep_char=','
Initialized size=80000, cols=11, type=double
Reshaping to 32 (final)
pdl> p $cars->info
PDL: Double D [32,11]
pdl> p $cars
[
[ 21 21 22.8 21.4 18.7 18.1 14.3 ... ]
[ 6 6 4 BAD 8 6 8 ... ]
[ 160 160 108 258 360 225 360 ... ]
[ 110 110 93 110 175 105 245 ... ]
[ 3.9 3.9 3.85 3.08 3.15 2.76 3.21 ... ]
[ 2.62 2.875 2.32 3.215 3.44 3.46 3.57 ... ]
[16.46 17.02 18.61 19.44 17.02 20.22 15.84 ... ]
[ 0 0 1 1 0 1 BAD ... ]
[ 1 1 1 0 0 0 0 ... ]
[ BAD 4 4 3 3 3 3 ... ]
...
]
Look at those BAD boys :)
That will work with any PDL function that is BadValue aware. Edit: Also, notice how I slipped in the info function in there. Just found it looking through the test suite.
In conclusion, PDL::IO::CSV isn’t the polished tool that deals with all your CSV needs, but it’s good enough to get some numbers into PDL quickly so you can start crunching or clustering right away. Paired with the perldl shell, it’s pretty easy to get that first look at a datafile and figure out where to go from there.
Leave a comment