Working with CSV files using PDL

Ahh, the venerable comma separated variable format, beloved of data scientists.

I grabbed a couple of csv files from Matt Pettis’ csvkit talk to prepare for the datafile that I should be getting my mitts on and tripped and bumped my way through the documentation for PDL::IO::CSV and metaphorically skinned my knees, as you do when you don’t read too carefully.

Too lazy to write a script, I fired up PDL’s REPL, the perldl shell, and did this straight away

$ perldl

pdl> use PDL::IO::CSV ':all'

pdl> $cars = rcsv2D('mtcars_001.csv')

pdl> p $cars

[ big block of data wider than it is tall ]

pdl> p dims $cars                                                                                         

33 12

WHAT? I gave it a 12x32 shaped CSV datafile, not 33x12!

Let’s unpick what went wrong.

My data fell over on its side!

First - the last thing the rcsv2D function does after reading the file is to transpose the data before returning the piddle (yes, if you hadn’t heard it before, they’re called piddles. stop sniggering) This keeps it consistent with how PDL stores its data. Maybe it has something to do with PDL being row-major That doesn’t matter right now. What does matter is that your data is where you can play with it.

Where are all the car names?

Once upon a time all data was numerical, so in this module, if it doesn’t look like a number it gets silently converted to 0, unless you use {text2bad => 1} in the options in which case it’s converted to a Bad Value and will print like this

...
pdl> $cars = rcsv2D('mtcars_001.csv', {text2bad => 1})

pdl> p $cars                                                                                              


[
 [  BAD   BAD   BAD   BAD ... ]
 [  BAD    21    21  22.8 ... ]
 [  BAD     6     6     4 ... ]
 ...
]

Why the extra line?

How about some verbosity to see what’s happening? Turn on the debug flag.

pdl> $cars = rcsv2D('mtcars_001.csv', {text2bad => 1, debug => 1})                                  
Initialized size=80000, cols=12, type=double
Reshaping to 33 (final)

Twelve columns, great, but “Reshaping to 33” when there’s only 32 cars?

Ahh, forgot about the header, so our ingest becomes

pdl> $cars = rcsv2D('mtcars_001.csv', {text2bad => 1, header => 1, debug => 1})                           

Initialized size=80000, cols=12, type=double
Reshaping to 31 (final)

Huh? 31? Bug in the header function eats the first line after the header. Ouch! You can get around that with a bit of a hack. Set header to 0.5 or -1.

Edit: author fixed bug within 36 hours and pushed changes to CPAN. Woohoo! (you can ignore this bit)

pdl> $cars = rcsv2D('mtcars_001.csv', {text2bad => 1, header => -1, debug => 1})

Initialized size=80000, cols=12, type=double
Reshaping to 32 (final)

What’s my name?

Get the column headers with

pdl> p $cars->hdr->{col_name}

this line intentionally left blank i.e. you get a big pile o’ nuthin from this command

well, that was disappointing. Turns out you don’t get column headers for 2D structures, only 1D structures. So how about a list of cars

pdl> @cars = rcsv1D('mtcars_001.csv',  { text2bad => 1, header => 1 })                                    

Could not parse the given string at /usr/local/share/perl/5.26.1/PDL/DateTime.pm line 654, <$fh> line 2.

Why? I dunno. Still works, but you can quiet the warning by telling it the data types you expect (all double because I’m lazy. there are some shorts, etc.)

pdl> @cars = rcsv1D('mtcars_001.csv',  { text2bad => 1, header => 1, type => double })                    


pdl>  p $cars[0]->hdr->{col_name}                                          
name
pdl> p $cars[11]->hdr->{col_name}
carb

Exclude the name column

You can choose which columns with and array ref like [ 1, 4, 6 ] Everything is numerical, except name at column 0, so let’s choose everything else

pdl> $cars = rcsv2D('mtcars_001.csv', [1 .. 11], { header => -1, debug => 1, text2bad => 1})              

Fetching 2D chunk=40000, reshape=80000, bad=?/1, sep_char=','
Initialized size=80000, cols=11, type=double
Reshaping to 32 (final)

pdl> p $cars                                                                                              


[
 [   21    21  22.8  21.4 ... ]
 [    6     6     4     6 ... ]
 ...
]

Now there are 11 columns and everything is numerical.

Missing values?

Yes, there’s an option for missing values using empty2bad

pdl> $cars = rcsv2D('mtcars_blank-select_cells.csv', [1 .. 11], { header => -1, debug => 1, empty2bad => 1 })

Fetching 2D chunk=40000, reshape=80000, bad=1/?, sep_char=','
Initialized size=80000, cols=11, type=double
Reshaping to 32 (final)

pdl> p $cars->info                                                                                        
PDL: Double D [32,11]

pdl> p $cars                                                                                              


[ 
 [   21    21  22.8  21.4  18.7  18.1  14.3 ... ]
 [    6     6     4   BAD     8     6     8 ... ]
 [  160   160   108   258   360   225   360 ... ]
 [  110   110    93   110   175   105   245 ... ]
 [  3.9   3.9  3.85  3.08  3.15  2.76  3.21 ... ]
 [ 2.62 2.875  2.32 3.215  3.44  3.46  3.57 ... ]
 [16.46 17.02 18.61 19.44 17.02 20.22 15.84 ... ]
 [    0     0     1     1     0     1   BAD ... ]
 [    1     1     1     0     0     0     0 ... ]
 [  BAD     4     4     3     3     3     3 ... ]
 ...
]

Look at those BAD boys :)

That will work with any PDL function that is BadValue aware. Edit: Also, notice how I slipped in the info function in there. Just found it looking through the test suite.

In conclusion, PDL::IO::CSV isn’t the polished tool that deals with all your CSV needs, but it’s good enough to get some numbers into PDL quickly so you can start crunching or clustering right away. Paired with the perldl shell, it’s pretty easy to get that first look at a datafile and figure out where to go from there.

Leave a comment

About Enkidu

user-pic I am a Freelance Scientist** and Perl is my Igor.