Two alternate patches for rows-as-hashrefs in Text::CSV_XS

markjugg on 2008-04-03T17:33:12

H.Merijn Brand, the Text::CSV_XS maintainer has been dicussing possibilities for adding parsing rows as hashrefs to that module through this RT ticket.

As fate would have it, our efforts to implement it crossed paths, and we now both have fairly complete but somewhat different patches for the feature. A couple points to get feedback on:

Which you do find clearer for setting the column names to use as the hash keys:

column_names() or hr_keys()

I have already been confused about whether "hr" stood for "header row" or "hashref", so I vote for the former.

The second point, which is currently in neither patch, is "how you design the interface to automatically setting the column names from the first row of the CSV?"

Parse::CSV uses new( fields => 'auto' ), but involving new() won't work for Text::CSV_XS.

I was thinking of perhaps:

$csv->column_names_from_line($io);

Which would simply mean:

$csv->column_names( $csv->getline($io) );

We would leave it up to documentation to make sure users called this first thing.

Alternately, you could have a function that stores the current file position, rewinds and reads the first row, and then returns to the current position. That seems more fragile to me, and I can oly imagine there are some non-rewindable filehandles out there for which it wouldn't work.

You can leave feedback here and/or in the RT ticket.

Thanks!

Mark

Feedback

Aristotle on 2008-04-03T18:31:31

No contest: column_names.
Definitely no rewinding magic. In fact, I wouldn’t even include a column_names_from_line sugar method, because all it does is save five keystrokes on a rare operation for the price of concealing how that works. And that’s the only reason users might forget about having to call that method first thing.

If you just tell them that they need to set up the column names manually (and here’s an easy way to take them from the first line of the file), then it’s pretty much impossible to get confused about it, because, well, you have to set the column names before hashrefs will work, and if you want the names to come from the file, well, you have to take them from the file, and to take them from the file, well, you have to do that right away if they’re on the first line.

So if you include no magic whatsoever, the interface will be clearer, because the correct way of using it will inescapably suggest itself. And it’s not so much typing that it might matter.

Re:Feedback

markjugg on 2008-04-03T19:15:15
Thanks for the feedback. I'll lean towards not proposing the sugar method for the reasons you give.

column_names

Lecar_red on 2008-04-03T19:25:10

I agree that column_names is a much better than the other.

column_names gets my vote!

jj on 2008-04-03T19:33:49

But have you seen the interface that Text::xSV uses?

use Text::xSV; my $csv = new Text::xSV; $csv->open_file("foo.csv"); $csv->read_header(); while (my $data = $csv->fetchrow_hash) { # do stuff... }

Personally I quite like the read_header() function.

Re: Text::xSV

markjugg on 2008-04-03T19:58:17
That is a nice interface. If I had found that first I might have just used that instead.

I think because it doesn't have "CSV" in the module name, distribution name or description, it's difficult to find by searching by on CPAN. At least it was for me.

I did not that it is not an "XS" module, so would be slower than Text::CSV_XS.
Re:column_names gets my vote!

Ron Savage on 2008-04-03T23:08:50
I used a hashref attribute called column_names in:
http://search.cpan.org/user/rsavage/Text-CSV_PP-Iterator-1.00/
which also discusses various similar modules.

column_names it is ...

merijn on 2008-04-04T08:30:16

OK, column_names () it is.

I liked the idea of hr_** for its double meaning: Hash Ref and Header Row, but that might be professional brain deformation from my side.

While I was designing this, I also had DBI in mind, and the obvious next step to try is bind_columns ().

With the new column_names (), it would be nice to do a DBI like bind_keys () so fields are stored in the same scalar over and over again, instead of creating a new scalar on parsing for every field line after line again.

This *could* mean a big speed gain, but otoh it could also slow down regular parses. If the gain is high enough, compared to the speed loss, this could then be proagated to be the `standard' way of parsing.

BTW public git is http://repo.or.cz/w/Text-CSV_XS.git?a=shortlog

And I'd prefer feature discussions in PerlMonks instead: http://www.perlmonks.org/?node=617577

Tie::Handle::CSV

danboo on 2008-04-04T15:23:47

This is offered as an example of how I did something similarly.

In Tie::Handle::CSV I just overloaded 'header' in the constructor. It can be a simple boolean to indicate whether the file has a header, or an array ref to assign the header.

Done, plus Bind_columns. Go fetch 0.40

merijn on 2008-04-07T13:12:22

You've now got it, and I also give you bind_columns!

I value feedback, and probably some improvements on the docs, like adding the new stuff to the SYNOPSIS

file: $CPAN/authors/id/H/HM/HMBRAND/Text-CSV_XS-0.40.tgz size: 85057 bytes md5: cb8b2af20925b832159f34eed9793666 2008-04-07 0.40 - H.Merijn Brand <h.m.brand@xs4all.nl> * Implemented getline_hr () and column_names () RT 34474 (suggestions accepted from Mark Stosberg) * Corrected misspelled variable names in XS * Functions are now =head2 type doc entries (Mark Stosberg) * Make SetDiag() available to the perl level, so errors can be centralized and consistent * Integrate the non-XS errors into XS * Add t/75_hashref.t * Testcase for error 2023 (Michael P Randall) * Completely refactored the XS part of parse/getline, which is now up to 6% faster. YMMV * Completed bind_columns. On straight fetches now up to three times as fast as normal fetches (both using getline ()) getline_hr The "getline_hr ()" and "column_names ()" methods work together to allow you to have rows returned as hashrefs. You must call "column_names ()" first to declare your column names. $csv−>column_names (qw( code name price description )); $hr = $csv−>getline_hr ($io); print "Price for $hr−>{name} is $hr−>{price} EUR\n"; "getline_hr ()" will croak if called before "column_names ()". column_names Set the keys that will be used in the "getline_hr ()" calls. If no keys (column names) are passed, it’ll return the current setting. "column_names ()" accepts a list of scalars (the column names) or a single array_ref, so you can pass "getline ()" $csv−>column_names ($csv−>getline ($io)); "column_names ()" croaks on invalid arguments. bind_columns Takes a list of references to scalars (max 255) to store the fields fetched "by getline_hr ()" in. When you don’t pass enough references to store the fetched fields in, "getline ()" will fail. If you pass more than there are fields to return, the remaining references are left untouched. $csv−>bind_columns (\$code, \$name, \$price, \$description); while ($csv−>getline ()) { print "The price of a $name is \x{20ac} $price\n"; }