Fictious standards?

barbie on 2002-10-25T14:25:20

I was dubious of the Caveats in the POD for Text::CSV when there was no reference to what standards or where the author had drawn his conclusions.

From MyFileFormats.com I found this CSV definition. Nowhere is it prejudice against non-US users of the format, so why does Text::CSV insist on:

Allowable characters within a CSV field include 0x09 (tab) and the inclusive range of 0x20 (space) through 0x7E (tilde).

Nowhere in the specification I found (and it wasn't easy to find that!), does it make an assumption on what can be inside a field. As long as it's contained in quotes, it's valid. As it should be.

The reason I'm taking issue with this, is the fact we have a field in our CSV that is a currency field. As we are in the UK, we quite rightly use a £ symbol. Text::CSV spits it out as invalid, even if the field is contained in quotes as the specification states. According to Text::CSV specification, it also means that no european language characters, other currency symbols (eg the Euro or the Yen) or special symbols (eg ® or ©) are ever allowed to appear in a CSV file. I wonder if these producers of spreadsheets applications, with the capability of saving to CSV, realise they write out illegal characters?

Then again Text::CSV is over 5 years old and still at version 0.01! Seeing as the author hasn't written anything else, I wonder if they've disappeared?

Is this another module I'm gonna have to look at and attempt to patch? I seemed to be finding alot of inacurate or restricted modules of late!

Standards

petdance on 2002-10-25T15:06:03

First, there's not really a "standard" for CSV. It really means whatever someone wants to throw at you. I had a project last year where multiple business partners would send me "CSV" data, and no two were the same. Some quoted every field. Some only quoted fields that needed it. Some escaped double quotes by doubling them. Some used backslashes. It was a mess.

Second, don't use Text::CSV. Use Text::CSV_XS. It's got far more parameters for your tuning enjoyment.

Re:Standards

jdavidb on 2002-10-25T17:21:44

I'm pretty sure Text::CSV_XS is the successor to Text::CSV. It's always a good idea to search CPAN and look for more recent modules.

For even more enjoyment, see if you can make use of DBD::CSV.

Re:Standards

barbie on 2002-10-29T15:29:06
DBD::CSV was my first choice, however the file we are being sent contains additional record types, which include 1 or more comment records (a ' as the first character) and 1 header record (a # as the first character).
Plus it was easier to parse the file directly rather than store it locally, parse it, then delete it.

Re:Text::CSV_XS

htoug on 2002-10-28T08:11:36
But it still has the wierd notion of not allowing us to use our alphabet unless we enter binary mode, which disables any check on characters.
Usefull, but I do have a hard time explaining why you have to use binary mode to write non-binary data!

I would love for it to have an eight bit mode, where control characters are forbidden, ie. 0x00-0x17, 0x7f-0x97 and 0xff (if I got my ranges right). Of course this would annoy M$-users, that have some printable characters embedded in the high control range (0x80-0x9f).

Re:Text::CSV_XS

barbie on 2002-10-29T15:40:13
This was the issue I had with it. Why should I have to switch to binary just to use the extended character set? The fix I did, apart from clean up the bizarre nesting and blank lines helping to confuse the layout of blocks, was the following chuck added to the _bite() function, just before the last "} else {" line:
} elsif ($in_quotes) { # an extended character in quotes... $$piece_ref .= substr($$line_ref, 0 ,1); substr($$line_ref, 0, 1) = '';

Well it does the job for me.
With regards to the control characters, I did think about that, but decided to just allow them. I may change it later though.

Re:Standards

barbie on 2002-10-29T15:26:02
Text::CSV_XS seemed a bit too much overkill for what I wanted. I have my own patch to Text::CSV now, which handles the extended character set, provided they are contained within quotes.
Your example still follows the standard as I understand it. Fields can have quotes around them, or the quotes can be omitted if the field doesn't contain the quote character or the field separator. The standard way of escaping double quotes is to double them. Much like SQL in that respect.

Re:Standards

merijn on 2007-11-05T12:52:00
Text::CSV is quite a bare module, which will be updated *very* soon now.

The new Text::CSV will include a pure perl version of Text::CSV_XS and will itself be just a wrapper. If Text::CSV_XS is installed, it will use it, otherwise, it will used the bundled Text::CSV_PP (or Text::CSV_PurePerl as the snap currently states).

Text::CSV_XS is extremely faster than the pure-perl version(s).

See also http://www.perlmonks.org/?node=617577