Flat file to XML round trip via XML::Writer and XML:Twig

toma on 2002-12-26T21:32:57

I wrote a translator (described yesterday) that converts a particular flat file format to XML, and another from the XML format back into the flat file.

The flat file has a line-at-a-time format with the first token on a line determining the type of the data on the line. The lines are in a hierarchical data structure, with various first-position tokens specifying the hierarchy.

I made an object that contained an XML::Writer object and a hash of anonymous subs, where the key to the hash is the first-position token. The code in the anonymous subs parsed the line of the flatfile, and then send this data to XML::Writer to create the XML formatted text. I used four types of calls to XML::Writer: emptyTag, startTag, endTag, and within_element.



The emptyTag calls were easiest. No hierarchy, just a single tag with parameters.



The startTag calls open up a section of hierarchy. This is also easy.



The endTag calls were slightly trickier. My code could detect where a piece of hierarchy was supposed to end. To remember what kind of closing tag is needed, the within_element call detects if a particular tag has been opened. This approach wouldn't work for multiple levels of hierarchy, but this format doesn't have that. Other tools with different formats may have this requirement, so this may need to be revisited someday.



Any good translator should make a lossless round-trip with the data, unlike babelfish. I used XML::Twig to process the data and recreate the flat file. I used a hash of TwigHandlers, which called separate subs for each type of tag. I noticed that there is symmetry in the code with the parser and the writer of the data, particularly in the code that has to read the flat file and understand the order of the fields. This same ordering is needed to take the XML field values and put them into the flat file. I was not able to take advantage of this symmetry, so I ended up with code that I feel could be improved somehow. I also ended up with the fields being described in the module documentation, so now I have the order in three places instead of one. Darn!



I used the XPath approach to parse the XML. I had the problem that the flat file data was not available until the closing tags were parsed, so things tended to come out in an order reminiscent of reverse polish notation. I used some local variables to store things so that they could be written out in the correct order once the closing tag was detected. This is analogous and possibly symmetric with the endTag manipulations in the XML writer. Once again, it will cause problems when deeper hierarchy is needed and is an opportunity for removal of redundancy in the code.



The biggest challenge in this project was determining the proper type of calls to use in XML::Twig. There are many to choose from! XML::Writer was much easier. This follows the general principle that it is easier to transmit than to receive.



New Modules and other activities
Installed Spreadsheet::WriteExcel with cpan.
Install okay.
Tried test program from previous version (0.39) It broke compatibility with gnumeric, so I reported the problem to jmcnamara with msg on perlmonks. I hope he fixes it, I really like both WriteExcel and gnumeric.



Installed Math::SnapTo with cpan
Install was okay, except I got an old version so I reinstalled by hand, which worked fine.
Tried a bunch of test cases, I wouldn't use this module - it seems to have many bugs.



Posted on problems with a new snippet. Noted that root cause of rounding problems were caused by typing lots of digits of pi instead of using 4*atan2(1,1).


SAX

Matts on 2002-12-26T22:34:58

For what it's worth, it sounds like what you're doing is a job that SAX is perfect for - reading, transforming, writing. SAX cleanly gives you these layers in a well designed manner. Unfortunately the book you're learning from doesn't cover SAX very well, because it was started before the XML::SAX project was finalised, however there's another book published by New Riders called XML and Perl which is supposed to cover SAX better (I haven't received my copy yet).

Not that you'd re-write all this now you have something that works, but this might be useful for future reference.

Re:SAX

toma on 2002-12-28T10:17:11

I'll give SAX a try. I have no problems with rewriting all my code, especially since I am trying to create an example for others to follow, not just a quick solution for a specific problem.

Perl & XML has a chapter on SAX. I'll start there, and I have ordered the New Riders book. Let me know if you have more recommendations for good SAX tutorials or documentation! Otherwise I'll just start slogging through XML::SAX::Intro and friends.

As I mentioned, I am particularly interested in exploiting the symmetry between reading and writing a quasi-flat (lumpy?) file. Is there a trick that will let me accomplish this? The existance of such a trick would be a light at the end of the tunnel for me.

I would like to be able to create a SAX driver for my non-XML source that can do this. This would enable me to change the code in only one place when I need to fix a bug or adapt to changes in the quasi-flat file format. As my code stands now, I will have to change both the reader and the writer.

Thanks,
-toma

Re:SAX

Matts on 2002-12-28T11:41:20

Unfortunately in SAX readers and writers are distinct too, simply because the operations tend to be very different. So you would have to write separate modules that did both.

However there's nothing stopping you unifying some of the code if it's relevant to do that. You could put functions that you would use for both reading and writing in a separate package.

A good example to look at for readers is Pod::SAX. Also check out XML::Generator::DBI. As far as writers go, there's not much detail on them. I tend to use readers only, and for writing I'll use a formatter like XSLT, but you probably need something a bit more strict with regard to whitespace than XSLT. Luckily SAX is optimised for filtering/writing, so those will be the easy bits.