I wrote a translator (described yesterday) that converts a particular flat file format to XML, and another from the XML format back into the flat file.
The flat file has a line-at-a-time format with the first token on a line determining the type of the data on the line. The lines are in a hierarchical data structure, with various first-position tokens specifying the hierarchy.
I made an object that contained an XML::Writer object and a hash of anonymous subs, where the key to the hash is the first-position token. The code in the anonymous subs parsed the line of the flatfile, and then send this data to XML::Writer to create the XML formatted text. I used four types of calls to XML::Writer: emptyTag, startTag, endTag, and within_element.
The emptyTag calls were easiest. No hierarchy,
just a single tag with parameters.
The startTag calls open up a section of hierarchy.
This is also easy.
The endTag calls were slightly trickier. My code
could detect where a piece of hierarchy was
supposed to end. To remember what kind of closing
tag is needed, the within_element call detects
if a particular tag has been opened. This
approach wouldn't work for multiple levels of
hierarchy, but this format doesn't have that.
Other tools with different formats may have
this requirement, so this may need to be
revisited someday.
Any good translator should make a lossless
round-trip with the data, unlike babelfish.
I used XML::Twig to process the data and recreate
the flat file. I used a hash of TwigHandlers,
which called separate subs for each type of
tag. I noticed that there is symmetry in the
code with the parser and the writer of the
data, particularly in the code that has to
read the flat file and understand the order of
the fields. This same ordering is needed to
take the XML field values and put them into
the flat file. I was not able to take advantage
of this symmetry, so I ended up with code that
I feel could be improved somehow. I also ended
up with the fields being described in the
module documentation, so now I have the order
in three places instead of one. Darn!
I used the XPath approach to parse the XML.
I had the problem that the flat file data was
not available until the closing tags were
parsed, so things tended to come out in an order
reminiscent of reverse polish notation. I used
some local variables to store things so that
they could be written out in the correct order
once the closing tag was detected. This is
analogous and possibly symmetric with the endTag
manipulations in the XML writer. Once again,
it will cause problems when deeper hierarchy is
needed and is an opportunity for removal of
redundancy in the code.
The biggest challenge in this project was
determining the proper type of calls to use
in XML::Twig. There are many to choose from!
XML::Writer was much easier. This follows
the general principle that it is easier to
transmit than to receive.
New Modules and other activities
Installed Spreadsheet::WriteExcel with cpan.
Install okay.
Tried test program from previous version (0.39)
It broke compatibility with gnumeric, so I reported
the problem to jmcnamara with msg on perlmonks.
I hope he fixes it, I really like both WriteExcel
and gnumeric.
Installed Math::SnapTo with cpan
Install was okay, except I got an old version
so I reinstalled by hand, which worked fine.
Tried a bunch of test cases, I wouldn't use this
module - it seems to have many bugs.
Posted on problems with a new snippet. Noted that
root cause of rounding problems were caused
by typing lots of digits of pi instead of
using 4*atan2(1,1).
Re:SAX
toma on 2002-12-28T10:17:11
I'll give SAX a try. I have no problems with rewriting all my code, especially since I am trying to create an example for others to follow, not just a quick solution for a specific problem.Perl & XML has a chapter on SAX. I'll start there, and I have ordered the New Riders book. Let me know if you have more recommendations for good SAX tutorials or documentation! Otherwise I'll just start slogging through XML::SAX::Intro and friends.
As I mentioned, I am particularly interested in exploiting the symmetry between reading and writing a quasi-flat (lumpy?) file. Is there a trick that will let me accomplish this? The existance of such a trick would be a light at the end of the tunnel for me.
I would like to be able to create a SAX driver for my non-XML source that can do this. This would enable me to change the code in only one place when I need to fix a bug or adapt to changes in the quasi-flat file format. As my code stands now, I will have to change both the reader and the writer.
Thanks,
-tomaRe:SAX
Matts on 2002-12-28T11:41:20
Unfortunately in SAX readers and writers are distinct too, simply because the operations tend to be very different. So you would have to write separate modules that did both.
However there's nothing stopping you unifying some of the code if it's relevant to do that. You could put functions that you would use for both reading and writing in a separate package.
A good example to look at for readers is Pod::SAX. Also check out XML::Generator::DBI. As far as writers go, there's not much detail on them. I tend to use readers only, and for writing I'll use a formatter like XSLT, but you probably need something a bit more strict with regard to whitespace than XSLT. Luckily SAX is optimised for filtering/writing, so those will be the easy bits.