XML::Checker::Parser

pudge on 2001-11-04T23:22:25

redsquirrel sends in his tale of XML adventures:

My current project loomed over my cube like a foreboding fortress. My usually focused mind was easily distracted as I thrashed about, trying to come up with a "magical solution" to the overwhelming problem before me.

The problem: I work for a large organization (perhaps part of the problem :). We have numerous web applications that must have a company-wide look-and-feel. I must design a system that can manage all of these applications' HTML templates from a single location.

When I embarked on this quest for the "magical solution," I was advised to look into XML. Immediately, I was comforted by the security of knowing that other Perl hackers had walked this path before. I knew that if I could learn enough XML to get started, a visit to CPAN would surely provide me with the resources I needed.

Once I had a few XML documents under my belt, I headed off to CPAN to pick up some tools. XML::Parser was the first module I came across. Due to my laziness, though, I quickly became disheartened at the number of styles, handlers, and constructor arguments I had to deal with.

I continued down the aisle until I found XML::Simple. This was more like it! A few lines of code produced an XML document packed into a neatly organized Perl object, ready for manipulation! Ahh, laziness has its rewards. My triumph was short-lived. As I started doing some preliminary tests with XML::Simple, I learned about DTD's and "valid" XML documents. I wondered if XML::Simple validated the XML it so nicely parsed. The module lived up to its name and did not.

With a slightly more educated eye, another perusal of CPAN produced the "magical solution" I was looking for (and ironically brought me back to where I had started). XML::Checker::Parser extends XML::Parser, and this time, it didn't scare me quite so much. I held my laziness in check long enough to do some preliminary testing with the less "Simple" interface and was handsomely rewarded with the ability to create a "valid" XML Perl object in just 3 lines:

use XML::Checker::Parser; my $cp = new XML::Checker::Parser(Style => 'Objects'); my $obj = $cp->parsefile($xml_doc);

Now, when I look upon the fortress that is this project, it does not feel as intimidating as it once was. My coding skills alone would be futile against such a task, but utilizing this module has given me access to the powerful skills of its authors: Enno Derksen (XML::Checker), Clark Cooper (XML::Parser v2.x), and Larry Wall (XML::Parser v1.0). Perl's power isn't found in any one individual's abilities, it is found in the collective strength of the multitude of CPAN authors.

A slight warning

Matts on 2001-11-05T08:14:53

Unfortunately XML::Checker::Parser is a bit buggy, and is known to not be a 100% correctly validating parser (I'm not sure what the fringe conditions are - it's changed developers recently, and hasn't been run through a validation suite to test it).

As an alternative, try XML::LibXML's validating mode. It's based on libxml2 which is a validating parser, and known to be good and well tested. Yes, I'm biased, but I will admit there are some bugs in the validating code in XML::LibXML (not in libxml2).

Also, I'm working on XML::SAX::Simple, which will give you all the power of XML::Simple, but built on top of SAX. This will allow you to use any of the SAX parsers out there (e.g. XML::SAX::PurePerl, XML::LibXML::SAX::Generator (i.e. XML::Simple + Validation!), XML::SAX::Expat, XML::Parser::PerlSAX, XML::Generator::DBI, etc) to build XML::Simple objects. Expect a release this week sometime.

another small warning

dito on 2001-11-05T13:56:22

You can't use the Objects style if you have an element named "Characters" or any element has an attribute named "Kids" as they will clash with the builtin naming convention.

It's probably best to use something like XML::XPath which wraps around XML::Parser. This creates a tree of objects with no reserved word problems and also lets you wander around using XPath which is a W3C standard, quite nice to use and knowing it could help you in lots of non-Perl situations too.

Thanks for the warnings

redsquirrel on 2001-11-05T14:24:57

Good to know. I will look into XML::SAX::Simple and XML::XPath. Sounds like they are superior solutions!

Some extra warnings

mir on 2001-11-06T12:26:37

I just think I need to mention that XML::Simple, while easy to use for config files and generally data-oriented XML, cannot be used for XHTML: it does not process mixed-content (<p>this is <b>mixed</b> content</elt>: text and tags mixed).

If you want to process that kind of XML you will have to use either XML::Parser itself (XML::Simple is based on XML::Parser) or XML::Parser::PerlSAX or XML::XPath, or XML::Parser::PurePerl (which would make Matt _really_ happy ;--) or (of course!) XML::Twig.

You can take care of validation by running either XML::Checker::Parser or an external validator in a first step, then using a non-validating module once you know your data is valid.

One last comment about the strength of XML... both Enno Derksen and Clark Cooper seem to be awol :-(

a note about xml validation and libxml2

albatross on 2006-01-05T02:35:25

I was recently tasked with taking invalid xml (XML that is so broken that it was not even valid XML. ie missing closing tags, etc etc etc...). No DTD, no schema, no namespaces. So the search for an XML validator began.

The XML contained critical medical information which is itself our companies cash-cow. Anyway, this stuff was totally ****ed and we needed to fix it so we could improve our current and quite legacy, hypercard, editorial system.

Yes, hypercard in 2006.

Anyway, the point is, that after trying out XML::Xpath, XML::Twig, XML::Simple, and with commercial GUI tools like XML Spy, I finally found XML::LibXML!!!

XML::LibXML is special because it impliments the libxml2 library which is a completely legal XML parser/validator. The module happens to be quick and simple to use.

You can easily use a DTD or schema for validation though get ready for building a DTD (or schema for tons of control). For us a Schema was impossible because the XML was just too wild.

So the problem's solved and the XML is valid and a bit more under control. We can, hopefully soon, move onto a more flexible editorial solution.

By the way XML::Twig doesn't even try to be an XML validator but instead is an AWESOME way to munge through Gigs of XML pretty damn fast. I think I was using XML::Xpath for something smaller and didn't end up using XML::Simple at all (though I'm sure it's very useful).

Thanks to these tools, PERL is just excellent with XML.