xml thuggery

tinman on 2003-11-14T14:45:50

So I thought I knew XML.Bah.Fooling around with the TREC competition data told me otherwise.

The problem was pretty simple, or so I thought. Just parse the TREC sample data (all 3GB of it), index it, and then build ever more "intelligent" query parsing functionality on top. The first snag in that grand plan was... the TREC XML data fails to parse!

For one, there is no XML header. But more importantly, there is an external (unreferenced) DTD available which contains entities. If I just throw the document at the parser, it barfs because it cant resolve the external entities! Begging and pleading to the Xerces parser didnt help. Nor did using EntityResolver.

So, I cursed Java and its XML parsers and came back to my trusty Perl roots. The standard XML::Parser said the same thing! Despite methods in the Java parser instance assuring me that external entities CAN be safely ignored, the parser doesnt seem to want to do that. Then I looked at other Perl based parsers and found XML::LibXML. It specifically has a method that says dont resolve "external_entities" . Umm.. didnt seem to work either?

I didnt really want to use a ugly handrolled parser solution (because that is going to break at some point, sooner rather than later). So, the only remaining option seems to be to use HTML::TokeParser and find tags. *sigh* and that, sadly, is the only solution that seems to work.

Some days(weeks) it just doesnt seem to pay to get out of bed. With strong winds in York (and ultra cold too *shiver*), this seems to be one of them.


HTML::Parser?

merlyn on 2003-11-14T15:20:09

HTML::Parser can be used in "XML mode", as I show in a recent column. Maybe it'll ignore enough of the uglies to get past them.

Re:HTML::Parser?

tinman on 2003-11-16T16:43:39

cool it works! :) HTML::Parser didnt give a hoot about the entities, and yeah, I think I can make this less fragile than my TokeParser attempt.. thank you.

Catalogs

Matts on 2003-11-14T16:33:21

Looks like you want to use Catalogs.

Re:Catalogs

tinman on 2003-11-16T16:48:17

err, just to clarify something from the link... its like a cache for externally referenced entities, which will be looked up whenever the parser encounters them ?

Where does this fit in with a DTD/Schema ? (which would have the entity declarations normally ?)

UTF-8+names

brianiac on 2003-11-14T17:31:43

You could try Tim Bray's UTF-8+names, though you would likely have to write your own decoder.