Parsing HTML headaches...

Matts on 2003-02-28T11:52:49

XML::LibXML has a really great HTML parser in it, and I'm using it to parse HTML emails. The only problem is my email parser has already decoded any alternate encodings in the email (e.g. GB2312) down to UTF-8. Now when XML::LibXML sees these HTML documents if they happen to have:

in them, then the parser treats them as GB2312. Ugh. If I strip out the META tag it seems to treat them as Latin-1 or something else completely by default. Its all very strange.

And it took me HOURS to figure out this is what was happening. I eventually found out (this morning, after having worked on this until late in the night) that the only way to get XML::LibXML to always recognise it as UTF-8 is to specify that its UTF-8 in the META tag. So I actually have to replace the META tag before even getting to the XML::LibXML (which seems a bit like parsing it before parsing it, but at least this works). In the end I lumped for this horrible pre-processing regexp:
my $meta = '';
unless ( $in =~ s/]*charset=[\w-]*[^>]*>/$meta/gi ) {
  unless ( $in =~ s//$meta/i ) {
    $in =~ s//$meta<\/head><\/body>/i;
  }
}
I think there's probably more unless() blocks I need to add in there, but it has worked on all the emails I've tried it on so far.

With one exception (of course). MS-HTML generated by MS-Word. This is the most horrible monstrosity you've ever seen. In the end I punted - if I can't parse it properly with XML::LibXML I resort to piping it through lynx -dump. That kinda works even for MS-HTML, and although it'll be slower than the in-process XML::LibXML parsing, it only runs when I can't parse it the fast way.

Yes, this is why we should have had XML in the first place. Wish I could go back and fix history. *sigh*.


tidy

gav on 2003-02-28T15:18:45

Tidy has an option to clean up Word HTML which might be handy, especially now there are Perl bindings.

XML is doomed

mstevens on 2003-02-28T18:33:58

The more popular XML is getting, the more it's becoming like HTML.

RSS is the most end-user XML application, and validity of generated RSS is so bad reasonable numbers of people seem to have started writing non-XML parsers to read it and accept anything...

Re:XML is doomed

sbwoodside on 2003-03-01T07:40:32

Well it doesn't help when you have something like XHTML, which is supposed to be a gateway drug to XML somehow, except that people write their XHTML in non-validating editors, and so the vast majority of XHTML out there isn't XHTML at all, and if it's not XML then it really is pointless to bother. Which, is why I support the "XHTML considered harmful" gang.

If more people would use XSLT then that would improve the situation a lot, since it can only output valid XML (in most situations).

These people who are outputting bad RSS ... what tools are they using to create it? It's probably better to blame the tool makers than the individuals who can't be bothered to validate.

simon

Re:XML is doomed

Matts on 2003-03-01T09:13:53

These people who are outputting bad RSS ... what tools are they using to create it?

Probably Perl, or PHP or ASP. And not using tools, just using strings.