XML::LibXML has a really great HTML parser in it, and I'm using it to parse HTML emails. The only problem is my email parser has already decoded any alternate encodings in the email (e.g. GB2312) down to UTF-8. Now when XML::LibXML sees these HTML documents if they happen to have:
in them, then the parser treats them as GB2312. Ugh. If I strip out the META tag it seems to treat them as Latin-1 or something else completely by default. Its all very strange.my $meta = ''; unless ( $in =~ s/]*charset=[\w-]*[^>]*>/$meta/gi ) { unless ( $in =~ s//$meta/i ) { $in =~ s//$meta<\/head><\/body>/i; } }I think there's probably more unless() blocks I need to add in there, but it has worked on all the emails I've tried it on so far.
Re:XML is doomed
sbwoodside on 2003-03-01T07:40:32
Well it doesn't help when you have something like XHTML, which is supposed to be a gateway drug to XML somehow, except that people write their XHTML in non-validating editors, and so the vast majority of XHTML out there isn't XHTML at all, and if it's not XML then it really is pointless to bother. Which, is why I support the "XHTML considered harmful" gang.
If more people would use XSLT then that would improve the situation a lot, since it can only output valid XML (in most situations).
These people who are outputting bad RSS... what tools are they using to create it? It's probably better to blame the tool makers than the individuals who can't be bothered to validate.
simonRe:XML is doomed
Matts on 2003-03-01T09:13:53
These people who are outputting bad RSS... what tools are they using to create it?
Probably Perl, or PHP or ASP. And not using tools, just using strings.