A colleague was developing a PowerCenter transfer with an XML file as the source. She kept getting an error, with a reference to a line number in the file. But there didn't appear to be anything wrong with the indicated line. I ran the file through libxml (via perl), and it gave me a different line number as the error. Then the error was obvious...the encoding claimed to be UTF-8, but there were characters such as ë and Ãâ in the file. Changing the encoding to ISO-8859-1 seemed to fix it, I'm not sure yet if the supplier of the file will fix it, or if we'll have to fix their tag-soup gunk (there is as yet no perl involved in the process, so I'm not sure if Grant's Rule applies). I went to google to see about any other info with regard to PowerCenter, XML, and line numbers, and not far from the top of the list was my own posts here on use.perl. Now with this post, I may show up even higher :-)
Re:More of the same
runrig on 2007-08-23T20:30:51
Hmm, I am no XML nor encoding wizard, so I wonder why this works without error (the xml file has a utf-8 encoding declaration, and characters above 127 ascii are not encoded):my $file = "file.xml";
open(FH, "<:encoding(iso-8859-1)", $file) or die "Error: $^E";
my $p = XML::LibXML->new();
$p->parse_fh(\*FH);Re:More of the same
Aristotle on 2007-08-23T20:56:17
Because the file is Latin-1-encoded, and if you open it like that, then Perl will decode from Latin-1 to characters as it reads the file, so libxml2 will never actually see Latin-1.
Re:More of the same
Dom2 on 2007-08-24T19:05:06
Look at using iconv (or it's Perl equivalent piconv) or perhaps recode.But to be quite honest, if it's not well formed, send it back and tell them to sort it out. If it's not well formed, it's not XML.
Re:More of the same
runrig on 2007-08-24T23:39:17
Oh yeah, I know it would be easy to fix on this side if I had to, but I did manage to get correctly encoded files from them. Their UTF-8 encoding is broken, but ISO-8859-1 and US-ASCII works.Re:More of the same
Dom2 on 2007-08-26T06:53:39
Careful. If it's coming from a windows system, it's more likely to be in Windows-1252, not ISO-8859-1.