A couple of months ago, the project at work was running into problems shortly after I integrated an XSLT engine into the project. I had a sneaky suspicion that there was going to be a problem, since there were two versions of expat lurking under the hood.
See, there was a hacked version of expat that didn't do entity translation. When you're doing a pass-through parse, sometimes you really do want the entities to remain as they are. For example, it's bad to convert numeric character entities coming from the database as you parse them, right before you spit them out to the browser. The browser can deal with tokens like �, but it might not be able to recognize a multibyte Unicode character as a multibyte Unicode character. So keeping the entitites unparsed is sometimes both the right thing to do and the tasty way to do it[*].
So now there's this situation where we have dueling copies of expat integrated into the project. One of the things that depends on expat will be broken, and as luck would have it, the thing that broke is the more important subsystem that has worked reliably for ~5 years (until I integrated that XSLT engine). As expected, my changes broke the build, so I rushed out some hacks to prevent the offending parts of expat to stop colliding. (Basically, some #define Token new_Token hacks for all of the functions that need two definitions.)
Do a clean checkout and a full build, and everything works. Do a make clean and make all on the offending component, and everything works.
Fast forward a couple of months, and things are acting wonky. The other component is now broken in some hard-to-describe circumstances. The only way to fix it is to build the entire project, including a broken XSLT component, then make clean && make install on the XSLT component to get everything working. Very strange indeed.
After about an hour of hacking around to understand how something that simply shouldn't be possible was happening, I found the cause of the problem. The C files that contained the definitions of the redefined symbols were updated from CVS and recompiled. But the C files that contained the invocations were never changed, so they were never recompiled, and the .o files contained stale references to the unadorned function names.
And here was the root cause of the problem. In development, we do a 'make clean' all the time. But in testing, they do a simple 'cvs update && make' instead. So the stale .o files were linked against the new definitions, loading both symbols into the same .so file. Ugh. And, like most Makefiles, this one did not list dependencies of C files on other C files or header files that should trigger a rebuild. Double ugh.
Funny. After years of using a real language, I had nearly forgotten that these kinds of partially-stale configurations were possible.
*: Thank you Quaker Oats for one of the most memorable ad campaigns from the 1980s. ;-)
Will bite you in the end. Trust me. I've just come from a project where we've used hack upon hack upon hack upon hack to ensure that entities get preserved in one state or another. But the trouble is that you've effectively got several layers of character encoding. In our case, we ended up with stuff in the database which contained & et al. Well, in some tables we did. In others we had UTF-8. And the search engine saw character references and turned them into latin-1. Sometimes. So you never really knew what you were going to get back from the database, and what kind of escaping and transcoding it required. I have one particular function which first attempts to encode a string using UTF-8, and if that fails, fall back to latin-1.
Needless to say, I am horrified by all this. And it pretty much could all have been prevented by working on the xml infoset instead of getting involved with XML's lexical details. That way you would ensure that you have only a single known encoding of input data. You would then know to apply a single set of transformations to get it understood by a browser.
After 4 years of delivering websites where we have attempted to turn Unicode into something simpler for the thick-as-pig-shit browsers, we've gradually come to the conclusion that it's better to spit out UTF-8 regardless. If the user gets funny about characters, tell them to get a better font. Some of the Microsoft core fonts are surprisingly good. They work really well in firefox. I just wish that the bitstream vera ones had such a large character repertoire.
Although you probably know most of it, this tutorial about Unicode and XML is worth a quick read.
Anyway, congratulations on finding that nasty bug!
-Dom
Re:Preserving Entities
ziggy on 2004-08-19T12:55:18
Actually, it sounds like you are dealing with a different problem.We own the data that we are serving up through this web app. So it's fully normalized by the time it's parsed in this pipeline. The problem is more about keeping the entities that are in there from being converted into UTF-8.
Re:Preserving Entities
Dom2 on 2004-08-19T13:21:21
You're right, I was talking about a semi-different problem, which descended into a large rant.:-) I'm still curious about the need for character references rather than UTF-8 bytes though. Which browsers were giving you trouble?
-Dom
Re:Preserving Entities
iburrell on 2004-08-19T17:54:24
One solution to this problem is to turn all non-ASCII characters into numeric character references before writing output. This saves having to worry about the encoding of characters in binary.Re:Preserving Entities
ziggy on 2004-08-19T18:55:28
Right, but if you postprocess the non-ASCII characters, then you're post processing the data. The constraint on this project is to: (1) load XML into a buffer, (2) parse that buffer, (3) send chunks of that buffer as-is to prevent extraenous [re-]processing.
Re:Browsers
ziggy on 2004-08-19T13:06:29
Yes, you're right. The browser is where the problem manifests itself, not the cause of the problem.It's been a while since I looked at AxKit, but I think the (XML) output that is serialized to the browser is properly re-escaped. This is the correct behavior according to the XML character model.
For some reason, this system goes to extreme measures to do as little work as possible in each transaction. That includes passing around raw XML text instead of higher level data structures, and making as few modifications to that buffer as technically possible.
I'm not going to comment as to whether this "performance optimization" is good or bad, useful or not, or even whether I agree with it. But that's the way this system is designed. And that's why it was ripe for this integration bug once a real version of expat was loaded.
Re:Browsers
Matts on 2004-08-19T15:22:37
I think libxslt's HTML renderer just sends raw UTF-8 to the output - it doesn't go to any effort to re-encode it. I'm still not seeing any problems though - maybe things have changed in browser terms, but opinions of what you have to do haven't?Re:Browsers
ziggy on 2004-08-19T15:38:25
There's probably a charset issue at the root of it all. I saw the problem again the other day -- an entity that came in as – got parsed into a unicode character and needed to be output as either – or – to render properly. When it came out as a unicode character, it was unrenderable (the obligatory question mark instead of an ndash). Most likely a unicode multibyte sequence coming out in iso8859-1 or even ascii.The tool chain is old enough that there are likely lots of XML-in-ascii assumptions under the hood. Mess enough of those up, and you get unrenderable characters in the browser. Thus, you have "bad browsers" and "browser problems" with improperly rendered character entities, and the herculean efforts to bend expat to something that fits local expectations.
Re:Browsers
Dom2 on 2004-08-20T06:53:37
Just a thought, but could you use Encode::encode('us-ascii',$xml,Encode::FB_XMLCREF)?-Dom
Re:Browsers
ziggy on 2004-08-20T12:41:37
I could, except this project is a mixture of Tcl and C. [*]The problem came about because the standard version of expat was munged, and linking in a new module with a clean expat broke one or the other of the dependencies. And expat was munged in the first place explicitly to avoid re-escaping previously escaped entities on output once they had been parsed into unicode characters. (A performance optimization to do as little work as possible, and work at the lowest layer possible, to enable high throughput and therefore scalability.)
This software is positively ancient in dotcom years. The decision to munge character handling was made at least 5 years ago, and it was working fine until the the dueling expats raised the issue. If it were up for redesign, we'd surely do things differently now.
;-) *: How do you hire a Tcl programmer in 2004? Put an ad out for a Perl programmer.
;-) 
Have you thought about sending a patch back to expat? In some cases I would love to have that option, especially from Perl where XML::Parser sometimes behaves in a really funny way