Things Not To Do at 4 pm on Friday on the Internet

chromatic on 2006-03-18T03:09:28

I'm not a full-time developer anymore, but I still know a few things.

Don't ever deply new code on a production system after lunch.
... and never, ever do this on a Friday afternoon.
Don't parse XML with regular expressions.
Don't treat newlines as significant in XHTML, except within attribute values, where the specification says not to use them.
Don't double-encode text and markup.
Don't add extra formatting to properly-formatted documents.
Don't deploy new code without testing it first on actual data.
Don't tell your users that they have to change the way they publish information in your system because you just made a change on 4 pm on a Friday afternoon that treats newlines as significant in XHTML though nothing else on the Internet does so, tell your users to fix all of their existing data, and then go home before they can say "Uh... that's broken. Why did you do this?"
Don't remove a feature that lets people who know what they're doing disable all of this "helpful" magic and turn on the magic by default if it changes their existing data.
If you ignore everything else, don't claim "Oh, you can still write valid XHTML. Just don't use paragraph tags and let the system put them in automatically."

I really don't have words for this except "BY DEFAULT?", "4 PM?!", and "FRIDAY AFTERNOON?!?!"

XML Regexes

ziggy on 2006-03-20T15:28:18

A very good list of truisms. However, there is the subtlest of subtle flaws in this list -- all categorical statements (including this one) are false. For example:

Don't parse XML with regular expressions.

Sometimes you do want to parse XML with regexes, but only in the most controlled of circumstances. Usually this involves munging huge quantities of data that are very rigidly formatted. If you can fully control the structure of XML inputs, and you tend to be reading inputs line-by-line (or block-by-block, but generally not a whole file at once), then you might be able to get away with parsing XML with regexes.

Tim Bray (yes, that Tim Bray) has spoken on using a similar processing model, given all of the appropriate pre-conditions, performance requirements and needs for code clarity.

As a general rule, XML parsing with regexes is moderately safe for the same kinds of processing where you would use egrep, grep -c, grep -v or its kin on a plain text file. But XML parsing with regexes doesn't work in the general case.

Re:XML Regexes

chromatic on 2006-03-20T18:02:38

In the case to which I allude, I assume that the code (I have not seen it) processes the XHTML line-by-line. This is a problem because the XML specification allows newline characters as valid whitespace characters within tags. This is a big problem because the input comes from arbitrary sources.

Parsing this XHTML without a stack or state machine somewhere is problematic.

Re:XML Regexes

ziggy on 2006-03-20T18:18:14
So the real dictum is Don't parse arbitrary XML with regular expressions.

Yep. No wiggle room on that. That's as hard and fast a rule as don't divide an integer by zero. :-)

Re:XML Regexes

Sidhekin on 2006-03-21T00:50:58
Shouldn't that be don't divide an integer by an arbitrary number?

Re:XML Regexes

ziggy on 2006-03-21T02:46:51
In some numeric towers, dividing floating point numbers by zero, as in [+-]1.0/0 results in [+-]Inf.

Re:XML Regexes

iburrell on 2006-03-20T19:01:19
It is possible to parse arbitrary XML with regular expressions. However, it can't be done line-by-line because tags can contain newlines. It must be done on the whole file (or have some smart buffering).
There is a paper, http://www.cs.sfu.ca/user/cameron/REX.html, which develops the regex for parsing XML.

Re:XML Regexes

Aristotle on 2006-03-22T11:21:33

Note that those patterns parse simple XML, not XML with namespaces. Parsing XML with namespaces purely using pattern matching is probably possible too, but it’d be a whole hell of a lot harder, and the patterns would be nasty monstrosities far more so than the managable beasts from that paper.

Re:XML Regexes

iburrell on 2006-03-22T19:26:29
They will parse XML with namespaces. But they only break the XML into pieces; tags, comments, text, etc. They don't handle the pieces like breaking tags into names and attribute values. They don't handle resolving namespace prefixes into canonical names.
I suspect they aren't suitable for doing interesting operations. They could be used for stuff that works on the chunks, like removing comments.

Re:XML Regexes

Aristotle on 2006-03-22T20:37:05

Well, or building a full-fledged parser on top. That’s not a very large step from there.