I'm not a full-time developer anymore, but I still know a few things.
I really don't have words for this except "BY DEFAULT?", "4 PM?!", and "FRIDAY AFTERNOON?!?!"
A very good list of truisms. However, there is the subtlest of subtle flaws in this list -- all categorical statements (including this one) are false. For example:
Don't parse XML with regular expressions.
Sometimes you do want to parse XML with regexes, but only in the most controlled of circumstances. Usually this involves munging huge quantities of data that are very rigidly formatted. If you can fully control the structure of XML inputs, and you tend to be reading inputs line-by-line (or block-by-block, but generally not a whole file at once), then you might be able to get away with parsing XML with regexes.
Tim Bray (yes, that Tim Bray) has spoken on using a similar processing model, given all of the appropriate pre-conditions, performance requirements and needs for code clarity.
As a general rule, XML parsing with regexes is moderately safe for the same kinds of processing where you would use egrep, grep -c, grep -v or its kin on a plain text file. But XML parsing with regexes doesn't work in the general case.
Re:XML Regexes
chromatic on 2006-03-20T18:02:38
In the case to which I allude, I assume that the code (I have not seen it) processes the XHTML line-by-line. This is a problem because the XML specification allows newline characters as valid whitespace characters within tags. This is a big problem because the input comes from arbitrary sources.
Parsing this XHTML without a stack or state machine somewhere is problematic.
Re:XML Regexes
ziggy on 2006-03-20T18:18:14
So the real dictum is Don't parse arbitrary XML with regular expressions.
Yep. No wiggle room on that. That's as hard and fast a rule as don't divide an integer by zero.:-) Re:XML Regexes
Sidhekin on 2006-03-21T00:50:58
Shouldn't that be don't divide an integer by an arbitrary number?Re:XML Regexes
ziggy on 2006-03-21T02:46:51
In some numeric towers, dividing floating point numbers by zero, as in [+-]1.0/0 results in [+-]Inf.Re:XML Regexes
iburrell on 2006-03-20T19:01:19
It is possible to parse arbitrary XML with regular expressions. However, it can't be done line-by-line because tags can contain newlines. It must be done on the whole file (or have some smart buffering).There is a paper, http://www.cs.sfu.ca/user/cameron/REX.html, which develops the regex for parsing XML.
Re:XML Regexes
Aristotle on 2006-03-22T11:21:33
Note that those patterns parse simple XML, not XML with namespaces. Parsing XML with namespaces purely using pattern matching is probably possible too, but it’d be a whole hell of a lot harder, and the patterns would be nasty monstrosities far more so than the managable beasts from that paper.
Re:XML Regexes
iburrell on 2006-03-22T19:26:29
They will parse XML with namespaces. But they only break the XML into pieces; tags, comments, text, etc. They don't handle the pieces like breaking tags into names and attribute values. They don't handle resolving namespace prefixes into canonical names.I suspect they aren't suitable for doing interesting operations. They could be used for stuff that works on the chunks, like removing comments.
Re:XML Regexes
Aristotle on 2006-03-22T20:37:05
Well, or building a full-fledged parser on top. That’s not a very large step from there.