The XML grammar, and pretty.

TorgoX on 2005-08-06T10:24:12

Dear All,

So a while back, I noted that when I referred to the XML spec (still then in its 1.0 version), I only ever referred to the syntax parts, and I rarely if ever (re-)read the paragraphs.

So I made this, which is basically just the XML spec minus all the paragraphs, leaving behind just the headings and syntax parts. But over time, 1) I got more annoyed by the irony of the nastiness of the 1997-vintage HTML (tables inside tables, etc); and 2) I got tired of looking up the production for Name, and ending up getting to the BaseChar production and seeing rows and rows of goo like "[#x03A3-#x03CE] | [#x03D0-#x03D6] | #x03DA | #x03DC | #x03DE | #x03E0 | [#x03E2-#x03F3]".

So I did a bit of search-and-replacing to get this prettied-up version of it all. Note that the funny characters (half of which are likely not to show up on any given person's display) have some passably informative title="..." attributes set, which appears when you mouseover.

At some point I may update this for the XML 1.1 spec. But feh (ف), I'm oldskool.


pubid char?!

jjohn on 2005-08-06T15:34:34

Had this label been vetted by third graders, the authors would have been informed of the unfortunate similiarity of "pubid" to "pubic." This might have forced them to reconsider that particular label, one hopes.

Unicode character ranges

bart on 2005-08-07T18:43:02

I think that the XML parsers I know of would choke on characters in the range [#x80-#xBF]. That's how I recall it, without trying.

Re:Unicode character ranges

bart on 2005-08-07T18:43:54

I mean [#x80-#x9F], of course... damned hexadecimal.

Re:Unicode character ranges

TorgoX on 2005-08-19T06:54:49

Ya know, I was actually wondering about that too. The XML 1.1 spec somewhat clarifies this.

But regardless, the strategy I've settled on for escaping arbitrary 8-bit data, is that anything fishy, I move up into the E000-E0FF block, as in my ps2xml library.