XML, where did you go wrong?

ziggy on 2002-01-04T02:21:12

I was updating some slides last night on XPath and XSLT. There were a few things that didn't go as smoothly as I would have liked the last time I used them. But that's just the way of the world.

Unfortunately, each of the XML specifications sat down and read come across as a rats nest of intertwined references and formalized definitions that generally impede readability, understanding and the general health of forests everywhere. (I should note than when I was reading Michael Kay's 1ed of XSLT from Wrox, I looked wistfully over at K&R, asking why there wasn't a good, thin, concise reference for XSLT. Perhaps that problem has been solved by now...)

Today, I started reading the lastest salvo from the W3C: about 700 pages of new working drafts on XSLT 2.0, XPath 2.0 and the related interference from XML Query. XSLT 1.0 was difficult to fathom within the first five reads, and XSLT 2.0 looks worse. I'd like to know exactly why there needs to be an entire section that very verbosely states "namespace processing works as expected in the source and result trees". Catching up on the XML Query discussion on xml-dev today, it sounds like it is more bureaucratic and hopeless than XML Schema. I honestly didn't think that was possible....

It got me thinking. Perhaps it's time to take back XML, starting with a refactoring of the core specifications until they are a coherent whole. Here's a rough cut:

Basic XML - grammar for elements, attributes, PI, comments, etc.; a subset of well-formed documents that do not use DTDs or internal subsets
XML and DTDs - grammar and semantic behavior for DTDs; introduce valid XML and complete the discussion of well-formed documents with internal subsets
XML and Namespaces - build on basic XML and introduce namespaces and namespace processing
XML Content - a refactored XML Infoset discussion so that it's actually readable and understandable
XML Validation - intro to schemas with RELAX NG? XML Schema bits? Full XML Schema (probably not)
XML Processing: Events - SAX2 with the Java centricism expunged; the Java centric bits are discussed in one appendix, the Perl centric bits in another; repeat ad infinitum. (Perhaps each appendix is a separate document?)
XML Processing: Trees - JDOM explained (repeat with neutral discussion up front, appendicies for language centric bits). Perhaps lead into DOM2 and point to it directly.
XML Processing: XPaths - XPath 1.0 explained in plain language
XML Transformations: XSLT - XSLT 1.0 explained in plain language

From here, CML, SVG, XSL-FO are simply vocabularies to learn. The above list describe the basic semantic behaviors and processing expectations for XML documents. The key goal here is that we start with a solid simple foundation and build upon it clearly. When namespaces come up in XPath or XSLT, they point back directly to the namespace discussion, rather than rehashing it formally and verbosely.

There might be a reason to start with a level zero, introducing the need and driving factors for doing all of this work....

Corp. Inc.

Matts on 2002-01-04T09:36:58

What happened is a project that started out as a private affair between Jon Bosak and a few friends got turned into the worlds next best thing. That would have perhaps been fine if one company, or a few with the same vision came on board, but that didn't happen. Microsoft and Netscape came on board. As did some SGML old timers.

But you knew all that already.

I'd take a different tack from yours in revamping things though. I'd say the following few things:

1. Dump DTDs altogether. They are a nasty remnant of SGML and deserve to die.
2. XML Core should be XML 1.0 minus the "Char" restriction (reverse Char so it's an exclusion property, not an inclusion property), plus Namespaces. I know that's contraversial, but namespaces are very useful, and optional, and the XML 1.0 spec already makes forward references to what might be namespaces, without going far enough.
3. Add the rest.

But yes, it's all gone wrong. You'd get along well with Simon St Laurent, who has the same opinions.

Re:Corp. Inc.

ziggy on 2002-01-04T19:20:35

1. Dump DTDs altogether. They are a nasty remnant of SGML and deserve to die.
Not quite yet. They're only 99.44% bad juju. But they're still the only way you can do this:
<!DOCTYPE foo[ <ENTITY somedoc SYSTEM "somedoc.xml"> <ATTLIST foo id ID> ]> <foo id="1"> &somedoc; </foo>
Until there's a core replacement for those two features, then DTDs have a modicum of utility. No, I don't believe in XLink.
Reversing the Char production is a given. The way I laid out Basic/Namespaces/DTDs+Valid was intentional; until such time as XML Processors are required to support namespaces, they're technically optional. However, the place to add namespaces is after the basic grammar, not after the basic grammar + DTDs.
My expectation would be that subsequent rewritten specs (XPath, XSLT) would simply subsume the Basic+Namespaces as a single entity, not XML 2ed with a liberal sprinkling of Namespace Sugar. (Also, the namespaces spec should be extended to address QNames as attribute values, unfortunately)

Re:Corp. Inc.

Matts on 2002-01-05T08:51:04
Xinclude covers external parsed entity inclusion. ID values would have to be covered by some schema system. I'd be in favour of using something like a top level <?xml-scheme type="..." href="..."?> for that, similar to XSLT.

What Matt said

darobin on 2002-01-04T17:58:15

In addition to which I think we can clearly separate the schema layer using 1) a way to access the tree in a validation-friendly fashion (this is probably 98% there already), 2) a standard way to link a doc to a schema (either with a specific XLink, or if it really has to be done with a PI as well) and 3) have an API to decorate (mostly for typing) trees so that schema languages that can do typing can use any old DOM/SAX supporting that extension.

So what's the next step ? I don't know for sure, but I guess it could be about listening to the remaining parts of the W3 that still make sense (there are some here and there) and turn more to OASIS, which seems to make a hell of a lot more sense these days (especially with RelaxNG which imho beats the hell out of XSD any day). I guess we'll see :-)