XML Schema wrinkles

ziggy on 2002-01-14T15:24:18

It's easy to create an enumerated type in XML Schema, like all 50 states by postal abbreviation or state name.

Creating a type that validates zip codes is a little trickier.

You would think that using something like the totalDigits facet would constrain a type to a five-digit number. Except that there are zip codes that start with leading zeros: 08063 is a valid zip code and must be presented that way, even though an integer with that value will be incorrectly serialized as 8063.

So perhaps the solution is to treat a zip code as a string, where each position in the string is in the range '0'..'9'. Atomic datatypes are composable into lists, and we can create a list of five of these '0'..'9' values. Except that lists are a series of values separated by spaces, so we've now constructed a type that will validate a value like this: "0 8 0 6 3", which is clearly incorrect.

It seems like the simplest type definition that can work here is a pattern: '\d\d\d\d\d', except that just feels wrong. Regexes feel like an escape hatch on the XML Schema data definition, doubly so here. I must be missing something; there should be a more natural way to construct a simpleType that matches zip codes; either that or the simpleType definition is missing something fundemental in datatype composition.

(Theoretically, with enough work, it should be possible to define a datatype that would validate a document if a piece of element content is a syntactically valid C program; clearly with a list of a union of simple token types, we're not quite there yet. I certainly don't want to see that grammar though...)

So how is this for a totally off-the-cuff idea for a schema language for element content:

  • simple strings (unconstrained, just like #PCDATA or xsd:string)
  • simple numbers (like xsd:decimal)
  • simple dates (ISO dates)
  • simple URIs (like xsd:anyURI)
  • simple email addresses (from RFC 822 and it's successors)
  • simple namespaces (like xsd:QName)
  • simple XML names (NMTOKEN from the DTD grammar; is this really necessary, or is QName sufficient here?)
  • basic constraints on the above (XSD derived types, such as xsd:short and xsd:unsignedLong)
  • full regular expressions
  • concatenations of simple strings (that may be a way to define xsd:decimal...)
  • BNF grammars for string data (a way to compose email addresses, and define QNames / URIs natively and not punt to the Schema engine...)
  • unions and lists as they exist today
  • directed unions and lists (perhaps as a way to define a simple BNF grammar, or something simpler?)
...all plugged into a RELAX NG structure definition.


why not re's?

jmm on 2002-01-14T18:08:01

Sounds like you're going to a lot of work to define something that will be unique and original and will cover some of the things that RE's could be used for.

Considering RE's an escape hatch to be avoided if possible suggests that te alternative is not powerful enough some of the time. Why not just use RE's and forget about defining a brand new incompatible language that isn't as powerful?

Re:why not re's?

ziggy on 2002-01-14T20:16:30

Sounds like you're going to a lot of work to define something that will be unique and original and will cover some of the things that RE's could be used for.

Actually, that was just thinking out loud. ;-)

You are quite correct in observing that if regexes are an "escape hatch", then it is an indication that the "brand new incompatible language" isn't as powerful. However that doesn't necessarily mean we're dealing with a "brand new incompatible language". :-)

For example, it is much easier to think in terms of a yacc (or even P::RD) grammar if we want to write a parser for C or even a rudimentary subset of Perl. However, the task of writing, maintaining and extending the corresponding regex is quite daunting. So while the regex notation is quite powerful, it isn't always optimized to the problem space (or the grey matter typically used to solve the problem).

One peeve I have with XML Schema is that it is so verbose and difficult to create simple constructs for element content validation. For example, given a validating model for zip codes, and another one for zip+4, it's as simple as creating a union of the two types to validate either one or the other. (Except that now we need 3 named types; urgh...) But each of those types are atomic, and the union itself is atomic. If we had a structured datatype, then XML encoding practice would have us tag each subtype individually since it is not atomic. Alternatively, we could describe each subtype as a simple (dirt simple) regex, and create a union of atomic types. (or just punt and make a validating regex that understands both types of zip code...)

But what do we do when we want to tag a complex structure as a simple type? For example, a set of email headers, a POD document, etc.? Here we fall down because we can't (er, "aren't supposed to") parse the atom. We could write regexes here for these kinds of element content, but that has similar scalability problems that a yacc grammar may simiplify.

Is that clearer?