I started working at my new job monday last. I can't say that it's been a hectic week: this being august most of France is on holiday. Also, my boss is in Corsica right now so I don't really have clear short term assignments.
However I have spent a lot of time learning about all the technologies that are used there, especially our main product which is very cool. It's a system that takes a schema for an XML vocabulary and generates specific encoders and decoders for it using automata. Thanks to that we have XML data that beats gzip'd XML by a fair margin in size, in decoder space, in memory usage, and in speed[1].
Key (buzz)words here are MPEG-7, MPEG-21, XML Schema, RDF, SVG, and BiM. At least half of those I didn't need before so I'm busy sharpening my knowledge about. It's amazing how things almost instantly appear in a far better light when you have a solid use case for them. I've always hated XML Schema for many reasons but after looking at what Expway does it seems much less hateable (mind you, the language of the spec is still beyond human understanding, and it does have very severe faults). I'm also finding out that Xerces 2.0 is a very high-quality tool, and I hope that the Perl binding to it soon implements most if not all of its interfaces.
Fun thing of the week relating to XML Schema: on my first day -- in fact first hour -- of using them seriously (as opposed to just fooling around to see what they're made of) I found a bug in the schema published by the W3C for the two attributes defined in the XML spec! Better still, it's not even a schema bug, the document itself isn't well-formed (it bind the XML namespace to a forbidden prefix)! Now if the XML WG can't use XML properly...
There's more to say, but I'm lazy :)
[1]the principle is simple here, as those are specific {en,de}coders instead of generic ones they perform much better. XML structure is often very predictable, especially if you have a schema handy. If you know that A must contain B and C in that order, you only need to encode the fact that you have an A to encode it all, etc. Of course the less structure you have the less efficient it is.
If so, then in what ways are you getting better performance than gzip'ed XML? Parsing? Serialization? On-disk representation (using a shorthand vocabulary)? Memory consumption? Inquiring minds want to know.
Also, if you can discuss it, what kind of XML documents are you dealing with? Are they document-centric or data-centric? I would imagine there's a lot of optimization that can be applied to a data-centric application, but remain sceptical about optimization of XML Schema+documents. Thanks.
Re:Clarification
darobin on 2002-08-11T20:46:17
Sorry for the hand-waving, I'm not yet fully familiarised with what in that technology is public and what will become public in the year to come
:) Part of my job will be to see how we can open source some things, but I need to know a lot more about what falls into what category to be certain I'm not violating NDAs. But I'll try to give you as much satisfaction as possible here :) If you don't mind, I'll start with the end: document vs data centric results. When it comes to compression using this technique, the line isn't all that clear. What is important is structure and its entropy. The more structured (and predictably structured) your XML is, the better the compression works. If you have only structure and a highly predictable schema the compression levels can be truly silly, in the high 90%s. Otoh if you have a very loose schema and lots of text content, it reaches gzip levels.
At this point you're probably wondering why I'm making the difference with the data vs document meme. Well, think of SVG. Most people are likely to consider a good number of SVG files as documents more than as data, but the schema for SVG being sufficiently predictable and generally well structured, it compresses better than with gzip. Similarly, some data formats I've seen (eg some configuration files) compressed really poorly because they were too loose. As you guess, XHTML doesn't work as well as SVG does. We are however working with all sorts of documents, whatever is requested from us.
Yes, we are generating some code from the schema as if it were a form of BNF (only one that can do encoding and decoding).
Advantages over gzip aren't limited to size (though that is in fact a deciding factor). Speed and memory consumption are part of the picture. With gzip the receiving end has to 1) ungzip and 2) parse the XML. Those two operations aren't exactly free, especially the latter. Add to that that the app probably has to somehow validate the content (at least minimally) to recover from errors and you have something that's a very large app for a cell phone, or for a cheap pocket radio. In this case validation is a given otherwise the document wouldn't have been encoded. And the decoding includes both uncompression and reading of the XML in a single step, because those two things are really one.
We're using XML Schema but we aren't married to it. Other schema solutions could work as well, it just happens that this is what people want. On devices as stupidly short on CPU cycles as those that we are targetting, typing is actually a big plus because for instance if you know that something is an int you can tell the app that it is instead of it having to parse it to find out (which is a costly operation in that context).
You can probably glean some extra (though a bit dated) info from http://www.expway.tv/bim/bim.html provided you have a way of reading powerpoint
:-/ I hope this clarifies some of the hand-waving
:) Don't hesitate to ask if you have more questions. Re:Clarification
ziggy on 2002-08-11T21:40:04
OK. If I understand you, then you're repeating what the WML group tried to do in the late '90s, but in a much better and more general fashion. It sounds like what you're producing isn't XML, but a compressed bytestream that can be treated as uncompressed, [schema-]valid XML with the appropriate auto-generated automata.And the decoding includes both uncompression and reading of the XML in a single step, because those two things are really one.Neat.
:-) FWIW, at the March 1998 XML Conference in Seattle, a couple of FoRK folks were positing a hypothetical markup language that could replace XML, and called it YML. The general idea is that XML is way too complex; once you get rid of attributes, PI's, DOCTYPEs, and I forget what else, you have a data format that is much more easily parsed and could be sent down the wire in a binary format. Of course, there were a few Microsofties involved, which explains the focus on binary formats and runtime performance.
:-) Re:Clarification
darobin on 2002-08-11T21:52:55
Yes, that's precisely it, with the difference that the MPEG folks learnt from the mistakes in WML
;) The compressed bytestream can indeed be treated as uncompressed because the decoder knows of the schema, and thus the infoset is completely there (for values of completely that match the needs of those apps, not the roundtrippability that, say, an XML editor would require). If the schema says that A must contain B followed by C, you only need to encode the presence of A, which can take as little as one bit, but that bit tells the decode that it's seeing A (B, C). And the idea is in fact not to get rid of the complexity in XML because that complexity is useful. I don't think we do PIs yet (the tools aren't 100% complete), but it wouldn't be a problem so long as they can be described in schemas. The only thing we're taking out is the information that's useless when you know what the document must look like.