XML tools recommendations, please.

petdance on 2002-12-16T17:48:51

We're doing some work with XML schemas and would like some kind of snazzy schema generator/documentor. We looked at XML Spy, which is nice, but is a Windows-only product, and is $400. I'm more concerned about being tied to Windows than the cost.

Stuff we'd like to have it do:

  • Automagic schema generation from xml
  • Automagic - and good! - documentation generation
  • Visual schema browsing/editing would be nice too
Any suggestions?


Automagic Anything

ziggy on 2002-12-16T18:02:25

I don't know what level of schema complexity you're looking for, but automagic schema generation (and the like) is a holy grail. And it's also a checkbox on a lot of PHB's shopping lists. (Go figure!)

The few XML Schema-centric tools I've seen are proprietary, and mostly work on Windows. The open source support for XML Schema is pretty light because (1) it's a very hard, and overly complex problem that's (2) poorly documented and (3) geared towards large companies making "interchangeable" tools (Oracle, IBM, Microsoft, etc.). If you really want any level of XML Schema support in a cross-platform tool, the best places to look at the moment are Xerxes and libxml, even though neither is fully compliant with XML Schema at the moment...

Re:Automagic Anything

darobin on 2002-12-17T15:27:55

I second what Adam says, XML Schema support is constantly getting better but there is currently no completely reliable implementation (and of course no two have the same bugs). As someone that has worked on the source of several, I know why that's the case. At this point no one knows when implementers will stop hitting awful edge cases in the spec...

XML Spy is probably the laxest validator out there, half of the problems we have stem from schemata received from people that think they're valid because of that piece of junk.

Automagic generation is currently in a sorry state, especially on the simpleType front. We have a viable one (for the kind of information we need from a schema), but it's not slated for release (possibly never). Interestingly enough the other one that gets close to being usable (a good test being infering the XHTML schema) is Microsoft's XSD Inference tool, which last I checked had source available (in C#, which is palatable).

Doc generation is already easier. We have our own doc generator (it's not for sale currently, but I could perhaps arrange something) and there are a number available out there. Search the xml-dev archives for some info on several.

I don't know much about visual editing, I would certainly never use that :)

Re:Automagic Anything

ziggy on 2002-12-18T14:20:49

Automagic generation is currently in a sorry state, especially on the simpleType front.
One thing that I don't understand is how anyone can expect autogeneration of a simpleType to work. At best, I can see autogeneration of a simpleType to identify that the content of an element/attribute is either a URI or plain text. Identifying any of the other facets (numeric attributes, min/max length, regex, etc.) doesn't seem possible in a general tool. For example, how is a schema supposed to auto-generate a simpleType definition for a US Zip code (and Zip+4)? UK/Canadian Postal code? Phone number? These are the kinds of things I want a schema to validate for me, and the kinds of things that I don't expect any generic tool to recognize on my behalf.

Re:Automagic Anything

darobin on 2002-12-18T15:33:57

It normally would depend on a variety of options. Most tools put themselves on the safe side (if there is any such thing in schema inference) by always using xsd:string, but in some you can ask for them to be smarter.

In such cases, it is possible to detect URIs (to a degree), date/time, and number types from the rest rather easily. If strings are what's left, you can refine that in two ways. The first is to produce an enumeration. That's good with a limited number of differences, but it quickly becomes stupid. Guessing when it becomes stupid is unsolvable and left to user decision because whatever arbitrary limit you set it is unlikely to be higher than SVG's schema enumeration of colours, which is nevertheless a legitimate enumeration.

The other approach is to use the strings you see to define a pattern. The techniques used are close to those used to infer schemata. The results are also usually quite ugly, and should be limited by pattern complexity. Zip codes however should be doable (though of course French zip codes would be inferred as ints...).

Another hard problem is that of simplification, ie how do I know that attribute foo on element elt1 has the same type as attribute bar on element elt2. Given the inference set, they may have slightly different types. You can then have a lot of fun trying to generalise with pattern proximity and such stuff ;)

Unfortunately/thankfully, we won't be going that far because as you know what we're interested in is efficient infoset encodings, and two weeks ago we discovered a way to encode schema-less content that is on par with zip (but a *lot* faster) and doesn't require the overhead and complexity of having a schema inferencer on the encoding side, and in-stream compiled schema updates sent to the decoder.

Of course, it'll never tell you that you have a "Zip code", but that's probably an ontology problem (says he, quickly waving the ball away to other AI fields).

To be honest, apart from a very limited class of applications (which requires machine-learning for some reason), I've never met a situation in which schema autogeneration was useful. Apart from trivial cases it's not useful to humans: writing the schema from scratch is likely easier than going through the generated one and fixing it bit by bit, even if you're lucky to have several hundred instances, even if you can read abigail's JAPHs. petdance: can you enlighten us on why you wanted this feature?

XML Spy

ajt on 2002-12-22T22:31:16

We use XML Spy 4.x at work, and it's very good really. Yes it's a Windows product, and yes it uses MSXML by default for many things, and yes it's closed source and expensive, but it's the best GUI tool I've seen.

I've done some schema work with it, but to be honest no more than auto-generating a schema from a source XML file. It did work, and it wasn't such a bad schema. It also does autmatic DTDs and they're not so bad either, but not perfect - you still have to tweak them afterwards.

If you plan to do any XSL-T, then configure it to use an external XSL-T engine, the MSXML it uses by default sucks, I use Xalan normally, but Saxon is supposed to be good too. Normally I use LibXML/LibXSLT inside Perl code though.

I think the new version (5.0) is very fancy and may even offer extra debugging tools!

Good Luck! report back as to what you found.