From YAML to XML through a DTD

potyl on 2008-10-07T21:39:10

The Bratislava PM web site has now an RSS feed. This feed is currently generated from a custom made YAML file that's transformed to RSS thanks to XML::RSS. This approach is simple and quite flexible but has some quirks.

First, it's almost impossible to verify that the format of the YAML file is following the default template without writing our own validation. For instance, if a feed entry is missing the title, the link or the date there's no built-in mechanism to inform us of this errors.

Secondly, the main content of each feed element is allowed to have HTML. In fact, all feed items that we have include HTML. Mixing HTML inside of a YAML file doesn't make the input file too nice since it has now two markup languages. Of course, one can argue that YAML Ain't Markup Language (tm), nevertheless it is weird to embed HTML in YAML.

Finally, converting YAML to XML seems strange. YAML is mainly used to provide data structures, configuration files or data serialization. Using it for content manipulation might be pushing it too far.

For this particular context XML seems more appropriated. Some of its advantages are that it's possible to validate through a DTD, an XML Schema or RELAX NG. HTML and XML can coexist without problems, specially if XHTML is used. And transforming an XML file into another an RSS feed can be easily done through XSLT.

Using XML as the input file provided some interesting advantages. First, thanks to a DTD not only can we validate each feed entry in the input file, but we can also validate the HTML that's embedded in the feed's description.

By using some clever XML and DTD hacks it's possible to create a custom made feed that can be validated without too much effort. Let's we assume that an RSS feed contains an events and that each event has:

  • title
  • link
  • description (can contain HTML)
  • subject
  • creator
  • date
  • id

The following DTD describes and validates a feed input file:













Although this DTD can be used for simple feed elements it has a problem: it doesn't allow any HTML inside the element ba:description! Does defeating the purpose of replacing YAML by XML. But all is not lost as this can be easily fixed by importing the XHTML DTD within our DTD and by redefining the element ba:description in order to accept any HTML tag that a div accepts:



%xhtml;
















Thanks to this new DTD the element ba:description can include any HTML element that's allowed within a div element. The DTD will make the validation and will ensure that valid HTML is inside the element. For instance, adding the element body to the element ba:description will be rejected by the DTD even though it's a valid HTML element it's not allowed to be within a div.

The element ba:description is declared in our DTD the same way that the element div is in the XHTML DTD. Furthermore, the element is allowed to set the default namespace to XHTML. Thus, making all child elements of ba:description to belong to the XHTML elements, this is very handy when processing the XML file latter on.

It's is not difficult to see that the new version of the feed will be generated from an XML file as using XML is quite advantageous here.


Kwalify & Data::Rx

srezic on 2008-10-07T22:05:37

Are you aware of Kwalify and Data::Rx, both schema languages for data structures?

Re:Kwalify & Data::Rx

potyl on 2008-10-08T06:09:39

No I wasn't aware that they existed, thanks for the pointers. I went quickly over the documentation and they look quite nice.

Although, in our case the main goal was to mix the validation of our own elements and the ones provided by XHTML. Using Kwalify or Data::Rx would require us to transform (or worse to rewrite) the XHTML DTD in some other language.

Also, using XML and XHTML has some other advantages that I will describe in future posts.