XML: there's some good after all

aurum on 2008-09-06T12:38:44

Those of you who were at my talk at YAPC might remember my mini-rant against XML. It's annoying to parse; the parsing libraries in Perl are among the more poorly-documented modules I've encountered; it seems in general to be one of those solutions that is over-engineered for any problem I encounter.

Well, last Thursday I spoke to a few guys from the Oxford Text Archive. The first frightening realization that I had to wrap my head around is that, for all the ways I naturally think in Perl, they think in XSLT.

Just...ponder that for a few minutes.

Here all this time I've thought of XML as, well, a "Markup Language". It has its uses, but basically I get uncomfortable with XML at the point where it stops being easily human-readable. It was, to say the least, odd to find a set of people who think of data as the basic building blocks of everything, and XML as a way to express these building blocks, and XSLT as a way to manipulate this building blocks in whatever way they need. It's like object orientation taken to its most frightening extreme.

So it turns out that the XML spec in question—the TEI guidelines—was thought up by a bunch of people who have taken a lot of feedback from scholars who work with texts of all kinds. There are chapters that could use more revision, sure, but basically the TEI XML spec has been informed by a bunch of people who are dealing with the problems I face and a lot more problems besides. As XML goes, it's a spec that's expressive enough for pretty much everything I might hope to encode about the text.

As it happens, I appreciated that fact already. I'd noticed that the TEI gave me a bunch of things to think about when transcribing a manuscript (abbreviations? marginal notes? catchwords? smaller-scale word corrections? abbreviation characters that appear in the manuscripts but aren't yet in Unicode? It's all there!) that I otherwise would have glossed over or interpreted without transcribing. But I was still thinking of it as a markup language—a standardized way of encoding information that might be useful to someone, someday, but not necessarily relevant to reading the words in the text and deriving enough meaning to compare it to other texts. Useful, to some extent, but not useful enough for my immediate problem (comparing the texts, which can reasonably be done word by word, without any meta-information) for me to bother with very deeply.

Meanwhile, a problem I have talked around in these blog posts but not addressed head on is that of data representation and storage. I have the information available in each manuscript; the problem I have not solved yet is "How do I represent that data? More importantly, how do I represent the decisions I make about the significance of that data?" It turns out that, not only can this be done within the TEI spec, but the spec allows for quite a lot of information (e.g. word divisions, morphological analysis—the ability to distinguish grammatically significant variations of words) that I've been looking for my own way to encode.

The upshot is, TEI XML makes it very easy and straightforward (well, for some definitions of "easy" and "straightforward"; I'll come back to this, probably in the next post) to mark and represent words, prefixes, suffixes, sectional divisions, marginal notes, and all sorts of stuff that may or may not prove to be significant. All I have to do is parse this information as it is given, rather than making heuristic guesses about how to derive it. I currently feed plaintext strings to my collator; there's no reason I can't feed regularized words based on the XML transcription.

Not only does TEI handle manuscript description; it also handles representation of critical editions. As I may have explained before, a critical edition generally presents a base text and an "apparatus", i.e. a specially-formatted block of footnotes, that contains the variations present in all the manuscripts of the text. From a data-representation point of view, the important thing here is that each word can be composed of a "lemma"—the base word—and its different "readings". Viewed that way, even the lemma is optional. A word can be composed of nothing but its variant readings.

And this is the first, easiest, thing my collator gives me. I make each "row" in my collation output into a list of readings, and write it out according to the TEI spec. When I'm ready to start editing, my program can read that file, present the options to me whenever there's more than one reading, and save my editing decisions back into the XML file. Then I can use pre-existing XSLT files to translate that result into LaTeX and printed text. This is particularly good, because as far as I'm concerned the only "good" form of XSLT is "XSLT that someone else has written and tested."

In short, other people have already thought about this problem, and I can use the fruits of their labor with only a very small abuse of their intentions. The only real cost is having to bash my head against libxml2.


TEI and XLink

cyocum on 2008-09-06T14:07:17

One of the things that I think is missing from TEI is XLink. A project that I called the "Critical Edition Browser" which would graphically show the connections between various recensions and copies of a text so that no one text is privileged over any other text (a classical critical edition set-up tends to do this). Basically, what I would want is two TEI encoded texts that have XLink arcs to each other in such a way as to show the lemma and stemma between the two (or more) texts. This would obviate the need for special mark-up for critical editions as it would be encoded in the XLinks between the two documents. The "browser" would then show these in such a way as to make it obvious and allow the scholar to click through and otherwise manipulate the edition on screen.

Anyway, yeah, TEI definitely allows a greater level of flexibility in encoding a document.

XSLT

Aristotle on 2008-09-06T22:03:08

The first frightening realization that I had to wrap my head around is that, for all the ways I naturally think in Perl, they think in XSLT.

Just…ponder that for a few minutes.

Nothing bizarre about that at all. :-) I can’t claim to be a decade-of-experience expert in XSLT as I can claim to be in Perl, but I am very good with the language, and I like it a whole lot. The syntax is dreadfully verbose, but at the semantic level – its computation model – it is extremely elegant. You can do things in XSLT with a dozen lines of code that would be terribly cumbersome to express in any XML API in any more general-purpose language. You just have to wrap your head around it the right way (which is difficult because every last XSLT tutorial and introduction, as far as I have seen, is crap).

(Oh, and libxml2 is not at all hard or bad.)

Re:XSLT

aurum on 2008-09-06T23:40:23

I guess the thing I find frustrating about libxml2 is that I want a nice compact way of saying "Get me the one&only FOO child element from the one&only BAR element of the document." Am I missing something?

It's also possible - moderately likely, even - that I'll convert my parsing to a SAX model, and that will make that particular frustration go away.

However, the real problem I have with libxml2 at the moment is that it doesn't like the TEI RelaxNG files, and I don't know whose fault that is. It means that I can't currently do any programmatic validation; I have to fire up oXygen if I want to check that a doc is valid.

Re:XSLT

Aristotle on 2008-09-07T03:34:29

Am I missing something?

Yes, XPath. Forget the DOM API, and for the most part, SAX as well.

However, the real problem I have with libxml2 at the moment is that it doesn’t like the TEI RelaxNG files

Ah, yes. The validation support in libxml2 is not all that great.

Re:XSLT

chromatic on 2008-09-07T00:07:41

I may have asked this before, but is it XSLT you like or XPath? I've never managed to like XSLT, but I do like XPath. The syntax isn't always perfect, but I can't think of improvements.

Re:XSLT

Aristotle on 2008-09-07T03:34:08

Both. XPath isn’t dreadfully verbose; XSLT is. (It would greatly benefit from a non-XML rendition of its syntax, just like RelaxNG has both an XML and a Compact syntax.) But the basic model (recursive node visiting) is a perfect match for XSLT’s job. The apply-templates directive is basically a map with polymorphic callback using XPath-based dispatch. That’s all there is to XSLT.

Of course, most people write for-each-heavy transforms instead, so they gain none of the elegance of this model. They would be better off writing that code in some general-purpose language. The result would still be cumbersome, but the awkwardness would at least not be exacerbated by the language having extremely limited facilities for general-purpose programming.

(Do note that I presume EXSLT support, which largely rectifies the least tolerable aspects of the language. Bare XSLT 1.0 is no fun for any but tasks but the trivial – too many complexity management tools are missing.)