It's been a while since I've given any sort of status update on my collation project. I've spent most of the past few weeks writing the "conventional" half of my thesis, in which I have to prove that I can talk intelligently about medieval Armenian literature without hiding behind source code.
I have made some progress though. As of a week or two ago, I re-tooled my collation engine to work with plain-text input, trivial TEI input, and TEI input in which each word is marked up with the <w> tag. That last is important, because it means I no longer have to assume that words are whitespace-separated. Now, as long as you provide semantic markup to define "what is a word?", and you provide a canonization function for your script if necessary, the collation engine should be able to handle any text in any script at all.
(The canonization function is meant to, well, canonize the orthographic variants within a script so that the collator will trivially recognize them as the same word. So for Armenian, it means that the letter օ is the same as աւ, and the ligature և is the same as the two letters 'ե'+'ւ', and a few other things. Since I don't want to learn the rules for all human languages, I just leave a place for the user to provide a coderef to do this.)
As long as I was re-tooling my code, I also took the opportunity to try this "test-driven development" thing that seems to be all the best-practices rage at the moment. It certainly works to some extentââ¬âI have plenty of tests now, and find it very easy to run them every time I change some codeââ¬âbut as the project gets more complex, I'm finding it harder to have the patience to nail down the design and write the tests before I just plunge into the code.
Finally, as a reward for reading this far, I give you a TEI encoding (with commentary; watch carefully) of Bob Dylan's "Subterranean Homesick Blues". Well worth watching.