Today I released the first small piece of the Collation Project. (Yes, I have another research proposal I ought to be writing. Yes, I spent hours today writing documentation and formalizing tests. What's your point?)
This piece addresses the problem that is efficient transcription of manuscripts. It is my weird idea of a markup language for TEI XML. As an added bonus for people who aren't me, it exports a function to take an existing TEI XML file (well, string), parse it, wrap all the whitespace-separated words in <w/>
("word") tags, and return the new file. Identifying the words is, after all, step one in efficient word collation.
This also means that my collator should be able to handle pretty much any language or writing system, as long as the basic unit of meaning that ought to be collated is enclosed within a <w/>
tag. When it's done, of course.
This also means that I am going to need a module name for the collator soon. Suggestions?