moving on from collation

aurum on 2008-08-25T23:32:19

It seems that I would much rather talk about software design issues than write this research proposal. Well, I may as well get something useful done.

So far, I have described the design of what I have been calling the "MCE", or the "manuscript collation engine". It works pretty well at this point, and when I run it on a bunch of transcribed text files, I get a bunch of arrays back, full of Word objects that are lined up neatly according to similarity and relative placement. Now I just have to use them. This is where I start speculating about what to do next.

I said at some point that I would talk about the structure of a Word object, but really there is little enough to tell. A Word is an object that will keep track of whatever I tell it to remember about a particular word—its matches, its variants, its original or normalized or canonicalized spelling, its punctuation, whether it should be considered the "base" word or a "variant" word or an "error".

Of course, many of the attributes I might want "remembered" can't actually be detected at collation time. Some of them are editing decisions, and others need the judgment of a human (or a set of rules) that understands things about the Armenian language. It's high time I wrote the editing interface.

(Nomenclature will be the death of this project, incidentally. It's bad enough that the computer world and the critical-text-edition world use the word "collation" differently. Now I want to write a program that, in the terminology of the humanities, ought to be called a "text editor." Great.)

So. I start with a bunch of arrays of words, and the superficial relationships between them. The end result should be a base text, and an apparatus that records the set of variants that I have judged to be worth recording. In the meantime, I should have had to do as little work as possible. This means several things:

  • I need to remember which word goes with which manuscript.
  • I need a way of marking a word as "base", that is, the accepted main reading.
  • I need a hierarchical series of categories of "variant", including but not limited to:
    • Grammatically sensical differences
    • Apparent grammatical errors
    • Orthography variations
  • I need to be able to "smooth" strings of variant words into a single variant.
  • I need a means of "teaching" the program about my decisions, so that I am never asked more than once about an orthographic variation.
  • I need a way of saving the decisions I've made.

But those are just the easy things. Two aspects of the problem are particularly tricky. The first is punctuation. The punctuation in Armenian manuscripts is all over the place. Do I mostly disregard it? Do I treat punctuation marks as words in their own right? Do I show it all on a case-by-case basis, and thereby give myself more work?

The second is the issue of partial-word readings. Remember that a "reading" is a "minimally distinctive unit of sense"; that means that a single word may contain multiple readings. Prefixes can have grammatical effects. For example:

  • աշխարհ (ashkharh): "land", but
  • աշխարհն (ashkharhn): "the land", and
  • յաշխարհն (yashkharhn): "into the land".
The last is especially tricky, as it can either be written as the single word յաշխարհն, or as two separate words ի աշխարհն. If I am standardizing the orthography across manuscripts, I should separate the prefix յ, converting it to the preposition ի; I'll have to split the Word object, and align the resulting pair of Words with the Words in my other arrays. The alignment and word matching is a problem I have already solved with the MCE, but this means that the editing program will have to call back into the MCE to re-align the words in question.

As usual, I've launched into a whole raft of explanation, and not even asked for anyone's opinion on specific questions yet. Maybe tomorrow.


A View from the Humanities

cyocum on 2008-08-26T15:15:40

I just read about your MCE and if I had something like that while doing my PhD, I would have been a godsend. Anyway, one thing I wanted to point out and something that happens in medieval Celtic Studies often is the situation where you have two texts which are both titled the same or are very much alike but are not exactly the same text even given variant spellings. For instance, in the edition that I have done, one of the scribes moved two of the lines from the original else where in the poem and filled in those lines with something else which was appropriate for the poem but was not an exact copy. Unfortunately, this is always going to be a problem.

If you have any thoughts on this, I am very happy to hear them.

Handling large variants

aurum on 2008-08-26T16:03:51

I have been concentrating on word-level variants, it's true, because it's easiest for the computer to find meaningful differences when you break the texts down to their smallest meaningful constituent parts. For most Western languages, that's a word.

The text I'm working on also has sentence-long (or paragraph-long, or in one case section-long) additions/deletions appearing in certain texts. (It also has word transpositions, which the MCE can detect, but which I haven't decided how to treat.) As far as the MCE is concerned, a sentence-long addition in one manuscript counts as a bunch of NULs in all the other manuscripts. When I said in this post that "I need a way of smoothing chains of variant words into a single variant", this is what I was talking about.

In your case, with a couple of verses that have been transplanted in a poem, the MCE would detect and record the differences on the word level. It is the job of the editing program to put those words back together into well-defined variants, and that's the part whose design I'm trying to hammer out now.

To be honest, I think I could get the editing program to tell you two things, separately:
- "text 3 has the two lines $foo / $bar where all the others have $baz / $quux";
- "text 3 has the two lines $random / $other where all the others have $foo / $bar".

It won't automatically make the connection between "$foo / $bar" in the two different spots of the text; for now, I'm leaving that to the human editor.

I *can* envision a feature wherein the user defines a minimum word length for a "substantial" variant, and then for each "substantial" variant the editor program will look for similar lines elsewhere in the text and point them out. (The minimum length setting would be to prevent noise; you don't need the computer showing you where every instance of the word "and" is in a text, for example.) It would still be the user's (that is, the human editor's) job to note a definite correlation, in either the apparatus or the footnotes. Is that the sort of thing you'd be looking for?

Re:Handling large variants

cyocum on 2008-08-28T12:27:03

I have been concentrating on word-level variants, it's true, because it's easiest for the computer to find meaningful differences when you break the texts down to their smallest meaningful constituent parts. For most Western languages, that's a word.

Indeed, I would caution, however, that Celtic Languages have initial mutation such that grammatical meaning is encoded in the lenition or nasalization of the following word.

I *can* envision a feature wherein the user defines a minimum word length for a "substantial" variant, and then for each "substantial" variant the editor program will look for similar lines elsewhere in the text and point them out. (The minimum length setting would be to prevent noise; you don't need the computer showing you where every instance of the word "and" is in a text, for example.) It would still be the user's (that is, the human editor's) job to note a definite correlation, in either the apparatus or the footnotes. Is that the sort of thing you'd be looking for?

Well, for Old and Middle Irish, most of the variations in spelling are in the "Dictionary of the Irish Language based mainly on Old and Middle Irish Sources". The problem is that in Irish you have d for t spelling change, among other changes, as Old and Middle Irish mingle on the page so word length may not help in this case. It would be easier to define the orthographic differences that may occur in variant spellings. So, a tree of variant spellings based on known change patterns would be of greater utility. I could be wrong and miss understanding you though.

Re:Handling large variants

aurum on 2008-08-28T12:52:55

Indeed, I would caution, however, that Celtic Languages have initial mutation such that grammatical meaning is encoded in the lenition or nasalization of the following word.

Yes, I should have been more clear. The word is the smallest meaningful difference that the computer can easily detect. Armenian also has grammatical meaning in suffixes, and a few prefixes, but for now the human still has to review those.

Well, for Old and Middle Irish, most of the variations in spelling are in the "Dictionary of the Irish Language based mainly on Old and Middle Irish Sources". The problem is that in Irish you have d for t spelling change, among other changes, as Old and Middle Irish mingle on the page so word length may not help in this case. It would be easier to define the orthographic differences that may occur in variant spellings. So, a tree of variant spellings based on known change patterns would be of greater utility. I could be wrong and miss understanding you though.

Again I was unclear; sorry about that. I was addressing large variants (e.g. your example of transplanted lines), so by "word length" I meant "number of words in variant" rather than "number of characters in word." So you might want to know if a contiguous set of, say, five words appears elsewhere in the text.

Regarding spelling variations though, that's more or less what I've done. Have a look at Text::WagnerFischer::Armenian sometime. It's a module for calculating word edit distances for Armenian words; one thing I defined therein was acceptable orthographic variations. I also defined a few of the single-letter prefixes and suffixes that affect the grammatical meaning of the word. What I haven't found a way to do is handle case endings properly. The WagnerFischer algorithm is only good for comparing changes letter-by-letter, and not really good at all for finding multi-letter suffixes.

Please do continue to raise these issues! It is important that I generalize as far as possible in the core modules, and not consider Armenian alone.