Text collation engine: design overview

aurum on 2008-08-21T13:49:45

Here I will describe the basic design outline, as it currently stands, for my manuscript text collation engine, a.k.a. the "Armenian Difference Engine." (Later I will ask you to argue with each other about a module name, but not now. Call it the MCE for now.) I welcome, indeed I solicit, feedback and opinions as to how I might do things more cleverly.

This is being posted without external proofreading, so if something isn't clear, please ask!

So what's the problem?

The editor of a historical text begins with some number of manuscript copies of that text. The goal of the exercise is to compare each of these manuscript texts against each other, note the variations among them, and choose the best "reading" out of each of the alternatives wherever a variation occurs. Until recently, this was done manually; what I am building is a program that will calculate the variations and only consult me when I need to exert editing control—that is, when I need to choose the best reading.

OK, sure. So how does your program work then?

Each manuscript text is passed in as a filehandle (or as a string), containing a bunch of words. For my purposes, the text I am passing in has punctuation and orthography variation as represented in the manuscript, but I have expanded the abbreviations. (Eventually I will accept TEI XML and parse that into a string of text; doing that will probably make my life easier writing this program in the same measure that it makes my life more difficult in transcribing and handling conversion to XML.)

The first step is to transform each "text" into words. Words are non-whitespace characters, separated by whitespace. (Yes that means that, right now, I only support scripts that demarcate words with whitespace.) Each word gets a representation as an MCE::Word object, which I will describe in my next post. Now I have several arrays of Words, where each array represents a manuscript text. In theory, I could create an MCE::Text object with information about the manuscript itself and a chain of words all linked together, but I haven't yet concluded that a simple array is fragile enough to justify the overhead of another OO package. I may later change my mind.

Now I have two or more arrays, probably all slightly different lengths. I pull out the string representations of each Word from the first two arrays, and pass them to Algorithm::Diff. Diff can return three answers for any given chunk of text:

  • The chunks are the same. Pass them through, and link them as being the same word.
  • One of the chunks is zero-length (addition or deletion.) Pass the non-zero chunk through, and pad the text which contains the zero-length chunk with empty Words. (Actually the same empty Word, to save space.)
  • The chunks are not the same. Call &align_and_match_words on the two chunks.

The &align_and_match_words subroutine takes two (generally relatively short) arrays of Word objects, which may be varying lengths. It compares each word in one array to each word in the second array to find the "best" match. If, for example, you send representations of two strings to this subroutine:

This is a short sentence.
This is an even longer sentence with more words.

your result is:

0    1  2  3     4      5        6    7    8
This is a  short NUL    sentence NUL  NUL  NUL
This is an even  longer sentence with more words.

(Note that this is an illustration only; in practice, these two strings would not be sent in toto to the subroutine, because Algorithm::Diff would only report a "changed" chunk for the substrings "a short" and "an even longer.")

The subroutine will detect a fuzzy match between "a" and "an" in column 2, and add the Word object "a" to the list of "fuzzymatch"es attached to the Word object "an". It will find no similarity between the words "short" and "even" in column 4, so will add the Word object for "even" to the list of variants attached to the Word object "short". It will pad the remaining empty spaces with an empty Word; the empty Word is never linked to anything. All "fuzzymatch" and "variant" linkage relations work from top to bottom; that is, given two texts, the first text always contains the links.

The top-to-bottom association of links becomes important when more than two texts are compared. To begin the next comparison, I call the &generate_base subroutine on the two texts whose comparison has just finished. This subroutine is fairly simple; it looks for the topmost non-NUL word in all the arrays it has passed. In our example above, the new base text generated would be

This is a short longer sentence with more words.

Semantically useless, but a good way to generate pegs on which to hang words. The word "a" has a fuzzymatch link to "an", and the word "short" has a variant link to "even". All the identical words that occur in the same column are also linked. This newly generated "base" becomes the topmost text in the next comparison.

At the end of the run, then, we have an array of Words to represent each of the manuscript texts we passed in. The arrays are padded with empty (NUL) words where necessary, so that all the arrays are the same length, and all their same / similar words are aligned in the same row. If the user calls &generate_base on the complete set of result arrays, he/she will have an array of non-NUL words, each of which contain the appropriate links to non-NUL words in the manuscripts that come after it. And then the real work can start.

In the next few posts, I will say more about the concept of a "variant", talk about the structure of Word objects, and discuss the as-yet unsolved problem of punctuation.


order-dependence?

mr_bean on 2008-08-22T09:41:48

It seems like the data structure will become dependent on which sentence you start with first, and that you won't be able to find variants of words which come under the control of
earlier-processed counterparts.

By the way, where is the NLP community in perl? There seems to be some action around Ted Pedersen's Wordnet similarity modules and Dragomir Radev has text relationship analyzer work in Clair.

But there doesn't seem to be other ongoing NLP development in perl.

Re: order-dependence?

aurum on 2008-08-22T12:06:57

It may have some dependence on which text I start with; I may need to look at this in more detail. My assumption is that for a given word row (discounting NULs for the moment) the words will look like this (picking an arbitrary set of words)

Precedense precedence order ordering preceedence

So the entire structure is hanging on the first word, "Precedense". I should link both "precedence and "preceedence" to that as a fuzzy match. When I find "order", I will link it to "Precedense" as a variant.

...and, yes. The next step is one I left out of my overview. When I find "ordering", before adding it as a variant in its own right, I compare it to each of the existing variants. So I have a structure that looks like:

Precedense:
   FUZZY_MATCH precedence preceedence
   VARIANT order

order:
   FUZZY_MATCH ordering

I think that retains all the information I need, yes?

Re NLP: I did a double-take before I realized that you meant this and not this. :) I am not aware of a community around it, no.