It was brought to my attention in a comment on my last post that I didn't do a very good job describing the relationships between words that I create. I'll try to fix this here.
It's really difficult to construct good examples in English, incidentally; we don't have a lot of prefixes or suffixes or case endings, so pretend for the moment that the samples I give in this post are all grammatically valid. (Don't make me break out the lolcat.) That said, given an example set of texts:
Tara has a lot of books about languages. Tara had alot book to do with languages. Tera got a lot of book to do with languages.
the collator would line them up thus, as I described previously:
0 1 2 3 4 5 6 7 8 9 A) Tara has a lot of books about languages. B) Tara had alot book to do with languages. C) Tera got a lot of book too do with languages.The base text generated from this would then be:
Tara has a lot of books about do with languages.Since each word in the base text comes from the top, it is this word that contains linkage information for all the other words. So for this base text we would have:
Tara -> FUZZYMATCH: Tera has -> FUZZYMATCH: had -> VARIANT: got a lot -> FUZZYMATCH: alot of books -> FUZZYMATCH: book about -> VARIANT: to do with languages
This does not, however, list every unique word that appears in every column of the texts above. For that, I need to also record the relationship between "to" and "too" in column 6. When the collator finds "too", and fails to find a match with "about", it will look through the list of variants attached to about, find "to", and add "too" as a FUZZYMATCH for it. So the relevant snippet of data structure becomes
about -> VARIANT: to (to -> FUZZYMATCH: too) do ...
I appear to have been waylaid by a cat, and anyway I've taken up a lot of screen space by drawing out datastructures, so I'll continue tomorrow.