more on word relationships

aurum on 2008-08-23T23:33:23

It was brought to my attention in a comment on my last post that I didn't do a very good job describing the relationships between words that I create. I'll try to fix this here.

It's really difficult to construct good examples in English, incidentally; we don't have a lot of prefixes or suffixes or case endings, so pretend for the moment that the samples I give in this post are all grammatically valid. (Don't make me break out the lolcat.) That said, given an example set of texts:

Tara has a lot of books about languages.
Tara had alot book to do with languages.
Tera got a lot of book to do with languages.

the collator would line them up thus, as I described previously:

   0    1   2 3    4  5     6     7  8    9
A) Tara has a lot  of books about         languages.
B) Tara had   alot    book  to    do with languages.
C) Tera got a lot  of book  too   do with languages.

The base text generated from this would then be:

Tara has a lot of books about do with languages.

Since each word in the base text comes from the top, it is this word that contains linkage information for all the other words. So for this base text we would have:

Tara
 ->	FUZZYMATCH: Tera
has
 ->	FUZZYMATCH: had
 ->	VARIANT: got
a
lot
 ->	FUZZYMATCH: alot
of
books
 ->	FUZZYMATCH: book
about
 ->	VARIANT: to
do
with
languages

This does not, however, list every unique word that appears in every column of the texts above. For that, I need to also record the relationship between "to" and "too" in column 6. When the collator finds "too", and fails to find a match with "about", it will look through the list of variants attached to about, find "to", and add "too" as a FUZZYMATCH for it. So the relevant snippet of data structure becomes

about
 -> VARIANT: to
(to
 -> FUZZYMATCH: too)
do
...

I appear to have been waylaid by a cat, and anyway I've taken up a lot of screen space by drawing out datastructures, so I'll continue tomorrow.