Storing word-grams

ambs on 2008-06-12T16:57:03

I am in the way of storing word-grams for big texts (read big = more than 3GB text files). I want 2-word, 3-word and 4-word tuples, and respective occurrence count.

When processing these texts (on a cluster) I do not have access to any RDBM system. Well, I have SQLite, Berkeley DB, GDBM and probably other similars that I am forgeting about.

As you might guess, the main problem with this is populating the database. For each word on the corpus I need to check if it (together with the neighbourhood) exists or not in the database. If it does, I increment the counter. If not, I add a new entry.

Given that I am working on a Cluster I can easily split the job in different chunks, so that each node process a different part of the text. At the end I just need to glue the final databases.

In my experiences SQLite seems to be faster tool for this task. But I may be wrong.

So, what would you use for that?

(I know that for questions PerlMonks might be better, but I just think that site is completly unusable :( )

Here are some possibilities for you

btilly on 2008-06-13T01:20:54

How big do you expect the final corpus to be? If it will be able to fit in memory, it is honestly fastest to just implement an internal hash, build that, and dump it. If it won't fit in memory with Perl using it, then an in memory database (SQLite can do this) is your best bet.

If it is much bigger than will fit in memory, though, then you should go for a radically different approach. What you should try to do is do a mergesort on disk. Odds are you won't even need to write this yourself - create a dataset where each line is a tuple. Then run the Unix utility sort on it, then postprocess that dataset in memory. See how fast it is. Odds are with your dataset that this approach will be fast enough on a single PC for the work you are doing.

If you need to write it yourself, you can beat sort, but only by cheating and taking heavy use of the fact that whenever 2 entries match you can merge them and add the counts. This will reduce your dataset size considerably. I've done it, and even been smart about when I had to write to disk and when I didn't. I wouldn't recommend going that way unless you have to.

If you need it to wind up in a database which supports quick lookups, then you can create a btree with BerkeleyDB and put it in there after you have the sorted dataset. The rule here is that putting sorted data into a btree is fast. Putting unsorted data in is slow, and a hash backed by disk is slow.

If you want to dig deeper and understand why, you need to consider a hypothetical drive that goes at 5000 RPM which supports a sustained throughput of, say, 20 MB per second. For a random seek to disk the drive, on average, has to spin half-way round. It is spinning 5000 times per minute so it will do about 10,000 seeks per minute. With a naive algorithm you try to fetch each record, then you update it and try to insert each record. So that's about 5000 records per second that you're processing.

Suppose your records (which in your case are 2-4 words long) average, say, 40 bytes. The disk can pass over 20 MB/s * 60 s = 1200 MB of data in a minute. That's 30 million records. Admittedly merge sorts need multiple passes to finish sorting. To make the numbers work out, let's say that multiple in this case is 30 (that's 15 reading passes, and 15 writing passes. Which with datasets of this size probably means a poorly implemented merge sort). Using merge sort meant you handle a million records a minute versus 5000.

The moral? Once data lives on disk, you do not want to randomly seek to it. You really, really, really do not want to do that.

another approach

salva on 2008-06-13T09:38:58

1) create a dictionary for the words in the file assigning an integer to every different word

2) map the text file into an array (@word) where every word is replaced by its index.

3) create another array (@offset) containing 0..$#words

4) sort @offset as follows:
@offset = sort {
for (my $i = 0;;$i++) {
return -1 if $a + $i >= @word;
return 1 if $b + $i >= @word;
return ($word[$a+$i] $word[$b+$i] or next)
} } @offset;

5) now the offsets into @word contained inside @offset are sorted so that similar sequences appear consecutively. You only have to scan @offset and count the consecutive word repetitions

For instance:

1) ...

2) @words = (a b b c d a b c)

3) @offset = (0 1 2 3 4 5 6 7)

4) @offset = (
0 # => a b b c d a b c
5 # => a b c
1 # => b b c d a b c
6 # => b c
2 # => b c d a b c
7 # => c
3 # => c d a b c
4 # => d a b c
)

5)
a b b c d a b c
a b c
=> a b = 2
b b c d a b c
=> b b = 1
b c
b c d a b c
=> b c = 2
c
=> c = 1
c d a b c
=> c d = 1
d a b c
=> d a = 1