Fast, light-weight databases

acme on 2005-01-14T11:25:06

I like general case solutions, like Plucene and DBM::Deep but sometimes you just have to get slightly lower level and do it all yourself. Why, you ask? For speed of course. I've just redone my recipe website, which has a lot of recipes, is used a lot (mostly by American housewives), and is hosted on a bitty box.

Enter CDB_File, a "Perl extension for access to CDB databases". Now, ignoring the fact that CDB is djbware, it rocks. The files are small, and they are very fast to read. Thus I do lots of processing and indexing on my build scripts and then suddenly you can search at lightning speed for orange wine, and find Orange-wine wafers. Note the related recipes on the right, generated with Search::ContextGraph.

Mmmmm, food and Perl. Happy birthday richardc!


Recipes and American housewives

osfameron on 2005-01-14T13:05:40

In one of the early novels, Michael Dibdin notes Aurelio Zen's dismay at the idea of using a cookery book. Italians cook the food they learnt in their family/regional tradition, and for the Venetian Zen it would be as odd to look in a cookery book for an Umbrian speciality as it would a Chinese, or Martian one. (Contrasted with his American girlfriend who cooks a baffling variety of recipes from across the world).

Back to CDB?

Matts on 2005-01-14T13:12:15

I recall we had a conversation about 2 years ago about how you decided not to use CDB files for some reason. Why the change of heart?

Re:Back to CDB?

acme on 2005-01-15T11:05:48

The reason at the time was that CDB_File was leaking memory, so I switched to BerkeleyDB. However, I checked out CDB_File again and lo and behold it doesn't leak, it does what it says on the tin, and it's nice and fast. After doing some more performance tests, one of the nice things about CDB over BDB is that my CDB files are 2.5x smaller.

BerkeleyDB

perrin on 2005-01-14T14:24:17

It probably doesn't matter for your recipe site, but BerkeleyDB is typically faster, if you use the BTree mode. Even faster if you use method calls instead of TIE. There's a benchmark you can play with here. Make sure to look for the update at the bottom of the thread.

Re:BerkeleyDB

acme on 2005-01-15T11:30:52

And lo did I do some more benchmarking. Bear in mind that my data has numeric keys and largish objects serialised via Storable. 'deep' is DBM::Deep. 'bdb' is BerkeleyDB::Hash. 'bdb_btree' is BerkeleyDB::Btree with "pack "N" as a key filter and 'cdb' is 'CDB_File'. First time I run the benchmark:
             Rate      deep       bdb bdb_btree       cdb
deep      0.810/s        --      -49%      -49%      -62%
bdb        1.58/s       96%        --       -1%      -25%
bdb_btree  1.59/s       97%        1%        --      -25%
cdb        2.12/s      162%       34%       33%        --
And when everything is cached in memory, all of them are about the same speed:
              Rate bdb_btree      deep       cdb       bdb
bdb_btree 280082/s        --       -1%       -1%       -1%
deep      282205/s        1%        --       -1%       -1%
cdb       283829/s        1%        1%        --       -0%
bdb       284326/s        2%        1%        0%        --
Another interesting thing to note is the file sizes. The BerkeleyDB files are much larger:
-rw-r--r--    1 acme     acme      7544832 2005-01-15 11:01 test.bdb
-rw-r--r--    1 acme     acme      7532544 2005-01-15 11:01 test_btree.bdb
-rw-r--r--    1 acme     acme      4476905 2005-01-15 11:01 test.deep
-rw-r--r--    1 acme     acme      4421730 2005-01-15 11:01 test.cdb
In summary: because my server isn't very powerful, I reckon CDB is the best way to go. It seems to work well in low memory and has smaller files too. Any other suggestions?

Re:BerkeleyDB

perrin on 2005-01-17T18:41:22

I feel a bit silly trying to optimize a recipe search, but... I suspect your use BerkeleyDB is not optimal. Take a look at this benchmark code for an example of high-speed use of BerkeleyDB.