Extracting books from recipes

acme on 2005-01-20T10:48:23

I did do a little research on large-scale spelling, but all the equations scared me. I'll look at it again in a bit. Instead, I spent a day trying to construct metadata about partially-structured data. OK, what I really did was try and extract book information from the recipes. This way I can link to Amazon using the Amazon Associates program, get 5% in in referral fees and be a millionaire by the time I'm 21. *cough*

Now it really was quite hard. Some recipes contained ISBNs. These were fairly easy to extract, and then I could use Amazon Web Services to extract the book title and an image. However, the vast majority of book references were free-form. This is slightly harder. I ended up taking a random sample of a couple hundred recipes and building a test suite of the correct book references. After a full day of heuristic building, it came up with mostly the correct results. I could then plug the title into AWS and get a related ISBN.

I really do like the Amazon Web Services. I haven't played with them until now and they do really expose an awful lot of Amazon's database. Also, notice in the previous links that it hasn't always been the exact book, but sometimes a similar one in the same category. OK, so sometimes it goes a little freaky but mostly it works out. It was tough, but I'd say it was a day well spent. And I don't think it's too evil to mention related books.

Whoops, you probably want some Perl content. I used Business::ISBN to validate the ISBNs and Net::Amazon to do lots of AWS queries. Lingua::EN::NamedEntity does a similar thing to my heuristics, but mine are more linked to my data. Note links to kobesearch as search.cpan.org has been almost down for the last week. Time for dimsum!