Gutenberg API

Ovid on 2008-11-12T10:39:02

As far as I can tell from reading the archives and checking their Web site, Project Gutenberg does not appear to have an API. The closed I've found is an RSS feed and an RDF document. These don't really constitute and API, but the latter can be parsed for adding to an SQLite database. Still trying to figure this out, though. Trying to grab one version of their catalog in RDF format:

gutenberg $ tar -xjf catalog.rdf.bz2
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Archive contains obsolescent base-64 headers
tar: Error exit delayed from previous errors

I was able to unzip their .zip version of the same file, but I was disappointed to learn that their Perl examples are rather old and no longer appear to properly parse the data.

But why would you care? Because I think I want to make this happen:

gutenberg --read "Art of War"

You know, sometimes I worry about posting neat ideas to use.perl for fear that someone would jump the gun and Just Do It. I realize now that this is foolish for two reasons. First, they Won't Just Do It. Second, if they did, I'd be happy just to have the project done :)

Suggestions welcome. There needs to be an easy way to update the database, track what a user has read, allow them to "bookmark" a book (or better yet, "annotate" a document"), etc. I've never used an eReader. I never gave a damn about them, really, because I like the feeling of a book in my hands. Still, this seems worthwhile.

OpenContent Index Web Service

inkdroid on 2008-11-12T11:23:35

You may be interested in IndexData's OpenContent Index Web Service.

The service is a somewhat un-RESTful, but it's still a useful API for searching for Gutenberg texts, as well as available titles from the Open Content Alliance and more.

Oh, and a while back I wrote a CPAN module for talking SRU, which is the protocol the service uses. SRU is a little bit like OpenSearch on crack. It's not difficult to craft the URLs yourself, so maybe just using LWP::Simple or something would work better and hide less :-)

Not a tar file?

Ed Avis on 2008-11-12T11:40:33

Surely you just need 'bzip2 -d catalog.rdf.bz2' to uncompress the file.

Re:Not a tar file?

Ovid on 2008-11-12T11:49:29

I am such a moron. Thanks :)

Code is data. Data is code.

Aristotle on 2008-11-12T17:04:44

A web site is an API. :-) And newsfeeds are a widely supported subset of that.

If you think otherwise, you’re thinking in terms of implementation, not in terms of interface. The web’s architectural goal is to make it not matter whether the document you receive is served from a static file, generated dynamically from an SQL database, served statically from a store other than the filesystem, or… whatever else. In the end there’s just documents with links you can follow, and that’s all there is.

Re:Code is data. Data is code.

Ovid on 2008-11-12T17:17:12

Except that a documented API at least implies that if it's not static, the designers will at least try to minimize changes (that is, if the designers are are aware of the issues involved). A Web site makes no such claims, in general. If they had something on their site which said "go ahead and scrape us, baby, it won't hurt!", then I'd be less worried. They don't say that, so the scraping route is, er, fragile at best.

Re:Code is data. Data is code.

Aristotle on 2008-11-12T17:38:01

Ah, that is what you actually meant. (Stability is not the first thing I associate with the term “API” – loose coupling makes the web work at all.)
The Gutenbergsters should have a mailing list, do they not? Seems like a good idea to ask them if they’re willing to commit to permanent support of whatever they’d be willing to, and to state so publicly.

beyond just finding it

Eric Wilhelm on 2008-11-13T08:29:09

I had looked at packaging Gutenberg texts in dotReader with links to the chapters, etc. IIRC, it was going to be quite a mess at that point, but perhaps their editing and organization has improved.

eBooks on your smartphone

arbingersys on 2009-07-03T17:44:21

This is something I've wanted ever since I've had a PDA (now a Blackberry) given to me by work.

I haven't liked any of the mobile reader options I've seen, and since modern smartphones have a workable browser, I see little reason to build a Java app you have to port to different architectures.

Anyway, I built a simple web app based on the very same catalog.rdf, and it's optimized for mobile browsing, by which I mean it has a very compact and minimal interface.

http://www.arbingersys.com/pgmob

You can search or browse. Because I found that some texts are too just big for the Blackberry browser, the app chunks the text in size of about 90K. I'll probably make this more configurable at some point.

Also, the search is based on 'friendlytitle' in the RDF file, and uses a simple /term1.*term2.*termN/i regex. But it *is* a regex, so you can do more if you know more.

It's written in Perl, of course. :)

James