milking 'em and stringing 'em

pemungkah on 2002-08-29T16:58:20

Someone I know was complaining about a writer's conference search site that had a really awful search interface. Says I, "Oh, well, I can pull all those pages down, parse them, and then build you a little database you can search."

Woah.

Getting the pages downloaded turned out to be a little tricky. I suspected that there was something more complicated going on when I couldn't just do a GET for the URL; having not read quite enough documentation, I didn't realize how easy it was to add cookie support to LWP::UserAgent, so I futzed around with wsnitch, trying to get it to build so I could watch the HTTP interaction. I got it to build, but then it was jusr segfaulting. Okay, I'll debug that. gdb doesn't work.

Sigh. I diddled around until I found and fixed the bugs in the Gentoo gdb install, found and fixed the wsnitch problem and had it all working. Just about that time I figured that I should google for LWP and cookies. D'oh. At least wsnitch and gdb are working.

Anyway, once I got that far, I could read all the pages, but they were full of nasties. Tables upon tables, unclosed font tags, and lots of other nastiness. I was able to match out some of the boilerplate, but parsing the data required HTML::TreeBuilder and lots of diddling in the debugger to finally parse the data.

Next is using DBI for the very first time to build and query a database. At least I've learned a lot more, anyway.