Oh god, please, no.

Ovid on 2008-11-20T16:46:00

Struggling all day with Gutenberg. Someone (not naming them as I don't have permission) sent me code to let me use Redland for my RDF parsing and it looks lovely. Too bad Redland doesn't compile for anyone. Didn't compile for me, either.

I put this aside for a bit and tried parsing result pages.

Tried to use the Web::Scraper module to at least pull results from Web pages, but I'm too stupid to figure out its syntax. Learning a new API, CSS selectors and battling strange "don't know what to do with undef" errors proved too much. Embarrassing.

I thought to use HTML::TableParser for some stuff, but that doesn't seem to let me at the attributes I need.

I thought XPath would be good, but it's not well-formed XML. Someone mentioned to me that there might be an XPath module which might have an option which might let you parse malformed XML. I didn't follow up on that.

I finally switch to my HTML::TokeParser::Simple module for this. It's not a good fit for this problem. No, scratch that. It's a bad fit for this problem, but it worked. Then I turned back to search. For this, I used WWW::Mechanize. Notice anything, um, crap about these damned results?

sub search {  
    my $self = shift;
    my $mech = WWW::Mechanize->new(
        agent     => 'App::Gutenberg (perl)',
        autocheck => 1,       
    );                        
                              
    $mech->get(App::Gutenberg->search_url);

    $mech->submit_form(
        form_number => 1,
        fields      => {
            'author' => ($self->author || ''),
            'title'  => ($self->title  || ''),
        }
    );

    my $uri = $mech->uri;
    if ( $uri =~ /#([[:word:]]+)\z/ ) {
        # you have got to
    }
    else {
        # be kidding me
    }
}

If that URL matches, you're indexing into a list of <li> elements. Otherwise, you're parsing a table. Either way, it's a right pain to get the data you want. Oh, and it's subtly different sets of data and the criteria for why it would be one type of result or another is unclear.

This is why I want to see REST for just about anything today. It's simple. It's straightforward. It doesn't make me cry. Now I know why you don't see Perl command line clients for Gutenberg. Everything I'm writing is so damned fragile it will break if you look at it funny. *sniff*

Update: it looks like any search with an author will return a list, but all other searches (only tested the basic form) return tables.


not XML then?

nicholas on 2008-11-20T18:21:46

not well-formed XML

I thought that there is only well-formed XML. Anything that is not, is simply not XML. The intent being to avoid the tag soup and Do-What-I-Think-You-Meant heuristics that got us to the HTML we have today.

Hence it sounds like even this so-called "RDF" that they are producing is fundamentally broken, if RDF is XML, and XML is well-formed. Not that this helps you, of course :-(

Re:not XML then?

Ovid on 2008-11-20T18:36:09

The RDF is well-formed, it's the Web site which is not. The RDF was very confusing, though, and I simply don't know it well enough to to manually use an XML parser to get all of the data I need.

Re:not XML then?

Aristotle on 2008-11-23T20:26:52

Extracting information from RDF/XML with an XML parser is a fool’s errand. RDF is a graph model, and RDF/XML is merely one (fairly TMTOWTDI-heavy) representation of it. It is possible to design XML documents so that they can also be parsed as RDF, but if the data was modelled in RDF with no consideration given to the XML parsing case, then trying to parse its RDF/XML representation is likely to produce code more analogous to a regex-based HTML scraper than a parser.

Use whatever parsing engine Web::Scraper uses

Corion on 2008-11-20T19:16:24

I've done some work with Web::Scraper, and I found that I mostly give it XPath syntax, which it handles fairly well, even with tagsoup.

I have a talk on Web::Scraper online, but it's in German. The hilarious babelfish translation might provide some shallow entertainment to you though.

XPath is definitely what you want

grantm on 2008-11-20T20:07:38

The LibXML module has a parse_html method that can be used to parse any old crappy HTML. It does tend to spew warnings to STDERR whether you want them or not but you can localise a redirection of STDERR if you don't want them.

HTML::TreeBuilder::XPath

srezic on 2008-11-21T00:14:11

With HTML::TreeBuilder::XPath you can do xpath-like searches on HTML documents.