SF.pm lightning talk

miyagawa on 2007-11-28T08:15:30

So I went down to SF.pm meeting and gave two lightning talks about Web::Scraper and takesako-san's neat IMG tag hackery. These talks went well and other talks were interesting too. Photos uploaded to Flickr tagged sf.pm.


good stuff!

Qiang on 2007-12-16T22:31:50

i am just going to do some scraping work and W::S works great so far. the doc is lacking though, the examples you posted in past journal helped! have few questions though:

  1. the example from the doc has:
    process "h3.ens>a",
    where the ens seems to be doing wildcard matching, any class name contains ens.
  2. html page contains utf8 characters such as è , that made HTML::Parser complain.
    Parsing of undecoded UTF-8 will give garbage when decoding entities
    HTML::Parser mentioned encoding the data before it gets parsed. but i have no clue how to do that.
  3. what does the result keyword in the DSL do? i took it out of the DSL and it still works fine.

Re:good stuff!

miyagawa on 2007-12-17T01:40:27

1. If you want a wildcard matching you can change the selector expression to something like ".ens>a"

2. Web::Scraper does whatever it can do to decode utf-8 characters back to Unicode as possible, as long as you pass the URI object and the HTML page has a correct Content-Type header. Otherwise you need to fetch the page into a variable and call Encode::decode to get the Unicode character back.

3. result keyword can specify which stash variable you want to get as a result. You can omit it if you want the entire hash.

Re:good stuff!

Qiang on 2007-12-18T04:24:06

.ens>a does that matching any class name contain the string 'ens'? what is the syntax for exact matching on a classname then?

Re:good stuff!

miyagawa on 2007-12-18T04:28:03

No, ".ens>a" does exact match. Or in other words, exact match with class name. If you want to match partial class names, you might need to do a[@class=~"ens"] or something like that. Read CSS Selector spec for details.

Re:good stuff!

miyagawa on 2007-12-18T04:32:35

Should be a[class~="ens"] that is.

Re:good stuff!

Aristotle on 2007-12-18T10:01:39

No, actually, “.ens > a” matches an “a” element inside an element of any name with class “ens”, whereas “a[class~="ens"]” wants to see the class on the “a” element itself. The partial-match version would actually be “*[class~="ens"] > a”.

Re:good stuff!

miyagawa on 2007-12-18T18:08:49

Eh, i didn't look at the original question very well. The point he didn't get was class="foo bar" is foo + bar and not "foo bar". Anyway.

Re:good stuff!

Qiang on 2007-12-18T04:50:27

er. my bad. i thought class="listing first" is one class name. it is 'listing' and 'first'.

great module, thanks!