Spot the Bug

gnat on 2003-04-30T00:38:56

I pasted my screenscraping code in a comment. There's a subtle bug in the way it deals with HTML, though. It's nothing to do with LWP, fetching the pages, or the use of HTML::TableContentParser. The bug leads to invalid XML. Can you find the bug?

--Nat


Guessing

Ovid on 2003-04-30T01:00:51

Well, in clean(), we see this:

$text =~ s{<.*?>}{}g;

Not being familiar with HTML::TableContentParser, I can only guess, but you appear to be expecting tags, but if you have a tag with attributes and one of them has a value with a greater than ">", then this breaks.

Re:Guessing

gnat on 2003-04-30T01:23:28

You're right, this isn't perfect. However, this wasn't the bug that was giving me invalid output. I had no HTML where <.*?> was insufficient.

The problem was that I had a certain character unescaped in the HTML, and my solution to this was naive to say the least...

--Nat

Re: Spot the Bug

kingubu on 2003-04-30T06:37:26

Can you find the bug?

Er... which one? :-)

The one that fails to encode the left angle bracket in what is (presumably) character data, or the one that assumes that &nbsp; is the only built-in character entity that will be found in an HTML document?

See here for the real rules.

If you intimately know and control the documents being processed, your scraper is naive but workable. I can only hope, however, that you aren't going offer this as a generic solution. It is not.

Given that there are mature tools available that would convert the dirtiest of HTML into XML and let you operate on *that* to do your extraction, I have to wonder why you'd go after a solution like this in the first place. "Oh, I'll just whack the string" solutions may be fun exercises, but they can only lead markup-n00bs astray and should *not* be used as examples.

BAD GNAT, NO COOKIE!!! ;->

-kip

Re: Spot the Bug

gnat on 2003-04-30T10:12:54

Blah blah :-) Yes, I could be more rigorous with entities. It works for the specific documents I was scraping. The bug I was referring to is a Perl bug, not a design bug.

If I had to convert the HTML to XML and work on that, I'd slit my wrists. For all the haughty condescension about "naive but workable", the key part is "workable". It was easy to write and worked. This isn't a generic solution to extracting information, but it's a very nice specific solution.

--Nat

Solution

gnat on 2003-04-30T10:29:38

The bug is in the clean subroutine. I say
s{\b&\b}{&amp;}g;
The expectation was that \b matches at word boundaries, so that HTML & CSS would become HTML &amp; CSS. Foolish me. There is no word boundary between a space and an ampersand. I gave up trying to be smart and took out the \b's to make it work. I could have tried capturing and replacing the spaces, but until it breaks and I need to be smart again, I'll continue with my dumb approach.

I need to pay more attention to my own advice that \b is nearly useless in the real world (a word boundary occurs between a word (alpha and _) and non-word character, which is useful for Perl identifiers and not much more). Unless someone else would like to prove me wrong :-) Anyone found a use for \b?

--Nat

Re:Solution

vsergu on 2003-05-04T14:13:46

In real-world text, \b normally works fine for me for matching word boundaries. Of course, I don't think of "&" as a word when I'm doing text searches. I really can't remember having much trouble with it -- it's the same as using most search engines.

Also, I've occasionally used \b in one-off (or even one-liner) HTML manipulations with things like s{</?(font|b|i)\b[^>]*>}{}gi. The \b ensures that I'm getting the whole HTML element name (not matching <br> when looking for <b>, for example), and it matches whether the next character is whitespace or the closing >.