HTML::TokeParser::Simple 3.1 -- major rewrite

Ovid on 2004-09-19T21:19:46

I've spent much of the morning doing something I've meant to do for a long time: completely rewriting the internals of HTML::TokeParser::Simple. One problem that's long bugged me is that returned tokens were all blessed into one subclass, even though they are clearly different types. The latest version finally rectifies that. Now extending this module to handle special needs should be a piece of cake.

One sign that the module is much cleaner is the lack of "if" statements. Most of them are in the POD, but I did notice a couple in my HTML::TokeParser::Simple::Token::Tag class after I uploaded it. As soon as I saw that, I realized that this class should actually be two classes -- one for end tags and one for start tags. It's interesting how the mere existence of a keyword points out a design problem. Start tags are what most people are really interested in, but overriding this class means overriding behavior of end tags. Silly me. I should fix that, too.


HTML

TorgoX on 2004-09-19T22:33:42

Thanks for all the works you've put into this! HTML::* needs all the smart help it can get.

Re:HTML

Ovid on 2004-09-19T23:41:23

HTML::* needs all the smart help it can get.

And I try to help, too :)

And feeling rather silly about my failure to break start and end tags into their own classes, I went ahead and did it now and just uploaded it. I've made major changes, so I'm sure there are huge bugs, but I'm pleased at how easy the changes are now. That makes 3 releases of this module in two days. I should really be less impetuous.

The 3rd generation

bart on 2004-09-21T00:31:41

You really had me wondering why the big version bump, yesterday. It looks like (though I haven't checked) you implemented the kind of change I was expecting, warranting such a version bump, only in 3.1. Oh, and there's even more of that fresh kind of stuff in 3.11, er, 3.12. Timing goes odd, sometimes.

Re:The 3rd generation

Ovid on 2004-09-21T13:55:47

The big version change was because of the new interface. While still backwards compatible, the new style constructors, the "get_foo" instead of "return_foo" names and a few other odds and ends are why I went with 3.0. From my standpoint, if I kept the interface the same and massively reworked the internals, there's really no justification for a version bump. Would anyone want MS Office 2005 if it had no new features and ran a touch slower? :)

Something is wrong...

bart on 2004-09-25T22:53:18

I tried to install the new version 3.12 on 2 Windows PCs today, and while on one it succeeded, on the other, it failed big time (even locking up my console window, forcing me to restart my computer — but I'm sure that's not your fault... ;-)).

Digging into the problem, I tried a manual install, step by step, and I found that -Mblib adds the lib directories under blib to @INC. But the file HTML/TokeParse/Simple.pm wasn't under blib/lib, instead, it was under lib, a sibling directory. That directory is not added to @INC.

Digging into your test scripts, I find that you do:

chdir 't' if -d 't';
unshift @INC => '../blib/lib';
actually duplicating the effort that blib does by itself. That was the reason for the tests to fail: I added lib to @INC myself, by changing this into
chdir 't' if -d 't';
unshift @INC => '../blib/lib', '../lib';
in all the scripts, and that made all the tests pass.

I actually don't believe that lib directory should even be there. I think it, and all its contents, should be under blib.

But the most striking conclusion to me was the idea that your tests succeeded, simply because you were testing the previous, older install of HTML::TokeParser::Simple, not the new one, the one you were supposed to test...

p.s. Indeed, one of my PCs didn't have an older install.

Re:Something is wrong...

Ovid on 2004-09-25T23:46:35

I'm a bit confused as to why adding '../blib' to @INC would cause things to fail. After running perl Makefile.PL; make, the blib directory is built automatically. Did you skip that step and try to run the tests directly? That would cause things to fail since I added the wrong lib.

Adding '../blib' to @INC is a typo on my part as I generally intend to add '../lib' to @INC to allow me to modify the file directly and have the changes instantly picked up. Further, I can run the tests without even running make. Still, it's a nice catch on your part and I'll have a new version uploaded soon.

Re:Something is wrong...

bart on 2004-09-26T00:56:49

I don't know any more... I've tried to build it several times over, deleting the blib directory every time, and I don't get the same results all the time. Sometimes the whole of the lib directory is copied to under blib, but sometimes it isn't, and blib/lib/HTML/TokeParser ends up containing only one file: ".exists".

So, what's up... No idea. I think that perhaps the whole make circus occasionally goes haywire. I'll try again later, I've now given up for the day.

Re:Something is wrong...

bart on 2004-09-30T19:16:22

... I'll have a new version uploaded soon.
Couldn't you find any excuse to bump it to version 3.14? That sounds like a nice, geekish version number to aim for... :)

Anyway, I have had the time to update a largish script of mine from HTML::TokeParser to HTML::TokeParser::Simple 3.13. I quite like it. If there's anything I miss, it's the option to extend

$token->is_start_tag            # is it a start tag
$token->is_start_tag($tag)      # is it a start tag of type $tag (string)
$token->is_start_tag($qr_tags)  # is it a start tag matching the regex $qr_tags
to
$token->is_start_tag(@tags)     # is it a start tag matching any of @tags, provided @tags isn't empty
and similar for is_end_tag and is_tag. It'd make testing whether a tag is in a set of tags easier. Now I am using
if($token->is_start_tag and $special{$token->get_tag}) { ...
which seems to be a bit of double work, to me.

The alternative is to generate a regexp out of the word list, which isn't too user friendly either.

Re:Something is wrong...

Ovid on 2004-09-30T19:48:11

That's an interesting idea. I wonder if I should create a new method to deal with this? I've already heavily overloaded this method and overloading methods is not Perl's strong suit :( How about &is_tag_in_list and corresponding start and end method? The method name could be confusing, though:

if ($token->is_start_tag_in_list) {...}

That suggests that the token is a start tag when, in fact , it may not be. I guess the overloaded method would be better after all :/

The above, incidentally, was a stream of consciousness that allowed me to figure out the interface. I didn't plant to write any of that, I just typed as I was thinking. I guess that's an example of how my mind (doesn't) work :)