HTML-Tree, and checksumming

TorgoX on 2002-08-17T01:50:03

Dear Log,

A little whoopsie of mine in HTML::TreeBuilder basically broke version 3.12, and yet didn't cause any of the HTML-Tree tests to fail. Michael Koehne is a superstar because he spotted this and told me.

So I rushed out a new version today (3.13) , with some more and smarter tests that will stop things like this from happening again.

Most (but not all) of the new tests each take two bits of HTML and making sure that they parse to isomorphic parse trees. Given a wrapper function same, the tests are mostly like ok(same( '<ul><li>x<li>y</ul>after' => '<ul><li>x</li><li>y</li></ul>after' ));.

One thing that Michael Koehne suggested is ensuring continuity across versions by having tests that basically take a bit of HTML, parse it, dump the parse tree as text, and run a checksum on that text. Then the test consists of making sure that that checksum stays the same across different HTML-Tree versions. He suggested MD5 for the checksum algorithm; but I'm hesitant about using it, since that would mean making HTML-Tree have a dependency on the MD5 module. Maybe I'll just make the tests skip on sites that don't have the MD5 module intsalled. Anyone have other suggestions?


Equality checking

ziggy on 2002-08-17T02:08:48

MD5 checks are going to be identical iff the two inputs are identical (for all practical purposes).

If you don't want to put the MD5 of the canonical version in the test case, why not put the stringified Data::Dumper value in the test? No CPAN dependency that way. :-)

Re:Equality checking

koschei on 2002-08-17T04:18:42

Well, you'd have to either eval it back in and do some deep comparison, or use the 5.8ism of Data::Dumper::Sortkeys.

I'd say use the MD5. If they don't previously have it, it will at least mean their CPAN.pm will start using it.

MD5

hfb on 2002-08-17T14:24:38

is a core module in 5.8.0 just in case you hadn't noticed that.

Re:MD5

pudge on 2002-08-21T21:38:18

Just for the sake of pedantry ;) and in case someone doesn't know better, the MD5 module is deprecated, and Digest::MD5 is in 5.8.0.

Checksum

jmcnamara on 2002-08-18T22:38:16

He suggested MD5 for the checksum algorithm; but I'm hesitant about using it, since that would mean making HTML-Tree have a dependency on the MD5 module.

Why not use the unpack() checksum: $sum = unpack "%32C*", $string;


Re:Checksum

TorgoX on 2002-08-18T23:20:28

Because it doesn't catch transposition:

DB<1> sub csum { unpack "%32C*", $_[0] }

DB<2> x csum "+abc-"
0 382
DB<3> x csum "-abc+"
0 382
DB<4>