I'm looking into tagsoup at the moment, writing a module that uses its HTML normalization algorithm on top of HTML::Parser. The original code is written in Java, not exactly clear, and full of state transitions that are somewhat obscure. For example, the list of open tags are maintained as a stack (fair enough), and when it's time to process a close tag, the first thing to know is whether we should process it or ignore it. Here's the source, in Java:
//...
Element sp;
for (sp = theStack; sp != null; sp = sp.next()) {
    if (sp.name().equals(name)) break;
}
if (sp == null) return;    // unknown etag, do nothing
//....
Here is the same operation, expressed more naturally in Perl:
## This end tag closes a tag that isn't open. Ignore it. return unless grep m/$name/, @$stack;Granted, these are philosophical and sylistic differences. My quibbles could be with the language, the generally accepted idioms for Java programming, or with the author of this code. (Actually, I don't have any issues with the author; just being complete and highlighting the possibilities. ;-)
Regardless of my personal differences, this example highlights the benefit of writing clear, concise code. Using the C-style for loop obscures the intent by micromanaging the problem and focusing on the mechanics. The single statement Perl equivalent nearly hides the mechanics while emphasizing the intent (return unless you find something).
You may look at this and say So what? It's just a different style preference. In the small, you're somewhat correct. However, I've been looking at this code for some time now, and these little differences accumulate to complexify a program from something that should be easy, and turning it into something hard to express and hard to understand.
Not the best way to write something that people should understand, and only incidentally for a computer to execute...
PS: Here's the tagsoup algorithm in a nutshell:
In one of his talks (Enterprise Perl?) James Duncan discussed readable code and gives the excellent advice that every loop should be a method. I find myself doing this more with Java than Perl, probably because I hit the mental ceiling for method length with Java's verbosity. So I'd typically translate your example to a method like:
skipUnopenedTags(theStack);
One side-effect of Java's not having unless is that I tend to write both 'isSomething' and 'isNotSomething' for readability, especially because an '!' always gets lost:
if ( ! foo.isSomething() ) {... } 
vs.
if ( foo.isNotSomething() ) {... } 
The first example is particularly difficult to read when developers don't use sufficient whitespace, like:
if (! foo.isSomething() ) {... } 
or even the painful:
if (!foo.isSomething()) {... } 
I'm always surprised that I rarely see discussions of whitespace in code readability. Do people consider this a religious issue like where to put braces?
Whitespace
Dom2 on 2004-12-10T08:02:11
I think that might be the case sometimes. I also think that there are people that just don't care. I've asked someone at work why they wrote this:my$x=$test>5?'foo':'bar';vs
my $x = $test > 5 ? 'foo' : 'bar';And they literally didn't see the difference. It's quite depressing.
-Dom
Re:Whitespace
TeeJay on 2004-12-16T11:15:52
Ask them if they are a compiler or a programmer?
(literal paste)return unless $tags{$tag};
If this tag cannot be placed under any tags in the stack of open tags, look for a tag in the stack that contain this tag's nominal parent. Remember this tag, and repeat the process with this tag's parent.
Do you mean to say that if you find a <LI> tag without an enclosing <UL> or <OL> , you insert one of those?
Re:Please elaborate
ziggy on 2004-12-12T14:57:49
Right.If <LI> cannot fit within the current open tag (such as <A>), walk up the stack of open tags until you can find a spot where you can open it. If, for example, you find an open <OL>, close all open tags until the <OL> is at the top of the stack. (Presumably, that means you forgot a </LI> somewhere, since they cannot nest.)
If there is no spot in the stack where you can deform it to open a <LI> tag, try to open the sequence <OL> <LI>, and find the closest spot that can handle an open <OL> tag. Repeat the process above until there's a spot to insert the <OL> <LI> sequence.
This algorithm, though surprisingly simple, generally works to fill in the missing bits of tag soup. For example, when a bare <TD> is found, the sequence <TABLE> <TBODY> <TR> <TD>. It also helps to fix up the infamous <B> <I> </B> </I>.
Re:Please elaborate
bart on 2004-12-19T01:51:19
I get it.So, where do you get your hierarchy of allowable nesting of tags from? The data will probably originate in the HTML DTD, but how do you feed it into your code? What form does the data structure take?
Re:Please elaborate
ziggy on 2004-12-19T15:51:38
The hierarchy of nestable tags ultimately originates from the HTML DTD.John Cowan wrote two schema langagues for tag soup - one for the scanner, and one for the tag parser. He has an XSLT stylesheet that converts the HTML Schema into a Java class. I wrote a simpler stylesheet that converts this schema into a hash-of-hashes.
;-) wild
inkdroid on 2004-12-20T17:04:43
can we see the stylsheet?Re:wild
ziggy on 2004-12-20T17:18:44
Go grab the Tag Soup distribution and poke around.My "tag soup schema -> perl" stylesheet can be found at http://www.panix.com/user/ziggy/schema.xsl.
The single statement Perl equivalent nearly hides the mechanics while emphasizing the intent (return unless you find something).
In Perl 5.10 we get the smartmatch, so the mechanics will be completely hidden:
return unless $name ~~ @$stack;
In any case, did you ever finish porting the thing to Perl? If not, can I still have the code you have so far?