Java vs. Perl (with a side of tag soup)

ziggy on 2004-12-09T16:53:42

I'm looking into tagsoup at the moment, writing a module that uses its HTML normalization algorithm on top of HTML::Parser. The original code is written in Java, not exactly clear, and full of state transitions that are somewhat obscure. For example, the list of open tags are maintained as a stack (fair enough), and when it's time to process a close tag, the first thing to know is whether we should process it or ignore it. Here's the source, in Java:

//...

Element sp;
for (sp = theStack; sp != null; sp = sp.next()) {
    if (sp.name().equals(name)) break;
}
if (sp == null) return;    // unknown etag, do nothing
//....

Here is the same operation, expressed more naturally in Perl:

## This end tag closes a tag that isn't open.  Ignore it.
return unless grep m/$name/, @$stack;

Granted, these are philosophical and sylistic differences. My quibbles could be with the language, the generally accepted idioms for Java programming, or with the author of this code. (Actually, I don't have any issues with the author; just being complete and highlighting the possibilities. ;-)

Regardless of my personal differences, this example highlights the benefit of writing clear, concise code. Using the C-style for loop obscures the intent by micromanaging the problem and focusing on the mechanics. The single statement Perl equivalent nearly hides the mechanics while emphasizing the intent (return unless you find something).

You may look at this and say So what? It's just a different style preference. In the small, you're somewhat correct. However, I've been looking at this code for some time now, and these little differences accumulate to complexify a program from something that should be easy, and turning it into something hard to express and hard to understand.

Not the best way to write something that people should understand, and only incidentally for a computer to execute...

PS: Here's the tagsoup algorithm in a nutshell:

Use a very liberal scanner to read the input character by character; use a schema driven lexer to find tokens in broken markup
Include a schema for HTML that describes what a tag can contain, and a nominal parent for every tag
Convert the tokens from the lexer into SAX events, adding events as necessary to convert the input into well-formed output; ignore unnecessary tokens when they appear (non-HTML elements, extra close tags)

The real power is in the last bit: the algorithm to add extra events to produce well-formed output. Here's how that works:

Maintain a stack of open tags
For each open tag, look up the stack to find a tag that can contain this tag
If this tag cannot be placed under any tags in the stack of open tags, look for a tag in the stack that contain this tag's nominal parent. Remember this tag, and repeat the process with this tag's parent. Repeat as necessary.
Knowing which open tag will serve as the parent to this sequence of tags to open, unwind the stack (if necessary) to close all open tags until the parent tag is at the top of the tag stack.
Open each tag in the list of tags to open (from step 3). After each tag is opened, reopen any of the preserved tags (from step 4) that this tag can contain.
Close tags as they are found; if close tags are missing, issue the necessary events to produce a well-formed document. Ignore unnecessary close tags.
(Optionally) Ignore all non-HTML tag events.

The algorithm is somewhat simple, can always succeed (or just ignore a tag in woefully broken HTML), and work in a streaming fashion. Unfortunately, this streaming algorithm has no lookahead, so it will deform your input file to create something vaguely approximating your input in a well formed fashion, even if it does not render correctly. For example, a sequence of table, tr, form, td will result a sequence of table, tr, /tr, /table, form, table, tr, td -- not what you started with, but that was broken anyway.

every loop a method

lachoy on 2004-12-09T18:16:50

In one of his talks (Enterprise Perl?) James Duncan discussed readable code and gives the excellent advice that every loop should be a method. I find myself doing this more with Java than Perl, probably because I hit the mental ceiling for method length with Java's verbosity. So I'd typically translate your example to a method like:

skipUnopenedTags(theStack);

One side-effect of Java's not having unless is that I tend to write both 'isSomething' and 'isNotSomething' for readability, especially because an '!' always gets lost:

if ( ! foo.isSomething() ) { ... } vs. if ( foo.isNotSomething() ) { ... }

The first example is particularly difficult to read when developers don't use sufficient whitespace, like:

if (! foo.isSomething() ) { ... }

or even the painful:

if (!foo.isSomething()) { ... }

I'm always surprised that I rarely see discussions of whitespace in code readability. Do people consider this a religious issue like where to put braces?

Whitespace

Dom2 on 2004-12-10T08:02:11
I think that might be the case sometimes. I also think that there are people that just don't care. I've asked someone at work why they wrote this:
my$x=$test>5?'foo':'bar';

vs

my $x = $test > 5 ? 'foo' : 'bar';

And they literally didn't see the difference. It's quite depressing.
-Dom

Re:Whitespace

TeeJay on 2004-12-16T11:15:52
Ask them if they are a compiler or a programmer?

Hash

Juerd on 2004-12-10T10:55:25

In a similar application, I keep a hash of open tags along with a stack. This is a simple ++/-- thing, but it makes the given snippet even more readable:

return unless $tags{$tag};

(literal paste)

Please elaborate

bart on 2004-12-11T22:12:10

Can you please explain, maybe show an example, of what you mean with step 3? It's just a mystery to me now.

If this tag cannot be placed under any tags in the stack of open tags, look for a tag in the stack that contain this tag's nominal parent. Remember this tag, and repeat the process with this tag's parent.

Do you mean to say that if you find a <LI> tag without an enclosing <UL> or <OL> , you insert one of those?

Re:Please elaborate

ziggy on 2004-12-12T14:57:49
Right.
If <LI> cannot fit within the current open tag (such as <A>), walk up the stack of open tags until you can find a spot where you can open it. If, for example, you find an open <OL>, close all open tags until the <OL> is at the top of the stack. (Presumably, that means you forgot a </LI> somewhere, since they cannot nest.)
If there is no spot in the stack where you can deform it to open a <LI> tag, try to open the sequence <OL> <LI>, and find the closest spot that can handle an open <OL> tag. Repeat the process above until there's a spot to insert the <OL> <LI> sequence.
This algorithm, though surprisingly simple, generally works to fill in the missing bits of tag soup. For example, when a bare <TD> is found, the sequence <TABLE> <TBODY> <TR> <TD>. It also helps to fix up the infamous <B> <I> </B> </I>.

Re:Please elaborate

bart on 2004-12-19T01:51:19
I get it.
So, where do you get your hierarchy of allowable nesting of tags from? The data will probably originate in the HTML DTD, but how do you feed it into your code? What form does the data structure take?

Re:Please elaborate

ziggy on 2004-12-19T15:51:38
The hierarchy of nestable tags ultimately originates from the HTML DTD.
John Cowan wrote two schema langagues for tag soup - one for the scanner, and one for the tag parser. He has an XSLT stylesheet that converts the HTML Schema into a Java class. I wrote a simpler stylesheet that converts this schema into a hash-of-hashes. ;-)

wild

inkdroid on 2004-12-20T17:04:43
can we see the stylsheet?

Re:wild

ziggy on 2004-12-20T17:18:44
Go grab the Tag Soup distribution and poke around.
My "tag soup schema -> perl" stylesheet can be found at http://www.panix.com/user/ziggy/schema.xsl.

Perl 6; also, Wither the code?

Aristotle on 2006-10-12T16:06:34

The single statement Perl equivalent nearly hides the mechanics while emphasizing the intent (return unless you find something).

In Perl 5.10 we get the smartmatch, so the mechanics will be completely hidden:

return unless $name ~~ @$stack;

In any case, did you ever finish porting the thing to Perl? If not, can I still have the code you have so far?