Just try this in Java, I dare you ...

Ovid on 2003-07-01T19:19:47

The problem: he had a ton of HTML that was written by MS Excel. As a result, all of the HTML was upper-case and it had MS proprietary style information embedded in the tags. He needed to clean this up, fast. He didn't have Perl on his box, but five minutes later, he accessed a URL that pointed to this script that I wrote. Paste the HTML in the textarea, click submit and it's instantly cleaned.

#!/usr/bin/perl -T
use strict;
use warnings;
use HTML::TokeParser::Simple 2.1;
use CGI qw(:standard);

my $new_html = param('html') ? clean_html() : '';
param(-name => 'html', -value => $new_html);

print header,
    start_html('Clean html'),
    start_form,
    textarea('html', '', 10, 50 ),
    submit,
    end_form,
    end_html;

sub clean_html {
    my $new_html = '';
    my $html     = param('html');
    my $parser   = HTML::TokeParser::Simple->new(\$html);
    while (my $token = $parser->get_token) {
        $token->delete_attr('style') if $token->is_start_tag;
        $token->rewrite_tag;
        $new_html .= $token->as_is;
    }
    return $new_html;
}


Unfair comparison?

jonasbn on 2003-07-02T07:10:39

Hmmm it is a neat script and I often have that experience that is does not take many lines to do something in Perl.

But is the comparison to Java completely fair? Your script does not tell the whole story, since HTML::TokeParser::Simple (which does most of the job) is not shown.

How would a similar program look in Java if a class of similar functionality existed in Java?

ovid.Simple.HTML.TokeParser

That would of course mean that we would also have to examine the modules and classes used by either of these and so on, probably ending up in the actual interpreter/compiler, meaning that it makes no sense anyway - might as well just compare the outmost implementations.

This brings me back to your script (and my intuitive reponse seeing it) - woa! nice! 8)

Re:Unfair comparison?

jplindstrom on 2003-07-07T17:02:12

I also think this is a very elegant little script.

Given that the problem is to Get the Job Done(tm), I'd say it's relevant whether HTML::TokeParse functionality is available in Java.

Some or all of the functionality may be, but probably in a slightly more verbose version. And probably not as easily found, installed and used. As always, having done similar things before is an important factor (me, I wouldn't have the experience to look at HTML::TokeParser for this problem, but I might find it anyway).

BTW, a cool future direction this script could take, should the need arise, is as an ActiveX object embeded in Excel, maybe part of an HTML-clean-up macro or something.

How would a Java program do that?