Perl vs Java

davorg on 2002-09-16T14:05:12

You may have seen the article Can Java technology beat Perl on its home turf with pattern matching in large files? that there has been some debate about on both #perl and comp.lang.perl.misc today.

One of the biggest criticisms of the article was that the author hasn't published the Perl code that he is comparing his Java with.

I emailed the author (found his email address thru a Google search) and pointed out the unfairness of this comparison. With half an hour I got a reply from him including the Perl code.

So here it is. Feel free to optimise it.

#!/home/hoffie/bin/perl
@sunIPs=("192\\.9\\.","192\\.18\\.","192\\.29\\.");
@fileext=("\\.gif","\\.jpg","\\.css","\\.GIF","\\.JPG","\\.CSS");
$filename="$ARGV[0]";
open(IN,$filename) || die "cannot open $ARGV[0] for reading: $!";
open(OUT,">$filename.out") || die "cannot open $filename.out for writing: $!";
LINE: while() {
     foreach $fileext (@fileext) {
         next LINE if ($_ =~ /$fileext HTTP/);
     }
     foreach $sunIP (@sunIPs) {
         next LINE if ($_ =~ /^$sunIP/);
     }
     print OUT;
}


you can always beat bad perl code

merlyn on 2002-09-16T14:16:16

     foreach $fileext (@fileext) {
         next LINE if ($_ =~ /$fileext HTTP/);
     }
     foreach $sunIP (@sunIPs) {
         next LINE if ($_ =~ /^$sunIP/);
     }
Yeah, it's almost always possible to beat bad Perl written by people who don't understand that regexes need to be compiled.

No "patterns" in his Java...

clintp on 2002-09-16T14:18:01

Except that he's not really pattern matching. He's using Java's index-like method. And he's "unrolled" his loops within the read-loop.

His perl is idiomatic (except for the spurious =~'s) and looks just like any novice would have written it.

If I had enough data I might take a crack at unrolling it and making it quicker. Like any "benchmark" though, the code can *always* be manipulated to favor one over the other.

Compiled regexes should do it

BooK on 2002-09-16T14:44:18

Simple first pass at it: using qr//:

#!/home/hoffie/bin/perl
@sunIPs=("192\\.9\\.","192\\.18\\.","192\\.29\\.");
@f ileext=("\\.gif","\\.jpg","\\.css","\\.GIF","\\.JPG","\\.CSS");
$filename="$ARG V[0]";
open(IN,$filename) || die "cannot open $ARGV[0] for reading: $!";
open(OUT,">$filename.out") || die "cannot open $filename.out for writing: $!";

# compile once
$fileext = join '|', @fileext;
$fileext = qr/(?:$fileext) HTTP/;
$sunIPs = join '|', @sunIPs;
$sunIPs = qr/^(?:$sunIPs)/;

LINE: while(<IN>) {
     next LINE if /$filext/;
     next LINE if /$sunIPs/;
     print OUT;
}

This remains untested, but I'd bet that's faster!

MRE2

acme on 2002-09-16T15:08:31

Well, he's not using Java regexes so it's not a fair comparison. But anyway, I'd like to point out that the second edition of Mastering Regular Expressions is fantastic. It goes into great detail on the new features and relative speeds of the regular expression engines in all the languages, and is generally very cool indeed.

This is trivial

petdance on 2002-09-16T15:12:19

my $filename = shift;
open(IN,$filename) || die "cannot open $filename for reading: $!";
open(OUT,">$filename.out") || die "cannot open $filename.out for writing: $!";

while ( <IN> ) {
    next if /\.(gif|jpg|css|GIF|JPG|CSS) HTTP/;
    next if /192\.(9|18|29)\./;
    print OUT;
}
Not sure about the execution speed of the regexes, but it's a damn sight easier to read.

Re:This is trivial

koschei on 2002-09-16T15:35:28

Tests? Should be /^192 and it should go faster if you use /o on the regexen.

But, yeah, much easier to read, much faster to write, and much better.

Hmm. I think I should ask pudge to make <tt> text a different colour.

Re:This is trivial

petdance on 2002-09-16T15:44:06

it should go faster if you use /o on the regexen.

No it won't. The /o only applies to regexes that are based on variables, as in:

my $pattern = "192\.(whatever)";
if ( $foo =~ /$pattern/o )
That's the ONLY time that /o applies.

Re:This is trivial

pdcawley on 2002-09-16T15:45:56

Without wishing to wave the golf stick, may I commend
#!perl -pi.out
$_ = '' unless /(?:\.(?:gif|jpg|css|GIF|JPG|CSS)[ ]HTTP |
                   192\.(?:9|18|29)\.)/x
to the house?

You've got to be kidding

jjohn on 2002-09-16T15:19:36

What underutilized lacky has enough time to worry about making a program that runs in 283 seconds BUT TAKES 5 MINUTES TO WRITE into a program that runs in 137 seconds BUT TAKES 15-30 minutes to write. If the program could be rewritten so that it runs under 10 seconds (my attention span), THEN the extra effort *might* be worth it. This program is likely to be run from a batch job so that a hyoo-mon isn't likely to be at the terminal waiting for it to finish.

Java cuts into my beer-drinking time.

Uh...

jhi on 2002-09-16T15:45:41

(As pointed out by many, already...)

(1) The Perl code is really bad. Just replacing the "loop-over-each-line-recompiling-the-regex-each-time" by moving the loop invariant regex to the front of
the while speeds things up.
(2) Using qr speeds things up further.
(3) Moving the sunIPs testing before the fileext
testing speeds things up further.
(4) Inlining the 192. and HTTP speeds things up.
Hey, the Java code inlines those strings.

And after all that is done, we're still comparing apples and oranges: the Java code doesn't do regular expressions. If someone has the time, they might want to ape precisely what the Java code is doing, using index() and so forth, and then measure that.

I hope someone will write a polite expose of all the things that are wrong (*) with this article, and both post it to whatever forum/editors, and the author. Mind, be polite, professional, and helpful.

(*) Let me see...
(a) comparing apples and oranges
(b) the Perl code not published in the article
(c) the Perl code is very bad
(d) the input data not available

I won't comment on the Java code itself, I'll leave that to people who do more Java, except that noting that it inlines the filtering data, as opposed to the Perl code which at least has it cleanly separated into variable.