Java vs. Perl

pudge on 2002-09-16T15:15:16

It seems the older Perl gets, the more willing people are to believe that it sucks, without any reasonable facts. davorg writes "You may have seen the article Can Java technology beat Perl on its home turf with pattern matching in large files? that there has been some debate about on both #perl and comp.lang.perl.misc today. One of the biggest criticisms of the article was that the author hasn't published the Perl code that he is comparing his Java with."

"I emailed the author (found his email address thru a Google search) and pointed out the unfairness of this comparison. With half an hour I got a reply from him including the Perl code. So here it is. Feel free to optimise it."

#!/home/hoffie/bin/perl @sunIPs=("192\\.9\\.","192\\.18\\.","192\\.29\\."); @f ileext=("\\.gif","\\.jpg","\\.css","\\.GIF","\\.JPG","\\.CSS"); $filename="$ARG V[0]"; open(IN,$filename) || die "cannot open $ARGV[0] for reading: $!"; open(OUT,">$filename.out") || die "cannot open $filename.out for writing: $!"; LINE: while(<IN>) { foreach $fileext (@fileext) { next LINE if ($_ =~ /$fileext HTTP/); } foreach $sunIP (@sunIPs) { next LINE if ($_ =~ /^$sunIP/); } print OUT; }

loc

gav on 2002-09-16T15:34:59

Ignoring the speed issue, the java code is 100 lines long vs. 15 for the perl. Though the perl code isn't that well written it is both easier to maintain and extend than the java equivalent.

I think it would have been less deceptive if the Java code used regular expressions making it more of a fair test.

Re:loc

gav on 2002-09-16T15:48:03
I'm guessing something like this (untested) would be pretty fast:
if (my $idx = index($_, 'HTTP')) { next if $file_ext{substr($_, $idx - 8, 4)}; } if (substr($_, 0, 4) eq '192.') { my $n = substr($_, 4, 2); next if $n == 9 || $n == 18 || $n == 29; }

Re:loc

Illiad on 2002-09-16T22:23:06
Hrmm....

Awful coding in both examples.

For each potential pattern, he's doing a separate check, be it with indexOf(), or with the regex.

At least in the perl example, the patterns to be skipped are all up at the front of the program, and adding a new exclusion is just a matter of pushing to the arrays.

[code]
my @sunIPs = qw(192\.9\. 192\.18\. 192\.29\.);
my @fileext = qw(\.gif \.GIF \.jpg \.JPG \.css \.CSS);
my $filename = shift;
open...yada..yada...
my $pattern = join('|',@fileext) . "|" . join('|',@sunIPs);

while .... read it in ....

Re:loc

Illiad on 2002-09-16T22:24:54
Gah... silly fookin' IE. Space button submits with a tab at the wrong time... :/

My Kingdom for an edit button...

Re:loc

oneiron on 2002-09-17T09:03:07

Make that one line for Perl :-)
perl -ne'/(?i:gif|jpg|css) HTTP/|/^192\.(9|18|29)\./||print' filename

Right Tool For The Job

jdporter on 2002-09-17T18:04:42
As always, it's important to consider which tool is best for the job at hand. Perl isn't always the best.

Results of my benchmark:

crappie Hoffie perl: 106.4 seconds
reasonably optimal perl: 13.7 seconds
egrep -vi -f hoffie.egrep: 1.1 seconds

where hoffie.egrep contains:
(^(192\.9|192\.18|192\.27))|((\.gif|\.jpg|\.css) HTTP)
The test data was a file of 1,200,000 lines, of which about half hit the regex.

Hypothesis: In any problem where a grep solution is significantly faster than a reasonably optimal perl solution, any comparison to a java solution is meaningless, since this is not "perl's home turf".

Re:Right Tool For The Job

oneiron on 2002-09-18T05:17:40

For cheap thrills, I started a golf thread: Golf thread which includes both a gawk and egrep version. The egrep version was "only" three times faster than the Perl version. :-( To write a 100-line Java program to solve such a trivial problem seems to me like killing an ant with a sledgehammer.
egrep -v '\.(gif|GIF|jpg|JPG|css|CSS) HTTP|^192\.(9|18|29)\.' inf >e gawk '!/\.(gif|GIF|jpg|JPG|css|CSS) HTTP|^192\.(9|18|29)\./{print}' inf >a perl -ne'/^192\.(?:9|18|29)\./||/\.(?:gif|GIF|jpg|JPG|css|CSS) HTTP/ or print' inf >p

Re:loc

dr_baggy on 2002-09-18T09:58:27
But the following is even shorter....

or in real perl tradition

perl -pe '$_=undef if /^192\.(9|18|29)\./ || /\.(gif|jpg|css) HTTP/i;' {logfilename} > {outputlog}

OK not quite the same... or

perl -pe '$_=undef if /^192\.(9|18|29)\./ || /\.(gif|jpg|css|GIF|JPG|CSS) HTTP/;' {logfilename} > {outputlog}

is just one line and probably faster...

Note his code is not the same as the Perl code in the ordering it does the ifs - for the size of file this could be appreciable... if GIF, CSS and JPG are very uncommon extensions (less common than the 192 ips) then the Java will be faster due to the reordering of the if statements... Its also hardcoded ifs which I think would be faster in Perl than the repeated 4 loop....

BUt I reckon the above is probably very fast...

3 seconds for the above to parse 88 Mb on an Compaq ES45

Code should be obvious from problem description

jdavidb on 2002-09-16T15:40:30

Why should he publish the code; it's not like there's more than one way to do it in Perl or anything...

Re:Code should be obvious from problem description

m2 on 2002-09-16T15:50:18

Because he is saying that the Java code took this many seconds and the Perl code took this many seconds more. And he is publishing the Java code. In order to be able to reproduce the results, his Perl code has to be made available, too.

Re:Code should be obvious from problem description

jdavidb on 2002-09-16T16:37:28

Um, yes. .... It's a joke, you see. :)

(There's always more than one way to do it in Perl.)

Uhmmm. the java code isn't using regex's

LunaticLeo on 2002-09-16T16:04:34

The java code posted in that link is not using Java Regular expressions; ie java.util.regex. Hence, the equivalent code in perl should not be using Perl Regular expressions, instead using index().

This kind of think irks me. The author didn't even try to compare apples to apples. He compared fixed string indexing to perl regexes. Furhter, the code structure was fundementally different.

What a joke.

It doesn't matter if the Java is using regexes

Ovid on 2002-09-16T20:35:54

"the equivalent code in perl should not be using Perl Regular expressions, instead using index()."

What I think you were saying is that the Java code should have been using regexes, but if you were saying that without regexes (a relatively recent addition to Java), Perl's should have also been excluded, then I would have to disagree. If I am going to compare, say, the performance of C and Java, I can't argue that Java isn't allowed to use OO features because C lacks them. If I use both Perl and Prolog to solve the 'N-Queens' problem, I can't arbitrarily rule that unification is not allowed because Perl doesn't have it. Should we argue that because Java doesn't allow variable interpolation that Perl must use printf?

If, in a comparison of languages, we exclude features because one of the languages doesn't have the appropriate feature, we're reduced down to the lowest common denominator. In fact, it might even be impossible to compare a Logic language to a Procedural one for some problems that both can solve.

When comparing languages, it's important to let those languages use their proper strengths. Otherwise, there is no real comparison beyond whether or not language X can execute a for loop faster than language Y.

Re:It doesn't matter if the Java is using regexes

LunaticLeo on 2002-09-16T22:35:24
In a general language comparison, I would agree with your point. But the articles author threw down the guantlet of "Pattern Matching" then compared a trivial feature of Java, the string indexer method indexOf(), to the non-trivial regex engine of perl. I believe when comparing specific language features, you should try to keep the comparison as close to possible, OR make the bolder argument that disparate features are required by good idiomatic practice.

Precompile regexes

m2 on 2002-09-16T16:06:47

First thing to say is that the author is comparing substring matches with regex matches. Someone already posted code to convert the Perl version to substring matches.

Second, this code:


@fileext=("\\.gif","\\.jpg","\\.css","\\.GIF","\\.JPG","\\.CSS");
 ...

    foreach $fileext (@fileext) {

        next LINE if ($_ =~ /$fileext HTTP/);

    }

recompiles the regex every time it's evaluated. Something like this is better, methinks:


@fileext=qw( gif jpg css GIF JPG CSS );

$fileextre="\.(?:" . join("|", @fileext) . ") HTTP";

$fileextre=qr/$fileextre/o;
 ...

    next LINE if /$fileextre/;

Which is, IMO, easier to read inside a larger blob of code.

Re:Precompile regexes

wickline on 2002-09-16T17:08:35
> next LINE if /$fileextre/;

and don't forget the 'o' flag on there

-matt

Re:Precompile regexes

jdporter on 2002-09-16T18:56:48
Hmm. For me (perl 5.6.1), I get slightly slower times using precompiled regexes. Of course, the /o is a nice boost.

Re:Precompile regexes

bart on 2002-09-22T22:08:22
Yeah, me too. Tests have shown that
next if /$patA|$patB/o
is quite a bit slower than
next if /$patA/o || /$patB/o
So I imagine that a faster way could be to construct a Perl expression as a string, and eval it, to compile it and include it in you code.

Also, as long as you don't modify the patterns, it turns out that the /o doesn't gain you too much any more (unlike in the old days), because Perl does some "Is it the same string? Oh, then I won't recompile!" optimisation behind the scenes.

You'll see an immediate slowdown as soon as you start changing the pattern for one regex.

a reply?

jhi on 2002-09-16T16:22:58

As I said over in the original thread in Big Dave's journal, I hope someone will have the time to write a polite, professional and helpful reply that'll be sent both to the author of the article and to the whatever Java forum/editors are applicable. Remember: be polite. Being snide, condescending, snotty, aggressive, ironic, or whatever, will do only harm.

From the technical viewpoints:
(1) Perl code not published
(2) input data not published
(3) comparing apples to oranges
(4) the Perl code is very slow for reasons very obvious to all of us

Benchmarks

freddo256 on 2002-09-16T18:45:32

No doubt that guy didnt saw this: Regular Expression Matching benchmarks... the whole site is worth it:The Great Computer Language Shootout as well as The Great Win32 Computer Language Shootout.

freddo

Apples and Oranges

marcel on 2002-09-16T22:57:54

It's not just about speed, though. It also matters how long it takes you to write the program and how maintainable and extensible it is.

For speed I'm offering (untested, just jotted down quickly):

#/usr/bin/perl -n
for $e (qw/gif jpg css GIF JPG CSS/) {
next if index($_, "$e HTTP") != -1
}
print, next if substr($_, 0, 4) eq '192.';
next if substr($_, 4, 2) eq '9.';
next if substr($_, 4, 3) eq '18.';
next if substr($_, 4, 3) eq '29.';
print

and for elegance (untested again):

#!/usr/bin/perl -n
next if /(?i:gif|jpg|css) HTTP/o;
next if /^192\.(?:9|18|29)\./o;
print;

marcel

Re:Apples and Oranges

marcel on 2002-09-16T23:13:53
errr... this should be

> next if index($_, "$e HTTP") != -1

next LINE if index($_, "$e HTTP") != -1

and

> print, next if substr($_, 0, 4) eq '192.';

print, next if substr($_, 0, 4) ne '192.';

oh, for an edit interface... but you get the idea.

marcel

Lines of code?

Deven on 2002-09-17T14:10:52

As soon as I saw that ~100 line Java program, I immediately wrote a one-line Perl equivalent:

while () { print unless /^(192\.(9|18|29)\.|\.(?i:(gif|jpg|css)) HTTP)/; }

Of course, you could match his argument conventions precisely, but why bother? This form is the normal Perl way to do it, and the author's Perl and Java arguments were already different.

I haven't benchmarked this one-liner, but I bet it's faster than the author's Perl version, and likely faster than the Java code as well. It might be a shade faster to drop the outermost parens, and avoid capturing with (?:...) and avoid an alternation by replacing 9|18|29 with 2?9|18, but these would come at the expense of readability and might not make a noticable performance impact.

Also, that one line of Perl code took me only about 2 minutes to write (slowed down slightly be double-checking the (?i:...) construct). I'm betting that 100 lines of Java code took quite a bit longer to write and debug, and it's harder to maintain 100 lines of code than one. In real life, I'd probably have used a /i modifier on the end of the regex instead of using the less-readable (?i:...) construct, but I wanted to match exactly what the author was matching to make it a valid comparison.

As far as I'm concerned, Perl's "home turf" includes being efficient for the programmer to write programs. Sometimes, highly-optimized runtime is important, but optimizing programming time is frequently more valuable. Even so, short Perl programs like this often have hard-to-beat runtimes if written well, so you can often have your cake and eat it too...

Re:Lines of code?

koschei on 2002-09-17T14:32:53
You may want to double check the position of your ^

As someone said somewhere (petdance iirc), when making optimized solutions, test. It's something a lot of people seem to not be doing in this thread (either here or in davorg's journal). If you're going to make it more efficient, you might as well make it produce the same results.

At work, I produced a shiny new version of a previous routine. I couldn't really benchmark them though: the previous version processed much less data due to a bug in its implementation and thus naturally went faster =)

Re:Lines of code?

Deven on 2002-09-18T18:56:29
Actually, I wrote it with the ^ inside the parens originally when I copied it onto here, I dropped the outer parens because they were really redundant. Then I decided to put them back in for readability and mention that they could be dropped instead.

Unfortunately, when I put them back in, I put the opening paren after the ^ (force of habit) when that's not what I meant. My bad!

I also lost the <> operator inside the while (), while we're being pedantic. :-) (I'm pretty sure this was eaten by the Slashcode software.)

To be really pedantic, I should point out that the case-insensitive part matches other capitalizations like "Css" that the original code didn't -- to be completely identical, I should have kept it case-sensitive and listed just the particular variations used in the original code, but I thought case-insensitive was more true to the intent of the original...

The only reason I didn't test it was because I didn't have any available data to run it on, and I didn't feel like constructing some sample test data. I left it as "an exercise for the reader". :-)

But hey, you knew what I meant! The point was that in a minute or two, you can make a one-liner in Perl that does the same as the Java 100-liner, and probably quite a bit faster, too.

Made my day

bschoate on 2002-09-17T22:25:58

Seriously folks, Java is a nice language and all, but why not use the right tool for the job? As demonstrated most effectively by Professor Hoffman, it's quite cumbersome to parse text files using Java. Now with Perl, you can do something like this:

perl -ne "print unless /^192\.(9|18|29)\./o||/\.(gif|jpg|css|GIF|JPG|CSS) HTTP/o" < access-log > clean-log

Heck, you could even make it your .sig. Not to mention that it runs faster than the Java version. The regex solution is even a tad bit faster than testing individual values using the index function. Go figure.

Those Sun engineers should find better things to do with their time.

Hoffmann responds...

bschoate on 2002-09-17T22:45:24
I sent John the one-liner and he was nice enough to test it himself. Results follow (emphasis mine)...

From: John Hoffmann Date: Tue Sep 17 15:32:30 2002 (PDT) To: Brad Choate Subject: Re: Java vs. Perl Brad, Thanks, you were the second person to write, but the first guy couldn't offer an optimization. Just ran your one liner on 578 Meg file and it took half the time of the java. %timex perl -ne "print unless /^192\.(9|18|29)\./o || /\.(gif|jpg|css|GIF|JPG|CSS) HTTP/o" < developer.20020916.raw > developer.20020916.perl real 52.51 user 27.28 sys 6.89 %java LogParse /usr/netgenesis/logs/developer.20020916.raw /usr/netgenesis/logs/developer.20020916.tmp Processing Time: 107 Seconds File sizes matched perfectly. -rw-rw-r-- 1 hoffie 96187583 Sep 17 15:22 developer.20020916.perl -rw-rw-r-- 1 hoffie 584267226 Sep 17 15:05 developer.20020916.raw -rw-rw-r-- 1 hoffie 96187583 Sep 17 15:11 developer.20020916.tmp The java programmer who wrote the LogParse class wants to try JDK 1.4 with regular expressions and the new IO classes to see the result. I'll see what we can do to publish a round two of the optimized Perl and the new Java. -John

Re:Hoffmann responds...

merlyn on 2002-09-17T23:49:58

The java programmer who wrote the LogParse class wants to try JDK 1.4 with regular expressions and the new IO classes to see the result. I'll see what we can do to publish a round two of the optimized Perl and the new Java.
Oddly enough, I can't see how that would make it go any faster than Java's non-regex solution. It seems like it would only lose ground!
Re:Hoffmann responds...

oneiron on 2002-09-18T05:23:35

Those two o's above, as in /.../o, seem quite useless because the regex's are constant. Is there a reason for them?

Re:Hoffmann responds...

bschoate on 2002-09-18T12:23:38

I guess not :) Silly me, I thought /o always helped when using the same regex pattern in a loop such as this. And I hadn't thought about specifying the non-capturing syntax, also suggested in this thread. The final result:

perl -ne "print unless /^192\.(?:9|18|29)\./||/\.(?:gif|jpg|css|GIF|JPG|CSS) HTTP/" < input > output

The fastest of all so far... any other improvements?

Re:Hoffmann responds...

dr_baggy on 2002-09-18T14:18:16

Try the following... if you are much more likely to have gif, jpg, css file than local files switch the regexps around and try:

perl -ne 'print unless /\.(?:gif|jpg|css|GIF|JPG|CSS) HTTP/||/^192\.(?:9|18|29)\./;' input > output

Re:Hoffmann responds...

java_sucks on 2002-09-18T14:44:55
Wouldn't an immediate retraction be in order, showing the perl one-liner and java 100-liner side by side with corrected timings? (and a note that the benchmark was designed for the purposes of java advocacy).

Re:Made my day

oneiron on 2002-09-18T08:56:54

It is faster still if you use non-capturing parens. i.e. change (9|18|29) to (?:9|18|29), ditto for the parens around gif|jpg etc. And the 'o' modifier should be removed.

Re:Made my day

damien on 2002-09-19T18:12:07
It's much faster still if you don't use alternation in the regex. /foo/ || /bar/ is significantly faster than /(foo|bar), since the former will be optimized to a pair of substr matches.

i own java

zee on 2006-08-24T03:51:57

Everyone knows that java is a piece of shit made by out of an ass handed losers to make money on even worst losers.

You have to be a retard to choose java over perl.