It seems the older Perl gets, the more willing people are to believe that it sucks, without any reasonable facts. davorg writes "You may have seen the article Can Java technology beat Perl on its home turf with pattern matching in large files? that there has been some debate about on both #perl and comp.lang.perl.misc today. One of the biggest criticisms of the article was that the author hasn't published the Perl code that he is comparing his Java with."
"I emailed the author (found his email address thru a Google search) and pointed out the unfairness of this comparison. With half an hour I got a reply from him including the Perl code. So here it is. Feel free to optimise it."
#!/home/hoffie/bin/perl
@sunIPs=("192\\.9\\.","192\\.18\\.","192\\.29\\.");
@f ileext=("\\.gif","\\.jpg","\\.css","\\.GIF","\\.JPG","\\.CSS");
$filename="$ARG V[0]";
open(IN,$filename) || die "cannot open $ARGV[0] for reading: $!";
open(OUT,">$filename.out") || die "cannot open $filename.out for writing: $!";
LINE: while(<IN>) {
foreach $fileext (@fileext) {
next LINE if ($_ =~/$fileext HTTP/);
}
foreach $sunIP (@sunIPs) {
next LINE if ($_ =~/^$sunIP/);
}
print OUT;
}
I think it would have been less deceptive if the Java code used regular expressions making it more of a fair test.
Re:loc
gav on 2002-09-16T15:48:03
I'm guessing something like this (untested) would be pretty fast:if (my $idx = index($_, 'HTTP')) {
next if $file_ext{substr($_, $idx - 8, 4)};
}
if (substr($_, 0, 4) eq '192.') {
my $n = substr($_, 4, 2);
next if $n == 9 || $n == 18 || $n == 29;
}Re:loc
Illiad on 2002-09-16T22:23:06
Hrmm....
Awful coding in both examples.
For each potential pattern, he's doing a separate check, be it with indexOf(), or with the regex.
At least in the perl example, the patterns to be skipped are all up at the front of the program, and adding a new exclusion is just a matter of pushing to the arrays.
[code]
my @sunIPs = qw(192\.9\. 192\.18\. 192\.29\.);
my @fileext = qw(\.gif \.GIF \.jpg \.JPG \.css \.CSS);
my $filename = shift;
open...yada..yada...
my $pattern = join('|',@fileext) . "|" . join('|',@sunIPs);
while.... read it in ....
Re:loc
Illiad on 2002-09-16T22:24:54
Gah... silly fookin' IE. Space button submits with a tab at the wrong time...:/
My Kingdom for an edit button...
Re:loc
oneiron on 2002-09-17T09:03:07
Make that one line for Perl
:-) perl -ne'/(?i:gif|jpg|css) HTTP/|/^192\.(9|18|29)\./||print' filename
Right Tool For The Job
jdporter on 2002-09-17T18:04:42
As always, it's important to consider which tool is best for the job at hand. Perl isn't always the best.
Results of my benchmark:
crappie Hoffie perl: 106.4 seconds
reasonably optimal perl: 13.7 seconds
egrep -vi -f hoffie.egrep: 1.1 seconds
where hoffie.egrep contains:
(^(192\.9|192\.18|192\.27))|((\.gif|\.jpg|\.css) HTTP)
The test data was a file of 1,200,000 lines, of which about half hit the regex.
Hypothesis: In any problem where a grep solution is significantly faster than a reasonably optimal perl solution, any comparison to a java solution is meaningless, since this is not "perl's home turf".
Re:Right Tool For The Job
oneiron on 2002-09-18T05:17:40
For cheap thrills, I started a golf thread: Golf thread which includes both a gawk and egrep version. The egrep version was "only" three times faster than the Perl version.
:-( To write a 100-line Java program to solve such a trivial problem seems to me like killing an ant with a sledgehammer. egrep -v '\.(gif|GIF|jpg|JPG|css|CSS) HTTP|^192\.(9|18|29)\.' inf >e
gawk '!/\.(gif|GIF|jpg|JPG|css|CSS) HTTP|^192\.(9|18|29)\./{print}' inf >a
perl -ne'/^192\.(?:9|18|29)\./||/\.(?:gif|GIF|jpg|JPG|css|CSS) HTTP/ or print' inf >p
Re:loc
dr_baggy on 2002-09-18T09:58:27
But the following is even shorter....
or in real perl tradition
perl -pe '$_=undef if/^192\.(9|18|29)\./ || /\.(gif|jpg|css) HTTP/i;' {logfilename} > {outputlog}
OK not quite the same... or
perl -pe '$_=undef if/^192\.(9|18|29)\./ || /\.(gif|jpg|css|GIF|JPG|CSS) HTTP/;' {logfilename} > {outputlog}
is just one line and probably faster...
Note his code is not the same as the Perl code in the ordering it does the ifs - for the size of file this could be appreciable... if GIF, CSS and JPG are very uncommon extensions (less common than the 192 ips) then the Java will be faster due to the reordering of the if statements... Its also hardcoded ifs which I think would be faster in Perl than the repeated 4 loop....
BUt I reckon the above is probably very fast...
3 seconds for the above to parse 88 Mb on an Compaq ES45
Why should he publish the code; it's not like there's more than one way to do it in Perl or anything...
Re:Code should be obvious from problem description
m2 on 2002-09-16T15:50:18
Because he is saying that the Java code took this many seconds and the Perl code took this many seconds more. And he is publishing the Java code. In order to be able to reproduce the results, his Perl code has to be made available, too.
Re:Code should be obvious from problem description
jdavidb on 2002-09-16T16:37:28
Um, yes.
.... It's a joke, you see. :) (There's always more than one way to do it in Perl.)
This kind of think irks me. The author didn't even try to compare apples to apples. He compared fixed string indexing to perl regexes. Furhter, the code structure was fundementally different.
What a joke.
It doesn't matter if the Java is using regexes
Ovid on 2002-09-16T20:35:54
"the equivalent code in perl should not be using Perl Regular expressions, instead using index()."
What I think you were saying is that the Java code should have been using regexes, but if you were saying that without regexes (a relatively recent addition to Java), Perl's should have also been excluded, then I would have to disagree. If I am going to compare, say, the performance of C and Java, I can't argue that Java isn't allowed to use OO features because C lacks them. If I use both Perl and Prolog to solve the 'N-Queens' problem, I can't arbitrarily rule that unification is not allowed because Perl doesn't have it. Should we argue that because Java doesn't allow variable interpolation that Perl must use printf?
If, in a comparison of languages, we exclude features because one of the languages doesn't have the appropriate feature, we're reduced down to the lowest common denominator. In fact, it might even be impossible to compare a Logic language to a Procedural one for some problems that both can solve.
When comparing languages, it's important to let those languages use their proper strengths. Otherwise, there is no real comparison beyond whether or not language X can execute a for loop faster than language Y.
Re:It doesn't matter if the Java is using regexes
LunaticLeo on 2002-09-16T22:35:24
In a general language comparison, I would agree with your point. But the articles author threw down the guantlet of "Pattern Matching" then compared a trivial feature of Java, the string indexer method indexOf(), to the non-trivial regex engine of perl. I believe when comparing specific language features, you should try to keep the comparison as close to possible, OR make the bolder argument that disparate features are required by good idiomatic practice.
First thing to say is that the author is comparing substring matches with regex matches. Someone already posted code to convert the Perl version to substring matches.
Second, this code:
@fileext=("\\.gif","\\.jpg","\\.css","\\.GIF","\\.JPG","\\.CSS");
...
foreach $fileext (@fileext) {
next LINE if ($_ =~ /$fileext HTTP/);
}
recompiles the regex every time it's evaluated. Something like this is better, methinks:
@fileext=qw( gif jpg css GIF JPG CSS );
$fileextre="\.(?:" . join("|", @fileext) . ") HTTP";
$fileextre=qr/$fileextre/o;
...
next LINE if /$fileextre/;
Which is, IMO, easier to read inside a larger blob of code.
Re:Precompile regexes
wickline on 2002-09-16T17:08:35
> next LINE if/$fileextre/;
and don't forget the 'o' flag on there
-matt
Re:Precompile regexes
jdporter on 2002-09-16T18:56:48
Hmm. For me (perl 5.6.1), I get slightly slower times using precompiled regexes. Of course, the/o is a nice boost.
Re:Precompile regexes
bart on 2002-09-22T22:08:22
Yeah, me too. Tests have shown that
next if/$patA|$patB/o
is quite a bit slower than
next if/$patA/o || /$patB/o
So I imagine that a faster way could be to construct a Perl expression as a string, and eval it, to compile it and include it in you code.
Also, as long as you don't modify the patterns, it turns out that the/o doesn't gain you too much any more (unlike in the old days), because Perl does some "Is it the same string? Oh, then I won't recompile!" optimisation behind the scenes.
You'll see an immediate slowdown as soon as you start changing the pattern for one regex.
Re:Apples and Oranges
marcel on 2002-09-16T23:13:53
errr... this should be
> next if index($_, "$e HTTP") != -1
next LINE if index($_, "$e HTTP") != -1
and
> print, next if substr($_, 0, 4) eq '192.';
print, next if substr($_, 0, 4) ne '192.';
oh, for an edit interface... but you get the idea.
marcel
Re:Lines of code?
koschei on 2002-09-17T14:32:53
You may want to double check the position of your ^
As someone said somewhere (petdance iirc), when making optimized solutions, test. It's something a lot of people seem to not be doing in this thread (either here or in davorg's journal). If you're going to make it more efficient, you might as well make it produce the same results.
At work, I produced a shiny new version of a previous routine. I couldn't really benchmark them though: the previous version processed much less data due to a bug in its implementation and thus naturally went faster =)Re:Lines of code?
Deven on 2002-09-18T18:56:29
Actually, I wrote it with the ^ inside the parens originally when I copied it onto here, I dropped the outer parens because they were really redundant. Then I decided to put them back in for readability and mention that they could be dropped instead.
Unfortunately, when I put them back in, I put the opening paren after the ^ (force of habit) when that's not what I meant. My bad!
I also lost the <> operator inside the while (), while we're being pedantic.:-) (I'm pretty sure this was eaten by the Slashcode software.)
To be really pedantic, I should point out that the case-insensitive part matches other capitalizations like "Css" that the original code didn't -- to be completely identical, I should have kept it case-sensitive and listed just the particular variations used in the original code, but I thought case-insensitive was more true to the intent of the original...
The only reason I didn't test it was because I didn't have any available data to run it on, and I didn't feel like constructing some sample test data. I left it as "an exercise for the reader".:-)
But hey, you knew what I meant! The point was that in a minute or two, you can make a one-liner in Perl that does the same as the Java 100-liner, and probably quite a bit faster, too.
Seriously folks, Java is a nice language and all, but why not use the right tool for the job? As demonstrated most effectively by Professor Hoffman, it's quite cumbersome to parse text files using Java. Now with Perl, you can do something like this:
perl -ne "print unless
Heck, you could even make it your
Those Sun engineers should find better things to do with their time.
Hoffmann responds...
bschoate on 2002-09-17T22:45:24
I sent John the one-liner and he was nice enough to test it himself. Results follow (emphasis mine)...
From: John Hoffmann
Date: Tue Sep 17 15:32:30 2002 (PDT)
To: Brad Choate
Subject: Re: Java vs. Perl
Brad,
Thanks, you were the second person to write, but the first guy couldn't offer an
optimization. Just ran your one liner on 578 Meg file and it took half the time
of the java.
%timex perl -ne "print unless/^192\.(9|18|29)\./o || /\.(gif|jpg|css|GIF|JPG|CSS) HTTP/o" < developer.20020916.raw > developer.20020916.perl
real 52.51
user 27.28
sys 6.89
%java LogParse/usr/netgenesis/logs/developer.20020916.raw /usr/netgenesis/logs/developer.20020916.tmp
Processing Time: 107 Seconds
File sizes matched perfectly.
-rw-rw-r-- 1 hoffie 96187583 Sep 17 15:22 developer.20020916.perl
-rw-rw-r-- 1 hoffie 584267226 Sep 17 15:05 developer.20020916.raw
-rw-rw-r-- 1 hoffie 96187583 Sep 17 15:11 developer.20020916.tmp
The java programmer who wrote the LogParse class wants to try JDK 1.4 with
regular expressions and the new IO classes to see the result. I'll see what we
can do to publish a round two of the optimized Perl and the new Java.
-JohnRe:Hoffmann responds...
merlyn on 2002-09-17T23:49:58
Oddly enough, I can't see how that would make it go any faster than Java's non-regex solution. It seems like it would only lose ground!The java programmer who wrote the LogParse class wants to try JDK 1.4 with regular expressions and the new IO classes to see the result. I'll see what we can do to publish a round two of the optimized Perl and the new Java.Re:Hoffmann responds...
oneiron on 2002-09-18T05:23:35
Those two o's above, as in
/.../o, seem quite useless because the regex's are constant. Is there a reason for them? Re:Hoffmann responds...
bschoate on 2002-09-18T12:23:38
I guess not
:) Silly me, I thought /o always helped when using the same regex pattern in a loop such as this. And I hadn't thought about specifying the non-capturing syntax, also suggested in this thread. The final result:
perl -ne "print unless
/^192\.(?:9|18|29)\./||/\.(?:gif|jpg|css|GIF|JPG|CSS) HTTP/" < input > output The fastest of all so far... any other improvements?
Re:Hoffmann responds...
dr_baggy on 2002-09-18T14:18:16
Try the following... if you are much more likely to have gif, jpg, css file than local files switch the regexps around and try:
perl -ne 'print unless
/\.(?:gif|jpg|css|GIF|JPG|CSS) HTTP/||/^192\.(?:9|18|29)\./;' input > output Re:Hoffmann responds...
java_sucks on 2002-09-18T14:44:55
Wouldn't an immediate retraction be in order, showing the perl one-liner and java 100-liner side by side with corrected timings? (and a note that the benchmark was designed for the purposes of java advocacy).Re:Made my day
oneiron on 2002-09-18T08:56:54
It is faster still if you use non-capturing parens. i.e. change (9|18|29) to (?:9|18|29), ditto for the parens around gif|jpg etc. And the 'o' modifier should be removed.
Re:Made my day
damien on 2002-09-19T18:12:07
It's much faster still if you don't use alternation in the regex./foo/ || /bar/ is significantly faster than /(foo|bar), since the former will be optimized to a pair of substr matches.