Perl, Ruby and Python (Part Deux)

Buck on 2003-01-15T03:48:40

Continuing with my comments made here, here's the source for the Perl and Ruby scripts I used to test with. The Python version is included for ... completeness (I'm being polite, here :).

#!/usr/bin/perl

my $target = shift;
for (my $iter = 0; $iter < 100; $iter++)
{
  my $count = 0;
  open(INDEX, "/usr/ports/INDEX");
  while (<INDEX>)
  {
    chomp;
    my @fields = split/\|/;
    if ($fields[0] =~ m{$target} || $fields[6] =~ m{$target} ||
        $fields[3] =~ m{$target} || $fields[7] =~ m{$target} ||
        $fields[8] =~ m{$target})
    {
       $count++;
    }
  }
  close INDEX;
}


#!/usr/local/bin/ruby
# distribution-name|port-path|installation-prefix|comment| \
# description-file|maintainer|categories|build deps|run deps|www site


target = ARGV[0]
100.times do |iter|
  index = File.new("/usr/ports/INDEX")
  count = 0
  index.each do |line|
    fields = line.chomp.split('|')
    if fields[0] =~ /#{target}/ || fields[6] =~ /#{target}/ || \
        fields[3] =~ /#{target}/ || fields[7] =~ /#{target}/ || \
        fields[8] =~ /#{target}/
      count += 1
    end
  end
  index.close
end


#!/usr/local/bin/python

import sys
import string
import re
target = re.compile(sys.argv[1])

for iter in range(100):
    index = open('/usr/ports/INDEX', 'r')
    count = 0
    line = index.readline()
    while line <> '' :
        fields = line.strip().split('|')
        if target.search(fields[0]) or target.search(fields[6]) \
                 or target.search(fields[3]) or target.search(fields[7]) \
                 or target.search(fields[8]):
             count = count + 1
        line = index.readline()
    index.close()


An example of the data used (from /usr/ports/INDEX in FreeBSD)
# distribution-name|port-path|installation-prefix|comment| \
# description-file|maintainer|categories|build deps|run deps|www site
9e-1.0|/usr/ports/archivers/9e|/usr/local|Explode Plan9 archives|/usr/ports/archivers/9e/pkg-descr|ports@FreeBSD.Org|archivers|||http://www.eecs.harvard.edu/user/wkj/Software/9e/
arc-5.21e.8_1|/usr/ports/archivers/arc|/usr/local|Create & extract files from DOS .ARC files|/usr/ports/archivers/arc/pkg-descr|ache@FreeBSD.org|archivers|||
arj-3.10b|/usr/ports/archivers/arj|/usr/local|Open-source ARJ|/usr/ports/archivers/arj/pkg-descr|kot@premierbank.dp.ua|archivers|autoconf213-2.13.000227_5 expat-1.95.5 gettext-0.11.5_1 gmake-3.80 libiconv-1.8_2 m4-1.4_1||http://arj.sourceforge.net/
bzip-0.21|/usr/ports/archivers/bzip|/usr/local|A block-sorting file compressor|/usr/ports/archivers/bzip/pkg-descr|ports@FreeBSD.org|archivers|||http://www.muraroa.demon.co.uk/
bzip2-1.0.2|/usr/ports/archivers/bzip2|/usr/local|A block-sorting file compressor|/usr/ports/archivers/bzip2/pkg-descr|jharris@widomaker.com|archivers|||http://sources.redhat.com/bzip2/
cabextract-0.6|/usr/ports/archivers/cabextract|/usr/local|A program to extract Microsoft cabinet (.CAB) files|/usr/ports/archivers/cabextract/pkg-descr|sobomax@FreeBSD.org|archivers|libgnugetopt-1.2|libgnugetopt-1.2|http://www.kyz.uklinux.net/cabextract.php3

The average runtimes, minutes and seconds, have been (for running each of the above scripts once):
Perl: 0:35
Python: 0:38
Ruby: 2:29


Eh?

djberg96 on 2003-01-15T05:55:14

I have no idea what you've been smoking, because there's no way you should be getting that kind of discrepency. Either you're using a very old version of ruby, your interpreter is broken, or you've been sniffing glue again.

For my results, I used ruby 1.6.7 and perl 5.8.0 on Mandrake 9. I took the sample text you gave in your journal entry and copied it over and over until I ended up with a 2.4 MB file. I used "bzip-0.21" as the target. Hopefully, I didn't screw up the logic.

I've provided the exact benchmark code that I used and the results of the benchmark module. I've also included the Perl benchmark results and code. Benchmarks first:

djberge>/usr/local/bin/ruby ruby_bench.rb
   user     system      total        real
original:  5.390000   0.100000   5.490000 (5.509422)
optimized:  2.650000   0.080000   2.730000 (2.741779)

Ruby code:

require "benchmark"
include Benchmark

target = "bzip-0.21"

bm do |x|
   x.report("original:"){
      100.times do |iter|
        index = File.new("test.txt")
        count = 0
        index.each do |line|
          fields = line.chomp.split('|')
          if fields[0] =~ /#{target}/ || fields[6] =~ /#{target}/ || \
              fields[3] =~ /#{target}/ || fields[7] =~ /#{target}/ || \
              fields[8] =~ /#{target}/
            count += 1
          end
        end
        index.close
      end
   }
   x.report("optimized:"){
      regex = Regexp.new(target)

      for n in 1..100
         count = 0
         IO.foreach("test.txt"){|line|
            line.chomp.split("|").indices(0,6,3,7,8).each{ |e|
               if regex.match(e)
                  count += 1
               end
            }
         }
      end
   }
end

Perl benchmarks:

djberge>perl perl_bench.pl
Benchmark: timing 1 iterations of original...
  original:  3 wallclock secs ( 2.52 usr +  0.04 sys =  2.56 CPU) @  0.39/s (n=1)

Perl code:

use Benchmark;
$target = "bzip-0.21";

timethese(1,{
   "original" => q{
      for (my $iter = 0; $iter < 100; $iter++)
      {
        my $count = 0;
        open(INDEX, "test.txt") || die "Couldn't open file: $!\n";
        while (<INDEX>)
        {
          chomp;
          my @fields = split/\|/;
          if ($fields[0] =~ m{$target} || $fields[6] =~ m{$target} ||
              $fields[3] =~ m{$target} || $fields[7] =~ m{$target} ||
              $fields[8] =~ m{$target})
          {
             $count++;
          }
        }
        close INDEX;
      }
   }
});

I didn't doing any serious averaging, but a few runs of each benchmark yielded no more than a tenth of a second difference.

Please run this code and let me know what you get

Re:Eh?

djberg96 on 2003-01-15T13:56:18

Oops - those were the benchmarks against the 48k file. Here are the benchmarks against the 2.4mb file:

Ruby:

djberge>/usr/local/bin/ruby ruby_bench.rb
      user     system      total        real
original:270.560000   2.740000 273.300000 (273.258903)
optimized:134.710000   1.890000 136.600000 (136.578120)

Perl:

djberge>perl perl_bench.pl
Benchmark: timing 1 iterations of original...
  original: 129 wallclock secs (127.60 usr +  1.07 sys = 128.67 CPU) @  0.01/s (n=1)

So, at the end of the day Perl is still slightly faster - about what I expected.

Re:Eh?

dreadpiratepeter on 2003-01-15T14:40:48

My reply is actually to the original post, not the first reply, but I couldn't find a link to comment on that.

Anyway, I actually was able to get the benchmark on the Perl test significantly lower by doing 2 things:
   I precompiled the regexp
   I joined the relevant search fields using an (assumedly) unused char (^A) and searched on that.

On my box that put the average from around 37 secs. to around 26 secs. (Using djberg96's benchmark version of the script).

use Benchmark;
use strict;
my $target = "bzip-0.21";
my $p = qr/$target/;

timethese(1,{
   "original" => q{
      for (my $iter = 0; $iter < 100; $iter++)
      {
        my $count = 0;
        open(INDEX, "INDEX") || die "Couldn't open file: $!\n";
        while (<INDEX>)
        {
          chomp;
          my $f = join("\001",(split/\|/)[0,3,6,7,8]);
          if ($f =~ /$p/)
          {
             $count++;
          }
        }
        close INDEX;
      }
   }
});

Re:Eh?

Buck on 2003-01-16T02:25:20

Yep. Precompiling the regex and joining the fields to be searched shaved a couple of seconds off the Perl script.

Thanks.

Re:Eh?

runrig on 2003-01-16T00:00:17

Since the 'optimized' ruby code doesn't short-circuit testing the rest of the fields on a successful match, you could make the perl a little more perlish also:
$count++ if grep m{$target}, (split /\|/) [0,3,6..8]
Or use List::Util::first instead of grep (though it may only be an improvement on bigger arrays).

I'm using perl 5.6.1 and ruby 1.6.8 and getting ruby about twice as slow as perl.

Re:Eh?

djberg96 on 2003-01-16T04:19:02

You're right - I forgot to short circuit. I'm not sure how that helps Ruby's case, though. Add a "break" after the "count += 1" line. It didn't seem to improve performance significantly for me, though.

Re:Eh?

runrig on 2003-01-16T05:34:43

The match only occurs in 1 out of every 6 lines (using your target and the sample data in the top post here), so you'd only see at most about a 16% benefit (if that). If the match occurred early in the string on more lines, there might be more benefit.

I just installed Ruby today, and have been poking through online docs earlier, and couldn't find a 'break' or 'last' statement. Is there such a thing? The best I could come up with was throwing an exception and catching it outside that loop. I still need to get a Ruby book...

Re:Eh?

djberg96 on 2003-01-16T16:07:27

I still need to get a Ruby book...

Visit rubycentral or ruby-doc.

The first link is an online version of Programming Ruby, aka "The Pickaxe". You can still buy that book at the store, if you prefer paper.

Re:Eh?

runrig on 2003-01-17T18:26:02

I had looked through the online book, but couldn't find a 'break' at first. I finally did find 'break', 'next', and 'redo' in the Expressions section; I had been looking in the Iterators section.

I was poking through the bookstore and the only Ruby book there was Sam's "Learn Ruby in 21 days". I can't recommend it, as it had no mention of 'break', 'next', or 'redo', nor the IO.foreach method in your example (and it was a thick book).

Re:Eh?

Buck on 2003-01-16T01:46:08

OK. Tried your version of the Ruby script. My ruby is version 1.6.8 on an Athlon 500mHz system running FreeBSD 4.7-STABLE. Used the /usr/ports/INDEX file that the sample data I originally posted came from; it's 3MB in size. However, I didn't use any language specific benchmark modules; I wanted to compare apples to apples. Anyway, here's the results:

[ayeka:~/portfinder] buck> repeat 5 time ruby pftest2.rb ruby
68.394u 2.119s 1:10.56 99.9% 4+1346k 0+0io 0pf+0w
69.770u 2.258s 1:12.08 99.9% 4+1346k 0+0io 0pf+0w
68.368u 2.199s 1:10.61 99.9% 4+1345k 0+0io 0pf+0w
68.594u 1.949s 1:10.58 99.9% 4+1345k 0+0io 0pf+0w
68.723u 1.969s 1:10.73 99.9% 4+1346k 0+0io 0pf+0w

Also, I've been smoke free since July, 2002. :)

Re:Eh?

djberg96 on 2003-01-16T04:54:29

Ruby:

127.33user 2.35system 2:10.31elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (222major+301minor)pagefaults 0swaps

Perl:

126.28user 1.74system 2:08.32elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (374major+164minor)pagefaults 0swaps

Perhaps it's a NetBSD issue? Seems unlikely, but based on the results you're getting versus what I'm getting, I'd consider it a possibility at least. At least I cut it down to x2 instead of x4!

Please consider posting to the mailing list with this info (ruby-talk@ruby-lang.org).

Re:Eh?

Buck on 2003-01-16T17:11:17

Perhaps it's a NetBSD issue? (Buck: FreeBSD even :) ) Seems unlikely, but based on the results you're getting versus what I'm getting, I'd consider it a possibility at least. At least I cut it down to x2 instead of x4!
Agreed.

Please consider posting to the mailing list with this info (ruby-talk@ruby-lang.org).
I'd like to try these on my TiBook with OSX 10.2.3 first, though I don't expect much of a change. Is the mailing list archived somewhere where I can research before posting anything?

By the way, thanks for looking at this.

Re:Eh?

djberg96 on 2003-01-16T17:52:27

You can find the archives at http://blade.nagaokaut.ac.jp/ruby/ruby-talk/index.shtml

There's a gateway between the mailing list and comp.lang.ruby, so you can search via deja (or your local news serve) and get everything from the mailing list that way.

FreeBSD even :) - Oops. Probably not the first time I made that mistake. Probably won't be the last. :-P