Bioinformatics Benchmarks

chromatic on 2008-02-06T23:49:39

I just caught a link to a programming language benchmark for bioinformatics. Unsurprisingly, the Perl is grotty and the C and C++ and Java implementations beat it handily.

Are there any PDLlers who'd like to bring some sanity to the results? (My eyes are going on strike from the C-style nested loops. Make sure you catch the use of bitwise and as control flow operator. That has to work only by accident.)

substr(A,B,1)

Matts on 2008-02-07T00:27:09

This will be what kills it. Looking for single chars in a string with perl is painful. There's no nice way around it, except perhaps to do a pre-split into an array, but then look at what you've got - an array of SVs - the overhead is huge.

You *might* be able to make it faster with a regexp. But basically doing anything character by character in perl is very slow.

(it's also one of the main reasons that XML::SAX::PurePerl is so slow)

Re:substr(A,B,1)

Alias on 2008-02-07T04:56:58

(it's also one of the main reasons that XML::SAX::PurePerl is so slow)
Ditto with PPI...

I managed to compensate by applying a regex if the character I see suggests I can read ahead a fair way.

You might be able to abuse the regex engine for this though...

s/./something;$1/e

You can double the speed (more or less)

runrig on 2008-02-07T02:27:02

In the alignment.pl code, nearly all the time is spent in the 'compute f matrix' loop. Pre-splitting the strings to arrays saved a few seconds (and took nearly no time). Using @_ directly instead of assigning to lexicals in the score and max subroutines (and using the ?: operator to write one line functions) saved a few more seconds (the python didn't seem quite fair to compare since it has named parameters, so you save the assignment).

There was also a lot of array indexing, so pre-assigning the first level of the multi-dimensional array to temp variables between the i and j loops saved a bit more. All in all (I didn't keep careful track), it went from about 65 seconds to about 37. C and Java still has it beat, but it's a little better now (though a bit more unreadable).

Re:You can double the speed (more or less)

runrig on 2008-02-07T17:49:19
And I don't know if it saved any time, but replacing the C-style for loops with perly 1..$n style ones made at least that part more readable.

4x

ChrisDolan on 2008-02-08T05:38:30

Interestingly, 97% of the alignment.pl time is spent in the creation of the f-matrix.

At the expense of some readability, I sped alignment.pl up by a factor of four (96 sec to 23 sec on my iMac G5 with perl5.8.6). You can view my modified code at your peril. The substr was not in fact the biggest cost -- I was surprised that changing to m/\G(.)/cg didn't save any time. The biggest win (about 40% time decrease) was unrolling the subroutines, of course, which is what some of the other languages may be doing in the compilers. Another win was removing the array dereferences, largely replacing them with running local variables. I also eliminated row 0 and column 0, which saved 2 seconds of initialization time.