Software Alchemy

jplindstrom on 2004-04-03T13:40:04

[I wrote this some time ago, but I don't think I t posted it anywhere for some reason. Well, here goes...]

I just reimplemented a C program in Perl. It's a multiple- pop-to-database gateway, i.e. it fetches e-mail messages from a number of POP accounts and enters them into a database. Simple stuff really.

The reason for the reimplementation was that the original program never got really stable. After a year in production and a lot of fixes it still froze sporadically, and when we lost the programmer as a resource we got fed up with restarting the program all the time. Finally the decision was made: don't fix it -- replace it!

Space

Here are some quick stats:

      Language: C                  |      Language: Perl
Source (total): 4237               |Source (total): 1666                               
          Code: 2728 (64%)         |          Code: 867 (52%)                          
                                   |    Code + POD: 1174 (70%)                         
                                   |           POD: 307 (18%) (26% of code + POD lines)
 Comments left: 34 ( 0%) of source | Comments left: 41 ( 2%) of source                 
  Comments EOL: 207 ( 7%) of code  |  Comments EOL: 50 ( 5%) of code                   
    Whitespace: 1509 (35%)         |    Whitespace: 799 (47%)                          
     Size (Kb): 99                 |     Size (Kb): 29

The insteresting metric is (I guess) Code (i.e. lines of actual code, excluding whitespace and docs). 867 / 2728 = 31% of the original code size. This seems a typical Perl/C ratio. The Perl code size includes a bit more functionality, unit tests, documentation, a lot more error checking, and a lot more solid implementation (e.g. it doesn't freeze in the middle of the night :)

The solid implementation is what makes this a useful fire- and-forget program rather than a clunky script. Things fail sensibly. People get alerted when things fall over anyway. Everything is logged. Nothing is lost, ever.

This is what took time to get right. While I probably could have whipped up a quick-n-dirty script in a day or two, that's of no use whatsoever when what you need is a zero- maintenance work horse. (Being the sysadmin, this matters a great deal to me. Time spent getting things sturdy is never wasted in the long run, but you seldom realize it unless you have to take responsibility for keeping the program running.)

Time

And as usual, the measure the Perl community always likes to emphasize is the development time. When interviewing the original programmer about what to actually do, he asked me how much time I estimated for the project:

-"A couple of weeks? Three?"
-"Uh, two to three days I guess"

It ended up taking about three days. I estimate perhaps one more day for real-world tuning over the next week and one more day to incorporate SpamAssasin which is sorely needed.

I had done a lot of similar programming before, including database programming, popping mail and the general "read config file, log everything, do things right and perform an endless number of similar work cycles". This counts for a lot of development speed. Not only do I know what modules to use, I know _how_ to use them and can snip working pieces of code with impunity.

A fair amount of time was spent just figuring out what the program was supposed to do, the original spec being lost and all. Mostly the details of exactly-what-goes-into-which- column and is-that-SP-still-used-and-should-I-call-it? (it turned out I shouldn't)

Quite a lot of time was spent doing spike solutions for the things I had neved done before, like inserting large IMAGE blobs in Sybase (a simple huge insert statement was the final winner :) and finding a suitable module for parsing multipart MIME e-mails. The latter included installing Perl 5.8 on the dev machine in order to use Mail::Box, but I ended up using MIME::Tools. In retrospect, i should have backed off before spending time on installing 5.8 (there's no ActiveState distro yet, and we currently use 5.6.1 on three platforms).

Reuse

So which parts of CPAN made this possible? The original C program implemented POP from the ground up (Badly. When we switched mail server, it broke. Hello maintenance.) and parsed mail manually. It did use libraries for database access and base64 conversion.

This is what I didn't write for this program:

DBI
Data::Dumper
File::Copy
File::Path
Getopt::Long
MD5
MIME::Parser
Mail::POP3Client
Mail::Sender
Net::SMTP
Pod::Usage
Test::More
Time::HiRes
WorkerBee
WorkerBee::Log

The last two modules are part of an application framework i created for writing always-running workhorse programs. Basically utility routines packaged for easy reuse (I got tired of copy-pasting the same stuff after writing ten similar programs). It's not on CPAN.

While the C program is a lot less resource intensive, it doesn't run any faster. Actually it's kind of slow and I was surprised of the increased performance when I let the Perl program loose in the production environment. Zip, zip, done.

Language

What made the code size smaller?

Obviously, Perl shines when it comes to text processing. A 23 line C function implemented this line:

my $id = ($subject =~ /CASE[_ ]ID\D*(\d+)/is) ? $1 : "";

(But I made it a method anyway, to encapsulate it and so I could write a few tests. Altogether the total size of code + POD + tests didn't make it that much smaller, but the code quality is lot better and I wrote the entire thing in a fraction of the time.)

I think most of the savings in code size, except from reusing modules from CPAN, comes from effective Perl idioms. For instance, keeping the code focused and succinct while still having effective error reporting is possible using conditional syntax + eval + die:

#If no case id, create new case
$idCase ||= $self->idCaseNew() or die("Could not create new case for mail ($subject)\n");

#Insert mail in messages
$idMessage = $self->idInsertMessage($oEntity, $idCase) or die("Could not create message ($subject)\n");

Something that should make the Perl code larger is my heavy use of OO and accessor/mutator subs for a lot of things, but I find the approach works well when doing programs larger than a few hundred lines of code. Basically everything but the simplest one-offs.