Codestorm

Purdy on 2006-08-15T14:14:45

I recently had a situation come up where I had to whip up some code to split up a huge (1 GB) mbox file. I KNOW I should be using mdir, but com'on, people ... it's what Debian does by default and I don't spend my time sysadmin'ing stuff. In looking around, I couldn't believe others hadn't already done this (perhaps they have and my Google-fu just wasn't adequate). There was a promising git-mailsplit program, but I couldn't find it in Debian.

So I whipped this up - feel free to use/tweak this for your own use:

#!/usr/bin/perl -wT

# Process: # 1) cp /var/mail/person /var/mail/person.bak # 2) Run this script # 3) chmod/chown the INBOX.GigSplitNN files # chown person:users /home/person/INBOX.GigSplit* # chmod 0600 /home/person/INBOX.GigSplit* # 4) mv /var/mail/person /var/mail/person.prerm # 5) mail the person and see if the /var/mail/person gets setup right # 6) diff /var/mail/person.bak and /var/mail/person.prerm and put that in /var/mail/person # i ended up just tailing the file with the right number of differing lines # and >>'ing that into /var/mail/person # b/c diff'ing two 1GB files takes WAY too long!

use strict;

open( MBOX, '/var/mail/person.bak' ) || die "Cannot open person.bak: $!";

# go through the mbox file my $message = ''; my $line_count = 0; my $message_count = 0; my $file_base = '/home/person/INBOX.GigSplit'; my $file_i = 1; my $line_count_limit = 580000; # this ends up with ~40MB files, which are more tolerable my $need_to_write_init = 1;

while( <MBOX> ) { $line_count++; if ( /^From / ) { if ( length( $message ) > 0 ) { $message_count++; my $file = $file_base . sprintf( "%02d", $file_i ); print "Got message # $message_count - appending to $file ...\n"; if ( $need_to_write_init ) { write_initial_msg( $file ); $need_to_write_init = 0; } open( SPLIT, ">>$file" ) || die "Cannot append to $file: $!"; print SPLIT $message; close( SPLIT ); if ( $line_count > $line_count_limit ) { print "Line Count exceeded $line_count_limit, so incrementing \$file_i...\n"; $file_i++; $line_count = 0; $need_to_write_init = 1; } } $message = $_; } else { $message .= $_; } }

close( MBOX );

print "All done!\n";

sub write_initial_msg { my $file = shift; open( FILE, ">$file" ) || die "Cannot open $file to put in initial msg: $!"; print FILE <<"_EOF_"; From MAILER-DAEMON Mon Aug 14 13:00:31 2006 Date: 14 Aug 2006 13:00:31 -0400 From: Mail System Internal Data <MAILER-DAEMON\@mail.example.com> Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA Message-ID: <1155574831\@mail.example.com> X-IMAP: 1134739889 0000025473 Status: RO

This text is part of the internal format of your mail folder, and is not a real message. It is created automatically by the mail system software. If deleted, important folder data will be lost, and it will be re-created with the data reset to initial values.

_EOF_ close( FILE ); }


So that will create INBOX.GigSplit01 ... INBOX.GigSplitNN, which my user could manage with Squirrelmail (I had to hack /home/person/.mailboxlist to add those new folders). Since the problem stemmed from a checkbox in her email client keeping old messages on the server and not removing them, she could simply delete a lot of the stuff as redundant and just look at the more recent messages for stuff she missed. Remotely accessing a 1GB mbox file tends to timeout. ;)

Yes, I KNOW that could be optimized and probably even one in one line (go for it, golfers!) ... it was something I had to do and it wasn't too painful to run (6700 msgs in 2 minutes).

That's just the way I roll!* ;)

Speaking of coding, Google has their Code Jam going on, but where's the love for Perl? You can program in C++, C#, Java, Python and VB.NET, but not Perl. It probably has to do with what TopCoder supports, but something should really be done to get Perl in that list, for longevity sake.

Peace,

Jason

* = My new favorite saying


formail

duff on 2006-08-16T19:03:22

From the man page for formail:

DESCRIPTION
              formail is a filter that can be used to force mail into mailbox format,
              perform ‘From ’ escaping, generate auto-replying headers, do simple
              header munging/extracting or split up a mailbox/digest/articles file.
              The mail/mailbox/article contents will be expected on stdin.


Maybe no one has done it in perl because it's already been done in C :-)

Re:formail

Purdy on 2006-08-17T12:50:05

Thanks ... it looks like it could do the trick, but upon closer examination, formail will split it into separate messages, but not chunked mbox files of a specified size. 6700 individual message files vs. 25 mbox files.

- Jason

Last mail not saved

elysch on 2009-11-09T19:20:21

Hi. I discovered the script stores mails every time it encounters a mail beginning (^From ). So, the last e-mail is not stored since there is no "NEXT" mail. I fixed moving:

        if ( length( $message ) > 0 ) {
                $message_count++;
                $file = $file_base . sprintf( "%02d", $file_i );
                #print "Got message # $message_count - appending to $file ...\n";
                if ( $need_to_write_init ) {
                        #write_initial_msg( $file );
                        $need_to_write_init = 0;
                }

                open( SPLIT, ">>$file" ) || die "Cannot append to $file: $!";
                print SPLIT $message;
                close( SPLIT );
        }

to a function called save_msg and calling it where the code was, and ALSO before

close( MBOX );

Note that I'm ignoring the variables scope, since I'm in a hurry, but I shouldn't have.

Feel free to ask if you want me to post my script.

Ely.