I recently had a situation come up where I had to whip up some code to split up a huge (1 GB) mbox file. I KNOW I should be using mdir, but com'on, people ... it's what Debian does by default and I don't spend my time sysadmin'ing stuff. In looking around, I couldn't believe others hadn't already done this (perhaps they have and my Google-fu just wasn't adequate). There was a promising git-mailsplit program, but I couldn't find it in Debian.
So I whipped this up - feel free to use/tweak this for your own use:
#!/usr/bin/perl -wT
# Process:
# 1) cp /var/mail/person /var/mail/person.bak
# 2) Run this script
# 3) chmod/chown the INBOX.GigSplitNN files
# chown person:users /home/person/INBOX.GigSplit*
# chmod 0600 /home/person/INBOX.GigSplit*
# 4) mv /var/mail/person /var/mail/person.prerm
# 5) mail the person and see if the /var/mail/person gets setup right
# 6) diff /var/mail/person.bak and /var/mail/person.prerm and put that in /var/mail/person
# i ended up just tailing the file with the right number of differing lines
# and >>'ing that into /var/mail/person
# b/c diff'ing two 1GB files takes WAY too long!
use strict;
open( MBOX, '/var/mail/person.bak' ) || die "Cannot open person.bak: $!";
# go through the mbox file
my $message = '';
my $line_count = 0;
my $message_count = 0;
my $file_base = '/home/person/INBOX.GigSplit';
my $file_i = 1;
my $line_count_limit = 580000; # this ends up with ~40MB files, which are more tolerable
my $need_to_write_init = 1;
while( <MBOX> ) {
$line_count++;
if ( /^From / ) {
if ( length( $message ) > 0 ) {
$message_count++;
my $file = $file_base . sprintf( "%02d", $file_i );
print "Got message # $message_count - appending to $file ...\n";
if ( $need_to_write_init ) {
write_initial_msg( $file );
$need_to_write_init = 0;
}
open( SPLIT, ">>$file" ) || die "Cannot append to $file: $!";
print SPLIT $message;
close( SPLIT );
if ( $line_count > $line_count_limit ) {
print "Line Count exceeded $line_count_limit, so incrementing \$file_i...\n";
$file_i++;
$line_count = 0;
$need_to_write_init = 1;
}
}
$message = $_;
} else {
$message .= $_;
}
}
close( MBOX );
print "All done!\n";
sub write_initial_msg {
my $file = shift;
open( FILE, ">$file" ) || die "Cannot open $file to put in initial msg: $!";
print FILE <<"_EOF_";
From MAILER-DAEMON Mon Aug 14 13:00:31 2006
Date: 14 Aug 2006 13:00:31 -0400
From: Mail System Internal Data <MAILER-DAEMON\@mail.example.com>
Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA
Message-ID: <1155574831\@mail.example.com>
X-IMAP: 1134739889 0000025473
Status: RO
This text is part of the internal format of your mail folder, and is not
a real message. It is created automatically by the mail system software.
If deleted, important folder data will be lost, and it will be re-created
with the data reset to initial values.
_EOF_
close( FILE );
}
So that will create INBOX.GigSplit01 ... INBOX.GigSplitNN, which my user could manage with Squirrelmail (I had to hack /home/person/.mailboxlist to add those new folders). Since the problem stemmed from a checkbox in her email client keeping old messages on the server and not removing them, she could simply delete a lot of the stuff as redundant and just look at the more recent messages for stuff she missed. Remotely accessing a 1GB mbox file tends to timeout. ;)
Yes, I KNOW that could be optimized and probably even one in one line (go for it, golfers!) ... it was something I had to do and it wasn't too painful to run (6700 msgs in 2 minutes).
That's just the way I roll!
Speaking of coding, Google has their Code Jam going on, but where's the love for Perl? You can program in C++, C#, Java, Python and VB.NET, but not Perl. It probably has to do with what TopCoder supports, but something should really be done to get Perl in that list, for longevity sake.
Peace,
Jason
* = My new favorite saying
DESCRIPTION
formail is a filter that can be used to force mail into mailbox format,
perform ‘From ’ escaping, generate auto-replying headers, do simple
header munging/extracting or split up a mailbox/digest/articles file.
The mail/mailbox/article contents will be expected on stdin.
Re:formail
Purdy on 2006-08-17T12:50:05
Thanks... it looks like it could do the trick, but upon closer examination, formail will split it into separate messages, but not chunked mbox files of a specified size. 6700 individual message files vs. 25 mbox files.
- Jason
if ( length( $message ) > 0 ) {
$message_count++;
$file = $file_base . sprintf( "%02d", $file_i );
#print "Got message # $message_count - appending to $file...\n";
if ( $need_to_write_init ) {
#write_initial_msg( $file );
$need_to_write_init = 0;
}
open( SPLIT, ">>$file" ) || die "Cannot append to $file: $!";
print SPLIT $message;
close( SPLIT );
}
to a function called save_msg and calling it where the code was, and ALSO before
close( MBOX );
Note that I'm ignoring the variables scope, since I'm in a hurry, but I shouldn't have.
Feel free to ask if you want me to post my script.
Ely.