Markov Blogger code

jjohn on 2003-09-06T02:06:30

Today's experiement with Markov chains has been really entertaining. I have added a few heuristics to the basic two word markov chain described in The Practice of Programming to handle the vageries of blog entries a bit better. Also, I made walking the chain more random, so that the generated entries are different every time. The markov program presented in the K&P book always begins with the first word found in the data set. This makes for boring blog entries.

The first program presented snarfs blog entries of particular userID from use.perl using the SOAP interface. It creates simple data files that can be easily consumed by the markov chain program.

get_blog

#!/usr/bin/perl --
#
# Get blog entries from use.perl.org of given UID
#

use strict;
use SOAP::Lite;
use Getopt::Std;
use HTTP::Cookies;
use Digest::MD5 'md5_hex';
use Data::Dumper;
use File::Basename;

use constant URI     => "http://use.perl.org/Slash/Journal/SOAP";
use constant PROXY   => "http://use.perl.org/journal.pl";
use constant UID     => 777;
use constant PW      => "s3cr3t";
use constant DOWNLOAD_DIR => './entries';

my $opts = {};
getopts('?hP:U:d:l:u:v', $opts);

if ($opts->{'?'} || $opts->{'h'}) {
  print usage();
  exit;
}

unless ($opts->{'u'}) {
  die "ERROR Missing -u \n", usage(), "\n";
}

# Everything is ready for the SOAP call now
my $cookie_jar = HTTP::Cookies->new;
$cookie_jar->set_cookie(0, 
			user => bakeUserCookie(($opts->{'U'} || UID),
					       ($opts->{'P'} || PW),
					      ),
			"/",
			"use.perl.org");

my $c = SOAP::Lite->uri(URI)->
                    proxy(PROXY, cookie_jar => $cookie_jar);

print "listing entries for UID $opts->{u}\n" if $opts->{v};
my $ret = $c->get_entries($opts->{'u'}, $opts->{'l'});
exit if had_transport_error($ret);
my $rows = $ret->result;

my $tdir = $opts->{d} || DOWNLOAD_DIR;
unless (-d $tdir) {
  print "making download directory ($tdir)\n" if $opts->{v};
  unless (mkdir($tdir, 0700)) {
    die "Can't make target director '$tdir': $!\n";
  }
}

# cut'n'paste code? never
$tdir .= "/$opts->{u}";
unless (-d $tdir) {
  print "making download directory ($tdir)\n" if $opts->{v};
  unless (mkdir($tdir, 0700)) {
    die "Can't make target director '$tdir': $!\n";
  }
}

print "fetching " . @{$rows} . " entries\n" if $opts->{v};
for my $r (@{$rows}) {
  if (-e "$tdir/$r->{id}") {
    print "Already fetched $r->{id}\n" if $opts->{v};
    next;
  }

  print "fetching entry: $r->{id}\n" if $opts->{v};
  my $ret = $c->get_entry($r->{id});
  if (had_transport_error($ret)) {
    warn ("**couldn't fetch entry '$r->{id}'\n") if $opts->{v};
    next;
  }

  my $rec = $ret->result;
  if (open my $out, ">$tdir/$r->{id}") {
    print $out "subject: $rec->{subject}\n";
    print $out "$rec->{body}\n";
    close $out;
  } else {
    warn("**couldn't create file '$tdir/$r->{id}'.  Skipping.\n") 
      if $opts->{v};
    next;
  }
}

print "done\n";

#------------------
# subs
#------------------
sub usage {
  my $base = basename($0);
  return < use.perl password (for AUTH)
  U  use.perl UID (for AUTH)
  d  target directory to hold fetched entries
  l   limit (-1 for all entries)
  u   target UID
  v        verbose messages
EOT
}

sub had_transport_error {
  my ($ret) = @_;

  if ($ret->fault) {
    warn "Oops: ", $ret->faultstring, "\n";
    return 1;
  }

  return;
}

# Thanks Pudge
sub bakeUserCookie {
    my($uid, $passwd) = @_;
    my $cookie = $uid . '::' . md5_hex($passwd);
    $cookie =~ s/(.)/sprintf("%%%02x", ord($1))/ge;
    $cookie =~ s/%/%25/g;
    return $cookie;
}

The really interesting work is done by the next program, which consumes the data and generates a output suitable for posting back to use.perl via the perl scripts I described in my article. Notice that I keep hyperlinks together. Also blockquotes. Some may call this cheating, but I think the effect is more pleasant. I experimented with three work chains, but this wasn't random enough for me. Perhaps I need a larger data set. A future version of this program could use both chains and somehow blend them together for the output. It would be great to see popular three word combinations appear in the generated text, but perhaps I already get that with two word chains. Also note that this script marks my first use of the qr operator. I guess compiled regexes aren't so scary after all.

markov.pl

#!/usr/bin/perl --
# 
# ripped off from _Practice of Programming_, Kernighan & Pike, p. 80
# tweaked about for my nefarious purposes.

use strict;
use Text::Wrap;

use constant CHAIN_SIZE => 3; # or 3
use constant NONWORD => "\n";
use constant DEBUG => $ENV{MARKOV_DEBUG} || 0;

my ($subject, $body) = ({}, {});
my ($s_in, $b_in) = ({},{});

# expand directories as needed
my @tmp;
for my $file (@ARGV) {
  if (-d $file) {
    push @tmp, glob("$file/*");
    warn ("expanding directory '$file'\n") if DEBUG;
  } else {
    push @tmp,$file;
  }
}
@ARGV=();

warn("Randomizing the order of the input files\n") if DEBUG;
# mix up the order of the files a bit
do {
  push(@ARGV, (splice @tmp, rand(@tmp), 1));
  # warn("\@ARGV size: " . @ARGV . " \@tmp size: " . @tmp . "\n") if DEBUG;
} while (@tmp); 

warn("Reading input files\n") if DEBUG;
my (@last_subject_keys, @last_body_keys);
for my $file (@ARGV) {

  while (<>) {
    if (/^subject:/) {
      s/^subject://;
      fill_table(table   => $subject, 
                 state   => $s_in, 
                 line    => $_, 
                 'keys'  => \@last_subject_keys
		);
    } else {
      fill_table(table   => $body, 
                 state   => $s_in, 
                 line    => $_, 
                 'keys'  => \@last_body_keys,
		);
    }
  }
}
end_table($subject, \@last_subject_keys);
end_table($body, \@last_body_keys);

warn("body table: " . (keys %{$body}) . "\n") if DEBUG;

# generate subject
my $subj;
do {
  my $max  = rand(12) + 1;
  $subj = make_chain($subject, $max)
} while (length($subj) > 64);

print "subject: [MarkovBlogger] $subj\n";

my $max  = rand(250) + 120;
print "body: ", make_chain($body, $max), "\n";

#--------------------------------
# subs
#--------------------------------
sub fill_table {
  my (%args) = @_;
  my ($tbl, $in, $line, $keys) = @args{qw(table state line keys)};

  my ($w1,$w2,$w3) = @{$keys};
  unless (defined $w1) {
    $w1 = $w2 = $w3 = NONWORD;
  }

  my $ecode = 'ecode';
  my %delims = ( href       => [ qr!!i ],
                 ecode      => [ qr!<$ecode>!i, qr!!i ],
                 ul         => [ qr!!i, qr!!i ],
                 blockquote => [ qr!!i, qr!!i ],
	       );

 WORD:
  for my $word (split /\s+/, $line) {

    # am I in a special block?
    # can't start a new special until I find the end of previous one
    for my $el (qw(href ecode ul blockquote)) {
      if ($in->{$el}) {
	if ($word =~ /$delims{$el}[0]/) { # end?
	  $word = "$in->{$el} $word";
	  $in->{$el} = "";
	} else {
	  $in->{$el} .= " $word";
	  next WORD;
	}
      } elsif ($word =~ /$delims{$el}[0]/) { # start?
	$in->{$el} = $word;
	next WORD;
      } 
    }

    if (CHAIN_SIZE > 2) {
      push @{$tbl->{$w1}->{$w2}->{$w3}}, $word;
      ($w1, $w2, $w3) = ($w2, $w3, $word); # pull the chain along
    } else {
      push @{$tbl->{$w1}->{$w2}}, $word;
      ($w1, $w2) = ($w2, $word); # pull the chain along
    }
  }

  # assign these keys back into the passed in ref
  return @{$keys} = ($w1, $w2, $w3);
}


sub end_table {
  my ($tbl) = shift(@_);
  my ($w1, $w2, $w3) = @{shift(@_)};
 
  if (CHAIN_SIZE > 2) {
    push @{$tbl->{$w1}->{$w2}->{$w3}}, NONWORD; 
  } else {
    push @{$tbl->{$w1}->{$w2}}, NONWORD;
  }

}

sub make_chain {
  my ($tbl, $size) = @_;

  my $text = "";

  # let's start in a rand point on the chain
  my @w1 = ((keys %{$tbl}), NONWORD);
  my $w1 = $w1[rand(@w1)];

  my @w2 = ((keys %{$tbl->{$w1}}), NONWORD);
  my $w2 = $w2[rand(@w2)];

  my @w3 = (NONWORD);
  if (CHAIN_SIZE > 2) {
    @w3 = ((keys %{$tbl->{$w1}->{$w2}}), NONWORD);
  }
  my $w3 = $w3[rand(@w3)];

  for my $i (0..$size) {
    my $suf;
    if (CHAIN_SIZE > 2) {
      $suf = $tbl->{$w1}->{$w2}->{$w3};
    } else {
      $suf = $tbl->{$w1}->{$w2};
    }

    warn("word1: '$w1'\n\tword2: '$w2'\n") if DEBUG;

    unless (ref $suf) {
      $w1 = $w1[rand(@w1)];
      $w2 = $w2[rand(@w2)];
      $w3 = $w3[rand(@w3)];
      redo;
    }

    my $r = int(rand @{$suf});
    my $t = $suf->[$r];

    if ($t eq NONWORD) {
      warn ("detected the end of the chain (" 
             . (keys %{$tbl})
	     .  ") at $i.  reseting keys\n") if DEBUG;
      $w1 = $w1[rand(@w1)];
      $w2 = $w2[rand(@w2)];
      $w3 = $w3[rand(@w3)];
      next;
    } 

    # there are "unbalanced" braces (close nuff for me)
    if ($t !~ m!\([^\)]*\)!) {  
      $t  =~ s!^\(!!;
      $t =~ s!\)$!!;
      $text .= "$t ";
    }

    if (CHAIN_SIZE > 2) {
      ($w1, $w2, $w3) = ($w2, $w3, $t);
    } else {
      ($w1, $w2) = ($w2, $t);
    }
  }

  # do some goofy clean up
  $text = ucfirst $text;
  chop $text; # final space
 
  # remove stray punctuation
  $text =~ s/[,:]$//;
  if (substr($text, -1, 1) ne '.') {
    $text .= ".";
  }

  $Text::Wrap::columns = 60;
  return wrap("", "", $text);
}

Let's see how annoying this gets. I have a feeling after a week, I'll make this quietly go away. Or expand it into a something truly monstrous. I made several tweaks to the code just trying to post this blog entry.

UPDATE: Thanks to the wonders of Soviet-style revisionism, I have updated this code a bit to remove the uses variables $w1,$w2,$w3 from the main line. Also, I randomize the input file order to make the output less likely to come from the some person's blog. Perhaps I'll bundle this up for CPAN or taskboy.com or something.

Greatness

gizmo_mathboy on 2003-09-06T03:29:04

I bow before the putrescent greatest that is the MarkovBlogger.

There's no filth like autogenerative filth.

Wonderful!

ethan on 2003-09-06T08:30:15

I really like it! Think about it, how much confusion you can create with such a beast. Let him harvest a little in a newsgroup and make postings and see what happens.

The entries created by the MarkovBlogger look very normal and unconspicuous....besides not making any sense, of course!

qr//

jdavidb on 2003-09-09T18:14:38

Isn't qr// wonderful? Sure beats the odd things you can get when you pass regexes around as strings. I just wish lots of old modules would go back and get rid of the idea of treating a string delimited with slashes as a regex and a string without slashes as a literal. Now that we have compiled regexes, that's obsolete.

Of course, I was never one for backward compatibility.