Microsoft Word without Microsoft Word

brian_d_foy on 2002-12-13T06:42:22

Sometimes I need to read Microsoft Word documents, but I do not have Word, or any other word processor. I do not process words really. Indeed, several Word attachments are still in my inbox waiting for me to care enough to extract and download them.

I am a plain text sort of person. I read Word documents with my favorite text editor. I can ignore most of the jibberish and the rest of the content comes out just fine.

I cannot see the images, though. Most of the time this does not matter because the images do not have any information I cannot get from the text. A couple of documents I read last week said "See the formula in Figure N.M", and I really did need to see the formula.

What to do? Buy Word? Yeah, right. Install something that can import Word documents? Too much work. Readjust reality so I do not need to see the formula? Not in this case.

Looking into my bags of tricks I notice Google and Perl. Google tells me Word stores its images as embedded PNG strings. Perl lets me right fancy regular expressions. I think I have a winner.

Why spend money when I can get away 20 lines of Perl? Read in the data, look for a PNG string, save it to a file, and try again where I left off. Easy peasy.

my $HEADER = "\211PNG";
my $FOOTER = "IEND\xAEB`\x82";

foreach my $file ( @ARGV ) { print "Extracting $file\n"; (my $image_base = $file) =~ s/(.*)\..*/$1/; my $data = do { local $/; open my( $fh ), $file; <$fh> }; my $count = 0; while( $data =~ m/($HEADER.*?$FOOTER)/sg ) { my $image = $1; $count++; my $image_name = "$image_base.$count.png"; open my $fh, "> $image_name" or warn "$image_name: $!", next; print "Writing $image_name: ", length($image), " bytes\n"; print $fh $image; close $fh; } }


The next time I care, I will probably figure out how to extract the captions so I can use those for the file names. I would rather have a name like figure-n.m.png than file.i.png. At the moment I do not care. If I got really fancy I could write a word2html converter.

What's next? Reading Excel files with more, but that's easy with SpreadSheet::ParseExcel.