Extracting from .DOC

jdavidb on 2006-02-28T16:53:36

I have received two revisions (and may ultimately receive more) of a Word document specification (and I use that term loosely). The main part of this document I'm concerned with is a series of tables. I am extracting these tables into an Excel spreadsheet by cutting and pasting. I then process the spreadsheet with a custom Perl program that spits out YAML, SQL DDL, and a couple of other important goodies.

Obviously I'd like to eliminate the cut and paste part of this process. Besides being something I just don't want to do, it is error prone, slow, and difficult to consistently replicate.

Does anyone know of a way I can automate this extraction process? I'm willing to consider any language, if necessary, though of course I prefer Perl. I'm also willing to consider intermediate formats, such as converting to OpenOffice, AbiWord, or whatever. (My Excel spreadsheet is already an intermediate format.) I'd like any such conversions to also be automateable, but if I had to manually convert and then extract it would still shrink down the human-driven, error-prone, unreplicable part of this process by at least an order of magnitude.

Incidentally, I have reason to believe that the .DOC I'm receiving was converted by someone else from .PDF. She hasn't shared details with me on what software she used to accomplish that, but I'd also like to learn that feat, too, if anyone knows. I'd also be interested in learning to program this extraction from .PDF, if it's even possible.


Antiword may be of some help

Phred on 2006-02-28T18:11:47

I've used antiword in the past for reading MS Word docs, but I don't know how well it reads tables. You might want to give it a try.

Re:Antiword may be of some help

jdavidb on 2006-02-28T18:28:33

Thank you! It looks like antiword converts to XML and/or DocBook, so maybe I can go that route. It says the support is still experimental, but I'll check it out. Even if it doesn't work today, it may work at some point in the future.

Re:Antiword may be of some help

jdavidb on 2006-02-28T18:35:02

Awesome!!! This is entirely feasible! Thank you!

The tables come out into elements called <informaltable>. I can parse that XML, extract those, and convert them. In fact it looks like this is better than going to Excel because going to Excel provides several "phantom" blank cells which I have to ignore in my current program.

I'm not sure if I'm going to have to do this specific file again, but there's a good chance I might, and if I do I will attempt to program this process. If I don't for this file, I know I will again for another. So at some point there may be a table extractor utility available for everyone to use.

Re:Antiword may be of some help

Phred on 2006-02-28T18:43:05

Glad it's working out for you. I haven't used antiword in over a year but it was very helpful when I needed it.

Re:Antiword may be of some help

Ron Savage on 2006-03-02T05:08:37

See also: http://wvware.sourceforge.net/

Win32::OLE

malte on 2006-02-28T18:44:41

If your on Win32 Win32::OLE could help you.

Re:Win32::OLE

jdavidb on 2006-02-28T19:38:09

Thanks for the pointer. Maybe I can do this entirely in pure Perl, and drop any intermediate file formats. :)

Win32::OLE + XML/HTML

dami on 2006-03-01T07:47:31

maybe you could save your doc in XML or HTML, and then parse the result with your favorite XSLT or regex tool. Something along the lines :

use Win32::OLE;

sub wdFormatHTML {8}
sub wdFormatXML {11}

my $msword = Win32::OLE->new("Word.Application");
my $doc = $msword->Documents->Open($src_name);
$doc->SaveAs($target_name, wdFormatXML);

Re:Win32::OLE + XML/HTML

jdavidb on 2006-03-01T18:13:19

Thank you for the concrete example. That looks like it may work very well.