License to Process Words

grantm on 2007-09-25T09:06:49

I've just finished converting one MS Word document of about 150 pages to a group of about a dozen HTML pages. What a nightmare!

Last time I had to do this, the output was closer to 100 HTML files with very little formatting so I ended up scripting much of it. I loaded the document into Open Office and saved as ODT. Then I used XPathScript to spit out a series of very plain HTML files. With the site stylesheet applied they looked very smart.

This time, the document structure didn't really lend itself to scripting and there was more formatting that I wanted to preserve (eg: headings, bullet lists, simple tables). So I used 'Save as Web Page' from Word and then did most of it manually with Vim.

The HTML that Word produced was unspeakably vile. All sorts of illegal constructs (e.g.: a <p> inside a <span> inside another <p>!); enormous sections of proprietary markup inside comment markers; kilobytes of unnecessary attributes (align="left" on every <p>); and invented markup tags (eg: <place> and <placetype>).

I was able to strip out much of the cruft with LibXML/XPath/DOM manipulations and some search and replace regexes in Vim. But the result was still gruesomely awful. Much of it came down to operator error on the part of whoever typed the document:

  • every alternate paragraph empty &emdash; to provide vertical whitespace
  • strings of empty paragraphs to push content onto a new page
  • 'tables' of data with columns aligned using spaces
  • 'bulleted lists' created by inserting a bullet character at the start of each line
  • vast sections of body text in 'Heading 2' style with the font overridden to give normal looking text
  • most actual headings rendered using bold and/or font changes

As one of my colleagues commented, people should not be allowed to use a Word Processor without a license.

It can't all be blamed on users though. The user interfaces of Word and Open Office are absolutely awful. They make it far too easy to do the wrong thing, by jamming the screen full of toolbar buttons and menus. Conversely they make it hard to do the right thing by hiding the style selection tool in amongst all that visual clutter. Each new release over the years seems to have made the problem worse. What these interfaces need is less, not more.


Dreamweaver++

Alias on 2007-09-26T01:21:36

Dreamweaver has this great little menu entry called "Clean up Word HTML".

It's not perfect, but generally when I need to do Word cleanup, I run it through that filter first, then do the rest by hand from there.

HTML Tidy++

ajt on 2007-09-26T11:42:38

There is also HTML-Tidy, which is available as a plug-in to many GUI HTML editors, e.g. HTML-Kit or Quanta+. It's a small c library and there is a small executable available for command line use and there is of course a Perl module based on it, HTML::Tidy.

As much as I loath and detest Microsoft products, what scares me even more is that people who love Microsoft have no idea how to use their products at all. It's hardly surprising that most Windows machines are infected with something or that when I receive an MS Office file from someone it's almost quicker to retype it from scratch than it is to use it as is...