Document Conversion

davorg on 2004-08-18T08:19:22

This is just thinking aloud about a discussion that came up in conversation in the office yesterday. I doubt I'll have any time to do anything about it at work so I might play with a few ideas in my spare (?) time.

We're geeks. Specifically most of us are Unix geeks (even the iGeeks are Unix geeks now). And Unix geeks like text files. We don't use a word processor unless someone is holding a gun to our head. We're far happier using our favourite text editor to create POD or DocBook or something like that. I'm sure I don't need to explain the advantages of text files over proprietary binary formats.

But we need to interact with the rest of the company. And the rest of the company like Word documents. The very idea of reading a plain text document fills them with the deepest dread.

That's not a problem. We can create a document in POD, use Pod::DocBook to convert it to DocBook and then use one of the db2foo tools to convert it into something they can read in Word (probably RTF I guess).

But it's not quite that simple. The non-techs like their Word docs to have a certain look. They create templates for different types of document that define fonts, header styles, required sections, watermarked logos and things like that. And my auto-generated RTF file won't have all of that.

Until now I've got round that by creating the RTF file, opening it in OpenOffice and applying the formatting from another document that was created using the template. But this is a soul-destroying (not to mention error-prone) activity.

So what I'm thinking that we need (well, maybe "need" is overstating the case a bit) an application that somehow parses a Word template file, extracts all of the formatting information and builds a file (it's probably DSSSL) that db2rtf can use to create an RTF file containing all the correct formats. Or maybe it just creates an XSLT file that transforms DocBook to the MS Word XML format. Or something like that.

But like I said, I'm just thinking aloud here. Does anyone know if anything like this already exists? Or have any idea on where I might start? Does MS publish the format of its .dot files anywhere?

Or would I be completely wasting my time?


Go via OpenOffice

Matts on 2004-08-18T08:42:28

If you can convert the template to an OO.o template then it's just an XML file (and a fairly well laid out one too - I wrote an article on XML.com about it ages ago). From there it's cake.

Ignore DSSSL and RTF

ziggy on 2004-08-18T14:37:19

I've done a lot of work generating RTF with DSSSL. This is not a path you want to follow. The output is generally good enough for many uses, but does not reach into the more complex formatting you are looking to achieve.

First of all, if you're talking DSSSL, you're talking OpenJade, which is a large package, somewhat difficult to build, and very difficult to troubleshoot unless you have some background in SGML arcana, or wish to acquire it. It's like expending the effort to master Victorian era metallurgy -- wonderful if you want to maintain an old engine, but not the best path to pursue if your goal is to ship cargo today.

Second, the RTF "specification" gets updated about once per release of MSWord. We're up to 1.7 right now. If I had to guess, the RTF backend for OpenJade was done in the era of RTF 1.4 or earlier. So whatever new features got added into RTF since the backend was updated simply cannot be generated from a DSSSL toolchain. I don't know what these features would be, so this is possibly just FUD.

Third, the RTF output that OpenJade generates is a small (but useful) subset of whatever level of RTF it implemented. Plus, there are complications between different interpretations of what a page model should be. For example, I think that DSSSL can handle nested tables, but RTF (or the RTF backend) cannot generate them properly.

And there are many RTF features that are not accessible from the DSSSL side of the fence -- things like annotating a block of text as an RTF style. In the DSSSL world, styles are resolved as the page goes into the back end. So if you have a style called "Heading 1" that renders as 36pt Helvetica, the output is specified as 36pt Helvetica, not symbolically as "the style named 'Heading 1'".

 

So while an RTF generator may be useful, DSSSL isn't the best tool for the job. I'd look at RTF::Writer , which can touch the entire (modern) RTF spec, or be made to do so much more easily. And you don't need to learn/use Scheme to get there. ;-)

simpler format ?

tinman on 2004-08-18T18:12:59

Word (recent incarnations of it, at any rate) do HTML fairly well. They certainly have no problem opening HTML documents. Would that fit ? You could do headers, footers, watermarks etc etc with HTML. Heck, just create a blank document, export to HTML, run HTMLTidy on it and you have your own HTML template to fill in with text.

I was also going to suggest PDF, because a few decent toolchains exist for generating fancy PDF; but editing it would pose a few problems (unless everyone has Acrobat installed or something)

I've used RTF::Writer a bit. It's a very nice module for what it's supposed to do, but if your non-techie types are like most I've encountered, some of their formatting tricks are going to be really hard to reproduce.