❯ Guillaume Laforge

Converting a Word document to HTML

If you’re under Windows, and that you need to do some shell scripting with ActiveX/COM components, Scriptom will certainly help you. Today, my boss just asked me if we could use Scriptom to convert a Word document into an HTML equivalent. And I decided to see if that was possible. To my delight, my little Scriptom module, backed by Jacob, helped me solve this integration problem with only 6 lines of Groovy code!

import org.codehaus.groovy.scriptom.ActiveXProxy
import java.io.File

word = new ActiveXProxy("Word.Application")
word.Documents.Open(new File(args[0]).canonicalPath)
word.ActiveDocument.SaveAs(new File(args[1]).canonicalPath, 8)

Now, I just need to launch:

groovy word2html.groovy specification.doc specification.html

And I’ve got a nice Word to HTML converter! Well… I know, not that nice. First of all, it’s a Windows-only solution, but that fits my requirements regarding the platform I’m running on, but the other negative aspect is that the generated HTML is really, really ugly. I really wonder why Microsoft can’t do a cleaner output. For the moment, I’m happy with that solution.

You probably noticed the magic number 8. It’s the HTML format option. The available formats are:

  • 0: wdFormatDocument (no conversion)
  • 1: wdFormatTemplate
  • 2: wdFormatText
  • 3: wdFormatTextLineBreaks
  • 4: wdFormatDOSText
  • 5: wdFormatDOSTextLineBreaks
  • 6: wdFormatRTF
  • 7: wdFormatUnicodeText
  • 8: wdFormatHTML

I haven’t yet figured out how to be able to use constants directly in Groovy. I’ll have to make Scriptom grok M$’s constants.

The example I’ve talked about has been tested with groovy-beta-9, Word 2000 and my additional Scriptom module for Groovy (don’t forget to install it if you want to try that sample).