Make Word-to-HTML less sucky?

Mercutio

Fatwah on Western Digital
Joined
Jan 17, 2002
Messages
22,232
Location
I am omnipresent
My life would be much easier if I could find a tool that de-suckifies Word-created HTML pages.

I know such tools exist, and I don't particularly care what OS those tools need, but my searches so far have just turned up tools that remove private information from .doc files.

Anyone know what I'm talking about?
 

Mercutio

Fatwah on Western Digital
Joined
Jan 17, 2002
Messages
22,232
Location
I am omnipresent
Oh my, how that helped. I made this in about five minutes, which is much better than the 40 minutes I probably would've spent screwing around with image placement.
 

timwhit

Hairy Aussie
Joined
Jan 23, 2002
Messages
5,278
Location
Chicago, IL
You might consider something like Dreamweaver if you need a simple WYSIWYG editor.

I assume your document starts as a word document and you want to convert it to an html doc or something similar.
 

Mercutio

Fatwah on Western Digital
Joined
Jan 17, 2002
Messages
22,232
Location
I am omnipresent
I make stuff for my classes in Word, yes (requirement of the job). Extracting all the text from word is easy enough, but Word's save-as-HTML does awful things with images (and we all know about the stupidity of Microsoft's generated HTML), and re-creating screenshots and the like is annoying to me, so I'm doing a "save as HTML" to get a nice directory full of images from those documents regardless.

I'm weighing the choice between doing a "save as HTML" and stripping out all the stupid stuff from Word's HTML, or copy/pasting things out of the document and spending more time formatting images and layout than I want to spend. I do actual HTML editing with Nvu.

A lot of the things I do make use of text frames in Word, which are obnoxious to recreate with, oh, anything. The fastest conversion I've found for those types of things is Word to PDF (which "fixes" the textboxes either as part of the image they're anchored to or as seperate images) to HTML.
 

sechs

Storage? I am Storage!
Joined
Feb 1, 2003
Messages
4,709
Location
Left Coast
Dreamweaver used to have a filter that cleaned out a lot of the stupid tagging and CSS crud that Word would create. I'm not sure if it's still there.
 

Sol

Storage is cool
Joined
Feb 10, 2002
Messages
960
Location
Cardiff (Wales)
You might find that making your screenshots .gif files will actually give you better results for the same size. If you have a copy of photoshop the save for web is a nice feature that lets you figure out what format and settings will work best for a particular image.

I guess that's a lot more work than just letting word do it, but since you have to do the screen capture in the first place if you save for web prior to inserting the image word might keep the format when exporting.
 

Mercutio

Fatwah on Western Digital
Joined
Jan 17, 2002
Messages
22,232
Location
I am omnipresent
The larger issue, Sol, is that I don't want to muck around with Images in the first place. They're useful and probably necessary, but seeing as I find making HTML documents slightly less exciting than a second viewing of the pre-recorded version of "Watching Paint Dry", I'd rather not deal with it.

HTML has comfortably been part of the landscape of documents for a dozen years now, and I'm not terribly keen on futzing about with laying out web pages when it's not even tertiary to my job or hobbies.
 

Handruin

Administrator
Joined
Jan 13, 2002
Messages
13,916
Location
USA
I know this tool fit the bill this time around, but have you considered setting up a wiki on your site (assuming you plan on adding more goodness like this)? You could lock it down so only you can modify it. A wiki would make for an seemingly easier way to publish documents online.
 

Mercutio

Fatwah on Western Digital
Joined
Jan 17, 2002
Messages
22,232
Location
I am omnipresent
I've thought about it, and I think wiki is even one of those things I can one-click install out of Dreamhost's control panel - but in theory, the design I have should be just about perfect for my needs. I'm really not trying to build a community or a dynamic document with any of these things, just put up some information that I've already compiled.

Things I would like to do, that I haven't done, are to figure out a better way to fully translate the word documents I've made - I have a library of a couple dozen things I've made for classes, from Home Networking to backing up DVDs to assembling a computer, and another set of notes for certification classes (224 pages on Microsoft's 70-270 exam, for instance)... anyway, I have all this stuff in Word, and almost all of it has at least some images, and there's really no good way actually preserve all the formatting and images, and to post them intact. Beyond that, writing some perl or something to tack on the tiny extra bit of php and HTML that makes up my "template" web page would be fairly straightforward.

That's my frustration. I have a lot of decent, premade content, and I can't stand to just recreate it for the web, and I don't really see a good way to automate fixing it.
 

Sol

Storage is cool
Joined
Feb 10, 2002
Messages
960
Location
Cardiff (Wales)
A brief look at using openoffice suggests that it does a substantially less horrific job of converting word documents to html than does word. It still doesn't seem quite as readable as hand written HTML but it would be worth a try.

I've always found when putting pictures into a word document that it's been worth the time processing and saving the image seperately and then importing it into word afterwards. Otherwise Word seems to just make everything a slightly customised bitmap or something and the result is a small document with a couple of pictures that takes up a couple of MB.

Although that has probably been improved more recently, it's been a while since I've tried to put images into word files at all...
 

Will Rickards

Storage Is My Life
Joined
Jan 23, 2002
Messages
2,012
Location
Here
Website
willrickards.net
Why not just upload the actual word documents?
Or even just pdf versions?
Not everything has to be html.
Especially a spyware removal guide that you probably want to either print or say download and open while going through it.
 

Tannin

Storage? I am Storage!
Joined
Jan 15, 2002
Messages
4,448
Location
Huon Valley, Tasmania
Website
www.redhill.net.au
Really, really really bad idea, Will. We all know the bad things about PDF files - "PDF" really ought to stand for Printable Document Format, not Portable Document Format, as it is really only any good for documents destined for the printer, not the screen. There are a million studies out there to demonstrate that users hate PDF and actively try to avoid PDF documents where they can. Huge file sizes and slow downloads, horizontal scrolling is almost inevitable if you set the print size to something you can read without glases, loss of the back and forward navigation buttons, and other stiff I'm too lazy to enumerate just now but can be foujnd in any of the usability docs you'll find easily enough if you go looking. PDF is absolute last resort stuff. (Unless you are operating in the print woirld, wher it works very well.)

As for Word format, you are joking, right?
 

Tannin

Storage? I am Storage!
Joined
Jan 15, 2002
Messages
4,448
Location
Huon Valley, Tasmania
Website
www.redhill.net.au
Merc, after playing with that example you posted in the other thread, it seems to me that if your other documents are broadly similar then the task isn't too difficult. All you need is to have something (I dunno what - possibly just a save-as-plain-text and a test-to-HTML converter followed by a hald-dozen regular expressions in your favourite text editor) put bare-bones HTML around the paragraphs and pictures and so on and then tie it into a suitable style sheet. You can pobably figure out how to do this once and then rinse, lather and repeat.
 

sechs

Storage? I am Storage!
Joined
Feb 1, 2003
Messages
4,709
Location
Left Coast
Tannin said:
There are a million studies out there to demonstrate that users hate PDF and actively try to avoid PDF documents where they can.

Can you cite a couple hundred of these? I have someone who might be interested.
 

Will Rickards

Storage Is My Life
Joined
Jan 23, 2002
Messages
2,012
Location
Here
Website
willrickards.net
Tannin said:
As for Word format, you are joking, right?

No, not at all. And for this purpose, a reference document, I don't think it is a bad idea. As I see it here is the situation. Merc has some pretty good documents that he'd like to share. Turning them into semi-decent html is turning out to be difficult and to time consuming for him to even post the content. Honestly I'd rather see the content first. Maybe some of us web developers around here could work on formatting it and coming up with a stylesheet.
 

Tannin

Storage? I am Storage!
Joined
Jan 15, 2002
Messages
4,448
Location
Huon Valley, Tasmania
Website
www.redhill.net.au
What's the very first thing you teach people about web security? Don't download stuff that could contain a virus! Putting a how-to-be-more-secure document in a potentially virus-infected form (such as MS Word) is directly against everything you are trying to teach people.

In any case, Word documents do not display as well on the screen as a made-for-the-purpose format (such as HTML), and you get all sorts of weird formatting errors. (OK, you won't, because you know Word the way Tea knows bananas, but any normal non-expert human doesn't have your skill.) For example, some friends sent me some Word-format stuff the other day for me to turn into a web page for them. I couldn't make head or tail out of it and had to ring them up to figure out what sort of general look and feel they had in mind. On my screen, it was a total mish-mash. Why? Turns out that they had several fonts installed that I didn't have.

But wait, there is more. No, not steak knives, download size. Word documents are vastly bigger than equivalent HTML documents. Word documents, if you are not careful, expose you to the risk of disclosing information you didn't want to make public. And finally, only people running the appropriate Microsoft software can read them. There is a host of reasons not to use Word, and not one single valid reason to use Word - discounting that the original happens to be in Word format right now. If the original was in Swahili instead of English, would we publish it in Swahili? Hell no. And bear it in mind that for many users, a Word document might as well be in Swahili, 'cause they can't read it either way.

Turn it into HTML? For sure. As you will know from the other thread (something random in the Pub and Brewary) I've already done the hard part, Will, stripping out the Word-generated junk, leaving just the stuff that's useful. Maybe tonight, if I get time, I'll do a bit more, play with the CSS a bit, lay the pictures out, stuff like that. Or perhaps you or Buck or Handy or Tim might take a hand. I'm sure that any of us could do it. (But it would be a good idea, if you or anyone else starts on that, to sing out so that we don't get two people spending the same hour doing the same thing twice.)
 

Handruin

Administrator
Joined
Jan 13, 2002
Messages
13,916
Location
USA
I'd offer to help convert a couple if you want.

My point about the wiki wasn't for you to build a community or to have large amounts of interlinking dynamic content. I meant it more for your own needs, or maybe for a second or third author down the road. This will allow it to grow at your own pace, and allow for it to be web-driven. I find it easier to use a tool that offers a restricted set of formatting features so that you can keep everyone on the same page. A couple dozen documents seems like a decent case for a web driven tool like a wiki. Especially one that you might refere links to...just my $.02.

If you want, I'd offer to host the wiki if you want. I can try to incorporate it into this site so that we have a means to create documents like yours and other things for our common discussions. I'd even offer to help convert/proof some of the into it. Anyway, the offer is there. If you don't want to I'll stop pushing. :D
 

Mercutio

Fatwah on Western Digital
Joined
Jan 17, 2002
Messages
22,232
Location
I am omnipresent
Wiki is something to think about, particularly for documents beyond a certain length.

PDF worked pretty well too. The PDF version of the Word source file is about 60% smaller than .DOC, and layout obviously isn't an issue. PDFs can be indexed and searched by search engines - probably a good thing.
On the other hand, I'd call it a nonstandard format since it doesn't display natively in a browser. Not everyone has Acrobat installed. It might not be as big a crime as Flash, but I've known people who wouldn't install it regardless.
 
Top