It's probably well-known (well, one hopes) that outputting a document to HTML in Microsoft Word adds a ton of additional code to the document, for no sane reason. I've been having to work with HTML pages from a handful of people, recently, and they use Word to create them, so... is there any sort of program/converter that'll take a Word-created HTML file and convert it to a more sensible HTML file, without all of Word's extra formatting in the way?
I saw a program in passing, years ago, that claimed to do something like that - but I didn't have need for it, then, and don't know how well it worked... or certainly, have the bookmark now.
I'm just hopeful there's something out there that can I can just feed the documents to and have it convert, so I don't have to try to filter things out while I'm editing the files. (I can probably find some editor, somewhere, that'll reprocess it before it saves, but...)
		
			+ Reply to Thread
			
		
		
		
			
	
	
				Results 1 to 13 of 13
			
		- 
	If cameras add ten pounds, why would people want to eat them?
- 
	I think this is all about it... 
 
 http://www.codinghorror.com/blog/archives/000485.html
 
 Thread may have some links to something useful.
- 
	You're more or less screwed, from what I know. Manual clean-up. Want my help? Ask here! (not via PM!)
 FAQs: Best Blank Discs • Best TBCs • Best VCRs for capture • Restore VHS
- 
	I've had good results in the past using HTML Tidy. The "bare" and "clean" options are designed to: 
 
 http://tidy.sourceforge.net/docs/quickref.htmlThis option specifies if Tidy should strip out surplus presentational tags and attributes replacing them by style rules and structural markup as appropriate. It works well on the HTML saved by Microsoft Office products.
 
 The last time I had to do this was in the days of Office 2000, so your results may vary.
 
 -drjtechThey that give up essential liberty to obtain a little temporary safety deserve neither liberty or safety.
 --Benjamin Franklin
- 
	Probably Dreamweaver's "Fix Microsoft HTML".Originally Posted by Ai Haibara
 
 I just love the way MS HTML has pages and pages of style definitions, then when you get to the actual text, it ignores all those and just has lines of nested FONT codes and other braindead stuff.
- 
	I remember that it was some sort of standalone program. I just didn't have a need for it at the time, and so, I didn't bother trying it. I did save the site in my bookmarks, though... turned on an old system to dig through the bookmarks, and an older version of this may have been it: http://www.bersoft.com/bwhcu/Originally Posted by AlanHK
 Of course, the current version's page also mentions using Dreamweaver to clean up the HTML on the bottom. The reference seems a little dated, though. The reference seems a little dated, though.
 
 I wouldn't be surprised if Word probably even converts every single thing to HTML entities far more than any other program I've used. A single line of text can become a monster paragraph!
 
 Yeah, I think I've even seen some perl scripts that begin to approach it, too. But the work's actually not crucial, so I'm just seeing if there's some simple executable I can just throw a handful of the Word HTML files at, and see if it generates something that's less of a headache to edit. If it was crucial/important work, I'd most likely open each file in my editor and manually edit everything (as lordsmurf mentions), just to be sure. Either that, or tell everyone I won't accept any HTML output from Word (and wait for the pitchfork-and-torch-bearing mob to form outside my door).Originally Posted by thecoalman
 
 I think I'll try the above utility and some of the standalone options mentioned in the link Chris K posted, and see what they make of one of the files. drjtech - I'll try HTML Tidy as well, though that codinghorror blog entry doesn't seem to think it'll do the trick, as much.
 
 Thanks, everyone.If cameras add ten pounds, why would people want to eat them?
- 
	I tell people all the time that I won't accept HTML from Word or PageMaker. That's really just tough shit on them. Those are not web creation applications. Want my help? Ask here! (not via PM!)
 FAQs: Best Blank Discs • Best TBCs • Best VCRs for capture • Restore VHS
- 
	tell them to send you a text file next time  ..it will be easier than trying to wade through the mess! :P ..it will be easier than trying to wade through the mess! :P
- 
	Hmm... well, I'll think about it.Originally Posted by lordsmurf But with my luck, they'll just switch to OpenOffice's Writer... which probably does about the same thing just to maintain feature parity with Word. But with my luck, they'll just switch to OpenOffice's Writer... which probably does about the same thing just to maintain feature parity with Word.  
 
 Sure, make me reconstruct all the formatting.Originally Posted by greymalkin If cameras add ten pounds, why would people want to eat them? If cameras add ten pounds, why would people want to eat them?
- 
	Worth a try to open the MS-HTML file in a browser, or Word, and copy and paste it into a real HTML editor. That should produce better code, preserving formatting.Originally Posted by Ai Haibara
 (I think formatted text on the clipboard is basically RTF format.)
- 
	Hmm... that's a thought, too. I'll keep that in mind. If cameras add ten pounds, why would people want to eat them?
Similar Threads
- 
  Converting Divx-AVI to MP4 - Output size option doesn't workBy tomzero in forum ffmpegX general discussionReplies: 6Last Post: 20th Mar 2010, 08:30
- 
  DLC.htmlBy Hangrumps in forum Video ConversionReplies: 4Last Post: 15th Jul 2009, 16:00
- 
  html helpBy steve42069 in forum ComputerReplies: 4Last Post: 8th Aug 2008, 11:38
- 
  Converting DV to 3GP --> Output Looks Rather Blocky & "PixelateBy SnakeGirl in forum ffmpegX general discussionReplies: 8Last Post: 21st Jan 2008, 06:13
- 
  HTML HelpBy FEEL in forum ProgrammingReplies: 3Last Post: 20th Aug 2007, 09:58


 
		
		 View Profile
				View Profile
			 View Forum Posts
				View Forum Posts
			 Private Message
				Private Message
			 
 
			
			 
			

 Quote
 Quote 
			 Visit Homepage
				Visit Homepage
			 
			 
			 
			