HTML Tidy is a great utility for cleaning up your web pages. You can use a DLL
which does the same thing from your server to clean up HTML using a script.
This article explains how to get started with HTML Tidy using ASP on PWS/Windows 98.
- Download TidyCOM.zip from perso.wanadoo.fr/ablavier/TidyCOM/
- Read Tidy options documentation from www.w3.org/People/Raggett/tidy/
- Extract both regsvr32.exe and TidyCOM.dll (I extracted these to C:\Windows)
- Go to a DOS command window
- Register the DLL by typing
You MUST give the full path name for the DLL. To unregister, do
regsvr32.exe /u c:\windows\TidyCOM.dll
- Restart PWS
- Extract the attached zip file to a folder under PWS. I put it under "tidy"
- See the file bad.html. It is not exactly bad, but a MS Word document saved as HTML from MS Word 97
- Fire up your browser, and point it to http://localhost/tidy/simple.asp
- If no error happened, open up bad.html and good.html. See how Tidy formatted it. It does not clean Word 2000 info.
- simple.asp sets the option within the code. This can be a pain. Next example uses a configuration file.
- Point your browser to http://localhost/tidy/useconf.asp. This uses the configuration file tidyconf.txt
- Now compare bad.html and good_2.html. good2.html is free of Word-2000 legacy.
Sample ASP Code
<%@ Language=VBScript %>
<% Option Explicit %>
' Creating Tidy Object
Set oTidy = CreateObject("TidyCOM.TidyObject")
' Setting Tidy Options
' Cleaning up file bad.html to good.html
' Cleaned up. See good.html
Set oTidy = Nothing
Note: Methods are also
available for cleaning up HTML
in a string.
Sample Python Code
objTidy = win32com.client.Dispatch("TidyCOM.TidyObject")
objTidy = NULL
- TidyCOM.dll is not re-entrant. So, only one instance of it should be run at any one time. Since this is not necessarily the case with web services, you need to write your own queueing system.
- See HTML Tidy home page for full option list.
- Making a configuration file is a very important task. So pay attention to it. If not done properly, the output documents can be made too much compliant to the standards and 4.0 browsers will have a problem.
Since you are seeing this, it means that your browser does not
support cascading style sheets. Please download and use one
of the many browsers that support web standards.