Using wget and the WDG Offline Validator to link check and validate your whole web site
September 18th, 2007
If you have a large web site or web application and care about dead links and valid HTML, it is a real pain to check for this. While I did run into this issue I collected some tools which I will present you now.
While the fastest way for manual link checking is the LinkChecker Firefox Plugin, it is not so easy to check our whole site for 404’s. Same thing with validation. For fast manual checking of a single page I would recommend to install the Web Developer Toolbar in Firefox and simply press Shift + Ctrl + H or Shift + Ctrl + A. But how to validate all pages or our site? And how doing it offline for performance reasons?
My approach is to mirror the whole web site with wget, look at the wget log for dead links and use the WDG Offline Validator to validate the mirrored HTML pages.
Mirror with wget
I assume, you use Linux or a Unix like system. So wget wouldn’t be new for you or you will be able to get it for your system. It’s a pretty basic but powerful tool.
First you should create a new directory where wget could download all your pages. On your console you can execute this command line to get your whole site:
wget --mirror --keep-session-cookies -o wget.log
I use the --mirror switch to simply fetch all. The --keep-session-cookies switch is useful is your site is dynamically created as this blog for example. -o wget.log says, it should put the output into this file. Be sure your server would hold against the stress!
Once wget finishes, you could use less or your favorite editor to search inside the wget.log for 404’s and the string error. This is all what you will need for link checking.
Validation with the WDG Offline Validator
I searched a while for a usable offline validator. The W3C one is a CGI script which needs a running Apache and you have to do a HTTP post in order to check your local HTML file. It is basically the same thing as the public W3C Validator. The next one I did not choose is a Windows application called A Real Validator. The disadvantages are that it costs money and has a little bit dated GUI which does not allow to filter for only invalid pages. So you have to scroll though hundreds of valid pages to find your invalide ones.
So at the end I use the WDG Offline Validator. You can get it from this site but the best thing is that Debian and Ubuntu have it available in there package repositories. So you can just type:
sudo apt-get install wdg-html-validator
(Be sure you have the universe repository in our list.)
To validate all the HTML pages wget downloaded, just type:
find . -name "*.html" -exec validate -w {} \; > validation.log
This command finds all yout HTML files in your current directory, executes the validate command on everyone and outputs the results in the validation.log file. While this runs, you can look at the validation.log with tail or you can view it afterwards in whatever editor you like best.
So that is basically all what you need to check your whole site for dead links and valid HTML.
Nice site, thanks for information!
Not bad… Not bad.