2007-02-17

Use the Web: use Tidy

Once you start web programming, it's only a matter of time before you face the task of fetching and parsing the existing contents - be it a web page, a feed, or whatever. Since that content is most likely to be a markup language of some sort, it would be great if you could use a generic parser to weed through it. In fact, since you know your input is some XML dialect (say, XHTML, RSS or Atom), it would be great if you could just use some ready-made XML parser to reach the portions you need through XPath expressions or DOM functions. But that is where the grief begins.

Guess what? It's not gonna work.

So the page you're fetching is boasting to be XHTML strict, but your parser keeps croaking on you. Why is that happening? Have you done something in your previous life to annoy a deity of some sort? Well actually, most existing XML parsers are quite picky - and they should be, since there is only handful of rules they expect to be fulfilled. However, for one reason or another - and this especially goes for XHTML, since people tend to take feed validity more seriously - they seldom are.

So, what do you do?

Meet you new best friend: HTML Tidy. It will do all the nasty cleanup and repair stuff for you, and leave you with an usable document. Originally a Dave Ragget utility program, it is now maintained by a group of dedicated volunteers on SourceForge. One of their goals was to make a a library form of Tidy, to make it easier to incorporate Tidy into other software.

And that they did - for an example, Tidy is an integral part of many current applications, it is available as a PECL extension for PHP 4.3.x and PHP 5 from http://pecl.php.net/package/tidy, and there are bindings for many other languages as well.

Using Tidy PHP extension

There are two flavors of Tidy for PHP: Tidy 1.0 is just for PHP 4.3.x, while Tidy 2.0 is just for PHP 5. This is how you'd use Tidy 1.0 with 4.3:

// ...
// Let's assume you already obtained the page you want to clean up in string $html
// ...
$config = array (
    'ncr' => true,                // allow numeric entities
    'numeric-entities' => true,   // output numeric instead of named entities
    'quote-nbsp' => true,         // quote non-breaking space character
    'fix-uri' => true,            // fix ampersands and such in URIs
    'output-xml' => true,         // output XML; could be XHTML as well, I think
    'char-encoding' => 'utf8'     // use UTF-8 encoding 
);
tidy_parse_string($html);
foreach ($config as $key=>$value) {
    tidy_setopt($key, $value);
}
tidy_clean_repair();
$html = tidy_get_output();
// ...
// Now $html contains cleaned up original page, ready for XML parser
// ...

the most important part being the $config array. This is where we set up Tidy to make the input string $html parser-friendly. There are a lot of other parameters for Tidy, but these are the basic ones that should correct almost any page. For a full reference on Tidy parameters, check out http://tidy.sourceforge.net/docs/quickref.html.

Using Tidy executable

On the down side, it's possible you don't have Tidy extension around in your environment. If that's the case, you might be able to use the Tidy standalone. To do that, you first need to make a Tidy configuration file. This is an example file, with the same options as in the above example:

ncr: 1                # allow numeric entities
numeric-entities: 1   # output numeric instead of named entities
quote-nbsp: 1         # quote non-breaking space character
fix-uri: 1            # fix amperstands and such in URIs
output-xml: 1         # output XML; could be XHTML as well, I think
char-encoding: utf8   # use UTF-8 encoding

Save that file as tidy.conf in the same directory where your script is. Next, in your script, do something along these lines:

// ...
// Let's assume you already obtained the page you want to clean up in string $html
// ...
define ('PATH_TO_YOUR_CONFIG_FILE', 'tidy.conf', true);
$filename =  tempnam("", "OUT");

$fp = fopen($filename, 'w');
fwrite($fp, $html);
fclose($fp);

$cmd = 'tidy -q -config "' . realpath(PATH_TO_YOUR_CONFIG_FILE) . '" '.$filename;
$html = shell_exec (escapeshellcmd ($cmd));
unlink ($filename);
// ...
// Now $html contains cleaned up original page, ready for XML parser
// ...

And that should be it. Once again, note that Tidy accepts a lot of (well documented) configuration parameters. For a full list, check out http://tidy.sourceforge.net/docs/quickref.html.