2007-11-01

10 Dos and Don'ts When Using Microformats Parser

MicroformatParser has actually been used in real world (ie. out of my sandbox testing grounds) for some time now, and I've been getting valuable feedback from developers. During that time, some of the most common problems - and some of the best practices to circumvent them - have emerged, and I thought it would be nice to collect them all in one place to share with others.

Dos

Please, do:

... use Tidy

The web is filthy, and you do need something to keep you clean. You can't just assume that you're working with well-formed XML from an external source- 9 out of 10 times the XML parser will choke and your script will croak because of that assumption.

What you can do is try and decrappify the input using Tidy. For a kick- start on using Tidy with PHP, you may want to check out this post as well.

... check your PHP version

For PHP4, everything should just work right out of the box. However, for PHP5 you'll need this script, by Alexandre Alapetite. He's done a great job of wrapping DOM XML extension API, making it available to PHP5 users.

... check xArray documentation

It may be tempting to just call toArray() method on the result and work with a familiar datatype. However, xArray is specifically crafted to facilitate working with collections of objects, such as your parsing results. The documentation is included in the package, and you can re-run PhpDocumentor over the source file to get it in a format you prefer. For more info on xArray you can also check out the documentation wiki. It is a work in progress, but some valuable info is already there.

Also, there is a new xArray version on the way (v0.2), which will make handling complex trees of data even easier.

... check if (bool)FALSE is returned

On error, MicroformatParser returns (bool)FALSE instead of an xArray object. So make sure that everything went OK before you try to do anything further with the result:

if($microformatsResult) ...
... use caching

Actual fetching of the remote page will most likely be the slowest part of your script (if it's not, something is seriously wrong). So, to shorten the execution time, implement some sort of caching mechanism in order to keep remote page fetching to minimum.

... contact me

This isn't really a "best practice" thing, but I think it's still worth keeping in mind. If you find a bug or just keep hitting the wall, don't hesitate to contact me. I'll try to help as much as I can.

Don'ts

There aren't as many of those, but they're just as important. So, please don't:

... assume you're parsing well-formed XHTML

Because it's just not true, most of the time.

... use PHP5 DOM XML extension

As of PHP 5.0, the required DOM XML extension is not bundled with PHP anymore. There is one available from PECL, but you don't want to use that. Thanks to deneme's patience and valuable input we discovered that you can't really plug it in and expect everything to work. You should keep away from it and use Alexandre Alapetite's solution instead.

... use it for something malicious

I can't really tell you what to do with it, but please don't use it for something bad, like email scraping. Would you like your name and email listed in some new directory handed down to generations of spammers? No, I bet you wouldn't. So don't do it to others, either.

... output invalid XML (XHTML included)

This is not strictly related to MicroformatParser usage, but it's a good advice nevertheless. Please, don't do that. The rest of the web will thank you for your effort.