2007-01-13

PHP microformats parser

Microformats parser is a PHP package for extracting the microformats data embedded into HTML. The gathered data is stored as an xArray of objects - one for each microformat type container found.

Requirements

Microformats parser requires PHP 4.3, with DOM XML extension. Since DOM XML extenstion doesn't ship with PHP5 anymore, there was a problem that was solved thanks to Alexandre Alapetite and Ludwig. Now it is possible to make it work with PHP5 by following this article.

Microformats parser requires xArray package that's not included by default. So, in order to use this package, you need to download the xArray package (you can do it here) and extract the xArray.php file into the lib/ directory of your parser.

Supported microformats

The parser supports most of the hCard (missing SOUND), hCalendar, hReview (missing item info; spec really needs some clarification) and rel elements, according to their respective specification on microformats Wiki.

Usage

The simplest usage example:

$filename = "http://microformats.org/about/people/";
$html = file_get_contents($filename);
$mfParser = new MicroFormatParser();
$mf = $mfParser->parseSource($html);
if ($mf) $mf->each('
   echo "<h1>".get_class($value)."</h1>";
   var_export($value);
   echo "<hr />";
');

As you can see, the parser expects HTML string input. That is because there is a lot of different ways you can fetch a page, so you're free to use whichever one works for you. Another reason is that DOM XML expects valid XML - in our case, an XHTML document. Since many pages out there are near valid but not really, really valid, you can use PHP Tidy functions (if available on your machine) to prevent parser choking to death.

The parser returns false on failure, or an xArray object with all of the microformats it finds otherwise. Therefore, it is good practice to always check the result for false before anything else.

A note on xArray object

xArray is created after Prototype Enumerable object, in order to facilitate array manipulation. It takes some time getting used to, but allows quite clever stuff. I tried my best to keep the source clean and well-commented, and there are some pre-built docs for it in the docs/ folder. You can re-run phpDocumentor over the sources to generate the output that suits you best. However, if you don't like the way it works, you can always use its toArray() method to get the good ol' PHP array out of it. Here is an example of this:

$filename = "http://microformats.org/about/people/";
$html = file_get_contents($filename);
$mfParser = new MicroFormatParser();
$mf = $mfParser->parseSource($html);
if ($mf) var_export($mf->toArray());
A bit more advanced usage example

Before you call the parseSource() method, you can calibrate the parser to extract just the microformats you're after. You do that by passing a hash of options to the parserSetup method, like this:

$mfParser->parserSetup (array (
   'hcard' => true,
   'hreview' => true,
   'hcalendar' => true,
   'reltag' => true,
));

The parser will fetch all the microformats it finds by default, so the previous code just augments the default behavior. However, doing something like this will seriously limit your search (and memory usage ;)):

$mfParser->parserSetup (array (
   'hcard' => true,
   'hreview' => false,
   'hcalendar' => false,
   'reltag' => false,

));

Please note that you have to do this before you call parseSource() method. So the full example source would be:

$filename = "http://microformats.org/about/people/";
$html = file_get_contents($filename);
$mfParser = new MicroFormatParser();
$mfParser->parserSetup (array (
   'hcard' => true,
   'hreview' => false,
   'hcalendar' => false,
   'reltag' => false,

));
$mf = $mfParser->parseSource($html);
if ($mf) $mf->each('
   echo "<h1>".get_class($value)."</h1>";
   var_export($value);
   echo "<hr />";
');