2007-01-13

PHP microformats parser

Microformats parser is a PHP package for extracting the microformats data embedded into HTML. The gathered data is stored as an xArray of objects - one for each microformat type container found.

Requirements

Microformats parser requires PHP 4.3, with DOM XML extension. Since DOM XML extenstion doesn't ship with PHP5 anymore, there was a problem that was solved thanks to Alexandre Alapetite and Ludwig. Now it is possible to make it work with PHP5 by following this article.

Microformats parser requires xArray package that's not included by default. So, in order to use this package, you need to download the xArray package (you can do it here) and extract the xArray.php file into the lib/ directory of your parser.

Supported microformats

The parser supports most of the hCard (missing SOUND), hCalendar, hReview (missing item info; spec really needs some clarification) and rel elements, according to their respective specification on microformats Wiki.

Usage

The simplest usage example:

$filename = "http://microformats.org/about/people/";
$html = file_get_contents($filename);
$mfParser = new MicroFormatParser();
$mf = $mfParser->parseSource($html);
if ($mf) $mf->each('
   echo "<h1>".get_class($value)."</h1>";
   var_export($value);
   echo "<hr />";
');

As you can see, the parser expects HTML string input. That is because there is a lot of different ways you can fetch a page, so you're free to use whichever one works for you. Another reason is that DOM XML expects valid XML - in our case, an XHTML document. Since many pages out there are near valid but not really, really valid, you can use PHP Tidy functions (if available on your machine) to prevent parser choking to death.

The parser returns false on failure, or an xArray object with all of the microformats it finds otherwise. Therefore, it is good practice to always check the result for false before anything else.

A note on xArray object

xArray is created after Prototype Enumerable object, in order to facilitate array manipulation. It takes some time getting used to, but allows quite clever stuff. I tried my best to keep the source clean and well-commented, and there are some pre-built docs for it in the docs/ folder. You can re-run phpDocumentor over the sources to generate the output that suits you best. However, if you don't like the way it works, you can always use its toArray() method to get the good ol' PHP array out of it. Here is an example of this:

$filename = "http://microformats.org/about/people/";
$html = file_get_contents($filename);
$mfParser = new MicroFormatParser();
$mf = $mfParser->parseSource($html);
if ($mf) var_export($mf->toArray());
A bit more advanced usage example

Before you call the parseSource() method, you can calibrate the parser to extract just the microformats you're after. You do that by passing a hash of options to the parserSetup method, like this:

$mfParser->parserSetup (array (
   'hcard' => true,
   'hreview' => true,
   'hcalendar' => true,
   'reltag' => true,
));

The parser will fetch all the microformats it finds by default, so the previous code just augments the default behavior. However, doing something like this will seriously limit your search (and memory usage ;)):

$mfParser->parserSetup (array (
   'hcard' => true,
   'hreview' => false,
   'hcalendar' => false,
   'reltag' => false,

));

Please note that you have to do this before you call parseSource() method. So the full example source would be:

$filename = "http://microformats.org/about/people/";
$html = file_get_contents($filename);
$mfParser = new MicroFormatParser();
$mfParser->parserSetup (array (
   'hcard' => true,
   'hreview' => false,
   'hcalendar' => false,
   'reltag' => false,

));
$mf = $mfParser->parseSource($html);
if ($mf) $mf->each('
   echo "<h1>".get_class($value)."</h1>";
   var_export($value);
   echo "<hr />";
');

4 comments:

  1. Hi,

    I think, your sentence has a mistake

    Since DOM XML extenstion doesn't ship with PHP5 anymore, there was a problem that was solved thanks to Alexandre Alapetite and Ludwig. Now it is possible to make it work with PHP5 by following this article.

    It will be "doesn't ship with PHP4 anymore" Am I wrong?

    Also, I can't run scripts with MicroFormatParser. MicroformatParser's 'parseSource' object method doesn't return any xArray object.I took this error when i run the script.

    Fatal error: Call to a member function length() on a non-object in microformats.php on line 30

    Is there a solution for this?

    ReplyDelete
  2. No, it's like in the article - DOM XML doesn't ship with PHP5. To quote php.net:

    Note: This extension has been moved to the PECL repository and is no longer bundled with PHP as of PHP 5.0.0.

    Note: This extension is no longer marked experimental. It will, however, never be released with PHP 5, and will only be distributed with PHP 4. If you need DOM XML support with PHP 5 you can use the DOM extension. This domxml extension is not compatible with the DOM extension.

    http://www.php.net/manual/en/ref.domxml.php

    As for the problem/error you're geting, let me quote myself:

    The parser returns false on failure, or an xArray object with all of the microformats it finds otherwise. Therefore, it is good practice to always check the result for false before anything else.

    I probably should have emphasized that a bit more in the article. So, always check the result for false before you do anything with it.

    By the way, you're most probably getting false because the page you're parsing isn't well formed and so DOM XML parser keeps croaking. You can try to compensate for that with Tidy before the page reaches the parser, by following steps from this article:
    http://malatestapunk-stuff.blogspot.com/2007/02/use-web-use-tidy.html

    ReplyDelete
  3. Thanks for your help.

    Tidy option solves the problem.

    But I have another problem. When MicroFormatParser parses the file,
    hCard and hCalendar's elements value always false. I'm sure that parsed file has hCard and hCalendar.

    Is there a solution for that?

    ps: Can I write this solution to my blog in Turkish?

    ReplyDelete
  4. Hmm... that sounds like a serious problem. Would you please email me the code and the page you're parsing, so I can have a better look?

    And yes, of course you can re- publish this on your blog. Thanks for spreading the word :)

    ReplyDelete