Code Notebook: Fetching and Parsing an Atom Feed in PHP

Update: I switched from directly using cURL in the script to running a fetcher cronjog every 15 minutes. I imagine that this would speed up average page loading time as the feed(s) aren't loaded at every front page request but instead are read from a local disk file.

The Python script I use to do this (works, may not be foolproof):

#!/usr/bin/python

import os, httplib

def grabfeed(host, path, output):
    conn = httplib.HTTPConnection(host)
    conn.request('GET', path)
    response = conn.getresponse()
        if response.status == 200:
            f = open(output, 'w')
            f.write(response.read())
            f.close()

See now? In the old days I would've done this with a shellscript and wget(1) (or curl(1)), but now that I'm such a good fanboy, none of that is necessary!

Original post continues below…

Like someone somewhere said, PHP is the new Visual Basic (oddly, that sentence suddenly popped into my head today from I don't know where, and a search returned two hits for the exact query). For the fact is that with PHP, you can hack stuff together rather quickly, but it may not be the best suited solution for something more ambitious. Just saying. YMMV. And I'm using it now for those basic tasks, like I used VB years ago on the desktop, so…

So, in a web page I needed to fetch my latest Blogspot posts, parse the Atom XML file and spit out the blog entries to go with the XHTML 1.1 page. I have done this simple thing before in Google App Engine, outputting HTML 4.01 Transitional, but with that non-XML based standard you don't really have to pay that much attention to wellformedness and stuff like that. The browser will happily give it's best effort rendering of the document it's served. But if you serve a non-wellformed (or non-valid) document with a strict doctype, the user agent pukes and refuses to render the page, instead outputting an XML parse error message.

There are quite a few things that can go wrong here (and I'm sure I haven't encountered half of them). For example if your blog post's title, or content, contains a plain & instead of &, then parse will fail (as in a HTTP query string key=value pair separator, for example). Or if you have an unclosed tag, or a tag inside a tag where it is not allowed (a paragraph inside a paragraph, for example), then that will fail too. And so on.

The solution below is obviously not the most elegant, but it works ™ for me, provided that the content element contains a valid HTML fragment. I didn't want to spend too many hours with this, as I'm going to port to Python in the near term anyway. For the code to work, you need to have the cURL and DOM extensions installed (note: I don't even know if the word extension is correct here, as I'm unsure if they are external modules loaded at run time or compiled into the PHP binary or what, but never mind). First, the feed is fetched with cURL, failing with an error message if no feed was loaded. After that the feed is parsed, and if all is good, HTML is formed from the updated, title, content and link elements. Notice the particularly un-elegant solution for finding the link elements with rel attribute values of alternate and related. The error handling mechanism would need some enhancements, too.

<?php
$feedsrc = 'http://sivuraide.blogspot.com/feeds/posts/default?max-results=20';

$curl_handle=curl_init();
curl_setopt($curl_handle, CURLOPT_URL, $feedsrc);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 7);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_HEADER, 0);
$data = curl_exec($curl_handle);
curl_close($curl_handle);

$errornotice = "<p>Could not load blog posts, yarr!</p>";
if(empty($data)) {
  echo $errornotice;
}
else {
  $doc = new DOMDocument();
  $parsestatus = $doc->loadXML($data, LIBXML_NOERROR | LIBXML_ERR_NONE);
  $entries = NULL;
  $entries = $doc->getElementsByTagName('entry');
  if(!$entries || $parsestatus == FALSE) {
      echo $errornotice;
  }
  else {
    foreach($entries as $entry) {
      $updated = $entry->getElementsByTagName('updated')->item(0)->nodeValue;
      $updated = substr($updated, 0, strpos($updated, 'T'));
      $title = htmlspecialchars($entry->getElementsByTagName('title')->item(0)->nodeValue);
      $content = $entry->getElementsByTagName('content')->item(0)->nodeValue;
      $related = NULL;
      $alternate = NULL;
      $links = $entry->getElementsByTagName('link');
      foreach($links as $link) {
        if(!strcmp($link->getAttribute('rel'), 'related')) {
          $related = htmlspecialchars($link->getAttribute('href'));
        }
        else if(!strcmp($link->getAttribute('rel'), 'alternate')) {
          $alternate = htmlspecialchars($link->getAttribute('href'));
        }
      }
      echo "\t<p class=\"date\">$updated</p>\n";
      if($alternate) {
        echo "\t<h3><a href=\"" . $alternate . "\">$title</a></h3>\n";
      }
      else {
        echo "\t<h3>$title</h3>\n";
      }
      echo "\t$content\n";
      if($related) {
        echo "\t<p><a href=\"" . $related . "\">$related </a></p>\n";
      }
      echo "<hr/>\n\n";
    }
  }
}
?>

2008-11-20

Fetching and Parsing an Atom Feed in PHP