Third Person

Nick Dunn is a lead front-end developer at Airlock, a digital creative agency in Shoreditch, London. He is passionate about accessibility, user experience and code-indenting. He recently played ElbowForget Myself.

First Person

I like to bookmark things that interest me. I also like to plan things, take photos of things I see, and tap my feet to a groove. I sometimes write tiny messages, and longer ones too. I have an online CV and a collection of links that are and aren't about me.

Archiving XML with Symphony

27th November 2008, 24 comments , tagged with , Symphony is an incredibly powerful CMS, but it’s ability to cache dynamic XML feeds is lacking. In this article I explain a simple method of parsing a third-party XML feed and archiving its contents in native Symphony sections.

Over the last six months or so, Symphony content management system has become my framework of choice. The reasons are numerous and deserve their own blog post entirely; but the foremost reason is that Symphony allows a developer to define their own data structures, query it, and have the output served as well structured XML to be formatted into XHTML using XSLT views.

With such a focus on XML, it made sense to provide a “Dynamic XML” data source type that pulls an XML feed directly into the CMS. While this native functionality is useful for grabbing a feed of most recent Flickr photos or Twitter posts, I have found it insufficient in several respects:

  • if the XML feed is unavailable or invalid at the time of caching, the data source will fail
  • content in the feed is not archived — when content is removed from the remote feed, we can no longer access it

In order to solve these problems (as alluded to on the Overture forums) I have used a method of saving items from the remote XML feeds into native Symphony Sections, so that they are archived and can be queried as native entries.

In the following example I’ll explain how to archive your Twitter timeline into Symphony.

Section

The data is to be stored in a Section, so begin by creating a Section with the same data structure as the feed. In my example I have created a “Tweets” section with the following fields:

  • ID (Textfield)
  • Message (Textarea)
  • Date (Date)

Figure 1. Tweets section Figure 1. Tweets section

Event

Create the event “Save Tweets” from the Tweets Section just created. Make sure the “Allow Multiple” option is selected, so we can save more than one tweet with a single request.

Figure 2. Save Tweets events Figure 2. Save Tweets events

Data Source

When saving tweets to the Tweets section, we need to be able to check if the tweet already exists in our cache. Thankfully the Twitter API assigns a unique ID to each tweet that we can use. If an RSS feed is being used the permalink or guid elements are suitable alternatives.

Once the Twitter feed has been loaded, we need to compare the XML against existing entries in Symphony. Therefore create the data source “Cached Tweets” to select the most recent tweets. We don’t need the full entry, just the ID element.

Figure 3. Cached Tweets data source Figure 3. Cached Tweets data source

Page

Create the page “save-tweets” and attach the Save Tweets event and Cached Tweets data source to it. We need a smidgem of XSLT to output the data source as XML. This should do it:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="data">
    <xsl:copy-of select="cached-tweets" />  
</xsl:template>

</xsl:stylesheet>

Figure 4. Save Tweets page Figure 4. Save Tweets page

The end result is a page that outputs the latest tweets as XML, and that also acts as a form handler to accept POST requests to add new tweets. All that remains is the brains of the operation — some custom PHP goodness.

The script is self-explanatory, but here’s the outline:

  1. Fetch the XML feed of cached tweets from Symphony (created above)
  2. Fetch the latest tweets from the Twitter API
  3. Iterate through status update in the Twitter feed
  4. Check whether an entry containing the same tweet ID exists in the Symphony XML
  5. If no entry exists, build multi-dimensional post variables to send to our page

cron.save-tweets.php

<?php

$page = "http://yourdomain.com/save-tweets/"; // path to Symphony page to output latest cached tweets and has the Save Tweet event attached
$twitter_username = "username"; // screen name (e.g. twitter.com/username)
$new_tweets = array();

// Get the most recent cached tweets from Symphony
$symphony_feed = DOMDocument::load($page);
$symphony_tweets = $symphony_feed->getElementsByTagName("entry");

// Get the most recent tweets from Twitter
$twitter_tweets = DOMDocument::load("http://twitter.com/statuses/user_timeline.xml?id=$twitter_username");

// Loop through the Twitter tweets
foreach ($twitter_tweets->getElementsByTagName("status") as $tweet) {

    try {

        // Get tweet information from the XML, format date
        $tweet_id = $tweet->getElementsByTagName("id")->item(0)->nodeValue;
        $tweet_text = $tweet->getElementsByTagName("text")->item(0)->nodeValue;
        $tweet_date = date("d F Y H:i", strtotime($tweet->getElementsByTagName("created_at")->item(0)->nodeValue)); // Mon Aug 18 09:38:05 +0000 2008

        // Query the cached tweets XML using the tweet-id field to see whether the entry already exists in Symphony
        if ($symphony_tweets->length == 0) {
            $cached_tweet = 0;
        } else {
            $xpath = new DomXPath($symphony_feed);
            $cached_tweet = $xpath->evaluate("count(//entry[id='" . $tweet_id . "'])");
        }

        // If the count() was zero, it is a new tweet so add to the new tweets array
        if ($cached_tweet == 0) {
            array_push($new_tweets, array($tweet_id, $tweet_text, $tweet_date));
        }

    } catch (Exception $ex) {
        var_dump($ex);
    }
}

// Set up initial POST variables
$post = "MAX_FILE_SIZE=104857600&action[save-tweets]=Submit";

// Build post variables for each new tweet
for($i=0; $i < count($new_tweets); $i++) {
    $post .= "&fields[$i][id]=" . $new_tweets[$i][0];
    $post .= "&fields[$i][message]=" . urlencode($new_tweets[$i][1]);
    $post .= "&fields[$i][date]=" . $new_tweets[$i][2];
}

// if there are new tweets, send a single POST request to the page
if (count($new_tweets) > 0) {
    $save_tweets = curl_init();   
    curl_setopt($save_tweets, CURLOPT_URL, $page);
    curl_setopt($save_tweets, CURLOPT_POST, 1);
    curl_setopt($save_tweets, CURLOPT_POSTFIELDS, $post);
    curl_setopt($save_tweets, CURLOPT_RETURNTRANSFER, TRUE);
    curl_exec ($save_tweets);
    curl_close ($save_tweets);  
}

?>

What is left to add is that this PHP script should be scheduled to run as a cron task on your server (the frequency of which depends on how much of a Twitter-aholic you are!).

There you have it — a relatively simple method to archive data feeds into Symphony. Imagine the possibilities of archiving references to all of your Flickr photos, or the ability to mine years of Last.fm plays. Thanks to Symphony’s entirely flexible structure, you can recreate almost any data schema and with some rudimentary PHP to perform basic XML DOM manipulation, you end up with a pretty powerful system.

There have been 24 comments. Add yours »

michael-e 01 December 2008 11:20

Thank you very much for this article! You have indeed saved me a lot of work figuring this out.

Brian Drum 06 December 2008 17:33

I really appreciate the write-up. This is exactly the direction I wanted to take my Symphony install, and I’ve already adapted the technique for Flickr.

It’s worth mentioning that the maximum number of results from your cached data source should be greater than or equal to the number of results you will get from your API response, or you will get duplicates.

Nick 06 December 2008 17:49

Michael, Brian, you’re most welcome. That’s a good point Brian, one I forgot to mention.

Josh N. 10 December 2008 21:50

Thanks for this! I think it’ll be really handy.

Anders Thoresson 11 December 2008 10:50

@Brian Drum: Would you mind posting your Flickr-adaptation over at Overture?

Brian Drum 13 December 2008 13:58

@Anders Thoresson: I definitely intend to – I‘ll try to get to it later this week. I had a meltdown with my local development install of Symphony, and I’d like to get that figured out first.

Skyler Richter 15 December 2008 17:07

I have modified the cron.save-tweets.php to pull and archive delicious bookmarks but cant post all the code here… email me so i can give you the code..

Anders Thoresson 16 December 2008 11:13

@nickdunn: In the loop setup, where the post is created out of all new tweets, you have

for($i=0; $i < count($new_jaiks) - 1; $i++)

But if there is just one new tweet, it wont be added to the database?

for($i=0; $i < count($new_jaiks); $i++) works for me. Am I missing something?

@Brian Drum: Great!

Nick 16 December 2008 21:00

@Anders: you’re quite right. I meant to use the “<=” instead of simply “<” but your fix works also. I have amended the code in the post. Ta :-)

Anders Thoresson 22 December 2008 10:11

The usefulness of this is really big. Just adapted it to Delicious: I’m both importing the links one by one to my local database, but also creates a new blogpost in category “Read tips” with all new ones together.

Nick 22 December 2008 12:16

Great! On one client site of ours we are caching twenty Flickr accounts, Bebo blogs, Twitter feeds and Vimeo video feeds (about 80 feeds in total) into various Section and it works really well.

I’d be interested to see what other applications people find for this method.

Anders Thoresson 29 December 2008 19:48

Nick, would you mind posting a link to that site?

Nick 30 December 2008 01:19

@Anders: I believe we are under NDA so can’t disclose the site and its underlying technology. I’ll request permission from the powers above and see what they say.

Nick Dunn 05 January 2009 16:49

Anders, the site in question is Battlefront. For each of the twenty official Campaigners we have a Bebo Blog, Flickr profile, Vimeo video account and a Twitter stream. Some also have a Blogger profile. These are polled at regular intervals (either parsing RSS feeds, API XML or HTML scraping) and the content inserted into the appropriate Symphony sections. The website editor is then alerted to new content and is able to moderate, approve/decline and push the content live.

Anders Thoresson 08 January 2009 14:34

Nice. Thanks.

Fazal 04 February 2009 22:31

Hi Nick, Great write up!

I’ve been using Symphony for a few years now and also found this frustrating.

I’m more of a UI developer then backend. I understand that this script is written for Twitter, but do you think there is scope to turn this into a generic “cron” plugin for symphony? i.e Add parameters via the admin interface as opposed to editing this script itself.

I ask because I’ve been playing with the idea of Lifestreams, Yahoo Pipes etc but would prefer to use Symphony as my tool to manage this information.

Thanks

David Martin 10 February 2009 21:45

Awesomeness Nick. This helps out a ton. I was having some problems getting this to work on Media Temple, by default the DOMDocument::load() method is restricted due to code injection fears by Media Temple on their (GS). Instead of overriding the default php setup for Media Temple I updated your code to grab the XML using a curl and use DOMDocument::loadXML() from a string instead of a url. Works great.

Brian Zerangue 02 March 2009 14:31

Nick -

This is awesome! I’m wondering at your Battlefront setup. Would you mind sharing once archived in Symphony if you make changes to the text and publish, what if changes are made at the original source. Does that overwrite the changes that you would have made?

Again thank you for sharing this! Really appreciate your willingness to post this. Very, very helpful!

BZ

Nick 02 March 2009 14:53

@David: thanks for the tip. We’ve run this on (GS) without issue but perhaps your instance is locked-down hard.

@Brian: you’re welcome! Because of the nature of the site, for legal reasons all third party content is moderated prior to going live. Entries are inserted with a select box set to “pending”. A moderator reviews and batch updates entries, modifying the content if necessary.

The content isn’t sent back to the third party service. This was never a requirement of the site, and I would struggle to think of a scenario in which this would be useful. It is technically possible, since Symphony provides delegates in the backend to which an extension can subscribe; executing some custom code to publish back to Twitter, Flickr et al.

I should also note that the data is only cached once. If a photo description changes on Flickr and we have already pulled the original into Symphony, the updated description won’t be re-pulled back in. This was for moderation purposes and simplicity more than anything.

Brian Zerangue 02 March 2009 18:41

Thanks Nick! That is very helpful. I work for a church in Dallas, Texas and we are moving all of our web stuff over to Symphony. We have this third party calendar software that we are generating an XML feed of list of events. The issue is that one items status might change from “live” event to “cancelled.” I was checking to see if it would check back to see if it the status changed (such as a cancellation).

Thank you again for posting this tutorial. It has been so helpful!!!

BZ

Josh N. 17 March 2009 18:52

The Cron Job panel at Dreamhost tripped me up with this, until I found this wiki entry on how to run PHP in a cron.

Josh N. 24 March 2009 15:25

Did anyone else get a TON of duplicate entries a week ago? I think all my lost tweets came back and then didn’t match up with the IDs in my cached tweets DS, so they got imported over and over again. Any ideas on how to stop that from happening again? Should I up my cached tweets from 20 to 50 or so?

spacecowboyian 29 May 2009 16:24

I keep going around and around with this. I’ve modified it to pull flickr and that’s working great you can see the $post output here http://modwaxdev.futurehat.com/flickr-cron.php

Is there anything wrong in that string? I am very new to POST and I’m not sure what I’m doing wrong. If there is nothing wrong with it then could it be my page?

I’ve been over everything that I could have mistyped about a million times but there is still likely something I have missed. What other info is needed to help me diagnose this problem?

Also, am I killing myself for no reason? Is there now an extension for this that I don’t know about?

Thanks,

Ian


Submit your comments

Orchestrated by Symphony CMS