Journal

Archiving XML with Symphony

Symphony is an incredibly powerful CMS, but it's ability to cache dynamic XML feeds is lacking. In this article I explain a simple method of parsing a third-party XML feed and archiving its contents in native Symphony sections.

Over the last six months or so, Symphony content management system has become my framework of choice. The reasons are numerous and deserve their own blog post entirely; but the foremost reason is that Symphony allows a developer to define their own data structures, query it, and have the output served as well structured XML to be formatted into XHTML using XSLT views.

With such a focus on XML, it made sense to provide a "Dynamic XML" data source type that pulls an XML feed directly into the CMS. While this native functionality is useful for grabbing a feed of most recent Flickr photos or Twitter posts, I have found it insufficient in several respects:

In order to solve these problems (as alluded to on the Overture forums) I have used a method of saving items from the remote XML feeds into native Symphony Sections, so that they are archived and can be queried as native entries.

In the following example I'll explain how to archive your Twitter timeline into Symphony.

Section

The data is to be stored in a Section, so begin by creating a Section with the same data structure as the feed. In my example I have created a "Tweets" section with the following fields:

Figure 1. Tweets section
Figure 1. Tweets section

Event

Create the event "Save Tweets" from the Tweets Section just created. Make sure the "Allow Multiple" option is selected, so we can save more than one tweet with a single request.

Figure 2. Save Tweets events
Figure 2. Save Tweets events

Data Source

When saving tweets to the Tweets section, we need to be able to check if the tweet already exists in our cache. Thankfully the Twitter API assigns a unique ID to each tweet that we can use. If an RSS feed is being used the permalink or guid elements are suitable alternatives.

Once the Twitter feed has been loaded, we need to compare the XML against existing entries in Symphony. Therefore create the data source "Cached Tweets" to select the most recent tweets. We don't need the full entry, just the ID element.

Figure 3. Cached Tweets data source
Figure 3. Cached Tweets data source

Page

Create the page "save-tweets" and attach the Save Tweets event and Cached Tweets data source to it. We need a smidgem of XSLT to output the data source as XML. This should do it:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="data">
    <xsl:copy-of select="cached-tweets" />  
</xsl:template>

</xsl:stylesheet>
Figure 4. Save Tweets page
Figure 4. Save Tweets page

The end result is a page that outputs the latest tweets as XML, and that also acts as a form handler to accept POST requests to add new tweets. All that remains is the brains of the operation — some custom PHP goodness.

The script is self-explanatory, but here's the outline:

  1. Fetch the XML feed of cached tweets from Symphony (created above)
  2. Fetch the latest tweets from the Twitter API
  3. Iterate through status update in the Twitter feed
  4. Check whether an entry containing the same tweet ID exists in the Symphony XML
  5. If no entry exists, build multi-dimensional post variables to send to our page

cron.save-tweets.php

<?php

$page = "http://yourdomain.com/save-tweets/"; // path to Symphony page to output latest cached tweets and has the Save Tweet event attached
$twitter_username = "username"; // screen name (e.g. twitter.com/username)
$new_tweets = array();

// Get the most recent cached tweets from Symphony
$symphony_feed = DOMDocument::load($page);
$symphony_tweets = $symphony_feed->getElementsByTagName("entry");

// Get the most recent tweets from Twitter
$twitter_tweets = DOMDocument::load("http://twitter.com/statuses/user_timeline.xml?id=$twitter_username");

// Loop through the Twitter tweets
foreach ($twitter_tweets->getElementsByTagName("status") as $tweet) {

    try {

        // Get tweet information from the XML, format date
        $tweet_id = $tweet->getElementsByTagName("id")->item(0)->nodeValue;
        $tweet_text = $tweet->getElementsByTagName("text")->item(0)->nodeValue;
        $tweet_date = date("d F Y H:i", strtotime($tweet->getElementsByTagName("created_at")->item(0)->nodeValue)); // Mon Aug 18 09:38:05 +0000 2008

        // Query the cached tweets XML using the tweet-id field to see whether the entry already exists in Symphony
        if ($symphony_tweets->length == 0) {
            $cached_tweet = 0;
        } else {
            $xpath = new DomXPath($symphony_feed);
            $cached_tweet = $xpath->evaluate("count(//entry[id='" . $tweet_id . "'])");
        }

        // If the count() was zero, it is a new tweet so add to the new tweets array
        if ($cached_tweet == 0) {
            array_push($new_tweets, array($tweet_id, $tweet_text, $tweet_date));
        }

    } catch (Exception $ex) {
        var_dump($ex);
    }
}

// Set up initial POST variables
$post = "MAX_FILE_SIZE=104857600&action[save-tweets]=Submit";

// Build post variables for each new tweet
for($i=0; $i < count($new_tweets); $i++) {
    $post .= "&fields[$i][id]=" . $new_tweets[$i][0];
    $post .= "&fields[$i][message]=" . urlencode($new_tweets[$i][1]);
    $post .= "&fields[$i][date]=" . $new_tweets[$i][2];
}

// if there are new tweets, send a single POST request to the page
if (count($new_tweets) > 0) {
    $save_tweets = curl_init();   
    curl_setopt($save_tweets, CURLOPT_URL, $page);
    curl_setopt($save_tweets, CURLOPT_POST, 1);
    curl_setopt($save_tweets, CURLOPT_POSTFIELDS, $post);
    curl_setopt($save_tweets, CURLOPT_RETURNTRANSFER, TRUE);
    curl_exec ($save_tweets);
    curl_close ($save_tweets);  
}

?>

What is left to add is that this PHP script should be scheduled to run as a cron task on your server (the frequency of which depends on how much of a Twitter-aholic you are!).

There you have it — a relatively simple method to archive data feeds into Symphony. Imagine the possibilities of archiving references to all of your Flickr photos, or the ability to mine years of Last.fm plays. Thanks to Symphony's entirely flexible structure, you can recreate almost any data schema and with some rudimentary PHP to perform basic XML DOM manipulation, you end up with a pretty powerful system.

Comments

michael-e

Thank you very much for this article! You have indeed saved me a lot of work figuring this out.

Brian Drum

I really appreciate the write-up. This is exactly the direction I wanted to take my Symphony install, and I’ve already adapted the technique for Flickr.

It’s worth mentioning that the maximum number of results from your cached data source should be greater than or equal to the number of results you will get from your API response, or you will get duplicates.

Nick

Michael, Brian, you’re most welcome. That’s a good point Brian, one I forgot to mention.

Anders Thoresson

@Brian Drum: Would you mind posting your Flickr-adaptation over at Overture?

Brian Drum

@Anders Thoresson: I definitely intend to – I‘ll try to get to it later this week. I had a meltdown with my local development install of Symphony, and I’d like to get that figured out first.

Skyler Richter

I have modified the cron.save-tweets.php to pull and archive delicious bookmarks but cant post all the code here… email me so i can give you the code..

Anders Thoresson

@nickdunn: In the loop setup, where the post is created out of all new tweets, you have

for($i=0; $i < count($new_jaiks) - 1; $i++)

But if there is just one new tweet, it wont be added to the database?

for($i=0; $i < count($new_jaiks); $i++)

works for me. Am I missing something?

@Brian Drum: Great!

Nick

@Anders: you’re quite right. I meant to use the “<=” instead of simply “<” but your fix works also. I have amended the code in the post. Ta :-)

Anders Thoresson

The usefulness of this is really big. Just adapted it to Delicious: I’m both importing the links one by one to my local database, but also creates a new blogpost in category “Read tips” with all new ones together.

Nick

Great! On one client site of ours we are caching twenty Flickr accounts, Bebo blogs, Twitter feeds and Vimeo video feeds (about 80 feeds in total) into various Section and it works really well.

I’d be interested to see what other applications people find for this method.

Anders Thoresson

Nick, would you mind posting a link to that site?

Nick

@Anders: I believe we are under NDA so can’t disclose the site and its underlying technology. I’ll request permission from the powers above and see what they say.

Nick Dunn

Anders, the site in question is Battlefront. For each of the twenty official Campaigners we have a Bebo Blog, Flickr profile, Vimeo video account and a Twitter stream. Some also have a Blogger profile. These are polled at regular intervals (either parsing RSS feeds, API XML or HTML scraping) and the content inserted into the appropriate Symphony sections. The website editor is then alerted to new content and is able to moderate, approve/decline and push the content live.

Fazal

Hi Nick, Great write up!

I’ve been using Symphony for a few years now and also found this frustrating.

I’m more of a UI developer then backend. I understand that this script is written for Twitter, but do you think there is scope to turn this into a generic “cron” plugin for symphony? i.e Add parameters via the admin interface as opposed to editing this script itself.

I ask because I’ve been playing with the idea of Lifestreams, Yahoo Pipes etc but would prefer to use Symphony as my tool to manage this information.

Thanks

David Martin

Awesomeness Nick. This helps out a ton. I was having some problems getting this to work on Media Temple, by default the DOMDocument::load() method is restricted due to code injection fears by Media Temple on their (GS). Instead of overriding the default php setup for Media Temple I updated your code to grab the XML using a curl and use DOMDocument::loadXML() from a string instead of a url. Works great.

Brian Zerangue

Nick -

This is awesome! I’m wondering at your Battlefront setup. Would you mind sharing once archived in Symphony if you make changes to the text and publish, what if changes are made at the original source. Does that overwrite the changes that you would have made?

Again thank you for sharing this! Really appreciate your willingness to post this. Very, very helpful!

BZ

Nick

@David: thanks for the tip. We’ve run this on (GS) without issue but perhaps your instance is locked-down hard.

@Brian: you’re welcome! Because of the nature of the site, for legal reasons all third party content is moderated prior to going live. Entries are inserted with a select box set to “pending”. A moderator reviews and batch updates entries, modifying the content if necessary.

The content isn’t sent back to the third party service. This was never a requirement of the site, and I would struggle to think of a scenario in which this would be useful. It is technically possible, since Symphony provides delegates in the backend to which an extension can subscribe; executing some custom code to publish back to Twitter, Flickr et al.

I should also note that the data is only cached once. If a photo description changes on Flickr and we have already pulled the original into Symphony, the updated description won’t be re-pulled back in. This was for moderation purposes and simplicity more than anything.

Brian Zerangue

Thanks Nick! That is very helpful. I work for a church in Dallas, Texas and we are moving all of our web stuff over to Symphony. We have this third party calendar software that we are generating an XML feed of list of events. The issue is that one items status might change from “live” event to “cancelled.” I was checking to see if it would check back to see if it the status changed (such as a cancellation).

Thank you again for posting this tutorial. It has been so helpful!!!

BZ

Josh N.

Did anyone else get a TON of duplicate entries a week ago? I think all my lost tweets came back and then didn’t match up with the IDs in my cached tweets DS, so they got imported over and over again. Any ideas on how to stop that from happening again? Should I up my cached tweets from 20 to 50 or so?

spacecowboyian

I keep going around and around with this. I’ve modified it to pull flickr and that’s working great you can see the $post output here http://modwaxdev.futurehat.com/flickr-cron.php

Is there anything wrong in that string? I am very new to POST and I’m not sure what I’m doing wrong. If there is nothing wrong with it then could it be my page?

I’ve been over everything that I could have mistyped about a million times but there is still likely something I have missed. What other info is needed to help me diagnose this problem?

Also, am I killing myself for no reason? Is there now an extension for this that I don’t know about?

Thanks,

Ian

Sam

Hi Nick,

Is there not a significant danger of having your database spammed and filled with crap from anyone who can guess the http://yourdomain.com/save-tweets/ page, which is actually listed in your menu.

Or am I missing something obvious?

S

Nick

Sam, yes indeed that is an issue.

There are several things you could do to get around this:

At the simplest level rename the page to something less guessable.

More complex you could amend the PHP to require a known token/secret string to be passed in the querystring such as /save-tweets/?secret=foobar. Inside the PHP check for the value of $_GET['secret'] and do not execute the script if the value is not foobar.

Going one step further one could enable the “Admin Only” filter on the Save Tweets Event. This means that a valid Symphony session or cookie is required to execute the event. Some PHP would be required to spoof a user session with a known UserID from your own installation.

I am planning an updated version of this article, since it has produced so much interest; I will make sure to address security in this update. Cheers!

Share your thoughts