RSS freshness...

LTjake on 2003-06-03T01:08:28

As described earlier, I'm working on a project dealing with RSS feeds.

Basically, I have a page which display RSS feeds. It grabs the feeds from a cache. The cache is updated by a script which periodically checks to see if all of the feeds are fresh.

The hard part is checking if the feed is fresh without actually downloading the feed.

There are some syndication rules that can be specified in an RSS feed -- but they're, to me, confusing. The easiest way, i found, was to use some standard HTTP tricks. Checking the "Last-Modified" header by sending a "If-Modified-Since" header or checking the "Etag" by sending an "If-None-Match" header seems to eliminate a lot of fuss. If only people used those headers more.


RSS Feeds

jbisbee on 2003-06-03T04:22:16

I'm also working on a couple projects dealing with RSS Feeds: POE::Component::RSSAggregator and XML::RSS::Feed. I'd be interested to know what kind of stuff you're doing so that maybe we can collaborate on something.

Re:RSS Feeds

LTjake on 2003-06-03T12:14:27

Hi,

It seems like the two modules you've been working on are both used to poll RSS feeds. You're passing a name, a url and a delay between checks. Rather, what I've done is scheduled a script to run every couple of hours and check to see if the feeds are updated.

NOTE: This is all quite beta - things might change.

The data is stored in a simple XML file:

<opt tmpl_path="/rss/" max="5">
  <feed url="http://www.alternation.net/user/cake/cake_news.xml" custom="no" max="4" />
  <feed url="http://use.perl.org/journal.pl?op=display&uid=3294&content_type=rss " />
  <feed url="http://www.oss4lib.org/topics/list.php?func=rss" default="yes" />
  <feed url="http://www.oreillynet.com/meerkat/?_fl=rss10&t=ALL&c=303" />
  <feed url="http://search.cpan.org/rss/search.rss" If_Modified_Since="Mon, 02 Jun 2003 18:41:48 GMT" default="yes" />
  <feed If_None_Match=""26c0f3-f31-3edb4834"" url="http://www.zeldman.com/feed/zeldman.xml" If_Modified_Since=""26c0f3-f31-3edb4834"" />
  <feed url="file://c:/rss/feeds-local/fm.xml" If_Modified_Since="Wed, 28 May 2003 23:18:44 GMT" />
  <feed url="file://c:/rss/feeds-local/ftdate.xml" If_Modified_Since="Mon, 02 Jun 2003 18:33:19 GMT" default="yes" />
</opt>

There are a few different options in there. tmpl_path is the path to your templates. max is the default maximum number of items to display for each feed. Each feed can have its own max, too. The only required param for each feed item is the url. custom="no" is a "sticky" feed, the user can never un-select it. The default="yes" feeds, along with any sticky feeds, are displayed in the default setup.

The other options like If_Modified_Since are auto inserted by the feed polling script if it gets any of that information in the returned headers. This allows you to do conditional GETs described in my initial post.

The basic poll_feeds.pl script looks like this:

  • init cache
  • load config (into a hash)
  • foreach feed in the config...
  • grab it from the cache
  • if it's not cached, download the feed, store it in the cache and update the config hash if you get some useful headers
  • else, add any conditional get headers to the request and send the request
  • If you get a 304 in return, the feed you have is current, otherwise store the data in the cache and update the config hash if you get some useful headers
  • (end foreach)
  • write updated config, if needed

The only problem with that script is that, for any feed that you don't have any conditional GET data, it will have to download the entire feed each time the poll_feeds.pl script runs. Bummer.

The script that actually displays the feeds is a CGI::App based script.

As a side note, in all of my CGI::App based scripts, i now use this cgiapp_get_query method:

sub cgiapp_get_query {
   my $self = shift;

   use CGI::Simple;

   my $q = CGI::Simple->new();

   return $q;
}

I get much better response times with CGI::Simple.

The app is a pretty straight forward, except that I have subclassed XML::RSS in order to store a "key" along with the feed. The "key" in this case is the URL of the feed. I then have all of my subclassed XML::RSS objects in an array stored as a reference in a CGI::App param.

The app allows the user to customize their experience. The only things stored in the cookie are url => maxitems key-value pairs.

Anyway, that's the gist of it, i guess.

HTH

RSS some good some bad....

ajt on 2003-06-03T09:09:40

I've been through the same hoops too. The theory sates that the feed should contain good meta-data telling you when it components were made, and how fresh they are. In practice with the general poor quality of feeds, and their often incompleteness this doesn't work.

I wrote XML::RSS::Tools to handle some of the problems I faced, I use it to mostly transform RSS to HTML via XSLT, but it's not perfect, and there are many other ways of attacking the problem. At least my module did prompt others to take over and mostly fix XML::RSS, so some good came of my work.

I've used my module in two ways. In the simple method I run wget in batch to download the feeds if they are newer then the current version, then I use my module to do all the conversion.

In a second more sophisticated method, I grab a feed (on demand) and transform it, storing the results in Cache::Cache, then display it. Next time when the page is requested I check the local cache (bit of a cheat) first before attempting a download. You could extend this to store the feed, and the potentially different templated results if you wanted, but I've not done that.

Re:RSS some good some bad....

LTjake on 2003-06-03T12:30:36

Hey,

In my reply, here, I state that i store the actual feed in the cache. That may change to the results of the output if i really need further optimization -- but, it's premature optimization at this stage.

A while back I was looking at RSS modules for perl and wasn't thrilled with XML::RSS. However, that quickly changed - it was updated, and now it's part of the core of how my project works. Thanks =)

I submitted a bug report indicating that skipHours and skipDays are used improperly. Although the docs say that the methods are broken, the description of the method contrasts the specs. If that ever gets fixed (which i may end up doing myself), then i might be able to eliminate some of the GETs i do based on that data. But, there's also the syndication module to deal with...