Web Cache Thing

ajt on 2004-02-24T13:40:58

Yesterday I was playing with xmltv feeds. I knocked up a little app that connects to a remote server, grabs a file (in this case xml), stores it in a cache, passes it into a templating engine (XSLT), and spits the result out as a web page.

The code is based on an RSS display tool I previously wrote. It's the same principle, grab something, transform it, and display it. In both cases I don't want to grab the data every time, I'm happy to use a local copy if it's only a few hours old.

I realised that a lot of the code could be abstracted out into a module. The transformation element may be to specialised, but we shall see.

The core module is fairly simple, you start with new, and passing in the cache data for the application you have.
Then you as for a resource with a URI.
- It looks in the cache to see if it has a local copy that's new enough. The cache would be a plug-in, so you could use something from Cache::cache, a DBM file, SQLite, or a full blown SQL DB.
- If it's not in the cache, or it's expired, then it gets the asset. Plug-ins would support file, http, ftp and so on.
- The new data is cached.
- ? The data is transformed via a plug-in.
- ?? Transformed data stored ??
Your app does what it wants with the data.

I'm not trying to re-invent Squid, I just want a simple URI getting tool, that can cache data. Basically you can call the app many times, but not check source data every time.

Does this exist on CPAN already? is so where? If it doesn't exist already, what should I call it?

LWP

Matts on 2004-02-24T14:53:16

Why not just use LWP's mirror() function for this? Does it have to be as complex as you describe?

Re:LWP

ajt on 2004-02-24T15:57:07

Interesting idea. I knew LWP did some caching, but I thought it was in memory caching, for within a single process, not file caching. This is probably enough, certainly for me anyway.

However LWP still has to do a HTTP request, to determine if the page has changed, and some feed providers may still dislike the frequency of the requests, even if they are only "304, Not modified since last retrieval".

My other option is to use cron and Wget to obtain the data, and try not do my data gathering at page viewing time. As with LWP, Wget can do a get if-modified, which should be enough to not upset the feed provider. By setting the cron infrequently enough I can control the load on the data feed, independently of the page request.

For the kind of data I'm pulling it's okay if the data is a little stale, so I want to make it avoid going to the source for a while, if I can. Given that I'm not planning to redistribute the feeds, it's for a home intranet use only, it's probably a lot of effort for nothing....

Re:LWP

Matts on 2004-02-24T16:44:25
I think if you pull the feed as you're viewing it you're likely to put LESS strain on the remote end than they normally see from a script running 48 times a day. Unless you hit refresh more than 48 times.

Honestly, anyone getting antsy about an individual user refreshing their RSS feed a few times in a few minutes has way too much time on their hands, and doesn't understand the web.
Re:LWP

Dom2 on 2004-02-24T19:00:18
You can probably subclass mirror() and make it check for an explicit expiry date (assuming the HTTP headers are stored somewhere). If you're not past the expiry date, you should be able to use the file without revalidation.
-Dom

Re:LWP

ajt on 2004-02-24T23:02:42

Good idea, I hadn't thought of that.