Mirroring OpenGuides websites

ivorw on 2005-10-17T10:18:20

In the aftermath of the September 2005 incident, I have been thinking about steps we can take to prevent a reoccurrence, or at least minimise the impact.

As it is, I have lost quite a number of writeups and updates I have made in the last six months.

I have been looking at decentralising the OpenGuides data by using one or more mirror sites, which hold all the data and are kept up to date. I'm using CPAN mirroring as my model.

In terms of how to do this, each page has wiki content and metadata. The wiki text can be retrieved using format=raw, for example:

This was recently implemented by hex (cheers mate!) - though you could previously achieve the same result by using action=edit and scraping the HTML response for the CGI form corresponding to the text.

The metadata is obtained in RDF/XML format using format=rdf. This has highlighted a number of issues, resulting in several RT bug tickets for OpenGuides. It has also resulted in a CPAN module OpenGuides::RDF::Reader - standardising data retrieval from OG, mapping namespace qualified tag names into more directly meaningful hash keys. In principle, these translated hash keys match the values going into the column metadata_type in the metadata table.

The idea is that a guide mirror can pull down new and changed pages, when detected from the RecentChanges RSS feed.

The guide mirror gives us new possibilities, such as having all OG data on one website. This will allow an aggregated search over all the Guides.

The other aspect of this is that the data pulled from another site comes with a hash key "source", containing the URL where the data has come from. This will allow implementation of Creative Commons "Attribution", and will allow a future release of OpenGuides to redirect all requests to edit, to the source website.

Exciting stuff! More to come...