CPAN RDF

acme on 2003-10-06T16:56:04

After discussing a lot about metadata this weekend I've been playing with RDF and CPAN. Looking at all the distributions by authors which begin with an 'L', with DBD::SQlite and RDF::Simple, I now have a lot of triples. I've been adding some Dublin Core information. I have lots of information yet to add. So who thinks this is a good idea?

Acme-Colour 2002-04-11T15:54:11 3151 Acme-Colour-0.17 0.17 application/x-gzip authors/id/L/LB/LBROCARD/Acme-Colour-0.17.tar.gz application/x-gzip LBROCARD

It's awful wordy

drhyde on 2003-10-06T18:14:58

It looks like an awfully verbose way of saying some very simple things. And I expect that for it to be useful for users they'll need to do XML voodoo. Which is HARD. I just don't see the point of using an obfuscatory format like RDF/RSS/XML/whatever it's called this week, rather than (eg) the output from Data::Dumper or YAML. Maybe I'm missing something.

Re:It's awful wordy

hfb on 2003-10-06T18:45:14

RDF may be more suitable and appropriate for aggregation of the various metadata files relating to a single distribution. Much of it will be primarily for PAUSE and the indexers like search.cpan and the various tools people already use like cpan.pm so users generally won't ever need to look at the raw metadata unless they really want to.

Re:It's awful wordy

acme on 2003-10-08T09:03:14
I should probably have explained this a little more. I got really confused and all negative about RDF until recently. The main problem is that it's all in XML and that scares everyone, but RDF is really all about triples: subject, predicate, object. It just so happens that the most common serialisation format at the moment is in XML.
So an interesting triple would be "LBROCARD" "is the author of" "Acme-Buffy-1.2". Or, in the RDF fragment about Acme-Buffy-1.2: "<cpan:id>LBROCARD</cpan:id>". Notice the namespace. If I use the "cpan:id" namespace, I have decided what type the object is. There are a set of standard metadata types, such as Dublin Core.
So basically it's all about triples, but with schemas and types specified. It could easily be in YAML, yes, but YAML doesn't have schemas so you'd be having to make guesses about what things are. It's just metadata. Wait for new tools to come out which start using it ;-)

Re:It's awful wordy

drhyde on 2003-10-08T10:09:47
You still have to make guesses about what a is, surely? At some point, a human has to decide that LBROCARD is the person who wrote Acme::Buffy, and that it's not some other random identifying feature like an ASCII-fied checksum.

Identifying people and things

hex on 2003-10-08T11:37:22

That's what RDF vocabularies are for.

If you stick

xmlns:foaf="http://xmlns.com/foaf/0.1/"

in the RDF declaration, that lets you do something like this:

<cpan:author> <cpan:id>EMARTIN</cpan:id> <foaf:name>Earle Martin</foaf:name> <foaf:mbox_sha1sum>8699ba79a95abf86e0055c133bf5d87ceab921e9</foaf:mbox_sha1s um> </cpan:author>

Of course, there's going to have to be a CPAN vocabulary to define what all this cpan:foo stuff is. The joy of RDF, though, is being able to build on other people's work. The FOAF project has got a lot of work done on matters of personal identity already, and using it would save a whole lot of wheel reinvention.

Re:Identifying people and things

drhyde on 2003-10-08T11:59:50
Still needs a human to read, parse and understand the fact that <foo> represents a FOO in the real world, and to write the code to handle FOOs correctly. That is, it requires just as much work as understanding what 'author' means in a structure such as:
$VAR1 = { 'author' => 'Sheerluck Holmes', 'title' => 'true crimes and how to avoid them' }
or a YAML equivalent.
Using XML-ish things does not help to define what your data is, regardless of what it says on the bottle of Kool-aid. All it does is define the relationships between them. I suppose what I'm really saying is that I don't understand this fashion for XML and its friends and relations. FOAF and friends would work just as well translated into a lighter-weight human-readable representation.

Re:Identifying people and things

hex on 2003-10-08T12:15:21

Oh, OK, maybe I didn't follow your meaning. I wasn't meaning to imply that using RDF (and in the vocabulary itself, OWL) would actually define what the data is. But yes, isn't that always going to be the case, until we have smart computers? At the moment, the closest thing to "encapsulated meaning" we have is Cyc, and that's a long way off from being the real thing. RDF vocabularies, as you say, are good for defining relationships between things.

I don't think, though, that RDF was ever intended to be human-readable; it needs to be parsed in some way. What kind of application were you thinking of for the metadata?

Re:Identifying people and things

drhyde on 2003-10-08T12:33:04
I always try to either use something that is explicitly designed to be human-readable, like Data::Dumper (with purity and indent style 2) or more recently YAML; or something which cares not about being human-readable, such as Storable or some other binary format. RDF/RSS/XML, because it's ASCII, looks like it's meant to be human-readable, so I try to read it and get irritated.

Re:Identifying people and things

sky on 2003-10-08T14:00:18
RDF can be represented in multiple formats, and there is work done to get it represented in YAML.

rdf/cpan

inkdroid on 2003-10-06T18:27:07

Wow, I really like this idea. Is the idea to serialize CPAN metadata in a similar way to how the Open Directory Project makes their data available? Speaking as an ex-librarian, your use of RDF and DublinCore is commendable. People in the library and information science communities have been getting all excited about RDF and DublinCore for years, and it's is very cool to see someone putting it to practical use. I bet the the semantic web folks would also be very interested to hear about your experiments.

On a somewhat related note: while it's a kind of eclectic the Open Archives Initiative has developed a protocol for sharing large sets of metadata. The OAI-PMH provides a very simple framework for building data providers and data harvesters using a set of 6 verbs over XML/HTTP: Identify(), ListIdentifiers(), GetRecord(), ListmetadataFormats(), ListRecords(), ListSets(). While it might not be of direct use, it could be of interest if you are looking for ideas on how to allow people to update their local copies of CPAN metadata without grabbing the whole lot each time. The OAI-PMH has its roots in the arxiv pre-print server at Los Alamos, and is currently being used by quite a mix of data providers. Oh, and I wrote Net::OAI::Harvester for interacting with repositories :-)

Using RDF

ziggy on 2003-10-07T03:13:23

This snippet doesn't look entirely kosher. The urn::filesize and urn::mimetype elements need to be placed into a proper namespace.

The RDF format is rather, um, ugly to behold. It's good for interchange between apps, but greatly obfuscates the meaning for wetware parsers. I think the following is a faithful interpretation of the above example in Notation 3:

@prefix cpan: <http://www.cpan.org/>. @prefix dc: <http://purl.org/dc/elements/1.1/> . @prefix misc: <urn:empty>. <#acmeColour017> cpan:dist "Acme-Colour"; cpan:suffix "authors/id/L/LB/LBROCARD/Acme-Colour-0.17.tar.gz"; cpan:version "0.17"; dc:date "2002-04-11T15:54:11"; dc:format "application/x-gzip"; dc:identifier <http://search.cpan.org/dist/Acme-Colour-0.17/>; dc:publisher <http://www.cpan.org/>; dc:type <http://purl.org/dc/dcmitype/Software>; misc:filesize "3151"; misc:mimetype "application/x-gzip"; .

Here are some important elements that are missing but should be trivial to add:

Author ID
DSLIP values
MD5 Checksum
Module Prerequisites (as determined by Meta.yml or whatnot)
Minimum Perl version required

Nevertheless, this snippet of RDF is a very good start. Thanks!

Re:Using RDF

acme on 2003-10-08T09:07:50
It was just a fragment, so it had no namespaces. Thanks for the feedback, it does now. Also I added Author ID and MD5 Checksum. More metadata from CPANTS and META.yml to come soon. I used RDF/XML as it was the simplest thing possible at the time and RDF::Simple was, well, simple. Anyway, you can check it out at: http://www.cpan.org/authors/id/L/LB/LBROCARD/cpan.rdf.gz (autrijus is hacking PAUSE so I can replace the file instead of releasing new versions all the time).

RDF/YAML

schuyler on 2003-10-07T18:40:47

First off, XML isn't the only possible serialization of RDF.

Second, and more importantly, I think it's reasonable for CPAN metadata to be stored/provided as YAML... so long as it can be unambigiously mapped to RDF for those applications that need/want it.

Re:RDF/YAML

inkdroid on 2003-10-08T14:02:26
I would argue that the world that uses XML/RDF is larger than the world that uses YAML. I have no statistics to back this up, it is just a gut feel. Safety in numbers is not really a good argument, but I guess the main thing that the data is *available* (thanks Acme) than what format it is in.
Re:RDF/YAML

ziggy on 2003-10-08T14:21:54
Actually, I'd argue with equal conviction that CPAN Metadata should be canonically stored in N3.
My cat would argue even more strongly that we should design a database schema and shove all the data into {SQLite|MySQL|PostgreSQL}. Even Cats can understand third normal form. ;-)
The one thing we really need is to agree on the triples and the meaning of the assertions that describe CPAN metadata. Everything else is just syntax. Mapping from one syntax or another (or deeming one syntax "preferred") is an exercise for the reader.

About

kasei on 2003-10-08T03:21:35

Could you (and would it make sense to) add an rdf:about attribute to the Description tag pointing to the file on either the main CPAN site or the search.cpan.org info page?