LCCO and Black

mdxi on 2005-08-06T19:50:22

Lorcan Dempsey recently wrote about a DDS browser which OCLC has implemented, and it reminded me that I tried to do basically the same thing a couple of years ago with the Library of Congress Classification Outline. His very literally drills down, starting with the top 100-level DDS categories and placing the next level under that as the user selects which segment to expand. My design, which never got past the static mockup stage, started with the user picking a top-level LCCO category, which it then exploded in vertical stacks progressing to the right.

This worked pretty nicely with A (General Works), but as soon as I started trying to mock up B (Philosophy, Psychology, Religion), the complexity spiralled and my abortive effort ended up looking like a microphotograph of a CPU, there were so many shadings and colors arrayed around each other.

Which brings me back to an old bitch: there are no good classification systems. The LCCO seems nice at first, but the intricacies of generating call letters for it are nigh-impenetrable (I know the Letter Digits Dot Digits Cutter OptionalYear part, but what's the other thing that looks like a secondary Cutter, which sometimes follows the year? And how to I generate Cutters anyway? And are they guaranteed unique or are collisions allowed? Why isn't this information public knowledge?). On top of this, the LCCO was last overhauled beginning around 1900 and finishing around 1960, resulting in huge swathes of modern works and fields of endeavor being shoved into sub-decimal categorization spaces, while American History gets two whole top-level letters. I'm sure this was fabulous for the actual Library of Congress, at the turn of the last century, but it's not so good for dealing with the modern world where we expect things to be designed for maximal flexibility and expansibility.

On the other hand we have the DDC (and UDC, which is DDC but not). The DDC, as most people are completely unaware, is owned by a corporation who charges license fees to use it and has a history of suing people who aren't aware of this. It should therefore be viewed as unsuitable for use in public/volunteer projects (like the one I'll get to in a second).

I have recently become aware of some fringe categorization schemes, such as Bliss, but I don't see them as any more of a win than anything else.

At this point you may think I'm a big fan of "folksonomies" and "tags" and that sort of thing, but I'm not. Very few people have the mindset which allows for things like data integrity, logical categorization, and stripping of spurious/redundant data. Flexibility and expansibility are always good but mob rule is never the answer, and that's all the folksonomy movement is. I'm down on pure democracy for the same reason the founders of the US government were: most people are so lazy can't even be trusted to make informed decisions which affect themselves.

I don't have an answer, only complaints. Moving on...

Much has been made (and made and made and made) of the recent interactions between Google and libraries. I'm very much of the opinion that libraries need to hurry the fuck up and actually get their collections online, as opposed to just data about their collections (yes, yes, whatever you're thinking, I know, but seriously people, wasn't the Internet supposed to be something a little better than porn and livejournal?), but I don't think Google is the right way to go about it.

For years I've been very slowly working on planning (and planning and planning) a framework for publishing text collections, but let's not talk about that now; let's talk about the idea which followed that one, which I call Black (all my project names are colors). In a nutshell: it's a combination of RSS and distributed source control.

Imagine that sites have created electronic collections -- not as web pages, but as collection data and content data (preferably with the text bits in small-ish chunks and transformable, diplay device neutral markup) in databases. Now imagine that you can flip a permissions bit on any document in the collection and your front-end collection server will inform an upstream meta-collection (Black) server of this. Now imagine networks of these servers, throwing collections data back and forth between each other. Now imagine that this network of "metaservers" is searchable, and that feeds of newly available documents are being served to other sites with their own collections engines in place.

Site B's admins or users see something interesting on the new docs feed (or turn it up in a search) and send a subscription request to their upstream Black server. A subscription to that document is set up, and a BitTorrent style copy operation begins, fetching the doc from Site A as well as any other sites which also have copies.

Three weeks later Site A makes changes to the doc. When it is republished on their internal system, messages are passed to the Black servers about the new version, and as messages filter through the network subscriber sites begin picking up the deltas (the changed and/or new and/or deleted bits).

Hey look, it's a worldwide, self-replicating, self-updating, fault-tolerant, on-demand library!

Things like allowing subscriber sites to keep full version histories or only newest versions, yadda yadda yadda are all fairly simple and eminently do-able. The only real problems are:

  1. Writing the Black server software
  2. Codifying the messaging format (protocols)
  3. Ensuring that database schema are compatible with each other (or that compatibility layers are available)

(Note that here I am actually thinking of volunteers putting their own collections online, either completely on their own or in small collectives which would probably be focused around certain topics. The whole reason I started thinking about this stuff is because I want to share my old textbook, cookbook, and asian history collections.)

Some people probably think that things like "intellectual property rights" are another issue, but I honestly don't care. My concern is maintaining and communicating human knowledge and our shared body of cultural works, not making sure a multinational corporation is properly compensated for the work of a long-dead author. I am concerned with saving (and making avilable) the information in all the mouldering, brittle, and disintegrating books around the world, not in making it easier to pirate Harry Potter book 7. A trust metric of some sort might be a good tool to assist human Black server admins, but I really believe that policing the collections is a policy issue and not a technological one.

To all the real librarians who just sat through this, I'd like to apologize if my ignorance and hubris has insulted you.


Institutional repositories

CavLec on 2005-08-08T13:19:54

You might be interested in some of the work going on in academic libraries around "institutional repositories." It's early days yet, so I warn you that existing software frameworks are more than a little crude, but I think in time we'll get somewhere near where you're wanting to go.

The big names are DSpace, Fedora, and EPrints. Give 'em a look, and drop me a line if you have any questions.