Rethinking the CPAN Problem

Ovid on 2008-07-04T09:56:06

Recently on Perl-QA there's been a debate both rather lengthy (going on for well over a week) and a touch acrimonious. It's actually been rather civilized and the acrimony has been more in the form of blunt comments, but it boils down to people still complaining about the "CPAN" problem:

The CPAN is huge and hard to evaluate

They weren't really talking about this problem. They were talking about tools surrounding this. In particular, there were sharp disagreements about the utility CPANTs and many people pointing out how things can be taken out of context. For example, both Damian Conway and Mark Jason Dominus have relatively low CPANTs scores and this certainly says something about their distributions. If you're experienced in the Perl community or new to it, you could draw drastically different conclusions as to what this "something" is.

Still, when you look at AnnoCPAN, cpanratings, CPAN RT, CPANTs, CPAN testers, the module itself, its competition, documentation, etc., you can quickly get overloaded with the bewildering array of choices (Class:: namespace, anyone?). I think, though, that these attempts to provide more information are good and people are misunderstanding the core problem.

Many people complain that the CPAN is huge and hard to evaluate, but that's not the problem. The fact that the CPAN is in the state it is in is largely because it's reflecting the real world. Let's try a little experiment: if you know nothing about content management systems, but you want the "best" one to run use.perl and your boss tells you it must not be slashcode, what do you choose? I'll wait. Tell me how long it takes you to choose the "best".

"But wait!", you protest. "What does 'best' mean?" Is it cost? Is it programming language? Is it ease of use once learnt? Is it easy of learning? Is it ease of setup? How much maintenance will it entail? Who else uses it? How long has it been around? What's its history vis-a-vis security issues? Does it rely on technologies that our IT department won't support? Do we have to change anything internally to use it? And the list goes on and on and on ...

You see, in the real world, when you choose software, whether you've written it yourself or not, you have a set of requirements and you have to evaluate the software against your needs. This is often very hard. CPAN, thus, mimics the real world to a certain extent.

There is an interesting difference, though. As a general rule, CPAN authors and those creating collaborating informational sites are interested in providing you with real information, not marketing spin. This can be a huge win, with the caveat that the author may not know what real information is needed.

What we really need is to incorporate one-stop shopping for these various resources to give us an ability to evaluate them in context. CPAN distributions need "tags" which people should be able to upvote or downvote based on appropriateness. Thus, even though "File::Find::Rule::XPath" might get returned as a search for "xpath", if you narrow your search with "xml", it would lower the likelihood of getting FFRX because even if it's been mistagged with "xml", people would downvote said tag and thus reduce its weight.

With that, we need solid APIs for the other services so that the consumer, once presented with a set of choices, can consider how well a module is maintained, last release date, bug count, annotations, etc. Which information is important? How the heck can I know? It's often subjective, anyway, and you and I could reasonably, disagree. Give 'em all the information we easily can, regardless of our personal opinion of its worth.

CPAN itself should probably not be the best place for this (though by default "search.cpan.org" is a poor man's version of what I envision). Instead, it should be the canonical repository and a front-end tool which allows intelligent searching, filtering, list of relevant articles and aggregation of all other relevant data that we can think of. It should merge the available information in one spot to ease the burden on the poor author.

This can be done, but whether it will be done is another story. I think, however, that this is what we need. I don't want more restrictions on the CPAN. I don't want to tell people developing external informational tools how they should do their stuff. I want one-stop shopping for most of this information and let me decide what's important. This would ease one of the biggest burdens we have.

Re: Tags

domm on 2008-07-04T10:19:18

Gabor has put some effort into adding tags to CPAN with CPAN::Forum. There even is a tag cloud.

metabase

domm on 2008-07-04T10:23:59

Sounds like we really need the CPAN Metabase... (scroll down to 'Project 2 -- CPAN Metabase for CPAN Testers 2.0')

tags!

jmason on 2008-07-04T10:44:28

I strongly agree, tags are the way to do this -- they're great for this kind of "needle in a haystack" searching.

Another thing: mid-length module descriptions, of a roughly freshmeat-style length (512 chars). this is short enough to fit in one paragraph, and therefore fit loads of modules on a single page, but allows enough text to specify function and purpose with a few synonyms for searchability.

We need evolution, not revolution

petdance on 2008-07-05T02:52:50

For the past three months, we've had a mailing list, rethinking-cpan, for discussing this very problem. Here's the blog post that started it: http://perlbuzz.com/2008/04/rethinking-the-interface-to-cpan.html

I've started a group, rethinking-cpan, for discussing the ideas I've posted here. -- Andy

Every few months, someone comes up with a modest proposal to improve CPAN and its public face. Usually it'll be about "how to make CPAN easier to search". It may be about adding reviews to search.cpan.org, or reorganizing the categories, or any number of relatively easy-to-implement tasks. It'll be a good idea, but it's focused too tightly.

We don't want to "make CPAN easier to search." What we're really trying to do is help with the selection process. We want to help the user find and select the best tool for the job.

It might involve showing the user the bug queue; or a list of reviews; or an average star rating. But ultimately, the goal is to let any person with a given problem find and select a solution.

"I want to parse XML, what should I use?" is a common question. XML::Parser? XML::Simple? XML::Twig? If "parse XML" really means "find a single tag out of a big order file my boss gave me", the answer might well be a regex, no? Perl's mighty CPAN is both blessing and curse. We have 14,966 distributions as I write this, but people say "I can't find what I want." Searching for "XML" is barely a useful exercise.

Success in the real world

Let's take a look at an example outside of the programming world. In my day job, I work for Follett Library Resources and Book Wholesalers, Inc. We are basically the Amazon.com for the school & public library markets, respectively. The key feature to the website is not ordering, but in helping librarians decide what books they should buy for their libraries. Imagine you have an elementary school library, and $10,000 in book budget for the year. What books do you buy? Our website is geared to making that happen.

Part of this is technical solutions. We have effective keyword searching, so you can search for "horses" and get books about horses. Part of it is filtering, like "I want books for this grade level, and that have been positively reviewed in at least two journals," in addition to plain ol' keyword searching. Part of it is showing book covers, and reprinting reviews from journals. (If anyone's interested in specifics, let me know and I can probably get you some screenshots and/or guest access.)

BWI takes it even farther. There's an entire department called Collection Development where librarians select books, CDs & DVDs to recommend to the librarians. The recommendations could be based on choices made by the CollDev staff directly. They could be compiled from awards lists (Caldecott, Newbery) or state lists (the Texas Bluebonnet Awards, for example). Whatever the source, they help solve the customer's problem of "I need to buy some books, what's good?"

This is no small part of the business. The websites for the two companies are key differentiators in the marketplace. Specifically, they raise the company's level of service from simply providing an item to purchase to actually helping the customer do her/his job. There's no point in providing access to hundreds of thousands of books, CDs and DVDs if the librarian can't decide what to buy. FLR is the #1 vendor in the market, in large part because of the effectiveness of the website.

Relentless focus on finding the right thing

Take a look at the front of the FLR website. As I write this, the page first thing a user sees is "Looking for lists of top titles?" That link leads to a page of lists for users to browse. Award lists, popular series grouped by grade level, top video choices, a list called "Too good to miss," and so on. The entire focus that the user sees is "How can I help you find what you want?"

Compare that with the front page of search.cpan.org. Twenty-six links to the categories that link to modules in the archaic Module List. Go on, tell me what's in "Control Flow Utilities," I dare you. Where do I find my XML modules? Seriously, read through all 26 categories without laughing and/or crying. Where would someone find Template Toolkit? Catalyst? ack? Class::Accessor? That one module that I heard about somewhere that lets me access my Lloyd's bank account programtically?

Even if you can navigate the categories, it hardly matters. Clicking through to the category list leads to a one-line description like "Another way of exporting symbols." Plus, the majority of modules on CPAN are not registered in the Module List. The Module List is an artifact a decade old that has far outlived its original usefulness.

What can we do?

There have been attempts, some implemented, some not, to do many of these things that FLR & BWI do very effectively. We have CPAN ratings and keyword searching, for example. BWI selects lists of top books, and Shlomi Fish has recently suggested having reviews of categories of modules, which sounds like a great idea. I made a very tentative start on this on perl101.org. But it's not enough.

We need to stop thinking tactical ("Let's have reviews") and start thinking ("How do we get the proper modules/solutions in the hands of the users that want them.") Nothing short of a complete overhaul of the front end of the CPAN will make a dent in this problem. We need a revolution, not evolution, to solve the problem.

Metadata is an old theme

hfb on 2008-07-05T13:57:38

Years ago, I convened that CPAN meeting in London for the sole purpose of getting some sort of Metadata standard together as this is an old problem that could possibly be partially solved with metadata. Unfortunately, not much happened in that vein as it would require a lot of work and cooperation from more than a few people and consensus has never been a strong point of perl people. Andreas, Ken, Leon and Jos were the most interested and may possibly be still....if you're interested in picking up that idea that has lain fallow for too long.

easy one-stop shopping

Eric Wilhelm on 2008-07-06T05:39:29

With webapps, the site maintainer has to be bothered to add every new thing. Further, the overhead to setting-up a local copy of a webapp often involves install and setup of a particular database, web server, etc. Webapps (by nature) don't need to be easily deployed onto a wide variety of systems, so setup usually kills any ambition of trying to add "one simple idea" to a thing -- even if you have the source code.

What if you could download and install a CPAN search+install application on your computer in a couple of minutes and immediately start writing a plugin for it? What if that plugin could access data from a variety of web APIs? (Useful for e.g. a server which has already done a batch run of indexing (search) or extracting information (CPANTS, cpan-testers).) What if you could package and ship that plugin just like any other module on CPAN?

How else are we going to get one-stop shopping without intense centralization? You have to be able to make your own stop.