Increase your Schwartz

brian_d_foy on 2002-10-11T08:11:34

What was the line from Space Balls? "May the Schwartz be with you"?

A while ago Randal measured the effective size of CPAN, although he was more interested in keeping a small CPAN around than tracking its size.

I am a physcist by training (currently reminded of that since Ray Davis won the Nobel Prize this year and I pulled out my copy of Neutrino Astrophysics to reread his history of the project. If I had not gotten wrapped up in the tech bubble I would have tried to find work in the neutrino industry.) Experimental physicists, despite their stereotype, are graphers. They make graphs like marketing people make powerpoint presentations---of everything and nothing and in every shape and form all day most days. Occassionally experimental physicists do a little math and maybe an experiment or two. The rest of the time they make graphs, show other people graphs, look at other people's graphs, and then make more graphs. Sometimes journals publish these graphs.

My first two papers in Physical Review C were almost half graphs by area. Strangely, you cannot get these online without some sort of subscription. So much for easy exchange of scientific papers. The first paper, "Correlation between e/d and the P factor" (Phys. Rev. C, 49, 2) had graphs about other graphs. My publications in chemistry journals have few graphs, by comparison.

By training, or indoctrination, I want to make graphs, which is why I spend too much time annoying Martien with GD::Graph patches. The only thing better than making graphs is writing programs to make them for me. The only problem is finding something to graph. It is the difference between drinking because you are an alcholic and sampling a wine with dinner. Graphaholics just want to make graphs, and sometimes I do that in Perl courses when I talk about Zipf's Law. I do not need to know what the data is to make cool graphs with Zipf's Law.

My favorite print news source, The Economist, has lots of interesting graphs. The last couple of pages is normally taken up with graphs and tables illustrating various trends. I have started to do that with "Perl at a Glance".

Every set of graphs needs some sort of diva fact though. The Economist has The Big Mac Index. Physics has named numbers after various people---Planck, Heisenberg, Curie, and so on. In that vein, The Perl Review created the Schwartz Factor, the ratio of the effective size of CPAN to its real size, named after Randal who started the idea, with his permission, but as my idea since he really is not that vain. It is appropriate, though, because I simply use Randal's script to do the work. I will have to find something else to name after Damian.

When I started measuring the size of CPAN, the Schwartz Factor was about 0.1729, meaning that about 5/6ths of CPAN was either old versions or the perl distribution. Today the Schwartz Factor is about 0.1732---a significant difference. CPAN has actually shrunk this week. I deleted a lot of old stuff from my directory (including two distributions around 5 Mb), so that could be part of it.

I have another concept in mind, The Wall, which is the point at which the effective size of CPAN is the same as the real size, or when the Schwartz Factor is 1. You cannot do better than that. It is the greatest upper bound. You can hit The Wall, but you cannot break through it. At some point, CPAN only had the latest versions, and has been moving away from The Wall since then, with small local oscillations.

I wonder, though, how much closer can we get to the The Wall? Can we get to Schwartz Factor 0.25? That is, can about 1/20th of CPAN disappear? How many old versions of CGI.pm do we need? or Tk? or BioPerl? Each of those are among the top 10 things by size in CPAN, by the way, and I have a graph to prove it. How many little modules could disappear? I deleted about a fifth of my CPAN directory contents just because they were old and buggy (although you can still get them from SourceForge). If every CPAN author, and there are about 2000 or so now, deleted one old file from their CPAN directory, how much closer to The Wall do we get?

If the Schwartz Factor got up to about 0.4 with the current CPAN, it could fit on a single CD again, at least until Hugo decides 5.10 should be twice as big as 5.8.

How can you increase your Schwartz? Clean up your CPAN directory and let's find out.

I have been collecting the data for about a month, and maybe in a couple years I can have some really fancy graphs with things like "03Q3 vs. 02Q3" in them. Maybe the size of CPAN will start to look like some crazy Bessel function going towards zero with sharp increases for lots of uploads, then sharp decreases where everyone cleans up their directories. As long as I can graph it, who cares. :)