std-dev

ziggy on 2005-08-10T14:10:30

Yesterday, I was hacking on a script to extract a series of numeric values from a data set. I wanted to understand the data better than just looking at (min | max | average).

So, after I had a list of values, one per line, I followed my first instinct and loaded that file into Excel. (I should mention that I view using spreadsheets a symptom of a larger problem, not a part of a working solution.) After a while, I realized that I had a lot of data to analyze, a lot of tests to run, and this was the fast track to weeks of needless agony and make-work.

Excel did have one benefit -- it helped me understand standard deviation a little better. It's a little annoying though that Excel's =STDEV() function, the most obvious function to use for calculating standard deviation, is actually standard deviation of a sample, not standard deviation of an entire population. Once I refreshed my memory of the concepts involved, it took a while to figure out why the standard formula wasn't agreeing with Excel's results. Sure enough, the =STDEVP() function did match with better than random precision.

I took a quick look on CPAN, but didn't find anything that does standard deviation. I know it's there, but I didn't want to download a huge Math library to calculate a simple function. So I wrote a quick and dirty std-dev instead:

#!/bin/env perl -w

use strict;
use List::Util qw(sum min max);

chomp(my @values = <>);
my $n = @values;
my $avg = sum(@values)/$n;
my $std_dev = sqrt(sum(map {($_ - $avg) ** 2} @_) / $n);

print "total   = $n\n";
print "std_dev = $std_dev\n";
print "avg     = $avg\n";
print "min     = ", min(@values), "\n";
print "max     = ", max(@values), "\n";

The hard part was the single line of code to calculate standard definiton. That was translated verbatim from the definition on the wikipedia page.

This little script, along with nth, reduced a bunch of time consuming Excel drudgery into a nearly autonomic piece of analysis. ;-)


Statistics::Descriptive

itub on 2005-08-10T15:02:02

Statistics::Descriptive calculates the standard deviation, among other things. It is the sample standard deviation, however (same as Excel and my pocket calculator). That's because most people prefer to use the sample formula, because the "entire population" is usually considerd to be "infinite" (also, if you have a reasonable enough sample, it doesn't really matter if you divide by N or N-1).