Functional approach for weighted random picks out of a list

bart on 2007-06-29T00:30:17

I have been trying to find a good way to pick a random item from a list, where items do not have the same probability of being picked: using weighted probabilities.

A way to visualize the problem, which is also a way to tackle it, is by representing each item by books (or planks) of unequal thickness. You make a pile of all the books, measure the total height of the pile, randomly choose a height between zero and that total height, and see which book lies at that height.

But I feel this is a very clumsy way: you have to assign an order to the items, calculate a running total for the heights of the items that come in front of it to determine at what height each item lies, and all that just to pick one item. If an item is added or removed from the list, you have to start over with all the administrative work of ordering and summing. What a contrast with equally weighted probabilities, where you just have to pick an index number.

So I thought, surely there must a more straightforward, functional way? One that could possibly even work in plain SQL? Because there is such a way to randomly shuffle a list of items with equal probabilities: just add an extra column with a random value (rnd, or rand in Perl, dbms_random.value in Oracle), and sort the rows according to this column. In Oracle:

select dbms_random.value, T.* from mytable T order by 1


All items have an equal chance of winning, i.e. coming out first, or getting the lowest random value, irrespective of the distribution of the randomness, provided the subsequent random values are sufficiently uncorrelated (meaning you can't really predict the next random value based on the previous values, which actually is an illusion in pseudo-random generators) and the probabilities are the same for all, as there is nothing skewing the chance of winning in favor of any of the items. Hence: they must all have an equal chance of winning.

But is there such a way to do the same with weighted probabilities? There must be. So I decided to explore. I am planning on comparing the values of $rand[$i] which is a weighted random function: a random number where the probability distribution depends on the value of the index $i. So I'm dusting off my math skill. As my confidence in them is not too great (I know I'm bound to make plenty of mistakes), I'll use experimental results to verify the results so any blatant mistakes should stand out.

Eventually, by pure luck, I did find a formula that works, but not at first. Here's how I found it.

I'll start with just 2 items. What is the probability that item 2 will win? The way you calculate that is like this: you find out what values X you can possibly have for item 1's random value, determining the probability that the value is X (or, for continuous functions, in a small range around X — between X and X+dx, which for small enough values of dx becomes virtually proportional to dx). Next, you have to determine the probability that item 2 beats this. You multiply the two, to calculate the probability of both happening at the same time, and finally, you add all these results for all possible values of X (integrate, in case of continuous functions) to take care of all possible cases: and you end up with the probability of item 2 winning, period.

The common generic representation is a formula
Sum(P(A)*P(B|A))
(the probability of A happening, times the probability of B happening provided A did happen, summed over all possible cases for A) which is, applied here:
Sum(P(x1 between X and X+dx)*P(x2 < X))


To work. My first thought was to just multiply (or divide) rand() by its weight:
$rand[$i] = $weight[$i]*rand;
Surely the result would result in a weighted probability, only, I had no idea what the relation is to the weight factors. With just 2 items, the formula is equivalent to:

Item 1 wins if $rand[1]<$rand[2], which is the same result I would get with
$k = $weight[2]/$weight[1];
$rand[1] = rand;
$rand[2] = $k*rand;   
With k >= 1 (if this isn't the case, just swap the two items), with X=x1, the chance that x2<X is X/k. Eventually, this leads to a probability of item 2 winning of X/(2*k). This doesn't feel right, with a factor k=2 item 2 has a probability 1/4 of winning, and item 1 has 3/4. That's a ratio of 1/3, not 1/2. Experiments confirm this result.

So let's try again for k<=1. There's an asymmetry for the value of k above or below 1, and the reason for this asymmetry is clipping: with k<1, there's a threshold for $rand[1] above which item 2 cannot possibly lose, and that threshold value is k. So the probability of item 2 winning for a value x for $rand[1] is broken into 2 pieces:
x/k    for 0 <= x <= k
1      for k <= x < 1
Integration over the range 0 to 1 for x yields k**2/(2*k)+(1-k) = 1-k/2. (for k=1/2, this is 3/4)

Time to confirm the results:
use Math::Random::MT qw(rand);
my $k = 2;
my @n = ( 0 , 0 );
for my $i (1 .. 1E5) {
    my @rand;
    $rand[0] = rand;
    $rand[1] = $k*rand;
    $n[index_with_min_value(@rand)]++;
}
use Data::Dumper;
print Dumper \@n;

sub index_with_min_value { # what index has the lowest value in a list? my $ix = 0; my $min = $_[0]; for my $i (1 .. $#_) { if($_[$i] < $min) { $ix = $i; $min = $_[$i]; } } return $ix; }
The result of a test run is:
$VAR1 = [
          75088,
          24912
        ];
I'm using a better random number generator than the one that comes with Perl. The standard random number generator in ActivePerl/Win32 has a meager 15 bits resolution, so you get at most 32000 something different values, and it has a repetition period in the same area. Thus for the above experiment, you'd go through the sequence of all the possible random values 3 times. The generator I chose, an implementation of Mersenne Twister, has a period of 4.315425E6001 which is like forever, and it has (in this implementation) a resolution of 32 bits. By just importing the function rand, it can replace the built-in rand. So I'm hoping it is better. :)

The results are very skewed in favor of the values with the smaller factor: a sample run with 3 items with weight factors (1, 2, 3) produced the result:
$VAR1 = [
          63890,
          22216,
          13894
        ];
Thus item 3 with a weight factor that's 3 times larger, has about a 5 times smaller likeliness of winning.

So I was thinking, what if I picked a distribution that is free of clipping, one that has no upper limit? A nice candidate distribution is the exponentially damped function, a function that you can see a lot in nature:
P[X > x] = exp(-x)
with X > 0 and x > 0, but with no upper bound.

This function's graph looks pretty much alike independent of the actual value of x: P just scales when x shifts.

So I wanted to compare results for probability distributions of exp(-x) for item 1, with exp(-k*x) for item 2. (This time, I decided to let the larger value for x win, because P[x < X] doesn't look as nice as a function.)

The probability for x1 (value for item 1) to be between x and x+dx is exp(-x)*dx. The probability of item 2 beating that (having a bigger value) is exp(-k*x). (A larger k results in a faster decaying probability function.)

Integration of ∫ exp(-k*x)*exp(-x)*dx [0 ∞] yields the value 1/(1+k).

That is nice: with k=2 I get 1/3, and its counterpart, with k=0.5, gets 1/(1.5) = 2/3. Those are the numbers I'm after, as their ratio is 2! But do note that the larger factor yields a smaller probability, so you have to divide by the weight, not multiply.

So how can you get a distribution that's exponential, out of a plain uniform distribution? By using a transforming function: x=fn(rand). With $y=rand; that is plainly following a uniform distribution between 0 and 1, and the required equivalence $y==exp(-$x), we end up with the transforming function $x=-log($y). Since the probability that $y<Y (with Y between 0 and 1) is Y, the probability that ($x=-log($y)) > (X=-log(Y)) is Y == exp(-X). (Note the swapping of the directions: a smaller $y yields a larger $x.) So this is exactly the result I am after, with just one nitpick: the border cases. rand can become zero, which is a value that log() chokes on; but it'll never be exactly 1, which is a value that log() doesn't mind. So I'm reversing the direction for y, and replace -log(rand) with -log(1-rand). For the rest, nothing changes, as the probability for in between values remains unchanged.

That's enough theory, so let's confirm these results through experimenting. As item 2 has a weight of 2, and item 3 a weight of 3:
use Math::Random::MT qw(rand);
my @w = (1, 2, 3);
my @n = (0, 0, 0);
for my $i (1 .. 1E5) {
    my @x;
    $x[$_] = -log(1 - rand)/$w[$_] for 0 .. $#w;
    $n[index_with_min_value(@x)]++;
}
use Data::Dumper;
print Dumper \@n;

sub index_with_min_value { # what index has the lowest value in a list? my $ix = 0; my $min = $_[0]; for my $i (1 .. $#_) { if($_[$i] < $min) { $ix = $i; $min = $_[$i]; } } return $ix; }
Results:
$VAR1 = [
          16621,
          33235,
          50144
        ];
Bingo. As you can see, the second item has twice the chance of winning than the first item, with a weighing factor of 2 vs. 1; and the third item has 3/2 times the chance of winning compared to the second one, with weighing factors 3 vs. 2. It's just perfect.

And I'm sure you can use this safely in (Oracle) SQL.


Seems too complicated

merlyn on 2007-06-29T02:55:04

Compared to the one I posted in 1996. Maybe I'm just missing something.

Triangular probabilities

jdavidb on 2007-06-29T11:49:57

Neat. I've always been interested in weighted probabilities.

I have a system I've used for lots of things where you select from a list where items near the top are more heavily weighted than items near the bottom. After selection, the selected item goes to the bottom of the list for the next round, meaning items that have not been recently selected are more likely to be picked sooner. Basically it's a way of randomly shuffling but giving precedence to trying not to repeat things too soon.

This started out implementation as a stack of CD's I used to cycle through to listen to at work: pick a CD, then put it at the bottom and randomly pick another but weight it so I'd tend to hear something I hadn't heard in a while. Today it's implemented as a private Perl module that works on an array, and in fact I had a program that implemented this through Tie::File so that I could run it on a text file containing all my choices and the file would keep track of the ordering from run to run.

The mathematics is that the item at the bottom of the list has probability 1 out of T of being selected, the next item up has probability 2 out of T of being selected, next item has probability 3 out of T, etc., all the way to the top of the list, item N, which has probability N out of T of being selected. T = SUM(1..N). T is a triangular number, hence the term "triangular probabilities."

Code cleanup

Aristotle on 2007-06-29T17:55:14

It would have been easier to follow what’s going on if you have edited the code prior to writing, so as to highlight what you are talking about. I might have written the first program like this:

use Data::Dumper;
use Math::Random::MT qw(rand);

sub min_idx {
    my $idx = 0;
    for my $i ( 1 .. $#_ ) {
        $idx = $_[$idx] > $_[$i] ? $i : $idx;
    }
    return $idx;
}

sub run {
    my ( $arg ) = @_;
    my @n;
    for my $i (1 .. $arg->{count} ) {
        my @d = $arg->{distrib}->( @{ $arg->{weights} } );
        $n[ min_idx @d ]++;
    }
    return @n;
}

print Dumper [ run {
    count   => 1E5,
    weights => [ 1, 2 ],
    distrib => sub { map { $_ * rand } @_ },
) ];

Then instead of repeating the program for the second formula (with slight variations in variable names that obscure your point) you could simply have written that this time, we will be using this:

    count   => 1E5,
    weights => [ 1, 2, 3 ],
    distrib => sub { map { -log( 1 - rand ) / $_ } @_ },

Much easier to see what the difference is.