I have been trying to find a good way to pick a random item from a list, where items do not have the same probability of being picked: using weighted probabilities.
A way to visualize the problem, which is also a way to tackle it, is by representing each item by books (or planks) of unequal thickness. You make a pile of all the books, measure the total height of the pile, randomly choose a height between zero and that total height, and see which book lies at that height.
But I feel this is a very clumsy way: you have to assign an order to the items, calculate a running total for the heights of the items that come in front of it to determine at what height each item lies, and all that just to pick one item. If an item is added or removed from the list, you have to start over with all the administrative work of ordering and summing. What a contrast with equally weighted probabilities, where you just have to pick an index number.
So I thought, surely there must a more straightforward, functional way? One that could possibly even work in plain SQL? Because there is such a way to randomly shuffle a list of items with equal probabilities: just add an extra column with a random value (rnd
, or rand
in Perl, dbms_random.value
in Oracle), and sort the rows according to this column. In Oracle:
select dbms_random.value, T.* from mytable T order by 1
$rand[$i]
which is a weighted random function: a random number where the probability distribution depends on the value of the index $i
. So I'm dusting off my math skill. As my confidence in them is not too great (I know I'm bound to make plenty of mistakes), I'll use experimental results to verify the results so any blatant mistakes should stand out.Sum(P(A)*P(B|A))(the probability of A happening, times the probability of B happening provided A did happen, summed over all possible cases for A) which is, applied here:
Sum(P(x1 between X and X+dx)*P(x2 < X))
rand()
by its weight: $rand[$i] = $weight[$i]*rand;Surely the result would result in a weighted probability, only, I had no idea what the relation is to the weight factors. With just 2 items, the formula is equivalent to:
$rand[1]<$rand[2]
, which is the same result I would get with
$k = $weight[2]/$weight[1]; $rand[1] = rand; $rand[2] = $k*rand;With
k >= 1
(if this isn't the case, just swap the two items), with X=x1
, the chance that x2<X
is X/k
. Eventually, this leads to a probability of item 2 winning of X/(2*k)
. This doesn't feel right, with a factor k=2
item 2 has a probability 1/4 of winning, and item 1 has 3/4. That's a ratio of 1/3, not 1/2. Experiments confirm this result.k<=1
. There's an asymmetry for the value of k
above or below 1, and the reason for this asymmetry is clipping: with k<1
, there's a threshold for $rand[1] above which item 2 cannot possibly lose, and that threshold value is k. So the probability of item 2 winning for a value x for $rand[1] is broken into 2 pieces:
x/k for 0 <= x <= k 1 for k <= x < 1Integration over the range 0 to 1 for x yields
k**2/(2*k)+(1-k) = 1-k/2
. (for k=1/2
, this is 3/4)use Math::Random::MT qw(rand); my $k = 2; my @n = ( 0 , 0 ); for my $i (1 .. 1E5) { my @rand; $rand[0] = rand; $rand[1] = $k*rand; $n[index_with_min_value(@rand)]++; } use Data::Dumper; print Dumper \@n;The result of a test run is:
sub index_with_min_value { # what index has the lowest value in a list? my $ix = 0; my $min = $_[0]; for my $i (1 .. $#_) { if($_[$i] < $min) { $ix = $i; $min = $_[$i]; } } return $ix; }
$VAR1 = [ 75088, 24912 ];I'm using a better random number generator than the one that comes with Perl. The standard random number generator in ActivePerl/Win32 has a meager 15 bits resolution, so you get at most 32000 something different values, and it has a repetition period in the same area. Thus for the above experiment, you'd go through the sequence of all the possible random values 3 times. The generator I chose, an implementation of Mersenne Twister, has a period of 4.315425E6001 which is like forever, and it has (in this implementation) a resolution of 32 bits. By just importing the function
rand
, it can replace the built-in rand
. So I'm hoping it is better. :)$VAR1 = [ 63890, 22216, 13894 ];Thus item 3 with a weight factor that's 3 times larger, has about a 5 times smaller likeliness of winning.
P[X > x] = exp(-x)with
X > 0
and x > 0
, but with no upper bound.exp(-x)
for item 1, with exp(-k*x)
for item 2. (This time, I decided to let the larger value for x win, because P[x < X] doesn't look as nice as a function.)exp(-x)*dx
. The probability of item 2 beating that (having a bigger value) is exp(-k*x)
. (A larger k results in a faster decaying probability function.)1/(1+k)
.x=fn(rand)
. With $y==exp(-$x)
, we end up with the transforming function $x=-log($y)
. Since the probability that $y<Y
(with Y between 0 and 1) is Y, the probability that ($x=-log($y)) > (X=-log(Y))
is Y == exp(-X)
. (Note the swapping of the directions: a smaller $y yields a larger $x.) So this is exactly the result I am after, with just one nitpick: the border cases. rand
can become zero, which is a value that log()
chokes on; but it'll never be exactly 1, which is a value that log()
doesn't mind. So I'm reversing the direction for y, and replace -log(rand)
with -log(1-rand)
. For the rest, nothing changes, as the probability for in between values remains unchanged.use Math::Random::MT qw(rand); my @w = (1, 2, 3); my @n = (0, 0, 0); for my $i (1 .. 1E5) { my @x; $x[$_] = -log(1 - rand)/$w[$_] for 0 .. $#w; $n[index_with_min_value(@x)]++; } use Data::Dumper; print Dumper \@n;Results:
sub index_with_min_value { # what index has the lowest value in a list? my $ix = 0; my $min = $_[0]; for my $i (1 .. $#_) { if($_[$i] < $min) { $ix = $i; $min = $_[$i]; } } return $ix; }
$VAR1 = [ 16621, 33235, 50144 ];Bingo. As you can see, the second item has twice the chance of winning than the first item, with a weighing factor of 2 vs. 1; and the third item has 3/2 times the chance of winning compared to the second one, with weighing factors 3 vs. 2. It's just perfect.
Neat. I've always been interested in weighted probabilities.
I have a system I've used for lots of things where you select from a list where items near the top are more heavily weighted than items near the bottom. After selection, the selected item goes to the bottom of the list for the next round, meaning items that have not been recently selected are more likely to be picked sooner. Basically it's a way of randomly shuffling but giving precedence to trying not to repeat things too soon.
This started out implementation as a stack of CD's I used to cycle through to listen to at work: pick a CD, then put it at the bottom and randomly pick another but weight it so I'd tend to hear something I hadn't heard in a while. Today it's implemented as a private Perl module that works on an array, and in fact I had a program that implemented this through Tie::File so that I could run it on a text file containing all my choices and the file would keep track of the ordering from run to run.
The mathematics is that the item at the bottom of the list has probability 1 out of T of being selected, the next item up has probability 2 out of T of being selected, next item has probability 3 out of T, etc., all the way to the top of the list, item N, which has probability N out of T of being selected. T = SUM(1..N). T is a triangular number, hence the term "triangular probabilities."
It would have been easier to follow what’s going on if you have edited the code prior to writing, so as to highlight what you are talking about. I might have written the first program like this:
use Data::Dumper;
use Math::Random::MT qw(rand);
sub min_idx {
my $idx = 0;
for my $i ( 1.. $#_ ) {
$idx = $_[$idx] > $_[$i] ? $i : $idx;
}
return $idx;
}
sub run {
my ( $arg ) = @_;
my @n;
for my $i (1.. $arg->{count} ) {
my @d = $arg->{distrib}->( @{ $arg->{weights} } );
$n[ min_idx @d ]++;
}
return @n;
}
print Dumper [ run {
count => 1E5,
weights => [ 1, 2 ],
distrib => sub { map { $_ * rand } @_ },
) ];
Then instead of repeating the program for the second formula (with slight variations in variable names that obscure your point) you could simply have written that this time, we will be using this:
count => 1E5,
weights => [ 1, 2, 3 ],
distrib => sub { map { -log( 1 - rand ) / $_ } @_ },
Much easier to see what the difference is.