Improbable vs Statistically Impossible

schwern on 2008-10-03T22:30:57

We've been talking git again on p5p, in particular the commit ids which are SHA-1 signatures of the change (a hashing algorithm). Every time this comes up someone says "but what if they collide?! Hashes sometimes collide!"

They won't collide.

Most Perl programmer's experience with hash collisions comes from Perl's own hashing algorithm. Don't compare Perl's anemic hashing algorithm with a cryptographically secure one. They're not in the same league. They aren't even playing the same sport.

There is improbable and then there is statistically impossible.

To it in perspective with some back of the envelope calculations.

Winning the lottery is improbable. In the Oregon Powerball Lottery (jackpot $20 mil) there are roughly 1 in 50 million different combinations to pick from. There are about 4 million people in Oregon. Not all of them play the lottery, but many buy more than one ticket. And the Powerball often goes for weeks and months without a hit. But with that number of people and those odds you see why people sometimes win the big jackpot.

Getting struck by lightning is improbable. You have a 1 in 2 million chance of being struck by lightning in your lifetime. There are 6 billion people, so a lot of people get hit by lightning.

These are some improbable things. Let's look what what cryptographers consider impossible...

Given 100 EB of data (EB == 2**60 bytes) or 100 MILLION terabytes there is a 0.00000000000001% chance of collision with SHA-1. That's a 1 in 10 trillion chance. This is roughly equivalent having a single undetected failure when writing to tape. 100 EB of data is more change than we could do and still have anything resembling Perl 5... or a computer program... or human output.

Even if we randomly changed every file in perl in every commit, let's say they're 1 meg each and there's 10k of them (which is generous), it would take about 11 billion commits to reach 100 EB of change where we'd have a 1 in 10 trillion chance.

Hard drives fail at a rate of about 1 in 100 per year.

My odds of winning the Oregon Powerball this week are about a million times better.

So if you're worried about a hash collision in git, I'd start investing in lottery tickets and RAID hardware.

For those who still don't get it, let's say there is a collision. What happens? We might lose a single commit. And then we report it. And then a lot of CS guys write a lot of papers about it, because nobody's found a SHA-1 collision in the wild yet. And then we get a visit from the NSA, because SHA-1 is a US crypto standard. And git just switches to something else (probably SHA-256 in which no collisions have been found by anyone) and an even more absurdly improbable collision chance. Unix password files have done this sort of upgrade several times going from crypt to MD5 to (sometimes) SHA-1.

That's ok, by then I will have won the lottery a few thousand times, and Jarkko will have been elected Grand Duke of Finland, and we'll be too busy chatting with all the aliens we contact with SETI.

Chances of a sha1 collision

waltman on 2008-10-04T02:36:48

I asked myself the same question a few months ago when I started using git. Here's the calculation I did:

SHA1 is 160-bit checksum. 2**160 is around 10**48. Then I plugged that into the formula I found on Wikipedia to see how many checksums you'd have to generate to have a 50% chance of a collision. The answer is around 10**24.

To give you an idea of how big that number is, if you generated one git commit a second, it would take 4.5 quadrillion years before you'd have a 50% chance of a collision. That's plenty big enough for me.

Of course this is assuming that the checksums are evenly distributed over the entire 160-bit range. But that's why crypto people make the big bucks.