aboutsummaryrefslogtreecommitdiff
path: root/content/articles/the-birthday-paradox.md
blob: e888c34c363d68907877c2dbf1b78369961e35a8 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
---
slug: the-birthday-paradox
title: "The Birthday Paradox"

# date (optional for articles)
date: "2021-11-11T06:46:00-04:00"

# draft articles are not visible on live site
draft: false

# show on articles page
show: true

# uncomment to show table of contents
toc: true

tables:
  probs:
    css: "is-hoverable"
    cols:
    - id: num_people
      name: Number of People
      tip: Number of people in room.
    - id: probability
      name: Probability
      tip: Probability that at least two people in a room with this many people share a birthday.
    - id: percent
      name: Percent
      tip: Probability represented as a percentage.
      align: right
    rows:
    - num_people: 1
      probability: 0.0
      percent: 0.00%
    - num_people: 2
      probability: 0.00273
      percent: 0.27%
    - num_people: 3
      probability: 0.0082
      percent: 0.82%
    - num_people: 4
      probability: 0.01635
      percent: 1.64%
    - num_people: "..."
    - num_people: 22
      probability: 0.47569
      percent: 47.57%
    - num_people: 23
      probability: 0.50729
      percent: 50.73%
      _css: has-text-weight-bold has-background-info-light
    - num_people: 24
      probability: 0.53834
      percent: 53.83%
      _css: has-text-weight-bold has-background-info-light
    - num_people: 25
      probability: 0.56869
      percent: 56.87%
      _css: has-text-weight-bold has-background-info-light
---

## Introduction

I share my birthday with my mom.  On ~~my~~ her birthday I asked if she
had heard of the following puzzle:

> How many people have to be in a room before the probability that there
> is a shared birthday is greater than 50%?

This is called the [Birthday Problem][bp], and the solution is known as
the *birthday paradox*.  It is an interesting problem because:

* The answer is counterintuitive (hence the name *birthday paradox*).
* The solution relies on elements of [combinatorics][], [set theory][],
  and [probability][].
* The ramifications affect the security of [cryptographic hash
  algorithms][hash].

The next section covers the math that you'll need.  If you're already
familiar with [probability][], feel free to [skip ahead][solving].

## Probability Crash Course

Here is an extremely abbreviated summary of of math that you'll need
from [combinatorics][], [probability][], and [set theory][]:

1. The expression `N!`, where `N` is a non-negative integer, is called a
   [factorial][], and it is means "the product of all the integers from
   `N` down to `1`".  So `5! = 5 × 4 × 3 × 2 × 1 = 120`.  To make our
   lives easier, `0! = 1` by definition.
2. In [probability][], an occurance is called an *event*, and the
   probability of an *event* `A` is denoted `P(A)`.
3. The value of a probability is a real number between `0` and `1`,
   inclusive (written `P(A) ∈ [0, 1]`).  Inclusive means that `0` and
   `1` are valid probabilities, where `0` means "this event is
   impossible" and `1` means "this event will occur with certainty".
4. You can convert a probability to a percentage by multiplying by
   `100` and appending a trailing `%`.  So `P(A) = 0.1` means `10%`,
   `P(A) = 0.2` means `20%`, and so on.
5. The probability of two [independent][] events `A` and `B` both
   occurring is written `P(A ⋂ B)` and read as "the [joint
   probability][] of A and B".  The value is calculated as `P(A) ×
   P(B)`.  This pattern of multiplication continues, so the probability
   of three [independent][] events `A`, `B`, and `C` occurring is `P(A ⋂
   B ⋂ C) = P(A) × P(B) × P(C)`, and so on.  The upside-down U symbols
   represent an [intersection][intersection].
6. The probability of an event `A` **not** occurring is written `P(¬A)`,
   called the *compliment*, and calculated as `1 - P(A)`.  This is
   useful because sometimes it is much easier to calculate the
   probability that an event will not occur, as we'll see below.

Here's what all the chicanery above looks like visually:

{{< figure
  src="/files/articles/birthday-paradox/probability-basics.min.svg"
  class="image"
  width=492
  height=556
  caption="Visual representation of probability basics."
>}}

To calculate a [discrete][] (countable) probability, you sum up all the
matching events, then divide the sum by the total number of possible
events.

For example, if you put one red marble and three blue marbles in a jar
and then randomly choose a single marble out of the jar, then the
probability that you will choose the red marble is:

```sh
P(R) = probability of choosing a red marble
P(R) = number of red marbles / total number of marbles
P(R) = 1/(1 + 3) = 0.25 = 25%
```
&nbsp;

Here's a more complicated example: Let's say you put 3 red marbles and 5
blue marbles in a jar, pick a marble from the jar at random, and then
roll a fair, [6-sided die][die].  What is the probability that you will
pick a red marble from the jar **and** roll a 5 or a 6 on the [die][]?

```sh
P(R) = probability of choosing a red marble
P(R) = number of red marbles / total number of marbles
P(R) = 3/(3 + 5) = 0.375 = 37.5%

P(D) = probability of rolling a 5 or a 6 on the die
P(D) = number of matching die faces / total number of die faces
P(D) = 2/6 = ~0.333 = ~33.3%

P(R  D) = P(R) × P(D)
P(R  D) = 3/8 × 1/3 = 0.125 = 12.5%
```

## Solving the Birthday Problem

Let's get back to finding a solution.

To save time, I'll tell you in up front advance that this is one of
those problems where it much easier to calculate the compliment; that
is, the [probability][] that everyone has a *unique* birthday.

If there is only one person in the room, then the probability that
everyone present has a unique birthday is `1.0` (`100%`).

```sh
P1 = P(unique first birthday) = 365/365
P1 = 1.0 = 100%
```
&nbsp;

If there are two people in the room, the probability that everyone has a
unique birthday can be calculated by multiplying the probability from
the previous step by the probability that the second person's birthday
is not the same as the first person's birthday.  Algebraically:

```sh
P2 = P(unique second birthday)
P2 = P1 × 364/365
P2 = 365/365 × 364/365 = (365 × 364)/(365^2) = 364/365
P2 = ~0.9972 = 99.7%
```
&nbsp;

Add a third person, and a pattern begins to emerge:

```r
P3 = P(unique second birthday  unique third birthday)
P3 = P2 × 363/365
P3 = 365/365 × 364/365 × 363/365 = (365 × 364 × 363)/(365^3)
P3 = ~0.9917 = ~99.2%
```
&nbsp;

Do you see the pattern?

The probability that everyone in a group of `N` people has a unique
birthday is the product of the numbers between `365` and `365 - (N - 1)`
(inclusive), divided by `365{{< sup "N" >}}`.

We can calculate the product of the numbers between `365 - (N - 1)` and
`365` by dividing `N!` by the product of terms we don't care about.  I
don't have a name for this (the equation is the same as a [permutation
without repetition][], but that's just a coincidence), so let's call it
a "truncated factorial":

```sh
TF(X, Y) = product of numbers between X and Y (inclusive) where X  [1, Y)
TF(X, Y) = Y!/(X - 1))!

TF((365 - (N - 1)), 365) = 365!/((365 - (N - 1) - 1))!
TF(366 - N, 365) = 365!/((365 + -N + 1 + -1))!
TF(366 - N, 365) = 365!/(365 - N)!
```
&nbsp;

Combined with the previous equation:

```sh
U(N) = P(N unique birthdays)
U(N) = 365!/((365 - N)! × 365^N)
```
&nbsp;

Then we calculate the compliment of the probability of N unique
birthdays to determine the probability of shared birthdays in a group of
N people:

```sh
S(N) = P(shared birthdays among N people)
S(N) = P(¬U(N)) = 1 - U(N)
S(N) = 1 - 365!/((365 - N)! × 365^N)
```
&nbsp;

At this point you can set `S(N) = 0.5` and solve for `N` to find an
exact solution, but often for discrete probabilities it's easier to
write a script and let a computer do the work for you.

So I wrote a couple of scripts which use the derived algorithm to do the
following:

1. Calculate the probability of shared birthdays in a room containing
   `N` people for `N ∈ [1, 50]`.
2. Exclude intermediate rows in the table to keep it short.
3. Highlight the relevant results (values of `N` where `S(N) > 0.5`).
4. Render the bar chart and table in the next section.

The full script is available in the [companion GitHub
repository][github].  Here is the [Ruby][] code for our derived
equation:

```ruby
# memoize 365!
F365 = 365.factorial

#
# Calculate the probability of at least one shared birthday in a group
# of N people.
#
def shared_birthdays(v)
  1 - F365 / (365 - v).factorial * (365 ** -v)
end
```
&nbsp;

## Solution

Here are the results of the script from the previous section:

{{< figure
  src="/files/articles/birthday-paradox/people-vs-probability.min.svg"
  class="image"
  width=614
  height=461
  caption="Number of People versus Probability of Shared Birthdays."
>}}

{{< table "probs" >}}

So the answer to the birthday problem is:

**If there are 23 or more people in the room, then the probability of at
least one shared birthday is greater than 50%**.

## Implications

The answer is called the *birthday paradox* is because it doesn't feel
intuitive.  When asked, most people guess that a much higher number of
people (50 or 130) are needed before the probability of a shared
birthday is greater than 50%.

The mistake is that we aren't just looking for a person in the room that
shares a birthday with a single person.  Instead, we are looking for
*anyone* who shares a birthday with *anyone else*.  As a result, the
probability of a shared birthday increases much faster than expected.

This has implications for the [collision resistance][] property of
[cryptographic hash functions][hash].  [Collision resistance][] means
that it is hard to find two inputs which hash to the same output.

The birthday paradox means that for all [hash functions][] with an N-bit
output, the probability of finding a collision for a random input is
greater than `0.5` after calculating `2{{< sup "N/2" >}}` hashes.  To
make matters worse, searching for hash collisions is an [embarassingly
parallel][] problem, and programs like [hashcat][] are network-aware and
[GPGPU][]-accelerated.

This is one of the reasons that the output sizes of [cryptographic hash
functions][hash] are much larger than non-cryptographic [hash
functions][].  [SHA-2][] and [BLAKE2][], for example, generate an output
of 256 bits or more, while non-cryptographic [hash functions][]
typically only generate an output of 32 or 64 bits.

## Conclusion

My mom was not amused when I asked if 23 people were present when I was
born.

## Further Reading

* [Companion GitHub repository][github]
* [The Birthday Problem (Wikipedia)][bp]
* [Inkscape][]: Used to create the probability basics figure.
* [matplotlib][]: Used to create the shared birthday probability chart.
* [hugo-shortcode-table][]: Used to create the shared birthday
  probability table.

[bp]: https://en.wikipedia.org/wiki/Birthday_problem
  "The birthday problem."
[combinatorics]: https://en.wikipedia.org/wiki/Combinatorics
  "The mathematics of counting."
[set theory]: https://en.wikipedia.org/wiki/Set_theory
  "The mathematics of sets."
[probability]: https://en.wikipedia.org/wiki/Probability
  "The mathematics of determining how likely an event is to occur."
[hash]: https://en.wikipedia.org/wiki/Cryptographic_hash_function
  "Cryptographic hash algorithm."
[intersection]: https://en.wikipedia.org/wiki/Intersection_(set_theory)
  "Intersection."
[independent]: https://en.wikipedia.org/wiki/Probability#Independent_events
  "Independent events in probability theory."
[discrete]: https://en.wikipedia.org/wiki/Random_variable#Discrete_random_variable
  "A random value with a countable number of possible values."
[joint probability]: https://en.wikipedia.org/wiki/Joint_probability_distribution#Coin_flips
  "Joint probability."
[die]: https://en.wikipedia.org/wiki/Dice
  "Dice."
[factorial]: https://en.wikipedia.org/wiki/Factorial
  "Factorial unary operator."
[ruby]: https://ruby-lang.org/
  "Ruby programming language."
[csv]: https://en.wikipedia.org/wiki/Comma-separated_values
  "Comma-separated values."
[html]: https://en.wikipedia.org/wiki/HTML
  "HyperText Markup Language"
[yaml]: https://en.wikipedia.org/wiki/YAML
  "YAML Ain't a Markup Language"
[hugo-table-shortcode]: https://github.com/pablotron/hugo-shortcode-table
  "Table shortcode for Hugo."
[github]: https://github.com/pablotron/birthday-paradox
  "GitHub repository for scripts used in this article."
[solving]: #solving-the-birthday-problem
  "Solving the Birthday Problem"
[collision resistance]: https://en.wikipedia.org/wiki/Collision_resistance
  "Property of cryptographic hash functions which makes it hard to find two inputs which hash to the same output."
[hash functions]: https://en.wikipedia.org/wiki/Hash_function
  "Hash function"
[embarassingly parallel]: https://en.wikipedia.org/wiki/Embarrassingly_parallel
  "Problems with solutions that can be calculated in parallel."
[sha-2]: https://en.wikipedia.org/wiki/SHA-2
  "SHA-2 cryptographic hash algorithm."
[blake2]: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE2
  "BLAKE2 cryptographic hash function."
[hashcat]: https://hashcat.net/hashcat/
  "Password hash cracking tool."
[gpgpu]: https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units
  "General Purpose computation on Graphics Processing Units."
[permutation without repetition]: https://en.wikipedia.org/wiki/Permutation#Permutations_without_repetitions
  "Permutation without repetition."
[inkscape]: https://inkscape.org/
  "Free and open source vector graphics editor."
[matplotlib]: https://matplotlib.org/
  "Python library for creating static, animated, and interactive visualizations."
[hugo-shortcode-table]: https://github.com/pablotron/hugo-shortcode-table
  "Table shortcode for Hugo static site generator."