WEBVTT

00:00:00.790 --> 00:00:03.130
The following content is
provided under a Creative

00:00:03.130 --> 00:00:04.550
Commons license.

00:00:04.550 --> 00:00:06.760
Your support will help
MIT OpenCourseWare

00:00:06.760 --> 00:00:10.850
continue to offer high quality
educational resources for free.

00:00:10.850 --> 00:00:13.390
To make a donation or to
view additional materials

00:00:13.390 --> 00:00:17.320
from hundreds of MIT courses,
visit MIT OpenCourseWare

00:00:17.320 --> 00:00:18.570
at ocw.mit.edu.

00:00:28.762 --> 00:00:31.140
JOHN GUTTAG: So
today, we're going

00:00:31.140 --> 00:00:34.560
to move on to a fairly
different world than the world

00:00:34.560 --> 00:00:36.090
we've been living in.

00:00:36.090 --> 00:00:37.950
And this will be a
world we'll be living in

00:00:37.950 --> 00:00:40.580
for quite a few lectures.

00:00:40.580 --> 00:00:42.990
But before I do that,
I want to get back

00:00:42.990 --> 00:00:47.170
to just finish up something
that Professor Grimson started.

00:00:47.170 --> 00:00:50.520
You may recall he talked
about family trees

00:00:50.520 --> 00:00:52.650
and raised the question,
was it actually

00:00:52.650 --> 00:00:55.890
possible to represent all
ancestral relationships

00:00:55.890 --> 00:00:57.560
as a tree?

00:00:57.560 --> 00:00:59.890
Well, as a counterexample,
I'm sure some of you

00:00:59.890 --> 00:01:03.770
are familiar with Oedipus Rex.

00:01:03.770 --> 00:01:05.269
For those of you
who are not, I'm

00:01:05.269 --> 00:01:07.940
happy give you a plot summary
at the end of the lecture.

00:01:07.940 --> 00:01:10.880
It's a rather bizarre plot.

00:01:10.880 --> 00:01:16.110
But it was captured in a
wonderful song by Tom Lehrer.

00:01:16.110 --> 00:01:19.160
The short story is Oedipus
ended up marrying his mother

00:01:19.160 --> 00:01:22.430
and having four children.

00:01:22.430 --> 00:01:25.100
And Tom Lehrer, if you've
never heard of Tom Lehrer,

00:01:25.100 --> 00:01:29.660
you're missing one of the
world's funniest songwriters.

00:01:29.660 --> 00:01:32.300
And he had a wonderful
song called "Oedipus Rex,"

00:01:32.300 --> 00:01:38.510
and I recommend this YouTube as
a way to go and listen to it.

00:01:38.510 --> 00:01:44.870
And you can gather from the
quote what the story is about.

00:01:44.870 --> 00:01:46.970
I also recommend the
play, by the way.

00:01:46.970 --> 00:01:50.330
It's really kind of
appalling what goes on,

00:01:50.330 --> 00:01:53.090
but it's beautiful.

00:01:53.090 --> 00:01:57.050
Back to the main topic,
here's the relevant reading--

00:01:59.570 --> 00:02:05.250
a small bit from later in
the book and then chapter 14.

00:02:05.250 --> 00:02:07.260
You may notice that
we're not actually going

00:02:07.260 --> 00:02:09.264
through the book in order.

00:02:09.264 --> 00:02:11.430
And the reason we're not
doing that is because we're

00:02:11.430 --> 00:02:13.440
trying to get you
information you need in time

00:02:13.440 --> 00:02:14.395
to do problem sets.

00:02:18.810 --> 00:02:24.480
So the topic of today is
really uncertainty and the fact

00:02:24.480 --> 00:02:29.550
that the world is really
annoyingly hard to understand.

00:02:32.520 --> 00:02:36.480
This is a signpost
related to 6.0002,

00:02:36.480 --> 00:02:41.170
but we won't go into too
much detail about it.

00:02:41.170 --> 00:02:43.050
We'd rather things were certain.

00:02:43.050 --> 00:02:47.330
But in fact, they
usually are not.

00:02:47.330 --> 00:02:51.710
And this is a place
where 6.0002 diverges

00:02:51.710 --> 00:02:53.930
from the typical
introductory computer science

00:02:53.930 --> 00:02:58.250
course, which focuses on
things that are functional--

00:02:58.250 --> 00:03:02.030
given an input, you always
get the same output.

00:03:02.030 --> 00:03:03.950
It's predictable.

00:03:03.950 --> 00:03:07.760
And we like to do that,
because that's easier to teach.

00:03:07.760 --> 00:03:11.300
But in fact, for reasons
we'll be talking about,

00:03:11.300 --> 00:03:14.210
it's not nearly as
useful if you're

00:03:14.210 --> 00:03:16.580
trying to actually
write computations that

00:03:16.580 --> 00:03:18.860
help you understand the world.

00:03:18.860 --> 00:03:21.110
You have to face
uncertainty head on.

00:03:25.030 --> 00:03:27.860
An analogy is for many
years people, believed

00:03:27.860 --> 00:03:31.490
in Newtonian mechanics--

00:03:31.490 --> 00:03:34.520
I guess they still
do in 8.01 maybe--

00:03:34.520 --> 00:03:38.030
that every effect has a cause.

00:03:38.030 --> 00:03:40.430
An apple falls from the
tree because of gravity,

00:03:40.430 --> 00:03:42.390
and you know where
it's going to land.

00:03:42.390 --> 00:03:45.080
And the world can be
understood causally.

00:03:45.080 --> 00:03:50.090
And people believed this
really for quite a long time,

00:03:50.090 --> 00:03:54.500
most of history,
until the early part

00:03:54.500 --> 00:03:58.250
of the 20th century, when
the so-called Copenhagen

00:03:58.250 --> 00:04:00.770
doctrine was put forth.

00:04:03.880 --> 00:04:06.670
The doctrine there from
Bohr and Heisenberg,

00:04:06.670 --> 00:04:09.910
two very famous
physicists, was one

00:04:09.910 --> 00:04:13.930
of what they called
causal nondeterminism.

00:04:13.930 --> 00:04:17.709
And their assertion was that
the world at its very most

00:04:17.709 --> 00:04:24.610
fundamental level behaves in
a way that you cannot predict.

00:04:24.610 --> 00:04:28.990
It's OK to make a statement that
x is highly likely to occur,

00:04:28.990 --> 00:04:33.430
almost certain to occur,
but for no case can

00:04:33.430 --> 00:04:36.310
you make a statement
x will occur.

00:04:36.310 --> 00:04:40.360
Nothing has a
probability of one.

00:04:40.360 --> 00:04:43.720
This was hard for us to
imagine today, when we all

00:04:43.720 --> 00:04:45.580
know quantum mechanics.

00:04:45.580 --> 00:04:50.320
But at the turn of the century,
this was a shocking statement.

00:04:50.320 --> 00:04:53.230
And two other very
well-known physicists,

00:04:53.230 --> 00:04:55.900
Albert Einstein and
Schrodinger, basically

00:04:55.900 --> 00:04:57.460
said, no, this is wrong.

00:04:57.460 --> 00:05:00.130
Bohr, Heisenberg,
you guys are idiots.

00:05:00.130 --> 00:05:01.570
It's just not true.

00:05:01.570 --> 00:05:03.670
They probably didn't
call them idiots.

00:05:03.670 --> 00:05:06.730
And this is most exemplified
by Einstein's famous quote

00:05:06.730 --> 00:05:11.230
that "God does not play dice,"
which is indicative of the fact

00:05:11.230 --> 00:05:13.990
that this was actually a
discussion that permeated

00:05:13.990 --> 00:05:19.570
not just the world of physics,
but society in general people

00:05:19.570 --> 00:05:22.150
really turned it into
literally a religious issue,

00:05:22.150 --> 00:05:24.900
as did Einstein.

00:05:24.900 --> 00:05:26.940
Well, so now we should
ask the question,

00:05:26.940 --> 00:05:28.830
does it really matter?

00:05:28.830 --> 00:05:31.260
And to illustrate
that, I need two coins.

00:05:31.260 --> 00:05:33.900
I forgot to bring
any coins with me.

00:05:33.900 --> 00:05:35.840
Does anyone got a
coin they can lend me?

00:05:35.840 --> 00:05:37.301
AUDIENCE: I have some coins.

00:05:37.301 --> 00:05:39.900
JOHN GUTTAG: All right.

00:05:39.900 --> 00:05:42.300
Now, this is where I see how
much the students trust me.

00:05:42.300 --> 00:05:44.190
Do I get a penny?

00:05:44.190 --> 00:05:46.440
Do I get a silver dollar?

00:05:46.440 --> 00:05:47.460
So what do we got here?

00:05:50.500 --> 00:05:54.600
This is someone who's entrusting
me with quarters, not so bad.

00:05:57.500 --> 00:06:00.149
So we'll take these quarters,
and we'll shake them up,

00:06:00.149 --> 00:06:01.690
and we'll put them
down on the table.

00:06:04.240 --> 00:06:07.000
And now, we'll ask a question--

00:06:07.000 --> 00:06:13.140
do we have two heads, two
tails, or one head and one tail?

00:06:13.140 --> 00:06:17.220
So who thinks we have two heads?

00:06:17.220 --> 00:06:20.370
Who thinks we have two tails?

00:06:20.370 --> 00:06:23.230
Who thinks we have one of each?

00:06:23.230 --> 00:06:26.580
Well, clearly, everyone except
a few people-- for example,

00:06:26.580 --> 00:06:29.730
the Indians fan, who clearly
believe in the counterfactual--

00:06:33.030 --> 00:06:37.080
made the most
probabilistic decision.

00:06:37.080 --> 00:06:40.550
But in fact, there is
no nondeterminism here.

00:06:40.550 --> 00:06:43.040
I know the answer.

00:06:43.040 --> 00:06:47.600
And so in some sense,
it doesn't matter

00:06:47.600 --> 00:06:49.820
whether it's deterministic,
because in fact, it's

00:06:49.820 --> 00:06:52.070
not causally nondeterministic.

00:06:52.070 --> 00:06:58.120
The answer is quite clear,
but you don't know the answer.

00:06:58.120 --> 00:07:03.870
And so whether or not the world
is inherently unpredictable,

00:07:03.870 --> 00:07:08.760
the fact that we never have
complete knowledge of the world

00:07:08.760 --> 00:07:10.770
suggests that we
might as well treat

00:07:10.770 --> 00:07:15.130
it as inherently unpredictable.

00:07:15.130 --> 00:07:19.060
And so this is called
predictive nondeterminism.

00:07:19.060 --> 00:07:21.365
And this really is
what's going to underline

00:07:21.365 --> 00:07:23.740
pretty much everything else
we're going to be doing here.

00:07:30.370 --> 00:07:34.000
No comments about that?

00:07:34.000 --> 00:07:37.150
I wouldn't do that to you.

00:07:37.150 --> 00:07:39.700
Thank you.

00:07:39.700 --> 00:07:42.260
I know you are wishing to
get interest on the money,

00:07:42.260 --> 00:07:44.140
but you don't get any.

00:07:44.140 --> 00:07:46.060
AUDIENCE: Was it heads or tails?

00:07:51.376 --> 00:07:52.500
JOHN GUTTAG: What was that?

00:07:56.160 --> 00:08:00.660
So when we think about
nondeterminism in computation,

00:08:00.660 --> 00:08:04.150
we use the word
stochastic process.

00:08:04.150 --> 00:08:07.020
And that's any
process that's ongoing

00:08:07.020 --> 00:08:12.180
in which the next state depends
upon the previous states

00:08:12.180 --> 00:08:14.800
in some random element.

00:08:14.800 --> 00:08:18.450
So typically up till now
when we've written code,

00:08:18.450 --> 00:08:20.890
one line of code
did depended only

00:08:20.890 --> 00:08:23.260
on what the previous
lines of code did.

00:08:23.260 --> 00:08:25.810
There was no randomness.

00:08:25.810 --> 00:08:28.282
Here, we're going
to have randomness.

00:08:28.282 --> 00:08:29.740
And we can see the
difference if we

00:08:29.740 --> 00:08:34.450
look at these two
specifications of rolling a die.

00:08:34.450 --> 00:08:38.320
The first one, returns
an int between 1 and 6,

00:08:38.320 --> 00:08:41.890
is what I'll call
underdetermined.

00:08:41.890 --> 00:08:45.940
By that I mean you can't tell
what it's going to return.

00:08:45.940 --> 00:08:49.540
Maybe it will return a different
number each time you call it,

00:08:49.540 --> 00:08:51.700
but it's not required to.

00:08:51.700 --> 00:08:55.120
Maybe it will return three
every time you call it.

00:08:55.120 --> 00:08:58.690
The second specification
requires randomness.

00:08:58.690 --> 00:09:01.360
It says, it returns are
randomly chosen int.

00:09:01.360 --> 00:09:06.710
So it requires a
stochastic implementation.

00:09:06.710 --> 00:09:11.090
Let's look at how we implement
a random process in Python.

00:09:11.090 --> 00:09:15.890
We start by importing
the library random.

00:09:15.890 --> 00:09:17.520
This is not to
say you can import

00:09:17.520 --> 00:09:19.770
any random library you want.

00:09:19.770 --> 00:09:22.530
It's to say you import
the library called random.

00:09:22.530 --> 00:09:23.810
Let me get my pen out of here.

00:09:27.310 --> 00:09:29.230
And we'll use that a lot.

00:09:29.230 --> 00:09:32.590
And then we're going to use
the function in random called

00:09:32.590 --> 00:09:34.940
random.choice.

00:09:34.940 --> 00:09:39.530
It takes as an argument a
sequence, in this case a list,

00:09:39.530 --> 00:09:43.850
and randomly chooses
one member of the list.

00:09:43.850 --> 00:09:46.160
And it chooses it uniformly.

00:09:49.010 --> 00:09:52.930
It's a uniform distribution.

00:09:52.930 --> 00:09:56.860
And what that means is
that it's equally probable

00:09:56.860 --> 00:09:59.650
that it will choose any
number in that list each time

00:09:59.650 --> 00:10:01.690
you call it.

00:10:01.690 --> 00:10:03.820
We'll later look
at distributions

00:10:03.820 --> 00:10:06.700
that are not uniform,
not equally probable,

00:10:06.700 --> 00:10:08.320
where things are weighted.

00:10:08.320 --> 00:10:10.375
But here, it's quite
simple, it's just uniform.

00:10:13.470 --> 00:10:16.980
And then we can test
it using testRoll--

00:10:16.980 --> 00:10:21.930
take some number of n and
rolls the die that many times

00:10:21.930 --> 00:10:24.970
and creates a string
telling us what we got.

00:10:29.750 --> 00:10:36.300
So let's consider running this
on, say, testRoll of five.

00:10:36.300 --> 00:10:38.680
And we'll ask the
question, if we run it,

00:10:38.680 --> 00:10:43.180
how probable is it that it's
going to return a string

00:10:43.180 --> 00:10:43.960
of five 1's?

00:10:50.100 --> 00:10:51.120
How do we do that?

00:10:51.120 --> 00:10:54.420
Now, how many people
here are either in 6.041

00:10:54.420 --> 00:10:56.670
or would have taken 6.041?

00:10:56.670 --> 00:10:59.280
Raise your hand.

00:10:59.280 --> 00:10:59.850
Oh, good.

00:10:59.850 --> 00:11:02.830
So very few of you
know probability.

00:11:02.830 --> 00:11:03.330
That helps.

00:11:06.450 --> 00:11:09.170
So how do we think
about that question?

00:11:09.170 --> 00:11:14.480
Well, probability, to me at
least, is all about counting,

00:11:14.480 --> 00:11:16.740
especially discrete
probability, which

00:11:16.740 --> 00:11:19.900
is what we're looking at here.

00:11:19.900 --> 00:11:23.830
What you do is you start by
counting the number of events

00:11:23.830 --> 00:11:29.710
that have the
property of interest

00:11:29.710 --> 00:11:31.480
and the number of
possible events

00:11:31.480 --> 00:11:32.940
and divide one by the other.

00:11:35.580 --> 00:11:41.430
So if we think about
rolling a die five times,

00:11:41.430 --> 00:11:44.070
we can enumerate all of
the possible outcomes

00:11:44.070 --> 00:11:44.885
of five rolls.

00:11:47.390 --> 00:11:50.870
So if we look at that,
what are the outcomes?

00:11:50.870 --> 00:11:54.150
Well, I could get five 1's.

00:11:54.150 --> 00:12:00.720
I could get four 1's and a 2
or four 1's and 3, skip a few.

00:12:00.720 --> 00:12:05.040
The next one would be three 1's,
a 2 and a 1, then a 2 and 2,

00:12:05.040 --> 00:12:08.850
and finally, at
the end, all 6's.

00:12:08.850 --> 00:12:13.320
So remember, we
looked before at when

00:12:13.320 --> 00:12:17.160
we're looking at optimization
problems about binary numbers.

00:12:17.160 --> 00:12:20.670
And we said we can look at all
the possible choices of items

00:12:20.670 --> 00:12:24.460
in the knapsack by a
vector of 0's and 1's.

00:12:24.460 --> 00:12:27.340
We said, how many possible
choices are there?

00:12:27.340 --> 00:12:30.200
Well, it depended on how
many binary numbers you could

00:12:30.200 --> 00:12:32.910
get in that number of digits.

00:12:32.910 --> 00:12:36.410
Well, here we're doing the same
thing, but instead of base 2,

00:12:36.410 --> 00:12:37.710
it's base 6.

00:12:40.590 --> 00:12:45.150
And so the number of possible
outcomes of five rolls

00:12:45.150 --> 00:12:45.990
is quite high.

00:12:48.760 --> 00:12:50.860
How many of those are five 1's?

00:12:50.860 --> 00:12:54.180
Only one of them, right?

00:12:54.180 --> 00:12:58.300
So in order to get the
probability of a five 1's, I

00:12:58.300 --> 00:13:00.370
divide 1 by 6 to the fifth.

00:13:03.232 --> 00:13:06.720
Does that makes
sense to everybody?

00:13:06.720 --> 00:13:10.565
So in fact, we see
it's highly unlikely.

00:13:10.565 --> 00:13:15.460
The probability of a
five 1's is quite small.

00:13:15.460 --> 00:13:17.770
Now, suppose we were to
ask about the probability

00:13:17.770 --> 00:13:19.870
of something else--

00:13:19.870 --> 00:13:27.120
instead of five 1's, say 53421.

00:13:27.120 --> 00:13:31.230
It kind of looks more likely
than that than five 1's

00:13:31.230 --> 00:13:33.630
in a row, but of
course, it isn't, right?

00:13:33.630 --> 00:13:37.620
Any specific combination
is equally probable.

00:13:37.620 --> 00:13:40.420
And there are a lot of them.

00:13:40.420 --> 00:13:44.920
So this is all the probability
we're going to think about we

00:13:44.920 --> 00:13:48.550
could think about this way, as
simply a matter of counting--

00:13:48.550 --> 00:13:51.640
the number of possible events,
the number of events that have

00:13:51.640 --> 00:13:54.970
the property of interest--
in this case being all 1's--

00:13:54.970 --> 00:13:56.680
and then simple division.

00:13:59.530 --> 00:14:03.010
Given that framework, there
were three basic facts

00:14:03.010 --> 00:14:07.870
about probability we're
going to be using a lot of.

00:14:07.870 --> 00:14:15.980
So one, probabilities
always range from 0 to 1.

00:14:15.980 --> 00:14:17.460
How do we know that?

00:14:17.460 --> 00:14:19.930
Well, we've got a
fraction, right?

00:14:19.930 --> 00:14:25.190
And the denominator is
all possible events.

00:14:25.190 --> 00:14:29.840
The numerator is the subset
of that that's of interest.

00:14:29.840 --> 00:14:35.680
So it has to range from
0 to the denominator.

00:14:35.680 --> 00:14:37.330
And that tells us
that the fraction

00:14:37.330 --> 00:14:40.250
has to range from 0 to 1.

00:14:40.250 --> 00:14:43.430
So 1 says it's always
going to happen, 0 never.

00:14:46.870 --> 00:14:50.290
So if the probability of
an event occurring is p,

00:14:50.290 --> 00:14:54.250
what's the probability
of it not occurring?

00:14:54.250 --> 00:14:57.060
This follows from
the first bullet.

00:14:57.060 --> 00:15:04.050
It's simply going
to be 1 minus p.

00:15:04.050 --> 00:15:07.650
This is a trick that we'll
find we'll use a lot.

00:15:07.650 --> 00:15:09.660
Because it's often
the case when you

00:15:09.660 --> 00:15:13.080
want to compute the probability
of something happening,

00:15:13.080 --> 00:15:16.680
it's easier to compute the
probability of it not happening

00:15:16.680 --> 00:15:18.980
and subtract it from 1.

00:15:18.980 --> 00:15:21.560
And we'll see an example
of that later today.

00:15:24.550 --> 00:15:27.940
Now, here's the biggie.

00:15:27.940 --> 00:15:31.650
When events are
independent of each other,

00:15:31.650 --> 00:15:35.000
the probability of all
of the events occurring

00:15:35.000 --> 00:15:39.380
is equal to the product of
the probabilities of each

00:15:39.380 --> 00:15:40.885
of the events occurring.

00:15:44.280 --> 00:15:53.890
So if the probability of A is
0.5 and the probability of B

00:15:53.890 --> 00:16:01.150
is 0.4, the probability
of A and B is what?

00:16:06.110 --> 00:16:07.670
0.5 times 0.4.

00:16:07.670 --> 00:16:10.680
You guys can figure that out.

00:16:10.680 --> 00:16:14.330
I think that's 0.2.

00:16:14.330 --> 00:16:16.100
So you'd expect
that, that it should

00:16:16.100 --> 00:16:20.390
be much smaller than either of
the first two probabilities.

00:16:20.390 --> 00:16:22.060
This is the most
common rule, it's

00:16:22.060 --> 00:16:24.460
something we use all the
time in probabilities,

00:16:24.460 --> 00:16:28.360
the so-called
multiplicative law.

00:16:28.360 --> 00:16:33.120
We have to be careful
about it, however,

00:16:33.120 --> 00:16:37.470
in that it only holds if
the events are actually

00:16:37.470 --> 00:16:40.570
independent.

00:16:40.570 --> 00:16:44.920
Two events are independent
if the outcome of one

00:16:44.920 --> 00:16:47.110
has no influence on the
outcome of the other.

00:16:50.010 --> 00:16:52.370
So when we roll
the die, we assume

00:16:52.370 --> 00:16:54.350
that the first
roll, the outcome,

00:16:54.350 --> 00:16:55.870
was independent of the--

00:16:55.870 --> 00:16:58.370
or the second roll was
independent of the first roll,

00:16:58.370 --> 00:17:00.910
independent of the fourth roll.

00:17:00.910 --> 00:17:02.560
When we looked at
the two coins, we

00:17:02.560 --> 00:17:05.410
assume that heads and
tails of each coin

00:17:05.410 --> 00:17:08.460
was independent
of the other coin.

00:17:08.460 --> 00:17:10.200
I didn't, for example,
look at one coin

00:17:10.200 --> 00:17:12.304
and make sure that the
other one was different.

00:17:15.700 --> 00:17:19.079
The danger here is
that people often

00:17:19.079 --> 00:17:22.950
compute probabilities assuming
independence when you don't

00:17:22.950 --> 00:17:26.099
actually have independence.

00:17:26.099 --> 00:17:29.470
So let's look at an example.

00:17:29.470 --> 00:17:32.980
For those of you familiar
with American football,

00:17:32.980 --> 00:17:35.800
the New England Patriots
and the Denver Broncos

00:17:35.800 --> 00:17:38.380
are two prominent teams.

00:17:38.380 --> 00:17:40.660
And let's look at
computing the probability

00:17:40.660 --> 00:17:45.690
of whether one of them will
lose on a given Sunday.

00:17:45.690 --> 00:17:48.840
So the Patriots have a
winning percentage of 7 of 8--

00:17:48.840 --> 00:17:51.420
they've won 7 of
their 8 games so far--

00:17:51.420 --> 00:17:54.590
and the Broncos 6 of 8.

00:17:54.590 --> 00:17:57.560
The probability of both
winning next Sunday,

00:17:57.560 --> 00:18:00.860
assuming that this is
indicative of how good they are,

00:18:00.860 --> 00:18:03.470
we can get with the
multiplicative rule.

00:18:03.470 --> 00:18:08.750
So it's 7/8 times 6/8, or 42/64.

00:18:08.750 --> 00:18:12.060
We could simplify that
fraction, I suppose.

00:18:12.060 --> 00:18:14.370
Does that makes sense?

00:18:14.370 --> 00:18:17.840
So this is probably a pretty
good estimate of both of them

00:18:17.840 --> 00:18:20.600
winning next Sunday.

00:18:20.600 --> 00:18:24.380
So the probability of at
least one of them losing

00:18:24.380 --> 00:18:27.740
is 1 minus that.

00:18:27.740 --> 00:18:30.430
So here's an example
of why we often use

00:18:30.430 --> 00:18:34.120
the 1 minus rule,
because we could

00:18:34.120 --> 00:18:38.020
compute the probability
of both of them

00:18:38.020 --> 00:18:41.440
winning by simply multiplying.

00:18:41.440 --> 00:18:44.130
And we subtract that from 1.

00:18:44.130 --> 00:18:47.220
However, what about
Sunday, December 18?

00:18:47.220 --> 00:18:50.440
What's the probability?

00:18:50.440 --> 00:18:53.920
Well, as it happens,
that day the Patriots

00:18:53.920 --> 00:18:55.025
are playing the Broncos.

00:18:58.380 --> 00:19:02.230
So now suddenly, the
outcomes are not independent.

00:19:02.230 --> 00:19:05.550
The probability of
one of them losing

00:19:05.550 --> 00:19:10.470
is influenced by the probability
of the other winning.

00:19:10.470 --> 00:19:13.540
So you would expect
the probability of one

00:19:13.540 --> 00:19:17.989
of them losing is much
closer to 1 than 22/64,

00:19:17.989 --> 00:19:18.780
which is about 1/3.

00:19:21.780 --> 00:19:25.490
So in this case, it's easy.

00:19:25.490 --> 00:19:28.430
But as we'll see, as we
get through the term,

00:19:28.430 --> 00:19:30.560
there are lots of
cases where you

00:19:30.560 --> 00:19:33.950
have to work pretty hard to
understand whether or not two

00:19:33.950 --> 00:19:36.350
events really are independent.

00:19:36.350 --> 00:19:40.410
And if you get it wrong, you
get a totally bogus answer.

00:19:40.410 --> 00:19:45.530
1/3 versus 1 is a
pretty big difference.

00:19:45.530 --> 00:19:49.010
By the way, as it happens,
the probability of the Broncos

00:19:49.010 --> 00:19:50.070
losing is about 1.

00:19:56.190 --> 00:19:58.400
Let's go look at some code.

00:20:01.040 --> 00:20:03.260
And we'll go back to
our dice, because it's

00:20:03.260 --> 00:20:05.420
much easier to
simulate dice games

00:20:05.420 --> 00:20:08.300
than it is to simulate
football games.

00:20:11.510 --> 00:20:13.340
So here it is.

00:20:13.340 --> 00:20:17.030
And we're going to talk
a lot about simulations.

00:20:17.030 --> 00:20:18.980
So here, rather than
rolling the die,

00:20:18.980 --> 00:20:20.480
I've written a program to do it.

00:20:23.980 --> 00:20:27.660
We've already seen the
code for rolling a die.

00:20:27.660 --> 00:20:32.990
And so to run this simulation,
typically what we're doing here

00:20:32.990 --> 00:20:35.480
is I'm giving you the goal--

00:20:35.480 --> 00:20:38.510
for example, are we
going to get five 1's--

00:20:38.510 --> 00:20:41.800
the number of trials--

00:20:41.800 --> 00:20:47.060
each trial, in this case,
will be say of length 5--

00:20:47.060 --> 00:20:48.770
so I'm going to
roll the same die

00:20:48.770 --> 00:20:55.130
five times say 1,000 different
times, and then just some text

00:20:55.130 --> 00:20:57.910
as to what I'm going to print.

00:20:57.910 --> 00:21:01.090
Almost all the
simulations we look at

00:21:01.090 --> 00:21:05.630
are going to start with lines
that look a lot like that.

00:21:05.630 --> 00:21:08.650
We're going to
initialize some variable.

00:21:08.650 --> 00:21:11.755
And then we're going to
run some number of trials.

00:21:16.160 --> 00:21:19.860
So in this case,
we're going to get

00:21:19.860 --> 00:21:21.340
from the length of the goal--

00:21:21.340 --> 00:21:23.790
so if the goal is
five 1's, then we're

00:21:23.790 --> 00:21:26.490
going to roll the dice five
times; if it's 10 runs,

00:21:26.490 --> 00:21:29.830
we'll roll it 10 times.

00:21:29.830 --> 00:21:35.310
So this is essentially
one trial, one attempt.

00:21:38.850 --> 00:21:41.850
And then we'll check
the result. And if it

00:21:41.850 --> 00:21:43.720
has the property we want--

00:21:43.720 --> 00:21:47.460
in this case, it's
equal to the goal--

00:21:47.460 --> 00:21:50.040
then we're going to
increment the total, which

00:21:50.040 --> 00:21:54.380
we initialized up here by 1.

00:21:54.380 --> 00:21:57.170
So we'll keep track
with just the counting--

00:21:57.170 --> 00:22:01.610
the number of trials that
actually meet the goal.

00:22:01.610 --> 00:22:04.990
And then when we're done,
what we're going to do

00:22:04.990 --> 00:22:08.560
is divide the number
that met the goal

00:22:08.560 --> 00:22:10.870
by the number of trials--

00:22:10.870 --> 00:22:14.170
exactly the counting
argument we just looked at.

00:22:14.170 --> 00:22:19.700
And then we'll print the result.

00:22:19.700 --> 00:22:22.220
Almost every
simulation we look at

00:22:22.220 --> 00:22:24.360
is going to have this structure.

00:22:24.360 --> 00:22:27.680
There'll be an outer loop,
which is the number of trials.

00:22:27.680 --> 00:22:29.870
And then inside-- maybe
it'll have a loop,

00:22:29.870 --> 00:22:32.600
or maybe it won't--
will be a single trial.

00:22:32.600 --> 00:22:33.770
We'll sum up the results.

00:22:33.770 --> 00:22:36.920
And then we'll divide
by the number of trials.

00:22:36.920 --> 00:22:37.490
Let's run it.

00:22:45.300 --> 00:22:49.650
So a couple of things
are going to go on here.

00:22:49.650 --> 00:22:59.570
If you look at the code as
we've looked at it before,

00:22:59.570 --> 00:23:02.780
what you're seeing is I'm
computing the estimated

00:23:02.780 --> 00:23:05.180
probability by the simulation.

00:23:05.180 --> 00:23:08.270
And I'm comparing it to the
actual probability, which we've

00:23:08.270 --> 00:23:09.590
already seen how to compute.

00:23:12.117 --> 00:23:14.700
So if you look at it, there are
a couple of things to look at.

00:23:17.370 --> 00:23:19.260
The estimated
probability is pretty

00:23:19.260 --> 00:23:24.704
close to the actual
probability but not the same.

00:23:24.704 --> 00:23:26.590
So let's go back
to the PowerPoint.

00:23:31.860 --> 00:23:34.240
Here are the results.

00:23:34.240 --> 00:23:37.680
And there are at least
two questions raised

00:23:37.680 --> 00:23:40.050
by this result.
First of all, how

00:23:40.050 --> 00:23:43.290
did I know that this is
what would get printed?

00:23:43.290 --> 00:23:45.610
Remember, this is random.

00:23:45.610 --> 00:23:48.520
How did I know that the
estimate-- well, there's

00:23:48.520 --> 00:23:51.790
nothing random about
the actual probability.

00:23:51.790 --> 00:23:55.390
But how did I know that
the estimated probability

00:23:55.390 --> 00:23:57.180
would be 0?

00:23:57.180 --> 00:23:58.470
And why did it print it twice?

00:23:58.470 --> 00:24:00.330
Because I messed
up the PowerPoint.

00:24:00.330 --> 00:24:04.140
Any rate, so how do I know
what would get printed?

00:24:04.140 --> 00:24:12.610
Well a confession--
random.choice

00:24:12.610 --> 00:24:14.920
is not actually random.

00:24:14.920 --> 00:24:20.140
In fact, nothing we can do in
a computer is actually random.

00:24:20.140 --> 00:24:23.650
You can prove that it's
impossible to build

00:24:23.650 --> 00:24:28.950
a computer that actually
generates truly random numbers.

00:24:28.950 --> 00:24:32.520
What they do instead
is generate numbers

00:24:32.520 --> 00:24:34.050
that called pseudorandom.

00:24:42.120 --> 00:24:44.740
How do they do that?

00:24:44.740 --> 00:24:48.930
They have an algorithm that
given one number generates

00:24:48.930 --> 00:24:52.700
the next number in a sequence.

00:24:52.700 --> 00:24:56.375
And they start that
algorithm with a seed.

00:25:00.050 --> 00:25:02.630
Now, typically,
they get that seed

00:25:02.630 --> 00:25:05.930
by reading the clock
of the computer.

00:25:05.930 --> 00:25:08.090
So most computers have
a clock that, say,

00:25:08.090 --> 00:25:12.080
keeps track of the number of
microseconds since January 1,

00:25:12.080 --> 00:25:14.174
1978.

00:25:14.174 --> 00:25:15.590
I don't know if
that's still true.

00:25:15.590 --> 00:25:18.590
That's what Unix used to do.

00:25:18.590 --> 00:25:22.070
So the notion is, you
start your program,

00:25:22.070 --> 00:25:26.420
there's no way of knowing how
many microseconds have elapsed.

00:25:26.420 --> 00:25:29.395
And so you're getting a random
number to start the process.

00:25:32.040 --> 00:25:33.660
Since you don't know
where it starts,

00:25:33.660 --> 00:25:34.800
you don't know what
the second number

00:25:34.800 --> 00:25:37.050
is, you don't know what the
third number is, you don't

00:25:37.050 --> 00:25:38.580
know what the fourth number is.

00:25:38.580 --> 00:25:42.570
And so it's predictably
nondeterministic,

00:25:42.570 --> 00:25:46.600
because you don't know what
the seed is going to be.

00:25:46.600 --> 00:25:49.180
Now, you can imagine
that this makes

00:25:49.180 --> 00:25:52.460
programs really hard to debug.

00:25:52.460 --> 00:25:55.850
Every time you run it, something
different could happen.

00:25:55.850 --> 00:25:59.220
Now, we'll see often you want
them to be unpredictable.

00:25:59.220 --> 00:26:02.300
But for now, we want them to
be predictable, makes it easier

00:26:02.300 --> 00:26:04.130
prepare PowerPoint.

00:26:04.130 --> 00:26:08.635
So what you have is a command.

00:26:13.040 --> 00:26:19.190
You can call random.seed
and give it a value

00:26:19.190 --> 00:26:21.800
and say, I don't want you to
just choose some random seed,

00:26:21.800 --> 00:26:24.890
I want you to use 0 as the seed.

00:26:24.890 --> 00:26:27.530
For the same seed, you
always get the same sequence

00:26:27.530 --> 00:26:30.120
of random values.

00:26:30.120 --> 00:26:33.410
And so what I've done is I
set the seed to be, I think, 0

00:26:33.410 --> 00:26:36.620
in this case, not because
there's anything magic about 0,

00:26:36.620 --> 00:26:38.780
it's just sort of habit.

00:26:38.780 --> 00:26:41.540
But it made it predictable.

00:26:41.540 --> 00:26:43.640
As you write programs
with randomness

00:26:43.640 --> 00:26:45.980
in and when you're debugging
it, you will almost surely

00:26:45.980 --> 00:26:49.550
want to start by setting
random.seed to a value

00:26:49.550 --> 00:26:51.590
so you get the same answer.

00:26:51.590 --> 00:26:54.950
But make sure you debug it with
more than one value of this,

00:26:54.950 --> 00:26:58.320
so you didn't just get
lucky with your seed.

00:26:58.320 --> 00:27:01.460
So that's how I knew
what would get printed.

00:27:01.460 --> 00:27:06.480
The next question is,
why did the simulation

00:27:06.480 --> 00:27:09.670
give me the wrong answer?

00:27:09.670 --> 00:27:14.530
The actual probability
is three 0's and 1286.

00:27:14.530 --> 00:27:16.630
But it's estimated
a probability of 0.

00:27:19.150 --> 00:27:20.140
Why is it wrong?

00:27:24.200 --> 00:27:27.100
Well, let's think about this.

00:27:27.100 --> 00:27:30.020
I ran 1,000 trials.

00:27:30.020 --> 00:27:32.430
What does it mean to say
the probability is zero?

00:27:32.430 --> 00:27:36.670
It means that I tried it 1,000
times and didn't ever get

00:27:36.670 --> 00:27:39.380
a sequence of five 1's.

00:27:39.380 --> 00:27:44.500
So the numerator of the
division at the bottom was 0.

00:27:44.500 --> 00:27:46.150
Hence, the answer is 0.

00:27:46.150 --> 00:27:47.890
Is this surprising?

00:27:47.890 --> 00:27:49.440
Well, no.

00:27:49.440 --> 00:27:54.200
Because if that's the actual
probability of getting five

00:27:54.200 --> 00:27:58.075
1's, it's not very shocking
that in 1,000 trials

00:27:58.075 --> 00:27:58.825
it never happened.

00:28:02.260 --> 00:28:06.140
It's not a surprising
result. And so we

00:28:06.140 --> 00:28:09.230
have to be careful when we
run these things to understand

00:28:09.230 --> 00:28:14.250
the difference between what's in
this case an actual probability

00:28:14.250 --> 00:28:17.510
and what statisticians
call a sample probability.

00:28:25.530 --> 00:28:28.970
So what we got with
the sample was 0.

00:28:28.970 --> 00:28:32.740
So what's the
obvious thing to do?

00:28:32.740 --> 00:28:35.590
If you're doing a
simulation of an event

00:28:35.590 --> 00:28:39.020
and the event is
pretty rare, you

00:28:39.020 --> 00:28:43.520
want to try it on a very
large number of trials.

00:28:43.520 --> 00:28:45.050
So let's go back to our code.

00:28:51.350 --> 00:28:58.720
And we'll change it to
instead of 1,000, 1,000,000.

00:28:58.720 --> 00:29:01.572
You can see up here, by the
way, where I set the seed.

00:29:01.572 --> 00:29:02.840
And now, let's run it.

00:29:17.760 --> 00:29:19.650
We did a lot better.

00:29:19.650 --> 00:29:22.470
If we look at here our
estimated probability,

00:29:22.470 --> 00:29:25.980
it's three 0's 128,
still not quite

00:29:25.980 --> 00:29:30.142
the actual probability
but darn close.

00:29:30.142 --> 00:29:31.600
And maybe if I had
done 10 million,

00:29:31.600 --> 00:29:32.891
it would have been even closer.

00:29:35.610 --> 00:29:38.040
So if you're
writing a simulation

00:29:38.040 --> 00:29:41.130
to compute the
probability of an event

00:29:41.130 --> 00:29:44.040
and the event is
moderately rare,

00:29:44.040 --> 00:29:47.310
then you better
run a lot of trials

00:29:47.310 --> 00:29:51.750
before you believe your
estimated probability.

00:29:51.750 --> 00:29:55.440
In a week or so, we'll
actually look at that more

00:29:55.440 --> 00:29:57.810
mathematically and
say, what is a lot,

00:29:57.810 --> 00:29:59.130
how do we know what is enough.

00:30:12.110 --> 00:30:13.550
What are the morals here?

00:30:13.550 --> 00:30:15.430
Moral one, I've just told you--

00:30:15.430 --> 00:30:18.950
takes a lot of trials to get a
good estimate of the frequency

00:30:18.950 --> 00:30:21.510
of a rare event.

00:30:21.510 --> 00:30:26.470
Moral two, we should always,
if we're getting an estimated

00:30:26.470 --> 00:30:29.290
probability, know
that, and probably

00:30:29.290 --> 00:30:33.570
say that, and not confuse it
with the actual probability.

00:30:33.570 --> 00:30:36.400
The third moral here
is, it was kind of

00:30:36.400 --> 00:30:38.830
stupid to do a simulation.

00:30:38.830 --> 00:30:42.430
Since it was a very
simple closed-form answer

00:30:42.430 --> 00:30:45.550
that we could compute
that would really tell us

00:30:45.550 --> 00:30:48.220
what the actual
probability is, why even

00:30:48.220 --> 00:30:51.550
bother with the simulation?

00:30:51.550 --> 00:30:53.880
Well, we're going
to see why now,

00:30:53.880 --> 00:30:57.340
because simulations
can be very useful.

00:30:57.340 --> 00:31:00.390
Let's look at another problem.

00:31:00.390 --> 00:31:02.070
This is the famous
birthday problem.

00:31:02.070 --> 00:31:03.660
Some of you have seen it.

00:31:03.660 --> 00:31:06.240
What's the probability of at
least two people in a group

00:31:06.240 --> 00:31:08.770
having the same birthday?

00:31:08.770 --> 00:31:10.600
There's a URL at the bottom.

00:31:10.600 --> 00:31:12.760
That's pointing
to a Google form.

00:31:12.760 --> 00:31:15.940
I'd like please all of you
who have a computing device

00:31:15.940 --> 00:31:20.100
to go to it and fill
out your birthday.

00:31:20.100 --> 00:31:22.942
It's anonymous, so we won't know
how old you are, don't worry.

00:31:22.942 --> 00:31:24.150
Actually, it's only the date.

00:31:24.150 --> 00:31:25.290
It's not the year.

00:31:27.880 --> 00:31:33.870
So suppose there were 367
people in the group, roughly

00:31:33.870 --> 00:31:40.680
the number of people who
took the 6.0001 600 midterm.

00:31:40.680 --> 00:31:44.070
If they are 367 people, what's
the probability of at least two

00:31:44.070 --> 00:31:45.230
of them sharing a birthday?

00:31:49.790 --> 00:31:54.110
One, by something called
the pigeonhole principle.

00:31:54.110 --> 00:31:56.000
You got some number of holes.

00:31:56.000 --> 00:31:57.800
And if you have more
pigeons than holes,

00:31:57.800 --> 00:32:01.430
two pigeons have
to share a whole.

00:32:01.430 --> 00:32:04.040
What about smaller numbers?

00:32:04.040 --> 00:32:07.430
Well, if we make a
simplifying assumption

00:32:07.430 --> 00:32:10.650
that each birthdate
is equally likely,

00:32:10.650 --> 00:32:13.970
then there's actually a nice
closed-form solution for it.

00:32:17.760 --> 00:32:20.730
Again, this is a question
where it's easier

00:32:20.730 --> 00:32:24.210
to compute the opposite
of what you're trying

00:32:24.210 --> 00:32:26.670
to do and subtract it from 1.

00:32:26.670 --> 00:32:32.160
And so this fraction is giving
the probability of two people

00:32:32.160 --> 00:32:35.190
not sharing a birthday.

00:32:35.190 --> 00:32:38.560
The proof that this is right,
it's a little bit elaborate.

00:32:38.560 --> 00:32:42.450
But you can trust
me, it's accurate.

00:32:42.450 --> 00:32:46.150
But it's a formula, and it's
not that complicated a formula.

00:32:46.150 --> 00:32:49.800
So numbers like 366
factorial are big.

00:32:55.240 --> 00:32:57.460
So let's approximate a solution.

00:32:57.460 --> 00:33:00.940
We'll right a simulation and
see if we get the same answer

00:33:00.940 --> 00:33:03.920
that that formula gave us.

00:33:03.920 --> 00:33:05.200
So here's the code for that--

00:33:07.810 --> 00:33:09.550
two arguments-- the
number of people

00:33:09.550 --> 00:33:14.780
in the group and the
number that we asking do

00:33:14.780 --> 00:33:17.520
they have the same birthday.

00:33:17.520 --> 00:33:21.120
So since I'm assuming for now
that every birthday is equally

00:33:21.120 --> 00:33:26.100
likely, the possible
dates range from 1 to 366,

00:33:26.100 --> 00:33:28.005
because some years
have a February 29.

00:33:31.200 --> 00:33:35.490
I'll keep track of the number
of people born in each date

00:33:35.490 --> 00:33:38.640
by starting with none.

00:33:38.640 --> 00:33:41.470
And then for p in the
range of number of people,

00:33:41.470 --> 00:33:45.240
I'll make a random choice
of the possible dates

00:33:45.240 --> 00:33:49.999
and increment that
element of the list by 1.

00:33:49.999 --> 00:33:51.540
And then at the end,
we can say, look

00:33:51.540 --> 00:33:54.330
at the maximum
number of birthdays

00:33:54.330 --> 00:33:59.560
and see if it's greater than
or equal to the number of same.

00:33:59.560 --> 00:34:01.240
So that tells us that.

00:34:04.490 --> 00:34:07.220
And then we can actually look
at the birthday problem--

00:34:07.220 --> 00:34:09.640
number of people, the number
of same, and, as usual,

00:34:09.640 --> 00:34:10.514
the number of trials.

00:34:13.750 --> 00:34:17.840
So the number of hits is 0 for
t in range number of trials.

00:34:17.840 --> 00:34:21.940
If sameDate is true, then
we'll increment the number

00:34:21.940 --> 00:34:28.590
of hits by 1 and then as usual
divide by the number of trials.

00:34:28.590 --> 00:34:34.739
And we'll try it for 10,
20, 40, and 100 people.

00:34:37.310 --> 00:34:41.480
And then just, we'll print
the estimated probability

00:34:41.480 --> 00:34:46.429
and the actual
probability computed using

00:34:46.429 --> 00:34:48.320
that formula I showed you.

00:34:48.320 --> 00:34:50.600
I have not shown you,
but I've imported

00:34:50.600 --> 00:34:53.480
a library called
math, because it

00:34:53.480 --> 00:34:55.040
is a factorial implementation.

00:34:55.040 --> 00:34:56.900
It's way faster than
the recursive one

00:34:56.900 --> 00:35:00.270
that we've seen before.

00:35:00.270 --> 00:35:00.880
Let's run it.

00:35:23.920 --> 00:35:25.040
And we'll see what we get.

00:35:25.040 --> 00:35:30.580
So for 10, the estimated
probability is 0.11 now.

00:35:30.580 --> 00:35:36.720
So you can see, the estimates
are really pretty good.

00:35:36.720 --> 00:35:39.450
Once again, we have this
business that for 100,

00:35:39.450 --> 00:35:43.450
we're estimating 1, when the
real answer is point many,

00:35:43.450 --> 00:35:45.150
many 9's.

00:35:45.150 --> 00:35:47.400
But again, this is
sample probability.

00:35:47.400 --> 00:35:53.250
It just means in the number
of trials we did, every 1

00:35:53.250 --> 00:35:56.190
for 100 people, there
was a shared birthday.

00:35:56.190 --> 00:35:59.010
This is a number that
usually surprises people,

00:35:59.010 --> 00:36:03.690
as to why with 100 people
the probability is so high.

00:36:03.690 --> 00:36:06.990
But we could work out
the formula and see it.

00:36:06.990 --> 00:36:08.460
And as you can
see, the estimates

00:36:08.460 --> 00:36:10.930
are pretty good
from my simulation.

00:36:20.252 --> 00:36:22.210
Now, we're going to see
why we did a simulation

00:36:22.210 --> 00:36:23.720
in the first place.

00:36:23.720 --> 00:36:27.970
Suppose we want the probability
of three people sharing

00:36:27.970 --> 00:36:29.260
a birthday instead of two.

00:36:34.030 --> 00:36:37.240
It's pretty easy to see how
we changed the simulation.

00:36:37.240 --> 00:36:38.980
I even made a parameter.

00:36:38.980 --> 00:36:42.190
I just changed the
number 2 to number 3.

00:36:42.190 --> 00:36:45.200
The math, on the
other hand, is ugly.

00:36:48.030 --> 00:36:52.190
Why is the math so much
uglier for 3 than for 2?

00:36:52.190 --> 00:36:55.400
Because for 2, the
complementary problem--

00:36:55.400 --> 00:36:58.040
the number we're
subtracting from 1--

00:36:58.040 --> 00:37:03.640
is simply the question of,
are all birthdays different?

00:37:03.640 --> 00:37:08.170
So did two people share a
birthday is 1 minus or all

00:37:08.170 --> 00:37:11.570
does everybody have
a different birthday.

00:37:11.570 --> 00:37:16.250
On the other hand, for 3 people,
the complementary problem is

00:37:16.250 --> 00:37:19.490
a complicated disjunct--
a bunch of ors--

00:37:19.490 --> 00:37:22.190
either all birthdays
are distinct,

00:37:22.190 --> 00:37:26.240
or two people share a birthday
and the rest are distinct,

00:37:26.240 --> 00:37:30.140
or there are two groups of
two people sharing a birthday

00:37:30.140 --> 00:37:31.970
and everything is distinct.

00:37:31.970 --> 00:37:36.450
So you can see here, there's
a lot of possibilities.

00:37:36.450 --> 00:37:40.800
And so it's 1 minus now a
very complicated formula.

00:37:40.800 --> 00:37:42.840
And in fact, if you try
and look how to do this,

00:37:42.840 --> 00:37:45.450
most people will tell
you don't bother.

00:37:45.450 --> 00:37:48.490
Here's kind of a
good approximation.

00:37:48.490 --> 00:37:50.320
But the math gets very hairy.

00:37:53.040 --> 00:37:57.160
In contrast, changing the
simulation is dead easy.

00:37:57.160 --> 00:37:57.880
We can do that.

00:38:03.808 --> 00:38:06.280
Whoops.

00:38:06.280 --> 00:38:13.650
So if we come over here for
the code, all I have to do

00:38:13.650 --> 00:38:15.075
is change this to 2 or 3.

00:38:25.090 --> 00:38:27.190
And I'm going to leave
in this code, which

00:38:27.190 --> 00:38:31.180
is the wrong code, computing
the actual probability now

00:38:31.180 --> 00:38:35.110
for 2 people sharing rather
than 3, because I want

00:38:35.110 --> 00:38:37.660
to make it easy for you to see
the difference between what

00:38:37.660 --> 00:38:41.260
happens when we look at 3
shared rather than 2 shared.

00:38:53.140 --> 00:38:55.980
And I get invalid syntax.

00:38:55.980 --> 00:38:58.766
That's not good.

00:38:58.766 --> 00:39:00.640
That's what happens when
I type in real time.

00:39:07.970 --> 00:39:10.010
Why do I have invalid syntax?

00:39:10.010 --> 00:39:11.337
AUDIENCE: Line 56.

00:39:11.337 --> 00:39:12.170
JOHN GUTTAG: Pardon.

00:39:12.170 --> 00:39:13.631
AUDIENCE: Line 56.

00:39:13.631 --> 00:39:15.170
JOHN GUTTAG: One person, Anna.

00:39:15.170 --> 00:39:17.660
AUDIENCE: Line 56,
there's a comma.

00:39:17.660 --> 00:39:20.532
JOHN GUTTAG: Oh.

00:39:20.532 --> 00:39:21.490
That's not a good line.

00:39:32.960 --> 00:39:40.410
So now, we see that if we get,
say, to n equals 100, for 2,

00:39:40.410 --> 00:39:42.530
you'll remember, it was 0.99.

00:39:42.530 --> 00:39:46.000
But for 3, it's only 0.63.

00:39:46.000 --> 00:39:49.590
So we see going from two
sharing to three sharing

00:39:49.590 --> 00:39:54.930
gets us a radically different
answer, not surprisingly.

00:39:54.930 --> 00:39:57.240
But we also-- and the real
thing I wanted you to see--

00:39:57.240 --> 00:39:59.310
is how easy it was to
answer this question

00:39:59.310 --> 00:40:01.810
with the simulation.

00:40:01.810 --> 00:40:05.940
And that's a primary
reason we use simulations

00:40:05.940 --> 00:40:09.000
to get probabilistic
questions rather

00:40:09.000 --> 00:40:11.190
than sitting down and
the pencil and paper

00:40:11.190 --> 00:40:14.460
and doing fancy
probability calculations,

00:40:14.460 --> 00:40:19.300
because it's often way
easier to do a simulation.

00:40:19.300 --> 00:40:22.220
We can see that in spades if
we look at the next question.

00:40:26.680 --> 00:40:28.210
Let's think about
this assumption

00:40:28.210 --> 00:40:31.270
that all birthdays
are equally likely.

00:40:31.270 --> 00:40:33.370
Well, as you can
see, this is a chart

00:40:33.370 --> 00:40:38.440
of how common birthdates
are in the US, a heat map.

00:40:38.440 --> 00:40:44.820
And you'll see, for
example, that February 29

00:40:44.820 --> 00:40:47.930
is quite an uncommon birthday.

00:40:47.930 --> 00:40:52.010
So we should probably
treat that differently.

00:40:52.010 --> 00:40:53.480
Somewhat surprisingly,
you'll see

00:40:53.480 --> 00:40:57.160
that July 4 is a very
uncommon birthday as well.

00:40:57.160 --> 00:41:00.410
It's easy to understand
why February 29.

00:41:00.410 --> 00:41:02.570
The only thing I can
figure out for July 4

00:41:02.570 --> 00:41:06.230
is obstetricians don't
like working on holidays.

00:41:06.230 --> 00:41:08.300
And so they induce
labor sometime

00:41:08.300 --> 00:41:10.790
around the 2nd or
the 3rd, so they

00:41:10.790 --> 00:41:14.420
don't have to come to work
on the 4th or the 5th.

00:41:14.420 --> 00:41:15.680
Sounds a horrible thought.

00:41:15.680 --> 00:41:19.952
But I can't think of any other
explanation for this anomaly.

00:41:19.952 --> 00:41:21.410
You'll probably,
if you look at it,

00:41:21.410 --> 00:41:25.580
see Christmas day is
not so common either.

00:41:25.580 --> 00:41:27.170
So now, the question,
which we can

00:41:27.170 --> 00:41:29.120
answer, since you've
all fill out this form,

00:41:29.120 --> 00:41:32.810
is how exceptional
are MIT students?

00:41:32.810 --> 00:41:35.930
We like to think that you're
different in every respect.

00:41:35.930 --> 00:41:38.960
So are your birthdays
distributed differently

00:41:38.960 --> 00:41:40.830
than other dates?

00:41:40.830 --> 00:41:43.220
Have we got that data?

00:41:43.220 --> 00:41:44.890
So now we'll go look at that.

00:41:49.180 --> 00:41:50.920
We should have a heat
map for you guys.

00:41:53.900 --> 00:41:54.400
This one?

00:41:54.400 --> 00:41:56.850
AUDIENCE: Yep.

00:41:56.850 --> 00:41:59.300
I removed all the February 31.

00:41:59.300 --> 00:42:02.240
Thank you for those submissions.

00:42:02.240 --> 00:42:05.910
[LAUGHTER]

00:42:06.525 --> 00:42:08.750
JOHN GUTTAG: So here it is.

00:42:08.750 --> 00:42:13.310
And we can see that,
well, they don't

00:42:13.310 --> 00:42:17.790
seem to be banded quite as
much in the summer months,

00:42:17.790 --> 00:42:20.370
probably says more about your
parents than it does about you.

00:42:23.030 --> 00:42:26.090
But you can see that,
indeed, we do have--

00:42:26.090 --> 00:42:28.280
wow, we have a day
where there are

00:42:28.280 --> 00:42:30.110
five birthdays, that look like?

00:42:30.110 --> 00:42:30.620
Or no?

00:42:30.620 --> 00:42:32.556
AUDIENCE: February 12.

00:42:32.556 --> 00:42:33.355
JOHN GUTTAG: Wow.

00:42:36.654 --> 00:42:39.070
You want to raise your hand
if you're born on February 12?

00:42:42.388 --> 00:42:45.800
[LAUGHTER]

00:42:46.670 --> 00:42:51.902
So you are exceptional in that
you lie about when you're born.

00:42:51.902 --> 00:42:57.470
But if you hadn't lied, I
think we would have still seen

00:42:57.470 --> 00:42:59.450
the probabilities would hold.

00:42:59.450 --> 00:43:03.155
How many people were
there, do we know?

00:43:03.155 --> 00:43:07.865
AUDIENCE: 146 with
112 unique birthdays.

00:43:07.865 --> 00:43:12.190
JOHN GUTTAG: 146 people,
112 unique birthdays.

00:43:12.190 --> 00:43:16.220
So indeed, the
probability does work.

00:43:26.470 --> 00:43:28.990
So we know you're
exceptional in a funny way.

00:43:28.990 --> 00:43:32.240
Well, you can
imagine how hard it

00:43:32.240 --> 00:43:36.080
would be to adjust the
analytic model to account

00:43:36.080 --> 00:43:40.370
for a weird distribution
of birthdates.

00:43:40.370 --> 00:43:44.900
But again, adjusting the
simulation model is easy.

00:43:44.900 --> 00:43:46.700
I could have gone
back to that heat

00:43:46.700 --> 00:43:49.670
map I showed you of
birthdays in the US

00:43:49.670 --> 00:43:52.550
and gotten a separate
probability for each day,

00:43:52.550 --> 00:43:55.130
but I was too lazy.

00:43:55.130 --> 00:44:01.220
And instead, what I observed
was that we had a few days,

00:44:01.220 --> 00:44:06.950
like February 29, highly
unlikely, and this band

00:44:06.950 --> 00:44:10.040
in the middle of people
who were conceived

00:44:10.040 --> 00:44:13.670
in the late fall
and early winter.

00:44:13.670 --> 00:44:19.950
So what I did is I
duplicated some dates.

00:44:19.950 --> 00:44:25.565
So the 58th day of the year,
February 29, occurs only once.

00:44:28.750 --> 00:44:30.590
The dates before
that, I said, let's

00:44:30.590 --> 00:44:32.730
pretend they occur four times.

00:44:32.730 --> 00:44:34.590
What only matters
here is not how often

00:44:34.590 --> 00:44:36.450
they occur but the
relative frequency.

00:44:40.700 --> 00:44:46.000
And then the dates after
that occur four times

00:44:46.000 --> 00:44:49.480
except for the dates in
that band, which is going

00:44:49.480 --> 00:44:52.180
to have occur yet more often.

00:44:52.180 --> 00:44:56.000
So now-- and don't worry
about the exact details here--

00:44:56.000 --> 00:44:58.840
but what I'm doing is simply
adjusting the simulation

00:44:58.840 --> 00:45:02.140
to change the probability
of each date getting

00:45:02.140 --> 00:45:04.378
chosen by same date.

00:45:07.190 --> 00:45:09.170
And then I can run
the simulation model.

00:45:09.170 --> 00:45:13.450
And, again, with a very
small change to code,

00:45:13.450 --> 00:45:15.900
I've modeled something
that's mathematically

00:45:15.900 --> 00:45:18.360
enormously complex.

00:45:18.360 --> 00:45:22.050
I have no idea how to
actually do this probability

00:45:22.050 --> 00:45:23.670
mathematically.

00:45:23.670 --> 00:45:27.046
But the code is, as you can
see, quite straightforward.

00:45:33.850 --> 00:45:35.460
So let's go to that here.

00:45:39.090 --> 00:45:45.450
So what I'm going to do
is comment this one out

00:45:45.450 --> 00:46:02.660
and uncomment this more
complicated set of dates

00:46:02.660 --> 00:46:03.500
and see what we get.

00:46:14.020 --> 00:46:16.240
And again, it changes
quite dramatically.

00:46:16.240 --> 00:46:18.240
You might remember, before
it was around I think

00:46:18.240 --> 00:46:23.460
0.6-something for 100,
and now, it's 0.75.

00:46:23.460 --> 00:46:26.460
So getting away from the notion
that birthdays are uniformly

00:46:26.460 --> 00:46:28.710
distributed to saying
some birthdays are

00:46:28.710 --> 00:46:32.010
more common than others,
again, dramatically changes

00:46:32.010 --> 00:46:34.570
the answer.

00:46:34.570 --> 00:46:36.589
And we can easily look at that.

00:46:43.080 --> 00:46:49.730
So that gets us to the big
topic of simulation models.

00:46:49.730 --> 00:46:52.820
It's a program that
describes a computation that

00:46:52.820 --> 00:46:57.830
provides information about the
possible behaviors of a system.

00:46:57.830 --> 00:47:00.050
I say possible
behaviors, because I'm

00:47:00.050 --> 00:47:02.835
particularly interested
in stochastic systems.

00:47:05.720 --> 00:47:10.350
They're descriptive not
prescriptive in the sense

00:47:10.350 --> 00:47:13.740
that they describe
the possible outcomes.

00:47:13.740 --> 00:47:18.800
They don't tell you how to
achieve possible outcomes.

00:47:18.800 --> 00:47:20.720
This is different
from what we've

00:47:20.720 --> 00:47:22.550
looked at earlier in
the course, where we

00:47:22.550 --> 00:47:25.700
looked at optimization models.

00:47:25.700 --> 00:47:30.440
So an optimization
model is prescriptive.

00:47:30.440 --> 00:47:33.800
It tells you how to
achieve an effect,

00:47:33.800 --> 00:47:38.000
how to get the most value
out of your knapsack,

00:47:38.000 --> 00:47:42.350
how to find the shortest
path from A to B in a graph.

00:47:42.350 --> 00:47:44.750
In contrast, a
simulation model says,

00:47:44.750 --> 00:47:48.170
if I do this,
here's what happens.

00:47:48.170 --> 00:47:52.290
It doesn't tell you how to
make something happened.

00:47:52.290 --> 00:47:53.970
So it's very
different, and it's why

00:47:53.970 --> 00:47:57.390
we need both, why we
need optimization models

00:47:57.390 --> 00:48:00.570
and we need simulation models.

00:48:00.570 --> 00:48:03.750
We have to remember that
a simulation model is only

00:48:03.750 --> 00:48:06.570
an approximation to reality.

00:48:06.570 --> 00:48:10.110
I put in an approximation to
the distribution of birthdates,

00:48:10.110 --> 00:48:12.910
but it wasn't quite right.

00:48:12.910 --> 00:48:16.770
And as the very famous
statistician George Box said,

00:48:16.770 --> 00:48:22.320
"all models are wrong, but
some are actually very useful."

00:48:22.320 --> 00:48:27.930
In the next lecture, we'll look
at a useful class of models.

00:48:27.930 --> 00:48:30.610
When do we use simulations?

00:48:30.610 --> 00:48:33.310
Typically, as we've just
shown, to model systems that

00:48:33.310 --> 00:48:37.180
are mathematically intractable,
like the birthday problem

00:48:37.180 --> 00:48:39.740
we just looked at.

00:48:39.740 --> 00:48:43.130
In other situations, to
extract intermediate results--

00:48:43.130 --> 00:48:47.660
something happens along
the way to the answer.

00:48:47.660 --> 00:48:50.410
And as I hope you've
seen that simulations

00:48:50.410 --> 00:48:55.480
are used because we can play
what if games by successively

00:48:55.480 --> 00:48:57.340
refining it.

00:48:57.340 --> 00:48:59.230
We started with a
simple simulation

00:48:59.230 --> 00:49:01.960
that assumed that we only
asked the question of, do

00:49:01.960 --> 00:49:04.540
two people share a birthday.

00:49:04.540 --> 00:49:08.080
We showed how we could change
it to ask do three people share

00:49:08.080 --> 00:49:10.020
a birthday.

00:49:10.020 --> 00:49:11.910
We then saw that
we could change it

00:49:11.910 --> 00:49:16.260
to assume a different
distribution of birthdates

00:49:16.260 --> 00:49:18.620
in the group.

00:49:18.620 --> 00:49:20.520
And so we can start
with something simple.

00:49:20.520 --> 00:49:23.310
And we get it ever
more complexed

00:49:23.310 --> 00:49:25.415
to answer questions what if.

00:49:29.510 --> 00:49:32.030
We're going to start
in the next lecture

00:49:32.030 --> 00:49:36.680
by producing a simulation
of a random walk.

00:49:36.680 --> 00:49:38.120
And with that, I'll stop.

00:49:38.120 --> 00:49:40.840
And see you guys soon.