1
00:00:05,115 --> 00:00:06,210
PROFESSOR: Hi.
2
00:00:06,210 --> 00:00:08,920
Today I'd like to introduce
a new module.
3
00:00:08,920 --> 00:00:11,270
In previous modules we've talked
about how to model
4
00:00:11,270 --> 00:00:14,870
particular systems, how to make
predictions about certain
5
00:00:14,870 --> 00:00:18,010
kinds and classifications of
systems, and also how to
6
00:00:18,010 --> 00:00:21,070
design both theoretical
and physical systems.
7
00:00:21,070 --> 00:00:24,040
If our systems are operating in
a deterministic universe,
8
00:00:24,040 --> 00:00:25,200
then we're all set.
9
00:00:25,200 --> 00:00:27,210
We've accounted for all the
things we could possibly
10
00:00:27,210 --> 00:00:28,150
account for.
11
00:00:28,150 --> 00:00:30,110
But if we're making systems that
are going to operate in
12
00:00:30,110 --> 00:00:34,540
the real world, then we need to
be able to deal with some
13
00:00:34,540 --> 00:00:36,600
level of uncertainty.
14
00:00:36,600 --> 00:00:38,430
Today I'm going to talk about
probability, which is the
15
00:00:38,430 --> 00:00:40,320
method by which we're
going to model
16
00:00:40,320 --> 00:00:41,930
uncertainty in our world.
17
00:00:41,930 --> 00:00:43,860
And later we're going to talk
about different strategies we
18
00:00:43,860 --> 00:00:47,260
can take to deal with that
uncertainty to hopefully
19
00:00:47,260 --> 00:00:50,160
increase the level of autonomy
of our systems as they operate
20
00:00:50,160 --> 00:00:51,410
in the real world.
21
00:00:56,000 --> 00:00:58,960
The first thing I need to talk
about is how to properly model
22
00:00:58,960 --> 00:01:01,800
probability, or how to
probably talk about
23
00:01:01,800 --> 00:01:05,239
probability such that we can
use it to talk about
24
00:01:05,239 --> 00:01:07,490
uncertainty.
25
00:01:07,490 --> 00:01:10,410
When you're talking about
probability, typically you'll
26
00:01:10,410 --> 00:01:13,180
end up talking about the sample
space or U. This refers
27
00:01:13,180 --> 00:01:18,020
to your entire universe, where
your universe is the values
28
00:01:18,020 --> 00:01:21,670
that you care about, or the
possible assigned values of
29
00:01:21,670 --> 00:01:23,450
the variables that
you care about.
30
00:01:27,700 --> 00:01:30,070
I'm already talking about
variables, but what I really
31
00:01:30,070 --> 00:01:31,320
mean to say is events.
32
00:01:34,280 --> 00:01:36,550
There are different states that
your universe can take,
33
00:01:36,550 --> 00:01:40,740
and those states can be sort of
exhaustively enumerated or
34
00:01:40,740 --> 00:01:45,720
be atomic, or they
can be factored
35
00:01:45,720 --> 00:01:46,970
into different variables.
36
00:01:49,210 --> 00:01:50,460
Let me elaborate on
this a little.
37
00:01:54,920 --> 00:01:59,140
If I were talking about four
coin flips, if those events
38
00:01:59,140 --> 00:02:02,850
were specified atomically then
I would have to exhaustively
39
00:02:02,850 --> 00:02:07,480
specify all possible sequences
of four coin flips.
40
00:02:07,480 --> 00:02:13,450
4 heads, 3 heads and 2 tail, 2
heads and 2 tails, that sort
41
00:02:13,450 --> 00:02:15,080
of exercise.
42
00:02:15,080 --> 00:02:26,810
Or flipping 4 heads would be
one event, flipping 3 heads
43
00:02:26,810 --> 00:02:31,160
and then 1 tail would be a
separate event, flipping 2
44
00:02:31,160 --> 00:02:35,760
heads 1 tail and another
head would be a
45
00:02:35,760 --> 00:02:39,120
third, distinct event.
46
00:02:39,120 --> 00:02:42,270
If we're talking about a
factored state space, then
47
00:02:42,270 --> 00:02:44,530
we'll have a different variable
representing each one
48
00:02:44,530 --> 00:02:47,110
of these coin flips.
49
00:02:47,110 --> 00:02:52,940
And each one of those variables
will say, here is
50
00:02:52,940 --> 00:02:56,230
the value associated with that
particular sub-event.
51
00:02:56,230 --> 00:02:58,240
And the accumulation of all
those values, or the
52
00:02:58,240 --> 00:03:01,450
specification for all those
values, will end up referring
53
00:03:01,450 --> 00:03:07,460
to the same states that
were addressed
54
00:03:07,460 --> 00:03:08,710
by the atomic events.
55
00:03:12,430 --> 00:03:15,040
Let me talk about variables,
because they're the thing that
56
00:03:15,040 --> 00:03:17,780
allow us to not exhaustively
enumerate every possible event
57
00:03:17,780 --> 00:03:24,450
in the universe, and talk
about our universe in
58
00:03:24,450 --> 00:03:29,290
meaningful ways, or in ways that
they can be effectively
59
00:03:29,290 --> 00:03:30,540
communicated.
60
00:03:32,390 --> 00:03:34,320
And why am I talking about
random variables?
61
00:03:34,320 --> 00:03:35,710
Why aren't they just
variables?
62
00:03:35,710 --> 00:03:37,985
Well, random variables is the
way that you specify the fact
63
00:03:37,985 --> 00:03:40,920
that you're talking about
probabilistic variables.
64
00:03:40,920 --> 00:03:43,940
When you're just dealing with
regular algebra, variables
65
00:03:43,940 --> 00:03:50,230
have some sort of assigned
value, and you're not forced
66
00:03:50,230 --> 00:03:51,720
to be in the space of 0 to 1.
67
00:03:51,720 --> 00:03:54,320
When you're talking about from
0 to 1, when you're talking
68
00:03:54,320 --> 00:03:58,260
about probabilities, you're
forced to be in the spaces of
69
00:03:58,260 --> 00:04:01,860
0 to 1 and forced to remain
within the reals.
70
00:04:05,250 --> 00:04:08,900
If you want to talk about all
the possible assigned values
71
00:04:08,900 --> 00:04:11,520
of that random variable, then
you're talking about the
72
00:04:11,520 --> 00:04:15,950
distribution over that
random variable.
73
00:04:15,950 --> 00:04:22,590
This means that A could be
anything that A could be.
74
00:04:22,590 --> 00:04:25,070
You're talking about the
function that says give me a
75
00:04:25,070 --> 00:04:27,780
value of A, and I will tell you
a probability associated
76
00:04:27,780 --> 00:04:29,990
with that value of A.
77
00:04:29,990 --> 00:04:32,040
If you're talking about the
probability of A being
78
00:04:32,040 --> 00:04:35,370
assigned to a particular value,
then you'll return out
79
00:04:35,370 --> 00:04:37,040
the probability.
80
00:04:37,040 --> 00:04:46,380
If A represents the color of the
shirt I'm wearing, and I
81
00:04:46,380 --> 00:04:48,990
want to know the probability
that A is some value, then I
82
00:04:48,990 --> 00:04:50,750
look at the color of the shirt
I'm wearing and try to
83
00:04:50,750 --> 00:04:52,300
determine whether or
not it's 1 or 0.
84
00:04:52,300 --> 00:04:54,970
Or possibly look at the colors
of shirts that I've worn over
85
00:04:54,970 --> 00:05:03,330
the past year and make some sort
of ratio of the number of
86
00:05:03,330 --> 00:05:05,320
pink shirts I've worn
over the past year.
87
00:05:12,100 --> 00:05:14,660
If I'm dealing with a factored
state space, I'm going to end
88
00:05:14,660 --> 00:05:18,410
up talking about more than
one random variable.
89
00:05:18,410 --> 00:05:20,440
There are two main ways to
talk about more than one
90
00:05:20,440 --> 00:05:24,260
random variable at
the same time.
91
00:05:24,260 --> 00:05:27,690
One is joint probabilities,
where all the random variables
92
00:05:27,690 --> 00:05:30,650
are collectively specified
at the same time.
93
00:05:30,650 --> 00:05:33,210
And the other is conditional
probabilities, where you've
94
00:05:33,210 --> 00:05:37,590
already decided that you
specified some value for one
95
00:05:37,590 --> 00:05:38,670
or more random variables.
96
00:05:38,670 --> 00:05:41,520
And then within that scope
you're going to talk about the
97
00:05:41,520 --> 00:05:43,635
probabilities associated with
other random variables.
98
00:05:48,110 --> 00:05:50,470
I want to demonstrate this
graphically, but there are two
99
00:05:50,470 --> 00:05:52,770
more things that I need
to mention right now.
100
00:05:55,470 --> 00:05:58,080
One is the difference between
frequentist and Bayesian
101
00:05:58,080 --> 00:05:59,330
interpretations of
probability.
102
00:06:02,760 --> 00:06:05,800
Right now they don't seem
particularly relevant, but
103
00:06:05,800 --> 00:06:08,260
people will use these words,
and it's good to know,
104
00:06:08,260 --> 00:06:11,480
approximately, what they're
talking about.
105
00:06:11,480 --> 00:06:16,760
The frequentist interpretation
of probability is more
106
00:06:16,760 --> 00:06:21,280
relevant when you're talking
about actions that happen a
107
00:06:21,280 --> 00:06:22,580
lot of different times.
108
00:06:22,580 --> 00:06:28,930
For instance, how frequently
it rains.
109
00:06:28,930 --> 00:06:31,760
If I say that today is a
Wednesday, and I want to know
110
00:06:31,760 --> 00:06:36,170
the probability that it rains
on Wednesdays, then that
111
00:06:36,170 --> 00:06:38,810
probability is open to
frequentist interpretation
112
00:06:38,810 --> 00:06:41,720
because there are a lot of
Wednesdays, and it's rained a
113
00:06:41,720 --> 00:06:44,580
lot in the universe of
Wednesdays or the space of
114
00:06:44,580 --> 00:06:47,390
possible Wednesdays.
115
00:06:47,390 --> 00:06:49,750
So thinking about the fact that
there's a 70% chance of
116
00:06:49,750 --> 00:06:52,230
rain, or a 30% chance of rain,
or whatever probability of
117
00:06:52,230 --> 00:06:56,220
rain there is on Wednesdays,
makes sense, or is open to
118
00:06:56,220 --> 00:06:57,470
frequentist interpretation.
119
00:07:02,040 --> 00:07:04,410
The other interpretation that
you'll hear talked about with
120
00:07:04,410 --> 00:07:07,120
respect to probabilities is the
Bayesian interpretation.
121
00:07:07,120 --> 00:07:10,390
Bayesian interpretation tends
to be more relevant when
122
00:07:10,390 --> 00:07:12,960
you're talking about spaces
that are more atomic as
123
00:07:12,960 --> 00:07:17,540
opposed to factored, or
represent events that are
124
00:07:17,540 --> 00:07:19,910
specified to the point that it
does not make sense to talk
125
00:07:19,910 --> 00:07:25,540
about them in the frequentist
sense.
126
00:07:25,540 --> 00:07:27,320
When we talk about probabilities
in the Bayesian
127
00:07:27,320 --> 00:07:32,100
interpretation, we frequently
use the term likelihood.
128
00:07:32,100 --> 00:07:34,610
For instance, if I'm talking
about whether or not it's
129
00:07:34,610 --> 00:07:40,980
likely to rain on August 24,
2011 in the afternoon, the
130
00:07:40,980 --> 00:07:45,580
specificity of that event is so
high that at that point it
131
00:07:45,580 --> 00:07:48,440
doesn't make sense to talk
about the frequency of
132
00:07:48,440 --> 00:07:52,610
Wednesday, August 24, 2011
in the afternoons,
133
00:07:52,610 --> 00:07:55,280
at least for now.
134
00:07:55,280 --> 00:07:57,600
At that point we're talking more
about likelihood and less
135
00:07:57,600 --> 00:08:00,280
about frequency.
136
00:08:00,280 --> 00:08:04,185
That event is more conducive
to Bayesian interpretation.
137
00:08:09,540 --> 00:08:11,350
No whirlwind tour of probability
would be complete
138
00:08:11,350 --> 00:08:12,950
without a mention of the three
axioms of probability.
139
00:08:16,200 --> 00:08:19,610
The first axiom is that the
likelihood of the universe
140
00:08:19,610 --> 00:08:23,870
happening is one, or all random
variables are going to
141
00:08:23,870 --> 00:08:27,580
be specified at some point.
142
00:08:27,580 --> 00:08:33,320
Relatedly, the likelihood of
nothing happening is 0.
143
00:08:33,320 --> 00:08:38,640
What these two really do is
establish the boundaries of a
144
00:08:38,640 --> 00:08:40,090
graphical representation
up here.
145
00:08:42,830 --> 00:08:45,680
The other axiom of probability
is that if you're going to be
146
00:08:45,680 --> 00:08:49,100
talking about the union of two
events, or the probability
147
00:08:49,100 --> 00:08:52,150
associated with one or the
other event having an
148
00:08:52,150 --> 00:08:58,450
assignment, then you're talking
about the probability
149
00:08:58,450 --> 00:09:04,140
of one of those variables and
the probability of the other
150
00:09:04,140 --> 00:09:08,880
variable, and then removing
the section that you've
151
00:09:08,880 --> 00:09:10,560
double-counted.
152
00:09:10,560 --> 00:09:12,940
So if I were to attempt to
demonstrate this on this
153
00:09:12,940 --> 00:09:17,400
graph, I would be talking about
the probability of A,
154
00:09:17,400 --> 00:09:20,700
added the probability of B.
And at this point I've
155
00:09:20,700 --> 00:09:22,250
double-counted this section
the middle where
156
00:09:22,250 --> 00:09:23,940
they're both one.
157
00:09:23,940 --> 00:09:25,820
So I would subtract
this exactly once.
158
00:09:25,820 --> 00:09:27,840
And then I'm talking about
the size of this space.
159
00:09:34,640 --> 00:09:36,420
If I'm talking about the
probability of A being equal
160
00:09:36,420 --> 00:09:42,910
to 1, then I'm talking about
the space in which A is
161
00:09:42,910 --> 00:09:47,580
specified to be equal to 1,
divided by the area associated
162
00:09:47,580 --> 00:09:48,830
with my universe.
163
00:10:02,540 --> 00:10:10,220
When I'm talking about joint
distributions, I'm going to
164
00:10:10,220 --> 00:10:14,780
find the space in which both
of these things are true,
165
00:10:14,780 --> 00:10:17,210
scoped to the size of
the universe, or the
166
00:10:17,210 --> 00:10:20,600
entire sample space.
167
00:10:20,600 --> 00:10:25,220
So in this space both A and B
are true, and I'm looking at
168
00:10:25,220 --> 00:10:26,940
it relative to the size
of the universe.
169
00:10:43,800 --> 00:10:47,680
In contrast, when I'm talking
about conditional probability,
170
00:10:47,680 --> 00:10:58,090
or the probability of A given B,
I'm going to look at where
171
00:10:58,090 --> 00:11:03,940
my specifications are true,
scoped to where
172
00:11:03,940 --> 00:11:06,690
my givens are true.
173
00:11:06,690 --> 00:11:13,460
So if I'm already dealing in the
space restricted to B, I'm
174
00:11:13,460 --> 00:11:15,955
just looking at this size,
or this space.
175
00:11:19,660 --> 00:11:22,140
But because I'm scoped to B, I'm
going to end up dividing
176
00:11:22,140 --> 00:11:53,710
by the area of B, instead
of the area of U.
177
00:11:53,710 --> 00:11:55,420
This is the main difference
between the joint and
178
00:11:55,420 --> 00:11:56,450
conditional distributions.
179
00:11:56,450 --> 00:11:59,470
And a lot of people get hung
up on it, which is why I'm
180
00:11:59,470 --> 00:12:01,055
exhaustively walking
through it.
181
00:12:04,870 --> 00:12:06,790
A couple of other things before
we talk about the first
182
00:12:06,790 --> 00:12:09,930
way in which we can use our
models for uncertainty to do
183
00:12:09,930 --> 00:12:12,730
some amount of addressing of the
fact that we're going to
184
00:12:12,730 --> 00:12:14,010
have to deal with uncertainty
in the future.
185
00:12:18,380 --> 00:12:20,710
If we start off with a joint
distribution and we want to
186
00:12:20,710 --> 00:12:22,690
reduce the number of variables
that we're actually talking
187
00:12:22,690 --> 00:12:26,340
about, we can do so by
exhaustively walking through
188
00:12:26,340 --> 00:12:29,980
all the possible assigned values
for the variable that
189
00:12:29,980 --> 00:12:35,600
we want to disregard, and then
summing up the values
190
00:12:35,600 --> 00:12:36,850
appropriately.
191
00:12:40,670 --> 00:12:44,600
An easy example for this is if
we had the joint probabilities
192
00:12:44,600 --> 00:12:46,980
of all the colors of shirts that
I wear and all the colors
193
00:12:46,980 --> 00:12:49,360
of pants that I wear, and we
only wanted to talk about all
194
00:12:49,360 --> 00:12:54,130
the colors of shirts that
I wear, then we could
195
00:12:54,130 --> 00:12:57,900
exhaustively cover all the
different colors of pants that
196
00:12:57,900 --> 00:13:01,870
I wear and accumulate all the
different values of shirts
197
00:13:01,870 --> 00:13:11,140
that I wear simultaneously,
and then collect that
198
00:13:11,140 --> 00:13:12,390
distribution.
199
00:13:15,080 --> 00:13:19,410
Related is the concept of total
probability, which is
200
00:13:19,410 --> 00:13:22,020
the same kind of accumulation.
201
00:13:22,020 --> 00:13:27,480
It just operates in the
conditional space, as opposed
202
00:13:27,480 --> 00:13:30,890
to in the joint space.
203
00:13:30,890 --> 00:13:36,770
So in this case, if I'm already
operating in A given B
204
00:13:36,770 --> 00:13:45,690
land, I have to scope myself
back out to the space of the
205
00:13:45,690 --> 00:13:49,670
universe by then accounting for
the fact that I've only
206
00:13:49,670 --> 00:13:58,290
been operating in the scope of
B. Then exhaustively enumerate
207
00:13:58,290 --> 00:14:00,740
all the possible values of
B, sum the probabilities
208
00:14:00,740 --> 00:14:05,880
associated with those values,
and then I've reduced the
209
00:14:05,880 --> 00:14:07,390
number of dimensions that
I'm talking about.
210
00:14:12,840 --> 00:14:14,435
The final thing we have to talk
about before we move on
211
00:14:14,435 --> 00:14:16,720
to state estimation is
Bayes' evidence.
212
00:14:16,720 --> 00:14:18,840
Or you've probably seen this
demonstrated as Bayes' rule.
213
00:14:25,850 --> 00:14:30,980
If I want to talk about B scoped
to A, and all I have is
214
00:14:30,980 --> 00:14:38,320
A scoped to B, B and A. When
we walked through total
215
00:14:38,320 --> 00:14:42,930
probability, we saw that the
conditional probability
216
00:14:42,930 --> 00:14:45,570
multiplied by the probability
associated with the variable
217
00:14:45,570 --> 00:14:48,310
that we're conditioning on,
is equal to the joint
218
00:14:48,310 --> 00:14:50,710
probability.
219
00:14:50,710 --> 00:14:56,640
When we multiply P of A given
B by P of B, we're going to
220
00:14:56,640 --> 00:15:00,390
end up with the joint
probability associated with
221
00:15:00,390 --> 00:15:01,900
the two variables.
222
00:15:01,900 --> 00:15:05,440
When we then divide back out
by A or scope our joint
223
00:15:05,440 --> 00:15:10,510
probability to A, we end up
talking about conditional
224
00:15:10,510 --> 00:15:13,790
probabilities again, which is
where B given A comes from.
225
00:15:18,280 --> 00:15:20,160
Bayes' evidence, or Bayes'
rule, is the basis for
226
00:15:20,160 --> 00:15:22,810
inference, which is going to be
really important for state
227
00:15:22,810 --> 00:15:24,380
estimation, which we'll
cover next time.