1
00:00:00,000 --> 00:00:00,040
2
00:00:00,040 --> 00:00:02,460
The following content is
provided under a Creative
3
00:00:02,460 --> 00:00:03,870
Commons license.
4
00:00:03,870 --> 00:00:06,910
Your support will help MIT
OpenCourseWare continue to
5
00:00:06,910 --> 00:00:10,560
offer high quality educational
resources for free.
6
00:00:10,560 --> 00:00:13,460
To make a donation or view
additional materials from
7
00:00:13,460 --> 00:00:19,290
hundreds of MIT courses, visit
MIT OpenCourseWare at
8
00:00:19,290 --> 00:00:22,410
ocw.mit.edu
9
00:00:22,410 --> 00:00:25,430
PROFESSOR: So we're going to
finish today our discussion of
10
00:00:25,430 --> 00:00:28,870
Bayesian Inference, which
we started last time.
11
00:00:28,870 --> 00:00:32,960
As you probably saw there's
not a huge lot of concepts
12
00:00:32,960 --> 00:00:37,370
that we're introducing at this
point in terms of specific
13
00:00:37,370 --> 00:00:39,770
skills of calculating
probabilities.
14
00:00:39,770 --> 00:00:44,040
But, rather, it's more of an
interpretation and setting up
15
00:00:44,040 --> 00:00:45,460
the framework.
16
00:00:45,460 --> 00:00:48,010
So the framework in Bayesian
estimation is that there is
17
00:00:48,010 --> 00:00:52,500
some parameter which is not
known, but we have a prior
18
00:00:52,500 --> 00:00:53,550
distribution on it.
19
00:00:53,550 --> 00:01:00,040
These are beliefs about what
this variable might be, and
20
00:01:00,040 --> 00:01:02,370
then we'll obtain some
measurements.
21
00:01:02,370 --> 00:01:05,410
And the measurements are
affected by the value of that
22
00:01:05,410 --> 00:01:07,560
parameter that we don't know.
23
00:01:07,560 --> 00:01:12,490
And this effect, the fact that
X is affected by Theta, is
24
00:01:12,490 --> 00:01:15,970
captured by introducing a
conditional probability
25
00:01:15,970 --> 00:01:16,660
distribution--
26
00:01:16,660 --> 00:01:19,590
the distribution of X
depends on Theta.
27
00:01:19,590 --> 00:01:22,270
It's a conditional probability
distribution.
28
00:01:22,270 --> 00:01:26,280
So we have formulas for these
two densities, the prior
29
00:01:26,280 --> 00:01:28,330
density and the conditional
density.
30
00:01:28,330 --> 00:01:31,110
And given that we have these,
if we multiply them we can
31
00:01:31,110 --> 00:01:34,000
also get the joint density
of X and Theta.
32
00:01:34,000 --> 00:01:35,940
So we have everything
that's there is to
33
00:01:35,940 --> 00:01:37,450
know in this second.
34
00:01:37,450 --> 00:01:41,650
And now we observe the random
variable X. Given this random
35
00:01:41,650 --> 00:01:44,400
variable what can we
say about Theta?
36
00:01:44,400 --> 00:01:48,380
Well, what we can do is we
can always calculate the
37
00:01:48,380 --> 00:01:52,600
conditional distribution of
theta given X. And now that we
38
00:01:52,600 --> 00:01:55,990
have the specific value of
X we can plot this as
39
00:01:55,990 --> 00:01:58,650
a function of Theta.
40
00:01:58,650 --> 00:01:59,150
OK.
41
00:01:59,150 --> 00:02:01,380
And this is the complete
answer to a
42
00:02:01,380 --> 00:02:02,990
Bayesian Inference problem.
43
00:02:02,990 --> 00:02:06,130
This posterior distribution
captures everything there is
44
00:02:06,130 --> 00:02:10,240
to say about Theta, that's
what we know about Theta.
45
00:02:10,240 --> 00:02:13,330
Given the X that we have
observed Theta is still
46
00:02:13,330 --> 00:02:15,080
random, it's still unknown.
47
00:02:15,080 --> 00:02:18,270
And it might be here, there,
or there with several
48
00:02:18,270 --> 00:02:19,900
probabilities.
49
00:02:19,900 --> 00:02:22,780
On the other hand, if you want
to report a single value for
50
00:02:22,780 --> 00:02:27,590
Theta then you do
some extra work.
51
00:02:27,590 --> 00:02:31,430
You continue from here, and you
do some data processing on
52
00:02:31,430 --> 00:02:35,360
X. Doing data processing means
that you apply a certain
53
00:02:35,360 --> 00:02:39,000
function on the data,
and this function is
54
00:02:39,000 --> 00:02:40,650
something that you design.
55
00:02:40,650 --> 00:02:42,930
It's the so-called estimator.
56
00:02:42,930 --> 00:02:46,460
And once that function is
applied it outputs an estimate
57
00:02:46,460 --> 00:02:50,760
of Theta, which we
call Theta hat.
58
00:02:50,760 --> 00:02:53,490
So this is sort of the big
picture of what's happening.
59
00:02:53,490 --> 00:02:55,880
Now one thing to keep in mind
is that even though I'm
60
00:02:55,880 --> 00:03:00,450
writing single letters here, in
general Theta or X could be
61
00:03:00,450 --> 00:03:02,030
vector random variables.
62
00:03:02,030 --> 00:03:03,540
So think of this--
63
00:03:03,540 --> 00:03:08,170
it could be a collection
Theta1, Theta2, Theta3.
64
00:03:08,170 --> 00:03:11,570
And maybe we obtained several
measurements, so this X is
65
00:03:11,570 --> 00:03:15,630
really a vector X1,
X2, up to Xn.
66
00:03:15,630 --> 00:03:20,190
All right, so now how do we
choose a Theta to report?
67
00:03:20,190 --> 00:03:21,960
There are various ways
of doing it.
68
00:03:21,960 --> 00:03:25,280
One is to look at the posterior
distribution and
69
00:03:25,280 --> 00:03:29,940
report the value of Theta, at
which the density or the PMF
70
00:03:29,940 --> 00:03:31,990
is highest.
71
00:03:31,990 --> 00:03:35,570
This is called the maximum
a posteriori estimate.
72
00:03:35,570 --> 00:03:38,770
So we pick a value of theta for
which the posteriori is
73
00:03:38,770 --> 00:03:40,990
maximum, and we report it.
74
00:03:40,990 --> 00:03:46,030
An alternative way is to try to
be optimal with respects to
75
00:03:46,030 --> 00:03:48,500
a mean squared error.
76
00:03:48,500 --> 00:03:49,410
So what is this?
77
00:03:49,410 --> 00:03:53,260
If we have a specific estimator,
g, this is the
78
00:03:53,260 --> 00:03:55,880
estimate it's going
to produce.
79
00:03:55,880 --> 00:03:58,300
This is the true value of
Theta, so this is our
80
00:03:58,300 --> 00:03:59,740
estimation error.
81
00:03:59,740 --> 00:04:03,180
We look at the square of the
estimation error, and look at
82
00:04:03,180 --> 00:04:04,180
the average value.
83
00:04:04,180 --> 00:04:07,180
We would like this squared
estimation error to be as
84
00:04:07,180 --> 00:04:08,710
small as possible.
85
00:04:08,710 --> 00:04:12,470
How can we design our estimator
g to make that error
86
00:04:12,470 --> 00:04:13,920
as small as possible?
87
00:04:13,920 --> 00:04:19,490
It turns out that the answer is
to produce, as an estimate,
88
00:04:19,490 --> 00:04:22,660
the conditional expectation
of Theta given X. So the
89
00:04:22,660 --> 00:04:26,600
conditional expectation is the
best estimate that you could
90
00:04:26,600 --> 00:04:30,690
produce if your objective is to
keep the mean squared error
91
00:04:30,690 --> 00:04:32,720
as small as possible.
92
00:04:32,720 --> 00:04:35,280
So this statement here is a
statement of what happens on
93
00:04:35,280 --> 00:04:39,950
the average over all Theta's and
all X's that may happen in
94
00:04:39,950 --> 00:04:42,490
our experiment.
95
00:04:42,490 --> 00:04:45,160
The conditional expectation as
an estimator has an even
96
00:04:45,160 --> 00:04:47,750
stronger property.
97
00:04:47,750 --> 00:04:51,490
Not only it's optimal on the
average, but it's also optimal
98
00:04:51,490 --> 00:04:56,130
given that you have made a
specific observation, no
99
00:04:56,130 --> 00:04:57,840
matter what you observe.
100
00:04:57,840 --> 00:05:01,150
Let's say you observe the
specific value for the random
101
00:05:01,150 --> 00:05:05,560
variable X. After that point if
you're asked to produce a
102
00:05:05,560 --> 00:05:11,190
best estimate Theta hat that
minimizes this mean squared
103
00:05:11,190 --> 00:05:14,080
error, your best estimate
would be the conditional
104
00:05:14,080 --> 00:05:18,940
expectation given the specific
value that you have observed.
105
00:05:18,940 --> 00:05:23,150
These two statements say almost
the same thing, but
106
00:05:23,150 --> 00:05:25,650
this one is a bit stronger.
107
00:05:25,650 --> 00:05:30,830
This one tells you no matter
what specific X happens the
108
00:05:30,830 --> 00:05:33,370
conditional expectation
is the best estimate.
109
00:05:33,370 --> 00:05:36,870
This one tells you on the
average, over all X's may
110
00:05:36,870 --> 00:05:39,050
happen, the conditional
111
00:05:39,050 --> 00:05:42,650
expectation is the best estimator.
112
00:05:42,650 --> 00:05:44,870
Now this is really a consequence
of this.
113
00:05:44,870 --> 00:05:48,510
If the conditional expectation
is best for any specific X,
114
00:05:48,510 --> 00:05:52,750
then it's the best one even when
X is left random and you
115
00:05:52,750 --> 00:05:58,200
are averaging your error
over all possible X's.
116
00:05:58,200 --> 00:06:02,120
OK so now that we know what is
the optimal way of producing
117
00:06:02,120 --> 00:06:05,510
an estimate let's do a
simple example to see
118
00:06:05,510 --> 00:06:07,240
how things work out.
119
00:06:07,240 --> 00:06:10,290
So we have started with an
unknown random variable,
120
00:06:10,290 --> 00:06:15,080
Theta, which is uniformly
distributed between 4 and 10.
121
00:06:15,080 --> 00:06:18,270
And then we have an observation
model that tells
122
00:06:18,270 --> 00:06:22,430
us that given the value of
Theta, X is going to be a
123
00:06:22,430 --> 00:06:24,532
random variable that ranges
between Theta -
124
00:06:24,532 --> 00:06:26,570
1, and Theta + 1.
125
00:06:26,570 --> 00:06:32,550
So think of X as a noisy
measurement of Theta, plus
126
00:06:32,550 --> 00:06:37,600
some noise, which is
between -1, and +1.
127
00:06:37,600 --> 00:06:41,980
So really the model that we are
using here is that X is
128
00:06:41,980 --> 00:06:44,430
equal to Theta plus U --
129
00:06:44,430 --> 00:06:50,500
where U is uniform
on -1, and +1.
130
00:06:50,500 --> 00:06:52,350
one, and plus one.
131
00:06:52,350 --> 00:06:55,946
So we have the true value of
Theta, but X could be Theta -
132
00:06:55,946 --> 00:07:00,750
1, or it could be all the
way up to Theta + 1.
133
00:07:00,750 --> 00:07:03,770
And the X is uniformly
distributed on that interval.
134
00:07:03,770 --> 00:07:08,060
That's the same as saying that
U is uniformly distributed
135
00:07:08,060 --> 00:07:09,820
over this interval.
136
00:07:09,820 --> 00:07:12,780
So now we have all the
information that we need, we
137
00:07:12,780 --> 00:07:15,270
can construct the
joint density.
138
00:07:15,270 --> 00:07:19,020
And the joint density is, of
course, the prior density
139
00:07:19,020 --> 00:07:21,850
times the conditional density.
140
00:07:21,850 --> 00:07:24,540
We go both of these.
141
00:07:24,540 --> 00:07:28,880
Both of these are constants, so
the joint density is also
142
00:07:28,880 --> 00:07:30,150
going to be a constant.
143
00:07:30,150 --> 00:07:34,420
1/6 times 1/2, this
is one over 12.
144
00:07:34,420 --> 00:07:37,700
But it is a constant,
not everywhere.
145
00:07:37,700 --> 00:07:41,280
Only on the range of possible
x's and thetas.
146
00:07:41,280 --> 00:07:46,030
So theta can take any value
between four and ten, so these
147
00:07:46,030 --> 00:07:47,430
are the values of theta.
148
00:07:47,430 --> 00:07:51,990
And for any given value of theta
x can take values from
149
00:07:51,990 --> 00:07:55,690
theta minus one, up
to theta plus one.
150
00:07:55,690 --> 00:08:00,210
So here, if you can imagine, a
line that goes with slope one,
151
00:08:00,210 --> 00:08:08,530
and then x can take that value
of theta plus or minus one.
152
00:08:08,530 --> 00:08:14,720
So this object here, this is
the set of possible x and
153
00:08:14,720 --> 00:08:16,070
theta pairs.
154
00:08:16,070 --> 00:08:21,490
So the density is equal to one
over 12 over this set, and
155
00:08:21,490 --> 00:08:23,640
it's zero everywhere else.
156
00:08:23,640 --> 00:08:28,035
So outside here the density is
zero, the density only applies
157
00:08:28,035 --> 00:08:29,800
at that point.
158
00:08:29,800 --> 00:08:33,110
All right, so now we're
asked to estimate
159
00:08:33,110 --> 00:08:34,890
theta in terms of x.
160
00:08:34,890 --> 00:08:37,500
So we want to build an estimator
which is going to be
161
00:08:37,500 --> 00:08:40,000
a function from the
x's to the thetas.
162
00:08:40,000 --> 00:08:42,909
That's why I chose the axis
this way-- x to be on this
163
00:08:42,909 --> 00:08:44,600
axis, theta on that axis--
164
00:08:44,600 --> 00:08:48,020
Because the estimator we're
building is a function of x.
165
00:08:48,020 --> 00:08:51,070
Based on the observation that
we obtained, we want to
166
00:08:51,070 --> 00:08:51,940
estimate theta.
167
00:08:51,940 --> 00:08:55,680
So we know that the optimal
estimator is the conditional
168
00:08:55,680 --> 00:08:59,360
expectation, given
the value of x.
169
00:08:59,360 --> 00:09:02,160
So what is the conditional
expectation?
170
00:09:02,160 --> 00:09:07,890
If you fix a particular value of
x, let's say in this range.
171
00:09:07,890 --> 00:09:13,240
So this is our x, then what
do we know about theta?
172
00:09:13,240 --> 00:09:18,050
We know that theta lies
in this range.
173
00:09:18,050 --> 00:09:21,670
Theta can only be sampled
between those two values.
174
00:09:21,670 --> 00:09:24,760
And what kind of distribution
does theta have?
175
00:09:24,760 --> 00:09:28,980
What is the conditional
distribution of theta given x?
176
00:09:28,980 --> 00:09:32,260
Well, remember how we built
conditional distributions from
177
00:09:32,260 --> 00:09:33,410
joint distributions?
178
00:09:33,410 --> 00:09:38,900
The conditional distribution is
just a section of the joint
179
00:09:38,900 --> 00:09:41,640
distribution applied to
the place where we're
180
00:09:41,640 --> 00:09:42,770
conditioning.
181
00:09:42,770 --> 00:09:45,800
So the joint is constant.
182
00:09:45,800 --> 00:09:49,310
So the conditional is also going
to be a constant density
183
00:09:49,310 --> 00:09:50,630
over this interval.
184
00:09:50,630 --> 00:09:53,560
So the posterior distribution
of theta is
185
00:09:53,560 --> 00:09:57,210
uniform over this interval.
186
00:09:57,210 --> 00:10:01,110
So if the posterior of theta is
uniform over that interval,
187
00:10:01,110 --> 00:10:04,900
the expected value of theta is
going to be the meet point of
188
00:10:04,900 --> 00:10:06,070
that interval.
189
00:10:06,070 --> 00:10:08,880
So the estimate which
you report--
190
00:10:08,880 --> 00:10:10,710
if you observe that theta--
191
00:10:10,710 --> 00:10:15,750
is going to be this particular
point here, it's the midpoint.
192
00:10:15,750 --> 00:10:19,140
The same argument goes through
even if you obtain an x
193
00:10:19,140 --> 00:10:22,570
somewhere here.
194
00:10:22,570 --> 00:10:29,540
Given this x, theta
can take a value
195
00:10:29,540 --> 00:10:32,800
between these two values.
196
00:10:32,800 --> 00:10:35,990
Theta is going to have a uniform
distribution over this
197
00:10:35,990 --> 00:10:40,650
interval, and the conditional
expectation of theta given x
198
00:10:40,650 --> 00:10:43,840
is going to be the midpoint
of that interval.
199
00:10:43,840 --> 00:10:50,790
So now if we plot our estimator
by tracing midpoints
200
00:10:50,790 --> 00:10:56,300
in this diagram what you're
going to obtain is a curve
201
00:10:56,300 --> 00:11:01,795
that starts like this, then
it changes slope.
202
00:11:01,795 --> 00:11:04,490
203
00:11:04,490 --> 00:11:07,280
So that it keeps track of the
midpoint, and then it goes
204
00:11:07,280 --> 00:11:09,000
like that again.
205
00:11:09,000 --> 00:11:13,760
So this blue curve here is
our g of x, which is the
206
00:11:13,760 --> 00:11:16,910
conditional expectation of
theta given that x is
207
00:11:16,910 --> 00:11:20,480
equal to little x.
208
00:11:20,480 --> 00:11:26,610
So it's a curve, in our example
it consists of three
209
00:11:26,610 --> 00:11:28,220
straight segments.
210
00:11:28,220 --> 00:11:30,780
But overall it's non-linear.
211
00:11:30,780 --> 00:11:33,440
It's not a single line
through this diagram.
212
00:11:33,440 --> 00:11:35,670
And that's how things
are in general.
213
00:11:35,670 --> 00:11:39,300
g of x, our optimal estimate has
no reason to be a linear
214
00:11:39,300 --> 00:11:40,460
function of x.
215
00:11:40,460 --> 00:11:42,780
In general it's going to be
some complicated curve.
216
00:11:42,780 --> 00:11:47,350
217
00:11:47,350 --> 00:11:51,170
So how good is our estimate?
218
00:11:51,170 --> 00:11:55,700
I mean you reported your x, your
estimate of theta based
219
00:11:55,700 --> 00:12:00,690
on x, and your boss asks you
what kind of error do you
220
00:12:00,690 --> 00:12:03,350
expect to get?
221
00:12:03,350 --> 00:12:07,010
Having observed the particular
value of x, what you can
222
00:12:07,010 --> 00:12:11,140
report to your boss is what you
think is the mean squared
223
00:12:11,140 --> 00:12:13,040
error is going to be.
224
00:12:13,040 --> 00:12:15,380
We observe the particular
value of x.
225
00:12:15,380 --> 00:12:19,650
So we're conditioning, and we're
living in this universe.
226
00:12:19,650 --> 00:12:22,760
Given that we have made this
observation, this is the true
227
00:12:22,760 --> 00:12:25,840
value of theta, this is the
estimate that we have
228
00:12:25,840 --> 00:12:32,220
produced, this is the expected
squared error, given that we
229
00:12:32,220 --> 00:12:35,740
have made the particular
observation.
230
00:12:35,740 --> 00:12:39,700
Now in this conditional universe
this is the expected
231
00:12:39,700 --> 00:12:42,880
value of theta given x.
232
00:12:42,880 --> 00:12:46,240
So this is the expected value of
this random variable inside
233
00:12:46,240 --> 00:12:47,900
the conditional universe.
234
00:12:47,900 --> 00:12:50,900
So when you take the mean
squared of a random variable
235
00:12:50,900 --> 00:12:53,780
minus the expected value, this
is the same thing as the
236
00:12:53,780 --> 00:12:55,840
variance of that random
variable.
237
00:12:55,840 --> 00:12:58,670
Except that it's the
variance inside
238
00:12:58,670 --> 00:13:00,940
the conditional universe.
239
00:13:00,940 --> 00:13:06,230
Having observed x, theta is
still a random variable.
240
00:13:06,230 --> 00:13:09,010
It's distributed according to
the posterior distribution.
241
00:13:09,010 --> 00:13:12,220
Since it's a random variable,
it has a variance.
242
00:13:12,220 --> 00:13:16,060
And that variance is our
mean squared error.
243
00:13:16,060 --> 00:13:20,280
So this is the variance of the
posterior distribution of
244
00:13:20,280 --> 00:13:22,605
Theta given the observation
that we have made.
245
00:13:22,605 --> 00:13:26,688
246
00:13:26,688 --> 00:13:30,180
OK, so what is the variance
in our example?
247
00:13:30,180 --> 00:13:36,270
If X happens to be here, then
Theta is uniform over this
248
00:13:36,270 --> 00:13:41,990
interval, and this interval
has length 2.
249
00:13:41,990 --> 00:13:46,960
Theta is uniformly distributed
over an interval of length 2.
250
00:13:46,960 --> 00:13:49,900
This is the posterior
distribution of Theta.
251
00:13:49,900 --> 00:13:51,410
What is the variance?
252
00:13:51,410 --> 00:13:54,680
Then you remember the formula
for the variance of a uniform
253
00:13:54,680 --> 00:13:59,520
random variable, it is the
length of the interval squared
254
00:13:59,520 --> 00:14:03,590
divided by 12, so this is 1/3.
255
00:14:03,590 --> 00:14:06,060
So the variance of Theta --
256
00:14:06,060 --> 00:14:10,330
the mean squared error-- is
going to be 1/3 whenever this
257
00:14:10,330 --> 00:14:12,430
kind of picture applies.
258
00:14:12,430 --> 00:14:16,460
This picture applies when
X is between 5 and 9.
259
00:14:16,460 --> 00:14:20,100
If X is less than 5, then the
picture is a little different,
260
00:14:20,100 --> 00:14:22,020
and Theta is going
to be uniform
261
00:14:22,020 --> 00:14:24,660
over a smaller interval.
262
00:14:24,660 --> 00:14:26,930
And so the variance of
theta is going to
263
00:14:26,930 --> 00:14:28,770
be smaller as well.
264
00:14:28,770 --> 00:14:31,470
So let's start plotting our
mean squared error.
265
00:14:31,470 --> 00:14:35,930
Between 5 and 9 the variance
of Theta --
266
00:14:35,930 --> 00:14:37,260
the posterior variance--
267
00:14:37,260 --> 00:14:39,090
is 1/3.
268
00:14:39,090 --> 00:14:46,100
Now when the X falls in here
Theta is uniformly distributed
269
00:14:46,100 --> 00:14:48,450
over a smaller interval.
270
00:14:48,450 --> 00:14:50,670
The size of this interval
changes
271
00:14:50,670 --> 00:14:52,800
linearly over that range.
272
00:14:52,800 --> 00:14:59,260
And so when we take the square
size of that interval we get a
273
00:14:59,260 --> 00:15:01,560
quadratic function of
how much we have
274
00:15:01,560 --> 00:15:03,120
moved from that corner.
275
00:15:03,120 --> 00:15:07,140
So at that corner what is
the variance of Theta?
276
00:15:07,140 --> 00:15:11,290
Well if I observe an X that's
equal to 3 then I know with
277
00:15:11,290 --> 00:15:14,810
certainty that Theta
is equal to 4.
278
00:15:14,810 --> 00:15:18,340
Then I'm in very good shape, I
know exactly what Theta is
279
00:15:18,340 --> 00:15:19,240
going to be.
280
00:15:19,240 --> 00:15:22,890
So the variance, in this
case, is going to be 0.
281
00:15:22,890 --> 00:15:26,570
If I observe an X that's a
little larger than Theta is
282
00:15:26,570 --> 00:15:31,130
now random, takes values on
a little interval, and the
283
00:15:31,130 --> 00:15:35,430
variance of Theta is going to be
proportional to the square
284
00:15:35,430 --> 00:15:37,910
of the length of that
little interval.
285
00:15:37,910 --> 00:15:40,400
So we get a curve that
starts rising
286
00:15:40,400 --> 00:15:42,560
quadratically from here.
287
00:15:42,560 --> 00:15:45,390
It goes up forward 1/3.
288
00:15:45,390 --> 00:15:48,980
At the other end of the picture
the same is true.
289
00:15:48,980 --> 00:15:54,500
If you observe an X which is
11 then Theta can only be
290
00:15:54,500 --> 00:15:57,150
equal to 10.
291
00:15:57,150 --> 00:16:00,720
And so the error in Theta
is equal to 0,
292
00:16:00,720 --> 00:16:02,920
there's 0 error variance.
293
00:16:02,920 --> 00:16:07,360
But as we obtain X's that are
slightly less than 11 then the
294
00:16:07,360 --> 00:16:10,380
mean squared error again
rises quadratically.
295
00:16:10,380 --> 00:16:13,450
So we end up with a
plot like this.
296
00:16:13,450 --> 00:16:17,120
What this plot tells us is that
certain measurements are
297
00:16:17,120 --> 00:16:18,920
better than others.
298
00:16:18,920 --> 00:16:25,270
If you're lucky, and you see X
equal to 3 then you're lucky,
299
00:16:25,270 --> 00:16:28,820
because you know Theta
exactly what it is.
300
00:16:28,820 --> 00:16:33,830
If you see an X which is equal
to 6 then you're sort of
301
00:16:33,830 --> 00:16:35,800
unlikely, because it
doesn't tell you
302
00:16:35,800 --> 00:16:37,900
Theta with great precision.
303
00:16:37,900 --> 00:16:40,560
Theta could be anywhere
on that interval.
304
00:16:40,560 --> 00:16:42,360
And so the variance
of Theta --
305
00:16:42,360 --> 00:16:44,630
even after you have
observed X --
306
00:16:44,630 --> 00:16:48,470
is a certain number,
1/3 in our case.
307
00:16:48,470 --> 00:16:52,370
So the moral to keep out
of that story is
308
00:16:52,370 --> 00:16:56,970
that the error variance--
309
00:16:56,970 --> 00:17:00,380
or the mean squared error--
310
00:17:00,380 --> 00:17:03,350
depends on what particular
observation
311
00:17:03,350 --> 00:17:04,829
you happen to obtain.
312
00:17:04,829 --> 00:17:10,240
Some observations may be very
informative, and once you see
313
00:17:10,240 --> 00:17:13,550
a specific number than you know
exactly what Theta is.
314
00:17:13,550 --> 00:17:15,760
Some observations might
be less informative.
315
00:17:15,760 --> 00:17:18,980
You observe your X, but it could
still leave a lot of
316
00:17:18,980 --> 00:17:20,230
uncertainty about Theta.
317
00:17:20,230 --> 00:17:23,839
318
00:17:23,839 --> 00:17:27,650
So conditional expectations are
really the cornerstone of
319
00:17:27,650 --> 00:17:28,890
Bayesian estimation.
320
00:17:28,890 --> 00:17:31,690
They're particularly
popular, especially
321
00:17:31,690 --> 00:17:33,950
in engineering contexts.
322
00:17:33,950 --> 00:17:38,260
There used a lot in signal
processing, communications,
323
00:17:38,260 --> 00:17:40,940
control theory, so on.
324
00:17:40,940 --> 00:17:44,300
So that makes it worth playing
a little bit with their
325
00:17:44,300 --> 00:17:50,450
theoretical properties, and get
some appreciation of a few
326
00:17:50,450 --> 00:17:53,590
subtleties involved here.
327
00:17:53,590 --> 00:17:57,990
No new math in reality, in what
we're going to do here.
328
00:17:57,990 --> 00:18:01,290
But it's going to be a good
opportunity to practice
329
00:18:01,290 --> 00:18:05,310
manipulation of conditional
expectations.
330
00:18:05,310 --> 00:18:13,150
So let's look at the expected
value of the estimation error
331
00:18:13,150 --> 00:18:15,330
that we obtained.
332
00:18:15,330 --> 00:18:18,540
So Theta hat is our estimator,
is the conditional
333
00:18:18,540 --> 00:18:19,855
expectation.
334
00:18:19,855 --> 00:18:25,690
Theta hat minus Theta is what
kind of error do we have?
335
00:18:25,690 --> 00:18:29,610
If Theta hat, is bigger than
Theta then we have made the
336
00:18:29,610 --> 00:18:31,510
positive error.
337
00:18:31,510 --> 00:18:33,910
If not, if it's on the other
side, we have made the
338
00:18:33,910 --> 00:18:35,290
negative error.
339
00:18:35,290 --> 00:18:39,110
Then it turns out that on the
average the errors cancel each
340
00:18:39,110 --> 00:18:41,030
other out, on the average.
341
00:18:41,030 --> 00:18:43,110
So let's do this calculation.
342
00:18:43,110 --> 00:18:50,010
Let's calculate the expected
value of the error given X.
343
00:18:50,010 --> 00:18:54,480
Now by definition the error is
expected value of Theta hat
344
00:18:54,480 --> 00:18:57,850
minus Theta given X.
345
00:18:57,850 --> 00:19:01,090
We use linearity of expectations
to break it up as
346
00:19:01,090 --> 00:19:04,850
expected value of Theta hat
given X minus expected value
347
00:19:04,850 --> 00:19:11,090
of Theta given X.
And now what?
348
00:19:11,090 --> 00:19:18,680
Our estimate is made on the
basis of the data of the X's.
349
00:19:18,680 --> 00:19:23,600
If I tell you X then you
know what Theta hat is.
350
00:19:23,600 --> 00:19:26,490
Remember that the conditional
expectation is a random
351
00:19:26,490 --> 00:19:29,680
variable which is a function
of the random variable, on
352
00:19:29,680 --> 00:19:31,560
which you're conditioning on.
353
00:19:31,560 --> 00:19:35,330
If you know X then you know the
conditional expectation
354
00:19:35,330 --> 00:19:38,390
given X, you know what Theta
hat is going to be.
355
00:19:38,390 --> 00:19:42,910
So Theta hat is a function of
X. If it's a function of X
356
00:19:42,910 --> 00:19:45,910
then once I tell you X
you know what Theta
357
00:19:45,910 --> 00:19:47,460
hat is going to be.
358
00:19:47,460 --> 00:19:49,580
So this conditional expectation
is going to be
359
00:19:49,580 --> 00:19:51,860
Theta hat itself.
360
00:19:51,860 --> 00:19:54,030
Here this is-- just
by definition--
361
00:19:54,030 --> 00:19:59,580
Theta hat, and so we
get equality to 0.
362
00:19:59,580 --> 00:20:04,260
So what we have proved is that
no matter what I have
363
00:20:04,260 --> 00:20:08,970
observed, and given that I have
observed something on the
364
00:20:08,970 --> 00:20:14,050
average my error is
going to be 0.
365
00:20:14,050 --> 00:20:19,960
This is a statement involving
equality of random variables.
366
00:20:19,960 --> 00:20:22,620
Remember that conditional
expectations are random
367
00:20:22,620 --> 00:20:26,970
variables because they depend
on the thing you're
368
00:20:26,970 --> 00:20:28,440
conditioning on.
369
00:20:28,440 --> 00:20:31,630
0 is sort of a trivial
random variable.
370
00:20:31,630 --> 00:20:34,080
This tells you that this random
variable is identically
371
00:20:34,080 --> 00:20:36,390
equal to the 0 random
variable.
372
00:20:36,390 --> 00:20:40,720
More specifically it tells you
that no matter what value for
373
00:20:40,720 --> 00:20:45,120
X you observe, the conditional
expectation of the error is
374
00:20:45,120 --> 00:20:46,410
going to be 0.
375
00:20:46,410 --> 00:20:49,150
And this takes us to this
statement here, which is
376
00:20:49,150 --> 00:20:51,830
inequality between numbers.
377
00:20:51,830 --> 00:20:56,330
No matter what specific value
for capital X you have
378
00:20:56,330 --> 00:21:00,440
observed, your error, on
the average, is going
379
00:21:00,440 --> 00:21:02,420
to be equal to 0.
380
00:21:02,420 --> 00:21:06,730
So this is a less abstract
version of these statements.
381
00:21:06,730 --> 00:21:09,300
This is inequality between
two numbers.
382
00:21:09,300 --> 00:21:15,080
It's true for every value of
X, so it's true in terms of
383
00:21:15,080 --> 00:21:18,550
these random variables being
equal to that random variable.
384
00:21:18,550 --> 00:21:21,170
Because remember according to
our definition this random
385
00:21:21,170 --> 00:21:24,400
variable is the random variable
that takes this
386
00:21:24,400 --> 00:21:27,410
specific value when capital
X happens to be
387
00:21:27,410 --> 00:21:29,410
equal to little x.
388
00:21:29,410 --> 00:21:33,480
Now this doesn't mean that your
error is 0, it only means
389
00:21:33,480 --> 00:21:37,050
that your error is as likely, in
some sense, to fall on the
390
00:21:37,050 --> 00:21:40,040
positive side, as to fall
on the negative side.
391
00:21:40,040 --> 00:21:41,400
So sometimes your error will be
392
00:21:41,400 --> 00:21:42,880
positive, sometimes negative.
393
00:21:42,880 --> 00:21:46,360
And on the average these
things cancel out and
394
00:21:46,360 --> 00:21:48,150
give you a 0 --.
395
00:21:48,150 --> 00:21:49,470
on the average.
396
00:21:49,470 --> 00:21:53,620
So this is a property that's
sometimes giving the name we
397
00:21:53,620 --> 00:21:59,040
say that Theta hat
is unbiased.
398
00:21:59,040 --> 00:22:03,190
So Theta hat, our estimate, does
not have a tendency to be
399
00:22:03,190 --> 00:22:04,180
on the high side.
400
00:22:04,180 --> 00:22:06,920
It does not have a tendency
to be on the low side.
401
00:22:06,920 --> 00:22:10,580
On the average it's
just right.
402
00:22:10,580 --> 00:22:14,700
403
00:22:14,700 --> 00:22:18,390
So let's do a little
more playing here.
404
00:22:18,390 --> 00:22:21,790
405
00:22:21,790 --> 00:22:27,690
Let's see how our error is
related to an arbitrary
406
00:22:27,690 --> 00:22:30,270
function of the data.
407
00:22:30,270 --> 00:22:36,960
Let's do this in a conditional
universe and
408
00:22:36,960 --> 00:22:38,210
look at this quantity.
409
00:22:38,210 --> 00:22:45,210
410
00:22:45,210 --> 00:22:47,910
In a conditional universe
where X is known
411
00:22:47,910 --> 00:22:51,060
then h of X is known.
412
00:22:51,060 --> 00:22:54,200
And so you can pull it outside
the expectation.
413
00:22:54,200 --> 00:22:58,010
In the conditional universe
where the value of X is given
414
00:22:58,010 --> 00:23:01,290
this quantity becomes
just a constant.
415
00:23:01,290 --> 00:23:03,250
There's nothing random
about it.
416
00:23:03,250 --> 00:23:06,280
So you can pull it out,
the expectation, and
417
00:23:06,280 --> 00:23:09,840
write things this way.
418
00:23:09,840 --> 00:23:14,090
And we have just calculated
that this quantity is 0.
419
00:23:14,090 --> 00:23:17,390
So this number turns out
to be 0 as well.
420
00:23:17,390 --> 00:23:20,810
421
00:23:20,810 --> 00:23:23,830
Now having done this,
we can take
422
00:23:23,830 --> 00:23:26,110
expectations of both sides.
423
00:23:26,110 --> 00:23:29,530
And now let's use the law of
iterated expectations.
424
00:23:29,530 --> 00:23:33,040
Expectation of a conditional
expectation gives us the
425
00:23:33,040 --> 00:23:42,200
unconditional expectation, and
this is also going to be 0.
426
00:23:42,200 --> 00:23:47,455
So here we use the law of
iterated expectations.
427
00:23:47,455 --> 00:23:54,460
428
00:23:54,460 --> 00:23:55,710
OK.
429
00:23:55,710 --> 00:24:04,510
430
00:24:04,510 --> 00:24:06,290
OK, why are we doing this?
431
00:24:06,290 --> 00:24:09,990
We're doing this because I would
like to calculate the
432
00:24:09,990 --> 00:24:13,940
covariance between Theta
tilde and Theta hat.
433
00:24:13,940 --> 00:24:16,490
Theta hat is, ask the question
-- is there a systematic
434
00:24:16,490 --> 00:24:20,870
relation between the error
and the estimate?
435
00:24:20,870 --> 00:24:30,830
So to calculate the covariance
we use the property that we
436
00:24:30,830 --> 00:24:34,460
can calculate the covariances
by calculating the expected
437
00:24:34,460 --> 00:24:39,520
value of the product minus
the product of
438
00:24:39,520 --> 00:24:40,770
the expected values.
439
00:24:40,770 --> 00:24:48,440
440
00:24:48,440 --> 00:24:50,850
And what do we get?
441
00:24:50,850 --> 00:24:56,080
This is 0, because of
what we just proved.
442
00:24:56,080 --> 00:25:00,980
443
00:25:00,980 --> 00:25:06,160
And this is 0, because of
what we proved earlier.
444
00:25:06,160 --> 00:25:09,740
That the expected value of
the error is equal to 0.
445
00:25:09,740 --> 00:25:12,900
446
00:25:12,900 --> 00:25:27,800
So the covariance between the
error and any function of X is
447
00:25:27,800 --> 00:25:29,470
equal to 0.
448
00:25:29,470 --> 00:25:33,060
Let's use that to the case where
the function of X we're
449
00:25:33,060 --> 00:25:38,620
considering is Theta
hat itself.
450
00:25:38,620 --> 00:25:43,300
Theta hat is our estimate, it's
a function of X. So this
451
00:25:43,300 --> 00:25:46,845
0 result would still apply,
and we get that this
452
00:25:46,845 --> 00:25:50,570
covariance is equal to 0.
453
00:25:50,570 --> 00:25:59,100
OK, so that's what we proved.
454
00:25:59,100 --> 00:26:02,720
Let's see, what are the morals
to take out of all this?
455
00:26:02,720 --> 00:26:07,640
First is you should be very
comfortable with this type of
456
00:26:07,640 --> 00:26:10,580
calculation involving
conditional expectations.
457
00:26:10,580 --> 00:26:14,100
The main two things that we're
using are that when you
458
00:26:14,100 --> 00:26:17,630
condition on a random variable
any function of that random
459
00:26:17,630 --> 00:26:21,020
variable becomes a constant,
and can be pulled out the
460
00:26:21,020 --> 00:26:22,690
conditional expectation.
461
00:26:22,690 --> 00:26:25,460
The other thing that we are
using is the law of iterated
462
00:26:25,460 --> 00:26:29,450
expectations, so these are
the skills involved.
463
00:26:29,450 --> 00:26:32,980
Now on the substance, why is
this result interesting?
464
00:26:32,980 --> 00:26:35,390
This tells us that the error is
465
00:26:35,390 --> 00:26:37,060
uncorrelated with the estimate.
466
00:26:37,060 --> 00:26:39,770
467
00:26:39,770 --> 00:26:42,530
What's a hypothetical situation
where these would
468
00:26:42,530 --> 00:26:44,160
not happen?
469
00:26:44,160 --> 00:26:52,720
Whenever Theta hat is positive
my error tends to be negative.
470
00:26:52,720 --> 00:26:57,000
Suppose that whenever Theta hat
is big then you say oh my
471
00:26:57,000 --> 00:27:00,610
estimate is too big, maybe the
true Theta is on the lower
472
00:27:00,610 --> 00:27:04,470
side, so I expect my error
to be negative.
473
00:27:04,470 --> 00:27:09,230
That would be a situation that
would violate this condition.
474
00:27:09,230 --> 00:27:13,880
This condition tells you that
no matter what Theta hat is,
475
00:27:13,880 --> 00:27:17,110
you don't expect your error to
be on the positive side or on
476
00:27:17,110 --> 00:27:18,030
the negative side.
477
00:27:18,030 --> 00:27:21,630
Your error will still
be 0 on the average.
478
00:27:21,630 --> 00:27:25,780
So if you obtain a very high
estimate this is no reason for
479
00:27:25,780 --> 00:27:29,630
you to suspect that
the true Theta is
480
00:27:29,630 --> 00:27:30,890
lower than your estimate.
481
00:27:30,890 --> 00:27:34,420
If you suspected that the true
Theta was lower than your
482
00:27:34,420 --> 00:27:38,830
estimate you should have
changed your Theta hat.
483
00:27:38,830 --> 00:27:42,580
If you make an estimate and
after obtaining that estimate
484
00:27:42,580 --> 00:27:46,270
you say I think my estimate
is too big, and so
485
00:27:46,270 --> 00:27:47,770
the error is negative.
486
00:27:47,770 --> 00:27:50,730
If you thought that way then
that means that your estimate
487
00:27:50,730 --> 00:27:53,690
is not the optimal one, that
your estimate should have been
488
00:27:53,690 --> 00:27:57,200
corrected to be smaller.
489
00:27:57,200 --> 00:28:00,030
And that would mean that there's
a better estimate than
490
00:28:00,030 --> 00:28:03,060
the one you used, but the
estimate that we are using
491
00:28:03,060 --> 00:28:06,060
here is the optimal one in terms
of mean squared error,
492
00:28:06,060 --> 00:28:08,350
there's no way of
improving it.
493
00:28:08,350 --> 00:28:11,640
And this is really captured
in that statement.
494
00:28:11,640 --> 00:28:14,250
That is knowing Theta hat
doesn't give you a lot of
495
00:28:14,250 --> 00:28:18,290
information about the error, and
gives you, therefore, no
496
00:28:18,290 --> 00:28:24,430
reason to adjust your estimate
from what it was.
497
00:28:24,430 --> 00:28:29,190
Finally, a consequence
of all this.
498
00:28:29,190 --> 00:28:31,910
This is the definition
of the error.
499
00:28:31,910 --> 00:28:35,770
Send Theta to this side, send
Theta tilde to that side, you
500
00:28:35,770 --> 00:28:36,850
get this relation.
501
00:28:36,850 --> 00:28:41,010
The true parameter is composed
of two quantities.
502
00:28:41,010 --> 00:28:44,940
The estimate, and the
error that they got
503
00:28:44,940 --> 00:28:46,460
with a minus sign.
504
00:28:46,460 --> 00:28:49,790
These two quantities are
uncorrelated with each other.
505
00:28:49,790 --> 00:28:53,350
Their covariance is 0, and
therefore, the variance of
506
00:28:53,350 --> 00:28:56,330
this is the sum of the variances
of these two
507
00:28:56,330 --> 00:28:57,580
quantities.
508
00:28:57,580 --> 00:29:00,470
509
00:29:00,470 --> 00:29:07,520
So what's an interpretation
of this equality?
510
00:29:07,520 --> 00:29:10,930
There is some inherent
randomness in the random
511
00:29:10,930 --> 00:29:14,540
variable theta that we're
trying to estimate.
512
00:29:14,540 --> 00:29:19,360
Theta hat tries to estimate it,
tries to get close to it.
513
00:29:19,360 --> 00:29:25,500
And if Theta hat always stays
close to Theta, since Theta is
514
00:29:25,500 --> 00:29:29,260
random Theta hat must also be
quite random, so it has
515
00:29:29,260 --> 00:29:31,170
uncertainty in it.
516
00:29:31,170 --> 00:29:35,270
And the more uncertain Theta
hat is the more it moves
517
00:29:35,270 --> 00:29:36,640
together with Theta.
518
00:29:36,640 --> 00:29:40,860
So the more uncertainty
it removes from Theta.
519
00:29:40,860 --> 00:29:43,900
And this is the remaining
uncertainty in Theta.
520
00:29:43,900 --> 00:29:47,140
The uncertainty that's left
after we've done our
521
00:29:47,140 --> 00:29:48,350
estimation.
522
00:29:48,350 --> 00:29:52,330
So ideally, to have a small
error we want this
523
00:29:52,330 --> 00:29:54,120
quantity to be small.
524
00:29:54,120 --> 00:29:55,820
Which is the same as
saying that this
525
00:29:55,820 --> 00:29:57,740
quantity should be big.
526
00:29:57,740 --> 00:30:02,070
In the ideal case Theta hat
is the same as Theta.
527
00:30:02,070 --> 00:30:04,820
That's the best we
could hope for.
528
00:30:04,820 --> 00:30:09,250
That corresponds to 0 error,
and all the uncertainly in
529
00:30:09,250 --> 00:30:14,230
Theta is absorbed by the
uncertainty in Theta hat.
530
00:30:14,230 --> 00:30:18,960
Interestingly, this relation
here is just another variation
531
00:30:18,960 --> 00:30:21,630
of the law of total variance
that we have seen at some
532
00:30:21,630 --> 00:30:23,880
point in the past.
533
00:30:23,880 --> 00:30:28,570
I will skip that derivation, but
it's an interesting fact,
534
00:30:28,570 --> 00:30:31,430
and it can give you an
alternative interpretation of
535
00:30:31,430 --> 00:30:32,680
the law of total variance.
536
00:30:32,680 --> 00:30:36,840
537
00:30:36,840 --> 00:30:40,570
OK, so now let's return
to our example.
538
00:30:40,570 --> 00:30:45,630
In our example we obtained the
optimal estimator, and we saw
539
00:30:45,630 --> 00:30:51,220
that it was a nonlinear curve,
something like this.
540
00:30:51,220 --> 00:30:53,660
I'm exaggerating the corner
of a little bit to
541
00:30:53,660 --> 00:30:55,350
show that it's nonlinear.
542
00:30:55,350 --> 00:30:57,400
This is the optimal estimator.
543
00:30:57,400 --> 00:31:01,070
It's a nonlinear function
of X --
544
00:31:01,070 --> 00:31:05,200
nonlinear generally
means complicated.
545
00:31:05,200 --> 00:31:09,020
Sometimes the conditional
expectation is really hard to
546
00:31:09,020 --> 00:31:12,320
compute, because whenever you
have to compute expectations
547
00:31:12,320 --> 00:31:17,270
you need to do some integrals.
548
00:31:17,270 --> 00:31:19,880
And if you have many random
variables involved it might
549
00:31:19,880 --> 00:31:23,160
correspond to a
multi-dimensional integration.
550
00:31:23,160 --> 00:31:24,370
We don't like this.
551
00:31:24,370 --> 00:31:27,370
Can we come up, maybe,
with a simpler way
552
00:31:27,370 --> 00:31:29,200
of estimating Theta?
553
00:31:29,200 --> 00:31:32,580
Of coming up with a point
estimate which still has some
554
00:31:32,580 --> 00:31:34,350
nice properties, it
has some good
555
00:31:34,350 --> 00:31:37,120
motivation, but is simpler.
556
00:31:37,120 --> 00:31:38,630
What does simpler mean?
557
00:31:38,630 --> 00:31:40,920
Perhaps linear.
558
00:31:40,920 --> 00:31:45,570
Let's put ourselves in a
straitjacket and restrict
559
00:31:45,570 --> 00:31:50,260
ourselves to estimators that's
are of these forms.
560
00:31:50,260 --> 00:31:53,280
My estimate is constrained
to be a linear
561
00:31:53,280 --> 00:31:54,930
function of the X's.
562
00:31:54,930 --> 00:31:59,320
So my estimator is going to be
a curve, a linear curve.
563
00:31:59,320 --> 00:32:03,450
It could be this, it could be
that, maybe it would want to
564
00:32:03,450 --> 00:32:06,350
be something like this.
565
00:32:06,350 --> 00:32:10,540
I want to choose the best
possible linear function.
566
00:32:10,540 --> 00:32:11,490
What does that mean?
567
00:32:11,490 --> 00:32:15,570
It means that I write my
Theta hat in this form.
568
00:32:15,570 --> 00:32:20,750
If I fix a certain a and b I
have fixed the functional form
569
00:32:20,750 --> 00:32:23,940
of my estimator, and this
is the corresponding
570
00:32:23,940 --> 00:32:25,360
mean squared error.
571
00:32:25,360 --> 00:32:28,210
That's the error between the
true parameter and the
572
00:32:28,210 --> 00:32:31,130
estimate of that parameter, we
take the square of this.
573
00:32:31,130 --> 00:32:33,730
574
00:32:33,730 --> 00:32:38,350
And now the optimal linear
estimator is defined as one
575
00:32:38,350 --> 00:32:42,210
for which these mean squared
error is smallest possible
576
00:32:42,210 --> 00:32:45,600
over all choices of a and b.
577
00:32:45,600 --> 00:32:48,260
So we want to minimize
this expression
578
00:32:48,260 --> 00:32:52,030
over all a's and b's.
579
00:32:52,030 --> 00:32:55,650
How do we do this
minimization?
580
00:32:55,650 --> 00:32:58,910
Well this is a square,
you can expand it.
581
00:32:58,910 --> 00:33:02,040
Write down all the terms in the
expansion of the square.
582
00:33:02,040 --> 00:33:03,810
So you're going to get
the term expected
583
00:33:03,810 --> 00:33:05,400
value of Theta squared.
584
00:33:05,400 --> 00:33:07,380
You're going to get
another term--
585
00:33:07,380 --> 00:33:11,010
a squared expected value of X
squared, another term which is
586
00:33:11,010 --> 00:33:13,340
b squared, and then you're
going to get to
587
00:33:13,340 --> 00:33:16,620
various cross terms.
588
00:33:16,620 --> 00:33:22,050
What you have here is really a
quadratic function of a and b.
589
00:33:22,050 --> 00:33:25,030
So think of this quantity that
we're minimizing as some
590
00:33:25,030 --> 00:33:28,920
function h of a and b, and it
happens to be quadratic.
591
00:33:28,920 --> 00:33:32,500
592
00:33:32,500 --> 00:33:35,280
How do we minimize a
quadratic function?
593
00:33:35,280 --> 00:33:38,890
We set the derivative of this
function with respect to a and
594
00:33:38,890 --> 00:33:42,940
b to 0, and then
do the algebra.
595
00:33:42,940 --> 00:33:48,000
After you do the algebra you
find that the best choice for
596
00:33:48,000 --> 00:33:54,380
a is this 1, so this is the
coefficient next to X. This is
597
00:33:54,380 --> 00:33:55,630
the optimal a.
598
00:33:55,630 --> 00:33:59,560
599
00:33:59,560 --> 00:34:03,660
And the optimal b corresponds
of the constant terms.
600
00:34:03,660 --> 00:34:08,770
So this term and this times that
together are the optimal
601
00:34:08,770 --> 00:34:11,090
choices of b.
602
00:34:11,090 --> 00:34:15,590
So the algebra itself is
not very interesting.
603
00:34:15,590 --> 00:34:19,210
What is really interesting is
the nature of the result that
604
00:34:19,210 --> 00:34:21,179
we get here.
605
00:34:21,179 --> 00:34:26,260
If we were to plot the result on
this particular example you
606
00:34:26,260 --> 00:34:32,280
would get the curve that's
something like this.
607
00:34:32,280 --> 00:34:36,949
608
00:34:36,949 --> 00:34:40,710
It goes through the middle
of this diagram
609
00:34:40,710 --> 00:34:43,080
and is a little slanted.
610
00:34:43,080 --> 00:34:48,639
In this example, X and Theta
are positively correlated.
611
00:34:48,639 --> 00:34:51,190
Bigger values of X generally
correspond to
612
00:34:51,190 --> 00:34:53,139
bigger values of Theta.
613
00:34:53,139 --> 00:34:56,310
So in this example the
covariance between X and Theta
614
00:34:56,310 --> 00:35:05,530
is positive, and so our estimate
can be interpreted in
615
00:35:05,530 --> 00:35:09,110
the following way: The expected
value of Theta is the
616
00:35:09,110 --> 00:35:13,130
estimate that you would come up
with if you didn't have any
617
00:35:13,130 --> 00:35:15,960
information about Theta.
618
00:35:15,960 --> 00:35:19,590
If you don't make any
observations this is the best
619
00:35:19,590 --> 00:35:22,270
way of estimating Theta.
620
00:35:22,270 --> 00:35:26,190
But I have made an observation,
X, and I need to
621
00:35:26,190 --> 00:35:27,920
take it into account.
622
00:35:27,920 --> 00:35:32,360
I look at this difference, which
is the piece of news
623
00:35:32,360 --> 00:35:34,380
contained in X?
624
00:35:34,380 --> 00:35:37,870
That's what X should
be on the average.
625
00:35:37,870 --> 00:35:41,910
If I observe an X which is
bigger than what I expected it
626
00:35:41,910 --> 00:35:46,830
to be, and since X and Theta
are positively correlated,
627
00:35:46,830 --> 00:35:51,070
this tells me that Theta should
also be bigger than its
628
00:35:51,070 --> 00:35:52,690
average value.
629
00:35:52,690 --> 00:35:57,180
Whenever I see an X that's
larger than its average value
630
00:35:57,180 --> 00:36:00,230
this gives me an indication
that theta should also
631
00:36:00,230 --> 00:36:04,480
probably be larger than
its average value.
632
00:36:04,480 --> 00:36:08,040
And so I'm taking that
difference and multiplying it
633
00:36:08,040 --> 00:36:10,240
by a positive coefficient.
634
00:36:10,240 --> 00:36:12,360
And that's what gives
me a curve here that
635
00:36:12,360 --> 00:36:14,880
has a positive slope.
636
00:36:14,880 --> 00:36:17,780
So this increment--
637
00:36:17,780 --> 00:36:21,750
the new information contained
in X as compared to the
638
00:36:21,750 --> 00:36:25,950
average value we expected
apriori, that increment allows
639
00:36:25,950 --> 00:36:30,780
us to make a correction to our
prior estimate of Theta, and
640
00:36:30,780 --> 00:36:34,780
the amount of that correction is
guided by the covariance of
641
00:36:34,780 --> 00:36:36,260
X with Theta.
642
00:36:36,260 --> 00:36:39,670
If the covariance of X with
Theta were 0, that would mean
643
00:36:39,670 --> 00:36:43,050
there's no systematic relation
between the two, and in that
644
00:36:43,050 --> 00:36:46,380
case obtaining some information
from X doesn't
645
00:36:46,380 --> 00:36:51,010
give us a guide as to how to
change the estimates of Theta.
646
00:36:51,010 --> 00:36:53,870
If that were 0, we would
just stay with
647
00:36:53,870 --> 00:36:55,050
this particular estimate.
648
00:36:55,050 --> 00:36:57,090
We're not able to make
a correction.
649
00:36:57,090 --> 00:37:00,810
But when there's a non zero
covariance between X and Theta
650
00:37:00,810 --> 00:37:04,620
that covariance works as a
guide for us to obtain a
651
00:37:04,620 --> 00:37:08,130
better estimate of Theta.
652
00:37:08,130 --> 00:37:12,270
653
00:37:12,270 --> 00:37:15,220
How about the resulting
mean squared error?
654
00:37:15,220 --> 00:37:18,690
In this context turns out that
there's a very nice formula
655
00:37:18,690 --> 00:37:21,360
for the mean squared
error obtained from
656
00:37:21,360 --> 00:37:24,780
the best linear estimate.
657
00:37:24,780 --> 00:37:27,900
What's the story here?
658
00:37:27,900 --> 00:37:31,210
The mean squared error that we
have has something to do with
659
00:37:31,210 --> 00:37:35,450
the variance of the original
random variable.
660
00:37:35,450 --> 00:37:38,710
The more uncertain our original
random variable is,
661
00:37:38,710 --> 00:37:41,670
the more error we're
going to make.
662
00:37:41,670 --> 00:37:45,590
On the other hand, when the two
variables are correlated
663
00:37:45,590 --> 00:37:48,370
we explored that correlation
to improve our estimate.
664
00:37:48,370 --> 00:37:52,100
665
00:37:52,100 --> 00:37:54,650
This row here is the correlation
coefficient
666
00:37:54,650 --> 00:37:56,730
between the two random
variables.
667
00:37:56,730 --> 00:37:59,720
When this correlation
coefficient is larger this
668
00:37:59,720 --> 00:38:01,780
factor here becomes smaller.
669
00:38:01,780 --> 00:38:04,660
And our mean squared error
become smaller.
670
00:38:04,660 --> 00:38:07,560
So think of the two
extreme cases.
671
00:38:07,560 --> 00:38:11,270
One extreme case is when
rho equal to 1 --
672
00:38:11,270 --> 00:38:14,200
so X and Theta are perfectly
correlated.
673
00:38:14,200 --> 00:38:18,420
When they're perfectly
correlated once I know X then
674
00:38:18,420 --> 00:38:20,310
I also know Theta.
675
00:38:20,310 --> 00:38:23,580
And the two random variables
are linearly related.
676
00:38:23,580 --> 00:38:27,080
In that case, my estimate is
right on the target, and the
677
00:38:27,080 --> 00:38:30,860
mean squared error
is going to be 0.
678
00:38:30,860 --> 00:38:34,810
The other extreme case is
if rho is equal to 0.
679
00:38:34,810 --> 00:38:37,590
The two random variables
are uncorrelated.
680
00:38:37,590 --> 00:38:41,740
In that case the measurement
does not help me estimate
681
00:38:41,740 --> 00:38:45,390
Theta, and the uncertainty
that's left--
682
00:38:45,390 --> 00:38:46,970
the mean squared error--
683
00:38:46,970 --> 00:38:49,830
is just the original
variance of Theta.
684
00:38:49,830 --> 00:38:53,750
So the uncertainty in Theta
does not get reduced.
685
00:38:53,750 --> 00:38:54,670
So moral--
686
00:38:54,670 --> 00:38:59,710
the estimation error is a
reduced version of the
687
00:38:59,710 --> 00:39:03,660
original amount of uncertainty
in the random variable Theta,
688
00:39:03,660 --> 00:39:08,280
and the larger the correlation
between those two random
689
00:39:08,280 --> 00:39:12,620
variables, the better we can
remove uncertainty from the
690
00:39:12,620 --> 00:39:13,970
original random variable.
691
00:39:13,970 --> 00:39:17,320
692
00:39:17,320 --> 00:39:21,200
I didn't derive this formula,
but it's just a matter of
693
00:39:21,200 --> 00:39:22,430
algebraic manipulations.
694
00:39:22,430 --> 00:39:25,770
We have a formula for
Theta hat, subtract
695
00:39:25,770 --> 00:39:27,620
Theta from that formula.
696
00:39:27,620 --> 00:39:30,640
Take square, take expectations,
and do a few
697
00:39:30,640 --> 00:39:33,750
lines of algebra that you can
read in the text, and you end
698
00:39:33,750 --> 00:39:35,915
up with this really neat
and clean formula.
699
00:39:35,915 --> 00:39:38,650
700
00:39:38,650 --> 00:39:42,360
Now I mentioned in the beginning
of the lecture that
701
00:39:42,360 --> 00:39:45,220
we can do inference with Theta's
and X's not just being
702
00:39:45,220 --> 00:39:48,970
single numbers, but they could
be vector random variables.
703
00:39:48,970 --> 00:39:52,100
So for example we might have
multiple data that gives us
704
00:39:52,100 --> 00:39:56,710
information about X.
705
00:39:56,710 --> 00:40:00,240
There are no vectors here, so
this discussion was for the
706
00:40:00,240 --> 00:40:04,460
case where Theta and X were just
scalar, one-dimensional
707
00:40:04,460 --> 00:40:05,350
quantities.
708
00:40:05,350 --> 00:40:08,060
What do we do if we have
multiple data?
709
00:40:08,060 --> 00:40:11,990
Suppose that Theta is still a
scalar, it's one dimensional,
710
00:40:11,990 --> 00:40:14,710
but we make several
observations.
711
00:40:14,710 --> 00:40:17,050
And on the basis of these
observations we want to
712
00:40:17,050 --> 00:40:20,080
estimate Theta.
713
00:40:20,080 --> 00:40:24,650
The optimal least mean squares
estimator would be again the
714
00:40:24,650 --> 00:40:28,830
conditional expectation of
Theta given X. That's the
715
00:40:28,830 --> 00:40:30,130
optimal one.
716
00:40:30,130 --> 00:40:36,330
And in this case X is a
vector, so the general
717
00:40:36,330 --> 00:40:40,650
estimator we would use
would be this one.
718
00:40:40,650 --> 00:40:44,050
But if we want to keep things
simple and we want our
719
00:40:44,050 --> 00:40:47,300
estimator to have a simple
functional form we might
720
00:40:47,300 --> 00:40:51,870
restrict to estimator that are
linear functions of the data.
721
00:40:51,870 --> 00:40:53,800
And then the story is
exactly the same as
722
00:40:53,800 --> 00:40:57,010
we discussed before.
723
00:40:57,010 --> 00:41:00,460
I constrained myself to
estimating Theta using a
724
00:41:00,460 --> 00:41:05,880
linear function of the data,
so my signal processing box
725
00:41:05,880 --> 00:41:07,830
just applies a linear
function.
726
00:41:07,830 --> 00:41:11,145
And I'm looking for the best
coefficients, the coefficients
727
00:41:11,145 --> 00:41:13,490
that are going to result
in the least
728
00:41:13,490 --> 00:41:15,990
possible squared error.
729
00:41:15,990 --> 00:41:19,780
This is my squared error, this
is (my estimate minus the
730
00:41:19,780 --> 00:41:22,110
thing I'm trying to estimate)
squared, and
731
00:41:22,110 --> 00:41:24,100
then taking the average.
732
00:41:24,100 --> 00:41:25,330
How do we do this?
733
00:41:25,330 --> 00:41:26,580
Same story as before.
734
00:41:26,580 --> 00:41:29,510
735
00:41:29,510 --> 00:41:32,500
The X's and the Theta's get
averaged out because we have
736
00:41:32,500 --> 00:41:33,430
an expectation.
737
00:41:33,430 --> 00:41:36,830
Whatever is left is just a
function of the coefficients
738
00:41:36,830 --> 00:41:38,760
of the a's and of b's.
739
00:41:38,760 --> 00:41:42,110
As before it turns out to
be a quadratic function.
740
00:41:42,110 --> 00:41:46,580
Then we set the derivatives of
this function of a's and b's
741
00:41:46,580 --> 00:41:50,000
with respect to the
coefficients, we set it to 0.
742
00:41:50,000 --> 00:41:54,340
And this gives us a system
of linear equations.
743
00:41:54,340 --> 00:41:56,780
It's a system of linear
equations that's satisfied by
744
00:41:56,780 --> 00:41:57,730
those coefficients.
745
00:41:57,730 --> 00:42:00,860
It's a linear system because
this is a quadratic function
746
00:42:00,860 --> 00:42:03,950
of those coefficients.
747
00:42:03,950 --> 00:42:10,410
So to get closed-form formulas
in this particular case one
748
00:42:10,410 --> 00:42:13,180
would need to introduce vectors,
and matrices, and
749
00:42:13,180 --> 00:42:15,330
metrics inverses and so on.
750
00:42:15,330 --> 00:42:18,570
The particular formulas are not
so much what interests us
751
00:42:18,570 --> 00:42:22,950
here, rather, the interesting
thing is that this is simply
752
00:42:22,950 --> 00:42:27,120
done just using straightforward
solvers of
753
00:42:27,120 --> 00:42:29,240
linear equations.
754
00:42:29,240 --> 00:42:32,470
The only thing you need to do
is to write down the correct
755
00:42:32,470 --> 00:42:35,280
coefficients of those non-linear
equations.
756
00:42:35,280 --> 00:42:37,440
And the typical coefficient
that you would
757
00:42:37,440 --> 00:42:39,240
get would be what?
758
00:42:39,240 --> 00:42:42,480
Let say a typical quick
equations would be --
759
00:42:42,480 --> 00:42:44,190
let's take a typical
term of this
760
00:42:44,190 --> 00:42:45,680
quadratic one you expanded.
761
00:42:45,680 --> 00:42:51,470
You're going to get the terms
such as a1x1 times a2x2.
762
00:42:51,470 --> 00:42:55,680
When you take expectations
you're left with a1a2 times
763
00:42:55,680 --> 00:42:58,210
expected value of x1x2.
764
00:42:58,210 --> 00:43:02,030
765
00:43:02,030 --> 00:43:06,700
So this would involve terms such
as a1 squared expected
766
00:43:06,700 --> 00:43:08,520
value of x1 squared.
767
00:43:08,520 --> 00:43:14,760
You would get terms such as
a1a2, expected value of x1x2,
768
00:43:14,760 --> 00:43:20,120
and a lot of other terms
here should have a too.
769
00:43:20,120 --> 00:43:23,600
So you get something that's
quadratic in your
770
00:43:23,600 --> 00:43:24,890
coefficients.
771
00:43:24,890 --> 00:43:30,490
And the constants that show up
in your system of equations
772
00:43:30,490 --> 00:43:33,790
are things that have to do with
the expected values of
773
00:43:33,790 --> 00:43:37,070
squares of your random
variables, or products of your
774
00:43:37,070 --> 00:43:39,130
random variables.
775
00:43:39,130 --> 00:43:43,060
To write down the numerical
values for these the only
776
00:43:43,060 --> 00:43:46,330
thing you need to know are the
means and variances of your
777
00:43:46,330 --> 00:43:47,570
random variables.
778
00:43:47,570 --> 00:43:50,360
If you know the mean and
variance then you know what
779
00:43:50,360 --> 00:43:51,760
this thing is.
780
00:43:51,760 --> 00:43:54,950
And if you know the covariances
as well then you
781
00:43:54,950 --> 00:43:57,250
know what this thing is.
782
00:43:57,250 --> 00:44:02,080
So in order to find the optimal
linear estimator in
783
00:44:02,080 --> 00:44:06,870
the case of multiple data you do
not need to know the entire
784
00:44:06,870 --> 00:44:09,230
probability distribution
of the random
785
00:44:09,230 --> 00:44:11,050
variables that are involved.
786
00:44:11,050 --> 00:44:14,690
You only need to know your
means and covariances.
787
00:44:14,690 --> 00:44:18,670
These are the only quantities
that affect the construction
788
00:44:18,670 --> 00:44:20,570
of your optimal estimator.
789
00:44:20,570 --> 00:44:23,840
We could see this already
in this formula.
790
00:44:23,840 --> 00:44:29,650
The form of my optimal estimator
is completely
791
00:44:29,650 --> 00:44:34,100
determined once I know the
means, variance, and
792
00:44:34,100 --> 00:44:37,970
covariance of the random
variables in my model.
793
00:44:37,970 --> 00:44:44,410
I do not need to know how the
details distribution of the
794
00:44:44,410 --> 00:44:46,570
random variables that
are involved here.
795
00:44:46,570 --> 00:44:51,690
796
00:44:51,690 --> 00:44:55,110
So as I said in general, you
find the form of the optimal
797
00:44:55,110 --> 00:44:59,550
estimator by using a linear
equation solver.
798
00:44:59,550 --> 00:45:01,890
There are special examples
in which you can
799
00:45:01,890 --> 00:45:05,210
get closed-form solutions.
800
00:45:05,210 --> 00:45:10,090
The nicest simplest estimation
problem one can think of is
801
00:45:10,090 --> 00:45:11,120
the following--
802
00:45:11,120 --> 00:45:14,870
you have some uncertain
parameter, and you make
803
00:45:14,870 --> 00:45:17,790
multiple measurements
of that parameter in
804
00:45:17,790 --> 00:45:19,950
the presence of noise.
805
00:45:19,950 --> 00:45:22,520
So the Wi's are noises.
806
00:45:22,520 --> 00:45:25,130
I corresponds to your
i-th experiment.
807
00:45:25,130 --> 00:45:27,810
So this is the most common
situation that you encounter
808
00:45:27,810 --> 00:45:28,490
in the lab.
809
00:45:28,490 --> 00:45:31,240
If you are dealing with some
process, you're trying to
810
00:45:31,240 --> 00:45:34,110
measure something you measure
it over and over.
811
00:45:34,110 --> 00:45:37,030
Each time your measurement
has some random error.
812
00:45:37,030 --> 00:45:40,360
And then you need to take all
your measurements together and
813
00:45:40,360 --> 00:45:43,550
come up with a single
estimate.
814
00:45:43,550 --> 00:45:48,320
So the noises are assumed to be
independent of each other,
815
00:45:48,320 --> 00:45:50,010
and also to be independent
from the
816
00:45:50,010 --> 00:45:52,090
value of the true parameter.
817
00:45:52,090 --> 00:45:55,010
Without loss of generality we
can assume that the noises
818
00:45:55,010 --> 00:45:58,890
have 0 mean and they have
some variances that we
819
00:45:58,890 --> 00:46:00,340
assume to be known.
820
00:46:00,340 --> 00:46:03,180
Theta itself has a prior
distribution with a certain
821
00:46:03,180 --> 00:46:05,670
mean and the certain variance.
822
00:46:05,670 --> 00:46:07,610
So the form of the
optimal linear
823
00:46:07,610 --> 00:46:10,940
estimator is really nice.
824
00:46:10,940 --> 00:46:14,930
Well maybe you cannot see it
right away because this looks
825
00:46:14,930 --> 00:46:18,580
messy, but what is it really?
826
00:46:18,580 --> 00:46:24,590
It's a linear combination of
the X's and the prior mean.
827
00:46:24,590 --> 00:46:28,560
And it's actually a weighted
average of the X's and the
828
00:46:28,560 --> 00:46:30,250
prior mean.
829
00:46:30,250 --> 00:46:33,570
Here we collect all of
the coefficients that
830
00:46:33,570 --> 00:46:35,920
we have at the top.
831
00:46:35,920 --> 00:46:42,060
So the whole thing is basically
a weighted average.
832
00:46:42,060 --> 00:46:46,460
833
00:46:46,460 --> 00:46:51,110
1/(sigma_i-squared) is the
weight that we give to Xi, and
834
00:46:51,110 --> 00:46:54,710
in the denominator we have the
sum of all of the weights.
835
00:46:54,710 --> 00:46:59,260
So in the end we're dealing
with a weighted average.
836
00:46:59,260 --> 00:47:03,760
If mu was equal to 1, and all
the Xi's were equal to 1 then
837
00:47:03,760 --> 00:47:06,790
our estimate would also
be equal to 1.
838
00:47:06,790 --> 00:47:10,670
Now the form of the weights that
we have is interesting.
839
00:47:10,670 --> 00:47:16,050
Any given data point is
weighted inversely
840
00:47:16,050 --> 00:47:17,820
proportional to the variance.
841
00:47:17,820 --> 00:47:20,270
What does that say?
842
00:47:20,270 --> 00:47:26,920
If my i-th data point has a lot
of variance, if Wi is very
843
00:47:26,920 --> 00:47:32,900
noisy then Xi is not very
useful, is not very reliable.
844
00:47:32,900 --> 00:47:36,840
So I'm giving it
a small weight.
845
00:47:36,840 --> 00:47:41,870
Large variance, a lot of error
in my Xi means that I should
846
00:47:41,870 --> 00:47:44,200
give it a smaller weight.
847
00:47:44,200 --> 00:47:47,920
If two data points have the
same variance, they're of
848
00:47:47,920 --> 00:47:50,140
comparable quality,
then I'm going to
849
00:47:50,140 --> 00:47:51,950
give them equal weight.
850
00:47:51,950 --> 00:47:56,200
The other interesting thing is
that the prior mean is treated
851
00:47:56,200 --> 00:47:58,300
the same way as the X's.
852
00:47:58,300 --> 00:48:03,050
So it's treated as an additional
observation.
853
00:48:03,050 --> 00:48:07,100
So we're taking a weighted
average of the prior mean and
854
00:48:07,100 --> 00:48:09,850
of the measurements that
we are making.
855
00:48:09,850 --> 00:48:13,380
The formula looks as if the
prior mean was just another
856
00:48:13,380 --> 00:48:14,210
data point.
857
00:48:14,210 --> 00:48:17,440
So that's the way of thinking
about Bayesian estimation.
858
00:48:17,440 --> 00:48:20,270
You have your real data points,
the X's that you
859
00:48:20,270 --> 00:48:23,430
observe, you also had some
prior information.
860
00:48:23,430 --> 00:48:27,470
This plays a role similar
to a data point.
861
00:48:27,470 --> 00:48:31,580
Interesting note that if all
random variables are normal in
862
00:48:31,580 --> 00:48:35,230
this model these optimal linear
estimator happens to be
863
00:48:35,230 --> 00:48:36,950
also the conditional
expectation.
864
00:48:36,950 --> 00:48:40,000
That's the nice thing about
normal random variables that
865
00:48:40,000 --> 00:48:42,770
conditional expectations
turn out to be linear.
866
00:48:42,770 --> 00:48:46,920
So the optimal estimate and the
optimal linear estimate
867
00:48:46,920 --> 00:48:48,560
turn out to be the same.
868
00:48:48,560 --> 00:48:51,050
And that gives us another
interpretation of linear
869
00:48:51,050 --> 00:48:52,100
estimation.
870
00:48:52,100 --> 00:48:54,660
Linear estimation is essentially
the same as
871
00:48:54,660 --> 00:48:58,970
pretending that all random
variables are normal.
872
00:48:58,970 --> 00:49:02,040
So that's a side point.
873
00:49:02,040 --> 00:49:04,230
Now I'd like to close
with a comment.
874
00:49:04,230 --> 00:49:08,370
875
00:49:08,370 --> 00:49:11,760
You do your measurements and
you estimate Theta on the
876
00:49:11,760 --> 00:49:17,040
basis of X. Suppose that instead
you have a measuring
877
00:49:17,040 --> 00:49:20,970
device that's measures X-cubed
instead of measuring X, and
878
00:49:20,970 --> 00:49:23,350
you want to estimate Theta.
879
00:49:23,350 --> 00:49:26,760
Are you going to get to
different a estimate?
880
00:49:26,760 --> 00:49:31,790
Well X and X-cubed contained
the same information.
881
00:49:31,790 --> 00:49:34,730
Telling you X is the
same as telling you
882
00:49:34,730 --> 00:49:36,640
the value of X-cubed.
883
00:49:36,640 --> 00:49:40,660
So the posterior distribution
of Theta given X is the same
884
00:49:40,660 --> 00:49:44,160
as the posterior distribution
of Theta given X-cubed.
885
00:49:44,160 --> 00:49:47,450
And so the means of these
posterior distributions are
886
00:49:47,450 --> 00:49:49,390
going to be the same.
887
00:49:49,390 --> 00:49:52,850
So doing transformations through
your data does not
888
00:49:52,850 --> 00:49:57,370
matter if you're doing optimal
least squares estimation.
889
00:49:57,370 --> 00:50:00,100
On the other hand, if you
restrict yourself to doing
890
00:50:00,100 --> 00:50:05,540
linear estimation then using a
linear function of X is not
891
00:50:05,540 --> 00:50:09,720
the same as using a linear
function of X-cubed.
892
00:50:09,720 --> 00:50:14,720
So this is a linear estimator,
but where the data are the
893
00:50:14,720 --> 00:50:19,250
X-cube's, and we have a linear
function of the data.
894
00:50:19,250 --> 00:50:23,690
So this means that when you're
using linear estimation you
895
00:50:23,690 --> 00:50:28,040
have some choices to make
linear on what?
896
00:50:28,040 --> 00:50:32,290
Sometimes you want to plot your
data on a not ordinary
897
00:50:32,290 --> 00:50:35,090
scale and try to plot
a line through them.
898
00:50:35,090 --> 00:50:38,360
Sometimes you plot your data
on a logarithmic scale, and
899
00:50:38,360 --> 00:50:40,480
try to plot a line
through them.
900
00:50:40,480 --> 00:50:42,390
Which scale is the
appropriate one?
901
00:50:42,390 --> 00:50:44,510
Here it would be
a cubic scale.
902
00:50:44,510 --> 00:50:46,830
And you have to think about
your particular model to
903
00:50:46,830 --> 00:50:51,180
decide which version would be
a more appropriate one.
904
00:50:51,180 --> 00:50:55,830
Finally when we have multiple
data sometimes these multiple
905
00:50:55,830 --> 00:50:59,910
data might contain the
same information.
906
00:50:59,910 --> 00:51:02,800
So X is one data point,
X-squared is another data
907
00:51:02,800 --> 00:51:05,610
point, X-cubed is another
data point.
908
00:51:05,610 --> 00:51:08,540
The three of them contain the
same information, but you can
909
00:51:08,540 --> 00:51:11,480
try to form a linear
function of them.
910
00:51:11,480 --> 00:51:14,380
And then you obtain a linear
estimator that has a more
911
00:51:14,380 --> 00:51:16,930
general form as a
function of X.
912
00:51:16,930 --> 00:51:22,130
So if you want to estimate your
Theta as a cubic function
913
00:51:22,130 --> 00:51:26,330
of X, for example, you can set
up a linear estimation model
914
00:51:26,330 --> 00:51:29,480
of this particular form and find
the optimal coefficients,
915
00:51:29,480 --> 00:51:32,900
the a's and the b's.
916
00:51:32,900 --> 00:51:35,700
All right, so the last slide
just gives you the big picture
917
00:51:35,700 --> 00:51:39,330
of what's happening in Bayesian
Inference, it's for
918
00:51:39,330 --> 00:51:40,330
you to ponder.
919
00:51:40,330 --> 00:51:41,930
Basically we talked about three
920
00:51:41,930 --> 00:51:43,470
possible estimation methods.
921
00:51:43,470 --> 00:51:48,300
Maximum posteriori, mean squared
error estimation, and
922
00:51:48,300 --> 00:51:51,070
linear mean squared error
estimation, or least squares
923
00:51:51,070 --> 00:51:52,290
estimation.
924
00:51:52,290 --> 00:51:54,410
And there's a number of standard
examples that you
925
00:51:54,410 --> 00:51:57,130
will be seeing over and over in
the recitations, tutorial,
926
00:51:57,130 --> 00:52:00,950
homework, and so on, perhaps
on exams even.
927
00:52:00,950 --> 00:52:05,630
Where we take some nice priors
on some unknown parameter, we
928
00:52:05,630 --> 00:52:09,410
take some nice models for the
noise or the observations, and
929
00:52:09,410 --> 00:52:11,880
then you need to work out
posterior distributions in the
930
00:52:11,880 --> 00:52:13,570
various estimates and
compare them.
931
00:52:13,570 --> 00:52:15,040