1
00:00:00,000 --> 00:00:00,040
2
00:00:00,040 --> 00:00:02,460
The following content is
provided under a Creative
3
00:00:02,460 --> 00:00:03,870
Commons license.
4
00:00:03,870 --> 00:00:06,910
Your support will help MIT
OpenCourseWare continue to
5
00:00:06,910 --> 00:00:08,700
offer high quality, educational
6
00:00:08,700 --> 00:00:10,560
resources for free.
7
00:00:10,560 --> 00:00:13,460
To make a donation or view
additional materials from
8
00:00:13,460 --> 00:00:19,290
hundreds of MIT courses, visit
MIT OpenCourseWare at
9
00:00:19,290 --> 00:00:20,540
ocw.mit.edu.
10
00:00:20,540 --> 00:00:22,200
11
00:00:22,200 --> 00:00:24,920
PROFESSOR: So for the last three
lectures we're going to
12
00:00:24,920 --> 00:00:28,200
talk about classical statistics,
the way statistics
13
00:00:28,200 --> 00:00:32,340
can be done if you don't want to
assume a prior distribution
14
00:00:32,340 --> 00:00:34,800
on the unknown parameters.
15
00:00:34,800 --> 00:00:38,290
Today we're going to focus,
mostly, on the estimation side
16
00:00:38,290 --> 00:00:41,910
and leave hypothesis testing
for the next two lectures.
17
00:00:41,910 --> 00:00:46,700
So where there is one generic
method that one can use to
18
00:00:46,700 --> 00:00:50,850
carry out parameter estimation,
that's the maximum
19
00:00:50,850 --> 00:00:51,850
likelihood method.
20
00:00:51,850 --> 00:00:53,990
We're going to define
what it is.
21
00:00:53,990 --> 00:00:58,200
Then we will look at the most
common estimation problem
22
00:00:58,200 --> 00:01:00,620
there is, which is to estimate
the mean of a given
23
00:01:00,620 --> 00:01:02,110
distribution.
24
00:01:02,110 --> 00:01:05,540
And we're going to talk about
confidence intervals, which
25
00:01:05,540 --> 00:01:09,130
refers to providing an
interval around your
26
00:01:09,130 --> 00:01:13,330
estimates, which has some
properties of the kind that
27
00:01:13,330 --> 00:01:17,640
the parameter is highly likely
to be inside that interval,
28
00:01:17,640 --> 00:01:20,040
but we will be careful about
how to interpret that
29
00:01:20,040 --> 00:01:22,220
particular statement.
30
00:01:22,220 --> 00:01:22,345
Ok.
31
00:01:22,345 --> 00:01:25,920
So the big framework first.
32
00:01:25,920 --> 00:01:29,120
The picture is almost the same
as the one that we had in the
33
00:01:29,120 --> 00:01:31,130
case of Bayesian statistics.
34
00:01:31,130 --> 00:01:33,570
We have some unknown
parameter.
35
00:01:33,570 --> 00:01:35,510
And we have a measuring
device.
36
00:01:35,510 --> 00:01:38,150
There is some noise,
some randomness.
37
00:01:38,150 --> 00:01:42,560
And we get an observation, X,
whose distribution depends on
38
00:01:42,560 --> 00:01:44,560
the value of the parameter.
39
00:01:44,560 --> 00:01:47,850
However, the big change from the
Bayesian setting is that
40
00:01:47,850 --> 00:01:50,840
here, this parameter
is just a number.
41
00:01:50,840 --> 00:01:53,200
It's not modeled as
a random variable.
42
00:01:53,200 --> 00:01:55,900
It does not have a probability
distribution.
43
00:01:55,900 --> 00:01:57,460
There's nothing random
about it.
44
00:01:57,460 --> 00:01:58,720
It's a constant.
45
00:01:58,720 --> 00:02:02,360
It just happens that we don't
know what that constant is.
46
00:02:02,360 --> 00:02:05,970
And in particular, this
probability distribution here,
47
00:02:05,970 --> 00:02:10,350
the distribution of X,
depends on Theta.
48
00:02:10,350 --> 00:02:13,900
But this is not a conditional
distribution in the usual
49
00:02:13,900 --> 00:02:15,450
sense of the word.
50
00:02:15,450 --> 00:02:18,480
Conditional distributions were
defined when we had two random
51
00:02:18,480 --> 00:02:21,800
variables and we condition one
random variable on the other.
52
00:02:21,800 --> 00:02:25,890
And we used the bar to separate
the X from the Theta.
53
00:02:25,890 --> 00:02:27,870
To make the point that this
is not a conditioned
54
00:02:27,870 --> 00:02:29,840
distribution, we use a
different notation.
55
00:02:29,840 --> 00:02:31,730
We put a semicolon here.
56
00:02:31,730 --> 00:02:35,760
And what this is meant to say is
that X has a distribution.
57
00:02:35,760 --> 00:02:39,640
That distribution has
a certain parameter.
58
00:02:39,640 --> 00:02:42,240
And we don't know what
that parameter is.
59
00:02:42,240 --> 00:02:46,270
So for example, this might be
a normal distribution, with
60
00:02:46,270 --> 00:02:49,070
variance 1 but a mean Theta.
61
00:02:49,070 --> 00:02:50,560
We don't know what Theta is.
62
00:02:50,560 --> 00:02:52,980
And we want to estimate it.
63
00:02:52,980 --> 00:02:55,970
Now once we have this setting,
then your job is to design
64
00:02:55,970 --> 00:02:57,560
this box, the estimator.
65
00:02:57,560 --> 00:03:00,620
The estimator is some data
processing box that takes the
66
00:03:00,620 --> 00:03:03,950
measurements and produces
an estimate
67
00:03:03,950 --> 00:03:06,300
of the unknown parameter.
68
00:03:06,300 --> 00:03:11,950
Now the notation that's used
here is as if X and Theta were
69
00:03:11,950 --> 00:03:13,640
one-dimensional quantities.
70
00:03:13,640 --> 00:03:16,610
But actually, everything we
say remains valid if you
71
00:03:16,610 --> 00:03:20,090
interpret X and Theta as
vectors of parameters.
72
00:03:20,090 --> 00:03:22,180
So for example, you
may obtain several
73
00:03:22,180 --> 00:03:25,050
measurements, X1 up to 2Xn.
74
00:03:25,050 --> 00:03:27,980
And there may be several unknown
parameters in the
75
00:03:27,980 --> 00:03:30,260
background.
76
00:03:30,260 --> 00:03:34,200
Once more, we do not have, and
we do not want to assume, a
77
00:03:34,200 --> 00:03:35,780
prior distribution on Theta.
78
00:03:35,780 --> 00:03:37,070
It's a constant.
79
00:03:37,070 --> 00:03:39,040
And if you want to think
mathematically about this
80
00:03:39,040 --> 00:03:41,510
situation, it's as if you
have many different
81
00:03:41,510 --> 00:03:43,340
probabilistic models.
82
00:03:43,340 --> 00:03:46,360
So a normal with this mean or
a normal with that mean or a
83
00:03:46,360 --> 00:03:49,020
normal with that mean, these
are alternative candidate
84
00:03:49,020 --> 00:03:50,700
probabilistic models.
85
00:03:50,700 --> 00:03:55,080
And we want to try to make a
decision about which one is
86
00:03:55,080 --> 00:03:56,420
the correct model.
87
00:03:56,420 --> 00:03:59,480
In some cases, we have to choose
just between a small
88
00:03:59,480 --> 00:04:00,390
number of models.
89
00:04:00,390 --> 00:04:03,400
For example, you have a coin
with an unknown bias.
90
00:04:03,400 --> 00:04:06,410
The bias is either 1/2 or 3/4.
91
00:04:06,410 --> 00:04:08,650
You're going to flip the
coin a few times.
92
00:04:08,650 --> 00:04:13,150
And you try to decide whether
the true bias is this one or
93
00:04:13,150 --> 00:04:14,150
is that one.
94
00:04:14,150 --> 00:04:17,610
So in this case, we have two
specific, alternative
95
00:04:17,610 --> 00:04:20,800
probabilistic models from which
we want to distinguish.
96
00:04:20,800 --> 00:04:25,000
But sometimes things are a
little more complicated.
97
00:04:25,000 --> 00:04:26,940
For example, you have a coin.
98
00:04:26,940 --> 00:04:30,940
And you have one hypothesis
that my coin is unbiased.
99
00:04:30,940 --> 00:04:34,650
And the other hypothesis is
that my coin is biased.
100
00:04:34,650 --> 00:04:36,040
And you do your experiments.
101
00:04:36,040 --> 00:04:40,840
And you want to come up with a
decision that decides whether
102
00:04:40,840 --> 00:04:43,970
this is true or this
one is true.
103
00:04:43,970 --> 00:04:46,630
In this case, we're not
dealing with just two
104
00:04:46,630 --> 00:04:48,710
alternative probabilistic
models.
105
00:04:48,710 --> 00:04:51,540
This one is a specific
model for the coin.
106
00:04:51,540 --> 00:04:54,230
But this one actually
corresponds to lots of
107
00:04:54,230 --> 00:04:56,890
possible, alternative
coin models.
108
00:04:56,890 --> 00:05:00,420
So this includes the model where
Theta is 0.6, the model
109
00:05:00,420 --> 00:05:03,860
where Theta is 0.7, Theta
is 0.8, and so on.
110
00:05:03,860 --> 00:05:07,350
So we're trying to discriminate
between one model
111
00:05:07,350 --> 00:05:09,510
and lots of alternative
models.
112
00:05:09,510 --> 00:05:11,560
How does one go about this?
113
00:05:11,560 --> 00:05:14,750
Well, there's some systematic
ways that one can approach
114
00:05:14,750 --> 00:05:16,120
problems of this kind.
115
00:05:16,120 --> 00:05:19,850
And we will start talking
about these next time.
116
00:05:19,850 --> 00:05:22,380
So today, we're going to focus
on estimation problems.
117
00:05:22,380 --> 00:05:27,080
In estimation problems, theta is
a quantity, which is a real
118
00:05:27,080 --> 00:05:29,070
number, a continuous
parameter.
119
00:05:29,070 --> 00:05:33,730
We're to design this box, so
what we get out of this box is
120
00:05:33,730 --> 00:05:34,280
an estimate.
121
00:05:34,280 --> 00:05:37,900
Now notice that this estimate
here is a random variable.
122
00:05:37,900 --> 00:05:42,000
Even though theta is
deterministic, this is random,
123
00:05:42,000 --> 00:05:45,110
because it's a function of
the data that we observe.
124
00:05:45,110 --> 00:05:46,360
The data are random.
125
00:05:46,360 --> 00:05:49,155
We're applying a function
to the data to
126
00:05:49,155 --> 00:05:50,270
construct our estimate.
127
00:05:50,270 --> 00:05:52,850
So, since it's a function of
random variables, it's a
128
00:05:52,850 --> 00:05:54,630
random variable itself.
129
00:05:54,630 --> 00:05:57,940
The distribution of Theta hat
depends on the distribution of
130
00:05:57,940 --> 00:06:01,280
X. The distribution of X
is affected by Theta.
131
00:06:01,280 --> 00:06:03,650
So in the end, the distribution
of your estimate
132
00:06:03,650 --> 00:06:08,290
Theta hat will also be affected
by whatever Theta
133
00:06:08,290 --> 00:06:09,920
happens to be.
134
00:06:09,920 --> 00:06:12,950
Our general objective, when
designing estimators, is that
135
00:06:12,950 --> 00:06:17,390
we want to get, in the end, an
error, an estimation error,
136
00:06:17,390 --> 00:06:19,070
which is not too large.
137
00:06:19,070 --> 00:06:21,500
But we'll have to make
that specific.
138
00:06:21,500 --> 00:06:24,720
Again, what exactly do
we mean by that?
139
00:06:24,720 --> 00:06:27,170
So how do we go about
this problem?
140
00:06:27,170 --> 00:06:29,670
141
00:06:29,670 --> 00:06:40,150
One general approach is to pick
a Theta, under which the
142
00:06:40,150 --> 00:06:44,590
data that we observe, that
this is the X's, our most
143
00:06:44,590 --> 00:06:47,180
likely to have occurred.
144
00:06:47,180 --> 00:06:52,700
So I observe X. For any given
Theta, I can calculate this
145
00:06:52,700 --> 00:06:56,630
quantity, which tells me, under
this particular Theta,
146
00:06:56,630 --> 00:07:00,670
the X that you observed had this
probability of occurring.
147
00:07:00,670 --> 00:07:03,270
Under that Theta, the X that
you observe had that
148
00:07:03,270 --> 00:07:04,770
probability of occurring.
149
00:07:04,770 --> 00:07:08,580
You just choose that Theta,
which makes the data that you
150
00:07:08,580 --> 00:07:12,700
observed most likely.
151
00:07:12,700 --> 00:07:15,810
It's interesting to compare
this maximum likelihood
152
00:07:15,810 --> 00:07:19,120
estimate with the estimates that
you would have, if you
153
00:07:19,120 --> 00:07:22,050
were in a Bayesian setting,
and you were using maximum
154
00:07:22,050 --> 00:07:25,010
approach theory probability
estimation.
155
00:07:25,010 --> 00:07:31,650
In the Bayesian setting, what
we do is, given the data, we
156
00:07:31,650 --> 00:07:34,350
use the prior distribution
on Theta.
157
00:07:34,350 --> 00:07:41,660
And we calculate the posterior
distribution of Theta given X.
158
00:07:41,660 --> 00:07:44,350
Notice that this is sort
of the opposite from
159
00:07:44,350 --> 00:07:46,040
what we have here.
160
00:07:46,040 --> 00:07:49,180
This is the probability of X
for a particular value of
161
00:07:49,180 --> 00:07:51,780
Theta, whereas this is the
probability of Theta for a
162
00:07:51,780 --> 00:07:55,380
particular X. So it's the
opposite type of conditioning.
163
00:07:55,380 --> 00:07:58,240
In the Bayesian setting, Theta
is a random variable.
164
00:07:58,240 --> 00:07:59,890
So we can talk about
the probability
165
00:07:59,890 --> 00:08:01,570
distribution of Theta.
166
00:08:01,570 --> 00:08:04,740
So how do these two compare,
except for this syntactic
167
00:08:04,740 --> 00:08:08,160
difference that the order X's
and Theta's are reversed?
168
00:08:08,160 --> 00:08:11,410
Let's write down, in full
detail, what this posterior
169
00:08:11,410 --> 00:08:13,280
distribution of Theta is.
170
00:08:13,280 --> 00:08:17,390
By the Bayes rule, this
conditional distribution is
171
00:08:17,390 --> 00:08:20,430
obtained from the prior, and the
model of the measurement
172
00:08:20,430 --> 00:08:21,850
process that we have.
173
00:08:21,850 --> 00:08:24,510
And we get to this expression.
174
00:08:24,510 --> 00:08:29,520
So in Bayesian estimation, we
want to find the most likely
175
00:08:29,520 --> 00:08:30,870
value of Theta.
176
00:08:30,870 --> 00:08:33,070
And we need to maximize
this quantity over
177
00:08:33,070 --> 00:08:34,539
all possible Theta's.
178
00:08:34,539 --> 00:08:38,210
First thing to notice is that
the denominator is a constant.
179
00:08:38,210 --> 00:08:40,220
It does not involve Theta.
180
00:08:40,220 --> 00:08:43,250
So when you maximize this
quantity, you don't care about
181
00:08:43,250 --> 00:08:44,520
the denominator.
182
00:08:44,520 --> 00:08:47,800
You just want to maximize
the numerator.
183
00:08:47,800 --> 00:08:52,310
Now, here, things start to look
a little more similar.
184
00:08:52,310 --> 00:08:56,530
And they would be exactly of
the same kind, if that term
185
00:08:56,530 --> 00:08:59,890
here was absent, it the
prior was absent.
186
00:08:59,890 --> 00:09:03,860
The two are going to become
the same if that prior was
187
00:09:03,860 --> 00:09:05,830
just a constant.
188
00:09:05,830 --> 00:09:10,160
So if that prior is a constant,
then maximum
189
00:09:10,160 --> 00:09:13,720
likelihood estimation takes
exactly the same form as
190
00:09:13,720 --> 00:09:17,360
Bayesian maximum posterior
probability estimation.
191
00:09:17,360 --> 00:09:21,230
So you can give this particular
interpretation of
192
00:09:21,230 --> 00:09:22,680
maximum likelihood estimation.
193
00:09:22,680 --> 00:09:27,400
Maximum likelihood estimation
is essentially what you have
194
00:09:27,400 --> 00:09:31,380
done, if you were in a Bayesian
world, and you had
195
00:09:31,380 --> 00:09:35,400
assumed a prior on the Theta's
that's uniform, all the
196
00:09:35,400 --> 00:09:37,030
Theta's being equally likely.
197
00:09:37,030 --> 00:09:42,620
198
00:09:42,620 --> 00:09:42,725
Okay.
199
00:09:42,725 --> 00:09:45,770
So let's look at a
simple example.
200
00:09:45,770 --> 00:09:48,510
Suppose that the Xi's are
independent, identically
201
00:09:48,510 --> 00:09:50,770
distributed random
variables, with a
202
00:09:50,770 --> 00:09:52,690
certain parameter Theta.
203
00:09:52,690 --> 00:09:55,910
So the distribution of each
one of the Xi's is this
204
00:09:55,910 --> 00:09:57,950
particular term.
205
00:09:57,950 --> 00:09:59,840
So Theta is one-dimensional.
206
00:09:59,840 --> 00:10:01,280
It's a one-dimensional
parameter.
207
00:10:01,280 --> 00:10:03,180
But we have several data.
208
00:10:03,180 --> 00:10:07,020
We write down the formula
for the probability of a
209
00:10:07,020 --> 00:10:12,360
particular X vector, given a
particular value of Theta.
210
00:10:12,360 --> 00:10:14,950
But again, when I use the word,
given, here it's not in
211
00:10:14,950 --> 00:10:16,080
the conditioning sense.
212
00:10:16,080 --> 00:10:18,770
It's the value of the
density for a
213
00:10:18,770 --> 00:10:21,710
particular choice of Theta.
214
00:10:21,710 --> 00:10:24,890
Here, I wrote down, I defined
maximum likelihood estimation
215
00:10:24,890 --> 00:10:26,190
in terms of PMFs.
216
00:10:26,190 --> 00:10:28,050
That's what you would
do if the X's were
217
00:10:28,050 --> 00:10:29,950
discrete random variables.
218
00:10:29,950 --> 00:10:32,770
Here, the X's are continuous
random variables, so instead
219
00:10:32,770 --> 00:10:36,220
of I'm using the PDF
instead of the PMF.
220
00:10:36,220 --> 00:10:39,530
So this a definition, here,
generalizes to the case of
221
00:10:39,530 --> 00:10:40,900
continuous random variables.
222
00:10:40,900 --> 00:10:44,620
And you use F's instead of
X's, our usual recipe.
223
00:10:44,620 --> 00:10:47,560
So the maximum likelihood
estimate is defined.
224
00:10:47,560 --> 00:10:51,880
Now, since the Xi's are
independent, the joint density
225
00:10:51,880 --> 00:10:54,410
of all the X's together
is the product of
226
00:10:54,410 --> 00:10:57,680
the individual densities.
227
00:10:57,680 --> 00:10:59,170
So you look at this quantity.
228
00:10:59,170 --> 00:11:03,310
This is the density or sort of
probability of observing a
229
00:11:03,310 --> 00:11:05,340
particular sequence of X's.
230
00:11:05,340 --> 00:11:08,230
And we ask the question, what's
the value of Theta that
231
00:11:08,230 --> 00:11:10,940
makes the X's that we
observe most likely?
232
00:11:10,940 --> 00:11:13,160
So we want to carry out
this maximization.
233
00:11:13,160 --> 00:11:17,430
Now this maximization is just
a calculational problem.
234
00:11:17,430 --> 00:11:19,920
We're going to do this
maximization by taking the
235
00:11:19,920 --> 00:11:21,910
logarithm of this expression.
236
00:11:21,910 --> 00:11:23,880
Maximizing an expression
is the same as
237
00:11:23,880 --> 00:11:25,790
maximizing the logarithm.
238
00:11:25,790 --> 00:11:28,790
So the logarithm of this
expression, the logarithm of a
239
00:11:28,790 --> 00:11:31,290
product is the sum of
the logarithms.
240
00:11:31,290 --> 00:11:34,390
You get contributions from
this Theta term.
241
00:11:34,390 --> 00:11:37,660
There's n of these, so we
get an n log Theta.
242
00:11:37,660 --> 00:11:40,430
And then we have the sum of the
logarithms of these terms.
243
00:11:40,430 --> 00:11:43,060
It gives us minus Theta.
244
00:11:43,060 --> 00:11:45,020
And then the sum of the X's.
245
00:11:45,020 --> 00:11:47,060
So we need to maximize
this expression
246
00:11:47,060 --> 00:11:48,630
with respect to Theta.
247
00:11:48,630 --> 00:11:51,130
The way to do this maximization
is you take the
248
00:11:51,130 --> 00:11:53,320
derivative, with respect
to Theta.
249
00:11:53,320 --> 00:11:58,520
And you get n over Theta equals
to the sum of the X's.
250
00:11:58,520 --> 00:12:00,280
And then you solve for Theta.
251
00:12:00,280 --> 00:12:02,040
And you find that the
maximum likelihood
252
00:12:02,040 --> 00:12:04,680
estimate is this quantity.
253
00:12:04,680 --> 00:12:13,160
Which sort of makes sense,
because this is the reciprocal
254
00:12:13,160 --> 00:12:16,700
of the sample mean of X's.
255
00:12:16,700 --> 00:12:19,520
Theta, in an exponential
distribution, we know that
256
00:12:19,520 --> 00:12:23,380
it's 1 over (the mean of the
exponential distribution).
257
00:12:23,380 --> 00:12:26,960
So it looks like a reasonable
estimate.
258
00:12:26,960 --> 00:12:29,570
So in any case, this is the
estimates that the maximum
259
00:12:29,570 --> 00:12:33,420
likelihood estimation procedure
tells us that we
260
00:12:33,420 --> 00:12:35,780
should report.
261
00:12:35,780 --> 00:12:39,790
This formula here, of course,
tells you what to do if you
262
00:12:39,790 --> 00:12:42,640
have already observed
specific numbers.
263
00:12:42,640 --> 00:12:46,020
If you have observed specific
numbers, then you observe this
264
00:12:46,020 --> 00:12:49,110
particular number as your
estimate of Theta.
265
00:12:49,110 --> 00:12:52,000
If you want to describe your
estimation procedure more
266
00:12:52,000 --> 00:12:55,900
abstractly, what you have
constructed is an estimator,
267
00:12:55,900 --> 00:12:59,690
which is a box that's takes in
the random variables, capital
268
00:12:59,690 --> 00:13:05,430
X1 up to Capital Xn, and
produces out your estimate,
269
00:13:05,430 --> 00:13:07,440
which is also a random
variable.
270
00:13:07,440 --> 00:13:10,760
Because it's a function of these
random variables and is
271
00:13:10,760 --> 00:13:14,750
denoted by an upper case Theta,
to indicate that this
272
00:13:14,750 --> 00:13:17,470
is now a random variable.
273
00:13:17,470 --> 00:13:21,040
So this is an equality
about numbers.
274
00:13:21,040 --> 00:13:23,860
This is a description of the
general procedure, which is an
275
00:13:23,860 --> 00:13:25,745
equality between two
random variables.
276
00:13:25,745 --> 00:13:28,360
277
00:13:28,360 --> 00:13:31,920
And this gives you the more
abstract view of what we're
278
00:13:31,920 --> 00:13:35,040
doing here.
279
00:13:35,040 --> 00:13:35,352
All right.
280
00:13:35,352 --> 00:13:37,970
So what can we tell about
our estimate?
281
00:13:37,970 --> 00:13:40,090
Is it good or is it bad?
282
00:13:40,090 --> 00:13:42,960
So we should look at this
particular random variable and
283
00:13:42,960 --> 00:13:46,220
talk about the statistical
properties that it has.
284
00:13:46,220 --> 00:13:49,930
What we would like is this
random variable to be close to
285
00:13:49,930 --> 00:13:55,810
the true value of Theta, with
high probability, no matter
286
00:13:55,810 --> 00:13:59,470
what Theta is, since we don't
know what Theta is.
287
00:13:59,470 --> 00:14:01,400
Let's make a little
more specific the
288
00:14:01,400 --> 00:14:05,100
properties that we want.
289
00:14:05,100 --> 00:14:08,470
So we cook up the estimator
somehow.
290
00:14:08,470 --> 00:14:11,850
So this estimator corresponds,
again, to a box that takes
291
00:14:11,850 --> 00:14:15,400
data in, the capital X's,
and produces an
292
00:14:15,400 --> 00:14:17,470
estimate Theta hat.
293
00:14:17,470 --> 00:14:18,710
This estimate is random.
294
00:14:18,710 --> 00:14:23,070
Sometimes it will be above
the true value of Theta.
295
00:14:23,070 --> 00:14:25,660
Sometimes it will be below.
296
00:14:25,660 --> 00:14:30,220
Ideally, we would like it to not
have a systematic error,
297
00:14:30,220 --> 00:14:32,810
on the positive side or
the negative side.
298
00:14:32,810 --> 00:14:37,310
So a reasonable wish to have,
for a good estimator, is that,
299
00:14:37,310 --> 00:14:41,700
on the average, it gives
you the correct value.
300
00:14:41,700 --> 00:14:45,850
Now here, let's be a little more
specific about what that
301
00:14:45,850 --> 00:14:47,740
expectation is.
302
00:14:47,740 --> 00:14:51,270
This is an expectation, with
respect to the probability
303
00:14:51,270 --> 00:14:54,240
distribution of Theta hat.
304
00:14:54,240 --> 00:14:58,780
The probability distribution
of Theta hat is affected by
305
00:14:58,780 --> 00:15:01,410
the probability distribution
of the X's.
306
00:15:01,410 --> 00:15:03,760
Because Theta hat is a
function of the X's.
307
00:15:03,760 --> 00:15:05,930
And the probability distribution
of the X's is
308
00:15:05,930 --> 00:15:09,220
affected by the true
value of Theta.
309
00:15:09,220 --> 00:15:13,710
So depending on which one is the
true value of Theta, this
310
00:15:13,710 --> 00:15:16,650
is going to be a different
expectation.
311
00:15:16,650 --> 00:15:20,830
So if you were to write this
expectation out in more
312
00:15:20,830 --> 00:15:25,840
detail, it would look
something like this.
313
00:15:25,840 --> 00:15:28,690
You need to write down
the probability
314
00:15:28,690 --> 00:15:30,260
distribution of Theta hat.
315
00:15:30,260 --> 00:15:32,890
316
00:15:32,890 --> 00:15:36,470
And this is going to
be some function.
317
00:15:36,470 --> 00:15:41,200
But this function depends on the
true Theta, is affected by
318
00:15:41,200 --> 00:15:42,800
the true Theta.
319
00:15:42,800 --> 00:15:48,300
And then you integrate this
with respect to Theta hat.
320
00:15:48,300 --> 00:15:49,430
What's the point here?
321
00:15:49,430 --> 00:15:53,280
Again, Theta hat is a
function of the X's.
322
00:15:53,280 --> 00:15:57,000
So the density of Theta
hat is affected by the
323
00:15:57,000 --> 00:15:58,400
density of the X's.
324
00:15:58,400 --> 00:16:00,730
The density of the X's
is affected by the
325
00:16:00,730 --> 00:16:02,380
true value of Theta.
326
00:16:02,380 --> 00:16:05,420
So the distribution of Theta
hat is affected by
327
00:16:05,420 --> 00:16:07,680
the value of Theta.
328
00:16:07,680 --> 00:16:10,500
Another way to put it is, as
I've mentioned a few minutes
329
00:16:10,500 --> 00:16:14,550
ago, in this business, it's
as if we are considering
330
00:16:14,550 --> 00:16:17,880
different possible probabilistic
models, one
331
00:16:17,880 --> 00:16:20,470
probabilistic model for
each choice of Theta.
332
00:16:20,470 --> 00:16:22,560
And we're trying to guess
which one of these
333
00:16:22,560 --> 00:16:25,200
probabilistic models
is the true one.
334
00:16:25,200 --> 00:16:28,420
One way of emphasizing the
fact that this expression
335
00:16:28,420 --> 00:16:31,780
depends on the true Theta is
to put a little subscript
336
00:16:31,780 --> 00:16:36,790
here, expectation, under the
particular value of the
337
00:16:36,790 --> 00:16:38,300
parameter Theta.
338
00:16:38,300 --> 00:16:42,450
So depending on what value the
true parameter Theta takes,
339
00:16:42,450 --> 00:16:45,000
this expectation will have
a different value.
340
00:16:45,000 --> 00:16:49,730
And what we would like is that
no matter what the true value
341
00:16:49,730 --> 00:16:55,300
is, that our estimate will not
have a bias on the positive or
342
00:16:55,300 --> 00:16:57,140
the negative sides.
343
00:16:57,140 --> 00:17:00,150
So this is a property
that's desirable.
344
00:17:00,150 --> 00:17:02,160
Is it always going to be true?
345
00:17:02,160 --> 00:17:05,218
Not necessarily, it depends on
what estimator we construct.
346
00:17:05,218 --> 00:17:09,160
347
00:17:09,160 --> 00:17:12,400
Is it true for our exponential
example?
348
00:17:12,400 --> 00:17:14,770
Unfortunately not, the estimate
that we have in the
349
00:17:14,770 --> 00:17:18,300
exponential example turns
out to be biased.
350
00:17:18,300 --> 00:17:22,900
And one extreme way of seeing
this is to consider the case
351
00:17:22,900 --> 00:17:25,160
where our sample size is 1.
352
00:17:25,160 --> 00:17:27,020
We're trying to estimate
Theta.
353
00:17:27,020 --> 00:17:30,370
And the estimator from the
previous slide, in that case,
354
00:17:30,370 --> 00:17:33,410
is just 1/X1.
355
00:17:33,410 --> 00:17:37,990
Now X1 has a fair amount of
density in the vicinity of 0,
356
00:17:37,990 --> 00:17:41,360
which means that 1/X1 has
significant probability of
357
00:17:41,360 --> 00:17:42,810
being very large.
358
00:17:42,810 --> 00:17:46,140
And if you do the calculation,
this ultimately makes the
359
00:17:46,140 --> 00:17:49,170
expected value of 1/X1
to be infinite.
360
00:17:49,170 --> 00:17:52,870
Now infinity is definitely
not the correct value.
361
00:17:52,870 --> 00:17:56,330
So our estimate is
biased upwards.
362
00:17:56,330 --> 00:18:00,130
And it's actually biased
a lot upwards.
363
00:18:00,130 --> 00:18:01,800
So that's how things are.
364
00:18:01,800 --> 00:18:06,690
Maximum likelihood estimates,
in general, will be biased.
365
00:18:06,690 --> 00:18:10,870
But under some conditions,
they will turn out to be
366
00:18:10,870 --> 00:18:12,780
asymptotically unbiased.
367
00:18:12,780 --> 00:18:16,810
That is, as you get more and
more data, as your X vector is
368
00:18:16,810 --> 00:18:21,750
longer and longer, with
independent data, the estimate
369
00:18:21,750 --> 00:18:25,010
that you're going to have, the
expected value of your
370
00:18:25,010 --> 00:18:26,860
estimator is going
to get closer and
371
00:18:26,860 --> 00:18:28,370
closer to the true value.
372
00:18:28,370 --> 00:18:31,330
So you do have some nice
asymptotic properties, but
373
00:18:31,330 --> 00:18:34,000
we're not going to prove
anything like this.
374
00:18:34,000 --> 00:18:37,680
Speaking of asymptotic
properties, in general, what
375
00:18:37,680 --> 00:18:40,950
we would like to have is that,
as you collect more and more
376
00:18:40,950 --> 00:18:46,550
data, you get the correct
answer, in some sense.
377
00:18:46,550 --> 00:18:49,360
And the sense that we're going
to use here is the limiting
378
00:18:49,360 --> 00:18:52,560
sense of convergence in
probability, since this is the
379
00:18:52,560 --> 00:18:55,270
only notion of convergence of
random variables that we have
380
00:18:55,270 --> 00:18:56,540
in our hands.
381
00:18:56,540 --> 00:18:59,600
This is similar to what
we had in the pollster
382
00:18:59,600 --> 00:19:01,180
problem, for example.
383
00:19:01,180 --> 00:19:04,900
If we had a bigger and bigger
sample size, we could be more
384
00:19:04,900 --> 00:19:08,360
and more confident that the
estimate that we obtained is
385
00:19:08,360 --> 00:19:11,970
close to the unknown true
parameter of the distribution
386
00:19:11,970 --> 00:19:13,320
that we have.
387
00:19:13,320 --> 00:19:16,420
So this is a desirable
property.
388
00:19:16,420 --> 00:19:20,720
If you have an infinitely large
amount of data, you
389
00:19:20,720 --> 00:19:25,070
should be able to estimate
an unknown parameter
390
00:19:25,070 --> 00:19:26,890
more or less exactly.
391
00:19:26,890 --> 00:19:32,280
So this is it desirable property
of estimators.
392
00:19:32,280 --> 00:19:35,560
It turns out that maximum
likelihood estimation, given
393
00:19:35,560 --> 00:19:39,330
independent data, does have
this property, under mild
394
00:19:39,330 --> 00:19:40,280
conditions.
395
00:19:40,280 --> 00:19:43,100
So maximum likelihood
estimation, in this respect,
396
00:19:43,100 --> 00:19:45,180
is a good approach.
397
00:19:45,180 --> 00:19:48,520
So let's see, do we have this
consistency property in our
398
00:19:48,520 --> 00:19:50,150
exponential example?
399
00:19:50,150 --> 00:19:56,560
In our exponential example, we
used this quantity to estimate
400
00:19:56,560 --> 00:19:59,040
the unknown parameter Theta.
401
00:19:59,040 --> 00:20:01,000
What properties does
this quantity have
402
00:20:01,000 --> 00:20:03,160
as n goes to infinity?
403
00:20:03,160 --> 00:20:06,580
Well this quantity is the
reciprocal of that quantity up
404
00:20:06,580 --> 00:20:09,890
here, which is the
sample mean.
405
00:20:09,890 --> 00:20:12,950
We know from the weak law of
large numbers, that the sample
406
00:20:12,950 --> 00:20:16,350
mean converges to
the expectation.
407
00:20:16,350 --> 00:20:19,250
So this property here
comes from the weak
408
00:20:19,250 --> 00:20:21,660
law of large numbers.
409
00:20:21,660 --> 00:20:24,670
In probability, this quantity
converges to the expected
410
00:20:24,670 --> 00:20:29,830
value, which, for exponential
distributions, is 1/Theta.
411
00:20:29,830 --> 00:20:33,460
Now, if something converges to
something, then the reciprocal
412
00:20:33,460 --> 00:20:37,680
of that should converge to
the reciprocal of that.
413
00:20:37,680 --> 00:20:41,520
That's a property that's
certainly correct for numbers.
414
00:20:41,520 --> 00:20:44,000
But you're not talking about
convergence of numbers.
415
00:20:44,000 --> 00:20:46,420
We're talking about convergence
in probability,
416
00:20:46,420 --> 00:20:48,820
which is a more complicated
notion.
417
00:20:48,820 --> 00:20:52,370
Fortunately, it turns out that
the same thing is true, when
418
00:20:52,370 --> 00:20:54,660
we deal with convergence
in probability.
419
00:20:54,660 --> 00:20:58,690
One can show, although we will
not bother doing this, that
420
00:20:58,690 --> 00:21:01,840
indeed, the reciprocal of this,
which is our estimate,
421
00:21:01,840 --> 00:21:05,880
converges in probability to
the reciprocal of that.
422
00:21:05,880 --> 00:21:08,880
And that reciprocal is the
true parameter Theta.
423
00:21:08,880 --> 00:21:11,570
So for this particular
exponential example, we do
424
00:21:11,570 --> 00:21:15,250
have the desirable property,
that as the number of data
425
00:21:15,250 --> 00:21:18,230
becomes larger and larger,
the estimate that we have
426
00:21:18,230 --> 00:21:20,970
constructed will get closer
and closer to the true
427
00:21:20,970 --> 00:21:22,510
parameter value.
428
00:21:22,510 --> 00:21:27,050
And this is true no matter
what Theta is.
429
00:21:27,050 --> 00:21:30,130
No matter what the true
parameter Theta is, we're
430
00:21:30,130 --> 00:21:33,240
going to get close to it as
we collect more data.
431
00:21:33,240 --> 00:21:35,780
432
00:21:35,780 --> 00:21:35,950
Okay.
433
00:21:35,950 --> 00:21:39,100
So these are two rough
qualitative properties that
434
00:21:39,100 --> 00:21:42,350
would be nice to have.
435
00:21:42,350 --> 00:21:47,340
If you want to get a little
more quantitative, you can
436
00:21:47,340 --> 00:21:50,210
start looking at the mean
squared error that your
437
00:21:50,210 --> 00:21:52,000
estimator gives.
438
00:21:52,000 --> 00:21:56,600
Now, once more, the comment I
was making up there applies.
439
00:21:56,600 --> 00:22:00,540
Namely, that this expectation
here is an expectation with
440
00:22:00,540 --> 00:22:04,600
respect to the probability
distribution of Theta hat that
441
00:22:04,600 --> 00:22:07,280
corresponds to a particular
value of little theta.
442
00:22:07,280 --> 00:22:09,840
So fix a little theta.
443
00:22:09,840 --> 00:22:11,910
Write down this expression.
444
00:22:11,910 --> 00:22:14,550
Look at the probability
distribution of Theta hat,
445
00:22:14,550 --> 00:22:16,380
under that little theta.
446
00:22:16,380 --> 00:22:18,220
And do this calculation.
447
00:22:18,220 --> 00:22:20,610
You're going to get some
quantity that depends on the
448
00:22:20,610 --> 00:22:21,860
little theta.
449
00:22:21,860 --> 00:22:24,200
450
00:22:24,200 --> 00:22:28,450
And so all quantities in this
equality here should be
451
00:22:28,450 --> 00:22:33,360
interpreted as quantities under
that particular value of
452
00:22:33,360 --> 00:22:34,490
little theta.
453
00:22:34,490 --> 00:22:38,640
So if you wanted to make this
more explicit, you could start
454
00:22:38,640 --> 00:22:41,870
throwing little subscripts
everywhere in those
455
00:22:41,870 --> 00:22:44,430
expressions.
456
00:22:44,430 --> 00:22:49,190
And let's see what those
expressions tell us.
457
00:22:49,190 --> 00:22:55,430
The expected value squared of
a random variable, we know
458
00:22:55,430 --> 00:22:59,210
that it's always equal to the
variance of this random
459
00:22:59,210 --> 00:23:03,790
variable, plus the expectation
of the
460
00:23:03,790 --> 00:23:05,860
random variable squared.
461
00:23:05,860 --> 00:23:08,465
So the expectation value of that
random variable, squared.
462
00:23:08,465 --> 00:23:12,020
463
00:23:12,020 --> 00:23:17,030
This equality here is just our
familiar formula, that the
464
00:23:17,030 --> 00:23:23,250
expected value of X squared is
the variance of X plus the
465
00:23:23,250 --> 00:23:26,350
expected value of X squared.
466
00:23:26,350 --> 00:23:30,040
So we apply this formula
to X equal to
467
00:23:30,040 --> 00:23:34,024
Theta hat minus Theta.
468
00:23:34,024 --> 00:23:37,180
469
00:23:37,180 --> 00:23:41,220
Now, remember that, in this
classical setting, theta is
470
00:23:41,220 --> 00:23:42,140
just a constant.
471
00:23:42,140 --> 00:23:43,450
We have fixed Theta.
472
00:23:43,450 --> 00:23:45,850
We want to calculate the
variance of this quantity,
473
00:23:45,850 --> 00:23:47,760
under that particular Theta.
474
00:23:47,760 --> 00:23:51,000
When you add or subtract a
constant to a random variable,
475
00:23:51,000 --> 00:23:54,070
the variance doesn't change.
476
00:23:54,070 --> 00:23:56,860
This is the same as the variance
of our estimator.
477
00:23:56,860 --> 00:24:00,300
And what we've got here is
the bias of our estimate.
478
00:24:00,300 --> 00:24:02,580
It tells us, on the average,
whether we
479
00:24:02,580 --> 00:24:04,470
fall above or below.
480
00:24:04,470 --> 00:24:06,850
And we're taking the bias
to be b squared.
481
00:24:06,850 --> 00:24:10,110
If we have an unbiased
estimator, the bias
482
00:24:10,110 --> 00:24:13,690
term will be 0.
483
00:24:13,690 --> 00:24:18,250
So ideally we want Theta hat
to be very close to Theta.
484
00:24:18,250 --> 00:24:21,840
And since Theta is a constant,
if that happens, the variance
485
00:24:21,840 --> 00:24:25,650
of Theta hat would
be very small.
486
00:24:25,650 --> 00:24:26,870
So Theta is a constant.
487
00:24:26,870 --> 00:24:30,180
If Theta hat has a distribution
that's
488
00:24:30,180 --> 00:24:33,610
concentrated just around own
little theta, then Theta hat
489
00:24:33,610 --> 00:24:35,250
would have a small variance.
490
00:24:35,250 --> 00:24:37,690
So this is one desire
that have.
491
00:24:37,690 --> 00:24:39,740
We're going to have
a small variance.
492
00:24:39,740 --> 00:24:43,710
But we also want to have a small
bias at the same time.
493
00:24:43,710 --> 00:24:47,370
So the general form of the mean
squared error has two
494
00:24:47,370 --> 00:24:48,240
contributions.
495
00:24:48,240 --> 00:24:50,530
One is the variance
of our estimator.
496
00:24:50,530 --> 00:24:52,350
The other is the bias.
497
00:24:52,350 --> 00:24:54,990
And one usually wants to design
an estimator that
498
00:24:54,990 --> 00:24:58,900
simultaneously keeps both
of these terms small.
499
00:24:58,900 --> 00:25:03,250
So here's an estimation method
that would do very well with
500
00:25:03,250 --> 00:25:05,080
respect to this term,
but badly with
501
00:25:05,080 --> 00:25:06,680
respect to that term.
502
00:25:06,680 --> 00:25:09,410
So suppose that my distribution
is, let's say,
503
00:25:09,410 --> 00:25:13,700
normal with an unknown mean
Theta and variance 1.
504
00:25:13,700 --> 00:25:17,580
And I use as my estimator
something very dumb.
505
00:25:17,580 --> 00:25:23,330
I always produce an estimate
that says my estimate is 100.
506
00:25:23,330 --> 00:25:26,430
So I'm just ignoring the
data and report 100.
507
00:25:26,430 --> 00:25:27,750
What does this do?
508
00:25:27,750 --> 00:25:30,950
The variance of my
estimator is 0.
509
00:25:30,950 --> 00:25:33,690
There's no randomness in the
estimate that I report.
510
00:25:33,690 --> 00:25:37,020
But the bias is going
to be pretty bad.
511
00:25:37,020 --> 00:25:44,180
The bias is going to be Theta
hat, which is 100 minus the
512
00:25:44,180 --> 00:25:46,770
true value of Theta.
513
00:25:46,770 --> 00:25:50,340
And for some Theta's, my bias
is going to be horrible.
514
00:25:50,340 --> 00:25:54,600
If my true Theta happens
to be 0, my bias
515
00:25:54,600 --> 00:25:56,200
squared is a huge term.
516
00:25:56,200 --> 00:25:57,810
And I get a large error.
517
00:25:57,810 --> 00:26:00,220
So what's the moral
of this example?
518
00:26:00,220 --> 00:26:03,700
There are ways of making that
variance very small, but, in
519
00:26:03,700 --> 00:26:07,360
those cases, you pay a
price in the bias.
520
00:26:07,360 --> 00:26:10,340
So you want to do something a
little more delicate, where
521
00:26:10,340 --> 00:26:14,640
you try to keep both terms
small at the same time.
522
00:26:14,640 --> 00:26:16,720
So these types of considerations
become
523
00:26:16,720 --> 00:26:20,280
important when you start to try
to design sophisticated
524
00:26:20,280 --> 00:26:22,840
estimators for more complicated
problems.
525
00:26:22,840 --> 00:26:24,800
But we will not do this
in this class.
526
00:26:24,800 --> 00:26:26,720
This belongs to further
classes on
527
00:26:26,720 --> 00:26:28,750
statistics and inference.
528
00:26:28,750 --> 00:26:31,960
For this class, for parameter
estimation, we will basically
529
00:26:31,960 --> 00:26:34,400
stick to two very
simple methods.
530
00:26:34,400 --> 00:26:37,930
One is the maximum likelihood
method we've just discussed.
531
00:26:37,930 --> 00:26:41,300
And the other method is what you
would do if you were still
532
00:26:41,300 --> 00:26:44,010
in high school and didn't
know any probability.
533
00:26:44,010 --> 00:26:46,610
You get data.
534
00:26:46,610 --> 00:26:50,430
And these data come from
some distribution
535
00:26:50,430 --> 00:26:51,850
with an unknown mean.
536
00:26:51,850 --> 00:26:53,930
And you want to estimate
that the unknown mean.
537
00:26:53,930 --> 00:26:54,810
What would you do?
538
00:26:54,810 --> 00:26:57,990
You would just take those data
and average them out.
539
00:26:57,990 --> 00:27:00,440
So let's make this a little
more specific.
540
00:27:00,440 --> 00:27:04,770
We have X's that come from
a given distribution.
541
00:27:04,770 --> 00:27:07,775
We know the general form of
the distribution, perhaps.
542
00:27:07,775 --> 00:27:10,570
543
00:27:10,570 --> 00:27:15,180
We do know, perhaps, the
variance of that distribution,
544
00:27:15,180 --> 00:27:17,050
or, perhaps, we don't know it.
545
00:27:17,050 --> 00:27:19,030
But we do not know the mean.
546
00:27:19,030 --> 00:27:22,700
And we want to estimate the
mean of that distribution.
547
00:27:22,700 --> 00:27:25,370
Now, we can write
this situation.
548
00:27:25,370 --> 00:27:27,710
We can represent it in
a different form.
549
00:27:27,710 --> 00:27:30,120
The Xi's are equal to Theta.
550
00:27:30,120 --> 00:27:31,380
This is the mean.
551
00:27:31,380 --> 00:27:34,310
Plus a 0 mean random
variable, that you
552
00:27:34,310 --> 00:27:36,000
can think of as noise.
553
00:27:36,000 --> 00:27:39,380
So this corresponds to the usual
situation you would have
554
00:27:39,380 --> 00:27:41,950
in a lab, where you
go and try to
555
00:27:41,950 --> 00:27:43,870
measure an unknown quantity.
556
00:27:43,870 --> 00:27:45,260
You get lots of measurements.
557
00:27:45,260 --> 00:27:49,490
But each time that you measure
them, your measurements have
558
00:27:49,490 --> 00:27:51,920
some extra noise in there.
559
00:27:51,920 --> 00:27:54,510
And you want to kind of
get rid of that noise.
560
00:27:54,510 --> 00:27:57,860
The way to try to get rid of
the measurement noise is to
561
00:27:57,860 --> 00:28:01,170
collect lots of data and
average them out.
562
00:28:01,170 --> 00:28:02,930
This is the sample mean.
563
00:28:02,930 --> 00:28:07,380
And this is a very, very
reasonable way of trying to
564
00:28:07,380 --> 00:28:10,130
estimate the unknown
mean of the X's.
565
00:28:10,130 --> 00:28:12,700
So this is the sample mean.
566
00:28:12,700 --> 00:28:17,840
It's a reasonable, plausible,
in general, pretty good
567
00:28:17,840 --> 00:28:22,390
estimator of the unknown mean
of a certain distribution.
568
00:28:22,390 --> 00:28:26,910
We can apply this estimator
without really knowing a lot
569
00:28:26,910 --> 00:28:28,810
about the distribution
of the X's.
570
00:28:28,810 --> 00:28:31,010
Actually, we don't need to
know anything about the
571
00:28:31,010 --> 00:28:32,320
distribution.
572
00:28:32,320 --> 00:28:35,840
We can still apply it, because
the variance, for example,
573
00:28:35,840 --> 00:28:37,130
does not show up here.
574
00:28:37,130 --> 00:28:38,660
We don't need to know
the variance to
575
00:28:38,660 --> 00:28:40,520
calculate that quantity.
576
00:28:40,520 --> 00:28:43,520
Does this estimator have
good properties?
577
00:28:43,520 --> 00:28:45,110
Yes, it does.
578
00:28:45,110 --> 00:28:48,110
What's the expected value
of the sample mean?
579
00:28:48,110 --> 00:28:51,910
If the expectation of this, it's
the expectation of this
580
00:28:51,910 --> 00:28:53,600
sum divided by n.
581
00:28:53,600 --> 00:28:56,410
The expected value for each
one of the X's is Theta.
582
00:28:56,410 --> 00:28:58,290
So the expected value
of the sample mean
583
00:28:58,290 --> 00:29:00,010
is just Theta itself.
584
00:29:00,010 --> 00:29:03,310
So our estimator is unbiased.
585
00:29:03,310 --> 00:29:06,410
No matter what Theta is, our
estimator does not have a
586
00:29:06,410 --> 00:29:11,130
systematic error in
either direction.
587
00:29:11,130 --> 00:29:13,870
Furthermore, the weak law of
large numbers tells us that
588
00:29:13,870 --> 00:29:18,140
this quantity converges to the
true parameter in probability.
589
00:29:18,140 --> 00:29:20,700
So it's a consistent
estimator.
590
00:29:20,700 --> 00:29:21,920
This is good.
591
00:29:21,920 --> 00:29:26,740
And if you want to calculate
the mean squared error
592
00:29:26,740 --> 00:29:28,780
corresponding to
this estimator.
593
00:29:28,780 --> 00:29:31,550
Remember how we defined the
mean squared error?
594
00:29:31,550 --> 00:29:35,300
It's this quantity.
595
00:29:35,300 --> 00:29:38,680
Then it's a calculation that we
have done a fair number of
596
00:29:38,680 --> 00:29:40,080
times by now.
597
00:29:40,080 --> 00:29:43,640
The mean squared error is the
variance of the distribution
598
00:29:43,640 --> 00:29:46,000
of the X's divided by n.
599
00:29:46,000 --> 00:29:49,370
So as we get more and more data,
the mean squared error
600
00:29:49,370 --> 00:29:52,170
goes down to 0.
601
00:29:52,170 --> 00:29:56,420
In some examples, it turns out
that the sample mean is also
602
00:29:56,420 --> 00:29:58,930
the same as the maximum
likelihood estimate.
603
00:29:58,930 --> 00:30:02,790
For example, if the X's are
coming from a normal
604
00:30:02,790 --> 00:30:07,700
distribution, you can write down
the likelihood, do the
605
00:30:07,700 --> 00:30:10,240
maximization with respect to
Theta, you'll find that the
606
00:30:10,240 --> 00:30:15,190
maximum likelihood estimate is
the same as the sample mean.
607
00:30:15,190 --> 00:30:18,730
In other cases, the sample mean
will be different from
608
00:30:18,730 --> 00:30:20,850
the maximum likelihood.
609
00:30:20,850 --> 00:30:23,990
And then you have a choice
about which one of the
610
00:30:23,990 --> 00:30:24,860
two you would use.
611
00:30:24,860 --> 00:30:27,890
Probably, in most reasonable
situations, you would just use
612
00:30:27,890 --> 00:30:31,460
the sample mean, because it's
simple, easy to compute, and
613
00:30:31,460 --> 00:30:33,830
has nice properties.
614
00:30:33,830 --> 00:30:33,936
All right.
615
00:30:33,936 --> 00:30:35,110
So you go to your boss.
616
00:30:35,110 --> 00:30:38,120
And you report and say,
OK, I did all my
617
00:30:38,120 --> 00:30:39,910
experiments in the lab.
618
00:30:39,910 --> 00:30:49,820
And the average value that I got
is a certain number, 2.37.
619
00:30:49,820 --> 00:30:52,490
So is that the informative
to your boss?
620
00:30:52,490 --> 00:30:55,470
Well your boss would like to
know how much they can trust
621
00:30:55,470 --> 00:30:58,280
this number, 2.37.
622
00:30:58,280 --> 00:31:00,630
Well, I know that the true
value is not going to be
623
00:31:00,630 --> 00:31:02,270
exactly that.
624
00:31:02,270 --> 00:31:07,410
But how close should it be?
625
00:31:07,410 --> 00:31:09,820
So give me a range of
what you think are
626
00:31:09,820 --> 00:31:12,080
possible values of Theta.
627
00:31:12,080 --> 00:31:16,220
So the situation is like this.
628
00:31:16,220 --> 00:31:20,370
So suppose that we observe X's
that are coming from a certain
629
00:31:20,370 --> 00:31:22,070
distribution.
630
00:31:22,070 --> 00:31:24,230
And we're trying to
estimate the mean.
631
00:31:24,230 --> 00:31:25,480
We get our data.
632
00:31:25,480 --> 00:31:27,880
633
00:31:27,880 --> 00:31:32,090
Maybe our data looks something
like this.
634
00:31:32,090 --> 00:31:34,090
You calculate the mean.
635
00:31:34,090 --> 00:31:36,140
You find the sample mean.
636
00:31:36,140 --> 00:31:40,120
So let's suppose that the sample
mean is a number, for
637
00:31:40,120 --> 00:31:45,570
some reason take to be 2.37.
638
00:31:45,570 --> 00:31:48,300
But you want to convey something
to your boss about
639
00:31:48,300 --> 00:31:51,450
how spread out these
data were.
640
00:31:51,450 --> 00:31:56,690
So the boss asks you to give
him or her some kind of
641
00:31:56,690 --> 00:32:05,340
interval on which Theta, the
true parameter, might lie.
642
00:32:05,340 --> 00:32:07,540
So the boss asked you
for an interval.
643
00:32:07,540 --> 00:32:11,740
So what you do is you end up
reporting an interval.
644
00:32:11,740 --> 00:32:14,990
And you somehow use the data
that you have seen to
645
00:32:14,990 --> 00:32:17,580
construct this interval.
646
00:32:17,580 --> 00:32:19,900
And you report to your
boss also the
647
00:32:19,900 --> 00:32:21,420
endpoints of this interval.
648
00:32:21,420 --> 00:32:24,020
Let's give names to
these endpoints,
649
00:32:24,020 --> 00:32:27,710
Theta_n- and Theta_n+.
650
00:32:27,710 --> 00:32:31,000
The ends here just play the role
of keeping track of how
651
00:32:31,000 --> 00:32:33,000
many data we're using.
652
00:32:33,000 --> 00:32:39,320
So what you report to your boss
is this interval as well.
653
00:32:39,320 --> 00:32:42,340
Are these Theta's here, the
endpoints of the interval,
654
00:32:42,340 --> 00:32:44,220
lowercase or uppercase?
655
00:32:44,220 --> 00:32:45,750
What should they be?
656
00:32:45,750 --> 00:32:48,180
Well you construct these
intervals after
657
00:32:48,180 --> 00:32:49,430
you see your data.
658
00:32:49,430 --> 00:32:53,830
You take the data into account
to construct your interval.
659
00:32:53,830 --> 00:32:57,020
So these definitely should
depend on the data.
660
00:32:57,020 --> 00:32:59,460
And therefore they are
random variables.
661
00:32:59,460 --> 00:33:03,240
Same thing with your estimator,
in general, it's
662
00:33:03,240 --> 00:33:05,120
going to be a random variable.
663
00:33:05,120 --> 00:33:07,930
Although, when you go and report
numbers to your boss,
664
00:33:07,930 --> 00:33:10,580
you give the specific
realizations of the random
665
00:33:10,580 --> 00:33:15,450
variables, given the
data that you got.
666
00:33:15,450 --> 00:33:21,500
So instead of having
just a single box
667
00:33:21,500 --> 00:33:25,050
that produces estimates.
668
00:33:25,050 --> 00:33:29,540
So our previous picture was that
you have your estimator
669
00:33:29,540 --> 00:33:34,130
that takes X's and produces
Theta hats.
670
00:33:34,130 --> 00:33:40,960
Now our box will also be
producing Theta hats minus and
671
00:33:40,960 --> 00:33:42,570
Theta hats plus.
672
00:33:42,570 --> 00:33:45,180
It's going to produce
an interval as well.
673
00:33:45,180 --> 00:33:48,670
The X's are random, therefore
these quantities are random.
674
00:33:48,670 --> 00:33:52,340
Once you go and do the
experiment and obtain your
675
00:33:52,340 --> 00:33:55,930
data, then your data
will be some
676
00:33:55,930 --> 00:33:58,810
lowercase x, specific numbers.
677
00:33:58,810 --> 00:34:00,950
And then your estimates
and estimator
678
00:34:00,950 --> 00:34:05,110
become also lower case.
679
00:34:05,110 --> 00:34:08,010
What would we like this
interval to do?
680
00:34:08,010 --> 00:34:11,760
We would like it to be highly
likely to contain the true
681
00:34:11,760 --> 00:34:13,810
value of the parameter.
682
00:34:13,810 --> 00:34:17,800
So we might impose some specs
of the following kind.
683
00:34:17,800 --> 00:34:19,170
I pick a number, alpha.
684
00:34:19,170 --> 00:34:21,170
Usually that alpha,
think of it as a
685
00:34:21,170 --> 00:34:23,050
probability of a large error.
686
00:34:23,050 --> 00:34:27,449
Typical value of alpha might
be 0.05, in which case this
687
00:34:27,449 --> 00:34:30,360
number here is point 0.95.
688
00:34:30,360 --> 00:34:33,989
And you're given specs that
say something like this.
689
00:34:33,989 --> 00:34:41,110
I would like, with probability
at least 0.95, this to happen,
690
00:34:41,110 --> 00:34:44,739
which says that the true
parameter lies inside the
691
00:34:44,739 --> 00:34:47,100
confidence interval.
692
00:34:47,100 --> 00:34:50,840
Now let's try to interpret
this statement.
693
00:34:50,840 --> 00:34:53,560
Suppose that you did the
experiment, and that you ended
694
00:34:53,560 --> 00:34:56,230
up reporting to your boss
a confidence interval
695
00:34:56,230 --> 00:35:01,520
from 1.97 to 2.56.
696
00:35:01,520 --> 00:35:03,170
That's what you report
to your boss.
697
00:35:03,170 --> 00:35:06,790
698
00:35:06,790 --> 00:35:08,300
And suppose that the confidence
699
00:35:08,300 --> 00:35:10,280
interval has this property.
700
00:35:10,280 --> 00:35:16,400
Can you go to your boss and say,
with probability 95%, the
701
00:35:16,400 --> 00:35:20,090
true value of Theta is between
these two numbers?
702
00:35:20,090 --> 00:35:22,630
Is that a meaningful
statement?
703
00:35:22,630 --> 00:35:26,100
So the statement is, the
tentative statement is, with
704
00:35:26,100 --> 00:35:30,200
probability 95%, the true
value of Theta is
705
00:35:30,200 --> 00:35:34,930
between 1.97 and 2.56.
706
00:35:34,930 --> 00:35:38,910
Well, what is random
in that statement?
707
00:35:38,910 --> 00:35:40,460
There's nothing random.
708
00:35:40,460 --> 00:35:43,070
The true value of theta
is a constant.
709
00:35:43,070 --> 00:35:44,720
1.97 is a number.
710
00:35:44,720 --> 00:35:46,740
2.56 is a number.
711
00:35:46,740 --> 00:35:52,960
So it doesn't make any sense to
talk about the probability
712
00:35:52,960 --> 00:35:54,920
that theta is in
this interval.
713
00:35:54,920 --> 00:35:57,540
Either theta happens to be
in that interval, or it
714
00:35:57,540 --> 00:35:58,760
happens to not be.
715
00:35:58,760 --> 00:36:01,560
But there are no probabilities
associated with this.
716
00:36:01,560 --> 00:36:04,700
Because theta is not random.
717
00:36:04,700 --> 00:36:06,690
Syntactically, you
can see this.
718
00:36:06,690 --> 00:36:09,210
Because theta here
is a lower case.
719
00:36:09,210 --> 00:36:11,930
So what kind of probabilities
are we talking about here?
720
00:36:11,930 --> 00:36:13,460
Where's the randomness?
721
00:36:13,460 --> 00:36:15,880
Well the random thing
is the interval.
722
00:36:15,880 --> 00:36:17,560
It's not theta.
723
00:36:17,560 --> 00:36:21,090
So the statement that is being
made here is that the
724
00:36:21,090 --> 00:36:24,290
interval, that's being
constructed by our procedure,
725
00:36:24,290 --> 00:36:28,410
should have the property that,
with probability 95%, it's
726
00:36:28,410 --> 00:36:33,280
going to fall on top of the
true value of theta.
727
00:36:33,280 --> 00:36:37,680
So the right way of interpreting
what the 95%
728
00:36:37,680 --> 00:36:42,270
confidence interval is, is
something like the following.
729
00:36:42,270 --> 00:36:45,390
We have the true value of theta
that we don't know.
730
00:36:45,390 --> 00:36:46,750
I get data.
731
00:36:46,750 --> 00:36:50,150
Based on the data, I construct
a confidence interval.
732
00:36:50,150 --> 00:36:51,950
I get my confidence interval.
733
00:36:51,950 --> 00:36:52,790
I got lucky.
734
00:36:52,790 --> 00:36:54,850
And the true value of
theta is in here.
735
00:36:54,850 --> 00:36:57,790
Next day, I do the same
experiment, take my data,
736
00:36:57,790 --> 00:37:00,500
construct a confidence
interval.
737
00:37:00,500 --> 00:37:04,040
And I get this confidence
interval, lucky once more.
738
00:37:04,040 --> 00:37:06,320
Next day I get data.
739
00:37:06,320 --> 00:37:09,620
I use my data to come up with
an estimate of theta and the
740
00:37:09,620 --> 00:37:10,660
confidence interval.
741
00:37:10,660 --> 00:37:12,340
That day, I was unlucky.
742
00:37:12,340 --> 00:37:15,000
And I got a confidence
interval out there.
743
00:37:15,000 --> 00:37:20,890
What the requirement here is, is
that 95% of the days, where
744
00:37:20,890 --> 00:37:25,270
we use this certain procedure
for constructing confidence
745
00:37:25,270 --> 00:37:29,180
intervals, 95% of those days,
we will be lucky.
746
00:37:29,180 --> 00:37:33,750
And we will capture the correct
value of theta by your
747
00:37:33,750 --> 00:37:35,160
confidence interval.
748
00:37:35,160 --> 00:37:39,390
So it's a statement about the
distribution of these random
749
00:37:39,390 --> 00:37:42,820
confidence intervals, how likely
are they to fall on top
750
00:37:42,820 --> 00:37:45,210
of the true theta, as opposed
to how likely
751
00:37:45,210 --> 00:37:47,060
they are to fall outside.
752
00:37:47,060 --> 00:37:50,770
So it's a statement about
probabilities associated with
753
00:37:50,770 --> 00:37:52,380
a confidence interval.
754
00:37:52,380 --> 00:37:55,080
They're not probabilities about
theta, because theta,
755
00:37:55,080 --> 00:37:58,370
itself, is not random.
756
00:37:58,370 --> 00:38:02,080
So this is what the confidence
interval is, in general, and
757
00:38:02,080 --> 00:38:03,470
how we interpret it.
758
00:38:03,470 --> 00:38:07,470
How do we construct a 95%
confidence interval?
759
00:38:07,470 --> 00:38:09,320
Let's go through this
exercise, in
760
00:38:09,320 --> 00:38:10,980
a particular example.
761
00:38:10,980 --> 00:38:13,970
The calculations are exactly the
same as the ones that you
762
00:38:13,970 --> 00:38:17,770
did when we talked about laws
of large numbers and the
763
00:38:17,770 --> 00:38:19,240
central limit theorem.
764
00:38:19,240 --> 00:38:22,600
So there's nothing new
calculationally but it's,
765
00:38:22,600 --> 00:38:25,440
perhaps, new in terms of the
language that we use and the
766
00:38:25,440 --> 00:38:26,800
interpretation.
767
00:38:26,800 --> 00:38:30,890
So we got our sample mean
from some distribution.
768
00:38:30,890 --> 00:38:34,650
And we would like to calculate
a 95% confidence interval.
769
00:38:34,650 --> 00:38:39,590
770
00:38:39,590 --> 00:38:42,650
We know from the normal tables,
that the standard
771
00:38:42,650 --> 00:38:54,011
normal has 2.5% on the tail,
that's after 1.96.
772
00:38:54,011 --> 00:38:58,060
Yes, by this time,
the number 1.96
773
00:38:58,060 --> 00:39:00,600
should be pretty familiar.
774
00:39:00,600 --> 00:39:05,880
So if this probability
here is 2.5%, this
775
00:39:05,880 --> 00:39:09,510
number here is 1.96.
776
00:39:09,510 --> 00:39:12,310
Now look at this random
variable here.
777
00:39:12,310 --> 00:39:15,000
This is the sample mean.
778
00:39:15,000 --> 00:39:17,950
Difference, from the true mean,
normalized by the usual
779
00:39:17,950 --> 00:39:18,940
normalizing factor.
780
00:39:18,940 --> 00:39:22,090
By the central limit theorem,
this is approximately normal.
781
00:39:22,090 --> 00:39:26,790
So it has probability 0.95
of being less than 1.96.
782
00:39:26,790 --> 00:39:31,050
Now take this event here
and rewrite it.
783
00:39:31,050 --> 00:39:36,240
This the event, well, that
Theta hat minus theta is
784
00:39:36,240 --> 00:39:40,350
bigger than this number and
smaller than that number.
785
00:39:40,350 --> 00:39:45,650
This event here is equivalent
to that event here.
786
00:39:45,650 --> 00:39:50,670
And so this suggests a way of
constructing our 95% percent
787
00:39:50,670 --> 00:39:52,130
confidence interval.
788
00:39:52,130 --> 00:39:56,330
I'm going to report the
interval, which gives this as
789
00:39:56,330 --> 00:40:00,350
the lower end of the confidence
interval, and gives
790
00:40:00,350 --> 00:40:05,720
this as the upper end of
the confidence interval
791
00:40:05,720 --> 00:40:09,180
In other words, at the end of
the experiment, we report the
792
00:40:09,180 --> 00:40:12,170
sample mean, which
is our estimate.
793
00:40:12,170 --> 00:40:14,230
And we report also, an interval
794
00:40:14,230 --> 00:40:16,080
around the sample mean.
795
00:40:16,080 --> 00:40:20,510
And this is our 95% confidence
interval.
796
00:40:20,510 --> 00:40:22,800
The confidence interval becomes
797
00:40:22,800 --> 00:40:26,050
smaller, when n is larger.
798
00:40:26,050 --> 00:40:28,950
In some sense, we're more
certain that we're doing a
799
00:40:28,950 --> 00:40:32,390
good estimation job, so we can
have a small interval and
800
00:40:32,390 --> 00:40:36,000
still be quite confident that
our interval captures the true
801
00:40:36,000 --> 00:40:37,520
value of the parameter.
802
00:40:37,520 --> 00:40:41,890
Also, if our data have very
little noise, when you have
803
00:40:41,890 --> 00:40:45,060
more accurate measurements,
you're more confident that
804
00:40:45,060 --> 00:40:47,220
your estimate is pretty good.
805
00:40:47,220 --> 00:40:51,120
And that results in a smaller
confidence interval, smaller
806
00:40:51,120 --> 00:40:52,610
length of the confidence
interval.
807
00:40:52,610 --> 00:40:56,040
And still you have 95%
probability of capturing the
808
00:40:56,040 --> 00:40:57,650
true value of theta.
809
00:40:57,650 --> 00:41:01,660
So we did this exercise by
taking 95% confidence
810
00:41:01,660 --> 00:41:04,010
intervals and the corresponding
value from the
811
00:41:04,010 --> 00:41:06,670
normal tables, which is 1.96.
812
00:41:06,670 --> 00:41:11,390
Of course, you can do it more
generally, if you set your
813
00:41:11,390 --> 00:41:13,730
alpha to be some other number.
814
00:41:13,730 --> 00:41:16,590
Again, you look at the
normal tables.
815
00:41:16,590 --> 00:41:20,460
And you find the value here,
so that the tail has
816
00:41:20,460 --> 00:41:22,640
probability alpha over 2.
817
00:41:22,640 --> 00:41:26,790
And instead of using these 1.96,
you use whatever number
818
00:41:26,790 --> 00:41:31,380
you get from the
normal tables.
819
00:41:31,380 --> 00:41:33,520
And this tells you
how to construct
820
00:41:33,520 --> 00:41:36,680
a confidence interval.
821
00:41:36,680 --> 00:41:42,060
Well, to be exact, this
is not necessarily a
822
00:41:42,060 --> 00:41:44,640
95% confidence interval.
823
00:41:44,640 --> 00:41:47,540
It's approximately a 95%
confidence interval.
824
00:41:47,540 --> 00:41:48,950
Why is this?
825
00:41:48,950 --> 00:41:51,060
Because we've done
an approximation.
826
00:41:51,060 --> 00:41:53,890
We have used the central
limit theorem.
827
00:41:53,890 --> 00:41:59,990
So it might turn out to be a
95.5% confidence interval
828
00:41:59,990 --> 00:42:03,220
instead of 95%, because
our calculations are
829
00:42:03,220 --> 00:42:04,740
not entirely accurate.
830
00:42:04,740 --> 00:42:08,230
But for reasonable values of
n, using the central limit
831
00:42:08,230 --> 00:42:10,190
theorem is a good
approximation.
832
00:42:10,190 --> 00:42:13,330
And that's what people
almost always do.
833
00:42:13,330 --> 00:42:17,350
So just take the value from
the normal tables.
834
00:42:17,350 --> 00:42:18,600
Okay, except for one catch.
835
00:42:18,600 --> 00:42:22,830
836
00:42:22,830 --> 00:42:24,590
I used the data.
837
00:42:24,590 --> 00:42:26,440
I obtained my estimate.
838
00:42:26,440 --> 00:42:29,830
And I want to go to my boss and
report this theta minus
839
00:42:29,830 --> 00:42:33,010
and theta hat, which is the
confidence interval.
840
00:42:33,010 --> 00:42:35,720
What's the difficulty?
841
00:42:35,720 --> 00:42:37,540
I know what n is.
842
00:42:37,540 --> 00:42:40,790
But I don't know what sigma
is, in general.
843
00:42:40,790 --> 00:42:44,750
So if I don't know sigma,
what am I going to do?
844
00:42:44,750 --> 00:42:48,980
Here, there's a few options
for what you can do.
845
00:42:48,980 --> 00:42:52,910
And the first option is familiar
from what we did when
846
00:42:52,910 --> 00:42:55,020
we talked about the
pollster problem.
847
00:42:55,020 --> 00:42:58,480
We don't know what sigma is,
but maybe we have an upper
848
00:42:58,480 --> 00:43:00,030
bound on sigma.
849
00:43:00,030 --> 00:43:03,540
For example, if the Xi's
Bernoulli random variables, we
850
00:43:03,540 --> 00:43:06,910
have seen that the standard
deviation is at most 1/2.
851
00:43:06,910 --> 00:43:10,220
So use the most conservative
value for sigma.
852
00:43:10,220 --> 00:43:13,520
Using the most conservative
value means that you take
853
00:43:13,520 --> 00:43:17,890
bigger confidence intervals
than necessary.
854
00:43:17,890 --> 00:43:20,780
So that's one option.
855
00:43:20,780 --> 00:43:25,480
Another option is to try to
estimate sigma from the data.
856
00:43:25,480 --> 00:43:27,630
How do you do this estimation?
857
00:43:27,630 --> 00:43:31,140
In special cases, for special
types of distributions, you
858
00:43:31,140 --> 00:43:34,180
can think of heuristic ways
of doing this estimation.
859
00:43:34,180 --> 00:43:38,390
For example, in the case of
Bernoulli random variables, we
860
00:43:38,390 --> 00:43:42,420
know that the true value of
sigma, the standard deviation
861
00:43:42,420 --> 00:43:45,120
of a Bernoulli random variable,
is the square root
862
00:43:45,120 --> 00:43:47,670
of theta1 minus theta,
where theta is
863
00:43:47,670 --> 00:43:50,290
the mean of the Bernoulli.
864
00:43:50,290 --> 00:43:51,900
Try to use this formula.
865
00:43:51,900 --> 00:43:54,140
But theta is the thing we're
trying to estimate in the
866
00:43:54,140 --> 00:43:54,760
first place.
867
00:43:54,760 --> 00:43:55,880
We don't know it.
868
00:43:55,880 --> 00:43:57,150
What do we do?
869
00:43:57,150 --> 00:44:00,850
Well, we have an estimate for
theta, the estimate, produced
870
00:44:00,850 --> 00:44:04,195
by our estimation procedure,
the sample mean.
871
00:44:04,195 --> 00:44:05,670
So I obtain my data.
872
00:44:05,670 --> 00:44:06,540
I get my data.
873
00:44:06,540 --> 00:44:09,030
I produce the estimate
theta hat.
874
00:44:09,030 --> 00:44:10,740
It's an estimate of the mean.
875
00:44:10,740 --> 00:44:14,770
Use that estimate in this
formula to come up with an
876
00:44:14,770 --> 00:44:17,290
estimate of my standard
deviation.
877
00:44:17,290 --> 00:44:20,210
And then use that standard
deviation, in the construction
878
00:44:20,210 --> 00:44:22,510
of the confidence interval,
pretending
879
00:44:22,510 --> 00:44:24,180
that this is correct.
880
00:44:24,180 --> 00:44:29,050
Well the number of your data is
large, then we know, from
881
00:44:29,050 --> 00:44:31,870
the law of large numbers, that
theta hat is a pretty good
882
00:44:31,870 --> 00:44:33,130
estimate of theta.
883
00:44:33,130 --> 00:44:36,670
So sigma hat is going to be a
pretty good estimate of sigma.
884
00:44:36,670 --> 00:44:42,380
So we're not making large errors
by using this approach.
885
00:44:42,380 --> 00:44:47,980
So in this scenario here, things
were simple, because we
886
00:44:47,980 --> 00:44:49,890
had an analytical formula.
887
00:44:49,890 --> 00:44:52,210
Sigma was determined by theta.
888
00:44:52,210 --> 00:44:54,420
So we could come up
with a quick and
889
00:44:54,420 --> 00:44:57,340
dirty estimate of sigma.
890
00:44:57,340 --> 00:45:00,940
In general, if you do not have
any nice formulas of this
891
00:45:00,940 --> 00:45:03,000
kind, what could you do?
892
00:45:03,000 --> 00:45:04,920
Well, you still need
to come up with an
893
00:45:04,920 --> 00:45:07,110
estimate of sigma somehow.
894
00:45:07,110 --> 00:45:08,950
What is a generic method for
895
00:45:08,950 --> 00:45:11,300
estimating a standard deviation?
896
00:45:11,300 --> 00:45:14,440
Equivalently, what could be a
generic method for estimating
897
00:45:14,440 --> 00:45:16,920
a variance?
898
00:45:16,920 --> 00:45:19,360
Well the variance is
an expected value
899
00:45:19,360 --> 00:45:20,940
of some random variable.
900
00:45:20,940 --> 00:45:25,610
The variance is the mean of the
random variable inside of
901
00:45:25,610 --> 00:45:28,200
those brackets.
902
00:45:28,200 --> 00:45:33,160
How does one estimate the mean
of some random variable?
903
00:45:33,160 --> 00:45:36,140
You obtain lots of measurements
of that random
904
00:45:36,140 --> 00:45:40,210
variable and average them out.
905
00:45:40,210 --> 00:45:45,170
So this would be a reasonable
way of estimating the variance
906
00:45:45,170 --> 00:45:47,310
of a distribution.
907
00:45:47,310 --> 00:45:50,590
And again, the weak law of large
numbers tells us that
908
00:45:50,590 --> 00:45:55,370
this average converges to the
expected value of this, which
909
00:45:55,370 --> 00:45:58,590
is just the variance of
the distribution.
910
00:45:58,590 --> 00:46:01,700
So we got a nice and
consistent way
911
00:46:01,700 --> 00:46:03,940
of estimating variances.
912
00:46:03,940 --> 00:46:08,100
But now, we seem to be getting
in a vicious circle here,
913
00:46:08,100 --> 00:46:10,580
because to estimate
the variance, we
914
00:46:10,580 --> 00:46:12,910
need to know the mean.
915
00:46:12,910 --> 00:46:16,075
And the mean is something we're
trying to estimate in
916
00:46:16,075 --> 00:46:18,250
the first place.
917
00:46:18,250 --> 00:46:18,400
Okay.
918
00:46:18,400 --> 00:46:20,880
But we do have an estimate
from the mean.
919
00:46:20,880 --> 00:46:24,640
So a reasonable approximation,
once more, is to plug-in,
920
00:46:24,640 --> 00:46:27,620
here, since we don't
know the mean, the
921
00:46:27,620 --> 00:46:29,270
estimate of the mean.
922
00:46:29,270 --> 00:46:32,370
And so you get that expression,
but with a theta
923
00:46:32,370 --> 00:46:35,130
hat instead of theta itself.
924
00:46:35,130 --> 00:46:37,980
And this is another
reasonable way of
925
00:46:37,980 --> 00:46:40,180
estimating the variance.
926
00:46:40,180 --> 00:46:42,940
It does have the same
consistency properties.
927
00:46:42,940 --> 00:46:44,050
Why?
928
00:46:44,050 --> 00:46:51,100
When n is large, this is going
to behave the same as that,
929
00:46:51,100 --> 00:46:53,640
because theta hat converges
to theta.
930
00:46:53,640 --> 00:46:57,890
And when n is large, this is
approximately the same as
931
00:46:57,890 --> 00:46:58,820
sigma squared.
932
00:46:58,820 --> 00:47:02,220
So for a large n, this quantity
also converges to
933
00:47:02,220 --> 00:47:03,350
sigma squared.
934
00:47:03,350 --> 00:47:05,500
And we have a consistent
estimate of
935
00:47:05,500 --> 00:47:07,000
the variance as well.
936
00:47:07,000 --> 00:47:09,490
And we can take that consistent
estimate and use it
937
00:47:09,490 --> 00:47:12,360
back in the construction
of confidence interval.
938
00:47:12,360 --> 00:47:16,310
One little detail, here,
we're dividing by n.
939
00:47:16,310 --> 00:47:19,590
Here, we're dividing by n-1.
940
00:47:19,590 --> 00:47:21,050
Why do we do this?
941
00:47:21,050 --> 00:47:24,630
Well, it turns out that's what
you need to do for these
942
00:47:24,630 --> 00:47:28,590
estimates to be an unbiased
estimate of the variance.
943
00:47:28,590 --> 00:47:32,080
One has to do a little bit of
a calculation, and one finds
944
00:47:32,080 --> 00:47:36,650
that that's the factor that you
need to have here in order
945
00:47:36,650 --> 00:47:37,770
to be unbiased.
946
00:47:37,770 --> 00:47:42,280
Of course, if you get 100 data
points, whether you divide by
947
00:47:42,280 --> 00:47:46,070
100 or divided by 99, it's
going to make only a tiny
948
00:47:46,070 --> 00:47:48,620
difference in your estimate
of your variance.
949
00:47:48,620 --> 00:47:50,740
So it's going to make only
a tiny difference in your
950
00:47:50,740 --> 00:47:52,670
estimate of the standard
deviation.
951
00:47:52,670 --> 00:47:54,180
It's not a big deal.
952
00:47:54,180 --> 00:47:56,550
And it doesn't really matter.
953
00:47:56,550 --> 00:48:00,720
But if you want to show off
about your deeper knowledge of
954
00:48:00,720 --> 00:48:06,810
statistics, you throw in the
1 over n-1 factor in there.
955
00:48:06,810 --> 00:48:11,350
So now one basically needs to
put together this story here,
956
00:48:11,350 --> 00:48:15,260
how you estimate the variance.
957
00:48:15,260 --> 00:48:18,370
You first estimate
the sample mean.
958
00:48:18,370 --> 00:48:21,010
And then you do some extra
work to come up with a
959
00:48:21,010 --> 00:48:23,020
reasonable estimate of
the variance and
960
00:48:23,020 --> 00:48:24,640
the standard deviation.
961
00:48:24,640 --> 00:48:27,510
And then you use your estimate,
of the standard
962
00:48:27,510 --> 00:48:32,960
deviation, to come up with a
confidence interval, which has
963
00:48:32,960 --> 00:48:35,150
these two endpoints.
964
00:48:35,150 --> 00:48:39,130
In doing this procedure, there's
basically a number of
965
00:48:39,130 --> 00:48:41,810
approximations that
are involved.
966
00:48:41,810 --> 00:48:43,570
There are two types
of approximations.
967
00:48:43,570 --> 00:48:46,170
One approximation is that we're
pretending that the
968
00:48:46,170 --> 00:48:48,720
sample mean has a normal
distribution.
969
00:48:48,720 --> 00:48:51,080
That's something we're justified
to do, by the
970
00:48:51,080 --> 00:48:52,470
central limit theorem.
971
00:48:52,470 --> 00:48:53,550
But it's not exact.
972
00:48:53,550 --> 00:48:54,910
It's an approximation.
973
00:48:54,910 --> 00:48:58,080
And the second approximation
that comes in is that, instead
974
00:48:58,080 --> 00:49:01,260
of using the correct standard
deviation, in general, you
975
00:49:01,260 --> 00:49:04,850
will have to use some
approximation of
976
00:49:04,850 --> 00:49:06,100
the standard deviation.
977
00:49:06,100 --> 00:49:08,390
978
00:49:08,390 --> 00:49:11,200
Okay so you will be getting a
little bit of practice with
979
00:49:11,200 --> 00:49:14,550
these concepts in recitation
and tutorial.
980
00:49:14,550 --> 00:49:18,070
And we will move on to
new topics next week.
981
00:49:18,070 --> 00:49:20,930
But the material that's going
to be covered in the final
982
00:49:20,930 --> 00:49:23,570
exam is only up to this point.
983
00:49:23,570 --> 00:49:28,220
So next week is just
general education.
984
00:49:28,220 --> 00:49:30,550
Hopefully useful, but it's
not in the exam.
985
00:49:30,550 --> 00:49:31,800