1
00:00:00,000 --> 00:00:00,040
2
00:00:00,040 --> 00:00:02,460
The following content is
provided under a Creative
3
00:00:02,460 --> 00:00:03,870
Commons license.
4
00:00:03,870 --> 00:00:06,910
Your support will help MIT
OpenCourseWare continue to
5
00:00:06,910 --> 00:00:10,560
offer high quality educational
resources for free.
6
00:00:10,560 --> 00:00:13,460
To make a donation or view
additional materials from
7
00:00:13,460 --> 00:00:19,290
hundreds of MIT courses, visit
MIT OpenCourseWare at
8
00:00:19,290 --> 00:00:21,732
ocw.mit.edu.
9
00:00:21,732 --> 00:00:24,170
JOHN TSITSIKLIS: And we're going
to continue today with
10
00:00:24,170 --> 00:00:26,820
our discussion of classical
statistics.
11
00:00:26,820 --> 00:00:29,290
We'll start with a quick review
of what we discussed
12
00:00:29,290 --> 00:00:34,680
last time, and then talk about
two topics that cover a lot of
13
00:00:34,680 --> 00:00:37,740
statistics that are happening
in the real world.
14
00:00:37,740 --> 00:00:39,510
So two basic methods.
15
00:00:39,510 --> 00:00:43,730
One is the method of linear
regression, and the other one
16
00:00:43,730 --> 00:00:46,500
is the basic methods and
tools for how to
17
00:00:46,500 --> 00:00:49,540
do hypothesis testing.
18
00:00:49,540 --> 00:00:53,970
OK, so these two are topics
that any scientifically
19
00:00:53,970 --> 00:00:57,170
literate person should
know something about.
20
00:00:57,170 --> 00:00:59,570
So we're going to introduce
the basic ideas
21
00:00:59,570 --> 00:01:01,860
and concepts involved.
22
00:01:01,860 --> 00:01:07,580
So in classical statistics we
basically have essentially a
23
00:01:07,580 --> 00:01:11,250
family of possible models
about the world.
24
00:01:11,250 --> 00:01:15,190
So the world is the random
variable that we observe, and
25
00:01:15,190 --> 00:01:19,370
we have a model for it, but
actually not just one model,
26
00:01:19,370 --> 00:01:20,960
several candidate models.
27
00:01:20,960 --> 00:01:24,380
And each candidate model
corresponds to a different
28
00:01:24,380 --> 00:01:28,070
value of a parameter theta
that we do not know.
29
00:01:28,070 --> 00:01:32,275
So in contrast to Bayesian
statistics, this theta is
30
00:01:32,275 --> 00:01:35,540
assumed to be a constant
that we do not know.
31
00:01:35,540 --> 00:01:38,190
It is not modeled as a random
variable, there's no
32
00:01:38,190 --> 00:01:40,480
probabilities associated
with theta.
33
00:01:40,480 --> 00:01:43,380
We only have probabilities
about the X's.
34
00:01:43,380 --> 00:01:47,320
So in this context what is a
reasonable way of choosing a
35
00:01:47,320 --> 00:01:49,350
value for the parameter?
36
00:01:49,350 --> 00:01:53,470
One general approach is the
maximum likelihood approach,
37
00:01:53,470 --> 00:01:56,090
which chooses the
theta for which
38
00:01:56,090 --> 00:01:58,630
this quantity is largest.
39
00:01:58,630 --> 00:02:00,690
So what does that mean
intuitively?
40
00:02:00,690 --> 00:02:04,550
I'm trying to find the value of
theta under which the data
41
00:02:04,550 --> 00:02:08,970
that I observe are most likely
to have occurred.
42
00:02:08,970 --> 00:02:11,470
So is the thinking is
essentially as follows.
43
00:02:11,470 --> 00:02:13,970
Let's say I have to choose
between two choices of theta.
44
00:02:13,970 --> 00:02:16,520
Under this theta the
X that I observed
45
00:02:16,520 --> 00:02:17,940
would be very unlikely.
46
00:02:17,940 --> 00:02:21,350
Under that theta the X that I
observed would have a decent
47
00:02:21,350 --> 00:02:22,830
probability of occurring.
48
00:02:22,830 --> 00:02:28,340
So I chose the latter as
my estimate of theta.
49
00:02:28,340 --> 00:02:31,200
It's interesting to do the
comparison with the Bayesian
50
00:02:31,200 --> 00:02:34,110
approach which we did discuss
last time, in the Bayesian
51
00:02:34,110 --> 00:02:38,430
approach we also maximize over
theta, but we maximize a
52
00:02:38,430 --> 00:02:43,220
quantity in which the relation
between X's and thetas run the
53
00:02:43,220 --> 00:02:44,520
opposite way.
54
00:02:44,520 --> 00:02:47,500
Here in the Bayesian world,
Theta is a random variable.
55
00:02:47,500 --> 00:02:48,980
So it has a distribution.
56
00:02:48,980 --> 00:02:53,030
Once we observe the data, it has
a posterior distribution,
57
00:02:53,030 --> 00:02:56,480
and we find the value of Theta,
which is most likely
58
00:02:56,480 --> 00:02:59,250
under the posterior
distribution.
59
00:02:59,250 --> 00:03:03,090
As we discussed last time when
you do this maximization now
60
00:03:03,090 --> 00:03:05,750
the posterior distribution is
given by this expression.
61
00:03:05,750 --> 00:03:09,760
The denominator doesn't matter,
and if you were to
62
00:03:09,760 --> 00:03:12,790
take a prior, which is flat--
63
00:03:12,790 --> 00:03:16,210
that is a constant independent
of Theta, then that
64
00:03:16,210 --> 00:03:17,640
term would go away.
65
00:03:17,640 --> 00:03:19,360
And syntactically,
at least, the two
66
00:03:19,360 --> 00:03:21,970
approaches look the same.
67
00:03:21,970 --> 00:03:28,170
So syntactically, or formally,
maximum likelihood estimation
68
00:03:28,170 --> 00:03:32,225
is the same as Bayesian
estimation in which you assume
69
00:03:32,225 --> 00:03:36,090
a prior which is flat, so that
all possible values of Theta
70
00:03:36,090 --> 00:03:37,570
are equally likely.
71
00:03:37,570 --> 00:03:40,790
Philosophically, however,
they're very different things.
72
00:03:40,790 --> 00:03:44,150
Here I'm picking the most
likely value of Theta.
73
00:03:44,150 --> 00:03:47,140
Here I'm picking the value of
Theta under which the observed
74
00:03:47,140 --> 00:03:51,050
data would have been more
likely to occur.
75
00:03:51,050 --> 00:03:53,590
So maximum likelihood estimation
is a general
76
00:03:53,590 --> 00:03:57,820
purpose method, so it's applied
all over the place in
77
00:03:57,820 --> 00:04:02,220
many, many different types
of estimation problems.
78
00:04:02,220 --> 00:04:05,100
There is a special kind of
estimation problem in which
79
00:04:05,100 --> 00:04:08,040
you may forget about maximum
likelihood estimation, and
80
00:04:08,040 --> 00:04:12,700
come up with an estimate in
a straightforward way.
81
00:04:12,700 --> 00:04:15,680
And this is the case where
you're trying to estimate the
82
00:04:15,680 --> 00:04:22,390
mean of the distribution of X,
where X is a random variable.
83
00:04:22,390 --> 00:04:25,140
You observe several independent
identically
84
00:04:25,140 --> 00:04:30,020
distributed random variables
X1 up to Xn.
85
00:04:30,020 --> 00:04:32,880
All of them have the same
distribution as this X.
86
00:04:32,880 --> 00:04:34,600
So they have a common mean.
87
00:04:34,600 --> 00:04:37,020
We do not know the mean we
want to estimate it.
88
00:04:37,020 --> 00:04:40,560
What is more natural than just
taking the average of the
89
00:04:40,560 --> 00:04:42,470
values that we have observed?
90
00:04:42,470 --> 00:04:46,150
So you generate lots of X's,
take the average of them, and
91
00:04:46,150 --> 00:04:50,560
you expect that this is going to
be a reasonable estimate of
92
00:04:50,560 --> 00:04:53,420
the true mean of that
random variable.
93
00:04:53,420 --> 00:04:56,290
And indeed we know from the weak
law of large numbers that
94
00:04:56,290 --> 00:05:00,790
this estimate converges in
probability to the true mean
95
00:05:00,790 --> 00:05:02,680
of the random variable.
96
00:05:02,680 --> 00:05:04,870
The other thing that we talked
about last time is that
97
00:05:04,870 --> 00:05:07,770
besides giving a point estimate
we may want to also
98
00:05:07,770 --> 00:05:13,530
give an interval that tells us
something about where we might
99
00:05:13,530 --> 00:05:16,170
believe theta to lie.
100
00:05:16,170 --> 00:05:21,950
And 1-alpha confidence interval
is in interval
101
00:05:21,950 --> 00:05:24,200
generated based on the data.
102
00:05:24,200 --> 00:05:26,860
So it's an interval from this
value to that value.
103
00:05:26,860 --> 00:05:30,120
These values are written with
capital letters because
104
00:05:30,120 --> 00:05:32,390
they're random, because they
depend on the data
105
00:05:32,390 --> 00:05:33,870
that we have seen.
106
00:05:33,870 --> 00:05:36,740
And this gives us an interval,
and we would like this
107
00:05:36,740 --> 00:05:40,600
interval to have the property
that theta is inside that
108
00:05:40,600 --> 00:05:42,830
interval with high
probability.
109
00:05:42,830 --> 00:05:46,805
So typically we would take
1-alpha to be a quantity such
110
00:05:46,805 --> 00:05:49,780
as 95% for example.
111
00:05:49,780 --> 00:05:54,340
In which case we have a 95%
confidence interval.
112
00:05:54,340 --> 00:05:56,980
As we discussed last time it's
important to have the right
113
00:05:56,980 --> 00:06:00,730
interpretation of what's
95% means.
114
00:06:00,730 --> 00:06:04,640
What it does not mean
is the following--
115
00:06:04,640 --> 00:06:09,800
the unknown value has 95%
percent probability of being
116
00:06:09,800 --> 00:06:12,450
in the interval that
we have generated.
117
00:06:12,450 --> 00:06:14,550
That's because the unknown
value is not a random
118
00:06:14,550 --> 00:06:15,910
variable, it's a constant.
119
00:06:15,910 --> 00:06:18,930
Once we generate the interval
either it's inside or it's
120
00:06:18,930 --> 00:06:22,500
outside, but there's no
probabilities involved.
121
00:06:22,500 --> 00:06:26,415
Rather the probabilities are
to be interpreted over the
122
00:06:26,415 --> 00:06:28,590
random interval itself.
123
00:06:28,590 --> 00:06:31,730
What a statement like this
says is that if I have a
124
00:06:31,730 --> 00:06:37,060
procedure for generating 95%
confidence intervals, then
125
00:06:37,060 --> 00:06:40,800
whenever I use that procedure
I'm going to get a random
126
00:06:40,800 --> 00:06:44,260
interval, and it's going to
have 95% probability of
127
00:06:44,260 --> 00:06:48,270
capturing the true
value of theta.
128
00:06:48,270 --> 00:06:53,010
So most of the time when I use
this particular procedure for
129
00:06:53,010 --> 00:06:56,170
generating confidence intervals
the true theta will
130
00:06:56,170 --> 00:06:59,440
happen to lie inside that
confidence interval with
131
00:06:59,440 --> 00:07:01,190
probability 95%.
132
00:07:01,190 --> 00:07:04,230
So the randomness in this
statement is with respect to
133
00:07:04,230 --> 00:07:09,190
my confidence interval, it's
not with respect to theta,
134
00:07:09,190 --> 00:07:11,880
because theta is not random.
135
00:07:11,880 --> 00:07:14,710
How does one construct
confidence intervals?
136
00:07:14,710 --> 00:07:17,500
There's various ways of going
about it, but in the case
137
00:07:17,500 --> 00:07:20,330
where we're dealing with the
estimation of the mean of a
138
00:07:20,330 --> 00:07:23,790
random variable doing this is
straightforward using the
139
00:07:23,790 --> 00:07:25,680
central limit theorem.
140
00:07:25,680 --> 00:07:31,440
Basically we take our estimated
mean, that's the
141
00:07:31,440 --> 00:07:35,910
sample mean, and we take a
symmetric interval to the left
142
00:07:35,910 --> 00:07:38,220
and to the right of
the sample mean.
143
00:07:38,220 --> 00:07:42,340
And we choose the width of that
interval by looking at
144
00:07:42,340 --> 00:07:43,680
the normal tables.
145
00:07:43,680 --> 00:07:50,180
So if this quantity, 1-alpha is
95% percent, we're going to
146
00:07:50,180 --> 00:07:55,790
look at the 97.5 percentile of
the normal distribution.
147
00:07:55,790 --> 00:07:59,910
Find the constant number that
corresponds to that value from
148
00:07:59,910 --> 00:08:02,790
the normal tables, and construct
the confidence
149
00:08:02,790 --> 00:08:07,350
intervals according
to this formula.
150
00:08:07,350 --> 00:08:10,810
So that gives you a pretty
mechanical way of going about
151
00:08:10,810 --> 00:08:13,250
constructing confidence
intervals when you're
152
00:08:13,250 --> 00:08:15,270
estimating the sample mean.
153
00:08:15,270 --> 00:08:18,650
So constructing confidence
intervals in this way involves
154
00:08:18,650 --> 00:08:19,630
an approximation.
155
00:08:19,630 --> 00:08:22,230
The approximation is the
central limit theorem.
156
00:08:22,230 --> 00:08:24,490
We are pretending that
the sample mean is a
157
00:08:24,490 --> 00:08:26,400
normal random variable.
158
00:08:26,400 --> 00:08:30,110
Which is, more or less,
right when n is large.
159
00:08:30,110 --> 00:08:32,780
That's what the central limit
theorem tells us.
160
00:08:32,780 --> 00:08:36,429
And sometimes we may need to
do some extra approximation
161
00:08:36,429 --> 00:08:39,480
work, because quite often
we do not know the
162
00:08:39,480 --> 00:08:41,030
true value of sigma.
163
00:08:41,030 --> 00:08:43,559
So we need to do some work
either to estimate
164
00:08:43,559 --> 00:08:45,360
sigma from the data.
165
00:08:45,360 --> 00:08:48,520
So sigma is, of course, the
standard deviation of the X's.
166
00:08:48,520 --> 00:08:51,410
We may want to estimate it from
the data, or we may have
167
00:08:51,410 --> 00:08:54,450
an upper bound on sigma, and we
just use that upper bound.
168
00:08:54,450 --> 00:08:57,430
169
00:08:57,430 --> 00:09:02,520
So now let's move on
to a new topic.
170
00:09:02,520 --> 00:09:09,420
A lot of statistics in the
real world are of the
171
00:09:09,420 --> 00:09:12,540
following flavor.
172
00:09:12,540 --> 00:09:16,820
So suppose that X is the SAT
score of a student in high
173
00:09:16,820 --> 00:09:23,620
school, and Y is the MIT GPA
of that same student.
174
00:09:23,620 --> 00:09:27,570
So you expect that there is a
relation between these two.
175
00:09:27,570 --> 00:09:31,240
So you go and collect data for
different students, and you
176
00:09:31,240 --> 00:09:35,470
record for a typical student
this would be their SAT score,
177
00:09:35,470 --> 00:09:37,700
that could be their MIT GPA.
178
00:09:37,700 --> 00:09:43,720
And you plot all this data
on an (X,Y) diagram.
179
00:09:43,720 --> 00:09:48,240
Now it's reasonable to believe
that there is some systematic
180
00:09:48,240 --> 00:09:49,940
relation between the two.
181
00:09:49,940 --> 00:09:54,650
So people who had higher SAT
scores in high school may have
182
00:09:54,650 --> 00:09:57,110
higher GPA in college.
183
00:09:57,110 --> 00:10:00,310
Well that may or may
not be true.
184
00:10:00,310 --> 00:10:05,270
You want to construct a model of
this kind, and see to what
185
00:10:05,270 --> 00:10:08,330
extent a relation of
this type is true.
186
00:10:08,330 --> 00:10:15,560
So you might hypothesize that
the real world is described by
187
00:10:15,560 --> 00:10:17,390
a model of this kind.
188
00:10:17,390 --> 00:10:22,730
That there is a linear relation
between the SAT
189
00:10:22,730 --> 00:10:27,710
score, and the college GPA.
190
00:10:27,710 --> 00:10:30,560
So it's a linear relation with
some parameters, theta0 and
191
00:10:30,560 --> 00:10:33,060
theta1 that we do not know.
192
00:10:33,060 --> 00:10:37,460
So we assume a linear relation
for the data, and depending on
193
00:10:37,460 --> 00:10:41,690
the choices of theta0 and theta1
it could be a different
194
00:10:41,690 --> 00:10:43,530
line through those data.
195
00:10:43,530 --> 00:10:47,670
Now we would like to find the
best model of this kind to
196
00:10:47,670 --> 00:10:49,230
explain the data.
197
00:10:49,230 --> 00:10:52,260
Of course there's going
to be some randomness.
198
00:10:52,260 --> 00:10:55,370
So in general it's going to be
impossible to find a line that
199
00:10:55,370 --> 00:10:57,780
goes through all of
the data points.
200
00:10:57,780 --> 00:11:04,020
So let's try to find the best
line that comes closest to
201
00:11:04,020 --> 00:11:05,810
explaining those data.
202
00:11:05,810 --> 00:11:08,520
And here's how we go about it.
203
00:11:08,520 --> 00:11:13,100
Suppose we try some particular
values of theta0 and theta1.
204
00:11:13,100 --> 00:11:15,750
These give us a certain line.
205
00:11:15,750 --> 00:11:20,760
Given that line, we can
make predictions.
206
00:11:20,760 --> 00:11:24,470
For a student who had this x,
the model that we have would
207
00:11:24,470 --> 00:11:27,580
predict that y would
be this value.
208
00:11:27,580 --> 00:11:32,150
The actual y is something else,
and so this quantity is
209
00:11:32,150 --> 00:11:37,660
the error that our model would
make in predicting the y of
210
00:11:37,660 --> 00:11:39,580
that particular student.
211
00:11:39,580 --> 00:11:43,350
We would like to choose a line
for which the predictions are
212
00:11:43,350 --> 00:11:45,110
as good as possible.
213
00:11:45,110 --> 00:11:47,790
And what do we mean by
as good as possible?
214
00:11:47,790 --> 00:11:51,150
As our criteria we're going
to take the following.
215
00:11:51,150 --> 00:11:54,070
We are going to look at the
prediction error that our
216
00:11:54,070 --> 00:11:56,310
model makes for each
particular student.
217
00:11:56,310 --> 00:12:01,050
Take the square of that, and
then add them up over all of
218
00:12:01,050 --> 00:12:02,580
our data points.
219
00:12:02,580 --> 00:12:06,140
So what we're looking at is
the sum of this quantity
220
00:12:06,140 --> 00:12:08,270
squared, that quantity squared,
that quantity
221
00:12:08,270 --> 00:12:09,570
squared, and so on.
222
00:12:09,570 --> 00:12:13,220
We add all of these squares, and
we would like to find the
223
00:12:13,220 --> 00:12:17,500
line for which the sum of
these squared prediction
224
00:12:17,500 --> 00:12:20,910
errors are as small
as possible.
225
00:12:20,910 --> 00:12:23,950
So that's the procedure.
226
00:12:23,950 --> 00:12:27,100
We have our data, the
X's and the Y's.
227
00:12:27,100 --> 00:12:31,340
And we're going to find theta's
the best model of this
228
00:12:31,340 --> 00:12:35,580
type, the best possible model,
by minimizing this sum of
229
00:12:35,580 --> 00:12:38,010
squared errors.
230
00:12:38,010 --> 00:12:41,020
So that's a method that one
could pull out of the hat and
231
00:12:41,020 --> 00:12:44,120
say OK, that's how I'm going
to build my model.
232
00:12:44,120 --> 00:12:46,730
And it sounds pretty
reasonable.
233
00:12:46,730 --> 00:12:49,530
And it sounds pretty reasonable
even if you don't
234
00:12:49,530 --> 00:12:51,660
know anything about
probability.
235
00:12:51,660 --> 00:12:55,340
But does it have some
probabilistic justification?
236
00:12:55,340 --> 00:12:59,280
It turns out that yes, you can
motivate this method with
237
00:12:59,280 --> 00:13:03,100
probabilistic considerations
under certain assumptions.
238
00:13:03,100 --> 00:13:07,360
So let's make a probabilistic
model that's going to lead us
239
00:13:07,360 --> 00:13:10,600
to these particular way of
estimating the parameters.
240
00:13:10,600 --> 00:13:12,920
So here's a probabilistic
model.
241
00:13:12,920 --> 00:13:18,090
I pick a student who had
a specific SAT score.
242
00:13:18,090 --> 00:13:21,190
And that could be done at
random, but also could be done
243
00:13:21,190 --> 00:13:22,330
in a systematic way.
244
00:13:22,330 --> 00:13:25,240
That is, I pick a student who
had an SAT of 600, a student
245
00:13:25,240 --> 00:13:33,170
of 610 all the way to 1,400
or 1,600, whatever the
246
00:13:33,170 --> 00:13:34,670
right number is.
247
00:13:34,670 --> 00:13:36,320
I pick all those students.
248
00:13:36,320 --> 00:13:40,370
And I assume that for a student
of this kind there's a
249
00:13:40,370 --> 00:13:44,500
true model that tells me that
their GPA is going to be a
250
00:13:44,500 --> 00:13:48,580
random variable, which is
something predicted by their
251
00:13:48,580 --> 00:13:52,690
SAT score plus some randomness,
some random noise.
252
00:13:52,690 --> 00:13:56,400
And I model that random noise
by independent normal random
253
00:13:56,400 --> 00:14:00,710
variables with 0 mean and
a certain variance.
254
00:14:00,710 --> 00:14:04,470
So this is a specific
probabilistic model, and now I
255
00:14:04,470 --> 00:14:09,010
can think about doing maximum
likelihood estimation for this
256
00:14:09,010 --> 00:14:10,980
particular model.
257
00:14:10,980 --> 00:14:14,490
So to do maximum likelihood
estimation here I need to
258
00:14:14,490 --> 00:14:19,830
write down the likelihood of the
y's that I have observed.
259
00:14:19,830 --> 00:14:23,380
What's the likelihood of the
y's that I have observed?
260
00:14:23,380 --> 00:14:28,425
Well, a particular w has a
likelihood of the form e to
261
00:14:28,425 --> 00:14:33,030
the minus w squared over
(2 sigma-squared).
262
00:14:33,030 --> 00:14:37,070
That's the likelihood
of a particular w.
263
00:14:37,070 --> 00:14:40,310
The probability, or the
likelihood of observing a
264
00:14:40,310 --> 00:14:43,990
particular value of y, that's
the same as the likelihood
265
00:14:43,990 --> 00:14:49,020
that w takes a value of y
minus this, minus that.
266
00:14:49,020 --> 00:14:52,850
So the likelihood of the
y's is of this form.
267
00:14:52,850 --> 00:14:57,360
Think of this as just being
the w_i-squared.
268
00:14:57,360 --> 00:15:01,370
So this is the density --
269
00:15:01,370 --> 00:15:06,060
and if we have multiple data you
multiply the likelihoods
270
00:15:06,060 --> 00:15:07,660
of the different y's.
271
00:15:07,660 --> 00:15:12,090
So you have to write something
like this.
272
00:15:12,090 --> 00:15:16,390
Since the w's are independent
that means that the y's are
273
00:15:16,390 --> 00:15:17,910
also independent.
274
00:15:17,910 --> 00:15:21,410
The likelihood of a y vector
is the product of the
275
00:15:21,410 --> 00:15:24,240
likelihoods of the
individual y's.
276
00:15:24,240 --> 00:15:27,800
The likelihood of every
individual y is of this form.
277
00:15:27,800 --> 00:15:33,050
Where w is y_i minus these
two quantities.
278
00:15:33,050 --> 00:15:36,000
So this is the form that the
likelihood function is going
279
00:15:36,000 --> 00:15:38,880
to take under this
particular model.
280
00:15:38,880 --> 00:15:42,260
And under the maximum likelihood
methodology we want
281
00:15:42,260 --> 00:15:49,170
to maximize this quantity with
respect to theta0 and theta1.
282
00:15:49,170 --> 00:15:56,930
Now to do this maximization you
might as well consider the
283
00:15:56,930 --> 00:16:00,990
logarithm and maximize the
logarithm, which is just the
284
00:16:00,990 --> 00:16:02,730
exponent up here.
285
00:16:02,730 --> 00:16:05,750
Maximizing this exponent because
we have a minus sign
286
00:16:05,750 --> 00:16:08,900
is the same as minimizing
the exponent
287
00:16:08,900 --> 00:16:10,840
without the minus sign.
288
00:16:10,840 --> 00:16:12,840
Sigma squared is a constant.
289
00:16:12,840 --> 00:16:17,970
So what you end up doing is
minimizing this quantity here,
290
00:16:17,970 --> 00:16:20,120
which is the same as
what we had in our
291
00:16:20,120 --> 00:16:23,640
linear regression methods.
292
00:16:23,640 --> 00:16:29,400
So in conclusion you might
choose to do linear regression
293
00:16:29,400 --> 00:16:34,490
in this particular way,
just because it looks
294
00:16:34,490 --> 00:16:36,210
reasonable or plausible.
295
00:16:36,210 --> 00:16:41,050
Or you might interpret what
you're doing as maximum
296
00:16:41,050 --> 00:16:45,220
likelihood estimation, in which
you assume a model of
297
00:16:45,220 --> 00:16:49,520
this kind where the noise
terms are normal random
298
00:16:49,520 --> 00:16:51,970
variables with the same
distribution --
299
00:16:51,970 --> 00:16:54,540
independent identically
distributed.
300
00:16:54,540 --> 00:17:01,320
So linear regression implicitly
makes an assumption
301
00:17:01,320 --> 00:17:02,840
of this kind.
302
00:17:02,840 --> 00:17:07,380
It's doing maximum likelihood
estimation as if the world was
303
00:17:07,380 --> 00:17:11,000
really described by a model of
this form, and with the W's
304
00:17:11,000 --> 00:17:12,560
being random variables.
305
00:17:12,560 --> 00:17:17,920
So this gives us at least some
justification that this
306
00:17:17,920 --> 00:17:21,800
particular approach to fitting
lines to data is not so
307
00:17:21,800 --> 00:17:25,579
arbitrary, but it has
a sound footing.
308
00:17:25,579 --> 00:17:30,530
OK so then once you accept this
formulation as being a
309
00:17:30,530 --> 00:17:32,920
reasonable one what's
the next step?
310
00:17:32,920 --> 00:17:37,760
The next step is to see how to
carry out this minimization.
311
00:17:37,760 --> 00:17:42,220
This is not a very difficult
minimization to do.
312
00:17:42,220 --> 00:17:48,260
The way it's done is by setting
the derivatives of
313
00:17:48,260 --> 00:17:50,930
this expression to 0.
314
00:17:50,930 --> 00:17:54,500
Now because this is a quadratic
function of theta0
315
00:17:54,500 --> 00:17:55,410
and theta1--
316
00:17:55,410 --> 00:17:57,270
when you take the derivatives
with respect
317
00:17:57,270 --> 00:17:58,940
to theta0 and theta1--
318
00:17:58,940 --> 00:18:03,250
you get linear functions
of theta0 and theta1.
319
00:18:03,250 --> 00:18:08,010
And you end up solving a system
of linear equations in
320
00:18:08,010 --> 00:18:09,630
theta0 and theta1.
321
00:18:09,630 --> 00:18:15,660
And it turns out that there's
very nice and simple formulas
322
00:18:15,660 --> 00:18:18,950
for the optimal estimates
of the parameters in
323
00:18:18,950 --> 00:18:20,510
terms of the data.
324
00:18:20,510 --> 00:18:23,910
And the formulas
are these ones.
325
00:18:23,910 --> 00:18:28,130
I said that these are nice
and simple formulas.
326
00:18:28,130 --> 00:18:29,800
Let's see why.
327
00:18:29,800 --> 00:18:31,270
How can we interpret them?
328
00:18:31,270 --> 00:18:34,050
329
00:18:34,050 --> 00:18:42,250
So suppose that the world is
described by a model of this
330
00:18:42,250 --> 00:18:48,990
kind, where the X's and Y's
are random variables.
331
00:18:48,990 --> 00:18:53,920
And where W is a noise term
that's independent of X. So
332
00:18:53,920 --> 00:18:57,750
we're assuming that a linear
model is indeed true, but not
333
00:18:57,750 --> 00:18:58,530
exactly true.
334
00:18:58,530 --> 00:19:01,790
There's always some noise
associated with any particular
335
00:19:01,790 --> 00:19:04,980
data point that we obtain.
336
00:19:04,980 --> 00:19:10,880
So if a model of this kind is
true, and the W's have 0 mean
337
00:19:10,880 --> 00:19:15,370
then we have that the expected
value of Y would be theta0
338
00:19:15,370 --> 00:19:23,570
plus theta1 expected value of
X. And because W has 0 mean
339
00:19:23,570 --> 00:19:26,200
there's no extra term.
340
00:19:26,200 --> 00:19:31,660
So in particular, theta0 would
be equal to expected value of
341
00:19:31,660 --> 00:19:37,380
Y minus theta1 expected
value of X.
342
00:19:37,380 --> 00:19:40,660
So let's use this equation
to try to come up with a
343
00:19:40,660 --> 00:19:44,060
reasonable estimate of theta0.
344
00:19:44,060 --> 00:19:47,220
I do not know the expected
value of Y, but I
345
00:19:47,220 --> 00:19:48,430
can estimate it.
346
00:19:48,430 --> 00:19:49,820
How do I estimate it?
347
00:19:49,820 --> 00:19:53,460
I look at the average of all the
y's that I have obtained.
348
00:19:53,460 --> 00:19:57,320
so I replace this, I estimate
it with the average of the
349
00:19:57,320 --> 00:19:59,940
data I have seen.
350
00:19:59,940 --> 00:20:02,430
Here, similarly with the X's.
351
00:20:02,430 --> 00:20:06,820
I might not know the expected
value of X's, but I have data
352
00:20:06,820 --> 00:20:08,520
points for the x's.
353
00:20:08,520 --> 00:20:13,070
I look at the average of all my
data points, I come up with
354
00:20:13,070 --> 00:20:16,380
an estimate of this
expectation.
355
00:20:16,380 --> 00:20:21,390
Now I don't know what theta1 is,
but my procedure is going
356
00:20:21,390 --> 00:20:25,320
to generate an estimate of
theta1 called theta1 hat.
357
00:20:25,320 --> 00:20:29,230
And once I have this estimate,
then a reasonable person would
358
00:20:29,230 --> 00:20:33,400
estimate theta0 in this
particular way.
359
00:20:33,400 --> 00:20:37,320
So that's how my estimate
of theta0 is going to be
360
00:20:37,320 --> 00:20:38,490
constructed.
361
00:20:38,490 --> 00:20:41,420
It's this formula here.
362
00:20:41,420 --> 00:20:44,700
We have not yet addressed the
harder question, which is how
363
00:20:44,700 --> 00:20:47,490
to estimate theta1 in
the first place.
364
00:20:47,490 --> 00:20:50,830
So to estimate theta0 I assumed
that I already had an
365
00:20:50,830 --> 00:20:52,180
estimate for a theta1.
366
00:20:52,180 --> 00:20:55,090
367
00:20:55,090 --> 00:21:02,060
OK, the right formula for the
estimate of theta1 happens to
368
00:21:02,060 --> 00:21:03,140
be this one.
369
00:21:03,140 --> 00:21:08,632
It looks messy, but let's
try to interpret it.
370
00:21:08,632 --> 00:21:12,970
What I'm going to do is I'm
going to take this model for
371
00:21:12,970 --> 00:21:18,340
simplicity let's assume that
they're the random variables
372
00:21:18,340 --> 00:21:19,590
have 0 means.
373
00:21:19,590 --> 00:21:22,940
374
00:21:22,940 --> 00:21:28,800
And see how we might estimate
how we might
375
00:21:28,800 --> 00:21:30,960
try to estimate theta1.
376
00:21:30,960 --> 00:21:36,270
Let's multiply both sides of
this equation by X. So we get
377
00:21:36,270 --> 00:21:48,470
Y times X equals theta0 plus
theta0 times X plus theta1
378
00:21:48,470 --> 00:21:54,530
times X-squared, plus X times
W. And now take expectations
379
00:21:54,530 --> 00:21:56,420
of both sides.
380
00:21:56,420 --> 00:22:00,160
If I have 0 mean random
variables the expected value
381
00:22:00,160 --> 00:22:07,210
of Y times X is just the
covariance of X with Y.
382
00:22:07,210 --> 00:22:10,640
I have assumed that my random
variables have 0 means, so the
383
00:22:10,640 --> 00:22:13,680
expectation of this is 0.
384
00:22:13,680 --> 00:22:17,970
This one is going to be the
variance of X, so I have
385
00:22:17,970 --> 00:22:23,260
theta1 times variance of X. And
since I'm assuming that my
386
00:22:23,260 --> 00:22:26,990
random variables have 0 mean,
and I'm also assuming that W
387
00:22:26,990 --> 00:22:32,250
is independent of X this last
term also has 0 mean.
388
00:22:32,250 --> 00:22:39,280
So under such a probabilistic
model this equation is true.
389
00:22:39,280 --> 00:22:43,620
If we knew the variance and the
covariance then we would
390
00:22:43,620 --> 00:22:45,930
know the value of theta1.
391
00:22:45,930 --> 00:22:49,080
But we only have data, we do
not necessarily know the
392
00:22:49,080 --> 00:22:53,070
variance and the covariance,
but we can estimate it.
393
00:22:53,070 --> 00:22:55,885
What's a reasonable estimate
of the variance?
394
00:22:55,885 --> 00:22:59,390
The reasonable estimate of the
variance is this quantity here
395
00:22:59,390 --> 00:23:03,195
divided by n, and the reasonable
estimate of the
396
00:23:03,195 --> 00:23:06,730
covariance is that numerator
divided by n.
397
00:23:06,730 --> 00:23:09,410
398
00:23:09,410 --> 00:23:11,510
So this is my estimate
of the mean.
399
00:23:11,510 --> 00:23:15,390
I'm looking at the squared
distances from the mean, and I
400
00:23:15,390 --> 00:23:18,740
average them over lots
and lots of data.
401
00:23:18,740 --> 00:23:23,990
This is the most reasonable way
of estimating the variance
402
00:23:23,990 --> 00:23:26,070
of our distribution.
403
00:23:26,070 --> 00:23:31,400
And similarly the expected value
of this quantity is the
404
00:23:31,400 --> 00:23:35,020
covariance of X with Y, and then
we have lots and lots of
405
00:23:35,020 --> 00:23:35,830
data points.
406
00:23:35,830 --> 00:23:38,895
This quantity here is going to
be a very good estimate of the
407
00:23:38,895 --> 00:23:40,140
covariance.
408
00:23:40,140 --> 00:23:44,820
So basically what this
formula does is--
409
00:23:44,820 --> 00:23:46,520
one way of thinking about it--
410
00:23:46,520 --> 00:23:50,870
is that it starts from this
relation which is true
411
00:23:50,870 --> 00:23:57,230
exactly, but estimates the
covariance and the variance on
412
00:23:57,230 --> 00:24:00,820
the basis of the data, and then
using these estimates to
413
00:24:00,820 --> 00:24:05,770
come up with an estimate
of theta1.
414
00:24:05,770 --> 00:24:09,890
So this gives us a probabilistic
interpretation
415
00:24:09,890 --> 00:24:13,620
of the formulas that we have for
the way that the estimates
416
00:24:13,620 --> 00:24:14,990
are constructed.
417
00:24:14,990 --> 00:24:19,560
If you're willing to assume that
this is the true model of
418
00:24:19,560 --> 00:24:22,640
the world, the structure of the
true model of the world,
419
00:24:22,640 --> 00:24:24,460
except that you do not
know means and
420
00:24:24,460 --> 00:24:27,590
covariances, and variances.
421
00:24:27,590 --> 00:24:33,010
Then this is a natural way of
estimating those unknown
422
00:24:33,010 --> 00:24:34,260
parameters.
423
00:24:34,260 --> 00:24:36,770
424
00:24:36,770 --> 00:24:39,800
All right, so we have a
closed-form formula, we can
425
00:24:39,800 --> 00:24:43,620
apply it whenever
we have data.
426
00:24:43,620 --> 00:24:47,810
Now linear regression is a
subject on which there are
427
00:24:47,810 --> 00:24:51,520
whole courses, and whole
books that are given.
428
00:24:51,520 --> 00:24:54,560
And the reason for that is that
there's a lot more that
429
00:24:54,560 --> 00:24:58,840
you can bring into the topic,
and many ways that you can
430
00:24:58,840 --> 00:25:02,350
elaborate on the simple solution
that we got for the
431
00:25:02,350 --> 00:25:05,880
case of two parameters and only
two random variables.
432
00:25:05,880 --> 00:25:09,550
So let me give you a little bit
of flavor of what are the
433
00:25:09,550 --> 00:25:12,950
topics that come up when you
start looking into linear
434
00:25:12,950 --> 00:25:14,200
regression in more depth.
435
00:25:14,200 --> 00:25:16,840
436
00:25:16,840 --> 00:25:24,390
So in our discussions so far
we made the linear model in
437
00:25:24,390 --> 00:25:28,370
which we're trying to explain
the values of one variable in
438
00:25:28,370 --> 00:25:30,860
terms of the values of
another variable.
439
00:25:30,860 --> 00:25:35,010
We're trying to explain GPAs
in terms of SAT scores, or
440
00:25:35,010 --> 00:25:39,640
we're trying to predict GPAs
in terms of SAT scores.
441
00:25:39,640 --> 00:25:47,910
But maybe your GPA is affected
by several factors.
442
00:25:47,910 --> 00:25:56,380
For example maybe your GPA is
affected by your SAT score,
443
00:25:56,380 --> 00:26:01,820
also the income of your family,
the years of education
444
00:26:01,820 --> 00:26:06,720
of your grandmother, and many
other factors like that.
445
00:26:06,720 --> 00:26:11,970
So you might write down a model
in which I believe that
446
00:26:11,970 --> 00:26:17,820
GPA has a relation, which is a
linear function of all these
447
00:26:17,820 --> 00:26:20,520
other variables that
I mentioned.
448
00:26:20,520 --> 00:26:24,350
So perhaps you have a theory of
what determines performance
449
00:26:24,350 --> 00:26:29,540
at college, and you want to
build a model of that type.
450
00:26:29,540 --> 00:26:31,460
How do we go about
in this case?
451
00:26:31,460 --> 00:26:33,830
Well, again we collect
the data points.
452
00:26:33,830 --> 00:26:37,980
We look at the i-th student,
who has a college GPA.
453
00:26:37,980 --> 00:26:42,090
We record their SAT score,
their family income, and
454
00:26:42,090 --> 00:26:45,010
grandmother's years
of education.
455
00:26:45,010 --> 00:26:50,390
So this is one data point that
is for one particular student.
456
00:26:50,390 --> 00:26:52,580
We postulate the model
of this form.
457
00:26:52,580 --> 00:26:56,160
For the i-th student this would
be the mistake that our
458
00:26:56,160 --> 00:26:59,940
model makes if we have chosen
specific values for those
459
00:26:59,940 --> 00:27:01,070
parameters.
460
00:27:01,070 --> 00:27:05,450
And then we go and choose the
parameters that are going to
461
00:27:05,450 --> 00:27:07,950
give us, again, the
smallest possible
462
00:27:07,950 --> 00:27:10,000
sum of squared errors.
463
00:27:10,000 --> 00:27:12,360
So philosophically it's exactly
the same as what we
464
00:27:12,360 --> 00:27:15,700
were discussing before, except
that now we're including
465
00:27:15,700 --> 00:27:19,560
multiple explanatory variables
in our model instead of a
466
00:27:19,560 --> 00:27:22,600
single explanatory variable.
467
00:27:22,600 --> 00:27:24,070
So that's the formulation.
468
00:27:24,070 --> 00:27:26,070
What do you do next?
469
00:27:26,070 --> 00:27:29,420
Well, to do this minimization
you're going to take
470
00:27:29,420 --> 00:27:32,750
derivatives once you have your
data, you have a function of
471
00:27:32,750 --> 00:27:34,310
these three parameters.
472
00:27:34,310 --> 00:27:37,190
You take the derivative with
respect to the parameter, set
473
00:27:37,190 --> 00:27:39,170
the derivative equal
to 0, you get the
474
00:27:39,170 --> 00:27:41,060
system of linear equations.
475
00:27:41,060 --> 00:27:43,450
You throw that system of
linear equations to the
476
00:27:43,450 --> 00:27:46,260
computer, and you get numerical
values for the
477
00:27:46,260 --> 00:27:48,060
optimal parameters.
478
00:27:48,060 --> 00:27:52,130
There are no nice closed-form
formulas of the type that we
479
00:27:52,130 --> 00:27:54,510
had in the previous slide
when you're dealing
480
00:27:54,510 --> 00:27:56,230
with multiple variables.
481
00:27:56,230 --> 00:28:02,240
Unless you're willing to go
into matrix notation.
482
00:28:02,240 --> 00:28:04,760
In that case you can again
write down closed-form
483
00:28:04,760 --> 00:28:07,290
formulas, but they will be a
little less intuitive than
484
00:28:07,290 --> 00:28:09,210
what we had before.
485
00:28:09,210 --> 00:28:13,550
But the moral of the story is
that numerically this is a
486
00:28:13,550 --> 00:28:16,480
procedure that's very easy.
487
00:28:16,480 --> 00:28:18,780
It's a problem, an optimization
problem that the
488
00:28:18,780 --> 00:28:20,680
computer can solve for you.
489
00:28:20,680 --> 00:28:23,290
And it can solve it for
you very quickly.
490
00:28:23,290 --> 00:28:25,470
Because all that it involves
is solving a
491
00:28:25,470 --> 00:28:26,720
system of linear equations.
492
00:28:26,720 --> 00:28:29,590
493
00:28:29,590 --> 00:28:34,270
Now when you choose your
explanatory variables you may
494
00:28:34,270 --> 00:28:37,940
have some choices.
495
00:28:37,940 --> 00:28:43,550
One person may think that your
GPA a has something to do with
496
00:28:43,550 --> 00:28:45,340
your SAT score.
497
00:28:45,340 --> 00:28:48,480
Some other person may think that
your GPA has something to
498
00:28:48,480 --> 00:28:51,800
do with the square of
your SAT score.
499
00:28:51,800 --> 00:28:55,380
And that other person may
want to try to build a
500
00:28:55,380 --> 00:28:58,840
model of this kind.
501
00:28:58,840 --> 00:29:01,550
Now when would you want
to do this? ?
502
00:29:01,550 --> 00:29:07,830
Suppose that the data that
you have looks like this.
503
00:29:07,830 --> 00:29:12,177
504
00:29:12,177 --> 00:29:15,740
If the data looks like this then
you might be tempted to
505
00:29:15,740 --> 00:29:20,710
say well a linear model does
not look right, but maybe a
506
00:29:20,710 --> 00:29:25,650
quadratic model will give me
a better fit for the data.
507
00:29:25,650 --> 00:29:30,690
So if you want to fit a
quadratic model to the data
508
00:29:30,690 --> 00:29:35,550
then what you do is you take
X-squared as your explanatory
509
00:29:35,550 --> 00:29:42,520
variable instead of X, and you
build a model of this kind.
510
00:29:42,520 --> 00:29:45,910
There's nothing really different
in models of this
511
00:29:45,910 --> 00:29:48,830
kind compared to models
of that kind.
512
00:29:48,830 --> 00:29:54,700
They are still linear models
because we have theta's
513
00:29:54,700 --> 00:29:57,630
showing up in a linear
fashion.
514
00:29:57,630 --> 00:30:00,460
What you take as your
explanatory variables, whether
515
00:30:00,460 --> 00:30:02,870
it's X, whether it's X-squared,
or whether it's
516
00:30:02,870 --> 00:30:05,390
some other function
that you chose.
517
00:30:05,390 --> 00:30:09,590
Some general function h of X,
doesn't make a difference.
518
00:30:09,590 --> 00:30:14,470
So think of you h of X as being
your new X. So you can
519
00:30:14,470 --> 00:30:17,620
formulate the problem exactly
the same way, except that
520
00:30:17,620 --> 00:30:21,035
instead of using X's you
choose h of X's.
521
00:30:21,035 --> 00:30:23,610
522
00:30:23,610 --> 00:30:26,540
So it's basically a question
do I want to build a model
523
00:30:26,540 --> 00:30:31,390
that explains Y's based on the
values of X, or do I want to
524
00:30:31,390 --> 00:30:35,190
build a model that explains Y's
on the basis of the values
525
00:30:35,190 --> 00:30:38,970
of h of X. Which is the
right value to use?
526
00:30:38,970 --> 00:30:42,160
And with this picture here,
we see that it can make a
527
00:30:42,160 --> 00:30:43,160
difference.
528
00:30:43,160 --> 00:30:47,070
A linear model in X might be
a poor fit, but a quadratic
529
00:30:47,070 --> 00:30:49,660
model might give us
a better fit.
530
00:30:49,660 --> 00:30:55,450
So this brings to the topic of
how to choose your functions h
531
00:30:55,450 --> 00:30:59,480
of X if you're dealing with
a real world problem.
532
00:30:59,480 --> 00:31:03,080
So in a real world problem
you're just given X's and Y's.
533
00:31:03,080 --> 00:31:05,990
And you have the freedom
of building models of
534
00:31:05,990 --> 00:31:07,120
any kind you want.
535
00:31:07,120 --> 00:31:11,330
You have the freedom of choosing
a function h of X of
536
00:31:11,330 --> 00:31:13,130
any type that you want.
537
00:31:13,130 --> 00:31:14,980
So this turns out to be a quite
538
00:31:14,980 --> 00:31:18,800
difficult and tricky topic.
539
00:31:18,800 --> 00:31:22,630
Because you may be tempted
to overdo it.
540
00:31:22,630 --> 00:31:28,450
For example, I got my 10 data
points, and I could say OK,
541
00:31:28,450 --> 00:31:35,660
I'm going to choose an h of X.
I'm going to choose h of X and
542
00:31:35,660 --> 00:31:40,300
actually multiple h's of X
to do a multiple linear
543
00:31:40,300 --> 00:31:45,030
regression in which I'm going to
build a model that's uses a
544
00:31:45,030 --> 00:31:47,600
10th degree polynomial.
545
00:31:47,600 --> 00:31:51,160
If I choose to fit my data with
a 10th degree polynomial
546
00:31:51,160 --> 00:31:54,680
I'm going to fit my data
perfectly, but I may obtain a
547
00:31:54,680 --> 00:31:58,530
model is does something like
this, and goes through all my
548
00:31:58,530 --> 00:31:59,930
data points.
549
00:31:59,930 --> 00:32:03,830
So I can make my prediction
errors extremely small if I
550
00:32:03,830 --> 00:32:08,820
use lots of parameters, and
if I choose my h functions
551
00:32:08,820 --> 00:32:09,930
appropriately.
552
00:32:09,930 --> 00:32:11,800
But clearly this would
be garbage.
553
00:32:11,800 --> 00:32:15,270
If you get those data points,
and you say here's my model
554
00:32:15,270 --> 00:32:16,420
that explains them.
555
00:32:16,420 --> 00:32:21,320
That has a polynomial going up
and down, then you're probably
556
00:32:21,320 --> 00:32:22,900
doing something wrong.
557
00:32:22,900 --> 00:32:26,180
So choosing how complicated
those functions,
558
00:32:26,180 --> 00:32:27,900
the h's, should be.
559
00:32:27,900 --> 00:32:32,020
And how many explanatory
variables to use is a very
560
00:32:32,020 --> 00:32:36,560
delicate and deep topic on which
there's deep theory that
561
00:32:36,560 --> 00:32:39,910
tells you what you should do,
and what you shouldn't do.
562
00:32:39,910 --> 00:32:43,830
But the main thing that one
should avoid doing is having
563
00:32:43,830 --> 00:32:46,620
too many parameters in
your model when you
564
00:32:46,620 --> 00:32:48,900
have too few data.
565
00:32:48,900 --> 00:32:52,350
So if you only have 10 data
points, you shouldn't have 10
566
00:32:52,350 --> 00:32:53,350
free parameters.
567
00:32:53,350 --> 00:32:56,150
With 10 free parameters you will
be able to fit your data
568
00:32:56,150 --> 00:33:00,760
perfectly, but you wouldn't be
able to really rely on the
569
00:33:00,760 --> 00:33:02,010
results that you are seeing.
570
00:33:02,010 --> 00:33:06,050
571
00:33:06,050 --> 00:33:12,630
OK, now in practice, when people
run linear regressions
572
00:33:12,630 --> 00:33:15,410
they do not just give
point estimates for
573
00:33:15,410 --> 00:33:17,370
the parameters theta.
574
00:33:17,370 --> 00:33:20,300
But similar to what we did for
the case of estimating the
575
00:33:20,300 --> 00:33:23,790
mean of a random variable you
might want to give confidence
576
00:33:23,790 --> 00:33:27,200
intervals that sort of tell you
how much randomness there
577
00:33:27,200 --> 00:33:30,730
is when you estimate each one of
the particular parameters.
578
00:33:30,730 --> 00:33:33,960
There are formulas for building
confidence intervals
579
00:33:33,960 --> 00:33:36,230
for the estimates
of the theta's.
580
00:33:36,230 --> 00:33:38,520
We're not going to look
at them, it would
581
00:33:38,520 --> 00:33:39,990
take too much time.
582
00:33:39,990 --> 00:33:44,600
Also you might want to estimate
the variance in the
583
00:33:44,600 --> 00:33:47,400
noise that you have
in your model.
584
00:33:47,400 --> 00:33:52,540
That is if you are pretending
that your true model is of the
585
00:33:52,540 --> 00:33:57,026
kind we were discussing before,
namely Y equals theta1
586
00:33:57,026 --> 00:34:02,190
times X plus W, and W has a
variance sigma squared.
587
00:34:02,190 --> 00:34:05,170
You might want to estimate this,
because it tells you
588
00:34:05,170 --> 00:34:09,199
something about the model, and
this is called standard error.
589
00:34:09,199 --> 00:34:11,929
It puts a limit on how
good predictions
590
00:34:11,929 --> 00:34:14,730
your model can make.
591
00:34:14,730 --> 00:34:18,170
Even if you have the correct
theta0 and theta1, and
592
00:34:18,170 --> 00:34:22,179
somebody tells you X you can
make a prediction about Y, but
593
00:34:22,179 --> 00:34:24,710
that prediction will
not be accurate.
594
00:34:24,710 --> 00:34:26,739
Because there's this additional
randomness.
595
00:34:26,739 --> 00:34:29,699
And if that additional
randomness is big, then your
596
00:34:29,699 --> 00:34:33,810
predictions will also have a
substantial error in them.
597
00:34:33,810 --> 00:34:38,300
There's another quantity that
gets reported usually.
598
00:34:38,300 --> 00:34:41,400
This is part of the computer
output that you get when you
599
00:34:41,400 --> 00:34:45,500
use a statistical package which
is called R-square.
600
00:34:45,500 --> 00:34:49,920
And its a measure of the
explanatory power of the model
601
00:34:49,920 --> 00:34:52,469
that you have built
linear regression.
602
00:34:52,469 --> 00:34:55,650
Using linear regression.
603
00:34:55,650 --> 00:35:01,030
Instead of defining R-square
exactly, let me give you a
604
00:35:01,030 --> 00:35:05,170
sort of analogous quantity
that's involved.
605
00:35:05,170 --> 00:35:08,030
After you do your linear
regression you can look at the
606
00:35:08,030 --> 00:35:10,600
following quantity.
607
00:35:10,600 --> 00:35:15,720
You look at the variance of Y,
which is something that you
608
00:35:15,720 --> 00:35:17,400
can estimate from data.
609
00:35:17,400 --> 00:35:23,370
This is how much randomness
there is in Y. And compare it
610
00:35:23,370 --> 00:35:28,090
with the randomness that you
have in Y, but conditioned on
611
00:35:28,090 --> 00:35:35,840
X. So this quantity tells
me if I knew X how much
612
00:35:35,840 --> 00:35:39,820
randomness would there
still be in my Y?
613
00:35:39,820 --> 00:35:43,650
So if I know X, I have more
information, so Y is more
614
00:35:43,650 --> 00:35:44,390
constrained.
615
00:35:44,390 --> 00:35:48,640
There's less randomness in Y.
This is the randomness in Y if
616
00:35:48,640 --> 00:35:50,790
I don't know anything about X.
617
00:35:50,790 --> 00:35:54,855
So naturally this quantity would
be less than 1, and if
618
00:35:54,855 --> 00:35:58,830
this quantity is small it would
mean that whenever I
619
00:35:58,830 --> 00:36:03,320
know X then Y is very
well known.
620
00:36:03,320 --> 00:36:07,440
Which essentially tells me that
knowing x allows me to
621
00:36:07,440 --> 00:36:12,370
make very good predictions about
Y. Knowing X means that
622
00:36:12,370 --> 00:36:17,390
I'm explaining away most
of the randomness in Y.
623
00:36:17,390 --> 00:36:22,590
So if you read a statistical
study that uses linear
624
00:36:22,590 --> 00:36:29,730
regression you might encounter
statements of the form 60% of
625
00:36:29,730 --> 00:36:36,140
a student's GPA is explained
by the family income.
626
00:36:36,140 --> 00:36:40,600
If you read the statements of
this kind it's really refers
627
00:36:40,600 --> 00:36:43,160
to quantities of this kind.
628
00:36:43,160 --> 00:36:47,820
Out of the total variance in Y,
how much variance is left
629
00:36:47,820 --> 00:36:50,060
after we build our model?
630
00:36:50,060 --> 00:36:56,490
So if only 40% of the variance
of Y is left after we build
631
00:36:56,490 --> 00:37:00,700
our model, that means that
X explains 60% of the
632
00:37:00,700 --> 00:37:02,510
variations in Y's.
633
00:37:02,510 --> 00:37:06,570
So the idea is that
randomness in Y is
634
00:37:06,570 --> 00:37:09,560
caused by multiple sources.
635
00:37:09,560 --> 00:37:12,025
Our explanatory variable
and random noise.
636
00:37:12,025 --> 00:37:15,610
And we ask the question what
percentage of the total
637
00:37:15,610 --> 00:37:19,940
randomness in Y is explained by
638
00:37:19,940 --> 00:37:23,030
variations in the X parameter?
639
00:37:23,030 --> 00:37:26,860
And how much of the total
randomness in Y is attributed
640
00:37:26,860 --> 00:37:30,390
just to random effects?
641
00:37:30,390 --> 00:37:34,050
So if you have a model that
explains most of the variation
642
00:37:34,050 --> 00:37:37,710
in Y then you can think that
you have a good model that
643
00:37:37,710 --> 00:37:42,550
tells you something useful
about the real world.
644
00:37:42,550 --> 00:37:45,990
Now there's lots of things that
can go wrong when you use
645
00:37:45,990 --> 00:37:50,670
linear regression, and there's
many pitfalls.
646
00:37:50,670 --> 00:37:56,440
One pitfall happens when you
have this situation that's
647
00:37:56,440 --> 00:37:58,300
called heteroskedacisity.
648
00:37:58,300 --> 00:38:01,020
So suppose your data
are of this kind.
649
00:38:01,020 --> 00:38:06,550
650
00:38:06,550 --> 00:38:09,330
So what's happening here?
651
00:38:09,330 --> 00:38:17,640
You seem to have a linear model,
but when X is small you
652
00:38:17,640 --> 00:38:19,200
have a very good model.
653
00:38:19,200 --> 00:38:23,830
So this means that W has a small
variance when X is here.
654
00:38:23,830 --> 00:38:26,760
On the other hand, when X is
there you have a lot of
655
00:38:26,760 --> 00:38:27,970
randomness.
656
00:38:27,970 --> 00:38:32,080
This would be a situation
in which the W's are not
657
00:38:32,080 --> 00:38:35,840
identically distributed, but
the variance of the W's, of
658
00:38:35,840 --> 00:38:40,360
the noise, has something
to do with the X's.
659
00:38:40,360 --> 00:38:43,720
So with different regions of our
x-space we have different
660
00:38:43,720 --> 00:38:45,260
amounts of noise.
661
00:38:45,260 --> 00:38:47,615
What will go wrong in
this situation?
662
00:38:47,615 --> 00:38:51,290
Since we're trying to minimize
sum of squared errors, we're
663
00:38:51,290 --> 00:38:54,080
really paying attention
to the biggest errors.
664
00:38:54,080 --> 00:38:57,010
Which will mean that we are
going to pay attention to
665
00:38:57,010 --> 00:38:59,690
these data points, because
that's where the big errors
666
00:38:59,690 --> 00:39:01,130
are going to be.
667
00:39:01,130 --> 00:39:04,250
So the linear regression
formulas will end up building
668
00:39:04,250 --> 00:39:09,110
a model based on these data,
which are the most noisy ones.
669
00:39:09,110 --> 00:39:14,810
Instead of those data that are
nicely stacked in order.
670
00:39:14,810 --> 00:39:17,410
Clearly that's not to the
right thing to do.
671
00:39:17,410 --> 00:39:21,500
So you need to change something,
and use the fact
672
00:39:21,500 --> 00:39:25,800
that the variance of W changes
with the X's, and there are
673
00:39:25,800 --> 00:39:27,770
ways of dealing with it.
674
00:39:27,770 --> 00:39:31,280
It's something that one needs
to be careful about.
675
00:39:31,280 --> 00:39:34,580
Another possibility of getting
into trouble is if you're
676
00:39:34,580 --> 00:39:38,550
using multiple explanatory
variables that are very
677
00:39:38,550 --> 00:39:41,330
closely related to each other.
678
00:39:41,330 --> 00:39:47,500
So for example, suppose that I
tried to predict your GPA by
679
00:39:47,500 --> 00:39:54,100
looking at your SAT the first
time that you took it plus
680
00:39:54,100 --> 00:39:58,290
your SAT the second time that
you took your SATs.
681
00:39:58,290 --> 00:40:00,470
I'm assuming that almost
everyone takes the
682
00:40:00,470 --> 00:40:02,450
SAT more than once.
683
00:40:02,450 --> 00:40:05,630
So suppose that you had
a model of this kind.
684
00:40:05,630 --> 00:40:09,380
Well, SAT on your first try and
SAT on your second try are
685
00:40:09,380 --> 00:40:12,480
very likely to be
fairly close.
686
00:40:12,480 --> 00:40:17,570
And you could think of coming
up with estimates in which
687
00:40:17,570 --> 00:40:19,390
this is ignored.
688
00:40:19,390 --> 00:40:22,780
And you build a model based on
this, or an alternative model
689
00:40:22,780 --> 00:40:25,810
in which this term is ignored,
and you make predictions based
690
00:40:25,810 --> 00:40:27,430
on the second SAT.
691
00:40:27,430 --> 00:40:31,840
And both models are likely to be
essentially as good as the
692
00:40:31,840 --> 00:40:34,430
other one, because these
two quantities are
693
00:40:34,430 --> 00:40:36,630
essentially the same.
694
00:40:36,630 --> 00:40:41,440
So in that case, your theta's
that you estimate are going to
695
00:40:41,440 --> 00:40:44,880
be very sensitive to little
details of the data.
696
00:40:44,880 --> 00:40:48,560
You change your data, you have
your data, and your data tell
697
00:40:48,560 --> 00:40:52,170
you that this coefficient
is big and that
698
00:40:52,170 --> 00:40:52,760
coefficient is small.
699
00:40:52,760 --> 00:40:56,060
You change your data just a
tiny bit, and your theta's
700
00:40:56,060 --> 00:40:57,720
would drastically change.
701
00:40:57,720 --> 00:41:00,750
So this is a case in which you
have multiple explanatory
702
00:41:00,750 --> 00:41:04,110
variables, but they're redundant
in the sense that
703
00:41:04,110 --> 00:41:07,300
they're very closely related
to each other, and perhaps
704
00:41:07,300 --> 00:41:08,830
with a linear relation.
705
00:41:08,830 --> 00:41:11,980
So one must be careful about the
situation, and do special
706
00:41:11,980 --> 00:41:15,940
tests to make sure that
this doesn't happen.
707
00:41:15,940 --> 00:41:20,900
Finally the biggest and most
common blunder is that you run
708
00:41:20,900 --> 00:41:24,910
your linear regression, you
get your linear model, and
709
00:41:24,910 --> 00:41:26,760
then you say oh, OK.
710
00:41:26,760 --> 00:41:33,340
Y is caused by X according to
this particular formula.
711
00:41:33,340 --> 00:41:36,940
Well, all that we did was to
identify a linear relation
712
00:41:36,940 --> 00:41:40,120
between X and Y. This doesn't
tell us anything.
713
00:41:40,120 --> 00:41:44,130
Whether it's Y that causes X, or
whether it's X that causes
714
00:41:44,130 --> 00:41:48,850
Y, or maybe both X and Y are
caused by some other variable
715
00:41:48,850 --> 00:41:51,110
that we didn't think about.
716
00:41:51,110 --> 00:41:56,800
So building a good linear model
that has small errors
717
00:41:56,800 --> 00:42:00,980
does not tell us anything about
causal relations between
718
00:42:00,980 --> 00:42:02,320
the two variables.
719
00:42:02,320 --> 00:42:05,210
It only tells us that there's
a close association between
720
00:42:05,210 --> 00:42:06,010
the two variables.
721
00:42:06,010 --> 00:42:10,370
If you know one you can make
predictions about the other.
722
00:42:10,370 --> 00:42:13,370
But it doesn't tell you anything
about the underlying
723
00:42:13,370 --> 00:42:18,120
physics, that there's some
physical mechanism that
724
00:42:18,120 --> 00:42:22,310
introduces the relation between
those variables.
725
00:42:22,310 --> 00:42:26,430
OK, that's it about
linear regression.
726
00:42:26,430 --> 00:42:30,510
Let us start the next topic,
which is hypothesis testing.
727
00:42:30,510 --> 00:42:35,140
And we're going to continue
with it next time.
728
00:42:35,140 --> 00:42:37,780
So here, instead of trying
to estimate continuous
729
00:42:37,780 --> 00:42:41,920
parameters, we have two
alternative hypotheses about
730
00:42:41,920 --> 00:42:46,550
the distribution of the
X random variable.
731
00:42:46,550 --> 00:42:53,620
So for example our random
variable could be either
732
00:42:53,620 --> 00:42:58,480
distributed according to this
distribution, under H0, or it
733
00:42:58,480 --> 00:43:02,930
might be distributed according
to this distribution under H1.
734
00:43:02,930 --> 00:43:06,230
And we want to make a decision
which distribution is the
735
00:43:06,230 --> 00:43:07,990
correct one?
736
00:43:07,990 --> 00:43:10,850
So we're given those two
distributions, and some common
737
00:43:10,850 --> 00:43:14,290
terminologies that one of them
is the null hypothesis--
738
00:43:14,290 --> 00:43:16,600
sort of the default hypothesis,
and we have some
739
00:43:16,600 --> 00:43:18,290
alternative hypotheses--
740
00:43:18,290 --> 00:43:20,560
and we want to check whether
this one is true,
741
00:43:20,560 --> 00:43:21,950
or that one is true.
742
00:43:21,950 --> 00:43:24,500
So you obtain a data
point, and you
743
00:43:24,500 --> 00:43:26,060
want to make a decision.
744
00:43:26,060 --> 00:43:28,820
In this picture what would
a reasonable person
745
00:43:28,820 --> 00:43:30,650
do to make a decision?
746
00:43:30,650 --> 00:43:35,500
They would probably choose a
certain threshold, Xi, and
747
00:43:35,500 --> 00:43:43,540
decide that H1 is true if your
data falls in this interval.
748
00:43:43,540 --> 00:43:49,590
And decide that H0 is true
if you fall on the side.
749
00:43:49,590 --> 00:43:51,660
So that would be a
reasonable way of
750
00:43:51,660 --> 00:43:54,100
approaching the problem.
751
00:43:54,100 --> 00:43:59,160
More generally you take the set
of all possible X's, and
752
00:43:59,160 --> 00:44:03,050
you divide the set of possible
X's into two regions.
753
00:44:03,050 --> 00:44:11,110
One is the rejection region,
in which you decide H1,
754
00:44:11,110 --> 00:44:13,170
or you reject H0.
755
00:44:13,170 --> 00:44:15,760
756
00:44:15,760 --> 00:44:21,640
And the complement of that
region is where you decide H0.
757
00:44:21,640 --> 00:44:25,210
So this is the x-space
of your data.
758
00:44:25,210 --> 00:44:28,350
In this example here, x
was one-dimensional.
759
00:44:28,350 --> 00:44:31,770
But in general X is going to
be a vector, where all the
760
00:44:31,770 --> 00:44:34,790
possible data vectors that
you can get, they're
761
00:44:34,790 --> 00:44:36,600
divided into two types.
762
00:44:36,600 --> 00:44:40,400
If it falls in this set you'd
make one decision.
763
00:44:40,400 --> 00:44:43,770
If it falls in that set, you
make the other decision.
764
00:44:43,770 --> 00:44:47,380
OK, so how would you
characterize the performance
765
00:44:47,380 --> 00:44:49,690
of the particular way of
making a decision?
766
00:44:49,690 --> 00:44:53,000
Suppose I chose my threshold.
767
00:44:53,000 --> 00:44:57,960
I may make mistakes of
two possible types.
768
00:44:57,960 --> 00:45:03,360
Perhaps H0 is true, but my data
happens to fall here.
769
00:45:03,360 --> 00:45:07,560
In which case I make a mistake,
and this would be a
770
00:45:07,560 --> 00:45:10,730
false rejection of H0.
771
00:45:10,730 --> 00:45:15,070
If my data falls here
I reject H0.
772
00:45:15,070 --> 00:45:16,890
I decide H1.
773
00:45:16,890 --> 00:45:19,510
Whereas H0 was true.
774
00:45:19,510 --> 00:45:21,690
The probability of
this happening?
775
00:45:21,690 --> 00:45:24,890
Let's call it alpha.
776
00:45:24,890 --> 00:45:28,040
But there's another kind of
error that can be made.
777
00:45:28,040 --> 00:45:32,810
Suppose that H1 was true, but by
accident my data happens to
778
00:45:32,810 --> 00:45:34,250
falls on that side.
779
00:45:34,250 --> 00:45:36,610
Then I'm going to make
an error again.
780
00:45:36,610 --> 00:45:40,540
I'm going to decide H0 even
though H1 was true.
781
00:45:40,540 --> 00:45:42,570
How likely is this to occur?
782
00:45:42,570 --> 00:45:46,420
This would be the area under
this curve here.
783
00:45:46,420 --> 00:45:50,600
And that's the other type of
error than can be made, and
784
00:45:50,600 --> 00:45:55,400
beta is the probability of this
particular type of error.
785
00:45:55,400 --> 00:45:57,550
Both of these are errors.
786
00:45:57,550 --> 00:45:59,640
Alpha is the probability
of error of one kind.
787
00:45:59,640 --> 00:46:02,110
Beta is the probability of an
error of the other kind.
788
00:46:02,110 --> 00:46:03,510
You would like the
probabilities
789
00:46:03,510 --> 00:46:05,050
of error to be small.
790
00:46:05,050 --> 00:46:07,550
So you would like to
make both alpha and
791
00:46:07,550 --> 00:46:09,780
beta as small as possible.
792
00:46:09,780 --> 00:46:13,300
Unfortunately that's not
possible, there's a trade-off.
793
00:46:13,300 --> 00:46:17,540
If I go to my threshold it this
way, then alpha become
794
00:46:17,540 --> 00:46:20,760
smaller, but beta
becomes bigger.
795
00:46:20,760 --> 00:46:22,770
So there's a trade-off.
796
00:46:22,770 --> 00:46:29,350
If I make my rejection region
smaller one kind of error is
797
00:46:29,350 --> 00:46:31,880
less likely, but the
other kind of error
798
00:46:31,880 --> 00:46:34,670
becomes more likely.
799
00:46:34,670 --> 00:46:38,050
So we got this trade-off.
800
00:46:38,050 --> 00:46:39,620
So what do we do about it?
801
00:46:39,620 --> 00:46:41,570
How do we move systematically?
802
00:46:41,570 --> 00:46:45,680
How do we come up with
rejection regions?
803
00:46:45,680 --> 00:46:48,900
Well, what the theory basically
tells you is it
804
00:46:48,900 --> 00:46:53,200
tells you how you should
create those regions.
805
00:46:53,200 --> 00:46:57,860
But it doesn't tell
you exactly how.
806
00:46:57,860 --> 00:47:00,970
It tells you the general
shape of those regions.
807
00:47:00,970 --> 00:47:05,120
For example here, the theory
who tells us that the right
808
00:47:05,120 --> 00:47:07,430
thing to do would be to put
the threshold and make
809
00:47:07,430 --> 00:47:10,910
decisions one way to the right,
one way to the left.
810
00:47:10,910 --> 00:47:12,830
But it might not necessarily
tell us
811
00:47:12,830 --> 00:47:15,020
where to put the threshold.
812
00:47:15,020 --> 00:47:18,890
Still, it's useful enough to
know that the way to make a
813
00:47:18,890 --> 00:47:20,960
good decision would
be in terms of
814
00:47:20,960 --> 00:47:22,400
a particular threshold.
815
00:47:22,400 --> 00:47:24,770
Let me make this
more specific.
816
00:47:24,770 --> 00:47:27,380
We can take our inspiration
from the solution of the
817
00:47:27,380 --> 00:47:29,820
hypothesis testing problem
that we had in
818
00:47:29,820 --> 00:47:31,370
the Bayesian case.
819
00:47:31,370 --> 00:47:34,130
In the Bayesian case we just
pick the hypothesis which is
820
00:47:34,130 --> 00:47:37,480
more likely given the data.
821
00:47:37,480 --> 00:47:40,080
The produced posterior
probabilities using Bayesian
822
00:47:40,080 --> 00:47:42,770
rule, they're written
this way.
823
00:47:42,770 --> 00:47:45,240
And this term is the
same as that term.
824
00:47:45,240 --> 00:47:49,500
They cancel out, then let me
collect terms here and there.
825
00:47:49,500 --> 00:47:52,370
826
00:47:52,370 --> 00:47:54,030
I get an expression here.
827
00:47:54,030 --> 00:47:56,090
I think the version you
have in your handout
828
00:47:56,090 --> 00:47:57,340
is the correct one.
829
00:47:57,340 --> 00:47:59,810
830
00:47:59,810 --> 00:48:02,082
The one on the slide was
not the correct one, so
831
00:48:02,082 --> 00:48:03,730
I'm fixing it here.
832
00:48:03,730 --> 00:48:06,920
OK, so this is the form of how
you make decisions in the
833
00:48:06,920 --> 00:48:08,720
Bayesian case.
834
00:48:08,720 --> 00:48:10,620
What you do in the Bayesian
case, you
835
00:48:10,620 --> 00:48:13,270
calculate this ratio.
836
00:48:13,270 --> 00:48:17,110
Let's call it the likelihood
ratio.
837
00:48:17,110 --> 00:48:20,770
And compare that ratio
to a threshold.
838
00:48:20,770 --> 00:48:22,916
And the threshold that you
should be using in the
839
00:48:22,916 --> 00:48:25,240
Bayesian case has something
to do with the prior
840
00:48:25,240 --> 00:48:28,000
probabilities of the
two hypotheses.
841
00:48:28,000 --> 00:48:31,840
In the non-Bayesian case we do
not have prior probabilities,
842
00:48:31,840 --> 00:48:34,690
so we do not know how to
set this threshold.
843
00:48:34,690 --> 00:48:38,350
But we're going to do is we're
going to keep this particular
844
00:48:38,350 --> 00:48:42,690
structure anyway, and maybe use
some other considerations
845
00:48:42,690 --> 00:48:44,480
to pick the threshold.
846
00:48:44,480 --> 00:48:51,030
So we're going to use a
likelihood ratio test, that's
847
00:48:51,030 --> 00:48:54,260
how it's called in which we
calculate a quantity of this
848
00:48:54,260 --> 00:48:56,830
kind that we call the
likelihood, and compare it
849
00:48:56,830 --> 00:48:58,480
with a threshold.
850
00:48:58,480 --> 00:49:00,530
So what's the interpretation
of this likelihood?
851
00:49:00,530 --> 00:49:03,140
852
00:49:03,140 --> 00:49:04,290
We ask--
853
00:49:04,290 --> 00:49:08,570
the X's that I have observed,
how likely were they to occur
854
00:49:08,570 --> 00:49:10,460
if H1 was true?
855
00:49:10,460 --> 00:49:14,590
And how likely were they to
occur if H0 was true?
856
00:49:14,590 --> 00:49:20,560
This ratio could be big if my
data are plausible they might
857
00:49:20,560 --> 00:49:22,400
occur under H1.
858
00:49:22,400 --> 00:49:25,400
But they're very implausible,
extremely unlikely
859
00:49:25,400 --> 00:49:27,380
to occur under H0.
860
00:49:27,380 --> 00:49:30,060
Then my thinking would be well
the data that I saw are
861
00:49:30,060 --> 00:49:33,300
extremely unlikely to have
occurred under H0.
862
00:49:33,300 --> 00:49:36,780
So H0 is probably not true.
863
00:49:36,780 --> 00:49:39,820
I'm going to go for
H1 and choose H1.
864
00:49:39,820 --> 00:49:43,920
So when this ratio is big it
tells us that the data that
865
00:49:43,920 --> 00:49:47,720
we're seeing are better
explained if we assume H1 to
866
00:49:47,720 --> 00:49:50,620
be true rather than
H0 to be true.
867
00:49:50,620 --> 00:49:53,970
So I calculate this quantity,
compare it with a threshold,
868
00:49:53,970 --> 00:49:56,200
and that's how I make
my decision.
869
00:49:56,200 --> 00:49:59,360
So in this particular picture,
for example the way it would
870
00:49:59,360 --> 00:50:02,930
go would be the likelihood ratio
in this picture goes
871
00:50:02,930 --> 00:50:07,230
monotonically with my X. So
comparing the likelihood ratio
872
00:50:07,230 --> 00:50:10,150
to the threshold would be the
same as comparing my x to the
873
00:50:10,150 --> 00:50:12,890
threshold, and we've got
the question of how
874
00:50:12,890 --> 00:50:13,920
to choose the threshold.
875
00:50:13,920 --> 00:50:17,880
The way that the threshold is
chosen is usually done by
876
00:50:17,880 --> 00:50:21,560
fixing one of the two
probabilities of error.
877
00:50:21,560 --> 00:50:26,710
That is, I say, that I want my
error of one particular type
878
00:50:26,710 --> 00:50:30,160
to be a given number,
so I fix this alpha.
879
00:50:30,160 --> 00:50:33,160
And then I try to find where
my threshold should be.
880
00:50:33,160 --> 00:50:36,095
So that this probability theta,
probability out there,
881
00:50:36,095 --> 00:50:39,190
is just equal to alpha.
882
00:50:39,190 --> 00:50:42,050
And then the other probability
of error, beta, will be
883
00:50:42,050 --> 00:50:44,190
whatever it turns out to be.
884
00:50:44,190 --> 00:50:48,140
So somebody picks alpha
ahead of time.
885
00:50:48,140 --> 00:50:52,210
Based on the probability of
a false rejection based on
886
00:50:52,210 --> 00:50:55,890
alpha, I find where my threshold
is going to be.
887
00:50:55,890 --> 00:50:59,890
I choose my threshold, and that
determines subsequently
888
00:50:59,890 --> 00:51:01,270
the value of beta.
889
00:51:01,270 --> 00:51:07,340
So we're going to continue with
this story next time, and
890
00:51:07,340 --> 00:51:08,590
we'll stop here.
891
00:51:08,590 --> 00:51:49,120