1
00:00:00,000 --> 00:00:00,040
2
00:00:00,040 --> 00:00:02,460
The following content is
provided under a Creative
3
00:00:02,460 --> 00:00:03,870
Commons license.
4
00:00:03,870 --> 00:00:06,910
Your support will help MIT
OpenCourseWare continue to
5
00:00:06,910 --> 00:00:10,560
offer high quality educational
resources for free.
6
00:00:10,560 --> 00:00:13,460
To make a donation or view
additional materials from
7
00:00:13,460 --> 00:00:19,290
hundreds of MIT courses, visit
MIT OpenCourseWare at
8
00:00:19,290 --> 00:00:22,160
ocw.mit.edu
9
00:00:22,160 --> 00:00:26,640
PROFESSOR: OK, if you have not
yet done it, please take a
10
00:00:26,640 --> 00:00:30,510
moment to go through the course
evaluation website and
11
00:00:30,510 --> 00:00:32,880
enter your comments
for the class.
12
00:00:32,880 --> 00:00:36,250
So what we're going to do today
to wrap things up is
13
00:00:36,250 --> 00:00:39,070
we're going to go through
a tour of the world of
14
00:00:39,070 --> 00:00:41,320
hypothesis testing.
15
00:00:41,320 --> 00:00:44,500
See a few examples of hypothesis
tests, starting
16
00:00:44,500 --> 00:00:48,280
from simple ones such as the
one the setting that we
17
00:00:48,280 --> 00:00:51,220
discussed last time in which you
just have two hypotheses,
18
00:00:51,220 --> 00:00:53,130
you're trying to choose
between them.
19
00:00:53,130 --> 00:00:56,040
But also look at more
complicated situations in
20
00:00:56,040 --> 00:01:00,600
which you have one
basic hypothesis.
21
00:01:00,600 --> 00:01:03,720
Let's say that you have a fair
coin and you want to test it
22
00:01:03,720 --> 00:01:06,450
against the hypotheses that
your coin is not fair, but
23
00:01:06,450 --> 00:01:09,770
that alternative hypothesis is
really lots of different
24
00:01:09,770 --> 00:01:11,000
hypothesis.
25
00:01:11,000 --> 00:01:12,310
So is my coin fair?
26
00:01:12,310 --> 00:01:13,700
Is my die fair?
27
00:01:13,700 --> 00:01:15,980
Do I have the correct
distribution for random
28
00:01:15,980 --> 00:01:17,510
variable, and so on.
29
00:01:17,510 --> 00:01:20,960
And I'm going to end up with a
few general comments about
30
00:01:20,960 --> 00:01:23,190
this whole business.
31
00:01:23,190 --> 00:01:28,370
So the sad thing in simple
hypothesis testing problems is
32
00:01:28,370 --> 00:01:28,990
the following--
33
00:01:28,990 --> 00:01:33,610
we have two possible models,
and this is the classical
34
00:01:33,610 --> 00:01:36,680
world so we do not have any
prior probabilities on the two
35
00:01:36,680 --> 00:01:37,850
hypotheses.
36
00:01:37,850 --> 00:01:41,340
Usually we want to think of
these hypotheses as not being
37
00:01:41,340 --> 00:01:44,730
completely symmetrical, but
rather one is the default
38
00:01:44,730 --> 00:01:48,180
hypothesis, and usually it's
referred to as the null
39
00:01:48,180 --> 00:01:49,630
hypothesis.
40
00:01:49,630 --> 00:01:53,400
And you want to check whether
the null hypothesis is true,
41
00:01:53,400 --> 00:01:57,170
whether things are normal as you
would have expected them
42
00:01:57,170 --> 00:02:00,900
to be, or whether it turns out
to be false, in which case an
43
00:02:00,900 --> 00:02:03,750
alternative hypothesis
would be correct.
44
00:02:03,750 --> 00:02:05,710
So how does one go about it?
45
00:02:05,710 --> 00:02:08,919
46
00:02:08,919 --> 00:02:12,720
No matter what approach you use,
in the end you're going
47
00:02:12,720 --> 00:02:14,220
to end up doing the following.
48
00:02:14,220 --> 00:02:17,620
You have the space of all simple
observations that you
49
00:02:17,620 --> 00:02:18,980
may obtain.
50
00:02:18,980 --> 00:02:21,630
So when you do the experiment
you're going to get an X
51
00:02:21,630 --> 00:02:25,050
vector, a vector of data
that's somewhere.
52
00:02:25,050 --> 00:02:27,760
And for some vectors you're
going to decide that you
53
00:02:27,760 --> 00:02:31,410
accept H. Note for some vectors
that you reject H0 and
54
00:02:31,410 --> 00:02:33,160
you accept H1.
55
00:02:33,160 --> 00:02:37,100
So what you will end up doing
is that you're going to have
56
00:02:37,100 --> 00:02:42,130
some division of the space of
all X's into two parts, and
57
00:02:42,130 --> 00:02:45,660
one part is the rejection
region, and one part is the
58
00:02:45,660 --> 00:02:47,050
acceptance region.
59
00:02:47,050 --> 00:02:50,440
So if you fall in here you
accept H0, if you fall here
60
00:02:50,440 --> 00:02:53,240
you'd reject H0.
61
00:02:53,240 --> 00:02:57,750
So to design a hypothesis test
basically you need to come up
62
00:02:57,750 --> 00:03:03,360
with a division of your X
space into two pieces.
63
00:03:03,360 --> 00:03:08,770
So the figuring out how to do
this involves two elements.
64
00:03:08,770 --> 00:03:12,640
One element is to decide what
kind of shape so I want for my
65
00:03:12,640 --> 00:03:14,740
dividing curve?
66
00:03:14,740 --> 00:03:18,240
And having chosen the shape of
the dividing curve, where
67
00:03:18,240 --> 00:03:20,540
exactly do I put it?
68
00:03:20,540 --> 00:03:23,980
So if you were to cut this
space using, let's say, a
69
00:03:23,980 --> 00:03:27,360
straight cut you might put it
here, or you might put it
70
00:03:27,360 --> 00:03:28,930
there, or you might
put it there.
71
00:03:28,930 --> 00:03:31,730
Where exactly are you
going to put it?
72
00:03:31,730 --> 00:03:33,530
So let's look at those
two steps.
73
00:03:33,530 --> 00:03:38,700
The first issue is to decide
the general shape of your
74
00:03:38,700 --> 00:03:43,440
rejection region, which is the
structure of your test.
75
00:03:43,440 --> 00:03:47,420
And the way this is done for the
case of two hypothesis is
76
00:03:47,420 --> 00:03:52,050
by writing down the likelihood
ratio between the two
77
00:03:52,050 --> 00:03:52,840
hypothesis.
78
00:03:52,840 --> 00:03:56,860
So let's call that quantity l of
X. It's something that you
79
00:03:56,860 --> 00:04:00,280
can compute given the
data that you have.
80
00:04:00,280 --> 00:04:04,660
A high value of l of X basically
means that this
81
00:04:04,660 --> 00:04:08,140
probability here tends to be
bigger than this probability.
82
00:04:08,140 --> 00:04:12,150
It means that the data that you
have seen are quite likely
83
00:04:12,150 --> 00:04:15,650
to have occurred under H1,
but less likely to have
84
00:04:15,650 --> 00:04:18,399
occurred under H0.
85
00:04:18,399 --> 00:04:22,360
So if you see data that they
are more plausible, can be
86
00:04:22,360 --> 00:04:26,630
better explained, under H1, then
this ratio is big, and
87
00:04:26,630 --> 00:04:31,030
you're going to choose in favor
of H1 or reject H0.
88
00:04:31,030 --> 00:04:32,950
That's what you do if you
have discrete data.
89
00:04:32,950 --> 00:04:34,380
You use the PMFs.
90
00:04:34,380 --> 00:04:37,450
If you have densities, in the
case of continues data, again
91
00:04:37,450 --> 00:04:42,740
you consider the ratio
of the two densities.
92
00:04:42,740 --> 00:04:47,250
So a big l of X is evidence
that your data are more
93
00:04:47,250 --> 00:04:51,570
compatible with H1
rather than H0.
94
00:04:51,570 --> 00:04:59,140
Once you accept this kind of
structure then your decision
95
00:04:59,140 --> 00:05:02,920
is really made in terms
of that single number.
96
00:05:02,920 --> 00:05:06,270
That is, you had your data that
was some kind of vector,
97
00:05:06,270 --> 00:05:09,930
and you condense your data
into a single number-- a
98
00:05:09,930 --> 00:05:12,080
statistic as it's called--
99
00:05:12,080 --> 00:05:15,150
in this case the likelihood
ratio, and you put the
100
00:05:15,150 --> 00:05:19,880
dividing point somewhere
here call it Xi.
101
00:05:19,880 --> 00:05:22,600
And in this region you
accept H1, in this
102
00:05:22,600 --> 00:05:25,940
region you accept H0.
103
00:05:25,940 --> 00:05:30,410
So by committing ourselves to
using the likelihood ratio in
104
00:05:30,410 --> 00:05:33,650
order to carry out the test
we have gone from this
105
00:05:33,650 --> 00:05:38,030
complicated picture of finding a
dividing line in x-space, to
106
00:05:38,030 --> 00:05:42,860
a simpler problem of just
finding a dividing point on
107
00:05:42,860 --> 00:05:45,280
the real line.
108
00:05:45,280 --> 00:05:46,960
OK, how are we going?
109
00:05:46,960 --> 00:05:51,290
So what's left to do is to
choose this threshold, Xi.
110
00:05:51,290 --> 00:05:53,920
Or as it's called, the
critical value,
111
00:05:53,920 --> 00:05:56,560
for making our decision.
112
00:05:56,560 --> 00:06:01,930
And you can place it anywhere,
but one way of deciding where
113
00:06:01,930 --> 00:06:03,240
to place it is the following--
114
00:06:03,240 --> 00:06:07,740
look at the distribution of this
random variable, l of X.
115
00:06:07,740 --> 00:06:11,760
It's has a certain distribution
under H0, and it
116
00:06:11,760 --> 00:06:16,210
has some other distribution
under H1.
117
00:06:16,210 --> 00:06:19,650
If I put my threshold here,
here's what's going to happen.
118
00:06:19,650 --> 00:06:24,360
When H0 is true, there is this
much probability that I'm
119
00:06:24,360 --> 00:06:27,360
going to end up making an
incorrect decision.
120
00:06:27,360 --> 00:06:31,000
If H0 is true there's still a
probability that my likelihood
121
00:06:31,000 --> 00:06:35,100
ratio will be bigger than Xi,
and that's the probability of
122
00:06:35,100 --> 00:06:38,590
making an incorrect decision
of this particular type.
123
00:06:38,590 --> 00:06:42,720
That is of making a false
rejection of H0.
124
00:06:42,720 --> 00:06:46,330
Usually one sets this
probability to a certain
125
00:06:46,330 --> 00:06:48,230
number, alpha.
126
00:06:48,230 --> 00:06:51,770
For example alpha being 5 %.
127
00:06:51,770 --> 00:06:55,680
And once you decide that you
want this to be 5 %, that
128
00:06:55,680 --> 00:07:00,630
determines where this number
Psi(Xi) is going to be.
129
00:07:00,630 --> 00:07:07,340
So the idea here is that I'm
going to reject H0 if the data
130
00:07:07,340 --> 00:07:12,350
that I have seen are quite
incompatible with H0.
131
00:07:12,350 --> 00:07:16,860
if they're quite unlikely to
have occurred under H0.
132
00:07:16,860 --> 00:07:19,690
And I take this level, 5%.
133
00:07:19,690 --> 00:07:25,670
So I see my data and then I say
well if H0 was true, the
134
00:07:25,670 --> 00:07:29,380
probability that I would have
seen data of this kind would
135
00:07:29,380 --> 00:07:31,390
be less than 5 %.
136
00:07:31,390 --> 00:07:35,390
Given that I saw those data,
that suggests that H0 is not
137
00:07:35,390 --> 00:07:37,860
true, and I end up
rejecting H0.
138
00:07:37,860 --> 00:07:40,770
139
00:07:40,770 --> 00:07:44,150
Now of course there's the
other type of error
140
00:07:44,150 --> 00:07:45,190
probability.
141
00:07:45,190 --> 00:07:50,550
If I put my threshold here, if
H1 is true but my likelihood
142
00:07:50,550 --> 00:07:53,470
ratio falls here I'm going
to make a mistake of
143
00:07:53,470 --> 00:07:55,250
the opposite kind.
144
00:07:55,250 --> 00:07:59,780
H1 is true, but my likelihood
ratio turned out to be small,
145
00:07:59,780 --> 00:08:02,370
and I decided in favor of H0.
146
00:08:02,370 --> 00:08:05,680
This is an error of the other
kind, this probability of
147
00:08:05,680 --> 00:08:08,030
error we call beta.
148
00:08:08,030 --> 00:08:10,070
And you can see that
there's a trade-off
149
00:08:10,070 --> 00:08:12,300
between alpha and beta.
150
00:08:12,300 --> 00:08:15,710
If you move your threshold this
way alpha become smaller,
151
00:08:15,710 --> 00:08:18,320
but beta becomes larger.
152
00:08:18,320 --> 00:08:22,120
And the general picture is, in
your trade-off, depending on
153
00:08:22,120 --> 00:08:25,970
where you put your threshold
is as follows--
154
00:08:25,970 --> 00:08:31,370
you can make this beta to be 0
if you put your threshold out
155
00:08:31,370 --> 00:08:34,809
here, but in that case you are
certain that you're going to
156
00:08:34,809 --> 00:08:37,000
make a mistake of the
opposite kind.
157
00:08:37,000 --> 00:08:42,360
So beta equals 0, alpha equals
1 is one possibility.
158
00:08:42,360 --> 00:08:46,420
Beta equals 1 alpha equals 0
is the other possibility if
159
00:08:46,420 --> 00:08:49,620
you send your thresholds
complete to the other side.
160
00:08:49,620 --> 00:08:51,950
And in general you're going
to get a trade-off
161
00:08:51,950 --> 00:08:54,930
curve of some sort.
162
00:08:54,930 --> 00:08:58,720
And if you want to use a
specific value of alpha, for
163
00:08:58,720 --> 00:09:04,030
example alpha being 0.05, then
that's going to determine for
164
00:09:04,030 --> 00:09:07,820
you the probability for beta.
165
00:09:07,820 --> 00:09:11,410
Now there's a general, and quite
important theorem in
166
00:09:11,410 --> 00:09:13,640
statistics, which were
are not proving.
167
00:09:13,640 --> 00:09:17,500
And which tells us that when we
use likelihood ratio tests
168
00:09:17,500 --> 00:09:21,670
we get the best possible
trade-off curve.
169
00:09:21,670 --> 00:09:26,720
You could think of other ways
of making your decisions.
170
00:09:26,720 --> 00:09:30,780
Other ways of cutting off your
x-space into a rejection and
171
00:09:30,780 --> 00:09:32,090
acceptance region.
172
00:09:32,090 --> 00:09:36,050
But any other way that you do
it is going to end up with
173
00:09:36,050 --> 00:09:39,900
some probabilities of error
that are going to be above
174
00:09:39,900 --> 00:09:41,990
this particular curve.
175
00:09:41,990 --> 00:09:46,570
So the likelihood ratio test
turns out to give you the best
176
00:09:46,570 --> 00:09:49,200
possible way of dealing
with this trade-off
177
00:09:49,200 --> 00:09:50,750
between alpha and beta.
178
00:09:50,750 --> 00:09:54,090
We cannot minimize alpha and
beta simultaneously, there's a
179
00:09:54,090 --> 00:09:56,280
trade-off between them.
180
00:09:56,280 --> 00:10:02,420
But at least we would like to
have a test that deals with
181
00:10:02,420 --> 00:10:04,380
this trade-off in the
best possible way.
182
00:10:04,380 --> 00:10:07,770
For a given value of alpha we
want to have the smallest
183
00:10:07,770 --> 00:10:09,490
possible value of beta.
184
00:10:09,490 --> 00:10:13,900
And as the theorem is that the
likelihood ratio tests do have
185
00:10:13,900 --> 00:10:15,240
this optimality property.
186
00:10:15,240 --> 00:10:18,270
For a given value of alpha they
minimize the probability
187
00:10:18,270 --> 00:10:20,610
of error of a different kind.
188
00:10:20,610 --> 00:10:23,380
So let's make all these concrete
and look at the
189
00:10:23,380 --> 00:10:24,680
simple example.
190
00:10:24,680 --> 00:10:27,980
We have two normal
distributions
191
00:10:27,980 --> 00:10:29,610
with different means.
192
00:10:29,610 --> 00:10:32,850
So under H0 you have
a mean of 0.
193
00:10:32,850 --> 00:10:36,790
Under H1 you have a mean of 1.
194
00:10:36,790 --> 00:10:40,810
You get your data, you actually
get several data
195
00:10:40,810 --> 00:10:43,770
drawn from one of the
two distributions.
196
00:10:43,770 --> 00:10:45,560
And you want to make a
decision, which one
197
00:10:45,560 --> 00:10:47,050
of the two is true?
198
00:10:47,050 --> 00:10:50,400
So what you do is you write
down the likelihood ratio.
199
00:10:50,400 --> 00:10:54,730
The density for a vector of
data, if that vector was
200
00:10:54,730 --> 00:10:57,490
generated according to H0 --
201
00:10:57,490 --> 00:11:00,470
which is this one, and the
density if it was generated
202
00:11:00,470 --> 00:11:02,810
according to H1.
203
00:11:02,810 --> 00:11:06,510
Since we have multiple data the
density of a vector is the
204
00:11:06,510 --> 00:11:09,830
product of the densities of
the individual elements.
205
00:11:09,830 --> 00:11:11,800
Since we're dealing with
normals we have those
206
00:11:11,800 --> 00:11:13,500
exponential factors.
207
00:11:13,500 --> 00:11:15,550
A product of exponentials
gives us an
208
00:11:15,550 --> 00:11:17,340
exponential of the sum.
209
00:11:17,340 --> 00:11:20,170
I'll spare you the details, but
this is the form of the
210
00:11:20,170 --> 00:11:21,230
likelihood ratio.
211
00:11:21,230 --> 00:11:23,960
The likelihood ratio test
tells us that we should
212
00:11:23,960 --> 00:11:28,360
calculate this quantity after we
get your data, and compare
213
00:11:28,360 --> 00:11:30,750
with a threshold.
214
00:11:30,750 --> 00:11:35,340
Now you can do some algebra
here, and simplify.
215
00:11:35,340 --> 00:11:39,150
And by tracing down the
inequalities you're taking
216
00:11:39,150 --> 00:11:41,840
logarithms of both
sides, and so on.
217
00:11:41,840 --> 00:11:47,350
One comes to the conclusion that
using a test that has a
218
00:11:47,350 --> 00:11:52,150
threshold on this ratio is
equivalent to calculating this
219
00:11:52,150 --> 00:11:56,920
quantity, and comparing
it with a threshold.
220
00:11:56,920 --> 00:12:01,220
Basically this quantity here is
monotonic in that quantity.
221
00:12:01,220 --> 00:12:04,510
This being larger than the
threshold is equivalent to
222
00:12:04,510 --> 00:12:07,400
this being larger than
the threshold.
223
00:12:07,400 --> 00:12:10,310
So this tells us the general
structure of the likelihood
224
00:12:10,310 --> 00:12:12,770
ratio test in this
particular case.
225
00:12:12,770 --> 00:12:15,640
And it's nice because it tells
us that we can make our
226
00:12:15,640 --> 00:12:20,340
decisions by looking at this
simple summary of the data.
227
00:12:20,340 --> 00:12:23,810
This quantity, this summary of
the data on the basis of which
228
00:12:23,810 --> 00:12:29,130
we make our decision is
called a statistic.
229
00:12:29,130 --> 00:12:32,850
So you take your data, which is
a multi-dimensional vector,
230
00:12:32,850 --> 00:12:37,850
and you condense it to a single
number, and then you
231
00:12:37,850 --> 00:12:40,630
make a decision on the
basis of that number.
232
00:12:40,630 --> 00:12:42,750
So this is the structure
of the test.
233
00:12:42,750 --> 00:12:47,430
If I get a large sum of Xi's
this is evidence in favor of
234
00:12:47,430 --> 00:12:50,430
H1 because here the
mean is larger.
235
00:12:50,430 --> 00:12:54,990
And so I'm going to decide in
favor of H1 or reject H0 if
236
00:12:54,990 --> 00:12:56,650
the sum is bigger than
the threshold.
237
00:12:56,650 --> 00:12:58,750
How do I choose my threshold?
238
00:12:58,750 --> 00:13:01,080
Well I would like to choose
my threshold so that the
239
00:13:01,080 --> 00:13:04,990
probability of an incorrect
decision when H0 is true the
240
00:13:04,990 --> 00:13:09,980
probability of a false
rejection equals
241
00:13:09,980 --> 00:13:10,890
to a certain number.
242
00:13:10,890 --> 00:13:14,400
Alpha, such as for
example 5 %.
243
00:13:14,400 --> 00:13:19,210
So you're given here
that this is 5 %.
244
00:13:19,210 --> 00:13:20,660
You know the distribution
of this random
245
00:13:20,660 --> 00:13:22,240
variable, it's normal.
246
00:13:22,240 --> 00:13:24,980
And you want to find the
threshold value that makes
247
00:13:24,980 --> 00:13:26,430
this to be true.
248
00:13:26,430 --> 00:13:28,300
So this is a type of problem
that you have
249
00:13:28,300 --> 00:13:29,360
seen several times.
250
00:13:29,360 --> 00:13:32,910
You go to the normal tables,
and you figure it out.
251
00:13:32,910 --> 00:13:35,790
So the sum of the Xi's has some
252
00:13:35,790 --> 00:13:38,160
distribution, it's normal.
253
00:13:38,160 --> 00:13:41,090
So that's the distribution
of the sum of the Xi's.
254
00:13:41,090 --> 00:13:44,620
And you want this probability
here to be alpha.
255
00:13:44,620 --> 00:13:49,520
For this to happen what is the
threshold value that makes
256
00:13:49,520 --> 00:13:50,870
this to be true?
257
00:13:50,870 --> 00:13:55,570
So you know how to solve
problems of this kind using
258
00:13:55,570 --> 00:13:58,420
the normal tables.
259
00:13:58,420 --> 00:14:02,730
A slightly different example is
one in which you have two
260
00:14:02,730 --> 00:14:05,900
normal distributions that
have the same mean --
261
00:14:05,900 --> 00:14:07,580
let's take it to be 0 --
262
00:14:07,580 --> 00:14:10,580
but they have a different
variance.
263
00:14:10,580 --> 00:14:15,080
So it's sort of natural that
here, if your X's that you see
264
00:14:15,080 --> 00:14:19,880
are kind of big on either side
you would choose H1.
265
00:14:19,880 --> 00:14:23,500
If your X's are near 0 then
that's evidence for the
266
00:14:23,500 --> 00:14:27,120
smaller variance you
would choose H0.
267
00:14:27,120 --> 00:14:30,740
So to proceed formally you again
write down to the form
268
00:14:30,740 --> 00:14:33,190
of the likelihood ratio.
269
00:14:33,190 --> 00:14:39,780
So again the density of an X
vector under H0 is this one.
270
00:14:39,780 --> 00:14:41,680
It's the product of
the densities of
271
00:14:41,680 --> 00:14:43,410
each one of the Xi's.
272
00:14:43,410 --> 00:14:47,030
Product of normal densities
gives you a product of
273
00:14:47,030 --> 00:14:50,180
exponentials, which is
exponential of the sum, and
274
00:14:50,180 --> 00:14:52,070
that's the expression
that you get.
275
00:14:52,070 --> 00:14:54,560
Under the other hypothesis
the only thing that
276
00:14:54,560 --> 00:14:56,530
changes is the variance.
277
00:14:56,530 --> 00:14:59,800
And the variance, in the normal
distribution, shows up
278
00:14:59,800 --> 00:15:02,970
here in the denominator
of the exponent.
279
00:15:02,970 --> 00:15:04,560
So you put it there.
280
00:15:04,560 --> 00:15:07,390
So this is the general structure
of the likelihood
281
00:15:07,390 --> 00:15:08,650
ratio test.
282
00:15:08,650 --> 00:15:10,400
And now you do some algebra.
283
00:15:10,400 --> 00:15:14,110
These terms are constants
comparing this ratio to a
284
00:15:14,110 --> 00:15:17,190
constant is the same as just
comparing the ratio of the
285
00:15:17,190 --> 00:15:19,050
exponentials to a constant.
286
00:15:19,050 --> 00:15:23,710
Then you take logarithms, you
want to compare the logarithm
287
00:15:23,710 --> 00:15:25,650
of this thing to a constant.
288
00:15:25,650 --> 00:15:28,210
You do a little bit of algebra,
and in the end you
289
00:15:28,210 --> 00:15:32,180
find that the structure of the
test is to reject H0 if the
290
00:15:32,180 --> 00:15:37,740
sum of the squares of the Xi's
is bigger than the threshold.
291
00:15:37,740 --> 00:15:41,360
So by committing to a likelihood
ratio test you are
292
00:15:41,360 --> 00:15:45,060
told that you should be making
it your decision according to
293
00:15:45,060 --> 00:15:46,940
a rule of this type.
294
00:15:46,940 --> 00:15:51,450
So this fixes the shape or the
structure of the decision
295
00:15:51,450 --> 00:15:53,670
region, of the rejection
region.
296
00:15:53,670 --> 00:15:56,660
And the only thing that's left,
once more, is to pick
297
00:15:56,660 --> 00:16:00,190
this threshold in order to have
the property that the
298
00:16:00,190 --> 00:16:05,340
probability of a false rejection
is equal to say 5 %.
299
00:16:05,340 --> 00:16:09,490
So that's the probability that
H0 is true, but the sum of the
300
00:16:09,490 --> 00:16:11,450
squares accidentally
happens to be
301
00:16:11,450 --> 00:16:13,080
bigger than my threshold.
302
00:16:13,080 --> 00:16:17,330
In which case I end
up deciding H1.
303
00:16:17,330 --> 00:16:21,570
How do I find the value
of Xi prime?
304
00:16:21,570 --> 00:16:25,150
Well what I need to do is to
look at the picture, more or
305
00:16:25,150 --> 00:16:29,100
less of this kind, but now
I need to look at the
306
00:16:29,100 --> 00:16:32,870
distribution of the sum
of the Xi's squared.
307
00:16:32,870 --> 00:16:36,190
Actually the sum of the Xi's
squared is a non-negative
308
00:16:36,190 --> 00:16:37,580
random variable.
309
00:16:37,580 --> 00:16:40,280
So it's going to have a
distribution that's
310
00:16:40,280 --> 00:16:44,910
something like this.
311
00:16:44,910 --> 00:16:50,540
I look at that distribution, and
once more I want this tail
312
00:16:50,540 --> 00:16:54,300
probability to be alpha, and
that determines where my
313
00:16:54,300 --> 00:16:56,370
threshold is going to be.
314
00:16:56,370 --> 00:17:00,595
So that's again a simple
exercise provided that you
315
00:17:00,595 --> 00:17:03,650
know the distribution
of this quantity.
316
00:17:03,650 --> 00:17:05,540
Do you know it?
317
00:17:05,540 --> 00:17:08,980
Well we don't really know it,
we have not dealt with this
318
00:17:08,980 --> 00:17:11,859
particular distribution
in this class.
319
00:17:11,859 --> 00:17:15,730
But in principle you should be
able to find what it is.
320
00:17:15,730 --> 00:17:18,459
It's a derived distribution
problem.
321
00:17:18,459 --> 00:17:22,920
You know the distribution
of Xi, it's normal.
322
00:17:22,920 --> 00:17:26,410
Therefore, by solving a derived
distribution problem
323
00:17:26,410 --> 00:17:30,400
you can find the distribution
of Xi squared.
324
00:17:30,400 --> 00:17:34,180
And the Xi squared's are
independent of each other,
325
00:17:34,180 --> 00:17:36,400
because the Xi's are
independent.
326
00:17:36,400 --> 00:17:39,190
So you want to find the
distribution of the sum of
327
00:17:39,190 --> 00:17:41,750
random variables with
known distributions.
328
00:17:41,750 --> 00:17:44,410
And since they're independent,
in principle, you can do this
329
00:17:44,410 --> 00:17:46,470
using the convolution formula.
330
00:17:46,470 --> 00:17:49,720
So in principle, and if you're
patient enough, you will be
331
00:17:49,720 --> 00:17:52,830
able to find the distribution
of this random variable.
332
00:17:52,830 --> 00:17:57,430
And then you plot it or tabulate
it, and find where
333
00:17:57,430 --> 00:18:02,870
exactly is the 95th percentile
of that distribution, and that
334
00:18:02,870 --> 00:18:05,290
determines your threshold.
335
00:18:05,290 --> 00:18:08,310
So this distribution actually
turns out to have a nice and
336
00:18:08,310 --> 00:18:11,000
simple closed-form formula.
337
00:18:11,000 --> 00:18:13,740
Because this is a pretty common
test, people have
338
00:18:13,740 --> 00:18:15,220
tabulated that distribution.
339
00:18:15,220 --> 00:18:17,370
It's called the chi-square
distribution.
340
00:18:17,370 --> 00:18:19,512
There's tables available
for it.
341
00:18:19,512 --> 00:18:23,390
And you look up in the tables,
you find the 95th percentile
342
00:18:23,390 --> 00:18:25,900
of the distribution,
and this way you
343
00:18:25,900 --> 00:18:28,280
determine your threshold.
344
00:18:28,280 --> 00:18:31,140
So what's the moral
of the story?
345
00:18:31,140 --> 00:18:34,800
The structure of the likelihood
ratio test tells
346
00:18:34,800 --> 00:18:40,470
you what kind of decision region
you're going to have.
347
00:18:40,470 --> 00:18:42,880
It tells you that for this
particular test you should be
348
00:18:42,880 --> 00:18:46,360
using the sum of the Xi
squared's as your statistic,
349
00:18:46,360 --> 00:18:48,460
as the basis for making
your decision.
350
00:18:48,460 --> 00:18:51,840
And then you need to solve a
derived distribution problem
351
00:18:51,840 --> 00:18:53,110
to find the probability
352
00:18:53,110 --> 00:18:55,500
distribution of your statistic.
353
00:18:55,500 --> 00:19:00,290
Find the distribution of this
quantity under H0, and
354
00:19:00,290 --> 00:19:03,000
finally, based on that
distribution, after you have
355
00:19:03,000 --> 00:19:05,330
derived it, then determine
your threshold.
356
00:19:05,330 --> 00:19:08,240
357
00:19:08,240 --> 00:19:10,360
So now let's move
on to a somewhat
358
00:19:10,360 --> 00:19:13,090
more complicated situation.
359
00:19:13,090 --> 00:19:18,090
You have a coin, and you
are told that I tried
360
00:19:18,090 --> 00:19:21,040
to make a fair coin.
361
00:19:21,040 --> 00:19:22,450
Is it fair?
362
00:19:22,450 --> 00:19:25,200
So you have the hypothesis,
which is the default--
363
00:19:25,200 --> 00:19:26,320
the null hypothesis--
364
00:19:26,320 --> 00:19:27,890
that the coin is fair.
365
00:19:27,890 --> 00:19:29,690
But maybe it isn't.
366
00:19:29,690 --> 00:19:31,880
So you have the alternative
hypothesis that
367
00:19:31,880 --> 00:19:34,030
your coin is not fair.
368
00:19:34,030 --> 00:19:36,690
Now what's different in this
context is that your
369
00:19:36,690 --> 00:19:41,830
alternative hypothesis is not
just one specific hypothesis.
370
00:19:41,830 --> 00:19:45,990
Your alternative hypothesis
consists of many alternatives.
371
00:19:45,990 --> 00:19:49,270
It includes the hypothesis
that p is 0.6.
372
00:19:49,270 --> 00:19:53,930
It includes the hypothesis
that p is 0.51.
373
00:19:53,930 --> 00:19:58,850
It includes the hypothesis that
p is 0.48, and so on.
374
00:19:58,850 --> 00:20:05,030
So you're testing this
hypothesis versus all this
375
00:20:05,030 --> 00:20:08,070
family of alternative
hypothesis.
376
00:20:08,070 --> 00:20:11,080
What you will end up doing is
essentially the following--
377
00:20:11,080 --> 00:20:12,480
you get some data.
378
00:20:12,480 --> 00:20:15,080
That is, you flip the coin
a number of times.
379
00:20:15,080 --> 00:20:17,640
Let's say you flip
it 1,000 times.
380
00:20:17,640 --> 00:20:20,290
You observe some outcome.
381
00:20:20,290 --> 00:20:24,580
Let's say you saw 472 heads.
382
00:20:24,580 --> 00:20:31,650
And you ask the question if
this hypothesis is true is
383
00:20:31,650 --> 00:20:35,790
this value really possible
under that hypothesis?
384
00:20:35,790 --> 00:20:39,450
Or would it be very much
of an outlier?
385
00:20:39,450 --> 00:20:44,220
If it looks like an extreme
outlier under this hypothesis
386
00:20:44,220 --> 00:20:47,780
then I reject it, and I accept
the alternative.
387
00:20:47,780 --> 00:20:50,800
If this number turns out to be
something within the range
388
00:20:50,800 --> 00:20:56,690
that you would have expected
then you keep, or accept your
389
00:20:56,690 --> 00:20:59,080
null hypothesis.
390
00:20:59,080 --> 00:21:03,200
OK so what does it mean to
be an outlier or not?
391
00:21:03,200 --> 00:21:05,430
First you take your data,
and you condense
392
00:21:05,430 --> 00:21:07,220
them to a single number.
393
00:21:07,220 --> 00:21:10,240
So your detailed data actually
would have been a sequence of
394
00:21:10,240 --> 00:21:12,440
heads/tails, heads/tails
and all that.
395
00:21:12,440 --> 00:21:16,370
Any reasonable person would tell
you that you shouldn't
396
00:21:16,370 --> 00:21:19,430
really care about the exact
sequence of heads and tails.
397
00:21:19,430 --> 00:21:22,570
Let's just base our decision on
the number of heads that we
398
00:21:22,570 --> 00:21:24,380
have observed.
399
00:21:24,380 --> 00:21:28,870
So using some kind of reasoning
which could be
400
00:21:28,870 --> 00:21:33,650
mathematical, or intuitive,
or involving artistry--
401
00:21:33,650 --> 00:21:38,400
you pick a one-dimensional, or
scalar summary of the data
402
00:21:38,400 --> 00:21:39,450
that you have seen.
403
00:21:39,450 --> 00:21:42,250
In this case, the summary of the
data is just the number of
404
00:21:42,250 --> 00:21:44,330
heads that's a quite
reasonable one.
405
00:21:44,330 --> 00:21:47,880
And so you commit yourself to
make a decision on the basis
406
00:21:47,880 --> 00:21:49,080
of this quantity.
407
00:21:49,080 --> 00:21:52,670
And you ask the quantity that
I'm seeing does it look like
408
00:21:52,670 --> 00:21:53,680
an outlier?
409
00:21:53,680 --> 00:21:57,710
Or does it look more
or less OK?
410
00:21:57,710 --> 00:22:00,540
OK, what does it mean
to be an outlier?
411
00:22:00,540 --> 00:22:04,900
You want to choose the shape of
this rejection region, but
412
00:22:04,900 --> 00:22:08,750
on the basis of that
single number s.
413
00:22:08,750 --> 00:22:11,240
And again, the reasonable thing
to do in this context
414
00:22:11,240 --> 00:22:15,170
would be to argue as follows--
if my coin is fair I expect to
415
00:22:15,170 --> 00:22:16,850
see n over 2 heads.
416
00:22:16,850 --> 00:22:18,540
That's the expected value.
417
00:22:18,540 --> 00:22:23,330
If the number of heads I see
is far from the expected
418
00:22:23,330 --> 00:22:26,030
number of heads then I consider
419
00:22:26,030 --> 00:22:27,750
this to be an outlier.
420
00:22:27,750 --> 00:22:30,470
So if this number is bigger
than some threshold Xi.
421
00:22:30,470 --> 00:22:33,600
I consider it to be an outlier,
and then I'm going to
422
00:22:33,600 --> 00:22:36,100
reject my hypothesis.
423
00:22:36,100 --> 00:22:38,930
So we picked our statistic.
424
00:22:38,930 --> 00:22:44,990
We picked the general form of
how we're going to make our
425
00:22:44,990 --> 00:22:50,000
decision, and then we pick a
certain significance, or
426
00:22:50,000 --> 00:22:51,690
confidence level that we want.
427
00:22:51,690 --> 00:22:54,470
Again, this famous 5% number.
428
00:22:54,470 --> 00:22:58,310
And we're going to declare
something to be an outlier if
429
00:22:58,310 --> 00:23:01,380
it lies in the region
that has 5% or less
430
00:23:01,380 --> 00:23:03,270
probability of occurring.
431
00:23:03,270 --> 00:23:07,560
That is I'm picking my rejection
region so that if H0
432
00:23:07,560 --> 00:23:11,870
is true under the default, or
null hypothesis, there's only
433
00:23:11,870 --> 00:23:17,380
5% chance that by accident I
fall there, and the thing
434
00:23:17,380 --> 00:23:21,540
makes me think that H1
is going to be true.
435
00:23:21,540 --> 00:23:25,690
436
00:23:25,690 --> 00:23:28,770
So now what's left to
do is to pick the
437
00:23:28,770 --> 00:23:30,920
value of this threshold.
438
00:23:30,920 --> 00:23:34,410
This is a calculation
of the usual kind.
439
00:23:34,410 --> 00:23:39,580
I want to pick my threshold,
my Xi number so that the
440
00:23:39,580 --> 00:23:44,150
probability that s is further
from the mean by an amount of
441
00:23:44,150 --> 00:23:47,200
Xi is less than 5%.
442
00:23:47,200 --> 00:23:50,630
Or that the probability
of being inside
443
00:23:50,630 --> 00:23:52,300
the acceptance region--
444
00:23:52,300 --> 00:23:55,240
so that the distance
from the default is
445
00:23:55,240 --> 00:23:56,380
less than my threshold.
446
00:23:56,380 --> 00:23:59,880
I want that to be 95%.
447
00:23:59,880 --> 00:24:04,380
So this is an equality that you
can get using the central
448
00:24:04,380 --> 00:24:06,760
limit theorem and the
normal tables.
449
00:24:06,760 --> 00:24:10,230
There's 95% probability that the
number of heads is going
450
00:24:10,230 --> 00:24:14,920
to be within 31 from
the correct mean.
451
00:24:14,920 --> 00:24:17,910
So the way the exercise is done
of course, is that we
452
00:24:17,910 --> 00:24:20,640
start with this number, 5%.
453
00:24:20,640 --> 00:24:24,410
Which translates to
this number 95%.
454
00:24:24,410 --> 00:24:27,960
And once we have fixed that
number then you ask the
455
00:24:27,960 --> 00:24:34,370
question what number should
we have here to make this
456
00:24:34,370 --> 00:24:36,500
equality to be true?
457
00:24:36,500 --> 00:24:39,360
It's again a problem
of this kind.
458
00:24:39,360 --> 00:24:42,820
You have a quantity whose
distribution you know.
459
00:24:42,820 --> 00:24:43,950
Why do you know it?
460
00:24:43,950 --> 00:24:46,390
The number of heads by the
central limit theorem is
461
00:24:46,390 --> 00:24:47,970
approximately normal.
462
00:24:47,970 --> 00:24:51,560
So this here talks about the
normal distribution.
463
00:24:51,560 --> 00:24:56,330
You set your alpha to be 5%, and
you ask where should I put
464
00:24:56,330 --> 00:24:59,690
my threshold so that this
probability of being out there
465
00:24:59,690 --> 00:25:01,530
is only 5%?
466
00:25:01,530 --> 00:25:03,750
Now in our particular example
the threshold
467
00:25:03,750 --> 00:25:05,970
turned out to be 31.
468
00:25:05,970 --> 00:25:09,170
This number turned out
was just 28 away
469
00:25:09,170 --> 00:25:10,960
from the correct mean.
470
00:25:10,960 --> 00:25:14,150
So these distance was less
than the threshold.
471
00:25:14,150 --> 00:25:17,280
So we end up not rejecting H0.
472
00:25:17,280 --> 00:25:20,430
473
00:25:20,430 --> 00:25:23,820
So we have our rejection
region.
474
00:25:23,820 --> 00:25:28,900
The way we designed it is that
when H0 is true there's only a
475
00:25:28,900 --> 00:25:32,960
small chance, 5%, that we get
to data out of there.
476
00:25:32,960 --> 00:25:35,510
Data that we would
call an outlier.
477
00:25:35,510 --> 00:25:39,330
If we see such an outlier
we reject H0.
478
00:25:39,330 --> 00:25:43,930
If what we see is not an outlier
as in this case, where
479
00:25:43,930 --> 00:25:47,090
that distance turned out to
be kind of small, then we
480
00:25:47,090 --> 00:25:50,980
do not reject H0.
481
00:25:50,980 --> 00:25:54,700
An interesting little piece
of language here, people
482
00:25:54,700 --> 00:25:57,490
generally prefer to use
this terminology--
483
00:25:57,490 --> 00:26:01,820
to say that H0 is not rejected
by the data.
484
00:26:01,820 --> 00:26:06,490
Instead of saying that
H0 is accepted.
485
00:26:06,490 --> 00:26:09,260
In some sense they're both
saying the same thing, but the
486
00:26:09,260 --> 00:26:11,940
difference is sort of subtle.
487
00:26:11,940 --> 00:26:17,240
When I say not rejected what I
mean is that I got some data
488
00:26:17,240 --> 00:26:20,560
that are compatible with
my hypothesis.
489
00:26:20,560 --> 00:26:26,470
That is the data that I got do
not falsify the hypothesis
490
00:26:26,470 --> 00:26:29,520
that I had, my null
hypothesis.
491
00:26:29,520 --> 00:26:34,500
So my null hypothesis is still
alive, and may be true.
492
00:26:34,500 --> 00:26:38,700
But from data you can never
really prove that the
493
00:26:38,700 --> 00:26:41,360
hypothesis is correct.
494
00:26:41,360 --> 00:26:46,190
Perhaps my coin is not fair in
some other complicated way.
495
00:26:46,190 --> 00:26:51,660
496
00:26:51,660 --> 00:26:55,980
Perhaps I was just lucky, and
even though my coin is not
497
00:26:55,980 --> 00:26:58,930
fair I ended up with
an outcome that
498
00:26:58,930 --> 00:27:01,270
suggests that it's fair.
499
00:27:01,270 --> 00:27:04,600
Perhaps my coin flips are
not independent as I
500
00:27:04,600 --> 00:27:06,020
assumed in my model.
501
00:27:06,020 --> 00:27:11,860
So there's many ways that my
null hypothesis could be
502
00:27:11,860 --> 00:27:15,010
wrong, and still I got data
that tells me that my
503
00:27:15,010 --> 00:27:16,970
hypothesis is OK.
504
00:27:16,970 --> 00:27:20,980
So this is the general way that
things work in science.
505
00:27:20,980 --> 00:27:24,340
One comes up with a
model or a theory.
506
00:27:24,340 --> 00:27:28,480
This is the default theory, and
we work with that theory
507
00:27:28,480 --> 00:27:31,100
trying to find whether
there are examples
508
00:27:31,100 --> 00:27:32,450
that violate the theory.
509
00:27:32,450 --> 00:27:35,550
If you find data and examples
that violate the theory your
510
00:27:35,550 --> 00:27:38,560
theory is falsified, and you
need to look for a new one.
511
00:27:38,560 --> 00:27:43,090
But when you have your theory,
really no amount of data can
512
00:27:43,090 --> 00:27:45,810
prove that your theory
is correct.
513
00:27:45,810 --> 00:27:49,950
So we have the default theory
that the speed of light is
514
00:27:49,950 --> 00:27:54,620
constant as long as we do not
find any data that runs
515
00:27:54,620 --> 00:27:56,210
counter to it.
516
00:27:56,210 --> 00:27:59,650
We stay with that theory, but
there's no way of really
517
00:27:59,650 --> 00:28:03,710
proving this, no matter how
many experiments we do.
518
00:28:03,710 --> 00:28:06,590
But there could be experiments
that falsify that theory, in
519
00:28:06,590 --> 00:28:10,580
which case we need to do
look for a new one.
520
00:28:10,580 --> 00:28:14,450
So there's a bit of an asymmetry
here in how we treat
521
00:28:14,450 --> 00:28:16,510
the alternative hypothesis.
522
00:28:16,510 --> 00:28:22,900
H0 is the default which we'll
accept until we see some
523
00:28:22,900 --> 00:28:25,350
evidence to the contrary.
524
00:28:25,350 --> 00:28:30,170
And if we see some evidence to
the contrary we reject it.
525
00:28:30,170 --> 00:28:33,580
As long as we do not see
evidence to the contrary then
526
00:28:33,580 --> 00:28:35,940
we keep working with it,
but always take it
527
00:28:35,940 --> 00:28:38,200
with a grain of salt.
528
00:28:38,200 --> 00:28:42,210
You can never really prove that
a coin has a bias exactly
529
00:28:42,210 --> 00:28:43,860
equal to 1/2.
530
00:28:43,860 --> 00:28:50,360
Maybe the bias is equal
to 0.50001, so
531
00:28:50,360 --> 00:28:52,440
the bias is not 1/2.
532
00:28:52,440 --> 00:28:56,180
But with an experiment with
1,000 coin tosses you wouldn't
533
00:28:56,180 --> 00:28:59,200
be able to see this effect.
534
00:28:59,200 --> 00:29:03,750
535
00:29:03,750 --> 00:29:07,870
OK, so that's how you go about
testing about whether your
536
00:29:07,870 --> 00:29:09,120
coin is fair.
537
00:29:09,120 --> 00:29:13,150
You can also think about testing
whether a die is fair.
538
00:29:13,150 --> 00:29:17,130
So for a die the null hypothesis
would be that every
539
00:29:17,130 --> 00:29:21,830
possible result when you roll
the die has equal probability
540
00:29:21,830 --> 00:29:23,860
and equal to 1/6.
541
00:29:23,860 --> 00:29:27,720
And you also make the hypothesis
that your die rolls
542
00:29:27,720 --> 00:29:30,900
are statistically independent
from each other.
543
00:29:30,900 --> 00:29:36,050
So I take my die, I roll it a
number of times, little n, and
544
00:29:36,050 --> 00:29:40,240
I count how many 1's I got, how
many 2's I got, how many
545
00:29:40,240 --> 00:29:43,430
3's I got, and these
are my data.
546
00:29:43,430 --> 00:29:48,400
I count how many times I
observed a specific result in
547
00:29:48,400 --> 00:29:51,660
my die roll that was
equal to sum i.
548
00:29:51,660 --> 00:29:53,410
And now I ask the question--
549
00:29:53,410 --> 00:29:58,050
the Ni's that I observed, are
they compatible with my
550
00:29:58,050 --> 00:30:01,000
hypothesis or not?
551
00:30:01,000 --> 00:30:05,560
What does compatible to
my hypothesis mean?
552
00:30:05,560 --> 00:30:12,570
Under the null hypothesis Ni
should be approximately equal,
553
00:30:12,570 --> 00:30:17,750
or is equal in expectation
to N times little Pi.
554
00:30:17,750 --> 00:30:23,170
And in our example this little
Pi is of course 1/6.
555
00:30:23,170 --> 00:30:28,210
So if my die is fair the number
of ones I expect to see
556
00:30:28,210 --> 00:30:31,110
is equal to the number
of rolls times 1/6.
557
00:30:31,110 --> 00:30:35,070
The number of 2's I expect to
see is again that same number.
558
00:30:35,070 --> 00:30:37,970
Of course there's randomness,
so I do not expect to get
559
00:30:37,970 --> 00:30:39,420
exactly that number.
560
00:30:39,420 --> 00:30:42,420
But I can ask how far
away from the
561
00:30:42,420 --> 00:30:45,380
expected values was i?
562
00:30:45,380 --> 00:30:51,470
If my capital Ni's turn to be
very different from N/6 this
563
00:30:51,470 --> 00:30:55,110
is evidence that my
die is not fair.
564
00:30:55,110 --> 00:31:01,000
If those numbers turn out to be
close to N times 1/6 then
565
00:31:01,000 --> 00:31:05,180
I'm going to say there's no
evidence that would lead me to
566
00:31:05,180 --> 00:31:06,870
reject this hypothesis.
567
00:31:06,870 --> 00:31:10,850
So this hypothesis
remains alive.
568
00:31:10,850 --> 00:31:16,390
So someone has come up with this
thought that maybe the
569
00:31:16,390 --> 00:31:20,730
right statistic to use, or the
right way of quantifying how
570
00:31:20,730 --> 00:31:23,910
far away are the Ni's from
their mean is to
571
00:31:23,910 --> 00:31:25,590
look at this quantity.
572
00:31:25,590 --> 00:31:29,520
So I'm looking at the expected
value of Ni under the null
573
00:31:29,520 --> 00:31:30,700
hypothesis.
574
00:31:30,700 --> 00:31:34,760
See what I got, take the square
of this, and add it
575
00:31:34,760 --> 00:31:36,040
over all i's.
576
00:31:36,040 --> 00:31:40,930
But also throw in these terms
in the denominator.
577
00:31:40,930 --> 00:31:46,010
And why that term is there,
that's a longer story.
578
00:31:46,010 --> 00:31:49,740
One can write down certain
likelihood ratios, do certain
579
00:31:49,740 --> 00:31:53,010
Taylor Series approximations,
and there's a Heuristic
580
00:31:53,010 --> 00:31:58,120
argument that justifies why this
would be a good form for
581
00:31:58,120 --> 00:31:59,810
the test to use.
582
00:31:59,810 --> 00:32:02,660
So there's a certain art that's
involved in this step
583
00:32:02,660 --> 00:32:06,370
that some people somehow decided
that it's a reasonable
584
00:32:06,370 --> 00:32:08,730
thing to do is to calcelate.
585
00:32:08,730 --> 00:32:12,300
Once you get your results to
calculate this one-dimensional
586
00:32:12,300 --> 00:32:16,740
summary of your result, this is
going to be your statistic,
587
00:32:16,740 --> 00:32:19,550
and compare that statistic
to a threshold.
588
00:32:19,550 --> 00:32:21,680
And that's how you make
your decision.
589
00:32:21,680 --> 00:32:27,310
So by this point we have fixed
the type of the rejection
590
00:32:27,310 --> 00:32:29,740
region that we're
going to have.
591
00:32:29,740 --> 00:32:32,780
So we've chosen the qualitative
structure of our
592
00:32:32,780 --> 00:32:36,230
test, and the only thing that's
now left is to choose
593
00:32:36,230 --> 00:32:38,820
the particular threshold
we're going to use.
594
00:32:38,820 --> 00:32:41,550
And the recipe, once
more, is the same.
595
00:32:41,550 --> 00:32:44,840
We want to set our threshold so
that the probability of a
596
00:32:44,840 --> 00:32:47,320
false rejection is 5%.
597
00:32:47,320 --> 00:32:52,040
We want the probability that our
data fall in here is only
598
00:32:52,040 --> 00:32:55,990
5% when the null hypothesis
is true.
599
00:32:55,990 --> 00:33:01,040
So that's the same as setting
our threshold Xi so that the
600
00:33:01,040 --> 00:33:03,940
probability that our
test statistic is
601
00:33:03,940 --> 00:33:05,960
bigger than that threshold.
602
00:33:05,960 --> 00:33:11,470
We want that probability
to be only 0.05.
603
00:33:11,470 --> 00:33:15,140
So to solve a problem of
this kind what is it
604
00:33:15,140 --> 00:33:16,820
that you need to do?
605
00:33:16,820 --> 00:33:19,490
You need to find the probability
distribution of
606
00:33:19,490 --> 00:33:23,810
capital T. So once more
it's the same picture.
607
00:33:23,810 --> 00:33:26,370
608
00:33:26,370 --> 00:33:32,200
You need to do some calculations
of some sort, and
609
00:33:32,200 --> 00:33:36,550
come up with the distribution
of the random variable T,
610
00:33:36,550 --> 00:33:39,060
where T is defined this way.
611
00:33:39,060 --> 00:33:41,400
You want to find this
distribution
612
00:33:41,400 --> 00:33:43,190
under hypothesis H0.
613
00:33:43,190 --> 00:33:48,820
614
00:33:48,820 --> 00:33:53,780
Once you find what that
distribution is then you can
615
00:33:53,780 --> 00:33:55,480
solve this usual problem.
616
00:33:55,480 --> 00:33:58,470
I want this probability
here to be 5%.
617
00:33:58,470 --> 00:34:01,860
What should my threshold be?
618
00:34:01,860 --> 00:34:03,930
So what does this
boil down to?
619
00:34:03,930 --> 00:34:08,510
Finding the distribution of
capital T is in some sense a
620
00:34:08,510 --> 00:34:13,350
messy, difficult, derived
distribution problem.
621
00:34:13,350 --> 00:34:16,239
From this model we know
the distribution
622
00:34:16,239 --> 00:34:17,489
of the capital Ni's.
623
00:34:17,489 --> 00:34:20,290
624
00:34:20,290 --> 00:34:23,800
And actually we can even write
down the joint distribution of
625
00:34:23,800 --> 00:34:26,840
the capital Ni's.
626
00:34:26,840 --> 00:34:29,690
In fact we can make an
approximation here.
627
00:34:29,690 --> 00:34:33,219
Capital Ni is a binomial
random variable.
628
00:34:33,219 --> 00:34:39,790
Let's say the number of 1's that
I got in little N rolls
629
00:34:39,790 --> 00:34:41,090
off my die.
630
00:34:41,090 --> 00:34:43,300
So that's a binomial
random variable.
631
00:34:43,300 --> 00:34:45,860
When little n is big
this is going to be
632
00:34:45,860 --> 00:34:48,040
approximately normal.
633
00:34:48,040 --> 00:34:52,060
So we have normal random
variables, or approximately
634
00:34:52,060 --> 00:34:54,260
normal minus a constant.
635
00:34:54,260 --> 00:34:55,770
They're still approximately
normal.
636
00:34:55,770 --> 00:35:01,070
We take the squares of these,
scale them so you can solve a
637
00:35:01,070 --> 00:35:03,730
derived distribution problem
to find the distribution of
638
00:35:03,730 --> 00:35:04,930
this quantity.
639
00:35:04,930 --> 00:35:08,550
You can do more work, more
derived distribution work, and
640
00:35:08,550 --> 00:35:12,080
find the distribution of
capital T. So this is a
641
00:35:12,080 --> 00:35:17,500
tedious matter, but because this
test is used quite often,
642
00:35:17,500 --> 00:35:20,080
again people have done
those calculations.
643
00:35:20,080 --> 00:35:23,600
They have found the distribution
of capital T, and
644
00:35:23,600 --> 00:35:25,250
it's available in tables.
645
00:35:25,250 --> 00:35:29,090
And you go to those tables, and
you find the appropriate
646
00:35:29,090 --> 00:35:31,370
threshold for making a decision
of this type.
647
00:35:31,370 --> 00:35:36,160
648
00:35:36,160 --> 00:35:40,720
Now to give you a sense of how
complicated hypothesis one
649
00:35:40,720 --> 00:35:47,190
might have to deal with let's
make things one level more
650
00:35:47,190 --> 00:35:48,370
complicated.
651
00:35:48,370 --> 00:35:55,200
So here you can think this X is
a discrete random variable.
652
00:35:55,200 --> 00:35:57,770
This is the outcome
of my roll.
653
00:35:57,770 --> 00:36:02,760
And I had a model in which the
possible values of my discrete
654
00:36:02,760 --> 00:36:06,030
random variables they
have probabilities
655
00:36:06,030 --> 00:36:07,870
all equal to 1/6.
656
00:36:07,870 --> 00:36:13,280
So my null hypothesis here was
a particular PMF for the
657
00:36:13,280 --> 00:36:17,810
random variable capital X. So
another way of phrasing what
658
00:36:17,810 --> 00:36:19,950
happened in this
problem was the
659
00:36:19,950 --> 00:36:24,700
question is my PMF correct?
660
00:36:24,700 --> 00:36:30,580
So this is the PMF of the
result of one die roll.
661
00:36:30,580 --> 00:36:33,950
You're asking the question
is my PMF correct?
662
00:36:33,950 --> 00:36:36,740
Make it more complicated.
663
00:36:36,740 --> 00:36:41,510
How about the question of the
type is my PDF correct when I
664
00:36:41,510 --> 00:36:45,220
have continuous data?
665
00:36:45,220 --> 00:36:50,900
So I have hypothesized that's
the probability distribution
666
00:36:50,900 --> 00:36:54,780
that I have is let's say
a particular normal.
667
00:36:54,780 --> 00:36:58,990
I get lots of results from
that random variable.
668
00:36:58,990 --> 00:37:04,450
Can I tell whether my results
look like normal or not?
669
00:37:04,450 --> 00:37:06,650
What are some ways of
going about it?
670
00:37:06,650 --> 00:37:09,450
Well, we saw in the previous
slide that there is a
671
00:37:09,450 --> 00:37:13,110
methodology for deciding
if your PMF is correct.
672
00:37:13,110 --> 00:37:19,090
So you could take your normal
results, the data that you got
673
00:37:19,090 --> 00:37:23,200
from your experiment, and
discretize them, and so now
674
00:37:23,200 --> 00:37:25,500
you're dealing with
discrete data.
675
00:37:25,500 --> 00:37:31,200
And sort of used in previous
methodology to solve a
676
00:37:31,200 --> 00:37:34,900
discrete problem of the type
is my PDF correct?
677
00:37:34,900 --> 00:37:41,320
So in practice the way this is
done is that you get all your
678
00:37:41,320 --> 00:37:49,920
data, let's say data points
of this kind.
679
00:37:49,920 --> 00:37:56,400
You split your space into bins,
and you count how many
680
00:37:56,400 --> 00:38:00,190
you have in each bin.
681
00:38:00,190 --> 00:38:07,180
So you get this, and that,
and that, and nothing.
682
00:38:07,180 --> 00:38:10,020
So that's a histogram that
you get from the
683
00:38:10,020 --> 00:38:11,020
data that you have.
684
00:38:11,020 --> 00:38:14,670
Like the very familiar
histograms that you see after
685
00:38:14,670 --> 00:38:16,860
each one of our quizzes.
686
00:38:16,860 --> 00:38:21,760
So if you look at these
histogram, and you ask does it
687
00:38:21,760 --> 00:38:24,060
look like normal?
688
00:38:24,060 --> 00:38:27,700
OK, we need a systematic
way of going about it.
689
00:38:27,700 --> 00:38:33,140
If it were normal you can
calculate the probability of
690
00:38:33,140 --> 00:38:36,760
falling in this interval.
691
00:38:36,760 --> 00:38:39,120
The probability of falling in
that interval, probability of
692
00:38:39,120 --> 00:38:40,890
falling into that interval.
693
00:38:40,890 --> 00:38:45,480
So you would have expected
values of how many results, or
694
00:38:45,480 --> 00:38:48,210
data points, you would have
in this interval.
695
00:38:48,210 --> 00:38:52,170
And compare these expected
values for each interval with
696
00:38:52,170 --> 00:38:54,830
the actual ones that
you observed.
697
00:38:54,830 --> 00:38:58,290
And then take the sum of
squares, and so on, exactly as
698
00:38:58,290 --> 00:38:59,700
in the previous slide.
699
00:38:59,700 --> 00:39:03,010
And this gives you a way
of going about it.
700
00:39:03,010 --> 00:39:07,060
701
00:39:07,060 --> 00:39:09,710
This is a little messy.
702
00:39:09,710 --> 00:39:14,530
It gets hard to do because you
have the difficult decision of
703
00:39:14,530 --> 00:39:19,180
how do you choose
the bin size?
704
00:39:19,180 --> 00:39:22,430
If you take your bins to be very
narrow you would get lots
705
00:39:22,430 --> 00:39:25,680
of bins with 0's, and a few
bins that only have one
706
00:39:25,680 --> 00:39:26,840
outcome in them.
707
00:39:26,840 --> 00:39:29,120
It probably wouldn't
feel right.
708
00:39:29,120 --> 00:39:32,110
If you choose your bins to be
very wide then you're losing a
709
00:39:32,110 --> 00:39:33,680
lot of information.
710
00:39:33,680 --> 00:39:39,240
Is there some way of making a
test without creating bins?
711
00:39:39,240 --> 00:39:43,330
This is just to illustrate
the clever ideas of what
712
00:39:43,330 --> 00:39:45,640
statisticians have
thought about.
713
00:39:45,640 --> 00:39:51,960
And here's a really cute way of
going about a test, whether
714
00:39:51,960 --> 00:39:53,750
my distribution is
correct or not.
715
00:39:53,750 --> 00:39:56,980
716
00:39:56,980 --> 00:40:00,790
Here we're essentially
plotting a PMF, or an
717
00:40:00,790 --> 00:40:02,630
approximation of a PDF.
718
00:40:02,630 --> 00:40:06,040
And we ask does it look like
the PDF we assumed?
719
00:40:06,040 --> 00:40:09,930
Instead of working with PDFs
let's work with cumulative
720
00:40:09,930 --> 00:40:11,800
distribution functions.
721
00:40:11,800 --> 00:40:13,840
So how does this go?
722
00:40:13,840 --> 00:40:20,160
The true normal distribution
that I have hypothesized, the
723
00:40:20,160 --> 00:40:22,310
density that I'm hypothesizing--
my null
724
00:40:22,310 --> 00:40:23,350
hypothesis--
725
00:40:23,350 --> 00:40:26,950
has a certain CDF
that I can plot.
726
00:40:26,950 --> 00:40:36,820
So supposed that my hypothesis
H0 is that the X's are normal
727
00:40:36,820 --> 00:40:42,630
with our standard normals, and I
plot the CDF of the standard
728
00:40:42,630 --> 00:40:46,360
normal, which is the sort of
continuous looking curve here.
729
00:40:46,360 --> 00:40:53,310
Now I get my data, and I
plot the empirical CDF.
730
00:40:53,310 --> 00:40:54,930
What's the empirical CDF?
731
00:40:54,930 --> 00:40:59,830
In the empirical CDF you ask the
question what fraction of
732
00:40:59,830 --> 00:41:02,940
the data fell below 0?
733
00:41:02,940 --> 00:41:04,450
You get a number.
734
00:41:04,450 --> 00:41:07,920
What fraction of my
data fell below 1?
735
00:41:07,920 --> 00:41:08,730
I get a number.
736
00:41:08,730 --> 00:41:12,590
What fraction of my data fell
below 2, and so on.
737
00:41:12,590 --> 00:41:15,780
So you're talking about
fractions of the data that
738
00:41:15,780 --> 00:41:18,760
fell below each particular
number.
739
00:41:18,760 --> 00:41:21,640
And by plotting those fractions
as a function of
740
00:41:21,640 --> 00:41:26,740
this number you get something
that looks like a CDF.
741
00:41:26,740 --> 00:41:31,670
And it's the CDF suggested
by the data.
742
00:41:31,670 --> 00:41:35,800
Now the fraction of the data
that fall below 0 in my
743
00:41:35,800 --> 00:41:38,530
experiment is--
744
00:41:38,530 --> 00:41:43,280
if my hypothesis were true--
745
00:41:43,280 --> 00:41:46,470
expected to be 1/2.
746
00:41:46,470 --> 00:41:49,280
1/2 is the value of
the true CDF.
747
00:41:49,280 --> 00:41:51,730
I look at the fraction
that I got, it's
748
00:41:51,730 --> 00:41:54,470
expected to be that number.
749
00:41:54,470 --> 00:41:56,800
But there's randomness, so
it's might be a little
750
00:41:56,800 --> 00:41:58,300
different than that.
751
00:41:58,300 --> 00:42:03,490
For any particular value, the
fraction that I got below a
752
00:42:03,490 --> 00:42:04,350
certain number--
753
00:42:04,350 --> 00:42:09,970
the fraction of data that
we're below, 2, its
754
00:42:09,970 --> 00:42:15,310
expectation is the probability
of falling below 2, which is
755
00:42:15,310 --> 00:42:16,740
the correct CDF.
756
00:42:16,740 --> 00:42:21,060
So if my hypothesis is true the
empirical CDF that I get
757
00:42:21,060 --> 00:42:24,900
based on data should, when
n is large, be very
758
00:42:24,900 --> 00:42:27,100
close to the true CDF.
759
00:42:27,100 --> 00:42:31,350
So a way of judging whether my
model is correct or not is to
760
00:42:31,350 --> 00:42:38,300
look at the assumed CDF, the
CDF under hypothesis H0.
761
00:42:38,300 --> 00:42:41,880
Look at the CDF that I
constructed based on the data,
762
00:42:41,880 --> 00:42:45,440
and see whether they're
close enough or not.
763
00:42:45,440 --> 00:42:48,150
And by close enough, I mean I'm
going to look at all the
764
00:42:48,150 --> 00:42:52,000
possible X's, and look at the
maximum distance between those
765
00:42:52,000 --> 00:42:53,300
two curves.
766
00:42:53,300 --> 00:42:59,140
And I'm going to have a test
that decides in favor of H0 if
767
00:42:59,140 --> 00:43:03,550
this distance is small,
and in favor of H1 if
768
00:43:03,550 --> 00:43:06,110
this distance is large.
769
00:43:06,110 --> 00:43:07,790
That still leaves me
the problem of
770
00:43:07,790 --> 00:43:09,570
coming up with a threshold.
771
00:43:09,570 --> 00:43:13,180
Where exactly do I
put my threshold?
772
00:43:13,180 --> 00:43:17,230
Because this test is important
enough, and is used frequently
773
00:43:17,230 --> 00:43:20,990
people have made the effort
to try to understand the
774
00:43:20,990 --> 00:43:23,240
probability distribution
of this quite
775
00:43:23,240 --> 00:43:25,280
difficult random variable.
776
00:43:25,280 --> 00:43:28,220
One needs to do lots of
approximations and clever
777
00:43:28,220 --> 00:43:32,550
calculations, but these have
led to values and tabulated
778
00:43:32,550 --> 00:43:34,570
values for the probability
distribution
779
00:43:34,570 --> 00:43:36,210
of this random variable.
780
00:43:36,210 --> 00:43:39,340
And, for example, those
tabulated values tell us that
781
00:43:39,340 --> 00:43:45,030
if we want 5% false rejection
probability, then our
782
00:43:45,030 --> 00:43:48,860
threshold should be 1.36
divided by the
783
00:43:48,860 --> 00:43:50,570
square root of n.
784
00:43:50,570 --> 00:43:53,870
So we know where to put
our threshold for
785
00:43:53,870 --> 00:43:55,280
this particular value.
786
00:43:55,280 --> 00:43:59,680
If we want this particular
error or error
787
00:43:59,680 --> 00:44:02,380
probability to occur.
788
00:44:02,380 --> 00:44:06,320
So that's about as hard and
sophisticated classical
789
00:44:06,320 --> 00:44:08,070
statistics get.
790
00:44:08,070 --> 00:44:12,920
You want to have tests for
hypotheses that are not so
791
00:44:12,920 --> 00:44:15,910
easy to handle.
792
00:44:15,910 --> 00:44:21,260
People somehow think of
clever ways of doing
793
00:44:21,260 --> 00:44:22,500
tests of this kind.
794
00:44:22,500 --> 00:44:26,970
How to compare the theoretical
predictions with the observed
795
00:44:26,970 --> 00:44:29,650
predictions with the
observed data.
796
00:44:29,650 --> 00:44:34,430
Come up with some measure of the
difference between theory
797
00:44:34,430 --> 00:44:38,270
and data, and if that difference
is big, than you
798
00:44:38,270 --> 00:44:39,520
reject your hypothesis.
799
00:44:39,520 --> 00:44:42,340
800
00:44:42,340 --> 00:44:45,640
OK, of course that's not
the end of the field of
801
00:44:45,640 --> 00:44:49,000
statistics, there's
a lot more.
802
00:44:49,000 --> 00:44:52,000
In some ways, as we kept
moving through today's
803
00:44:52,000 --> 00:44:55,240
lecture, the way that we
constructed those rejection
804
00:44:55,240 --> 00:44:57,680
regions was more and
more ad hoc.
805
00:44:57,680 --> 00:45:02,220
I pulled out of a hat a
particular measure of fit
806
00:45:02,220 --> 00:45:04,980
between data and the model.
807
00:45:04,980 --> 00:45:09,470
And I said let's just use
a test based on this.
808
00:45:09,470 --> 00:45:13,890
There are attempts at more or
less systematic ways of coming
809
00:45:13,890 --> 00:45:17,350
up with the general shape of
rejection regions that have at
810
00:45:17,350 --> 00:45:20,540
least some desirable or
favorable theoretical
811
00:45:20,540 --> 00:45:21,790
properties.
812
00:45:21,790 --> 00:45:24,620
813
00:45:24,620 --> 00:45:28,300
Some more specific problems
that people study--
814
00:45:28,300 --> 00:45:31,690
instead of having a test,
is this the correct PDF?
815
00:45:31,690 --> 00:45:33,140
Yes or no.
816
00:45:33,140 --> 00:45:37,670
I just give you data, and I
ask you tell me, give me a
817
00:45:37,670 --> 00:45:41,270
model or a PDF for those data.
818
00:45:41,270 --> 00:45:45,000
OK, my thoughts of this kind
are of many types.
819
00:45:45,000 --> 00:45:50,640
One general method is you form a
histogram, and then you take
820
00:45:50,640 --> 00:45:54,570
your histogram and plot a smooth
line, that kind of fits
821
00:45:54,570 --> 00:45:55,680
the histogram.
822
00:45:55,680 --> 00:45:59,140
This still leaves the question
of how do you choose the bins?
823
00:45:59,140 --> 00:46:00,780
The bin size in your
histograms.
824
00:46:00,780 --> 00:46:02,620
How narrow do you take them?
825
00:46:02,620 --> 00:46:05,920
And that depends on how many
data you have, and there's a
826
00:46:05,920 --> 00:46:09,190
lot of theory that tells you
about the best way of choosing
827
00:46:09,190 --> 00:46:12,890
the bin sizes, and the best
ways of smoothing the data
828
00:46:12,890 --> 00:46:14,640
that you have.
829
00:46:14,640 --> 00:46:18,090
A completely different topic
is in signal processing --
830
00:46:18,090 --> 00:46:20,200
you want to do your inference.
831
00:46:20,200 --> 00:46:22,810
Not only you want it to be good,
but you also want it to
832
00:46:22,810 --> 00:46:25,520
be fast in a computational
way.
833
00:46:25,520 --> 00:46:28,010
You get data in real
time, lots of data.
834
00:46:28,010 --> 00:46:31,330
You want to keep processing and
revising your estimates
835
00:46:31,330 --> 00:46:35,220
and your decisions as
they come and go.
836
00:46:35,220 --> 00:46:38,950
Another topic that was briefly
touched upon the last couple
837
00:46:38,950 --> 00:46:43,010
of lectures is that when you set
up a model, like a linear
838
00:46:43,010 --> 00:46:46,540
regression model, you choose
some explanatory variables,
839
00:46:46,540 --> 00:46:50,230
and you try to predict y from
your X, these variables.
840
00:46:50,230 --> 00:46:52,720
You have a choice of
what to take as
841
00:46:52,720 --> 00:46:55,440
your explanatory variables.
842
00:46:55,440 --> 00:47:02,560
Are there systematic ways of
picking the right X variables
843
00:47:02,560 --> 00:47:04,520
to try to estimate a Y.
844
00:47:04,520 --> 00:47:08,360
For example should I try to
estimate Y on the basis of X?
845
00:47:08,360 --> 00:47:10,320
Or on the basis of X-squared?
846
00:47:10,320 --> 00:47:12,960
How do I decide between
the two?
847
00:47:12,960 --> 00:47:17,000
Finally, the rage these days has
to do with anything big,
848
00:47:17,000 --> 00:47:18,490
high-demensional.
849
00:47:18,490 --> 00:47:23,410
Complicated models of
complicated things, and tons
850
00:47:23,410 --> 00:47:24,650
and tons of data.
851
00:47:24,650 --> 00:47:27,430
So these days data are
generated everywhere.
852
00:47:27,430 --> 00:47:30,230
The amounts of data
are humongous.
853
00:47:30,230 --> 00:47:33,120
Also, the problems that people
are interested in tend to be
854
00:47:33,120 --> 00:47:35,500
very complicated with
lots of parameters.
855
00:47:35,500 --> 00:47:39,800
So I need specially tailored
methods that can give you good
856
00:47:39,800 --> 00:47:44,220
results, or decent results even
in the face of these huge
857
00:47:44,220 --> 00:47:47,290
amounts of data, and possibly
with computational
858
00:47:47,290 --> 00:47:48,310
constraints.
859
00:47:48,310 --> 00:47:50,720
So with huge amounts of data
you want methods that are
860
00:47:50,720 --> 00:47:56,460
simple, but still can deliver
for you meaningful answers.
861
00:47:56,460 --> 00:48:00,170
Now as I mentioned some time
ago, this whole field of
862
00:48:00,170 --> 00:48:03,960
statistics is very different
from the field of probability.
863
00:48:03,960 --> 00:48:06,530
In some sense all that we're
doing in statistics is
864
00:48:06,530 --> 00:48:08,100
probabilistic calculations.
865
00:48:08,100 --> 00:48:10,360
That's what the theory
kind of does.
866
00:48:10,360 --> 00:48:12,870
But there's a big
element of art.
867
00:48:12,870 --> 00:48:16,550
You saw that we chose the shape
of some decision regions
868
00:48:16,550 --> 00:48:19,840
or rejection regions in
a somewhat ad hoc way.
869
00:48:19,840 --> 00:48:21,660
There's even more
basic things.
870
00:48:21,660 --> 00:48:23,260
How do you organize your data?
871
00:48:23,260 --> 00:48:26,690
How do you think about which
hypotheses you would like to
872
00:48:26,690 --> 00:48:28,300
test, and so on.
873
00:48:28,300 --> 00:48:31,710
There's a lot of art that's
involved here, and there's a
874
00:48:31,710 --> 00:48:33,510
lot that can go wrong.
875
00:48:33,510 --> 00:48:36,630
So I'm going to close with a
note that you can take either
876
00:48:36,630 --> 00:48:39,050
as pessimistic or optimistic.
877
00:48:39,050 --> 00:48:42,880
There is a famous paper that
came out a few years ago and
878
00:48:42,880 --> 00:48:46,440
has been cited about a
1,000 times or so.
879
00:48:46,440 --> 00:48:50,110
And the title of the paper is
Why Most Published Research
880
00:48:50,110 --> 00:48:51,850
Findings Are False.
881
00:48:51,850 --> 00:48:56,080
And it's actually a very good
argument why, in fields like
882
00:48:56,080 --> 00:48:59,900
psychology or the medical
science and all that a lot of
883
00:48:59,900 --> 00:49:01,160
what you see published--
884
00:49:01,160 --> 00:49:03,410
that yes, this drug
has an effect on
885
00:49:03,410 --> 00:49:05,000
that particular disease--
886
00:49:05,000 --> 00:49:08,030
is actually false, because
people do not do their
887
00:49:08,030 --> 00:49:09,780
statistics correctly.
888
00:49:09,780 --> 00:49:12,130
There's lots of biases
in what people do.
889
00:49:12,130 --> 00:49:16,300
I mean an obvious bias is that
you only published a result
890
00:49:16,300 --> 00:49:19,190
when you see something.
891
00:49:19,190 --> 00:49:22,770
So the null hypothesis is that
the drug doesn't work.
892
00:49:22,770 --> 00:49:26,820
You do your tests, the drug
didn't work, OK, you just go
893
00:49:26,820 --> 00:49:27,960
home and cry.
894
00:49:27,960 --> 00:49:33,380
But if by accident that 5%
happens, and even though the
895
00:49:33,380 --> 00:49:37,320
drug doesn't work, you got
some outlier data, and it
896
00:49:37,320 --> 00:49:38,760
seemed to be working.
897
00:49:38,760 --> 00:49:40,990
Then you're excited,
you publish it.
898
00:49:40,990 --> 00:49:42,760
So that's clearly a bias.
899
00:49:42,760 --> 00:49:46,980
That gets results to be
published, even though they do
900
00:49:46,980 --> 00:49:50,330
not have a solid foundation
behind them.
901
00:49:50,330 --> 00:49:53,050
Then there's another
thing, OK?
902
00:49:53,050 --> 00:49:55,440
I'm picking my 5%.
903
00:49:55,440 --> 00:49:59,940
So H0 is true there's a small
probability that the data will
904
00:49:59,940 --> 00:50:04,160
look like an outlier,
and in that case I
905
00:50:04,160 --> 00:50:06,270
published my result.
906
00:50:06,270 --> 00:50:08,160
OK it's only 5% --
907
00:50:08,160 --> 00:50:10,300
it's not going to happen
too often.
908
00:50:10,300 --> 00:50:15,200
But suppose that I go and do
a 1,000 different tests?
909
00:50:15,200 --> 00:50:18,540
Test H0 against this hypothesis,
test H0 against
910
00:50:18,540 --> 00:50:22,000
that hypothesis , test H0
against that hypothesis.
911
00:50:22,000 --> 00:50:26,230
Some of these tests, just by
accident might turn out to be
912
00:50:26,230 --> 00:50:29,350
in favor of H1, and
again these are
913
00:50:29,350 --> 00:50:31,170
selected to be published.
914
00:50:31,170 --> 00:50:35,720
So if you do lots and lots of
tests and in each one you have
915
00:50:35,720 --> 00:50:38,980
a 5% probability of error,
when you consider the
916
00:50:38,980 --> 00:50:41,980
collection of all those tests,
actually the probability of
917
00:50:41,980 --> 00:50:46,940
making incorrect inferences
is a lot more than 5%.
918
00:50:46,940 --> 00:50:51,400
One basic principle in being
systematic about such studies
919
00:50:51,400 --> 00:50:55,950
is that you should first pick
your hypothesis that you're
920
00:50:55,950 --> 00:50:59,230
going to test, then get
your data, and do
921
00:50:59,230 --> 00:51:00,880
your hypothesis testing.
922
00:51:00,880 --> 00:51:05,640
What would be wrong is to get
your data, look at them, and
923
00:51:05,640 --> 00:51:08,890
say OK I'm going now to test
for these 100 different
924
00:51:08,890 --> 00:51:13,060
hypotheses, and I'm going to
choose my hypothesis to be for
925
00:51:13,060 --> 00:51:16,580
features that look abnormal
in my data.
926
00:51:16,580 --> 00:51:19,520
Well, given enough data, you
can always find some
927
00:51:19,520 --> 00:51:21,650
abnormalities just by chance.
928
00:51:21,650 --> 00:51:24,380
And if you choose to make
a statistical test--
929
00:51:24,380 --> 00:51:26,710
is this abnormality present?
930
00:51:26,710 --> 00:51:28,090
Yes, it will be present.
931
00:51:28,090 --> 00:51:31,020
Because you first found the
abnormality, and then you
932
00:51:31,020 --> 00:51:32,130
tested for it.
933
00:51:32,130 --> 00:51:35,210
So that's another way that
things can go wrong.
934
00:51:35,210 --> 00:51:37,520
So the moral of this story is
that while the world of
935
00:51:37,520 --> 00:51:40,200
probability is really beautiful
and solid, you have
936
00:51:40,200 --> 00:51:40,960
your axioms.
937
00:51:40,960 --> 00:51:44,630
Every question has a unique
answer that by now you can,
938
00:51:44,630 --> 00:51:48,250
all of you, find in a
very reliable way.
939
00:51:48,250 --> 00:51:50,740
Statistics is a dirty and
difficult business.
940
00:51:50,740 --> 00:51:53,010
And that's why the subject
is not over.
941
00:51:53,010 --> 00:51:55,430
And if you're interested in
it, it's worth taking
942
00:51:55,430 --> 00:51:58,920
follow-on courses in
that direction.
943
00:51:58,920 --> 00:52:03,950
OK so have good luck in the
final, do well, and have a
944
00:52:03,950 --> 00:52:05,200
nice vacation afterwards.
945
00:52:05,200 --> 00:52:06,260