WEBVTT
00:00:00.530 --> 00:00:02.960
The following content is
provided under a Creative
00:00:02.960 --> 00:00:04.370
Commons license.
00:00:04.370 --> 00:00:07.410
Your support will help MIT
OpenCourseWare continue to
00:00:07.410 --> 00:00:11.060
offer high-quality educational
resources for free.
00:00:11.060 --> 00:00:13.960
To make a donation or view
additional materials from
00:00:13.960 --> 00:00:19.790
hundreds of MIT courses, visit
MIT OpenCourseWare at
00:00:19.790 --> 00:00:21.040
ocw.mit.edu.
00:00:24.000 --> 00:00:26.220
PROFESSOR: OK, I guess we might
as well start a minute
00:00:26.220 --> 00:00:29.830
early since those of you
who are here are here.
00:00:32.580 --> 00:00:36.090
We're coming to the
end of course.
00:00:36.090 --> 00:00:41.630
We're deep in chapter 7 now
talking about random walks and
00:00:41.630 --> 00:00:44.580
detection theory.
00:00:44.580 --> 00:00:48.450
We'll get into martingales
sometime next week.
00:00:48.450 --> 00:00:52.090
There are four more lectures
after this one.
00:00:52.090 --> 00:00:55.290
The schedule was passed out at
the beginning of the term.
00:00:55.290 --> 00:01:00.050
I don't know how I did it, but
I somehow left off the last
00:01:00.050 --> 00:01:02.860
Wednesday of class.
00:01:02.860 --> 00:01:04.519
The final is going to
be on Wednesday
00:01:04.519 --> 00:01:06.630
morning at the ice rink.
00:01:06.630 --> 00:01:08.260
I don't know what the
ice rink is like.
00:01:08.260 --> 00:01:14.140
It doesn't sound like an ideal
place to take a final, but I
00:01:14.140 --> 00:01:17.340
assume they must have desks
there and all that stuff.
00:01:19.860 --> 00:01:22.700
We will send out a notice
about that.
00:01:22.700 --> 00:01:27.710
This is the last homework set
that you will have to turn in.
00:01:27.710 --> 00:01:34.550
We will probably have another
set of practice problems and
00:01:34.550 --> 00:01:36.900
problems on--
00:01:36.900 --> 00:01:40.870
but not things you
should turn in.
00:01:40.870 --> 00:01:43.450
We will try to get solutions
out on them
00:01:43.450 --> 00:01:44.920
fairly quickly, also.
00:01:44.920 --> 00:01:49.320
So you can do them, but also
look at the answers right
00:01:49.320 --> 00:01:50.710
after you do them.
00:01:50.710 --> 00:01:55.550
OK, so let's get back
to random walks.
00:01:55.550 --> 00:02:00.550
And remember what we were
doing last time.
00:02:00.550 --> 00:02:04.720
A random walk, by definition,
you have a sequence of IID
00:02:04.720 --> 00:02:06.860
random variables.
00:02:06.860 --> 00:02:10.360
You have partial sums of
those random variables.
00:02:10.360 --> 00:02:13.090
S sub n is a sum of
the first n of
00:02:13.090 --> 00:02:15.690
those IID random variables.
00:02:15.690 --> 00:02:21.790
And the sequence of partial sums
S1, S2, S3, and so forth,
00:02:21.790 --> 00:02:25.345
that sequence is called
a random walk.
00:02:25.345 --> 00:02:27.860
And if you graph the random
walk, it's something which
00:02:27.860 --> 00:02:30.670
wanders up and down usually.
00:02:30.670 --> 00:02:34.840
And sometimes, if the mean of X
is positive, it wanders off
00:02:34.840 --> 00:02:36.080
to infinity.
00:02:36.080 --> 00:02:38.520
If the mean of X is negative,
it wanders
00:02:38.520 --> 00:02:40.225
off to minus infinity.
00:02:40.225 --> 00:02:44.960
If the mean of X is 0, it simply
diffuses somewhat as
00:02:44.960 --> 00:02:46.120
time goes on.
00:02:46.120 --> 00:02:49.320
And what we're trying to find
that is exactly how do these
00:02:49.320 --> 00:02:51.820
things work.
00:02:51.820 --> 00:02:53.670
So our focus here is
going to be on
00:02:53.670 --> 00:02:56.640
threshold-crossing problems.
00:02:56.640 --> 00:03:01.340
Namely, what's the probability
that this random walk is going
00:03:01.340 --> 00:03:08.110
to cross some threshold by or at
some particular value of n?
00:03:08.110 --> 00:03:11.450
If you have two thresholds,
one above and one below,
00:03:11.450 --> 00:03:13.960
what's the probability it's
going to cross the one above?
00:03:13.960 --> 00:03:17.750
What's the probability it's
going to cross the one below?
00:03:17.750 --> 00:03:21.200
And if it crosses one of these,
when does it cross it?
00:03:21.200 --> 00:03:26.740
If it crosses it, how much
of an overshoot is there?
00:03:26.740 --> 00:03:29.380
All of those problems just come
in naturally by looking
00:03:29.380 --> 00:03:32.160
at a sum of IID random
variables.
00:03:32.160 --> 00:03:35.710
But here we're going to be
trying to study them in some
00:03:35.710 --> 00:03:40.500
consistent manner looking at the
thresholds particularly.
00:03:40.500 --> 00:03:45.650
We've talked a little bit
about two particularly
00:03:45.650 --> 00:03:47.030
important applications.
00:03:47.030 --> 00:03:49.960
One is [? GG1Qs ?].
00:03:49.960 --> 00:03:54.760
And even far more important than
that is this question of
00:03:54.760 --> 00:03:58.920
detection, or making decisions,
or hypothesis
00:03:58.920 --> 00:04:01.880
testing, all of which
are the same thing.
00:04:01.880 --> 00:04:06.340
You remember we did show that
there was at least one
00:04:06.340 --> 00:04:09.980
threshold-crossing problem
that was very, very easy.
00:04:09.980 --> 00:04:14.600
It's the threshold problem where
the underlying random
00:04:14.600 --> 00:04:16.640
variable is binary.
00:04:16.640 --> 00:04:21.010
You either go up by 1 or you
go down by 1 on each step.
00:04:21.010 --> 00:04:24.040
And the question is, what's the
probability that you will
00:04:24.040 --> 00:04:29.570
cross some threshold at
some k greater than 0?
00:04:29.570 --> 00:04:32.590
And it turns out that since
you can only go up 1 each
00:04:32.590 --> 00:04:36.620
time, the probability of getting
up to some point k is
00:04:36.620 --> 00:04:39.390
the probability you
ever got up to 1.
00:04:39.390 --> 00:04:41.910
Given that you got up to 1, it's
the probability that you
00:04:41.910 --> 00:04:43.250
ever got up to 2.
00:04:43.250 --> 00:04:45.260
Given you got up to 2, it's
the probability you
00:04:45.260 --> 00:04:46.480
ever got up to 3.
00:04:46.480 --> 00:04:49.400
That doesn't mean that you
go directly from 2 to 3.
00:04:49.400 --> 00:04:53.160
After you go to 2, you wander
all around, and eventually you
00:04:53.160 --> 00:04:54.510
make it up to 3.
00:04:54.510 --> 00:04:57.920
If you do, then the question is,
do you ever get from 3 to
00:04:57.920 --> 00:04:59.580
4, and so forth.
00:04:59.580 --> 00:05:03.650
And we found that the solution
to that problem was p over 1
00:05:03.650 --> 00:05:08.110
minus p to the k-th power of p
is less than or equal to 1/2.
00:05:08.110 --> 00:05:11.990
And we solved that problem, if
you remember, back when we
00:05:11.990 --> 00:05:15.420
were talking about stop when
you're ahead if you're playing
00:05:15.420 --> 00:05:17.960
coin tossing with somebody.
00:05:17.960 --> 00:05:26.180
And so let's go further and
look particularly at this
00:05:26.180 --> 00:05:29.200
problem of detection, and
decisions, and hypothesis
00:05:29.200 --> 00:05:33.720
testing, which is really not a
particularly hard problem.
00:05:33.720 --> 00:05:38.250
But it's made particularly hard
by statisticians who have
00:05:38.250 --> 00:05:44.870
so many special rules, peculiar
cases, and almost
00:05:44.870 --> 00:05:48.500
mythology about making
decisions.
00:05:48.500 --> 00:05:53.200
And you can imagine why because
as long as you talk
00:05:53.200 --> 00:05:56.810
about probability, everybody
knows you're talking about an
00:05:56.810 --> 00:05:58.440
abstraction.
00:05:58.440 --> 00:06:02.010
As soon as you start talking
about making a decision, it
00:06:02.010 --> 00:06:04.910
suddenly becomes real.
00:06:04.910 --> 00:06:07.960
I mean, you look at a
bunch of data and
00:06:07.960 --> 00:06:09.850
you have to do something.
00:06:09.850 --> 00:06:12.210
You look at a bunch of
candidates for a job, you have
00:06:12.210 --> 00:06:13.760
to choose one.
00:06:13.760 --> 00:06:16.650
That's always very difficult
because you might not choose
00:06:16.650 --> 00:06:17.270
the right one.
00:06:17.270 --> 00:06:19.170
You might choose a
very poor one.
00:06:19.170 --> 00:06:21.830
But you have to do your best.
00:06:21.830 --> 00:06:25.260
If you're investing in stocks,
you look at all the statistics
00:06:25.260 --> 00:06:26.500
of everything.
00:06:26.500 --> 00:06:28.060
And finally you say,
that's where I'm
00:06:28.060 --> 00:06:30.210
going to put my money.
00:06:30.210 --> 00:06:32.340
Or if you're looking for a job
you say, that's where I'm
00:06:32.340 --> 00:06:33.960
going to work, and you
hope that that's
00:06:33.960 --> 00:06:35.700
going to work out well.
00:06:35.700 --> 00:06:38.790
There are all these situations
where you can evaluate
00:06:38.790 --> 00:06:42.720
probabilities until you're
sick in the head.
00:06:42.720 --> 00:06:44.810
They don't mean anything.
00:06:44.810 --> 00:06:47.170
It's only when you make a
decision and actually do
00:06:47.170 --> 00:06:50.670
something with it that it
really means something.
00:06:50.670 --> 00:06:53.850
So it becomes important
at this point.
00:06:53.850 --> 00:06:58.240
The model we use for this,
since we're studying
00:06:58.240 --> 00:06:59.880
probability theory--
00:06:59.880 --> 00:07:03.360
well, actually, we're studying
random processes.
00:07:03.360 --> 00:07:06.190
But we're really studying
probability theory.
00:07:06.190 --> 00:07:09.360
You probably noticed
that by now.
00:07:09.360 --> 00:07:13.040
Since we're studying
probability, we study all
00:07:13.040 --> 00:07:16.530
these problems in terms of
a probabilistic model.
00:07:16.530 --> 00:07:20.630
And in the probabilistic model,
there's a discrete and,
00:07:20.630 --> 00:07:25.880
in most cases, binary random
variable, H, which is called
00:07:25.880 --> 00:07:28.340
the hypothesis random
variable.
00:07:28.340 --> 00:07:31.060
The sample values of H,
you might as well
00:07:31.060 --> 00:07:32.900
call them 0 and 1.
00:07:32.900 --> 00:07:36.460
That's the easiest things
to call binary things.
00:07:36.460 --> 00:07:39.590
They're called the alternative
hypotheses.
00:07:39.590 --> 00:07:42.350
They have marginal probabilities
because it's a
00:07:42.350 --> 00:07:44.100
probability model.
00:07:44.100 --> 00:07:45.440
You have a random variable.
00:07:45.440 --> 00:07:49.020
It can only take on the value
0 and 1, so it has to have
00:07:49.020 --> 00:07:51.740
probabilities of
being 0 and 1.
00:07:51.740 --> 00:07:53.690
Along with that, there
are all sorts of
00:07:53.690 --> 00:07:54.880
other random variables.
00:07:54.880 --> 00:07:58.360
The situation might be as
complicated as you want.
00:07:58.360 --> 00:08:00.770
But since we're making
decisions, we're making
00:08:00.770 --> 00:08:04.560
decisions on the basis of some
set of alternatives.
00:08:04.560 --> 00:08:07.700
And here, since we're trying
to talk about random walks,
00:08:07.700 --> 00:08:11.260
and martingales, and things like
that, also we restrict
00:08:11.260 --> 00:08:14.960
our attention to particular
kinds of observations.
00:08:14.960 --> 00:08:17.680
And the particular kind of
observation that we restrict
00:08:17.680 --> 00:08:23.160
attention to here is a sequence
of random variables,
00:08:23.160 --> 00:08:24.900
which we call the observation.
00:08:24.900 --> 00:08:26.060
You observe Y1.
00:08:26.060 --> 00:08:27.000
You observe Y2.
00:08:27.000 --> 00:08:29.460
You observe Y3, and so forth.
00:08:29.460 --> 00:08:33.539
In other words, you observe a
sample value of each of those
00:08:33.539 --> 00:08:34.770
random variables.
00:08:34.770 --> 00:08:36.669
There are a whole sequence
of them.
00:08:36.669 --> 00:08:40.409
And we assume, to make life
simple for ourselves, that
00:08:40.409 --> 00:08:44.920
each of these are independent,
conditional on the hypothesis.
00:08:44.920 --> 00:08:47.250
And they're identically
distributed conditional on the
00:08:47.250 --> 00:08:48.080
hypothesis.
00:08:48.080 --> 00:08:54.150
That's what this says
right here.
00:08:54.150 --> 00:08:57.030
This makes one more assumption
that assumes that these
00:08:57.030 --> 00:09:00.040
observations are continuous
random variables.
00:09:00.040 --> 00:09:02.600
That doesn't make much
difference, there are just a
00:09:02.600 --> 00:09:05.790
few peculiarities that
come in if these are
00:09:05.790 --> 00:09:07.780
discrete random variables.
00:09:07.780 --> 00:09:10.410
There also a few peculiarities
that come in when they're
00:09:10.410 --> 00:09:11.530
continuous.
00:09:11.530 --> 00:09:13.830
And there are a lot of
peculiarities that come in
00:09:13.830 --> 00:09:15.680
when they're absolutely
arbitrary.
00:09:15.680 --> 00:09:18.780
But for the time being, just
imagine each of these are
00:09:18.780 --> 00:09:20.800
continuous random variables.
00:09:20.800 --> 00:09:26.060
So for each value of n, we
look at n observations.
00:09:26.060 --> 00:09:30.010
We can calculate the probability
density that those
00:09:30.010 --> 00:09:35.730
observations would occur
conditional on hypothesis 0.
00:09:35.730 --> 00:09:39.270
We can find the conditional
probability they could occur
00:09:39.270 --> 00:09:41.700
conditional on hypothesis 1.
00:09:41.700 --> 00:09:47.460
And since they're IID, that's
equal to this product here.
00:09:47.460 --> 00:09:52.210
Excuse me, they are not IID,
they are conditionally ID.
00:09:52.210 --> 00:09:54.500
Conditional on the hypothesis.
00:09:54.500 --> 00:09:58.340
Namely, the idea is the world
is one way or the world is
00:09:58.340 --> 00:09:59.270
another way.
00:09:59.270 --> 00:10:01.880
If the world is this way,
then all of these
00:10:01.880 --> 00:10:03.860
hypotheses are IID.
00:10:03.860 --> 00:10:06.340
You're doing the same experiment
again and again and
00:10:06.340 --> 00:10:10.590
again, but it's based on the
same underlying hypothesis.
00:10:10.590 --> 00:10:15.460
Or, the underlying hypothesis
is this over here.
00:10:15.460 --> 00:10:18.470
You make the number of
observations all based on this
00:10:18.470 --> 00:10:22.820
same hypothesis, and you make
as many of these IID
00:10:22.820 --> 00:10:25.330
observations conditional
on that
00:10:25.330 --> 00:10:27.270
observation as you choose.
00:10:27.270 --> 00:10:29.440
And when you're all done,
what do you do?
00:10:29.440 --> 00:10:31.140
You have to make
your decision.
00:10:31.140 --> 00:10:35.050
OK, so this is a very
simple-minded model of this
00:10:35.050 --> 00:10:38.240
very complicated and very
important problem.
00:10:38.240 --> 00:10:41.830
But it's close enough to the
truth that we can get a lot of
00:10:41.830 --> 00:10:44.190
observations from it.
00:10:44.190 --> 00:10:47.750
Now, I spent a lot last time
talking about this.
00:10:47.750 --> 00:10:51.810
Spend a lot of time this time
talking about it because when
00:10:51.810 --> 00:10:56.130
we use a probability model for
this, when we say that we're
00:10:56.130 --> 00:10:57.740
studying probability theory.
00:10:57.740 --> 00:11:00.610
And therefore, we're going to
use probability, we have
00:11:00.610 --> 00:11:04.900
suddenly allied ourselves
completely with people called
00:11:04.900 --> 00:11:09.980
Bayesian statisticians or
Bayesian probabilists.
00:11:09.980 --> 00:11:13.440
And we have gone against, turned
our back on people
00:11:13.440 --> 00:11:17.150
called Non-Bayesians, or
sometimes classical.
00:11:17.150 --> 00:11:19.280
I hate using the word
"classical" because I like the
00:11:19.280 --> 00:11:25.120
word "classics." I like the
classics for such an unusual
00:11:25.120 --> 00:11:26.630
point of view.
00:11:26.630 --> 00:11:30.090
And the unusual point of view
is that we refuse to take a
00:11:30.090 --> 00:11:31.840
probability model.
00:11:31.840 --> 00:11:35.320
We accept the fact that on all
the observations, all the
00:11:35.320 --> 00:11:37.370
observations are
probabilistic.
00:11:37.370 --> 00:11:40.710
We assume we have a nice model
for them, which makes sense.
00:11:40.710 --> 00:11:42.820
We can do whatever we want
with that model.
00:11:42.820 --> 00:11:44.220
We can change the model.
00:11:44.220 --> 00:11:46.830
We can do whatever we
want with a model.
00:11:46.830 --> 00:11:49.870
But if you once assume that
these two hypotheses that
00:11:49.870 --> 00:11:52.730
you're trying to choose between,
that they have a
00:11:52.730 --> 00:11:57.490
priori probabilities, then
people get very upset about it
00:11:57.490 --> 00:11:59.670
because they say, well,
if what the a priori
00:11:59.670 --> 00:12:03.640
probabilities are, why do you
have to do a hypothesis test?
00:12:03.640 --> 00:12:05.980
You already understand
everything there is to know
00:12:05.980 --> 00:12:07.510
about the problem.
00:12:07.510 --> 00:12:08.835
And they feel this
is very strange.
00:12:11.870 --> 00:12:15.820
It's not strange because you
use probability models.
00:12:15.820 --> 00:12:18.540
You use models to try to
understand certain things
00:12:18.540 --> 00:12:19.870
about reality.
00:12:19.870 --> 00:12:21.860
And you assume as many
things as you want to
00:12:21.860 --> 00:12:22.820
assume about it.
00:12:22.820 --> 00:12:26.110
And when you get all done, you
either use all the assumptions
00:12:26.110 --> 00:12:27.320
or you don't use them.
00:12:27.320 --> 00:12:32.230
What we're going to find today
is that when you use this
00:12:32.230 --> 00:12:36.380
assumption of a probability
model, you can answer the
00:12:36.380 --> 00:12:40.310
questions that these classical
statisticians go to great
00:12:40.310 --> 00:12:41.510
pains to answer.
00:12:41.510 --> 00:12:44.670
And you can ask them
very, very simply.
00:12:44.670 --> 00:12:48.160
So that after we assume the a
priori probabilities, we can
00:12:48.160 --> 00:12:52.390
calculate certain things which
don't depend on those a priori
00:12:52.390 --> 00:12:53.860
probabilities.
00:12:53.860 --> 00:12:55.520
And therefore, we
know two things.
00:12:55.520 --> 00:12:58.220
One, we know that if we
did know the a priori
00:12:58.220 --> 00:13:02.290
probabilities, it wouldn't
make any difference.
00:13:02.290 --> 00:13:05.670
And two, we know that if we
can estimate the a priori
00:13:05.670 --> 00:13:08.710
probabilities, it makes a great
deal of difference.
00:13:08.710 --> 00:13:10.380
And three--
00:13:10.380 --> 00:13:13.250
and this is the most
important point--
00:13:13.250 --> 00:13:17.110
you make 100 observations
of something.
00:13:17.110 --> 00:13:20.200
Somebody else says, I don't
believe you, and comes in and
00:13:20.200 --> 00:13:22.530
makes another 100
observations.
00:13:22.530 --> 00:13:25.780
Somebody else makes another
100 observations.
00:13:25.780 --> 00:13:29.580
Now, even if the second person
doesn't believe what the first
00:13:29.580 --> 00:13:34.400
person has done, it doesn't make
sense as a scientist to
00:13:34.400 --> 00:13:38.800
completely eliminate all of
that from consideration.
00:13:38.800 --> 00:13:41.940
Namely, what you would like to
do is say well, since this
00:13:41.940 --> 00:13:46.190
person has found such and
such, the a priori
00:13:46.190 --> 00:13:48.660
probabilities have changed.
00:13:48.660 --> 00:13:53.670
And then I can go on and make
my 100 observations.
00:13:53.670 --> 00:13:57.590
I can either make a hypothesis
test based on my 100
00:13:57.590 --> 00:14:02.090
observations or I can make a
hypothesis test assuming that
00:14:02.090 --> 00:14:04.340
the other person did
their work well.
00:14:04.340 --> 00:14:07.640
I can make it based on all
of these observations.
00:14:07.640 --> 00:14:11.000
If you try to do that those
two things in a classical
00:14:11.000 --> 00:14:13.800
formulation, you run into
a lot of trouble.
00:14:13.800 --> 00:14:17.110
If you try to do them in this
probabilistic formulation,
00:14:17.110 --> 00:14:18.810
it's all perfectly
straightforward.
00:14:18.810 --> 00:14:22.220
Because you can either start
out with a model in which
00:14:22.220 --> 00:14:25.490
you're taking 200 observations
or you can start out with a
00:14:25.490 --> 00:14:28.200
model in which you take
100 observations.
00:14:28.200 --> 00:14:30.440
And then suddenly, the
world changes.
00:14:30.440 --> 00:14:33.930
This hypothesis takes on,
perhaps a different value.
00:14:33.930 --> 00:14:35.970
You take another hundred
observations.
00:14:35.970 --> 00:14:39.150
So you do whatever you want
to within a probabilistic
00:14:39.150 --> 00:14:40.510
formulation.
00:14:40.510 --> 00:14:47.250
But the other thing is, all of
you that patiently have lived
00:14:47.250 --> 00:14:51.050
with this idea of studying
probabilistic
00:14:51.050 --> 00:14:53.160
models all term long.
00:14:53.160 --> 00:14:56.480
You might as well keep
on living with it.
00:14:56.480 --> 00:15:00.710
The fact that we're now
interested in making decisions
00:15:00.710 --> 00:15:03.030
should not make you think that
everything you've learned up
00:15:03.030 --> 00:15:06.580
until this point is baloney.
00:15:06.580 --> 00:15:10.050
And to move from here to
a classical statistical
00:15:10.050 --> 00:15:13.520
formulation of the world would
really be saying, I don't
00:15:13.520 --> 00:15:15.400
believe in probability theory.
00:15:15.400 --> 00:15:17.510
It's that bad.
00:15:17.510 --> 00:15:18.760
So here we go.
00:15:23.850 --> 00:15:26.490
I'm sorry, we did that.
00:15:26.490 --> 00:15:28.190
We were there.
00:15:28.190 --> 00:15:33.500
Assume that on the basis of
observing a sample value of
00:15:33.500 --> 00:15:37.490
this sequence of observations,
we have to make a decision
00:15:37.490 --> 00:15:42.060
about H. We have to choose
H equals 0 or H equals 1.
00:15:42.060 --> 00:15:44.990
We have to detect whether
or not H is 1.
00:15:44.990 --> 00:15:48.750
When you do this detection, you
would think in the real
00:15:48.750 --> 00:15:51.010
world that you've detected
something.
00:15:51.010 --> 00:15:54.330
If you've made a decision about
something, that you've
00:15:54.330 --> 00:15:58.770
tested a hypothesis and you
found that which is correct.
00:15:58.770 --> 00:16:00.460
Not at all.
00:16:00.460 --> 00:16:03.470
When you make decisions,
you can make errors.
00:16:03.470 --> 00:16:06.700
And the question of what kinds
of errors you're making is a
00:16:06.700 --> 00:16:11.170
major part of trying
to make decisions.
00:16:11.170 --> 00:16:14.620
I mean, those people who make
decisions and then can't
00:16:14.620 --> 00:16:17.780
believe that they might have
made the wrong decision are
00:16:17.780 --> 00:16:20.210
the worst kind of fools.
00:16:20.210 --> 00:16:21.710
And you see them in politics.
00:16:21.710 --> 00:16:22.900
You see them in business.
00:16:22.900 --> 00:16:24.250
You see them in academia.
00:16:24.250 --> 00:16:27.280
You see them all
over the place.
00:16:27.280 --> 00:16:30.640
When you make a decision and
you've made a mistake, you get
00:16:30.640 --> 00:16:31.520
some more evidence.
00:16:31.520 --> 00:16:33.890
You see that it's a mistake
and you change.
00:16:33.890 --> 00:16:37.790
The whole 19th century
was taken up with--
00:16:37.790 --> 00:16:41.760
I mean, the scientific community
was driven by
00:16:41.760 --> 00:16:43.680
physicists in those days.
00:16:43.680 --> 00:16:47.490
And the idea of Newton's
laws was the most
00:16:47.490 --> 00:16:48.930
sacred thing they had.
00:16:53.410 --> 00:16:55.450
Everybody believed in Newtonian
00:16:55.450 --> 00:16:57.300
mechanics in those days.
00:16:57.300 --> 00:17:01.500
When quantum mechanics came
along, this wasn't just a
00:17:01.500 --> 00:17:04.619
minor perturbation in physics.
00:17:04.619 --> 00:17:07.069
This was a most crucial thing.
00:17:07.069 --> 00:17:10.010
This said, everything we've
known goes out the window.
00:17:10.010 --> 00:17:13.069
We can't rely on anything
anymore.
00:17:13.069 --> 00:17:17.950
But the physicists said, OK,
I guess we made a mistake.
00:17:17.950 --> 00:17:19.390
We'll make new observations.
00:17:19.390 --> 00:17:22.170
We have new observations
that can be made.
00:17:22.170 --> 00:17:25.420
We now see that Newtonian
mechanics works over a certain
00:17:25.420 --> 00:17:27.050
range of things.
00:17:27.050 --> 00:17:29.390
It doesn't work in another
ranges of things.
00:17:29.390 --> 00:17:31.180
And they go on and
find new things.
00:17:31.180 --> 00:17:32.970
That's the same thing
we do here.
00:17:32.970 --> 00:17:34.200
We take these models.
00:17:34.200 --> 00:17:36.820
We evaluate our error
probabilities.
00:17:36.820 --> 00:17:39.510
And evaluating them, we then
say, well, we've got to go on
00:17:39.510 --> 00:17:41.290
and take some more
measurements.
00:17:41.290 --> 00:17:43.200
Or we say we're going
to live with it.
00:17:43.200 --> 00:17:45.800
But we face the fact that there
are errors involved.
00:17:45.800 --> 00:17:50.990
And in doing that, you have to
take a probabilistic model.
00:17:50.990 --> 00:17:53.730
If you don't take a
probabilistic model, it's very
00:17:53.730 --> 00:17:56.800
hard for you to talk honestly
about what error
00:17:56.800 --> 00:17:58.270
probabilities are.
00:17:58.270 --> 00:18:00.336
So both ways--
00:18:00.336 --> 00:18:02.910
well, I'm preaching
and I'm sorry.
00:18:02.910 --> 00:18:10.250
But I've lived for a long time
with many statisticians, many
00:18:10.250 --> 00:18:13.300
of whom get into my own
field and who cause a
00:18:13.300 --> 00:18:16.530
great deal of trouble.
00:18:16.530 --> 00:18:19.850
So the only thing I can do it
urge you all to be cautious
00:18:19.850 --> 00:18:20.500
about this.
00:18:20.500 --> 00:18:22.690
And to think the matter
through on your own.
00:18:22.690 --> 00:18:25.960
I'm not telling you to take
my point of view on it.
00:18:25.960 --> 00:18:28.870
I'm telling you, don't take
other people's point of view
00:18:28.870 --> 00:18:30.120
without thinking it through.
00:18:32.690 --> 00:18:37.270
The probability experiment
here really--
00:18:37.270 --> 00:18:42.530
I mean, every probability model
we view in terms of the
00:18:42.530 --> 00:18:48.350
real world, as you have this set
of probabilities, a set of
00:18:48.350 --> 00:18:49.710
possible events.
00:18:49.710 --> 00:18:51.170
You do the experiment.
00:18:51.170 --> 00:18:53.580
There's one sample point
that comes out.
00:18:53.580 --> 00:18:56.630
And after the one sample point
comes out, then you know what
00:18:56.630 --> 00:18:59.040
the result of the
experiment is.
00:18:59.040 --> 00:19:04.990
Here, the experiment consists
both of what you normally view
00:19:04.990 --> 00:19:06.090
as the experiment.
00:19:06.090 --> 00:19:08.660
Namely, taking the
observations.
00:19:08.660 --> 00:19:13.100
And it also involves a
choice of hypotheses.
00:19:13.100 --> 00:19:16.780
Namely, there's not a correct
hypothesis to start with.
00:19:16.780 --> 00:19:21.870
The experiment involves
God throws his dice.
00:19:21.870 --> 00:19:26.170
Einstein didn't believe that
God threw dice, but I do.
00:19:26.170 --> 00:19:30.260
And after throwing the dice,
one or the other of these
00:19:30.260 --> 00:19:32.980
hypotheses turns
out to be true.
00:19:32.980 --> 00:19:36.650
All of these observations point
to that or they point to
00:19:36.650 --> 00:19:39.290
the other and you
make a decision.
00:19:39.290 --> 00:19:42.660
OK, so the experiment consists
both on choosing the
00:19:42.660 --> 00:19:45.330
hypothesis and on taking
a whole sequence of
00:19:45.330 --> 00:19:46.480
observations.
00:19:46.480 --> 00:19:50.650
Now, the other thing to
not forget in this--
00:19:50.650 --> 00:19:53.040
because you really have to get
this model in your mind or
00:19:53.040 --> 00:19:54.610
you're going to get very
confused with all
00:19:54.610 --> 00:19:56.390
the things we do.
00:19:56.390 --> 00:19:59.280
The experiment consists
on a whole sequence of
00:19:59.280 --> 00:20:03.540
observations, but only one
choice of hypothesis.
00:20:03.540 --> 00:20:06.020
Namely, you do the experiment.
00:20:06.020 --> 00:20:08.650
There's a hypothesis that
occurs, and there's a whole
00:20:08.650 --> 00:20:12.720
sequence of observations which
are all IID conditional on
00:20:12.720 --> 00:20:13.970
that particular hypothesis.
00:20:16.730 --> 00:20:21.960
So that's the model we're
going to be using.
00:20:21.960 --> 00:20:24.980
And now life is quite
simple once we've
00:20:24.980 --> 00:20:27.040
explained the model.
00:20:27.040 --> 00:20:31.020
We can talk about the
probability that H is equal to
00:20:31.020 --> 00:20:34.850
either 0 or 1, conditional
on the
00:20:34.850 --> 00:20:37.750
sample point we've observed.
00:20:37.750 --> 00:20:43.570
It's equal to the a priori
probability of that hypothesis
00:20:43.570 --> 00:20:47.520
times the density of the
observation conditional on the
00:20:47.520 --> 00:20:53.870
hypothesis divided by just
a normalization factor.
00:20:53.870 --> 00:20:57.360
Namely, the overall probability
of that
00:20:57.360 --> 00:21:03.512
observation period, which is the
sum of probability that 0
00:21:03.512 --> 00:21:07.090
is a correct hypothesis times
this plus probability that 1
00:21:07.090 --> 00:21:12.490
is a correct hypothesis times
the density given 1.
00:21:12.490 --> 00:21:15.530
This denominator here
is a pain in the
00:21:15.530 --> 00:21:17.790
neck, as you can see.
00:21:17.790 --> 00:21:22.450
But you can avoid ever dealing
with a denominator if you take
00:21:22.450 --> 00:21:29.570
this for H equals 0, divide by
this for H equals 1, and then
00:21:29.570 --> 00:21:34.300
you have this term divided by
this term all divided by this
00:21:34.300 --> 00:21:38.460
term for l equals 1 divided
by the same thing.
00:21:38.460 --> 00:21:43.370
So the ratio, the probability
that H equals 0 given y over
00:21:43.370 --> 00:21:46.390
the probability that
H is 1 equals y is
00:21:46.390 --> 00:21:48.850
just this ratio here.
00:21:48.850 --> 00:21:52.990
Now, what's the probability of
error if we make a decision at
00:21:52.990 --> 00:21:55.470
this point?
00:21:55.470 --> 00:22:01.320
If I've got in this particular
sequence Y, this quantity here
00:22:01.320 --> 00:22:05.310
is, in fact, the probability
that hypothesis 0 is correct
00:22:05.310 --> 00:22:08.020
in the model that
we have chosen.
00:22:08.020 --> 00:22:14.870
So this is the probability that
H is equal to 0 given Y.
00:22:14.870 --> 00:22:19.540
If we select 1 under these
conditions, if we select
00:22:19.540 --> 00:22:22.700
hypothesis 1, if we make a
decision and say, I'm going to
00:22:22.700 --> 00:22:25.740
guess that 1 is the
right decision.
00:22:25.740 --> 00:22:29.070
That means that this
is the probability
00:22:29.070 --> 00:22:30.370
you've made a mistake.
00:22:30.370 --> 00:22:32.810
Because this is the probability
that H is actually
00:22:32.810 --> 00:22:34.330
0 rather than 1.
00:22:34.330 --> 00:22:38.150
This quantity here is the
probability that you've made a
00:22:38.150 --> 00:22:41.705
mistake given that 1 is the
correct hypothesis.
00:22:45.590 --> 00:22:48.240
So here we are sitting
here with these
00:22:48.240 --> 00:22:49.640
probabilities of error.
00:22:49.640 --> 00:22:52.280
We don't have to do any
calculations for them.
00:22:52.280 --> 00:22:55.100
Well, you might have to do a
great deal of calculation to
00:22:55.100 --> 00:22:58.250
calculate this and to
calculate this.
00:22:58.250 --> 00:23:01.630
But otherwise, the whole thing
is just sitting there for you.
00:23:01.630 --> 00:23:03.600
So what do you do if you
want to minimize the
00:23:03.600 --> 00:23:04.870
probability of error?
00:23:11.780 --> 00:23:14.810
This was the probability that
you're going to make an error
00:23:14.810 --> 00:23:16.380
if you choose 1.
00:23:16.380 --> 00:23:19.275
This is the probability of
error if you choose 0.
00:23:21.880 --> 00:23:24.670
We want to minimize the
probability of error and we
00:23:24.670 --> 00:23:28.205
see the observation Y, we want
to pick the one of these which
00:23:28.205 --> 00:23:29.370
is largest.
00:23:29.370 --> 00:23:33.580
And that's all there is to it.
00:23:33.580 --> 00:23:37.660
This is the decision rule
that minimizes the
00:23:37.660 --> 00:23:41.020
probability of an error.
00:23:41.020 --> 00:23:44.790
It's based on knowing
what P0 and P1 is.
00:23:44.790 --> 00:23:47.600
But otherwise, probability that
H equals l is the correct
00:23:47.600 --> 00:23:51.000
hypothesis given the observation
is probability
00:23:51.000 --> 00:23:55.000
that H equals L given Y. We
maximize the a posteriori
00:23:55.000 --> 00:23:58.080
probability of choosing
correctly by choosing the
00:23:58.080 --> 00:24:02.910
maximum over l of probability
that H equals l given Y.
00:24:02.910 --> 00:24:08.350
This choosing directly,
maximizing the a posteriori
00:24:08.350 --> 00:24:12.510
probability is called the MAP
rule, Maximum A posteriori
00:24:12.510 --> 00:24:14.590
Probability.
00:24:14.590 --> 00:24:21.620
You can only solve the MAP
problem if you assume that you
00:24:21.620 --> 00:24:23.340
know P0 P1.
00:24:23.340 --> 00:24:27.650
We do know P0 and P1 if we've
selected a probability model.
00:24:27.650 --> 00:24:30.400
So when we select this
probability model, we've
00:24:30.400 --> 00:24:35.000
already assumed what these a
priori probabilities are, so
00:24:35.000 --> 00:24:37.390
we now make our observation.
00:24:37.390 --> 00:24:38.320
And after making our
00:24:38.320 --> 00:24:40.210
observation, we make a decision.
00:24:40.210 --> 00:24:44.890
And at that point, we have an a
posteriori probability that
00:24:44.890 --> 00:24:47.260
each of the hypotheses
is correct.
00:24:50.430 --> 00:24:53.350
Anybody has any issues
with this?
00:24:53.350 --> 00:24:55.230
I mean, it looks painfully
simple when you
00:24:55.230 --> 00:24:56.666
look at this way.
00:24:56.666 --> 00:25:01.650
And if it doesn't look painfully
simple, please ask
00:25:01.650 --> 00:25:04.020
now or forever hold your
peace as they say.
00:25:06.830 --> 00:25:07.040
Yeah?
00:25:07.040 --> 00:25:09.375
AUDIENCE: So can you explain
how you get the equation?
00:25:09.375 --> 00:25:11.243
Can you explain how you
get the equation
00:25:11.243 --> 00:25:13.330
on the first line?
00:25:13.330 --> 00:25:15.620
PROFESSOR: On the first
line right up here?
00:25:15.620 --> 00:25:18.170
Yes, I use Bayes' law.
00:25:18.170 --> 00:25:19.620
AUDIENCE: So what is that?
00:25:19.620 --> 00:25:23.600
So that's P of A given B is
equal to P of B given A?
00:25:23.600 --> 00:25:24.310
PROFESSOR: Yes.
00:25:24.310 --> 00:25:27.360
AUDIENCE: I don't quite
see how to--
00:25:27.360 --> 00:25:37.140
P of A given B is equal to P
of B given A times P of A
00:25:37.140 --> 00:25:43.110
divided by P of A and B.
00:25:43.110 --> 00:25:47.826
If you take this over
there then it's--
00:25:47.826 --> 00:25:51.350
am I stating Bayes' law
in a funny way?
00:25:51.350 --> 00:25:53.576
AUDIENCE: So the thing on
the bottom is P of B?
00:25:53.576 --> 00:25:54.552
OK.
00:25:54.552 --> 00:25:56.016
PROFESSOR: What?
00:25:56.016 --> 00:25:57.266
AUDIENCE: OK, I get it.
00:26:02.330 --> 00:26:04.950
PROFESSOR: I mean, I might
not be explained it well.
00:26:04.950 --> 00:26:06.200
AUDIENCE: [INAUDIBLE].
00:26:08.120 --> 00:26:12.650
PROFESSOR: Except if you start
out with P of A given B is
00:26:12.650 --> 00:26:17.090
equal to P of B given A times P
of B divided by P of A. This
00:26:17.090 --> 00:26:28.210
quantity here is P of Y. So we
have probability that H equals
00:26:28.210 --> 00:26:34.590
l times probability of Y given
l divided by the probability
00:26:34.590 --> 00:26:36.650
of l to start with.
00:26:36.650 --> 00:26:43.590
OK, so you maximize the a
posteriori probability by
00:26:43.590 --> 00:26:45.180
choosing the maximum of these.
00:26:45.180 --> 00:26:47.650
It's called the MAP rule.
00:26:47.650 --> 00:26:53.550
And it doesn't require you to
calculate this quantity, which
00:26:53.550 --> 00:26:55.000
is sometimes a mess.
00:26:55.000 --> 00:26:58.110
All it requires you to do
is to compare these two
00:26:58.110 --> 00:27:01.410
quantities, which means you
have to compare these two
00:27:01.410 --> 00:27:02.780
quantities.
00:27:02.780 --> 00:27:04.136
AUDIENCE: It's 10 o'clock.
00:27:04.136 --> 00:27:05.110
PROFESSOR: Well, excuse me.
00:27:05.110 --> 00:27:05.250
Yes.
00:27:05.250 --> 00:27:06.500
Yes, I know.
00:27:13.480 --> 00:27:17.930
These things become clearer if
you state them in terms of
00:27:17.930 --> 00:27:20.320
what you call the likelihood
ratio.
00:27:20.320 --> 00:27:23.670
Likelihood ratio only works when
you have two hypotheses.
00:27:23.670 --> 00:27:29.490
When you have two hypotheses,
you call the ratio of one of
00:27:29.490 --> 00:27:34.440
them to the other one the
likelihood ratio.
00:27:34.440 --> 00:27:37.290
Why do I put 0 up here
and 1 down here?
00:27:37.290 --> 00:27:40.400
Absolutely no reason at all,
it's just convention.
00:27:40.400 --> 00:27:42.290
And unfortunately, it's
a convention that
00:27:42.290 --> 00:27:43.920
not everybody follows.
00:27:43.920 --> 00:27:46.440
So some people have one
convention and some people
00:27:46.440 --> 00:27:48.610
have another convention.
00:27:48.610 --> 00:27:51.650
If you want to use the other
convention, just imagine
00:27:51.650 --> 00:27:54.440
switching 1 and 1
in your mind.
00:27:54.440 --> 00:27:58.280
They're both just
binary numbers.
00:27:58.280 --> 00:28:03.140
Then, when you want to look at
this MAP rule, the MAP rule is
00:28:03.140 --> 00:28:06.790
choosing the larger of
these two things,
00:28:06.790 --> 00:28:10.600
which we had back here.
00:28:10.600 --> 00:28:15.620
That's choosing whether this is
larger than this, or vice
00:28:15.620 --> 00:28:20.540
versa, which is choosing whether
this ratio here is
00:28:20.540 --> 00:28:24.630
greater than the ratio
of P1 to P0.
00:28:24.630 --> 00:28:29.170
So that's the same, that's
the same thing.
00:28:29.170 --> 00:28:34.960
So the MAP rule is to calculate
the likelihood ratio
00:28:34.960 --> 00:28:37.530
for this given observation y.
00:28:37.530 --> 00:28:41.930
And if this is greater
than P1 over P0, you
00:28:41.930 --> 00:28:46.200
select H equals 0.
00:28:46.200 --> 00:28:51.910
If it's less than or equal to
P1 over P0, you select H1.
00:28:51.910 --> 00:28:56.320
Why do I put the strict equality
here and the strict
00:28:56.320 --> 00:28:57.880
inequality here?
00:28:57.880 --> 00:29:00.310
Again, no reason whatsoever.
00:29:00.310 --> 00:29:03.490
When you have equality, it
doesn't make any difference
00:29:03.490 --> 00:29:04.880
which you choose.
00:29:04.880 --> 00:29:07.380
So you could flip a coin.
00:29:07.380 --> 00:29:11.150
It's a little easier if you just
say, we're going to do
00:29:11.150 --> 00:29:14.240
this under this condition.
00:29:14.240 --> 00:29:16.770
So we state condition
this way.
00:29:16.770 --> 00:29:19.540
We calculate the likelihood
ratio.
00:29:19.540 --> 00:29:21.680
We compare it with
a threshold.
00:29:21.680 --> 00:29:24.840
The threshold here
is P1 over P0.
00:29:24.840 --> 00:29:27.320
And then we select something.
00:29:27.320 --> 00:29:30.750
Why did I put a little
hat over this?
00:29:30.750 --> 00:29:31.680
AUDIENCE: Estimation.
00:29:31.680 --> 00:29:32.165
PROFESSOR: What?
00:29:32.165 --> 00:29:34.430
AUDIENCE: Because it's
an estimation.
00:29:34.430 --> 00:29:34.530
PROFESSOR: What?
00:29:34.530 --> 00:29:37.150
AUDIENCE: It's an estimation?
00:29:37.150 --> 00:29:38.750
PROFESSOR: Well, it's not
really an estimation.
00:29:38.750 --> 00:29:39.750
It's a detection.
00:29:39.750 --> 00:29:44.690
I mean, estimation you usually
view as being analog.
00:29:44.690 --> 00:29:47.210
Detection you usually view
as being digital.
00:29:47.210 --> 00:29:48.880
And thanks for bringing
that up because it's
00:29:48.880 --> 00:29:50.130
an important point.
00:29:53.700 --> 00:30:00.140
But in this model, H is either
0 or 1 in the result of this
00:30:00.140 --> 00:30:01.140
experiment.
00:30:01.140 --> 00:30:03.900
We don't know which it is.
00:30:03.900 --> 00:30:06.220
This is what we've chosen.
00:30:06.220 --> 00:30:12.320
So H hat is 0 does not mean
that H itself is 0.
00:30:12.320 --> 00:30:14.280
So this is our choice.
00:30:14.280 --> 00:30:17.360
It might be wrong or
it might be right.
00:30:17.360 --> 00:30:21.110
Many decision rules, including
the most common and the most
00:30:21.110 --> 00:30:24.980
sensible, are rules that compare
lambda of y to a fixed
00:30:24.980 --> 00:30:31.170
threshold, say, eta, is P1 over
P0, which is independent
00:30:31.170 --> 00:30:34.090
of y, which is just
a fixed threshold.
00:30:34.090 --> 00:30:37.350
The decision rules then vary
only in the way that you
00:30:37.350 --> 00:30:39.280
choose the threshold.
00:30:39.280 --> 00:30:44.030
Now, what happens as soon
as I call this eta
00:30:44.030 --> 00:30:47.160
instead of P1 over P0?
00:30:47.160 --> 00:30:51.660
My test becomes independent of
these a priori probabilities
00:30:51.660 --> 00:30:55.910
that statisticians have thought
about for so long.
00:30:55.910 --> 00:30:58.800
Namely, after a couple of lines
of fiddling around with
00:30:58.800 --> 00:31:03.840
these things, suddenly all
of that has disappeared.
00:31:03.840 --> 00:31:05.820
We have a threshold test.
00:31:05.820 --> 00:31:09.930
The threshold test says,
take this ratio--
00:31:09.930 --> 00:31:13.720
everybody agrees that there's
such a ratio that exists--
00:31:13.720 --> 00:31:16.400
and compare it with something.
00:31:16.400 --> 00:31:23.900
And if it's bigger than that
something, you choose 0.
00:31:23.900 --> 00:31:27.220
If it's less than that
thing, you choose 1.
00:31:27.220 --> 00:31:29.810
And that's the end of it.
00:31:29.810 --> 00:31:33.670
OK, so we have two questions.
00:31:33.670 --> 00:31:38.580
One, do we always want to use a
threshold test or are there
00:31:38.580 --> 00:31:40.390
cases where we should
use things other
00:31:40.390 --> 00:31:42.560
than a threshold test?
00:31:42.560 --> 00:31:48.010
And the second question is, if
we're going to use a threshold
00:31:48.010 --> 00:31:51.650
test, where should we
set the threshold?
00:31:51.650 --> 00:31:54.430
I mean, there's nothing that
says that you really want to
00:31:54.430 --> 00:31:58.530
minimize the probability
of error.
00:31:58.530 --> 00:32:02.740
I mean, suppose your test
is to see whether--
00:32:02.740 --> 00:32:06.000
I mean, something in
the news today.
00:32:06.000 --> 00:32:09.350
I mean, you'd like to take an
experiment to see whether your
00:32:09.350 --> 00:32:14.410
nuclear plant is going
to explode or not.
00:32:14.410 --> 00:32:16.940
So you come up with
one decision, it's
00:32:16.940 --> 00:32:18.450
not going to explode.
00:32:18.450 --> 00:32:21.580
Or another decision, you
decide it will explode.
00:32:21.580 --> 00:32:24.270
Presumably on the basis of
that decision, you do all
00:32:24.270 --> 00:32:26.340
sorts of things.
00:32:26.340 --> 00:32:30.190
Do you really want to make
it a maximum a posteriori
00:32:30.190 --> 00:32:32.420
probability decision?
00:32:32.420 --> 00:32:33.840
No.
00:32:33.840 --> 00:32:38.900
You recognize that if it's
going to explode, and you
00:32:38.900 --> 00:32:42.030
choose that it's not going to
explode and you don't do
00:32:42.030 --> 00:32:47.380
anything, there is a humongous
cost associated with that.
00:32:47.380 --> 00:32:49.860
If you decide the other way,
there's a pretty large cost
00:32:49.860 --> 00:32:52.180
associated with that also.
00:32:52.180 --> 00:32:54.940
But there's not really much
comparison between the two.
00:32:54.940 --> 00:32:58.360
But anyway, you want to do
something which takes those
00:32:58.360 --> 00:32:59.830
costs into account.
00:32:59.830 --> 00:33:02.170
One of the problems in the
homework does that.
00:33:02.170 --> 00:33:09.510
It's really almost trivial to
readjust this problem, so that
00:33:09.510 --> 00:33:13.860
you set the threshold to
involve the costs also.
00:33:13.860 --> 00:33:19.730
So if you have arbitrary costs
in making errors, then you
00:33:19.730 --> 00:33:21.660
change the threshold
a little bit.
00:33:21.660 --> 00:33:25.800
But you still use a
threshold test.
00:33:25.800 --> 00:33:29.010
There's something called maximum
likelihood that people
00:33:29.010 --> 00:33:32.420
like for making decisions.
00:33:32.420 --> 00:33:35.470
And maximum likelihood
says you calculate
00:33:35.470 --> 00:33:37.040
the likelihood ratio.
00:33:37.040 --> 00:33:39.620
And if the likelihood
ratio is bigger than
00:33:39.620 --> 00:33:41.650
1, you go this way.
00:33:41.650 --> 00:33:46.530
If it's less than 1,
you go this way.
00:33:46.530 --> 00:33:49.780
It's the MAP test if
the two a priori
00:33:49.780 --> 00:33:52.140
probabilities are equal.
00:33:52.140 --> 00:33:55.790
But in many cases, you want to
use it whether or not the a
00:33:55.790 --> 00:33:57.930
priori probabilities
are equal.
00:33:57.930 --> 00:34:00.890
It's a standard test,
and there are many
00:34:00.890 --> 00:34:02.670
reasons for using it.
00:34:02.670 --> 00:34:06.510
Aside from the fact that the a
priori probabilities might be
00:34:06.510 --> 00:34:07.860
chosen that way.
00:34:07.860 --> 00:34:10.870
So anyway, that's one
other choice.
00:34:10.870 --> 00:34:13.070
When we go a little further
day, we'll talk about a
00:34:13.070 --> 00:34:14.659
Neyman-Pearson test.
00:34:14.659 --> 00:34:20.560
The Neyman-Pearson test says,
for some reason or other, I
00:34:20.560 --> 00:34:23.989
want to make sure that the
probability that my nuclear
00:34:23.989 --> 00:34:29.250
plant doesn't blow up is
less than, say, 10
00:34:29.250 --> 00:34:30.570
to the minus fifth.
00:34:30.570 --> 00:34:32.560
Why 10 to the minus fifth?
00:34:32.560 --> 00:34:34.060
Pull it out of the air.
00:34:34.060 --> 00:34:37.389
Maybe 10 to the minus sixth,
that point our probabilities
00:34:37.389 --> 00:34:39.230
don't make much sense anymore.
00:34:39.230 --> 00:34:44.170
But however we choose it, we
choose our test to say, we
00:34:44.170 --> 00:34:47.860
can't make the probability of
error under one hypothesis
00:34:47.860 --> 00:34:52.300
bigger than some certain amount
alpha than what test
00:34:52.300 --> 00:34:55.400
will minimize the probability
of error under the other
00:34:55.400 --> 00:34:57.360
hypothesis.
00:34:57.360 --> 00:35:00.020
Namely, if I have to get one
thing right, or I have to get
00:35:00.020 --> 00:35:03.610
it right almost all the time,
what's the best I can do on
00:35:03.610 --> 00:35:05.570
the other alternative?
00:35:05.570 --> 00:35:08.060
And that's the Neyman-Pearson
test.
00:35:08.060 --> 00:35:14.640
That is a favorite test among
the non-Bayesians because it
00:35:14.640 --> 00:35:18.260
doesn't involve the a priori
probabilities anymore.
00:35:18.260 --> 00:35:20.200
So it's a nice one
in that way.
00:35:20.200 --> 00:35:23.940
But we'll see, we get
it anyway using
00:35:23.940 --> 00:35:26.750
a probability model.
00:35:26.750 --> 00:35:29.890
OK, let's go back to random
walks just a little bit to see
00:35:29.890 --> 00:35:33.230
why we're doing what
we're doing.
00:35:33.230 --> 00:35:42.560
The logarithm of the threshold
ratio is logarithm of this
00:35:42.560 --> 00:35:44.352
lambda of y.
00:35:44.352 --> 00:35:45.845
I'm taking m observations.
00:35:45.845 --> 00:35:49.380
I'm putting that in explicitly,
is the sum from N
00:35:49.380 --> 00:35:53.810
equals 1 to m of the log of
the individual ratio.
00:35:53.810 --> 00:35:56.640
In other words, when you--
00:35:56.640 --> 00:36:00.960
under hypothesis 0, if I
calculate the probability of
00:36:00.960 --> 00:36:09.210
vector y given H equals 0, I'm
finding the probability of n
00:36:09.210 --> 00:36:11.250
things which are IID.
00:36:11.250 --> 00:36:16.320
So what I'm going to find this
probability density is taking
00:36:16.320 --> 00:36:19.720
the product of the probabilities
of each of the
00:36:19.720 --> 00:36:22.390
observations.
00:36:22.390 --> 00:36:25.530
Most of you know now that
any time you look at a
00:36:25.530 --> 00:36:29.210
probability, which is a product
of observations, what
00:36:29.210 --> 00:36:32.590
you'd really like to do is to
take the logarithm of it.
00:36:32.590 --> 00:36:35.120
So you're talking about a sum
of things rather than a
00:36:35.120 --> 00:36:38.540
product of things because
we all know how to add
00:36:38.540 --> 00:36:40.700
independent random variables.
00:36:40.700 --> 00:36:46.110
So the log of this likelihood
ratio, which is called the log
00:36:46.110 --> 00:36:50.940
likelihood ratio as you might
guess, is just a sum of these
00:36:50.940 --> 00:36:52.600
likelihood ratios.
00:36:52.600 --> 00:36:55.900
If we look at this for each m
greater than or equal to 1,
00:36:55.900 --> 00:36:59.780
then given H equals 0,
it's a random walk.
00:36:59.780 --> 00:37:03.410
And given H equals 1, it's
another random walk.
00:37:03.410 --> 00:37:06.870
It's the same sequence of sample
values in both cases.
00:37:06.870 --> 00:37:10.300
Namely, as an experimentalist,
we're taking these
00:37:10.300 --> 00:37:11.490
observations.
00:37:11.490 --> 00:37:17.230
We don't know whether H equals
0 or H equals 1 is what the
00:37:17.230 --> 00:37:20.480
result of the experiment
is going to be.
00:37:20.480 --> 00:37:23.450
But what we do know is we know
what those values are.
00:37:23.450 --> 00:37:25.870
We can calculate this sum.
00:37:25.870 --> 00:37:38.320
And now, if we condition this
on H equals 0, then this
00:37:38.320 --> 00:37:41.550
quantity, which is fixed,
has a particular
00:37:41.550 --> 00:37:43.490
probability of occurring.
00:37:43.490 --> 00:37:47.040
So this is a random variable
then under the
00:37:47.040 --> 00:37:49.380
hypothesis H equals 0.
00:37:49.380 --> 00:37:53.670
It's a random variable under
the hypothesis H equals 1.
00:37:53.670 --> 00:37:57.590
And this sum of random variables
behaves in a very
00:37:57.590 --> 00:38:01.090
different way under these
two hypotheses.
00:38:01.090 --> 00:38:04.680
What's going to happen is that
under one hypothesis, the
00:38:04.680 --> 00:38:09.980
expected value of this log
likelihood ratio is going to
00:38:09.980 --> 00:38:12.310
linearly increase with n.
00:38:12.310 --> 00:38:16.050
If we look at it under the other
hypothesis, it's going
00:38:16.050 --> 00:38:20.530
to linearly decrease
as we increase n.
00:38:20.530 --> 00:38:24.550
And a nifty test at that point
is to say, as soon as it
00:38:24.550 --> 00:38:28.460
crosses a threshold up here or
a threshold down here, we're
00:38:28.460 --> 00:38:31.360
going to make a decision.
00:38:31.360 --> 00:38:35.320
And that's called a sequential
test in that case because you
00:38:35.320 --> 00:38:38.500
haven't specified ahead of time,
I'm going to take 100
00:38:38.500 --> 00:38:41.270
tests and then make
up my mind.
00:38:41.270 --> 00:38:44.420
You've specified that I'm going
to take as many tests as
00:38:44.420 --> 00:38:48.050
I need to be relatively sure
that I'm getting the right
00:38:48.050 --> 00:38:52.340
decision, which is what
you do in real life.
00:38:52.340 --> 00:38:57.090
I mean, there's nothing fancy
about doing sequential tests.
00:38:57.090 --> 00:39:00.420
Those are the obvious things to
do, except they're a little
00:39:00.420 --> 00:39:05.530
more tricky to talk about using
probability theory.
00:39:05.530 --> 00:39:08.530
But anyway, that's where
we're headed.
00:39:08.530 --> 00:39:14.660
That's why we're talking about
hypothesis testing.
00:39:14.660 --> 00:39:19.090
Because when you look at it in
this formulation, we get a
00:39:19.090 --> 00:39:21.290
random walk.
00:39:21.290 --> 00:39:28.500
And it gives us a nice example
of when you want to use random
00:39:28.500 --> 00:39:33.060
walks crossing a threshold as
a way of making decisions.
00:39:33.060 --> 00:39:36.640
OK, so that's why we're doing
what we're doing.
00:39:41.050 --> 00:39:48.150
Now, let's go back and look at
threshold tests again, and try
00:39:48.150 --> 00:39:55.650
to see how we're going to make
threshold tests, what the
00:39:55.650 --> 00:39:59.460
error probabilities will be,
and try to analyze them a
00:39:59.460 --> 00:40:03.150
little more than just saying,
well, a MAP test does this.
00:40:03.150 --> 00:40:06.420
Because as soon as you see that
a MAP test does this, you
00:40:06.420 --> 00:40:10.030
say, well, suppose I use
some other test.
00:40:10.030 --> 00:40:12.500
And what am I going to
suffer from that?
00:40:12.500 --> 00:40:15.120
What am I going to gain by it?
00:40:15.120 --> 00:40:19.140
So it's worthwhile to, instead
of looking at even just
00:40:19.140 --> 00:40:22.580
threshold tests, to say,
well, let's look at
00:40:22.580 --> 00:40:25.300
any old test at all.
00:40:25.300 --> 00:40:28.745
Now, any test means
the following.
00:40:31.270 --> 00:40:34.150
I have this probability model.
00:40:34.150 --> 00:40:37.730
I've already bludgeoned you into
accepting the fact that's
00:40:37.730 --> 00:40:40.300
the probability model we're
going to be looking at.
00:40:44.390 --> 00:40:46.440
And we have this--
00:40:46.440 --> 00:40:49.120
well, we have the likelihood
ratio, but we don't care about
00:40:49.120 --> 00:40:50.710
that for the moment.
00:40:50.710 --> 00:40:53.390
But we make this observation.
00:40:53.390 --> 00:40:54.640
We got to make a decision.
00:40:57.330 --> 00:41:01.800
And our decision is going
to be either 1 or 0.
00:41:01.800 --> 00:41:03.743
How do we characterize
that mathematically?
00:41:08.300 --> 00:41:11.850
Or how do we calculate it if
we want a computer to make
00:41:11.850 --> 00:41:14.250
that decision for us?
00:41:14.250 --> 00:41:18.690
The systematic way to do it is
for every possible sequence of
00:41:18.690 --> 00:41:24.650
y to say ahead of time to give a
formula, which sequences get
00:41:24.650 --> 00:41:30.010
mapped into 1 and which
sequences get mapped into 0.
00:41:30.010 --> 00:41:36.220
So we're going to call a set A
the set of sample sequences
00:41:36.220 --> 00:41:38.750
that get mapped into
hypothesis 1.
00:41:38.750 --> 00:41:45.620
That's the most general binary
hypothesis test you can do.
00:41:45.620 --> 00:41:48.140
That includes all
possible ways of
00:41:48.140 --> 00:41:50.660
choosing either 1 or 0.
00:41:50.660 --> 00:41:55.320
You're forced to hire somebody
or not hire somebody.
00:41:55.320 --> 00:41:58.500
You can't get them to work for
you for two weeks, and then
00:41:58.500 --> 00:41:59.950
make a decision at that point.
00:41:59.950 --> 00:42:02.060
Well, sometimes you
can in this world.
00:42:02.060 --> 00:42:05.890
But if it's somebody you really
want and other people
00:42:05.890 --> 00:42:09.820
want them, too, then you've got
to decide, I'm going to go
00:42:09.820 --> 00:42:12.150
with this person or I'm not
going to go with them.
00:42:12.150 --> 00:42:18.240
So under all observations that
you've made, you need some way
00:42:18.240 --> 00:42:23.550
to decide which ones make
you go to decision 1.
00:42:23.550 --> 00:42:27.010
Which ones make you
go to decision 0.
00:42:27.010 --> 00:42:30.940
So we will just say arbitrarily,
there's a set A
00:42:30.940 --> 00:42:35.230
of sample sequences that
map into hypothesis 1.
00:42:35.230 --> 00:42:40.070
And the error probability for
each hypothesis using test A
00:42:40.070 --> 00:42:43.140
is given by-- and we'll just
call Q sub 0 of A--
00:42:43.140 --> 00:42:45.880
this is our name for the
error probability.
00:42:53.120 --> 00:42:54.610
Have I twisted this up?
00:42:54.610 --> 00:42:55.860
No.
00:42:58.300 --> 00:43:01.150
Q sub 0 of A is the probability
00:43:01.150 --> 00:43:02.400
that I actually choose--
00:43:05.880 --> 00:43:10.170
it's the probability that I
choose A given that the
00:43:10.170 --> 00:43:12.210
hypothesis is 0.
00:43:12.210 --> 00:43:20.000
Q sub 1 of A is the probability
that I choose 1.
00:43:20.000 --> 00:43:22.275
Blah, let me start
that over again.
00:43:25.400 --> 00:43:33.880
Q0 of A is the probability
that I'm going to choose
00:43:33.880 --> 00:43:37.460
hypothesis 1 given that
hypothesis 0 was the correct
00:43:37.460 --> 00:43:38.410
hypothesis.
00:43:38.410 --> 00:43:43.090
It's the probability that Y is
in A. That means that H hat is
00:43:43.090 --> 00:43:46.980
equal to 1 given that
H is actually 0.
00:43:46.980 --> 00:43:52.510
So that's the probability we
make an error given the
00:43:52.510 --> 00:43:56.190
hypothesis, the correct
hypothesis is 0.
00:43:56.190 --> 00:43:59.880
Q1 of A is the probability of
making an error given that the
00:43:59.880 --> 00:44:02.250
correct hypothesis is 1.
00:44:02.250 --> 00:44:04.870
If I have a priori
probabilities, I'm going back
00:44:04.870 --> 00:44:07.770
to assuming a priori
probabilities again.
00:44:07.770 --> 00:44:10.495
The probability of error is?
00:44:15.340 --> 00:44:21.970
It's P0 times the probability
I make an error given that H
00:44:21.970 --> 00:44:23.190
equals zero.
00:44:23.190 --> 00:44:26.770
P1 a priori probability
of 1 given that I make
00:44:26.770 --> 00:44:28.920
an error given 1.
00:44:28.920 --> 00:44:30.260
I add these two up.
00:44:30.260 --> 00:44:33.750
I can write it this way.
00:44:33.750 --> 00:44:35.570
Don't ask for the time being.
00:44:35.570 --> 00:44:42.300
I'll just take the P0 out, so
it's Q0 of A plus P1 over P0
00:44:42.300 --> 00:44:47.920
Q1 of A. So that's what I've
called eta times Q1 of A.
00:44:47.920 --> 00:44:52.160
For the threshold test based
on eta, the probability of
00:44:52.160 --> 00:44:55.120
error is the same thing.
00:44:55.120 --> 00:44:59.455
But that A there is an eta.
00:44:59.455 --> 00:45:04.690
I hope you can imagine that
quantity there is an eta.
00:45:04.690 --> 00:45:05.930
This is an eta.
00:45:05.930 --> 00:45:10.710
So it's P0 times Q0 of eta
plus eta times Q1 of eta.
00:45:10.710 --> 00:45:14.840
So the eta probability, under
this crazy test that you've
00:45:14.840 --> 00:45:19.180
designed, is P0 times
this quantity.
00:45:19.180 --> 00:45:23.370
Under the MAP test, probability
of error is this
00:45:23.370 --> 00:45:25.710
quantity here.
00:45:25.710 --> 00:45:28.830
What do we know about
the MAP test?
00:45:28.830 --> 00:45:33.300
It minimizes the error
probability under those a
00:45:33.300 --> 00:45:35.160
priori probabilities.
00:45:35.160 --> 00:45:39.720
So what we know about it is
that this quantity is less
00:45:39.720 --> 00:45:43.890
than or equal to
this quantity.
00:45:43.890 --> 00:45:48.630
Take out the P0's and it says
that this quantity is less
00:45:48.630 --> 00:45:50.280
than or equal to
this quantity.
00:45:53.250 --> 00:45:55.920
Pretty simple.
00:45:55.920 --> 00:46:00.320
Let's draw a picture that
shows what that means.
00:46:00.320 --> 00:46:01.724
Here's a result that we have.
00:46:04.690 --> 00:46:09.010
We know because of maximum a
posteriori probability for the
00:46:09.010 --> 00:46:15.990
threshold test that this is less
than or equal to this.
00:46:15.990 --> 00:46:17.970
This is the minimum
error probability.
00:46:17.970 --> 00:46:21.590
This is the error probability
you get with
00:46:21.590 --> 00:46:23.820
whatever test you like.
00:46:23.820 --> 00:46:30.810
So let's draw a picture on a
graph where the probability of
00:46:30.810 --> 00:46:37.810
error given H equals 1 is
on the horizontal axis.
00:46:37.810 --> 00:46:42.680
The probability of error
conditional on H equals 0 is
00:46:42.680 --> 00:46:46.550
on this axis.
00:46:46.550 --> 00:46:53.420
So I can list the probability
of error for the threshold
00:46:53.420 --> 00:46:55.970
test, which sits here.
00:46:55.970 --> 00:46:59.810
I can list the probability of
error for this arbitrary test,
00:46:59.810 --> 00:47:01.380
which sits here.
00:47:01.380 --> 00:47:05.880
And I know that this quantity
is greater than or equal to
00:47:05.880 --> 00:47:06.890
this quantity.
00:47:06.890 --> 00:47:14.400
So the only thing I have to do
now is to sort out using plain
00:47:14.400 --> 00:47:19.560
geometry, why these numbers
are what they are.
00:47:19.560 --> 00:47:26.760
This number here is Q0 of eta
plus eta times Q1 of eta.
00:47:26.760 --> 00:47:30.070
Here's Q1 of eta.
00:47:30.070 --> 00:47:33.630
This distance here
is Q1 of eta.
00:47:33.630 --> 00:47:37.600
We have a line of slope minus
eta there that we've drawn.
00:47:37.600 --> 00:47:42.890
So this point here is, in fact,
Q0 of eta plus eta times
00:47:42.890 --> 00:47:44.620
Q1 of eta .
00:47:44.620 --> 00:47:47.890
That's just plain geometry.
00:47:47.890 --> 00:47:57.880
This point is Q0 of A plus eta
times Q1 of A. Another line of
00:47:57.880 --> 00:48:00.660
slope minus et.
00:48:00.660 --> 00:48:05.350
What we've shown is that this is
less than or equal to this.
00:48:10.620 --> 00:48:11.990
That's because of
the MAP rule.
00:48:11.990 --> 00:48:14.470
This has to be less than
or equal to that.
00:48:14.470 --> 00:48:16.740
So what have we shown here?
00:48:16.740 --> 00:48:21.260
We've shown that for every test
A you can imagine, when
00:48:21.260 --> 00:48:25.880
you draw that test on this
two-dimensional plot of error
00:48:25.880 --> 00:48:30.710
probability given H equals 1
versus error probability given
00:48:30.710 --> 00:48:33.090
H equals 0.
00:48:33.090 --> 00:48:37.360
Every test in the world lies
Northeast of this line here.
00:48:45.620 --> 00:48:46.050
Yeah?
00:48:46.050 --> 00:48:48.268
AUDIENCE: Can you say
again exactly what
00:48:48.268 --> 00:48:51.280
axis represents what?
00:48:51.280 --> 00:48:54.410
PROFESSOR: This axis here
represents the error
00:48:54.410 --> 00:48:58.200
probability given that H
equals 1 is the correct
00:48:58.200 --> 00:48:59.690
hypothesis.
00:48:59.690 --> 00:49:03.620
This axis is the error
probability given that 0 is
00:49:03.620 --> 00:49:04.870
the correct hypothesis.
00:49:07.420 --> 00:49:11.200
So we've defined Q1 of eta and
Q0 of eta as those two error
00:49:11.200 --> 00:49:12.510
probabilities.
00:49:12.510 --> 00:49:17.860
Using the threshold test, or
using the MAP test where eta
00:49:17.860 --> 00:49:20.470
is equal to P0 over P1.
00:49:20.470 --> 00:49:25.010
And this point here is whatever
it happens to be for
00:49:25.010 --> 00:49:27.300
any test that you
happen to like.
00:49:30.630 --> 00:49:35.390
You might have a supervisor who
wants to hire somebody and
00:49:35.390 --> 00:49:39.190
you view that person is a threat
to yourself, so you've
00:49:39.190 --> 00:49:43.010
taken all your observations and
you then make a decision.
00:49:43.010 --> 00:49:45.380
If the person is any good,
you say, don't hire him.
00:49:45.380 --> 00:49:47.480
If the person is good
you say, hire them.
00:49:47.480 --> 00:49:51.770
So just the opposite of
what you should do.
00:49:51.770 --> 00:49:57.310
But whatever you do, this says
this is less than or equal to
00:49:57.310 --> 00:50:00.910
this because of the MAP rule.
00:50:00.910 --> 00:50:05.680
And therefore, this point lies
up in that direction
00:50:05.680 --> 00:50:06.930
of this line here.
00:50:09.900 --> 00:50:12.630
You can do this for any eta that
you want to do it for.
00:50:15.820 --> 00:50:19.490
So for every eta that we want
to use, we get some value of
00:50:19.490 --> 00:50:22.240
Q0 of eta and Q1 of eta.
00:50:22.240 --> 00:50:25.180
These go along here
in some way.
00:50:25.180 --> 00:50:27.640
You can do the same
argument again.
00:50:27.640 --> 00:50:33.770
For every threshold test, every
point lies Northeast of
00:50:33.770 --> 00:50:37.380
the line of slope minus eta
through that threshold test.
00:50:37.380 --> 00:50:43.200
We get a whole family of curves
when eta is very big,
00:50:43.200 --> 00:50:47.170
the curve of slope minus
eta goes like this.
00:50:47.170 --> 00:50:50.260
When eta is very small,
it goes like this.
00:50:54.050 --> 00:50:58.000
We just think of ourselves
plotting all these curves,
00:50:58.000 --> 00:51:02.930
taking the upper envelope of
them because every test has to
00:51:02.930 --> 00:51:06.770
lie Northeast of every
one of those lines.
00:51:06.770 --> 00:51:11.240
So we take the upper envelope of
all of these lines, and we
00:51:11.240 --> 00:51:15.050
get something that
looks like this.
00:51:15.050 --> 00:51:18.450
We call this the error curve.
00:51:18.450 --> 00:51:23.730
And this is the upper envelope
of the straight lines of slope
00:51:23.730 --> 00:51:28.110
minus eta that go through the
threshold tests at eta.
00:51:31.510 --> 00:51:33.830
You get something else
from that, too.
00:51:33.830 --> 00:51:37.330
This curve is convex.
00:51:37.330 --> 00:51:39.110
Why is the curve convex?
00:51:39.110 --> 00:51:42.380
Well, you might like to take the
second derivative of it,
00:51:42.380 --> 00:51:45.090
but that's a pain in the neck.
00:51:45.090 --> 00:51:52.110
But the fundamental definition
of convexity is that a
00:51:52.110 --> 00:51:55.730
one-dimensional curve is convex
if all of its tangents
00:51:55.730 --> 00:51:58.230
lie underneath the curve.
00:51:58.230 --> 00:52:00.040
That's the way we've
constructed this.
00:52:00.040 --> 00:52:02.870
It's the upper envelope of a
bunch of straight lines.
00:52:02.870 --> 00:52:03.360
Yes?
00:52:03.360 --> 00:52:06.520
AUDIENCE: Can you please
explain, what is u of alpha?
00:52:06.520 --> 00:52:08.930
PROFESSOR: U of alpha
is just what I've
00:52:08.930 --> 00:52:11.870
called this upper envelope.
00:52:11.870 --> 00:52:14.420
This upper envelope
is now a function.
00:52:14.420 --> 00:52:16.110
AUDIENCE: What's
the definition?
00:52:16.110 --> 00:52:16.520
PROFESSOR: What?
00:52:16.520 --> 00:52:17.700
AUDIENCE: What is
the definition?
00:52:17.700 --> 00:52:19.980
PROFESSOR: The definition is
the upper envelope of all
00:52:19.980 --> 00:52:23.235
these straight lines.
00:52:23.235 --> 00:52:24.485
AUDIENCE: For changing eta?
00:52:24.485 --> 00:52:24.870
PROFESSOR: What?
00:52:24.870 --> 00:52:27.490
AUDIENCE: For changing eta?
00:52:27.490 --> 00:52:28.960
PROFESSOR: Yes.
00:52:28.960 --> 00:52:35.540
As eta changes, I get a whole
bunch of these points.
00:52:35.540 --> 00:52:37.940
I got a whole bunch
of these points.
00:52:37.940 --> 00:52:41.480
I take the upper envelope of all
of these straight lines.
00:52:44.200 --> 00:52:47.010
I mean, yes, you'd rather
see an equation.
00:52:47.010 --> 00:52:51.670
But if you see an equation
it's terribly ugly.
00:52:51.670 --> 00:52:55.590
I mean, you can program
a computer to do this.
00:52:55.590 --> 00:52:59.310
as easily as you can
program it to
00:52:59.310 --> 00:53:02.800
follow a bunch of equations.
00:53:02.800 --> 00:53:06.230
But anyway, I'm not interested
in actually solving for this
00:53:06.230 --> 00:53:07.480
curve in particular.
00:53:13.420 --> 00:53:16.660
I am particularly interested
in the fact that this upper
00:53:16.660 --> 00:53:22.990
envelope is, in fact, a convex
curve and that the threshold
00:53:22.990 --> 00:53:25.900
tests lie on the curve.
00:53:25.900 --> 00:53:30.690
The other tests lie Northeast
of the curve.
00:53:30.690 --> 00:53:34.280
And that's the reason you want
to use threshold tests.
00:53:34.280 --> 00:53:38.810
And it has nothing to do with a
priori probabilities at all.
00:53:38.810 --> 00:53:41.730
So you see, the thing we've done
is to start out assuming
00:53:41.730 --> 00:53:44.120
a priori probabilities.
00:53:44.120 --> 00:53:49.450
We've derived this neat result
using a priori probabilities.
00:53:49.450 --> 00:53:55.196
But now we have this
error curve.
00:53:55.196 --> 00:53:59.240
Well, to give you a better
definition of what u of alpha
00:53:59.240 --> 00:54:07.620
is, u of alpha is the error
probability under hypothesis 1
00:54:07.620 --> 00:54:12.450
if the error probability under
hypothesis 0 was alpha.
00:54:12.450 --> 00:54:16.160
You pick an error probability
here.
00:54:16.160 --> 00:54:18.660
You go up to that point here.
00:54:18.660 --> 00:54:21.750
There's a threshold
test there.
00:54:21.750 --> 00:54:24.730
You read over there.
00:54:24.730 --> 00:54:28.480
And at that point, you find the
probability of error given
00:54:28.480 --> 00:54:30.426
H equals 1.
00:54:30.426 --> 00:54:31.830
AUDIENCE: How do you know
that the threshold
00:54:31.830 --> 00:54:35.580
tests lie on the curve?
00:54:35.580 --> 00:54:42.640
PROFESSOR: Well, this threshold
test here is
00:54:42.640 --> 00:54:45.220
Southwest of all tests.
00:54:48.420 --> 00:54:53.175
And therefore, it can't lie
above this upper envelope.
00:54:57.300 --> 00:55:00.740
Now, I've cheated you
in one small way.
00:55:00.740 --> 00:55:07.370
If you have a discrete test,
what you're going to wind up
00:55:07.370 --> 00:55:12.580
with is just a finite set of
these possible points here.
00:55:12.580 --> 00:55:15.650
So you're going to wind up with
the upper envelope of a
00:55:15.650 --> 00:55:18.130
finite set of straight lines.
00:55:18.130 --> 00:55:21.770
So the straight line is
actually going to be--
00:55:21.770 --> 00:55:26.120
it's still convex, but it's
piecewise linear.
00:55:26.120 --> 00:55:30.970
And it's piecewise linear, and
the threshold tests are at the
00:55:30.970 --> 00:55:33.320
points of that curve.
00:55:33.320 --> 00:55:36.300
And in between those points,
you don't quite
00:55:36.300 --> 00:55:37.550
know what to do.
00:55:40.890 --> 00:55:44.630
So since you don't quite know
what to do in between those
00:55:44.630 --> 00:55:51.400
points, as far as the maximum a
posteriori probability test
00:55:51.400 --> 00:55:58.030
goes, you can reach any one of
those points, sometimes using
00:55:58.030 --> 00:56:00.890
one test on one corner of--
00:56:00.890 --> 00:56:02.550
I guess it's easier
if I draw it.
00:56:07.140 --> 00:56:10.480
And I didn't want to get into
this particularly because it's
00:56:10.480 --> 00:56:12.280
a little messier.
00:56:18.320 --> 00:56:20.600
So you could have this
kind of curve.
00:56:20.600 --> 00:56:24.550
And the notes talk about
this in detail.
00:56:24.550 --> 00:56:29.370
So the threshold test correspond
to this point.
00:56:29.370 --> 00:56:33.550
This point says always
decide one.
00:56:33.550 --> 00:56:38.020
Don't pay any attention to the
tests at all, just say I think
00:56:38.020 --> 00:56:40.980
one is the right hypothesis.
00:56:40.980 --> 00:56:44.880
I mean, this is the testing
philosophy of people who don't
00:56:44.880 --> 00:56:46.980
believe in experimentalism.
00:56:46.980 --> 00:56:48.840
They've already made
up their mind.
00:56:48.840 --> 00:56:50.330
They look at the results.
00:56:50.330 --> 00:56:52.200
They say, that's very
interesting.
00:56:52.200 --> 00:56:56.070
And then they say, I'm
going to choose this.
00:56:56.070 --> 00:57:00.944
These other points are our
particular threshold tests.
00:57:04.680 --> 00:57:07.680
If you want to get error
probabilities in the middle
00:57:07.680 --> 00:57:09.420
here, what do you do?
00:57:09.420 --> 00:57:11.500
You use a randomized test.
00:57:11.500 --> 00:57:12.700
Sometimes you use this.
00:57:12.700 --> 00:57:14.150
Sometimes you use this.
00:57:14.150 --> 00:57:17.120
You flip a coin and choose
whichever one of these you
00:57:17.120 --> 00:57:18.370
want to choose.
00:57:20.990 --> 00:57:27.160
So what this says is the
Neyman-Pearson test, which is
00:57:27.160 --> 00:57:36.190
the test that says pick some
alpha, which is the error
00:57:36.190 --> 00:57:39.630
probability under hypothesis
1 that
00:57:39.630 --> 00:57:41.660
you're willing to tolerate.
00:57:41.660 --> 00:57:44.130
So you pick alpha.
00:57:44.130 --> 00:57:48.330
And then it says, minimize the
error probability of the other
00:57:48.330 --> 00:57:51.790
kind, so you read over there.
00:57:51.790 --> 00:57:56.630
And the Neyman-Pearson test,
what it does is it minimizes
00:57:56.630 --> 00:58:02.130
the error probability under
the other hypothesis.
00:58:02.130 --> 00:58:05.530
Now, when this curve is
piecewise linear, the
00:58:05.530 --> 00:58:09.530
Neyman-Pearson test is not a
threshold test, but it's a
00:58:09.530 --> 00:58:11.930
randomized threshold test.
00:58:11.930 --> 00:58:15.300
Sometimes when you're at a point
like this, you have to
00:58:15.300 --> 00:58:17.485
use this test and this
test sometimes.
00:58:20.340 --> 00:58:25.710
For most of the tests that you
deal with, Neyman-Pearson test
00:58:25.710 --> 00:58:28.570
is just the threshold
test that's at
00:58:28.570 --> 00:58:31.180
that particular point.
00:58:34.670 --> 00:58:38.100
Any questions about that?
00:58:38.100 --> 00:58:39.960
This is probably one of these
things you have to think about
00:58:39.960 --> 00:58:40.980
a little bit.
00:58:40.980 --> 00:58:41.510
Yes?
00:58:41.510 --> 00:58:44.330
AUDIENCE: When you say you have
to use this test or this
00:58:44.330 --> 00:58:48.045
test, are you talking about
threshold or are you talking
00:58:48.045 --> 00:58:51.184
about-- because this is always--
it's either H equals
00:58:51.184 --> 00:58:53.846
0 or H equal 1, right?
00:58:53.846 --> 00:58:56.508
What do you mean when you say
you have to randomize between
00:58:56.508 --> 00:58:59.180
the two tests?
00:58:59.180 --> 00:59:00.695
I mean threshold tests--
00:59:15.120 --> 00:59:20.290
if I have a finite set of
alternatives, and I'm doing a
00:59:20.290 --> 00:59:24.640
threshold test on that finite
set of alternatives, I only
00:59:24.640 --> 00:59:29.500
have a finite number
of things I can do.
00:59:29.500 --> 00:59:33.230
As I increase the threshold,
I suddenly get to the point
00:59:33.230 --> 00:59:36.750
where this ratio of likelihoods
00:59:36.750 --> 00:59:39.190
includes one more point.
00:59:39.190 --> 00:59:41.500
And then it gets to the point
where it includes one other
00:59:41.500 --> 00:59:43.770
point and so forth.
00:59:43.770 --> 00:59:49.430
So that what happens is that
this upper envelope is just
00:59:49.430 --> 00:59:53.320
the upper envelope of a finite
number of points.
00:59:53.320 --> 00:59:56.980
And this upper envelope of a
finite number of points, the
00:59:56.980 --> 01:00:00.500
threshold tests are just
the corners there.
01:00:00.500 --> 01:00:04.330
So I sometimes have to randomize
between them.
01:00:04.330 --> 01:00:05.880
If you don't like
that, ignore it.
01:00:09.130 --> 01:00:16.450
Because for most tests you deal
with, almost all books on
01:00:16.450 --> 01:00:20.300
statistics that I've ever
seen, it just says the
01:00:20.300 --> 01:00:25.130
Neyman-Pearson test looks at the
threshold curve, at this
01:00:25.130 --> 01:00:26.610
error curve.
01:00:26.610 --> 01:00:29.040
And it chooses accordingly.
01:00:29.040 --> 01:00:31.228
Yes?
01:00:31.228 --> 01:00:36.590
AUDIENCE: You can put the
previous slide back?
01:00:36.590 --> 01:00:42.690
You told us that because
of maximum a posteriori
01:00:42.690 --> 01:00:49.870
probability, if eta is equal to
P0 divided by P1, then the
01:00:49.870 --> 01:00:51.950
probability of error
is minimized.
01:00:51.950 --> 01:00:56.900
And so the errors of the
test A are [INAUDIBLE].
01:00:59.750 --> 01:01:04.738
But if we start changing eta
from 0 to infinity, it doesn't
01:01:04.738 --> 01:01:05.704
have to be anymore.
01:01:05.704 --> 01:01:09.175
[INAUDIBLE], which means
the error is
01:01:09.175 --> 01:01:11.015
not necessarily minimized.
01:01:11.015 --> 01:01:13.170
So the argument doesn't
hold anymore.
01:01:13.170 --> 01:01:17.880
PROFESSOR: As I change eta, I'm
changing P1 and P0 also.
01:01:17.880 --> 01:01:21.760
In other words, now what I'm
doing is I'm saying, let's
01:01:21.760 --> 01:01:27.240
look at this threshold test,
and let's visualize what
01:01:27.240 --> 01:01:32.010
happens as I change the a
priori probabilities.
01:01:32.010 --> 01:01:37.390
So I'm suddenly becoming a
classical statistician instead
01:01:37.390 --> 01:01:40.340
of a Bayesian one.
01:01:40.340 --> 01:01:42.250
But I know what the answers
are from looking at the
01:01:42.250 --> 01:01:43.500
Bayesian case.
01:01:48.370 --> 01:01:53.290
OK, so let's move on.
01:01:57.160 --> 01:02:02.245
I mean, we now sort of see
that these tests--
01:02:05.180 --> 01:02:08.120
well, one thing we've seen is
when you have to make a
01:02:08.120 --> 01:02:13.120
decision under this kind of
probabilistic model we've been
01:02:13.120 --> 01:02:18.070
talking about-- namely, two
hypotheses, IID random
01:02:18.070 --> 01:02:20.153
variable is conditional
on each hypothesis.
01:02:23.420 --> 01:02:26.370
Those hypothesis testing
problems turn
01:02:26.370 --> 01:02:29.350
into random walk problems.
01:02:29.350 --> 01:02:32.580
We also saw that
the [? GG1Q ?]
01:02:32.580 --> 01:02:37.040
when I started looking at when
the system becomes empty, and
01:02:37.040 --> 01:02:43.010
how long it takes to start to
fill up again, that problem is
01:02:43.010 --> 01:02:44.880
a random walk problem.
01:02:44.880 --> 01:02:48.000
So now I want to start to ask
the question, what's the
01:02:48.000 --> 01:02:52.470
probability that a random walk
will cross a threshold?
01:02:52.470 --> 01:02:54.700
I'm going to apply the
Chernoff bound to it.
01:02:54.700 --> 01:02:56.010
You remember the
Chernoff bound?
01:02:56.010 --> 01:03:00.410
We talked about it a little
bit back on the
01:03:00.410 --> 01:03:03.180
second week of the term.
01:03:03.180 --> 01:03:06.420
We were talking about the Markov
inequality and the
01:03:06.420 --> 01:03:08.270
Chebyshev inequality.
01:03:08.270 --> 01:03:12.200
And we said that the Chernoff
inequality was the same sort
01:03:12.200 --> 01:03:17.780
of thing, except it was based
on e to the rZ rather than x
01:03:17.780 --> 01:03:20.290
or x squared.
01:03:20.290 --> 01:03:24.620
And we talked a little bit
about its properties.
01:03:24.620 --> 01:03:28.790
The major thing one uses the
Chernoff bound for is to get
01:03:28.790 --> 01:03:33.020
good estimates very, very
far away from the mean.
01:03:33.020 --> 01:03:36.200
In other words, good estimates
of probabilities that are
01:03:36.200 --> 01:03:38.040
very, very small.
01:03:38.040 --> 01:03:41.370
I've grown up using these all
my life because I've been
01:03:41.370 --> 01:03:43.440
concerned with error
probabilities in
01:03:43.440 --> 01:03:46.010
communication systems.
01:03:46.010 --> 01:03:49.630
You typically want error
probabilities that run between
01:03:49.630 --> 01:03:53.420
10 to the minus fifth and
10 to the minus eighth.
01:03:53.420 --> 01:03:58.940
So you want to look at points
which are quite far away.
01:03:58.940 --> 01:04:02.550
I mean, you take a
large number of--
01:04:02.550 --> 01:04:05.230
you take a sum of a large number
of variables, which
01:04:05.230 --> 01:04:09.330
correspond to a code.
01:04:09.330 --> 01:04:12.400
And you look at error
probabilities for this rather
01:04:12.400 --> 01:04:13.790
complicated thing.
01:04:13.790 --> 01:04:16.380
But you're looking very, very
far away from the mean, and
01:04:16.380 --> 01:04:19.620
you're looking at very large
numbers of observations.
01:04:19.620 --> 01:04:25.920
So instead of the kinds of
things where we deal with
01:04:25.920 --> 01:04:28.380
things like the central limit
theorem where you're trying to
01:04:28.380 --> 01:04:31.430
figure out what goes on close
to the mean, here you're
01:04:31.430 --> 01:04:36.170
trying to figure out what goes
on very far from the mean.
01:04:36.170 --> 01:04:40.990
OK, so what the Chernoff bound
says is that the probability
01:04:40.990 --> 01:04:45.820
that a random variable Z is
greater than or equal to some
01:04:45.820 --> 01:04:47.390
constant b.
01:04:47.390 --> 01:04:50.580
We don't even need sums of
random variables here, it's
01:04:50.580 --> 01:04:54.590
just a Chernoff bound is a
bound on the tail of a
01:04:54.590 --> 01:04:55.980
distribution.
01:04:55.980 --> 01:04:59.800
Is less than or equal to the
moment generating function of
01:04:59.800 --> 01:05:01.810
that random variable.
01:05:01.810 --> 01:05:08.090
g sub Z of r is the expected
value of e to the rZ.
01:05:08.090 --> 01:05:10.890
These generating functions,
you can calculate
01:05:10.890 --> 01:05:12.960
them if you want to.
01:05:12.960 --> 01:05:15.390
Times e to the minus rb.
01:05:15.390 --> 01:05:18.550
This is the Markov inequality
for the random
01:05:18.550 --> 01:05:21.750
variable e to the rZ.
01:05:21.750 --> 01:05:26.330
And go back and review
chapter 1.
01:05:26.330 --> 01:05:29.770
I think it's section
1.43 or something.
01:05:29.770 --> 01:05:34.180
It's the section that deals with
the Markov inequality,
01:05:34.180 --> 01:05:40.970
the Chebyshev inequality,
and the Chernoff bound.
01:05:40.970 --> 01:05:43.880
And as I told you once when we
talked about these things,
01:05:43.880 --> 01:05:45.620
Chernoff is still
alive and well.
01:05:45.620 --> 01:05:47.840
He's a statistician
at Harvard.
01:05:47.840 --> 01:05:51.480
He was somewhat embarrassed by
this inequality becoming so
01:05:51.480 --> 01:05:55.620
famous because he did it as sort
of a throw-off thing in a
01:05:55.620 --> 01:05:59.250
paper where he was trying to
do something which was much
01:05:59.250 --> 01:06:02.290
more mathematically
sophisticated.
01:06:02.290 --> 01:06:05.440
And now the poor guy is only
known for this thing that he
01:06:05.440 --> 01:06:06.690
views as being trivial.
01:06:11.360 --> 01:06:14.220
But what the bound says is the
probability of Z is greater
01:06:14.220 --> 01:06:17.380
than or equal to b is
this inequality.
01:06:17.380 --> 01:06:20.840
Strangely enough, the
probability that Z is less
01:06:20.840 --> 01:06:25.710
than or equal to b is bounded
by the same inequality.
01:06:25.710 --> 01:06:28.980
But one of them, r
is bigger than 0.
01:06:28.980 --> 01:06:33.220
And the other one,
r is less than 0.
01:06:33.220 --> 01:06:35.230
And you have to go back and
read that section to
01:06:35.230 --> 01:06:37.800
understand why.
01:06:37.800 --> 01:06:40.560
Now, this is most useful when
it's applied to a sum of
01:06:40.560 --> 01:06:42.270
random variables.
01:06:42.270 --> 01:06:46.670
I don't know of any applications
for it otherwise.
01:06:46.670 --> 01:06:50.580
So if the moment-generating
function--
01:06:50.580 --> 01:06:52.870
oh, incidentally, also.
01:06:52.870 --> 01:06:56.380
When most people talk about
moment-generating functions,
01:06:56.380 --> 01:06:59.650
and certainly when people talked
about moment-generating
01:06:59.650 --> 01:07:04.640
functions before the 1950s or
so, what they were always
01:07:04.640 --> 01:07:08.830
interested in is the fact that
if you take derivatives of the
01:07:08.830 --> 01:07:12.540
moment-generating functions,
you generate the moments of
01:07:12.540 --> 01:07:14.980
the random variable.
01:07:14.980 --> 01:07:17.860
If you take the derivative of
this with respect to r,
01:07:17.860 --> 01:07:22.970
evaluate it at r equals 0, you
get the expected value of Z.
01:07:22.970 --> 01:07:26.610
If you take the second
derivative evaluated at r
01:07:26.610 --> 01:07:30.720
equals 0, you get the
expected value of Z
01:07:30.720 --> 01:07:32.700
squared, and so forth.
01:07:32.700 --> 01:07:36.810
You can see that by just taking
the derivative of that.
01:07:36.810 --> 01:07:38.580
Here, we're looking
at something else.
01:07:38.580 --> 01:07:42.200
We're not looking at what goes
on around r equals 0.
01:07:42.200 --> 01:07:45.640
We're trying to figure out what
goes on way on the far
01:07:45.640 --> 01:07:48.760
tails of these distributions.
01:07:48.760 --> 01:07:56.860
So if gX of r is e to the rX,
then e to the e to the r Sn--
01:07:56.860 --> 01:07:59.380
Sn is the sum of these
random variables--
01:07:59.380 --> 01:08:04.590
is the expected value of the
product of e to the rXi.
01:08:04.590 --> 01:08:07.300
Namely, it's e to the r.
01:08:07.300 --> 01:08:09.150
Some of Xi.
01:08:09.150 --> 01:08:11.020
So that turns into a product.
01:08:11.020 --> 01:08:15.520
The expected value of a product
of a finite number of
01:08:15.520 --> 01:08:19.319
terms is the product of
the expected value.
01:08:19.319 --> 01:08:23.460
So it's gX or r to
the n-th power.
01:08:23.460 --> 01:08:27.200
So if I want to write this, now
I'm applying the Chernoff
01:08:27.200 --> 01:08:30.020
bound to the random
variable S sub n.
01:08:30.020 --> 01:08:32.880
What's the probability that S
sub n is greater than or equal
01:08:32.880 --> 01:08:34.840
to n times a?
01:08:34.840 --> 01:08:39.000
It's gX to the n of r times
e to the minus rna.
01:08:39.000 --> 01:08:41.260
That's what the Chernoff
bound says.
01:08:41.260 --> 01:08:46.640
This is the Chernoff bound over
on the other side of the
01:08:46.640 --> 01:08:49.240
distribution.
01:08:49.240 --> 01:08:54.020
This only makes sense and has
interesting values when a is
01:08:54.020 --> 01:08:56.990
bigger than the mean or when
a is less than the mean.
01:08:56.990 --> 01:09:01.210
And when r is greater than 0 for
this one and less than 0
01:09:01.210 --> 01:09:02.460
for this one.
01:09:07.370 --> 01:09:10.640
Now, this is easier to
interpret and it's
01:09:10.640 --> 01:09:13.729
easier to work with.
01:09:13.729 --> 01:09:20.819
If you take that product of
terms g to the r to the n-th
01:09:20.819 --> 01:09:27.020
power and you visualize the
logarithm of g to the X.
01:09:27.020 --> 01:09:31.850
Visualize the logarithm of g
to the X, then you get this
01:09:31.850 --> 01:09:33.114
quantity up here.
01:09:41.340 --> 01:09:44.529
You get the probability that Sn
is greater than or equal to
01:09:44.529 --> 01:09:51.350
na is this e to the n times
gamma x of r minus ra.
01:09:51.350 --> 01:09:57.600
Gamma is the logarithm of the
moment-generating function.
01:09:57.600 --> 01:10:00.710
The logarithm of the
moment-generating function is
01:10:00.710 --> 01:10:02.980
always called the
semi-invariant
01:10:02.980 --> 01:10:04.980
moment-generating function.
01:10:04.980 --> 01:10:07.620
The name is, again, because
people were originally
01:10:07.620 --> 01:10:10.570
interested in the
moment-generating properties
01:10:10.570 --> 01:10:12.480
of these random variables.
01:10:12.480 --> 01:10:17.060
If you sit down and take
the derivatives, I can
01:10:17.060 --> 01:10:19.080
probably do it here.
01:10:19.080 --> 01:10:21.195
It's simple enough that
I won't get confused.
01:10:26.640 --> 01:10:37.140
The derivative with respect to
r of the logarithm of g of r
01:10:37.140 --> 01:10:44.810
is first derivative of
r divided by g of r.
01:10:44.810 --> 01:10:52.890
And the second derivative
is then the
01:10:52.890 --> 01:10:55.660
natural log of g of r.
01:10:55.660 --> 01:11:00.120
Taking the derivative of that is
equal to g double prime of
01:11:00.120 --> 01:11:06.000
r over g of r squared.
01:11:06.000 --> 01:11:09.300
Tell me if I'm making a mistake
here because I usually
01:11:09.300 --> 01:11:11.360
do when I do this.
01:11:11.360 --> 01:11:19.690
Minus g of r and g prime of r.
01:11:22.950 --> 01:11:35.770
Probably divided by
this squared.
01:11:35.770 --> 01:11:37.020
Let's see.
01:11:37.020 --> 01:11:38.470
Is this right?
01:11:38.470 --> 01:11:41.810
Who can take derivatives here?
01:11:41.810 --> 01:11:43.620
AUDIENCE: First term doesn't
have a square in it.
01:11:43.620 --> 01:11:43.970
PROFESSOR: What?
01:11:43.970 --> 01:11:45.875
AUDIENCE: First term doesn't
have a square in the
01:11:45.875 --> 01:11:47.150
denominator.
01:11:47.150 --> 01:11:49.780
PROFESSOR: First term?
01:11:49.780 --> 01:11:51.610
Yeah.
01:11:51.610 --> 01:11:53.280
Oh, the first thing doesn't
have a square.
01:11:53.280 --> 01:11:54.375
No, you're right.
01:11:54.375 --> 01:11:56.350
AUDIENCE: Second one
doesn't have--
01:11:56.350 --> 01:11:59.400
PROFESSOR: And the second
one, let's see.
01:11:59.400 --> 01:12:00.650
We have--
01:12:03.850 --> 01:12:06.230
we just have g prime
of r squared
01:12:06.230 --> 01:12:08.150
divided by g of r squared.
01:12:08.150 --> 01:12:12.930
And we evaluate this
at r equals 0.
01:12:12.930 --> 01:12:14.930
This term becomes 1.
01:12:14.930 --> 01:12:17.340
This term becomes 1.
01:12:17.340 --> 01:12:22.760
This term becomes the second
moment x squared bar.
01:12:22.760 --> 01:12:26.300
And this term becomes
x bar squared.
01:12:26.300 --> 01:12:32.030
And this whole thing becomes the
variance of the moment of
01:12:32.030 --> 01:12:37.980
the random variable rather
than the second moment.
01:12:37.980 --> 01:12:43.100
All of these terms might be
wrong, but this term is right.
01:12:43.100 --> 01:12:47.010
And I'm sure all of you can
rewrite that and evaluate it
01:12:47.010 --> 01:12:48.190
at r equals 0.
01:12:48.190 --> 01:12:50.240
So that's why it's called
the semi-invariant
01:12:50.240 --> 01:12:52.280
moment-generating function.
01:12:52.280 --> 01:12:55.610
It doesn't make any difference
for what we're interested in.
01:12:55.610 --> 01:12:59.550
The thing that we're interested
in is that this
01:12:59.550 --> 01:13:00.810
exponent here--
01:13:03.520 --> 01:13:07.490
as you visualize doing this
experiment and taking
01:13:07.490 --> 01:13:15.000
additional observations, what
happens is the probability
01:13:15.000 --> 01:13:19.480
that you exceed na--
01:13:19.480 --> 01:13:25.310
that the n-th sum exceeds n
times some fixed quantity a is
01:13:25.310 --> 01:13:26.950
going down exponentially
with [? the a. ?]
01:13:29.450 --> 01:13:32.100
Now, is this bound any good?
01:13:35.150 --> 01:13:39.970
Well, if you optimize it over
r, It's essentially
01:13:39.970 --> 01:13:41.730
exponentially tight.
01:13:41.730 --> 01:13:45.210
So, in fact, it is good.
01:13:45.210 --> 01:13:48.030
What does it mean to be
exponentially tight?
01:13:48.030 --> 01:13:50.680
That's what I don't want
to define carefully.
01:13:50.680 --> 01:13:53.540
There's a theorem in the notes
that says what exponentially
01:13:53.540 --> 01:13:54.850
tight means.
01:13:54.850 --> 01:13:58.250
And it takes you half an hour to
read it because it's being
01:13:58.250 --> 01:14:00.110
stated very carefully.
01:14:00.110 --> 01:14:08.430
What it says essentially is that
if I take this quantity
01:14:08.430 --> 01:14:14.510
here and I subtract--
01:14:14.510 --> 01:14:16.710
I add an epsilon to it.
01:14:16.710 --> 01:14:22.510
Namely, e to the n times this
quantity minus epsilon.
01:14:22.510 --> 01:14:25.600
So I have an e to the
minus n epsilon, see
01:14:25.600 --> 01:14:26.720
it sitting in there?
01:14:26.720 --> 01:14:31.170
When I take this exponent and I
reduce it just a little bit,
01:14:31.170 --> 01:14:33.120
I get a bound that isn't true.
01:14:33.120 --> 01:14:35.850
This is greater than
or equal to the
01:14:35.850 --> 01:14:37.860
quantity with an epsilon.
01:14:37.860 --> 01:14:40.490
In other words, you can't make
an exponent that's any
01:14:40.490 --> 01:14:42.350
smaller than this.
01:14:42.350 --> 01:14:45.690
You can take coefficients and
play with them, but you can't
01:14:45.690 --> 01:14:48.750
make the exponent any smaller.
01:14:48.750 --> 01:14:55.490
OK, all of these things you
can do them by pictures.
01:14:55.490 --> 01:14:58.560
I know many of you don't like
doing things by pictures.
01:14:58.560 --> 01:15:02.060
I keep doing them by pictures
because I keep trying to
01:15:02.060 --> 01:15:05.820
convince you that pictures
are more rigorous
01:15:05.820 --> 01:15:07.750
than equations are.
01:15:07.750 --> 01:15:10.850
At least, many times.
01:15:10.850 --> 01:15:13.690
If you want to show that
something is convex, you try
01:15:13.690 --> 01:15:17.800
to show that the second
derivative is positive.
01:15:17.800 --> 01:15:20.640
That works sometimes and it
doesn't work sometimes.
01:15:20.640 --> 01:15:23.430
I mean, it works as a function
is continuous and has a
01:15:23.430 --> 01:15:25.450
continuous first derivative.
01:15:25.450 --> 01:15:27.850
It doesn't work. otherwise.
01:15:27.850 --> 01:15:33.280
When you start taking tangents
of the curve, and you say the
01:15:33.280 --> 01:15:40.560
upper envelope of the tangents
to the curve all lie below the
01:15:40.560 --> 01:15:42.640
function, then it
works perfectly.
01:15:42.640 --> 01:15:44.800
That's what a convex function
is by definition.
01:15:48.210 --> 01:15:49.810
How do we derive
all this stuff?
01:15:52.350 --> 01:15:56.320
What we're trying to
do is to find--
01:15:56.320 --> 01:16:04.990
I mean, this inequality here is
true for all r, for all r
01:16:04.990 --> 01:16:09.710
greater than 0 so long as a is
greater than the mean of X.
01:16:09.710 --> 01:16:12.970
It's true for all r for which
this moment-generating
01:16:12.970 --> 01:16:15.020
function exists.
01:16:15.020 --> 01:16:18.400
Moment-generating functions can
sometimes blow up, so they
01:16:18.400 --> 01:16:21.060
don't exist everywhere.
01:16:21.060 --> 01:16:22.270
So it's true wherever the
01:16:22.270 --> 01:16:25.140
moment-generating function exists.
01:16:25.140 --> 01:16:29.890
So we like to find the r for
which this bound is tightest.
01:16:29.890 --> 01:16:33.240
So what I'm going to do is draw
a picture and show you
01:16:33.240 --> 01:16:37.160
where it's tightest in
terms of the picture.
01:16:37.160 --> 01:16:40.380
What I've drawn here is
the semi-invariant
01:16:40.380 --> 01:16:43.240
moment-generating function.
01:16:43.240 --> 01:16:47.580
Why didn't I put that down?
01:16:47.580 --> 01:16:51.130
This is gamma of r.
01:16:51.130 --> 01:16:55.630
Gamma of r at 0, it's the log
of the moment-generating
01:16:55.630 --> 01:16:58.500
function at 0, which is 0.
01:17:01.090 --> 01:17:03.000
It's convex.
01:17:03.000 --> 01:17:06.190
You take its second
derivative.
01:17:06.190 --> 01:17:09.180
Its second derivative at r
equals 0 is pretty easy.
01:17:09.180 --> 01:17:12.250
Its second derivative of other
values or r you have to
01:17:12.250 --> 01:17:13.500
struggle with it.
01:17:16.010 --> 01:17:20.200
But when you struggle a little
bit, it is convex.
01:17:20.200 --> 01:17:23.770
If you've got a curve that goes
down like this, then it
01:17:23.770 --> 01:17:25.800
goes back up again.
01:17:25.800 --> 01:17:28.100
Sometimes goes off
towards infinity.
01:17:28.100 --> 01:17:30.950
Might do whatever
it wants to do.
01:17:30.950 --> 01:17:34.790
Sometimes at a certain value
of r, it stops existing.
01:17:34.790 --> 01:17:37.750
Suppose I take the
simplest random
01:17:37.750 --> 01:17:39.800
variable you know about.
01:17:39.800 --> 01:17:43.440
You only know two simple
random variables.
01:17:43.440 --> 01:17:46.030
One of them is a binary
random variable.
01:17:46.030 --> 01:17:49.420
The other one's an exponential
random variable.
01:17:49.420 --> 01:17:54.330
Suppose I take the exponential
random variable with density
01:17:54.330 --> 01:17:59.020
alpha times e to the minus
alpha X. Where does this
01:17:59.020 --> 01:18:02.580
moment-generating
function exist?
01:18:02.580 --> 01:18:16.220
You take alpha and I multiply
it by e to the rX when I
01:18:16.220 --> 01:18:17.470
integrate it.
01:18:20.940 --> 01:18:22.190
Where does this exist?
01:18:25.100 --> 01:18:27.110
I mean, don't bother
to integrate it.
01:18:31.910 --> 01:18:36.630
If r is bigger than alpha, this
exponent is bigger than
01:18:36.630 --> 01:18:37.890
this exponent.
01:18:37.890 --> 01:18:40.110
And this thing takes off
towards infinity.
01:18:40.110 --> 01:18:43.715
If r is less than a, the
whole thing goes to 0.
01:18:49.220 --> 01:18:59.020
gX of r exists for r less
than alpha in this case.
01:19:01.620 --> 01:19:06.290
And in general, if you look at
a moment-generating function,
01:19:06.290 --> 01:19:11.140
if the tail of that distribution
function is going
01:19:11.140 --> 01:19:14.880
to 0 exponentially, you find the
rate at which it's going
01:19:14.880 --> 01:19:16.980
to 0 exponentially.
01:19:16.980 --> 01:19:18.650
And that's where the
moment-generating
01:19:18.650 --> 01:19:21.810
function cuts off.
01:19:21.810 --> 01:19:23.630
It has to cut off.
01:19:23.630 --> 01:19:27.070
You can't show a result like
this, which says something is
01:19:27.070 --> 01:19:30.710
going to 0, faster than it could
possibly be going to 0.
01:19:33.600 --> 01:19:35.460
So we have to have that
kind of result.
01:19:35.460 --> 01:19:37.760
But anyway, we draw
this curve.
01:19:37.760 --> 01:19:40.350
This is mu sub X of r.
01:19:40.350 --> 01:19:46.520
And then we say, how do we
graphically minimize gamma of
01:19:46.520 --> 01:19:50.900
r minus r times a?
01:19:50.900 --> 01:19:57.140
Well, what I do because I've
done this before and I know
01:19:57.140 --> 01:19:59.090
how to do it--
01:19:59.090 --> 01:20:01.580
I mean, it's not the kind of
thing where if you sat down
01:20:01.580 --> 01:20:05.670
you would immediately
settle on this.
01:20:05.670 --> 01:20:09.710
I look at some particular
value of r.
01:20:09.710 --> 01:20:16.370
If I take a line of slope gamma
prime of r, that's a
01:20:16.370 --> 01:20:19.920
tangent to this curve because
this curve is convex.
01:20:19.920 --> 01:20:24.380
So if I take a line through here
of this slope and I look
01:20:24.380 --> 01:20:29.210
at where this line hits here,
where does it hit?
01:20:29.210 --> 01:20:34.320
It hits at gamma sub X of
r, this point here,
01:20:34.320 --> 01:20:38.230
minus gamma X of r--
01:20:41.880 --> 01:20:43.130
oh.
01:20:50.220 --> 01:20:55.100
Well, what I've done is I've
already optimized the problem.
01:20:55.100 --> 01:20:58.600
I'm trying to find the
probability that Sn is greater
01:20:58.600 --> 01:20:59.950
than or equal to na.
01:20:59.950 --> 01:21:03.870
I'm trying to minimize this
exponent here, gamma
01:21:03.870 --> 01:21:06.890
X of r minus ra.
01:21:06.890 --> 01:21:10.420
Unfortunately, I really start
out by taking the derivative
01:21:10.420 --> 01:21:13.260
of that and setting it equal to
0, which is what you would
01:21:13.260 --> 01:21:15.120
all do, too.
01:21:15.120 --> 01:21:19.330
When I set the derivative of
this equal to 0, I get gamma
01:21:19.330 --> 01:21:26.010
prime of r minus a is equal to
0, which is what this says.
01:21:26.010 --> 01:21:31.540
So then we take a line of slope
gamma x of r equals 0.
01:21:31.540 --> 01:21:34.130
It's tangent at this
point here.
01:21:34.130 --> 01:21:37.500
You look at this point over here
and you get the minimum
01:21:37.500 --> 01:21:41.290
value of the gamma X
of r minus r0 a.
01:21:45.370 --> 01:21:49.920
So what this says is when you
vary a, you can go through
01:21:49.920 --> 01:21:57.440
this maximization tilting
this curve around.
01:21:57.440 --> 01:22:02.180
I mean, a determines the slope
of this line here.
01:22:02.180 --> 01:22:06.930
If I use a smaller value of
a, the slope is smaller.
01:22:06.930 --> 01:22:08.470
It hits in here.
01:22:08.470 --> 01:22:14.220
If I take a larger value of a,
it comes in further down and
01:22:14.220 --> 01:22:15.500
the exponent gets bigger.
01:22:15.500 --> 01:22:16.840
That's not surprising.
01:22:16.840 --> 01:22:19.700
I want to find out the
probability that S sub n is
01:22:19.700 --> 01:22:21.670
greater than or equal to a.
01:22:21.670 --> 01:22:26.350
As I increase a, I expect this
exponent to keep going down as
01:22:26.350 --> 01:22:29.390
I make a bigger and bigger
because it's harder and harder
01:22:29.390 --> 01:22:33.700
for it to be greater
than or equal to a.
01:22:33.700 --> 01:22:37.570
So anyway, when you optimize
this, you get something
01:22:37.570 --> 01:22:39.720
exponentially tight.
01:22:39.720 --> 01:22:42.540
And this is what
it's equal to.
01:22:42.540 --> 01:22:46.560
And I would recommend that you
go back and read the section
01:22:46.560 --> 01:22:50.100
of chapter 1, which goes through
all of this in a
01:22:50.100 --> 01:22:51.350
little more detail.
01:22:56.820 --> 01:23:00.600
Let me go passed that.
01:23:00.600 --> 01:23:03.800
Don't want to talk about that.
01:23:03.800 --> 01:23:09.640
Well, when I do this
optimization, if what I'm
01:23:09.640 --> 01:23:13.210
looking at is the probability
that S sub n is greater than
01:23:13.210 --> 01:23:17.000
or equal to some alpha rather
than n times a when I'm do
01:23:17.000 --> 01:23:20.120
this optimization and I'm
looking at what happens at
01:23:20.120 --> 01:23:24.170
different values of n, it turns
out that when n is very
01:23:24.170 --> 01:23:31.770
big, you get something which
is tangent there.
01:23:31.770 --> 01:23:36.340
As n gets smaller, you get these
tangents that come down
01:23:36.340 --> 01:23:38.620
that comes in to there,
and then it starts
01:23:38.620 --> 01:23:40.330
going back out again.
01:23:40.330 --> 01:23:47.800
This e to the r star is the
tightest the bound ever gets.
01:23:47.800 --> 01:23:54.390
That's the n at which errors
in the hypothesis testing
01:23:54.390 --> 01:23:57.400
usually occur.
01:23:57.400 --> 01:23:59.120
It's the point at which--
01:23:59.120 --> 01:24:02.740
it's the n for which Sn greater
than or equal to alpha
01:24:02.740 --> 01:24:06.240
is most likely to occur.
01:24:06.240 --> 01:24:12.500
And if you evaluate that for
our friendly binary case
01:24:12.500 --> 01:24:18.290
again, X equals 1 or X equals
minus 1, what you find when
01:24:18.290 --> 01:24:25.060
you evaluate that point alpha r
star is that r star is equal
01:24:25.060 --> 01:24:29.595
to log 1 minus P over P. And our
bound of probability union
01:24:29.595 --> 01:24:33.830
of Sn is greater than or equal
to alpha is approximately e to
01:24:33.830 --> 01:24:38.030
the minus alpha r star
is 1 minus P over P
01:24:38.030 --> 01:24:40.690
to the minus alpha.
01:24:40.690 --> 01:24:43.600
I mean, why do I torture
you with this?
01:24:43.600 --> 01:24:46.570
Because we solved this problem
at the beginning of the
01:24:46.570 --> 01:24:47.760
lecture, remember?
01:24:47.760 --> 01:24:54.540
The probability that the sum
S sub n for this binary
01:24:54.540 --> 01:24:59.710
experiment is greater than or
equal to k is equal to 1 minus
01:24:59.710 --> 01:25:02.130
P over P to the minus k.
01:25:02.130 --> 01:25:04.640
That's what it's equal
to exactly.
01:25:04.640 --> 01:25:09.960
When I go through all of this
Chernoff bound stuff, I get
01:25:09.960 --> 01:25:11.790
the same answer.
01:25:11.790 --> 01:25:14.950
Now, this is a much harder way
to do it, but this is a
01:25:14.950 --> 01:25:16.240
general way of doing it.
01:25:16.240 --> 01:25:18.340
And that's a very specialized
way of doing it.
01:25:18.340 --> 01:25:20.250
So we'll talk more about
this next time.