WEBVTT
00:00:00.040 --> 00:00:02.460
The following content is
provided under a Creative
00:00:02.460 --> 00:00:03.870
Commons license.
00:00:03.870 --> 00:00:06.910
Your support will help MIT
OpenCourseWare continue to
00:00:06.910 --> 00:00:10.560
offer high quality educational
resources for free.
00:00:10.560 --> 00:00:13.460
To make a donation or view
additional materials from
00:00:13.460 --> 00:00:19.290
hundreds of MIT courses, visit
MIT OpenCourseWare at
00:00:19.290 --> 00:00:21.708
ocw.mit.edu.
00:00:21.708 --> 00:00:25.380
PROFESSOR: It involves real
phenomena out there.
00:00:25.380 --> 00:00:28.960
So we have real stuff
that happens.
00:00:28.960 --> 00:00:33.630
So it might be an arrival
process to a bank that we're
00:00:33.630 --> 00:00:35.790
trying to model.
00:00:35.790 --> 00:00:38.230
This is a reality, but
this is what we have
00:00:38.230 --> 00:00:39.660
been doing so far.
00:00:39.660 --> 00:00:41.910
We have been playing
with models of
00:00:41.910 --> 00:00:43.770
probabilistic phenomena.
00:00:43.770 --> 00:00:46.730
And somehow we need to
tie the two together.
00:00:46.730 --> 00:00:50.930
The way these are tied is that
we observe the real world and
00:00:50.930 --> 00:00:53.530
this gives us data.
00:00:53.530 --> 00:00:58.590
And then based on these data, we
try to come up with a model
00:00:58.590 --> 00:01:01.930
of what exactly is going on.
00:01:01.930 --> 00:01:05.290
For example, for an arrival
process, you might ask the
00:01:05.290 --> 00:01:08.680
model in question, is my arrival
process Poisson or is
00:01:08.680 --> 00:01:10.300
it something different?
00:01:10.300 --> 00:01:14.630
If it is Poisson, what is the
rate of the arrival process?
00:01:14.630 --> 00:01:17.460
Once you come up with your model
and you come up with the
00:01:17.460 --> 00:01:21.710
parameters of the model, then
you can use it to make
00:01:21.710 --> 00:01:27.520
predictions about reality or to
figure out certain hidden
00:01:27.520 --> 00:01:31.890
things, certain hidden aspects
of reality, that you do not
00:01:31.890 --> 00:01:35.560
observe directly, but you try
to infer what they are.
00:01:35.560 --> 00:01:38.900
So that's where the usefulness
of the model comes in.
00:01:38.900 --> 00:01:43.330
Now this field is of course
tremendously useful.
00:01:43.330 --> 00:01:46.650
And it shows up pretty
much everywhere.
00:01:46.650 --> 00:01:50.000
So we talked about the polling
examples in the
00:01:50.000 --> 00:01:51.280
last couple of lectures.
00:01:51.280 --> 00:01:53.520
This is, of course, a
real application.
00:01:53.520 --> 00:01:57.525
You sample and on the basis of
the sample that you have, you
00:01:57.525 --> 00:02:00.400
try to make some inferences
about, let's say, the
00:02:00.400 --> 00:02:03.060
preferences in a given
population.
00:02:03.060 --> 00:02:06.230
Let's say in the medical field,
you want to try whether
00:02:06.230 --> 00:02:08.919
a certain drug makes a
difference or not.
00:02:08.919 --> 00:02:14.380
So people would do medical
trials, get some results, and
00:02:14.380 --> 00:02:17.640
then from the data somehow you
need to make sense of them and
00:02:17.640 --> 00:02:18.530
make a decision.
00:02:18.530 --> 00:02:21.360
Is the new drug useful
or is it not?
00:02:21.360 --> 00:02:23.460
How do we go systematically
about the
00:02:23.460 --> 00:02:24.710
question of this type?
00:02:27.770 --> 00:02:32.170
A sexier, more recent topic,
there's this famous Netflix
00:02:32.170 --> 00:02:37.510
competition where Netflix gives
you a huge table of
00:02:37.510 --> 00:02:41.450
movies and people.
00:02:41.450 --> 00:02:45.860
And people have rated the
movies, but not everyone has
00:02:45.860 --> 00:02:47.850
watched all of the
movies in there.
00:02:47.850 --> 00:02:49.460
You have some of the ratings.
00:02:49.460 --> 00:02:53.250
For example, this person gave a
4 to that particular movie.
00:02:53.250 --> 00:02:56.300
So you get the table that's
partially filled.
00:02:56.300 --> 00:02:58.300
And the Netflix asks
you to make
00:02:58.300 --> 00:02:59.860
recommendations to people.
00:02:59.860 --> 00:03:02.410
So this means trying to guess.
00:03:02.410 --> 00:03:06.100
This person here, how much
would they like this
00:03:06.100 --> 00:03:07.610
particular movie?
00:03:07.610 --> 00:03:11.130
And you can start thinking,
well, maybe this person has
00:03:11.130 --> 00:03:14.860
given somewhat similar ratings
with another person.
00:03:14.860 --> 00:03:18.440
And if that other person has
also seen that movie, maybe
00:03:18.440 --> 00:03:21.290
the rating of that other
person is relevant.
00:03:21.290 --> 00:03:24.230
But of course it's a lot more
complicated than that.
00:03:24.230 --> 00:03:26.650
And this has been a serious
competition where people have
00:03:26.650 --> 00:03:30.230
been using every heavy, wet
machinery that there is in
00:03:30.230 --> 00:03:32.540
statistics, trying to
come up with good
00:03:32.540 --> 00:03:35.140
recommendation systems.
00:03:35.140 --> 00:03:37.870
Then the other people, of
course, are trying to analyze
00:03:37.870 --> 00:03:39.010
financial data.
00:03:39.010 --> 00:03:43.680
Somebody gives you the sequence
of the values, let's
00:03:43.680 --> 00:03:45.840
say of the SMP index.
00:03:45.840 --> 00:03:47.850
You look at something like this
00:03:47.850 --> 00:03:49.770
and you can ask questions.
00:03:49.770 --> 00:03:55.030
How do I model these data using
any of the models that
00:03:55.030 --> 00:03:57.060
we have in our bag of tools?
00:03:57.060 --> 00:04:00.230
How can I make predictions about
what's going to happen
00:04:00.230 --> 00:04:03.310
afterwards, and so on?
00:04:03.310 --> 00:04:09.700
On the engineering side,
anywhere where you have noise
00:04:09.700 --> 00:04:11.590
inference comes in.
00:04:11.590 --> 00:04:13.810
Signal processing, in
some sense, is just
00:04:13.810 --> 00:04:14.960
an inference problem.
00:04:14.960 --> 00:04:18.730
You observe signals that are
noisy and you try to figure
00:04:18.730 --> 00:04:21.750
out exactly what's happening
out there or what kind of
00:04:21.750 --> 00:04:24.130
signal has been sent.
00:04:24.130 --> 00:04:28.830
Maybe the beginning of the field
could be traced a few
00:04:28.830 --> 00:04:32.060
hundred years ago where people
would observe, make
00:04:32.060 --> 00:04:35.420
astronomical observations
of the position of the
00:04:35.420 --> 00:04:37.550
planets in the sky.
00:04:37.550 --> 00:04:41.130
They would have some beliefs
that perhaps the orbits of
00:04:41.130 --> 00:04:44.070
planets is an ellipse.
00:04:44.070 --> 00:04:47.840
Or if it's a comet, maybe it's
a parabola, hyperbola, don't
00:04:47.840 --> 00:04:48.640
know what it is.
00:04:48.640 --> 00:04:51.320
But they would have
a model of that.
00:04:51.320 --> 00:04:53.840
But, of course, astronomical
measurements would not be
00:04:53.840 --> 00:04:55.300
perfectly exact.
00:04:55.300 --> 00:05:00.690
And they would try to find the
curve that fits these data.
00:05:00.690 --> 00:05:05.580
How do you go about choosing
this particular curve on the
00:05:05.580 --> 00:05:07.960
base of noisy data and
try to do it in a
00:05:07.960 --> 00:05:11.274
somewhat principled way?
00:05:11.274 --> 00:05:13.890
OK, so questions of this
type-- clearly the
00:05:13.890 --> 00:05:17.100
applications are all
over the place.
00:05:17.100 --> 00:05:20.830
But how is this related
conceptually with what we have
00:05:20.830 --> 00:05:22.480
been doing so far?
00:05:22.480 --> 00:05:25.960
What's the relation between the
field of inference and the
00:05:25.960 --> 00:05:28.130
field of probability
as we have been
00:05:28.130 --> 00:05:30.650
practicing until now?
00:05:30.650 --> 00:05:33.620
Well, mathematically speaking,
what's going to happen in the
00:05:33.620 --> 00:05:38.780
next few lectures could be just
exercises or homework
00:05:38.780 --> 00:05:44.880
problems in the class in based
on what we have done so far.
00:05:44.880 --> 00:05:48.560
That means you're not going
to get any new facts about
00:05:48.560 --> 00:05:50.200
probability theory.
00:05:50.200 --> 00:05:53.930
Everything we're going to do
will be simple applications of
00:05:53.930 --> 00:05:57.110
things that you already
do know.
00:05:57.110 --> 00:06:00.140
So in some sense, statistics
and inference is just an
00:06:00.140 --> 00:06:02.780
applied exercise
in probability.
00:06:02.780 --> 00:06:08.310
But actually, things are
not that simple in
00:06:08.310 --> 00:06:09.550
the following sense.
00:06:09.550 --> 00:06:12.510
If you get a probability
problem,
00:06:12.510 --> 00:06:14.040
there's a correct answer.
00:06:14.040 --> 00:06:15.450
There's a correct solution.
00:06:15.450 --> 00:06:18.170
And that correct solution
is unique.
00:06:18.170 --> 00:06:20.550
There's no ambiguity.
00:06:20.550 --> 00:06:23.380
The theory of probability has
clearly defined rules.
00:06:23.380 --> 00:06:24.570
These are the axioms.
00:06:24.570 --> 00:06:27.550
You're given some information
about probability
00:06:27.550 --> 00:06:28.280
distributions.
00:06:28.280 --> 00:06:31.000
You're asked to calculate
certain other things.
00:06:31.000 --> 00:06:32.190
There's no ambiguity.
00:06:32.190 --> 00:06:34.230
Answers are always unique.
00:06:34.230 --> 00:06:39.180
In statistical questions, it's
no longer the case that the
00:06:39.180 --> 00:06:41.420
question has a unique answer.
00:06:41.420 --> 00:06:44.990
If I give you data and I ask
you what's the best way of
00:06:44.990 --> 00:06:49.710
estimating the motion of that
planet, reasonable people can
00:06:49.710 --> 00:06:53.370
come up with different
methods.
00:06:53.370 --> 00:06:56.790
And reasonable people will try
to argue that's my method has
00:06:56.790 --> 00:07:00.140
these desirable properties but
somebody else may say, here's
00:07:00.140 --> 00:07:03.740
another method that has certain
desirable properties.
00:07:03.740 --> 00:07:08.220
And it's not clear what
the best method is.
00:07:08.220 --> 00:07:11.330
So it's good to have some
understanding of what the
00:07:11.330 --> 00:07:16.910
issues are and to know at least
what is the general
00:07:16.910 --> 00:07:20.150
class of methods that one tries
to consider, how does
00:07:20.150 --> 00:07:22.380
one go about such problems.
00:07:22.380 --> 00:07:24.360
So we're going to see
lots and lots of
00:07:24.360 --> 00:07:25.880
different inference methods.
00:07:25.880 --> 00:07:27.350
We're not going to tell
you that one is
00:07:27.350 --> 00:07:28.730
better than the other.
00:07:28.730 --> 00:07:30.940
But it's important to understand
what are the
00:07:30.940 --> 00:07:33.980
concepts between those
different methods.
00:07:33.980 --> 00:07:38.710
And finally, statistics can
be misused really badly.
00:07:38.710 --> 00:07:41.870
That is, one can come up with
methods that you think are
00:07:41.870 --> 00:07:48.650
sound, but in fact they're
not quite that.
00:07:48.650 --> 00:07:52.830
I will bring some examples next
time and talk a little
00:07:52.830 --> 00:07:54.290
more about this.
00:07:54.290 --> 00:07:58.540
So, they want to say, you have
some data, you want to make
00:07:58.540 --> 00:08:02.590
some inference from them, what
many people will do is to go
00:08:02.590 --> 00:08:06.340
to Wikipedia, find a statistical
test that they
00:08:06.340 --> 00:08:08.990
think it applies to that
situation, plug in numbers,
00:08:08.990 --> 00:08:10.880
and present results.
00:08:10.880 --> 00:08:14.220
Are the conclusions that they
get really justified or are
00:08:14.220 --> 00:08:16.400
they misusing statistical
methods?
00:08:16.400 --> 00:08:20.520
Well, too many people actually
do misuse statistics and
00:08:20.520 --> 00:08:24.530
conclusions that people
get are often false.
00:08:24.530 --> 00:08:29.840
So it's important to, besides
just being able to copy
00:08:29.840 --> 00:08:32.600
statistical tests and use them,
to understand what are
00:08:32.600 --> 00:08:35.860
the assumptions between the
different methods and what
00:08:35.860 --> 00:08:40.559
kind of guarantees they
have, if any.
00:08:40.559 --> 00:08:44.420
All right, so we'll try to do a
quick tour through the field
00:08:44.420 --> 00:08:47.600
of inference in this lecture and
the next few lectures that
00:08:47.600 --> 00:08:51.700
we have left this semester and
try to highlight at the very
00:08:51.700 --> 00:08:53.940
high level the main concept
skills, and
00:08:53.940 --> 00:08:56.990
techniques that come in.
00:08:56.990 --> 00:08:59.840
Let's start with some
generalities and some general
00:08:59.840 --> 00:09:01.090
statements.
00:09:03.090 --> 00:09:07.090
One first statement is that
statistics or inference
00:09:07.090 --> 00:09:11.800
problems come up in very
different guises.
00:09:11.800 --> 00:09:16.490
And they may look as if they are
of very different forms.
00:09:16.490 --> 00:09:20.190
Although, at some fundamental
level, the basic issues turn
00:09:20.190 --> 00:09:23.320
out to be always pretty
much the same.
00:09:23.320 --> 00:09:27.880
So let's look at this example.
00:09:27.880 --> 00:09:31.420
There's an unknown signal
that's being sent.
00:09:31.420 --> 00:09:35.840
It's sent through some medium,
and that medium just takes the
00:09:35.840 --> 00:09:39.180
signal and amplifies it
by a certain number.
00:09:39.180 --> 00:09:41.340
So you can think of
somebody shouting.
00:09:41.340 --> 00:09:42.920
There's the air out there.
00:09:42.920 --> 00:09:46.420
What you shouted will be
attenuated through the air
00:09:46.420 --> 00:09:48.040
until it gets to a receiver.
00:09:48.040 --> 00:09:51.730
And that receiver then observes
this, but together
00:09:51.730 --> 00:09:53.110
with some random noise.
00:09:56.040 --> 00:10:00.390
Here I meant S. S is the signal
that's being sent.
00:10:00.390 --> 00:10:06.280
And what you observe is an X.
00:10:06.280 --> 00:10:09.240
You observe X, so what kind
of inference problems
00:10:09.240 --> 00:10:11.240
could we have here?
00:10:11.240 --> 00:10:15.400
In some cases, you want to build
a model of the physical
00:10:15.400 --> 00:10:17.450
phenomenon that you're
dealing with.
00:10:17.450 --> 00:10:21.180
So for example, you don't know
the attenuation of your signal
00:10:21.180 --> 00:10:25.190
and you try to find out what
this number is based on the
00:10:25.190 --> 00:10:26.980
observations that you have.
00:10:26.980 --> 00:10:30.240
So the way this is done in
engineering systems is that
00:10:30.240 --> 00:10:35.020
you design a certain signal, you
know what it is, you shout
00:10:35.020 --> 00:10:39.560
a particular word, and then
the receiver listens.
00:10:39.560 --> 00:10:43.460
And based on the intensity of
the signal that they get, they
00:10:43.460 --> 00:10:48.380
try to make a guess about A. So
you don't know A, but you
00:10:48.380 --> 00:10:52.460
know S. And by observing X,
you get some information
00:10:52.460 --> 00:10:54.270
about what A is.
00:10:54.270 --> 00:10:57.810
So in this case, you're trying
to build a model of the medium
00:10:57.810 --> 00:11:01.170
through which your signal
is propagating.
00:11:01.170 --> 00:11:04.600
So sometimes one would call
problems of this kind, let's
00:11:04.600 --> 00:11:07.990
say, system identification.
00:11:07.990 --> 00:11:11.980
In a different version of an
inference problem that comes
00:11:11.980 --> 00:11:15.300
with this picture, you've
done your modeling.
00:11:15.300 --> 00:11:18.160
You know your A. You know the
medium through which the
00:11:18.160 --> 00:11:22.330
signal is going, but it's
a communication system.
00:11:22.330 --> 00:11:24.190
This person is trying
to communicate
00:11:24.190 --> 00:11:26.140
something to that person.
00:11:26.140 --> 00:11:30.250
So you send the signal S, but
that person receives a noisy
00:11:30.250 --> 00:11:35.430
version of S. So that person
tries to reconstruct S based
00:11:35.430 --> 00:11:36.930
on X.
00:11:36.930 --> 00:11:42.210
So in both cases, we have a
linear relation between X and
00:11:42.210 --> 00:11:43.490
the unknown quantity.
00:11:43.490 --> 00:11:47.360
In one version, A is the unknown
and we know S. In the
00:11:47.360 --> 00:11:51.670
other version, A is known,
and so we try to infer S.
00:11:51.670 --> 00:11:54.300
Mathematically, you can see that
this is essentially the
00:11:54.300 --> 00:11:57.060
same kind of problem
in both cases.
00:11:57.060 --> 00:12:03.590
Although, the kind of practical
problem that you're
00:12:03.590 --> 00:12:07.580
trying to solve is a
little different.
00:12:07.580 --> 00:12:11.880
So we will not be making any
distinctions between problems
00:12:11.880 --> 00:12:15.940
of the model building type as
opposed to models where you
00:12:15.940 --> 00:12:19.260
try to estimate some unknown
signal and so on.
00:12:19.260 --> 00:12:22.400
Because conceptually, the tools
that one uses for both
00:12:22.400 --> 00:12:26.850
types of problems are
essentially the same.
00:12:26.850 --> 00:12:30.430
OK, next a very useful
classification
00:12:30.430 --> 00:12:31.680
of inference problems--
00:12:34.170 --> 00:12:37.760
the unknown quantity that you're
trying to estimate
00:12:37.760 --> 00:12:40.770
could be either a discrete
one that takes a
00:12:40.770 --> 00:12:43.040
small number of values.
00:12:43.040 --> 00:12:45.605
So this could be discrete
problems, such as the airplane
00:12:45.605 --> 00:12:48.080
radar problem we encountered
back a long
00:12:48.080 --> 00:12:50.120
time ago in this class.
00:12:50.120 --> 00:12:52.120
So there's two possibilities--
00:12:52.120 --> 00:12:55.450
an airplane is out there or an
airplane is not out there.
00:12:55.450 --> 00:12:57.050
And you're trying to
make a decision
00:12:57.050 --> 00:12:58.940
between these two options.
00:12:58.940 --> 00:13:01.570
Or you can have other problems
would you have, let's say,
00:13:01.570 --> 00:13:03.380
four possible options.
00:13:03.380 --> 00:13:05.970
You don't know which one is
true, but you get data and you
00:13:05.970 --> 00:13:09.040
try to figure out which
one is true.
00:13:09.040 --> 00:13:12.050
In problems of these kind,
usually you want to make a
00:13:12.050 --> 00:13:14.050
decision based on your data.
00:13:14.050 --> 00:13:17.000
And you're interested in the
probability of making a
00:13:17.000 --> 00:13:18.040
correct decision.
00:13:18.040 --> 00:13:19.430
You would like that
probability to
00:13:19.430 --> 00:13:21.830
be as high as possible.
00:13:21.830 --> 00:13:24.000
Estimation problems are
a little different.
00:13:24.000 --> 00:13:28.540
Here you have some continuous
quantity that's not known.
00:13:28.540 --> 00:13:31.860
And you try to make a good
guess of that quantity.
00:13:31.860 --> 00:13:36.050
And you would like your guess to
be as close as possible to
00:13:36.050 --> 00:13:37.310
the true quantity.
00:13:37.310 --> 00:13:40.270
So the polling problem
was of this type.
00:13:40.270 --> 00:13:44.720
There was an unknown fraction
f of the population that had
00:13:44.720 --> 00:13:45.870
some property.
00:13:45.870 --> 00:13:50.040
And you try to estimate f as
accurately as you can.
00:13:50.040 --> 00:13:53.420
So the distinction here is that
usually here the unknown
00:13:53.420 --> 00:13:56.440
quantity takes on discrete
set of values.
00:13:56.440 --> 00:13:57.890
Here the unknown quantity
takes a
00:13:57.890 --> 00:14:00.030
continuous set of values.
00:14:00.030 --> 00:14:02.980
Here we're interested in the
probability of error.
00:14:02.980 --> 00:14:07.400
Here we're interested in
the size of the error.
00:14:07.400 --> 00:14:11.000
Broadly speaking, most inference
problems fall either
00:14:11.000 --> 00:14:13.940
in this category or
in that category.
00:14:13.940 --> 00:14:17.230
Although, if you want to
complicate life, you can also
00:14:17.230 --> 00:14:20.250
think or construct problems
where both of these aspects
00:14:20.250 --> 00:14:24.410
are simultaneously present.
00:14:24.410 --> 00:14:28.530
OK, finally since we're in
classification mode, there is
00:14:28.530 --> 00:14:33.670
a very big, important dichotomy
into how one goes
00:14:33.670 --> 00:14:35.940
about inference problems.
00:14:35.940 --> 00:14:39.150
And here there's two
fundamentally different
00:14:39.150 --> 00:14:46.070
philosophical points of view,
which is how do we model the
00:14:46.070 --> 00:14:50.270
quantity that is unknown?
00:14:50.270 --> 00:14:54.530
In one approach, you say there's
a certain quantity
00:14:54.530 --> 00:14:57.590
that has a definite value.
00:14:57.590 --> 00:15:00.010
It just happens that
they don't know it.
00:15:00.010 --> 00:15:01.320
But it's a number.
00:15:01.320 --> 00:15:03.290
There's nothing random
about it.
00:15:03.290 --> 00:15:05.945
So think of trying to estimate
some physical quantity.
00:15:10.630 --> 00:15:13.350
You're making measurements, you
try to estimate the mass
00:15:13.350 --> 00:15:15.820
of an electron, which
is a sort of
00:15:15.820 --> 00:15:18.270
universal physical constant.
00:15:18.270 --> 00:15:20.320
There's nothing random
about it.
00:15:20.320 --> 00:15:22.340
It's a fixed number.
00:15:22.340 --> 00:15:29.120
You get data, because you have
some measuring apparatus.
00:15:29.120 --> 00:15:33.020
And that measuring apparatus,
depending on what that results
00:15:33.020 --> 00:15:37.160
that you get are affected by the
true mass of the electron,
00:15:37.160 --> 00:15:39.340
but there's also some noise.
00:15:39.340 --> 00:15:42.200
You take the data out of your
measuring apparatus and you
00:15:42.200 --> 00:15:44.465
try to come up with
some estimate of
00:15:44.465 --> 00:15:47.220
that quantity theta.
00:15:47.220 --> 00:15:49.760
So this is definitely a
legitimate picture, but the
00:15:49.760 --> 00:15:52.370
important thing in this picture
is that this theta is
00:15:52.370 --> 00:15:54.570
written as lowercase.
00:15:54.570 --> 00:15:58.110
And that's to make the point
that it's a real number, not a
00:15:58.110 --> 00:16:00.900
random variable.
00:16:00.900 --> 00:16:03.230
There's a different
philosophical approach which
00:16:03.230 --> 00:16:08.180
says, well, anything that I
don't know I should model it
00:16:08.180 --> 00:16:10.190
as a random variable.
00:16:10.190 --> 00:16:11.130
Yes, I know.
00:16:11.130 --> 00:16:14.500
The mass of the electron
is not really random.
00:16:14.500 --> 00:16:15.690
It's a constant.
00:16:15.690 --> 00:16:17.920
But I don't know what it is.
00:16:17.920 --> 00:16:22.510
I have some vague sense,
perhaps, what it is perhaps
00:16:22.510 --> 00:16:24.290
because of the experiments
that some other
00:16:24.290 --> 00:16:25.940
people carried out.
00:16:25.940 --> 00:16:30.560
So perhaps I have a prior
distribution on the possible
00:16:30.560 --> 00:16:32.160
values of Theta.
00:16:32.160 --> 00:16:34.990
And that prior distribution
doesn't mean that the nature
00:16:34.990 --> 00:16:39.320
is random, but it's more of a
subjective description of my
00:16:39.320 --> 00:16:44.570
subjective beliefs of where do
I think this constant number
00:16:44.570 --> 00:16:46.200
happens to be.
00:16:46.200 --> 00:16:50.140
So even though it's not truly
random, I model my initial
00:16:50.140 --> 00:16:52.600
beliefs before the experiment
starts.
00:16:52.600 --> 00:16:55.790
In terms of a prior
distribution, I view it as a
00:16:55.790 --> 00:16:57.470
random variable.
00:16:57.470 --> 00:17:01.850
Then I observe another related
random variable through some
00:17:01.850 --> 00:17:02.930
measuring apparatus.
00:17:02.930 --> 00:17:05.920
And then I use this again
to create an estimate.
00:17:08.819 --> 00:17:12.069
So these two pictures
philosophically are very
00:17:12.069 --> 00:17:13.589
different from each other.
00:17:13.589 --> 00:17:17.130
Here we treat the unknown
quantities as unknown numbers.
00:17:17.130 --> 00:17:20.589
Here we treat them as
random variables.
00:17:20.589 --> 00:17:24.829
When we treat them as a random
variables, then we know pretty
00:17:24.829 --> 00:17:27.109
much already what we
should be doing.
00:17:27.109 --> 00:17:29.470
We should just use
the Bayes rule.
00:17:29.470 --> 00:17:31.850
Based on X, find
the conditional
00:17:31.850 --> 00:17:33.670
distribution of Theta.
00:17:33.670 --> 00:17:37.520
And that's what we will be doing
mostly over this lecture
00:17:37.520 --> 00:17:40.010
and the next lecture.
00:17:40.010 --> 00:17:44.660
Now in both cases, what you end
up getting at the end is
00:17:44.660 --> 00:17:47.240
an estimate.
00:17:47.240 --> 00:17:52.120
But actually, that estimate is
what kind of object is it?
00:17:52.120 --> 00:17:55.170
It's a random variable
in both cases.
00:17:55.170 --> 00:17:56.000
Why?
00:17:56.000 --> 00:17:58.130
Even in this case where
theta was a
00:17:58.130 --> 00:18:01.060
constant, my data are random.
00:18:01.060 --> 00:18:02.860
I do my data processing.
00:18:02.860 --> 00:18:06.050
So I calculate a function
of the data, the
00:18:06.050 --> 00:18:07.580
data are random variables.
00:18:07.580 --> 00:18:11.390
So out here we output something
which is a function
00:18:11.390 --> 00:18:12.770
of a random variable.
00:18:12.770 --> 00:18:15.830
So this quantity here
will be also random.
00:18:15.830 --> 00:18:18.400
It's affected by the noise and
the experiment that I have
00:18:18.400 --> 00:18:19.650
been doing.
00:18:19.650 --> 00:18:22.330
That's why these estimators
will be denoted
00:18:22.330 --> 00:18:24.920
by uppercase Thetas.
00:18:24.920 --> 00:18:26.740
And we will be using hats.
00:18:26.740 --> 00:18:29.030
Hat, usually in estimation,
means
00:18:29.030 --> 00:18:32.990
an estimate of something.
00:18:32.990 --> 00:18:35.380
All right, so this is
the big picture.
00:18:35.380 --> 00:18:38.690
We're going to start with
the Bayesian version.
00:18:38.690 --> 00:18:42.830
And then the last few lectures
we're going to talk about the
00:18:42.830 --> 00:18:45.690
non-Bayesian version or
the classical one.
00:18:45.690 --> 00:18:48.610
By the way, I should say that
statisticians have been
00:18:48.610 --> 00:18:52.500
debating fiercely for 100 years
whether the right way to
00:18:52.500 --> 00:18:56.030
approach statistics is to go
the classical way or the
00:18:56.030 --> 00:18:57.420
Bayesian way.
00:18:57.420 --> 00:19:00.530
And there have been tides going
back and forth between
00:19:00.530 --> 00:19:02.260
the two sides.
00:19:02.260 --> 00:19:05.330
These days, Bayesian methods
tend to become a little more
00:19:05.330 --> 00:19:07.320
popular for various reasons.
00:19:07.320 --> 00:19:11.730
We're going to come back
to this later.
00:19:11.730 --> 00:19:14.610
All right, so in Bayesian
estimation, what we got in our
00:19:14.610 --> 00:19:16.610
hands is Bayes rule.
00:19:16.610 --> 00:19:19.380
And if you have Bayes rule,
there's not a lot
00:19:19.380 --> 00:19:21.340
that's left to do.
00:19:21.340 --> 00:19:24.190
We have different forms of the
Bayes rule, depending on
00:19:24.190 --> 00:19:27.920
whether we're dealing with
discrete data, And discrete
00:19:27.920 --> 00:19:32.310
quantities to estimate, or
continuous data, and so on.
00:19:32.310 --> 00:19:36.020
In the hypothesis testing
problem, the unknown quantity
00:19:36.020 --> 00:19:38.210
Theta is discrete.
00:19:38.210 --> 00:19:42.890
So in both cases here,
we have a P of Theta.
00:19:42.890 --> 00:19:45.530
We obtain data, the X's.
00:19:45.530 --> 00:19:49.040
And on the basis of the X that
we observe, we can calculate
00:19:49.040 --> 00:19:53.340
the posterior distribution
of Theta, given the data.
00:19:53.340 --> 00:19:59.840
So to use Bayesian inference,
what do we start with?
00:19:59.840 --> 00:20:03.160
We start with some priors.
00:20:03.160 --> 00:20:05.910
These are our initial
beliefs about what
00:20:05.910 --> 00:20:07.890
Theta that might be.
00:20:07.890 --> 00:20:10.440
That's before we do
the experiment.
00:20:10.440 --> 00:20:13.840
We have a model of the
experimental aparatus.
00:20:17.520 --> 00:20:21.550
And the model of the
experimental apparatus tells
00:20:21.550 --> 00:20:28.040
us if this Theta is true, I'm
going to see X's of that kind.
00:20:28.040 --> 00:20:31.480
If that other Theta is true, I'm
going to see X's that they
00:20:31.480 --> 00:20:33.130
are somewhere else.
00:20:33.130 --> 00:20:35.200
That models my apparatus.
00:20:35.200 --> 00:20:39.150
And based on that knowledge,
once I observe I have these
00:20:39.150 --> 00:20:41.975
two functions in my hands, we
have already seen that if you
00:20:41.975 --> 00:20:44.760
know those two functions, you
can also calculate the
00:20:44.760 --> 00:20:46.550
denominator here.
00:20:46.550 --> 00:20:50.900
So all of these functions are
available, so you can compute,
00:20:50.900 --> 00:20:54.170
you can find a formula for
this function as well.
00:20:54.170 --> 00:20:58.780
And as soon as you observe the
data, that X's, you plug in
00:20:58.780 --> 00:21:02.220
here the numerical value
of those X's.
00:21:02.220 --> 00:21:04.720
And you get a function
of Theta.
00:21:04.720 --> 00:21:07.870
And this is the posterior
distribution of Theta, given
00:21:07.870 --> 00:21:09.680
the data that you have seen.
00:21:09.680 --> 00:21:11.930
So you've already done
a fair number of
00:21:11.930 --> 00:21:13.760
exercises of these kind.
00:21:13.760 --> 00:21:17.320
So we not say more about this.
00:21:17.320 --> 00:21:20.470
And there's a similar formula as
you know for the case where
00:21:20.470 --> 00:21:22.460
we have continuous data.
00:21:22.460 --> 00:21:25.140
If the X's are continuous random
variable, then the
00:21:25.140 --> 00:21:28.620
formula is the same, except
that X's are described by
00:21:28.620 --> 00:21:31.630
densities instead of being
described by a probability
00:21:31.630 --> 00:21:32.880
mass functions.
00:21:35.170 --> 00:21:40.200
OK, now if Theta is continuous,
then we're dealing
00:21:40.200 --> 00:21:42.160
with estimation problems.
00:21:42.160 --> 00:21:44.880
But the story is once
more the same.
00:21:44.880 --> 00:21:47.920
You're going to use the Bayes
rule to come up with the
00:21:47.920 --> 00:21:51.090
posterior density of Theta,
given the data
00:21:51.090 --> 00:21:53.300
that you have observed.
00:21:53.300 --> 00:21:57.250
Now just for the sake of the
example, let's come back to
00:21:57.250 --> 00:21:58.900
this picture here.
00:21:58.900 --> 00:22:03.490
Suppose that something is flying
in the air, and maybe
00:22:03.490 --> 00:22:07.800
this is just an object in the
air close to the Earth.
00:22:07.800 --> 00:22:10.820
So because of gravity, the
trajectory that it's going to
00:22:10.820 --> 00:22:15.170
follow it's going to
be a parabola.
00:22:15.170 --> 00:22:18.014
So this is the general equation
of a parabola.
00:22:18.014 --> 00:22:23.450
Zt is the position of my
objects at time t.
00:22:26.310 --> 00:22:29.500
But I don't know exactly
which parabola it is.
00:22:29.500 --> 00:22:32.690
So the parameters of the
parabola are unknown
00:22:32.690 --> 00:22:34.040
quantities.
00:22:34.040 --> 00:22:37.710
What I can do is to go and
measure the position of my
00:22:37.710 --> 00:22:41.880
objects at different times.
00:22:41.880 --> 00:22:44.575
But unfortunately, my
measurements are noisy.
00:22:47.380 --> 00:22:51.070
What I want to do is to model
the motion of my object.
00:22:51.070 --> 00:22:56.260
So I guess in the picture, the
axis would be t going this way
00:22:56.260 --> 00:22:59.980
and Z going this way.
00:22:59.980 --> 00:23:02.470
And on the basis of the
data that they get,
00:23:02.470 --> 00:23:05.020
these are my X's.
00:23:05.020 --> 00:23:07.390
I want to figure
out the Thetas.
00:23:07.390 --> 00:23:09.570
That is, I want to figure
out the exact
00:23:09.570 --> 00:23:11.840
equation of this parabola.
00:23:11.840 --> 00:23:14.940
Now if somebody gives you
probability distributions for
00:23:14.940 --> 00:23:18.490
Theta, these would
be your priors.
00:23:18.490 --> 00:23:19.840
So this is given.
00:23:23.200 --> 00:23:26.200
We need the conditional
distribution of the X's given
00:23:26.200 --> 00:23:27.360
the Thetas.
00:23:27.360 --> 00:23:30.870
Well, we have the conditional
distribution of Z, given the
00:23:30.870 --> 00:23:32.920
Thetas from this equation.
00:23:32.920 --> 00:23:36.040
And then by playing with this
equation, you can also find
00:23:36.040 --> 00:23:42.460
how is X distributed if Theta
takes a particular value.
00:23:42.460 --> 00:23:46.420
So you do have all of the
densities that you might need.
00:23:46.420 --> 00:23:48.790
And you can apply
the Bayes rule.
00:23:48.790 --> 00:23:53.620
And at the end, your end result
would be a formula for
00:23:53.620 --> 00:23:57.270
the distribution of Theta,
given to the X
00:23:57.270 --> 00:23:59.130
that you have observed--
00:23:59.130 --> 00:24:03.000
except for one sort of
computation, or to make things
00:24:03.000 --> 00:24:04.470
more interesting.
00:24:04.470 --> 00:24:07.680
Instead of these X's and Theta's
being single random
00:24:07.680 --> 00:24:11.070
variables that we have here,
typically those X's and
00:24:11.070 --> 00:24:13.400
Theta's will be
multi-dimensional random
00:24:13.400 --> 00:24:16.490
variables or will correspond
to multiple ones.
00:24:16.490 --> 00:24:19.920
So this little Theta here
actually stands for a triplet
00:24:19.920 --> 00:24:22.880
of Theta0, Theta1, and Theta2.
00:24:22.880 --> 00:24:26.820
And that X here stands here for
the entire sequence of X's
00:24:26.820 --> 00:24:28.410
that we have observed.
00:24:28.410 --> 00:24:31.060
So in reality, the object that
you're going to get at to the
00:24:31.060 --> 00:24:35.900
end after inference is done is
a function that you plug in
00:24:35.900 --> 00:24:39.430
the values of the data and you
get the function of the
00:24:39.430 --> 00:24:43.240
Theta's that tells you the
relative likelihoods of
00:24:43.240 --> 00:24:46.780
different Theta triplets.
00:24:46.780 --> 00:24:49.760
So what I'm saying is that this
is no harder than the
00:24:49.760 --> 00:24:53.720
problems that you have dealt
with so far, except perhaps
00:24:53.720 --> 00:24:56.020
for the complication that's
usually in interesting
00:24:56.020 --> 00:24:57.490
inference problems.
00:24:57.490 --> 00:25:01.940
Your Theta's and X's are often
the vectors of random
00:25:01.940 --> 00:25:05.490
variables instead of individual
random variables.
00:25:05.490 --> 00:25:09.630
Now if you are to do estimation
in a case where you
00:25:09.630 --> 00:25:13.520
have discrete data, again the
situation is no different.
00:25:13.520 --> 00:25:17.020
We still have a Bayes rule of
the same kind, except that
00:25:17.020 --> 00:25:19.540
densities gets replaced
by PMF's.
00:25:19.540 --> 00:25:23.680
If X is discrete, you put a P
here instead of putting an f.
00:25:23.680 --> 00:25:27.990
So an example of an estimation
problem with discrete data is
00:25:27.990 --> 00:25:29.740
similar to the polling
problem.
00:25:29.740 --> 00:25:31.600
You have a coin.
00:25:31.600 --> 00:25:33.500
It has an unknown
parameter Theta.
00:25:33.500 --> 00:25:35.230
This is the probability
of obtaining heads.
00:25:35.230 --> 00:25:37.410
You flip the coin many times.
00:25:37.410 --> 00:25:41.560
What can you tell me about
the true value of Theta?
00:25:41.560 --> 00:25:46.200
A classical statistician, at
this point, would say, OK, I'm
00:25:46.200 --> 00:25:48.900
going to use an estimator,
the most reasonable
00:25:48.900 --> 00:25:50.950
one, which is this.
00:25:50.950 --> 00:25:54.200
How many heads did they
obtain in n trials?
00:25:54.200 --> 00:25:56.440
Divide by the total
number of trials.
00:25:56.440 --> 00:26:00.700
This is my estimate of
the bias of my coin.
00:26:00.700 --> 00:26:02.860
And then the classical
statistician would continue
00:26:02.860 --> 00:26:07.610
from here and try to prove some
properties and argue that
00:26:07.610 --> 00:26:10.030
this estimate is a good one.
00:26:10.030 --> 00:26:12.850
For example, we have the weak
law of large numbers that
00:26:12.850 --> 00:26:15.630
tells us that this particular
estimate converges in
00:26:15.630 --> 00:26:17.990
probability to the
true parameter.
00:26:17.990 --> 00:26:21.000
This is a kind of guarantee
that's useful to have.
00:26:21.000 --> 00:26:23.410
And the classical statistician
would pretty much close the
00:26:23.410 --> 00:26:24.660
subject in this way.
00:26:27.340 --> 00:26:30.160
What would the Bayesian
person do differently?
00:26:30.160 --> 00:26:35.040
The Bayesian person would start
by assuming a prior
00:26:35.040 --> 00:26:37.100
distribution of Theta.
00:26:37.100 --> 00:26:39.820
Instead of treating Theta as
an unknown constant, they
00:26:39.820 --> 00:26:44.340
would say that Theta would speak
randomly or pretend that
00:26:44.340 --> 00:26:47.360
it would speak randomly
and assume a
00:26:47.360 --> 00:26:49.300
distribution on Theta.
00:26:49.300 --> 00:26:54.290
So for example, if you don't
know they need anything more,
00:26:54.290 --> 00:26:57.510
you might assume that any value
for the bias of the coin
00:26:57.510 --> 00:27:01.460
is as likely as any other value
of the bias of the coin.
00:27:01.460 --> 00:27:04.150
And this way so the probability
distribution
00:27:04.150 --> 00:27:05.720
that's uniform.
00:27:05.720 --> 00:27:09.840
Or if you have a little more
faith in the manufacturing
00:27:09.840 --> 00:27:13.270
processes that's created that
coin, you might choose your
00:27:13.270 --> 00:27:17.660
prior to be a distribution
that's centered around 1/2 and
00:27:17.660 --> 00:27:21.860
sits fairly narrowly centered
around 1/2.
00:27:21.860 --> 00:27:24.500
That would be a prior
distribution in which you say,
00:27:24.500 --> 00:27:27.920
well, I believe that the
manufacturer tried to make my
00:27:27.920 --> 00:27:29.410
coin to be fair.
00:27:29.410 --> 00:27:33.070
But they often makes some
mistakes, so it's going to be,
00:27:33.070 --> 00:27:36.600
I believe, it's approximately
1/2 but not quite.
00:27:36.600 --> 00:27:40.050
So depending on your beliefs,
you would choose an
00:27:40.050 --> 00:27:43.630
appropriate prior for the
distribution of Theta.
00:27:43.630 --> 00:27:48.610
And then you would use the
Bayes rule to find the
00:27:48.610 --> 00:27:52.270
probabilities of different
values of Theta, based on the
00:27:52.270 --> 00:27:53.520
data that you have observed.
00:27:59.620 --> 00:28:04.640
So no matter which version of
the Bayes rule that you use,
00:28:04.640 --> 00:28:10.540
the end product of the Bayes
rule is going to be either a
00:28:10.540 --> 00:28:14.400
plot of this kind or a
plot of that kind.
00:28:14.400 --> 00:28:16.740
So what am I plotting here?
00:28:16.740 --> 00:28:19.810
This axis is the Theta axis.
00:28:19.810 --> 00:28:23.830
These are the possible values
of the unknown quantity that
00:28:23.830 --> 00:28:26.670
we're trying to estimate.
00:28:26.670 --> 00:28:28.990
In the continuous
case, theta is a
00:28:28.990 --> 00:28:30.800
continuous random variable.
00:28:30.800 --> 00:28:32.560
I obtain my data.
00:28:32.560 --> 00:28:36.430
And I plot for the posterior
probability distribution after
00:28:36.430 --> 00:28:37.940
observing my data.
00:28:37.940 --> 00:28:42.220
And I'm plotting here the
probability density for Theta.
00:28:42.220 --> 00:28:45.500
So this is a plot
of that density.
00:28:45.500 --> 00:28:49.210
In the discrete case, theta can
take finitely many values
00:28:49.210 --> 00:28:51.570
or a discrete set of values.
00:28:51.570 --> 00:28:54.470
And for each one of those
values, I'm telling you how
00:28:54.470 --> 00:28:58.080
likely is that the value to be
the correct one, given the
00:28:58.080 --> 00:29:01.040
data that I have observed.
00:29:01.040 --> 00:29:04.990
And in general, what you would
go back to your boss and
00:29:04.990 --> 00:29:08.520
report after you've done all
your inference work would be
00:29:08.520 --> 00:29:10.870
either a plot of this kinds
or of that kind.
00:29:10.870 --> 00:29:14.180
So you go to your boss
who asks you, what is
00:29:14.180 --> 00:29:15.190
the value of Theta?
00:29:15.190 --> 00:29:17.490
And you say, well, I only
have limited data.
00:29:17.490 --> 00:29:19.420
That I don't know what it is.
00:29:19.420 --> 00:29:22.920
It could be this, with
so much probability.
00:29:22.920 --> 00:29:24.640
There's probability.
00:29:24.640 --> 00:29:27.220
OK, let's throw in some
numbers here.
00:29:27.220 --> 00:29:32.250
There's probability 0.3 that
Theta is this value.
00:29:32.250 --> 00:29:36.100
There's probability 0.2 that
Theta is this value, 0.1 that
00:29:36.100 --> 00:29:39.420
it's this one, 0.1 that it's
this one, 0.2 that it's that
00:29:39.420 --> 00:29:40.830
one, and so on.
00:29:40.830 --> 00:29:44.890
OK, now bosses often want
simple answers.
00:29:44.890 --> 00:29:48.480
They say, OK, you're
talking too much.
00:29:48.480 --> 00:29:51.770
What do you think Theta is?
00:29:51.770 --> 00:29:55.920
And now you're forced
to make a decision.
00:29:55.920 --> 00:30:00.680
If that was the situation and
you have to make a decision,
00:30:00.680 --> 00:30:02.370
how would you make it?
00:30:02.370 --> 00:30:06.880
Well, I'm going to make a
decision that's most likely to
00:30:06.880 --> 00:30:09.120
be correct.
00:30:09.120 --> 00:30:13.060
If I make this decision,
what's going to happen?
00:30:13.060 --> 00:30:17.670
Theta is this value with
probability 0.2, which means
00:30:17.670 --> 00:30:21.150
there's probably 0.8 that
they make an error
00:30:21.150 --> 00:30:23.280
if I make that guess.
00:30:23.280 --> 00:30:29.370
If I make that decision, this
decision has probably 0.3 of
00:30:29.370 --> 00:30:30.750
being the correct one.
00:30:30.750 --> 00:30:34.530
So I have probably
of error 0.7.
00:30:34.530 --> 00:30:38.460
So if you want to just maximize
the probability of
00:30:38.460 --> 00:30:41.730
giving the correct decision, or
if you want to minimize the
00:30:41.730 --> 00:30:44.780
probability of making an
incorrect decision, what
00:30:44.780 --> 00:30:48.790
you're going to choose to report
is that value of Theta
00:30:48.790 --> 00:30:51.450
for which the probability
is highest.
00:30:51.450 --> 00:30:54.230
So in this case, I would
choose to report this
00:30:54.230 --> 00:30:58.210
particular value, the most
likely value of Theta, given
00:30:58.210 --> 00:31:00.120
what I have observed.
00:31:00.120 --> 00:31:04.640
And that value is called them
maximum a posteriori
00:31:04.640 --> 00:31:07.550
probability estimate.
00:31:07.550 --> 00:31:11.550
It's going to be this
one in our case.
00:31:11.550 --> 00:31:16.830
So picking the point in the
posterior PMF that has the
00:31:16.830 --> 00:31:19.040
highest probability.
00:31:19.040 --> 00:31:20.720
That's the reasonable
thing to do.
00:31:20.720 --> 00:31:23.850
This is the optimal thing to do
if you want to minimize the
00:31:23.850 --> 00:31:27.340
probability of an incorrect
inference.
00:31:27.340 --> 00:31:31.400
And that's what people do
usually if they need to report
00:31:31.400 --> 00:31:35.280
a single answer, if they need
to report a single decision.
00:31:35.280 --> 00:31:39.530
How about in the estimation
context?
00:31:39.530 --> 00:31:43.250
If that's what you know about
Theta, Theta could be around
00:31:43.250 --> 00:31:46.670
here, but there's also some
sharp probability that it is
00:31:46.670 --> 00:31:48.720
around here.
00:31:48.720 --> 00:31:52.380
What's the single answer that
you would give to your boss?
00:31:52.380 --> 00:31:56.310
One option is to use the same
philosophy and say, OK, I'm
00:31:56.310 --> 00:32:00.135
going to find the Theta at which
this posterior density
00:32:00.135 --> 00:32:01.690
is highest.
00:32:01.690 --> 00:32:06.010
So I would pick this point
here and report this
00:32:06.010 --> 00:32:06.920
particular Theta.
00:32:06.920 --> 00:32:11.110
So this would be my Theta,
again, Theta MAP, the Theta
00:32:11.110 --> 00:32:15.290
that has the highest a
posteriori probability, just
00:32:15.290 --> 00:32:19.100
because it corresponds to
the peak of the density.
00:32:19.100 --> 00:32:23.810
But in this context, the
maximum a posteriori
00:32:23.810 --> 00:32:27.120
probability theta was the
one that was most
00:32:27.120 --> 00:32:28.600
likely to be true.
00:32:28.600 --> 00:32:32.460
In the continuous case, you
cannot really say that this is
00:32:32.460 --> 00:32:34.940
the most likely value
of Theta.
00:32:34.940 --> 00:32:38.340
In a continuous setting, any
value of Theta has zero
00:32:38.340 --> 00:32:41.530
probability, so when we
talk about densities.
00:32:41.530 --> 00:32:43.260
So it's not the most likely.
00:32:43.260 --> 00:32:48.240
It's the one for which the
density, so the probabilities
00:32:48.240 --> 00:32:51.820
of that neighborhoods,
are highest.
00:32:51.820 --> 00:32:56.390
So the rationale for picking
this particular estimate in
00:32:56.390 --> 00:33:00.050
the continuous case is much
less compelling than the
00:33:00.050 --> 00:33:02.210
rationale that we had in here.
00:33:02.210 --> 00:33:05.590
So in this case, reasonable
people might choose different
00:33:05.590 --> 00:33:07.460
quantities to report.
00:33:07.460 --> 00:33:11.810
And the very popular one would
be to report instead the
00:33:11.810 --> 00:33:13.700
conditional expectation.
00:33:13.700 --> 00:33:15.990
So I don't know quite
what Theta is.
00:33:15.990 --> 00:33:19.600
Given the data that I have,
Theta has this distribution.
00:33:19.600 --> 00:33:23.320
Let me just report the average
over that distribution.
00:33:23.320 --> 00:33:27.090
Let me report to the center
of gravity of this figure.
00:33:27.090 --> 00:33:30.340
And in this figure, the center
of gravity would probably be
00:33:30.340 --> 00:33:32.230
somewhere around here.
00:33:32.230 --> 00:33:35.690
And that would be a different
estimate that you
00:33:35.690 --> 00:33:37.520
might choose to report.
00:33:37.520 --> 00:33:40.340
So center of gravity is
something around here.
00:33:40.340 --> 00:33:43.580
And this is a conditional
expectation of Theta, given
00:33:43.580 --> 00:33:46.010
the data that you have.
00:33:46.010 --> 00:33:51.190
So these are two, in some sense,
fairly reasonable ways
00:33:51.190 --> 00:33:53.850
of choosing what to report
to your boss.
00:33:53.850 --> 00:33:55.690
Some people might choose
to report this.
00:33:55.690 --> 00:33:58.630
Some people might choose
to report that.
00:33:58.630 --> 00:34:03.230
And a priori, if there's no
compelling reason why one
00:34:03.230 --> 00:34:08.639
would be preferable than other
one, unless you set some rules
00:34:08.639 --> 00:34:12.350
for the game and you describe
a little more precisely what
00:34:12.350 --> 00:34:14.090
your objectives are.
00:34:14.090 --> 00:34:19.070
But no matter which one you
report, a single answer, a
00:34:19.070 --> 00:34:24.350
point estimate, doesn't really
tell you the whole story.
00:34:24.350 --> 00:34:28.159
There's a lot more information
conveyed by this posterior
00:34:28.159 --> 00:34:31.060
distribution plot than
any single number
00:34:31.060 --> 00:34:32.159
that you might report.
00:34:32.159 --> 00:34:36.510
So in general, you may wish to
convince your boss that's it's
00:34:36.510 --> 00:34:40.310
worth their time to look at the
entire plot, because that
00:34:40.310 --> 00:34:43.100
plot sort of covers all
the possibilities.
00:34:43.100 --> 00:34:47.060
It tells your boss most likely
we're in that range, but
00:34:47.060 --> 00:34:51.620
there's also a distinct change
that our Theta happens to lie
00:34:51.620 --> 00:34:54.080
in that range.
00:34:54.080 --> 00:34:58.400
All right, now let us try to
perhaps differentiate between
00:34:58.400 --> 00:35:02.570
these two and see under what
circumstances this one might
00:35:02.570 --> 00:35:05.530
be the better estimate
to perform.
00:35:05.530 --> 00:35:07.320
Better with respect to what?
00:35:07.320 --> 00:35:08.830
We need some rules.
00:35:08.830 --> 00:35:10.730
So we're going to throw
in some rules.
00:35:14.320 --> 00:35:17.450
As a warm up, we're going to
deal with the problem of
00:35:17.450 --> 00:35:22.000
making an estimation if you
had no information at all,
00:35:22.000 --> 00:35:24.670
except for a prior
distribution.
00:35:24.670 --> 00:35:27.650
So this is a warm up for what's
coming next, which
00:35:27.650 --> 00:35:32.970
would be estimation that takes
into account some information.
00:35:32.970 --> 00:35:34.860
So we have a Theta.
00:35:34.860 --> 00:35:38.500
And because of your subjective
beliefs or models by others,
00:35:38.500 --> 00:35:41.780
you believe that Theta is
uniformly distributed between,
00:35:41.780 --> 00:35:46.250
let's say, 4 and 10.
00:35:46.250 --> 00:35:48.120
You want to come up with
a point estimate.
00:35:51.770 --> 00:35:54.900
Let's try to look
for an estimate.
00:35:54.900 --> 00:35:57.580
Call it c, in this case.
00:35:57.580 --> 00:36:00.090
I want to pick a number
with which to estimate
00:36:00.090 --> 00:36:01.340
the value of Theta.
00:36:04.030 --> 00:36:08.260
I will be interested in the size
of the error that I make.
00:36:08.260 --> 00:36:12.310
And I really dislike large
errors, so I'm going to focus
00:36:12.310 --> 00:36:15.500
on the square of the error
that they make.
00:36:15.500 --> 00:36:19.140
So I pick c.
00:36:19.140 --> 00:36:21.340
Theta that has a random value
that I don't know.
00:36:21.340 --> 00:36:25.900
But whatever it is, once it
becomes known, it results into
00:36:25.900 --> 00:36:28.640
a squared error between
what it is and what I
00:36:28.640 --> 00:36:30.660
guessed that it was.
00:36:30.660 --> 00:36:35.770
And I'm interested in making
a small air on the average,
00:36:35.770 --> 00:36:38.170
where the average is taken
with respect to all the
00:36:38.170 --> 00:36:42.350
possible and unknown
values of Theta.
00:36:42.350 --> 00:36:47.220
So the problem, this is a least
squares formulation of
00:36:47.220 --> 00:36:49.240
the problem, where we
try to minimize the
00:36:49.240 --> 00:36:51.150
least squares errors.
00:36:51.150 --> 00:36:53.900
How do you find the optimal c?
00:36:53.900 --> 00:36:57.200
Well, we take that expression
and expand it.
00:37:00.930 --> 00:37:05.650
And it is, using linearity
of expectations--
00:37:05.650 --> 00:37:11.460
square minus 2c expected
Theta plus c squared--
00:37:11.460 --> 00:37:13.620
that's the quantity that
we want to minimize,
00:37:13.620 --> 00:37:16.670
with respect to c.
00:37:16.670 --> 00:37:19.670
To do the minimization, take the
derivative with respect to
00:37:19.670 --> 00:37:21.950
c and set it to 0.
00:37:21.950 --> 00:37:27.320
So that differentiation gives us
from here minus 2 expected
00:37:27.320 --> 00:37:32.420
value of Theta plus
2c is equal to 0.
00:37:32.420 --> 00:37:36.550
And the answer that you get by
solving this equation is that
00:37:36.550 --> 00:37:39.350
c is the expected
value of Theta.
00:37:39.350 --> 00:37:42.860
So when you do this
optimization, you find that
00:37:42.860 --> 00:37:45.170
the optimal estimate, the
things you should be
00:37:45.170 --> 00:37:47.970
reporting, is the expected
value of Theta.
00:37:47.970 --> 00:37:51.630
So in this particular example,
you would choose your estimate
00:37:51.630 --> 00:37:55.500
c to be just the middle
of these values,
00:37:55.500 --> 00:37:57.980
which would be 7.
00:38:02.642 --> 00:38:06.640
OK, and in case your
boss asks you, how
00:38:06.640 --> 00:38:08.610
good is your estimate?
00:38:08.610 --> 00:38:11.390
How big is your error
going to be?
00:38:14.910 --> 00:38:19.870
What you could report is the
average size of the estimation
00:38:19.870 --> 00:38:22.570
error that you are making.
00:38:22.570 --> 00:38:26.760
We picked our estimates to be
the expected value of Theta.
00:38:26.760 --> 00:38:29.450
So for this particular way that
I'm choosing to do my
00:38:29.450 --> 00:38:33.610
estimation, this is the mean
squared error that I get.
00:38:33.610 --> 00:38:35.330
And this is a familiar
quantity.
00:38:35.330 --> 00:38:38.370
It's just the variance
of the distribution.
00:38:38.370 --> 00:38:41.890
So the expectation is that
best way to estimate a
00:38:41.890 --> 00:38:45.550
quantity, if you're interested
in the mean squared error.
00:38:45.550 --> 00:38:50.430
And the resulting mean squared
error is the variance itself.
00:38:50.430 --> 00:38:56.380
How will this story change if
we now have data as well?
00:38:56.380 --> 00:39:01.290
Now having data means that
we can compute posterior
00:39:01.290 --> 00:39:05.150
distributions or conditional
distributions.
00:39:05.150 --> 00:39:10.400
So we get transported into a new
universe where instead the
00:39:10.400 --> 00:39:14.740
working with the original
distribution of Theta, the
00:39:14.740 --> 00:39:18.860
prior distribution, now we work
with the condition of
00:39:18.860 --> 00:39:22.280
distribution of Theta,
given the data
00:39:22.280 --> 00:39:24.860
that we have observed.
00:39:24.860 --> 00:39:30.430
Now remember our old slogan that
conditional models and
00:39:30.430 --> 00:39:33.570
conditional probabilities are
no different than ordinary
00:39:33.570 --> 00:39:38.880
probabilities, except that we
live now in a new universe
00:39:38.880 --> 00:39:42.690
where the new information has
been taken into account.
00:39:42.690 --> 00:39:47.860
So if you use that philosophy
and you're asked to minimize
00:39:47.860 --> 00:39:53.310
the squared error but now that
you live in a new universe
00:39:53.310 --> 00:39:56.910
where X has been fixed to
something, what would the
00:39:56.910 --> 00:39:59.210
optimal solution be?
00:39:59.210 --> 00:40:03.540
It would again be the
expectation of theta, but
00:40:03.540 --> 00:40:04.730
which expectation?
00:40:04.730 --> 00:40:08.910
It's the expectation which
applies in the new conditional
00:40:08.910 --> 00:40:12.350
universe in which we
live right now.
00:40:12.350 --> 00:40:16.330
So because of what we did
before, by the same
00:40:16.330 --> 00:40:20.330
calculation, we would find that
the optimal estimates is
00:40:20.330 --> 00:40:24.970
the expected value of X of
Theta, but the optimal
00:40:24.970 --> 00:40:26.730
estimate that takes
into account the
00:40:26.730 --> 00:40:29.170
information that we have.
00:40:29.170 --> 00:40:33.600
So the conclusion, once you get
your data, if you want to
00:40:33.600 --> 00:40:40.480
minimize the mean squared error,
you should just report
00:40:40.480 --> 00:40:43.870
the conditional estimation of
this unknown quantity based on
00:40:43.870 --> 00:40:46.640
the data that you have.
00:40:46.640 --> 00:40:53.050
So the picture here is that
Theta is unknown.
00:40:53.050 --> 00:41:00.710
You have your apparatus that
creates measurements.
00:41:00.710 --> 00:41:07.880
So this creates an X. You take
an X, and here you have a box
00:41:07.880 --> 00:41:10.203
that does calculations.
00:41:13.490 --> 00:41:18.180
It does calculations and it
spits out the conditional
00:41:18.180 --> 00:41:22.230
expectation of Theta, given the
particular data that you
00:41:22.230 --> 00:41:24.750
have observed.
00:41:24.750 --> 00:41:28.680
And what we have done in this
class so far is, to some
00:41:28.680 --> 00:41:33.450
extent, developing the
computational tools and skills
00:41:33.450 --> 00:41:36.020
to do with this particular
calculation--
00:41:36.020 --> 00:41:39.780
how to calculate the posterior
density for Theta and how to
00:41:39.780 --> 00:41:42.750
calculate expectations,
conditional expectations.
00:41:42.750 --> 00:41:45.330
So in principle, we know
how to do this.
00:41:45.330 --> 00:41:50.040
In principle, we can program a
computer to take the data and
00:41:50.040 --> 00:41:51.670
to spit out condition
expectations.
00:41:56.140 --> 00:42:04.390
Somebody who doesn't think like
us might instead design a
00:42:04.390 --> 00:42:09.940
calculating machine that does
something differently and
00:42:09.940 --> 00:42:16.490
produces some other estimate.
00:42:16.490 --> 00:42:20.000
So we went through this argument
and we decided to
00:42:20.000 --> 00:42:23.110
program our computer to
calculate conditional
00:42:23.110 --> 00:42:24.490
expectations.
00:42:24.490 --> 00:42:28.460
Somebody else came up with some
other crazy idea for how
00:42:28.460 --> 00:42:30.590
to estimate the random
variable.
00:42:30.590 --> 00:42:34.460
They came up with some function
g and the programmed
00:42:34.460 --> 00:42:38.700
it, and they designed a machine
that estimates Theta's
00:42:38.700 --> 00:42:43.000
by outputting a certain
g of X.
00:42:43.000 --> 00:42:47.690
That could be an alternative
estimator.
00:42:47.690 --> 00:42:50.280
Which one is better?
00:42:50.280 --> 00:42:56.350
Well, we convinced ourselves
that this is the optimal one
00:42:56.350 --> 00:42:59.780
in a universe where we have
fixed the particular
00:42:59.780 --> 00:43:01.420
value of the data.
00:43:01.420 --> 00:43:06.030
So what we have proved so far
is a relation of this kind.
00:43:06.030 --> 00:43:09.670
In this conditional universe,
the mean squared
00:43:09.670 --> 00:43:11.920
error that I get--
00:43:11.920 --> 00:43:15.170
I'm the one who's using
this estimator--
00:43:15.170 --> 00:43:18.850
is less than or equal than the
mean squared error that this
00:43:18.850 --> 00:43:23.960
person will get, the person
who uses that estimator.
00:43:23.960 --> 00:43:28.040
For any particular value of
the data, I'm going to do
00:43:28.040 --> 00:43:30.190
better than the other person.
00:43:30.190 --> 00:43:32.760
Now the data themselves
are random.
00:43:32.760 --> 00:43:38.050
If I average over all possible
values of the data, I should
00:43:38.050 --> 00:43:40.240
still be better off.
00:43:40.240 --> 00:43:45.120
If I'm better off for any
possible value X, then I
00:43:45.120 --> 00:43:49.140
should be better off on the
average over all possible
00:43:49.140 --> 00:43:50.640
values of X.
00:43:50.640 --> 00:43:55.670
So let us average both sides of
this quantity with respect
00:43:55.670 --> 00:43:58.990
to the probability distribution
of X. If you want
00:43:58.990 --> 00:44:03.350
to do it formally, you can write
this inequality between
00:44:03.350 --> 00:44:06.520
numbers as an inequality between
random variables.
00:44:06.520 --> 00:44:10.240
And it tells that no matter
what that random variable
00:44:10.240 --> 00:44:14.010
turns out to be, this quantity
is better than that quantity.
00:44:14.010 --> 00:44:17.270
Take expectations of both
sides, and you get this
00:44:17.270 --> 00:44:21.360
inequality between expectations
overall.
00:44:21.360 --> 00:44:29.130
And this last inequality tells
me that the person who's using
00:44:29.130 --> 00:44:34.430
this estimator who produces
estimates according to this
00:44:34.430 --> 00:44:45.090
machine will have a mean squared
estimation error
00:44:45.090 --> 00:44:48.580
that's less than or equal to
the estimation error that's
00:44:48.580 --> 00:44:51.290
produced by the other person.
00:44:51.290 --> 00:44:54.710
In a few words, the conditional
expectation
00:44:54.710 --> 00:44:58.500
estimator is the optimal
estimator.
00:44:58.500 --> 00:45:01.765
It's the ultimate estimating
machine.
00:45:04.430 --> 00:45:08.720
That's how you should solve
estimation problems and report
00:45:08.720 --> 00:45:10.240
a single value.
00:45:10.240 --> 00:45:14.510
If you're forced to report a
single value and if you're
00:45:14.510 --> 00:45:18.060
interested in estimation
errors.
00:45:18.060 --> 00:45:24.620
OK, while we could have told you
that story, of course, a
00:45:24.620 --> 00:45:29.500
month or two ago, this is really
about interpretation --
00:45:29.500 --> 00:45:32.550
about realizing that conditional
expectations have
00:45:32.550 --> 00:45:35.160
a very nice property.
00:45:35.160 --> 00:45:38.180
But other than that, any
probabilistic skills that come
00:45:38.180 --> 00:45:41.180
into this business are just the
probabilistic skills of
00:45:41.180 --> 00:45:44.330
being able to calculate
conditional expectations,
00:45:44.330 --> 00:45:46.750
which you already
know how to do.
00:45:46.750 --> 00:45:51.380
So conclusion, all of optimal
Bayesian estimation just means
00:45:51.380 --> 00:45:54.655
calculating and reporting
conditional expectations.
00:45:54.655 --> 00:45:58.380
Well, if the world were that
simple, then statisticians
00:45:58.380 --> 00:46:02.670
wouldn't be able to find jobs
if life is that simple.
00:46:02.670 --> 00:46:05.690
So real life is not
that simple.
00:46:05.690 --> 00:46:07.540
There are complications.
00:46:07.540 --> 00:46:10.050
And that perhaps makes their
life a little more
00:46:10.050 --> 00:46:11.300
interesting.
00:46:22.010 --> 00:46:25.500
OK, one complication is that we
would deal with the vectors
00:46:25.500 --> 00:46:28.580
instead of just single
random variables.
00:46:28.580 --> 00:46:31.830
I use the notation here
as if X was a
00:46:31.830 --> 00:46:33.500
single random variable.
00:46:33.500 --> 00:46:37.710
In real life, you get
several data.
00:46:37.710 --> 00:46:39.520
Does our story change?
00:46:39.520 --> 00:46:41.950
Not really, same argument--
00:46:41.950 --> 00:46:44.410
given all the data that you
have observed, you should
00:46:44.410 --> 00:46:47.660
still report the conditional
expectation of Theta.
00:46:47.660 --> 00:46:51.260
But what kind of work does it
take in order to report this
00:46:51.260 --> 00:46:53.080
conditional expectation?
00:46:53.080 --> 00:46:57.030
One issue is that you need to
cook up a plausible prior
00:46:57.030 --> 00:46:58.810
distribution for Theta.
00:46:58.810 --> 00:46:59.960
How do you do that?
00:46:59.960 --> 00:47:03.570
In a given application , this
is a bit of a judgment call,
00:47:03.570 --> 00:47:05.970
what prior would you
be working with.
00:47:05.970 --> 00:47:08.840
And there's a certain
skill there of not
00:47:08.840 --> 00:47:12.100
making silly choices.
00:47:12.100 --> 00:47:16.690
A more pragmatic, practical
issue is that this is a
00:47:16.690 --> 00:47:21.180
formula that's extremely nice
and compact and simple that
00:47:21.180 --> 00:47:24.560
you can write with
minimal ink.
00:47:24.560 --> 00:47:29.180
But the behind it there could
be hidden a huge amount of
00:47:29.180 --> 00:47:31.520
calculation.
00:47:31.520 --> 00:47:34.820
So doing any sort of
calculations that involve
00:47:34.820 --> 00:47:39.640
multiple random variables really
involves calculating
00:47:39.640 --> 00:47:42.240
multi-dimensional integrals.
00:47:42.240 --> 00:47:46.230
And the multi-dimensional
integrals are hard to compute.
00:47:46.230 --> 00:47:50.830
So implementing actually this
calculating machine here may
00:47:50.830 --> 00:47:54.340
not be easy, might be
complicated computationally.
00:47:54.340 --> 00:47:58.250
It's also complicated in terms
of not being able to derive
00:47:58.250 --> 00:47:59.890
intuition about it.
00:47:59.890 --> 00:48:03.680
So perhaps you might want to
have a simpler version, a
00:48:03.680 --> 00:48:07.940
simpler alternative to this
formula that's easier to work
00:48:07.940 --> 00:48:10.950
with and easier to calculate.
00:48:10.950 --> 00:48:13.440
We will be talking about
one such simpler
00:48:13.440 --> 00:48:15.540
alternative next time.
00:48:15.540 --> 00:48:18.570
So again, to conclude, at
the high level, Bayesian
00:48:18.570 --> 00:48:22.330
estimation is very, very simple,
given that you have
00:48:22.330 --> 00:48:24.180
mastered everything that
has happened in
00:48:24.180 --> 00:48:26.370
this course so far.
00:48:26.370 --> 00:48:29.860
There are certain practical
issues and it's also good to
00:48:29.860 --> 00:48:33.590
be familiar with the concepts
and the issues that in
00:48:33.590 --> 00:48:36.620
general, you would prefer to
report that complete posterior
00:48:36.620 --> 00:48:37.360
distribution.
00:48:37.360 --> 00:48:40.890
But if you're forced to report a
point estimate, then there's
00:48:40.890 --> 00:48:43.130
a number of reasonable
ways to do it.
00:48:43.130 --> 00:48:45.690
And perhaps the most reasonable
one is to just the
00:48:45.690 --> 00:48:48.220
report the conditional
expectation itself.