WEBVTT

00:00:00.560 --> 00:00:04.310
In this segment we provide a
high level introduction into

00:00:04.310 --> 00:00:08.090
the conceptual framework of
classical statistics.

00:00:08.090 --> 00:00:11.970
In order to get there, it is
better to start from what we

00:00:11.970 --> 00:00:14.910
already know and then
make a comparison.

00:00:14.910 --> 00:00:19.430
We already know how to make
inferences by just using the

00:00:19.430 --> 00:00:20.560
Bayes rule.

00:00:20.560 --> 00:00:24.370
In this setting, we have an
unknown quantity, theta, which

00:00:24.370 --> 00:00:26.270
we model as a random variable.

00:00:26.270 --> 00:00:29.480
And so in particular, it's going
to have a probability

00:00:29.480 --> 00:00:30.800
distribution.

00:00:30.800 --> 00:00:33.130
And then we make some
observations.

00:00:33.130 --> 00:00:36.310
And those observations are
modeled as random variables.

00:00:36.310 --> 00:00:39.350
And typically we are given the
conditional distribution of

00:00:39.350 --> 00:00:43.190
the observations given
the unknown variable.

00:00:43.190 --> 00:00:47.750
So these two distributions are
the starting points, and then

00:00:47.750 --> 00:00:49.510
we do some calculations.

00:00:49.510 --> 00:00:51.430
And we use the Bayes rule.

00:00:51.430 --> 00:00:54.580
And we find the posterior
distribution of theta given

00:00:54.580 --> 00:00:56.780
the observations.

00:00:56.780 --> 00:00:59.860
And this tells us all that there
is to know about the

00:00:59.860 --> 00:01:02.460
unknown quantity, theta,
given the observations

00:01:02.460 --> 00:01:04.120
that we have made.

00:01:04.120 --> 00:01:08.690
What is important in this
framework is that theta is

00:01:08.690 --> 00:01:11.020
treated as a random variable.

00:01:11.020 --> 00:01:13.930
And so it has a distribution
of its own.

00:01:13.930 --> 00:01:16.100
And that's our starting point.

00:01:16.100 --> 00:01:20.100
These are our prior beliefs
about theta before we obtain

00:01:20.100 --> 00:01:23.150
any observations.

00:01:23.150 --> 00:01:27.910
However, one can think of
situations where theta maybe

00:01:27.910 --> 00:01:30.780
cannot be modeled as
a random variable.

00:01:30.780 --> 00:01:34.940
Suppose that theta is at some
universal, physical constant.

00:01:34.940 --> 00:01:38.690
For example the mass
of the electron.

00:01:38.690 --> 00:01:41.590
Does it make sense to think of
that quantity as random?

00:01:41.590 --> 00:01:44.600
And how do we come up with a
probability distribution for

00:01:44.600 --> 00:01:45.961
that quantity?

00:01:45.961 --> 00:01:50.050
One can argue that in certain
situations one should not

00:01:50.050 --> 00:01:55.060
think of unknown quantities as
being random, but rather they

00:01:55.060 --> 00:01:58.610
are just unknown constants.

00:01:58.610 --> 00:02:00.380
They are absolute constants.

00:02:00.380 --> 00:02:05.690
It just happens that we do
not know their value.

00:02:05.690 --> 00:02:09.460
Or there may be other situations
in which even

00:02:09.460 --> 00:02:12.540
though we may think that there
is something random that

00:02:12.540 --> 00:02:16.910
determines theta, we are
reluctant to postulate any

00:02:16.910 --> 00:02:18.190
prior distribution.

00:02:18.190 --> 00:02:21.450
We do not want to impose
any biases.

00:02:21.450 --> 00:02:24.560
And that leads us to the classic
statistical framework

00:02:24.560 --> 00:02:28.970
in which unknown quantities are
treated as constants, not

00:02:28.970 --> 00:02:30.829
as random variables.

00:02:30.829 --> 00:02:33.290
Pictorially the setting
is as follows.

00:02:33.290 --> 00:02:36.260
There's an unknown quantity
that we wish to estimate.

00:02:36.260 --> 00:02:40.290
And we make some observations,
X. Those

00:02:40.290 --> 00:02:42.490
observations are random.

00:02:42.490 --> 00:02:45.840
And they're drawn according to
a probability distribution.

00:02:45.840 --> 00:02:49.630
And that probability
distribution depends, or

00:02:49.630 --> 00:02:53.660
rather is affected, by that
unknown quantity.

00:02:53.660 --> 00:02:58.090
So for example, for one value of
theta, the distribution of

00:02:58.090 --> 00:03:00.100
the X's might be this one.

00:03:00.100 --> 00:03:03.710
And for another value of theta,
the distribution of the

00:03:03.710 --> 00:03:06.930
X's could be a different one.

00:03:06.930 --> 00:03:10.000
And we're trying to guess
what theta is.

00:03:10.000 --> 00:03:14.250
Which in some ways is the
question, do my data come from

00:03:14.250 --> 00:03:18.740
this distribution or do they
come from that distribution?

00:03:18.740 --> 00:03:24.390
In order to make a choice of
theta, what we do is we take

00:03:24.390 --> 00:03:27.010
the data and we process them.

00:03:27.010 --> 00:03:32.340
And after we process them, we
come up with our estimate--

00:03:32.340 --> 00:03:36.110
or rather estimator.

00:03:36.110 --> 00:03:37.970
What is the estimator?

00:03:37.970 --> 00:03:41.130
We take the data, and
we calculate a

00:03:41.130 --> 00:03:43.150
function of the data.

00:03:43.150 --> 00:03:46.120
That's what it means to
process the data.

00:03:46.120 --> 00:03:49.700
And that function is
our theta hat.

00:03:49.700 --> 00:03:53.210
Now this function, our data
processing mechanism, is what

00:03:53.210 --> 00:03:55.550
we can call an estimator.

00:03:55.550 --> 00:03:59.900
But quite often, or usually,
we also use the same

00:03:59.900 --> 00:04:04.790
terminology to call theta
hat itself an estimator.

00:04:04.790 --> 00:04:08.570
Now notice that theta hat is
a function of the random

00:04:08.570 --> 00:04:14.820
variable X. So theta hat is
actually a random variable.

00:04:14.820 --> 00:04:19.130
And that's why we denote it
with an uppercase theta.

00:04:19.130 --> 00:04:23.640
On the other hand, after you
obtain some concrete data,

00:04:23.640 --> 00:04:26.850
little x, which are the realized
values of the random

00:04:26.850 --> 00:04:31.060
variable capital X. Then we can
apply your estimator to

00:04:31.060 --> 00:04:37.790
that particular input, and we
compute a specific value--

00:04:37.790 --> 00:04:40.390
call it theta hat lower case.

00:04:40.390 --> 00:04:45.850
And that quantity we
call an estimate.

00:04:45.850 --> 00:04:48.580
So this is a useful
distinction.

00:04:48.580 --> 00:04:51.590
Always, with random variables,
we want to distinguish between

00:04:51.590 --> 00:04:55.030
the random variable itself
indicated by uppercase letters

00:04:55.030 --> 00:04:57.780
and the values of the random
variable, which are indicated

00:04:57.780 --> 00:04:59.430
with lower case letters.

00:04:59.430 --> 00:05:04.090
Similarly, the estimator
is a random variable.

00:05:04.090 --> 00:05:08.920
It's essentially a description
of how we generate estimates.

00:05:08.920 --> 00:05:12.770
Whereas the realized value,
once we have some specific

00:05:12.770 --> 00:05:14.200
observations at hand--

00:05:14.200 --> 00:05:15.895
that's what we call
an estimate.

00:05:18.400 --> 00:05:21.280
Now let me continue with
a few comments.

00:05:21.280 --> 00:05:25.780
The picture, or the setting,
that I have here suggests that

00:05:25.780 --> 00:05:29.220
X is just one variable and
theta is one variable.

00:05:29.220 --> 00:05:33.260
But we can have the same
framework, even if X and theta

00:05:33.260 --> 00:05:34.980
are multi-dimensional.

00:05:34.980 --> 00:05:39.900
For example, X might consist of
several random variables.

00:05:39.900 --> 00:05:43.280
And theta may be a parameter
that consists of multiple

00:05:43.280 --> 00:05:45.430
components.

00:05:45.430 --> 00:05:48.900
Now you may notice that this
notation that we're using here

00:05:48.900 --> 00:05:53.920
is a little different from our
traditional notation which was

00:05:53.920 --> 00:05:55.170
of this form.

00:05:58.159 --> 00:06:01.470
In what ways is it different?

00:06:01.470 --> 00:06:04.700
The main difference is
that here, theta is

00:06:04.700 --> 00:06:07.720
not a random variable.

00:06:07.720 --> 00:06:10.660
Theta is just a parameter.

00:06:10.660 --> 00:06:16.230
So what we're dealing with,
here, is just an ordinary--

00:06:16.230 --> 00:06:18.570
not a conditional
distribution.

00:06:18.570 --> 00:06:22.950
It's an ordinary distribution
that happens to involve,

00:06:22.950 --> 00:06:27.420
inside its description,
some parameters theta.

00:06:27.420 --> 00:06:30.810
Just to emphasize the point that
these are not conditional

00:06:30.810 --> 00:06:34.570
probabilities, because theta is
not a random variable, we

00:06:34.570 --> 00:06:38.990
use a semicolon instead
of using a bar.

00:06:38.990 --> 00:06:42.040
And since theta is not a random
variable, we do not

00:06:42.040 --> 00:06:46.970
include it in the subscript down
here when we talk about

00:06:46.970 --> 00:06:48.830
the classical setting.

00:06:48.830 --> 00:06:52.590
The best way to think of the
situation mathematically is

00:06:52.590 --> 00:06:57.140
that we're essentially dealing
with multiple candidate

00:06:57.140 --> 00:06:59.250
models, as in this picture.

00:06:59.250 --> 00:07:02.840
This could be one possible
model of X. This could be

00:07:02.840 --> 00:07:07.320
another possible model of X.
We have one such model for

00:07:07.320 --> 00:07:10.070
each possible value of theta.

00:07:10.070 --> 00:07:15.040
And if, for example, I were to
get data points that sit down

00:07:15.040 --> 00:07:19.770
here, then a reasonable way to
make an inference could be to

00:07:19.770 --> 00:07:23.320
say, these data are extremely
unlikely to have been

00:07:23.320 --> 00:07:25.810
generated according
to this model.

00:07:25.810 --> 00:07:27.790
This data are quite likely
to have been

00:07:27.790 --> 00:07:29.880
generated by this model.

00:07:29.880 --> 00:07:32.570
So I'm going to pick this
particular model.

00:07:32.570 --> 00:07:35.770
So even though we're not
treating theta as a random

00:07:35.770 --> 00:07:39.380
variable, and we do not have the
Bayes rule in our hands--

00:07:39.380 --> 00:07:42.530
we can still see, at least from
this trivial example,

00:07:42.530 --> 00:07:45.200
that there should be a
reasonable way of making

00:07:45.200 --> 00:07:47.390
inferences.

00:07:47.390 --> 00:07:50.390
And let me close with some
comments on the different

00:07:50.390 --> 00:07:54.110
types of problems that we may
encounter in classical

00:07:54.110 --> 00:07:55.420
statistics.

00:07:55.420 --> 00:07:58.280
One class of problems are
so-called hypothesis testing

00:07:58.280 --> 00:08:01.440
problems in which we're asked
to choose between two

00:08:01.440 --> 00:08:02.710
candidate models.

00:08:02.710 --> 00:08:06.190
So the unknown parameter, as in
this example, can take one

00:08:06.190 --> 00:08:07.540
of two values.

00:08:07.540 --> 00:08:11.050
So think of a machine
that produces coins.

00:08:11.050 --> 00:08:14.470
And coins are either
fair or they have a

00:08:14.470 --> 00:08:16.960
very specific bias.

00:08:16.960 --> 00:08:20.700
You want to flip the coin, maybe
multiple times, and then

00:08:20.700 --> 00:08:22.870
decide whether you're dealing
with a coin of this

00:08:22.870 --> 00:08:25.290
type or of that type.

00:08:25.290 --> 00:08:28.340
There's another type of
hypothesis testing problems

00:08:28.340 --> 00:08:31.000
which is a little more
complicated, for

00:08:31.000 --> 00:08:32.429
example this one.

00:08:32.429 --> 00:08:37.110
We have one hypothesis which
says that my coin is fair,

00:08:37.110 --> 00:08:39.820
versus an alternative
hypothesis in

00:08:39.820 --> 00:08:42.130
which my coin is unfair.

00:08:42.130 --> 00:08:45.240
But notice that this hypothesis
actually includes

00:08:45.240 --> 00:08:46.870
many possible scenarios.

00:08:46.870 --> 00:08:50.300
There are many possible values
of theta under which this

00:08:50.300 --> 00:08:52.550
hypothesis would be true.

00:08:52.550 --> 00:08:56.920
We will not deal with problems
of this kind in this segment,

00:08:56.920 --> 00:08:58.920
or in this lecture sequence.

00:08:58.920 --> 00:09:01.200
Instead we will focus
exclusively

00:09:01.200 --> 00:09:03.280
on estimation problems.

00:09:03.280 --> 00:09:08.170
In estimation problems, the
unknown parameter, theta, is

00:09:08.170 --> 00:09:12.940
either continuous or can take
one of many, many values.

00:09:12.940 --> 00:09:17.340
What we want to do is to
design an estimator--

00:09:17.340 --> 00:09:19.760
a way of processing the data--

00:09:19.760 --> 00:09:24.070
that comes up with estimates
that are good.

00:09:24.070 --> 00:09:26.530
What does it mean that
an estimate is good?

00:09:26.530 --> 00:09:29.530
An estimate would be good if
the resulting value of the

00:09:29.530 --> 00:09:31.120
estimation error--

00:09:31.120 --> 00:09:34.720
that is the difference between
the estimated value and the

00:09:34.720 --> 00:09:35.500
true value--

00:09:35.500 --> 00:09:38.200
if that difference is small.

00:09:38.200 --> 00:09:40.480
You want to keep
that difference

00:09:40.480 --> 00:09:43.160
small in some sense.

00:09:43.160 --> 00:09:48.010
Well one may need a criterion of
what it means to be small.

00:09:48.010 --> 00:09:51.600
And whether we want this in
expectation, or with high

00:09:51.600 --> 00:09:53.470
probability, and so on.

00:09:53.470 --> 00:09:56.780
This statement, to keep the
estimation error small, can be

00:09:56.780 --> 00:09:58.870
interpreted in various ways.

00:09:58.870 --> 00:10:02.510
And because of that reason,
there's no single approach to

00:10:02.510 --> 00:10:05.590
the problem of designing
a good estimator.

00:10:05.590 --> 00:10:08.040
And this is something that
happens more generally in

00:10:08.040 --> 00:10:09.710
classical statistics.

00:10:09.710 --> 00:10:15.050
Typically problems do not admit
a single best approach.

00:10:15.050 --> 00:10:17.790
They do not admit
unique answers.

00:10:17.790 --> 00:10:21.470
Reasonable people can come up
with different methodologies

00:10:21.470 --> 00:10:23.920
for approaching the
same problem.

00:10:23.920 --> 00:10:27.060
And there is a little bit
of an element of an

00:10:27.060 --> 00:10:29.880
art involved here.

00:10:29.880 --> 00:10:33.230
In general, one wants to come
up with reasonable methods

00:10:33.230 --> 00:10:36.120
that will have good
properties.

00:10:36.120 --> 00:10:39.270
And we will see some examples
of what this may mean.

00:10:39.270 --> 00:10:41.870
But again, I'm emphasizing
that there is

00:10:41.870 --> 00:10:45.140
no single best method.

00:10:45.140 --> 00:10:49.260
So whereas the Bayes rule is a
completely unambiguous way for

00:10:49.260 --> 00:10:52.720
making inferences, here, in
the context of classical

00:10:52.720 --> 00:10:57.020
statistics, there will be some
freedom as to what approaches

00:10:57.020 --> 00:10:58.270
one might take.