WEBVTT
00:00:00.080 --> 00:00:02.500
The following content is
provided under a Creative
00:00:02.500 --> 00:00:04.019
Commons license.
00:00:04.019 --> 00:00:06.360
Your support will help
MIT OpenCourseWare
00:00:06.360 --> 00:00:10.730
continue to offer high quality
educational resources for free.
00:00:10.730 --> 00:00:13.340
To make a donation or
view additional materials
00:00:13.340 --> 00:00:17.217
from hundreds of MIT courses,
visit MIT OpenCourseWare
00:00:17.217 --> 00:00:17.842
at ocw.mit.edu.
00:00:22.190 --> 00:00:23.010
PROFESSOR: OK.
00:00:23.010 --> 00:00:25.530
Well, last time I
was lecturing, we
00:00:25.530 --> 00:00:29.380
were talking about
regression analysis.
00:00:29.380 --> 00:00:31.870
And we finished up talking
about estimation methods
00:00:31.870 --> 00:00:34.730
for fitting regression models.
00:00:34.730 --> 00:00:38.670
I want to recap the method
of maximum likelihood,
00:00:38.670 --> 00:00:42.010
because this is really
the primary estimation
00:00:42.010 --> 00:00:45.070
method in statistical
modeling that you start with.
00:00:45.070 --> 00:00:49.946
And so let me just
review where we were.
00:00:49.946 --> 00:00:53.060
We have a normal linear
regression model.
00:00:53.060 --> 00:00:55.100
A dependent variable
y is explained
00:00:55.100 --> 00:00:58.940
by a linear combination
of independent variables
00:00:58.940 --> 00:01:00.710
given by a regression
parameter beta.
00:01:00.710 --> 00:01:03.800
And we assume that there are
errors about all the cases
00:01:03.800 --> 00:01:05.710
which are independent
identically distributed
00:01:05.710 --> 00:01:07.440
normal random variables.
00:01:07.440 --> 00:01:12.120
So because of that relationship,
the dependent variable vector
00:01:12.120 --> 00:01:15.630
y, which is an
n-vector, for n cases,
00:01:15.630 --> 00:01:18.730
is a multivariate
normal random variable.
00:01:18.730 --> 00:01:26.490
Now, the likelihood function is
equal to the density function
00:01:26.490 --> 00:01:28.280
for the data.
00:01:28.280 --> 00:01:32.400
And there's some
ambiguity really
00:01:32.400 --> 00:01:36.000
about how one manipulates
the likelihood function.
00:01:36.000 --> 00:01:38.780
The likelihood function
becomes defined once we've
00:01:38.780 --> 00:01:41.030
observed a sample of data.
00:01:41.030 --> 00:01:45.390
So in this expression for
the likelihood function
00:01:45.390 --> 00:01:47.330
as a function of beta
and sigma squared,
00:01:47.330 --> 00:01:50.800
we're considering evaluating
the probability density
00:01:50.800 --> 00:01:53.830
function for the
data conditional
00:01:53.830 --> 00:01:57.040
on the unknown parameters.
00:01:57.040 --> 00:02:02.540
So if this were simply a
univariate normal distribution
00:02:02.540 --> 00:02:05.160
with some unknown mean
and variance, then
00:02:05.160 --> 00:02:10.880
what we would have is
just a bell curve for mu
00:02:10.880 --> 00:02:13.880
centered around a
single observation y,
00:02:13.880 --> 00:02:15.550
if you look at the
likelihood function
00:02:15.550 --> 00:02:19.150
and how it varies with
the underlying mean
00:02:19.150 --> 00:02:22.950
of the normal distribution.
00:02:22.950 --> 00:02:28.180
So this likelihood
function is-- well,
00:02:28.180 --> 00:02:30.540
the challenge really
in maximum estimation
00:02:30.540 --> 00:02:34.840
is really calculating
and computing
00:02:34.840 --> 00:02:36.790
the likelihood function.
00:02:36.790 --> 00:02:39.050
And with normal linear
regression models,
00:02:39.050 --> 00:02:40.440
it's very easy.
00:02:40.440 --> 00:02:42.910
Now, the maximum
likelihood estimates
00:02:42.910 --> 00:02:47.490
are those values that
maximize this function.
00:02:47.490 --> 00:02:51.890
And the question is, why
are those good estimates
00:02:51.890 --> 00:02:54.840
of the underlying parameters?
00:02:54.840 --> 00:02:57.760
Well, what those
estimates do is they
00:02:57.760 --> 00:03:03.150
are the parameter values for
which the observed data is
00:03:03.150 --> 00:03:05.030
most likely.
00:03:05.030 --> 00:03:09.170
So we're able to scale
the unknown parameters
00:03:09.170 --> 00:03:14.020
by how likely those parameters
could have generated these data
00:03:14.020 --> 00:03:15.500
values.
00:03:15.500 --> 00:03:19.560
So let's look at the
likelihood function
00:03:19.560 --> 00:03:23.360
for this normal linear
regression model.
00:03:23.360 --> 00:03:28.520
These first two lines here are
highlighting-- the first line
00:03:28.520 --> 00:03:32.470
is highlighting that
our response variable
00:03:32.470 --> 00:03:35.310
values are independent.
00:03:35.310 --> 00:03:36.770
They're conditionally
independent
00:03:36.770 --> 00:03:38.720
given the unknown parameters.
00:03:38.720 --> 00:03:43.180
And so the density of the
full vector of y's is simply
00:03:43.180 --> 00:03:48.290
the product of the density
functions for those components.
00:03:48.290 --> 00:03:52.410
And because this is a normal
linear regression model,
00:03:52.410 --> 00:03:55.350
each of the y_i's is
normally distributed.
00:03:55.350 --> 00:03:57.140
So what's in there
is simply the density
00:03:57.140 --> 00:04:01.330
function of a normal random
variable with mean given
00:04:01.330 --> 00:04:06.960
by the beta sum of independent
variables for each i,
00:04:06.960 --> 00:04:10.300
case i, given by the
regression parameters.
00:04:10.300 --> 00:04:18.320
And that expression
basically can be expressed
00:04:18.320 --> 00:04:21.630
in matrix form this way.
00:04:21.630 --> 00:04:28.910
And what we have is
the likelihood function
00:04:28.910 --> 00:04:33.160
ends up being a function
of our Q of beta, which
00:04:33.160 --> 00:04:35.610
was our least squares criteria.
00:04:35.610 --> 00:04:39.120
So the least squares
estimation is
00:04:39.120 --> 00:04:42.930
equivalent to maximum likelihood
estimation for the regression
00:04:42.930 --> 00:04:48.510
parameters if we have a normal
linear regression model.
00:04:48.510 --> 00:04:52.200
And there's this
extra term, minus n.
00:04:52.200 --> 00:04:54.820
Well, actually, if we're going
to maximize the likelihood
00:04:54.820 --> 00:04:57.220
function, we can also maximize
the log of the likelihood
00:04:57.220 --> 00:05:00.010
function, because that's
just a monotone function
00:05:00.010 --> 00:05:01.860
of the likelihood.
00:05:01.860 --> 00:05:04.570
And it's easier to maximize the
log of the likelihood function
00:05:04.570 --> 00:05:06.430
which is expressed here.
00:05:06.430 --> 00:05:11.460
And so we're able to
maximize over beta
00:05:11.460 --> 00:05:14.230
by minimizing Q of beta.
00:05:14.230 --> 00:05:18.280
And then we can maximize
over sigma squared
00:05:18.280 --> 00:05:21.800
given our estimate for beta.
00:05:21.800 --> 00:05:25.120
And that's achieved by
taking the derivative
00:05:25.120 --> 00:05:31.170
of the log-likelihood with
respect to sigma squared.
00:05:31.170 --> 00:05:33.150
So we basically have this
first order condition
00:05:33.150 --> 00:05:35.320
that finds the
maximum because things
00:05:35.320 --> 00:05:39.830
are appropriately convex.
00:05:39.830 --> 00:05:45.200
And taking that derivative
and solving for zero,
00:05:45.200 --> 00:05:47.450
we basically get
this expression.
00:05:47.450 --> 00:05:50.380
So this is just
taking the derivative
00:05:50.380 --> 00:05:54.350
of the log-likelihood with
respect to sigma squared.
00:05:54.350 --> 00:05:55.857
And you'll notice
here I'm taking
00:05:55.857 --> 00:05:57.690
the derivative with
respect to sigma squared
00:05:57.690 --> 00:05:59.050
as a parameter, not sigma.
00:06:01.870 --> 00:06:05.380
And that gives us that
the maximum likelihood
00:06:05.380 --> 00:06:10.700
estimate of the error variance
is Q of beta hat over n.
00:06:10.700 --> 00:06:17.090
So this is the sum of the
squared residuals divided by n.
00:06:17.090 --> 00:06:20.940
Now, I emphasize here
that that's biased.
00:06:20.940 --> 00:06:24.612
Who can tell me
why that's biased
00:06:24.612 --> 00:06:25.820
or why it ought to be biased?
00:06:30.554 --> 00:06:31.526
AUDIENCE: [INAUDIBLE].
00:06:35.420 --> 00:06:36.350
PROFESSOR: OK.
00:06:36.350 --> 00:06:42.530
Well, it should be n
minus 1 if we're actually
00:06:42.530 --> 00:06:44.660
estimating one parameter.
00:06:44.660 --> 00:06:54.050
So if the independent variables
were, say, a constant, 1,
00:06:54.050 --> 00:06:57.160
so we're just estimating a
sample from a normal with mean
00:06:57.160 --> 00:07:03.030
beta 1 corresponding to
the units vector of the X,
00:07:03.030 --> 00:07:11.410
then we would have a one
degree of freedom correction
00:07:11.410 --> 00:07:14.120
to the residuals to get
an unbiased estimator.
00:07:14.120 --> 00:07:17.150
But what if we
have p parameters?
00:07:17.150 --> 00:07:18.370
Well, let me ask you this.
00:07:18.370 --> 00:07:23.280
What if we had n parameters
in our regression model?
00:07:23.280 --> 00:07:28.130
What would happen if
we had a full rank n
00:07:28.130 --> 00:07:30.760
independent variable matrix
and n independent observations?
00:07:34.062 --> 00:07:35.690
AUDIENCE: [INAUDIBLE].
00:07:35.690 --> 00:07:38.410
PROFESSOR: Yes, you'd have
an exact fit to the data.
00:07:38.410 --> 00:07:43.560
So this estimate would be 0.
00:07:43.560 --> 00:07:47.500
And so clearly, if
the data do arise
00:07:47.500 --> 00:07:52.059
from a normal linear regression
model, 0 is not unbiased.
00:07:52.059 --> 00:07:53.600
And you need to have
some correction.
00:07:53.600 --> 00:07:58.220
Turns out you need
to divide by n
00:07:58.220 --> 00:08:01.980
minus the rank of the X
matrix, the degrees of freedom
00:08:01.980 --> 00:08:05.630
in the model, to get
a biased estimate.
00:08:05.630 --> 00:08:08.610
So this is an important
issue, highlights
00:08:08.610 --> 00:08:11.880
how the more parameters you add
in the model, the more precise
00:08:11.880 --> 00:08:13.760
your fitted values are.
00:08:13.760 --> 00:08:15.840
In a sense, there's
dangers of curve fitting
00:08:15.840 --> 00:08:18.370
which you want to avoid.
00:08:18.370 --> 00:08:25.070
But the maximum likelihood
estimates, in fact, are biased.
00:08:25.070 --> 00:08:27.482
You just have to
be aware of that.
00:08:27.482 --> 00:08:29.190
And when you're using
different software,
00:08:29.190 --> 00:08:30.170
fitting different
models, you need
00:08:30.170 --> 00:08:32.450
to know whether there are
various corrections be
00:08:32.450 --> 00:08:33.654
made for biasedness or not.
00:08:38.370 --> 00:08:41.679
So this solves the
estimation problem
00:08:41.679 --> 00:08:44.790
for normal linear
regression models.
00:08:44.790 --> 00:08:48.310
And when we have normal
linear regression
00:08:48.310 --> 00:08:50.470
models, the theorem we
went through last time--
00:08:50.470 --> 00:08:51.428
this is very important.
00:08:51.428 --> 00:08:54.590
Let me just go back and
highlight that for you.
00:09:02.430 --> 00:09:05.370
This theorem right here.
00:09:05.370 --> 00:09:10.010
This is really a very
important theorem
00:09:10.010 --> 00:09:13.330
indicating what is the
distribution of the least
00:09:13.330 --> 00:09:15.800
squares, now the maximum
likelihood estimates
00:09:15.800 --> 00:09:17.670
of our regression model?
00:09:17.670 --> 00:09:20.750
They are normally distributed.
00:09:20.750 --> 00:09:25.570
And the residuals, sum
of squares, have a chi
00:09:25.570 --> 00:09:28.140
squared distribution
with degrees of freedom
00:09:28.140 --> 00:09:29.910
given by n minus p.
00:09:29.910 --> 00:09:34.770
And we can look at how
much signal to noise
00:09:34.770 --> 00:09:36.490
there is in estimating
our regression
00:09:36.490 --> 00:09:40.590
parameters by calculating a t
statistic, which is take away
00:09:40.590 --> 00:09:45.400
from an estimate its
expected value, its mean,
00:09:45.400 --> 00:09:48.330
and divide through by an
estimate of the variability
00:09:48.330 --> 00:09:50.421
in standard deviation units.
00:09:50.421 --> 00:09:51.920
And that will have
a t distribution.
00:09:51.920 --> 00:09:56.800
So that's a critical
way to assess
00:09:56.800 --> 00:09:59.200
the relevance of different
explanatory variables
00:09:59.200 --> 00:10:00.690
in our model.
00:10:00.690 --> 00:10:06.060
And this approach will apply
with maximum likelihood
00:10:06.060 --> 00:10:08.010
estimation in all
kinds of models
00:10:08.010 --> 00:10:10.510
apart from normal linear
regression models.
00:10:10.510 --> 00:10:13.970
It turns out maximum
likelihood estimates generally
00:10:13.970 --> 00:10:17.880
are asymptotically
normally distributed.
00:10:17.880 --> 00:10:21.630
And so these properties here
will apply for those models
00:10:21.630 --> 00:10:23.020
as well.
00:10:23.020 --> 00:10:27.470
So let's finish up these
notes on estimation
00:10:27.470 --> 00:10:32.590
by talking about
generalized M estimation.
00:10:32.590 --> 00:10:39.020
So what we want to consider is
estimating unknown parameters
00:10:39.020 --> 00:10:44.630
by minimizing some
function, Q of beta,
00:10:44.630 --> 00:10:49.890
which is a sum of evaluations
of another function h,
00:10:49.890 --> 00:10:53.180
evaluated for each of
the individual cases.
00:10:53.180 --> 00:10:59.980
And choosing h to take on
different functional forms
00:10:59.980 --> 00:11:03.120
will define different
kinds of estimators.
00:11:03.120 --> 00:11:08.440
We've seen how when h
is simply the square
00:11:08.440 --> 00:11:13.880
of the case minus its
regression prediction,
00:11:13.880 --> 00:11:18.980
that leads to least squares,
and in fact, maximum likelihood
00:11:18.980 --> 00:11:23.830
estimation, as we saw before.
00:11:23.830 --> 00:11:27.340
Rather than taking the
square of the residual,
00:11:27.340 --> 00:11:29.540
the fitted residual,
we could take simply
00:11:29.540 --> 00:11:33.510
the modulus of that.
00:11:33.510 --> 00:11:36.930
And so that would be the
mean absolute deviation.
00:11:36.930 --> 00:11:39.040
So rather than summing
the squared deviations
00:11:39.040 --> 00:11:42.310
from the mean, we could
sum the absolute deviations
00:11:42.310 --> 00:11:43.780
from the mean.
00:11:43.780 --> 00:11:46.710
Now, from a
mathematical standpoint,
00:11:46.710 --> 00:11:50.530
if we want to solve
for those estimates,
00:11:50.530 --> 00:11:52.450
how would you go
about doing that?
00:11:55.160 --> 00:12:01.950
What methodology would you
use to maximize this function?
00:12:01.950 --> 00:12:04.380
Well, we try and apply
basically the same principles
00:12:04.380 --> 00:12:09.690
of if this is a
convex function, then
00:12:09.690 --> 00:12:12.860
we just want to take derivatives
of that and solve for that
00:12:12.860 --> 00:12:14.110
being equal to 0.
00:12:14.110 --> 00:12:17.080
So what happens when
you take the derivative
00:12:17.080 --> 00:12:21.110
of the modulus of y minus xi
beta with respect to beta?
00:12:24.749 --> 00:12:27.620
AUDIENCE: [INAUDIBLE].
00:12:27.620 --> 00:12:30.780
PROFESSOR: What did you say?
00:12:30.780 --> 00:12:32.890
What did you say?
00:12:32.890 --> 00:12:36.783
AUDIENCE: Yeah, it's
not [INAUDIBLE].
00:12:36.783 --> 00:12:38.908
The first [INAUDIBLE]
derivative is not continuous.
00:12:45.460 --> 00:12:46.610
PROFESSOR: OK.
00:12:46.610 --> 00:12:50.940
Well, this is not
a smooth function.
00:12:50.940 --> 00:13:06.290
But let me just plot x_i beta
here, and y_i minus that.
00:13:06.290 --> 00:13:15.060
Basically, this is going
to be a function that
00:13:15.060 --> 00:13:19.230
has slope 1 when it's positive
and slope minus 1 when
00:13:19.230 --> 00:13:20.450
it's negative.
00:13:20.450 --> 00:13:26.260
And so that will be true,
component-wise, or for the y.
00:13:26.260 --> 00:13:28.850
So what we end up
wanting to do is
00:13:28.850 --> 00:13:31.000
find the value of the
regression estimate
00:13:31.000 --> 00:13:36.680
that minimizes the
sum of predictions
00:13:36.680 --> 00:13:40.670
that are below the estimate plus
the sum of the predictions that
00:13:40.670 --> 00:13:43.240
are above the estimate given
by the regression line.
00:13:43.240 --> 00:13:45.580
And that solves the problem.
00:13:45.580 --> 00:13:50.960
Now, with the maximum
likelihood estimation,
00:13:50.960 --> 00:13:55.840
one can plug in minus log the
density of y_i given beta, x
00:13:55.840 --> 00:13:57.730
and sigma_i squared.
00:13:57.730 --> 00:14:04.400
And that function simply sums
to the log of the joint density
00:14:04.400 --> 00:14:05.510
for all the data.
00:14:05.510 --> 00:14:08.530
So that works as well.
00:14:08.530 --> 00:14:13.520
With robust M estimators, we can
consider another function chi
00:14:13.520 --> 00:14:18.210
which can be defined to have
good properties with estimates.
00:14:18.210 --> 00:14:21.065
And there's a whole theory
of robust estimation--
00:14:21.065 --> 00:14:23.830
it's very rich-- which
talks about how best
00:14:23.830 --> 00:14:27.400
to specify this chi function.
00:14:27.400 --> 00:14:33.130
Now, one of the problems
with least squares estimation
00:14:33.130 --> 00:14:37.400
is that the squares
of very large values
00:14:37.400 --> 00:14:40.210
are very, very
large in magnitude.
00:14:40.210 --> 00:14:42.740
So there's perhaps
an undue influence
00:14:42.740 --> 00:14:47.650
of very large values, very large
residuals under least squares
00:14:47.650 --> 00:14:49.680
estimation and maximum
[INAUDIBLE] estimation.
00:14:49.680 --> 00:14:53.600
So robust estimators
allow you to control that
00:14:53.600 --> 00:14:57.770
by defining the
function differently.
00:14:57.770 --> 00:15:00.830
Finally, there are
quantile estimators,
00:15:00.830 --> 00:15:07.410
which extend the mean
absolute deviation criterion.
00:15:07.410 --> 00:15:11.220
And so if we consider
the h function
00:15:11.220 --> 00:15:16.270
to be basically a
multiple of the deviation
00:15:16.270 --> 00:15:23.460
if the residual is positive
and a different multiple,
00:15:23.460 --> 00:15:26.810
a complementary multiple if
the derivation, the residual,
00:15:26.810 --> 00:15:30.910
is less than 0,
then by varying tau,
00:15:30.910 --> 00:15:35.230
you end up getting
quantile estimators, where
00:15:35.230 --> 00:15:38.921
what you're doing is minimizing
the estimate of the tau
00:15:38.921 --> 00:15:39.420
quantile.
00:15:47.510 --> 00:15:51.240
So this general
class of M estimators
00:15:51.240 --> 00:15:54.730
encompasses most
estimators that we will
00:15:54.730 --> 00:15:59.020
encounter in fitting models.
00:15:59.020 --> 00:16:03.130
So that finishes the technical
or the mathematical discussion
00:16:03.130 --> 00:16:05.190
of regression analysis.
00:16:05.190 --> 00:16:31.070
Let me highlight for you--
there's a case study that I
00:16:31.070 --> 00:16:34.410
dragged to the desktop here.
00:16:34.410 --> 00:16:37.532
And I wanted to find that.
00:16:37.532 --> 00:16:38.240
Let me find that.
00:16:46.970 --> 00:16:54.300
There's a case study that's been
added to the course website.
00:16:54.300 --> 00:16:58.840
And this first one is on
linear regression models
00:16:58.840 --> 00:17:00.370
for asset pricing.
00:17:00.370 --> 00:17:03.430
And I want you to
read through that just
00:17:03.430 --> 00:17:08.099
to see how it applies to
fitting various simple linear
00:17:08.099 --> 00:17:09.650
regression models.
00:17:09.650 --> 00:17:12.985
And enter full screen.
00:17:17.900 --> 00:17:21.650
This case study begins by
introducing the capital asset
00:17:21.650 --> 00:17:24.670
pricing model, which
basically suggests
00:17:24.670 --> 00:17:28.190
that if you look at the
returns on any stocks
00:17:28.190 --> 00:17:30.720
in an efficient
market, then those
00:17:30.720 --> 00:17:36.830
should depend on the return
of the overall market
00:17:36.830 --> 00:17:40.040
but scaled by how
risky the stock is.
00:17:40.040 --> 00:17:45.170
And so if one looks
at basically what
00:17:45.170 --> 00:17:47.929
the return is on the
stock on the right scale,
00:17:47.929 --> 00:17:49.970
you should have a simple
linear regression model.
00:17:49.970 --> 00:17:54.110
So here, we just look at
a time series for GE stock
00:17:54.110 --> 00:17:55.972
in the S&P 500.
00:17:55.972 --> 00:17:58.180
And the case study guide
through how you can actually
00:17:58.180 --> 00:18:01.790
collect this data
on the web using R.
00:18:01.790 --> 00:18:06.845
And so the case notes
provide those details.
00:18:09.350 --> 00:18:11.930
There's also the
three-month treasury rate
00:18:11.930 --> 00:18:13.660
which is collected.
00:18:13.660 --> 00:18:16.190
And so if you're
thinking about return
00:18:16.190 --> 00:18:19.540
on the stock versus return
on the index, well, what's
00:18:19.540 --> 00:18:24.940
really of interest is the excess
return over a risk-free rate.
00:18:24.940 --> 00:18:27.450
And the efficient
markets models,
00:18:27.450 --> 00:18:31.390
basically the excess
return of a stock
00:18:31.390 --> 00:18:34.330
is related to the excess
return of the market as
00:18:34.330 --> 00:18:37.250
given by a linear
regression model.
00:18:37.250 --> 00:18:39.310
So we can fit this model.
00:18:39.310 --> 00:18:46.360
And here's a plot of the excess
returns on a daily basis for GE
00:18:46.360 --> 00:18:47.640
stock versus the market.
00:18:47.640 --> 00:18:52.444
So that looks like a
nice sort of point cloud
00:18:52.444 --> 00:18:54.110
for which a linear
model might fit well.
00:18:54.110 --> 00:18:54.800
And it does.
00:18:59.400 --> 00:19:01.170
Well, there are
regression diagnostics,
00:19:01.170 --> 00:19:05.300
which I'll get to-- well, there
are regression diagnostics
00:19:05.300 --> 00:19:09.110
which are detailed in the
problem set, where we're
00:19:09.110 --> 00:19:12.420
looking at how influential are
individual observations, what's
00:19:12.420 --> 00:19:14.160
their impact on
regression parameters.
00:19:16.680 --> 00:19:20.160
This display here
basically highlights
00:19:20.160 --> 00:19:21.790
with a very simple
linear regression
00:19:21.790 --> 00:19:25.770
model what are the
influential data points.
00:19:25.770 --> 00:19:28.560
And so I've highlighted
in red those values
00:19:28.560 --> 00:19:30.640
which are influential.
00:19:30.640 --> 00:19:34.060
Now, if you look at the
definition of leverage
00:19:34.060 --> 00:19:36.390
in a linear model,
it's very simple.
00:19:36.390 --> 00:19:39.130
A simple linear model is
just those observations that
00:19:39.130 --> 00:19:42.200
are very far from the
mean have large leverage.
00:19:42.200 --> 00:19:46.060
And so you can confirm
that with your answers
00:19:46.060 --> 00:19:48.470
to the problem set.
00:19:48.470 --> 00:19:52.710
This x indicates a
significantly influential point
00:19:52.710 --> 00:19:55.720
in terms of the
regression parameters
00:19:55.720 --> 00:19:57.090
given by Cook's distance.
00:19:57.090 --> 00:19:59.956
And that definition is also
given in the case notes.
00:19:59.956 --> 00:20:00.908
AUDIENCE: [INAUDIBLE].
00:20:04.240 --> 00:20:06.630
PROFESSOR: By computing
the individual
00:20:06.630 --> 00:20:09.930
leverages with a function
that's given here,
00:20:09.930 --> 00:20:13.385
and by selecting out those
that exceed a given magnitude.
00:20:17.870 --> 00:20:20.530
Now, with this very,
very simple model
00:20:20.530 --> 00:20:23.190
of stocks depending
on one unknown factor,
00:20:23.190 --> 00:20:26.110
risk factor given the market.
00:20:26.110 --> 00:20:29.730
In modeling equity
returns, there
00:20:29.730 --> 00:20:33.680
are many different factors that
can have an impact on returns.
00:20:33.680 --> 00:20:36.890
So what I've done
in the case study
00:20:36.890 --> 00:20:48.660
is to look at adding
another factor which is just
00:20:48.660 --> 00:20:51.590
the return on crude oil.
00:20:51.590 --> 00:20:55.210
And so-- I need to go down here.
00:21:04.090 --> 00:21:10.260
So let me highlight
something for you here.
00:21:10.260 --> 00:21:15.220
With GE stock, what would you
expect the impact of, say,
00:21:15.220 --> 00:21:19.260
a high return on crude oil to
be on the return of GE stock?
00:21:19.260 --> 00:21:21.500
Would you expect it to
be positively related
00:21:21.500 --> 00:21:22.730
or negatively related?
00:21:30.910 --> 00:21:31.410
OK.
00:21:34.510 --> 00:21:39.610
Well, GE is a stock that's
just a broad stock invested
00:21:39.610 --> 00:21:41.820
in many different industries.
00:21:41.820 --> 00:21:45.390
And it really reflects the
overall market, to some extent.
00:21:45.390 --> 00:21:48.710
Many years ago,
10, 15 years ago,
00:21:48.710 --> 00:21:51.960
GE represented maybe 3% of
the GNP of the US market.
00:21:51.960 --> 00:21:55.510
So it was really highly related
to how well the market does.
00:21:55.510 --> 00:21:59.700
Now, crude oil is a commodity.
00:21:59.700 --> 00:22:07.010
And oil is used to drive cars,
to fuel energy production.
00:22:07.010 --> 00:22:10.510
So if you have an
increase in oil prices,
00:22:10.510 --> 00:22:13.770
then the cost of essentially
doing business goes up.
00:22:13.770 --> 00:22:18.870
So it is associated with
an inflation factor.
00:22:18.870 --> 00:22:20.380
Prices are rising.
00:22:20.380 --> 00:22:25.730
So if you can see here,
the regression estimate,
00:22:25.730 --> 00:22:29.830
if we add in a factor of
the return on crude oil,
00:22:29.830 --> 00:22:32.120
it's negative 0.03.
00:22:32.120 --> 00:22:36.740
And it has a t value
of minus 3.561.
00:22:36.740 --> 00:22:41.330
So in fact, the market, in
a sense, over this period,
00:22:41.330 --> 00:22:44.600
for this analysis, was not
efficient in explaining
00:22:44.600 --> 00:22:49.730
the return on GE; crude oil
is another independent factor
00:22:49.730 --> 00:22:52.260
that helps explain returns.
00:22:52.260 --> 00:22:55.850
So that's useful to know.
00:22:55.850 --> 00:23:01.430
And if you are clever about
defining and identifying
00:23:01.430 --> 00:23:03.590
and evaluating
different factors,
00:23:03.590 --> 00:23:07.550
you can build
factor asset pricing
00:23:07.550 --> 00:23:11.430
models that are
very, very useful
00:23:11.430 --> 00:23:13.390
for investing and trading.
00:23:13.390 --> 00:23:18.710
Now, as a comparison
to this case study,
00:23:18.710 --> 00:23:26.040
also applied the same
analysis to Exxon Mobil.
00:23:26.040 --> 00:23:30.330
Now, Exxon Mobil
is an oil company.
00:23:30.330 --> 00:23:35.530
So let me highlight this here.
00:23:35.530 --> 00:23:37.570
We basically are
fitting this model.
00:23:37.570 --> 00:23:39.050
Now let's highlight it.
00:23:43.150 --> 00:23:48.960
Here, if we consider
this two-factor model,
00:23:48.960 --> 00:23:50.650
the regression
parameter corresponding
00:23:50.650 --> 00:23:57.840
to the crude oil factor is
plus 0.13 with a t value of 16.
00:23:57.840 --> 00:24:01.750
So crude oil definitely
has an impact
00:24:01.750 --> 00:24:06.370
on the return of Exxon Mobil,
because it goes up and down
00:24:06.370 --> 00:24:07.065
with oil prices.
00:24:16.300 --> 00:24:19.550
This case study closes
with a scatter plot
00:24:19.550 --> 00:24:22.950
of the independent variables
and highlighting where
00:24:22.950 --> 00:24:25.740
the influential values are.
00:24:25.740 --> 00:24:28.650
And so just in the same way that
with a simple linear regression
00:24:28.650 --> 00:24:32.430
it was those that were far
away from the mean of the data
00:24:32.430 --> 00:24:35.920
were influential, in a
multivariate setting-- here,
00:24:35.920 --> 00:24:38.450
it's bivariate-- the
influential observations
00:24:38.450 --> 00:24:41.240
are those that are very
far away from the centroid.
00:24:41.240 --> 00:24:43.931
And if you look at one of the
problems in the problem set,
00:24:43.931 --> 00:24:45.430
it actually goes
through and you can
00:24:45.430 --> 00:24:48.930
see where these
leveraged values are
00:24:48.930 --> 00:24:53.580
and how it indicates influences
associated with the Mahalanobis
00:24:53.580 --> 00:24:56.660
distance of cases
from the centroid
00:24:56.660 --> 00:24:58.820
of the independent variables.
00:24:58.820 --> 00:25:02.010
So if you're a visual
type mathematician as
00:25:02.010 --> 00:25:04.850
opposed to an algebraic
type mathematician,
00:25:04.850 --> 00:25:06.390
I think these
kinds of graphs are
00:25:06.390 --> 00:25:10.970
very helpful in understanding
what is really going on.
00:25:10.970 --> 00:25:16.180
And the degree of influence
is associated with the fact
00:25:16.180 --> 00:25:21.380
that we're basically taking
least squares estimates,
00:25:21.380 --> 00:25:23.560
so we have the quadratic
form associated
00:25:23.560 --> 00:25:24.790
with the overall process.
00:25:28.800 --> 00:25:33.950
There's another
case study that I'll
00:25:33.950 --> 00:25:40.054
be happy to discuss after
class or during office hours.
00:25:40.054 --> 00:25:42.220
I don't think we have time
today during the lecture.
00:25:42.220 --> 00:25:45.650
But it concerns
exchange rate regimes.
00:25:45.650 --> 00:25:51.310
And the second case study
looks at the Chinese yuan,
00:25:51.310 --> 00:25:55.960
which was basically pegged
to the dollar for many years.
00:25:55.960 --> 00:26:00.190
And then I guess through
political influence
00:26:00.190 --> 00:26:02.710
from other countries,
they started
00:26:02.710 --> 00:26:06.172
to let the yuan vary
from the dollar,
00:26:06.172 --> 00:26:08.560
but perhaps pegged
it to some basket
00:26:08.560 --> 00:26:10.690
of securities-- of currencies.
00:26:10.690 --> 00:26:13.540
And so how would you determine
what that basket of currencies
00:26:13.540 --> 00:26:14.039
is?
00:26:14.039 --> 00:26:16.250
Well, there are
regression methods
00:26:16.250 --> 00:26:19.490
that have been
developed by economists
00:26:19.490 --> 00:26:20.650
that help you do that.
00:26:20.650 --> 00:26:23.480
And that case study goes
through the analysis of that.
00:26:23.480 --> 00:26:26.770
So check that out to see how
you can get immediate access
00:26:26.770 --> 00:26:29.750
to currency data and be
fitting these regression models
00:26:29.750 --> 00:26:31.250
and looking at the
different results
00:26:31.250 --> 00:26:32.458
and trying to evaluate those.
00:26:38.720 --> 00:26:48.170
So let's turn now
to the main topic--
00:26:48.170 --> 00:26:54.200
let's see here-- which
is time series analysis.
00:27:01.250 --> 00:27:04.080
Today in the rest
of the lecture,
00:27:04.080 --> 00:27:09.040
I want to talk about univariate
time series analysis.
00:27:09.040 --> 00:27:12.670
And so we're thinking of
basically a random variable
00:27:12.670 --> 00:27:17.720
that is observed over time and
it's a discrete time process.
00:27:17.720 --> 00:27:23.140
And we'll introduce you
to the Wold representation
00:27:23.140 --> 00:27:26.435
theorem and definitions
of stationarity
00:27:26.435 --> 00:27:28.340
and its relationship there.
00:27:28.340 --> 00:27:31.430
Then, look at the classic
models of autoregressive
00:27:31.430 --> 00:27:34.120
moving average models.
00:27:34.120 --> 00:27:36.920
And then extending those
to non-stationarity
00:27:36.920 --> 00:27:40.430
with integrated autoregressive
moving average models.
00:27:40.430 --> 00:27:44.440
And then finally, talk about
estimating stationary models
00:27:44.440 --> 00:27:47.630
and how we test
for stationarity.
00:27:47.630 --> 00:27:54.740
So let's begin from
basically first principles.
00:27:54.740 --> 00:27:59.310
We have a stochastic process,
a discrete time stochastic
00:27:59.310 --> 00:28:04.880
process, X, which consists
of random variables indexed
00:28:04.880 --> 00:28:06.160
by time.
00:28:06.160 --> 00:28:09.110
And we're thinking
now discrete time.
00:28:09.110 --> 00:28:11.820
The stochastic behavior
of this sequence
00:28:11.820 --> 00:28:16.050
is determined by specifying
the density or probability mass
00:28:16.050 --> 00:28:22.220
functions for all finite
collections of time indexes.
00:28:22.220 --> 00:28:26.490
And so if we could specify
all finite.dimensional
00:28:26.490 --> 00:28:28.130
distributions of
this process, we
00:28:28.130 --> 00:28:31.710
would specify this
probability model
00:28:31.710 --> 00:28:35.200
for the stochastic process.
00:28:35.200 --> 00:28:40.500
Now, this stochastic process
is strictly stationary
00:28:40.500 --> 00:28:48.760
if the density function for
any collection of times,
00:28:48.760 --> 00:28:55.780
t_1 through t_m, is equal to
the density function for a tau
00:28:55.780 --> 00:28:57.440
translation of that.
00:28:57.440 --> 00:29:03.000
So the density function for any
finite-dimensional distribution
00:29:03.000 --> 00:29:08.300
is stationary, is constant
under arbitrary translations.
00:29:08.300 --> 00:29:12.620
So that's a very
strong property.
00:29:12.620 --> 00:29:16.620
But it's a reasonable
property to ask for if you're
00:29:16.620 --> 00:29:18.566
doing statistical modeling.
00:29:18.566 --> 00:29:20.940
And what do you want to do
when you're estimating models?
00:29:20.940 --> 00:29:24.080
You want to estimate
things that are constant.
00:29:24.080 --> 00:29:26.570
Constants are nice
things to estimate.
00:29:26.570 --> 00:29:28.520
And parameters of
models are constant.
00:29:28.520 --> 00:29:32.930
So we really want the underlying
structure of the distributions
00:29:32.930 --> 00:29:35.150
to be the same.
00:29:44.960 --> 00:29:47.040
That was strict
stationarity, which
00:29:47.040 --> 00:29:51.510
requires knowledge of
the entire distribution
00:29:51.510 --> 00:29:55.020
of the stochastic process.
00:29:55.020 --> 00:29:57.340
We're now going to introduce
a weaker definition, which
00:29:57.340 --> 00:29:59.660
is covariance stationarity.
00:29:59.660 --> 00:30:02.960
And a covariance
stationary process
00:30:02.960 --> 00:30:08.330
has a constant mean,
mu; a constant variance,
00:30:08.330 --> 00:30:15.630
sigma squared; and a
covariance over increments tau,
00:30:15.630 --> 00:30:20.500
given by a function gamma of
tau, that is also constant.
00:30:20.500 --> 00:30:26.960
Gamma isn't a constant function,
but basically for all t,
00:30:26.960 --> 00:30:31.900
covariance of X_t, X_(t+tau)
is this gamma of tau function.
00:30:31.900 --> 00:30:38.080
And we also can introduce
the autocorrelation function
00:30:38.080 --> 00:30:41.830
of the stochastic
process, rho of tau.
00:30:41.830 --> 00:30:49.120
And so the correlation
of two random variables
00:30:49.120 --> 00:30:52.220
is the covariance of those
random variables divided
00:30:52.220 --> 00:30:57.340
by the square root of the
product of the variances.
00:30:57.340 --> 00:31:00.805
And Choongbum I think
introduced that a bit.
00:31:00.805 --> 00:31:02.680
in one of his lectures,
where we were talking
00:31:02.680 --> 00:31:06.890
about the correlation function.
00:31:06.890 --> 00:31:09.810
But essentially, the
correlation function
00:31:09.810 --> 00:31:15.400
is if you standardize the
data or the random variables
00:31:15.400 --> 00:31:17.690
to have mean 0-- so
subtract off the means
00:31:17.690 --> 00:31:21.040
and then divide through by
their standard deviations.
00:31:21.040 --> 00:31:26.410
So those translated variables
have mean 0 and variance 1.
00:31:26.410 --> 00:31:29.482
Then the correlation
coefficient is the covariance
00:31:29.482 --> 00:31:31.315
between those standardized
random variables.
00:31:35.020 --> 00:31:38.810
So this is going to come up
again and again in time series
00:31:38.810 --> 00:31:40.080
analysis.
00:31:40.080 --> 00:31:42.650
Now, the Wold
representation theorem
00:31:42.650 --> 00:31:47.350
is a very, very powerful theorem
about covariance stationary
00:31:47.350 --> 00:31:47.850
processes.
00:31:51.110 --> 00:31:55.050
It basically states that if
we have a zero-mean covariance
00:31:55.050 --> 00:31:59.750
stationary time
series, then it can
00:31:59.750 --> 00:32:03.520
be decomposed into two
components with a very
00:32:03.520 --> 00:32:06.390
nice structure.
00:32:06.390 --> 00:32:11.430
Basically, X_t can be
decomposed into V_t plus S_t.
00:32:11.430 --> 00:32:18.470
V_t is going to be a linearly
deterministic process, meaning
00:32:18.470 --> 00:32:23.130
that past values of
V_t perfectly predict
00:32:23.130 --> 00:32:24.590
what V_t is going to be.
00:32:24.590 --> 00:32:27.780
So this could be like a
linear trend or some fixed
00:32:27.780 --> 00:32:29.660
function of past values.
00:32:29.660 --> 00:32:32.320
It's basically a
deterministic process.
00:32:32.320 --> 00:32:34.690
So there's nothing
random in V_t.
00:32:34.690 --> 00:32:40.710
It's something that's
fixed, without randomness.
00:32:40.710 --> 00:32:46.510
And S_t is a sum
of coefficients,
00:32:46.510 --> 00:32:56.650
psi_i times eta_(t-i), where
the eta_t's are linearly
00:32:56.650 --> 00:32:58.550
unpredictable white noise.
00:32:58.550 --> 00:33:03.890
So what we have is S_t
is a weighted average
00:33:03.890 --> 00:33:09.850
of white noise with
coefficients given by the psi_i.
00:33:09.850 --> 00:33:16.170
And the coefficients psi_i
are such that psi_0 is 1,
00:33:16.170 --> 00:33:18.830
and the sum of the
squared psi_i's is finite.
00:33:21.340 --> 00:33:26.540
And the white noise
eta_t-- what's white noise?
00:33:26.540 --> 00:33:28.930
It has expectation zero.
00:33:28.930 --> 00:33:35.120
It has variance, given by
sigma squared, that's constant.
00:33:35.120 --> 00:33:39.520
And it has covariance across
different white noise elements
00:33:39.520 --> 00:33:42.490
that's 0 for all t and s.
00:33:42.490 --> 00:33:45.810
So eta_t's are uncorrelated
with themselves,
00:33:45.810 --> 00:33:47.750
and of course, they
are uncorrelated
00:33:47.750 --> 00:33:51.290
with the deterministic process.
00:33:51.290 --> 00:33:58.010
So this is really a very,
very powerful concept.
00:33:58.010 --> 00:34:00.600
If you are modeling
a process and it
00:34:00.600 --> 00:34:05.030
has covariance
stationarity, then there
00:34:05.030 --> 00:34:07.960
exists a representation
like this of the function.
00:34:07.960 --> 00:34:15.750
So it's a very
compelling structure,
00:34:15.750 --> 00:34:20.659
which we'll see how it applies
in different circumstances.
00:34:20.659 --> 00:34:25.650
Now, before getting into the
definition of autoregressive
00:34:25.650 --> 00:34:28.719
moving average
models, I just want
00:34:28.719 --> 00:34:33.820
to give you an intuitive
understanding of what's going
00:34:33.820 --> 00:34:36.469
on with the Wold decomposition.
00:34:36.469 --> 00:34:41.030
And this, I think,
will help motivate
00:34:41.030 --> 00:34:44.480
why the Wold
decomposition should exist
00:34:44.480 --> 00:34:48.170
from a mathematical standpoint.
00:34:48.170 --> 00:34:53.550
So consider just some
univariate stochastic process,
00:34:53.550 --> 00:34:56.500
some time series X_t
that we want to model.
00:34:56.500 --> 00:35:00.010
And we believe that it's
covariance stationary.
00:35:00.010 --> 00:35:02.850
And so we want to
specify essentially
00:35:02.850 --> 00:35:04.610
the Wold decomposition of that.
00:35:04.610 --> 00:35:07.680
Well, what we could
do is initialize
00:35:07.680 --> 00:35:10.890
a parameter p, the number
of past observations,
00:35:10.890 --> 00:35:15.310
in the linearly
deterministic term.
00:35:15.310 --> 00:35:24.420
And then estimate the linear
projection of X_t on the last p
00:35:24.420 --> 00:35:26.140
lag values.
00:35:26.140 --> 00:35:31.490
And so what I want to do
is consider estimating
00:35:31.490 --> 00:35:36.360
that relationship using
a sample of size n
00:35:36.360 --> 00:35:42.660
with some ending point t_0
less than or equal to T.
00:35:42.660 --> 00:35:50.010
And so we can consider y
values like a response variable
00:35:50.010 --> 00:35:57.760
being given by the successive
values of our time series.
00:35:57.760 --> 00:36:02.550
And so our response variables
y_j can be considered to be x
00:36:02.550 --> 00:36:06.040
t_0 minus n plus j.
00:36:06.040 --> 00:36:14.350
And define a y vector and
a Z matrix as follows.
00:36:20.140 --> 00:36:25.890
So we have values of our
stochastic process in y.
00:36:25.890 --> 00:36:29.080
And then our Z matrix,
which is essentially
00:36:29.080 --> 00:36:30.580
a matrix of
independent variables,
00:36:30.580 --> 00:36:36.000
is just the lagged
values of this process.
00:36:36.000 --> 00:36:37.940
So let's apply
ordinary least squares
00:36:37.940 --> 00:36:40.530
to specify the projection.
00:36:40.530 --> 00:36:43.810
This projection matrix
should be familiar now.
00:36:43.810 --> 00:36:49.160
And that basically gives
us a prediction of y hat
00:36:49.160 --> 00:36:51.680
depending on p lags.
00:36:51.680 --> 00:36:54.750
And we can compute the
projection residual
00:36:54.750 --> 00:36:56.080
from that fit.
00:36:59.660 --> 00:37:03.450
Well, we can conduct
time series methods
00:37:03.450 --> 00:37:08.470
to analyze these residuals,
which we'll be introducing here
00:37:08.470 --> 00:37:13.170
in a few minutes, to specify
a moving average model.
00:37:13.170 --> 00:37:16.180
We can then have estimates of
the underlying coefficients
00:37:16.180 --> 00:37:22.700
psi and estimates of
these residuals eta_t.
00:37:22.700 --> 00:37:27.300
And then we can evaluate whether
this is a good model or not.
00:37:27.300 --> 00:37:29.430
What does it mean to be
an appropriate model?
00:37:29.430 --> 00:37:35.250
Well, the residual should
be orthogonal to longer lags
00:37:35.250 --> 00:37:39.550
than t minus s, or
longer lags than p.
00:37:39.550 --> 00:37:42.850
So we basically shouldn't
have any dependence
00:37:42.850 --> 00:37:49.390
of our residuals on lags
of the stochastic process
00:37:49.390 --> 00:37:51.550
that weren't included
in the model.
00:37:51.550 --> 00:37:54.850
Those should be orthogonal.
00:37:54.850 --> 00:38:01.070
And the eta_t hats should be
consistent with white noise.
00:38:01.070 --> 00:38:05.220
So those issues
can be evaluated.
00:38:05.220 --> 00:38:07.620
And if there's
evidence otherwise,
00:38:07.620 --> 00:38:10.720
then we can change the
specification of the model.
00:38:10.720 --> 00:38:13.090
We can add additional lags.
00:38:13.090 --> 00:38:15.870
We can add additional
deterministic variables
00:38:15.870 --> 00:38:21.570
if we can identify
what those might be.
00:38:21.570 --> 00:38:23.260
And proceed with this process.
00:38:23.260 --> 00:38:28.490
But essentially that is
how the Wold decomposition
00:38:28.490 --> 00:38:30.740
could be implemented.
00:38:30.740 --> 00:38:35.250
And theoretically, as
our sample gets large,
00:38:35.250 --> 00:38:42.320
if we're observing this time
series for a long time, then
00:38:42.320 --> 00:38:45.090
well certainly the
limit of the projections
00:38:45.090 --> 00:38:49.110
as p, the number of lags
we include, gets large,
00:38:49.110 --> 00:38:52.380
should be essentially
the projection
00:38:52.380 --> 00:38:55.270
of our data on its history.
00:38:55.270 --> 00:39:00.490
And that, in fact, is the
projection corresponding to,
00:39:00.490 --> 00:39:03.950
defining, the
coefficient's psi_i.
00:39:03.950 --> 00:39:09.400
And so in the limit, that
projection will converge
00:39:09.400 --> 00:39:11.320
and it will converge
in the sense
00:39:11.320 --> 00:39:15.070
that the coefficients of
the projection definition
00:39:15.070 --> 00:39:17.320
correspond to the psi_i.
00:39:17.320 --> 00:39:26.600
And now if p goes to
infinity is required,
00:39:26.600 --> 00:39:29.510
now p means that there's
basically a long term
00:39:29.510 --> 00:39:31.145
dependence in the process.
00:39:34.310 --> 00:39:37.120
Basically, it doesn't
stop at a given lag.
00:39:37.120 --> 00:39:41.410
The dependence
persists over time.
00:39:41.410 --> 00:39:45.580
Then we may require
that p goes to infinity.
00:39:45.580 --> 00:39:47.360
Now, what happens when
p goes to infinity?
00:39:47.360 --> 00:39:50.036
Well, if you let p go
to infinity too quickly,
00:39:50.036 --> 00:39:51.410
you run out of
degrees of freedom
00:39:51.410 --> 00:39:53.520
to estimate your models.
00:39:53.520 --> 00:39:57.220
And so from an
implementation standpoint,
00:39:57.220 --> 00:40:01.340
you need to let p/n
go to 0 so that you
00:40:01.340 --> 00:40:09.180
have essentially more
data than parameters
00:40:09.180 --> 00:40:10.710
that you're estimating.
00:40:10.710 --> 00:40:13.800
And so that is required.
00:40:13.800 --> 00:40:18.860
And in time series
modeling, what we
00:40:18.860 --> 00:40:26.609
look for are models where
finite values of p are required.
00:40:26.609 --> 00:40:28.900
So we're only estimating a
finite number of parameters.
00:40:28.900 --> 00:40:31.920
Or if we have a moving
average model which
00:40:31.920 --> 00:40:35.300
has coefficients that
are infinite in number,
00:40:35.300 --> 00:40:40.430
perhaps those can be defined by
a small number of parameters.
00:40:40.430 --> 00:40:44.552
So we'll be looking for
that kind of feature
00:40:44.552 --> 00:40:45.385
in different models.
00:40:49.230 --> 00:40:52.620
Let's turn to talking
about the lag operator.
00:40:52.620 --> 00:40:56.250
The lag operator is
a fundamental tool
00:40:56.250 --> 00:40:59.430
in time series models.
00:40:59.430 --> 00:41:04.180
We consider the operator L
that shifts a time series back
00:41:04.180 --> 00:41:06.680
by one time increment.
00:41:06.680 --> 00:41:09.210
And applying this
operator recursively,
00:41:09.210 --> 00:41:14.400
we get, if it's operating
0 times, there's no lag,
00:41:14.400 --> 00:41:16.570
one time, there's
one lag, two times,
00:41:16.570 --> 00:41:18.860
two lags-- doing
that iteratively.
00:41:18.860 --> 00:41:22.470
And in thinking of these,
what we're dealing with
00:41:22.470 --> 00:41:26.680
is like a transformation on
infinite dimensional space,
00:41:26.680 --> 00:41:29.150
where it's like
the identity matrix
00:41:29.150 --> 00:41:32.390
sort of shifted by
one element-- or not
00:41:32.390 --> 00:41:35.320
the identity, but an element.
00:41:35.320 --> 00:41:37.290
It's like the identity
matrix shifted
00:41:37.290 --> 00:41:41.520
by one column or two columns.
00:41:41.520 --> 00:41:43.760
So anyway, inverses
of these operators
00:41:43.760 --> 00:41:49.440
are well defined in terms
of what we get from them.
00:41:49.440 --> 00:41:53.470
So we can represent
the Wold representation
00:41:53.470 --> 00:41:58.140
in terms of these lag
operators by saying
00:41:58.140 --> 00:42:03.120
that our stochastic
process X_t is
00:42:03.120 --> 00:42:10.030
equal to V_t plus this
psi of L function,
00:42:10.030 --> 00:42:14.030
basically a
functional of the lag
00:42:14.030 --> 00:42:18.570
operator, which is a potentially
infinite-order polynomial
00:42:18.570 --> 00:42:20.730
of the lags.
00:42:20.730 --> 00:42:23.770
So this notation is
something that you
00:42:23.770 --> 00:42:26.110
need to get very
familiar with if you're
00:42:26.110 --> 00:42:28.520
going to be comfortable with
the different models that
00:42:28.520 --> 00:42:33.840
are introduced with
ARMA and ARIMA models.
00:42:33.840 --> 00:42:35.410
Any questions about that?
00:42:42.230 --> 00:42:43.870
Now relating to
this-- let me just
00:42:43.870 --> 00:42:47.550
introduce now, because this
will come up somewhat later.
00:42:47.550 --> 00:42:49.840
But there's the impulse
response function
00:42:49.840 --> 00:42:53.010
of the covariance
stationary process.
00:42:53.010 --> 00:42:58.630
If we have a stochastic process
X_t which is given by this Wold
00:42:58.630 --> 00:43:05.950
representation, then
you can ask yourself
00:43:05.950 --> 00:43:11.320
what happens to the innovation
at time t, which is eta_t,
00:43:11.320 --> 00:43:15.470
how does that affect
the process over time?
00:43:15.470 --> 00:43:21.590
And so, OK, pretend that you are
chairman of the Federal Reserve
00:43:21.590 --> 00:43:22.090
Bank.
00:43:22.090 --> 00:43:29.600
And you're interested in the GNP
or basically economic growth.
00:43:29.600 --> 00:43:33.944
And you're considering
changing interest rates
00:43:33.944 --> 00:43:36.340
to help the economy.
00:43:36.340 --> 00:43:38.630
Well, you'd like to
know what an impact is
00:43:38.630 --> 00:43:42.610
of your change in
this factor, how
00:43:42.610 --> 00:43:47.560
that's going to affect the
variable of interest, perhaps
00:43:47.560 --> 00:43:48.130
GNP.
00:43:48.130 --> 00:43:49.520
Now, in this case,
we're thinking
00:43:49.520 --> 00:43:55.140
of just a simple covariance
stationary stochastic process.
00:43:55.140 --> 00:44:00.165
It's basically a process that
is a random-- a weighted sum,
00:44:00.165 --> 00:44:03.210
a moving average of
innovations eta_t.
00:44:03.210 --> 00:44:06.130
But the question is, basically
any covariance stationary
00:44:06.130 --> 00:44:08.310
process could be
represented in this form.
00:44:08.310 --> 00:44:11.630
And the impulse
response function
00:44:11.630 --> 00:44:15.790
relates to what is
the impact of eta_t.
00:44:15.790 --> 00:44:18.120
What's its impact over time?
00:44:18.120 --> 00:44:21.940
Basically, it affects
the process at time t.
00:44:21.940 --> 00:44:24.360
That, because of the
moving average process,
00:44:24.360 --> 00:44:27.350
it affects it at t plus
1, affects it at t plus 2.
00:44:27.350 --> 00:44:33.810
And so this impulse
response is basically
00:44:33.810 --> 00:44:37.650
the derivative of the
value of the process
00:44:37.650 --> 00:44:44.210
with the j-th previous
innovation is given by psi_j.
00:44:44.210 --> 00:44:47.360
So the different
innovations have an impact
00:44:47.360 --> 00:44:51.200
on the current value given by
this impulse response function.
00:44:51.200 --> 00:44:53.200
So looking backward,
that definition
00:44:53.200 --> 00:44:54.920
is pretty well defined.
00:44:54.920 --> 00:44:56.630
But you can also
think about how does
00:44:56.630 --> 00:44:58.620
an impact of the
innovation affect
00:44:58.620 --> 00:45:00.760
the process going forward.
00:45:00.760 --> 00:45:03.430
And the long-run
cumulative response
00:45:03.430 --> 00:45:07.490
is essentially what is the
impact of that innovation
00:45:07.490 --> 00:45:11.350
in the process ultimately?
00:45:11.350 --> 00:45:13.839
And eventually, it's
not going to change
00:45:13.839 --> 00:45:14.880
the value of the process.
00:45:14.880 --> 00:45:18.710
But what is the value to
which the process is moving
00:45:18.710 --> 00:45:20.890
because of that one innovation?
00:45:20.890 --> 00:45:22.630
And so the long run
cumulative response
00:45:22.630 --> 00:45:28.900
is given by basically the
sum of these individual ones.
00:45:28.900 --> 00:45:33.020
And it's given by the
sum of the psi_i's.
00:45:33.020 --> 00:45:37.295
So that's the polynomial of
psi with lag operator, where we
00:45:37.295 --> 00:45:39.010
replace the lag operator by 1.
00:45:43.540 --> 00:45:45.570
We'll see this
again when we talk
00:45:45.570 --> 00:45:50.546
about vector
autoregressive processes
00:45:50.546 --> 00:45:51.795
with multivariate time series.
00:45:56.020 --> 00:45:57.860
Now, the Wold
representation, which
00:45:57.860 --> 00:46:00.550
is a infinite-order moving
average, possibly infinite
00:46:00.550 --> 00:46:04.466
order, can have an
autoregressive representation.
00:46:07.940 --> 00:46:17.580
Suppose that there is
another polynomial psi_i
00:46:17.580 --> 00:46:23.240
star of the lags, which we're
going to call psi inverse of L,
00:46:23.240 --> 00:46:29.860
which satisfies the fact if you
multiply that with psi of L,
00:46:29.860 --> 00:46:31.690
you get the identity lag 0.
00:46:31.690 --> 00:46:37.820
Then this psi inverse,
if that exists,
00:46:37.820 --> 00:46:47.060
is basically the
inverse of the psi of L.
00:46:47.060 --> 00:46:50.180
So if we start with psi of
L, if that's invertible,
00:46:50.180 --> 00:46:52.510
then there exists
a psi inverse of L,
00:46:52.510 --> 00:46:55.490
with coefficients psi_i star.
00:46:55.490 --> 00:47:02.130
And one can basically take
our original expression
00:47:02.130 --> 00:47:06.020
for the stochastic process,
which is as this moving average
00:47:06.020 --> 00:47:13.250
of the eta's, and express it
as this essentially moving
00:47:13.250 --> 00:47:16.450
averages of the X's.
00:47:16.450 --> 00:47:20.730
And so we've essentially
inverted the process
00:47:20.730 --> 00:47:27.500
and shown that the
stochastic process can
00:47:27.500 --> 00:47:35.570
be expressed as an infinite
order autoregressive
00:47:35.570 --> 00:47:36.850
representation.
00:47:36.850 --> 00:47:40.760
And so this infinite order
autoregressive representation
00:47:40.760 --> 00:47:43.610
corresponds to that intuitive
understanding of how
00:47:43.610 --> 00:47:46.280
the Wold representation exists.
00:47:46.280 --> 00:47:51.330
And it actually works with the--
the regression coefficients
00:47:51.330 --> 00:47:54.749
in that projection several
slides back corresponds
00:47:54.749 --> 00:47:55.790
to this inverse operator.
00:47:59.030 --> 00:48:04.160
So let's turn to some
specific time series
00:48:04.160 --> 00:48:07.590
models that are widely used.
00:48:07.590 --> 00:48:11.670
The class of autoregressive
moving average processes
00:48:11.670 --> 00:48:16.100
has this mathematical
definition.
00:48:16.100 --> 00:48:22.360
We define the X_t to be equal
to a linear combination of lags
00:48:22.360 --> 00:48:27.190
of X, going back p
lags, with coefficients
00:48:27.190 --> 00:48:30.210
phi_1 through phi_p.
00:48:30.210 --> 00:48:35.500
And then there are
residuals which
00:48:35.500 --> 00:48:40.720
are expressed in terms of a
q-th order moving average.
00:48:40.720 --> 00:48:45.990
So in this framework, the
eta_t's are white noise.
00:48:45.990 --> 00:48:50.910
And white noise, to reiterate,
has mean 0, constant variance,
00:48:50.910 --> 00:48:53.456
zero covariance between those.
00:48:56.330 --> 00:49:03.470
In this representation, I've
simplified things a little bit
00:49:03.470 --> 00:49:09.400
by subtracting off the
mean from all of the X's.
00:49:09.400 --> 00:49:15.400
And that just makes the formulas
a little bit more simpler.
00:49:15.400 --> 00:49:20.370
Now, with lag operators, we
can write this ARMA model
00:49:20.370 --> 00:49:26.810
as phi of L, p-th order
polynomial of lag L given
00:49:26.810 --> 00:49:31.360
with coefficients 1,
phi_1 up to phi_p,
00:49:31.360 --> 00:49:37.627
and theta of L given
by 1, theta_1, theta_2,
00:49:37.627 --> 00:49:38.210
up to theta_q.
00:49:52.870 --> 00:49:55.840
This is basically
a representation
00:49:55.840 --> 00:49:59.170
of the ARMA time series model.
00:49:59.170 --> 00:50:03.320
Basically, we're
taking a set of lags
00:50:03.320 --> 00:50:09.530
of the values of the stochastic
process up to order p.
00:50:09.530 --> 00:50:11.840
And that's equal to a weighted
average of the eta_t's.
00:50:14.530 --> 00:50:21.600
If we multiply by the inverse
of phi of L, if that exists,
00:50:21.600 --> 00:50:24.010
then we get this
representation here,
00:50:24.010 --> 00:50:26.430
which is simply the
Wold decomposition.
00:50:26.430 --> 00:50:34.150
So the ARMA models basically
have a Wold decomposition
00:50:34.150 --> 00:50:36.970
if this phi of L is invertible.
00:50:42.850 --> 00:50:47.120
And we'll explore
these by looking
00:50:47.120 --> 00:50:49.160
at simpler cases
of the ARMA models
00:50:49.160 --> 00:50:51.390
by just focusing on
autoregressive models
00:50:51.390 --> 00:50:53.680
first and then moving
average processes
00:50:53.680 --> 00:50:56.090
second so that
you'll get a better
00:50:56.090 --> 00:51:00.690
feel for how these things are
manipulated and interpreted.
00:51:00.690 --> 00:51:04.540
So let's move on to the p-th
order autoregressive process.
00:51:04.540 --> 00:51:08.750
So we're going to consider
ARMA models that just have
00:51:08.750 --> 00:51:10.100
autoregressive terms in them.
00:51:16.000 --> 00:51:20.300
So we have phi of L X_t
minus mu is equal to eta_t,
00:51:20.300 --> 00:51:21.990
which is white noise.
00:51:21.990 --> 00:51:28.970
So a linear combination of
the series is white noise.
00:51:28.970 --> 00:51:34.730
And X_t follows then a linear
regression model on explanatory
00:51:34.730 --> 00:51:41.330
variables, which are
lags of the process X.
00:51:41.330 --> 00:51:46.760
And this could be expressed
as X_t equal to c plus the sum
00:51:46.760 --> 00:51:50.950
from 1 to p of phi_j X_(t-j),
which is a linear regression
00:51:50.950 --> 00:51:53.700
model with regression
parameters phi_j.
00:51:53.700 --> 00:52:01.390
And c, the constant term, is
equal to mu times phi of 1.
00:52:01.390 --> 00:52:10.920
Now, if you basically take
expectations of the process,
00:52:10.920 --> 00:52:14.360
you basically have
coefficients of mu coming in
00:52:14.360 --> 00:52:15.730
from all the terms.
00:52:15.730 --> 00:52:22.220
And phi of 1 times mu is the
regression coefficient there.
00:52:25.160 --> 00:52:27.320
So with this
autoregressive model,
00:52:27.320 --> 00:52:31.160
we now want to go over what are
the stationarity conditions.
00:52:31.160 --> 00:52:35.020
Certainly, this
autoregressive model
00:52:35.020 --> 00:52:40.790
is one where, well,
a simple random walk
00:52:40.790 --> 00:52:45.520
follows an autoregressive
model but is not stationary.
00:52:45.520 --> 00:52:47.650
We'll highlight that
in a minute as well.
00:52:47.650 --> 00:52:50.410
But if you think
it, that's true.
00:52:50.410 --> 00:52:55.400
And so stationarity is something
to be understood and evaluated.
00:53:03.160 --> 00:53:08.680
This polynomial
function phi, where
00:53:08.680 --> 00:53:11.630
if we replace the
lag operator L by z,
00:53:11.630 --> 00:53:20.970
a complex variable, the
equation phi of z equal to 0
00:53:20.970 --> 00:53:24.330
is the characteristic
equation associated
00:53:24.330 --> 00:53:27.020
with this autoregressive model.
00:53:27.020 --> 00:53:33.190
And it turns out that we'll
be interested in the roots
00:53:33.190 --> 00:53:36.610
of this characteristic equation.
00:53:36.610 --> 00:53:40.705
Now, if we consider
writing phi of L
00:53:40.705 --> 00:53:44.270
as a function of the
roots of the equation,
00:53:44.270 --> 00:53:49.130
we get this expression
where you'll
00:53:49.130 --> 00:53:51.340
notice if you multiply
all those terms out,
00:53:51.340 --> 00:53:55.730
the 1's all multiply out
together, and you get 1.
00:53:55.730 --> 00:54:00.100
And with the lag operator
L to the p-th power,
00:54:00.100 --> 00:54:03.210
that would be the product
of 1 over lambda_1
00:54:03.210 --> 00:54:06.650
times 1 over lambda_2,
or actually negative 1
00:54:06.650 --> 00:54:09.680
over lambda_1 times
negative 1 over lambda_2,
00:54:09.680 --> 00:54:13.640
and so forth-- negative
1 over lambda_p.
00:54:13.640 --> 00:54:15.820
Basically, if there are
p roots to this equation,
00:54:15.820 --> 00:54:19.420
this is how it would
be written out.
00:54:19.420 --> 00:54:27.070
And the process
X_t is covariance
00:54:27.070 --> 00:54:28.710
stationary if and
only if all the roots
00:54:28.710 --> 00:54:33.630
of this characteristic equation
lie outside the unit circle.
00:54:33.630 --> 00:54:35.880
So what does that mean?
00:54:35.880 --> 00:54:41.240
That means that the norm
modulus of the complex z
00:54:41.240 --> 00:54:42.810
is greater than 1.
00:54:42.810 --> 00:54:45.160
So they're outside
the unit circle
00:54:45.160 --> 00:54:47.150
where it's less
than or equal to 1.
00:54:47.150 --> 00:54:56.810
And the roots, if they are
outside the unit circle,
00:54:56.810 --> 00:55:01.080
then the modulus of the
lambda_j's is greater than 1.
00:55:05.400 --> 00:55:12.160
And if we then consider
taking a complex number
00:55:12.160 --> 00:55:16.010
lambda, basically
the root, and have
00:55:16.010 --> 00:55:20.600
an expression for 1 minus
1 over lambda L inverse,
00:55:20.600 --> 00:55:25.010
we can get this series
expression for that inverse.
00:55:25.010 --> 00:55:34.860
And that series will exist and
be bounded if the lambda_i are
00:55:34.860 --> 00:55:36.430
greater than 1 in magnitude.
00:55:39.210 --> 00:55:46.210
So we can actually compute
an inverse of phi of L
00:55:46.210 --> 00:55:49.610
by taking the inverse
of each of the component
00:55:49.610 --> 00:55:52.240
products in that polynomial.
00:55:52.240 --> 00:55:57.800
So in introductory
time series courses,
00:55:57.800 --> 00:56:00.544
they talk about
stationarity and unit roots,
00:56:00.544 --> 00:56:01.960
but they don't
really get into it,
00:56:01.960 --> 00:56:04.490
because people don't
know complex math,
00:56:04.490 --> 00:56:06.970
don't know about roots.
00:56:06.970 --> 00:56:09.620
So anyway, but this
is just very simply
00:56:09.620 --> 00:56:12.840
how that framework is applied.
00:56:12.840 --> 00:56:17.830
So we have a
polynomial equation,
00:56:17.830 --> 00:56:20.885
the characteristic equation,
whose roots we're looking for.
00:56:20.885 --> 00:56:22.510
Those roots have to
be outside the unit
00:56:22.510 --> 00:56:26.170
circle for stationarity
of the process.
00:56:26.170 --> 00:56:31.870
Well, it's basically
conditions for invertibility
00:56:31.870 --> 00:56:35.100
of the process, of the
autoregressive process.
00:56:35.100 --> 00:56:40.440
And that invertibility renders
the process an infinite-order
00:56:40.440 --> 00:56:42.125
moving average process.
00:56:46.210 --> 00:56:50.830
So let's go through
these results
00:56:50.830 --> 00:56:52.840
for the autoregressive
process of order one,
00:56:52.840 --> 00:56:56.330
where things-- always start
with the simplest cases
00:56:56.330 --> 00:56:58.420
to understand things.
00:56:58.420 --> 00:57:01.140
The characteristic equation
for this model is just 1
00:57:01.140 --> 00:57:02.820
minus phi z.
00:57:02.820 --> 00:57:03.600
The root is 1/phi.
00:57:06.630 --> 00:57:12.382
So lambda is greater than
1-- if the modulus of lambda
00:57:12.382 --> 00:57:13.840
is greater than 1,
meaning the root
00:57:13.840 --> 00:57:16.990
is outside the unit circle,
then phi is less than 1.
00:57:16.990 --> 00:57:21.160
So for covariance stationarity
of this autoregressive process,
00:57:21.160 --> 00:57:25.877
we need the magnitude of phi
to be less than 1 in magnitude.
00:57:30.090 --> 00:57:31.950
The expected value of X is mu.
00:57:31.950 --> 00:57:36.460
The variance of X
is sigma squared X.
00:57:36.460 --> 00:57:41.130
This has this form, sigma
squared over 1 minus phi.
00:57:41.130 --> 00:57:44.960
That expression is
basically obtained
00:57:44.960 --> 00:57:50.110
by looking at the infinite order
moving average representation.
00:57:50.110 --> 00:57:56.760
But notice that if
phi is positive,
00:57:56.760 --> 00:58:03.710
then the variance
of X is actually
00:58:03.710 --> 00:58:07.895
greater than the variance
of the innovations.
00:58:10.440 --> 00:58:17.280
And if phi is less than 0,
then it's going to be smaller.
00:58:17.280 --> 00:58:23.100
So the innovation variance
basically is scaled up a bit
00:58:23.100 --> 00:58:25.010
in the autoregressive process.
00:58:25.010 --> 00:58:27.710
The covariance matrix is
phi times sigma squared
00:58:27.710 --> 00:58:31.980
X. You'll be going through
this in the problem set.
00:58:31.980 --> 00:58:40.160
And the covariance of X is phi
to the j power sigma squared X.
00:58:40.160 --> 00:58:43.640
And these expressions can
all be easily evaluated
00:58:43.640 --> 00:58:47.490
by simply writing out the
definition of these covariances
00:58:47.490 --> 00:58:50.000
in terms of the original
model and looking
00:58:50.000 --> 00:58:54.250
at what terms are independent,
cancel out, and that proceeds.
00:59:04.510 --> 00:59:06.800
Let's just go
through these cases.
00:59:06.800 --> 00:59:08.730
Let's show it all here.
00:59:08.730 --> 00:59:16.630
So we have if phi
is between 0 and 1,
00:59:16.630 --> 00:59:20.810
then the process experiences
exponential mean reversion
00:59:20.810 --> 00:59:22.170
to mu.
00:59:22.170 --> 00:59:24.760
So an autoregressive
process with phi between 0
00:59:24.760 --> 00:59:29.490
on 1 corresponds to a
mean-reverting process.
00:59:29.490 --> 00:59:31.830
This process is
actually one that
00:59:31.830 --> 00:59:34.310
has been used theoretically
for interest rate models
00:59:34.310 --> 00:59:36.920
and a lot of theoretical
work in finance.
00:59:36.920 --> 00:59:40.280
The Vasicek model is
actually an example
00:59:40.280 --> 00:59:42.300
of the Ornstein-Uhlenbeck
process,
00:59:42.300 --> 00:59:47.840
which is basically a
mean-reverting Brownian motion.
00:59:47.840 --> 00:59:53.070
And any variables
that exhibit or could
00:59:53.070 --> 00:59:59.950
be thought of as
exhibiting mean reversion,
00:59:59.950 --> 01:00:01.810
this model can be
applied to those
01:00:01.810 --> 01:00:07.470
processes, such as interest rate
spreads or real exchange rates,
01:00:07.470 --> 01:00:11.430
variables where one can
expect that things never
01:00:11.430 --> 01:00:12.790
get too large or too small.
01:00:12.790 --> 01:00:14.440
They come back to some mean.
01:00:14.440 --> 01:00:16.570
Now, the challenge
is, that usually
01:00:16.570 --> 01:00:18.930
may be true over
short periods of time.
01:00:18.930 --> 01:00:21.100
But over very long
periods of time,
01:00:21.100 --> 01:00:23.230
the point to which you're
reverting to changes.
01:00:23.230 --> 01:00:26.640
So these models tend to
not have broad application
01:00:26.640 --> 01:00:27.900
over long time ranges.
01:00:27.900 --> 01:00:30.150
You need to adapt.
01:00:30.150 --> 01:00:32.220
Anyway, with the AR
process, we can also
01:00:32.220 --> 01:00:34.020
have negative
values of phi, which
01:00:34.020 --> 01:00:38.460
results in exponential mean
reversion that's oscillating
01:00:38.460 --> 01:00:44.190
in time, because the
autoregressive coefficient
01:00:44.190 --> 01:00:49.180
basically is a negative value.
01:00:49.180 --> 01:00:54.510
And for phi equal to 1, the Wold
decomposition doesn't exist.
01:00:54.510 --> 01:00:57.860
And the process is the
simple random walk.
01:00:57.860 --> 01:01:00.340
So basically, if
phi is equal to 1,
01:01:00.340 --> 01:01:04.480
that means that basically just
changes in value of the process
01:01:04.480 --> 01:01:08.860
are independent and identically
distributed white noise.
01:01:08.860 --> 01:01:11.910
And that's the
random walk process.
01:01:11.910 --> 01:01:15.840
And that process, as was
covered in earlier lectures,
01:01:15.840 --> 01:01:18.780
is non-stationary.
01:01:18.780 --> 01:01:22.790
If phi is greater than 1, then
you have an explosive process,
01:01:22.790 --> 01:01:26.780
because basically the
values are scaling up
01:01:26.780 --> 01:01:31.000
every time increment.
01:01:31.000 --> 01:01:35.290
So those are features
of the AR(1) model.
01:01:35.290 --> 01:01:42.110
For a general autoregressive
process of order p,
01:01:42.110 --> 01:01:45.850
there's a method-- well, we
can look at the second order
01:01:45.850 --> 01:01:49.590
moments of that process, which
have a very nice structure,
01:01:49.590 --> 01:01:51.840
and then use those to
solve for estimates
01:01:51.840 --> 01:01:56.630
of the ARMA parameters, or
autoregressive parameters.
01:01:56.630 --> 01:02:01.820
And those happen to be
specified by what are called
01:02:01.820 --> 01:02:04.840
the Yule-Walker equations.
01:02:04.840 --> 01:02:07.270
So the Yule-Walker equations
is a standard topic
01:02:07.270 --> 01:02:09.670
in time series analysis.
01:02:09.670 --> 01:02:11.480
What is it?
01:02:11.480 --> 01:02:13.030
What does it correspond to?
01:02:13.030 --> 01:02:16.320
Well, we take our original
autoregressive process
01:02:16.320 --> 01:02:17.470
of order p.
01:02:17.470 --> 01:02:24.400
And we write out the
formulas for the covariance
01:02:24.400 --> 01:02:26.900
at lag j between
two observations.
01:02:26.900 --> 01:02:31.790
So what's the covariance
between X_t and X_(t-j)?
01:02:31.790 --> 01:02:39.820
And that expression is
given by this equation.
01:02:39.820 --> 01:02:43.980
And so this equation for gamma
of j is determined simply
01:02:43.980 --> 01:02:48.700
by evaluating the expectations
where we're taking
01:02:48.700 --> 01:02:53.620
the expectation of X_t in the
autoregressive process times
01:02:53.620 --> 01:02:56.110
the fix X_(t-j) minus mu.
01:02:56.110 --> 01:02:58.540
So just evaluating
those terms, you
01:02:58.540 --> 01:03:02.880
can validate that
this is the equation.
01:03:02.880 --> 01:03:08.620
If we look at the equations
corresponding to j equals 1--
01:03:08.620 --> 01:03:12.040
so lag 1 up through
lag p-- this is
01:03:12.040 --> 01:03:16.070
what those equations look like.
01:03:16.070 --> 01:03:20.060
Basically, the left-hand side
is gamma_1 through gamma_p.
01:03:20.060 --> 01:03:23.090
The covariance to
lag 1 up to lag p
01:03:23.090 --> 01:03:27.590
is equal to basically
linear functions
01:03:27.590 --> 01:03:29.980
given by the phi of
the other covariances.
01:03:33.570 --> 01:03:37.410
Who can tell me what the
structure is of this matrix?
01:03:37.410 --> 01:03:38.590
It's not a diagonal matrix?
01:03:38.590 --> 01:03:41.817
What kind of matrix is this?
01:03:41.817 --> 01:03:42.900
Math trivia question here.
01:03:48.850 --> 01:03:49.782
It has a special name.
01:03:52.460 --> 01:03:54.600
Anyone?
01:03:54.600 --> 01:03:57.690
It's a Toeplitz matrix.
01:03:57.690 --> 01:04:00.840
The off diagonals are
all the same value.
01:04:00.840 --> 01:04:06.680
And in fact, because of the
symmetry of the covariance,
01:04:06.680 --> 01:04:09.750
basically the gamma of 1 is
equal to gamma of minus 1.
01:04:09.750 --> 01:04:12.680
Gamma of minus 2 is
equal to gamma plus 2.
01:04:12.680 --> 01:04:14.640
Because of the
covariant stationarity,
01:04:14.640 --> 01:04:16.700
it's actually also symmetric.
01:04:16.700 --> 01:04:22.630
So these equations allow
us to solve for the phis
01:04:22.630 --> 01:04:25.990
so long as we have estimates
of these covariances.
01:04:25.990 --> 01:04:30.510
So if we have a
system of estimates,
01:04:30.510 --> 01:04:33.940
we can plug these in in
an attempt to solve this.
01:04:33.940 --> 01:04:36.770
If they're consistent
estimates of the covariances,
01:04:36.770 --> 01:04:38.530
then there will be a solution.
01:04:38.530 --> 01:04:41.980
And then the 0th
equation, which was not
01:04:41.980 --> 01:04:43.469
part of the series
of equations--
01:04:43.469 --> 01:04:45.510
if you go back and look
at the 0th equation, that
01:04:45.510 --> 01:04:47.920
allows you to get an estimate
for the sigma squared.
01:04:47.920 --> 01:04:50.920
So these Yule-Walker
equations are the way
01:04:50.920 --> 01:04:54.510
in which many ARMA
models are specified
01:04:54.510 --> 01:05:03.650
in different statistics packages
and in terms of what principles
01:05:03.650 --> 01:05:04.400
are being applied.
01:05:04.400 --> 01:05:09.700
Well, if we're using unbiased
estimates of these parameters,
01:05:09.700 --> 01:05:12.055
then this is applying
what's called
01:05:12.055 --> 01:05:16.250
the method of moments principle
for statistical estimation.
01:05:16.250 --> 01:05:20.600
And with complicated models,
where sometimes the likelihood
01:05:20.600 --> 01:05:25.900
functions are very hard
to specify and compute,
01:05:25.900 --> 01:05:29.800
and then to do optimization
over those is even harder.
01:05:29.800 --> 01:05:32.780
It can turn out that
there are relationships
01:05:32.780 --> 01:05:35.840
between the moments of the
random variables, which
01:05:35.840 --> 01:05:38.340
are functions of the
unknown parameters.
01:05:38.340 --> 01:05:42.590
And you can solve for basically
the sample moments equalling
01:05:42.590 --> 01:05:45.940
the theoretical moments
and you apply the method
01:05:45.940 --> 01:05:48.830
of moments estimation method.
01:05:48.830 --> 01:05:54.670
Econometrics is rich with many
applications of that principle.
01:05:57.580 --> 01:06:02.110
The next section goes through
the moving average model.
01:06:05.240 --> 01:06:12.340
Let me highlight this.
01:06:12.340 --> 01:06:16.080
So with an order
q moving average,
01:06:16.080 --> 01:06:19.560
we basically have a polynomial
in the lag operator L,
01:06:19.560 --> 01:06:22.390
which is operated
upon the eta_t's.
01:06:22.390 --> 01:06:25.700
And if you write out
the expectations of X_t,
01:06:25.700 --> 01:06:27.030
you get mu.
01:06:27.030 --> 01:06:28.650
The variance of X_t,
which is gamma 0,
01:06:28.650 --> 01:06:34.470
is sigma squared times 1 plus
the squares of the coefficients
01:06:34.470 --> 01:06:36.360
in the polynomial.
01:06:36.360 --> 01:06:39.920
And so this feature,
this property here is due
01:06:39.920 --> 01:06:44.100
to the fact that we have
uncorrelated innovations
01:06:44.100 --> 01:06:47.060
in the eta_t's.
01:06:47.060 --> 01:06:48.260
The eta t's are white noise.
01:06:48.260 --> 01:06:52.830
So the only thing that comes
through in the square of X_t
01:06:52.830 --> 01:06:56.020
and the expectation of
that is the squared powers
01:06:56.020 --> 01:07:01.900
of the etas, which
have coefficients
01:07:01.900 --> 01:07:03.860
given by the theta_i squared.
01:07:03.860 --> 01:07:09.170
So these properties are left--
I'll leave you just to verify,
01:07:09.170 --> 01:07:11.142
very straightforward.
01:07:11.142 --> 01:07:14.430
But let's now turn to the
final minutes of the lecture
01:07:14.430 --> 01:07:20.170
today to accommodating
non-stationary behavior
01:07:20.170 --> 01:07:23.340
in time series.
01:07:23.340 --> 01:07:27.990
The original approaches
with time series
01:07:27.990 --> 01:07:32.320
was to focus on
estimation methodologies
01:07:32.320 --> 01:07:34.940
for covariance
stationary process.
01:07:34.940 --> 01:07:38.440
So if the series is not
covariance stationary,
01:07:38.440 --> 01:07:42.410
then we would want to
do some transformation
01:07:42.410 --> 01:07:48.660
of the data, of the
series, into a stationary
01:07:48.660 --> 01:07:52.270
so that the resulting
process is stationary.
01:07:52.270 --> 01:07:55.990
And with the
differencing operators,
01:07:55.990 --> 01:08:00.610
delta, Box and Jenkins
advocated moving
01:08:00.610 --> 01:08:03.420
non-stationary trending
behavior, which
01:08:03.420 --> 01:08:06.370
is exhibited often in
economic time series,
01:08:06.370 --> 01:08:09.960
by using a first difference,
maybe a second difference,
01:08:09.960 --> 01:08:12.300
or a k-th order difference.
01:08:12.300 --> 01:08:20.229
So these operators are
defined in this way.
01:08:20.229 --> 01:08:22.960
Basically with the
k-th order operator
01:08:22.960 --> 01:08:25.210
having this
expression here, this
01:08:25.210 --> 01:08:31.189
is the binomial expansion
of a k-th power,
01:08:31.189 --> 01:08:35.970
which can be useful.
01:08:35.970 --> 01:08:40.609
It comes up all the time
in probability theory.
01:08:40.609 --> 01:08:43.609
And if a process has
a linear time trend,
01:08:43.609 --> 01:08:48.390
then delta X_t is going to
have no time trend at all,
01:08:48.390 --> 01:08:51.390
because you're
basically taking out
01:08:51.390 --> 01:08:54.430
that linear component by
taking successive differences.
01:08:54.430 --> 01:08:57.014
Sometimes, if you
have a real series
01:08:57.014 --> 01:08:59.430
and you look at the difference,
it appears non-stationary,
01:08:59.430 --> 01:09:02.810
you look at first differences,
that can still not
01:09:02.810 --> 01:09:05.649
appear to be growing
over time, in which case
01:09:05.649 --> 01:09:08.810
sometimes the second
difference will result
01:09:08.810 --> 01:09:11.270
in a process with no trend.
01:09:11.270 --> 01:09:14.170
So these are sort of
convenient tricks,
01:09:14.170 --> 01:09:18.250
techniques to render
the series stationary.
01:09:18.250 --> 01:09:21.220
And let's see.
01:09:21.220 --> 01:09:26.960
There's examples here of
linear trend reversion models
01:09:26.960 --> 01:09:32.319
which are rendered
covariance stationary
01:09:32.319 --> 01:09:35.330
under first differencing.
01:09:35.330 --> 01:09:38.689
In this case, this is an
example where you have
01:09:38.689 --> 01:09:41.350
a deterministic time trend.
01:09:41.350 --> 01:09:46.040
But then you have reversion
to the time trend over time.
01:09:46.040 --> 01:09:49.880
So we basically have
eta_t, the error
01:09:49.880 --> 01:09:53.830
about the deterministic trend,
is a first order autoregressive
01:09:53.830 --> 01:09:55.740
process.
01:09:55.740 --> 01:10:00.307
And the moments here
can be derived this way.
01:10:00.307 --> 01:10:01.390
Leave that as an exercise.
01:10:04.230 --> 01:10:09.510
One could also consider
the pure integrated process
01:10:09.510 --> 01:10:16.330
and talk about
stochastic trends.
01:10:16.330 --> 01:10:19.140
And basically,
random walk processes
01:10:19.140 --> 01:10:22.740
are often referred
to in econometrics
01:10:22.740 --> 01:10:25.010
as stochastic trends.
01:10:25.010 --> 01:10:31.610
And you may want to try and
remove those from the data,
01:10:31.610 --> 01:10:33.280
or accommodate them.
01:10:33.280 --> 01:10:40.930
And so the stochastic
trend process is basically
01:10:40.930 --> 01:10:49.630
given by the first difference
X_t is just equal to eta_t.
01:10:49.630 --> 01:10:53.430
And so we have essentially
this random walk
01:10:53.430 --> 01:10:55.830
from a given starting point.
01:10:55.830 --> 01:11:00.650
And it's easy to verify it if
you knew the 0th point, then
01:11:00.650 --> 01:11:04.770
the variance of the t-th time
point would be t sigma squared,
01:11:04.770 --> 01:11:09.000
because we're summing t
independent innovations.
01:11:09.000 --> 01:11:14.475
And the covariance between
t and lag t minus j
01:11:14.475 --> 01:11:17.500
is simply t minus
j sigma squared.
01:11:17.500 --> 01:11:20.860
And the correlation between
those has this form.
01:11:20.860 --> 01:11:23.240
What you can see is that this
definitely depends on time.
01:11:23.240 --> 01:11:26.660
So it's not a
stationary process.
01:11:26.660 --> 01:11:33.880
So this first differencing
results in stationarity.
01:11:33.880 --> 01:11:36.230
And the end difference
process has those features.
01:11:46.847 --> 01:11:47.805
Let's see where we are.
01:11:52.730 --> 01:11:57.380
Final topic for
today is just how
01:11:57.380 --> 01:12:04.630
you incorporate non-stationary
process into ARMA processes.
01:12:04.630 --> 01:12:07.680
Well, if you take
first differences
01:12:07.680 --> 01:12:10.340
or second differences
and the resulting process
01:12:10.340 --> 01:12:13.252
is covariance
stationary, then we
01:12:13.252 --> 01:12:15.460
can just incorporate that
differencing into the model
01:12:15.460 --> 01:12:20.490
specification itself, and define
ARIMA models, Autoregressive
01:12:20.490 --> 01:12:23.730
Integrated Moving
Average Processes.
01:12:23.730 --> 01:12:26.000
And so to specify
these models, we
01:12:26.000 --> 01:12:29.290
need to determine the order
of the differencing required
01:12:29.290 --> 01:12:32.990
to move trends,
deterministic or stochastic,
01:12:32.990 --> 01:12:35.820
and then estimating
the unknown parameters,
01:12:35.820 --> 01:12:38.940
and then applying model
selection criteria.
01:12:38.940 --> 01:12:43.770
So let me go very
quickly through this
01:12:43.770 --> 01:12:48.600
and come back to it the
beginning of next time.
01:12:48.600 --> 01:12:51.660
But in specifying the
parameters of these models,
01:12:51.660 --> 01:12:54.410
we can apply maximum
likelihood, again,
01:12:54.410 --> 01:12:59.280
if we assume normality of
these innovations eta_t.
01:12:59.280 --> 01:13:02.260
And we can express
the ARMA model
01:13:02.260 --> 01:13:04.440
in state space
form, which results
01:13:04.440 --> 01:13:07.880
in a form for the
likelihood function, which
01:13:07.880 --> 01:13:12.130
we'll see a few lectures ahead.
01:13:12.130 --> 01:13:15.970
But then we can apply limited
information maximum likelihood,
01:13:15.970 --> 01:13:19.470
where we just condition on the
first observations of the data
01:13:19.470 --> 01:13:22.550
and maximize the likelihood.
01:13:22.550 --> 01:13:27.060
Or not condition on the first
few observations, but also
01:13:27.060 --> 01:13:33.700
use their information as well,
and look at their density
01:13:33.700 --> 01:13:36.640
functions, incorporating
those into the likelihood
01:13:36.640 --> 01:13:41.160
relative to the stationary
distribution for their values.
01:13:41.160 --> 01:13:44.000
And then the issue
becomes, how do we
01:13:44.000 --> 01:13:45.390
choose amongst different models?
01:13:45.390 --> 01:13:48.480
Now, last time we talked about
linear regression models,
01:13:48.480 --> 01:13:50.500
how you'd specify a
given model, here, we're
01:13:50.500 --> 01:13:53.050
talking about autoregressive,
moving average,
01:13:53.050 --> 01:13:55.000
and even integrated
moving average processes
01:13:55.000 --> 01:13:59.320
and how do we specify
those, well, with the method
01:13:59.320 --> 01:14:06.470
of maximum likelihood,
there are procedures
01:14:06.470 --> 01:14:12.440
which-- there are measures of
how effectively a fitted model
01:14:12.440 --> 01:14:16.390
is, given by an
information criterion
01:14:16.390 --> 01:14:21.250
that you would want to minimize
for a given fitted model.
01:14:21.250 --> 01:14:24.719
So we can consider
different sets of models,
01:14:24.719 --> 01:14:26.510
different numbers of
explanatory variables,
01:14:26.510 --> 01:14:29.740
different orders of
autoregressive parameters,
01:14:29.740 --> 01:14:33.100
moving average parameters,
and compute, say,
01:14:33.100 --> 01:14:37.940
the Akaike information criterion
or the Bayes information
01:14:37.940 --> 01:14:39.990
criterion or the
Hannan-Quinn criterion
01:14:39.990 --> 01:14:44.720
as different ways of judging
how good different models are.
01:14:44.720 --> 01:14:47.960
And let me just finish
today by pointing out
01:14:47.960 --> 01:14:52.620
that what these
information criteria are
01:14:52.620 --> 01:14:58.560
is basically a function of the
log likelihood function, which
01:14:58.560 --> 01:15:00.719
is something we're
trying to maximize
01:15:00.719 --> 01:15:02.135
with maximum
likelihood estimates.
01:15:04.870 --> 01:15:08.700
And then adding some penalty
for how many parameters
01:15:08.700 --> 01:15:10.742
we're estimating.
01:15:10.742 --> 01:15:12.950
And so what I'd like you to
think about for next time
01:15:12.950 --> 01:15:18.600
is what kind of a penalty
is appropriate for adding
01:15:18.600 --> 01:15:20.300
an extra parameter.
01:15:20.300 --> 01:15:23.640
Like, what evidence is
required to incorporate
01:15:23.640 --> 01:15:28.020
extra parameters, extra
variables, in the model.
01:15:28.020 --> 01:15:31.180
Would it be t statistics
that exceeds some threshold
01:15:31.180 --> 01:15:32.760
or some other criteria.
01:15:32.760 --> 01:15:35.940
Turns out that these are
all related to those issues.
01:15:35.940 --> 01:15:39.500
And it's very interesting
how those play out.
01:15:39.500 --> 01:15:45.180
And I'll say that for those
of you who have actually
01:15:45.180 --> 01:15:48.490
seen these before, the
Bayes information criterion
01:15:48.490 --> 01:15:50.400
corresponds to an
assumption that there
01:15:50.400 --> 01:15:54.180
is some finite number of
variables in the model.
01:15:54.180 --> 01:15:57.010
And you know what those are.
01:15:57.010 --> 01:16:00.060
The Hannan-Quinn criterion
says maybe there's
01:16:00.060 --> 01:16:03.760
an infinite number of
variables in the model,
01:16:03.760 --> 01:16:08.810
but you want to be
able to identify those.
01:16:08.810 --> 01:16:12.230
And so anyway, it's a
very challenging problem
01:16:12.230 --> 01:16:13.390
with model selection.
01:16:13.390 --> 01:16:16.900
And these criteria can
be used to specify those.
01:16:16.900 --> 01:16:19.050
So we'll go through
that next time.