1
00:00:00,080 --> 00:00:02,500
The following content is
provided under a Creative
2
00:00:02,500 --> 00:00:04,019
Commons license.
3
00:00:04,019 --> 00:00:06,360
Your support will help
MIT OpenCourseWare
4
00:00:06,360 --> 00:00:10,730
continue to offer high quality
educational resources for free.
5
00:00:10,730 --> 00:00:13,340
To make a donation or
view additional materials
6
00:00:13,340 --> 00:00:17,217
from hundreds of MIT courses,
visit MIT OpenCourseWare
7
00:00:17,217 --> 00:00:17,842
at ocw.mit.edu.
8
00:00:22,190 --> 00:00:23,010
PROFESSOR: OK.
9
00:00:23,010 --> 00:00:25,530
Well, last time I
was lecturing, we
10
00:00:25,530 --> 00:00:29,380
were talking about
regression analysis.
11
00:00:29,380 --> 00:00:31,870
And we finished up talking
about estimation methods
12
00:00:31,870 --> 00:00:34,730
for fitting regression models.
13
00:00:34,730 --> 00:00:38,670
I want to recap the method
of maximum likelihood,
14
00:00:38,670 --> 00:00:42,010
because this is really
the primary estimation
15
00:00:42,010 --> 00:00:45,070
method in statistical
modeling that you start with.
16
00:00:45,070 --> 00:00:49,946
And so let me just
review where we were.
17
00:00:49,946 --> 00:00:53,060
We have a normal linear
regression model.
18
00:00:53,060 --> 00:00:55,100
A dependent variable
y is explained
19
00:00:55,100 --> 00:00:58,940
by a linear combination
of independent variables
20
00:00:58,940 --> 00:01:00,710
given by a regression
parameter beta.
21
00:01:00,710 --> 00:01:03,800
And we assume that there are
errors about all the cases
22
00:01:03,800 --> 00:01:05,710
which are independent
identically distributed
23
00:01:05,710 --> 00:01:07,440
normal random variables.
24
00:01:07,440 --> 00:01:12,120
So because of that relationship,
the dependent variable vector
25
00:01:12,120 --> 00:01:15,630
y, which is an
n-vector, for n cases,
26
00:01:15,630 --> 00:01:18,730
is a multivariate
normal random variable.
27
00:01:18,730 --> 00:01:26,490
Now, the likelihood function is
equal to the density function
28
00:01:26,490 --> 00:01:28,280
for the data.
29
00:01:28,280 --> 00:01:32,400
And there's some
ambiguity really
30
00:01:32,400 --> 00:01:36,000
about how one manipulates
the likelihood function.
31
00:01:36,000 --> 00:01:38,780
The likelihood function
becomes defined once we've
32
00:01:38,780 --> 00:01:41,030
observed a sample of data.
33
00:01:41,030 --> 00:01:45,390
So in this expression for
the likelihood function
34
00:01:45,390 --> 00:01:47,330
as a function of beta
and sigma squared,
35
00:01:47,330 --> 00:01:50,800
we're considering evaluating
the probability density
36
00:01:50,800 --> 00:01:53,830
function for the
data conditional
37
00:01:53,830 --> 00:01:57,040
on the unknown parameters.
38
00:01:57,040 --> 00:02:02,540
So if this were simply a
univariate normal distribution
39
00:02:02,540 --> 00:02:05,160
with some unknown mean
and variance, then
40
00:02:05,160 --> 00:02:10,880
what we would have is
just a bell curve for mu
41
00:02:10,880 --> 00:02:13,880
centered around a
single observation y,
42
00:02:13,880 --> 00:02:15,550
if you look at the
likelihood function
43
00:02:15,550 --> 00:02:19,150
and how it varies with
the underlying mean
44
00:02:19,150 --> 00:02:22,950
of the normal distribution.
45
00:02:22,950 --> 00:02:28,180
So this likelihood
function is-- well,
46
00:02:28,180 --> 00:02:30,540
the challenge really
in maximum estimation
47
00:02:30,540 --> 00:02:34,840
is really calculating
and computing
48
00:02:34,840 --> 00:02:36,790
the likelihood function.
49
00:02:36,790 --> 00:02:39,050
And with normal linear
regression models,
50
00:02:39,050 --> 00:02:40,440
it's very easy.
51
00:02:40,440 --> 00:02:42,910
Now, the maximum
likelihood estimates
52
00:02:42,910 --> 00:02:47,490
are those values that
maximize this function.
53
00:02:47,490 --> 00:02:51,890
And the question is, why
are those good estimates
54
00:02:51,890 --> 00:02:54,840
of the underlying parameters?
55
00:02:54,840 --> 00:02:57,760
Well, what those
estimates do is they
56
00:02:57,760 --> 00:03:03,150
are the parameter values for
which the observed data is
57
00:03:03,150 --> 00:03:05,030
most likely.
58
00:03:05,030 --> 00:03:09,170
So we're able to scale
the unknown parameters
59
00:03:09,170 --> 00:03:14,020
by how likely those parameters
could have generated these data
60
00:03:14,020 --> 00:03:15,500
values.
61
00:03:15,500 --> 00:03:19,560
So let's look at the
likelihood function
62
00:03:19,560 --> 00:03:23,360
for this normal linear
regression model.
63
00:03:23,360 --> 00:03:28,520
These first two lines here are
highlighting-- the first line
64
00:03:28,520 --> 00:03:32,470
is highlighting that
our response variable
65
00:03:32,470 --> 00:03:35,310
values are independent.
66
00:03:35,310 --> 00:03:36,770
They're conditionally
independent
67
00:03:36,770 --> 00:03:38,720
given the unknown parameters.
68
00:03:38,720 --> 00:03:43,180
And so the density of the
full vector of y's is simply
69
00:03:43,180 --> 00:03:48,290
the product of the density
functions for those components.
70
00:03:48,290 --> 00:03:52,410
And because this is a normal
linear regression model,
71
00:03:52,410 --> 00:03:55,350
each of the y_i's is
normally distributed.
72
00:03:55,350 --> 00:03:57,140
So what's in there
is simply the density
73
00:03:57,140 --> 00:04:01,330
function of a normal random
variable with mean given
74
00:04:01,330 --> 00:04:06,960
by the beta sum of independent
variables for each i,
75
00:04:06,960 --> 00:04:10,300
case i, given by the
regression parameters.
76
00:04:10,300 --> 00:04:18,320
And that expression
basically can be expressed
77
00:04:18,320 --> 00:04:21,630
in matrix form this way.
78
00:04:21,630 --> 00:04:28,910
And what we have is
the likelihood function
79
00:04:28,910 --> 00:04:33,160
ends up being a function
of our Q of beta, which
80
00:04:33,160 --> 00:04:35,610
was our least squares criteria.
81
00:04:35,610 --> 00:04:39,120
So the least squares
estimation is
82
00:04:39,120 --> 00:04:42,930
equivalent to maximum likelihood
estimation for the regression
83
00:04:42,930 --> 00:04:48,510
parameters if we have a normal
linear regression model.
84
00:04:48,510 --> 00:04:52,200
And there's this
extra term, minus n.
85
00:04:52,200 --> 00:04:54,820
Well, actually, if we're going
to maximize the likelihood
86
00:04:54,820 --> 00:04:57,220
function, we can also maximize
the log of the likelihood
87
00:04:57,220 --> 00:05:00,010
function, because that's
just a monotone function
88
00:05:00,010 --> 00:05:01,860
of the likelihood.
89
00:05:01,860 --> 00:05:04,570
And it's easier to maximize the
log of the likelihood function
90
00:05:04,570 --> 00:05:06,430
which is expressed here.
91
00:05:06,430 --> 00:05:11,460
And so we're able to
maximize over beta
92
00:05:11,460 --> 00:05:14,230
by minimizing Q of beta.
93
00:05:14,230 --> 00:05:18,280
And then we can maximize
over sigma squared
94
00:05:18,280 --> 00:05:21,800
given our estimate for beta.
95
00:05:21,800 --> 00:05:25,120
And that's achieved by
taking the derivative
96
00:05:25,120 --> 00:05:31,170
of the log-likelihood with
respect to sigma squared.
97
00:05:31,170 --> 00:05:33,150
So we basically have this
first order condition
98
00:05:33,150 --> 00:05:35,320
that finds the
maximum because things
99
00:05:35,320 --> 00:05:39,830
are appropriately convex.
100
00:05:39,830 --> 00:05:45,200
And taking that derivative
and solving for zero,
101
00:05:45,200 --> 00:05:47,450
we basically get
this expression.
102
00:05:47,450 --> 00:05:50,380
So this is just
taking the derivative
103
00:05:50,380 --> 00:05:54,350
of the log-likelihood with
respect to sigma squared.
104
00:05:54,350 --> 00:05:55,857
And you'll notice
here I'm taking
105
00:05:55,857 --> 00:05:57,690
the derivative with
respect to sigma squared
106
00:05:57,690 --> 00:05:59,050
as a parameter, not sigma.
107
00:06:01,870 --> 00:06:05,380
And that gives us that
the maximum likelihood
108
00:06:05,380 --> 00:06:10,700
estimate of the error variance
is Q of beta hat over n.
109
00:06:10,700 --> 00:06:17,090
So this is the sum of the
squared residuals divided by n.
110
00:06:17,090 --> 00:06:20,940
Now, I emphasize here
that that's biased.
111
00:06:20,940 --> 00:06:24,612
Who can tell me
why that's biased
112
00:06:24,612 --> 00:06:25,820
or why it ought to be biased?
113
00:06:30,554 --> 00:06:31,526
AUDIENCE: [INAUDIBLE].
114
00:06:35,420 --> 00:06:36,350
PROFESSOR: OK.
115
00:06:36,350 --> 00:06:42,530
Well, it should be n
minus 1 if we're actually
116
00:06:42,530 --> 00:06:44,660
estimating one parameter.
117
00:06:44,660 --> 00:06:54,050
So if the independent variables
were, say, a constant, 1,
118
00:06:54,050 --> 00:06:57,160
so we're just estimating a
sample from a normal with mean
119
00:06:57,160 --> 00:07:03,030
beta 1 corresponding to
the units vector of the X,
120
00:07:03,030 --> 00:07:11,410
then we would have a one
degree of freedom correction
121
00:07:11,410 --> 00:07:14,120
to the residuals to get
an unbiased estimator.
122
00:07:14,120 --> 00:07:17,150
But what if we
have p parameters?
123
00:07:17,150 --> 00:07:18,370
Well, let me ask you this.
124
00:07:18,370 --> 00:07:23,280
What if we had n parameters
in our regression model?
125
00:07:23,280 --> 00:07:28,130
What would happen if
we had a full rank n
126
00:07:28,130 --> 00:07:30,760
independent variable matrix
and n independent observations?
127
00:07:34,062 --> 00:07:35,690
AUDIENCE: [INAUDIBLE].
128
00:07:35,690 --> 00:07:38,410
PROFESSOR: Yes, you'd have
an exact fit to the data.
129
00:07:38,410 --> 00:07:43,560
So this estimate would be 0.
130
00:07:43,560 --> 00:07:47,500
And so clearly, if
the data do arise
131
00:07:47,500 --> 00:07:52,059
from a normal linear regression
model, 0 is not unbiased.
132
00:07:52,059 --> 00:07:53,600
And you need to have
some correction.
133
00:07:53,600 --> 00:07:58,220
Turns out you need
to divide by n
134
00:07:58,220 --> 00:08:01,980
minus the rank of the X
matrix, the degrees of freedom
135
00:08:01,980 --> 00:08:05,630
in the model, to get
a biased estimate.
136
00:08:05,630 --> 00:08:08,610
So this is an important
issue, highlights
137
00:08:08,610 --> 00:08:11,880
how the more parameters you add
in the model, the more precise
138
00:08:11,880 --> 00:08:13,760
your fitted values are.
139
00:08:13,760 --> 00:08:15,840
In a sense, there's
dangers of curve fitting
140
00:08:15,840 --> 00:08:18,370
which you want to avoid.
141
00:08:18,370 --> 00:08:25,070
But the maximum likelihood
estimates, in fact, are biased.
142
00:08:25,070 --> 00:08:27,482
You just have to
be aware of that.
143
00:08:27,482 --> 00:08:29,190
And when you're using
different software,
144
00:08:29,190 --> 00:08:30,170
fitting different
models, you need
145
00:08:30,170 --> 00:08:32,450
to know whether there are
various corrections be
146
00:08:32,450 --> 00:08:33,654
made for biasedness or not.
147
00:08:38,370 --> 00:08:41,679
So this solves the
estimation problem
148
00:08:41,679 --> 00:08:44,790
for normal linear
regression models.
149
00:08:44,790 --> 00:08:48,310
And when we have normal
linear regression
150
00:08:48,310 --> 00:08:50,470
models, the theorem we
went through last time--
151
00:08:50,470 --> 00:08:51,428
this is very important.
152
00:08:51,428 --> 00:08:54,590
Let me just go back and
highlight that for you.
153
00:09:02,430 --> 00:09:05,370
This theorem right here.
154
00:09:05,370 --> 00:09:10,010
This is really a very
important theorem
155
00:09:10,010 --> 00:09:13,330
indicating what is the
distribution of the least
156
00:09:13,330 --> 00:09:15,800
squares, now the maximum
likelihood estimates
157
00:09:15,800 --> 00:09:17,670
of our regression model?
158
00:09:17,670 --> 00:09:20,750
They are normally distributed.
159
00:09:20,750 --> 00:09:25,570
And the residuals, sum
of squares, have a chi
160
00:09:25,570 --> 00:09:28,140
squared distribution
with degrees of freedom
161
00:09:28,140 --> 00:09:29,910
given by n minus p.
162
00:09:29,910 --> 00:09:34,770
And we can look at how
much signal to noise
163
00:09:34,770 --> 00:09:36,490
there is in estimating
our regression
164
00:09:36,490 --> 00:09:40,590
parameters by calculating a t
statistic, which is take away
165
00:09:40,590 --> 00:09:45,400
from an estimate its
expected value, its mean,
166
00:09:45,400 --> 00:09:48,330
and divide through by an
estimate of the variability
167
00:09:48,330 --> 00:09:50,421
in standard deviation units.
168
00:09:50,421 --> 00:09:51,920
And that will have
a t distribution.
169
00:09:51,920 --> 00:09:56,800
So that's a critical
way to assess
170
00:09:56,800 --> 00:09:59,200
the relevance of different
explanatory variables
171
00:09:59,200 --> 00:10:00,690
in our model.
172
00:10:00,690 --> 00:10:06,060
And this approach will apply
with maximum likelihood
173
00:10:06,060 --> 00:10:08,010
estimation in all
kinds of models
174
00:10:08,010 --> 00:10:10,510
apart from normal linear
regression models.
175
00:10:10,510 --> 00:10:13,970
It turns out maximum
likelihood estimates generally
176
00:10:13,970 --> 00:10:17,880
are asymptotically
normally distributed.
177
00:10:17,880 --> 00:10:21,630
And so these properties here
will apply for those models
178
00:10:21,630 --> 00:10:23,020
as well.
179
00:10:23,020 --> 00:10:27,470
So let's finish up these
notes on estimation
180
00:10:27,470 --> 00:10:32,590
by talking about
generalized M estimation.
181
00:10:32,590 --> 00:10:39,020
So what we want to consider is
estimating unknown parameters
182
00:10:39,020 --> 00:10:44,630
by minimizing some
function, Q of beta,
183
00:10:44,630 --> 00:10:49,890
which is a sum of evaluations
of another function h,
184
00:10:49,890 --> 00:10:53,180
evaluated for each of
the individual cases.
185
00:10:53,180 --> 00:10:59,980
And choosing h to take on
different functional forms
186
00:10:59,980 --> 00:11:03,120
will define different
kinds of estimators.
187
00:11:03,120 --> 00:11:08,440
We've seen how when h
is simply the square
188
00:11:08,440 --> 00:11:13,880
of the case minus its
regression prediction,
189
00:11:13,880 --> 00:11:18,980
that leads to least squares,
and in fact, maximum likelihood
190
00:11:18,980 --> 00:11:23,830
estimation, as we saw before.
191
00:11:23,830 --> 00:11:27,340
Rather than taking the
square of the residual,
192
00:11:27,340 --> 00:11:29,540
the fitted residual,
we could take simply
193
00:11:29,540 --> 00:11:33,510
the modulus of that.
194
00:11:33,510 --> 00:11:36,930
And so that would be the
mean absolute deviation.
195
00:11:36,930 --> 00:11:39,040
So rather than summing
the squared deviations
196
00:11:39,040 --> 00:11:42,310
from the mean, we could
sum the absolute deviations
197
00:11:42,310 --> 00:11:43,780
from the mean.
198
00:11:43,780 --> 00:11:46,710
Now, from a
mathematical standpoint,
199
00:11:46,710 --> 00:11:50,530
if we want to solve
for those estimates,
200
00:11:50,530 --> 00:11:52,450
how would you go
about doing that?
201
00:11:55,160 --> 00:12:01,950
What methodology would you
use to maximize this function?
202
00:12:01,950 --> 00:12:04,380
Well, we try and apply
basically the same principles
203
00:12:04,380 --> 00:12:09,690
of if this is a
convex function, then
204
00:12:09,690 --> 00:12:12,860
we just want to take derivatives
of that and solve for that
205
00:12:12,860 --> 00:12:14,110
being equal to 0.
206
00:12:14,110 --> 00:12:17,080
So what happens when
you take the derivative
207
00:12:17,080 --> 00:12:21,110
of the modulus of y minus xi
beta with respect to beta?
208
00:12:24,749 --> 00:12:27,620
AUDIENCE: [INAUDIBLE].
209
00:12:27,620 --> 00:12:30,780
PROFESSOR: What did you say?
210
00:12:30,780 --> 00:12:32,890
What did you say?
211
00:12:32,890 --> 00:12:36,783
AUDIENCE: Yeah, it's
not [INAUDIBLE].
212
00:12:36,783 --> 00:12:38,908
The first [INAUDIBLE]
derivative is not continuous.
213
00:12:45,460 --> 00:12:46,610
PROFESSOR: OK.
214
00:12:46,610 --> 00:12:50,940
Well, this is not
a smooth function.
215
00:12:50,940 --> 00:13:06,290
But let me just plot x_i beta
here, and y_i minus that.
216
00:13:06,290 --> 00:13:15,060
Basically, this is going
to be a function that
217
00:13:15,060 --> 00:13:19,230
has slope 1 when it's positive
and slope minus 1 when
218
00:13:19,230 --> 00:13:20,450
it's negative.
219
00:13:20,450 --> 00:13:26,260
And so that will be true,
component-wise, or for the y.
220
00:13:26,260 --> 00:13:28,850
So what we end up
wanting to do is
221
00:13:28,850 --> 00:13:31,000
find the value of the
regression estimate
222
00:13:31,000 --> 00:13:36,680
that minimizes the
sum of predictions
223
00:13:36,680 --> 00:13:40,670
that are below the estimate plus
the sum of the predictions that
224
00:13:40,670 --> 00:13:43,240
are above the estimate given
by the regression line.
225
00:13:43,240 --> 00:13:45,580
And that solves the problem.
226
00:13:45,580 --> 00:13:50,960
Now, with the maximum
likelihood estimation,
227
00:13:50,960 --> 00:13:55,840
one can plug in minus log the
density of y_i given beta, x
228
00:13:55,840 --> 00:13:57,730
and sigma_i squared.
229
00:13:57,730 --> 00:14:04,400
And that function simply sums
to the log of the joint density
230
00:14:04,400 --> 00:14:05,510
for all the data.
231
00:14:05,510 --> 00:14:08,530
So that works as well.
232
00:14:08,530 --> 00:14:13,520
With robust M estimators, we can
consider another function chi
233
00:14:13,520 --> 00:14:18,210
which can be defined to have
good properties with estimates.
234
00:14:18,210 --> 00:14:21,065
And there's a whole theory
of robust estimation--
235
00:14:21,065 --> 00:14:23,830
it's very rich-- which
talks about how best
236
00:14:23,830 --> 00:14:27,400
to specify this chi function.
237
00:14:27,400 --> 00:14:33,130
Now, one of the problems
with least squares estimation
238
00:14:33,130 --> 00:14:37,400
is that the squares
of very large values
239
00:14:37,400 --> 00:14:40,210
are very, very
large in magnitude.
240
00:14:40,210 --> 00:14:42,740
So there's perhaps
an undue influence
241
00:14:42,740 --> 00:14:47,650
of very large values, very large
residuals under least squares
242
00:14:47,650 --> 00:14:49,680
estimation and maximum
[INAUDIBLE] estimation.
243
00:14:49,680 --> 00:14:53,600
So robust estimators
allow you to control that
244
00:14:53,600 --> 00:14:57,770
by defining the
function differently.
245
00:14:57,770 --> 00:15:00,830
Finally, there are
quantile estimators,
246
00:15:00,830 --> 00:15:07,410
which extend the mean
absolute deviation criterion.
247
00:15:07,410 --> 00:15:11,220
And so if we consider
the h function
248
00:15:11,220 --> 00:15:16,270
to be basically a
multiple of the deviation
249
00:15:16,270 --> 00:15:23,460
if the residual is positive
and a different multiple,
250
00:15:23,460 --> 00:15:26,810
a complementary multiple if
the derivation, the residual,
251
00:15:26,810 --> 00:15:30,910
is less than 0,
then by varying tau,
252
00:15:30,910 --> 00:15:35,230
you end up getting
quantile estimators, where
253
00:15:35,230 --> 00:15:38,921
what you're doing is minimizing
the estimate of the tau
254
00:15:38,921 --> 00:15:39,420
quantile.
255
00:15:47,510 --> 00:15:51,240
So this general
class of M estimators
256
00:15:51,240 --> 00:15:54,730
encompasses most
estimators that we will
257
00:15:54,730 --> 00:15:59,020
encounter in fitting models.
258
00:15:59,020 --> 00:16:03,130
So that finishes the technical
or the mathematical discussion
259
00:16:03,130 --> 00:16:05,190
of regression analysis.
260
00:16:05,190 --> 00:16:31,070
Let me highlight for you--
there's a case study that I
261
00:16:31,070 --> 00:16:34,410
dragged to the desktop here.
262
00:16:34,410 --> 00:16:37,532
And I wanted to find that.
263
00:16:37,532 --> 00:16:38,240
Let me find that.
264
00:16:46,970 --> 00:16:54,300
There's a case study that's been
added to the course website.
265
00:16:54,300 --> 00:16:58,840
And this first one is on
linear regression models
266
00:16:58,840 --> 00:17:00,370
for asset pricing.
267
00:17:00,370 --> 00:17:03,430
And I want you to
read through that just
268
00:17:03,430 --> 00:17:08,099
to see how it applies to
fitting various simple linear
269
00:17:08,099 --> 00:17:09,650
regression models.
270
00:17:09,650 --> 00:17:12,985
And enter full screen.
271
00:17:17,900 --> 00:17:21,650
This case study begins by
introducing the capital asset
272
00:17:21,650 --> 00:17:24,670
pricing model, which
basically suggests
273
00:17:24,670 --> 00:17:28,190
that if you look at the
returns on any stocks
274
00:17:28,190 --> 00:17:30,720
in an efficient
market, then those
275
00:17:30,720 --> 00:17:36,830
should depend on the return
of the overall market
276
00:17:36,830 --> 00:17:40,040
but scaled by how
risky the stock is.
277
00:17:40,040 --> 00:17:45,170
And so if one looks
at basically what
278
00:17:45,170 --> 00:17:47,929
the return is on the
stock on the right scale,
279
00:17:47,929 --> 00:17:49,970
you should have a simple
linear regression model.
280
00:17:49,970 --> 00:17:54,110
So here, we just look at
a time series for GE stock
281
00:17:54,110 --> 00:17:55,972
in the S&P 500.
282
00:17:55,972 --> 00:17:58,180
And the case study guide
through how you can actually
283
00:17:58,180 --> 00:18:01,790
collect this data
on the web using R.
284
00:18:01,790 --> 00:18:06,845
And so the case notes
provide those details.
285
00:18:09,350 --> 00:18:11,930
There's also the
three-month treasury rate
286
00:18:11,930 --> 00:18:13,660
which is collected.
287
00:18:13,660 --> 00:18:16,190
And so if you're
thinking about return
288
00:18:16,190 --> 00:18:19,540
on the stock versus return
on the index, well, what's
289
00:18:19,540 --> 00:18:24,940
really of interest is the excess
return over a risk-free rate.
290
00:18:24,940 --> 00:18:27,450
And the efficient
markets models,
291
00:18:27,450 --> 00:18:31,390
basically the excess
return of a stock
292
00:18:31,390 --> 00:18:34,330
is related to the excess
return of the market as
293
00:18:34,330 --> 00:18:37,250
given by a linear
regression model.
294
00:18:37,250 --> 00:18:39,310
So we can fit this model.
295
00:18:39,310 --> 00:18:46,360
And here's a plot of the excess
returns on a daily basis for GE
296
00:18:46,360 --> 00:18:47,640
stock versus the market.
297
00:18:47,640 --> 00:18:52,444
So that looks like a
nice sort of point cloud
298
00:18:52,444 --> 00:18:54,110
for which a linear
model might fit well.
299
00:18:54,110 --> 00:18:54,800
And it does.
300
00:18:59,400 --> 00:19:01,170
Well, there are
regression diagnostics,
301
00:19:01,170 --> 00:19:05,300
which I'll get to-- well, there
are regression diagnostics
302
00:19:05,300 --> 00:19:09,110
which are detailed in the
problem set, where we're
303
00:19:09,110 --> 00:19:12,420
looking at how influential are
individual observations, what's
304
00:19:12,420 --> 00:19:14,160
their impact on
regression parameters.
305
00:19:16,680 --> 00:19:20,160
This display here
basically highlights
306
00:19:20,160 --> 00:19:21,790
with a very simple
linear regression
307
00:19:21,790 --> 00:19:25,770
model what are the
influential data points.
308
00:19:25,770 --> 00:19:28,560
And so I've highlighted
in red those values
309
00:19:28,560 --> 00:19:30,640
which are influential.
310
00:19:30,640 --> 00:19:34,060
Now, if you look at the
definition of leverage
311
00:19:34,060 --> 00:19:36,390
in a linear model,
it's very simple.
312
00:19:36,390 --> 00:19:39,130
A simple linear model is
just those observations that
313
00:19:39,130 --> 00:19:42,200
are very far from the
mean have large leverage.
314
00:19:42,200 --> 00:19:46,060
And so you can confirm
that with your answers
315
00:19:46,060 --> 00:19:48,470
to the problem set.
316
00:19:48,470 --> 00:19:52,710
This x indicates a
significantly influential point
317
00:19:52,710 --> 00:19:55,720
in terms of the
regression parameters
318
00:19:55,720 --> 00:19:57,090
given by Cook's distance.
319
00:19:57,090 --> 00:19:59,956
And that definition is also
given in the case notes.
320
00:19:59,956 --> 00:20:00,908
AUDIENCE: [INAUDIBLE].
321
00:20:04,240 --> 00:20:06,630
PROFESSOR: By computing
the individual
322
00:20:06,630 --> 00:20:09,930
leverages with a function
that's given here,
323
00:20:09,930 --> 00:20:13,385
and by selecting out those
that exceed a given magnitude.
324
00:20:17,870 --> 00:20:20,530
Now, with this very,
very simple model
325
00:20:20,530 --> 00:20:23,190
of stocks depending
on one unknown factor,
326
00:20:23,190 --> 00:20:26,110
risk factor given the market.
327
00:20:26,110 --> 00:20:29,730
In modeling equity
returns, there
328
00:20:29,730 --> 00:20:33,680
are many different factors that
can have an impact on returns.
329
00:20:33,680 --> 00:20:36,890
So what I've done
in the case study
330
00:20:36,890 --> 00:20:48,660
is to look at adding
another factor which is just
331
00:20:48,660 --> 00:20:51,590
the return on crude oil.
332
00:20:51,590 --> 00:20:55,210
And so-- I need to go down here.
333
00:21:04,090 --> 00:21:10,260
So let me highlight
something for you here.
334
00:21:10,260 --> 00:21:15,220
With GE stock, what would you
expect the impact of, say,
335
00:21:15,220 --> 00:21:19,260
a high return on crude oil to
be on the return of GE stock?
336
00:21:19,260 --> 00:21:21,500
Would you expect it to
be positively related
337
00:21:21,500 --> 00:21:22,730
or negatively related?
338
00:21:30,910 --> 00:21:31,410
OK.
339
00:21:34,510 --> 00:21:39,610
Well, GE is a stock that's
just a broad stock invested
340
00:21:39,610 --> 00:21:41,820
in many different industries.
341
00:21:41,820 --> 00:21:45,390
And it really reflects the
overall market, to some extent.
342
00:21:45,390 --> 00:21:48,710
Many years ago,
10, 15 years ago,
343
00:21:48,710 --> 00:21:51,960
GE represented maybe 3% of
the GNP of the US market.
344
00:21:51,960 --> 00:21:55,510
So it was really highly related
to how well the market does.
345
00:21:55,510 --> 00:21:59,700
Now, crude oil is a commodity.
346
00:21:59,700 --> 00:22:07,010
And oil is used to drive cars,
to fuel energy production.
347
00:22:07,010 --> 00:22:10,510
So if you have an
increase in oil prices,
348
00:22:10,510 --> 00:22:13,770
then the cost of essentially
doing business goes up.
349
00:22:13,770 --> 00:22:18,870
So it is associated with
an inflation factor.
350
00:22:18,870 --> 00:22:20,380
Prices are rising.
351
00:22:20,380 --> 00:22:25,730
So if you can see here,
the regression estimate,
352
00:22:25,730 --> 00:22:29,830
if we add in a factor of
the return on crude oil,
353
00:22:29,830 --> 00:22:32,120
it's negative 0.03.
354
00:22:32,120 --> 00:22:36,740
And it has a t value
of minus 3.561.
355
00:22:36,740 --> 00:22:41,330
So in fact, the market, in
a sense, over this period,
356
00:22:41,330 --> 00:22:44,600
for this analysis, was not
efficient in explaining
357
00:22:44,600 --> 00:22:49,730
the return on GE; crude oil
is another independent factor
358
00:22:49,730 --> 00:22:52,260
that helps explain returns.
359
00:22:52,260 --> 00:22:55,850
So that's useful to know.
360
00:22:55,850 --> 00:23:01,430
And if you are clever about
defining and identifying
361
00:23:01,430 --> 00:23:03,590
and evaluating
different factors,
362
00:23:03,590 --> 00:23:07,550
you can build
factor asset pricing
363
00:23:07,550 --> 00:23:11,430
models that are
very, very useful
364
00:23:11,430 --> 00:23:13,390
for investing and trading.
365
00:23:13,390 --> 00:23:18,710
Now, as a comparison
to this case study,
366
00:23:18,710 --> 00:23:26,040
also applied the same
analysis to Exxon Mobil.
367
00:23:26,040 --> 00:23:30,330
Now, Exxon Mobil
is an oil company.
368
00:23:30,330 --> 00:23:35,530
So let me highlight this here.
369
00:23:35,530 --> 00:23:37,570
We basically are
fitting this model.
370
00:23:37,570 --> 00:23:39,050
Now let's highlight it.
371
00:23:43,150 --> 00:23:48,960
Here, if we consider
this two-factor model,
372
00:23:48,960 --> 00:23:50,650
the regression
parameter corresponding
373
00:23:50,650 --> 00:23:57,840
to the crude oil factor is
plus 0.13 with a t value of 16.
374
00:23:57,840 --> 00:24:01,750
So crude oil definitely
has an impact
375
00:24:01,750 --> 00:24:06,370
on the return of Exxon Mobil,
because it goes up and down
376
00:24:06,370 --> 00:24:07,065
with oil prices.
377
00:24:16,300 --> 00:24:19,550
This case study closes
with a scatter plot
378
00:24:19,550 --> 00:24:22,950
of the independent variables
and highlighting where
379
00:24:22,950 --> 00:24:25,740
the influential values are.
380
00:24:25,740 --> 00:24:28,650
And so just in the same way that
with a simple linear regression
381
00:24:28,650 --> 00:24:32,430
it was those that were far
away from the mean of the data
382
00:24:32,430 --> 00:24:35,920
were influential, in a
multivariate setting-- here,
383
00:24:35,920 --> 00:24:38,450
it's bivariate-- the
influential observations
384
00:24:38,450 --> 00:24:41,240
are those that are very
far away from the centroid.
385
00:24:41,240 --> 00:24:43,931
And if you look at one of the
problems in the problem set,
386
00:24:43,931 --> 00:24:45,430
it actually goes
through and you can
387
00:24:45,430 --> 00:24:48,930
see where these
leveraged values are
388
00:24:48,930 --> 00:24:53,580
and how it indicates influences
associated with the Mahalanobis
389
00:24:53,580 --> 00:24:56,660
distance of cases
from the centroid
390
00:24:56,660 --> 00:24:58,820
of the independent variables.
391
00:24:58,820 --> 00:25:02,010
So if you're a visual
type mathematician as
392
00:25:02,010 --> 00:25:04,850
opposed to an algebraic
type mathematician,
393
00:25:04,850 --> 00:25:06,390
I think these
kinds of graphs are
394
00:25:06,390 --> 00:25:10,970
very helpful in understanding
what is really going on.
395
00:25:10,970 --> 00:25:16,180
And the degree of influence
is associated with the fact
396
00:25:16,180 --> 00:25:21,380
that we're basically taking
least squares estimates,
397
00:25:21,380 --> 00:25:23,560
so we have the quadratic
form associated
398
00:25:23,560 --> 00:25:24,790
with the overall process.
399
00:25:28,800 --> 00:25:33,950
There's another
case study that I'll
400
00:25:33,950 --> 00:25:40,054
be happy to discuss after
class or during office hours.
401
00:25:40,054 --> 00:25:42,220
I don't think we have time
today during the lecture.
402
00:25:42,220 --> 00:25:45,650
But it concerns
exchange rate regimes.
403
00:25:45,650 --> 00:25:51,310
And the second case study
looks at the Chinese yuan,
404
00:25:51,310 --> 00:25:55,960
which was basically pegged
to the dollar for many years.
405
00:25:55,960 --> 00:26:00,190
And then I guess through
political influence
406
00:26:00,190 --> 00:26:02,710
from other countries,
they started
407
00:26:02,710 --> 00:26:06,172
to let the yuan vary
from the dollar,
408
00:26:06,172 --> 00:26:08,560
but perhaps pegged
it to some basket
409
00:26:08,560 --> 00:26:10,690
of securities-- of currencies.
410
00:26:10,690 --> 00:26:13,540
And so how would you determine
what that basket of currencies
411
00:26:13,540 --> 00:26:14,039
is?
412
00:26:14,039 --> 00:26:16,250
Well, there are
regression methods
413
00:26:16,250 --> 00:26:19,490
that have been
developed by economists
414
00:26:19,490 --> 00:26:20,650
that help you do that.
415
00:26:20,650 --> 00:26:23,480
And that case study goes
through the analysis of that.
416
00:26:23,480 --> 00:26:26,770
So check that out to see how
you can get immediate access
417
00:26:26,770 --> 00:26:29,750
to currency data and be
fitting these regression models
418
00:26:29,750 --> 00:26:31,250
and looking at the
different results
419
00:26:31,250 --> 00:26:32,458
and trying to evaluate those.
420
00:26:38,720 --> 00:26:48,170
So let's turn now
to the main topic--
421
00:26:48,170 --> 00:26:54,200
let's see here-- which
is time series analysis.
422
00:27:01,250 --> 00:27:04,080
Today in the rest
of the lecture,
423
00:27:04,080 --> 00:27:09,040
I want to talk about univariate
time series analysis.
424
00:27:09,040 --> 00:27:12,670
And so we're thinking of
basically a random variable
425
00:27:12,670 --> 00:27:17,720
that is observed over time and
it's a discrete time process.
426
00:27:17,720 --> 00:27:23,140
And we'll introduce you
to the Wold representation
427
00:27:23,140 --> 00:27:26,435
theorem and definitions
of stationarity
428
00:27:26,435 --> 00:27:28,340
and its relationship there.
429
00:27:28,340 --> 00:27:31,430
Then, look at the classic
models of autoregressive
430
00:27:31,430 --> 00:27:34,120
moving average models.
431
00:27:34,120 --> 00:27:36,920
And then extending those
to non-stationarity
432
00:27:36,920 --> 00:27:40,430
with integrated autoregressive
moving average models.
433
00:27:40,430 --> 00:27:44,440
And then finally, talk about
estimating stationary models
434
00:27:44,440 --> 00:27:47,630
and how we test
for stationarity.
435
00:27:47,630 --> 00:27:54,740
So let's begin from
basically first principles.
436
00:27:54,740 --> 00:27:59,310
We have a stochastic process,
a discrete time stochastic
437
00:27:59,310 --> 00:28:04,880
process, X, which consists
of random variables indexed
438
00:28:04,880 --> 00:28:06,160
by time.
439
00:28:06,160 --> 00:28:09,110
And we're thinking
now discrete time.
440
00:28:09,110 --> 00:28:11,820
The stochastic behavior
of this sequence
441
00:28:11,820 --> 00:28:16,050
is determined by specifying
the density or probability mass
442
00:28:16,050 --> 00:28:22,220
functions for all finite
collections of time indexes.
443
00:28:22,220 --> 00:28:26,490
And so if we could specify
all finite.dimensional
444
00:28:26,490 --> 00:28:28,130
distributions of
this process, we
445
00:28:28,130 --> 00:28:31,710
would specify this
probability model
446
00:28:31,710 --> 00:28:35,200
for the stochastic process.
447
00:28:35,200 --> 00:28:40,500
Now, this stochastic process
is strictly stationary
448
00:28:40,500 --> 00:28:48,760
if the density function for
any collection of times,
449
00:28:48,760 --> 00:28:55,780
t_1 through t_m, is equal to
the density function for a tau
450
00:28:55,780 --> 00:28:57,440
translation of that.
451
00:28:57,440 --> 00:29:03,000
So the density function for any
finite-dimensional distribution
452
00:29:03,000 --> 00:29:08,300
is stationary, is constant
under arbitrary translations.
453
00:29:08,300 --> 00:29:12,620
So that's a very
strong property.
454
00:29:12,620 --> 00:29:16,620
But it's a reasonable
property to ask for if you're
455
00:29:16,620 --> 00:29:18,566
doing statistical modeling.
456
00:29:18,566 --> 00:29:20,940
And what do you want to do
when you're estimating models?
457
00:29:20,940 --> 00:29:24,080
You want to estimate
things that are constant.
458
00:29:24,080 --> 00:29:26,570
Constants are nice
things to estimate.
459
00:29:26,570 --> 00:29:28,520
And parameters of
models are constant.
460
00:29:28,520 --> 00:29:32,930
So we really want the underlying
structure of the distributions
461
00:29:32,930 --> 00:29:35,150
to be the same.
462
00:29:44,960 --> 00:29:47,040
That was strict
stationarity, which
463
00:29:47,040 --> 00:29:51,510
requires knowledge of
the entire distribution
464
00:29:51,510 --> 00:29:55,020
of the stochastic process.
465
00:29:55,020 --> 00:29:57,340
We're now going to introduce
a weaker definition, which
466
00:29:57,340 --> 00:29:59,660
is covariance stationarity.
467
00:29:59,660 --> 00:30:02,960
And a covariance
stationary process
468
00:30:02,960 --> 00:30:08,330
has a constant mean,
mu; a constant variance,
469
00:30:08,330 --> 00:30:15,630
sigma squared; and a
covariance over increments tau,
470
00:30:15,630 --> 00:30:20,500
given by a function gamma of
tau, that is also constant.
471
00:30:20,500 --> 00:30:26,960
Gamma isn't a constant function,
but basically for all t,
472
00:30:26,960 --> 00:30:31,900
covariance of X_t, X_(t+tau)
is this gamma of tau function.
473
00:30:31,900 --> 00:30:38,080
And we also can introduce
the autocorrelation function
474
00:30:38,080 --> 00:30:41,830
of the stochastic
process, rho of tau.
475
00:30:41,830 --> 00:30:49,120
And so the correlation
of two random variables
476
00:30:49,120 --> 00:30:52,220
is the covariance of those
random variables divided
477
00:30:52,220 --> 00:30:57,340
by the square root of the
product of the variances.
478
00:30:57,340 --> 00:31:00,805
And Choongbum I think
introduced that a bit.
479
00:31:00,805 --> 00:31:02,680
in one of his lectures,
where we were talking
480
00:31:02,680 --> 00:31:06,890
about the correlation function.
481
00:31:06,890 --> 00:31:09,810
But essentially, the
correlation function
482
00:31:09,810 --> 00:31:15,400
is if you standardize the
data or the random variables
483
00:31:15,400 --> 00:31:17,690
to have mean 0-- so
subtract off the means
484
00:31:17,690 --> 00:31:21,040
and then divide through by
their standard deviations.
485
00:31:21,040 --> 00:31:26,410
So those translated variables
have mean 0 and variance 1.
486
00:31:26,410 --> 00:31:29,482
Then the correlation
coefficient is the covariance
487
00:31:29,482 --> 00:31:31,315
between those standardized
random variables.
488
00:31:35,020 --> 00:31:38,810
So this is going to come up
again and again in time series
489
00:31:38,810 --> 00:31:40,080
analysis.
490
00:31:40,080 --> 00:31:42,650
Now, the Wold
representation theorem
491
00:31:42,650 --> 00:31:47,350
is a very, very powerful theorem
about covariance stationary
492
00:31:47,350 --> 00:31:47,850
processes.
493
00:31:51,110 --> 00:31:55,050
It basically states that if
we have a zero-mean covariance
494
00:31:55,050 --> 00:31:59,750
stationary time
series, then it can
495
00:31:59,750 --> 00:32:03,520
be decomposed into two
components with a very
496
00:32:03,520 --> 00:32:06,390
nice structure.
497
00:32:06,390 --> 00:32:11,430
Basically, X_t can be
decomposed into V_t plus S_t.
498
00:32:11,430 --> 00:32:18,470
V_t is going to be a linearly
deterministic process, meaning
499
00:32:18,470 --> 00:32:23,130
that past values of
V_t perfectly predict
500
00:32:23,130 --> 00:32:24,590
what V_t is going to be.
501
00:32:24,590 --> 00:32:27,780
So this could be like a
linear trend or some fixed
502
00:32:27,780 --> 00:32:29,660
function of past values.
503
00:32:29,660 --> 00:32:32,320
It's basically a
deterministic process.
504
00:32:32,320 --> 00:32:34,690
So there's nothing
random in V_t.
505
00:32:34,690 --> 00:32:40,710
It's something that's
fixed, without randomness.
506
00:32:40,710 --> 00:32:46,510
And S_t is a sum
of coefficients,
507
00:32:46,510 --> 00:32:56,650
psi_i times eta_(t-i), where
the eta_t's are linearly
508
00:32:56,650 --> 00:32:58,550
unpredictable white noise.
509
00:32:58,550 --> 00:33:03,890
So what we have is S_t
is a weighted average
510
00:33:03,890 --> 00:33:09,850
of white noise with
coefficients given by the psi_i.
511
00:33:09,850 --> 00:33:16,170
And the coefficients psi_i
are such that psi_0 is 1,
512
00:33:16,170 --> 00:33:18,830
and the sum of the
squared psi_i's is finite.
513
00:33:21,340 --> 00:33:26,540
And the white noise
eta_t-- what's white noise?
514
00:33:26,540 --> 00:33:28,930
It has expectation zero.
515
00:33:28,930 --> 00:33:35,120
It has variance, given by
sigma squared, that's constant.
516
00:33:35,120 --> 00:33:39,520
And it has covariance across
different white noise elements
517
00:33:39,520 --> 00:33:42,490
that's 0 for all t and s.
518
00:33:42,490 --> 00:33:45,810
So eta_t's are uncorrelated
with themselves,
519
00:33:45,810 --> 00:33:47,750
and of course, they
are uncorrelated
520
00:33:47,750 --> 00:33:51,290
with the deterministic process.
521
00:33:51,290 --> 00:33:58,010
So this is really a very,
very powerful concept.
522
00:33:58,010 --> 00:34:00,600
If you are modeling
a process and it
523
00:34:00,600 --> 00:34:05,030
has covariance
stationarity, then there
524
00:34:05,030 --> 00:34:07,960
exists a representation
like this of the function.
525
00:34:07,960 --> 00:34:15,750
So it's a very
compelling structure,
526
00:34:15,750 --> 00:34:20,659
which we'll see how it applies
in different circumstances.
527
00:34:20,659 --> 00:34:25,650
Now, before getting into the
definition of autoregressive
528
00:34:25,650 --> 00:34:28,719
moving average
models, I just want
529
00:34:28,719 --> 00:34:33,820
to give you an intuitive
understanding of what's going
530
00:34:33,820 --> 00:34:36,469
on with the Wold decomposition.
531
00:34:36,469 --> 00:34:41,030
And this, I think,
will help motivate
532
00:34:41,030 --> 00:34:44,480
why the Wold
decomposition should exist
533
00:34:44,480 --> 00:34:48,170
from a mathematical standpoint.
534
00:34:48,170 --> 00:34:53,550
So consider just some
univariate stochastic process,
535
00:34:53,550 --> 00:34:56,500
some time series X_t
that we want to model.
536
00:34:56,500 --> 00:35:00,010
And we believe that it's
covariance stationary.
537
00:35:00,010 --> 00:35:02,850
And so we want to
specify essentially
538
00:35:02,850 --> 00:35:04,610
the Wold decomposition of that.
539
00:35:04,610 --> 00:35:07,680
Well, what we could
do is initialize
540
00:35:07,680 --> 00:35:10,890
a parameter p, the number
of past observations,
541
00:35:10,890 --> 00:35:15,310
in the linearly
deterministic term.
542
00:35:15,310 --> 00:35:24,420
And then estimate the linear
projection of X_t on the last p
543
00:35:24,420 --> 00:35:26,140
lag values.
544
00:35:26,140 --> 00:35:31,490
And so what I want to do
is consider estimating
545
00:35:31,490 --> 00:35:36,360
that relationship using
a sample of size n
546
00:35:36,360 --> 00:35:42,660
with some ending point t_0
less than or equal to T.
547
00:35:42,660 --> 00:35:50,010
And so we can consider y
values like a response variable
548
00:35:50,010 --> 00:35:57,760
being given by the successive
values of our time series.
549
00:35:57,760 --> 00:36:02,550
And so our response variables
y_j can be considered to be x
550
00:36:02,550 --> 00:36:06,040
t_0 minus n plus j.
551
00:36:06,040 --> 00:36:14,350
And define a y vector and
a Z matrix as follows.
552
00:36:20,140 --> 00:36:25,890
So we have values of our
stochastic process in y.
553
00:36:25,890 --> 00:36:29,080
And then our Z matrix,
which is essentially
554
00:36:29,080 --> 00:36:30,580
a matrix of
independent variables,
555
00:36:30,580 --> 00:36:36,000
is just the lagged
values of this process.
556
00:36:36,000 --> 00:36:37,940
So let's apply
ordinary least squares
557
00:36:37,940 --> 00:36:40,530
to specify the projection.
558
00:36:40,530 --> 00:36:43,810
This projection matrix
should be familiar now.
559
00:36:43,810 --> 00:36:49,160
And that basically gives
us a prediction of y hat
560
00:36:49,160 --> 00:36:51,680
depending on p lags.
561
00:36:51,680 --> 00:36:54,750
And we can compute the
projection residual
562
00:36:54,750 --> 00:36:56,080
from that fit.
563
00:36:59,660 --> 00:37:03,450
Well, we can conduct
time series methods
564
00:37:03,450 --> 00:37:08,470
to analyze these residuals,
which we'll be introducing here
565
00:37:08,470 --> 00:37:13,170
in a few minutes, to specify
a moving average model.
566
00:37:13,170 --> 00:37:16,180
We can then have estimates of
the underlying coefficients
567
00:37:16,180 --> 00:37:22,700
psi and estimates of
these residuals eta_t.
568
00:37:22,700 --> 00:37:27,300
And then we can evaluate whether
this is a good model or not.
569
00:37:27,300 --> 00:37:29,430
What does it mean to be
an appropriate model?
570
00:37:29,430 --> 00:37:35,250
Well, the residual should
be orthogonal to longer lags
571
00:37:35,250 --> 00:37:39,550
than t minus s, or
longer lags than p.
572
00:37:39,550 --> 00:37:42,850
So we basically shouldn't
have any dependence
573
00:37:42,850 --> 00:37:49,390
of our residuals on lags
of the stochastic process
574
00:37:49,390 --> 00:37:51,550
that weren't included
in the model.
575
00:37:51,550 --> 00:37:54,850
Those should be orthogonal.
576
00:37:54,850 --> 00:38:01,070
And the eta_t hats should be
consistent with white noise.
577
00:38:01,070 --> 00:38:05,220
So those issues
can be evaluated.
578
00:38:05,220 --> 00:38:07,620
And if there's
evidence otherwise,
579
00:38:07,620 --> 00:38:10,720
then we can change the
specification of the model.
580
00:38:10,720 --> 00:38:13,090
We can add additional lags.
581
00:38:13,090 --> 00:38:15,870
We can add additional
deterministic variables
582
00:38:15,870 --> 00:38:21,570
if we can identify
what those might be.
583
00:38:21,570 --> 00:38:23,260
And proceed with this process.
584
00:38:23,260 --> 00:38:28,490
But essentially that is
how the Wold decomposition
585
00:38:28,490 --> 00:38:30,740
could be implemented.
586
00:38:30,740 --> 00:38:35,250
And theoretically, as
our sample gets large,
587
00:38:35,250 --> 00:38:42,320
if we're observing this time
series for a long time, then
588
00:38:42,320 --> 00:38:45,090
well certainly the
limit of the projections
589
00:38:45,090 --> 00:38:49,110
as p, the number of lags
we include, gets large,
590
00:38:49,110 --> 00:38:52,380
should be essentially
the projection
591
00:38:52,380 --> 00:38:55,270
of our data on its history.
592
00:38:55,270 --> 00:39:00,490
And that, in fact, is the
projection corresponding to,
593
00:39:00,490 --> 00:39:03,950
defining, the
coefficient's psi_i.
594
00:39:03,950 --> 00:39:09,400
And so in the limit, that
projection will converge
595
00:39:09,400 --> 00:39:11,320
and it will converge
in the sense
596
00:39:11,320 --> 00:39:15,070
that the coefficients of
the projection definition
597
00:39:15,070 --> 00:39:17,320
correspond to the psi_i.
598
00:39:17,320 --> 00:39:26,600
And now if p goes to
infinity is required,
599
00:39:26,600 --> 00:39:29,510
now p means that there's
basically a long term
600
00:39:29,510 --> 00:39:31,145
dependence in the process.
601
00:39:34,310 --> 00:39:37,120
Basically, it doesn't
stop at a given lag.
602
00:39:37,120 --> 00:39:41,410
The dependence
persists over time.
603
00:39:41,410 --> 00:39:45,580
Then we may require
that p goes to infinity.
604
00:39:45,580 --> 00:39:47,360
Now, what happens when
p goes to infinity?
605
00:39:47,360 --> 00:39:50,036
Well, if you let p go
to infinity too quickly,
606
00:39:50,036 --> 00:39:51,410
you run out of
degrees of freedom
607
00:39:51,410 --> 00:39:53,520
to estimate your models.
608
00:39:53,520 --> 00:39:57,220
And so from an
implementation standpoint,
609
00:39:57,220 --> 00:40:01,340
you need to let p/n
go to 0 so that you
610
00:40:01,340 --> 00:40:09,180
have essentially more
data than parameters
611
00:40:09,180 --> 00:40:10,710
that you're estimating.
612
00:40:10,710 --> 00:40:13,800
And so that is required.
613
00:40:13,800 --> 00:40:18,860
And in time series
modeling, what we
614
00:40:18,860 --> 00:40:26,609
look for are models where
finite values of p are required.
615
00:40:26,609 --> 00:40:28,900
So we're only estimating a
finite number of parameters.
616
00:40:28,900 --> 00:40:31,920
Or if we have a moving
average model which
617
00:40:31,920 --> 00:40:35,300
has coefficients that
are infinite in number,
618
00:40:35,300 --> 00:40:40,430
perhaps those can be defined by
a small number of parameters.
619
00:40:40,430 --> 00:40:44,552
So we'll be looking for
that kind of feature
620
00:40:44,552 --> 00:40:45,385
in different models.
621
00:40:49,230 --> 00:40:52,620
Let's turn to talking
about the lag operator.
622
00:40:52,620 --> 00:40:56,250
The lag operator is
a fundamental tool
623
00:40:56,250 --> 00:40:59,430
in time series models.
624
00:40:59,430 --> 00:41:04,180
We consider the operator L
that shifts a time series back
625
00:41:04,180 --> 00:41:06,680
by one time increment.
626
00:41:06,680 --> 00:41:09,210
And applying this
operator recursively,
627
00:41:09,210 --> 00:41:14,400
we get, if it's operating
0 times, there's no lag,
628
00:41:14,400 --> 00:41:16,570
one time, there's
one lag, two times,
629
00:41:16,570 --> 00:41:18,860
two lags-- doing
that iteratively.
630
00:41:18,860 --> 00:41:22,470
And in thinking of these,
what we're dealing with
631
00:41:22,470 --> 00:41:26,680
is like a transformation on
infinite dimensional space,
632
00:41:26,680 --> 00:41:29,150
where it's like
the identity matrix
633
00:41:29,150 --> 00:41:32,390
sort of shifted by
one element-- or not
634
00:41:32,390 --> 00:41:35,320
the identity, but an element.
635
00:41:35,320 --> 00:41:37,290
It's like the identity
matrix shifted
636
00:41:37,290 --> 00:41:41,520
by one column or two columns.
637
00:41:41,520 --> 00:41:43,760
So anyway, inverses
of these operators
638
00:41:43,760 --> 00:41:49,440
are well defined in terms
of what we get from them.
639
00:41:49,440 --> 00:41:53,470
So we can represent
the Wold representation
640
00:41:53,470 --> 00:41:58,140
in terms of these lag
operators by saying
641
00:41:58,140 --> 00:42:03,120
that our stochastic
process X_t is
642
00:42:03,120 --> 00:42:10,030
equal to V_t plus this
psi of L function,
643
00:42:10,030 --> 00:42:14,030
basically a
functional of the lag
644
00:42:14,030 --> 00:42:18,570
operator, which is a potentially
infinite-order polynomial
645
00:42:18,570 --> 00:42:20,730
of the lags.
646
00:42:20,730 --> 00:42:23,770
So this notation is
something that you
647
00:42:23,770 --> 00:42:26,110
need to get very
familiar with if you're
648
00:42:26,110 --> 00:42:28,520
going to be comfortable with
the different models that
649
00:42:28,520 --> 00:42:33,840
are introduced with
ARMA and ARIMA models.
650
00:42:33,840 --> 00:42:35,410
Any questions about that?
651
00:42:42,230 --> 00:42:43,870
Now relating to
this-- let me just
652
00:42:43,870 --> 00:42:47,550
introduce now, because this
will come up somewhat later.
653
00:42:47,550 --> 00:42:49,840
But there's the impulse
response function
654
00:42:49,840 --> 00:42:53,010
of the covariance
stationary process.
655
00:42:53,010 --> 00:42:58,630
If we have a stochastic process
X_t which is given by this Wold
656
00:42:58,630 --> 00:43:05,950
representation, then
you can ask yourself
657
00:43:05,950 --> 00:43:11,320
what happens to the innovation
at time t, which is eta_t,
658
00:43:11,320 --> 00:43:15,470
how does that affect
the process over time?
659
00:43:15,470 --> 00:43:21,590
And so, OK, pretend that you are
chairman of the Federal Reserve
660
00:43:21,590 --> 00:43:22,090
Bank.
661
00:43:22,090 --> 00:43:29,600
And you're interested in the GNP
or basically economic growth.
662
00:43:29,600 --> 00:43:33,944
And you're considering
changing interest rates
663
00:43:33,944 --> 00:43:36,340
to help the economy.
664
00:43:36,340 --> 00:43:38,630
Well, you'd like to
know what an impact is
665
00:43:38,630 --> 00:43:42,610
of your change in
this factor, how
666
00:43:42,610 --> 00:43:47,560
that's going to affect the
variable of interest, perhaps
667
00:43:47,560 --> 00:43:48,130
GNP.
668
00:43:48,130 --> 00:43:49,520
Now, in this case,
we're thinking
669
00:43:49,520 --> 00:43:55,140
of just a simple covariance
stationary stochastic process.
670
00:43:55,140 --> 00:44:00,165
It's basically a process that
is a random-- a weighted sum,
671
00:44:00,165 --> 00:44:03,210
a moving average of
innovations eta_t.
672
00:44:03,210 --> 00:44:06,130
But the question is, basically
any covariance stationary
673
00:44:06,130 --> 00:44:08,310
process could be
represented in this form.
674
00:44:08,310 --> 00:44:11,630
And the impulse
response function
675
00:44:11,630 --> 00:44:15,790
relates to what is
the impact of eta_t.
676
00:44:15,790 --> 00:44:18,120
What's its impact over time?
677
00:44:18,120 --> 00:44:21,940
Basically, it affects
the process at time t.
678
00:44:21,940 --> 00:44:24,360
That, because of the
moving average process,
679
00:44:24,360 --> 00:44:27,350
it affects it at t plus
1, affects it at t plus 2.
680
00:44:27,350 --> 00:44:33,810
And so this impulse
response is basically
681
00:44:33,810 --> 00:44:37,650
the derivative of the
value of the process
682
00:44:37,650 --> 00:44:44,210
with the j-th previous
innovation is given by psi_j.
683
00:44:44,210 --> 00:44:47,360
So the different
innovations have an impact
684
00:44:47,360 --> 00:44:51,200
on the current value given by
this impulse response function.
685
00:44:51,200 --> 00:44:53,200
So looking backward,
that definition
686
00:44:53,200 --> 00:44:54,920
is pretty well defined.
687
00:44:54,920 --> 00:44:56,630
But you can also
think about how does
688
00:44:56,630 --> 00:44:58,620
an impact of the
innovation affect
689
00:44:58,620 --> 00:45:00,760
the process going forward.
690
00:45:00,760 --> 00:45:03,430
And the long-run
cumulative response
691
00:45:03,430 --> 00:45:07,490
is essentially what is the
impact of that innovation
692
00:45:07,490 --> 00:45:11,350
in the process ultimately?
693
00:45:11,350 --> 00:45:13,839
And eventually, it's
not going to change
694
00:45:13,839 --> 00:45:14,880
the value of the process.
695
00:45:14,880 --> 00:45:18,710
But what is the value to
which the process is moving
696
00:45:18,710 --> 00:45:20,890
because of that one innovation?
697
00:45:20,890 --> 00:45:22,630
And so the long run
cumulative response
698
00:45:22,630 --> 00:45:28,900
is given by basically the
sum of these individual ones.
699
00:45:28,900 --> 00:45:33,020
And it's given by the
sum of the psi_i's.
700
00:45:33,020 --> 00:45:37,295
So that's the polynomial of
psi with lag operator, where we
701
00:45:37,295 --> 00:45:39,010
replace the lag operator by 1.
702
00:45:43,540 --> 00:45:45,570
We'll see this
again when we talk
703
00:45:45,570 --> 00:45:50,546
about vector
autoregressive processes
704
00:45:50,546 --> 00:45:51,795
with multivariate time series.
705
00:45:56,020 --> 00:45:57,860
Now, the Wold
representation, which
706
00:45:57,860 --> 00:46:00,550
is a infinite-order moving
average, possibly infinite
707
00:46:00,550 --> 00:46:04,466
order, can have an
autoregressive representation.
708
00:46:07,940 --> 00:46:17,580
Suppose that there is
another polynomial psi_i
709
00:46:17,580 --> 00:46:23,240
star of the lags, which we're
going to call psi inverse of L,
710
00:46:23,240 --> 00:46:29,860
which satisfies the fact if you
multiply that with psi of L,
711
00:46:29,860 --> 00:46:31,690
you get the identity lag 0.
712
00:46:31,690 --> 00:46:37,820
Then this psi inverse,
if that exists,
713
00:46:37,820 --> 00:46:47,060
is basically the
inverse of the psi of L.
714
00:46:47,060 --> 00:46:50,180
So if we start with psi of
L, if that's invertible,
715
00:46:50,180 --> 00:46:52,510
then there exists
a psi inverse of L,
716
00:46:52,510 --> 00:46:55,490
with coefficients psi_i star.
717
00:46:55,490 --> 00:47:02,130
And one can basically take
our original expression
718
00:47:02,130 --> 00:47:06,020
for the stochastic process,
which is as this moving average
719
00:47:06,020 --> 00:47:13,250
of the eta's, and express it
as this essentially moving
720
00:47:13,250 --> 00:47:16,450
averages of the X's.
721
00:47:16,450 --> 00:47:20,730
And so we've essentially
inverted the process
722
00:47:20,730 --> 00:47:27,500
and shown that the
stochastic process can
723
00:47:27,500 --> 00:47:35,570
be expressed as an infinite
order autoregressive
724
00:47:35,570 --> 00:47:36,850
representation.
725
00:47:36,850 --> 00:47:40,760
And so this infinite order
autoregressive representation
726
00:47:40,760 --> 00:47:43,610
corresponds to that intuitive
understanding of how
727
00:47:43,610 --> 00:47:46,280
the Wold representation exists.
728
00:47:46,280 --> 00:47:51,330
And it actually works with the--
the regression coefficients
729
00:47:51,330 --> 00:47:54,749
in that projection several
slides back corresponds
730
00:47:54,749 --> 00:47:55,790
to this inverse operator.
731
00:47:59,030 --> 00:48:04,160
So let's turn to some
specific time series
732
00:48:04,160 --> 00:48:07,590
models that are widely used.
733
00:48:07,590 --> 00:48:11,670
The class of autoregressive
moving average processes
734
00:48:11,670 --> 00:48:16,100
has this mathematical
definition.
735
00:48:16,100 --> 00:48:22,360
We define the X_t to be equal
to a linear combination of lags
736
00:48:22,360 --> 00:48:27,190
of X, going back p
lags, with coefficients
737
00:48:27,190 --> 00:48:30,210
phi_1 through phi_p.
738
00:48:30,210 --> 00:48:35,500
And then there are
residuals which
739
00:48:35,500 --> 00:48:40,720
are expressed in terms of a
q-th order moving average.
740
00:48:40,720 --> 00:48:45,990
So in this framework, the
eta_t's are white noise.
741
00:48:45,990 --> 00:48:50,910
And white noise, to reiterate,
has mean 0, constant variance,
742
00:48:50,910 --> 00:48:53,456
zero covariance between those.
743
00:48:56,330 --> 00:49:03,470
In this representation, I've
simplified things a little bit
744
00:49:03,470 --> 00:49:09,400
by subtracting off the
mean from all of the X's.
745
00:49:09,400 --> 00:49:15,400
And that just makes the formulas
a little bit more simpler.
746
00:49:15,400 --> 00:49:20,370
Now, with lag operators, we
can write this ARMA model
747
00:49:20,370 --> 00:49:26,810
as phi of L, p-th order
polynomial of lag L given
748
00:49:26,810 --> 00:49:31,360
with coefficients 1,
phi_1 up to phi_p,
749
00:49:31,360 --> 00:49:37,627
and theta of L given
by 1, theta_1, theta_2,
750
00:49:37,627 --> 00:49:38,210
up to theta_q.
751
00:49:52,870 --> 00:49:55,840
This is basically
a representation
752
00:49:55,840 --> 00:49:59,170
of the ARMA time series model.
753
00:49:59,170 --> 00:50:03,320
Basically, we're
taking a set of lags
754
00:50:03,320 --> 00:50:09,530
of the values of the stochastic
process up to order p.
755
00:50:09,530 --> 00:50:11,840
And that's equal to a weighted
average of the eta_t's.
756
00:50:14,530 --> 00:50:21,600
If we multiply by the inverse
of phi of L, if that exists,
757
00:50:21,600 --> 00:50:24,010
then we get this
representation here,
758
00:50:24,010 --> 00:50:26,430
which is simply the
Wold decomposition.
759
00:50:26,430 --> 00:50:34,150
So the ARMA models basically
have a Wold decomposition
760
00:50:34,150 --> 00:50:36,970
if this phi of L is invertible.
761
00:50:42,850 --> 00:50:47,120
And we'll explore
these by looking
762
00:50:47,120 --> 00:50:49,160
at simpler cases
of the ARMA models
763
00:50:49,160 --> 00:50:51,390
by just focusing on
autoregressive models
764
00:50:51,390 --> 00:50:53,680
first and then moving
average processes
765
00:50:53,680 --> 00:50:56,090
second so that
you'll get a better
766
00:50:56,090 --> 00:51:00,690
feel for how these things are
manipulated and interpreted.
767
00:51:00,690 --> 00:51:04,540
So let's move on to the p-th
order autoregressive process.
768
00:51:04,540 --> 00:51:08,750
So we're going to consider
ARMA models that just have
769
00:51:08,750 --> 00:51:10,100
autoregressive terms in them.
770
00:51:16,000 --> 00:51:20,300
So we have phi of L X_t
minus mu is equal to eta_t,
771
00:51:20,300 --> 00:51:21,990
which is white noise.
772
00:51:21,990 --> 00:51:28,970
So a linear combination of
the series is white noise.
773
00:51:28,970 --> 00:51:34,730
And X_t follows then a linear
regression model on explanatory
774
00:51:34,730 --> 00:51:41,330
variables, which are
lags of the process X.
775
00:51:41,330 --> 00:51:46,760
And this could be expressed
as X_t equal to c plus the sum
776
00:51:46,760 --> 00:51:50,950
from 1 to p of phi_j X_(t-j),
which is a linear regression
777
00:51:50,950 --> 00:51:53,700
model with regression
parameters phi_j.
778
00:51:53,700 --> 00:52:01,390
And c, the constant term, is
equal to mu times phi of 1.
779
00:52:01,390 --> 00:52:10,920
Now, if you basically take
expectations of the process,
780
00:52:10,920 --> 00:52:14,360
you basically have
coefficients of mu coming in
781
00:52:14,360 --> 00:52:15,730
from all the terms.
782
00:52:15,730 --> 00:52:22,220
And phi of 1 times mu is the
regression coefficient there.
783
00:52:25,160 --> 00:52:27,320
So with this
autoregressive model,
784
00:52:27,320 --> 00:52:31,160
we now want to go over what are
the stationarity conditions.
785
00:52:31,160 --> 00:52:35,020
Certainly, this
autoregressive model
786
00:52:35,020 --> 00:52:40,790
is one where, well,
a simple random walk
787
00:52:40,790 --> 00:52:45,520
follows an autoregressive
model but is not stationary.
788
00:52:45,520 --> 00:52:47,650
We'll highlight that
in a minute as well.
789
00:52:47,650 --> 00:52:50,410
But if you think
it, that's true.
790
00:52:50,410 --> 00:52:55,400
And so stationarity is something
to be understood and evaluated.
791
00:53:03,160 --> 00:53:08,680
This polynomial
function phi, where
792
00:53:08,680 --> 00:53:11,630
if we replace the
lag operator L by z,
793
00:53:11,630 --> 00:53:20,970
a complex variable, the
equation phi of z equal to 0
794
00:53:20,970 --> 00:53:24,330
is the characteristic
equation associated
795
00:53:24,330 --> 00:53:27,020
with this autoregressive model.
796
00:53:27,020 --> 00:53:33,190
And it turns out that we'll
be interested in the roots
797
00:53:33,190 --> 00:53:36,610
of this characteristic equation.
798
00:53:36,610 --> 00:53:40,705
Now, if we consider
writing phi of L
799
00:53:40,705 --> 00:53:44,270
as a function of the
roots of the equation,
800
00:53:44,270 --> 00:53:49,130
we get this expression
where you'll
801
00:53:49,130 --> 00:53:51,340
notice if you multiply
all those terms out,
802
00:53:51,340 --> 00:53:55,730
the 1's all multiply out
together, and you get 1.
803
00:53:55,730 --> 00:54:00,100
And with the lag operator
L to the p-th power,
804
00:54:00,100 --> 00:54:03,210
that would be the product
of 1 over lambda_1
805
00:54:03,210 --> 00:54:06,650
times 1 over lambda_2,
or actually negative 1
806
00:54:06,650 --> 00:54:09,680
over lambda_1 times
negative 1 over lambda_2,
807
00:54:09,680 --> 00:54:13,640
and so forth-- negative
1 over lambda_p.
808
00:54:13,640 --> 00:54:15,820
Basically, if there are
p roots to this equation,
809
00:54:15,820 --> 00:54:19,420
this is how it would
be written out.
810
00:54:19,420 --> 00:54:27,070
And the process
X_t is covariance
811
00:54:27,070 --> 00:54:28,710
stationary if and
only if all the roots
812
00:54:28,710 --> 00:54:33,630
of this characteristic equation
lie outside the unit circle.
813
00:54:33,630 --> 00:54:35,880
So what does that mean?
814
00:54:35,880 --> 00:54:41,240
That means that the norm
modulus of the complex z
815
00:54:41,240 --> 00:54:42,810
is greater than 1.
816
00:54:42,810 --> 00:54:45,160
So they're outside
the unit circle
817
00:54:45,160 --> 00:54:47,150
where it's less
than or equal to 1.
818
00:54:47,150 --> 00:54:56,810
And the roots, if they are
outside the unit circle,
819
00:54:56,810 --> 00:55:01,080
then the modulus of the
lambda_j's is greater than 1.
820
00:55:05,400 --> 00:55:12,160
And if we then consider
taking a complex number
821
00:55:12,160 --> 00:55:16,010
lambda, basically
the root, and have
822
00:55:16,010 --> 00:55:20,600
an expression for 1 minus
1 over lambda L inverse,
823
00:55:20,600 --> 00:55:25,010
we can get this series
expression for that inverse.
824
00:55:25,010 --> 00:55:34,860
And that series will exist and
be bounded if the lambda_i are
825
00:55:34,860 --> 00:55:36,430
greater than 1 in magnitude.
826
00:55:39,210 --> 00:55:46,210
So we can actually compute
an inverse of phi of L
827
00:55:46,210 --> 00:55:49,610
by taking the inverse
of each of the component
828
00:55:49,610 --> 00:55:52,240
products in that polynomial.
829
00:55:52,240 --> 00:55:57,800
So in introductory
time series courses,
830
00:55:57,800 --> 00:56:00,544
they talk about
stationarity and unit roots,
831
00:56:00,544 --> 00:56:01,960
but they don't
really get into it,
832
00:56:01,960 --> 00:56:04,490
because people don't
know complex math,
833
00:56:04,490 --> 00:56:06,970
don't know about roots.
834
00:56:06,970 --> 00:56:09,620
So anyway, but this
is just very simply
835
00:56:09,620 --> 00:56:12,840
how that framework is applied.
836
00:56:12,840 --> 00:56:17,830
So we have a
polynomial equation,
837
00:56:17,830 --> 00:56:20,885
the characteristic equation,
whose roots we're looking for.
838
00:56:20,885 --> 00:56:22,510
Those roots have to
be outside the unit
839
00:56:22,510 --> 00:56:26,170
circle for stationarity
of the process.
840
00:56:26,170 --> 00:56:31,870
Well, it's basically
conditions for invertibility
841
00:56:31,870 --> 00:56:35,100
of the process, of the
autoregressive process.
842
00:56:35,100 --> 00:56:40,440
And that invertibility renders
the process an infinite-order
843
00:56:40,440 --> 00:56:42,125
moving average process.
844
00:56:46,210 --> 00:56:50,830
So let's go through
these results
845
00:56:50,830 --> 00:56:52,840
for the autoregressive
process of order one,
846
00:56:52,840 --> 00:56:56,330
where things-- always start
with the simplest cases
847
00:56:56,330 --> 00:56:58,420
to understand things.
848
00:56:58,420 --> 00:57:01,140
The characteristic equation
for this model is just 1
849
00:57:01,140 --> 00:57:02,820
minus phi z.
850
00:57:02,820 --> 00:57:03,600
The root is 1/phi.
851
00:57:06,630 --> 00:57:12,382
So lambda is greater than
1-- if the modulus of lambda
852
00:57:12,382 --> 00:57:13,840
is greater than 1,
meaning the root
853
00:57:13,840 --> 00:57:16,990
is outside the unit circle,
then phi is less than 1.
854
00:57:16,990 --> 00:57:21,160
So for covariance stationarity
of this autoregressive process,
855
00:57:21,160 --> 00:57:25,877
we need the magnitude of phi
to be less than 1 in magnitude.
856
00:57:30,090 --> 00:57:31,950
The expected value of X is mu.
857
00:57:31,950 --> 00:57:36,460
The variance of X
is sigma squared X.
858
00:57:36,460 --> 00:57:41,130
This has this form, sigma
squared over 1 minus phi.
859
00:57:41,130 --> 00:57:44,960
That expression is
basically obtained
860
00:57:44,960 --> 00:57:50,110
by looking at the infinite order
moving average representation.
861
00:57:50,110 --> 00:57:56,760
But notice that if
phi is positive,
862
00:57:56,760 --> 00:58:03,710
then the variance
of X is actually
863
00:58:03,710 --> 00:58:07,895
greater than the variance
of the innovations.
864
00:58:10,440 --> 00:58:17,280
And if phi is less than 0,
then it's going to be smaller.
865
00:58:17,280 --> 00:58:23,100
So the innovation variance
basically is scaled up a bit
866
00:58:23,100 --> 00:58:25,010
in the autoregressive process.
867
00:58:25,010 --> 00:58:27,710
The covariance matrix is
phi times sigma squared
868
00:58:27,710 --> 00:58:31,980
X. You'll be going through
this in the problem set.
869
00:58:31,980 --> 00:58:40,160
And the covariance of X is phi
to the j power sigma squared X.
870
00:58:40,160 --> 00:58:43,640
And these expressions can
all be easily evaluated
871
00:58:43,640 --> 00:58:47,490
by simply writing out the
definition of these covariances
872
00:58:47,490 --> 00:58:50,000
in terms of the original
model and looking
873
00:58:50,000 --> 00:58:54,250
at what terms are independent,
cancel out, and that proceeds.
874
00:59:04,510 --> 00:59:06,800
Let's just go
through these cases.
875
00:59:06,800 --> 00:59:08,730
Let's show it all here.
876
00:59:08,730 --> 00:59:16,630
So we have if phi
is between 0 and 1,
877
00:59:16,630 --> 00:59:20,810
then the process experiences
exponential mean reversion
878
00:59:20,810 --> 00:59:22,170
to mu.
879
00:59:22,170 --> 00:59:24,760
So an autoregressive
process with phi between 0
880
00:59:24,760 --> 00:59:29,490
on 1 corresponds to a
mean-reverting process.
881
00:59:29,490 --> 00:59:31,830
This process is
actually one that
882
00:59:31,830 --> 00:59:34,310
has been used theoretically
for interest rate models
883
00:59:34,310 --> 00:59:36,920
and a lot of theoretical
work in finance.
884
00:59:36,920 --> 00:59:40,280
The Vasicek model is
actually an example
885
00:59:40,280 --> 00:59:42,300
of the Ornstein-Uhlenbeck
process,
886
00:59:42,300 --> 00:59:47,840
which is basically a
mean-reverting Brownian motion.
887
00:59:47,840 --> 00:59:53,070
And any variables
that exhibit or could
888
00:59:53,070 --> 00:59:59,950
be thought of as
exhibiting mean reversion,
889
00:59:59,950 --> 01:00:01,810
this model can be
applied to those
890
01:00:01,810 --> 01:00:07,470
processes, such as interest rate
spreads or real exchange rates,
891
01:00:07,470 --> 01:00:11,430
variables where one can
expect that things never
892
01:00:11,430 --> 01:00:12,790
get too large or too small.
893
01:00:12,790 --> 01:00:14,440
They come back to some mean.
894
01:00:14,440 --> 01:00:16,570
Now, the challenge
is, that usually
895
01:00:16,570 --> 01:00:18,930
may be true over
short periods of time.
896
01:00:18,930 --> 01:00:21,100
But over very long
periods of time,
897
01:00:21,100 --> 01:00:23,230
the point to which you're
reverting to changes.
898
01:00:23,230 --> 01:00:26,640
So these models tend to
not have broad application
899
01:00:26,640 --> 01:00:27,900
over long time ranges.
900
01:00:27,900 --> 01:00:30,150
You need to adapt.
901
01:00:30,150 --> 01:00:32,220
Anyway, with the AR
process, we can also
902
01:00:32,220 --> 01:00:34,020
have negative
values of phi, which
903
01:00:34,020 --> 01:00:38,460
results in exponential mean
reversion that's oscillating
904
01:00:38,460 --> 01:00:44,190
in time, because the
autoregressive coefficient
905
01:00:44,190 --> 01:00:49,180
basically is a negative value.
906
01:00:49,180 --> 01:00:54,510
And for phi equal to 1, the Wold
decomposition doesn't exist.
907
01:00:54,510 --> 01:00:57,860
And the process is the
simple random walk.
908
01:00:57,860 --> 01:01:00,340
So basically, if
phi is equal to 1,
909
01:01:00,340 --> 01:01:04,480
that means that basically just
changes in value of the process
910
01:01:04,480 --> 01:01:08,860
are independent and identically
distributed white noise.
911
01:01:08,860 --> 01:01:11,910
And that's the
random walk process.
912
01:01:11,910 --> 01:01:15,840
And that process, as was
covered in earlier lectures,
913
01:01:15,840 --> 01:01:18,780
is non-stationary.
914
01:01:18,780 --> 01:01:22,790
If phi is greater than 1, then
you have an explosive process,
915
01:01:22,790 --> 01:01:26,780
because basically the
values are scaling up
916
01:01:26,780 --> 01:01:31,000
every time increment.
917
01:01:31,000 --> 01:01:35,290
So those are features
of the AR(1) model.
918
01:01:35,290 --> 01:01:42,110
For a general autoregressive
process of order p,
919
01:01:42,110 --> 01:01:45,850
there's a method-- well, we
can look at the second order
920
01:01:45,850 --> 01:01:49,590
moments of that process, which
have a very nice structure,
921
01:01:49,590 --> 01:01:51,840
and then use those to
solve for estimates
922
01:01:51,840 --> 01:01:56,630
of the ARMA parameters, or
autoregressive parameters.
923
01:01:56,630 --> 01:02:01,820
And those happen to be
specified by what are called
924
01:02:01,820 --> 01:02:04,840
the Yule-Walker equations.
925
01:02:04,840 --> 01:02:07,270
So the Yule-Walker equations
is a standard topic
926
01:02:07,270 --> 01:02:09,670
in time series analysis.
927
01:02:09,670 --> 01:02:11,480
What is it?
928
01:02:11,480 --> 01:02:13,030
What does it correspond to?
929
01:02:13,030 --> 01:02:16,320
Well, we take our original
autoregressive process
930
01:02:16,320 --> 01:02:17,470
of order p.
931
01:02:17,470 --> 01:02:24,400
And we write out the
formulas for the covariance
932
01:02:24,400 --> 01:02:26,900
at lag j between
two observations.
933
01:02:26,900 --> 01:02:31,790
So what's the covariance
between X_t and X_(t-j)?
934
01:02:31,790 --> 01:02:39,820
And that expression is
given by this equation.
935
01:02:39,820 --> 01:02:43,980
And so this equation for gamma
of j is determined simply
936
01:02:43,980 --> 01:02:48,700
by evaluating the expectations
where we're taking
937
01:02:48,700 --> 01:02:53,620
the expectation of X_t in the
autoregressive process times
938
01:02:53,620 --> 01:02:56,110
the fix X_(t-j) minus mu.
939
01:02:56,110 --> 01:02:58,540
So just evaluating
those terms, you
940
01:02:58,540 --> 01:03:02,880
can validate that
this is the equation.
941
01:03:02,880 --> 01:03:08,620
If we look at the equations
corresponding to j equals 1--
942
01:03:08,620 --> 01:03:12,040
so lag 1 up through
lag p-- this is
943
01:03:12,040 --> 01:03:16,070
what those equations look like.
944
01:03:16,070 --> 01:03:20,060
Basically, the left-hand side
is gamma_1 through gamma_p.
945
01:03:20,060 --> 01:03:23,090
The covariance to
lag 1 up to lag p
946
01:03:23,090 --> 01:03:27,590
is equal to basically
linear functions
947
01:03:27,590 --> 01:03:29,980
given by the phi of
the other covariances.
948
01:03:33,570 --> 01:03:37,410
Who can tell me what the
structure is of this matrix?
949
01:03:37,410 --> 01:03:38,590
It's not a diagonal matrix?
950
01:03:38,590 --> 01:03:41,817
What kind of matrix is this?
951
01:03:41,817 --> 01:03:42,900
Math trivia question here.
952
01:03:48,850 --> 01:03:49,782
It has a special name.
953
01:03:52,460 --> 01:03:54,600
Anyone?
954
01:03:54,600 --> 01:03:57,690
It's a Toeplitz matrix.
955
01:03:57,690 --> 01:04:00,840
The off diagonals are
all the same value.
956
01:04:00,840 --> 01:04:06,680
And in fact, because of the
symmetry of the covariance,
957
01:04:06,680 --> 01:04:09,750
basically the gamma of 1 is
equal to gamma of minus 1.
958
01:04:09,750 --> 01:04:12,680
Gamma of minus 2 is
equal to gamma plus 2.
959
01:04:12,680 --> 01:04:14,640
Because of the
covariant stationarity,
960
01:04:14,640 --> 01:04:16,700
it's actually also symmetric.
961
01:04:16,700 --> 01:04:22,630
So these equations allow
us to solve for the phis
962
01:04:22,630 --> 01:04:25,990
so long as we have estimates
of these covariances.
963
01:04:25,990 --> 01:04:30,510
So if we have a
system of estimates,
964
01:04:30,510 --> 01:04:33,940
we can plug these in in
an attempt to solve this.
965
01:04:33,940 --> 01:04:36,770
If they're consistent
estimates of the covariances,
966
01:04:36,770 --> 01:04:38,530
then there will be a solution.
967
01:04:38,530 --> 01:04:41,980
And then the 0th
equation, which was not
968
01:04:41,980 --> 01:04:43,469
part of the series
of equations--
969
01:04:43,469 --> 01:04:45,510
if you go back and look
at the 0th equation, that
970
01:04:45,510 --> 01:04:47,920
allows you to get an estimate
for the sigma squared.
971
01:04:47,920 --> 01:04:50,920
So these Yule-Walker
equations are the way
972
01:04:50,920 --> 01:04:54,510
in which many ARMA
models are specified
973
01:04:54,510 --> 01:05:03,650
in different statistics packages
and in terms of what principles
974
01:05:03,650 --> 01:05:04,400
are being applied.
975
01:05:04,400 --> 01:05:09,700
Well, if we're using unbiased
estimates of these parameters,
976
01:05:09,700 --> 01:05:12,055
then this is applying
what's called
977
01:05:12,055 --> 01:05:16,250
the method of moments principle
for statistical estimation.
978
01:05:16,250 --> 01:05:20,600
And with complicated models,
where sometimes the likelihood
979
01:05:20,600 --> 01:05:25,900
functions are very hard
to specify and compute,
980
01:05:25,900 --> 01:05:29,800
and then to do optimization
over those is even harder.
981
01:05:29,800 --> 01:05:32,780
It can turn out that
there are relationships
982
01:05:32,780 --> 01:05:35,840
between the moments of the
random variables, which
983
01:05:35,840 --> 01:05:38,340
are functions of the
unknown parameters.
984
01:05:38,340 --> 01:05:42,590
And you can solve for basically
the sample moments equalling
985
01:05:42,590 --> 01:05:45,940
the theoretical moments
and you apply the method
986
01:05:45,940 --> 01:05:48,830
of moments estimation method.
987
01:05:48,830 --> 01:05:54,670
Econometrics is rich with many
applications of that principle.
988
01:05:57,580 --> 01:06:02,110
The next section goes through
the moving average model.
989
01:06:05,240 --> 01:06:12,340
Let me highlight this.
990
01:06:12,340 --> 01:06:16,080
So with an order
q moving average,
991
01:06:16,080 --> 01:06:19,560
we basically have a polynomial
in the lag operator L,
992
01:06:19,560 --> 01:06:22,390
which is operated
upon the eta_t's.
993
01:06:22,390 --> 01:06:25,700
And if you write out
the expectations of X_t,
994
01:06:25,700 --> 01:06:27,030
you get mu.
995
01:06:27,030 --> 01:06:28,650
The variance of X_t,
which is gamma 0,
996
01:06:28,650 --> 01:06:34,470
is sigma squared times 1 plus
the squares of the coefficients
997
01:06:34,470 --> 01:06:36,360
in the polynomial.
998
01:06:36,360 --> 01:06:39,920
And so this feature,
this property here is due
999
01:06:39,920 --> 01:06:44,100
to the fact that we have
uncorrelated innovations
1000
01:06:44,100 --> 01:06:47,060
in the eta_t's.
1001
01:06:47,060 --> 01:06:48,260
The eta t's are white noise.
1002
01:06:48,260 --> 01:06:52,830
So the only thing that comes
through in the square of X_t
1003
01:06:52,830 --> 01:06:56,020
and the expectation of
that is the squared powers
1004
01:06:56,020 --> 01:07:01,900
of the etas, which
have coefficients
1005
01:07:01,900 --> 01:07:03,860
given by the theta_i squared.
1006
01:07:03,860 --> 01:07:09,170
So these properties are left--
I'll leave you just to verify,
1007
01:07:09,170 --> 01:07:11,142
very straightforward.
1008
01:07:11,142 --> 01:07:14,430
But let's now turn to the
final minutes of the lecture
1009
01:07:14,430 --> 01:07:20,170
today to accommodating
non-stationary behavior
1010
01:07:20,170 --> 01:07:23,340
in time series.
1011
01:07:23,340 --> 01:07:27,990
The original approaches
with time series
1012
01:07:27,990 --> 01:07:32,320
was to focus on
estimation methodologies
1013
01:07:32,320 --> 01:07:34,940
for covariance
stationary process.
1014
01:07:34,940 --> 01:07:38,440
So if the series is not
covariance stationary,
1015
01:07:38,440 --> 01:07:42,410
then we would want to
do some transformation
1016
01:07:42,410 --> 01:07:48,660
of the data, of the
series, into a stationary
1017
01:07:48,660 --> 01:07:52,270
so that the resulting
process is stationary.
1018
01:07:52,270 --> 01:07:55,990
And with the
differencing operators,
1019
01:07:55,990 --> 01:08:00,610
delta, Box and Jenkins
advocated moving
1020
01:08:00,610 --> 01:08:03,420
non-stationary trending
behavior, which
1021
01:08:03,420 --> 01:08:06,370
is exhibited often in
economic time series,
1022
01:08:06,370 --> 01:08:09,960
by using a first difference,
maybe a second difference,
1023
01:08:09,960 --> 01:08:12,300
or a k-th order difference.
1024
01:08:12,300 --> 01:08:20,229
So these operators are
defined in this way.
1025
01:08:20,229 --> 01:08:22,960
Basically with the
k-th order operator
1026
01:08:22,960 --> 01:08:25,210
having this
expression here, this
1027
01:08:25,210 --> 01:08:31,189
is the binomial expansion
of a k-th power,
1028
01:08:31,189 --> 01:08:35,970
which can be useful.
1029
01:08:35,970 --> 01:08:40,609
It comes up all the time
in probability theory.
1030
01:08:40,609 --> 01:08:43,609
And if a process has
a linear time trend,
1031
01:08:43,609 --> 01:08:48,390
then delta X_t is going to
have no time trend at all,
1032
01:08:48,390 --> 01:08:51,390
because you're
basically taking out
1033
01:08:51,390 --> 01:08:54,430
that linear component by
taking successive differences.
1034
01:08:54,430 --> 01:08:57,014
Sometimes, if you
have a real series
1035
01:08:57,014 --> 01:08:59,430
and you look at the difference,
it appears non-stationary,
1036
01:08:59,430 --> 01:09:02,810
you look at first differences,
that can still not
1037
01:09:02,810 --> 01:09:05,649
appear to be growing
over time, in which case
1038
01:09:05,649 --> 01:09:08,810
sometimes the second
difference will result
1039
01:09:08,810 --> 01:09:11,270
in a process with no trend.
1040
01:09:11,270 --> 01:09:14,170
So these are sort of
convenient tricks,
1041
01:09:14,170 --> 01:09:18,250
techniques to render
the series stationary.
1042
01:09:18,250 --> 01:09:21,220
And let's see.
1043
01:09:21,220 --> 01:09:26,960
There's examples here of
linear trend reversion models
1044
01:09:26,960 --> 01:09:32,319
which are rendered
covariance stationary
1045
01:09:32,319 --> 01:09:35,330
under first differencing.
1046
01:09:35,330 --> 01:09:38,689
In this case, this is an
example where you have
1047
01:09:38,689 --> 01:09:41,350
a deterministic time trend.
1048
01:09:41,350 --> 01:09:46,040
But then you have reversion
to the time trend over time.
1049
01:09:46,040 --> 01:09:49,880
So we basically have
eta_t, the error
1050
01:09:49,880 --> 01:09:53,830
about the deterministic trend,
is a first order autoregressive
1051
01:09:53,830 --> 01:09:55,740
process.
1052
01:09:55,740 --> 01:10:00,307
And the moments here
can be derived this way.
1053
01:10:00,307 --> 01:10:01,390
Leave that as an exercise.
1054
01:10:04,230 --> 01:10:09,510
One could also consider
the pure integrated process
1055
01:10:09,510 --> 01:10:16,330
and talk about
stochastic trends.
1056
01:10:16,330 --> 01:10:19,140
And basically,
random walk processes
1057
01:10:19,140 --> 01:10:22,740
are often referred
to in econometrics
1058
01:10:22,740 --> 01:10:25,010
as stochastic trends.
1059
01:10:25,010 --> 01:10:31,610
And you may want to try and
remove those from the data,
1060
01:10:31,610 --> 01:10:33,280
or accommodate them.
1061
01:10:33,280 --> 01:10:40,930
And so the stochastic
trend process is basically
1062
01:10:40,930 --> 01:10:49,630
given by the first difference
X_t is just equal to eta_t.
1063
01:10:49,630 --> 01:10:53,430
And so we have essentially
this random walk
1064
01:10:53,430 --> 01:10:55,830
from a given starting point.
1065
01:10:55,830 --> 01:11:00,650
And it's easy to verify it if
you knew the 0th point, then
1066
01:11:00,650 --> 01:11:04,770
the variance of the t-th time
point would be t sigma squared,
1067
01:11:04,770 --> 01:11:09,000
because we're summing t
independent innovations.
1068
01:11:09,000 --> 01:11:14,475
And the covariance between
t and lag t minus j
1069
01:11:14,475 --> 01:11:17,500
is simply t minus
j sigma squared.
1070
01:11:17,500 --> 01:11:20,860
And the correlation between
those has this form.
1071
01:11:20,860 --> 01:11:23,240
What you can see is that this
definitely depends on time.
1072
01:11:23,240 --> 01:11:26,660
So it's not a
stationary process.
1073
01:11:26,660 --> 01:11:33,880
So this first differencing
results in stationarity.
1074
01:11:33,880 --> 01:11:36,230
And the end difference
process has those features.
1075
01:11:46,847 --> 01:11:47,805
Let's see where we are.
1076
01:11:52,730 --> 01:11:57,380
Final topic for
today is just how
1077
01:11:57,380 --> 01:12:04,630
you incorporate non-stationary
process into ARMA processes.
1078
01:12:04,630 --> 01:12:07,680
Well, if you take
first differences
1079
01:12:07,680 --> 01:12:10,340
or second differences
and the resulting process
1080
01:12:10,340 --> 01:12:13,252
is covariance
stationary, then we
1081
01:12:13,252 --> 01:12:15,460
can just incorporate that
differencing into the model
1082
01:12:15,460 --> 01:12:20,490
specification itself, and define
ARIMA models, Autoregressive
1083
01:12:20,490 --> 01:12:23,730
Integrated Moving
Average Processes.
1084
01:12:23,730 --> 01:12:26,000
And so to specify
these models, we
1085
01:12:26,000 --> 01:12:29,290
need to determine the order
of the differencing required
1086
01:12:29,290 --> 01:12:32,990
to move trends,
deterministic or stochastic,
1087
01:12:32,990 --> 01:12:35,820
and then estimating
the unknown parameters,
1088
01:12:35,820 --> 01:12:38,940
and then applying model
selection criteria.
1089
01:12:38,940 --> 01:12:43,770
So let me go very
quickly through this
1090
01:12:43,770 --> 01:12:48,600
and come back to it the
beginning of next time.
1091
01:12:48,600 --> 01:12:51,660
But in specifying the
parameters of these models,
1092
01:12:51,660 --> 01:12:54,410
we can apply maximum
likelihood, again,
1093
01:12:54,410 --> 01:12:59,280
if we assume normality of
these innovations eta_t.
1094
01:12:59,280 --> 01:13:02,260
And we can express
the ARMA model
1095
01:13:02,260 --> 01:13:04,440
in state space
form, which results
1096
01:13:04,440 --> 01:13:07,880
in a form for the
likelihood function, which
1097
01:13:07,880 --> 01:13:12,130
we'll see a few lectures ahead.
1098
01:13:12,130 --> 01:13:15,970
But then we can apply limited
information maximum likelihood,
1099
01:13:15,970 --> 01:13:19,470
where we just condition on the
first observations of the data
1100
01:13:19,470 --> 01:13:22,550
and maximize the likelihood.
1101
01:13:22,550 --> 01:13:27,060
Or not condition on the first
few observations, but also
1102
01:13:27,060 --> 01:13:33,700
use their information as well,
and look at their density
1103
01:13:33,700 --> 01:13:36,640
functions, incorporating
those into the likelihood
1104
01:13:36,640 --> 01:13:41,160
relative to the stationary
distribution for their values.
1105
01:13:41,160 --> 01:13:44,000
And then the issue
becomes, how do we
1106
01:13:44,000 --> 01:13:45,390
choose amongst different models?
1107
01:13:45,390 --> 01:13:48,480
Now, last time we talked about
linear regression models,
1108
01:13:48,480 --> 01:13:50,500
how you'd specify a
given model, here, we're
1109
01:13:50,500 --> 01:13:53,050
talking about autoregressive,
moving average,
1110
01:13:53,050 --> 01:13:55,000
and even integrated
moving average processes
1111
01:13:55,000 --> 01:13:59,320
and how do we specify
those, well, with the method
1112
01:13:59,320 --> 01:14:06,470
of maximum likelihood,
there are procedures
1113
01:14:06,470 --> 01:14:12,440
which-- there are measures of
how effectively a fitted model
1114
01:14:12,440 --> 01:14:16,390
is, given by an
information criterion
1115
01:14:16,390 --> 01:14:21,250
that you would want to minimize
for a given fitted model.
1116
01:14:21,250 --> 01:14:24,719
So we can consider
different sets of models,
1117
01:14:24,719 --> 01:14:26,510
different numbers of
explanatory variables,
1118
01:14:26,510 --> 01:14:29,740
different orders of
autoregressive parameters,
1119
01:14:29,740 --> 01:14:33,100
moving average parameters,
and compute, say,
1120
01:14:33,100 --> 01:14:37,940
the Akaike information criterion
or the Bayes information
1121
01:14:37,940 --> 01:14:39,990
criterion or the
Hannan-Quinn criterion
1122
01:14:39,990 --> 01:14:44,720
as different ways of judging
how good different models are.
1123
01:14:44,720 --> 01:14:47,960
And let me just finish
today by pointing out
1124
01:14:47,960 --> 01:14:52,620
that what these
information criteria are
1125
01:14:52,620 --> 01:14:58,560
is basically a function of the
log likelihood function, which
1126
01:14:58,560 --> 01:15:00,719
is something we're
trying to maximize
1127
01:15:00,719 --> 01:15:02,135
with maximum
likelihood estimates.
1128
01:15:04,870 --> 01:15:08,700
And then adding some penalty
for how many parameters
1129
01:15:08,700 --> 01:15:10,742
we're estimating.
1130
01:15:10,742 --> 01:15:12,950
And so what I'd like you to
think about for next time
1131
01:15:12,950 --> 01:15:18,600
is what kind of a penalty
is appropriate for adding
1132
01:15:18,600 --> 01:15:20,300
an extra parameter.
1133
01:15:20,300 --> 01:15:23,640
Like, what evidence is
required to incorporate
1134
01:15:23,640 --> 01:15:28,020
extra parameters, extra
variables, in the model.
1135
01:15:28,020 --> 01:15:31,180
Would it be t statistics
that exceeds some threshold
1136
01:15:31,180 --> 01:15:32,760
or some other criteria.
1137
01:15:32,760 --> 01:15:35,940
Turns out that these are
all related to those issues.
1138
01:15:35,940 --> 01:15:39,500
And it's very interesting
how those play out.
1139
01:15:39,500 --> 01:15:45,180
And I'll say that for those
of you who have actually
1140
01:15:45,180 --> 01:15:48,490
seen these before, the
Bayes information criterion
1141
01:15:48,490 --> 01:15:50,400
corresponds to an
assumption that there
1142
01:15:50,400 --> 01:15:54,180
is some finite number of
variables in the model.
1143
01:15:54,180 --> 01:15:57,010
And you know what those are.
1144
01:15:57,010 --> 01:16:00,060
The Hannan-Quinn criterion
says maybe there's
1145
01:16:00,060 --> 01:16:03,760
an infinite number of
variables in the model,
1146
01:16:03,760 --> 01:16:08,810
but you want to be
able to identify those.
1147
01:16:08,810 --> 01:16:12,230
And so anyway, it's a
very challenging problem
1148
01:16:12,230 --> 01:16:13,390
with model selection.
1149
01:16:13,390 --> 01:16:16,900
And these criteria can
be used to specify those.
1150
01:16:16,900 --> 01:16:19,050
So we'll go through
that next time.