1
00:00:00,080 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:04,019
Commons license.

3
00:00:04,019 --> 00:00:06,360
Your support will help
MIT OpenCourseWare

4
00:00:06,360 --> 00:00:10,730
continue to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,340
To make a donation or
view additional materials

6
00:00:13,340 --> 00:00:17,217
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,217 --> 00:00:17,842
at ocw.mit.edu.

8
00:00:22,190 --> 00:00:23,010
PROFESSOR: OK.

9
00:00:23,010 --> 00:00:25,530
Well, last time I
was lecturing, we

10
00:00:25,530 --> 00:00:29,380
were talking about
regression analysis.

11
00:00:29,380 --> 00:00:31,870
And we finished up talking
about estimation methods

12
00:00:31,870 --> 00:00:34,730
for fitting regression models.

13
00:00:34,730 --> 00:00:38,670
I want to recap the method
of maximum likelihood,

14
00:00:38,670 --> 00:00:42,010
because this is really
the primary estimation

15
00:00:42,010 --> 00:00:45,070
method in statistical
modeling that you start with.

16
00:00:45,070 --> 00:00:49,946
And so let me just
review where we were.

17
00:00:49,946 --> 00:00:53,060
We have a normal linear
regression model.

18
00:00:53,060 --> 00:00:55,100
A dependent variable
y is explained

19
00:00:55,100 --> 00:00:58,940
by a linear combination
of independent variables

20
00:00:58,940 --> 00:01:00,710
given by a regression
parameter beta.

21
00:01:00,710 --> 00:01:03,800
And we assume that there are
errors about all the cases

22
00:01:03,800 --> 00:01:05,710
which are independent
identically distributed

23
00:01:05,710 --> 00:01:07,440
normal random variables.

24
00:01:07,440 --> 00:01:12,120
So because of that relationship,
the dependent variable vector

25
00:01:12,120 --> 00:01:15,630
y, which is an
n-vector, for n cases,

26
00:01:15,630 --> 00:01:18,730
is a multivariate
normal random variable.

27
00:01:18,730 --> 00:01:26,490
Now, the likelihood function is
equal to the density function

28
00:01:26,490 --> 00:01:28,280
for the data.

29
00:01:28,280 --> 00:01:32,400
And there's some
ambiguity really

30
00:01:32,400 --> 00:01:36,000
about how one manipulates
the likelihood function.

31
00:01:36,000 --> 00:01:38,780
The likelihood function
becomes defined once we've

32
00:01:38,780 --> 00:01:41,030
observed a sample of data.

33
00:01:41,030 --> 00:01:45,390
So in this expression for
the likelihood function

34
00:01:45,390 --> 00:01:47,330
as a function of beta
and sigma squared,

35
00:01:47,330 --> 00:01:50,800
we're considering evaluating
the probability density

36
00:01:50,800 --> 00:01:53,830
function for the
data conditional

37
00:01:53,830 --> 00:01:57,040
on the unknown parameters.

38
00:01:57,040 --> 00:02:02,540
So if this were simply a
univariate normal distribution

39
00:02:02,540 --> 00:02:05,160
with some unknown mean
and variance, then

40
00:02:05,160 --> 00:02:10,880
what we would have is
just a bell curve for mu

41
00:02:10,880 --> 00:02:13,880
centered around a
single observation y,

42
00:02:13,880 --> 00:02:15,550
if you look at the
likelihood function

43
00:02:15,550 --> 00:02:19,150
and how it varies with
the underlying mean

44
00:02:19,150 --> 00:02:22,950
of the normal distribution.

45
00:02:22,950 --> 00:02:28,180
So this likelihood
function is-- well,

46
00:02:28,180 --> 00:02:30,540
the challenge really
in maximum estimation

47
00:02:30,540 --> 00:02:34,840
is really calculating
and computing

48
00:02:34,840 --> 00:02:36,790
the likelihood function.

49
00:02:36,790 --> 00:02:39,050
And with normal linear
regression models,

50
00:02:39,050 --> 00:02:40,440
it's very easy.

51
00:02:40,440 --> 00:02:42,910
Now, the maximum
likelihood estimates

52
00:02:42,910 --> 00:02:47,490
are those values that
maximize this function.

53
00:02:47,490 --> 00:02:51,890
And the question is, why
are those good estimates

54
00:02:51,890 --> 00:02:54,840
of the underlying parameters?

55
00:02:54,840 --> 00:02:57,760
Well, what those
estimates do is they

56
00:02:57,760 --> 00:03:03,150
are the parameter values for
which the observed data is

57
00:03:03,150 --> 00:03:05,030
most likely.

58
00:03:05,030 --> 00:03:09,170
So we're able to scale
the unknown parameters

59
00:03:09,170 --> 00:03:14,020
by how likely those parameters
could have generated these data

60
00:03:14,020 --> 00:03:15,500
values.

61
00:03:15,500 --> 00:03:19,560
So let's look at the
likelihood function

62
00:03:19,560 --> 00:03:23,360
for this normal linear
regression model.

63
00:03:23,360 --> 00:03:28,520
These first two lines here are
highlighting-- the first line

64
00:03:28,520 --> 00:03:32,470
is highlighting that
our response variable

65
00:03:32,470 --> 00:03:35,310
values are independent.

66
00:03:35,310 --> 00:03:36,770
They're conditionally
independent

67
00:03:36,770 --> 00:03:38,720
given the unknown parameters.

68
00:03:38,720 --> 00:03:43,180
And so the density of the
full vector of y's is simply

69
00:03:43,180 --> 00:03:48,290
the product of the density
functions for those components.

70
00:03:48,290 --> 00:03:52,410
And because this is a normal
linear regression model,

71
00:03:52,410 --> 00:03:55,350
each of the y_i's is
normally distributed.

72
00:03:55,350 --> 00:03:57,140
So what's in there
is simply the density

73
00:03:57,140 --> 00:04:01,330
function of a normal random
variable with mean given

74
00:04:01,330 --> 00:04:06,960
by the beta sum of independent
variables for each i,

75
00:04:06,960 --> 00:04:10,300
case i, given by the
regression parameters.

76
00:04:10,300 --> 00:04:18,320
And that expression
basically can be expressed

77
00:04:18,320 --> 00:04:21,630
in matrix form this way.

78
00:04:21,630 --> 00:04:28,910
And what we have is
the likelihood function

79
00:04:28,910 --> 00:04:33,160
ends up being a function
of our Q of beta, which

80
00:04:33,160 --> 00:04:35,610
was our least squares criteria.

81
00:04:35,610 --> 00:04:39,120
So the least squares
estimation is

82
00:04:39,120 --> 00:04:42,930
equivalent to maximum likelihood
estimation for the regression

83
00:04:42,930 --> 00:04:48,510
parameters if we have a normal
linear regression model.

84
00:04:48,510 --> 00:04:52,200
And there's this
extra term, minus n.

85
00:04:52,200 --> 00:04:54,820
Well, actually, if we're going
to maximize the likelihood

86
00:04:54,820 --> 00:04:57,220
function, we can also maximize
the log of the likelihood

87
00:04:57,220 --> 00:05:00,010
function, because that's
just a monotone function

88
00:05:00,010 --> 00:05:01,860
of the likelihood.

89
00:05:01,860 --> 00:05:04,570
And it's easier to maximize the
log of the likelihood function

90
00:05:04,570 --> 00:05:06,430
which is expressed here.

91
00:05:06,430 --> 00:05:11,460
And so we're able to
maximize over beta

92
00:05:11,460 --> 00:05:14,230
by minimizing Q of beta.

93
00:05:14,230 --> 00:05:18,280
And then we can maximize
over sigma squared

94
00:05:18,280 --> 00:05:21,800
given our estimate for beta.

95
00:05:21,800 --> 00:05:25,120
And that's achieved by
taking the derivative

96
00:05:25,120 --> 00:05:31,170
of the log-likelihood with
respect to sigma squared.

97
00:05:31,170 --> 00:05:33,150
So we basically have this
first order condition

98
00:05:33,150 --> 00:05:35,320
that finds the
maximum because things

99
00:05:35,320 --> 00:05:39,830
are appropriately convex.

100
00:05:39,830 --> 00:05:45,200
And taking that derivative
and solving for zero,

101
00:05:45,200 --> 00:05:47,450
we basically get
this expression.

102
00:05:47,450 --> 00:05:50,380
So this is just
taking the derivative

103
00:05:50,380 --> 00:05:54,350
of the log-likelihood with
respect to sigma squared.

104
00:05:54,350 --> 00:05:55,857
And you'll notice
here I'm taking

105
00:05:55,857 --> 00:05:57,690
the derivative with
respect to sigma squared

106
00:05:57,690 --> 00:05:59,050
as a parameter, not sigma.

107
00:06:01,870 --> 00:06:05,380
And that gives us that
the maximum likelihood

108
00:06:05,380 --> 00:06:10,700
estimate of the error variance
is Q of beta hat over n.

109
00:06:10,700 --> 00:06:17,090
So this is the sum of the
squared residuals divided by n.

110
00:06:17,090 --> 00:06:20,940
Now, I emphasize here
that that's biased.

111
00:06:20,940 --> 00:06:24,612
Who can tell me
why that's biased

112
00:06:24,612 --> 00:06:25,820
or why it ought to be biased?

113
00:06:30,554 --> 00:06:31,526
AUDIENCE: [INAUDIBLE].

114
00:06:35,420 --> 00:06:36,350
PROFESSOR: OK.

115
00:06:36,350 --> 00:06:42,530
Well, it should be n
minus 1 if we're actually

116
00:06:42,530 --> 00:06:44,660
estimating one parameter.

117
00:06:44,660 --> 00:06:54,050
So if the independent variables
were, say, a constant, 1,

118
00:06:54,050 --> 00:06:57,160
so we're just estimating a
sample from a normal with mean

119
00:06:57,160 --> 00:07:03,030
beta 1 corresponding to
the units vector of the X,

120
00:07:03,030 --> 00:07:11,410
then we would have a one
degree of freedom correction

121
00:07:11,410 --> 00:07:14,120
to the residuals to get
an unbiased estimator.

122
00:07:14,120 --> 00:07:17,150
But what if we
have p parameters?

123
00:07:17,150 --> 00:07:18,370
Well, let me ask you this.

124
00:07:18,370 --> 00:07:23,280
What if we had n parameters
in our regression model?

125
00:07:23,280 --> 00:07:28,130
What would happen if
we had a full rank n

126
00:07:28,130 --> 00:07:30,760
independent variable matrix
and n independent observations?

127
00:07:34,062 --> 00:07:35,690
AUDIENCE: [INAUDIBLE].

128
00:07:35,690 --> 00:07:38,410
PROFESSOR: Yes, you'd have
an exact fit to the data.

129
00:07:38,410 --> 00:07:43,560
So this estimate would be 0.

130
00:07:43,560 --> 00:07:47,500
And so clearly, if
the data do arise

131
00:07:47,500 --> 00:07:52,059
from a normal linear regression
model, 0 is not unbiased.

132
00:07:52,059 --> 00:07:53,600
And you need to have
some correction.

133
00:07:53,600 --> 00:07:58,220
Turns out you need
to divide by n

134
00:07:58,220 --> 00:08:01,980
minus the rank of the X
matrix, the degrees of freedom

135
00:08:01,980 --> 00:08:05,630
in the model, to get
a biased estimate.

136
00:08:05,630 --> 00:08:08,610
So this is an important
issue, highlights

137
00:08:08,610 --> 00:08:11,880
how the more parameters you add
in the model, the more precise

138
00:08:11,880 --> 00:08:13,760
your fitted values are.

139
00:08:13,760 --> 00:08:15,840
In a sense, there's
dangers of curve fitting

140
00:08:15,840 --> 00:08:18,370
which you want to avoid.

141
00:08:18,370 --> 00:08:25,070
But the maximum likelihood
estimates, in fact, are biased.

142
00:08:25,070 --> 00:08:27,482
You just have to
be aware of that.

143
00:08:27,482 --> 00:08:29,190
And when you're using
different software,

144
00:08:29,190 --> 00:08:30,170
fitting different
models, you need

145
00:08:30,170 --> 00:08:32,450
to know whether there are
various corrections be

146
00:08:32,450 --> 00:08:33,654
made for biasedness or not.

147
00:08:38,370 --> 00:08:41,679
So this solves the
estimation problem

148
00:08:41,679 --> 00:08:44,790
for normal linear
regression models.

149
00:08:44,790 --> 00:08:48,310
And when we have normal
linear regression

150
00:08:48,310 --> 00:08:50,470
models, the theorem we
went through last time--

151
00:08:50,470 --> 00:08:51,428
this is very important.

152
00:08:51,428 --> 00:08:54,590
Let me just go back and
highlight that for you.

153
00:09:02,430 --> 00:09:05,370
This theorem right here.

154
00:09:05,370 --> 00:09:10,010
This is really a very
important theorem

155
00:09:10,010 --> 00:09:13,330
indicating what is the
distribution of the least

156
00:09:13,330 --> 00:09:15,800
squares, now the maximum
likelihood estimates

157
00:09:15,800 --> 00:09:17,670
of our regression model?

158
00:09:17,670 --> 00:09:20,750
They are normally distributed.

159
00:09:20,750 --> 00:09:25,570
And the residuals, sum
of squares, have a chi

160
00:09:25,570 --> 00:09:28,140
squared distribution
with degrees of freedom

161
00:09:28,140 --> 00:09:29,910
given by n minus p.

162
00:09:29,910 --> 00:09:34,770
And we can look at how
much signal to noise

163
00:09:34,770 --> 00:09:36,490
there is in estimating
our regression

164
00:09:36,490 --> 00:09:40,590
parameters by calculating a t
statistic, which is take away

165
00:09:40,590 --> 00:09:45,400
from an estimate its
expected value, its mean,

166
00:09:45,400 --> 00:09:48,330
and divide through by an
estimate of the variability

167
00:09:48,330 --> 00:09:50,421
in standard deviation units.

168
00:09:50,421 --> 00:09:51,920
And that will have
a t distribution.

169
00:09:51,920 --> 00:09:56,800
So that's a critical
way to assess

170
00:09:56,800 --> 00:09:59,200
the relevance of different
explanatory variables

171
00:09:59,200 --> 00:10:00,690
in our model.

172
00:10:00,690 --> 00:10:06,060
And this approach will apply
with maximum likelihood

173
00:10:06,060 --> 00:10:08,010
estimation in all
kinds of models

174
00:10:08,010 --> 00:10:10,510
apart from normal linear
regression models.

175
00:10:10,510 --> 00:10:13,970
It turns out maximum
likelihood estimates generally

176
00:10:13,970 --> 00:10:17,880
are asymptotically
normally distributed.

177
00:10:17,880 --> 00:10:21,630
And so these properties here
will apply for those models

178
00:10:21,630 --> 00:10:23,020
as well.

179
00:10:23,020 --> 00:10:27,470
So let's finish up these
notes on estimation

180
00:10:27,470 --> 00:10:32,590
by talking about
generalized M estimation.

181
00:10:32,590 --> 00:10:39,020
So what we want to consider is
estimating unknown parameters

182
00:10:39,020 --> 00:10:44,630
by minimizing some
function, Q of beta,

183
00:10:44,630 --> 00:10:49,890
which is a sum of evaluations
of another function h,

184
00:10:49,890 --> 00:10:53,180
evaluated for each of
the individual cases.

185
00:10:53,180 --> 00:10:59,980
And choosing h to take on
different functional forms

186
00:10:59,980 --> 00:11:03,120
will define different
kinds of estimators.

187
00:11:03,120 --> 00:11:08,440
We've seen how when h
is simply the square

188
00:11:08,440 --> 00:11:13,880
of the case minus its
regression prediction,

189
00:11:13,880 --> 00:11:18,980
that leads to least squares,
and in fact, maximum likelihood

190
00:11:18,980 --> 00:11:23,830
estimation, as we saw before.

191
00:11:23,830 --> 00:11:27,340
Rather than taking the
square of the residual,

192
00:11:27,340 --> 00:11:29,540
the fitted residual,
we could take simply

193
00:11:29,540 --> 00:11:33,510
the modulus of that.

194
00:11:33,510 --> 00:11:36,930
And so that would be the
mean absolute deviation.

195
00:11:36,930 --> 00:11:39,040
So rather than summing
the squared deviations

196
00:11:39,040 --> 00:11:42,310
from the mean, we could
sum the absolute deviations

197
00:11:42,310 --> 00:11:43,780
from the mean.

198
00:11:43,780 --> 00:11:46,710
Now, from a
mathematical standpoint,

199
00:11:46,710 --> 00:11:50,530
if we want to solve
for those estimates,

200
00:11:50,530 --> 00:11:52,450
how would you go
about doing that?

201
00:11:55,160 --> 00:12:01,950
What methodology would you
use to maximize this function?

202
00:12:01,950 --> 00:12:04,380
Well, we try and apply
basically the same principles

203
00:12:04,380 --> 00:12:09,690
of if this is a
convex function, then

204
00:12:09,690 --> 00:12:12,860
we just want to take derivatives
of that and solve for that

205
00:12:12,860 --> 00:12:14,110
being equal to 0.

206
00:12:14,110 --> 00:12:17,080
So what happens when
you take the derivative

207
00:12:17,080 --> 00:12:21,110
of the modulus of y minus xi
beta with respect to beta?

208
00:12:24,749 --> 00:12:27,620
AUDIENCE: [INAUDIBLE].

209
00:12:27,620 --> 00:12:30,780
PROFESSOR: What did you say?

210
00:12:30,780 --> 00:12:32,890
What did you say?

211
00:12:32,890 --> 00:12:36,783
AUDIENCE: Yeah, it's
not [INAUDIBLE].

212
00:12:36,783 --> 00:12:38,908
The first [INAUDIBLE]
derivative is not continuous.

213
00:12:45,460 --> 00:12:46,610
PROFESSOR: OK.

214
00:12:46,610 --> 00:12:50,940
Well, this is not
a smooth function.

215
00:12:50,940 --> 00:13:06,290
But let me just plot x_i beta
here, and y_i minus that.

216
00:13:06,290 --> 00:13:15,060
Basically, this is going
to be a function that

217
00:13:15,060 --> 00:13:19,230
has slope 1 when it's positive
and slope minus 1 when

218
00:13:19,230 --> 00:13:20,450
it's negative.

219
00:13:20,450 --> 00:13:26,260
And so that will be true,
component-wise, or for the y.

220
00:13:26,260 --> 00:13:28,850
So what we end up
wanting to do is

221
00:13:28,850 --> 00:13:31,000
find the value of the
regression estimate

222
00:13:31,000 --> 00:13:36,680
that minimizes the
sum of predictions

223
00:13:36,680 --> 00:13:40,670
that are below the estimate plus
the sum of the predictions that

224
00:13:40,670 --> 00:13:43,240
are above the estimate given
by the regression line.

225
00:13:43,240 --> 00:13:45,580
And that solves the problem.

226
00:13:45,580 --> 00:13:50,960
Now, with the maximum
likelihood estimation,

227
00:13:50,960 --> 00:13:55,840
one can plug in minus log the
density of y_i given beta, x

228
00:13:55,840 --> 00:13:57,730
and sigma_i squared.

229
00:13:57,730 --> 00:14:04,400
And that function simply sums
to the log of the joint density

230
00:14:04,400 --> 00:14:05,510
for all the data.

231
00:14:05,510 --> 00:14:08,530
So that works as well.

232
00:14:08,530 --> 00:14:13,520
With robust M estimators, we can
consider another function chi

233
00:14:13,520 --> 00:14:18,210
which can be defined to have
good properties with estimates.

234
00:14:18,210 --> 00:14:21,065
And there's a whole theory
of robust estimation--

235
00:14:21,065 --> 00:14:23,830
it's very rich-- which
talks about how best

236
00:14:23,830 --> 00:14:27,400
to specify this chi function.

237
00:14:27,400 --> 00:14:33,130
Now, one of the problems
with least squares estimation

238
00:14:33,130 --> 00:14:37,400
is that the squares
of very large values

239
00:14:37,400 --> 00:14:40,210
are very, very
large in magnitude.

240
00:14:40,210 --> 00:14:42,740
So there's perhaps
an undue influence

241
00:14:42,740 --> 00:14:47,650
of very large values, very large
residuals under least squares

242
00:14:47,650 --> 00:14:49,680
estimation and maximum
[INAUDIBLE] estimation.

243
00:14:49,680 --> 00:14:53,600
So robust estimators
allow you to control that

244
00:14:53,600 --> 00:14:57,770
by defining the
function differently.

245
00:14:57,770 --> 00:15:00,830
Finally, there are
quantile estimators,

246
00:15:00,830 --> 00:15:07,410
which extend the mean
absolute deviation criterion.

247
00:15:07,410 --> 00:15:11,220
And so if we consider
the h function

248
00:15:11,220 --> 00:15:16,270
to be basically a
multiple of the deviation

249
00:15:16,270 --> 00:15:23,460
if the residual is positive
and a different multiple,

250
00:15:23,460 --> 00:15:26,810
a complementary multiple if
the derivation, the residual,

251
00:15:26,810 --> 00:15:30,910
is less than 0,
then by varying tau,

252
00:15:30,910 --> 00:15:35,230
you end up getting
quantile estimators, where

253
00:15:35,230 --> 00:15:38,921
what you're doing is minimizing
the estimate of the tau

254
00:15:38,921 --> 00:15:39,420
quantile.

255
00:15:47,510 --> 00:15:51,240
So this general
class of M estimators

256
00:15:51,240 --> 00:15:54,730
encompasses most
estimators that we will

257
00:15:54,730 --> 00:15:59,020
encounter in fitting models.

258
00:15:59,020 --> 00:16:03,130
So that finishes the technical
or the mathematical discussion

259
00:16:03,130 --> 00:16:05,190
of regression analysis.

260
00:16:05,190 --> 00:16:31,070
Let me highlight for you--
there's a case study that I

261
00:16:31,070 --> 00:16:34,410
dragged to the desktop here.

262
00:16:34,410 --> 00:16:37,532
And I wanted to find that.

263
00:16:37,532 --> 00:16:38,240
Let me find that.

264
00:16:46,970 --> 00:16:54,300
There's a case study that's been
added to the course website.

265
00:16:54,300 --> 00:16:58,840
And this first one is on
linear regression models

266
00:16:58,840 --> 00:17:00,370
for asset pricing.

267
00:17:00,370 --> 00:17:03,430
And I want you to
read through that just

268
00:17:03,430 --> 00:17:08,099
to see how it applies to
fitting various simple linear

269
00:17:08,099 --> 00:17:09,650
regression models.

270
00:17:09,650 --> 00:17:12,985
And enter full screen.

271
00:17:17,900 --> 00:17:21,650
This case study begins by
introducing the capital asset

272
00:17:21,650 --> 00:17:24,670
pricing model, which
basically suggests

273
00:17:24,670 --> 00:17:28,190
that if you look at the
returns on any stocks

274
00:17:28,190 --> 00:17:30,720
in an efficient
market, then those

275
00:17:30,720 --> 00:17:36,830
should depend on the return
of the overall market

276
00:17:36,830 --> 00:17:40,040
but scaled by how
risky the stock is.

277
00:17:40,040 --> 00:17:45,170
And so if one looks
at basically what

278
00:17:45,170 --> 00:17:47,929
the return is on the
stock on the right scale,

279
00:17:47,929 --> 00:17:49,970
you should have a simple
linear regression model.

280
00:17:49,970 --> 00:17:54,110
So here, we just look at
a time series for GE stock

281
00:17:54,110 --> 00:17:55,972
in the S&P 500.

282
00:17:55,972 --> 00:17:58,180
And the case study guide
through how you can actually

283
00:17:58,180 --> 00:18:01,790
collect this data
on the web using R.

284
00:18:01,790 --> 00:18:06,845
And so the case notes
provide those details.

285
00:18:09,350 --> 00:18:11,930
There's also the
three-month treasury rate

286
00:18:11,930 --> 00:18:13,660
which is collected.

287
00:18:13,660 --> 00:18:16,190
And so if you're
thinking about return

288
00:18:16,190 --> 00:18:19,540
on the stock versus return
on the index, well, what's

289
00:18:19,540 --> 00:18:24,940
really of interest is the excess
return over a risk-free rate.

290
00:18:24,940 --> 00:18:27,450
And the efficient
markets models,

291
00:18:27,450 --> 00:18:31,390
basically the excess
return of a stock

292
00:18:31,390 --> 00:18:34,330
is related to the excess
return of the market as

293
00:18:34,330 --> 00:18:37,250
given by a linear
regression model.

294
00:18:37,250 --> 00:18:39,310
So we can fit this model.

295
00:18:39,310 --> 00:18:46,360
And here's a plot of the excess
returns on a daily basis for GE

296
00:18:46,360 --> 00:18:47,640
stock versus the market.

297
00:18:47,640 --> 00:18:52,444
So that looks like a
nice sort of point cloud

298
00:18:52,444 --> 00:18:54,110
for which a linear
model might fit well.

299
00:18:54,110 --> 00:18:54,800
And it does.

300
00:18:59,400 --> 00:19:01,170
Well, there are
regression diagnostics,

301
00:19:01,170 --> 00:19:05,300
which I'll get to-- well, there
are regression diagnostics

302
00:19:05,300 --> 00:19:09,110
which are detailed in the
problem set, where we're

303
00:19:09,110 --> 00:19:12,420
looking at how influential are
individual observations, what's

304
00:19:12,420 --> 00:19:14,160
their impact on
regression parameters.

305
00:19:16,680 --> 00:19:20,160
This display here
basically highlights

306
00:19:20,160 --> 00:19:21,790
with a very simple
linear regression

307
00:19:21,790 --> 00:19:25,770
model what are the
influential data points.

308
00:19:25,770 --> 00:19:28,560
And so I've highlighted
in red those values

309
00:19:28,560 --> 00:19:30,640
which are influential.

310
00:19:30,640 --> 00:19:34,060
Now, if you look at the
definition of leverage

311
00:19:34,060 --> 00:19:36,390
in a linear model,
it's very simple.

312
00:19:36,390 --> 00:19:39,130
A simple linear model is
just those observations that

313
00:19:39,130 --> 00:19:42,200
are very far from the
mean have large leverage.

314
00:19:42,200 --> 00:19:46,060
And so you can confirm
that with your answers

315
00:19:46,060 --> 00:19:48,470
to the problem set.

316
00:19:48,470 --> 00:19:52,710
This x indicates a
significantly influential point

317
00:19:52,710 --> 00:19:55,720
in terms of the
regression parameters

318
00:19:55,720 --> 00:19:57,090
given by Cook's distance.

319
00:19:57,090 --> 00:19:59,956
And that definition is also
given in the case notes.

320
00:19:59,956 --> 00:20:00,908
AUDIENCE: [INAUDIBLE].

321
00:20:04,240 --> 00:20:06,630
PROFESSOR: By computing
the individual

322
00:20:06,630 --> 00:20:09,930
leverages with a function
that's given here,

323
00:20:09,930 --> 00:20:13,385
and by selecting out those
that exceed a given magnitude.

324
00:20:17,870 --> 00:20:20,530
Now, with this very,
very simple model

325
00:20:20,530 --> 00:20:23,190
of stocks depending
on one unknown factor,

326
00:20:23,190 --> 00:20:26,110
risk factor given the market.

327
00:20:26,110 --> 00:20:29,730
In modeling equity
returns, there

328
00:20:29,730 --> 00:20:33,680
are many different factors that
can have an impact on returns.

329
00:20:33,680 --> 00:20:36,890
So what I've done
in the case study

330
00:20:36,890 --> 00:20:48,660
is to look at adding
another factor which is just

331
00:20:48,660 --> 00:20:51,590
the return on crude oil.

332
00:20:51,590 --> 00:20:55,210
And so-- I need to go down here.

333
00:21:04,090 --> 00:21:10,260
So let me highlight
something for you here.

334
00:21:10,260 --> 00:21:15,220
With GE stock, what would you
expect the impact of, say,

335
00:21:15,220 --> 00:21:19,260
a high return on crude oil to
be on the return of GE stock?

336
00:21:19,260 --> 00:21:21,500
Would you expect it to
be positively related

337
00:21:21,500 --> 00:21:22,730
or negatively related?

338
00:21:30,910 --> 00:21:31,410
OK.

339
00:21:34,510 --> 00:21:39,610
Well, GE is a stock that's
just a broad stock invested

340
00:21:39,610 --> 00:21:41,820
in many different industries.

341
00:21:41,820 --> 00:21:45,390
And it really reflects the
overall market, to some extent.

342
00:21:45,390 --> 00:21:48,710
Many years ago,
10, 15 years ago,

343
00:21:48,710 --> 00:21:51,960
GE represented maybe 3% of
the GNP of the US market.

344
00:21:51,960 --> 00:21:55,510
So it was really highly related
to how well the market does.

345
00:21:55,510 --> 00:21:59,700
Now, crude oil is a commodity.

346
00:21:59,700 --> 00:22:07,010
And oil is used to drive cars,
to fuel energy production.

347
00:22:07,010 --> 00:22:10,510
So if you have an
increase in oil prices,

348
00:22:10,510 --> 00:22:13,770
then the cost of essentially
doing business goes up.

349
00:22:13,770 --> 00:22:18,870
So it is associated with
an inflation factor.

350
00:22:18,870 --> 00:22:20,380
Prices are rising.

351
00:22:20,380 --> 00:22:25,730
So if you can see here,
the regression estimate,

352
00:22:25,730 --> 00:22:29,830
if we add in a factor of
the return on crude oil,

353
00:22:29,830 --> 00:22:32,120
it's negative 0.03.

354
00:22:32,120 --> 00:22:36,740
And it has a t value
of minus 3.561.

355
00:22:36,740 --> 00:22:41,330
So in fact, the market, in
a sense, over this period,

356
00:22:41,330 --> 00:22:44,600
for this analysis, was not
efficient in explaining

357
00:22:44,600 --> 00:22:49,730
the return on GE; crude oil
is another independent factor

358
00:22:49,730 --> 00:22:52,260
that helps explain returns.

359
00:22:52,260 --> 00:22:55,850
So that's useful to know.

360
00:22:55,850 --> 00:23:01,430
And if you are clever about
defining and identifying

361
00:23:01,430 --> 00:23:03,590
and evaluating
different factors,

362
00:23:03,590 --> 00:23:07,550
you can build
factor asset pricing

363
00:23:07,550 --> 00:23:11,430
models that are
very, very useful

364
00:23:11,430 --> 00:23:13,390
for investing and trading.

365
00:23:13,390 --> 00:23:18,710
Now, as a comparison
to this case study,

366
00:23:18,710 --> 00:23:26,040
also applied the same
analysis to Exxon Mobil.

367
00:23:26,040 --> 00:23:30,330
Now, Exxon Mobil
is an oil company.

368
00:23:30,330 --> 00:23:35,530
So let me highlight this here.

369
00:23:35,530 --> 00:23:37,570
We basically are
fitting this model.

370
00:23:37,570 --> 00:23:39,050
Now let's highlight it.

371
00:23:43,150 --> 00:23:48,960
Here, if we consider
this two-factor model,

372
00:23:48,960 --> 00:23:50,650
the regression
parameter corresponding

373
00:23:50,650 --> 00:23:57,840
to the crude oil factor is
plus 0.13 with a t value of 16.

374
00:23:57,840 --> 00:24:01,750
So crude oil definitely
has an impact

375
00:24:01,750 --> 00:24:06,370
on the return of Exxon Mobil,
because it goes up and down

376
00:24:06,370 --> 00:24:07,065
with oil prices.

377
00:24:16,300 --> 00:24:19,550
This case study closes
with a scatter plot

378
00:24:19,550 --> 00:24:22,950
of the independent variables
and highlighting where

379
00:24:22,950 --> 00:24:25,740
the influential values are.

380
00:24:25,740 --> 00:24:28,650
And so just in the same way that
with a simple linear regression

381
00:24:28,650 --> 00:24:32,430
it was those that were far
away from the mean of the data

382
00:24:32,430 --> 00:24:35,920
were influential, in a
multivariate setting-- here,

383
00:24:35,920 --> 00:24:38,450
it's bivariate-- the
influential observations

384
00:24:38,450 --> 00:24:41,240
are those that are very
far away from the centroid.

385
00:24:41,240 --> 00:24:43,931
And if you look at one of the
problems in the problem set,

386
00:24:43,931 --> 00:24:45,430
it actually goes
through and you can

387
00:24:45,430 --> 00:24:48,930
see where these
leveraged values are

388
00:24:48,930 --> 00:24:53,580
and how it indicates influences
associated with the Mahalanobis

389
00:24:53,580 --> 00:24:56,660
distance of cases
from the centroid

390
00:24:56,660 --> 00:24:58,820
of the independent variables.

391
00:24:58,820 --> 00:25:02,010
So if you're a visual
type mathematician as

392
00:25:02,010 --> 00:25:04,850
opposed to an algebraic
type mathematician,

393
00:25:04,850 --> 00:25:06,390
I think these
kinds of graphs are

394
00:25:06,390 --> 00:25:10,970
very helpful in understanding
what is really going on.

395
00:25:10,970 --> 00:25:16,180
And the degree of influence
is associated with the fact

396
00:25:16,180 --> 00:25:21,380
that we're basically taking
least squares estimates,

397
00:25:21,380 --> 00:25:23,560
so we have the quadratic
form associated

398
00:25:23,560 --> 00:25:24,790
with the overall process.

399
00:25:28,800 --> 00:25:33,950
There's another
case study that I'll

400
00:25:33,950 --> 00:25:40,054
be happy to discuss after
class or during office hours.

401
00:25:40,054 --> 00:25:42,220
I don't think we have time
today during the lecture.

402
00:25:42,220 --> 00:25:45,650
But it concerns
exchange rate regimes.

403
00:25:45,650 --> 00:25:51,310
And the second case study
looks at the Chinese yuan,

404
00:25:51,310 --> 00:25:55,960
which was basically pegged
to the dollar for many years.

405
00:25:55,960 --> 00:26:00,190
And then I guess through
political influence

406
00:26:00,190 --> 00:26:02,710
from other countries,
they started

407
00:26:02,710 --> 00:26:06,172
to let the yuan vary
from the dollar,

408
00:26:06,172 --> 00:26:08,560
but perhaps pegged
it to some basket

409
00:26:08,560 --> 00:26:10,690
of securities-- of currencies.

410
00:26:10,690 --> 00:26:13,540
And so how would you determine
what that basket of currencies

411
00:26:13,540 --> 00:26:14,039
is?

412
00:26:14,039 --> 00:26:16,250
Well, there are
regression methods

413
00:26:16,250 --> 00:26:19,490
that have been
developed by economists

414
00:26:19,490 --> 00:26:20,650
that help you do that.

415
00:26:20,650 --> 00:26:23,480
And that case study goes
through the analysis of that.

416
00:26:23,480 --> 00:26:26,770
So check that out to see how
you can get immediate access

417
00:26:26,770 --> 00:26:29,750
to currency data and be
fitting these regression models

418
00:26:29,750 --> 00:26:31,250
and looking at the
different results

419
00:26:31,250 --> 00:26:32,458
and trying to evaluate those.

420
00:26:38,720 --> 00:26:48,170
So let's turn now
to the main topic--

421
00:26:48,170 --> 00:26:54,200
let's see here-- which
is time series analysis.

422
00:27:01,250 --> 00:27:04,080
Today in the rest
of the lecture,

423
00:27:04,080 --> 00:27:09,040
I want to talk about univariate
time series analysis.

424
00:27:09,040 --> 00:27:12,670
And so we're thinking of
basically a random variable

425
00:27:12,670 --> 00:27:17,720
that is observed over time and
it's a discrete time process.

426
00:27:17,720 --> 00:27:23,140
And we'll introduce you
to the Wold representation

427
00:27:23,140 --> 00:27:26,435
theorem and definitions
of stationarity

428
00:27:26,435 --> 00:27:28,340
and its relationship there.

429
00:27:28,340 --> 00:27:31,430
Then, look at the classic
models of autoregressive

430
00:27:31,430 --> 00:27:34,120
moving average models.

431
00:27:34,120 --> 00:27:36,920
And then extending those
to non-stationarity

432
00:27:36,920 --> 00:27:40,430
with integrated autoregressive
moving average models.

433
00:27:40,430 --> 00:27:44,440
And then finally, talk about
estimating stationary models

434
00:27:44,440 --> 00:27:47,630
and how we test
for stationarity.

435
00:27:47,630 --> 00:27:54,740
So let's begin from
basically first principles.

436
00:27:54,740 --> 00:27:59,310
We have a stochastic process,
a discrete time stochastic

437
00:27:59,310 --> 00:28:04,880
process, X, which consists
of random variables indexed

438
00:28:04,880 --> 00:28:06,160
by time.

439
00:28:06,160 --> 00:28:09,110
And we're thinking
now discrete time.

440
00:28:09,110 --> 00:28:11,820
The stochastic behavior
of this sequence

441
00:28:11,820 --> 00:28:16,050
is determined by specifying
the density or probability mass

442
00:28:16,050 --> 00:28:22,220
functions for all finite
collections of time indexes.

443
00:28:22,220 --> 00:28:26,490
And so if we could specify
all finite.dimensional

444
00:28:26,490 --> 00:28:28,130
distributions of
this process, we

445
00:28:28,130 --> 00:28:31,710
would specify this
probability model

446
00:28:31,710 --> 00:28:35,200
for the stochastic process.

447
00:28:35,200 --> 00:28:40,500
Now, this stochastic process
is strictly stationary

448
00:28:40,500 --> 00:28:48,760
if the density function for
any collection of times,

449
00:28:48,760 --> 00:28:55,780
t_1 through t_m, is equal to
the density function for a tau

450
00:28:55,780 --> 00:28:57,440
translation of that.

451
00:28:57,440 --> 00:29:03,000
So the density function for any
finite-dimensional distribution

452
00:29:03,000 --> 00:29:08,300
is stationary, is constant
under arbitrary translations.

453
00:29:08,300 --> 00:29:12,620
So that's a very
strong property.

454
00:29:12,620 --> 00:29:16,620
But it's a reasonable
property to ask for if you're

455
00:29:16,620 --> 00:29:18,566
doing statistical modeling.

456
00:29:18,566 --> 00:29:20,940
And what do you want to do
when you're estimating models?

457
00:29:20,940 --> 00:29:24,080
You want to estimate
things that are constant.

458
00:29:24,080 --> 00:29:26,570
Constants are nice
things to estimate.

459
00:29:26,570 --> 00:29:28,520
And parameters of
models are constant.

460
00:29:28,520 --> 00:29:32,930
So we really want the underlying
structure of the distributions

461
00:29:32,930 --> 00:29:35,150
to be the same.

462
00:29:44,960 --> 00:29:47,040
That was strict
stationarity, which

463
00:29:47,040 --> 00:29:51,510
requires knowledge of
the entire distribution

464
00:29:51,510 --> 00:29:55,020
of the stochastic process.

465
00:29:55,020 --> 00:29:57,340
We're now going to introduce
a weaker definition, which

466
00:29:57,340 --> 00:29:59,660
is covariance stationarity.

467
00:29:59,660 --> 00:30:02,960
And a covariance
stationary process

468
00:30:02,960 --> 00:30:08,330
has a constant mean,
mu; a constant variance,

469
00:30:08,330 --> 00:30:15,630
sigma squared; and a
covariance over increments tau,

470
00:30:15,630 --> 00:30:20,500
given by a function gamma of
tau, that is also constant.

471
00:30:20,500 --> 00:30:26,960
Gamma isn't a constant function,
but basically for all t,

472
00:30:26,960 --> 00:30:31,900
covariance of X_t, X_(t+tau)
is this gamma of tau function.

473
00:30:31,900 --> 00:30:38,080
And we also can introduce
the autocorrelation function

474
00:30:38,080 --> 00:30:41,830
of the stochastic
process, rho of tau.

475
00:30:41,830 --> 00:30:49,120
And so the correlation
of two random variables

476
00:30:49,120 --> 00:30:52,220
is the covariance of those
random variables divided

477
00:30:52,220 --> 00:30:57,340
by the square root of the
product of the variances.

478
00:30:57,340 --> 00:31:00,805
And Choongbum I think
introduced that a bit.

479
00:31:00,805 --> 00:31:02,680
in one of his lectures,
where we were talking

480
00:31:02,680 --> 00:31:06,890
about the correlation function.

481
00:31:06,890 --> 00:31:09,810
But essentially, the
correlation function

482
00:31:09,810 --> 00:31:15,400
is if you standardize the
data or the random variables

483
00:31:15,400 --> 00:31:17,690
to have mean 0-- so
subtract off the means

484
00:31:17,690 --> 00:31:21,040
and then divide through by
their standard deviations.

485
00:31:21,040 --> 00:31:26,410
So those translated variables
have mean 0 and variance 1.

486
00:31:26,410 --> 00:31:29,482
Then the correlation
coefficient is the covariance

487
00:31:29,482 --> 00:31:31,315
between those standardized
random variables.

488
00:31:35,020 --> 00:31:38,810
So this is going to come up
again and again in time series

489
00:31:38,810 --> 00:31:40,080
analysis.

490
00:31:40,080 --> 00:31:42,650
Now, the Wold
representation theorem

491
00:31:42,650 --> 00:31:47,350
is a very, very powerful theorem
about covariance stationary

492
00:31:47,350 --> 00:31:47,850
processes.

493
00:31:51,110 --> 00:31:55,050
It basically states that if
we have a zero-mean covariance

494
00:31:55,050 --> 00:31:59,750
stationary time
series, then it can

495
00:31:59,750 --> 00:32:03,520
be decomposed into two
components with a very

496
00:32:03,520 --> 00:32:06,390
nice structure.

497
00:32:06,390 --> 00:32:11,430
Basically, X_t can be
decomposed into V_t plus S_t.

498
00:32:11,430 --> 00:32:18,470
V_t is going to be a linearly
deterministic process, meaning

499
00:32:18,470 --> 00:32:23,130
that past values of
V_t perfectly predict

500
00:32:23,130 --> 00:32:24,590
what V_t is going to be.

501
00:32:24,590 --> 00:32:27,780
So this could be like a
linear trend or some fixed

502
00:32:27,780 --> 00:32:29,660
function of past values.

503
00:32:29,660 --> 00:32:32,320
It's basically a
deterministic process.

504
00:32:32,320 --> 00:32:34,690
So there's nothing
random in V_t.

505
00:32:34,690 --> 00:32:40,710
It's something that's
fixed, without randomness.

506
00:32:40,710 --> 00:32:46,510
And S_t is a sum
of coefficients,

507
00:32:46,510 --> 00:32:56,650
psi_i times eta_(t-i), where
the eta_t's are linearly

508
00:32:56,650 --> 00:32:58,550
unpredictable white noise.

509
00:32:58,550 --> 00:33:03,890
So what we have is S_t
is a weighted average

510
00:33:03,890 --> 00:33:09,850
of white noise with
coefficients given by the psi_i.

511
00:33:09,850 --> 00:33:16,170
And the coefficients psi_i
are such that psi_0 is 1,

512
00:33:16,170 --> 00:33:18,830
and the sum of the
squared psi_i's is finite.

513
00:33:21,340 --> 00:33:26,540
And the white noise
eta_t-- what's white noise?

514
00:33:26,540 --> 00:33:28,930
It has expectation zero.

515
00:33:28,930 --> 00:33:35,120
It has variance, given by
sigma squared, that's constant.

516
00:33:35,120 --> 00:33:39,520
And it has covariance across
different white noise elements

517
00:33:39,520 --> 00:33:42,490
that's 0 for all t and s.

518
00:33:42,490 --> 00:33:45,810
So eta_t's are uncorrelated
with themselves,

519
00:33:45,810 --> 00:33:47,750
and of course, they
are uncorrelated

520
00:33:47,750 --> 00:33:51,290
with the deterministic process.

521
00:33:51,290 --> 00:33:58,010
So this is really a very,
very powerful concept.

522
00:33:58,010 --> 00:34:00,600
If you are modeling
a process and it

523
00:34:00,600 --> 00:34:05,030
has covariance
stationarity, then there

524
00:34:05,030 --> 00:34:07,960
exists a representation
like this of the function.

525
00:34:07,960 --> 00:34:15,750
So it's a very
compelling structure,

526
00:34:15,750 --> 00:34:20,659
which we'll see how it applies
in different circumstances.

527
00:34:20,659 --> 00:34:25,650
Now, before getting into the
definition of autoregressive

528
00:34:25,650 --> 00:34:28,719
moving average
models, I just want

529
00:34:28,719 --> 00:34:33,820
to give you an intuitive
understanding of what's going

530
00:34:33,820 --> 00:34:36,469
on with the Wold decomposition.

531
00:34:36,469 --> 00:34:41,030
And this, I think,
will help motivate

532
00:34:41,030 --> 00:34:44,480
why the Wold
decomposition should exist

533
00:34:44,480 --> 00:34:48,170
from a mathematical standpoint.

534
00:34:48,170 --> 00:34:53,550
So consider just some
univariate stochastic process,

535
00:34:53,550 --> 00:34:56,500
some time series X_t
that we want to model.

536
00:34:56,500 --> 00:35:00,010
And we believe that it's
covariance stationary.

537
00:35:00,010 --> 00:35:02,850
And so we want to
specify essentially

538
00:35:02,850 --> 00:35:04,610
the Wold decomposition of that.

539
00:35:04,610 --> 00:35:07,680
Well, what we could
do is initialize

540
00:35:07,680 --> 00:35:10,890
a parameter p, the number
of past observations,

541
00:35:10,890 --> 00:35:15,310
in the linearly
deterministic term.

542
00:35:15,310 --> 00:35:24,420
And then estimate the linear
projection of X_t on the last p

543
00:35:24,420 --> 00:35:26,140
lag values.

544
00:35:26,140 --> 00:35:31,490
And so what I want to do
is consider estimating

545
00:35:31,490 --> 00:35:36,360
that relationship using
a sample of size n

546
00:35:36,360 --> 00:35:42,660
with some ending point t_0
less than or equal to T.

547
00:35:42,660 --> 00:35:50,010
And so we can consider y
values like a response variable

548
00:35:50,010 --> 00:35:57,760
being given by the successive
values of our time series.

549
00:35:57,760 --> 00:36:02,550
And so our response variables
y_j can be considered to be x

550
00:36:02,550 --> 00:36:06,040
t_0 minus n plus j.

551
00:36:06,040 --> 00:36:14,350
And define a y vector and
a Z matrix as follows.

552
00:36:20,140 --> 00:36:25,890
So we have values of our
stochastic process in y.

553
00:36:25,890 --> 00:36:29,080
And then our Z matrix,
which is essentially

554
00:36:29,080 --> 00:36:30,580
a matrix of
independent variables,

555
00:36:30,580 --> 00:36:36,000
is just the lagged
values of this process.

556
00:36:36,000 --> 00:36:37,940
So let's apply
ordinary least squares

557
00:36:37,940 --> 00:36:40,530
to specify the projection.

558
00:36:40,530 --> 00:36:43,810
This projection matrix
should be familiar now.

559
00:36:43,810 --> 00:36:49,160
And that basically gives
us a prediction of y hat

560
00:36:49,160 --> 00:36:51,680
depending on p lags.

561
00:36:51,680 --> 00:36:54,750
And we can compute the
projection residual

562
00:36:54,750 --> 00:36:56,080
from that fit.

563
00:36:59,660 --> 00:37:03,450
Well, we can conduct
time series methods

564
00:37:03,450 --> 00:37:08,470
to analyze these residuals,
which we'll be introducing here

565
00:37:08,470 --> 00:37:13,170
in a few minutes, to specify
a moving average model.

566
00:37:13,170 --> 00:37:16,180
We can then have estimates of
the underlying coefficients

567
00:37:16,180 --> 00:37:22,700
psi and estimates of
these residuals eta_t.

568
00:37:22,700 --> 00:37:27,300
And then we can evaluate whether
this is a good model or not.

569
00:37:27,300 --> 00:37:29,430
What does it mean to be
an appropriate model?

570
00:37:29,430 --> 00:37:35,250
Well, the residual should
be orthogonal to longer lags

571
00:37:35,250 --> 00:37:39,550
than t minus s, or
longer lags than p.

572
00:37:39,550 --> 00:37:42,850
So we basically shouldn't
have any dependence

573
00:37:42,850 --> 00:37:49,390
of our residuals on lags
of the stochastic process

574
00:37:49,390 --> 00:37:51,550
that weren't included
in the model.

575
00:37:51,550 --> 00:37:54,850
Those should be orthogonal.

576
00:37:54,850 --> 00:38:01,070
And the eta_t hats should be
consistent with white noise.

577
00:38:01,070 --> 00:38:05,220
So those issues
can be evaluated.

578
00:38:05,220 --> 00:38:07,620
And if there's
evidence otherwise,

579
00:38:07,620 --> 00:38:10,720
then we can change the
specification of the model.

580
00:38:10,720 --> 00:38:13,090
We can add additional lags.

581
00:38:13,090 --> 00:38:15,870
We can add additional
deterministic variables

582
00:38:15,870 --> 00:38:21,570
if we can identify
what those might be.

583
00:38:21,570 --> 00:38:23,260
And proceed with this process.

584
00:38:23,260 --> 00:38:28,490
But essentially that is
how the Wold decomposition

585
00:38:28,490 --> 00:38:30,740
could be implemented.

586
00:38:30,740 --> 00:38:35,250
And theoretically, as
our sample gets large,

587
00:38:35,250 --> 00:38:42,320
if we're observing this time
series for a long time, then

588
00:38:42,320 --> 00:38:45,090
well certainly the
limit of the projections

589
00:38:45,090 --> 00:38:49,110
as p, the number of lags
we include, gets large,

590
00:38:49,110 --> 00:38:52,380
should be essentially
the projection

591
00:38:52,380 --> 00:38:55,270
of our data on its history.

592
00:38:55,270 --> 00:39:00,490
And that, in fact, is the
projection corresponding to,

593
00:39:00,490 --> 00:39:03,950
defining, the
coefficient's psi_i.

594
00:39:03,950 --> 00:39:09,400
And so in the limit, that
projection will converge

595
00:39:09,400 --> 00:39:11,320
and it will converge
in the sense

596
00:39:11,320 --> 00:39:15,070
that the coefficients of
the projection definition

597
00:39:15,070 --> 00:39:17,320
correspond to the psi_i.

598
00:39:17,320 --> 00:39:26,600
And now if p goes to
infinity is required,

599
00:39:26,600 --> 00:39:29,510
now p means that there's
basically a long term

600
00:39:29,510 --> 00:39:31,145
dependence in the process.

601
00:39:34,310 --> 00:39:37,120
Basically, it doesn't
stop at a given lag.

602
00:39:37,120 --> 00:39:41,410
The dependence
persists over time.

603
00:39:41,410 --> 00:39:45,580
Then we may require
that p goes to infinity.

604
00:39:45,580 --> 00:39:47,360
Now, what happens when
p goes to infinity?

605
00:39:47,360 --> 00:39:50,036
Well, if you let p go
to infinity too quickly,

606
00:39:50,036 --> 00:39:51,410
you run out of
degrees of freedom

607
00:39:51,410 --> 00:39:53,520
to estimate your models.

608
00:39:53,520 --> 00:39:57,220
And so from an
implementation standpoint,

609
00:39:57,220 --> 00:40:01,340
you need to let p/n
go to 0 so that you

610
00:40:01,340 --> 00:40:09,180
have essentially more
data than parameters

611
00:40:09,180 --> 00:40:10,710
that you're estimating.

612
00:40:10,710 --> 00:40:13,800
And so that is required.

613
00:40:13,800 --> 00:40:18,860
And in time series
modeling, what we

614
00:40:18,860 --> 00:40:26,609
look for are models where
finite values of p are required.

615
00:40:26,609 --> 00:40:28,900
So we're only estimating a
finite number of parameters.

616
00:40:28,900 --> 00:40:31,920
Or if we have a moving
average model which

617
00:40:31,920 --> 00:40:35,300
has coefficients that
are infinite in number,

618
00:40:35,300 --> 00:40:40,430
perhaps those can be defined by
a small number of parameters.

619
00:40:40,430 --> 00:40:44,552
So we'll be looking for
that kind of feature

620
00:40:44,552 --> 00:40:45,385
in different models.

621
00:40:49,230 --> 00:40:52,620
Let's turn to talking
about the lag operator.

622
00:40:52,620 --> 00:40:56,250
The lag operator is
a fundamental tool

623
00:40:56,250 --> 00:40:59,430
in time series models.

624
00:40:59,430 --> 00:41:04,180
We consider the operator L
that shifts a time series back

625
00:41:04,180 --> 00:41:06,680
by one time increment.

626
00:41:06,680 --> 00:41:09,210
And applying this
operator recursively,

627
00:41:09,210 --> 00:41:14,400
we get, if it's operating
0 times, there's no lag,

628
00:41:14,400 --> 00:41:16,570
one time, there's
one lag, two times,

629
00:41:16,570 --> 00:41:18,860
two lags-- doing
that iteratively.

630
00:41:18,860 --> 00:41:22,470
And in thinking of these,
what we're dealing with

631
00:41:22,470 --> 00:41:26,680
is like a transformation on
infinite dimensional space,

632
00:41:26,680 --> 00:41:29,150
where it's like
the identity matrix

633
00:41:29,150 --> 00:41:32,390
sort of shifted by
one element-- or not

634
00:41:32,390 --> 00:41:35,320
the identity, but an element.

635
00:41:35,320 --> 00:41:37,290
It's like the identity
matrix shifted

636
00:41:37,290 --> 00:41:41,520
by one column or two columns.

637
00:41:41,520 --> 00:41:43,760
So anyway, inverses
of these operators

638
00:41:43,760 --> 00:41:49,440
are well defined in terms
of what we get from them.

639
00:41:49,440 --> 00:41:53,470
So we can represent
the Wold representation

640
00:41:53,470 --> 00:41:58,140
in terms of these lag
operators by saying

641
00:41:58,140 --> 00:42:03,120
that our stochastic
process X_t is

642
00:42:03,120 --> 00:42:10,030
equal to V_t plus this
psi of L function,

643
00:42:10,030 --> 00:42:14,030
basically a
functional of the lag

644
00:42:14,030 --> 00:42:18,570
operator, which is a potentially
infinite-order polynomial

645
00:42:18,570 --> 00:42:20,730
of the lags.

646
00:42:20,730 --> 00:42:23,770
So this notation is
something that you

647
00:42:23,770 --> 00:42:26,110
need to get very
familiar with if you're

648
00:42:26,110 --> 00:42:28,520
going to be comfortable with
the different models that

649
00:42:28,520 --> 00:42:33,840
are introduced with
ARMA and ARIMA models.

650
00:42:33,840 --> 00:42:35,410
Any questions about that?

651
00:42:42,230 --> 00:42:43,870
Now relating to
this-- let me just

652
00:42:43,870 --> 00:42:47,550
introduce now, because this
will come up somewhat later.

653
00:42:47,550 --> 00:42:49,840
But there's the impulse
response function

654
00:42:49,840 --> 00:42:53,010
of the covariance
stationary process.

655
00:42:53,010 --> 00:42:58,630
If we have a stochastic process
X_t which is given by this Wold

656
00:42:58,630 --> 00:43:05,950
representation, then
you can ask yourself

657
00:43:05,950 --> 00:43:11,320
what happens to the innovation
at time t, which is eta_t,

658
00:43:11,320 --> 00:43:15,470
how does that affect
the process over time?

659
00:43:15,470 --> 00:43:21,590
And so, OK, pretend that you are
chairman of the Federal Reserve

660
00:43:21,590 --> 00:43:22,090
Bank.

661
00:43:22,090 --> 00:43:29,600
And you're interested in the GNP
or basically economic growth.

662
00:43:29,600 --> 00:43:33,944
And you're considering
changing interest rates

663
00:43:33,944 --> 00:43:36,340
to help the economy.

664
00:43:36,340 --> 00:43:38,630
Well, you'd like to
know what an impact is

665
00:43:38,630 --> 00:43:42,610
of your change in
this factor, how

666
00:43:42,610 --> 00:43:47,560
that's going to affect the
variable of interest, perhaps

667
00:43:47,560 --> 00:43:48,130
GNP.

668
00:43:48,130 --> 00:43:49,520
Now, in this case,
we're thinking

669
00:43:49,520 --> 00:43:55,140
of just a simple covariance
stationary stochastic process.

670
00:43:55,140 --> 00:44:00,165
It's basically a process that
is a random-- a weighted sum,

671
00:44:00,165 --> 00:44:03,210
a moving average of
innovations eta_t.

672
00:44:03,210 --> 00:44:06,130
But the question is, basically
any covariance stationary

673
00:44:06,130 --> 00:44:08,310
process could be
represented in this form.

674
00:44:08,310 --> 00:44:11,630
And the impulse
response function

675
00:44:11,630 --> 00:44:15,790
relates to what is
the impact of eta_t.

676
00:44:15,790 --> 00:44:18,120
What's its impact over time?

677
00:44:18,120 --> 00:44:21,940
Basically, it affects
the process at time t.

678
00:44:21,940 --> 00:44:24,360
That, because of the
moving average process,

679
00:44:24,360 --> 00:44:27,350
it affects it at t plus
1, affects it at t plus 2.

680
00:44:27,350 --> 00:44:33,810
And so this impulse
response is basically

681
00:44:33,810 --> 00:44:37,650
the derivative of the
value of the process

682
00:44:37,650 --> 00:44:44,210
with the j-th previous
innovation is given by psi_j.

683
00:44:44,210 --> 00:44:47,360
So the different
innovations have an impact

684
00:44:47,360 --> 00:44:51,200
on the current value given by
this impulse response function.

685
00:44:51,200 --> 00:44:53,200
So looking backward,
that definition

686
00:44:53,200 --> 00:44:54,920
is pretty well defined.

687
00:44:54,920 --> 00:44:56,630
But you can also
think about how does

688
00:44:56,630 --> 00:44:58,620
an impact of the
innovation affect

689
00:44:58,620 --> 00:45:00,760
the process going forward.

690
00:45:00,760 --> 00:45:03,430
And the long-run
cumulative response

691
00:45:03,430 --> 00:45:07,490
is essentially what is the
impact of that innovation

692
00:45:07,490 --> 00:45:11,350
in the process ultimately?

693
00:45:11,350 --> 00:45:13,839
And eventually, it's
not going to change

694
00:45:13,839 --> 00:45:14,880
the value of the process.

695
00:45:14,880 --> 00:45:18,710
But what is the value to
which the process is moving

696
00:45:18,710 --> 00:45:20,890
because of that one innovation?

697
00:45:20,890 --> 00:45:22,630
And so the long run
cumulative response

698
00:45:22,630 --> 00:45:28,900
is given by basically the
sum of these individual ones.

699
00:45:28,900 --> 00:45:33,020
And it's given by the
sum of the psi_i's.

700
00:45:33,020 --> 00:45:37,295
So that's the polynomial of
psi with lag operator, where we

701
00:45:37,295 --> 00:45:39,010
replace the lag operator by 1.

702
00:45:43,540 --> 00:45:45,570
We'll see this
again when we talk

703
00:45:45,570 --> 00:45:50,546
about vector
autoregressive processes

704
00:45:50,546 --> 00:45:51,795
with multivariate time series.

705
00:45:56,020 --> 00:45:57,860
Now, the Wold
representation, which

706
00:45:57,860 --> 00:46:00,550
is a infinite-order moving
average, possibly infinite

707
00:46:00,550 --> 00:46:04,466
order, can have an
autoregressive representation.

708
00:46:07,940 --> 00:46:17,580
Suppose that there is
another polynomial psi_i

709
00:46:17,580 --> 00:46:23,240
star of the lags, which we're
going to call psi inverse of L,

710
00:46:23,240 --> 00:46:29,860
which satisfies the fact if you
multiply that with psi of L,

711
00:46:29,860 --> 00:46:31,690
you get the identity lag 0.

712
00:46:31,690 --> 00:46:37,820
Then this psi inverse,
if that exists,

713
00:46:37,820 --> 00:46:47,060
is basically the
inverse of the psi of L.

714
00:46:47,060 --> 00:46:50,180
So if we start with psi of
L, if that's invertible,

715
00:46:50,180 --> 00:46:52,510
then there exists
a psi inverse of L,

716
00:46:52,510 --> 00:46:55,490
with coefficients psi_i star.

717
00:46:55,490 --> 00:47:02,130
And one can basically take
our original expression

718
00:47:02,130 --> 00:47:06,020
for the stochastic process,
which is as this moving average

719
00:47:06,020 --> 00:47:13,250
of the eta's, and express it
as this essentially moving

720
00:47:13,250 --> 00:47:16,450
averages of the X's.

721
00:47:16,450 --> 00:47:20,730
And so we've essentially
inverted the process

722
00:47:20,730 --> 00:47:27,500
and shown that the
stochastic process can

723
00:47:27,500 --> 00:47:35,570
be expressed as an infinite
order autoregressive

724
00:47:35,570 --> 00:47:36,850
representation.

725
00:47:36,850 --> 00:47:40,760
And so this infinite order
autoregressive representation

726
00:47:40,760 --> 00:47:43,610
corresponds to that intuitive
understanding of how

727
00:47:43,610 --> 00:47:46,280
the Wold representation exists.

728
00:47:46,280 --> 00:47:51,330
And it actually works with the--
the regression coefficients

729
00:47:51,330 --> 00:47:54,749
in that projection several
slides back corresponds

730
00:47:54,749 --> 00:47:55,790
to this inverse operator.

731
00:47:59,030 --> 00:48:04,160
So let's turn to some
specific time series

732
00:48:04,160 --> 00:48:07,590
models that are widely used.

733
00:48:07,590 --> 00:48:11,670
The class of autoregressive
moving average processes

734
00:48:11,670 --> 00:48:16,100
has this mathematical
definition.

735
00:48:16,100 --> 00:48:22,360
We define the X_t to be equal
to a linear combination of lags

736
00:48:22,360 --> 00:48:27,190
of X, going back p
lags, with coefficients

737
00:48:27,190 --> 00:48:30,210
phi_1 through phi_p.

738
00:48:30,210 --> 00:48:35,500
And then there are
residuals which

739
00:48:35,500 --> 00:48:40,720
are expressed in terms of a
q-th order moving average.

740
00:48:40,720 --> 00:48:45,990
So in this framework, the
eta_t's are white noise.

741
00:48:45,990 --> 00:48:50,910
And white noise, to reiterate,
has mean 0, constant variance,

742
00:48:50,910 --> 00:48:53,456
zero covariance between those.

743
00:48:56,330 --> 00:49:03,470
In this representation, I've
simplified things a little bit

744
00:49:03,470 --> 00:49:09,400
by subtracting off the
mean from all of the X's.

745
00:49:09,400 --> 00:49:15,400
And that just makes the formulas
a little bit more simpler.

746
00:49:15,400 --> 00:49:20,370
Now, with lag operators, we
can write this ARMA model

747
00:49:20,370 --> 00:49:26,810
as phi of L, p-th order
polynomial of lag L given

748
00:49:26,810 --> 00:49:31,360
with coefficients 1,
phi_1 up to phi_p,

749
00:49:31,360 --> 00:49:37,627
and theta of L given
by 1, theta_1, theta_2,

750
00:49:37,627 --> 00:49:38,210
up to theta_q.

751
00:49:52,870 --> 00:49:55,840
This is basically
a representation

752
00:49:55,840 --> 00:49:59,170
of the ARMA time series model.

753
00:49:59,170 --> 00:50:03,320
Basically, we're
taking a set of lags

754
00:50:03,320 --> 00:50:09,530
of the values of the stochastic
process up to order p.

755
00:50:09,530 --> 00:50:11,840
And that's equal to a weighted
average of the eta_t's.

756
00:50:14,530 --> 00:50:21,600
If we multiply by the inverse
of phi of L, if that exists,

757
00:50:21,600 --> 00:50:24,010
then we get this
representation here,

758
00:50:24,010 --> 00:50:26,430
which is simply the
Wold decomposition.

759
00:50:26,430 --> 00:50:34,150
So the ARMA models basically
have a Wold decomposition

760
00:50:34,150 --> 00:50:36,970
if this phi of L is invertible.

761
00:50:42,850 --> 00:50:47,120
And we'll explore
these by looking

762
00:50:47,120 --> 00:50:49,160
at simpler cases
of the ARMA models

763
00:50:49,160 --> 00:50:51,390
by just focusing on
autoregressive models

764
00:50:51,390 --> 00:50:53,680
first and then moving
average processes

765
00:50:53,680 --> 00:50:56,090
second so that
you'll get a better

766
00:50:56,090 --> 00:51:00,690
feel for how these things are
manipulated and interpreted.

767
00:51:00,690 --> 00:51:04,540
So let's move on to the p-th
order autoregressive process.

768
00:51:04,540 --> 00:51:08,750
So we're going to consider
ARMA models that just have

769
00:51:08,750 --> 00:51:10,100
autoregressive terms in them.

770
00:51:16,000 --> 00:51:20,300
So we have phi of L X_t
minus mu is equal to eta_t,

771
00:51:20,300 --> 00:51:21,990
which is white noise.

772
00:51:21,990 --> 00:51:28,970
So a linear combination of
the series is white noise.

773
00:51:28,970 --> 00:51:34,730
And X_t follows then a linear
regression model on explanatory

774
00:51:34,730 --> 00:51:41,330
variables, which are
lags of the process X.

775
00:51:41,330 --> 00:51:46,760
And this could be expressed
as X_t equal to c plus the sum

776
00:51:46,760 --> 00:51:50,950
from 1 to p of phi_j X_(t-j),
which is a linear regression

777
00:51:50,950 --> 00:51:53,700
model with regression
parameters phi_j.

778
00:51:53,700 --> 00:52:01,390
And c, the constant term, is
equal to mu times phi of 1.

779
00:52:01,390 --> 00:52:10,920
Now, if you basically take
expectations of the process,

780
00:52:10,920 --> 00:52:14,360
you basically have
coefficients of mu coming in

781
00:52:14,360 --> 00:52:15,730
from all the terms.

782
00:52:15,730 --> 00:52:22,220
And phi of 1 times mu is the
regression coefficient there.

783
00:52:25,160 --> 00:52:27,320
So with this
autoregressive model,

784
00:52:27,320 --> 00:52:31,160
we now want to go over what are
the stationarity conditions.

785
00:52:31,160 --> 00:52:35,020
Certainly, this
autoregressive model

786
00:52:35,020 --> 00:52:40,790
is one where, well,
a simple random walk

787
00:52:40,790 --> 00:52:45,520
follows an autoregressive
model but is not stationary.

788
00:52:45,520 --> 00:52:47,650
We'll highlight that
in a minute as well.

789
00:52:47,650 --> 00:52:50,410
But if you think
it, that's true.

790
00:52:50,410 --> 00:52:55,400
And so stationarity is something
to be understood and evaluated.

791
00:53:03,160 --> 00:53:08,680
This polynomial
function phi, where

792
00:53:08,680 --> 00:53:11,630
if we replace the
lag operator L by z,

793
00:53:11,630 --> 00:53:20,970
a complex variable, the
equation phi of z equal to 0

794
00:53:20,970 --> 00:53:24,330
is the characteristic
equation associated

795
00:53:24,330 --> 00:53:27,020
with this autoregressive model.

796
00:53:27,020 --> 00:53:33,190
And it turns out that we'll
be interested in the roots

797
00:53:33,190 --> 00:53:36,610
of this characteristic equation.

798
00:53:36,610 --> 00:53:40,705
Now, if we consider
writing phi of L

799
00:53:40,705 --> 00:53:44,270
as a function of the
roots of the equation,

800
00:53:44,270 --> 00:53:49,130
we get this expression
where you'll

801
00:53:49,130 --> 00:53:51,340
notice if you multiply
all those terms out,

802
00:53:51,340 --> 00:53:55,730
the 1's all multiply out
together, and you get 1.

803
00:53:55,730 --> 00:54:00,100
And with the lag operator
L to the p-th power,

804
00:54:00,100 --> 00:54:03,210
that would be the product
of 1 over lambda_1

805
00:54:03,210 --> 00:54:06,650
times 1 over lambda_2,
or actually negative 1

806
00:54:06,650 --> 00:54:09,680
over lambda_1 times
negative 1 over lambda_2,

807
00:54:09,680 --> 00:54:13,640
and so forth-- negative
1 over lambda_p.

808
00:54:13,640 --> 00:54:15,820
Basically, if there are
p roots to this equation,

809
00:54:15,820 --> 00:54:19,420
this is how it would
be written out.

810
00:54:19,420 --> 00:54:27,070
And the process
X_t is covariance

811
00:54:27,070 --> 00:54:28,710
stationary if and
only if all the roots

812
00:54:28,710 --> 00:54:33,630
of this characteristic equation
lie outside the unit circle.

813
00:54:33,630 --> 00:54:35,880
So what does that mean?

814
00:54:35,880 --> 00:54:41,240
That means that the norm
modulus of the complex z

815
00:54:41,240 --> 00:54:42,810
is greater than 1.

816
00:54:42,810 --> 00:54:45,160
So they're outside
the unit circle

817
00:54:45,160 --> 00:54:47,150
where it's less
than or equal to 1.

818
00:54:47,150 --> 00:54:56,810
And the roots, if they are
outside the unit circle,

819
00:54:56,810 --> 00:55:01,080
then the modulus of the
lambda_j's is greater than 1.

820
00:55:05,400 --> 00:55:12,160
And if we then consider
taking a complex number

821
00:55:12,160 --> 00:55:16,010
lambda, basically
the root, and have

822
00:55:16,010 --> 00:55:20,600
an expression for 1 minus
1 over lambda L inverse,

823
00:55:20,600 --> 00:55:25,010
we can get this series
expression for that inverse.

824
00:55:25,010 --> 00:55:34,860
And that series will exist and
be bounded if the lambda_i are

825
00:55:34,860 --> 00:55:36,430
greater than 1 in magnitude.

826
00:55:39,210 --> 00:55:46,210
So we can actually compute
an inverse of phi of L

827
00:55:46,210 --> 00:55:49,610
by taking the inverse
of each of the component

828
00:55:49,610 --> 00:55:52,240
products in that polynomial.

829
00:55:52,240 --> 00:55:57,800
So in introductory
time series courses,

830
00:55:57,800 --> 00:56:00,544
they talk about
stationarity and unit roots,

831
00:56:00,544 --> 00:56:01,960
but they don't
really get into it,

832
00:56:01,960 --> 00:56:04,490
because people don't
know complex math,

833
00:56:04,490 --> 00:56:06,970
don't know about roots.

834
00:56:06,970 --> 00:56:09,620
So anyway, but this
is just very simply

835
00:56:09,620 --> 00:56:12,840
how that framework is applied.

836
00:56:12,840 --> 00:56:17,830
So we have a
polynomial equation,

837
00:56:17,830 --> 00:56:20,885
the characteristic equation,
whose roots we're looking for.

838
00:56:20,885 --> 00:56:22,510
Those roots have to
be outside the unit

839
00:56:22,510 --> 00:56:26,170
circle for stationarity
of the process.

840
00:56:26,170 --> 00:56:31,870
Well, it's basically
conditions for invertibility

841
00:56:31,870 --> 00:56:35,100
of the process, of the
autoregressive process.

842
00:56:35,100 --> 00:56:40,440
And that invertibility renders
the process an infinite-order

843
00:56:40,440 --> 00:56:42,125
moving average process.

844
00:56:46,210 --> 00:56:50,830
So let's go through
these results

845
00:56:50,830 --> 00:56:52,840
for the autoregressive
process of order one,

846
00:56:52,840 --> 00:56:56,330
where things-- always start
with the simplest cases

847
00:56:56,330 --> 00:56:58,420
to understand things.

848
00:56:58,420 --> 00:57:01,140
The characteristic equation
for this model is just 1

849
00:57:01,140 --> 00:57:02,820
minus phi z.

850
00:57:02,820 --> 00:57:03,600
The root is 1/phi.

851
00:57:06,630 --> 00:57:12,382
So lambda is greater than
1-- if the modulus of lambda

852
00:57:12,382 --> 00:57:13,840
is greater than 1,
meaning the root

853
00:57:13,840 --> 00:57:16,990
is outside the unit circle,
then phi is less than 1.

854
00:57:16,990 --> 00:57:21,160
So for covariance stationarity
of this autoregressive process,

855
00:57:21,160 --> 00:57:25,877
we need the magnitude of phi
to be less than 1 in magnitude.

856
00:57:30,090 --> 00:57:31,950
The expected value of X is mu.

857
00:57:31,950 --> 00:57:36,460
The variance of X
is sigma squared X.

858
00:57:36,460 --> 00:57:41,130
This has this form, sigma
squared over 1 minus phi.

859
00:57:41,130 --> 00:57:44,960
That expression is
basically obtained

860
00:57:44,960 --> 00:57:50,110
by looking at the infinite order
moving average representation.

861
00:57:50,110 --> 00:57:56,760
But notice that if
phi is positive,

862
00:57:56,760 --> 00:58:03,710
then the variance
of X is actually

863
00:58:03,710 --> 00:58:07,895
greater than the variance
of the innovations.

864
00:58:10,440 --> 00:58:17,280
And if phi is less than 0,
then it's going to be smaller.

865
00:58:17,280 --> 00:58:23,100
So the innovation variance
basically is scaled up a bit

866
00:58:23,100 --> 00:58:25,010
in the autoregressive process.

867
00:58:25,010 --> 00:58:27,710
The covariance matrix is
phi times sigma squared

868
00:58:27,710 --> 00:58:31,980
X. You'll be going through
this in the problem set.

869
00:58:31,980 --> 00:58:40,160
And the covariance of X is phi
to the j power sigma squared X.

870
00:58:40,160 --> 00:58:43,640
And these expressions can
all be easily evaluated

871
00:58:43,640 --> 00:58:47,490
by simply writing out the
definition of these covariances

872
00:58:47,490 --> 00:58:50,000
in terms of the original
model and looking

873
00:58:50,000 --> 00:58:54,250
at what terms are independent,
cancel out, and that proceeds.

874
00:59:04,510 --> 00:59:06,800
Let's just go
through these cases.

875
00:59:06,800 --> 00:59:08,730
Let's show it all here.

876
00:59:08,730 --> 00:59:16,630
So we have if phi
is between 0 and 1,

877
00:59:16,630 --> 00:59:20,810
then the process experiences
exponential mean reversion

878
00:59:20,810 --> 00:59:22,170
to mu.

879
00:59:22,170 --> 00:59:24,760
So an autoregressive
process with phi between 0

880
00:59:24,760 --> 00:59:29,490
on 1 corresponds to a
mean-reverting process.

881
00:59:29,490 --> 00:59:31,830
This process is
actually one that

882
00:59:31,830 --> 00:59:34,310
has been used theoretically
for interest rate models

883
00:59:34,310 --> 00:59:36,920
and a lot of theoretical
work in finance.

884
00:59:36,920 --> 00:59:40,280
The Vasicek model is
actually an example

885
00:59:40,280 --> 00:59:42,300
of the Ornstein-Uhlenbeck
process,

886
00:59:42,300 --> 00:59:47,840
which is basically a
mean-reverting Brownian motion.

887
00:59:47,840 --> 00:59:53,070
And any variables
that exhibit or could

888
00:59:53,070 --> 00:59:59,950
be thought of as
exhibiting mean reversion,

889
00:59:59,950 --> 01:00:01,810
this model can be
applied to those

890
01:00:01,810 --> 01:00:07,470
processes, such as interest rate
spreads or real exchange rates,

891
01:00:07,470 --> 01:00:11,430
variables where one can
expect that things never

892
01:00:11,430 --> 01:00:12,790
get too large or too small.

893
01:00:12,790 --> 01:00:14,440
They come back to some mean.

894
01:00:14,440 --> 01:00:16,570
Now, the challenge
is, that usually

895
01:00:16,570 --> 01:00:18,930
may be true over
short periods of time.

896
01:00:18,930 --> 01:00:21,100
But over very long
periods of time,

897
01:00:21,100 --> 01:00:23,230
the point to which you're
reverting to changes.

898
01:00:23,230 --> 01:00:26,640
So these models tend to
not have broad application

899
01:00:26,640 --> 01:00:27,900
over long time ranges.

900
01:00:27,900 --> 01:00:30,150
You need to adapt.

901
01:00:30,150 --> 01:00:32,220
Anyway, with the AR
process, we can also

902
01:00:32,220 --> 01:00:34,020
have negative
values of phi, which

903
01:00:34,020 --> 01:00:38,460
results in exponential mean
reversion that's oscillating

904
01:00:38,460 --> 01:00:44,190
in time, because the
autoregressive coefficient

905
01:00:44,190 --> 01:00:49,180
basically is a negative value.

906
01:00:49,180 --> 01:00:54,510
And for phi equal to 1, the Wold
decomposition doesn't exist.

907
01:00:54,510 --> 01:00:57,860
And the process is the
simple random walk.

908
01:00:57,860 --> 01:01:00,340
So basically, if
phi is equal to 1,

909
01:01:00,340 --> 01:01:04,480
that means that basically just
changes in value of the process

910
01:01:04,480 --> 01:01:08,860
are independent and identically
distributed white noise.

911
01:01:08,860 --> 01:01:11,910
And that's the
random walk process.

912
01:01:11,910 --> 01:01:15,840
And that process, as was
covered in earlier lectures,

913
01:01:15,840 --> 01:01:18,780
is non-stationary.

914
01:01:18,780 --> 01:01:22,790
If phi is greater than 1, then
you have an explosive process,

915
01:01:22,790 --> 01:01:26,780
because basically the
values are scaling up

916
01:01:26,780 --> 01:01:31,000
every time increment.

917
01:01:31,000 --> 01:01:35,290
So those are features
of the AR(1) model.

918
01:01:35,290 --> 01:01:42,110
For a general autoregressive
process of order p,

919
01:01:42,110 --> 01:01:45,850
there's a method-- well, we
can look at the second order

920
01:01:45,850 --> 01:01:49,590
moments of that process, which
have a very nice structure,

921
01:01:49,590 --> 01:01:51,840
and then use those to
solve for estimates

922
01:01:51,840 --> 01:01:56,630
of the ARMA parameters, or
autoregressive parameters.

923
01:01:56,630 --> 01:02:01,820
And those happen to be
specified by what are called

924
01:02:01,820 --> 01:02:04,840
the Yule-Walker equations.

925
01:02:04,840 --> 01:02:07,270
So the Yule-Walker equations
is a standard topic

926
01:02:07,270 --> 01:02:09,670
in time series analysis.

927
01:02:09,670 --> 01:02:11,480
What is it?

928
01:02:11,480 --> 01:02:13,030
What does it correspond to?

929
01:02:13,030 --> 01:02:16,320
Well, we take our original
autoregressive process

930
01:02:16,320 --> 01:02:17,470
of order p.

931
01:02:17,470 --> 01:02:24,400
And we write out the
formulas for the covariance

932
01:02:24,400 --> 01:02:26,900
at lag j between
two observations.

933
01:02:26,900 --> 01:02:31,790
So what's the covariance
between X_t and X_(t-j)?

934
01:02:31,790 --> 01:02:39,820
And that expression is
given by this equation.

935
01:02:39,820 --> 01:02:43,980
And so this equation for gamma
of j is determined simply

936
01:02:43,980 --> 01:02:48,700
by evaluating the expectations
where we're taking

937
01:02:48,700 --> 01:02:53,620
the expectation of X_t in the
autoregressive process times

938
01:02:53,620 --> 01:02:56,110
the fix X_(t-j) minus mu.

939
01:02:56,110 --> 01:02:58,540
So just evaluating
those terms, you

940
01:02:58,540 --> 01:03:02,880
can validate that
this is the equation.

941
01:03:02,880 --> 01:03:08,620
If we look at the equations
corresponding to j equals 1--

942
01:03:08,620 --> 01:03:12,040
so lag 1 up through
lag p-- this is

943
01:03:12,040 --> 01:03:16,070
what those equations look like.

944
01:03:16,070 --> 01:03:20,060
Basically, the left-hand side
is gamma_1 through gamma_p.

945
01:03:20,060 --> 01:03:23,090
The covariance to
lag 1 up to lag p

946
01:03:23,090 --> 01:03:27,590
is equal to basically
linear functions

947
01:03:27,590 --> 01:03:29,980
given by the phi of
the other covariances.

948
01:03:33,570 --> 01:03:37,410
Who can tell me what the
structure is of this matrix?

949
01:03:37,410 --> 01:03:38,590
It's not a diagonal matrix?

950
01:03:38,590 --> 01:03:41,817
What kind of matrix is this?

951
01:03:41,817 --> 01:03:42,900
Math trivia question here.

952
01:03:48,850 --> 01:03:49,782
It has a special name.

953
01:03:52,460 --> 01:03:54,600
Anyone?

954
01:03:54,600 --> 01:03:57,690
It's a Toeplitz matrix.

955
01:03:57,690 --> 01:04:00,840
The off diagonals are
all the same value.

956
01:04:00,840 --> 01:04:06,680
And in fact, because of the
symmetry of the covariance,

957
01:04:06,680 --> 01:04:09,750
basically the gamma of 1 is
equal to gamma of minus 1.

958
01:04:09,750 --> 01:04:12,680
Gamma of minus 2 is
equal to gamma plus 2.

959
01:04:12,680 --> 01:04:14,640
Because of the
covariant stationarity,

960
01:04:14,640 --> 01:04:16,700
it's actually also symmetric.

961
01:04:16,700 --> 01:04:22,630
So these equations allow
us to solve for the phis

962
01:04:22,630 --> 01:04:25,990
so long as we have estimates
of these covariances.

963
01:04:25,990 --> 01:04:30,510
So if we have a
system of estimates,

964
01:04:30,510 --> 01:04:33,940
we can plug these in in
an attempt to solve this.

965
01:04:33,940 --> 01:04:36,770
If they're consistent
estimates of the covariances,

966
01:04:36,770 --> 01:04:38,530
then there will be a solution.

967
01:04:38,530 --> 01:04:41,980
And then the 0th
equation, which was not

968
01:04:41,980 --> 01:04:43,469
part of the series
of equations--

969
01:04:43,469 --> 01:04:45,510
if you go back and look
at the 0th equation, that

970
01:04:45,510 --> 01:04:47,920
allows you to get an estimate
for the sigma squared.

971
01:04:47,920 --> 01:04:50,920
So these Yule-Walker
equations are the way

972
01:04:50,920 --> 01:04:54,510
in which many ARMA
models are specified

973
01:04:54,510 --> 01:05:03,650
in different statistics packages
and in terms of what principles

974
01:05:03,650 --> 01:05:04,400
are being applied.

975
01:05:04,400 --> 01:05:09,700
Well, if we're using unbiased
estimates of these parameters,

976
01:05:09,700 --> 01:05:12,055
then this is applying
what's called

977
01:05:12,055 --> 01:05:16,250
the method of moments principle
for statistical estimation.

978
01:05:16,250 --> 01:05:20,600
And with complicated models,
where sometimes the likelihood

979
01:05:20,600 --> 01:05:25,900
functions are very hard
to specify and compute,

980
01:05:25,900 --> 01:05:29,800
and then to do optimization
over those is even harder.

981
01:05:29,800 --> 01:05:32,780
It can turn out that
there are relationships

982
01:05:32,780 --> 01:05:35,840
between the moments of the
random variables, which

983
01:05:35,840 --> 01:05:38,340
are functions of the
unknown parameters.

984
01:05:38,340 --> 01:05:42,590
And you can solve for basically
the sample moments equalling

985
01:05:42,590 --> 01:05:45,940
the theoretical moments
and you apply the method

986
01:05:45,940 --> 01:05:48,830
of moments estimation method.

987
01:05:48,830 --> 01:05:54,670
Econometrics is rich with many
applications of that principle.

988
01:05:57,580 --> 01:06:02,110
The next section goes through
the moving average model.

989
01:06:05,240 --> 01:06:12,340
Let me highlight this.

990
01:06:12,340 --> 01:06:16,080
So with an order
q moving average,

991
01:06:16,080 --> 01:06:19,560
we basically have a polynomial
in the lag operator L,

992
01:06:19,560 --> 01:06:22,390
which is operated
upon the eta_t's.

993
01:06:22,390 --> 01:06:25,700
And if you write out
the expectations of X_t,

994
01:06:25,700 --> 01:06:27,030
you get mu.

995
01:06:27,030 --> 01:06:28,650
The variance of X_t,
which is gamma 0,

996
01:06:28,650 --> 01:06:34,470
is sigma squared times 1 plus
the squares of the coefficients

997
01:06:34,470 --> 01:06:36,360
in the polynomial.

998
01:06:36,360 --> 01:06:39,920
And so this feature,
this property here is due

999
01:06:39,920 --> 01:06:44,100
to the fact that we have
uncorrelated innovations

1000
01:06:44,100 --> 01:06:47,060
in the eta_t's.

1001
01:06:47,060 --> 01:06:48,260
The eta t's are white noise.

1002
01:06:48,260 --> 01:06:52,830
So the only thing that comes
through in the square of X_t

1003
01:06:52,830 --> 01:06:56,020
and the expectation of
that is the squared powers

1004
01:06:56,020 --> 01:07:01,900
of the etas, which
have coefficients

1005
01:07:01,900 --> 01:07:03,860
given by the theta_i squared.

1006
01:07:03,860 --> 01:07:09,170
So these properties are left--
I'll leave you just to verify,

1007
01:07:09,170 --> 01:07:11,142
very straightforward.

1008
01:07:11,142 --> 01:07:14,430
But let's now turn to the
final minutes of the lecture

1009
01:07:14,430 --> 01:07:20,170
today to accommodating
non-stationary behavior

1010
01:07:20,170 --> 01:07:23,340
in time series.

1011
01:07:23,340 --> 01:07:27,990
The original approaches
with time series

1012
01:07:27,990 --> 01:07:32,320
was to focus on
estimation methodologies

1013
01:07:32,320 --> 01:07:34,940
for covariance
stationary process.

1014
01:07:34,940 --> 01:07:38,440
So if the series is not
covariance stationary,

1015
01:07:38,440 --> 01:07:42,410
then we would want to
do some transformation

1016
01:07:42,410 --> 01:07:48,660
of the data, of the
series, into a stationary

1017
01:07:48,660 --> 01:07:52,270
so that the resulting
process is stationary.

1018
01:07:52,270 --> 01:07:55,990
And with the
differencing operators,

1019
01:07:55,990 --> 01:08:00,610
delta, Box and Jenkins
advocated moving

1020
01:08:00,610 --> 01:08:03,420
non-stationary trending
behavior, which

1021
01:08:03,420 --> 01:08:06,370
is exhibited often in
economic time series,

1022
01:08:06,370 --> 01:08:09,960
by using a first difference,
maybe a second difference,

1023
01:08:09,960 --> 01:08:12,300
or a k-th order difference.

1024
01:08:12,300 --> 01:08:20,229
So these operators are
defined in this way.

1025
01:08:20,229 --> 01:08:22,960
Basically with the
k-th order operator

1026
01:08:22,960 --> 01:08:25,210
having this
expression here, this

1027
01:08:25,210 --> 01:08:31,189
is the binomial expansion
of a k-th power,

1028
01:08:31,189 --> 01:08:35,970
which can be useful.

1029
01:08:35,970 --> 01:08:40,609
It comes up all the time
in probability theory.

1030
01:08:40,609 --> 01:08:43,609
And if a process has
a linear time trend,

1031
01:08:43,609 --> 01:08:48,390
then delta X_t is going to
have no time trend at all,

1032
01:08:48,390 --> 01:08:51,390
because you're
basically taking out

1033
01:08:51,390 --> 01:08:54,430
that linear component by
taking successive differences.

1034
01:08:54,430 --> 01:08:57,014
Sometimes, if you
have a real series

1035
01:08:57,014 --> 01:08:59,430
and you look at the difference,
it appears non-stationary,

1036
01:08:59,430 --> 01:09:02,810
you look at first differences,
that can still not

1037
01:09:02,810 --> 01:09:05,649
appear to be growing
over time, in which case

1038
01:09:05,649 --> 01:09:08,810
sometimes the second
difference will result

1039
01:09:08,810 --> 01:09:11,270
in a process with no trend.

1040
01:09:11,270 --> 01:09:14,170
So these are sort of
convenient tricks,

1041
01:09:14,170 --> 01:09:18,250
techniques to render
the series stationary.

1042
01:09:18,250 --> 01:09:21,220
And let's see.

1043
01:09:21,220 --> 01:09:26,960
There's examples here of
linear trend reversion models

1044
01:09:26,960 --> 01:09:32,319
which are rendered
covariance stationary

1045
01:09:32,319 --> 01:09:35,330
under first differencing.

1046
01:09:35,330 --> 01:09:38,689
In this case, this is an
example where you have

1047
01:09:38,689 --> 01:09:41,350
a deterministic time trend.

1048
01:09:41,350 --> 01:09:46,040
But then you have reversion
to the time trend over time.

1049
01:09:46,040 --> 01:09:49,880
So we basically have
eta_t, the error

1050
01:09:49,880 --> 01:09:53,830
about the deterministic trend,
is a first order autoregressive

1051
01:09:53,830 --> 01:09:55,740
process.

1052
01:09:55,740 --> 01:10:00,307
And the moments here
can be derived this way.

1053
01:10:00,307 --> 01:10:01,390
Leave that as an exercise.

1054
01:10:04,230 --> 01:10:09,510
One could also consider
the pure integrated process

1055
01:10:09,510 --> 01:10:16,330
and talk about
stochastic trends.

1056
01:10:16,330 --> 01:10:19,140
And basically,
random walk processes

1057
01:10:19,140 --> 01:10:22,740
are often referred
to in econometrics

1058
01:10:22,740 --> 01:10:25,010
as stochastic trends.

1059
01:10:25,010 --> 01:10:31,610
And you may want to try and
remove those from the data,

1060
01:10:31,610 --> 01:10:33,280
or accommodate them.

1061
01:10:33,280 --> 01:10:40,930
And so the stochastic
trend process is basically

1062
01:10:40,930 --> 01:10:49,630
given by the first difference
X_t is just equal to eta_t.

1063
01:10:49,630 --> 01:10:53,430
And so we have essentially
this random walk

1064
01:10:53,430 --> 01:10:55,830
from a given starting point.

1065
01:10:55,830 --> 01:11:00,650
And it's easy to verify it if
you knew the 0th point, then

1066
01:11:00,650 --> 01:11:04,770
the variance of the t-th time
point would be t sigma squared,

1067
01:11:04,770 --> 01:11:09,000
because we're summing t
independent innovations.

1068
01:11:09,000 --> 01:11:14,475
And the covariance between
t and lag t minus j

1069
01:11:14,475 --> 01:11:17,500
is simply t minus
j sigma squared.

1070
01:11:17,500 --> 01:11:20,860
And the correlation between
those has this form.

1071
01:11:20,860 --> 01:11:23,240
What you can see is that this
definitely depends on time.

1072
01:11:23,240 --> 01:11:26,660
So it's not a
stationary process.

1073
01:11:26,660 --> 01:11:33,880
So this first differencing
results in stationarity.

1074
01:11:33,880 --> 01:11:36,230
And the end difference
process has those features.

1075
01:11:46,847 --> 01:11:47,805
Let's see where we are.

1076
01:11:52,730 --> 01:11:57,380
Final topic for
today is just how

1077
01:11:57,380 --> 01:12:04,630
you incorporate non-stationary
process into ARMA processes.

1078
01:12:04,630 --> 01:12:07,680
Well, if you take
first differences

1079
01:12:07,680 --> 01:12:10,340
or second differences
and the resulting process

1080
01:12:10,340 --> 01:12:13,252
is covariance
stationary, then we

1081
01:12:13,252 --> 01:12:15,460
can just incorporate that
differencing into the model

1082
01:12:15,460 --> 01:12:20,490
specification itself, and define
ARIMA models, Autoregressive

1083
01:12:20,490 --> 01:12:23,730
Integrated Moving
Average Processes.

1084
01:12:23,730 --> 01:12:26,000
And so to specify
these models, we

1085
01:12:26,000 --> 01:12:29,290
need to determine the order
of the differencing required

1086
01:12:29,290 --> 01:12:32,990
to move trends,
deterministic or stochastic,

1087
01:12:32,990 --> 01:12:35,820
and then estimating
the unknown parameters,

1088
01:12:35,820 --> 01:12:38,940
and then applying model
selection criteria.

1089
01:12:38,940 --> 01:12:43,770
So let me go very
quickly through this

1090
01:12:43,770 --> 01:12:48,600
and come back to it the
beginning of next time.

1091
01:12:48,600 --> 01:12:51,660
But in specifying the
parameters of these models,

1092
01:12:51,660 --> 01:12:54,410
we can apply maximum
likelihood, again,

1093
01:12:54,410 --> 01:12:59,280
if we assume normality of
these innovations eta_t.

1094
01:12:59,280 --> 01:13:02,260
And we can express
the ARMA model

1095
01:13:02,260 --> 01:13:04,440
in state space
form, which results

1096
01:13:04,440 --> 01:13:07,880
in a form for the
likelihood function, which

1097
01:13:07,880 --> 01:13:12,130
we'll see a few lectures ahead.

1098
01:13:12,130 --> 01:13:15,970
But then we can apply limited
information maximum likelihood,

1099
01:13:15,970 --> 01:13:19,470
where we just condition on the
first observations of the data

1100
01:13:19,470 --> 01:13:22,550
and maximize the likelihood.

1101
01:13:22,550 --> 01:13:27,060
Or not condition on the first
few observations, but also

1102
01:13:27,060 --> 01:13:33,700
use their information as well,
and look at their density

1103
01:13:33,700 --> 01:13:36,640
functions, incorporating
those into the likelihood

1104
01:13:36,640 --> 01:13:41,160
relative to the stationary
distribution for their values.

1105
01:13:41,160 --> 01:13:44,000
And then the issue
becomes, how do we

1106
01:13:44,000 --> 01:13:45,390
choose amongst different models?

1107
01:13:45,390 --> 01:13:48,480
Now, last time we talked about
linear regression models,

1108
01:13:48,480 --> 01:13:50,500
how you'd specify a
given model, here, we're

1109
01:13:50,500 --> 01:13:53,050
talking about autoregressive,
moving average,

1110
01:13:53,050 --> 01:13:55,000
and even integrated
moving average processes

1111
01:13:55,000 --> 01:13:59,320
and how do we specify
those, well, with the method

1112
01:13:59,320 --> 01:14:06,470
of maximum likelihood,
there are procedures

1113
01:14:06,470 --> 01:14:12,440
which-- there are measures of
how effectively a fitted model

1114
01:14:12,440 --> 01:14:16,390
is, given by an
information criterion

1115
01:14:16,390 --> 01:14:21,250
that you would want to minimize
for a given fitted model.

1116
01:14:21,250 --> 01:14:24,719
So we can consider
different sets of models,

1117
01:14:24,719 --> 01:14:26,510
different numbers of
explanatory variables,

1118
01:14:26,510 --> 01:14:29,740
different orders of
autoregressive parameters,

1119
01:14:29,740 --> 01:14:33,100
moving average parameters,
and compute, say,

1120
01:14:33,100 --> 01:14:37,940
the Akaike information criterion
or the Bayes information

1121
01:14:37,940 --> 01:14:39,990
criterion or the
Hannan-Quinn criterion

1122
01:14:39,990 --> 01:14:44,720
as different ways of judging
how good different models are.

1123
01:14:44,720 --> 01:14:47,960
And let me just finish
today by pointing out

1124
01:14:47,960 --> 01:14:52,620
that what these
information criteria are

1125
01:14:52,620 --> 01:14:58,560
is basically a function of the
log likelihood function, which

1126
01:14:58,560 --> 01:15:00,719
is something we're
trying to maximize

1127
01:15:00,719 --> 01:15:02,135
with maximum
likelihood estimates.

1128
01:15:04,870 --> 01:15:08,700
And then adding some penalty
for how many parameters

1129
01:15:08,700 --> 01:15:10,742
we're estimating.

1130
01:15:10,742 --> 01:15:12,950
And so what I'd like you to
think about for next time

1131
01:15:12,950 --> 01:15:18,600
is what kind of a penalty
is appropriate for adding

1132
01:15:18,600 --> 01:15:20,300
an extra parameter.

1133
01:15:20,300 --> 01:15:23,640
Like, what evidence is
required to incorporate

1134
01:15:23,640 --> 01:15:28,020
extra parameters, extra
variables, in the model.

1135
01:15:28,020 --> 01:15:31,180
Would it be t statistics
that exceeds some threshold

1136
01:15:31,180 --> 01:15:32,760
or some other criteria.

1137
01:15:32,760 --> 01:15:35,940
Turns out that these are
all related to those issues.

1138
01:15:35,940 --> 01:15:39,500
And it's very interesting
how those play out.

1139
01:15:39,500 --> 01:15:45,180
And I'll say that for those
of you who have actually

1140
01:15:45,180 --> 01:15:48,490
seen these before, the
Bayes information criterion

1141
01:15:48,490 --> 01:15:50,400
corresponds to an
assumption that there

1142
01:15:50,400 --> 01:15:54,180
is some finite number of
variables in the model.

1143
01:15:54,180 --> 01:15:57,010
And you know what those are.

1144
01:15:57,010 --> 01:16:00,060
The Hannan-Quinn criterion
says maybe there's

1145
01:16:00,060 --> 01:16:03,760
an infinite number of
variables in the model,

1146
01:16:03,760 --> 01:16:08,810
but you want to be
able to identify those.

1147
01:16:08,810 --> 01:16:12,230
And so anyway, it's a
very challenging problem

1148
01:16:12,230 --> 01:16:13,390
with model selection.

1149
01:16:13,390 --> 01:16:16,900
And these criteria can
be used to specify those.

1150
01:16:16,900 --> 01:16:19,050
So we'll go through
that next time.