1 00:00:00,080 --> 00:00:02,500 The following content is provided under a Creative 2 00:00:02,500 --> 00:00:04,019 Commons license. 3 00:00:04,019 --> 00:00:06,360 Your support will help MIT OpenCourseWare 4 00:00:06,360 --> 00:00:10,730 continue to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,340 To make a donation or view additional materials 6 00:00:13,340 --> 00:00:17,217 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,217 --> 00:00:17,842 at ocw.mit.edu. 8 00:00:22,190 --> 00:00:23,010 PROFESSOR: OK. 9 00:00:23,010 --> 00:00:25,530 Well, last time I was lecturing, we 10 00:00:25,530 --> 00:00:29,380 were talking about regression analysis. 11 00:00:29,380 --> 00:00:31,870 And we finished up talking about estimation methods 12 00:00:31,870 --> 00:00:34,730 for fitting regression models. 13 00:00:34,730 --> 00:00:38,670 I want to recap the method of maximum likelihood, 14 00:00:38,670 --> 00:00:42,010 because this is really the primary estimation 15 00:00:42,010 --> 00:00:45,070 method in statistical modeling that you start with. 16 00:00:45,070 --> 00:00:49,946 And so let me just review where we were. 17 00:00:49,946 --> 00:00:53,060 We have a normal linear regression model. 18 00:00:53,060 --> 00:00:55,100 A dependent variable y is explained 19 00:00:55,100 --> 00:00:58,940 by a linear combination of independent variables 20 00:00:58,940 --> 00:01:00,710 given by a regression parameter beta. 21 00:01:00,710 --> 00:01:03,800 And we assume that there are errors about all the cases 22 00:01:03,800 --> 00:01:05,710 which are independent identically distributed 23 00:01:05,710 --> 00:01:07,440 normal random variables. 24 00:01:07,440 --> 00:01:12,120 So because of that relationship, the dependent variable vector 25 00:01:12,120 --> 00:01:15,630 y, which is an n-vector, for n cases, 26 00:01:15,630 --> 00:01:18,730 is a multivariate normal random variable. 27 00:01:18,730 --> 00:01:26,490 Now, the likelihood function is equal to the density function 28 00:01:26,490 --> 00:01:28,280 for the data. 29 00:01:28,280 --> 00:01:32,400 And there's some ambiguity really 30 00:01:32,400 --> 00:01:36,000 about how one manipulates the likelihood function. 31 00:01:36,000 --> 00:01:38,780 The likelihood function becomes defined once we've 32 00:01:38,780 --> 00:01:41,030 observed a sample of data. 33 00:01:41,030 --> 00:01:45,390 So in this expression for the likelihood function 34 00:01:45,390 --> 00:01:47,330 as a function of beta and sigma squared, 35 00:01:47,330 --> 00:01:50,800 we're considering evaluating the probability density 36 00:01:50,800 --> 00:01:53,830 function for the data conditional 37 00:01:53,830 --> 00:01:57,040 on the unknown parameters. 38 00:01:57,040 --> 00:02:02,540 So if this were simply a univariate normal distribution 39 00:02:02,540 --> 00:02:05,160 with some unknown mean and variance, then 40 00:02:05,160 --> 00:02:10,880 what we would have is just a bell curve for mu 41 00:02:10,880 --> 00:02:13,880 centered around a single observation y, 42 00:02:13,880 --> 00:02:15,550 if you look at the likelihood function 43 00:02:15,550 --> 00:02:19,150 and how it varies with the underlying mean 44 00:02:19,150 --> 00:02:22,950 of the normal distribution. 45 00:02:22,950 --> 00:02:28,180 So this likelihood function is-- well, 46 00:02:28,180 --> 00:02:30,540 the challenge really in maximum estimation 47 00:02:30,540 --> 00:02:34,840 is really calculating and computing 48 00:02:34,840 --> 00:02:36,790 the likelihood function. 49 00:02:36,790 --> 00:02:39,050 And with normal linear regression models, 50 00:02:39,050 --> 00:02:40,440 it's very easy. 51 00:02:40,440 --> 00:02:42,910 Now, the maximum likelihood estimates 52 00:02:42,910 --> 00:02:47,490 are those values that maximize this function. 53 00:02:47,490 --> 00:02:51,890 And the question is, why are those good estimates 54 00:02:51,890 --> 00:02:54,840 of the underlying parameters? 55 00:02:54,840 --> 00:02:57,760 Well, what those estimates do is they 56 00:02:57,760 --> 00:03:03,150 are the parameter values for which the observed data is 57 00:03:03,150 --> 00:03:05,030 most likely. 58 00:03:05,030 --> 00:03:09,170 So we're able to scale the unknown parameters 59 00:03:09,170 --> 00:03:14,020 by how likely those parameters could have generated these data 60 00:03:14,020 --> 00:03:15,500 values. 61 00:03:15,500 --> 00:03:19,560 So let's look at the likelihood function 62 00:03:19,560 --> 00:03:23,360 for this normal linear regression model. 63 00:03:23,360 --> 00:03:28,520 These first two lines here are highlighting-- the first line 64 00:03:28,520 --> 00:03:32,470 is highlighting that our response variable 65 00:03:32,470 --> 00:03:35,310 values are independent. 66 00:03:35,310 --> 00:03:36,770 They're conditionally independent 67 00:03:36,770 --> 00:03:38,720 given the unknown parameters. 68 00:03:38,720 --> 00:03:43,180 And so the density of the full vector of y's is simply 69 00:03:43,180 --> 00:03:48,290 the product of the density functions for those components. 70 00:03:48,290 --> 00:03:52,410 And because this is a normal linear regression model, 71 00:03:52,410 --> 00:03:55,350 each of the y_i's is normally distributed. 72 00:03:55,350 --> 00:03:57,140 So what's in there is simply the density 73 00:03:57,140 --> 00:04:01,330 function of a normal random variable with mean given 74 00:04:01,330 --> 00:04:06,960 by the beta sum of independent variables for each i, 75 00:04:06,960 --> 00:04:10,300 case i, given by the regression parameters. 76 00:04:10,300 --> 00:04:18,320 And that expression basically can be expressed 77 00:04:18,320 --> 00:04:21,630 in matrix form this way. 78 00:04:21,630 --> 00:04:28,910 And what we have is the likelihood function 79 00:04:28,910 --> 00:04:33,160 ends up being a function of our Q of beta, which 80 00:04:33,160 --> 00:04:35,610 was our least squares criteria. 81 00:04:35,610 --> 00:04:39,120 So the least squares estimation is 82 00:04:39,120 --> 00:04:42,930 equivalent to maximum likelihood estimation for the regression 83 00:04:42,930 --> 00:04:48,510 parameters if we have a normal linear regression model. 84 00:04:48,510 --> 00:04:52,200 And there's this extra term, minus n. 85 00:04:52,200 --> 00:04:54,820 Well, actually, if we're going to maximize the likelihood 86 00:04:54,820 --> 00:04:57,220 function, we can also maximize the log of the likelihood 87 00:04:57,220 --> 00:05:00,010 function, because that's just a monotone function 88 00:05:00,010 --> 00:05:01,860 of the likelihood. 89 00:05:01,860 --> 00:05:04,570 And it's easier to maximize the log of the likelihood function 90 00:05:04,570 --> 00:05:06,430 which is expressed here. 91 00:05:06,430 --> 00:05:11,460 And so we're able to maximize over beta 92 00:05:11,460 --> 00:05:14,230 by minimizing Q of beta. 93 00:05:14,230 --> 00:05:18,280 And then we can maximize over sigma squared 94 00:05:18,280 --> 00:05:21,800 given our estimate for beta. 95 00:05:21,800 --> 00:05:25,120 And that's achieved by taking the derivative 96 00:05:25,120 --> 00:05:31,170 of the log-likelihood with respect to sigma squared. 97 00:05:31,170 --> 00:05:33,150 So we basically have this first order condition 98 00:05:33,150 --> 00:05:35,320 that finds the maximum because things 99 00:05:35,320 --> 00:05:39,830 are appropriately convex. 100 00:05:39,830 --> 00:05:45,200 And taking that derivative and solving for zero, 101 00:05:45,200 --> 00:05:47,450 we basically get this expression. 102 00:05:47,450 --> 00:05:50,380 So this is just taking the derivative 103 00:05:50,380 --> 00:05:54,350 of the log-likelihood with respect to sigma squared. 104 00:05:54,350 --> 00:05:55,857 And you'll notice here I'm taking 105 00:05:55,857 --> 00:05:57,690 the derivative with respect to sigma squared 106 00:05:57,690 --> 00:05:59,050 as a parameter, not sigma. 107 00:06:01,870 --> 00:06:05,380 And that gives us that the maximum likelihood 108 00:06:05,380 --> 00:06:10,700 estimate of the error variance is Q of beta hat over n. 109 00:06:10,700 --> 00:06:17,090 So this is the sum of the squared residuals divided by n. 110 00:06:17,090 --> 00:06:20,940 Now, I emphasize here that that's biased. 111 00:06:20,940 --> 00:06:24,612 Who can tell me why that's biased 112 00:06:24,612 --> 00:06:25,820 or why it ought to be biased? 113 00:06:30,554 --> 00:06:31,526 AUDIENCE: [INAUDIBLE]. 114 00:06:35,420 --> 00:06:36,350 PROFESSOR: OK. 115 00:06:36,350 --> 00:06:42,530 Well, it should be n minus 1 if we're actually 116 00:06:42,530 --> 00:06:44,660 estimating one parameter. 117 00:06:44,660 --> 00:06:54,050 So if the independent variables were, say, a constant, 1, 118 00:06:54,050 --> 00:06:57,160 so we're just estimating a sample from a normal with mean 119 00:06:57,160 --> 00:07:03,030 beta 1 corresponding to the units vector of the X, 120 00:07:03,030 --> 00:07:11,410 then we would have a one degree of freedom correction 121 00:07:11,410 --> 00:07:14,120 to the residuals to get an unbiased estimator. 122 00:07:14,120 --> 00:07:17,150 But what if we have p parameters? 123 00:07:17,150 --> 00:07:18,370 Well, let me ask you this. 124 00:07:18,370 --> 00:07:23,280 What if we had n parameters in our regression model? 125 00:07:23,280 --> 00:07:28,130 What would happen if we had a full rank n 126 00:07:28,130 --> 00:07:30,760 independent variable matrix and n independent observations? 127 00:07:34,062 --> 00:07:35,690 AUDIENCE: [INAUDIBLE]. 128 00:07:35,690 --> 00:07:38,410 PROFESSOR: Yes, you'd have an exact fit to the data. 129 00:07:38,410 --> 00:07:43,560 So this estimate would be 0. 130 00:07:43,560 --> 00:07:47,500 And so clearly, if the data do arise 131 00:07:47,500 --> 00:07:52,059 from a normal linear regression model, 0 is not unbiased. 132 00:07:52,059 --> 00:07:53,600 And you need to have some correction. 133 00:07:53,600 --> 00:07:58,220 Turns out you need to divide by n 134 00:07:58,220 --> 00:08:01,980 minus the rank of the X matrix, the degrees of freedom 135 00:08:01,980 --> 00:08:05,630 in the model, to get a biased estimate. 136 00:08:05,630 --> 00:08:08,610 So this is an important issue, highlights 137 00:08:08,610 --> 00:08:11,880 how the more parameters you add in the model, the more precise 138 00:08:11,880 --> 00:08:13,760 your fitted values are. 139 00:08:13,760 --> 00:08:15,840 In a sense, there's dangers of curve fitting 140 00:08:15,840 --> 00:08:18,370 which you want to avoid. 141 00:08:18,370 --> 00:08:25,070 But the maximum likelihood estimates, in fact, are biased. 142 00:08:25,070 --> 00:08:27,482 You just have to be aware of that. 143 00:08:27,482 --> 00:08:29,190 And when you're using different software, 144 00:08:29,190 --> 00:08:30,170 fitting different models, you need 145 00:08:30,170 --> 00:08:32,450 to know whether there are various corrections be 146 00:08:32,450 --> 00:08:33,654 made for biasedness or not. 147 00:08:38,370 --> 00:08:41,679 So this solves the estimation problem 148 00:08:41,679 --> 00:08:44,790 for normal linear regression models. 149 00:08:44,790 --> 00:08:48,310 And when we have normal linear regression 150 00:08:48,310 --> 00:08:50,470 models, the theorem we went through last time-- 151 00:08:50,470 --> 00:08:51,428 this is very important. 152 00:08:51,428 --> 00:08:54,590 Let me just go back and highlight that for you. 153 00:09:02,430 --> 00:09:05,370 This theorem right here. 154 00:09:05,370 --> 00:09:10,010 This is really a very important theorem 155 00:09:10,010 --> 00:09:13,330 indicating what is the distribution of the least 156 00:09:13,330 --> 00:09:15,800 squares, now the maximum likelihood estimates 157 00:09:15,800 --> 00:09:17,670 of our regression model? 158 00:09:17,670 --> 00:09:20,750 They are normally distributed. 159 00:09:20,750 --> 00:09:25,570 And the residuals, sum of squares, have a chi 160 00:09:25,570 --> 00:09:28,140 squared distribution with degrees of freedom 161 00:09:28,140 --> 00:09:29,910 given by n minus p. 162 00:09:29,910 --> 00:09:34,770 And we can look at how much signal to noise 163 00:09:34,770 --> 00:09:36,490 there is in estimating our regression 164 00:09:36,490 --> 00:09:40,590 parameters by calculating a t statistic, which is take away 165 00:09:40,590 --> 00:09:45,400 from an estimate its expected value, its mean, 166 00:09:45,400 --> 00:09:48,330 and divide through by an estimate of the variability 167 00:09:48,330 --> 00:09:50,421 in standard deviation units. 168 00:09:50,421 --> 00:09:51,920 And that will have a t distribution. 169 00:09:51,920 --> 00:09:56,800 So that's a critical way to assess 170 00:09:56,800 --> 00:09:59,200 the relevance of different explanatory variables 171 00:09:59,200 --> 00:10:00,690 in our model. 172 00:10:00,690 --> 00:10:06,060 And this approach will apply with maximum likelihood 173 00:10:06,060 --> 00:10:08,010 estimation in all kinds of models 174 00:10:08,010 --> 00:10:10,510 apart from normal linear regression models. 175 00:10:10,510 --> 00:10:13,970 It turns out maximum likelihood estimates generally 176 00:10:13,970 --> 00:10:17,880 are asymptotically normally distributed. 177 00:10:17,880 --> 00:10:21,630 And so these properties here will apply for those models 178 00:10:21,630 --> 00:10:23,020 as well. 179 00:10:23,020 --> 00:10:27,470 So let's finish up these notes on estimation 180 00:10:27,470 --> 00:10:32,590 by talking about generalized M estimation. 181 00:10:32,590 --> 00:10:39,020 So what we want to consider is estimating unknown parameters 182 00:10:39,020 --> 00:10:44,630 by minimizing some function, Q of beta, 183 00:10:44,630 --> 00:10:49,890 which is a sum of evaluations of another function h, 184 00:10:49,890 --> 00:10:53,180 evaluated for each of the individual cases. 185 00:10:53,180 --> 00:10:59,980 And choosing h to take on different functional forms 186 00:10:59,980 --> 00:11:03,120 will define different kinds of estimators. 187 00:11:03,120 --> 00:11:08,440 We've seen how when h is simply the square 188 00:11:08,440 --> 00:11:13,880 of the case minus its regression prediction, 189 00:11:13,880 --> 00:11:18,980 that leads to least squares, and in fact, maximum likelihood 190 00:11:18,980 --> 00:11:23,830 estimation, as we saw before. 191 00:11:23,830 --> 00:11:27,340 Rather than taking the square of the residual, 192 00:11:27,340 --> 00:11:29,540 the fitted residual, we could take simply 193 00:11:29,540 --> 00:11:33,510 the modulus of that. 194 00:11:33,510 --> 00:11:36,930 And so that would be the mean absolute deviation. 195 00:11:36,930 --> 00:11:39,040 So rather than summing the squared deviations 196 00:11:39,040 --> 00:11:42,310 from the mean, we could sum the absolute deviations 197 00:11:42,310 --> 00:11:43,780 from the mean. 198 00:11:43,780 --> 00:11:46,710 Now, from a mathematical standpoint, 199 00:11:46,710 --> 00:11:50,530 if we want to solve for those estimates, 200 00:11:50,530 --> 00:11:52,450 how would you go about doing that? 201 00:11:55,160 --> 00:12:01,950 What methodology would you use to maximize this function? 202 00:12:01,950 --> 00:12:04,380 Well, we try and apply basically the same principles 203 00:12:04,380 --> 00:12:09,690 of if this is a convex function, then 204 00:12:09,690 --> 00:12:12,860 we just want to take derivatives of that and solve for that 205 00:12:12,860 --> 00:12:14,110 being equal to 0. 206 00:12:14,110 --> 00:12:17,080 So what happens when you take the derivative 207 00:12:17,080 --> 00:12:21,110 of the modulus of y minus xi beta with respect to beta? 208 00:12:24,749 --> 00:12:27,620 AUDIENCE: [INAUDIBLE]. 209 00:12:27,620 --> 00:12:30,780 PROFESSOR: What did you say? 210 00:12:30,780 --> 00:12:32,890 What did you say? 211 00:12:32,890 --> 00:12:36,783 AUDIENCE: Yeah, it's not [INAUDIBLE]. 212 00:12:36,783 --> 00:12:38,908 The first [INAUDIBLE] derivative is not continuous. 213 00:12:45,460 --> 00:12:46,610 PROFESSOR: OK. 214 00:12:46,610 --> 00:12:50,940 Well, this is not a smooth function. 215 00:12:50,940 --> 00:13:06,290 But let me just plot x_i beta here, and y_i minus that. 216 00:13:06,290 --> 00:13:15,060 Basically, this is going to be a function that 217 00:13:15,060 --> 00:13:19,230 has slope 1 when it's positive and slope minus 1 when 218 00:13:19,230 --> 00:13:20,450 it's negative. 219 00:13:20,450 --> 00:13:26,260 And so that will be true, component-wise, or for the y. 220 00:13:26,260 --> 00:13:28,850 So what we end up wanting to do is 221 00:13:28,850 --> 00:13:31,000 find the value of the regression estimate 222 00:13:31,000 --> 00:13:36,680 that minimizes the sum of predictions 223 00:13:36,680 --> 00:13:40,670 that are below the estimate plus the sum of the predictions that 224 00:13:40,670 --> 00:13:43,240 are above the estimate given by the regression line. 225 00:13:43,240 --> 00:13:45,580 And that solves the problem. 226 00:13:45,580 --> 00:13:50,960 Now, with the maximum likelihood estimation, 227 00:13:50,960 --> 00:13:55,840 one can plug in minus log the density of y_i given beta, x 228 00:13:55,840 --> 00:13:57,730 and sigma_i squared. 229 00:13:57,730 --> 00:14:04,400 And that function simply sums to the log of the joint density 230 00:14:04,400 --> 00:14:05,510 for all the data. 231 00:14:05,510 --> 00:14:08,530 So that works as well. 232 00:14:08,530 --> 00:14:13,520 With robust M estimators, we can consider another function chi 233 00:14:13,520 --> 00:14:18,210 which can be defined to have good properties with estimates. 234 00:14:18,210 --> 00:14:21,065 And there's a whole theory of robust estimation-- 235 00:14:21,065 --> 00:14:23,830 it's very rich-- which talks about how best 236 00:14:23,830 --> 00:14:27,400 to specify this chi function. 237 00:14:27,400 --> 00:14:33,130 Now, one of the problems with least squares estimation 238 00:14:33,130 --> 00:14:37,400 is that the squares of very large values 239 00:14:37,400 --> 00:14:40,210 are very, very large in magnitude. 240 00:14:40,210 --> 00:14:42,740 So there's perhaps an undue influence 241 00:14:42,740 --> 00:14:47,650 of very large values, very large residuals under least squares 242 00:14:47,650 --> 00:14:49,680 estimation and maximum [INAUDIBLE] estimation. 243 00:14:49,680 --> 00:14:53,600 So robust estimators allow you to control that 244 00:14:53,600 --> 00:14:57,770 by defining the function differently. 245 00:14:57,770 --> 00:15:00,830 Finally, there are quantile estimators, 246 00:15:00,830 --> 00:15:07,410 which extend the mean absolute deviation criterion. 247 00:15:07,410 --> 00:15:11,220 And so if we consider the h function 248 00:15:11,220 --> 00:15:16,270 to be basically a multiple of the deviation 249 00:15:16,270 --> 00:15:23,460 if the residual is positive and a different multiple, 250 00:15:23,460 --> 00:15:26,810 a complementary multiple if the derivation, the residual, 251 00:15:26,810 --> 00:15:30,910 is less than 0, then by varying tau, 252 00:15:30,910 --> 00:15:35,230 you end up getting quantile estimators, where 253 00:15:35,230 --> 00:15:38,921 what you're doing is minimizing the estimate of the tau 254 00:15:38,921 --> 00:15:39,420 quantile. 255 00:15:47,510 --> 00:15:51,240 So this general class of M estimators 256 00:15:51,240 --> 00:15:54,730 encompasses most estimators that we will 257 00:15:54,730 --> 00:15:59,020 encounter in fitting models. 258 00:15:59,020 --> 00:16:03,130 So that finishes the technical or the mathematical discussion 259 00:16:03,130 --> 00:16:05,190 of regression analysis. 260 00:16:05,190 --> 00:16:31,070 Let me highlight for you-- there's a case study that I 261 00:16:31,070 --> 00:16:34,410 dragged to the desktop here. 262 00:16:34,410 --> 00:16:37,532 And I wanted to find that. 263 00:16:37,532 --> 00:16:38,240 Let me find that. 264 00:16:46,970 --> 00:16:54,300 There's a case study that's been added to the course website. 265 00:16:54,300 --> 00:16:58,840 And this first one is on linear regression models 266 00:16:58,840 --> 00:17:00,370 for asset pricing. 267 00:17:00,370 --> 00:17:03,430 And I want you to read through that just 268 00:17:03,430 --> 00:17:08,099 to see how it applies to fitting various simple linear 269 00:17:08,099 --> 00:17:09,650 regression models. 270 00:17:09,650 --> 00:17:12,985 And enter full screen. 271 00:17:17,900 --> 00:17:21,650 This case study begins by introducing the capital asset 272 00:17:21,650 --> 00:17:24,670 pricing model, which basically suggests 273 00:17:24,670 --> 00:17:28,190 that if you look at the returns on any stocks 274 00:17:28,190 --> 00:17:30,720 in an efficient market, then those 275 00:17:30,720 --> 00:17:36,830 should depend on the return of the overall market 276 00:17:36,830 --> 00:17:40,040 but scaled by how risky the stock is. 277 00:17:40,040 --> 00:17:45,170 And so if one looks at basically what 278 00:17:45,170 --> 00:17:47,929 the return is on the stock on the right scale, 279 00:17:47,929 --> 00:17:49,970 you should have a simple linear regression model. 280 00:17:49,970 --> 00:17:54,110 So here, we just look at a time series for GE stock 281 00:17:54,110 --> 00:17:55,972 in the S&P 500. 282 00:17:55,972 --> 00:17:58,180 And the case study guide through how you can actually 283 00:17:58,180 --> 00:18:01,790 collect this data on the web using R. 284 00:18:01,790 --> 00:18:06,845 And so the case notes provide those details. 285 00:18:09,350 --> 00:18:11,930 There's also the three-month treasury rate 286 00:18:11,930 --> 00:18:13,660 which is collected. 287 00:18:13,660 --> 00:18:16,190 And so if you're thinking about return 288 00:18:16,190 --> 00:18:19,540 on the stock versus return on the index, well, what's 289 00:18:19,540 --> 00:18:24,940 really of interest is the excess return over a risk-free rate. 290 00:18:24,940 --> 00:18:27,450 And the efficient markets models, 291 00:18:27,450 --> 00:18:31,390 basically the excess return of a stock 292 00:18:31,390 --> 00:18:34,330 is related to the excess return of the market as 293 00:18:34,330 --> 00:18:37,250 given by a linear regression model. 294 00:18:37,250 --> 00:18:39,310 So we can fit this model. 295 00:18:39,310 --> 00:18:46,360 And here's a plot of the excess returns on a daily basis for GE 296 00:18:46,360 --> 00:18:47,640 stock versus the market. 297 00:18:47,640 --> 00:18:52,444 So that looks like a nice sort of point cloud 298 00:18:52,444 --> 00:18:54,110 for which a linear model might fit well. 299 00:18:54,110 --> 00:18:54,800 And it does. 300 00:18:59,400 --> 00:19:01,170 Well, there are regression diagnostics, 301 00:19:01,170 --> 00:19:05,300 which I'll get to-- well, there are regression diagnostics 302 00:19:05,300 --> 00:19:09,110 which are detailed in the problem set, where we're 303 00:19:09,110 --> 00:19:12,420 looking at how influential are individual observations, what's 304 00:19:12,420 --> 00:19:14,160 their impact on regression parameters. 305 00:19:16,680 --> 00:19:20,160 This display here basically highlights 306 00:19:20,160 --> 00:19:21,790 with a very simple linear regression 307 00:19:21,790 --> 00:19:25,770 model what are the influential data points. 308 00:19:25,770 --> 00:19:28,560 And so I've highlighted in red those values 309 00:19:28,560 --> 00:19:30,640 which are influential. 310 00:19:30,640 --> 00:19:34,060 Now, if you look at the definition of leverage 311 00:19:34,060 --> 00:19:36,390 in a linear model, it's very simple. 312 00:19:36,390 --> 00:19:39,130 A simple linear model is just those observations that 313 00:19:39,130 --> 00:19:42,200 are very far from the mean have large leverage. 314 00:19:42,200 --> 00:19:46,060 And so you can confirm that with your answers 315 00:19:46,060 --> 00:19:48,470 to the problem set. 316 00:19:48,470 --> 00:19:52,710 This x indicates a significantly influential point 317 00:19:52,710 --> 00:19:55,720 in terms of the regression parameters 318 00:19:55,720 --> 00:19:57,090 given by Cook's distance. 319 00:19:57,090 --> 00:19:59,956 And that definition is also given in the case notes. 320 00:19:59,956 --> 00:20:00,908 AUDIENCE: [INAUDIBLE]. 321 00:20:04,240 --> 00:20:06,630 PROFESSOR: By computing the individual 322 00:20:06,630 --> 00:20:09,930 leverages with a function that's given here, 323 00:20:09,930 --> 00:20:13,385 and by selecting out those that exceed a given magnitude. 324 00:20:17,870 --> 00:20:20,530 Now, with this very, very simple model 325 00:20:20,530 --> 00:20:23,190 of stocks depending on one unknown factor, 326 00:20:23,190 --> 00:20:26,110 risk factor given the market. 327 00:20:26,110 --> 00:20:29,730 In modeling equity returns, there 328 00:20:29,730 --> 00:20:33,680 are many different factors that can have an impact on returns. 329 00:20:33,680 --> 00:20:36,890 So what I've done in the case study 330 00:20:36,890 --> 00:20:48,660 is to look at adding another factor which is just 331 00:20:48,660 --> 00:20:51,590 the return on crude oil. 332 00:20:51,590 --> 00:20:55,210 And so-- I need to go down here. 333 00:21:04,090 --> 00:21:10,260 So let me highlight something for you here. 334 00:21:10,260 --> 00:21:15,220 With GE stock, what would you expect the impact of, say, 335 00:21:15,220 --> 00:21:19,260 a high return on crude oil to be on the return of GE stock? 336 00:21:19,260 --> 00:21:21,500 Would you expect it to be positively related 337 00:21:21,500 --> 00:21:22,730 or negatively related? 338 00:21:30,910 --> 00:21:31,410 OK. 339 00:21:34,510 --> 00:21:39,610 Well, GE is a stock that's just a broad stock invested 340 00:21:39,610 --> 00:21:41,820 in many different industries. 341 00:21:41,820 --> 00:21:45,390 And it really reflects the overall market, to some extent. 342 00:21:45,390 --> 00:21:48,710 Many years ago, 10, 15 years ago, 343 00:21:48,710 --> 00:21:51,960 GE represented maybe 3% of the GNP of the US market. 344 00:21:51,960 --> 00:21:55,510 So it was really highly related to how well the market does. 345 00:21:55,510 --> 00:21:59,700 Now, crude oil is a commodity. 346 00:21:59,700 --> 00:22:07,010 And oil is used to drive cars, to fuel energy production. 347 00:22:07,010 --> 00:22:10,510 So if you have an increase in oil prices, 348 00:22:10,510 --> 00:22:13,770 then the cost of essentially doing business goes up. 349 00:22:13,770 --> 00:22:18,870 So it is associated with an inflation factor. 350 00:22:18,870 --> 00:22:20,380 Prices are rising. 351 00:22:20,380 --> 00:22:25,730 So if you can see here, the regression estimate, 352 00:22:25,730 --> 00:22:29,830 if we add in a factor of the return on crude oil, 353 00:22:29,830 --> 00:22:32,120 it's negative 0.03. 354 00:22:32,120 --> 00:22:36,740 And it has a t value of minus 3.561. 355 00:22:36,740 --> 00:22:41,330 So in fact, the market, in a sense, over this period, 356 00:22:41,330 --> 00:22:44,600 for this analysis, was not efficient in explaining 357 00:22:44,600 --> 00:22:49,730 the return on GE; crude oil is another independent factor 358 00:22:49,730 --> 00:22:52,260 that helps explain returns. 359 00:22:52,260 --> 00:22:55,850 So that's useful to know. 360 00:22:55,850 --> 00:23:01,430 And if you are clever about defining and identifying 361 00:23:01,430 --> 00:23:03,590 and evaluating different factors, 362 00:23:03,590 --> 00:23:07,550 you can build factor asset pricing 363 00:23:07,550 --> 00:23:11,430 models that are very, very useful 364 00:23:11,430 --> 00:23:13,390 for investing and trading. 365 00:23:13,390 --> 00:23:18,710 Now, as a comparison to this case study, 366 00:23:18,710 --> 00:23:26,040 also applied the same analysis to Exxon Mobil. 367 00:23:26,040 --> 00:23:30,330 Now, Exxon Mobil is an oil company. 368 00:23:30,330 --> 00:23:35,530 So let me highlight this here. 369 00:23:35,530 --> 00:23:37,570 We basically are fitting this model. 370 00:23:37,570 --> 00:23:39,050 Now let's highlight it. 371 00:23:43,150 --> 00:23:48,960 Here, if we consider this two-factor model, 372 00:23:48,960 --> 00:23:50,650 the regression parameter corresponding 373 00:23:50,650 --> 00:23:57,840 to the crude oil factor is plus 0.13 with a t value of 16. 374 00:23:57,840 --> 00:24:01,750 So crude oil definitely has an impact 375 00:24:01,750 --> 00:24:06,370 on the return of Exxon Mobil, because it goes up and down 376 00:24:06,370 --> 00:24:07,065 with oil prices. 377 00:24:16,300 --> 00:24:19,550 This case study closes with a scatter plot 378 00:24:19,550 --> 00:24:22,950 of the independent variables and highlighting where 379 00:24:22,950 --> 00:24:25,740 the influential values are. 380 00:24:25,740 --> 00:24:28,650 And so just in the same way that with a simple linear regression 381 00:24:28,650 --> 00:24:32,430 it was those that were far away from the mean of the data 382 00:24:32,430 --> 00:24:35,920 were influential, in a multivariate setting-- here, 383 00:24:35,920 --> 00:24:38,450 it's bivariate-- the influential observations 384 00:24:38,450 --> 00:24:41,240 are those that are very far away from the centroid. 385 00:24:41,240 --> 00:24:43,931 And if you look at one of the problems in the problem set, 386 00:24:43,931 --> 00:24:45,430 it actually goes through and you can 387 00:24:45,430 --> 00:24:48,930 see where these leveraged values are 388 00:24:48,930 --> 00:24:53,580 and how it indicates influences associated with the Mahalanobis 389 00:24:53,580 --> 00:24:56,660 distance of cases from the centroid 390 00:24:56,660 --> 00:24:58,820 of the independent variables. 391 00:24:58,820 --> 00:25:02,010 So if you're a visual type mathematician as 392 00:25:02,010 --> 00:25:04,850 opposed to an algebraic type mathematician, 393 00:25:04,850 --> 00:25:06,390 I think these kinds of graphs are 394 00:25:06,390 --> 00:25:10,970 very helpful in understanding what is really going on. 395 00:25:10,970 --> 00:25:16,180 And the degree of influence is associated with the fact 396 00:25:16,180 --> 00:25:21,380 that we're basically taking least squares estimates, 397 00:25:21,380 --> 00:25:23,560 so we have the quadratic form associated 398 00:25:23,560 --> 00:25:24,790 with the overall process. 399 00:25:28,800 --> 00:25:33,950 There's another case study that I'll 400 00:25:33,950 --> 00:25:40,054 be happy to discuss after class or during office hours. 401 00:25:40,054 --> 00:25:42,220 I don't think we have time today during the lecture. 402 00:25:42,220 --> 00:25:45,650 But it concerns exchange rate regimes. 403 00:25:45,650 --> 00:25:51,310 And the second case study looks at the Chinese yuan, 404 00:25:51,310 --> 00:25:55,960 which was basically pegged to the dollar for many years. 405 00:25:55,960 --> 00:26:00,190 And then I guess through political influence 406 00:26:00,190 --> 00:26:02,710 from other countries, they started 407 00:26:02,710 --> 00:26:06,172 to let the yuan vary from the dollar, 408 00:26:06,172 --> 00:26:08,560 but perhaps pegged it to some basket 409 00:26:08,560 --> 00:26:10,690 of securities-- of currencies. 410 00:26:10,690 --> 00:26:13,540 And so how would you determine what that basket of currencies 411 00:26:13,540 --> 00:26:14,039 is? 412 00:26:14,039 --> 00:26:16,250 Well, there are regression methods 413 00:26:16,250 --> 00:26:19,490 that have been developed by economists 414 00:26:19,490 --> 00:26:20,650 that help you do that. 415 00:26:20,650 --> 00:26:23,480 And that case study goes through the analysis of that. 416 00:26:23,480 --> 00:26:26,770 So check that out to see how you can get immediate access 417 00:26:26,770 --> 00:26:29,750 to currency data and be fitting these regression models 418 00:26:29,750 --> 00:26:31,250 and looking at the different results 419 00:26:31,250 --> 00:26:32,458 and trying to evaluate those. 420 00:26:38,720 --> 00:26:48,170 So let's turn now to the main topic-- 421 00:26:48,170 --> 00:26:54,200 let's see here-- which is time series analysis. 422 00:27:01,250 --> 00:27:04,080 Today in the rest of the lecture, 423 00:27:04,080 --> 00:27:09,040 I want to talk about univariate time series analysis. 424 00:27:09,040 --> 00:27:12,670 And so we're thinking of basically a random variable 425 00:27:12,670 --> 00:27:17,720 that is observed over time and it's a discrete time process. 426 00:27:17,720 --> 00:27:23,140 And we'll introduce you to the Wold representation 427 00:27:23,140 --> 00:27:26,435 theorem and definitions of stationarity 428 00:27:26,435 --> 00:27:28,340 and its relationship there. 429 00:27:28,340 --> 00:27:31,430 Then, look at the classic models of autoregressive 430 00:27:31,430 --> 00:27:34,120 moving average models. 431 00:27:34,120 --> 00:27:36,920 And then extending those to non-stationarity 432 00:27:36,920 --> 00:27:40,430 with integrated autoregressive moving average models. 433 00:27:40,430 --> 00:27:44,440 And then finally, talk about estimating stationary models 434 00:27:44,440 --> 00:27:47,630 and how we test for stationarity. 435 00:27:47,630 --> 00:27:54,740 So let's begin from basically first principles. 436 00:27:54,740 --> 00:27:59,310 We have a stochastic process, a discrete time stochastic 437 00:27:59,310 --> 00:28:04,880 process, X, which consists of random variables indexed 438 00:28:04,880 --> 00:28:06,160 by time. 439 00:28:06,160 --> 00:28:09,110 And we're thinking now discrete time. 440 00:28:09,110 --> 00:28:11,820 The stochastic behavior of this sequence 441 00:28:11,820 --> 00:28:16,050 is determined by specifying the density or probability mass 442 00:28:16,050 --> 00:28:22,220 functions for all finite collections of time indexes. 443 00:28:22,220 --> 00:28:26,490 And so if we could specify all finite.dimensional 444 00:28:26,490 --> 00:28:28,130 distributions of this process, we 445 00:28:28,130 --> 00:28:31,710 would specify this probability model 446 00:28:31,710 --> 00:28:35,200 for the stochastic process. 447 00:28:35,200 --> 00:28:40,500 Now, this stochastic process is strictly stationary 448 00:28:40,500 --> 00:28:48,760 if the density function for any collection of times, 449 00:28:48,760 --> 00:28:55,780 t_1 through t_m, is equal to the density function for a tau 450 00:28:55,780 --> 00:28:57,440 translation of that. 451 00:28:57,440 --> 00:29:03,000 So the density function for any finite-dimensional distribution 452 00:29:03,000 --> 00:29:08,300 is stationary, is constant under arbitrary translations. 453 00:29:08,300 --> 00:29:12,620 So that's a very strong property. 454 00:29:12,620 --> 00:29:16,620 But it's a reasonable property to ask for if you're 455 00:29:16,620 --> 00:29:18,566 doing statistical modeling. 456 00:29:18,566 --> 00:29:20,940 And what do you want to do when you're estimating models? 457 00:29:20,940 --> 00:29:24,080 You want to estimate things that are constant. 458 00:29:24,080 --> 00:29:26,570 Constants are nice things to estimate. 459 00:29:26,570 --> 00:29:28,520 And parameters of models are constant. 460 00:29:28,520 --> 00:29:32,930 So we really want the underlying structure of the distributions 461 00:29:32,930 --> 00:29:35,150 to be the same. 462 00:29:44,960 --> 00:29:47,040 That was strict stationarity, which 463 00:29:47,040 --> 00:29:51,510 requires knowledge of the entire distribution 464 00:29:51,510 --> 00:29:55,020 of the stochastic process. 465 00:29:55,020 --> 00:29:57,340 We're now going to introduce a weaker definition, which 466 00:29:57,340 --> 00:29:59,660 is covariance stationarity. 467 00:29:59,660 --> 00:30:02,960 And a covariance stationary process 468 00:30:02,960 --> 00:30:08,330 has a constant mean, mu; a constant variance, 469 00:30:08,330 --> 00:30:15,630 sigma squared; and a covariance over increments tau, 470 00:30:15,630 --> 00:30:20,500 given by a function gamma of tau, that is also constant. 471 00:30:20,500 --> 00:30:26,960 Gamma isn't a constant function, but basically for all t, 472 00:30:26,960 --> 00:30:31,900 covariance of X_t, X_(t+tau) is this gamma of tau function. 473 00:30:31,900 --> 00:30:38,080 And we also can introduce the autocorrelation function 474 00:30:38,080 --> 00:30:41,830 of the stochastic process, rho of tau. 475 00:30:41,830 --> 00:30:49,120 And so the correlation of two random variables 476 00:30:49,120 --> 00:30:52,220 is the covariance of those random variables divided 477 00:30:52,220 --> 00:30:57,340 by the square root of the product of the variances. 478 00:30:57,340 --> 00:31:00,805 And Choongbum I think introduced that a bit. 479 00:31:00,805 --> 00:31:02,680 in one of his lectures, where we were talking 480 00:31:02,680 --> 00:31:06,890 about the correlation function. 481 00:31:06,890 --> 00:31:09,810 But essentially, the correlation function 482 00:31:09,810 --> 00:31:15,400 is if you standardize the data or the random variables 483 00:31:15,400 --> 00:31:17,690 to have mean 0-- so subtract off the means 484 00:31:17,690 --> 00:31:21,040 and then divide through by their standard deviations. 485 00:31:21,040 --> 00:31:26,410 So those translated variables have mean 0 and variance 1. 486 00:31:26,410 --> 00:31:29,482 Then the correlation coefficient is the covariance 487 00:31:29,482 --> 00:31:31,315 between those standardized random variables. 488 00:31:35,020 --> 00:31:38,810 So this is going to come up again and again in time series 489 00:31:38,810 --> 00:31:40,080 analysis. 490 00:31:40,080 --> 00:31:42,650 Now, the Wold representation theorem 491 00:31:42,650 --> 00:31:47,350 is a very, very powerful theorem about covariance stationary 492 00:31:47,350 --> 00:31:47,850 processes. 493 00:31:51,110 --> 00:31:55,050 It basically states that if we have a zero-mean covariance 494 00:31:55,050 --> 00:31:59,750 stationary time series, then it can 495 00:31:59,750 --> 00:32:03,520 be decomposed into two components with a very 496 00:32:03,520 --> 00:32:06,390 nice structure. 497 00:32:06,390 --> 00:32:11,430 Basically, X_t can be decomposed into V_t plus S_t. 498 00:32:11,430 --> 00:32:18,470 V_t is going to be a linearly deterministic process, meaning 499 00:32:18,470 --> 00:32:23,130 that past values of V_t perfectly predict 500 00:32:23,130 --> 00:32:24,590 what V_t is going to be. 501 00:32:24,590 --> 00:32:27,780 So this could be like a linear trend or some fixed 502 00:32:27,780 --> 00:32:29,660 function of past values. 503 00:32:29,660 --> 00:32:32,320 It's basically a deterministic process. 504 00:32:32,320 --> 00:32:34,690 So there's nothing random in V_t. 505 00:32:34,690 --> 00:32:40,710 It's something that's fixed, without randomness. 506 00:32:40,710 --> 00:32:46,510 And S_t is a sum of coefficients, 507 00:32:46,510 --> 00:32:56,650 psi_i times eta_(t-i), where the eta_t's are linearly 508 00:32:56,650 --> 00:32:58,550 unpredictable white noise. 509 00:32:58,550 --> 00:33:03,890 So what we have is S_t is a weighted average 510 00:33:03,890 --> 00:33:09,850 of white noise with coefficients given by the psi_i. 511 00:33:09,850 --> 00:33:16,170 And the coefficients psi_i are such that psi_0 is 1, 512 00:33:16,170 --> 00:33:18,830 and the sum of the squared psi_i's is finite. 513 00:33:21,340 --> 00:33:26,540 And the white noise eta_t-- what's white noise? 514 00:33:26,540 --> 00:33:28,930 It has expectation zero. 515 00:33:28,930 --> 00:33:35,120 It has variance, given by sigma squared, that's constant. 516 00:33:35,120 --> 00:33:39,520 And it has covariance across different white noise elements 517 00:33:39,520 --> 00:33:42,490 that's 0 for all t and s. 518 00:33:42,490 --> 00:33:45,810 So eta_t's are uncorrelated with themselves, 519 00:33:45,810 --> 00:33:47,750 and of course, they are uncorrelated 520 00:33:47,750 --> 00:33:51,290 with the deterministic process. 521 00:33:51,290 --> 00:33:58,010 So this is really a very, very powerful concept. 522 00:33:58,010 --> 00:34:00,600 If you are modeling a process and it 523 00:34:00,600 --> 00:34:05,030 has covariance stationarity, then there 524 00:34:05,030 --> 00:34:07,960 exists a representation like this of the function. 525 00:34:07,960 --> 00:34:15,750 So it's a very compelling structure, 526 00:34:15,750 --> 00:34:20,659 which we'll see how it applies in different circumstances. 527 00:34:20,659 --> 00:34:25,650 Now, before getting into the definition of autoregressive 528 00:34:25,650 --> 00:34:28,719 moving average models, I just want 529 00:34:28,719 --> 00:34:33,820 to give you an intuitive understanding of what's going 530 00:34:33,820 --> 00:34:36,469 on with the Wold decomposition. 531 00:34:36,469 --> 00:34:41,030 And this, I think, will help motivate 532 00:34:41,030 --> 00:34:44,480 why the Wold decomposition should exist 533 00:34:44,480 --> 00:34:48,170 from a mathematical standpoint. 534 00:34:48,170 --> 00:34:53,550 So consider just some univariate stochastic process, 535 00:34:53,550 --> 00:34:56,500 some time series X_t that we want to model. 536 00:34:56,500 --> 00:35:00,010 And we believe that it's covariance stationary. 537 00:35:00,010 --> 00:35:02,850 And so we want to specify essentially 538 00:35:02,850 --> 00:35:04,610 the Wold decomposition of that. 539 00:35:04,610 --> 00:35:07,680 Well, what we could do is initialize 540 00:35:07,680 --> 00:35:10,890 a parameter p, the number of past observations, 541 00:35:10,890 --> 00:35:15,310 in the linearly deterministic term. 542 00:35:15,310 --> 00:35:24,420 And then estimate the linear projection of X_t on the last p 543 00:35:24,420 --> 00:35:26,140 lag values. 544 00:35:26,140 --> 00:35:31,490 And so what I want to do is consider estimating 545 00:35:31,490 --> 00:35:36,360 that relationship using a sample of size n 546 00:35:36,360 --> 00:35:42,660 with some ending point t_0 less than or equal to T. 547 00:35:42,660 --> 00:35:50,010 And so we can consider y values like a response variable 548 00:35:50,010 --> 00:35:57,760 being given by the successive values of our time series. 549 00:35:57,760 --> 00:36:02,550 And so our response variables y_j can be considered to be x 550 00:36:02,550 --> 00:36:06,040 t_0 minus n plus j. 551 00:36:06,040 --> 00:36:14,350 And define a y vector and a Z matrix as follows. 552 00:36:20,140 --> 00:36:25,890 So we have values of our stochastic process in y. 553 00:36:25,890 --> 00:36:29,080 And then our Z matrix, which is essentially 554 00:36:29,080 --> 00:36:30,580 a matrix of independent variables, 555 00:36:30,580 --> 00:36:36,000 is just the lagged values of this process. 556 00:36:36,000 --> 00:36:37,940 So let's apply ordinary least squares 557 00:36:37,940 --> 00:36:40,530 to specify the projection. 558 00:36:40,530 --> 00:36:43,810 This projection matrix should be familiar now. 559 00:36:43,810 --> 00:36:49,160 And that basically gives us a prediction of y hat 560 00:36:49,160 --> 00:36:51,680 depending on p lags. 561 00:36:51,680 --> 00:36:54,750 And we can compute the projection residual 562 00:36:54,750 --> 00:36:56,080 from that fit. 563 00:36:59,660 --> 00:37:03,450 Well, we can conduct time series methods 564 00:37:03,450 --> 00:37:08,470 to analyze these residuals, which we'll be introducing here 565 00:37:08,470 --> 00:37:13,170 in a few minutes, to specify a moving average model. 566 00:37:13,170 --> 00:37:16,180 We can then have estimates of the underlying coefficients 567 00:37:16,180 --> 00:37:22,700 psi and estimates of these residuals eta_t. 568 00:37:22,700 --> 00:37:27,300 And then we can evaluate whether this is a good model or not. 569 00:37:27,300 --> 00:37:29,430 What does it mean to be an appropriate model? 570 00:37:29,430 --> 00:37:35,250 Well, the residual should be orthogonal to longer lags 571 00:37:35,250 --> 00:37:39,550 than t minus s, or longer lags than p. 572 00:37:39,550 --> 00:37:42,850 So we basically shouldn't have any dependence 573 00:37:42,850 --> 00:37:49,390 of our residuals on lags of the stochastic process 574 00:37:49,390 --> 00:37:51,550 that weren't included in the model. 575 00:37:51,550 --> 00:37:54,850 Those should be orthogonal. 576 00:37:54,850 --> 00:38:01,070 And the eta_t hats should be consistent with white noise. 577 00:38:01,070 --> 00:38:05,220 So those issues can be evaluated. 578 00:38:05,220 --> 00:38:07,620 And if there's evidence otherwise, 579 00:38:07,620 --> 00:38:10,720 then we can change the specification of the model. 580 00:38:10,720 --> 00:38:13,090 We can add additional lags. 581 00:38:13,090 --> 00:38:15,870 We can add additional deterministic variables 582 00:38:15,870 --> 00:38:21,570 if we can identify what those might be. 583 00:38:21,570 --> 00:38:23,260 And proceed with this process. 584 00:38:23,260 --> 00:38:28,490 But essentially that is how the Wold decomposition 585 00:38:28,490 --> 00:38:30,740 could be implemented. 586 00:38:30,740 --> 00:38:35,250 And theoretically, as our sample gets large, 587 00:38:35,250 --> 00:38:42,320 if we're observing this time series for a long time, then 588 00:38:42,320 --> 00:38:45,090 well certainly the limit of the projections 589 00:38:45,090 --> 00:38:49,110 as p, the number of lags we include, gets large, 590 00:38:49,110 --> 00:38:52,380 should be essentially the projection 591 00:38:52,380 --> 00:38:55,270 of our data on its history. 592 00:38:55,270 --> 00:39:00,490 And that, in fact, is the projection corresponding to, 593 00:39:00,490 --> 00:39:03,950 defining, the coefficient's psi_i. 594 00:39:03,950 --> 00:39:09,400 And so in the limit, that projection will converge 595 00:39:09,400 --> 00:39:11,320 and it will converge in the sense 596 00:39:11,320 --> 00:39:15,070 that the coefficients of the projection definition 597 00:39:15,070 --> 00:39:17,320 correspond to the psi_i. 598 00:39:17,320 --> 00:39:26,600 And now if p goes to infinity is required, 599 00:39:26,600 --> 00:39:29,510 now p means that there's basically a long term 600 00:39:29,510 --> 00:39:31,145 dependence in the process. 601 00:39:34,310 --> 00:39:37,120 Basically, it doesn't stop at a given lag. 602 00:39:37,120 --> 00:39:41,410 The dependence persists over time. 603 00:39:41,410 --> 00:39:45,580 Then we may require that p goes to infinity. 604 00:39:45,580 --> 00:39:47,360 Now, what happens when p goes to infinity? 605 00:39:47,360 --> 00:39:50,036 Well, if you let p go to infinity too quickly, 606 00:39:50,036 --> 00:39:51,410 you run out of degrees of freedom 607 00:39:51,410 --> 00:39:53,520 to estimate your models. 608 00:39:53,520 --> 00:39:57,220 And so from an implementation standpoint, 609 00:39:57,220 --> 00:40:01,340 you need to let p/n go to 0 so that you 610 00:40:01,340 --> 00:40:09,180 have essentially more data than parameters 611 00:40:09,180 --> 00:40:10,710 that you're estimating. 612 00:40:10,710 --> 00:40:13,800 And so that is required. 613 00:40:13,800 --> 00:40:18,860 And in time series modeling, what we 614 00:40:18,860 --> 00:40:26,609 look for are models where finite values of p are required. 615 00:40:26,609 --> 00:40:28,900 So we're only estimating a finite number of parameters. 616 00:40:28,900 --> 00:40:31,920 Or if we have a moving average model which 617 00:40:31,920 --> 00:40:35,300 has coefficients that are infinite in number, 618 00:40:35,300 --> 00:40:40,430 perhaps those can be defined by a small number of parameters. 619 00:40:40,430 --> 00:40:44,552 So we'll be looking for that kind of feature 620 00:40:44,552 --> 00:40:45,385 in different models. 621 00:40:49,230 --> 00:40:52,620 Let's turn to talking about the lag operator. 622 00:40:52,620 --> 00:40:56,250 The lag operator is a fundamental tool 623 00:40:56,250 --> 00:40:59,430 in time series models. 624 00:40:59,430 --> 00:41:04,180 We consider the operator L that shifts a time series back 625 00:41:04,180 --> 00:41:06,680 by one time increment. 626 00:41:06,680 --> 00:41:09,210 And applying this operator recursively, 627 00:41:09,210 --> 00:41:14,400 we get, if it's operating 0 times, there's no lag, 628 00:41:14,400 --> 00:41:16,570 one time, there's one lag, two times, 629 00:41:16,570 --> 00:41:18,860 two lags-- doing that iteratively. 630 00:41:18,860 --> 00:41:22,470 And in thinking of these, what we're dealing with 631 00:41:22,470 --> 00:41:26,680 is like a transformation on infinite dimensional space, 632 00:41:26,680 --> 00:41:29,150 where it's like the identity matrix 633 00:41:29,150 --> 00:41:32,390 sort of shifted by one element-- or not 634 00:41:32,390 --> 00:41:35,320 the identity, but an element. 635 00:41:35,320 --> 00:41:37,290 It's like the identity matrix shifted 636 00:41:37,290 --> 00:41:41,520 by one column or two columns. 637 00:41:41,520 --> 00:41:43,760 So anyway, inverses of these operators 638 00:41:43,760 --> 00:41:49,440 are well defined in terms of what we get from them. 639 00:41:49,440 --> 00:41:53,470 So we can represent the Wold representation 640 00:41:53,470 --> 00:41:58,140 in terms of these lag operators by saying 641 00:41:58,140 --> 00:42:03,120 that our stochastic process X_t is 642 00:42:03,120 --> 00:42:10,030 equal to V_t plus this psi of L function, 643 00:42:10,030 --> 00:42:14,030 basically a functional of the lag 644 00:42:14,030 --> 00:42:18,570 operator, which is a potentially infinite-order polynomial 645 00:42:18,570 --> 00:42:20,730 of the lags. 646 00:42:20,730 --> 00:42:23,770 So this notation is something that you 647 00:42:23,770 --> 00:42:26,110 need to get very familiar with if you're 648 00:42:26,110 --> 00:42:28,520 going to be comfortable with the different models that 649 00:42:28,520 --> 00:42:33,840 are introduced with ARMA and ARIMA models. 650 00:42:33,840 --> 00:42:35,410 Any questions about that? 651 00:42:42,230 --> 00:42:43,870 Now relating to this-- let me just 652 00:42:43,870 --> 00:42:47,550 introduce now, because this will come up somewhat later. 653 00:42:47,550 --> 00:42:49,840 But there's the impulse response function 654 00:42:49,840 --> 00:42:53,010 of the covariance stationary process. 655 00:42:53,010 --> 00:42:58,630 If we have a stochastic process X_t which is given by this Wold 656 00:42:58,630 --> 00:43:05,950 representation, then you can ask yourself 657 00:43:05,950 --> 00:43:11,320 what happens to the innovation at time t, which is eta_t, 658 00:43:11,320 --> 00:43:15,470 how does that affect the process over time? 659 00:43:15,470 --> 00:43:21,590 And so, OK, pretend that you are chairman of the Federal Reserve 660 00:43:21,590 --> 00:43:22,090 Bank. 661 00:43:22,090 --> 00:43:29,600 And you're interested in the GNP or basically economic growth. 662 00:43:29,600 --> 00:43:33,944 And you're considering changing interest rates 663 00:43:33,944 --> 00:43:36,340 to help the economy. 664 00:43:36,340 --> 00:43:38,630 Well, you'd like to know what an impact is 665 00:43:38,630 --> 00:43:42,610 of your change in this factor, how 666 00:43:42,610 --> 00:43:47,560 that's going to affect the variable of interest, perhaps 667 00:43:47,560 --> 00:43:48,130 GNP. 668 00:43:48,130 --> 00:43:49,520 Now, in this case, we're thinking 669 00:43:49,520 --> 00:43:55,140 of just a simple covariance stationary stochastic process. 670 00:43:55,140 --> 00:44:00,165 It's basically a process that is a random-- a weighted sum, 671 00:44:00,165 --> 00:44:03,210 a moving average of innovations eta_t. 672 00:44:03,210 --> 00:44:06,130 But the question is, basically any covariance stationary 673 00:44:06,130 --> 00:44:08,310 process could be represented in this form. 674 00:44:08,310 --> 00:44:11,630 And the impulse response function 675 00:44:11,630 --> 00:44:15,790 relates to what is the impact of eta_t. 676 00:44:15,790 --> 00:44:18,120 What's its impact over time? 677 00:44:18,120 --> 00:44:21,940 Basically, it affects the process at time t. 678 00:44:21,940 --> 00:44:24,360 That, because of the moving average process, 679 00:44:24,360 --> 00:44:27,350 it affects it at t plus 1, affects it at t plus 2. 680 00:44:27,350 --> 00:44:33,810 And so this impulse response is basically 681 00:44:33,810 --> 00:44:37,650 the derivative of the value of the process 682 00:44:37,650 --> 00:44:44,210 with the j-th previous innovation is given by psi_j. 683 00:44:44,210 --> 00:44:47,360 So the different innovations have an impact 684 00:44:47,360 --> 00:44:51,200 on the current value given by this impulse response function. 685 00:44:51,200 --> 00:44:53,200 So looking backward, that definition 686 00:44:53,200 --> 00:44:54,920 is pretty well defined. 687 00:44:54,920 --> 00:44:56,630 But you can also think about how does 688 00:44:56,630 --> 00:44:58,620 an impact of the innovation affect 689 00:44:58,620 --> 00:45:00,760 the process going forward. 690 00:45:00,760 --> 00:45:03,430 And the long-run cumulative response 691 00:45:03,430 --> 00:45:07,490 is essentially what is the impact of that innovation 692 00:45:07,490 --> 00:45:11,350 in the process ultimately? 693 00:45:11,350 --> 00:45:13,839 And eventually, it's not going to change 694 00:45:13,839 --> 00:45:14,880 the value of the process. 695 00:45:14,880 --> 00:45:18,710 But what is the value to which the process is moving 696 00:45:18,710 --> 00:45:20,890 because of that one innovation? 697 00:45:20,890 --> 00:45:22,630 And so the long run cumulative response 698 00:45:22,630 --> 00:45:28,900 is given by basically the sum of these individual ones. 699 00:45:28,900 --> 00:45:33,020 And it's given by the sum of the psi_i's. 700 00:45:33,020 --> 00:45:37,295 So that's the polynomial of psi with lag operator, where we 701 00:45:37,295 --> 00:45:39,010 replace the lag operator by 1. 702 00:45:43,540 --> 00:45:45,570 We'll see this again when we talk 703 00:45:45,570 --> 00:45:50,546 about vector autoregressive processes 704 00:45:50,546 --> 00:45:51,795 with multivariate time series. 705 00:45:56,020 --> 00:45:57,860 Now, the Wold representation, which 706 00:45:57,860 --> 00:46:00,550 is a infinite-order moving average, possibly infinite 707 00:46:00,550 --> 00:46:04,466 order, can have an autoregressive representation. 708 00:46:07,940 --> 00:46:17,580 Suppose that there is another polynomial psi_i 709 00:46:17,580 --> 00:46:23,240 star of the lags, which we're going to call psi inverse of L, 710 00:46:23,240 --> 00:46:29,860 which satisfies the fact if you multiply that with psi of L, 711 00:46:29,860 --> 00:46:31,690 you get the identity lag 0. 712 00:46:31,690 --> 00:46:37,820 Then this psi inverse, if that exists, 713 00:46:37,820 --> 00:46:47,060 is basically the inverse of the psi of L. 714 00:46:47,060 --> 00:46:50,180 So if we start with psi of L, if that's invertible, 715 00:46:50,180 --> 00:46:52,510 then there exists a psi inverse of L, 716 00:46:52,510 --> 00:46:55,490 with coefficients psi_i star. 717 00:46:55,490 --> 00:47:02,130 And one can basically take our original expression 718 00:47:02,130 --> 00:47:06,020 for the stochastic process, which is as this moving average 719 00:47:06,020 --> 00:47:13,250 of the eta's, and express it as this essentially moving 720 00:47:13,250 --> 00:47:16,450 averages of the X's. 721 00:47:16,450 --> 00:47:20,730 And so we've essentially inverted the process 722 00:47:20,730 --> 00:47:27,500 and shown that the stochastic process can 723 00:47:27,500 --> 00:47:35,570 be expressed as an infinite order autoregressive 724 00:47:35,570 --> 00:47:36,850 representation. 725 00:47:36,850 --> 00:47:40,760 And so this infinite order autoregressive representation 726 00:47:40,760 --> 00:47:43,610 corresponds to that intuitive understanding of how 727 00:47:43,610 --> 00:47:46,280 the Wold representation exists. 728 00:47:46,280 --> 00:47:51,330 And it actually works with the-- the regression coefficients 729 00:47:51,330 --> 00:47:54,749 in that projection several slides back corresponds 730 00:47:54,749 --> 00:47:55,790 to this inverse operator. 731 00:47:59,030 --> 00:48:04,160 So let's turn to some specific time series 732 00:48:04,160 --> 00:48:07,590 models that are widely used. 733 00:48:07,590 --> 00:48:11,670 The class of autoregressive moving average processes 734 00:48:11,670 --> 00:48:16,100 has this mathematical definition. 735 00:48:16,100 --> 00:48:22,360 We define the X_t to be equal to a linear combination of lags 736 00:48:22,360 --> 00:48:27,190 of X, going back p lags, with coefficients 737 00:48:27,190 --> 00:48:30,210 phi_1 through phi_p. 738 00:48:30,210 --> 00:48:35,500 And then there are residuals which 739 00:48:35,500 --> 00:48:40,720 are expressed in terms of a q-th order moving average. 740 00:48:40,720 --> 00:48:45,990 So in this framework, the eta_t's are white noise. 741 00:48:45,990 --> 00:48:50,910 And white noise, to reiterate, has mean 0, constant variance, 742 00:48:50,910 --> 00:48:53,456 zero covariance between those. 743 00:48:56,330 --> 00:49:03,470 In this representation, I've simplified things a little bit 744 00:49:03,470 --> 00:49:09,400 by subtracting off the mean from all of the X's. 745 00:49:09,400 --> 00:49:15,400 And that just makes the formulas a little bit more simpler. 746 00:49:15,400 --> 00:49:20,370 Now, with lag operators, we can write this ARMA model 747 00:49:20,370 --> 00:49:26,810 as phi of L, p-th order polynomial of lag L given 748 00:49:26,810 --> 00:49:31,360 with coefficients 1, phi_1 up to phi_p, 749 00:49:31,360 --> 00:49:37,627 and theta of L given by 1, theta_1, theta_2, 750 00:49:37,627 --> 00:49:38,210 up to theta_q. 751 00:49:52,870 --> 00:49:55,840 This is basically a representation 752 00:49:55,840 --> 00:49:59,170 of the ARMA time series model. 753 00:49:59,170 --> 00:50:03,320 Basically, we're taking a set of lags 754 00:50:03,320 --> 00:50:09,530 of the values of the stochastic process up to order p. 755 00:50:09,530 --> 00:50:11,840 And that's equal to a weighted average of the eta_t's. 756 00:50:14,530 --> 00:50:21,600 If we multiply by the inverse of phi of L, if that exists, 757 00:50:21,600 --> 00:50:24,010 then we get this representation here, 758 00:50:24,010 --> 00:50:26,430 which is simply the Wold decomposition. 759 00:50:26,430 --> 00:50:34,150 So the ARMA models basically have a Wold decomposition 760 00:50:34,150 --> 00:50:36,970 if this phi of L is invertible. 761 00:50:42,850 --> 00:50:47,120 And we'll explore these by looking 762 00:50:47,120 --> 00:50:49,160 at simpler cases of the ARMA models 763 00:50:49,160 --> 00:50:51,390 by just focusing on autoregressive models 764 00:50:51,390 --> 00:50:53,680 first and then moving average processes 765 00:50:53,680 --> 00:50:56,090 second so that you'll get a better 766 00:50:56,090 --> 00:51:00,690 feel for how these things are manipulated and interpreted. 767 00:51:00,690 --> 00:51:04,540 So let's move on to the p-th order autoregressive process. 768 00:51:04,540 --> 00:51:08,750 So we're going to consider ARMA models that just have 769 00:51:08,750 --> 00:51:10,100 autoregressive terms in them. 770 00:51:16,000 --> 00:51:20,300 So we have phi of L X_t minus mu is equal to eta_t, 771 00:51:20,300 --> 00:51:21,990 which is white noise. 772 00:51:21,990 --> 00:51:28,970 So a linear combination of the series is white noise. 773 00:51:28,970 --> 00:51:34,730 And X_t follows then a linear regression model on explanatory 774 00:51:34,730 --> 00:51:41,330 variables, which are lags of the process X. 775 00:51:41,330 --> 00:51:46,760 And this could be expressed as X_t equal to c plus the sum 776 00:51:46,760 --> 00:51:50,950 from 1 to p of phi_j X_(t-j), which is a linear regression 777 00:51:50,950 --> 00:51:53,700 model with regression parameters phi_j. 778 00:51:53,700 --> 00:52:01,390 And c, the constant term, is equal to mu times phi of 1. 779 00:52:01,390 --> 00:52:10,920 Now, if you basically take expectations of the process, 780 00:52:10,920 --> 00:52:14,360 you basically have coefficients of mu coming in 781 00:52:14,360 --> 00:52:15,730 from all the terms. 782 00:52:15,730 --> 00:52:22,220 And phi of 1 times mu is the regression coefficient there. 783 00:52:25,160 --> 00:52:27,320 So with this autoregressive model, 784 00:52:27,320 --> 00:52:31,160 we now want to go over what are the stationarity conditions. 785 00:52:31,160 --> 00:52:35,020 Certainly, this autoregressive model 786 00:52:35,020 --> 00:52:40,790 is one where, well, a simple random walk 787 00:52:40,790 --> 00:52:45,520 follows an autoregressive model but is not stationary. 788 00:52:45,520 --> 00:52:47,650 We'll highlight that in a minute as well. 789 00:52:47,650 --> 00:52:50,410 But if you think it, that's true. 790 00:52:50,410 --> 00:52:55,400 And so stationarity is something to be understood and evaluated. 791 00:53:03,160 --> 00:53:08,680 This polynomial function phi, where 792 00:53:08,680 --> 00:53:11,630 if we replace the lag operator L by z, 793 00:53:11,630 --> 00:53:20,970 a complex variable, the equation phi of z equal to 0 794 00:53:20,970 --> 00:53:24,330 is the characteristic equation associated 795 00:53:24,330 --> 00:53:27,020 with this autoregressive model. 796 00:53:27,020 --> 00:53:33,190 And it turns out that we'll be interested in the roots 797 00:53:33,190 --> 00:53:36,610 of this characteristic equation. 798 00:53:36,610 --> 00:53:40,705 Now, if we consider writing phi of L 799 00:53:40,705 --> 00:53:44,270 as a function of the roots of the equation, 800 00:53:44,270 --> 00:53:49,130 we get this expression where you'll 801 00:53:49,130 --> 00:53:51,340 notice if you multiply all those terms out, 802 00:53:51,340 --> 00:53:55,730 the 1's all multiply out together, and you get 1. 803 00:53:55,730 --> 00:54:00,100 And with the lag operator L to the p-th power, 804 00:54:00,100 --> 00:54:03,210 that would be the product of 1 over lambda_1 805 00:54:03,210 --> 00:54:06,650 times 1 over lambda_2, or actually negative 1 806 00:54:06,650 --> 00:54:09,680 over lambda_1 times negative 1 over lambda_2, 807 00:54:09,680 --> 00:54:13,640 and so forth-- negative 1 over lambda_p. 808 00:54:13,640 --> 00:54:15,820 Basically, if there are p roots to this equation, 809 00:54:15,820 --> 00:54:19,420 this is how it would be written out. 810 00:54:19,420 --> 00:54:27,070 And the process X_t is covariance 811 00:54:27,070 --> 00:54:28,710 stationary if and only if all the roots 812 00:54:28,710 --> 00:54:33,630 of this characteristic equation lie outside the unit circle. 813 00:54:33,630 --> 00:54:35,880 So what does that mean? 814 00:54:35,880 --> 00:54:41,240 That means that the norm modulus of the complex z 815 00:54:41,240 --> 00:54:42,810 is greater than 1. 816 00:54:42,810 --> 00:54:45,160 So they're outside the unit circle 817 00:54:45,160 --> 00:54:47,150 where it's less than or equal to 1. 818 00:54:47,150 --> 00:54:56,810 And the roots, if they are outside the unit circle, 819 00:54:56,810 --> 00:55:01,080 then the modulus of the lambda_j's is greater than 1. 820 00:55:05,400 --> 00:55:12,160 And if we then consider taking a complex number 821 00:55:12,160 --> 00:55:16,010 lambda, basically the root, and have 822 00:55:16,010 --> 00:55:20,600 an expression for 1 minus 1 over lambda L inverse, 823 00:55:20,600 --> 00:55:25,010 we can get this series expression for that inverse. 824 00:55:25,010 --> 00:55:34,860 And that series will exist and be bounded if the lambda_i are 825 00:55:34,860 --> 00:55:36,430 greater than 1 in magnitude. 826 00:55:39,210 --> 00:55:46,210 So we can actually compute an inverse of phi of L 827 00:55:46,210 --> 00:55:49,610 by taking the inverse of each of the component 828 00:55:49,610 --> 00:55:52,240 products in that polynomial. 829 00:55:52,240 --> 00:55:57,800 So in introductory time series courses, 830 00:55:57,800 --> 00:56:00,544 they talk about stationarity and unit roots, 831 00:56:00,544 --> 00:56:01,960 but they don't really get into it, 832 00:56:01,960 --> 00:56:04,490 because people don't know complex math, 833 00:56:04,490 --> 00:56:06,970 don't know about roots. 834 00:56:06,970 --> 00:56:09,620 So anyway, but this is just very simply 835 00:56:09,620 --> 00:56:12,840 how that framework is applied. 836 00:56:12,840 --> 00:56:17,830 So we have a polynomial equation, 837 00:56:17,830 --> 00:56:20,885 the characteristic equation, whose roots we're looking for. 838 00:56:20,885 --> 00:56:22,510 Those roots have to be outside the unit 839 00:56:22,510 --> 00:56:26,170 circle for stationarity of the process. 840 00:56:26,170 --> 00:56:31,870 Well, it's basically conditions for invertibility 841 00:56:31,870 --> 00:56:35,100 of the process, of the autoregressive process. 842 00:56:35,100 --> 00:56:40,440 And that invertibility renders the process an infinite-order 843 00:56:40,440 --> 00:56:42,125 moving average process. 844 00:56:46,210 --> 00:56:50,830 So let's go through these results 845 00:56:50,830 --> 00:56:52,840 for the autoregressive process of order one, 846 00:56:52,840 --> 00:56:56,330 where things-- always start with the simplest cases 847 00:56:56,330 --> 00:56:58,420 to understand things. 848 00:56:58,420 --> 00:57:01,140 The characteristic equation for this model is just 1 849 00:57:01,140 --> 00:57:02,820 minus phi z. 850 00:57:02,820 --> 00:57:03,600 The root is 1/phi. 851 00:57:06,630 --> 00:57:12,382 So lambda is greater than 1-- if the modulus of lambda 852 00:57:12,382 --> 00:57:13,840 is greater than 1, meaning the root 853 00:57:13,840 --> 00:57:16,990 is outside the unit circle, then phi is less than 1. 854 00:57:16,990 --> 00:57:21,160 So for covariance stationarity of this autoregressive process, 855 00:57:21,160 --> 00:57:25,877 we need the magnitude of phi to be less than 1 in magnitude. 856 00:57:30,090 --> 00:57:31,950 The expected value of X is mu. 857 00:57:31,950 --> 00:57:36,460 The variance of X is sigma squared X. 858 00:57:36,460 --> 00:57:41,130 This has this form, sigma squared over 1 minus phi. 859 00:57:41,130 --> 00:57:44,960 That expression is basically obtained 860 00:57:44,960 --> 00:57:50,110 by looking at the infinite order moving average representation. 861 00:57:50,110 --> 00:57:56,760 But notice that if phi is positive, 862 00:57:56,760 --> 00:58:03,710 then the variance of X is actually 863 00:58:03,710 --> 00:58:07,895 greater than the variance of the innovations. 864 00:58:10,440 --> 00:58:17,280 And if phi is less than 0, then it's going to be smaller. 865 00:58:17,280 --> 00:58:23,100 So the innovation variance basically is scaled up a bit 866 00:58:23,100 --> 00:58:25,010 in the autoregressive process. 867 00:58:25,010 --> 00:58:27,710 The covariance matrix is phi times sigma squared 868 00:58:27,710 --> 00:58:31,980 X. You'll be going through this in the problem set. 869 00:58:31,980 --> 00:58:40,160 And the covariance of X is phi to the j power sigma squared X. 870 00:58:40,160 --> 00:58:43,640 And these expressions can all be easily evaluated 871 00:58:43,640 --> 00:58:47,490 by simply writing out the definition of these covariances 872 00:58:47,490 --> 00:58:50,000 in terms of the original model and looking 873 00:58:50,000 --> 00:58:54,250 at what terms are independent, cancel out, and that proceeds. 874 00:59:04,510 --> 00:59:06,800 Let's just go through these cases. 875 00:59:06,800 --> 00:59:08,730 Let's show it all here. 876 00:59:08,730 --> 00:59:16,630 So we have if phi is between 0 and 1, 877 00:59:16,630 --> 00:59:20,810 then the process experiences exponential mean reversion 878 00:59:20,810 --> 00:59:22,170 to mu. 879 00:59:22,170 --> 00:59:24,760 So an autoregressive process with phi between 0 880 00:59:24,760 --> 00:59:29,490 on 1 corresponds to a mean-reverting process. 881 00:59:29,490 --> 00:59:31,830 This process is actually one that 882 00:59:31,830 --> 00:59:34,310 has been used theoretically for interest rate models 883 00:59:34,310 --> 00:59:36,920 and a lot of theoretical work in finance. 884 00:59:36,920 --> 00:59:40,280 The Vasicek model is actually an example 885 00:59:40,280 --> 00:59:42,300 of the Ornstein-Uhlenbeck process, 886 00:59:42,300 --> 00:59:47,840 which is basically a mean-reverting Brownian motion. 887 00:59:47,840 --> 00:59:53,070 And any variables that exhibit or could 888 00:59:53,070 --> 00:59:59,950 be thought of as exhibiting mean reversion, 889 00:59:59,950 --> 01:00:01,810 this model can be applied to those 890 01:00:01,810 --> 01:00:07,470 processes, such as interest rate spreads or real exchange rates, 891 01:00:07,470 --> 01:00:11,430 variables where one can expect that things never 892 01:00:11,430 --> 01:00:12,790 get too large or too small. 893 01:00:12,790 --> 01:00:14,440 They come back to some mean. 894 01:00:14,440 --> 01:00:16,570 Now, the challenge is, that usually 895 01:00:16,570 --> 01:00:18,930 may be true over short periods of time. 896 01:00:18,930 --> 01:00:21,100 But over very long periods of time, 897 01:00:21,100 --> 01:00:23,230 the point to which you're reverting to changes. 898 01:00:23,230 --> 01:00:26,640 So these models tend to not have broad application 899 01:00:26,640 --> 01:00:27,900 over long time ranges. 900 01:00:27,900 --> 01:00:30,150 You need to adapt. 901 01:00:30,150 --> 01:00:32,220 Anyway, with the AR process, we can also 902 01:00:32,220 --> 01:00:34,020 have negative values of phi, which 903 01:00:34,020 --> 01:00:38,460 results in exponential mean reversion that's oscillating 904 01:00:38,460 --> 01:00:44,190 in time, because the autoregressive coefficient 905 01:00:44,190 --> 01:00:49,180 basically is a negative value. 906 01:00:49,180 --> 01:00:54,510 And for phi equal to 1, the Wold decomposition doesn't exist. 907 01:00:54,510 --> 01:00:57,860 And the process is the simple random walk. 908 01:00:57,860 --> 01:01:00,340 So basically, if phi is equal to 1, 909 01:01:00,340 --> 01:01:04,480 that means that basically just changes in value of the process 910 01:01:04,480 --> 01:01:08,860 are independent and identically distributed white noise. 911 01:01:08,860 --> 01:01:11,910 And that's the random walk process. 912 01:01:11,910 --> 01:01:15,840 And that process, as was covered in earlier lectures, 913 01:01:15,840 --> 01:01:18,780 is non-stationary. 914 01:01:18,780 --> 01:01:22,790 If phi is greater than 1, then you have an explosive process, 915 01:01:22,790 --> 01:01:26,780 because basically the values are scaling up 916 01:01:26,780 --> 01:01:31,000 every time increment. 917 01:01:31,000 --> 01:01:35,290 So those are features of the AR(1) model. 918 01:01:35,290 --> 01:01:42,110 For a general autoregressive process of order p, 919 01:01:42,110 --> 01:01:45,850 there's a method-- well, we can look at the second order 920 01:01:45,850 --> 01:01:49,590 moments of that process, which have a very nice structure, 921 01:01:49,590 --> 01:01:51,840 and then use those to solve for estimates 922 01:01:51,840 --> 01:01:56,630 of the ARMA parameters, or autoregressive parameters. 923 01:01:56,630 --> 01:02:01,820 And those happen to be specified by what are called 924 01:02:01,820 --> 01:02:04,840 the Yule-Walker equations. 925 01:02:04,840 --> 01:02:07,270 So the Yule-Walker equations is a standard topic 926 01:02:07,270 --> 01:02:09,670 in time series analysis. 927 01:02:09,670 --> 01:02:11,480 What is it? 928 01:02:11,480 --> 01:02:13,030 What does it correspond to? 929 01:02:13,030 --> 01:02:16,320 Well, we take our original autoregressive process 930 01:02:16,320 --> 01:02:17,470 of order p. 931 01:02:17,470 --> 01:02:24,400 And we write out the formulas for the covariance 932 01:02:24,400 --> 01:02:26,900 at lag j between two observations. 933 01:02:26,900 --> 01:02:31,790 So what's the covariance between X_t and X_(t-j)? 934 01:02:31,790 --> 01:02:39,820 And that expression is given by this equation. 935 01:02:39,820 --> 01:02:43,980 And so this equation for gamma of j is determined simply 936 01:02:43,980 --> 01:02:48,700 by evaluating the expectations where we're taking 937 01:02:48,700 --> 01:02:53,620 the expectation of X_t in the autoregressive process times 938 01:02:53,620 --> 01:02:56,110 the fix X_(t-j) minus mu. 939 01:02:56,110 --> 01:02:58,540 So just evaluating those terms, you 940 01:02:58,540 --> 01:03:02,880 can validate that this is the equation. 941 01:03:02,880 --> 01:03:08,620 If we look at the equations corresponding to j equals 1-- 942 01:03:08,620 --> 01:03:12,040 so lag 1 up through lag p-- this is 943 01:03:12,040 --> 01:03:16,070 what those equations look like. 944 01:03:16,070 --> 01:03:20,060 Basically, the left-hand side is gamma_1 through gamma_p. 945 01:03:20,060 --> 01:03:23,090 The covariance to lag 1 up to lag p 946 01:03:23,090 --> 01:03:27,590 is equal to basically linear functions 947 01:03:27,590 --> 01:03:29,980 given by the phi of the other covariances. 948 01:03:33,570 --> 01:03:37,410 Who can tell me what the structure is of this matrix? 949 01:03:37,410 --> 01:03:38,590 It's not a diagonal matrix? 950 01:03:38,590 --> 01:03:41,817 What kind of matrix is this? 951 01:03:41,817 --> 01:03:42,900 Math trivia question here. 952 01:03:48,850 --> 01:03:49,782 It has a special name. 953 01:03:52,460 --> 01:03:54,600 Anyone? 954 01:03:54,600 --> 01:03:57,690 It's a Toeplitz matrix. 955 01:03:57,690 --> 01:04:00,840 The off diagonals are all the same value. 956 01:04:00,840 --> 01:04:06,680 And in fact, because of the symmetry of the covariance, 957 01:04:06,680 --> 01:04:09,750 basically the gamma of 1 is equal to gamma of minus 1. 958 01:04:09,750 --> 01:04:12,680 Gamma of minus 2 is equal to gamma plus 2. 959 01:04:12,680 --> 01:04:14,640 Because of the covariant stationarity, 960 01:04:14,640 --> 01:04:16,700 it's actually also symmetric. 961 01:04:16,700 --> 01:04:22,630 So these equations allow us to solve for the phis 962 01:04:22,630 --> 01:04:25,990 so long as we have estimates of these covariances. 963 01:04:25,990 --> 01:04:30,510 So if we have a system of estimates, 964 01:04:30,510 --> 01:04:33,940 we can plug these in in an attempt to solve this. 965 01:04:33,940 --> 01:04:36,770 If they're consistent estimates of the covariances, 966 01:04:36,770 --> 01:04:38,530 then there will be a solution. 967 01:04:38,530 --> 01:04:41,980 And then the 0th equation, which was not 968 01:04:41,980 --> 01:04:43,469 part of the series of equations-- 969 01:04:43,469 --> 01:04:45,510 if you go back and look at the 0th equation, that 970 01:04:45,510 --> 01:04:47,920 allows you to get an estimate for the sigma squared. 971 01:04:47,920 --> 01:04:50,920 So these Yule-Walker equations are the way 972 01:04:50,920 --> 01:04:54,510 in which many ARMA models are specified 973 01:04:54,510 --> 01:05:03,650 in different statistics packages and in terms of what principles 974 01:05:03,650 --> 01:05:04,400 are being applied. 975 01:05:04,400 --> 01:05:09,700 Well, if we're using unbiased estimates of these parameters, 976 01:05:09,700 --> 01:05:12,055 then this is applying what's called 977 01:05:12,055 --> 01:05:16,250 the method of moments principle for statistical estimation. 978 01:05:16,250 --> 01:05:20,600 And with complicated models, where sometimes the likelihood 979 01:05:20,600 --> 01:05:25,900 functions are very hard to specify and compute, 980 01:05:25,900 --> 01:05:29,800 and then to do optimization over those is even harder. 981 01:05:29,800 --> 01:05:32,780 It can turn out that there are relationships 982 01:05:32,780 --> 01:05:35,840 between the moments of the random variables, which 983 01:05:35,840 --> 01:05:38,340 are functions of the unknown parameters. 984 01:05:38,340 --> 01:05:42,590 And you can solve for basically the sample moments equalling 985 01:05:42,590 --> 01:05:45,940 the theoretical moments and you apply the method 986 01:05:45,940 --> 01:05:48,830 of moments estimation method. 987 01:05:48,830 --> 01:05:54,670 Econometrics is rich with many applications of that principle. 988 01:05:57,580 --> 01:06:02,110 The next section goes through the moving average model. 989 01:06:05,240 --> 01:06:12,340 Let me highlight this. 990 01:06:12,340 --> 01:06:16,080 So with an order q moving average, 991 01:06:16,080 --> 01:06:19,560 we basically have a polynomial in the lag operator L, 992 01:06:19,560 --> 01:06:22,390 which is operated upon the eta_t's. 993 01:06:22,390 --> 01:06:25,700 And if you write out the expectations of X_t, 994 01:06:25,700 --> 01:06:27,030 you get mu. 995 01:06:27,030 --> 01:06:28,650 The variance of X_t, which is gamma 0, 996 01:06:28,650 --> 01:06:34,470 is sigma squared times 1 plus the squares of the coefficients 997 01:06:34,470 --> 01:06:36,360 in the polynomial. 998 01:06:36,360 --> 01:06:39,920 And so this feature, this property here is due 999 01:06:39,920 --> 01:06:44,100 to the fact that we have uncorrelated innovations 1000 01:06:44,100 --> 01:06:47,060 in the eta_t's. 1001 01:06:47,060 --> 01:06:48,260 The eta t's are white noise. 1002 01:06:48,260 --> 01:06:52,830 So the only thing that comes through in the square of X_t 1003 01:06:52,830 --> 01:06:56,020 and the expectation of that is the squared powers 1004 01:06:56,020 --> 01:07:01,900 of the etas, which have coefficients 1005 01:07:01,900 --> 01:07:03,860 given by the theta_i squared. 1006 01:07:03,860 --> 01:07:09,170 So these properties are left-- I'll leave you just to verify, 1007 01:07:09,170 --> 01:07:11,142 very straightforward. 1008 01:07:11,142 --> 01:07:14,430 But let's now turn to the final minutes of the lecture 1009 01:07:14,430 --> 01:07:20,170 today to accommodating non-stationary behavior 1010 01:07:20,170 --> 01:07:23,340 in time series. 1011 01:07:23,340 --> 01:07:27,990 The original approaches with time series 1012 01:07:27,990 --> 01:07:32,320 was to focus on estimation methodologies 1013 01:07:32,320 --> 01:07:34,940 for covariance stationary process. 1014 01:07:34,940 --> 01:07:38,440 So if the series is not covariance stationary, 1015 01:07:38,440 --> 01:07:42,410 then we would want to do some transformation 1016 01:07:42,410 --> 01:07:48,660 of the data, of the series, into a stationary 1017 01:07:48,660 --> 01:07:52,270 so that the resulting process is stationary. 1018 01:07:52,270 --> 01:07:55,990 And with the differencing operators, 1019 01:07:55,990 --> 01:08:00,610 delta, Box and Jenkins advocated moving 1020 01:08:00,610 --> 01:08:03,420 non-stationary trending behavior, which 1021 01:08:03,420 --> 01:08:06,370 is exhibited often in economic time series, 1022 01:08:06,370 --> 01:08:09,960 by using a first difference, maybe a second difference, 1023 01:08:09,960 --> 01:08:12,300 or a k-th order difference. 1024 01:08:12,300 --> 01:08:20,229 So these operators are defined in this way. 1025 01:08:20,229 --> 01:08:22,960 Basically with the k-th order operator 1026 01:08:22,960 --> 01:08:25,210 having this expression here, this 1027 01:08:25,210 --> 01:08:31,189 is the binomial expansion of a k-th power, 1028 01:08:31,189 --> 01:08:35,970 which can be useful. 1029 01:08:35,970 --> 01:08:40,609 It comes up all the time in probability theory. 1030 01:08:40,609 --> 01:08:43,609 And if a process has a linear time trend, 1031 01:08:43,609 --> 01:08:48,390 then delta X_t is going to have no time trend at all, 1032 01:08:48,390 --> 01:08:51,390 because you're basically taking out 1033 01:08:51,390 --> 01:08:54,430 that linear component by taking successive differences. 1034 01:08:54,430 --> 01:08:57,014 Sometimes, if you have a real series 1035 01:08:57,014 --> 01:08:59,430 and you look at the difference, it appears non-stationary, 1036 01:08:59,430 --> 01:09:02,810 you look at first differences, that can still not 1037 01:09:02,810 --> 01:09:05,649 appear to be growing over time, in which case 1038 01:09:05,649 --> 01:09:08,810 sometimes the second difference will result 1039 01:09:08,810 --> 01:09:11,270 in a process with no trend. 1040 01:09:11,270 --> 01:09:14,170 So these are sort of convenient tricks, 1041 01:09:14,170 --> 01:09:18,250 techniques to render the series stationary. 1042 01:09:18,250 --> 01:09:21,220 And let's see. 1043 01:09:21,220 --> 01:09:26,960 There's examples here of linear trend reversion models 1044 01:09:26,960 --> 01:09:32,319 which are rendered covariance stationary 1045 01:09:32,319 --> 01:09:35,330 under first differencing. 1046 01:09:35,330 --> 01:09:38,689 In this case, this is an example where you have 1047 01:09:38,689 --> 01:09:41,350 a deterministic time trend. 1048 01:09:41,350 --> 01:09:46,040 But then you have reversion to the time trend over time. 1049 01:09:46,040 --> 01:09:49,880 So we basically have eta_t, the error 1050 01:09:49,880 --> 01:09:53,830 about the deterministic trend, is a first order autoregressive 1051 01:09:53,830 --> 01:09:55,740 process. 1052 01:09:55,740 --> 01:10:00,307 And the moments here can be derived this way. 1053 01:10:00,307 --> 01:10:01,390 Leave that as an exercise. 1054 01:10:04,230 --> 01:10:09,510 One could also consider the pure integrated process 1055 01:10:09,510 --> 01:10:16,330 and talk about stochastic trends. 1056 01:10:16,330 --> 01:10:19,140 And basically, random walk processes 1057 01:10:19,140 --> 01:10:22,740 are often referred to in econometrics 1058 01:10:22,740 --> 01:10:25,010 as stochastic trends. 1059 01:10:25,010 --> 01:10:31,610 And you may want to try and remove those from the data, 1060 01:10:31,610 --> 01:10:33,280 or accommodate them. 1061 01:10:33,280 --> 01:10:40,930 And so the stochastic trend process is basically 1062 01:10:40,930 --> 01:10:49,630 given by the first difference X_t is just equal to eta_t. 1063 01:10:49,630 --> 01:10:53,430 And so we have essentially this random walk 1064 01:10:53,430 --> 01:10:55,830 from a given starting point. 1065 01:10:55,830 --> 01:11:00,650 And it's easy to verify it if you knew the 0th point, then 1066 01:11:00,650 --> 01:11:04,770 the variance of the t-th time point would be t sigma squared, 1067 01:11:04,770 --> 01:11:09,000 because we're summing t independent innovations. 1068 01:11:09,000 --> 01:11:14,475 And the covariance between t and lag t minus j 1069 01:11:14,475 --> 01:11:17,500 is simply t minus j sigma squared. 1070 01:11:17,500 --> 01:11:20,860 And the correlation between those has this form. 1071 01:11:20,860 --> 01:11:23,240 What you can see is that this definitely depends on time. 1072 01:11:23,240 --> 01:11:26,660 So it's not a stationary process. 1073 01:11:26,660 --> 01:11:33,880 So this first differencing results in stationarity. 1074 01:11:33,880 --> 01:11:36,230 And the end difference process has those features. 1075 01:11:46,847 --> 01:11:47,805 Let's see where we are. 1076 01:11:52,730 --> 01:11:57,380 Final topic for today is just how 1077 01:11:57,380 --> 01:12:04,630 you incorporate non-stationary process into ARMA processes. 1078 01:12:04,630 --> 01:12:07,680 Well, if you take first differences 1079 01:12:07,680 --> 01:12:10,340 or second differences and the resulting process 1080 01:12:10,340 --> 01:12:13,252 is covariance stationary, then we 1081 01:12:13,252 --> 01:12:15,460 can just incorporate that differencing into the model 1082 01:12:15,460 --> 01:12:20,490 specification itself, and define ARIMA models, Autoregressive 1083 01:12:20,490 --> 01:12:23,730 Integrated Moving Average Processes. 1084 01:12:23,730 --> 01:12:26,000 And so to specify these models, we 1085 01:12:26,000 --> 01:12:29,290 need to determine the order of the differencing required 1086 01:12:29,290 --> 01:12:32,990 to move trends, deterministic or stochastic, 1087 01:12:32,990 --> 01:12:35,820 and then estimating the unknown parameters, 1088 01:12:35,820 --> 01:12:38,940 and then applying model selection criteria. 1089 01:12:38,940 --> 01:12:43,770 So let me go very quickly through this 1090 01:12:43,770 --> 01:12:48,600 and come back to it the beginning of next time. 1091 01:12:48,600 --> 01:12:51,660 But in specifying the parameters of these models, 1092 01:12:51,660 --> 01:12:54,410 we can apply maximum likelihood, again, 1093 01:12:54,410 --> 01:12:59,280 if we assume normality of these innovations eta_t. 1094 01:12:59,280 --> 01:13:02,260 And we can express the ARMA model 1095 01:13:02,260 --> 01:13:04,440 in state space form, which results 1096 01:13:04,440 --> 01:13:07,880 in a form for the likelihood function, which 1097 01:13:07,880 --> 01:13:12,130 we'll see a few lectures ahead. 1098 01:13:12,130 --> 01:13:15,970 But then we can apply limited information maximum likelihood, 1099 01:13:15,970 --> 01:13:19,470 where we just condition on the first observations of the data 1100 01:13:19,470 --> 01:13:22,550 and maximize the likelihood. 1101 01:13:22,550 --> 01:13:27,060 Or not condition on the first few observations, but also 1102 01:13:27,060 --> 01:13:33,700 use their information as well, and look at their density 1103 01:13:33,700 --> 01:13:36,640 functions, incorporating those into the likelihood 1104 01:13:36,640 --> 01:13:41,160 relative to the stationary distribution for their values. 1105 01:13:41,160 --> 01:13:44,000 And then the issue becomes, how do we 1106 01:13:44,000 --> 01:13:45,390 choose amongst different models? 1107 01:13:45,390 --> 01:13:48,480 Now, last time we talked about linear regression models, 1108 01:13:48,480 --> 01:13:50,500 how you'd specify a given model, here, we're 1109 01:13:50,500 --> 01:13:53,050 talking about autoregressive, moving average, 1110 01:13:53,050 --> 01:13:55,000 and even integrated moving average processes 1111 01:13:55,000 --> 01:13:59,320 and how do we specify those, well, with the method 1112 01:13:59,320 --> 01:14:06,470 of maximum likelihood, there are procedures 1113 01:14:06,470 --> 01:14:12,440 which-- there are measures of how effectively a fitted model 1114 01:14:12,440 --> 01:14:16,390 is, given by an information criterion 1115 01:14:16,390 --> 01:14:21,250 that you would want to minimize for a given fitted model. 1116 01:14:21,250 --> 01:14:24,719 So we can consider different sets of models, 1117 01:14:24,719 --> 01:14:26,510 different numbers of explanatory variables, 1118 01:14:26,510 --> 01:14:29,740 different orders of autoregressive parameters, 1119 01:14:29,740 --> 01:14:33,100 moving average parameters, and compute, say, 1120 01:14:33,100 --> 01:14:37,940 the Akaike information criterion or the Bayes information 1121 01:14:37,940 --> 01:14:39,990 criterion or the Hannan-Quinn criterion 1122 01:14:39,990 --> 01:14:44,720 as different ways of judging how good different models are. 1123 01:14:44,720 --> 01:14:47,960 And let me just finish today by pointing out 1124 01:14:47,960 --> 01:14:52,620 that what these information criteria are 1125 01:14:52,620 --> 01:14:58,560 is basically a function of the log likelihood function, which 1126 01:14:58,560 --> 01:15:00,719 is something we're trying to maximize 1127 01:15:00,719 --> 01:15:02,135 with maximum likelihood estimates. 1128 01:15:04,870 --> 01:15:08,700 And then adding some penalty for how many parameters 1129 01:15:08,700 --> 01:15:10,742 we're estimating. 1130 01:15:10,742 --> 01:15:12,950 And so what I'd like you to think about for next time 1131 01:15:12,950 --> 01:15:18,600 is what kind of a penalty is appropriate for adding 1132 01:15:18,600 --> 01:15:20,300 an extra parameter. 1133 01:15:20,300 --> 01:15:23,640 Like, what evidence is required to incorporate 1134 01:15:23,640 --> 01:15:28,020 extra parameters, extra variables, in the model. 1135 01:15:28,020 --> 01:15:31,180 Would it be t statistics that exceeds some threshold 1136 01:15:31,180 --> 01:15:32,760 or some other criteria. 1137 01:15:32,760 --> 01:15:35,940 Turns out that these are all related to those issues. 1138 01:15:35,940 --> 01:15:39,500 And it's very interesting how those play out. 1139 01:15:39,500 --> 01:15:45,180 And I'll say that for those of you who have actually 1140 01:15:45,180 --> 01:15:48,490 seen these before, the Bayes information criterion 1141 01:15:48,490 --> 01:15:50,400 corresponds to an assumption that there 1142 01:15:50,400 --> 01:15:54,180 is some finite number of variables in the model. 1143 01:15:54,180 --> 01:15:57,010 And you know what those are. 1144 01:15:57,010 --> 01:16:00,060 The Hannan-Quinn criterion says maybe there's 1145 01:16:00,060 --> 01:16:03,760 an infinite number of variables in the model, 1146 01:16:03,760 --> 01:16:08,810 but you want to be able to identify those. 1147 01:16:08,810 --> 01:16:12,230 And so anyway, it's a very challenging problem 1148 01:16:12,230 --> 01:16:13,390 with model selection. 1149 01:16:13,390 --> 01:16:16,900 And these criteria can be used to specify those. 1150 01:16:16,900 --> 01:16:19,050 So we'll go through that next time.