1 00:00:00,060 --> 00:00:02,500 The following content is provided under a Creative 2 00:00:02,500 --> 00:00:04,019 Commons license. 3 00:00:04,019 --> 00:00:06,360 Your support will help MIT OpenCourseWare 4 00:00:06,360 --> 00:00:10,730 continue to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,340 To make a donation or view additional materials 6 00:00:13,340 --> 00:00:17,217 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,217 --> 00:00:17,842 at ocw.mit.edu. 8 00:00:21,490 --> 00:00:24,620 PROFESSOR: Today's topic is regression analysis. 9 00:00:24,620 --> 00:00:28,950 And this subject is one that we're 10 00:00:28,950 --> 00:00:32,610 going to cover it today covering the mathematical and 11 00:00:32,610 --> 00:00:35,620 statistical foundations of regression 12 00:00:35,620 --> 00:00:38,720 and focus particularly on linear regression. 13 00:00:38,720 --> 00:00:44,840 This methodology is perhaps the most powerful method 14 00:00:44,840 --> 00:00:47,630 in statistical modeling. 15 00:00:47,630 --> 00:00:49,640 And the foundations of it, I think, 16 00:00:49,640 --> 00:00:52,740 are very, very important to understand and master, 17 00:00:52,740 --> 00:00:56,320 and they'll help you in any kind of statistical modeling 18 00:00:56,320 --> 00:01:02,870 exercise you might entertain during or after this course. 19 00:01:02,870 --> 00:01:07,320 And its popularity in finance is very, very high, 20 00:01:07,320 --> 00:01:10,540 but it's also a very popular methodology 21 00:01:10,540 --> 00:01:15,630 in all other disciplines that do applied statistics. 22 00:01:15,630 --> 00:01:23,410 So let's begin with setting up the multiple linear regression 23 00:01:23,410 --> 00:01:24,490 problem. 24 00:01:24,490 --> 00:01:31,760 So we begin with a data set that consists of data observations 25 00:01:31,760 --> 00:01:34,640 on different cases, a number of cases. 26 00:01:34,640 --> 00:01:40,110 So we have n cases indexed by i. 27 00:01:40,110 --> 00:01:44,410 And there's a single variable, a dependent variable or response 28 00:01:44,410 --> 00:01:47,970 variable, which is the variable of focus. 29 00:01:47,970 --> 00:01:52,040 And we'll denote that y sub i. 30 00:01:52,040 --> 00:01:55,550 And together with that, for each of the cases, 31 00:01:55,550 --> 00:01:59,640 there are explanatory variables that we might observe. 32 00:01:59,640 --> 00:02:04,330 So the y_i's, the dependent variables, 33 00:02:04,330 --> 00:02:07,670 could be returns on stocks. 34 00:02:07,670 --> 00:02:12,160 The explanatory variables could be underlying characteristics 35 00:02:12,160 --> 00:02:16,410 of those stocks over a given period. 36 00:02:16,410 --> 00:02:21,090 The dependent variable could be the change 37 00:02:21,090 --> 00:02:28,360 in value of an index, the S&P 500 index or the yield rate, 38 00:02:28,360 --> 00:02:30,220 and the explanatory variables can 39 00:02:30,220 --> 00:02:33,390 be various macroeconomic factors or other factors that 40 00:02:33,390 --> 00:02:39,430 might be used to explain how the response variable changes 41 00:02:39,430 --> 00:02:41,510 and takes on its value. 42 00:02:41,510 --> 00:02:44,880 Let's go through various goals of regression analysis. 43 00:02:44,880 --> 00:02:47,780 OK, first it can be to extract or exploit 44 00:02:47,780 --> 00:02:49,790 the relationship between the dependent variable 45 00:02:49,790 --> 00:02:51,720 and the independent variable. 46 00:02:51,720 --> 00:02:55,010 And examples of this are prediction. 47 00:02:55,010 --> 00:02:57,830 Indeed, in finance that's where I've 48 00:02:57,830 --> 00:03:00,075 used regression analysis most. 49 00:03:00,075 --> 00:03:02,965 We want to predict what's going to happen and take actions 50 00:03:02,965 --> 00:03:05,230 to take advantage of that. 51 00:03:05,230 --> 00:03:07,570 One can also use regression analysis 52 00:03:07,570 --> 00:03:10,940 to talk about causal inference. 53 00:03:10,940 --> 00:03:16,240 What factors are really driving a dependent variable? 54 00:03:16,240 --> 00:03:19,570 And so one can actually test hypotheses 55 00:03:19,570 --> 00:03:23,930 about what are true causal factors underlying 56 00:03:23,930 --> 00:03:27,250 the relationships between the variables. 57 00:03:27,250 --> 00:03:32,430 Another application is for just simple approximation. 58 00:03:32,430 --> 00:03:34,870 As mathematicians, you're all very 59 00:03:34,870 --> 00:03:38,970 familiar with how smooth functions can 60 00:03:38,970 --> 00:03:41,460 be-- that are smooth in the sense of being 61 00:03:41,460 --> 00:03:43,250 differentiable and bounded. 62 00:03:43,250 --> 00:03:46,565 Those can be approximated well by a Taylor series 63 00:03:46,565 --> 00:03:48,690 if you have a function of a single variable or even 64 00:03:48,690 --> 00:03:53,240 a multivariable function. 65 00:03:53,240 --> 00:03:55,840 So one can use regression analysis 66 00:03:55,840 --> 00:04:00,460 to actually approximate functions nicely. 67 00:04:00,460 --> 00:04:04,370 And one can also use regression analysis 68 00:04:04,370 --> 00:04:08,600 to uncover functional relationships 69 00:04:08,600 --> 00:04:10,724 and validate functional relationships 70 00:04:10,724 --> 00:04:11,640 amongst the variables. 71 00:04:14,270 --> 00:04:17,130 So let's set up the general linear model 72 00:04:17,130 --> 00:04:19,570 from a mathematical standpoint to begin with. 73 00:04:19,570 --> 00:04:23,160 In this lecture, OK, we're going to start off 74 00:04:23,160 --> 00:04:28,050 with discussing ordinary least squares, which 75 00:04:28,050 --> 00:04:31,070 is a purely mathematical criterion for how 76 00:04:31,070 --> 00:04:33,580 you specify regression models. 77 00:04:33,580 --> 00:04:38,350 And then we're going to turn to the Gauss-Markov theorem which 78 00:04:38,350 --> 00:04:42,265 incorporates some statistical modeling principles there. 79 00:04:42,265 --> 00:04:45,170 They're essentially weak principles. 80 00:04:45,170 --> 00:04:49,570 And then we will turn to formal models 81 00:04:49,570 --> 00:04:52,160 with normal linear regression models, 82 00:04:52,160 --> 00:04:55,700 and then consider extensions of those to broader classes. 83 00:04:58,250 --> 00:05:00,600 Now we're in the mathematical context. 84 00:05:00,600 --> 00:05:05,180 And a linear model is basically attempting 85 00:05:05,180 --> 00:05:08,780 to model the conditional distribution of the response 86 00:05:08,780 --> 00:05:14,580 variable y_i given the independent variables x_i. 87 00:05:14,580 --> 00:05:21,050 And the conditional distribution of the response variable 88 00:05:21,050 --> 00:05:25,050 is modeled simply as a linear function 89 00:05:25,050 --> 00:05:27,230 of the independent variables. 90 00:05:27,230 --> 00:05:32,880 So the x_i's, x_(i,1) through x_(i,p), 91 00:05:32,880 --> 00:05:36,090 are the key explanatory variables that relate 92 00:05:36,090 --> 00:05:38,340 to the response variables, possibly. 93 00:05:38,340 --> 00:05:45,850 And the beta_1, beta_2, beta_i, or beta_p, 94 00:05:45,850 --> 00:05:49,480 are the regression parameters which 95 00:05:49,480 --> 00:05:53,190 would be used in defining that linear relationship. 96 00:05:53,190 --> 00:06:03,680 So this relationship has residuals, epsilon_i, 97 00:06:03,680 --> 00:06:07,320 basically where there's uncertainty in the data-- 98 00:06:07,320 --> 00:06:11,100 whether it's either due to a measurement error or modeling 99 00:06:11,100 --> 00:06:14,535 error or underlying stochastic processes that 100 00:06:14,535 --> 00:06:15,980 are driving the error. 101 00:06:15,980 --> 00:06:19,340 This epsilon_i is a residual error variable 102 00:06:19,340 --> 00:06:24,440 that will indicate how this linear relationship varies 103 00:06:24,440 --> 00:06:25,915 across the different n cases. 104 00:06:30,590 --> 00:06:33,520 So OK, how broad are the models? 105 00:06:33,520 --> 00:06:38,080 Well, the models really are very broad. 106 00:06:38,080 --> 00:06:40,020 First of all, polynomial approximation 107 00:06:40,020 --> 00:06:41,505 is indicated here. 108 00:06:41,505 --> 00:06:45,630 It corresponds, essentially, to a truncated Taylor 109 00:06:45,630 --> 00:06:49,170 series approximation to a functional form. 110 00:06:51,730 --> 00:06:56,240 With variables that exhibit cyclical behavior, 111 00:06:56,240 --> 00:07:03,200 Fourier series can be applied in a linear regression context. 112 00:07:03,200 --> 00:07:06,020 How many people in here are familiar with Fourier series? 113 00:07:08,580 --> 00:07:09,870 Almost everybody. 114 00:07:09,870 --> 00:07:13,530 So Fourier series basically provide 115 00:07:13,530 --> 00:07:17,650 a set of basis functions that allow you to closely 116 00:07:17,650 --> 00:07:19,250 approximate most functions. 117 00:07:19,250 --> 00:07:21,540 And certainly with bounded functions 118 00:07:21,540 --> 00:07:24,510 that possibly have a cyclical structure to them, 119 00:07:24,510 --> 00:07:27,120 it provides a complete description. 120 00:07:27,120 --> 00:07:30,940 So we could apply Fourier series here. 121 00:07:30,940 --> 00:07:38,640 Finally, time series regressions where the cases i one through n 122 00:07:38,640 --> 00:07:42,360 are really indexes of different time points can be applied. 123 00:07:42,360 --> 00:07:46,650 And so the independent variables can 124 00:07:46,650 --> 00:07:50,660 be variables that are observable at a given time point 125 00:07:50,660 --> 00:07:51,910 or known at a given time. 126 00:07:51,910 --> 00:07:55,810 So those can include lags of the response variables. 127 00:07:55,810 --> 00:08:00,530 So we'll see actually when we talk about time series 128 00:08:00,530 --> 00:08:02,880 that there's autoregressive time series 129 00:08:02,880 --> 00:08:05,870 models that can be specified. 130 00:08:05,870 --> 00:08:11,410 And those are very broadly applied in finance. 131 00:08:18,900 --> 00:08:23,510 All right, so let's go through what the steps are for fitting 132 00:08:23,510 --> 00:08:24,840 a regression model. 133 00:08:27,680 --> 00:08:31,760 First, one wants to propose a model 134 00:08:31,760 --> 00:08:33,600 in terms of what is it that we have 135 00:08:33,600 --> 00:08:38,727 to identify or be interested in a particular response variable. 136 00:08:38,727 --> 00:08:42,934 And critical here is specifying the scale 137 00:08:42,934 --> 00:08:47,010 of that response variable. 138 00:08:47,010 --> 00:08:51,280 Choongbum was discussing problems of modeling stock 139 00:08:51,280 --> 00:08:52,450 prices. 140 00:08:52,450 --> 00:08:55,030 If, say, y is the stock price? 141 00:08:55,030 --> 00:09:00,090 Well, it may be that it's more appropriate to consider 142 00:09:00,090 --> 00:09:06,320 modeling it on a logarithmic scale than on a linear scale. 143 00:09:06,320 --> 00:09:09,780 Who can tell me why that would be a good idea? 144 00:09:09,780 --> 00:09:11,740 AUDIENCE: Because the changes might 145 00:09:11,740 --> 00:09:13,700 become more percent changes in price 146 00:09:13,700 --> 00:09:16,365 rather than absolute changes in price. 147 00:09:16,365 --> 00:09:17,490 PROFESSOR: Very good, yeah. 148 00:09:17,490 --> 00:09:21,540 So price changes basically on the percentage scale, 149 00:09:21,540 --> 00:09:25,680 which log changes would be, may be much better predicted 150 00:09:25,680 --> 00:09:30,410 by knowing factors than the absolute price level. 151 00:09:30,410 --> 00:09:35,940 OK, and so we have to have a collection 152 00:09:35,940 --> 00:09:40,070 of independent variables, which to include in the model. 153 00:09:40,070 --> 00:09:42,970 And it's important to think about how 154 00:09:42,970 --> 00:09:44,590 general this set up is. 155 00:09:44,590 --> 00:09:46,450 I mean, the independent variables 156 00:09:46,450 --> 00:09:50,140 can be functions, lag values of the response variable. 157 00:09:50,140 --> 00:09:52,270 They can be different functional forms 158 00:09:52,270 --> 00:09:53,670 of other independent variables. 159 00:09:53,670 --> 00:09:58,720 So the fact that we're talking about a linear regression model 160 00:09:58,720 --> 00:10:03,590 here is it's not so limiting in terms of the linearity. 161 00:10:03,590 --> 00:10:06,050 We can really capture lot of nonlinear behavior 162 00:10:06,050 --> 00:10:07,730 in this framework. 163 00:10:07,730 --> 00:10:11,290 So then third, we need to address the assumptions 164 00:10:11,290 --> 00:10:15,020 about the distribution of the residuals, epsilon, 165 00:10:15,020 --> 00:10:16,820 over the cases. 166 00:10:16,820 --> 00:10:21,150 So that has to be specified. 167 00:10:21,150 --> 00:10:23,610 Once we've set up the model in terms 168 00:10:23,610 --> 00:10:27,160 of identifying the response of the explanatory variables 169 00:10:27,160 --> 00:10:29,700 and the assumptions underlying the distribution 170 00:10:29,700 --> 00:10:35,080 of the residuals, we need to specify a criterion for judging 171 00:10:35,080 --> 00:10:36,550 different estimators. 172 00:10:36,550 --> 00:10:40,840 So given a particular setup, what we want to do 173 00:10:40,840 --> 00:10:48,000 is be able to define a methodology for specifying 174 00:10:48,000 --> 00:10:50,580 the regression parameters so that we can then 175 00:10:50,580 --> 00:10:54,010 use this regression model for prediction 176 00:10:54,010 --> 00:10:56,670 or whatever our purpose is. 177 00:10:56,670 --> 00:11:00,700 So the second thing we want to do 178 00:11:00,700 --> 00:11:03,460 is define a criterion for how we might 179 00:11:03,460 --> 00:11:09,090 judge different estimators of the progression parameters. 180 00:11:09,090 --> 00:11:11,210 We're going to go through several of those. 181 00:11:11,210 --> 00:11:15,200 And you'll see those-- least squares is the first one, 182 00:11:15,200 --> 00:11:17,210 but there are actually more general ones. 183 00:11:17,210 --> 00:11:19,290 In fact, the last section of this lecture 184 00:11:19,290 --> 00:11:22,930 on generalized estimators will cover those as well. 185 00:11:22,930 --> 00:11:25,740 Third, we need to characterize the best estimator 186 00:11:25,740 --> 00:11:27,600 and apply it to the given data. 187 00:11:27,600 --> 00:11:31,401 So once we choose a criterion for how good 188 00:11:31,401 --> 00:11:32,900 an estimate of regression parameters 189 00:11:32,900 --> 00:11:37,080 is, then we have to have a technology for solving 190 00:11:37,080 --> 00:11:38,060 for that. 191 00:11:38,060 --> 00:11:43,030 And then fourth, we need to check our assumptions. 192 00:11:43,030 --> 00:11:48,770 Now, it's very often the case that at this fourth step, where 193 00:11:48,770 --> 00:11:51,160 you're checking the assumptions that you've made, 194 00:11:51,160 --> 00:11:55,370 you'll discover features of your data or the process 195 00:11:55,370 --> 00:11:58,550 that it's modeling that make you want 196 00:11:58,550 --> 00:12:01,520 to expand upon your assumptions or change your assumptions. 197 00:12:01,520 --> 00:12:06,410 And so checking the assumptions is a critical part 198 00:12:06,410 --> 00:12:08,170 of any modeling process. 199 00:12:08,170 --> 00:12:11,830 And then if necessary, modify the model and assumptions 200 00:12:11,830 --> 00:12:15,260 and repeat this process. 201 00:12:15,260 --> 00:12:18,660 What I can tell you is that this sort 202 00:12:18,660 --> 00:12:21,280 of protocol for how you fit models 203 00:12:21,280 --> 00:12:26,680 is what I've applied many, many times. 204 00:12:26,680 --> 00:12:31,870 And if you are lucky in a particular problem area, 205 00:12:31,870 --> 00:12:34,430 the very simple models will work well 206 00:12:34,430 --> 00:12:37,340 with small changes in assumptions. 207 00:12:37,340 --> 00:12:39,450 But when you get challenging problems, 208 00:12:39,450 --> 00:12:43,900 then this item five of modify the model 209 00:12:43,900 --> 00:12:46,300 and/or assumptions is critical. 210 00:12:46,300 --> 00:12:50,790 And in statistical modeling, my philosophy 211 00:12:50,790 --> 00:12:53,340 is you really want to, as much as possible, 212 00:12:53,340 --> 00:12:55,390 tailor the model to the process you're modeling. 213 00:12:55,390 --> 00:12:59,070 You don't want to fit a square peg in a round hole 214 00:12:59,070 --> 00:13:01,200 and just apply, say, simple linear regression 215 00:13:01,200 --> 00:13:02,510 to everything. 216 00:13:02,510 --> 00:13:06,191 You want to apply it when the assumptions are valid. 217 00:13:06,191 --> 00:13:07,940 If the assumptions aren't valid, maybe you 218 00:13:07,940 --> 00:13:10,680 can change the specification of the problem 219 00:13:10,680 --> 00:13:14,890 so a linear model is still applicable in a changed 220 00:13:14,890 --> 00:13:16,060 framework. 221 00:13:16,060 --> 00:13:18,530 But if not, then you'll want to extend 222 00:13:18,530 --> 00:13:20,250 to other kinds of models. 223 00:13:20,250 --> 00:13:22,120 But what we'll be doing-- or what 224 00:13:22,120 --> 00:13:24,680 you will be doing if you do that-- is basically applying 225 00:13:24,680 --> 00:13:28,120 all the same principles that are developed 226 00:13:28,120 --> 00:13:30,140 in the linear modeling framework. 227 00:13:35,750 --> 00:13:38,320 OK, now let's see. 228 00:13:38,320 --> 00:13:39,860 I wanted to make some comments here 229 00:13:39,860 --> 00:13:43,975 about specifying assumptions for the residual distribution. 230 00:13:47,600 --> 00:13:49,620 What kind of assumptions might we make? 231 00:13:49,620 --> 00:13:52,840 OK, would anyone like to suggest some assumptions 232 00:13:52,840 --> 00:13:55,640 you might make in a linear regression 233 00:13:55,640 --> 00:13:58,310 model for the residuals? 234 00:13:58,310 --> 00:13:58,980 Yes? 235 00:13:58,980 --> 00:14:00,006 What's your name, by the way? 236 00:14:00,006 --> 00:14:00,700 AUDIENCE: My name is Will. 237 00:14:00,700 --> 00:14:01,533 PROFESSOR: Will, OK. 238 00:14:01,533 --> 00:14:02,455 Will what? 239 00:14:02,455 --> 00:14:02,900 [? AUDIENCE: Ossler. ?] 240 00:14:02,900 --> 00:14:03,460 PROFESSOR: [? Ossler, ?] great. 241 00:14:03,460 --> 00:14:04,910 OK, thank you, Will. 242 00:14:04,910 --> 00:14:06,480 AUDIENCE: It might be-- or we might 243 00:14:06,480 --> 00:14:10,000 want to say that the residual might be normally distributed 244 00:14:10,000 --> 00:14:17,598 and it might not depend too much on what value of the input 245 00:14:17,598 --> 00:14:19,000 variable we'd use. 246 00:14:19,000 --> 00:14:20,540 PROFESSOR: OK. 247 00:14:20,540 --> 00:14:21,090 Anyone else? 248 00:14:23,810 --> 00:14:24,310 OK. 249 00:14:24,310 --> 00:14:27,930 Well, that certainly is an excellent place 250 00:14:27,930 --> 00:14:33,510 to start in terms of starting with a distribution that's 251 00:14:33,510 --> 00:14:35,391 familiar. 252 00:14:35,391 --> 00:14:36,390 Familiar is always good. 253 00:14:36,390 --> 00:14:38,598 Although it's not something that should be necessary, 254 00:14:38,598 --> 00:14:44,210 but we know from some of Choongbum's lecture areas 255 00:14:44,210 --> 00:14:46,470 that Gaussian and normal distributions 256 00:14:46,470 --> 00:14:49,670 arise in many settings where we're 257 00:14:49,670 --> 00:14:53,900 taking basically sums of independent, random variables. 258 00:14:53,900 --> 00:14:58,180 And so it may be that these residuals are like that. 259 00:14:58,180 --> 00:15:04,910 Anyway, a slightly simpler or weaker condition 260 00:15:04,910 --> 00:15:10,110 is to use the Gauss-- what are called in statistics 261 00:15:10,110 --> 00:15:12,400 the Gauss-Markov assumptions. 262 00:15:12,400 --> 00:15:15,340 And these are assumptions where we're only 263 00:15:15,340 --> 00:15:20,390 concerned with the means or averages, statistically, 264 00:15:20,390 --> 00:15:23,210 and the variances of the residuals. 265 00:15:23,210 --> 00:15:26,160 And so we assume that there's zero mean. 266 00:15:26,160 --> 00:15:30,230 So on average, they're not adding a bias up or down 267 00:15:30,230 --> 00:15:32,580 to the dependent variable. 268 00:15:32,580 --> 00:15:36,920 And those have a constant variance. 269 00:15:36,920 --> 00:15:40,830 So the level of uncertainty in our model 270 00:15:40,830 --> 00:15:43,210 doesn't depend on the case. 271 00:15:43,210 --> 00:15:47,660 And so indeed, if errors on the percentage scale 272 00:15:47,660 --> 00:15:51,250 are more appropriate, then one could look at, say, 273 00:15:51,250 --> 00:15:53,970 a time series of prices that you're trying to model. 274 00:15:53,970 --> 00:15:56,970 And it may be that on the log scale, 275 00:15:56,970 --> 00:15:59,160 that constant variance looks much more appropriate 276 00:15:59,160 --> 00:16:02,180 than on the original scale, which would have-- 277 00:16:02,180 --> 00:16:06,660 And then a third attribute of the Gauss-Markov assumptions 278 00:16:06,660 --> 00:16:10,740 is that the residuals are uncorrelated. 279 00:16:10,740 --> 00:16:16,770 So now uncorrelated does not mean independent 280 00:16:16,770 --> 00:16:18,210 or statistically independent. 281 00:16:18,210 --> 00:16:21,470 So this is a somewhat weak condition, or weaker condition, 282 00:16:21,470 --> 00:16:24,326 than independence of the residuals. 283 00:16:24,326 --> 00:16:26,280 But in the Gauss-Markov setting, we're 284 00:16:26,280 --> 00:16:29,650 just setting up basically a reduced set 285 00:16:29,650 --> 00:16:33,610 of assumptions that we might apply to fit the model. 286 00:16:33,610 --> 00:16:36,620 If we extend upon that, we can then 287 00:16:36,620 --> 00:16:40,040 consider normal linear regression models, 288 00:16:40,040 --> 00:16:44,050 which Will just suggested. 289 00:16:44,050 --> 00:16:48,930 And in this case, those could be assumed to be independent 290 00:16:48,930 --> 00:16:50,440 and identically distributed-- IID 291 00:16:50,440 --> 00:16:56,380 is that notation for that-- with Gaussian or normal with mean 0 292 00:16:56,380 --> 00:16:57,580 and variance sigma squared. 293 00:17:00,590 --> 00:17:03,050 We can extend upon that to consider 294 00:17:03,050 --> 00:17:07,140 generalized Gauss-Markov assumptions where we maintain 295 00:17:07,140 --> 00:17:09,750 still the zero mean for the residuals, 296 00:17:09,750 --> 00:17:15,990 but the general-- we might have a covariance matrix which 297 00:17:15,990 --> 00:17:18,560 does not correspond to independent and identically 298 00:17:18,560 --> 00:17:20,680 distributed random variables. 299 00:17:20,680 --> 00:17:21,760 Now, let's see. 300 00:17:21,760 --> 00:17:25,000 In the discussion of probability theory, 301 00:17:25,000 --> 00:17:29,400 we really haven't talked yet about matrix-valued random 302 00:17:29,400 --> 00:17:31,490 variables, right? 303 00:17:31,490 --> 00:17:34,360 But how many people in the class have 304 00:17:34,360 --> 00:17:37,380 covered matrix-value or vector-valued random variables 305 00:17:37,380 --> 00:17:39,660 before? 306 00:17:39,660 --> 00:17:41,120 OK, just a handful. 307 00:17:41,120 --> 00:17:47,290 Well, a vector-valued random variable, 308 00:17:47,290 --> 00:17:51,090 we think of the values of these n 309 00:17:51,090 --> 00:17:57,200 cases for the dependent variable to be an n-valued, an n-vector 310 00:17:57,200 --> 00:17:59,950 of random variables. 311 00:17:59,950 --> 00:18:05,600 And so we can generalize the variance 312 00:18:05,600 --> 00:18:09,140 of individual random variables to the variance covariance 313 00:18:09,140 --> 00:18:12,870 matrix of the collection. 314 00:18:12,870 --> 00:18:18,490 And so you have a covariance matrix characterizing 315 00:18:18,490 --> 00:18:25,930 the variance of the n-vector which gives us the-- the (i, j) 316 00:18:25,930 --> 00:18:31,640 element gives us the value of the covariance. 317 00:18:31,640 --> 00:18:38,130 All right, let me put the screen up and just 318 00:18:38,130 --> 00:18:41,214 write that on the board so that you're familiar with that. 319 00:18:49,820 --> 00:18:58,270 All right, so we have y_1, y_2, down to y_n, 320 00:18:58,270 --> 00:19:03,070 our n values of our response variable. 321 00:19:03,070 --> 00:19:10,755 And we can basically talk about the expectation 322 00:19:10,755 --> 00:19:18,303 of that being equal to mu_1, mu_2, down to mu_n. 323 00:19:21,021 --> 00:19:38,270 And the covariance matrix of y_1, y_2, down to y_n 324 00:19:38,270 --> 00:19:46,060 is equal to a matrix with the variance of y_1 325 00:19:46,060 --> 00:19:53,280 in the upper 1, 1 element, and the variance of y_2 in the 2, 326 00:19:53,280 --> 00:20:04,760 2 element, and the variance of y_n in the nth column and nth 327 00:20:04,760 --> 00:20:05,890 row. 328 00:20:05,890 --> 00:20:15,080 And in the (i,j)-th row, (i, j), we have the covariance between 329 00:20:15,080 --> 00:20:18,550 y_i and y_j. 330 00:20:18,550 --> 00:20:24,070 So we're going to use matrices to represent covariances. 331 00:20:24,070 --> 00:20:26,670 And that's something which I want everyone 332 00:20:26,670 --> 00:20:28,900 to get very familiar with because we're 333 00:20:28,900 --> 00:20:31,750 going to assume that we are comfortable with those, 334 00:20:31,750 --> 00:20:38,470 and apply matrix algebra with these kinds of constructs. 335 00:20:38,470 --> 00:20:41,300 So the generalized Gauss-Markov theorem 336 00:20:41,300 --> 00:20:43,960 assumes a general covariance matrix 337 00:20:43,960 --> 00:20:50,400 where you can have nonzero covariances 338 00:20:50,400 --> 00:20:52,260 between the independent variables 339 00:20:52,260 --> 00:20:54,440 or the dependent variables and the residuals. 340 00:20:54,440 --> 00:20:58,450 And those can be correlated. 341 00:20:58,450 --> 00:21:03,160 Now, who can come up with an example 342 00:21:03,160 --> 00:21:11,471 of why the residuals might be correlated in a regression 343 00:21:11,471 --> 00:21:11,970 model? 344 00:21:15,134 --> 00:21:16,490 Dan? 345 00:21:16,490 --> 00:21:17,390 OK. 346 00:21:17,390 --> 00:21:21,100 That's a really good example because it's nonlinear. 347 00:21:21,100 --> 00:21:24,720 If you imagine sort of a simple nonlinear curve 348 00:21:24,720 --> 00:21:27,070 and you try to fit a straight line to it, 349 00:21:27,070 --> 00:21:32,080 then the residuals from that linear fit 350 00:21:32,080 --> 00:21:35,100 are going to be consistently above or below the line 351 00:21:35,100 --> 00:21:37,630 depending on where you are in the nonlinearity, how 352 00:21:37,630 --> 00:21:38,820 it might be fitting. 353 00:21:38,820 --> 00:21:42,410 So that's one example where that could arise. 354 00:21:42,410 --> 00:21:43,415 Any other possibilities? 355 00:21:46,090 --> 00:21:50,380 Well, next week we'll be talking about some time series models. 356 00:21:50,380 --> 00:21:54,650 And there can be time dependence amongst variables 357 00:21:54,650 --> 00:21:57,170 where there are some underlying factors maybe 358 00:21:57,170 --> 00:21:58,850 that are driving the process. 359 00:21:58,850 --> 00:22:02,030 And those ongoing factors can persist 360 00:22:02,030 --> 00:22:05,480 in making the linear relationship 361 00:22:05,480 --> 00:22:11,300 over or under gauge the dependent variable. 362 00:22:11,300 --> 00:22:13,850 So that can happen as well. 363 00:22:13,850 --> 00:22:17,410 All right, yes? 364 00:22:17,410 --> 00:22:19,949 AUDIENCE: The Gauss-Markov is just the diagonal case? 365 00:22:19,949 --> 00:22:22,490 PROFESSOR: Yes, the Gauss-Markov is simply the diagonal case. 366 00:22:22,490 --> 00:22:27,090 And explicitly if we replace y's here by the residuals, 367 00:22:27,090 --> 00:22:30,780 epsilon_1 through epsilon_n, then 368 00:22:30,780 --> 00:22:36,930 that diagonal matrix with a constant diagonal 369 00:22:36,930 --> 00:22:41,948 is the simple Gauss-Markov assumption, yeah. 370 00:22:45,270 --> 00:22:48,920 Now, I'm sure it comes as no surprise 371 00:22:48,920 --> 00:22:51,450 that Gaussian distributions don't always fit everything. 372 00:22:51,450 --> 00:22:55,350 And so one needs to get clever with extending 373 00:22:55,350 --> 00:23:01,560 the models to other cases. 374 00:23:01,560 --> 00:23:07,280 And there are-- I know-- Laplace distributions, Pareto 375 00:23:07,280 --> 00:23:09,760 distributions, contaminated normal distributions, 376 00:23:09,760 --> 00:23:14,000 which can be used to fit regression models. 377 00:23:14,000 --> 00:23:22,330 And these general cases really extend the applicability 378 00:23:22,330 --> 00:23:28,610 of regression models to many interesting settings. 379 00:23:28,610 --> 00:23:35,530 So let's turn to specifying the estimator criterion in two. 380 00:23:35,530 --> 00:23:39,590 So how do we judge what's a good estimate of the regression 381 00:23:39,590 --> 00:23:40,580 parameters? 382 00:23:40,580 --> 00:23:45,390 Well, we're going to cover least squares, maximum likelihood, 383 00:23:45,390 --> 00:23:50,990 robust methods, which are contamination resistant. 384 00:23:50,990 --> 00:23:56,370 And other methods exist that we will mention but not 385 00:23:56,370 --> 00:23:57,870 get into really in the lectures, are 386 00:23:57,870 --> 00:24:03,390 Bayes methods and accommodating incomplete or missing data. 387 00:24:03,390 --> 00:24:08,150 Essentially, as your approach to modeling a problem 388 00:24:08,150 --> 00:24:10,120 gets more and more realistic, you 389 00:24:10,120 --> 00:24:14,580 start adding more and more complexity as it's needed. 390 00:24:14,580 --> 00:24:24,450 And certainly issues of-- well, robust methods 391 00:24:24,450 --> 00:24:26,724 is where you assume most of the data 392 00:24:26,724 --> 00:24:28,890 arrives under normal conditions, but once in a while 393 00:24:28,890 --> 00:24:31,710 there may be some problem with the data. 394 00:24:31,710 --> 00:24:34,390 And you don't want your methodology just 395 00:24:34,390 --> 00:24:39,635 to break down if there happens to be some outliers in the data 396 00:24:39,635 --> 00:24:41,090 or contamination. 397 00:24:41,090 --> 00:24:47,200 Bayes methodologies are the technology 398 00:24:47,200 --> 00:24:50,300 for incorporating subjective beliefs 399 00:24:50,300 --> 00:24:54,310 into statistical models. 400 00:24:54,310 --> 00:24:57,190 And I think it's fair to say that probably 401 00:24:57,190 --> 00:25:01,720 all statistical modeling is essentially subjective. 402 00:25:01,720 --> 00:25:05,820 And so if you're going to be good at statistical modeling, 403 00:25:05,820 --> 00:25:08,779 you want to be sure that you're effectively incorporating 404 00:25:08,779 --> 00:25:10,070 subjective information in that. 405 00:25:10,070 --> 00:25:13,400 And so Bayes methodologies are very, very useful, 406 00:25:13,400 --> 00:25:16,460 and indeed pretty much required to engage 407 00:25:16,460 --> 00:25:19,610 in appropriate modeling. 408 00:25:19,610 --> 00:25:22,700 And then finally, accommodate incomplete or missing data. 409 00:25:22,700 --> 00:25:26,920 The world is always sort of cruel in terms of you 410 00:25:26,920 --> 00:25:30,900 often are missing what you think is critical information 411 00:25:30,900 --> 00:25:32,060 to do your analysis. 412 00:25:32,060 --> 00:25:34,520 And so how do you deal with situations 413 00:25:34,520 --> 00:25:39,680 where you have some holes in your data? 414 00:25:39,680 --> 00:25:44,720 Statistical models provide good methods and tools 415 00:25:44,720 --> 00:25:46,596 for dealing with that situation. 416 00:25:49,160 --> 00:25:49,720 OK. 417 00:25:49,720 --> 00:25:50,960 Then let's see. 418 00:25:50,960 --> 00:25:55,930 In case analyses for checking assumptions, 419 00:25:55,930 --> 00:25:57,540 let me go through this. 420 00:26:00,560 --> 00:26:03,790 Basically when you fit a regression model, 421 00:26:03,790 --> 00:26:07,680 you check assumptions by looking at the residuals, which 422 00:26:07,680 --> 00:26:16,330 are the basically estimates of the epsilons, the deviations 423 00:26:16,330 --> 00:26:22,359 of the dependent variable from their predictions. 424 00:26:22,359 --> 00:26:25,930 And what one wants to do is analyze 425 00:26:25,930 --> 00:26:29,660 these to determine whether our assumptions are appropriate. 426 00:26:29,660 --> 00:26:33,460 OK, but the Gauss-Markov assumptions would be, 427 00:26:33,460 --> 00:26:36,170 do these appear to have constant variance? 428 00:26:36,170 --> 00:26:40,270 And it may be that their variance depends on time, 429 00:26:40,270 --> 00:26:42,555 if the i is indexing time. 430 00:26:45,450 --> 00:26:48,190 Residuals might depend on the other variables as well, 431 00:26:48,190 --> 00:26:53,610 and one wants to determine that that isn't the case. 432 00:26:53,610 --> 00:26:57,660 There are also influence diagnostics identifying cases 433 00:26:57,660 --> 00:27:00,330 which are highly influential. 434 00:27:00,330 --> 00:27:05,620 It turns out that when you are building a regression 435 00:27:05,620 --> 00:27:10,560 model with data, you treat all the cases as 436 00:27:10,560 --> 00:27:12,830 if they're equally important. 437 00:27:12,830 --> 00:27:15,390 Well, it may be that certain cases 438 00:27:15,390 --> 00:27:19,540 are really critical to estimated certain factors. 439 00:27:19,540 --> 00:27:25,650 And it may be that much of the inference about how important 440 00:27:25,650 --> 00:27:27,940 a certain factor is is determined 441 00:27:27,940 --> 00:27:30,550 by very small number of points. 442 00:27:30,550 --> 00:27:32,875 So even though you have a massive data set 443 00:27:32,875 --> 00:27:34,265 that you're using to fit a model, 444 00:27:34,265 --> 00:27:36,675 it could be that some of the structure 445 00:27:36,675 --> 00:27:39,600 is driven by a very small number of cases. 446 00:27:39,600 --> 00:27:45,100 So influence diagnostics give you a way of analyzing that. 447 00:27:45,100 --> 00:27:50,730 In the problem set for this lecture, 448 00:27:50,730 --> 00:27:53,470 you'll be deriving some influence diagnostics 449 00:27:53,470 --> 00:27:55,380 for linear regression models and seeing how 450 00:27:55,380 --> 00:27:57,790 they're mathematically defined. 451 00:27:57,790 --> 00:28:01,150 And I'll be distributing a case study which 452 00:28:01,150 --> 00:28:04,290 illustrates fitting linear regression 453 00:28:04,290 --> 00:28:06,350 models for asset prices. 454 00:28:06,350 --> 00:28:10,502 And you can see how those play out 455 00:28:10,502 --> 00:28:11,710 with some practical examples. 456 00:28:16,170 --> 00:28:18,525 OK, finally there's outlier detection. 457 00:28:21,210 --> 00:28:25,790 With outliers, it's interesting. 458 00:28:25,790 --> 00:28:33,100 The exceptions in data are often the most interesting. 459 00:28:33,100 --> 00:28:36,560 It's important in modeling to understand 460 00:28:36,560 --> 00:28:40,700 whether certain cases are unusual. 461 00:28:40,700 --> 00:28:47,600 And sometimes their degree of idiosyncrasy 462 00:28:47,600 --> 00:28:50,580 can be explained away so that one essentially 463 00:28:50,580 --> 00:28:51,710 discards those outliers. 464 00:28:51,710 --> 00:28:54,790 But other times, those idiosyncrasies 465 00:28:54,790 --> 00:28:57,890 lead to extensions of the model. 466 00:28:57,890 --> 00:29:04,345 And so outlier detection can be very important for validating 467 00:29:04,345 --> 00:29:05,980 a model. 468 00:29:05,980 --> 00:29:10,970 OK, so with that introduction to regression, linear regression, 469 00:29:10,970 --> 00:29:14,590 let's talk about ordinary least squares. 470 00:29:14,590 --> 00:29:15,090 Ah. 471 00:29:19,075 --> 00:29:24,360 OK, the least squares criterion is for a given a regression 472 00:29:24,360 --> 00:29:27,120 parameter, beta, which is considered 473 00:29:27,120 --> 00:29:32,160 to be a column vector-- so I'm taking the transpose of a row 474 00:29:32,160 --> 00:29:32,660 vector. 475 00:29:35,920 --> 00:29:39,610 The least squares criterion is to basically take 476 00:29:39,610 --> 00:29:43,170 the sum of square deviations from the actual value 477 00:29:43,170 --> 00:29:46,160 of the response variable from its linear prediction. 478 00:29:46,160 --> 00:29:48,760 So y_i minus y hat i, we're just plugging 479 00:29:48,760 --> 00:29:50,980 in for y hat i the linear function 480 00:29:50,980 --> 00:29:55,470 of the independent variables and the squaring that. 481 00:29:58,270 --> 00:30:01,570 And the ordinary least squares estimate, beta hat, 482 00:30:01,570 --> 00:30:05,490 minimizes this function. 483 00:30:05,490 --> 00:30:11,520 So in order to solve for this, we're going to use matrices. 484 00:30:11,520 --> 00:30:16,627 And so we're going to take the y vector, the vector of n 485 00:30:16,627 --> 00:30:18,002 values of the dependent variable, 486 00:30:18,002 --> 00:30:21,710 or the response variable, and X, the matrix 487 00:30:21,710 --> 00:30:24,210 of values of the independent variable. 488 00:30:24,210 --> 00:30:28,800 It's important in this set up to keep straight 489 00:30:28,800 --> 00:30:32,990 that cases go by rows and columns 490 00:30:32,990 --> 00:30:35,585 go by values of the independent variable. 491 00:30:41,410 --> 00:30:44,450 Boy, this thing is ultra sensitive. 492 00:30:44,450 --> 00:30:44,950 Excuse me. 493 00:30:49,317 --> 00:30:50,650 Do I turn off the touchpad here? 494 00:30:50,650 --> 00:30:53,050 OK. 495 00:30:53,050 --> 00:31:01,990 So we can now define our fitted value, y hat, 496 00:31:01,990 --> 00:31:06,330 to be equal to the matrix x times beta. 497 00:31:06,330 --> 00:31:10,940 And with matrix multiplication, that results in the y hat 1 498 00:31:10,940 --> 00:31:13,350 through y hat n. 499 00:31:13,350 --> 00:31:20,800 And Q of beta can basically be written as y minus X beta 500 00:31:20,800 --> 00:31:23,330 transpose y minus X beta. 501 00:31:23,330 --> 00:31:27,580 So this term here is an n-vector minus the product 502 00:31:27,580 --> 00:31:30,990 of the X matrix times beta, which is another n-vector. 503 00:31:30,990 --> 00:31:33,380 And we're just taking the cross product of that. 504 00:31:37,410 --> 00:31:41,460 And the ordinary least squares estimate for beta 505 00:31:41,460 --> 00:31:47,650 solves the derivative of this criterion equaling 0. 506 00:31:47,650 --> 00:31:53,990 Now, that's in fact true, but who 507 00:31:53,990 --> 00:31:55,400 can tell me why that's true? 508 00:32:00,360 --> 00:32:00,860 Say again? 509 00:32:00,860 --> 00:32:01,750 AUDIENCE: Is that minimum? 510 00:32:01,750 --> 00:32:02,333 PROFESSOR: OK. 511 00:32:02,333 --> 00:32:04,763 So your name? 512 00:32:04,763 --> 00:32:05,610 AUDIENCE: Seth. 513 00:32:05,610 --> 00:32:06,313 PROFESSOR: Seth? 514 00:32:06,313 --> 00:32:06,812 Seth. 515 00:32:06,812 --> 00:32:07,760 Very good, Seth. 516 00:32:07,760 --> 00:32:08,850 Thanks, Seth. 517 00:32:08,850 --> 00:32:13,200 So if we want to find a minimum of Q, 518 00:32:13,200 --> 00:32:17,120 then that minimum will have, if it's a smooth function, 519 00:32:17,120 --> 00:32:20,950 will have a minimum at slope equals 0. 520 00:32:20,950 --> 00:32:23,405 Now, how do we know whether it's a minimum or not? 521 00:32:23,405 --> 00:32:25,424 It could be a maximum. 522 00:32:25,424 --> 00:32:26,340 AUDIENCE: [INAUDIBLE]? 523 00:32:29,160 --> 00:32:30,430 PROFESSOR: OK, right. 524 00:32:30,430 --> 00:32:34,050 So in fact, this is a-- Q of beta 525 00:32:34,050 --> 00:32:39,160 is a convex function of beta. 526 00:32:39,160 --> 00:32:44,770 And so its second derivative is positive. 527 00:32:44,770 --> 00:32:52,140 And if you basically think about the set-- basically, 528 00:32:52,140 --> 00:32:54,310 this is the first derivative of Q with respect 529 00:32:54,310 --> 00:32:55,680 to beta equaling 0. 530 00:32:55,680 --> 00:32:58,420 If you were to solve for the second derivative of Q 531 00:32:58,420 --> 00:33:03,060 with respect to beta, well, beta is a p-vector. 532 00:33:03,060 --> 00:33:04,560 So the second derivative is actually 533 00:33:04,560 --> 00:33:11,240 a second derivative matrix, and that matrix, 534 00:33:11,240 --> 00:33:12,540 you can solve for it. 535 00:33:12,540 --> 00:33:14,360 It will be X transpose X, which is 536 00:33:14,360 --> 00:33:18,730 a positive definite or semi-definite matrix. 537 00:33:18,730 --> 00:33:23,950 So it basically had a positive derivative there. 538 00:33:23,950 --> 00:33:27,210 So anyway, this ordinary least squares estimates 539 00:33:27,210 --> 00:33:32,370 will solve this d Q of beta by d beta equals 0. 540 00:33:32,370 --> 00:33:36,330 What does d Q beta by d beta_j? 541 00:33:36,330 --> 00:33:40,980 Well, you just take the derivative of this sum. 542 00:33:44,530 --> 00:33:47,390 So we're taking the sum of all these elements. 543 00:33:50,280 --> 00:33:53,640 And if you take the derivative-- well, 544 00:33:53,640 --> 00:33:57,210 OK, the derivative is a linear operator. 545 00:33:57,210 --> 00:34:00,450 So the derivative of a sum is the sum of the derivatives. 546 00:34:00,450 --> 00:34:04,150 So we take the summation out and we take the derivative of each 547 00:34:04,150 --> 00:34:09,929 term, so we get 2 minus x_(i,j), then the thing in square 548 00:34:09,929 --> 00:34:11,399 brackets, y_i minus that. 549 00:34:15,239 --> 00:34:16,139 And what is that? 550 00:34:16,139 --> 00:34:19,090 Well, in matrix notation, if we let 551 00:34:19,090 --> 00:34:22,469 this sort of bold X sub square j denote 552 00:34:22,469 --> 00:34:25,670 the j-th column of the independent variables, 553 00:34:25,670 --> 00:34:28,219 then this is minus 2. 554 00:34:28,219 --> 00:34:35,350 Basically, the j-th column of X transpose times y minus X beta. 555 00:34:35,350 --> 00:34:42,940 So this j-th equation for ordinary least squares 556 00:34:42,940 --> 00:34:46,980 has that representation in terms-- in matrix notation. 557 00:34:46,980 --> 00:34:51,860 Now if we put that all together, we basically 558 00:34:51,860 --> 00:34:55,400 can define this derivative of Q with respect 559 00:34:55,400 --> 00:34:57,740 to the different regression parameters 560 00:34:57,740 --> 00:35:05,640 as basically the minus twice the j-th column stacked times y 561 00:35:05,640 --> 00:35:10,120 minus X beta, which is simply minus 2 X transpose, y minus X 562 00:35:10,120 --> 00:35:11,040 beta. 563 00:35:11,040 --> 00:35:14,870 And this has to equal 0. 564 00:35:14,870 --> 00:35:19,550 And if we just simplify, taking out the two, 565 00:35:19,550 --> 00:35:22,140 we get this set of equations. 566 00:35:22,140 --> 00:35:25,560 It must be satisfied by the ordinary least squares 567 00:35:25,560 --> 00:35:27,190 estimate, beta. 568 00:35:27,190 --> 00:35:31,960 And that's called the normal equations in books 569 00:35:31,960 --> 00:35:35,270 on regression modeling. 570 00:35:35,270 --> 00:35:37,310 So let's consider how we solve that. 571 00:35:37,310 --> 00:35:43,100 Well, we can re-express that by multiplying through the X 572 00:35:43,100 --> 00:35:46,010 transpose on each of the terms. 573 00:35:46,010 --> 00:35:54,770 And then beta hat basically solves this equation. 574 00:35:54,770 --> 00:35:58,170 And if X transpose X inverse exists, 575 00:35:58,170 --> 00:36:02,385 we get beta hat is equal to X transpose X inverse X 576 00:36:02,385 --> 00:36:04,920 transpose y. 577 00:36:04,920 --> 00:36:11,030 So with matrix algebra, we can actually solve this. 578 00:36:11,030 --> 00:36:14,530 And matrix algebra is going to be 579 00:36:14,530 --> 00:36:16,910 very important to this lecture and other lectures. 580 00:36:16,910 --> 00:36:20,260 So if this stuff is-- if you're a bit rusty on this, 581 00:36:20,260 --> 00:36:22,210 do brush up. 582 00:36:26,370 --> 00:36:29,610 This particular solution for beta hat 583 00:36:29,610 --> 00:36:37,980 assumes that X transpose X inverse exists. 584 00:36:37,980 --> 00:36:42,920 Who can tell me what assumptions do 585 00:36:42,920 --> 00:36:47,290 we need to make for X transpose X to have an inverse? 586 00:36:52,110 --> 00:36:55,910 I'll call you in a second if no one else does. 587 00:36:55,910 --> 00:36:59,528 Somebody just said something. 588 00:36:59,528 --> 00:37:01,370 Someone else. 589 00:37:01,370 --> 00:37:01,870 No? 590 00:37:01,870 --> 00:37:02,030 All right. 591 00:37:02,030 --> 00:37:02,400 OK, Will. 592 00:37:02,400 --> 00:37:03,860 AUDIENCE: So X transpose X inverse 593 00:37:03,860 --> 00:37:06,160 needs to have full rank, which means 594 00:37:06,160 --> 00:37:10,269 that each of the submatrices needs to have [INAUDIBLE] 595 00:37:10,269 --> 00:37:11,750 smaller dimension. 596 00:37:11,750 --> 00:37:15,305 PROFESSOR: OK, so Will said, basically, the matrix X 597 00:37:15,305 --> 00:37:17,250 needs to have full rank. 598 00:37:17,250 --> 00:37:23,990 And so if X has full rank, then-- well, let's see. 599 00:37:23,990 --> 00:37:29,340 If X has full rank, then the singular value decomposition 600 00:37:29,340 --> 00:37:35,990 which was in the very first class can exist. 601 00:37:35,990 --> 00:37:40,330 And you have basically p singular values 602 00:37:40,330 --> 00:37:42,730 that are all non-zero. 603 00:37:42,730 --> 00:37:46,100 And X transpose X can be expressed 604 00:37:46,100 --> 00:37:50,610 as sort of a, from the singular value decomposition, 605 00:37:50,610 --> 00:37:53,110 as one of the orthogonal matrices times the square 606 00:37:53,110 --> 00:37:57,145 of the singular values times that same matrix transpose, 607 00:37:57,145 --> 00:37:59,390 if you recall that definition. 608 00:37:59,390 --> 00:38:02,420 So that actually is-- it basically 609 00:38:02,420 --> 00:38:06,590 provides a solution for X transpose X inverse, indeed, 610 00:38:06,590 --> 00:38:08,810 from the singular value decomposition of X. 611 00:38:08,810 --> 00:38:11,910 But what's required is that you have a full rank in X. 612 00:38:11,910 --> 00:38:14,450 And what that means is that you can't have 613 00:38:14,450 --> 00:38:20,120 independent variables that are explained 614 00:38:20,120 --> 00:38:21,950 by other independent variables. 615 00:38:21,950 --> 00:38:29,010 So different columns of X have to be linear-- 616 00:38:29,010 --> 00:38:32,670 or they can't linearly depend on any other columns of X. 617 00:38:32,670 --> 00:38:34,670 Otherwise, you would have reduced rank. 618 00:38:37,460 --> 00:38:44,570 So now if beta hat doesn't have full rank, 619 00:38:44,570 --> 00:38:49,380 then our least squares estimate of beta might be non-unique. 620 00:38:49,380 --> 00:38:53,280 And in fact, it is the case that if you 621 00:38:53,280 --> 00:38:55,600 are really interested in just predicting 622 00:38:55,600 --> 00:38:59,180 values of a dependent variable, then 623 00:38:59,180 --> 00:39:03,440 having non-unique least squares estimates 624 00:39:03,440 --> 00:39:05,530 isn't as much of a problem, because you still 625 00:39:05,530 --> 00:39:07,090 get estimates out of that. 626 00:39:07,090 --> 00:39:11,302 But for now, we want to assume that there's full column rank 627 00:39:11,302 --> 00:39:12,510 in the independent variables. 628 00:39:16,010 --> 00:39:17,510 All right. 629 00:39:17,510 --> 00:39:30,100 Now, if we plug in the value of the solution for the least 630 00:39:30,100 --> 00:39:34,780 squares estimate, we get fitted values 631 00:39:34,780 --> 00:39:41,960 for the response variable, which are simply the matrix X times 632 00:39:41,960 --> 00:39:44,370 beta hat. 633 00:39:44,370 --> 00:39:52,070 And this expression for the fitted values 634 00:39:52,070 --> 00:39:58,570 is basically X times X transpose X inverse X transpose y, 635 00:39:58,570 --> 00:40:01,330 which we can represent as Hy. 636 00:40:01,330 --> 00:40:08,160 Basically, this H matrix in linear models and statistics 637 00:40:08,160 --> 00:40:09,770 is called the hat matrix. 638 00:40:09,770 --> 00:40:12,310 It's basically a projection matrix 639 00:40:12,310 --> 00:40:19,120 that takes the linear vector, or the vector of values 640 00:40:19,120 --> 00:40:24,290 of the response variable, into the fitted values. 641 00:40:24,290 --> 00:40:30,713 So this hat matrix is quite important. 642 00:40:34,640 --> 00:40:37,745 The problem set's going to cover some features, 643 00:40:37,745 --> 00:40:39,695 go into some properties of the hat matrix. 644 00:40:42,790 --> 00:40:49,180 Does anyone want to make any comments about this hat matrix? 645 00:40:49,180 --> 00:40:52,290 It's actually a very special type of matrix. 646 00:40:52,290 --> 00:40:56,230 Does anyone want to point out what that special type is? 647 00:41:00,420 --> 00:41:02,811 It's a projection matrix, OK. 648 00:41:02,811 --> 00:41:03,310 Yeah. 649 00:41:03,310 --> 00:41:08,490 And in linear algebra, projection matrices 650 00:41:08,490 --> 00:41:11,970 have some very special properties. 651 00:41:11,970 --> 00:41:17,400 And it's actually an orthogonal projection matrix. 652 00:41:17,400 --> 00:41:24,030 And so if you're interested in that feature, 653 00:41:24,030 --> 00:41:25,290 you should look into that. 654 00:41:25,290 --> 00:41:30,635 But it's really a very rich set of properties associated 655 00:41:30,635 --> 00:41:31,510 with this hat matrix. 656 00:41:31,510 --> 00:41:36,970 It's an orthogonal projection, and it's-- let's see. 657 00:41:36,970 --> 00:41:38,180 What's it projecting? 658 00:41:38,180 --> 00:41:40,600 It's projecting from n-space into what? 659 00:41:44,570 --> 00:41:45,150 Go ahead. 660 00:41:45,150 --> 00:41:46,396 What's your name? 661 00:41:46,396 --> 00:41:47,062 AUDIENCE: Ethan. 662 00:41:47,062 --> 00:41:47,870 PROFESSOR: Ethan, OK. 663 00:41:47,870 --> 00:41:49,203 AUDIENCE: Into space [INAUDIBLE] 664 00:41:51,250 --> 00:41:52,375 PROFESSOR: Basically, yeah. 665 00:41:52,375 --> 00:41:55,930 It's projecting into the column space of X. 666 00:41:55,930 --> 00:42:00,730 So that's what linear regression is doing. 667 00:42:00,730 --> 00:42:05,220 So in focusing and understanding linear regression, 668 00:42:05,220 --> 00:42:08,620 you can think of, how do we get estimates of this p-vector? 669 00:42:08,620 --> 00:42:11,880 That's all very good and useful, and we'll do a lot of that. 670 00:42:11,880 --> 00:42:13,880 But you can also think of it as, what's 671 00:42:13,880 --> 00:42:15,750 happening in the n-dimensional space? 672 00:42:15,750 --> 00:42:17,660 So you basically are representing 673 00:42:17,660 --> 00:42:21,960 this n-dimensional vector y by its projection 674 00:42:21,960 --> 00:42:23,180 onto the column space. 675 00:42:29,730 --> 00:42:32,920 Now, the residuals are basically the difference 676 00:42:32,920 --> 00:42:38,320 between the response value and the fitted value. 677 00:42:38,320 --> 00:42:43,960 And this can be expressed as y minus y hat, 678 00:42:43,960 --> 00:42:48,370 or I_n minus H times y. 679 00:42:48,370 --> 00:42:58,700 And it turns out that I_n minus H is also a projection matrix, 680 00:42:58,700 --> 00:43:03,950 and it's projecting the data onto the space orthogonal 681 00:43:03,950 --> 00:43:06,680 to the column space of x. 682 00:43:06,680 --> 00:43:11,980 And to show that that's true, if we consider 683 00:43:11,980 --> 00:43:16,420 the normal equations, which are X transpose y minus X beta 684 00:43:16,420 --> 00:43:20,890 hat equaling 0, that basically is X transpose epsilon hat 685 00:43:20,890 --> 00:43:22,310 equals 0. 686 00:43:22,310 --> 00:43:25,410 And so from the normal equations, 687 00:43:25,410 --> 00:43:27,610 we can see that what they mean is 688 00:43:27,610 --> 00:43:33,380 they mean that the residual vector epsilon hat is 689 00:43:33,380 --> 00:43:36,550 orthogonal to each of the columns of X. 690 00:43:36,550 --> 00:43:38,570 You can take any column in X, multiply that 691 00:43:38,570 --> 00:43:42,210 by the residual vector, and get 0 coming out. 692 00:43:42,210 --> 00:43:47,760 So that's a feature of the residuals 693 00:43:47,760 --> 00:43:51,660 as they relate to the independent variables. 694 00:43:51,660 --> 00:43:53,940 OK, all right. 695 00:43:53,940 --> 00:44:00,520 So at this point, we've gone through really not talking 696 00:44:00,520 --> 00:44:03,669 about any statistical properties to specify the betas. 697 00:44:03,669 --> 00:44:06,210 All we've done is talked-- we've introduced the least squares 698 00:44:06,210 --> 00:44:09,770 criterion and said, what value of the beta vector 699 00:44:09,770 --> 00:44:13,260 minimizes that least squares criterion? 700 00:44:13,260 --> 00:44:17,470 Let's turn to the Gauss-Markov theorem 701 00:44:17,470 --> 00:44:22,240 and start introducing some statistical properties, 702 00:44:22,240 --> 00:44:24,190 probability properties. 703 00:44:24,190 --> 00:44:28,945 So with our data, y and X-- yes? 704 00:44:28,945 --> 00:44:29,445 Yes. 705 00:44:29,445 --> 00:44:30,361 AUDIENCE: [INAUDIBLE]? 706 00:44:36,110 --> 00:44:37,268 PROFESSOR: That epsilon-- 707 00:44:37,268 --> 00:44:38,184 AUDIENCE: [INAUDIBLE]? 708 00:44:41,480 --> 00:44:42,752 PROFESSOR: OK. 709 00:44:42,752 --> 00:44:43,710 Let me go back to that. 710 00:44:47,830 --> 00:44:53,500 It's that X, the columns of X, and the column 711 00:44:53,500 --> 00:44:59,660 vector of the residual are orthogonal to each other. 712 00:44:59,660 --> 00:45:04,330 So we're not doing a projection onto a null space. 713 00:45:04,330 --> 00:45:11,440 This is just a statement that those values, or those column 714 00:45:11,440 --> 00:45:15,160 vectors, are orthogonal to each other. 715 00:45:15,160 --> 00:45:22,010 And just to recap, the epsilon is a projection of y 716 00:45:22,010 --> 00:45:27,060 onto the space orthogonal to the column space. 717 00:45:27,060 --> 00:45:32,788 And y hat is a projection onto the column space of y. 718 00:45:32,788 --> 00:45:37,405 And these projections are all orthogonal projections, 719 00:45:37,405 --> 00:45:45,740 and so they happen to result in the projected value epsilon hat 720 00:45:45,740 --> 00:45:48,770 must be orthogonal to the column space of X, 721 00:45:48,770 --> 00:45:53,080 if you project it out. 722 00:45:53,080 --> 00:45:54,300 OK? 723 00:45:54,300 --> 00:45:55,130 All right. 724 00:45:55,130 --> 00:46:02,240 So the Gauss-Markov theorem, we have data y and X again. 725 00:46:02,240 --> 00:46:05,810 And now we're going to think of the observed data, 726 00:46:05,810 --> 00:46:08,640 little y_1 through y_n, is actually 727 00:46:08,640 --> 00:46:12,050 an observation of the random vector capital 728 00:46:12,050 --> 00:46:19,610 Y, composed of random variables Y_1 up to Y_n. 729 00:46:19,610 --> 00:46:25,120 And the expectation of this vector 730 00:46:25,120 --> 00:46:28,740 conditional on the values of the independent variables 731 00:46:28,740 --> 00:46:30,680 and their regression parameters given by X, 732 00:46:30,680 --> 00:46:36,490 beta-- so the dependent variable vector 733 00:46:36,490 --> 00:46:40,880 has expectation given by the product 734 00:46:40,880 --> 00:46:43,340 of the independent variables matrix times the regression 735 00:46:43,340 --> 00:46:44,750 parameters. 736 00:46:44,750 --> 00:46:48,425 And the covariance matrix of Y given X and beta 737 00:46:48,425 --> 00:46:50,930 is sigma squared times the identity 738 00:46:50,930 --> 00:46:54,330 matrix, the n-dimensional identity matrix. 739 00:46:54,330 --> 00:46:58,120 So the identity matrix has 1's along the diagonal, 740 00:46:58,120 --> 00:47:00,480 n-dimensional, and 0's off the diagonal. 741 00:47:00,480 --> 00:47:06,980 So the variances of the Y's are the diagonal entries, 742 00:47:06,980 --> 00:47:08,830 those are all the same, sigma squared. 743 00:47:08,830 --> 00:47:13,323 And the covariance between any two are equal to 0 744 00:47:13,323 --> 00:47:13,906 conditionally. 745 00:47:21,530 --> 00:47:25,260 OK, now the Gauss-Markov theorem. 746 00:47:25,260 --> 00:47:31,790 This is a terrific result in linear models theory. 747 00:47:31,790 --> 00:47:37,790 And it's terrific in terms of the mathematical content of it. 748 00:47:37,790 --> 00:47:43,850 I think it's-- for a math class, it's really a nice theorem 749 00:47:43,850 --> 00:47:51,110 to introduce you to and highlight the power of, I 750 00:47:51,110 --> 00:47:55,570 guess, results that can arise from applying the theory. 751 00:47:55,570 --> 00:48:00,120 And so to set this theorem up, we 752 00:48:00,120 --> 00:48:05,710 want to think about trying to estimate some function 753 00:48:05,710 --> 00:48:07,770 of the regression parameters. 754 00:48:07,770 --> 00:48:14,060 And so OK, our problem is with ordinary least squares-- 755 00:48:14,060 --> 00:48:16,300 it was, how do we specify the regression parameters 756 00:48:16,300 --> 00:48:18,180 beta_1 through beta_p? 757 00:48:18,180 --> 00:48:23,510 Let's consider a general target of interest, 758 00:48:23,510 --> 00:48:26,660 which is a linear combination of the betas. 759 00:48:26,660 --> 00:48:32,090 So we want to predict a parameter theta which 760 00:48:32,090 --> 00:48:37,530 is some linear combination of the regression parameters. 761 00:48:37,530 --> 00:48:42,000 And because that linear combination of the regression 762 00:48:42,000 --> 00:48:49,920 parameters corresponds to the expectation of the response 763 00:48:49,920 --> 00:48:51,840 variable corresponding to a given 764 00:48:51,840 --> 00:48:53,970 row of the independent variables matrix, 765 00:48:53,970 --> 00:48:55,690 this is just a generalization of trying 766 00:48:55,690 --> 00:48:59,240 to estimate the means of the regression model 767 00:48:59,240 --> 00:49:01,410 at different points in the space, 768 00:49:01,410 --> 00:49:06,720 or to be estimating other quantities that might arise. 769 00:49:06,720 --> 00:49:09,340 So this is really a very general kind of thing 770 00:49:09,340 --> 00:49:10,540 to want to estimate. 771 00:49:10,540 --> 00:49:13,560 It certainly is appropriate for predictions. 772 00:49:13,560 --> 00:49:18,530 And if we consider the least squares estimate 773 00:49:18,530 --> 00:49:22,760 by just plugging in beta hat one through beta hat p, solved 774 00:49:22,760 --> 00:49:28,990 by the least squares, well, it turns out 775 00:49:28,990 --> 00:49:36,900 that those are an unbiased estimator of the parameter 776 00:49:36,900 --> 00:49:37,812 theta. 777 00:49:37,812 --> 00:49:39,770 So if we're trying to estimate this combination 778 00:49:39,770 --> 00:49:42,580 of these unknown parameters, you plug in the least squares 779 00:49:42,580 --> 00:49:47,350 estimate, you're going to get an estimator that's unbiased. 780 00:49:47,350 --> 00:49:49,330 Who can tell me what unbiased is? 781 00:49:49,330 --> 00:49:54,770 It's probably going to be a new concept for some people here. 782 00:49:54,770 --> 00:49:56,000 Anyone? 783 00:49:56,000 --> 00:49:59,810 OK, well it's a basic property of estimators 784 00:49:59,810 --> 00:50:04,800 in statistics where the expectation of this statistic 785 00:50:04,800 --> 00:50:06,720 is the true parameter. 786 00:50:06,720 --> 00:50:10,680 So it doesn't, on average, probabilistically, it 787 00:50:10,680 --> 00:50:12,860 doesn't over- or underestimate the value. 788 00:50:12,860 --> 00:50:14,740 So that's what unbiased means. 789 00:50:14,740 --> 00:50:16,520 Now, it's also a linear estimator 790 00:50:16,520 --> 00:50:21,270 of theta in terms of this theta hat 791 00:50:21,270 --> 00:50:23,810 being a particular linear combination 792 00:50:23,810 --> 00:50:26,750 of the dependent variables. 793 00:50:26,750 --> 00:50:31,350 So with our original response variable y, 794 00:50:31,350 --> 00:50:37,510 in the case of y_1 through y_n, this theta hat is simply 795 00:50:37,510 --> 00:50:40,530 a linear combination of all the y's. 796 00:50:40,530 --> 00:50:42,560 And now why is that true? 797 00:50:42,560 --> 00:50:49,246 Well, we know that beta hat, from the normal equations, 798 00:50:49,246 --> 00:50:52,490 is solved by X transpose X inverse X transpose y. 799 00:50:52,490 --> 00:50:56,200 So it's a linear transform of the y vector. 800 00:50:56,200 --> 00:50:57,780 So if we take a linear combination 801 00:50:57,780 --> 00:51:00,480 of those components, it's also another linear combination 802 00:51:00,480 --> 00:51:01,490 of the y vector. 803 00:51:01,490 --> 00:51:06,090 So this is a linear function of the underlying-- 804 00:51:06,090 --> 00:51:08,830 of the response variables. 805 00:51:08,830 --> 00:51:12,130 Now, the Gauss-Markov theorem says 806 00:51:12,130 --> 00:51:16,350 that, if the Gauss-Markov assumptions apply, 807 00:51:16,350 --> 00:51:20,230 then the estimator theta has the smallest variance 808 00:51:20,230 --> 00:51:25,750 amongst all linear unbiased estimators of theta. 809 00:51:25,750 --> 00:51:28,910 So it actually is like the optimal one, 810 00:51:28,910 --> 00:51:32,070 as long as this is our criteria. 811 00:51:32,070 --> 00:51:36,340 And this is really a very powerful result. 812 00:51:36,340 --> 00:51:42,482 And to prove it, it's very easy. 813 00:51:42,482 --> 00:51:45,610 Let's see. 814 00:51:45,610 --> 00:51:47,750 Actually, these notes are going to be distributed. 815 00:51:47,750 --> 00:51:53,890 So I'm going to go through this very, very quickly 816 00:51:53,890 --> 00:51:56,300 and come back to it later if we have more time. 817 00:51:56,300 --> 00:52:01,670 But you basically-- the argument for the proof here 818 00:52:01,670 --> 00:52:05,750 is you consider another linear estimate which 819 00:52:05,750 --> 00:52:08,410 is also an unbiased estimate. 820 00:52:08,410 --> 00:52:13,540 So let's consider a competitor to the least squares value 821 00:52:13,540 --> 00:52:17,710 and then look at the difference between that estimator 822 00:52:17,710 --> 00:52:20,960 and theta hat. 823 00:52:20,960 --> 00:52:27,880 And so that can be characterized as basically this vector, 824 00:52:27,880 --> 00:52:29,280 f transpose y. 825 00:52:32,010 --> 00:52:35,970 And this difference in the estimates 826 00:52:35,970 --> 00:52:40,030 must have expectation 0. 827 00:52:40,030 --> 00:52:42,790 So basically, if we look at-- if theta tilde is unbiased, 828 00:52:42,790 --> 00:52:46,200 then this expression here is going 829 00:52:46,200 --> 00:52:50,580 to be equal to zero, which means that f-- 830 00:52:50,580 --> 00:52:57,100 the difference in these two estimators, f 831 00:52:57,100 --> 00:52:59,230 defines the difference in the two estimators-- 832 00:52:59,230 --> 00:53:02,120 has to be orthogonal to the column space of x. 833 00:53:02,120 --> 00:53:13,350 And with this result, one then uses 834 00:53:13,350 --> 00:53:17,530 this orthogonality of f and d to evaluate 835 00:53:17,530 --> 00:53:20,040 the variance of theta tilde. 836 00:53:20,040 --> 00:53:23,760 And in this proof, the mathematical argument 837 00:53:23,760 --> 00:53:29,535 here is really something-- I should put some asterisks 838 00:53:29,535 --> 00:53:31,140 on a few lines here. 839 00:53:31,140 --> 00:53:35,850 This expression here is actually very important. 840 00:53:35,850 --> 00:53:40,120 We're basically looking at the decomposition 841 00:53:40,120 --> 00:53:43,250 of the variance to be the variance of b 842 00:53:43,250 --> 00:53:46,890 transpose y, which is the variance of the sum 843 00:53:46,890 --> 00:53:49,350 of these two random variables. 844 00:53:49,350 --> 00:53:55,880 So the page before basically defined d and f 845 00:53:55,880 --> 00:53:57,560 such that this is true. 846 00:53:57,560 --> 00:54:02,370 Now when you consider the variance of a sum, 847 00:54:02,370 --> 00:54:05,870 it's not the sum of the variances. 848 00:54:05,870 --> 00:54:09,730 It's the sum of the variances plus twice 849 00:54:09,730 --> 00:54:12,170 the sum of the covariances. 850 00:54:12,170 --> 00:54:18,795 And so when you are calculating variances 851 00:54:18,795 --> 00:54:21,430 of sums of random variables, you have to really keep 852 00:54:21,430 --> 00:54:24,180 track of the covariance terms. 853 00:54:24,180 --> 00:54:26,200 In this case, this argument shows 854 00:54:26,200 --> 00:54:29,320 that the covariance terms are, in fact, 0, 855 00:54:29,320 --> 00:54:32,804 and you get the result popping out. 856 00:54:32,804 --> 00:54:38,890 But that's really a-- in an econometrics class, 857 00:54:38,890 --> 00:54:42,934 they'll talk about BLUE estimates of regression, 858 00:54:42,934 --> 00:54:45,100 or the BLUE property of the least squares estimates. 859 00:54:45,100 --> 00:54:47,260 That's where that comes from. 860 00:54:47,260 --> 00:54:53,090 All right, so let's now consider generalizing from Gauss-Markov 861 00:54:53,090 --> 00:55:04,660 to allow for unequal variances and possibly correlated 862 00:55:04,660 --> 00:55:08,710 nonzero covariances between the components. 863 00:55:08,710 --> 00:55:13,150 And in this case, the regression model 864 00:55:13,150 --> 00:55:14,560 has the same linear set up. 865 00:55:14,560 --> 00:55:16,970 The only difference is the expectation 866 00:55:16,970 --> 00:55:19,730 of the residual vector is still 0. 867 00:55:19,730 --> 00:55:22,920 But the covariance matrix of the residual vector 868 00:55:22,920 --> 00:55:26,330 is sigma squared, a single parameter, 869 00:55:26,330 --> 00:55:29,850 times let's say capital sigma. 870 00:55:29,850 --> 00:55:33,530 And we'll assume here that this capital sigma 871 00:55:33,530 --> 00:55:39,000 matrix is a known n by n positive definite matrix 872 00:55:39,000 --> 00:55:41,204 specifying relative variances and correlations 873 00:55:41,204 --> 00:55:42,245 between the observations. 874 00:55:47,570 --> 00:55:48,070 OK. 875 00:55:51,500 --> 00:55:59,392 Well, in order to solve for regression estimates 876 00:55:59,392 --> 00:56:04,400 under these generalized Gauss-Markov assumptions, 877 00:56:04,400 --> 00:56:08,750 we can transform the data Y, X to Y star 878 00:56:08,750 --> 00:56:13,170 equals sigma to the minus 1/2 y and X 879 00:56:13,170 --> 00:56:17,670 to X star, which is sigma to the minus 1/2 x. 880 00:56:17,670 --> 00:56:24,470 And this model then becomes a model, a linear regression 881 00:56:24,470 --> 00:56:29,640 model, in terms of Y star and X star. 882 00:56:29,640 --> 00:56:33,460 We're basically multiplying this regression model by sigma 883 00:56:33,460 --> 00:56:36,320 to the minus 1/2 across. 884 00:56:36,320 --> 00:56:43,700 And epsilon star actually has a covariance matrix 885 00:56:43,700 --> 00:56:45,770 equal to sigma squared times the identity. 886 00:56:45,770 --> 00:56:50,380 So if we just take a linear transformation 887 00:56:50,380 --> 00:56:56,260 of the original data, we get a representation 888 00:56:56,260 --> 00:56:57,890 of the regression model that satisfies 889 00:56:57,890 --> 00:57:00,880 the original Gauss-Markov assumptions. 890 00:57:00,880 --> 00:57:03,410 And what we had to do was basically 891 00:57:03,410 --> 00:57:05,920 do a linear transformation that makes the response 892 00:57:05,920 --> 00:57:10,500 variables all have constant variance and be uncorrelated. 893 00:57:13,980 --> 00:57:19,740 So with that, we then have the least squares estimate of beta 894 00:57:19,740 --> 00:57:23,850 is the least squares, the ordinary least squares, 895 00:57:23,850 --> 00:57:26,250 in terms of Y star and X star. 896 00:57:26,250 --> 00:57:31,910 And so plugging that in, we then have X star transpose X star 897 00:57:31,910 --> 00:57:34,310 inverse X star transpose Y star. 898 00:57:34,310 --> 00:57:37,130 And if you multiply through, that's how the formula changes. 899 00:57:41,600 --> 00:57:45,600 So this formula characterizing the least squares estimate 900 00:57:45,600 --> 00:57:48,850 under this generalized set of assumptions 901 00:57:48,850 --> 00:57:55,010 highlights what you need to do to be 902 00:57:55,010 --> 00:57:56,900 able to apply that theorem. 903 00:57:56,900 --> 00:58:03,140 So with response values that have very large variances, 904 00:58:03,140 --> 00:58:07,540 you basically want to discount those by the sigma inverse. 905 00:58:10,275 --> 00:58:14,280 And that's part of the way in which these generalized least 906 00:58:14,280 --> 00:58:15,940 squares work. 907 00:58:15,940 --> 00:58:17,120 All right. 908 00:58:17,120 --> 00:58:19,960 So now let's turn to distribution theory 909 00:58:19,960 --> 00:58:21,683 for normal regression models. 910 00:58:25,480 --> 00:58:28,820 Let's assume that the residuals are 911 00:58:28,820 --> 00:58:32,175 normals with mean 0 and variance sigma squared. 912 00:58:37,880 --> 00:58:42,360 OK, conditioning on the values of the independent variable, 913 00:58:42,360 --> 00:58:44,240 the Y's, the response variables, are 914 00:58:44,240 --> 00:58:49,932 going to be independent over the index i. 915 00:58:49,932 --> 00:58:51,890 They're not going to be identically distributed 916 00:58:51,890 --> 00:58:54,920 because they have different means, mu_i 917 00:58:54,920 --> 00:58:58,820 for the dependent variable Y_i, but they 918 00:58:58,820 --> 00:59:02,060 will have a constant variance. 919 00:59:02,060 --> 00:59:10,930 And what we can do is basically condition on X, beta, 920 00:59:10,930 --> 00:59:14,320 and sigma squared and then represent 921 00:59:14,320 --> 00:59:20,280 this model in terms of the distribution of the epsilons. 922 00:59:20,280 --> 00:59:23,140 So if we're conditioning on x and beta, 923 00:59:23,140 --> 00:59:27,950 this X beta is a constant, known, we've conditioned on it. 924 00:59:27,950 --> 00:59:31,520 And the remaining uncertainty is in the residual vector, 925 00:59:31,520 --> 00:59:36,590 which is assumed to be all independent 926 00:59:36,590 --> 00:59:39,064 and identically distributed normal random variables. 927 00:59:39,064 --> 00:59:40,480 Now, this is the first time you'll 928 00:59:40,480 --> 00:59:48,400 see this notation, capital N sub little n, for a random vector. 929 00:59:48,400 --> 00:59:53,090 It's a multivariate normal random variable 930 00:59:53,090 --> 00:59:57,540 where you consider an n-vector where each component is 931 00:59:57,540 --> 01:00:00,340 normally distributed, with mean given 932 01:00:00,340 --> 01:00:04,090 by some corresponding mean vector, 933 01:00:04,090 --> 01:00:10,200 and a covariance matrix given by a covariance matrix. 934 01:00:10,200 --> 01:00:16,150 In terms of independent and identically distributed values, 935 01:00:16,150 --> 01:00:21,250 the probability structure here is totally well-defined. 936 01:00:21,250 --> 01:00:24,760 Anyone here who's taken a beginning probability class 937 01:00:24,760 --> 01:00:26,420 knows what the density function is 938 01:00:26,420 --> 01:00:28,340 for this multivariate normal distribution 939 01:00:28,340 --> 01:00:32,180 because it's the product of the independent density 940 01:00:32,180 --> 01:00:34,960 functions for the independent components, 941 01:00:34,960 --> 01:00:37,170 because they're all independent random variables. 942 01:00:37,170 --> 01:00:40,190 So this multivariate normal random vector 943 01:00:40,190 --> 01:00:45,100 has a density function which you can write down, 944 01:00:45,100 --> 01:00:47,266 given your first probability class. 945 01:00:51,090 --> 01:00:54,960 OK, here I'm just highlighting or defining 946 01:00:54,960 --> 01:01:01,670 the mu vector for the means of the cases of the data. 947 01:01:01,670 --> 01:01:06,210 And the covariance matrix sigma is this diagonal matrix. 948 01:01:08,940 --> 01:01:19,450 And so basically sigma_(i,j) is equal to sigma squared times 949 01:01:19,450 --> 01:01:23,790 the Kronecker delta for the (i,j) element. 950 01:01:23,790 --> 01:01:28,940 Now what we want to do is, under the assumptions 951 01:01:28,940 --> 01:01:33,845 of normally distributed residuals, 952 01:01:33,845 --> 01:01:38,700 to solve for the distribution of the least squares estimators. 953 01:01:38,700 --> 01:01:41,550 We want to know, basically, what kind of distribution 954 01:01:41,550 --> 01:01:42,819 does it have? 955 01:01:42,819 --> 01:01:44,360 Because what we want to be able to do 956 01:01:44,360 --> 01:01:46,330 is to determine whether estimates 957 01:01:46,330 --> 01:01:49,020 are particularly large or not. 958 01:01:49,020 --> 01:01:50,850 And maybe there's no structure at all 959 01:01:50,850 --> 01:01:55,080 and the regression parameters are 0 so 960 01:01:55,080 --> 01:01:57,510 that there's no dependence on a given factor. 961 01:01:57,510 --> 01:02:00,400 And we need to be able to judge how significant that is. 962 01:02:00,400 --> 01:02:03,290 So we need to know what the distribution is 963 01:02:03,290 --> 01:02:06,410 of our least squares estimate. 964 01:02:06,410 --> 01:02:09,160 So what we're going to do is apply moment generating 965 01:02:09,160 --> 01:02:11,540 functions to derive the joint distribution of y 966 01:02:11,540 --> 01:02:13,574 and the joint distribution of beta hat. 967 01:02:17,060 --> 01:02:22,560 And so Choongbum introduced the moment generating function 968 01:02:22,560 --> 01:02:26,821 for individual random variables for single-variate random 969 01:02:26,821 --> 01:02:27,320 variables. 970 01:02:27,320 --> 01:02:30,100 For n-variate random variables, we 971 01:02:30,100 --> 01:02:35,190 can define the moment generating function of the Y vector 972 01:02:35,190 --> 01:02:38,990 to be the expectation of e to the t transpose Y. 973 01:02:38,990 --> 01:02:41,840 So t is an argument of the moment generating function. 974 01:02:41,840 --> 01:02:43,660 It's another n-vector. 975 01:02:43,660 --> 01:02:46,660 And it's equal to the expectation of e to the t_1 Y_1 976 01:02:46,660 --> 01:02:48,840 plus t_2 Y_2 up to t_n Y_n. 977 01:02:48,840 --> 01:02:53,930 So this is a very simple definition. 978 01:02:53,930 --> 01:02:58,260 Because of independence, the expectation 979 01:02:58,260 --> 01:03:02,380 of the products, or this exponential sum 980 01:03:02,380 --> 01:03:05,240 is the product of the exponentials. 981 01:03:05,240 --> 01:03:08,840 And so this moment generating function is simply 982 01:03:08,840 --> 01:03:12,250 the product of the moment generating functions for Y_1 983 01:03:12,250 --> 01:03:15,190 up through Y_n. 984 01:03:15,190 --> 01:03:18,180 And I think-- I don't know if it was in the first problem set 985 01:03:18,180 --> 01:03:22,652 or in the first lecture, but e to the t_i mu_i plus a half t_i 986 01:03:22,652 --> 01:03:24,610 squared sigma squared was the moment generating 987 01:03:24,610 --> 01:03:28,310 function for the single univariate 988 01:03:28,310 --> 01:03:30,710 normal random variable, mean mu_i and variance sigma 989 01:03:30,710 --> 01:03:32,340 squared. 990 01:03:32,340 --> 01:03:36,780 And so if we have n of these, we take their product. 991 01:03:36,780 --> 01:03:39,720 And the moment generating function 992 01:03:39,720 --> 01:03:44,700 for y is simply e to the t transpose mu plus 1/2 993 01:03:44,700 --> 01:03:48,090 t transpose sigma t. 994 01:03:48,090 --> 01:03:53,020 And so for this multivariate normal distribution, 995 01:03:53,020 --> 01:03:57,630 this is its moment generating function. 996 01:03:57,630 --> 01:04:06,070 And this happens to be-- the distribution of y 997 01:04:06,070 --> 01:04:10,165 is a multivariate normal with mean mu and covariance matrix 998 01:04:10,165 --> 01:04:11,940 sigma. 999 01:04:11,940 --> 01:04:16,080 So a fact that we're going to use 1000 01:04:16,080 --> 01:04:19,970 is that if we're working with multivariate normal random 1001 01:04:19,970 --> 01:04:23,470 variables, this is the structure of their moment generating 1002 01:04:23,470 --> 01:04:24,520 functions. 1003 01:04:24,520 --> 01:04:26,870 And so if we solve for the moment generation 1004 01:04:26,870 --> 01:04:29,280 function of some other item of interest 1005 01:04:29,280 --> 01:04:31,840 and recognize that it has the same form, 1006 01:04:31,840 --> 01:04:35,237 we can conclude that it's also a multivariate normal random 1007 01:04:35,237 --> 01:04:35,736 variable. 1008 01:04:39,440 --> 01:04:41,330 So let's do that. 1009 01:04:41,330 --> 01:04:44,720 Let's solve for the moment generation 1010 01:04:44,720 --> 01:04:48,120 function of the least squares estimate, beta hat. 1011 01:04:48,120 --> 01:04:50,980 Now rather than dealing with an n-vector, 1012 01:04:50,980 --> 01:04:55,212 we're dealing with a p-vector of the betas, beta hats. 1013 01:04:55,212 --> 01:04:57,170 And this is simply the definition of the moment 1014 01:04:57,170 --> 01:05:00,240 generating function. 1015 01:05:00,240 --> 01:05:06,380 If we plug in for basically what the functional form is 1016 01:05:06,380 --> 01:05:09,160 for the ordinary least squares estimates 1017 01:05:09,160 --> 01:05:13,960 and how they depend on the underlying Y, then we 1018 01:05:13,960 --> 01:05:20,480 basically-- OK, we have A equal to, essentially, 1019 01:05:20,480 --> 01:05:22,950 the linear projection of Y. That gives us the least squares 1020 01:05:22,950 --> 01:05:24,530 estimate. 1021 01:05:24,530 --> 01:05:28,800 And then we can say that this moment generating 1022 01:05:28,800 --> 01:05:34,723 function for beta hat is equal to the expectation of e 1023 01:05:34,723 --> 01:05:40,650 to the t transpose Y, where little t is A transpose tau. 1024 01:05:40,650 --> 01:05:41,960 Well, we know what this is. 1025 01:05:41,960 --> 01:05:43,543 This is the moment generating function 1026 01:05:43,543 --> 01:05:50,080 of X-- sorry, of Y-- evaluated at the vector little t. 1027 01:05:50,080 --> 01:05:54,620 So we just need to plug in little t, that expression 1028 01:05:54,620 --> 01:05:56,280 A transpose tau. 1029 01:05:56,280 --> 01:05:59,370 So let's do that. 1030 01:05:59,370 --> 01:06:03,580 And you do that and it turns out to be e to the t transpose 1031 01:06:03,580 --> 01:06:06,324 mu plus that. 1032 01:06:06,324 --> 01:06:13,450 And we go through a number of calculations. 1033 01:06:13,450 --> 01:06:16,310 And at the end of the day, we get that the moment generating 1034 01:06:16,310 --> 01:06:20,580 function is just e to the tau transpose beta plus a 1/2 tau 1035 01:06:20,580 --> 01:06:23,740 transpose this matrix tau. 1036 01:06:23,740 --> 01:06:26,050 And that is the moment generation function 1037 01:06:26,050 --> 01:06:28,490 of a multivariate normal. 1038 01:06:28,490 --> 01:06:32,280 So these few lines that you can go through after class 1039 01:06:32,280 --> 01:06:34,320 basically solve for the moment generating 1040 01:06:34,320 --> 01:06:35,960 function of beta hat. 1041 01:06:35,960 --> 01:06:39,456 And because we can recognize this as the MGF 1042 01:06:39,456 --> 01:06:44,960 of a multivariate normal, we know that that's-- beta hat is 1043 01:06:44,960 --> 01:06:48,140 a multivariate normal, with mean the true beta, 1044 01:06:48,140 --> 01:06:52,610 and covariance matrix given by the object in square brackets 1045 01:06:52,610 --> 01:06:54,390 there. 1046 01:06:54,390 --> 01:07:01,590 OK, so this is essentially the conclusion 1047 01:07:01,590 --> 01:07:05,040 of that previous analysis. 1048 01:07:05,040 --> 01:07:08,410 The marginal distribution of each of the beta hats 1049 01:07:08,410 --> 01:07:13,490 is given by beta hat-- by a univariate normal distribution 1050 01:07:13,490 --> 01:07:18,490 with mean beta_j and variance equal to the diagonal. 1051 01:07:18,490 --> 01:07:25,190 Now at this point, saying that is like an assertion. 1052 01:07:25,190 --> 01:07:28,750 But one can actually prove that very easily, 1053 01:07:28,750 --> 01:07:33,290 given this sequence of argument. 1054 01:07:33,290 --> 01:07:36,622 And can anyone tell me why this is true? 1055 01:07:43,080 --> 01:07:44,390 Let me tell you. 1056 01:07:44,390 --> 01:07:47,850 If you consider plugging in the moment generating function, 1057 01:07:47,850 --> 01:07:53,990 the value tau, where only the j-th entry is non-zero, 1058 01:07:53,990 --> 01:07:56,120 then you have the moment generating function 1059 01:07:56,120 --> 01:07:59,310 of the j-th component of beta hat. 1060 01:07:59,310 --> 01:08:03,880 And that's a Gaussian moment generating function. 1061 01:08:03,880 --> 01:08:08,190 So the marginal distribution of the j-th component is normal. 1062 01:08:08,190 --> 01:08:10,550 So you get that almost for free from 1063 01:08:10,550 --> 01:08:13,680 this multivariate analysis. 1064 01:08:13,680 --> 01:08:18,014 And so there's no hand waving going on in having that result. 1065 01:08:18,014 --> 01:08:20,229 This actually follows directly from the moment 1066 01:08:20,229 --> 01:08:22,760 generating functions. 1067 01:08:22,760 --> 01:08:26,870 OK, let's now turn to another topic. 1068 01:08:26,870 --> 01:08:32,200 Related, but it's the QR decomposition of X. 1069 01:08:32,200 --> 01:08:36,870 So we have-- with our independent variables 1070 01:08:36,870 --> 01:08:41,630 X, we want to express this as a product 1071 01:08:41,630 --> 01:08:46,609 of an orthonormal matrix Q which is n by p, 1072 01:08:46,609 --> 01:08:51,684 and an upper triangular matrix R. 1073 01:08:51,684 --> 01:08:59,750 So it turns out that any matrix, n by p matrix, 1074 01:08:59,750 --> 01:09:02,130 can be expressed in this form. 1075 01:09:02,130 --> 01:09:07,260 And we'll quickly show you how that can be accomplished. 1076 01:09:07,260 --> 01:09:10,180 We can accomplish that by conducting 1077 01:09:10,180 --> 01:09:13,479 a Gram-Schmidt orthonormalization 1078 01:09:13,479 --> 01:09:17,580 of the independent variables matrix X. 1079 01:09:17,580 --> 01:09:21,540 And let's see. 1080 01:09:21,540 --> 01:09:26,689 So if we define R, the upper triangular matrix in the QR 1081 01:09:26,689 --> 01:09:31,424 decomposition, to have 0's off the diagonal below 1082 01:09:31,424 --> 01:09:35,710 and then possibly nonzero value along the diagonal 1083 01:09:35,710 --> 01:09:41,500 into the right, we're just going to solve for Q and R 1084 01:09:41,500 --> 01:09:45,260 through this Gram-Schmidt process. 1085 01:09:45,260 --> 01:09:50,439 So the first column of X is equal to the first column 1086 01:09:50,439 --> 01:09:57,430 of Q times the first element, the top left corner 1087 01:09:57,430 --> 01:10:00,240 of the matrix R. 1088 01:10:00,240 --> 01:10:08,130 And if we take the cross product of that vector with itself, 1089 01:10:08,130 --> 01:10:14,600 then we get this expression for r_(1,1) squared-- 1090 01:10:14,600 --> 01:10:18,270 we can basically solve for r_(1,1) as the square root 1091 01:10:18,270 --> 01:10:19,590 of this dot product. 1092 01:10:19,590 --> 01:10:22,920 And Q_Q_[1] is simply the first column of X divided by that 1093 01:10:22,920 --> 01:10:23,500 square root. 1094 01:10:23,500 --> 01:10:27,260 So this first element of the Q matrix 1095 01:10:27,260 --> 01:10:32,210 and the first element r, this can be solved for right away. 1096 01:10:32,210 --> 01:10:38,100 Then let's solve for the second column of Q 1097 01:10:38,100 --> 01:10:43,310 and the second column of the R matrix. 1098 01:10:43,310 --> 01:10:46,860 Well, X_X_[2], the second column of the X matrix, 1099 01:10:46,860 --> 01:10:56,090 is the first column of Q times r_(1,2), 1100 01:10:56,090 --> 01:10:58,614 plus the second column of Q times r_(2,2). 1101 01:11:02,250 --> 01:11:09,300 And if we multiply this expression by Q_Q_[1] transpose, 1102 01:11:09,300 --> 01:11:15,720 then we basically get this expression for r_(1,2). 1103 01:11:19,780 --> 01:11:24,240 So we actually have just solved for r_(1,2). 1104 01:11:24,240 --> 01:11:35,910 And Q_Q_[2] is solved for by the arguments given here. 1105 01:11:35,910 --> 01:11:41,070 So basically, we successively are orthogonalizing 1106 01:11:41,070 --> 01:11:43,480 columns of X to the previous columns of X 1107 01:11:43,480 --> 01:11:45,850 through this Gram-Schmidt process. 1108 01:11:45,850 --> 01:11:48,460 And it basically can be repeated through all the columns. 1109 01:11:51,030 --> 01:11:55,410 Now with this QR decomposition, what we get 1110 01:11:55,410 --> 01:12:01,470 is a really nice form for the least squares estimate. 1111 01:12:01,470 --> 01:12:07,080 Basically, it simplifies to the inverse of R times Q transpose 1112 01:12:07,080 --> 01:12:08,220 y. 1113 01:12:08,220 --> 01:12:15,490 And this basically means that you 1114 01:12:15,490 --> 01:12:19,720 can solve for least squares estimates by calculating the QR 1115 01:12:19,720 --> 01:12:22,150 decomposition, which is a very simple linear algebra 1116 01:12:22,150 --> 01:12:25,890 operation, and then just do a couple of matrix products 1117 01:12:25,890 --> 01:12:31,310 to get the-- well, you do have to do a matrix inverse with R 1118 01:12:31,310 --> 01:12:33,730 to get that out. 1119 01:12:33,730 --> 01:12:37,810 And the covariance matrix of beta hat 1120 01:12:37,810 --> 01:12:44,480 is equal to sigma squared X transpose X inverse. 1121 01:12:44,480 --> 01:12:53,330 And in terms of the covariance matrix, what is implicit here 1122 01:12:53,330 --> 01:12:55,870 but you should make explicit in your study, 1123 01:12:55,870 --> 01:13:06,190 is if you consider taking a matrix, R inverse Q transpose 1124 01:13:06,190 --> 01:13:11,160 times y, the only thing that's random there is that y vector, 1125 01:13:11,160 --> 01:13:11,660 OK? 1126 01:13:11,660 --> 01:13:16,720 The covariance of a matrix times a random vector 1127 01:13:16,720 --> 01:13:19,770 is that matrix times the covariance 1128 01:13:19,770 --> 01:13:22,960 of the vector times the transpose of the matrix. 1129 01:13:22,960 --> 01:13:26,770 So if you take a matrix transformation 1130 01:13:26,770 --> 01:13:30,510 of a random vector, then the covariance 1131 01:13:30,510 --> 01:13:33,660 of that transformation has that form. 1132 01:13:33,660 --> 01:13:41,110 So that's where this covariance matrix is coming into play. 1133 01:13:41,110 --> 01:13:44,410 And from the MGF, the moment generating function, 1134 01:13:44,410 --> 01:13:48,660 for the least squares estimate, this basically 1135 01:13:48,660 --> 01:13:50,930 comes out of the moment generating function definition 1136 01:13:50,930 --> 01:13:51,943 as well. 1137 01:13:51,943 --> 01:13:56,480 And if we take X transpose X, plug 1138 01:13:56,480 --> 01:14:01,864 in the QR decomposition, only the R's play out, 1139 01:14:01,864 --> 01:14:03,600 giving you that. 1140 01:14:03,600 --> 01:14:05,870 Now, this also gives us a very nice form 1141 01:14:05,870 --> 01:14:10,450 for the hat matrix, which turns out 1142 01:14:10,450 --> 01:14:13,490 to just be Q times Q transpose. 1143 01:14:13,490 --> 01:14:20,370 So that's a very simple form. 1144 01:14:25,290 --> 01:14:28,160 So now with the distribution theory, 1145 01:14:28,160 --> 01:14:35,190 this next section is going to actually prove 1146 01:14:35,190 --> 01:14:37,970 what's really a fundamental result 1147 01:14:37,970 --> 01:14:40,380 about normal linear regression models. 1148 01:14:40,380 --> 01:14:45,210 And I'm going to go through this somewhat quickly just 1149 01:14:45,210 --> 01:14:49,290 so that we cover what the main ideas are of the theorem. 1150 01:14:49,290 --> 01:14:53,930 But the details, I think, are very straightforward. 1151 01:14:53,930 --> 01:14:57,250 And these course notes that will be posted online 1152 01:14:57,250 --> 01:14:59,370 will go through the various steps of the analysis. 1153 01:15:03,390 --> 01:15:07,160 OK, so there's an important theorem here 1154 01:15:07,160 --> 01:15:11,660 which is for any matrix A, m by n, 1155 01:15:11,660 --> 01:15:15,860 you consider transforming the random vector y 1156 01:15:15,860 --> 01:15:23,600 by this matrix A. It is also a random normal vector. 1157 01:15:23,600 --> 01:15:26,980 And its distribution is going to have 1158 01:15:26,980 --> 01:15:30,090 a mean and covariance matrix given 1159 01:15:30,090 --> 01:15:35,390 by mu_z and sigma_z, which have this simple expression in terms 1160 01:15:35,390 --> 01:15:38,710 of the matrix A and the underlying means 1161 01:15:38,710 --> 01:15:40,995 and covariances of y. 1162 01:15:44,880 --> 01:15:48,620 OK, earlier we actually applied this theorem 1163 01:15:48,620 --> 01:15:53,260 with A corresponding to the matrix that generates the least 1164 01:15:53,260 --> 01:15:54,920 squares estimates. 1165 01:15:54,920 --> 01:15:57,900 So with A equal to X transpose X inverse, 1166 01:15:57,900 --> 01:16:01,060 we actually previously went through the solution for what's 1167 01:16:01,060 --> 01:16:04,040 the distribution of beta hat. 1168 01:16:04,040 --> 01:16:06,960 And with any other matrix A, we can 1169 01:16:06,960 --> 01:16:09,815 go through the same analysis and get the distribution. 1170 01:16:13,820 --> 01:16:20,690 So if we do that here, well, we can actually 1171 01:16:20,690 --> 01:16:22,890 prove this important theorem, which 1172 01:16:22,890 --> 01:16:27,335 says that with least squares estimates 1173 01:16:27,335 --> 01:16:33,670 of normal linear regression models, our least 1174 01:16:33,670 --> 01:16:40,050 squares estimate beta hat and our residual vector epsilon hat 1175 01:16:40,050 --> 01:16:43,040 are independent random variables. 1176 01:16:43,040 --> 01:16:48,520 So when we construct these statistics, 1177 01:16:48,520 --> 01:16:51,860 they are statistically independent of each other. 1178 01:16:51,860 --> 01:16:57,580 And the distribution of beta hat is multivariate normal. 1179 01:16:57,580 --> 01:17:04,730 The sum of the squared residuals is, in fact, 1180 01:17:04,730 --> 01:17:09,660 a multiple of a chi-squared random variable. 1181 01:17:09,660 --> 01:17:16,010 Now who in here can tell me what a chi-squared random variable 1182 01:17:16,010 --> 01:17:17,405 is? 1183 01:17:17,405 --> 01:17:18,335 Anyone? 1184 01:17:18,335 --> 01:17:19,251 AUDIENCE: [INAUDIBLE]? 1185 01:17:21,590 --> 01:17:23,010 PROFESSOR: Yes, that's right. 1186 01:17:23,010 --> 01:17:25,610 So a chi-squared random variable with one degree of freedom 1187 01:17:25,610 --> 01:17:29,652 is a squared normal zero one random variable. 1188 01:17:29,652 --> 01:17:31,360 A chi-squared with two degrees of freedom 1189 01:17:31,360 --> 01:17:36,420 is the sum of two independent normals, zero one, squared. 1190 01:17:36,420 --> 01:17:43,000 And so the sum of n squared residuals is, in fact, 1191 01:17:43,000 --> 01:17:48,182 an n minus p chi-squared random variable scale it by sigma 1192 01:17:48,182 --> 01:17:49,420 squared. 1193 01:17:49,420 --> 01:17:55,870 And for each component j, if we take 1194 01:17:55,870 --> 01:18:01,080 the difference between the least squares estimate beta hat j 1195 01:18:01,080 --> 01:18:04,810 and beta_j and divide through by this estimate 1196 01:18:04,810 --> 01:18:12,320 of the standard deviation of that, then 1197 01:18:12,320 --> 01:18:15,740 that will, in fact, have a t distribution on n minus p 1198 01:18:15,740 --> 01:18:17,280 degrees of freedom. 1199 01:18:17,280 --> 01:18:25,040 And let's see, a t distribution in probability theory 1200 01:18:25,040 --> 01:18:35,684 is the ratio of a normal random variable to an independent chi 1201 01:18:35,684 --> 01:18:38,100 squared random variable, or the root of an independent chi 1202 01:18:38,100 --> 01:18:39,210 squared random variable. 1203 01:18:39,210 --> 01:18:45,780 So basically these properties characterize 1204 01:18:45,780 --> 01:18:50,970 our regression parameter estimates and t statistics 1205 01:18:50,970 --> 01:18:54,780 for those estimates. 1206 01:18:54,780 --> 01:18:59,490 Now, OK, in the course notes, there's 1207 01:18:59,490 --> 01:19:01,670 a moderately long proof. 1208 01:19:01,670 --> 01:19:05,320 But all the details are given, and I'll 1209 01:19:05,320 --> 01:19:08,110 be happy to go through any of those details with people 1210 01:19:08,110 --> 01:19:11,050 during office hours. 1211 01:19:11,050 --> 01:19:19,500 Let me just push on to-- let's see. 1212 01:19:19,500 --> 01:19:23,091 We have maybe two minutes left in the class. 1213 01:19:25,860 --> 01:19:32,440 Let me just talk about maximum likelihood estimation. 1214 01:19:32,440 --> 01:19:37,850 And in fitting models and statistics, 1215 01:19:37,850 --> 01:19:41,030 maximum likelihood estimation comes up again and again. 1216 01:19:41,030 --> 01:19:45,295 And with normal linear regression models, 1217 01:19:45,295 --> 01:19:47,880 it turns out that ordinary least squares estimate 1218 01:19:47,880 --> 01:19:51,690 are, in fact, our maximum likelihood estimates. 1219 01:19:51,690 --> 01:19:57,570 And what we want to do with a maximum likelihood 1220 01:19:57,570 --> 01:20:02,000 is to maximize. 1221 01:20:02,000 --> 01:20:05,640 We want to define the likelihood function, which 1222 01:20:05,640 --> 01:20:09,290 is the density function for the data given 1223 01:20:09,290 --> 01:20:12,110 the unknown parameters. 1224 01:20:12,110 --> 01:20:16,390 And this density function is simply 1225 01:20:16,390 --> 01:20:18,720 the density function for a multivariate normal random 1226 01:20:18,720 --> 01:20:20,310 variable. 1227 01:20:20,310 --> 01:20:24,630 And the maximum likelihood estimates 1228 01:20:24,630 --> 01:20:28,120 are the estimates of the underlying parameters 1229 01:20:28,120 --> 01:20:31,650 that basically maximize the density function. 1230 01:20:31,650 --> 01:20:34,200 So it's the values of the underlying parameters 1231 01:20:34,200 --> 01:20:37,130 that make the data that was observed the most likely. 1232 01:20:40,520 --> 01:20:49,110 And if you plug in the values of the density function, 1233 01:20:49,110 --> 01:20:53,680 basically we have these independent random variables, 1234 01:20:53,680 --> 01:21:00,540 Y_i, whose product is the joint density. 1235 01:21:00,540 --> 01:21:05,466 The likelihood function turns out 1236 01:21:05,466 --> 01:21:11,260 to be basically a function of the least squares criterion. 1237 01:21:11,260 --> 01:21:14,980 So if you fit models by least squares, 1238 01:21:14,980 --> 01:21:18,472 you're consistent with doing something decent in at least 1239 01:21:18,472 --> 01:21:20,180 applying the maximum likelihood principle 1240 01:21:20,180 --> 01:21:23,720 if you had a normal linear regression model. 1241 01:21:23,720 --> 01:21:31,730 And it's useful to know when your statistical estimation 1242 01:21:31,730 --> 01:21:37,230 algorithms are consistent with certain principles 1243 01:21:37,230 --> 01:21:41,620 like maximum likelihood estimation or others. 1244 01:21:41,620 --> 01:21:43,930 So let me, I guess, finish there. 1245 01:21:43,930 --> 01:21:49,100 And next time, I will just talk a little bit 1246 01:21:49,100 --> 01:21:52,780 about generalized M estimators. 1247 01:21:52,780 --> 01:21:55,160 Those provide a class of estimators 1248 01:21:55,160 --> 01:22:04,420 that can be used for finding robust estimates 1249 01:22:04,420 --> 01:22:08,290 and also quantile estimates of regression parameters 1250 01:22:08,290 --> 01:22:12,320 which are very interesting.