1
00:00:00,060 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:04,019
Commons license.

3
00:00:04,019 --> 00:00:06,360
Your support will help
MIT OpenCourseWare

4
00:00:06,360 --> 00:00:10,730
continue to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,340
To make a donation or
view additional materials

6
00:00:13,340 --> 00:00:17,217
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,217 --> 00:00:17,842
at ocw.mit.edu.

8
00:00:21,490 --> 00:00:24,620
PROFESSOR: Today's topic
is regression analysis.

9
00:00:24,620 --> 00:00:28,950
And this subject
is one that we're

10
00:00:28,950 --> 00:00:32,610
going to cover it today
covering the mathematical and

11
00:00:32,610 --> 00:00:35,620
statistical foundations
of regression

12
00:00:35,620 --> 00:00:38,720
and focus particularly
on linear regression.

13
00:00:38,720 --> 00:00:44,840
This methodology is perhaps
the most powerful method

14
00:00:44,840 --> 00:00:47,630
in statistical modeling.

15
00:00:47,630 --> 00:00:49,640
And the foundations
of it, I think,

16
00:00:49,640 --> 00:00:52,740
are very, very important
to understand and master,

17
00:00:52,740 --> 00:00:56,320
and they'll help you in any
kind of statistical modeling

18
00:00:56,320 --> 00:01:02,870
exercise you might entertain
during or after this course.

19
00:01:02,870 --> 00:01:07,320
And its popularity in
finance is very, very high,

20
00:01:07,320 --> 00:01:10,540
but it's also a very
popular methodology

21
00:01:10,540 --> 00:01:15,630
in all other disciplines
that do applied statistics.

22
00:01:15,630 --> 00:01:23,410
So let's begin with setting up
the multiple linear regression

23
00:01:23,410 --> 00:01:24,490
problem.

24
00:01:24,490 --> 00:01:31,760
So we begin with a data set that
consists of data observations

25
00:01:31,760 --> 00:01:34,640
on different cases,
a number of cases.

26
00:01:34,640 --> 00:01:40,110
So we have n cases indexed by i.

27
00:01:40,110 --> 00:01:44,410
And there's a single variable,
a dependent variable or response

28
00:01:44,410 --> 00:01:47,970
variable, which is
the variable of focus.

29
00:01:47,970 --> 00:01:52,040
And we'll denote that y sub i.

30
00:01:52,040 --> 00:01:55,550
And together with that,
for each of the cases,

31
00:01:55,550 --> 00:01:59,640
there are explanatory variables
that we might observe.

32
00:01:59,640 --> 00:02:04,330
So the y_i's, the
dependent variables,

33
00:02:04,330 --> 00:02:07,670
could be returns on stocks.

34
00:02:07,670 --> 00:02:12,160
The explanatory variables could
be underlying characteristics

35
00:02:12,160 --> 00:02:16,410
of those stocks
over a given period.

36
00:02:16,410 --> 00:02:21,090
The dependent variable
could be the change

37
00:02:21,090 --> 00:02:28,360
in value of an index, the S&P
500 index or the yield rate,

38
00:02:28,360 --> 00:02:30,220
and the explanatory
variables can

39
00:02:30,220 --> 00:02:33,390
be various macroeconomic
factors or other factors that

40
00:02:33,390 --> 00:02:39,430
might be used to explain how
the response variable changes

41
00:02:39,430 --> 00:02:41,510
and takes on its value.

42
00:02:41,510 --> 00:02:44,880
Let's go through various
goals of regression analysis.

43
00:02:44,880 --> 00:02:47,780
OK, first it can be
to extract or exploit

44
00:02:47,780 --> 00:02:49,790
the relationship between
the dependent variable

45
00:02:49,790 --> 00:02:51,720
and the independent variable.

46
00:02:51,720 --> 00:02:55,010
And examples of
this are prediction.

47
00:02:55,010 --> 00:02:57,830
Indeed, in finance
that's where I've

48
00:02:57,830 --> 00:03:00,075
used regression analysis most.

49
00:03:00,075 --> 00:03:02,965
We want to predict what's going
to happen and take actions

50
00:03:02,965 --> 00:03:05,230
to take advantage of that.

51
00:03:05,230 --> 00:03:07,570
One can also use
regression analysis

52
00:03:07,570 --> 00:03:10,940
to talk about causal inference.

53
00:03:10,940 --> 00:03:16,240
What factors are really
driving a dependent variable?

54
00:03:16,240 --> 00:03:19,570
And so one can actually
test hypotheses

55
00:03:19,570 --> 00:03:23,930
about what are true
causal factors underlying

56
00:03:23,930 --> 00:03:27,250
the relationships
between the variables.

57
00:03:27,250 --> 00:03:32,430
Another application is for
just simple approximation.

58
00:03:32,430 --> 00:03:34,870
As mathematicians,
you're all very

59
00:03:34,870 --> 00:03:38,970
familiar with how
smooth functions can

60
00:03:38,970 --> 00:03:41,460
be-- that are smooth
in the sense of being

61
00:03:41,460 --> 00:03:43,250
differentiable and bounded.

62
00:03:43,250 --> 00:03:46,565
Those can be approximated
well by a Taylor series

63
00:03:46,565 --> 00:03:48,690
if you have a function of
a single variable or even

64
00:03:48,690 --> 00:03:53,240
a multivariable function.

65
00:03:53,240 --> 00:03:55,840
So one can use
regression analysis

66
00:03:55,840 --> 00:04:00,460
to actually approximate
functions nicely.

67
00:04:00,460 --> 00:04:04,370
And one can also use
regression analysis

68
00:04:04,370 --> 00:04:08,600
to uncover functional
relationships

69
00:04:08,600 --> 00:04:10,724
and validate functional
relationships

70
00:04:10,724 --> 00:04:11,640
amongst the variables.

71
00:04:14,270 --> 00:04:17,130
So let's set up the
general linear model

72
00:04:17,130 --> 00:04:19,570
from a mathematical
standpoint to begin with.

73
00:04:19,570 --> 00:04:23,160
In this lecture, OK,
we're going to start off

74
00:04:23,160 --> 00:04:28,050
with discussing ordinary
least squares, which

75
00:04:28,050 --> 00:04:31,070
is a purely mathematical
criterion for how

76
00:04:31,070 --> 00:04:33,580
you specify regression models.

77
00:04:33,580 --> 00:04:38,350
And then we're going to turn to
the Gauss-Markov theorem which

78
00:04:38,350 --> 00:04:42,265
incorporates some statistical
modeling principles there.

79
00:04:42,265 --> 00:04:45,170
They're essentially
weak principles.

80
00:04:45,170 --> 00:04:49,570
And then we will
turn to formal models

81
00:04:49,570 --> 00:04:52,160
with normal linear
regression models,

82
00:04:52,160 --> 00:04:55,700
and then consider extensions
of those to broader classes.

83
00:04:58,250 --> 00:05:00,600
Now we're in the
mathematical context.

84
00:05:00,600 --> 00:05:05,180
And a linear model is
basically attempting

85
00:05:05,180 --> 00:05:08,780
to model the conditional
distribution of the response

86
00:05:08,780 --> 00:05:14,580
variable y_i given the
independent variables x_i.

87
00:05:14,580 --> 00:05:21,050
And the conditional distribution
of the response variable

88
00:05:21,050 --> 00:05:25,050
is modeled simply
as a linear function

89
00:05:25,050 --> 00:05:27,230
of the independent variables.

90
00:05:27,230 --> 00:05:32,880
So the x_i's, x_(i,1)
through x_(i,p),

91
00:05:32,880 --> 00:05:36,090
are the key explanatory
variables that relate

92
00:05:36,090 --> 00:05:38,340
to the response
variables, possibly.

93
00:05:38,340 --> 00:05:45,850
And the beta_1, beta_2,
beta_i, or beta_p,

94
00:05:45,850 --> 00:05:49,480
are the regression
parameters which

95
00:05:49,480 --> 00:05:53,190
would be used in defining
that linear relationship.

96
00:05:53,190 --> 00:06:03,680
So this relationship has
residuals, epsilon_i,

97
00:06:03,680 --> 00:06:07,320
basically where there's
uncertainty in the data--

98
00:06:07,320 --> 00:06:11,100
whether it's either due to a
measurement error or modeling

99
00:06:11,100 --> 00:06:14,535
error or underlying
stochastic processes that

100
00:06:14,535 --> 00:06:15,980
are driving the error.

101
00:06:15,980 --> 00:06:19,340
This epsilon_i is a
residual error variable

102
00:06:19,340 --> 00:06:24,440
that will indicate how this
linear relationship varies

103
00:06:24,440 --> 00:06:25,915
across the different n cases.

104
00:06:30,590 --> 00:06:33,520
So OK, how broad are the models?

105
00:06:33,520 --> 00:06:38,080
Well, the models
really are very broad.

106
00:06:38,080 --> 00:06:40,020
First of all,
polynomial approximation

107
00:06:40,020 --> 00:06:41,505
is indicated here.

108
00:06:41,505 --> 00:06:45,630
It corresponds, essentially,
to a truncated Taylor

109
00:06:45,630 --> 00:06:49,170
series approximation
to a functional form.

110
00:06:51,730 --> 00:06:56,240
With variables that
exhibit cyclical behavior,

111
00:06:56,240 --> 00:07:03,200
Fourier series can be applied
in a linear regression context.

112
00:07:03,200 --> 00:07:06,020
How many people in here are
familiar with Fourier series?

113
00:07:08,580 --> 00:07:09,870
Almost everybody.

114
00:07:09,870 --> 00:07:13,530
So Fourier series
basically provide

115
00:07:13,530 --> 00:07:17,650
a set of basis functions
that allow you to closely

116
00:07:17,650 --> 00:07:19,250
approximate most functions.

117
00:07:19,250 --> 00:07:21,540
And certainly with
bounded functions

118
00:07:21,540 --> 00:07:24,510
that possibly have a
cyclical structure to them,

119
00:07:24,510 --> 00:07:27,120
it provides a
complete description.

120
00:07:27,120 --> 00:07:30,940
So we could apply
Fourier series here.

121
00:07:30,940 --> 00:07:38,640
Finally, time series regressions
where the cases i one through n

122
00:07:38,640 --> 00:07:42,360
are really indexes of different
time points can be applied.

123
00:07:42,360 --> 00:07:46,650
And so the independent
variables can

124
00:07:46,650 --> 00:07:50,660
be variables that are
observable at a given time point

125
00:07:50,660 --> 00:07:51,910
or known at a given time.

126
00:07:51,910 --> 00:07:55,810
So those can include lags
of the response variables.

127
00:07:55,810 --> 00:08:00,530
So we'll see actually when
we talk about time series

128
00:08:00,530 --> 00:08:02,880
that there's
autoregressive time series

129
00:08:02,880 --> 00:08:05,870
models that can be specified.

130
00:08:05,870 --> 00:08:11,410
And those are very broadly
applied in finance.

131
00:08:18,900 --> 00:08:23,510
All right, so let's go through
what the steps are for fitting

132
00:08:23,510 --> 00:08:24,840
a regression model.

133
00:08:27,680 --> 00:08:31,760
First, one wants
to propose a model

134
00:08:31,760 --> 00:08:33,600
in terms of what
is it that we have

135
00:08:33,600 --> 00:08:38,727
to identify or be interested in
a particular response variable.

136
00:08:38,727 --> 00:08:42,934
And critical here is
specifying the scale

137
00:08:42,934 --> 00:08:47,010
of that response variable.

138
00:08:47,010 --> 00:08:51,280
Choongbum was discussing
problems of modeling stock

139
00:08:51,280 --> 00:08:52,450
prices.

140
00:08:52,450 --> 00:08:55,030
If, say, y is the stock price?

141
00:08:55,030 --> 00:09:00,090
Well, it may be that it's
more appropriate to consider

142
00:09:00,090 --> 00:09:06,320
modeling it on a logarithmic
scale than on a linear scale.

143
00:09:06,320 --> 00:09:09,780
Who can tell me why that
would be a good idea?

144
00:09:09,780 --> 00:09:11,740
AUDIENCE: Because
the changes might

145
00:09:11,740 --> 00:09:13,700
become more percent
changes in price

146
00:09:13,700 --> 00:09:16,365
rather than absolute
changes in price.

147
00:09:16,365 --> 00:09:17,490
PROFESSOR: Very good, yeah.

148
00:09:17,490 --> 00:09:21,540
So price changes basically
on the percentage scale,

149
00:09:21,540 --> 00:09:25,680
which log changes would be,
may be much better predicted

150
00:09:25,680 --> 00:09:30,410
by knowing factors than
the absolute price level.

151
00:09:30,410 --> 00:09:35,940
OK, and so we have
to have a collection

152
00:09:35,940 --> 00:09:40,070
of independent variables,
which to include in the model.

153
00:09:40,070 --> 00:09:42,970
And it's important
to think about how

154
00:09:42,970 --> 00:09:44,590
general this set up is.

155
00:09:44,590 --> 00:09:46,450
I mean, the
independent variables

156
00:09:46,450 --> 00:09:50,140
can be functions, lag values
of the response variable.

157
00:09:50,140 --> 00:09:52,270
They can be different
functional forms

158
00:09:52,270 --> 00:09:53,670
of other independent variables.

159
00:09:53,670 --> 00:09:58,720
So the fact that we're talking
about a linear regression model

160
00:09:58,720 --> 00:10:03,590
here is it's not so limiting
in terms of the linearity.

161
00:10:03,590 --> 00:10:06,050
We can really capture
lot of nonlinear behavior

162
00:10:06,050 --> 00:10:07,730
in this framework.

163
00:10:07,730 --> 00:10:11,290
So then third, we need to
address the assumptions

164
00:10:11,290 --> 00:10:15,020
about the distribution of
the residuals, epsilon,

165
00:10:15,020 --> 00:10:16,820
over the cases.

166
00:10:16,820 --> 00:10:21,150
So that has to be specified.

167
00:10:21,150 --> 00:10:23,610
Once we've set up
the model in terms

168
00:10:23,610 --> 00:10:27,160
of identifying the response
of the explanatory variables

169
00:10:27,160 --> 00:10:29,700
and the assumptions
underlying the distribution

170
00:10:29,700 --> 00:10:35,080
of the residuals, we need to
specify a criterion for judging

171
00:10:35,080 --> 00:10:36,550
different estimators.

172
00:10:36,550 --> 00:10:40,840
So given a particular
setup, what we want to do

173
00:10:40,840 --> 00:10:48,000
is be able to define a
methodology for specifying

174
00:10:48,000 --> 00:10:50,580
the regression parameters
so that we can then

175
00:10:50,580 --> 00:10:54,010
use this regression
model for prediction

176
00:10:54,010 --> 00:10:56,670
or whatever our purpose is.

177
00:10:56,670 --> 00:11:00,700
So the second
thing we want to do

178
00:11:00,700 --> 00:11:03,460
is define a criterion
for how we might

179
00:11:03,460 --> 00:11:09,090
judge different estimators of
the progression parameters.

180
00:11:09,090 --> 00:11:11,210
We're going to go
through several of those.

181
00:11:11,210 --> 00:11:15,200
And you'll see those-- least
squares is the first one,

182
00:11:15,200 --> 00:11:17,210
but there are actually
more general ones.

183
00:11:17,210 --> 00:11:19,290
In fact, the last
section of this lecture

184
00:11:19,290 --> 00:11:22,930
on generalized estimators
will cover those as well.

185
00:11:22,930 --> 00:11:25,740
Third, we need to characterize
the best estimator

186
00:11:25,740 --> 00:11:27,600
and apply it to the given data.

187
00:11:27,600 --> 00:11:31,401
So once we choose a
criterion for how good

188
00:11:31,401 --> 00:11:32,900
an estimate of
regression parameters

189
00:11:32,900 --> 00:11:37,080
is, then we have to have
a technology for solving

190
00:11:37,080 --> 00:11:38,060
for that.

191
00:11:38,060 --> 00:11:43,030
And then fourth, we need
to check our assumptions.

192
00:11:43,030 --> 00:11:48,770
Now, it's very often the case
that at this fourth step, where

193
00:11:48,770 --> 00:11:51,160
you're checking the
assumptions that you've made,

194
00:11:51,160 --> 00:11:55,370
you'll discover features
of your data or the process

195
00:11:55,370 --> 00:11:58,550
that it's modeling
that make you want

196
00:11:58,550 --> 00:12:01,520
to expand upon your assumptions
or change your assumptions.

197
00:12:01,520 --> 00:12:06,410
And so checking the
assumptions is a critical part

198
00:12:06,410 --> 00:12:08,170
of any modeling process.

199
00:12:08,170 --> 00:12:11,830
And then if necessary, modify
the model and assumptions

200
00:12:11,830 --> 00:12:15,260
and repeat this process.

201
00:12:15,260 --> 00:12:18,660
What I can tell you
is that this sort

202
00:12:18,660 --> 00:12:21,280
of protocol for
how you fit models

203
00:12:21,280 --> 00:12:26,680
is what I've applied
many, many times.

204
00:12:26,680 --> 00:12:31,870
And if you are lucky in a
particular problem area,

205
00:12:31,870 --> 00:12:34,430
the very simple
models will work well

206
00:12:34,430 --> 00:12:37,340
with small changes
in assumptions.

207
00:12:37,340 --> 00:12:39,450
But when you get
challenging problems,

208
00:12:39,450 --> 00:12:43,900
then this item five
of modify the model

209
00:12:43,900 --> 00:12:46,300
and/or assumptions is critical.

210
00:12:46,300 --> 00:12:50,790
And in statistical
modeling, my philosophy

211
00:12:50,790 --> 00:12:53,340
is you really want to,
as much as possible,

212
00:12:53,340 --> 00:12:55,390
tailor the model to the
process you're modeling.

213
00:12:55,390 --> 00:12:59,070
You don't want to fit a
square peg in a round hole

214
00:12:59,070 --> 00:13:01,200
and just apply, say,
simple linear regression

215
00:13:01,200 --> 00:13:02,510
to everything.

216
00:13:02,510 --> 00:13:06,191
You want to apply it when
the assumptions are valid.

217
00:13:06,191 --> 00:13:07,940
If the assumptions
aren't valid, maybe you

218
00:13:07,940 --> 00:13:10,680
can change the
specification of the problem

219
00:13:10,680 --> 00:13:14,890
so a linear model is still
applicable in a changed

220
00:13:14,890 --> 00:13:16,060
framework.

221
00:13:16,060 --> 00:13:18,530
But if not, then
you'll want to extend

222
00:13:18,530 --> 00:13:20,250
to other kinds of models.

223
00:13:20,250 --> 00:13:22,120
But what we'll be
doing-- or what

224
00:13:22,120 --> 00:13:24,680
you will be doing if you do
that-- is basically applying

225
00:13:24,680 --> 00:13:28,120
all the same principles
that are developed

226
00:13:28,120 --> 00:13:30,140
in the linear
modeling framework.

227
00:13:35,750 --> 00:13:38,320
OK, now let's see.

228
00:13:38,320 --> 00:13:39,860
I wanted to make
some comments here

229
00:13:39,860 --> 00:13:43,975
about specifying assumptions
for the residual distribution.

230
00:13:47,600 --> 00:13:49,620
What kind of assumptions
might we make?

231
00:13:49,620 --> 00:13:52,840
OK, would anyone like to
suggest some assumptions

232
00:13:52,840 --> 00:13:55,640
you might make in
a linear regression

233
00:13:55,640 --> 00:13:58,310
model for the residuals?

234
00:13:58,310 --> 00:13:58,980
Yes?

235
00:13:58,980 --> 00:14:00,006
What's your name, by the way?

236
00:14:00,006 --> 00:14:00,700
AUDIENCE: My name is Will.

237
00:14:00,700 --> 00:14:01,533
PROFESSOR: Will, OK.

238
00:14:01,533 --> 00:14:02,455
Will what?

239
00:14:02,455 --> 00:14:02,900
[? AUDIENCE: Ossler. ?]

240
00:14:02,900 --> 00:14:03,460
PROFESSOR: [? Ossler, ?] great.

241
00:14:03,460 --> 00:14:04,910
OK, thank you, Will.

242
00:14:04,910 --> 00:14:06,480
AUDIENCE: It might
be-- or we might

243
00:14:06,480 --> 00:14:10,000
want to say that the residual
might be normally distributed

244
00:14:10,000 --> 00:14:17,598
and it might not depend too
much on what value of the input

245
00:14:17,598 --> 00:14:19,000
variable we'd use.

246
00:14:19,000 --> 00:14:20,540
PROFESSOR: OK.

247
00:14:20,540 --> 00:14:21,090
Anyone else?

248
00:14:23,810 --> 00:14:24,310
OK.

249
00:14:24,310 --> 00:14:27,930
Well, that certainly
is an excellent place

250
00:14:27,930 --> 00:14:33,510
to start in terms of starting
with a distribution that's

251
00:14:33,510 --> 00:14:35,391
familiar.

252
00:14:35,391 --> 00:14:36,390
Familiar is always good.

253
00:14:36,390 --> 00:14:38,598
Although it's not something
that should be necessary,

254
00:14:38,598 --> 00:14:44,210
but we know from some of
Choongbum's lecture areas

255
00:14:44,210 --> 00:14:46,470
that Gaussian and
normal distributions

256
00:14:46,470 --> 00:14:49,670
arise in many
settings where we're

257
00:14:49,670 --> 00:14:53,900
taking basically sums of
independent, random variables.

258
00:14:53,900 --> 00:14:58,180
And so it may be that these
residuals are like that.

259
00:14:58,180 --> 00:15:04,910
Anyway, a slightly simpler
or weaker condition

260
00:15:04,910 --> 00:15:10,110
is to use the Gauss-- what
are called in statistics

261
00:15:10,110 --> 00:15:12,400
the Gauss-Markov assumptions.

262
00:15:12,400 --> 00:15:15,340
And these are assumptions
where we're only

263
00:15:15,340 --> 00:15:20,390
concerned with the means
or averages, statistically,

264
00:15:20,390 --> 00:15:23,210
and the variances
of the residuals.

265
00:15:23,210 --> 00:15:26,160
And so we assume that
there's zero mean.

266
00:15:26,160 --> 00:15:30,230
So on average, they're not
adding a bias up or down

267
00:15:30,230 --> 00:15:32,580
to the dependent variable.

268
00:15:32,580 --> 00:15:36,920
And those have a
constant variance.

269
00:15:36,920 --> 00:15:40,830
So the level of
uncertainty in our model

270
00:15:40,830 --> 00:15:43,210
doesn't depend on the case.

271
00:15:43,210 --> 00:15:47,660
And so indeed, if errors
on the percentage scale

272
00:15:47,660 --> 00:15:51,250
are more appropriate, then
one could look at, say,

273
00:15:51,250 --> 00:15:53,970
a time series of prices
that you're trying to model.

274
00:15:53,970 --> 00:15:56,970
And it may be that
on the log scale,

275
00:15:56,970 --> 00:15:59,160
that constant variance
looks much more appropriate

276
00:15:59,160 --> 00:16:02,180
than on the original
scale, which would have--

277
00:16:02,180 --> 00:16:06,660
And then a third attribute of
the Gauss-Markov assumptions

278
00:16:06,660 --> 00:16:10,740
is that the residuals
are uncorrelated.

279
00:16:10,740 --> 00:16:16,770
So now uncorrelated does
not mean independent

280
00:16:16,770 --> 00:16:18,210
or statistically independent.

281
00:16:18,210 --> 00:16:21,470
So this is a somewhat weak
condition, or weaker condition,

282
00:16:21,470 --> 00:16:24,326
than independence
of the residuals.

283
00:16:24,326 --> 00:16:26,280
But in the Gauss-Markov
setting, we're

284
00:16:26,280 --> 00:16:29,650
just setting up
basically a reduced set

285
00:16:29,650 --> 00:16:33,610
of assumptions that we might
apply to fit the model.

286
00:16:33,610 --> 00:16:36,620
If we extend upon
that, we can then

287
00:16:36,620 --> 00:16:40,040
consider normal linear
regression models,

288
00:16:40,040 --> 00:16:44,050
which Will just suggested.

289
00:16:44,050 --> 00:16:48,930
And in this case, those could
be assumed to be independent

290
00:16:48,930 --> 00:16:50,440
and identically
distributed-- IID

291
00:16:50,440 --> 00:16:56,380
is that notation for that-- with
Gaussian or normal with mean 0

292
00:16:56,380 --> 00:16:57,580
and variance sigma squared.

293
00:17:00,590 --> 00:17:03,050
We can extend upon
that to consider

294
00:17:03,050 --> 00:17:07,140
generalized Gauss-Markov
assumptions where we maintain

295
00:17:07,140 --> 00:17:09,750
still the zero mean
for the residuals,

296
00:17:09,750 --> 00:17:15,990
but the general-- we might
have a covariance matrix which

297
00:17:15,990 --> 00:17:18,560
does not correspond to
independent and identically

298
00:17:18,560 --> 00:17:20,680
distributed random variables.

299
00:17:20,680 --> 00:17:21,760
Now, let's see.

300
00:17:21,760 --> 00:17:25,000
In the discussion of
probability theory,

301
00:17:25,000 --> 00:17:29,400
we really haven't talked yet
about matrix-valued random

302
00:17:29,400 --> 00:17:31,490
variables, right?

303
00:17:31,490 --> 00:17:34,360
But how many people
in the class have

304
00:17:34,360 --> 00:17:37,380
covered matrix-value or
vector-valued random variables

305
00:17:37,380 --> 00:17:39,660
before?

306
00:17:39,660 --> 00:17:41,120
OK, just a handful.

307
00:17:41,120 --> 00:17:47,290
Well, a vector-valued
random variable,

308
00:17:47,290 --> 00:17:51,090
we think of the
values of these n

309
00:17:51,090 --> 00:17:57,200
cases for the dependent variable
to be an n-valued, an n-vector

310
00:17:57,200 --> 00:17:59,950
of random variables.

311
00:17:59,950 --> 00:18:05,600
And so we can
generalize the variance

312
00:18:05,600 --> 00:18:09,140
of individual random variables
to the variance covariance

313
00:18:09,140 --> 00:18:12,870
matrix of the collection.

314
00:18:12,870 --> 00:18:18,490
And so you have a covariance
matrix characterizing

315
00:18:18,490 --> 00:18:25,930
the variance of the n-vector
which gives us the-- the (i, j)

316
00:18:25,930 --> 00:18:31,640
element gives us the
value of the covariance.

317
00:18:31,640 --> 00:18:38,130
All right, let me put
the screen up and just

318
00:18:38,130 --> 00:18:41,214
write that on the board so
that you're familiar with that.

319
00:18:49,820 --> 00:18:58,270
All right, so we have
y_1, y_2, down to y_n,

320
00:18:58,270 --> 00:19:03,070
our n values of our
response variable.

321
00:19:03,070 --> 00:19:10,755
And we can basically talk
about the expectation

322
00:19:10,755 --> 00:19:18,303
of that being equal to
mu_1, mu_2, down to mu_n.

323
00:19:21,021 --> 00:19:38,270
And the covariance matrix
of y_1, y_2, down to y_n

324
00:19:38,270 --> 00:19:46,060
is equal to a matrix
with the variance of y_1

325
00:19:46,060 --> 00:19:53,280
in the upper 1, 1 element, and
the variance of y_2 in the 2,

326
00:19:53,280 --> 00:20:04,760
2 element, and the variance of
y_n in the nth column and nth

327
00:20:04,760 --> 00:20:05,890
row.

328
00:20:05,890 --> 00:20:15,080
And in the (i,j)-th row, (i, j),
we have the covariance between

329
00:20:15,080 --> 00:20:18,550
y_i and y_j.

330
00:20:18,550 --> 00:20:24,070
So we're going to use matrices
to represent covariances.

331
00:20:24,070 --> 00:20:26,670
And that's something
which I want everyone

332
00:20:26,670 --> 00:20:28,900
to get very familiar
with because we're

333
00:20:28,900 --> 00:20:31,750
going to assume that we
are comfortable with those,

334
00:20:31,750 --> 00:20:38,470
and apply matrix algebra with
these kinds of constructs.

335
00:20:38,470 --> 00:20:41,300
So the generalized
Gauss-Markov theorem

336
00:20:41,300 --> 00:20:43,960
assumes a general
covariance matrix

337
00:20:43,960 --> 00:20:50,400
where you can have
nonzero covariances

338
00:20:50,400 --> 00:20:52,260
between the
independent variables

339
00:20:52,260 --> 00:20:54,440
or the dependent variables
and the residuals.

340
00:20:54,440 --> 00:20:58,450
And those can be correlated.

341
00:20:58,450 --> 00:21:03,160
Now, who can come
up with an example

342
00:21:03,160 --> 00:21:11,471
of why the residuals might
be correlated in a regression

343
00:21:11,471 --> 00:21:11,970
model?

344
00:21:15,134 --> 00:21:16,490
Dan?

345
00:21:16,490 --> 00:21:17,390
OK.

346
00:21:17,390 --> 00:21:21,100
That's a really good example
because it's nonlinear.

347
00:21:21,100 --> 00:21:24,720
If you imagine sort of
a simple nonlinear curve

348
00:21:24,720 --> 00:21:27,070
and you try to fit a
straight line to it,

349
00:21:27,070 --> 00:21:32,080
then the residuals
from that linear fit

350
00:21:32,080 --> 00:21:35,100
are going to be consistently
above or below the line

351
00:21:35,100 --> 00:21:37,630
depending on where you are
in the nonlinearity, how

352
00:21:37,630 --> 00:21:38,820
it might be fitting.

353
00:21:38,820 --> 00:21:42,410
So that's one example
where that could arise.

354
00:21:42,410 --> 00:21:43,415
Any other possibilities?

355
00:21:46,090 --> 00:21:50,380
Well, next week we'll be talking
about some time series models.

356
00:21:50,380 --> 00:21:54,650
And there can be time
dependence amongst variables

357
00:21:54,650 --> 00:21:57,170
where there are some
underlying factors maybe

358
00:21:57,170 --> 00:21:58,850
that are driving the process.

359
00:21:58,850 --> 00:22:02,030
And those ongoing
factors can persist

360
00:22:02,030 --> 00:22:05,480
in making the
linear relationship

361
00:22:05,480 --> 00:22:11,300
over or under gauge
the dependent variable.

362
00:22:11,300 --> 00:22:13,850
So that can happen as well.

363
00:22:13,850 --> 00:22:17,410
All right, yes?

364
00:22:17,410 --> 00:22:19,949
AUDIENCE: The Gauss-Markov
is just the diagonal case?

365
00:22:19,949 --> 00:22:22,490
PROFESSOR: Yes, the Gauss-Markov
is simply the diagonal case.

366
00:22:22,490 --> 00:22:27,090
And explicitly if we replace
y's here by the residuals,

367
00:22:27,090 --> 00:22:30,780
epsilon_1 through
epsilon_n, then

368
00:22:30,780 --> 00:22:36,930
that diagonal matrix
with a constant diagonal

369
00:22:36,930 --> 00:22:41,948
is the simple Gauss-Markov
assumption, yeah.

370
00:22:45,270 --> 00:22:48,920
Now, I'm sure it
comes as no surprise

371
00:22:48,920 --> 00:22:51,450
that Gaussian distributions
don't always fit everything.

372
00:22:51,450 --> 00:22:55,350
And so one needs to get
clever with extending

373
00:22:55,350 --> 00:23:01,560
the models to other cases.

374
00:23:01,560 --> 00:23:07,280
And there are-- I know--
Laplace distributions, Pareto

375
00:23:07,280 --> 00:23:09,760
distributions, contaminated
normal distributions,

376
00:23:09,760 --> 00:23:14,000
which can be used to
fit regression models.

377
00:23:14,000 --> 00:23:22,330
And these general cases really
extend the applicability

378
00:23:22,330 --> 00:23:28,610
of regression models to
many interesting settings.

379
00:23:28,610 --> 00:23:35,530
So let's turn to specifying
the estimator criterion in two.

380
00:23:35,530 --> 00:23:39,590
So how do we judge what's a
good estimate of the regression

381
00:23:39,590 --> 00:23:40,580
parameters?

382
00:23:40,580 --> 00:23:45,390
Well, we're going to cover least
squares, maximum likelihood,

383
00:23:45,390 --> 00:23:50,990
robust methods, which are
contamination resistant.

384
00:23:50,990 --> 00:23:56,370
And other methods exist
that we will mention but not

385
00:23:56,370 --> 00:23:57,870
get into really in
the lectures, are

386
00:23:57,870 --> 00:24:03,390
Bayes methods and accommodating
incomplete or missing data.

387
00:24:03,390 --> 00:24:08,150
Essentially, as your approach
to modeling a problem

388
00:24:08,150 --> 00:24:10,120
gets more and more
realistic, you

389
00:24:10,120 --> 00:24:14,580
start adding more and more
complexity as it's needed.

390
00:24:14,580 --> 00:24:24,450
And certainly issues
of-- well, robust methods

391
00:24:24,450 --> 00:24:26,724
is where you assume
most of the data

392
00:24:26,724 --> 00:24:28,890
arrives under normal
conditions, but once in a while

393
00:24:28,890 --> 00:24:31,710
there may be some
problem with the data.

394
00:24:31,710 --> 00:24:34,390
And you don't want
your methodology just

395
00:24:34,390 --> 00:24:39,635
to break down if there happens
to be some outliers in the data

396
00:24:39,635 --> 00:24:41,090
or contamination.

397
00:24:41,090 --> 00:24:47,200
Bayes methodologies
are the technology

398
00:24:47,200 --> 00:24:50,300
for incorporating
subjective beliefs

399
00:24:50,300 --> 00:24:54,310
into statistical models.

400
00:24:54,310 --> 00:24:57,190
And I think it's fair
to say that probably

401
00:24:57,190 --> 00:25:01,720
all statistical modeling
is essentially subjective.

402
00:25:01,720 --> 00:25:05,820
And so if you're going to be
good at statistical modeling,

403
00:25:05,820 --> 00:25:08,779
you want to be sure that you're
effectively incorporating

404
00:25:08,779 --> 00:25:10,070
subjective information in that.

405
00:25:10,070 --> 00:25:13,400
And so Bayes methodologies
are very, very useful,

406
00:25:13,400 --> 00:25:16,460
and indeed pretty much
required to engage

407
00:25:16,460 --> 00:25:19,610
in appropriate modeling.

408
00:25:19,610 --> 00:25:22,700
And then finally, accommodate
incomplete or missing data.

409
00:25:22,700 --> 00:25:26,920
The world is always sort
of cruel in terms of you

410
00:25:26,920 --> 00:25:30,900
often are missing what you
think is critical information

411
00:25:30,900 --> 00:25:32,060
to do your analysis.

412
00:25:32,060 --> 00:25:34,520
And so how do you
deal with situations

413
00:25:34,520 --> 00:25:39,680
where you have some
holes in your data?

414
00:25:39,680 --> 00:25:44,720
Statistical models provide
good methods and tools

415
00:25:44,720 --> 00:25:46,596
for dealing with that situation.

416
00:25:49,160 --> 00:25:49,720
OK.

417
00:25:49,720 --> 00:25:50,960
Then let's see.

418
00:25:50,960 --> 00:25:55,930
In case analyses for
checking assumptions,

419
00:25:55,930 --> 00:25:57,540
let me go through this.

420
00:26:00,560 --> 00:26:03,790
Basically when you fit
a regression model,

421
00:26:03,790 --> 00:26:07,680
you check assumptions by
looking at the residuals, which

422
00:26:07,680 --> 00:26:16,330
are the basically estimates of
the epsilons, the deviations

423
00:26:16,330 --> 00:26:22,359
of the dependent variable
from their predictions.

424
00:26:22,359 --> 00:26:25,930
And what one wants
to do is analyze

425
00:26:25,930 --> 00:26:29,660
these to determine whether our
assumptions are appropriate.

426
00:26:29,660 --> 00:26:33,460
OK, but the Gauss-Markov
assumptions would be,

427
00:26:33,460 --> 00:26:36,170
do these appear to
have constant variance?

428
00:26:36,170 --> 00:26:40,270
And it may be that their
variance depends on time,

429
00:26:40,270 --> 00:26:42,555
if the i is indexing time.

430
00:26:45,450 --> 00:26:48,190
Residuals might depend on
the other variables as well,

431
00:26:48,190 --> 00:26:53,610
and one wants to determine
that that isn't the case.

432
00:26:53,610 --> 00:26:57,660
There are also influence
diagnostics identifying cases

433
00:26:57,660 --> 00:27:00,330
which are highly influential.

434
00:27:00,330 --> 00:27:05,620
It turns out that when you
are building a regression

435
00:27:05,620 --> 00:27:10,560
model with data, you
treat all the cases as

436
00:27:10,560 --> 00:27:12,830
if they're equally important.

437
00:27:12,830 --> 00:27:15,390
Well, it may be
that certain cases

438
00:27:15,390 --> 00:27:19,540
are really critical to
estimated certain factors.

439
00:27:19,540 --> 00:27:25,650
And it may be that much of the
inference about how important

440
00:27:25,650 --> 00:27:27,940
a certain factor
is is determined

441
00:27:27,940 --> 00:27:30,550
by very small number of points.

442
00:27:30,550 --> 00:27:32,875
So even though you
have a massive data set

443
00:27:32,875 --> 00:27:34,265
that you're using
to fit a model,

444
00:27:34,265 --> 00:27:36,675
it could be that
some of the structure

445
00:27:36,675 --> 00:27:39,600
is driven by a very
small number of cases.

446
00:27:39,600 --> 00:27:45,100
So influence diagnostics give
you a way of analyzing that.

447
00:27:45,100 --> 00:27:50,730
In the problem set
for this lecture,

448
00:27:50,730 --> 00:27:53,470
you'll be deriving some
influence diagnostics

449
00:27:53,470 --> 00:27:55,380
for linear regression
models and seeing how

450
00:27:55,380 --> 00:27:57,790
they're mathematically defined.

451
00:27:57,790 --> 00:28:01,150
And I'll be distributing
a case study which

452
00:28:01,150 --> 00:28:04,290
illustrates fitting
linear regression

453
00:28:04,290 --> 00:28:06,350
models for asset prices.

454
00:28:06,350 --> 00:28:10,502
And you can see
how those play out

455
00:28:10,502 --> 00:28:11,710
with some practical examples.

456
00:28:16,170 --> 00:28:18,525
OK, finally there's
outlier detection.

457
00:28:21,210 --> 00:28:25,790
With outliers, it's interesting.

458
00:28:25,790 --> 00:28:33,100
The exceptions in data are
often the most interesting.

459
00:28:33,100 --> 00:28:36,560
It's important in
modeling to understand

460
00:28:36,560 --> 00:28:40,700
whether certain
cases are unusual.

461
00:28:40,700 --> 00:28:47,600
And sometimes their
degree of idiosyncrasy

462
00:28:47,600 --> 00:28:50,580
can be explained away
so that one essentially

463
00:28:50,580 --> 00:28:51,710
discards those outliers.

464
00:28:51,710 --> 00:28:54,790
But other times,
those idiosyncrasies

465
00:28:54,790 --> 00:28:57,890
lead to extensions of the model.

466
00:28:57,890 --> 00:29:04,345
And so outlier detection can be
very important for validating

467
00:29:04,345 --> 00:29:05,980
a model.

468
00:29:05,980 --> 00:29:10,970
OK, so with that introduction to
regression, linear regression,

469
00:29:10,970 --> 00:29:14,590
let's talk about
ordinary least squares.

470
00:29:14,590 --> 00:29:15,090
Ah.

471
00:29:19,075 --> 00:29:24,360
OK, the least squares criterion
is for a given a regression

472
00:29:24,360 --> 00:29:27,120
parameter, beta,
which is considered

473
00:29:27,120 --> 00:29:32,160
to be a column vector-- so I'm
taking the transpose of a row

474
00:29:32,160 --> 00:29:32,660
vector.

475
00:29:35,920 --> 00:29:39,610
The least squares criterion
is to basically take

476
00:29:39,610 --> 00:29:43,170
the sum of square deviations
from the actual value

477
00:29:43,170 --> 00:29:46,160
of the response variable
from its linear prediction.

478
00:29:46,160 --> 00:29:48,760
So y_i minus y hat i,
we're just plugging

479
00:29:48,760 --> 00:29:50,980
in for y hat i the
linear function

480
00:29:50,980 --> 00:29:55,470
of the independent variables
and the squaring that.

481
00:29:58,270 --> 00:30:01,570
And the ordinary least
squares estimate, beta hat,

482
00:30:01,570 --> 00:30:05,490
minimizes this function.

483
00:30:05,490 --> 00:30:11,520
So in order to solve for this,
we're going to use matrices.

484
00:30:11,520 --> 00:30:16,627
And so we're going to take
the y vector, the vector of n

485
00:30:16,627 --> 00:30:18,002
values of the
dependent variable,

486
00:30:18,002 --> 00:30:21,710
or the response variable,
and X, the matrix

487
00:30:21,710 --> 00:30:24,210
of values of the
independent variable.

488
00:30:24,210 --> 00:30:28,800
It's important in this
set up to keep straight

489
00:30:28,800 --> 00:30:32,990
that cases go by
rows and columns

490
00:30:32,990 --> 00:30:35,585
go by values of the
independent variable.

491
00:30:41,410 --> 00:30:44,450
Boy, this thing is
ultra sensitive.

492
00:30:44,450 --> 00:30:44,950
Excuse me.

493
00:30:49,317 --> 00:30:50,650
Do I turn off the touchpad here?

494
00:30:50,650 --> 00:30:53,050
OK.

495
00:30:53,050 --> 00:31:01,990
So we can now define
our fitted value, y hat,

496
00:31:01,990 --> 00:31:06,330
to be equal to the
matrix x times beta.

497
00:31:06,330 --> 00:31:10,940
And with matrix multiplication,
that results in the y hat 1

498
00:31:10,940 --> 00:31:13,350
through y hat n.

499
00:31:13,350 --> 00:31:20,800
And Q of beta can basically
be written as y minus X beta

500
00:31:20,800 --> 00:31:23,330
transpose y minus X beta.

501
00:31:23,330 --> 00:31:27,580
So this term here is an
n-vector minus the product

502
00:31:27,580 --> 00:31:30,990
of the X matrix times beta,
which is another n-vector.

503
00:31:30,990 --> 00:31:33,380
And we're just taking the
cross product of that.

504
00:31:37,410 --> 00:31:41,460
And the ordinary least
squares estimate for beta

505
00:31:41,460 --> 00:31:47,650
solves the derivative of
this criterion equaling 0.

506
00:31:47,650 --> 00:31:53,990
Now, that's in
fact true, but who

507
00:31:53,990 --> 00:31:55,400
can tell me why that's true?

508
00:32:00,360 --> 00:32:00,860
Say again?

509
00:32:00,860 --> 00:32:01,750
AUDIENCE: Is that minimum?

510
00:32:01,750 --> 00:32:02,333
PROFESSOR: OK.

511
00:32:02,333 --> 00:32:04,763
So your name?

512
00:32:04,763 --> 00:32:05,610
AUDIENCE: Seth.

513
00:32:05,610 --> 00:32:06,313
PROFESSOR: Seth?

514
00:32:06,313 --> 00:32:06,812
Seth.

515
00:32:06,812 --> 00:32:07,760
Very good, Seth.

516
00:32:07,760 --> 00:32:08,850
Thanks, Seth.

517
00:32:08,850 --> 00:32:13,200
So if we want to
find a minimum of Q,

518
00:32:13,200 --> 00:32:17,120
then that minimum will have,
if it's a smooth function,

519
00:32:17,120 --> 00:32:20,950
will have a minimum
at slope equals 0.

520
00:32:20,950 --> 00:32:23,405
Now, how do we know whether
it's a minimum or not?

521
00:32:23,405 --> 00:32:25,424
It could be a maximum.

522
00:32:25,424 --> 00:32:26,340
AUDIENCE: [INAUDIBLE]?

523
00:32:29,160 --> 00:32:30,430
PROFESSOR: OK, right.

524
00:32:30,430 --> 00:32:34,050
So in fact, this
is a-- Q of beta

525
00:32:34,050 --> 00:32:39,160
is a convex function of beta.

526
00:32:39,160 --> 00:32:44,770
And so its second
derivative is positive.

527
00:32:44,770 --> 00:32:52,140
And if you basically think
about the set-- basically,

528
00:32:52,140 --> 00:32:54,310
this is the first
derivative of Q with respect

529
00:32:54,310 --> 00:32:55,680
to beta equaling 0.

530
00:32:55,680 --> 00:32:58,420
If you were to solve for
the second derivative of Q

531
00:32:58,420 --> 00:33:03,060
with respect to beta,
well, beta is a p-vector.

532
00:33:03,060 --> 00:33:04,560
So the second
derivative is actually

533
00:33:04,560 --> 00:33:11,240
a second derivative
matrix, and that matrix,

534
00:33:11,240 --> 00:33:12,540
you can solve for it.

535
00:33:12,540 --> 00:33:14,360
It will be X
transpose X, which is

536
00:33:14,360 --> 00:33:18,730
a positive definite or
semi-definite matrix.

537
00:33:18,730 --> 00:33:23,950
So it basically had a
positive derivative there.

538
00:33:23,950 --> 00:33:27,210
So anyway, this ordinary
least squares estimates

539
00:33:27,210 --> 00:33:32,370
will solve this d Q of
beta by d beta equals 0.

540
00:33:32,370 --> 00:33:36,330
What does d Q beta by d beta_j?

541
00:33:36,330 --> 00:33:40,980
Well, you just take the
derivative of this sum.

542
00:33:44,530 --> 00:33:47,390
So we're taking the sum
of all these elements.

543
00:33:50,280 --> 00:33:53,640
And if you take the
derivative-- well,

544
00:33:53,640 --> 00:33:57,210
OK, the derivative
is a linear operator.

545
00:33:57,210 --> 00:34:00,450
So the derivative of a sum is
the sum of the derivatives.

546
00:34:00,450 --> 00:34:04,150
So we take the summation out and
we take the derivative of each

547
00:34:04,150 --> 00:34:09,929
term, so we get 2 minus x_(i,j),
then the thing in square

548
00:34:09,929 --> 00:34:11,399
brackets, y_i minus that.

549
00:34:15,239 --> 00:34:16,139
And what is that?

550
00:34:16,139 --> 00:34:19,090
Well, in matrix
notation, if we let

551
00:34:19,090 --> 00:34:22,469
this sort of bold X
sub square j denote

552
00:34:22,469 --> 00:34:25,670
the j-th column of the
independent variables,

553
00:34:25,670 --> 00:34:28,219
then this is minus 2.

554
00:34:28,219 --> 00:34:35,350
Basically, the j-th column of X
transpose times y minus X beta.

555
00:34:35,350 --> 00:34:42,940
So this j-th equation for
ordinary least squares

556
00:34:42,940 --> 00:34:46,980
has that representation in
terms-- in matrix notation.

557
00:34:46,980 --> 00:34:51,860
Now if we put that all
together, we basically

558
00:34:51,860 --> 00:34:55,400
can define this derivative
of Q with respect

559
00:34:55,400 --> 00:34:57,740
to the different
regression parameters

560
00:34:57,740 --> 00:35:05,640
as basically the minus twice
the j-th column stacked times y

561
00:35:05,640 --> 00:35:10,120
minus X beta, which is simply
minus 2 X transpose, y minus X

562
00:35:10,120 --> 00:35:11,040
beta.

563
00:35:11,040 --> 00:35:14,870
And this has to equal 0.

564
00:35:14,870 --> 00:35:19,550
And if we just simplify,
taking out the two,

565
00:35:19,550 --> 00:35:22,140
we get this set of equations.

566
00:35:22,140 --> 00:35:25,560
It must be satisfied by
the ordinary least squares

567
00:35:25,560 --> 00:35:27,190
estimate, beta.

568
00:35:27,190 --> 00:35:31,960
And that's called the
normal equations in books

569
00:35:31,960 --> 00:35:35,270
on regression modeling.

570
00:35:35,270 --> 00:35:37,310
So let's consider
how we solve that.

571
00:35:37,310 --> 00:35:43,100
Well, we can re-express that
by multiplying through the X

572
00:35:43,100 --> 00:35:46,010
transpose on each of the terms.

573
00:35:46,010 --> 00:35:54,770
And then beta hat basically
solves this equation.

574
00:35:54,770 --> 00:35:58,170
And if X transpose
X inverse exists,

575
00:35:58,170 --> 00:36:02,385
we get beta hat is equal
to X transpose X inverse X

576
00:36:02,385 --> 00:36:04,920
transpose y.

577
00:36:04,920 --> 00:36:11,030
So with matrix algebra, we
can actually solve this.

578
00:36:11,030 --> 00:36:14,530
And matrix algebra
is going to be

579
00:36:14,530 --> 00:36:16,910
very important to this
lecture and other lectures.

580
00:36:16,910 --> 00:36:20,260
So if this stuff is-- if
you're a bit rusty on this,

581
00:36:20,260 --> 00:36:22,210
do brush up.

582
00:36:26,370 --> 00:36:29,610
This particular
solution for beta hat

583
00:36:29,610 --> 00:36:37,980
assumes that X transpose
X inverse exists.

584
00:36:37,980 --> 00:36:42,920
Who can tell me
what assumptions do

585
00:36:42,920 --> 00:36:47,290
we need to make for X
transpose X to have an inverse?

586
00:36:52,110 --> 00:36:55,910
I'll call you in a second
if no one else does.

587
00:36:55,910 --> 00:36:59,528
Somebody just said something.

588
00:36:59,528 --> 00:37:01,370
Someone else.

589
00:37:01,370 --> 00:37:01,870
No?

590
00:37:01,870 --> 00:37:02,030
All right.

591
00:37:02,030 --> 00:37:02,400
OK, Will.

592
00:37:02,400 --> 00:37:03,860
AUDIENCE: So X
transpose X inverse

593
00:37:03,860 --> 00:37:06,160
needs to have full
rank, which means

594
00:37:06,160 --> 00:37:10,269
that each of the submatrices
needs to have [INAUDIBLE]

595
00:37:10,269 --> 00:37:11,750
smaller dimension.

596
00:37:11,750 --> 00:37:15,305
PROFESSOR: OK, so Will said,
basically, the matrix X

597
00:37:15,305 --> 00:37:17,250
needs to have full rank.

598
00:37:17,250 --> 00:37:23,990
And so if X has full rank,
then-- well, let's see.

599
00:37:23,990 --> 00:37:29,340
If X has full rank, then the
singular value decomposition

600
00:37:29,340 --> 00:37:35,990
which was in the very
first class can exist.

601
00:37:35,990 --> 00:37:40,330
And you have basically
p singular values

602
00:37:40,330 --> 00:37:42,730
that are all non-zero.

603
00:37:42,730 --> 00:37:46,100
And X transpose X
can be expressed

604
00:37:46,100 --> 00:37:50,610
as sort of a, from the
singular value decomposition,

605
00:37:50,610 --> 00:37:53,110
as one of the orthogonal
matrices times the square

606
00:37:53,110 --> 00:37:57,145
of the singular values times
that same matrix transpose,

607
00:37:57,145 --> 00:37:59,390
if you recall that definition.

608
00:37:59,390 --> 00:38:02,420
So that actually
is-- it basically

609
00:38:02,420 --> 00:38:06,590
provides a solution for X
transpose X inverse, indeed,

610
00:38:06,590 --> 00:38:08,810
from the singular value
decomposition of X.

611
00:38:08,810 --> 00:38:11,910
But what's required is that
you have a full rank in X.

612
00:38:11,910 --> 00:38:14,450
And what that means
is that you can't have

613
00:38:14,450 --> 00:38:20,120
independent variables
that are explained

614
00:38:20,120 --> 00:38:21,950
by other independent variables.

615
00:38:21,950 --> 00:38:29,010
So different columns of
X have to be linear--

616
00:38:29,010 --> 00:38:32,670
or they can't linearly depend
on any other columns of X.

617
00:38:32,670 --> 00:38:34,670
Otherwise, you would
have reduced rank.

618
00:38:37,460 --> 00:38:44,570
So now if beta hat
doesn't have full rank,

619
00:38:44,570 --> 00:38:49,380
then our least squares estimate
of beta might be non-unique.

620
00:38:49,380 --> 00:38:53,280
And in fact, it is
the case that if you

621
00:38:53,280 --> 00:38:55,600
are really interested
in just predicting

622
00:38:55,600 --> 00:38:59,180
values of a dependent
variable, then

623
00:38:59,180 --> 00:39:03,440
having non-unique
least squares estimates

624
00:39:03,440 --> 00:39:05,530
isn't as much of a
problem, because you still

625
00:39:05,530 --> 00:39:07,090
get estimates out of that.

626
00:39:07,090 --> 00:39:11,302
But for now, we want to assume
that there's full column rank

627
00:39:11,302 --> 00:39:12,510
in the independent variables.

628
00:39:16,010 --> 00:39:17,510
All right.

629
00:39:17,510 --> 00:39:30,100
Now, if we plug in the value
of the solution for the least

630
00:39:30,100 --> 00:39:34,780
squares estimate,
we get fitted values

631
00:39:34,780 --> 00:39:41,960
for the response variable, which
are simply the matrix X times

632
00:39:41,960 --> 00:39:44,370
beta hat.

633
00:39:44,370 --> 00:39:52,070
And this expression
for the fitted values

634
00:39:52,070 --> 00:39:58,570
is basically X times X transpose
X inverse X transpose y,

635
00:39:58,570 --> 00:40:01,330
which we can represent as Hy.

636
00:40:01,330 --> 00:40:08,160
Basically, this H matrix in
linear models and statistics

637
00:40:08,160 --> 00:40:09,770
is called the hat matrix.

638
00:40:09,770 --> 00:40:12,310
It's basically a
projection matrix

639
00:40:12,310 --> 00:40:19,120
that takes the linear vector,
or the vector of values

640
00:40:19,120 --> 00:40:24,290
of the response variable,
into the fitted values.

641
00:40:24,290 --> 00:40:30,713
So this hat matrix
is quite important.

642
00:40:34,640 --> 00:40:37,745
The problem set's going
to cover some features,

643
00:40:37,745 --> 00:40:39,695
go into some properties
of the hat matrix.

644
00:40:42,790 --> 00:40:49,180
Does anyone want to make any
comments about this hat matrix?

645
00:40:49,180 --> 00:40:52,290
It's actually a very
special type of matrix.

646
00:40:52,290 --> 00:40:56,230
Does anyone want to point out
what that special type is?

647
00:41:00,420 --> 00:41:02,811
It's a projection matrix, OK.

648
00:41:02,811 --> 00:41:03,310
Yeah.

649
00:41:03,310 --> 00:41:08,490
And in linear algebra,
projection matrices

650
00:41:08,490 --> 00:41:11,970
have some very
special properties.

651
00:41:11,970 --> 00:41:17,400
And it's actually an
orthogonal projection matrix.

652
00:41:17,400 --> 00:41:24,030
And so if you're
interested in that feature,

653
00:41:24,030 --> 00:41:25,290
you should look into that.

654
00:41:25,290 --> 00:41:30,635
But it's really a very rich
set of properties associated

655
00:41:30,635 --> 00:41:31,510
with this hat matrix.

656
00:41:31,510 --> 00:41:36,970
It's an orthogonal projection,
and it's-- let's see.

657
00:41:36,970 --> 00:41:38,180
What's it projecting?

658
00:41:38,180 --> 00:41:40,600
It's projecting from
n-space into what?

659
00:41:44,570 --> 00:41:45,150
Go ahead.

660
00:41:45,150 --> 00:41:46,396
What's your name?

661
00:41:46,396 --> 00:41:47,062
AUDIENCE: Ethan.

662
00:41:47,062 --> 00:41:47,870
PROFESSOR: Ethan, OK.

663
00:41:47,870 --> 00:41:49,203
AUDIENCE: Into space [INAUDIBLE]

664
00:41:51,250 --> 00:41:52,375
PROFESSOR: Basically, yeah.

665
00:41:52,375 --> 00:41:55,930
It's projecting into
the column space of X.

666
00:41:55,930 --> 00:42:00,730
So that's what linear
regression is doing.

667
00:42:00,730 --> 00:42:05,220
So in focusing and
understanding linear regression,

668
00:42:05,220 --> 00:42:08,620
you can think of, how do we
get estimates of this p-vector?

669
00:42:08,620 --> 00:42:11,880
That's all very good and useful,
and we'll do a lot of that.

670
00:42:11,880 --> 00:42:13,880
But you can also
think of it as, what's

671
00:42:13,880 --> 00:42:15,750
happening in the
n-dimensional space?

672
00:42:15,750 --> 00:42:17,660
So you basically
are representing

673
00:42:17,660 --> 00:42:21,960
this n-dimensional vector
y by its projection

674
00:42:21,960 --> 00:42:23,180
onto the column space.

675
00:42:29,730 --> 00:42:32,920
Now, the residuals are
basically the difference

676
00:42:32,920 --> 00:42:38,320
between the response value
and the fitted value.

677
00:42:38,320 --> 00:42:43,960
And this can be expressed
as y minus y hat,

678
00:42:43,960 --> 00:42:48,370
or I_n minus H times y.

679
00:42:48,370 --> 00:42:58,700
And it turns out that I_n minus
H is also a projection matrix,

680
00:42:58,700 --> 00:43:03,950
and it's projecting the data
onto the space orthogonal

681
00:43:03,950 --> 00:43:06,680
to the column space of x.

682
00:43:06,680 --> 00:43:11,980
And to show that that's
true, if we consider

683
00:43:11,980 --> 00:43:16,420
the normal equations, which
are X transpose y minus X beta

684
00:43:16,420 --> 00:43:20,890
hat equaling 0, that basically
is X transpose epsilon hat

685
00:43:20,890 --> 00:43:22,310
equals 0.

686
00:43:22,310 --> 00:43:25,410
And so from the
normal equations,

687
00:43:25,410 --> 00:43:27,610
we can see that
what they mean is

688
00:43:27,610 --> 00:43:33,380
they mean that the residual
vector epsilon hat is

689
00:43:33,380 --> 00:43:36,550
orthogonal to each
of the columns of X.

690
00:43:36,550 --> 00:43:38,570
You can take any column
in X, multiply that

691
00:43:38,570 --> 00:43:42,210
by the residual vector,
and get 0 coming out.

692
00:43:42,210 --> 00:43:47,760
So that's a feature
of the residuals

693
00:43:47,760 --> 00:43:51,660
as they relate to the
independent variables.

694
00:43:51,660 --> 00:43:53,940
OK, all right.

695
00:43:53,940 --> 00:44:00,520
So at this point, we've gone
through really not talking

696
00:44:00,520 --> 00:44:03,669
about any statistical
properties to specify the betas.

697
00:44:03,669 --> 00:44:06,210
All we've done is talked-- we've
introduced the least squares

698
00:44:06,210 --> 00:44:09,770
criterion and said, what
value of the beta vector

699
00:44:09,770 --> 00:44:13,260
minimizes that least
squares criterion?

700
00:44:13,260 --> 00:44:17,470
Let's turn to the
Gauss-Markov theorem

701
00:44:17,470 --> 00:44:22,240
and start introducing some
statistical properties,

702
00:44:22,240 --> 00:44:24,190
probability properties.

703
00:44:24,190 --> 00:44:28,945
So with our data, y and X-- yes?

704
00:44:28,945 --> 00:44:29,445
Yes.

705
00:44:29,445 --> 00:44:30,361
AUDIENCE: [INAUDIBLE]?

706
00:44:36,110 --> 00:44:37,268
PROFESSOR: That epsilon--

707
00:44:37,268 --> 00:44:38,184
AUDIENCE: [INAUDIBLE]?

708
00:44:41,480 --> 00:44:42,752
PROFESSOR: OK.

709
00:44:42,752 --> 00:44:43,710
Let me go back to that.

710
00:44:47,830 --> 00:44:53,500
It's that X, the columns
of X, and the column

711
00:44:53,500 --> 00:44:59,660
vector of the residual are
orthogonal to each other.

712
00:44:59,660 --> 00:45:04,330
So we're not doing a
projection onto a null space.

713
00:45:04,330 --> 00:45:11,440
This is just a statement that
those values, or those column

714
00:45:11,440 --> 00:45:15,160
vectors, are orthogonal
to each other.

715
00:45:15,160 --> 00:45:22,010
And just to recap, the
epsilon is a projection of y

716
00:45:22,010 --> 00:45:27,060
onto the space orthogonal
to the column space.

717
00:45:27,060 --> 00:45:32,788
And y hat is a projection
onto the column space of y.

718
00:45:32,788 --> 00:45:37,405
And these projections are
all orthogonal projections,

719
00:45:37,405 --> 00:45:45,740
and so they happen to result in
the projected value epsilon hat

720
00:45:45,740 --> 00:45:48,770
must be orthogonal to
the column space of X,

721
00:45:48,770 --> 00:45:53,080
if you project it out.

722
00:45:53,080 --> 00:45:54,300
OK?

723
00:45:54,300 --> 00:45:55,130
All right.

724
00:45:55,130 --> 00:46:02,240
So the Gauss-Markov theorem,
we have data y and X again.

725
00:46:02,240 --> 00:46:05,810
And now we're going to
think of the observed data,

726
00:46:05,810 --> 00:46:08,640
little y_1 through
y_n, is actually

727
00:46:08,640 --> 00:46:12,050
an observation of the
random vector capital

728
00:46:12,050 --> 00:46:19,610
Y, composed of random
variables Y_1 up to Y_n.

729
00:46:19,610 --> 00:46:25,120
And the expectation
of this vector

730
00:46:25,120 --> 00:46:28,740
conditional on the values
of the independent variables

731
00:46:28,740 --> 00:46:30,680
and their regression
parameters given by X,

732
00:46:30,680 --> 00:46:36,490
beta-- so the dependent
variable vector

733
00:46:36,490 --> 00:46:40,880
has expectation
given by the product

734
00:46:40,880 --> 00:46:43,340
of the independent variables
matrix times the regression

735
00:46:43,340 --> 00:46:44,750
parameters.

736
00:46:44,750 --> 00:46:48,425
And the covariance matrix
of Y given X and beta

737
00:46:48,425 --> 00:46:50,930
is sigma squared
times the identity

738
00:46:50,930 --> 00:46:54,330
matrix, the n-dimensional
identity matrix.

739
00:46:54,330 --> 00:46:58,120
So the identity matrix has
1's along the diagonal,

740
00:46:58,120 --> 00:47:00,480
n-dimensional, and
0's off the diagonal.

741
00:47:00,480 --> 00:47:06,980
So the variances of the Y's
are the diagonal entries,

742
00:47:06,980 --> 00:47:08,830
those are all the
same, sigma squared.

743
00:47:08,830 --> 00:47:13,323
And the covariance between
any two are equal to 0

744
00:47:13,323 --> 00:47:13,906
conditionally.

745
00:47:21,530 --> 00:47:25,260
OK, now the
Gauss-Markov theorem.

746
00:47:25,260 --> 00:47:31,790
This is a terrific result
in linear models theory.

747
00:47:31,790 --> 00:47:37,790
And it's terrific in terms of
the mathematical content of it.

748
00:47:37,790 --> 00:47:43,850
I think it's-- for a math class,
it's really a nice theorem

749
00:47:43,850 --> 00:47:51,110
to introduce you to and
highlight the power of, I

750
00:47:51,110 --> 00:47:55,570
guess, results that can arise
from applying the theory.

751
00:47:55,570 --> 00:48:00,120
And so to set this
theorem up, we

752
00:48:00,120 --> 00:48:05,710
want to think about trying
to estimate some function

753
00:48:05,710 --> 00:48:07,770
of the regression parameters.

754
00:48:07,770 --> 00:48:14,060
And so OK, our problem is
with ordinary least squares--

755
00:48:14,060 --> 00:48:16,300
it was, how do we specify
the regression parameters

756
00:48:16,300 --> 00:48:18,180
beta_1 through beta_p?

757
00:48:18,180 --> 00:48:23,510
Let's consider a general
target of interest,

758
00:48:23,510 --> 00:48:26,660
which is a linear
combination of the betas.

759
00:48:26,660 --> 00:48:32,090
So we want to predict
a parameter theta which

760
00:48:32,090 --> 00:48:37,530
is some linear combination
of the regression parameters.

761
00:48:37,530 --> 00:48:42,000
And because that linear
combination of the regression

762
00:48:42,000 --> 00:48:49,920
parameters corresponds to the
expectation of the response

763
00:48:49,920 --> 00:48:51,840
variable corresponding
to a given

764
00:48:51,840 --> 00:48:53,970
row of the independent
variables matrix,

765
00:48:53,970 --> 00:48:55,690
this is just a
generalization of trying

766
00:48:55,690 --> 00:48:59,240
to estimate the means
of the regression model

767
00:48:59,240 --> 00:49:01,410
at different points
in the space,

768
00:49:01,410 --> 00:49:06,720
or to be estimating other
quantities that might arise.

769
00:49:06,720 --> 00:49:09,340
So this is really a very
general kind of thing

770
00:49:09,340 --> 00:49:10,540
to want to estimate.

771
00:49:10,540 --> 00:49:13,560
It certainly is appropriate
for predictions.

772
00:49:13,560 --> 00:49:18,530
And if we consider the
least squares estimate

773
00:49:18,530 --> 00:49:22,760
by just plugging in beta hat
one through beta hat p, solved

774
00:49:22,760 --> 00:49:28,990
by the least squares,
well, it turns out

775
00:49:28,990 --> 00:49:36,900
that those are an unbiased
estimator of the parameter

776
00:49:36,900 --> 00:49:37,812
theta.

777
00:49:37,812 --> 00:49:39,770
So if we're trying to
estimate this combination

778
00:49:39,770 --> 00:49:42,580
of these unknown parameters,
you plug in the least squares

779
00:49:42,580 --> 00:49:47,350
estimate, you're going to get
an estimator that's unbiased.

780
00:49:47,350 --> 00:49:49,330
Who can tell me
what unbiased is?

781
00:49:49,330 --> 00:49:54,770
It's probably going to be a new
concept for some people here.

782
00:49:54,770 --> 00:49:56,000
Anyone?

783
00:49:56,000 --> 00:49:59,810
OK, well it's a basic
property of estimators

784
00:49:59,810 --> 00:50:04,800
in statistics where the
expectation of this statistic

785
00:50:04,800 --> 00:50:06,720
is the true parameter.

786
00:50:06,720 --> 00:50:10,680
So it doesn't, on average,
probabilistically, it

787
00:50:10,680 --> 00:50:12,860
doesn't over- or
underestimate the value.

788
00:50:12,860 --> 00:50:14,740
So that's what unbiased means.

789
00:50:14,740 --> 00:50:16,520
Now, it's also a
linear estimator

790
00:50:16,520 --> 00:50:21,270
of theta in terms
of this theta hat

791
00:50:21,270 --> 00:50:23,810
being a particular
linear combination

792
00:50:23,810 --> 00:50:26,750
of the dependent variables.

793
00:50:26,750 --> 00:50:31,350
So with our original
response variable y,

794
00:50:31,350 --> 00:50:37,510
in the case of y_1 through
y_n, this theta hat is simply

795
00:50:37,510 --> 00:50:40,530
a linear combination
of all the y's.

796
00:50:40,530 --> 00:50:42,560
And now why is that true?

797
00:50:42,560 --> 00:50:49,246
Well, we know that beta hat,
from the normal equations,

798
00:50:49,246 --> 00:50:52,490
is solved by X transpose
X inverse X transpose y.

799
00:50:52,490 --> 00:50:56,200
So it's a linear
transform of the y vector.

800
00:50:56,200 --> 00:50:57,780
So if we take a
linear combination

801
00:50:57,780 --> 00:51:00,480
of those components, it's also
another linear combination

802
00:51:00,480 --> 00:51:01,490
of the y vector.

803
00:51:01,490 --> 00:51:06,090
So this is a linear
function of the underlying--

804
00:51:06,090 --> 00:51:08,830
of the response variables.

805
00:51:08,830 --> 00:51:12,130
Now, the Gauss-Markov
theorem says

806
00:51:12,130 --> 00:51:16,350
that, if the Gauss-Markov
assumptions apply,

807
00:51:16,350 --> 00:51:20,230
then the estimator theta
has the smallest variance

808
00:51:20,230 --> 00:51:25,750
amongst all linear unbiased
estimators of theta.

809
00:51:25,750 --> 00:51:28,910
So it actually is
like the optimal one,

810
00:51:28,910 --> 00:51:32,070
as long as this is our criteria.

811
00:51:32,070 --> 00:51:36,340
And this is really a
very powerful result.

812
00:51:36,340 --> 00:51:42,482
And to prove it, it's very easy.

813
00:51:42,482 --> 00:51:45,610
Let's see.

814
00:51:45,610 --> 00:51:47,750
Actually, these notes are
going to be distributed.

815
00:51:47,750 --> 00:51:53,890
So I'm going to go through
this very, very quickly

816
00:51:53,890 --> 00:51:56,300
and come back to it later
if we have more time.

817
00:51:56,300 --> 00:52:01,670
But you basically-- the
argument for the proof here

818
00:52:01,670 --> 00:52:05,750
is you consider another
linear estimate which

819
00:52:05,750 --> 00:52:08,410
is also an unbiased estimate.

820
00:52:08,410 --> 00:52:13,540
So let's consider a competitor
to the least squares value

821
00:52:13,540 --> 00:52:17,710
and then look at the difference
between that estimator

822
00:52:17,710 --> 00:52:20,960
and theta hat.

823
00:52:20,960 --> 00:52:27,880
And so that can be characterized
as basically this vector,

824
00:52:27,880 --> 00:52:29,280
f transpose y.

825
00:52:32,010 --> 00:52:35,970
And this difference
in the estimates

826
00:52:35,970 --> 00:52:40,030
must have expectation 0.

827
00:52:40,030 --> 00:52:42,790
So basically, if we look at--
if theta tilde is unbiased,

828
00:52:42,790 --> 00:52:46,200
then this expression
here is going

829
00:52:46,200 --> 00:52:50,580
to be equal to zero,
which means that f--

830
00:52:50,580 --> 00:52:57,100
the difference in
these two estimators, f

831
00:52:57,100 --> 00:52:59,230
defines the difference
in the two estimators--

832
00:52:59,230 --> 00:53:02,120
has to be orthogonal to
the column space of x.

833
00:53:02,120 --> 00:53:13,350
And with this
result, one then uses

834
00:53:13,350 --> 00:53:17,530
this orthogonality of
f and d to evaluate

835
00:53:17,530 --> 00:53:20,040
the variance of theta tilde.

836
00:53:20,040 --> 00:53:23,760
And in this proof, the
mathematical argument

837
00:53:23,760 --> 00:53:29,535
here is really something--
I should put some asterisks

838
00:53:29,535 --> 00:53:31,140
on a few lines here.

839
00:53:31,140 --> 00:53:35,850
This expression here is
actually very important.

840
00:53:35,850 --> 00:53:40,120
We're basically looking
at the decomposition

841
00:53:40,120 --> 00:53:43,250
of the variance to
be the variance of b

842
00:53:43,250 --> 00:53:46,890
transpose y, which is
the variance of the sum

843
00:53:46,890 --> 00:53:49,350
of these two random variables.

844
00:53:49,350 --> 00:53:55,880
So the page before
basically defined d and f

845
00:53:55,880 --> 00:53:57,560
such that this is true.

846
00:53:57,560 --> 00:54:02,370
Now when you consider
the variance of a sum,

847
00:54:02,370 --> 00:54:05,870
it's not the sum
of the variances.

848
00:54:05,870 --> 00:54:09,730
It's the sum of the
variances plus twice

849
00:54:09,730 --> 00:54:12,170
the sum of the covariances.

850
00:54:12,170 --> 00:54:18,795
And so when you are
calculating variances

851
00:54:18,795 --> 00:54:21,430
of sums of random variables,
you have to really keep

852
00:54:21,430 --> 00:54:24,180
track of the covariance terms.

853
00:54:24,180 --> 00:54:26,200
In this case, this
argument shows

854
00:54:26,200 --> 00:54:29,320
that the covariance
terms are, in fact, 0,

855
00:54:29,320 --> 00:54:32,804
and you get the
result popping out.

856
00:54:32,804 --> 00:54:38,890
But that's really a-- in
an econometrics class,

857
00:54:38,890 --> 00:54:42,934
they'll talk about BLUE
estimates of regression,

858
00:54:42,934 --> 00:54:45,100
or the BLUE property of the
least squares estimates.

859
00:54:45,100 --> 00:54:47,260
That's where that comes from.

860
00:54:47,260 --> 00:54:53,090
All right, so let's now consider
generalizing from Gauss-Markov

861
00:54:53,090 --> 00:55:04,660
to allow for unequal variances
and possibly correlated

862
00:55:04,660 --> 00:55:08,710
nonzero covariances
between the components.

863
00:55:08,710 --> 00:55:13,150
And in this case,
the regression model

864
00:55:13,150 --> 00:55:14,560
has the same linear set up.

865
00:55:14,560 --> 00:55:16,970
The only difference
is the expectation

866
00:55:16,970 --> 00:55:19,730
of the residual
vector is still 0.

867
00:55:19,730 --> 00:55:22,920
But the covariance matrix
of the residual vector

868
00:55:22,920 --> 00:55:26,330
is sigma squared,
a single parameter,

869
00:55:26,330 --> 00:55:29,850
times let's say capital sigma.

870
00:55:29,850 --> 00:55:33,530
And we'll assume here
that this capital sigma

871
00:55:33,530 --> 00:55:39,000
matrix is a known n by n
positive definite matrix

872
00:55:39,000 --> 00:55:41,204
specifying relative
variances and correlations

873
00:55:41,204 --> 00:55:42,245
between the observations.

874
00:55:47,570 --> 00:55:48,070
OK.

875
00:55:51,500 --> 00:55:59,392
Well, in order to solve
for regression estimates

876
00:55:59,392 --> 00:56:04,400
under these generalized
Gauss-Markov assumptions,

877
00:56:04,400 --> 00:56:08,750
we can transform the
data Y, X to Y star

878
00:56:08,750 --> 00:56:13,170
equals sigma to the
minus 1/2 y and X

879
00:56:13,170 --> 00:56:17,670
to X star, which is
sigma to the minus 1/2 x.

880
00:56:17,670 --> 00:56:24,470
And this model then becomes
a model, a linear regression

881
00:56:24,470 --> 00:56:29,640
model, in terms of
Y star and X star.

882
00:56:29,640 --> 00:56:33,460
We're basically multiplying
this regression model by sigma

883
00:56:33,460 --> 00:56:36,320
to the minus 1/2 across.

884
00:56:36,320 --> 00:56:43,700
And epsilon star actually
has a covariance matrix

885
00:56:43,700 --> 00:56:45,770
equal to sigma squared
times the identity.

886
00:56:45,770 --> 00:56:50,380
So if we just take a
linear transformation

887
00:56:50,380 --> 00:56:56,260
of the original data,
we get a representation

888
00:56:56,260 --> 00:56:57,890
of the regression
model that satisfies

889
00:56:57,890 --> 00:57:00,880
the original
Gauss-Markov assumptions.

890
00:57:00,880 --> 00:57:03,410
And what we had to
do was basically

891
00:57:03,410 --> 00:57:05,920
do a linear transformation
that makes the response

892
00:57:05,920 --> 00:57:10,500
variables all have constant
variance and be uncorrelated.

893
00:57:13,980 --> 00:57:19,740
So with that, we then have the
least squares estimate of beta

894
00:57:19,740 --> 00:57:23,850
is the least squares, the
ordinary least squares,

895
00:57:23,850 --> 00:57:26,250
in terms of Y star and X star.

896
00:57:26,250 --> 00:57:31,910
And so plugging that in, we then
have X star transpose X star

897
00:57:31,910 --> 00:57:34,310
inverse X star transpose Y star.

898
00:57:34,310 --> 00:57:37,130
And if you multiply through,
that's how the formula changes.

899
00:57:41,600 --> 00:57:45,600
So this formula characterizing
the least squares estimate

900
00:57:45,600 --> 00:57:48,850
under this generalized
set of assumptions

901
00:57:48,850 --> 00:57:55,010
highlights what you
need to do to be

902
00:57:55,010 --> 00:57:56,900
able to apply that theorem.

903
00:57:56,900 --> 00:58:03,140
So with response values that
have very large variances,

904
00:58:03,140 --> 00:58:07,540
you basically want to discount
those by the sigma inverse.

905
00:58:10,275 --> 00:58:14,280
And that's part of the way in
which these generalized least

906
00:58:14,280 --> 00:58:15,940
squares work.

907
00:58:15,940 --> 00:58:17,120
All right.

908
00:58:17,120 --> 00:58:19,960
So now let's turn to
distribution theory

909
00:58:19,960 --> 00:58:21,683
for normal regression models.

910
00:58:25,480 --> 00:58:28,820
Let's assume that
the residuals are

911
00:58:28,820 --> 00:58:32,175
normals with mean 0 and
variance sigma squared.

912
00:58:37,880 --> 00:58:42,360
OK, conditioning on the values
of the independent variable,

913
00:58:42,360 --> 00:58:44,240
the Y's, the response
variables, are

914
00:58:44,240 --> 00:58:49,932
going to be independent
over the index i.

915
00:58:49,932 --> 00:58:51,890
They're not going to be
identically distributed

916
00:58:51,890 --> 00:58:54,920
because they have
different means, mu_i

917
00:58:54,920 --> 00:58:58,820
for the dependent
variable Y_i, but they

918
00:58:58,820 --> 00:59:02,060
will have a constant variance.

919
00:59:02,060 --> 00:59:10,930
And what we can do is
basically condition on X, beta,

920
00:59:10,930 --> 00:59:14,320
and sigma squared
and then represent

921
00:59:14,320 --> 00:59:20,280
this model in terms of the
distribution of the epsilons.

922
00:59:20,280 --> 00:59:23,140
So if we're conditioning
on x and beta,

923
00:59:23,140 --> 00:59:27,950
this X beta is a constant,
known, we've conditioned on it.

924
00:59:27,950 --> 00:59:31,520
And the remaining uncertainty
is in the residual vector,

925
00:59:31,520 --> 00:59:36,590
which is assumed to
be all independent

926
00:59:36,590 --> 00:59:39,064
and identically distributed
normal random variables.

927
00:59:39,064 --> 00:59:40,480
Now, this is the
first time you'll

928
00:59:40,480 --> 00:59:48,400
see this notation, capital N sub
little n, for a random vector.

929
00:59:48,400 --> 00:59:53,090
It's a multivariate
normal random variable

930
00:59:53,090 --> 00:59:57,540
where you consider an n-vector
where each component is

931
00:59:57,540 --> 01:00:00,340
normally distributed,
with mean given

932
01:00:00,340 --> 01:00:04,090
by some corresponding
mean vector,

933
01:00:04,090 --> 01:00:10,200
and a covariance matrix
given by a covariance matrix.

934
01:00:10,200 --> 01:00:16,150
In terms of independent and
identically distributed values,

935
01:00:16,150 --> 01:00:21,250
the probability structure
here is totally well-defined.

936
01:00:21,250 --> 01:00:24,760
Anyone here who's taken a
beginning probability class

937
01:00:24,760 --> 01:00:26,420
knows what the
density function is

938
01:00:26,420 --> 01:00:28,340
for this multivariate
normal distribution

939
01:00:28,340 --> 01:00:32,180
because it's the product
of the independent density

940
01:00:32,180 --> 01:00:34,960
functions for the
independent components,

941
01:00:34,960 --> 01:00:37,170
because they're all
independent random variables.

942
01:00:37,170 --> 01:00:40,190
So this multivariate
normal random vector

943
01:00:40,190 --> 01:00:45,100
has a density function
which you can write down,

944
01:00:45,100 --> 01:00:47,266
given your first
probability class.

945
01:00:51,090 --> 01:00:54,960
OK, here I'm just
highlighting or defining

946
01:00:54,960 --> 01:01:01,670
the mu vector for the means
of the cases of the data.

947
01:01:01,670 --> 01:01:06,210
And the covariance matrix
sigma is this diagonal matrix.

948
01:01:08,940 --> 01:01:19,450
And so basically sigma_(i,j)
is equal to sigma squared times

949
01:01:19,450 --> 01:01:23,790
the Kronecker delta
for the (i,j) element.

950
01:01:23,790 --> 01:01:28,940
Now what we want to do
is, under the assumptions

951
01:01:28,940 --> 01:01:33,845
of normally
distributed residuals,

952
01:01:33,845 --> 01:01:38,700
to solve for the distribution
of the least squares estimators.

953
01:01:38,700 --> 01:01:41,550
We want to know, basically,
what kind of distribution

954
01:01:41,550 --> 01:01:42,819
does it have?

955
01:01:42,819 --> 01:01:44,360
Because what we want
to be able to do

956
01:01:44,360 --> 01:01:46,330
is to determine
whether estimates

957
01:01:46,330 --> 01:01:49,020
are particularly large or not.

958
01:01:49,020 --> 01:01:50,850
And maybe there's
no structure at all

959
01:01:50,850 --> 01:01:55,080
and the regression
parameters are 0 so

960
01:01:55,080 --> 01:01:57,510
that there's no dependence
on a given factor.

961
01:01:57,510 --> 01:02:00,400
And we need to be able to
judge how significant that is.

962
01:02:00,400 --> 01:02:03,290
So we need to know what
the distribution is

963
01:02:03,290 --> 01:02:06,410
of our least squares estimate.

964
01:02:06,410 --> 01:02:09,160
So what we're going to do
is apply moment generating

965
01:02:09,160 --> 01:02:11,540
functions to derive the
joint distribution of y

966
01:02:11,540 --> 01:02:13,574
and the joint
distribution of beta hat.

967
01:02:17,060 --> 01:02:22,560
And so Choongbum introduced
the moment generating function

968
01:02:22,560 --> 01:02:26,821
for individual random variables
for single-variate random

969
01:02:26,821 --> 01:02:27,320
variables.

970
01:02:27,320 --> 01:02:30,100
For n-variate
random variables, we

971
01:02:30,100 --> 01:02:35,190
can define the moment generating
function of the Y vector

972
01:02:35,190 --> 01:02:38,990
to be the expectation of
e to the t transpose Y.

973
01:02:38,990 --> 01:02:41,840
So t is an argument of the
moment generating function.

974
01:02:41,840 --> 01:02:43,660
It's another n-vector.

975
01:02:43,660 --> 01:02:46,660
And it's equal to the
expectation of e to the t_1 Y_1

976
01:02:46,660 --> 01:02:48,840
plus t_2 Y_2 up to t_n Y_n.

977
01:02:48,840 --> 01:02:53,930
So this is a very
simple definition.

978
01:02:53,930 --> 01:02:58,260
Because of independence,
the expectation

979
01:02:58,260 --> 01:03:02,380
of the products, or
this exponential sum

980
01:03:02,380 --> 01:03:05,240
is the product of
the exponentials.

981
01:03:05,240 --> 01:03:08,840
And so this moment
generating function is simply

982
01:03:08,840 --> 01:03:12,250
the product of the moment
generating functions for Y_1

983
01:03:12,250 --> 01:03:15,190
up through Y_n.

984
01:03:15,190 --> 01:03:18,180
And I think-- I don't know if
it was in the first problem set

985
01:03:18,180 --> 01:03:22,652
or in the first lecture, but e
to the t_i mu_i plus a half t_i

986
01:03:22,652 --> 01:03:24,610
squared sigma squared
was the moment generating

987
01:03:24,610 --> 01:03:28,310
function for the
single univariate

988
01:03:28,310 --> 01:03:30,710
normal random variable,
mean mu_i and variance sigma

989
01:03:30,710 --> 01:03:32,340
squared.

990
01:03:32,340 --> 01:03:36,780
And so if we have n of
these, we take their product.

991
01:03:36,780 --> 01:03:39,720
And the moment
generating function

992
01:03:39,720 --> 01:03:44,700
for y is simply e to the
t transpose mu plus 1/2

993
01:03:44,700 --> 01:03:48,090
t transpose sigma t.

994
01:03:48,090 --> 01:03:53,020
And so for this multivariate
normal distribution,

995
01:03:53,020 --> 01:03:57,630
this is its moment
generating function.

996
01:03:57,630 --> 01:04:06,070
And this happens to be--
the distribution of y

997
01:04:06,070 --> 01:04:10,165
is a multivariate normal with
mean mu and covariance matrix

998
01:04:10,165 --> 01:04:11,940
sigma.

999
01:04:11,940 --> 01:04:16,080
So a fact that
we're going to use

1000
01:04:16,080 --> 01:04:19,970
is that if we're working with
multivariate normal random

1001
01:04:19,970 --> 01:04:23,470
variables, this is the structure
of their moment generating

1002
01:04:23,470 --> 01:04:24,520
functions.

1003
01:04:24,520 --> 01:04:26,870
And so if we solve for
the moment generation

1004
01:04:26,870 --> 01:04:29,280
function of some
other item of interest

1005
01:04:29,280 --> 01:04:31,840
and recognize that
it has the same form,

1006
01:04:31,840 --> 01:04:35,237
we can conclude that it's also
a multivariate normal random

1007
01:04:35,237 --> 01:04:35,736
variable.

1008
01:04:39,440 --> 01:04:41,330
So let's do that.

1009
01:04:41,330 --> 01:04:44,720
Let's solve for the
moment generation

1010
01:04:44,720 --> 01:04:48,120
function of the least
squares estimate, beta hat.

1011
01:04:48,120 --> 01:04:50,980
Now rather than dealing
with an n-vector,

1012
01:04:50,980 --> 01:04:55,212
we're dealing with a p-vector
of the betas, beta hats.

1013
01:04:55,212 --> 01:04:57,170
And this is simply the
definition of the moment

1014
01:04:57,170 --> 01:05:00,240
generating function.

1015
01:05:00,240 --> 01:05:06,380
If we plug in for basically
what the functional form is

1016
01:05:06,380 --> 01:05:09,160
for the ordinary least
squares estimates

1017
01:05:09,160 --> 01:05:13,960
and how they depend on
the underlying Y, then we

1018
01:05:13,960 --> 01:05:20,480
basically-- OK, we have
A equal to, essentially,

1019
01:05:20,480 --> 01:05:22,950
the linear projection of Y.
That gives us the least squares

1020
01:05:22,950 --> 01:05:24,530
estimate.

1021
01:05:24,530 --> 01:05:28,800
And then we can say that
this moment generating

1022
01:05:28,800 --> 01:05:34,723
function for beta hat is
equal to the expectation of e

1023
01:05:34,723 --> 01:05:40,650
to the t transpose Y, where
little t is A transpose tau.

1024
01:05:40,650 --> 01:05:41,960
Well, we know what this is.

1025
01:05:41,960 --> 01:05:43,543
This is the moment
generating function

1026
01:05:43,543 --> 01:05:50,080
of X-- sorry, of Y-- evaluated
at the vector little t.

1027
01:05:50,080 --> 01:05:54,620
So we just need to plug in
little t, that expression

1028
01:05:54,620 --> 01:05:56,280
A transpose tau.

1029
01:05:56,280 --> 01:05:59,370
So let's do that.

1030
01:05:59,370 --> 01:06:03,580
And you do that and it turns
out to be e to the t transpose

1031
01:06:03,580 --> 01:06:06,324
mu plus that.

1032
01:06:06,324 --> 01:06:13,450
And we go through a
number of calculations.

1033
01:06:13,450 --> 01:06:16,310
And at the end of the day, we
get that the moment generating

1034
01:06:16,310 --> 01:06:20,580
function is just e to the tau
transpose beta plus a 1/2 tau

1035
01:06:20,580 --> 01:06:23,740
transpose this matrix tau.

1036
01:06:23,740 --> 01:06:26,050
And that is the moment
generation function

1037
01:06:26,050 --> 01:06:28,490
of a multivariate normal.

1038
01:06:28,490 --> 01:06:32,280
So these few lines that you
can go through after class

1039
01:06:32,280 --> 01:06:34,320
basically solve for
the moment generating

1040
01:06:34,320 --> 01:06:35,960
function of beta hat.

1041
01:06:35,960 --> 01:06:39,456
And because we can
recognize this as the MGF

1042
01:06:39,456 --> 01:06:44,960
of a multivariate normal, we
know that that's-- beta hat is

1043
01:06:44,960 --> 01:06:48,140
a multivariate normal,
with mean the true beta,

1044
01:06:48,140 --> 01:06:52,610
and covariance matrix given by
the object in square brackets

1045
01:06:52,610 --> 01:06:54,390
there.

1046
01:06:54,390 --> 01:07:01,590
OK, so this is
essentially the conclusion

1047
01:07:01,590 --> 01:07:05,040
of that previous analysis.

1048
01:07:05,040 --> 01:07:08,410
The marginal distribution
of each of the beta hats

1049
01:07:08,410 --> 01:07:13,490
is given by beta hat-- by a
univariate normal distribution

1050
01:07:13,490 --> 01:07:18,490
with mean beta_j and variance
equal to the diagonal.

1051
01:07:18,490 --> 01:07:25,190
Now at this point, saying
that is like an assertion.

1052
01:07:25,190 --> 01:07:28,750
But one can actually
prove that very easily,

1053
01:07:28,750 --> 01:07:33,290
given this sequence of argument.

1054
01:07:33,290 --> 01:07:36,622
And can anyone tell
me why this is true?

1055
01:07:43,080 --> 01:07:44,390
Let me tell you.

1056
01:07:44,390 --> 01:07:47,850
If you consider plugging in
the moment generating function,

1057
01:07:47,850 --> 01:07:53,990
the value tau, where only
the j-th entry is non-zero,

1058
01:07:53,990 --> 01:07:56,120
then you have the moment
generating function

1059
01:07:56,120 --> 01:07:59,310
of the j-th component
of beta hat.

1060
01:07:59,310 --> 01:08:03,880
And that's a Gaussian
moment generating function.

1061
01:08:03,880 --> 01:08:08,190
So the marginal distribution of
the j-th component is normal.

1062
01:08:08,190 --> 01:08:10,550
So you get that
almost for free from

1063
01:08:10,550 --> 01:08:13,680
this multivariate analysis.

1064
01:08:13,680 --> 01:08:18,014
And so there's no hand waving
going on in having that result.

1065
01:08:18,014 --> 01:08:20,229
This actually follows
directly from the moment

1066
01:08:20,229 --> 01:08:22,760
generating functions.

1067
01:08:22,760 --> 01:08:26,870
OK, let's now turn
to another topic.

1068
01:08:26,870 --> 01:08:32,200
Related, but it's the
QR decomposition of X.

1069
01:08:32,200 --> 01:08:36,870
So we have-- with our
independent variables

1070
01:08:36,870 --> 01:08:41,630
X, we want to express
this as a product

1071
01:08:41,630 --> 01:08:46,609
of an orthonormal matrix
Q which is n by p,

1072
01:08:46,609 --> 01:08:51,684
and an upper
triangular matrix R.

1073
01:08:51,684 --> 01:08:59,750
So it turns out that any
matrix, n by p matrix,

1074
01:08:59,750 --> 01:09:02,130
can be expressed in this form.

1075
01:09:02,130 --> 01:09:07,260
And we'll quickly show you
how that can be accomplished.

1076
01:09:07,260 --> 01:09:10,180
We can accomplish
that by conducting

1077
01:09:10,180 --> 01:09:13,479
a Gram-Schmidt
orthonormalization

1078
01:09:13,479 --> 01:09:17,580
of the independent
variables matrix X.

1079
01:09:17,580 --> 01:09:21,540
And let's see.

1080
01:09:21,540 --> 01:09:26,689
So if we define R, the upper
triangular matrix in the QR

1081
01:09:26,689 --> 01:09:31,424
decomposition, to have
0's off the diagonal below

1082
01:09:31,424 --> 01:09:35,710
and then possibly nonzero
value along the diagonal

1083
01:09:35,710 --> 01:09:41,500
into the right, we're just
going to solve for Q and R

1084
01:09:41,500 --> 01:09:45,260
through this
Gram-Schmidt process.

1085
01:09:45,260 --> 01:09:50,439
So the first column of X is
equal to the first column

1086
01:09:50,439 --> 01:09:57,430
of Q times the first
element, the top left corner

1087
01:09:57,430 --> 01:10:00,240
of the matrix R.

1088
01:10:00,240 --> 01:10:08,130
And if we take the cross product
of that vector with itself,

1089
01:10:08,130 --> 01:10:14,600
then we get this expression
for r_(1,1) squared--

1090
01:10:14,600 --> 01:10:18,270
we can basically solve for
r_(1,1) as the square root

1091
01:10:18,270 --> 01:10:19,590
of this dot product.

1092
01:10:19,590 --> 01:10:22,920
And Q_Q_[1] is simply the first
column of X divided by that

1093
01:10:22,920 --> 01:10:23,500
square root.

1094
01:10:23,500 --> 01:10:27,260
So this first element
of the Q matrix

1095
01:10:27,260 --> 01:10:32,210
and the first element r, this
can be solved for right away.

1096
01:10:32,210 --> 01:10:38,100
Then let's solve for
the second column of Q

1097
01:10:38,100 --> 01:10:43,310
and the second column
of the R matrix.

1098
01:10:43,310 --> 01:10:46,860
Well, X_X_[2], the second
column of the X matrix,

1099
01:10:46,860 --> 01:10:56,090
is the first column
of Q times r_(1,2),

1100
01:10:56,090 --> 01:10:58,614
plus the second column
of Q times r_(2,2).

1101
01:11:02,250 --> 01:11:09,300
And if we multiply this
expression by Q_Q_[1] transpose,

1102
01:11:09,300 --> 01:11:15,720
then we basically get this
expression for r_(1,2).

1103
01:11:19,780 --> 01:11:24,240
So we actually have
just solved for r_(1,2).

1104
01:11:24,240 --> 01:11:35,910
And Q_Q_[2] is solved for by
the arguments given here.

1105
01:11:35,910 --> 01:11:41,070
So basically, we successively
are orthogonalizing

1106
01:11:41,070 --> 01:11:43,480
columns of X to the
previous columns of X

1107
01:11:43,480 --> 01:11:45,850
through this
Gram-Schmidt process.

1108
01:11:45,850 --> 01:11:48,460
And it basically can be repeated
through all the columns.

1109
01:11:51,030 --> 01:11:55,410
Now with this QR
decomposition, what we get

1110
01:11:55,410 --> 01:12:01,470
is a really nice form for
the least squares estimate.

1111
01:12:01,470 --> 01:12:07,080
Basically, it simplifies to the
inverse of R times Q transpose

1112
01:12:07,080 --> 01:12:08,220
y.

1113
01:12:08,220 --> 01:12:15,490
And this basically
means that you

1114
01:12:15,490 --> 01:12:19,720
can solve for least squares
estimates by calculating the QR

1115
01:12:19,720 --> 01:12:22,150
decomposition, which is a
very simple linear algebra

1116
01:12:22,150 --> 01:12:25,890
operation, and then just do
a couple of matrix products

1117
01:12:25,890 --> 01:12:31,310
to get the-- well, you do have
to do a matrix inverse with R

1118
01:12:31,310 --> 01:12:33,730
to get that out.

1119
01:12:33,730 --> 01:12:37,810
And the covariance
matrix of beta hat

1120
01:12:37,810 --> 01:12:44,480
is equal to sigma squared
X transpose X inverse.

1121
01:12:44,480 --> 01:12:53,330
And in terms of the covariance
matrix, what is implicit here

1122
01:12:53,330 --> 01:12:55,870
but you should make
explicit in your study,

1123
01:12:55,870 --> 01:13:06,190
is if you consider taking a
matrix, R inverse Q transpose

1124
01:13:06,190 --> 01:13:11,160
times y, the only thing that's
random there is that y vector,

1125
01:13:11,160 --> 01:13:11,660
OK?

1126
01:13:11,660 --> 01:13:16,720
The covariance of a matrix
times a random vector

1127
01:13:16,720 --> 01:13:19,770
is that matrix
times the covariance

1128
01:13:19,770 --> 01:13:22,960
of the vector times the
transpose of the matrix.

1129
01:13:22,960 --> 01:13:26,770
So if you take a
matrix transformation

1130
01:13:26,770 --> 01:13:30,510
of a random vector,
then the covariance

1131
01:13:30,510 --> 01:13:33,660
of that transformation
has that form.

1132
01:13:33,660 --> 01:13:41,110
So that's where this covariance
matrix is coming into play.

1133
01:13:41,110 --> 01:13:44,410
And from the MGF, the
moment generating function,

1134
01:13:44,410 --> 01:13:48,660
for the least squares
estimate, this basically

1135
01:13:48,660 --> 01:13:50,930
comes out of the moment
generating function definition

1136
01:13:50,930 --> 01:13:51,943
as well.

1137
01:13:51,943 --> 01:13:56,480
And if we take X
transpose X, plug

1138
01:13:56,480 --> 01:14:01,864
in the QR decomposition,
only the R's play out,

1139
01:14:01,864 --> 01:14:03,600
giving you that.

1140
01:14:03,600 --> 01:14:05,870
Now, this also gives
us a very nice form

1141
01:14:05,870 --> 01:14:10,450
for the hat matrix,
which turns out

1142
01:14:10,450 --> 01:14:13,490
to just be Q times Q transpose.

1143
01:14:13,490 --> 01:14:20,370
So that's a very simple form.

1144
01:14:25,290 --> 01:14:28,160
So now with the
distribution theory,

1145
01:14:28,160 --> 01:14:35,190
this next section is
going to actually prove

1146
01:14:35,190 --> 01:14:37,970
what's really a
fundamental result

1147
01:14:37,970 --> 01:14:40,380
about normal linear
regression models.

1148
01:14:40,380 --> 01:14:45,210
And I'm going to go through
this somewhat quickly just

1149
01:14:45,210 --> 01:14:49,290
so that we cover what the
main ideas are of the theorem.

1150
01:14:49,290 --> 01:14:53,930
But the details, I think,
are very straightforward.

1151
01:14:53,930 --> 01:14:57,250
And these course notes
that will be posted online

1152
01:14:57,250 --> 01:14:59,370
will go through the various
steps of the analysis.

1153
01:15:03,390 --> 01:15:07,160
OK, so there's an
important theorem here

1154
01:15:07,160 --> 01:15:11,660
which is for any
matrix A, m by n,

1155
01:15:11,660 --> 01:15:15,860
you consider transforming
the random vector y

1156
01:15:15,860 --> 01:15:23,600
by this matrix A. It is
also a random normal vector.

1157
01:15:23,600 --> 01:15:26,980
And its distribution
is going to have

1158
01:15:26,980 --> 01:15:30,090
a mean and covariance
matrix given

1159
01:15:30,090 --> 01:15:35,390
by mu_z and sigma_z, which have
this simple expression in terms

1160
01:15:35,390 --> 01:15:38,710
of the matrix A and
the underlying means

1161
01:15:38,710 --> 01:15:40,995
and covariances of y.

1162
01:15:44,880 --> 01:15:48,620
OK, earlier we actually
applied this theorem

1163
01:15:48,620 --> 01:15:53,260
with A corresponding to the
matrix that generates the least

1164
01:15:53,260 --> 01:15:54,920
squares estimates.

1165
01:15:54,920 --> 01:15:57,900
So with A equal to X
transpose X inverse,

1166
01:15:57,900 --> 01:16:01,060
we actually previously went
through the solution for what's

1167
01:16:01,060 --> 01:16:04,040
the distribution of beta hat.

1168
01:16:04,040 --> 01:16:06,960
And with any other
matrix A, we can

1169
01:16:06,960 --> 01:16:09,815
go through the same analysis
and get the distribution.

1170
01:16:13,820 --> 01:16:20,690
So if we do that here,
well, we can actually

1171
01:16:20,690 --> 01:16:22,890
prove this important
theorem, which

1172
01:16:22,890 --> 01:16:27,335
says that with least
squares estimates

1173
01:16:27,335 --> 01:16:33,670
of normal linear regression
models, our least

1174
01:16:33,670 --> 01:16:40,050
squares estimate beta hat and
our residual vector epsilon hat

1175
01:16:40,050 --> 01:16:43,040
are independent
random variables.

1176
01:16:43,040 --> 01:16:48,520
So when we construct
these statistics,

1177
01:16:48,520 --> 01:16:51,860
they are statistically
independent of each other.

1178
01:16:51,860 --> 01:16:57,580
And the distribution of beta
hat is multivariate normal.

1179
01:16:57,580 --> 01:17:04,730
The sum of the squared
residuals is, in fact,

1180
01:17:04,730 --> 01:17:09,660
a multiple of a chi-squared
random variable.

1181
01:17:09,660 --> 01:17:16,010
Now who in here can tell me what
a chi-squared random variable

1182
01:17:16,010 --> 01:17:17,405
is?

1183
01:17:17,405 --> 01:17:18,335
Anyone?

1184
01:17:18,335 --> 01:17:19,251
AUDIENCE: [INAUDIBLE]?

1185
01:17:21,590 --> 01:17:23,010
PROFESSOR: Yes, that's right.

1186
01:17:23,010 --> 01:17:25,610
So a chi-squared random variable
with one degree of freedom

1187
01:17:25,610 --> 01:17:29,652
is a squared normal zero
one random variable.

1188
01:17:29,652 --> 01:17:31,360
A chi-squared with
two degrees of freedom

1189
01:17:31,360 --> 01:17:36,420
is the sum of two independent
normals, zero one, squared.

1190
01:17:36,420 --> 01:17:43,000
And so the sum of n squared
residuals is, in fact,

1191
01:17:43,000 --> 01:17:48,182
an n minus p chi-squared random
variable scale it by sigma

1192
01:17:48,182 --> 01:17:49,420
squared.

1193
01:17:49,420 --> 01:17:55,870
And for each component
j, if we take

1194
01:17:55,870 --> 01:18:01,080
the difference between the least
squares estimate beta hat j

1195
01:18:01,080 --> 01:18:04,810
and beta_j and divide
through by this estimate

1196
01:18:04,810 --> 01:18:12,320
of the standard
deviation of that, then

1197
01:18:12,320 --> 01:18:15,740
that will, in fact, have a
t distribution on n minus p

1198
01:18:15,740 --> 01:18:17,280
degrees of freedom.

1199
01:18:17,280 --> 01:18:25,040
And let's see, a t distribution
in probability theory

1200
01:18:25,040 --> 01:18:35,684
is the ratio of a normal random
variable to an independent chi

1201
01:18:35,684 --> 01:18:38,100
squared random variable, or
the root of an independent chi

1202
01:18:38,100 --> 01:18:39,210
squared random variable.

1203
01:18:39,210 --> 01:18:45,780
So basically these
properties characterize

1204
01:18:45,780 --> 01:18:50,970
our regression parameter
estimates and t statistics

1205
01:18:50,970 --> 01:18:54,780
for those estimates.

1206
01:18:54,780 --> 01:18:59,490
Now, OK, in the
course notes, there's

1207
01:18:59,490 --> 01:19:01,670
a moderately long proof.

1208
01:19:01,670 --> 01:19:05,320
But all the details
are given, and I'll

1209
01:19:05,320 --> 01:19:08,110
be happy to go through any
of those details with people

1210
01:19:08,110 --> 01:19:11,050
during office hours.

1211
01:19:11,050 --> 01:19:19,500
Let me just push
on to-- let's see.

1212
01:19:19,500 --> 01:19:23,091
We have maybe two minutes
left in the class.

1213
01:19:25,860 --> 01:19:32,440
Let me just talk about
maximum likelihood estimation.

1214
01:19:32,440 --> 01:19:37,850
And in fitting models
and statistics,

1215
01:19:37,850 --> 01:19:41,030
maximum likelihood estimation
comes up again and again.

1216
01:19:41,030 --> 01:19:45,295
And with normal linear
regression models,

1217
01:19:45,295 --> 01:19:47,880
it turns out that ordinary
least squares estimate

1218
01:19:47,880 --> 01:19:51,690
are, in fact, our maximum
likelihood estimates.

1219
01:19:51,690 --> 01:19:57,570
And what we want to do
with a maximum likelihood

1220
01:19:57,570 --> 01:20:02,000
is to maximize.

1221
01:20:02,000 --> 01:20:05,640
We want to define the
likelihood function, which

1222
01:20:05,640 --> 01:20:09,290
is the density function
for the data given

1223
01:20:09,290 --> 01:20:12,110
the unknown parameters.

1224
01:20:12,110 --> 01:20:16,390
And this density
function is simply

1225
01:20:16,390 --> 01:20:18,720
the density function for a
multivariate normal random

1226
01:20:18,720 --> 01:20:20,310
variable.

1227
01:20:20,310 --> 01:20:24,630
And the maximum
likelihood estimates

1228
01:20:24,630 --> 01:20:28,120
are the estimates of the
underlying parameters

1229
01:20:28,120 --> 01:20:31,650
that basically maximize
the density function.

1230
01:20:31,650 --> 01:20:34,200
So it's the values of
the underlying parameters

1231
01:20:34,200 --> 01:20:37,130
that make the data that was
observed the most likely.

1232
01:20:40,520 --> 01:20:49,110
And if you plug in the values
of the density function,

1233
01:20:49,110 --> 01:20:53,680
basically we have these
independent random variables,

1234
01:20:53,680 --> 01:21:00,540
Y_i, whose product
is the joint density.

1235
01:21:00,540 --> 01:21:05,466
The likelihood
function turns out

1236
01:21:05,466 --> 01:21:11,260
to be basically a function of
the least squares criterion.

1237
01:21:11,260 --> 01:21:14,980
So if you fit models
by least squares,

1238
01:21:14,980 --> 01:21:18,472
you're consistent with doing
something decent in at least

1239
01:21:18,472 --> 01:21:20,180
applying the maximum
likelihood principle

1240
01:21:20,180 --> 01:21:23,720
if you had a normal
linear regression model.

1241
01:21:23,720 --> 01:21:31,730
And it's useful to know when
your statistical estimation

1242
01:21:31,730 --> 01:21:37,230
algorithms are consistent
with certain principles

1243
01:21:37,230 --> 01:21:41,620
like maximum likelihood
estimation or others.

1244
01:21:41,620 --> 01:21:43,930
So let me, I guess,
finish there.

1245
01:21:43,930 --> 01:21:49,100
And next time, I will
just talk a little bit

1246
01:21:49,100 --> 01:21:52,780
about generalized M estimators.

1247
01:21:52,780 --> 01:21:55,160
Those provide a
class of estimators

1248
01:21:55,160 --> 01:22:04,420
that can be used for
finding robust estimates

1249
01:22:04,420 --> 01:22:08,290
and also quantile estimates
of regression parameters

1250
01:22:08,290 --> 01:22:12,320
which are very interesting.