1
00:00:00,120 --> 00:00:02,460
The following content is
provided under a Creative

2
00:00:02,460 --> 00:00:03,880
Commons license.

3
00:00:03,880 --> 00:00:06,090
Your support will help
MIT OpenCourseWare

4
00:00:06,090 --> 00:00:10,180
continue to offer high-quality
educational resources for free.

5
00:00:10,180 --> 00:00:12,720
To make a donation or to
view additional materials

6
00:00:12,720 --> 00:00:16,650
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,650 --> 00:00:17,880
at ocw.mit.edu.

8
00:00:20,760 --> 00:00:25,220
PHILIPPE RIGOLLET: So
yes, before we start,

9
00:00:25,220 --> 00:00:27,710
this chapter will not
be part of the midterm.

10
00:00:27,710 --> 00:00:31,970
Everything else will be, so all
the way up to goodness of fit

11
00:00:31,970 --> 00:00:33,080
tests.

12
00:00:33,080 --> 00:00:36,920
And there will be
some practice exams

13
00:00:36,920 --> 00:00:38,690
that will be posted
in the recitation

14
00:00:38,690 --> 00:00:39,780
section of the course.

15
00:00:39,780 --> 00:00:40,790
And that will be--

16
00:00:40,790 --> 00:00:41,760
you will be working on.

17
00:00:41,760 --> 00:00:44,480
So the recitation tomorrow
will be a review session

18
00:00:44,480 --> 00:00:46,430
for the midterm.

19
00:00:46,430 --> 00:00:49,850
I'll send an
announcement by email.

20
00:00:49,850 --> 00:00:55,370
So going back to
our estimator, we

21
00:00:55,370 --> 00:00:58,460
showed that the least
squares estimator in the case

22
00:00:58,460 --> 00:01:01,860
where we had some
Gaussian observations.

23
00:01:01,860 --> 00:01:04,459
So we had something that
looked like this-- y was

24
00:01:04,459 --> 00:01:07,850
equal to some matrix x times
beta plus some epsilon.

25
00:01:07,850 --> 00:01:10,640
This was an equation
that was happening in r

26
00:01:10,640 --> 00:01:13,010
to the n for n observations.

27
00:01:13,010 --> 00:01:15,530
And then we wrote the least
squares estimator beta hat.

28
00:01:21,300 --> 00:01:23,580
And for the purpose
from here on,

29
00:01:23,580 --> 00:01:26,200
you see that you have
this normal distribution,

30
00:01:26,200 --> 00:01:28,189
this Gaussian p
variant distribution.

31
00:01:28,189 --> 00:01:29,730
That means that, at
some point, we've

32
00:01:29,730 --> 00:01:31,830
made the assumption
that epsilons

33
00:01:31,830 --> 00:01:38,060
were n and dimensional
0 identity of rn

34
00:01:38,060 --> 00:01:41,000
times sigma squared,
which I kept

35
00:01:41,000 --> 00:01:43,130
on forgetting about last time.

36
00:01:43,130 --> 00:01:45,530
I will try not to
do that this time.

37
00:01:45,530 --> 00:01:48,800
And so from this,
we derived a bunch

38
00:01:48,800 --> 00:01:54,080
of properties of this least
squares estimator, beta hat.

39
00:01:54,080 --> 00:01:58,090
And in particular, the key thing
that everything was built on

40
00:01:58,090 --> 00:02:02,090
was that we could write beta
hat as the true unknown beta

41
00:02:02,090 --> 00:02:05,020
plus some multivariate
Gaussian that was centered,

42
00:02:05,020 --> 00:02:07,010
but had a weird
covariant structure.

43
00:02:07,010 --> 00:02:08,979
So that was definitely
p dimensional.

44
00:02:08,979 --> 00:02:11,240
And it was sigma
squared times x--

45
00:02:13,970 --> 00:02:16,359
so that's x transpose x.

46
00:02:16,359 --> 00:02:17,150
And that's inverse.

47
00:02:19,720 --> 00:02:22,840
And the way we derived that
was by having a lot of--

48
00:02:22,840 --> 00:02:26,260
at least one cancellation
between x transpose x and x

49
00:02:26,260 --> 00:02:28,150
transpose x inverse.

50
00:02:28,150 --> 00:02:47,582
So this is the basis for
inference in linear regression.

51
00:02:51,650 --> 00:02:54,980
So in a way, that's
correct, because what

52
00:02:54,980 --> 00:02:58,370
happened is that we used
the fact that x beta hat--

53
00:02:58,370 --> 00:02:59,980
once we have this
beta, x beta hat

54
00:02:59,980 --> 00:03:04,310
is really just a projection
of y onto the linear span

55
00:03:04,310 --> 00:03:08,480
of the columns of x, or
the column span of x.

56
00:03:08,480 --> 00:03:10,040
And so in particular,
those things--

57
00:03:10,040 --> 00:03:11,990
y minus x beta hats--

58
00:03:11,990 --> 00:03:13,230
are called residuals.

59
00:03:22,060 --> 00:03:25,180
So that's the
vector of residuals.

60
00:03:28,070 --> 00:03:32,390
What's the dimension
of this vector?

61
00:03:36,214 --> 00:03:37,180
AUDIENCE: n by 1.

62
00:03:37,180 --> 00:03:38,350
PHILIPPE RIGOLLET: n by 1.

63
00:03:38,350 --> 00:03:42,890
So those things, we can
write as epsilon hat.

64
00:03:42,890 --> 00:03:44,720
There's an estimate
for this epsilon

65
00:03:44,720 --> 00:03:47,870
because we just
put a hat on beta.

66
00:03:47,870 --> 00:03:49,610
And from this one,
we could actually

67
00:03:49,610 --> 00:03:54,560
build an unbiased estimator
of sigma hat squared,

68
00:03:54,560 --> 00:03:55,940
and that was this guy.

69
00:03:55,940 --> 00:03:59,330
And we showed that, indeed, the
right normalization for this

70
00:03:59,330 --> 00:04:04,730
was n minus p, because y
minus x beta hat to norm

71
00:04:04,730 --> 00:04:07,830
is actually a chi squared with
n minus p degrees of freedom.

72
00:04:07,830 --> 00:04:11,120
And so that's up to this
scaling by sigma squared.

73
00:04:11,120 --> 00:04:12,766
So that's what we came up with.

74
00:04:12,766 --> 00:04:14,390
And something I told
you, which follows

75
00:04:14,390 --> 00:04:15,389
from Cochran's theorem--

76
00:04:15,389 --> 00:04:17,420
we did not go into
details about this.

77
00:04:17,420 --> 00:04:18,950
But essentially,
since one of them

78
00:04:18,950 --> 00:04:22,640
corresponds to projection onto
the linear span of the columns

79
00:04:22,640 --> 00:04:25,400
of x, and the other one
corresponds to projection

80
00:04:25,400 --> 00:04:28,575
onto the orthogonal of this guy,
and we're in a Gaussian case,

81
00:04:28,575 --> 00:04:30,200
things that are
orthogonal are actually

82
00:04:30,200 --> 00:04:31,874
independent in a Gaussian case.

83
00:04:31,874 --> 00:04:33,290
So from a geometric
point of view,

84
00:04:33,290 --> 00:04:34,873
you can sort of
understand everything.

85
00:04:34,873 --> 00:04:37,580
You think of your subspace of
the linear span of the x's,

86
00:04:37,580 --> 00:04:39,080
sometimes you project
onto this guy,

87
00:04:39,080 --> 00:04:41,420
sometimes you project
onto its orthogonal.

88
00:04:41,420 --> 00:04:43,240
Beta hat corresponds
to projection

89
00:04:43,240 --> 00:04:44,471
onto the linear span.

90
00:04:44,471 --> 00:04:46,970
Epsilon hats correspond to a
projection onto the orthogonal.

91
00:04:46,970 --> 00:04:48,777
And those things tend
to be independent,

92
00:04:48,777 --> 00:04:50,360
and that's what you
have that beta hat

93
00:04:50,360 --> 00:04:53,560
is independent of
sigma hat squared.

94
00:04:53,560 --> 00:04:56,930
So it's really just a statement
about two linear spaces being

95
00:04:56,930 --> 00:05:00,510
orthogonal with
respect to each other.

96
00:05:00,510 --> 00:05:07,820
So we left on this
slide last time.

97
00:05:07,820 --> 00:05:10,610
And what I claim is that
this thing here is actually--

98
00:05:10,610 --> 00:05:12,510
oh, yeah-- the other
thing we want to use.

99
00:05:12,510 --> 00:05:14,002
So that's good for beta hat.

100
00:05:14,002 --> 00:05:15,960
But since we don't know
what sigma squared is--

101
00:05:15,960 --> 00:05:17,335
if we knew what
sigma squared is,

102
00:05:17,335 --> 00:05:19,160
that would totally
be enough for us.

103
00:05:19,160 --> 00:05:21,290
But we also need
this extra thing--

104
00:05:21,290 --> 00:05:27,250
that sigma squared hat squared
over sigma squared follows--

105
00:05:27,250 --> 00:05:29,450
and there's an n minus p.

106
00:05:29,450 --> 00:05:33,250
This follows a chi squared with
n minus p degrees of freedom.

107
00:05:33,250 --> 00:05:36,820
And sigma hat squared is
independent of beta hat.

108
00:05:36,820 --> 00:05:41,780
So that's going to
be something we need.

109
00:05:41,780 --> 00:05:47,870
So that's useful if
sigma squared if unknown.

110
00:05:51,510 --> 00:05:53,490
And again, sometimes
it might be known

111
00:05:53,490 --> 00:05:56,164
if you're using some sort
of measurement device

112
00:05:56,164 --> 00:05:58,080
for which it's written
on the side of the box.

113
00:06:01,000 --> 00:06:02,860
So from these two
things, we're going

114
00:06:02,860 --> 00:06:05,830
to be able to do inference
And inference, we

115
00:06:05,830 --> 00:06:09,370
said there's three
pillars to inference.

116
00:06:09,370 --> 00:06:12,340
The first one is estimation, and
we've been doing that so far.

117
00:06:12,340 --> 00:06:14,530
We've constructed this
least squares estimator,

118
00:06:14,530 --> 00:06:16,280
which happens to be
the maximum likelihood

119
00:06:16,280 --> 00:06:18,520
estimator in the Gaussian case.

120
00:06:18,520 --> 00:06:20,710
The two other things
we do in inference

121
00:06:20,710 --> 00:06:22,780
are confidence intervals.

122
00:06:22,780 --> 00:06:24,294
And we can do
confidence intervals.

123
00:06:24,294 --> 00:06:25,960
We're not going to
do much because we're

124
00:06:25,960 --> 00:06:29,260
going to talk about their sort
of cousin, which are tests.

125
00:06:29,260 --> 00:06:31,600
And that's really where
the statistical inference

126
00:06:31,600 --> 00:06:32,180
comes into.

127
00:06:32,180 --> 00:06:34,180
And here, we're going to
be interested in a very

128
00:06:34,180 --> 00:06:36,820
specific kind of test
for linear regression.

129
00:06:36,820 --> 00:06:42,650
And those are tests
of the form beta j--

130
00:06:42,650 --> 00:06:46,190
so the j-th coefficient
of beta is equal to 0,

131
00:06:46,190 --> 00:06:52,310
and that's going to be our null
hypothesis, versus h1 where

132
00:06:52,310 --> 00:06:55,190
beta j is, say, not equal to 0.

133
00:06:55,190 --> 00:06:57,560
And for the purpose
of regression,

134
00:06:57,560 --> 00:07:00,080
unless you have lots of
domain-specific knowledge,

135
00:07:00,080 --> 00:07:03,020
it won't be beta j positive
or beta j negative.

136
00:07:03,020 --> 00:07:06,800
It's really non-0 that's
interesting to you.

137
00:07:06,800 --> 00:07:09,830
So why would I want
to do this test?

138
00:07:09,830 --> 00:07:14,540
Well, if I expand this
thing where I have y

139
00:07:14,540 --> 00:07:19,740
is equal to x beta
plus epsilon--

140
00:07:19,740 --> 00:07:21,850
so what happens if
I look, for example,

141
00:07:21,850 --> 00:07:24,630
at the first coordinates?

142
00:07:24,630 --> 00:07:32,420
So I have that y is actually--
so say, y1 is equal to beta 1

143
00:07:32,420 --> 00:07:37,050
plus beta 2 x 1.

144
00:07:37,050 --> 00:07:38,866
Well, that's
actually complicated.

145
00:07:38,866 --> 00:07:39,990
Let me write it like this--

146
00:07:42,600 --> 00:07:56,500
beta 0 plus beta 1 x1 plus
beta p minus 1 xp minus 1

147
00:07:56,500 --> 00:07:58,860
plus epsilon.

148
00:07:58,860 --> 00:08:00,130
And that's true for all i's.

149
00:08:04,510 --> 00:08:05,960
So this is beta 1 times 1.

150
00:08:05,960 --> 00:08:07,880
That was our first coordinate.

151
00:08:07,880 --> 00:08:09,920
So that's just expanding this--

152
00:08:09,920 --> 00:08:12,980
going back to the
scalar form rather than

153
00:08:12,980 --> 00:08:15,140
going to the matrix vector form.

154
00:08:15,140 --> 00:08:16,250
That's what we're doing.

155
00:08:16,250 --> 00:08:19,450
When I write y is equal
to x beta plus epsilon,

156
00:08:19,450 --> 00:08:22,400
I assume that each of my
y's can be represented

157
00:08:22,400 --> 00:08:25,520
as a linear combination
of the x's, the first one

158
00:08:25,520 --> 00:08:26,990
being 1 plus some epsilon i.

159
00:08:26,990 --> 00:08:29,630
Everybody agrees with this?

160
00:08:29,630 --> 00:08:32,539
What does it mean for
beta j to be equal to 0?

161
00:08:40,661 --> 00:08:41,161
Yeah?

162
00:08:41,161 --> 00:08:43,315
AUDIENCE: That
xj's not important.

163
00:08:43,315 --> 00:08:45,190
PHILIPPE RIGOLLET: Yeah,
that xj doesn't even

164
00:08:45,190 --> 00:08:46,750
show up in this thing.

165
00:08:46,750 --> 00:08:51,940
So if beta j is equal to 0,
that means that, essentially, we

166
00:08:51,940 --> 00:09:05,946
can remove the j's coordinate,
xj, from all observations.

167
00:09:12,710 --> 00:09:15,080
So for example, I'm
a banker, and I'm

168
00:09:15,080 --> 00:09:19,280
trying to predict some score--

169
00:09:19,280 --> 00:09:21,260
let's call it y--

170
00:09:21,260 --> 00:09:22,460
without the noise.

171
00:09:22,460 --> 00:09:26,400
So I'm trying to predict what
is going to be your score.

172
00:09:26,400 --> 00:09:29,090
And that's something
that should be telling me

173
00:09:29,090 --> 00:09:33,080
how likely you are to
reimburse your loan on time

174
00:09:33,080 --> 00:09:34,490
or do you have late payments.

175
00:09:34,490 --> 00:09:36,530
Or actually, maybe
these days bankers

176
00:09:36,530 --> 00:09:40,550
are actually looking at
how much late fees will I

177
00:09:40,550 --> 00:09:41,509
be collecting from you.

178
00:09:41,509 --> 00:09:44,049
Maybe that's what they are more
after rather than making sure

179
00:09:44,049 --> 00:09:45,490
that you reimburse everything.

180
00:09:45,490 --> 00:09:47,810
So they're trying to maximize
this number of late fees.

181
00:09:47,810 --> 00:09:49,970
And they collect a bunch
of things about you--

182
00:09:49,970 --> 00:09:52,130
definitely your credit
score, but maybe your

183
00:09:52,130 --> 00:09:57,110
zip code, profession, years
of education, family status,

184
00:09:57,110 --> 00:09:59,150
a bunch of things.

185
00:09:59,150 --> 00:10:01,560
One might be your shoe size.

186
00:10:01,560 --> 00:10:03,750
And they want to know--
maybe shoe is actually

187
00:10:03,750 --> 00:10:07,050
a good explanation
for how much fees

188
00:10:07,050 --> 00:10:08,770
they're going to be
collecting from you.

189
00:10:08,770 --> 00:10:10,950
But as you can imagine, this
would be a controversial thing

190
00:10:10,950 --> 00:10:12,720
to bring, and people might
want to test for their shoe

191
00:10:12,720 --> 00:10:14,010
size is a good idea.

192
00:10:14,010 --> 00:10:17,130
And so they would just
look at the j corresponding

193
00:10:17,130 --> 00:10:21,120
to shoe size and test whether
shoe size should appear or not

194
00:10:21,120 --> 00:10:22,484
in this formula.

195
00:10:22,484 --> 00:10:24,150
And that's essentially
the kind of thing

196
00:10:24,150 --> 00:10:25,410
that people are going to do.

197
00:10:25,410 --> 00:10:27,840
Now, if I do genomics
and I'm trying

198
00:10:27,840 --> 00:10:32,760
to predict the size, the girth,
of a pumpkin for a competition

199
00:10:32,760 --> 00:10:37,530
based on some
available genomic data,

200
00:10:37,530 --> 00:10:40,710
then I can test whether
gene j, which is called--

201
00:10:40,710 --> 00:10:44,010
I don't know-- pea snap 24--
they always have these crazy

202
00:10:44,010 --> 00:10:44,820
names--

203
00:10:44,820 --> 00:10:46,730
appears or not in this formula.

204
00:10:46,730 --> 00:10:49,350
Is the gene pea snap 24
going to be important or not

205
00:10:49,350 --> 00:10:52,080
for the size of
the final pumpkin?

206
00:10:52,080 --> 00:10:54,420
So those are definitely
the important things.

207
00:10:54,420 --> 00:10:57,660
And definitely, we
want to put beta j not

208
00:10:57,660 --> 00:11:00,120
equal to 0 as the alternative
because that's where

209
00:11:00,120 --> 00:11:02,880
scientific discovery shows up.

210
00:11:02,880 --> 00:11:06,450
And so to do that, well,
we're in a Gaussian set-up,

211
00:11:06,450 --> 00:11:10,470
so we know that even if we
don't know what sigma hat is,

212
00:11:10,470 --> 00:11:14,250
we can actually
call for a t-test.

213
00:11:14,250 --> 00:11:16,740
So how did we build
the t-test in general?

214
00:11:16,740 --> 00:11:23,630
Well, we had something that
looked like-- so before, what

215
00:11:23,630 --> 00:11:28,490
we had was something that
looked like theta hat was

216
00:11:28,490 --> 00:11:35,030
equal to theta plus some
n0 and something that

217
00:11:35,030 --> 00:11:38,540
depended on n, maybe, something
like this-- sigma squared

218
00:11:38,540 --> 00:11:39,350
over n.

219
00:11:39,350 --> 00:11:41,470
So that's what it looked like.

220
00:11:41,470 --> 00:11:46,120
Now what we have
is that beta hat

221
00:11:46,120 --> 00:11:50,470
is equal to beta plus some n,
but this time, it's p variant,

222
00:11:50,470 --> 00:11:56,130
and then x transpose x
inverse sigma squared.

223
00:11:56,130 --> 00:12:00,700
So it's actually very similar,
except that the matrix

224
00:12:00,700 --> 00:12:03,110
x transpose x inverse
is now replacing

225
00:12:03,110 --> 00:12:06,830
just this number, 1/n, but
it's playing the same role.

226
00:12:06,830 --> 00:12:12,750
So in particular, this implies
that for every j from 1

227
00:12:12,750 --> 00:12:16,300
to p, what is the
distribution of beta hat j?

228
00:12:19,010 --> 00:12:22,550
Well, beta hat j is
actually equal to--

229
00:12:22,550 --> 00:12:26,350
so all I have to do-- so this
is a system of p equations,

230
00:12:26,350 --> 00:12:29,540
and all I have to do is
to read the j through.

231
00:12:29,540 --> 00:12:32,090
So it's telling me here, I'm
going to read beta hat j.

232
00:12:32,090 --> 00:12:34,300
Here, I'm going to read beta j.

233
00:12:34,300 --> 00:12:36,470
And here, I need
to read, what is

234
00:12:36,470 --> 00:12:40,980
the distribution of the j-th
coordinates of this guy?

235
00:12:40,980 --> 00:12:43,570
So this is a Gaussian
vector, so we

236
00:12:43,570 --> 00:12:45,910
need to understand
what its definition is.

237
00:12:49,470 --> 00:12:52,350
So how do I do this?

238
00:12:52,350 --> 00:12:56,360
Well, the observation that's
actually useful for this--

239
00:12:56,360 --> 00:12:59,830
maybe I shouldn't use the word
observation in a stats class,

240
00:12:59,830 --> 00:13:00,900
so let's call it claim.

241
00:13:03,648 --> 00:13:09,610
The interesting claim is
that if I have a vector--

242
00:13:09,610 --> 00:13:13,916
let's call it v--

243
00:13:13,916 --> 00:13:20,500
then vj is equal to
v transpose ej where

244
00:13:20,500 --> 00:13:28,890
ej is the vector with 0, 0,
0, and then the 1 on the j-th

245
00:13:28,890 --> 00:13:30,990
coordinate, and
then 0 elsewhere.

246
00:13:30,990 --> 00:13:32,687
That's the j-th coordinate.

247
00:13:35,620 --> 00:13:38,530
So that's the j-th vector of
the canonical basis of rp.

248
00:13:41,640 --> 00:13:43,860
So now that I have
this form, I can

249
00:13:43,860 --> 00:13:45,730
see that, essentially,
beta hat j

250
00:13:45,730 --> 00:13:51,270
is just ej transpose
this np0 sigma squared

251
00:13:51,270 --> 00:13:53,790
x transpose x inverse.

252
00:13:59,550 --> 00:14:02,040
And now, I know what
the distribution

253
00:14:02,040 --> 00:14:05,550
of the inner product
between a Gaussian

254
00:14:05,550 --> 00:14:08,660
and a deterministic vector is.

255
00:14:08,660 --> 00:14:09,160
What is it?

256
00:14:13,810 --> 00:14:15,640
It's a Gaussian.

257
00:14:15,640 --> 00:14:23,950
So all I have to check is that
ej transpose np0 sigma squared

258
00:14:23,950 --> 00:14:27,200
x transpose x inverse--

259
00:14:27,200 --> 00:14:31,610
well, this is equal in
distribution to what?

260
00:14:31,610 --> 00:14:34,900
Well, this is going to be
a one-dimensional thing.

261
00:14:34,900 --> 00:14:38,230
A then your product
is just a real number.

262
00:14:38,230 --> 00:14:42,090
So it's going to
be some Gaussian.

263
00:14:42,090 --> 00:14:49,480
The mean is going to be 0 in
a product with ej, which is 0.

264
00:14:49,480 --> 00:14:52,502
What is the variance
of this guy?

265
00:14:52,502 --> 00:14:55,320
We actually used this, except
that ej was not a vector,

266
00:14:55,320 --> 00:14:57,460
but it was a matrix.

267
00:14:57,460 --> 00:15:04,990
So what we do is we, to see--
so the rule is that v transpose,

268
00:15:04,990 --> 00:15:16,610
say, n mu sigma is
some n v transpose mu,

269
00:15:16,610 --> 00:15:21,140
and then v transpose
sigma v. That's

270
00:15:21,140 --> 00:15:23,234
the rule for Gaussian vectors.

271
00:15:23,234 --> 00:15:25,150
There's just the property
of Gaussian vectors.

272
00:15:27,760 --> 00:15:29,300
So what do we have here?

273
00:15:29,300 --> 00:15:33,350
Well, ej plays the
role of v. And sigma

274
00:15:33,350 --> 00:15:36,990
squared x transpose x
inverse is the role of sigma.

275
00:15:36,990 --> 00:15:40,765
So here, I'm left
with ej transpose--

276
00:15:40,765 --> 00:15:42,390
let me pull out the
sigma squared here.

277
00:15:54,590 --> 00:15:57,350
But this thing is, what
happens if I take a matrix,

278
00:15:57,350 --> 00:16:00,110
I premultiply it
by this vector ej,

279
00:16:00,110 --> 00:16:02,298
and I postmultiply
it by this vector ej?

280
00:16:05,110 --> 00:16:07,890
I'm claiming that this
corresponds to only one

281
00:16:07,890 --> 00:16:09,510
single element of this matrix.

282
00:16:09,510 --> 00:16:11,371
Which one is it?

283
00:16:11,371 --> 00:16:11,870
AUDIENCE: j.

284
00:16:11,870 --> 00:16:14,570
PHILIPPE RIGOLLET:
j's diagonal element.

285
00:16:14,570 --> 00:16:23,210
So this thing here is nothing
but x transpose x inverse,

286
00:16:23,210 --> 00:16:27,840
and then the j-th
diagonal element is jj.

287
00:16:27,840 --> 00:16:30,840
Now, I cannot go any further.

288
00:16:30,840 --> 00:16:34,730
x transpose x inverse can
be a complicated matrix,

289
00:16:34,730 --> 00:16:40,350
and I do not know how to express
jj's diagonal element much

290
00:16:40,350 --> 00:16:41,190
better than this.

291
00:16:43,990 --> 00:16:46,740
Well, no, actually, I don't.

292
00:16:46,740 --> 00:16:48,681
It involves basically
all the coefficients.

293
00:16:48,681 --> 00:16:49,181
Yeah?

294
00:16:49,181 --> 00:16:52,127
AUDIENCE: [INAUDIBLE]
second j come from,

295
00:16:52,127 --> 00:16:55,073
so I get why ej
transpose [INAUDIBLE]..

296
00:16:55,073 --> 00:16:56,560
Where did the--

297
00:16:56,560 --> 00:16:58,194
PHILIPPE RIGOLLET:
From this rule?

298
00:16:58,194 --> 00:16:59,070
AUDIENCE: [INAUDIBLE]

299
00:16:59,070 --> 00:16:59,790
PHILIPPE RIGOLLET:
So you always pre-

300
00:16:59,790 --> 00:17:01,956
and postmultiply when you
talk about the covariance,

301
00:17:01,956 --> 00:17:04,869
because if you did not, it would
be a vector and not a scalar,

302
00:17:04,869 --> 00:17:06,480
for one.

303
00:17:06,480 --> 00:17:08,550
But in general, think
of v as a matrix.

304
00:17:08,550 --> 00:17:11,310
It's still true even
in v is a matrix that's

305
00:17:11,310 --> 00:17:12,911
compatible with
the premultiplying

306
00:17:12,911 --> 00:17:13,619
by some Gaussian.

307
00:17:19,079 --> 00:17:20,805
Any other question?

308
00:17:20,805 --> 00:17:21,305
Yeah?

309
00:17:21,305 --> 00:17:25,241
AUDIENCE: When you say claim
a vector v, what is vector v?

310
00:17:29,180 --> 00:17:31,759
PHILIPPE RIGOLLET:
So for any vector v--

311
00:17:31,759 --> 00:17:32,300
AUDIENCE: OK.

312
00:17:37,700 --> 00:17:40,690
PHILIPPE RIGOLLET:
Any other question?

313
00:17:40,690 --> 00:17:44,890
So now we've identified
that the j-th coefficient

314
00:17:44,890 --> 00:17:47,440
of this Gaussian, which I
can represent from the claim

315
00:17:47,440 --> 00:17:49,540
as ej transpose
this guy, is also

316
00:17:49,540 --> 00:17:51,220
a Gaussian that's centered.

317
00:17:51,220 --> 00:17:54,310
And its variance,
now, is sigma squared

318
00:17:54,310 --> 00:17:58,700
times the j-th diagonal element
of x transpose x inverse.

319
00:17:58,700 --> 00:18:05,830
So the conclusion
is that beta hat j

320
00:18:05,830 --> 00:18:10,749
is equal to beta j plus some n.

321
00:18:10,749 --> 00:18:12,790
And I'm going to emphasize
the fact that now it's

322
00:18:12,790 --> 00:18:19,180
one-dimensional with mean 0
and covariance sigma squared x

323
00:18:19,180 --> 00:18:25,100
transpose x inverse inverse jj.

324
00:18:25,100 --> 00:18:28,430
Now, if you look at the last
line of the second board

325
00:18:28,430 --> 00:18:31,912
and the first line
on the first board,

326
00:18:31,912 --> 00:18:33,370
those are basically
the same thing.

327
00:18:36,630 --> 00:18:39,410
Beta hat j is my theta hat.

328
00:18:39,410 --> 00:18:41,630
Beta j is my theta.

329
00:18:41,630 --> 00:18:44,080
And the variance
sigma squared over n

330
00:18:44,080 --> 00:18:47,840
is now sigma squared times
this [? jj's ?] element.

331
00:18:47,840 --> 00:18:52,710
Now, the inverse suggests that
it looks like the inverse of n.

332
00:18:52,710 --> 00:18:53,960
So those things are going to--

333
00:18:53,960 --> 00:18:55,710
we're going to want
to think of those guys

334
00:18:55,710 --> 00:18:59,917
as being some sort of
1/n kind of statement.

335
00:19:04,120 --> 00:19:09,420
So from this, the fact that
those two things are the same

336
00:19:09,420 --> 00:19:11,660
leads us to believe
that we are now

337
00:19:11,660 --> 00:19:14,010
equipped to perform the task
that we're trying to do,

338
00:19:14,010 --> 00:19:16,410
because under the
null hypothesis,

339
00:19:16,410 --> 00:19:22,790
beta j is known it's equal
to 0, so I can remove it.

340
00:19:22,790 --> 00:19:24,540
And I have to deal
with the sigma squared.

341
00:19:24,540 --> 00:19:26,130
If sigma squared
is known, then I

342
00:19:26,130 --> 00:19:29,100
can just perform
a regular Gaussian

343
00:19:29,100 --> 00:19:31,070
test using Gaussian quintiles.

344
00:19:31,070 --> 00:19:33,120
And if sigma squared
is unknown, I'm

345
00:19:33,120 --> 00:19:35,730
going to just divide
by sigma squared

346
00:19:35,730 --> 00:19:38,790
and multiply by sigma
hat, and then I'm

347
00:19:38,790 --> 00:19:40,260
going to basically
get my t-test.

348
00:20:00,630 --> 00:20:03,600
Actually, for the
purpose of your exam,

349
00:20:03,600 --> 00:20:06,060
I really suggest that you
understand every single word

350
00:20:06,060 --> 00:20:08,220
I'm going to be saying now,
because this is exactly

351
00:20:08,220 --> 00:20:09,678
the same thing that
you're expected

352
00:20:09,678 --> 00:20:12,719
to know from other courses,
because right now, I'm just

353
00:20:12,719 --> 00:20:14,760
going to apply exactly
the same technique that we

354
00:20:14,760 --> 00:20:17,400
did for the single
parameter estimation.

355
00:20:17,400 --> 00:20:26,940
So what do we have now is that
under h0, beta j is equal to 0.

356
00:20:26,940 --> 00:20:39,030
Therefore, beta hat j follows
some n0 sigma squared.

357
00:20:39,030 --> 00:20:41,670
Just like I do in the slide,
I'm going to call this gamma j.

358
00:20:50,810 --> 00:20:56,060
So gamma j is this x transpose
x inverse j-th diagonal element.

359
00:20:59,770 --> 00:21:06,140
So that implies that
beta hat j over sigma--

360
00:21:06,140 --> 00:21:08,120
oh, was it a square root?

361
00:21:08,120 --> 00:21:16,130
Yeah, sigma square root of
gamma j follows some n0 1.

362
00:21:16,130 --> 00:21:21,280
So I can form my
test statistic, which

363
00:21:21,280 --> 00:21:30,880
to be reject if the absolute
value of beta hat j divided

364
00:21:30,880 --> 00:21:38,159
by sigma square root gamma
j is larger than what?

365
00:21:38,159 --> 00:21:39,700
Can somebody tell
me what I want this

366
00:21:39,700 --> 00:21:41,050
to be larger than to reject?

367
00:21:43,948 --> 00:21:45,400
AUDIENCE: q alpha.

368
00:21:45,400 --> 00:21:46,525
PHILIPPE RIGOLLET: q alpha.

369
00:21:48,642 --> 00:21:49,350
Everybody agrees?

370
00:21:49,350 --> 00:21:50,852
Of what?

371
00:21:50,852 --> 00:21:58,070
Of this guy, where
the standard notation

372
00:21:58,070 --> 00:21:59,480
that this is the quintile.

373
00:21:59,480 --> 00:22:01,257
Everybody agrees?

374
00:22:01,257 --> 00:22:02,756
AUDIENCE: It's alpha
over 2 I think.

375
00:22:02,756 --> 00:22:03,537
I think alpha's--

376
00:22:03,537 --> 00:22:04,870
PHILIPPE RIGOLLET: Alpha over 2.

377
00:22:04,870 --> 00:22:06,520
So not everybody
should be agreeing.

378
00:22:06,520 --> 00:22:08,765
Thank you, you're the first
one to disagree with yourself,

379
00:22:08,765 --> 00:22:09,723
which is probably good.

380
00:22:12,111 --> 00:22:14,110
It's alpha over 2 because
of the absolute value.

381
00:22:14,110 --> 00:22:15,670
I want to just be
away from this guy,

382
00:22:15,670 --> 00:22:17,110
and that's because I have--

383
00:22:17,110 --> 00:22:19,140
so the alpha over 2--

384
00:22:19,140 --> 00:22:27,650
the sanity check should be that
h1 is beta j not equal to 0.

385
00:22:27,650 --> 00:22:35,010
So that works if sigma is known,
because I need to know sigma

386
00:22:35,010 --> 00:22:37,380
to be able to build my test.

387
00:22:37,380 --> 00:22:39,960
So if sigma is unknown, well,
I can tell you, use this test,

388
00:22:39,960 --> 00:22:41,550
but you're going to
be like, OK, when

389
00:22:41,550 --> 00:22:44,310
I'm going to have to
plug in some numbers,

390
00:22:44,310 --> 00:22:45,810
I'm going to be stuck.

391
00:22:49,240 --> 00:22:59,570
But if sigma is unknown,
we have sigma hat

392
00:22:59,570 --> 00:23:03,400
squared as an estimator.

393
00:23:03,400 --> 00:23:06,850
So let me write
sigma squared here.

394
00:23:06,850 --> 00:23:12,050
So in particular,
beta hat divided

395
00:23:12,050 --> 00:23:18,220
by sigma hat squared times
square root gamma j-- something

396
00:23:18,220 --> 00:23:19,169
I can compute.

397
00:23:19,169 --> 00:23:20,210
Sorry, that's beta hat j.

398
00:23:23,070 --> 00:23:24,576
I can compute that thing.

399
00:23:24,576 --> 00:23:25,490
Agreed?

400
00:23:25,490 --> 00:23:27,230
Now I have sigma hat j.

401
00:23:27,230 --> 00:23:28,980
What I need to do is
to be able to compute

402
00:23:28,980 --> 00:23:32,625
the distribution of this thing.

403
00:23:32,625 --> 00:23:37,880
So I know the distribution of
beta hat j over the square root

404
00:23:37,880 --> 00:23:38,410
of gamma j.

405
00:23:38,410 --> 00:23:40,479
That's some Gaussian 0, 1.

406
00:23:40,479 --> 00:23:42,770
I don't know exactly what
the distribution of sigma hat

407
00:23:42,770 --> 00:23:46,660
j squared is, but what I know is
that that was actually written,

408
00:23:46,660 --> 00:23:54,790
maybe, here is that n minus p
sigma hat squared over sigma

409
00:23:54,790 --> 00:23:59,550
squared follows some chi
squared with n minus p

410
00:23:59,550 --> 00:24:01,350
degrees of freedom,
and that it's actually

411
00:24:01,350 --> 00:24:06,590
independent of beta hat j.

412
00:24:06,590 --> 00:24:08,220
It's independent of
beta hat, so it's

413
00:24:08,220 --> 00:24:10,030
independent of each
of its coordinates.

414
00:24:10,030 --> 00:24:13,680
That was part of your
homework where you had to--

415
00:24:13,680 --> 00:24:15,900
some of you were confused
by the fact that--

416
00:24:15,900 --> 00:24:18,199
I mean, if you're independent
of some big thing,

417
00:24:18,199 --> 00:24:19,740
you're independent
of all the smaller

418
00:24:19,740 --> 00:24:20,948
components of this big thing.

419
00:24:20,948 --> 00:24:24,080
That's basically what
you need to know.

420
00:24:24,080 --> 00:24:26,310
And so now I can
just write this as--

421
00:24:29,630 --> 00:24:35,970
this is beta hat j divided by--

422
00:24:35,970 --> 00:24:37,890
so now I want to
make this guy appear,

423
00:24:37,890 --> 00:24:42,630
so it's beta hat j sigma
squared over sigma squared--

424
00:24:42,630 --> 00:24:48,261
sigma hat squared over sigma
squared times n minus p divided

425
00:24:48,261 --> 00:24:49,510
by the square root of gamma j.

426
00:24:49,510 --> 00:24:51,580
So that's what I want to see.

427
00:24:51,580 --> 00:24:52,236
Yeah?

428
00:24:52,236 --> 00:24:53,188
AUDIENCE: Why do
you have to stick

429
00:24:53,188 --> 00:24:54,313
the hat in the denominator?

430
00:24:54,313 --> 00:24:56,996
Shouldn't it be sigma?

431
00:24:56,996 --> 00:24:59,130
PHILIPPE RIGOLLET:
Yeah, so I write this.

432
00:24:59,130 --> 00:25:01,330
I decide to write this.

433
00:25:01,330 --> 00:25:03,170
I could have put a
Mickey Mouse here.

434
00:25:03,170 --> 00:25:04,460
It just wouldn't make sense.

435
00:25:04,460 --> 00:25:05,960
I just decided to
take this thing.

436
00:25:05,960 --> 00:25:06,390
AUDIENCE: OK.

437
00:25:06,390 --> 00:25:07,306
PHILIPPE RIGOLLET: OK.

438
00:25:07,306 --> 00:25:12,800
So now, let-- so I
take this guy, and now,

439
00:25:12,800 --> 00:25:15,050
I'm going to rewrite
it as something I want,

440
00:25:15,050 --> 00:25:17,891
because if you don't
know what sigma is--

441
00:25:17,891 --> 00:25:18,890
sorry, that's not sigm--

442
00:25:18,890 --> 00:25:19,850
you mean the square?

443
00:25:19,850 --> 00:25:20,265
AUDIENCE: Yeah.

444
00:25:20,265 --> 00:25:21,020
PHILIPPE RIGOLLET:
Oh, thank you.

445
00:25:21,020 --> 00:25:22,160
Yes, that's correct.

446
00:25:22,160 --> 00:25:25,390
[LAUGHS] OK, so you
don't know what's sigma

447
00:25:25,390 --> 00:25:26,740
is, you replace it by sigma hat.

448
00:25:26,740 --> 00:25:28,650
That's the most
natural thing to do.

449
00:25:28,650 --> 00:25:30,590
You just now want
to find out what

450
00:25:30,590 --> 00:25:33,380
the distribution of this guy is.

451
00:25:33,380 --> 00:25:35,780
So this is not
exactly what I had.

452
00:25:35,780 --> 00:25:41,530
To be able to get this, I need
to divide by sigma squared--

453
00:25:41,530 --> 00:25:42,640
sorry, I need to--

454
00:25:42,640 --> 00:25:43,950
AUDIENCE: Square root.

455
00:25:43,950 --> 00:25:44,741
PHILIPPE RIGOLLET: I'm sorry.

456
00:25:44,741 --> 00:25:46,157
AUDIENCE: Do we
need a square root

457
00:25:46,157 --> 00:25:47,450
of the sigma hat [INAUDIBLE].

458
00:25:47,450 --> 00:25:49,033
PHILIPPE RIGOLLET:
That's correct now.

459
00:25:55,400 --> 00:25:57,080
And now I have that--

460
00:25:57,080 --> 00:25:59,430
sorry, I should not
write it like that.

461
00:25:59,430 --> 00:26:01,770
That's not what I want.

462
00:26:01,770 --> 00:26:04,350
What I want is this.

463
00:26:08,260 --> 00:26:11,470
And to be able to get
this guy, what I need

464
00:26:11,470 --> 00:26:25,100
is sigma over sigma
hat square root.

465
00:26:25,100 --> 00:26:27,500
And then I need to make
this thing show up.

466
00:26:27,500 --> 00:26:32,670
So I need to have this n minus
p show up in the denominator.

467
00:26:32,670 --> 00:26:34,610
So to be able to get
it, I need to multiply

468
00:26:34,610 --> 00:26:37,343
the entire thing by the
square root of n minus p.

469
00:26:41,120 --> 00:26:42,590
So this is just a tautology.

470
00:26:42,590 --> 00:26:46,510
I just squeezed
in what I wanted.

471
00:26:46,510 --> 00:26:50,680
But now this whole thing
here, this is actually

472
00:26:50,680 --> 00:26:56,560
of the form beta hat j divided
by sigma over square root gamma

473
00:26:56,560 --> 00:27:01,450
j, and then divided by square
root of sigma hat squared

474
00:27:01,450 --> 00:27:04,700
over sigma squared.

475
00:27:08,607 --> 00:27:11,231
No, I don't want to divide it by
square root of minus p, sorry.

476
00:27:15,290 --> 00:27:21,720
And now it's times n minus
p divided by n minus p.

477
00:27:27,560 --> 00:27:29,714
And what is the distribution
of this thing here?

478
00:27:43,546 --> 00:27:45,610
So I'm going to keep going here.

479
00:27:45,610 --> 00:27:48,480
So the distribution of
this thing here is what?

480
00:27:48,480 --> 00:27:54,075
Well, this numerator,
what is this distribution?

481
00:27:58,035 --> 00:28:01,650
AUDIENCE: [INAUDIBLE]

482
00:28:01,650 --> 00:28:02,900
PHILIPPE RIGOLLET: Yeah, n0 1.

483
00:28:02,900 --> 00:28:04,525
It's actually still
written over there.

484
00:28:09,460 --> 00:28:11,509
So that's our n0 1.

485
00:28:11,509 --> 00:28:13,050
What is the distribution
of this guy?

486
00:28:16,580 --> 00:28:18,970
Sorry, I don't think
you have color again.

487
00:28:18,970 --> 00:28:22,922
So what is the
distribution of this guy?

488
00:28:22,922 --> 00:28:24,380
This is still
written on the board.

489
00:28:24,380 --> 00:28:25,660
AUDIENCE: Chi squared.

490
00:28:25,660 --> 00:28:28,285
PHILIPPE RIGOLLET: It's the chi
squared that I have right here.

491
00:28:32,530 --> 00:28:35,580
So that's a chi squared n
minus p divided by n minus p

492
00:28:35,580 --> 00:28:36,516
degrees of freedom.

493
00:28:36,516 --> 00:28:37,890
The only thing I
need to check is

494
00:28:37,890 --> 00:28:39,780
that those two guys
are independent, which

495
00:28:39,780 --> 00:28:43,050
is also what I have from here.

496
00:28:43,050 --> 00:28:49,690
And so that implies
that beta hat j divided

497
00:28:49,690 --> 00:28:53,290
by sigma hat square
root of gamma

498
00:28:53,290 --> 00:28:55,360
j, what is the
distribution of this guy?

499
00:29:04,822 --> 00:29:06,330
[INTERPOSING VOICES]

500
00:29:06,330 --> 00:29:09,095
PHILIPPE RIGOLLET: n minus p.

501
00:29:09,095 --> 00:29:12,040
Was that crystal
clear for everyone?

502
00:29:12,040 --> 00:29:15,370
Was that so simple that
it was boring to everyone?

503
00:29:15,370 --> 00:29:16,090
OK, good.

504
00:29:16,090 --> 00:29:18,760
That's where the point
at which you should be.

505
00:29:18,760 --> 00:29:23,350
So now I have this, I can read
the quintiles of this guy.

506
00:29:23,350 --> 00:29:28,580
So my test statistic becomes--

507
00:29:28,580 --> 00:29:31,000
well, my rejection
region, I reject

508
00:29:31,000 --> 00:29:40,390
if the absolute
value of this new guy

509
00:29:40,390 --> 00:29:44,900
exceeds the quintile of order
alpha over 2, but this time,

510
00:29:44,900 --> 00:29:48,390
of a tn minus p.

511
00:29:48,390 --> 00:29:50,660
And now you can actually
see that the only difference

512
00:29:50,660 --> 00:29:53,600
between this test and that
test, apart from replacing sigma

513
00:29:53,600 --> 00:29:55,490
by sigma hat, is
that now I've moved

514
00:29:55,490 --> 00:29:58,280
from the quintiles of a
Gaussian to the quintiles

515
00:29:58,280 --> 00:29:59,640
of a tn minus p.

516
00:30:11,085 --> 00:30:13,210
What's actually interesting,
from this perspective,

517
00:30:13,210 --> 00:30:18,070
is that the tn minus
p, we know, has

518
00:30:18,070 --> 00:30:20,800
heavier tails than the Gaussian,
but if the number of degrees

519
00:30:20,800 --> 00:30:26,131
of freedom reaches, maybe, 30 or
40, they're virtually the same.

520
00:30:26,131 --> 00:30:27,880
And here, the number
of degrees of freedom

521
00:30:27,880 --> 00:30:30,610
is not given only by
n, but it's n minus p.

522
00:30:30,610 --> 00:30:33,100
So if I have more and more
parameters to estimate,

523
00:30:33,100 --> 00:30:35,616
this will result in some
heavier, heavier tails,

524
00:30:35,616 --> 00:30:37,240
and that's just to
account for the fact

525
00:30:37,240 --> 00:30:41,680
that it's harder and harder
to estimate the variance

526
00:30:41,680 --> 00:30:44,680
when I have a lot of parameters.

527
00:30:44,680 --> 00:30:46,780
That's basically where
it's coming from.

528
00:30:46,780 --> 00:30:52,270
So now let's move on to--

529
00:30:52,270 --> 00:30:57,040
well, I don't know what because
this is not working anymore.

530
00:30:57,040 --> 00:30:59,080
So this is the simplest test.

531
00:30:59,080 --> 00:31:02,560
And actually, if you run
any statistical software

532
00:31:02,560 --> 00:31:06,190
for least squares, the
output in any of them

533
00:31:06,190 --> 00:31:08,690
will look like this.

534
00:31:08,690 --> 00:31:11,780
You will have a
sequence of rows.

535
00:31:11,780 --> 00:31:15,330
And you're going to have
an estimate for beta 0,

536
00:31:15,330 --> 00:31:17,445
an estimate for
beta 1, et cetera.

537
00:31:17,445 --> 00:31:19,320
Here, you're going to
have a bunch of things.

538
00:31:19,320 --> 00:31:23,040
And on this row, you're
going to have the value here,

539
00:31:23,040 --> 00:31:25,910
so that's going to be what's
estimated by least squares.

540
00:31:25,910 --> 00:31:30,260
And then the second line
immediately is going to be,

541
00:31:30,260 --> 00:31:32,300
well, either the
value of this thing--

542
00:31:35,320 --> 00:31:36,854
so let's call it t.

543
00:31:36,854 --> 00:31:38,520
And then there's going
to be the p value

544
00:31:38,520 --> 00:31:40,800
corresponding to this t.

545
00:31:40,800 --> 00:31:44,109
This is something that's just
routinely coming out because--

546
00:31:44,109 --> 00:31:46,650
oh, and then there's, of course,
the last line for people who

547
00:31:46,650 --> 00:31:49,740
cannot read numbers that's
really just giving you little

548
00:31:49,740 --> 00:31:50,240
stars.

549
00:31:53,850 --> 00:31:56,900
They're not stickers,
but that's close to it.

550
00:31:56,900 --> 00:31:59,110
And that's just saying,
well, I have three stars,

551
00:31:59,110 --> 00:32:01,420
I'm very significantly
different from 0's.

552
00:32:01,420 --> 00:32:04,160
If I have 2 stars, I'm
moderately differently from 0.

553
00:32:04,160 --> 00:32:07,090
And if I have 1 star,
it means, well, just

554
00:32:07,090 --> 00:32:10,450
give me another $1,000 and I
will sign that it's actually

555
00:32:10,450 --> 00:32:12,250
different from 0.

556
00:32:12,250 --> 00:32:14,950
So that's basically
the kind of outputs.

557
00:32:14,950 --> 00:32:16,467
Everybody sees what
I mean by that?

558
00:32:16,467 --> 00:32:18,550
So what I mean, what I'm
trying to emphasize here,

559
00:32:18,550 --> 00:32:20,260
is that those things
are so routine when

560
00:32:20,260 --> 00:32:23,740
you run linear aggression,
because people stuff in maybe--

561
00:32:23,740 --> 00:32:25,510
even if you have
200 observations,

562
00:32:25,510 --> 00:32:28,720
you're going to stuff in maybe
20 variables-- p equals 20.

563
00:32:28,720 --> 00:32:31,110
That's still a big number to
interpret what's going on.

564
00:32:31,110 --> 00:32:35,410
And it's nice for you if you
can actually trim some fat out.

565
00:32:35,410 --> 00:32:41,260
And so the problem is that when
you start doing this, and then

566
00:32:41,260 --> 00:32:44,386
this, and then
this, and then this,

567
00:32:44,386 --> 00:32:47,750
the probability that
you make a mistake

568
00:32:47,750 --> 00:32:52,040
in your test, the probably
that you erroneously

569
00:32:52,040 --> 00:32:55,170
reject the null here is 5%.

570
00:32:55,170 --> 00:32:56,540
Here, it's 5%.

571
00:32:56,540 --> 00:32:58,500
Here, it's 5%.

572
00:32:58,500 --> 00:33:00,120
Here, it's 5%.

573
00:33:00,120 --> 00:33:05,370
And at some point, if things
happen with 5% chances

574
00:33:05,370 --> 00:33:08,130
and you keep on doing
them over and over again,

575
00:33:08,130 --> 00:33:10,240
they're going to
start to happen.

576
00:33:10,240 --> 00:33:14,160
So you can see that
basically what's happening

577
00:33:14,160 --> 00:33:15,900
is that you actually
have an issue is

578
00:33:15,900 --> 00:33:18,750
that if you start
repeating those tests,

579
00:33:18,750 --> 00:33:23,000
you might not be at 5%
error at some point.

580
00:33:23,000 --> 00:33:25,940
And so what do you do
to prevent from that,

581
00:33:25,940 --> 00:33:28,850
if you want to test all those
beta j's simultaneously,

582
00:33:28,850 --> 00:33:32,340
you have to do what's called
the Bonferroni correction.

583
00:33:32,340 --> 00:33:35,060
And the Bonferroni correction
follows from what's

584
00:33:35,060 --> 00:33:36,790
called a union bound.

585
00:33:36,790 --> 00:33:40,392
A union bound is actually-- so
if you're a computer scientist,

586
00:33:40,392 --> 00:33:41,600
you're very familiar with it.

587
00:33:41,600 --> 00:33:44,390
If you're a mathematician,
that's just, essentially,

588
00:33:44,390 --> 00:33:46,650
the third axiom of
probability that you see,

589
00:33:46,650 --> 00:33:48,140
that the probability
of the union

590
00:33:48,140 --> 00:33:50,392
is less than the sum
of the probabilities.

591
00:34:00,350 --> 00:34:02,660
That's the union bound.

592
00:34:02,660 --> 00:34:05,570
And you, of course, can
generalize that to more than 2.

593
00:34:05,570 --> 00:34:07,460
And that's exactly
what you're doing here.

594
00:34:07,460 --> 00:34:11,870
So let's see how we would
want to perform Bonferroni

595
00:34:11,870 --> 00:34:19,340
correction to control the
probability that they're all

596
00:34:19,340 --> 00:34:21,429
equal to 0 at the same time.

597
00:34:26,690 --> 00:34:29,960
So recall-- so if I want to
perform this test over there

598
00:34:29,960 --> 00:34:34,820
where I want to
test h0, that beta j

599
00:34:34,820 --> 00:34:40,560
is equal to 0 for all
j in some subset s.

600
00:34:43,860 --> 00:34:48,409
So think of s included in 1p.

601
00:34:48,409 --> 00:34:51,139
You can think of it as being
all of 1 of p if you want.

602
00:34:51,139 --> 00:34:53,960
It really doesn't matter. s is
something that's given to you.

603
00:34:53,960 --> 00:34:55,790
Maybe you want to test
the subset of them,

604
00:34:55,790 --> 00:34:57,890
but maybe you want
to test all of them.

605
00:34:57,890 --> 00:35:04,540
Versus h1, beta j is not
equal to 0 for some j in s.

606
00:35:07,850 --> 00:35:10,610
That's a test that tests
all these things at once.

607
00:35:10,610 --> 00:35:13,880
And if you actually look
at this table all at once,

608
00:35:13,880 --> 00:35:16,820
implicitly, you're performing
this test for all of the rows,

609
00:35:16,820 --> 00:35:19,262
for s equal 1 to p.

610
00:35:19,262 --> 00:35:19,970
You will do that.

611
00:35:19,970 --> 00:35:23,120
Whether you like it
or not, you will.

612
00:35:23,120 --> 00:35:27,110
So now let's look at what the
probability of type I error

613
00:35:27,110 --> 00:35:28,100
looks like.

614
00:35:28,100 --> 00:35:31,270
So I want the probability
of type 1 error,

615
00:35:31,270 --> 00:35:35,370
so that's the probably
when h0 is true.

616
00:35:35,370 --> 00:35:41,930
Well, so let me call psi j the
indicator that, say, beta j

617
00:35:41,930 --> 00:35:51,330
hat over sigma hat square
root gamma j exceeds

618
00:35:51,330 --> 00:35:54,636
q alpha over 2 of tn minus p.

619
00:35:54,636 --> 00:35:56,760
So we know that those are
the tests that I perform.

620
00:35:56,760 --> 00:35:59,160
Here, I just add
this extra index j

621
00:35:59,160 --> 00:36:02,400
to tell me that I'm actually
testing the j-th coefficient.

622
00:36:02,400 --> 00:36:06,490
So what I want is the
probability that under the null

623
00:36:06,490 --> 00:36:12,450
so that those are all
equal to 0 that beta j's--

624
00:36:12,450 --> 00:36:16,620
that I will reject to the
alternative for one of them.

625
00:36:16,620 --> 00:36:25,510
So that's psi 1 is
equal to 1 or psi 2

626
00:36:25,510 --> 00:36:29,120
is equal to 1, all
the way to psi--

627
00:36:29,120 --> 00:36:31,474
well, let's just say that
this is the entire thing,

628
00:36:31,474 --> 00:36:32,390
because it's annoying.

629
00:36:36,247 --> 00:36:37,830
I mean, you can check
the slide if you

630
00:36:37,830 --> 00:36:39,150
want to do it more generally.

631
00:36:39,150 --> 00:36:44,140
But psi p is equal to--

632
00:36:44,140 --> 00:36:48,850
or, or-- everybody agrees
that this is the probability

633
00:36:48,850 --> 00:36:51,940
of type I error?

634
00:36:51,940 --> 00:36:54,010
So either I reject
this one, or this one,

635
00:36:54,010 --> 00:36:55,757
or this one, or this
one, or this one.

636
00:36:55,757 --> 00:36:58,090
And that's exactly when I'm
going to reject at least one

637
00:36:58,090 --> 00:36:59,580
of them.

638
00:36:59,580 --> 00:37:08,550
So this is the probability
of type I error.

639
00:37:08,550 --> 00:37:12,380
And what I want is to keep
this guy less than alpha.

640
00:37:15,780 --> 00:37:17,730
But what I know is to
control the probability

641
00:37:17,730 --> 00:37:20,190
that this guy is less than
alpha, that this guy is

642
00:37:20,190 --> 00:37:22,820
less than alpha, that this
guy is less than alpha.

643
00:37:22,820 --> 00:37:26,260
In particular, if all
these guys are disjoint,

644
00:37:26,260 --> 00:37:29,530
then this could really be the
sum of all these probabilities.

645
00:37:29,530 --> 00:37:42,400
So in the worst case, if psi j
equals 1 intersected with psi k

646
00:37:42,400 --> 00:37:46,540
equals 1 is the empty
set, so that means

647
00:37:46,540 --> 00:37:47,960
those are called disjoint sets.

648
00:37:51,210 --> 00:37:53,970
You've seen this terminology
in probability, right?

649
00:37:53,970 --> 00:38:00,800
So if those sets are
disjoint, for all of them,

650
00:38:00,800 --> 00:38:04,176
for all j different from
k, then this probability--

651
00:38:07,370 --> 00:38:14,590
well, let me write it as star--

652
00:38:14,590 --> 00:38:20,990
then star is equal to, well,
the probability under h0

653
00:38:20,990 --> 00:38:30,320
that psi 1 is equal to 1
plus the probability under h0

654
00:38:30,320 --> 00:38:33,110
that psi p is equal to 1.

655
00:38:33,110 --> 00:38:37,120
Now, if I use this test
with this alpha here,

656
00:38:37,120 --> 00:38:40,600
then this probability
is equal to alpha.

657
00:38:40,600 --> 00:38:43,185
This probability is
also equal to alpha.

658
00:38:43,185 --> 00:38:45,810
So the probably of type I error
is actually not equal to alpha.

659
00:38:45,810 --> 00:38:47,404
It's equal to?

660
00:38:47,404 --> 00:38:48,270
AUDIENCE: p alpha.

661
00:38:48,270 --> 00:38:49,395
PHILIPPE RIGOLLET: p alpha.

662
00:38:52,770 --> 00:38:54,240
So what is the solution here?

663
00:38:54,240 --> 00:38:58,470
Well, it's to run those
guys not with alpha,

664
00:38:58,470 --> 00:38:59,802
but with alpha over p.

665
00:39:02,400 --> 00:39:06,870
And if they do this, then this
guy is equal to alpha over p,

666
00:39:06,870 --> 00:39:09,036
this guy is equal
to alpha over p.

667
00:39:09,036 --> 00:39:10,410
And so when I get
those things, I

668
00:39:10,410 --> 00:39:13,260
get p times alpha over
p, which is just alpha.

669
00:39:17,170 --> 00:39:20,410
So all I do is, rather than
running each of the tests

670
00:39:20,410 --> 00:39:23,862
with probability of error--

671
00:39:23,862 --> 00:39:28,751
so that's a test at
level alpha over p.

672
00:39:32,500 --> 00:39:33,800
That's actually very stringent.

673
00:39:33,800 --> 00:39:35,500
If you think about
it for 1 second,

674
00:39:35,500 --> 00:39:41,542
even if you have only 5
variables-- p equals 5--

675
00:39:41,542 --> 00:39:43,000
and you started
with the tests, you

676
00:39:43,000 --> 00:39:45,610
wanted to do your tests at 5%.

677
00:39:45,610 --> 00:39:50,720
It forces you to do the test at
1% for each of those variables.

678
00:39:50,720 --> 00:39:53,350
If you have 10
variables, I mean, that

679
00:39:53,350 --> 00:39:55,460
start to be very stringent.

680
00:39:55,460 --> 00:39:59,690
So it's going to be
harder and harder for you

681
00:39:59,690 --> 00:40:01,945
to conclude to the alternative.

682
00:40:01,945 --> 00:40:03,320
Now, one thing I
need to tell you

683
00:40:03,320 --> 00:40:05,240
is that here I said,
if they are disjoint,

684
00:40:05,240 --> 00:40:07,230
then those
probabilities are equal.

685
00:40:07,230 --> 00:40:12,610
But if they are not
disjoint, the union bound

686
00:40:12,610 --> 00:40:14,360
tells me that the
probability of the union

687
00:40:14,360 --> 00:40:16,650
is less than the sum
of the probabilities.

688
00:40:16,650 --> 00:40:20,090
And so now I'm not
exactly equal to alpha,

689
00:40:20,090 --> 00:40:23,220
but I'm bounded by alpha.

690
00:40:23,220 --> 00:40:26,170
And that's why
Bonferroni correction,

691
00:40:26,170 --> 00:40:28,110
people are not super
comfortable with,

692
00:40:28,110 --> 00:40:30,600
is because, in reality,
you never think

693
00:40:30,600 --> 00:40:32,610
that those tests are
going to be giving you

694
00:40:32,610 --> 00:40:34,890
completely disjoint things.

695
00:40:34,890 --> 00:40:36,480
I mean, why would it be?

696
00:40:36,480 --> 00:40:39,210
Why would it be that if
this guy is equal to 1,

697
00:40:39,210 --> 00:40:42,110
then all the other
ones are equal to 0?

698
00:40:42,110 --> 00:40:44,340
Why would it make any sense?

699
00:40:44,340 --> 00:40:45,860
So this is definitely
conservative,

700
00:40:45,860 --> 00:40:49,394
but the problem is that we don't
know how to do much better.

701
00:40:49,394 --> 00:40:51,060
I mean, we have a
formula that tells you

702
00:40:51,060 --> 00:40:54,120
the probability of the
union as some crazy sum that

703
00:40:54,120 --> 00:40:57,330
looks at all the intersection
and all these little things.

704
00:40:57,330 --> 00:41:01,340
I mean, it's the
generalization of p of a or b

705
00:41:01,340 --> 00:41:06,060
is equal to p of a plus
p of b minus probability

706
00:41:06,060 --> 00:41:08,997
of the intersection.

707
00:41:08,997 --> 00:41:10,830
But if you start doing
this for more than 2,

708
00:41:10,830 --> 00:41:12,060
it's super complicated.

709
00:41:12,060 --> 00:41:15,030
The number of terms
grows really fast.

710
00:41:15,030 --> 00:41:17,432
But most importantly,
even if you go here,

711
00:41:17,432 --> 00:41:19,140
you still need to
control the probability

712
00:41:19,140 --> 00:41:20,130
of the intersection.

713
00:41:20,130 --> 00:41:22,470
And those tests are not
necessarily independent.

714
00:41:22,470 --> 00:41:24,090
If they were independent,
then that would be easy.

715
00:41:24,090 --> 00:41:26,340
The probably of the intersection
would be the product

716
00:41:26,340 --> 00:41:27,840
of the probabilities.

717
00:41:27,840 --> 00:41:31,270
But those things are
super correlated,

718
00:41:31,270 --> 00:41:33,220
and so it doesn't really help.

719
00:41:33,220 --> 00:41:37,026
And so we'll see, when we talk
about high-dimensional stats

720
00:41:37,026 --> 00:41:38,650
towards the end, that
there's something

721
00:41:38,650 --> 00:41:41,600
called false discovery rate,
which is essentially saying,

722
00:41:41,600 --> 00:41:45,380
listen, if I want to
control this thing,

723
00:41:45,380 --> 00:41:47,260
if I really define my
probability of type I

724
00:41:47,260 --> 00:41:50,260
error as this, I want to
make sure that I never make

725
00:41:50,260 --> 00:41:52,300
this kind of error, I'm doomed.

726
00:41:52,300 --> 00:41:54,680
This is just not
going to happen.

727
00:41:54,680 --> 00:41:59,500
But I can revise what my
goals are in terms of errors

728
00:41:59,500 --> 00:42:02,570
that I make, and then I
will actually be able to do.

729
00:42:02,570 --> 00:42:05,680
And what people are looking
at is false discovery rate.

730
00:42:05,680 --> 00:42:07,750
And this is called
family-wise error rate, which

731
00:42:07,750 --> 00:42:10,280
is a stronger thing to control.

732
00:42:10,280 --> 00:42:14,590
So this trick that
consists in replacing

733
00:42:14,590 --> 00:42:16,704
alpha by alpha over
the number of times

734
00:42:16,704 --> 00:42:18,370
you're going to be
performing your test,

735
00:42:18,370 --> 00:42:21,700
or alpha over the number
of terms in your union,

736
00:42:21,700 --> 00:42:24,164
is actually called the
Bonferroni correction.

737
00:42:32,160 --> 00:42:35,450
And that's something you use
when you have what's called--

738
00:42:35,450 --> 00:42:41,010
another key word here
is multiple testing,

739
00:42:41,010 --> 00:42:43,830
when you're trying to do
multiple tests simultaneously.

740
00:42:47,470 --> 00:42:49,840
And if s is not of
p, well, you just

741
00:42:49,840 --> 00:42:52,760
divide by the number of tests
that you are actually making.

742
00:42:52,760 --> 00:42:56,130
So if s is of size k
for some k less than p,

743
00:42:56,130 --> 00:42:59,172
you just divide alpha by
k and not by p, of course.

744
00:42:59,172 --> 00:43:00,630
I mean, you can
always divide by p,

745
00:43:00,630 --> 00:43:03,170
but you're going to make your
life harder for no reason.

746
00:43:11,010 --> 00:43:13,260
Any question about
Bonferroni correction?

747
00:43:18,260 --> 00:43:26,100
So one thing that is
maybe not as obvious

748
00:43:26,100 --> 00:43:30,150
as the test beta j equal to 0
versus beta j not equal to 0--

749
00:43:30,150 --> 00:43:32,190
and in particular,
what it means is

750
00:43:32,190 --> 00:43:36,360
that it's not going to come
up as a software output

751
00:43:36,360 --> 00:43:39,480
without even you requesting
it because this is so standard

752
00:43:39,480 --> 00:43:40,897
that it's just coming out.

753
00:43:40,897 --> 00:43:42,480
But there's other
tests that you might

754
00:43:42,480 --> 00:43:45,060
think of that might be
more complicated and more

755
00:43:45,060 --> 00:43:47,590
tailored to your
particular problem.

756
00:43:47,590 --> 00:43:52,560
And those tests are of
the form g times beta

757
00:43:52,560 --> 00:43:56,260
is equal to some lambda.

758
00:43:56,260 --> 00:44:05,810
So let's see, the
test we've just done,

759
00:44:05,810 --> 00:44:14,910
beta j equals 0 versus
beta j not equal to 0,

760
00:44:14,910 --> 00:44:23,100
is actually equivalent to
ej transpose beta equals

761
00:44:23,100 --> 00:44:28,020
0 versus ej transpose
beta not equal to 0.

762
00:44:28,020 --> 00:44:31,260
That was our claim.

763
00:44:31,260 --> 00:44:32,870
But now I don't
have to stop here.

764
00:44:32,870 --> 00:44:34,970
I don't have to
multiply by a vector

765
00:44:34,970 --> 00:44:36,890
and test if it's equal to 0.

766
00:44:36,890 --> 00:44:46,790
I can actually replace this
by some general matrix g

767
00:44:46,790 --> 00:44:54,449
and replace this guy by
some general vector lambda.

768
00:44:54,449 --> 00:44:56,240
And I'm not telling
you what the dimensions

769
00:44:56,240 --> 00:44:57,406
are because they're general.

770
00:44:57,406 --> 00:44:58,830
I can take whatever I want.

771
00:44:58,830 --> 00:45:00,260
Take your favorite
matrix, as long

772
00:45:00,260 --> 00:45:05,690
as the right side of the
matrix can be multiplying beta,

773
00:45:05,690 --> 00:45:09,710
and lambda, take it as
the number of rows of g,

774
00:45:09,710 --> 00:45:11,820
and then you can do that.

775
00:45:11,820 --> 00:45:14,280
I can always
formulate this test.

776
00:45:14,280 --> 00:45:16,680
What will this test encompass?

777
00:45:16,680 --> 00:45:18,780
Well, those are
kind of weird tests.

778
00:45:18,780 --> 00:45:22,170
So you can think
of things like, I

779
00:45:22,170 --> 00:45:30,440
want to test if beta 2 plus beta
3 are equal to 0, for example.

780
00:45:30,440 --> 00:45:40,770
Maybe I want to test if beta 5
minus 2 beta 6 is equal to 23.

781
00:45:40,770 --> 00:45:42,270
Well, that's weird.

782
00:45:42,270 --> 00:45:44,730
But why would you want to
test if beta 2 plus beta 3

783
00:45:44,730 --> 00:45:46,814
is equal to 0?

784
00:45:46,814 --> 00:45:48,730
Maybe you don't want to
know if the-- you know

785
00:45:48,730 --> 00:45:50,720
that the effect of
some gene is not 0.

786
00:45:50,720 --> 00:45:54,210
Maybe you know that this
gene affects this trait,

787
00:45:54,210 --> 00:45:56,790
but you want to know if
the effect of this gene

788
00:45:56,790 --> 00:45:59,262
is canceled by the
effect of that gene.

789
00:45:59,262 --> 00:46:00,970
And this is the kind
of stuff that you're

790
00:46:00,970 --> 00:46:02,178
going to be testing for that.

791
00:46:04,470 --> 00:46:06,150
Now, this guy is
much more artificial,

792
00:46:06,150 --> 00:46:08,770
and I don't have a bedtime
story to tell you around this.

793
00:46:08,770 --> 00:46:13,340
So those things can happen and
can be much more complicated.

794
00:46:13,340 --> 00:46:15,180
Now, here, notice
that the matrix g

795
00:46:15,180 --> 00:46:18,270
has one row for both
of the examples.

796
00:46:18,270 --> 00:46:20,580
But if I want to test if
those two things happen

797
00:46:20,580 --> 00:46:25,380
at the same time, then I
actually can take a matrix.

798
00:46:25,380 --> 00:46:27,840
Another matrix
that can be useful

799
00:46:27,840 --> 00:46:34,620
is g equals the identity of
rp and lambda is equal to 0.

800
00:46:34,620 --> 00:46:39,530
What am I doing
here in this case?

801
00:46:39,530 --> 00:46:41,480
What is this test testing?

802
00:46:41,480 --> 00:46:42,280
Sorry, this test.

803
00:46:44,959 --> 00:46:45,458
Yeah?

804
00:46:45,458 --> 00:46:46,820
AUDIENCE: Whether
or not beta is 0.

805
00:46:46,820 --> 00:46:49,278
PHILIPPE RIGOLLET: Yeah, we're
testing if the entire vector

806
00:46:49,278 --> 00:46:54,120
beta is equal to 0, because g
times beta is equal to beta,

807
00:46:54,120 --> 00:46:56,100
and we're asking
whether it's equal to 0.

808
00:47:00,375 --> 00:47:04,590
So the thing is, when
you want to actually test

809
00:47:04,590 --> 00:47:07,140
if beta is equal to
0, you're actually

810
00:47:07,140 --> 00:47:09,510
testing if your entire
model, everything you're

811
00:47:09,510 --> 00:47:12,070
doing in life, is just junk.

812
00:47:12,070 --> 00:47:13,920
This is just telling
you, actually,

813
00:47:13,920 --> 00:47:17,090
forget about this y is
x beta plus epsilon.

814
00:47:17,090 --> 00:47:18,360
y is really just epsilon.

815
00:47:18,360 --> 00:47:19,200
There's nothing.

816
00:47:19,200 --> 00:47:21,810
There's just some big noise
with some big variants,

817
00:47:21,810 --> 00:47:23,950
and there's nothing else.

818
00:47:23,950 --> 00:47:26,860
So turns out that the
statistical software

819
00:47:26,860 --> 00:47:30,970
output that I wrote here spits
out an answer to this question.

820
00:47:30,970 --> 00:47:34,480
Just the last line,
usually, is doing this test.

821
00:47:34,480 --> 00:47:36,642
Does your model even make sense?

822
00:47:36,642 --> 00:47:39,100
And it's probably for people
to check whether they actually

823
00:47:39,100 --> 00:47:41,230
just mix their two data sets.

824
00:47:41,230 --> 00:47:43,450
Maybe they're actually
trying to predict--

825
00:47:43,450 --> 00:47:49,190
I don't know-- some credit
score from genomic data,

826
00:47:49,190 --> 00:47:51,040
and so just want to
make sure, maybe, that's

827
00:47:51,040 --> 00:47:53,050
not the right thing.

828
00:47:53,050 --> 00:47:56,500
So it turns out that the
machinery is exactly the same

829
00:47:56,500 --> 00:47:58,750
as the one we've just taken.

830
00:47:58,750 --> 00:48:00,380
So we actually start from here.

831
00:48:05,542 --> 00:48:06,500
So let me pull this up.

832
00:48:12,930 --> 00:48:15,000
So we start from here.

833
00:48:15,000 --> 00:48:18,470
Beta hat was equal to
beta plus this guy.

834
00:48:21,780 --> 00:48:23,640
And the first thing we
did was to say, well,

835
00:48:23,640 --> 00:48:27,180
beta j is equal to this thing
because, well, beta j was

836
00:48:27,180 --> 00:48:29,250
just ej times beta.

837
00:48:29,250 --> 00:48:32,616
So rather than taking ej
here, let me just take g.

838
00:48:42,280 --> 00:48:45,220
Now, we said that
for any vector--

839
00:48:45,220 --> 00:48:47,840
well, that was trivial.

840
00:48:47,840 --> 00:48:50,350
So the thing we need to
know is, what is this thing?

841
00:48:50,350 --> 00:48:55,110
Well, this thing here,
what is this guy?

842
00:48:55,110 --> 00:48:59,870
It's also normal
and the mean is 0.

843
00:48:59,870 --> 00:49:03,510
Again, that's just using
properties of Gaussian vectors.

844
00:49:03,510 --> 00:49:06,430
And what is the
covariance matrix?

845
00:49:06,430 --> 00:49:09,290
Let's call these guys sigma so
that you can make an answer,

846
00:49:09,290 --> 00:49:11,660
you can formulate an answer.

847
00:49:11,660 --> 00:49:14,230
So what is the
distribution of-- what

848
00:49:14,230 --> 00:49:18,354
is the covariance of g
times some Gaussian 0 sigma?

849
00:49:18,354 --> 00:49:20,290
AUDIENCE: g sigma g transpose.

850
00:49:20,290 --> 00:49:22,500
PHILIPPE RIGOLLET: g
sigma g transpose, right?

851
00:49:22,500 --> 00:49:33,895
So that's gx transpose
x inverse g transpose.

852
00:49:38,650 --> 00:49:41,780
Now, I'm not going to be
able to go much farther.

853
00:49:41,780 --> 00:49:44,900
I mean, I made this
very acute observation

854
00:49:44,900 --> 00:49:47,790
that ej transpose the matrix
times ej is the j-th angle

855
00:49:47,790 --> 00:49:48,290
element.

856
00:49:48,290 --> 00:49:50,450
Now, if I have a general matrix,
the price to pay is that I

857
00:49:50,450 --> 00:49:52,949
cannot just shrink this thing
any further because I'm trying

858
00:49:52,949 --> 00:49:54,640
to be abstract.

859
00:49:54,640 --> 00:49:56,487
And so I'm almost there.

860
00:49:56,487 --> 00:49:58,070
The only thing that
happened last time

861
00:49:58,070 --> 00:50:00,050
is that when this
was ej under h0, 0,

862
00:50:00,050 --> 00:50:03,380
we knew that this was
equal to 0 under the null.

863
00:50:03,380 --> 00:50:08,790
But under the null,
what is this equal to?

864
00:50:12,510 --> 00:50:13,440
AUDIENCE: Lambda.

865
00:50:13,440 --> 00:50:15,106
PHILIPPE RIGOLLET:
Lambda, which I know.

866
00:50:15,106 --> 00:50:16,880
I mean, I wrote my thing.

867
00:50:16,880 --> 00:50:19,730
And in the couple instances
I just showed you,

868
00:50:19,730 --> 00:50:22,700
including this one over there
on top, lambda was equal to 0.

869
00:50:22,700 --> 00:50:24,620
But in general, it
can be any lambda.

870
00:50:24,620 --> 00:50:27,890
But what's key about this lambda
is that I actually know it.

871
00:50:27,890 --> 00:50:31,940
That's the hypothesis
I'm formulating.

872
00:50:31,940 --> 00:50:34,340
So now I'm going to have to
be a little more careful when

873
00:50:34,340 --> 00:50:36,650
I want to build the
distribution of g beta hat.

874
00:50:36,650 --> 00:50:39,380
I need to actually
subtract this lambda.

875
00:50:39,380 --> 00:50:40,970
So now we go from
this, and we say,

876
00:50:40,970 --> 00:50:47,040
well, g beta hat
minus lambda follows

877
00:50:47,040 --> 00:50:57,730
some np0 sigma squared
g x transpose x

878
00:50:57,730 --> 00:51:00,660
inverse g transpose.

879
00:51:04,070 --> 00:51:06,469
So that's true.

880
00:51:06,469 --> 00:51:08,510
Let's assume-- let's go
straight to the case when

881
00:51:08,510 --> 00:51:10,410
we don't know what sigma is.

882
00:51:10,410 --> 00:51:11,970
So what I'm going
to be interested in

883
00:51:11,970 --> 00:51:26,360
is g beta hat minus lambda
divided by sigma hat.

884
00:51:26,360 --> 00:51:29,870
And that's going to follow some
Gaussian that has this thing,

885
00:51:29,870 --> 00:51:37,660
gx transpose x
inverse g transpose.

886
00:51:37,660 --> 00:51:40,780
So now, what did I do last time?

887
00:51:40,780 --> 00:51:45,010
So clearly, the quintiles
of this distribution

888
00:51:45,010 --> 00:51:48,000
is-- well, OK, what is the
size of this distribution?

889
00:51:48,000 --> 00:51:52,848
Well, I need to tell
you that g is an--

890
00:51:52,848 --> 00:51:54,724
what did I take here?

891
00:51:54,724 --> 00:51:57,180
AUDIENCE: 1 divided by
sigma, not sigma hat.

892
00:51:57,180 --> 00:51:58,930
PHILIPPE RIGOLLET: Oh,
yeah, you're right.

893
00:51:58,930 --> 00:52:00,440
So let me write it like this.

894
00:52:05,750 --> 00:52:15,800
Well, let me write
it like this--

895
00:52:15,800 --> 00:52:17,253
sigma squared over sigma.

896
00:52:21,659 --> 00:52:23,325
So let's forget about
the size of g now.

897
00:52:23,325 --> 00:52:25,120
Let's just think
of any general g.

898
00:52:27,730 --> 00:52:30,820
When g was a vector,
what was nice

899
00:52:30,820 --> 00:52:35,410
is that this guy was just the
scalar number, just one number.

900
00:52:35,410 --> 00:52:38,012
And so if I wanted to get rid
of this in the right-hand side,

901
00:52:38,012 --> 00:52:39,970
all I had to do was to
divide it by this thing.

902
00:52:39,970 --> 00:52:41,464
We called it gamma j.

903
00:52:41,464 --> 00:52:43,630
And we just had to divide
by square root of gamma j,

904
00:52:43,630 --> 00:52:45,820
and that would be gone.

905
00:52:45,820 --> 00:52:48,450
Now I have a matrix.

906
00:52:48,450 --> 00:52:50,100
So I need to get
rid of this matrix

907
00:52:50,100 --> 00:52:55,016
somehow because, clearly, the
quintiles of this distribution

908
00:52:55,016 --> 00:52:56,640
are not going to be
written in the back

909
00:52:56,640 --> 00:52:59,170
of a book for any value
of g and any value of x.

910
00:52:59,170 --> 00:53:01,660
So I need to standardize
before I can read anything out

911
00:53:01,660 --> 00:53:03,860
of a table.

912
00:53:03,860 --> 00:53:04,820
So how do we do it?

913
00:53:04,820 --> 00:53:14,880
Well, we just form
this guy here.

914
00:53:14,880 --> 00:53:18,770
So what we know is that if--

915
00:53:18,770 --> 00:53:21,120
so here's the claim,
again, another

916
00:53:21,120 --> 00:53:23,520
claim about Gaussian vector.

917
00:53:23,520 --> 00:53:43,220
If x follows some n0 sigma,
then x transpose sigma inverse x

918
00:53:43,220 --> 00:53:44,596
follows some chi squared.

919
00:53:48,330 --> 00:53:51,930
And here, it's going to depend
on what is the dimension here.

920
00:53:51,930 --> 00:53:56,160
So if I make this a k by k, a
k-dimensional Gaussian vector,

921
00:53:56,160 --> 00:53:57,497
this is x squared k.

922
00:54:02,467 --> 00:54:04,455
Where have we used that before?

923
00:54:08,928 --> 00:54:09,922
Yeah?

924
00:54:09,922 --> 00:54:10,850
AUDIENCE: Wald's test.

925
00:54:10,850 --> 00:54:13,350
PHILIPPE RIGOLLET: Wald's test,
that's exactly what we used.

926
00:54:13,350 --> 00:54:16,480
Wald's test had a chi
squared that was showing up.

927
00:54:16,480 --> 00:54:18,430
And the way we
made it show up was

928
00:54:18,430 --> 00:54:20,640
by taking the
asymptotic variance,

929
00:54:20,640 --> 00:54:24,852
taking its inverse, which, in
this framework, was called--

930
00:54:24,852 --> 00:54:25,710
AUDIENCE: Fisher.

931
00:54:25,710 --> 00:54:27,300
PHILIPPE RIGOLLET:
Fisher information.

932
00:54:27,300 --> 00:54:31,410
And then we pre- and
postmultiply by this thing.

933
00:54:31,410 --> 00:54:33,150
So this is the key.

934
00:54:33,150 --> 00:54:35,400
And so now, it tells
me exactly, when

935
00:54:35,400 --> 00:54:38,190
I start from this guy that has
this multivariate Gaussian,

936
00:54:38,190 --> 00:54:40,050
it tells me how to
turn it into something

937
00:54:40,050 --> 00:54:42,720
that has a distribution
which is pivotal.

938
00:54:42,720 --> 00:54:45,849
Chi squared k is completely
pivotal, does not depend

939
00:54:45,849 --> 00:54:46,890
on anything I don't know.

940
00:55:03,810 --> 00:55:06,400
The way I go from here
is by saying, well, now,

941
00:55:06,400 --> 00:55:13,380
I look at g beta hat
minus lambda transpose,

942
00:55:13,380 --> 00:55:15,390
and now I need to
look at the inverse

943
00:55:15,390 --> 00:55:16,600
of the matrix over there.

944
00:55:16,600 --> 00:55:29,950
So it's gx transpose x
inverse g inverse g beta

945
00:55:29,950 --> 00:55:32,510
hat minus lambda.

946
00:55:35,647 --> 00:55:36,855
This guy is going to follow--

947
00:55:39,700 --> 00:55:42,891
well, here, I need to actually
divide by sigma in this case--

948
00:55:56,540 --> 00:56:00,560
if g is k times p.

949
00:56:00,560 --> 00:56:04,370
So what I mean here is
just that's the same k.

950
00:56:04,370 --> 00:56:07,250
The k that shows up is
the number of constraints

951
00:56:07,250 --> 00:56:08,840
that I have in my tests.

952
00:56:13,340 --> 00:56:20,690
So now, if I go from
here to using sigma hat,

953
00:56:20,690 --> 00:56:23,180
the key thing to observe is
that this guy is actually

954
00:56:23,180 --> 00:56:25,100
not a Gaussian.

955
00:56:25,100 --> 00:56:28,410
I'm not going to have a student
t-distribution that shows up.

956
00:56:36,290 --> 00:57:03,850
So that implies that if
I take the same thing,

957
00:57:03,850 --> 00:57:06,450
so now I just go from
sigma to sigma hat,

958
00:57:06,450 --> 00:57:08,140
then this thing is of the form--

959
00:57:12,620 --> 00:57:17,280
well, this chi squared k divided
by the chi squared that shows

960
00:57:17,280 --> 00:57:20,590
up in the denominator
of the t-distribution,

961
00:57:20,590 --> 00:57:28,270
which is square root of--

962
00:57:28,270 --> 00:57:30,060
oh, I should not
divide by sigma--

963
00:57:30,060 --> 00:57:31,510
so this is sigma squared, right?

964
00:57:31,510 --> 00:57:32,567
AUDIENCE: Yeah.

965
00:57:32,567 --> 00:57:34,400
PHILIPPE RIGOLLET: So
this is sigma squared.

966
00:57:34,400 --> 00:57:40,550
So this is of the form
divided by chi squared n

967
00:57:40,550 --> 00:57:44,180
minus p divided by n minus p.

968
00:57:44,180 --> 00:57:48,370
So that's the same denominator
that I saw in my t-test.

969
00:57:48,370 --> 00:57:49,955
The numerator has
changed, though.

970
00:57:49,955 --> 00:57:52,080
The numerator is now this
chi squared and no longer

971
00:57:52,080 --> 00:57:52,580
a Gaussian.

972
00:57:55,430 --> 00:58:00,350
But this distribution is
actually pivotal, as long

973
00:58:00,350 --> 00:58:02,210
as we can guarantee
that there's no hidden

974
00:58:02,210 --> 00:58:08,550
parameter in the correlation
between the two chi squares.

975
00:58:08,550 --> 00:58:13,470
So again, as all statements
of independence in this class,

976
00:58:13,470 --> 00:58:15,930
I will just give
it to you for free.

977
00:58:15,930 --> 00:58:20,660
Those two things, I claim--

978
00:58:20,660 --> 00:58:29,635
so OK, let's say admit
these are independent.

979
00:58:37,370 --> 00:58:38,730
We're almost there.

980
00:58:38,730 --> 00:58:41,627
This could be a
distribution that's pivotal.

981
00:58:41,627 --> 00:58:43,960
But there's something that's
a little unbalanced with it

982
00:58:43,960 --> 00:58:46,160
is that this guy is divided
by its number of degrees

983
00:58:46,160 --> 00:58:48,980
of freedom, but this guy is
not divided by its number

984
00:58:48,980 --> 00:58:50,670
of degrees of freedom.

985
00:58:50,670 --> 00:58:53,350
And so we just have
to make the extra step

986
00:58:53,350 --> 00:58:57,280
that if I divide this guy by k,
and this guy is a chi squared

987
00:58:57,280 --> 00:59:00,080
divided by k, if I
divide this guy by k,

988
00:59:00,080 --> 00:59:03,900
then I get this
guy divided by k.

989
00:59:03,900 --> 00:59:05,442
And now it looks--

990
00:59:05,442 --> 00:59:06,900
I mean, it doesn't
change anything.

991
00:59:06,900 --> 00:59:09,020
I've just divided
by a fixed number.

992
00:59:09,020 --> 00:59:11,200
But it just looks more elegant--

993
00:59:11,200 --> 00:59:13,650
is the ratio of
two independent chi

994
00:59:13,650 --> 00:59:15,420
squared that are
individually divided

995
00:59:15,420 --> 00:59:16,920
by the number of
degrees of freedom.

996
00:59:20,840 --> 00:59:31,100
And this has a name,
and it's called a Fisher

997
00:59:31,100 --> 00:59:34,190
or F-distribution.

998
00:59:34,190 --> 00:59:40,740
So unlike William
Gosset, who was not

999
00:59:40,740 --> 00:59:43,200
allowed to use his own name
and used the name student,

1000
00:59:43,200 --> 00:59:45,000
Fisher was allowed
to use his own name,

1001
00:59:45,000 --> 00:59:47,220
and that's called the
Fisher distribution.

1002
00:59:47,220 --> 00:59:52,470
And the Fisher distribution
has now 2 parameters,

1003
00:59:52,470 --> 00:59:53,910
a set of 2 degrees of freedom--

1004
00:59:53,910 --> 00:59:57,180
1 for the numerator and
1 for the denominator.

1005
00:59:57,180 --> 01:00:01,217
So F- of Fisher distribution--

1006
01:00:07,430 --> 01:00:13,450
so F is equal to the
ratio of a chi squared p/p

1007
01:00:13,450 --> 01:00:16,960
and a chi squared q/q.

1008
01:00:16,960 --> 01:00:27,320
So that's Fpq P-q where the 2
chi squareds are independent.

1009
01:00:32,970 --> 01:00:35,160
Is that clear what
I'm defining here?

1010
01:00:35,160 --> 01:00:41,460
So this is basically what plays
the role of t-distributions

1011
01:00:41,460 --> 01:00:43,870
when you're testing more
than 1 parameter at a time.

1012
01:00:43,870 --> 01:00:45,630
So you basically replace--

1013
01:00:45,630 --> 01:00:47,190
the normal that was
in the numerator,

1014
01:00:47,190 --> 01:00:49,023
you replace it by chi
squared because you're

1015
01:00:49,023 --> 01:00:51,780
testing if 2 vectors are
simultaneously close.

1016
01:00:51,780 --> 01:00:55,340
And the way you do it is by
looking at their squared norm.

1017
01:00:55,340 --> 01:00:57,800
And that's how the
chi squared shows up.

1018
01:01:00,632 --> 01:01:08,240
Quick remark-- are those
things really very different?

1019
01:01:08,240 --> 01:01:12,090
How can I relate a chi
squared with a t-distribution?

1020
01:01:12,090 --> 01:01:19,151
Well, if t follows, say, a t--

1021
01:01:19,151 --> 01:01:20,400
I don't know, let's call it q.

1022
01:01:24,080 --> 01:01:28,330
So that means that
t, let me look at--

1023
01:01:28,330 --> 01:01:38,200
t is some n01 divided by
the square root of a chi

1024
01:01:38,200 --> 01:01:40,650
squared q/q.

1025
01:01:44,820 --> 01:01:48,926
That's the distribution of t.

1026
01:01:48,926 --> 01:01:51,300
So if I look at the square of
the-- the distribution of t

1027
01:01:51,300 --> 01:01:53,600
squared--

1028
01:01:53,600 --> 01:01:55,010
let me put it here--

1029
01:01:58,300 --> 01:02:06,280
well, that's the square of some
n01 divided by chi squared q/q.

1030
01:02:09,690 --> 01:02:11,900
Agreed?

1031
01:02:11,900 --> 01:02:13,470
I just removed the
square root here,

1032
01:02:13,470 --> 01:02:15,810
and I took the square
of the Gaussian.

1033
01:02:15,810 --> 01:02:20,030
But what is the distribution
of a square of a Gaussian?

1034
01:02:20,030 --> 01:02:21,530
AUDIENCE: Chi squared
with 1 degree.

1035
01:02:21,530 --> 01:02:25,140
PHILIPPE RIGOLLET: Chi squared
with 1 degree of freedom.

1036
01:02:25,140 --> 01:02:27,284
So this is a chi squared
with 1 degree of freedom.

1037
01:02:27,284 --> 01:02:28,700
And in particular,
it's also a chi

1038
01:02:28,700 --> 01:02:31,836
squared with 1 degree
of freedom divided by 1.

1039
01:02:31,836 --> 01:02:38,860
So t-squared, in the end,
has an F-distribution with 1

1040
01:02:38,860 --> 01:02:41,300
and q degrees of freedom.

1041
01:02:41,300 --> 01:02:43,589
So those two things are
actually very similar.

1042
01:02:43,589 --> 01:02:45,130
The only thing that's
going to change

1043
01:02:45,130 --> 01:02:48,280
is that, since we're actually
looking at, typically,

1044
01:02:48,280 --> 01:02:51,164
absolute values of t
when we do our tests,

1045
01:02:51,164 --> 01:02:52,830
it's going to be
exactly the same thing.

1046
01:02:52,830 --> 01:02:54,330
These quintiles of
one guy are going

1047
01:02:54,330 --> 01:02:56,496
to be, essentially, the
square root of the quintiles

1048
01:02:56,496 --> 01:02:57,310
of the other guy.

1049
01:02:57,310 --> 01:03:00,390
That's all it's going to be.

1050
01:03:00,390 --> 01:03:07,360
So if my test is psi is
equal to the indicator

1051
01:03:07,360 --> 01:03:16,010
that t exceeds q alpha
over 2 of tq, for example,

1052
01:03:16,010 --> 01:03:19,990
then it's equal to the
indicator that t-squared

1053
01:03:19,990 --> 01:03:26,030
exceeds q squared
alpha over 2 tq,

1054
01:03:26,030 --> 01:03:28,770
because I had the
absolute value here,

1055
01:03:28,770 --> 01:03:33,110
which is equal to the
indicator that t squared is

1056
01:03:33,110 --> 01:03:35,580
greater than q alpha over 2.

1057
01:03:35,580 --> 01:03:37,000
And now this time, it's an F1q.

1058
01:03:39,880 --> 01:03:42,340
So in a way, those two things
belong to the same family.

1059
01:03:42,340 --> 01:03:44,680
They really are a natural
generalization of each other.

1060
01:03:44,680 --> 01:03:47,310
I mean, at least the F-test is
a generalization of the t-test.

1061
01:03:51,230 --> 01:03:54,480
And so now I can perform my test
just like it's written here.

1062
01:03:54,480 --> 01:03:56,250
I just formed this
guy, and then I

1063
01:03:56,250 --> 01:03:58,860
perform against the
quintile of an F-test.

1064
01:03:58,860 --> 01:04:01,440
Notice, there's no
absolute value--

1065
01:04:01,440 --> 01:04:04,740
oh, yeah, I forgot,
this is actually

1066
01:04:04,740 --> 01:04:09,761
q alpha because the F-statistic
is already positive.

1067
01:04:09,761 --> 01:04:11,760
So I'm not going to look
between left and right,

1068
01:04:11,760 --> 01:04:15,240
I'm just going to look
whether it's too large or not.

1069
01:04:15,240 --> 01:04:18,030
So that's by definition.

1070
01:04:18,030 --> 01:04:19,380
So you can check--

1071
01:04:19,380 --> 01:04:21,120
if you look at a
table for student

1072
01:04:21,120 --> 01:04:23,025
and you look at a
table for F1q, one

1073
01:04:23,025 --> 01:04:25,650
it just going to-- you're going
to have to move from one column

1074
01:04:25,650 --> 01:04:26,610
to the other
because you're going

1075
01:04:26,610 --> 01:04:28,475
to have to move from
alpha over 2 to alpha,

1076
01:04:28,475 --> 01:04:31,670
but one is going to be
squared root of the other one,

1077
01:04:31,670 --> 01:04:34,370
just like the chi squared is
the square of the Gaussian.

1078
01:04:34,370 --> 01:04:36,828
I mean, if you look at the chi
squared 1 degree of freedom,

1079
01:04:36,828 --> 01:04:40,441
you will see the same
thing as the Gaussians.

1080
01:04:47,035 --> 01:04:53,594
So I'm actually going to
start with the last one

1081
01:04:53,594 --> 01:04:55,760
because you've been asking
a few questions about why

1082
01:04:55,760 --> 01:04:58,450
is my design deterministic.

1083
01:04:58,450 --> 01:04:59,660
So there's many answers.

1084
01:04:59,660 --> 01:05:01,955
Some are philosophical.

1085
01:05:01,955 --> 01:05:04,330
But one that's actually--
well, there's the one that says

1086
01:05:04,330 --> 01:05:07,106
everything you cannot do if
you don't have a condition--

1087
01:05:07,106 --> 01:05:09,730
if you don't have x, because all
of the statements that we made

1088
01:05:09,730 --> 01:05:12,850
here, for example, just the
fact that this is chi squared,

1089
01:05:12,850 --> 01:05:15,010
if those guys start to
be random variables,

1090
01:05:15,010 --> 01:05:17,010
then it's clearly not
going to be a chi squared.

1091
01:05:17,010 --> 01:05:19,000
I mean, it cannot be chi
squared when those guys are

1092
01:05:19,000 --> 01:05:20,624
deterministic and
when they are random.

1093
01:05:20,624 --> 01:05:22,100
I mean, things change.

1094
01:05:22,100 --> 01:05:25,060
So that's just maybe
[INAUDIBLE] check statement.

1095
01:05:25,060 --> 01:05:27,580
But I think the one that
really matters is that--

1096
01:05:27,580 --> 01:05:30,450
remember when we
did the t-test, we

1097
01:05:30,450 --> 01:05:32,195
had this gamma j that showed up.

1098
01:05:32,195 --> 01:05:34,910
Gamma j was playing the
role of the variance.

1099
01:05:34,910 --> 01:05:36,904
So here, the variance,
you never think of--

1100
01:05:36,904 --> 01:05:39,070
I mean, we'll talk about
this in the Bayesian setup,

1101
01:05:39,070 --> 01:05:41,320
but so far, we haven't
thought of the variance

1102
01:05:41,320 --> 01:05:42,390
as a random variable.

1103
01:05:42,390 --> 01:05:45,580
And so here, your x's really
are the parameters of your data.

1104
01:05:45,580 --> 01:05:48,110
And the diagonal elements
of x transpose x inverse

1105
01:05:48,110 --> 01:05:49,787
actually tell you
what the variance is.

1106
01:05:49,787 --> 01:05:52,120
So that's also one reason why
you should think of your x

1107
01:05:52,120 --> 01:05:53,530
as being a deterministic number.

1108
01:05:53,530 --> 01:05:55,450
They are, in a way,
things that change

1109
01:05:55,450 --> 01:05:56,740
the geometry of your problem.

1110
01:05:56,740 --> 01:05:58,450
They just say, oh,
let me look at it

1111
01:05:58,450 --> 01:06:01,180
from the perspective of x.

1112
01:06:01,180 --> 01:06:03,070
Actually, for that
matter, we didn't really

1113
01:06:03,070 --> 01:06:06,000
spend much time
commenting on what

1114
01:06:06,000 --> 01:06:09,730
is the effect of x onto gamma.

1115
01:06:09,730 --> 01:06:19,910
So remember, gamma j, so that
was the variance parameter.

1116
01:06:19,910 --> 01:06:23,780
So we should try to understand
what x's lead to big variance

1117
01:06:23,780 --> 01:06:26,030
and what x's lead
to small variance.

1118
01:06:26,030 --> 01:06:28,610
That would be nice.

1119
01:06:28,610 --> 01:06:31,550
Well, if this is the
identity matrix--

1120
01:06:31,550 --> 01:06:35,140
let's say identity over n,
which is the natural thing

1121
01:06:35,140 --> 01:06:38,620
to look at, because we want
this thing to scale like 1/n--

1122
01:06:38,620 --> 01:06:39,820
then this is just 1/n.

1123
01:06:39,820 --> 01:06:41,200
We're back to the original case.

1124
01:06:41,200 --> 01:06:41,700
Yes?

1125
01:06:41,700 --> 01:06:43,200
AUDIENCE: Shouldn't
that be inverse?

1126
01:06:43,200 --> 01:06:45,500
PHILIPPE RIGOLLET: Yeah,
thank you. x inverse, yes.

1127
01:06:45,500 --> 01:06:48,590
So if this is the identity,
then, well, the inverse

1128
01:06:48,590 --> 01:06:53,180
is-- let's say just this guy
here is n times this guy.

1129
01:06:53,180 --> 01:06:56,210
So then the inverse is 1/n.

1130
01:06:56,210 --> 01:06:59,270
So in this case, that means
that gamma j is equal to 1/n

1131
01:06:59,270 --> 01:07:02,240
and we're back to
the theta hat theta

1132
01:07:02,240 --> 01:07:06,450
case, the basic
one-dimensional thing.

1133
01:07:06,450 --> 01:07:11,390
What does it mean for a
matrix for when I take its--

1134
01:07:11,390 --> 01:07:13,230
yeah, so that's of dimension p.

1135
01:07:13,230 --> 01:07:15,420
But when I take its transpose--

1136
01:07:15,420 --> 01:07:17,394
so forget about the
scaling by n right now.

1137
01:07:17,394 --> 01:07:19,060
This is just a matter
of scaling things.

1138
01:07:19,060 --> 01:07:20,840
I can always multiply
my x's so that I

1139
01:07:20,840 --> 01:07:22,584
have this thing that shows up.

1140
01:07:22,584 --> 01:07:24,750
But when I have a matrix,
if I look at x transpose x

1141
01:07:24,750 --> 01:07:26,550
and I get something which
is the identity, how

1142
01:07:26,550 --> 01:07:27,570
do I call this matrix?

1143
01:07:31,980 --> 01:07:32,970
AUDIENCE: Orthonormal?

1144
01:07:32,970 --> 01:07:34,470
PHILIPPE RIGOLLET:
Orthogonal, yeah.

1145
01:07:34,470 --> 01:07:35,790
Orthonormal or orthogonal.

1146
01:07:35,790 --> 01:07:37,919
So you call this thing
an orthogonal matrix.

1147
01:07:37,919 --> 01:07:39,960
And when it's an orthogonal
matrix, what it means

1148
01:07:39,960 --> 01:07:42,540
is that the--

1149
01:07:42,540 --> 01:07:46,230
so this matrix here, if you
look at the matrix xx transpose,

1150
01:07:46,230 --> 01:07:48,390
the entries of this matrix
are the inner products

1151
01:07:48,390 --> 01:07:49,890
between the columns of x.

1152
01:07:49,890 --> 01:07:51,240
That's what's happening.

1153
01:07:51,240 --> 01:07:52,800
You can write it,
and you will see

1154
01:07:52,800 --> 01:07:55,890
that the entries of this
matrix are linear products.

1155
01:07:55,890 --> 01:07:59,860
If it's the identity, that
means that you get some 1's

1156
01:07:59,860 --> 01:08:03,170
and a bunch of 0's, it means
that all the inner products

1157
01:08:03,170 --> 01:08:05,910
between 2 different
columns is actually 0.

1158
01:08:05,910 --> 01:08:07,980
What it means is
that this matrix x

1159
01:08:07,980 --> 01:08:09,990
is an orthonormal
basis for your space.

1160
01:08:09,990 --> 01:08:12,100
The columns form an
orthonormal basis.

1161
01:08:12,100 --> 01:08:15,680
So they're basically as far
from each other as they can.

1162
01:08:15,680 --> 01:08:20,260
Now, if I start making those
guys closer and closer,

1163
01:08:20,260 --> 01:08:21,939
then I'm starting
to have some issues.

1164
01:08:21,939 --> 01:08:24,490
x transpose x is not
going to be the identity.

1165
01:08:24,490 --> 01:08:27,330
I'm going to start to
have some non-0 entries.

1166
01:08:27,330 --> 01:08:32,551
But if they all remain
of norm 1, then--

1167
01:08:32,551 --> 01:08:34,880
oh, sorry, so that's
for the inverse.

1168
01:08:34,880 --> 01:08:37,899
So I first start putting some
stuff here, which is non-0,

1169
01:08:37,899 --> 01:08:39,550
by taking my x's.

1170
01:08:39,550 --> 01:08:44,269
Rather than having
this, I move to this.

1171
01:08:44,269 --> 01:08:46,310
Now I'm going to start
seeing some non-0 entries.

1172
01:08:46,310 --> 01:08:49,410
And when I'm going to take
the inverse of this matrix,

1173
01:08:49,410 --> 01:08:52,781
the diagonal elements are
going to start to blow up.

1174
01:08:52,781 --> 01:08:56,010
Oh, sorry, the diagonals start
to become smaller and smaller.

1175
01:08:56,010 --> 01:08:57,399
So when I take the inverse--

1176
01:08:57,399 --> 01:09:01,399
no, sorry, the diagonal
limits are going to blow up.

1177
01:09:01,399 --> 01:09:05,430
And so what it means is that the
variance is going to blow up.

1178
01:09:05,430 --> 01:09:06,899
And that's essentially
telling you

1179
01:09:06,899 --> 01:09:09,090
that if you get to
choose your x's, you

1180
01:09:09,090 --> 01:09:12,582
want to take them as
orthogonal as you can.

1181
01:09:12,582 --> 01:09:14,790
But if you don't, then you
just have to deal with it,

1182
01:09:14,790 --> 01:09:18,950
and it will have a significant
impact on your estimation

1183
01:09:18,950 --> 01:09:19,620
performance.

1184
01:09:19,620 --> 01:09:25,010
And that's what, also,
routinely, statistical software

1185
01:09:25,010 --> 01:09:26,885
is going to spit out
this value here for you.

1186
01:09:26,885 --> 01:09:28,884
And you're going to have--
well, actually square

1187
01:09:28,884 --> 01:09:30,410
root of this value.

1188
01:09:30,410 --> 01:09:32,440
And it's going to tell
you, essentially--

1189
01:09:32,440 --> 01:09:34,939
you're going to know how much
randomness, how much variation

1190
01:09:34,939 --> 01:09:36,952
you have in this
particular parameter

1191
01:09:36,952 --> 01:09:37,910
that you're estimating.

1192
01:09:37,910 --> 01:09:41,564
So if gamma j is
large, then you're

1193
01:09:41,564 --> 01:09:43,189
going to have wide
confidence intervals

1194
01:09:43,189 --> 01:09:45,740
and your tests are not
going to reject very much.

1195
01:09:45,740 --> 01:09:47,110
And that's all captured by x.

1196
01:09:47,110 --> 01:09:48,109
That's what's important.

1197
01:09:48,109 --> 01:09:50,927
Everything, all of this, is
completely captured by x.

1198
01:09:50,927 --> 01:09:52,760
Then, of course, there
was the sigma squared

1199
01:09:52,760 --> 01:09:54,570
that showed up here.

1200
01:09:54,570 --> 01:09:57,155
Actually, it was here, even
in the definition of gamma j.

1201
01:09:57,155 --> 01:09:58,430
I forgot it.

1202
01:09:58,430 --> 01:10:00,440
What is the sigma
squared police doing?

1203
01:10:00,440 --> 01:10:02,950
And so this thing
was here as well,

1204
01:10:02,950 --> 01:10:04,850
and that's just exogenous.

1205
01:10:04,850 --> 01:10:06,269
It comes from the noise itself.

1206
01:10:06,269 --> 01:10:08,810
But there was this huge factor
that came from the x's itself.

1207
01:10:11,680 --> 01:10:13,960
So let's go back,
now, to reading

1208
01:10:13,960 --> 01:10:15,320
this list in a linear fashion.

1209
01:10:15,320 --> 01:10:20,680
So I mean, you're MIT
students, you've probably

1210
01:10:20,680 --> 01:10:25,480
heard that correlation does
not imply causation many times.

1211
01:10:25,480 --> 01:10:27,145
Maybe you don't
know what it means.

1212
01:10:27,145 --> 01:10:30,900
If you don't, that's OK, you
just have to know the sentence.

1213
01:10:30,900 --> 01:10:32,420
No, what it means
is that it's done

1214
01:10:32,420 --> 01:10:35,255
because I decided that
something was going to be the x

1215
01:10:35,255 --> 01:10:36,630
and that something
else was going

1216
01:10:36,630 --> 01:10:39,640
to be the y, that whatever
thing I'm getting,

1217
01:10:39,640 --> 01:10:42,010
it means that x implies y.

1218
01:10:42,010 --> 01:10:44,530
For example, even if I
do genetics, genomics,

1219
01:10:44,530 --> 01:10:47,230
or whatever, I
mean, I implicitly

1220
01:10:47,230 --> 01:10:49,630
assume that my genes
are going to have

1221
01:10:49,630 --> 01:10:52,780
an effect on my outside look.

1222
01:10:52,780 --> 01:10:54,310
I could be the opposite.

1223
01:10:54,310 --> 01:10:55,720
I mean, who am I to say?

1224
01:10:55,720 --> 01:10:56,570
I'm not a biologist.

1225
01:10:56,570 --> 01:10:57,111
I don't know.

1226
01:10:57,111 --> 01:10:59,590
I didn't open a biology
book in 20 years.

1227
01:10:59,590 --> 01:11:02,140
So maybe, if I start hitting
my head with a hammer,

1228
01:11:02,140 --> 01:11:04,720
I'm going to have changing
my genetic material.

1229
01:11:04,720 --> 01:11:07,140
Probably not, but that's why--

1230
01:11:07,140 --> 01:11:09,450
but causation definitely does
not come from statistics.

1231
01:11:09,450 --> 01:11:11,690
So if you know that that's
the different thing,

1232
01:11:11,690 --> 01:11:13,180
it's actually going to--

1233
01:11:13,180 --> 01:11:14,690
it's not coming from there.

1234
01:11:14,690 --> 01:11:18,410
So actually, I remember, once,
I put an exam to students,

1235
01:11:18,410 --> 01:11:21,685
and there was an old data
set from police expenditures,

1236
01:11:21,685 --> 01:11:23,920
I think, in Chicago in the '60s.

1237
01:11:23,920 --> 01:11:27,437
And they were trying
to understand--

1238
01:11:27,437 --> 01:11:28,270
no, it was on crime.

1239
01:11:28,270 --> 01:11:29,650
It was the crime data set.

1240
01:11:29,650 --> 01:11:31,700
And they were trying-- so
the y variable was just

1241
01:11:31,700 --> 01:11:34,530
the rate of crime, and the
x's were a bunch of things,

1242
01:11:34,530 --> 01:11:36,670
and one of them was
police expenditures.

1243
01:11:36,670 --> 01:11:38,200
And if you rend
the regression, you

1244
01:11:38,200 --> 01:11:41,050
would find that the coefficient
in front of police expenditure

1245
01:11:41,050 --> 01:11:42,700
was a positive
number, which means

1246
01:11:42,700 --> 01:11:45,690
that if you increase
police expenditures,

1247
01:11:45,690 --> 01:11:48,400
that increases the crime.

1248
01:11:48,400 --> 01:11:52,800
I mean, that's what it means
to have a positive coefficient.

1249
01:11:52,800 --> 01:11:55,410
Everybody agrees with this fact?

1250
01:11:55,410 --> 01:11:57,830
If beta j is 10, then it
means that if I increase by $1

1251
01:11:57,830 --> 01:12:01,860
my police expenditure, I
[INAUDIBLE] by 10 my crime,

1252
01:12:01,860 --> 01:12:04,160
everything else
being kept equal.

1253
01:12:04,160 --> 01:12:06,140
Well, there were,
I think, about 80%

1254
01:12:06,140 --> 01:12:09,230
of the students that were able
to explain to me that if you

1255
01:12:09,230 --> 01:12:11,844
give more money to
the police, then

1256
01:12:11,844 --> 01:12:13,010
the crime is going to raise.

1257
01:12:13,010 --> 01:12:14,780
Some people were
like, well, police

1258
01:12:14,780 --> 01:12:16,730
is making too much
money, and they

1259
01:12:16,730 --> 01:12:19,264
don't think about their
work, and they become lazy.

1260
01:12:19,264 --> 01:12:20,930
And I mean, people
were really coming up

1261
01:12:20,930 --> 01:12:22,340
with some crazy things.

1262
01:12:22,340 --> 01:12:26,090
And what it just meant is
that, no, it's not causation.

1263
01:12:26,090 --> 01:12:28,030
It's just, if you
have more crime,

1264
01:12:28,030 --> 01:12:29,810
you give more money
to your police.

1265
01:12:29,810 --> 01:12:31,370
That's what's happening.

1266
01:12:31,370 --> 01:12:33,800
And that's all there is.

1267
01:12:33,800 --> 01:12:35,750
So just be careful
when you actually

1268
01:12:35,750 --> 01:12:38,360
draw some conclusions that
causation is a very important

1269
01:12:38,360 --> 01:12:39,410
thing to keep in mind.

1270
01:12:39,410 --> 01:12:43,280
And in practice, unless you
have external sources of reason

1271
01:12:43,280 --> 01:12:45,680
for causality-- for
example, genetic material

1272
01:12:45,680 --> 01:12:52,040
and physical traits,
we agree upon what

1273
01:12:52,040 --> 01:12:54,690
the direction of the arrow
of causality is here.

1274
01:12:54,690 --> 01:12:57,845
There's places
where you might not.

1275
01:12:57,845 --> 01:12:59,930
Now, finally, the
normality on the noise--

1276
01:12:59,930 --> 01:13:04,340
everything we did today required
normal Gaussian distribution

1277
01:13:04,340 --> 01:13:05,750
on the noise.

1278
01:13:05,750 --> 01:13:07,541
I mean, it's everywhere.

1279
01:13:07,541 --> 01:13:09,540
There's some Gaussian,
there's some chi squared.

1280
01:13:09,540 --> 01:13:11,330
Everything came out of Gaussian.

1281
01:13:11,330 --> 01:13:13,836
And for that, we needed
this basic formula

1282
01:13:13,836 --> 01:13:15,710
for inference, which we
derived from the fact

1283
01:13:15,710 --> 01:13:18,610
that the noise was
Gaussian itself.

1284
01:13:18,610 --> 01:13:20,860
If we did not have that, the
only thing we could write

1285
01:13:20,860 --> 01:13:24,370
is, beta hat is this
number, or this vector.

1286
01:13:24,370 --> 01:13:27,980
We would not be able to say,
the fluctuations of beta hat

1287
01:13:27,980 --> 01:13:28,615
are this guy.

1288
01:13:28,615 --> 01:13:30,472
We would not be
able to do tests.

1289
01:13:30,472 --> 01:13:31,930
We would not be
able to build, say,

1290
01:13:31,930 --> 01:13:34,160
confidence regions or anything.

1291
01:13:34,160 --> 01:13:38,150
And so this is an important
condition that we need,

1292
01:13:38,150 --> 01:13:40,670
and that's what statistical
software assumes by default.

1293
01:13:40,670 --> 01:13:44,870
But we now have a recipe
on how to do those tests.

1294
01:13:44,870 --> 01:13:47,060
We can do it either
visually, if we really

1295
01:13:47,060 --> 01:13:49,430
want to conclude that,
yes, this is Gaussian,

1296
01:13:49,430 --> 01:13:51,350
using our normal Q-Q plots.

1297
01:13:51,350 --> 01:13:54,860
And we can also do it
using our favorite tests.

1298
01:13:54,860 --> 01:13:56,750
What test should I be
using to test that?

1299
01:14:01,540 --> 01:14:03,771
With two names?

1300
01:14:03,771 --> 01:14:04,270
Yeah?

1301
01:14:04,270 --> 01:14:06,957
AUDIENCE: Normal [INAUDIBLE].

1302
01:14:06,957 --> 01:14:08,540
PHILIPPE RIGOLLET:
Not the 2 Russians.

1303
01:14:08,540 --> 01:14:10,820
So I want a Russian and
a Scandinavian person

1304
01:14:10,820 --> 01:14:12,722
for this one.

1305
01:14:12,722 --> 01:14:13,416
What's that?

1306
01:14:13,416 --> 01:14:14,540
AUDIENCE: Lillie-something?

1307
01:14:14,540 --> 01:14:16,290
PHILIPPE RIGOLLET:
Yeah, Lillie-something.

1308
01:14:16,290 --> 01:14:18,660
So Kolmogorov
Lillie-something test.

1309
01:14:18,660 --> 01:14:23,370
And [LAUGHS] so it's the
Kolmogorov Lilliefors test.

1310
01:14:23,370 --> 01:14:26,670
And because I'm testing if
there Gaussian, and I'm actually

1311
01:14:26,670 --> 01:14:28,140
not really making any--

1312
01:14:28,140 --> 01:14:30,510
I don't need to know
what the variance is.

1313
01:14:30,510 --> 01:14:31,350
The mean is 0.

1314
01:14:31,350 --> 01:14:32,558
We saw that at the beginning.

1315
01:14:32,558 --> 01:14:34,680
It's 0 by construction,
so we actually

1316
01:14:34,680 --> 01:14:37,590
don't need to think about
the mean being 0 itself.

1317
01:14:37,590 --> 01:14:38,850
This just happens to be 0.

1318
01:14:38,850 --> 01:14:41,340
So we know that it's 0, but
the variance, we don't know.

1319
01:14:41,340 --> 01:14:42,900
So we just want to
know if it belongs

1320
01:14:42,900 --> 01:14:45,233
to the family of Gaussians,
and so we need to Kolmogorov

1321
01:14:45,233 --> 01:14:46,660
Lilliefors for that.

1322
01:14:46,660 --> 01:14:49,650
And that's also one of the thing
that's spit out by statistical

1323
01:14:49,650 --> 01:14:52,680
software by default. When
you run a linear regression,

1324
01:14:52,680 --> 01:14:54,670
actually, it spits out
both Kolmogorov-Smirnov

1325
01:14:54,670 --> 01:14:59,118
and Kolmogorov Lilliefors,
probably contributing

1326
01:14:59,118 --> 01:15:01,860
to the widespread use of
Kolmogorov-Smirnov when you

1327
01:15:01,860 --> 01:15:03,550
really shouldn't.

1328
01:15:03,550 --> 01:15:08,920
So next time, we will talk
about more advanced topics

1329
01:15:08,920 --> 01:15:09,670
on regression.

1330
01:15:09,670 --> 01:15:11,780
But I think I'm going
to stop here for today.

1331
01:15:11,780 --> 01:15:14,740
So again, tomorrow,
sometime during the day,

1332
01:15:14,740 --> 01:15:16,780
at least before
the recitation, you

1333
01:15:16,780 --> 01:15:20,740
will have a list of practice
exercises that will be posted.

1334
01:15:20,740 --> 01:15:23,600
And if you go to the
optional recitation,

1335
01:15:23,600 --> 01:15:26,190
you will have
someone solving them