1
00:00:00,120 --> 00:00:02,460
The following content is
provided under a Creative

2
00:00:02,460 --> 00:00:03,880
Commons license.

3
00:00:03,880 --> 00:00:06,090
Your support will help
MIT OpenCourseWare

4
00:00:06,090 --> 00:00:10,180
continue to offer high quality
educational resources for free.

5
00:00:10,180 --> 00:00:12,720
To make a donation or to
view additional materials

6
00:00:12,720 --> 00:00:15,210
from hundreds of
MIT courses, visit

7
00:00:15,210 --> 00:00:17,360
MITOpenCourseWare@OCW.MIT.edu.

8
00:00:20,292 --> 00:00:22,790
PHILIPPE RIGOLLET: It's
because if I was not,

9
00:00:22,790 --> 00:00:25,640
this would be basically the
last topic we would ever see.

10
00:00:25,640 --> 00:00:29,201
And this is arguably, probably
the most important topic

11
00:00:29,201 --> 00:00:30,950
in statistics, or at
least that's probably

12
00:00:30,950 --> 00:00:33,470
the reason why most of
you are taking this class.

13
00:00:33,470 --> 00:00:36,980
Because regression
implies prediction,

14
00:00:36,980 --> 00:00:39,260
and prediction is what people
are after to now, right?

15
00:00:39,260 --> 00:00:40,340
You don't need to
understand what

16
00:00:40,340 --> 00:00:41,960
the model for the
financial market

17
00:00:41,960 --> 00:00:43,460
is if you actually
have a formula

18
00:00:43,460 --> 00:00:47,430
to predict what the stock
prices are going to be tomorrow.

19
00:00:47,430 --> 00:00:49,705
And regression, in a way,
allows us to do that.

20
00:00:49,705 --> 00:00:52,080
And we'll start with a very
simple version of regression,

21
00:00:52,080 --> 00:00:55,130
which is linear regression,
which is the most standard one.

22
00:00:55,130 --> 00:00:58,070
And then we'll move on to
slightly more advanced notions

23
00:00:58,070 --> 00:00:59,560
such as nonparametric
regression.

24
00:00:59,560 --> 00:01:02,960
At least, we're going to see
the principles behind it.

25
00:01:02,960 --> 00:01:06,570
And I'll touch upon a little bit
of high dimensional regression,

26
00:01:06,570 --> 00:01:09,450
which is what people
are doing today.

27
00:01:09,450 --> 00:01:12,290
So the goal of
regression is to try

28
00:01:12,290 --> 00:01:16,220
to predict one variable
based on another variable.

29
00:01:16,220 --> 00:01:19,290
All right, so here the
notation is very important.

30
00:01:19,290 --> 00:01:22,220
It's extremely standard.

31
00:01:22,220 --> 00:01:25,600
It goes everywhere essentially,
and essentially you're

32
00:01:25,600 --> 00:01:29,840
trying to explain why
as a function of x,

33
00:01:29,840 --> 00:01:33,320
which is the usual y
equals f of x question--

34
00:01:33,320 --> 00:01:36,090
except that, you know, if
you look at a calculus class,

35
00:01:36,090 --> 00:01:39,930
people tell you y equals
f of x, and they give you

36
00:01:39,930 --> 00:01:42,142
a specific form for f,
and then you do something.

37
00:01:42,142 --> 00:01:43,850
Here, we're just going
to try to estimate

38
00:01:43,850 --> 00:01:46,040
what this length function is.

39
00:01:46,040 --> 00:01:49,790
And this is why we often
call y the explained variable

40
00:01:49,790 --> 00:01:52,800
and x the explanatory variable.

41
00:01:52,800 --> 00:01:55,960
All right, so we're
statisticians,

42
00:01:55,960 --> 00:01:56,925
so we start with data.

43
00:01:56,925 --> 00:01:58,800
All right, then what
does our data look like?

44
00:01:58,800 --> 00:02:01,820
Well, it looks like a
bunch of input, output

45
00:02:01,820 --> 00:02:03,110
to this relationship.

46
00:02:03,110 --> 00:02:05,870
All right, so we have
a bunch of xi, yi.

47
00:02:05,870 --> 00:02:09,280
Those are pairs, and I can do
a scatterplot of those guys.

48
00:02:09,280 --> 00:02:14,390
So each point here has a
x-coordinate, which is xi,

49
00:02:14,390 --> 00:02:16,550
and a y-coordinate,
which is yi, and here, I

50
00:02:16,550 --> 00:02:17,810
have a bunch of endpoints.

51
00:02:17,810 --> 00:02:19,830
And I just draw them like that.

52
00:02:19,830 --> 00:02:23,700
Now, the functions we're
going to be interested in

53
00:02:23,700 --> 00:02:30,170
are often function of the form
y equals a plus b times x, OK.

54
00:02:30,170 --> 00:02:32,870
And that means that this
function looks like this.

55
00:02:36,310 --> 00:02:38,200
So if I do x and
y, this function

56
00:02:38,200 --> 00:02:41,530
looks exactly like a line,
and clearly those points

57
00:02:41,530 --> 00:02:42,980
are not on the line.

58
00:02:42,980 --> 00:02:44,577
And it will basically
never happen

59
00:02:44,577 --> 00:02:45,910
that those points are on a line.

60
00:02:45,910 --> 00:02:48,460
There's a famous
T-shirt from, I think,

61
00:02:48,460 --> 00:02:50,320
U.C. Berkeley's
staff department,

62
00:02:50,320 --> 00:02:52,736
that shows this picture
and put a line between them

63
00:02:52,736 --> 00:02:53,860
like we're going to see it.

64
00:02:53,860 --> 00:02:56,890
And it says, oh,
statisticians, so many points,

65
00:02:56,890 --> 00:02:59,590
and you still managed
to miss all of them.

66
00:02:59,590 --> 00:03:04,150
And so essentially, we don't
believe that this relationship

67
00:03:04,150 --> 00:03:08,912
y is equal to a plus bx is true,
but maybe up to some noise.

68
00:03:08,912 --> 00:03:11,370
And that's where the statistics
is going to come into play.

69
00:03:11,370 --> 00:03:13,995
There's going to be some random
noise that's going to play out,

70
00:03:13,995 --> 00:03:17,470
and hopefully the noise is
going to be spread out evenly,

71
00:03:17,470 --> 00:03:20,950
so that we can average it
if we have enough points.

72
00:03:20,950 --> 00:03:22,830
Average it out, OK.

73
00:03:22,830 --> 00:03:26,910
And so this epsilon here is not
necessarily due to randomness.

74
00:03:26,910 --> 00:03:29,387
But again, just like we did
modeling in the first place,

75
00:03:29,387 --> 00:03:30,970
it essentially
accounts for everything

76
00:03:30,970 --> 00:03:33,520
we don't understand
about this relationship.

77
00:03:33,520 --> 00:03:36,200
All right, so for example--

78
00:03:36,200 --> 00:03:37,630
so here, I'm not going to be--

79
00:03:37,630 --> 00:03:41,960
give me one second, so we'll
see an example in a second.

80
00:03:41,960 --> 00:03:44,050
But the idea here is
that if you have data,

81
00:03:44,050 --> 00:03:45,850
and if you believe
that it's of the form,

82
00:03:45,850 --> 00:03:47,740
a plus b x plus
some noise, you're

83
00:03:47,740 --> 00:03:50,410
trying to find the line
that will explain your data

84
00:03:50,410 --> 00:03:51,640
the best, right?

85
00:03:51,640 --> 00:03:54,650
In the terminology
we've been using before,

86
00:03:54,650 --> 00:03:58,060
this would be the most likely
line that explains the data.

87
00:03:58,060 --> 00:03:59,680
So we can see that
it's slightly--

88
00:03:59,680 --> 00:04:01,145
we've just added
another dimension

89
00:04:01,145 --> 00:04:02,270
to our statistical problem.

90
00:04:02,270 --> 00:04:04,353
We don't have just x's,
but we have y's, and we're

91
00:04:04,353 --> 00:04:07,450
trying to find the most likely
explanation of the relationship

92
00:04:07,450 --> 00:04:09,220
between y and x.

93
00:04:09,220 --> 00:04:12,379
All right, and so
in practice, the way

94
00:04:12,379 --> 00:04:14,920
it's going to look like is that
we're going to have basically

95
00:04:14,920 --> 00:04:17,470
two parameters to
find the slope b

96
00:04:17,470 --> 00:04:20,240
and the intercept
a, and given data,

97
00:04:20,240 --> 00:04:23,120
the goal is going to be to try
to find the best possible line.

98
00:04:23,120 --> 00:04:24,514
All right?

99
00:04:24,514 --> 00:04:25,930
So what we're going
to find is not

100
00:04:25,930 --> 00:04:29,310
exactly a and b, the ones that
actually generate the data,

101
00:04:29,310 --> 00:04:33,790
but some estimators of those
parameters, a hat and b hat

102
00:04:33,790 --> 00:04:35,680
constructed from the data.

103
00:04:35,680 --> 00:04:38,260
All right, so we'll see
that more generally,

104
00:04:38,260 --> 00:04:40,990
but we're not going to go too
much in the details of this.

105
00:04:40,990 --> 00:04:42,190
There's actually
quite a bit that you

106
00:04:42,190 --> 00:04:43,773
can understand if
you do what's called

107
00:04:43,773 --> 00:04:47,260
univariate regression
when x is actually

108
00:04:47,260 --> 00:04:49,940
a real valued random variable.

109
00:04:49,940 --> 00:04:52,720
So when this happens, this is
called univariate regression.

110
00:04:59,640 --> 00:05:05,550
And when x is in rp for p
larger than or equal to 2,

111
00:05:05,550 --> 00:05:07,810
this is called
multivariate regression.

112
00:05:16,640 --> 00:05:20,340
OK, and so here we're
just trying to explain y

113
00:05:20,340 --> 00:05:23,940
is a plus bx plus epsilon.

114
00:05:23,940 --> 00:05:26,620
And here we're going to have
something more complicated.

115
00:05:26,620 --> 00:05:33,510
We're going to have y, which is
equal to a plus b1, x1 plus b2,

116
00:05:33,510 --> 00:05:39,780
x2 plus bp, xp plus epsilon--

117
00:05:39,780 --> 00:05:42,150
where x is equal to--

118
00:05:42,150 --> 00:05:46,710
the coordinates of x are
given by x1, 2xp, rp.

119
00:05:46,710 --> 00:05:49,200
OK, so it's still linear.

120
00:05:49,200 --> 00:05:51,030
Right, they still add
all the coordinates

121
00:05:51,030 --> 00:05:53,370
of x with a coefficient
in front of them,

122
00:05:53,370 --> 00:05:56,360
but it's a bit more complicated
than just one coefficient

123
00:05:56,360 --> 00:05:58,770
for one coordinate of x, OK?

124
00:05:58,770 --> 00:06:03,420
So we'll come back to
multivariate regression.

125
00:06:03,420 --> 00:06:08,280
Of course, you can write
this as x transpose b, right?

126
00:06:08,280 --> 00:06:14,800
So this entire thing here,
this linear combination

127
00:06:14,800 --> 00:06:17,710
is of the form x
transpose b, where

128
00:06:17,710 --> 00:06:23,310
b is the vector that has
coordinates b1 to bp.

129
00:06:23,310 --> 00:06:25,700
OK?

130
00:06:25,700 --> 00:06:31,100
Sorry, here, it's in [? rd, ?]
p is the natural notation.

131
00:06:31,100 --> 00:06:35,660
All right, so our goal
here, in the univariate one,

132
00:06:35,660 --> 00:06:38,360
is to try to write
the model, make sense

133
00:06:38,360 --> 00:06:40,910
of this little twiddle here--

134
00:06:40,910 --> 00:06:44,050
essentially, from a
statistical modeling question,

135
00:06:44,050 --> 00:06:47,480
the question is going to be,
what distributional assumptions

136
00:06:47,480 --> 00:06:48,730
do you want to put on epsilon?

137
00:06:48,730 --> 00:06:50,313
Are you going to say
they're Gaussian?

138
00:06:50,313 --> 00:06:52,720
Are you going to say
they're binomial?

139
00:07:00,160 --> 00:07:03,450
OK, are you going to
say they're binomial?

140
00:07:03,450 --> 00:07:05,532
Are you going to say
they're Bernoulli?

141
00:07:05,532 --> 00:07:07,990
So that's going to be what we
we're going to make sense of,

142
00:07:07,990 --> 00:07:10,230
and then we're going
to try to find a method

143
00:07:10,230 --> 00:07:11,700
to estimate a and b.

144
00:07:11,700 --> 00:07:13,680
And then maybe we're
going to try to do

145
00:07:13,680 --> 00:07:15,030
some inference about a and b--

146
00:07:15,030 --> 00:07:18,390
maybe test if a and b take
certain values, if they're

147
00:07:18,390 --> 00:07:20,850
less than something,
maybe find some confidence

148
00:07:20,850 --> 00:07:24,290
regions for a and b, all right?

149
00:07:24,290 --> 00:07:25,990
So why would you
want to do this?

150
00:07:25,990 --> 00:07:29,810
Well, I'm sure all of you have
an application, if I give you

151
00:07:29,810 --> 00:07:32,260
some x, you're trying
to predict what y is.

152
00:07:32,260 --> 00:07:34,730
Machine learning is all
about doing this, right?

153
00:07:34,730 --> 00:07:36,994
Without maybe trying
to even understand

154
00:07:36,994 --> 00:07:38,660
the physics behind
this, they're saying,

155
00:07:38,660 --> 00:07:40,590
well, you give me
a bag of words,

156
00:07:40,590 --> 00:07:43,520
I want to understand whether
it's going to be a spam or not.

157
00:07:43,520 --> 00:07:47,370
You give me a bunch of
economic indicators,

158
00:07:47,370 --> 00:07:51,530
I want you to tell me how much
I should be selling my car for.

159
00:07:51,530 --> 00:07:55,774
You give me a bunch of
measurements on some patient,

160
00:07:55,774 --> 00:07:57,440
I want you to predict
how this person is

161
00:07:57,440 --> 00:08:00,110
going to respond to my
drug-- and things like this.

162
00:08:00,110 --> 00:08:04,830
All right, and often we actually
don't have much modeling

163
00:08:04,830 --> 00:08:07,380
intuition about what the
relationship between x and y

164
00:08:07,380 --> 00:08:10,350
is, and this linear thing is
basically the simplest function

165
00:08:10,350 --> 00:08:11,530
we can think of.

166
00:08:11,530 --> 00:08:15,235
Arguably, linear functions
are the simplest functions

167
00:08:15,235 --> 00:08:16,110
that are not trivial.

168
00:08:16,110 --> 00:08:19,110
Otherwise, we would just say,
well, let's just predict x of y

169
00:08:19,110 --> 00:08:21,445
to be a constant, meaning
it does not depend on x.

170
00:08:21,445 --> 00:08:23,070
But if you want it
to depend on x, then

171
00:08:23,070 --> 00:08:25,710
your functions are basically
as simple as it gets.

172
00:08:25,710 --> 00:08:30,840
It turns out, amazingly, this
does the trick quite often.

173
00:08:30,840 --> 00:08:33,750
So for example, if
you look at economics,

174
00:08:33,750 --> 00:08:35,909
you might want to assume
that the demand is

175
00:08:35,909 --> 00:08:38,039
a linear function of the price.

176
00:08:38,039 --> 00:08:40,200
So if your price
is zero, there's

177
00:08:40,200 --> 00:08:41,640
going to be a certain demand.

178
00:08:41,640 --> 00:08:45,037
And as the price increases,
the demand is going to move.

179
00:08:45,037 --> 00:08:47,370
Do you think b is going to
be positive or negative here?

180
00:08:51,000 --> 00:08:52,069
What?

181
00:08:52,069 --> 00:08:53,610
Typically, it's
negative unless we're

182
00:08:53,610 --> 00:08:56,292
talking about
maybe luxury goods,

183
00:08:56,292 --> 00:08:57,750
where you know,
the more expensive,

184
00:08:57,750 --> 00:09:00,130
the more people
actually want it.

185
00:09:00,130 --> 00:09:02,380
I mean, if we're talking
about actual economic demand,

186
00:09:02,380 --> 00:09:06,030
that's probably
definitely negative.

187
00:09:06,030 --> 00:09:11,360
It doesn't have to be,
you know, clearly linear,

188
00:09:11,360 --> 00:09:13,724
so that you can actually
make it linear, transform it

189
00:09:13,724 --> 00:09:14,640
into something linear.

190
00:09:14,640 --> 00:09:17,520
So for example, you have
this like multiplicative

191
00:09:17,520 --> 00:09:24,330
relationship, PV equals nRT,
which is the Ideal gas law.

192
00:09:24,330 --> 00:09:26,670
If you want to actually
write this relationship,

193
00:09:26,670 --> 00:09:28,680
if you want to predict
what the pressure is

194
00:09:28,680 --> 00:09:33,690
going to be as a function of
the volume and the temperature--

195
00:09:33,690 --> 00:09:37,810
and well, let's assume that
n is the Avogadro constant,

196
00:09:37,810 --> 00:09:42,060
and let's assume that the
radius is actually fixed.

197
00:09:42,060 --> 00:09:47,840
Then you take the log on each
side, so you get PV equals nRT.

198
00:10:03,610 --> 00:10:07,690
So what that means is that
log PV is equal to log nRT.

199
00:10:10,600 --> 00:10:23,180
So that means log P plus log V
is equal to the log nR plus log

200
00:10:23,180 --> 00:10:28,737
T. So we said that R is
constant, so this is actually

201
00:10:28,737 --> 00:10:29,320
your constant.

202
00:10:29,320 --> 00:10:31,400
I'm going to call it a.

203
00:10:31,400 --> 00:10:35,410
And then that
means that log P is

204
00:10:35,410 --> 00:10:49,430
equal to minus log V. That
log P is equal to a minus log

205
00:10:49,430 --> 00:10:55,070
V plus log T. OK?

206
00:10:55,070 --> 00:11:01,650
And so in particular, if I
write b equal to negative 1

207
00:11:01,650 --> 00:11:04,800
and c equal to plus 1,
this gives me the formula

208
00:11:04,800 --> 00:11:06,210
that I have here.

209
00:11:06,210 --> 00:11:10,670
Now again, it might be the case
that this is the ideal gas law.

210
00:11:10,670 --> 00:11:12,960
So in practice, if I
start recording pressure,

211
00:11:12,960 --> 00:11:16,830
and temperature, and volume, I
might make measurement errors,

212
00:11:16,830 --> 00:11:18,950
there might be slightly
different conditions

213
00:11:18,950 --> 00:11:21,346
in such a way that I'm not
going to get exactly those.

214
00:11:21,346 --> 00:11:23,220
And I'm just going to
put this little twiddle

215
00:11:23,220 --> 00:11:25,350
to account for the fact
that the points that I'm

216
00:11:25,350 --> 00:11:28,170
going to be recording
for log pressure,

217
00:11:28,170 --> 00:11:30,180
log volume, and log
temperature are not going

218
00:11:30,180 --> 00:11:32,590
to be exactly on one line.

219
00:11:32,590 --> 00:11:33,840
OK, they're going to be close.

220
00:11:33,840 --> 00:11:36,150
Actually, in those
physics experiments,

221
00:11:36,150 --> 00:11:39,600
usually, they're very close
because the conditions

222
00:11:39,600 --> 00:11:41,740
are controlled under
lab experiments.

223
00:11:41,740 --> 00:11:44,670
So it means that the
noise is very small.

224
00:11:44,670 --> 00:11:47,160
But for other cases,
like demand and prices,

225
00:11:47,160 --> 00:11:50,820
it's not a law of physics,
and so this must change.

226
00:11:50,820 --> 00:11:53,180
Even the linear structure is
probably not clear, right.

227
00:11:53,180 --> 00:11:54,763
At some points,
there's probably going

228
00:11:54,763 --> 00:11:57,550
to be some weird
curvature happening.

229
00:11:57,550 --> 00:12:00,910
All right, so this slide is
just to tell you maybe you

230
00:12:00,910 --> 00:12:03,071
don't have, obviously,
a linear relationship,

231
00:12:03,071 --> 00:12:04,570
but maybe you do
if you start taking

232
00:12:04,570 --> 00:12:08,380
logs exponentials, squares.

233
00:12:08,380 --> 00:12:10,820
You can sometimes take the
product of two variables,

234
00:12:10,820 --> 00:12:12,040
things like this, right.

235
00:12:12,040 --> 00:12:13,570
So this is variable
transformation,

236
00:12:13,570 --> 00:12:15,610
and it's mostly
domain-specific, so we're not

237
00:12:15,610 --> 00:12:18,076
going to go into
more details of this.

238
00:12:18,076 --> 00:12:19,480
Any questions?

239
00:12:22,290 --> 00:12:27,100
All right, so now I'm
going to be giving--

240
00:12:27,100 --> 00:12:29,100
so if we start thinking
a little more about what

241
00:12:29,100 --> 00:12:32,100
these coefficients
should be, well,

242
00:12:32,100 --> 00:12:34,440
remember-- so
everybody's clear why

243
00:12:34,440 --> 00:12:36,280
I don't put the little i here?

244
00:12:41,971 --> 00:12:43,970
Right, I don't put the
little i because I'm just

245
00:12:43,970 --> 00:12:47,120
talking about a generic
x and a generic y,

246
00:12:47,120 --> 00:12:49,870
but the observations
are x1, y1, right.

247
00:12:49,870 --> 00:12:53,450
So typically, on
the blackboard I'm

248
00:12:53,450 --> 00:13:02,980
often going to write only xy,
but the data really is x1,

249
00:13:02,980 --> 00:13:07,180
y1, all the way to xn, yn.

250
00:13:07,180 --> 00:13:10,810
So those are those points in
this two dimensional plot.

251
00:13:10,810 --> 00:13:21,830
But I think of those as being
independent copies of the pair

252
00:13:21,830 --> 00:13:24,500
xy.

253
00:13:24,500 --> 00:13:26,120
They have to have--

254
00:13:26,120 --> 00:13:27,420
to contain their relationship.

255
00:13:27,420 --> 00:13:29,630
And so when I talk
about distribution

256
00:13:29,630 --> 00:13:32,420
of those random variables, I
talk about the distribution

257
00:13:32,420 --> 00:13:34,240
of xy, and that's the same.

258
00:13:34,240 --> 00:13:36,950
All right, so the first
thing you might want to ask

259
00:13:36,950 --> 00:13:41,790
is, well, if I have an
infinite amount of data,

260
00:13:41,790 --> 00:13:44,390
what can I hope to
get for a and b?

261
00:13:44,390 --> 00:13:46,350
If my simple size
goes to infinity,

262
00:13:46,350 --> 00:13:48,110
then I should actually
know exactly what

263
00:13:48,110 --> 00:13:50,040
the distribution of xy is.

264
00:13:50,040 --> 00:13:52,670
And so there should
be an a and a b

265
00:13:52,670 --> 00:13:57,305
that captures this linear
relationship between y and x.

266
00:13:57,305 --> 00:13:59,180
And so in particular,
we're going

267
00:13:59,180 --> 00:14:02,709
to try to ask the population,
or theoretic, values of a and b,

268
00:14:02,709 --> 00:14:04,250
and you can see that
you can actually

269
00:14:04,250 --> 00:14:05,960
compute them explicitly.

270
00:14:05,960 --> 00:14:08,510
So let's just try to find how.

271
00:14:08,510 --> 00:14:10,640
So as I said, we have
a bunch of points

272
00:14:10,640 --> 00:14:16,460
on this line close
to a line, and I'm

273
00:14:16,460 --> 00:14:20,520
trying to find the best fit.

274
00:14:20,520 --> 00:14:23,330
All right, so this
guy is not a good fit.

275
00:14:23,330 --> 00:14:24,960
This guy is not a good fit.

276
00:14:24,960 --> 00:14:27,870
And we know that this guy
is a good fit somehow.

277
00:14:27,870 --> 00:14:30,680
So we need to mathematically
formulate the fact

278
00:14:30,680 --> 00:14:35,150
that this line here is
better than this line here

279
00:14:35,150 --> 00:14:37,460
or better than this line here.

280
00:14:37,460 --> 00:14:41,030
So what we're trying to do
is to create a function that

281
00:14:41,030 --> 00:14:43,580
has values that are
smaller for this curve

282
00:14:43,580 --> 00:14:45,590
and larger for these two curves.

283
00:14:45,590 --> 00:14:47,630
And the way we do it is
by measuring the fit,

284
00:14:47,630 --> 00:14:51,740
and the fit is essentially
the aggregate distance

285
00:14:51,740 --> 00:14:55,310
of all the points to the curve.

286
00:14:55,310 --> 00:14:56,930
And there's many
ways I can measure

287
00:14:56,930 --> 00:14:58,550
the distance to a curve.

288
00:14:58,550 --> 00:15:01,730
So if I want to find so--
let's just open a parenthesis.

289
00:15:01,730 --> 00:15:03,290
If I have a point
here-- so we're

290
00:15:03,290 --> 00:15:05,250
going to do it for
one point at a time.

291
00:15:05,250 --> 00:15:07,120
So if I have a point,
there's many ways

292
00:15:07,120 --> 00:15:09,530
I can measure its distance
to the curve, right?

293
00:15:09,530 --> 00:15:12,800
I can measure it like that.

294
00:15:12,800 --> 00:15:14,690
That is one distance
to the curve.

295
00:15:14,690 --> 00:15:19,280
I can measure it like that by
having a right angle here that

296
00:15:19,280 --> 00:15:20,840
is one distance to the curve.

297
00:15:20,840 --> 00:15:23,430
Or I can measure it like that.

298
00:15:23,430 --> 00:15:27,490
That is another distance
to the curve, right.

299
00:15:27,490 --> 00:15:29,650
There's many ways
I can go for it.

300
00:15:29,650 --> 00:15:31,030
It turns out that
one is actually

301
00:15:31,030 --> 00:15:33,040
going to be fairly
convenient for us,

302
00:15:33,040 --> 00:15:36,910
and that's the one that says,
let's look at the square

303
00:15:36,910 --> 00:15:38,720
of the value of x on the curve.

304
00:15:38,720 --> 00:15:43,690
So if this is the curve,
y is equal to a plus bx.

305
00:15:51,260 --> 00:15:54,140
Now, I'm going to think of
this point as a random point,

306
00:15:54,140 --> 00:15:57,050
capital X, capital
Y, so that means

307
00:15:57,050 --> 00:16:02,210
that it's going to be x1,
y1 or x2, y2, et cetera.

308
00:16:02,210 --> 00:16:04,250
Now, I want to
measure the distance.

309
00:16:04,250 --> 00:16:06,390
Can somebody tell me
which of the three--

310
00:16:06,390 --> 00:16:08,870
the first one, the second
one, or the third one--

311
00:16:08,870 --> 00:16:13,610
this formula, expectation of y
minus a minus bx squared is--

312
00:16:13,610 --> 00:16:18,578
which of the three
is it representing?

313
00:16:18,578 --> 00:16:20,020
AUDIENCE: The second one.

314
00:16:20,020 --> 00:16:21,395
PHILIPPE RIGOLLET:
The second one

315
00:16:21,395 --> 00:16:22,740
where I have the right angle?

316
00:16:22,740 --> 00:16:26,710
OK, everybody agrees with this?

317
00:16:26,710 --> 00:16:28,730
Anybody wants to vote
for something else?

318
00:16:28,730 --> 00:16:29,320
Yeah?

319
00:16:29,320 --> 00:16:30,320
AUDIENCE: The third one?

320
00:16:30,320 --> 00:16:31,695
PHILIPPE RIGOLLET:
The third one?

321
00:16:31,695 --> 00:16:34,520
Everybody agrees
with the third one?

322
00:16:34,520 --> 00:16:38,975
So by default, everybody's
on the first one?

323
00:16:38,975 --> 00:16:42,010
Yeah, it is the vertical
distance actually.

324
00:16:42,010 --> 00:16:44,555
And the reason is if it was the
one with the straight angle,

325
00:16:44,555 --> 00:16:46,180
with the right angle,
it would actually

326
00:16:46,180 --> 00:16:48,430
be a very complicated
mathematical formula,

327
00:16:48,430 --> 00:16:51,240
so let's just see y, right?

328
00:16:51,240 --> 00:16:53,470
And by y, I mean y.

329
00:16:53,470 --> 00:16:59,460
OK, so this means that this
is my x, and this is my y.

330
00:17:02,500 --> 00:17:05,829
All right, so that means
that this point is xy.

331
00:17:05,829 --> 00:17:07,900
So what I'm measuring
is the difference

332
00:17:07,900 --> 00:17:15,965
between y minus
a plus b times x.

333
00:17:15,965 --> 00:17:18,339
This is the thing I'm going
to take the expectation off--

334
00:17:18,339 --> 00:17:20,290
the square and then
the expectation-- so a

335
00:17:20,290 --> 00:17:24,140
plus b times x, if this is
this line, this is this point.

336
00:17:24,140 --> 00:17:27,310
So that's this value here.

337
00:17:27,310 --> 00:17:33,254
This value here is
a plus bx, right?

338
00:17:33,254 --> 00:17:35,170
So what I'm really
measuring is the difference

339
00:17:35,170 --> 00:17:38,740
between y and N plus bx,
which is this distance here.

340
00:17:42,400 --> 00:17:45,846
And since I like things
like Pythagoras theorem,

341
00:17:45,846 --> 00:17:47,470
I'm actually going
to put a square here

342
00:17:47,470 --> 00:17:51,500
before I take the expectation.

343
00:17:51,500 --> 00:17:53,090
So now this is a
random variable.

344
00:17:53,090 --> 00:17:55,210
This is this random variable.

345
00:17:55,210 --> 00:17:58,420
And so I want a number,
so I'm going to turn it

346
00:17:58,420 --> 00:18:00,020
into a deterministic number.

347
00:18:00,020 --> 00:18:03,400
And the way I do this is
by taking expectation.

348
00:18:03,400 --> 00:18:07,330
And if you think expectations
should be close to average,

349
00:18:07,330 --> 00:18:09,310
this is the same
thing as saying,

350
00:18:09,310 --> 00:18:12,010
I want that in
average, the y's are

351
00:18:12,010 --> 00:18:14,440
close to the a plus bx, right?

352
00:18:14,440 --> 00:18:16,570
So we're doing it
in expectation,

353
00:18:16,570 --> 00:18:18,370
but that's going to
translate into doing it

354
00:18:18,370 --> 00:18:20,650
in average for all the points.

355
00:18:20,650 --> 00:18:22,850
All right, so this is the
thing I want to measure.

356
00:18:22,850 --> 00:18:24,500
So that's this
vertical distance.

357
00:18:24,500 --> 00:18:26,321
Yeah?

358
00:18:26,321 --> 00:18:26,820
OK.

359
00:18:32,750 --> 00:18:36,292
This is my fault actually.

360
00:18:36,292 --> 00:18:37,890
Maybe we should
close those shades.

361
00:18:50,230 --> 00:18:53,280
OK, I cannot do just
one at a time, sorry.

362
00:19:11,910 --> 00:19:15,640
All right, so now that I do
those vertical distances,

363
00:19:15,640 --> 00:19:18,340
I can ask-- well, now,
I have this function,

364
00:19:18,340 --> 00:19:22,020
right-- to have a function that
takes two parameters a and b,

365
00:19:22,020 --> 00:19:30,220
maps it to the expectation
of y minus a plus bx squared.

366
00:19:30,220 --> 00:19:32,170
Sorry, the square is here.

367
00:19:32,170 --> 00:19:35,080
And I could ask, well,
this is a function that

368
00:19:35,080 --> 00:19:38,320
measures the fit of the
parameters a and b, right?

369
00:19:38,320 --> 00:19:40,210
This function should be small.

370
00:19:40,210 --> 00:19:45,700
The value of this
function here, function

371
00:19:45,700 --> 00:20:07,370
of a and b that measures
how close the point xy is

372
00:20:07,370 --> 00:20:14,210
to the line a plus
b times x while y

373
00:20:14,210 --> 00:20:18,869
is equal to a plus b
times x in expectation.

374
00:20:23,760 --> 00:20:24,400
OK, agreed?

375
00:20:24,400 --> 00:20:27,030
This is what we just said.

376
00:20:27,030 --> 00:20:29,480
Again, if you're not
comfortable with the reason why

377
00:20:29,480 --> 00:20:32,290
you get expectations, just
think about having data points

378
00:20:32,290 --> 00:20:34,410
and taking the average
value for this guy.

379
00:20:34,410 --> 00:20:36,360
So it's basically an
aggregate distance

380
00:20:36,360 --> 00:20:41,070
of the points to their line.

381
00:20:41,070 --> 00:20:44,390
OK, everybody agrees this
is a legitimate measure?

382
00:20:44,390 --> 00:20:48,150
If all my points were on the
line-- if my distribution--

383
00:20:48,150 --> 00:20:51,720
if y was actually equal
to a plus bx for some a

384
00:20:51,720 --> 00:20:54,780
and b then this function
would be equal to 0

385
00:20:54,780 --> 00:20:57,906
for the correct a and b, right?

386
00:20:57,906 --> 00:20:59,510
If they are far--
well, it's going

387
00:20:59,510 --> 00:21:01,460
to depend on how much
noise I'm getting,

388
00:21:01,460 --> 00:21:04,190
but it's still going to be
minimized for the best one.

389
00:21:04,190 --> 00:21:06,800
So let's minimize this thing.

390
00:21:06,800 --> 00:21:11,000
So here, I don't make any--

391
00:21:11,000 --> 00:21:12,350
again, sorry.

392
00:21:12,350 --> 00:21:21,800
I don't make an assumption on
the distribution of x or y.

393
00:21:21,800 --> 00:21:27,290
Here, I assume, somehow,
that the variance of x

394
00:21:27,290 --> 00:21:28,289
is not equal to 0.

395
00:21:28,289 --> 00:21:29,330
Can somebody tell me why?

396
00:21:29,330 --> 00:21:30,310
Yeah?

397
00:21:30,310 --> 00:21:33,250
AUDIENCE: Not really a
question-- the slides,

398
00:21:33,250 --> 00:21:38,150
you have y minus a minus bx
quantity squared expectation

399
00:21:38,150 --> 00:21:41,204
of that, and here you've written
square of the expectation.

400
00:21:41,204 --> 00:21:42,870
PHILIPPE RIGOLLET:
No, here I'm actually

401
00:21:42,870 --> 00:21:46,890
in the expectation
of the square.

402
00:21:46,890 --> 00:21:49,200
If I wanted to write the
square of the expectation,

403
00:21:49,200 --> 00:21:52,350
I would just do this.

404
00:21:52,350 --> 00:21:53,680
So let's just make it clear.

405
00:22:00,970 --> 00:22:01,575
Right?

406
00:22:01,575 --> 00:22:03,820
Do you want me to put an
extra set of parenthesis?

407
00:22:03,820 --> 00:22:06,690
That's what you want me to do?

408
00:22:06,690 --> 00:22:11,034
AUDIENCE: Yeah, it's just
confusing with the [INAUDIBLE]

409
00:22:11,034 --> 00:22:13,450
PHILIPPE RIGOLLET: OK, that's
the one that makes sense, so

410
00:22:13,450 --> 00:22:14,700
the square of the expectation?

411
00:22:14,700 --> 00:22:15,680
AUDIENCE: Yeah.

412
00:22:15,680 --> 00:22:17,180
PHILIPPE RIGOLLET: Oh, the
expectation of the square,

413
00:22:17,180 --> 00:22:17,680
sorry.

414
00:22:20,310 --> 00:22:22,130
Yeah, dyslexia.

415
00:22:22,130 --> 00:22:25,100
All right, any question?

416
00:22:25,100 --> 00:22:25,600
Yeah?

417
00:22:25,600 --> 00:22:28,400
AUDIENCE: Does this assume
that the error is Gaussian?

418
00:22:28,400 --> 00:22:29,316
PHILIPPE RIGOLLET: No.

419
00:22:32,290 --> 00:22:34,133
AUDIENCE: I mean, in
the sense that like,

420
00:22:34,133 --> 00:22:36,980
if we knew that the
error was, like,

421
00:22:36,980 --> 00:22:40,062
even the minus followed
like-- so even the minus x

422
00:22:40,062 --> 00:22:44,942
to the fourth distribution,
would we want to minimise

423
00:22:44,942 --> 00:22:48,358
the expectation of what
the fourth power of y minus

424
00:22:48,358 --> 00:22:52,280
a equals bx in order to get
[? what the ?] [? best is? ?]

425
00:22:52,280 --> 00:22:53,238
PHILIPPE RIGOLLET: Why?

426
00:22:57,372 --> 00:22:59,080
So you know the answers
to your question,

427
00:22:59,080 --> 00:23:01,760
so I just want you to
use the words that--

428
00:23:01,760 --> 00:23:04,756
right, so why would you want
to use the fourth power?

429
00:23:04,756 --> 00:23:06,429
AUDIENCE: Well,
because, like, we

430
00:23:06,429 --> 00:23:08,137
want to more strongly
penalize deviations

431
00:23:08,137 --> 00:23:11,518
because we'd expect very
large deviations to be

432
00:23:11,518 --> 00:23:15,870
very rare, or more
rare, than it would

433
00:23:15,870 --> 00:23:18,170
with the Gaussian
[INAUDIBLE] power.

434
00:23:18,170 --> 00:23:19,360
PHILIPPE RIGOLLET: Yeah so,
that would be the maximum likely

435
00:23:19,360 --> 00:23:21,290
estimator that you're
describing to me, right?

436
00:23:21,290 --> 00:23:22,850
I can actually
write the likelihood

437
00:23:22,850 --> 00:23:25,340
of a pair of numbers ab.

438
00:23:25,340 --> 00:23:26,847
And if I know this,
that's actually

439
00:23:26,847 --> 00:23:28,430
what's going to come
into it because I

440
00:23:28,430 --> 00:23:31,610
know that the density is
going to come into play when

441
00:23:31,610 --> 00:23:32,740
I talk about there.

442
00:23:32,740 --> 00:23:34,580
But here, I'm just
talking about--

443
00:23:34,580 --> 00:23:36,350
this is a mechanical tool.

444
00:23:36,350 --> 00:23:39,640
I'm just saying, let's minimize
the distance to the curve.

445
00:23:39,640 --> 00:23:42,320
Another thing I could have
done is take the absolute value

446
00:23:42,320 --> 00:23:43,750
of this thing, for example.

447
00:23:43,750 --> 00:23:46,190
I just decided to take the
square root before I did it.

448
00:23:46,190 --> 00:23:48,630
OK, so regardless
of what I'm doing,

449
00:23:48,630 --> 00:23:50,600
I'm just taking the
squares because that's just

450
00:23:50,600 --> 00:23:53,600
going to be convenient for me
to do my computations for now.

451
00:23:53,600 --> 00:23:55,400
But we don't have
any statistical model

452
00:23:55,400 --> 00:23:56,940
at this point.

453
00:23:56,940 --> 00:23:59,040
I didn't say anything--
that y follows this.

454
00:23:59,040 --> 00:24:00,320
X follows this.

455
00:24:00,320 --> 00:24:01,760
I'm just doing
minimal assumptions

456
00:24:01,760 --> 00:24:04,250
as we go, all right?

457
00:24:04,250 --> 00:24:06,140
So the variance of
x is not equal to 0?

458
00:24:06,140 --> 00:24:07,270
Could somebody tell me why?

459
00:24:11,330 --> 00:24:14,490
What would my cloud point
look like if the variance of x

460
00:24:14,490 --> 00:24:16,130
was equal to 0?

461
00:24:16,130 --> 00:24:18,122
Yeah, they would all
be at the same point.

462
00:24:18,122 --> 00:24:20,580
So it's going to be hard for
me to start fitting in a line,

463
00:24:20,580 --> 00:24:21,180
right?

464
00:24:21,180 --> 00:24:24,100
I mean, best case
scenario, I have this x.

465
00:24:24,100 --> 00:24:26,700
It has variance, zero, so
this is the expectation of x.

466
00:24:26,700 --> 00:24:31,020
And all my points have
the same expectation,

467
00:24:31,020 --> 00:24:33,780
and so, yes, I could
probably fit that line.

468
00:24:33,780 --> 00:24:38,340
But that wouldn't help
very much for other x's.

469
00:24:38,340 --> 00:24:41,400
So I need a bit of variance
so that things spread out

470
00:24:41,400 --> 00:24:42,010
a little bit.

471
00:24:47,440 --> 00:24:51,130
OK, I'm going to
have to do this.

472
00:24:51,130 --> 00:24:52,370
I think it's just my--

473
00:25:10,200 --> 00:25:13,460
All right, so I'm going to
put a little bit of variance.

474
00:25:13,460 --> 00:25:15,960
And the other thing is here,
I don't want to do much more,

475
00:25:15,960 --> 00:25:22,440
but I'm actually going to think
of x as having means zero.

476
00:25:22,440 --> 00:25:24,430
And the way I do
this is as follows.

477
00:25:24,430 --> 00:25:30,570
Let's define x tilde, which is
x minus the expectation of x.

478
00:25:30,570 --> 00:25:33,920
OK, so definitely the
expectation of x tilde is what?

479
00:25:36,620 --> 00:25:38,110
Zero, OK.

480
00:25:38,110 --> 00:25:43,350
And so now I want to
minimize in ab, expectation

481
00:25:43,350 --> 00:25:53,920
of y minus a plus b, x squared.

482
00:25:53,920 --> 00:26:03,810
And the way I'm going to do this
is by turning x into x tilde

483
00:26:03,810 --> 00:26:07,060
and stuffing the extra--

484
00:26:07,060 --> 00:26:12,760
and putting the extra
expectation of x into the a.

485
00:26:12,760 --> 00:26:19,840
So I'm going to write this as
an expectation of y minus a plus

486
00:26:19,840 --> 00:26:25,180
b expectation of x--

487
00:26:25,180 --> 00:26:27,530
which I'm going to a tilde--

488
00:26:27,530 --> 00:26:30,300
and plus b x tilde.

489
00:26:33,930 --> 00:26:35,630
OK?

490
00:26:35,630 --> 00:26:38,920
And everybody agrees with this?

491
00:26:38,920 --> 00:26:41,490
So now I have two
parameters, a tilde and b,

492
00:26:41,490 --> 00:26:44,350
and I'm going to pretend
that now x tilde--

493
00:26:44,350 --> 00:26:50,830
so now the role of x is played
by x tilde, which is now

494
00:26:50,830 --> 00:26:53,020
a centered random variable.

495
00:26:53,020 --> 00:26:55,660
OK, so I'm going to
call this guy a tilde,

496
00:26:55,660 --> 00:26:58,859
but for my computations
I'm going to call it a.

497
00:26:58,859 --> 00:27:00,650
So how do I find the
minimum of this thing?

498
00:27:05,114 --> 00:27:06,620
Derivative equal to zero, right?

499
00:27:06,620 --> 00:27:08,235
So here it's a quadratic thing.

500
00:27:08,235 --> 00:27:09,360
It's going to be like that.

501
00:27:09,360 --> 00:27:10,880
I take the derivative,
set it to zero.

502
00:27:10,880 --> 00:27:13,130
So I'm first going to take
the derivative with respect

503
00:27:13,130 --> 00:27:16,370
to a and set it equal to zero,
so that's equivalent to saying

504
00:27:16,370 --> 00:27:18,320
that the expectation of--

505
00:27:18,320 --> 00:27:21,315
well, here, I'm going
to pick up a 2--

506
00:27:21,315 --> 00:27:33,720
y minus a plus bx
tilde is equal to zero.

507
00:27:33,720 --> 00:27:36,580
And then I also have that the
derivative with respect to b is

508
00:27:36,580 --> 00:27:40,260
equal to zero, which is
equivalent to the expectation

509
00:27:40,260 --> 00:27:42,100
of-- well, I have a
negative sign somewhere,

510
00:27:42,100 --> 00:27:43,410
so let me put it here--

511
00:27:43,410 --> 00:27:50,950
minus 2x tilde, y
minus a plus bx tilde.

512
00:27:55,644 --> 00:27:58,910
OK, see that's why I don't want
to put too many parenthesis.

513
00:28:03,140 --> 00:28:05,741
OK.

514
00:28:05,741 --> 00:28:07,490
So I just took the
derivative with respect

515
00:28:07,490 --> 00:28:09,920
to a, which is just
basically the square,

516
00:28:09,920 --> 00:28:12,569
and then I have a negative 1
that comes out from inside.

517
00:28:12,569 --> 00:28:14,360
And then I take the
derivative with respect

518
00:28:14,360 --> 00:28:17,010
to b, and since b has x tilde.

519
00:28:17,010 --> 00:28:19,340
In [? factor, ?] it
comes out as well.

520
00:28:19,340 --> 00:28:24,420
All right, so the minus 2's
really won't matter for me.

521
00:28:24,420 --> 00:28:26,706
And so now I have two equations.

522
00:28:26,706 --> 00:28:28,580
The first equation,
while it's pretty simple,

523
00:28:28,580 --> 00:28:31,955
it's just telling me that
the expectation of y minus a

524
00:28:31,955 --> 00:28:33,710
is equal to zero.

525
00:28:33,710 --> 00:28:41,870
So what I know is that a is
equal to the expectation of y.

526
00:28:41,870 --> 00:28:44,060
And really that
was a tilde, which

527
00:28:44,060 --> 00:28:47,870
implies that the a
I want is actually

528
00:28:47,870 --> 00:29:00,690
equal to the
expectation of y minus b

529
00:29:00,690 --> 00:29:05,030
times the expectation of x.

530
00:29:05,030 --> 00:29:05,530
OK?

531
00:29:10,240 --> 00:29:13,450
Just because a tilde is a plus
b times the expectation of x.

532
00:29:16,830 --> 00:29:19,360
So that's for my a.

533
00:29:19,360 --> 00:29:22,180
And then for my b, I
use the second one.

534
00:29:22,180 --> 00:29:27,990
So the second one tells me that
the expectation of x tilde of y

535
00:29:27,990 --> 00:29:32,430
is equal to a plus b times
the expectation of x tilde

536
00:29:32,430 --> 00:29:33,520
which is zero, right?

537
00:29:38,640 --> 00:29:39,460
OK?

538
00:29:39,460 --> 00:29:41,630
But this a is actually
a tilde in this problem,

539
00:29:41,630 --> 00:29:47,210
so it's actually a plus
b expectation of x.

540
00:29:51,900 --> 00:29:53,890
Now, this is the
expectation of the product

541
00:29:53,890 --> 00:29:57,480
of two random variables, but
x tilde is centered, right?

542
00:29:57,480 --> 00:30:00,670
It's x minus expectation of
x, so this thing is actually

543
00:30:00,670 --> 00:30:03,640
equal to the covariance
between x and y

544
00:30:03,640 --> 00:30:05,140
by definition of covariance.

545
00:30:09,130 --> 00:30:11,840
So now I have everything
I need, right.

546
00:30:11,840 --> 00:30:14,110
How do I just--

547
00:30:14,110 --> 00:30:16,520
I'm sorry about that.

548
00:30:16,520 --> 00:30:18,330
So I have everything I need.

549
00:30:18,330 --> 00:30:22,560
Now, I now have two
equations with two unknowns,

550
00:30:22,560 --> 00:30:25,110
and all I have to do is
to basically plug it in.

551
00:30:25,110 --> 00:30:29,460
So it's essentially telling
me that the covariance of xy--

552
00:30:29,460 --> 00:30:31,980
so the first equation tells
me that the covariance of xy

553
00:30:31,980 --> 00:30:36,750
is equal to a plus b expectation
of x, but a is expectation of y

554
00:30:36,750 --> 00:30:39,640
minus b expectation of x.

555
00:30:39,640 --> 00:30:45,113
So it's-- well, actually,
maybe I should start with b.

556
00:30:54,780 --> 00:30:56,010
Oh, sorry.

557
00:30:56,010 --> 00:30:59,580
OK, I forgot one thing.

558
00:30:59,580 --> 00:31:00,750
This is not true, right.

559
00:31:00,750 --> 00:31:02,516
I forgot this term.

560
00:31:02,516 --> 00:31:05,850
x tilde multiplies x
tilde here, so what

561
00:31:05,850 --> 00:31:07,680
I'm left with is x tilde--

562
00:31:07,680 --> 00:31:11,320
it's minus b times the
expectation of x tilde squared.

563
00:31:11,320 --> 00:31:14,800
So that's actually minus
b times the variance of x

564
00:31:14,800 --> 00:31:17,970
tilde because x tilde
is already centered,

565
00:31:17,970 --> 00:31:19,760
which is actually
the variance of x.

566
00:31:23,850 --> 00:31:29,790
So now I have that this thing
is actually a plus b expectation

567
00:31:29,790 --> 00:31:36,570
of x minus b variance of x.

568
00:31:36,570 --> 00:31:42,180
And I also have that a
is equal to expectation

569
00:31:42,180 --> 00:31:45,960
of y minus b expectation of x.

570
00:31:53,720 --> 00:31:58,100
So if I sum the two, those
guys are going to cancel.

571
00:31:58,100 --> 00:32:00,740
Those guys are going to cancel.

572
00:32:00,740 --> 00:32:05,630
And so what I'm going to be
left with is covariance of xy

573
00:32:05,630 --> 00:32:10,570
is equal to expectation
of x, expectation of y,

574
00:32:10,570 --> 00:32:12,610
and then I'm left with
this term here, minus

575
00:32:12,610 --> 00:32:14,050
b times the variance of x.

576
00:32:17,070 --> 00:32:20,171
And so that tells me that b--

577
00:32:20,171 --> 00:32:21,796
why do I still have
the variance there?

578
00:32:34,692 --> 00:32:37,668
AUDIENCE: So is the
covariance really

579
00:32:37,668 --> 00:32:43,124
the expectation of x tilde
times y minus expectation of y?

580
00:32:43,124 --> 00:32:46,092
Because y is not
centered, correct?

581
00:32:46,092 --> 00:32:47,092
PHILIPPE RIGOLLET: Yeah.

582
00:32:47,092 --> 00:32:48,814
AUDIENCE: OK, but x
is still the center.

583
00:32:48,814 --> 00:32:50,980
PHILIPPE RIGOLLET: But x
is still the center, right.

584
00:32:50,980 --> 00:32:52,700
So you just need
to have one that's

585
00:32:52,700 --> 00:32:53,830
centered for this to work.

586
00:32:57,187 --> 00:32:58,520
Right, I mean, you can check it.

587
00:32:58,520 --> 00:33:00,144
But basically when
you're going to have

588
00:33:00,144 --> 00:33:02,877
the product of the expectations,
you only need one of the two

589
00:33:02,877 --> 00:33:03,960
in the product to be zero.

590
00:33:03,960 --> 00:33:04,920
So the product is zero.

591
00:33:09,090 --> 00:33:11,020
OK, why do I keep my--

592
00:33:11,020 --> 00:33:14,542
so I get a, a, and
then the b expectation.

593
00:33:14,542 --> 00:33:16,750
OK, so that's probably
earlier that I made a mistake.

594
00:33:25,620 --> 00:33:29,140
So I get-- so this was a tilde.

595
00:33:29,140 --> 00:33:30,548
Let's just be clear about the--

596
00:33:40,508 --> 00:33:43,350
So that tells me that a tilde--

597
00:33:43,350 --> 00:33:45,570
maybe it's not super
fair of me to--

598
00:33:48,310 --> 00:33:50,426
yeah, OK, I think I know
where I made a mistake.

599
00:33:50,426 --> 00:33:51,550
I should not have centered.

600
00:33:51,550 --> 00:33:54,760
I wanted to make my life
easier, and I should not

601
00:33:54,760 --> 00:33:55,960
have done that.

602
00:33:55,960 --> 00:33:59,140
And the reason is a
tilde depends on b,

603
00:33:59,140 --> 00:34:01,780
so when I take the
derivative with respect

604
00:34:01,780 --> 00:34:04,840
to b, what I'm left with here--

605
00:34:04,840 --> 00:34:06,880
since a tilde
depends on b, when I

606
00:34:06,880 --> 00:34:09,370
take the derivative of
this guy, I actually

607
00:34:09,370 --> 00:34:12,550
don't get a tilde here,
but I really get--

608
00:34:17,570 --> 00:34:20,896
so again, this was not--

609
00:34:20,896 --> 00:34:21,960
so that's the first one.

610
00:34:30,389 --> 00:34:33,800
This is actually x here--

611
00:34:33,800 --> 00:34:38,050
because when I take the
derivative with respect to b.

612
00:34:38,050 --> 00:34:40,929
And so now, what I'm left with
is that the expectation-- so

613
00:34:40,929 --> 00:34:43,929
yeah, I'm basically left
with nothing that helps.

614
00:34:43,929 --> 00:34:46,300
So I'm sorry about.

615
00:34:46,300 --> 00:34:49,929
Let's start from the
beginning because this is not

616
00:34:49,929 --> 00:34:53,090
getting us anywhere, and a
fix is not going to help.

617
00:34:53,090 --> 00:34:55,370
So let's just do it again.

618
00:34:55,370 --> 00:34:56,320
Sorry about that.

619
00:34:56,320 --> 00:34:59,230
So let's not center anything
and just do brute force

620
00:34:59,230 --> 00:35:01,120
because we're going to--

621
00:35:01,120 --> 00:35:04,820
b x squared.

622
00:35:04,820 --> 00:35:07,270
All right.

623
00:35:07,270 --> 00:35:09,520
Partial, with respect
to a, is giving

624
00:35:09,520 --> 00:35:11,920
equal zero is
equivalent, so my minus 2

625
00:35:11,920 --> 00:35:13,060
is going to cancel, right.

626
00:35:13,060 --> 00:35:14,851
So I'm going to actually
forget about this.

627
00:35:14,851 --> 00:35:17,980
So it's actually telling
me that the expectation

628
00:35:17,980 --> 00:35:25,660
of y minus a plus bx
is equal to zero, which

629
00:35:25,660 --> 00:35:31,090
is equivalent to a plus
b expectation of x, is

630
00:35:31,090 --> 00:35:33,775
equal to the expectation of y.

631
00:35:33,775 --> 00:35:35,650
Now, if I take the
derivative with respect to

632
00:35:35,650 --> 00:35:38,830
b and set it equal to
zero, this is telling me

633
00:35:38,830 --> 00:35:41,656
that the expectation of--

634
00:35:41,656 --> 00:35:43,780
well, it's the same thing
except that this time I'm

635
00:35:43,780 --> 00:35:45,280
going to pull out an x.

636
00:35:52,470 --> 00:35:54,310
This guy is equal to zero--

637
00:35:54,310 --> 00:35:56,660
this guy is not here--

638
00:35:56,660 --> 00:36:03,650
and so that implies that
the expectation of xy

639
00:36:03,650 --> 00:36:09,560
is equal to a times
the expectation of x,

640
00:36:09,560 --> 00:36:16,726
plus b times the
expectation of x square.

641
00:36:16,726 --> 00:36:17,226
OK?

642
00:36:21,540 --> 00:36:26,720
All right, so the first one is
actually not giving me much,

643
00:36:26,720 --> 00:36:29,700
so I need to actually work
with the two of those guys.

644
00:36:29,700 --> 00:36:31,470
So I'm going to take the first--

645
00:36:31,470 --> 00:36:33,690
so let me rewrite those two
inequalities that I have.

646
00:36:33,690 --> 00:36:40,830
I have a plus b, e of
x is equal to e of y.

647
00:36:40,830 --> 00:36:43,092
And then I have e of xy.

648
00:36:50,970 --> 00:37:01,160
OK, and now what I do is
that I multiply this guy.

649
00:37:01,160 --> 00:37:03,230
So I want to cancel one
of those things, right?

650
00:37:03,230 --> 00:37:04,455
So what I'm going to--

651
00:37:12,197 --> 00:37:13,780
so I'm going to take
this guy, and I'm

652
00:37:13,780 --> 00:37:19,030
going to multiply it by e of
x and take the difference.

653
00:37:19,030 --> 00:37:26,330
So I do times e of x, and then
I take the sum of those two,

654
00:37:26,330 --> 00:37:28,840
and then those two terms
are going to cancel.

655
00:37:28,840 --> 00:37:33,550
So then that tells
me that b times e

656
00:37:33,550 --> 00:37:45,180
of x squared, plus the
expectation of xy is equal to--

657
00:37:45,180 --> 00:37:48,423
so this guy is the
one that cancelled.

658
00:37:53,850 --> 00:37:56,570
Then I get this guy
here, expectation

659
00:37:56,570 --> 00:38:02,450
of x times the expectation
of y, plus the guy that

660
00:38:02,450 --> 00:38:04,070
remains here--

661
00:38:04,070 --> 00:38:08,752
which is b times the
expectation of x square.

662
00:38:11,920 --> 00:38:16,220
So here I have b expectation
of x, the whole thing squared.

663
00:38:16,220 --> 00:38:18,400
And here I have b
expectation of x square.

664
00:38:18,400 --> 00:38:22,440
So if I pull this guy
here, what do I get?

665
00:38:22,440 --> 00:38:26,140
b times the variance of x, OK?

666
00:38:26,140 --> 00:38:28,180
So I'm going to move here.

667
00:38:28,180 --> 00:38:31,160
And this guy here, when
I move this guy here,

668
00:38:31,160 --> 00:38:32,980
I get the expectation
of x times y,

669
00:38:32,980 --> 00:38:35,590
minus the expectation of x
times the expectation of y.

670
00:38:35,590 --> 00:38:40,540
So this is actually telling me
that the covariance of x and y

671
00:38:40,540 --> 00:38:45,450
is equal to b times
the variance of x.

672
00:38:45,450 --> 00:38:48,840
And so then that
tells me that b is

673
00:38:48,840 --> 00:38:55,519
equal to covariance of xy
divided by the variance of x.

674
00:38:55,519 --> 00:38:57,310
And that's why I actually
need the variance

675
00:38:57,310 --> 00:39:01,690
of x to be non-zero because
I couldn't do that otherwise.

676
00:39:01,690 --> 00:39:03,190
And because if it
was, it would mean

677
00:39:03,190 --> 00:39:04,890
that b should be
plus infinity, which

678
00:39:04,890 --> 00:39:08,220
is what the limit of this
guy is when the variance goes

679
00:39:08,220 --> 00:39:11,200
to zero or negative infinity.

680
00:39:11,200 --> 00:39:14,410
I can not sort them out.

681
00:39:14,410 --> 00:39:16,130
All right, so I'm
sorry about the mess,

682
00:39:16,130 --> 00:39:19,070
but that should be more clear.

683
00:39:19,070 --> 00:39:21,410
Then a, of course,
you can write it

684
00:39:21,410 --> 00:39:23,240
by plugging in the
value of b, so you

685
00:39:23,240 --> 00:39:27,030
know it's only a function
of your distribution, right?

686
00:39:27,030 --> 00:39:29,240
So what are the characteristics
of the distribution--

687
00:39:29,240 --> 00:39:31,031
so distribution can
have a bunch of things.

688
00:39:31,031 --> 00:39:34,330
It can have movements
of order 4, of order 26.

689
00:39:34,330 --> 00:39:36,590
It can have heavy
tails or light tails.

690
00:39:36,590 --> 00:39:39,320
But when you compute
least squares,

691
00:39:39,320 --> 00:39:41,900
the only thing that
matters are the variance

692
00:39:41,900 --> 00:39:45,320
of x, the expectation
of the individual ones--

693
00:39:45,320 --> 00:39:50,300
and really what captures how
y changes when you change x,

694
00:39:50,300 --> 00:39:51,590
is captured in the covariance.

695
00:39:51,590 --> 00:39:54,510
The rest is really
just normalization.

696
00:39:54,510 --> 00:39:58,550
It's just telling you, I want
things to cross the y-axis

697
00:39:58,550 --> 00:39:59,360
at the right place.

698
00:39:59,360 --> 00:40:02,330
I want things to cross the
x-axis at the right place.

699
00:40:02,330 --> 00:40:05,720
But the slope is really captured
by how much more covariance

700
00:40:05,720 --> 00:40:08,330
you have relative to
the variance of x.

701
00:40:08,330 --> 00:40:12,350
So this is essentially setting
the scale for the x-axis,

702
00:40:12,350 --> 00:40:15,410
and this is telling
you for a unit scale,

703
00:40:15,410 --> 00:40:20,460
this is the unit of y
that you're changing.

704
00:40:20,460 --> 00:40:23,600
OK, so we have explicit forms.

705
00:40:23,600 --> 00:40:26,300
And what I could do, if I
wanted to estimate those things,

706
00:40:26,300 --> 00:40:32,510
is just say, well again, we
have expectations, right?

707
00:40:32,510 --> 00:40:36,050
The expectation of xy minus the
product of the expectations,

708
00:40:36,050 --> 00:40:38,510
I could replace
expectations by averages

709
00:40:38,510 --> 00:40:40,310
and get an empirical
covariance just

710
00:40:40,310 --> 00:40:42,710
like we can replace the
expectations for the variance

711
00:40:42,710 --> 00:40:44,720
and get a sample covariance.

712
00:40:44,720 --> 00:40:47,300
And this is basically what
we're going to be doing.

713
00:40:47,300 --> 00:40:49,470
All right, this is
essentially what you want.

714
00:40:49,470 --> 00:40:51,950
The problem is that if
you view it that way,

715
00:40:51,950 --> 00:40:54,860
you sort of prevent yourself
from being able to solve

716
00:40:54,860 --> 00:40:56,510
the multivariate problem.

717
00:40:56,510 --> 00:40:58,430
Because it's only in
the univariate problem

718
00:40:58,430 --> 00:41:00,930
that you have closed form
solutions for your problem.

719
00:41:00,930 --> 00:41:03,080
But if you actually
go to multivariate,

720
00:41:03,080 --> 00:41:05,510
this is not where you want
to replace expectations

721
00:41:05,510 --> 00:41:06,230
by averages.

722
00:41:06,230 --> 00:41:09,120
You actually want to replace
expectation by averages here.

723
00:41:12,520 --> 00:41:14,950
And once you do
it here, then you

724
00:41:14,950 --> 00:41:17,920
can actually just solve
the minimisation problem.

725
00:41:23,240 --> 00:41:29,840
OK, so one thing that
arises from this guy

726
00:41:29,840 --> 00:41:35,795
is that this is an
interesting formula.

727
00:41:40,640 --> 00:41:43,740
All right, think about it.

728
00:41:43,740 --> 00:42:00,190
If I have that y is a
plus bx plus some noise.

729
00:42:00,190 --> 00:42:02,680
Things are no
longer on something.

730
00:42:02,680 --> 00:42:08,470
I have that y is equal to
a bx plus some noise, which

731
00:42:08,470 --> 00:42:11,210
is usually denoted by epsilon.

732
00:42:11,210 --> 00:42:12,910
So that's the
distribution, right?

733
00:42:12,910 --> 00:42:15,760
If I tell you the
distribution of x, and I

734
00:42:15,760 --> 00:42:17,470
say y is a plus b epsilon--

735
00:42:17,470 --> 00:42:18,940
I tell you the
distribution of y,

736
00:42:18,940 --> 00:42:21,190
and if [? they mean ?] that
those two are independent,

737
00:42:21,190 --> 00:42:23,860
you have a distribution on y.

738
00:42:23,860 --> 00:42:27,364
So what happens is that I can
actually always say-- well, you

739
00:42:27,364 --> 00:42:28,780
know, this is
equivalent to saying

740
00:42:28,780 --> 00:42:35,560
that epsilon is equal to
y minus a plus bx, right?

741
00:42:35,560 --> 00:42:37,540
I can always write
this as just--

742
00:42:37,540 --> 00:42:40,320
I mean, as tautology.

743
00:42:40,320 --> 00:42:42,069
But here, for those guys--

744
00:42:42,069 --> 00:42:43,360
this is not for any guy, right.

745
00:42:43,360 --> 00:42:45,770
This is really for
the best fit, a

746
00:42:45,770 --> 00:42:50,170
and b, those ones that
satisfy this gradient is

747
00:42:50,170 --> 00:42:51,610
equal to zero thing.

748
00:42:51,610 --> 00:42:55,330
Then what we had is that
the expectation of epsilon

749
00:42:55,330 --> 00:42:59,380
was equal to expectation
of y minus a plus

750
00:42:59,380 --> 00:43:03,430
b expectation of x by linearity
of the expectation, which

751
00:43:03,430 --> 00:43:05,560
was equal to zero.

752
00:43:05,560 --> 00:43:10,180
So for this best
fit we have zero.

753
00:43:10,180 --> 00:43:13,630
Now, the covariance
between x and y--

754
00:43:17,190 --> 00:43:20,530
Between, sorry, x
and epsilon, is what?

755
00:43:20,530 --> 00:43:23,420
Well, it's the
covariance between x--

756
00:43:23,420 --> 00:43:27,540
and well, epsilon was
y minus a plus bx.

757
00:43:30,100 --> 00:43:33,240
Now, the covariance is
bilinear, so what I have

758
00:43:33,240 --> 00:43:35,640
is that the
covariance of this is

759
00:43:35,640 --> 00:43:38,760
the covariance of xn times y--

760
00:43:38,760 --> 00:43:41,790
sorry, of x and y, minus
the variance-- well,

761
00:43:41,790 --> 00:43:50,220
minus a plus b,
covariance of x and x,

762
00:43:50,220 --> 00:43:54,720
which is the variance of x?

763
00:43:59,050 --> 00:44:03,510
Covariance of xy minus
a plus b variance of x.

764
00:44:12,384 --> 00:44:13,300
OK, I didn't write it.

765
00:44:13,300 --> 00:44:16,080
So here I have
covariance of xy is

766
00:44:16,080 --> 00:44:17,910
equal to b variance of x, right?

767
00:44:34,070 --> 00:44:35,270
Covariance of xy.

768
00:44:35,270 --> 00:44:38,057
Yeah, that's because they cannot
do that with the covariance.

769
00:44:44,030 --> 00:44:46,520
Yeah, I have those
averages again.

770
00:44:46,520 --> 00:44:48,320
No, because this
is centered, right?

771
00:44:48,320 --> 00:44:51,000
Sorry, this is centered,
so this is actually

772
00:44:51,000 --> 00:44:56,760
equal to the expectation of
x times y minus a plus bx.

773
00:45:01,527 --> 00:45:03,110
The covariance is
equal to the product

774
00:45:03,110 --> 00:45:05,750
just because this insight
is actually centered.

775
00:45:05,750 --> 00:45:09,980
So this is the
expectation of x times y

776
00:45:09,980 --> 00:45:20,100
minus the expectation of a times
the expectation of x, plus b

777
00:45:20,100 --> 00:45:23,013
minus b times the
expectation of x squared.

778
00:45:32,200 --> 00:45:34,720
Well, actually maybe I
should not really go too far.

779
00:45:38,894 --> 00:45:40,560
So this is actually
the one that I need.

780
00:45:40,560 --> 00:45:47,300
But if I stop here, this is
actually equal to zero, right.

781
00:45:47,300 --> 00:45:49,095
Those are the same equations.

782
00:45:52,065 --> 00:45:53,050
OK?

783
00:45:53,050 --> 00:45:53,550
Yeah?

784
00:45:53,550 --> 00:45:55,516
AUDIENCE: What are
we doing right now?

785
00:45:55,516 --> 00:45:57,140
PHILIPPE RIGOLLET:
So we're just saying

786
00:45:57,140 --> 00:46:01,070
that if I actually believe that
this best fit was the one that

787
00:46:01,070 --> 00:46:02,990
gave me the right
parameters, what would

788
00:46:02,990 --> 00:46:05,804
that imply on the noise
itself, on this epsilon?

789
00:46:05,804 --> 00:46:07,220
So here we're
actually just trying

790
00:46:07,220 --> 00:46:10,070
to find some necessary condition
for the noise to hold--

791
00:46:10,070 --> 00:46:11,030
for the noise.

792
00:46:11,030 --> 00:46:14,540
And so those conditions are,
that first, the expectation

793
00:46:14,540 --> 00:46:15,290
is zero.

794
00:46:15,290 --> 00:46:17,090
That's what we've got here.

795
00:46:17,090 --> 00:46:20,480
And then, that the covariance
between the noise and x

796
00:46:20,480 --> 00:46:22,900
has to be zero as well.

797
00:46:22,900 --> 00:46:24,770
OK, so those are
actually conditions

798
00:46:24,770 --> 00:46:26,360
that the noise must satisfy.

799
00:46:26,360 --> 00:46:29,450
But the noise was just not
really defined as noise itself.

800
00:46:29,450 --> 00:46:31,550
We were just
saying, OK, if we're

801
00:46:31,550 --> 00:46:35,230
going to put some assumptions
on the epsilon, what

802
00:46:35,230 --> 00:46:36,110
do we better have?

803
00:46:36,110 --> 00:46:38,360
So the first one is that
it's centered, which is good,

804
00:46:38,360 --> 00:46:41,150
because otherwise, the noise
would shift everything.

805
00:46:41,150 --> 00:46:45,620
So now when you look at a
linear regression model--

806
00:46:45,620 --> 00:46:48,590
typically, if you open a book,
it doesn't start by saying,

807
00:46:48,590 --> 00:46:50,920
let the noise be the
difference between y

808
00:46:50,920 --> 00:46:52,940
and what I actually
want y to be.

809
00:46:52,940 --> 00:46:57,210
It says let y be a
plus bx plus epsilon.

810
00:46:57,210 --> 00:47:02,120
So conversely, if we assume that
this is the model that we have,

811
00:47:02,120 --> 00:47:04,340
then we're going to have
to assume that epsilon--

812
00:47:04,340 --> 00:47:06,298
we're going to assume
that epsilon is centered,

813
00:47:06,298 --> 00:47:10,840
and that the covariance
between x and epsilon is zero.

814
00:47:10,840 --> 00:47:13,760
Actually, often, we're
going to assume much more.

815
00:47:13,760 --> 00:47:17,600
And one way to ensure that
those two things are satisfied

816
00:47:17,600 --> 00:47:19,940
is to assume that x is
independent of epsilon,

817
00:47:19,940 --> 00:47:21,290
for example.

818
00:47:21,290 --> 00:47:23,940
If you assume that x is
independent of epsilon,

819
00:47:23,940 --> 00:47:28,332
of course the covariance
is going to be zero.

820
00:47:28,332 --> 00:47:30,720
Or we might assume that
the conditional expectation

821
00:47:30,720 --> 00:47:35,450
of epsilon, given x, is equal
to zero, then that implies that.

822
00:47:35,450 --> 00:47:38,710
OK, now the fact that it's
centered is one thing.

823
00:47:38,710 --> 00:47:43,500
So if we make this assumption,
the only thing it's telling us

824
00:47:43,500 --> 00:47:47,700
is that those ab's that come--
right, we started from there.

825
00:47:47,700 --> 00:47:51,240
y is equal to a plus bx plus
some epsilon for some a,

826
00:47:51,240 --> 00:47:51,960
for some b.

827
00:47:51,960 --> 00:47:55,890
What it turns out is that
those a's and b's are actually

828
00:47:55,890 --> 00:47:58,680
the ones that you would get
by solving this expectation

829
00:47:58,680 --> 00:48:00,690
of square thing.

830
00:48:00,690 --> 00:48:02,610
All right, so when you asked--

831
00:48:02,610 --> 00:48:04,530
back when you were following--

832
00:48:04,530 --> 00:48:07,170
so when you asked,
you know, why don't we

833
00:48:07,170 --> 00:48:10,290
take the square, for
example, or the power

834
00:48:10,290 --> 00:48:12,210
4, or something like this--

835
00:48:12,210 --> 00:48:15,990
then here, I'm saying, well, if
I have y is equal to a plus bx,

836
00:48:15,990 --> 00:48:19,230
I don't actually need to put
too much assumptions on epsilon.

837
00:48:19,230 --> 00:48:22,320
If epsilon is actually
satisfying those two things,

838
00:48:22,320 --> 00:48:25,620
expectation is equal to
zero and the covariance

839
00:48:25,620 --> 00:48:28,912
with x is equal to zero,
then the right a and b

840
00:48:28,912 --> 00:48:30,870
that I'm looking for are
actually the ones that

841
00:48:30,870 --> 00:48:32,120
come with the square--

842
00:48:32,120 --> 00:48:36,750
not with power 4 or power 25.

843
00:48:36,750 --> 00:48:39,300
So those are actually
pretty weak assumptions.

844
00:48:39,300 --> 00:48:41,510
If we want to do
inference, we're

845
00:48:41,510 --> 00:48:43,350
going to have to
assume slightly more.

846
00:48:43,350 --> 00:48:45,690
If we want to use
T-distributions at some point,

847
00:48:45,690 --> 00:48:47,520
for example, and we
will, we're going

848
00:48:47,520 --> 00:48:50,800
to have to assume that epsilon
has a Gaussian distribution.

849
00:48:50,800 --> 00:48:53,700
So if you want to start doing
more statistics beyond just

850
00:48:53,700 --> 00:48:56,550
like doing this least square
thing, which is minimizing

851
00:48:56,550 --> 00:48:58,350
the square of criterion,
you're actually

852
00:48:58,350 --> 00:48:59,933
going to have to put
more assumptions.

853
00:48:59,933 --> 00:49:01,710
But right now, we
did not need them.

854
00:49:01,710 --> 00:49:04,210
We only need that epsilon
as mean zero and covariant

855
00:49:04,210 --> 00:49:04,998
zero with x.

856
00:49:08,750 --> 00:49:13,040
OK, so that was basically
probabilistic, right.

857
00:49:13,040 --> 00:49:14,450
If I were to do
probability and I

858
00:49:14,450 --> 00:49:17,090
were trying to model the
relationship between two

859
00:49:17,090 --> 00:49:20,330
random variables, x
and y, in the form

860
00:49:20,330 --> 00:49:24,320
y is a plus bx plus some noise,
this is what would come out.

861
00:49:24,320 --> 00:49:25,640
Everything was expectations.

862
00:49:25,640 --> 00:49:27,290
There was no data involved.

863
00:49:27,290 --> 00:49:33,620
So now let's go to the
data problem, which is now,

864
00:49:33,620 --> 00:49:35,540
I do not know what
those expectations are.

865
00:49:35,540 --> 00:49:38,240
In particular, I don't know what
the covariance of x and y is,

866
00:49:38,240 --> 00:49:40,610
and I don't know with
the expectation of x

867
00:49:40,610 --> 00:49:42,950
and the expectation of y r.

868
00:49:42,950 --> 00:49:44,570
So I have data to do that.

869
00:49:44,570 --> 00:49:45,880
So how am I going to do this?

870
00:49:49,244 --> 00:49:50,660
Well, I'm just
going to say, well,

871
00:49:50,660 --> 00:49:57,570
if I want x1, y1,
xn, yn, and I'm going

872
00:49:57,570 --> 00:49:59,781
to assume that
they're [? iid. ?]

873
00:49:59,781 --> 00:50:01,530
And I'm actually going
to assume that they

874
00:50:01,530 --> 00:50:02,820
have some model, right.

875
00:50:02,820 --> 00:50:06,570
So I'm going to assume
that I have that a--

876
00:50:06,570 --> 00:50:09,150
so that Yi follows
the same model.

877
00:50:14,620 --> 00:50:17,000
So epsilon i
[? rad, ?] and I won't

878
00:50:17,000 --> 00:50:23,610
say that expectation of epsilon
i is zero and covariance of xi,

879
00:50:23,610 --> 00:50:25,630
epsilon i is equal to zero.

880
00:50:25,630 --> 00:50:28,880
So I'm going to put the
same model on all the data.

881
00:50:28,880 --> 00:50:31,420
So you can see that a is
not ai, and b is not bi.

882
00:50:31,420 --> 00:50:32,380
It's the same.

883
00:50:32,380 --> 00:50:34,090
So as my data
increases, I should

884
00:50:34,090 --> 00:50:36,850
be able to recover
the correct things--

885
00:50:36,850 --> 00:50:39,430
as the size of my
data increases.

886
00:50:39,430 --> 00:50:43,030
OK, so this is what the
statistical problem look like.

887
00:50:43,030 --> 00:50:45,250
You're given the points.

888
00:50:45,250 --> 00:50:47,350
There is a true line
from which this point

889
00:50:47,350 --> 00:50:48,557
was generated, right.

890
00:50:48,557 --> 00:50:49,390
There was this line.

891
00:50:49,390 --> 00:50:54,250
There was a true ab that
I use to draw this plot,

892
00:50:54,250 --> 00:50:55,190
and that was the line.

893
00:50:55,190 --> 00:50:59,320
So first I picked an
x, say uniformly at

894
00:50:59,320 --> 00:51:02,110
on this intervals, 0 to 2.

895
00:51:02,110 --> 00:51:03,610
I said that was this one.

896
00:51:03,610 --> 00:51:06,800
Then I said well, I
want y to be a plus bx,

897
00:51:06,800 --> 00:51:08,500
so it should be
here, but then I'm

898
00:51:08,500 --> 00:51:10,840
going to add some noise
epsilon to go away again

899
00:51:10,840 --> 00:51:13,270
back from this line.

900
00:51:13,270 --> 00:51:16,970
And that's actually me, here, we
actually got two points correct

901
00:51:16,970 --> 00:51:18,070
on this line.

902
00:51:18,070 --> 00:51:20,170
So there's basically
two epsilons

903
00:51:20,170 --> 00:51:22,330
that were small enough
that the dots actually

904
00:51:22,330 --> 00:51:24,720
look like they're on the line.

905
00:51:24,720 --> 00:51:27,060
Everybody's clear
about what I'm drawing?

906
00:51:27,060 --> 00:51:28,810
So now of course if
you're a statistician,

907
00:51:28,810 --> 00:51:29,620
you don't see this.

908
00:51:29,620 --> 00:51:30,810
You only see this.

909
00:51:30,810 --> 00:51:32,610
And you have to
recover this guy,

910
00:51:32,610 --> 00:51:34,260
and it's going to
look like this.

911
00:51:34,260 --> 00:51:36,550
You're going to have an
estimated line, which

912
00:51:36,550 --> 00:51:37,780
is the red one.

913
00:51:37,780 --> 00:51:42,610
And the blue line, which is
the true one, the one that

914
00:51:42,610 --> 00:51:44,230
actually generated the data.

915
00:51:44,230 --> 00:51:46,810
And your question is,
while this line corresponds

916
00:51:46,810 --> 00:51:48,967
to some parameters
a hat and b hat,

917
00:51:48,967 --> 00:51:51,550
how could I make sure that those
two lines-- how far those two

918
00:51:51,550 --> 00:51:52,060
lines are?

919
00:51:52,060 --> 00:51:53,620
And one to address
this question is

920
00:51:53,620 --> 00:51:57,920
to say how far is a from a hat,
and how far is b from b hat?

921
00:51:57,920 --> 00:51:58,785
OK?

922
00:51:58,785 --> 00:52:00,660
Another question, of
course, that you may ask

923
00:52:00,660 --> 00:52:04,470
is, how do you find
a hat and b hat?

924
00:52:04,470 --> 00:52:07,530
And as you can see, it's
basically the same thing.

925
00:52:07,530 --> 00:52:15,210
Remember, what was a-- so b
was the covariance between x

926
00:52:15,210 --> 00:52:21,240
and y divided by the
variance of x, right?

927
00:52:21,240 --> 00:52:22,410
We check and rewrite this.

928
00:52:22,410 --> 00:52:26,430
The expectation of
xy minus expectation

929
00:52:26,430 --> 00:52:30,060
of x times the
expectation of y, divided

930
00:52:30,060 --> 00:52:35,580
by expectation of x squared
minus expectation of x.

931
00:52:35,580 --> 00:52:37,540
The whole thing's--

932
00:52:37,540 --> 00:52:39,040
OK?

933
00:52:39,040 --> 00:52:42,910
If you look at the
expression for b hat,

934
00:52:42,910 --> 00:52:47,670
I basically replaced all
the expectations by bars.

935
00:52:47,670 --> 00:52:49,800
So I said, well,
this guy I'm going

936
00:52:49,800 --> 00:52:53,480
to estimate by an average.

937
00:52:53,480 --> 00:52:59,970
So that's the xy
bar, and is 1 over n,

938
00:52:59,970 --> 00:53:03,025
[? sum ?] from [? i co ?]
1, to n of Xi, times Yi.

939
00:53:05,555 --> 00:53:08,380
x bar, of course, is just
the one that we're used to.

940
00:53:12,690 --> 00:53:14,970
And same for y bar.

941
00:53:14,970 --> 00:53:20,580
X squared bar, the
one that's here,

942
00:53:20,580 --> 00:53:22,290
is the average of the squares.

943
00:53:22,290 --> 00:53:24,426
And x bar square is the
square of the average.

944
00:53:39,510 --> 00:53:44,070
OK, so you just basically
replace this guy by x bar,

945
00:53:44,070 --> 00:53:47,820
this guy by y bar, this
guy by x square bar,

946
00:53:47,820 --> 00:53:52,350
and this guy by x
bar and no square.

947
00:53:52,350 --> 00:53:54,810
OK, so that's basically
one way to do it.

948
00:53:54,810 --> 00:53:56,340
Everywhere you see
an expectation,

949
00:53:56,340 --> 00:53:58,740
you replace it by an average.

950
00:53:58,740 --> 00:54:02,070
That's the usual
statistical hammer.

951
00:54:02,070 --> 00:54:04,720
You can actually be slightly
more subtle about this.

952
00:54:09,980 --> 00:54:12,420
And as an exercise,
I invite you--

953
00:54:12,420 --> 00:54:14,940
just to make sure that you know
how to do this competition,

954
00:54:14,940 --> 00:54:17,400
it's going to be exactly the
same kind of competitions

955
00:54:17,400 --> 00:54:18,840
that we've done.

956
00:54:18,840 --> 00:54:20,670
But as an exercise,
you can check

957
00:54:20,670 --> 00:54:23,311
that if you actually
look at say, well,

958
00:54:23,311 --> 00:54:25,810
what I wanted to minimize here,
I had an expectation, right?

959
00:54:32,720 --> 00:54:35,660
And I said, let's
minimize this thing.

960
00:54:35,660 --> 00:54:41,800
Well, let's replace this
by an average first.

961
00:54:51,630 --> 00:54:54,270
And now minimize.

962
00:54:54,270 --> 00:54:57,100
OK, so if I do
this, it turns out

963
00:54:57,100 --> 00:55:00,160
I'm going to actually
get the same result.

964
00:55:00,160 --> 00:55:03,940
The minimum of the
average is basically--

965
00:55:03,940 --> 00:55:06,160
when I replace the
average by-- sorry,

966
00:55:06,160 --> 00:55:09,040
when I replace the
expectation by the average

967
00:55:09,040 --> 00:55:11,817
and then minimize,
it's the same thing

968
00:55:11,817 --> 00:55:13,900
as first minimizing and
then replacing expectation

969
00:55:13,900 --> 00:55:17,510
by averages in this case.

970
00:55:17,510 --> 00:55:21,764
Again, this is a much
more general principle

971
00:55:21,764 --> 00:55:23,180
because if you
don't have a closed

972
00:55:23,180 --> 00:55:27,530
form for the minimum like for
some, say, likelihood problems,

973
00:55:27,530 --> 00:55:30,579
well, you might not
actually have a possibility

974
00:55:30,579 --> 00:55:32,870
to just look at what the
formula looks like-- see where

975
00:55:32,870 --> 00:55:35,480
the expectations show up-- and
then just plug in the averages

976
00:55:35,480 --> 00:55:36,380
instead.

977
00:55:36,380 --> 00:55:39,170
So this is the one you
want to keep in mind.

978
00:55:39,170 --> 00:55:41,000
And again, as an exercise.

979
00:55:47,000 --> 00:55:48,870
OK, so here, and then
you do expectation

980
00:55:48,870 --> 00:55:52,980
replaced by averages.

981
00:55:52,980 --> 00:55:57,800
And then that's the same
answer, and I encourage

982
00:55:57,800 --> 00:56:00,080
you to solve the exercise.

983
00:56:00,080 --> 00:56:03,770
OK, everybody's clear that this
is actually the same expression

984
00:56:03,770 --> 00:56:07,140
for a hat and b hat that we had
before that we had for a and b

985
00:56:07,140 --> 00:56:12,460
when we replaced the
expectations by averages?

986
00:56:12,460 --> 00:56:16,960
Here, by the way, I minimize
the sum rather than the average.

987
00:56:16,960 --> 00:56:19,708
It's clear to everyone that
this is the same thing, right?

988
00:56:22,680 --> 00:56:23,180
Yep?

989
00:56:23,180 --> 00:56:27,148
AUDIENCE: [INAUDIBLE] sum
replacing it [INAUDIBLE]

990
00:56:27,148 --> 00:56:29,628
minimize the
expectation, I'm assuming

991
00:56:29,628 --> 00:56:31,612
it's switched with
the derivative

992
00:56:31,612 --> 00:56:33,596
on the expectation [INAUDIBLE].

993
00:56:37,592 --> 00:56:39,050
PHILIPPE RIGOLLET:
So we did switch

994
00:56:39,050 --> 00:56:43,640
the derivative and the
expectation before you came,

995
00:56:43,640 --> 00:56:44,140
I think.

996
00:56:47,890 --> 00:56:49,810
All right, so
indeed, the picture

997
00:56:49,810 --> 00:56:52,150
was the one that we
said, so visually, this

998
00:56:52,150 --> 00:56:53,380
is what we're doing.

999
00:56:53,380 --> 00:56:55,780
We're looking among
all the lines.

1000
00:56:55,780 --> 00:56:58,822
For each line, we
compute this distance.

1001
00:56:58,822 --> 00:57:00,280
So if I give you
another line there

1002
00:57:00,280 --> 00:57:01,759
would be another set of arrows.

1003
00:57:01,759 --> 00:57:02,800
You look at their length.

1004
00:57:02,800 --> 00:57:03,610
You square it.

1005
00:57:03,610 --> 00:57:05,520
And then you sum it
all, and you find

1006
00:57:05,520 --> 00:57:08,080
the line that has the minimum
sum of squared lengths

1007
00:57:08,080 --> 00:57:09,364
of the arrows.

1008
00:57:09,364 --> 00:57:11,780
All right, and those are the
arrows that we're looking at.

1009
00:57:11,780 --> 00:57:14,710
But again, you could actually
think of other distances,

1010
00:57:14,710 --> 00:57:17,307
and you would actually
get different--

1011
00:57:17,307 --> 00:57:19,390
you could actually get
different solutions, right.

1012
00:57:19,390 --> 00:57:22,644
So there's something called,
mean absolute deviation,

1013
00:57:22,644 --> 00:57:24,310
which rather than
minimizing this thing,

1014
00:57:24,310 --> 00:57:27,490
is actually minimizing the
sum from i to co 1 to n

1015
00:57:27,490 --> 00:57:33,970
of the absolute value
of y minus a plus bXi.

1016
00:57:33,970 --> 00:57:36,160
And that's not
something for which

1017
00:57:36,160 --> 00:57:39,190
you're going to have a closed
form, as you can imagine.

1018
00:57:39,190 --> 00:57:42,010
You might have something
that's sort of implicit,

1019
00:57:42,010 --> 00:57:44,647
but you can actually still
solve it numerically.

1020
00:57:44,647 --> 00:57:46,230
And this is something
that people also

1021
00:57:46,230 --> 00:57:50,478
like to use but way, way less
than the least squares one.

1022
00:57:50,478 --> 00:57:52,174
AUDIENCE: [INAUDIBLE]

1023
00:57:52,174 --> 00:57:53,840
PHILIPPE RIGOLLET:
What did I just what?

1024
00:57:53,840 --> 00:57:56,600
AUDIENCE: [INAUDIBLE]

1025
00:57:56,600 --> 00:58:02,230
The sum of the absolute
values of Yi minus a plus bXi.

1026
00:58:02,230 --> 00:58:04,432
So it's the same except
I don't square here.

1027
00:58:07,820 --> 00:58:08,320
OK?

1028
00:58:11,250 --> 00:58:18,330
So arguably, you know,
predicting a demand

1029
00:58:18,330 --> 00:58:21,780
based on price is a
fairly naive problem.

1030
00:58:21,780 --> 00:58:23,787
Typically, what we
have is a bunch of data

1031
00:58:23,787 --> 00:58:25,620
that we've collected,
and we're hoping that,

1032
00:58:25,620 --> 00:58:29,460
together, they can help
us do a better prediction.

1033
00:58:29,460 --> 00:58:31,890
All right, so maybe I
don't have only the price,

1034
00:58:31,890 --> 00:58:35,670
but maybe I have a bunch
of other social indicators.

1035
00:58:35,670 --> 00:58:40,484
Maybe I know the competition,
the price of the competition.

1036
00:58:40,484 --> 00:58:42,150
And maybe I know a
bunch of other things

1037
00:58:42,150 --> 00:58:43,980
that are actually relevant.

1038
00:58:43,980 --> 00:58:48,030
And so I'm trying to find a way
to combine a bunch of points,

1039
00:58:48,030 --> 00:58:50,880
a bunch of measures.

1040
00:58:50,880 --> 00:58:52,540
There's a nice
example that I like,

1041
00:58:52,540 --> 00:58:56,370
which is people were
trying to measure something

1042
00:58:56,370 --> 00:59:00,750
related to your body
mass index, so basically

1043
00:59:00,750 --> 00:59:04,820
the volume of your-- the
density of your body.

1044
00:59:04,820 --> 00:59:07,380
And the way you can do
this is by just, really,

1045
00:59:07,380 --> 00:59:10,170
weighing someone and
also putting them

1046
00:59:10,170 --> 00:59:13,920
in some cubic meter of water
and see how much overflows.

1047
00:59:13,920 --> 00:59:15,750
And then you have
both the volume

1048
00:59:15,750 --> 00:59:20,850
and the mass of
this person, and you

1049
00:59:20,850 --> 00:59:23,370
can start computing density.

1050
00:59:23,370 --> 00:59:25,860
But as you can
imagine, you know,

1051
00:59:25,860 --> 00:59:27,684
I would not personally
like to go to a gym

1052
00:59:27,684 --> 00:59:29,600
when the first thing
they ask me is to just go

1053
00:59:29,600 --> 00:59:33,240
in a bucket of
water, and so people

1054
00:59:33,240 --> 00:59:36,840
try to find ways to measure this
based on other indicators that

1055
00:59:36,840 --> 00:59:38,110
are much easier to measure.

1056
00:59:38,110 --> 00:59:41,040
For example, I don't know,
the length of my forearm,

1057
00:59:41,040 --> 00:59:45,090
and the circumference of
my head, and maybe my belly

1058
00:59:45,090 --> 00:59:46,870
would probably be
more appropriate here.

1059
00:59:46,870 --> 00:59:48,870
And so you know, they
just try to find something

1060
00:59:48,870 --> 00:59:50,340
that actually makes sense.

1061
00:59:50,340 --> 00:59:52,094
And so there's
actually a nice example

1062
00:59:52,094 --> 00:59:53,760
where you can show
that if you measure--

1063
00:59:53,760 --> 00:59:55,050
I think one of the
most significant

1064
00:59:55,050 --> 00:59:56,860
was with the circumference
of your wrist.

1065
00:59:56,860 --> 01:00:02,070
This is actually a very good
indicator of your body density.

1066
01:00:02,070 --> 01:00:06,780
And it turns out that if you
stuff all the bunch of things

1067
01:00:06,780 --> 01:00:09,240
together, you might actually
get a very good formula that

1068
01:00:09,240 --> 01:00:10,840
explains things.

1069
01:00:10,840 --> 01:00:12,390
All right, so what
we're going to do

1070
01:00:12,390 --> 01:00:14,406
is rather than saying
we have only one x

1071
01:00:14,406 --> 01:00:15,780
to explain y's,
let's say we have

1072
01:00:15,780 --> 01:00:19,510
20 x's that we're trying
to combine to explain y.

1073
01:00:19,510 --> 01:00:22,410
And again, just like assuming
something of the form,

1074
01:00:22,410 --> 01:00:26,107
y is a plus b times x was the
simplest thing we could do,

1075
01:00:26,107 --> 01:00:28,440
here we're just going to
assume that we have y is a plus

1076
01:00:28,440 --> 01:00:31,650
b1, x1 plus b2, x2, plus b3, x3.

1077
01:00:31,650 --> 01:00:33,690
And we can write
it in a vector form

1078
01:00:33,690 --> 01:00:39,210
by writing that Yi is
Xi transposed b, which

1079
01:00:39,210 --> 01:00:42,770
is now a vector plus epsilon i.

1080
01:00:42,770 --> 01:00:44,520
OK, and here, on
the board, I'm going

1081
01:00:44,520 --> 01:00:46,980
to have a hard time
doing boldface,

1082
01:00:46,980 --> 01:00:52,360
but all these things are
vectors except for y,

1083
01:00:52,360 --> 01:00:53,520
which is a number.

1084
01:00:53,520 --> 01:00:54,450
Yi is a number.

1085
01:00:54,450 --> 01:00:57,780
It's always the
value of my y-axis.

1086
01:00:57,780 --> 01:00:59,930
So even if my x-axis lives on--

1087
01:00:59,930 --> 01:01:04,350
this is x1, and this is x2, y
is really just the real valued

1088
01:01:04,350 --> 01:01:05,249
function.

1089
01:01:05,249 --> 01:01:07,290
And so I'm going to get
a bunch of points, x1,y1,

1090
01:01:07,290 --> 01:01:10,380
and I'm going to see
how much they respond.

1091
01:01:10,380 --> 01:01:13,560
So for example, my
body density is y,

1092
01:01:13,560 --> 01:01:16,562
and then all the x's are
a bunch of other things.

1093
01:01:16,562 --> 01:01:17,270
Agreed with that?

1094
01:01:17,270 --> 01:01:20,870
So this is an equation that
holds on the real line,

1095
01:01:20,870 --> 01:01:27,390
but this guy here is an r
p, and this guy's an rp.

1096
01:01:30,080 --> 01:01:33,550
It's actually common to
talk to call b, beta,

1097
01:01:33,550 --> 01:01:38,650
when it's a vector, and that's
the usual linear regression

1098
01:01:38,650 --> 01:01:39,370
notation.

1099
01:01:39,370 --> 01:01:42,470
Y is x beta plus epsilon.

1100
01:01:42,470 --> 01:01:45,780
So x's are called
explanatory variables.

1101
01:01:45,780 --> 01:01:50,600
y is called explained variable,
or dependent variable,

1102
01:01:50,600 --> 01:01:52,000
or response variable.

1103
01:01:52,000 --> 01:01:53,050
It has a bunch of names.

1104
01:01:53,050 --> 01:01:55,877
You can use whatever you
feel more comfortable with.

1105
01:01:55,877 --> 01:01:57,460
It should actually
be explicit, right,

1106
01:01:57,460 --> 01:01:58,668
so that's all you care about.

1107
01:02:01,100 --> 01:02:05,840
Now, what we typically do
is that rather-- so you

1108
01:02:05,840 --> 01:02:07,840
notice here, that there's
actually no intercept.

1109
01:02:07,840 --> 01:02:10,840
If I actually fold that
back down to one dimension,

1110
01:02:10,840 --> 01:02:13,210
there's actually a is
equal to zero, right?

1111
01:02:13,210 --> 01:02:18,350
If I go back to p
is equal to 1, that

1112
01:02:18,350 --> 01:02:22,430
would imply that Yi is,
well, say, beta times

1113
01:02:22,430 --> 01:02:24,979
x plus epsilon i.

1114
01:02:24,979 --> 01:02:27,020
And that's not good, I
want to have an intercept.

1115
01:02:27,020 --> 01:02:29,480
And the way I do this,
rather than writing

1116
01:02:29,480 --> 01:02:31,910
a plus this, and
you know, just have

1117
01:02:31,910 --> 01:02:35,420
like an overload of notation,
what I am actually doing

1118
01:02:35,420 --> 01:02:37,670
is that I fold back.

1119
01:02:37,670 --> 01:02:40,750
I fold my intercept
back into my x.

1120
01:02:43,460 --> 01:02:46,190
And so if I measure
20 variables,

1121
01:02:46,190 --> 01:02:48,080
I'm going to create a
21st variable, which

1122
01:02:48,080 --> 01:02:49,700
is always equal to 1.

1123
01:02:49,700 --> 01:02:52,650
OK, so you should need
to think of x as being 1.

1124
01:02:52,650 --> 01:02:58,120
And then x1 xp.

1125
01:02:58,120 --> 01:03:00,790
And sorry, xp minus 1, I guess.

1126
01:03:00,790 --> 01:03:02,293
OK, and now this is an rp.

1127
01:03:05,590 --> 01:03:07,900
I'm always going to assume
that the first one is 1.

1128
01:03:07,900 --> 01:03:09,250
I can always do that.

1129
01:03:09,250 --> 01:03:11,320
If I have a table of data--

1130
01:03:11,320 --> 01:03:15,940
if my data is given to me
in an Excel spreadsheet--

1131
01:03:15,940 --> 01:03:19,990
and here I have the density
that I measured on my data,

1132
01:03:19,990 --> 01:03:22,940
and then maybe here
I have the height,

1133
01:03:22,940 --> 01:03:25,544
and here I have the
wrist circumference.

1134
01:03:25,544 --> 01:03:26,710
And I have all these things.

1135
01:03:26,710 --> 01:03:31,100
All I have to do is to create
another column here of ones,

1136
01:03:31,100 --> 01:03:34,180
and I just put 1-1-1-1-1.

1137
01:03:34,180 --> 01:03:37,090
OK, that's all I have to
do to create this guy.

1138
01:03:37,090 --> 01:03:39,190
Agreed?

1139
01:03:39,190 --> 01:03:43,940
And now my x is going to
be just one of those rows.

1140
01:03:43,940 --> 01:03:46,190
So that's this is
Xi, this entire row.

1141
01:03:46,190 --> 01:03:47,622
And this entry here is Yi.

1142
01:03:54,430 --> 01:03:56,920
So now, for my
noise coefficients,

1143
01:03:56,920 --> 01:03:59,300
I'm still going to
ask for the same thing

1144
01:03:59,300 --> 01:04:04,090
except that here, the
covariance is not between x--

1145
01:04:04,090 --> 01:04:07,210
between one random variable
and another random variable.

1146
01:04:07,210 --> 01:04:10,930
It's between a random vector
and a random variable.

1147
01:04:10,930 --> 01:04:13,130
OK, how do I measure the
covariance between a vector

1148
01:04:13,130 --> 01:04:14,594
and a random variable?

1149
01:04:23,866 --> 01:04:25,840
AUDIENCE: [INAUDIBLE]

1150
01:04:25,840 --> 01:04:29,002
PHILIPPE RIGOLLET:
Yeah, so basically--

1151
01:04:29,002 --> 01:04:31,380
AUDIENCE: [INAUDIBLE]

1152
01:04:31,380 --> 01:04:33,630
PHILIPPE RIGOLLET: Yeah, I
mean, the covariance vector

1153
01:04:33,630 --> 01:04:36,171
is equal to 0 is the same thing
as [INAUDIBLE] equal to zero,

1154
01:04:36,171 --> 01:04:39,270
but yeah, this is basically
thought of entry-wise.

1155
01:04:39,270 --> 01:04:41,820
For each coordinate of x,
I want that the covariance

1156
01:04:41,820 --> 01:04:47,430
between epsilon and this
coordinate of x is equal to 0.

1157
01:04:47,430 --> 01:04:50,370
So I'm just asking this
for all coordinates.

1158
01:04:50,370 --> 01:04:52,020
Again, in most
instances, we're going

1159
01:04:52,020 --> 01:04:53,520
to think that epsilon
is independent

1160
01:04:53,520 --> 01:04:56,310
of x, and that's something we
can understand without thinking

1161
01:04:56,310 --> 01:04:59,022
about coordinates.

1162
01:04:59,022 --> 01:05:00,471
Yep?

1163
01:05:00,471 --> 01:05:03,852
AUDIENCE: [INAUDIBLE] like
what if beta equals alpha

1164
01:05:03,852 --> 01:05:04,818
[INAUDIBLE]?

1165
01:05:06,774 --> 01:05:09,190
PHILIPPE RIGOLLET: I'm sorry,
can you repeat the question?

1166
01:05:09,190 --> 01:05:09,773
I didn't hear.

1167
01:05:09,773 --> 01:05:12,140
AUDIENCE: Is this the
parameter of beta, a parameter?

1168
01:05:12,140 --> 01:05:13,100
PHILIPPE RIGOLLET: Yeah,
beta is the parameter

1169
01:05:13,100 --> 01:05:14,141
we're looking for, right.

1170
01:05:14,141 --> 01:05:18,485
Just like it was the pair ab has
become the whole vector of beta

1171
01:05:18,485 --> 01:05:19,394
now.

1172
01:05:19,394 --> 01:05:20,810
AUDIENCE: And
what's [INAUDIBLE]??

1173
01:05:22,720 --> 01:05:25,219
PHILIPPE RIGOLLET: Well, can
you think of an intercept

1174
01:05:25,219 --> 01:05:26,260
of a function that take--

1175
01:05:26,260 --> 01:05:28,630
I mean, there is one actually.

1176
01:05:28,630 --> 01:05:30,370
There's the one
for which betas--

1177
01:05:30,370 --> 01:05:31,840
all the betas that
don't correspond

1178
01:05:31,840 --> 01:05:35,200
to the vector of all
ones, so the intercept

1179
01:05:35,200 --> 01:05:38,469
is really the weight
that I put on this guy.

1180
01:05:38,469 --> 01:05:40,510
That's the beta that's
going to come to this guy,

1181
01:05:40,510 --> 01:05:44,310
but we don't really
talk about intercept.

1182
01:05:44,310 --> 01:05:49,210
So if x lives in two
dimensions, the way

1183
01:05:49,210 --> 01:05:50,950
you want to think
about this is you

1184
01:05:50,950 --> 01:05:54,420
take a sheet of paper
like that, so now I

1185
01:05:54,420 --> 01:05:57,080
have points that live
in three dimensions.

1186
01:05:57,080 --> 01:05:59,320
So let's say one
direction here is x1.

1187
01:05:59,320 --> 01:06:02,710
This direction is x2,
and this direction is y.

1188
01:06:02,710 --> 01:06:04,960
And so what's going
to happen is that I'm

1189
01:06:04,960 --> 01:06:07,120
going to have my points
that live in this three

1190
01:06:07,120 --> 01:06:08,710
dimensional space.

1191
01:06:08,710 --> 01:06:10,180
And what I'm trying
to do when I'm

1192
01:06:10,180 --> 01:06:12,580
trying to do a linear
model for those guys--

1193
01:06:12,580 --> 01:06:13,990
when I assume a linear model.

1194
01:06:13,990 --> 01:06:17,380
What I assume is that there's
a plane in those three

1195
01:06:17,380 --> 01:06:17,950
dimensions.

1196
01:06:17,950 --> 01:06:20,170
So think of this guy
as going everywhere,

1197
01:06:20,170 --> 01:06:23,920
and there's a plane close to
which all my points should be.

1198
01:06:23,920 --> 01:06:26,320
That's what's happening
in two dimensional.

1199
01:06:26,320 --> 01:06:29,930
If you see higher dimensions
then congratulations to you,

1200
01:06:29,930 --> 01:06:30,975
but I can't.

1201
01:06:33,530 --> 01:06:36,470
But you know, you can definitely
formalize that fairly easily

1202
01:06:36,470 --> 01:06:38,405
mathematically and just
talk about vectors.

1203
01:06:40,940 --> 01:06:44,200
So now here, if I talk about the
least square error estimator,

1204
01:06:44,200 --> 01:06:47,470
or just the least squares
estimator of beta,

1205
01:06:47,470 --> 01:06:49,990
it's simply the same
thing as before.

1206
01:06:49,990 --> 01:06:52,460
Just like we said--

1207
01:06:52,460 --> 01:06:56,750
so remember, you
should think of as beta

1208
01:06:56,750 --> 01:06:59,930
as being both the
pair a b generalized.

1209
01:06:59,930 --> 01:07:05,060
So we said, oh, we wanted to
minimize the expectation of y

1210
01:07:05,060 --> 01:07:13,640
minus a plus bx squared, right?

1211
01:07:13,640 --> 01:07:16,910
Now, so that's in--
for p is equal to 1.

1212
01:07:16,910 --> 01:07:19,510
Now for p lower
than or equal to 2,

1213
01:07:19,510 --> 01:07:28,760
we're just going to write it
as y minus x transpose beta

1214
01:07:28,760 --> 01:07:29,260
squared.

1215
01:07:34,210 --> 01:07:37,900
OK, so I'm just trying to
minimize this quantity.

1216
01:07:37,900 --> 01:07:40,857
Of course, I don't
have access to this,

1217
01:07:40,857 --> 01:07:42,940
so what I'm going to do
with them going to replace

1218
01:07:42,940 --> 01:07:44,881
my expectation by an average.

1219
01:07:51,010 --> 01:07:54,890
So here I'm using the notation
t because beta is the true one,

1220
01:07:54,890 --> 01:07:56,960
and I don't want you to just--

1221
01:07:56,960 --> 01:07:59,960
so here, I have a variable
t that's just moving around.

1222
01:07:59,960 --> 01:08:02,390
And so now I'm going to take
the square of this thing.

1223
01:08:02,390 --> 01:08:08,450
And when I minimize this over
all t in rp, the arc min,

1224
01:08:08,450 --> 01:08:19,584
the minimum is attained at beta
hat, which is my estimator.

1225
01:08:19,584 --> 01:08:20,084
OK?

1226
01:08:25,359 --> 01:08:29,337
So if I want to
actually compute--

1227
01:08:29,337 --> 01:08:29,837
yeah?

1228
01:08:29,837 --> 01:08:31,420
AUDIENCE: I'm sorry,
on the last slide

1229
01:08:31,420 --> 01:08:36,422
did we require the expectation
of [INAUDIBLE] to be zero?

1230
01:08:36,422 --> 01:08:38,380
PHILIPPE RIGOLLET: You
mean the previous slide?

1231
01:08:38,380 --> 01:08:38,963
AUDIENCE: Yes.

1232
01:08:38,963 --> 01:08:40,262
[INAUDIBLE]

1233
01:08:40,262 --> 01:08:42,720
PHILIPPE RIGOLLET: So again,
I'm just defining an estimator

1234
01:08:42,720 --> 01:08:45,053
just like I would tell you,
just take the estimator that

1235
01:08:45,053 --> 01:08:46,539
has coordinates for everywhere.

1236
01:08:46,539 --> 01:08:48,984
AUDIENCE: So I'm saying like
[? in that sign, ?] we'll say

1237
01:08:48,984 --> 01:08:51,918
the noise [? terms ?] we want to
satisfy the covariance of that

1238
01:08:51,918 --> 01:08:55,830
[? side. ?] We also want them
to satisfy expectation of each

1239
01:08:55,830 --> 01:08:56,808
[? noise turn ?] zero?

1240
01:09:07,827 --> 01:09:09,660
PHILIPPE RIGOLLET: And
so the answer is yes.

1241
01:09:09,660 --> 01:09:13,050
I was just trying to think
if this was captured.

1242
01:09:13,050 --> 01:09:15,180
So it is not
captured in this guy

1243
01:09:15,180 --> 01:09:17,700
because this is just telling
me that the expectation

1244
01:09:17,700 --> 01:09:23,750
of epsilon i minus expectation
of some i is equal to zero.

1245
01:09:23,750 --> 01:09:27,380
OK, so yes I need to have
that epsilon has mean zero--

1246
01:09:27,380 --> 01:09:29,130
let's assume that
expectation of epsilon

1247
01:09:29,130 --> 01:09:31,545
is zero for this problem.

1248
01:09:43,640 --> 01:09:45,374
And we're going
to need something

1249
01:09:45,374 --> 01:09:47,540
about some sort of question
about the variance being

1250
01:09:47,540 --> 01:09:51,060
not equal to zero, right, but
this is going to come up later.

1251
01:09:51,060 --> 01:09:54,710
So let's think for one second
about doing the same approach

1252
01:09:54,710 --> 01:09:55,490
as we did before.

1253
01:09:55,490 --> 01:09:57,320
Take the partial
derivative with respect

1254
01:09:57,320 --> 01:09:59,279
to the first coordinate
of t, with respect

1255
01:09:59,279 --> 01:10:01,070
to the second coordinate
of t, with respect

1256
01:10:01,070 --> 01:10:03,320
to the third coordinate
of t, et cetera.

1257
01:10:03,320 --> 01:10:04,610
So that's what we did before.

1258
01:10:04,610 --> 01:10:07,460
We had two equations,
and we reconciled them

1259
01:10:07,460 --> 01:10:10,190
because it was fairly
easy to solve, right?

1260
01:10:10,190 --> 01:10:11,826
But in general,
what's going to happen

1261
01:10:11,826 --> 01:10:13,700
is we're going to have
a system of equations.

1262
01:10:13,700 --> 01:10:17,150
We're going to have a system
of p equations, one for each

1263
01:10:17,150 --> 01:10:19,340
of the coordinates of t.

1264
01:10:19,340 --> 01:10:23,960
And we're going to have p
unknowns, each coordinate of t.

1265
01:10:23,960 --> 01:10:26,559
And so we're going to
have the system to solve--

1266
01:10:26,559 --> 01:10:28,850
actually, i turns out it's
going to be a linear system.

1267
01:10:28,850 --> 01:10:29,960
But it's not going
to be something

1268
01:10:29,960 --> 01:10:32,543
that we're going to be able to
solve coordinate by coordinate.

1269
01:10:32,543 --> 01:10:34,020
It's going to be
annoying to solve.

1270
01:10:34,020 --> 01:10:36,820
You know, you can guess that
what's going to happen, right.

1271
01:10:36,820 --> 01:10:40,700
Here, it involved the covariance
between x and epsilon, right.

1272
01:10:40,700 --> 01:10:43,910
That's what it involved
to understand--

1273
01:10:43,910 --> 01:10:47,540
sorry, the correlation
between x and y

1274
01:10:47,540 --> 01:10:50,660
to understand how the
solution of this problem was.

1275
01:10:50,660 --> 01:10:52,070
In this case,
there's going to be

1276
01:10:52,070 --> 01:10:57,930
only the covariance between
x1 and y, x2 and y, x3, et

1277
01:10:57,930 --> 01:10:59,510
cetera, all the way to xp and y.

1278
01:10:59,510 --> 01:11:02,960
There's also going to be all
the cross covariances between xj

1279
01:11:02,960 --> 01:11:04,077
and xk.

1280
01:11:04,077 --> 01:11:05,660
And so this is going
to be a nightmare

1281
01:11:05,660 --> 01:11:08,210
to solve, like, in this system.

1282
01:11:08,210 --> 01:11:12,100
And what we do is that we go
on to using a matrix notation,

1283
01:11:12,100 --> 01:11:14,600
so that when we
take derivatives,

1284
01:11:14,600 --> 01:11:16,340
we talk about
gradients, and then we

1285
01:11:16,340 --> 01:11:20,390
can invert matrices and solve
linear systems in a somewhat

1286
01:11:20,390 --> 01:11:23,330
formal manner by just saying
that, if I want to solve

1287
01:11:23,330 --> 01:11:27,230
the system ax equals b--

1288
01:11:27,230 --> 01:11:28,760
rather than actually
solving this

1289
01:11:28,760 --> 01:11:30,440
for each coordinate
of x individually,

1290
01:11:30,440 --> 01:11:33,770
I just say that x is
equal to a inverse times.

1291
01:11:33,770 --> 01:11:37,490
So that's really why we're
going to the equation one,

1292
01:11:37,490 --> 01:11:40,730
because we have a
formalism to write that x

1293
01:11:40,730 --> 01:11:42,260
is the solution of the system.

1294
01:11:42,260 --> 01:11:43,843
I'm not telling you
that this is going

1295
01:11:43,843 --> 01:11:48,110
to be easy to solve numerically,
but at least I can write it.

1296
01:11:48,110 --> 01:11:51,307
And so here's how it goes.

1297
01:11:51,307 --> 01:11:52,390
I have a bunch of vectors.

1298
01:11:55,540 --> 01:11:56,790
So what are my vectors, right?

1299
01:11:56,790 --> 01:11:57,875
So I have x1--

1300
01:11:57,875 --> 01:11:59,250
oh, by the way,
I didn't actually

1301
01:11:59,250 --> 01:12:01,320
mention that when I
put the lowercase, when

1302
01:12:01,320 --> 01:12:03,660
I put the subscript, I'm
talking about the observation.

1303
01:12:03,660 --> 01:12:05,118
And when I put the
superscript, I'm

1304
01:12:05,118 --> 01:12:07,110
talking about the
coordinates, right?

1305
01:12:07,110 --> 01:12:13,290
So I have x1, which is
equal to x1, x1 [? 1, ?]

1306
01:12:13,290 --> 01:12:19,965
x 1p, x2, which is 1.

1307
01:12:19,965 --> 01:12:32,380
x2, 1, x2 p, all the way to
xn, which is 1, xn 1, x np.

1308
01:12:32,380 --> 01:12:35,210
All right, so those are n
observed x's, and then I

1309
01:12:35,210 --> 01:12:40,870
have another y1, y2, yn, that
comes paired with those guys.

1310
01:12:40,870 --> 01:12:42,510
OK?

1311
01:12:42,510 --> 01:12:44,640
So the first thing
is that I'm going

1312
01:12:44,640 --> 01:12:46,290
to stack those guys
into some vector

1313
01:12:46,290 --> 01:12:47,520
that I'm going to call y.

1314
01:12:47,520 --> 01:12:49,710
So maybe I should put
an arrow for the purpose

1315
01:12:49,710 --> 01:12:53,310
of the blackboard, and
it's just y1 to yn.

1316
01:12:53,310 --> 01:12:56,720
OK, so this is a vector in rn.

1317
01:12:56,720 --> 01:12:59,150
Now, if I want to stack
those guys together,

1318
01:12:59,150 --> 01:13:03,449
I can either create a long
vector of size n times p,

1319
01:13:03,449 --> 01:13:05,990
but the problem is that I lose
the role of who's a coordinate

1320
01:13:05,990 --> 01:13:08,815
and who's an observation.

1321
01:13:08,815 --> 01:13:10,190
And so it's actually
nicer for me

1322
01:13:10,190 --> 01:13:12,840
to just put those guys
next to each other

1323
01:13:12,840 --> 01:13:15,320
and create one new variable.

1324
01:13:15,320 --> 01:13:18,020
And so the way I'm going to do
this is-- rather than actually

1325
01:13:18,020 --> 01:13:22,220
stacking those guys like that,
I'm getting their transpose

1326
01:13:22,220 --> 01:13:24,530
and stack them as
rows of a matrix.

1327
01:13:24,530 --> 01:13:26,870
OK, so I'm going to
create a matrix, which

1328
01:13:26,870 --> 01:13:28,700
here is denoted typically by--

1329
01:13:28,700 --> 01:13:31,295
I'm going to write x double bar.

1330
01:13:31,295 --> 01:13:33,420
And here, I'm going to
actually just-- so since I'm

1331
01:13:33,420 --> 01:13:35,940
taking those guys like
this, the first column

1332
01:13:35,940 --> 01:13:37,010
is going to be only ones.

1333
01:13:40,510 --> 01:13:41,950
And then I'm going to have--

1334
01:13:41,950 --> 01:13:47,130
well, x1, 1, [? 1, ?] x1, p.

1335
01:13:47,130 --> 01:13:52,890
And here, I'm going
to have x n1, x np.

1336
01:13:52,890 --> 01:13:57,690
OK, so here the number of rows
is n, and the number of columns

1337
01:13:57,690 --> 01:13:58,800
is p.

1338
01:13:58,800 --> 01:14:02,352
One row per observation,
one column per coordinate.

1339
01:14:05,010 --> 01:14:10,710
And again, I make your life
miserable because this really

1340
01:14:10,710 --> 01:14:13,380
should be p minus 1
because I already used

1341
01:14:13,380 --> 01:14:15,850
the first one for this guy.

1342
01:14:15,850 --> 01:14:16,820
I'm sorry about that.

1343
01:14:16,820 --> 01:14:18,400
It's a bit painful.

1344
01:14:18,400 --> 01:14:20,490
So usually we don't even
write what's in there.

1345
01:14:20,490 --> 01:14:21,948
So we don't have
to think about it.

1346
01:14:21,948 --> 01:14:23,970
Those are just
vectors of size p.

1347
01:14:23,970 --> 01:14:25,380
OK?

1348
01:14:25,380 --> 01:14:27,740
So now that I
created this thing,

1349
01:14:27,740 --> 01:14:31,340
I can actually just basically
stack up all my models.

1350
01:14:31,340 --> 01:14:39,270
So Yi equals Xi transpose
beta plus epsilon i for all i

1351
01:14:39,270 --> 01:14:41,430
equal 1 to n.

1352
01:14:41,430 --> 01:14:44,010
This transforms into-- this
is equivalent to saying

1353
01:14:44,010 --> 01:14:47,610
that the vector y is
equal to the matrix x

1354
01:14:47,610 --> 01:14:51,150
times beta plus a matrix,
plus a vector epsilon,

1355
01:14:51,150 --> 01:14:57,940
where epsilon is just epsilon
1 to epsilon n, right.

1356
01:14:57,940 --> 01:14:59,830
So I have just
this system, which

1357
01:14:59,830 --> 01:15:02,000
I write as a matrix,
which really just consists

1358
01:15:02,000 --> 01:15:04,900
in stacking up all these
equations next to each other.

1359
01:15:10,195 --> 01:15:12,820
So now that I have this model--
this is the usual least squares

1360
01:15:12,820 --> 01:15:13,330
model.

1361
01:15:13,330 --> 01:15:16,150
And here, when I want to write
my least squares criterion

1362
01:15:16,150 --> 01:15:17,500
in terms of matrices, right?

1363
01:15:17,500 --> 01:15:19,041
My least squares
criterion, remember,

1364
01:15:19,041 --> 01:15:27,010
was sum from i equal 1 to n
of Yi minus Xi transposed beta

1365
01:15:27,010 --> 01:15:28,210
squared.

1366
01:15:28,210 --> 01:15:31,060
Well, here it's
really just the sum

1367
01:15:31,060 --> 01:15:35,260
of the square of the
coefficients of the vector

1368
01:15:35,260 --> 01:15:37,540
y minus x beta.

1369
01:15:37,540 --> 01:15:40,380
So this is actually
equal to the norm squared

1370
01:15:40,380 --> 01:15:43,090
of y minus x beta square.

1371
01:15:46,382 --> 01:15:47,340
That's just the square.

1372
01:15:47,340 --> 01:15:49,470
Norm is, by definition,
the sum of the square

1373
01:15:49,470 --> 01:15:51,720
of the coordinates.

1374
01:15:51,720 --> 01:15:53,885
And so now I can actually
talk about minimizing

1375
01:15:53,885 --> 01:15:56,090
a norm squared,
and here it's going

1376
01:15:56,090 --> 01:15:58,160
to be easier for me
to take derivatives.

1377
01:15:58,160 --> 01:16:01,300
All right, so we'll
do that next time.