1
00:00:01,640 --> 00:00:04,040
The following content is
provided under a Creative

2
00:00:04,040 --> 00:00:05,580
Commons license.

3
00:00:05,580 --> 00:00:07,880
Your support will help
MIT OpenCourseWare

4
00:00:07,880 --> 00:00:12,270
continue to offer high quality
educational resources for free.

5
00:00:12,270 --> 00:00:14,870
To make a donation or
view additional materials

6
00:00:14,870 --> 00:00:18,830
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,830 --> 00:00:21,670
at ocw.mit.edu.

8
00:00:21,670 --> 00:00:23,420
LORENZO ROSASCO: So
what we want to do now

9
00:00:23,420 --> 00:00:25,040
is to move away
from local methods

10
00:00:25,040 --> 00:00:29,810
and start to do some form of
global regularization method.

11
00:00:29,810 --> 00:00:33,470
The word regularization I'm
going to use broadly as a term

12
00:00:33,470 --> 00:00:36,860
to define procedures,
statistical procedures

13
00:00:36,860 --> 00:00:38,540
and computational
procedure that do

14
00:00:38,540 --> 00:00:41,780
have some parameters that
allow to do from complex model

15
00:00:41,780 --> 00:00:43,880
to simple model in
a very broad sense.

16
00:00:43,880 --> 00:00:46,280
What I mean by complex is
something that is potentially

17
00:00:46,280 --> 00:00:49,910
going closer to overfitting and
by simple model something that

18
00:00:49,910 --> 00:00:54,630
is giving me something, which
is stable with respect to data.

19
00:00:54,630 --> 00:01:01,800
So we're going to consider
the following algorithm.

20
00:01:01,800 --> 00:01:03,960
I imagine a lot of you
have seen it before.

21
00:01:03,960 --> 00:01:07,890
This is called-- it has a
bunch of different names--

22
00:01:07,890 --> 00:01:12,200
probably the most famous one
is Tikhonov regularization.

23
00:01:12,200 --> 00:01:14,520
A bunch of people at the
beginning of the '60s

24
00:01:14,520 --> 00:01:17,520
thought about something
similar either in the context

25
00:01:17,520 --> 00:01:20,974
of statistics or solving
linear equations.

26
00:01:20,974 --> 00:01:23,515
So Tikhonov is the only one for
which I can find the picture.

27
00:01:23,515 --> 00:01:25,270
The other one was
Philips, and then there

28
00:01:25,270 --> 00:01:28,020
is Hoerl and other people.

29
00:01:28,020 --> 00:01:31,007
They basically thought all
about this same procedure.

30
00:01:34,750 --> 00:01:39,506
The procedure basically
is based on a functional

31
00:01:39,506 --> 00:01:41,380
that you want to minimize
based on two terms.

32
00:01:41,380 --> 00:01:43,338
So there are several
ingredients going on here.

33
00:01:43,338 --> 00:01:44,970
First of all, this is f of x.

34
00:01:44,970 --> 00:01:47,760
We assume the
functional form of-- we

35
00:01:47,760 --> 00:01:49,740
try to estimate the
function, and we

36
00:01:49,740 --> 00:01:51,870
do assume a parametric form
of this function, which

37
00:01:51,870 --> 00:01:54,720
in this case, just linear.

38
00:01:54,720 --> 00:01:57,810
And for the time being,
because you can really,

39
00:01:57,810 --> 00:02:01,230
you can put it back in, I
don't look at the offset.

40
00:02:01,230 --> 00:02:03,347
So I just take lines
passing through the origin.

41
00:02:03,347 --> 00:02:05,430
And this is just because
you can prove in one line

42
00:02:05,430 --> 00:02:09,665
that you can put back in
the offset at zero cost.

43
00:02:09,665 --> 00:02:11,039
So for the time
being, just think

44
00:02:11,039 --> 00:02:12,372
that data are actually standard.

45
00:02:14,700 --> 00:02:17,670
The way you try to estimate
this parameter is, on one hand,

46
00:02:17,670 --> 00:02:22,420
try to make the empirical error
small, and on the other hand,

47
00:02:22,420 --> 00:02:27,844
you put a budget on the weights.

48
00:02:27,844 --> 00:02:29,010
The reason why you do this--

49
00:02:29,010 --> 00:02:31,160
there are a bunch of
way to explain this.

50
00:02:31,160 --> 00:02:34,110
Andrei yesterday talked about
margin, and different lines,

51
00:02:34,110 --> 00:02:35,270
and so on.

52
00:02:35,270 --> 00:02:37,620
Another way to think about
it is that you can convince

53
00:02:37,620 --> 00:02:39,720
yourself-- and we were
going to see later--

54
00:02:39,720 --> 00:02:44,550
that if you're in low dimension,
a line is a very poor model.

55
00:02:44,550 --> 00:02:47,587
Because basically if you
have more than a few points--

56
00:02:47,587 --> 00:02:49,170
and they're not
standing on the line--

57
00:02:49,170 --> 00:02:51,400
you will not be able
to make zero error.

58
00:02:51,400 --> 00:02:53,100
But if the number
of points is lower

59
00:02:53,100 --> 00:02:54,900
than the number
of dimension, you

60
00:02:54,900 --> 00:02:58,200
can show that the line actually
can give you zero error.

61
00:02:58,200 --> 00:03:00,750
It's just a matter of
degrees of freedom.

62
00:03:00,750 --> 00:03:04,570
You have fewer equations
than the actual variables.

63
00:03:04,570 --> 00:03:07,140
So what you do is
that you actually

64
00:03:07,140 --> 00:03:09,240
add a regularization theorem.

65
00:03:09,240 --> 00:03:11,820
It's basically a theorem that
makes the problem well-posed.

66
00:03:11,820 --> 00:03:13,350
We're going to see
this in a minute

67
00:03:13,350 --> 00:03:14,800
from a different perspective.

68
00:03:14,800 --> 00:03:18,390
The easiest one is
going to be numerical.

69
00:03:18,390 --> 00:03:21,210
We stick to least squares
for-- and there is--

70
00:03:21,210 --> 00:03:26,430
so there is an extra
parenthesis that I forgot,

71
00:03:26,430 --> 00:03:28,571
but before I tell you why
we use least squares also

72
00:03:28,571 --> 00:03:30,570
let me tell you that--
as somebody pointed out--

73
00:03:30,570 --> 00:03:32,611
there is a mistake here,
because this is a minus.

74
00:03:36,420 --> 00:03:38,396
It should just be a minus.

75
00:03:38,396 --> 00:03:41,350
I'll fix this.

76
00:03:41,350 --> 00:03:43,760
So back, why do you
use least squares?

77
00:03:43,760 --> 00:03:47,017
OK, so least squares
on the one hand,

78
00:03:47,017 --> 00:03:48,600
if you're in low
dimension especially,

79
00:03:48,600 --> 00:03:51,340
you can think of least
squares as its way is basic,

80
00:03:51,340 --> 00:03:54,390
but it's not a very robust
way to measure error,

81
00:03:54,390 --> 00:03:55,610
because you squared them.

82
00:03:55,610 --> 00:04:00,750
And so just one error
can count a lot.

83
00:04:00,750 --> 00:04:02,570
So typically, there
is a whole literature

84
00:04:02,570 --> 00:04:04,290
on robust statistics, where
you want to replace least

85
00:04:04,290 --> 00:04:06,123
square with something
like an absolute value

86
00:04:06,123 --> 00:04:07,560
or something like that.

87
00:04:07,560 --> 00:04:09,490
It turns out that at
least in our experience

88
00:04:09,490 --> 00:04:11,432
and when you have high
dimensional problem,

89
00:04:11,432 --> 00:04:13,890
it's not completely clear how
much this kind of instability

90
00:04:13,890 --> 00:04:16,890
will occur and will not
be cured by just adding

91
00:04:16,890 --> 00:04:18,660
some regularization term.

92
00:04:18,660 --> 00:04:22,710
And the computation
underlying this algorithm

93
00:04:22,710 --> 00:04:25,300
are extremely, extremely simple.

94
00:04:25,300 --> 00:04:28,230
So that's why we're
sticking to this, because it

95
00:04:28,230 --> 00:04:29,769
works pretty well in practice.

96
00:04:29,769 --> 00:04:31,560
We actually developed
in the last few years

97
00:04:31,560 --> 00:04:34,170
some toolbox that you can use.

98
00:04:34,170 --> 00:04:35,970
They're pretty
much plug and play.

99
00:04:35,970 --> 00:04:39,120
And because the algorithm
is easy to understand

100
00:04:39,120 --> 00:04:40,420
in simpler terms.

101
00:04:40,420 --> 00:04:42,600
Yesterday, Andrei was
talking about SVM.

102
00:04:42,600 --> 00:04:44,650
SVM is very similar
in principle.

103
00:04:44,650 --> 00:04:46,650
Basically the only
difference is that you change

104
00:04:46,650 --> 00:04:49,040
the way you measure cost here.

105
00:04:49,040 --> 00:04:51,690
This algorithm you can use
both for classification

106
00:04:51,690 --> 00:04:54,351
and regression, whereas SVM--

107
00:04:54,351 --> 00:04:56,100
the one which was
talked about yesterday--

108
00:04:56,100 --> 00:04:57,670
is just for classification.

109
00:04:57,670 --> 00:05:02,790
And because the cost function
turns out to be non-smooth--

110
00:05:02,790 --> 00:05:05,970
and non-smooth is basically
non-differentiable--

111
00:05:05,970 --> 00:05:08,321
and so the whole math is
much more complicated,

112
00:05:08,321 --> 00:05:10,320
because you have to learn
how to minimize things

113
00:05:10,320 --> 00:05:11,650
that are not differentiable.

114
00:05:11,650 --> 00:05:15,720
So in this case, you can
stick to elementary stuff.

115
00:05:15,720 --> 00:05:19,200
And I think I did somewhere,
that also because Legendre

116
00:05:19,200 --> 00:05:22,740
200 years ago said that least
squares are really great.

117
00:05:22,740 --> 00:05:24,320
There is this old story--

118
00:05:24,320 --> 00:05:28,570
who between Gauss and Legendre
invented least squares first.

119
00:05:28,570 --> 00:05:31,050
And there are actually
long articles about this.

120
00:05:31,050 --> 00:05:33,110
But anyway, it's
around that time.

121
00:05:33,110 --> 00:05:34,437
It's around the end of the--

122
00:05:34,437 --> 00:05:36,020
this is when he was
born-- it's around

123
00:05:36,020 --> 00:05:37,260
the end of the 18th century.

124
00:05:37,260 --> 00:05:42,430
So the algorithm is pretty old.

125
00:05:42,430 --> 00:05:43,340
So what's the idea?

126
00:05:43,340 --> 00:05:47,100
So back to the case we
had before, you're going

127
00:05:47,100 --> 00:05:50,110
to take a linear function.

128
00:05:50,110 --> 00:05:51,600
So one thing is--

129
00:05:51,600 --> 00:05:53,500
just to be careful--
think about it once.

130
00:05:53,500 --> 00:05:55,500
Because if you've never
thought about it before,

131
00:05:55,500 --> 00:05:57,240
it's good to focus.

132
00:05:57,240 --> 00:06:04,560
When you do this drawing,
this is not f of x.

133
00:06:04,560 --> 00:06:07,620
This line is not f of x.

134
00:06:07,620 --> 00:06:10,920
It's f of x equals zero.

135
00:06:10,920 --> 00:06:15,510
So I think I made enough
time to have a 3D plot.

136
00:06:15,510 --> 00:06:22,380
So f of x is actually a plane
that cuts through the slide.

137
00:06:22,380 --> 00:06:25,805
It's positive, when
it's not dotted--

138
00:06:25,805 --> 00:06:27,930
because this points are
positive-- and then becomes

139
00:06:27,930 --> 00:06:28,830
negative.

140
00:06:28,830 --> 00:06:31,770
And this line is
where it changes sign.

141
00:06:31,770 --> 00:06:34,615
So the decision boundary
is not f of x itself,

142
00:06:34,615 --> 00:06:36,240
but it's the level
set that corresponds

143
00:06:36,240 --> 00:06:39,060
to f of x equals zero.

144
00:06:39,060 --> 00:06:41,460
Whereas f of x itself
is this one line.

145
00:06:41,460 --> 00:06:44,250
If you think in one
dimension, the points

146
00:06:44,250 --> 00:06:47,220
are just standing on a line.

147
00:06:47,220 --> 00:06:48,990
Some here are plus 1.

148
00:06:48,990 --> 00:06:50,180
Some here are minus 1.

149
00:06:50,180 --> 00:06:51,510
So what is f of x?

150
00:06:51,510 --> 00:06:54,660
It's just a line.

151
00:06:54,660 --> 00:06:57,900
What is the decision
boundary in this case?

152
00:06:57,900 --> 00:07:00,150
It will just be one point
in this case actually,

153
00:07:00,150 --> 00:07:04,060
because it's just
one line that cuts

154
00:07:04,060 --> 00:07:06,530
the input line in one point.

155
00:07:06,530 --> 00:07:07,560
And that's it.

156
00:07:07,560 --> 00:07:10,907
If you were to take a more
complicated nonlinear line,

157
00:07:10,907 --> 00:07:12,240
it would be more than one point.

158
00:07:12,240 --> 00:07:14,100
In two dimension,
it becomes one line.

159
00:07:14,100 --> 00:07:16,894
In three dimension, it becomes
a plane, and so on and so forth.

160
00:07:16,894 --> 00:07:19,310
But the important piece-- just
a remember, at least once--

161
00:07:19,310 --> 00:07:20,830
then we look at this plot.

162
00:07:20,830 --> 00:07:26,450
This is not f of
x, but only the set

163
00:07:26,450 --> 00:07:30,072
f of x equals zero, which
is where you change sign.

164
00:07:30,072 --> 00:07:32,030
And that's how you're
going to make prediction.

165
00:07:32,030 --> 00:07:34,039
You take real valued
functions, so you

166
00:07:34,039 --> 00:07:36,080
would like-- in principle,
in classification, you

167
00:07:36,080 --> 00:07:39,620
would allow this function
just to be binary.

168
00:07:39,620 --> 00:07:42,840
But optimization with binary
functions is very hard.

169
00:07:42,840 --> 00:07:44,540
So what you typically
do to relax this?

170
00:07:44,540 --> 00:07:47,789
You just allow it to be
a real valued function,

171
00:07:47,789 --> 00:07:48,830
and then you take a sign.

172
00:07:48,830 --> 00:07:50,420
When it's positive,
you take plus 1.

173
00:07:50,420 --> 00:07:51,927
If it's negative,
you say minus 1.

174
00:07:51,927 --> 00:07:53,510
If it's a regression
problem, you just

175
00:07:53,510 --> 00:07:54,590
keep it for what it is.

176
00:08:01,130 --> 00:08:04,550
And how many free parameters
has this algorithm?

177
00:08:04,550 --> 00:08:05,380
Well, one.

178
00:08:05,380 --> 00:08:08,040
It's lambda for now and w.

179
00:08:08,040 --> 00:08:10,550
But w we're going to solve
by solving this optimization

180
00:08:10,550 --> 00:08:11,280
problem.

181
00:08:11,280 --> 00:08:12,170
How about lambda?

182
00:08:12,170 --> 00:08:15,960
Well, whatever we
discussed before for k.

183
00:08:15,960 --> 00:08:19,220
We would try to sit
down and do some bias

184
00:08:19,220 --> 00:08:21,410
variance of the composition,
see what it depends on,

185
00:08:21,410 --> 00:08:23,990
try to see if we can
get a grasp on what

186
00:08:23,990 --> 00:08:25,580
the theory of this algorithm is.

187
00:08:25,580 --> 00:08:28,162
And then we try to see if
we can use cross-validation.

188
00:08:28,162 --> 00:08:29,870
You can do all these
things, so we're not

189
00:08:29,870 --> 00:08:36,059
going to discuss much how you
choose lambda, but most of you

190
00:08:36,059 --> 00:08:40,970
are going to discuss how you can
compute the minimizer of this.

191
00:08:40,970 --> 00:08:45,440
And this is not a problem,
because this is smooth.

192
00:08:45,440 --> 00:08:48,470
So you can take the
retrospect to w and also this.

193
00:08:48,470 --> 00:08:52,130
So what you can do is just to
take the derivative of this,

194
00:08:52,130 --> 00:08:54,360
set it equal to zero,
and check what happens.

195
00:08:59,090 --> 00:09:01,700
So it's useful to
do this to other--

196
00:09:01,700 --> 00:09:04,820
just some vectorial notation.

197
00:09:04,820 --> 00:09:06,770
We've already seen it before.

198
00:09:06,770 --> 00:09:10,700
So you take all the x's and you
stack it as rows of the data

199
00:09:10,700 --> 00:09:13,040
matrix x of n.

200
00:09:13,040 --> 00:09:17,060
So this ny, you just stack
them as entries of a vector.

201
00:09:17,060 --> 00:09:18,470
You call it yn.

202
00:09:18,470 --> 00:09:21,140
Then you can rewrite
this term just

203
00:09:21,140 --> 00:09:24,440
in this way, as this vector
minus this vector here, which

204
00:09:24,440 --> 00:09:27,140
you obtain by multiplying
the matrix with w.

205
00:09:27,140 --> 00:09:28,760
So this norm is the norm in Rn.

206
00:09:31,270 --> 00:09:35,510
So this is just
simple rewriting.

207
00:09:35,510 --> 00:09:38,120
It's useful just
because if you now

208
00:09:38,120 --> 00:09:39,860
take the derivative
of this with respect

209
00:09:39,860 --> 00:09:42,330
to w, set it equal to
zero, you get this.

210
00:09:42,330 --> 00:09:43,690
This is the gradient.

211
00:09:43,690 --> 00:09:45,740
So I haven't set it to zero yet.

212
00:09:45,740 --> 00:09:49,190
This is the gradient of
the least square part.

213
00:09:49,190 --> 00:09:52,380
This is the gradient
of the second term.

214
00:09:52,380 --> 00:09:56,150
It is still
multiplied by lambda.

215
00:09:56,150 --> 00:09:59,420
If you set them equal to
zero, what you get is this.

216
00:09:59,420 --> 00:10:03,340
You take everything with x,
so the 2 and the 2 goes away.

217
00:10:03,340 --> 00:10:06,410
You took everything with
x, and you put it here.

218
00:10:06,410 --> 00:10:08,120
There's still the
one here with lambda.

219
00:10:08,120 --> 00:10:10,200
You put it here.

220
00:10:10,200 --> 00:10:12,319
You take this term
in x transpose y,

221
00:10:12,319 --> 00:10:14,360
and you put it on the
other side of the equality.

222
00:10:14,360 --> 00:10:17,810
So you take everything with
w on one side and everything

223
00:10:17,810 --> 00:10:19,610
without w on the other side.

224
00:10:19,610 --> 00:10:23,420
And then here, I remove
n by multiplying.

225
00:10:26,440 --> 00:10:28,797
And so what you get
is a linear system.

226
00:10:28,797 --> 00:10:29,880
It's just a linear system.

227
00:10:29,880 --> 00:10:31,580
So that's the beauty
of least squares.

228
00:10:31,580 --> 00:10:33,140
Whether you
regularize it or not--

229
00:10:33,140 --> 00:10:36,620
in this case for this simple
squared loss regularization,

230
00:10:36,620 --> 00:10:39,300
all you get is a linear system.

231
00:10:39,300 --> 00:10:42,680
And this is the first way
to think about the effect

232
00:10:42,680 --> 00:10:45,770
of adding this term.

233
00:10:45,770 --> 00:10:47,900
So what is this doing?

234
00:10:47,900 --> 00:10:53,180
So just quickly for you a
quick linear system recap.

235
00:10:53,180 --> 00:10:54,759
You're solving a linear system.

236
00:10:54,759 --> 00:10:55,550
I changed notation.

237
00:10:55,550 --> 00:10:59,120
This is just a parenthesis,
just a little bit.

238
00:10:59,120 --> 00:11:00,680
The simplest case
you can think of

239
00:11:00,680 --> 00:11:03,830
is the case where m is diagonal.

240
00:11:03,830 --> 00:11:06,140
Suppose it's just a diagonal
matrix, a square diagonal

241
00:11:06,140 --> 00:11:06,640
matrix.

242
00:11:09,500 --> 00:11:11,020
How do you solve this problem?

243
00:11:11,020 --> 00:11:13,422
You have to invert the matrix m.

244
00:11:13,422 --> 00:11:15,130
What is the inverse
of a diagonal matrix?

245
00:11:18,190 --> 00:11:20,470
So it's just another
diagonal matrix.

246
00:11:20,470 --> 00:11:23,200
On the entries, instead of, say,
sigma, you have 1 over sigma

247
00:11:23,200 --> 00:11:24,040
or whatever it is.

248
00:11:27,250 --> 00:11:31,900
So what you see is that if
m-- you just consider m--

249
00:11:31,900 --> 00:11:34,030
and m is diagonal
like this-- this

250
00:11:34,030 --> 00:11:35,860
is what you're going to get.

251
00:11:35,860 --> 00:11:37,480
Suppose that now
some of these numbers

252
00:11:37,480 --> 00:11:42,070
are actually small, then
when you take 1 over,

253
00:11:42,070 --> 00:11:44,110
this is going to blow up.

254
00:11:44,110 --> 00:11:47,650
When you apply this matrix
to b, what you might have is

255
00:11:47,650 --> 00:11:51,820
that if you change the
sigmas or the b slightly,

256
00:11:51,820 --> 00:11:54,670
you can have an explosion.

257
00:11:54,670 --> 00:11:57,910
And if you want, this is
one way to understand why

258
00:11:57,910 --> 00:11:59,980
adding the lambda would help.

259
00:11:59,980 --> 00:12:02,560
And it's another way to look
at overfitting, if you want,

260
00:12:02,560 --> 00:12:04,130
from a numerical point of view.

261
00:12:04,130 --> 00:12:04,880
You take the data.

262
00:12:04,880 --> 00:12:07,480
You change them slightly, and
you have numerical instability

263
00:12:07,480 --> 00:12:09,190
right away.

264
00:12:09,190 --> 00:12:10,960
What is the effect
of adding this term?

265
00:12:14,020 --> 00:12:15,820
Well, what you see
is that instead

266
00:12:15,820 --> 00:12:23,710
of just doing m minus 1, you're
doing m plus lambda I minus 1.

267
00:12:23,710 --> 00:12:26,002
And this is the simple
case, where it's diagonal.

268
00:12:26,002 --> 00:12:28,210
But what you see is that on
the diagonal instead of 1

269
00:12:28,210 --> 00:12:32,990
over sigma 1, you take 1
over sigma 1 plus lambda.

270
00:12:32,990 --> 00:12:36,910
If sigma 1 is big, adding
this lambda won't matter.

271
00:12:36,910 --> 00:12:40,510
If sigma-- for example, sigma
d, now think there are order.

272
00:12:40,510 --> 00:12:42,880
I'm thinking they are
order, and sigma d is small.

273
00:12:42,880 --> 00:12:47,320
If this is small, at some point
lambda is going to jump in,

274
00:12:47,320 --> 00:12:49,660
make the problem
stable at the price

275
00:12:49,660 --> 00:12:51,911
of ignoring the
information in that sigma,

276
00:12:51,911 --> 00:12:53,410
that you basically
consider it to be

277
00:12:53,410 --> 00:12:55,510
at the same size of the
noise or the perturbation

278
00:12:55,510 --> 00:12:57,230
or the sample in your data.

279
00:12:57,230 --> 00:12:59,050
Does this make sense?

280
00:12:59,050 --> 00:13:01,015
So this is what the
algorithm is doing.

281
00:13:01,015 --> 00:13:04,881
And it's a numerical way
to look at stability.

282
00:13:04,881 --> 00:13:07,255
But you can imagine that this
is an immediate statistical

283
00:13:07,255 --> 00:13:07,600
consequence.

284
00:13:07,600 --> 00:13:09,150
You change the data
slightly, you'll

285
00:13:09,150 --> 00:13:11,230
have a big change in your
solution and the other way

286
00:13:11,230 --> 00:13:11,730
around.

287
00:13:11,730 --> 00:13:14,140
And lambda governs this
by basically telling you

288
00:13:14,140 --> 00:13:16,060
how much this is invertible.

289
00:13:16,060 --> 00:13:18,310
So it's a connection between
statistical and numerical

290
00:13:18,310 --> 00:13:19,875
stability.

291
00:13:19,875 --> 00:13:22,000
Now of course, you can say,
this is oversimplistic,

292
00:13:22,000 --> 00:13:26,110
because this is just
a diagonal matrix.

293
00:13:26,110 --> 00:13:31,300
But basically, if
you now take matrices

294
00:13:31,300 --> 00:13:35,620
that you can diagonalize,
conceptually nothing

295
00:13:35,620 --> 00:13:36,280
would change.

296
00:13:36,280 --> 00:13:37,570
Because basically
you would have that

297
00:13:37,570 --> 00:13:39,653
if you have a matrix-- so
there is a mistake here.

298
00:13:39,653 --> 00:13:41,380
There should be no minus 1.

299
00:13:41,380 --> 00:13:43,360
If you have an m that you can--

300
00:13:43,360 --> 00:13:45,235
this is just sigma, not minus 1.

301
00:13:45,235 --> 00:13:47,057
You can just diagonalize it.

302
00:13:47,057 --> 00:13:49,390
And now every operation you
want to do on the matrix you

303
00:13:49,390 --> 00:13:51,910
can just do on the diagonal.

304
00:13:51,910 --> 00:13:54,845
So all the reasoning
here will work the same.

305
00:13:54,845 --> 00:13:56,440
Only now you have
to remember that you

306
00:13:56,440 --> 00:13:58,900
have to squeeze the diagonal
matrix in between v and v

307
00:13:58,900 --> 00:14:00,631
transpose.

308
00:14:00,631 --> 00:14:03,130
I'm not saying that this is
what you want to do numerically.

309
00:14:03,130 --> 00:14:05,350
But I'm just saying that the
conceptual reasoning here--

310
00:14:05,350 --> 00:14:07,610
that we tell it that this
was the effect of lambda--

311
00:14:07,610 --> 00:14:10,090
is going to hold
just the same here.

312
00:14:10,090 --> 00:14:13,210
This is m, which you can
write like this-- m minus 1

313
00:14:13,210 --> 00:14:14,290
you can write like this.

314
00:14:14,290 --> 00:14:18,220
And so this is just going to
be the same diagonal terms

315
00:14:18,220 --> 00:14:18,739
inverted.

316
00:14:18,739 --> 00:14:20,280
And now you see the
effect of lambda.

317
00:14:20,280 --> 00:14:22,390
It's just the same.

318
00:14:22,390 --> 00:14:24,290
So once you grasp
this conceptually,

319
00:14:24,290 --> 00:14:28,020
for any matrix you can make
diagonal, it's the same.

320
00:14:28,020 --> 00:14:29,736
And the point is
that as long as you

321
00:14:29,736 --> 00:14:32,110
have a symmetric positive
definite matrix, the reason you

322
00:14:32,110 --> 00:14:35,370
can diagonalize it, you just
have the same thing squeezed

323
00:14:35,370 --> 00:14:37,630
in between v and v transpose.

324
00:14:37,630 --> 00:14:39,520
And that's what we have,
because instead of--

325
00:14:44,680 --> 00:14:47,920
because what we have is
exactly this matrix here.

326
00:14:47,920 --> 00:14:50,270
So instead of-- and you see
here that basically this

327
00:14:50,270 --> 00:14:52,880
depends a lot on the
dimensionality of the data.

328
00:14:52,880 --> 00:14:57,220
If the number of points is much
bigger than the dimensionality,

329
00:14:57,220 --> 00:14:59,450
this matrix in
principle could be--

330
00:14:59,450 --> 00:15:01,420
it's easier that is invertible.

331
00:15:01,420 --> 00:15:03,100
But if the number
of points is smaller

332
00:15:03,100 --> 00:15:04,840
than the dimensionality--

333
00:15:04,840 --> 00:15:06,040
how big is this matrix?

334
00:15:06,040 --> 00:15:10,080
So xn is-- you remember
how big was xn?

335
00:15:10,080 --> 00:15:12,610
It was the rows were the
points, and the columns

336
00:15:12,610 --> 00:15:13,370
were the variable.

337
00:15:13,370 --> 00:15:14,595
So how big is this?

338
00:15:14,595 --> 00:15:15,730
And we call this d.

339
00:15:15,730 --> 00:15:16,930
We called the length n.

340
00:15:16,930 --> 00:15:18,304
So this is--

341
00:15:18,304 --> 00:15:19,510
AUDIENCE: [INAUDIBLE]

342
00:15:19,510 --> 00:15:22,470
LORENZO ROSASCO: --n by d.

343
00:15:22,470 --> 00:15:25,060
So this matrix here is how big?

344
00:15:25,060 --> 00:15:28,320
Just d by d, and
the number of points

345
00:15:28,320 --> 00:15:31,080
is smaller than the
number dimension.

346
00:15:31,080 --> 00:15:32,445
The rank of this--

347
00:15:32,445 --> 00:15:34,440
this is going to
be rank-deficient.

348
00:15:34,440 --> 00:15:35,725
So it's not invertible.

349
00:15:35,725 --> 00:15:38,100
So if the number of points is
more, if you're in a high--

350
00:15:38,100 --> 00:15:39,870
so called
high-dimensional scenario,

351
00:15:39,870 --> 00:15:41,520
where the number of
points is more than

352
00:15:41,520 --> 00:15:43,410
the number of
dimension, for sure you

353
00:15:43,410 --> 00:15:44,790
won't be able to invert this.

354
00:15:44,790 --> 00:15:46,950
Ordinary least
squares will not work.

355
00:15:46,950 --> 00:15:48,120
It will be unstable.

356
00:15:48,120 --> 00:15:49,619
And then you will
have to regularize

357
00:15:49,619 --> 00:15:52,279
to get anything reasonable.

358
00:15:52,279 --> 00:15:54,820
So in the case of least squares,
just by setting rank to zero

359
00:15:54,820 --> 00:15:56,790
and looking in this computation
to get a grasp of both.

360
00:15:56,790 --> 00:15:58,530
What kind of computation
you have to do,

361
00:15:58,530 --> 00:15:59,835
and what they mean both
from the statistical

362
00:15:59,835 --> 00:16:01,376
and the numerical point of view.

363
00:16:01,376 --> 00:16:03,750
And that's why that's one of
the beauty of least squares.

364
00:16:08,280 --> 00:16:11,400
We could stick to a whole
derivation of this--

365
00:16:11,400 --> 00:16:14,480
so this is more the
linear system perspective.

366
00:16:14,480 --> 00:16:15,960
There is a whole
literature trying

367
00:16:15,960 --> 00:16:18,570
to justify more from a
statistical point of view what

368
00:16:18,570 --> 00:16:19,641
I'm saying.

369
00:16:19,641 --> 00:16:21,390
You can talk about the
maximum likelihood,

370
00:16:21,390 --> 00:16:24,300
then you can talk about
maximum a posteriori.

371
00:16:24,300 --> 00:16:26,760
You can talk about variance
reduction and so-called Stein

372
00:16:26,760 --> 00:16:27,610
effect.

373
00:16:27,610 --> 00:16:31,170
And you can make a much bigger
story trying, for example,

374
00:16:31,170 --> 00:16:34,227
to develop the whole theory of
shrinkage estimators, the bias

375
00:16:34,227 --> 00:16:35,310
variance tradeoff of this.

376
00:16:35,310 --> 00:16:37,380
But we're not going
to talk about that.

377
00:16:37,380 --> 00:16:39,690
So this simple
numerical stability,

378
00:16:39,690 --> 00:16:41,640
statistical stability
intuition is

379
00:16:41,640 --> 00:16:44,430
going to be my main motivation
for considering these schemes.

380
00:16:47,730 --> 00:16:49,890
So let me skip these.

381
00:16:49,890 --> 00:16:52,440
I wanted to show the demo, but--

382
00:16:52,440 --> 00:16:53,429
it's very simple.

383
00:16:53,429 --> 00:16:55,470
It's going to be very
stable, because you're just

384
00:16:55,470 --> 00:16:58,480
drawing a one-dimensional line.

385
00:16:58,480 --> 00:17:00,270
Then you move on just
a bit, because we

386
00:17:00,270 --> 00:17:04,290
didn't cover as much as
I want in the first part.

387
00:17:04,290 --> 00:17:08,819
So first of all, so far so good?

388
00:17:08,819 --> 00:17:12,230
Are you all with me about this?

389
00:17:12,230 --> 00:17:14,290
So again, the basic
thing if you want--

390
00:17:14,290 --> 00:17:20,359
all the interesting--
so this is the one line,

391
00:17:20,359 --> 00:17:23,230
where there is something
conceptual happening.

392
00:17:23,230 --> 00:17:26,530
This is the one line, where we
make it a bit more complicated

393
00:17:26,530 --> 00:17:27,327
mathematically.

394
00:17:27,327 --> 00:17:29,160
And then all you have
to do is to match this

395
00:17:29,160 --> 00:17:31,760
with what we just wrote before.

396
00:17:31,760 --> 00:17:32,260
That's all.

397
00:17:32,260 --> 00:17:35,040
These are the main three
things we want to do.

398
00:17:35,040 --> 00:17:37,900
And think a bit
about dimensionality.

399
00:17:37,900 --> 00:17:44,029
Now if you look at a problem
even like this, as I said,

400
00:17:44,029 --> 00:17:45,820
this might be misleading--
a low dimension.

401
00:17:45,820 --> 00:17:47,740
And in fact, what we
typically do in high dimension

402
00:17:47,740 --> 00:17:49,540
is that, first of all, you
start with the linear model

403
00:17:49,540 --> 00:17:51,590
and you see how far
you can go with that.

404
00:17:51,590 --> 00:17:55,969
And typically, you go a bit
further that you might imagine.

405
00:17:55,969 --> 00:17:57,760
But still, you can
think, why should I just

406
00:17:57,760 --> 00:17:59,670
stick to linear decision rule?

407
00:17:59,670 --> 00:18:03,080
This won't give me
much of a flexibility.

408
00:18:03,080 --> 00:18:04,990
So in this case,
obviously, it looks

409
00:18:04,990 --> 00:18:06,490
like something that
would be better,

410
00:18:06,490 --> 00:18:09,850
some kind of quadric
decision boundary.

411
00:18:09,850 --> 00:18:12,414
So how can you do this?

412
00:18:12,414 --> 00:18:14,080
How can you go--
suppose that I give you

413
00:18:14,080 --> 00:18:16,126
the code of least squares.

414
00:18:16,126 --> 00:18:17,500
And you're the
laziest programmer

415
00:18:17,500 --> 00:18:19,900
in the world, which in
my case is actually not

416
00:18:19,900 --> 00:18:22,360
that hard to imagine.

417
00:18:22,360 --> 00:18:25,500
How can you recycle
the code to fit,

418
00:18:25,500 --> 00:18:28,420
to create a solution
like this, instead

419
00:18:28,420 --> 00:18:30,880
of a solution like this?

420
00:18:30,880 --> 00:18:32,286
You see the question?

421
00:18:32,286 --> 00:18:34,035
I give you the code
to solve this problem,

422
00:18:34,035 --> 00:18:35,243
the one I showed you before--

423
00:18:35,243 --> 00:18:37,090
the linear system for
different lambdas.

424
00:18:37,090 --> 00:18:40,270
But you want to go from this
solution to the solution.

425
00:18:40,270 --> 00:18:41,870
How could you do that?

426
00:18:41,870 --> 00:18:44,300
So one way you can do
it in this simple case

427
00:18:44,300 --> 00:18:46,155
is-- this is the example.

428
00:18:46,155 --> 00:18:47,500
So the idea is--

429
00:18:47,500 --> 00:18:49,150
you remember the matrix?

430
00:18:49,150 --> 00:18:51,334
I'm going to invent new
entries of the matrix,

431
00:18:51,334 --> 00:18:53,500
not of the points, because
you cannot invent points,

432
00:18:53,500 --> 00:18:54,740
but of the variables.

433
00:18:54,740 --> 00:18:57,198
So what you're going to do,
instead of just-- they can say,

434
00:18:57,198 --> 00:18:59,320
in this case I call them x1, x2.

435
00:18:59,320 --> 00:19:00,510
I'm just in two dimension.

436
00:19:00,510 --> 00:19:02,250
These are my data.

437
00:19:02,250 --> 00:19:04,240
This is just another
example of this.

438
00:19:04,240 --> 00:19:06,190
So these are my data--
sorry these are--

439
00:19:06,190 --> 00:19:07,780
let's see what they are.

440
00:19:07,780 --> 00:19:09,700
This is one point.

441
00:19:09,700 --> 00:19:12,160
X1 and x2 here
are just the entry

442
00:19:12,160 --> 00:19:15,970
of the point x, so
the first coordinate

443
00:19:15,970 --> 00:19:18,370
and the second coordinate.

444
00:19:18,370 --> 00:19:21,070
So what you said is
exactly one way to do this.

445
00:19:21,070 --> 00:19:22,210
And it is--

446
00:19:22,210 --> 00:19:24,790
I'm going to now build a
new vector representation

447
00:19:24,790 --> 00:19:25,710
of the same points.

448
00:19:25,710 --> 00:19:26,470
So it's going to
be the same point,

449
00:19:26,470 --> 00:19:28,240
but instead of two
coordinates I now use

450
00:19:28,240 --> 00:19:32,410
three, which are going to be
the first coordinate square,

451
00:19:32,410 --> 00:19:35,680
the second coordinate square,
and the product of the two

452
00:19:35,680 --> 00:19:36,700
coordinates.

453
00:19:39,560 --> 00:19:42,490
Once I've done this, I
forget about how I got this,

454
00:19:42,490 --> 00:19:44,800
and I just treat it
as new variables.

455
00:19:44,800 --> 00:19:49,000
And I take a linear model
with that variables.

456
00:19:49,000 --> 00:19:51,340
It's a linear model with
these new variables,

457
00:19:51,340 --> 00:19:54,170
but it's a new linear model
with the original variables.

458
00:19:54,170 --> 00:19:56,090
And that's what you see here.

459
00:19:56,090 --> 00:20:00,120
So x tilde is this stuff.

460
00:20:00,120 --> 00:20:02,920
It's just a new
vector representation.

461
00:20:02,920 --> 00:20:05,500
And now I'm linear with
respect to this new vector

462
00:20:05,500 --> 00:20:06,640
representation.

463
00:20:06,640 --> 00:20:09,869
But when you write
x tilde explicitly,

464
00:20:09,869 --> 00:20:11,410
it's some kind of
non-linear function

465
00:20:11,410 --> 00:20:12,670
of the original variable.

466
00:20:12,670 --> 00:20:14,770
So this function
here is non-linear

467
00:20:14,770 --> 00:20:16,870
in the original variable.

468
00:20:16,870 --> 00:20:20,310
It's harder to say
than probably to see.

469
00:20:20,310 --> 00:20:22,310
Does it make sense?

470
00:20:22,310 --> 00:20:23,920
So if you do this,
you're completely

471
00:20:23,920 --> 00:20:26,650
recycling the beauty
of the linearity

472
00:20:26,650 --> 00:20:29,530
from a computational
point of view while

473
00:20:29,530 --> 00:20:31,510
augmenting the
power of your model

474
00:20:31,510 --> 00:20:33,700
from linear to non-linear.

475
00:20:33,700 --> 00:20:37,351
It's still parametric in the
sense that in this case--

476
00:20:37,351 --> 00:20:39,100
what I mean by parametric
is that we still

477
00:20:39,100 --> 00:20:41,020
fix a priori the
number of degrees

478
00:20:41,020 --> 00:20:42,850
of freedom of our problem.

479
00:20:42,850 --> 00:20:45,070
It was true now I make it three.

480
00:20:45,070 --> 00:20:47,230
More general I could
make it p, but the number

481
00:20:47,230 --> 00:20:49,280
of numbers I have to
find is fixed a priori.

482
00:20:49,280 --> 00:20:54,040
It doesn't depend on my
data, and it's fixed.

483
00:20:54,040 --> 00:20:58,150
But I can definitely go
from linear to non-linear.

484
00:20:58,150 --> 00:20:59,540
So let's keep on going.

485
00:20:59,540 --> 00:21:02,447
So from the simple linear model
we already went quite far,

486
00:21:02,447 --> 00:21:04,780
because we basically know
that with the same computation

487
00:21:04,780 --> 00:21:06,500
we can now solve
stuff like this.

488
00:21:06,500 --> 00:21:09,220
Let's take a couple
of steps further.

489
00:21:09,220 --> 00:21:13,330
So one is-- appreciate
that really the code

490
00:21:13,330 --> 00:21:14,680
is just the same.

491
00:21:14,680 --> 00:21:16,900
Instead of x, I have
to do a pre-processing

492
00:21:16,900 --> 00:21:19,630
to replace x with this
new matrix x tilde, which

493
00:21:19,630 --> 00:21:22,497
is the one which instead of
being n by d, is now n by p

494
00:21:22,497 --> 00:21:24,830
where p is this new number
of variables that I invented.

495
00:21:28,000 --> 00:21:31,870
Now it's useful to just
get the feeling of what is

496
00:21:31,870 --> 00:21:33,710
the complexity of this method.

497
00:21:33,710 --> 00:21:38,400
And this is a very
quick complexity recap.

498
00:21:38,400 --> 00:21:40,540
Here basically, the
product of two numbers

499
00:21:40,540 --> 00:21:41,770
is going to count one.

500
00:21:41,770 --> 00:21:44,340
And then when you take product
of vectors of matrices,

501
00:21:44,340 --> 00:21:46,870
you just count on any real
number multiplication you do.

502
00:21:46,870 --> 00:21:48,220
And this is a quick recap.

503
00:21:48,220 --> 00:21:52,630
If I multiply two vectors
of size p, the cost p,

504
00:21:52,630 --> 00:21:54,850
matrix vector is going to be np.

505
00:21:54,850 --> 00:21:58,600
Matrix matrix is going
to be n square p.

506
00:21:58,600 --> 00:22:00,274
You have n vectors.

507
00:22:00,274 --> 00:22:02,490
And one-to-one, other n vectors.

508
00:22:02,490 --> 00:22:05,880
And they are size p, so each
time you have-- it costs you p.

509
00:22:05,880 --> 00:22:07,930
And you have to do n against n.

510
00:22:07,930 --> 00:22:10,090
So it's going to be n square p.

511
00:22:10,090 --> 00:22:13,674
And the last one
is-- this is a much--

512
00:22:13,674 --> 00:22:15,340
less clear to just
look at it like this.

513
00:22:15,340 --> 00:22:17,680
But roughly speaking,
the inversion of a matrix

514
00:22:17,680 --> 00:22:21,460
costs roughly speaking n
cube in the worst case.

515
00:22:21,460 --> 00:22:25,030
It's just to give you a feeling
of what the complexity are.

516
00:22:25,030 --> 00:22:27,740
So it makes sense?

517
00:22:27,740 --> 00:22:29,579
It's a bit quick,
but it's simple.

518
00:22:29,579 --> 00:22:30,370
If you know it, OK.

519
00:22:30,370 --> 00:22:32,120
Otherwise, you just
take this on the side,

520
00:22:32,120 --> 00:22:33,520
when you think about this.

521
00:22:33,520 --> 00:22:37,720
So what is the
complexity of this?

522
00:22:37,720 --> 00:22:41,390
Well, the matrix-- you have
to multiply this times this,

523
00:22:41,390 --> 00:22:45,460
and this is going to
cost you nd or np.

524
00:22:45,460 --> 00:22:46,820
You have to build this matrix.

525
00:22:46,820 --> 00:22:51,184
This is going to cost you
n square d or n square p.

526
00:22:51,184 --> 00:22:52,350
And then you have to invert.

527
00:22:52,350 --> 00:22:55,720
These are going to be n cube.

528
00:22:55,720 --> 00:22:58,990
So-- sorry, p cubed,
because with this matrix is

529
00:22:58,990 --> 00:23:03,070
going to be-- or d cube,
because this matrix is d by d.

530
00:23:03,070 --> 00:23:06,560
So this is, roughly
speaking, the cost.

531
00:23:06,560 --> 00:23:07,525
So now look at this.

532
00:23:07,525 --> 00:23:10,150
This is-- I take this.

533
00:23:10,150 --> 00:23:13,750
In this case, p is the
new variable, otherwise d.

534
00:23:13,750 --> 00:23:21,050
So in this case, I have p cube,
and then I have p square n.

535
00:23:21,050 --> 00:23:23,710
But one question is
what if n is much--

536
00:23:23,710 --> 00:23:28,390
and that's a fact-- what if
n is much smaller than p?

537
00:23:28,390 --> 00:23:32,920
If n is a 10, do I
really have to pay

538
00:23:32,920 --> 00:23:36,430
quadratic or even cubic
in the number of dimension

539
00:23:36,430 --> 00:23:37,357
to solve this problem?

540
00:23:37,357 --> 00:23:39,940
Because in some sense, it looks
I'm overshooting things a bit.

541
00:23:39,940 --> 00:23:42,820
Because I'm inverting a
matrix, yes, but this matrix

542
00:23:42,820 --> 00:23:45,290
is really a rank n.

543
00:23:45,290 --> 00:23:48,010
It only has n rows that are
linearly independent at most.

544
00:23:48,010 --> 00:23:50,410
It might be less,
but at most it has n.

545
00:23:50,410 --> 00:23:53,920
So can I break the
complexity of this?

546
00:23:53,920 --> 00:23:56,074
Linear system have
to solve, you just

547
00:23:56,074 --> 00:23:57,490
use the table I
showed you before.

548
00:23:57,490 --> 00:23:58,420
Check the computation.

549
00:23:58,420 --> 00:24:00,190
These are the computation
you have to do.

550
00:24:00,190 --> 00:24:02,530
And one observation
here is you pay really

551
00:24:02,530 --> 00:24:06,370
a lot in the dimension,
the number of variables

552
00:24:06,370 --> 00:24:08,800
or the number of
features you invented.

553
00:24:08,800 --> 00:24:12,370
And this might be OK,
when p is smaller than n.

554
00:24:12,370 --> 00:24:14,800
But one thing-- this
seems wrong intuitively,

555
00:24:14,800 --> 00:24:17,200
when n is much smaller than p.

556
00:24:17,200 --> 00:24:18,850
Because the complexity
of the problem,

557
00:24:18,850 --> 00:24:21,310
the rank of the
problem is just n.

558
00:24:21,310 --> 00:24:25,510
The matrix here has n
rows and d or p columns

559
00:24:25,510 --> 00:24:27,580
depending on which
representation you take.

560
00:24:27,580 --> 00:24:30,340
And so the rank of the
whole thing is at most n,

561
00:24:30,340 --> 00:24:36,160
if n is much smaller.

562
00:24:36,160 --> 00:24:39,850
So now the red dot appears.

563
00:24:39,850 --> 00:24:44,484
And what you can do is
proving this one line.

564
00:24:44,484 --> 00:24:46,150
So let's see what
they do, and then I'll

565
00:24:46,150 --> 00:24:47,399
tell you how you can prove it.

566
00:24:47,399 --> 00:24:48,850
And it's an exercise.

567
00:24:48,850 --> 00:24:52,460
So you see here if
you invert this,

568
00:24:52,460 --> 00:24:54,670
then you have to
multiply x transpose

569
00:24:54,670 --> 00:24:57,940
y times the inverse
of this matrix, which

570
00:24:57,940 --> 00:25:00,670
is what's written in here.

571
00:25:00,670 --> 00:25:03,010
So I claim that this
equality stands.

572
00:25:03,010 --> 00:25:03,850
Look what it does.

573
00:25:03,850 --> 00:25:05,770
I take this x transpose.

574
00:25:05,770 --> 00:25:09,050
I move it in front.

575
00:25:09,050 --> 00:25:10,630
But then if I do
this, you clearly

576
00:25:10,630 --> 00:25:12,463
see that I'm messing
around with dimensions.

577
00:25:12,463 --> 00:25:16,122
So what you do is that you have
to switch the order of the two

578
00:25:16,122 --> 00:25:17,080
matrices in the middle.

579
00:25:19,900 --> 00:25:22,360
Now from a dimensionality
point of view, at least,

580
00:25:22,360 --> 00:25:24,490
I still see that this
matrix and this matrix

581
00:25:24,490 --> 00:25:27,520
have the same dimension.

582
00:25:27,520 --> 00:25:28,690
How do you prove this?

583
00:25:28,690 --> 00:25:31,090
Well, you basically
just need to do SVD.

584
00:25:31,090 --> 00:25:34,060
You take the singular-value
decomposition of the matrix Xn.

585
00:25:34,060 --> 00:25:36,917
You plug it in, and you
just compute things.

586
00:25:36,917 --> 00:25:38,750
And you check that this
side of the equality

587
00:25:38,750 --> 00:25:41,590
is the same of this
side of the equality.

588
00:25:41,590 --> 00:25:43,330
So there's nothing
more than this,

589
00:25:43,330 --> 00:25:44,750
but we're going to skip this.

590
00:25:44,750 --> 00:25:47,320
So you just take this as a fact.

591
00:25:47,320 --> 00:25:48,220
It's a little trick.

592
00:25:48,220 --> 00:25:49,660
Why do I want to do this trick?

593
00:25:49,660 --> 00:25:52,870
Because look, now what I
say is that my w is going

594
00:25:52,870 --> 00:25:55,930
to be x transpose of something.

595
00:25:59,330 --> 00:26:00,400
What is this something?

596
00:26:00,400 --> 00:26:06,130
So w is going to be X
transpose of this thing here.

597
00:26:06,130 --> 00:26:08,020
How big is this vector?

598
00:26:08,020 --> 00:26:11,020
So how big is this
matrix first of all?

599
00:26:11,020 --> 00:26:13,495
So remember, Xn was how big?

600
00:26:13,495 --> 00:26:14,770
AUDIENCE: N by d.

601
00:26:14,770 --> 00:26:17,260
LORENZO ROSASCO: N by d or p.

602
00:26:17,260 --> 00:26:18,340
How big is this?

603
00:26:18,340 --> 00:26:19,060
AUDIENCE: N by n.

604
00:26:19,060 --> 00:26:21,500
LORENZO ROSASCO: N by n.

605
00:26:21,500 --> 00:26:23,060
So how big is this vector?

606
00:26:23,060 --> 00:26:25,130
It's n by 1.

607
00:26:25,130 --> 00:26:27,370
So now I have to--

608
00:26:27,370 --> 00:26:29,770
I found out that
my w can always be

609
00:26:29,770 --> 00:26:34,320
written as x transpose
c, where c is just

610
00:26:34,320 --> 00:26:36,540
an n-dimensional vector.

611
00:26:36,540 --> 00:26:40,350
I rewrote it like
this, if you want.

612
00:26:40,350 --> 00:26:43,490
So what is the
cost of doing this?

613
00:26:47,800 --> 00:26:51,610
Well, this was the
cost of doing this?

614
00:26:51,610 --> 00:26:54,260
But now you just have to do--

615
00:26:54,260 --> 00:26:57,070
so let's say what is the
cost of doing this thing here

616
00:26:57,070 --> 00:26:59,080
above the bracket?

617
00:26:59,080 --> 00:27:01,710
Well, if this one was
p cube p square n,

618
00:27:01,710 --> 00:27:03,910
this one will be how much?

619
00:27:03,910 --> 00:27:07,720
I have that this
matrix will say p by p,

620
00:27:07,720 --> 00:27:10,860
and then this vector was p by 1.

621
00:27:10,860 --> 00:27:14,680
Whereas here, my matrix is n by
n, and the victory is n by 1.

622
00:27:17,200 --> 00:27:20,699
So you basically have that
these two numbers swap.

623
00:27:20,699 --> 00:27:22,115
Instead of having
this complexity,

624
00:27:22,115 --> 00:27:25,780
now you have a complexity,
which is n cube.

625
00:27:25,780 --> 00:27:30,910
And then you have n square
p, which sounds about right.

626
00:27:30,910 --> 00:27:32,060
It's linear in p.

627
00:27:32,060 --> 00:27:33,250
You cannot avoid that.

628
00:27:33,250 --> 00:27:35,590
You have to look at
the data at least once.

629
00:27:35,590 --> 00:27:40,035
But then it's polynomial only in
the small quantity of the two.

630
00:27:40,035 --> 00:27:41,410
So in some sense,
what you see is

631
00:27:41,410 --> 00:27:44,360
that, depending on the
size of n, of course,

632
00:27:44,360 --> 00:27:46,110
you still have to do
this multiplication.

633
00:27:46,110 --> 00:27:50,200
But this multiplication
is just n, nd, or np.

634
00:27:50,200 --> 00:27:52,170
So let's just recap
what I'm telling you.

635
00:27:52,170 --> 00:27:55,050
This is a lot more
mathematical fact I put.

636
00:27:55,050 --> 00:27:57,010
I have a warning here.

637
00:27:57,010 --> 00:27:59,860
The first thing is the
question should be clear.

638
00:27:59,860 --> 00:28:02,950
Can I break the complexity
of this in the case

639
00:28:02,950 --> 00:28:05,550
when n is smaller than p or d?

640
00:28:05,550 --> 00:28:08,620
This is relevant because the
question came out a second ago,

641
00:28:08,620 --> 00:28:10,870
which was should
I always explode

642
00:28:10,870 --> 00:28:12,700
the dimension of my features?

643
00:28:12,700 --> 00:28:14,320
And here what you see is that--

644
00:28:14,320 --> 00:28:16,960
well, at least for now we
see that even if you do,

645
00:28:16,960 --> 00:28:20,980
you don't pay more
than linearly in that.

646
00:28:20,980 --> 00:28:22,690
And the way you
prove it is A, you

647
00:28:22,690 --> 00:28:25,377
observe this factor,
which, again, I

648
00:28:25,377 --> 00:28:27,460
measured if you're curious,
to show how you do it,

649
00:28:27,460 --> 00:28:28,720
but it's a one line.

650
00:28:28,720 --> 00:28:30,840
And 2, you observe that
once you have this,

651
00:28:30,840 --> 00:28:34,840
if you just rewrite w, you can
write w as a x transpose c.

652
00:28:34,840 --> 00:28:37,710
And to find a c-- which is now
you basically re-parametrize--

653
00:28:37,710 --> 00:28:39,970
and to find the new c
is going to cost you

654
00:28:39,970 --> 00:28:43,690
only n cube n square p.

655
00:28:43,690 --> 00:28:46,294
So you do exactly
what you wanted to do.

656
00:28:46,294 --> 00:28:47,710
And basically,
what you see now is

657
00:28:47,710 --> 00:28:49,270
that whenever you
do least squares,

658
00:28:49,270 --> 00:28:51,100
you can check the
number of dimensions,

659
00:28:51,100 --> 00:28:54,610
the number of points, and always
re-parametrize the problem

660
00:28:54,610 --> 00:28:58,750
in such a way that complexity
is depending linearly

661
00:28:58,750 --> 00:29:00,940
on the bigger of the
two and polynomially

662
00:29:00,940 --> 00:29:04,390
on the smaller of the two.

663
00:29:04,390 --> 00:29:05,230
So that's good news.

664
00:29:11,782 --> 00:29:12,740
Oh, I wrote it.

665
00:29:17,264 --> 00:29:18,680
So this is where
we are right now.

666
00:29:21,210 --> 00:29:23,870
So if we're lost now, you're
going to become completely lost

667
00:29:23,870 --> 00:29:24,530
in one second.

668
00:29:24,530 --> 00:29:26,480
Because this is
what we want to do.

669
00:29:26,480 --> 00:29:29,910
We want to introduce kernel
in the simplest possible way,

670
00:29:29,910 --> 00:29:31,095
which is the following.

671
00:29:33,770 --> 00:29:37,150
So look at-- this
is what we find out.

672
00:29:37,150 --> 00:29:40,870
We discovered, we
actually proved a theorem.

673
00:29:40,870 --> 00:29:44,350
And the theorem
says that the w's

674
00:29:44,350 --> 00:29:48,610
that are output by the least
squares algorithm are not

675
00:29:48,610 --> 00:29:50,740
any possible
d-dimensional vectors,

676
00:29:50,740 --> 00:29:52,630
but they're always
vectors that I

677
00:29:52,630 --> 00:29:57,100
can write as the combination
of the training set vectors.

678
00:29:57,100 --> 00:30:01,300
So xi is long d or p,
and I've summed them up

679
00:30:01,300 --> 00:30:02,429
with these weights.

680
00:30:02,429 --> 00:30:04,720
And the w's that are going
to come out of least squares

681
00:30:04,720 --> 00:30:07,500
are always of that form.

682
00:30:07,500 --> 00:30:10,150
They cannot be of
any other form.

683
00:30:10,150 --> 00:30:13,140
This is called the
representer theorem.

684
00:30:13,140 --> 00:30:16,010
It's the basic theorem of
so-called kernel methods.

685
00:30:16,010 --> 00:30:19,960
It shows you that the
solution you're looking for

686
00:30:19,960 --> 00:30:22,840
can be written as a linear
superposition of these terms.

687
00:30:27,310 --> 00:30:28,330
If you now write--

688
00:30:28,330 --> 00:30:29,260
this is just the w.

689
00:30:29,260 --> 00:30:31,150
Let's just write down f of x.

690
00:30:31,150 --> 00:30:33,670
F of x is going to be
x transpose w, just

691
00:30:33,670 --> 00:30:36,160
the linear function.

692
00:30:36,160 --> 00:30:39,400
And now you can-- if you write
it down, you just get this.

693
00:30:39,400 --> 00:30:40,960
By linearity you can--

694
00:30:40,960 --> 00:30:42,624
so w is written like this.

695
00:30:42,624 --> 00:30:43,790
You multiply by x transpose.

696
00:30:43,790 --> 00:30:45,190
This is a finite sum.

697
00:30:45,190 --> 00:30:47,680
So you can let x
transpose inside the sum.

698
00:30:47,680 --> 00:30:49,420
This is what you get.

699
00:30:49,420 --> 00:30:50,320
Are you OK?

700
00:30:50,320 --> 00:30:52,650
So you have x
transpose times a sum.

701
00:30:52,650 --> 00:30:57,800
This is the sum of x transpose
multiplied by the rest.

702
00:30:57,800 --> 00:30:58,940
Why do we care about this?

703
00:30:58,940 --> 00:31:01,730
Because basically the
idea of kernel methods--

704
00:31:01,730 --> 00:31:05,570
in this very basic
form-- is what

705
00:31:05,570 --> 00:31:08,750
if I replace this
inner product, which

706
00:31:08,750 --> 00:31:11,720
is a way to measure similarity
between my functions,

707
00:31:11,720 --> 00:31:14,360
with another similarity.

708
00:31:14,360 --> 00:31:17,417
So instead of mapping
each x into a very high

709
00:31:17,417 --> 00:31:19,250
dimensional vector and
then taking product--

710
00:31:19,250 --> 00:31:22,430
which is itself, if you
want another way, as I said,

711
00:31:22,430 --> 00:31:25,310
of measuring similarity
in your product,

712
00:31:25,310 --> 00:31:27,650
distances between vectors--
what if I just define it,

713
00:31:27,650 --> 00:31:29,420
instead of by an
explicit mapping,

714
00:31:29,420 --> 00:31:32,810
by redefining the inner product.

715
00:31:32,810 --> 00:31:36,830
So this k here is the
k similar to the one

716
00:31:36,830 --> 00:31:39,730
we had in the previous--
in the very first slide.

717
00:31:39,730 --> 00:31:43,250
And it's-- re-parametrize
the inner product.

718
00:31:43,250 --> 00:31:45,140
Change the inner
product, and then I

719
00:31:45,140 --> 00:31:47,220
want to use everything else.

720
00:31:47,220 --> 00:31:50,520
So we need to question-- we
need to answer two question.

721
00:31:50,520 --> 00:31:54,170
The first one is if I
give you now a procedure

722
00:31:54,170 --> 00:31:56,330
that whenever you would
want to do x transpose

723
00:31:56,330 --> 00:32:01,100
x does something else
called ax comma x prime.

724
00:32:01,100 --> 00:32:03,410
How do you change
the computations?

725
00:32:03,410 --> 00:32:05,030
This is going to be very easy.

726
00:32:05,030 --> 00:32:08,323
But also what are you doing
from a modeling perspective?

727
00:32:14,000 --> 00:32:17,200
So from the computational
point of view, it's very easy,

728
00:32:17,200 --> 00:32:21,520
because you see
here you always had

729
00:32:21,520 --> 00:32:24,040
that you have to build a
matrix whose entries were

730
00:32:24,040 --> 00:32:26,330
xi transpose xj.

731
00:32:29,920 --> 00:32:33,510
So it was always a
product of two vectors.

732
00:32:33,510 --> 00:32:36,070
And what you do now is
that you do the same.

733
00:32:36,070 --> 00:32:40,360
So you build the matrix
kn, which is not just xn,

734
00:32:40,360 --> 00:32:43,705
xn transpose but is a new matrix
whose entries are just this.

735
00:32:43,705 --> 00:32:45,880
This is just a generalization.

736
00:32:45,880 --> 00:32:47,830
If I put the linear
kernel, I just get back

737
00:32:47,830 --> 00:32:49,760
in what we had before.

738
00:32:49,760 --> 00:32:52,500
If you put another kernel,
you just get something else.

739
00:32:52,500 --> 00:32:54,610
So from a computational
point of view,

740
00:32:54,610 --> 00:32:58,164
you're done for this
computation of c.

741
00:32:58,164 --> 00:32:59,330
You have to do nothing else.

742
00:32:59,330 --> 00:33:02,390
You just replace this matrix
with these general matrix.

743
00:33:02,390 --> 00:33:05,350
And if you want
to now compute s--

744
00:33:05,350 --> 00:33:08,050
so w you cannot compute
anymore, because you don't know

745
00:33:08,050 --> 00:33:10,040
what's an x by itself.

746
00:33:10,040 --> 00:33:13,090
But if you want to
compute f of x, you can,

747
00:33:13,090 --> 00:33:14,470
because you've just to plug-in--

748
00:33:14,470 --> 00:33:16,280
So you know how
to compute the c.

749
00:33:16,280 --> 00:33:20,190
And you know how to compute this
quantity, because you have just

750
00:33:20,190 --> 00:33:21,220
to put the kernel there.

751
00:33:21,220 --> 00:33:23,530
So the magic here
is that you never

752
00:33:23,530 --> 00:33:25,090
ever point x in isolation.

753
00:33:25,090 --> 00:33:27,880
You always have a point x
multiplied by another point x.

754
00:33:27,880 --> 00:33:31,770
And this allows you to
replace vectors by--

755
00:33:31,770 --> 00:33:34,000
in some sense, this is
an implicit remapping

756
00:33:34,000 --> 00:33:37,160
of the points by just
redefining the inner product.

757
00:33:37,160 --> 00:33:39,089
So what you should
see for now is

758
00:33:39,089 --> 00:33:40,630
just that the
computation that you've

759
00:33:40,630 --> 00:33:45,100
done to compute f of x in
the linear case you can redo,

760
00:33:45,100 --> 00:33:48,070
if you replace the inner
product with this new function.

761
00:33:48,070 --> 00:33:52,990
Because A, you can compute c
by just using this new matrix

762
00:33:52,990 --> 00:33:54,220
in place of this.

763
00:33:54,220 --> 00:33:57,430
And B, you can replace f
of x, because all you need

764
00:33:57,430 --> 00:33:59,635
is to replace this inner
product with this one

765
00:33:59,635 --> 00:34:04,210
and put the right weights,
which you know how to compute.

766
00:34:04,210 --> 00:34:05,920
From a modeling
perspective what you

767
00:34:05,920 --> 00:34:09,880
can check is that, for
example, if you choose here

768
00:34:09,880 --> 00:34:11,260
this polynomial kernel--

769
00:34:11,260 --> 00:34:14,499
which is just x
transpose x prime plus 1

770
00:34:14,499 --> 00:34:16,210
elevated to the d--

771
00:34:16,210 --> 00:34:18,400
if you take, for
example, d equal 2,

772
00:34:18,400 --> 00:34:21,850
this is equivalent to the
mapping I showed you before,

773
00:34:21,850 --> 00:34:25,989
the one with explicit
monomials as entries.

774
00:34:25,989 --> 00:34:28,479
This is just doing
it implicitly.

775
00:34:28,479 --> 00:34:31,435
If you're in low-dimensional,
if you're low-dimensional,

776
00:34:31,435 --> 00:34:35,270
if n is very big, and the
dimensions are very small,

777
00:34:35,270 --> 00:34:37,179
the first way might be better.

778
00:34:37,179 --> 00:34:42,400
But if n is much bigger,
this way would be better.

779
00:34:42,400 --> 00:34:45,310
But also you can use stuff like
this, like a Gaussian kernel.

780
00:34:45,310 --> 00:34:47,860
And in that case, you cannot
really write down explicitly

781
00:34:47,860 --> 00:34:50,530
the explicit map,
because it turns out that

782
00:34:50,530 --> 00:34:51,659
it's infinite-dimensional.

783
00:34:51,659 --> 00:34:53,560
The vectors you would
need to write down,

784
00:34:53,560 --> 00:34:57,560
to write down the explicit
variable version of--

785
00:34:57,560 --> 00:35:00,840
embedding version of this
is infinite-dimensional.

786
00:35:00,840 --> 00:35:01,900
So this is a--

787
00:35:01,900 --> 00:35:04,420
if you use this, you get the
truly non-parametric model.

788
00:35:07,480 --> 00:35:09,666
If you think of what is
the effect of using this,

789
00:35:09,666 --> 00:35:11,290
it's quite clear if
you plug them here.

790
00:35:11,290 --> 00:35:13,000
Because what you have
is that in one case

791
00:35:13,000 --> 00:35:15,420
you have a superposition
of linear stuff,

792
00:35:15,420 --> 00:35:17,920
a superposition of
polynomial stuff,

793
00:35:17,920 --> 00:35:20,100
or a superposition of Gaussians.

794
00:35:23,330 --> 00:35:24,580
So same game as before.

795
00:35:28,490 --> 00:35:30,470
So same dataset we train.

796
00:35:30,470 --> 00:35:33,070
I take kernel least squares--

797
00:35:33,070 --> 00:35:34,650
which is what I
just showed you--

798
00:35:34,650 --> 00:35:37,900
compute the c
inverting that matrix,

799
00:35:37,900 --> 00:35:40,090
use the Gaussian kernel--
the last of the example--

800
00:35:40,090 --> 00:35:41,170
and then compute f of x.

801
00:35:41,170 --> 00:35:42,544
And then we just
want to plot it.

802
00:35:47,231 --> 00:35:48,230
So this is the solution.

803
00:35:51,770 --> 00:35:53,510
The algorithm depends
on two parameters.

804
00:35:53,510 --> 00:35:54,093
What are they?

805
00:35:57,700 --> 00:35:58,620
AUDIENCE: Lambda.

806
00:35:58,620 --> 00:36:00,870
LORENZO ROSASCO: Lambda, the
regularization parameter,

807
00:36:00,870 --> 00:36:03,700
the one that appeared
already in the linear case--

808
00:36:03,700 --> 00:36:04,240
and then--

809
00:36:04,240 --> 00:36:06,760
AUDIENCE: Whatever parameter
you've chosen [INAUDIBLE]..

810
00:36:06,760 --> 00:36:07,570
LORENZO ROSASCO: Exactly.

811
00:36:07,570 --> 00:36:09,220
Whatever parameters
there is in your kernel.

812
00:36:09,220 --> 00:36:10,803
In this case, it's
the Gaussian, so it

813
00:36:10,803 --> 00:36:12,870
will depend on this width.

814
00:36:17,370 --> 00:36:22,120
Now suppose that
I take gamma big.

815
00:36:22,120 --> 00:36:24,120
I don't know what big is.

816
00:36:24,120 --> 00:36:27,530
I just do it by hand here,
so we see what happens.

817
00:36:32,620 --> 00:36:36,120
If you take gamma--
sorry gamma, sigma big,

818
00:36:36,120 --> 00:36:39,450
you start to get
something very simple.

819
00:36:39,450 --> 00:36:42,820
And if I make it
a bit bigger, it

820
00:36:42,820 --> 00:36:47,550
will probably start to look very
much like a linear solution.

821
00:36:54,530 --> 00:36:56,220
If I make it small--

822
00:36:59,805 --> 00:37:01,560
and again, I don't
know what small is,

823
00:37:01,560 --> 00:37:02,601
so I'm just going to try.

824
00:37:07,725 --> 00:37:08,715
I's very small.

825
00:37:15,660 --> 00:37:18,540
You start to see
what's going on.

826
00:37:18,540 --> 00:37:20,040
And if you go in
between, you really

827
00:37:20,040 --> 00:37:23,224
start to see that you can
circle out individual examples.

828
00:37:23,224 --> 00:37:25,140
So let's think a second
what we're doing here.

829
00:37:29,370 --> 00:37:33,830
It is going to be again other
hand-waving explanation.

830
00:37:33,830 --> 00:37:37,190
Look at this equation.

831
00:37:37,190 --> 00:37:39,170
Let's read out what it says.

832
00:37:39,170 --> 00:37:42,710
In the case of Gaussians,
it says, I take a Gaussian--

833
00:37:42,710 --> 00:37:44,480
just a usual Gaussian--

834
00:37:44,480 --> 00:37:47,840
I center it over a
training set point,

835
00:37:47,840 --> 00:37:52,160
then by choosing the ci
I'm choosing whether it is

836
00:37:52,160 --> 00:37:54,310
going to be a peak or a valley.

837
00:37:54,310 --> 00:37:57,680
It can go up, or it can go down
in the two-dimensional case.

838
00:37:57,680 --> 00:38:00,520
And by choosing
the width, I decide

839
00:38:00,520 --> 00:38:03,140
how large it's going to be.

840
00:38:03,140 --> 00:38:08,060
If I do f of x, then I sum up
all this stuff, which basically

841
00:38:08,060 --> 00:38:11,810
means that I'm going to have
these peaks and these valleys

842
00:38:11,810 --> 00:38:17,000
and I connect them in some way.

843
00:38:17,000 --> 00:38:18,910
Now you remember before
that I pointed out

844
00:38:18,910 --> 00:38:20,780
within the
two-dimensional case what

845
00:38:20,780 --> 00:38:24,920
we draw is not f of x,
but f of x equal to zero.

846
00:38:24,920 --> 00:38:31,525
So what you should really think
is that f of x in this case

847
00:38:31,525 --> 00:38:36,250
is no longer an upper plane,
but it's this surface.

848
00:38:36,250 --> 00:38:37,545
It goes up, and it goes down.

849
00:38:37,545 --> 00:38:38,920
And it goes up,
and it goes down.

850
00:38:38,920 --> 00:38:42,290
So in the blue part, it goes
up, and in the orange part,

851
00:38:42,290 --> 00:38:45,260
it goes down into valley.

852
00:38:45,260 --> 00:38:48,110
So what you do is
that right now you're

853
00:38:48,110 --> 00:38:50,270
taking all these
small Gaussians,

854
00:38:50,270 --> 00:38:54,350
and you put them in around
blue and orange point,

855
00:38:54,350 --> 00:38:56,702
and then you
connect their peaks.

856
00:38:56,702 --> 00:38:58,160
And by making them
small, you allow

857
00:38:58,160 --> 00:39:00,050
them to create a very
complicated surface.

858
00:39:03,310 --> 00:39:05,290
So what did we put before?

859
00:39:11,240 --> 00:39:11,950
So they're small.

860
00:39:11,950 --> 00:39:14,210
They're getting smaller,
and smaller, and smaller.

861
00:39:14,210 --> 00:39:16,640
And they go out,
and you see the--

862
00:39:16,640 --> 00:39:18,620
there is a point here,
so they circle it out

863
00:39:18,620 --> 00:39:21,530
here by putting basically
Gaussian right there

864
00:39:21,530 --> 00:39:24,920
for that individual point.

865
00:39:24,920 --> 00:39:26,870
Imagine what happens
if my points--

866
00:39:26,870 --> 00:39:29,240
I have two points here
and two points here--

867
00:39:29,240 --> 00:39:33,020
and now I put a huge
Gaussian around each point.

868
00:39:33,020 --> 00:39:36,314
Basically, the peaks are almost
going to touch each other.

869
00:39:36,314 --> 00:39:37,730
So what you're
imagine is that you

870
00:39:37,730 --> 00:39:40,670
get something, where basically
the decision boundary has

871
00:39:40,670 --> 00:39:42,170
to look like a line,
because you get

872
00:39:42,170 --> 00:39:43,470
something which is so smooth.

873
00:39:43,470 --> 00:39:45,095
It doesn't go up and
down all the time.

874
00:39:45,095 --> 00:39:47,539
It's going to be--

875
00:39:47,539 --> 00:39:49,080
And that's what we
saw before, right?

876
00:39:49,080 --> 00:39:51,110
And again, I don't
remember what I put here.

877
00:39:54,718 --> 00:39:56,490
So this is starting
to look good.

878
00:39:56,490 --> 00:40:00,720
So you really see that somewhat
something nice happens.

879
00:40:00,720 --> 00:40:03,720
Maybe if I put-- five is
what we put before maybe.

880
00:40:09,010 --> 00:40:10,890
So basically what
you're basically doing

881
00:40:10,890 --> 00:40:13,560
is that you're computing
the center of mass

882
00:40:13,560 --> 00:40:16,794
of one class in the
sense of the Gaussians.

883
00:40:16,794 --> 00:40:18,210
So you're doing a
Gaussian mixture

884
00:40:18,210 --> 00:40:19,850
on one side, a Gaussian
mixture on the other side,

885
00:40:19,850 --> 00:40:21,780
you're basically computing
the center of masses,

886
00:40:21,780 --> 00:40:22,890
and then you just
find the line that

887
00:40:22,890 --> 00:40:24,230
separates the center of masses.

888
00:40:24,230 --> 00:40:26,021
That's what you're
doing here, and you just

889
00:40:26,021 --> 00:40:27,810
find this one big line here.

890
00:40:30,600 --> 00:40:35,250
So again, so we're
not playing around

891
00:40:35,250 --> 00:40:38,040
with the number of points.

892
00:40:38,040 --> 00:40:39,776
We're not play
around with lambda.

893
00:40:39,776 --> 00:40:42,150
But because this is basically
what we already saw before.

894
00:40:42,150 --> 00:40:44,691
All I want to show you right
now is the effect of the kernel.

895
00:40:44,691 --> 00:40:51,610
And here I'm using the Gaussian
kernel, but-- let's see--

896
00:40:51,610 --> 00:40:54,177
but you can also use
the linear kernel.

897
00:40:54,177 --> 00:40:55,260
This is the linear kernel.

898
00:40:55,260 --> 00:40:56,885
This is using the
linear least squares.

899
00:40:56,885 --> 00:40:58,520
If you now use the
Gaussian kernel,

900
00:40:58,520 --> 00:41:00,270
you give yourself the
extra possibility.

901
00:41:00,270 --> 00:41:01,650
Essentially, what you
see is that if you

902
00:41:01,650 --> 00:41:03,691
put the Gaussian which is
very big, in some sense

903
00:41:03,691 --> 00:41:05,810
you get back the linear kernel.

904
00:41:05,810 --> 00:41:07,810
But if you put the Gaussian
which is very small,

905
00:41:07,810 --> 00:41:12,050
you allow yourself to
this extra complexity.

906
00:41:12,050 --> 00:41:14,910
And so that's what we gain
with this little trick

907
00:41:14,910 --> 00:41:20,500
that we did of replacing
the inner product

908
00:41:20,500 --> 00:41:24,400
with this new kernel.

909
00:41:24,400 --> 00:41:26,380
We went from the simple
linear estimators

910
00:41:26,380 --> 00:41:28,045
to something, which is--

911
00:41:28,045 --> 00:41:29,650
It's the same
thing-- if you want--

912
00:41:29,650 --> 00:41:31,300
that we did by
building explicitly

913
00:41:31,300 --> 00:41:34,180
these monomials of higher
power, but here you're

914
00:41:34,180 --> 00:41:35,879
doing it implicitly.

915
00:41:35,879 --> 00:41:37,420
And it turns out
that it's actually--

916
00:41:37,420 --> 00:41:40,510
there is no explicit
version that you can--

917
00:41:40,510 --> 00:41:43,690
You can do it mathematically,
but the feature representation,

918
00:41:43,690 --> 00:41:45,910
the variable representation
of this kernel

919
00:41:45,910 --> 00:41:48,260
would be an infinitely
long vector.

920
00:41:48,260 --> 00:41:49,900
The space of function
that is built

921
00:41:49,900 --> 00:41:53,130
as a combination of Gaussians
is not finite-dimensional.

922
00:41:53,130 --> 00:41:56,040
For polynomials, you can check
that the space of function,

923
00:41:56,040 --> 00:41:59,680
it basically is a
polynomial in d.

924
00:41:59,680 --> 00:42:02,030
If I ask you how
big is the function

925
00:42:02,030 --> 00:42:04,780
space that you can build using
this-- well, this is easy.

926
00:42:04,780 --> 00:42:06,910
It's just d-dimensional.

927
00:42:06,910 --> 00:42:09,220
With this, well, this is
a bit more complicated,

928
00:42:09,220 --> 00:42:11,250
but you can compute.

929
00:42:11,250 --> 00:42:16,270
For this, it's not easy to
compute, because it's infinite.

930
00:42:16,270 --> 00:42:19,690
So it in some sense is
a non-parametric model.

931
00:42:19,690 --> 00:42:21,280
What does it mean?

932
00:42:21,280 --> 00:42:22,990
Of course, you still
have a finite number

933
00:42:22,990 --> 00:42:24,380
of parameters in practice.

934
00:42:24,380 --> 00:42:26,020
And that's the good news.

935
00:42:26,020 --> 00:42:28,440
But there is no fixed number
of parameters a priori.

936
00:42:28,440 --> 00:42:31,129
If I give you a hundred points,
you get a hundred parameters.

937
00:42:31,129 --> 00:42:33,670
If I give you 2 million points,
you get 2 million parameters.

938
00:42:33,670 --> 00:42:36,400
If I give you 5 million points,
you get 5 million parameters.

939
00:42:36,400 --> 00:42:40,300
But you never hit a
boundary of complexity,

940
00:42:40,300 --> 00:42:44,800
because these are in some sense
as an infinite-dimensional

941
00:42:44,800 --> 00:42:45,640
parameter space.

942
00:42:48,940 --> 00:42:52,210
So of course, I
see that here there

943
00:42:52,210 --> 00:42:55,040
are some of the part that I'm
explaining are complicated,

944
00:42:55,040 --> 00:42:57,370
especially if this is the
first time you see them.

945
00:42:57,370 --> 00:43:00,370
But the take-home message
should be essentially

946
00:43:00,370 --> 00:43:02,020
from least squares,
I can understand

947
00:43:02,020 --> 00:43:03,936
what's going on from a
numerical point of view

948
00:43:03,936 --> 00:43:06,700
and bridge numerics
and statistics.

949
00:43:06,700 --> 00:43:08,680
Then by just simple
linear algebra,

950
00:43:08,680 --> 00:43:10,090
I can understand
the complexity--

951
00:43:10,090 --> 00:43:12,250
how I can get
complexly-- which is

952
00:43:12,250 --> 00:43:14,110
linear in the
number of dimension

953
00:43:14,110 --> 00:43:16,240
or the number of points.

954
00:43:16,240 --> 00:43:19,840
And then by following up,
I can do a a little magic

955
00:43:19,840 --> 00:43:23,480
and go from the linear model
to something non-linear.

956
00:43:23,480 --> 00:43:26,200
The deep reason why this is
possible are complicated.

957
00:43:26,200 --> 00:43:29,099
But as a take-home
message, A, the computation

958
00:43:29,099 --> 00:43:29,890
you can check easy.

959
00:43:29,890 --> 00:43:31,620
It remained the same.

960
00:43:31,620 --> 00:43:33,796
B, you can check that
what you're doing is now

961
00:43:33,796 --> 00:43:35,920
allowing yourself to take
a more complicated model,

962
00:43:35,920 --> 00:43:38,800
it's combination of
the kernel functions.

963
00:43:38,800 --> 00:43:46,480
And then even just by playing
with these simple demos,

964
00:43:46,480 --> 00:43:48,370
you can understand a
bit what is the effect.

965
00:43:48,370 --> 00:43:50,300
And that's what you
intuitively would expect.

966
00:43:50,300 --> 00:43:52,600
So I hope that it would
get you close enough

967
00:43:52,600 --> 00:43:55,900
to have some awareness,
when you use this.

968
00:43:55,900 --> 00:43:57,850
And of course, you can put--

969
00:43:57,850 --> 00:44:01,510
when you abstract from the
specificity of this algorithm,

970
00:44:01,510 --> 00:44:03,990
you build an algorithm with
one or two parameters--

971
00:44:03,990 --> 00:44:05,380
lambda and sigma.

972
00:44:05,380 --> 00:44:07,540
And so as soon as you ask
me how you choose those,

973
00:44:07,540 --> 00:44:09,685
well, we go back to the
first part of the lecture--

974
00:44:09,685 --> 00:44:12,100
bias-variance, tradeoffs,
cross-validation,

975
00:44:12,100 --> 00:44:13,810
and so on and so forth.

976
00:44:13,810 --> 00:44:16,750
So you just have to
put them together.

977
00:44:16,750 --> 00:44:19,100
There is a lot of stuff
I've not talked about.

978
00:44:19,100 --> 00:44:21,130
And it's a step away
from what we discussed,

979
00:44:21,130 --> 00:44:23,902
so you've just seen the
take-home message part,

980
00:44:23,902 --> 00:44:25,360
but we could talk
about reproducing

981
00:44:25,360 --> 00:44:27,670
kernel hybrid spaces,
the functional analysis

982
00:44:27,670 --> 00:44:29,560
behind everything I said.

983
00:44:29,560 --> 00:44:31,990
We can talk about Gaussian
processes, which is basically

984
00:44:31,990 --> 00:44:34,780
the probabilistic version of
what I just showed you now.

985
00:44:34,780 --> 00:44:37,450
Then we can all see the
connection with a bunch of math

986
00:44:37,450 --> 00:44:39,746
like integral
equations and PDEs.

987
00:44:39,746 --> 00:44:41,620
There is a whole connection
with the sampling

988
00:44:41,620 --> 00:44:44,240
theory a la Shannon,
inverse problems and so on.

989
00:44:44,240 --> 00:44:47,230
And there is a
bunch of extension,

990
00:44:47,230 --> 00:44:48,705
which are almost for free.

991
00:44:48,705 --> 00:44:50,234
You change the loss function.

992
00:44:50,234 --> 00:44:52,150
You can make the logistic,
and you take kernel

993
00:44:52,150 --> 00:44:53,110
logistic regression.

994
00:44:53,110 --> 00:44:57,520
You can take SVM, and
you get kernel SVM.

995
00:44:57,520 --> 00:45:00,430
Then you can also take more
complicated output spaces.

996
00:45:00,430 --> 00:45:03,220
And you can do multiclass,
multivariate regression.

997
00:45:03,220 --> 00:45:04,210
You can do regression.

998
00:45:04,210 --> 00:45:06,220
You can do multilabel,
and you can

999
00:45:06,220 --> 00:45:07,640
do a bunch of different things.

1000
00:45:07,640 --> 00:45:10,210
And these are
really a step away.

1001
00:45:10,210 --> 00:45:12,387
These are minor
modification of the code.

1002
00:45:12,387 --> 00:45:13,720
And you can do a bunch of stuff.

1003
00:45:13,720 --> 00:45:16,120
So the good thing of this
is that with really, really,

1004
00:45:16,120 --> 00:45:17,770
really minor effort,
you can actually

1005
00:45:17,770 --> 00:45:19,090
solve a bunch of problem.

1006
00:45:19,090 --> 00:45:21,700
I'm not saying that it's going
to be the best algorithm ever,

1007
00:45:21,700 --> 00:45:26,290
but definitely it
gets you quite far.

1008
00:45:26,290 --> 00:45:28,240
So again we spent quite
a bit of time thinking

1009
00:45:28,240 --> 00:45:30,950
about bias-variance and
what it means and used

1010
00:45:30,950 --> 00:45:33,250
least squares and just
basically warming up

1011
00:45:33,250 --> 00:45:34,570
a bit with this setting.

1012
00:45:34,570 --> 00:45:39,022
And then in the last hour or
so, we discussed least squares,

1013
00:45:39,022 --> 00:45:41,480
because it allows to just think
in terms of linear algebra,

1014
00:45:41,480 --> 00:45:43,120
which is something that--

1015
00:45:43,120 --> 00:45:45,130
one way or another--
you've seen in your life.

1016
00:45:45,130 --> 00:45:48,400
And then from there, you can
go from linear to non-linear.

1017
00:45:48,400 --> 00:45:51,100
And that's a bit of magic,
but a couple of parts--

1018
00:45:51,100 --> 00:45:54,070
which are how you use
it both numerically

1019
00:45:54,070 --> 00:45:56,300
and just from a
practical perspective

1020
00:45:56,300 --> 00:45:59,260
to go from complex models to
simple models and vice versa--

1021
00:45:59,260 --> 00:46:03,579
should be-- is the part that
I hope you keep in your mind.

1022
00:46:03,579 --> 00:46:05,870
For now, our concern has just
been to make predictions.

1023
00:46:05,870 --> 00:46:07,370
If you hear
classification, you want

1024
00:46:07,370 --> 00:46:08,494
to have good clarification.

1025
00:46:08,494 --> 00:46:10,870
If you hear regression, you
want to do good regression.

1026
00:46:10,870 --> 00:46:14,440
But you didn't talk about-- we
didn't talk about understanding

1027
00:46:14,440 --> 00:46:17,200
how did you do good regression?

1028
00:46:17,200 --> 00:46:21,944
So a typical example is
the example in biology.

1029
00:46:21,944 --> 00:46:23,110
This is, perhaps, a bit old.

1030
00:46:23,110 --> 00:46:24,350
This is micro-arrays.

1031
00:46:24,350 --> 00:46:31,860
But the idea is the datasets
you have is a bunch of patients.

1032
00:46:31,860 --> 00:46:33,870
For each patient, you
have measurements,

1033
00:46:33,870 --> 00:46:36,480
and the measurements correspond
to some gene expression

1034
00:46:36,480 --> 00:46:38,265
level or some other
biological process.

1035
00:46:42,390 --> 00:46:45,840
The patients are divided in
two groups, say, disease type

1036
00:46:45,840 --> 00:46:48,630
A and disease type B. And
based on the good prediction

1037
00:46:48,630 --> 00:46:50,880
of whether a patient
is disease type A or B,

1038
00:46:50,880 --> 00:46:55,089
you can change the way you cure
it or you address the disease.

1039
00:46:55,089 --> 00:46:57,130
So of course, you want to
have a good prediction.

1040
00:46:57,130 --> 00:46:59,190
You want to be able-- when
a new patient arrive--

1041
00:46:59,190 --> 00:47:04,440
to say whether it's going
to-- this is type A or type B.

1042
00:47:04,440 --> 00:47:06,810
But oftentimes,
what you want to do

1043
00:47:06,810 --> 00:47:09,900
is that you want to use
this not as the final tool,

1044
00:47:09,900 --> 00:47:13,540
because unless deep
learning can solve this,

1045
00:47:13,540 --> 00:47:18,450
you might go back and study a
bit more the biological process

1046
00:47:18,450 --> 00:47:19,900
to understand a bit more.

1047
00:47:19,900 --> 00:47:23,590
So you use this as
more statistical tools

1048
00:47:23,590 --> 00:47:28,840
like measurements, like the
way you can use a microscope

1049
00:47:28,840 --> 00:47:31,335
or something to look into
your data and get information.

1050
00:47:31,335 --> 00:47:32,710
And in that sense
sometimes, it's

1051
00:47:32,710 --> 00:47:34,335
interesting to--
instead of just saying

1052
00:47:34,335 --> 00:47:37,720
is this patient going to be
more likely to be disease type

1053
00:47:37,720 --> 00:47:40,300
A or B, it's to
go in and say, ah,

1054
00:47:40,300 --> 00:47:42,340
but when you make
the prediction, what

1055
00:47:42,340 --> 00:47:44,500
are the process that
matters for this prediction?

1056
00:47:44,500 --> 00:47:49,032
Is this gene number 33 or 34,
so that I can go in and say,

1057
00:47:49,032 --> 00:47:51,490
oh, these genes make sense,
because they're in fact related

1058
00:47:51,490 --> 00:47:54,210
to these other processes,
which are known to be related,

1059
00:47:54,210 --> 00:47:56,354
involved in this disease.

1060
00:47:56,354 --> 00:47:58,270
And doing that, you use
just as a little tool,

1061
00:47:58,270 --> 00:48:00,219
then you use other
ones to get a picture.

1062
00:48:00,219 --> 00:48:01,510
And then you put them together.

1063
00:48:01,510 --> 00:48:04,330
And then it's mostly on the
doctor, or the clinician,

1064
00:48:04,330 --> 00:48:08,320
or the biostatistician to try
to develop better understanding.

1065
00:48:08,320 --> 00:48:10,640
But you do use these as
tools to understand and look

1066
00:48:10,640 --> 00:48:12,460
into the data.

1067
00:48:12,460 --> 00:48:14,650
And in that perspective,
the word interpretability

1068
00:48:14,650 --> 00:48:15,440
plays a big role.

1069
00:48:15,440 --> 00:48:17,590
And here by
interpretability I mean

1070
00:48:17,590 --> 00:48:19,160
I not only want to
make predictions,

1071
00:48:19,160 --> 00:48:22,300
but I want to know how I make
predictions and tell you, come

1072
00:48:22,300 --> 00:48:25,060
afterwards with an
explanation of how

1073
00:48:25,060 --> 00:48:29,410
I picked the information that
were contained in my data.

1074
00:48:29,410 --> 00:48:35,080
So so far it's hard to see how
to do it with the tools we had.

1075
00:48:35,080 --> 00:48:41,940
So this is basically the
field of variable selection.

1076
00:48:41,940 --> 00:48:44,917
And in this basic
form, the setting

1077
00:48:44,917 --> 00:48:46,500
where we do understand
what's going on

1078
00:48:46,500 --> 00:48:49,210
is the setting of linear models.

1079
00:48:49,210 --> 00:48:52,530
So in this setting
basically, I just

1080
00:48:52,530 --> 00:48:54,040
rewrite what we've seen before.

1081
00:48:54,040 --> 00:48:57,510
You have x is a vector, and you
can think of it, for example,

1082
00:48:57,510 --> 00:48:59,000
as a patient.

1083
00:48:59,000 --> 00:49:01,290
And xj are measurements
that you have

1084
00:49:01,290 --> 00:49:04,140
done describing this patient.

1085
00:49:04,140 --> 00:49:06,060
When you do a linear
model, you basically

1086
00:49:06,060 --> 00:49:10,630
have that by putting a
weight on each variables,

1087
00:49:10,630 --> 00:49:13,680
you're putting a weight
on each measurement.

1088
00:49:13,680 --> 00:49:15,840
If a measurement
doesn't matter, you

1089
00:49:15,840 --> 00:49:17,310
think you might put here a zero.

1090
00:49:17,310 --> 00:49:19,650
And it will disappear
from the sum.

1091
00:49:19,650 --> 00:49:21,420
If the measurement
matters a lot,

1092
00:49:21,420 --> 00:49:23,610
then here you might
get a big weight.

1093
00:49:23,610 --> 00:49:27,840
So one way to try to get the
feeling of which measurements

1094
00:49:27,840 --> 00:49:29,640
are important and which
are not and to try

1095
00:49:29,640 --> 00:49:33,980
to estimate and model, a linear
model, where you get the w,

1096
00:49:33,980 --> 00:49:37,650
but ideally we would like to
get the w, which has many zeros.

1097
00:49:37,650 --> 00:49:40,189
You don't want to fumble with
what's small and what's not.

1098
00:49:40,189 --> 00:49:42,480
So if you do least squares
the way I showed you before,

1099
00:49:42,480 --> 00:49:44,047
you would get a w.

1100
00:49:44,047 --> 00:49:44,880
Then you would get--

1101
00:49:44,880 --> 00:49:47,820
most of them you can check
that it will not be zero.

1102
00:49:47,820 --> 00:49:50,254
In fact, none of them
will be zero in general.

1103
00:49:50,254 --> 00:49:52,670
And so now you have to decide
what's small and what's big,

1104
00:49:52,670 --> 00:49:53,794
and that might not be easy.

1105
00:49:56,740 --> 00:49:57,975
Oops, what happened here?

1106
00:50:05,590 --> 00:50:10,797
So funny enough, this is
the name I found on how--

1107
00:50:10,797 --> 00:50:12,380
I don't remember the
name of the book.

1108
00:50:12,380 --> 00:50:14,250
It's the name that
was used to describe

1109
00:50:14,250 --> 00:50:16,759
the process of variable
selection, which

1110
00:50:16,759 --> 00:50:18,300
is a much harder
problem, because you

1111
00:50:18,300 --> 00:50:19,330
don't want to make predictions.

1112
00:50:19,330 --> 00:50:20,705
But you want to
go back and check

1113
00:50:20,705 --> 00:50:22,640
how you make the prediction.

1114
00:50:22,640 --> 00:50:28,230
And so it's very easy to start
to get overfitting and start

1115
00:50:28,230 --> 00:50:30,690
to try to squeeze the data
until you get some information.

1116
00:50:30,690 --> 00:50:32,231
So it's good to have
a procedure that

1117
00:50:32,231 --> 00:50:34,950
will give you somewhat a
clean procedure to extract

1118
00:50:34,950 --> 00:50:36,060
the important variables.

1119
00:50:36,060 --> 00:50:37,980
Again, you can think of
this as a-- basically,

1120
00:50:37,980 --> 00:50:40,500
I want to build an
f, but I also want

1121
00:50:40,500 --> 00:50:43,890
to come up with a list or even
better weights that tell me

1122
00:50:43,890 --> 00:50:45,600
which variables are important.

1123
00:50:45,600 --> 00:50:47,150
And often this will
be just a list,

1124
00:50:47,150 --> 00:50:49,560
which is much smaller than
d, so that I can go back

1125
00:50:49,560 --> 00:50:52,740
and say, oh, measurement 33,
34, and 50-- what are they?

1126
00:50:52,740 --> 00:50:55,494
I could go in and look at it.

1127
00:50:55,494 --> 00:50:57,660
Notice that there is also
a computational reason why

1128
00:50:57,660 --> 00:50:58,743
this would be interesting.

1129
00:50:58,743 --> 00:51:01,530
Because of course,
if d here is 50,000--

1130
00:51:01,530 --> 00:51:03,600
and what I see is
that, in fact, I

1131
00:51:03,600 --> 00:51:07,710
can throw away most of these
measurements and just keep 10--

1132
00:51:07,710 --> 00:51:09,300
then it means that
I can hopefully

1133
00:51:09,300 --> 00:51:10,980
reduce the complexity
of my computation,

1134
00:51:10,980 --> 00:51:12,905
but also the storage of
the data, for example.

1135
00:51:12,905 --> 00:51:14,780
If I have to send you
the datasets after I've

1136
00:51:14,780 --> 00:51:16,410
done this thing,
I've just to send you

1137
00:51:16,410 --> 00:51:17,520
this teeny tiny matrix.

1138
00:51:20,160 --> 00:51:22,210
So interpretability
is one reason,

1139
00:51:22,210 --> 00:51:27,690
but the computational
aspect could be another one.

1140
00:51:30,250 --> 00:51:33,510
Another reason that I don't
want to talk too much is also--

1141
00:51:33,510 --> 00:51:36,270
remember that we
had this idea, where

1142
00:51:36,270 --> 00:51:39,600
we said we could document
the complexity of a model

1143
00:51:39,600 --> 00:51:43,770
by inventing
features, and he said

1144
00:51:43,770 --> 00:51:47,100
do I always have to pay
the price of making it big?

1145
00:51:47,100 --> 00:51:49,339
Well, I basically--
if you what--

1146
00:51:49,339 --> 00:51:50,130
I was pointing at--

1147
00:51:50,130 --> 00:51:52,566
I said, no, not always, because
I was thinking of kernels.

1148
00:51:52,566 --> 00:51:54,690
These, if you want give
you another way potentially

1149
00:51:54,690 --> 00:51:57,150
to go around in which what
you do is that, first of all,

1150
00:51:57,150 --> 00:51:59,252
you explode the
number of features.

1151
00:51:59,252 --> 00:52:00,960
You take many, many,
many, many, and then

1152
00:52:00,960 --> 00:52:02,940
you use this as a
preliminary step

1153
00:52:02,940 --> 00:52:07,050
to shrink them down to a
more reasonable number.

1154
00:52:07,050 --> 00:52:08,700
Because it's quite
likely that among

1155
00:52:08,700 --> 00:52:10,980
these many, many
measurements, some of them

1156
00:52:10,980 --> 00:52:13,170
would just be very
correlated, or uninteresting,

1157
00:52:13,170 --> 00:52:15,240
or so on and so forth.

1158
00:52:15,240 --> 00:52:18,480
So this dimensionality
reduction or

1159
00:52:18,480 --> 00:52:21,030
computational or interpretable
model perspective

1160
00:52:21,030 --> 00:52:25,860
is what stands behind the desire
to do something like this.

1161
00:52:25,860 --> 00:52:28,450
So let's say one more
thing and then we'll stop.

1162
00:52:31,570 --> 00:52:35,330
So suppose that you have an
infinite computational power.

1163
00:52:35,330 --> 00:52:39,450
So the computation
are not your concern,

1164
00:52:39,450 --> 00:52:41,959
and you want to
solve this problem.

1165
00:52:41,959 --> 00:52:42,750
How will you do it?

1166
00:52:46,358 --> 00:52:50,280
Suppose that you have the
code for least squares.

1167
00:52:50,280 --> 00:52:52,450
And you can run it as
many times as you want.

1168
00:52:52,450 --> 00:52:55,352
How would you go and
try to estimate which

1169
00:52:55,352 --> 00:52:56,560
variables are more important?

1170
00:52:56,560 --> 00:52:58,864
AUDIENCE: [INAUDIBLE]
possibility of computations.

1171
00:52:58,864 --> 00:53:00,530
LORENZO ROSASCO:
That's one possibility.

1172
00:53:00,530 --> 00:53:02,760
What you do is that you have--

1173
00:53:02,760 --> 00:53:05,270
you start and look at
all single variables.

1174
00:53:05,270 --> 00:53:09,120
And you solve least squares
for all single variables.

1175
00:53:09,120 --> 00:53:12,260
Then you take all
couples of variables.

1176
00:53:12,260 --> 00:53:14,510
Then you get all
triplets of variables.

1177
00:53:14,510 --> 00:53:17,797
And then you find
which one is best.

1178
00:53:17,797 --> 00:53:19,380
From a statistical
point of view there

1179
00:53:19,380 --> 00:53:21,530
is absolutely nothing
wrong with this,

1180
00:53:21,530 --> 00:53:23,330
because you're
trying everything.

1181
00:53:23,330 --> 00:53:26,510
And at some point, you
find what's the best.

1182
00:53:26,510 --> 00:53:28,910
The problem is that
it's combinatorial.

1183
00:53:28,910 --> 00:53:33,680
And you see that when you're
in dimension a few more then--

1184
00:53:33,680 --> 00:53:36,950
very few, it's huge.

1185
00:53:36,950 --> 00:53:39,740
So it's exponential.

1186
00:53:39,740 --> 00:53:43,670
So it turns out that
doing what you just

1187
00:53:43,670 --> 00:53:46,600
told me to do, which is what
I asked you to tell me to do,

1188
00:53:46,600 --> 00:53:48,320
which is this brute
force approach

1189
00:53:48,320 --> 00:53:52,400
is equivalent to do
something like this is again

1190
00:53:52,400 --> 00:53:54,470
a regularization approach.

1191
00:53:54,470 --> 00:53:57,400
Here I put what is
called the zero norm.

1192
00:53:57,400 --> 00:53:59,730
The zero norm is
actually not a norm.

1193
00:53:59,730 --> 00:54:01,685
And it is just functional.

1194
00:54:01,685 --> 00:54:04,040
It's a thing that does
the following thing.

1195
00:54:04,040 --> 00:54:06,320
If I give you a vector,
you've to return

1196
00:54:06,320 --> 00:54:10,250
the number of components
different from zero, only that.

1197
00:54:10,250 --> 00:54:11,930
So you go inside and
look at each entry,

1198
00:54:11,930 --> 00:54:14,630
and you tell if they
are different from zero.

1199
00:54:14,630 --> 00:54:17,800
This is absolutely not convex.

1200
00:54:17,800 --> 00:54:21,270
And so this is the reason why
this problem is equivalent--

1201
00:54:21,270 --> 00:54:24,530
it becomes a computation
not feasible.

1202
00:54:24,530 --> 00:54:26,480
So perhaps, we can stop here.

1203
00:54:26,480 --> 00:54:29,810
And what I want to show
you next is essentially--

1204
00:54:29,810 --> 00:54:31,270
if you have this--

1205
00:54:31,270 --> 00:54:33,890
and you know that in some sense,
this is what you would like

1206
00:54:33,890 --> 00:54:35,810
to do, if you could
do it computationally,

1207
00:54:35,810 --> 00:54:37,100
but you cannot--

1208
00:54:37,100 --> 00:54:41,330
so how can you find
approximate version of this

1209
00:54:41,330 --> 00:54:42,770
that you can
compute in practice?

1210
00:54:42,770 --> 00:54:44,769
And we're going to discuss
two ways of doing it.

1211
00:54:44,769 --> 00:54:48,820
One is greedy methods and
one is convex relaxations.