1
00:00:01,640 --> 00:00:04,040
The following content is
provided under a Creative

2
00:00:04,040 --> 00:00:05,580
Commons license.

3
00:00:05,580 --> 00:00:07,880
Your support will help
MIT OpenCourseWare

4
00:00:07,880 --> 00:00:12,270
continue to offer high-quality
educational resources for free.

5
00:00:12,270 --> 00:00:14,870
To make a donation or
view additional materials

6
00:00:14,870 --> 00:00:18,830
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,830 --> 00:00:21,715
at ocw.mit.edu.

8
00:00:21,715 --> 00:00:23,090
LORENZO ROSASCO:
If you remember,

9
00:00:23,090 --> 00:00:26,420
we did the local
method bias-variance.

10
00:00:26,420 --> 00:00:29,290
Then we passed to global
regularization methods-- least

11
00:00:29,290 --> 00:00:31,790
squares, linear least
squares, kernel least squares,

12
00:00:31,790 --> 00:00:33,420
computations modeling.

13
00:00:33,420 --> 00:00:34,762
And that's where we were at.

14
00:00:34,762 --> 00:00:36,470
And then we moved on
and started to think

15
00:00:36,470 --> 00:00:40,070
about more intractable
model, and we

16
00:00:40,070 --> 00:00:43,910
were starting to think of the
problem of variable selection,

17
00:00:43,910 --> 00:00:45,110
OK?

18
00:00:45,110 --> 00:00:48,370
And the way we posed it is
that, yeah, that you are going

19
00:00:48,370 --> 00:00:50,180
to consider a linear model.

20
00:00:50,180 --> 00:00:53,510
And I use the weights
associated to each variable

21
00:00:53,510 --> 00:00:57,260
as the strength of the
corresponding variable

22
00:00:57,260 --> 00:00:59,060
that you can view
as measurements.

23
00:00:59,060 --> 00:01:00,590
And your game is
not only to build

24
00:01:00,590 --> 00:01:02,360
good predictional
measurements, but also

25
00:01:02,360 --> 00:01:06,170
to tell which measurements
are interesting, OK?

26
00:01:06,170 --> 00:01:11,330
And so here, the term
"relevant variable"

27
00:01:11,330 --> 00:01:17,870
is going to be related to the
productivity, the contribution

28
00:01:17,870 --> 00:01:21,320
to the productivity of the
corresponding function, OK?

29
00:01:21,320 --> 00:01:24,600
So that's how we measure
relevance of a variable.

30
00:01:24,600 --> 00:01:28,460
So we looked at the funny name.

31
00:01:28,460 --> 00:01:31,610
And then we kind of
agreed that somewhat it

32
00:01:31,610 --> 00:01:37,340
seems that there is a
default approach, which

33
00:01:37,340 --> 00:01:40,970
is basically based on trying
all possible subsets, OK?

34
00:01:40,970 --> 00:01:42,869
So this is also called
best subset selection.

35
00:01:42,869 --> 00:01:44,660
Variable selection is
also sometimes called

36
00:01:44,660 --> 00:01:45,577
best subset selection.

37
00:01:45,577 --> 00:01:47,743
And this gives you a feeling
that what you should do

38
00:01:47,743 --> 00:01:50,420
is try all possible subsets
and check the one which

39
00:01:50,420 --> 00:01:53,030
is best with respect to your
data, which, again, would

40
00:01:53,030 --> 00:01:56,540
be like a trade-off between
how well you fit the data

41
00:01:56,540 --> 00:02:01,190
and how many variables
you have, OK?

42
00:02:01,190 --> 00:02:04,820
And what I told you last
was you could actually

43
00:02:04,820 --> 00:02:07,730
see that this trying
all possible subsets

44
00:02:07,730 --> 00:02:11,780
is related to a form
of regularization,

45
00:02:11,780 --> 00:02:13,490
where it looks very
similar to the one

46
00:02:13,490 --> 00:02:15,530
we saw until a minute ago.

47
00:02:15,530 --> 00:02:17,220
The main difference
here, I put fw,

48
00:02:17,220 --> 00:02:19,796
but fw is just the
usual linear function.

49
00:02:19,796 --> 00:02:21,170
The only difference
is that here,

50
00:02:21,170 --> 00:02:24,380
rather than the square
norm, we put this functional

51
00:02:24,380 --> 00:02:26,640
that is called the
0 norm, which if,

52
00:02:26,640 --> 00:02:29,030
as a functional,
that given a vector

53
00:02:29,030 --> 00:02:31,880
returns the number of
entries in the vector which

54
00:02:31,880 --> 00:02:34,670
are different from 0, OK?

55
00:02:34,670 --> 00:02:37,550
It turns out that if you
were to minimize this,

56
00:02:37,550 --> 00:02:42,350
you are solving the best
subset selection problem.

57
00:02:42,350 --> 00:02:45,080
Issue here is that--
another manifestation

58
00:02:45,080 --> 00:02:47,240
of the complexity of
the problem is the fact

59
00:02:47,240 --> 00:02:49,800
that this functional
here is non-convex.

60
00:02:49,800 --> 00:02:51,500
So there is no
polynomial algorithm

61
00:02:51,500 --> 00:02:52,760
to actually find a solution.

62
00:02:57,210 --> 00:02:59,910
Come to my mind that
somebody made the comment

63
00:02:59,910 --> 00:03:01,360
during the break.

64
00:03:01,360 --> 00:03:05,100
Notice that here I'm
a bit passing quickly

65
00:03:05,100 --> 00:03:07,610
over a refinement of the
question of best subset

66
00:03:07,610 --> 00:03:10,020
selection, which is
related to, is there

67
00:03:10,020 --> 00:03:12,600
a unique subset which is good?

68
00:03:12,600 --> 00:03:13,710
Is there more than one?

69
00:03:13,710 --> 00:03:17,160
And if there's more than one,
which one should I pick, OK?

70
00:03:17,160 --> 00:03:19,260
In practice, these
questions arise immediately

71
00:03:19,260 --> 00:03:20,968
because if you have
two measurements that

72
00:03:20,968 --> 00:03:23,580
are very correlated,
or even more,

73
00:03:23,580 --> 00:03:25,950
if they're perfectly
correlated, if you just

74
00:03:25,950 --> 00:03:28,792
build the measurements,
you might build out

75
00:03:28,792 --> 00:03:30,750
of two measurements, a
third measurement, which

76
00:03:30,750 --> 00:03:34,187
is completely just a linear
combination of the first two.

77
00:03:34,187 --> 00:03:36,020
So at that point, what
would you want to do?

78
00:03:36,020 --> 00:03:37,603
Do you want to keep
the minimum number

79
00:03:37,603 --> 00:03:40,560
of variables, the biggest
possible number of variables?

80
00:03:40,560 --> 00:03:43,410
And you have to decide,
because all these variables

81
00:03:43,410 --> 00:03:47,430
are, to some extent,
completely dependent, OK?

82
00:03:47,430 --> 00:03:49,290
So for now, we just
keep to the case

83
00:03:49,290 --> 00:03:51,120
where we don't really
worry about this, OK?

84
00:03:51,120 --> 00:03:55,470
We just say, among the good
ones, we want to pick one.

85
00:03:55,470 --> 00:03:57,900
A harder question would
be pick all of them

86
00:03:57,900 --> 00:03:58,990
or pick one of them.

87
00:03:58,990 --> 00:04:01,406
And if you wanted to pick one
of them, you have to tell me

88
00:04:01,406 --> 00:04:03,650
which one you want, according
to which criterion, OK?

89
00:04:06,190 --> 00:04:11,250
So the problem we're concerned
with now is one of, OK, now

90
00:04:11,250 --> 00:04:16,390
that we know that we
might want to do this,

91
00:04:16,390 --> 00:04:19,829
how can you do it in
an approximate way that

92
00:04:19,829 --> 00:04:23,850
will be good enough, and what
does it mean, good enough?

93
00:04:23,850 --> 00:04:27,660
So the simplest
way, again, we can

94
00:04:27,660 --> 00:04:30,410
try to think about
it together, is

95
00:04:30,410 --> 00:04:34,480
kind of a greedy version of
the brute force approach.

96
00:04:34,480 --> 00:04:37,520
So the brute force
approach was start

97
00:04:37,520 --> 00:04:41,070
from all single variables,
then all couples,

98
00:04:41,070 --> 00:04:43,380
then all triplets, and
blah, blah, blah, blah, OK?

99
00:04:43,380 --> 00:04:46,140
And this doesn't
work computationally.

100
00:04:46,140 --> 00:04:49,620
So just to wake
up, I don't know,

101
00:04:49,620 --> 00:04:52,470
how could you
twist this approach

102
00:04:52,470 --> 00:04:58,010
to make it approximate, but
computationally feasible?

103
00:04:58,010 --> 00:04:59,440
Let's keep the same spirit.

104
00:04:59,440 --> 00:05:03,480
So let's start from few, and
then let's try to add more.

105
00:05:03,480 --> 00:05:06,420
So the general
idea is I pick one.

106
00:05:06,420 --> 00:05:09,030
And once I pick one, I
pick another one, just

107
00:05:09,030 --> 00:05:10,466
keeping the one
I already picked.

108
00:05:10,466 --> 00:05:12,840
And then I pick another one,
and then I pick another one,

109
00:05:12,840 --> 00:05:14,006
and then I pick another one.

110
00:05:14,006 --> 00:05:18,060
So this, of course, will
not be the exhaustive search

111
00:05:18,060 --> 00:05:19,440
I've done before.

112
00:05:19,440 --> 00:05:21,100
It's probably doable.

113
00:05:21,100 --> 00:05:23,550
There is a bunch of
different ways you can do it.

114
00:05:23,550 --> 00:05:24,990
And you can hopefully--

115
00:05:24,990 --> 00:05:27,100
you can hope that
under some condition

116
00:05:27,100 --> 00:05:29,100
you might be able to prove
that it's not too far

117
00:05:29,100 --> 00:05:32,570
away from the brute
force approach, at least

118
00:05:32,570 --> 00:05:34,590
under some condition.

119
00:05:34,590 --> 00:05:37,740
And this is kind of what
we would want to do, OK?

120
00:05:37,740 --> 00:05:41,360
So we will have a
notion of residual.

121
00:05:41,360 --> 00:05:43,560
At the first
iteration, the residual

122
00:05:43,560 --> 00:05:46,060
will be just the output.

123
00:05:46,060 --> 00:05:48,300
So just think of
the first iteration.

124
00:05:48,300 --> 00:05:51,420
You get the output vectors,
and you want to explain it.

125
00:05:51,420 --> 00:05:54,090
You want to predict it well, OK?

126
00:05:54,090 --> 00:05:59,880
So what you do is that you first
check the one variable that

127
00:05:59,880 --> 00:06:02,670
gives you the best
prediction of this guy,

128
00:06:02,670 --> 00:06:04,839
and then you compute
the prediction.

129
00:06:04,839 --> 00:06:06,630
Then what you want to
do at the next round,

130
00:06:06,630 --> 00:06:09,019
you want to discount what
you have already explained.

131
00:06:09,019 --> 00:06:10,560
And what you do is
that basically you

132
00:06:10,560 --> 00:06:13,870
take the actual output
minus your prediction,

133
00:06:13,870 --> 00:06:15,695
and you find the residual.

134
00:06:15,695 --> 00:06:17,070
And then you try
to explain that.

135
00:06:17,070 --> 00:06:18,930
That's what's left
to explain, OK?

136
00:06:18,930 --> 00:06:20,820
So now you check
for the variables

137
00:06:20,820 --> 00:06:24,840
that best explain
this remaining bit.

138
00:06:24,840 --> 00:06:27,570
Then you add this variable
to the one you already have,

139
00:06:27,570 --> 00:06:29,700
and you have a new
notion of residual,

140
00:06:29,700 --> 00:06:32,160
which is what you explained
in the first round, what

141
00:06:32,160 --> 00:06:34,410
you added in explanation
of the second round.

142
00:06:34,410 --> 00:06:37,980
And then there's still something
left, and you keep on going.

143
00:06:37,980 --> 00:06:40,080
If you let this thing
go for enough time,

144
00:06:40,080 --> 00:06:42,810
you will have the
least squares solution.

145
00:06:42,810 --> 00:06:46,500
At the end of the day, you
will have explained everything.

146
00:06:46,500 --> 00:06:48,840
And each round, notice
that you might or not

147
00:06:48,840 --> 00:06:53,742
decide to put the
variables back in, OK?

148
00:06:53,742 --> 00:06:55,200
So you might have
that at each step

149
00:06:55,200 --> 00:06:56,430
you have just one
variable, or you

150
00:06:56,430 --> 00:06:58,096
might have that you
take multiple steps,

151
00:06:58,096 --> 00:07:00,840
but you have fewer variables
than number of steps.

152
00:07:00,840 --> 00:07:02,610
No matter what,
the number of steps

153
00:07:02,610 --> 00:07:06,030
would be related to the
number of variables that

154
00:07:06,030 --> 00:07:08,812
are active in your model, OK?

155
00:07:08,812 --> 00:07:10,350
Does it makes sense?

156
00:07:10,350 --> 00:07:13,606
This is the wordy version,
but now we go in details, OK?

157
00:07:13,606 --> 00:07:14,980
But this is roughly
the speaking.

158
00:07:14,980 --> 00:07:17,701
So first round, you try
to explain something,

159
00:07:17,701 --> 00:07:19,200
then you see what's
left to explain.

160
00:07:19,200 --> 00:07:21,449
And you keep the variables
that will explain the rest,

161
00:07:21,449 --> 00:07:22,890
and then you need to write.

162
00:07:22,890 --> 00:07:26,160
I'm not sure I used the
word, but it's important.

163
00:07:26,160 --> 00:07:28,410
The key word here
is "sparsity," OK?

164
00:07:28,410 --> 00:07:30,450
The fact that I'm
assuming my model

165
00:07:30,450 --> 00:07:33,040
to depend on just
a few vectors--

166
00:07:33,040 --> 00:07:34,110
sorry a few entries.

167
00:07:34,110 --> 00:07:36,930
So it's a vector with
many zero entries.

168
00:07:36,930 --> 00:07:40,327
Sparsity is the key word to
explain this property, which

169
00:07:40,327 --> 00:07:41,535
is a property of the problem.

170
00:07:41,535 --> 00:07:44,460
And so I build
algorithm that will

171
00:07:44,460 --> 00:07:49,710
try to find sparse solution
explaining my data,

172
00:07:49,710 --> 00:07:51,370
and this is one way.

173
00:07:51,370 --> 00:07:53,330
So let's look at this list.

174
00:07:53,330 --> 00:07:56,294
So you define the notion
of residual as this thing

175
00:07:56,294 --> 00:07:58,460
that you want to try to
explain, and the first round

176
00:07:58,460 --> 00:07:59,300
will be the output.

177
00:07:59,300 --> 00:08:01,258
The second round will be
what's left to explain

178
00:08:01,258 --> 00:08:02,900
after your prediction.

179
00:08:02,900 --> 00:08:05,060
You have a coefficient
vector, OK,

180
00:08:05,060 --> 00:08:07,080
because we're building
a linear function.

181
00:08:07,080 --> 00:08:09,560
And then you have
an index set, which

182
00:08:09,560 --> 00:08:12,950
is the set of variables which
are important at that stage.

183
00:08:12,950 --> 00:08:15,505
So these are the three objects
that you have to initialize.

184
00:08:15,505 --> 00:08:18,062
So at the first round,
the coefficient vector

185
00:08:18,062 --> 00:08:18,770
is going to be 0.

186
00:08:18,770 --> 00:08:20,450
The index set is
going to be empty.

187
00:08:20,450 --> 00:08:22,010
And the residual
at the first round

188
00:08:22,010 --> 00:08:27,880
is just going to be the
output, the output vector.

189
00:08:27,880 --> 00:08:32,679
Then you find the
best single variable,

190
00:08:32,679 --> 00:08:34,150
and then you update
the index set.

191
00:08:34,150 --> 00:08:37,570
So you add that variable
to the index set.

192
00:08:37,570 --> 00:08:41,559
To include such variables, you
compute the coefficient vector.

193
00:08:41,559 --> 00:08:42,970
And then you update
the residual,

194
00:08:42,970 --> 00:08:44,700
and then you start again, OK?

195
00:08:47,282 --> 00:08:52,020
If you want, here I show
you the first exam--

196
00:08:52,020 --> 00:08:53,180
just to give you an idea.

197
00:08:53,180 --> 00:08:54,750
Suppose that this is--

198
00:08:54,750 --> 00:08:58,976
so first of all,
notice this, OK?

199
00:08:58,976 --> 00:09:00,368
Oh, it's so boring.

200
00:09:02,931 --> 00:09:04,680
Forget about anything
that's written here.

201
00:09:04,680 --> 00:09:07,150
Just look at this matrix.

202
00:09:07,150 --> 00:09:11,080
The output vector is of the
same length of the column

203
00:09:11,080 --> 00:09:13,330
of the matrix, right?

204
00:09:13,330 --> 00:09:16,060
So each column of
the matrix will

205
00:09:16,060 --> 00:09:18,372
be related to one variable.

206
00:09:18,372 --> 00:09:20,080
So what you're going
to do is that you're

207
00:09:20,080 --> 00:09:23,860
going to try to see
which of these best

208
00:09:23,860 --> 00:09:27,400
explain my output
vector, and then

209
00:09:27,400 --> 00:09:30,380
you're going to try to define
the residual and keep on going,

210
00:09:30,380 --> 00:09:30,880
OK?

211
00:09:34,896 --> 00:09:36,270
So in this case,
for example, you

212
00:09:36,270 --> 00:09:38,620
can ask, which of the two
directions, X1 and X2,

213
00:09:38,620 --> 00:09:42,010
best explain the vector Y, OK?

214
00:09:42,010 --> 00:09:44,770
So this is the case
where it's simple.

215
00:09:44,770 --> 00:09:47,080
I basically have
this one direction.

216
00:09:47,080 --> 00:09:48,940
One variable is this one.

217
00:09:48,940 --> 00:09:50,170
Another variable is this one.

218
00:09:50,170 --> 00:09:51,010
And then I have that vector.

219
00:09:51,010 --> 00:09:52,634
I want to know which
direction I should

220
00:09:52,634 --> 00:09:57,182
pick to best explain my Y. Which
one do you think I should pick?

221
00:09:57,182 --> 00:09:58,066
AUDIENCE: X1.

222
00:09:58,066 --> 00:10:00,060
LORENZO ROSASCO:
I should pick X1?

223
00:10:00,060 --> 00:10:01,240
OK.

224
00:10:01,240 --> 00:10:04,330
This projection here will be
the weight I have to put to X1

225
00:10:04,330 --> 00:10:05,750
to get a good prediction.

226
00:10:05,750 --> 00:10:07,230
And then what's the residual?

227
00:10:07,230 --> 00:10:09,130
Well, I have to take this X1.

228
00:10:09,130 --> 00:10:11,650
I have to-- I have to take Yn.

229
00:10:11,650 --> 00:10:14,020
I have to subtract that, and
this is what I have left.

230
00:10:14,020 --> 00:10:16,390
So this is the simple terms, OK?

231
00:10:16,390 --> 00:10:19,460
So that's what we want to do.

232
00:10:19,460 --> 00:10:22,540
So we said it with hands.

233
00:10:22,540 --> 00:10:23,890
We say it with words.

234
00:10:23,890 --> 00:10:27,870
Here is, more or less,
the pseudocode, OK?

235
00:10:27,870 --> 00:10:29,380
It's a bit boring to read.

236
00:10:29,380 --> 00:10:33,670
You can see that it's
four lines of code anyway.

237
00:10:33,670 --> 00:10:35,586
Now that we've said it
15 times, probably it

238
00:10:35,586 --> 00:10:37,210
won't be that hard
to read because what

239
00:10:37,210 --> 00:10:39,460
you see is that you have
a notion of residual.

240
00:10:39,460 --> 00:10:42,750
You have the coefficient vector,
and you have the index set.

241
00:10:42,750 --> 00:10:43,820
This is empty.

242
00:10:43,820 --> 00:10:44,564
This is all 0's.

243
00:10:44,564 --> 00:10:45,730
And this is the first round.

244
00:10:45,730 --> 00:10:47,662
It's just the output.

245
00:10:47,662 --> 00:10:49,120
Then what you do
is that you start.

246
00:10:49,120 --> 00:10:52,930
And the free parameter here is
T, the number of iterations,

247
00:10:52,930 --> 00:10:53,530
OK?

248
00:10:53,530 --> 00:10:56,230
It's going to be
lambda, so to say.

249
00:10:56,230 --> 00:10:57,810
What you do is--

250
00:10:57,810 --> 00:11:00,340
OK, you have for each j--

251
00:11:00,340 --> 00:11:01,600
so just notation.

252
00:11:01,600 --> 00:11:06,190
j runs over the
variables, and capital Xj

253
00:11:06,190 --> 00:11:09,360
will be the column of the
data matrix, the one that

254
00:11:09,360 --> 00:11:12,400
corresponds to j's variable, OK?

255
00:11:12,400 --> 00:11:15,190
And then what you
do in this line,

256
00:11:15,190 --> 00:11:19,130
here I expand it, is to
find the coefficient--

257
00:11:19,130 --> 00:11:21,320
sorry, find the error
that corresponds

258
00:11:21,320 --> 00:11:23,410
to the best variable, OK?

259
00:11:23,410 --> 00:11:27,010
If you look, it turns
out that it is--

260
00:11:27,010 --> 00:11:30,460
if you assume-- oh, it
is equivalent to find

261
00:11:30,460 --> 00:11:33,490
the column best
correlated with the output

262
00:11:33,490 --> 00:11:36,910
is equivalent to find the
column that best explains

263
00:11:36,910 --> 00:11:38,290
the output, or the residual.

264
00:11:38,290 --> 00:11:39,800
These two things are the same.

265
00:11:39,800 --> 00:11:41,470
So here, I write equivalence.

266
00:11:41,470 --> 00:11:43,950
So pick the one
that you prefer, OK?

267
00:11:43,950 --> 00:11:46,750
Either you say I find the
column that is best correlated

268
00:11:46,750 --> 00:11:49,300
with the output or the
residual, or you find the column

269
00:11:49,300 --> 00:11:53,871
that best explains the residual
in the sense of least squares,

270
00:11:53,871 --> 00:11:54,370
OK?

271
00:11:54,370 --> 00:11:55,320
These two things are equivalent.

272
00:11:55,320 --> 00:11:56,620
Pick the one that you like.

273
00:11:56,620 --> 00:11:58,690
And that's the
content of this one.

274
00:11:58,690 --> 00:12:00,760
And then you select the
index of that column.

275
00:12:00,760 --> 00:12:02,920
So you solve this
problem for each column.

276
00:12:02,920 --> 00:12:04,090
It's an easy problem.

277
00:12:04,090 --> 00:12:05,412
It's a one-dimensional problem.

278
00:12:05,412 --> 00:12:07,120
And then you pick the
one column that you

279
00:12:07,120 --> 00:12:09,400
like the most, which is
the one that gives you

280
00:12:09,400 --> 00:12:13,440
the best correlation, a.k.a.

281
00:12:13,440 --> 00:12:16,240
least square error.

282
00:12:16,240 --> 00:12:20,230
Then you add this
k to the index set.

283
00:12:20,230 --> 00:12:22,430
And then, in this
case, it's very simple.

284
00:12:22,430 --> 00:12:23,360
I'm just going to--

285
00:12:23,360 --> 00:12:25,240
I'm not going to
recompute anything.

286
00:12:25,240 --> 00:12:28,241
So what I do is that
suppose that at--

287
00:12:28,241 --> 00:12:29,740
you remember the
coefficient vector,

288
00:12:29,740 --> 00:12:31,774
where it was all 0's, OK?

289
00:12:31,774 --> 00:12:33,690
Then at the first round,
I compute one number,

290
00:12:33,690 --> 00:12:36,850
the solution with, say, the
first coordinate, for example.

291
00:12:36,850 --> 00:12:40,540
And then I add a number
in that entry, OK?

292
00:12:40,540 --> 00:12:43,450
So this is the
orthonormal basis, OK?

293
00:12:43,450 --> 00:12:46,390
So it has just all 0's,
but 1 in the position k.

294
00:12:46,390 --> 00:12:49,210
So here I put this number.

295
00:12:49,210 --> 00:12:51,489
This is just a typo.

296
00:12:51,489 --> 00:12:53,530
And then what you do is
that you sum them up, OK?

297
00:12:53,530 --> 00:12:54,367
So you have all 0's.

298
00:12:54,367 --> 00:12:56,950
Just one number here, the first
iteration, then the other one.

299
00:12:56,950 --> 00:12:59,880
And then you add this one there,
and you keep on going, OK?

300
00:12:59,880 --> 00:13:02,150
This is the simplest
possible version.

301
00:13:02,150 --> 00:13:04,790
And once you have this,
now you have this vector.

302
00:13:04,790 --> 00:13:06,310
This is a long vector.

303
00:13:06,310 --> 00:13:10,560
You multiply this--
sorry, this should be Xn.

304
00:13:10,560 --> 00:13:12,340
Maybe we should take
note of the typos

305
00:13:12,340 --> 00:13:14,789
because I'm never going
to remember all of them.

306
00:13:14,789 --> 00:13:16,330
And then what you
do is that you just

307
00:13:16,330 --> 00:13:19,390
discount what you explained
so far to the solution.

308
00:13:19,390 --> 00:13:21,816
So you already explained
some part of the residual.

309
00:13:21,816 --> 00:13:23,440
Now you discount this
new part, and you

310
00:13:23,440 --> 00:13:27,070
define the new residual,
and then you go back.

311
00:13:27,070 --> 00:13:30,520
This method is-- so
greedy approaches

312
00:13:30,520 --> 00:13:33,730
is one name, as it often
happens in machine learning

313
00:13:33,730 --> 00:13:37,450
and statistics and
other fields, things

314
00:13:37,450 --> 00:13:40,954
get reinvented constantly, a
bit because people just come

315
00:13:40,954 --> 00:13:42,495
to them from a
different perspective,

316
00:13:42,495 --> 00:13:45,790
a bit because people just
decide studying and reading

317
00:13:45,790 --> 00:13:48,590
is not priority sometimes.

318
00:13:48,590 --> 00:13:54,680
And so this one algorithm
is often called greedy.

319
00:13:54,680 --> 00:13:56,280
It's one example of
greedy approaches.

320
00:13:56,280 --> 00:13:58,290
It's sometimes called
matching pursuit.

321
00:13:58,290 --> 00:14:01,630
It's very much related to
so-called forward stagewise

322
00:14:01,630 --> 00:14:02,680
regression.

323
00:14:02,680 --> 00:14:04,480
That's how it's
called in statistics.

324
00:14:04,480 --> 00:14:08,360
And well, it has a
bunch of other names.

325
00:14:08,360 --> 00:14:11,380
Now, this one version is
just the basic version.

326
00:14:11,380 --> 00:14:15,110
It's the simplex-- it's
the simplest version.

327
00:14:15,110 --> 00:14:17,440
This step typically remains.

328
00:14:17,440 --> 00:14:20,980
These two steps can be
changed slightly, OK?

329
00:14:20,980 --> 00:14:24,554
For example, can you think
of another way of doing this?

330
00:14:24,554 --> 00:14:25,720
Let me just give you a hint.

331
00:14:25,720 --> 00:14:28,053
In this case, what you do is
that you select a variable.

332
00:14:28,053 --> 00:14:29,260
You compute the coefficient.

333
00:14:29,260 --> 00:14:30,430
Then you select
another variable.

334
00:14:30,430 --> 00:14:32,110
You compute the coefficient
for the second variable,

335
00:14:32,110 --> 00:14:34,151
but you keep the coefficient
you already computed

336
00:14:34,151 --> 00:14:35,490
for the first variable.

337
00:14:35,490 --> 00:14:37,770
They never knew that
you took another one

338
00:14:37,770 --> 00:14:40,990
because you didn't take it yet.

339
00:14:40,990 --> 00:14:44,720
So from this comment,
do you see how

340
00:14:44,720 --> 00:14:51,620
you could change this method
to somewhat fix this aspect?

341
00:14:51,620 --> 00:14:54,450
Do you see what I'm saying?

342
00:14:54,450 --> 00:14:56,210
I would like to change
this one line where

343
00:14:56,210 --> 00:14:58,340
I compute the coefficient
and this one line

344
00:14:58,340 --> 00:15:01,350
even, perhaps, where I compute
the residual to account

345
00:15:01,350 --> 00:15:05,660
for the fact that this method
basically never updated

346
00:15:05,660 --> 00:15:07,430
the weights that
it computed before.

347
00:15:07,430 --> 00:15:09,150
You only add a new one.

348
00:15:09,150 --> 00:15:12,920
And this seems potentially
not a good idea,

349
00:15:12,920 --> 00:15:16,160
because when you
have two variables,

350
00:15:16,160 --> 00:15:18,435
it's better to compute the
solution with both of them.

351
00:15:18,435 --> 00:15:19,310
So what could you do?

352
00:15:19,310 --> 00:15:22,262
AUDIENCE: [INAUDIBLE]

353
00:15:24,052 --> 00:15:25,010
LORENZO ROSASCO: Right.

354
00:15:25,010 --> 00:15:27,440
So what you could
do is essentially

355
00:15:27,440 --> 00:15:29,360
what is called orthogonal
matching pursuit.

356
00:15:29,360 --> 00:15:31,280
So you would take
this set, and now you

357
00:15:31,280 --> 00:15:33,800
would solve a least
square problem

358
00:15:33,800 --> 00:15:36,500
with the variables that
are in the index set up

359
00:15:36,500 --> 00:15:39,240
to that point, all of them.

360
00:15:39,240 --> 00:15:41,577
You recompute everything.

361
00:15:41,577 --> 00:15:43,910
And now you have to solve not
a one-dimensional problem,

362
00:15:43,910 --> 00:15:47,220
but n times k-dimensional
problem, where the k is the--

363
00:15:47,220 --> 00:15:49,460
I don't know, k is a bad
name-- with the dimension

364
00:15:49,460 --> 00:15:53,780
of the set that could
be T or more than T, OK?

365
00:15:53,780 --> 00:15:58,290
And then at that point, you
also want to redefine this,

366
00:15:58,290 --> 00:16:00,851
because you're not
discounting anymore

367
00:16:00,851 --> 00:16:02,850
what you already explained,
but each time you're

368
00:16:02,850 --> 00:16:04,070
recomputing everything.

369
00:16:04,070 --> 00:16:08,216
So you just want to do Yn
minus the prediction, OK?

370
00:16:08,216 --> 00:16:09,590
So this algorithm
is the one that

371
00:16:09,590 --> 00:16:10,881
actually has better properties.

372
00:16:10,881 --> 00:16:12,200
It works better.

373
00:16:12,200 --> 00:16:14,240
You pay the price,
because each time

374
00:16:14,240 --> 00:16:16,620
you have to recompute the
least squares solution.

375
00:16:16,620 --> 00:16:19,659
And when you have more
than one variable inside,

376
00:16:19,659 --> 00:16:21,200
then the problems
become big and big.

377
00:16:21,200 --> 00:16:23,120
So if you stop after a few
iterations, it's great.

378
00:16:23,120 --> 00:16:25,040
But if you start to have
many iterations each time,

379
00:16:25,040 --> 00:16:26,456
you have to solve
a linear system.

380
00:16:26,456 --> 00:16:28,100
So the complexity
is much higher.

381
00:16:28,100 --> 00:16:30,710
This one here, as you can
imagine, is super fast.

382
00:16:30,710 --> 00:16:31,380
So that's it.

383
00:16:31,380 --> 00:16:36,639
So it turns out that this
method is, as I told you,

384
00:16:36,639 --> 00:16:38,930
is called matching pursuit,
or if not matching pursuit,

385
00:16:38,930 --> 00:16:43,130
forward stagewise regression,
is one way to approximate a zero

386
00:16:43,130 --> 00:16:43,910
solution.

387
00:16:43,910 --> 00:16:46,490
And one can prove
exactly in which sense

388
00:16:46,490 --> 00:16:51,020
you can approximate it, OK?

389
00:16:51,020 --> 00:16:53,200
So I think this is the
one that we might give you

390
00:16:53,200 --> 00:16:54,300
this afternoon, right?

391
00:16:54,300 --> 00:16:55,430
AUDIENCE: Orthogonal.

392
00:16:55,430 --> 00:16:57,513
LORENZO ROSASCO: Oh, yeah,
the orthogonal version,

393
00:16:57,513 --> 00:16:59,660
the nicer version.

394
00:16:59,660 --> 00:17:04,410
The other way of doing this is
the one that basically says,

395
00:17:04,410 --> 00:17:07,369
look, here what you're doing
is that you're just counting

396
00:17:07,369 --> 00:17:11,180
the number of entries
different from 0's.

397
00:17:11,180 --> 00:17:13,849
If you now were to replace
this with something that

398
00:17:13,849 --> 00:17:17,089
does a bit more-- what it
does is it not only counts,

399
00:17:17,089 --> 00:17:21,857
but it actually
sums up the weights.

400
00:17:21,857 --> 00:17:23,690
So if you want, in one
case, you just check.

401
00:17:23,690 --> 00:17:26,000
If a weight is bigger
than 0, you count it 1.

402
00:17:26,000 --> 00:17:27,349
Otherwise, you count it 0.

403
00:17:27,349 --> 00:17:29,340
Here you actually take
the absolute value.

404
00:17:29,340 --> 00:17:31,400
So instead of summing
up binary values,

405
00:17:31,400 --> 00:17:33,380
you sum up real numbers, OK?

406
00:17:33,380 --> 00:17:35,810
This is what is
called the L1 norm.

407
00:17:35,810 --> 00:17:38,945
So each weight doesn't
count for its sign,

408
00:17:38,945 --> 00:17:42,630
but it actually counts
for each absolute value.

409
00:17:42,630 --> 00:17:46,350
So it turns out that this one
term, which you can imagine--

410
00:17:46,350 --> 00:17:48,500
the absolute value
looks like this, right,

411
00:17:48,500 --> 00:17:50,000
and now you're just
summing them up.

412
00:17:50,000 --> 00:17:52,320
So that thing is
actually convex.

413
00:17:52,320 --> 00:17:56,020
So it turns out that you're
summing up two convex terms,

414
00:17:56,020 --> 00:17:58,670
and the overall
functional is convex.

415
00:17:58,670 --> 00:18:01,070
And if you want, you
can think of this a bit

416
00:18:01,070 --> 00:18:04,253
as a relaxation
of the zero norm.

417
00:18:04,253 --> 00:18:05,960
As we say, relaxation
in this sense

418
00:18:05,960 --> 00:18:07,749
is the strict requirement.

419
00:18:07,749 --> 00:18:10,040
So I talked about relaxation
before when I said instead

420
00:18:10,040 --> 00:18:12,650
of binary values, it
takes real values,

421
00:18:12,650 --> 00:18:15,649
and you optimize over the reals
instead of the binary values.

422
00:18:15,649 --> 00:18:16,940
Here is kind of the same thing.

423
00:18:16,940 --> 00:18:19,314
Instead of restricting yourself
to this functional, which

424
00:18:19,314 --> 00:18:21,740
is binary-valued, now you
allow yourself to relax and get

425
00:18:21,740 --> 00:18:22,830
real numbers.

426
00:18:22,830 --> 00:18:25,690
And what you gain is
that this algorithm,

427
00:18:25,690 --> 00:18:29,784
the corresponding optimization
problem is convex,

428
00:18:29,784 --> 00:18:30,950
and you can try to solve it.

429
00:18:30,950 --> 00:18:33,500
It is not still something
that we can do--

430
00:18:33,500 --> 00:18:34,910
we cannot do what we did before.

431
00:18:34,910 --> 00:18:38,480
We cannot just take derivatives
and set them equal to 0

432
00:18:38,480 --> 00:18:41,030
because this term is not smooth.

433
00:18:41,030 --> 00:18:42,960
The absolute value
looks like this,

434
00:18:42,960 --> 00:18:45,320
which means that
here, around the kink,

435
00:18:45,320 --> 00:18:46,560
is not differentiable.

436
00:18:46,560 --> 00:18:48,410
But we can still
use convex analysis

437
00:18:48,410 --> 00:18:50,900
to try to get the solution,
and actually the solution

438
00:18:50,900 --> 00:18:52,310
doesn't look too complicated.

439
00:18:52,310 --> 00:18:56,240
Getting there requires a
bit of convex analysis,

440
00:18:56,240 --> 00:18:57,855
but there are techniques.

441
00:18:57,855 --> 00:18:59,480
And the ones that
are trendy these days

442
00:18:59,480 --> 00:19:02,150
are called forward-backward
splitting or proximal method

443
00:19:02,150 --> 00:19:04,334
to solve this problem.

444
00:19:04,334 --> 00:19:05,750
And apparently,
I'm not even going

445
00:19:05,750 --> 00:19:07,610
to show them to you because
you don't even see them.

446
00:19:07,610 --> 00:19:09,360
But essentially, it's
not too complicated.

447
00:19:09,360 --> 00:19:11,480
Just to tell you in
one word what they do

448
00:19:11,480 --> 00:19:16,310
is that they do the gradient
descent of the first term,

449
00:19:16,310 --> 00:19:18,920
and then at each step of
the gradient they threshold.

450
00:19:18,920 --> 00:19:21,380
So they take a step of the
gradient, get a vector,

451
00:19:21,380 --> 00:19:23,480
look inside the vector.

452
00:19:23,480 --> 00:19:26,840
If something is smaller than
a threshold that depends

453
00:19:26,840 --> 00:19:29,150
on lambda, I set it equal to 0.

454
00:19:29,150 --> 00:19:31,410
Otherwise, I let it go, OK?

455
00:19:31,410 --> 00:19:33,140
So I didn't put it,
I don't know why,

456
00:19:33,140 --> 00:19:35,810
because it's really
a one-line algorithm.

457
00:19:35,810 --> 00:19:38,717
It's a bit harder to derive,
but it's very simple to check.

458
00:19:38,717 --> 00:19:40,550
So let's talk one second
about this picture,

459
00:19:40,550 --> 00:19:45,170
and then let me tell you about
what I'm not telling you.

460
00:19:45,170 --> 00:19:49,280
So hiding behind
everything I said so far,

461
00:19:49,280 --> 00:19:51,320
there is a linear system, right?

462
00:19:51,320 --> 00:19:54,810
There is a linear system
that is n by p or d

463
00:19:54,810 --> 00:19:56,810
or whatever you want to
call it, with the number

464
00:19:56,810 --> 00:19:58,190
of p of variables.

465
00:19:58,190 --> 00:20:00,020
And our game so far
has always been,

466
00:20:00,020 --> 00:20:04,220
look, we have a linear system
that can be not invertible,

467
00:20:04,220 --> 00:20:06,680
or even if it is, it might
have bad condition number,

468
00:20:06,680 --> 00:20:09,980
and I want to try to find a
way to stabilize the problem.

469
00:20:09,980 --> 00:20:12,174
The first round, we
basically replace the inverse

470
00:20:12,174 --> 00:20:13,340
with an approximate inverse.

471
00:20:13,340 --> 00:20:15,320
That's the classical
way of doing it.

472
00:20:15,320 --> 00:20:17,420
Here, we're making
another assumption.

473
00:20:17,420 --> 00:20:20,000
We're basically saying,
look, this vector,

474
00:20:20,000 --> 00:20:24,610
it does look very long, so that
this problem seems ill-posed.

475
00:20:24,610 --> 00:20:28,390
But in fact, if only a few
entries were different from 0

476
00:20:28,390 --> 00:20:30,730
and if you were able to
tell me which one they are,

477
00:20:30,730 --> 00:20:32,710
you can go in, delete
all these entries,

478
00:20:32,710 --> 00:20:34,597
delete all the
corresponding columns,

479
00:20:34,597 --> 00:20:35,930
and then you will have a matrix.

480
00:20:35,930 --> 00:20:38,300
Now he looks short and large.

481
00:20:38,300 --> 00:20:41,599
He will look skinny
and tall, OK?

482
00:20:41,599 --> 00:20:43,390
And that probably would
be easier to solve.

483
00:20:43,390 --> 00:20:45,320
It would be the case
where the problem is

484
00:20:45,320 --> 00:20:48,170
one of the linear systema
that we know how to solve.

485
00:20:48,170 --> 00:20:50,890
So what we described
so far is a way

486
00:20:50,890 --> 00:20:55,167
to find solution of
linear systems that are--

487
00:20:55,167 --> 00:20:57,250
with the number of equations
which is smaller than

488
00:20:57,250 --> 00:21:00,610
the number of unknowns, and, by
definition, cannot be solved,

489
00:21:00,610 --> 00:21:03,100
under the extra
assumption that, in fact,

490
00:21:03,100 --> 00:21:06,460
there are fewer unknowns
than what it looks like.

491
00:21:06,460 --> 00:21:09,750
It's just that I'm not telling
you which one they are, OK?

492
00:21:09,750 --> 00:21:11,839
You see, if I
could tell you, you

493
00:21:11,839 --> 00:21:13,880
would just get back to a
very easy problem, where

494
00:21:13,880 --> 00:21:16,340
the number of unknowns
is much smaller, OK?

495
00:21:19,780 --> 00:21:21,900
So this is a
mathematical effect, OK?

496
00:21:24,620 --> 00:21:26,372
And the odds are open, because--

497
00:21:26,372 --> 00:21:27,830
now they're not
because people have

498
00:21:27,830 --> 00:21:31,070
been talking about this stuff
for 10 years constantly.

499
00:21:31,070 --> 00:21:35,120
But one question is, how much
does this assumption buy you?

500
00:21:35,120 --> 00:21:37,970
For example, could you prove
that in certain situations,

501
00:21:37,970 --> 00:21:41,030
even if you don't know
the entry of this,

502
00:21:41,030 --> 00:21:44,690
you could actually solve
this problem exactly?

503
00:21:44,690 --> 00:21:47,400
So if I give them to you,
you can do it, right?

504
00:21:47,400 --> 00:21:50,180
But is there a way to try
to guess them in some way

505
00:21:50,180 --> 00:21:53,810
so that you can do almost as
good or with high probability

506
00:21:53,810 --> 00:21:56,810
as well as if I tell
them in advance?

507
00:21:56,810 --> 00:21:59,870
And it turned out that
the answer is yes, OK?

508
00:21:59,870 --> 00:22:03,980
And the answer is basically that
if the number of entries that

509
00:22:03,980 --> 00:22:06,410
are different from
0 is small enough

510
00:22:06,410 --> 00:22:10,190
and the columns corresponding
to those variables

511
00:22:10,190 --> 00:22:13,300
are not too correlated,
are not too collinear,

512
00:22:13,300 --> 00:22:14,935
so they're
distinguishable enough

513
00:22:14,935 --> 00:22:16,310
that when you
perturb the problem

514
00:22:16,310 --> 00:22:18,800
a little bit nothing
changes, then

515
00:22:18,800 --> 00:22:21,480
you can solve the
problem exactly, OK?

516
00:22:24,100 --> 00:22:27,280
So this, on the one hand, is
exactly the kind of theory

517
00:22:27,280 --> 00:22:30,580
that tells you why using greedy
methods and convex relaxation

518
00:22:30,580 --> 00:22:32,830
will give you a good
approximation to L0,

519
00:22:32,830 --> 00:22:35,860
because that's basically
what this story tells you.

520
00:22:35,860 --> 00:22:38,680
People have been using this--
and so this is interesting

521
00:22:38,680 --> 00:22:42,670
for us-- people have been using
this observation in a slightly

522
00:22:42,670 --> 00:22:47,920
different context, which
is the context where--

523
00:22:47,920 --> 00:22:50,710
you see, for us, Y and
X, we don't choose.

524
00:22:50,710 --> 00:22:51,500
We get.

525
00:22:51,500 --> 00:22:53,200
And whatever they are, they are.

526
00:22:53,200 --> 00:22:55,660
And if it's correlate-- if
the columns are nice, nice.

527
00:22:55,660 --> 00:23:00,071
But if they're not nice, sorry,
you have to live with it, OK?

528
00:23:00,071 --> 00:23:02,570
But there are settings where
you can think of the following.

529
00:23:02,570 --> 00:23:06,790
Suppose that you have
a signal, and you want

530
00:23:06,790 --> 00:23:09,190
to be able to reconstruct it.

531
00:23:09,190 --> 00:23:12,227
So the classical Shannon
sampling theorem results

532
00:23:12,227 --> 00:23:13,810
basically tell you
that, I don't know,

533
00:23:13,810 --> 00:23:15,730
if you have something
which has been limited,

534
00:23:15,730 --> 00:23:17,929
you have to sample twice
the maximum frequency.

535
00:23:17,929 --> 00:23:19,720
But this is kind of
worst case because it's

536
00:23:19,720 --> 00:23:23,050
assuming that all the bands,
all the frequencies, are full.

537
00:23:23,050 --> 00:23:24,430
Suppose that now we play--

538
00:23:24,430 --> 00:23:25,360
it's an analogy, OK?

539
00:23:25,360 --> 00:23:26,860
But I tell you, oh,
look, it's true,

540
00:23:26,860 --> 00:23:28,370
the maximum frequency of this.

541
00:23:28,370 --> 00:23:30,910
But there's only another
frequency, this one.

542
00:23:30,910 --> 00:23:32,980
Do you really need to
sample that much, or you

543
00:23:32,980 --> 00:23:34,890
can do much less?

544
00:23:34,890 --> 00:23:37,234
And so it turns out that
basically the story here,

545
00:23:37,234 --> 00:23:38,650
as you're answering
this question,

546
00:23:38,650 --> 00:23:40,950
it turns out that, yes,
you can do much less.

547
00:23:40,950 --> 00:23:42,700
Ideally, what you would
like to say, well,

548
00:23:42,700 --> 00:23:45,610
instead of being twice
the maximum frequency,

549
00:23:45,610 --> 00:23:47,730
if I have just four
frequencies different from 0,

550
00:23:47,730 --> 00:23:50,916
I'd have to do
eight samples, OK?

551
00:23:50,916 --> 00:23:52,540
That would be ideal,
but you would have

552
00:23:52,540 --> 00:23:54,070
to know which one they are.

553
00:23:54,070 --> 00:23:57,850
You don't, so you pay a price,
but it's just logarithmic.

554
00:23:57,850 --> 00:23:59,740
So you basically say
that you can almost--

555
00:23:59,740 --> 00:24:01,930
you have a new sampling
theorem that tells you

556
00:24:01,930 --> 00:24:05,530
that you don't need
to sample that much.

557
00:24:05,530 --> 00:24:07,720
You don't need to sample
that low either, which

558
00:24:07,720 --> 00:24:11,520
would be, say, the
maximum frequency is d.

559
00:24:11,520 --> 00:24:15,610
The number of non-zero
frequencies is s.

560
00:24:15,610 --> 00:24:18,340
So with classical, you
would have to say 2d.

561
00:24:18,340 --> 00:24:20,260
Ideally, we would
like to say 2s.

562
00:24:20,260 --> 00:24:23,890
Actually, what you can say
is something like 2s log d.

563
00:24:23,890 --> 00:24:27,010
So you have a log d price
that you pay because you

564
00:24:27,010 --> 00:24:28,390
didn't know where they are.

565
00:24:28,390 --> 00:24:32,170
But still, it's much less than
being linear in the dimension.

566
00:24:32,170 --> 00:24:34,150
So essentially, the field
of compressed sensing

567
00:24:34,150 --> 00:24:37,000
has been built around
this observation,

568
00:24:37,000 --> 00:24:38,710
and the focus is
slightly different.

569
00:24:38,710 --> 00:24:41,620
Instead of saying I want to do
a statistical estimation where

570
00:24:41,620 --> 00:24:44,990
I can just build this, what
you do is say I have a signal.

571
00:24:44,990 --> 00:24:47,230
And now I view this
as a sensing matrix

572
00:24:47,230 --> 00:24:49,690
that I design with
the property that I

573
00:24:49,690 --> 00:24:53,320
know will allow me to
do this estimation well.

574
00:24:53,320 --> 00:24:55,960
And so you basically assume that
you can choose those vectors

575
00:24:55,960 --> 00:24:57,490
in certain ways,
and then you can

576
00:24:57,490 --> 00:25:01,810
prove that you can reconstruct
with much fewer samples, OK?

577
00:25:01,810 --> 00:25:03,550
And this has been
used, for example--

578
00:25:03,550 --> 00:25:07,532
I never remember for, as
you call in MEG-- in what?

579
00:25:07,532 --> 00:25:09,202
No, MRI, MRI.

580
00:25:09,202 --> 00:25:10,660
Two things I didn't
tell you about,

581
00:25:10,660 --> 00:25:13,390
but it's worth
mentioning are, suppose

582
00:25:13,390 --> 00:25:17,530
that what I tell
you is actually it's

583
00:25:17,530 --> 00:25:20,170
not individual entries that are
0, but group of entries that

584
00:25:20,170 --> 00:25:24,740
are 0, for example, because each
entry is a biological process.

585
00:25:24,740 --> 00:25:26,980
So I have genes, but
genes are actually

586
00:25:26,980 --> 00:25:28,850
involved in biological process.

587
00:25:28,850 --> 00:25:30,420
So there is a group of genes
that is doing something.

588
00:25:30,420 --> 00:25:32,461
I have a group of genes
that are doing something,

589
00:25:32,461 --> 00:25:35,680
and I want to select is not
individual genes, but groups.

590
00:25:35,680 --> 00:25:38,740
Can you twist this stuff in such
a way that you select groups?

591
00:25:38,740 --> 00:25:39,327
Yes.

592
00:25:39,327 --> 00:25:41,160
What if the groups are
actually overlapping?

593
00:25:41,160 --> 00:25:42,160
How do you want to
deal with the overlaps?

594
00:25:42,160 --> 00:25:43,889
Do you want to keep the overlap?

595
00:25:43,889 --> 00:25:45,180
Do you want to cut the overlap?

596
00:25:45,180 --> 00:25:47,590
What if you have a
tree structure, OK?

597
00:25:47,590 --> 00:25:49,160
What do you do with this?

598
00:25:49,160 --> 00:25:51,407
So first of all, who
gives it information, OK?

599
00:25:51,407 --> 00:25:53,740
And then if you have the
information, how do you use it,

600
00:25:53,740 --> 00:25:55,073
and how are you going to use it?

601
00:25:55,073 --> 00:25:57,250
See, this is the whole
field of structure sparsity.

602
00:25:57,250 --> 00:26:00,910
It's the whole industry
of building penalties

603
00:26:00,910 --> 00:26:04,559
other than L1 that would
allow you to incorporate

604
00:26:04,559 --> 00:26:05,850
this kind of prior information.

605
00:26:05,850 --> 00:26:08,800
And if you want as the
place, as in kernel methods,

606
00:26:08,800 --> 00:26:11,230
the kernel was the
place where you could

607
00:26:11,230 --> 00:26:12,910
incorporate prior information.

608
00:26:12,910 --> 00:26:14,620
This is the case
where, in this field,

609
00:26:14,620 --> 00:26:19,789
you can do that by designing
a suitable regularizer.

610
00:26:19,789 --> 00:26:21,330
And then a lot of
the reason is this.

611
00:26:21,330 --> 00:26:25,720
So here we'll translate
with these new regularizers.

612
00:26:25,720 --> 00:26:29,500
The last bit is that, with
a bit of a twist, some

613
00:26:29,500 --> 00:26:31,810
of the idea that I show
you now that are basically

614
00:26:31,810 --> 00:26:35,620
related to vectors
and sparsity translate

615
00:26:35,620 --> 00:26:39,050
to more general context, in
particular that of matrices

616
00:26:39,050 --> 00:26:41,200
that have low rank, OK?

617
00:26:41,200 --> 00:26:43,460
The classical example is
suppose that I give you--

618
00:26:43,460 --> 00:26:45,500
it's matrix completion, OK?

619
00:26:45,500 --> 00:26:47,410
I give you a matrix,
but I actually

620
00:26:47,410 --> 00:26:50,080
delete most of the
entries of the matrix.

621
00:26:50,080 --> 00:26:54,164
And I tell you, OK, estimate
the original matrix.

622
00:26:54,164 --> 00:26:56,600
Well, how can I do that, right?

623
00:26:56,600 --> 00:26:58,450
Well, it turns out that
if the matrix itself

624
00:26:58,450 --> 00:27:03,760
had very low rank, so that many
of the columns and rows you saw

625
00:27:03,760 --> 00:27:05,397
were actually related
to each other,

626
00:27:05,397 --> 00:27:06,980
you might actually
be able to do that.

627
00:27:06,980 --> 00:27:10,540
And the way you chose the
entries to delete or select

628
00:27:10,540 --> 00:27:13,660
was not malicious,
then you would

629
00:27:13,660 --> 00:27:16,360
be able to fill in the
missing entries, OK?

630
00:27:16,360 --> 00:27:18,040
And the theory
behind this is very

631
00:27:18,040 --> 00:27:19,414
similar to the
theory that allows

632
00:27:19,414 --> 00:27:23,840
to fill in the right
entries of the vector, OK?

633
00:27:23,840 --> 00:27:27,986
Last bit-- PCA in 15 minutes.

634
00:27:27,986 --> 00:27:33,790
So what we've seen so far
was a very hard problem

635
00:27:33,790 --> 00:27:34,780
of variable selection.

636
00:27:34,780 --> 00:27:36,400
It is still a
supervised problem,

637
00:27:36,400 --> 00:27:38,224
where I give you labels, OK?

638
00:27:38,224 --> 00:27:39,640
The last bit I
want to show you is

639
00:27:39,640 --> 00:27:42,180
PCA, which is the case where
I don't give you labels.

640
00:27:42,180 --> 00:27:44,199
And what you try to answer
is actually-- perhaps

641
00:27:44,199 --> 00:27:45,490
it's like the simpler question.

642
00:27:45,490 --> 00:27:48,232
Because you don't want to
select one of the directions,

643
00:27:48,232 --> 00:27:49,690
but you would like
to know if there

644
00:27:49,690 --> 00:27:50,950
are directions that matter.

645
00:27:50,950 --> 00:27:52,408
So you allow
yourself, for example,

646
00:27:52,408 --> 00:27:57,040
to combine the different
directions in your data, OK?

647
00:27:57,040 --> 00:28:01,460
So this question is interesting
for many, many reasons.

648
00:28:01,460 --> 00:28:03,107
One is data visualization,
for example.

649
00:28:03,107 --> 00:28:05,440
You have stuff that you cannot
look at because you have,

650
00:28:05,440 --> 00:28:07,273
for example, digits in
very high dimensions.

651
00:28:07,273 --> 00:28:09,220
You would like to look at them.

652
00:28:09,220 --> 00:28:09,940
How do you do it?

653
00:28:09,940 --> 00:28:11,320
Well, you like to
find directions.

654
00:28:11,320 --> 00:28:12,880
The first direction
to project everything,

655
00:28:12,880 --> 00:28:15,171
the second direction, three
direction, because then you

656
00:28:15,171 --> 00:28:18,430
can plot and look at them, OK?

657
00:28:18,430 --> 00:28:21,952
And this is one visualization
of these images here.

658
00:28:21,952 --> 00:28:23,410
And I'll remember
now the code now.

659
00:28:23,410 --> 00:28:24,430
It's written here.

660
00:28:24,430 --> 00:28:26,669
You have different
colors, and what you see

661
00:28:26,669 --> 00:28:28,210
is that this actually
did a good job.

662
00:28:28,210 --> 00:28:30,835
Because what you expect is that
if you do a nice visualization,

663
00:28:30,835 --> 00:28:33,460
what you would like to have
is that similar numbers

664
00:28:33,460 --> 00:28:35,410
or same numbers are
in the same regions,

665
00:28:35,410 --> 00:28:38,920
and perhaps similar
numbers are close, OK?

666
00:28:38,920 --> 00:28:42,564
So this is one reason
why you want to do this.

667
00:28:42,564 --> 00:28:44,230
One reason why you
might want to do this

668
00:28:44,230 --> 00:28:45,910
is also because you
might want to reduce

669
00:28:45,910 --> 00:28:48,427
the dimensionality of your
data just to compress them

670
00:28:48,427 --> 00:28:51,010
or because you might hope that
certain dimensions don't matter

671
00:28:51,010 --> 00:28:52,870
or are simply noise.

672
00:28:52,870 --> 00:28:55,150
And so you just want
to get rid of that

673
00:28:55,150 --> 00:28:58,720
because this could be good
for statistical reasons.

674
00:28:58,720 --> 00:29:02,170
OK, so the game is going
to be the following.

675
00:29:02,170 --> 00:29:05,230
X is the data space,
which is going to be RD.

676
00:29:05,230 --> 00:29:10,240
And we want to define a map M
that sends vectors of length D

677
00:29:10,240 --> 00:29:14,200
into vectors of length k.

678
00:29:14,200 --> 00:29:20,094
So k is going to be my
reduced dimensionality.

679
00:29:20,094 --> 00:29:21,760
And what we're going
to do is that we're

680
00:29:21,760 --> 00:29:24,940
going to build a basic method
to do this, which is PCA,

681
00:29:24,940 --> 00:29:29,290
and we're going to give a purely
geometric view of PCA, OK?

682
00:29:29,290 --> 00:29:31,060
And this is going
to be done by taking

683
00:29:31,060 --> 00:29:34,485
first the case where k is equal
to 1 and then iterate to go up.

684
00:29:34,485 --> 00:29:35,860
So at the first
case, we're going

685
00:29:35,860 --> 00:29:38,410
to ask, if I give you vectors
which are D dimensional,

686
00:29:38,410 --> 00:29:41,230
how can I project them in
one dimension with respect

687
00:29:41,230 --> 00:29:46,540
to some criterion
of optimality, OK?

688
00:29:46,540 --> 00:29:50,620
And here what we ask is, we want
to project the data in the one

689
00:29:50,620 --> 00:29:57,170
dimension that would give
me the best possible error.

690
00:29:57,170 --> 00:30:01,320
So I think I had it before.

691
00:30:01,320 --> 00:30:03,279
Do I have it-- no, no.

692
00:30:09,440 --> 00:30:13,130
This was done for another
reason, but it's useful now.

693
00:30:13,130 --> 00:30:15,440
If you have this
vector and you want

694
00:30:15,440 --> 00:30:19,370
to project in this direction,
and this is a unit vector,

695
00:30:19,370 --> 00:30:22,130
what do you do?

696
00:30:22,130 --> 00:30:26,910
I want to know how to write this
vector here, the projection.

697
00:30:26,910 --> 00:30:31,850
What you do is that you take the
inner product between Yn and X.

698
00:30:31,850 --> 00:30:34,430
You get the number, and that
number is the length you

699
00:30:34,430 --> 00:30:36,576
want to assign to X1, OK?

700
00:30:40,550 --> 00:30:43,440
So suppose that w
is the direction.

701
00:30:43,440 --> 00:30:45,215
And I have a vector
x, and I want

702
00:30:45,215 --> 00:30:46,970
to give the projection, OK?

703
00:30:46,970 --> 00:30:47,840
What do I do?

704
00:30:47,840 --> 00:30:51,685
I take the inner
product of x and w,

705
00:30:51,685 --> 00:30:54,770
and this is the length I have
to assign to the vector w, which

706
00:30:54,770 --> 00:30:57,140
is unit norm, OK?

707
00:30:57,140 --> 00:30:59,930
So this is the
best approximation

708
00:30:59,930 --> 00:31:03,480
of xi in the direction of w.

709
00:31:03,480 --> 00:31:04,540
Does it make sense?

710
00:31:07,670 --> 00:31:12,135
I fix a w, and I want to know
how well I can describe x.

711
00:31:12,135 --> 00:31:15,090
I project x in that
direction, and then I

712
00:31:15,090 --> 00:31:20,070
take the difference between
x and the projection, OK?

713
00:31:20,070 --> 00:31:22,620
And then what I do is that
I sum over all points.

714
00:31:22,620 --> 00:31:24,891
And then I check, among
all possible directions,

715
00:31:24,891 --> 00:31:26,390
the one that give
me the best error.

716
00:31:29,070 --> 00:31:32,484
So suppose that
is your data set.

717
00:31:32,484 --> 00:31:33,900
Which direction
you think is going

718
00:31:33,900 --> 00:31:34,983
to give me the best error?

719
00:31:39,860 --> 00:31:40,850
AUDIENCE: Keep going.

720
00:31:40,850 --> 00:31:42,180
LORENZO ROSASCO: Well, if
you go in this direction,

721
00:31:42,180 --> 00:31:44,210
you can explain most
of the stuff, OK?

722
00:31:44,210 --> 00:31:47,030
You can reconstruct it best.

723
00:31:47,030 --> 00:31:49,850
So this is going
to be the solution.

724
00:31:49,850 --> 00:31:57,140
So the question here is really,
how do you solve this problem?

725
00:31:57,140 --> 00:32:01,457
You could try to minimize
with respect to w.

726
00:32:01,457 --> 00:32:04,040
But in fact, it's not clear what
kind of computation you have.

727
00:32:04,040 --> 00:32:05,580
And if we massage
this a little bit,

728
00:32:05,580 --> 00:32:08,744
it turns out that it is actually
exactly an eigenvalue problem.

729
00:32:08,744 --> 00:32:10,160
So that's what we
want to do next.

730
00:32:10,160 --> 00:32:12,590
So conceptually, what we want
to do is what I said here

731
00:32:12,590 --> 00:32:13,580
and nothing more.

732
00:32:13,580 --> 00:32:16,160
I want to find the single
individual direction that

733
00:32:16,160 --> 00:32:20,160
allows me to reconstruct best,
on average, all the training

734
00:32:20,160 --> 00:32:20,999
set points.

735
00:32:20,999 --> 00:32:22,790
And now what we want
to do is just to check

736
00:32:22,790 --> 00:32:26,210
what kind of computation
these entail and learn

737
00:32:26,210 --> 00:32:28,050
a bit more about this, OK?

738
00:32:31,370 --> 00:32:34,640
So this notation is just to
say that the vector is norm 1

739
00:32:34,640 --> 00:32:40,800
so that I don't have to fumble
with the size of the vector.

740
00:32:40,800 --> 00:32:43,050
OK, so let's do a
couple of computations.

741
00:32:43,050 --> 00:32:45,210
This is ideal after lunch.

742
00:32:45,210 --> 00:32:49,100
So you just take this
square and develop it, OK?

743
00:32:49,100 --> 00:32:52,200
And remember that
w is unit norm.

744
00:32:52,200 --> 00:32:56,690
So when you do w
transpose w, you get 1.

745
00:32:56,690 --> 00:32:58,000
And then if you--

746
00:32:58,000 --> 00:33:00,270
and if you don't forget
to put your square

747
00:33:00,270 --> 00:33:01,770
and if you just
develop this, you'll

748
00:33:01,770 --> 00:33:03,450
see that this is
an equality, OK?

749
00:33:03,450 --> 00:33:05,830
There is a square missing here.

750
00:33:05,830 --> 00:33:07,530
So you have xi square.

751
00:33:07,530 --> 00:33:09,870
Then you would have
the product of xi

752
00:33:09,870 --> 00:33:13,710
and this, which will be
w transpose xi square.

753
00:33:13,710 --> 00:33:17,340
And then you would
also have this square,

754
00:33:17,340 --> 00:33:20,570
but this square is w transpose
xi and then w transpose w,

755
00:33:20,570 --> 00:33:21,675
which is 1.

756
00:33:21,675 --> 00:33:24,154
And so what you see is
that this would create--

757
00:33:24,154 --> 00:33:26,070
instead of three terms
we have two because two

758
00:33:26,070 --> 00:33:27,810
cancel out-- not cancel out.

759
00:33:27,810 --> 00:33:29,720
They balance each other.

760
00:33:29,720 --> 00:33:30,220
OK.

761
00:33:32,820 --> 00:33:38,250
So then I'd argue that if
instead of minimizing this,

762
00:33:38,250 --> 00:33:40,724
because this is equal to this,
instead of minimizing this,

763
00:33:40,724 --> 00:33:41,640
you can maximize this.

764
00:33:46,050 --> 00:33:46,550
Why?

765
00:33:46,550 --> 00:33:49,160
Well, because this
is just a constant.

766
00:33:49,160 --> 00:33:51,290
It doesn't depend
on w at all, so I

767
00:33:51,290 --> 00:33:53,040
can drop it from my functional.

768
00:33:53,040 --> 00:33:56,000
The solution, the minimum,
the minimum will be different,

769
00:33:56,000 --> 00:33:58,520
but the minimizer, the w
that solves the problem,

770
00:33:58,520 --> 00:34:01,010
will be the same, OK?

771
00:34:01,010 --> 00:34:02,630
And then here is
minimizing something

772
00:34:02,630 --> 00:34:05,810
with a minus, which is the same
as maximizing the same thing

773
00:34:05,810 --> 00:34:09,391
without the minus, OK?

774
00:34:09,391 --> 00:34:13,190
I don't ask, so far, so
good, because I'm scared.

775
00:34:13,190 --> 00:34:17,000
So what you see now
is that basically

776
00:34:17,000 --> 00:34:20,929
if the data were centered,
basically this would just

777
00:34:20,929 --> 00:34:23,110
be a variance.

778
00:34:23,110 --> 00:34:25,739
If the data are centered,
so there is a minus 0 here,

779
00:34:25,739 --> 00:34:29,360
maybe you can interpret this
as measuring the variance

780
00:34:29,360 --> 00:34:30,069
in one direction.

781
00:34:30,069 --> 00:34:31,651
And so you have
another interpretation

782
00:34:31,651 --> 00:34:33,960
of PCA, which is the one
where instead of picking

783
00:34:33,960 --> 00:34:37,370
the single direction with the
best possible reconstruction,

784
00:34:37,370 --> 00:34:39,889
you're picking the direction
where the variance of the data

785
00:34:39,889 --> 00:34:42,569
is bigger, OK?

786
00:34:42,569 --> 00:34:44,860
And these two points of view
are completely equivalent.

787
00:34:44,860 --> 00:34:46,886
Essentially, whenever
you have a square norm,

788
00:34:46,886 --> 00:34:48,469
thinking about
maximizing the variance

789
00:34:48,469 --> 00:34:51,080
or minimizing the
reconstruction are two

790
00:34:51,080 --> 00:34:53,540
complementary dual ideas, OK?

791
00:34:53,540 --> 00:34:56,090
So that's what you
will be doing here.

792
00:34:56,090 --> 00:34:57,090
One more bit.

793
00:34:57,090 --> 00:34:58,080
What about computation?

794
00:34:58,080 --> 00:35:00,640
So this is-- so we can
think about reconstruction.

795
00:35:00,640 --> 00:35:03,920
You can think about
variance, if you like.

796
00:35:03,920 --> 00:35:05,100
What about this computation?

797
00:35:05,100 --> 00:35:07,180
What kind of
computation is this, OK?

798
00:35:07,180 --> 00:35:08,600
If we massage it
a little bit, we

799
00:35:08,600 --> 00:35:10,550
see that is just an
eigenvalue problem.

800
00:35:10,550 --> 00:35:12,540
So this is how you do it.

801
00:35:12,540 --> 00:35:16,670
This actually look-- so it's
annoying, but it's very simple.

802
00:35:16,670 --> 00:35:18,110
So I wrote all the passages.

803
00:35:18,110 --> 00:35:21,200
This is a square, so it's
something times itself.

804
00:35:21,200 --> 00:35:23,030
This whole thing is
symmetric, so you

805
00:35:23,030 --> 00:35:26,420
can swap the order of
this multiplication.

806
00:35:26,420 --> 00:35:30,980
So you get w transpose
xi, xi transpose w.

807
00:35:30,980 --> 00:35:33,740
But then this is
just the sum that was

808
00:35:33,740 --> 00:35:35,240
going to involve these terms.

809
00:35:35,240 --> 00:35:40,310
So I can let the sum enter,
and this is what you get.

810
00:35:40,310 --> 00:35:47,270
So you get w transpose
1/n xi xi transpose w.

811
00:35:47,270 --> 00:35:53,247
So this is just a number. w
transpose xi is just a number.

812
00:35:53,247 --> 00:35:55,580
But the moment you look at
something that looks like xi,

813
00:35:55,580 --> 00:35:57,750
xi transpose, what is that?

814
00:35:57,750 --> 00:36:00,310
Well, just look at
dimensionality, OK?

815
00:36:00,310 --> 00:36:05,137
1 times d times d times 1 gives
you a number, which is 1 by 1.

816
00:36:05,137 --> 00:36:06,720
Now you're doing the
other way around.

817
00:36:06,720 --> 00:36:07,610
So what is this?

818
00:36:07,610 --> 00:36:08,771
AUDIENCE: It's a matrix.

819
00:36:08,771 --> 00:36:09,812
LORENZO ROSASCO: It's a--

820
00:36:09,812 --> 00:36:10,286
AUDIENCE: Matrix.

821
00:36:10,286 --> 00:36:11,577
LORENZO ROSASCO: It's a matrix.

822
00:36:11,577 --> 00:36:15,590
And it's a matrix which is d
by d, and it's of rank 1, OK?

823
00:36:15,590 --> 00:36:18,347
And what you do now is
that you sum them all up,

824
00:36:18,347 --> 00:36:20,180
and what you have is
that this quantity here

825
00:36:20,180 --> 00:36:22,180
becomes what is called
the quadratic form.

826
00:36:22,180 --> 00:36:26,300
It is a matrix C, which
just looks like this.

827
00:36:26,300 --> 00:36:30,810
And it's squeezed in between
two vectors, w transpose and w.

828
00:36:30,810 --> 00:36:33,350
So now what you want
to do is that you

829
00:36:33,350 --> 00:36:38,840
can rewrite this just this
way as maximizing the w--

830
00:36:38,840 --> 00:36:42,320
sorry, finding the
unit norm vector w that

831
00:36:42,320 --> 00:36:44,475
maximizes this quadratic form.

832
00:36:44,475 --> 00:36:46,100
And at this point,
you can still ask me

833
00:36:46,100 --> 00:36:47,740
who cares, because
it's just keeping

834
00:36:47,740 --> 00:36:50,530
on rewriting the same problem.

835
00:36:50,530 --> 00:36:54,230
But it turns out that
essentially using

836
00:36:54,230 --> 00:36:57,390
Lagrange theorem, it is
relatively simple to do,

837
00:36:57,390 --> 00:36:59,660
you can check that--

838
00:36:59,660 --> 00:37:03,740
oh, so boring-- that the
solution of this problem

839
00:37:03,740 --> 00:37:08,060
is the maximum eigenvector
of this matrix, OK?

840
00:37:08,060 --> 00:37:10,490
So this you can
leave as an exercise.

841
00:37:10,490 --> 00:37:13,340
Essentially, you take the
Lagrangian of this and use

842
00:37:13,340 --> 00:37:16,610
a little bit of duality, and
you show that the [INAUDIBLE]

843
00:37:16,610 --> 00:37:19,370
of this problem is just
the maximum eigen--

844
00:37:19,370 --> 00:37:23,930
so the eigenvalues-- ugh,
the eigenvector corresponding

845
00:37:23,930 --> 00:37:28,610
to the maximum eigenvalue
of the matrix C.

846
00:37:28,610 --> 00:37:30,800
So finding this
direction is just

847
00:37:30,800 --> 00:37:32,810
solving an eigenvalue problem.

848
00:37:32,810 --> 00:37:33,310
OK.

849
00:37:36,274 --> 00:37:39,850
I think just do last few of
those slide, kind of cute.

850
00:37:39,850 --> 00:37:42,860
It's pretty simple, OK?

851
00:37:42,860 --> 00:37:46,280
So the one part,
this line after lunch

852
00:37:46,280 --> 00:37:48,689
is a bit there for
because I'm nice.

853
00:37:48,689 --> 00:37:51,230
But really, the only one part
which is a bit more complicated

854
00:37:51,230 --> 00:37:51,950
is this one here.

855
00:37:51,950 --> 00:37:56,630
The rest is really just
very simple algebra.

856
00:37:56,630 --> 00:37:58,500
So what about k equal 2?

857
00:37:58,500 --> 00:37:59,660
I run out of time.

858
00:37:59,660 --> 00:38:03,350
But it turns out that
what you want to do

859
00:38:03,350 --> 00:38:07,382
is basically if you say that
you want to look for a second--

860
00:38:07,382 --> 00:38:08,840
so you look at the
first direction.

861
00:38:08,840 --> 00:38:11,340
You solve it, and you know that
it's the first eigenvector--

862
00:38:11,340 --> 00:38:12,290
the first eigenvector.

863
00:38:12,290 --> 00:38:14,206
And then let's say that
you add the constraint

864
00:38:14,206 --> 00:38:15,710
that the second
direction you find

865
00:38:15,710 --> 00:38:19,124
has to be orthogonal
to the first direction.

866
00:38:19,124 --> 00:38:20,540
You might not want
to do this, OK?

867
00:38:20,540 --> 00:38:22,545
But if you do, if
you say you add

868
00:38:22,545 --> 00:38:25,670
the orthogonality constraint,
then what you can check

869
00:38:25,670 --> 00:38:28,170
is that you can repeat--

870
00:38:28,170 --> 00:38:29,534
sorry, I didn't do it.

871
00:38:29,534 --> 00:38:31,700
It's on my note, the one
that I have on the website,

872
00:38:31,700 --> 00:38:33,500
and the computation
is kind of cute.

873
00:38:33,500 --> 00:38:36,530
And what you see is that
the solution of this problem

874
00:38:36,530 --> 00:38:38,660
looks exactly like
the one before,

875
00:38:38,660 --> 00:38:40,970
only with this
additional constraint,

876
00:38:40,970 --> 00:38:44,060
is exactly the
eigenvector corresponding

877
00:38:44,060 --> 00:38:46,980
to the second largest
eigenvalue, OK?

878
00:38:46,980 --> 00:38:49,439
And so you can keep on going.

879
00:38:49,439 --> 00:38:50,980
And so now this
gives you a way to go

880
00:38:50,980 --> 00:38:54,040
from k equal to 1
to k bigger than 1,

881
00:38:54,040 --> 00:38:56,530
and you can keep on going, OK?

882
00:38:56,530 --> 00:38:58,690
So if you're looking
for the direction that

883
00:38:58,690 --> 00:39:00,565
maximizes the variance
or the reconstruction,

884
00:39:00,565 --> 00:39:04,339
they turn out to be
the biggest eigen--

885
00:39:04,339 --> 00:39:06,380
the vectors-- ugh, the
eigenvectors corresponding

886
00:39:06,380 --> 00:39:09,430
to the biggest eigenvalues of
this matrix C, which what you

887
00:39:09,430 --> 00:39:11,470
can call it as a second
moment or covariance

888
00:39:11,470 --> 00:39:14,380
matrix of the data.

889
00:39:14,380 --> 00:39:17,350
OK, so this is more
or less the end.

890
00:39:17,350 --> 00:39:20,380
This is the basic, basic,
basic version of this.

891
00:39:20,380 --> 00:39:24,970
You can mix this with pretty
much all the other stuff

892
00:39:24,970 --> 00:39:25,900
we said today.

893
00:39:25,900 --> 00:39:29,620
So one is, how about trying to
use kernels to do a nonlinear

894
00:39:29,620 --> 00:39:30,784
extension of this?

895
00:39:30,784 --> 00:39:32,950
So here we just looked at
the linear reconstruction.

896
00:39:32,950 --> 00:39:34,417
How about nonlinear
reconstruction?

897
00:39:34,417 --> 00:39:36,250
So what you would do
is that you would first

898
00:39:36,250 --> 00:39:38,140
map the data in
some way and then

899
00:39:38,140 --> 00:39:40,820
try to find some kind of
nonlinear dimensionality

900
00:39:40,820 --> 00:39:41,390
reduction.

901
00:39:41,390 --> 00:39:44,980
You see that what I'm
doing here is that I'm just

902
00:39:44,980 --> 00:39:46,840
using this linear--

903
00:39:46,840 --> 00:39:49,496
it's just a linear
dimensionality reduction, just

904
00:39:49,496 --> 00:39:51,100
a linear operator.

905
00:39:51,100 --> 00:39:52,620
But what about
something nonlinear?

906
00:39:52,620 --> 00:39:54,790
What if my data lie on
some kind of structure

907
00:39:54,790 --> 00:39:57,240
that looked like that--

908
00:39:57,240 --> 00:40:01,210
our beloved machine
learning Swiss roll.

909
00:40:01,210 --> 00:40:03,470
Well, if you do
PCA, well, you're

910
00:40:03,470 --> 00:40:07,950
just going to find a plane that
cuts that thing somewhere, OK?

911
00:40:07,950 --> 00:40:10,270
But if you try to embed the
data in some nonlinear way,

912
00:40:10,270 --> 00:40:11,920
you could try to resolve this.

913
00:40:11,920 --> 00:40:13,830
And this has been
much of the research

914
00:40:13,830 --> 00:40:17,650
done in the direction
of manifold learning.

915
00:40:17,650 --> 00:40:21,790
Here there are just
a few keywords--

916
00:40:21,790 --> 00:40:25,270
kernel PCA is the easiest
version, Laplacian map,

917
00:40:25,270 --> 00:40:28,870
eigenmaps, diffusion
maps, and so on, OK?

918
00:40:28,870 --> 00:40:32,410
I only touch quickly
upon random projection.

919
00:40:32,410 --> 00:40:34,310
There is a whole
literature about those.

920
00:40:34,310 --> 00:40:38,290
The idea is, again, that
by multiplying the data

921
00:40:38,290 --> 00:40:41,570
by random vectors, you can keep
the information in the data

922
00:40:41,570 --> 00:40:43,120
and might be able
to reconstruct them

923
00:40:43,120 --> 00:40:45,100
as well as to
preserve distances.

924
00:40:45,100 --> 00:40:52,780
Also, you can combine ideas from
sparsity with ideas from PCA.

925
00:40:52,780 --> 00:40:56,126
For example, you can say, what
if I want to know not only--

926
00:40:56,126 --> 00:40:58,000
I want to find something
like an eigenvector,

927
00:40:58,000 --> 00:41:01,210
but I would like the entries
of the eigenvector to be 0.

928
00:41:01,210 --> 00:41:03,520
So can I add here a
constraint which basically

929
00:41:03,520 --> 00:41:05,470
says, among all
the unit vectors,

930
00:41:05,470 --> 00:41:07,750
find the one whose
entries are most--

931
00:41:07,750 --> 00:41:10,810
so I want to add an
L0 norm or L1 norm.

932
00:41:10,810 --> 00:41:12,580
So how can you do that, OK?

933
00:41:12,580 --> 00:41:15,930
And this leads to sparse PCA
and other structured matrix

934
00:41:15,930 --> 00:41:17,239
estimation problems, OK?

935
00:41:17,239 --> 00:41:19,780
So this is, again, something
I'm not going to tell you about,

936
00:41:19,780 --> 00:41:21,640
but that's kind
of the beginning.

937
00:41:21,640 --> 00:41:26,650
And this, more or less, brings
us to the desert island,

938
00:41:26,650 --> 00:41:28,500
and I'm done.