1
00:00:01,640 --> 00:00:04,040
The following content is
provided under a Creative

2
00:00:04,040 --> 00:00:05,580
Commons license.

3
00:00:05,580 --> 00:00:07,880
Your support will help
MIT OpenCourseWare

4
00:00:07,880 --> 00:00:12,270
continue to offer high quality
educational resources for free.

5
00:00:12,270 --> 00:00:14,870
To make a donation or
view additional materials

6
00:00:14,870 --> 00:00:18,830
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,830 --> 00:00:20,000
at OCW.mit.edu.

8
00:00:22,329 --> 00:00:23,870
LORENZO ROSASCO:
I'm Lorenzo Rosasco.

9
00:00:23,870 --> 00:00:29,371
This is going to be a couple
of hours plus of basic machine

10
00:00:29,371 --> 00:00:29,870
learning.

11
00:00:29,870 --> 00:00:31,040
OK.

12
00:00:31,040 --> 00:00:33,950
And I want to emphasize
a bit, the word, "basic."

13
00:00:33,950 --> 00:00:38,150
Because really I
tried to just stick

14
00:00:38,150 --> 00:00:40,364
to the essentials, or
things that I would think

15
00:00:40,364 --> 00:00:41,530
of essentials to just start.

16
00:00:41,530 --> 00:00:43,550
Suppose that you have zero
knowledge of machine learning

17
00:00:43,550 --> 00:00:45,092
and you just want
to start from zero.

18
00:00:45,092 --> 00:00:45,591
OK.

19
00:00:45,591 --> 00:00:47,780
So if you already had
classes in machine learning,

20
00:00:47,780 --> 00:00:51,060
you might find this a little
bit boring or at least

21
00:00:51,060 --> 00:00:54,680
kind of rehearsing things
that you already know.

22
00:00:54,680 --> 00:00:58,460
The idea of looking at
machine learning these days

23
00:00:58,460 --> 00:01:01,830
is coming from at least
two different perspectives.

24
00:01:01,830 --> 00:01:05,830
The first one is for those
of you, probably most of that

25
00:01:05,830 --> 00:01:08,750
are interested to develop
intelligent systems in a very

26
00:01:08,750 --> 00:01:09,962
broad sense.

27
00:01:09,962 --> 00:01:11,420
What happened in
the last few years

28
00:01:11,420 --> 00:01:14,360
is that there's been a kind
of data-driven revolution

29
00:01:14,360 --> 00:01:17,210
where systems that are
trained rather than programmed

30
00:01:17,210 --> 00:01:20,270
start to be the key
engines to solve tasks.

31
00:01:20,270 --> 00:01:22,790
And here, there are just some
pictures that are probably

32
00:01:22,790 --> 00:01:24,680
outdated, like robotics.

33
00:01:24,680 --> 00:01:26,180
You know, we have
Siri on our phone.

34
00:01:26,180 --> 00:01:28,400
We hear about self-driving cars.

35
00:01:28,400 --> 00:01:31,460
In all these systems,
one key engine

36
00:01:31,460 --> 00:01:34,010
is providing data to the
system to essentially try

37
00:01:34,010 --> 00:01:35,754
to learn how to solve the task.

38
00:01:35,754 --> 00:01:37,670
And so one idea of this
class is to try to see

39
00:01:37,670 --> 00:01:39,840
what does it mean to learn?

40
00:01:39,840 --> 00:01:43,050
And the moment that
you start to use

41
00:01:43,050 --> 00:01:44,780
data to solve complex
tasks, then there

42
00:01:44,780 --> 00:01:46,880
is a natural
connection with what

43
00:01:46,880 --> 00:01:50,000
today is called data
science, which is somewhat

44
00:01:50,000 --> 00:01:56,330
a rapid [INAUDIBLE]
renovated version of what we

45
00:01:56,330 --> 00:01:58,130
used to call just statistics.

46
00:01:58,130 --> 00:02:02,129
So basically, we start to have
tons of data of all kinds.

47
00:02:02,129 --> 00:02:03,670
They are very easy
to collect, and we

48
00:02:03,670 --> 00:02:06,500
are starving for knowledge and
trying to extract information

49
00:02:06,500 --> 00:02:07,430
from these data.

50
00:02:07,430 --> 00:02:09,470
And as it turns out,
many of the techniques

51
00:02:09,470 --> 00:02:11,780
that are used to develop
intelligent systems

52
00:02:11,780 --> 00:02:13,280
are the same very
technique that you

53
00:02:13,280 --> 00:02:15,960
can use to try to extract
relevant information

54
00:02:15,960 --> 00:02:20,504
patterns, data, from your data.

55
00:02:20,504 --> 00:02:21,920
So what we want
to do today is try

56
00:02:21,920 --> 00:02:23,460
to see a bit what's
in the middle.

57
00:02:23,460 --> 00:02:25,709
What is the set of techniques
that allows you, indeed,

58
00:02:25,709 --> 00:02:31,580
to go from data to knowledge
or to acquiring ability

59
00:02:31,580 --> 00:02:32,430
to solve tasks.

60
00:02:35,030 --> 00:02:37,010
Machine learning
is huge these days,

61
00:02:37,010 --> 00:02:38,930
and there are tons of
possible applications.

62
00:02:38,930 --> 00:02:41,220
There has been theory
developed in the last 20,

63
00:02:41,220 --> 00:02:44,030
30 years that brought the field
to a certain level of maturity

64
00:02:44,030 --> 00:02:46,380
from a mathematical
point of view.

65
00:02:46,380 --> 00:02:48,020
There have been tons
and tons and tons

66
00:02:48,020 --> 00:02:49,051
of algorithms developed.

67
00:02:49,051 --> 00:02:49,550
OK.

68
00:02:49,550 --> 00:02:51,800
So in three hours,
there is no way

69
00:02:51,800 --> 00:02:55,309
I could give you even
just a little view of what

70
00:02:55,309 --> 00:02:56,600
machine learning is these days.

71
00:02:56,600 --> 00:02:59,577
So what I did is
pretty much this.

72
00:02:59,577 --> 00:03:01,160
I don't know if
you've ever done this,

73
00:03:01,160 --> 00:03:03,620
but you used to do
the mixtape, and you

74
00:03:03,620 --> 00:03:06,500
try to pick the songs that
you would bring with yourself

75
00:03:06,500 --> 00:03:08,219
on a desert island.

76
00:03:08,219 --> 00:03:10,010
That's kind of the way
I thought about what

77
00:03:10,010 --> 00:03:13,840
to put in this one
[INAUDIBLE] lights that we're

78
00:03:13,840 --> 00:03:15,140
going to show in a minute.

79
00:03:15,140 --> 00:03:17,540
So basically, I
thought, what are

80
00:03:17,540 --> 00:03:19,910
those three, four, five
learning algorithms

81
00:03:19,910 --> 00:03:21,800
that you should
know, OK, if you know

82
00:03:21,800 --> 00:03:23,660
nothing about machine learning.

83
00:03:23,660 --> 00:03:26,151
And this is more or
less at least one part.

84
00:03:26,151 --> 00:03:27,650
Of course there are
a few songs that

85
00:03:27,650 --> 00:03:31,200
stayed out of the compilation,
but this is like one selection.

86
00:03:31,200 --> 00:03:31,700
OK.

87
00:03:34,490 --> 00:03:38,193
So as such, we're going
to start, as I said--

88
00:03:38,193 --> 00:03:40,270
whoop-- simple.

89
00:03:40,270 --> 00:03:42,290
And the idea is that
this morning you're

90
00:03:42,290 --> 00:03:44,940
going to see a few algorithms.

91
00:03:44,940 --> 00:03:47,360
And I picked algorithms
that are relatively

92
00:03:47,360 --> 00:03:49,550
simple from a computational
point of view.

93
00:03:49,550 --> 00:03:52,561
So the math level is
going to be pretty basic.

94
00:03:52,561 --> 00:03:53,060
OK.

95
00:03:53,060 --> 00:03:54,893
I think I'm going to
use some linear algebra

96
00:03:54,893 --> 00:03:58,620
at some point and maybe some
calculus, but that's about it.

97
00:03:58,620 --> 00:04:00,500
So most of the idea
here is to emphasize

98
00:04:00,500 --> 00:04:03,770
conceptual ideas, the concepts.

99
00:04:03,770 --> 00:04:06,150
And then today, afternoon,
there's going to be,

100
00:04:06,150 --> 00:04:08,240
basically labs where
you sit and you just

101
00:04:08,240 --> 00:04:10,130
pick these kind of
algorithms and use them,

102
00:04:10,130 --> 00:04:12,361
so you immediately
see, what does it mean?

103
00:04:12,361 --> 00:04:12,860
OK.

104
00:04:12,860 --> 00:04:14,810
So at the end of
the day, you should

105
00:04:14,810 --> 00:04:17,450
have reasonable
knowledge about whatever

106
00:04:17,450 --> 00:04:20,160
you're seeing this morning.

107
00:04:20,160 --> 00:04:24,180
So this is how the
class is structured.

108
00:04:24,180 --> 00:04:27,000
It's divided in
parts plus the lab.

109
00:04:27,000 --> 00:04:29,060
So the first part,
what we want to do

110
00:04:29,060 --> 00:04:32,540
is start from probably the
simplest learning algorithm

111
00:04:32,540 --> 00:04:35,690
you can think of to try
to emphasize, and use

112
00:04:35,690 --> 00:04:41,180
that as an excuse to introduce
the idea of bias-variance,

113
00:04:41,180 --> 00:04:45,080
trade-off, which, to me,
is probably either the,

114
00:04:45,080 --> 00:04:48,470
or one of the most fundamental
concepts in statistics

115
00:04:48,470 --> 00:04:52,190
and machine learning, which
is this idea that you're

116
00:04:52,190 --> 00:04:54,380
going to see in a few
minutes in more detail.

117
00:04:54,380 --> 00:04:56,690
But it's essentially
the idea that you never

118
00:04:56,690 --> 00:04:57,420
have enough data.

119
00:04:57,420 --> 00:04:57,920
OK.

120
00:04:57,920 --> 00:05:00,230
And the game here is not
about describing the data

121
00:05:00,230 --> 00:05:02,480
that you have today,
as much as using

122
00:05:02,480 --> 00:05:05,030
the data you have today
as a basis of knowledge

123
00:05:05,030 --> 00:05:06,980
to describe data you're
going to get tomorrow.

124
00:05:06,980 --> 00:05:09,230
So there is this
inherent trade-off

125
00:05:09,230 --> 00:05:11,720
between what you
have at disposal

126
00:05:11,720 --> 00:05:13,550
and what would you
like to predict.

127
00:05:13,550 --> 00:05:15,680
And then, essentially
it turns out

128
00:05:15,680 --> 00:05:18,170
that you have to somewhat
decide how much you want

129
00:05:18,170 --> 00:05:21,120
to trust the data, and how
much you want to somewhat throw

130
00:05:21,120 --> 00:05:24,030
away, or regularize,
as they say,

131
00:05:24,030 --> 00:05:26,159
smooth out the
information in your data,

132
00:05:26,159 --> 00:05:28,200
because you think that
it's actually an accident.

133
00:05:28,200 --> 00:05:31,410
It's just because you saw
data with aspects today

134
00:05:31,410 --> 00:05:34,170
that are not really reflective
of the phenomenon that

135
00:05:34,170 --> 00:05:34,820
produced them.

136
00:05:34,820 --> 00:05:37,980
But it's just because I saw
10 points rather than 100.

137
00:05:37,980 --> 00:05:39,579
The basic idea
here is essentially

138
00:05:39,579 --> 00:05:40,620
the law of large numbers.

139
00:05:40,620 --> 00:05:42,660
When you toss a coin,
you might find out

140
00:05:42,660 --> 00:05:44,850
that if you toss
it just 10 times,

141
00:05:44,850 --> 00:05:46,350
it looks like it's
not a fair coin,

142
00:05:46,350 --> 00:05:47,850
but if you go for
100, or 1,000, you

143
00:05:47,850 --> 00:05:50,920
start to see that it
converts to 50-50.

144
00:05:50,920 --> 00:05:51,420
OK.

145
00:05:51,420 --> 00:05:53,230
So that's kind of
what's going on here.

146
00:05:53,230 --> 00:05:56,880
So the idea is that you want
to use some kind of induction

147
00:05:56,880 --> 00:06:02,290
principle that tells you how
much you can trust the data.

148
00:06:02,290 --> 00:06:04,930
Moving on from this basic
class of algorithms,

149
00:06:04,930 --> 00:06:06,930
we're going to consider
so-called regularization

150
00:06:06,930 --> 00:06:08,700
techniques.

151
00:06:08,700 --> 00:06:10,800
I use regularization in
a very broad sentence.

152
00:06:10,800 --> 00:06:14,850
And here we're going to
concentrate on least squares

153
00:06:14,850 --> 00:06:16,980
essentially because
A, it's simple,

154
00:06:16,980 --> 00:06:18,790
and it just reduces
to linear algebra.

155
00:06:18,790 --> 00:06:22,320
And so you don't have to
know anything about convex

156
00:06:22,320 --> 00:06:25,920
optimization or any other
kind of fancy optimization

157
00:06:25,920 --> 00:06:27,739
techniques.

158
00:06:27,739 --> 00:06:29,280
And B, because it's
relatively simple

159
00:06:29,280 --> 00:06:33,240
to move from linear models
to non-parametric non-linear

160
00:06:33,240 --> 00:06:34,670
models using kernels.

161
00:06:34,670 --> 00:06:35,220
OK.

162
00:06:35,220 --> 00:06:38,070
And kernels are a big
field with a lot of math,

163
00:06:38,070 --> 00:06:39,510
but you're just
going to look more

164
00:06:39,510 --> 00:06:41,580
at the recipe to move
from simple models

165
00:06:41,580 --> 00:06:42,690
to complicated models.

166
00:06:45,240 --> 00:06:48,690
So finally, the last part,
we're going to move a bit away

167
00:06:48,690 --> 00:06:49,950
from pure prediction.

168
00:06:49,950 --> 00:06:52,110
So basically these
first two parts

169
00:06:52,110 --> 00:06:55,590
are about prediction, or what
is called supervised learning.

170
00:06:55,590 --> 00:06:58,080
And here we're going to move
a bit away from prediction

171
00:06:58,080 --> 00:07:02,010
and we're going to ask questions
more related to, you have data,

172
00:07:02,010 --> 00:07:04,770
and you want to know, what
are the important sectors

173
00:07:04,770 --> 00:07:06,390
in your data?

174
00:07:06,390 --> 00:07:10,010
So the one key word here
is interoperability.

175
00:07:10,010 --> 00:07:12,820
You want to have some form of
interoperability of the data

176
00:07:12,820 --> 00:07:13,320
at hand.

177
00:07:13,320 --> 00:07:14,820
You would like to
know, not only how

178
00:07:14,820 --> 00:07:16,890
you can make good
predictions, but what

179
00:07:16,890 --> 00:07:18,450
are the important sectors.

180
00:07:18,450 --> 00:07:21,000
So you not only want
to do good prediction,

181
00:07:21,000 --> 00:07:23,610
but you want to know how
you make good prediction.

182
00:07:23,610 --> 00:07:25,860
What is the important
information to actually get

183
00:07:25,860 --> 00:07:27,230
good prediction.

184
00:07:27,230 --> 00:07:31,172
And so, in this last part we're
going to take a peek into this.

185
00:07:31,172 --> 00:07:32,880
And as I said, the
afternoon is basically

186
00:07:32,880 --> 00:07:34,360
going to be a practical session.

187
00:07:34,360 --> 00:07:36,589
If it's all MATLAB I think
there is some quick--

188
00:07:36,589 --> 00:07:38,130
if you have never
seen MATLAB before,

189
00:07:38,130 --> 00:07:40,804
you can play around
with just a little bit.

190
00:07:40,804 --> 00:07:42,220
But it's very easy
and then you've

191
00:07:42,220 --> 00:07:45,969
got a few different proposals
I think, of things you can do.

192
00:07:45,969 --> 00:07:48,510
And you can pick, depending on
what you already know and what

193
00:07:48,510 --> 00:07:52,580
you can try, you can start from
that and be more or less fancy.

194
00:07:52,580 --> 00:07:53,310
OK.

195
00:07:53,310 --> 00:07:58,830
So it goes without
saying, stop me.

196
00:07:58,830 --> 00:08:02,170
I mean, the more we
interact, the better it is.

197
00:08:02,170 --> 00:08:03,840
So the first part,
as I said, the idea

198
00:08:03,840 --> 00:08:06,570
is to use so-called local
methods as an excuse

199
00:08:06,570 --> 00:08:08,920
to understand it by experience.

200
00:08:08,920 --> 00:08:09,420
OK.

201
00:08:09,420 --> 00:08:11,503
So we're going to introduce
the simplest algorithm

202
00:08:11,503 --> 00:08:14,070
you can think of, and we're
going to use it to understand

203
00:08:14,070 --> 00:08:16,920
a much deeper concept.

204
00:08:16,920 --> 00:08:21,030
So first of all, let's
just put down our setup.

205
00:08:21,030 --> 00:08:23,010
The idea is that we are--

206
00:08:23,010 --> 00:08:27,310
so how many of you had a
machine learning class before?

207
00:08:27,310 --> 00:08:27,810
All right.

208
00:08:27,810 --> 00:08:30,570
So, you won't be too bored.

209
00:08:30,570 --> 00:08:32,919
The idea is we want to
do supervised learning.

210
00:08:32,919 --> 00:08:36,280
So in supervised learning there
is an input and an output.

211
00:08:36,280 --> 00:08:38,599
And these inputs and outputs
are somewhat related.

212
00:08:38,599 --> 00:08:40,140
And I'll be more
precise in a minute.

213
00:08:40,140 --> 00:08:41,723
But the idea is that
you want to learn

214
00:08:41,723 --> 00:08:44,970
this input-output relationship.

215
00:08:44,970 --> 00:08:49,700
And all you have at disposal
are sets of inputs and outputs.

216
00:08:49,700 --> 00:08:50,300
OK.

217
00:08:50,300 --> 00:08:53,850
So x here is an input,
and y is the output.

218
00:08:53,850 --> 00:08:56,350
f is a functional
relation between the input

219
00:08:56,350 --> 00:08:58,470
and the output.

220
00:08:58,470 --> 00:09:00,780
All you have in this puzzle
are these couples, OK.

221
00:09:00,780 --> 00:09:02,239
So I give an input,
and then what's

222
00:09:02,239 --> 00:09:03,279
the corresponding output?

223
00:09:03,279 --> 00:09:04,920
I give another input
and I know what's

224
00:09:04,920 --> 00:09:05,961
the corresponding output.

225
00:09:05,961 --> 00:09:07,920
But I don't give
you all of them.

226
00:09:07,920 --> 00:09:11,230
You just have n, OK. n
is the number of points,

227
00:09:11,230 --> 00:09:13,356
and you call this
a training set,

228
00:09:13,356 --> 00:09:15,605
because it will be the basis
of knowledge in which you

229
00:09:15,605 --> 00:09:20,580
can try to train a machine
to estimate this functional

230
00:09:20,580 --> 00:09:21,890
relationship.

231
00:09:21,890 --> 00:09:22,860
OK.

232
00:09:22,860 --> 00:09:26,680
And the key point here
is that, on the one hand,

233
00:09:26,680 --> 00:09:28,210
you want to describe these data.

234
00:09:28,210 --> 00:09:30,600
So you want to get a
functional relationship that

235
00:09:30,600 --> 00:09:32,910
works well that, if
you get the next one

236
00:09:32,910 --> 00:09:37,140
to give you an f(x1), which
is close to y1 and so on.

237
00:09:37,140 --> 00:09:39,630
And f(x2), which is close to y2.

238
00:09:39,630 --> 00:09:42,420
But more importantly,
you want an f, that

239
00:09:42,420 --> 00:09:46,360
given a new point
that was not here,

240
00:09:46,360 --> 00:09:48,420
will give you an
output, which is

241
00:09:48,420 --> 00:09:49,930
a good estimate
of the true output

242
00:09:49,930 --> 00:09:51,780
to correspond to that input.

243
00:09:51,780 --> 00:09:53,280
OK.

244
00:09:53,280 --> 00:09:57,040
This is the most important
thing of the setup.

245
00:09:57,040 --> 00:09:57,740
OK.

246
00:09:57,740 --> 00:09:59,432
The ideal, so-called
generalization,

247
00:09:59,432 --> 00:10:00,390
if you want prediction.

248
00:10:00,390 --> 00:10:02,010
If you want to
really do inference.

249
00:10:02,010 --> 00:10:03,843
You don't want to do
descriptive statistics.

250
00:10:03,843 --> 00:10:08,230
You really want to do
inferential statistics.

251
00:10:08,230 --> 00:10:11,390
So this is just very,
very simple example,

252
00:10:11,390 --> 00:10:13,980
but just to start to
have something in mind.

253
00:10:13,980 --> 00:10:15,230
Suppose that you have--

254
00:10:15,230 --> 00:10:20,970
well, it's just like a toy
version of the face recognition

255
00:10:20,970 --> 00:10:23,490
system we have on our phones.

256
00:10:23,490 --> 00:10:26,244
You know that when you
take a picture, you start--

257
00:10:26,244 --> 00:10:26,910
AUDIENCE: Sorry.

258
00:10:26,910 --> 00:10:28,920
LORENZO ROSASCO: They
really weren't talking.

259
00:10:28,920 --> 00:10:30,128
You have something like this.

260
00:10:30,128 --> 00:10:32,610
You have a little square
appearing around a face

261
00:10:32,610 --> 00:10:33,300
sometimes.

262
00:10:33,300 --> 00:10:35,216
It means that basically
the system is actually

263
00:10:35,216 --> 00:10:38,130
going inside the image
and recognizing faces.

264
00:10:38,130 --> 00:10:39,330
OK.

265
00:10:39,330 --> 00:10:42,970
So the idea is a bit more
complicated than this.

266
00:10:42,970 --> 00:10:45,180
But a toy version
of this algorithm

267
00:10:45,180 --> 00:10:47,250
is, you have an image like this.

268
00:10:47,250 --> 00:10:47,910
OK.

269
00:10:47,910 --> 00:10:52,170
The image you think of
as a matrix of numbers.

270
00:10:52,170 --> 00:10:55,801
Now this is color, but imagine
it's black and white, OK.

271
00:10:55,801 --> 00:10:57,300
Then it would just
contain a number,

272
00:10:57,300 --> 00:11:01,380
which is the pixel value
with the light intensity

273
00:11:01,380 --> 00:11:02,520
of that pixel.

274
00:11:02,520 --> 00:11:04,350
And you just have this array.

275
00:11:04,350 --> 00:11:08,700
And then if you want you can
brutalize it with and just

276
00:11:08,700 --> 00:11:12,370
unroll the matrix
into a long vector.

277
00:11:12,370 --> 00:11:13,286
OK.

278
00:11:13,286 --> 00:11:14,920
That gives one vector.

279
00:11:14,920 --> 00:11:17,280
So p here would be what?

280
00:11:17,280 --> 00:11:19,260
The number of?

281
00:11:19,260 --> 00:11:20,660
Just the number of pixels.

282
00:11:20,660 --> 00:11:22,020
OK.

283
00:11:22,020 --> 00:11:24,379
So I take this image
and I unroll it.

284
00:11:24,379 --> 00:11:25,920
I take another image
and I unroll it.

285
00:11:25,920 --> 00:11:26,730
And I take images.

286
00:11:26,730 --> 00:11:29,160
And you see, some images
here do contain faces.

287
00:11:29,160 --> 00:11:31,691
Some of the images
do not contain faces.

288
00:11:31,691 --> 00:11:32,190
OK.

289
00:11:32,190 --> 00:11:34,800
And I here use
color to code them.

290
00:11:34,800 --> 00:11:38,430
And now what I have is that
images are my inputs, OK,

291
00:11:38,430 --> 00:11:39,870
are the x's.

292
00:11:39,870 --> 00:11:43,830
So here-- full
disclosure, I never use

293
00:11:43,830 --> 00:11:46,942
the little arrow above
letters to denote vectors.

294
00:11:46,942 --> 00:11:48,900
So hopefully it will be
clear from the context.

295
00:11:48,900 --> 00:11:55,000
When it's really useful I
use upper or lower indices.

296
00:11:55,000 --> 00:11:55,590
Anyway.

297
00:11:55,590 --> 00:11:58,050
So this is the data matrix.

298
00:11:58,050 --> 00:12:01,800
Rows are inputs and columns
are so-called features

299
00:12:01,800 --> 00:12:04,440
or variables, are the
entries of each vector.

300
00:12:04,440 --> 00:12:05,280
OK.

301
00:12:05,280 --> 00:12:09,840
And I have n rows and p columns.

302
00:12:09,840 --> 00:12:12,581
Associated to this, I
have my output vector.

303
00:12:12,581 --> 00:12:13,830
And what is the output vector?

304
00:12:13,830 --> 00:12:17,130
Well in this case, it's
just a simple binary vector.

305
00:12:17,130 --> 00:12:21,360
And the idea here is, if
there is a face, I put 1.

306
00:12:21,360 --> 00:12:24,570
If there is not a
face, I put minus 1.

307
00:12:24,570 --> 00:12:25,560
OK.

308
00:12:25,560 --> 00:12:28,800
So this is the way I turn,
like an abstract question,

309
00:12:28,800 --> 00:12:33,485
recognize faces in
images, into some data

310
00:12:33,485 --> 00:12:35,610
structure that in a minute
we're going to elaborate

311
00:12:35,610 --> 00:12:37,950
to try to actually
answer the question,

312
00:12:37,950 --> 00:12:40,340
whether there is a face
in an image or not.

313
00:12:40,340 --> 00:12:42,200
OK.

314
00:12:42,200 --> 00:12:45,270
So this first step, it's
kind of obvious in this case,

315
00:12:45,270 --> 00:12:46,751
but it's actually a tricky step.

316
00:12:46,751 --> 00:12:47,250
OK.

317
00:12:47,250 --> 00:12:51,640
It's the part that I'm not going
to give you any hint about.

318
00:12:51,640 --> 00:12:53,010
It's kind of an art.

319
00:12:53,010 --> 00:12:54,249
You have data and you have--

320
00:12:54,249 --> 00:12:56,040
at the very beginning
you have to turn them

321
00:12:56,040 --> 00:12:59,271
into some kind of
manageable data structure.

322
00:12:59,271 --> 00:12:59,770
OK.

323
00:12:59,770 --> 00:13:01,501
Then you can elaborate
in multiple ways.

324
00:13:01,501 --> 00:13:03,750
But the very first step is
you deciding-- for example,

325
00:13:03,750 --> 00:13:07,600
here we decided to unroll all
these numbers into vectors.

326
00:13:07,600 --> 00:13:09,976
This sounds like a good
idea or a bad idea?

327
00:13:09,976 --> 00:13:11,850
One thing that you're
doing is that the pixel

328
00:13:11,850 --> 00:13:14,910
here and the pixel here
are probably related.

329
00:13:14,910 --> 00:13:17,680
And in this case there is
some structure in the image.

330
00:13:17,680 --> 00:13:22,540
And so when you take this
pixel 136, and you unroll it,

331
00:13:22,540 --> 00:13:24,000
it comes here.

332
00:13:24,000 --> 00:13:25,640
So they're not close.

333
00:13:25,640 --> 00:13:26,220
OK.

334
00:13:26,220 --> 00:13:28,020
Now here it turns out that
if you think about it--

335
00:13:28,020 --> 00:13:28,730
you'll see a minute.

336
00:13:28,730 --> 00:13:30,080
For those of you who
remember, if you just

337
00:13:30,080 --> 00:13:32,600
took Euclidean distance,
you take product of numbers

338
00:13:32,600 --> 00:13:33,660
and you sum them up.

339
00:13:33,660 --> 00:13:36,090
That's invariant to the position
of the individual pixels.

340
00:13:36,090 --> 00:13:37,180
So that's OK.

341
00:13:37,180 --> 00:13:37,730
OK.

342
00:13:37,730 --> 00:13:39,813
But yet again, there is
this intuition that, well,

343
00:13:39,813 --> 00:13:42,200
maybe here I'm losing too
much geometric information

344
00:13:42,200 --> 00:13:43,910
about the context of the image.

345
00:13:43,910 --> 00:13:46,869
And indeed, while this
kind of works in practice,

346
00:13:46,869 --> 00:13:48,410
but if you want to
get better results

347
00:13:48,410 --> 00:13:50,659
you have to do the fancy
stuff that Andrei was talking

348
00:13:50,659 --> 00:13:53,420
about today, looking locally
and try to look at collection,

349
00:13:53,420 --> 00:13:55,680
try to keep more
geometric information.

350
00:13:55,680 --> 00:13:56,180
OK.

351
00:13:56,180 --> 00:13:58,480
So I'm not going to talk
about that kind of stuff.

352
00:13:58,480 --> 00:14:02,660
This up to date, a lot of
engineering, and some good way

353
00:14:02,660 --> 00:14:03,770
to learn it.

354
00:14:03,770 --> 00:14:06,170
But we're going to
try to just stick

355
00:14:06,170 --> 00:14:07,680
to simple representations.

356
00:14:07,680 --> 00:14:08,180
OK.

357
00:14:08,180 --> 00:14:10,080
So how do you build
representation is now

358
00:14:10,080 --> 00:14:13,850
going to be part of what
I'm going to talk about.

359
00:14:13,850 --> 00:14:16,040
So imagine that
either and you stick

360
00:14:16,040 --> 00:14:18,350
to this super-simple
representation or some friends

361
00:14:18,350 --> 00:14:20,780
of yours come in
and put the box here

362
00:14:20,780 --> 00:14:24,320
in the middle, where you
put this array of numbers

363
00:14:24,320 --> 00:14:27,440
and you extract another
vector much fancier than this

364
00:14:27,440 --> 00:14:31,320
that contains some better
representation of an image.

365
00:14:31,320 --> 00:14:31,887
OK.

366
00:14:31,887 --> 00:14:33,470
But then at the end
of the day, my job

367
00:14:33,470 --> 00:14:36,530
starts when you give me
a vector representation

368
00:14:36,530 --> 00:14:37,430
that I can trust.

369
00:14:37,430 --> 00:14:41,450
And I can basically say that
if two vectors seem similar,

370
00:14:41,450 --> 00:14:43,480
they should have the same label.

371
00:14:43,480 --> 00:14:44,930
And that's the basic idea.

372
00:14:44,930 --> 00:14:45,430
OK.

373
00:14:48,060 --> 00:14:48,560
All right.

374
00:14:48,560 --> 00:14:51,969
So a little game
here is, OK, imagine

375
00:14:51,969 --> 00:14:54,260
that these are just the
two-pixel version of the images

376
00:14:54,260 --> 00:14:55,093
I showed you before.

377
00:14:55,093 --> 00:14:59,120
You have some
boxes, some circles.

378
00:14:59,120 --> 00:15:00,905
And then I give you
this one triangle.

379
00:15:00,905 --> 00:15:01,840
It's very original.

380
00:15:01,840 --> 00:15:03,500
Andrei showed you
this yesterday.

381
00:15:03,500 --> 00:15:06,410
And the question is,
what's the color of that?

382
00:15:06,410 --> 00:15:07,525
OK.

383
00:15:07,525 --> 00:15:10,770
Unless you haven't
slept a minute,

384
00:15:10,770 --> 00:15:12,260
you're going to say it's orange.

385
00:15:12,260 --> 00:15:15,923
But the question is, why
do you think it's orange?

386
00:15:15,923 --> 00:15:16,880
AUDIENCE: [INAUDIBLE]

387
00:15:16,880 --> 00:15:18,130
LORENZO ROSASCO: Say it again?

388
00:15:18,130 --> 00:15:18,986
AUDIENCE: It's
surrounded by oranges.

389
00:15:18,986 --> 00:15:20,819
LORENZO ROSASCO: It's
surrounded by oranges.

390
00:15:20,819 --> 00:15:21,330
OK.

391
00:15:21,330 --> 00:15:24,230
And she said, it's
close to oranges.

392
00:15:26,577 --> 00:15:28,660
So it turns out that this
is actually the simplest

393
00:15:28,660 --> 00:15:29,871
algorithm you can think of.

394
00:15:29,871 --> 00:15:30,370
OK.

395
00:15:30,370 --> 00:15:33,850
You check who you have close
to you, and if it's orange,

396
00:15:33,850 --> 00:15:34,730
you say orange.

397
00:15:34,730 --> 00:15:37,210
And if it's blue, you say blue.

398
00:15:37,210 --> 00:15:39,640
OK.

399
00:15:39,640 --> 00:15:41,695
But we already made
an assumption here,

400
00:15:41,695 --> 00:15:44,070
which we ask in the question,
which is the nearby things.

401
00:15:44,070 --> 00:15:46,653
So we are basically saying that
our of vectoral representation

402
00:15:46,653 --> 00:15:48,910
is such that, if two
things are close--

403
00:15:48,910 --> 00:15:51,700
so I do have a distance,
and if two things are close,

404
00:15:51,700 --> 00:15:54,330
then they might have the
same semantic content.

405
00:15:54,330 --> 00:15:55,506
OK.

406
00:15:55,506 --> 00:15:56,630
Which might be true or not.

407
00:15:56,630 --> 00:16:01,120
For example, if you take
this thing I showed you here,

408
00:16:01,120 --> 00:16:02,380
we cannot just draw it, right?

409
00:16:02,380 --> 00:16:06,280
We cannot just take 200 times
200 vectors and just look

410
00:16:06,280 --> 00:16:08,489
at them and say, yeah, you
know, a visual inspection.

411
00:16:08,489 --> 00:16:10,655
You have to believe that
this distance will be fine.

412
00:16:10,655 --> 00:16:12,760
And so the discussion that
we just had about what

413
00:16:12,760 --> 00:16:15,040
is a good representation
is going to kick in.

414
00:16:15,040 --> 00:16:15,970
OK.

415
00:16:15,970 --> 00:16:17,260
But the assumption you make--

416
00:16:17,260 --> 00:16:21,570
in this case visually it's
very easy, it's low dimension--

417
00:16:21,570 --> 00:16:23,862
is that nearby things
have similar labels.

418
00:16:23,862 --> 00:16:26,320
One thing that I forgot to tell
you in the previous slides,

419
00:16:26,320 --> 00:16:28,570
but it's key, is
exactly this observation

420
00:16:28,570 --> 00:16:33,220
that in machine learning
we typically move away

421
00:16:33,220 --> 00:16:34,720
from situations
like this one, where

422
00:16:34,720 --> 00:16:38,032
you can do visual inspection
and you have low dimensionality,

423
00:16:38,032 --> 00:16:40,240
to kind of a situation like
the one I just showed you

424
00:16:40,240 --> 00:16:42,360
a minute before,
where you have images.

425
00:16:42,360 --> 00:16:45,667
And if you have to think of each
of these circles as an image,

426
00:16:45,667 --> 00:16:47,500
you want to be able to
draw it, because it's

427
00:16:47,500 --> 00:16:49,660
going to be several
hundred typically,

428
00:16:49,660 --> 00:16:53,100
or tens dimensional vector.

429
00:16:53,100 --> 00:16:53,830
OK.

430
00:16:53,830 --> 00:16:56,260
So the game is
kind of different.

431
00:16:56,260 --> 00:16:58,390
Can we still do
this kind of stuff?

432
00:16:58,390 --> 00:17:02,289
Can we just say that
closed things should

433
00:17:02,289 --> 00:17:03,580
have the same semantic content?

434
00:17:03,580 --> 00:17:05,680
That's another question
we're going to try to answer.

435
00:17:05,680 --> 00:17:06,179
OK.

436
00:17:06,179 --> 00:17:08,079
But I just want to do
a bit of inception.

437
00:17:08,079 --> 00:17:10,630
This is a big deal, OK,
going from low dimension

438
00:17:10,630 --> 00:17:12,891
to very high dimensions.

439
00:17:12,891 --> 00:17:13,390
All right.

440
00:17:13,390 --> 00:17:15,430
But let's stick for
a minute to the idea

441
00:17:15,430 --> 00:17:18,640
that nearby things should
have the same label,

442
00:17:18,640 --> 00:17:22,089
and just write the one line,
write down the algorithm.

443
00:17:22,089 --> 00:17:24,609
It's the kind of case where
it's harder to write it down

444
00:17:24,609 --> 00:17:27,550
than to code it up or
just explain what it is.

445
00:17:27,550 --> 00:17:28,820
It's super simple.

446
00:17:28,820 --> 00:17:32,300
What you do is, you
have data points, Xi.

447
00:17:32,300 --> 00:17:37,140
So Xi is the training set, the
input data in the training set.

448
00:17:37,140 --> 00:17:39,530
X-bar is what I
call X-new before.

449
00:17:39,530 --> 00:17:40,960
It's a new point.

450
00:17:40,960 --> 00:17:43,330
What you do is that you search.

451
00:17:43,330 --> 00:17:49,060
This just says, look for the
index of the closest point.

452
00:17:49,060 --> 00:17:50,400
That's what you did before.

453
00:17:50,400 --> 00:17:51,040
OK.

454
00:17:51,040 --> 00:17:57,850
So here, I-prime is the index of
the point Xi closest to X-bar.

455
00:17:57,850 --> 00:18:01,330
Once you find it,
go in your dataset

456
00:18:01,330 --> 00:18:04,570
and find the label
of that point.

457
00:18:04,570 --> 00:18:07,966
And then assign that
label to the new point.

458
00:18:07,966 --> 00:18:10,090
Does that makes sense?

459
00:18:10,090 --> 00:18:12,780
Everybody's happy?

460
00:18:12,780 --> 00:18:15,810
Not super-complicated.

461
00:18:15,810 --> 00:18:18,180
Fair enough.

462
00:18:18,180 --> 00:18:20,960
How does it work?

463
00:18:20,960 --> 00:18:24,450
So let me see if I can do this.

464
00:18:27,570 --> 00:18:30,810
This is extremely fancy code.

465
00:18:33,702 --> 00:18:36,191
Let's see.

466
00:18:36,191 --> 00:18:36,690
All right.

467
00:18:36,690 --> 00:18:38,710
So what did I do?

468
00:18:38,710 --> 00:18:40,420
Let me do it a bit smaller.

469
00:18:40,420 --> 00:18:42,650
So this is just simple
two-dimensional datasets.

470
00:18:42,650 --> 00:18:44,550
I take 40 points.

471
00:18:47,080 --> 00:18:51,035
The dataset looks like this.

472
00:18:51,035 --> 00:18:53,470
The dataset is the
one on the left.

473
00:18:53,470 --> 00:18:54,080
OK.

474
00:18:54,080 --> 00:18:56,700
And what I do, I take 40 points.

475
00:18:56,700 --> 00:18:58,820
And to make it a
bit more complex,

476
00:18:58,820 --> 00:19:01,040
I flip some of the labels.

477
00:19:01,040 --> 00:19:01,550
OK.

478
00:19:01,550 --> 00:19:02,925
So you basically
say that the two

479
00:19:02,925 --> 00:19:04,972
datasets-- this is called
the two moons dataset,

480
00:19:04,972 --> 00:19:05,930
or something like this.

481
00:19:05,930 --> 00:19:09,525
And what I did is that some
of the labels in this sea,

482
00:19:09,525 --> 00:19:10,497
I changed color.

483
00:19:10,497 --> 00:19:11,330
I changed the label.

484
00:19:11,330 --> 00:19:11,830
OK.

485
00:19:11,830 --> 00:19:14,300
So I made the
problem a bit harder.

486
00:19:14,300 --> 00:19:18,330
And here is what fortunately
you don't have in practice.

487
00:19:18,330 --> 00:19:18,830
OK.

488
00:19:18,830 --> 00:19:19,650
Here we're cheating.

489
00:19:19,650 --> 00:19:20,800
We're doing just
the simulations.

490
00:19:20,800 --> 00:19:22,140
We're looking at the future.

491
00:19:22,140 --> 00:19:25,276
We assume that because we
can generate this data,

492
00:19:25,276 --> 00:19:27,150
we can look at the future
and check how we're

493
00:19:27,150 --> 00:19:28,274
going to do in future data.

494
00:19:28,274 --> 00:19:29,990
So you can think of
this as a future data

495
00:19:29,990 --> 00:19:31,240
that typically you don't have.

496
00:19:31,240 --> 00:19:33,420
So here you're a
normal human being.

497
00:19:33,420 --> 00:19:35,550
Here you're playing god
and looking at the future.

498
00:19:35,550 --> 00:19:36,050
OK.

499
00:19:36,050 --> 00:19:38,224
Because we just want to
do a little simulation.

500
00:19:41,190 --> 00:19:51,780
So based on that, we can just
go here and put 1, train,

501
00:19:51,780 --> 00:19:53,110
and then test and plot.

502
00:19:56,200 --> 00:20:00,830
So what you see here is the
so-called decision boundary.

503
00:20:00,830 --> 00:20:01,530
OK.

504
00:20:01,530 --> 00:20:05,730
What I did is exactly that one
line of code you saw before.

505
00:20:05,730 --> 00:20:06,360
OK.

506
00:20:06,360 --> 00:20:07,980
And what I did is, in
this case I can draw it,

507
00:20:07,980 --> 00:20:09,188
because it's low dimensional.

508
00:20:09,188 --> 00:20:12,000
And basically what I do is
that I just put in the regions

509
00:20:12,000 --> 00:20:14,730
where I think I
should put orange,

510
00:20:14,730 --> 00:20:18,281
and the region where it
think I should put blue.

511
00:20:18,281 --> 00:20:18,780
OK.

512
00:20:18,780 --> 00:20:21,840
And here you can kind
of see what's going on.

513
00:20:21,840 --> 00:20:26,370
These are actually very
good on the data, right?

514
00:20:26,370 --> 00:20:29,510
How many mistakes do you
make on the new dataset?

515
00:20:29,510 --> 00:20:32,790
Sorry, on the training set?

516
00:20:32,790 --> 00:20:33,330
Zero.

517
00:20:33,330 --> 00:20:34,170
It's perfect.

518
00:20:34,170 --> 00:20:35,250
OK.

519
00:20:35,250 --> 00:20:36,520
Is that a good idea?

520
00:20:36,520 --> 00:20:39,651
Well, when you look at it here,
it doesn't look that good.

521
00:20:39,651 --> 00:20:40,150
OK.

522
00:20:40,150 --> 00:20:42,820
There is this whole
region of points,

523
00:20:42,820 --> 00:20:49,470
for example, that are going
to be predicted to be orange,

524
00:20:49,470 --> 00:20:51,026
but they're actually blue.

525
00:20:51,026 --> 00:20:53,400
Of course if you want to have
zero errors in the training

526
00:20:53,400 --> 00:20:55,289
set, there's nothing
else you can do, right?

527
00:20:55,289 --> 00:20:57,330
Because you see, you have
this orange point here.

528
00:20:57,330 --> 00:20:58,913
You have these two
orange points here.

529
00:20:58,913 --> 00:21:01,200
And you want to go
and follow them.

530
00:21:01,200 --> 00:21:03,665
So there's nothing you can do.

531
00:21:03,665 --> 00:21:05,040
So this is the
first observation.

532
00:21:05,040 --> 00:21:08,470
The second observation
is, the curve,

533
00:21:08,470 --> 00:21:11,490
if you look close enough,
it's piecewise linear.

534
00:21:11,490 --> 00:21:19,200
It's like a sequence of
linear pieces stuck together.

535
00:21:19,200 --> 00:21:21,810
If we just try to
do a little game

536
00:21:21,810 --> 00:21:23,670
and generate some new data--

537
00:21:23,670 --> 00:21:25,810
OK, so imagine again,
I'm playing god now.

538
00:21:25,810 --> 00:21:28,560
I generate the new dataset
that it should look like.

539
00:21:28,560 --> 00:21:30,480
So take another peek at this.

540
00:21:30,480 --> 00:21:33,100
OK.

541
00:21:33,100 --> 00:21:33,600
Oop.

542
00:21:40,820 --> 00:21:42,820
So now I generate them.

543
00:21:42,820 --> 00:21:43,480
I plot them.

544
00:21:46,445 --> 00:21:46,945
I train.

545
00:21:50,008 --> 00:21:51,412
And now let's test.

546
00:21:54,050 --> 00:21:54,550
OK.

547
00:21:54,550 --> 00:21:57,920
If you remember the decision
curves you've seen before,

548
00:21:57,920 --> 00:21:59,255
what do you notice here?

549
00:21:59,255 --> 00:22:00,380
AUDIENCE: they're different

550
00:22:00,380 --> 00:22:02,046
LORENZO ROSASCO:
They're very different.

551
00:22:02,046 --> 00:22:03,710
OK.

552
00:22:03,710 --> 00:22:06,770
For example, the one
before, if you remember,

553
00:22:06,770 --> 00:22:09,170
we noticed it was going all
the way down here to follow

554
00:22:09,170 --> 00:22:10,790
those couple of points.

555
00:22:10,790 --> 00:22:12,750
But here you don't have
those couple of points.

556
00:22:12,750 --> 00:22:13,917
OK.

557
00:22:13,917 --> 00:22:15,750
So now, is that a good
thing or a bad thing?

558
00:22:15,750 --> 00:22:17,374
Well the point here
is that because you

559
00:22:17,374 --> 00:22:20,510
have so few points, the
moment you start to just feed

560
00:22:20,510 --> 00:22:22,120
the data, this will happen.

561
00:22:22,120 --> 00:22:22,670
OK.

562
00:22:22,670 --> 00:22:24,544
You have something that
changes all the time.

563
00:22:24,544 --> 00:22:25,790
It's very unstable.

564
00:22:25,790 --> 00:22:27,644
That's a key word, OK.

565
00:22:27,644 --> 00:22:29,060
You have something
that you change

566
00:22:29,060 --> 00:22:30,920
the data just a little bit,
and it changes completely.

567
00:22:30,920 --> 00:22:32,340
That sounds like a bad idea.

568
00:22:32,340 --> 00:22:32,840
OK.

569
00:22:32,840 --> 00:22:34,370
If I want to make
a prediction, if I

570
00:22:34,370 --> 00:22:37,250
keep on getting
slightly different data

571
00:22:37,250 --> 00:22:39,110
and I change my mind
completely, that's

572
00:22:39,110 --> 00:22:41,240
probably not a good way to make
a prediction about anything.

573
00:22:41,240 --> 00:22:41,764
OK.

574
00:22:41,764 --> 00:22:43,430
And this is happening
all the time here.

575
00:22:43,430 --> 00:22:45,800
And it's exactly
because our algorithm

576
00:22:45,800 --> 00:22:47,060
is in some sense is greedy.

577
00:22:47,060 --> 00:22:49,670
You just try to get
perfect performance

578
00:22:49,670 --> 00:22:55,630
on the training set without
worrying much about the future.

579
00:22:55,630 --> 00:22:57,740
Let's do this just once more.

580
00:23:08,340 --> 00:23:09,390
OK.

581
00:23:09,390 --> 00:23:10,680
And we keep on going.

582
00:23:10,680 --> 00:23:13,810
It's going to change all
the time, all the time.

583
00:23:13,810 --> 00:23:17,790
Of course-- I
don't know how much

584
00:23:17,790 --> 00:23:20,910
I can push this because
it's not super-duper fast.

585
00:23:20,910 --> 00:23:21,590
But let's try.

586
00:23:25,020 --> 00:23:26,620
Let's say 18 by 30.

587
00:23:36,410 --> 00:23:39,560
So what I did now is just that
I augmented the number of points

588
00:23:39,560 --> 00:23:40,460
in my training set.

589
00:23:40,460 --> 00:23:42,200
It was 20 or 30,
I don't remember.

590
00:23:42,200 --> 00:23:45,990
Now it make it 100.

591
00:23:45,990 --> 00:23:47,830
So now you should see--

592
00:23:47,830 --> 00:23:48,330
OK.

593
00:23:48,330 --> 00:23:49,951
So this is one solution.

594
00:23:49,951 --> 00:23:51,200
We want to play the same game.

595
00:23:51,200 --> 00:23:54,030
We just want to generate
other datasets of the same.

596
00:23:54,030 --> 00:23:59,040
So maybe now it might
be that I took them all.

597
00:23:59,040 --> 00:24:00,872
I don't remember
how many there are.

598
00:24:06,230 --> 00:24:07,650
No, I didn't take them all.

599
00:24:07,650 --> 00:24:09,930
So, what do you see now?

600
00:24:09,930 --> 00:24:11,430
We are doing exactly
the same thing.

601
00:24:11,430 --> 00:24:11,770
OK.

602
00:24:11,770 --> 00:24:14,061
And is this something that
you can absolutely not to do

603
00:24:14,061 --> 00:24:17,750
in practice, because you
cannot just generate datasets.

604
00:24:17,750 --> 00:24:21,900
But here what you
see is that I just

605
00:24:21,900 --> 00:24:23,920
augmented the number
of training set points.

606
00:24:23,920 --> 00:24:26,700
And what you see is now
the solution does change,

607
00:24:26,700 --> 00:24:27,550
but not as much.

608
00:24:27,550 --> 00:24:28,050
OK.

609
00:24:28,050 --> 00:24:30,840
And you can kind of start to
see that there is something

610
00:24:30,840 --> 00:24:34,360
going on a bit like this here.

611
00:24:34,360 --> 00:24:35,047
OK.

612
00:24:35,047 --> 00:24:36,630
So this one actually
looks pretty bad.

613
00:24:36,630 --> 00:24:38,040
Let's try to do it once more.

614
00:24:46,980 --> 00:24:48,460
OK.

615
00:24:48,460 --> 00:24:51,900
So again, it does change a
lot, but not as much as before.

616
00:24:51,900 --> 00:24:53,340
And you roughly
see that this guy

617
00:24:53,340 --> 00:24:57,720
says that, here it should be
orange and here should be blue.

618
00:24:57,720 --> 00:24:58,500
OK.

619
00:24:58,500 --> 00:25:01,110
So that's kind of
what you expect.

620
00:25:01,110 --> 00:25:03,820
The more points you get, the
better your solution would get.

621
00:25:03,820 --> 00:25:06,870
And if I put hear all
the possible points, what

622
00:25:06,870 --> 00:25:09,510
you will start to see is that
the closest point to any point

623
00:25:09,510 --> 00:25:11,380
here will be a blue point.

624
00:25:11,380 --> 00:25:11,880
OK.

625
00:25:11,880 --> 00:25:13,350
So it will be perfect.

626
00:25:13,350 --> 00:25:15,870
So if I ask you if this
is a good algorithm

627
00:25:15,870 --> 00:25:17,800
or not, what would you say?

628
00:25:21,331 --> 00:25:22,830
AUDIENCE: It's
overfitting the data.

629
00:25:22,830 --> 00:25:24,810
LORENZO ROSASCO: It's kind
of a overfitting the data.

630
00:25:24,810 --> 00:25:26,700
But it is not always
overfitting the data.

631
00:25:26,700 --> 00:25:29,190
If the data are good, it's
a good idea to fit them.

632
00:25:29,190 --> 00:25:29,796
OK.

633
00:25:29,796 --> 00:25:31,170
But in some sense,
this algorithm

634
00:25:31,170 --> 00:25:32,790
doesn't have a way
to prevent itself

635
00:25:32,790 --> 00:25:35,569
to fall in love with the
data when there are very few.

636
00:25:35,569 --> 00:25:37,110
And if you have very
few data points,

637
00:25:37,110 --> 00:25:39,570
you start to just wiggle around,
become extremely unstable,

638
00:25:39,570 --> 00:25:40,950
change your mind all the time.

639
00:25:40,950 --> 00:25:42,995
If the data are
enough, it stabilizes,

640
00:25:42,995 --> 00:25:44,370
and in some senses,
this setting,

641
00:25:44,370 --> 00:25:47,070
we're fitting the data,
or as she's saying,

642
00:25:47,070 --> 00:25:48,090
overfitting the data.

643
00:25:48,090 --> 00:25:49,640
It's actually not a bad thing.

644
00:25:49,640 --> 00:25:50,280
OK.

645
00:25:50,280 --> 00:25:51,858
So this is what's going on here.

646
00:25:51,858 --> 00:25:54,050
AUDIENCE: What do you
mean by overfitting?

647
00:25:54,050 --> 00:25:56,050
LORENZO ROSASCO:
Fitting a bit too much.

648
00:25:56,050 --> 00:25:58,710
So if you look here.

649
00:25:58,710 --> 00:26:00,580
So here, if you look
what you're doing here,

650
00:26:00,580 --> 00:26:02,040
you're always
fitting the data OK.

651
00:26:02,040 --> 00:26:03,750
But here you're
doing nothing else.

652
00:26:03,750 --> 00:26:05,910
And so if you have
few data points,

653
00:26:05,910 --> 00:26:08,250
fitting the data is fine.

654
00:26:08,250 --> 00:26:09,860
Sorry, if you have
many data points,

655
00:26:09,860 --> 00:26:11,110
fitting the data is just fine.

656
00:26:11,110 --> 00:26:14,430
If you have few data
points, by fitting them you,

657
00:26:14,430 --> 00:26:17,010
in some sense,
overfit in the sense

658
00:26:17,010 --> 00:26:18,870
that when you look
at new data points,

659
00:26:18,870 --> 00:26:20,260
you have done a bit too much.

660
00:26:20,260 --> 00:26:20,760
OK.

661
00:26:20,760 --> 00:26:22,551
What you saw before,
that you get something

662
00:26:22,551 --> 00:26:25,260
that is very good, because
it perfectly fits that,

663
00:26:25,260 --> 00:26:27,280
but it's overfitting with
respect to the future.

664
00:26:27,280 --> 00:26:30,870
Whereas here, the fitting
on the left-hand side

665
00:26:30,870 --> 00:26:32,760
kind of reflects, not
too badly the fitting

666
00:26:32,760 --> 00:26:33,718
on the right-hand side.

667
00:26:33,718 --> 00:26:35,950
OK.

668
00:26:35,950 --> 00:26:38,790
So the idea of
overfitting and stability

669
00:26:38,790 --> 00:26:40,890
that came out in this
discussion are key.

670
00:26:40,890 --> 00:26:41,430
OK.

671
00:26:41,430 --> 00:26:42,846
If you want
everything we're going

672
00:26:42,846 --> 00:26:44,580
to do in the next
three hours, understand

673
00:26:44,580 --> 00:26:48,810
how you can prevent overfitting
and build a good way

674
00:26:48,810 --> 00:26:54,370
to stabilize your algorithms.

675
00:26:54,370 --> 00:26:54,870
OK.

676
00:26:54,870 --> 00:26:58,120
So let's go back here.

677
00:26:58,120 --> 00:27:00,660
This is going to be quick,
because if I ask you,

678
00:27:00,660 --> 00:27:01,470
what is this?

679
00:27:01,470 --> 00:27:03,100
What would you say?

680
00:27:03,100 --> 00:27:04,600
AUDIENCE: [INAUDIBLE]

681
00:27:06,600 --> 00:27:08,100
[LAUGHING]

682
00:27:09,931 --> 00:27:11,680
LORENZO ROSASCO: So
the idea is that, when

683
00:27:11,680 --> 00:27:13,060
you have a situation
like this, you're

684
00:27:13,060 --> 00:27:15,460
still pretty much able to
say what's the right answer.

685
00:27:15,460 --> 00:27:17,210
And what you're going
to do is that you're

686
00:27:17,210 --> 00:27:19,420
going to move away from
just saying, what's

687
00:27:19,420 --> 00:27:22,560
the closest point, and you
just look at a few more points.

688
00:27:22,560 --> 00:27:24,020
You just don't look at one.

689
00:27:24,020 --> 00:27:24,520
OK.

690
00:27:24,520 --> 00:27:27,610
You look at, how many? boh?

691
00:27:27,610 --> 00:27:29,030
"boh" is very
useful Italian word.

692
00:27:29,030 --> 00:27:30,130
It means, I don't know.

693
00:27:32,720 --> 00:27:36,070
So these algorithm-- it's
called the k nearest neighbor

694
00:27:36,070 --> 00:27:38,650
algorithm, it's probably the
second simplest algorithm

695
00:27:38,650 --> 00:27:39,957
you can think of.

696
00:27:39,957 --> 00:27:41,290
It's kind of the same as before.

697
00:27:41,290 --> 00:27:42,706
The notation here
is a bit boring,

698
00:27:42,706 --> 00:27:45,395
but it's basically
saying, take the points.

699
00:27:45,395 --> 00:27:46,270
Give them new points.

700
00:27:46,270 --> 00:27:47,820
Check the distance
with everybody.

701
00:27:47,820 --> 00:27:50,310
Sort it and take the first k.

702
00:27:50,310 --> 00:27:51,676
OK.

703
00:27:51,676 --> 00:27:53,050
If it's a
classification problem,

704
00:27:53,050 --> 00:27:57,167
it's probably a good idea
to take an odd number for k,

705
00:27:57,167 --> 00:27:58,750
so that you can then
just have voting.

706
00:27:58,750 --> 00:28:00,430
And basically everybody votes.

707
00:28:00,430 --> 00:28:01,850
Each vote counts one.

708
00:28:01,850 --> 00:28:04,930
And somebody says blue,
somebody says orange,

709
00:28:04,930 --> 00:28:06,810
and you make a decision.

710
00:28:06,810 --> 00:28:09,220
OK.

711
00:28:09,220 --> 00:28:10,150
Fair enough.

712
00:28:10,150 --> 00:28:12,740
Well how does this work?

713
00:28:12,740 --> 00:28:14,090
You can kind of imagine.

714
00:28:16,960 --> 00:28:18,280
So what we have to do--

715
00:28:18,280 --> 00:28:20,400
so for example here
we have this guy.

716
00:28:20,400 --> 00:28:21,340
OK.

717
00:28:21,340 --> 00:28:23,110
Now let's just put k--

718
00:28:23,110 --> 00:28:24,860
well, let's make
this a bit smaller.

719
00:28:24,860 --> 00:28:27,160
So we do 40.

720
00:28:27,160 --> 00:28:31,510
Generate, plot, train.

721
00:28:31,510 --> 00:28:33,015
[INAUDIBLE] test.

722
00:28:33,015 --> 00:28:34,380
Plot.

723
00:28:34,380 --> 00:28:36,370
OK.

724
00:28:36,370 --> 00:28:38,650
Well we got a bit lucky, OK.

725
00:28:38,650 --> 00:28:41,350
This is actually a good
dataset, because in some sense

726
00:28:41,350 --> 00:28:44,140
there are no, what you
might call outliers.

727
00:28:44,140 --> 00:28:47,394
There are no orange points that
really go and sit in the blue.

728
00:28:47,394 --> 00:28:49,810
So I just want to show you a
bit about the dramatic effect

729
00:28:49,810 --> 00:28:50,020
of this.

730
00:28:50,020 --> 00:28:52,030
So I'm going to just
try to redo this one so

731
00:28:52,030 --> 00:28:54,290
that we get the more--

732
00:28:54,290 --> 00:28:55,317
yeah, this should do.

733
00:28:57,790 --> 00:28:58,290
OK.

734
00:28:58,290 --> 00:29:00,260
So this is nearest neighbor.

735
00:29:00,260 --> 00:29:02,000
This is the solution you get.

736
00:29:02,000 --> 00:29:03,320
It's not too horrible.

737
00:29:03,320 --> 00:29:06,340
But, for example, you see that
it starts following this guy.

738
00:29:06,340 --> 00:29:08,180
OK.

739
00:29:08,180 --> 00:29:12,331
Now, what you can do is that you
can just go in and say, four.

740
00:29:12,331 --> 00:29:13,330
Well, four's a bad idea.

741
00:29:13,330 --> 00:29:14,780
Five.

742
00:29:14,780 --> 00:29:16,130
You'd retrain them the same.

743
00:29:20,020 --> 00:29:23,184
And all of a sudden it
just ignores this guy.

744
00:29:23,184 --> 00:29:25,100
Because the moment that
you put more in, well,

745
00:29:25,100 --> 00:29:27,225
you just realize that he's
surrounded by blue guys,

746
00:29:27,225 --> 00:29:32,230
so it's probably just, his vote
just counts one against four.

747
00:29:32,230 --> 00:29:32,830
OK.

748
00:29:32,830 --> 00:29:35,680
And you can keep on going.

749
00:29:35,680 --> 00:29:42,600
And the idea here is that
the more you make this big,

750
00:29:42,600 --> 00:29:46,800
the more your solution
is going to be, what?

751
00:29:46,800 --> 00:29:48,760
Well you say, it's
going to be good,

752
00:29:48,760 --> 00:29:50,490
but it's actually not true.

753
00:29:50,490 --> 00:29:53,040
Because if you start
to put k too big,

754
00:29:53,040 --> 00:29:56,250
at some point all you're doing
is counting how many points you

755
00:29:56,250 --> 00:29:58,500
have in class one,
counting how many points

756
00:29:58,500 --> 00:30:02,741
you have in class two, and
always say the same thing.

757
00:30:02,741 --> 00:30:03,240
OK.

758
00:30:03,240 --> 00:30:07,415
So I'm going to put here, 20.

759
00:30:07,415 --> 00:30:08,790
What you start to
see is that you

760
00:30:08,790 --> 00:30:12,090
start to obtain a decision
boundary, which is simpler,

761
00:30:12,090 --> 00:30:13,780
and simpler and simpler.

762
00:30:13,780 --> 00:30:14,280
OK.

763
00:30:14,280 --> 00:30:16,120
It looks kind of linear here.

764
00:30:16,120 --> 00:30:18,230
What you will see is
that, suppose that now

765
00:30:18,230 --> 00:30:23,161
I regenerate the data.

766
00:30:23,161 --> 00:30:24,660
And you remember
how much it changed

767
00:30:24,660 --> 00:30:27,600
before when I was using
nearest neighbor with just

768
00:30:27,600 --> 00:30:31,030
k equal to 1.

769
00:30:31,030 --> 00:30:33,821
So of course here, you
know, it's probabilistic.

770
00:30:33,821 --> 00:30:34,320
OK.

771
00:30:34,320 --> 00:30:36,180
So of course I'm going to get
a dataset like the one I just

772
00:30:36,180 --> 00:30:38,970
showed you minutes ago, and
I had it as fast as possible.

773
00:30:38,970 --> 00:30:40,969
Because if I pick
10, one is going

774
00:30:40,969 --> 00:30:43,260
to look like that and nine
are going to look like this.

775
00:30:43,260 --> 00:30:43,830
OK.

776
00:30:43,830 --> 00:30:44,880
And when they look
like this, you

777
00:30:44,880 --> 00:30:47,190
see, they kind of start
to have this kind of line,

778
00:30:47,190 --> 00:30:49,770
like a decision boundary
with some twists.

779
00:30:49,770 --> 00:30:50,820
But it's very simple.

780
00:30:50,820 --> 00:30:51,540
OK.

781
00:30:51,540 --> 00:30:54,075
And if at some point, if I
put k big enough-- that is,

782
00:30:54,075 --> 00:30:56,200
the number of all points,
it won't change any more.

783
00:30:56,200 --> 00:30:56,530
OK.

784
00:30:56,530 --> 00:30:58,100
It will just be
essentially dividing

785
00:30:58,100 --> 00:31:01,350
the sets in two equal parts.

786
00:31:01,350 --> 00:31:05,170
So does that makes sense?

787
00:31:05,170 --> 00:31:09,960
So would it make sense to
vote to make different votes?

788
00:31:09,960 --> 00:31:12,490
Essentially, the idea is,
if the point is closest,

789
00:31:12,490 --> 00:31:16,120
his vote should count more than
if a point is more far away?

790
00:31:16,120 --> 00:31:18,804
Yes, absolutely.

791
00:31:18,804 --> 00:31:20,470
Let's say here we're
making the simplest

792
00:31:20,470 --> 00:31:21,920
thing in the world, the second
simplest thing in the world,

793
00:31:21,920 --> 00:31:23,320
the third simplest
thing in the world.

794
00:31:23,320 --> 00:31:24,028
It is doing that.

795
00:31:24,028 --> 00:31:24,670
OK.

796
00:31:24,670 --> 00:31:26,650
And you can see that you
can go pretty far with this.

797
00:31:26,650 --> 00:31:28,180
I mean, it's simple, but
these are actually algorithms

798
00:31:28,180 --> 00:31:29,860
that are used sometimes.

799
00:31:29,860 --> 00:31:34,724
And what you do is that,
if you just look at this--

800
00:31:34,724 --> 00:31:36,640
again, these I don't
want to explain too much.

801
00:31:36,640 --> 00:31:38,080
If you've seen it
before, it's simple.

802
00:31:38,080 --> 00:31:39,538
Otherwise it doesn't
really matter.

803
00:31:39,538 --> 00:31:42,910
But the basic idea
here is that each vote

804
00:31:42,910 --> 00:31:45,280
is going to be
between 0-- so, you

805
00:31:45,280 --> 00:31:48,820
see here I put the distance
between the new point and all

806
00:31:48,820 --> 00:31:50,860
the other points on
top of an exponential.

807
00:31:50,860 --> 00:31:55,040
So the number I get is not
1, but it is between 0 and 1.

808
00:31:55,040 --> 00:31:58,040
If the two points are close,
and the limits supposedly

809
00:31:58,040 --> 00:32:01,540
are the same, it becomes a
0, and it counts exactly one.

810
00:32:01,540 --> 00:32:05,050
If they're very far away,
these would be, say, infinity

811
00:32:05,050 --> 00:32:06,820
and then we'd be close to 0.

812
00:32:06,820 --> 00:32:09,485
So the closest you are,
the more you count.

813
00:32:09,485 --> 00:32:11,110
If you want, you can
read it like this.

814
00:32:11,110 --> 00:32:17,400
You're sitting on a new point,
and you put a zooming window.

815
00:32:17,400 --> 00:32:19,896
Yeah, like a zooming
window of a certain size.

816
00:32:19,896 --> 00:32:21,520
And you basically
check that everything

817
00:32:21,520 --> 00:32:24,029
which is inside this
window will be closed.

818
00:32:24,029 --> 00:32:25,570
And the more you go
farther away-- so

819
00:32:25,570 --> 00:32:27,940
the window is like this.

820
00:32:27,940 --> 00:32:30,040
And you deform the space
so that basically what

821
00:32:30,040 --> 00:32:31,690
you say is, things
that are far away,

822
00:32:31,690 --> 00:32:32,890
they're going to count less.

823
00:32:32,890 --> 00:32:37,660
And if I move sigma
here, I'm somewhat

824
00:32:37,660 --> 00:32:41,830
making my visual field, if
you want, larger or smaller,

825
00:32:41,830 --> 00:32:43,550
around this one new point.

826
00:32:43,550 --> 00:32:45,010
It's just a physical
interpretation

827
00:32:45,010 --> 00:32:45,926
of what this is doing.

828
00:32:45,926 --> 00:32:47,950
There are 15 other
ways of looking

829
00:32:47,950 --> 00:32:49,630
at what the Gaussian is doing.

830
00:32:49,630 --> 00:32:52,900
Voting, changing the weight
of the vote is another one.

831
00:32:52,900 --> 00:32:53,710
OK.

832
00:32:53,710 --> 00:32:55,180
Why the Gaussian here?

833
00:32:55,180 --> 00:32:57,069
Well, because.

834
00:32:57,069 --> 00:32:57,610
Just because.

835
00:32:57,610 --> 00:32:58,860
You can use many, many others.

836
00:32:58,860 --> 00:33:01,090
You can use, for
example, a hat window.

837
00:33:01,090 --> 00:33:03,460
And this is part of
your prior knowledge,

838
00:33:03,460 --> 00:33:05,020
how much you want to weight.

839
00:33:05,020 --> 00:33:09,580
If you are in this kind of
low dimensional situation,

840
00:33:09,580 --> 00:33:12,010
you might have good ways to
just look inside the data

841
00:33:12,010 --> 00:33:14,599
and decide almost like doing
by a visual inspection.

842
00:33:14,599 --> 00:33:16,890
Otherwise you have to trust
some more broad principles.

843
00:33:16,890 --> 00:33:18,306
And it's again
back to the problem

844
00:33:18,306 --> 00:33:20,710
of learning the
representation and deciding

845
00:33:20,710 --> 00:33:22,190
how to measure
distance, which are

846
00:33:22,190 --> 00:33:24,920
two phases of the same story.

847
00:33:24,920 --> 00:33:25,420
OK.

848
00:33:29,490 --> 00:33:33,510
And the other thing
you see is that, if you

849
00:33:33,510 --> 00:33:35,370
start to do these games,
you might actually

850
00:33:35,370 --> 00:33:37,020
add more parameters.

851
00:33:37,020 --> 00:33:37,680
OK.

852
00:33:37,680 --> 00:33:39,372
Because we start from
nearest neighbor,

853
00:33:39,372 --> 00:33:40,830
which is completely
parameter-free,

854
00:33:40,830 --> 00:33:42,690
but it was very unstable.

855
00:33:42,690 --> 00:33:43,840
We added k.

856
00:33:43,840 --> 00:33:45,930
We allow ourselves to go
from simple to complex,

857
00:33:45,930 --> 00:33:48,000
from stability to overfitting.

858
00:33:48,000 --> 00:33:50,244
But we introduced
a new parameter.

859
00:33:50,244 --> 00:33:51,910
And so that's not an
algorithm any more.

860
00:33:51,910 --> 00:33:52,860
It's a half algorithm.

861
00:33:52,860 --> 00:33:55,059
A true algorithm is a
parameter-free algorithm

862
00:33:55,059 --> 00:33:56,850
where I tell you how
you choose everything.

863
00:33:56,850 --> 00:33:57,500
OK.

864
00:33:57,500 --> 00:33:59,166
So if they just give
you something, say,

865
00:33:59,166 --> 00:34:01,940
yeah, there's k, well,
how do you choose it?

866
00:34:01,940 --> 00:34:03,240
OK.

867
00:34:03,240 --> 00:34:05,230
It's not something you can use.

868
00:34:05,230 --> 00:34:06,690
And here I'm adding sigma.

869
00:34:06,690 --> 00:34:08,621
And again, you have to
decide how you use it.

870
00:34:08,621 --> 00:34:09,120
OK.

871
00:34:09,120 --> 00:34:11,790
And so that's what we
want to ask in a minute.

872
00:34:11,790 --> 00:34:17,580
So before doing that,
just a side remark is--

873
00:34:17,580 --> 00:34:19,480
we've been looking
at vector data.

874
00:34:19,480 --> 00:34:19,980
OK.

875
00:34:19,980 --> 00:34:21,646
And we were basically
measuring distance

876
00:34:21,646 --> 00:34:24,360
through just the Euclidean
norm, OK, just the usual one,

877
00:34:24,360 --> 00:34:26,790
or this version like
the Gaussian kernel

878
00:34:26,790 --> 00:34:30,690
that somewhat
amplifies distances.

879
00:34:30,690 --> 00:34:34,630
What if you have strings,
for example, or graphs?

880
00:34:34,630 --> 00:34:35,130
OK.

881
00:34:35,130 --> 00:34:36,570
Your data turns
out to be strings

882
00:34:36,570 --> 00:34:37,778
and you want to compare them?

883
00:34:40,380 --> 00:34:42,210
Say even if they're
binary strings,

884
00:34:42,210 --> 00:34:43,530
there's no linear structure.

885
00:34:43,530 --> 00:34:45,654
You cannot just sum them
up. the Euclidean distance

886
00:34:45,654 --> 00:34:48,580
doesn't really make
a lot of sense.

887
00:34:48,580 --> 00:34:50,310
But what you can
do is that as long

888
00:34:50,310 --> 00:34:52,632
as you can define a
distance-- and say this one

889
00:34:52,632 --> 00:34:54,840
would be the simplest one,
just the Hamming distance.

890
00:34:54,840 --> 00:34:56,850
You just check entries,
and if they're the same,

891
00:34:56,850 --> 00:34:57,450
you count one.

892
00:34:57,450 --> 00:34:59,340
If they're different,
you count zero.

893
00:34:59,340 --> 00:35:00,600
OK.

894
00:35:00,600 --> 00:35:03,030
The moment you can define
a distance of your data,

895
00:35:03,030 --> 00:35:06,760
then you can use this
kind of technique.

896
00:35:06,760 --> 00:35:10,150
So this technique is pretty
flexible in that sense,

897
00:35:10,150 --> 00:35:12,296
that whenever you can
give-- you don't need

898
00:35:12,296 --> 00:35:14,470
a vectoral
representation, you just

899
00:35:14,470 --> 00:35:16,300
need a way to measure,
say, similarity

900
00:35:16,300 --> 00:35:19,060
or distances between things, and
then you can use this method.

901
00:35:19,060 --> 00:35:19,690
OK.

902
00:35:19,690 --> 00:35:21,398
So here I just mentioned
this, and that's

903
00:35:21,398 --> 00:35:24,520
what most of these classes are
going to be, about vector data.

904
00:35:24,520 --> 00:35:29,200
But this is one point where,
the moment you have k--

905
00:35:29,200 --> 00:35:31,450
you can think of this case
sometimes as a similarity.

906
00:35:31,450 --> 00:35:31,950
OK.

907
00:35:31,950 --> 00:35:34,370
Similarity is kind of concept
that is dual to distances.

908
00:35:34,370 --> 00:35:36,700
So if the similarity
is big, it's good.

909
00:35:36,700 --> 00:35:37,980
The distance small is good.

910
00:35:37,980 --> 00:35:38,710
OK.

911
00:35:38,710 --> 00:35:42,040
And so here, if you have a way
to build the k or a distance,

912
00:35:42,040 --> 00:35:43,730
then you're good to go.

913
00:35:43,730 --> 00:35:46,260
And we're not going to
really talk about it,

914
00:35:46,260 --> 00:35:48,232
but there's a whole
industry about how

915
00:35:48,232 --> 00:35:49,440
you build this kind of stuff.

916
00:35:49,440 --> 00:35:51,120
So we give restraints.

917
00:35:51,120 --> 00:35:53,830
Maybe I want to say
that I should not only

918
00:35:53,830 --> 00:35:57,400
look at the entry of a string,
but also the nearby entry when

919
00:35:57,400 --> 00:35:58,860
I make the score
for that specific.

920
00:35:58,860 --> 00:36:01,960
So maybe I shifted a value
of the string a little bit.

921
00:36:01,960 --> 00:36:02,820
It's not right here.

922
00:36:02,820 --> 00:36:05,560
It's in the next position over,
so that should come to bits.

923
00:36:05,560 --> 00:36:08,150
So I want to do a
soft version of this.

924
00:36:08,150 --> 00:36:08,800
OK.

925
00:36:08,800 --> 00:36:11,560
Or maybe I have graphs, and
I want to compare graphs.

926
00:36:11,560 --> 00:36:14,080
And I want to say that if
two graphs are close, then

927
00:36:14,080 --> 00:36:15,710
I want them to have
the same label.

928
00:36:15,710 --> 00:36:16,210
OK.

929
00:36:16,210 --> 00:36:18,950
How do you do that?

930
00:36:18,950 --> 00:36:22,470
The next big question is--

931
00:36:22,470 --> 00:36:24,219
we introduced three parameters.

932
00:36:24,219 --> 00:36:26,010
They look really nice,
because they kind of

933
00:36:26,010 --> 00:36:29,430
allowed us to get more flexible
solutions to the problem

934
00:36:29,430 --> 00:36:32,880
by choosing, for example, k
or the sigma in the Gaussian.

935
00:36:32,880 --> 00:36:35,736
We can go from
overfitting to stability.

936
00:36:35,736 --> 00:36:37,860
But then of course we have
to choose the parameter,

937
00:36:37,860 --> 00:36:40,592
and we have to find good
ways to choose them.

938
00:36:40,592 --> 00:36:42,175
And so there are a
bunch of questions.

939
00:36:42,175 --> 00:36:46,020
So the first one is, well, is
there an optimal value at all?

940
00:36:46,020 --> 00:36:46,950
OK.

941
00:36:46,950 --> 00:36:48,580
Does it exist?

942
00:36:48,580 --> 00:36:51,580
But if it does exist, I can go
try to estimate it in some way.

943
00:36:51,580 --> 00:36:54,230
If it doesn't, well it
does not even make sense.

944
00:36:54,230 --> 00:36:55,610
I just throw a random number.

945
00:36:55,610 --> 00:36:56,741
I just say, k equals 4.

946
00:36:56,741 --> 00:36:57,240
Why?

947
00:36:57,240 --> 00:36:58,310
Just because.

948
00:36:58,310 --> 00:36:58,856
OK.

949
00:36:58,856 --> 00:36:59,730
So what do you think?

950
00:36:59,730 --> 00:37:01,980
It exists or not?

951
00:37:01,980 --> 00:37:04,130
What does it depend on?

952
00:37:04,130 --> 00:37:05,510
Because that's
the next question.

953
00:37:05,510 --> 00:37:06,560
What does it depend on?

954
00:37:06,560 --> 00:37:07,730
Can we compute it?

955
00:37:07,730 --> 00:37:08,510
OK.

956
00:37:08,510 --> 00:37:10,759
So let's try to guess
one minute before we go

957
00:37:10,759 --> 00:37:11,800
and check how we do this.

958
00:37:11,800 --> 00:37:14,640
OK.

959
00:37:14,640 --> 00:37:15,937
OK.

960
00:37:15,937 --> 00:37:16,770
I have to choose it.

961
00:37:16,770 --> 00:37:17,561
How do I choose it?

962
00:37:17,561 --> 00:37:19,332
What does it depend on?

963
00:37:19,332 --> 00:37:20,401
AUDIENCE: Size of this.

964
00:37:20,401 --> 00:37:22,650
LORENZO ROSASCO: One thing
is the size of the dataset.

965
00:37:22,650 --> 00:37:26,400
Because what we saw is that a
small k seems a good idea when

966
00:37:26,400 --> 00:37:29,370
you have a lot of data, but
it seems like a bad idea

967
00:37:29,370 --> 00:37:31,670
when you have few.

968
00:37:31,670 --> 00:37:32,407
OK.

969
00:37:32,407 --> 00:37:33,240
So it should depend.

970
00:37:33,240 --> 00:37:34,656
It should be
something that scales

971
00:37:34,656 --> 00:37:37,830
with n, the number of points,
and probably also the training

972
00:37:37,830 --> 00:37:38,687
set itself.

973
00:37:38,687 --> 00:37:40,770
But we want something that
works for all datasets,

974
00:37:40,770 --> 00:37:42,300
say, in expectation.

975
00:37:42,300 --> 00:37:44,190
So cardinality of
the training set

976
00:37:44,190 --> 00:37:45,520
is going to be a main factor.

977
00:37:45,520 --> 00:37:46,840
What else?

978
00:37:46,840 --> 00:37:49,270
AUDIENCE: The smoothness
of the boundary.

979
00:37:49,270 --> 00:37:49,470
LORENZO ROSASCO: The what?

980
00:37:49,470 --> 00:37:50,440
AUDIENCE: The smoothness.

981
00:37:50,440 --> 00:37:51,790
LORENZO ROSASCO: This
smoothness of the boundary.

982
00:37:51,790 --> 00:37:52,290
Yeah.

983
00:37:52,290 --> 00:37:55,305
So what he's saying is, if
my problem looks like this,

984
00:37:55,305 --> 00:37:57,525
or if my problem
looks like this,

985
00:37:57,525 --> 00:37:59,310
it looks like k
should be different.

986
00:37:59,310 --> 00:38:04,680
In this case I can take
any arbitrary high k--

987
00:38:04,680 --> 00:38:06,639
sorry, small k, I guess, or i.

988
00:38:06,639 --> 00:38:08,430
It doesn't matter,
because whatever you do,

989
00:38:08,430 --> 00:38:09,888
you pretty much
get the good thing.

990
00:38:09,888 --> 00:38:12,150
But if you start doing
something like this,

991
00:38:12,150 --> 00:38:14,180
then you want-- k is
enough, because otherwise

992
00:38:14,180 --> 00:38:15,690
you just start to
blur everything.

993
00:38:15,690 --> 00:38:17,231
And this is exactly
what he's saying.

994
00:38:17,231 --> 00:38:19,770
If your problem is
complicated or it's easy.

995
00:38:19,770 --> 00:38:21,439
OK.

996
00:38:21,439 --> 00:38:22,980
And at the same
time, this is related

997
00:38:22,980 --> 00:38:25,610
to the fact of how much noise
you might have in the data,

998
00:38:25,610 --> 00:38:29,590
OK, how much flipping you
might have in your data.

999
00:38:29,590 --> 00:38:34,510
If the problem is hard, then you
expect to need a different k.

1000
00:38:34,510 --> 00:38:35,010
OK.

1001
00:38:35,010 --> 00:38:37,200
So it depends on the
cardinality of the data,

1002
00:38:37,200 --> 00:38:38,670
and how complicated
is the problem?

1003
00:38:38,670 --> 00:38:40,128
How complicated it
is the boundary?

1004
00:38:40,128 --> 00:38:41,280
How much noise do I have?

1005
00:38:41,280 --> 00:38:42,420
OK.

1006
00:38:42,420 --> 00:38:45,450
So it turns out that
one thing you can ask

1007
00:38:45,450 --> 00:38:46,661
is, can we prove it?

1008
00:38:46,661 --> 00:38:47,160
OK.

1009
00:38:47,160 --> 00:38:51,690
Can we prove a theorem that
says that there is an optimal k,

1010
00:38:51,690 --> 00:38:57,570
and it really does depends
on this, on this quantities.

1011
00:38:57,570 --> 00:38:59,280
And it turns out that you can.

1012
00:38:59,280 --> 00:39:02,060
Of course, as always, to make a
theory or to make assumptions,

1013
00:39:02,060 --> 00:39:03,800
you have to work within a model.

1014
00:39:03,800 --> 00:39:05,883
And the model we want to
work on is the following.

1015
00:39:05,883 --> 00:39:07,707
You're basically
saying, this is the k

1016
00:39:07,707 --> 00:39:08,790
nearest neighbor solution.

1017
00:39:08,790 --> 00:39:10,520
So big k here is the
number of neighbors,

1018
00:39:10,520 --> 00:39:13,050
and this is hat because
it depends on the data.

1019
00:39:13,050 --> 00:39:14,550
And what I say here
is that I'm just

1020
00:39:14,550 --> 00:39:17,575
going to look at squared loss
error, just because it's easy.

1021
00:39:17,575 --> 00:39:19,950
And I'm going to look at the
regression problem, not just

1022
00:39:19,950 --> 00:39:21,439
this classification.

1023
00:39:21,439 --> 00:39:22,980
And what you do here
is that you take

1024
00:39:22,980 --> 00:39:26,430
expectation over all
possible input-output pairs.

1025
00:39:26,430 --> 00:39:30,160
So basically you say,
when I tried to do math,

1026
00:39:30,160 --> 00:39:31,686
I want to see what's ideal.

1027
00:39:31,686 --> 00:39:33,060
An ideally I want
a solution that

1028
00:39:33,060 --> 00:39:35,070
does well on future points.

1029
00:39:35,070 --> 00:39:35,670
OK.

1030
00:39:35,670 --> 00:39:36,780
So how do I do that?

1031
00:39:36,780 --> 00:39:40,230
I think the average error
over all possible points

1032
00:39:40,230 --> 00:39:41,820
in the future, x and y.

1033
00:39:41,820 --> 00:39:45,220
So this is the meaning of
this first expectation.

1034
00:39:45,220 --> 00:39:47,070
Make sense?

1035
00:39:47,070 --> 00:39:48,705
Yes?

1036
00:39:48,705 --> 00:39:49,980
No?

1037
00:39:49,980 --> 00:39:54,330
So if they fix y and x, this is
the error on a specific couple

1038
00:39:54,330 --> 00:39:55,290
input and output.

1039
00:39:55,290 --> 00:39:56,300
I give you the input.

1040
00:39:56,300 --> 00:40:01,049
I do f(kx) and then I check
if it's close or not to y.

1041
00:40:01,049 --> 00:40:03,090
But what I want to do if
I want to be theoretical

1042
00:40:03,090 --> 00:40:04,506
is to say, OK,
what I would really

1043
00:40:04,506 --> 00:40:08,461
like to be small is this error
over all possible points.

1044
00:40:08,461 --> 00:40:10,710
So I take the expectation,
not the one on the training

1045
00:40:10,710 --> 00:40:12,220
set, the one in the future.

1046
00:40:12,220 --> 00:40:13,980
And I take expectation
so that if points

1047
00:40:13,980 --> 00:40:15,450
are more likely
to be simple, they

1048
00:40:15,450 --> 00:40:19,050
will count more than points that
are less likely to be simple.

1049
00:40:19,050 --> 00:40:20,322
OK.

1050
00:40:20,322 --> 00:40:21,791
AUDIENCE: What was Es?

1051
00:40:21,791 --> 00:40:23,790
LORENZO ROSASCO: We haven't
got to that one yet.

1052
00:40:23,790 --> 00:40:24,360
OK.

1053
00:40:24,360 --> 00:40:26,280
So Exy is what I just said.

1054
00:40:26,280 --> 00:40:27,610
What is Es?

1055
00:40:27,610 --> 00:40:30,360
It's the expectation
over the training set.

1056
00:40:30,360 --> 00:40:31,684
Why do we need that?

1057
00:40:31,684 --> 00:40:33,600
Well because if we don't
put that expectation,

1058
00:40:33,600 --> 00:40:37,290
I'm basically telling you
what's the good k for this one

1059
00:40:37,290 --> 00:40:38,440
training set here.

1060
00:40:38,440 --> 00:40:39,465
Then I give you
another training set

1061
00:40:39,465 --> 00:40:41,140
and I get another one, which
is in some sense is good,

1062
00:40:41,140 --> 00:40:42,598
but it's also bad,
because we would

1063
00:40:42,598 --> 00:40:44,400
like to have a take-home
message that we

1064
00:40:44,400 --> 00:40:46,599
hold for all training sets.

1065
00:40:46,599 --> 00:40:47,640
And this is the simplest.

1066
00:40:47,640 --> 00:40:50,110
You say, for the
average training set,

1067
00:40:50,110 --> 00:40:52,697
this is how I should choose k.

1068
00:40:52,697 --> 00:40:53,780
That's what we want to do.

1069
00:40:53,780 --> 00:40:54,090
OK.

1070
00:40:54,090 --> 00:40:56,470
So the first expectation is
to measure error with respect

1071
00:40:56,470 --> 00:40:57,130
to the future.

1072
00:40:57,130 --> 00:40:58,812
The second
expectation is to say,

1073
00:40:58,812 --> 00:41:00,270
I want to deal with
the fact that I

1074
00:41:00,270 --> 00:41:03,740
have several potential
training sets appearing.

1075
00:41:03,740 --> 00:41:06,480
OK.

1076
00:41:06,480 --> 00:41:09,180
So in the next couple
of slides, this red dot

1077
00:41:09,180 --> 00:41:10,811
means that there
are computations.

1078
00:41:10,811 --> 00:41:11,310
OK.

1079
00:41:11,310 --> 00:41:14,340
And so I want to
do them quickly.

1080
00:41:14,340 --> 00:41:17,710
And the important thing of
this bit is, it's an exercise.

1081
00:41:17,710 --> 00:41:18,210
OK.

1082
00:41:18,210 --> 00:41:22,550
So this is an exercise
of stats zero.

1083
00:41:22,550 --> 00:41:23,550
OK.

1084
00:41:23,550 --> 00:41:25,800
So we don't want to
spend time doing that.

1085
00:41:25,800 --> 00:41:28,133
The important thing is going
to be the conceptual parts.

1086
00:41:28,133 --> 00:41:30,150
I'm going to go a bit
quickly through it.

1087
00:41:30,150 --> 00:41:32,200
So you start from
this, and you would

1088
00:41:32,200 --> 00:41:33,724
like to understand
if there exists--

1089
00:41:33,724 --> 00:41:36,140
so this is the quantity that
you would like to make small,

1090
00:41:36,140 --> 00:41:37,590
ideally.

1091
00:41:37,590 --> 00:41:39,450
You will never have
access to this,

1092
00:41:39,450 --> 00:41:42,930
but ideally, in the
optimal scenario,

1093
00:41:42,930 --> 00:41:45,420
you want k to make this small.

1094
00:41:45,420 --> 00:41:46,260
OK.

1095
00:41:46,260 --> 00:41:49,020
Now the problem is that you want
to essentially mathematically

1096
00:41:49,020 --> 00:41:51,000
study this m minimization
problem, but it's not easy,

1097
00:41:51,000 --> 00:41:52,166
because, how do you do this?

1098
00:41:52,166 --> 00:41:52,740
OK.

1099
00:41:52,740 --> 00:41:55,680
The dependence of this
function on k is complicated.

1100
00:41:55,680 --> 00:41:57,895
It's that equation
we had before, right?

1101
00:41:57,895 --> 00:41:59,520
So you kind of just
take the derivative

1102
00:41:59,520 --> 00:42:01,034
and set it equal to zero.

1103
00:42:01,034 --> 00:42:02,200
Let's keep on going into to.

1104
00:42:02,200 --> 00:42:03,780
So what we are at is,
these are the points

1105
00:42:03,780 --> 00:42:04,770
I would like to make small.

1106
00:42:04,770 --> 00:42:07,170
I would like to choose k so
that I can make this small.

1107
00:42:07,170 --> 00:42:09,480
I want to study this from a
mathematical point of view.

1108
00:42:09,480 --> 00:42:11,310
But I cannot just use what
you're doing in calculus,

1109
00:42:11,310 --> 00:42:13,726
which is taking a derivative
and setting it equal to zero,

1110
00:42:13,726 --> 00:42:16,880
because the dependence of these
two k, which is my variable,

1111
00:42:16,880 --> 00:42:17,920
it's complicated.

1112
00:42:17,920 --> 00:42:18,420
OK.

1113
00:42:18,420 --> 00:42:20,400
So we go a bit of a round way.

1114
00:42:20,400 --> 00:42:21,960
We turn out to be
pretty universal.

1115
00:42:21,960 --> 00:42:23,460
And this is what
we are going to do.

1116
00:42:29,540 --> 00:42:31,730
First of all, we assume
a model for our data.

1117
00:42:31,730 --> 00:42:33,781
And this is just for
the sake of simplicity.

1118
00:42:33,781 --> 00:42:34,280
OK.

1119
00:42:34,280 --> 00:42:37,100
I can use a much
more general model.

1120
00:42:37,100 --> 00:42:38,540
But this is the model.

1121
00:42:38,540 --> 00:42:41,210
I'm going to say that my y
are just some fixed function

1122
00:42:41,210 --> 00:42:44,720
of star plus some noise.

1123
00:42:44,720 --> 00:42:47,180
OK.

1124
00:42:47,180 --> 00:42:51,210
And the noise is zero
mean and variance sigma

1125
00:42:51,210 --> 00:42:54,150
square for all entries.

1126
00:42:54,150 --> 00:42:56,570
OK.

1127
00:42:56,570 --> 00:42:58,190
This is the simplest model.

1128
00:42:58,190 --> 00:43:00,740
It's a Gaussian
regression model.

1129
00:43:07,500 --> 00:43:09,702
So one thing I'm doing,
and this is like a trick

1130
00:43:09,702 --> 00:43:11,410
and you can really
forget it, but it just

1131
00:43:11,410 --> 00:43:15,190
makes life much easier is that
I take the expectation over xy

1132
00:43:15,190 --> 00:43:16,780
and a condition here.

1133
00:43:16,780 --> 00:43:18,094
OK.

1134
00:43:18,094 --> 00:43:19,510
The reason why you
do this is just

1135
00:43:19,510 --> 00:43:20,890
to make the math a bit easier.

1136
00:43:20,890 --> 00:43:23,317
Because basically now, if
you put this expectation out,

1137
00:43:23,317 --> 00:43:24,900
and you look just
at these quantities,

1138
00:43:24,900 --> 00:43:26,835
you're looking at
everything for fixed x.

1139
00:43:26,835 --> 00:43:30,490
And these just become a real
number, OK, not the function

1140
00:43:30,490 --> 00:43:31,120
anymore.

1141
00:43:31,120 --> 00:43:33,280
So you can use normal calculus.

1142
00:43:33,280 --> 00:43:35,620
You have a real-valued
function and you can just

1143
00:43:35,620 --> 00:43:36,970
use the usual stuff.

1144
00:43:36,970 --> 00:43:37,659
OK.

1145
00:43:37,659 --> 00:43:39,700
Again, I'm going to going
a bit quickly over this

1146
00:43:39,700 --> 00:43:41,074
because it doesn't
really matter.

1147
00:43:41,074 --> 00:43:42,070
So this ingredient one.

1148
00:43:42,070 --> 00:43:44,390
This is observation two.

1149
00:43:44,390 --> 00:43:46,000
Observation three
is that you need

1150
00:43:46,000 --> 00:43:49,780
to introduce an object
between the solution you

1151
00:43:49,780 --> 00:43:54,490
get in practice and
this ideal function.

1152
00:43:54,490 --> 00:43:55,220
What is this?

1153
00:43:55,220 --> 00:43:58,780
It's this kind of, what
is called the expectation

1154
00:43:58,780 --> 00:44:00,700
of my algorithm.

1155
00:44:00,700 --> 00:44:02,620
What you do is that--
in my algorithm

1156
00:44:02,620 --> 00:44:06,160
what I do here is
that I put Yi, i OK,

1157
00:44:06,160 --> 00:44:07,960
just the label of
my training set.

1158
00:44:07,960 --> 00:44:09,360
And the label are noisy.

1159
00:44:09,360 --> 00:44:11,950
But this is an ideal object
where you put the true function

1160
00:44:11,950 --> 00:44:17,320
itself, and you just average
the value of the true function.

1161
00:44:17,320 --> 00:44:18,250
Why do I use this?

1162
00:44:18,250 --> 00:44:20,680
Because I want to
get something which

1163
00:44:20,680 --> 00:44:24,310
is in between this
f-star and this f-hat.

1164
00:44:24,310 --> 00:44:27,860
So if you put k big enough--
so if you have enough points,

1165
00:44:27,860 --> 00:44:28,990
this is going to be--

1166
00:44:28,990 --> 00:44:30,500
sorry, if you take
k small enough--

1167
00:44:30,500 --> 00:44:34,330
so this is closer to
f-star than my f-hat,

1168
00:44:34,330 --> 00:44:37,120
OK, because you
get no noisy data.

1169
00:44:37,120 --> 00:44:39,171
And what I want to do--

1170
00:44:39,171 --> 00:44:39,670
oops.

1171
00:44:43,716 --> 00:44:46,090
What I want to do is that I
want to plug it in the middle

1172
00:44:46,090 --> 00:44:49,620
and split this error in two.

1173
00:44:49,620 --> 00:44:50,890
And this is what I do.

1174
00:44:50,890 --> 00:44:51,580
OK.

1175
00:44:51,580 --> 00:44:55,960
If you do this, you can check
that you have a square here.

1176
00:44:55,960 --> 00:44:56,860
You get two terms.

1177
00:44:56,860 --> 00:44:59,354
One simplifies, because of
this assumption on the noise,

1178
00:44:59,354 --> 00:45:00,520
and you get these two terms.

1179
00:45:00,520 --> 00:45:00,790
OK.

1180
00:45:00,790 --> 00:45:02,539
And the important thing
is these two terms

1181
00:45:02,539 --> 00:45:05,470
are-- one is the comparison
between my algorithm

1182
00:45:05,470 --> 00:45:06,760
and its expectation.

1183
00:45:06,760 --> 00:45:09,000
So that's exactly what
we called a variance.

1184
00:45:09,000 --> 00:45:10,000
OK.

1185
00:45:10,000 --> 00:45:12,210
And one is the comparison
between the value

1186
00:45:12,210 --> 00:45:14,230
of the true function
here, and the value

1187
00:45:14,230 --> 00:45:15,860
of this other function.

1188
00:45:15,860 --> 00:45:17,610
Sorry, this should be--

1189
00:45:17,610 --> 00:45:18,410
oh yeah.

1190
00:45:18,410 --> 00:45:20,050
This is the
expectation, which is

1191
00:45:20,050 --> 00:45:23,020
my ideal version of my
algorithm, the one that has

1192
00:45:23,020 --> 00:45:25,000
access to the noiseless labels.

1193
00:45:25,000 --> 00:45:25,750
OK.

1194
00:45:25,750 --> 00:45:27,290
It's what you call a bias.

1195
00:45:27,290 --> 00:45:28,990
It's basically because,
instead of using

1196
00:45:28,990 --> 00:45:31,620
the exact value of the
function, you blur it

1197
00:45:31,620 --> 00:45:33,410
a bit by averaging out.

1198
00:45:33,410 --> 00:45:33,910
OK.

1199
00:45:33,910 --> 00:45:36,370
You see here, instead of using
the value of the function,

1200
00:45:36,370 --> 00:45:39,590
you average out a
few nearby values.

1201
00:45:39,590 --> 00:45:41,986
So you're making
it a bit dirtier.

1202
00:45:41,986 --> 00:45:44,110
The question now is, how
would these two quantities

1203
00:45:44,110 --> 00:45:45,754
depend on k?

1204
00:45:45,754 --> 00:45:47,920
How this quantity depends
on k and how this quantity

1205
00:45:47,920 --> 00:45:49,840
depends on k.

1206
00:45:49,840 --> 00:45:50,824
OK.

1207
00:45:50,824 --> 00:45:52,240
And then by putting
this together,

1208
00:45:52,240 --> 00:45:54,340
we'll see that we have a
certain behavior of this,

1209
00:45:54,340 --> 00:45:55,810
and a certain behavior of this.

1210
00:45:55,810 --> 00:45:57,400
And then balancing
this out, we'll

1211
00:45:57,400 --> 00:45:59,530
get what the optimal
value looked like.

1212
00:45:59,530 --> 00:46:02,980
And this is going to
be all useless from--

1213
00:46:02,980 --> 00:46:04,480
so these are going
to be interesting

1214
00:46:04,480 --> 00:46:05,680
from a conceptual perspective.

1215
00:46:05,680 --> 00:46:07,638
We're going to learn
something, but we'll still

1216
00:46:07,638 --> 00:46:10,030
have to do something practical,
because nothing of this

1217
00:46:10,030 --> 00:46:11,350
you can measure in practice.

1218
00:46:11,350 --> 00:46:12,040
OK.

1219
00:46:12,040 --> 00:46:14,320
So the next question
would be, now that we

1220
00:46:14,320 --> 00:46:16,520
know that it exists and
it depends on this stuff,

1221
00:46:16,520 --> 00:46:18,220
how can we actually
approximate it in practice?

1222
00:46:18,220 --> 00:46:20,770
And cross-validation is going
to pop out of the window.

1223
00:46:20,770 --> 00:46:21,320
OK.

1224
00:46:21,320 --> 00:46:22,861
But this is the
theory that shows you

1225
00:46:22,861 --> 00:46:25,710
that this would help
proving a theory that

1226
00:46:25,710 --> 00:46:29,650
shows that cross-validation is
a good idea, in a precise sense.

1227
00:46:29,650 --> 00:46:32,110
The take-home message
is, by making this model

1228
00:46:32,110 --> 00:46:33,880
and using this as an
intermediate object,

1229
00:46:33,880 --> 00:46:37,500
you split the error in two, and
you start to be able to study.

1230
00:46:37,500 --> 00:46:41,260
And what you get is
basically the following.

1231
00:46:41,260 --> 00:46:44,030
This term, by basically using--

1232
00:46:44,030 --> 00:46:45,504
so we assume that the data--

1233
00:46:45,504 --> 00:46:47,170
I didn't say that,
but that's important.

1234
00:46:47,170 --> 00:46:49,960
We assume that the data are
independent with each other.

1235
00:46:49,960 --> 00:46:50,800
OK.

1236
00:46:50,800 --> 00:46:54,070
And by using that, you get
these results right away,

1237
00:46:54,070 --> 00:46:56,217
essentially using the
fact that the variance

1238
00:46:56,217 --> 00:46:57,800
of the sum of the
independent variable

1239
00:46:57,800 --> 00:47:00,130
is the sum of the variances.

1240
00:47:00,130 --> 00:47:01,930
You get these
results in one line.

1241
00:47:01,930 --> 00:47:04,570
OK.

1242
00:47:04,570 --> 00:47:09,946
And basically what this shows
is that, if k gets big--

1243
00:47:09,946 --> 00:47:12,550
so variance is another
word for the stability.

1244
00:47:12,550 --> 00:47:13,300
OK.

1245
00:47:13,300 --> 00:47:16,190
So if you have a big variance,
things will vary a lot.

1246
00:47:16,190 --> 00:47:17,390
It will be unstable.

1247
00:47:17,390 --> 00:47:21,830
So what you see here is exactly
what we observe in the plot

1248
00:47:21,830 --> 00:47:22,330
before.

1249
00:47:22,330 --> 00:47:24,940
If k was big, things are
not changing as much.

1250
00:47:24,940 --> 00:47:27,220
If k was small, things
were changing a lot.

1251
00:47:27,220 --> 00:47:27,820
OK.

1252
00:47:27,820 --> 00:47:31,400
And this is the one equation
that shows you that.

1253
00:47:31,400 --> 00:47:32,040
OK.

1254
00:47:32,040 --> 00:47:34,450
And if you just look at
that, it would just tell you,

1255
00:47:34,450 --> 00:47:35,740
the big is better.

1256
00:47:35,740 --> 00:47:36,660
Big, respect to what?

1257
00:47:36,660 --> 00:47:37,550
To the noise.

1258
00:47:37,550 --> 00:47:38,050
OK.

1259
00:47:38,050 --> 00:47:40,216
If there is a lot of noise,
I should make it bigger.

1260
00:47:40,216 --> 00:47:43,480
If there's more noise,
I can make it smaller.

1261
00:47:43,480 --> 00:47:45,970
But the point is
that we saw before is

1262
00:47:45,970 --> 00:47:47,650
that the problem
of putting k large

1263
00:47:47,650 --> 00:47:49,300
was that we were forgetting
about the problem.

1264
00:47:49,300 --> 00:47:51,341
We're just getting something
that was very stable

1265
00:47:51,341 --> 00:47:54,220
but could be potentially very
bad, if my function was not

1266
00:47:54,220 --> 00:47:55,430
that simple.

1267
00:47:55,430 --> 00:47:56,080
OK.

1268
00:47:56,080 --> 00:47:58,620
This is a bit harder to
study mathematically.

1269
00:47:58,620 --> 00:47:59,290
OK.

1270
00:47:59,290 --> 00:48:00,850
This is a calculation
that I show you

1271
00:48:00,850 --> 00:48:05,952
because you can do it yourself
in like 20 minutes, or less.

1272
00:48:05,952 --> 00:48:07,660
This one takes a bit more.

1273
00:48:07,660 --> 00:48:10,360
But he can get the hunch
on how it looks like.

1274
00:48:10,360 --> 00:48:12,910
And the basic idea is
what we already said.

1275
00:48:12,910 --> 00:48:19,210
If k is small, and the points
are close enough, instead

1276
00:48:19,210 --> 00:48:24,910
of f-star x, we are
thinking of f-star Xk, Xi.

1277
00:48:24,910 --> 00:48:26,560
And the i is closing off.

1278
00:48:26,560 --> 00:48:27,850
OK.

1279
00:48:27,850 --> 00:48:29,920
Now if we start to
put k bigger, we

1280
00:48:29,920 --> 00:48:35,620
start to blur that prediction by
looking at many nearby points.

1281
00:48:35,620 --> 00:48:37,090
But here there is no noise.

1282
00:48:37,090 --> 00:48:37,699
OK.

1283
00:48:37,699 --> 00:48:38,990
So that sounds like a bad idea.

1284
00:48:38,990 --> 00:48:41,110
So we expect the
error in that case

1285
00:48:41,110 --> 00:48:45,610
to be either increasing, or at
least flat with respect to k.

1286
00:48:45,610 --> 00:48:50,800
So when we take k larger,
we're blurring this prediction,

1287
00:48:50,800 --> 00:48:53,690
and potentially make it
far away from the true one.

1288
00:48:53,690 --> 00:48:54,190
OK.

1289
00:48:54,190 --> 00:48:57,040
And you can make this
statement precise.

1290
00:48:57,040 --> 00:48:58,090
You can prove it.

1291
00:48:58,090 --> 00:49:00,780
And if you will prove it,
it's basically that you have--

1292
00:49:00,780 --> 00:49:03,130
what happened?

1293
00:49:03,130 --> 00:49:04,700
You have linear dependence.

1294
00:49:04,700 --> 00:49:07,460
So the error here is linearly
increasing or polynomially

1295
00:49:07,460 --> 00:49:10,551
increasing-- in fact I don't
remember-- with respect to k.

1296
00:49:10,551 --> 00:49:11,050
OK.

1297
00:49:13,385 --> 00:49:14,760
So the reason why
I'm showing you

1298
00:49:14,760 --> 00:49:16,577
this, skipping
all these details,

1299
00:49:16,577 --> 00:49:18,910
is just to give you a feeling
of the kind of computation

1300
00:49:18,910 --> 00:49:23,370
that answered the question
if there is a optimal value

1301
00:49:23,370 --> 00:49:25,310
and what it depends on.

1302
00:49:25,310 --> 00:49:27,060
And then at this point,
once you get this,

1303
00:49:27,060 --> 00:49:28,560
you start to see
this kind of plot.

1304
00:49:28,560 --> 00:49:30,990
And typically here I
put them the wrong way.

1305
00:49:30,990 --> 00:49:32,430
But here you
basically say, I have

1306
00:49:32,430 --> 00:49:34,750
this one function
I wanted to study,

1307
00:49:34,750 --> 00:49:39,030
which is the sum
of two functions.

1308
00:49:39,030 --> 00:49:41,040
I have this, and I have this.

1309
00:49:41,040 --> 00:49:42,725
OK.

1310
00:49:42,725 --> 00:49:44,100
And now to study
the minimum, I'm

1311
00:49:44,100 --> 00:49:45,750
basically going to
sum them up and see

1312
00:49:45,750 --> 00:49:47,666
what's the optimal value
to optimize this too.

1313
00:49:47,666 --> 00:49:51,854
And the k that optimized this
is exactly the optimal k.

1314
00:49:51,854 --> 00:49:54,270
And you see that the optimal
k will behave as we expected.

1315
00:49:54,270 --> 00:49:55,440
OK.

1316
00:49:55,440 --> 00:49:58,770
So here, one
ingredient is missing.

1317
00:49:58,770 --> 00:50:01,425
And it's just missing
because I didn't put it in,

1318
00:50:01,425 --> 00:50:02,790
which is the number of points.

1319
00:50:02,790 --> 00:50:03,290
OK.

1320
00:50:03,290 --> 00:50:05,660
It's just because I
didn't renormalize things.

1321
00:50:05,660 --> 00:50:06,210
OK.

1322
00:50:06,210 --> 00:50:09,150
It should be a 1 over n here.

1323
00:50:15,210 --> 00:50:16,950
It's just that I
didn't renormalize.

1324
00:50:16,950 --> 00:50:17,629
OK.

1325
00:50:17,629 --> 00:50:19,920
But you announced it, and
it's good, because it's true.

1326
00:50:19,920 --> 00:50:21,450
There should be
a 1 over n there.

1327
00:50:21,450 --> 00:50:23,040
But the rest is
what we expected.

1328
00:50:23,040 --> 00:50:23,592
OK.

1329
00:50:23,592 --> 00:50:25,800
In some sense what we expect
is that if my problem is

1330
00:50:25,800 --> 00:50:28,100
complicated, I
need the smaller k.

1331
00:50:28,100 --> 00:50:31,680
If there is a lot of
noise, I need a bigger k.

1332
00:50:31,680 --> 00:50:33,330
And depending on the
number of points,

1333
00:50:33,330 --> 00:50:35,150
which would be in
the numerator here,

1334
00:50:35,150 --> 00:50:37,890
I can make a bigger or a larger.

1335
00:50:37,890 --> 00:50:38,620
k.

1336
00:50:38,620 --> 00:50:40,290
OK.

1337
00:50:40,290 --> 00:50:43,020
This plot is fundamental because
it shows some property which

1338
00:50:43,020 --> 00:50:44,410
is inherent in the problem.

1339
00:50:44,410 --> 00:50:47,490
And the theorem that
somewhat is behind it--

1340
00:50:47,490 --> 00:50:50,680
intuition I've been saying,
repeating over and over,

1341
00:50:50,680 --> 00:50:53,890
which is this intuition that you
cannot trust the data too much.

1342
00:50:53,890 --> 00:50:56,850
And there is the optimal amount
of trust you can of your data

1343
00:50:56,850 --> 00:50:58,320
based on certain assumptions.

1344
00:50:58,320 --> 00:50:58,830
OK.

1345
00:50:58,830 --> 00:51:03,560
And in our case, the assumption
where this kind of model.

1346
00:51:03,560 --> 00:51:07,800
So little calculation
I'll show you quickly,

1347
00:51:07,800 --> 00:51:11,220
grounds this intuition into
a mathematical argument.

1348
00:51:11,220 --> 00:51:13,660
OK.

1349
00:51:13,660 --> 00:51:14,160
All right.

1350
00:51:14,160 --> 00:51:17,490
So we spent quite a
bit of time on this.

1351
00:51:17,490 --> 00:51:20,420
In some sense, from a
conceptual point of view,

1352
00:51:20,420 --> 00:51:21,420
this is a critical idea.

1353
00:51:21,420 --> 00:51:21,720
OK.

1354
00:51:21,720 --> 00:51:23,511
Because it's behind
pretty much everything.

1355
00:51:23,511 --> 00:51:26,880
This idea of, how much you
can trust or not of the data.

1356
00:51:26,880 --> 00:51:30,920
Of course here, as we said,
this has been informative,

1357
00:51:30,920 --> 00:51:31,800
hopefully.

1358
00:51:31,800 --> 00:51:34,202
But you cannot
really choose this k,

1359
00:51:34,202 --> 00:51:35,910
because you would need
to know the noise,

1360
00:51:35,910 --> 00:51:38,250
but especially to know how
to estimate this in order

1361
00:51:38,250 --> 00:51:40,510
to minimize this quantity.

1362
00:51:40,510 --> 00:51:44,730
So in practice what
you can show is,

1363
00:51:44,730 --> 00:51:46,604
you can use what is
called cross-validation.

1364
00:51:46,604 --> 00:51:48,020
And in effect,
cross-validation is

1365
00:51:48,020 --> 00:51:50,430
one of a few other
techniques you can use.

1366
00:51:50,430 --> 00:51:53,840
And the idea is that you
don't have access [AUDIO OUT]

1367
00:51:53,840 --> 00:51:57,320
but you can show that if you
take a bunch of data points,

1368
00:51:57,320 --> 00:51:59,780
you split them in two, you
use half for the training

1369
00:51:59,780 --> 00:52:02,750
as you've always done, and you
use the other half as a proxy

1370
00:52:02,750 --> 00:52:05,450
for this future data.

1371
00:52:05,450 --> 00:52:07,600
Then by minimizing the k--

1372
00:52:07,600 --> 00:52:10,590
taking the k that minimized the
error on this so-called holdout

1373
00:52:10,590 --> 00:52:15,620
set, then you can prove
it's as good as if you

1374
00:52:15,620 --> 00:52:17,461
could have access to this.

1375
00:52:17,461 --> 00:52:17,960
OK.

1376
00:52:17,960 --> 00:52:19,730
And it's actually
very easy to prove.

1377
00:52:19,730 --> 00:52:22,460
You can show that if
you're just split in two,

1378
00:52:22,460 --> 00:52:24,260
and you minimize the
error in second half--

1379
00:52:24,260 --> 00:52:27,260
you do what is called the
holdout cross-validation-- it's

1380
00:52:27,260 --> 00:52:31,291
as good as if you'd
had access to this.

1381
00:52:31,291 --> 00:52:31,790
OK.

1382
00:52:31,790 --> 00:52:32,930
So it's optimal in a way.

1383
00:52:36,310 --> 00:52:39,790
Now, the problem with this
is that we are only looking

1384
00:52:39,790 --> 00:52:44,394
at the area and expectation.

1385
00:52:44,394 --> 00:52:46,810
And what you can check is that
if you look at higher order

1386
00:52:46,810 --> 00:52:49,741
statistics, say that
variance of your estimators

1387
00:52:49,741 --> 00:52:51,490
and so on and so forth,
what you might get

1388
00:52:51,490 --> 00:52:54,520
is that by splitting in two,
[AUDIO OUT] big is fine.

1389
00:52:54,520 --> 00:52:56,337
In practice the
difference is small,

1390
00:52:56,337 --> 00:52:58,420
you might get that the way
you split might matter.

1391
00:52:58,420 --> 00:53:00,753
You might have bad luck and
just split in a certain way.

1392
00:53:00,753 --> 00:53:03,850
And so there is a whole
zoology of ways of splitting.

1393
00:53:03,850 --> 00:53:06,947
And the basic one
is, say, split--

1394
00:53:06,947 --> 00:53:08,405
this is, for example,
the simplest.

1395
00:53:08,405 --> 00:53:08,905
OK.

1396
00:53:08,905 --> 00:53:13,630
Split in a bunch of groups.

1397
00:53:13,630 --> 00:53:14,260
OK.

1398
00:53:14,260 --> 00:53:16,240
k-fold or v-fold
cross-validation.

1399
00:53:16,240 --> 00:53:18,081
Take one group out of the time.

1400
00:53:18,081 --> 00:53:18,580
OK.

1401
00:53:18,580 --> 00:53:20,150
And do the same trick.

1402
00:53:20,150 --> 00:53:23,470
You know, you train here
and calculate the error here

1403
00:53:23,470 --> 00:53:24,430
for different k's.

1404
00:53:24,430 --> 00:53:25,960
Then you do the same
here, do the same here,

1405
00:53:25,960 --> 00:53:26,740
do the same here.

1406
00:53:26,740 --> 00:53:29,572
Sum the errors
up, renormalizing,

1407
00:53:29,572 --> 00:53:31,280
and then just choose
the k that minimizes

1408
00:53:31,280 --> 00:53:34,090
this new form of error.

1409
00:53:34,090 --> 00:53:37,450
And if the data there are
small, small, small, then

1410
00:53:37,450 --> 00:53:39,630
typically this set
will become very small.

1411
00:53:39,630 --> 00:53:42,171
And then delimited, it becomes
one, the leave one out error.

1412
00:53:42,171 --> 00:53:42,670
OK.

1413
00:53:42,670 --> 00:53:44,740
What you do is
that you literally

1414
00:53:44,740 --> 00:53:47,110
leave one out,
train on the rest,

1415
00:53:47,110 --> 00:53:49,670
get there for all the
values of k in this case.

1416
00:53:49,670 --> 00:53:56,210
Put it back in, take another one
out, and repeat the procedure.

1417
00:53:56,210 --> 00:53:59,440
Now the question that I
had 10, 15 minutes ago was,

1418
00:53:59,440 --> 00:54:01,770
how do you choose v?

1419
00:54:01,770 --> 00:54:04,270
OK.

1420
00:54:04,270 --> 00:54:06,150
Shall I make this two?

1421
00:54:06,150 --> 00:54:08,630
So I just do one
split like this?

1422
00:54:08,630 --> 00:54:12,640
Or shall I make it n,
so I do leave one out?

1423
00:54:12,640 --> 00:54:14,570
And as far as I
know there is not

1424
00:54:14,570 --> 00:54:17,180
a lot of theory
that would support

1425
00:54:17,180 --> 00:54:18,830
an answer to this question.

1426
00:54:18,830 --> 00:54:23,040
And what I know is mostly what
you can expect intuitively,

1427
00:54:23,040 --> 00:54:24,874
which is, if you have
a lot of data points--

1428
00:54:24,874 --> 00:54:25,873
what does it mean a lot?

1429
00:54:25,873 --> 00:54:26,480
I don't know.

1430
00:54:26,480 --> 00:54:29,690
If you have two million,
10,000, I don't know.

1431
00:54:29,690 --> 00:54:32,120
If you have a big dataset,
typically splitting in two,

1432
00:54:32,120 --> 00:54:36,590
or maybe doing just random
splits is stable enough.

1433
00:54:36,590 --> 00:54:37,700
What does it mean?

1434
00:54:37,700 --> 00:54:42,320
That you try, and you
look at how much it moves.

1435
00:54:42,320 --> 00:54:43,370
Whereas if you have say--

1436
00:54:43,370 --> 00:54:44,560
you know, I don't know
if it even exists,

1437
00:54:44,560 --> 00:54:46,768
the implication like, you
know, a few years ago there

1438
00:54:46,768 --> 00:54:51,350
were micro-reapplication where
you would have 20, 30 inputs,

1439
00:54:51,350 --> 00:54:54,830
and you have 20 dimensions.

1440
00:54:54,830 --> 00:54:57,302
And then in that case, you
really don't do much splitting.

1441
00:54:57,302 --> 00:54:59,510
If you have 20, for example,
you try to leave one out

1442
00:54:59,510 --> 00:55:00,718
and it's the best you can do.

1443
00:55:00,718 --> 00:55:02,990
And it's already very
unstable and sucks.

1444
00:55:02,990 --> 00:55:03,920
OK.

1445
00:55:03,920 --> 00:55:05,760
So in this case, there
is work to be done.

1446
00:55:05,760 --> 00:55:08,311
I mean, as far as I know,
that's the state of things.

1447
00:55:08,311 --> 00:55:08,810
OK.

1448
00:55:08,810 --> 00:55:10,964
So we introduced a class
of very simple algorithms.

1449
00:55:10,964 --> 00:55:12,380
They seem to be
pretty reasonable.

1450
00:55:12,380 --> 00:55:14,750
They seem to allow us,
provided that we have a way

1451
00:55:14,750 --> 00:55:16,460
to measure distances
or similarity,

1452
00:55:16,460 --> 00:55:19,010
to go from simple to complex.

1453
00:55:19,010 --> 00:55:21,320
And we have some
kind of theory that

1454
00:55:21,320 --> 00:55:24,642
tells us what is the optimal
value of a parameter,

1455
00:55:24,642 --> 00:55:28,800
a kind of practical procedure to
actually choose it in practice.

1456
00:55:28,800 --> 00:55:30,630
OK.

1457
00:55:30,630 --> 00:55:31,850
Are we done?

1458
00:55:31,850 --> 00:55:33,229
Is that all?

1459
00:55:33,229 --> 00:55:34,520
do we need to do anything else?

1460
00:55:34,520 --> 00:55:36,326
What's missing here?

1461
00:55:36,326 --> 00:55:37,700
One thing that is
missing here is

1462
00:55:37,700 --> 00:55:40,310
that most of the intuition
we developed so far

1463
00:55:40,310 --> 00:55:42,000
are really related
to low dimension.

1464
00:55:42,000 --> 00:55:42,720
OK.

1465
00:55:42,720 --> 00:55:46,160
And here, very quickly, if
you just do a little exercise

1466
00:55:46,160 --> 00:55:47,960
where you try to say
how big is a cube

1467
00:55:47,960 --> 00:55:53,911
that covers 1% of the volume of
a bigger cube of a unit length?

1468
00:55:53,911 --> 00:55:54,410
OK.

1469
00:55:54,410 --> 00:55:56,540
So the big cube is volume 1.

1470
00:55:56,540 --> 00:55:58,040
The length of that is just 1.

1471
00:55:58,040 --> 00:55:59,700
And it ask you, how
big is this, if it

1472
00:55:59,700 --> 00:56:02,280
has to cover 1% of the volume?

1473
00:56:02,280 --> 00:56:05,840
It's really to check that these
are just going to be a dth-root

1474
00:56:05,840 --> 00:56:07,820
where d is the
dimension of the cube.

1475
00:56:07,820 --> 00:56:10,460
And this is the shape
of the dth-root.

1476
00:56:10,460 --> 00:56:11,090
OK.

1477
00:56:11,090 --> 00:56:13,940
So if you're in low
dimension, basically, 1%

1478
00:56:13,940 --> 00:56:17,024
is intuitively small
within the big cube.

1479
00:56:17,024 --> 00:56:19,190
But as soon as you're go
in higher dimensional, what

1480
00:56:19,190 --> 00:56:22,940
you see is that the length of
the edge of the little cube

1481
00:56:22,940 --> 00:56:26,390
that has to cover 1% of the
volume becomes very close to 1,

1482
00:56:26,390 --> 00:56:27,490
almost immediately.

1483
00:56:27,490 --> 00:56:29,300
It's this curve going up.

1484
00:56:29,300 --> 00:56:30,410
OK.

1485
00:56:30,410 --> 00:56:31,160
What does it mean?

1486
00:56:31,160 --> 00:56:35,630
That if you say, our
intuition is, well, 1%.

1487
00:56:35,630 --> 00:56:36,950
It's a pretty small volume.

1488
00:56:36,950 --> 00:56:39,920
If I just took the neighbors
in 1%, they're pretty close,

1489
00:56:39,920 --> 00:56:42,420
so they should have
the same label.

1490
00:56:42,420 --> 00:56:45,750
Well, in dimension
10, it's everything.

1491
00:56:45,750 --> 00:56:46,670
OK.

1492
00:56:46,670 --> 00:56:49,962
So our intuition-- now you
can say that probably there

1493
00:56:49,962 --> 00:56:52,420
is something wrong with my way
of thinking of volume, sure.

1494
00:56:52,420 --> 00:56:54,420
But the problem is that
we have to rethink a bit

1495
00:56:54,420 --> 00:56:57,400
how you think of dimensions and
similarity in high dimension,

1496
00:56:57,400 --> 00:56:59,530
because things that are
obvious low dimensional

1497
00:56:59,530 --> 00:57:01,450
start to be very complicated.

1498
00:57:01,450 --> 00:57:02,200
OK.

1499
00:57:02,200 --> 00:57:05,770
And the basic idea is that
this neighbor technique just

1500
00:57:05,770 --> 00:57:08,770
looks at what's
happening in one region.

1501
00:57:08,770 --> 00:57:11,540
But what you hope to do is
that if your function actually

1502
00:57:11,540 --> 00:57:13,840
has some kind of
global properties--

1503
00:57:13,840 --> 00:57:16,204
so, say for example a sign
is the simplest example

1504
00:57:16,204 --> 00:57:18,370
of something which is global,
because the value here

1505
00:57:18,370 --> 00:57:21,507
and the value here
are very much related.

1506
00:57:21,507 --> 00:57:23,090
And then it goes up
and it's the same.

1507
00:57:23,090 --> 00:57:24,012
And then it goes down.

1508
00:57:24,012 --> 00:57:25,470
So if you know
something like this,

1509
00:57:25,470 --> 00:57:27,490
the idea is that you
can borrow strength

1510
00:57:27,490 --> 00:57:29,729
from points which are far away.

1511
00:57:29,729 --> 00:57:32,020
In some sense the function
has some similar properties.

1512
00:57:32,020 --> 00:57:34,180
And so you want to go
from a local estimation

1513
00:57:34,180 --> 00:57:36,310
to some form of
global estimation.

1514
00:57:36,310 --> 00:57:36,900
OK.

1515
00:57:36,900 --> 00:57:38,500
And instead of making
a decision based

1516
00:57:38,500 --> 00:57:40,030
only on the neighbors
of the points,

1517
00:57:40,030 --> 00:57:42,640
you might want to use points
which are potentially far away.

1518
00:57:42,640 --> 00:57:43,140
OK.

1519
00:57:43,140 --> 00:57:47,620
And this seems to be like a good
idea in high dimensions where

1520
00:57:47,620 --> 00:57:51,066
the neighboring points might
not give enough information .

1521
00:57:51,066 --> 00:57:52,440
And that's kind
of what's called,

1522
00:57:52,440 --> 00:57:53,560
curse of dimensionality.

1523
00:57:53,560 --> 00:57:54,060
OK.

1524
00:57:54,060 --> 00:57:56,260
So what I want to do next--

1525
00:57:56,260 --> 00:57:58,030
we can take a break here--

1526
00:57:58,030 --> 00:58:02,010
is discussing least squares
and kernel least squares.

1527
00:58:02,010 --> 00:58:02,520
OK.

1528
00:58:02,520 --> 00:58:04,186
But what we're going
to do is that we're

1529
00:58:04,186 --> 00:58:06,200
going to take a linear
model of our data,

1530
00:58:06,200 --> 00:58:07,930
and then we are
going to try to see

1531
00:58:07,930 --> 00:58:09,250
how you can estimate and learn.

1532
00:58:09,250 --> 00:58:11,291
And we're going to look
at bit of the computation

1533
00:58:11,291 --> 00:58:14,100
and a bit of the statistical
idea underlying this model.

1534
00:58:14,100 --> 00:58:16,600
And then we're going to play
around in a very simple for way

1535
00:58:16,600 --> 00:58:19,390
to extend from a linear
model to a non-linear model

1536
00:58:19,390 --> 00:58:21,245
and actually make
it non-parametric.

1537
00:58:21,245 --> 00:58:24,030
I'll tell you what
non-parametric means.