WEBVTT

00:00:09.234 --> 00:00:10.300
PATRICK WINSTON: So
where are we?

00:00:10.300 --> 00:00:14.962
We started off with simple
methods for learning stuff.

00:00:14.962 --> 00:00:20.730
Then, we talked a little about
a purchase of learning that

00:00:20.730 --> 00:00:24.556
we're vaguely inspired by.

00:00:24.556 --> 00:00:27.300
The fact that our heads are
stuffed with neurons, and that

00:00:27.300 --> 00:00:31.095
we seemed to have evolved
from primates.

00:00:31.095 --> 00:00:34.940
Then, we talked about looking at
the problem and address the

00:00:34.940 --> 00:00:36.410
issue of [? phrenology ?]

00:00:36.410 --> 00:00:40.430
and how it's possible
to learn concepts.

00:00:40.430 --> 00:00:43.700
But now, we're coming full
circle back to the beginning

00:00:43.700 --> 00:00:47.990
and thinking about how to
divide up a space with

00:00:47.990 --> 00:00:49.930
decision boundaries.

00:00:49.930 --> 00:00:54.580
But whereas, you do it with
a neural net or a nearest

00:00:54.580 --> 00:00:56.510
neighbors or a ID tree.

00:00:56.510 --> 00:01:02.115
Those are very simple ideas
that work very often.

00:01:02.115 --> 00:01:05.895
Today, we're going to talk about
a very sophisticated

00:01:05.895 --> 00:01:09.212
idea that still has
a implementation.

00:01:09.212 --> 00:01:13.220
So this needs to be
in the tool bag of

00:01:13.220 --> 00:01:15.506
every civilized person.

00:01:15.506 --> 00:01:18.560
This is about support
vector machines, an

00:01:18.560 --> 00:01:20.735
idea that was developed.

00:01:20.735 --> 00:01:22.470
Well, I want to talk to
you today about how

00:01:22.470 --> 00:01:24.705
ideas develop, actually.

00:01:24.705 --> 00:01:27.150
Because you look at stuff like
this in a book, and you think,

00:01:27.150 --> 00:01:32.515
well, Vladimir Vapnik just
figured this out one Saturday

00:01:32.515 --> 00:01:35.780
afternoon when the weather was
too bad to go outside.

00:01:35.780 --> 00:01:37.185
That's not how it happens.

00:01:37.185 --> 00:01:38.580
It happens very differently.

00:01:38.580 --> 00:01:41.229
I want to talk to you
a little about that.

00:01:41.229 --> 00:01:46.950
The next thing about great
things that were done by

00:01:46.950 --> 00:01:49.060
people who are still alive
is you can ask them

00:01:49.060 --> 00:01:50.210
how they did it.

00:01:50.210 --> 00:01:51.810
You can't do that
with Fourier.

00:01:51.810 --> 00:01:54.310
You can't say to Fourier,
how did you do it?

00:01:54.310 --> 00:01:56.946
Did you dream it up on
a Saturday afternoon?

00:01:56.946 --> 00:02:00.220
But can call Vapnik on the phone
and ask him questions.

00:02:00.220 --> 00:02:02.050
That's the stuff I'm going
to talk about toward

00:02:02.050 --> 00:02:04.186
the end of the hour.

00:02:04.186 --> 00:02:06.045
Well, it's all about decision
boundaries.

00:02:06.045 --> 00:02:11.400
And now, we have several
techniques that we can use to

00:02:11.400 --> 00:02:12.620
draw some decision boundaries.

00:02:12.620 --> 00:02:14.700
And here's the same problem.

00:02:14.700 --> 00:02:18.329
And if we drew decision
boundaries in here, we might

00:02:18.329 --> 00:02:21.826
get something that would
look like maybe this.

00:02:21.826 --> 00:02:25.790
If we were doing a nearest
neighbor approach, and if

00:02:25.790 --> 00:02:31.522
we're doing ID trees, we'll just
draw in a line like that.

00:02:31.522 --> 00:02:34.945
And if we're doing neural nets,
well, you can put in a

00:02:34.945 --> 00:02:37.550
lot of straight lines wherever
you like with a neural net,

00:02:37.550 --> 00:02:39.110
depending on how it's
trained up.

00:02:39.110 --> 00:02:42.470
Or if you just simply go in
there and design it, so you

00:02:42.470 --> 00:02:45.554
could do that if you wanted.

00:02:45.554 --> 00:02:48.110
And you would think that after
people have been working on

00:02:48.110 --> 00:02:52.500
this sort of stuff for 50 or 75
years that there wouldn't

00:02:52.500 --> 00:02:54.535
be any tricks in the bag left.

00:02:54.535 --> 00:02:59.340
And that's when everybody got
surprised, because around the

00:02:59.340 --> 00:03:03.880
early '90s Vladimir Vapnik
introduced the ideas I'm about

00:03:03.880 --> 00:03:05.916
to talk to you about.

00:03:05.916 --> 00:03:11.215
So what Vapnik says is
something like this.

00:03:11.215 --> 00:03:17.470
Here you have a space, and you
have some negative examples,

00:03:17.470 --> 00:03:20.436
and you have some positive
examples.

00:03:20.436 --> 00:03:22.870
How do you divide the positive
examples from

00:03:22.870 --> 00:03:24.220
the negative examples?

00:03:24.220 --> 00:03:27.710
And what he says that we want
to do is we want to draw a

00:03:27.710 --> 00:03:29.140
straight line.

00:03:29.140 --> 00:03:32.062
But which straight line
is the question.

00:03:32.062 --> 00:03:35.140
Well, we want to draw
a straight line.

00:03:35.140 --> 00:03:38.141
Well, would this be a
good straight line?

00:03:38.141 --> 00:03:40.492
One that went up like that?

00:03:40.492 --> 00:03:42.660
Probably not so hot.

00:03:42.660 --> 00:03:45.622
How about one that's
just right here?

00:03:45.622 --> 00:03:49.460
Well, that might separate them,
but it seems awfully

00:03:49.460 --> 00:03:51.765
close to the negative
examples.

00:03:51.765 --> 00:03:55.030
So maybe what we ought to do
is we ought to draw our

00:03:55.030 --> 00:03:57.220
straight line in here,
sort of like this.

00:04:00.458 --> 00:04:07.590
And that line is drawn with a
view toward putting in the

00:04:07.590 --> 00:04:13.330
widest street that separates the
positive samples from the

00:04:13.330 --> 00:04:14.460
negative samples.

00:04:14.460 --> 00:04:17.209
That's why I call it the
widest street approach.

00:04:17.209 --> 00:04:21.535
So that makes way of putting
in the decision boundary--

00:04:21.535 --> 00:04:25.560
is to put in a straight line but
in contrast with the way

00:04:25.560 --> 00:04:27.440
ID tree puts in a
straight line.

00:04:27.440 --> 00:04:32.165
It tries to put the line in in
such a way as the separation

00:04:32.165 --> 00:04:34.680
between the positive and
negative examples.

00:04:34.680 --> 00:04:37.236
That street is as wide
as possible.

00:04:37.236 --> 00:04:37.722
All right.

00:04:37.722 --> 00:04:41.620
So you might think to do that in
the UROP project, and then,

00:04:41.620 --> 00:04:43.205
let it go with that.

00:04:43.205 --> 00:04:44.730
What's the big deal?

00:04:44.730 --> 00:04:47.340
So what we've got to do is we've
got to go through why

00:04:47.340 --> 00:04:49.176
it's a big deal.

00:04:49.176 --> 00:04:55.170
So first of all, we like to
think about how you would make

00:04:55.170 --> 00:04:59.326
a decision rule that would use
that decision boundary.

00:04:59.326 --> 00:05:03.650
So what I'm going to ask you to
imagine is that we've got a

00:05:03.650 --> 00:05:09.650
vector of any length that you
like, constrained to be

00:05:09.650 --> 00:05:13.715
perpendicular to the median, or
if you like, perpendicular

00:05:13.715 --> 00:05:14.630
to the gutters.

00:05:14.630 --> 00:05:18.280
It's perpendicular to the median
line of the street.

00:05:18.280 --> 00:05:20.540
All right, it's drawn in such
a way that that's true.

00:05:20.540 --> 00:05:23.984
We don't know anything about
it's length, yet.

00:05:23.984 --> 00:05:29.920
Then, we also have some unknown,
say, right here.

00:05:29.920 --> 00:05:35.325
And we have a vector that
points to it by excel.

00:05:35.325 --> 00:05:39.310
So now, what we're really
interested in is whether or

00:05:39.310 --> 00:05:42.920
not that unknown is on the right
side of the street or on

00:05:42.920 --> 00:05:45.062
the left side of the street.

00:05:45.062 --> 00:05:47.909
So what we'd what to do is want
to project that vector,

00:05:47.909 --> 00:05:51.990
u, down on to one that's
perpendicular to the street.

00:05:51.990 --> 00:05:55.205
Because then, we'll have the
distance in this direction or

00:05:55.205 --> 00:05:58.490
a number that's proportional
to this in this direction.

00:05:58.490 --> 00:06:02.670
And the further out we go, the
closer we'll get to being on

00:06:02.670 --> 00:06:05.360
the right side of the street,
where the right side of the

00:06:05.360 --> 00:06:08.065
street is not the correct side
but actually the right side of

00:06:08.065 --> 00:06:08.985
the street.

00:06:08.985 --> 00:06:14.280
So what we can do is we can say,
let's take w and dot it

00:06:14.280 --> 00:06:19.930
with u and measure whether or
not that number is equal to or

00:06:19.930 --> 00:06:22.646
greater than some constant, c.

00:06:22.646 --> 00:06:25.880
So remember that the dot
product has taken the

00:06:25.880 --> 00:06:27.896
projection onto w.

00:06:27.896 --> 00:06:32.150
And the bigger that projection
is, the further out along this

00:06:32.150 --> 00:06:34.255
line the projection will lie.

00:06:34.255 --> 00:06:37.490
And eventually it will be so
big that the projection

00:06:37.490 --> 00:06:40.440
crosses the median line of the
street, and we'll say it must

00:06:40.440 --> 00:06:41.690
be a positive sample.

00:06:45.707 --> 00:06:50.880
Or we could say, without loss
of generality that the dot

00:06:50.880 --> 00:06:56.360
product plus some constant, b,
is equal to or greater than 0.

00:06:56.360 --> 00:07:03.050
If that's true, then it's
a positive sample.

00:07:03.050 --> 00:07:04.300
So that's our decision rule.

00:07:11.522 --> 00:07:17.300
And this is the first in several
elements that we're

00:07:17.300 --> 00:07:20.960
going to have to line up to
understand this idea called

00:07:20.960 --> 00:07:23.340
support vector machines.

00:07:23.340 --> 00:07:24.730
So that's the decision rule.

00:07:24.730 --> 00:07:29.460
And the trouble is we don't know
what constant to use, and

00:07:29.460 --> 00:07:32.450
we don't know which
w to use either.

00:07:32.450 --> 00:07:35.390
We know that w has to be
perpendicular to the median

00:07:35.390 --> 00:07:37.476
line of the street.

00:07:37.476 --> 00:07:39.880
But there's lot of w's that
are perpendicular to the

00:07:39.880 --> 00:07:41.070
median line of the street,
because it

00:07:41.070 --> 00:07:42.740
could be of any length.

00:07:42.740 --> 00:07:45.750
So we don't have enough
constraint here to fix a

00:07:45.750 --> 00:07:49.532
particular b or a
particular w.

00:07:49.532 --> 00:07:52.395
Are you with me so far?

00:07:52.395 --> 00:07:55.176
All right.

00:07:55.176 --> 00:07:57.990
And this, by the way, we get
just by saying that c

00:07:57.990 --> 00:07:59.240
equals minus b.

00:08:02.800 --> 00:08:05.790
What we're going to do next is
we're going to lay on some

00:08:05.790 --> 00:08:08.960
additional constraints whether
you're toward putting enough

00:08:08.960 --> 00:08:13.330
constraint on the situation that
we can actually calculate

00:08:13.330 --> 00:08:16.015
a b and a w.

00:08:16.015 --> 00:08:21.290
So what we're going to say is
this, that if we look at this

00:08:21.290 --> 00:08:24.680
quantity that we're checking out
to be greater than or less

00:08:24.680 --> 00:08:28.040
than 0 to make our decision,
then, what we're going to do

00:08:28.040 --> 00:08:32.510
is we're going to say that if we
take that vector w, and we

00:08:32.510 --> 00:08:37.789
take the dot product of that
with some x plus, some

00:08:37.789 --> 00:08:38.929
positive sample, now.

00:08:38.929 --> 00:08:39.760
This is not an unknown.

00:08:39.760 --> 00:08:42.272
This is a positive sample.

00:08:42.272 --> 00:08:46.500
If we take the dot product of
those two vectors, and we had

00:08:46.500 --> 00:08:50.050
b just like in our decision
rule, we're going to want that

00:08:50.050 --> 00:08:51.370
to be equal to or
greater than 1.

00:08:54.220 --> 00:08:59.080
So in other words, you can be
an unknown anywhere in this

00:08:59.080 --> 00:09:02.140
street and be just a little bit
greater or just a little

00:09:02.140 --> 00:09:03.610
bit less than 0.

00:09:03.610 --> 00:09:06.120
But if you're a positive sample,
we're going to insist

00:09:06.120 --> 00:09:08.550
that this decision function
gives the

00:09:08.550 --> 00:09:11.476
value of one or greater.

00:09:11.476 --> 00:09:21.030
Likewise, if w thought it was
some negative sample is

00:09:21.030 --> 00:09:24.380
provided to us, then we're going
to say that has to be

00:09:24.380 --> 00:09:25.800
equal to or less than minus 1.

00:09:28.690 --> 00:09:29.866
All right.

00:09:29.866 --> 00:09:33.790
So if you're a minus sample,
like one of these two guys or

00:09:33.790 --> 00:09:38.330
any minus sample that may lie
down here, this function that

00:09:38.330 --> 00:09:42.506
gives us the decision rule must
return minus 1 or less.

00:09:42.506 --> 00:09:45.020
So there's a separation
of distance here.

00:09:45.020 --> 00:09:46.930
Minus 1 to plus 1 for
all of the samples.

00:09:50.717 --> 00:09:52.842
So that's cool.

00:09:52.842 --> 00:09:58.290
But we're not quite done,
because carrying around two

00:09:58.290 --> 00:10:01.534
equations like this,
it's a pain.

00:10:01.534 --> 00:10:04.760
So what we're going to do is
we're going to introduce

00:10:04.760 --> 00:10:08.190
another variable to make
like a little easier.

00:10:11.502 --> 00:10:15.210
Like many things that we do, and
when we develop this kind

00:10:15.210 --> 00:10:19.120
of stuff, introducing this
variable is not something that

00:10:19.120 --> 00:10:20.370
God says has to be done.

00:10:24.380 --> 00:10:25.310
What is it?

00:10:25.310 --> 00:10:28.930
We introduced this additional
stuff to do what?

00:10:28.930 --> 00:10:34.140
To make the mathematics more
convenient, so mathematical

00:10:34.140 --> 00:10:35.822
convenience.

00:10:35.822 --> 00:10:37.730
So what we're going to do is
we're going to introduce a

00:10:37.730 --> 00:10:53.600
variable, y sub i, such that y
sub i is equal to plus 1 for

00:10:53.600 --> 00:11:10.460
plus samples and minus 1
for negative samples.

00:11:10.460 --> 00:11:11.685
All right.

00:11:11.685 --> 00:11:14.190
So for each sample, we're going
to have a value for this

00:11:14.190 --> 00:11:16.680
new quantity we've
introduced, y.

00:11:16.680 --> 00:11:19.910
And the value of y is going to
be determined by whether it's

00:11:19.910 --> 00:11:22.370
a positive sample or
negative sample.

00:11:22.370 --> 00:11:26.600
If it's a positive sample it's
got to be plus 1 for this

00:11:26.600 --> 00:11:29.280
situation up here, and it's
going to be minus 1 for this

00:11:29.280 --> 00:11:31.235
situation down here.

00:11:31.235 --> 00:11:34.480
So what we're going to do with
this first equation is we're

00:11:34.480 --> 00:11:41.605
going to multiply it by y sub
i, and that is now x of i,

00:11:41.605 --> 00:11:46.430
plus b is equal to or
greater than 1.

00:11:46.430 --> 00:11:47.740
And then, you know what
we're going to do?

00:11:47.740 --> 00:11:53.030
We're going to multiply the left
side of this equation by

00:11:53.030 --> 00:11:54.770
y sub i, as well.

00:11:54.770 --> 00:12:03.172
So the second equation becomes
y sub i times x sub i plus b.

00:12:03.172 --> 00:12:05.876
And now, what does that
do over here?

00:12:05.876 --> 00:12:09.480
We multiplied this guy
times minus 1.

00:12:09.480 --> 00:12:12.750
So it used to be the case that
that was less than minus 1.

00:12:12.750 --> 00:12:14.900
So if we multiply it by minus
1, then it has to be greater

00:12:14.900 --> 00:12:16.150
than plus 1.

00:12:18.990 --> 00:12:23.220
The two equations are the same,
because that introduces

00:12:23.220 --> 00:12:26.580
this little mathematical
convenience.

00:12:26.580 --> 00:12:35.430
So now, we can say that y sub
i times x sub i plus b.

00:12:37.986 --> 00:12:41.826
Well, what we're going to do--

00:12:41.826 --> 00:12:42.675
Brett?

00:12:42.675 --> 00:12:44.255
STUDENT: What happened
to the w?

00:12:44.255 --> 00:12:45.450
PATRICK WINSTON: Oh, did
I leave out a w?

00:12:45.450 --> 00:12:46.050
I'm sorry.

00:12:46.050 --> 00:12:48.612
Thank you.

00:12:48.612 --> 00:12:51.561
Yeah, I wouldn't have gotten
very far with that.

00:12:51.561 --> 00:12:54.210
So that's dot it with
w, dot it with w.

00:12:54.210 --> 00:12:55.605
Thank you, Brett.

00:12:55.605 --> 00:12:56.710
Those are all vectors.

00:12:56.710 --> 00:13:00.010
I'll pretty soon forget to put
the little vector marks on

00:13:00.010 --> 00:13:01.090
there, but you know
what I mean.

00:13:01.090 --> 00:13:05.256
So that's w plus b.

00:13:05.256 --> 00:13:09.660
And now, let me bring that 1
over to the left side, and

00:13:09.660 --> 00:13:11.010
that's equal to or
greater than 0.

00:13:13.535 --> 00:13:14.730
All right.

00:13:14.730 --> 00:13:17.440
With Brett's correction, I
think everything's OK.

00:13:17.440 --> 00:13:21.010
But we're going to take one more
step, and we're going to

00:13:21.010 --> 00:13:31.270
say that y sub i times x sub
i times w plus b minus 1.

00:13:33.885 --> 00:13:35.760
It's always got to be equal
to or greater than 0.

00:13:35.760 --> 00:13:42.492
But what I'm going to
say is if we're for

00:13:42.492 --> 00:13:44.550
x sub i in a gutter.

00:13:49.092 --> 00:13:51.140
So there's always going to be
greater than 0, but we're

00:13:51.140 --> 00:13:53.540
going to add the additional
constraint that it's going to

00:13:53.540 --> 00:13:58.300
be exactly 0 for all the samples
that end up in the

00:13:58.300 --> 00:14:00.190
gutters here of the street.

00:14:00.190 --> 00:14:03.010
So the value of that expression
is going to be

00:14:03.010 --> 00:14:08.390
exactly 0 for that sample, 0
for this sample and this

00:14:08.390 --> 00:14:10.460
sample, not 0 for that sample.

00:14:10.460 --> 00:14:12.180
It's got to be greater than 1.

00:14:12.180 --> 00:14:13.846
All right?

00:14:13.846 --> 00:14:16.760
So that's step number two.

00:14:25.319 --> 00:14:27.140
And this is step number one.

00:14:31.454 --> 00:14:31.950
OK.

00:14:31.950 --> 00:14:34.340
So now, we've just got some
expressions to talk about,

00:14:34.340 --> 00:14:36.415
some constraints.

00:14:36.415 --> 00:14:37.870
Now, what are we trying
to do here?

00:14:37.870 --> 00:14:39.922
I forgot.

00:14:39.922 --> 00:14:41.320
Oh, I remember now.

00:14:41.320 --> 00:14:45.500
We're trying to figure out how
to arrange for the line to be

00:14:45.500 --> 00:14:48.790
such at the street separating
the pluses from the minuses as

00:14:48.790 --> 00:14:51.121
wide as possible.

00:14:51.121 --> 00:14:54.300
So maybe we better figure out
how we can express the

00:14:54.300 --> 00:14:56.130
distance between the
two gutters.

00:15:03.645 --> 00:15:06.822
Let's just repeat our drawing.

00:15:06.822 --> 00:15:12.030
We've got some minuses here, got
pluses out here, and we've

00:15:12.030 --> 00:15:17.021
got gutters that are
going down here.

00:15:17.021 --> 00:15:22.290
And now, we've got a vector here
to a minus, and we've got

00:15:22.290 --> 00:15:27.091
a vector here to a plus.

00:15:27.091 --> 00:15:33.950
So we'll call that x plus
and this x minus.

00:15:33.950 --> 00:15:36.730
So what's the width
of the street?

00:15:36.730 --> 00:15:37.600
I don't know, yet.

00:15:37.600 --> 00:15:40.360
But what we can do is we can
take the difference of those

00:15:40.360 --> 00:15:44.120
two vectors, and that will
be a vector that

00:15:44.120 --> 00:15:46.346
looks like this, right?

00:15:46.346 --> 00:15:52.016
So that's x plus
minus x minus.

00:15:52.016 --> 00:15:56.280
So now, if I only had a unit
normal that's normal to the

00:15:56.280 --> 00:16:00.320
median line of the street, if
it's a unit normal, then I

00:16:00.320 --> 00:16:02.120
could just take the dot product
or that unit normal

00:16:02.120 --> 00:16:03.975
and this difference vector, and
that would be the width of

00:16:03.975 --> 00:16:05.980
the street, right?

00:16:05.980 --> 00:16:13.090
So in other words, if I had a
unit vector in that direction,

00:16:13.090 --> 00:16:15.530
then I could just dot the two
together, and that would be

00:16:15.530 --> 00:16:17.896
the width of the street.

00:16:17.896 --> 00:16:21.550
So let me write that down
before I forget.

00:16:21.550 --> 00:16:31.625
So the width is equal to
x plus minus x minus.

00:16:31.625 --> 00:16:34.396
OK.

00:16:34.396 --> 00:16:35.580
That's the difference vector.

00:16:35.580 --> 00:16:37.510
And now, I've got to multiple
it by unit vector.

00:16:37.510 --> 00:16:38.180
But wait a minute.

00:16:38.180 --> 00:16:41.590
I said that that w is
a normal, right?

00:16:41.590 --> 00:16:44.032
The w is a normal.

00:16:44.032 --> 00:16:50.018
So what I can do is I can
multiply this times w, and

00:16:50.018 --> 00:16:54.156
then, we'll divide by the
magnitude of w, and that will

00:16:54.156 --> 00:16:56.591
make it a unit vector.

00:16:56.591 --> 00:17:05.650
So that dot product, not a
product, that dot product is,

00:17:05.650 --> 00:17:10.329
in fact, a scalar, and it's
the width of the street.

00:17:10.329 --> 00:17:14.730
It doesn't do as much good,
because it doesn't look like

00:17:14.730 --> 00:17:17.053
we get much out of it.

00:17:17.053 --> 00:17:18.220
Oh, but I don't know.

00:17:18.220 --> 00:17:21.371
Let's see, what can
we get out of it?

00:17:21.371 --> 00:17:25.954
Oh gee, we've got this equation
over here, this

00:17:25.954 --> 00:17:28.594
equation that constrains
the samples

00:17:28.594 --> 00:17:31.310
that lie in the gutter.

00:17:31.310 --> 00:17:35.610
So if we have a positive sample,
for example, then this

00:17:35.610 --> 00:17:38.530
is plus 1, and we have
this equation.

00:17:41.150 --> 00:17:53.900
So it says that x plus times w
is equal to, oh, 1 minus b.

00:17:58.492 --> 00:18:02.210
See, I'm just taking this part
here, this vector here, and

00:18:02.210 --> 00:18:04.880
I'm dotting it with x plus.

00:18:04.880 --> 00:18:08.650
So that's this piece
right here.

00:18:08.650 --> 00:18:11.230
y is 1 for this kind
of sample.

00:18:11.230 --> 00:18:13.600
So I'll just take the 1 and the
b back over to the other

00:18:13.600 --> 00:18:16.212
side, and I've got 1 minus b.

00:18:16.212 --> 00:18:18.592
OK?

00:18:18.592 --> 00:18:22.241
Well, we can do the same
trick with x minus.

00:18:22.241 --> 00:18:24.806
If we've got a negative sample,

00:18:24.806 --> 00:18:28.572
then y sub i is negative.

00:18:28.572 --> 00:18:34.296
That gives us our negative
w times dot over x sub i.

00:18:34.296 --> 00:18:37.190
But now, we take this stuff back
over to the right side,

00:18:37.190 --> 00:18:40.540
and we get 1 plus b.

00:18:45.252 --> 00:18:50.200
So that all licenses to rewrite
this thing as 2 over

00:18:50.200 --> 00:18:52.646
the magnitude of w.

00:18:52.646 --> 00:18:54.210
How did I get there?

00:18:54.210 --> 00:18:59.270
Well, I decided I was going to
enforce this constraint.

00:18:59.270 --> 00:19:03.540
I noted that the width of the
street has got to be this

00:19:03.540 --> 00:19:06.105
difference vector times
a unit vector.

00:19:06.105 --> 00:19:09.400
Then, I used the constraint to
plug back some values here.

00:19:09.400 --> 00:19:12.480
And I discovered to my delight
and amazement that the width

00:19:12.480 --> 00:19:15.350
of the street is 2 over
the magnitude of w.

00:19:18.340 --> 00:19:20.388
Yes, Brett?

00:19:20.388 --> 00:19:23.881
STUDENT: So your first x
plus is minus b, and x

00:19:23.881 --> 00:19:25.378
minus is 1 plus b.

00:19:25.378 --> 00:19:25.877
PATRICK WINSTON: Yeah.

00:19:25.877 --> 00:19:26.875
STUDENT: So you're
subtracting it?

00:19:26.875 --> 00:19:27.750
PATRICK WINSTON: Let's see.

00:19:27.750 --> 00:19:31.855
If I've got a minus here, then
that makes that minus, and

00:19:31.855 --> 00:19:33.810
then, the b is minus, and when I
take the b over to the other

00:19:33.810 --> 00:19:35.579
side it becomes plus.

00:19:35.579 --> 00:19:38.573
STUDENT: Yeah, so if you
subtract the left with the

00:19:38.573 --> 00:19:41.068
right [INAUDIBLE].

00:19:41.068 --> 00:19:41.670
PATRICK WINSTON: No.

00:19:41.670 --> 00:19:42.320
No, sorry.

00:19:42.320 --> 00:19:46.981
This expression here
is 1 plus b.

00:19:46.981 --> 00:19:48.870
Trust me it works.

00:19:48.870 --> 00:19:51.370
I haven't got my legs all
tangled up like last Friday,

00:19:51.370 --> 00:19:53.786
well, not yet, anyway.

00:19:53.786 --> 00:19:55.340
It's possible.

00:19:55.340 --> 00:19:58.958
There's going to be a lot of
algebra here eventually.

00:19:58.958 --> 00:20:04.995
So this quantity here, this
is miracle number three.

00:20:04.995 --> 00:20:09.731
This quantity here is the
width of the street.

00:20:09.731 --> 00:20:13.570
And what we're trying to
do is we're trying to

00:20:13.570 --> 00:20:17.158
maximize that, right?

00:20:17.158 --> 00:20:27.170
So we want to maximize 2 over
the magnitude of w if we're to

00:20:27.170 --> 00:20:29.300
get the widest street under
the constraints that we've

00:20:29.300 --> 00:20:32.210
decided that we're going
to work with.

00:20:32.210 --> 00:20:33.050
All right.

00:20:33.050 --> 00:20:46.281
So that means that it's OK to
maximize 1 over w, instead.

00:20:46.281 --> 00:20:48.250
We just drop the constant.

00:20:48.250 --> 00:20:53.550
And that means that it's
OK to minimize the

00:20:53.550 --> 00:20:56.150
magnitude of w, right?

00:20:59.572 --> 00:21:08.710
And that means that it's OK
to minimize 1/2 times the

00:21:08.710 --> 00:21:12.070
magnitude of w squared.

00:21:12.070 --> 00:21:13.675
Right, Brett?

00:21:13.675 --> 00:21:16.075
Why did I do that?

00:21:16.075 --> 00:21:19.010
Why did I multiply by
1/2 and square it?

00:21:19.010 --> 00:21:19.970
STUDENT: Because it's
mathematically convenient.

00:21:19.970 --> 00:21:20.930
PATRICK WINSTON: It's
mathematically convenient.

00:21:20.930 --> 00:21:22.850
Thank you.

00:21:22.850 --> 00:21:27.840
So this is point number three
in the development.

00:21:27.840 --> 00:21:28.950
So where do we go?

00:21:28.950 --> 00:21:31.170
We decided that was going
to be our decision rule.

00:21:31.170 --> 00:21:33.530
We're going to see which side
of the line we're on.

00:21:33.530 --> 00:21:36.420
We decided to constrain the
situation, so the value of the

00:21:36.420 --> 00:21:40.750
decision rule is plus 1 in the
gutters for the positive

00:21:40.750 --> 00:21:42.820
samples and minus 1
in the gutters for

00:21:42.820 --> 00:21:44.070
the negative samples.

00:21:44.070 --> 00:21:47.470
And then, we discovered that
maximizing the width of the

00:21:47.470 --> 00:21:51.090
street led us to an expression
like that,

00:21:51.090 --> 00:21:52.340
which we wish to maximize.

00:21:57.425 --> 00:21:58.350
Should we take a break?

00:21:58.350 --> 00:21:59.460
Should we get coffee?

00:21:59.460 --> 00:22:02.365
Too bad, we can't do that in
this kind of situation.

00:22:02.365 --> 00:22:04.400
But we would if we could.

00:22:04.400 --> 00:22:07.090
And I'm sure when Vapnik
got to this point, he

00:22:07.090 --> 00:22:09.826
went out for coffee.

00:22:09.826 --> 00:22:13.820
So now, we back up, and we say,
well, let's let these

00:22:13.820 --> 00:22:17.252
expressions start developing
into a song.

00:22:17.252 --> 00:22:21.030
Not like that, that's vapid,
speaking of Vapnik.

00:22:29.760 --> 00:22:31.970
What song is it going to sing?

00:22:31.970 --> 00:22:35.680
We've got an expression here
that we'd like to find the

00:22:35.680 --> 00:22:38.236
minimum of, the extremum of.

00:22:38.236 --> 00:22:41.790
And we've got some constraints
here that we

00:22:41.790 --> 00:22:44.040
would like to honor.

00:22:44.040 --> 00:22:45.290
What are we going to do?

00:22:47.600 --> 00:22:49.300
Let me put what we're going
to do to you in

00:22:49.300 --> 00:22:52.385
the form of a puzzle.

00:22:52.385 --> 00:22:58.900
Is it got something to
do with Legendre?

00:22:58.900 --> 00:23:04.270
Has it got something
to do with Laplace?

00:23:04.270 --> 00:23:07.375
Or does it have something
to do with Lagrange?

00:23:07.375 --> 00:23:09.400
She says Lagrange.

00:23:09.400 --> 00:23:12.850
Actually, all three were said
to be on Fourier's Doctoral

00:23:12.850 --> 00:23:15.590
Defense Committee-- must have
been quite an example.

00:23:15.590 --> 00:23:18.960
But we want to talk about
Lagrange, because we've got a

00:23:18.960 --> 00:23:20.605
situation here.

00:23:20.605 --> 00:23:22.060
Is this 1801?

00:23:22.060 --> 00:23:22.840
1802?

00:23:22.840 --> 00:23:25.000
1802.

00:23:25.000 --> 00:23:28.462
We learned in 1802 that if we
going to find the extremum of

00:23:28.462 --> 00:23:33.840
a function with constraints,
then we're going to have to

00:23:33.840 --> 00:23:35.922
use Lagrange multipliers.

00:23:35.922 --> 00:23:39.820
That would give us a new
expression, which we can

00:23:39.820 --> 00:23:43.350
maximize or minimize without
thinking about

00:23:43.350 --> 00:23:45.090
the constraints anymore.

00:23:45.090 --> 00:23:47.755
That's how Lagrange
multipliers work.

00:23:47.755 --> 00:23:52.440
So this brings us to miracle
number four, developmental

00:23:52.440 --> 00:23:53.770
piece number four.

00:23:53.770 --> 00:23:56.420
And it works like this.

00:23:56.420 --> 00:23:58.210
We're going to say that L--

00:23:58.210 --> 00:24:00.720
the thing we're going to try
to maximize in order to

00:24:00.720 --> 00:24:02.660
maximize the width
of the street--

00:24:02.660 --> 00:24:08.235
is equal to 1/2 times the
magnitude of that vector, w,

00:24:08.235 --> 00:24:12.476
squared minus.

00:24:12.476 --> 00:24:16.230
And now, we've got to have
a summation over all the

00:24:16.230 --> 00:24:17.480
constraints.

00:24:18.880 --> 00:24:21.460
And each or those constraints is
going to have a multiplier,

00:24:21.460 --> 00:24:23.412
alpha sub i.

00:24:23.412 --> 00:24:26.106
And then, we write down
the constraint.

00:24:26.106 --> 00:24:27.575
And when we write down
a constraint,

00:24:27.575 --> 00:24:29.100
there it is up there.

00:24:29.100 --> 00:24:31.690
And I've got to be hyper
careful here, because,

00:24:31.690 --> 00:24:33.830
otherwise, I'll get lost
in the algebra.

00:24:33.830 --> 00:24:42.520
So the constraint is y sub i
times vector, w, dotted with

00:24:42.520 --> 00:24:49.030
vector x sub i plus b, and
now, I've got a closing

00:24:49.030 --> 00:24:52.315
parenthesis, a minus 1.

00:24:52.315 --> 00:24:56.690
That's the end of my constraint,
like so.

00:25:00.330 --> 00:25:03.380
I sure hope I've got that right,
because I'll be in deep

00:25:03.380 --> 00:25:04.730
trouble if that's wrong.

00:25:04.730 --> 00:25:05.940
Anybody see any bugs in that?

00:25:05.940 --> 00:25:08.250
That looks right. doesn't it?

00:25:08.250 --> 00:25:10.310
We've got the original thing
we're trying to work with.

00:25:10.310 --> 00:25:14.425
Now, we've got Lagrange
multipliers all multiplied.

00:25:14.425 --> 00:25:16.300
It's back to that constraint
up there, where each

00:25:16.300 --> 00:25:20.512
constraint is constrained
to be 0.

00:25:20.512 --> 00:25:24.770
Well, there's a little bit of
mathematical slight of hand

00:25:24.770 --> 00:25:27.810
here, because in the end, the
ones that are going to be 0,

00:25:27.810 --> 00:25:31.210
the Lagrange multipliers here.

00:25:31.210 --> 00:25:33.795
The ones that are going to be
non 0 are going to be the ones

00:25:33.795 --> 00:25:36.120
connected with vectors that
lie in the gutter.

00:25:36.120 --> 00:25:39.848
The rest are going to be 0.

00:25:39.848 --> 00:25:43.380
But in any event, we can pretend
that this is what

00:25:43.380 --> 00:25:44.630
we're doing.

00:25:46.550 --> 00:25:48.350
I don't care whether it's
a maximum or minimum.

00:25:48.350 --> 00:25:49.550
I've lost track.

00:25:49.550 --> 00:25:51.290
But what we're going to do is
we're going to try to find an

00:25:51.290 --> 00:25:52.360
extremum of that.

00:25:52.360 --> 00:25:53.730
So what do we do?

00:25:53.730 --> 00:25:58.330
What does 1801 teach us about?

00:25:58.330 --> 00:25:59.465
Finding the maximum--

00:25:59.465 --> 00:26:04.760
well, we've got to find the
derivatives and set them to 0.

00:26:04.760 --> 00:26:06.500
And then, after we've done that,
a little bit of that

00:26:06.500 --> 00:26:08.760
manipulation, we're going
to see a wonderful

00:26:08.760 --> 00:26:10.850
song start to emerge.

00:26:10.850 --> 00:26:12.890
So let's see if we can do it.

00:26:12.890 --> 00:26:17.160
Let's take the partial of L, the
Lagrangian, with respect

00:26:17.160 --> 00:26:19.190
to the vector, w.

00:26:19.190 --> 00:26:21.430
Oh my God, how do you
differentiate with

00:26:21.430 --> 00:26:22.680
respect to a vector?

00:26:25.255 --> 00:26:28.050
It turns out that it has a form
that looks exactly like

00:26:28.050 --> 00:26:30.450
differentiating with respect
to a scalar.

00:26:30.450 --> 00:26:32.580
And the way you prove that to
yourself is you just expand

00:26:32.580 --> 00:26:35.530
everything in terms of all of
the vector's components.

00:26:35.530 --> 00:26:37.660
You differentiate those with
respect to what you're

00:26:37.660 --> 00:26:40.140
differentiating with respect
to, and everything

00:26:40.140 --> 00:26:42.380
turns out the same.

00:26:42.380 --> 00:26:44.880
So what you get when you
differentiate this with

00:26:44.880 --> 00:26:52.280
respect to the vector, w, is 2
comes down, and we have just

00:26:52.280 --> 00:26:53.833
magnitude of w.

00:26:53.833 --> 00:26:56.090
Was it the magnitude of w?

00:26:56.090 --> 00:26:58.000
Yeah, like so.

00:27:01.629 --> 00:27:02.910
Was it the magnitude of w?

00:27:02.910 --> 00:27:06.510
Oh, it's not the
magnitude of w.

00:27:06.510 --> 00:27:12.396
It's just w, like so, no
magnitude involved.

00:27:12.396 --> 00:27:16.480
Then, we've got a w over here,
so we've got to differentiate

00:27:16.480 --> 00:27:18.270
this part with respect
to w, as well.

00:27:18.270 --> 00:27:19.690
But that part's a lot easier,
because all we

00:27:19.690 --> 00:27:21.310
have there is a w.

00:27:21.310 --> 00:27:22.350
There's no magnitude.

00:27:22.350 --> 00:27:24.002
It's not raised to any power.

00:27:24.002 --> 00:27:26.290
So what's w multiplied by?

00:27:26.290 --> 00:27:31.954
Well, it's multiplied by x and
y sub i and alpha sub i.

00:27:31.954 --> 00:27:32.610
All right.

00:27:32.610 --> 00:27:36.605
So that means that this
expression, this derivative of

00:27:36.605 --> 00:27:41.660
the Lagrangian, with respect to
w is going to be equal to w

00:27:41.660 --> 00:27:51.820
minus the sum of alpha sub i,
y sub i, x sub i, and that's

00:27:51.820 --> 00:27:54.240
got to be set to 0.

00:27:54.240 --> 00:28:02.250
And that implies that w is equal
to the sum of some alpha

00:28:02.250 --> 00:28:06.980
i, some scalars, times this
minus 1 or plus 1 variable

00:28:06.980 --> 00:28:11.332
times x sub i over i.

00:28:11.332 --> 00:28:14.430
And now, the math is
beginning to sing.

00:28:14.430 --> 00:28:19.490
Because it tells us that the
vector w is a linear sum of

00:28:19.490 --> 00:28:24.492
the samples, all the samples
or some of the sample.

00:28:24.492 --> 00:28:27.786
It didn't have to be that way.

00:28:27.786 --> 00:28:29.230
It could have been raised
to a power.

00:28:29.230 --> 00:28:31.160
It could have been
a logarithm.

00:28:31.160 --> 00:28:33.010
All sorts of horrible
things could have

00:28:33.010 --> 00:28:34.320
happened when we did this.

00:28:34.320 --> 00:28:39.210
But when we did this, we
discovered that w is going to

00:28:39.210 --> 00:28:44.620
be equal to a linear some
of these vectors here.

00:28:44.620 --> 00:28:49.060
Some of the vectors in the
sample set, and I say some,

00:28:49.060 --> 00:28:51.260
because for some alpha
will be 0.

00:28:54.265 --> 00:28:55.515
All right.

00:28:55.515 --> 00:29:01.560
So this is something that we
want to take note of as

00:29:01.560 --> 00:29:05.402
something important.

00:29:05.402 --> 00:29:09.760
Now, of course, we've got to
differentiate L with respect

00:29:09.760 --> 00:29:12.900
to anything else it might
vary, so we've got to

00:29:12.900 --> 00:29:15.180
differentiate L with respect
to b, as well.

00:29:18.436 --> 00:29:21.222
So what's that going
to be equal to?

00:29:21.222 --> 00:29:25.705
Well, there's no b in here, so
that makes no contribution.

00:29:25.705 --> 00:29:28.750
This part here doesn't have a
b in it, so that makes no

00:29:28.750 --> 00:29:29.335
contribution.

00:29:29.335 --> 00:29:32.270
There's no b over here, so that
makes no contribution.

00:29:32.270 --> 00:29:37.210
So we've got alpha i times
y sub i times b.

00:29:37.210 --> 00:29:39.365
That has a contribution.

00:29:39.365 --> 00:29:46.470
So that's going to be the sum
of alpha i times y sub i.

00:29:46.470 --> 00:29:48.570
And then, we're differentiating
with respect

00:29:48.570 --> 00:29:50.635
to b, so that disappears.

00:29:50.635 --> 00:29:55.440
There's a minus sign here, and
that's equal to 0, or that

00:29:55.440 --> 00:29:59.490
implies that the sum of the
alpha i times y sub

00:29:59.490 --> 00:30:03.012
i is equal to 0.

00:30:03.012 --> 00:30:05.100
Hm, that looks like that might
be helpful somewhere.

00:30:10.460 --> 00:30:12.755
And now, it's time
for more coffee.

00:30:12.755 --> 00:30:15.520
By the way, these coffee
periods take months.

00:30:15.520 --> 00:30:16.905
You stare at it.

00:30:16.905 --> 00:30:18.980
You work on something else.

00:30:18.980 --> 00:30:22.000
You've got to worry
about your finals.

00:30:22.000 --> 00:30:24.020
And you think about
it some more.

00:30:24.020 --> 00:30:25.740
And eventually, you come
back from coffee

00:30:25.740 --> 00:30:28.930
and do the next thing.

00:30:28.930 --> 00:30:31.640
Oh, what is the next thing?

00:30:31.640 --> 00:30:34.180
Well, we've still got this
expression that we're trying

00:30:34.180 --> 00:30:41.020
to find the minimum for.

00:30:41.020 --> 00:30:43.500
And you say to yourself, this
is really a job for the

00:30:43.500 --> 00:30:44.480
numerical analysts.

00:30:44.480 --> 00:30:47.205
Those guys know about
this sort of stuff.

00:30:47.205 --> 00:30:49.620
Because of that little power
in there, that square.

00:30:49.620 --> 00:30:54.772
This is a so-called quadratic
optimization problem.

00:30:54.772 --> 00:30:57.480
So at this point, you would be
inclined to hand this problem

00:30:57.480 --> 00:30:59.290
over to a numerical analysts.

00:30:59.290 --> 00:31:01.410
They'll come back in a few
weeks with an algorithm.

00:31:01.410 --> 00:31:03.100
You implement the algorithm.

00:31:03.100 --> 00:31:04.120
And maybe things work.

00:31:04.120 --> 00:31:04.890
Maybe they don't converge.

00:31:04.890 --> 00:31:08.325
But any case, you don't
worry about it.

00:31:08.325 --> 00:31:10.360
But we're not going to do that,
because we want to do a

00:31:10.360 --> 00:31:12.680
little bit more math, because
we're interested

00:31:12.680 --> 00:31:14.890
in stuff like this.

00:31:14.890 --> 00:31:18.770
We're interested in the fact
that the decision vector is a

00:31:18.770 --> 00:31:21.265
linear sum of the samples.

00:31:21.265 --> 00:31:24.030
So we're going to work a little
harder on this stuff.

00:31:24.030 --> 00:31:27.730
And in particular, now that
we've got an expression for w,

00:31:27.730 --> 00:31:31.010
this one right here, we're
going to plug it back in

00:31:31.010 --> 00:31:34.870
there, and we're going to plug
it back in here and see what

00:31:34.870 --> 00:31:37.440
happens to that thing
we're trying to find

00:31:37.440 --> 00:31:38.690
the extremum of.

00:31:46.817 --> 00:31:51.220
Is everybody relaxed,
taking deep breath?

00:31:51.220 --> 00:31:52.530
Actually, this is the
easiest part.

00:31:52.530 --> 00:31:55.755
This is just doing a little
bit of the algebra.

00:31:55.755 --> 00:31:58.830
So the think we're trying
to maximize or

00:31:58.830 --> 00:32:03.465
minimize is equal to 1/2.

00:32:03.465 --> 00:32:10.570
And now, we've got to
have this vector

00:32:10.570 --> 00:32:16.781
here in there twice.

00:32:16.781 --> 00:32:17.190
Right?

00:32:17.190 --> 00:32:21.295
Because we're multiplying
the two together.

00:32:21.295 --> 00:32:22.970
So let's see.

00:32:22.970 --> 00:32:26.860
We've got from that expression
up there, one of those w's

00:32:26.860 --> 00:32:33.670
will just be the sum of the
alpha i times y sub i times

00:32:33.670 --> 00:32:36.265
the vector x sub i.

00:32:36.265 --> 00:32:38.320
And then, we've got the
other one, too.

00:32:38.320 --> 00:32:41.620
So that's just going to
be the sum of alpha.

00:32:41.620 --> 00:32:45.280
Now, I'm going to, actually,
eventually, squish those two

00:32:45.280 --> 00:32:48.050
sums together into a double
summation, so I have to keep

00:32:48.050 --> 00:32:49.990
the indexes straight.

00:32:49.990 --> 00:32:53.786
So I'm just going to write
that as alpha sub j, y

00:32:53.786 --> 00:32:57.726
sub j, x sub j.

00:32:57.726 --> 00:32:59.760
So those are my two vectors and
I'm going to take the dot

00:32:59.760 --> 00:33:00.850
product of those.

00:33:00.850 --> 00:33:04.310
That's the first piece, right?

00:33:04.310 --> 00:33:07.345
Boy, this is hard.

00:33:07.345 --> 00:33:13.760
So minus, and now, the next term
looks like alpha i, y sub

00:33:13.760 --> 00:33:17.395
i, x sub i times w.

00:33:17.395 --> 00:33:19.640
So you've got a whole
bunch of these.

00:33:19.640 --> 00:33:26.996
We've got a sum of alpha i times
y sub i times x sub i,

00:33:26.996 --> 00:33:30.425
and then, that gets multiplied
times w.

00:33:30.425 --> 00:33:39.160
So we'll put this like this, the
sum of alpha j, y sub j, x

00:33:39.160 --> 00:33:41.630
sub j in there like that.

00:33:41.630 --> 00:33:44.345
And then, that's the dot
product like that.

00:33:44.345 --> 00:33:45.890
That wasn't as bad
as I thought.

00:33:49.731 --> 00:33:54.150
Now, I've got to deal with the
next term, the alpha i times y

00:33:54.150 --> 00:33:55.740
sub i times b.

00:33:58.475 --> 00:34:07.746
So that's minus sub of alpha
i times y sub i times b.

00:34:07.746 --> 00:34:13.949
And then, to finish it off, we
have plus the sum of alpha sub

00:34:13.949 --> 00:34:18.320
i minus 1 up there, minus 1 in
front of the summation, such

00:34:18.320 --> 00:34:20.059
as the sum of the alphas.

00:34:20.059 --> 00:34:21.605
Are you with me so far?

00:34:21.605 --> 00:34:24.096
Just a little algebra.

00:34:24.096 --> 00:34:24.860
It looks good.

00:34:24.860 --> 00:34:28.838
I think I haven't
mucked it, yet.

00:34:28.838 --> 00:34:30.952
Let's see.

00:34:30.952 --> 00:34:34.364
alpha i times y sub i times
b. b is a constant.

00:34:34.364 --> 00:34:37.409
So pull that out there, and
then, I just got the sum of

00:34:37.409 --> 00:34:41.078
alpha sub i times y sub i.

00:34:41.078 --> 00:34:42.250
Oh, that's good.

00:34:42.250 --> 00:34:43.500
That's 0.

00:34:48.304 --> 00:34:51.900
Now, so for every one of these
terms, we dot it with this

00:34:51.900 --> 00:34:53.150
whole expression.

00:34:54.966 --> 00:35:00.050
So that's just like taking this
thing here and dotting

00:35:00.050 --> 00:35:02.145
those two things together,
right?

00:35:02.145 --> 00:35:04.240
Oh, but that's just the same
thing we've got here.

00:35:07.324 --> 00:35:11.140
So now, what we can do is we
can say that we can rewrite

00:35:11.140 --> 00:35:15.560
this Lagrangian as--

00:35:15.560 --> 00:35:19.566
we've got that sum of alpha i.

00:35:19.566 --> 00:35:22.256
That's the positive element.

00:35:22.256 --> 00:35:25.680
And then, we've got one of
these and half of these.

00:35:25.680 --> 00:35:28.865
So that's minus 1/2.

00:35:28.865 --> 00:35:30.980
And now, I'll just convert that
whole works into a double

00:35:30.980 --> 00:35:43.230
sum over both i and j of alpha
i times alpha j times y sub i

00:35:43.230 --> 00:35:49.760
times y sub j times x sub
i dotted with x of j.

00:35:52.670 --> 00:35:55.560
We sure went through a lot of
trouble to get there, but now,

00:35:55.560 --> 00:35:56.210
we've got it.

00:35:56.210 --> 00:35:59.200
And we know that what we're
trying to do is we're trying

00:35:59.200 --> 00:36:03.320
to find a maximum of
that expression.

00:36:07.212 --> 00:36:08.910
And that's the one we're
going to had off to

00:36:08.910 --> 00:36:11.010
the numerical analysts.

00:36:11.010 --> 00:36:13.090
So if we're going to had this
off to the numerical analysts

00:36:13.090 --> 00:36:16.136
anyway, why did I go to
all this trouble?

00:36:16.136 --> 00:36:19.200
Good question.

00:36:19.200 --> 00:36:22.626
Do you have any idea why I
went to all this trouble?

00:36:22.626 --> 00:36:25.440
Because I wanted to find out
the dependence of this

00:36:25.440 --> 00:36:26.950
expression.

00:36:26.950 --> 00:36:28.120
Wanda is telling me.

00:36:28.120 --> 00:36:29.450
I'm translating as I go.

00:36:29.450 --> 00:36:31.555
She's telling me in Romanian.

00:36:31.555 --> 00:36:35.510
I want to find what this
maximization depends on with

00:36:35.510 --> 00:36:41.160
respect these vectors, the
x, the sample vectors.

00:36:41.160 --> 00:36:46.480
And what I've discovered is that
the optimization depends

00:36:46.480 --> 00:36:53.976
only on the dot product
of pairs of samples.

00:36:53.976 --> 00:36:55.300
And that's something we
want to keep in mind.

00:36:55.300 --> 00:36:56.620
That's why I put it
in royal purple.

00:36:59.350 --> 00:37:02.920
Now, up here, so let's see.

00:37:02.920 --> 00:37:04.210
What do we call that
one up there?

00:37:04.210 --> 00:37:05.715
That's two.

00:37:05.715 --> 00:37:10.505
I guess, we'll call this
piece here three.

00:37:10.505 --> 00:37:12.600
This piece here is four.

00:37:12.600 --> 00:37:15.060
And now, there's
one more piece.

00:37:15.060 --> 00:37:20.080
Because I want to take that w,
and not only stick it back

00:37:20.080 --> 00:37:22.700
into that Lagrangian, I want
to stick it back into the

00:37:22.700 --> 00:37:24.446
decision rule.

00:37:24.446 --> 00:37:29.030
So now, my decision rule with
this expression for w is going

00:37:29.030 --> 00:37:31.410
to be w plugged into
that thing.

00:37:31.410 --> 00:37:37.000
So the decision rule is going to
look like the sum of alpha

00:37:37.000 --> 00:37:45.960
i times y sub i times x sub
i dotted with the unknown

00:37:45.960 --> 00:37:47.840
vector, like so.

00:37:47.840 --> 00:37:51.536
And we're going to,
I guess, add b.

00:37:51.536 --> 00:37:53.770
And we're going to say, if
that's greater than or equal

00:37:53.770 --> 00:37:57.660
to 0, then plus.

00:38:00.560 --> 00:38:04.750
So you see why the math is
beginning to sing to us now.

00:38:04.750 --> 00:38:08.840
Because now, we discover that
the decision rule, also,

00:38:08.840 --> 00:38:12.700
depends only on the dot product
of those sample

00:38:12.700 --> 00:38:15.340
vectors and the unknown.

00:38:15.340 --> 00:38:18.640
So the total of dependence
of all of the

00:38:18.640 --> 00:38:21.106
math on the dot products.

00:38:21.106 --> 00:38:24.034
All right.

00:38:24.034 --> 00:38:27.160
And now, I hear a whisper.

00:38:27.160 --> 00:38:30.410
Someone is saying, I
don't believe that

00:38:30.410 --> 00:38:31.720
mathematicians can do it.

00:38:31.720 --> 00:38:33.850
I don't think those numerical
analysts can find the

00:38:33.850 --> 00:38:35.100
optimization.

00:38:37.360 --> 00:38:38.925
I want to be sure of it.

00:38:38.925 --> 00:38:40.850
Give me ocular proof.

00:38:40.850 --> 00:38:42.360
So I'd like to run a
demonstration of it.

00:38:56.596 --> 00:38:57.090
OK.

00:38:57.090 --> 00:38:58.060
There's our sample problem.

00:38:58.060 --> 00:38:59.800
The one I started the
hour out with.

00:38:59.800 --> 00:39:05.430
Now, if the optimization
algorithm doesn't get stuck in

00:39:05.430 --> 00:39:07.720
a local maximum or something,
it should find a nice,

00:39:07.720 --> 00:39:10.900
straight line separating those
two guys to finding the widest

00:39:10.900 --> 00:39:14.445
street between the minuses
and the pluses.

00:39:14.445 --> 00:39:16.880
So in just a couple of steps,
you can see down

00:39:16.880 --> 00:39:18.150
there in step 11.

00:39:18.150 --> 00:39:20.630
It's decided that it's done
as much as it can on the

00:39:20.630 --> 00:39:22.406
optimization.

00:39:22.406 --> 00:39:25.480
And it's got three alphas.

00:39:25.480 --> 00:39:30.970
And you can see that the two
negative samples both figure

00:39:30.970 --> 00:39:34.575
into the solution, the weights
on the Lagrangian multipliers

00:39:34.575 --> 00:39:36.820
are given by those little
yellow bars.

00:39:36.820 --> 00:39:40.030
So the two negatives participate
in the solution as

00:39:40.030 --> 00:39:42.040
one of the positives, but the
other positive doesn't.

00:39:42.040 --> 00:39:45.500
So it has a 0 weight.

00:39:45.500 --> 00:39:47.700
So everything worked out well.

00:39:47.700 --> 00:39:50.440
Now, I said, as long as it
doesn't get stuck on a local

00:39:50.440 --> 00:39:55.095
maximum, guess what, those
mathematical friends of ours

00:39:55.095 --> 00:39:58.120
can tell us and prove
to us that this

00:39:58.120 --> 00:40:00.420
thing is a convex space.

00:40:00.420 --> 00:40:04.042
That means it can never get
stuck in a local maximum.

00:40:04.042 --> 00:40:07.780
So in contrast with things like
neural nets, where you

00:40:07.780 --> 00:40:11.160
have a plague of local maxima,
this guy never gets stuck in a

00:40:11.160 --> 00:40:12.355
local maxima.

00:40:12.355 --> 00:40:15.536
Let's try some other examples.

00:40:15.536 --> 00:40:17.250
Here's two vertical points--

00:40:17.250 --> 00:40:20.920
no surprises there, right?

00:40:20.920 --> 00:40:22.470
Well, you say, well,
maybe it can't deal

00:40:22.470 --> 00:40:24.165
with diagonal points.

00:40:24.165 --> 00:40:26.830
Sure it can.

00:40:26.830 --> 00:40:32.091
How about this thing here?

00:40:32.091 --> 00:40:38.510
Yeah, it only needed two of the
points since any two, a

00:40:38.510 --> 00:40:41.820
plus or minus, will
define the street.

00:40:41.820 --> 00:40:44.580
Let's try this guy.

00:40:44.580 --> 00:40:46.526
Oh.

00:40:46.526 --> 00:40:47.110
What do you think?

00:40:47.110 --> 00:40:50.046
What happened here?

00:40:50.046 --> 00:40:51.345
Well, we're screwed, right?

00:40:51.345 --> 00:40:52.595
Because it's linearly
inseparable--

00:40:56.629 --> 00:40:57.879
bad news.

00:41:00.175 --> 00:41:04.250
So in situations where it's
linearly inseparable, the

00:41:04.250 --> 00:41:07.060
mechanism struggles, and
eventually, it will just slow

00:41:07.060 --> 00:41:08.570
down and you truncate
it, because it's

00:41:08.570 --> 00:41:09.510
not making any progress.

00:41:09.510 --> 00:41:14.765
And you see the red dots there
are ones that it got wrong.

00:41:14.765 --> 00:41:17.480
So you say, well, too bad for
our side-- doesn't look like

00:41:17.480 --> 00:41:19.502
it's all that good anyway.

00:41:19.502 --> 00:41:26.020
But then, a powerful idea comes
to the rescue, when

00:41:26.020 --> 00:41:28.896
stuck switch to another
perspective.

00:41:28.896 --> 00:41:31.850
So if we don't like the space
that we're in, because it

00:41:31.850 --> 00:41:37.680
gives examples that are not
linearly separable, then we

00:41:37.680 --> 00:41:39.705
can say, oh, shoot.

00:41:39.705 --> 00:41:42.052
Here's our space.

00:41:42.052 --> 00:41:43.302
Here are two points.

00:41:49.486 --> 00:41:52.944
Here are two other points.

00:41:52.944 --> 00:41:54.630
We can't separate them.

00:41:54.630 --> 00:41:57.740
But if we could somehow get them
into another space, maybe

00:41:57.740 --> 00:42:06.600
we can separate them, because
they look like this in the

00:42:06.600 --> 00:42:08.925
other space, and they're
easy to separate.

00:42:08.925 --> 00:42:12.820
So what we need, then, is a
transformation that will take

00:42:12.820 --> 00:42:16.070
us from the space we're in into
a space where things are

00:42:16.070 --> 00:42:17.590
more convenient, so we're
going to call that

00:42:17.590 --> 00:42:22.745
transformation phi
with a vector, x.

00:42:22.745 --> 00:42:23.855
That's the transformation.

00:42:23.855 --> 00:42:26.290
And now, here's the reason
for all the magic.

00:42:28.950 --> 00:42:34.880
I said, that the maximization
only depends on dot products.

00:42:34.880 --> 00:42:38.810
So all I need to do the
maximization is the

00:42:38.810 --> 00:42:43.975
transformation of one vector
dotted with the transformation

00:42:43.975 --> 00:42:47.235
of another vector, like so.

00:42:47.235 --> 00:42:51.260
That's what I need to maximize,
or to find the

00:42:51.260 --> 00:42:52.510
maximum on.

00:42:52.510 --> 00:42:55.216
Then, in order to recognize--

00:42:55.216 --> 00:42:57.706
where did it go?

00:42:57.706 --> 00:42:59.260
Underneath the chalkboard.

00:43:05.290 --> 00:43:06.002
Oh, yes.

00:43:06.002 --> 00:43:06.900
Here it is.

00:43:06.900 --> 00:43:09.620
To recognize, all I need
is dot products, too.

00:43:09.620 --> 00:43:17.025
So for that one I need phi of
x dotted with phi of u.

00:43:17.025 --> 00:43:19.300
And just to make this a little
bit more consistent, the

00:43:19.300 --> 00:43:22.750
notation, I'll call that
x j and this x sub i.

00:43:22.750 --> 00:43:23.550
And that's x sub i.

00:43:23.550 --> 00:43:27.595
Those are the quantities I
need in order to do it.

00:43:27.595 --> 00:43:34.540
So that means that if I have a
function, let's call it k of x

00:43:34.540 --> 00:43:45.370
sub i and x sub j, that's equal
to phi of x sub i dotted

00:43:45.370 --> 00:43:49.191
with phi of x sub j.

00:43:49.191 --> 00:43:50.215
Then, I'm done.

00:43:50.215 --> 00:43:52.306
This is what I need.

00:43:52.306 --> 00:43:54.020
I don't actually need this.

00:43:56.955 --> 00:44:00.990
All I need is that function, k,
which happens to be called

00:44:00.990 --> 00:44:04.650
a kernel function, which
provides me with the dot

00:44:04.650 --> 00:44:07.745
product of those two vectors
in another space.

00:44:07.745 --> 00:44:09.310
I don't have to know
the transformation

00:44:09.310 --> 00:44:11.200
into the other space.

00:44:11.200 --> 00:44:15.935
And that's the reason that
this stuff is a miracle.

00:44:15.935 --> 00:44:19.595
So what are some of the kernels
that are popular?

00:44:19.595 --> 00:44:27.200
One is the linear kernel that
says that u dotted with v plus

00:44:27.200 --> 00:44:32.515
1 to the n-th is such a kernel,
because it's got u in

00:44:32.515 --> 00:44:35.190
it and v in it, the
two vectors.

00:44:35.190 --> 00:44:38.060
And this is what the dot product
is in the other space.

00:44:38.060 --> 00:44:39.550
So that's one choice.

00:44:39.550 --> 00:44:42.450
Another choice is a kernel
that looks like

00:44:42.450 --> 00:44:46.295
this, e to the minus.

00:44:46.295 --> 00:44:50.440
Let's take the dot product
of the difference

00:44:50.440 --> 00:44:51.690
of those two guys.

00:44:53.880 --> 00:44:56.360
Let's take the magnitude
of that and

00:44:56.360 --> 00:44:57.660
divide it by some sigma.

00:44:57.660 --> 00:45:01.160
That's a second kind of kernel
that we can use.

00:45:01.160 --> 00:45:04.350
So let's go back and see if we
can solve this problem by

00:45:04.350 --> 00:45:06.350
transforming it into another
space where we have another

00:45:06.350 --> 00:45:07.600
perspective.

00:45:10.082 --> 00:45:15.618
So that's it.

00:45:15.618 --> 00:45:17.760
That's another kernel.

00:45:17.760 --> 00:45:18.870
And so sure, we can.

00:45:18.870 --> 00:45:21.280
And that's the answer when
transformed back into the

00:45:21.280 --> 00:45:22.905
original space.

00:45:22.905 --> 00:45:24.690
We can also try doing that
with a so-called

00:45:24.690 --> 00:45:25.780
radial basis kernel.

00:45:25.780 --> 00:45:28.112
That's the one with the
exponential in it.

00:45:28.112 --> 00:45:29.310
We can learn on that one.

00:45:29.310 --> 00:45:30.480
Boom.

00:45:30.480 --> 00:45:33.346
No problem.

00:45:33.346 --> 00:45:36.860
So we've got a general method
that's convex and guaranteed

00:45:36.860 --> 00:45:39.245
to produce a global solution.

00:45:39.245 --> 00:45:42.950
We've got a mechanism that
easily allows us to transform

00:45:42.950 --> 00:45:45.470
this into another space.

00:45:45.470 --> 00:45:47.695
So it works like a charm.

00:45:47.695 --> 00:45:50.736
Of course, it doesn't remove
all possible problems.

00:45:50.736 --> 00:45:53.650
Look at that exponential
thing here.

00:45:53.650 --> 00:45:59.890
If we choose a sigma that is
small enough, then those

00:45:59.890 --> 00:46:02.760
sigmas are essentially shrunk
right around the sample

00:46:02.760 --> 00:46:06.092
points, and we could
get overfitting.

00:46:06.092 --> 00:46:09.385
So it doesn't immunize us
against overfitting, but it

00:46:09.385 --> 00:46:12.500
does immunize us against local
maxima and does provide us

00:46:12.500 --> 00:46:16.820
with a general mechanism for
doing a transformation into

00:46:16.820 --> 00:46:18.935
another space with a
better perspective.

00:46:18.935 --> 00:46:22.435
Now, the history lesson, all
this stuff feels fairly new.

00:46:22.435 --> 00:46:25.746
It feels like it's younger
than you are.

00:46:25.746 --> 00:46:27.822
Here's the history of it.

00:46:27.822 --> 00:46:31.060
Vapnik immigrated from the
Soviet Union to the United

00:46:31.060 --> 00:46:33.760
States in about 1991.

00:46:33.760 --> 00:46:36.795
Nobody ever heard of this stuff
before he immigrated.

00:46:36.795 --> 00:46:40.200
He actually had done this work
on the basic support vector

00:46:40.200 --> 00:46:44.355
idea in his Ph.D. thesis
at Moscow University

00:46:44.355 --> 00:46:46.590
in the early '60s.

00:46:46.590 --> 00:46:49.470
But it wasn't possible for him
to do anything with it,

00:46:49.470 --> 00:46:51.220
because they didn't have any
computers they could try

00:46:51.220 --> 00:46:53.010
anything out with.

00:46:53.010 --> 00:46:57.460
So he spent the next 25 years at
some oncology institute in

00:46:57.460 --> 00:47:00.660
the Soviet Union doing
applications.

00:47:00.660 --> 00:47:03.440
Somebody from Bell Labs
discovers him, invites him

00:47:03.440 --> 00:47:05.445
over to the United States
where, subsequently, he

00:47:05.445 --> 00:47:07.466
decides to immigrate.

00:47:07.466 --> 00:47:13.580
In 1992, or thereabouts, Vapnik
submits three papers to

00:47:13.580 --> 00:47:17.115
NIPS, the Neural Information
Processing Systems journal.

00:47:17.115 --> 00:47:19.065
All of them were rejected.

00:47:19.065 --> 00:47:23.570
He's still sore about it,
but it's motivating.

00:47:23.570 --> 00:47:27.060
So around 1992, 1993, Bell
Labs was interested in

00:47:27.060 --> 00:47:28.420
hand-written character
recognition

00:47:28.420 --> 00:47:30.456
and in neural nets.

00:47:30.456 --> 00:47:33.270
Vapnik thinks that
neural nets--

00:47:33.270 --> 00:47:36.295
what would be a good
word to use?

00:47:36.295 --> 00:47:38.410
I can think of the vernacular,
but he thinks that

00:47:38.410 --> 00:47:40.150
they're not very good.

00:47:40.150 --> 00:47:44.320
So he bets a colleague a good
dinner that support vector

00:47:44.320 --> 00:47:46.385
machines will eventually do
better at handwriting

00:47:46.385 --> 00:47:50.356
recognition then neural nets.

00:47:50.356 --> 00:47:51.690
And it's a dinner bet, right?

00:47:51.690 --> 00:47:52.600
It's not that big of deal.

00:47:52.600 --> 00:47:55.280
But as Napoleon said, it's
amazing what a soldier will do

00:47:55.280 --> 00:47:57.641
for a bit of ribbon.

00:47:57.641 --> 00:48:01.380
So that makes colleague, who's
working on this problem with

00:48:01.380 --> 00:48:06.730
handwritten recognition, decides
to try a support

00:48:06.730 --> 00:48:12.700
vector machine with a kernel,
in which n equals 2, just

00:48:12.700 --> 00:48:14.820
slightly nonlinear, works
like a charm.

00:48:17.530 --> 00:48:19.890
Was this the first time anybody
tried a kernel?

00:48:19.890 --> 00:48:23.070
Vapnik actually had the idea in
his thesis but never though

00:48:23.070 --> 00:48:25.560
it was very important.

00:48:25.560 --> 00:48:29.670
As soon as it was shown to work
in the early '90s on the

00:48:29.670 --> 00:48:32.090
problem handwriting recognition,
Vapnik

00:48:32.090 --> 00:48:35.190
resuscitated the idea of the
kernel, began to develop it,

00:48:35.190 --> 00:48:38.270
and became an essential part of
the whole approach of using

00:48:38.270 --> 00:48:39.920
support vector machines.

00:48:39.920 --> 00:48:43.980
So the main point about this
is that it was 30 years in

00:48:43.980 --> 00:48:47.380
between the concept and anybody
ever hearing about it.

00:48:47.380 --> 00:48:52.360
It was 30 years between Vapnik's
understanding of

00:48:52.360 --> 00:48:55.840
kernels and his appreciation
of their importance.

00:48:55.840 --> 00:48:59.870
And that's the way things often
go, great ideas followed

00:48:59.870 --> 00:49:03.320
by long periods of nothing
happening, followed by an

00:49:03.320 --> 00:49:06.640
epiphanous moment when the
original idea seemed to have

00:49:06.640 --> 00:49:09.320
great power with just a
little bit of a twist.

00:49:09.320 --> 00:49:10.960
And then, the world
never looks back.

00:49:10.960 --> 00:49:14.780
And Vapnik, who nobody ever
heard of until the early '90s,

00:49:14.780 --> 00:49:18.380
becomes famous for something
that everybody knows about

00:49:18.380 --> 00:49:19.630
today who does machine
learning.