WEBVTT

00:00:01.640 --> 00:00:04.040
The following content is
provided under a Creative

00:00:04.040 --> 00:00:05.580
Commons license.

00:00:05.580 --> 00:00:07.880
Your support will help
MIT OpenCourseWare

00:00:07.880 --> 00:00:12.270
continue to offer high quality
educational resources for free.

00:00:12.270 --> 00:00:14.870
To make a donation or
view additional materials

00:00:14.870 --> 00:00:18.830
from hundreds of MIT courses,
visit MIT OpenCourseWare

00:00:18.830 --> 00:00:21.670
at ocw.mit.edu.

00:00:21.670 --> 00:00:23.420
LORENZO ROSASCO: So
what we want to do now

00:00:23.420 --> 00:00:25.040
is to move away
from local methods

00:00:25.040 --> 00:00:29.810
and start to do some form of
global regularization method.

00:00:29.810 --> 00:00:33.470
The word regularization I'm
going to use broadly as a term

00:00:33.470 --> 00:00:36.860
to define procedures,
statistical procedures

00:00:36.860 --> 00:00:38.540
and computational
procedure that do

00:00:38.540 --> 00:00:41.780
have some parameters that
allow to do from complex model

00:00:41.780 --> 00:00:43.880
to simple model in
a very broad sense.

00:00:43.880 --> 00:00:46.280
What I mean by complex is
something that is potentially

00:00:46.280 --> 00:00:49.910
going closer to overfitting and
by simple model something that

00:00:49.910 --> 00:00:54.630
is giving me something, which
is stable with respect to data.

00:00:54.630 --> 00:01:01.800
So we're going to consider
the following algorithm.

00:01:01.800 --> 00:01:03.960
I imagine a lot of you
have seen it before.

00:01:03.960 --> 00:01:07.890
This is called-- it has a
bunch of different names--

00:01:07.890 --> 00:01:12.200
probably the most famous one
is Tikhonov regularization.

00:01:12.200 --> 00:01:14.520
A bunch of people at the
beginning of the '60s

00:01:14.520 --> 00:01:17.520
thought about something
similar either in the context

00:01:17.520 --> 00:01:20.974
of statistics or solving
linear equations.

00:01:20.974 --> 00:01:23.515
So Tikhonov is the only one for
which I can find the picture.

00:01:23.515 --> 00:01:25.270
The other one was
Philips, and then there

00:01:25.270 --> 00:01:28.020
is Hoerl and other people.

00:01:28.020 --> 00:01:31.007
They basically thought all
about this same procedure.

00:01:34.750 --> 00:01:39.506
The procedure basically
is based on a functional

00:01:39.506 --> 00:01:41.380
that you want to minimize
based on two terms.

00:01:41.380 --> 00:01:43.338
So there are several
ingredients going on here.

00:01:43.338 --> 00:01:44.970
First of all, this is f of x.

00:01:44.970 --> 00:01:47.760
We assume the
functional form of-- we

00:01:47.760 --> 00:01:49.740
try to estimate the
function, and we

00:01:49.740 --> 00:01:51.870
do assume a parametric form
of this function, which

00:01:51.870 --> 00:01:54.720
in this case, just linear.

00:01:54.720 --> 00:01:57.810
And for the time being,
because you can really,

00:01:57.810 --> 00:02:01.230
you can put it back in, I
don't look at the offset.

00:02:01.230 --> 00:02:03.347
So I just take lines
passing through the origin.

00:02:03.347 --> 00:02:05.430
And this is just because
you can prove in one line

00:02:05.430 --> 00:02:09.665
that you can put back in
the offset at zero cost.

00:02:09.665 --> 00:02:11.039
So for the time
being, just think

00:02:11.039 --> 00:02:12.372
that data are actually standard.

00:02:14.700 --> 00:02:17.670
The way you try to estimate
this parameter is, on one hand,

00:02:17.670 --> 00:02:22.420
try to make the empirical error
small, and on the other hand,

00:02:22.420 --> 00:02:27.844
you put a budget on the weights.

00:02:27.844 --> 00:02:29.010
The reason why you do this--

00:02:29.010 --> 00:02:31.160
there are a bunch of
way to explain this.

00:02:31.160 --> 00:02:34.110
Andrei yesterday talked about
margin, and different lines,

00:02:34.110 --> 00:02:35.270
and so on.

00:02:35.270 --> 00:02:37.620
Another way to think about
it is that you can convince

00:02:37.620 --> 00:02:39.720
yourself-- and we were
going to see later--

00:02:39.720 --> 00:02:44.550
that if you're in low dimension,
a line is a very poor model.

00:02:44.550 --> 00:02:47.587
Because basically if you
have more than a few points--

00:02:47.587 --> 00:02:49.170
and they're not
standing on the line--

00:02:49.170 --> 00:02:51.400
you will not be able
to make zero error.

00:02:51.400 --> 00:02:53.100
But if the number
of points is lower

00:02:53.100 --> 00:02:54.900
than the number
of dimension, you

00:02:54.900 --> 00:02:58.200
can show that the line actually
can give you zero error.

00:02:58.200 --> 00:03:00.750
It's just a matter of
degrees of freedom.

00:03:00.750 --> 00:03:04.570
You have fewer equations
than the actual variables.

00:03:04.570 --> 00:03:07.140
So what you do is
that you actually

00:03:07.140 --> 00:03:09.240
add a regularization theorem.

00:03:09.240 --> 00:03:11.820
It's basically a theorem that
makes the problem well-posed.

00:03:11.820 --> 00:03:13.350
We're going to see
this in a minute

00:03:13.350 --> 00:03:14.800
from a different perspective.

00:03:14.800 --> 00:03:18.390
The easiest one is
going to be numerical.

00:03:18.390 --> 00:03:21.210
We stick to least squares
for-- and there is--

00:03:21.210 --> 00:03:26.430
so there is an extra
parenthesis that I forgot,

00:03:26.430 --> 00:03:28.571
but before I tell you why
we use least squares also

00:03:28.571 --> 00:03:30.570
let me tell you that--
as somebody pointed out--

00:03:30.570 --> 00:03:32.611
there is a mistake here,
because this is a minus.

00:03:36.420 --> 00:03:38.396
It should just be a minus.

00:03:38.396 --> 00:03:41.350
I'll fix this.

00:03:41.350 --> 00:03:43.760
So back, why do you
use least squares?

00:03:43.760 --> 00:03:47.017
OK, so least squares
on the one hand,

00:03:47.017 --> 00:03:48.600
if you're in low
dimension especially,

00:03:48.600 --> 00:03:51.340
you can think of least
squares as its way is basic,

00:03:51.340 --> 00:03:54.390
but it's not a very robust
way to measure error,

00:03:54.390 --> 00:03:55.610
because you squared them.

00:03:55.610 --> 00:04:00.750
And so just one error
can count a lot.

00:04:00.750 --> 00:04:02.570
So typically, there
is a whole literature

00:04:02.570 --> 00:04:04.290
on robust statistics, where
you want to replace least

00:04:04.290 --> 00:04:06.123
square with something
like an absolute value

00:04:06.123 --> 00:04:07.560
or something like that.

00:04:07.560 --> 00:04:09.490
It turns out that at
least in our experience

00:04:09.490 --> 00:04:11.432
and when you have high
dimensional problem,

00:04:11.432 --> 00:04:13.890
it's not completely clear how
much this kind of instability

00:04:13.890 --> 00:04:16.890
will occur and will not
be cured by just adding

00:04:16.890 --> 00:04:18.660
some regularization term.

00:04:18.660 --> 00:04:22.710
And the computation
underlying this algorithm

00:04:22.710 --> 00:04:25.300
are extremely, extremely simple.

00:04:25.300 --> 00:04:28.230
So that's why we're
sticking to this, because it

00:04:28.230 --> 00:04:29.769
works pretty well in practice.

00:04:29.769 --> 00:04:31.560
We actually developed
in the last few years

00:04:31.560 --> 00:04:34.170
some toolbox that you can use.

00:04:34.170 --> 00:04:35.970
They're pretty
much plug and play.

00:04:35.970 --> 00:04:39.120
And because the algorithm
is easy to understand

00:04:39.120 --> 00:04:40.420
in simpler terms.

00:04:40.420 --> 00:04:42.600
Yesterday, Andrei was
talking about SVM.

00:04:42.600 --> 00:04:44.650
SVM is very similar
in principle.

00:04:44.650 --> 00:04:46.650
Basically the only
difference is that you change

00:04:46.650 --> 00:04:49.040
the way you measure cost here.

00:04:49.040 --> 00:04:51.690
This algorithm you can use
both for classification

00:04:51.690 --> 00:04:54.351
and regression, whereas SVM--

00:04:54.351 --> 00:04:56.100
the one which was
talked about yesterday--

00:04:56.100 --> 00:04:57.670
is just for classification.

00:04:57.670 --> 00:05:02.790
And because the cost function
turns out to be non-smooth--

00:05:02.790 --> 00:05:05.970
and non-smooth is basically
non-differentiable--

00:05:05.970 --> 00:05:08.321
and so the whole math is
much more complicated,

00:05:08.321 --> 00:05:10.320
because you have to learn
how to minimize things

00:05:10.320 --> 00:05:11.650
that are not differentiable.

00:05:11.650 --> 00:05:15.720
So in this case, you can
stick to elementary stuff.

00:05:15.720 --> 00:05:19.200
And I think I did somewhere,
that also because Legendre

00:05:19.200 --> 00:05:22.740
200 years ago said that least
squares are really great.

00:05:22.740 --> 00:05:24.320
There is this old story--

00:05:24.320 --> 00:05:28.570
who between Gauss and Legendre
invented least squares first.

00:05:28.570 --> 00:05:31.050
And there are actually
long articles about this.

00:05:31.050 --> 00:05:33.110
But anyway, it's
around that time.

00:05:33.110 --> 00:05:34.437
It's around the end of the--

00:05:34.437 --> 00:05:36.020
this is when he was
born-- it's around

00:05:36.020 --> 00:05:37.260
the end of the 18th century.

00:05:37.260 --> 00:05:42.430
So the algorithm is pretty old.

00:05:42.430 --> 00:05:43.340
So what's the idea?

00:05:43.340 --> 00:05:47.100
So back to the case we
had before, you're going

00:05:47.100 --> 00:05:50.110
to take a linear function.

00:05:50.110 --> 00:05:51.600
So one thing is--

00:05:51.600 --> 00:05:53.500
just to be careful--
think about it once.

00:05:53.500 --> 00:05:55.500
Because if you've never
thought about it before,

00:05:55.500 --> 00:05:57.240
it's good to focus.

00:05:57.240 --> 00:06:04.560
When you do this drawing,
this is not f of x.

00:06:04.560 --> 00:06:07.620
This line is not f of x.

00:06:07.620 --> 00:06:10.920
It's f of x equals zero.

00:06:10.920 --> 00:06:15.510
So I think I made enough
time to have a 3D plot.

00:06:15.510 --> 00:06:22.380
So f of x is actually a plane
that cuts through the slide.

00:06:22.380 --> 00:06:25.805
It's positive, when
it's not dotted--

00:06:25.805 --> 00:06:27.930
because this points are
positive-- and then becomes

00:06:27.930 --> 00:06:28.830
negative.

00:06:28.830 --> 00:06:31.770
And this line is
where it changes sign.

00:06:31.770 --> 00:06:34.615
So the decision boundary
is not f of x itself,

00:06:34.615 --> 00:06:36.240
but it's the level
set that corresponds

00:06:36.240 --> 00:06:39.060
to f of x equals zero.

00:06:39.060 --> 00:06:41.460
Whereas f of x itself
is this one line.

00:06:41.460 --> 00:06:44.250
If you think in one
dimension, the points

00:06:44.250 --> 00:06:47.220
are just standing on a line.

00:06:47.220 --> 00:06:48.990
Some here are plus 1.

00:06:48.990 --> 00:06:50.180
Some here are minus 1.

00:06:50.180 --> 00:06:51.510
So what is f of x?

00:06:51.510 --> 00:06:54.660
It's just a line.

00:06:54.660 --> 00:06:57.900
What is the decision
boundary in this case?

00:06:57.900 --> 00:07:00.150
It will just be one point
in this case actually,

00:07:00.150 --> 00:07:04.060
because it's just
one line that cuts

00:07:04.060 --> 00:07:06.530
the input line in one point.

00:07:06.530 --> 00:07:07.560
And that's it.

00:07:07.560 --> 00:07:10.907
If you were to take a more
complicated nonlinear line,

00:07:10.907 --> 00:07:12.240
it would be more than one point.

00:07:12.240 --> 00:07:14.100
In two dimension,
it becomes one line.

00:07:14.100 --> 00:07:16.894
In three dimension, it becomes
a plane, and so on and so forth.

00:07:16.894 --> 00:07:19.310
But the important piece-- just
a remember, at least once--

00:07:19.310 --> 00:07:20.830
then we look at this plot.

00:07:20.830 --> 00:07:26.450
This is not f of
x, but only the set

00:07:26.450 --> 00:07:30.072
f of x equals zero, which
is where you change sign.

00:07:30.072 --> 00:07:32.030
And that's how you're
going to make prediction.

00:07:32.030 --> 00:07:34.039
You take real valued
functions, so you

00:07:34.039 --> 00:07:36.080
would like-- in principle,
in classification, you

00:07:36.080 --> 00:07:39.620
would allow this function
just to be binary.

00:07:39.620 --> 00:07:42.840
But optimization with binary
functions is very hard.

00:07:42.840 --> 00:07:44.540
So what you typically
do to relax this?

00:07:44.540 --> 00:07:47.789
You just allow it to be
a real valued function,

00:07:47.789 --> 00:07:48.830
and then you take a sign.

00:07:48.830 --> 00:07:50.420
When it's positive,
you take plus 1.

00:07:50.420 --> 00:07:51.927
If it's negative,
you say minus 1.

00:07:51.927 --> 00:07:53.510
If it's a regression
problem, you just

00:07:53.510 --> 00:07:54.590
keep it for what it is.

00:08:01.130 --> 00:08:04.550
And how many free parameters
has this algorithm?

00:08:04.550 --> 00:08:05.380
Well, one.

00:08:05.380 --> 00:08:08.040
It's lambda for now and w.

00:08:08.040 --> 00:08:10.550
But w we're going to solve
by solving this optimization

00:08:10.550 --> 00:08:11.280
problem.

00:08:11.280 --> 00:08:12.170
How about lambda?

00:08:12.170 --> 00:08:15.960
Well, whatever we
discussed before for k.

00:08:15.960 --> 00:08:19.220
We would try to sit
down and do some bias

00:08:19.220 --> 00:08:21.410
variance of the composition,
see what it depends on,

00:08:21.410 --> 00:08:23.990
try to see if we can
get a grasp on what

00:08:23.990 --> 00:08:25.580
the theory of this algorithm is.

00:08:25.580 --> 00:08:28.162
And then we try to see if
we can use cross-validation.

00:08:28.162 --> 00:08:29.870
You can do all these
things, so we're not

00:08:29.870 --> 00:08:36.059
going to discuss much how you
choose lambda, but most of you

00:08:36.059 --> 00:08:40.970
are going to discuss how you can
compute the minimizer of this.

00:08:40.970 --> 00:08:45.440
And this is not a problem,
because this is smooth.

00:08:45.440 --> 00:08:48.470
So you can take the
retrospect to w and also this.

00:08:48.470 --> 00:08:52.130
So what you can do is just to
take the derivative of this,

00:08:52.130 --> 00:08:54.360
set it equal to zero,
and check what happens.

00:08:59.090 --> 00:09:01.700
So it's useful to
do this to other--

00:09:01.700 --> 00:09:04.820
just some vectorial notation.

00:09:04.820 --> 00:09:06.770
We've already seen it before.

00:09:06.770 --> 00:09:10.700
So you take all the x's and you
stack it as rows of the data

00:09:10.700 --> 00:09:13.040
matrix x of n.

00:09:13.040 --> 00:09:17.060
So this ny, you just stack
them as entries of a vector.

00:09:17.060 --> 00:09:18.470
You call it yn.

00:09:18.470 --> 00:09:21.140
Then you can rewrite
this term just

00:09:21.140 --> 00:09:24.440
in this way, as this vector
minus this vector here, which

00:09:24.440 --> 00:09:27.140
you obtain by multiplying
the matrix with w.

00:09:27.140 --> 00:09:28.760
So this norm is the norm in Rn.

00:09:31.270 --> 00:09:35.510
So this is just
simple rewriting.

00:09:35.510 --> 00:09:38.120
It's useful just
because if you now

00:09:38.120 --> 00:09:39.860
take the derivative
of this with respect

00:09:39.860 --> 00:09:42.330
to w, set it equal to
zero, you get this.

00:09:42.330 --> 00:09:43.690
This is the gradient.

00:09:43.690 --> 00:09:45.740
So I haven't set it to zero yet.

00:09:45.740 --> 00:09:49.190
This is the gradient of
the least square part.

00:09:49.190 --> 00:09:52.380
This is the gradient
of the second term.

00:09:52.380 --> 00:09:56.150
It is still
multiplied by lambda.

00:09:56.150 --> 00:09:59.420
If you set them equal to
zero, what you get is this.

00:09:59.420 --> 00:10:03.340
You take everything with x,
so the 2 and the 2 goes away.

00:10:03.340 --> 00:10:06.410
You took everything with
x, and you put it here.

00:10:06.410 --> 00:10:08.120
There's still the
one here with lambda.

00:10:08.120 --> 00:10:10.200
You put it here.

00:10:10.200 --> 00:10:12.319
You take this term
in x transpose y,

00:10:12.319 --> 00:10:14.360
and you put it on the
other side of the equality.

00:10:14.360 --> 00:10:17.810
So you take everything with
w on one side and everything

00:10:17.810 --> 00:10:19.610
without w on the other side.

00:10:19.610 --> 00:10:23.420
And then here, I remove
n by multiplying.

00:10:26.440 --> 00:10:28.797
And so what you get
is a linear system.

00:10:28.797 --> 00:10:29.880
It's just a linear system.

00:10:29.880 --> 00:10:31.580
So that's the beauty
of least squares.

00:10:31.580 --> 00:10:33.140
Whether you
regularize it or not--

00:10:33.140 --> 00:10:36.620
in this case for this simple
squared loss regularization,

00:10:36.620 --> 00:10:39.300
all you get is a linear system.

00:10:39.300 --> 00:10:42.680
And this is the first way
to think about the effect

00:10:42.680 --> 00:10:45.770
of adding this term.

00:10:45.770 --> 00:10:47.900
So what is this doing?

00:10:47.900 --> 00:10:53.180
So just quickly for you a
quick linear system recap.

00:10:53.180 --> 00:10:54.759
You're solving a linear system.

00:10:54.759 --> 00:10:55.550
I changed notation.

00:10:55.550 --> 00:10:59.120
This is just a parenthesis,
just a little bit.

00:10:59.120 --> 00:11:00.680
The simplest case
you can think of

00:11:00.680 --> 00:11:03.830
is the case where m is diagonal.

00:11:03.830 --> 00:11:06.140
Suppose it's just a diagonal
matrix, a square diagonal

00:11:06.140 --> 00:11:06.640
matrix.

00:11:09.500 --> 00:11:11.020
How do you solve this problem?

00:11:11.020 --> 00:11:13.422
You have to invert the matrix m.

00:11:13.422 --> 00:11:15.130
What is the inverse
of a diagonal matrix?

00:11:18.190 --> 00:11:20.470
So it's just another
diagonal matrix.

00:11:20.470 --> 00:11:23.200
On the entries, instead of, say,
sigma, you have 1 over sigma

00:11:23.200 --> 00:11:24.040
or whatever it is.

00:11:27.250 --> 00:11:31.900
So what you see is that if
m-- you just consider m--

00:11:31.900 --> 00:11:34.030
and m is diagonal
like this-- this

00:11:34.030 --> 00:11:35.860
is what you're going to get.

00:11:35.860 --> 00:11:37.480
Suppose that now
some of these numbers

00:11:37.480 --> 00:11:42.070
are actually small, then
when you take 1 over,

00:11:42.070 --> 00:11:44.110
this is going to blow up.

00:11:44.110 --> 00:11:47.650
When you apply this matrix
to b, what you might have is

00:11:47.650 --> 00:11:51.820
that if you change the
sigmas or the b slightly,

00:11:51.820 --> 00:11:54.670
you can have an explosion.

00:11:54.670 --> 00:11:57.910
And if you want, this is
one way to understand why

00:11:57.910 --> 00:11:59.980
adding the lambda would help.

00:11:59.980 --> 00:12:02.560
And it's another way to look
at overfitting, if you want,

00:12:02.560 --> 00:12:04.130
from a numerical point of view.

00:12:04.130 --> 00:12:04.880
You take the data.

00:12:04.880 --> 00:12:07.480
You change them slightly, and
you have numerical instability

00:12:07.480 --> 00:12:09.190
right away.

00:12:09.190 --> 00:12:10.960
What is the effect
of adding this term?

00:12:14.020 --> 00:12:15.820
Well, what you see
is that instead

00:12:15.820 --> 00:12:23.710
of just doing m minus 1, you're
doing m plus lambda I minus 1.

00:12:23.710 --> 00:12:26.002
And this is the simple
case, where it's diagonal.

00:12:26.002 --> 00:12:28.210
But what you see is that on
the diagonal instead of 1

00:12:28.210 --> 00:12:32.990
over sigma 1, you take 1
over sigma 1 plus lambda.

00:12:32.990 --> 00:12:36.910
If sigma 1 is big, adding
this lambda won't matter.

00:12:36.910 --> 00:12:40.510
If sigma-- for example, sigma
d, now think there are order.

00:12:40.510 --> 00:12:42.880
I'm thinking they are
order, and sigma d is small.

00:12:42.880 --> 00:12:47.320
If this is small, at some point
lambda is going to jump in,

00:12:47.320 --> 00:12:49.660
make the problem
stable at the price

00:12:49.660 --> 00:12:51.911
of ignoring the
information in that sigma,

00:12:51.911 --> 00:12:53.410
that you basically
consider it to be

00:12:53.410 --> 00:12:55.510
at the same size of the
noise or the perturbation

00:12:55.510 --> 00:12:57.230
or the sample in your data.

00:12:57.230 --> 00:12:59.050
Does this make sense?

00:12:59.050 --> 00:13:01.015
So this is what the
algorithm is doing.

00:13:01.015 --> 00:13:04.881
And it's a numerical way
to look at stability.

00:13:04.881 --> 00:13:07.255
But you can imagine that this
is an immediate statistical

00:13:07.255 --> 00:13:07.600
consequence.

00:13:07.600 --> 00:13:09.150
You change the data
slightly, you'll

00:13:09.150 --> 00:13:11.230
have a big change in your
solution and the other way

00:13:11.230 --> 00:13:11.730
around.

00:13:11.730 --> 00:13:14.140
And lambda governs this
by basically telling you

00:13:14.140 --> 00:13:16.060
how much this is invertible.

00:13:16.060 --> 00:13:18.310
So it's a connection between
statistical and numerical

00:13:18.310 --> 00:13:19.875
stability.

00:13:19.875 --> 00:13:22.000
Now of course, you can say,
this is oversimplistic,

00:13:22.000 --> 00:13:26.110
because this is just
a diagonal matrix.

00:13:26.110 --> 00:13:31.300
But basically, if
you now take matrices

00:13:31.300 --> 00:13:35.620
that you can diagonalize,
conceptually nothing

00:13:35.620 --> 00:13:36.280
would change.

00:13:36.280 --> 00:13:37.570
Because basically
you would have that

00:13:37.570 --> 00:13:39.653
if you have a matrix-- so
there is a mistake here.

00:13:39.653 --> 00:13:41.380
There should be no minus 1.

00:13:41.380 --> 00:13:43.360
If you have an m that you can--

00:13:43.360 --> 00:13:45.235
this is just sigma, not minus 1.

00:13:45.235 --> 00:13:47.057
You can just diagonalize it.

00:13:47.057 --> 00:13:49.390
And now every operation you
want to do on the matrix you

00:13:49.390 --> 00:13:51.910
can just do on the diagonal.

00:13:51.910 --> 00:13:54.845
So all the reasoning
here will work the same.

00:13:54.845 --> 00:13:56.440
Only now you have
to remember that you

00:13:56.440 --> 00:13:58.900
have to squeeze the diagonal
matrix in between v and v

00:13:58.900 --> 00:14:00.631
transpose.

00:14:00.631 --> 00:14:03.130
I'm not saying that this is
what you want to do numerically.

00:14:03.130 --> 00:14:05.350
But I'm just saying that the
conceptual reasoning here--

00:14:05.350 --> 00:14:07.610
that we tell it that this
was the effect of lambda--

00:14:07.610 --> 00:14:10.090
is going to hold
just the same here.

00:14:10.090 --> 00:14:13.210
This is m, which you can
write like this-- m minus 1

00:14:13.210 --> 00:14:14.290
you can write like this.

00:14:14.290 --> 00:14:18.220
And so this is just going to
be the same diagonal terms

00:14:18.220 --> 00:14:18.739
inverted.

00:14:18.739 --> 00:14:20.280
And now you see the
effect of lambda.

00:14:20.280 --> 00:14:22.390
It's just the same.

00:14:22.390 --> 00:14:24.290
So once you grasp
this conceptually,

00:14:24.290 --> 00:14:28.020
for any matrix you can make
diagonal, it's the same.

00:14:28.020 --> 00:14:29.736
And the point is
that as long as you

00:14:29.736 --> 00:14:32.110
have a symmetric positive
definite matrix, the reason you

00:14:32.110 --> 00:14:35.370
can diagonalize it, you just
have the same thing squeezed

00:14:35.370 --> 00:14:37.630
in between v and v transpose.

00:14:37.630 --> 00:14:39.520
And that's what we have,
because instead of--

00:14:44.680 --> 00:14:47.920
because what we have is
exactly this matrix here.

00:14:47.920 --> 00:14:50.270
So instead of-- and you see
here that basically this

00:14:50.270 --> 00:14:52.880
depends a lot on the
dimensionality of the data.

00:14:52.880 --> 00:14:57.220
If the number of points is much
bigger than the dimensionality,

00:14:57.220 --> 00:14:59.450
this matrix in
principle could be--

00:14:59.450 --> 00:15:01.420
it's easier that is invertible.

00:15:01.420 --> 00:15:03.100
But if the number
of points is smaller

00:15:03.100 --> 00:15:04.840
than the dimensionality--

00:15:04.840 --> 00:15:06.040
how big is this matrix?

00:15:06.040 --> 00:15:10.080
So xn is-- you remember
how big was xn?

00:15:10.080 --> 00:15:12.610
It was the rows were the
points, and the columns

00:15:12.610 --> 00:15:13.370
were the variable.

00:15:13.370 --> 00:15:14.595
So how big is this?

00:15:14.595 --> 00:15:15.730
And we call this d.

00:15:15.730 --> 00:15:16.930
We called the length n.

00:15:16.930 --> 00:15:18.304
So this is--

00:15:18.304 --> 00:15:19.510
AUDIENCE: [INAUDIBLE]

00:15:19.510 --> 00:15:22.470
LORENZO ROSASCO: --n by d.

00:15:22.470 --> 00:15:25.060
So this matrix here is how big?

00:15:25.060 --> 00:15:28.320
Just d by d, and
the number of points

00:15:28.320 --> 00:15:31.080
is smaller than the
number dimension.

00:15:31.080 --> 00:15:32.445
The rank of this--

00:15:32.445 --> 00:15:34.440
this is going to
be rank-deficient.

00:15:34.440 --> 00:15:35.725
So it's not invertible.

00:15:35.725 --> 00:15:38.100
So if the number of points is
more, if you're in a high--

00:15:38.100 --> 00:15:39.870
so called
high-dimensional scenario,

00:15:39.870 --> 00:15:41.520
where the number of
points is more than

00:15:41.520 --> 00:15:43.410
the number of
dimension, for sure you

00:15:43.410 --> 00:15:44.790
won't be able to invert this.

00:15:44.790 --> 00:15:46.950
Ordinary least
squares will not work.

00:15:46.950 --> 00:15:48.120
It will be unstable.

00:15:48.120 --> 00:15:49.619
And then you will
have to regularize

00:15:49.619 --> 00:15:52.279
to get anything reasonable.

00:15:52.279 --> 00:15:54.820
So in the case of least squares,
just by setting rank to zero

00:15:54.820 --> 00:15:56.790
and looking in this computation
to get a grasp of both.

00:15:56.790 --> 00:15:58.530
What kind of computation
you have to do,

00:15:58.530 --> 00:15:59.835
and what they mean both
from the statistical

00:15:59.835 --> 00:16:01.376
and the numerical point of view.

00:16:01.376 --> 00:16:03.750
And that's why that's one of
the beauty of least squares.

00:16:08.280 --> 00:16:11.400
We could stick to a whole
derivation of this--

00:16:11.400 --> 00:16:14.480
so this is more the
linear system perspective.

00:16:14.480 --> 00:16:15.960
There is a whole
literature trying

00:16:15.960 --> 00:16:18.570
to justify more from a
statistical point of view what

00:16:18.570 --> 00:16:19.641
I'm saying.

00:16:19.641 --> 00:16:21.390
You can talk about the
maximum likelihood,

00:16:21.390 --> 00:16:24.300
then you can talk about
maximum a posteriori.

00:16:24.300 --> 00:16:26.760
You can talk about variance
reduction and so-called Stein

00:16:26.760 --> 00:16:27.610
effect.

00:16:27.610 --> 00:16:31.170
And you can make a much bigger
story trying, for example,

00:16:31.170 --> 00:16:34.227
to develop the whole theory of
shrinkage estimators, the bias

00:16:34.227 --> 00:16:35.310
variance tradeoff of this.

00:16:35.310 --> 00:16:37.380
But we're not going
to talk about that.

00:16:37.380 --> 00:16:39.690
So this simple
numerical stability,

00:16:39.690 --> 00:16:41.640
statistical stability
intuition is

00:16:41.640 --> 00:16:44.430
going to be my main motivation
for considering these schemes.

00:16:47.730 --> 00:16:49.890
So let me skip these.

00:16:49.890 --> 00:16:52.440
I wanted to show the demo, but--

00:16:52.440 --> 00:16:53.429
it's very simple.

00:16:53.429 --> 00:16:55.470
It's going to be very
stable, because you're just

00:16:55.470 --> 00:16:58.480
drawing a one-dimensional line.

00:16:58.480 --> 00:17:00.270
Then you move on just
a bit, because we

00:17:00.270 --> 00:17:04.290
didn't cover as much as
I want in the first part.

00:17:04.290 --> 00:17:08.819
So first of all, so far so good?

00:17:08.819 --> 00:17:12.230
Are you all with me about this?

00:17:12.230 --> 00:17:14.290
So again, the basic
thing if you want--

00:17:14.290 --> 00:17:20.359
all the interesting--
so this is the one line,

00:17:20.359 --> 00:17:23.230
where there is something
conceptual happening.

00:17:23.230 --> 00:17:26.530
This is the one line, where we
make it a bit more complicated

00:17:26.530 --> 00:17:27.327
mathematically.

00:17:27.327 --> 00:17:29.160
And then all you have
to do is to match this

00:17:29.160 --> 00:17:31.760
with what we just wrote before.

00:17:31.760 --> 00:17:32.260
That's all.

00:17:32.260 --> 00:17:35.040
These are the main three
things we want to do.

00:17:35.040 --> 00:17:37.900
And think a bit
about dimensionality.

00:17:37.900 --> 00:17:44.029
Now if you look at a problem
even like this, as I said,

00:17:44.029 --> 00:17:45.820
this might be misleading--
a low dimension.

00:17:45.820 --> 00:17:47.740
And in fact, what we
typically do in high dimension

00:17:47.740 --> 00:17:49.540
is that, first of all, you
start with the linear model

00:17:49.540 --> 00:17:51.590
and you see how far
you can go with that.

00:17:51.590 --> 00:17:55.969
And typically, you go a bit
further that you might imagine.

00:17:55.969 --> 00:17:57.760
But still, you can
think, why should I just

00:17:57.760 --> 00:17:59.670
stick to linear decision rule?

00:17:59.670 --> 00:18:03.080
This won't give me
much of a flexibility.

00:18:03.080 --> 00:18:04.990
So in this case,
obviously, it looks

00:18:04.990 --> 00:18:06.490
like something that
would be better,

00:18:06.490 --> 00:18:09.850
some kind of quadric
decision boundary.

00:18:09.850 --> 00:18:12.414
So how can you do this?

00:18:12.414 --> 00:18:14.080
How can you go--
suppose that I give you

00:18:14.080 --> 00:18:16.126
the code of least squares.

00:18:16.126 --> 00:18:17.500
And you're the
laziest programmer

00:18:17.500 --> 00:18:19.900
in the world, which in
my case is actually not

00:18:19.900 --> 00:18:22.360
that hard to imagine.

00:18:22.360 --> 00:18:25.500
How can you recycle
the code to fit,

00:18:25.500 --> 00:18:28.420
to create a solution
like this, instead

00:18:28.420 --> 00:18:30.880
of a solution like this?

00:18:30.880 --> 00:18:32.286
You see the question?

00:18:32.286 --> 00:18:34.035
I give you the code
to solve this problem,

00:18:34.035 --> 00:18:35.243
the one I showed you before--

00:18:35.243 --> 00:18:37.090
the linear system for
different lambdas.

00:18:37.090 --> 00:18:40.270
But you want to go from this
solution to the solution.

00:18:40.270 --> 00:18:41.870
How could you do that?

00:18:41.870 --> 00:18:44.300
So one way you can do
it in this simple case

00:18:44.300 --> 00:18:46.155
is-- this is the example.

00:18:46.155 --> 00:18:47.500
So the idea is--

00:18:47.500 --> 00:18:49.150
you remember the matrix?

00:18:49.150 --> 00:18:51.334
I'm going to invent new
entries of the matrix,

00:18:51.334 --> 00:18:53.500
not of the points, because
you cannot invent points,

00:18:53.500 --> 00:18:54.740
but of the variables.

00:18:54.740 --> 00:18:57.198
So what you're going to do,
instead of just-- they can say,

00:18:57.198 --> 00:18:59.320
in this case I call them x1, x2.

00:18:59.320 --> 00:19:00.510
I'm just in two dimension.

00:19:00.510 --> 00:19:02.250
These are my data.

00:19:02.250 --> 00:19:04.240
This is just another
example of this.

00:19:04.240 --> 00:19:06.190
So these are my data--
sorry these are--

00:19:06.190 --> 00:19:07.780
let's see what they are.

00:19:07.780 --> 00:19:09.700
This is one point.

00:19:09.700 --> 00:19:12.160
X1 and x2 here
are just the entry

00:19:12.160 --> 00:19:15.970
of the point x, so
the first coordinate

00:19:15.970 --> 00:19:18.370
and the second coordinate.

00:19:18.370 --> 00:19:21.070
So what you said is
exactly one way to do this.

00:19:21.070 --> 00:19:22.210
And it is--

00:19:22.210 --> 00:19:24.790
I'm going to now build a
new vector representation

00:19:24.790 --> 00:19:25.710
of the same points.

00:19:25.710 --> 00:19:26.470
So it's going to
be the same point,

00:19:26.470 --> 00:19:28.240
but instead of two
coordinates I now use

00:19:28.240 --> 00:19:32.410
three, which are going to be
the first coordinate square,

00:19:32.410 --> 00:19:35.680
the second coordinate square,
and the product of the two

00:19:35.680 --> 00:19:36.700
coordinates.

00:19:39.560 --> 00:19:42.490
Once I've done this, I
forget about how I got this,

00:19:42.490 --> 00:19:44.800
and I just treat it
as new variables.

00:19:44.800 --> 00:19:49.000
And I take a linear model
with that variables.

00:19:49.000 --> 00:19:51.340
It's a linear model with
these new variables,

00:19:51.340 --> 00:19:54.170
but it's a new linear model
with the original variables.

00:19:54.170 --> 00:19:56.090
And that's what you see here.

00:19:56.090 --> 00:20:00.120
So x tilde is this stuff.

00:20:00.120 --> 00:20:02.920
It's just a new
vector representation.

00:20:02.920 --> 00:20:05.500
And now I'm linear with
respect to this new vector

00:20:05.500 --> 00:20:06.640
representation.

00:20:06.640 --> 00:20:09.869
But when you write
x tilde explicitly,

00:20:09.869 --> 00:20:11.410
it's some kind of
non-linear function

00:20:11.410 --> 00:20:12.670
of the original variable.

00:20:12.670 --> 00:20:14.770
So this function
here is non-linear

00:20:14.770 --> 00:20:16.870
in the original variable.

00:20:16.870 --> 00:20:20.310
It's harder to say
than probably to see.

00:20:20.310 --> 00:20:22.310
Does it make sense?

00:20:22.310 --> 00:20:23.920
So if you do this,
you're completely

00:20:23.920 --> 00:20:26.650
recycling the beauty
of the linearity

00:20:26.650 --> 00:20:29.530
from a computational
point of view while

00:20:29.530 --> 00:20:31.510
augmenting the
power of your model

00:20:31.510 --> 00:20:33.700
from linear to non-linear.

00:20:33.700 --> 00:20:37.351
It's still parametric in the
sense that in this case--

00:20:37.351 --> 00:20:39.100
what I mean by parametric
is that we still

00:20:39.100 --> 00:20:41.020
fix a priori the
number of degrees

00:20:41.020 --> 00:20:42.850
of freedom of our problem.

00:20:42.850 --> 00:20:45.070
It was true now I make it three.

00:20:45.070 --> 00:20:47.230
More general I could
make it p, but the number

00:20:47.230 --> 00:20:49.280
of numbers I have to
find is fixed a priori.

00:20:49.280 --> 00:20:54.040
It doesn't depend on my
data, and it's fixed.

00:20:54.040 --> 00:20:58.150
But I can definitely go
from linear to non-linear.

00:20:58.150 --> 00:20:59.540
So let's keep on going.

00:20:59.540 --> 00:21:02.447
So from the simple linear model
we already went quite far,

00:21:02.447 --> 00:21:04.780
because we basically know
that with the same computation

00:21:04.780 --> 00:21:06.500
we can now solve
stuff like this.

00:21:06.500 --> 00:21:09.220
Let's take a couple
of steps further.

00:21:09.220 --> 00:21:13.330
So one is-- appreciate
that really the code

00:21:13.330 --> 00:21:14.680
is just the same.

00:21:14.680 --> 00:21:16.900
Instead of x, I have
to do a pre-processing

00:21:16.900 --> 00:21:19.630
to replace x with this
new matrix x tilde, which

00:21:19.630 --> 00:21:22.497
is the one which instead of
being n by d, is now n by p

00:21:22.497 --> 00:21:24.830
where p is this new number
of variables that I invented.

00:21:28.000 --> 00:21:31.870
Now it's useful to just
get the feeling of what is

00:21:31.870 --> 00:21:33.710
the complexity of this method.

00:21:33.710 --> 00:21:38.400
And this is a very
quick complexity recap.

00:21:38.400 --> 00:21:40.540
Here basically, the
product of two numbers

00:21:40.540 --> 00:21:41.770
is going to count one.

00:21:41.770 --> 00:21:44.340
And then when you take product
of vectors of matrices,

00:21:44.340 --> 00:21:46.870
you just count on any real
number multiplication you do.

00:21:46.870 --> 00:21:48.220
And this is a quick recap.

00:21:48.220 --> 00:21:52.630
If I multiply two vectors
of size p, the cost p,

00:21:52.630 --> 00:21:54.850
matrix vector is going to be np.

00:21:54.850 --> 00:21:58.600
Matrix matrix is going
to be n square p.

00:21:58.600 --> 00:22:00.274
You have n vectors.

00:22:00.274 --> 00:22:02.490
And one-to-one, other n vectors.

00:22:02.490 --> 00:22:05.880
And they are size p, so each
time you have-- it costs you p.

00:22:05.880 --> 00:22:07.930
And you have to do n against n.

00:22:07.930 --> 00:22:10.090
So it's going to be n square p.

00:22:10.090 --> 00:22:13.674
And the last one
is-- this is a much--

00:22:13.674 --> 00:22:15.340
less clear to just
look at it like this.

00:22:15.340 --> 00:22:17.680
But roughly speaking,
the inversion of a matrix

00:22:17.680 --> 00:22:21.460
costs roughly speaking n
cube in the worst case.

00:22:21.460 --> 00:22:25.030
It's just to give you a feeling
of what the complexity are.

00:22:25.030 --> 00:22:27.740
So it makes sense?

00:22:27.740 --> 00:22:29.579
It's a bit quick,
but it's simple.

00:22:29.579 --> 00:22:30.370
If you know it, OK.

00:22:30.370 --> 00:22:32.120
Otherwise, you just
take this on the side,

00:22:32.120 --> 00:22:33.520
when you think about this.

00:22:33.520 --> 00:22:37.720
So what is the
complexity of this?

00:22:37.720 --> 00:22:41.390
Well, the matrix-- you have
to multiply this times this,

00:22:41.390 --> 00:22:45.460
and this is going to
cost you nd or np.

00:22:45.460 --> 00:22:46.820
You have to build this matrix.

00:22:46.820 --> 00:22:51.184
This is going to cost you
n square d or n square p.

00:22:51.184 --> 00:22:52.350
And then you have to invert.

00:22:52.350 --> 00:22:55.720
These are going to be n cube.

00:22:55.720 --> 00:22:58.990
So-- sorry, p cubed,
because with this matrix is

00:22:58.990 --> 00:23:03.070
going to be-- or d cube,
because this matrix is d by d.

00:23:03.070 --> 00:23:06.560
So this is, roughly
speaking, the cost.

00:23:06.560 --> 00:23:07.525
So now look at this.

00:23:07.525 --> 00:23:10.150
This is-- I take this.

00:23:10.150 --> 00:23:13.750
In this case, p is the
new variable, otherwise d.

00:23:13.750 --> 00:23:21.050
So in this case, I have p cube,
and then I have p square n.

00:23:21.050 --> 00:23:23.710
But one question is
what if n is much--

00:23:23.710 --> 00:23:28.390
and that's a fact-- what if
n is much smaller than p?

00:23:28.390 --> 00:23:32.920
If n is a 10, do I
really have to pay

00:23:32.920 --> 00:23:36.430
quadratic or even cubic
in the number of dimension

00:23:36.430 --> 00:23:37.357
to solve this problem?

00:23:37.357 --> 00:23:39.940
Because in some sense, it looks
I'm overshooting things a bit.

00:23:39.940 --> 00:23:42.820
Because I'm inverting a
matrix, yes, but this matrix

00:23:42.820 --> 00:23:45.290
is really a rank n.

00:23:45.290 --> 00:23:48.010
It only has n rows that are
linearly independent at most.

00:23:48.010 --> 00:23:50.410
It might be less,
but at most it has n.

00:23:50.410 --> 00:23:53.920
So can I break the
complexity of this?

00:23:53.920 --> 00:23:56.074
Linear system have
to solve, you just

00:23:56.074 --> 00:23:57.490
use the table I
showed you before.

00:23:57.490 --> 00:23:58.420
Check the computation.

00:23:58.420 --> 00:24:00.190
These are the computation
you have to do.

00:24:00.190 --> 00:24:02.530
And one observation
here is you pay really

00:24:02.530 --> 00:24:06.370
a lot in the dimension,
the number of variables

00:24:06.370 --> 00:24:08.800
or the number of
features you invented.

00:24:08.800 --> 00:24:12.370
And this might be OK,
when p is smaller than n.

00:24:12.370 --> 00:24:14.800
But one thing-- this
seems wrong intuitively,

00:24:14.800 --> 00:24:17.200
when n is much smaller than p.

00:24:17.200 --> 00:24:18.850
Because the complexity
of the problem,

00:24:18.850 --> 00:24:21.310
the rank of the
problem is just n.

00:24:21.310 --> 00:24:25.510
The matrix here has n
rows and d or p columns

00:24:25.510 --> 00:24:27.580
depending on which
representation you take.

00:24:27.580 --> 00:24:30.340
And so the rank of the
whole thing is at most n,

00:24:30.340 --> 00:24:36.160
if n is much smaller.

00:24:36.160 --> 00:24:39.850
So now the red dot appears.

00:24:39.850 --> 00:24:44.484
And what you can do is
proving this one line.

00:24:44.484 --> 00:24:46.150
So let's see what
they do, and then I'll

00:24:46.150 --> 00:24:47.399
tell you how you can prove it.

00:24:47.399 --> 00:24:48.850
And it's an exercise.

00:24:48.850 --> 00:24:52.460
So you see here if
you invert this,

00:24:52.460 --> 00:24:54.670
then you have to
multiply x transpose

00:24:54.670 --> 00:24:57.940
y times the inverse
of this matrix, which

00:24:57.940 --> 00:25:00.670
is what's written in here.

00:25:00.670 --> 00:25:03.010
So I claim that this
equality stands.

00:25:03.010 --> 00:25:03.850
Look what it does.

00:25:03.850 --> 00:25:05.770
I take this x transpose.

00:25:05.770 --> 00:25:09.050
I move it in front.

00:25:09.050 --> 00:25:10.630
But then if I do
this, you clearly

00:25:10.630 --> 00:25:12.463
see that I'm messing
around with dimensions.

00:25:12.463 --> 00:25:16.122
So what you do is that you have
to switch the order of the two

00:25:16.122 --> 00:25:17.080
matrices in the middle.

00:25:19.900 --> 00:25:22.360
Now from a dimensionality
point of view, at least,

00:25:22.360 --> 00:25:24.490
I still see that this
matrix and this matrix

00:25:24.490 --> 00:25:27.520
have the same dimension.

00:25:27.520 --> 00:25:28.690
How do you prove this?

00:25:28.690 --> 00:25:31.090
Well, you basically
just need to do SVD.

00:25:31.090 --> 00:25:34.060
You take the singular-value
decomposition of the matrix Xn.

00:25:34.060 --> 00:25:36.917
You plug it in, and you
just compute things.

00:25:36.917 --> 00:25:38.750
And you check that this
side of the equality

00:25:38.750 --> 00:25:41.590
is the same of this
side of the equality.

00:25:41.590 --> 00:25:43.330
So there's nothing
more than this,

00:25:43.330 --> 00:25:44.750
but we're going to skip this.

00:25:44.750 --> 00:25:47.320
So you just take this as a fact.

00:25:47.320 --> 00:25:48.220
It's a little trick.

00:25:48.220 --> 00:25:49.660
Why do I want to do this trick?

00:25:49.660 --> 00:25:52.870
Because look, now what I
say is that my w is going

00:25:52.870 --> 00:25:55.930
to be x transpose of something.

00:25:59.330 --> 00:26:00.400
What is this something?

00:26:00.400 --> 00:26:06.130
So w is going to be X
transpose of this thing here.

00:26:06.130 --> 00:26:08.020
How big is this vector?

00:26:08.020 --> 00:26:11.020
So how big is this
matrix first of all?

00:26:11.020 --> 00:26:13.495
So remember, Xn was how big?

00:26:13.495 --> 00:26:14.770
AUDIENCE: N by d.

00:26:14.770 --> 00:26:17.260
LORENZO ROSASCO: N by d or p.

00:26:17.260 --> 00:26:18.340
How big is this?

00:26:18.340 --> 00:26:19.060
AUDIENCE: N by n.

00:26:19.060 --> 00:26:21.500
LORENZO ROSASCO: N by n.

00:26:21.500 --> 00:26:23.060
So how big is this vector?

00:26:23.060 --> 00:26:25.130
It's n by 1.

00:26:25.130 --> 00:26:27.370
So now I have to--

00:26:27.370 --> 00:26:29.770
I found out that
my w can always be

00:26:29.770 --> 00:26:34.320
written as x transpose
c, where c is just

00:26:34.320 --> 00:26:36.540
an n-dimensional vector.

00:26:36.540 --> 00:26:40.350
I rewrote it like
this, if you want.

00:26:40.350 --> 00:26:43.490
So what is the
cost of doing this?

00:26:47.800 --> 00:26:51.610
Well, this was the
cost of doing this?

00:26:51.610 --> 00:26:54.260
But now you just have to do--

00:26:54.260 --> 00:26:57.070
so let's say what is the
cost of doing this thing here

00:26:57.070 --> 00:26:59.080
above the bracket?

00:26:59.080 --> 00:27:01.710
Well, if this one was
p cube p square n,

00:27:01.710 --> 00:27:03.910
this one will be how much?

00:27:03.910 --> 00:27:07.720
I have that this
matrix will say p by p,

00:27:07.720 --> 00:27:10.860
and then this vector was p by 1.

00:27:10.860 --> 00:27:14.680
Whereas here, my matrix is n by
n, and the victory is n by 1.

00:27:17.200 --> 00:27:20.699
So you basically have that
these two numbers swap.

00:27:20.699 --> 00:27:22.115
Instead of having
this complexity,

00:27:22.115 --> 00:27:25.780
now you have a complexity,
which is n cube.

00:27:25.780 --> 00:27:30.910
And then you have n square
p, which sounds about right.

00:27:30.910 --> 00:27:32.060
It's linear in p.

00:27:32.060 --> 00:27:33.250
You cannot avoid that.

00:27:33.250 --> 00:27:35.590
You have to look at
the data at least once.

00:27:35.590 --> 00:27:40.035
But then it's polynomial only in
the small quantity of the two.

00:27:40.035 --> 00:27:41.410
So in some sense,
what you see is

00:27:41.410 --> 00:27:44.360
that, depending on the
size of n, of course,

00:27:44.360 --> 00:27:46.110
you still have to do
this multiplication.

00:27:46.110 --> 00:27:50.200
But this multiplication
is just n, nd, or np.

00:27:50.200 --> 00:27:52.170
So let's just recap
what I'm telling you.

00:27:52.170 --> 00:27:55.050
This is a lot more
mathematical fact I put.

00:27:55.050 --> 00:27:57.010
I have a warning here.

00:27:57.010 --> 00:27:59.860
The first thing is the
question should be clear.

00:27:59.860 --> 00:28:02.950
Can I break the complexity
of this in the case

00:28:02.950 --> 00:28:05.550
when n is smaller than p or d?

00:28:05.550 --> 00:28:08.620
This is relevant because the
question came out a second ago,

00:28:08.620 --> 00:28:10.870
which was should
I always explode

00:28:10.870 --> 00:28:12.700
the dimension of my features?

00:28:12.700 --> 00:28:14.320
And here what you see is that--

00:28:14.320 --> 00:28:16.960
well, at least for now we
see that even if you do,

00:28:16.960 --> 00:28:20.980
you don't pay more
than linearly in that.

00:28:20.980 --> 00:28:22.690
And the way you
prove it is A, you

00:28:22.690 --> 00:28:25.377
observe this factor,
which, again, I

00:28:25.377 --> 00:28:27.460
measured if you're curious,
to show how you do it,

00:28:27.460 --> 00:28:28.720
but it's a one line.

00:28:28.720 --> 00:28:30.840
And 2, you observe that
once you have this,

00:28:30.840 --> 00:28:34.840
if you just rewrite w, you can
write w as a x transpose c.

00:28:34.840 --> 00:28:37.710
And to find a c-- which is now
you basically re-parametrize--

00:28:37.710 --> 00:28:39.970
and to find the new c
is going to cost you

00:28:39.970 --> 00:28:43.690
only n cube n square p.

00:28:43.690 --> 00:28:46.294
So you do exactly
what you wanted to do.

00:28:46.294 --> 00:28:47.710
And basically,
what you see now is

00:28:47.710 --> 00:28:49.270
that whenever you
do least squares,

00:28:49.270 --> 00:28:51.100
you can check the
number of dimensions,

00:28:51.100 --> 00:28:54.610
the number of points, and always
re-parametrize the problem

00:28:54.610 --> 00:28:58.750
in such a way that complexity
is depending linearly

00:28:58.750 --> 00:29:00.940
on the bigger of the
two and polynomially

00:29:00.940 --> 00:29:04.390
on the smaller of the two.

00:29:04.390 --> 00:29:05.230
So that's good news.

00:29:11.782 --> 00:29:12.740
Oh, I wrote it.

00:29:17.264 --> 00:29:18.680
So this is where
we are right now.

00:29:21.210 --> 00:29:23.870
So if we're lost now, you're
going to become completely lost

00:29:23.870 --> 00:29:24.530
in one second.

00:29:24.530 --> 00:29:26.480
Because this is
what we want to do.

00:29:26.480 --> 00:29:29.910
We want to introduce kernel
in the simplest possible way,

00:29:29.910 --> 00:29:31.095
which is the following.

00:29:33.770 --> 00:29:37.150
So look at-- this
is what we find out.

00:29:37.150 --> 00:29:40.870
We discovered, we
actually proved a theorem.

00:29:40.870 --> 00:29:44.350
And the theorem
says that the w's

00:29:44.350 --> 00:29:48.610
that are output by the least
squares algorithm are not

00:29:48.610 --> 00:29:50.740
any possible
d-dimensional vectors,

00:29:50.740 --> 00:29:52.630
but they're always
vectors that I

00:29:52.630 --> 00:29:57.100
can write as the combination
of the training set vectors.

00:29:57.100 --> 00:30:01.300
So xi is long d or p,
and I've summed them up

00:30:01.300 --> 00:30:02.429
with these weights.

00:30:02.429 --> 00:30:04.720
And the w's that are going
to come out of least squares

00:30:04.720 --> 00:30:07.500
are always of that form.

00:30:07.500 --> 00:30:10.150
They cannot be of
any other form.

00:30:10.150 --> 00:30:13.140
This is called the
representer theorem.

00:30:13.140 --> 00:30:16.010
It's the basic theorem of
so-called kernel methods.

00:30:16.010 --> 00:30:19.960
It shows you that the
solution you're looking for

00:30:19.960 --> 00:30:22.840
can be written as a linear
superposition of these terms.

00:30:27.310 --> 00:30:28.330
If you now write--

00:30:28.330 --> 00:30:29.260
this is just the w.

00:30:29.260 --> 00:30:31.150
Let's just write down f of x.

00:30:31.150 --> 00:30:33.670
F of x is going to be
x transpose w, just

00:30:33.670 --> 00:30:36.160
the linear function.

00:30:36.160 --> 00:30:39.400
And now you can-- if you write
it down, you just get this.

00:30:39.400 --> 00:30:40.960
By linearity you can--

00:30:40.960 --> 00:30:42.624
so w is written like this.

00:30:42.624 --> 00:30:43.790
You multiply by x transpose.

00:30:43.790 --> 00:30:45.190
This is a finite sum.

00:30:45.190 --> 00:30:47.680
So you can let x
transpose inside the sum.

00:30:47.680 --> 00:30:49.420
This is what you get.

00:30:49.420 --> 00:30:50.320
Are you OK?

00:30:50.320 --> 00:30:52.650
So you have x
transpose times a sum.

00:30:52.650 --> 00:30:57.800
This is the sum of x transpose
multiplied by the rest.

00:30:57.800 --> 00:30:58.940
Why do we care about this?

00:30:58.940 --> 00:31:01.730
Because basically the
idea of kernel methods--

00:31:01.730 --> 00:31:05.570
in this very basic
form-- is what

00:31:05.570 --> 00:31:08.750
if I replace this
inner product, which

00:31:08.750 --> 00:31:11.720
is a way to measure similarity
between my functions,

00:31:11.720 --> 00:31:14.360
with another similarity.

00:31:14.360 --> 00:31:17.417
So instead of mapping
each x into a very high

00:31:17.417 --> 00:31:19.250
dimensional vector and
then taking product--

00:31:19.250 --> 00:31:22.430
which is itself, if you
want another way, as I said,

00:31:22.430 --> 00:31:25.310
of measuring similarity
in your product,

00:31:25.310 --> 00:31:27.650
distances between vectors--
what if I just define it,

00:31:27.650 --> 00:31:29.420
instead of by an
explicit mapping,

00:31:29.420 --> 00:31:32.810
by redefining the inner product.

00:31:32.810 --> 00:31:36.830
So this k here is the
k similar to the one

00:31:36.830 --> 00:31:39.730
we had in the previous--
in the very first slide.

00:31:39.730 --> 00:31:43.250
And it's-- re-parametrize
the inner product.

00:31:43.250 --> 00:31:45.140
Change the inner
product, and then I

00:31:45.140 --> 00:31:47.220
want to use everything else.

00:31:47.220 --> 00:31:50.520
So we need to question-- we
need to answer two question.

00:31:50.520 --> 00:31:54.170
The first one is if I
give you now a procedure

00:31:54.170 --> 00:31:56.330
that whenever you would
want to do x transpose

00:31:56.330 --> 00:32:01.100
x does something else
called ax comma x prime.

00:32:01.100 --> 00:32:03.410
How do you change
the computations?

00:32:03.410 --> 00:32:05.030
This is going to be very easy.

00:32:05.030 --> 00:32:08.323
But also what are you doing
from a modeling perspective?

00:32:14.000 --> 00:32:17.200
So from the computational
point of view, it's very easy,

00:32:17.200 --> 00:32:21.520
because you see
here you always had

00:32:21.520 --> 00:32:24.040
that you have to build a
matrix whose entries were

00:32:24.040 --> 00:32:26.330
xi transpose xj.

00:32:29.920 --> 00:32:33.510
So it was always a
product of two vectors.

00:32:33.510 --> 00:32:36.070
And what you do now is
that you do the same.

00:32:36.070 --> 00:32:40.360
So you build the matrix
kn, which is not just xn,

00:32:40.360 --> 00:32:43.705
xn transpose but is a new matrix
whose entries are just this.

00:32:43.705 --> 00:32:45.880
This is just a generalization.

00:32:45.880 --> 00:32:47.830
If I put the linear
kernel, I just get back

00:32:47.830 --> 00:32:49.760
in what we had before.

00:32:49.760 --> 00:32:52.500
If you put another kernel,
you just get something else.

00:32:52.500 --> 00:32:54.610
So from a computational
point of view,

00:32:54.610 --> 00:32:58.164
you're done for this
computation of c.

00:32:58.164 --> 00:32:59.330
You have to do nothing else.

00:32:59.330 --> 00:33:02.390
You just replace this matrix
with these general matrix.

00:33:02.390 --> 00:33:05.350
And if you want
to now compute s--

00:33:05.350 --> 00:33:08.050
so w you cannot compute
anymore, because you don't know

00:33:08.050 --> 00:33:10.040
what's an x by itself.

00:33:10.040 --> 00:33:13.090
But if you want to
compute f of x, you can,

00:33:13.090 --> 00:33:14.470
because you've just to plug-in--

00:33:14.470 --> 00:33:16.280
So you know how
to compute the c.

00:33:16.280 --> 00:33:20.190
And you know how to compute this
quantity, because you have just

00:33:20.190 --> 00:33:21.220
to put the kernel there.

00:33:21.220 --> 00:33:23.530
So the magic here
is that you never

00:33:23.530 --> 00:33:25.090
ever point x in isolation.

00:33:25.090 --> 00:33:27.880
You always have a point x
multiplied by another point x.

00:33:27.880 --> 00:33:31.770
And this allows you to
replace vectors by--

00:33:31.770 --> 00:33:34.000
in some sense, this is
an implicit remapping

00:33:34.000 --> 00:33:37.160
of the points by just
redefining the inner product.

00:33:37.160 --> 00:33:39.089
So what you should
see for now is

00:33:39.089 --> 00:33:40.630
just that the
computation that you've

00:33:40.630 --> 00:33:45.100
done to compute f of x in
the linear case you can redo,

00:33:45.100 --> 00:33:48.070
if you replace the inner
product with this new function.

00:33:48.070 --> 00:33:52.990
Because A, you can compute c
by just using this new matrix

00:33:52.990 --> 00:33:54.220
in place of this.

00:33:54.220 --> 00:33:57.430
And B, you can replace f
of x, because all you need

00:33:57.430 --> 00:33:59.635
is to replace this inner
product with this one

00:33:59.635 --> 00:34:04.210
and put the right weights,
which you know how to compute.

00:34:04.210 --> 00:34:05.920
From a modeling
perspective what you

00:34:05.920 --> 00:34:09.880
can check is that, for
example, if you choose here

00:34:09.880 --> 00:34:11.260
this polynomial kernel--

00:34:11.260 --> 00:34:14.499
which is just x
transpose x prime plus 1

00:34:14.499 --> 00:34:16.210
elevated to the d--

00:34:16.210 --> 00:34:18.400
if you take, for
example, d equal 2,

00:34:18.400 --> 00:34:21.850
this is equivalent to the
mapping I showed you before,

00:34:21.850 --> 00:34:25.989
the one with explicit
monomials as entries.

00:34:25.989 --> 00:34:28.479
This is just doing
it implicitly.

00:34:28.479 --> 00:34:31.435
If you're in low-dimensional,
if you're low-dimensional,

00:34:31.435 --> 00:34:35.270
if n is very big, and the
dimensions are very small,

00:34:35.270 --> 00:34:37.179
the first way might be better.

00:34:37.179 --> 00:34:42.400
But if n is much bigger,
this way would be better.

00:34:42.400 --> 00:34:45.310
But also you can use stuff like
this, like a Gaussian kernel.

00:34:45.310 --> 00:34:47.860
And in that case, you cannot
really write down explicitly

00:34:47.860 --> 00:34:50.530
the explicit map,
because it turns out that

00:34:50.530 --> 00:34:51.659
it's infinite-dimensional.

00:34:51.659 --> 00:34:53.560
The vectors you would
need to write down,

00:34:53.560 --> 00:34:57.560
to write down the explicit
variable version of--

00:34:57.560 --> 00:35:00.840
embedding version of this
is infinite-dimensional.

00:35:00.840 --> 00:35:01.900
So this is a--

00:35:01.900 --> 00:35:04.420
if you use this, you get the
truly non-parametric model.

00:35:07.480 --> 00:35:09.666
If you think of what is
the effect of using this,

00:35:09.666 --> 00:35:11.290
it's quite clear if
you plug them here.

00:35:11.290 --> 00:35:13.000
Because what you have
is that in one case

00:35:13.000 --> 00:35:15.420
you have a superposition
of linear stuff,

00:35:15.420 --> 00:35:17.920
a superposition of
polynomial stuff,

00:35:17.920 --> 00:35:20.100
or a superposition of Gaussians.

00:35:23.330 --> 00:35:24.580
So same game as before.

00:35:28.490 --> 00:35:30.470
So same dataset we train.

00:35:30.470 --> 00:35:33.070
I take kernel least squares--

00:35:33.070 --> 00:35:34.650
which is what I
just showed you--

00:35:34.650 --> 00:35:37.900
compute the c
inverting that matrix,

00:35:37.900 --> 00:35:40.090
use the Gaussian kernel--
the last of the example--

00:35:40.090 --> 00:35:41.170
and then compute f of x.

00:35:41.170 --> 00:35:42.544
And then we just
want to plot it.

00:35:47.231 --> 00:35:48.230
So this is the solution.

00:35:51.770 --> 00:35:53.510
The algorithm depends
on two parameters.

00:35:53.510 --> 00:35:54.093
What are they?

00:35:57.700 --> 00:35:58.620
AUDIENCE: Lambda.

00:35:58.620 --> 00:36:00.870
LORENZO ROSASCO: Lambda, the
regularization parameter,

00:36:00.870 --> 00:36:03.700
the one that appeared
already in the linear case--

00:36:03.700 --> 00:36:04.240
and then--

00:36:04.240 --> 00:36:06.760
AUDIENCE: Whatever parameter
you've chosen [INAUDIBLE]..

00:36:06.760 --> 00:36:07.570
LORENZO ROSASCO: Exactly.

00:36:07.570 --> 00:36:09.220
Whatever parameters
there is in your kernel.

00:36:09.220 --> 00:36:10.803
In this case, it's
the Gaussian, so it

00:36:10.803 --> 00:36:12.870
will depend on this width.

00:36:17.370 --> 00:36:22.120
Now suppose that
I take gamma big.

00:36:22.120 --> 00:36:24.120
I don't know what big is.

00:36:24.120 --> 00:36:27.530
I just do it by hand here,
so we see what happens.

00:36:32.620 --> 00:36:36.120
If you take gamma--
sorry gamma, sigma big,

00:36:36.120 --> 00:36:39.450
you start to get
something very simple.

00:36:39.450 --> 00:36:42.820
And if I make it
a bit bigger, it

00:36:42.820 --> 00:36:47.550
will probably start to look very
much like a linear solution.

00:36:54.530 --> 00:36:56.220
If I make it small--

00:36:59.805 --> 00:37:01.560
and again, I don't
know what small is,

00:37:01.560 --> 00:37:02.601
so I'm just going to try.

00:37:07.725 --> 00:37:08.715
I's very small.

00:37:15.660 --> 00:37:18.540
You start to see
what's going on.

00:37:18.540 --> 00:37:20.040
And if you go in
between, you really

00:37:20.040 --> 00:37:23.224
start to see that you can
circle out individual examples.

00:37:23.224 --> 00:37:25.140
So let's think a second
what we're doing here.

00:37:29.370 --> 00:37:33.830
It is going to be again other
hand-waving explanation.

00:37:33.830 --> 00:37:37.190
Look at this equation.

00:37:37.190 --> 00:37:39.170
Let's read out what it says.

00:37:39.170 --> 00:37:42.710
In the case of Gaussians,
it says, I take a Gaussian--

00:37:42.710 --> 00:37:44.480
just a usual Gaussian--

00:37:44.480 --> 00:37:47.840
I center it over a
training set point,

00:37:47.840 --> 00:37:52.160
then by choosing the ci
I'm choosing whether it is

00:37:52.160 --> 00:37:54.310
going to be a peak or a valley.

00:37:54.310 --> 00:37:57.680
It can go up, or it can go down
in the two-dimensional case.

00:37:57.680 --> 00:38:00.520
And by choosing
the width, I decide

00:38:00.520 --> 00:38:03.140
how large it's going to be.

00:38:03.140 --> 00:38:08.060
If I do f of x, then I sum up
all this stuff, which basically

00:38:08.060 --> 00:38:11.810
means that I'm going to have
these peaks and these valleys

00:38:11.810 --> 00:38:17.000
and I connect them in some way.

00:38:17.000 --> 00:38:18.910
Now you remember before
that I pointed out

00:38:18.910 --> 00:38:20.780
within the
two-dimensional case what

00:38:20.780 --> 00:38:24.920
we draw is not f of x,
but f of x equal to zero.

00:38:24.920 --> 00:38:31.525
So what you should really think
is that f of x in this case

00:38:31.525 --> 00:38:36.250
is no longer an upper plane,
but it's this surface.

00:38:36.250 --> 00:38:37.545
It goes up, and it goes down.

00:38:37.545 --> 00:38:38.920
And it goes up,
and it goes down.

00:38:38.920 --> 00:38:42.290
So in the blue part, it goes
up, and in the orange part,

00:38:42.290 --> 00:38:45.260
it goes down into valley.

00:38:45.260 --> 00:38:48.110
So what you do is
that right now you're

00:38:48.110 --> 00:38:50.270
taking all these
small Gaussians,

00:38:50.270 --> 00:38:54.350
and you put them in around
blue and orange point,

00:38:54.350 --> 00:38:56.702
and then you
connect their peaks.

00:38:56.702 --> 00:38:58.160
And by making them
small, you allow

00:38:58.160 --> 00:39:00.050
them to create a very
complicated surface.

00:39:03.310 --> 00:39:05.290
So what did we put before?

00:39:11.240 --> 00:39:11.950
So they're small.

00:39:11.950 --> 00:39:14.210
They're getting smaller,
and smaller, and smaller.

00:39:14.210 --> 00:39:16.640
And they go out,
and you see the--

00:39:16.640 --> 00:39:18.620
there is a point here,
so they circle it out

00:39:18.620 --> 00:39:21.530
here by putting basically
Gaussian right there

00:39:21.530 --> 00:39:24.920
for that individual point.

00:39:24.920 --> 00:39:26.870
Imagine what happens
if my points--

00:39:26.870 --> 00:39:29.240
I have two points here
and two points here--

00:39:29.240 --> 00:39:33.020
and now I put a huge
Gaussian around each point.

00:39:33.020 --> 00:39:36.314
Basically, the peaks are almost
going to touch each other.

00:39:36.314 --> 00:39:37.730
So what you're
imagine is that you

00:39:37.730 --> 00:39:40.670
get something, where basically
the decision boundary has

00:39:40.670 --> 00:39:42.170
to look like a line,
because you get

00:39:42.170 --> 00:39:43.470
something which is so smooth.

00:39:43.470 --> 00:39:45.095
It doesn't go up and
down all the time.

00:39:45.095 --> 00:39:47.539
It's going to be--

00:39:47.539 --> 00:39:49.080
And that's what we
saw before, right?

00:39:49.080 --> 00:39:51.110
And again, I don't
remember what I put here.

00:39:54.718 --> 00:39:56.490
So this is starting
to look good.

00:39:56.490 --> 00:40:00.720
So you really see that somewhat
something nice happens.

00:40:00.720 --> 00:40:03.720
Maybe if I put-- five is
what we put before maybe.

00:40:09.010 --> 00:40:10.890
So basically what
you're basically doing

00:40:10.890 --> 00:40:13.560
is that you're computing
the center of mass

00:40:13.560 --> 00:40:16.794
of one class in the
sense of the Gaussians.

00:40:16.794 --> 00:40:18.210
So you're doing a
Gaussian mixture

00:40:18.210 --> 00:40:19.850
on one side, a Gaussian
mixture on the other side,

00:40:19.850 --> 00:40:21.780
you're basically computing
the center of masses,

00:40:21.780 --> 00:40:22.890
and then you just
find the line that

00:40:22.890 --> 00:40:24.230
separates the center of masses.

00:40:24.230 --> 00:40:26.021
That's what you're
doing here, and you just

00:40:26.021 --> 00:40:27.810
find this one big line here.

00:40:30.600 --> 00:40:35.250
So again, so we're
not playing around

00:40:35.250 --> 00:40:38.040
with the number of points.

00:40:38.040 --> 00:40:39.776
We're not play
around with lambda.

00:40:39.776 --> 00:40:42.150
But because this is basically
what we already saw before.

00:40:42.150 --> 00:40:44.691
All I want to show you right
now is the effect of the kernel.

00:40:44.691 --> 00:40:51.610
And here I'm using the Gaussian
kernel, but-- let's see--

00:40:51.610 --> 00:40:54.177
but you can also use
the linear kernel.

00:40:54.177 --> 00:40:55.260
This is the linear kernel.

00:40:55.260 --> 00:40:56.885
This is using the
linear least squares.

00:40:56.885 --> 00:40:58.520
If you now use the
Gaussian kernel,

00:40:58.520 --> 00:41:00.270
you give yourself the
extra possibility.

00:41:00.270 --> 00:41:01.650
Essentially, what you
see is that if you

00:41:01.650 --> 00:41:03.691
put the Gaussian which is
very big, in some sense

00:41:03.691 --> 00:41:05.810
you get back the linear kernel.

00:41:05.810 --> 00:41:07.810
But if you put the Gaussian
which is very small,

00:41:07.810 --> 00:41:12.050
you allow yourself to
this extra complexity.

00:41:12.050 --> 00:41:14.910
And so that's what we gain
with this little trick

00:41:14.910 --> 00:41:20.500
that we did of replacing
the inner product

00:41:20.500 --> 00:41:24.400
with this new kernel.

00:41:24.400 --> 00:41:26.380
We went from the simple
linear estimators

00:41:26.380 --> 00:41:28.045
to something, which is--

00:41:28.045 --> 00:41:29.650
It's the same
thing-- if you want--

00:41:29.650 --> 00:41:31.300
that we did by
building explicitly

00:41:31.300 --> 00:41:34.180
these monomials of higher
power, but here you're

00:41:34.180 --> 00:41:35.879
doing it implicitly.

00:41:35.879 --> 00:41:37.420
And it turns out
that it's actually--

00:41:37.420 --> 00:41:40.510
there is no explicit
version that you can--

00:41:40.510 --> 00:41:43.690
You can do it mathematically,
but the feature representation,

00:41:43.690 --> 00:41:45.910
the variable representation
of this kernel

00:41:45.910 --> 00:41:48.260
would be an infinitely
long vector.

00:41:48.260 --> 00:41:49.900
The space of function
that is built

00:41:49.900 --> 00:41:53.130
as a combination of Gaussians
is not finite-dimensional.

00:41:53.130 --> 00:41:56.040
For polynomials, you can check
that the space of function,

00:41:56.040 --> 00:41:59.680
it basically is a
polynomial in d.

00:41:59.680 --> 00:42:02.030
If I ask you how
big is the function

00:42:02.030 --> 00:42:04.780
space that you can build using
this-- well, this is easy.

00:42:04.780 --> 00:42:06.910
It's just d-dimensional.

00:42:06.910 --> 00:42:09.220
With this, well, this is
a bit more complicated,

00:42:09.220 --> 00:42:11.250
but you can compute.

00:42:11.250 --> 00:42:16.270
For this, it's not easy to
compute, because it's infinite.

00:42:16.270 --> 00:42:19.690
So it in some sense is
a non-parametric model.

00:42:19.690 --> 00:42:21.280
What does it mean?

00:42:21.280 --> 00:42:22.990
Of course, you still
have a finite number

00:42:22.990 --> 00:42:24.380
of parameters in practice.

00:42:24.380 --> 00:42:26.020
And that's the good news.

00:42:26.020 --> 00:42:28.440
But there is no fixed number
of parameters a priori.

00:42:28.440 --> 00:42:31.129
If I give you a hundred points,
you get a hundred parameters.

00:42:31.129 --> 00:42:33.670
If I give you 2 million points,
you get 2 million parameters.

00:42:33.670 --> 00:42:36.400
If I give you 5 million points,
you get 5 million parameters.

00:42:36.400 --> 00:42:40.300
But you never hit a
boundary of complexity,

00:42:40.300 --> 00:42:44.800
because these are in some sense
as an infinite-dimensional

00:42:44.800 --> 00:42:45.640
parameter space.

00:42:48.940 --> 00:42:52.210
So of course, I
see that here there

00:42:52.210 --> 00:42:55.040
are some of the part that I'm
explaining are complicated,

00:42:55.040 --> 00:42:57.370
especially if this is the
first time you see them.

00:42:57.370 --> 00:43:00.370
But the take-home message
should be essentially

00:43:00.370 --> 00:43:02.020
from least squares,
I can understand

00:43:02.020 --> 00:43:03.936
what's going on from a
numerical point of view

00:43:03.936 --> 00:43:06.700
and bridge numerics
and statistics.

00:43:06.700 --> 00:43:08.680
Then by just simple
linear algebra,

00:43:08.680 --> 00:43:10.090
I can understand
the complexity--

00:43:10.090 --> 00:43:12.250
how I can get
complexly-- which is

00:43:12.250 --> 00:43:14.110
linear in the
number of dimension

00:43:14.110 --> 00:43:16.240
or the number of points.

00:43:16.240 --> 00:43:19.840
And then by following up,
I can do a a little magic

00:43:19.840 --> 00:43:23.480
and go from the linear model
to something non-linear.

00:43:23.480 --> 00:43:26.200
The deep reason why this is
possible are complicated.

00:43:26.200 --> 00:43:29.099
But as a take-home
message, A, the computation

00:43:29.099 --> 00:43:29.890
you can check easy.

00:43:29.890 --> 00:43:31.620
It remained the same.

00:43:31.620 --> 00:43:33.796
B, you can check that
what you're doing is now

00:43:33.796 --> 00:43:35.920
allowing yourself to take
a more complicated model,

00:43:35.920 --> 00:43:38.800
it's combination of
the kernel functions.

00:43:38.800 --> 00:43:46.480
And then even just by playing
with these simple demos,

00:43:46.480 --> 00:43:48.370
you can understand a
bit what is the effect.

00:43:48.370 --> 00:43:50.300
And that's what you
intuitively would expect.

00:43:50.300 --> 00:43:52.600
So I hope that it would
get you close enough

00:43:52.600 --> 00:43:55.900
to have some awareness,
when you use this.

00:43:55.900 --> 00:43:57.850
And of course, you can put--

00:43:57.850 --> 00:44:01.510
when you abstract from the
specificity of this algorithm,

00:44:01.510 --> 00:44:03.990
you build an algorithm with
one or two parameters--

00:44:03.990 --> 00:44:05.380
lambda and sigma.

00:44:05.380 --> 00:44:07.540
And so as soon as you ask
me how you choose those,

00:44:07.540 --> 00:44:09.685
well, we go back to the
first part of the lecture--

00:44:09.685 --> 00:44:12.100
bias-variance, tradeoffs,
cross-validation,

00:44:12.100 --> 00:44:13.810
and so on and so forth.

00:44:13.810 --> 00:44:16.750
So you just have to
put them together.

00:44:16.750 --> 00:44:19.100
There is a lot of stuff
I've not talked about.

00:44:19.100 --> 00:44:21.130
And it's a step away
from what we discussed,

00:44:21.130 --> 00:44:23.902
so you've just seen the
take-home message part,

00:44:23.902 --> 00:44:25.360
but we could talk
about reproducing

00:44:25.360 --> 00:44:27.670
kernel hybrid spaces,
the functional analysis

00:44:27.670 --> 00:44:29.560
behind everything I said.

00:44:29.560 --> 00:44:31.990
We can talk about Gaussian
processes, which is basically

00:44:31.990 --> 00:44:34.780
the probabilistic version of
what I just showed you now.

00:44:34.780 --> 00:44:37.450
Then we can all see the
connection with a bunch of math

00:44:37.450 --> 00:44:39.746
like integral
equations and PDEs.

00:44:39.746 --> 00:44:41.620
There is a whole connection
with the sampling

00:44:41.620 --> 00:44:44.240
theory a la Shannon,
inverse problems and so on.

00:44:44.240 --> 00:44:47.230
And there is a
bunch of extension,

00:44:47.230 --> 00:44:48.705
which are almost for free.

00:44:48.705 --> 00:44:50.234
You change the loss function.

00:44:50.234 --> 00:44:52.150
You can make the logistic,
and you take kernel

00:44:52.150 --> 00:44:53.110
logistic regression.

00:44:53.110 --> 00:44:57.520
You can take SVM, and
you get kernel SVM.

00:44:57.520 --> 00:45:00.430
Then you can also take more
complicated output spaces.

00:45:00.430 --> 00:45:03.220
And you can do multiclass,
multivariate regression.

00:45:03.220 --> 00:45:04.210
You can do regression.

00:45:04.210 --> 00:45:06.220
You can do multilabel,
and you can

00:45:06.220 --> 00:45:07.640
do a bunch of different things.

00:45:07.640 --> 00:45:10.210
And these are
really a step away.

00:45:10.210 --> 00:45:12.387
These are minor
modification of the code.

00:45:12.387 --> 00:45:13.720
And you can do a bunch of stuff.

00:45:13.720 --> 00:45:16.120
So the good thing of this
is that with really, really,

00:45:16.120 --> 00:45:17.770
really minor effort,
you can actually

00:45:17.770 --> 00:45:19.090
solve a bunch of problem.

00:45:19.090 --> 00:45:21.700
I'm not saying that it's going
to be the best algorithm ever,

00:45:21.700 --> 00:45:26.290
but definitely it
gets you quite far.

00:45:26.290 --> 00:45:28.240
So again we spent quite
a bit of time thinking

00:45:28.240 --> 00:45:30.950
about bias-variance and
what it means and used

00:45:30.950 --> 00:45:33.250
least squares and just
basically warming up

00:45:33.250 --> 00:45:34.570
a bit with this setting.

00:45:34.570 --> 00:45:39.022
And then in the last hour or
so, we discussed least squares,

00:45:39.022 --> 00:45:41.480
because it allows to just think
in terms of linear algebra,

00:45:41.480 --> 00:45:43.120
which is something that--

00:45:43.120 --> 00:45:45.130
one way or another--
you've seen in your life.

00:45:45.130 --> 00:45:48.400
And then from there, you can
go from linear to non-linear.

00:45:48.400 --> 00:45:51.100
And that's a bit of magic,
but a couple of parts--

00:45:51.100 --> 00:45:54.070
which are how you use
it both numerically

00:45:54.070 --> 00:45:56.300
and just from a
practical perspective

00:45:56.300 --> 00:45:59.260
to go from complex models to
simple models and vice versa--

00:45:59.260 --> 00:46:03.579
should be-- is the part that
I hope you keep in your mind.

00:46:03.579 --> 00:46:05.870
For now, our concern has just
been to make predictions.

00:46:05.870 --> 00:46:07.370
If you hear
classification, you want

00:46:07.370 --> 00:46:08.494
to have good clarification.

00:46:08.494 --> 00:46:10.870
If you hear regression, you
want to do good regression.

00:46:10.870 --> 00:46:14.440
But you didn't talk about-- we
didn't talk about understanding

00:46:14.440 --> 00:46:17.200
how did you do good regression?

00:46:17.200 --> 00:46:21.944
So a typical example is
the example in biology.

00:46:21.944 --> 00:46:23.110
This is, perhaps, a bit old.

00:46:23.110 --> 00:46:24.350
This is micro-arrays.

00:46:24.350 --> 00:46:31.860
But the idea is the datasets
you have is a bunch of patients.

00:46:31.860 --> 00:46:33.870
For each patient, you
have measurements,

00:46:33.870 --> 00:46:36.480
and the measurements correspond
to some gene expression

00:46:36.480 --> 00:46:38.265
level or some other
biological process.

00:46:42.390 --> 00:46:45.840
The patients are divided in
two groups, say, disease type

00:46:45.840 --> 00:46:48.630
A and disease type B. And
based on the good prediction

00:46:48.630 --> 00:46:50.880
of whether a patient
is disease type A or B,

00:46:50.880 --> 00:46:55.089
you can change the way you cure
it or you address the disease.

00:46:55.089 --> 00:46:57.130
So of course, you want to
have a good prediction.

00:46:57.130 --> 00:46:59.190
You want to be able-- when
a new patient arrive--

00:46:59.190 --> 00:47:04.440
to say whether it's going
to-- this is type A or type B.

00:47:04.440 --> 00:47:06.810
But oftentimes,
what you want to do

00:47:06.810 --> 00:47:09.900
is that you want to use
this not as the final tool,

00:47:09.900 --> 00:47:13.540
because unless deep
learning can solve this,

00:47:13.540 --> 00:47:18.450
you might go back and study a
bit more the biological process

00:47:18.450 --> 00:47:19.900
to understand a bit more.

00:47:19.900 --> 00:47:23.590
So you use this as
more statistical tools

00:47:23.590 --> 00:47:28.840
like measurements, like the
way you can use a microscope

00:47:28.840 --> 00:47:31.335
or something to look into
your data and get information.

00:47:31.335 --> 00:47:32.710
And in that sense
sometimes, it's

00:47:32.710 --> 00:47:34.335
interesting to--
instead of just saying

00:47:34.335 --> 00:47:37.720
is this patient going to be
more likely to be disease type

00:47:37.720 --> 00:47:40.300
A or B, it's to
go in and say, ah,

00:47:40.300 --> 00:47:42.340
but when you make
the prediction, what

00:47:42.340 --> 00:47:44.500
are the process that
matters for this prediction?

00:47:44.500 --> 00:47:49.032
Is this gene number 33 or 34,
so that I can go in and say,

00:47:49.032 --> 00:47:51.490
oh, these genes make sense,
because they're in fact related

00:47:51.490 --> 00:47:54.210
to these other processes,
which are known to be related,

00:47:54.210 --> 00:47:56.354
involved in this disease.

00:47:56.354 --> 00:47:58.270
And doing that, you use
just as a little tool,

00:47:58.270 --> 00:48:00.219
then you use other
ones to get a picture.

00:48:00.219 --> 00:48:01.510
And then you put them together.

00:48:01.510 --> 00:48:04.330
And then it's mostly on the
doctor, or the clinician,

00:48:04.330 --> 00:48:08.320
or the biostatistician to try
to develop better understanding.

00:48:08.320 --> 00:48:10.640
But you do use these as
tools to understand and look

00:48:10.640 --> 00:48:12.460
into the data.

00:48:12.460 --> 00:48:14.650
And in that perspective,
the word interpretability

00:48:14.650 --> 00:48:15.440
plays a big role.

00:48:15.440 --> 00:48:17.590
And here by
interpretability I mean

00:48:17.590 --> 00:48:19.160
I not only want to
make predictions,

00:48:19.160 --> 00:48:22.300
but I want to know how I make
predictions and tell you, come

00:48:22.300 --> 00:48:25.060
afterwards with an
explanation of how

00:48:25.060 --> 00:48:29.410
I picked the information that
were contained in my data.

00:48:29.410 --> 00:48:35.080
So so far it's hard to see how
to do it with the tools we had.

00:48:35.080 --> 00:48:41.940
So this is basically the
field of variable selection.

00:48:41.940 --> 00:48:44.917
And in this basic
form, the setting

00:48:44.917 --> 00:48:46.500
where we do understand
what's going on

00:48:46.500 --> 00:48:49.210
is the setting of linear models.

00:48:49.210 --> 00:48:52.530
So in this setting
basically, I just

00:48:52.530 --> 00:48:54.040
rewrite what we've seen before.

00:48:54.040 --> 00:48:57.510
You have x is a vector, and you
can think of it, for example,

00:48:57.510 --> 00:48:59.000
as a patient.

00:48:59.000 --> 00:49:01.290
And xj are measurements
that you have

00:49:01.290 --> 00:49:04.140
done describing this patient.

00:49:04.140 --> 00:49:06.060
When you do a linear
model, you basically

00:49:06.060 --> 00:49:10.630
have that by putting a
weight on each variables,

00:49:10.630 --> 00:49:13.680
you're putting a weight
on each measurement.

00:49:13.680 --> 00:49:15.840
If a measurement
doesn't matter, you

00:49:15.840 --> 00:49:17.310
think you might put here a zero.

00:49:17.310 --> 00:49:19.650
And it will disappear
from the sum.

00:49:19.650 --> 00:49:21.420
If the measurement
matters a lot,

00:49:21.420 --> 00:49:23.610
then here you might
get a big weight.

00:49:23.610 --> 00:49:27.840
So one way to try to get the
feeling of which measurements

00:49:27.840 --> 00:49:29.640
are important and which
are not and to try

00:49:29.640 --> 00:49:33.980
to estimate and model, a linear
model, where you get the w,

00:49:33.980 --> 00:49:37.650
but ideally we would like to
get the w, which has many zeros.

00:49:37.650 --> 00:49:40.189
You don't want to fumble with
what's small and what's not.

00:49:40.189 --> 00:49:42.480
So if you do least squares
the way I showed you before,

00:49:42.480 --> 00:49:44.047
you would get a w.

00:49:44.047 --> 00:49:44.880
Then you would get--

00:49:44.880 --> 00:49:47.820
most of them you can check
that it will not be zero.

00:49:47.820 --> 00:49:50.254
In fact, none of them
will be zero in general.

00:49:50.254 --> 00:49:52.670
And so now you have to decide
what's small and what's big,

00:49:52.670 --> 00:49:53.794
and that might not be easy.

00:49:56.740 --> 00:49:57.975
Oops, what happened here?

00:50:05.590 --> 00:50:10.797
So funny enough, this is
the name I found on how--

00:50:10.797 --> 00:50:12.380
I don't remember the
name of the book.

00:50:12.380 --> 00:50:14.250
It's the name that
was used to describe

00:50:14.250 --> 00:50:16.759
the process of variable
selection, which

00:50:16.759 --> 00:50:18.300
is a much harder
problem, because you

00:50:18.300 --> 00:50:19.330
don't want to make predictions.

00:50:19.330 --> 00:50:20.705
But you want to
go back and check

00:50:20.705 --> 00:50:22.640
how you make the prediction.

00:50:22.640 --> 00:50:28.230
And so it's very easy to start
to get overfitting and start

00:50:28.230 --> 00:50:30.690
to try to squeeze the data
until you get some information.

00:50:30.690 --> 00:50:32.231
So it's good to have
a procedure that

00:50:32.231 --> 00:50:34.950
will give you somewhat a
clean procedure to extract

00:50:34.950 --> 00:50:36.060
the important variables.

00:50:36.060 --> 00:50:37.980
Again, you can think of
this as a-- basically,

00:50:37.980 --> 00:50:40.500
I want to build an
f, but I also want

00:50:40.500 --> 00:50:43.890
to come up with a list or even
better weights that tell me

00:50:43.890 --> 00:50:45.600
which variables are important.

00:50:45.600 --> 00:50:47.150
And often this will
be just a list,

00:50:47.150 --> 00:50:49.560
which is much smaller than
d, so that I can go back

00:50:49.560 --> 00:50:52.740
and say, oh, measurement 33,
34, and 50-- what are they?

00:50:52.740 --> 00:50:55.494
I could go in and look at it.

00:50:55.494 --> 00:50:57.660
Notice that there is also
a computational reason why

00:50:57.660 --> 00:50:58.743
this would be interesting.

00:50:58.743 --> 00:51:01.530
Because of course,
if d here is 50,000--

00:51:01.530 --> 00:51:03.600
and what I see is
that, in fact, I

00:51:03.600 --> 00:51:07.710
can throw away most of these
measurements and just keep 10--

00:51:07.710 --> 00:51:09.300
then it means that
I can hopefully

00:51:09.300 --> 00:51:10.980
reduce the complexity
of my computation,

00:51:10.980 --> 00:51:12.905
but also the storage of
the data, for example.

00:51:12.905 --> 00:51:14.780
If I have to send you
the datasets after I've

00:51:14.780 --> 00:51:16.410
done this thing,
I've just to send you

00:51:16.410 --> 00:51:17.520
this teeny tiny matrix.

00:51:20.160 --> 00:51:22.210
So interpretability
is one reason,

00:51:22.210 --> 00:51:27.690
but the computational
aspect could be another one.

00:51:30.250 --> 00:51:33.510
Another reason that I don't
want to talk too much is also--

00:51:33.510 --> 00:51:36.270
remember that we
had this idea, where

00:51:36.270 --> 00:51:39.600
we said we could document
the complexity of a model

00:51:39.600 --> 00:51:43.770
by inventing
features, and he said

00:51:43.770 --> 00:51:47.100
do I always have to pay
the price of making it big?

00:51:47.100 --> 00:51:49.339
Well, I basically--
if you what--

00:51:49.339 --> 00:51:50.130
I was pointing at--

00:51:50.130 --> 00:51:52.566
I said, no, not always, because
I was thinking of kernels.

00:51:52.566 --> 00:51:54.690
These, if you want give
you another way potentially

00:51:54.690 --> 00:51:57.150
to go around in which what
you do is that, first of all,

00:51:57.150 --> 00:51:59.252
you explode the
number of features.

00:51:59.252 --> 00:52:00.960
You take many, many,
many, many, and then

00:52:00.960 --> 00:52:02.940
you use this as a
preliminary step

00:52:02.940 --> 00:52:07.050
to shrink them down to a
more reasonable number.

00:52:07.050 --> 00:52:08.700
Because it's quite
likely that among

00:52:08.700 --> 00:52:10.980
these many, many
measurements, some of them

00:52:10.980 --> 00:52:13.170
would just be very
correlated, or uninteresting,

00:52:13.170 --> 00:52:15.240
or so on and so forth.

00:52:15.240 --> 00:52:18.480
So this dimensionality
reduction or

00:52:18.480 --> 00:52:21.030
computational or interpretable
model perspective

00:52:21.030 --> 00:52:25.860
is what stands behind the desire
to do something like this.

00:52:25.860 --> 00:52:28.450
So let's say one more
thing and then we'll stop.

00:52:31.570 --> 00:52:35.330
So suppose that you have an
infinite computational power.

00:52:35.330 --> 00:52:39.450
So the computation
are not your concern,

00:52:39.450 --> 00:52:41.959
and you want to
solve this problem.

00:52:41.959 --> 00:52:42.750
How will you do it?

00:52:46.358 --> 00:52:50.280
Suppose that you have the
code for least squares.

00:52:50.280 --> 00:52:52.450
And you can run it as
many times as you want.

00:52:52.450 --> 00:52:55.352
How would you go and
try to estimate which

00:52:55.352 --> 00:52:56.560
variables are more important?

00:52:56.560 --> 00:52:58.864
AUDIENCE: [INAUDIBLE]
possibility of computations.

00:52:58.864 --> 00:53:00.530
LORENZO ROSASCO:
That's one possibility.

00:53:00.530 --> 00:53:02.760
What you do is that you have--

00:53:02.760 --> 00:53:05.270
you start and look at
all single variables.

00:53:05.270 --> 00:53:09.120
And you solve least squares
for all single variables.

00:53:09.120 --> 00:53:12.260
Then you take all
couples of variables.

00:53:12.260 --> 00:53:14.510
Then you get all
triplets of variables.

00:53:14.510 --> 00:53:17.797
And then you find
which one is best.

00:53:17.797 --> 00:53:19.380
From a statistical
point of view there

00:53:19.380 --> 00:53:21.530
is absolutely nothing
wrong with this,

00:53:21.530 --> 00:53:23.330
because you're
trying everything.

00:53:23.330 --> 00:53:26.510
And at some point, you
find what's the best.

00:53:26.510 --> 00:53:28.910
The problem is that
it's combinatorial.

00:53:28.910 --> 00:53:33.680
And you see that when you're
in dimension a few more then--

00:53:33.680 --> 00:53:36.950
very few, it's huge.

00:53:36.950 --> 00:53:39.740
So it's exponential.

00:53:39.740 --> 00:53:43.670
So it turns out that
doing what you just

00:53:43.670 --> 00:53:46.600
told me to do, which is what
I asked you to tell me to do,

00:53:46.600 --> 00:53:48.320
which is this brute
force approach

00:53:48.320 --> 00:53:52.400
is equivalent to do
something like this is again

00:53:52.400 --> 00:53:54.470
a regularization approach.

00:53:54.470 --> 00:53:57.400
Here I put what is
called the zero norm.

00:53:57.400 --> 00:53:59.730
The zero norm is
actually not a norm.

00:53:59.730 --> 00:54:01.685
And it is just functional.

00:54:01.685 --> 00:54:04.040
It's a thing that does
the following thing.

00:54:04.040 --> 00:54:06.320
If I give you a vector,
you've to return

00:54:06.320 --> 00:54:10.250
the number of components
different from zero, only that.

00:54:10.250 --> 00:54:11.930
So you go inside and
look at each entry,

00:54:11.930 --> 00:54:14.630
and you tell if they
are different from zero.

00:54:14.630 --> 00:54:17.800
This is absolutely not convex.

00:54:17.800 --> 00:54:21.270
And so this is the reason why
this problem is equivalent--

00:54:21.270 --> 00:54:24.530
it becomes a computation
not feasible.

00:54:24.530 --> 00:54:26.480
So perhaps, we can stop here.

00:54:26.480 --> 00:54:29.810
And what I want to show
you next is essentially--

00:54:29.810 --> 00:54:31.270
if you have this--

00:54:31.270 --> 00:54:33.890
and you know that in some sense,
this is what you would like

00:54:33.890 --> 00:54:35.810
to do, if you could
do it computationally,

00:54:35.810 --> 00:54:37.100
but you cannot--

00:54:37.100 --> 00:54:41.330
so how can you find
approximate version of this

00:54:41.330 --> 00:54:42.770
that you can
compute in practice?

00:54:42.770 --> 00:54:44.769
And we're going to discuss
two ways of doing it.

00:54:44.769 --> 00:54:48.820
One is greedy methods and
one is convex relaxations.