WEBVTT

00:00:01.550 --> 00:00:03.920
The following content is
provided under a Creative

00:00:03.920 --> 00:00:05.310
Commons license.

00:00:05.310 --> 00:00:07.520
Your support will help
MIT Open Courseware

00:00:07.520 --> 00:00:11.610
continue to offer high quality
educational resources for free.

00:00:11.610 --> 00:00:14.180
To make a donation or to
view additional materials

00:00:14.180 --> 00:00:16.670
from hundreds of
MIT courses, visit

00:00:16.670 --> 00:00:18.540
MITopencourseware@ocw.MIT.edu.

00:00:24.170 --> 00:00:29.070
GILBERT STRANG: So I'm going to
talk about the gradient descent

00:00:29.070 --> 00:00:32.580
today to get to that
central algorithm

00:00:32.580 --> 00:00:38.190
of neural net deep
learning, machine learning,

00:00:38.190 --> 00:00:40.530
and optimization in general.

00:00:40.530 --> 00:00:43.230
So I'm trying to
minimize a function.

00:00:43.230 --> 00:00:50.400
And that's the way you do it if
there are many, many variables,

00:00:50.400 --> 00:00:52.890
too many to take
second derivatives,

00:00:52.890 --> 00:00:56.880
then we settle for first
derivatives of the function.

00:00:56.880 --> 00:00:59.610
So I introduced,
and you've already

00:00:59.610 --> 00:01:01.610
met the idea of gradient.

00:01:01.610 --> 00:01:04.470
But let me just be sure
to make some comments

00:01:04.470 --> 00:01:07.410
about the gradient
and the Hessian

00:01:07.410 --> 00:01:15.610
and the role of convexity before
we see the big crucial example.

00:01:15.610 --> 00:01:19.425
So I've kind of prepared over
here for this crucial example.

00:01:22.010 --> 00:01:26.820
The function is a pure
quadratic, two unknowns, x

00:01:26.820 --> 00:01:30.240
and y, pure quadratic.

00:01:30.240 --> 00:01:34.620
So every pure quadratic
I can write in terms

00:01:34.620 --> 00:01:37.160
of a symmetric matrix s.

00:01:37.160 --> 00:01:42.890
And in this case, x1 squared
was bx2 squared, the symmetric,

00:01:42.890 --> 00:01:45.810
the matrix is just 2 by 2.

00:01:45.810 --> 00:01:47.040
It's diagonal.

00:01:47.040 --> 00:01:52.440
It's got eigenvalues 1 and
b sitting on the diagonal.

00:01:52.440 --> 00:01:56.020
I'm thinking of b as
being the smaller one.

00:01:56.020 --> 00:02:00.720
So the condition
number, which we'll see,

00:02:00.720 --> 00:02:07.230
is all important in the question
of the speed of convergence

00:02:07.230 --> 00:02:13.260
is the ratio of the
largest to the smallest.

00:02:13.260 --> 00:02:17.310
In this case, the largest
is 1 the smallest is b.

00:02:17.310 --> 00:02:19.260
So that's 1 over b.

00:02:19.260 --> 00:02:23.370
And when 1 over b
is a big number,

00:02:23.370 --> 00:02:26.130
when b is a very small
number, then that's

00:02:26.130 --> 00:02:27.090
when we're in trouble.

00:02:31.560 --> 00:02:34.380
When the matrix is symmetric,
that condition number

00:02:34.380 --> 00:02:37.620
is lambda max over lambda min.

00:02:37.620 --> 00:02:40.830
If I had an
unsymmetric matrix, I

00:02:40.830 --> 00:02:44.360
would probably use sigma max
over sigma min, of course.

00:02:44.360 --> 00:02:48.660
But here, matrices
are symmetric.

00:02:48.660 --> 00:02:52.170
We're going to
see something neat

00:02:52.170 --> 00:02:58.260
is that we can actually take
the steps of steepest descent,

00:02:58.260 --> 00:03:01.440
write down what
each step gives us,

00:03:01.440 --> 00:03:05.310
and see how quickly they
converge to the answer.

00:03:05.310 --> 00:03:07.220
And what is the answer?

00:03:07.220 --> 00:03:11.370
So I haven't put in
any linear term here.

00:03:11.370 --> 00:03:14.730
So I just have a bowl
sitting on the origin.

00:03:14.730 --> 00:03:18.990
So of course, the minimum
point is x equal 0, y equals 0.

00:03:18.990 --> 00:03:26.050
So the minimum point x
star, is 0, 0, of course.

00:03:26.050 --> 00:03:29.670
So the question will be how
quickly do we get to that one.

00:03:29.670 --> 00:03:33.450
And you will say pretty
small example, not typical.

00:03:33.450 --> 00:03:37.080
But the terrific
thing is that we see

00:03:37.080 --> 00:03:38.890
everything for this example.

00:03:38.890 --> 00:03:43.380
We can see the actual
steps of steepest descent.

00:03:43.380 --> 00:03:45.600
We can see how
quickly they converge

00:03:45.600 --> 00:03:50.730
to the x star, the
answer, the place

00:03:50.730 --> 00:03:52.890
where this thing is a minimum.

00:03:52.890 --> 00:04:01.440
And we can begin to think
what to do if it's too slow.

00:04:01.440 --> 00:04:06.930
So I'll come to that example
after some general thoughts

00:04:06.930 --> 00:04:09.840
about gradients, Hessians.

00:04:09.840 --> 00:04:12.300
So what does the
gradient tell us?

00:04:12.300 --> 00:04:14.745
So let me just take an
example of the gradient.

00:04:17.860 --> 00:04:23.980
Let me take a linear function,
f of xy equals say, 2x plus 5y.

00:04:26.560 --> 00:04:31.540
I just think we ought to get
totally familiar with these.

00:04:31.540 --> 00:04:33.910
We're doing something.

00:04:33.910 --> 00:04:38.800
We're jumping into
an important topic.

00:04:38.800 --> 00:04:41.440
When I ask you
what's the gradient,

00:04:41.440 --> 00:04:43.780
that's a freshman question.

00:04:43.780 --> 00:04:48.460
But let's just be sure we know
how to interpret the gradient,

00:04:48.460 --> 00:04:51.970
how to compute
it, what it means,

00:04:51.970 --> 00:04:54.200
how to see it geometrically.

00:04:54.200 --> 00:04:56.650
So what's the gradient
of that function?

00:04:56.650 --> 00:04:58.380
It's a function
of two variables.

00:04:58.380 --> 00:05:02.110
So the gradient is a
vector with two components.

00:05:02.110 --> 00:05:02.980
And they are?

00:05:07.540 --> 00:05:09.420
The derivative of
this factor x, which

00:05:09.420 --> 00:05:13.320
is 2 and the derivative of
this factor y, which is 5.

00:05:13.320 --> 00:05:17.100
So in this case, the
gradient is constant.

00:05:17.100 --> 00:05:22.650
And the Hessian, which I
often call H after Hessian,

00:05:22.650 --> 00:05:25.800
or del squared F
would tell us we're

00:05:25.800 --> 00:05:27.990
taking the second
derivatives, that

00:05:27.990 --> 00:05:33.150
will be the second derivatives
obviously 0 in this case.

00:05:33.150 --> 00:05:38.230
So what shape is H here?

00:05:38.230 --> 00:05:39.730
It's 2 by 2.

00:05:39.730 --> 00:05:45.212
Everybody recognizes 2 by
2 is H would have the--

00:05:45.212 --> 00:05:49.220
I'll take a second
derivative of that--

00:05:49.220 --> 00:05:52.090
sorry, the first derivative
of that with respect to x,

00:05:52.090 --> 00:05:54.700
obviously 0, the first
derivative with respect

00:05:54.700 --> 00:06:00.620
to y, the first derivative
of that with respect to x y.

00:06:00.620 --> 00:06:04.840
Anyway, Hessian 0 for sure.

00:06:04.840 --> 00:06:08.080
So let me draw the surface.

00:06:08.080 --> 00:06:13.540
So x, y, and the surface, if
I graph F in this direction,

00:06:13.540 --> 00:06:16.960
then obviously, I have a plane.

00:06:16.960 --> 00:06:20.840
And I'm at a typical point
on the plane let's say.

00:06:20.840 --> 00:06:21.910
Yeah, yeah.

00:06:21.910 --> 00:06:24.070
So I'm at a point
x, y, I should say.

00:06:24.070 --> 00:06:25.690
I'm at a point x, y.

00:06:25.690 --> 00:06:28.340
And let me put the
plane through it.

00:06:28.340 --> 00:06:30.160
So how do I interpret
the gradient

00:06:30.160 --> 00:06:32.235
at that particular point x, y?

00:06:35.630 --> 00:06:38.240
What does 2x plus 5y tell me?

00:06:38.240 --> 00:06:46.400
Or rather what does grad
F tell me about movement

00:06:46.400 --> 00:06:50.510
from that point x, y?

00:06:50.510 --> 00:06:52.030
Of course, the
gradient is constant.

00:06:52.030 --> 00:06:55.130
So it really didn't matter
what point I'm moving from.

00:06:55.130 --> 00:06:57.680
But taking a point here.

00:06:57.680 --> 00:07:00.290
So what's the deal if I move?

00:07:00.290 --> 00:07:04.010
What's the fastest way
to go up the surface?

00:07:04.010 --> 00:07:09.110
If I took the plane that
went through that point x, y,

00:07:09.110 --> 00:07:11.620
what's the fastest way
to climb the plane?

00:07:11.620 --> 00:07:14.630
What direction goes up fastest?

00:07:14.630 --> 00:07:16.230
The gradient direction, right?

00:07:16.230 --> 00:07:19.080
The gradient direction
is the way up.

00:07:19.080 --> 00:07:22.700
How am I going to put
it in this picture?

00:07:22.700 --> 00:07:26.710
I guess I'm thinking
of this plane as--

00:07:26.710 --> 00:07:27.530
so what plane?

00:07:27.530 --> 00:07:30.230
You could well ask what
plane have I drawn?

00:07:30.230 --> 00:07:39.350
Suppose I've drawn the plane
2x plus 5y equals 0 even?

00:07:39.350 --> 00:07:41.560
So I'll make it go
through the arc.

00:07:41.560 --> 00:07:44.540
And I've taken a typical
point on that plane.

00:07:44.540 --> 00:07:48.380
Now if I want to
increase that function,

00:07:48.380 --> 00:07:52.700
I go perpendicular to the plane.

00:07:52.700 --> 00:07:54.665
If I want to stay level
with the function,

00:07:54.665 --> 00:07:58.620
if I wanted to stay at
0, I stay in the plane.

00:07:58.620 --> 00:08:00.650
So there are two key directions.

00:08:00.650 --> 00:08:01.880
Everybody knows this.

00:08:01.880 --> 00:08:03.200
I'm just repeating.

00:08:03.200 --> 00:08:08.030
This is the direction
of the gradient of F out

00:08:08.030 --> 00:08:10.250
of the plane, steepest upwards.

00:08:10.250 --> 00:08:13.190
This is the downwards
direction minus gradient

00:08:13.190 --> 00:08:16.940
of F, perpendicular to
the plane downwards.

00:08:16.940 --> 00:08:21.800
And that line is in the plane.

00:08:21.800 --> 00:08:23.660
That's part of the level set.

00:08:23.660 --> 00:08:28.070
2x plus 5y equals 0
would be a level set.

00:08:28.070 --> 00:08:32.950
That's my pretty
amateur picture.

00:08:32.950 --> 00:08:45.130
Just all I want to remember is
these words level and steepest,

00:08:45.130 --> 00:08:49.330
up or down.

00:08:49.330 --> 00:08:54.610
Down with a minus sign that
we see in steepest descent.

00:08:54.610 --> 00:08:58.980
So where in steepest descent.

00:09:03.020 --> 00:09:08.900
And what's the Hessian
telling me about the surface

00:09:08.900 --> 00:09:12.810
if I take the matrix
of second derivatives?

00:09:12.810 --> 00:09:14.680
So I have this surface.

00:09:14.680 --> 00:09:18.070
So I have a surface
F equal constant.

00:09:22.990 --> 00:09:25.620
That's the sort
of level surface.

00:09:25.620 --> 00:09:29.530
So if I stay in that surface,
the gradient of F is 0.

00:09:29.530 --> 00:09:33.351
Gradient of F is 0 in--

00:09:36.960 --> 00:09:39.270
on-- on is a better word--

00:09:39.270 --> 00:09:39.900
on the surface.

00:09:43.330 --> 00:09:46.220
The gradient of F
points perpendicular.

00:09:46.220 --> 00:09:58.100
But what about the Hessian,
the second derivative?

00:09:58.100 --> 00:10:03.430
What is that telling
me about that surface

00:10:03.430 --> 00:10:07.950
in particular when the Hessian
is 0 or other surfaces?

00:10:07.950 --> 00:10:10.395
What does the Hessian
tell me about--

00:10:13.370 --> 00:10:16.990
I'm thinking of the Hessian
at a particular point.

00:10:16.990 --> 00:10:25.580
So I'm getting 0 for the Hessian
because the surface is flat.

00:10:25.580 --> 00:10:34.180
If the surface was
convex upwards from--

00:10:34.180 --> 00:10:41.775
if it was a convex or a graph
of F, the Hessian would be--

00:10:46.340 --> 00:10:48.810
so I just want to make
that connection now.

00:10:48.810 --> 00:10:54.990
What's the connection between
the Hessian and convexity

00:10:54.990 --> 00:10:55.590
of the--

00:10:55.590 --> 00:11:00.660
the Hessian of the function
and convexity of the function?

00:11:00.660 --> 00:11:06.550
So the point is that convexity--

00:11:06.550 --> 00:11:10.350
the Hessian tells me whether
or not the surface is convex.

00:11:10.350 --> 00:11:11.550
And what is the test?

00:11:11.550 --> 00:11:12.600
AUDIENCE: [INAUDIBLE].

00:11:12.600 --> 00:11:16.350
GILBERT STRANG: Positive
definite or semi definite.

00:11:16.350 --> 00:11:20.340
I'm just looking for
an excuse to write down

00:11:20.340 --> 00:11:26.910
convexity and strong.

00:11:26.910 --> 00:11:29.760
Do I say strict or
strong convexity?

00:11:29.760 --> 00:11:30.630
I've forgotten.

00:11:30.630 --> 00:11:32.150
Strict, I think.

00:11:32.150 --> 00:11:33.030
Strictly convex.

00:11:38.230 --> 00:11:45.100
So convexity, the Hessian
is positive semi-definite,

00:11:45.100 --> 00:11:48.330
or which includes--

00:11:48.330 --> 00:11:49.990
I better say that right here--

00:11:49.990 --> 00:11:52.074
includes positive definite.

00:11:58.380 --> 00:12:00.420
If I'm looking for
a strict convexity,

00:12:00.420 --> 00:12:03.220
then I must require
positive definite.

00:12:03.220 --> 00:12:05.863
H is positive definite.

00:12:09.810 --> 00:12:12.300
Semi-definite won't do.

00:12:12.300 --> 00:12:15.300
So semi-definite for convex.

00:12:15.300 --> 00:12:18.540
So that in fact,
the linear function

00:12:18.540 --> 00:12:22.170
is convex, but not
strictly convex.

00:12:22.170 --> 00:12:25.160
Strictly means it
really bends upwards.

00:12:25.160 --> 00:12:26.890
The Hessian is
positive definite.

00:12:26.890 --> 00:12:31.120
The curvatures are positive.

00:12:31.120 --> 00:12:34.290
So this would include
linear functions,

00:12:34.290 --> 00:12:37.460
and that would not
include linear function.

00:12:37.460 --> 00:12:40.740
They're not strictly convex.

00:12:40.740 --> 00:12:42.510
Good, good, good.

00:12:42.510 --> 00:12:46.600
Some examples-- OK, the
number one example, of course,

00:12:46.600 --> 00:12:49.410
is the one we're
talking about over here.

00:12:49.410 --> 00:12:59.840
So examples f of x equal
1/2 x transpose Sx.

00:13:03.020 --> 00:13:05.660
And of course, I could
have linear terms

00:13:05.660 --> 00:13:10.310
minus a transpose
x, a linear term.

00:13:10.310 --> 00:13:12.770
And I could have a constant.

00:13:12.770 --> 00:13:13.270
OK.

00:13:18.790 --> 00:13:23.390
So this function
is strictly convex

00:13:23.390 --> 00:13:28.130
when S is positive
definite, because H is now

00:13:28.130 --> 00:13:33.800
S for that function,
for that function

00:13:33.800 --> 00:13:39.170
H. Usually H, the Hessian is
varying from point to point.

00:13:39.170 --> 00:13:42.770
The nice thing about a pure
quadratic is its constant.

00:13:42.770 --> 00:13:46.550
It's the same S at all points.

00:13:46.550 --> 00:13:49.580
Let me just ask you--

00:13:49.580 --> 00:13:53.370
so that's a convex function.

00:13:53.370 --> 00:13:56.250
And what's its minimum?

00:13:56.250 --> 00:13:57.883
What's the gradient,
first of all?

00:13:57.883 --> 00:13:59.050
What's the gradient of that?

00:14:03.790 --> 00:14:09.570
I'm asking really
for differentiating

00:14:09.570 --> 00:14:14.440
thinking in vector, doing all
n derivatives at once here.

00:14:14.440 --> 00:14:19.840
I'm asking for the whole
vector of first derivatives.

00:14:19.840 --> 00:14:24.420
Because here I'm giving
you the whole function

00:14:24.420 --> 00:14:28.150
with x for vector x.

00:14:28.150 --> 00:14:31.210
Of course, we could
take n to be 1.

00:14:31.210 --> 00:14:33.760
And then we would
see that if n was 1,

00:14:33.760 --> 00:14:39.880
this would just be Sx
squared, half Sx squared.

00:14:39.880 --> 00:14:44.170
And the derivative of
a half Sx squared--

00:14:44.170 --> 00:14:46.030
let me just put that
over here so we're

00:14:46.030 --> 00:14:48.700
sure to get it right--
half of Sx squared.

00:14:48.700 --> 00:14:51.490
This is in the n equal 1 case.

00:14:51.490 --> 00:14:53.860
And the derivative
is obviously Sx.

00:14:53.860 --> 00:14:55.540
And that's what it is here, Sx.

00:15:06.490 --> 00:15:10.200
It's obviously
simple, but if you

00:15:10.200 --> 00:15:14.190
haven't thought
about that line, it's

00:15:14.190 --> 00:15:18.120
asking for all the
first derivatives

00:15:18.120 --> 00:15:20.850
of that quadratic function.

00:15:20.850 --> 00:15:21.570
Oh!

00:15:21.570 --> 00:15:27.940
It's not-- What do I
have to include now here?

00:15:27.940 --> 00:15:31.200
That's not right as it stands
for the function that's

00:15:31.200 --> 00:15:32.517
written above it.

00:15:32.517 --> 00:15:33.600
What's the right gradient?

00:15:33.600 --> 00:15:34.517
AUDIENCE: [INAUDIBLE].

00:15:34.517 --> 00:15:38.220
GILBERT STRANG: Minus a, thanks.

00:15:38.220 --> 00:15:41.440
Because the linear function,
its partial derivatives

00:15:41.440 --> 00:15:45.120
are obviously just
the components of a.

00:15:45.120 --> 00:15:56.030
And the Hessian H is S,
derivatives of that guy.

00:15:56.030 --> 00:15:56.700
OK.

00:15:56.700 --> 00:15:57.300
Good.

00:15:57.300 --> 00:15:59.550
Good, good, good.

00:15:59.550 --> 00:16:02.520
And the minimum value-- we
might as well-- oh yeah!

00:16:02.520 --> 00:16:07.820
What's the right words
for a minimum value?

00:16:07.820 --> 00:16:09.570
No, I'm sorry.

00:16:09.570 --> 00:16:14.430
The right word is
minimum value like f min.

00:16:14.430 --> 00:16:17.880
So I want to compute f min.

00:16:17.880 --> 00:16:23.930
Well, first I have to figure out
where is that minimum reached?

00:16:23.930 --> 00:16:27.140
And what's the answer to that?

00:16:27.140 --> 00:16:30.840
We're putting everything on
the board for this simple case.

00:16:30.840 --> 00:16:38.990
The minimum of f
of f of f of x--

00:16:38.990 --> 00:16:42.290
remember, it's x is--
we're in n dimensions--

00:16:42.290 --> 00:16:49.910
is at x equal what?

00:16:49.910 --> 00:16:52.400
Well, the minimum is
where the gradient is 0.

00:16:55.460 --> 00:16:59.381
So what's the minimizing x?

00:16:59.381 --> 00:17:01.115
S inverse a, thanks.

00:17:08.180 --> 00:17:09.260
Sorry.

00:17:09.260 --> 00:17:12.480
That's not right.

00:17:12.480 --> 00:17:14.020
It's here that I
meant to write it.

00:17:17.099 --> 00:17:20.550
Really, my whole point
for this little moment

00:17:20.550 --> 00:17:23.250
is to be sure that
we keep straight what

00:17:23.250 --> 00:17:27.780
I mean by the place where
the minimum is reached

00:17:27.780 --> 00:17:29.160
and the minimum value.

00:17:29.160 --> 00:17:30.600
Those are two different things.

00:17:34.330 --> 00:17:36.810
So the minimum is
reached at S inverse

00:17:36.810 --> 00:17:40.270
a, because that's obviously
where the gradient is 0.

00:17:40.270 --> 00:17:43.073
It's the solution to Sx equal a.

00:17:43.073 --> 00:17:48.970
And what I was going to ask
you is what's the right word--

00:17:48.970 --> 00:17:56.440
well, sort of word, made up
word-- for this point x star

00:17:56.440 --> 00:17:58.760
where the minimum is reached?

00:17:58.760 --> 00:18:00.160
So it's not the minimum value.

00:18:00.160 --> 00:18:01.720
It's the point
where it's reached.

00:18:01.720 --> 00:18:06.057
And that's called-- the
notation for that point is

00:18:06.057 --> 00:18:06.991
AUDIENCE: Arg min.

00:18:06.991 --> 00:18:10.240
GILBERT STRANG: Arg min, thanks.

00:18:10.240 --> 00:18:16.620
Arg min of my function.

00:18:16.620 --> 00:18:18.900
And that means the place--

00:18:18.900 --> 00:18:24.918
the point where f equals f min.

00:18:28.200 --> 00:18:30.600
I haven't said yet what
the minimum value is.

00:18:30.600 --> 00:18:31.830
This tells us the point.

00:18:31.830 --> 00:18:34.290
And that's usually what
we're interested in.

00:18:34.290 --> 00:18:36.540
We're, to tell the
truth, not that

00:18:36.540 --> 00:18:40.470
interested in a typical example
and what the minimum value

00:18:40.470 --> 00:18:43.740
is as much as where is it?

00:18:43.740 --> 00:18:46.590
Where do we reach that thing?

00:18:46.590 --> 00:18:50.490
And of course, so this is x min.

00:18:50.490 --> 00:19:00.010
This is then arg min
of my function f.

00:19:00.010 --> 00:19:00.940
That's the point.

00:19:00.940 --> 00:19:04.420
And it happens to
be in this case,

00:19:04.420 --> 00:19:06.520
the minimum value is actually 0.

00:19:11.470 --> 00:19:15.190
Because there's no linear
term a transpose x.

00:19:20.080 --> 00:19:26.270
Why am I talking about arg
min when you've all seen it?

00:19:26.270 --> 00:19:28.990
I guess I think that
somebody could just

00:19:28.990 --> 00:19:34.750
be reading this stuff,
for example, learning

00:19:34.750 --> 00:19:40.740
about neural net, and run
into this expression arg min

00:19:40.740 --> 00:19:43.360
and think what's that?

00:19:43.360 --> 00:19:47.620
So it's maybe a right
time to say what it is.

00:19:47.620 --> 00:19:50.110
It's the point where
the minimum is reached.

00:19:52.930 --> 00:19:55.510
Why those words, by the way?

00:19:55.510 --> 00:19:57.280
Well, arg isn't much of a word.

00:19:57.280 --> 00:20:00.160
It sounds like you're
getting strangled.

00:20:00.160 --> 00:20:03.520
But it's sort of short.

00:20:03.520 --> 00:20:05.440
I assume it's short.

00:20:05.440 --> 00:20:07.300
Nobody ever told me this.

00:20:07.300 --> 00:20:10.210
I assume it's
short for argument.

00:20:10.210 --> 00:20:15.160
The word argument is a kind of
long word for the value of x.

00:20:15.160 --> 00:20:18.850
If I have a function
f of x, f, I

00:20:18.850 --> 00:20:23.770
call it function and x is the
argument of that function.

00:20:23.770 --> 00:20:27.430
You might more often
see the word variable.

00:20:27.430 --> 00:20:31.240
But argument-- and I'm assuming
that's what that refers to,

00:20:31.240 --> 00:20:35.430
it's the argument that
minimizes the function.

00:20:35.430 --> 00:20:37.180
OK, good.

00:20:37.180 --> 00:20:41.090
And here it is, S inverse a.

00:20:41.090 --> 00:20:43.180
Now but just by the
way, what is f min?

00:20:43.180 --> 00:20:45.730
Do you know the
minimum of a quadratic?

00:20:45.730 --> 00:20:49.750
I mean, this is the fundamental
minimization question,

00:20:49.750 --> 00:20:52.660
to minimize a quadratic.

00:20:52.660 --> 00:20:56.410
Electrical engineering, a
quadratic regulator problem

00:20:56.410 --> 00:20:58.280
is the simplest problem there.

00:20:58.280 --> 00:20:59.920
There could be constraints.

00:20:59.920 --> 00:21:03.070
And we'll see it with
constraints included.

00:21:03.070 --> 00:21:06.260
But right now, no
constraints at all.

00:21:06.260 --> 00:21:08.560
We're just looking at
the function f of x.

00:21:11.480 --> 00:21:15.040
Let me to remove the
b, because that just

00:21:15.040 --> 00:21:18.130
shifts the function by b.

00:21:18.130 --> 00:21:22.710
If I erase that, just
to say it didn't matter.

00:21:22.710 --> 00:21:25.000
It's really that function.

00:21:25.000 --> 00:21:28.030
So that function
actually goes through 0.

00:21:28.030 --> 00:21:32.290
As it is, when x is
0, we obviously get 0.

00:21:32.290 --> 00:21:35.950
But it's still on its
way down, so to speak.

00:21:35.950 --> 00:21:40.090
It's on its way down to
this point, S inverse a.

00:21:40.090 --> 00:21:42.490
That's where it bottoms out.

00:21:42.490 --> 00:21:47.060
And when it bottoms out,
what do you get for f?

00:21:47.060 --> 00:21:49.660
One thing I know, it's
going to be negative

00:21:49.660 --> 00:21:53.620
because it passed through 0,
and it was on its way below 0.

00:21:53.620 --> 00:21:57.220
So let's just figure
out what that f min is.

00:21:57.220 --> 00:22:00.010
So I have a half.

00:22:00.010 --> 00:22:05.560
I'm just going to plug in S
inverse a, the bottom point

00:22:05.560 --> 00:22:11.860
into the function, and see
where the surface bottoms out

00:22:11.860 --> 00:22:15.700
and at what level
it bottoms out.

00:22:15.700 --> 00:22:17.200
So I have a half.

00:22:17.200 --> 00:22:23.320
So that's S inverse a is
a transpose S inverse.

00:22:23.320 --> 00:22:26.950
S symmetric, so I'll just
write this inverse transpose.

00:22:26.950 --> 00:22:33.520
S, S inverse a from
the quadratic term,

00:22:33.520 --> 00:22:37.770
minus a transpose.

00:22:37.770 --> 00:22:40.030
And x is S inverse a.

00:22:40.030 --> 00:22:42.580
Have you done this calculation?

00:22:42.580 --> 00:22:46.240
It just doesn't
hurt to repeat it.

00:22:46.240 --> 00:22:53.530
So I've plugged in S inverse
a there, there, and there.

00:22:53.530 --> 00:22:55.060
OK, what have I got?

00:22:55.060 --> 00:22:58.630
Well, S inverse
cancels S. So I have

00:22:58.630 --> 00:23:02.310
a half of a transpose
S inverse a minus 1

00:23:02.310 --> 00:23:04.150
of a transpose inverse a.

00:23:04.150 --> 00:23:08.350
So I get finally
negative a half.

00:23:08.350 --> 00:23:15.850
Half of it minus one of it
of a transpose S inverse a.

00:23:15.850 --> 00:23:19.480
Sorry, that's not brilliant
use of the blackboard

00:23:19.480 --> 00:23:21.370
to squeeze that in there.

00:23:21.370 --> 00:23:26.380
But that's easily repeatable.

00:23:26.380 --> 00:23:29.770
OK, good.

00:23:29.770 --> 00:23:34.560
So that's what a quadratic bowl,
a perfect quadratic problem

00:23:34.560 --> 00:23:40.390
minimizes to that's
its lowest level.

00:23:40.390 --> 00:23:45.390
Ooh, I wanted to mention
one other function,

00:23:45.390 --> 00:23:48.480
because I'm going to speak
mostly about quadratics,

00:23:48.480 --> 00:23:51.150
but obviously,
the whole point is

00:23:51.150 --> 00:23:56.520
that it's the convexity that's
really making things work.

00:23:56.520 --> 00:24:07.190
So here, let me just put here,
a remarkable convex function.

00:24:11.800 --> 00:24:20.690
And the notes tell what's the
gradient of this function.

00:24:20.690 --> 00:24:24.550
They don't actually go
as far as the Hessian.

00:24:24.550 --> 00:24:32.780
Proving that this function I'm
going to write down is convex,

00:24:32.780 --> 00:24:34.720
it takes a little thinking.

00:24:34.720 --> 00:24:37.810
But it's a fantastic function.

00:24:37.810 --> 00:24:41.922
You would never
sort of imagine it

00:24:41.922 --> 00:24:44.110
if you didn't see it sometime.

00:24:44.110 --> 00:24:48.580
So it's going to be a function
of a matrix, a function of--

00:24:48.580 --> 00:24:58.630
those are n squared
variables, x, i, j.

00:24:58.630 --> 00:25:01.140
So it's a function
of many variables.

00:25:01.140 --> 00:25:03.220
And here is this function.

00:25:03.220 --> 00:25:07.300
It's you take the
determinant of the matrix.

00:25:07.300 --> 00:25:11.010
That's clearly a function of
all the n squared variables.

00:25:11.010 --> 00:25:15.810
Then you take the log
of the determinant

00:25:15.810 --> 00:25:21.840
and put in a minus sign
because we want convex.

00:25:21.840 --> 00:25:24.660
That turns out to be
a convex function.

00:25:24.660 --> 00:25:29.250
And even to just check that
for 2 by 2 well, for 2 by 2

00:25:29.250 --> 00:25:32.190
you have four variables,
because it's a 2 by 2 matrix.

00:25:32.190 --> 00:25:35.160
We could maybe check it
for a symmetric matrix.

00:25:35.160 --> 00:25:37.170
I move it down to
three variables.

00:25:37.170 --> 00:25:45.540
But I'd be glad anybody
who's ambitious to see

00:25:45.540 --> 00:25:51.450
why that log determinant
is a remarkable function.

00:25:51.450 --> 00:25:52.650
And let me see.

00:25:56.040 --> 00:26:01.860
So the gradient of that
thing is also amazing.

00:26:01.860 --> 00:26:06.120
The gradient of that function--

00:26:06.120 --> 00:26:11.610
I'm going to peek so I don't
write the wrong fact here.

00:26:15.780 --> 00:26:19.800
So the partial derivative
of that function

00:26:19.800 --> 00:26:23.190
are the entries of--

00:26:23.190 --> 00:26:26.220
these are the entries
of a, a inverse.

00:26:26.220 --> 00:26:27.960
That's the-- of x inverse.

00:26:38.360 --> 00:26:39.880
That's like, wow.

00:26:39.880 --> 00:26:42.130
Where did that come from?

00:26:42.130 --> 00:26:45.410
It might be minus the
entries, of course.

00:26:45.410 --> 00:26:46.930
Yeah, yeah, yeah.

00:26:46.930 --> 00:26:53.240
So we've got n
squared function--

00:26:53.240 --> 00:26:56.560
what is a typical
entry in x inverse?

00:26:56.560 --> 00:27:02.090
What does a typical
x inverse i, j?

00:27:02.090 --> 00:27:05.890
Just to remember
that bit of pretty

00:27:05.890 --> 00:27:09.910
old fashioned linear
algebra, the entry

00:27:09.910 --> 00:27:14.980
is of the inverse matrix,
I'm sure to divide by what?

00:27:14.980 --> 00:27:17.200
The determinant, that's
the one thing we know.

00:27:21.720 --> 00:27:24.270
And that's the reason
we take the log,

00:27:24.270 --> 00:27:27.840
because when you take
derivatives of a log,

00:27:27.840 --> 00:27:31.680
that will put determinant
of x in the denominator.

00:27:31.680 --> 00:27:33.990
And then the numerator
will be the derivatives

00:27:33.990 --> 00:27:36.160
of the determinant of x.

00:27:36.160 --> 00:27:36.660
Oh!

00:27:36.660 --> 00:27:41.640
Can we get any idea what are the
derivatives of the determinant?

00:27:41.640 --> 00:27:43.596
Oh my god.

00:27:43.596 --> 00:27:46.410
How did I never get into this?

00:27:46.410 --> 00:27:50.090
So are you with me so far?

00:27:50.090 --> 00:27:54.350
This is going to be
derivatives of determinant,

00:27:54.350 --> 00:27:58.020
the strength of all
these variables divided

00:27:58.020 --> 00:28:02.130
by the determinant, because
that's what the log achieved.

00:28:02.130 --> 00:28:04.560
So when I take the derivative
of the log of something,

00:28:04.560 --> 00:28:12.060
that chain rule says take the
derivative of that something

00:28:12.060 --> 00:28:15.900
divide by the function
determinant of x.

00:28:15.900 --> 00:28:20.710
So what's the derivative of
the determinant of a matrix

00:28:20.710 --> 00:28:22.510
with respect to its 1, 1 entry?

00:28:22.510 --> 00:28:23.010
Yeah, sure.

00:28:23.010 --> 00:28:24.960
This is crazy.

00:28:24.960 --> 00:28:26.490
But it's crazy to be doing this.

00:28:26.490 --> 00:28:28.000
But it's healthy.

00:28:28.000 --> 00:28:28.500
OK.

00:28:31.960 --> 00:28:38.111
So I have a matrix x, da,
da, da, x, x, 1, 1, x, 1n,

00:28:38.111 --> 00:28:43.400
et cetera, xn, 1, x, n, n.

00:28:43.400 --> 00:28:45.050
OK.

00:28:45.050 --> 00:28:46.440
And what am I looking for?

00:28:46.440 --> 00:28:52.160
I'm looking for that for
the derivatives of the--

00:28:52.160 --> 00:28:55.630
do I want the derivatives
of the determinant?

00:28:55.630 --> 00:28:57.550
Yes.

00:28:57.550 --> 00:29:05.470
So what's the derivative of x
of the determinant with respect

00:29:05.470 --> 00:29:10.100
to the first equals what?

00:29:13.780 --> 00:29:15.950
How can I figure out?

00:29:15.950 --> 00:29:17.810
So what's this asking me to do?

00:29:17.810 --> 00:29:22.790
It's asking me to change x,
1, 1 by delta x and see what's

00:29:22.790 --> 00:29:25.980
the change in the determinant.

00:29:25.980 --> 00:29:28.220
That's what derivatives are.

00:29:28.220 --> 00:29:31.010
Change x, 1, 1 a little bit.

00:29:31.010 --> 00:29:32.615
How much did the
determinant change?

00:29:36.150 --> 00:29:39.060
What has the determinant
of the whole matrix

00:29:39.060 --> 00:29:42.850
got to do with x, 1, 1?

00:29:42.850 --> 00:29:47.270
You remember that there is
a formula for determinants.

00:29:47.270 --> 00:29:49.160
So I need that fact.

00:29:49.160 --> 00:29:55.600
The determinant of x is
x, 1, 1 times something.

00:29:55.600 --> 00:29:58.510
Is that something that
I really want to know?

00:29:58.510 --> 00:30:01.870
Plus x, 1, 2 times
other something plus

00:30:01.870 --> 00:30:06.348
say, along the first row
times another something.

00:30:09.340 --> 00:30:15.970
What are these
factors that multiply

00:30:15.970 --> 00:30:19.790
the x's to give the determinant?

00:30:19.790 --> 00:30:22.520
What [INAUDIBLE] a
linear combination

00:30:22.520 --> 00:30:27.340
of the first row time certain
factors gives the determinant?

00:30:27.340 --> 00:30:30.520
And how do I know that
there will be such factors,

00:30:30.520 --> 00:30:33.160
because the fundamental
property of the determinant

00:30:33.160 --> 00:30:39.280
is that it's linear in row 1 if
I don't mess with other rows.

00:30:39.280 --> 00:30:43.240
It's a linear function of row 1.

00:30:43.240 --> 00:30:46.510
So it has a form x,
1, 1 times something.

00:30:46.510 --> 00:30:48.284
And what is something?

00:30:48.284 --> 00:30:49.201
AUDIENCE: [INAUDIBLE].

00:30:49.201 --> 00:30:52.300
GILBERT STRANG: The
determinant of this.

00:30:52.300 --> 00:30:56.560
So what does x, 1, 1 multiply
when you compute determinants?

00:30:56.560 --> 00:31:00.280
X, 1, 1 will not multiply
any other guys in its row,

00:31:00.280 --> 00:31:02.920
because you're never
multiplying two

00:31:02.920 --> 00:31:06.280
x's in the same row
or the same column.

00:31:06.280 --> 00:31:10.210
What x, 1, 1 is
multiplying all these guys.

00:31:10.210 --> 00:31:15.040
And in fact, it turns out
to be is the determinant.

00:31:15.040 --> 00:31:17.180
And what is this called?

00:31:17.180 --> 00:31:22.930
That one smaller determinant
that I get by throwing away

00:31:22.930 --> 00:31:24.970
the first row and first column?

00:31:24.970 --> 00:31:27.710
It's called a--

00:31:27.710 --> 00:31:28.880
Minor is good.

00:31:28.880 --> 00:31:30.860
Yes, minor is good.

00:31:30.860 --> 00:31:33.650
I was saying there are two
words that can be used,

00:31:33.650 --> 00:31:36.890
minor and co-factor.

00:31:42.860 --> 00:31:43.560
Yeah.

00:31:43.560 --> 00:31:44.740
And what is it?

00:31:44.740 --> 00:31:46.050
I mean, how do I compute it?

00:31:46.050 --> 00:31:47.367
What is the number?

00:31:47.367 --> 00:31:48.075
This is a number.

00:31:51.180 --> 00:31:52.110
It's just a number.

00:31:56.880 --> 00:32:01.090
Maybe I think of the minor
as this determinant--

00:32:01.090 --> 00:32:01.750
Ah!

00:32:01.750 --> 00:32:03.480
Let me cancel that.

00:32:03.480 --> 00:32:05.820
Maybe I think of the
minor as this smaller

00:32:05.820 --> 00:32:08.790
matrix, and the
co-factor, which is

00:32:08.790 --> 00:32:10.425
the determinant of the minor.

00:32:15.180 --> 00:32:16.890
And there is a plus or minus.

00:32:16.890 --> 00:32:20.250
Everything about
determinants, there's

00:32:20.250 --> 00:32:23.430
a there's a plus or
minus choice to be made.

00:32:23.430 --> 00:32:27.600
And we're not going
to worry about that.

00:32:27.600 --> 00:32:33.325
But so anyway, so
it's the co-factor.

00:32:33.325 --> 00:32:35.300
Let me call it C, 1, 1.

00:32:37.950 --> 00:32:42.690
And so that's the formula
for a determinant.

00:32:42.690 --> 00:32:46.842
That's the co-factor
expansion of a determinant.

00:32:54.230 --> 00:32:56.100
OK.

00:32:56.100 --> 00:32:59.400
And that will connect
back to this amazing fact

00:32:59.400 --> 00:33:02.790
that the gradient is the
entries of x inverse,

00:33:02.790 --> 00:33:07.720
because the inverse is the ratio
of co-factor to determinant.

00:33:07.720 --> 00:33:15.772
So x inverse 1, 1 is that
co-factor over the determinant.

00:33:18.670 --> 00:33:20.190
Yeah.

00:33:20.190 --> 00:33:22.530
So that's where
this all comes from.

00:33:22.530 --> 00:33:32.670
Anyway, I'm just mentioning that
as a very interesting example

00:33:32.670 --> 00:33:35.820
of a convex function.

00:33:35.820 --> 00:33:37.270
OK.

00:33:37.270 --> 00:33:37.950
I'll leave that.

00:33:37.950 --> 00:33:41.740
That's just for like, education.

00:33:41.740 --> 00:33:43.080
OK.

00:33:43.080 --> 00:33:48.510
Now I'm ready to go to
work on gradient descent.

00:33:48.510 --> 00:33:52.260
So actually, the rest of
this class and Friday's class

00:33:52.260 --> 00:33:59.310
about gradient descent are very
fundamental parts of 18.065.

00:33:59.310 --> 00:34:01.750
And that will be
one of our examples.

00:34:01.750 --> 00:34:06.650
And then the general case here.

00:34:06.650 --> 00:34:11.040
So I'm using this.

00:34:11.040 --> 00:34:13.670
It would be interesting
to minimize that thing,

00:34:13.670 --> 00:34:15.409
but we're not going there.

00:34:15.409 --> 00:34:20.480
Let's hide it, so we
don't see it again.

00:34:20.480 --> 00:34:23.030
And I'll work with that example.

00:34:26.429 --> 00:34:28.610
So here's gradient descent.

00:34:37.770 --> 00:34:45.030
Is xk plus 1 is xk
minus Sk the step size

00:34:45.030 --> 00:34:47.760
times the gradient of f at xk.

00:34:52.922 --> 00:34:56.080
So the only thing
left that requires

00:34:56.080 --> 00:35:01.570
us to input some decision making
is a step size, the learning

00:35:01.570 --> 00:35:03.100
rate.

00:35:03.100 --> 00:35:06.520
We can take it as constant.

00:35:06.520 --> 00:35:09.170
If we take too big
a learning rate,

00:35:09.170 --> 00:35:12.130
the thing will oscillate
all over the place

00:35:12.130 --> 00:35:16.130
and it's a disaster.

00:35:16.130 --> 00:35:19.520
If we take too small a
learning rate, too small steps,

00:35:19.520 --> 00:35:22.600
what's the matter with that?

00:35:22.600 --> 00:35:24.190
Takes too long.

00:35:24.190 --> 00:35:26.260
Takes too long.

00:35:26.260 --> 00:35:30.400
So the problem is to
get it just right.

00:35:30.400 --> 00:35:32.560
And one way that you
could say get it right

00:35:32.560 --> 00:35:37.030
would be to think of optimize.

00:35:37.030 --> 00:35:38.920
Choose the optimal Sk.

00:35:38.920 --> 00:35:43.450
Of course, that takes longer
than just deciding an Sk

00:35:43.450 --> 00:35:46.370
in advance, which
is what people do.

00:35:46.370 --> 00:35:51.760
So I'll tell you what people
do is on really big problems is

00:35:51.760 --> 00:35:53.160
take an Sk--

00:35:53.160 --> 00:35:57.520
estimate a suitable Sk, and
then go with it for a while.

00:35:57.520 --> 00:36:02.830
And then look back to
see if it was too big,

00:36:02.830 --> 00:36:05.310
they'll see oscillations.

00:36:05.310 --> 00:36:09.220
It'll be bouncing
all over the place.

00:36:09.220 --> 00:36:13.525
Or of course, an
exact line search--

00:36:16.730 --> 00:36:19.090
so you see that this
expression often.

00:36:19.090 --> 00:36:30.810
The exact line search choose
Sk to make my function

00:36:30.810 --> 00:36:44.020
f at xk plus 1 a minimum on
the line, on the search line,

00:36:44.020 --> 00:36:48.235
a minimum in the
search direction.

00:36:54.175 --> 00:36:57.940
The search direction is
given by the gradient.

00:36:57.940 --> 00:36:59.770
That's the direction
we're moving.

00:36:59.770 --> 00:37:02.260
This is the distance
we're moving,

00:37:02.260 --> 00:37:05.440
or measure of the
distance we're moving.

00:37:05.440 --> 00:37:09.580
And an exact search would
be to go along there.

00:37:09.580 --> 00:37:14.110
If I have a convex function,
then as I move along this line,

00:37:14.110 --> 00:37:19.350
as I increase Sk, I'll see
the function start down,

00:37:19.350 --> 00:37:25.380
because the gradient,
negative gradient means down.

00:37:25.380 --> 00:37:28.080
But at some point
it'll turn up again.

00:37:28.080 --> 00:37:33.220
And an exact line search would
find that point and stop there.

00:37:36.310 --> 00:37:38.860
That doesn't mean we would--

00:37:38.860 --> 00:37:40.600
we will see in
this example where

00:37:40.600 --> 00:37:46.960
we will do exact line searches
that for a small value of b,

00:37:46.960 --> 00:37:51.790
it's extremely slow, that
the condition number controls

00:37:51.790 --> 00:37:52.660
the speed.

00:37:52.660 --> 00:37:55.330
That's really what
my message will

00:37:55.330 --> 00:37:59.050
be just in these last
minutes and next time

00:37:59.050 --> 00:38:03.340
the sort of key lecture
on gradient descent.

00:38:03.340 --> 00:38:06.670
So an exact line
search would be that.

00:38:06.670 --> 00:38:09.070
So what a backtracking
line search--

00:38:15.880 --> 00:38:24.670
backtracking would be
take a fixed S like one.

00:38:24.670 --> 00:38:32.290
And then be prepared
to come backwards.

00:38:32.290 --> 00:38:34.060
Cut back by half.

00:38:34.060 --> 00:38:36.250
See what you get at that point.

00:38:36.250 --> 00:38:40.180
Cut back by half of that to a
quarter of the original step.

00:38:40.180 --> 00:38:41.200
See what that is.

00:38:44.650 --> 00:38:48.970
So the full step might
have taken you back

00:38:48.970 --> 00:38:52.450
to the upward sweep.

00:38:52.450 --> 00:38:55.420
Halfway forward it might
still be on the upward sweep.

00:38:55.420 --> 00:39:00.760
Might be too much, but so
backtracking cuts the step size

00:39:00.760 --> 00:39:04.840
in pieces and checks until it--

00:39:08.440 --> 00:39:13.180
So S0, half of
S0, quarter of S0,

00:39:13.180 --> 00:39:18.250
or obviously, a different
parameter, aS0, a squared S0,

00:39:18.250 --> 00:39:25.720
and so on until you're
satisfied with that step.

00:39:25.720 --> 00:39:28.070
And there are of course,
many, many refinements.

00:39:28.070 --> 00:39:31.810
We're talking about
the big algorithm

00:39:31.810 --> 00:39:40.260
here that everybody has,
depending on their function,

00:39:40.260 --> 00:39:44.250
has different experiences with.

00:39:44.250 --> 00:39:46.670
So here's my
fundamental question.

00:39:50.580 --> 00:39:53.610
Let's think of an
exact line search.

00:39:53.610 --> 00:39:57.700
How much does that
reduce the function?

00:39:57.700 --> 00:40:00.400
How much does that
reduce the function?

00:40:00.400 --> 00:40:05.380
So that's really what the
bounds that I want are.

00:40:05.380 --> 00:40:08.440
How much does that
reduce the function?

00:40:08.440 --> 00:40:24.320
And we'll see that the reduction
involves the condition number,

00:40:24.320 --> 00:40:32.730
m over M. So why don't I
turn to the example first?

00:40:32.730 --> 00:40:37.260
And then where we
know exact answers.

00:40:37.260 --> 00:40:39.980
That gives us a
basis for comparison.

00:40:39.980 --> 00:40:46.150
And then our math
goal is prove--

00:40:46.150 --> 00:40:50.050
get S dead bounds
on the size of f

00:40:50.050 --> 00:40:55.330
that match what we see
exactly in that example

00:40:55.330 --> 00:40:58.120
where we know everything.

00:40:58.120 --> 00:41:01.510
We know the gradient.

00:41:01.510 --> 00:41:03.140
We know the Hessian.

00:41:03.140 --> 00:41:04.090
It's that matrix.

00:41:04.090 --> 00:41:05.650
We know the condition number.

00:41:05.650 --> 00:41:08.440
So what happens if
I start at a point

00:41:08.440 --> 00:41:15.105
x0 y0 that's on my surface?

00:41:19.110 --> 00:41:20.230
Sorry.

00:41:20.230 --> 00:41:22.710
What do I want to do here?

00:41:22.710 --> 00:41:23.250
Yeah.

00:41:23.250 --> 00:41:31.080
I take a point, x0
y0 and I iterate.

00:41:34.350 --> 00:41:54.040
So the new xy k plus
1 is xyk minus the S,

00:41:54.040 --> 00:41:56.940
which I can compute
times the gradient of f.

00:41:56.940 --> 00:41:58.710
So I'm going to
put in gradient f.

00:41:58.710 --> 00:42:00.030
What is the gradient here?

00:42:02.790 --> 00:42:05.790
The derivative is
we expect to x.

00:42:05.790 --> 00:42:11.970
So I have a 2xk and 2by.

00:42:16.630 --> 00:42:18.244
And this is the step size.

00:42:22.120 --> 00:42:25.450
And for this small
problem where we're

00:42:25.450 --> 00:42:27.940
going to get such
a revealing answer,

00:42:27.940 --> 00:42:29.860
I'm going to choose
exact line search.

00:42:29.860 --> 00:42:31.240
I'm going to choose the best xk.

00:42:34.040 --> 00:42:35.240
And what's the answer?

00:42:35.240 --> 00:42:39.500
So I just want to tell you
what the iterations are

00:42:39.500 --> 00:42:43.520
for that particular
function starting at x0 y0.

00:42:46.080 --> 00:42:51.460
So let me put start x0 y0.

00:42:54.810 --> 00:42:56.790
And I haven't done this
calculation myself.

00:42:56.790 --> 00:43:01.470
It's taken from the book by
Steven Boyd and Vandenberghe

00:43:01.470 --> 00:43:03.240
called Convex Optimization.

00:43:03.240 --> 00:43:06.010
Of course, they weren't the
first to do this either.

00:43:06.010 --> 00:43:11.580
But I'm happy to mention that
book Convex Optimization.

00:43:11.580 --> 00:43:14.160
And Steven Boyd will be
on campus this spring

00:43:14.160 --> 00:43:18.180
actually, in April
for three lectures.

00:43:18.180 --> 00:43:20.010
This is April, maybe.

00:43:20.010 --> 00:43:21.010
Yeah, OK.

00:43:21.010 --> 00:43:24.400
So it's this month in
two or three weeks.

00:43:24.400 --> 00:43:26.470
And I'll tell you about that.

00:43:26.470 --> 00:43:34.820
So here are the xk's and the
yk's and the f and the function

00:43:34.820 --> 00:43:35.320
values.

00:43:40.190 --> 00:43:41.400
So where am I going to start?

00:43:44.840 --> 00:43:45.440
Yeah.

00:43:45.440 --> 00:43:50.480
So I'm starting from the
point x0 y0 equal b1.

00:43:50.480 --> 00:43:54.110
Turns out that will make our
formulas very convenient,

00:43:54.110 --> 00:43:57.500
x0 y0 equals b1.

00:43:57.500 --> 00:43:58.340
Good.

00:43:58.340 --> 00:44:00.530
So OK.

00:44:00.530 --> 00:44:09.260
So xk is b times the key
ratio b minus 1 over b plus 1

00:44:09.260 --> 00:44:11.420
to the kth power.

00:44:11.420 --> 00:44:15.335
And yk happens to be--

00:44:20.270 --> 00:44:24.020
it has this same ratio.

00:44:24.020 --> 00:44:29.600
And my function f has
the same ratio too.

00:44:29.600 --> 00:44:30.815
This is fk.

00:44:30.815 --> 00:44:34.010
It has that same
ratio 1 minus b over 1

00:44:34.010 --> 00:44:39.710
plus b to the kth times f0.

00:44:39.710 --> 00:44:51.160
That's the beautiful
formula that we're

00:44:51.160 --> 00:44:54.450
going to take as the
best example possible.

00:44:54.450 --> 00:44:55.160
Let's just see.

00:44:55.160 --> 00:45:04.800
If k equals 0, I have xk equal
b yk equal 1 b starting at b1.

00:45:04.800 --> 00:45:09.690
And that tells me the rate
of decrease of the function.

00:45:09.690 --> 00:45:11.680
It's this same ratio.

00:45:11.680 --> 00:45:14.730
So what am I learning
from this example?

00:45:14.730 --> 00:45:20.365
What's jumping out is that this
ratio 1 minus b over 1 plus b

00:45:20.365 --> 00:45:20.865
is crucial.

00:45:25.920 --> 00:45:29.500
If b is near 1,
that ratio is small.

00:45:29.500 --> 00:45:32.870
If b is near 1,
that's near 0 over 2.

00:45:32.870 --> 00:45:36.070
And I converge quickly,
no problem at all.

00:45:36.070 --> 00:45:42.490
But if b is near 0, if my
condition number is bad--

00:45:42.490 --> 00:45:51.430
so the bad case, the
hard case is small b.

00:45:55.200 --> 00:46:01.300
Of course, when b is small,
that ratio is very near 1.

00:46:01.300 --> 00:46:02.590
It's below 1.

00:46:02.590 --> 00:46:06.220
The ratio is below 1, so
I'm getting convergence.

00:46:06.220 --> 00:46:07.360
I do get convergence.

00:46:07.360 --> 00:46:09.460
I do go downhill.

00:46:09.460 --> 00:46:13.810
But what happens is I don't
go downhill very far until I'm

00:46:13.810 --> 00:46:15.910
headed back uphill again.

00:46:15.910 --> 00:46:20.720
So the picture to
draw for this--

00:46:20.720 --> 00:46:26.070
let me change that picture
to a picture in the xy

00:46:26.070 --> 00:46:29.400
plane of the level sets.

00:46:29.400 --> 00:46:33.870
So the picture really to
see is in the xy plane.

00:46:33.870 --> 00:46:37.395
The level sets f equal constant.

00:46:37.395 --> 00:46:38.940
That's what a level set is.

00:46:38.940 --> 00:46:43.570
It's a set of points, x and
y where f has the same value.

00:46:43.570 --> 00:46:46.510
And what do those look like?

00:46:46.510 --> 00:46:48.000
Oh, let's see.

00:46:50.920 --> 00:46:53.680
I think-- what do you think?

00:46:53.680 --> 00:46:59.860
What do the level sets look like
for this particular function?

00:46:59.860 --> 00:47:04.520
If I look at the curve x
squared plus b y squared equal

00:47:04.520 --> 00:47:07.240
a constant, that's
what the level set is.

00:47:07.240 --> 00:47:13.620
This is x squared plus by
squared equal a constant.

00:47:13.620 --> 00:47:16.402
What kind of a curve is that?

00:47:16.402 --> 00:47:17.330
AUDIENCE: [INAUDIBLE].

00:47:17.330 --> 00:47:19.470
GILBERT STRANG:
That's an ellipse.

00:47:19.470 --> 00:47:21.900
And what's up with that ellipse?

00:47:21.900 --> 00:47:24.750
What's the shape of it?

00:47:24.750 --> 00:47:27.960
Because there is no
xy term, that ellipse

00:47:27.960 --> 00:47:33.180
is like, well lined
up with the axes.

00:47:33.180 --> 00:47:37.770
The major axes of the ellipse
are in the x and y directions,

00:47:37.770 --> 00:47:42.150
because there is
no cross term here.

00:47:42.150 --> 00:47:46.020
We could always have
diagonalized our matrix

00:47:46.020 --> 00:47:47.623
if it wasn't diagonal.

00:47:47.623 --> 00:47:49.290
And that wouldn't
have changed anything.

00:47:49.290 --> 00:47:52.740
So it's just
rotating this space.

00:47:52.740 --> 00:47:54.090
And we've done that.

00:47:57.570 --> 00:47:59.130
What do the levels
set look like?

00:47:59.130 --> 00:48:00.870
They're ellipses.

00:48:00.870 --> 00:48:06.690
And suppose b is a small number,
then what's with the ellipses?

00:48:06.690 --> 00:48:10.530
If b is small, I
have to go pretty--

00:48:10.530 --> 00:48:14.070
I have to take a pretty
large y to match a--

00:48:14.070 --> 00:48:15.090
change an x.

00:48:15.090 --> 00:48:18.340
I think maybe they're
ellipses of that sort.

00:48:18.340 --> 00:48:18.840
Are they?

00:48:24.220 --> 00:48:26.780
They're lined up for the axes.

00:48:26.780 --> 00:48:30.610
And I hope I'm drawing
in the right direction.

00:48:30.610 --> 00:48:33.807
They're long and thin.

00:48:33.807 --> 00:48:34.390
Is that right?

00:48:34.390 --> 00:48:36.880
Because I would have
to take a pretty big y

00:48:36.880 --> 00:48:40.120
to make up for a small b.

00:48:40.120 --> 00:48:41.830
OK.

00:48:41.830 --> 00:48:44.140
So what happens
when I'm descending?

00:48:44.140 --> 00:48:45.910
This is a narrow valley then.

00:48:45.910 --> 00:48:52.240
Think of it as a valley
which comes down steeply

00:48:52.240 --> 00:48:54.730
in the y direction,
but in the x direction

00:48:54.730 --> 00:48:57.560
I'm crossing the valley slow--

00:48:57.560 --> 00:49:00.250
Oh, is that right?

00:49:00.250 --> 00:49:04.300
So what happens if I
take a point there?

00:49:04.300 --> 00:49:06.690
Oh yeah, I remember what to do.

00:49:06.690 --> 00:49:10.850
So let's start at that
point on that ellipse.

00:49:14.070 --> 00:49:17.490
And those were the levels
sets f equal constant.

00:49:17.490 --> 00:49:20.980
So what's the first
search direction?

00:49:20.980 --> 00:49:23.320
What direction do
I move from x0 y0?

00:49:28.510 --> 00:49:31.210
Do I move along the ellipse?

00:49:31.210 --> 00:49:35.490
Absolutely not, because along
the ellipse f is constant.

00:49:35.490 --> 00:49:39.430
The gradient direction is
perpendicular to the ellipse.

00:49:39.430 --> 00:49:42.280
So I move perpendicular
to the ellipse.

00:49:42.280 --> 00:49:43.285
And when do I stop?

00:49:47.040 --> 00:49:50.930
Pretty soon, because very
soon I'm going back up again.

00:50:02.410 --> 00:50:04.120
I haven't practiced
with this curve.

00:50:04.120 --> 00:50:08.400
But I know-- and time
is up, thank God.

00:50:08.400 --> 00:50:10.780
So what do I know
is going to happen?

00:50:10.780 --> 00:50:13.780
And by Friday we'll
make it happen?

00:50:13.780 --> 00:50:22.840
So what do we see for the
curve, the track of the--

00:50:22.840 --> 00:50:24.776
it's say it?

00:50:24.776 --> 00:50:25.770
AUDIENCE: Zigzag.

00:50:25.770 --> 00:50:28.110
GILBERT STRANG:
It's a zigzag, yeah.

00:50:28.110 --> 00:50:31.110
We would like to get here, but
we're not aimed here at all.

00:50:31.110 --> 00:50:36.000
So we zig, zig, zig zag,
and very slowly approach

00:50:36.000 --> 00:50:36.540
that point.

00:50:39.210 --> 00:50:41.910
And how slowly?

00:50:41.910 --> 00:50:48.990
With that multiplier, 1
minus b over 1 plus b.

00:50:48.990 --> 00:50:51.000
That's what I'm learning
from this example,

00:50:51.000 --> 00:50:53.010
that that's a key number.

00:50:53.010 --> 00:50:56.760
And then you could ask, well,
what about general examples?

00:50:56.760 --> 00:51:01.470
This was one specially chose
an example with exact solution.

00:51:01.470 --> 00:51:04.530
Well, we'll see at the
beginning of next time

00:51:04.530 --> 00:51:08.400
that for a convex
function this is typical.

00:51:08.400 --> 00:51:14.550
This is 1 minus b is the
critical quantity, or 1 over b,

00:51:14.550 --> 00:51:17.760
or the how small
is b compared to 1?

00:51:17.760 --> 00:51:20.110
So that will be the
critical quantity.

00:51:20.110 --> 00:51:24.390
And we see it in this ratio
1 minus b over 1 plus b.

00:51:24.390 --> 00:51:30.210
So if b is 100, this
is 0.99 over 1.01.

00:51:30.210 --> 00:51:31.830
It's virtually 1.

00:51:31.830 --> 00:51:32.460
OK.

00:51:32.460 --> 00:51:36.780
So next time is a
sort of a key lecture

00:51:36.780 --> 00:51:43.380
to see what I've just
said, that this controls

00:51:43.380 --> 00:51:46.440
the convergence of
steepest descent,

00:51:46.440 --> 00:51:51.130
and then to see an
idea that speeds it up.

00:51:51.130 --> 00:51:54.660
That idea is called
momentum or heavy ball.

00:51:54.660 --> 00:52:02.820
So the physical idea is if you
had a heavy ball right there

00:52:02.820 --> 00:52:06.930
and wanted to get it down
the valley toward the bottom,

00:52:06.930 --> 00:52:10.650
you wouldn't go perpendicular
to the level sets.

00:52:10.650 --> 00:52:11.280
Not at all.

00:52:11.280 --> 00:52:13.680
You'd let the momentum
of the ball take over

00:52:13.680 --> 00:52:16.990
and let it roll down.

00:52:16.990 --> 00:52:21.500
So the idea of momentum is
to model the possibility

00:52:21.500 --> 00:52:26.240
of letting that heavy ball
roll instead of directing it

00:52:26.240 --> 00:52:30.380
by the steepest
descent at every point.

00:52:30.380 --> 00:52:34.280
So there's an extra term in
steepest descent, the momentum

00:52:34.280 --> 00:52:36.230
term that accelerates.

00:52:36.230 --> 00:52:36.860
OK.

00:52:36.860 --> 00:52:39.530
So Friday is the day.

00:52:39.530 --> 00:52:40.190
Good.

00:52:40.190 --> 00:52:42.130
See you then.