1
00:00:01,550 --> 00:00:03,920
The following content is
provided under a Creative
2
00:00:03,920 --> 00:00:05,310
Commons license.
3
00:00:05,310 --> 00:00:07,520
Your support will help
MIT OpenCourseWare
4
00:00:07,520 --> 00:00:11,610
continue to offer high-quality
educational resources for free.
5
00:00:11,610 --> 00:00:14,180
To make a donation or to
view additional materials
6
00:00:14,180 --> 00:00:18,140
from hundreds of MIT courses,
visit MIT OpenCourseWare
7
00:00:18,140 --> 00:00:19,026
at ocw.mit.edu.
8
00:00:22,870 --> 00:00:25,590
GILBERT STRANG: OK, here we go.
9
00:00:25,590 --> 00:00:29,250
All set, and two
topics for today--
10
00:00:29,250 --> 00:00:34,800
one is to go back to
Professor Sra's lecture.
11
00:00:34,800 --> 00:00:37,410
That was last Friday.
12
00:00:37,410 --> 00:00:41,110
And he promised a
theorem and proof.
13
00:00:41,110 --> 00:00:45,180
And this morning,
he sent it to me.
14
00:00:45,180 --> 00:00:51,660
So it's proving the convergence
of stochastic gradient descent.
15
00:00:51,660 --> 00:00:54,240
And really, what's
important, maybe,
16
00:00:54,240 --> 00:00:58,590
and useful is not so much
the details of the proof,
17
00:00:58,590 --> 00:01:03,700
which I'm just learning,
but the assumptions--
18
00:01:03,700 --> 00:01:05,580
what's the logic
here, what do you
19
00:01:05,580 --> 00:01:10,740
have to assume about
the gradient and about
20
00:01:10,740 --> 00:01:14,970
the algorithm to get the answer?
21
00:01:14,970 --> 00:01:22,920
But now I actually look back
at the video of his lecture.
22
00:01:22,920 --> 00:01:25,860
And it was excellent.
23
00:01:25,860 --> 00:01:29,970
And as I looked at it, there
were a couple of things
24
00:01:29,970 --> 00:01:33,420
later in the lecture
that I thought
25
00:01:33,420 --> 00:01:35,340
would make good projects.
26
00:01:35,340 --> 00:01:37,590
So I don't know if
anybody is still
27
00:01:37,590 --> 00:01:40,920
open to what to do on a project.
28
00:01:40,920 --> 00:01:44,220
But here are my two ideas.
29
00:01:44,220 --> 00:01:47,280
And if you've already
finished your project,
30
00:01:47,280 --> 00:01:53,640
well, you get an A-plus by
considering one of these.
31
00:01:53,640 --> 00:01:56,010
So you remember-- and
this will remind you
32
00:01:56,010 --> 00:01:59,170
of the lecture, which
is a good thing.
33
00:01:59,170 --> 00:02:03,630
So do you remember that
question 1 was whether,
34
00:02:03,630 --> 00:02:10,289
in the stochastic part, after
you've sampled one or some mini
35
00:02:10,289 --> 00:02:16,170
batch-- but let's just say
one of the lost functions,
36
00:02:16,170 --> 00:02:17,800
coming from one sample--
37
00:02:17,800 --> 00:02:22,860
you remember, the whole point
is that if we do all zillion
38
00:02:22,860 --> 00:02:27,400
samples at every iteration,
we're really, really slow.
39
00:02:27,400 --> 00:02:31,170
So the stochastic idea
is to randomly pick
40
00:02:31,170 --> 00:02:35,900
one or a mini batch
of the samples
41
00:02:35,900 --> 00:02:41,790
and just reduce their loss,
just deal with the loss--
42
00:02:41,790 --> 00:02:43,440
say, the square loss.
43
00:02:43,440 --> 00:02:46,980
Or later we'll see
cross-entropy loss.
44
00:02:46,980 --> 00:02:54,240
But whatever the cost
is, just do a few or one.
45
00:02:54,240 --> 00:02:57,490
And then the question was,
after you've done that one,
46
00:02:57,490 --> 00:03:01,350
do you put it back
in the pot every time
47
00:03:01,350 --> 00:03:04,380
you sample over the
whole collection?
48
00:03:04,380 --> 00:03:06,810
But that's expensive.
49
00:03:06,810 --> 00:03:15,060
Or do you just make a list of
random order of all the samples
50
00:03:15,060 --> 00:03:17,290
and go through them?
51
00:03:17,290 --> 00:03:20,350
Which is then without
replacement, which
52
00:03:20,350 --> 00:03:22,840
is a sort of semi-illegal.
53
00:03:22,840 --> 00:03:31,780
That is, the logic
in the randomization
54
00:03:31,780 --> 00:03:34,360
asks you to replace every time.
55
00:03:34,360 --> 00:03:36,620
But nobody does it.
56
00:03:36,620 --> 00:03:38,020
It costs a lot--
57
00:03:38,020 --> 00:03:39,670
probably not worth it.
58
00:03:39,670 --> 00:03:43,870
So the project would be,
suppose you take 1,000--
59
00:03:43,870 --> 00:03:46,290
or, say, just 100.
60
00:03:46,290 --> 00:03:54,305
100 random numbers--
say you use MATLAB, just
61
00:03:54,305 --> 00:03:56,240
the command "rand."
62
00:03:56,240 --> 00:04:00,540
So you get numbers whose
average is a half from rand.
63
00:04:00,540 --> 00:04:02,750
They're between 0 and 1.
64
00:04:02,750 --> 00:04:03,250
OK.
65
00:04:03,250 --> 00:04:06,320
So we know what the average is.
66
00:04:06,320 --> 00:04:08,880
So let's compute it two ways.
67
00:04:08,880 --> 00:04:13,130
One is by not replacing.
68
00:04:13,130 --> 00:04:16,800
And that's the interesting one.
69
00:04:16,800 --> 00:04:19,700
So take 100 samples.
70
00:04:19,700 --> 00:04:22,280
Well, I guess we know
that, after we've
71
00:04:22,280 --> 00:04:25,790
got through the full
100, we're going to get
72
00:04:25,790 --> 00:04:27,740
exactly the right answer.
73
00:04:27,740 --> 00:04:34,460
But anyway, my question would
be, how much difference do you
74
00:04:34,460 --> 00:04:40,220
see in the eventual approach--
so the law of large numbers,
75
00:04:40,220 --> 00:04:43,160
I guess, would tell
us we get a average
76
00:04:43,160 --> 00:04:50,820
of a half for these numbers
with uniform distribution
77
00:04:50,820 --> 00:04:51,930
between 0 and 1.
78
00:04:51,930 --> 00:04:54,000
Should I be writing
anything here?
79
00:04:54,000 --> 00:04:55,540
Maybe I should.
80
00:04:55,540 --> 00:04:56,370
OK.
81
00:04:56,370 --> 00:04:58,740
So this is project 1.
82
00:05:02,930 --> 00:05:13,110
You pick numbers ak,
which is from rand--
83
00:05:13,110 --> 00:05:22,470
so uniformly on 0,1.
84
00:05:22,470 --> 00:05:25,350
And then my question is,
what about convergence
85
00:05:25,350 --> 00:05:29,310
to the final--
86
00:05:29,310 --> 00:05:32,080
the average is a half.
87
00:05:32,080 --> 00:05:35,100
So this may be too
simple an example.
88
00:05:35,100 --> 00:05:39,330
But could we see what
happens for the convergence
89
00:05:39,330 --> 00:05:46,590
of the average as you either
do replacements or don't
90
00:05:46,590 --> 00:05:48,030
do replacements?
91
00:05:48,030 --> 00:05:53,010
And in fact, I would like
to see a figure that looks
92
00:05:53,010 --> 00:05:54,405
like those in his lecture.
93
00:05:54,405 --> 00:05:55,880
Do you remember?
94
00:05:55,880 --> 00:05:58,470
He started it somewhere--
95
00:05:58,470 --> 00:06:03,095
start-- and then
here's the finish.
96
00:06:05,730 --> 00:06:08,500
But you remember, the
stochastic gradient descent
97
00:06:08,500 --> 00:06:11,620
was kind of pretty
effective at the beginning.
98
00:06:11,620 --> 00:06:14,410
Well, the beginning,
those might be 100
99
00:06:14,410 --> 00:06:20,960
iterations each-- one epoch,
one run through the full number.
100
00:06:20,960 --> 00:06:23,830
But then when it got
to here, got closer,
101
00:06:23,830 --> 00:06:27,180
it started oscillating.
102
00:06:27,180 --> 00:06:30,930
You remember, he identified
the region of confusion
103
00:06:30,930 --> 00:06:33,790
around the thing.
104
00:06:33,790 --> 00:06:38,010
Well, my suggestion
is just, I think
105
00:06:38,010 --> 00:06:40,920
those videos should be
accessible to you on--
106
00:06:40,920 --> 00:06:43,140
are they on Stellar?
107
00:06:43,140 --> 00:06:43,710
Yeah.
108
00:06:43,710 --> 00:06:54,270
So I'd love to see that
behavior and some good examples
109
00:06:54,270 --> 00:06:56,510
of that behavior and
some pictures to you.
110
00:06:56,510 --> 00:06:59,730
So that would be one
idea with and with--
111
00:06:59,730 --> 00:07:03,840
oh, yeah, that's also idea 2.
112
00:07:03,840 --> 00:07:07,830
Idea 2 is the good
start and then
113
00:07:07,830 --> 00:07:12,560
the bad finish for a
stochastic gradient descent.
114
00:07:12,560 --> 00:07:17,970
And of course,
even without this,
115
00:07:17,970 --> 00:07:25,170
the magic words in computations
is "early stopping."
116
00:07:25,170 --> 00:07:29,360
We don't over-fit.
117
00:07:35,330 --> 00:07:38,810
So we wanted to
stop early, anyway.
118
00:07:38,810 --> 00:07:44,850
And early stopping
just is a good idea
119
00:07:44,850 --> 00:07:51,230
if that's what the
approach to the x
120
00:07:51,230 --> 00:07:53,400
star that you're looking for.
121
00:07:53,400 --> 00:07:57,170
This would be the
place where the--
122
00:07:57,170 --> 00:08:05,040
that's x star where
grad f at x star is 0.
123
00:08:05,040 --> 00:08:07,280
That's the minimum point.
124
00:08:07,280 --> 00:08:14,900
That's ARG MIN-- exactly
what we're looking for.
125
00:08:14,900 --> 00:08:17,450
And we don't find it very well.
126
00:08:17,450 --> 00:08:20,090
But we get close to it fast.
127
00:08:20,090 --> 00:08:21,800
OK.
128
00:08:21,800 --> 00:08:25,520
Two ideas on projects--
129
00:08:25,520 --> 00:08:31,630
so maybe I'll go to the
main topic of today--
130
00:08:31,630 --> 00:08:35,299
the topic I promised--
131
00:08:35,299 --> 00:08:39,100
the idea of back propagation.
132
00:08:39,100 --> 00:08:48,600
This is all to compute grad f--
133
00:08:48,600 --> 00:08:50,130
the gradient.
134
00:08:50,130 --> 00:09:02,460
All the derivatives-- this
is the df dx1 to df dxm,
135
00:09:02,460 --> 00:09:14,730
maybe, I'll say, where I have
m features for the sample.
136
00:09:14,730 --> 00:09:15,690
OK.
137
00:09:15,690 --> 00:09:17,645
So that's back propagation.
138
00:09:17,645 --> 00:09:25,050
And that's the thing whose
discovery, or rediscovery,
139
00:09:25,050 --> 00:09:29,270
put neural nets on the map.
140
00:09:29,270 --> 00:09:32,510
That's the key calculation, of
course, to find the gradient.
141
00:09:32,510 --> 00:09:34,610
In the steepest
descent algorithm,
142
00:09:34,610 --> 00:09:38,030
every step needs a gradient.
143
00:09:38,030 --> 00:09:45,620
And if you can't compute it
quickly, you're in bad shape.
144
00:09:45,620 --> 00:09:48,200
But you can compute
it quickly by
145
00:09:48,200 --> 00:09:54,140
this automatic differentiation
in reverse mode, which
146
00:09:54,140 --> 00:09:56,780
is otherwise known--
147
00:09:56,780 --> 00:10:09,090
I don't think the people--
maybe Hinton was the leader
148
00:10:09,090 --> 00:10:12,620
in developing deep neural net--
149
00:10:12,620 --> 00:10:13,340
deep learning.
150
00:10:16,200 --> 00:10:18,200
So I give him big
credit for that--
151
00:10:18,200 --> 00:10:22,040
that back propagation would
work and would give him
152
00:10:22,040 --> 00:10:23,810
fast gradients.
153
00:10:23,810 --> 00:10:30,650
But it actually had been studied
before under the name AD--
154
00:10:30,650 --> 00:10:32,280
Automatic Differentiation.
155
00:10:32,280 --> 00:10:35,840
So may I just tell
you that idea?
156
00:10:35,840 --> 00:10:39,590
Some of you may know
it, may know about it,
157
00:10:39,590 --> 00:10:47,040
may know more than I, and
might know a good website
158
00:10:47,040 --> 00:10:49,720
to see this description.
159
00:10:49,720 --> 00:10:56,300
There will be, of course,
a section of the notes,
160
00:10:56,300 --> 00:10:57,630
you already have it.
161
00:10:57,630 --> 00:11:02,770
This is section 7.2.
162
00:11:02,770 --> 00:11:06,630
So this is the chapter
on deep learning.
163
00:11:06,630 --> 00:11:11,040
And the first section was
about the structure of F of x.
164
00:11:11,040 --> 00:11:13,710
And you remember the key
point about the structure
165
00:11:13,710 --> 00:11:19,920
of F of x is that I start with
x and apply some function, F1
166
00:11:19,920 --> 00:11:21,090
of x.
167
00:11:21,090 --> 00:11:24,550
And to that, I apply
some function, F2 of x.
168
00:11:24,550 --> 00:11:26,970
And to that, I
apply some function
169
00:11:26,970 --> 00:11:30,930
of F3 of F2 of F1 of x.
170
00:11:30,930 --> 00:11:35,320
And that's the thing
whose derivative I need.
171
00:11:35,320 --> 00:11:38,110
So I'll just take
ordinary derivative--
172
00:11:38,110 --> 00:11:40,950
well, partial
derivatives, really.
173
00:11:40,950 --> 00:11:42,890
Yeah, I better say
partial derivatives.
174
00:11:42,890 --> 00:11:45,910
So suppose x is a pair, xy.
175
00:11:48,690 --> 00:11:57,880
Example-- so here, let
me show you my example.
176
00:11:57,880 --> 00:12:02,610
So suppose F of x is--
177
00:12:02,610 --> 00:12:04,650
let me take a simple example--
178
00:12:04,650 --> 00:12:06,720
x cubed times x plus 2y.
179
00:12:10,230 --> 00:12:11,980
OK.
180
00:12:11,980 --> 00:12:18,010
So I want to think of that
function the way anybody would,
181
00:12:18,010 --> 00:12:20,740
as the product of two functions.
182
00:12:20,740 --> 00:12:26,170
So there is a product rule
to get into the derivative.
183
00:12:26,170 --> 00:12:30,290
And then we need the
derivatives of each piece.
184
00:12:30,290 --> 00:12:36,880
So there's a power rule and
a linear combination rule.
185
00:12:36,880 --> 00:12:40,360
So it's got a few of
the rules that we use.
186
00:12:40,360 --> 00:12:45,160
And the point is to think
about the computation
187
00:12:45,160 --> 00:12:51,400
of F of x and the
computation of dF dx
188
00:12:51,400 --> 00:12:54,970
and the computation of dF dy.
189
00:12:54,970 --> 00:12:57,370
Those are the
derivatives that we need.
190
00:12:57,370 --> 00:13:01,110
This is the function
we need and how
191
00:13:01,110 --> 00:13:03,640
to do those
computations quickly.
192
00:13:03,640 --> 00:13:04,810
OK.
193
00:13:04,810 --> 00:13:15,100
And this is section 7.2, which
benefited a lot from a blog.
194
00:13:15,100 --> 00:13:18,010
I'm not a blog reader
or a blog writer.
195
00:13:18,010 --> 00:13:21,325
But somehow I found this blog.
196
00:13:27,250 --> 00:13:35,670
It's Christopher
Olah, is his name.
197
00:13:35,670 --> 00:13:38,490
And he really
writes clear things.
198
00:13:41,260 --> 00:13:43,890
He works for one of
the big companies
199
00:13:43,890 --> 00:13:47,850
and does the deeper research.
200
00:13:47,850 --> 00:13:51,610
But he's also a
really good expositor.
201
00:13:51,610 --> 00:13:55,950
And the website
that he now uses is
202
00:13:55,950 --> 00:14:00,530
called Distill dot something.
203
00:14:00,530 --> 00:14:04,620
But I think maybe this
blog was earlier than
204
00:14:04,620 --> 00:14:06,300
before the start of Distill.
205
00:14:06,300 --> 00:14:08,790
But it might be
loaded onto Distill.
206
00:14:08,790 --> 00:14:14,190
Anyway, that's where I got
this simple description
207
00:14:14,190 --> 00:14:16,890
of back propagation.
208
00:14:16,890 --> 00:14:21,160
And let's just do
calculus, first of all.
209
00:14:21,160 --> 00:14:24,553
If I just have a function
of maybe even one variable,
210
00:14:24,553 --> 00:14:25,470
what's the derivative?
211
00:14:25,470 --> 00:14:29,610
What is dF dx here,
just to remember
212
00:14:29,610 --> 00:14:32,970
what calculation we have to do?
213
00:14:32,970 --> 00:14:38,410
So dF dx, this is
with n equal one--
214
00:14:38,410 --> 00:14:40,110
one variable.
215
00:14:40,110 --> 00:14:47,340
So I use ordinary derivative
and not partial derivative.
216
00:14:47,340 --> 00:14:53,160
But that's what
really has to be done.
217
00:14:53,160 --> 00:14:55,530
But just, what's the
derivative of that--
218
00:14:55,530 --> 00:14:58,560
of a chain of functions?
219
00:14:58,560 --> 00:15:01,300
Well, of course, the chain rule.
220
00:15:01,300 --> 00:15:04,110
So what does the chain rule say?
221
00:15:04,110 --> 00:15:05,690
I differentiate dF.
222
00:15:10,550 --> 00:15:12,670
I don't know.
223
00:15:12,670 --> 00:15:15,950
What do I put that it's
differentiated with respect to?
224
00:15:19,380 --> 00:15:21,630
dF3, dF2-- is that
what I should put?
225
00:15:21,630 --> 00:15:22,130
OK.
226
00:15:26,210 --> 00:15:28,790
And where do I evaluate
that derivative?
227
00:15:31,700 --> 00:15:37,880
So yeah, I don't
evaluate it at x.
228
00:15:37,880 --> 00:15:39,860
I'm differentiated to F2.
229
00:15:39,860 --> 00:15:45,500
So do I evaluate it
at F2 of F1 of x?
230
00:15:45,500 --> 00:15:54,390
This is where the chain rule
gets sort of a little chain-ey.
231
00:15:54,390 --> 00:15:54,890
OK.
232
00:15:54,890 --> 00:15:57,260
Then we know that dF2 dF1.
233
00:16:01,390 --> 00:16:05,960
And again, that's now
evaluated at F1 of x.
234
00:16:05,960 --> 00:16:14,470
And then the final factor
is dF1 dx evaluated at x.
235
00:16:14,470 --> 00:16:17,780
That's somehow
what we have to do.
236
00:16:17,780 --> 00:16:22,010
And that's just for an
ordinary one-variable function.
237
00:16:22,010 --> 00:16:24,890
And I have here a
two-variable function.
238
00:16:24,890 --> 00:16:27,485
And deep learning has a
million-variable function.
239
00:16:31,150 --> 00:16:33,550
So I think we won't
go to a million.
240
00:16:33,550 --> 00:16:35,570
But two, we could manage.
241
00:16:35,570 --> 00:16:42,070
So let's compute the
function, first of all.
242
00:16:42,070 --> 00:16:58,760
Compute F. So I'm
given x equals, say, 2,
243
00:16:58,760 --> 00:17:01,490
and y equals, say, 3.
244
00:17:04,530 --> 00:17:09,869
And I'm going to create
a computational graph.
245
00:17:13,650 --> 00:17:27,480
So I'm actually going to
draw the computational graph
246
00:17:27,480 --> 00:17:37,140
to compute for F. And then it'll
be a variation of that graph
247
00:17:37,140 --> 00:17:40,000
to find the derivatives.
248
00:17:40,000 --> 00:17:42,360
So let's just start with
the graph, first of all,
249
00:17:42,360 --> 00:17:46,600
for the function, because
we're going to need that.
250
00:17:46,600 --> 00:17:49,870
So again, it's x cubed plus--
251
00:17:49,870 --> 00:17:54,250
so can I write that function
again? x cubed times x plus 2y.
252
00:17:58,390 --> 00:18:06,561
So I think the first step will
be to find x plus x cubed--
253
00:18:06,561 --> 00:18:08,530
that factor, which will be 8.
254
00:18:11,190 --> 00:18:16,110
And we have to find the
other factor, x plus 2y.
255
00:18:16,110 --> 00:18:19,410
So then that uses y and x.
256
00:18:19,410 --> 00:18:23,610
So it's a directed
graph in going forward
257
00:18:23,610 --> 00:18:26,100
with this computation.
258
00:18:26,100 --> 00:18:29,390
So x plus 2y equals
whatever it is--
259
00:18:29,390 --> 00:18:31,620
2 and 6-- oh, 8 again.
260
00:18:31,620 --> 00:18:33,750
Not brilliant.
261
00:18:33,750 --> 00:18:36,600
What shall I change here?
262
00:18:36,600 --> 00:18:37,410
Make it 3y?
263
00:18:42,200 --> 00:18:47,540
3y, just to get a
different number here.
264
00:18:47,540 --> 00:18:49,280
So now x is 2.
265
00:18:49,280 --> 00:18:50,270
y is 3.
266
00:18:50,270 --> 00:18:50,960
I get 11.
267
00:18:50,960 --> 00:18:52,797
That's a good number.
268
00:18:52,797 --> 00:18:53,297
11.
269
00:18:57,130 --> 00:18:59,760
OK.
270
00:18:59,760 --> 00:19:01,500
So far, so good?
271
00:19:01,500 --> 00:19:05,480
And now the next step
on this graph will be,
272
00:19:05,480 --> 00:19:07,560
I have a product of those.
273
00:19:07,560 --> 00:19:10,230
So that will go to the product.
274
00:19:15,850 --> 00:19:18,835
F equals 8 times 11--
275
00:19:18,835 --> 00:19:19,335
88.
276
00:19:22,050 --> 00:19:22,600
OK.
277
00:19:22,600 --> 00:19:28,810
So we've got the answer,
88, which, normally, I
278
00:19:28,810 --> 00:19:31,480
wouldn't take that
much of a book
279
00:19:31,480 --> 00:19:41,710
to compute F. I would have said,
2 cubed times 2 plus 3 times 3.
280
00:19:41,710 --> 00:19:47,170
And I'd have simplified
that to 8 times 11.
281
00:19:47,170 --> 00:19:50,530
And I would have got 88.
282
00:19:50,530 --> 00:19:54,190
So if we were just writing
normally, that would do it.
283
00:19:54,190 --> 00:19:59,110
But this is the picture of
the computational graph.
284
00:19:59,110 --> 00:20:00,040
OK.
285
00:20:00,040 --> 00:20:00,550
Good.
286
00:20:00,550 --> 00:20:01,050
Good.
287
00:20:01,050 --> 00:20:02,440
Good.
288
00:20:02,440 --> 00:20:05,200
Now it's the derivatives--
289
00:20:05,200 --> 00:20:08,650
two derivatives to
find-- dF dx and dF dy.
290
00:20:08,650 --> 00:20:12,810
Suppose we go forward first.
291
00:20:12,810 --> 00:20:15,360
My point is going to
be-- or the great point
292
00:20:15,360 --> 00:20:17,520
is that backward is better.
293
00:20:17,520 --> 00:20:19,770
Reverse mode is better.
294
00:20:19,770 --> 00:20:22,650
But we don't know what that
means until we've gone forward.
295
00:20:22,650 --> 00:20:24,443
So let me go forward.
296
00:20:24,443 --> 00:20:25,735
So now I'm going to go forward.
297
00:20:38,940 --> 00:20:41,590
Let's do dF dx.
298
00:20:41,590 --> 00:20:44,170
Everybody is up for dF
dx-- the partial derivative
299
00:20:44,170 --> 00:20:46,300
with respect to x?
300
00:20:46,300 --> 00:20:54,980
So here we have x
equal 2 and y equal 3.
301
00:21:01,168 --> 00:21:04,030
OK.
302
00:21:04,030 --> 00:21:11,750
And then I take the
derivative of that step.
303
00:21:11,750 --> 00:21:15,040
The first step was x 2x cubed.
304
00:21:15,040 --> 00:21:16,330
So I need the derivative.
305
00:21:16,330 --> 00:21:23,710
The whole point of AD is
that every computation
306
00:21:23,710 --> 00:21:30,400
of a derivative breaks down like
this into very simple pieces.
307
00:21:30,400 --> 00:21:34,710
And the derivatives
of those simple pieces
308
00:21:34,710 --> 00:21:36,660
are also simple pieces.
309
00:21:36,660 --> 00:21:44,190
So the whole point is
to replace appropriately
310
00:21:44,190 --> 00:21:50,020
those intermediate
steps with derivatives,
311
00:21:50,020 --> 00:21:52,920
so as to compute
the x derivative.
312
00:21:52,920 --> 00:22:00,070
So I have to use the fact
that the derivative of x
313
00:22:00,070 --> 00:22:02,040
cubed, with respect to x--
314
00:22:02,040 --> 00:22:04,650
oh, I better do partial
derivative-- partial
315
00:22:04,650 --> 00:22:09,950
derivatives of x cube, with
respect to x, is 3x squared.
316
00:22:09,950 --> 00:22:14,340
I'll put maybe a formula
and then a number.
317
00:22:14,340 --> 00:22:21,860
So that gives 3 times 4--
318
00:22:21,860 --> 00:22:22,360
12.
319
00:22:25,910 --> 00:22:31,750
And the derivative of x
cubed, with respect to y,
320
00:22:31,750 --> 00:22:34,385
gives 0, clearly.
321
00:22:34,385 --> 00:22:35,780
So that's 0.
322
00:22:40,160 --> 00:22:44,350
So I'm doing the x derivative.
323
00:22:44,350 --> 00:22:51,170
So the derivative of y,
with respect to x, is--
324
00:22:51,170 --> 00:22:54,250
you get to tell me.
325
00:22:54,250 --> 00:22:58,560
If I'm computing partial
derivatives, it is 0.
326
00:22:58,560 --> 00:22:59,955
It is 0.
327
00:22:59,955 --> 00:23:03,030
y and x are independent.
328
00:23:03,030 --> 00:23:06,810
And this is the
reason, in my view,
329
00:23:06,810 --> 00:23:10,080
that the forward
method is wasteful,
330
00:23:10,080 --> 00:23:15,630
because I'm going to have to do
another whole graph for the y
331
00:23:15,630 --> 00:23:16,990
derivative.
332
00:23:16,990 --> 00:23:21,630
In other words, tracking
the x derivatives,
333
00:23:21,630 --> 00:23:25,650
a whole lot of stuff
never got off the ground.
334
00:23:25,650 --> 00:23:28,140
So we never should
have looked at it.
335
00:23:28,140 --> 00:23:41,812
So anyway, I have
this x plus 3y, maybe.
336
00:23:41,812 --> 00:23:43,270
I don't know whether
to erase that.
337
00:23:43,270 --> 00:23:45,970
I think I will,
just because I don't
338
00:23:45,970 --> 00:23:49,010
know what to do with it there.
339
00:23:49,010 --> 00:23:49,510
Yeah.
340
00:23:49,510 --> 00:23:56,130
So now let me take the
ones that I really need,
341
00:23:56,130 --> 00:24:08,400
is the derivative, with respect
to x, of x plus 3y, which is 1.
342
00:24:08,400 --> 00:24:14,520
And so that gives me the
answer 1 for any x actually.
343
00:24:14,520 --> 00:24:17,040
OK.
344
00:24:17,040 --> 00:24:18,250
And now what?
345
00:24:20,820 --> 00:24:23,440
Oh, yeah, I don't need these.
346
00:24:23,440 --> 00:24:25,410
This is a waste of time.
347
00:24:25,410 --> 00:24:26,330
Isn't it?
348
00:24:29,090 --> 00:24:33,120
Is it only x derivatives I want?
349
00:24:33,120 --> 00:24:36,640
Anyway, let's just keep going.
350
00:24:36,640 --> 00:24:40,170
You can see, this takes
a little organization.
351
00:24:40,170 --> 00:24:42,750
And I'm not practiced with it.
352
00:24:42,750 --> 00:24:44,170
So what am I going to do?
353
00:24:44,170 --> 00:24:47,700
I'm looking for the
x derivative of--
354
00:24:47,700 --> 00:24:50,160
I've got to use our
product rule now.
355
00:24:50,160 --> 00:24:54,750
I found the x derivative
of that factor was 12.
356
00:24:54,750 --> 00:24:58,600
The x derivative of
this factor is 1.
357
00:24:58,600 --> 00:25:03,950
And now the x derivative
of the product--
358
00:25:03,950 --> 00:25:10,590
so now I'm going to do,
somehow, a product rule--
359
00:25:10,590 --> 00:25:15,440
the x derivative
of this product.
360
00:25:15,440 --> 00:25:20,460
I should have given
these two terms a name.
361
00:25:20,460 --> 00:25:25,910
Let me call that first term x
cubed, and the second term x
362
00:25:25,910 --> 00:25:26,810
plus 3y--
363
00:25:26,810 --> 00:25:27,870
call it s.
364
00:25:27,870 --> 00:25:32,090
So I'll call the
two terms c and s.
365
00:25:38,930 --> 00:25:41,210
So that's dc ds.
366
00:25:41,210 --> 00:25:43,850
This is dc dx.
367
00:25:43,850 --> 00:25:46,820
This is dc dx.
368
00:25:46,820 --> 00:25:56,390
And this one is ds dx and dc dy.
369
00:25:56,390 --> 00:25:57,620
Do I need to know that?
370
00:25:57,620 --> 00:26:02,690
I'm sorry, this computational
graph has thrown me.
371
00:26:02,690 --> 00:26:07,080
But now I want to
use the product rule.
372
00:26:07,080 --> 00:26:09,860
And I'm taking x derivatives.
373
00:26:09,860 --> 00:26:13,580
So I should have
computed c and s.
374
00:26:13,580 --> 00:26:16,580
Yes, I see I need those
in the product rule.
375
00:26:16,580 --> 00:26:30,037
So I should have computed c
as being 8 and s as being 5.
376
00:26:30,037 --> 00:26:30,620
Is that right?
377
00:26:30,620 --> 00:26:35,940
2 plus 3-- so 11.
378
00:26:35,940 --> 00:26:37,800
Yeah, I needed the 8.
379
00:26:37,800 --> 00:26:43,040
Oh, is that-- what's up?
380
00:26:43,040 --> 00:26:45,440
I've just been
running along here
381
00:26:45,440 --> 00:26:49,730
without getting myself
in the whole picture.
382
00:26:49,730 --> 00:26:51,440
Yeah, 8 and 11 is right.
383
00:26:51,440 --> 00:26:53,990
But now I'm looking
for the derivatives.
384
00:26:53,990 --> 00:26:55,760
So I don't multiply those.
385
00:26:55,760 --> 00:26:57,250
That's not the product rule.
386
00:27:00,190 --> 00:27:01,810
So the product rule is what?
387
00:27:07,190 --> 00:27:13,120
So this product rule, I have
to do this combination of--
388
00:27:13,120 --> 00:27:14,810
this is now the product rule--
389
00:27:20,050 --> 00:27:25,240
for the derivative of c times s.
390
00:27:25,240 --> 00:27:30,640
So I want c ds dx plus s dc dx.
391
00:27:30,640 --> 00:27:32,940
I think I'm on track now.
392
00:27:32,940 --> 00:27:36,640
And now I want to
put it in numbers.
393
00:27:36,640 --> 00:27:40,900
So c is 8.
394
00:27:40,900 --> 00:27:45,370
ds dx-- have we computed ds dx?
395
00:27:45,370 --> 00:27:48,680
Yes, ds dx is 1.
396
00:27:48,680 --> 00:27:53,590
And now s itself
is computed as 11.
397
00:27:53,590 --> 00:27:58,840
And dc dx, we computed as 12.
398
00:27:58,840 --> 00:28:00,250
I don't dare look.
399
00:28:06,470 --> 00:28:08,120
I don't think I'm going to get--
400
00:28:08,120 --> 00:28:09,830
oh, no, I don't
know the answer yet.
401
00:28:09,830 --> 00:28:12,020
Sorry, I'm not trying to get 88.
402
00:28:14,740 --> 00:28:16,575
You guys are not helping.
403
00:28:16,575 --> 00:28:18,700
[LAUGHS]
404
00:28:18,700 --> 00:28:20,210
You see I'm in trouble.
405
00:28:20,210 --> 00:28:24,880
But what I imagine here is,
that's 8 and that's 132.
406
00:28:24,880 --> 00:28:28,000
So I'm getting 140.
407
00:28:28,000 --> 00:28:29,830
Is there any
possibility that that's
408
00:28:29,830 --> 00:28:34,330
the right answer for dF dx?
409
00:28:34,330 --> 00:28:36,660
This is dF dx I computed.
410
00:28:40,170 --> 00:28:44,920
By watching me struggle
here, you're seeing the idea.
411
00:28:47,970 --> 00:28:52,170
Every step, I take the
derivative of each step.
412
00:28:52,170 --> 00:28:55,050
So it was a power step, x cubed.
413
00:28:55,050 --> 00:28:57,000
So I had a 3x squared.
414
00:28:57,000 --> 00:29:00,480
And a sum step, so I had a 1.
415
00:29:00,480 --> 00:29:04,900
Then the next step
was a multiplication.
416
00:29:04,900 --> 00:29:08,730
So I needed the
product rule for that.
417
00:29:08,730 --> 00:29:11,040
I have these separate numbers.
418
00:29:11,040 --> 00:29:12,570
So I put them in.
419
00:29:12,570 --> 00:29:18,140
And so it's the
computational graph finished.
420
00:29:18,140 --> 00:29:21,710
We only needed two levels.
421
00:29:21,710 --> 00:29:23,840
And we got 8 and 132--
422
00:29:23,840 --> 00:29:25,180
140.
423
00:29:25,180 --> 00:29:26,540
OK.
424
00:29:26,540 --> 00:29:29,120
But we didn't get dF dy yet.
425
00:29:34,230 --> 00:29:37,190
And for that, I'd need
to redo this again.
426
00:29:40,160 --> 00:29:43,330
And I don't want to do that.
427
00:29:43,330 --> 00:29:48,160
I would rather do the reverse
mode and do them both at once.
428
00:29:48,160 --> 00:29:50,090
That's the point of
the reverse mode.
429
00:29:50,090 --> 00:29:51,230
It's very efficient.
430
00:29:51,230 --> 00:29:55,140
It's very efficient, actually.
431
00:29:55,140 --> 00:29:59,490
Computing the
gradient after you've
432
00:29:59,490 --> 00:30:03,270
done the work for the function,
computing first derivatives--
433
00:30:03,270 --> 00:30:05,970
you could compute
n first derivatives
434
00:30:05,970 --> 00:30:10,800
with about four or five
times the cost, not n times.
435
00:30:10,800 --> 00:30:12,330
That's amazing to me.
436
00:30:12,330 --> 00:30:17,490
That is amazing that I can
compute the gradient very
437
00:30:17,490 --> 00:30:23,290
efficiently by the back prop.
438
00:30:23,290 --> 00:30:25,730
So I have to show you
the backwards way.
439
00:30:29,300 --> 00:30:31,250
Yeah.
440
00:30:31,250 --> 00:30:35,090
I'm just going to follow all
the paths backwards so that I
441
00:30:35,090 --> 00:30:38,960
get both dF dx and dF dy.
442
00:30:38,960 --> 00:30:43,280
You see, the idea is to take
the derivative of each step--
443
00:30:43,280 --> 00:30:45,020
each small step.
444
00:30:45,020 --> 00:30:48,080
That's really what
we do in calculus.
445
00:30:48,080 --> 00:30:51,050
If you think about the
start of a calculus course,
446
00:30:51,050 --> 00:30:53,600
what derivatives do
we actually know?
447
00:30:53,600 --> 00:31:00,020
Do we actually use F at
x plus delta x minus F?
448
00:31:00,020 --> 00:31:02,150
What derivatives
do we grind out?
449
00:31:05,960 --> 00:31:10,440
We do the derivatives
of x to the n.
450
00:31:10,440 --> 00:31:14,080
Every calculus book starts
with x squared and finds
451
00:31:14,080 --> 00:31:15,930
the derivative of x to the n.
452
00:31:15,930 --> 00:31:18,480
Then you do sine x and cos x.
453
00:31:21,150 --> 00:31:22,590
Then what others?
454
00:31:22,590 --> 00:31:25,390
Are there any more?
455
00:31:25,390 --> 00:31:28,450
e to the x-- good, e to the x.
456
00:31:28,450 --> 00:31:31,600
And it's the inverse
function log.
457
00:31:31,600 --> 00:31:35,920
In freshman calculus,
you always write ln, just
458
00:31:35,920 --> 00:31:37,640
to be out of date.
459
00:31:37,640 --> 00:31:38,330
OK.
460
00:31:38,330 --> 00:31:39,920
And now that may be the list.
461
00:31:39,920 --> 00:31:40,420
Is it?
462
00:31:40,420 --> 00:31:43,460
And then the chain rule.
463
00:31:43,460 --> 00:31:50,040
Are there others that you
actually do a computation of?
464
00:31:50,040 --> 00:31:53,820
Actually, e to the x is
defined by the property
465
00:31:53,820 --> 00:31:57,170
that its derivative
is e to the x.
466
00:31:57,170 --> 00:32:00,270
And then you discover
what log x has to be.
467
00:32:00,270 --> 00:32:04,500
And sine x-- how do you
do sine of x plus delta x?
468
00:32:04,500 --> 00:32:07,260
Well, compare minus sine of x.
469
00:32:07,260 --> 00:32:12,030
How do you find the hard
way, once-and-for-all way?
470
00:32:12,030 --> 00:32:17,970
You draw a little unit circle
and mess with some angles.
471
00:32:17,970 --> 00:32:21,480
And you discover that the
derivative of the sine
472
00:32:21,480 --> 00:32:24,140
is the cosine.
473
00:32:24,140 --> 00:32:30,180
That's if you've defined
the sine as a ratio of sides
474
00:32:30,180 --> 00:32:31,350
in a right triangle.
475
00:32:31,350 --> 00:32:34,050
Of course, you could define
it as an infinite series.
476
00:32:34,050 --> 00:32:37,600
And then you would be
back to just using that.
477
00:32:37,600 --> 00:32:38,100
OK.
478
00:32:40,680 --> 00:32:44,160
So calculus does exactly
what we're doing here--
479
00:32:44,160 --> 00:32:48,030
finds all derivatives
by the chain rule
480
00:32:48,030 --> 00:32:56,030
applied to a few ones that
it has worked out in detail.
481
00:32:56,030 --> 00:33:02,060
But tangent of x, we would
use the quotient rule.
482
00:33:02,060 --> 00:33:06,970
Secant of x, we would use the
quotient rule, 1 over cosine.
483
00:33:06,970 --> 00:33:09,370
And the products, we
use the product rule.
484
00:33:09,370 --> 00:33:17,010
So really, calculus tends
to seem fairly simple
485
00:33:17,010 --> 00:33:22,750
when you look back to see
what, actually, you did.
486
00:33:22,750 --> 00:33:26,520
And then integration-- what
is integral calculus about?
487
00:33:26,520 --> 00:33:29,240
More or less
guessing the answer.
488
00:33:29,240 --> 00:33:34,230
You have to integrate f of x dx.
489
00:33:34,230 --> 00:33:38,130
So really, what you have
to do is sort of think, OK,
490
00:33:38,130 --> 00:33:40,290
what had this derivative?
491
00:33:40,290 --> 00:33:42,550
What function had
that derivative?
492
00:33:42,550 --> 00:33:46,230
And mess around and get it.
493
00:33:46,230 --> 00:33:54,210
So really, it's a
freshman course, I guess.
494
00:33:54,210 --> 00:33:54,960
OK.
495
00:33:54,960 --> 00:33:57,740
So where am I?
496
00:33:57,740 --> 00:33:58,400
Backward.
497
00:33:58,400 --> 00:33:59,410
Right.
498
00:33:59,410 --> 00:34:01,690
That's the thing still to do.
499
00:34:01,690 --> 00:34:04,330
How does the
backward system work?
500
00:34:04,330 --> 00:34:07,280
OK, I'll try my best.
501
00:34:07,280 --> 00:34:07,780
OK.
502
00:34:07,780 --> 00:34:10,679
So here is the big goal.
503
00:34:10,679 --> 00:34:14,750
Back-- so reverse mode AD.
504
00:34:21,040 --> 00:34:21,750
Right.
505
00:34:21,750 --> 00:34:25,489
And let me make
myself a little note.
506
00:34:25,489 --> 00:34:30,710
The little note is to give
you another example where
507
00:34:30,710 --> 00:34:34,219
the order that you
do the computations
508
00:34:34,219 --> 00:34:37,190
makes a big difference.
509
00:34:37,190 --> 00:34:39,699
And that's not
obvious that it will.
510
00:34:39,699 --> 00:34:41,770
There are many things
in math that you
511
00:34:41,770 --> 00:34:44,050
could do in either order.
512
00:34:44,050 --> 00:34:48,730
And it seems like, logically,
you've done the same things.
513
00:34:48,730 --> 00:34:53,980
So another, and
simpler, example which
514
00:34:53,980 --> 00:34:58,660
shows how one way could be
way faster than another way
515
00:34:58,660 --> 00:35:04,870
is when I'm multiplying
three matrices.
516
00:35:04,870 --> 00:35:06,790
So I'm multiplying
three matrices--
517
00:35:06,790 --> 00:35:08,740
A times B times C.
518
00:35:08,740 --> 00:35:14,110
And the question is, do I do BC
first and then multiply by A?
519
00:35:14,110 --> 00:35:20,230
Or do I do AB first and
then multiply that by C?
520
00:35:20,230 --> 00:35:22,840
And of course, I
kept them in order--
521
00:35:22,840 --> 00:35:24,370
in the order ABC.
522
00:35:24,370 --> 00:35:31,790
But the order of computations
can be different.
523
00:35:31,790 --> 00:35:33,530
You get the right
answer both ways.
524
00:35:33,530 --> 00:35:36,710
But those can be completely,
completely different.
525
00:35:36,710 --> 00:35:40,720
One can be 1,000 times
faster than the other.
526
00:35:40,720 --> 00:35:42,950
So that's just to show--
527
00:35:42,950 --> 00:35:45,990
actually, it kind
of connects to this.
528
00:35:45,990 --> 00:35:49,630
And there is also another--
529
00:35:49,630 --> 00:35:53,120
so I'll do that, too.
530
00:35:53,120 --> 00:36:01,580
So this is example 2, where
this is meant to be example 1.
531
00:36:01,580 --> 00:36:09,860
And example 3 leads to something
called the adjoint method
532
00:36:09,860 --> 00:36:17,530
in differential equations
or in optimization--
533
00:36:17,530 --> 00:36:23,880
in computing optimum
and maximizing it.
534
00:36:23,880 --> 00:36:24,380
Yeah.
535
00:36:28,010 --> 00:36:32,450
Really, the underlying
reason it gives us speed-up
536
00:36:32,450 --> 00:36:38,030
is, it makes the right choice
in a product of three things.
537
00:36:38,030 --> 00:36:39,170
Yeah.
538
00:36:39,170 --> 00:36:43,110
So it'll be enough to do
example 1 and example 2.
539
00:36:43,110 --> 00:36:48,540
OK, let me go with example 1.
540
00:36:48,540 --> 00:36:50,520
This is now back propagation.
541
00:36:50,520 --> 00:36:52,220
Finally, we got to it.
542
00:36:52,220 --> 00:36:52,720
OK.
543
00:36:59,330 --> 00:37:03,230
Well, I look at my
notes is how I do it.
544
00:37:07,170 --> 00:37:10,410
So the notes-- this
is section 7.2--
545
00:37:10,410 --> 00:37:12,720
does these computational graphs.
546
00:37:12,720 --> 00:37:15,450
And then here is reverse mode.
547
00:37:18,120 --> 00:37:20,840
So it starts over
here with the--
548
00:37:20,840 --> 00:37:22,810
so I'm going to
use the chain rule.
549
00:37:22,810 --> 00:37:26,040
So dF dF is 1.
550
00:37:26,040 --> 00:37:28,410
And then I'm going backwards.
551
00:37:31,500 --> 00:37:38,970
And of course, I have
to use the right rule.
552
00:37:38,970 --> 00:37:41,250
So I have to use
the product rule.
553
00:37:41,250 --> 00:37:43,920
And then soon I'll
have to use these power
554
00:37:43,920 --> 00:37:45,150
rule and linear rules.
555
00:37:45,150 --> 00:37:47,830
So of course, no change there.
556
00:37:47,830 --> 00:37:52,220
The change is that
by going backwards--
557
00:37:52,220 --> 00:37:55,330
oh, I don't know if I
completed that sentence,
558
00:37:55,330 --> 00:37:59,650
that I could find 100
partial derivatives,
559
00:37:59,650 --> 00:38:02,800
if the function depended
on 100 variables,
560
00:38:02,800 --> 00:38:07,870
in about five times the
cost of one variable--
561
00:38:07,870 --> 00:38:10,060
three to five times
the cost of one.
562
00:38:10,060 --> 00:38:16,480
So you would expect 100 chain
rules would cost 100 times.
563
00:38:16,480 --> 00:38:22,240
But you see, we're reusing
the pieces in the chain
564
00:38:22,240 --> 00:38:26,530
and just having a larger--
565
00:38:26,530 --> 00:38:28,190
our chain is wider.
566
00:38:28,190 --> 00:38:29,400
But it's not longer.
567
00:38:29,400 --> 00:38:30,630
And it's not repeated.
568
00:38:30,630 --> 00:38:36,400
Anyway, so here I'm going
to use whatever it is--
569
00:38:36,400 --> 00:38:43,080
dF dc and dF ds.
570
00:38:43,080 --> 00:38:44,710
And I'm remembering that--
571
00:38:47,980 --> 00:38:49,360
yeah, OK.
572
00:38:49,360 --> 00:38:54,880
So dF dc is s, and dF ds is c.
573
00:38:54,880 --> 00:39:01,090
That was because F
started out as c times s.
574
00:39:01,090 --> 00:39:02,650
It was the product.
575
00:39:02,650 --> 00:39:03,220
OK.
576
00:39:03,220 --> 00:39:06,900
Then we've got to
evaluate those.
577
00:39:06,900 --> 00:39:10,270
And I'll look again to see
that I'm hopefully writing down
578
00:39:10,270 --> 00:39:11,395
some of the correct things.
579
00:39:14,740 --> 00:39:16,250
OK.
580
00:39:16,250 --> 00:39:21,350
So now what I've written
down next is dF dc is 5.
581
00:39:21,350 --> 00:39:24,770
Or no, 5 on that example.
582
00:39:24,770 --> 00:39:30,960
What is it here? dF dc is--
583
00:39:30,960 --> 00:39:35,490
c is x cubed.
584
00:39:35,490 --> 00:39:40,410
So dF-- oh, sorry, dF dc--
585
00:39:40,410 --> 00:39:42,120
yeah, I want s.
586
00:39:42,120 --> 00:39:43,400
I'm looking for s here.
587
00:39:43,400 --> 00:39:44,502
Yeah.
588
00:39:44,502 --> 00:39:45,474
I'm looking for s.
589
00:39:50,830 --> 00:39:53,210
So I'm looking for s.
590
00:39:53,210 --> 00:39:58,460
And that's x plus 3y.
591
00:39:58,460 --> 00:39:59,638
Am I doing this well?
592
00:40:04,030 --> 00:40:08,210
I want, in the end, to get
the derivatives with respect
593
00:40:08,210 --> 00:40:10,880
to x and y-- the whole gradient.
594
00:40:10,880 --> 00:40:11,380
OK.
595
00:40:11,380 --> 00:40:13,580
I think we started right.
596
00:40:13,580 --> 00:40:16,650
The first derivatives
is to write c and s.
597
00:40:16,650 --> 00:40:20,190
And then let me leave
these boxes open,
598
00:40:20,190 --> 00:40:21,360
just to get the picture.
599
00:40:24,660 --> 00:40:43,220
Then I'll need dc dx,
dc dy, ds dx, and ds dy.
600
00:40:43,220 --> 00:40:44,140
I think that's right.
601
00:40:47,300 --> 00:40:49,400
Here, I had a
product of c and s.
602
00:40:49,400 --> 00:40:52,700
So I had two derivatives.
603
00:40:52,700 --> 00:40:57,710
Here I have c and s,
each to differentiate.
604
00:40:57,710 --> 00:41:01,760
So have an x and a y derivative
of x and a y derivative.
605
00:41:01,760 --> 00:41:05,330
And now it's just a matter
of putting in those numbers
606
00:41:05,330 --> 00:41:07,640
and following the
chain backwards.
607
00:41:13,630 --> 00:41:15,730
Maybe I'm not going to
put those numbers in,
608
00:41:15,730 --> 00:41:19,510
because if I didn't
reach 140, you wouldn't
609
00:41:19,510 --> 00:41:21,830
believe in back propagation.
610
00:41:21,830 --> 00:41:25,285
And that would be
an unhappy outcome.
611
00:41:28,250 --> 00:41:31,520
So I'll leave you to
put them in maybe.
612
00:41:31,520 --> 00:41:35,840
Or the notes have a separate
example that you can see.
613
00:41:35,840 --> 00:41:37,760
But do you see the point--
614
00:41:37,760 --> 00:41:47,305
that in the end, I'm
going to find dF dx and dF
615
00:41:47,305 --> 00:41:53,650
dy from the chain--
616
00:41:53,650 --> 00:41:59,200
from one chain and not
from a separate chain for x
617
00:41:59,200 --> 00:42:02,470
and a separate chain for y.
618
00:42:02,470 --> 00:42:06,070
To me, that's the
point of reverse mode.
619
00:42:06,070 --> 00:42:09,400
It's a little bit of magic.
620
00:42:09,400 --> 00:42:12,190
But you see the steps--
621
00:42:12,190 --> 00:42:13,330
the ingredient.
622
00:42:13,330 --> 00:42:17,470
And some of you have seen
this before and maybe
623
00:42:17,470 --> 00:42:19,700
know a better exposition.
624
00:42:19,700 --> 00:42:24,100
I found this blog by
Christopher Olah clear.
625
00:42:24,100 --> 00:42:26,110
And these very simple
things, you'll see,
626
00:42:26,110 --> 00:42:28,420
are clear in the notes.
627
00:42:28,420 --> 00:42:36,730
But maybe another blog brings
out other points to make here.
628
00:42:36,730 --> 00:42:41,660
It's not obvious, maybe, that
I could have 100 variables
629
00:42:41,660 --> 00:42:48,570
and do the calculation in
four or five times the cost--
630
00:42:48,570 --> 00:42:52,740
four or five times
being instead of 100.
631
00:42:52,740 --> 00:42:53,740
Yeah.
632
00:42:53,740 --> 00:42:55,450
But it's possible.
633
00:42:55,450 --> 00:42:56,850
OK.
634
00:42:56,850 --> 00:43:00,262
So could I close
today with this one?
635
00:43:05,920 --> 00:43:07,370
How could those be different?
636
00:43:07,370 --> 00:43:12,940
You're computing the same
numbers, the same AIJ, BJKs,
637
00:43:12,940 --> 00:43:17,470
CKLs, and doing these sums.
638
00:43:17,470 --> 00:43:19,390
But it certainly is different.
639
00:43:19,390 --> 00:43:21,370
So let's just do that.
640
00:43:21,370 --> 00:43:21,903
OK.
641
00:43:21,903 --> 00:43:22,570
I'll do it here.
642
00:43:28,480 --> 00:43:31,980
And then at the
right time-- and I
643
00:43:31,980 --> 00:43:36,030
guess it'll be after Professor
Rao on Friday and Monday,
644
00:43:36,030 --> 00:43:42,950
I'll come back to
Professor Sra's short proof
645
00:43:42,950 --> 00:43:48,470
of the convergence of
stochastic gradient descent.
646
00:43:48,470 --> 00:43:52,560
The whole point is to show you
what assumptions do you need.
647
00:43:52,560 --> 00:43:56,660
You need some assumptions on
the gradient, some assumptions
648
00:43:56,660 --> 00:43:58,190
on the step size.
649
00:43:58,190 --> 00:44:02,810
And for a good proof, all
the assumptions fit together,
650
00:44:02,810 --> 00:44:06,270
and, dong, out comes
the conclusion.
651
00:44:06,270 --> 00:44:10,010
And the conclusion would
be how fast it converges--
652
00:44:10,010 --> 00:44:11,600
stochastic gradient descent.
653
00:44:11,600 --> 00:44:18,230
So there's some expected
things, because it's stochastic.
654
00:44:18,230 --> 00:44:25,060
We expect some assumptions
about the mean and the variance
655
00:44:25,060 --> 00:44:28,390
to go into the proof.
656
00:44:28,390 --> 00:44:29,620
So you'll see that.
657
00:44:29,620 --> 00:44:33,960
But maybe it's too
much for today.
658
00:44:33,960 --> 00:44:36,690
So I'll come back to that.
659
00:44:36,690 --> 00:44:45,130
I might even put it on Stellar
and just close with this.
660
00:44:45,130 --> 00:44:56,320
So suppose A is m by n, B
is n by p, and C is p by q.
661
00:44:56,320 --> 00:44:57,930
OK.
662
00:44:57,930 --> 00:45:04,480
How many steps does it take
to find A times B times C--
663
00:45:04,480 --> 00:45:06,970
the product of those
three matrices?
664
00:45:06,970 --> 00:45:14,140
Well, if I go this way,
I have to do BC first.
665
00:45:14,140 --> 00:45:18,160
So BC costs-- how
many operations
666
00:45:18,160 --> 00:45:20,125
to multiply that times that?
667
00:45:24,010 --> 00:45:25,610
npq-- nice formula.
668
00:45:25,610 --> 00:45:26,110
npq.
669
00:45:28,670 --> 00:45:30,540
Why is that?
670
00:45:30,540 --> 00:45:36,280
Well, I could say that
the answer is n by q.
671
00:45:36,280 --> 00:45:41,960
And every number in there
was an inner product
672
00:45:41,960 --> 00:45:45,310
of a row and column of length p.
673
00:45:45,310 --> 00:45:50,350
So I have nq inner products.
674
00:45:50,350 --> 00:45:52,280
And each one costs p--
675
00:45:54,940 --> 00:45:58,450
multiply, adds.
676
00:45:58,450 --> 00:46:04,280
So now I have BC,
which will be--
677
00:46:04,280 --> 00:46:06,270
so now I have m by n.
678
00:46:06,270 --> 00:46:14,000
Then I have m by n,
which is the A times
679
00:46:14,000 --> 00:46:17,360
B by C, which is now n by q.
680
00:46:17,360 --> 00:46:18,110
That's BC.
681
00:46:18,110 --> 00:46:20,480
This is A, BC.
682
00:46:20,480 --> 00:46:23,450
And this one costs--
683
00:46:23,450 --> 00:46:25,310
what's the cost here?
684
00:46:25,310 --> 00:46:28,340
m by n, m by q--
685
00:46:28,340 --> 00:46:30,035
by the same rule, it'll be mnq.
686
00:46:32,954 --> 00:46:34,450
Good.
687
00:46:34,450 --> 00:46:36,640
That's the first way--
688
00:46:36,640 --> 00:46:38,590
A times BC.
689
00:46:38,590 --> 00:46:44,530
Now, the second way is AB
times C. Let me write in again,
690
00:46:44,530 --> 00:46:47,455
m by n, n by p, p by q.
691
00:46:51,700 --> 00:46:53,890
So now I'm doing this first--
692
00:46:53,890 --> 00:46:56,680
so AB costs.
693
00:46:56,680 --> 00:46:58,870
Tell me again now,
what's the rule
694
00:46:58,870 --> 00:47:03,130
for the cost of a
matrix multiplication?
695
00:47:03,130 --> 00:47:04,295
mnp.
696
00:47:04,295 --> 00:47:04,795
mnp.
697
00:47:08,380 --> 00:47:16,410
And then I multiply m by p--
698
00:47:16,410 --> 00:47:18,930
that's AB-- times p by q.
699
00:47:18,930 --> 00:47:20,580
That's C.
700
00:47:20,580 --> 00:47:22,650
So I have mpq.
701
00:47:27,220 --> 00:47:32,320
So I have that together
with that, or that
702
00:47:32,320 --> 00:47:35,130
together with that.
703
00:47:35,130 --> 00:47:41,490
That sum-- those
two or these two.
704
00:47:41,490 --> 00:47:43,450
And they're different.
705
00:47:43,450 --> 00:47:48,340
And let's just recognize
the most important example.
706
00:47:48,340 --> 00:47:50,770
Suppose C is a column vector--
707
00:47:50,770 --> 00:47:52,540
C for column vector.
708
00:47:52,540 --> 00:47:54,280
So q is 1.
709
00:47:54,280 --> 00:47:56,050
There's only one column.
710
00:47:56,050 --> 00:48:00,170
So if q is 1, this way did np--
711
00:48:00,170 --> 00:48:02,020
let's just specialize to that.
712
00:48:06,130 --> 00:48:16,340
So specialize to C
equal a column vector,
713
00:48:16,340 --> 00:48:19,170
which means that q is 1.
714
00:48:19,170 --> 00:48:20,980
I only have one column.
715
00:48:20,980 --> 00:48:30,820
So then A times BC
is versus AB times C.
716
00:48:30,820 --> 00:48:33,580
So let's just figure
that out when q is 1.
717
00:48:33,580 --> 00:48:37,840
So npq is just np.
718
00:48:37,840 --> 00:48:48,595
And mnq is just mn,
where AB is m and p.
719
00:48:48,595 --> 00:48:51,190
Oh, that's a bad one.
720
00:48:51,190 --> 00:48:52,210
Disaster already.
721
00:48:55,750 --> 00:48:58,660
Those are potentially
two big matrices,
722
00:48:58,660 --> 00:49:01,160
multiplying a column vector.
723
00:49:01,160 --> 00:49:03,340
So here I've done a
matrix multiplication.
724
00:49:03,340 --> 00:49:04,990
I never should have done that.
725
00:49:04,990 --> 00:49:07,750
This is a matrix vector.
726
00:49:07,750 --> 00:49:09,250
It gives me a vector.
727
00:49:09,250 --> 00:49:11,530
And then this is
a matrix vector.
728
00:49:11,530 --> 00:49:14,320
So I get nice numbers here.
729
00:49:14,320 --> 00:49:17,380
But I get a terrible
number for AB.
730
00:49:17,380 --> 00:49:21,700
And then I multiply that
by C. So that's mpq.
731
00:49:25,680 --> 00:49:26,180
mpq.
732
00:49:29,390 --> 00:49:31,760
So mp is factoring out.
733
00:49:31,760 --> 00:49:42,340
So if I write it as n times
m plus p versus this one
734
00:49:42,340 --> 00:49:50,190
is m that's factoring
out times m--
735
00:49:50,190 --> 00:49:51,570
no.
736
00:49:51,570 --> 00:49:53,240
Yeah.
737
00:49:53,240 --> 00:49:54,160
What's up here?
738
00:49:56,920 --> 00:49:57,650
Yeah.
739
00:49:57,650 --> 00:49:58,760
Sorry.
740
00:49:58,760 --> 00:49:59,570
What am I doing?
741
00:50:05,190 --> 00:50:06,320
Yeah.
742
00:50:06,320 --> 00:50:09,540
Is it p that factors
out from this one?
743
00:50:09,540 --> 00:50:11,520
OK.
744
00:50:11,520 --> 00:50:15,820
p times m plus n, I guess.
745
00:50:15,820 --> 00:50:16,320
Sorry.
746
00:50:19,140 --> 00:50:24,938
Anyway, the difference is--
747
00:50:24,938 --> 00:50:29,240
AUDIENCE: I think it's
mp times p plus q.
748
00:50:29,240 --> 00:50:30,480
[INAUDIBLE]
749
00:50:30,480 --> 00:50:34,080
GILBERT STRANG: Shall I go
over it again or write--?
750
00:50:34,080 --> 00:50:36,120
Let me do just this
thinking again.
751
00:50:36,120 --> 00:50:39,810
If q is 1, if I go
this way, was that
752
00:50:39,810 --> 00:50:42,960
my final total when q was 1?
753
00:50:42,960 --> 00:50:45,420
And that's this?
754
00:50:45,420 --> 00:50:46,440
No.
755
00:50:46,440 --> 00:50:49,740
m factors out times n plus p.
756
00:50:49,740 --> 00:50:52,800
Let's just get that right.
757
00:50:52,800 --> 00:50:54,690
Oh, no, n factors out.
758
00:50:54,690 --> 00:50:58,070
Sorry, n factors
out times m plus p.
759
00:50:58,070 --> 00:51:03,635
And this way was
all these things.
760
00:51:03,635 --> 00:51:07,520
AUDIENCE: Both the m
and the p factor out.
761
00:51:07,520 --> 00:51:10,340
GILBERT STRANG: Both the
m and the p factor out.
762
00:51:10,340 --> 00:51:11,790
OK.
763
00:51:11,790 --> 00:51:12,290
Thanks.
764
00:51:16,700 --> 00:51:22,100
Times n plus q.
765
00:51:22,100 --> 00:51:24,120
n plus q was 1.
766
00:51:24,120 --> 00:51:24,620
OK.
767
00:51:29,220 --> 00:51:32,520
The whole point is, we've got
this horrible multiplication
768
00:51:32,520 --> 00:51:36,300
of three big numbers.
769
00:51:36,300 --> 00:51:38,840
And this only had
two big numbers.
770
00:51:38,840 --> 00:51:42,990
So this is orders of
magnitude faster than that.
771
00:51:42,990 --> 00:51:45,480
And of course, you would
have done the calculation.
772
00:51:45,480 --> 00:51:48,720
That way, you would have
multiplied the column vector
773
00:51:48,720 --> 00:51:52,140
by a matrix to get
another column vector.
774
00:51:52,140 --> 00:51:54,090
And you would have
multiplied that by a matrix
775
00:51:54,090 --> 00:51:57,390
to get another column
vector, where here,
776
00:51:57,390 --> 00:52:02,100
you crazily multiplied two big
matrices together and then got
777
00:52:02,100 --> 00:52:02,940
a column vector.
778
00:52:02,940 --> 00:52:07,020
So there is a bad move.
779
00:52:07,020 --> 00:52:08,440
OK, thanks.
780
00:52:08,440 --> 00:52:11,670
Oh, I'm past the
time on this ABC.
781
00:52:11,670 --> 00:52:16,230
It's just to show that on a
very familiar calculation,
782
00:52:16,230 --> 00:52:18,510
you have to do it
in the right order.
783
00:52:18,510 --> 00:52:21,840
And back propagation
is the right order
784
00:52:21,840 --> 00:52:24,130
for partial derivatives.
785
00:52:24,130 --> 00:52:24,630
OK.
786
00:52:24,630 --> 00:52:25,260
Thank you.
787
00:52:25,260 --> 00:52:29,370
And so bring laptops Friday.
788
00:52:29,370 --> 00:52:35,490
And look forward
to Professor Rao.
789
00:52:35,490 --> 00:52:37,880
Give him a good welcome.