1
00:00:00,000 --> 00:00:02,520
The following content is
provided under a Creative

2
00:00:02,520 --> 00:00:03,970
Commons license.

3
00:00:03,970 --> 00:00:06,330
Your support will help
MIT OpenCourseWare

4
00:00:06,330 --> 00:00:10,660
continue to offer high-quality
educational resources for free.

5
00:00:10,660 --> 00:00:13,320
To make a donation or
view additional materials

6
00:00:13,320 --> 00:00:17,170
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,170 --> 00:00:18,370
at ocw.mit.edu.

8
00:00:21,062 --> 00:00:22,770
JOHN W. ROBERTS: We'll
be talking about--

9
00:00:29,670 --> 00:00:31,560
I heard last time I
had bad handwriting.

10
00:00:31,560 --> 00:00:33,268
And I guess this isn't
much improved yet,

11
00:00:33,268 --> 00:00:39,150
but I will try to be more
deliberate if not more skilled.

12
00:00:39,150 --> 00:00:53,200
Stochastic-- all right.

13
00:00:53,200 --> 00:00:59,650
So as you may
remember last time,

14
00:00:59,650 --> 00:01:01,930
we were talking about
different assumptions

15
00:01:01,930 --> 00:01:04,180
that we've used in all
the techniques we've

16
00:01:04,180 --> 00:01:05,019
applied so far.

17
00:01:11,830 --> 00:01:19,780
Now, we assume that we
have a model of the system,

18
00:01:19,780 --> 00:01:22,137
that the system
is deterministic--

19
00:01:26,232 --> 00:01:27,940
that's not really any
better handwriting,

20
00:01:27,940 --> 00:01:29,550
but this is the one
that last time we

21
00:01:29,550 --> 00:01:31,300
talked about getting
rid of anyway, right?

22
00:01:31,300 --> 00:01:33,790
Stochastic system,
stochastic dynamics.

23
00:01:33,790 --> 00:01:36,503
So that's what we
tried to remove.

24
00:01:36,503 --> 00:01:37,670
And then the state is known.

25
00:01:43,060 --> 00:01:46,720
So sort of already
gotten rid of this one

26
00:01:46,720 --> 00:01:48,023
in some of our discussions.

27
00:01:48,023 --> 00:01:49,690
And today we're going
to talk about what

28
00:01:49,690 --> 00:01:51,940
you do if you don't a model.

29
00:01:51,940 --> 00:01:54,122
And this is something
that's actually

30
00:01:54,122 --> 00:01:56,080
very important in a lot
of interesting systems.

31
00:01:56,080 --> 00:01:59,330
And the systems that we work
on the lab, some of them

32
00:01:59,330 --> 00:02:01,870
we try to model, but some
of them help us to model.

33
00:02:01,870 --> 00:02:06,520
So this, dealing without
this, is a very useful thing

34
00:02:06,520 --> 00:02:07,600
to be able to do.

35
00:02:07,600 --> 00:02:10,840
So hopefully, you'll
all at the end

36
00:02:10,840 --> 00:02:14,730
appreciate the tremendous power
of model-free reinforcement

37
00:02:14,730 --> 00:02:15,230
learning.

38
00:02:18,850 --> 00:02:22,480
So the basic idea is,
again, we have this policy

39
00:02:22,480 --> 00:02:28,840
parameterization alpha, which
somehow defines our policy.

40
00:02:28,840 --> 00:02:31,390
And the problem sets that you
recently did, it's open loop,

41
00:02:31,390 --> 00:02:33,800
so you just have one
alpha for every time step.

42
00:02:33,800 --> 00:02:36,760
You also can imagine these are
gains on a feedback policy,

43
00:02:36,760 --> 00:02:40,750
entries of the K matrix,
or PD gains, any way you

44
00:02:40,750 --> 00:02:42,400
want to parameterize it.

45
00:02:42,400 --> 00:02:46,675
And you think about you use--

46
00:02:46,675 --> 00:02:49,300
these parameters-- now, this is
the most simple interpretation.

47
00:02:49,300 --> 00:02:51,710
There's a lot more complicated
ways of looking at it,

48
00:02:51,710 --> 00:02:53,470
but I'm going to look at
the most simple way first.

49
00:02:53,470 --> 00:02:54,800
You send this into your system.

50
00:02:54,800 --> 00:02:58,305
So you can run your system
with these parameters.

51
00:02:58,305 --> 00:02:59,680
Now, this is,
again, sort of like

52
00:02:59,680 --> 00:03:00,580
what you did in the problem set.

53
00:03:00,580 --> 00:03:03,020
You have a fixed initial
condition, fixed cost function,

54
00:03:03,020 --> 00:03:06,870
you give it a policy, you run
it, and you see how it does.

55
00:03:06,870 --> 00:03:11,950
And so what you
get, you get J. You

56
00:03:11,950 --> 00:03:15,650
get the cost of
running that policy.

57
00:03:15,650 --> 00:03:17,883
So the question is
that, previously we've

58
00:03:17,883 --> 00:03:20,050
talked about, OK, if you
have a model of the system,

59
00:03:20,050 --> 00:03:21,508
there's a lot of
things you can do.

60
00:03:21,508 --> 00:03:23,650
You can do back prop to
get the specific gradient;

61
00:03:23,650 --> 00:03:26,770
do something like SNOPT, you can
do gradient descent using that.

62
00:03:26,770 --> 00:03:29,725
You can do, depending
on the dimensionality,

63
00:03:29,725 --> 00:03:30,850
you can do value iteration.

64
00:03:30,850 --> 00:03:32,215
So there's a lot of options
when you have a model.

65
00:03:32,215 --> 00:03:34,150
But if you don't, if you don't
know how the system works,

66
00:03:34,150 --> 00:03:36,400
if it's really just a black
box where I have a policy

67
00:03:36,400 --> 00:03:42,190
parameterization, I get a cost,
how do we achieve anything

68
00:03:42,190 --> 00:03:43,280
in that context?

69
00:03:43,280 --> 00:03:44,380
We don't have any
sort of information

70
00:03:44,380 --> 00:03:46,170
about how these things
relate to each other.

71
00:03:46,170 --> 00:03:47,800
Well, the thing is we do have
some information in that we

72
00:03:47,800 --> 00:03:48,925
can execute this black box.

73
00:03:48,925 --> 00:03:49,960
We can test it, right?

74
00:03:49,960 --> 00:03:52,600
We can run our policy
and see how well it does.

75
00:03:52,600 --> 00:03:55,840
So what would you say
is the crudest thing

76
00:03:55,840 --> 00:04:00,220
you could do if you had a
system like this, a black box?

77
00:04:00,220 --> 00:04:02,587
You give it an open loop
tape, let's say, you run it,

78
00:04:02,587 --> 00:04:03,670
and it tells you the cost.

79
00:04:06,855 --> 00:04:08,170
What could we do?

80
00:04:08,170 --> 00:04:09,395
AUDIENCE: SNOPT could also--

81
00:04:09,395 --> 00:04:11,020
well, not that we'll
[INAUDIBLE] SNOPT.

82
00:04:11,020 --> 00:04:12,370
But SNOPT could also--

83
00:04:12,370 --> 00:04:14,680
or you have methods for
estimating the gradient.

84
00:04:14,680 --> 00:04:16,490
JOHN W. ROBERTS: You can do
finite differences, right?

85
00:04:16,490 --> 00:04:17,019
AUDIENCE: Yeah, do finite--

86
00:04:17,019 --> 00:04:19,019
JOHN W. ROBERTS: So finite
differences, exactly.

87
00:04:19,019 --> 00:04:20,990
So what you can
do is you can say,

88
00:04:20,990 --> 00:04:25,225
let's talk about, again,
this is a notation. [? So ?]

89
00:04:25,225 --> 00:04:26,350
what we're using is simple.

90
00:04:26,350 --> 00:04:27,460
A lot of times they parameterize
it in a different way.

91
00:04:27,460 --> 00:04:28,480
But yeah.

92
00:04:28,480 --> 00:04:30,755
So you have pretty
much in this context--

93
00:04:30,755 --> 00:04:32,380
let's say we have a
deterministic cost.

94
00:04:32,380 --> 00:04:33,755
So we don't have
a random system.

95
00:04:33,755 --> 00:04:35,800
We'll talk about
random systems later.

96
00:04:35,800 --> 00:04:37,690
Let's say we have a
deterministic cost, which

97
00:04:37,690 --> 00:04:38,930
is a function of our alpha.

98
00:04:38,930 --> 00:04:42,920
So what are our parameters,
our parameter vector?

99
00:04:42,920 --> 00:04:45,160
So what we do is we
say, OK, let's say

100
00:04:45,160 --> 00:04:56,115
I have a 2D system
alpha 1, alpha 2.

101
00:04:56,115 --> 00:04:58,240
Now, we don't know what
this function is right now.

102
00:04:58,240 --> 00:05:02,710
But let's just say it's a simple
function like this, convex,

103
00:05:02,710 --> 00:05:06,977
where these are the contour
lines, and what we want

104
00:05:06,977 --> 00:05:08,310
is we want to get to the middle.

105
00:05:08,310 --> 00:05:12,440
So this is sort of the local
min, and we start here.

106
00:05:12,440 --> 00:05:14,440
Now, what-- how SNOPT
could get these gradients,

107
00:05:14,440 --> 00:05:16,750
and sort of the simplest
thing you can imagine doing,

108
00:05:16,750 --> 00:05:18,800
would be, all right, well--

109
00:05:18,800 --> 00:05:20,800
one of the simplest things
you can imagine doing

110
00:05:20,800 --> 00:05:23,080
is actually another
very simple thing.

111
00:05:23,080 --> 00:05:24,190
You measure here.

112
00:05:24,190 --> 00:05:25,300
So you run the system.

113
00:05:25,300 --> 00:05:27,440
You get J at this point.

114
00:05:27,440 --> 00:05:29,440
So you run the system,
you get J at this point.

115
00:05:29,440 --> 00:05:31,273
You run the system, you
get J at this point.

116
00:05:31,273 --> 00:05:34,480
But you take these differences,
divide by your displacement,

117
00:05:34,480 --> 00:05:37,422
and what you get is you get some
estimate of the local gradient.

118
00:05:37,422 --> 00:05:39,880
And if those distances are
small enough and your evaluation

119
00:05:39,880 --> 00:05:42,475
sort of nice enough, you
can get arbitrarily close

120
00:05:42,475 --> 00:05:43,600
to the true gradient there.

121
00:05:43,600 --> 00:05:46,225
And this will tell you, OK, you
want to move in this direction,

122
00:05:46,225 --> 00:05:47,230
right?

123
00:05:47,230 --> 00:05:51,128
Now, the problem with this is
that you have to do n plus 1,

124
00:05:51,128 --> 00:05:53,170
where n is the number of
dimensions, evaluations,

125
00:05:53,170 --> 00:05:54,290
to get this.

126
00:05:54,290 --> 00:05:56,920
So you sort of have
to evaluate at alpha

127
00:05:56,920 --> 00:06:02,800
and at alpha plus delta
0, 0, 0, dot, dot, dot,

128
00:06:02,800 --> 00:06:07,095
alpha plus 0 delta 0, 0,
dot, dot, dot, et cetera.

129
00:06:07,095 --> 00:06:08,470
Now, obviously,
these things just

130
00:06:08,470 --> 00:06:10,540
have to be linearly
independent, actually.

131
00:06:10,540 --> 00:06:12,670
But you might as
well do it this way.

132
00:06:12,670 --> 00:06:14,170
Do these finite
differences, you get

133
00:06:14,170 --> 00:06:14,770
an estimate of the gradient.

134
00:06:14,770 --> 00:06:16,187
You can hand it
to SNOPT and SNOPT

135
00:06:16,187 --> 00:06:17,760
can try to do fancier things.

136
00:06:17,760 --> 00:06:20,260
Or you can do gradient descent,
where you get this gradient,

137
00:06:20,260 --> 00:06:23,470
you compute it, and then you
do an update where I say, OK,

138
00:06:23,470 --> 00:06:34,060
now my alpha at n plus 1 equals
alpha at n plus some delta

139
00:06:34,060 --> 00:06:36,370
alpha.

140
00:06:36,370 --> 00:06:42,520
And you can say, OK, delta
alpha equals negative eta

141
00:06:42,520 --> 00:06:47,410
and then dJ d alpha.

142
00:06:47,410 --> 00:06:48,165
That's a vector.

143
00:06:48,165 --> 00:06:49,540
And so this is
our learning rate.

144
00:06:49,540 --> 00:06:51,248
That says, OK, we have
the gradient here.

145
00:06:51,248 --> 00:06:52,720
How far are we going to move?

146
00:06:52,720 --> 00:06:54,290
Setting that can be an issue.

147
00:06:54,290 --> 00:06:55,940
But you update your
alpha like this.

148
00:06:55,940 --> 00:06:58,148
And you can just keep doing
that over and over again,

149
00:06:58,148 --> 00:07:00,170
keep evaluating it
over and over again.

150
00:07:00,170 --> 00:07:05,380
And eventually, you
should move in to the 0.

151
00:07:05,380 --> 00:07:07,210
You should get to a local min.

152
00:07:07,210 --> 00:07:12,850
The thing is that doing n
plus 1 evaluations every time

153
00:07:12,850 --> 00:07:13,570
is expensive.

154
00:07:13,570 --> 00:07:15,800
Now, you could say you
could cut that down a bit

155
00:07:15,800 --> 00:07:16,825
if you were to reuse
some evaluations

156
00:07:16,825 --> 00:07:17,630
and stuff like that.

157
00:07:17,630 --> 00:07:19,540
But the point is that you have
to do a lot of evaluations

158
00:07:19,540 --> 00:07:20,790
to get this local information.

159
00:07:20,790 --> 00:07:22,930
And if you move very
far, you sort of

160
00:07:22,930 --> 00:07:25,468
have to discard those and do
all those evaluations again.

161
00:07:25,468 --> 00:07:27,010
So short of doing
a lot of evaluation

162
00:07:27,010 --> 00:07:28,780
to get sort of an accurate
estimate of the gradient right

163
00:07:28,780 --> 00:07:29,280
here.

164
00:07:29,280 --> 00:07:31,447
Then you're throwing a lot
of it away when you move,

165
00:07:31,447 --> 00:07:32,740
and the gradient could change.

166
00:07:32,740 --> 00:07:35,050
And so in that sense, doing
all these evaluations maybe

167
00:07:35,050 --> 00:07:36,580
is wasteful, because
you're getting more--

168
00:07:36,580 --> 00:07:38,470
you're sort of being more
careful than you have to.

169
00:07:38,470 --> 00:07:39,910
And then you're just going
to lose that information

170
00:07:39,910 --> 00:07:42,368
once you move somewhere else
and have to evaluate it again.

171
00:07:42,368 --> 00:07:43,410
So there's another thing.

172
00:07:43,410 --> 00:07:45,370
This one, you could
say, is even more crude.

173
00:07:45,370 --> 00:07:47,920
This is, at least in
evolutionary algorithm

174
00:07:47,920 --> 00:07:50,170
[? screen, ?] I think they
call it just hill climbing.

175
00:07:50,170 --> 00:07:52,670
I mean, all these things are
sort of hill climbing or valley

176
00:07:52,670 --> 00:07:54,040
descending.

177
00:07:54,040 --> 00:07:55,840
But what you can
also imagine doing

178
00:07:55,840 --> 00:07:58,600
is just having a point here,
and now we just randomly

179
00:07:58,600 --> 00:07:59,188
perturb that.

180
00:07:59,188 --> 00:07:59,980
So I don't do this.

181
00:07:59,980 --> 00:08:01,293
This is deterministic, right?

182
00:08:01,293 --> 00:08:03,460
I could randomly perturb
and just be like, OK, well,

183
00:08:03,460 --> 00:08:04,607
what if I'm here?

184
00:08:04,607 --> 00:08:05,440
That's worse, right?

185
00:08:05,440 --> 00:08:06,232
The cost is higher.

186
00:08:06,232 --> 00:08:08,740
So you just throw it
out, don't use it.

187
00:08:08,740 --> 00:08:11,090
Do it again here, that's better.

188
00:08:11,090 --> 00:08:12,460
So now we just keep this.

189
00:08:12,460 --> 00:08:15,220
And we just do this over and
over, discarding bad ones,

190
00:08:15,220 --> 00:08:20,927
keeping good ones,
until we get back there.

191
00:08:20,927 --> 00:08:22,510
But the thing is,
is that there you're

192
00:08:22,510 --> 00:08:23,530
doing all these evaluations.

193
00:08:23,530 --> 00:08:25,240
When they get worse, you're
just throwing them out,

194
00:08:25,240 --> 00:08:27,310
and you're acting like they
give you no information.

195
00:08:27,310 --> 00:08:28,450
But there is
information in that.

196
00:08:28,450 --> 00:08:30,658
Even if it gets worse, and
by how much it gets worse,

197
00:08:30,658 --> 00:08:33,158
how much it gets better, there's
information in all of that.

198
00:08:33,158 --> 00:08:35,575
And you're getting information
when you do the evaluation.

199
00:08:35,575 --> 00:08:38,260
So I throw it away and just sort
of like cast out these things

200
00:08:38,260 --> 00:08:39,200
that do worse.

201
00:08:39,200 --> 00:08:44,290
So that's sort of the idea
of the stochastic gradient

202
00:08:44,290 --> 00:08:47,050
descent, is that we're
going to follow this sort

203
00:08:47,050 --> 00:08:48,363
of like random kind of idea.

204
00:08:48,363 --> 00:08:50,530
Well, instead of doing this
deterministic evaluation

205
00:08:50,530 --> 00:08:53,732
of the local gradient, we're
going to randomize the system,

206
00:08:53,732 --> 00:08:55,690
and we're going to get
an estimate the gradient

207
00:08:55,690 --> 00:08:58,150
stochastically, and then
we're going to follow that.

208
00:08:58,150 --> 00:08:59,440
And we're going to get
as much information out

209
00:08:59,440 --> 00:09:00,170
of that as possible.

210
00:09:00,170 --> 00:09:01,587
That's one of the
important thing,

211
00:09:01,587 --> 00:09:04,702
is generally these systems,
this evaluation is all the cost.

212
00:09:04,702 --> 00:09:06,160
Pretty much everything
is dominated

213
00:09:06,160 --> 00:09:08,980
by checking your cost of
the [INAUDIBLE] policy.

214
00:09:08,980 --> 00:09:11,403
So you want to get as much
as you can out of each one.

215
00:09:11,403 --> 00:09:12,820
And stochastic
gradient descent is

216
00:09:12,820 --> 00:09:15,610
sort of a powerful way of doing
that when you have no model.

217
00:09:15,610 --> 00:09:20,170
That's definitely more
efficient than hill climbing.

218
00:09:20,170 --> 00:09:28,450
So now the question is, what
is the appropriate process

219
00:09:28,450 --> 00:09:30,040
for doing this?

220
00:09:30,040 --> 00:09:31,540
How do we randomly
sample these guys

221
00:09:31,540 --> 00:09:33,848
and actually improve our policy?

222
00:09:33,848 --> 00:09:35,390
So I'm going to
write down an update.

223
00:09:35,390 --> 00:09:36,310
This is a common update.

224
00:09:36,310 --> 00:09:37,810
It's the weight
perturbation update.

225
00:09:37,810 --> 00:09:40,165
It also shows up in an
identical form in reinforce,

226
00:09:40,165 --> 00:09:40,750
if you see any of those.

227
00:09:40,750 --> 00:09:41,875
We'll talk about all those.

228
00:09:41,875 --> 00:09:44,410
But when you can look at the
[? performance ?] update--

229
00:09:47,830 --> 00:09:50,740
is my handwriting
at all legible?

230
00:09:50,740 --> 00:09:51,640
[INTERPOSING VOICES]

231
00:09:51,640 --> 00:09:52,557
JOHN W. ROBERTS: Yeah?

232
00:09:52,557 --> 00:09:54,022
OK.

233
00:09:54,022 --> 00:09:56,230
So you want to look at this
[? performance ?] update.

234
00:10:01,690 --> 00:10:03,610
Take my word for now
that this makes sense.

235
00:10:07,450 --> 00:10:09,550
Well, changing the alpha bits.

236
00:10:14,600 --> 00:10:16,000
OK.

237
00:10:16,000 --> 00:10:17,870
So I'm saying,
change your alpha.

238
00:10:17,870 --> 00:10:19,370
Here we have the
same learning rate.

239
00:10:19,370 --> 00:10:21,850
So this is like in the
deterministic gradient descent.

240
00:10:21,850 --> 00:10:23,308
And then here's
where you evaluate.

241
00:10:23,308 --> 00:10:24,530
And this z is noise.

242
00:10:24,530 --> 00:10:26,320
So when you perturb
your policy, this

243
00:10:26,320 --> 00:10:29,256
is sort of the vector of how
you perturb that alpha vector.

244
00:10:29,256 --> 00:10:30,610
So this is sort of--

245
00:10:30,610 --> 00:10:34,240
this is a z, this
is a z, this is a z.

246
00:10:34,240 --> 00:10:38,098
Those z's are these
perturbations to this.

247
00:10:38,098 --> 00:10:39,890
So what you can do is
we can say-- a simple

248
00:10:39,890 --> 00:10:45,380
and a very common way is to
have the vector z is distributed

249
00:10:45,380 --> 00:10:50,240
as a multivariate Gaussian,
so where each element of z

250
00:10:50,240 --> 00:10:56,480
is iid with the same
standard deviation, mean 0.

251
00:10:56,480 --> 00:10:59,380
And so you sort of
draw this z from your--

252
00:10:59,380 --> 00:11:01,880
you draw a sort of
sample z, you evaluate

253
00:11:01,880 --> 00:11:03,830
how well it does, you
evaluate how well you

254
00:11:03,830 --> 00:11:05,288
do with your sort
of nominal policy

255
00:11:05,288 --> 00:11:06,980
right now, calculate
this difference,

256
00:11:06,980 --> 00:11:10,370
and then you move in
the direction of z.

257
00:11:10,370 --> 00:11:19,090
So I'll try to draw this in 1D
and then 2D so it makes sense.

258
00:11:19,090 --> 00:11:23,660
So here in 1D, you can
say this is our 1 alpha.

259
00:11:23,660 --> 00:11:29,940
If this is J, so here's
our cost function.

260
00:11:29,940 --> 00:11:32,180
So we'll be here.

261
00:11:32,180 --> 00:11:34,190
Now, our z in this
case is just a scalar.

262
00:11:34,190 --> 00:11:36,560
But so our z is
going to be mean 0.

263
00:11:36,560 --> 00:11:39,020
And it's going to have
a Gaussian distribution.

264
00:11:39,020 --> 00:11:43,948
But when you sample from
this, you evaluate--

265
00:11:43,948 --> 00:11:45,865
I should actually probably
keep that update up

266
00:11:45,865 --> 00:11:46,573
at the same time.

267
00:11:54,686 --> 00:11:58,300
So you sample, you
get this change.

268
00:11:58,300 --> 00:11:59,620
So this is sort of my J alpha.

269
00:12:03,010 --> 00:12:06,930
This right here is
my J alpha plus z.

270
00:12:06,930 --> 00:12:07,930
And imagine this change.

271
00:12:07,930 --> 00:12:10,330
That's going to say,
OK, the cost went up.

272
00:12:10,330 --> 00:12:12,250
It went up by some amount.

273
00:12:12,250 --> 00:12:13,750
That's the difference.

274
00:12:13,750 --> 00:12:15,910
I'm going to move in
the direction of z.

275
00:12:15,910 --> 00:12:16,827
So z is just a scalar.

276
00:12:16,827 --> 00:12:18,618
Here it's just going
to be sort of the sign

277
00:12:18,618 --> 00:12:19,690
and the magnitude of it.

278
00:12:19,690 --> 00:12:21,732
And then I'm going to move
sort of opposite this.

279
00:12:21,732 --> 00:12:23,620
So I perturb z. z went
in this direction.

280
00:12:23,620 --> 00:12:25,330
It got bigger.

281
00:12:25,330 --> 00:12:27,860
That change, then,
is a positive number.

282
00:12:27,860 --> 00:12:32,290
So we're going to move down by
amount sort of eta that change,

283
00:12:32,290 --> 00:12:33,310
right?

284
00:12:33,310 --> 00:12:36,220
And so if it gets a lot
worse, we move down farther.

285
00:12:36,220 --> 00:12:38,530
If it gets a bit worse,
we move down a bit.

286
00:12:38,530 --> 00:12:40,430
Does that makes sense?

287
00:12:40,430 --> 00:12:42,640
And so when you're
measuring here,

288
00:12:42,640 --> 00:12:45,220
you're going to get small
change for the same z.

289
00:12:45,220 --> 00:12:48,390
When you, again, you draw
your Gaussian around that,

290
00:12:48,390 --> 00:12:49,765
if you get a small
change, you're

291
00:12:49,765 --> 00:12:50,590
just going to move a bit.

292
00:12:50,590 --> 00:12:52,240
When I'm here, where
it's really steep,

293
00:12:52,240 --> 00:12:54,865
I'll get the same perturbation,
I'm going to get bigger change.

294
00:12:54,865 --> 00:12:56,890
I'm going to move even farther.

295
00:12:56,890 --> 00:12:57,958
And I'll update here.

296
00:12:57,958 --> 00:12:59,500
And so if I do this
a bunch of times,

297
00:12:59,500 --> 00:13:03,875
you can imagine I
descend into local min.

298
00:13:03,875 --> 00:13:04,750
Does that make sense?

299
00:13:04,750 --> 00:13:07,300
And this is every time you're
drawing the stochastically.

300
00:13:07,300 --> 00:13:08,920
So you're not doing this
[INAUDIBLE] term thing.

301
00:13:08,920 --> 00:13:10,753
Every time you do it,
you could be updating,

302
00:13:10,753 --> 00:13:12,700
you could try worse,
you could try better.

303
00:13:12,700 --> 00:13:14,860
But stochastically, you
can sort of intuitively

304
00:13:14,860 --> 00:13:16,575
see why it's going
to sort of descend.

305
00:13:16,575 --> 00:13:17,450
Does that make sense?

306
00:13:17,450 --> 00:13:20,085
AUDIENCE: This is heavily
depending on the fact

307
00:13:20,085 --> 00:13:23,372
that the function is sort
of [INAUDIBLE] direction?

308
00:13:23,372 --> 00:13:24,830
JOHN W. ROBERTS:
It's sort of what?

309
00:13:24,830 --> 00:13:27,163
AUDIENCE: It's like the
function that you're looking at,

310
00:13:27,163 --> 00:13:28,640
if you're looking--

311
00:13:28,640 --> 00:13:30,410
if you increase,
like in this case,

312
00:13:30,410 --> 00:13:33,149
like alpha in one
direction versus other one,

313
00:13:33,149 --> 00:13:35,817
the changes are sort of
similar in both ways.

314
00:13:35,817 --> 00:13:36,650
JOHN W. ROBERTS: No.

315
00:13:36,650 --> 00:13:39,770
I mean, that can affect the
performance of the algorithm.

316
00:13:39,770 --> 00:13:41,390
But yeah, I can draw that.

317
00:13:41,390 --> 00:13:44,807
These are sort of common
pathological cases.

318
00:13:44,807 --> 00:13:45,640
Let's look at in 2D.

319
00:13:45,640 --> 00:13:48,680
So this is what
you're saying, right?

320
00:13:48,680 --> 00:13:50,060
Now, the ideal one would be--

321
00:13:50,060 --> 00:13:53,823
again, we can draw
a contour map again.

322
00:13:53,823 --> 00:13:55,740
Now, you'd be-- this is
about the same, right?

323
00:13:55,740 --> 00:13:58,157
You're saying this is about
sort of isotropic or whatever.

324
00:13:58,157 --> 00:14:01,465
You're here, you perturb
yourself randomly

325
00:14:01,465 --> 00:14:03,590
so your Gaussian is going
to put you anywhere here.

326
00:14:03,590 --> 00:14:04,783
And you measure somewhere.

327
00:14:04,783 --> 00:14:07,200
You get better, so you're going
to move in that direction,

328
00:14:07,200 --> 00:14:09,242
depending on what eta is,
and I'll get an update.

329
00:14:09,242 --> 00:14:11,480
You're saying, well, what
happens if we're actually

330
00:14:11,480 --> 00:14:22,623
in trouble, and
we have something

331
00:14:22,623 --> 00:14:23,790
that looks like this, right?

332
00:14:26,400 --> 00:14:28,300
Saying that's a problem?

333
00:14:28,300 --> 00:14:31,152
Well, that can hurt
the convergence of it.

334
00:14:31,152 --> 00:14:31,860
It can be slower.

335
00:14:31,860 --> 00:14:33,120
But it still works.

336
00:14:33,120 --> 00:14:35,078
Because you can see--
like, let's say I'm here.

337
00:14:35,078 --> 00:14:37,495
Now, it's really steep here,
and it's really shallow here.

338
00:14:37,495 --> 00:14:39,630
So what's going to happen
is when I perturb it,

339
00:14:39,630 --> 00:14:41,015
I'm still going to--

340
00:14:41,015 --> 00:14:42,390
my perturbation
in this direction

341
00:14:42,390 --> 00:14:44,160
is going have an effect--
maybe it's relatively shallow--

342
00:14:44,160 --> 00:14:46,120
but then in this direction it's
going to be very sensitive.

343
00:14:46,120 --> 00:14:48,203
And so when I move it more
in this direction and I

344
00:14:48,203 --> 00:14:50,400
move very far, and I'm going
to go down here first--

345
00:14:50,400 --> 00:14:51,942
I'm going to descend
the steep part--

346
00:14:51,942 --> 00:14:54,060
and then slowly converge
in on the shallow part.

347
00:14:54,060 --> 00:14:57,182
That's sort of called the,
I think, the banana problem,

348
00:14:57,182 --> 00:14:58,890
where you sort of have
this massive bowl,

349
00:14:58,890 --> 00:15:01,260
and you go really quick right
down here, and then really

350
00:15:01,260 --> 00:15:02,130
slowly.

351
00:15:02,130 --> 00:15:04,330
And so the thing is that
if it's all very shallow,

352
00:15:04,330 --> 00:15:05,040
that's not a problem.

353
00:15:05,040 --> 00:15:06,090
You can make your
learning rate bigger,

354
00:15:06,090 --> 00:15:07,830
you can make your
sample further out,

355
00:15:07,830 --> 00:15:09,788
and then it sort of just
doesn't matter, right?

356
00:15:09,788 --> 00:15:12,942
But this asymmetry in
these things is an issue.

357
00:15:12,942 --> 00:15:14,400
Now, there are some
ways of dealing

358
00:15:14,400 --> 00:15:16,650
with that if you have an
idea of how asymmetric it is.

359
00:15:16,650 --> 00:15:18,030
We can talk about this later.

360
00:15:18,030 --> 00:15:19,980
But it'll still descend.

361
00:15:19,980 --> 00:15:22,230
And actually, you
can show, and I'm

362
00:15:22,230 --> 00:15:25,920
about to show, that
this update actually

363
00:15:25,920 --> 00:15:28,320
follows an expectation
it moves in the direction

364
00:15:28,320 --> 00:15:30,990
of the true gradient.

365
00:15:30,990 --> 00:15:34,230
So, I mean, randomly it
can bounce all around.

366
00:15:34,230 --> 00:15:37,510
But in expectation, it will
move in the right direction.

367
00:15:37,510 --> 00:15:39,755
And if you're having
deterministic evaluations--

368
00:15:39,755 --> 00:15:41,880
well, we're going to do a
linear analysis at first.

369
00:15:41,880 --> 00:15:45,720
But you actually can
show that it'll always

370
00:15:45,720 --> 00:15:48,030
move within 90 degrees
of the true gradient

371
00:15:48,030 --> 00:15:50,710
if you have
deterministic valuations.

372
00:15:50,710 --> 00:15:52,380
So you'll never
actually get worse.

373
00:15:52,380 --> 00:15:54,990
You can move parallel
and not improve,

374
00:15:54,990 --> 00:15:57,990
but you'll never move
sort of the wrong sign

375
00:15:57,990 --> 00:16:01,630
of [? those two ?] parameters.

376
00:16:01,630 --> 00:16:02,130
All right.

377
00:16:05,020 --> 00:16:07,390
So then yeah.

378
00:16:07,390 --> 00:16:09,880
Let's look at why that
is in some detail then.

379
00:16:14,210 --> 00:16:21,310
So again, our delta
alpha, same up there.

380
00:16:21,310 --> 00:16:23,820
So I just won't waste
time rewriting it.

381
00:16:23,820 --> 00:16:26,550
And let's do-- let's look at
it in a first-order Taylor

382
00:16:26,550 --> 00:16:28,050
expansion of our cost function.

383
00:16:28,050 --> 00:16:30,070
So look at it locally
where you look at sort

384
00:16:30,070 --> 00:16:34,530
of like the linear term.

385
00:16:34,530 --> 00:16:39,840
So our J-- and we
linearize around alpha.

386
00:16:39,840 --> 00:16:48,350
So our J of alpha
plus z, well, that

387
00:16:48,350 --> 00:16:51,170
is approximately equal to--

388
00:16:51,170 --> 00:17:06,410
for small z-- alpha plus
dJ d alpha transpose z.

389
00:17:10,554 --> 00:17:13,280
So that's the first-order
Taylor expansion.

390
00:17:13,280 --> 00:17:16,040
Now, if we examine
this then, we plug this

391
00:17:16,040 --> 00:17:19,880
in for J alpha plus
z in that update,

392
00:17:19,880 --> 00:17:23,750
we're going to cancel out of
this term, our J alpha term,

393
00:17:23,750 --> 00:17:31,130
and we're going to get that
delta alpha approximately

394
00:17:31,130 --> 00:17:43,356
negative eta dJ d alpha z z.

395
00:17:48,440 --> 00:17:50,390
So now, what does
this look like?

396
00:17:50,390 --> 00:17:52,400
This is sort of
like a dot product

397
00:17:52,400 --> 00:17:53,910
between the gradient
with respect

398
00:17:53,910 --> 00:17:56,990
to alpha and our noise vector.

399
00:17:56,990 --> 00:17:59,690
All right?

400
00:17:59,690 --> 00:18:04,790
So and this is going to be
about equal to negative eta.

401
00:18:04,790 --> 00:18:10,670
This thing we can then write
[INAUDIBLE] i is 1 to N

402
00:18:10,670 --> 00:18:24,290
of dJ d alpha i
zi times vector z.

403
00:18:35,050 --> 00:18:36,970
All right.

404
00:18:36,970 --> 00:18:42,290
So here then, if we
multiply that out,

405
00:18:42,290 --> 00:18:46,570
so we're going to get this
vector and eta, because you're

406
00:18:46,570 --> 00:18:49,868
multiplying that coefficient
times each term individually.

407
00:18:49,868 --> 00:18:51,160
You're going to get the vector.

408
00:19:08,800 --> 00:19:10,570
And then the same thing.

409
00:19:10,570 --> 00:19:18,620
This one's going to be
some dJ d alpha zi zN.

410
00:19:22,000 --> 00:19:25,893
Now, if we take
expectation of this,

411
00:19:25,893 --> 00:19:27,060
we get another distribution.

412
00:19:27,060 --> 00:19:29,740
We know that each zi is iid.

413
00:19:29,740 --> 00:19:30,780
Do you know iid?

414
00:19:30,780 --> 00:19:32,530
That they're all--
they're all distributed

415
00:19:32,530 --> 00:19:34,390
the exact same distribution,
all the sort of mean 0,

416
00:19:34,390 --> 00:19:36,130
Gaussian, standard
deviation sigma.

417
00:19:36,130 --> 00:19:38,060
And they're all independent.

418
00:19:38,060 --> 00:19:46,775
So if we do that, we can then
take the expectation of delta

419
00:19:46,775 --> 00:19:47,275
alpha.

420
00:19:49,870 --> 00:19:55,030
We can pull that eta out front,
because expectation is linear.

421
00:19:55,030 --> 00:20:01,510
And what you'll get is you'll
get the [INAUDIBLE] again,

422
00:20:01,510 --> 00:20:04,765
dJ d alpha-- i is not
a random variable.

423
00:20:04,765 --> 00:20:07,920
So pull that out.

424
00:20:07,920 --> 00:20:17,410
dJ d alpha i, again the sum z--

425
00:20:17,410 --> 00:20:17,910
sorry.

426
00:20:21,350 --> 00:20:29,980
Expectation of zi then z1.

427
00:20:29,980 --> 00:20:34,900
Now, this sum goes
through all the i's.

428
00:20:34,900 --> 00:20:37,480
But the first one only
has that z1, right?

429
00:20:37,480 --> 00:20:42,333
Now, zi, z1, they're
independent, mean 0.

430
00:20:42,333 --> 00:20:43,750
So you can sort
of split these up,

431
00:20:43,750 --> 00:20:46,780
and you're going to get that
they're 0 for every term,

432
00:20:46,780 --> 00:20:49,780
except for the term
where i equals 1,

433
00:20:49,780 --> 00:20:52,150
and the second one where
equals 2, et cetera, right?

434
00:20:52,150 --> 00:20:53,980
All the other terms
are going to go to 0.

435
00:20:53,980 --> 00:20:54,772
So it's easy, then.

436
00:20:54,772 --> 00:20:57,178
To get the expectations,
you go through the sum,

437
00:20:57,178 --> 00:20:58,720
and you're going to
see that you only

438
00:20:58,720 --> 00:21:01,420
have the one where you have
expectation of z1 squared,

439
00:21:01,420 --> 00:21:02,900
expectation of z2 squared.

440
00:21:05,890 --> 00:21:15,280
Now, the expectation-- again,
maybe you remember variance

441
00:21:15,280 --> 00:21:21,430
equals expected value of x
squared minus expected value

442
00:21:21,430 --> 00:21:24,430
of x squared, right?

443
00:21:24,430 --> 00:21:27,190
Now, we're mean 0, so this is 0.

444
00:21:27,190 --> 00:21:29,650
Our variance is sigma
squared, So our expected value

445
00:21:29,650 --> 00:21:31,668
of x squared is sigma squared.

446
00:21:31,668 --> 00:21:33,710
So that means that each
one of these expectations

447
00:21:33,710 --> 00:21:35,700
is going to be sigma squared.

448
00:21:35,700 --> 00:21:39,070
So you're going to end up where
you have negative eta-- now,

449
00:21:39,070 --> 00:21:41,380
they all the same sigma,
so we can pull that out--

450
00:21:41,380 --> 00:21:42,610
sigma squared.

451
00:21:42,610 --> 00:21:45,670
And you're going to have the
vector of this dJ d alpha 1,

452
00:21:45,670 --> 00:21:47,880
dJ d alpha 2, et cetera.

453
00:21:47,880 --> 00:21:54,616
So you're going
to get dJ d alpha.

454
00:21:54,616 --> 00:21:56,350
So yeah, so the
expectation, this update,

455
00:21:56,350 --> 00:21:58,660
when we look at it in
this sort of linear sense,

456
00:21:58,660 --> 00:22:01,540
is eta sigma squared-- so
just these are scalars.

457
00:22:01,540 --> 00:22:03,430
They just change
the magnitude of it.

458
00:22:03,430 --> 00:22:05,695
But it's in the direction
of the gradient.

459
00:22:05,695 --> 00:22:07,070
And eta is sort
of our parameter.

460
00:22:07,070 --> 00:22:07,820
We can control it.

461
00:22:11,660 --> 00:22:12,644
That makes sense?

462
00:22:12,644 --> 00:22:14,300
AUDIENCE: Is that sigma squared?

463
00:22:14,300 --> 00:22:14,840
JOHN W. ROBERTS: Yes, yes.

464
00:22:14,840 --> 00:22:15,340
Sorry.

465
00:22:17,950 --> 00:22:18,650
Yeah, sorry.

466
00:22:18,650 --> 00:22:19,150
Yeah.

467
00:22:19,150 --> 00:22:21,970
So the noise you
use pops out here.

468
00:22:21,970 --> 00:22:24,680
Comment-- I actually oftentimes
in one of the other--

469
00:22:24,680 --> 00:22:27,160
when we look at this
algorithm in a different way,

470
00:22:27,160 --> 00:22:29,680
they write the
update where it's eta

471
00:22:29,680 --> 00:22:31,615
over sigma squared, your noise.

472
00:22:31,615 --> 00:22:33,490
And then that cancels
out that sigma squared,

473
00:22:33,490 --> 00:22:35,490
and you purely just
get eta dJ d alpha.

474
00:22:35,490 --> 00:22:37,990
So you can put that in, too,
if you wanted to really just be

475
00:22:37,990 --> 00:22:40,660
eta times your true gradient.

476
00:22:40,660 --> 00:22:43,570
But the important
thing is that you'll

477
00:22:43,570 --> 00:22:45,700
move an expectation
in the true direction.

478
00:22:51,860 --> 00:22:56,930
So a couple of interesting
properties to this.

479
00:22:56,930 --> 00:22:59,452
Here, you see we
still have to do--

480
00:22:59,452 --> 00:23:01,410
we still have to do two
evaluations to give rid

481
00:23:01,410 --> 00:23:02,870
of the update, right?

482
00:23:02,870 --> 00:23:06,532
If we want to cancel
out that J alpha term,

483
00:23:06,532 --> 00:23:08,240
we're going to have
to evaluate it twice.

484
00:23:08,240 --> 00:23:10,060
Now, it doesn't matter
for three-dimensional,

485
00:23:10,060 --> 00:23:11,630
and we only have to
evaluate it twice.

486
00:23:11,630 --> 00:23:13,422
But we still have to
evaluate it two times.

487
00:23:13,422 --> 00:23:16,130
And the question is,
well, what happens if you

488
00:23:16,130 --> 00:23:19,580
don't evaluate it at J alpha?

489
00:23:19,580 --> 00:23:21,470
What happens if you
only evaluate it once?

490
00:23:21,470 --> 00:23:24,560
Well, that's a very common
thing to do, actually,

491
00:23:24,560 --> 00:23:27,572
and doesn't actually affect
your expectation at all.

492
00:23:27,572 --> 00:23:29,030
Lots of times,
instead of this sort

493
00:23:29,030 --> 00:23:31,310
of like your perfect baseline
where you evaluate it,

494
00:23:31,310 --> 00:23:33,625
people sometimes average
the last several evaluations

495
00:23:33,625 --> 00:23:35,000
to get that
baseline-- oh, sorry.

496
00:23:35,000 --> 00:23:36,770
I don't think I
defined baseline.

497
00:23:36,770 --> 00:23:41,107
This right here, whatever
it is, is your baseline.

498
00:23:44,380 --> 00:23:46,640
Now, there doesn't
have to be J of alpha.

499
00:23:46,640 --> 00:23:49,610
It can be an exponentially
decaying average

500
00:23:49,610 --> 00:23:51,620
of your last
several evaluations.

501
00:23:51,620 --> 00:23:53,780
That's going be
approximately J of alpha.

502
00:23:53,780 --> 00:23:57,200
And it won't be
perfect, but the point

503
00:23:57,200 --> 00:23:58,700
is that it's not
going to affect it,

504
00:23:58,700 --> 00:23:59,270
and we're going to see that.

505
00:23:59,270 --> 00:24:00,980
Maybe you'd expect that you
need to get rid of that term

506
00:24:00,980 --> 00:24:03,480
for you're still moving in the
direction [? of ?] [? your ?]

507
00:24:03,480 --> 00:24:06,920
gradient, because you can
imagine if you don't have that,

508
00:24:06,920 --> 00:24:10,550
if you don't know that
term, you could evaluate--

509
00:24:10,550 --> 00:24:12,398
if it's always
positive, you'll--

510
00:24:12,398 --> 00:24:13,940
I'll draw a diagram,
make this clear.

511
00:24:16,640 --> 00:24:21,850
If you don't have that, and
you're here, if I-- let's say

512
00:24:21,850 --> 00:24:23,360
I just make that 0.

513
00:24:23,360 --> 00:24:25,190
I'm going to
evaluate here, that's

514
00:24:25,190 --> 00:24:26,540
going to be a positive number.

515
00:24:26,540 --> 00:24:28,207
So I'm moving in the
opposite direction.

516
00:24:28,207 --> 00:24:30,530
If I evaluate here, it's
going to be positive number.

517
00:24:30,530 --> 00:24:31,680
So you're going to move
in the opposite direction.

518
00:24:31,680 --> 00:24:33,110
So maybe you think like,
oh, without that baseline

519
00:24:33,110 --> 00:24:34,680
we could be in bad shape.

520
00:24:34,680 --> 00:24:36,788
But actually, you'll move
more in this direction

521
00:24:36,788 --> 00:24:39,080
when you do that sample than
you move in this direction

522
00:24:39,080 --> 00:24:41,270
when you do the other sample.

523
00:24:41,270 --> 00:24:44,720
And so that scale
here, the fact that you

524
00:24:44,720 --> 00:24:47,210
move proportional to how big
the change is in your cost,

525
00:24:47,210 --> 00:24:48,710
it means that in
expectation, you'll

526
00:24:48,710 --> 00:24:50,940
still move in the direction
of the true gradient.

527
00:24:50,940 --> 00:24:52,680
Now, in practice you
won't do as well.

528
00:24:52,680 --> 00:24:53,720
It makes sense that
you won't do as well.

529
00:24:53,720 --> 00:24:54,800
Really, when you
think about it, that's

530
00:24:54,800 --> 00:24:56,467
going to be bouncing
all around crazily.

531
00:24:56,467 --> 00:24:58,770
But it'll still move in
the direction the gradient.

532
00:24:58,770 --> 00:25:02,790
And you don't just have
to take my word for that.

533
00:25:02,790 --> 00:25:10,730
If you look at
this update again,

534
00:25:10,730 --> 00:25:12,500
now we can do linear
expansion again,

535
00:25:12,500 --> 00:25:24,440
and you'll get this dJ d alpha
z plus, say, some scalar--

536
00:25:24,440 --> 00:25:25,940
this is uncorrelated
with the noise.

537
00:25:25,940 --> 00:25:27,380
That's an important
thing, though.

538
00:25:27,380 --> 00:25:30,620
It's uncorrelated
with the noise z.

539
00:25:30,620 --> 00:25:31,850
Now, use expectation again.

540
00:25:31,850 --> 00:25:33,140
Expectation is linear.

541
00:25:33,140 --> 00:25:35,155
So we have expectation
of this term.

542
00:25:35,155 --> 00:25:36,530
That's the same
as it was before.

543
00:25:36,530 --> 00:25:38,160
That's the gradient.

544
00:25:38,160 --> 00:25:43,920
And then we have expectation
of negative eta dz.

545
00:25:46,940 --> 00:25:48,920
Now, E is uncorrelated
with the noise.

546
00:25:48,920 --> 00:25:51,295
These are both scalar, so you
can actually pull them out.

547
00:25:51,295 --> 00:25:53,330
Expectation of z, it's mean 0.

548
00:25:53,330 --> 00:25:55,187
So this won't affect it at all.

549
00:25:55,187 --> 00:25:57,770
So really, your expected update
will not depend it all on what

550
00:25:57,770 --> 00:25:58,880
you use here.

551
00:25:58,880 --> 00:26:00,380
So you could put
a constant there.

552
00:26:00,380 --> 00:26:02,060
You could put in the exact one.

553
00:26:02,060 --> 00:26:07,310
You could put in some decaying
average, anything you want.

554
00:26:07,310 --> 00:26:08,950
It will still move,
in expectation,

555
00:26:08,950 --> 00:26:09,908
in the right direction.

556
00:26:09,908 --> 00:26:12,240
But in practice, it
can a huge difference.

557
00:26:12,240 --> 00:26:15,140
I don't know if anyone's
implemented these things on--

558
00:26:15,140 --> 00:26:18,410
but a good baseline can be
the difference between success

559
00:26:18,410 --> 00:26:20,990
and getting completely stuck
and not moving anywhere.

560
00:26:20,990 --> 00:26:24,050
So if you do small updates,
you should still be OK.

561
00:26:24,050 --> 00:26:27,802
But performance depends a
lot on getting a baseline.

562
00:26:27,802 --> 00:26:28,760
Or it can depend a lot.

563
00:26:28,760 --> 00:26:29,927
Sometimes it doesn't matter.

564
00:26:32,480 --> 00:26:34,460
Right.

565
00:26:34,460 --> 00:26:37,370
So the-- yeah.

566
00:26:37,370 --> 00:26:41,270
So again, a common thing to use
here is that you're evaluating,

567
00:26:41,270 --> 00:26:42,110
you're updating.

568
00:26:42,110 --> 00:26:44,690
Let's say every time I do
one evaluation, I update.

569
00:26:44,690 --> 00:26:47,037
If I took my last 10 of
them, I averaged them

570
00:26:47,037 --> 00:26:49,370
with decaying sort of weight
so that the most recent one

571
00:26:49,370 --> 00:26:50,990
is the most heavily
weighted, then

572
00:26:50,990 --> 00:26:52,990
I'm sure you'll get an
approximation of how much

573
00:26:52,990 --> 00:26:54,110
should it be around here.

574
00:26:54,110 --> 00:26:55,490
And then I update based on that.

575
00:26:55,490 --> 00:26:58,020
And that way you don't have to
evaluate it twice every time.

576
00:26:58,020 --> 00:26:58,760
And so that way,
you can actually

577
00:26:58,760 --> 00:27:00,135
get sort of improved
performance.

578
00:27:00,135 --> 00:27:02,090
And it's still going to work.

579
00:27:02,090 --> 00:27:03,590
And another cool
thing, this is sort

580
00:27:03,590 --> 00:27:04,760
of when we go back
to our assumptions

581
00:27:04,760 --> 00:27:05,900
about deterministic.

582
00:27:05,900 --> 00:27:07,733
It doesn't have to be
deterministic, either.

583
00:27:07,733 --> 00:27:12,230
Let's say in the same way
we put in this, instead

584
00:27:12,230 --> 00:27:16,700
let's say we put in noise like,
again, like a scalar noise

585
00:27:16,700 --> 00:27:19,412
to evaluation w.

586
00:27:19,412 --> 00:27:23,870
Oh, I just got [? color. ?]

587
00:27:23,870 --> 00:27:25,670
Now, that's going to
show up in here again.

588
00:27:25,670 --> 00:27:26,540
Now, it's not a--

589
00:27:26,540 --> 00:27:28,748
now it's a random variable,
so it has an expectation.

590
00:27:28,748 --> 00:27:32,780
But if they're uncorrelated,
we can split them up.

591
00:27:32,780 --> 00:27:40,100
We can-- that'll be equal to
negative eta expectation w

592
00:27:40,100 --> 00:27:42,410
expectation z.

593
00:27:42,410 --> 00:27:45,200
Now, we know that
z is mean 0 again.

594
00:27:45,200 --> 00:27:45,740
That's 0.

595
00:27:45,740 --> 00:27:47,198
So it's not going
to affect either.

596
00:27:47,198 --> 00:27:48,830
We're still going
to get this term.

597
00:27:48,830 --> 00:27:51,360
And so you can add sort
of additive random noise,

598
00:27:51,360 --> 00:27:53,610
and you'll still move that
through expected direction,

599
00:27:53,610 --> 00:27:54,450
the gradient.

600
00:27:54,450 --> 00:27:55,420
So that's sort of cool.

601
00:27:55,420 --> 00:27:56,350
This is quite robust.

602
00:27:56,350 --> 00:27:58,142
You can have these
errors in this baseline.

603
00:27:58,142 --> 00:27:59,550
You can have noisy evaluations.

604
00:27:59,550 --> 00:28:01,175
You can have all
sorts of these things.

605
00:28:01,175 --> 00:28:04,625
And still, expectation will
move in the right direction.

606
00:28:04,625 --> 00:28:05,250
So that's nice.

607
00:28:08,100 --> 00:28:15,130
We're going to see that that
has a lot of practical benefits.

608
00:28:15,130 --> 00:28:17,400
Is everybody with me here?

609
00:28:17,400 --> 00:28:21,268
I don't know if I went
through this quickly or if--

610
00:28:21,268 --> 00:28:22,560
everyone's sort of being quiet.

611
00:28:22,560 --> 00:28:23,352
They look sort of--

612
00:28:23,352 --> 00:28:24,602
AUDIENCE: w is baseline there?

613
00:28:24,602 --> 00:28:25,894
JOHN W. ROBERTS: No, no, sorry.

614
00:28:25,894 --> 00:28:27,160
This w I change it to noise.

615
00:28:27,160 --> 00:28:28,680
Sorry, this is a noise.

616
00:28:28,680 --> 00:28:32,310
Maybe you'd prefer it to be
called like xi or something

617
00:28:32,310 --> 00:28:34,510
like that.

618
00:28:34,510 --> 00:28:36,150
But this is just added noise.

619
00:28:36,150 --> 00:28:39,282
So you could say that
z is drawn from--

620
00:28:39,282 --> 00:28:40,990
it doesn't really
matter the distribution

621
00:28:40,990 --> 00:28:42,690
as long as it's uncorrelated.

622
00:28:42,690 --> 00:28:48,690
We could say it's drawn
from some other Gaussian.

623
00:28:48,690 --> 00:28:49,738
And so it's expectation--

624
00:28:49,738 --> 00:28:51,780
I mean, expectation of
this really can be 0, too.

625
00:28:51,780 --> 00:28:54,223
Because if it's not non-zero--
it's not mean 0 noise,

626
00:28:54,223 --> 00:28:56,640
then you might as well just
put that in your cost function

627
00:28:56,640 --> 00:28:58,950
and make it mean 0 again, right?

628
00:28:58,950 --> 00:28:59,670
Yes?

629
00:28:59,670 --> 00:29:03,420
AUDIENCE: So the idea is to
add this into the term J alpha?

630
00:29:03,420 --> 00:29:05,610
Or replace the term J alpha
with different baseline?

631
00:29:05,610 --> 00:29:06,540
JOHN W. ROBERTS:
Replace it, right.

632
00:29:06,540 --> 00:29:07,082
AUDIENCE: OK.

633
00:29:07,082 --> 00:29:08,700
And then so what
cancels-- so when we

634
00:29:08,700 --> 00:29:10,650
talk about a Taylor expansion?

635
00:29:10,650 --> 00:29:11,858
What cancels-- what--

636
00:29:11,858 --> 00:29:12,900
JOHN W. ROBERTS: Nothing.

637
00:29:12,900 --> 00:29:13,710
Nothing cancels it.

638
00:29:13,710 --> 00:29:14,793
You see, that's the thing.

639
00:29:14,793 --> 00:29:17,490
Yeah, so I put an E here--

640
00:29:17,490 --> 00:29:19,320
maybe I'm reusing
too many things.

641
00:29:19,320 --> 00:29:23,650
AUDIENCE: Oh, is it J alpha
is also uncorrelated to z?

642
00:29:23,650 --> 00:29:25,275
JOHN W. ROBERTS:
Well, J alpha, J alpha

643
00:29:25,275 --> 00:29:26,280
is just a scalar, right?

644
00:29:26,280 --> 00:29:28,510
I mean, it is some number.

645
00:29:28,510 --> 00:29:29,010
So it is--

646
00:29:29,010 --> 00:29:30,000
AUDIENCE: z is your mean, so.

647
00:29:30,000 --> 00:29:30,960
JOHN W. ROBERTS: Yeah,
so z is your mean.

648
00:29:30,960 --> 00:29:32,587
So whether-- we
could put in J alpha.

649
00:29:32,587 --> 00:29:34,920
We could put an estimate of
J alpha that has some error.

650
00:29:34,920 --> 00:29:37,410
And then our J
alpha minus this is

651
00:29:37,410 --> 00:29:39,130
going to be some
number-- doesn't matter.

652
00:29:39,130 --> 00:29:40,350
If we just put in
nothing at all, then

653
00:29:40,350 --> 00:29:42,060
our error is sort of
that J alpha term.

654
00:29:42,060 --> 00:29:43,435
That J alpha term
is just, again,

655
00:29:43,435 --> 00:29:46,165
some number that's
uncorrelated, gets rid of it.

656
00:29:46,165 --> 00:29:47,040
Does that make sense?

657
00:29:47,040 --> 00:29:51,253
Everyone looks sort of just--

658
00:29:51,253 --> 00:29:53,420
AUDIENCE: So it's actually,
putting another constant

659
00:29:53,420 --> 00:29:55,810
in that equation for the
update makes you move more

660
00:29:55,810 --> 00:29:57,370
in some random z-direction.

661
00:29:57,370 --> 00:30:00,038
But on average, you're still
going down the gradient

662
00:30:00,038 --> 00:30:00,580
the same way.

663
00:30:00,580 --> 00:30:00,640
JOHN W. ROBERTS: Yeah.

664
00:30:00,640 --> 00:30:01,940
I mean, you can move more.

665
00:30:01,940 --> 00:30:02,440
Yeah.

666
00:30:02,440 --> 00:30:04,773
I mean, if you put some-- if
you put some giant constant

667
00:30:04,773 --> 00:30:07,780
every time you update, maybe
you'll bounce around farther.

668
00:30:07,780 --> 00:30:09,280
But on average, you'll still
move in the right direction.

669
00:30:09,280 --> 00:30:10,930
Because you'll move farther in
the right direction than you

670
00:30:10,930 --> 00:30:12,480
move in the wrong direction.

671
00:30:12,480 --> 00:30:13,605
So they sort of cancel out.

672
00:30:16,110 --> 00:30:18,770
So everybody is on board here?

673
00:30:18,770 --> 00:30:19,270
OK.

674
00:30:19,270 --> 00:30:20,870
I just really want you to--

675
00:30:20,870 --> 00:30:23,600
AUDIENCE: Why wouldn't you
include the actual J alpha?

676
00:30:23,600 --> 00:30:26,270
JOHN W. ROBERTS: Well,
because if you get it

677
00:30:26,270 --> 00:30:29,600
by evaluating the function,
if you run a policy,

678
00:30:29,600 --> 00:30:31,820
it can be expensive to
get that J alpha, right?

679
00:30:31,820 --> 00:30:33,653
Because for example, I
use this in some work

680
00:30:33,653 --> 00:30:35,278
I did where we had
this flapping thing.

681
00:30:35,278 --> 00:30:36,530
I'll show you videos of it.

682
00:30:36,530 --> 00:30:38,600
Maybe I'll start setting
that up right now.

683
00:30:38,600 --> 00:30:40,340
But so we have this
flapping system.

684
00:30:40,340 --> 00:30:43,602
And we get-- we sort
of have souped it

685
00:30:43,602 --> 00:30:44,810
up now so it's a bit quicker.

686
00:30:44,810 --> 00:30:46,352
But it used to be
every time I wanted

687
00:30:46,352 --> 00:30:49,880
to evaluate the function, I
had to sit there for 4 minutes

688
00:30:49,880 --> 00:30:52,100
and have this sort of
plate flap in this water

689
00:30:52,100 --> 00:30:54,392
and measure how quickly it
was going, all these things.

690
00:30:54,392 --> 00:30:58,220
And so to evaluate that function
once, it took me 4 minutes.

691
00:30:58,220 --> 00:31:01,260
And so avoiding
evaluations is important.

692
00:31:01,260 --> 00:31:05,695
And so if you can just take your
several previous evaluations,

693
00:31:05,695 --> 00:31:07,070
average them
together-- now, it's

694
00:31:07,070 --> 00:31:08,090
not going to be a
perfect assignment,

695
00:31:08,090 --> 00:31:09,715
but maybe it's an OK
estimate, and then

696
00:31:09,715 --> 00:31:11,730
you don't have to
spend any more time.

697
00:31:11,730 --> 00:31:14,780
And so in that sense,
it's sort of cheaper.

698
00:31:14,780 --> 00:31:18,910
Please ask as many questions
as possible, because this is--

699
00:31:18,910 --> 00:31:21,840
AUDIENCE: But at some point
you have to measure every time,

700
00:31:21,840 --> 00:31:22,340
right?

701
00:31:22,340 --> 00:31:22,700
JOHN W. ROBERTS: You have to.

702
00:31:22,700 --> 00:31:23,990
Yeah, you have to
measure every time

703
00:31:23,990 --> 00:31:25,280
when you want to do an update.

704
00:31:25,280 --> 00:31:26,900
But the thing is that--

705
00:31:26,900 --> 00:31:27,400
here.

706
00:31:30,910 --> 00:31:37,540
Let's say i, a tiny one--

707
00:31:37,540 --> 00:31:40,510
but the question is, if I
have some estimate of that--

708
00:31:40,510 --> 00:31:43,840
let's say my current
sort of alpha is here.

709
00:31:43,840 --> 00:31:46,270
Now, I need to randomly
sample something,

710
00:31:46,270 --> 00:31:47,830
so I have to do that evaluation.

711
00:31:47,830 --> 00:31:50,680
Now, the question is, do I
have to evaluate it here, too?

712
00:31:50,680 --> 00:31:52,720
Because this is my J alpha.

713
00:31:52,720 --> 00:31:53,620
Do I evaluate that?

714
00:31:57,280 --> 00:31:59,470
Now, I could estimate
this, because have

715
00:31:59,470 --> 00:32:02,797
a bunch of other evaluations
from however I got here, right?

716
00:32:02,797 --> 00:32:03,880
So I've already evaluated.

717
00:32:03,880 --> 00:32:05,440
If I average those
together, I'll

718
00:32:05,440 --> 00:32:07,390
get a pretty good
idea of what this is.

719
00:32:07,390 --> 00:32:08,950
If I wanted to get
it exactly, I'd

720
00:32:08,950 --> 00:32:12,170
have to run my system here,
and then run it again here.

721
00:32:12,170 --> 00:32:16,240
And so every update would
require two evaluations

722
00:32:16,240 --> 00:32:17,842
as opposed to just one.

723
00:32:17,842 --> 00:32:19,300
Now, sometimes it
still makes sense

724
00:32:19,300 --> 00:32:20,620
to do that evaluation, though.

725
00:32:20,620 --> 00:32:23,470
Depending on how your system
is, if it's really noisy,

726
00:32:23,470 --> 00:32:26,650
if you have to do really
big updates, it makes sense.

727
00:32:26,650 --> 00:32:29,411
AUDIENCE: [INAUDIBLE] using
this delta alpha would you

728
00:32:29,411 --> 00:32:30,328
calculate [INAUDIBLE]?

729
00:32:30,328 --> 00:32:31,869
JOHN W. ROBERTS:
Pardon [INAUDIBLE]??

730
00:32:31,869 --> 00:32:32,750
AUDIENCE: Yes.

731
00:32:32,750 --> 00:32:33,890
JOHN W. ROBERTS: I'm sorry,
I didn't hear what you said.

732
00:32:33,890 --> 00:32:35,648
AUDIENCE: This new
alpha that we have,

733
00:32:35,648 --> 00:32:37,190
that we have the
[INAUDIBLE] before--

734
00:32:37,190 --> 00:32:37,630
JOHN W. ROBERTS: This one?

735
00:32:37,630 --> 00:32:38,260
Yeah.

736
00:32:38,260 --> 00:32:40,640
AUDIENCE: You calculate it
by having a previous alpha,

737
00:32:40,640 --> 00:32:42,230
and then we did
this thing, and--

738
00:32:42,230 --> 00:32:43,980
JOHN W. ROBERTS: And I moved
in that direction, right.

739
00:32:43,980 --> 00:32:44,380
AUDIENCE: Right.

740
00:32:44,380 --> 00:32:46,463
But you're saying that you
don't want to calculate

741
00:32:46,463 --> 00:32:47,980
the value for this new alpha.

742
00:32:47,980 --> 00:32:49,940
Instead we use
like, for example,

743
00:32:49,940 --> 00:32:54,340
10 past history of J of alpha,
and use that as your estimate.

744
00:32:54,340 --> 00:32:55,835
JOHN W. ROBERTS: Yeah.

745
00:32:55,835 --> 00:32:58,032
You're saying that
doesn't make sense to you?

746
00:32:58,032 --> 00:32:59,240
AUDIENCE: It does make sense.

747
00:32:59,240 --> 00:33:01,460
In some cases I can think
[INAUDIBLE] actually

748
00:33:01,460 --> 00:33:04,840
[INAUDIBLE] if the change--

749
00:33:04,840 --> 00:33:07,880
a small change in alpha would
have a huge effect on the end

750
00:33:07,880 --> 00:33:10,085
value [INAUDIBLE] from J--

751
00:33:10,085 --> 00:33:13,860
like, if you have a very
discrete-- like, [INAUDIBLE]

752
00:33:13,860 --> 00:33:15,193
condition pass over [INAUDIBLE].

753
00:33:15,193 --> 00:33:17,277
JOHN W. ROBERTS: If you
move very violently, yeah.

754
00:33:17,277 --> 00:33:19,340
So I mean, that's a good
example in practice.

755
00:33:19,340 --> 00:33:20,830
I mean, there's
things that we have

756
00:33:20,830 --> 00:33:22,580
in the theory like
this expectation stuff.

757
00:33:22,580 --> 00:33:24,950
And there's things that I've
applied to several systems.

758
00:33:24,950 --> 00:33:27,033
And in practice, when you
have like sort of really

759
00:33:27,033 --> 00:33:29,283
bad policies, and you need
to move really far in state

760
00:33:29,283 --> 00:33:30,908
space-- let's say
that right now you're

761
00:33:30,908 --> 00:33:33,660
trying to swing up a cart-pole,
and you're not going anywhere

762
00:33:33,660 --> 00:33:34,280
near the top.

763
00:33:34,280 --> 00:33:36,980
And your reward function doesn't
have very smooth gradients,

764
00:33:36,980 --> 00:33:39,650
and so you can't just sort of
swing up a bit by bit by bit.

765
00:33:39,650 --> 00:33:42,230
Well, a good thing is, is to
put in place possibly very

766
00:33:42,230 --> 00:33:46,107
big noise, a very big eta, and
then do these two evaluations.

767
00:33:46,107 --> 00:33:48,440
Because if you-- it's going
to change so much every time

768
00:33:48,440 --> 00:33:48,770
you do it.

769
00:33:48,770 --> 00:33:50,060
Like for example, if you
jump and suddenly you're

770
00:33:50,060 --> 00:33:52,520
doing a lot better, then
your previous average

771
00:33:52,520 --> 00:33:53,973
is not going to
be representative.

772
00:33:53,973 --> 00:33:55,640
And then you can
actually bounce around.

773
00:33:55,640 --> 00:33:57,057
You can bounce
around so violently

774
00:33:57,057 --> 00:34:01,160
in this big space of policies
that you never improve, right?

775
00:34:01,160 --> 00:34:03,742
I don't-- maybe I should draw
a diagram to make this more

776
00:34:03,742 --> 00:34:04,700
clear, what I'm saying.

777
00:34:04,700 --> 00:34:06,693
But the key thing
is, is that, yeah,

778
00:34:06,693 --> 00:34:08,360
if you're moving these
really big jumps,

779
00:34:08,360 --> 00:34:10,489
and your cost is changing
a lot every time,

780
00:34:10,489 --> 00:34:12,906
and you still want to sort of
move in the right direction,

781
00:34:12,906 --> 00:34:14,449
doing two evaluations
can make sense.

782
00:34:14,449 --> 00:34:16,280
Because if you're
stuck to where you

783
00:34:16,280 --> 00:34:18,290
don't have good gradients
in your cost function,

784
00:34:18,290 --> 00:34:20,690
a bunch of little updates
which slowly would climb

785
00:34:20,690 --> 00:34:22,190
aren't going to give you
anything, because maybe they're

786
00:34:22,190 --> 00:34:22,940
not even differentiable.

787
00:34:22,940 --> 00:34:25,148
Maybe you have some sort of
discrete way of measuring

788
00:34:25,148 --> 00:34:27,523
a reward, like how many time
steps you spend in some goal

789
00:34:27,523 --> 00:34:29,815
region, or something, and
you don't have any time steps

790
00:34:29,815 --> 00:34:31,650
there, there's no
gradient at all right now.

791
00:34:31,650 --> 00:34:33,770
And so you need to be violent
enough in sort of your policy

792
00:34:33,770 --> 00:34:35,179
changes that you
eventually get it to where

793
00:34:35,179 --> 00:34:36,360
you're into that goal region.

794
00:34:36,360 --> 00:34:37,340
And once you get in
that goal region,

795
00:34:37,340 --> 00:34:39,810
now you have some gradients
and you're in good shape.

796
00:34:39,810 --> 00:34:42,477
So that's actually another thing
that I was going to talk about.

797
00:34:42,477 --> 00:34:46,465
But designing your cost
function is extremely important.

798
00:34:46,465 --> 00:34:48,590
There are cost functions
that can be extremely poor

799
00:34:48,590 --> 00:34:50,298
and doing this can
work really poorly on.

800
00:34:50,298 --> 00:34:53,040
And there's cost functions
that can make it a lot easier.

801
00:34:53,040 --> 00:34:56,900
So if you have a cost function
which is relatively smooth,

802
00:34:56,900 --> 00:34:59,400
if it's-- ideally it doesn't
have this sort of banana

803
00:34:59,400 --> 00:34:59,900
problem.

804
00:34:59,900 --> 00:35:02,730
If it's relatively same in
all the different parameters,

805
00:35:02,730 --> 00:35:03,950
it can work a lot better.

806
00:35:03,950 --> 00:35:06,840
And you can sort of formulate
the same task lots of time,

807
00:35:06,840 --> 00:35:09,512
since lots of times your cost
function isn't what you really

808
00:35:09,512 --> 00:35:10,220
want to optimize.

809
00:35:10,220 --> 00:35:12,750
It's just of a proxy for
trying to get something done.

810
00:35:12,750 --> 00:35:14,410
That's what Russ talked about
he didn't care about optimality.

811
00:35:14,410 --> 00:35:15,993
It's like, here's a
cost function that

812
00:35:15,993 --> 00:35:17,785
gives us a means of
solving how to do this.

813
00:35:17,785 --> 00:35:20,035
And so there's sort of a
whole bunch of cost functions

814
00:35:20,035 --> 00:35:23,070
you can imagine coming up that
try to encapsulate that task.

815
00:35:23,070 --> 00:35:25,487
Now, if you come up with-- for
the perch one, for example,

816
00:35:25,487 --> 00:35:28,535
this plane perching, which
is a difficult problem,

817
00:35:28,535 --> 00:35:30,410
and a problem where the
models are very bad--

818
00:35:30,410 --> 00:35:32,300
I mean, the aerodynamic
models of this plane

819
00:35:32,300 --> 00:35:34,110
flying like that
are extremely poor.

820
00:35:34,110 --> 00:35:36,110
And we have-- we actually
have some decent ones.

821
00:35:36,110 --> 00:35:37,890
We spent a lot of work
trying to get decent ones.

822
00:35:37,890 --> 00:35:39,410
But sort of the
high-fidelity kind of region,

823
00:35:39,410 --> 00:35:41,285
where you really want
to just get at the end,

824
00:35:41,285 --> 00:35:42,890
it's hard to model that.

825
00:35:42,890 --> 00:35:45,093
So the thing is that,
what if you had a cost

826
00:35:45,093 --> 00:35:46,760
function, like what
we really care about

827
00:35:46,760 --> 00:35:47,640
is hitting that perch.

828
00:35:47,640 --> 00:35:49,890
So let's say that we give
you a 1 if you hit the perch

829
00:35:49,890 --> 00:35:51,165
and a 0 everywhere else.

830
00:35:51,165 --> 00:35:52,790
Now, that means until
we hit the perch,

831
00:35:52,790 --> 00:35:53,720
we're getting no information.

832
00:35:53,720 --> 00:35:54,780
We could be getting
really close,

833
00:35:54,780 --> 00:35:55,450
we could be really far away.

834
00:35:55,450 --> 00:35:57,080
It's not going to
tell us anything.

835
00:35:57,080 --> 00:35:59,780
Now, a lot of actually
reinforcement learning

836
00:35:59,780 --> 00:36:02,630
has these sort of rewards, like
these sort of delayed rewards

837
00:36:02,630 --> 00:36:03,440
where you get it
here, and then you

838
00:36:03,440 --> 00:36:04,940
have to sort of
propagate that back.

839
00:36:04,940 --> 00:36:07,030
When you're trying to
accomplish a task like that,

840
00:36:07,030 --> 00:36:08,937
that doesn't necessarily
work that well.

841
00:36:08,937 --> 00:36:10,520
If you measure
something like distance

842
00:36:10,520 --> 00:36:12,290
from the perch of distance
from your desired state,

843
00:36:12,290 --> 00:36:14,160
if you get a little bit
closer to your desired state,

844
00:36:14,160 --> 00:36:15,660
you sort of get a
little bit better.

845
00:36:15,660 --> 00:36:17,570
And then you can
measure the gradient.

846
00:36:17,570 --> 00:36:19,970
And so that will make a
big difference, right?

847
00:36:19,970 --> 00:36:21,590
And so if you had
something where

848
00:36:21,590 --> 00:36:23,660
you have region
of state where you

849
00:36:23,660 --> 00:36:25,610
have a good gradient
in your cost function,

850
00:36:25,610 --> 00:36:27,710
and you're out here, and
not getting a gradient,

851
00:36:27,710 --> 00:36:28,655
the little perturbations
you're going

852
00:36:28,655 --> 00:36:30,650
to have to random walk
sort of have made you

853
00:36:30,650 --> 00:36:33,140
no update at all, because
you may get no change.

854
00:36:33,140 --> 00:36:34,340
But if you do really
big ones, maybe you'll

855
00:36:34,340 --> 00:36:35,750
bounce into where you
get this region where

856
00:36:35,750 --> 00:36:36,875
you're getting some reward.

857
00:36:36,875 --> 00:36:38,333
And in that case,
these updates are

858
00:36:38,333 --> 00:36:40,140
so big that averaging
doesn't make sense,

859
00:36:40,140 --> 00:36:41,932
a baseline still gives
you a big advantage,

860
00:36:41,932 --> 00:36:44,300
and maybe two
evaluations is worth it.

861
00:36:44,300 --> 00:36:45,860
In some of the
flapping stuff I did,

862
00:36:45,860 --> 00:36:47,660
I did two evaluations,
because when

863
00:36:47,660 --> 00:36:49,640
I was moving very
violently, because averaging

864
00:36:49,640 --> 00:36:50,630
didn't work that well.

865
00:36:50,630 --> 00:36:53,973
And getting a good baseline
was worth the extra time.

866
00:36:53,973 --> 00:36:55,640
But when we ended up
getting it working,

867
00:36:55,640 --> 00:36:58,700
we put it online, and we
actually-- we update it

868
00:36:58,700 --> 00:36:59,750
every time we flapped.

869
00:36:59,750 --> 00:37:02,630
So it was just 1 second,
flap, update, flap, update.

870
00:37:02,630 --> 00:37:04,070
And that way, we
pretty much were

871
00:37:04,070 --> 00:37:06,470
able to sort of cut our time in
half, because our policies were

872
00:37:06,470 --> 00:37:08,710
very similar, our average
was a pretty good estimate.

873
00:37:08,710 --> 00:37:10,640
It's so noisy that one
evaluation, anyway,

874
00:37:10,640 --> 00:37:13,265
isn't necessarily that great of
an estimate of your local value

875
00:37:13,265 --> 00:37:14,630
function.

876
00:37:14,630 --> 00:37:15,130
And so yeah.

877
00:37:15,130 --> 00:37:16,463
We just did an average baseline.

878
00:37:16,463 --> 00:37:18,530
And that's sort of half
the running time, right?

879
00:37:18,530 --> 00:37:20,310
And so it can be a big one.

880
00:37:20,310 --> 00:37:22,670
And so there's a lot of
details when you implement it

881
00:37:22,670 --> 00:37:24,540
about the right way to
sort of put this together,

882
00:37:24,540 --> 00:37:26,620
and depending what your cost
function is, and how good

883
00:37:26,620 --> 00:37:28,070
of an initial policy, what
your initial condition

884
00:37:28,070 --> 00:37:29,450
and your policy is.

885
00:37:29,450 --> 00:37:31,490
But yeah, there's a lot
of factors like that.

886
00:37:34,830 --> 00:37:35,330
All right.

887
00:37:35,330 --> 00:37:42,530
So now we can do some of--

888
00:37:46,360 --> 00:37:46,860
sorry.

889
00:37:51,180 --> 00:37:53,970
I can do example of this.

890
00:37:53,970 --> 00:37:56,430
So I keep on talking about
this flapping system.

891
00:37:56,430 --> 00:38:02,460
That's what I worked on
for my master's thesis.

892
00:38:02,460 --> 00:38:05,160
And so that's sort of what
my brain always goes back to,

893
00:38:05,160 --> 00:38:07,035
particularly since we
used all these methods.

894
00:38:07,035 --> 00:38:07,980
But all right.

895
00:38:07,980 --> 00:38:12,330
So now I wonder if I
can do Russ' thing where

896
00:38:12,330 --> 00:38:14,822
he makes the font really big.

897
00:38:14,822 --> 00:38:16,530
That's also-- the
thing I'm about to run,

898
00:38:16,530 --> 00:38:21,240
it's this relatively simple
lumped parameter simulation

899
00:38:21,240 --> 00:38:23,400
of the flapping system.

900
00:38:23,400 --> 00:38:26,320
This is a lumped parameter
model of-- let me show you,

901
00:38:26,320 --> 00:38:27,600
it's pretty cool--

902
00:38:27,600 --> 00:38:32,580
of this system which I guy
in NYU named Jun [? Zhang ?]

903
00:38:32,580 --> 00:38:37,150
built this robot that
effectively models flapping

904
00:38:37,150 --> 00:38:37,650
flight.

905
00:38:37,650 --> 00:38:39,567
It's a very simple model.

906
00:38:39,567 --> 00:38:40,900
I'll show it to you in a second.

907
00:38:40,900 --> 00:38:42,442
But it has a lot of
the same dynamics

908
00:38:42,442 --> 00:38:46,110
and a lot of the same
issues as sort of a bird.

909
00:38:46,110 --> 00:38:51,840
So the system, it's a
sort of a rigid plate.

910
00:38:51,840 --> 00:38:57,250
Well, the one you see here, we
attached a rubber tail to it.

911
00:38:57,250 --> 00:38:59,250
But the one-- most
of these results

912
00:38:59,250 --> 00:39:05,450
are on actually a rigid plate,
where it heaves up and down,

913
00:39:05,450 --> 00:39:11,470
and what we can do is control
the motion it follows.

914
00:39:11,470 --> 00:39:13,310
I hope that the
camera can see it.

915
00:39:20,816 --> 00:39:22,798
AUDIENCE: [INAUDIBLE]
moonlight [INAUDIBLE]..

916
00:39:22,798 --> 00:39:23,715
JOHN W. ROBERTS: Mood?

917
00:39:23,715 --> 00:39:24,270
AUDIENCE: Moonlight.

918
00:39:24,270 --> 00:39:25,562
JOHN W. ROBERTS: Oh, moonlight.

919
00:39:25,562 --> 00:39:26,840
I was like, mood lighting?

920
00:39:26,840 --> 00:39:27,830
OK.

921
00:39:27,830 --> 00:39:31,000
Make my lecture more enjoyable.

922
00:39:31,000 --> 00:39:32,360
All right.

923
00:39:32,360 --> 00:39:35,600
So this is the system.

924
00:39:35,600 --> 00:39:37,100
You can see we drive
it up and down.

925
00:39:37,100 --> 00:39:41,120
That big cylindrical disk
right there is the load cell.

926
00:39:41,120 --> 00:39:42,890
So that measures the
force we're applying.

927
00:39:42,890 --> 00:39:45,182
And then what we do is we
control this vertical motion.

928
00:39:45,182 --> 00:39:47,223
How we control it is--
that's an important thing.

929
00:39:47,223 --> 00:39:49,460
I talked about how the cost
function matters a lot.

930
00:39:49,460 --> 00:39:51,290
Well, another thing
that matters a lot

931
00:39:51,290 --> 00:39:53,930
is the parameterization
of your policy.

932
00:39:53,930 --> 00:39:57,860
Now, in the last few problems
we had open-loop policies,

933
00:39:57,860 --> 00:39:58,860
which are pretty simple.

934
00:39:58,860 --> 00:40:01,580
You have like 251 parameters
or something like that, right?

935
00:40:01,580 --> 00:40:04,730
Now, when you're doing
gradient descent using back

936
00:40:04,730 --> 00:40:06,682
prop or SNOPT, you have
the exact gradient.

937
00:40:06,682 --> 00:40:08,390
It's cheap to compute
the exact gradient,

938
00:40:08,390 --> 00:40:10,265
so you can sort of follow
this pretty nicely.

939
00:40:10,265 --> 00:40:13,130
But When you do stochastic
gradient descent,

940
00:40:13,130 --> 00:40:15,298
the probability of
being perpendicular sort

941
00:40:15,298 --> 00:40:17,840
of to your gradient, or nearly
perpendicular to the gradient,

942
00:40:17,840 --> 00:40:20,180
increases the number
of parameters goes up.

943
00:40:20,180 --> 00:40:22,183
So you can think, if you're on--

944
00:40:22,183 --> 00:40:23,600
if you're doing a
1D thing, you're

945
00:40:23,600 --> 00:40:25,490
always going to move
pretty much-- it doesn't

946
00:40:25,490 --> 00:40:26,310
matter if you move in
the right direction

947
00:40:26,310 --> 00:40:27,190
or the wrong direction.

948
00:40:27,190 --> 00:40:29,000
That's one of the benefits
of this instead of that hill

949
00:40:29,000 --> 00:40:29,810
climbing.

950
00:40:29,810 --> 00:40:31,370
But you're always trying to get
moving in the right direction

951
00:40:31,370 --> 00:40:32,480
to get this measurement.

952
00:40:32,480 --> 00:40:34,140
Does that make sense?

953
00:40:34,140 --> 00:40:36,470
If you think in 2D,
you have the circle.

954
00:40:36,470 --> 00:40:38,650
You're going to
be moving around.

955
00:40:38,650 --> 00:40:39,902
You're going to be along--

956
00:40:39,902 --> 00:40:42,110
close to the direction of
your gradient pretty often.

957
00:40:42,110 --> 00:40:44,540
A sphere, it's a lot easier
to be pretty far away.

958
00:40:44,540 --> 00:40:48,770
I mean, sort of a lot
more of the samples you do

959
00:40:48,770 --> 00:40:51,440
are going to be
relatively perpendicular

960
00:40:51,440 --> 00:40:53,120
to your true gradient.

961
00:40:53,120 --> 00:40:55,070
And as your dimensionality
gets very high,

962
00:40:55,070 --> 00:40:57,290
a lot of your samples are
relatively perpendicular.

963
00:40:57,290 --> 00:40:58,210
And the thing is
that whether you

964
00:40:58,210 --> 00:41:00,180
go in the right direction or
wrong direction doesn't matter.

965
00:41:00,180 --> 00:41:01,670
You'll get the same
information either way.

966
00:41:01,670 --> 00:41:03,128
Going perpendicular
to the gradient

967
00:41:03,128 --> 00:41:04,850
gives you no information.

968
00:41:04,850 --> 00:41:07,220
Because you'll get no change,
and there's no update.

969
00:41:07,220 --> 00:41:10,490
So it's still-- the
[? cross ?] dimensionality

970
00:41:10,490 --> 00:41:12,410
is alive and well.

971
00:41:12,410 --> 00:41:14,210
And very
high-dimensional policies

972
00:41:14,210 --> 00:41:15,380
can be slower to learn.

973
00:41:15,380 --> 00:41:18,380
And so those 251
dimensional policies you use

974
00:41:18,380 --> 00:41:20,390
may not be the best
representation,

975
00:41:20,390 --> 00:41:21,830
because they sort of--

976
00:41:21,830 --> 00:41:25,580
I mean, you probably don't
need that many parameters

977
00:41:25,580 --> 00:41:27,330
to represent what
you want to do.

978
00:41:27,330 --> 00:41:29,293
So for this, what we had--

979
00:41:29,293 --> 00:41:30,710
and this made a
big difference, we

980
00:41:30,710 --> 00:41:33,002
tried different things, this
one worked really nicely--

981
00:41:33,002 --> 00:41:34,100
was a spline.

982
00:41:34,100 --> 00:41:39,050
So we said, all right,
if you have time,

983
00:41:39,050 --> 00:41:41,780
I'm going to set
the final time here.

984
00:41:41,780 --> 00:41:43,580
Now that's a parameter, too.

985
00:41:43,580 --> 00:41:45,620
Then this is the z height.

986
00:41:45,620 --> 00:41:48,060
It's in millimeters
or whatever you want.

987
00:41:48,060 --> 00:41:49,550
And I'm going to
say, OK, I'm going

988
00:41:49,550 --> 00:41:51,920
to force it to be at the
beginning, in the middle,

989
00:41:51,920 --> 00:41:54,420
and at the end-- wow, that's
nowhere near the middle, is it?

990
00:41:58,090 --> 00:42:02,300
I shouldn't be a
carpenter in the 1200s.

991
00:42:02,300 --> 00:42:03,500
So what do we do then?

992
00:42:03,500 --> 00:42:05,258
We then have five
parameters-- now, we've

993
00:42:05,258 --> 00:42:07,550
done several versions, but
simple one right here-- five

994
00:42:07,550 --> 00:42:09,200
parameters that define a spline.

995
00:42:09,200 --> 00:42:10,700
So this is going to be smooth.

996
00:42:10,700 --> 00:42:13,100
You can enforce to be a
periodic spline, which

997
00:42:13,100 --> 00:42:16,700
means that the knot at the
end, the connection here, is

998
00:42:16,700 --> 00:42:18,570
continuously
differentiable as well.

999
00:42:18,570 --> 00:42:22,040
And then we force that this
parameter-- so this number p1,

1000
00:42:22,040 --> 00:42:27,180
this one is going to
be the opposite of it.

1001
00:42:27,180 --> 00:42:28,550
So it's a negative p1.

1002
00:42:28,550 --> 00:42:30,240
And that's true for all these.

1003
00:42:30,240 --> 00:42:35,930
So this way, we have this
relatively rich policy class

1004
00:42:35,930 --> 00:42:37,980
that has sort of the
right kind of properties.

1005
00:42:37,980 --> 00:42:39,955
But we do it with
only five parameters.

1006
00:42:39,955 --> 00:42:41,330
So you can imagine,
if we want it

1007
00:42:41,330 --> 00:42:42,920
to be asymmetric top
and bottom, that would

1008
00:42:42,920 --> 00:42:43,760
double our parameters.

1009
00:42:43,760 --> 00:42:45,885
And we probably wouldn't
want to tie this guy to 0,

1010
00:42:45,885 --> 00:42:47,450
so we'd even add one more.

1011
00:42:47,450 --> 00:42:49,270
And when we have
the amplitude, you

1012
00:42:49,270 --> 00:42:51,230
can either fix it
or make it free.

1013
00:42:51,230 --> 00:42:52,640
I can add another parameter.

1014
00:42:52,640 --> 00:42:54,200
So you can see that as
you add this richness,

1015
00:42:54,200 --> 00:42:55,580
you're going to add all
these different parameters.

1016
00:42:55,580 --> 00:42:57,718
But getting-- using a
spline rather than--

1017
00:42:57,718 --> 00:43:00,260
this is the height right now,
this is the height right then--

1018
00:43:00,260 --> 00:43:01,340
it's a huge advantage.

1019
00:43:01,340 --> 00:43:02,240
Because what's the
chance that you're

1020
00:43:02,240 --> 00:43:04,698
going to want it to move very
violently on a sort of like 1

1021
00:43:04,698 --> 00:43:05,750
dt time scale?

1022
00:43:05,750 --> 00:43:06,950
And if you try to do
that, you could actually

1023
00:43:06,950 --> 00:43:07,742
damage your system.

1024
00:43:07,742 --> 00:43:09,200
Some of the policies
that I-- when

1025
00:43:09,200 --> 00:43:11,030
I was working on this
parameterization,

1026
00:43:11,030 --> 00:43:14,360
I had the load cell break off
and fall into the tank once.

1027
00:43:14,360 --> 00:43:16,250
Luckily it broke off
the wires and lost

1028
00:43:16,250 --> 00:43:20,510
its electric connection before
it fell in there, but yeah.

1029
00:43:20,510 --> 00:43:24,890
So if you come up with
a parameterization that

1030
00:43:24,890 --> 00:43:30,890
appropriately captures the kind
of behaviors you expect to see,

1031
00:43:30,890 --> 00:43:32,630
it can be a lot faster to learn.

1032
00:43:32,630 --> 00:43:36,495
Now, sort of the warning,
then, is that you're only

1033
00:43:36,495 --> 00:43:37,370
going to be optimal--

1034
00:43:37,370 --> 00:43:40,037
the only thing, you're going get
to a local minimum in this sort

1035
00:43:40,037 --> 00:43:41,438
of parameterization space.

1036
00:43:41,438 --> 00:43:43,730
So if you parameterize-- if
I were to parameterize this

1037
00:43:43,730 --> 00:43:48,260
by saying, OK, well I'm only
going to let it be some--

1038
00:43:48,260 --> 00:43:50,885
let's say I was going to do like
a Fourier series kind of thing

1039
00:43:50,885 --> 00:43:56,750
and say, OK, it's add
this, this, and this--

1040
00:43:56,750 --> 00:43:58,265
now, that's not very rich.

1041
00:43:58,265 --> 00:43:59,390
It's only three parameters.

1042
00:43:59,390 --> 00:44:00,050
That's good.

1043
00:44:00,050 --> 00:44:01,490
But I'm going to do all sorts
of things that are probably

1044
00:44:01,490 --> 00:44:02,407
extremely sub-optimal.

1045
00:44:02,407 --> 00:44:04,880
Now, it's still going to find
the best kind of behavior,

1046
00:44:04,880 --> 00:44:06,470
or the locally best
kind of behavior

1047
00:44:06,470 --> 00:44:09,920
can using this kind of policy.

1048
00:44:09,920 --> 00:44:11,210
But it could be quite bad.

1049
00:44:11,210 --> 00:44:14,340
So the actual optimum
could be very different.

1050
00:44:14,340 --> 00:44:19,640
So your policy class, you'd
like it to include the optimum.

1051
00:44:19,640 --> 00:44:21,180
And so that sort
of is-- it depends

1052
00:44:21,180 --> 00:44:21,970
on what the question is.

1053
00:44:21,970 --> 00:44:24,210
You sort of have to just have a
feel for what is a good policy

1054
00:44:24,210 --> 00:44:24,710
class.

1055
00:44:24,710 --> 00:44:26,850
How do I get [? my ?]
dimension as low as possible,

1056
00:44:26,850 --> 00:44:29,430
while still having the richness
to represent a wide variety

1057
00:44:29,430 --> 00:44:32,440
of viable policies?

1058
00:44:32,440 --> 00:44:35,370
So when you're trying to
implement these things,

1059
00:44:35,370 --> 00:44:38,910
that can make a big difference.

1060
00:44:38,910 --> 00:44:39,410
So yeah.

1061
00:44:39,410 --> 00:44:40,440
So we set up that.

1062
00:44:40,440 --> 00:44:44,920
And we could control
the shape of that curve.

1063
00:44:44,920 --> 00:44:47,770
And so that is the policy
parameterization we chose.

1064
00:44:47,770 --> 00:44:51,960
So going back to this code here.

1065
00:44:51,960 --> 00:44:58,860
Now, I think I can
just run this here.

1066
00:44:58,860 --> 00:45:02,580
This is going to be doing
that bit we talked about on--

1067
00:45:02,580 --> 00:45:06,650
again, a simple lumped parameter
model of that flapping system.

1068
00:45:06,650 --> 00:45:08,766
So here's our curve.

1069
00:45:08,766 --> 00:45:13,485
It's this, you see
this-- well, this

1070
00:45:13,485 --> 00:45:17,653
is the forward motion of
the thing as it's flapping.

1071
00:45:17,653 --> 00:45:18,820
This is the vertical motion.

1072
00:45:18,820 --> 00:45:20,640
So this is sort of the
waveform it's following.

1073
00:45:20,640 --> 00:45:21,810
This is where it
is in x position.

1074
00:45:21,810 --> 00:45:24,120
You can see it sort of
goes fast, bounces around--

1075
00:45:24,120 --> 00:45:26,200
sorry, this is the
speed, not the position.

1076
00:45:26,200 --> 00:45:27,930
So you can see it
accelerates from 0,

1077
00:45:27,930 --> 00:45:29,730
and then as it's pumping,
it sort of oscillates a bit.

1078
00:45:29,730 --> 00:45:31,200
In practice, there's more
inertia and everything,

1079
00:45:31,200 --> 00:45:33,325
so you don't see these
high-frequency oscillations.

1080
00:45:33,325 --> 00:45:36,450
But this is just a relatively
simple, explicit model.

1081
00:45:36,450 --> 00:45:38,100
This is the shape we follow.

1082
00:45:38,100 --> 00:45:39,900
So we're following that curve.

1083
00:45:39,900 --> 00:45:42,040
And we have a little
bit of noise to it.

1084
00:45:42,040 --> 00:45:47,065
And let me-- so now we're going
to perturb it, measure again.

1085
00:45:47,065 --> 00:45:50,965
Try to measure again,
and boom, here we are.

1086
00:45:50,965 --> 00:45:52,090
We got a little bit better.

1087
00:45:52,090 --> 00:45:54,215
This is our reward, and
then we did another sample,

1088
00:45:54,215 --> 00:45:55,440
and that's our reward.

1089
00:45:55,440 --> 00:45:57,120
Let's do it again, better.

1090
00:46:00,424 --> 00:46:04,560
You see we improve quite nicely.

1091
00:46:04,560 --> 00:46:07,800
And also, notice,
relatively monotonically.

1092
00:46:07,800 --> 00:46:09,840
Now, you might be
surprised by that.

1093
00:46:09,840 --> 00:46:11,470
Because even though
we're moving--

1094
00:46:11,470 --> 00:46:12,970
we have this sort
of guarantee we'll

1095
00:46:12,970 --> 00:46:14,910
move within 90 degrees
of the gradient.

1096
00:46:14,910 --> 00:46:17,190
That's what I was talking
about sort of with,

1097
00:46:17,190 --> 00:46:20,037
you'll always be within 90
degrees if it's deterministic.

1098
00:46:20,037 --> 00:46:21,120
And this is deterministic.

1099
00:46:21,120 --> 00:46:24,330
But it also sort of is this
linear kind of interpretation,

1100
00:46:24,330 --> 00:46:24,900
right?

1101
00:46:24,900 --> 00:46:29,130
So as you run it, you'd imagine
that you could perturb yourself

1102
00:46:29,130 --> 00:46:31,380
far enough that you got worse.

1103
00:46:31,380 --> 00:46:32,880
Now, the reason
that's not happening

1104
00:46:32,880 --> 00:46:35,310
is because I'm perturbing
myself very small amounts,

1105
00:46:35,310 --> 00:46:37,010
and I'm updating
very small amounts.

1106
00:46:37,010 --> 00:46:38,550
So all this sort of linear
analysis is appropriate.

1107
00:46:38,550 --> 00:46:40,380
And actually, you can see
what I talked about that,

1108
00:46:40,380 --> 00:46:43,005
that you always get pretty close
to the true gradient is there.

1109
00:46:43,005 --> 00:46:45,090
Sometimes it moves up a
lot, sometimes it's steep,

1110
00:46:45,090 --> 00:46:48,090
sometimes it moves up shallowly,
but it does a pretty good job.

1111
00:46:48,090 --> 00:46:51,540
Now, we can change that and
try to sabotage our little code

1112
00:46:51,540 --> 00:46:52,380
here.

1113
00:46:52,380 --> 00:46:54,180
Or sometimes you're
OK, actually.

1114
00:46:54,180 --> 00:46:55,680
That's the thing, is that
in practice lots of times

1115
00:46:55,680 --> 00:46:57,330
it's OK if it gets
worse sometimes,

1116
00:46:57,330 --> 00:46:58,500
because allowing
it to get worse,

1117
00:46:58,500 --> 00:46:59,917
being violent
enough to get worse,

1118
00:46:59,917 --> 00:47:02,392
it'll reach the
optimum a lot faster.

1119
00:47:02,392 --> 00:47:03,850
So here, this is
our eta parameter.

1120
00:47:03,850 --> 00:47:10,470
Let's make it bigger a factor
of-- let's make it 20.5.

1121
00:47:10,470 --> 00:47:11,817
I don't want to risk--

1122
00:47:11,817 --> 00:47:12,900
[INAUDIBLE] not get worse.

1123
00:47:12,900 --> 00:47:13,824
AUDIENCE: Is that the noise or--

1124
00:47:13,824 --> 00:47:14,286
JOHN W. ROBERTS: Pardon?

1125
00:47:14,286 --> 00:47:15,278
No, that is the update.

1126
00:47:15,278 --> 00:47:16,320
So the noise is the same.

1127
00:47:16,320 --> 00:47:17,070
This noise is still local.

1128
00:47:17,070 --> 00:47:18,590
But now we're
jumping really far.

1129
00:47:18,590 --> 00:47:20,798
And so you can imagine,
we're measuring the gradient.

1130
00:47:20,798 --> 00:47:21,848
We're moving really far.

1131
00:47:21,848 --> 00:47:23,640
And now where we've
moved to, that gradient

1132
00:47:23,640 --> 00:47:25,050
may be a poor
measurement of sort

1133
00:47:25,050 --> 00:47:28,090
of the update over
that long of a scale.

1134
00:47:28,090 --> 00:47:31,510
So let's do this again.

1135
00:47:31,510 --> 00:47:32,950
This is always fun.

1136
00:47:32,950 --> 00:47:35,650
Oh, there you go, already.

1137
00:47:35,650 --> 00:47:36,290
That's better.

1138
00:47:38,985 --> 00:47:41,110
See now-- but you see,
that's a huge increase then.

1139
00:47:41,110 --> 00:47:42,250
That's what I'm talking
about, is that there's

1140
00:47:42,250 --> 00:47:43,210
sort of a sweet spot.

1141
00:47:43,210 --> 00:47:45,550
And you don't necessarily
want monotonic increasing.

1142
00:47:45,550 --> 00:47:47,637
Like, there's limitations
on how violent

1143
00:47:47,637 --> 00:47:49,720
you want it to be in
practice, because on a robot,

1144
00:47:49,720 --> 00:47:52,350
a very violent policy
could break your load

1145
00:47:52,350 --> 00:47:55,137
cell of and have it
almost cost you $400.

1146
00:47:55,137 --> 00:47:56,470
So you don't do something crazy.

1147
00:47:56,470 --> 00:47:59,470
But there's also the willingness
that-- oh, that's ugly.

1148
00:47:59,470 --> 00:48:02,500
But you see, I mean, if
you bounce pretty far,

1149
00:48:02,500 --> 00:48:04,210
you can also get
huge improvements.

1150
00:48:04,210 --> 00:48:06,460
And so there's sort of this--

1151
00:48:06,460 --> 00:48:09,910
monotonicity in your
increasing reward

1152
00:48:09,910 --> 00:48:12,550
is not necessarily the best
way to learn, I suppose.

1153
00:48:12,550 --> 00:48:15,400
That's from the trenches.

1154
00:48:15,400 --> 00:48:18,040
I learned that the hard way
through many, many hours

1155
00:48:18,040 --> 00:48:20,150
sitting in front of a machine.

1156
00:48:20,150 --> 00:48:24,115
So then the other thing
that we can do is this eta.

1157
00:48:24,115 --> 00:48:26,800
Let's decrease eta.

1158
00:48:26,800 --> 00:48:28,492
And now let's make
our sigma really big.

1159
00:48:28,492 --> 00:48:30,700
Now, this is going to be
really crazy stuff probably.

1160
00:48:30,700 --> 00:48:32,140
But you see, now we're
going to measure so far.

1161
00:48:32,140 --> 00:48:33,610
And we're going to
get this sort of--

1162
00:48:33,610 --> 00:48:34,480
we're going to try to
measure the gradient,

1163
00:48:34,480 --> 00:48:36,903
but it's going to be just
way off, because it's

1164
00:48:36,903 --> 00:48:39,070
moving so far that the local
structure is completely

1165
00:48:39,070 --> 00:48:39,570
ignored.

1166
00:48:42,910 --> 00:48:44,880
Yeah.

1167
00:48:44,880 --> 00:48:47,130
I probably don't have to be
nearly as dramatic as this

1168
00:48:47,130 --> 00:48:47,838
to make my point.

1169
00:48:47,838 --> 00:48:50,700
But, you know, it's just
completely falling apart.

1170
00:48:50,700 --> 00:48:51,660
Yeah.

1171
00:48:51,660 --> 00:48:53,520
That's doing as badly
as it can, I guess.

1172
00:48:53,520 --> 00:48:56,040
I think it's like almost
[? no net ?] motion, so.

1173
00:48:56,040 --> 00:48:56,563
Yeah.

1174
00:48:56,563 --> 00:48:58,230
So the sweet spot,
then, is somewhere in

1175
00:48:58,230 --> 00:49:02,805
between, where maybe you
want an eta of, say, 3,

1176
00:49:02,805 --> 00:49:15,007
and sigma, I don't know, 0.1.

1177
00:49:15,007 --> 00:49:16,590
Oh, that's probably
still too violent.

1178
00:49:19,420 --> 00:49:21,940
Yeah, definitely.

1179
00:49:21,940 --> 00:49:24,920
But I think that is--

1180
00:49:24,920 --> 00:49:26,680
that's the sort of
game you have to play.

1181
00:49:26,680 --> 00:49:29,950
And how big all these things are
depend on a number of factors

1182
00:49:29,950 --> 00:49:31,750
specific to your system.

1183
00:49:31,750 --> 00:49:35,390
Like, if your system--

1184
00:49:35,390 --> 00:49:37,130
if the change is very
small in magnitude,

1185
00:49:37,130 --> 00:49:38,200
if your cost function
is such that it's

1186
00:49:38,200 --> 00:49:40,450
changed between 10 to
the negative fifth and 10

1187
00:49:40,450 --> 00:49:43,510
to the negative
fifth plus 1 times 10

1188
00:49:43,510 --> 00:49:46,090
to the negative sixth, that's
changing by very small amounts,

1189
00:49:46,090 --> 00:49:46,590
right?

1190
00:49:46,590 --> 00:49:49,150
You could need a very large eta
just to make up for the fact

1191
00:49:49,150 --> 00:49:50,450
that your change is so small.

1192
00:49:50,450 --> 00:49:53,500
So a big eta-- like, there's
no absolute perception

1193
00:49:53,500 --> 00:49:54,460
on what is a big eta.

1194
00:49:54,460 --> 00:49:56,532
It's not like 10,000
is a huge eta.

1195
00:49:56,532 --> 00:49:58,240
10,000 could be very
small eta, depending

1196
00:49:58,240 --> 00:49:59,453
on what your rewards are.

1197
00:49:59,453 --> 00:50:00,370
Same thing with sigma.

1198
00:50:00,370 --> 00:50:04,222
It depends on how big
your parameters are.

1199
00:50:04,222 --> 00:50:05,680
Because, I mean,
my parameters here

1200
00:50:05,680 --> 00:50:07,900
are of order one, which
is sort of convenient.

1201
00:50:07,900 --> 00:50:09,760
Yeah.

1202
00:50:09,760 --> 00:50:10,270
So there.

1203
00:50:17,130 --> 00:50:17,630
Yeah.

1204
00:50:17,630 --> 00:50:19,915
So here we're learning
pretty quickly.

1205
00:50:19,915 --> 00:50:21,540
And so all those sort
of things, that's

1206
00:50:21,540 --> 00:50:23,240
sort of a disadvantage
of this technique,

1207
00:50:23,240 --> 00:50:25,698
is that there's a lot of tuning
to sort solve these things,

1208
00:50:25,698 --> 00:50:27,320
is that--

1209
00:50:27,320 --> 00:50:29,300
where SNOPT you
don't have to set--

1210
00:50:29,300 --> 00:50:32,228
SNOPT you don't have
to set a learning rate,

1211
00:50:32,228 --> 00:50:33,770
here you have to
set a learning rate.

1212
00:50:33,770 --> 00:50:35,660
You have to set your sigma.

1213
00:50:35,660 --> 00:50:37,790
And when you have really
sort of hard problems,

1214
00:50:37,790 --> 00:50:38,660
there's even more
things you have to do.

1215
00:50:38,660 --> 00:50:39,830
Like, your policy
parameterization

1216
00:50:39,830 --> 00:50:41,250
could affect a lot of things.

1217
00:50:41,250 --> 00:50:42,250
There's a lot of issues.

1218
00:50:42,250 --> 00:50:44,790
But sometimes, that's--
sometimes it's the only sort

1219
00:50:44,790 --> 00:50:45,590
of route you have.

1220
00:50:45,590 --> 00:50:48,050
Like, the best this can
ever do is gradient descent.

1221
00:50:48,050 --> 00:50:50,360
It's never going to do
better than gradient descent.

1222
00:50:50,360 --> 00:50:52,250
And so there's of a lot of
fancy packages out there.

1223
00:50:52,250 --> 00:50:54,250
When you have better
models and stuff like that,

1224
00:50:54,250 --> 00:50:56,480
you can do better
than gradient descent.

1225
00:50:56,480 --> 00:50:58,057
But while even
though you're only

1226
00:50:58,057 --> 00:50:59,195
going to be able to
achieve gradient descent,

1227
00:50:59,195 --> 00:51:00,290
you can achieve it
despite the fact

1228
00:51:00,290 --> 00:51:01,957
that you know nothing
about your system,

1229
00:51:01,957 --> 00:51:04,980
your system is stochastic,
and it's noisy, like that.

1230
00:51:04,980 --> 00:51:09,120
And so in those cases,
it can be a big win.

1231
00:51:09,120 --> 00:51:11,360
AUDIENCE: So when you were
doing this in real life,

1232
00:51:11,360 --> 00:51:13,190
instead of in space
each time, you

1233
00:51:13,190 --> 00:51:15,963
were sitting for 4 minutes
in front of a flapping--

1234
00:51:15,963 --> 00:51:18,380
JOHN W. ROBERTS: I automated
pretty much everything, yeah.

1235
00:51:18,380 --> 00:51:19,310
So I was-- but yeah.

1236
00:51:19,310 --> 00:51:20,540
I mean, this is a
little simulation.

1237
00:51:20,540 --> 00:51:22,730
AUDIENCE: Every interval was
like actually it running and--

1238
00:51:22,730 --> 00:51:23,210
JOHN W. ROBERTS: Oh, yeah.

1239
00:51:23,210 --> 00:51:24,585
When I pressed
Space, it actually

1240
00:51:24,585 --> 00:51:26,868
does two-- because this
is using a true baseline.

1241
00:51:26,868 --> 00:51:28,410
I didn't put in the
average baseline.

1242
00:51:28,410 --> 00:51:30,350
So this is running it twice
every time I press Space.

1243
00:51:30,350 --> 00:51:30,770
But yeah.

1244
00:51:30,770 --> 00:51:31,940
You can imagine every
time I'm doing Space,

1245
00:51:31,940 --> 00:51:34,107
it does this one update and
gives me that new point.

1246
00:51:34,107 --> 00:51:37,340
What I was doing is I sat
there, and it would run,

1247
00:51:37,340 --> 00:51:40,490
and I'd babysit to make
sure it wasn't broken.

1248
00:51:40,490 --> 00:51:42,410
And it would throw
up the curve as it

1249
00:51:42,410 --> 00:51:44,540
was running so I could make sure
that the encoders weren't off,

1250
00:51:44,540 --> 00:51:45,470
just sort of sitting
there keeping

1251
00:51:45,470 --> 00:51:46,160
track of all these things.

1252
00:51:46,160 --> 00:51:48,140
I was like a nuclear
safety technician.

1253
00:51:48,140 --> 00:51:51,656
I just eat some doughnuts and
go to Moe's, and I would have

1254
00:51:51,656 --> 00:51:54,630
been a good sitcom character.

1255
00:51:54,630 --> 00:51:55,130
But yeah.

1256
00:51:55,130 --> 00:51:56,150
So I mean, pretty much
just babysitting it.

1257
00:51:56,150 --> 00:51:57,320
But yeah, every time you
did it, every time you

1258
00:51:57,320 --> 00:51:58,970
got a new update-- like,
every one of these points

1259
00:51:58,970 --> 00:52:00,630
cost me 6 minutes or
something, because it

1260
00:52:00,630 --> 00:52:02,015
was like a 3-minute
run for-- basically

1261
00:52:02,015 --> 00:52:03,320
a 3-minute run for an update.

1262
00:52:03,320 --> 00:52:05,528
Because I wasn't using
averaged baseline then either.

1263
00:52:05,528 --> 00:52:06,950
I was trying to be more violent.

1264
00:52:06,950 --> 00:52:07,580
But yeah.

1265
00:52:07,580 --> 00:52:10,067
And so that's the
thing, is that that

1266
00:52:10,067 --> 00:52:11,913
is the perfect
encapsulation of why

1267
00:52:11,913 --> 00:52:14,330
you want to use this information
as carefully as possible.

1268
00:52:14,330 --> 00:52:15,890
It's because it's very
expensive to get a point.

1269
00:52:15,890 --> 00:52:17,015
Like, here it cost nothing.

1270
00:52:17,015 --> 00:52:19,190
If I were to turn
off the pause, like,

1271
00:52:19,190 --> 00:52:20,690
this thing would
climb up like that.

1272
00:52:23,108 --> 00:52:24,650
If you're running
on a robot, like we

1273
00:52:24,650 --> 00:52:26,450
want to use this on the
glider, every time you watch

1274
00:52:26,450 --> 00:52:28,242
that glider, you have
to set up the glider,

1275
00:52:28,242 --> 00:52:30,747
fire it off, take all this
data, and reset it by hand,

1276
00:52:30,747 --> 00:52:31,580
and launch it again.

1277
00:52:31,580 --> 00:52:34,130
So getting a data point there
is going be extremely expensive.

1278
00:52:34,130 --> 00:52:36,380
And so we've actually done
some work on the right ways

1279
00:52:36,380 --> 00:52:37,052
to sample.

1280
00:52:37,052 --> 00:52:39,260
You can imagine trying to
come up with the right ways

1281
00:52:39,260 --> 00:52:39,968
to have a policy.

1282
00:52:39,968 --> 00:52:42,917
But sampling intelligently
can save you a lot of time.

1283
00:52:42,917 --> 00:52:44,750
We sort of look at the
signal-to-noise ratio

1284
00:52:44,750 --> 00:52:46,340
of these updates.

1285
00:52:46,340 --> 00:52:47,990
I don't know if anyone--

1286
00:52:47,990 --> 00:52:50,450
some people here probably at
least heard about that stuff

1287
00:52:50,450 --> 00:52:51,750
since they're in my group.

1288
00:52:51,750 --> 00:52:56,432
But probably talk about
that maybe tomorrow.

1289
00:52:56,432 --> 00:52:57,890
But there's these
things you can do

1290
00:52:57,890 --> 00:53:01,330
that can improve the quality
of your performance a lot.

1291
00:53:01,330 --> 00:53:03,080
And actually, I test
on this exact system.

1292
00:53:03,080 --> 00:53:05,060
I got put on the
system, and I ran it

1293
00:53:05,060 --> 00:53:07,865
with the sort of results
we had that's just,

1294
00:53:07,865 --> 00:53:09,740
this is a better way to
sample, and then just

1295
00:53:09,740 --> 00:53:13,970
with a naive Gaussian kind of
sampling, and you learn faster.

1296
00:53:13,970 --> 00:53:17,150
And in the context of me sitting
there and spending my days

1297
00:53:17,150 --> 00:53:19,880
in New York City huddled in
front of a computer, that's

1298
00:53:19,880 --> 00:53:20,480
a big win.

1299
00:53:20,480 --> 00:53:21,702
So anyway.

1300
00:53:21,702 --> 00:53:23,647
AUDIENCE: So when you
say change the sampling,

1301
00:53:23,647 --> 00:53:25,730
you can just change the
variance like you would do

1302
00:53:25,730 --> 00:53:26,840
to a non-Gaussian distribution?

1303
00:53:26,840 --> 00:53:28,048
JOHN W. ROBERTS: Right, yeah.

1304
00:53:28,048 --> 00:53:29,000
So that-- yeah.

1305
00:53:29,000 --> 00:53:33,180
In fact, we used a very
different kind of description

1306
00:53:33,180 --> 00:53:33,680
overall.

1307
00:53:33,680 --> 00:53:35,990
You can still-- the linear
analysis will still work.

1308
00:53:35,990 --> 00:53:38,660
But it's just a local--

1309
00:53:38,660 --> 00:53:41,083
but yeah, there's work
where they change--

1310
00:53:41,083 --> 00:53:43,250
We also have something where
you change the variance

1311
00:53:43,250 --> 00:53:46,730
[? to ?] the Gaussian, but
your different directions

1312
00:53:46,730 --> 00:53:48,038
have different variances.

1313
00:53:48,038 --> 00:53:50,330
And so if you sort of need
an estimate of the gradient,

1314
00:53:50,330 --> 00:53:50,830
then--

1315
00:53:50,830 --> 00:53:53,873
but you just estimate the
gradient to bias your sampling

1316
00:53:53,873 --> 00:53:56,290
more in the directions where
you think the gradient is, so

1317
00:53:56,290 --> 00:53:59,120
that more of your sampling is
along the directions you think

1318
00:53:59,120 --> 00:54:00,110
are most interesting.

1319
00:54:00,110 --> 00:54:04,760
And so that can be a win when
you have a lot of parameters

1320
00:54:04,760 --> 00:54:05,930
that aren't well correlated.

1321
00:54:05,930 --> 00:54:07,638
Like if you imagine
if you had a feedback

1322
00:54:07,638 --> 00:54:09,160
policy that was dependent on--

1323
00:54:09,160 --> 00:54:10,910
a parameter is active
in a certain state--

1324
00:54:10,910 --> 00:54:14,540
like, if I was at negative
2 to negative 5, I do this,

1325
00:54:14,540 --> 00:54:16,370
and let's say I
never get there, then

1326
00:54:16,370 --> 00:54:19,040
that parameter has nothing to
do with how well I perform.

1327
00:54:19,040 --> 00:54:20,752
And so if you know
that, you can sort

1328
00:54:20,752 --> 00:54:23,210
of-- there's something called
an eligibility you can track.

1329
00:54:23,210 --> 00:54:24,170
And you cannot update
that parameter.

1330
00:54:24,170 --> 00:54:25,610
There's no reason to
sort of be fooling around

1331
00:54:25,610 --> 00:54:27,690
with that parameter when it's
not affecting your output.

1332
00:54:27,690 --> 00:54:29,773
And if you know that, you
can do things like that.

1333
00:54:29,773 --> 00:54:32,750
And we sort of have a way,
a more careful way of,

1334
00:54:32,750 --> 00:54:34,330
shaping all these--

1335
00:54:34,330 --> 00:54:36,530
of shaping this Gaussian
to learn faster.

1336
00:54:36,530 --> 00:54:38,583
And it can.

1337
00:54:38,583 --> 00:54:41,000
And also, just completely very
different kind of sampling.

1338
00:54:41,000 --> 00:54:43,133
Like, it's-- well, maybe
I'll try to talk about it.

1339
00:54:43,133 --> 00:54:45,050
Because I think it's
pretty interesting stuff.

1340
00:54:45,050 --> 00:54:47,120
The math is a little
bit nasty, but I'll

1341
00:54:47,120 --> 00:54:49,760
skip the really ugly steps.

1342
00:54:49,760 --> 00:54:53,000
And actually, the one with
the different distribution

1343
00:54:53,000 --> 00:54:54,030
isn't even that nasty.

1344
00:54:54,030 --> 00:54:54,530
But yeah.

1345
00:54:54,530 --> 00:54:57,205
I mean, we ran it here and
it [INAUDIBLE] improvement.

1346
00:54:57,205 --> 00:54:58,220
Yeah, so.

1347
00:54:58,220 --> 00:54:59,990
Did I answer your question?

1348
00:54:59,990 --> 00:55:00,560
Yeah.

1349
00:55:00,560 --> 00:55:01,990
It's not just changing
the variances.

1350
00:55:01,990 --> 00:55:03,050
It's more complicated than that.

1351
00:55:03,050 --> 00:55:05,092
Although changing the
variances can be a big win.

1352
00:55:05,092 --> 00:55:07,802
For example, if you knew
you had this anisotropy,

1353
00:55:07,802 --> 00:55:10,010
and if you were to have
different etas in different--

1354
00:55:10,010 --> 00:55:11,927
if you were to scale
everything in your sigma,

1355
00:55:11,927 --> 00:55:14,420
you could effectively make
it squashed in, right?

1356
00:55:14,420 --> 00:55:17,790
I mean, just a rescaling
of this anisotropic bowl

1357
00:55:17,790 --> 00:55:19,350
will make it right.

1358
00:55:19,350 --> 00:55:21,660
So if you can evaluate
that, you can fix it.

1359
00:55:21,660 --> 00:55:23,713
But you sort have to know
that that's going on.

1360
00:55:23,713 --> 00:55:25,380
That's about the times
you have adaptive

1361
00:55:25,380 --> 00:55:26,580
learning rates and stuff.

1362
00:55:26,580 --> 00:55:28,410
Gradient descent, like if you
keep moving the same direction,

1363
00:55:28,410 --> 00:55:30,000
you have a bigger learning rate.

1364
00:55:30,000 --> 00:55:30,900
You can have different
learning rates,

1365
00:55:30,900 --> 00:55:32,440
you have different parameters.

1366
00:55:32,440 --> 00:55:33,840
This one, as you get
close to a local min,

1367
00:55:33,840 --> 00:55:35,100
you'll decrease your
learning rate and your noise,

1368
00:55:35,100 --> 00:55:36,850
because you want to
sort of bounce around.

1369
00:55:36,850 --> 00:55:39,250
You don't want to be
jumping all across this min.

1370
00:55:39,250 --> 00:55:39,750
So--

1371
00:55:39,750 --> 00:55:42,626
AUDIENCE: [INAUDIBLE] talked
about a basically policy

1372
00:55:42,626 --> 00:55:44,114
gradient when we
were [INAUDIBLE]..

1373
00:55:44,114 --> 00:55:45,197
JOHN W. ROBERTS: Yeah, no.

1374
00:55:45,197 --> 00:55:45,780
Yeah.

1375
00:55:45,780 --> 00:55:47,700
I mean, there is--

1376
00:55:47,700 --> 00:55:49,410
it's definitely exactly that.

1377
00:55:49,410 --> 00:55:51,002
It's just stochastic gradient.

1378
00:55:51,002 --> 00:55:52,710
But yeah, it's all
policy gradient ideas.

1379
00:55:52,710 --> 00:55:55,800
Because we don't-- I mean, these
things don't have a critic,

1380
00:55:55,800 --> 00:55:56,970
right?

1381
00:55:56,970 --> 00:55:59,280
But you can combine this
with some policy evaluation

1382
00:55:59,280 --> 00:55:59,780
techniques.

1383
00:55:59,780 --> 00:56:01,905
And you can turn them into
actor-critic algorithms.

1384
00:56:01,905 --> 00:56:04,830
A very simple-- do people know
about actor-critic algorithms?

1385
00:56:04,830 --> 00:56:05,770
That's going to be
a subject I think

1386
00:56:05,770 --> 00:56:06,937
Russ talks about at the end.

1387
00:56:06,937 --> 00:56:09,270
But the thing is that
right now-- well, I'll

1388
00:56:09,270 --> 00:56:11,007
motivate in a completely
different way.

1389
00:56:11,007 --> 00:56:12,840
We talked about how
this baseline can affect

1390
00:56:12,840 --> 00:56:14,730
your performance a lot, right?

1391
00:56:14,730 --> 00:56:17,670
Now, a good baseline can
make you do a lot better.

1392
00:56:17,670 --> 00:56:20,160
Now, the thing is
that, what happens

1393
00:56:20,160 --> 00:56:23,260
if-- here we start with the same
initial condition every time.

1394
00:56:23,260 --> 00:56:25,167
But let's say that I
actually could be in one

1395
00:56:25,167 --> 00:56:26,250
of two initial conditions.

1396
00:56:26,250 --> 00:56:29,160
I can measure this,
and then I run it.

1397
00:56:29,160 --> 00:56:31,592
And the system behaves
very differently,

1398
00:56:31,592 --> 00:56:33,300
or the costs are very
different depending

1399
00:56:33,300 --> 00:56:34,530
on my initial condition.

1400
00:56:34,530 --> 00:56:37,153
But I want sort of the same
policy to cover both of these.

1401
00:56:37,153 --> 00:56:39,570
So the thing is, if I just did
this and I had one baseline

1402
00:56:39,570 --> 00:56:42,065
for both of them, and I could
randomly be putting these

1403
00:56:42,065 --> 00:56:44,190
in [? different initial ?]
conditions or whatever--

1404
00:56:44,190 --> 00:56:45,503
or I mean, I could--

1405
00:56:45,503 --> 00:56:47,670
there's probably a more
sensible way of saying this,

1406
00:56:47,670 --> 00:56:49,690
but I don't want to
confuse the issue.

1407
00:56:49,690 --> 00:56:52,758
So if you could have
different initial conditions,

1408
00:56:52,758 --> 00:56:54,300
you can make your
baseline a function

1409
00:56:54,300 --> 00:56:55,770
of your initial condition.

1410
00:56:55,770 --> 00:56:56,687
Does that makes sense?

1411
00:56:56,687 --> 00:56:59,410
Instead of just having B,
instead of evaluating it twice,

1412
00:56:59,410 --> 00:57:01,800
I could have my B of x.

1413
00:57:01,800 --> 00:57:04,090
And if my x is here,
I'm going to say, OK,

1414
00:57:04,090 --> 00:57:05,490
my cost should be like this.

1415
00:57:05,490 --> 00:57:06,780
And if my x is here,
then it's like, oh,

1416
00:57:06,780 --> 00:57:08,010
my cost should be like this.

1417
00:57:08,010 --> 00:57:10,890
And when I evaluate my cost,
when I perturb my policy,

1418
00:57:10,890 --> 00:57:13,170
I have a better idea
of how well I'm doing.

1419
00:57:13,170 --> 00:57:15,635
Does that makes sense?

1420
00:57:15,635 --> 00:57:17,640
It probably doesn't, so.

1421
00:57:17,640 --> 00:57:20,400
All right.

1422
00:57:20,400 --> 00:57:25,500
So let's say-- now,
this is phase space now.

1423
00:57:28,770 --> 00:57:33,810
Now let's say that I can
start in either of these.

1424
00:57:33,810 --> 00:57:37,152
And let's say that
I'm trying to get to--

1425
00:57:37,152 --> 00:57:38,822
let's draw this here.

1426
00:57:38,822 --> 00:57:39,780
I'm trying to get to 0.

1427
00:57:39,780 --> 00:57:40,920
That's my goal.

1428
00:57:40,920 --> 00:57:42,240
And I can measure this.

1429
00:57:42,240 --> 00:57:46,270
But then one of them, I'm
going to go [WHOOSH] like that.

1430
00:57:46,270 --> 00:57:49,447
And the other one I'm going
to have to go, I don't know,

1431
00:57:49,447 --> 00:57:51,030
through whatever
torque, [? limited ?]

1432
00:57:51,030 --> 00:57:52,860
reasons like that or something.

1433
00:57:52,860 --> 00:57:55,963
So this one always costs more
than this one, all right?

1434
00:57:55,963 --> 00:57:57,630
It doesn't matter how
good my policy is.

1435
00:57:57,630 --> 00:57:59,713
Like, you can imagine just
have a feedback policy.

1436
00:57:59,713 --> 00:58:01,920
It doesn't matter how bad
it is, how good it is.

1437
00:58:01,920 --> 00:58:04,435
I mean, the same policy is
always going to do worse here.

1438
00:58:04,435 --> 00:58:07,060
Now, if you believe that a good
baseline improves performance--

1439
00:58:07,060 --> 00:58:08,370
and trust me, it does--

1440
00:58:08,370 --> 00:58:09,912
then I don't want
the same baseline.

1441
00:58:09,912 --> 00:58:12,120
I don't want the same B for
both of these situations.

1442
00:58:12,120 --> 00:58:14,340
Because this guy should
always be around 50,

1443
00:58:14,340 --> 00:58:16,433
and this guy should always
be around 20, right?

1444
00:58:16,433 --> 00:58:17,850
So what I could
do is I could have

1445
00:58:17,850 --> 00:58:19,422
my baseline be a function of x.

1446
00:58:19,422 --> 00:58:21,630
And I'm going to be like,
OK, here my baseline is 50,

1447
00:58:21,630 --> 00:58:23,932
here my baseline is 20.

1448
00:58:23,932 --> 00:58:25,890
And let's say I don't
know that from the start.

1449
00:58:25,890 --> 00:58:30,690
I can learn my baseline
while I'm learning my policy.

1450
00:58:30,690 --> 00:58:32,970
So I can use the same
policy for both situations.

1451
00:58:32,970 --> 00:58:34,658
And then over here
I measure my state,

1452
00:58:34,658 --> 00:58:36,950
and I'm like, oh, over here
I'm doing bad all the time.

1453
00:58:36,950 --> 00:58:38,220
So my baseline is
going to be high.

1454
00:58:38,220 --> 00:58:39,720
And over here I'm
always doing well,

1455
00:58:39,720 --> 00:58:41,760
so my baseline is
going to be low.

1456
00:58:41,760 --> 00:58:48,400
And so in that way you can
take that into account.

1457
00:58:48,400 --> 00:58:49,968
Does that makes sense?

1458
00:58:49,968 --> 00:58:50,760
it does look like--

1459
00:58:50,760 --> 00:58:54,200
AUDIENCE: [INAUDIBLE] this
is basically Monte-Carlo

1460
00:58:54,200 --> 00:58:57,050
sampling and learning.

1461
00:58:57,050 --> 00:58:59,910
Because each time
that you set your--

1462
00:58:59,910 --> 00:59:02,290
so your policy is defined
by a set of alphas.

1463
00:59:02,290 --> 00:59:04,500
And then you fix it,
you run it, and you

1464
00:59:04,500 --> 00:59:07,200
get a sample that says what
is the value associated

1465
00:59:07,200 --> 00:59:09,807
with this starting point
given this [INAUDIBLE] policy.

1466
00:59:09,807 --> 00:59:11,890
JOHN W. ROBERTS: Are you
talking about Monte-Carlo

1467
00:59:11,890 --> 00:59:12,810
for policy evaluation?

1468
00:59:12,810 --> 00:59:14,143
Because Monte-Carlo [INAUDIBLE].

1469
00:59:14,143 --> 00:59:16,300
That's like TD infinity
or whatever it is.

1470
00:59:16,300 --> 00:59:17,700
And that's for
policy evaluation.

1471
00:59:17,700 --> 00:59:20,100
That's how you make a critic.

1472
00:59:20,100 --> 00:59:21,660
The policy is different, right?

1473
00:59:21,660 --> 00:59:23,202
The policy, you're
doing this update,

1474
00:59:23,202 --> 00:59:24,540
then you're advancing it a bit.

1475
00:59:24,540 --> 00:59:26,082
Your critic, the
way I just described

1476
00:59:26,082 --> 00:59:28,192
making the baseline
for this, that would be

1477
00:59:28,192 --> 00:59:29,400
a Monte-Carlo interpretation.

1478
00:59:29,400 --> 00:59:32,100
You could do it with t, lambda,
or anything you wanted to.

1479
00:59:32,100 --> 00:59:32,790
But yeah.

1480
00:59:32,790 --> 00:59:37,350
So the important thing is--

1481
00:59:37,350 --> 00:59:39,570
I mean, it looks like
the sort of blank faces

1482
00:59:39,570 --> 00:59:41,610
after I talked about that.

1483
00:59:41,610 --> 00:59:45,480
But Russ, I think, is going
to go into more detail

1484
00:59:45,480 --> 00:59:49,260
into actor-critic.

1485
00:59:49,260 --> 00:59:52,720
But maybe I can talk about
that more tomorrow if you want.

1486
00:59:52,720 --> 00:59:53,220
Yeah.

1487
00:59:53,220 --> 00:59:54,690
I mean, the important thing
is that right now this

1488
00:59:54,690 --> 00:59:56,732
is a very simple kind of
idea we've talked about,

1489
00:59:56,732 --> 00:59:59,640
where you run the alpha, and
then if you ran the same alpha,

1490
00:59:59,640 --> 01:00:01,860
it would always do the same.

1491
01:00:01,860 --> 01:00:04,380
Or maybe it just has a
little bit of additive noise.

1492
01:00:04,380 --> 01:00:07,140
But If actually running the same
alpha from different states--

1493
01:00:07,140 --> 01:00:09,570
which happens a lot
in a lot of systems--

1494
01:00:09,570 --> 01:00:12,492
the different states could have
different expected performance.

1495
01:00:12,492 --> 01:00:14,700
And so while you'll still
learn without the baseline,

1496
01:00:14,700 --> 01:00:16,075
having a good
baseline everywhere

1497
01:00:16,075 --> 01:00:17,440
will make you learn faster.

1498
01:00:17,440 --> 01:00:19,410
And so it's worth
learning a baseline

1499
01:00:19,410 --> 01:00:23,562
and learning the
policy simultaneously.

1500
01:00:23,562 --> 01:00:25,770
And sort of the thing we
talked about, where you just

1501
01:00:25,770 --> 01:00:28,570
average your last several
samples to get your baseline,

1502
01:00:28,570 --> 01:00:30,570
that's already we're
learning a baseline, right?

1503
01:00:30,570 --> 01:00:32,310
We're just learning it for
everywhere in state space.

1504
01:00:32,310 --> 01:00:34,310
We're saying this is the
same everywhere, right?

1505
01:00:38,320 --> 01:00:40,510
AUDIENCE: That idea
of sampling, can you

1506
01:00:40,510 --> 01:00:44,230
do something like [? smarter ?]
using Gaussian processes

1507
01:00:44,230 --> 01:00:46,120
to do active
learning on top of it

1508
01:00:46,120 --> 01:00:49,840
to sample in areas that
are more promising?

1509
01:00:49,840 --> 01:00:51,663
Instead of just randomly
moving somewhere?

1510
01:00:51,663 --> 01:00:53,080
JOHN W. ROBERTS:
I mean, there are

1511
01:00:53,080 --> 01:00:55,870
ways of biasing your
sampling based on what

1512
01:00:55,870 --> 01:00:57,145
you think the gradient is.

1513
01:00:57,145 --> 01:00:59,020
I mean, that's one of
the things we worked on

1514
01:00:59,020 --> 01:01:02,410
with signal-to-noise ratio.

1515
01:01:02,410 --> 01:01:04,600
I'm not sure exactly what--

1516
01:01:04,600 --> 01:01:08,230
AUDIENCE: I know some people
worked on Aibos walking,

1517
01:01:08,230 --> 01:01:09,880
and they wanted to
find a gain which

1518
01:01:09,880 --> 01:01:12,460
maximizes the speed of the
Aibos when they're walking.

1519
01:01:12,460 --> 01:01:14,502
JOHN W. ROBERTS: I think
I read that paper, yeah.

1520
01:01:14,502 --> 01:01:17,930
AUDIENCE: Yeah, and there
are like 12 or 13 dimensions.

1521
01:01:17,930 --> 01:01:20,085
And it seems like
a similar problem--

1522
01:01:20,085 --> 01:01:21,460
JOHN W. ROBERTS:
No, I think they

1523
01:01:21,460 --> 01:01:22,668
use a very similar algorithm.

1524
01:01:22,668 --> 01:01:24,502
I think they had a
different update, though.

1525
01:01:24,502 --> 01:01:25,870
It was the same kind of idea.

1526
01:01:25,870 --> 01:01:27,662
I think that the update
structure was maybe

1527
01:01:27,662 --> 01:01:29,660
different than that.

1528
01:01:29,660 --> 01:01:30,160
Yeah.

1529
01:01:30,160 --> 01:01:31,813
So I won't dwell
on critic stuff.

1530
01:01:31,813 --> 01:01:33,730
That's, I think, the
last lecture in the class

1531
01:01:33,730 --> 01:01:34,870
or something like that.

1532
01:01:34,870 --> 01:01:37,420
But yeah.

1533
01:01:40,680 --> 01:01:42,820
So here, I mean, this is
sort of the sample system.

1534
01:01:42,820 --> 01:01:46,510
And you can see how this thing
is robust to really noisy

1535
01:01:46,510 --> 01:01:47,380
systems in practice.

1536
01:01:47,380 --> 01:01:52,180
Because when I ran it on the
flapping thing down at NYU,

1537
01:01:52,180 --> 01:01:57,280
the consecutive evaluations
could be very different--

1538
01:01:57,280 --> 01:01:58,780
not because of any
change in policy,

1539
01:01:58,780 --> 01:02:00,160
You run the same policy,
you get a big variance.

1540
01:02:00,160 --> 01:02:01,300
So that's just
because you're running

1541
01:02:01,300 --> 01:02:03,370
on this physical robot
with this fluid system

1542
01:02:03,370 --> 01:02:05,590
and you're measuring the
forces in an analog sensor.

1543
01:02:05,590 --> 01:02:07,060
And so it's just very noisy.

1544
01:02:07,060 --> 01:02:08,560
But it's robust to that.

1545
01:02:08,560 --> 01:02:09,670
And that's what's so nice.

1546
01:02:17,350 --> 01:02:18,330
Put that here.

1547
01:02:22,160 --> 01:02:24,080
So look at that.

1548
01:02:24,080 --> 01:02:25,940
I mean, this one--
these, luckily,

1549
01:02:25,940 --> 01:02:27,200
didn't take 3 minutes anymore.

1550
01:02:27,200 --> 01:02:28,710
They took 1 second.

1551
01:02:28,710 --> 01:02:31,342
So it wasn't nearly as bad.

1552
01:02:31,342 --> 01:02:33,050
But, I mean, look how
much it's changing.

1553
01:02:33,050 --> 01:02:35,330
It's changing a significant
percentage every time, right?

1554
01:02:35,330 --> 01:02:36,560
AUDIENCE: These are all with
the same [? taping loop? ?]

1555
01:02:36,560 --> 01:02:36,880
JOHN W. ROBERTS: Yeah.

1556
01:02:36,880 --> 01:02:38,630
Yeah-- I mean, no, this
is playing a different--

1557
01:02:38,630 --> 01:02:39,410
this is learning.

1558
01:02:39,410 --> 01:02:40,723
So the thing is that--

1559
01:02:40,723 --> 01:02:42,890
I mean, I showed you how
it wasn't monotonic before.

1560
01:02:42,890 --> 01:02:44,150
But this, you can
run the same tape.

1561
01:02:44,150 --> 01:02:45,950
I mean, up there it's pretty
much running the same tape.

1562
01:02:45,950 --> 01:02:48,620
So up there you get an idea of
what the noise looks like when

1563
01:02:48,620 --> 01:02:50,180
you're running the same policy.

1564
01:02:50,180 --> 01:02:51,290
Right.

1565
01:02:51,290 --> 01:02:53,030
And so you can imagine-- yes.

1566
01:02:53,030 --> 01:02:55,295
AUDIENCE: Just [INAUDIBLE]
went with blue and red.

1567
01:02:55,295 --> 01:02:56,670
JOHN W. ROBERTS:
Oh, blue and red

1568
01:02:56,670 --> 01:02:59,960
are different ways of
keeping track of my baseline.

1569
01:02:59,960 --> 01:03:01,668
All right.

1570
01:03:01,668 --> 01:03:04,085
So I mean, I don't worry about
the different blue and red.

1571
01:03:04,085 --> 01:03:05,300
They're just sort
of an internal test

1572
01:03:05,300 --> 01:03:07,190
to see the right way to make
these things-- we determined

1573
01:03:07,190 --> 01:03:08,330
that it didn't
make a difference.

1574
01:03:08,330 --> 01:03:08,960
But yeah.

1575
01:03:08,960 --> 01:03:12,270
AUDIENCE: It looks like
the red is much smoother.

1576
01:03:12,270 --> 01:03:13,520
JOHN W. ROBERTS: I don't know.

1577
01:03:13,520 --> 01:03:14,150
It may be plotting.

1578
01:03:14,150 --> 01:03:16,070
I may have plotted blue on
top of red or something, too,

1579
01:03:16,070 --> 01:03:16,970
you know?

1580
01:03:16,970 --> 01:03:17,750
I don't know.

1581
01:03:17,750 --> 01:03:21,150
I remember we decided it didn't
make much of a difference.

1582
01:03:21,150 --> 01:03:21,830
Yeah.

1583
01:03:21,830 --> 01:03:22,430
I see what you're saying.

1584
01:03:22,430 --> 01:03:24,305
It does look like the
variance is a bit less,

1585
01:03:24,305 --> 01:03:25,850
but I don't think it was.

1586
01:03:25,850 --> 01:03:27,620
But these are trials
on the bottom.

1587
01:03:27,620 --> 01:03:30,013
So that's, every second we
sort of did another flap,

1588
01:03:30,013 --> 01:03:30,930
we did another update.

1589
01:03:30,930 --> 01:03:32,240
So this is update
from the bottom.

1590
01:03:32,240 --> 01:03:32,540
And yeah.

1591
01:03:32,540 --> 01:03:34,460
This is-- we actually have a
reward instead of cost here.

1592
01:03:34,460 --> 01:03:36,085
So it's going to go
up instead of down.

1593
01:03:36,085 --> 01:03:37,552
But yeah.

1594
01:03:37,552 --> 01:03:39,260
So despite the fact
this is really noisy,

1595
01:03:39,260 --> 01:03:41,797
despite the fact that we had
this average baseline, which

1596
01:03:41,797 --> 01:03:43,630
I was talking about--
so our baseline wasn't

1597
01:03:43,630 --> 01:03:44,867
perfect-- it still learned.

1598
01:03:44,867 --> 01:03:45,950
It learned pretty quickly.

1599
01:03:45,950 --> 01:03:48,500
I mean, 400 samples maybe
doesn't seem very good.

1600
01:03:48,500 --> 01:03:50,360
But that's also less
than 10 minutes.

1601
01:03:50,360 --> 01:03:53,630
So that's like 7 minutes.

1602
01:03:53,630 --> 01:03:57,660
So it in practice can
work pretty darn well.

1603
01:03:57,660 --> 01:03:59,520
And solving this thing
with other techniques

1604
01:03:59,520 --> 01:04:00,395
would be very tricky.

1605
01:04:00,395 --> 01:04:03,440
Well, I mean, you could build
a model like this model we have

1606
01:04:03,440 --> 01:04:05,270
and stuff like that, you can try
to solve it with a simulation.

1607
01:04:05,270 --> 01:04:07,603
That's generally how they
solve a lot of these problems,

1608
01:04:07,603 --> 01:04:10,070
is to do the
optimization on a model.

1609
01:04:10,070 --> 01:04:13,310
So there's this fly.

1610
01:04:15,920 --> 01:04:18,800
Jane Wang at Cornell tries
to optimize the stroke form

1611
01:04:18,800 --> 01:04:20,450
for a fly, like a fruit fly.

1612
01:04:20,450 --> 01:04:22,640
I think it's a fruit fly scale.

1613
01:04:22,640 --> 01:04:25,100
And so she just built a
sort of pretty fancy model

1614
01:04:25,100 --> 01:04:26,750
of this thing and
then simulates it.

1615
01:04:26,750 --> 01:04:29,990
It does the optimization on a
computational fluid dynamics

1616
01:04:29,990 --> 01:04:31,280
simulation.

1617
01:04:31,280 --> 01:04:33,470
And so that's some way
we can-- and there you

1618
01:04:33,470 --> 01:04:35,000
can get the gradients, you
can do all the sort of things

1619
01:04:35,000 --> 01:04:36,260
we've already talked about.

1620
01:04:36,260 --> 01:04:37,910
Because you have the model,
you can do all these things

1621
01:04:37,910 --> 01:04:38,630
explicitly.

1622
01:04:38,630 --> 01:04:40,255
But the model takes
a long time to run.

1623
01:04:40,255 --> 01:04:43,590
I think the optimization
took months of computer time.

1624
01:04:43,590 --> 01:04:45,758
So if you-- that's
the thing here,

1625
01:04:45,758 --> 01:04:47,800
is that the full simulation
of this system, where

1626
01:04:47,800 --> 01:04:50,900
it took me 1 second
to get an update,

1627
01:04:50,900 --> 01:04:54,410
it takes, I think,
about an hour per flap.

1628
01:04:54,410 --> 01:04:55,790
So an hour on a
computing cluster

1629
01:04:55,790 --> 01:04:58,640
to get one full safety
simulation of one flap.

1630
01:04:58,640 --> 01:05:00,020
And that's even the simpler one.

1631
01:05:00,020 --> 01:05:01,395
We're working on
other ones, too,

1632
01:05:01,395 --> 01:05:03,290
that have sort of
aeroelastic effects, which

1633
01:05:03,290 --> 01:05:05,367
are where sort of the
body deforms in response

1634
01:05:05,367 --> 01:05:06,200
to the fluid forces.

1635
01:05:06,200 --> 01:05:08,880
And simulating those
is even harder.

1636
01:05:08,880 --> 01:05:11,660
And so where it takes an
hour to get an update,

1637
01:05:11,660 --> 01:05:13,158
I can get it in a second.

1638
01:05:13,158 --> 01:05:15,200
And the thing is my update
is going to be noisier

1639
01:05:15,200 --> 01:05:16,617
and I don't get
the true gradient.

1640
01:05:16,617 --> 01:05:19,707
But when you can get
3,600 updates per update,

1641
01:05:19,707 --> 01:05:20,540
you're going to win.

1642
01:05:20,540 --> 01:05:22,110
I mean, I'll get
one flap in the time

1643
01:05:22,110 --> 01:05:25,250
takes me to optimize and sit
there for most of an hour,

1644
01:05:25,250 --> 01:05:25,820
you know?

1645
01:05:25,820 --> 01:05:28,825
So you can see in
those kind of problems,

1646
01:05:28,825 --> 01:05:31,458
it can be a big win, especially
when a simulation is extremely

1647
01:05:31,458 --> 01:05:33,500
expensive, or computing
the gradient is extremely

1648
01:05:33,500 --> 01:05:35,833
expensive, but you have the
robot right in front of you.

1649
01:05:35,833 --> 01:05:38,120
You can just take that
data, accept the noise,

1650
01:05:38,120 --> 01:05:39,570
do model-free gradient descent.

1651
01:05:43,610 --> 01:05:47,498
I think that's what I
wanted to talk about.

1652
01:05:47,498 --> 01:05:49,790
If you have any questions or
anything didn't make sense

1653
01:05:49,790 --> 01:05:51,200
at all, please let me know.

1654
01:05:51,200 --> 01:05:53,540
Otherwise, maybe I'll
introduce something

1655
01:05:53,540 --> 01:05:55,205
that I'm trying to
talk about tomorrow,

1656
01:05:55,205 --> 01:05:56,330
a different interpretation.

1657
01:05:56,330 --> 01:05:59,090
I'll just try to get your
brain ready for it, I guess.

1658
01:05:59,090 --> 01:06:01,850
But if there are any other
questions on this, please ask.

1659
01:06:04,527 --> 01:06:06,860
AUDIENCE: What was the reward
function for [INAUDIBLE]??

1660
01:06:06,860 --> 01:06:08,360
JOHN W. ROBERTS:
The reward function

1661
01:06:08,360 --> 01:06:11,600
for this was the integral of
velocity, of spin velocity,

1662
01:06:11,600 --> 01:06:14,270
over the integral
of power input.

1663
01:06:14,270 --> 01:06:18,290
So it measured the force
on it, multiplied that

1664
01:06:18,290 --> 01:06:19,640
by the vertical velocity.

1665
01:06:19,640 --> 01:06:23,210
That gives you
the rate of power.

1666
01:06:23,210 --> 01:06:25,550
That gives you power,
which is the rate of work.

1667
01:06:25,550 --> 01:06:28,050
And then it just sort of
calculates the distance.

1668
01:06:28,050 --> 01:06:30,008
And so that ratio is what
we tried to optimize.

1669
01:06:30,008 --> 01:06:33,830
So it tries to figure out sort
of the minimum energy per unit

1670
01:06:33,830 --> 01:06:34,362
distance.

1671
01:06:34,362 --> 01:06:35,820
And so it spins
around in a circle,

1672
01:06:35,820 --> 01:06:37,362
but it's a model of
it going forward.

1673
01:06:37,362 --> 01:06:39,260
So we did it for
an angle, but you

1674
01:06:39,260 --> 01:06:41,660
can do it just as easily for
if you had a linear test.

1675
01:06:41,660 --> 01:06:43,730
It's just harder experimentally.

1676
01:06:43,730 --> 01:06:46,100
And so it's try-- it's sort
of an efficiency metric.

1677
01:06:50,320 --> 01:06:50,820
Yeah?

1678
01:06:53,520 --> 01:06:56,340
All right.

1679
01:06:56,340 --> 01:06:59,670
Turn the lights back up.

1680
01:07:02,600 --> 01:07:07,550
Make sure I crossed all
my Ts, dotted my Is.

1681
01:07:12,900 --> 01:07:13,620
Oh, yeah.

1682
01:07:13,620 --> 01:07:16,950
And actually, there's
one story, too,

1683
01:07:16,950 --> 01:07:18,180
before I get into that thing.

1684
01:07:18,180 --> 01:07:22,200
So a lot of these
things originated,

1685
01:07:22,200 --> 01:07:24,990
like a lot of the things we've
seen for neural networks--

1686
01:07:24,990 --> 01:07:26,760
like back prop, like
gradient descent.

1687
01:07:26,760 --> 01:07:29,672
I mean, we learned
that [INAUDIBLE]

1688
01:07:29,672 --> 01:07:33,210
originated in the context of
neural networks, RTRL did.

1689
01:07:33,210 --> 01:07:35,260
And a lot of this did,
the reinforce algorithm,

1690
01:07:35,260 --> 01:07:37,770
which is the thing we're going
to talk about-- originated

1691
01:07:37,770 --> 01:07:40,030
with neural networks.

1692
01:07:40,030 --> 01:07:42,172
And one of the reasons
they found so appealing,

1693
01:07:42,172 --> 01:07:44,130
particularly like this
kind of stochastic work,

1694
01:07:44,130 --> 01:07:46,230
is that it seemed
biologically plausible.

1695
01:07:46,230 --> 01:07:49,038
That it could be like, what is
the chance that a human brain

1696
01:07:49,038 --> 01:07:49,830
is doing back prop?

1697
01:07:49,830 --> 01:07:51,750
I mean, it could be doing some
sort of approximate back prop

1698
01:07:51,750 --> 01:07:52,708
or something like that.

1699
01:07:52,708 --> 01:07:54,870
I actually don't know that
much about neuroscience.

1700
01:07:54,870 --> 01:07:58,860
But the thing is that these
sort of computationally involved

1701
01:07:58,860 --> 01:08:01,140
techniques for
solving these problems

1702
01:08:01,140 --> 01:08:05,250
don't seem like they're
reasonable as sort

1703
01:08:05,250 --> 01:08:09,870
of postulations on how the
human brain or how neurons solve

1704
01:08:09,870 --> 01:08:10,960
these problems.

1705
01:08:10,960 --> 01:08:12,710
But this one, you can
see, it's so simple.

1706
01:08:12,710 --> 01:08:14,729
And the little randomness
being part of it

1707
01:08:14,729 --> 01:08:16,979
and just the sort of like
simple update structure does

1708
01:08:16,979 --> 01:08:18,283
seem biologically plausible.

1709
01:08:18,283 --> 01:08:20,200
Just sort of intuitively,
it makes more sense.

1710
01:08:20,200 --> 01:08:24,090
But even more than that,
there's examples of--

1711
01:08:24,090 --> 01:08:26,609
there's data and
evidence that suggests

1712
01:08:26,609 --> 01:08:29,160
that these kind of things could
be one of the aspects of how

1713
01:08:29,160 --> 01:08:30,240
animals learn.

1714
01:08:30,240 --> 01:08:32,715
And the coolest one, I
think, is there's these song

1715
01:08:32,715 --> 01:08:34,255
birds that learn how to sing.

1716
01:08:34,255 --> 01:08:36,630
Like, they don't-- they're
not born knowing a certain way

1717
01:08:36,630 --> 01:08:37,290
to sing.

1718
01:08:37,290 --> 01:08:39,647
But they hear their parents
sing as they're growing up,

1719
01:08:39,647 --> 01:08:41,189
and they start
singing more and more.

1720
01:08:41,189 --> 01:08:42,100
And they get better and better.

1721
01:08:42,100 --> 01:08:43,620
And actually, you can hear
them getting better until they

1722
01:08:43,620 --> 01:08:44,880
sing like their parents did.

1723
01:08:44,880 --> 01:08:46,770
And you can raise
them in captivity

1724
01:08:46,770 --> 01:08:48,990
and play them
Elvis all the time,

1725
01:08:48,990 --> 01:08:53,760
and they'll do like a song
bird impression of Elvis, which

1726
01:08:53,760 --> 01:08:57,779
I'm surprised you can't buy the
CD of that on late night TV.

1727
01:08:57,779 --> 01:09:02,422
But the-- right.

1728
01:09:02,422 --> 01:09:03,880
But so a really
cool thing, though,

1729
01:09:03,880 --> 01:09:05,819
is that there's this
part of the brain where,

1730
01:09:05,819 --> 01:09:08,850
if you measure sort
of the signals,

1731
01:09:08,850 --> 01:09:10,770
they seem to be
completely random.

1732
01:09:10,770 --> 01:09:12,640
Like, they just seem
to be random noise.

1733
01:09:12,640 --> 01:09:14,640
And so it's like-- it's
strange that there's not

1734
01:09:14,640 --> 01:09:15,029
this structure.

1735
01:09:15,029 --> 01:09:16,500
It's like, what could this
part of the brain be doing?

1736
01:09:16,500 --> 01:09:19,020
Why would it need to be
producing random noise?

1737
01:09:19,020 --> 01:09:21,510
What they did--
and this is your--

1738
01:09:21,510 --> 01:09:24,202
maybe you bird lovers
out there won't like it--

1739
01:09:24,202 --> 01:09:25,410
they took one of these birds.

1740
01:09:25,410 --> 01:09:26,785
And while it was
learning-- like,

1741
01:09:26,785 --> 01:09:28,979
they waited till a bird
learned like the full song.

1742
01:09:28,979 --> 01:09:31,828
And then they deactivated,
through some means,

1743
01:09:31,828 --> 01:09:33,870
the part of the brain that
produces random noise.

1744
01:09:33,870 --> 01:09:34,745
And nothing happened.

1745
01:09:34,745 --> 01:09:37,380
The bird-- apparently, the bird
wasn't like entirely the same.

1746
01:09:37,380 --> 01:09:39,005
But it still could
sing the songs fine,

1747
01:09:39,005 --> 01:09:40,090
everything like that.

1748
01:09:40,090 --> 01:09:42,430
Then they took a bird who was
in the process of learning

1749
01:09:42,430 --> 01:09:44,460
the song--

1750
01:09:44,460 --> 01:09:46,479
had learned some of it
but wasn't perfect yet

1751
01:09:46,479 --> 01:09:47,729
and was still getting better--

1752
01:09:47,729 --> 01:09:51,090
and they deactivated
that part of the brain.

1753
01:09:51,090 --> 01:09:53,609
And it started just
singing the same song.

1754
01:09:53,609 --> 01:09:55,560
Like, how it'd been
singing, it kept singing.

1755
01:09:55,560 --> 01:09:57,250
It didn't get any better.

1756
01:09:57,250 --> 01:09:59,820
And so that's some
sort of proxy evidence

1757
01:09:59,820 --> 01:10:02,280
that this random noise was
related towards the ability

1758
01:10:02,280 --> 01:10:04,560
to improve, that it's
not storing the signal,

1759
01:10:04,560 --> 01:10:07,722
that it's not necessarily
like the descent itself.

1760
01:10:07,722 --> 01:10:09,180
But just this random
noise could be

1761
01:10:09,180 --> 01:10:12,938
how it's screwing up
its song in an effort

1762
01:10:12,938 --> 01:10:13,980
to get better and better.

1763
01:10:13,980 --> 01:10:16,647
It screws up a bit, listens, and
maybe it's a little bit better,

1764
01:10:16,647 --> 01:10:17,620
and it does that.

1765
01:10:17,620 --> 01:10:18,930
So that's sort of a pretty--

1766
01:10:18,930 --> 01:10:21,330
I mean, and sometimes
compelling evidence

1767
01:10:21,330 --> 01:10:23,460
that, I mean, biology
could at least use this

1768
01:10:23,460 --> 01:10:25,260
as some aspect of
its improvement,

1769
01:10:25,260 --> 01:10:29,100
that you shut down the random
noise and it stops learning.

1770
01:10:29,100 --> 01:10:31,010
I mean, if you get
the variance to 0,

1771
01:10:31,010 --> 01:10:32,760
you're not going to
get worse, you're just

1772
01:10:32,760 --> 01:10:33,802
not going to do anything.

1773
01:10:33,802 --> 01:10:36,270
You're going to keep
singing the same song.

1774
01:10:36,270 --> 01:10:39,741
So that's sort of cool, I think.

1775
01:10:39,741 --> 01:10:45,330
Right, so just give you
something to chew on.

1776
01:10:45,330 --> 01:10:46,960
There's another
interpretation of this.

1777
01:10:46,960 --> 01:10:50,400
So here we sort of
talked about this one.

1778
01:10:50,400 --> 01:10:58,090
Here I think our idea was
this sort of sampling,

1779
01:10:58,090 --> 01:10:59,970
where we have some
nominal policy.

1780
01:11:06,560 --> 01:11:20,280
We perturb it, measure how
good we did, how well we did,

1781
01:11:20,280 --> 01:11:28,140
measure performance, and update.

1782
01:11:31,350 --> 01:11:33,180
So this is pretty
much what we have,

1783
01:11:33,180 --> 01:11:35,370
is we have got some policy
that we're working at.

1784
01:11:35,370 --> 01:11:38,140
We add this z to it
that changes it a bit.

1785
01:11:38,140 --> 01:11:40,170
We run it, and then we update.

1786
01:11:40,170 --> 01:11:41,970
There's a different
interpretation--

1787
01:11:41,970 --> 01:11:45,090
my performance got too long.

1788
01:11:45,090 --> 01:11:47,899
There's a stochastic
policy interpretation.

1789
01:11:54,390 --> 01:11:56,970
Now, in this, the way
you think about it

1790
01:11:56,970 --> 01:11:59,822
isn't that we have
some nominal policy

1791
01:11:59,822 --> 01:12:01,030
and we're adding noise to it.

1792
01:12:01,030 --> 01:12:03,720
It's that your policy
itself acts stochastically.

1793
01:12:03,720 --> 01:12:09,420
So actions are random.

1794
01:12:13,167 --> 01:12:15,000
Doesn't mean that they're
completely random.

1795
01:12:15,000 --> 01:12:17,010
I mean, they're random
with some distribution.

1796
01:12:17,010 --> 01:12:19,770
But you're not saying
exactly what you do.

1797
01:12:19,770 --> 01:12:24,630
And so you can imagine this
is sort of like if you're

1798
01:12:24,630 --> 01:12:26,550
playing Liar's Poker--
you know, where you

1799
01:12:26,550 --> 01:12:28,260
hold the card above your head.

1800
01:12:28,260 --> 01:12:30,690
And then you can see
anyone else's card but not

1801
01:12:30,690 --> 01:12:32,040
your own, and then you sort
of bet on these things.

1802
01:12:32,040 --> 01:12:32,915
Do you know the game?

1803
01:12:32,915 --> 01:12:35,250
Maybe that game doesn't have
enough cultural penetration

1804
01:12:35,250 --> 01:12:36,247
to be a good example.

1805
01:12:36,247 --> 01:12:38,580
But if you're playing normal
poker, any sort of gambling

1806
01:12:38,580 --> 01:12:41,190
games, if every time
you had the same cards,

1807
01:12:41,190 --> 01:12:43,740
if you made the exact same
bet, people could eventually

1808
01:12:43,740 --> 01:12:45,720
sort of figure that
out, and maybe they

1809
01:12:45,720 --> 01:12:46,950
could use it to beat you.

1810
01:12:46,950 --> 01:12:50,400
There's plenty of games like
that, where, say, every time I

1811
01:12:50,400 --> 01:12:52,233
have a certain card,
I always bet this.

1812
01:12:52,233 --> 01:12:53,775
Then if I bet that
way, they're going

1813
01:12:53,775 --> 01:12:54,730
to be like, oh,
he has good cards.

1814
01:12:54,730 --> 01:12:55,405
I'm going to fold.

1815
01:12:55,405 --> 01:12:56,910
Or, oh, he always bluffs
when he has this card.

1816
01:12:56,910 --> 01:12:59,430
So this sort of deterministic
policy doesn't make sense.

1817
01:12:59,430 --> 01:13:02,065
The stochastic policy
is exactly what you do.

1818
01:13:02,065 --> 01:13:03,690
So your policy is
going to be like, oh,

1819
01:13:03,690 --> 01:13:06,690
I've got like pocket kings, 95%
of the time I'm going to raise

1820
01:13:06,690 --> 01:13:09,863
whatever, and [INAUDIBLE]
time I'm going to check--

1821
01:13:09,863 --> 01:13:12,030
those kind of things, where
you're sort of-- there's

1822
01:13:12,030 --> 01:13:14,310
some noise in what you do.

1823
01:13:14,310 --> 01:13:15,810
Now, you can question
whether or not

1824
01:13:15,810 --> 01:13:18,300
that makes sense as whether
optimal policies would

1825
01:13:18,300 --> 01:13:20,570
be stochastic in the kind
of problems we look at.

1826
01:13:20,570 --> 01:13:22,320
But the important thing
is just to realize

1827
01:13:22,320 --> 01:13:25,740
that your policy, don't think of
it as it's doing these things.

1828
01:13:25,740 --> 01:13:28,810
It is these sort of
distributions of what you do.

1829
01:13:28,810 --> 01:13:30,420
All right?

1830
01:13:30,420 --> 01:13:46,440
So the parameterization,
then, controls distribution.

1831
01:13:49,590 --> 01:13:51,240
Ooh.

1832
01:13:51,240 --> 01:13:53,710
My fifth grade teacher
would not have liked that.

1833
01:13:53,710 --> 01:13:54,210
But.

1834
01:13:57,870 --> 01:14:00,400
So what you do, then, you
can imagine you control,

1835
01:14:00,400 --> 01:14:01,900
perhaps, the mean
of a distribution.

1836
01:14:01,900 --> 01:14:03,878
So where over here we had this--

1837
01:14:03,878 --> 01:14:06,420
you can think of it as really
sort of exactly the same, where

1838
01:14:06,420 --> 01:14:09,660
my other interpretation
said, OK, my policy is alpha,

1839
01:14:09,660 --> 01:14:13,620
and then I add
random noise z to it.

1840
01:14:13,620 --> 01:14:16,980
While here, my policy is
parameterized by alpha,

1841
01:14:16,980 --> 01:14:19,060
and my action is the same thing.

1842
01:14:19,060 --> 01:14:21,060
It's just that it's not,
this is what I'm doing

1843
01:14:21,060 --> 01:14:22,393
and I'm sampling something else.

1844
01:14:22,393 --> 01:14:24,300
It's that, this is
actually my policy.

1845
01:14:24,300 --> 01:14:26,910
If I ran the same policy, I
would just do all these things

1846
01:14:26,910 --> 01:14:29,700
with these probabilities.

1847
01:14:29,700 --> 01:14:32,670
So it's your actions
are stochastic.

1848
01:14:32,670 --> 01:14:39,570
Now, that's sort of something
that isn't always completely--

1849
01:14:39,570 --> 01:14:41,940
well, when I first saw it,
it wasn't really easy for me

1850
01:14:41,940 --> 01:14:43,830
to get my head around
all what that meant.

1851
01:14:43,830 --> 01:14:45,630
But yeah.

1852
01:14:45,630 --> 01:14:47,550
So this is-- we're
going to look at this.

1853
01:14:47,550 --> 01:14:49,710
Yeah, I won't go
into more detail.

1854
01:14:49,710 --> 01:14:51,210
But tomorrow we'll
look at this sort

1855
01:14:51,210 --> 01:14:53,470
of different interpretation
of how to do this.

1856
01:14:53,470 --> 01:14:54,370
And you can get
the same learning.

1857
01:14:54,370 --> 01:14:55,980
we'll actually show that
the update's the same,

1858
01:14:55,980 --> 01:14:57,243
the behavior's very similar.

1859
01:14:57,243 --> 01:14:59,160
But the properties are
a little bit different.

1860
01:14:59,160 --> 01:15:00,810
And the big thing
is that you don't

1861
01:15:00,810 --> 01:15:02,402
have to do this linearization.

1862
01:15:02,402 --> 01:15:04,110
Here we did this sort
of linear expansion

1863
01:15:04,110 --> 01:15:06,420
and we say, OK, so
this is true locally.

1864
01:15:06,420 --> 01:15:08,010
When you look at
it in this context,

1865
01:15:08,010 --> 01:15:09,780
you can show that
you'll always follow

1866
01:15:09,780 --> 01:15:12,900
the gradient of the expected
value of the policy.

1867
01:15:12,900 --> 01:15:13,680
All right?

1868
01:15:13,680 --> 01:15:14,940
And so that's a big
difference, right?

1869
01:15:14,940 --> 01:15:17,190
Here we're saying, OK, we're
looking at the local gradient,

1870
01:15:17,190 --> 01:15:19,080
we're going to follow
the local gradient here.

1871
01:15:19,080 --> 01:15:20,760
But let's say that you
have a very broad policy

1872
01:15:20,760 --> 01:15:22,670
or a very sort of
violent value function.

1873
01:15:22,670 --> 01:15:25,530
Let's look at this 1D one again,
where my value function has

1874
01:15:25,530 --> 01:15:27,010
something like this--

1875
01:15:27,010 --> 01:15:27,990
extremely violent.

1876
01:15:27,990 --> 01:15:31,950
Well, when I put this
random stochastic policy,

1877
01:15:31,950 --> 01:15:33,677
that smooths it out.

1878
01:15:33,677 --> 01:15:35,135
And so even though--
it's because I

1879
01:15:35,135 --> 01:15:36,240
have a stochastic policy.

1880
01:15:36,240 --> 01:15:39,545
Running my policy, the cost
is a random variable now,

1881
01:15:39,545 --> 01:15:40,920
depending on what
my actions are.

1882
01:15:40,920 --> 01:15:43,050
Even if my dynamics
are deterministic,

1883
01:15:43,050 --> 01:15:47,250
because my policy is stochastic,
my cost is stochastic,

1884
01:15:47,250 --> 01:15:48,430
I'm going to get some--

1885
01:15:48,430 --> 01:15:51,030
if you look at this,
there's some expected cost

1886
01:15:51,030 --> 01:15:52,608
for running this policy on this.

1887
01:15:52,608 --> 01:15:54,150
And you do this,
you're going to sort

1888
01:15:54,150 --> 01:15:55,530
of-- you can imagine
sort of smoothing out

1889
01:15:55,530 --> 01:15:56,363
some of this, right?

1890
01:15:56,363 --> 01:15:58,020
Sort of averaging
over all of these.

1891
01:15:58,020 --> 01:15:59,910
And what you follow
when you do this--

1892
01:15:59,910 --> 01:16:01,710
and the update is
really identical.

1893
01:16:01,710 --> 01:16:04,857
Like, it's actually just
the exact same update.

1894
01:16:04,857 --> 01:16:07,440
Possibly, there's a coefficient
up front that you could put in

1895
01:16:07,440 --> 01:16:07,940
or not.

1896
01:16:07,940 --> 01:16:09,510
But the structure is the same.

1897
01:16:09,510 --> 01:16:11,160
And the thing is
that it'll follow

1898
01:16:11,160 --> 01:16:13,680
the expected value
of the performance

1899
01:16:13,680 --> 01:16:15,607
of the stochastic policy.

1900
01:16:15,607 --> 01:16:17,440
So it's sort of a
different way of thinking.

1901
01:16:17,440 --> 01:16:19,230
I think that this way is sort of
the easier way to sort of first

1902
01:16:19,230 --> 01:16:20,340
think about it.

1903
01:16:20,340 --> 01:16:22,890
But tomorrow will be more
probability kind of things.

1904
01:16:22,890 --> 01:16:24,720
And we'll talk about
the stochastic policy

1905
01:16:24,720 --> 01:16:28,570
interpretation and some of
the ramifications of that.

1906
01:16:28,570 --> 01:16:29,120
Yeah.

1907
01:16:29,120 --> 01:16:33,000
And maybe some other
interesting side notes.

1908
01:16:33,000 --> 01:16:35,510
So yeah.