1
00:00:00,000 --> 00:00:02,520
The following content is
provided under a Creative

2
00:00:02,520 --> 00:00:03,950
Commons license.

3
00:00:03,950 --> 00:00:06,330
Your support will help
MIT OpenCourseWare

4
00:00:06,330 --> 00:00:10,660
continue to offer high quality
educational resources for free.

5
00:00:10,660 --> 00:00:13,320
To make a donation or
view additional materials

6
00:00:13,320 --> 00:00:17,190
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,190 --> 00:00:18,520
at ocw.mit.edu.

8
00:00:21,630 --> 00:00:24,450
RUSS TEDRAKE: Today is sort of
the culmination of everything

9
00:00:24,450 --> 00:00:27,190
we've been doing in the
model free optimal control.

10
00:00:27,190 --> 00:00:27,690
OK.

11
00:00:27,690 --> 00:00:34,440
So we talked a lot about
the policy gradient methods.

12
00:00:34,440 --> 00:00:36,270
So under the model
free category here.

13
00:00:47,182 --> 00:00:50,980
And we've talked a lot about
model free policy gradient

14
00:00:50,980 --> 00:00:51,480
methods.

15
00:01:00,060 --> 00:01:02,550
And then the last
week or so, we spent

16
00:01:02,550 --> 00:01:05,850
talking about model
free methods based

17
00:01:05,850 --> 00:01:07,230
on learning value functions.

18
00:01:15,930 --> 00:01:17,670
OK.

19
00:01:17,670 --> 00:01:22,560
Now both of those have some
pros and some cons to them.

20
00:01:22,560 --> 00:01:23,060
OK?

21
00:01:32,100 --> 00:01:35,490
So the policy gradient
methods, what's

22
00:01:35,490 --> 00:01:37,020
good about policy
gradient methods?

23
00:01:39,876 --> 00:01:42,260
STUDENT: They scale--

24
00:01:42,260 --> 00:01:44,060
RUSS TEDRAKE: They scale with--

25
00:01:44,060 --> 00:01:45,390
[INTERPOSING VOICES]

26
00:01:45,390 --> 00:01:46,460
RUSS TEDRAKE: OK.

27
00:01:46,460 --> 00:01:54,620
And they can scale well
to high dimensions.

28
00:01:54,620 --> 00:01:55,820
We'll qualify that.

29
00:02:05,940 --> 00:02:06,440
Right?

30
00:02:06,440 --> 00:02:09,065
It's actually still only a local
search, that's why they scale.

31
00:02:11,420 --> 00:02:17,690
And the performance of
the model free methods

32
00:02:17,690 --> 00:02:20,125
degrades with the number
of policy parameters.

33
00:02:32,750 --> 00:02:36,290
But if you have an
infinite dimensional system

34
00:02:36,290 --> 00:02:38,000
with one parameter
you want to optimize,

35
00:02:38,000 --> 00:02:40,730
then you're in pretty good shape
with a policy gradient method.

36
00:02:40,730 --> 00:02:41,230
Right?

37
00:02:45,675 --> 00:02:46,175
OK.

38
00:02:48,740 --> 00:02:49,290
What else?

39
00:02:49,290 --> 00:02:52,090
What are other pros and cons
of policy gradient methods?

40
00:02:52,090 --> 00:02:54,400
What's a con?

41
00:02:54,400 --> 00:02:57,980
Well, I said a lot of it
in the parentheses already.

42
00:02:57,980 --> 00:03:02,150
But it's local.

43
00:03:02,150 --> 00:03:06,509
What are some other cons
about policy gradient methods?

44
00:03:06,509 --> 00:03:07,928
STUDENT: [INAUDIBLE].

45
00:03:10,770 --> 00:03:11,830
RUSS TEDRAKE: Yeah.

46
00:03:11,830 --> 00:03:12,420
Right.

47
00:03:12,420 --> 00:03:15,260
So this performance
degradation typically

48
00:03:15,260 --> 00:03:16,940
is summarized by
people saying they

49
00:03:16,940 --> 00:03:18,432
tend to have high variance.

50
00:03:18,432 --> 00:03:18,932
Right?

51
00:03:27,680 --> 00:03:34,850
Variance in the
update, which can

52
00:03:34,850 --> 00:03:40,550
lead to mean that you need
many trials to converge.

53
00:03:49,620 --> 00:03:53,630
I mean, fundamentally, if
we're sampling policy space

54
00:03:53,630 --> 00:03:55,790
and making some
stochastic update,

55
00:03:55,790 --> 00:03:58,370
it might be that it
requires many, many samples,

56
00:03:58,370 --> 00:04:01,500
for instance, to accurately
estimate the gradient.

57
00:04:01,500 --> 00:04:03,508
And if we're making a
move after every sample,

58
00:04:03,508 --> 00:04:05,300
then it might take
many, many trials for us

59
00:04:05,300 --> 00:04:06,890
to find the minimum of that.

60
00:04:06,890 --> 00:04:07,810
It's a noisy descent.

61
00:04:07,810 --> 00:04:08,520
Yeah?

62
00:04:08,520 --> 00:04:12,200
STUDENT: You also have to
choose like a [INAUDIBLE]..

63
00:04:12,200 --> 00:04:14,753
RUSS TEDRAKE: Good.

64
00:04:14,753 --> 00:04:15,920
That wasn't even on my list.

65
00:04:15,920 --> 00:04:18,440
But I totally agree.

66
00:04:18,440 --> 00:04:20,386
OK.

67
00:04:20,386 --> 00:04:21,920
I'll put it right up here.

68
00:04:40,010 --> 00:04:43,910
There's one other very big
advantage to the policy

69
00:04:43,910 --> 00:04:45,253
gradient algorithms.

70
00:04:50,929 --> 00:04:54,680
We take advantage of smoothness.

71
00:04:54,680 --> 00:04:57,020
They require smoothness to work.

72
00:04:57,020 --> 00:04:58,900
That's both a pro
and a con, right?

73
00:05:01,700 --> 00:05:04,880
But the big one that we
haven't said yet, I think,

74
00:05:04,880 --> 00:05:09,860
is the convergence is sort
of virtually guaranteed.

75
00:05:09,860 --> 00:05:11,840
You're doing a direct search.

76
00:05:11,840 --> 00:05:15,860
And exactly, you're doing a
stochastic gradient descent

77
00:05:15,860 --> 00:05:18,050
in exactly the parameters
you care about.

78
00:05:18,050 --> 00:05:21,900
Convergence is sort of
trivial and guaranteed.

79
00:05:21,900 --> 00:05:22,400
OK?

80
00:05:38,900 --> 00:05:41,420
That turns out to be
probably one of the biggest

81
00:05:41,420 --> 00:05:43,100
motivating reasons
for the community

82
00:05:43,100 --> 00:05:46,100
to have put their efforts
into policy gradient.

83
00:05:46,100 --> 00:05:59,870
Because if you look at the
value function methods,

84
00:05:59,870 --> 00:06:01,190
in many cases--

85
00:06:01,190 --> 00:06:04,460
now, I told you about one case
with function approximation,

86
00:06:04,460 --> 00:06:06,380
still linear function
approximation,

87
00:06:06,380 --> 00:06:09,680
where there are stronger
convergence results.

88
00:06:09,680 --> 00:06:12,290
And that was the least
squares policy iteration.

89
00:06:12,290 --> 00:06:17,150
But most of the cases we've
had, the convergence results

90
00:06:17,150 --> 00:06:18,260
were fairly weak.

91
00:06:18,260 --> 00:06:21,230
We told you that temporal
difference learning converges

92
00:06:21,230 --> 00:06:23,280
if the policy remains fixed.

93
00:06:23,280 --> 00:06:23,780
OK?

94
00:06:23,780 --> 00:06:25,515
But if you're not
careful, if you

95
00:06:25,515 --> 00:06:27,890
do temporal difference learning
with the policy changing,

96
00:06:27,890 --> 00:06:30,410
with a function
approximator involved,

97
00:06:30,410 --> 00:06:32,360
convergence is not guaranteed.

98
00:06:32,360 --> 00:06:33,350
OK?

99
00:06:33,350 --> 00:06:40,430
In fact, they often, I mean a
lot of these methods struggle

100
00:06:40,430 --> 00:06:41,300
with convergence.

101
00:06:41,300 --> 00:06:45,750
Not just the proofs,
which are more involved.

102
00:06:45,750 --> 00:06:48,590
But there's a
handful of, I guess,

103
00:06:48,590 --> 00:06:53,750
sort of in the big switch
from value methods to policy

104
00:06:53,750 --> 00:06:55,370
gradient methods,
there are a number

105
00:06:55,370 --> 00:07:04,325
of papers showing sort
of trivial examples of--

106
00:07:07,790 --> 00:07:09,750
can I call them TD
control methods?

107
00:07:09,750 --> 00:07:12,740
So temporal difference
learning where you're actually

108
00:07:12,740 --> 00:07:19,130
also updating your policy
of TD control methods

109
00:07:19,130 --> 00:07:26,190
with function approximation,
which diverge.

110
00:07:26,190 --> 00:07:26,690
Right?

111
00:07:34,550 --> 00:07:36,200
There was even one that--

112
00:07:36,200 --> 00:07:39,440
I think it might
have been, I forget

113
00:07:39,440 --> 00:07:41,840
whose-- it might have been
Lehman Baird's example.

114
00:07:41,840 --> 00:07:45,500
But where they actually showed
that the method will actually

115
00:07:45,500 --> 00:07:48,890
oscillate between the best
possible representation

116
00:07:48,890 --> 00:07:51,873
of the value function and the
worst possible representation

117
00:07:51,873 --> 00:07:52,790
of the value function.

118
00:07:52,790 --> 00:07:55,490
And it's sort of stably
oscillated between the two.

119
00:07:55,490 --> 00:07:56,737
Right?

120
00:07:56,737 --> 00:07:58,820
Which was obviously something
that they cooked up.

121
00:07:58,820 --> 00:08:00,660
But still, that makes the point.

122
00:08:00,660 --> 00:08:01,160
Right?

123
00:08:23,440 --> 00:08:28,420
Even the convergence result
we did give you for LSPI,

124
00:08:28,420 --> 00:08:30,522
least squares policy
iteration, still

125
00:08:30,522 --> 00:08:32,230
had no guarantee that
it wasn't going to,

126
00:08:32,230 --> 00:08:33,460
it could certainly oscillate.

127
00:08:33,460 --> 00:08:35,590
They gave a bound
on the oscillation.

128
00:08:35,590 --> 00:08:40,360
But that bound has
to be interpreted.

129
00:08:40,360 --> 00:08:42,510
Even the LSPI could
still oscillate.

130
00:08:42,510 --> 00:08:45,220
And that's one of the stronger
convergence results we have.

131
00:08:47,770 --> 00:08:48,730
OK.

132
00:08:48,730 --> 00:08:51,070
But they're
relatively efficient.

133
00:08:51,070 --> 00:08:52,060
Right?

134
00:08:52,060 --> 00:08:53,950
So we put up with a lot of that.

135
00:08:53,950 --> 00:08:56,890
And we keep trying to
use them because they're

136
00:08:56,890 --> 00:08:59,230
efficient to learn in the
sense that you're just

137
00:08:59,230 --> 00:09:01,640
learning a scalar value over
all your states and actions.

138
00:09:01,640 --> 00:09:03,310
That's a relatively
compact thing to learn.

139
00:09:03,310 --> 00:09:04,893
I told you, I tried
to argue last time

140
00:09:04,893 --> 00:09:09,940
that it's easier than
learning a model even by just

141
00:09:09,940 --> 00:09:12,670
dimensionality arguments.

142
00:09:12,670 --> 00:09:14,830
And they tend to be
efficient because the TD

143
00:09:14,830 --> 00:09:17,177
methods in particular
reuse your estimates.

144
00:09:17,177 --> 00:09:18,760
And they tend to be
efficient in data.

145
00:09:18,760 --> 00:09:20,045
They reuse old estimates.

146
00:09:20,045 --> 00:09:21,670
They use your old
estimate of the value

147
00:09:21,670 --> 00:09:25,840
function to update your new
estimate of the algorithms.

148
00:09:25,840 --> 00:09:29,500
So when they do work,
they tend to learn faster.

149
00:09:29,500 --> 00:09:32,050
And they can, with the
least squares methods,

150
00:09:32,050 --> 00:09:33,760
they tend to be
efficient in data.

151
00:09:33,760 --> 00:09:36,780
And therefore, in time.

152
00:09:36,780 --> 00:09:39,280
Number of trials.

153
00:09:39,280 --> 00:09:42,580
When these things do work,
they're the tool of choice.

154
00:09:42,580 --> 00:09:45,610
The problem is-- and
there are great examples

155
00:09:45,610 --> 00:09:47,853
of them working-- but
there's not enough guarantees

156
00:09:47,853 --> 00:09:48,520
of them working.

157
00:09:53,040 --> 00:09:56,970
And if you want to
sort of summarize

158
00:09:56,970 --> 00:10:01,350
why these value methods
struggle and why

159
00:10:01,350 --> 00:10:04,590
they can struggle to converge
and they even diverge,

160
00:10:04,590 --> 00:10:07,890
you can sort of think of it
in a single line, I think.

161
00:10:07,890 --> 00:10:11,160
The basic fundamental problem
with the value methods

162
00:10:11,160 --> 00:10:13,920
is that a very small change
in your estimate of the value

163
00:10:13,920 --> 00:10:16,560
function, if you
make a little change,

164
00:10:16,560 --> 00:10:19,920
can cause a dramatic
change in your policy.

165
00:10:19,920 --> 00:10:20,430
Right?

166
00:10:20,430 --> 00:10:22,720
So let's say my value
functions tip this way.

167
00:10:22,720 --> 00:10:23,220
Right?

168
00:10:23,220 --> 00:10:24,887
And I change my
parameters a little bit.

169
00:10:24,887 --> 00:10:25,950
Now it's tipped this way.

170
00:10:25,950 --> 00:10:29,160
My policy just went from
going left to going right,

171
00:10:29,160 --> 00:10:30,610
for instance.

172
00:10:30,610 --> 00:10:35,010
And now you're trying to
update your value function

173
00:10:35,010 --> 00:10:36,030
as the policy changed.

174
00:10:36,030 --> 00:10:40,465
And just things can start
oscillating out of control.

175
00:10:40,465 --> 00:10:41,340
Does that make sense?

176
00:11:33,110 --> 00:11:33,610
OK.

177
00:11:37,210 --> 00:11:41,100
That's a reasonably accurate,
I think, lay of the land

178
00:11:41,100 --> 00:11:44,600
in the methods we've
told you about so far.

179
00:11:44,600 --> 00:11:47,710
If you can find a value method
that converges nicely, use it.

180
00:11:47,710 --> 00:11:50,020
It's going to be faster than
a policy gradient method.

181
00:11:50,020 --> 00:11:51,520
It's more efficient
in reusing data.

182
00:11:51,520 --> 00:11:53,740
You're learning a fairly
compact structure.

183
00:11:53,740 --> 00:11:56,590
Value iteration has always been
our most efficient algorithm,

184
00:11:56,590 --> 00:11:59,590
when it works.

185
00:11:59,590 --> 00:12:01,390
But the policy
gradient algorithms

186
00:12:01,390 --> 00:12:03,400
are guaranteed to work.

187
00:12:03,400 --> 00:12:05,210
And they're fairly
simple to implement.

188
00:12:05,210 --> 00:12:08,960
And they can just be sort of
local search in the policy

189
00:12:08,960 --> 00:12:09,460
space.

190
00:12:09,460 --> 00:12:13,150
Directly in the space that you
care about, really your policy.

191
00:12:13,150 --> 00:12:15,100
So the big idea, which
is the culmination

192
00:12:15,100 --> 00:12:18,010
of the methods we've talked
about in the model free stuff

193
00:12:18,010 --> 00:12:21,160
so far, is to try to take
the advantages of both

194
00:12:21,160 --> 00:12:23,080
by putting them together.

195
00:12:23,080 --> 00:12:27,760
Represent both a value function
and a policy simultaneously.

196
00:12:27,760 --> 00:12:30,130
There's extra
representational costs there.

197
00:12:30,130 --> 00:12:33,250
But if you're willing to do
that and make slower changes

198
00:12:33,250 --> 00:12:37,330
to the policy based on guesses
that are coming from the value

199
00:12:37,330 --> 00:12:40,660
function, then you can
overcome a lot of the stability

200
00:12:40,660 --> 00:12:42,580
problems of the value methods.

201
00:12:42,580 --> 00:12:45,700
You get the strong convergence
results of the policy gradient.

202
00:12:45,700 --> 00:12:49,900
And you get some of the
more, ideally, efficiency.

203
00:12:49,900 --> 00:12:51,850
You can reduce your
variance of your update.

204
00:12:51,850 --> 00:12:55,340
You make more effective updates
by using a value function.

205
00:12:55,340 --> 00:12:55,840
OK?

206
00:13:12,100 --> 00:13:18,070
So the actor is the playful
name for the policy.

207
00:13:18,070 --> 00:13:20,923
And the critic is your
value estimate telling you

208
00:13:20,923 --> 00:13:22,090
how well you're going to do.

209
00:13:50,110 --> 00:13:52,130
And one of the
big ideas there is

210
00:13:52,130 --> 00:13:54,606
you'd like it to be a
two time scale algorithm.

211
00:14:09,990 --> 00:14:20,360
Policy is changing slower
than the greedy policy

212
00:14:20,360 --> 00:14:21,360
from the value function.

213
00:14:42,160 --> 00:14:42,660
OK.

214
00:14:42,660 --> 00:14:48,060
So the idea is an actor critic
are actually very, very simple.

215
00:14:48,060 --> 00:14:51,205
The proofs are ugly.

216
00:14:51,205 --> 00:14:52,830
There's only a handful
of papers you've

217
00:14:52,830 --> 00:14:57,930
got to look at if you
want to get into the dirt.

218
00:14:57,930 --> 00:15:02,560
But these, I think,
are the algorithms

219
00:15:02,560 --> 00:15:06,730
of choice today for a
model free optimization.

220
00:15:06,730 --> 00:15:07,230
OK.

221
00:15:07,230 --> 00:15:11,370
So just to give you a couple
of the key papers here.

222
00:15:11,370 --> 00:15:14,370
So Konda and Tsitsiklis.

223
00:15:14,370 --> 00:15:17,691
John's right upstairs.

224
00:15:17,691 --> 00:15:21,000
Had an actor critic
paper in 2003

225
00:15:21,000 --> 00:15:23,700
that has all the algorithm
derivation and proofs.

226
00:15:26,220 --> 00:15:31,810
Sutton has a similar one in '99
that's called Policy Gradient.

227
00:15:31,810 --> 00:15:33,540
But it's actually
the same sort of math

228
00:15:33,540 --> 00:15:37,380
as in Konda and Tsitsiklis.

229
00:15:37,380 --> 00:15:44,370
And then our friend Jan Peters
has got a newer take on it.

230
00:15:44,370 --> 00:15:55,470
He calls it Natural
Actor Critic,

231
00:15:55,470 --> 00:15:57,220
which is a popular one today.

232
00:15:57,220 --> 00:15:59,090
It should be easy to find.

233
00:15:59,090 --> 00:15:59,590
OK.

234
00:15:59,590 --> 00:16:02,370
So I want to give
you the basic tools.

235
00:16:02,370 --> 00:16:05,010
And then instead of
getting into all the math,

236
00:16:05,010 --> 00:16:08,590
I'll give you a case
study, which was my thesis.

237
00:16:08,590 --> 00:16:09,090
Works out.

238
00:16:32,600 --> 00:16:36,030
So probably John already said
quickly what the big idea was.

239
00:16:36,030 --> 00:16:36,530
Right?

240
00:16:36,530 --> 00:16:44,270
So John told you about the
reinforced type algorithms

241
00:16:44,270 --> 00:16:45,928
and the weight perturbation.

242
00:16:52,760 --> 00:16:57,450
In the reinforced algorithms,
we have some parameter vector.

243
00:16:57,450 --> 00:16:58,370
Let's call it alpha.

244
00:16:58,370 --> 00:17:02,570
And I'm going to change alpha
with a very simple update rule.

245
00:17:02,570 --> 00:17:06,050
In the simple case, maybe
I'll run my system twice.

246
00:17:06,050 --> 00:17:09,470
I'll run it once with the--

247
00:17:09,470 --> 00:17:13,339
I'll get once, I'll sample
the output with alpha.

248
00:17:13,339 --> 00:17:16,790
And then once I'll do it
with alpha plus some noise.

249
00:17:16,790 --> 00:17:20,810
Let's say I'll run it from
the same initial condition.

250
00:17:20,810 --> 00:17:21,980
Compare those two.

251
00:17:26,250 --> 00:17:29,021
And then multiply the difference
times the noise I added.

252
00:17:29,021 --> 00:17:29,521
Right?

253
00:17:48,180 --> 00:17:50,490
And that's actually
a good estimator,

254
00:17:50,490 --> 00:17:53,860
a reasonable estimator
of the gradient.

255
00:17:53,860 --> 00:17:56,430
And if I multiply by
the learning rate,

256
00:17:56,430 --> 00:18:00,980
then I've got a gradient
descent type update.

257
00:18:00,980 --> 00:18:01,480
OK?

258
00:18:04,600 --> 00:18:07,040
So this is not useful
in its current form.

259
00:18:07,040 --> 00:18:09,040
John told you about the
better forms of it, too.

260
00:18:09,040 --> 00:18:10,960
But the problem
with this is that I

261
00:18:10,960 --> 00:18:14,440
have to run the system
twice from exactly

262
00:18:14,440 --> 00:18:15,970
the same initial conditions.

263
00:18:15,970 --> 00:18:19,960
You don't want to run two trials
to simulate the thing exactly

264
00:18:19,960 --> 00:18:26,290
twice for every one update.

265
00:18:26,290 --> 00:18:31,300
And it sort of assumes that
this is a deterministic update.

266
00:18:31,300 --> 00:18:37,090
The more general form here
would be to not keep, not

267
00:18:37,090 --> 00:18:38,320
run the system twice.

268
00:18:38,320 --> 00:18:41,500
But use, for instance,
some estimate

269
00:18:41,500 --> 00:18:46,280
of what reward I'd expect to
get from this initial condition.

270
00:18:46,280 --> 00:18:50,010
And compare that to
the learning trial.

271
00:18:50,010 --> 00:18:52,330
So we just went from policy
gradient to actor critic

272
00:18:52,330 --> 00:18:53,560
just like that.

273
00:18:53,560 --> 00:18:55,540
This is the simplest form of it.

274
00:18:55,540 --> 00:18:57,620
But let's think about
what just happened.

275
00:18:57,620 --> 00:19:01,210
So if I do have an estimate
of my value function,

276
00:19:01,210 --> 00:19:03,490
I have an estimate of my
cost to go from every state.

277
00:19:03,490 --> 00:19:04,690
Right?

278
00:19:04,690 --> 00:19:06,970
Then that helps me make
a policy gradient update.

279
00:19:06,970 --> 00:19:11,470
Because if I run a single trial,
then I can compare the reward

280
00:19:11,470 --> 00:19:13,870
I expected to get
with the reward

281
00:19:13,870 --> 00:19:17,200
I actually got very compactly.

282
00:19:17,200 --> 00:19:17,980
OK?

283
00:19:17,980 --> 00:19:19,780
So this is the reward
I actually got.

284
00:19:19,780 --> 00:19:21,250
I run a trial, one trial.

285
00:19:21,250 --> 00:19:24,100
Even if it's noisy with
my perturb parameters,

286
00:19:24,100 --> 00:19:25,600
I change my parameters
a little bit.

287
00:19:25,600 --> 00:19:26,740
I run a trial.

288
00:19:26,740 --> 00:19:28,270
And what I want
to efficiently do

289
00:19:28,270 --> 00:19:30,687
is compare it to the reward I
should have expected to get,

290
00:19:30,687 --> 00:19:33,410
given I had the parameters
I had a minute ago.

291
00:19:33,410 --> 00:19:33,910
Right?

292
00:19:33,910 --> 00:19:37,090
That's nothing but a
value function right here.

293
00:19:37,090 --> 00:19:37,930
OK?

294
00:19:37,930 --> 00:19:41,290
So the simplest way to think
about an actor critic algorithm

295
00:19:41,290 --> 00:19:44,410
is go ahead and use a TD
learning kind of algorithm.

296
00:19:47,940 --> 00:19:49,720
Every time I'm running
my robot, go ahead

297
00:19:49,720 --> 00:19:51,880
and work on in the
background learning

298
00:19:51,880 --> 00:19:55,900
a value function of the system.

299
00:19:55,900 --> 00:20:00,370
And simply use that to
compare the samples you

300
00:20:00,370 --> 00:20:03,235
get from your policy search.

301
00:20:03,235 --> 00:20:05,610
Do you guys remember the sort
of weight perturbation type

302
00:20:05,610 --> 00:20:08,730
updates enough for
that to make sense?

303
00:20:08,730 --> 00:20:09,900
Yeah?

304
00:20:09,900 --> 00:20:14,570
STUDENT: So in this case, that
[INAUDIBLE] into your system

305
00:20:14,570 --> 00:20:18,010
but just through
some expectation.

306
00:20:18,010 --> 00:20:19,010
RUSS TEDRAKE: Excellent.

307
00:20:19,010 --> 00:20:19,910
That's where you're getting it.

308
00:20:19,910 --> 00:20:21,680
From temporal
difference learning.

309
00:20:21,680 --> 00:20:26,450
In the case of a stochastic
system, where both of these

310
00:20:26,450 --> 00:20:29,270
are going to be noisy
random variables,

311
00:20:29,270 --> 00:20:31,700
this actually can be better
than running it twice.

312
00:20:31,700 --> 00:20:34,790
Because this is the
expected value accumulated

313
00:20:34,790 --> 00:20:36,150
through experience.

314
00:20:36,150 --> 00:20:36,650
Right?

315
00:20:36,650 --> 00:20:39,380
And that's what you really want
to compare your noisy sample

316
00:20:39,380 --> 00:20:41,240
to the expected value.

317
00:20:41,240 --> 00:20:44,225
So in the stochastic
case, you actually

318
00:20:44,225 --> 00:20:46,850
do better by comparing it to the
expected value of your update.

319
00:20:52,250 --> 00:20:56,870
What you can show
by various tools

320
00:20:56,870 --> 00:21:00,210
is that comparing to the
expected value of your update,

321
00:21:00,210 --> 00:21:02,450
which is the value
function here,

322
00:21:02,450 --> 00:21:06,180
can dramatically reduce the
variance of your estimator.

323
00:21:06,180 --> 00:21:06,680
OK?

324
00:21:39,570 --> 00:21:41,520
You should always think
about policy gradient

325
00:21:41,520 --> 00:21:43,410
as every one of
these steps trying

326
00:21:43,410 --> 00:21:46,590
to estimate the change
in the performance

327
00:21:46,590 --> 00:21:48,370
based on a change in parameters.

328
00:21:48,370 --> 00:21:52,410
But in general,
what you get back

329
00:21:52,410 --> 00:21:55,192
is the true gradient
plus a bunch of noise,

330
00:21:55,192 --> 00:21:57,150
because you're just taking
a random sample here

331
00:21:57,150 --> 00:22:00,570
in one dimension of change.

332
00:22:00,570 --> 00:22:03,300
If this is a good estimate
of the value function,

333
00:22:03,300 --> 00:22:07,650
then it can reduce the
variance of that update.

334
00:22:07,650 --> 00:22:08,872
Question?

335
00:22:08,872 --> 00:22:10,300
STUDENT: [INAUDIBLE].

336
00:22:14,288 --> 00:22:16,080
RUSS TEDRAKE: The
guarantees of convergence

337
00:22:16,080 --> 00:22:18,372
are still intact because
you're doing gradient descent.

338
00:22:18,372 --> 00:22:21,330
You can actually do, you
can do almost anything here.

339
00:22:21,330 --> 00:22:22,997
This can be zero.

340
00:22:22,997 --> 00:22:25,080
And gradient descent, the
policy gradient actually

341
00:22:25,080 --> 00:22:26,135
still converges.

342
00:22:26,135 --> 00:22:27,930
It doesn't converge very fast.

343
00:22:27,930 --> 00:22:29,970
But you can still
actually show that it'll,

344
00:22:29,970 --> 00:22:31,720
on average, converge.

345
00:22:31,720 --> 00:22:32,220
OK?

346
00:22:32,220 --> 00:22:37,530
So it's actually quite robust
to the thing you subtract out.

347
00:22:37,530 --> 00:22:41,130
Because, especially if this
thing doesn't depend on alpha,

348
00:22:41,130 --> 00:22:42,580
then it has zero expectation.

349
00:22:42,580 --> 00:22:45,190
So it doesn't even affect the
expected value of your update.

350
00:22:45,190 --> 00:22:48,400
So it actually does not affect
the convergence results at all.

351
00:22:48,400 --> 00:22:50,568
So the convergence
results are still intact.

352
00:22:50,568 --> 00:22:52,110
But the performance
should get better

353
00:22:52,110 --> 00:22:56,458
because you have a better
estimate of your J. Right?

354
00:22:56,458 --> 00:22:58,500
And that should be
intuitively obvious, actually.

355
00:22:58,500 --> 00:22:59,000
Right?

356
00:22:59,000 --> 00:23:04,320
If I did something and
I said, how did I do?

357
00:23:04,320 --> 00:23:07,140
And [INAUDIBLE]
just always said,

358
00:23:07,140 --> 00:23:10,303
you should have gotten a
four every single time.

359
00:23:10,303 --> 00:23:12,720
If I got a lousy estimator of
how well I should have done,

360
00:23:12,720 --> 00:23:13,293
I'd say, OK.

361
00:23:13,293 --> 00:23:14,460
Look, I got a six that time.

362
00:23:14,460 --> 00:23:16,200
And he says, you
should have had a four.

363
00:23:16,200 --> 00:23:17,310
Six, you should have had a four.

364
00:23:17,310 --> 00:23:18,768
Then he's giving
me no information.

365
00:23:18,768 --> 00:23:21,110
And that's not helping
me evaluate my policy.

366
00:23:21,110 --> 00:23:21,610
Right?

367
00:23:21,610 --> 00:23:22,650
If someone said, OK.

368
00:23:22,650 --> 00:23:24,150
We did something a
little different.

369
00:23:24,150 --> 00:23:27,660
I expected you to get a
six, but you got a 6.1.

370
00:23:27,660 --> 00:23:31,984
Well, that's a much cleaner
learning signal for me to use.

371
00:23:31,984 --> 00:23:41,310
STUDENT: [INAUDIBLE]
the worst possible--

372
00:23:41,310 --> 00:23:43,130
RUSS TEDRAKE: Yeah, absolutely.

373
00:23:43,130 --> 00:23:44,880
So that's the important
point is that it's

374
00:23:44,880 --> 00:23:47,790
got to be uncorrelated with the
noise you add to your system.

375
00:23:47,790 --> 00:23:48,780
OK?

376
00:23:48,780 --> 00:23:50,940
If it's not correlated
with the noise you add in,

377
00:23:50,940 --> 00:23:53,307
then it actually goes
away in expectation.

378
00:23:53,307 --> 00:23:55,890
So the variance can be very bad
if you have the worst possible

379
00:23:55,890 --> 00:23:57,090
value estimate.

380
00:23:57,090 --> 00:24:01,620
But the convergence
still happens.

381
00:24:04,230 --> 00:24:05,790
Like I said, zero
actually works.

382
00:24:05,790 --> 00:24:08,313
Right?

383
00:24:08,313 --> 00:24:09,480
Which is sort of surprising.

384
00:24:09,480 --> 00:24:10,920
Right?

385
00:24:10,920 --> 00:24:14,550
If I have a reward function
that always returns between zero

386
00:24:14,550 --> 00:24:20,580
and 10, and I'm trying
to optimize my update,

387
00:24:20,580 --> 00:24:25,800
then I would always move in the
direction of the noise I add.

388
00:24:25,800 --> 00:24:28,950
But I move more often in the
ones that gave me high scores.

389
00:24:28,950 --> 00:24:31,080
And actually, it still
does a gradient descent

390
00:24:31,080 --> 00:24:32,880
on the cost function.

391
00:24:32,880 --> 00:24:35,490
It's actually worth
thinking about that.

392
00:24:35,490 --> 00:24:39,570
It's actually pretty cool that
it's so robust, that estimator.

393
00:24:39,570 --> 00:24:43,050
But certainly with a good
estimator, it works better.

394
00:24:45,960 --> 00:24:47,460
I don't know how
much John told you.

395
00:24:47,460 --> 00:24:49,290
But we don't actually like
talking about the variance.

396
00:24:49,290 --> 00:24:51,050
We like talking about the
signal to noise ratio.

397
00:24:51,050 --> 00:24:52,880
Did you tell them about the
signal noise ratio, John?

398
00:24:52,880 --> 00:24:54,180
STUDENT: I don't remember.

399
00:24:54,180 --> 00:24:55,230
RUSS TEDRAKE: Quickly?

400
00:24:55,230 --> 00:24:55,830
Yeah.

401
00:24:55,830 --> 00:24:57,000
So John's got a nice paper.

402
00:24:57,000 --> 00:24:57,870
Maybe he was being modest.

403
00:24:57,870 --> 00:24:59,828
John has a nice paper
analyzing the performance

404
00:24:59,828 --> 00:25:03,270
of these with a signal to
noise ratio analysis, which

405
00:25:03,270 --> 00:25:13,440
is another way to look at the
performance of the update.

406
00:25:19,240 --> 00:25:23,820
So that's enough to do, to take
the power of the value methods

407
00:25:23,820 --> 00:25:26,760
and start putting them to use
in the policy gradient methods.

408
00:25:26,760 --> 00:25:27,420
OK?

409
00:25:27,420 --> 00:25:29,920
The cool thing is, like I said,
as long as it's uncorrelated

410
00:25:29,920 --> 00:25:32,410
with z, it can be a very
bad approximate of the value

411
00:25:32,410 --> 00:25:32,910
function.

412
00:25:32,910 --> 00:25:34,290
It won't break convergence.

413
00:25:34,290 --> 00:25:39,420
The better the value estimate,
the faster your convergence is.

414
00:25:39,420 --> 00:25:40,148
OK?

415
00:25:40,148 --> 00:25:41,940
This isn't the update
that people typically

416
00:25:41,940 --> 00:25:43,857
use when they talk about
actor critic updates.

417
00:25:43,857 --> 00:25:46,630
The Konda and Tsitsiklis one
has a slightly more beautiful

418
00:25:46,630 --> 00:25:47,130
thing.

419
00:25:47,130 --> 00:25:51,090
This is maybe what you think
of as an episodic update.

420
00:25:51,090 --> 00:25:51,930
Right?

421
00:25:51,930 --> 00:25:54,657
This is, I just said we
started initial condition x.

422
00:25:54,657 --> 00:25:56,490
Maybe I should right
an x zero or something.

423
00:25:56,490 --> 00:25:58,282
But we just start with
initial condition x.

424
00:25:58,282 --> 00:26:01,620
We run our robot for a little
bit with these parameters.

425
00:26:01,620 --> 00:26:03,850
We compare it to
what we expected.

426
00:26:03,850 --> 00:26:06,455
And we make an update
maybe once per trial.

427
00:26:06,455 --> 00:26:07,830
That's a perfectly
good algorithm

428
00:26:07,830 --> 00:26:10,440
for making an update
once per trial.

429
00:26:10,440 --> 00:26:13,020
There's a more beautiful
sort of online update.

430
00:26:13,020 --> 00:26:13,710
Right?

431
00:26:13,710 --> 00:26:22,060
If you actually want
to, let's say you

432
00:26:22,060 --> 00:26:26,230
have an infinite horizon thing.

433
00:26:26,230 --> 00:26:27,280
Infinite horizon problem.

434
00:26:43,160 --> 00:26:46,190
There's actually a theorem,
I've debated how much of this

435
00:26:46,190 --> 00:26:46,700
to go into.

436
00:26:46,700 --> 00:26:48,890
But I'll at least list
the theorem for you

437
00:26:48,890 --> 00:26:50,900
because it's nice.

438
00:26:50,900 --> 00:27:02,760
They call it the policy
gradient theorem,

439
00:27:02,760 --> 00:27:08,640
which says partial
J partial alpha,

440
00:27:08,640 --> 00:27:12,335
where in the infinite
horizon case typically

441
00:27:12,335 --> 00:27:14,460
there's different ways to
define infinite horizons.

442
00:27:14,460 --> 00:27:16,950
This is typically done in
an average reward setting.

443
00:27:20,310 --> 00:27:24,543
It can be made to work
for other formulations.

444
00:27:24,543 --> 00:27:26,460
But I'll just be careful
to say the one that I

445
00:27:26,460 --> 00:27:29,940
know it's a correct proof for.

446
00:27:29,940 --> 00:27:33,705
The policy gradient can
actually be written as--

447
00:27:36,607 --> 00:27:37,440
let me write it out.

448
00:27:54,480 --> 00:28:10,020
This guy is the stationary
distribution of executing

449
00:28:10,020 --> 00:28:17,640
of the state action, of
executing pi of alpha.

450
00:28:17,640 --> 00:28:21,540
This guy is the Q function
executing alpha, the true Q

451
00:28:21,540 --> 00:28:22,320
function.

452
00:28:22,320 --> 00:28:23,900
And this is the state action.

453
00:28:31,800 --> 00:28:35,333
And this guy is
actually the gradient

454
00:28:35,333 --> 00:28:36,750
of the log
probabilities, which is

455
00:28:36,750 --> 00:28:39,840
the same thing we saw in the
policy gradient algorithms.

456
00:28:39,840 --> 00:28:45,570
The log probabilities
of executing pi.

457
00:28:57,890 --> 00:29:00,110
Yeah.

458
00:29:00,110 --> 00:29:03,080
Gradient of the log probability.

459
00:29:03,080 --> 00:29:05,920
I'm not trying to give you
enough to completely get this.

460
00:29:05,920 --> 00:29:07,670
But just I want you
to know that it exists

461
00:29:07,670 --> 00:29:10,250
and know where to find it.

462
00:29:10,250 --> 00:29:13,700
And what it reveals is
a very nice relationship

463
00:29:13,700 --> 00:29:18,680
between the Q function
and the gradients

464
00:29:18,680 --> 00:29:21,050
that we were already computing
in our reinforced type

465
00:29:21,050 --> 00:29:22,340
algorithms.

466
00:29:22,340 --> 00:29:23,810
OK?

467
00:29:23,810 --> 00:29:26,975
And it turns out an
update of the form--

468
00:29:52,520 --> 00:29:55,898
this is gradient of the
log probabilities again.

469
00:29:55,898 --> 00:29:56,690
I'll just write it.

470
00:30:05,646 --> 00:30:08,000
That would be doing
gradient descent on this

471
00:30:08,000 --> 00:30:10,310
if you're running
from sample paths.

472
00:30:10,310 --> 00:30:15,140
This term disappears if
I'm just pulling x and u

473
00:30:15,140 --> 00:30:17,240
from the distribution
that happens

474
00:30:17,240 --> 00:30:22,340
when I run the system that gives
me this stationary distribution

475
00:30:22,340 --> 00:30:24,470
coefficient for free.

476
00:30:24,470 --> 00:30:25,460
OK?

477
00:30:25,460 --> 00:30:28,340
And then if I could
somehow multiply the true Q

478
00:30:28,340 --> 00:30:31,520
function times my
eligibility-- this one,

479
00:30:31,520 --> 00:30:33,290
I definitely have access to.

480
00:30:33,290 --> 00:30:39,170
This one, I can only guess,
because I have access

481
00:30:39,170 --> 00:30:41,030
to my policy.

482
00:30:41,030 --> 00:30:42,080
I can compute that.

483
00:30:46,730 --> 00:30:48,954
But this guy, I
have to estimate.

484
00:30:53,800 --> 00:30:54,850
OK?

485
00:30:54,850 --> 00:30:58,270
So if I put a hat on
there, then that's

486
00:30:58,270 --> 00:31:03,790
actually a good estimator
of the policy gradient using

487
00:31:03,790 --> 00:31:06,220
an approximate Q function.

488
00:31:06,220 --> 00:31:11,020
And in the case where you hold
up your updates for a long time

489
00:31:11,020 --> 00:31:13,990
and then make an estimate
in an episodic case,

490
00:31:13,990 --> 00:31:18,230
it actually results in
that actual algorithm.

491
00:31:18,230 --> 00:31:18,730
OK?

492
00:31:24,320 --> 00:31:26,540
Getting to that from with
a more detailed explanation

493
00:31:26,540 --> 00:31:27,600
is painful.

494
00:31:27,600 --> 00:31:30,430
But it's good to know.

495
00:31:30,430 --> 00:31:32,180
I think the way you're
going to appreciate

496
00:31:32,180 --> 00:31:34,310
actor critic algorithms,
though, is by seeing them work.

497
00:31:34,310 --> 00:31:34,810
OK?

498
00:31:34,810 --> 00:31:41,300
So let me show you how I
made them work on a walking

499
00:31:41,300 --> 00:31:43,625
robot for my thesis.

500
00:31:43,625 --> 00:31:44,690
I've already done this.

501
00:31:44,690 --> 00:31:45,769
Is it going to turn on?

502
00:32:16,305 --> 00:32:18,430
Since I think everybody's
here, maybe we should do,

503
00:32:18,430 --> 00:32:21,270
while it's booting, I'll
do a quick context switch.

504
00:32:21,270 --> 00:32:22,830
Let's figure out
projects real quick.

505
00:32:22,830 --> 00:32:23,788
And then we'll go back.

506
00:32:23,788 --> 00:32:25,538
I don't want to run
out of time and forget

507
00:32:25,538 --> 00:32:27,600
to say all the last
details about the projects.

508
00:32:27,600 --> 00:32:28,100
Yeah?

509
00:32:30,590 --> 00:32:34,940
Somehow, I never remember
to post the syllabus

510
00:32:34,940 --> 00:32:36,950
with all the dates on there.

511
00:32:36,950 --> 00:32:38,510
We're posting it now.

512
00:32:38,510 --> 00:32:39,980
But I can't believe
I didn't post

513
00:32:39,980 --> 00:32:42,380
a long time ago on the website.

514
00:32:42,380 --> 00:32:45,590
But I hope you know that the
end of term is coming fast.

515
00:32:45,590 --> 00:32:47,528
Yeah?

516
00:32:47,528 --> 00:32:49,070
And you know you're
doing a write up.

517
00:32:49,070 --> 00:32:49,820
Right?

518
00:32:49,820 --> 00:32:52,520
And that write up,
we're going to say

519
00:32:52,520 --> 00:32:56,480
that the 21st,
which is basically

520
00:32:56,480 --> 00:33:03,290
the last day I can possibly
still grade them by,

521
00:33:03,290 --> 00:33:06,980
the write up as described, which
is sort of I said six pages--

522
00:33:06,980 --> 00:33:09,380
sort of an [INAUDIBLE]
type format--

523
00:33:09,380 --> 00:33:12,290
is going to be due
on May 21 online.

524
00:33:20,490 --> 00:33:21,510
OK.

525
00:33:21,510 --> 00:33:26,052
But next week, last week
of term already, we're

526
00:33:26,052 --> 00:33:28,260
going to try to do oral
presentations so you guys can

527
00:33:28,260 --> 00:33:29,700
tell me--

528
00:33:29,700 --> 00:33:32,820
eight minutes each is
what works out to be.

529
00:33:32,820 --> 00:33:35,740
You get to tell us what
you've been working on.

530
00:33:35,740 --> 00:33:36,240
OK?

531
00:33:50,280 --> 00:33:53,580
For each project,
there are a few of you

532
00:33:53,580 --> 00:33:55,320
that are working in pairs.

533
00:33:55,320 --> 00:33:59,130
But we'll still just do
eight minutes per project.

534
00:33:59,130 --> 00:34:01,140
And we have 19 total projects.

535
00:34:05,400 --> 00:34:07,710
So I figure we do eight--

536
00:34:11,130 --> 00:34:19,469
sorry, nine-- next Thursday,
which is going to be the 14th.

537
00:34:19,469 --> 00:34:20,100
Is that right?

538
00:34:20,100 --> 00:34:20,600
5-14.

539
00:34:26,969 --> 00:34:31,870
And nine on 5-12,
working back here,

540
00:34:31,870 --> 00:34:35,880
which leaves some unlucky son
of a gun going on Thursday.

541
00:34:43,447 --> 00:34:45,030
And the way I've
always done this is I

542
00:34:45,030 --> 00:34:51,630
have a MATLAB script here that
has everybody's name in it.

543
00:34:51,630 --> 00:34:52,739
Yeah.

544
00:34:52,739 --> 00:34:53,980
Why is it not on here?

545
00:34:53,980 --> 00:34:54,480
OK.

546
00:34:58,230 --> 00:35:01,140
I have a MATLAB script
with all your names in it.

547
00:35:01,140 --> 00:35:01,650
OK?

548
00:35:01,650 --> 00:35:05,970
And I'm going to do a
rand perm on the names.

549
00:35:05,970 --> 00:35:09,560
And it'll print up
what day you're going.

550
00:35:09,560 --> 00:35:11,310
STUDENT: Maybe in
fairness to that person,

551
00:35:11,310 --> 00:35:13,620
would we all be happy
to stay an extra eight

552
00:35:13,620 --> 00:35:15,630
minutes on whatever it is?

553
00:35:15,630 --> 00:35:16,802
Tuesday?

554
00:35:16,802 --> 00:35:18,510
RUSS TEDRAKE: Let's
do it this way first.

555
00:35:18,510 --> 00:35:19,718
And then we'll figure it out.

556
00:35:19,718 --> 00:35:22,690
[LAUGHTER]

557
00:35:22,690 --> 00:35:26,072
And yes.

558
00:35:26,072 --> 00:35:27,780
So I'm going to call
rand perm in MATLAB.

559
00:35:27,780 --> 00:35:29,760
And for dramatic
effect this year,

560
00:35:29,760 --> 00:35:34,110
I've added pause statements
between the print commands.

561
00:35:34,110 --> 00:35:35,040
[LAUGHTER]

562
00:35:35,040 --> 00:35:37,800
So we should have a good
time with this, I think.

563
00:35:37,800 --> 00:35:39,850
I will, at least.

564
00:35:39,850 --> 00:35:40,350
OK.

565
00:35:40,350 --> 00:35:40,850
Good.

566
00:35:43,260 --> 00:35:44,610
Let's make this nice and big.

567
00:35:48,210 --> 00:35:50,180
I actually was going to
just use a few slides

568
00:35:50,180 --> 00:35:51,180
from the middle of this.

569
00:35:51,180 --> 00:35:53,760
But I thought I'd at least
let you see the motivation

570
00:35:53,760 --> 00:35:55,048
behind it, which very well.

571
00:35:55,048 --> 00:35:56,340
And I'll go through it quickly.

572
00:35:56,340 --> 00:36:00,900
But just to see at least
my take on it in 2005,

573
00:36:00,900 --> 00:36:03,980
which hasn't
changed a whole lot.

574
00:36:03,980 --> 00:36:04,890
It's matured, I hope.

575
00:36:04,890 --> 00:36:08,670
But I've told you
about walking robots.

576
00:36:08,670 --> 00:36:10,740
We spent more time talking
about passive walkers

577
00:36:10,740 --> 00:36:13,860
than we talked about some
of the other approaches.

578
00:36:13,860 --> 00:36:16,360
But there's actually a lot of
good walking robots out there.

579
00:36:16,360 --> 00:36:18,193
Even in 2005, there
were a lot of good ones.

580
00:36:18,193 --> 00:36:19,662
This one is M2 from the Leg Lab.

581
00:36:19,662 --> 00:36:21,120
The wiring could
have been cleaner.

582
00:36:21,120 --> 00:36:24,522
But it's actually a
pretty beautiful robot

583
00:36:24,522 --> 00:36:25,230
in a lot of ways.

584
00:36:25,230 --> 00:36:26,610
The simulations of it are great.

585
00:36:26,610 --> 00:36:28,530
It hasn't walked
very nicely yet.

586
00:36:28,530 --> 00:36:30,760
But it's a detail.

587
00:36:30,760 --> 00:36:33,060
[LAUGHTER]

588
00:36:33,060 --> 00:36:36,493
Honda's ASIMO had sort of the
same sort of humble beginnings.

589
00:36:36,493 --> 00:36:38,160
As you can imagine,
it's not really fair

590
00:36:38,160 --> 00:36:40,590
that academics have to compete
with people like Honda.

591
00:36:40,590 --> 00:36:41,100
Right?

592
00:36:41,100 --> 00:36:42,840
I mean, so our robots
looked like what

593
00:36:42,840 --> 00:36:43,933
you saw on the last page.

594
00:36:43,933 --> 00:36:45,600
And ASIMO looks like
what it looks like.

595
00:36:45,600 --> 00:36:48,190
But it's kind of fun to
see where ASIMO came from.

596
00:36:48,190 --> 00:36:50,620
So this is ASIMO 0.000.

597
00:36:50,620 --> 00:36:51,120
Right?

598
00:36:51,120 --> 00:36:53,010
And this is actually
the progression

599
00:36:53,010 --> 00:36:54,405
of their ASIMO robots.

600
00:36:58,630 --> 00:37:01,210
That's the first one they
told the world about in '97.

601
00:37:01,210 --> 00:37:03,400
Rocked the world of robotics.

602
00:37:03,400 --> 00:37:04,960
I was in the Leg Lab, remember.

603
00:37:04,960 --> 00:37:07,777
At the time, we were
kind of like, oh wow.

604
00:37:07,777 --> 00:37:08,360
They did that?

605
00:37:08,360 --> 00:37:08,860
Wow.

606
00:37:08,860 --> 00:37:11,920
That sort of certain changes
our view of the world.

607
00:37:11,920 --> 00:37:13,360
That's P3.

608
00:37:13,360 --> 00:37:14,590
And that's ASIMO.

609
00:37:14,590 --> 00:37:15,580
Right?

610
00:37:15,580 --> 00:37:19,587
Really, really still one of the
most beautiful robots around.

611
00:37:19,587 --> 00:37:21,170
You know about
under-actuated systems.

612
00:37:21,170 --> 00:37:22,420
I don't have to tell you that.

613
00:37:22,420 --> 00:37:24,550
You know about acrobots.

614
00:37:24,550 --> 00:37:27,910
You know walking
is under-actuated.

615
00:37:27,910 --> 00:37:28,540
Right?

616
00:37:28,540 --> 00:37:30,457
Just to say it again--
and I said it quickly--

617
00:37:30,457 --> 00:37:33,610
but essentially,
the way ASIMO works

618
00:37:33,610 --> 00:37:36,370
is they are trying to
avoid under-actuation.

619
00:37:36,370 --> 00:37:37,115
Right?

620
00:37:37,115 --> 00:37:38,740
When you watch videos
of ASIMO walking,

621
00:37:38,740 --> 00:37:40,840
it's always got its
foot flat on the ground.

622
00:37:40,840 --> 00:37:44,292
There's an exception where it
runs with an [? arrow ?] phase

623
00:37:44,292 --> 00:37:46,000
that you need a high
speed camera to see.

624
00:37:46,000 --> 00:37:46,500
But--

625
00:37:46,500 --> 00:37:49,530
[LAUGHTER]

626
00:37:49,530 --> 00:37:51,880
It's true.

627
00:37:51,880 --> 00:37:54,167
And that's just a
small sort of deviation

628
00:37:54,167 --> 00:37:56,500
where they sort of turn off
the stability of the control

629
00:37:56,500 --> 00:37:57,470
system for long enough.

630
00:37:57,470 --> 00:37:58,485
And they can recover.

631
00:37:58,485 --> 00:37:59,860
Their controller
is robust enough

632
00:37:59,860 --> 00:38:02,380
in the flat on the
ground phase that they

633
00:38:02,380 --> 00:38:05,620
can catch small
disturbances which

634
00:38:05,620 --> 00:38:07,930
are their uncontrolled
aerial phase.

635
00:38:07,930 --> 00:38:11,843
So for the most part, they keep
their foot flat on the ground.

636
00:38:11,843 --> 00:38:14,260
They assume that their foot
is bolted to the ground, which

637
00:38:14,260 --> 00:38:15,790
would make them fully actuated.

638
00:38:15,790 --> 00:38:16,450
Right?

639
00:38:16,450 --> 00:38:17,950
And then they do a lot
of work to make sure

640
00:38:17,950 --> 00:38:19,325
that that assumption
stays valid.

641
00:38:19,325 --> 00:38:21,965
So they're constantly estimating
the center of pressure

642
00:38:21,965 --> 00:38:24,340
of that foot and trying to
keep it inside the foot, which

643
00:38:24,340 --> 00:38:26,067
means the foot will not tip.

644
00:38:26,067 --> 00:38:28,150
And this is if you've heard
of ZMP control, that's

645
00:38:28,150 --> 00:38:29,450
the ZMP control idea.

646
00:38:29,450 --> 00:38:30,190
OK?

647
00:38:30,190 --> 00:38:32,800
And then they do good
robotics in between there.

648
00:38:32,800 --> 00:38:35,050
They're designing desired
trajectories carefully.

649
00:38:35,050 --> 00:38:37,300
They're keeping the knees
bent to avoid singularities.

650
00:38:37,300 --> 00:38:40,120
They're doing some--
depends on the story.

651
00:38:40,120 --> 00:38:43,490
I've heard good
claims that they do

652
00:38:43,490 --> 00:38:45,490
very smart adaptive
trajectory tracking control.

653
00:38:45,490 --> 00:38:47,410
I've heard more recently
that they just do PD control.

654
00:38:47,410 --> 00:38:49,810
And that's good enough because
they've got these enormous gear

655
00:38:49,810 --> 00:38:50,410
ratios.

656
00:38:50,410 --> 00:38:53,320
And that's good enough.

657
00:38:53,320 --> 00:38:54,130
OK.

658
00:38:54,130 --> 00:38:57,160
So you've seen ASIMO working.

659
00:38:59,770 --> 00:39:02,540
The problem with it is that
it's really inefficient.

660
00:39:02,540 --> 00:39:03,040
Right?

661
00:39:03,040 --> 00:39:04,900
Uses way too much energy.

662
00:39:04,900 --> 00:39:06,040
Walks slowly.

663
00:39:06,040 --> 00:39:07,630
And has no robustness.

664
00:39:07,630 --> 00:39:08,328
Right?

665
00:39:08,328 --> 00:39:09,370
I've told you that story.

666
00:39:12,170 --> 00:39:14,975
Here's one view of everything
we've been doing in this class.

667
00:39:14,975 --> 00:39:16,600
The fundamental thing
that ASIMO is not

668
00:39:16,600 --> 00:39:22,390
doing in its control system
is thinking about the future.

669
00:39:22,390 --> 00:39:23,020
OK?

670
00:39:23,020 --> 00:39:26,008
So if you were taking a
reinforcement learning class,

671
00:39:26,008 --> 00:39:28,550
you would have started off with
talking about delayed reward.

672
00:39:28,550 --> 00:39:31,010
And that's what makes the
learning problem difficult.

673
00:39:31,010 --> 00:39:31,510
Right?

674
00:39:31,510 --> 00:39:33,892
I didn't use the words
delayed reward in this class.

675
00:39:33,892 --> 00:39:35,600
But it's actually
exactly the same thing.

676
00:39:35,600 --> 00:39:37,142
The fact that we're
optimizing a cost

677
00:39:37,142 --> 00:39:41,530
function over some
interval into the future

678
00:39:41,530 --> 00:39:43,450
means that I'm thinking
about the future.

679
00:39:43,450 --> 00:39:44,800
I'm planning over the future.

680
00:39:44,800 --> 00:39:47,740
I'm doing long term planning.

681
00:39:47,740 --> 00:39:50,470
And if you think about having to
wait to the end of that future

682
00:39:50,470 --> 00:39:52,600
to figure out if what
you did made sense,

683
00:39:52,600 --> 00:39:54,100
that's the delayed
reward problem.

684
00:39:54,100 --> 00:39:57,970
It's exactly the thing that
reinforcement learning folks

685
00:39:57,970 --> 00:40:00,820
use to convince other people
that reinforcement learning is

686
00:40:00,820 --> 00:40:01,760
hard.

687
00:40:01,760 --> 00:40:02,260
OK?

688
00:40:02,260 --> 00:40:04,190
So the problem in
walking is that you

689
00:40:04,190 --> 00:40:06,190
could do better if you
stopped just trying to be

690
00:40:06,190 --> 00:40:07,090
fully actuated all the time.

691
00:40:07,090 --> 00:40:08,440
We start thinking
about the future.

692
00:40:08,440 --> 00:40:10,065
Think about long term
stability instead

693
00:40:10,065 --> 00:40:11,600
of trying to be fully actuated.

694
00:40:11,600 --> 00:40:12,100
OK?

695
00:40:15,935 --> 00:40:18,560
The hoppers, there are examples
of really dynamically dexterous

696
00:40:18,560 --> 00:40:20,190
locomotion.

697
00:40:20,190 --> 00:40:22,580
But there's not general
solutions to that.

698
00:40:22,580 --> 00:40:24,720
That's what this class
has been trying to go for.

699
00:40:24,720 --> 00:40:26,630
So we do optimal control.

700
00:40:26,630 --> 00:40:30,080
We would love to have
analytical approximations

701
00:40:30,080 --> 00:40:32,840
for optimal control for
full humanoids like ASIMO.

702
00:40:32,840 --> 00:40:33,770
Love to have it.

703
00:40:33,770 --> 00:40:34,670
Don't have it.

704
00:40:34,670 --> 00:40:36,590
We're not even close.

705
00:40:36,590 --> 00:40:38,905
You know the tools
that we have now.

706
00:40:38,905 --> 00:40:41,030
But even if we did have an
analytical approximation

707
00:40:41,030 --> 00:40:43,113
of optimal control-- maybe
we will in a few years,

708
00:40:43,113 --> 00:40:44,630
who knows--

709
00:40:44,630 --> 00:40:47,400
we'd still like
to have learning.

710
00:40:47,400 --> 00:40:47,900
Right?

711
00:40:47,900 --> 00:40:49,692
All this model free
stuff is still valuable

712
00:40:49,692 --> 00:40:52,700
because, if the world
changes, you'd like to adapt.

713
00:40:52,700 --> 00:40:54,290
Right?

714
00:40:54,290 --> 00:40:56,030
So my thesis was
basically about trying

715
00:40:56,030 --> 00:40:59,420
to show that I could
do online optimization

716
00:40:59,420 --> 00:41:02,640
on a real system in real time.

717
00:41:02,640 --> 00:41:05,810
And I told you about
Andrews Helicopters.

718
00:41:05,810 --> 00:41:07,520
There's a lot of
work on Sony Dogs

719
00:41:07,520 --> 00:41:09,700
that do loop trajectory
optimization from trial

720
00:41:09,700 --> 00:41:10,200
and error.

721
00:41:10,200 --> 00:41:11,475
So Sony came out.

722
00:41:11,475 --> 00:41:13,100
And they had this
sort of walking gait.

723
00:41:13,100 --> 00:41:13,370
Right?

724
00:41:13,370 --> 00:41:15,020
And then people start
using them for soccer.

725
00:41:15,020 --> 00:41:16,670
And they said, how fast
can we make this thing go?

726
00:41:16,670 --> 00:41:18,128
It turns out the
fastest thing they

727
00:41:18,128 --> 00:41:20,540
do on an IBO is to make it
walk on its knees like this.

728
00:41:20,540 --> 00:41:22,250
And they found that from
a policy gradient search

729
00:41:22,250 --> 00:41:24,458
where they basically made
the dog walk back and forth

730
00:41:24,458 --> 00:41:26,600
between sort of a pink
cone and a blue cone,

731
00:41:26,600 --> 00:41:30,745
just back and forth all day
long doing policy gradient.

732
00:41:30,745 --> 00:41:32,870
And they figured out this
is a nice fast way to go.

733
00:41:32,870 --> 00:41:34,578
And then they won the
soccer competition.

734
00:41:34,578 --> 00:41:35,507
[LAUGHTER]

735
00:41:35,507 --> 00:41:37,340
Not actually sure if
that last part is true.

736
00:41:37,340 --> 00:41:38,215
I don't know who won.

737
00:41:38,215 --> 00:41:41,180
But I'd like to think it's true.

738
00:41:41,180 --> 00:41:43,890
There are people that do
a lot of walking robots.

739
00:41:43,890 --> 00:41:47,090
I think I showed you the
UNH bipeds that were some

740
00:41:47,090 --> 00:41:50,150
of the first learning bipeds.

741
00:41:50,150 --> 00:41:51,547
Right?

742
00:41:51,547 --> 00:41:52,880
I told you about these all term.

743
00:41:52,880 --> 00:41:53,060
Right?

744
00:41:53,060 --> 00:41:55,070
So there's large continuously
in action spaces,

745
00:41:55,070 --> 00:41:56,600
complex dynamics.

746
00:41:56,600 --> 00:41:59,660
We want to minimize
the number of trials.

747
00:41:59,660 --> 00:42:02,357
The dynamics are
tough for walking.

748
00:42:02,357 --> 00:42:03,440
Because of the collisions.

749
00:42:03,440 --> 00:42:06,090
And there's this delayed reward.

750
00:42:06,090 --> 00:42:07,790
So in my thesis,
the thing I did was

751
00:42:07,790 --> 00:42:10,112
tried to build a robot
that learned well.

752
00:42:10,112 --> 00:42:10,820
That was my goal.

753
00:42:10,820 --> 00:42:13,538
I simultaneously designed
a good learning system

754
00:42:13,538 --> 00:42:15,830
but also built a robot where
learning would work really

755
00:42:15,830 --> 00:42:16,340
well.

756
00:42:16,340 --> 00:42:18,290
Instead of working on ASIMO,
I worked on this little dinky

757
00:42:18,290 --> 00:42:19,340
thing I call Toddler.

758
00:42:19,340 --> 00:42:21,710
Yeah?

759
00:42:21,710 --> 00:42:24,510
And I spent a lot of time
on that little robot.

760
00:42:24,510 --> 00:42:26,570
So you know about
passive walking.

761
00:42:32,960 --> 00:42:35,320
This is the simplest, this
is the first passive walker I

762
00:42:35,320 --> 00:42:38,870
built. Passive walking 101 here.

763
00:42:38,870 --> 00:42:40,120
So it's sort of a funny story.

764
00:42:40,120 --> 00:42:42,427
I mean, I was in a
neuroscience lab.

765
00:42:42,427 --> 00:42:43,510
I worked with the Leg Lab.

766
00:42:43,510 --> 00:42:46,060
But my advisor was
in neuroscience.

767
00:42:46,060 --> 00:42:49,482
They spent lots of money on
microscopes and lots of money.

768
00:42:49,482 --> 00:42:51,940
So at some point, I said, can
I spend a little bit of money

769
00:42:51,940 --> 00:42:52,990
on a machine shop?

770
00:42:52,990 --> 00:42:55,300
And I promise it'll cost
less than that lens you just

771
00:42:55,300 --> 00:42:57,033
spent on that one microscope?

772
00:42:57,033 --> 00:42:59,200
And so, he gave me a little
bit of money to go down.

773
00:42:59,200 --> 00:43:02,057
I was basically in a closet
at the end of the hall.

774
00:43:02,057 --> 00:43:03,640
My tools looked like
things like this.

775
00:43:03,640 --> 00:43:05,807
Like, I couldn't even afford
another piece of rubber

776
00:43:05,807 --> 00:43:08,020
when I cut off a corner.

777
00:43:08,020 --> 00:43:11,410
And that's actually a CD rack
that I got rid of somewhere.

778
00:43:11,410 --> 00:43:13,090
And that's my little
wooden ramp that I

779
00:43:13,090 --> 00:43:15,520
was using for passive walking.

780
00:43:15,520 --> 00:43:17,980
But I built these
little passive walkers

781
00:43:17,980 --> 00:43:23,680
with a little sureline CNC
mill that walked stably

782
00:43:23,680 --> 00:43:25,160
in 3D down a small ramp.

783
00:43:25,160 --> 00:43:25,660
Yeah?

784
00:43:27,580 --> 00:43:29,080
I don't know why
it's playing badly.

785
00:43:33,133 --> 00:43:34,300
So that was the first steps.

786
00:43:34,300 --> 00:43:37,650
If we're going to do
walking, it's not hard.

787
00:43:37,650 --> 00:43:40,300
Those feet are
actually CNC-ed out.

788
00:43:40,300 --> 00:43:42,355
I spent a lot of
time on those feet.

789
00:43:42,355 --> 00:43:44,380
They're a curvature that
was designed carefully

790
00:43:44,380 --> 00:43:46,060
to get stability.

791
00:43:46,060 --> 00:43:47,848
STUDENT: It's just a
simple [INAUDIBLE]..

792
00:43:47,848 --> 00:43:48,640
RUSS TEDRAKE: Yeah.

793
00:43:48,640 --> 00:43:50,680
Just a pin joint.

794
00:43:50,680 --> 00:43:54,070
That's a walking robot.

795
00:43:54,070 --> 00:43:57,580
At the time, people had been
working on passive walkers

796
00:43:57,580 --> 00:43:58,930
for a long time.

797
00:43:58,930 --> 00:44:01,108
But nobody had sort of
done the obvious thing,

798
00:44:01,108 --> 00:44:03,400
which is add a few motors
and make it walk on the flat.

799
00:44:03,400 --> 00:44:04,598
Nobody had done it.

800
00:44:04,598 --> 00:44:06,640
So that's what I set out
to do with the learning.

801
00:44:09,665 --> 00:44:11,790
Turns out a few people did
it around the same time.

802
00:44:11,790 --> 00:44:13,340
So we wrote a paper together.

803
00:44:13,340 --> 00:44:17,570
But the basic story was we went
from this simple thing that

804
00:44:17,570 --> 00:44:20,270
was passive to the
actuated version.

805
00:44:20,270 --> 00:44:22,610
The hip joint here on this
robot is still passive.

806
00:44:22,610 --> 00:44:23,720
OK?

807
00:44:23,720 --> 00:44:25,280
Put actuators in at the ankle.

808
00:44:25,280 --> 00:44:27,500
So we had a new degrees
of freedom with actuators

809
00:44:27,500 --> 00:44:29,210
so that it could
push off the ground

810
00:44:29,210 --> 00:44:32,210
but still keep its
mostly passive gait.

811
00:44:32,210 --> 00:44:35,180
Actually, it's extruded
stock here stacked

812
00:44:35,180 --> 00:44:39,890
with gyros and rate gyros
and all the kinds of sensors.

813
00:44:39,890 --> 00:44:43,627
It's got a 700 megahertz
Pentium in its belly,

814
00:44:43,627 --> 00:44:44,460
which kind of stung.

815
00:44:44,460 --> 00:44:47,000
In retrospect, I couldn't make
very many efficiency arguments

816
00:44:47,000 --> 00:44:48,830
about the robot because it's
carrying a computer the size

817
00:44:48,830 --> 00:44:49,872
of a desktop at the time.

818
00:44:49,872 --> 00:44:50,990
You know?

819
00:44:50,990 --> 00:44:54,000
And so, there's five
batteries total on the system.

820
00:44:54,000 --> 00:44:54,500
Right?

821
00:44:54,500 --> 00:44:56,042
Those four are
powering the computer.

822
00:44:56,042 --> 00:44:59,120
There's one little one in there
that's powering the motors.

823
00:44:59,120 --> 00:45:01,520
And still those big
four drained like

824
00:45:01,520 --> 00:45:05,070
50% faster than the other ones.

825
00:45:05,070 --> 00:45:06,740
But it's computationally
powerful.

826
00:45:06,740 --> 00:45:06,920
Right?

827
00:45:06,920 --> 00:45:08,753
I actually ran a little
web server off there

828
00:45:08,753 --> 00:45:10,400
just because I
thought it was funny.

829
00:45:10,400 --> 00:45:11,270
[LAUGHTER]

830
00:45:11,270 --> 00:45:15,373
And the arms look like I've
added degrees of freedom.

831
00:45:15,373 --> 00:45:16,790
But actually,
they're mechanically

832
00:45:16,790 --> 00:45:17,998
attached to the opposite leg.

833
00:45:17,998 --> 00:45:20,780
So when I move this,
that bar across the front

834
00:45:20,780 --> 00:45:22,940
was making that coupling
happen, which is

835
00:45:22,940 --> 00:45:24,320
important for the 3D walking.

836
00:45:24,320 --> 00:45:25,700
Because if you
want to walk down,

837
00:45:25,700 --> 00:45:28,630
if you have no arms actually
and you swing a big heavy foot,

838
00:45:28,630 --> 00:45:30,380
then you're going to
get a big yaw moment.

839
00:45:30,380 --> 00:45:32,283
And the robots
often walk like this

840
00:45:32,283 --> 00:45:33,700
and went off the
side of the ramp.

841
00:45:33,700 --> 00:45:36,200
So you put the big batteries
on the side and then everything

842
00:45:36,200 --> 00:45:36,950
walks straight.

843
00:45:36,950 --> 00:45:38,427
And it's good.

844
00:45:38,427 --> 00:45:40,260
So in total, there's
nine degrees of freedom

845
00:45:40,260 --> 00:45:42,620
if you count all the things
that could possibly move.

846
00:45:42,620 --> 00:45:44,458
And there's four motors
to do the controls.

847
00:45:44,458 --> 00:45:45,500
So that's under-actuated.

848
00:45:45,500 --> 00:45:46,000
Right?

849
00:45:48,392 --> 00:45:49,600
We've got the robot dynamics.

850
00:45:49,600 --> 00:45:50,480
Oops.

851
00:45:50,480 --> 00:45:51,520
I've used a Mac now.

852
00:45:51,520 --> 00:45:52,520
I used to use a Windows.

853
00:45:52,520 --> 00:45:55,231
So apparently my u is now O hat.

854
00:45:55,231 --> 00:45:57,540
[LAUGHTER]

855
00:45:57,540 --> 00:45:58,107
Sorry.

856
00:45:58,107 --> 00:45:58,940
That's actually tau.

857
00:45:58,940 --> 00:45:59,440
OK.

858
00:45:59,440 --> 00:46:00,450
So tau.

859
00:46:00,450 --> 00:46:00,950
Yeah.

860
00:46:00,950 --> 00:46:04,910
So I had most almost the
manipulator equations.

861
00:46:04,910 --> 00:46:07,850
But I had to go through
this little hobby servo.

862
00:46:07,850 --> 00:46:11,210
So it wasn't quite the
manipulator equation.

863
00:46:11,210 --> 00:46:14,750
And the goal was to find a
control policy pi that was--

864
00:46:14,750 --> 00:46:17,182
so it was already stable
down a small ramp.

865
00:46:17,182 --> 00:46:18,890
And the way I formulated
the problem is I

866
00:46:18,890 --> 00:46:20,900
wanted to take that
same limit cycle

867
00:46:20,900 --> 00:46:23,628
that I could find
experimentally down a ramp

868
00:46:23,628 --> 00:46:25,420
and make it so it worked
on whatever slope.

869
00:46:25,420 --> 00:46:30,020
So make that return map
dynamics invariant to slope.

870
00:46:30,020 --> 00:46:31,820
And to do that, you
need to add energy.

871
00:46:31,820 --> 00:46:34,230
And you need to find
a control policy.

872
00:46:34,230 --> 00:46:39,050
So my goal was to find this
pi, stabilize the limit cycle

873
00:46:39,050 --> 00:46:44,460
solution that I saw downhill
to make it work on any slope.

874
00:46:44,460 --> 00:46:49,610
So this was just showing that
Toddler, with its computer

875
00:46:49,610 --> 00:46:52,160
turned off, its motors
are turned on-- actually,

876
00:46:52,160 --> 00:46:54,125
this one is even
the motors are off.

877
00:46:54,125 --> 00:46:56,000
And there's just little
splints on the ankle.

878
00:46:56,000 --> 00:46:58,220
Just showing that it was
also a passive walker.

879
00:46:58,220 --> 00:47:01,310
And showing that I dramatically
improved my hardware experience

880
00:47:01,310 --> 00:47:04,122
by getting a little
proform treadmill

881
00:47:04,122 --> 00:47:05,330
that was off of the back lot.

882
00:47:05,330 --> 00:47:08,240
And I painted it
yellow and stuff.

883
00:47:08,240 --> 00:47:11,430
So this thing would
actually walk all day long.

884
00:47:11,430 --> 00:47:11,930
It would.

885
00:47:11,930 --> 00:47:15,590
So it's a little trick.

886
00:47:15,590 --> 00:47:18,030
At the very edge, in the middle,
there's nothing going on.

887
00:47:18,030 --> 00:47:19,640
But at the very edge
of the treadmill,

888
00:47:19,640 --> 00:47:21,150
I put a little lip there.

889
00:47:21,150 --> 00:47:23,330
So if it happened to wander
itself over to the side,

890
00:47:23,330 --> 00:47:25,080
it had that lip and walked
back towards the middle.

891
00:47:25,080 --> 00:47:25,730
OK?

892
00:47:25,730 --> 00:47:28,550
And I put a little wedge on
the front and on the back

893
00:47:28,550 --> 00:47:31,640
so it sort of would try to stay
in the middle of the treadmill.

894
00:47:31,640 --> 00:47:33,473
And that thing would
just walk all day long.

895
00:47:33,473 --> 00:47:37,400
It would drive you crazy hearing
those footsteps all day long.

896
00:47:37,400 --> 00:47:38,870
[LAUGHTER]

897
00:47:38,870 --> 00:47:39,530
But it worked.

898
00:47:39,530 --> 00:47:40,245
It worked well.

899
00:47:40,245 --> 00:47:41,870
It still works today,
most of the time.

900
00:47:44,600 --> 00:47:46,850
So I use the words
policy gradient.

901
00:47:46,850 --> 00:47:50,870
But this was really an
actor critic algorithm.

902
00:47:50,870 --> 00:47:55,250
So I used linear, it's actually
a very centric grid in phi.

903
00:47:55,250 --> 00:47:57,260
But a linear function
approximator.

904
00:47:57,260 --> 00:47:59,160
And the basic story
was policy gradient.

905
00:47:59,160 --> 00:47:59,660
OK?

906
00:47:59,660 --> 00:48:04,100
So it was something in
between this perfectly online

907
00:48:04,100 --> 00:48:07,250
at every dt, make an update.

908
00:48:07,250 --> 00:48:10,160
And it was not quite the
episodic run a trial,

909
00:48:10,160 --> 00:48:11,960
stop, run a trial, stop.

910
00:48:11,960 --> 00:48:16,310
The cost function was
really a long term cost.

911
00:48:16,310 --> 00:48:18,200
But I did it once per footstep.

912
00:48:18,200 --> 00:48:19,610
OK?

913
00:48:19,610 --> 00:48:21,980
So every time the robot
literally took a footstep,

914
00:48:21,980 --> 00:48:24,860
I would make a small change
to the policy parameters.

915
00:48:24,860 --> 00:48:26,480
See how well it walked.

916
00:48:26,480 --> 00:48:27,830
See where it hit the return map.

917
00:48:27,830 --> 00:48:29,372
And then change the
parameters again.

918
00:48:29,372 --> 00:48:30,993
Change the parameters again.

919
00:48:30,993 --> 00:48:32,660
And every time that
foot hit the ground,

920
00:48:32,660 --> 00:48:35,900
I would evaluate the change
in walking performance

921
00:48:35,900 --> 00:48:38,768
and make the change in W
based on that result. OK.

922
00:48:38,768 --> 00:48:41,060
I'll show you the algorithm
that I used a second, which

923
00:48:41,060 --> 00:48:43,230
you'll now recognize.

924
00:48:43,230 --> 00:48:45,063
So the way to think
about that sampling in W

925
00:48:45,063 --> 00:48:46,980
is that you're estimating
the policy gradient.

926
00:48:46,980 --> 00:48:49,430
And you're performing online
stochastic gradient descent.

927
00:48:49,430 --> 00:48:50,330
Right?

928
00:48:50,330 --> 00:48:53,600
So the time, the way I
described the big challenge

929
00:48:53,600 --> 00:48:55,790
is, what is the cost
function for walking?

930
00:48:55,790 --> 00:48:58,850
And how do you achieve
fast provable convergence,

931
00:48:58,850 --> 00:49:02,527
despite noisy
gradient estimates?

932
00:49:02,527 --> 00:49:03,860
You guys know about return maps.

933
00:49:03,860 --> 00:49:06,830
This is my picture of return
maps from a long time ago.

934
00:49:09,810 --> 00:49:12,260
So this is the Van
der Pol Oscillator.

935
00:49:12,260 --> 00:49:14,480
This is the return map here.

936
00:49:14,480 --> 00:49:16,610
The important point
here, so this is

937
00:49:16,610 --> 00:49:18,088
the samples on the return map.

938
00:49:18,088 --> 00:49:20,630
This is the velocity at the n-th
crossing versus the velocity

939
00:49:20,630 --> 00:49:22,460
at the n-th plus 1 crossing.

940
00:49:22,460 --> 00:49:25,100
The blue line is the
line of slope one.

941
00:49:25,100 --> 00:49:27,590
So it's stable, the
Van der Pol Oscillator,

942
00:49:27,590 --> 00:49:30,440
because it's above the line
here and below the line there.

943
00:49:30,440 --> 00:49:32,260
And you can evaluate
local stability

944
00:49:32,260 --> 00:49:34,010
by linearizing and
taking the eigenvalues.

945
00:49:34,010 --> 00:49:36,590
We've talked about these things.

946
00:49:36,590 --> 00:49:39,620
But I don't know if I made
the point nicely before.

947
00:49:39,620 --> 00:49:41,870
That if you can pick anything,
if you want your return

948
00:49:41,870 --> 00:49:43,495
map to look like
anything in the world,

949
00:49:43,495 --> 00:49:45,860
if you could pick,
what would you pick?

950
00:49:45,860 --> 00:49:47,150
You'd pick a flat line.

951
00:49:47,150 --> 00:49:48,200
Right?

952
00:49:48,200 --> 00:49:50,120
That's the deadbeat controller.

953
00:49:50,120 --> 00:49:52,790
I used the word deadbeat.

954
00:49:52,790 --> 00:49:55,277
So that's where my cost
function came from.

955
00:49:55,277 --> 00:49:57,860
The cost function that tried to
say that the robot was walking

956
00:49:57,860 --> 00:49:58,360
well--

957
00:50:03,000 --> 00:50:08,370
wow-- penalized my
instantaneous cost

958
00:50:08,370 --> 00:50:12,810
function, penalized the square
distance between my sample

959
00:50:12,810 --> 00:50:15,450
on the return map and
the desired return map,

960
00:50:15,450 --> 00:50:16,810
which is that green line.

961
00:50:16,810 --> 00:50:17,760
OK.

962
00:50:17,760 --> 00:50:20,130
So basically I wanted, I
tried to drive the system

963
00:50:20,130 --> 00:50:22,290
to have a deadbeat controller.

964
00:50:22,290 --> 00:50:23,610
And I did, and there's limits.

965
00:50:23,610 --> 00:50:24,930
There's actuator limits
that's going to mean

966
00:50:24,930 --> 00:50:25,770
it's never going to get there.

967
00:50:25,770 --> 00:50:27,330
But my cost function was
trying to force that.

968
00:50:27,330 --> 00:50:28,705
Every time I got
a sample, it was

969
00:50:28,705 --> 00:50:31,110
trying to push that
sample more towards

970
00:50:31,110 --> 00:50:32,517
the deadbeat controller.

971
00:50:38,720 --> 00:50:40,610
Then basically, it worked.

972
00:50:40,610 --> 00:50:41,940
It worked really well.

973
00:50:41,940 --> 00:50:44,960
The robot began walking
in one minute, which

974
00:50:44,960 --> 00:50:48,335
means it started getting
its foot cleared.

975
00:50:48,335 --> 00:50:50,210
So the first thing, if
I set W equal to zero,

976
00:50:50,210 --> 00:50:53,090
it was configured so that when
the policy parameters were

977
00:50:53,090 --> 00:50:54,620
zero, it was a passive walker.

978
00:50:54,620 --> 00:50:56,000
So I put it on flat.

979
00:50:56,000 --> 00:50:56,900
I picked it up.

980
00:50:56,900 --> 00:50:58,820
I picked it up a lot.

981
00:50:58,820 --> 00:50:59,570
And I drop it.

982
00:50:59,570 --> 00:51:01,310
It runs out of energy
and stands still.

983
00:51:01,310 --> 00:51:02,852
Because it was just
a passive walker,

984
00:51:02,852 --> 00:51:04,730
it's not getting energy from--

985
00:51:04,730 --> 00:51:05,790
it's only losing energy.

986
00:51:05,790 --> 00:51:06,560
OK?

987
00:51:06,560 --> 00:51:08,450
So now, I pick it up.

988
00:51:08,450 --> 00:51:09,320
I drop it.

989
00:51:09,320 --> 00:51:12,897
And every time it takes a step,
it's twiddling the parameters

990
00:51:12,897 --> 00:51:13,980
at the ankle a little bit.

991
00:51:13,980 --> 00:51:14,480
OK?

992
00:51:14,480 --> 00:51:16,590
So it started going
like this a little bit.

993
00:51:16,590 --> 00:51:19,580
And then after about a minute
of dropping it-- and quickly,

994
00:51:19,580 --> 00:51:21,480
I wrote a script that would
kick it into place so I stopped

995
00:51:21,480 --> 00:51:21,860
dropping it--

996
00:51:21,860 --> 00:51:22,460
OK.

997
00:51:22,460 --> 00:51:24,585
So I gave a little script
so it would go like this.

998
00:51:24,585 --> 00:51:27,740
And in about one minute, it
was sort of marching in place.

999
00:51:27,740 --> 00:51:28,428
OK?

1000
00:51:28,428 --> 00:51:29,970
And then I started
driving it around.

1001
00:51:29,970 --> 00:51:31,050
I had a little
joystick which said,

1002
00:51:31,050 --> 00:51:33,000
I want your desired
body to go like this.

1003
00:51:33,000 --> 00:51:34,370
And it started walking around.

1004
00:51:34,370 --> 00:51:37,860
And in about five minutes, it
was sort of walking around.

1005
00:51:37,860 --> 00:51:39,900
I'll show you the
video here in a second.

1006
00:51:39,900 --> 00:51:41,910
And then, I said 20
minutes for convergence.

1007
00:51:41,910 --> 00:51:42,620
That was conservative.

1008
00:51:42,620 --> 00:51:44,120
Most of the time,
it was 10 minutes.

1009
00:51:44,120 --> 00:51:48,620
It would converge to the
policy that was locally optimal

1010
00:51:48,620 --> 00:51:49,550
in this policy class.

1011
00:51:49,550 --> 00:51:50,980
But it worked very well.

1012
00:51:50,980 --> 00:51:53,060
And I just sort of sent
it off down the hall.

1013
00:51:53,060 --> 00:51:53,960
And it would walk.

1014
00:51:53,960 --> 00:51:55,640
OK?

1015
00:51:55,640 --> 00:51:58,070
And doing the
stability analysis,

1016
00:51:58,070 --> 00:52:01,160
it showed the learn controllers
is considerably more stable

1017
00:52:01,160 --> 00:52:04,610
than the controllers I
designed by hand, which I spent

1018
00:52:04,610 --> 00:52:05,930
a long time on those, too.

1019
00:52:08,840 --> 00:52:10,640
And now, here's a
really key point.

1020
00:52:10,640 --> 00:52:11,630
OK?

1021
00:52:11,630 --> 00:52:18,587
So you might ask, how much is
this sort of approximate value

1022
00:52:18,587 --> 00:52:19,920
function, how important is that?

1023
00:52:19,920 --> 00:52:21,010
That's sort of the
topic for today.

1024
00:52:21,010 --> 00:52:21,510
Right?

1025
00:52:21,510 --> 00:52:24,940
How important is this
approximate value function?

1026
00:52:24,940 --> 00:52:29,180
Well, it turns out, if I
were to reset the policy,

1027
00:52:29,180 --> 00:52:31,977
if I just set the policy
parameters to zero again

1028
00:52:31,977 --> 00:52:34,060
but keep the value function
from the previous time

1029
00:52:34,060 --> 00:52:39,500
it learned, then the whole
thing speeds up dramatically.

1030
00:52:39,500 --> 00:52:41,420
So instead of converging
in 20 minutes,

1031
00:52:41,420 --> 00:52:43,190
the thing converges
in like two minutes.

1032
00:52:43,190 --> 00:52:44,020
OK?

1033
00:52:44,020 --> 00:52:48,640
So just by virtue of having
a good value estimate there,

1034
00:52:48,640 --> 00:52:50,405
learning goes
dramatically faster.

1035
00:52:50,405 --> 00:52:52,030
And it's only when
I have to learn them

1036
00:52:52,030 --> 00:52:53,830
both simultaneously
that it takes more

1037
00:52:53,830 --> 00:52:56,770
like 10 or 20 minutes.

1038
00:52:56,770 --> 00:52:59,140
And it worked so fast that
I never built a robot.

1039
00:52:59,140 --> 00:53:01,750
I never built a
model for the robot.

1040
00:53:01,750 --> 00:53:03,070
Actually, I tried later.

1041
00:53:03,070 --> 00:53:04,130
It's tough.

1042
00:53:04,130 --> 00:53:05,500
The dynamics of that--

1043
00:53:05,500 --> 00:53:10,420
I mean, it's a curved foot
with rubber on it, right?

1044
00:53:10,420 --> 00:53:14,405
It was just very hard
to model accurately.

1045
00:53:14,405 --> 00:53:15,280
And I didn't need to.

1046
00:53:15,280 --> 00:53:15,780
It worked.

1047
00:53:15,780 --> 00:53:18,567
It learned very quickly.

1048
00:53:18,567 --> 00:53:20,650
Quickly enough that it was
adapting to the terrain

1049
00:53:20,650 --> 00:53:21,370
as it walked.

1050
00:53:21,370 --> 00:53:21,870
All right.

1051
00:53:21,870 --> 00:53:25,360
So here's the Poincaré maps
from that little Toddler robot

1052
00:53:25,360 --> 00:53:27,940
projected onto a plane.

1053
00:53:27,940 --> 00:53:30,400
So I picked it up
a bunch of times.

1054
00:53:30,400 --> 00:53:33,460
I tried to make it just
walk in place here.

1055
00:53:33,460 --> 00:53:35,650
Before learning,
it was obviously

1056
00:53:35,650 --> 00:53:37,450
only stable at the
zero, zero fix point.

1057
00:53:37,450 --> 00:53:41,590
It was running out of energy on
every step and going to zero.

1058
00:53:41,590 --> 00:53:45,550
After learning, this is what
the return map looked like.

1059
00:53:45,550 --> 00:53:47,170
OK?

1060
00:53:47,170 --> 00:53:50,470
So it actually could start
from stopped reliably.

1061
00:53:50,470 --> 00:53:51,460
Right?

1062
00:53:51,460 --> 00:53:55,290
This is actually far better
than I expected it to do.

1063
00:53:55,290 --> 00:53:59,200
If you do your little
staircase analysis of this,

1064
00:53:59,200 --> 00:54:04,780
so it gets up to the fixed point
in two steps or three steps

1065
00:54:04,780 --> 00:54:07,400
for most initial conditions.

1066
00:54:07,400 --> 00:54:07,900
Right?

1067
00:54:07,900 --> 00:54:09,820
And from a very large range
of initial conditions,

1068
00:54:09,820 --> 00:54:11,380
as large as I care
to sample from.

1069
00:54:11,380 --> 00:54:14,650
So you could go up there--
and people did actually.

1070
00:54:14,650 --> 00:54:15,580
We had a little--

1071
00:54:15,580 --> 00:54:19,360
after we got it
working, the press came.

1072
00:54:19,360 --> 00:54:21,887
And then everybody was asking
me, the reporters were saying,

1073
00:54:21,887 --> 00:54:23,470
can I have my kid
play with the robot?

1074
00:54:23,470 --> 00:54:26,080
Or can we put on a
treadmill at the gym?

1075
00:54:26,080 --> 00:54:29,170
Rich Sutton put his
fingers under it

1076
00:54:29,170 --> 00:54:31,960
and was like playing
with it at dips one time.

1077
00:54:31,960 --> 00:54:34,270
So it got disturbed
in every possible way.

1078
00:54:34,270 --> 00:54:35,980
And for the most part,
it worked really--

1079
00:54:35,980 --> 00:54:38,260
I mean, so if you give
it a big push this way,

1080
00:54:38,260 --> 00:54:41,560
it actually takes energy out
and comes back and recovers

1081
00:54:41,560 --> 00:54:42,970
in two steps.

1082
00:54:42,970 --> 00:54:44,020
You stop it.

1083
00:54:44,020 --> 00:54:44,710
It goes back up.

1084
00:54:44,710 --> 00:54:45,850
And it recovers.

1085
00:54:45,850 --> 00:54:50,710
And in the worst case, I had
some demo to give or something.

1086
00:54:50,710 --> 00:54:52,590
And I took it out of the case.

1087
00:54:52,590 --> 00:54:54,910
It had traveled
through the airport.

1088
00:54:54,910 --> 00:54:59,140
The customs people always asked
me if it had commercial value.

1089
00:54:59,140 --> 00:55:00,670
It doesn't have
commercial value.

1090
00:55:03,490 --> 00:55:05,380
But it broke somewhere
in the travel.

1091
00:55:05,380 --> 00:55:06,380
And I didn't realize it.

1092
00:55:06,380 --> 00:55:08,958
I picked it up and
headed to do its demo.

1093
00:55:08,958 --> 00:55:10,000
And it's going like this.

1094
00:55:10,000 --> 00:55:11,103
And it's sort of walking.

1095
00:55:11,103 --> 00:55:12,270
And it looks a little funny.

1096
00:55:12,270 --> 00:55:14,062
And people are so
relatively happy with it.

1097
00:55:14,062 --> 00:55:16,150
Turns out the ankle
had completely snapped.

1098
00:55:16,150 --> 00:55:17,650
But in just a few
steps, it actually

1099
00:55:17,650 --> 00:55:19,817
found a policy that was
walking with a broken ankle.

1100
00:55:19,817 --> 00:55:21,640
[LAUGHTER]

1101
00:55:21,640 --> 00:55:22,210
So it works.

1102
00:55:22,210 --> 00:55:23,830
It really worked.

1103
00:55:23,830 --> 00:55:24,640
It really did work.

1104
00:55:24,640 --> 00:55:27,790
I'm not sure-- I mean, yeah.

1105
00:55:27,790 --> 00:55:28,660
It really worked.

1106
00:55:28,660 --> 00:55:29,160
OK.

1107
00:55:29,160 --> 00:55:31,690
So here's the basic video.

1108
00:55:31,690 --> 00:55:33,705
This was the beginning.

1109
00:55:33,705 --> 00:55:34,330
I was paranoid.

1110
00:55:34,330 --> 00:55:37,562
So I had pads on it to make sure
it didn't fall down and break.

1111
00:55:37,562 --> 00:55:39,520
This is the little policy
that would kick it up

1112
00:55:39,520 --> 00:55:41,960
into a random initial
condition like that.

1113
00:55:41,960 --> 00:55:43,660
And now it's learning.

1114
00:55:43,660 --> 00:55:44,570
It falls down.

1115
00:55:44,570 --> 00:55:47,323
I don't know why it's
playing so badly.

1116
00:55:47,323 --> 00:55:48,490
This is after a few minutes.

1117
00:55:48,490 --> 00:55:49,720
It's stepping in place.

1118
00:55:49,720 --> 00:55:50,320
It's walking.

1119
00:55:54,820 --> 00:55:56,530
And then I started
driving it around.

1120
00:55:56,530 --> 00:55:57,040
I say, OK.

1121
00:55:57,040 --> 00:55:57,790
Let's walk around.

1122
00:55:57,790 --> 00:56:00,070
And it stumbles.

1123
00:56:00,070 --> 00:56:03,330
But really, really fast,
it learned a policy

1124
00:56:03,330 --> 00:56:04,330
that could stabilize it.

1125
00:56:07,210 --> 00:56:08,820
Right?

1126
00:56:08,820 --> 00:56:13,600
And after a few minutes, this
is the disturbance tests.

1127
00:56:13,600 --> 00:56:17,180
I actually haven't shown
these in a long time.

1128
00:56:19,750 --> 00:56:21,960
It's really robust
to those things.

1129
00:56:21,960 --> 00:56:24,570
And then you can send
it off down the hall.

1130
00:56:24,570 --> 00:56:27,960
And now, this is a little
robot with big feet admittedly.

1131
00:56:27,960 --> 00:56:31,800
But you know, it's like
the linoleum in E25--

1132
00:56:31,800 --> 00:56:35,160
this is in E25--
was really not flat.

1133
00:56:35,160 --> 00:56:37,590
I mean, it's sort of
embarrassing to tell people,

1134
00:56:37,590 --> 00:56:38,370
look at the floor.

1135
00:56:38,370 --> 00:56:38,980
It's not flat.

1136
00:56:38,980 --> 00:56:42,160
But for that robot, I mean
there's huge disturbances

1137
00:56:42,160 --> 00:56:46,140
as it walked down the floor.

1138
00:56:46,140 --> 00:56:48,540
But the policy parameters
were changing quite a bit.

1139
00:56:48,540 --> 00:56:50,550
You could walk off
tile onto carpet.

1140
00:56:50,550 --> 00:56:53,220
And in a few steps, it
would adjust its parameters

1141
00:56:53,220 --> 00:56:54,090
and keep on walking.

1142
00:56:54,090 --> 00:56:57,493
This was it walking from
E25 towards the Media Lab,

1143
00:56:57,493 --> 00:56:58,410
if you recognize that.

1144
00:57:11,350 --> 00:57:11,850
OK.

1145
00:57:11,850 --> 00:57:13,795
So one of the things
I said is that one

1146
00:57:13,795 --> 00:57:15,420
of the problems with
the value estimate

1147
00:57:15,420 --> 00:57:17,462
is you make a small change
in the value function,

1148
00:57:17,462 --> 00:57:21,030
you get a big change
in the policy.

1149
00:57:21,030 --> 00:57:22,740
Theoretically, no problem.

1150
00:57:22,740 --> 00:57:24,680
In practice, you don't
probably want that.

1151
00:57:24,680 --> 00:57:25,180
Right?

1152
00:57:25,180 --> 00:57:27,660
One of the beautiful things
about the policy gradient

1153
00:57:27,660 --> 00:57:30,573
algorithms is you make a
small change to the policy.

1154
00:57:30,573 --> 00:57:32,740
It doesn't look like the
robot's doing crazy things.

1155
00:57:32,740 --> 00:57:34,020
So every time,
everything you saw there,

1156
00:57:34,020 --> 00:57:35,310
it was always learning.

1157
00:57:35,310 --> 00:57:35,970
Right?

1158
00:57:35,970 --> 00:57:38,280
Learning did not look
like a big deviation

1159
00:57:38,280 --> 00:57:39,277
from nominal behavior.

1160
00:57:39,277 --> 00:57:40,860
I never turned off
learning with this.

1161
00:57:40,860 --> 00:57:41,370
Right?

1162
00:57:41,370 --> 00:57:44,233
It turned out in the
policy gradient setting,

1163
00:57:44,233 --> 00:57:45,900
I could add such a
small amount of noise

1164
00:57:45,900 --> 00:57:48,300
to the policy parameters,
which was a very central grid

1165
00:57:48,300 --> 00:57:52,800
over the state space, such
a small amount of noise

1166
00:57:52,800 --> 00:57:55,060
that you couldn't even
tell it was learning.

1167
00:57:55,060 --> 00:57:55,560
Right?

1168
00:57:55,560 --> 00:57:57,602
But it was enough to pull
out a gradient estimate

1169
00:57:57,602 --> 00:57:58,610
and keep going.

1170
00:57:58,610 --> 00:58:00,735
So it didn't look like it
was trying random things.

1171
00:58:00,735 --> 00:58:03,277
But then, if it walked off on
the carpet and did a bad thing,

1172
00:58:03,277 --> 00:58:04,560
it would still adapt.

1173
00:58:04,560 --> 00:58:06,540
That was something
I didn't expect.

1174
00:58:06,540 --> 00:58:08,780
It just was a very
nice sort of match

1175
00:58:08,780 --> 00:58:10,530
between the amount of
noise you had to add

1176
00:58:10,530 --> 00:58:14,550
and the speed of learning.

1177
00:58:14,550 --> 00:58:16,980
The value estimate was a low
dimensional approximation

1178
00:58:16,980 --> 00:58:19,350
of the value function.

1179
00:58:19,350 --> 00:58:20,023
Very low.

1180
00:58:20,023 --> 00:58:20,940
Like ridiculously low.

1181
00:58:20,940 --> 00:58:21,523
One dimension.

1182
00:58:21,523 --> 00:58:22,800
Right?

1183
00:58:22,800 --> 00:58:24,960
But it was sufficient
to decrease the variance

1184
00:58:24,960 --> 00:58:26,240
and allow fast convergence.

1185
00:58:26,240 --> 00:58:28,573
I never got it to work before
I put a value function in.

1186
00:58:31,530 --> 00:58:33,160
And here's this question.

1187
00:58:33,160 --> 00:58:36,390
So I ended up choosing
gamma to be pretty low.

1188
00:58:36,390 --> 00:58:38,100
Gamma was 0.2.

1189
00:58:38,100 --> 00:58:39,960
I did try with zero times.

1190
00:58:39,960 --> 00:58:40,950
What did that mean?

1191
00:58:40,950 --> 00:58:44,280
So that's how far I carried
back my eligibility, which

1192
00:58:44,280 --> 00:58:46,292
means how many steps
am I looking at it.

1193
00:58:46,292 --> 00:58:48,750
So that you could think of it
as a receding horizon optimal

1194
00:58:48,750 --> 00:58:49,250
control.

1195
00:58:49,250 --> 00:58:50,910
How many steps
ahead do you look?

1196
00:58:50,910 --> 00:58:51,720
Right.

1197
00:58:51,720 --> 00:58:52,870
Except it's discounted.

1198
00:58:52,870 --> 00:58:53,760
OK?

1199
00:58:53,760 --> 00:58:57,445
So 0.2 is really
heavy discounted.

1200
00:58:57,445 --> 00:58:58,320
Really, really heavy.

1201
00:58:58,320 --> 00:59:00,600
It means I was basically
looking one step ahead

1202
00:59:00,600 --> 00:59:03,570
and not worrying
about things well

1203
00:59:03,570 --> 00:59:06,990
into the future, which made
my learning faster but meant

1204
00:59:06,990 --> 00:59:09,660
I didn't take really aggressive
corrections that were

1205
00:59:09,660 --> 00:59:11,640
multi-step sort of corrections.

1206
00:59:11,640 --> 00:59:14,550
Only very rarely, if the
cost really warranted it.

1207
00:59:14,550 --> 00:59:15,247
OK.

1208
00:59:15,247 --> 00:59:16,830
So that was always
something I thought

1209
00:59:16,830 --> 00:59:18,960
would be cool if I
could get that higher

1210
00:59:18,960 --> 00:59:22,350
and show a reason why
multi-step corrections made

1211
00:59:22,350 --> 00:59:23,850
it a lot more stable.

1212
00:59:23,850 --> 00:59:26,910
STUDENT: Did it
not work as well?

1213
00:59:26,910 --> 00:59:28,810
RUSS TEDRAKE: It
didn't learn as fast.

1214
00:59:28,810 --> 00:59:30,240
At some point, I
decided I'm going

1215
00:59:30,240 --> 00:59:31,650
to try to make the point
that these things can really

1216
00:59:31,650 --> 00:59:32,700
learn fast.

1217
00:59:32,700 --> 00:59:34,890
And so, I started
turning all the knobs.

1218
00:59:34,890 --> 00:59:39,570
Simple policy, simple value
function, low look ahead.

1219
00:59:39,570 --> 00:59:40,380
And it worked.

1220
00:59:40,380 --> 00:59:43,180
But it was fast.

1221
00:59:43,180 --> 00:59:49,360
STUDENT: Is gamma used
[INAUDIBLE] the same as lambda?

1222
00:59:49,360 --> 00:59:55,070
RUSS TEDRAKE: It's a gamma in a
discounted reward formulation.

1223
00:59:55,070 --> 00:59:57,632
STUDENT: So there is
no eligibility trace?

1224
00:59:57,632 --> 00:59:59,090
RUSS TEDRAKE: The
eligibility trace

1225
00:59:59,090 --> 01:00:01,142
for the reinforce in
a discounted problem

1226
01:00:01,142 --> 01:00:02,600
is the same as the
discount factor.

1227
01:00:10,990 --> 01:00:13,383
So in my lab now,
we're doing a lot

1228
01:00:13,383 --> 01:00:14,550
of these model based things.

1229
01:00:14,550 --> 01:00:15,450
We're doing LQR trees.

1230
01:00:15,450 --> 01:00:16,617
We're doing a lot of things.

1231
01:00:16,617 --> 01:00:19,170
In fact, the linear
controls are working

1232
01:00:19,170 --> 01:00:23,250
so beautifully in simulation
that Rick Corey, one

1233
01:00:23,250 --> 01:00:26,040
of our guys, started joshing me.

1234
01:00:26,040 --> 01:00:28,680
He's like, why didn't you
just do LQR on Toddler?

1235
01:00:28,680 --> 01:00:31,153
And he was giving me a
hard time for a long time.

1236
01:00:31,153 --> 01:00:32,820
Now he's asking about
model free methods

1237
01:00:32,820 --> 01:00:39,450
again because it's really hard
to get a good model of very

1238
01:00:39,450 --> 01:00:40,410
underactuated systems.

1239
01:00:40,410 --> 01:00:44,700
I mean, the plane that I'll
tell you about more on Thursday,

1240
01:00:44,700 --> 01:00:49,050
our perching plane we've seen
quickly, is one actuator.

1241
01:00:49,050 --> 01:00:52,110
And depending on how
you count the elevator,

1242
01:00:52,110 --> 01:00:54,480
eight degrees of
freedom roughly.

1243
01:00:54,480 --> 01:00:58,530
And sorry, eight
state variables.

1244
01:00:58,530 --> 01:01:01,860
And it's just very, very
hard to build a good model

1245
01:01:01,860 --> 01:01:06,600
for that that's accurate for the
long trajectory, the trajectory

1246
01:01:06,600 --> 01:01:08,730
all the way to the perch
such that LQR could just

1247
01:01:08,730 --> 01:01:09,450
stabilize it.

1248
01:01:09,450 --> 01:01:11,053
We're trying.

1249
01:01:11,053 --> 01:01:13,470
But there's something sort of
beautiful about these things

1250
01:01:13,470 --> 01:01:17,080
that just work without
building a perfect model.

1251
01:01:17,080 --> 01:01:17,580
OK?

1252
01:01:20,820 --> 01:01:23,910
The big picture is
roughly the class you saw.

1253
01:01:23,910 --> 01:01:26,250
This is actually, I had
forgotten about this.

1254
01:01:26,250 --> 01:01:28,320
This was one of my backup
slides from before.

1255
01:01:28,320 --> 01:01:33,330
But this is the basic
learning plot, which

1256
01:01:33,330 --> 01:01:36,530
is just one average run here.

1257
01:01:36,530 --> 01:01:38,940
If I reset the
learning parameters,

1258
01:01:38,940 --> 01:01:42,810
how quickly would it minimize
the average one step error?

1259
01:01:42,810 --> 01:01:44,020
And it was pretty fast.

1260
01:01:44,020 --> 01:01:47,253
And then actually,
that's a lot of steps.

1261
01:01:47,253 --> 01:01:48,420
That's more than I remember.

1262
01:01:48,420 --> 01:01:50,040
But this takes
steps once a second.

1263
01:01:50,040 --> 01:01:52,680
And so, in a handful of minutes,
it does hundreds of steps.

1264
01:01:52,680 --> 01:01:53,310
OK.

1265
01:01:53,310 --> 01:01:58,320
And this is the policy in two
dimensions that it learned.

1266
01:01:58,320 --> 01:02:04,500
So if you think about a theta
role and theta role dot,

1267
01:02:04,500 --> 01:02:07,350
I don't know if you have
intuition about this,

1268
01:02:07,350 --> 01:02:12,240
but the sort of yin
and yang of the Toddler

1269
01:02:12,240 --> 01:02:16,890
was that you wanted to push when
you're in this side of the face

1270
01:02:16,890 --> 01:02:18,602
portrait and push
with this foot when

1271
01:02:18,602 --> 01:02:20,310
you're on this side
of the face portrait.

1272
01:02:20,310 --> 01:02:24,840
I did things like I
mirrored, the left ankle

1273
01:02:24,840 --> 01:02:27,890
was doing the inverse, the
mirror of the right ankle.

1274
01:02:27,890 --> 01:02:28,390
Right?

1275
01:02:28,390 --> 01:02:32,550
So everything I could do
to try to minimize the size

1276
01:02:32,550 --> 01:02:33,930
of the function I was learning.

1277
01:02:33,930 --> 01:02:36,300
And that's actually sort
of a beautiful picture

1278
01:02:36,300 --> 01:02:42,510
of how it needed to push in
order to stabilize and skate.

1279
01:02:48,008 --> 01:02:49,050
Any questions about that?

1280
01:02:59,560 --> 01:03:00,060
All right.

1281
01:03:00,060 --> 01:03:08,130
So that's one success story
from model free learning

1282
01:03:08,130 --> 01:03:08,910
on real robots.

1283
01:03:08,910 --> 01:03:10,420
It learns in a few minutes.

1284
01:03:10,420 --> 01:03:11,670
There's other success stories.

1285
01:03:11,670 --> 01:03:14,550
I'll try to talk about
more of them on Thursday.

1286
01:03:14,550 --> 01:03:16,260
But at this point,
I've basically

1287
01:03:16,260 --> 01:03:21,930
given you all the tools that
we talk about in research

1288
01:03:21,930 --> 01:03:23,907
to make these robots tick.

1289
01:03:23,907 --> 01:03:25,740
Their state estimation,
I didn't talk about.

1290
01:03:25,740 --> 01:03:28,420
There's Morse's idea that
we didn't talk about.

1291
01:03:28,420 --> 01:03:31,740
But this is, I've given
you a pretty big swath

1292
01:03:31,740 --> 01:03:33,270
of algorithms here.

1293
01:03:33,270 --> 01:03:36,000
So really I want to now
hear from you next week.

1294
01:03:36,000 --> 01:03:38,070
And I want to give
you a few more case

1295
01:03:38,070 --> 01:03:40,140
studies so you feel that
these things actually

1296
01:03:40,140 --> 01:03:41,132
work in practice.

1297
01:03:41,132 --> 01:03:43,340
And you can go off and you
use them in your research.

1298
01:03:43,340 --> 01:03:44,220
Yeah, John?

1299
01:03:44,220 --> 01:03:47,850
STUDENT: If there is a lot of
stuff that's been published

1300
01:03:47,850 --> 01:03:52,080
and a lot of interest
[INAUDIBLE] stochasticity,

1301
01:03:52,080 --> 01:03:54,790
then it would make sense to
have a large gamma [INAUDIBLE]..

1302
01:03:54,790 --> 01:03:55,290
Right?

1303
01:03:55,290 --> 01:03:57,030
There'd be no
reason, it would be

1304
01:03:57,030 --> 01:03:58,947
a faulty way of trying
to interpret that data.

1305
01:03:58,947 --> 01:04:00,228
Right?

1306
01:04:00,228 --> 01:04:01,020
RUSS TEDRAKE: Yeah.

1307
01:04:01,020 --> 01:04:03,810
I mean, I think that so Katie's
stuff, the metastability stuff,

1308
01:04:03,810 --> 01:04:06,308
argued that for most of
these walking systems,

1309
01:04:06,308 --> 01:04:08,850
it doesn't make sense to look
very far in the future anyways.

1310
01:04:08,850 --> 01:04:10,590
Because the dynamics
of the system

1311
01:04:10,590 --> 01:04:13,230
mix with the stochasticity,
which I think is the same thing

1312
01:04:13,230 --> 01:04:14,160
you just said.

1313
01:04:14,160 --> 01:04:15,030
Yeah.

1314
01:04:15,030 --> 01:04:15,720
Yeah.

1315
01:04:15,720 --> 01:04:19,700
STUDENT: The general dimensions
of the robot [INAUDIBLE]

1316
01:04:19,700 --> 01:04:21,690
when you're
designing that robot,

1317
01:04:21,690 --> 01:04:25,770
thinking about this model free
learning when you started?

1318
01:04:25,770 --> 01:04:29,628
[INAUDIBLE] helps it be
a little more stable.

1319
01:04:29,628 --> 01:04:30,420
RUSS TEDRAKE: Good.

1320
01:04:30,420 --> 01:04:33,420
So I'm glad you asked that.

1321
01:04:33,420 --> 01:04:35,070
So it's definitely
very stable, which

1322
01:04:35,070 --> 01:04:37,900
was experimentally convenient.

1323
01:04:37,900 --> 01:04:38,400
Right?

1324
01:04:38,400 --> 01:04:39,940
Because I didn't have
to pick it up as much.

1325
01:04:39,940 --> 01:04:42,273
But it actually learns fine
when it starts off unstable.

1326
01:04:42,273 --> 01:04:45,300
So the way I tested that is, if
the ramp was very steep, then

1327
01:04:45,300 --> 01:04:47,790
it starts oscillating
and falls off sideways.

1328
01:04:47,790 --> 01:04:50,040
So just to show that it can
stabilize and unstabilize.

1329
01:04:50,040 --> 01:04:51,623
It's like, oh, the
same cost function.

1330
01:04:51,623 --> 01:04:53,390
It's absolutely no different.

1331
01:04:53,390 --> 01:04:54,880
I showed that it
stabilized that.

1332
01:04:54,880 --> 01:04:55,810
And it just meant
I had to pick it up

1333
01:04:55,810 --> 01:04:57,670
when it fell down
a bunch of times.

1334
01:04:57,670 --> 01:04:58,770
But the same algorithm
works for that.

1335
01:04:58,770 --> 01:05:01,103
So it's not really the stability
that I was counting on.

1336
01:05:01,103 --> 01:05:03,300
That was just
experimentally nice.

1337
01:05:03,300 --> 01:05:05,220
The big clown feet
and everything

1338
01:05:05,220 --> 01:05:09,630
were because that's how I knew
how to tune the passive gait.

1339
01:05:09,630 --> 01:05:10,620
Right?

1340
01:05:10,620 --> 01:05:12,627
In the passive walkers
we work on these days,

1341
01:05:12,627 --> 01:05:13,710
you always see point feet.

1342
01:05:13,710 --> 01:05:15,660
Because I care about
rough terrain now.

1343
01:05:15,660 --> 01:05:18,052
And those clown feet are
not good for rough terrain.

1344
01:05:18,052 --> 01:05:19,510
So we could try to
get rid of that.

1345
01:05:19,510 --> 01:05:21,968
STUDENT: You're saying if you
wanted to scale that out, you

1346
01:05:21,968 --> 01:05:25,930
had mentioned the [INAUDIBLE]
robots [INAUDIBLE] would

1347
01:05:25,930 --> 01:05:29,010
you have the same
success [INAUDIBLE]??

1348
01:05:29,010 --> 01:05:30,520
RUSS TEDRAKE: Got you.

1349
01:05:30,520 --> 01:05:31,270
I think it's fine.

1350
01:05:31,270 --> 01:05:35,155
I think that it would look
ridiculous that big maybe.

1351
01:05:35,155 --> 01:05:37,030
And I wouldn't scale
the feet quite that big.

1352
01:05:37,030 --> 01:05:37,530
Right?

1353
01:05:37,530 --> 01:05:40,840
That would be ridiculous.

1354
01:05:40,840 --> 01:05:44,560
But I don't think there's any
scaling issues there really.

1355
01:05:44,560 --> 01:05:47,170
It's the inertia of the
relative links that matters.

1356
01:05:47,170 --> 01:05:48,993
And I think you can
scale that properly.

1357
01:05:48,993 --> 01:05:50,410
At some point
you're going to just

1358
01:05:50,410 --> 01:05:53,140
look ridiculous if you don't
have knees and you're that big.

1359
01:05:53,140 --> 01:05:57,350
So yeah.

1360
01:05:57,350 --> 01:05:59,900
Energetically, the
mechanical cost of transport,

1361
01:05:59,900 --> 01:06:02,502
if you just look at the power
coming out of the batteries--

1362
01:06:02,502 --> 01:06:04,460
sorry, actually the work
done by the actuators,

1363
01:06:04,460 --> 01:06:06,043
the actual work done
by the actuators.

1364
01:06:06,043 --> 01:06:10,070
It was comparable to a human,
20 times better than ASIMO.

1365
01:06:10,070 --> 01:06:14,600
But if you plot the current
coming out of the batteries,

1366
01:06:14,600 --> 01:06:17,300
it was three times worse than
ASIMO or something like that.

1367
01:06:17,300 --> 01:06:20,060
Because it's got these little
itty bitty steps and really

1368
01:06:20,060 --> 01:06:20,840
big computer.

1369
01:06:20,840 --> 01:06:23,990
And that was, in retrospect,
maybe not the best decision.

1370
01:06:23,990 --> 01:06:26,138
Although I never had to
worry about computation.

1371
01:06:26,138 --> 01:06:27,680
I never had to
optimize my algorithms

1372
01:06:27,680 --> 01:06:31,250
to run on a small embedded chip.

1373
01:06:31,250 --> 01:06:35,450
STUDENT: Can you talk a little
bit about the [INAUDIBLE]??

1374
01:06:35,450 --> 01:06:38,040
RUSS TEDRAKE: You can
actually see it here.

1375
01:06:38,040 --> 01:06:43,430
So this is the
barycentric policy space

1376
01:06:43,430 --> 01:06:48,170
that were the parameters.

1377
01:06:48,170 --> 01:06:48,950
Yeah.

1378
01:06:48,950 --> 01:06:54,110
So it was tiled over
0.5, 0.5 roughly.

1379
01:06:54,110 --> 01:06:56,990
And you could see the
density of the tiling there.

1380
01:06:56,990 --> 01:06:57,830
Yeah.

1381
01:06:57,830 --> 01:06:59,810
And that was trained.

1382
01:06:59,810 --> 01:07:01,320
So there was no generalization.

1383
01:07:01,320 --> 01:07:03,778
So the fact that those looked
like sort of consistent blobs

1384
01:07:03,778 --> 01:07:05,780
was just from experience
and eligibility traces

1385
01:07:05,780 --> 01:07:06,500
carrying through.

1386
01:07:08,883 --> 01:07:11,300
But those are not constrained
by the function approximator

1387
01:07:11,300 --> 01:07:13,858
to be similar more
than one block away.

1388
01:07:13,858 --> 01:07:15,650
There's literally a
barycentric grid there.

1389
01:07:15,650 --> 01:07:20,690
And then the value estimate
was theta equals zero.

1390
01:07:20,690 --> 01:07:22,250
The different theta dots.

1391
01:07:22,250 --> 01:07:24,800
It was just the same size tiles.

1392
01:07:24,800 --> 01:07:27,770
But a line just
straight up the middle.

1393
01:07:27,770 --> 01:07:31,070
STUDENT: So your joystick
would just change theta?

1394
01:07:31,070 --> 01:07:32,280
Or not the theta?

1395
01:07:32,280 --> 01:07:36,290
But it would just
change the position.

1396
01:07:36,290 --> 01:07:38,710
RUSS TEDRAKE: The joystick
was, so the policy was mostly

1397
01:07:38,710 --> 01:07:41,210
for the side to side angles,
which would give me limit cycle

1398
01:07:41,210 --> 01:07:41,730
stability.

1399
01:07:41,730 --> 01:07:43,730
And then I could just
joystick control the front

1400
01:07:43,730 --> 01:07:44,495
to back angles.

1401
01:07:44,495 --> 01:07:46,370
So this thing, we could
just lean it forward.

1402
01:07:46,370 --> 01:07:47,660
It starts walking forward.

1403
01:07:47,660 --> 01:07:48,670
Even uphill.

1404
01:07:48,670 --> 01:07:49,280
That's fine.

1405
01:07:49,280 --> 01:07:50,990
You lean back, it
starts walking back.

1406
01:07:50,990 --> 01:07:52,390
It was really basically this.

1407
01:07:52,390 --> 01:07:52,990
Yeah.

1408
01:07:52,990 --> 01:07:55,370
If you want it to turn,
you've got to go like this.

1409
01:07:55,370 --> 01:07:56,570
And it would do its thing.

1410
01:07:56,570 --> 01:07:58,315
Right?

1411
01:07:58,315 --> 01:07:58,940
So that was it.

1412
01:07:58,940 --> 01:08:03,140
It wasn't sort of
highly maneuverable.

1413
01:08:03,140 --> 01:08:04,700
Yeah.

1414
01:08:04,700 --> 01:08:08,150
STUDENT: It seems like there
are some [INAUDIBLE] to step

1415
01:08:08,150 --> 01:08:12,048
to step, having
each step be like--

1416
01:08:12,048 --> 01:08:12,965
RUSS TEDRAKE: A trial.

1417
01:08:12,965 --> 01:08:15,158
STUDENT: So a section
on your Poincaré map.

1418
01:08:15,158 --> 01:08:15,950
RUSS TEDRAKE: Yeah.

1419
01:08:15,950 --> 01:08:18,560
STUDENT: I don't know if
that would work for flapping.

1420
01:08:18,560 --> 01:08:18,814
RUSS TEDRAKE: Absolutely.

1421
01:08:18,814 --> 01:08:20,231
STUDENT: If
[INAUDIBLE] up or down

1422
01:08:20,231 --> 01:08:21,770
is a similar kind of thing.

1423
01:08:21,770 --> 01:08:23,330
RUSS TEDRAKE: I think it would.

1424
01:08:23,330 --> 01:08:24,870
We were thinking
about it that way.

1425
01:08:24,870 --> 01:08:25,550
So you're absolutely right.

1426
01:08:25,550 --> 01:08:27,410
So it was nice to
be able to, it was

1427
01:08:27,410 --> 01:08:29,930
very important to
be able to add noise

1428
01:08:29,930 --> 01:08:33,260
by sort of making a persistent
change in my policy.

1429
01:08:33,260 --> 01:08:35,149
So this whole
function, adding noise

1430
01:08:35,149 --> 01:08:37,819
meant this whole function
would change a little bit.

1431
01:08:37,819 --> 01:08:40,342
And then I would stay
constant for that whole run.

1432
01:08:40,342 --> 01:08:41,550
And then change a little bit.

1433
01:08:41,550 --> 01:08:44,302
If you add noise every
DT, for instance, then you

1434
01:08:44,302 --> 01:08:46,760
have to worry about it filtering
out with motors and stuff.

1435
01:08:46,760 --> 01:08:49,279
This was actually a very
convenient discretization

1436
01:08:49,279 --> 01:08:50,779
in time on the point grade map.

1437
01:08:50,779 --> 01:08:51,762
Yeah.

1438
01:08:51,762 --> 01:08:53,720
So I think that was one
of the keys to success.

1439
01:08:53,720 --> 01:08:54,483
John?

1440
01:08:54,483 --> 01:08:58,250
STUDENT: The actuators you
took, were they pushing off

1441
01:08:58,250 --> 01:09:00,770
the sort of stance foot?

1442
01:09:00,770 --> 01:09:01,640
[INTERPOSING VOICES]

1443
01:09:01,640 --> 01:09:03,600
RUSS TEDRAKE: Or
pulling it back up.

1444
01:09:03,600 --> 01:09:04,290
But yes.

1445
01:09:04,290 --> 01:09:06,439
STUDENT: So you just
actuated the stance foot.

1446
01:09:06,439 --> 01:09:08,689
That was the
actuator [INAUDIBLE]..

1447
01:09:08,689 --> 01:09:11,615
RUSS TEDRAKE: The units were,
I guess they were scaled out.

1448
01:09:11,615 --> 01:09:13,490
I did actually do the
kinematics of the link.

1449
01:09:13,490 --> 01:09:17,670
So it was literally
a linear command in--

1450
01:09:17,670 --> 01:09:20,689
those are probably meters
or something in the--

1451
01:09:23,540 --> 01:09:24,109
no.

1452
01:09:24,109 --> 01:09:25,130
It's way too big.

1453
01:09:25,130 --> 01:09:26,420
[INTERPOSING VOICES]

1454
01:09:26,420 --> 01:09:29,480
STUDENT: Touching down, but
touch down at the same angle?

1455
01:09:29,480 --> 01:09:30,360
RUSS TEDRAKE: No.

1456
01:09:30,360 --> 01:09:33,450
The swing foot was
also being controlled.

1457
01:09:33,450 --> 01:09:35,180
So it would get a
big penalty actually

1458
01:09:35,180 --> 01:09:39,020
if it was at a weird angle
when it touched down.

1459
01:09:39,020 --> 01:09:41,130
It would hit and it would
lose all its energy.

1460
01:09:41,130 --> 01:09:42,949
But that was free to
make that mistake.

1461
01:09:42,949 --> 01:09:44,532
STUDENT: So you have
two actions then?

1462
01:09:44,532 --> 01:09:48,812
The address to [INAUDIBLE]?

1463
01:09:48,812 --> 01:09:49,520
RUSS TEDRAKE: No.

1464
01:09:49,520 --> 01:09:51,020
Well, it's one action.

1465
01:09:51,020 --> 01:09:53,390
But the policy is being run
on two different actuators

1466
01:09:53,390 --> 01:09:54,220
at the same time.

1467
01:09:54,220 --> 01:09:56,170
So one of them is over in
this side of the state space.

1468
01:09:56,170 --> 01:09:57,270
And the other one's over
in the side of the state

1469
01:09:57,270 --> 01:09:58,440
space at the same time.

1470
01:09:58,440 --> 01:09:58,940
STUDENT: OK.

1471
01:09:58,940 --> 01:10:00,232
So it just used different data.

1472
01:10:00,232 --> 01:10:01,110
But they're-- OK.

1473
01:10:01,110 --> 01:10:02,633
RUSS TEDRAKE: Yeah.

1474
01:10:02,633 --> 01:10:04,550
So just it was learning
on both of those sides

1475
01:10:04,550 --> 01:10:05,258
at the same time.

1476
01:10:21,180 --> 01:10:22,992
I'm a big fan of simplicity.

1477
01:10:22,992 --> 01:10:24,450
It's easy to make
things that work.

1478
01:10:24,450 --> 01:10:27,480
I mean, I think it's a good
way to get things working.

1479
01:10:27,480 --> 01:10:30,360
So that's what the test
will be as we go forward

1480
01:10:30,360 --> 01:10:32,160
in how complex we can
make these things.

1481
01:10:32,160 --> 01:10:35,960
But in sort of the simple
case, they work really well.

1482
01:10:40,480 --> 01:10:41,030
Great.

1483
01:10:41,030 --> 01:10:41,530
OK.

1484
01:10:41,530 --> 01:10:45,280
So thanks for putting up with
the randomized algorithm.

1485
01:10:45,280 --> 01:10:48,120
We'll see you on Thursday.