1
00:00:00,000 --> 00:00:02,490
The following content is
provided under a Creative

2
00:00:02,490 --> 00:00:03,940
Commons license.

3
00:00:03,940 --> 00:00:06,330
Your support will help
MIT OpenCourseWare

4
00:00:06,330 --> 00:00:10,630
continue to offer high-quality
educational resources for free.

5
00:00:10,630 --> 00:00:13,320
To make a donation or
view additional materials

6
00:00:13,320 --> 00:00:17,160
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,160 --> 00:00:18,252
at ocw.mit.edu.

8
00:00:21,870 --> 00:00:23,190
RUSS TEDRAKE: Welcome back.

9
00:00:23,190 --> 00:00:29,600
So today we get to
finish our discussion

10
00:00:29,600 --> 00:00:33,500
on at least the first wave
of value-based methods

11
00:00:33,500 --> 00:00:35,777
for trying to find
optimal control

12
00:00:35,777 --> 00:00:39,130
policies, without a model.

13
00:00:39,130 --> 00:00:47,040
So we started last time talking
about these model-free methods.

14
00:00:47,040 --> 00:00:51,270
And just to make sure
we're all synced up here,

15
00:00:51,270 --> 00:01:01,090
so big picture is
that we're trying

16
00:01:01,090 --> 00:01:14,920
to learn an optimal policy,
approximate optimal policy,

17
00:01:14,920 --> 00:01:25,350
without a model, by just
learning an approximate value

18
00:01:25,350 --> 00:01:25,850
function.

19
00:01:36,620 --> 00:01:41,720
And the claim is that value
functions are a good thing

20
00:01:41,720 --> 00:01:45,630
to learn for a
couple of reasons.

21
00:01:45,630 --> 00:01:47,060
First of all, they
should describe

22
00:01:47,060 --> 00:01:53,720
everything you need to know
about the optimal control.

23
00:01:53,720 --> 00:01:57,185
Second of all, they're
actually fairly compact.

24
00:01:57,185 --> 00:01:59,060
I'm going to say more
about that in a minute.

25
00:01:59,060 --> 00:02:02,120
But if you think about it,
probably a value function

26
00:02:02,120 --> 00:02:03,920
might actually be
a simpler thing

27
00:02:03,920 --> 00:02:08,449
to represent than the policy, a
smaller thing to represent it,

28
00:02:08,449 --> 00:02:10,894
because it's just a scalar
value over all states.

29
00:02:14,480 --> 00:02:19,250
And the third big motivation
I tried to give last time

30
00:02:19,250 --> 00:02:23,390
was that these temporal
difference methods which

31
00:02:23,390 --> 00:02:26,660
bootstrap based on
previous experience,

32
00:02:26,660 --> 00:02:29,040
they're, like value iteration
and dynamic programming,

33
00:02:29,040 --> 00:02:33,230
can be very efficient in terms
of reusing the computation

34
00:02:33,230 --> 00:02:35,720
or reusing the samples
that you've gotten by using

35
00:02:35,720 --> 00:02:37,803
estimates that you've
already made with your value

36
00:02:37,803 --> 00:02:40,960
function to make better,
fast estimates of your value

37
00:02:40,960 --> 00:02:43,878
function as you [INAUDIBLE].

38
00:02:43,878 --> 00:02:45,420
These are all going
to come up again.

39
00:02:45,420 --> 00:02:47,210
But that's just the
high-level motivation

40
00:02:47,210 --> 00:02:50,038
for why we care about trying
to learn value functions.

41
00:02:50,038 --> 00:02:51,830
And then first thing
we did-- it was really

42
00:02:51,830 --> 00:02:54,950
all we did last time.

43
00:02:54,950 --> 00:02:59,510
The first thing
we had to achieve

44
00:02:59,510 --> 00:03:19,996
was just estimate a value
function for a fixed policy,

45
00:03:19,996 --> 00:03:23,990
which we called J pi, right?

46
00:03:23,990 --> 00:03:26,582
And we did it just from
sample trajectories.

47
00:03:29,420 --> 00:03:34,910
In discrete state and action,
we called them s and a.

48
00:03:34,910 --> 00:03:39,350
Take a bunch of
trajectories, and you

49
00:03:39,350 --> 00:03:42,170
would be able from
those trajectories

50
00:03:42,170 --> 00:03:45,675
to try to back out J pi.

51
00:03:45,675 --> 00:03:49,118
And those trajectories are
generated using policy pi.

52
00:03:51,860 --> 00:03:54,180
So I actually tried to argue
that that was useful even

53
00:03:54,180 --> 00:03:55,550
in the-- so if you just want--

54
00:03:55,550 --> 00:03:57,407
if you have a robot out
there that's already

55
00:03:57,407 --> 00:03:59,990
executing a policy, or a passive
walker or something like this

56
00:03:59,990 --> 00:04:01,880
that doesn't have a
policy, and you just

57
00:04:01,880 --> 00:04:06,260
want to see how well it's
doing, estimate its stability

58
00:04:06,260 --> 00:04:09,652
by example, then
you can actually--

59
00:04:09,652 --> 00:04:10,860
this might be enough for you.

60
00:04:10,860 --> 00:04:15,020
You might just try to evaluate
how well that policy is doing.

61
00:04:15,020 --> 00:04:16,857
We call that policy evaluation.

62
00:04:23,188 --> 00:04:26,320
What we're actually
interested it's not that.

63
00:04:26,320 --> 00:04:28,840
That's just the first step.

64
00:04:28,840 --> 00:04:33,550
What we care about now is,
given if we can estimate

65
00:04:33,550 --> 00:04:37,720
the value for a
stationary policy,

66
00:04:37,720 --> 00:04:41,140
can we now do something
smarter and more involved,

67
00:04:41,140 --> 00:04:45,490
and try to estimate what
the optimal value function.

68
00:04:45,490 --> 00:04:49,030
Or you might think
of it as continuing

69
00:04:49,030 --> 00:04:51,490
to estimate the value
function as we change

70
00:04:51,490 --> 00:04:53,758
pi towards the [INAUDIBLE]
the optimal cost,

71
00:04:53,758 --> 00:04:54,550
the optimal policy.

72
00:04:59,220 --> 00:05:01,980
We talked about a couple of ways
to estimate the value function

73
00:05:01,980 --> 00:05:04,590
for a fixed pi, right?

74
00:05:04,590 --> 00:05:11,820
Even for function approximation,
we did first Markov chains,

75
00:05:11,820 --> 00:05:14,490
and then we went to
function approximation.

76
00:05:21,720 --> 00:05:31,290
And we have convergence
results for linear function

77
00:05:31,290 --> 00:05:32,144
approximators.

78
00:05:36,425 --> 00:05:39,170
And we went back and looked up
[INAUDIBLE] it had a question

79
00:05:39,170 --> 00:05:42,335
about whether they used
lambda in your update,

80
00:05:42,335 --> 00:05:45,170
if it always got to
the same estimate of J.

81
00:05:45,170 --> 00:05:47,850
And I think the answer
was, yes, it always gets--

82
00:05:47,850 --> 00:05:50,280
the convergence proof
has an error bound.

83
00:05:50,280 --> 00:05:53,030
And that error bound
does depend on lambda,

84
00:05:53,030 --> 00:05:56,180
if you remember
[INAUDIBLE] discussion.

85
00:05:56,180 --> 00:05:58,970
But if you said your learning
rate gets smaller and smaller

86
00:05:58,970 --> 00:06:01,400
and you go, it should converge
to the-- they should all

87
00:06:01,400 --> 00:06:04,658
converge to the same
estimate of J pi.

88
00:06:11,300 --> 00:06:15,800
So if you think about
it, learning J pi

89
00:06:15,800 --> 00:06:18,420
shouldn't involve any
new machinery, right?

90
00:06:18,420 --> 00:06:22,140
If I'm just experiencing cost,
and I'm experiencing states,

91
00:06:22,140 --> 00:06:24,440
and I'm trying to learn a
function of cost-to-go given

92
00:06:24,440 --> 00:06:27,107
states, that should just be able
to do a least squares function.

93
00:06:27,107 --> 00:06:30,450
It's just a standard
function approximation task.

94
00:06:30,450 --> 00:06:37,765
I could just do just a least
squares function approximation,

95
00:06:37,765 --> 00:06:45,230
least squares
estimation, and what

96
00:06:45,230 --> 00:06:46,860
we call the Monte-Carlo error.

97
00:06:49,584 --> 00:06:52,918
Just run a bunch of
trials, figure out

98
00:06:52,918 --> 00:06:55,210
the estimates of what the
cost-to-go was at every time,

99
00:06:55,210 --> 00:06:59,100
and then just do least
squares estimation.

100
00:06:59,100 --> 00:07:02,150
The machinery we
developed last time

101
00:07:02,150 --> 00:07:06,078
was because it's actually
a lot faster using

102
00:07:06,078 --> 00:07:09,656
bootstrapping algorithms.

103
00:07:09,656 --> 00:07:12,152
[INAUDIBLE] much faster
than [INAUDIBLE]..

104
00:07:23,460 --> 00:07:23,960
Right.

105
00:07:23,960 --> 00:07:29,120
So we talked about the
TD lambda algorithm,

106
00:07:29,120 --> 00:07:31,010
including for function
approximation.

107
00:07:34,165 --> 00:07:36,290
The only reason we had to
develop any new machinery

108
00:07:36,290 --> 00:07:39,630
is because we wanted to be able
to essentially do the least

109
00:07:39,630 --> 00:07:41,190
squares estimation,
but we wanted

110
00:07:41,190 --> 00:07:44,840
to reuse our current estimate
as we build up the estimate.

111
00:07:44,840 --> 00:07:48,410
And that's why it's not just a
standard function approximation

112
00:07:48,410 --> 00:07:54,778
task we did all the [INAUDIBLE]
something [INAUDIBLE]

113
00:07:54,778 --> 00:07:57,700
presented it.

114
00:07:57,700 --> 00:07:58,200
OK.

115
00:08:02,520 --> 00:08:06,470
So that's the simple
policy evaluation story.

116
00:08:10,420 --> 00:08:14,110
Now the question is, how
do we use the ability

117
00:08:14,110 --> 00:08:18,004
to do policy evaluation to get
towards a more optimal policy?

118
00:08:21,350 --> 00:08:34,789
So today, given the
new policy evaluation,

119
00:08:34,789 --> 00:08:36,215
we want to improve the policy.

120
00:08:50,920 --> 00:08:53,422
And the idea of this--

121
00:08:53,422 --> 00:08:55,755
the first idea you have to
have in your head, very, very

122
00:08:55,755 --> 00:08:56,255
simple.

123
00:09:00,090 --> 00:09:02,410
And it's called
policy iteration.

124
00:09:12,630 --> 00:09:17,560
So given I start off with some
initial guess for a policy,

125
00:09:17,560 --> 00:09:22,840
and I run it for a little while,
I could do policy evaluation.

126
00:09:22,840 --> 00:09:28,570
So I'm converged on a nice
estimate to get J pi 1.

127
00:09:31,080 --> 00:09:36,690
And now I'd like to take
J pi 1, my estimate,

128
00:09:36,690 --> 00:09:43,810
and come up with a new pi 2.

129
00:09:43,810 --> 00:09:49,670
We've talked about how the
value function infers a policy.

130
00:09:49,670 --> 00:09:54,740
And if I repeat, and
I do it properly,

131
00:09:54,740 --> 00:09:59,110
then if all goes well,
I should find myself--

132
00:09:59,110 --> 00:10:00,820
if it's always increasing
in performance,

133
00:10:00,820 --> 00:10:06,730
and we can show that, then I
should find myself eventually

134
00:10:06,730 --> 00:10:13,888
at the optimal policy and
optimal value function, right?

135
00:10:19,760 --> 00:10:23,390
So we said TD lambda
was a candidate

136
00:10:23,390 --> 00:10:25,515
for sitting there and
evaluating policy, which I've

137
00:10:25,515 --> 00:10:27,140
talked about a couple
of different ways

138
00:10:27,140 --> 00:10:28,400
to do policy evaluation.

139
00:10:28,400 --> 00:10:31,127
So the question now is,
how do we this, then?

140
00:10:31,127 --> 00:10:32,210
That's the first question.

141
00:10:36,660 --> 00:10:39,830
So given your policy,
given your value function,

142
00:10:39,830 --> 00:10:42,890
how do you compute
a new policy that's

143
00:10:42,890 --> 00:10:46,174
at least as good as your
own policy but maybe better?

144
00:11:02,000 --> 00:11:04,460
AUDIENCE: Maybe stochastic
gradient descent?

145
00:11:04,460 --> 00:11:06,140
RUSS TEDRAKE: Do something like
stochastic gradient descent?

146
00:11:06,140 --> 00:11:07,440
You have to be careful
with stochastic gradient.

147
00:11:07,440 --> 00:11:09,315
You have to make sure
it's always going down,

148
00:11:09,315 --> 00:11:11,810
and things like that.

149
00:11:11,810 --> 00:11:14,610
It's a good idea.

150
00:11:14,610 --> 00:11:16,550
In fact, that's sort of--

151
00:11:16,550 --> 00:11:18,970
actually, [INAUDIBLE].

152
00:11:18,970 --> 00:11:22,840
We combine stochastic gradient
descent and evaluation

153
00:11:22,840 --> 00:11:25,600
to do actor-critic [INAUDIBLE].

154
00:11:25,600 --> 00:11:29,618
But there's a
simpler sort of idea.

155
00:11:32,398 --> 00:11:33,690
I guess the thing it requires--

156
00:11:33,690 --> 00:11:36,148
I didn't even think about this
when I was making the notes.

157
00:11:36,148 --> 00:11:39,120
But I guess it requires
an observation that--

158
00:11:41,870 --> 00:11:47,810
so the optimal value function
and the optimal policy

159
00:11:47,810 --> 00:11:50,510
have a property that
the policy is going,

160
00:11:50,510 --> 00:11:54,187
taking the fastest descent
down the value function.

161
00:11:54,187 --> 00:11:56,770
Your job is to go down the value
function as fast as possible.

162
00:11:59,510 --> 00:12:03,970
But if you're not optimal yet,
I've got some random policy,

163
00:12:03,970 --> 00:12:09,010
and I figure out my value of
executing that policy, that's

164
00:12:09,010 --> 00:12:12,210
actually not true yet.

165
00:12:12,210 --> 00:12:15,830
So what I need to say is, if
you start giving your value

166
00:12:15,830 --> 00:12:19,190
function, you come up
with a new policy which

167
00:12:19,190 --> 00:12:21,050
tries to be as
aggressive as possible

168
00:12:21,050 --> 00:12:23,750
on this value function, which
in our continuous sense,

169
00:12:23,750 --> 00:12:26,450
is going down the gradient
of the value function

170
00:12:26,450 --> 00:12:28,778
as fast as possible.

171
00:12:28,778 --> 00:12:30,320
And that should be
at least as good--

172
00:12:30,320 --> 00:12:33,380
in the case of the optimal
policy, it should be the same.

173
00:12:33,380 --> 00:12:35,910
It should return the
optimal policy again.

174
00:12:35,910 --> 00:12:39,260
But in the case where the
value estimates from another,

175
00:12:39,260 --> 00:12:41,776
original policy gets
you to do better.

176
00:12:44,755 --> 00:12:46,040
So the basic story--

177
00:12:46,040 --> 00:12:48,830
that's the continuous
gradient-- is

178
00:12:48,830 --> 00:12:57,050
you want to come up with a
greedy policy that moves down,

179
00:12:57,050 --> 00:13:00,150
that does the best it
can with this J pi.

180
00:13:06,660 --> 00:13:12,900
So pi 2, let's say, which is
a function of s, should be,

181
00:13:12,900 --> 00:13:17,450
for instance, [INAUDIBLE]
the discrete sense here,

182
00:13:17,450 --> 00:13:21,263
discrete state and action,
minimize the expected value

183
00:13:21,263 --> 00:13:27,030
[INAUDIBLE] expected value
first, by one-step error plus--

184
00:13:53,430 --> 00:13:57,570
So I've got the cost
that I incur here

185
00:13:57,570 --> 00:13:59,190
plus the long-term cost here.

186
00:13:59,190 --> 00:14:01,500
I want to pick the
new min over a.

187
00:14:04,480 --> 00:14:06,806
The best thing I can
do given that estimate

188
00:14:06,806 --> 00:14:08,890
of the value function.

189
00:14:08,890 --> 00:14:11,530
And that's going to give me
a new policy, actually, pi

190
00:14:11,530 --> 00:14:15,163
2, which is greedy with respect
to this estimate of the value

191
00:14:15,163 --> 00:14:15,663
function.

192
00:14:20,500 --> 00:14:24,602
What does that look
like to you guys?

193
00:14:24,602 --> 00:14:25,600
AUDIENCE: [INAUDIBLE]

194
00:14:25,600 --> 00:14:26,392
RUSS TEDRAKE: Yeah.

195
00:14:26,392 --> 00:14:26,910
OK.

196
00:14:26,910 --> 00:14:30,060
So value iteration, or
dynamic programming,

197
00:14:30,060 --> 00:14:34,440
is exactly policy
iteration in the case

198
00:14:34,440 --> 00:14:36,780
where you do a sweep
through your entire state

199
00:14:36,780 --> 00:14:40,860
space every time, and then you
update, sweep your entire state

200
00:14:40,860 --> 00:14:42,639
space, you do the update.

201
00:15:15,670 --> 00:15:17,410
Absolutely.

202
00:15:17,410 --> 00:15:20,080
But it's a more general idea
than just value iteration.

203
00:15:20,080 --> 00:15:22,330
You don't have to
actually evaluate all s.

204
00:15:22,330 --> 00:15:24,030
You might call it
asynchronous value--

205
00:15:24,030 --> 00:15:25,600
[INAUDIBLE]?

206
00:15:25,600 --> 00:15:27,892
AUDIENCE: Shouldn't that
be argmin [INAUDIBLE]??

207
00:15:27,892 --> 00:15:28,850
RUSS TEDRAKE: Oh, good.

208
00:15:28,850 --> 00:15:30,170
Thank you, yeah.

209
00:15:30,170 --> 00:15:32,860
This is argmin.

210
00:15:32,860 --> 00:15:34,258
Good catch.

211
00:15:34,258 --> 00:15:36,600
Yeah.

212
00:15:36,600 --> 00:15:40,117
AUDIENCE: This is
like g [INAUDIBLE]..

213
00:15:40,117 --> 00:15:40,950
RUSS TEDRAKE: Right.

214
00:15:40,950 --> 00:15:41,910
I always minimize this.

215
00:15:41,910 --> 00:15:51,040
So g is [? bad. ?] Well, I
don't promise that I will never

216
00:15:51,040 --> 00:15:52,900
make a mistake with
the signs, because I

217
00:15:52,900 --> 00:15:54,775
try to use reinforcement
[INAUDIBLE] notation

218
00:15:54,775 --> 00:15:57,580
with costs, and I can sometimes
get myself into trouble.

219
00:15:57,580 --> 00:16:00,190
I never write "arg."

220
00:16:00,190 --> 00:16:03,390
It's always g.

221
00:16:03,390 --> 00:16:03,890
OK.

222
00:16:03,890 --> 00:16:08,620
So so this would be argmin.

223
00:16:08,620 --> 00:16:12,452
The min is the value estimated
in the case of value iteration.

224
00:16:12,452 --> 00:16:14,410
But in general, you don't
have to wait till you

225
00:16:14,410 --> 00:16:15,850
sweep the entire state space.

226
00:16:15,850 --> 00:16:18,430
You can just take
a single trajectory

227
00:16:18,430 --> 00:16:22,490
through, update
your value J. Or you

228
00:16:22,490 --> 00:16:25,180
take lots of trajectories
through [INAUDIBLE]

229
00:16:25,180 --> 00:16:30,300
get an improved estimate
for J, and then do this

230
00:16:30,300 --> 00:16:32,070
and get a new policy, right?

231
00:16:32,070 --> 00:16:35,730
In this policy iteration,
the original idea

232
00:16:35,730 --> 00:16:38,340
is you should really do
this policy evaluation

233
00:16:38,340 --> 00:16:43,480
step until your estimate of J pi
convergence, and then move on.

234
00:16:43,480 --> 00:16:45,912
But in fact, value
iteration and other--

235
00:16:45,912 --> 00:16:47,370
many algorithms
show that you can--

236
00:16:47,370 --> 00:16:48,930
it actually is still
stable when you

237
00:16:48,930 --> 00:16:50,180
don't wait for it to converge.

238
00:16:54,660 --> 00:16:58,110
But there's a problem
with what I wrote here.

239
00:16:58,110 --> 00:17:00,130
I don't think there's
a technical problem.

240
00:17:00,130 --> 00:17:03,190
But why is that not quite what
we need for today's lecture?

241
00:17:03,190 --> 00:17:03,690
Yeah.

242
00:17:03,690 --> 00:17:05,170
AUDIENCE: I just had
a quick question.

243
00:17:05,170 --> 00:17:07,140
So if you're going to
be gradient with respect

244
00:17:07,140 --> 00:17:10,549
to the value function
that you evaluate,

245
00:17:10,549 --> 00:17:12,624
you can't do that
with a value function

246
00:17:12,624 --> 00:17:13,835
if you have a model, right?

247
00:17:13,835 --> 00:17:14,460
So you need a--

248
00:17:14,460 --> 00:17:15,480
RUSS TEDRAKE: That's
actually exactly--

249
00:17:15,480 --> 00:17:17,480
you're answering the
question that I was asking.

250
00:17:17,480 --> 00:17:19,829
That's perfect.

251
00:17:19,829 --> 00:17:22,950
So from, as I said, model-free,
model-free, model-free,

252
00:17:22,950 --> 00:17:26,400
but then I wrote
down a model here.

253
00:17:26,400 --> 00:17:27,240
So how can I--

254
00:17:27,240 --> 00:17:32,940
even in the steepest descent
sort of continuous sense,

255
00:17:32,940 --> 00:17:34,170
this is absurd.

256
00:17:34,170 --> 00:17:35,970
In the discrete
sense, argmin over a

257
00:17:35,970 --> 00:17:37,980
is typically done with a
search over all actions

258
00:17:37,980 --> 00:17:39,990
in the continuous
state and action.

259
00:17:39,990 --> 00:17:43,810
I think it was finding the
gradient down the slope.

260
00:17:43,810 --> 00:17:44,310
But right.

261
00:17:44,310 --> 00:17:47,880
Both of those require
a model to actually do

262
00:17:47,880 --> 00:17:49,570
that policy [INAUDIBLE].

263
00:17:49,570 --> 00:17:52,725
So the first question
for today is,

264
00:17:52,725 --> 00:17:54,750
how do we come up with
a gradient policy,

265
00:17:54,750 --> 00:17:59,336
basically, without any model?

266
00:17:59,336 --> 00:18:00,880
[INAUDIBLE] going to say it.

267
00:18:00,880 --> 00:18:03,376
[INAUDIBLE] know
this, but that's the--

268
00:18:06,172 --> 00:18:08,060
what do you think?

269
00:18:08,060 --> 00:18:10,440
[INAUDIBLE] haven't read all
the [INAUDIBLE] algorithms.

270
00:18:10,440 --> 00:18:11,190
What do you think?

271
00:18:11,190 --> 00:18:13,690
What's the-- how
could I possibly

272
00:18:13,690 --> 00:18:21,005
come up with a new policy
without having a model?

273
00:18:25,835 --> 00:18:28,418
AUDIENCE: [INAUDIBLE] s n plus
[INAUDIBLE] sample directly?

274
00:18:28,418 --> 00:18:29,210
RUSS TEDRAKE: Good.

275
00:18:29,210 --> 00:18:29,920
You could sample.

276
00:18:29,920 --> 00:18:32,523
You can start to do
some local search

277
00:18:32,523 --> 00:18:33,690
to come up with [INAUDIBLE].

278
00:18:36,408 --> 00:18:37,950
Turns out-- I mean,
I didn't actually

279
00:18:37,950 --> 00:18:40,492
ask the question in a way that
anybody would have answered it

280
00:18:40,492 --> 00:18:41,925
in the way I wanted, so.

281
00:18:41,925 --> 00:18:43,890
So it turns out if
we changed the thing

282
00:18:43,890 --> 00:18:49,830
we store just a little bit,
then it turns out to contribute

283
00:18:49,830 --> 00:18:53,840
to do model-free greedy policy.

284
00:18:58,930 --> 00:18:59,470
OK.

285
00:18:59,470 --> 00:19:01,053
So the way we do
that is a Q function.

286
00:19:05,640 --> 00:19:09,600
We need to find a Q function.

287
00:19:09,600 --> 00:19:11,750
It's a lot like
a value function.

288
00:19:11,750 --> 00:19:13,976
But now it's a function
of state and action.

289
00:19:30,245 --> 00:19:35,720
And we'll say this is
still [INAUDIBLE] this way.

290
00:19:57,270 --> 00:19:58,080
OK.

291
00:19:58,080 --> 00:19:59,630
So what's a Q function?

292
00:19:59,630 --> 00:20:02,610
A Q function is the
cost you should expect

293
00:20:02,610 --> 00:20:06,720
to take, to incur, given
you're in a current state

294
00:20:06,720 --> 00:20:09,480
and you take a
particular action.

295
00:20:09,480 --> 00:20:11,018
So it's a lot like
a value function.

296
00:20:11,018 --> 00:20:12,810
But now you're actually
learning a function

297
00:20:12,810 --> 00:20:14,250
over both state and actions.

298
00:20:14,250 --> 00:20:20,430
So in any state,
Q pi is the cost

299
00:20:20,430 --> 00:20:24,120
I should expect to incur given
I take action a for one step

300
00:20:24,120 --> 00:20:28,670
and I follow a policy pi
for the rest of the time.

301
00:20:28,670 --> 00:20:31,590
That make sense?

302
00:20:31,590 --> 00:20:36,130
So I could have my acrobot
controller or something

303
00:20:36,130 --> 00:20:36,810
like this.

304
00:20:36,810 --> 00:20:40,770
And in a current state, I've
got a policy that mostly gets me

305
00:20:40,770 --> 00:20:43,560
up, but I'm learning more than
just what that policy would

306
00:20:43,560 --> 00:20:44,352
do from this state.

307
00:20:44,352 --> 00:20:45,810
I'm learning what
that policy would

308
00:20:45,810 --> 00:20:47,550
have done if I had
for one step executed

309
00:20:47,550 --> 00:20:49,483
any random action
on the function,

310
00:20:49,483 --> 00:20:50,400
for any random action.

311
00:20:50,400 --> 00:20:52,770
And then what would
I do from the--

312
00:20:52,770 --> 00:20:55,970
beginning I ran that
controller for the rest of it.

313
00:20:55,970 --> 00:20:58,010
Algebraically, it's going
to make a lot of sense

314
00:20:58,010 --> 00:20:59,570
why we would store this.

315
00:20:59,570 --> 00:21:01,695
But it's actually interesting
to think a little bit

316
00:21:01,695 --> 00:21:05,700
about what that Q
function should look like.

317
00:21:05,700 --> 00:21:08,180
And if you have a Q
function, you certainly

318
00:21:08,180 --> 00:21:22,215
could also get the
value function,

319
00:21:22,215 --> 00:21:26,660
because you can look up
for a given pi what action

320
00:21:26,660 --> 00:21:28,690
that policy would have taken.

321
00:21:28,690 --> 00:21:36,515
You can always pull out your
current value function from Q.

322
00:21:36,515 --> 00:21:37,265
But you can also--

323
00:21:42,050 --> 00:21:45,600
[INAUDIBLE] simple relationship
here in the [INAUDIBLE]..

324
00:21:54,278 --> 00:21:59,155
And for the optimal
[INAUDIBLE],, I should actually

325
00:21:59,155 --> 00:22:00,680
do that search over a.

326
00:22:00,680 --> 00:22:02,875
I almost wrote minus.

327
00:22:02,875 --> 00:22:04,250
That can be your
job for the day,

328
00:22:04,250 --> 00:22:06,163
make sure I don't
flip any signs.

329
00:22:14,040 --> 00:22:14,540
OK.

330
00:22:14,540 --> 00:22:16,860
We're roboticists in this room.

331
00:22:16,860 --> 00:22:20,690
What does it mean to
learn a Q function?

332
00:22:20,690 --> 00:22:22,900
What are the implications
of learning a Q function?

333
00:22:26,065 --> 00:22:27,190
Well, I guess I didn't say.

334
00:22:27,190 --> 00:22:31,880
So given the Q function pi
[INAUDIBLE] having a Q function

335
00:22:31,880 --> 00:22:37,850
makes action collection easy.

336
00:22:41,259 --> 00:22:53,482
Pi 2 of s is now just a min
over a Q pi s and a, where

337
00:22:53,482 --> 00:22:56,280
Q pi was [INAUDIBLE] with pi 1.

338
00:23:04,354 --> 00:23:07,312
AUDIENCE: [INAUDIBLE]

339
00:23:07,312 --> 00:23:09,284
RUSS TEDRAKE: It's
argmin [INAUDIBLE]..

340
00:23:15,220 --> 00:23:17,050
But I was willing to--

341
00:23:17,050 --> 00:23:20,170
the reason to learn a
Q function in one case

342
00:23:20,170 --> 00:23:23,350
here is that it tells me
about the other actions

343
00:23:23,350 --> 00:23:24,780
I could have taken.

344
00:23:24,780 --> 00:23:28,000
And if I want to now
improve my policy,

345
00:23:28,000 --> 00:23:29,830
then I'll just look
at my Q function.

346
00:23:29,830 --> 00:23:34,870
At every state I'm in, instead
of taking the one at pi a,

347
00:23:34,870 --> 00:23:37,450
I'll go ahead and
take the best one.

348
00:23:37,450 --> 00:23:41,270
If pi 1 was optimal,
that I would just

349
00:23:41,270 --> 00:23:45,280
get back the same policy.

350
00:23:45,280 --> 00:23:46,840
But if pi 1 wasn't
optimal, then I'll

351
00:23:46,840 --> 00:23:54,010
get back something better, given
my current estimate of Q. OK.

352
00:23:54,010 --> 00:23:55,770
But what does it mean to run Q?

353
00:23:55,770 --> 00:23:58,020
And this is actually
all you need

354
00:23:58,020 --> 00:24:03,884
to learn to do model-free value
[INAUDIBLE] optimal policy.

355
00:24:07,580 --> 00:24:10,530
That's actually really big.

356
00:24:10,530 --> 00:24:15,890
So it's a little
bit more to learn

357
00:24:15,890 --> 00:24:18,440
than learning a value function.

358
00:24:24,524 --> 00:24:26,460
And you're learning
about your [INAUDIBLE]..

359
00:24:32,110 --> 00:24:35,680
If I had to learn J
pi, how big is that?

360
00:24:35,680 --> 00:24:41,260
If I'm going to say
I've got n dimensional

361
00:24:41,260 --> 00:24:53,104
states and m dimensional u--

362
00:24:53,104 --> 00:24:54,854
I'll just think about
these two new cases,

363
00:24:54,854 --> 00:24:56,062
even though [INAUDIBLE] this.

364
00:24:56,062 --> 00:24:59,460
If I have to learn J
pi, how big is that?

365
00:24:59,460 --> 00:25:00,835
What's that function
mapping for?

366
00:25:09,745 --> 00:25:12,060
AUDIENCE: [INAUDIBLE]
scalar learning.

367
00:25:12,060 --> 00:25:14,460
RUSS TEDRAKE: Learning
a scalar function over

368
00:25:14,460 --> 00:25:23,850
the state space to R1, just
learning a scalar function.

369
00:25:23,850 --> 00:25:27,400
If I was learning a policy,
how big would that be?

370
00:25:27,400 --> 00:25:33,720
If I was learning a stationary
policy, it might be that.

371
00:25:37,980 --> 00:25:39,270
So how bad is it to learn Q?

372
00:25:43,720 --> 00:25:44,460
What's Q?

373
00:25:48,870 --> 00:25:52,472
AUDIENCE: [INAUDIBLE]
asymptote [INAUDIBLE]..

374
00:25:52,472 --> 00:25:54,930
RUSS TEDRAKE: Let's keep it a
deterministic policy for now.

375
00:25:57,852 --> 00:25:59,495
AUDIENCE: [INAUDIBLE]

376
00:25:59,495 --> 00:26:00,287
RUSS TEDRAKE: Yeah.

377
00:26:06,962 --> 00:26:08,920
Now I've suddenly got to
learn something over--

378
00:26:13,500 --> 00:26:15,663
sorry, [INAUDIBLE] here.

379
00:26:15,663 --> 00:26:16,580
AUDIENCE: [INAUDIBLE].

380
00:26:16,580 --> 00:26:18,520
Yeah, there.

381
00:26:18,520 --> 00:26:19,450
RUSS TEDRAKE: OK.

382
00:26:19,450 --> 00:26:25,620
And for [INAUDIBLE],,
what would it be

383
00:26:25,620 --> 00:26:27,197
used to learn a modeled system?

384
00:26:30,680 --> 00:26:32,730
If I wanted to use this idea.

385
00:26:32,730 --> 00:26:33,656
What's that model?

386
00:26:37,544 --> 00:26:39,010
[INTERPOSING VOICES]

387
00:26:39,010 --> 00:26:40,360
RUSS TEDRAKE: Yeah.

388
00:26:40,360 --> 00:26:45,580
So f and then n plus m to Rn.

389
00:26:48,310 --> 00:26:51,110
So let's just think about
how much you have to learn.

390
00:26:51,110 --> 00:26:53,170
So the easiness of
learning this is not

391
00:26:53,170 --> 00:26:55,060
only related to the size.

392
00:26:55,060 --> 00:26:58,270
But it does matter.

393
00:26:58,270 --> 00:27:02,290
So most of the time, as
control guys, as robotics guys

394
00:27:02,290 --> 00:27:07,390
we would probably try to
learn a model first, and then

395
00:27:07,390 --> 00:27:08,722
do model-based control.

396
00:27:08,722 --> 00:27:10,180
The last few days
I've been saying,

397
00:27:10,180 --> 00:27:12,580
let's try to do some things
without learning a model.

398
00:27:12,580 --> 00:27:17,140
Here's one interesting
reason why.

399
00:27:17,140 --> 00:27:20,015
It's actually-- learning a
model is sort of a tall order.

400
00:27:20,015 --> 00:27:21,140
It's a lot to learn, right?

401
00:27:21,140 --> 00:27:23,500
You've got to learn from every
possible state and action

402
00:27:23,500 --> 00:27:28,780
what's my x dot [INAUDIBLE].

403
00:27:28,780 --> 00:27:32,770
This is only learning from
every possible state and action.

404
00:27:32,770 --> 00:27:36,250
What's the expected
cost-to-go [INAUDIBLE]??

405
00:27:36,250 --> 00:27:39,060
[INAUDIBLE] a scalar.

406
00:27:39,060 --> 00:27:41,510
So this is learning one
algorithm for all m.

407
00:27:41,510 --> 00:27:45,050
And the beautiful thing
about optimal control,

408
00:27:45,050 --> 00:27:46,980
with this sort of
additive cost functions

409
00:27:46,980 --> 00:27:50,180
and everything like
that, the beautiful thing

410
00:27:50,180 --> 00:27:55,190
is that this is all you need to
know to make optimal decisions.

411
00:27:55,190 --> 00:27:56,840
You don't need to
know your model.

412
00:27:56,840 --> 00:28:00,140
That model is extra information.

413
00:28:00,140 --> 00:28:03,950
All you need to know to make
optimal decisions, given

414
00:28:03,950 --> 00:28:06,570
these additive cost
functions [INAUDIBLE]

415
00:28:06,570 --> 00:28:09,530
is given [INAUDIBLE] a state
and then a given action,

416
00:28:09,530 --> 00:28:13,589
how much do I expect to
incur cost [INAUDIBLE]??

417
00:28:13,589 --> 00:28:16,430
It's a beautiful thing.

418
00:28:16,430 --> 00:28:19,430
So if we make it stochastic,
it gets even sort of--

419
00:28:19,430 --> 00:28:21,847
learning a stochastic model,
if your dynamics are variable

420
00:28:21,847 --> 00:28:23,347
and that's important,
you want to do

421
00:28:23,347 --> 00:28:24,710
stochastic optimal control.

422
00:28:24,710 --> 00:28:28,730
Learning a stochastic model is
probably even harder than that.

423
00:28:28,730 --> 00:28:32,120
Maybe I have to
learn the mean of x

424
00:28:32,120 --> 00:28:34,820
dot plus the covariance
matrix of x dot or something

425
00:28:34,820 --> 00:28:36,325
like this.

426
00:28:36,325 --> 00:28:37,700
When I use a
stochastic model, it

427
00:28:37,700 --> 00:28:40,910
would be even more expensive.

428
00:28:40,910 --> 00:28:44,005
Q, in the Q sense--

429
00:28:44,005 --> 00:28:46,650
I left it off in the first
pass just to keep it clean,

430
00:28:46,650 --> 00:28:50,930
but Q is just going to be the
expected value around this.

431
00:28:58,440 --> 00:29:02,320
So Q is always going
to be a scalar, even

432
00:29:02,320 --> 00:29:04,020
in the stochastic
optimal control sense.

433
00:29:07,190 --> 00:29:09,530
So maybe this is the biggest
point of optimal control,

434
00:29:09,530 --> 00:29:11,655
honestly--

435
00:29:11,655 --> 00:29:13,580
optimal control
related to learning--

436
00:29:13,580 --> 00:29:19,880
is that if you're willing to
do these additive expected

437
00:29:19,880 --> 00:29:22,640
value optimization
problems, which

438
00:29:22,640 --> 00:29:24,980
I think you've seen lots
of interesting problems

439
00:29:24,980 --> 00:29:27,530
that fall into that
category, then all you

440
00:29:27,530 --> 00:29:31,370
need to know to make decisions
is to be able to-- the value

441
00:29:31,370 --> 00:29:35,490
function, the Q function here.

442
00:29:35,490 --> 00:29:38,220
The expected value
of future penalties.

443
00:29:38,220 --> 00:29:39,876
And for everything
else, [INAUDIBLE]..

444
00:29:42,443 --> 00:29:43,110
Important point.

445
00:29:46,309 --> 00:29:50,497
Now, just to soften it a
little bit, in practice,

446
00:29:50,497 --> 00:29:52,080
you might not get
away with only that.

447
00:29:52,080 --> 00:29:54,705
If you have to somehow build an
observer to do state estimation

448
00:29:54,705 --> 00:29:56,860
or to estimate Q,
and you've got--

449
00:29:56,860 --> 00:29:59,010
there might be other
reasons floating around

450
00:29:59,010 --> 00:30:02,970
in your robot that might
require you to learn this.

451
00:30:02,970 --> 00:30:09,478
But in a pure sense, that's
really what you need to know.

452
00:30:09,478 --> 00:30:10,470
AUDIENCE: Hey, Russ?

453
00:30:10,470 --> 00:30:11,587
RUSS TEDRAKE: Yeah.

454
00:30:11,587 --> 00:30:12,670
AUDIENCE: You put x and u.

455
00:30:12,670 --> 00:30:14,170
Shouldn't that be s and--

456
00:30:14,170 --> 00:30:15,710
RUSS TEDRAKE: Right.

457
00:30:15,710 --> 00:30:16,840
We could have--

458
00:30:16,840 --> 00:30:20,440
I could have said n is
the number of states.

459
00:30:23,540 --> 00:30:26,057
AUDIENCE: I just meant,
should it be s and a?

460
00:30:26,057 --> 00:30:26,890
RUSS TEDRAKE: Right.

461
00:30:26,890 --> 00:30:27,800
I would have--

462
00:30:27,800 --> 00:30:31,120
I wrote the dimension of
x, and I called it Rn,m.

463
00:30:31,120 --> 00:30:32,472
So that's what I meant.

464
00:30:32,472 --> 00:30:34,180
If you want to make
an analogy back here,

465
00:30:34,180 --> 00:30:36,220
then it would actually
be just the number

466
00:30:36,220 --> 00:30:37,680
of elements in s and a.

467
00:30:37,680 --> 00:30:39,805
But I wanted to sort of be
a roboticist [INAUDIBLE]

468
00:30:39,805 --> 00:30:40,513
for a little bit.

469
00:30:40,513 --> 00:30:41,500
AUDIENCE: OK.

470
00:30:41,500 --> 00:30:43,560
RUSS TEDRAKE: This is just
the computer scientist

471
00:30:43,560 --> 00:30:46,010
that did this [INAUDIBLE].

472
00:30:46,010 --> 00:30:48,513
But it does make this easier,
so I still [INAUDIBLE]..

473
00:30:48,513 --> 00:30:49,430
So I meant to do that.

474
00:30:56,340 --> 00:30:58,850
So that [INAUDIBLE].

475
00:30:58,850 --> 00:30:59,350
OK.

476
00:30:59,350 --> 00:31:01,000
So now, how do we learn Q?

477
00:31:01,000 --> 00:31:05,510
I told you how to learn J.
Q looks pretty close to J.

478
00:31:05,510 --> 00:31:06,430
How do I learn Q?

479
00:31:06,430 --> 00:31:08,080
I told you about temporal
different learning,

480
00:31:08,080 --> 00:31:09,460
probably wouldn't
have wasted your time

481
00:31:09,460 --> 00:31:10,480
talking about
temporal difference

482
00:31:10,480 --> 00:31:11,938
learning if it
wasn't also relevant

483
00:31:11,938 --> 00:31:18,050
for what we needed to do these
model-free value methods.

484
00:31:18,050 --> 00:31:20,822
So let's just see that temporal
difference learning also

485
00:31:20,822 --> 00:31:22,280
works for learning
these functions.

486
00:31:22,280 --> 00:31:22,780
OK?

487
00:31:22,780 --> 00:31:26,760
That's just some [INAUDIBLE].

488
00:31:26,760 --> 00:31:29,520
Let's do just a
simple case first,

489
00:31:29,520 --> 00:31:34,860
where I'm just doing-- remember,
TD0 was just bootstrapping.

490
00:31:34,860 --> 00:31:37,890
It wasn't carrying
around long-term rewards.

491
00:31:37,890 --> 00:31:39,727
It was just saying
[INAUDIBLE] one step,

492
00:31:39,727 --> 00:31:42,060
and then I'm going to use my
value estimate for the rest

493
00:31:42,060 --> 00:31:43,792
of the time as my new update.

494
00:31:46,820 --> 00:31:48,395
And I'll go ahead, since we're--

495
00:31:48,395 --> 00:31:50,800
we talked about last time
how a function approximator

496
00:31:50,800 --> 00:31:53,440
[INAUDIBLE] reduce it to
the Markov chain case,

497
00:31:53,440 --> 00:31:54,722
let's just do it like--

498
00:31:58,124 --> 00:32:07,520
let's say [INAUDIBLE] of s is
alpha i phi i, linear function

499
00:32:07,520 --> 00:32:09,665
approximators.

500
00:32:09,665 --> 00:32:16,780
Or we could, in fact,
reduce alpha t phi s, a.

501
00:32:19,360 --> 00:32:19,860
OK.

502
00:32:19,860 --> 00:32:25,050
Then the TD lambda
update it going to be--

503
00:32:25,050 --> 00:32:27,990
TD0 update is just
going to be alpha

504
00:32:27,990 --> 00:32:43,855
plus gamma call that hat
just to be careful here--

505
00:32:43,855 --> 00:32:50,290
pi s transpose.

506
00:33:18,220 --> 00:33:20,820
These really are supposed
to be s n and a n.

507
00:33:20,820 --> 00:33:26,332
I get a lot of [INAUDIBLE]
for my sloppy [INAUDIBLE]..

508
00:33:26,332 --> 00:33:29,110
OK.

509
00:33:29,110 --> 00:33:33,530
And Q pi-- or the gradient
here in the linear function

510
00:33:33,530 --> 00:33:36,482
approximator case,
is just phi s, a.

511
00:33:41,960 --> 00:33:43,730
So if you look back
in your notes, that's

512
00:33:43,730 --> 00:33:49,700
exactly what we had before,
where this used to be J.

513
00:33:49,700 --> 00:33:52,190
We're going to use
is our new update--

514
00:33:52,190 --> 00:33:55,730
we're going to say that
our new estimate for J

515
00:33:55,730 --> 00:34:00,440
is basically the one-step cost
plus the long-term look-ahead.

516
00:34:00,440 --> 00:34:05,470
a n plus 1 in the case
of doing an on-policy--

517
00:34:05,470 --> 00:34:07,490
if I'm just trying to
do policy evaluation,

518
00:34:07,490 --> 00:34:13,558
it's going to be pi s n plus 1.

519
00:34:13,558 --> 00:34:16,969
We'll use that
one-step prediction

520
00:34:16,969 --> 00:34:20,298
minus my current prediction
and try to make that go to 0.

521
00:34:20,298 --> 00:34:22,590
And in order to do it in a
function approximator sense,

522
00:34:22,590 --> 00:34:25,290
that means multiplying that
error, the temporal difference

523
00:34:25,290 --> 00:34:26,840
error, by the gradient.

524
00:34:26,840 --> 00:34:30,246
And that was something
like gradient descent

525
00:34:30,246 --> 00:34:32,150
on your temporal
difference policy.

526
00:34:32,150 --> 00:34:37,889
But not exactly, because it's
this whole recursive dependence

527
00:34:37,889 --> 00:34:38,389
thing.

528
00:34:42,420 --> 00:34:45,150
People get why-- do
people get that it's not

529
00:34:45,150 --> 00:34:50,449
quite gradient descent
but kind of this?

530
00:34:50,449 --> 00:34:52,070
This looks a lot
like what I would

531
00:34:52,070 --> 00:34:58,638
get if I was trying to do
gradient descent [INAUDIBLE],,

532
00:34:58,638 --> 00:34:59,138
right?

533
00:35:02,120 --> 00:35:05,190
But only in the case of TD1
was actually gradient descent.

534
00:35:05,190 --> 00:35:10,320
But normally if I
have a y minus f of x,

535
00:35:10,320 --> 00:35:15,317
I'm trying to do the gradient
with respect to this,

536
00:35:15,317 --> 00:35:16,400
I've got to minimize this.

537
00:35:19,952 --> 00:35:22,452
And I'll get something-- if I
take the gradient with respect

538
00:35:22,452 --> 00:35:30,405
to alpha, I get the error
alpha x [INAUDIBLE]..

539
00:35:33,279 --> 00:35:35,313
AUDIENCE: [INAUDIBLE]

540
00:35:35,313 --> 00:35:37,730
RUSS TEDRAKE: Because what we
got here, this is our error.

541
00:35:37,730 --> 00:35:42,500
If we assume that this
is just my desired

542
00:35:42,500 --> 00:35:45,380
and this is my actual, then
this is gradient descent.

543
00:35:45,380 --> 00:35:48,640
But it's not quite that, because
this depends on an alpha--

544
00:35:48,640 --> 00:35:50,150
these all depend on alpha.

545
00:35:50,150 --> 00:35:52,070
So by virtue of having
this one in the alpha,

546
00:35:52,070 --> 00:35:56,080
it's not exactly gradient
descent algorithm.

547
00:35:56,080 --> 00:35:57,460
But it still works.

548
00:35:57,460 --> 00:35:59,258
People proved that it works.

549
00:35:59,258 --> 00:36:00,530
Is that OK?

550
00:36:00,530 --> 00:36:03,920
And actually, in the
case where TD is one,

551
00:36:03,920 --> 00:36:08,108
these things actually
go through it

552
00:36:08,108 --> 00:36:10,650
and cancel each other out with
whatever is a gradient descent

553
00:36:10,650 --> 00:36:11,715
algorithm [INAUDIBLE].

554
00:36:14,570 --> 00:36:17,180
But I want you to
see, this is my error

555
00:36:17,180 --> 00:36:20,570
I'm trying to make Q and
my current state and action

556
00:36:20,570 --> 00:36:24,823
look like my one-step cost plus
Q of my next state and action.

557
00:36:24,823 --> 00:36:26,615
And I would do that by
multiplying my error

558
00:36:26,615 --> 00:36:29,680
by my gradient in a gradient
descent kind of idea.

559
00:36:35,890 --> 00:36:36,390
OK.

560
00:36:36,390 --> 00:36:43,615
You can still do TD
lambda if you like also.

561
00:36:46,370 --> 00:37:00,650
Q functions And
the big idea there

562
00:37:00,650 --> 00:37:03,350
was to use an
eligibility trace, which

563
00:37:03,350 --> 00:37:10,080
in the function
approximator case,

564
00:37:10,080 --> 00:37:18,610
was gamma lambda ei
n plus [INAUDIBLE]..

565
00:37:34,480 --> 00:37:38,574
And then my update
is the same thing--

566
00:37:38,574 --> 00:37:41,470
alpha-- because this is my
big temporal difference error.

567
00:37:41,470 --> 00:37:43,740
And instead of multiplying
by the gradient [INAUDIBLE]

568
00:37:43,740 --> 00:37:48,110
this eligibility trace.

569
00:37:48,110 --> 00:37:51,670
And magically through
an algebraic trick,

570
00:37:51,670 --> 00:37:58,380
remembering the gradient
computes the bootstrapping case

571
00:37:58,380 --> 00:38:02,850
when lambda is 0, and the Monte
Carlo case when lambda is 1,

572
00:38:02,850 --> 00:38:05,258
and something in between
when lambda is [INAUDIBLE]..

573
00:38:12,700 --> 00:38:14,545
OK.

574
00:38:14,545 --> 00:38:17,560
So you'd still do
temporal difference there.

575
00:38:17,560 --> 00:38:18,670
Big point number two--

576
00:38:23,230 --> 00:38:28,360
big idea number one is we
have to use Q functions to do

577
00:38:28,360 --> 00:38:29,420
action selection.

578
00:38:32,020 --> 00:38:43,600
Big point number two is
off-policy policy evaluation.

579
00:38:43,600 --> 00:38:47,860
Once we start using Q,
you could do this trick

580
00:38:47,860 --> 00:38:53,140
that I mentioned first thing
we're doing value methods.

581
00:38:53,140 --> 00:39:12,598
And that is to execute policy pi
1 but learn Q pi 2 [INAUDIBLE]..

582
00:39:19,542 --> 00:39:20,970
Can you see how we do that?

583
00:39:40,550 --> 00:39:42,800
By virtue of having
this extra dimension,

584
00:39:42,800 --> 00:39:45,150
we know we're
learning-- bless you--

585
00:39:45,150 --> 00:39:50,330
not only what happens when I
take policy pi from state s.

586
00:39:50,330 --> 00:39:54,990
I'm learning what happens when
I take any action in state s.

587
00:39:54,990 --> 00:39:59,830
That gives me a lot more power.

588
00:39:59,830 --> 00:40:02,330
Because for instance, when I'm
making my temporal difference

589
00:40:02,330 --> 00:40:06,950
error, I don't need
to necessarily use

590
00:40:06,950 --> 00:40:13,180
my one-step prediction
as the current policy.

591
00:40:13,180 --> 00:40:17,420
I can just look up what
would policy 2 [INAUDIBLE]..

592
00:40:22,700 --> 00:40:27,950
Because I'm storing
every state-action pair,

593
00:40:27,950 --> 00:40:30,110
it's more to learn, more work.

594
00:40:30,110 --> 00:40:35,090
But it means I can say,
I'd like my new Q pi 2

595
00:40:35,090 --> 00:40:38,090
to be the one-step policy
I got from taking a plus

596
00:40:38,090 --> 00:40:41,838
the long-term cost of
taking policy pi 2.

597
00:40:46,020 --> 00:40:49,710
And then all the same equations
play out, and you get--

598
00:40:53,920 --> 00:40:55,360
you get an estimate
for policy 2.

599
00:40:58,568 --> 00:40:59,610
AUDIENCE: Does it count--

600
00:41:02,930 --> 00:41:05,260
RUSS TEDRAKE: Yeah?

601
00:41:05,260 --> 00:41:07,260
AUDIENCE: Does it count
more than the first step

602
00:41:07,260 --> 00:41:11,294
of the policy 2 and then take
your cost-to-go of policy 1?

603
00:41:11,294 --> 00:41:12,900
Or does it somehow--

604
00:41:16,070 --> 00:41:19,157
RUSS TEDRAKE: So
I can't switch--

605
00:41:19,157 --> 00:41:21,490
we'll talk about whether you
can switch halfway through.

606
00:41:21,490 --> 00:41:26,660
But once I commit
to learning Q pi 2,

607
00:41:26,660 --> 00:41:29,570
then actually this
whole thing is built up

608
00:41:29,570 --> 00:41:32,540
of experience of
executing policy 2,

609
00:41:32,540 --> 00:41:37,710
even though I've only generated
sample paths for policy 1.

610
00:41:37,710 --> 00:41:42,548
So it's a completely consistent
estimator of Q pi 2, right?

611
00:41:42,548 --> 00:41:44,090
If I halfway through
decided I wanted

612
00:41:44,090 --> 00:41:45,650
to start evaluating
pi 3, then I'm

613
00:41:45,650 --> 00:41:47,630
going to have to wait
for those cancel out,

614
00:41:47,630 --> 00:41:49,890
or we play some
tricks to do that.

615
00:41:49,890 --> 00:41:52,760
But it actually
recursively builds up

616
00:41:52,760 --> 00:41:55,370
in the estimator of pi 2.

617
00:41:55,370 --> 00:41:56,760
AUDIENCE: Can I ask a question?

618
00:41:56,760 --> 00:41:58,500
RUSS TEDRAKE: Of course.

619
00:41:58,500 --> 00:42:03,510
AUDIENCE: [INAUDIBLE] have that
[INAUDIBLE] function like this,

620
00:42:03,510 --> 00:42:04,385
we can substitute y--

621
00:42:04,385 --> 00:42:06,052
RUSS TEDRAKE: You're
talking about this?

622
00:42:06,052 --> 00:42:06,910
AUDIENCE: Yes.

623
00:42:06,910 --> 00:42:09,571
You can substitute y
by [? g or ?] gamma,

624
00:42:09,571 --> 00:42:12,380
and then execute the--

625
00:42:12,380 --> 00:42:13,335
RUSS TEDRAKE: Yeah.

626
00:42:13,335 --> 00:42:14,960
AUDIENCE: And then
take the derivative?

627
00:42:14,960 --> 00:42:16,214
RUSS TEDRAKE: Yes.

628
00:42:16,214 --> 00:42:17,539
AUDIENCE: Why [INAUDIBLE]?

629
00:42:17,539 --> 00:42:19,706
RUSS TEDRAKE: So why isn't
it true gradient descent?

630
00:42:23,040 --> 00:42:25,020
That's exactly what
I proposed to do.

631
00:42:25,020 --> 00:42:27,660
But the only problem is,
this isn't what we have.

632
00:42:27,660 --> 00:42:32,410
What we actually
have is this, which

633
00:42:32,410 --> 00:42:35,230
means that this is not the
true gradient [INAUDIBLE] term

634
00:42:35,230 --> 00:42:36,840
for partial y partial
from over here.

635
00:42:36,840 --> 00:42:38,382
AUDIENCE: That's
what I'm suggesting.

636
00:42:38,382 --> 00:42:40,690
So instead of y alpha, we
can actually [INAUDIBLE]

637
00:42:40,690 --> 00:42:43,510
g plus gamma-- an
approximation of y.

638
00:42:43,510 --> 00:42:46,720
So this g plus gamma
Q is [INAUDIBLE]

639
00:42:46,720 --> 00:42:48,400
approximation for y, right?

640
00:42:48,400 --> 00:42:50,150
RUSS TEDRAKE: I'm
trying to perfectly make

641
00:42:50,150 --> 00:42:57,719
the analogy that this looks like
that, and this looks like that.

642
00:42:57,719 --> 00:42:58,386
AUDIENCE: Right.

643
00:42:58,386 --> 00:43:00,719
But when we're taking the
derivative from that function,

644
00:43:00,719 --> 00:43:02,050
we assume that y is constant.

645
00:43:02,050 --> 00:43:02,590
RUSS TEDRAKE: Yes.

646
00:43:02,590 --> 00:43:03,610
AUDIENCE: And then solve this.

647
00:43:03,610 --> 00:43:04,060
RUSS TEDRAKE: Yes.

648
00:43:04,060 --> 00:43:06,830
AUDIENCE: We can actually assume
that y is dependent on alpha

649
00:43:06,830 --> 00:43:10,060
and the derivative of that term
with respect to alpha as well,

650
00:43:10,060 --> 00:43:10,810
and then solve it.

651
00:43:10,810 --> 00:43:11,560
RUSS TEDRAKE: Yes.

652
00:43:11,560 --> 00:43:12,625
So you could do that.

653
00:43:12,625 --> 00:43:14,440
So you're saying why
don't we actually have

654
00:43:14,440 --> 00:43:17,080
a different update which has
the gradient [INAUDIBLE]??

655
00:43:17,080 --> 00:43:17,630
OK, good.

656
00:43:17,630 --> 00:43:19,630
So in the case of TD0--

657
00:43:19,630 --> 00:43:23,090
TD1, you actually do have that.

658
00:43:23,090 --> 00:43:24,820
And I think that's true.

659
00:43:24,820 --> 00:43:26,530
I worked this out a
number of years ago.

660
00:43:26,530 --> 00:43:30,280
But I think it's true that
if you start including that,

661
00:43:30,280 --> 00:43:36,100
if you look at the sum over a
chain, for this standard update

662
00:43:36,100 --> 00:43:40,210
with TD0, for instance,
that those terms,

663
00:43:40,210 --> 00:43:43,302
this term now will actually
cancel itself out on this term

664
00:43:43,302 --> 00:43:45,110
here, for instance.

665
00:43:45,110 --> 00:43:45,960
It doesn't work.

666
00:43:45,960 --> 00:43:46,970
It doesn't work nicely.

667
00:43:46,970 --> 00:43:49,663
It would give you-- it gives
you back the Monte Carlo error.

668
00:43:49,663 --> 00:43:51,080
It doesn't do
temporal difference.

669
00:43:51,080 --> 00:43:52,980
It doesn't do the bootstrapping.

670
00:43:52,980 --> 00:43:55,550
So basically, you
start including that,

671
00:43:55,550 --> 00:43:58,700
then you do get a least
squares algorithm, of course.

672
00:43:58,700 --> 00:44:04,220
But it's effectively
doing Monte Carlo.

673
00:44:04,220 --> 00:44:06,964
You have to sort of ignore
that do to temporal difference

674
00:44:06,964 --> 00:44:09,160
learning.

675
00:44:09,160 --> 00:44:13,021
You're actually saying, I'm
going to believe this estimate

676
00:44:13,021 --> 00:44:14,422
in order to do that.

677
00:44:14,422 --> 00:44:17,050
OK?

678
00:44:17,050 --> 00:44:18,550
Temporal difference,
if you actually

679
00:44:18,550 --> 00:44:20,850
want to prove any
of these things,

680
00:44:20,850 --> 00:44:22,400
I have one example
of it in the note.

681
00:44:22,400 --> 00:44:26,540
I think that I put in TD1 is
gradient descent in the notes,

682
00:44:26,540 --> 00:44:28,190
just so you see an example.

683
00:44:28,190 --> 00:44:31,130
A story-- rule of the game in
temporal difference learning,

684
00:44:31,130 --> 00:44:35,420
derivations, and proofs, is
you start expanding these sums,

685
00:44:35,420 --> 00:44:39,080
and terms from time
n and terms from time

686
00:44:39,080 --> 00:44:43,370
n plus 1 cancel each other
out in a gradient way.

687
00:44:43,370 --> 00:44:46,102
And you're left with
something much more compact.

688
00:44:46,102 --> 00:44:48,560
That's why [? everybody ?]
calls it an algebraic trick, why

689
00:44:48,560 --> 00:44:50,210
these things work.

690
00:44:50,210 --> 00:44:56,835
But because these are not random
samples drawn one at a time,

691
00:44:56,835 --> 00:44:58,210
they're actually
directly related

692
00:44:58,210 --> 00:45:01,857
to each other, that's why it
makes it more complicated.

693
00:45:07,700 --> 00:45:08,200
OK.

694
00:45:08,200 --> 00:45:11,140
So we said off-policy
evaluation says,

695
00:45:11,140 --> 00:45:20,230
execute policy pi 1 to get pi 1
generates s n a n trajectories.

696
00:45:20,230 --> 00:45:26,508
But you're going to do
the update alpha plus--

697
00:45:26,508 --> 00:45:33,810
I'm going to just write quickly
here-- g s, a plus gamma Q pi--

698
00:45:33,810 --> 00:45:36,790
this is going to be
estimator Q pi 2--

699
00:45:36,790 --> 00:45:41,620
s n plus 1 pi 2.

700
00:45:41,620 --> 00:45:45,130
What would pi 2 have done
in kind of s n plus 1?

701
00:45:57,505 --> 00:46:00,096
In general, we'll multiply
it by [INAUDIBLE]..

702
00:46:04,127 --> 00:46:05,585
That's a really,
really nice trick.

703
00:46:05,585 --> 00:46:08,240
Let's learn about
policy 1-- or policy 2

704
00:46:08,240 --> 00:46:11,320
while we execute policy 1.

705
00:46:11,320 --> 00:46:11,820
OK.

706
00:46:11,820 --> 00:46:13,445
So what policy 2
should we learn about?

707
00:46:19,515 --> 00:46:22,220
And again, these are--

708
00:46:22,220 --> 00:46:23,970
I'm asking the questions
in bizarre ways,

709
00:46:23,970 --> 00:46:25,220
and there's a specific answer.

710
00:46:25,220 --> 00:46:29,540
But [INAUDIBLE]
ask that question.

711
00:46:29,540 --> 00:46:31,430
Our ultimate goal
is not to learn

712
00:46:31,430 --> 00:46:33,033
about some arbitrary pi 2.

713
00:46:33,033 --> 00:46:34,741
I want to learn about
the optimal policy.

714
00:46:38,180 --> 00:46:40,700
I don't have the optimal policy.

715
00:46:40,700 --> 00:46:43,430
But I have an estimate of it.

716
00:46:43,430 --> 00:46:45,680
So actually, a perfectly
reasonable update

717
00:46:45,680 --> 00:46:53,870
to do, and the way you might
describe it is, let's execute--

718
00:46:53,870 --> 00:46:56,367
I'm putting it in
quotes, because it's not

719
00:46:56,367 --> 00:46:58,200
entirely accurate, but
it's the right idea--

720
00:47:00,890 --> 00:47:11,006
execute policy 1 but learn
about the optimal policy.

721
00:47:16,370 --> 00:47:17,550
And how would we do that?

722
00:47:17,550 --> 00:47:32,756
Well-- this is now my
shorthand Q star here.

723
00:47:32,756 --> 00:47:38,565
Estimate of Q star
is s n plus 1--

724
00:47:38,565 --> 00:47:40,062
I should have--

725
00:48:18,570 --> 00:48:19,700
It makes total sense.

726
00:48:19,700 --> 00:48:23,250
Might as well, as
I'm learning, always

727
00:48:23,250 --> 00:48:25,990
try to learn about policy
which is optimal with respect

728
00:48:25,990 --> 00:48:29,820
to my current
estimate J [? hat. ?]

729
00:48:29,820 --> 00:48:31,962
And this algorithm
is called Q-learning.

730
00:48:37,510 --> 00:48:39,190
OK.

731
00:48:39,190 --> 00:48:46,750
It's the crown jewel of
the value-based methods

732
00:48:46,750 --> 00:48:47,730
[INAUDIBLE].

733
00:48:47,730 --> 00:48:50,580
I would say it was the gold
standard until probably about

734
00:48:50,580 --> 00:48:52,860
[? '90-- ?] something like that.

735
00:48:52,860 --> 00:48:57,095
When people started to do policy
gradient stuff more often.

736
00:48:57,095 --> 00:48:58,720
Even probably halfway
through the '90s,

737
00:48:58,720 --> 00:49:00,730
people were still mostly
[INAUDIBLE] papers

738
00:49:00,730 --> 00:49:04,300
about Q-learning.

739
00:49:04,300 --> 00:49:07,446
and there was a movement
in policy gradient.

740
00:49:07,446 --> 00:49:09,340
AUDIENCE: So is your
current estimate

741
00:49:09,340 --> 00:49:12,730
Q star not based on pi 1?

742
00:49:16,070 --> 00:49:18,400
RUSS TEDRAKE: It is
based on data from pi 1.

743
00:49:18,400 --> 00:49:23,860
But if I always make my
update, making it this update,

744
00:49:23,860 --> 00:49:27,220
then it really is
learning about pi 2.

745
00:49:31,060 --> 00:49:35,068
AUDIENCE: Isn't pi 2 what you're
computing with this update?

746
00:49:40,708 --> 00:49:41,500
RUSS TEDRAKE: Good.

747
00:49:41,500 --> 00:49:44,440
There's a couple of
ways that I can do this.

748
00:49:44,440 --> 00:49:45,460
So in the policy--

749
00:49:45,460 --> 00:49:48,635
in the simple
policy iteration, we

750
00:49:48,635 --> 00:49:51,160
use [INAUDIBLE] evaluate
for a long time,

751
00:49:51,160 --> 00:49:54,642
and then you make an, update,
you evaluate for a long time,

752
00:49:54,642 --> 00:49:55,600
and you make an update.

753
00:49:55,600 --> 00:49:56,440
AUDIENCE: This is dynamically--

754
00:49:56,440 --> 00:49:58,930
RUSS TEDRAKE: This is always
sort of updating, right?

755
00:49:58,930 --> 00:50:00,220
[INAUDIBLE]

756
00:50:00,220 --> 00:50:03,520
And you can prove that it's
still a sound algorithm

757
00:50:03,520 --> 00:50:04,750
despite [INAUDIBLE].

758
00:50:04,750 --> 00:50:10,922
This is always sort of
updating its policy as it goes.

759
00:50:10,922 --> 00:50:13,930
Compared to this,
which is more of the--

760
00:50:13,930 --> 00:50:16,780
learn about pi 2 for a
while, stop, [INAUDIBLE]

761
00:50:16,780 --> 00:50:18,380
pi 3 for a while,
stop, this is trying

762
00:50:18,380 --> 00:50:19,550
to go straight through pi.

763
00:50:27,440 --> 00:50:27,990
OK, good.

764
00:50:27,990 --> 00:50:31,250
So what is it--

765
00:50:31,250 --> 00:50:36,270
what's required for a Q-learning
algorithm to converge?

766
00:50:36,270 --> 00:50:40,220
So even for this algorithm to
converge, in order for pi 1

767
00:50:40,220 --> 00:50:43,660
to really teach me everything
there is to know about pi 2,

768
00:50:43,660 --> 00:50:50,320
there's some important feature,
which is that pi 1 and pi 2

769
00:50:50,320 --> 00:50:53,982
had better pick the same actions
with some old probability.

770
00:50:56,690 --> 00:50:58,310
So off-policy works.

771
00:51:01,852 --> 00:51:03,060
Let's just even think about--

772
00:51:03,060 --> 00:51:07,760
I'll even [INAUDIBLE] first in
the discrete state and discrete

773
00:51:07,760 --> 00:51:08,760
actions and the Markov--

774
00:51:11,330 --> 00:51:13,770
MDP formulations.

775
00:51:13,770 --> 00:51:26,840
Off-policy works if pi 1 takes
in general all state-action

776
00:51:26,840 --> 00:51:31,922
pairs with some
small probability.

777
00:51:39,940 --> 00:51:47,890
If pi 2 took action [INAUDIBLE]
state 1 and pi 1 never did,

778
00:51:47,890 --> 00:51:49,690
there's no way I'm
going to learn really

779
00:51:49,690 --> 00:51:53,356
what pi 2 is all about.

780
00:51:53,356 --> 00:51:58,540
[INAUDIBLE] show you
those two [INAUDIBLE]..

781
00:51:58,540 --> 00:51:59,303
OK.

782
00:51:59,303 --> 00:52:00,220
So how do you do that?

783
00:52:00,220 --> 00:52:03,370
If you just-- if you're
thinking about greedy policies

784
00:52:03,370 --> 00:52:07,060
on a robot, and you've got your
current estimate of the value,

785
00:52:07,060 --> 00:52:12,310
and you do the most aggressive
action on the acrobot,

786
00:52:12,310 --> 00:52:14,280
I'll tell you what's
going to happen.

787
00:52:14,280 --> 00:52:17,470
You're going to visit the
states near the bottom,

788
00:52:17,470 --> 00:52:19,300
and you start learning a lot.

789
00:52:19,300 --> 00:52:22,356
And you're never going to
visit the states up at the top

790
00:52:22,356 --> 00:52:24,728
when you're learning.

791
00:52:24,728 --> 00:52:27,020
So how are you going to get
around that on the acrobot?

792
00:52:27,020 --> 00:52:29,150
And the acrobot is
tough, actually.

793
00:52:29,150 --> 00:52:33,830
But the idea is, you'd
better add some randomness

794
00:52:33,830 --> 00:52:36,620
so that you explore more
and more state and actions.

795
00:52:36,620 --> 00:52:39,288
And the hope is that if you add
enough for a long enough time,

796
00:52:39,288 --> 00:52:41,330
you're going to learn
better and better policies,

797
00:52:41,330 --> 00:52:44,330
you're going to find
your way up to the top.

798
00:52:44,330 --> 00:52:47,210
So the acrobot is
actually almost as hard

799
00:52:47,210 --> 00:52:49,130
as it gets with these
things, where you really

800
00:52:49,130 --> 00:52:50,780
have to find your
way into this region

801
00:52:50,780 --> 00:52:53,790
to learn about the region.

802
00:52:53,790 --> 00:52:56,090
In fact, my beef with
reinforcement learning

803
00:52:56,090 --> 00:52:58,370
community is that
they learn only

804
00:52:58,370 --> 00:53:00,530
to swing up to some threshold.

805
00:53:00,530 --> 00:53:02,960
They never actually
[INAUDIBLE] to the top.

806
00:53:02,960 --> 00:53:05,210
If you look, there's lots
of papers, countless papers,

807
00:53:05,210 --> 00:53:07,340
written about reinforcement
learning, Q-learning

808
00:53:07,340 --> 00:53:09,040
for the acrobot and
things like this,

809
00:53:09,040 --> 00:53:10,790
and they never actually
solve the problem.

810
00:53:10,790 --> 00:53:12,650
They just try to
[INAUDIBLE] at the top.

811
00:53:12,650 --> 00:53:14,210
But they don't do it.

812
00:53:14,210 --> 00:53:15,680
They just get up this high.

813
00:53:15,680 --> 00:53:17,450
Because it is sort
of a tough case.

814
00:53:19,583 --> 00:53:20,500
So how do you do this?

815
00:53:20,500 --> 00:53:21,917
So you get-- like
I said, in order

816
00:53:21,917 --> 00:53:23,675
to start exploring
that space, you'd

817
00:53:23,675 --> 00:53:27,140
better add some randomness.

818
00:53:27,140 --> 00:53:30,230
So one of the
standard approaches

819
00:53:30,230 --> 00:53:32,649
is to use epsilon
greedy algorithms.

820
00:53:42,240 --> 00:53:47,280
So I said, let's make pi 2
exactly the minimizing thing.

821
00:53:47,280 --> 00:53:47,950
That's true.

822
00:53:47,950 --> 00:53:53,160
But if you execute that, you're
probably going to [INAUDIBLE]

823
00:53:53,160 --> 00:54:03,590
much better to
execute a policy pi s,

824
00:54:03,590 --> 00:54:06,970
where the-- let's say
the policy I care about,

825
00:54:06,970 --> 00:54:13,120
I'm going to execute
with probability--

826
00:54:13,120 --> 00:54:17,680
sort of, you flip a coin,
you pull a random number

827
00:54:17,680 --> 00:54:18,983
between 0 and 1.

828
00:54:18,983 --> 00:54:20,400
If it's greater
than epsilon, then

829
00:54:20,400 --> 00:54:22,090
go ahead and execute
policy you're

830
00:54:22,090 --> 00:54:24,910
trying to learn
about, but execute

831
00:54:24,910 --> 00:54:32,822
some random action otherwise.

832
00:54:42,900 --> 00:54:43,400
OK.

833
00:54:43,400 --> 00:54:45,440
So every time I--

834
00:54:45,440 --> 00:54:48,020
every dt I'm going to
flip a coin, keep it--

835
00:54:48,020 --> 00:54:49,380
well, not a coin.

836
00:54:49,380 --> 00:54:52,070
A hundred-sided coin, a 100 to--

837
00:54:52,070 --> 00:54:53,750
0 to 1, a continuous thing.

838
00:54:53,750 --> 00:54:57,830
If it comes out
less than epsilon,

839
00:54:57,830 --> 00:54:59,240
I'm going to do a random action.

840
00:54:59,240 --> 00:55:01,657
Just forget about my current
policy, pick a random action.

841
00:55:01,657 --> 00:55:05,240
It's a uniform
distribution over actions.

842
00:55:05,240 --> 00:55:10,170
Otherwise, I'll take this,
the action from my policy.

843
00:55:10,170 --> 00:55:14,135
And by virtue of having a
soft policy learning thing

844
00:55:14,135 --> 00:55:15,980
is I can still learn
about pi 2, even

845
00:55:15,980 --> 00:55:19,858
if I'm taking this pi epsilon.

846
00:55:19,858 --> 00:55:21,400
But I have the
advantage of exploring

847
00:55:21,400 --> 00:55:22,690
all the state-actions.

848
00:55:26,110 --> 00:55:26,610
Good.

849
00:55:26,610 --> 00:55:28,080
I'm missing a page.

850
00:55:32,908 --> 00:55:34,450
AUDIENCE: Is that
the most randomness

851
00:55:34,450 --> 00:55:39,370
you can produce since
they'll [INAUDIBLE] converge?

852
00:55:39,370 --> 00:55:41,950
RUSS TEDRAKE: There's a couple
of different candidates.

853
00:55:41,950 --> 00:55:44,000
The softmax is another
one that people use a lot.

854
00:55:47,300 --> 00:55:48,020
And a lot of--

855
00:55:48,020 --> 00:55:51,050
I mean, in the off-policy sense,
it's actually quite robust.

856
00:55:51,050 --> 00:55:55,470
So a lot of people talk about
using behavioral policy, which

857
00:55:55,470 --> 00:55:57,470
is just sort of something
to try-- it's designed

858
00:55:57,470 --> 00:55:58,987
to explore the state space.

859
00:55:58,987 --> 00:56:00,820
Actually, my candidate
for behavioral policy

860
00:56:00,820 --> 00:56:02,810
is something like RRT.

861
00:56:02,810 --> 00:56:04,790
We should really
try to do something

862
00:56:04,790 --> 00:56:08,600
that gets me into all areas
of state space, for instance.

863
00:56:08,600 --> 00:56:10,570
And then maybe that's
a good way to design,

864
00:56:10,570 --> 00:56:12,860
to sample these
state-action pairs.

865
00:56:12,860 --> 00:56:16,100
And all the while, I
try to learn about pi 2.

866
00:56:16,100 --> 00:56:17,684
So it is robust in that sense.

867
00:56:21,940 --> 00:56:24,630
When I say it works here, I
have to be a little careful.

868
00:56:24,630 --> 00:56:28,560
This is only for the MDP
case that it's really

869
00:56:28,560 --> 00:56:29,880
guaranteed to work.

870
00:56:29,880 --> 00:56:34,590
There's more recent work
doing off-policy and function

871
00:56:34,590 --> 00:56:36,290
approximators.

872
00:56:36,290 --> 00:56:37,140
And you can do that.

873
00:56:48,450 --> 00:56:51,750
I don't want to bury you
guys with random detail.

874
00:56:51,750 --> 00:56:59,220
But you can do off-policy with
linear function approximators

875
00:56:59,220 --> 00:57:08,004
safely, using an important
[INAUDIBLE] when you're dealing

876
00:57:08,004 --> 00:57:08,992
[INAUDIBLE].

877
00:57:18,970 --> 00:57:20,910
And that's work by Doina Precup.

878
00:57:28,790 --> 00:57:30,470
The basic idea is you have to--

879
00:57:30,470 --> 00:57:34,310
if your policy is changing,
like these things,

880
00:57:34,310 --> 00:57:37,760
it's changing over time, you'd
better weight your updates

881
00:57:37,760 --> 00:57:39,142
based on the relative--

882
00:57:44,444 --> 00:57:47,150
AUDIENCE: [INAUDIBLE]
necessarily.

883
00:57:47,150 --> 00:57:48,482
But that's--

884
00:57:48,482 --> 00:57:50,940
RUSS TEDRAKE: But what you're
learning about is the state--

885
00:57:50,940 --> 00:57:53,180
the probability of picking
this action for one step

886
00:57:53,180 --> 00:57:54,503
and then executing pi 2.

887
00:57:54,503 --> 00:57:56,586
And that's still [INAUDIBLE]
even pi 2 would never

888
00:57:56,586 --> 00:57:57,488
take that action.

889
00:57:57,488 --> 00:57:58,280
AUDIENCE: Oh, yeah.

890
00:57:58,280 --> 00:57:59,422
Because it's-- OK.

891
00:57:59,422 --> 00:58:00,880
RUSS TEDRAKE: So
I think it's good.

892
00:58:00,880 --> 00:58:02,870
The thing that has to
happen is that pi 2

893
00:58:02,870 --> 00:58:05,358
has to be well-defined
for every possible state.

894
00:58:05,358 --> 00:58:05,900
AUDIENCE: OK.

895
00:58:05,900 --> 00:58:09,470
So keeping the Q pi 2 is
take a certain [INAUDIBLE]

896
00:58:09,470 --> 00:58:11,870
take certain action, then
[INAUDIBLE] pi [INAUDIBLE]..

897
00:58:11,870 --> 00:58:12,290
RUSS TEDRAKE: Yes.

898
00:58:12,290 --> 00:58:12,840
AUDIENCE: OK.

899
00:58:12,840 --> 00:58:14,030
Sorry, I lost that [INAUDIBLE].

900
00:58:14,030 --> 00:58:14,630
RUSS TEDRAKE: OK, good.

901
00:58:14,630 --> 00:58:14,990
Sorry.

902
00:58:14,990 --> 00:58:16,157
Thank you for clarifying it.

903
00:58:16,157 --> 00:58:17,240
Yeah, so cool.

904
00:58:17,240 --> 00:58:18,659
So I think that still works.

905
00:58:22,790 --> 00:58:25,483
OK.

906
00:58:25,483 --> 00:58:26,150
So this is good.

907
00:58:26,150 --> 00:58:28,470
So let me tell you
where you are so far.

908
00:58:28,470 --> 00:58:31,320
We've now switched from doing
temporal difference learning

909
00:58:31,320 --> 00:58:33,560
on value functions and
temporal difference

910
00:58:33,560 --> 00:58:35,900
learning on Q functions.

911
00:58:35,900 --> 00:58:37,580
And a major thing
we got out of that

912
00:58:37,580 --> 00:58:41,186
was that we can do this
off-policy learning.

913
00:58:45,400 --> 00:58:46,780
You put it all together.

914
00:58:46,780 --> 00:58:51,250
[INAUDIBLE] back into my
policy iteration diagram,

915
00:58:51,250 --> 00:58:53,890
and what we have, we've defined
the policy evaluation, that's

916
00:58:53,890 --> 00:58:54,940
the TD lambda.

917
00:58:54,940 --> 00:58:57,480
We defined our
update, which could

918
00:58:57,480 --> 00:58:59,710
be this in the general sense.

919
00:58:59,710 --> 00:59:03,775
This one is-- if
I used pi 1 again,

920
00:59:03,775 --> 00:59:09,004
If I really did on-policy,
if I used pi 1 everywhere

921
00:59:09,004 --> 00:59:11,170
while I'm executing
pi 1, then this

922
00:59:11,170 --> 00:59:15,166
would be called SARSA,
[INAUDIBLE] sort

923
00:59:15,166 --> 00:59:20,480
of on-policy Q-learning,
on-policy updating.

924
00:59:20,480 --> 00:59:25,270
And Q-learning is this, where
you use the middle gradient.

925
00:59:25,270 --> 00:59:27,930
And what we know, what
people have proven,

926
00:59:27,930 --> 00:59:30,570
the algorithms were in use
for years and years and years

927
00:59:30,570 --> 00:59:33,580
before it was actually proven,
even in the tabular case,

928
00:59:33,580 --> 00:59:36,040
where you have finite
state and actions.

929
00:59:36,040 --> 00:59:39,010
But now we know that
this thing is guaranteed

930
00:59:39,010 --> 00:59:41,140
to converge to the
optimal policy,

931
00:59:41,140 --> 00:59:44,320
that policy iteration, even
if it's updated at every step,

932
00:59:44,320 --> 00:59:47,380
is going to converge
to the optimal policy

933
00:59:47,380 --> 00:59:51,080
and the optimal
Q function, given

934
00:59:51,080 --> 00:59:54,290
that all state-action pairs
are [INAUDIBLE] in the tabular

935
00:59:54,290 --> 00:59:54,790
case.

936
01:00:01,840 --> 01:00:05,700
If we go to function
approximation,

937
01:00:05,700 --> 01:00:09,480
if you just do policy
evaluation but not update,

938
01:00:09,480 --> 01:00:14,480
then we have an example
where this is actually

939
01:00:14,480 --> 01:00:16,250
in '02 or something like that.

940
01:00:16,250 --> 01:00:21,840
It'd be '01 or '02, 2001.

941
01:00:21,840 --> 01:00:24,930
We finally proved that
off-policy with linear function

942
01:00:24,930 --> 01:00:29,530
approximation would converge
when the policy is not

943
01:00:29,530 --> 01:00:31,770
changing--

944
01:00:31,770 --> 01:00:33,540
no control.

945
01:00:33,540 --> 01:00:36,410
So the thing I need to
give you before we consider

946
01:00:36,410 --> 01:00:43,590
this a complete story here is,
can we do off-policy learning?

947
01:00:43,590 --> 01:00:45,450
Can we do our
policy improvement,

948
01:00:45,450 --> 01:00:50,310
update stably with
function approximation?

949
01:00:50,310 --> 01:00:52,740
And the algorithm
that we have for that

950
01:00:52,740 --> 01:00:55,454
is our least squares
policy iteration.

951
01:01:35,980 --> 01:01:40,030
Do you remember least squares
temporal difference learning?

952
01:01:40,030 --> 01:01:45,150
Sort of the idea was that if we
look at the stationary update--

953
01:01:45,150 --> 01:01:47,014
maybe I should
write it down again.

954
01:01:47,014 --> 01:01:47,946
[INAUDIBLE] find it.

955
01:01:51,642 --> 01:01:53,100
If I look at the
stationary update,

956
01:01:53,100 --> 01:01:55,870
if I were to run an
entire batch of--

957
01:01:55,870 --> 01:01:58,370
I mean, the big idea is when
you're doing the least squares,

958
01:01:58,370 --> 01:02:00,260
is that we're going to
try to reuse old data.

959
01:02:00,260 --> 01:02:02,468
We're not going to just make
a single update, spit it

960
01:02:02,468 --> 01:02:03,490
out, throw9t away.

961
01:02:03,490 --> 01:02:06,500
We're going to remember a bunch
of old state-action pairs,

962
01:02:06,500 --> 01:02:09,405
trying to make a least squares
update with just the same thing

963
01:02:09,405 --> 01:02:11,570
as [INAUDIBLE].

964
01:02:11,570 --> 01:02:13,165
In the Monte Carlo
sense, it's easy.

965
01:02:13,165 --> 01:02:14,540
It's just function
approximation,

966
01:02:14,540 --> 01:02:17,170
where with this TD term
floating around it's harder.

967
01:02:17,170 --> 01:02:19,420
So we had to come up least
squares temporal difference

968
01:02:19,420 --> 01:02:21,430
learning.

969
01:02:21,430 --> 01:02:27,970
And in the LSTD
case, the story was

970
01:02:27,970 --> 01:02:42,460
we could build up a matrix using
something that looked like phi

971
01:02:42,460 --> 01:02:51,650
of s gamma phi transpose--

972
01:02:51,650 --> 01:02:54,515
so it's ik.

973
01:02:54,515 --> 01:02:58,310
Let me just write ik in here--

974
01:02:58,310 --> 01:03:02,750
ik plus 1 minus phi of ik.

975
01:03:05,715 --> 01:03:09,420
times my parameter vector.

976
01:03:09,420 --> 01:03:17,070
And b-- some of
these terms was e,

977
01:03:17,070 --> 01:03:23,291
which were phi ik times
our reward times--

978
01:03:29,135 --> 01:03:33,560
And if I did this least
squares solution--

979
01:03:33,560 --> 01:03:35,810
or I could invert that
carefully with SVD or something

980
01:03:35,810 --> 01:03:36,640
like that--

981
01:03:36,640 --> 01:03:38,930
then what I get out is the--

982
01:03:43,313 --> 01:03:55,822
it jumps immediately to
steady-state solution TD

983
01:03:55,822 --> 01:03:56,322
lambda.

984
01:04:05,666 --> 01:04:09,330
So this is essentially
the big piece

985
01:04:09,330 --> 01:04:13,650
of the TD lambda update
broken into the part that

986
01:04:13,650 --> 01:04:17,040
depends on alpha, the part
that doesn't depend on alpha.

987
01:04:17,040 --> 01:04:19,755
I could write my
batch TD lambda update

988
01:04:19,755 --> 01:04:26,860
as alpha equals alpha
plus gamma a alpha plus b.

989
01:04:26,860 --> 01:04:29,938
And I could solve that at
steady-state [INAUDIBLE]..

990
01:04:32,690 --> 01:04:33,190
All right.

991
01:04:33,190 --> 01:04:35,970
So least squares policy, least
squares temporal difference

992
01:04:35,970 --> 01:04:39,190
learning, is about reusing
lots of trajectories

993
01:04:39,190 --> 01:04:43,780
to make a single update that was
going to jump right to the case

994
01:04:43,780 --> 01:04:47,598
where the temporal difference
learning we've gotten through.

995
01:04:47,598 --> 01:04:49,140
We just replayed it
a bunch of times.

996
01:04:51,690 --> 01:04:55,420
Now the question
is, the policy is

997
01:04:55,420 --> 01:04:58,680
going to be moving
while we're doing this.

998
01:04:58,680 --> 01:05:03,750
How can we do this sort
of least squares method

999
01:05:03,750 --> 01:05:05,583
to do the policy
iteration up here?

1000
01:05:12,780 --> 01:05:17,597
Again, the trick
is pretty simple.

1001
01:05:17,597 --> 01:05:19,555
We've just got to learn
the Q function instead,

1002
01:05:19,555 --> 01:05:20,638
[INAUDIBLE] biggest trick.

1003
01:05:24,760 --> 01:05:26,410
So in order to do--

1004
01:05:26,410 --> 01:05:28,350
control not just
evaluate a single policy

1005
01:05:28,350 --> 01:05:45,535
but actually try to
find the optimal policy,

1006
01:05:45,535 --> 01:05:47,160
first thing we have
to do is figure out

1007
01:05:47,160 --> 01:05:49,875
how to do LSTD on a Q function.

1008
01:05:53,268 --> 01:05:56,605
And it turns out it's no--

1009
01:05:56,605 --> 01:05:57,685
yeah, what's up?

1010
01:05:57,685 --> 01:05:58,680
[INAUDIBLE]

1011
01:06:03,080 --> 01:06:07,000
It turns out if you
keep along, [INAUDIBLE]

1012
01:06:07,000 --> 01:06:10,840
exactly the same form as
we did in least squares

1013
01:06:10,840 --> 01:06:14,710
temporal difference learning,
but now we do everything

1014
01:06:14,710 --> 01:06:18,467
with functions of s and t.

1015
01:06:50,554 --> 01:06:54,520
[INAUDIBLE] transpose on it.

1016
01:06:54,520 --> 01:06:55,485
Now I do the--

1017
01:06:55,485 --> 01:06:57,964
you said form of the
[INAUDIBLE] too much.

1018
01:06:57,964 --> 01:07:01,690
I just want you to
know the big idea here.

1019
01:07:13,769 --> 01:07:18,080
Then I do gamma
[INAUDIBLE] a inverse b,

1020
01:07:18,080 --> 01:07:20,446
this whole time we're
representing our Q function.

1021
01:07:23,451 --> 01:07:30,720
Q hat s, a is now a linear
combination of nonlinear basis

1022
01:07:30,720 --> 01:07:33,525
functions on s and a.

1023
01:07:33,525 --> 01:07:36,905
AUDIENCE: Shouldn't that
be transpose [INAUDIBLE]??

1024
01:07:36,905 --> 01:07:37,780
RUSS TEDRAKE: I put--

1025
01:07:37,780 --> 01:07:41,340
I tried to put a
transpose with my poorly--

1026
01:07:41,340 --> 01:07:43,140
throughout everything.

1027
01:07:43,140 --> 01:07:45,672
So you're saying this one
shouldn't be transpose?

1028
01:07:45,672 --> 01:07:47,980
AUDIENCE: [INAUDIBLE]
should be a [INAUDIBLE]??

1029
01:07:47,980 --> 01:07:48,920
RUSS TEDRAKE: Yeah.

1030
01:07:48,920 --> 01:07:49,420
Good.

1031
01:07:49,420 --> 01:07:50,710
So I'm going to--

1032
01:07:50,710 --> 01:07:53,193
but this one, I wrote
this whole update

1033
01:07:53,193 --> 01:07:55,110
as the transpose of the
other-- of what I just

1034
01:07:55,110 --> 01:07:56,134
wrote over there.

1035
01:08:07,070 --> 01:08:08,745
AUDIENCE: [INAUDIBLE]
write alpha?

1036
01:08:08,745 --> 01:08:12,538
[INAUDIBLE] If it was
a plus b some stuff?

1037
01:08:12,538 --> 01:08:13,330
RUSS TEDRAKE: Yeah.

1038
01:08:13,330 --> 01:08:14,180
And there's an alpha, sorry.

1039
01:08:14,180 --> 01:08:14,790
Thank you.

1040
01:08:19,439 --> 01:08:24,319
Well, actually, it's-- the
alpha is really not there.

1041
01:08:24,319 --> 01:08:26,170
Yeah, I should have
written it here.

1042
01:08:26,170 --> 01:08:27,060
It's a times alpha.

1043
01:08:27,060 --> 01:08:29,012
That's what it
makes the update on.

1044
01:08:29,012 --> 01:08:32,896
So we get [INAUDIBLE] alpha.

1045
01:08:32,896 --> 01:08:34,229
So that is actually [INAUDIBLE].

1046
01:08:34,229 --> 01:08:37,100
This is the one I
did [INAUDIBLE]..

1047
01:08:37,100 --> 01:08:38,562
OK, good.

1048
01:08:38,562 --> 01:08:40,520
So it turns out you can
learn a Q function just

1049
01:08:40,520 --> 01:08:42,920
like you can learn
a value function,

1050
01:08:42,920 --> 01:08:45,319
with storing up
these matrices, which

1051
01:08:45,319 --> 01:08:50,720
are what TD learning would
have done in a batch sense,

1052
01:08:50,720 --> 01:08:52,970
and then just taking
a one-step shot

1053
01:08:52,970 --> 01:08:55,550
to get directly to the solution
for temporal difference

1054
01:08:55,550 --> 01:08:56,966
learning for Q function.

1055
01:09:01,220 --> 01:09:06,620
And again, I put
in here s prime.

1056
01:09:06,620 --> 01:09:09,300
And I left this a
little bit ambiguous.

1057
01:09:09,300 --> 01:09:15,560
So I can evaluate any
policy by just putting

1058
01:09:15,560 --> 01:09:18,470
in that policy in here
and doing the replay.

1059
01:09:23,410 --> 01:09:31,649
And it turns out if I now do
this in a policy iteration

1060
01:09:31,649 --> 01:09:43,560
sense, LSPI--

1061
01:09:43,560 --> 01:09:45,920
Least Squares Policy Iteration--

1062
01:09:45,920 --> 01:09:49,170
basically, you start off
with an initial guess,

1063
01:09:49,170 --> 01:09:58,080
we do LSTDQ [INAUDIBLE]
to get Q pi 1.

1064
01:09:58,080 --> 01:10:02,640
And then you repeat, yeah?

1065
01:10:02,640 --> 01:10:09,370
Then this thing, that's
enough to get you to--

1066
01:10:09,370 --> 01:10:12,322
it converges.

1067
01:10:12,322 --> 01:10:14,270
Now, be careful about
how it converges.

1068
01:10:14,270 --> 01:10:31,976
It converges with some error
bound to pi star Q star.

1069
01:10:34,810 --> 01:10:38,120
The error bound depends
on a couple of parameters.

1070
01:10:38,120 --> 01:10:40,420
So technically, it could
be close to your solution

1071
01:10:40,420 --> 01:10:42,290
and oscillate, or
something like that.

1072
01:10:42,290 --> 01:10:44,290
But it's a pretty strong
convergence result

1073
01:10:44,290 --> 01:10:50,162
for this for policy improvement
with approximate value

1074
01:10:50,162 --> 01:10:50,662
function.

1075
01:10:54,930 --> 01:10:57,480
In a pure sense, it is--

1076
01:10:57,480 --> 01:11:01,650
you should run this
for a while until you

1077
01:11:01,650 --> 01:11:08,510
get a good estimate for LSTD
and you get your new Q pi.

1078
01:11:08,510 --> 01:11:14,150
But by virtue of using Q
functions, when you do switch

1079
01:11:14,150 --> 01:11:16,880
to your new policy,
pi 2, let's say,

1080
01:11:16,880 --> 01:11:19,710
you don't have to throw
away all your old data.

1081
01:11:19,710 --> 01:11:25,700
You just take your old tapes
and actually regenerate a and b

1082
01:11:25,700 --> 01:11:30,320
as if you had played off
those old tapes with--

1083
01:11:30,320 --> 01:11:33,020
as if you had seen the old
tapes executing the new policy.

1084
01:11:36,250 --> 01:11:39,400
And you can reuse
all your old data

1085
01:11:39,400 --> 01:11:43,686
and make an efficient update
to get Q pi [INAUDIBLE]..

1086
01:11:52,870 --> 01:11:54,744
Least squares policy iteration.

1087
01:11:54,744 --> 01:11:56,780
Pretty simple.

1088
01:11:56,780 --> 01:11:57,280
OK.

1089
01:11:57,280 --> 01:11:59,697
I know that was a little dry
and a little bit-- and a lot.

1090
01:11:59,697 --> 01:12:03,130
But let's make sure we know
how we got where we got.

1091
01:12:03,130 --> 01:12:11,130
So there's another route
besides pure policy search

1092
01:12:11,130 --> 01:12:12,310
to do model-free learning.

1093
01:12:12,310 --> 01:12:16,930
All you have to do is take
a bunch of trajectories,

1094
01:12:16,930 --> 01:12:19,120
learn a value function
for those trajectories.

1095
01:12:19,120 --> 01:12:24,130
You don't even actually
have to take the perfect--

1096
01:12:24,130 --> 01:12:25,340
your best controller yet.

1097
01:12:25,340 --> 01:12:27,760
You could take some RRT
controller, something that's

1098
01:12:27,760 --> 01:12:30,740
going to explore the space and
try to learn about your value

1099
01:12:30,740 --> 01:12:31,240
function.

1100
01:12:34,042 --> 01:12:39,190
Learn Q pi through
these LSTD algorithms,

1101
01:12:39,190 --> 01:12:42,810
you could do a pretty efficient
update for doing Q pi.

1102
01:12:42,810 --> 01:12:44,810
You can improve efficiently
by just looking over

1103
01:12:44,810 --> 01:12:48,670
the min over Q,
and pretty quickly

1104
01:12:48,670 --> 01:12:51,977
iterate to an optimal
policy and optimal value

1105
01:12:51,977 --> 01:12:54,820
function, only storing--

1106
01:12:54,820 --> 01:12:57,160
the only thing you have to
store in that whole process

1107
01:12:57,160 --> 01:12:59,700
is the Q function.

1108
01:12:59,700 --> 01:13:03,480
And in the LS case, the LSTD
case, you remember the tape--

1109
01:13:03,480 --> 01:13:07,470
the history of tapes, just
you use them more efficiently.

1110
01:13:07,470 --> 01:13:08,370
Yeah?

1111
01:13:08,370 --> 01:13:11,680
AUDIENCE: So could you have
used this on the flapper

1112
01:13:11,680 --> 01:13:14,820
that John showed?

1113
01:13:14,820 --> 01:13:16,488
Or what's the--

1114
01:13:16,488 --> 01:13:17,280
RUSS TEDRAKE: Good.

1115
01:13:17,280 --> 01:13:18,072
Very good question.

1116
01:13:20,870 --> 01:13:22,590
That's an excellent question.

1117
01:13:22,590 --> 01:13:25,310
So in fact, the last day of
class, what we're going to do

1118
01:13:25,310 --> 01:13:26,360
is going to--

1119
01:13:26,360 --> 01:13:28,780
the last day I present in
class, we're going to--

1120
01:13:28,780 --> 01:13:29,870
I'm going to go through
a couple of sort

1121
01:13:29,870 --> 01:13:31,130
of case studies and
different problems

1122
01:13:31,130 --> 01:13:33,080
that people have had
success and tell you

1123
01:13:33,080 --> 01:13:35,760
why we picked the algorithm
we picked, things like that.

1124
01:13:35,760 --> 01:13:38,420
So why didn't we do
this on a flapper?

1125
01:13:38,420 --> 01:13:42,730
The simplest reason is that
we don't know the state space.

1126
01:13:42,730 --> 01:13:46,620
It's infinite
dimensional in general.

1127
01:13:46,620 --> 01:13:48,180
So that would have
been a big thing

1128
01:13:48,180 --> 01:13:51,070
to represent a Q function for.

1129
01:13:51,070 --> 01:13:53,490
It doesn't mean-- it
doesn't make it invalid.

1130
01:13:53,490 --> 01:13:54,990
We could have
learned, we could have

1131
01:13:54,990 --> 01:13:58,470
tried to approximate the state
space with even a handful

1132
01:13:58,470 --> 01:14:03,060
of features, learned a very
approximate Q function,

1133
01:14:03,060 --> 01:14:04,710
and done something
like actor-critic

1134
01:14:04,710 --> 01:14:06,190
like we're going
to do next time.

1135
01:14:06,190 --> 01:14:08,730
But I think in cases where you
don't know the state space,

1136
01:14:08,730 --> 01:14:10,607
or the state space is
a very, very large,

1137
01:14:10,607 --> 01:14:12,190
and you can write a
simple controller,

1138
01:14:12,190 --> 01:14:16,360
then it makes more sense
to parameterize the policy.

1139
01:14:16,360 --> 01:14:18,880
It really goes down to that
game, that accounting game,

1140
01:14:18,880 --> 01:14:22,830
in some ways, of how many
dimensions things are.

1141
01:14:22,830 --> 01:14:26,820
But in a fluids case, you could
have a pretty simple policy

1142
01:14:26,820 --> 01:14:29,960
from sensors to actions
which we could twiddle.

1143
01:14:29,960 --> 01:14:35,230
We couldn't have an
efficient value function.

1144
01:14:35,230 --> 01:14:38,390
Now, there are other cases
where the opposite is true.

1145
01:14:38,390 --> 01:14:41,860
The opposite is true, where
you have a small state space,

1146
01:14:41,860 --> 01:14:44,980
let's say, but the
resulting policies

1147
01:14:44,980 --> 01:14:48,250
would require a lot of
features to parameterize.

1148
01:14:48,250 --> 01:14:50,770
But I think in general, the
strength of these algorithms

1149
01:14:50,770 --> 01:14:54,760
is that they are efficient
with reusing data.

1150
01:14:54,760 --> 01:14:57,560
The weakness is that--

1151
01:14:57,560 --> 01:14:59,230
well, the weakness
a few years ago

1152
01:14:59,230 --> 01:15:01,188
would have been that they
big blow up the time.

1153
01:15:01,188 --> 01:15:04,380
But algorithms have gotten
better as we [INAUDIBLE]

1154
01:15:04,380 --> 01:15:05,980
we have some
convergence guarantees.

1155
01:15:05,980 --> 01:15:07,190
Not the general [INAUDIBLE].

1156
01:15:07,190 --> 01:15:08,680
I never told you
that it converged

1157
01:15:08,680 --> 01:15:10,990
if you have a nonlinear
function approximator.

1158
01:15:10,990 --> 01:15:13,040
We'd love to have that
result [INAUDIBLE]..

1159
01:15:13,040 --> 01:15:14,890
We won't have it for a while.

1160
01:15:14,890 --> 01:15:17,140
But in linear function
approximator sense,

1161
01:15:17,140 --> 01:15:20,290
we have both.

1162
01:15:20,290 --> 01:15:22,330
But there's a lot
of success stories.

1163
01:15:22,330 --> 01:15:24,122
These are the kind of
algorithms that we're

1164
01:15:24,122 --> 01:15:26,230
used to play backgammon.

1165
01:15:26,230 --> 01:15:27,700
There are examples
of them working

1166
01:15:27,700 --> 01:15:31,460
on things like the [INAUDIBLE].

1167
01:15:31,460 --> 01:15:34,570
But in the domains that I
care about most in my lab,

1168
01:15:34,570 --> 01:15:37,740
we tend to do more policy
gradient sort of things.