1 00:00:00,000 --> 00:00:02,490 The following content is provided under a Creative 2 00:00:02,490 --> 00:00:03,940 Commons license. 3 00:00:03,940 --> 00:00:06,330 Your support will help MIT OpenCourseWare 4 00:00:06,330 --> 00:00:10,630 continue to offer high-quality educational resources for free. 5 00:00:10,630 --> 00:00:13,320 To make a donation or view additional materials 6 00:00:13,320 --> 00:00:17,160 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,160 --> 00:00:18,252 at ocw.mit.edu. 8 00:00:21,870 --> 00:00:23,190 RUSS TEDRAKE: Welcome back. 9 00:00:23,190 --> 00:00:29,600 So today we get to finish our discussion 10 00:00:29,600 --> 00:00:33,500 on at least the first wave of value-based methods 11 00:00:33,500 --> 00:00:35,777 for trying to find optimal control 12 00:00:35,777 --> 00:00:39,130 policies, without a model. 13 00:00:39,130 --> 00:00:47,040 So we started last time talking about these model-free methods. 14 00:00:47,040 --> 00:00:51,270 And just to make sure we're all synced up here, 15 00:00:51,270 --> 00:01:01,090 so big picture is that we're trying 16 00:01:01,090 --> 00:01:14,920 to learn an optimal policy, approximate optimal policy, 17 00:01:14,920 --> 00:01:25,350 without a model, by just learning an approximate value 18 00:01:25,350 --> 00:01:25,850 function. 19 00:01:36,620 --> 00:01:41,720 And the claim is that value functions are a good thing 20 00:01:41,720 --> 00:01:45,630 to learn for a couple of reasons. 21 00:01:45,630 --> 00:01:47,060 First of all, they should describe 22 00:01:47,060 --> 00:01:53,720 everything you need to know about the optimal control. 23 00:01:53,720 --> 00:01:57,185 Second of all, they're actually fairly compact. 24 00:01:57,185 --> 00:01:59,060 I'm going to say more about that in a minute. 25 00:01:59,060 --> 00:02:02,120 But if you think about it, probably a value function 26 00:02:02,120 --> 00:02:03,920 might actually be a simpler thing 27 00:02:03,920 --> 00:02:08,449 to represent than the policy, a smaller thing to represent it, 28 00:02:08,449 --> 00:02:10,894 because it's just a scalar value over all states. 29 00:02:14,480 --> 00:02:19,250 And the third big motivation I tried to give last time 30 00:02:19,250 --> 00:02:23,390 was that these temporal difference methods which 31 00:02:23,390 --> 00:02:26,660 bootstrap based on previous experience, 32 00:02:26,660 --> 00:02:29,040 they're, like value iteration and dynamic programming, 33 00:02:29,040 --> 00:02:33,230 can be very efficient in terms of reusing the computation 34 00:02:33,230 --> 00:02:35,720 or reusing the samples that you've gotten by using 35 00:02:35,720 --> 00:02:37,803 estimates that you've already made with your value 36 00:02:37,803 --> 00:02:40,960 function to make better, fast estimates of your value 37 00:02:40,960 --> 00:02:43,878 function as you [INAUDIBLE]. 38 00:02:43,878 --> 00:02:45,420 These are all going to come up again. 39 00:02:45,420 --> 00:02:47,210 But that's just the high-level motivation 40 00:02:47,210 --> 00:02:50,038 for why we care about trying to learn value functions. 41 00:02:50,038 --> 00:02:51,830 And then first thing we did-- it was really 42 00:02:51,830 --> 00:02:54,950 all we did last time. 43 00:02:54,950 --> 00:02:59,510 The first thing we had to achieve 44 00:02:59,510 --> 00:03:19,996 was just estimate a value function for a fixed policy, 45 00:03:19,996 --> 00:03:23,990 which we called J pi, right? 46 00:03:23,990 --> 00:03:26,582 And we did it just from sample trajectories. 47 00:03:29,420 --> 00:03:34,910 In discrete state and action, we called them s and a. 48 00:03:34,910 --> 00:03:39,350 Take a bunch of trajectories, and you 49 00:03:39,350 --> 00:03:42,170 would be able from those trajectories 50 00:03:42,170 --> 00:03:45,675 to try to back out J pi. 51 00:03:45,675 --> 00:03:49,118 And those trajectories are generated using policy pi. 52 00:03:51,860 --> 00:03:54,180 So I actually tried to argue that that was useful even 53 00:03:54,180 --> 00:03:55,550 in the-- so if you just want-- 54 00:03:55,550 --> 00:03:57,407 if you have a robot out there that's already 55 00:03:57,407 --> 00:03:59,990 executing a policy, or a passive walker or something like this 56 00:03:59,990 --> 00:04:01,880 that doesn't have a policy, and you just 57 00:04:01,880 --> 00:04:06,260 want to see how well it's doing, estimate its stability 58 00:04:06,260 --> 00:04:09,652 by example, then you can actually-- 59 00:04:09,652 --> 00:04:10,860 this might be enough for you. 60 00:04:10,860 --> 00:04:15,020 You might just try to evaluate how well that policy is doing. 61 00:04:15,020 --> 00:04:16,857 We call that policy evaluation. 62 00:04:23,188 --> 00:04:26,320 What we're actually interested it's not that. 63 00:04:26,320 --> 00:04:28,840 That's just the first step. 64 00:04:28,840 --> 00:04:33,550 What we care about now is, given if we can estimate 65 00:04:33,550 --> 00:04:37,720 the value for a stationary policy, 66 00:04:37,720 --> 00:04:41,140 can we now do something smarter and more involved, 67 00:04:41,140 --> 00:04:45,490 and try to estimate what the optimal value function. 68 00:04:45,490 --> 00:04:49,030 Or you might think of it as continuing 69 00:04:49,030 --> 00:04:51,490 to estimate the value function as we change 70 00:04:51,490 --> 00:04:53,758 pi towards the [INAUDIBLE] the optimal cost, 71 00:04:53,758 --> 00:04:54,550 the optimal policy. 72 00:04:59,220 --> 00:05:01,980 We talked about a couple of ways to estimate the value function 73 00:05:01,980 --> 00:05:04,590 for a fixed pi, right? 74 00:05:04,590 --> 00:05:11,820 Even for function approximation, we did first Markov chains, 75 00:05:11,820 --> 00:05:14,490 and then we went to function approximation. 76 00:05:21,720 --> 00:05:31,290 And we have convergence results for linear function 77 00:05:31,290 --> 00:05:32,144 approximators. 78 00:05:36,425 --> 00:05:39,170 And we went back and looked up [INAUDIBLE] it had a question 79 00:05:39,170 --> 00:05:42,335 about whether they used lambda in your update, 80 00:05:42,335 --> 00:05:45,170 if it always got to the same estimate of J. 81 00:05:45,170 --> 00:05:47,850 And I think the answer was, yes, it always gets-- 82 00:05:47,850 --> 00:05:50,280 the convergence proof has an error bound. 83 00:05:50,280 --> 00:05:53,030 And that error bound does depend on lambda, 84 00:05:53,030 --> 00:05:56,180 if you remember [INAUDIBLE] discussion. 85 00:05:56,180 --> 00:05:58,970 But if you said your learning rate gets smaller and smaller 86 00:05:58,970 --> 00:06:01,400 and you go, it should converge to the-- they should all 87 00:06:01,400 --> 00:06:04,658 converge to the same estimate of J pi. 88 00:06:11,300 --> 00:06:15,800 So if you think about it, learning J pi 89 00:06:15,800 --> 00:06:18,420 shouldn't involve any new machinery, right? 90 00:06:18,420 --> 00:06:22,140 If I'm just experiencing cost, and I'm experiencing states, 91 00:06:22,140 --> 00:06:24,440 and I'm trying to learn a function of cost-to-go given 92 00:06:24,440 --> 00:06:27,107 states, that should just be able to do a least squares function. 93 00:06:27,107 --> 00:06:30,450 It's just a standard function approximation task. 94 00:06:30,450 --> 00:06:37,765 I could just do just a least squares function approximation, 95 00:06:37,765 --> 00:06:45,230 least squares estimation, and what 96 00:06:45,230 --> 00:06:46,860 we call the Monte-Carlo error. 97 00:06:49,584 --> 00:06:52,918 Just run a bunch of trials, figure out 98 00:06:52,918 --> 00:06:55,210 the estimates of what the cost-to-go was at every time, 99 00:06:55,210 --> 00:06:59,100 and then just do least squares estimation. 100 00:06:59,100 --> 00:07:02,150 The machinery we developed last time 101 00:07:02,150 --> 00:07:06,078 was because it's actually a lot faster using 102 00:07:06,078 --> 00:07:09,656 bootstrapping algorithms. 103 00:07:09,656 --> 00:07:12,152 [INAUDIBLE] much faster than [INAUDIBLE].. 104 00:07:23,460 --> 00:07:23,960 Right. 105 00:07:23,960 --> 00:07:29,120 So we talked about the TD lambda algorithm, 106 00:07:29,120 --> 00:07:31,010 including for function approximation. 107 00:07:34,165 --> 00:07:36,290 The only reason we had to develop any new machinery 108 00:07:36,290 --> 00:07:39,630 is because we wanted to be able to essentially do the least 109 00:07:39,630 --> 00:07:41,190 squares estimation, but we wanted 110 00:07:41,190 --> 00:07:44,840 to reuse our current estimate as we build up the estimate. 111 00:07:44,840 --> 00:07:48,410 And that's why it's not just a standard function approximation 112 00:07:48,410 --> 00:07:54,778 task we did all the [INAUDIBLE] something [INAUDIBLE] 113 00:07:54,778 --> 00:07:57,700 presented it. 114 00:07:57,700 --> 00:07:58,200 OK. 115 00:08:02,520 --> 00:08:06,470 So that's the simple policy evaluation story. 116 00:08:10,420 --> 00:08:14,110 Now the question is, how do we use the ability 117 00:08:14,110 --> 00:08:18,004 to do policy evaluation to get towards a more optimal policy? 118 00:08:21,350 --> 00:08:34,789 So today, given the new policy evaluation, 119 00:08:34,789 --> 00:08:36,215 we want to improve the policy. 120 00:08:50,920 --> 00:08:53,422 And the idea of this-- 121 00:08:53,422 --> 00:08:55,755 the first idea you have to have in your head, very, very 122 00:08:55,755 --> 00:08:56,255 simple. 123 00:09:00,090 --> 00:09:02,410 And it's called policy iteration. 124 00:09:12,630 --> 00:09:17,560 So given I start off with some initial guess for a policy, 125 00:09:17,560 --> 00:09:22,840 and I run it for a little while, I could do policy evaluation. 126 00:09:22,840 --> 00:09:28,570 So I'm converged on a nice estimate to get J pi 1. 127 00:09:31,080 --> 00:09:36,690 And now I'd like to take J pi 1, my estimate, 128 00:09:36,690 --> 00:09:43,810 and come up with a new pi 2. 129 00:09:43,810 --> 00:09:49,670 We've talked about how the value function infers a policy. 130 00:09:49,670 --> 00:09:54,740 And if I repeat, and I do it properly, 131 00:09:54,740 --> 00:09:59,110 then if all goes well, I should find myself-- 132 00:09:59,110 --> 00:10:00,820 if it's always increasing in performance, 133 00:10:00,820 --> 00:10:06,730 and we can show that, then I should find myself eventually 134 00:10:06,730 --> 00:10:13,888 at the optimal policy and optimal value function, right? 135 00:10:19,760 --> 00:10:23,390 So we said TD lambda was a candidate 136 00:10:23,390 --> 00:10:25,515 for sitting there and evaluating policy, which I've 137 00:10:25,515 --> 00:10:27,140 talked about a couple of different ways 138 00:10:27,140 --> 00:10:28,400 to do policy evaluation. 139 00:10:28,400 --> 00:10:31,127 So the question now is, how do we this, then? 140 00:10:31,127 --> 00:10:32,210 That's the first question. 141 00:10:36,660 --> 00:10:39,830 So given your policy, given your value function, 142 00:10:39,830 --> 00:10:42,890 how do you compute a new policy that's 143 00:10:42,890 --> 00:10:46,174 at least as good as your own policy but maybe better? 144 00:11:02,000 --> 00:11:04,460 AUDIENCE: Maybe stochastic gradient descent? 145 00:11:04,460 --> 00:11:06,140 RUSS TEDRAKE: Do something like stochastic gradient descent? 146 00:11:06,140 --> 00:11:07,440 You have to be careful with stochastic gradient. 147 00:11:07,440 --> 00:11:09,315 You have to make sure it's always going down, 148 00:11:09,315 --> 00:11:11,810 and things like that. 149 00:11:11,810 --> 00:11:14,610 It's a good idea. 150 00:11:14,610 --> 00:11:16,550 In fact, that's sort of-- 151 00:11:16,550 --> 00:11:18,970 actually, [INAUDIBLE]. 152 00:11:18,970 --> 00:11:22,840 We combine stochastic gradient descent and evaluation 153 00:11:22,840 --> 00:11:25,600 to do actor-critic [INAUDIBLE]. 154 00:11:25,600 --> 00:11:29,618 But there's a simpler sort of idea. 155 00:11:32,398 --> 00:11:33,690 I guess the thing it requires-- 156 00:11:33,690 --> 00:11:36,148 I didn't even think about this when I was making the notes. 157 00:11:36,148 --> 00:11:39,120 But I guess it requires an observation that-- 158 00:11:41,870 --> 00:11:47,810 so the optimal value function and the optimal policy 159 00:11:47,810 --> 00:11:50,510 have a property that the policy is going, 160 00:11:50,510 --> 00:11:54,187 taking the fastest descent down the value function. 161 00:11:54,187 --> 00:11:56,770 Your job is to go down the value function as fast as possible. 162 00:11:59,510 --> 00:12:03,970 But if you're not optimal yet, I've got some random policy, 163 00:12:03,970 --> 00:12:09,010 and I figure out my value of executing that policy, that's 164 00:12:09,010 --> 00:12:12,210 actually not true yet. 165 00:12:12,210 --> 00:12:15,830 So what I need to say is, if you start giving your value 166 00:12:15,830 --> 00:12:19,190 function, you come up with a new policy which 167 00:12:19,190 --> 00:12:21,050 tries to be as aggressive as possible 168 00:12:21,050 --> 00:12:23,750 on this value function, which in our continuous sense, 169 00:12:23,750 --> 00:12:26,450 is going down the gradient of the value function 170 00:12:26,450 --> 00:12:28,778 as fast as possible. 171 00:12:28,778 --> 00:12:30,320 And that should be at least as good-- 172 00:12:30,320 --> 00:12:33,380 in the case of the optimal policy, it should be the same. 173 00:12:33,380 --> 00:12:35,910 It should return the optimal policy again. 174 00:12:35,910 --> 00:12:39,260 But in the case where the value estimates from another, 175 00:12:39,260 --> 00:12:41,776 original policy gets you to do better. 176 00:12:44,755 --> 00:12:46,040 So the basic story-- 177 00:12:46,040 --> 00:12:48,830 that's the continuous gradient-- is 178 00:12:48,830 --> 00:12:57,050 you want to come up with a greedy policy that moves down, 179 00:12:57,050 --> 00:13:00,150 that does the best it can with this J pi. 180 00:13:06,660 --> 00:13:12,900 So pi 2, let's say, which is a function of s, should be, 181 00:13:12,900 --> 00:13:17,450 for instance, [INAUDIBLE] the discrete sense here, 182 00:13:17,450 --> 00:13:21,263 discrete state and action, minimize the expected value 183 00:13:21,263 --> 00:13:27,030 [INAUDIBLE] expected value first, by one-step error plus-- 184 00:13:53,430 --> 00:13:57,570 So I've got the cost that I incur here 185 00:13:57,570 --> 00:13:59,190 plus the long-term cost here. 186 00:13:59,190 --> 00:14:01,500 I want to pick the new min over a. 187 00:14:04,480 --> 00:14:06,806 The best thing I can do given that estimate 188 00:14:06,806 --> 00:14:08,890 of the value function. 189 00:14:08,890 --> 00:14:11,530 And that's going to give me a new policy, actually, pi 190 00:14:11,530 --> 00:14:15,163 2, which is greedy with respect to this estimate of the value 191 00:14:15,163 --> 00:14:15,663 function. 192 00:14:20,500 --> 00:14:24,602 What does that look like to you guys? 193 00:14:24,602 --> 00:14:25,600 AUDIENCE: [INAUDIBLE] 194 00:14:25,600 --> 00:14:26,392 RUSS TEDRAKE: Yeah. 195 00:14:26,392 --> 00:14:26,910 OK. 196 00:14:26,910 --> 00:14:30,060 So value iteration, or dynamic programming, 197 00:14:30,060 --> 00:14:34,440 is exactly policy iteration in the case 198 00:14:34,440 --> 00:14:36,780 where you do a sweep through your entire state 199 00:14:36,780 --> 00:14:40,860 space every time, and then you update, sweep your entire state 200 00:14:40,860 --> 00:14:42,639 space, you do the update. 201 00:15:15,670 --> 00:15:17,410 Absolutely. 202 00:15:17,410 --> 00:15:20,080 But it's a more general idea than just value iteration. 203 00:15:20,080 --> 00:15:22,330 You don't have to actually evaluate all s. 204 00:15:22,330 --> 00:15:24,030 You might call it asynchronous value-- 205 00:15:24,030 --> 00:15:25,600 [INAUDIBLE]? 206 00:15:25,600 --> 00:15:27,892 AUDIENCE: Shouldn't that be argmin [INAUDIBLE]?? 207 00:15:27,892 --> 00:15:28,850 RUSS TEDRAKE: Oh, good. 208 00:15:28,850 --> 00:15:30,170 Thank you, yeah. 209 00:15:30,170 --> 00:15:32,860 This is argmin. 210 00:15:32,860 --> 00:15:34,258 Good catch. 211 00:15:34,258 --> 00:15:36,600 Yeah. 212 00:15:36,600 --> 00:15:40,117 AUDIENCE: This is like g [INAUDIBLE].. 213 00:15:40,117 --> 00:15:40,950 RUSS TEDRAKE: Right. 214 00:15:40,950 --> 00:15:41,910 I always minimize this. 215 00:15:41,910 --> 00:15:51,040 So g is [? bad. ?] Well, I don't promise that I will never 216 00:15:51,040 --> 00:15:52,900 make a mistake with the signs, because I 217 00:15:52,900 --> 00:15:54,775 try to use reinforcement [INAUDIBLE] notation 218 00:15:54,775 --> 00:15:57,580 with costs, and I can sometimes get myself into trouble. 219 00:15:57,580 --> 00:16:00,190 I never write "arg." 220 00:16:00,190 --> 00:16:03,390 It's always g. 221 00:16:03,390 --> 00:16:03,890 OK. 222 00:16:03,890 --> 00:16:08,620 So so this would be argmin. 223 00:16:08,620 --> 00:16:12,452 The min is the value estimated in the case of value iteration. 224 00:16:12,452 --> 00:16:14,410 But in general, you don't have to wait till you 225 00:16:14,410 --> 00:16:15,850 sweep the entire state space. 226 00:16:15,850 --> 00:16:18,430 You can just take a single trajectory 227 00:16:18,430 --> 00:16:22,490 through, update your value J. Or you 228 00:16:22,490 --> 00:16:25,180 take lots of trajectories through [INAUDIBLE] 229 00:16:25,180 --> 00:16:30,300 get an improved estimate for J, and then do this 230 00:16:30,300 --> 00:16:32,070 and get a new policy, right? 231 00:16:32,070 --> 00:16:35,730 In this policy iteration, the original idea 232 00:16:35,730 --> 00:16:38,340 is you should really do this policy evaluation 233 00:16:38,340 --> 00:16:43,480 step until your estimate of J pi convergence, and then move on. 234 00:16:43,480 --> 00:16:45,912 But in fact, value iteration and other-- 235 00:16:45,912 --> 00:16:47,370 many algorithms show that you can-- 236 00:16:47,370 --> 00:16:48,930 it actually is still stable when you 237 00:16:48,930 --> 00:16:50,180 don't wait for it to converge. 238 00:16:54,660 --> 00:16:58,110 But there's a problem with what I wrote here. 239 00:16:58,110 --> 00:17:00,130 I don't think there's a technical problem. 240 00:17:00,130 --> 00:17:03,190 But why is that not quite what we need for today's lecture? 241 00:17:03,190 --> 00:17:03,690 Yeah. 242 00:17:03,690 --> 00:17:05,170 AUDIENCE: I just had a quick question. 243 00:17:05,170 --> 00:17:07,140 So if you're going to be gradient with respect 244 00:17:07,140 --> 00:17:10,549 to the value function that you evaluate, 245 00:17:10,549 --> 00:17:12,624 you can't do that with a value function 246 00:17:12,624 --> 00:17:13,835 if you have a model, right? 247 00:17:13,835 --> 00:17:14,460 So you need a-- 248 00:17:14,460 --> 00:17:15,480 RUSS TEDRAKE: That's actually exactly-- 249 00:17:15,480 --> 00:17:17,480 you're answering the question that I was asking. 250 00:17:17,480 --> 00:17:19,829 That's perfect. 251 00:17:19,829 --> 00:17:22,950 So from, as I said, model-free, model-free, model-free, 252 00:17:22,950 --> 00:17:26,400 but then I wrote down a model here. 253 00:17:26,400 --> 00:17:27,240 So how can I-- 254 00:17:27,240 --> 00:17:32,940 even in the steepest descent sort of continuous sense, 255 00:17:32,940 --> 00:17:34,170 this is absurd. 256 00:17:34,170 --> 00:17:35,970 In the discrete sense, argmin over a 257 00:17:35,970 --> 00:17:37,980 is typically done with a search over all actions 258 00:17:37,980 --> 00:17:39,990 in the continuous state and action. 259 00:17:39,990 --> 00:17:43,810 I think it was finding the gradient down the slope. 260 00:17:43,810 --> 00:17:44,310 But right. 261 00:17:44,310 --> 00:17:47,880 Both of those require a model to actually do 262 00:17:47,880 --> 00:17:49,570 that policy [INAUDIBLE]. 263 00:17:49,570 --> 00:17:52,725 So the first question for today is, 264 00:17:52,725 --> 00:17:54,750 how do we come up with a gradient policy, 265 00:17:54,750 --> 00:17:59,336 basically, without any model? 266 00:17:59,336 --> 00:18:00,880 [INAUDIBLE] going to say it. 267 00:18:00,880 --> 00:18:03,376 [INAUDIBLE] know this, but that's the-- 268 00:18:06,172 --> 00:18:08,060 what do you think? 269 00:18:08,060 --> 00:18:10,440 [INAUDIBLE] haven't read all the [INAUDIBLE] algorithms. 270 00:18:10,440 --> 00:18:11,190 What do you think? 271 00:18:11,190 --> 00:18:13,690 What's the-- how could I possibly 272 00:18:13,690 --> 00:18:21,005 come up with a new policy without having a model? 273 00:18:25,835 --> 00:18:28,418 AUDIENCE: [INAUDIBLE] s n plus [INAUDIBLE] sample directly? 274 00:18:28,418 --> 00:18:29,210 RUSS TEDRAKE: Good. 275 00:18:29,210 --> 00:18:29,920 You could sample. 276 00:18:29,920 --> 00:18:32,523 You can start to do some local search 277 00:18:32,523 --> 00:18:33,690 to come up with [INAUDIBLE]. 278 00:18:36,408 --> 00:18:37,950 Turns out-- I mean, I didn't actually 279 00:18:37,950 --> 00:18:40,492 ask the question in a way that anybody would have answered it 280 00:18:40,492 --> 00:18:41,925 in the way I wanted, so. 281 00:18:41,925 --> 00:18:43,890 So it turns out if we changed the thing 282 00:18:43,890 --> 00:18:49,830 we store just a little bit, then it turns out to contribute 283 00:18:49,830 --> 00:18:53,840 to do model-free greedy policy. 284 00:18:58,930 --> 00:18:59,470 OK. 285 00:18:59,470 --> 00:19:01,053 So the way we do that is a Q function. 286 00:19:05,640 --> 00:19:09,600 We need to find a Q function. 287 00:19:09,600 --> 00:19:11,750 It's a lot like a value function. 288 00:19:11,750 --> 00:19:13,976 But now it's a function of state and action. 289 00:19:30,245 --> 00:19:35,720 And we'll say this is still [INAUDIBLE] this way. 290 00:19:57,270 --> 00:19:58,080 OK. 291 00:19:58,080 --> 00:19:59,630 So what's a Q function? 292 00:19:59,630 --> 00:20:02,610 A Q function is the cost you should expect 293 00:20:02,610 --> 00:20:06,720 to take, to incur, given you're in a current state 294 00:20:06,720 --> 00:20:09,480 and you take a particular action. 295 00:20:09,480 --> 00:20:11,018 So it's a lot like a value function. 296 00:20:11,018 --> 00:20:12,810 But now you're actually learning a function 297 00:20:12,810 --> 00:20:14,250 over both state and actions. 298 00:20:14,250 --> 00:20:20,430 So in any state, Q pi is the cost 299 00:20:20,430 --> 00:20:24,120 I should expect to incur given I take action a for one step 300 00:20:24,120 --> 00:20:28,670 and I follow a policy pi for the rest of the time. 301 00:20:28,670 --> 00:20:31,590 That make sense? 302 00:20:31,590 --> 00:20:36,130 So I could have my acrobot controller or something 303 00:20:36,130 --> 00:20:36,810 like this. 304 00:20:36,810 --> 00:20:40,770 And in a current state, I've got a policy that mostly gets me 305 00:20:40,770 --> 00:20:43,560 up, but I'm learning more than just what that policy would 306 00:20:43,560 --> 00:20:44,352 do from this state. 307 00:20:44,352 --> 00:20:45,810 I'm learning what that policy would 308 00:20:45,810 --> 00:20:47,550 have done if I had for one step executed 309 00:20:47,550 --> 00:20:49,483 any random action on the function, 310 00:20:49,483 --> 00:20:50,400 for any random action. 311 00:20:50,400 --> 00:20:52,770 And then what would I do from the-- 312 00:20:52,770 --> 00:20:55,970 beginning I ran that controller for the rest of it. 313 00:20:55,970 --> 00:20:58,010 Algebraically, it's going to make a lot of sense 314 00:20:58,010 --> 00:20:59,570 why we would store this. 315 00:20:59,570 --> 00:21:01,695 But it's actually interesting to think a little bit 316 00:21:01,695 --> 00:21:05,700 about what that Q function should look like. 317 00:21:05,700 --> 00:21:08,180 And if you have a Q function, you certainly 318 00:21:08,180 --> 00:21:22,215 could also get the value function, 319 00:21:22,215 --> 00:21:26,660 because you can look up for a given pi what action 320 00:21:26,660 --> 00:21:28,690 that policy would have taken. 321 00:21:28,690 --> 00:21:36,515 You can always pull out your current value function from Q. 322 00:21:36,515 --> 00:21:37,265 But you can also-- 323 00:21:42,050 --> 00:21:45,600 [INAUDIBLE] simple relationship here in the [INAUDIBLE].. 324 00:21:54,278 --> 00:21:59,155 And for the optimal [INAUDIBLE],, I should actually 325 00:21:59,155 --> 00:22:00,680 do that search over a. 326 00:22:00,680 --> 00:22:02,875 I almost wrote minus. 327 00:22:02,875 --> 00:22:04,250 That can be your job for the day, 328 00:22:04,250 --> 00:22:06,163 make sure I don't flip any signs. 329 00:22:14,040 --> 00:22:14,540 OK. 330 00:22:14,540 --> 00:22:16,860 We're roboticists in this room. 331 00:22:16,860 --> 00:22:20,690 What does it mean to learn a Q function? 332 00:22:20,690 --> 00:22:22,900 What are the implications of learning a Q function? 333 00:22:26,065 --> 00:22:27,190 Well, I guess I didn't say. 334 00:22:27,190 --> 00:22:31,880 So given the Q function pi [INAUDIBLE] having a Q function 335 00:22:31,880 --> 00:22:37,850 makes action collection easy. 336 00:22:41,259 --> 00:22:53,482 Pi 2 of s is now just a min over a Q pi s and a, where 337 00:22:53,482 --> 00:22:56,280 Q pi was [INAUDIBLE] with pi 1. 338 00:23:04,354 --> 00:23:07,312 AUDIENCE: [INAUDIBLE] 339 00:23:07,312 --> 00:23:09,284 RUSS TEDRAKE: It's argmin [INAUDIBLE].. 340 00:23:15,220 --> 00:23:17,050 But I was willing to-- 341 00:23:17,050 --> 00:23:20,170 the reason to learn a Q function in one case 342 00:23:20,170 --> 00:23:23,350 here is that it tells me about the other actions 343 00:23:23,350 --> 00:23:24,780 I could have taken. 344 00:23:24,780 --> 00:23:28,000 And if I want to now improve my policy, 345 00:23:28,000 --> 00:23:29,830 then I'll just look at my Q function. 346 00:23:29,830 --> 00:23:34,870 At every state I'm in, instead of taking the one at pi a, 347 00:23:34,870 --> 00:23:37,450 I'll go ahead and take the best one. 348 00:23:37,450 --> 00:23:41,270 If pi 1 was optimal, that I would just 349 00:23:41,270 --> 00:23:45,280 get back the same policy. 350 00:23:45,280 --> 00:23:46,840 But if pi 1 wasn't optimal, then I'll 351 00:23:46,840 --> 00:23:54,010 get back something better, given my current estimate of Q. OK. 352 00:23:54,010 --> 00:23:55,770 But what does it mean to run Q? 353 00:23:55,770 --> 00:23:58,020 And this is actually all you need 354 00:23:58,020 --> 00:24:03,884 to learn to do model-free value [INAUDIBLE] optimal policy. 355 00:24:07,580 --> 00:24:10,530 That's actually really big. 356 00:24:10,530 --> 00:24:15,890 So it's a little bit more to learn 357 00:24:15,890 --> 00:24:18,440 than learning a value function. 358 00:24:24,524 --> 00:24:26,460 And you're learning about your [INAUDIBLE].. 359 00:24:32,110 --> 00:24:35,680 If I had to learn J pi, how big is that? 360 00:24:35,680 --> 00:24:41,260 If I'm going to say I've got n dimensional 361 00:24:41,260 --> 00:24:53,104 states and m dimensional u-- 362 00:24:53,104 --> 00:24:54,854 I'll just think about these two new cases, 363 00:24:54,854 --> 00:24:56,062 even though [INAUDIBLE] this. 364 00:24:56,062 --> 00:24:59,460 If I have to learn J pi, how big is that? 365 00:24:59,460 --> 00:25:00,835 What's that function mapping for? 366 00:25:09,745 --> 00:25:12,060 AUDIENCE: [INAUDIBLE] scalar learning. 367 00:25:12,060 --> 00:25:14,460 RUSS TEDRAKE: Learning a scalar function over 368 00:25:14,460 --> 00:25:23,850 the state space to R1, just learning a scalar function. 369 00:25:23,850 --> 00:25:27,400 If I was learning a policy, how big would that be? 370 00:25:27,400 --> 00:25:33,720 If I was learning a stationary policy, it might be that. 371 00:25:37,980 --> 00:25:39,270 So how bad is it to learn Q? 372 00:25:43,720 --> 00:25:44,460 What's Q? 373 00:25:48,870 --> 00:25:52,472 AUDIENCE: [INAUDIBLE] asymptote [INAUDIBLE].. 374 00:25:52,472 --> 00:25:54,930 RUSS TEDRAKE: Let's keep it a deterministic policy for now. 375 00:25:57,852 --> 00:25:59,495 AUDIENCE: [INAUDIBLE] 376 00:25:59,495 --> 00:26:00,287 RUSS TEDRAKE: Yeah. 377 00:26:06,962 --> 00:26:08,920 Now I've suddenly got to learn something over-- 378 00:26:13,500 --> 00:26:15,663 sorry, [INAUDIBLE] here. 379 00:26:15,663 --> 00:26:16,580 AUDIENCE: [INAUDIBLE]. 380 00:26:16,580 --> 00:26:18,520 Yeah, there. 381 00:26:18,520 --> 00:26:19,450 RUSS TEDRAKE: OK. 382 00:26:19,450 --> 00:26:25,620 And for [INAUDIBLE],, what would it be 383 00:26:25,620 --> 00:26:27,197 used to learn a modeled system? 384 00:26:30,680 --> 00:26:32,730 If I wanted to use this idea. 385 00:26:32,730 --> 00:26:33,656 What's that model? 386 00:26:37,544 --> 00:26:39,010 [INTERPOSING VOICES] 387 00:26:39,010 --> 00:26:40,360 RUSS TEDRAKE: Yeah. 388 00:26:40,360 --> 00:26:45,580 So f and then n plus m to Rn. 389 00:26:48,310 --> 00:26:51,110 So let's just think about how much you have to learn. 390 00:26:51,110 --> 00:26:53,170 So the easiness of learning this is not 391 00:26:53,170 --> 00:26:55,060 only related to the size. 392 00:26:55,060 --> 00:26:58,270 But it does matter. 393 00:26:58,270 --> 00:27:02,290 So most of the time, as control guys, as robotics guys 394 00:27:02,290 --> 00:27:07,390 we would probably try to learn a model first, and then 395 00:27:07,390 --> 00:27:08,722 do model-based control. 396 00:27:08,722 --> 00:27:10,180 The last few days I've been saying, 397 00:27:10,180 --> 00:27:12,580 let's try to do some things without learning a model. 398 00:27:12,580 --> 00:27:17,140 Here's one interesting reason why. 399 00:27:17,140 --> 00:27:20,015 It's actually-- learning a model is sort of a tall order. 400 00:27:20,015 --> 00:27:21,140 It's a lot to learn, right? 401 00:27:21,140 --> 00:27:23,500 You've got to learn from every possible state and action 402 00:27:23,500 --> 00:27:28,780 what's my x dot [INAUDIBLE]. 403 00:27:28,780 --> 00:27:32,770 This is only learning from every possible state and action. 404 00:27:32,770 --> 00:27:36,250 What's the expected cost-to-go [INAUDIBLE]?? 405 00:27:36,250 --> 00:27:39,060 [INAUDIBLE] a scalar. 406 00:27:39,060 --> 00:27:41,510 So this is learning one algorithm for all m. 407 00:27:41,510 --> 00:27:45,050 And the beautiful thing about optimal control, 408 00:27:45,050 --> 00:27:46,980 with this sort of additive cost functions 409 00:27:46,980 --> 00:27:50,180 and everything like that, the beautiful thing 410 00:27:50,180 --> 00:27:55,190 is that this is all you need to know to make optimal decisions. 411 00:27:55,190 --> 00:27:56,840 You don't need to know your model. 412 00:27:56,840 --> 00:28:00,140 That model is extra information. 413 00:28:00,140 --> 00:28:03,950 All you need to know to make optimal decisions, given 414 00:28:03,950 --> 00:28:06,570 these additive cost functions [INAUDIBLE] 415 00:28:06,570 --> 00:28:09,530 is given [INAUDIBLE] a state and then a given action, 416 00:28:09,530 --> 00:28:13,589 how much do I expect to incur cost [INAUDIBLE]?? 417 00:28:13,589 --> 00:28:16,430 It's a beautiful thing. 418 00:28:16,430 --> 00:28:19,430 So if we make it stochastic, it gets even sort of-- 419 00:28:19,430 --> 00:28:21,847 learning a stochastic model, if your dynamics are variable 420 00:28:21,847 --> 00:28:23,347 and that's important, you want to do 421 00:28:23,347 --> 00:28:24,710 stochastic optimal control. 422 00:28:24,710 --> 00:28:28,730 Learning a stochastic model is probably even harder than that. 423 00:28:28,730 --> 00:28:32,120 Maybe I have to learn the mean of x 424 00:28:32,120 --> 00:28:34,820 dot plus the covariance matrix of x dot or something 425 00:28:34,820 --> 00:28:36,325 like this. 426 00:28:36,325 --> 00:28:37,700 When I use a stochastic model, it 427 00:28:37,700 --> 00:28:40,910 would be even more expensive. 428 00:28:40,910 --> 00:28:44,005 Q, in the Q sense-- 429 00:28:44,005 --> 00:28:46,650 I left it off in the first pass just to keep it clean, 430 00:28:46,650 --> 00:28:50,930 but Q is just going to be the expected value around this. 431 00:28:58,440 --> 00:29:02,320 So Q is always going to be a scalar, even 432 00:29:02,320 --> 00:29:04,020 in the stochastic optimal control sense. 433 00:29:07,190 --> 00:29:09,530 So maybe this is the biggest point of optimal control, 434 00:29:09,530 --> 00:29:11,655 honestly-- 435 00:29:11,655 --> 00:29:13,580 optimal control related to learning-- 436 00:29:13,580 --> 00:29:19,880 is that if you're willing to do these additive expected 437 00:29:19,880 --> 00:29:22,640 value optimization problems, which 438 00:29:22,640 --> 00:29:24,980 I think you've seen lots of interesting problems 439 00:29:24,980 --> 00:29:27,530 that fall into that category, then all you 440 00:29:27,530 --> 00:29:31,370 need to know to make decisions is to be able to-- the value 441 00:29:31,370 --> 00:29:35,490 function, the Q function here. 442 00:29:35,490 --> 00:29:38,220 The expected value of future penalties. 443 00:29:38,220 --> 00:29:39,876 And for everything else, [INAUDIBLE].. 444 00:29:42,443 --> 00:29:43,110 Important point. 445 00:29:46,309 --> 00:29:50,497 Now, just to soften it a little bit, in practice, 446 00:29:50,497 --> 00:29:52,080 you might not get away with only that. 447 00:29:52,080 --> 00:29:54,705 If you have to somehow build an observer to do state estimation 448 00:29:54,705 --> 00:29:56,860 or to estimate Q, and you've got-- 449 00:29:56,860 --> 00:29:59,010 there might be other reasons floating around 450 00:29:59,010 --> 00:30:02,970 in your robot that might require you to learn this. 451 00:30:02,970 --> 00:30:09,478 But in a pure sense, that's really what you need to know. 452 00:30:09,478 --> 00:30:10,470 AUDIENCE: Hey, Russ? 453 00:30:10,470 --> 00:30:11,587 RUSS TEDRAKE: Yeah. 454 00:30:11,587 --> 00:30:12,670 AUDIENCE: You put x and u. 455 00:30:12,670 --> 00:30:14,170 Shouldn't that be s and-- 456 00:30:14,170 --> 00:30:15,710 RUSS TEDRAKE: Right. 457 00:30:15,710 --> 00:30:16,840 We could have-- 458 00:30:16,840 --> 00:30:20,440 I could have said n is the number of states. 459 00:30:23,540 --> 00:30:26,057 AUDIENCE: I just meant, should it be s and a? 460 00:30:26,057 --> 00:30:26,890 RUSS TEDRAKE: Right. 461 00:30:26,890 --> 00:30:27,800 I would have-- 462 00:30:27,800 --> 00:30:31,120 I wrote the dimension of x, and I called it Rn,m. 463 00:30:31,120 --> 00:30:32,472 So that's what I meant. 464 00:30:32,472 --> 00:30:34,180 If you want to make an analogy back here, 465 00:30:34,180 --> 00:30:36,220 then it would actually be just the number 466 00:30:36,220 --> 00:30:37,680 of elements in s and a. 467 00:30:37,680 --> 00:30:39,805 But I wanted to sort of be a roboticist [INAUDIBLE] 468 00:30:39,805 --> 00:30:40,513 for a little bit. 469 00:30:40,513 --> 00:30:41,500 AUDIENCE: OK. 470 00:30:41,500 --> 00:30:43,560 RUSS TEDRAKE: This is just the computer scientist 471 00:30:43,560 --> 00:30:46,010 that did this [INAUDIBLE]. 472 00:30:46,010 --> 00:30:48,513 But it does make this easier, so I still [INAUDIBLE].. 473 00:30:48,513 --> 00:30:49,430 So I meant to do that. 474 00:30:56,340 --> 00:30:58,850 So that [INAUDIBLE]. 475 00:30:58,850 --> 00:30:59,350 OK. 476 00:30:59,350 --> 00:31:01,000 So now, how do we learn Q? 477 00:31:01,000 --> 00:31:05,510 I told you how to learn J. Q looks pretty close to J. 478 00:31:05,510 --> 00:31:06,430 How do I learn Q? 479 00:31:06,430 --> 00:31:08,080 I told you about temporal different learning, 480 00:31:08,080 --> 00:31:09,460 probably wouldn't have wasted your time 481 00:31:09,460 --> 00:31:10,480 talking about temporal difference 482 00:31:10,480 --> 00:31:11,938 learning if it wasn't also relevant 483 00:31:11,938 --> 00:31:18,050 for what we needed to do these model-free value methods. 484 00:31:18,050 --> 00:31:20,822 So let's just see that temporal difference learning also 485 00:31:20,822 --> 00:31:22,280 works for learning these functions. 486 00:31:22,280 --> 00:31:22,780 OK? 487 00:31:22,780 --> 00:31:26,760 That's just some [INAUDIBLE]. 488 00:31:26,760 --> 00:31:29,520 Let's do just a simple case first, 489 00:31:29,520 --> 00:31:34,860 where I'm just doing-- remember, TD0 was just bootstrapping. 490 00:31:34,860 --> 00:31:37,890 It wasn't carrying around long-term rewards. 491 00:31:37,890 --> 00:31:39,727 It was just saying [INAUDIBLE] one step, 492 00:31:39,727 --> 00:31:42,060 and then I'm going to use my value estimate for the rest 493 00:31:42,060 --> 00:31:43,792 of the time as my new update. 494 00:31:46,820 --> 00:31:48,395 And I'll go ahead, since we're-- 495 00:31:48,395 --> 00:31:50,800 we talked about last time how a function approximator 496 00:31:50,800 --> 00:31:53,440 [INAUDIBLE] reduce it to the Markov chain case, 497 00:31:53,440 --> 00:31:54,722 let's just do it like-- 498 00:31:58,124 --> 00:32:07,520 let's say [INAUDIBLE] of s is alpha i phi i, linear function 499 00:32:07,520 --> 00:32:09,665 approximators. 500 00:32:09,665 --> 00:32:16,780 Or we could, in fact, reduce alpha t phi s, a. 501 00:32:19,360 --> 00:32:19,860 OK. 502 00:32:19,860 --> 00:32:25,050 Then the TD lambda update it going to be-- 503 00:32:25,050 --> 00:32:27,990 TD0 update is just going to be alpha 504 00:32:27,990 --> 00:32:43,855 plus gamma call that hat just to be careful here-- 505 00:32:43,855 --> 00:32:50,290 pi s transpose. 506 00:33:18,220 --> 00:33:20,820 These really are supposed to be s n and a n. 507 00:33:20,820 --> 00:33:26,332 I get a lot of [INAUDIBLE] for my sloppy [INAUDIBLE].. 508 00:33:26,332 --> 00:33:29,110 OK. 509 00:33:29,110 --> 00:33:33,530 And Q pi-- or the gradient here in the linear function 510 00:33:33,530 --> 00:33:36,482 approximator case, is just phi s, a. 511 00:33:41,960 --> 00:33:43,730 So if you look back in your notes, that's 512 00:33:43,730 --> 00:33:49,700 exactly what we had before, where this used to be J. 513 00:33:49,700 --> 00:33:52,190 We're going to use is our new update-- 514 00:33:52,190 --> 00:33:55,730 we're going to say that our new estimate for J 515 00:33:55,730 --> 00:34:00,440 is basically the one-step cost plus the long-term look-ahead. 516 00:34:00,440 --> 00:34:05,470 a n plus 1 in the case of doing an on-policy-- 517 00:34:05,470 --> 00:34:07,490 if I'm just trying to do policy evaluation, 518 00:34:07,490 --> 00:34:13,558 it's going to be pi s n plus 1. 519 00:34:13,558 --> 00:34:16,969 We'll use that one-step prediction 520 00:34:16,969 --> 00:34:20,298 minus my current prediction and try to make that go to 0. 521 00:34:20,298 --> 00:34:22,590 And in order to do it in a function approximator sense, 522 00:34:22,590 --> 00:34:25,290 that means multiplying that error, the temporal difference 523 00:34:25,290 --> 00:34:26,840 error, by the gradient. 524 00:34:26,840 --> 00:34:30,246 And that was something like gradient descent 525 00:34:30,246 --> 00:34:32,150 on your temporal difference policy. 526 00:34:32,150 --> 00:34:37,889 But not exactly, because it's this whole recursive dependence 527 00:34:37,889 --> 00:34:38,389 thing. 528 00:34:42,420 --> 00:34:45,150 People get why-- do people get that it's not 529 00:34:45,150 --> 00:34:50,449 quite gradient descent but kind of this? 530 00:34:50,449 --> 00:34:52,070 This looks a lot like what I would 531 00:34:52,070 --> 00:34:58,638 get if I was trying to do gradient descent [INAUDIBLE],, 532 00:34:58,638 --> 00:34:59,138 right? 533 00:35:02,120 --> 00:35:05,190 But only in the case of TD1 was actually gradient descent. 534 00:35:05,190 --> 00:35:10,320 But normally if I have a y minus f of x, 535 00:35:10,320 --> 00:35:15,317 I'm trying to do the gradient with respect to this, 536 00:35:15,317 --> 00:35:16,400 I've got to minimize this. 537 00:35:19,952 --> 00:35:22,452 And I'll get something-- if I take the gradient with respect 538 00:35:22,452 --> 00:35:30,405 to alpha, I get the error alpha x [INAUDIBLE].. 539 00:35:33,279 --> 00:35:35,313 AUDIENCE: [INAUDIBLE] 540 00:35:35,313 --> 00:35:37,730 RUSS TEDRAKE: Because what we got here, this is our error. 541 00:35:37,730 --> 00:35:42,500 If we assume that this is just my desired 542 00:35:42,500 --> 00:35:45,380 and this is my actual, then this is gradient descent. 543 00:35:45,380 --> 00:35:48,640 But it's not quite that, because this depends on an alpha-- 544 00:35:48,640 --> 00:35:50,150 these all depend on alpha. 545 00:35:50,150 --> 00:35:52,070 So by virtue of having this one in the alpha, 546 00:35:52,070 --> 00:35:56,080 it's not exactly gradient descent algorithm. 547 00:35:56,080 --> 00:35:57,460 But it still works. 548 00:35:57,460 --> 00:35:59,258 People proved that it works. 549 00:35:59,258 --> 00:36:00,530 Is that OK? 550 00:36:00,530 --> 00:36:03,920 And actually, in the case where TD is one, 551 00:36:03,920 --> 00:36:08,108 these things actually go through it 552 00:36:08,108 --> 00:36:10,650 and cancel each other out with whatever is a gradient descent 553 00:36:10,650 --> 00:36:11,715 algorithm [INAUDIBLE]. 554 00:36:14,570 --> 00:36:17,180 But I want you to see, this is my error 555 00:36:17,180 --> 00:36:20,570 I'm trying to make Q and my current state and action 556 00:36:20,570 --> 00:36:24,823 look like my one-step cost plus Q of my next state and action. 557 00:36:24,823 --> 00:36:26,615 And I would do that by multiplying my error 558 00:36:26,615 --> 00:36:29,680 by my gradient in a gradient descent kind of idea. 559 00:36:35,890 --> 00:36:36,390 OK. 560 00:36:36,390 --> 00:36:43,615 You can still do TD lambda if you like also. 561 00:36:46,370 --> 00:37:00,650 Q functions And the big idea there 562 00:37:00,650 --> 00:37:03,350 was to use an eligibility trace, which 563 00:37:03,350 --> 00:37:10,080 in the function approximator case, 564 00:37:10,080 --> 00:37:18,610 was gamma lambda ei n plus [INAUDIBLE].. 565 00:37:34,480 --> 00:37:38,574 And then my update is the same thing-- 566 00:37:38,574 --> 00:37:41,470 alpha-- because this is my big temporal difference error. 567 00:37:41,470 --> 00:37:43,740 And instead of multiplying by the gradient [INAUDIBLE] 568 00:37:43,740 --> 00:37:48,110 this eligibility trace. 569 00:37:48,110 --> 00:37:51,670 And magically through an algebraic trick, 570 00:37:51,670 --> 00:37:58,380 remembering the gradient computes the bootstrapping case 571 00:37:58,380 --> 00:38:02,850 when lambda is 0, and the Monte Carlo case when lambda is 1, 572 00:38:02,850 --> 00:38:05,258 and something in between when lambda is [INAUDIBLE].. 573 00:38:12,700 --> 00:38:14,545 OK. 574 00:38:14,545 --> 00:38:17,560 So you'd still do temporal difference there. 575 00:38:17,560 --> 00:38:18,670 Big point number two-- 576 00:38:23,230 --> 00:38:28,360 big idea number one is we have to use Q functions to do 577 00:38:28,360 --> 00:38:29,420 action selection. 578 00:38:32,020 --> 00:38:43,600 Big point number two is off-policy policy evaluation. 579 00:38:43,600 --> 00:38:47,860 Once we start using Q, you could do this trick 580 00:38:47,860 --> 00:38:53,140 that I mentioned first thing we're doing value methods. 581 00:38:53,140 --> 00:39:12,598 And that is to execute policy pi 1 but learn Q pi 2 [INAUDIBLE].. 582 00:39:19,542 --> 00:39:20,970 Can you see how we do that? 583 00:39:40,550 --> 00:39:42,800 By virtue of having this extra dimension, 584 00:39:42,800 --> 00:39:45,150 we know we're learning-- bless you-- 585 00:39:45,150 --> 00:39:50,330 not only what happens when I take policy pi from state s. 586 00:39:50,330 --> 00:39:54,990 I'm learning what happens when I take any action in state s. 587 00:39:54,990 --> 00:39:59,830 That gives me a lot more power. 588 00:39:59,830 --> 00:40:02,330 Because for instance, when I'm making my temporal difference 589 00:40:02,330 --> 00:40:06,950 error, I don't need to necessarily use 590 00:40:06,950 --> 00:40:13,180 my one-step prediction as the current policy. 591 00:40:13,180 --> 00:40:17,420 I can just look up what would policy 2 [INAUDIBLE].. 592 00:40:22,700 --> 00:40:27,950 Because I'm storing every state-action pair, 593 00:40:27,950 --> 00:40:30,110 it's more to learn, more work. 594 00:40:30,110 --> 00:40:35,090 But it means I can say, I'd like my new Q pi 2 595 00:40:35,090 --> 00:40:38,090 to be the one-step policy I got from taking a plus 596 00:40:38,090 --> 00:40:41,838 the long-term cost of taking policy pi 2. 597 00:40:46,020 --> 00:40:49,710 And then all the same equations play out, and you get-- 598 00:40:53,920 --> 00:40:55,360 you get an estimate for policy 2. 599 00:40:58,568 --> 00:40:59,610 AUDIENCE: Does it count-- 600 00:41:02,930 --> 00:41:05,260 RUSS TEDRAKE: Yeah? 601 00:41:05,260 --> 00:41:07,260 AUDIENCE: Does it count more than the first step 602 00:41:07,260 --> 00:41:11,294 of the policy 2 and then take your cost-to-go of policy 1? 603 00:41:11,294 --> 00:41:12,900 Or does it somehow-- 604 00:41:16,070 --> 00:41:19,157 RUSS TEDRAKE: So I can't switch-- 605 00:41:19,157 --> 00:41:21,490 we'll talk about whether you can switch halfway through. 606 00:41:21,490 --> 00:41:26,660 But once I commit to learning Q pi 2, 607 00:41:26,660 --> 00:41:29,570 then actually this whole thing is built up 608 00:41:29,570 --> 00:41:32,540 of experience of executing policy 2, 609 00:41:32,540 --> 00:41:37,710 even though I've only generated sample paths for policy 1. 610 00:41:37,710 --> 00:41:42,548 So it's a completely consistent estimator of Q pi 2, right? 611 00:41:42,548 --> 00:41:44,090 If I halfway through decided I wanted 612 00:41:44,090 --> 00:41:45,650 to start evaluating pi 3, then I'm 613 00:41:45,650 --> 00:41:47,630 going to have to wait for those cancel out, 614 00:41:47,630 --> 00:41:49,890 or we play some tricks to do that. 615 00:41:49,890 --> 00:41:52,760 But it actually recursively builds up 616 00:41:52,760 --> 00:41:55,370 in the estimator of pi 2. 617 00:41:55,370 --> 00:41:56,760 AUDIENCE: Can I ask a question? 618 00:41:56,760 --> 00:41:58,500 RUSS TEDRAKE: Of course. 619 00:41:58,500 --> 00:42:03,510 AUDIENCE: [INAUDIBLE] have that [INAUDIBLE] function like this, 620 00:42:03,510 --> 00:42:04,385 we can substitute y-- 621 00:42:04,385 --> 00:42:06,052 RUSS TEDRAKE: You're talking about this? 622 00:42:06,052 --> 00:42:06,910 AUDIENCE: Yes. 623 00:42:06,910 --> 00:42:09,571 You can substitute y by [? g or ?] gamma, 624 00:42:09,571 --> 00:42:12,380 and then execute the-- 625 00:42:12,380 --> 00:42:13,335 RUSS TEDRAKE: Yeah. 626 00:42:13,335 --> 00:42:14,960 AUDIENCE: And then take the derivative? 627 00:42:14,960 --> 00:42:16,214 RUSS TEDRAKE: Yes. 628 00:42:16,214 --> 00:42:17,539 AUDIENCE: Why [INAUDIBLE]? 629 00:42:17,539 --> 00:42:19,706 RUSS TEDRAKE: So why isn't it true gradient descent? 630 00:42:23,040 --> 00:42:25,020 That's exactly what I proposed to do. 631 00:42:25,020 --> 00:42:27,660 But the only problem is, this isn't what we have. 632 00:42:27,660 --> 00:42:32,410 What we actually have is this, which 633 00:42:32,410 --> 00:42:35,230 means that this is not the true gradient [INAUDIBLE] term 634 00:42:35,230 --> 00:42:36,840 for partial y partial from over here. 635 00:42:36,840 --> 00:42:38,382 AUDIENCE: That's what I'm suggesting. 636 00:42:38,382 --> 00:42:40,690 So instead of y alpha, we can actually [INAUDIBLE] 637 00:42:40,690 --> 00:42:43,510 g plus gamma-- an approximation of y. 638 00:42:43,510 --> 00:42:46,720 So this g plus gamma Q is [INAUDIBLE] 639 00:42:46,720 --> 00:42:48,400 approximation for y, right? 640 00:42:48,400 --> 00:42:50,150 RUSS TEDRAKE: I'm trying to perfectly make 641 00:42:50,150 --> 00:42:57,719 the analogy that this looks like that, and this looks like that. 642 00:42:57,719 --> 00:42:58,386 AUDIENCE: Right. 643 00:42:58,386 --> 00:43:00,719 But when we're taking the derivative from that function, 644 00:43:00,719 --> 00:43:02,050 we assume that y is constant. 645 00:43:02,050 --> 00:43:02,590 RUSS TEDRAKE: Yes. 646 00:43:02,590 --> 00:43:03,610 AUDIENCE: And then solve this. 647 00:43:03,610 --> 00:43:04,060 RUSS TEDRAKE: Yes. 648 00:43:04,060 --> 00:43:06,830 AUDIENCE: We can actually assume that y is dependent on alpha 649 00:43:06,830 --> 00:43:10,060 and the derivative of that term with respect to alpha as well, 650 00:43:10,060 --> 00:43:10,810 and then solve it. 651 00:43:10,810 --> 00:43:11,560 RUSS TEDRAKE: Yes. 652 00:43:11,560 --> 00:43:12,625 So you could do that. 653 00:43:12,625 --> 00:43:14,440 So you're saying why don't we actually have 654 00:43:14,440 --> 00:43:17,080 a different update which has the gradient [INAUDIBLE]?? 655 00:43:17,080 --> 00:43:17,630 OK, good. 656 00:43:17,630 --> 00:43:19,630 So in the case of TD0-- 657 00:43:19,630 --> 00:43:23,090 TD1, you actually do have that. 658 00:43:23,090 --> 00:43:24,820 And I think that's true. 659 00:43:24,820 --> 00:43:26,530 I worked this out a number of years ago. 660 00:43:26,530 --> 00:43:30,280 But I think it's true that if you start including that, 661 00:43:30,280 --> 00:43:36,100 if you look at the sum over a chain, for this standard update 662 00:43:36,100 --> 00:43:40,210 with TD0, for instance, that those terms, 663 00:43:40,210 --> 00:43:43,302 this term now will actually cancel itself out on this term 664 00:43:43,302 --> 00:43:45,110 here, for instance. 665 00:43:45,110 --> 00:43:45,960 It doesn't work. 666 00:43:45,960 --> 00:43:46,970 It doesn't work nicely. 667 00:43:46,970 --> 00:43:49,663 It would give you-- it gives you back the Monte Carlo error. 668 00:43:49,663 --> 00:43:51,080 It doesn't do temporal difference. 669 00:43:51,080 --> 00:43:52,980 It doesn't do the bootstrapping. 670 00:43:52,980 --> 00:43:55,550 So basically, you start including that, 671 00:43:55,550 --> 00:43:58,700 then you do get a least squares algorithm, of course. 672 00:43:58,700 --> 00:44:04,220 But it's effectively doing Monte Carlo. 673 00:44:04,220 --> 00:44:06,964 You have to sort of ignore that do to temporal difference 674 00:44:06,964 --> 00:44:09,160 learning. 675 00:44:09,160 --> 00:44:13,021 You're actually saying, I'm going to believe this estimate 676 00:44:13,021 --> 00:44:14,422 in order to do that. 677 00:44:14,422 --> 00:44:17,050 OK? 678 00:44:17,050 --> 00:44:18,550 Temporal difference, if you actually 679 00:44:18,550 --> 00:44:20,850 want to prove any of these things, 680 00:44:20,850 --> 00:44:22,400 I have one example of it in the note. 681 00:44:22,400 --> 00:44:26,540 I think that I put in TD1 is gradient descent in the notes, 682 00:44:26,540 --> 00:44:28,190 just so you see an example. 683 00:44:28,190 --> 00:44:31,130 A story-- rule of the game in temporal difference learning, 684 00:44:31,130 --> 00:44:35,420 derivations, and proofs, is you start expanding these sums, 685 00:44:35,420 --> 00:44:39,080 and terms from time n and terms from time 686 00:44:39,080 --> 00:44:43,370 n plus 1 cancel each other out in a gradient way. 687 00:44:43,370 --> 00:44:46,102 And you're left with something much more compact. 688 00:44:46,102 --> 00:44:48,560 That's why [? everybody ?] calls it an algebraic trick, why 689 00:44:48,560 --> 00:44:50,210 these things work. 690 00:44:50,210 --> 00:44:56,835 But because these are not random samples drawn one at a time, 691 00:44:56,835 --> 00:44:58,210 they're actually directly related 692 00:44:58,210 --> 00:45:01,857 to each other, that's why it makes it more complicated. 693 00:45:07,700 --> 00:45:08,200 OK. 694 00:45:08,200 --> 00:45:11,140 So we said off-policy evaluation says, 695 00:45:11,140 --> 00:45:20,230 execute policy pi 1 to get pi 1 generates s n a n trajectories. 696 00:45:20,230 --> 00:45:26,508 But you're going to do the update alpha plus-- 697 00:45:26,508 --> 00:45:33,810 I'm going to just write quickly here-- g s, a plus gamma Q pi-- 698 00:45:33,810 --> 00:45:36,790 this is going to be estimator Q pi 2-- 699 00:45:36,790 --> 00:45:41,620 s n plus 1 pi 2. 700 00:45:41,620 --> 00:45:45,130 What would pi 2 have done in kind of s n plus 1? 701 00:45:57,505 --> 00:46:00,096 In general, we'll multiply it by [INAUDIBLE].. 702 00:46:04,127 --> 00:46:05,585 That's a really, really nice trick. 703 00:46:05,585 --> 00:46:08,240 Let's learn about policy 1-- or policy 2 704 00:46:08,240 --> 00:46:11,320 while we execute policy 1. 705 00:46:11,320 --> 00:46:11,820 OK. 706 00:46:11,820 --> 00:46:13,445 So what policy 2 should we learn about? 707 00:46:19,515 --> 00:46:22,220 And again, these are-- 708 00:46:22,220 --> 00:46:23,970 I'm asking the questions in bizarre ways, 709 00:46:23,970 --> 00:46:25,220 and there's a specific answer. 710 00:46:25,220 --> 00:46:29,540 But [INAUDIBLE] ask that question. 711 00:46:29,540 --> 00:46:31,430 Our ultimate goal is not to learn 712 00:46:31,430 --> 00:46:33,033 about some arbitrary pi 2. 713 00:46:33,033 --> 00:46:34,741 I want to learn about the optimal policy. 714 00:46:38,180 --> 00:46:40,700 I don't have the optimal policy. 715 00:46:40,700 --> 00:46:43,430 But I have an estimate of it. 716 00:46:43,430 --> 00:46:45,680 So actually, a perfectly reasonable update 717 00:46:45,680 --> 00:46:53,870 to do, and the way you might describe it is, let's execute-- 718 00:46:53,870 --> 00:46:56,367 I'm putting it in quotes, because it's not 719 00:46:56,367 --> 00:46:58,200 entirely accurate, but it's the right idea-- 720 00:47:00,890 --> 00:47:11,006 execute policy 1 but learn about the optimal policy. 721 00:47:16,370 --> 00:47:17,550 And how would we do that? 722 00:47:17,550 --> 00:47:32,756 Well-- this is now my shorthand Q star here. 723 00:47:32,756 --> 00:47:38,565 Estimate of Q star is s n plus 1-- 724 00:47:38,565 --> 00:47:40,062 I should have-- 725 00:48:18,570 --> 00:48:19,700 It makes total sense. 726 00:48:19,700 --> 00:48:23,250 Might as well, as I'm learning, always 727 00:48:23,250 --> 00:48:25,990 try to learn about policy which is optimal with respect 728 00:48:25,990 --> 00:48:29,820 to my current estimate J [? hat. ?] 729 00:48:29,820 --> 00:48:31,962 And this algorithm is called Q-learning. 730 00:48:37,510 --> 00:48:39,190 OK. 731 00:48:39,190 --> 00:48:46,750 It's the crown jewel of the value-based methods 732 00:48:46,750 --> 00:48:47,730 [INAUDIBLE]. 733 00:48:47,730 --> 00:48:50,580 I would say it was the gold standard until probably about 734 00:48:50,580 --> 00:48:52,860 [? '90-- ?] something like that. 735 00:48:52,860 --> 00:48:57,095 When people started to do policy gradient stuff more often. 736 00:48:57,095 --> 00:48:58,720 Even probably halfway through the '90s, 737 00:48:58,720 --> 00:49:00,730 people were still mostly [INAUDIBLE] papers 738 00:49:00,730 --> 00:49:04,300 about Q-learning. 739 00:49:04,300 --> 00:49:07,446 and there was a movement in policy gradient. 740 00:49:07,446 --> 00:49:09,340 AUDIENCE: So is your current estimate 741 00:49:09,340 --> 00:49:12,730 Q star not based on pi 1? 742 00:49:16,070 --> 00:49:18,400 RUSS TEDRAKE: It is based on data from pi 1. 743 00:49:18,400 --> 00:49:23,860 But if I always make my update, making it this update, 744 00:49:23,860 --> 00:49:27,220 then it really is learning about pi 2. 745 00:49:31,060 --> 00:49:35,068 AUDIENCE: Isn't pi 2 what you're computing with this update? 746 00:49:40,708 --> 00:49:41,500 RUSS TEDRAKE: Good. 747 00:49:41,500 --> 00:49:44,440 There's a couple of ways that I can do this. 748 00:49:44,440 --> 00:49:45,460 So in the policy-- 749 00:49:45,460 --> 00:49:48,635 in the simple policy iteration, we 750 00:49:48,635 --> 00:49:51,160 use [INAUDIBLE] evaluate for a long time, 751 00:49:51,160 --> 00:49:54,642 and then you make an, update, you evaluate for a long time, 752 00:49:54,642 --> 00:49:55,600 and you make an update. 753 00:49:55,600 --> 00:49:56,440 AUDIENCE: This is dynamically-- 754 00:49:56,440 --> 00:49:58,930 RUSS TEDRAKE: This is always sort of updating, right? 755 00:49:58,930 --> 00:50:00,220 [INAUDIBLE] 756 00:50:00,220 --> 00:50:03,520 And you can prove that it's still a sound algorithm 757 00:50:03,520 --> 00:50:04,750 despite [INAUDIBLE]. 758 00:50:04,750 --> 00:50:10,922 This is always sort of updating its policy as it goes. 759 00:50:10,922 --> 00:50:13,930 Compared to this, which is more of the-- 760 00:50:13,930 --> 00:50:16,780 learn about pi 2 for a while, stop, [INAUDIBLE] 761 00:50:16,780 --> 00:50:18,380 pi 3 for a while, stop, this is trying 762 00:50:18,380 --> 00:50:19,550 to go straight through pi. 763 00:50:27,440 --> 00:50:27,990 OK, good. 764 00:50:27,990 --> 00:50:31,250 So what is it-- 765 00:50:31,250 --> 00:50:36,270 what's required for a Q-learning algorithm to converge? 766 00:50:36,270 --> 00:50:40,220 So even for this algorithm to converge, in order for pi 1 767 00:50:40,220 --> 00:50:43,660 to really teach me everything there is to know about pi 2, 768 00:50:43,660 --> 00:50:50,320 there's some important feature, which is that pi 1 and pi 2 769 00:50:50,320 --> 00:50:53,982 had better pick the same actions with some old probability. 770 00:50:56,690 --> 00:50:58,310 So off-policy works. 771 00:51:01,852 --> 00:51:03,060 Let's just even think about-- 772 00:51:03,060 --> 00:51:07,760 I'll even [INAUDIBLE] first in the discrete state and discrete 773 00:51:07,760 --> 00:51:08,760 actions and the Markov-- 774 00:51:11,330 --> 00:51:13,770 MDP formulations. 775 00:51:13,770 --> 00:51:26,840 Off-policy works if pi 1 takes in general all state-action 776 00:51:26,840 --> 00:51:31,922 pairs with some small probability. 777 00:51:39,940 --> 00:51:47,890 If pi 2 took action [INAUDIBLE] state 1 and pi 1 never did, 778 00:51:47,890 --> 00:51:49,690 there's no way I'm going to learn really 779 00:51:49,690 --> 00:51:53,356 what pi 2 is all about. 780 00:51:53,356 --> 00:51:58,540 [INAUDIBLE] show you those two [INAUDIBLE].. 781 00:51:58,540 --> 00:51:59,303 OK. 782 00:51:59,303 --> 00:52:00,220 So how do you do that? 783 00:52:00,220 --> 00:52:03,370 If you just-- if you're thinking about greedy policies 784 00:52:03,370 --> 00:52:07,060 on a robot, and you've got your current estimate of the value, 785 00:52:07,060 --> 00:52:12,310 and you do the most aggressive action on the acrobot, 786 00:52:12,310 --> 00:52:14,280 I'll tell you what's going to happen. 787 00:52:14,280 --> 00:52:17,470 You're going to visit the states near the bottom, 788 00:52:17,470 --> 00:52:19,300 and you start learning a lot. 789 00:52:19,300 --> 00:52:22,356 And you're never going to visit the states up at the top 790 00:52:22,356 --> 00:52:24,728 when you're learning. 791 00:52:24,728 --> 00:52:27,020 So how are you going to get around that on the acrobot? 792 00:52:27,020 --> 00:52:29,150 And the acrobot is tough, actually. 793 00:52:29,150 --> 00:52:33,830 But the idea is, you'd better add some randomness 794 00:52:33,830 --> 00:52:36,620 so that you explore more and more state and actions. 795 00:52:36,620 --> 00:52:39,288 And the hope is that if you add enough for a long enough time, 796 00:52:39,288 --> 00:52:41,330 you're going to learn better and better policies, 797 00:52:41,330 --> 00:52:44,330 you're going to find your way up to the top. 798 00:52:44,330 --> 00:52:47,210 So the acrobot is actually almost as hard 799 00:52:47,210 --> 00:52:49,130 as it gets with these things, where you really 800 00:52:49,130 --> 00:52:50,780 have to find your way into this region 801 00:52:50,780 --> 00:52:53,790 to learn about the region. 802 00:52:53,790 --> 00:52:56,090 In fact, my beef with reinforcement learning 803 00:52:56,090 --> 00:52:58,370 community is that they learn only 804 00:52:58,370 --> 00:53:00,530 to swing up to some threshold. 805 00:53:00,530 --> 00:53:02,960 They never actually [INAUDIBLE] to the top. 806 00:53:02,960 --> 00:53:05,210 If you look, there's lots of papers, countless papers, 807 00:53:05,210 --> 00:53:07,340 written about reinforcement learning, Q-learning 808 00:53:07,340 --> 00:53:09,040 for the acrobot and things like this, 809 00:53:09,040 --> 00:53:10,790 and they never actually solve the problem. 810 00:53:10,790 --> 00:53:12,650 They just try to [INAUDIBLE] at the top. 811 00:53:12,650 --> 00:53:14,210 But they don't do it. 812 00:53:14,210 --> 00:53:15,680 They just get up this high. 813 00:53:15,680 --> 00:53:17,450 Because it is sort of a tough case. 814 00:53:19,583 --> 00:53:20,500 So how do you do this? 815 00:53:20,500 --> 00:53:21,917 So you get-- like I said, in order 816 00:53:21,917 --> 00:53:23,675 to start exploring that space, you'd 817 00:53:23,675 --> 00:53:27,140 better add some randomness. 818 00:53:27,140 --> 00:53:30,230 So one of the standard approaches 819 00:53:30,230 --> 00:53:32,649 is to use epsilon greedy algorithms. 820 00:53:42,240 --> 00:53:47,280 So I said, let's make pi 2 exactly the minimizing thing. 821 00:53:47,280 --> 00:53:47,950 That's true. 822 00:53:47,950 --> 00:53:53,160 But if you execute that, you're probably going to [INAUDIBLE] 823 00:53:53,160 --> 00:54:03,590 much better to execute a policy pi s, 824 00:54:03,590 --> 00:54:06,970 where the-- let's say the policy I care about, 825 00:54:06,970 --> 00:54:13,120 I'm going to execute with probability-- 826 00:54:13,120 --> 00:54:17,680 sort of, you flip a coin, you pull a random number 827 00:54:17,680 --> 00:54:18,983 between 0 and 1. 828 00:54:18,983 --> 00:54:20,400 If it's greater than epsilon, then 829 00:54:20,400 --> 00:54:22,090 go ahead and execute policy you're 830 00:54:22,090 --> 00:54:24,910 trying to learn about, but execute 831 00:54:24,910 --> 00:54:32,822 some random action otherwise. 832 00:54:42,900 --> 00:54:43,400 OK. 833 00:54:43,400 --> 00:54:45,440 So every time I-- 834 00:54:45,440 --> 00:54:48,020 every dt I'm going to flip a coin, keep it-- 835 00:54:48,020 --> 00:54:49,380 well, not a coin. 836 00:54:49,380 --> 00:54:52,070 A hundred-sided coin, a 100 to-- 837 00:54:52,070 --> 00:54:53,750 0 to 1, a continuous thing. 838 00:54:53,750 --> 00:54:57,830 If it comes out less than epsilon, 839 00:54:57,830 --> 00:54:59,240 I'm going to do a random action. 840 00:54:59,240 --> 00:55:01,657 Just forget about my current policy, pick a random action. 841 00:55:01,657 --> 00:55:05,240 It's a uniform distribution over actions. 842 00:55:05,240 --> 00:55:10,170 Otherwise, I'll take this, the action from my policy. 843 00:55:10,170 --> 00:55:14,135 And by virtue of having a soft policy learning thing 844 00:55:14,135 --> 00:55:15,980 is I can still learn about pi 2, even 845 00:55:15,980 --> 00:55:19,858 if I'm taking this pi epsilon. 846 00:55:19,858 --> 00:55:21,400 But I have the advantage of exploring 847 00:55:21,400 --> 00:55:22,690 all the state-actions. 848 00:55:26,110 --> 00:55:26,610 Good. 849 00:55:26,610 --> 00:55:28,080 I'm missing a page. 850 00:55:32,908 --> 00:55:34,450 AUDIENCE: Is that the most randomness 851 00:55:34,450 --> 00:55:39,370 you can produce since they'll [INAUDIBLE] converge? 852 00:55:39,370 --> 00:55:41,950 RUSS TEDRAKE: There's a couple of different candidates. 853 00:55:41,950 --> 00:55:44,000 The softmax is another one that people use a lot. 854 00:55:47,300 --> 00:55:48,020 And a lot of-- 855 00:55:48,020 --> 00:55:51,050 I mean, in the off-policy sense, it's actually quite robust. 856 00:55:51,050 --> 00:55:55,470 So a lot of people talk about using behavioral policy, which 857 00:55:55,470 --> 00:55:57,470 is just sort of something to try-- it's designed 858 00:55:57,470 --> 00:55:58,987 to explore the state space. 859 00:55:58,987 --> 00:56:00,820 Actually, my candidate for behavioral policy 860 00:56:00,820 --> 00:56:02,810 is something like RRT. 861 00:56:02,810 --> 00:56:04,790 We should really try to do something 862 00:56:04,790 --> 00:56:08,600 that gets me into all areas of state space, for instance. 863 00:56:08,600 --> 00:56:10,570 And then maybe that's a good way to design, 864 00:56:10,570 --> 00:56:12,860 to sample these state-action pairs. 865 00:56:12,860 --> 00:56:16,100 And all the while, I try to learn about pi 2. 866 00:56:16,100 --> 00:56:17,684 So it is robust in that sense. 867 00:56:21,940 --> 00:56:24,630 When I say it works here, I have to be a little careful. 868 00:56:24,630 --> 00:56:28,560 This is only for the MDP case that it's really 869 00:56:28,560 --> 00:56:29,880 guaranteed to work. 870 00:56:29,880 --> 00:56:34,590 There's more recent work doing off-policy and function 871 00:56:34,590 --> 00:56:36,290 approximators. 872 00:56:36,290 --> 00:56:37,140 And you can do that. 873 00:56:48,450 --> 00:56:51,750 I don't want to bury you guys with random detail. 874 00:56:51,750 --> 00:56:59,220 But you can do off-policy with linear function approximators 875 00:56:59,220 --> 00:57:08,004 safely, using an important [INAUDIBLE] when you're dealing 876 00:57:08,004 --> 00:57:08,992 [INAUDIBLE]. 877 00:57:18,970 --> 00:57:20,910 And that's work by Doina Precup. 878 00:57:28,790 --> 00:57:30,470 The basic idea is you have to-- 879 00:57:30,470 --> 00:57:34,310 if your policy is changing, like these things, 880 00:57:34,310 --> 00:57:37,760 it's changing over time, you'd better weight your updates 881 00:57:37,760 --> 00:57:39,142 based on the relative-- 882 00:57:44,444 --> 00:57:47,150 AUDIENCE: [INAUDIBLE] necessarily. 883 00:57:47,150 --> 00:57:48,482 But that's-- 884 00:57:48,482 --> 00:57:50,940 RUSS TEDRAKE: But what you're learning about is the state-- 885 00:57:50,940 --> 00:57:53,180 the probability of picking this action for one step 886 00:57:53,180 --> 00:57:54,503 and then executing pi 2. 887 00:57:54,503 --> 00:57:56,586 And that's still [INAUDIBLE] even pi 2 would never 888 00:57:56,586 --> 00:57:57,488 take that action. 889 00:57:57,488 --> 00:57:58,280 AUDIENCE: Oh, yeah. 890 00:57:58,280 --> 00:57:59,422 Because it's-- OK. 891 00:57:59,422 --> 00:58:00,880 RUSS TEDRAKE: So I think it's good. 892 00:58:00,880 --> 00:58:02,870 The thing that has to happen is that pi 2 893 00:58:02,870 --> 00:58:05,358 has to be well-defined for every possible state. 894 00:58:05,358 --> 00:58:05,900 AUDIENCE: OK. 895 00:58:05,900 --> 00:58:09,470 So keeping the Q pi 2 is take a certain [INAUDIBLE] 896 00:58:09,470 --> 00:58:11,870 take certain action, then [INAUDIBLE] pi [INAUDIBLE].. 897 00:58:11,870 --> 00:58:12,290 RUSS TEDRAKE: Yes. 898 00:58:12,290 --> 00:58:12,840 AUDIENCE: OK. 899 00:58:12,840 --> 00:58:14,030 Sorry, I lost that [INAUDIBLE]. 900 00:58:14,030 --> 00:58:14,630 RUSS TEDRAKE: OK, good. 901 00:58:14,630 --> 00:58:14,990 Sorry. 902 00:58:14,990 --> 00:58:16,157 Thank you for clarifying it. 903 00:58:16,157 --> 00:58:17,240 Yeah, so cool. 904 00:58:17,240 --> 00:58:18,659 So I think that still works. 905 00:58:22,790 --> 00:58:25,483 OK. 906 00:58:25,483 --> 00:58:26,150 So this is good. 907 00:58:26,150 --> 00:58:28,470 So let me tell you where you are so far. 908 00:58:28,470 --> 00:58:31,320 We've now switched from doing temporal difference learning 909 00:58:31,320 --> 00:58:33,560 on value functions and temporal difference 910 00:58:33,560 --> 00:58:35,900 learning on Q functions. 911 00:58:35,900 --> 00:58:37,580 And a major thing we got out of that 912 00:58:37,580 --> 00:58:41,186 was that we can do this off-policy learning. 913 00:58:45,400 --> 00:58:46,780 You put it all together. 914 00:58:46,780 --> 00:58:51,250 [INAUDIBLE] back into my policy iteration diagram, 915 00:58:51,250 --> 00:58:53,890 and what we have, we've defined the policy evaluation, that's 916 00:58:53,890 --> 00:58:54,940 the TD lambda. 917 00:58:54,940 --> 00:58:57,480 We defined our update, which could 918 00:58:57,480 --> 00:58:59,710 be this in the general sense. 919 00:58:59,710 --> 00:59:03,775 This one is-- if I used pi 1 again, 920 00:59:03,775 --> 00:59:09,004 If I really did on-policy, if I used pi 1 everywhere 921 00:59:09,004 --> 00:59:11,170 while I'm executing pi 1, then this 922 00:59:11,170 --> 00:59:15,166 would be called SARSA, [INAUDIBLE] sort 923 00:59:15,166 --> 00:59:20,480 of on-policy Q-learning, on-policy updating. 924 00:59:20,480 --> 00:59:25,270 And Q-learning is this, where you use the middle gradient. 925 00:59:25,270 --> 00:59:27,930 And what we know, what people have proven, 926 00:59:27,930 --> 00:59:30,570 the algorithms were in use for years and years and years 927 00:59:30,570 --> 00:59:33,580 before it was actually proven, even in the tabular case, 928 00:59:33,580 --> 00:59:36,040 where you have finite state and actions. 929 00:59:36,040 --> 00:59:39,010 But now we know that this thing is guaranteed 930 00:59:39,010 --> 00:59:41,140 to converge to the optimal policy, 931 00:59:41,140 --> 00:59:44,320 that policy iteration, even if it's updated at every step, 932 00:59:44,320 --> 00:59:47,380 is going to converge to the optimal policy 933 00:59:47,380 --> 00:59:51,080 and the optimal Q function, given 934 00:59:51,080 --> 00:59:54,290 that all state-action pairs are [INAUDIBLE] in the tabular 935 00:59:54,290 --> 00:59:54,790 case. 936 01:00:01,840 --> 01:00:05,700 If we go to function approximation, 937 01:00:05,700 --> 01:00:09,480 if you just do policy evaluation but not update, 938 01:00:09,480 --> 01:00:14,480 then we have an example where this is actually 939 01:00:14,480 --> 01:00:16,250 in '02 or something like that. 940 01:00:16,250 --> 01:00:21,840 It'd be '01 or '02, 2001. 941 01:00:21,840 --> 01:00:24,930 We finally proved that off-policy with linear function 942 01:00:24,930 --> 01:00:29,530 approximation would converge when the policy is not 943 01:00:29,530 --> 01:00:31,770 changing-- 944 01:00:31,770 --> 01:00:33,540 no control. 945 01:00:33,540 --> 01:00:36,410 So the thing I need to give you before we consider 946 01:00:36,410 --> 01:00:43,590 this a complete story here is, can we do off-policy learning? 947 01:00:43,590 --> 01:00:45,450 Can we do our policy improvement, 948 01:00:45,450 --> 01:00:50,310 update stably with function approximation? 949 01:00:50,310 --> 01:00:52,740 And the algorithm that we have for that 950 01:00:52,740 --> 01:00:55,454 is our least squares policy iteration. 951 01:01:35,980 --> 01:01:40,030 Do you remember least squares temporal difference learning? 952 01:01:40,030 --> 01:01:45,150 Sort of the idea was that if we look at the stationary update-- 953 01:01:45,150 --> 01:01:47,014 maybe I should write it down again. 954 01:01:47,014 --> 01:01:47,946 [INAUDIBLE] find it. 955 01:01:51,642 --> 01:01:53,100 If I look at the stationary update, 956 01:01:53,100 --> 01:01:55,870 if I were to run an entire batch of-- 957 01:01:55,870 --> 01:01:58,370 I mean, the big idea is when you're doing the least squares, 958 01:01:58,370 --> 01:02:00,260 is that we're going to try to reuse old data. 959 01:02:00,260 --> 01:02:02,468 We're not going to just make a single update, spit it 960 01:02:02,468 --> 01:02:03,490 out, throw9t away. 961 01:02:03,490 --> 01:02:06,500 We're going to remember a bunch of old state-action pairs, 962 01:02:06,500 --> 01:02:09,405 trying to make a least squares update with just the same thing 963 01:02:09,405 --> 01:02:11,570 as [INAUDIBLE]. 964 01:02:11,570 --> 01:02:13,165 In the Monte Carlo sense, it's easy. 965 01:02:13,165 --> 01:02:14,540 It's just function approximation, 966 01:02:14,540 --> 01:02:17,170 where with this TD term floating around it's harder. 967 01:02:17,170 --> 01:02:19,420 So we had to come up least squares temporal difference 968 01:02:19,420 --> 01:02:21,430 learning. 969 01:02:21,430 --> 01:02:27,970 And in the LSTD case, the story was 970 01:02:27,970 --> 01:02:42,460 we could build up a matrix using something that looked like phi 971 01:02:42,460 --> 01:02:51,650 of s gamma phi transpose-- 972 01:02:51,650 --> 01:02:54,515 so it's ik. 973 01:02:54,515 --> 01:02:58,310 Let me just write ik in here-- 974 01:02:58,310 --> 01:03:02,750 ik plus 1 minus phi of ik. 975 01:03:05,715 --> 01:03:09,420 times my parameter vector. 976 01:03:09,420 --> 01:03:17,070 And b-- some of these terms was e, 977 01:03:17,070 --> 01:03:23,291 which were phi ik times our reward times-- 978 01:03:29,135 --> 01:03:33,560 And if I did this least squares solution-- 979 01:03:33,560 --> 01:03:35,810 or I could invert that carefully with SVD or something 980 01:03:35,810 --> 01:03:36,640 like that-- 981 01:03:36,640 --> 01:03:38,930 then what I get out is the-- 982 01:03:43,313 --> 01:03:55,822 it jumps immediately to steady-state solution TD 983 01:03:55,822 --> 01:03:56,322 lambda. 984 01:04:05,666 --> 01:04:09,330 So this is essentially the big piece 985 01:04:09,330 --> 01:04:13,650 of the TD lambda update broken into the part that 986 01:04:13,650 --> 01:04:17,040 depends on alpha, the part that doesn't depend on alpha. 987 01:04:17,040 --> 01:04:19,755 I could write my batch TD lambda update 988 01:04:19,755 --> 01:04:26,860 as alpha equals alpha plus gamma a alpha plus b. 989 01:04:26,860 --> 01:04:29,938 And I could solve that at steady-state [INAUDIBLE].. 990 01:04:32,690 --> 01:04:33,190 All right. 991 01:04:33,190 --> 01:04:35,970 So least squares policy, least squares temporal difference 992 01:04:35,970 --> 01:04:39,190 learning, is about reusing lots of trajectories 993 01:04:39,190 --> 01:04:43,780 to make a single update that was going to jump right to the case 994 01:04:43,780 --> 01:04:47,598 where the temporal difference learning we've gotten through. 995 01:04:47,598 --> 01:04:49,140 We just replayed it a bunch of times. 996 01:04:51,690 --> 01:04:55,420 Now the question is, the policy is 997 01:04:55,420 --> 01:04:58,680 going to be moving while we're doing this. 998 01:04:58,680 --> 01:05:03,750 How can we do this sort of least squares method 999 01:05:03,750 --> 01:05:05,583 to do the policy iteration up here? 1000 01:05:12,780 --> 01:05:17,597 Again, the trick is pretty simple. 1001 01:05:17,597 --> 01:05:19,555 We've just got to learn the Q function instead, 1002 01:05:19,555 --> 01:05:20,638 [INAUDIBLE] biggest trick. 1003 01:05:24,760 --> 01:05:26,410 So in order to do-- 1004 01:05:26,410 --> 01:05:28,350 control not just evaluate a single policy 1005 01:05:28,350 --> 01:05:45,535 but actually try to find the optimal policy, 1006 01:05:45,535 --> 01:05:47,160 first thing we have to do is figure out 1007 01:05:47,160 --> 01:05:49,875 how to do LSTD on a Q function. 1008 01:05:53,268 --> 01:05:56,605 And it turns out it's no-- 1009 01:05:56,605 --> 01:05:57,685 yeah, what's up? 1010 01:05:57,685 --> 01:05:58,680 [INAUDIBLE] 1011 01:06:03,080 --> 01:06:07,000 It turns out if you keep along, [INAUDIBLE] 1012 01:06:07,000 --> 01:06:10,840 exactly the same form as we did in least squares 1013 01:06:10,840 --> 01:06:14,710 temporal difference learning, but now we do everything 1014 01:06:14,710 --> 01:06:18,467 with functions of s and t. 1015 01:06:50,554 --> 01:06:54,520 [INAUDIBLE] transpose on it. 1016 01:06:54,520 --> 01:06:55,485 Now I do the-- 1017 01:06:55,485 --> 01:06:57,964 you said form of the [INAUDIBLE] too much. 1018 01:06:57,964 --> 01:07:01,690 I just want you to know the big idea here. 1019 01:07:13,769 --> 01:07:18,080 Then I do gamma [INAUDIBLE] a inverse b, 1020 01:07:18,080 --> 01:07:20,446 this whole time we're representing our Q function. 1021 01:07:23,451 --> 01:07:30,720 Q hat s, a is now a linear combination of nonlinear basis 1022 01:07:30,720 --> 01:07:33,525 functions on s and a. 1023 01:07:33,525 --> 01:07:36,905 AUDIENCE: Shouldn't that be transpose [INAUDIBLE]?? 1024 01:07:36,905 --> 01:07:37,780 RUSS TEDRAKE: I put-- 1025 01:07:37,780 --> 01:07:41,340 I tried to put a transpose with my poorly-- 1026 01:07:41,340 --> 01:07:43,140 throughout everything. 1027 01:07:43,140 --> 01:07:45,672 So you're saying this one shouldn't be transpose? 1028 01:07:45,672 --> 01:07:47,980 AUDIENCE: [INAUDIBLE] should be a [INAUDIBLE]?? 1029 01:07:47,980 --> 01:07:48,920 RUSS TEDRAKE: Yeah. 1030 01:07:48,920 --> 01:07:49,420 Good. 1031 01:07:49,420 --> 01:07:50,710 So I'm going to-- 1032 01:07:50,710 --> 01:07:53,193 but this one, I wrote this whole update 1033 01:07:53,193 --> 01:07:55,110 as the transpose of the other-- of what I just 1034 01:07:55,110 --> 01:07:56,134 wrote over there. 1035 01:08:07,070 --> 01:08:08,745 AUDIENCE: [INAUDIBLE] write alpha? 1036 01:08:08,745 --> 01:08:12,538 [INAUDIBLE] If it was a plus b some stuff? 1037 01:08:12,538 --> 01:08:13,330 RUSS TEDRAKE: Yeah. 1038 01:08:13,330 --> 01:08:14,180 And there's an alpha, sorry. 1039 01:08:14,180 --> 01:08:14,790 Thank you. 1040 01:08:19,439 --> 01:08:24,319 Well, actually, it's-- the alpha is really not there. 1041 01:08:24,319 --> 01:08:26,170 Yeah, I should have written it here. 1042 01:08:26,170 --> 01:08:27,060 It's a times alpha. 1043 01:08:27,060 --> 01:08:29,012 That's what it makes the update on. 1044 01:08:29,012 --> 01:08:32,896 So we get [INAUDIBLE] alpha. 1045 01:08:32,896 --> 01:08:34,229 So that is actually [INAUDIBLE]. 1046 01:08:34,229 --> 01:08:37,100 This is the one I did [INAUDIBLE].. 1047 01:08:37,100 --> 01:08:38,562 OK, good. 1048 01:08:38,562 --> 01:08:40,520 So it turns out you can learn a Q function just 1049 01:08:40,520 --> 01:08:42,920 like you can learn a value function, 1050 01:08:42,920 --> 01:08:45,319 with storing up these matrices, which 1051 01:08:45,319 --> 01:08:50,720 are what TD learning would have done in a batch sense, 1052 01:08:50,720 --> 01:08:52,970 and then just taking a one-step shot 1053 01:08:52,970 --> 01:08:55,550 to get directly to the solution for temporal difference 1054 01:08:55,550 --> 01:08:56,966 learning for Q function. 1055 01:09:01,220 --> 01:09:06,620 And again, I put in here s prime. 1056 01:09:06,620 --> 01:09:09,300 And I left this a little bit ambiguous. 1057 01:09:09,300 --> 01:09:15,560 So I can evaluate any policy by just putting 1058 01:09:15,560 --> 01:09:18,470 in that policy in here and doing the replay. 1059 01:09:23,410 --> 01:09:31,649 And it turns out if I now do this in a policy iteration 1060 01:09:31,649 --> 01:09:43,560 sense, LSPI-- 1061 01:09:43,560 --> 01:09:45,920 Least Squares Policy Iteration-- 1062 01:09:45,920 --> 01:09:49,170 basically, you start off with an initial guess, 1063 01:09:49,170 --> 01:09:58,080 we do LSTDQ [INAUDIBLE] to get Q pi 1. 1064 01:09:58,080 --> 01:10:02,640 And then you repeat, yeah? 1065 01:10:02,640 --> 01:10:09,370 Then this thing, that's enough to get you to-- 1066 01:10:09,370 --> 01:10:12,322 it converges. 1067 01:10:12,322 --> 01:10:14,270 Now, be careful about how it converges. 1068 01:10:14,270 --> 01:10:31,976 It converges with some error bound to pi star Q star. 1069 01:10:34,810 --> 01:10:38,120 The error bound depends on a couple of parameters. 1070 01:10:38,120 --> 01:10:40,420 So technically, it could be close to your solution 1071 01:10:40,420 --> 01:10:42,290 and oscillate, or something like that. 1072 01:10:42,290 --> 01:10:44,290 But it's a pretty strong convergence result 1073 01:10:44,290 --> 01:10:50,162 for this for policy improvement with approximate value 1074 01:10:50,162 --> 01:10:50,662 function. 1075 01:10:54,930 --> 01:10:57,480 In a pure sense, it is-- 1076 01:10:57,480 --> 01:11:01,650 you should run this for a while until you 1077 01:11:01,650 --> 01:11:08,510 get a good estimate for LSTD and you get your new Q pi. 1078 01:11:08,510 --> 01:11:14,150 But by virtue of using Q functions, when you do switch 1079 01:11:14,150 --> 01:11:16,880 to your new policy, pi 2, let's say, 1080 01:11:16,880 --> 01:11:19,710 you don't have to throw away all your old data. 1081 01:11:19,710 --> 01:11:25,700 You just take your old tapes and actually regenerate a and b 1082 01:11:25,700 --> 01:11:30,320 as if you had played off those old tapes with-- 1083 01:11:30,320 --> 01:11:33,020 as if you had seen the old tapes executing the new policy. 1084 01:11:36,250 --> 01:11:39,400 And you can reuse all your old data 1085 01:11:39,400 --> 01:11:43,686 and make an efficient update to get Q pi [INAUDIBLE].. 1086 01:11:52,870 --> 01:11:54,744 Least squares policy iteration. 1087 01:11:54,744 --> 01:11:56,780 Pretty simple. 1088 01:11:56,780 --> 01:11:57,280 OK. 1089 01:11:57,280 --> 01:11:59,697 I know that was a little dry and a little bit-- and a lot. 1090 01:11:59,697 --> 01:12:03,130 But let's make sure we know how we got where we got. 1091 01:12:03,130 --> 01:12:11,130 So there's another route besides pure policy search 1092 01:12:11,130 --> 01:12:12,310 to do model-free learning. 1093 01:12:12,310 --> 01:12:16,930 All you have to do is take a bunch of trajectories, 1094 01:12:16,930 --> 01:12:19,120 learn a value function for those trajectories. 1095 01:12:19,120 --> 01:12:24,130 You don't even actually have to take the perfect-- 1096 01:12:24,130 --> 01:12:25,340 your best controller yet. 1097 01:12:25,340 --> 01:12:27,760 You could take some RRT controller, something that's 1098 01:12:27,760 --> 01:12:30,740 going to explore the space and try to learn about your value 1099 01:12:30,740 --> 01:12:31,240 function. 1100 01:12:34,042 --> 01:12:39,190 Learn Q pi through these LSTD algorithms, 1101 01:12:39,190 --> 01:12:42,810 you could do a pretty efficient update for doing Q pi. 1102 01:12:42,810 --> 01:12:44,810 You can improve efficiently by just looking over 1103 01:12:44,810 --> 01:12:48,670 the min over Q, and pretty quickly 1104 01:12:48,670 --> 01:12:51,977 iterate to an optimal policy and optimal value 1105 01:12:51,977 --> 01:12:54,820 function, only storing-- 1106 01:12:54,820 --> 01:12:57,160 the only thing you have to store in that whole process 1107 01:12:57,160 --> 01:12:59,700 is the Q function. 1108 01:12:59,700 --> 01:13:03,480 And in the LS case, the LSTD case, you remember the tape-- 1109 01:13:03,480 --> 01:13:07,470 the history of tapes, just you use them more efficiently. 1110 01:13:07,470 --> 01:13:08,370 Yeah? 1111 01:13:08,370 --> 01:13:11,680 AUDIENCE: So could you have used this on the flapper 1112 01:13:11,680 --> 01:13:14,820 that John showed? 1113 01:13:14,820 --> 01:13:16,488 Or what's the-- 1114 01:13:16,488 --> 01:13:17,280 RUSS TEDRAKE: Good. 1115 01:13:17,280 --> 01:13:18,072 Very good question. 1116 01:13:20,870 --> 01:13:22,590 That's an excellent question. 1117 01:13:22,590 --> 01:13:25,310 So in fact, the last day of class, what we're going to do 1118 01:13:25,310 --> 01:13:26,360 is going to-- 1119 01:13:26,360 --> 01:13:28,780 the last day I present in class, we're going to-- 1120 01:13:28,780 --> 01:13:29,870 I'm going to go through a couple of sort 1121 01:13:29,870 --> 01:13:31,130 of case studies and different problems 1122 01:13:31,130 --> 01:13:33,080 that people have had success and tell you 1123 01:13:33,080 --> 01:13:35,760 why we picked the algorithm we picked, things like that. 1124 01:13:35,760 --> 01:13:38,420 So why didn't we do this on a flapper? 1125 01:13:38,420 --> 01:13:42,730 The simplest reason is that we don't know the state space. 1126 01:13:42,730 --> 01:13:46,620 It's infinite dimensional in general. 1127 01:13:46,620 --> 01:13:48,180 So that would have been a big thing 1128 01:13:48,180 --> 01:13:51,070 to represent a Q function for. 1129 01:13:51,070 --> 01:13:53,490 It doesn't mean-- it doesn't make it invalid. 1130 01:13:53,490 --> 01:13:54,990 We could have learned, we could have 1131 01:13:54,990 --> 01:13:58,470 tried to approximate the state space with even a handful 1132 01:13:58,470 --> 01:14:03,060 of features, learned a very approximate Q function, 1133 01:14:03,060 --> 01:14:04,710 and done something like actor-critic 1134 01:14:04,710 --> 01:14:06,190 like we're going to do next time. 1135 01:14:06,190 --> 01:14:08,730 But I think in cases where you don't know the state space, 1136 01:14:08,730 --> 01:14:10,607 or the state space is a very, very large, 1137 01:14:10,607 --> 01:14:12,190 and you can write a simple controller, 1138 01:14:12,190 --> 01:14:16,360 then it makes more sense to parameterize the policy. 1139 01:14:16,360 --> 01:14:18,880 It really goes down to that game, that accounting game, 1140 01:14:18,880 --> 01:14:22,830 in some ways, of how many dimensions things are. 1141 01:14:22,830 --> 01:14:26,820 But in a fluids case, you could have a pretty simple policy 1142 01:14:26,820 --> 01:14:29,960 from sensors to actions which we could twiddle. 1143 01:14:29,960 --> 01:14:35,230 We couldn't have an efficient value function. 1144 01:14:35,230 --> 01:14:38,390 Now, there are other cases where the opposite is true. 1145 01:14:38,390 --> 01:14:41,860 The opposite is true, where you have a small state space, 1146 01:14:41,860 --> 01:14:44,980 let's say, but the resulting policies 1147 01:14:44,980 --> 01:14:48,250 would require a lot of features to parameterize. 1148 01:14:48,250 --> 01:14:50,770 But I think in general, the strength of these algorithms 1149 01:14:50,770 --> 01:14:54,760 is that they are efficient with reusing data. 1150 01:14:54,760 --> 01:14:57,560 The weakness is that-- 1151 01:14:57,560 --> 01:14:59,230 well, the weakness a few years ago 1152 01:14:59,230 --> 01:15:01,188 would have been that they big blow up the time. 1153 01:15:01,188 --> 01:15:04,380 But algorithms have gotten better as we [INAUDIBLE] 1154 01:15:04,380 --> 01:15:05,980 we have some convergence guarantees. 1155 01:15:05,980 --> 01:15:07,190 Not the general [INAUDIBLE]. 1156 01:15:07,190 --> 01:15:08,680 I never told you that it converged 1157 01:15:08,680 --> 01:15:10,990 if you have a nonlinear function approximator. 1158 01:15:10,990 --> 01:15:13,040 We'd love to have that result [INAUDIBLE].. 1159 01:15:13,040 --> 01:15:14,890 We won't have it for a while. 1160 01:15:14,890 --> 01:15:17,140 But in linear function approximator sense, 1161 01:15:17,140 --> 01:15:20,290 we have both. 1162 01:15:20,290 --> 01:15:22,330 But there's a lot of success stories. 1163 01:15:22,330 --> 01:15:24,122 These are the kind of algorithms that we're 1164 01:15:24,122 --> 01:15:26,230 used to play backgammon. 1165 01:15:26,230 --> 01:15:27,700 There are examples of them working 1166 01:15:27,700 --> 01:15:31,460 on things like the [INAUDIBLE]. 1167 01:15:31,460 --> 01:15:34,570 But in the domains that I care about most in my lab, 1168 01:15:34,570 --> 01:15:37,740 we tend to do more policy gradient sort of things.