1 00:00:00,000 --> 00:00:02,520 The following content is provided under a Creative 2 00:00:02,520 --> 00:00:03,950 Commons license. 3 00:00:03,950 --> 00:00:06,330 Your support will help MIT OpenCourseWare 4 00:00:06,330 --> 00:00:10,660 continue to offer high-quality educational resources for free. 5 00:00:10,660 --> 00:00:13,320 To make a donation or view additional materials 6 00:00:13,320 --> 00:00:17,190 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,190 --> 00:00:18,370 at ocw.mit.edu. 8 00:00:21,760 --> 00:00:23,870 RUSS TEDRAKE: OK, so every once in a while, 9 00:00:23,870 --> 00:00:25,870 I stop and try to do a little bit of reflection, 10 00:00:25,870 --> 00:00:26,920 since we've-- 11 00:00:26,920 --> 00:00:30,700 have so many methods flying through this semester 12 00:00:30,700 --> 00:00:32,170 that I want to just-- 13 00:00:32,170 --> 00:00:34,780 once again, let's say where we've been, 14 00:00:34,780 --> 00:00:37,210 where we're going, what we have, what we can do, 15 00:00:37,210 --> 00:00:39,610 what we can't do, and why we're going to do something 16 00:00:39,610 --> 00:00:40,410 different today-- 17 00:00:43,995 --> 00:00:45,370 so just a little reflection here. 18 00:00:52,450 --> 00:00:54,820 We've been talking, obviously, about optimal control. 19 00:00:54,820 --> 00:00:58,030 There's two major approaches to optimal control 20 00:00:58,030 --> 00:00:59,860 that we've focused on-- 21 00:00:59,860 --> 00:01:00,910 well, three, I guess. 22 00:01:00,910 --> 00:01:03,745 In some cases, we've done analytical optimal control. 23 00:01:11,750 --> 00:01:13,810 And I think, by now, you appreciate that, 24 00:01:13,810 --> 00:01:15,490 although it was often-- 25 00:01:15,490 --> 00:01:17,890 only works in special cases-- 26 00:01:17,890 --> 00:01:21,280 linear quadratic regulators and things like that-- 27 00:01:21,280 --> 00:01:24,160 the lessons we learned from that help 28 00:01:24,160 --> 00:01:25,540 us design better algorithms. 29 00:01:25,540 --> 00:01:29,200 And things like LQR can fit right into more complicated 30 00:01:29,200 --> 00:01:33,138 non-linear algorithms to make them click, so I think, 31 00:01:33,138 --> 00:01:35,680 absolutely, it's essential to understand the things we can do 32 00:01:35,680 --> 00:01:37,330 analytically in optimal control-- 33 00:01:37,330 --> 00:01:39,850 even though they crap out pretty early 34 00:01:39,850 --> 00:01:42,852 in the scale of complexity that we care about. 35 00:01:42,852 --> 00:01:44,560 So mostly, that's good for linear systems 36 00:01:44,560 --> 00:01:51,610 and even restricted there, linear systems 37 00:01:51,610 --> 00:01:56,200 with quadratic costs and things like that. 38 00:01:56,200 --> 00:01:59,230 And then we talked about major direction number 39 00:01:59,230 --> 00:02:12,600 two was the dynamic programming and value iteration approach. 40 00:02:12,600 --> 00:02:18,500 And the big idea there right was that, because we've 41 00:02:18,500 --> 00:02:22,520 written our cost functions over time to be additive, 42 00:02:22,520 --> 00:02:25,760 the big idea really was that we're going to learn-- 43 00:02:25,760 --> 00:02:28,070 we're going to figure out the cost-to-go function-- 44 00:02:28,070 --> 00:02:31,580 the value function, the value iteration-- 45 00:02:31,580 --> 00:02:33,470 and from there, we can just extract that. 46 00:02:33,470 --> 00:02:35,360 That captures all of the long-term reasoning 47 00:02:35,360 --> 00:02:36,890 we have to do about the system. 48 00:02:36,890 --> 00:02:41,630 From that, we can extract the optimal control decisions. 49 00:02:41,630 --> 00:02:44,120 And it's actually very efficient. 50 00:02:44,120 --> 00:02:50,900 I hope, by now, you agree with me that it's very efficient, 51 00:02:50,900 --> 00:02:54,200 because if you think about it, it's solving for an entire-- 52 00:02:54,200 --> 00:02:56,780 solving for the optimal policy for every possible initial 53 00:02:56,780 --> 00:02:59,540 condition in times that are comparable to what 54 00:02:59,540 --> 00:03:04,520 we're doing for single initial conditions in the loop cases. 55 00:03:04,520 --> 00:03:07,587 But it only works in low dimensions, 56 00:03:07,587 --> 00:03:09,170 and it has some discretization issues. 57 00:03:16,960 --> 00:03:21,290 And then the third major approach-- 58 00:03:21,290 --> 00:03:22,555 I called it policy search. 59 00:03:27,070 --> 00:03:33,610 And we focused mostly, in the policy search, 60 00:03:33,610 --> 00:03:36,659 on loop trajectory optimization. 61 00:03:44,650 --> 00:03:46,660 But I tried to make the point early-- 62 00:03:46,660 --> 00:03:50,110 and I'm going to make the point again in a lecture or two-- 63 00:03:50,110 --> 00:03:52,720 that it's really not restricted to thinking 64 00:03:52,720 --> 00:03:55,402 about loop trajectories. 65 00:03:55,402 --> 00:03:56,860 So when I first said policy search, 66 00:03:56,860 --> 00:03:58,568 I said we could be looking for parameters 67 00:03:58,568 --> 00:04:00,375 of a feedback controller of a-- 68 00:04:00,375 --> 00:04:04,300 the linear gain matrix of a linear feedback. 69 00:04:04,300 --> 00:04:06,280 We could do a lot of things, but we quickly 70 00:04:06,280 --> 00:04:08,740 started being specific in our algorithms 71 00:04:08,740 --> 00:04:12,670 in trying to optimize some loop tape with direct colocation 72 00:04:12,670 --> 00:04:13,660 with shooting. 73 00:04:13,660 --> 00:04:15,800 But the ideas really are more general than that, 74 00:04:15,800 --> 00:04:18,339 and I'm going to-- we're going to have a lecture soon 75 00:04:18,339 --> 00:04:19,839 about how to do these kind of things 76 00:04:19,839 --> 00:04:22,900 with function approximation, and do more general feedback 77 00:04:22,900 --> 00:04:25,090 controllers. 78 00:04:25,090 --> 00:04:30,580 These worked in higher dimensional systems, 79 00:04:30,580 --> 00:04:33,640 had local minima-- 80 00:04:33,640 --> 00:04:35,100 all the problems you know by now. 81 00:04:43,080 --> 00:04:48,210 OK, good-- so for our model systems, 82 00:04:48,210 --> 00:04:49,650 we got pretty far with that. 83 00:04:49,650 --> 00:04:52,570 In the cases where we knew the model, 84 00:04:52,570 --> 00:04:55,950 we assumed that the model was deterministic 85 00:04:55,950 --> 00:04:58,650 and sensing was clean-- 86 00:04:58,650 --> 00:04:59,882 everything like that. 87 00:04:59,882 --> 00:05:02,340 We could make our simulations do pretty much what we wanted 88 00:05:02,340 --> 00:05:05,010 with those bag of tricks. 89 00:05:05,010 --> 00:05:09,660 Then I threw in the stochastic optimal control case. 90 00:05:21,370 --> 00:05:27,270 We said, what happens if the models aren't deterministic? 91 00:05:27,270 --> 00:05:28,727 Analytical optimal control-- 92 00:05:28,727 --> 00:05:30,810 I didn't really talk about it, but there are still 93 00:05:30,810 --> 00:05:33,102 some cases where you can do analytical optimal control. 94 00:05:33,102 --> 00:05:35,340 The linear quadratic Gaussian systems 95 00:05:35,340 --> 00:05:40,410 are the clear example of that. 96 00:05:40,410 --> 00:05:45,060 We said that value iteration for this-- 97 00:05:45,060 --> 00:05:47,280 although I was quickly challenged on it, 98 00:05:47,280 --> 00:05:49,770 I said, basically, it was no harder 99 00:05:49,770 --> 00:05:55,470 to do value iteration stochastic optimization, where 100 00:05:55,470 --> 00:05:59,880 now our goal is to minimize some expected value 101 00:05:59,880 --> 00:06:01,190 of a long-term cost. 102 00:06:05,907 --> 00:06:07,740 Value iteration we basically said and almost 103 00:06:07,740 --> 00:06:11,830 no harder to do the case with transition probabilities flying 104 00:06:11,830 --> 00:06:12,330 around. 105 00:06:19,380 --> 00:06:26,740 And in fact, the barycentric grids 106 00:06:26,740 --> 00:06:29,920 that we used in the value duration way back there 107 00:06:29,920 --> 00:06:33,550 I told you actually has a more clean interpretation 108 00:06:33,550 --> 00:06:37,480 as taking a continuous-- you can think of it as being 109 00:06:37,480 --> 00:06:44,020 a continuous deterministic system, 110 00:06:44,020 --> 00:06:48,070 and converting it into a discrete state 111 00:06:48,070 --> 00:06:49,020 stochastic system. 112 00:06:56,130 --> 00:07:01,350 Remember, the interpolation that you do in the barycentric 113 00:07:01,350 --> 00:07:02,970 actually takes exactly the same form 114 00:07:02,970 --> 00:07:05,640 as some transition probabilities when you're going from-- 115 00:07:09,060 --> 00:07:10,987 you've got some grid, and you want 116 00:07:10,987 --> 00:07:13,320 to know where you're going to go from simulating forward 117 00:07:13,320 --> 00:07:15,840 from this with some action for some dt, 118 00:07:15,840 --> 00:07:19,590 you can approximate that as being some fraction 119 00:07:19,590 --> 00:07:21,580 here, some fraction here, some fraction here. 120 00:07:21,580 --> 00:07:23,732 And it turns out to be exactly equivalent to saying 121 00:07:23,732 --> 00:07:25,440 that there's some probability I get here, 122 00:07:25,440 --> 00:07:27,815 some probability I get here, some probability I get here, 123 00:07:27,815 --> 00:07:30,270 and the like. 124 00:07:30,270 --> 00:07:33,210 So value iteration really, in that sense, 125 00:07:33,210 --> 00:07:37,540 can solve stochastic problems nicely. 126 00:07:37,540 --> 00:07:45,080 The other major approach, the policy search, 127 00:07:45,080 --> 00:07:50,390 can still work for stochastic problems. 128 00:07:50,390 --> 00:07:58,400 In some cases, you can compute the gradient 129 00:07:58,400 --> 00:08:01,070 of the expected value with respect 130 00:08:01,070 --> 00:08:05,330 to your parameters analytically, with a [INAUDIBLE] update. 131 00:08:05,330 --> 00:08:10,460 In other cases, you can do sampling based, 132 00:08:10,460 --> 00:08:13,820 Monte Carlo based estimates, and I'm 133 00:08:13,820 --> 00:08:15,080 going to get more into that. 134 00:08:26,560 --> 00:08:28,060 We're going to talk more about that, 135 00:08:28,060 --> 00:08:32,130 but the takeaway messages, when things get stochastic, 136 00:08:32,130 --> 00:08:35,520 both of these methods still work. 137 00:08:35,520 --> 00:08:37,260 And they work in slightly different ways, 138 00:08:37,260 --> 00:08:41,049 but you can make both of those work 139 00:08:41,049 --> 00:08:46,930 And then, last week, John threw a major wrench into things. 140 00:08:46,930 --> 00:08:47,430 Sorry. 141 00:08:47,430 --> 00:08:50,130 That wasn't supposed to be a statement about John. 142 00:08:50,130 --> 00:08:55,170 John made your life better by telling you 143 00:08:55,170 --> 00:08:58,290 that some of these statements-- some of these algorithms 144 00:08:58,290 --> 00:08:59,910 work even if you don't know the model. 145 00:09:14,730 --> 00:09:20,640 And he talked about doing policy search without a model. 146 00:09:27,420 --> 00:09:35,940 And the big idea there was that actually fairly simple looking 147 00:09:35,940 --> 00:09:39,870 algorithms, which just perturb the-- which try different 148 00:09:39,870 --> 00:09:41,220 parameters-- 149 00:09:41,220 --> 00:09:43,440 run a trial, try different parameters-- 150 00:09:46,230 --> 00:09:58,010 simple sampling algorithms can estimate the same thing 151 00:09:58,010 --> 00:10:01,490 we would do with our policy gradient, the gradient 152 00:10:01,490 --> 00:10:07,820 of the expected reward with respect to the parameters. 153 00:10:07,820 --> 00:10:10,460 Even the simplest thing is let's change 154 00:10:10,460 --> 00:10:12,560 my parameters a little bit, see what happened. 155 00:10:12,560 --> 00:10:17,515 That gives me a sample from this gradient. 156 00:10:17,515 --> 00:10:19,640 And if I do it enough times, I pull enough samples, 157 00:10:19,640 --> 00:10:20,420 and I can-- 158 00:10:20,420 --> 00:10:22,805 I get a sample of the expected returns. 159 00:10:26,600 --> 00:10:29,750 If you think back, that's why we try 160 00:10:29,750 --> 00:10:31,400 to stick in stochastic optimal control 161 00:10:31,400 --> 00:10:33,380 before we got to that, because John also 162 00:10:33,380 --> 00:10:36,110 told you the nice interpretation of these algorithms 163 00:10:36,110 --> 00:10:37,444 in the stochastic-- 164 00:10:46,350 --> 00:10:52,310 that, even if the plant that you're measuring from 165 00:10:52,310 --> 00:10:55,610 is stochastic or if the sensors are-- there's noise, 166 00:10:55,610 --> 00:10:58,310 then actually, still, these same sampling algorithms 167 00:10:58,310 --> 00:11:00,910 can estimate these gradients for you nicely. 168 00:11:07,450 --> 00:11:09,370 I think John also made the point-- 169 00:11:09,370 --> 00:11:13,067 and I want to make it again-- 170 00:11:13,067 --> 00:11:15,400 you would never use these algorithms if you had a model. 171 00:11:18,760 --> 00:11:22,480 They're beautiful, but probably, if you have a model-- 172 00:11:39,370 --> 00:11:41,350 maybe, if you have a model and you're 173 00:11:41,350 --> 00:11:43,348 a very patient person, but very lazy, 174 00:11:43,348 --> 00:11:45,640 then you might try to use this, because you can type it 175 00:11:45,640 --> 00:11:47,098 in in a few minutes, but it's going 176 00:11:47,098 --> 00:11:49,120 to take a lot longer to run. 177 00:11:49,120 --> 00:11:52,630 And the reason for that is it's going 178 00:11:52,630 --> 00:11:58,260 to require many [INAUDIBLE] requires many more simulations. 179 00:12:05,260 --> 00:12:07,540 In fact, it just requires many simulations 180 00:12:07,540 --> 00:12:09,505 to estimate a single gradient-- 181 00:12:16,280 --> 00:12:17,030 policy gradient. 182 00:12:26,510 --> 00:12:29,050 Now, the next thing I'm going to say is a little more 183 00:12:29,050 --> 00:12:35,410 controversial, but most people would say that the limiting 184 00:12:35,410 --> 00:12:37,720 case, the best you could possibly do with these 185 00:12:37,720 --> 00:12:40,345 reinforce type algorithms-- are you raising your hand or just-- 186 00:12:40,345 --> 00:12:41,220 AUDIENCE: [INAUDIBLE] 187 00:12:41,220 --> 00:12:42,400 RUSS TEDRAKE: No-- sorry. 188 00:12:42,400 --> 00:12:44,858 The best thing you can do with these reinforced algorithms, 189 00:12:44,858 --> 00:12:46,810 if you really-- 190 00:12:46,810 --> 00:12:53,363 the best performance you could expect is a shooting algorithm. 191 00:13:06,700 --> 00:13:13,090 And it's really, I should say, a first-order shooting method. 192 00:13:13,090 --> 00:13:17,850 It's really just doing gradient descent by doing trials. 193 00:13:17,850 --> 00:13:19,600 And when we talked about shooting methods, 194 00:13:19,600 --> 00:13:21,868 I actually said never do first-order shooting methods. 195 00:13:21,868 --> 00:13:22,660 I made a big point. 196 00:13:22,660 --> 00:13:25,270 I said never do this, never do this-- 197 00:13:25,270 --> 00:13:27,460 because if you go to the second-order methods, 198 00:13:27,460 --> 00:13:28,610 things converge faster. 199 00:13:28,610 --> 00:13:30,193 You don't have to pick learning rates. 200 00:13:30,193 --> 00:13:32,650 You can handle constraints. 201 00:13:32,650 --> 00:13:38,572 So there are people that do a bit of more second-order policy 202 00:13:38,572 --> 00:13:40,780 gradient algorithms, but that's not the standard yet. 203 00:13:45,095 --> 00:13:47,470 So you should really think of those as cool systems that, 204 00:13:47,470 --> 00:13:48,550 if you-- 205 00:13:48,550 --> 00:13:53,140 cool algorithms that, if you don't have a model, 206 00:13:53,140 --> 00:13:55,150 you can almost do a shooting method. 207 00:13:57,760 --> 00:13:59,990 Why do I say that's a controversial statement? 208 00:14:08,290 --> 00:14:10,540 Could you imagine somebody standing up and saying, 209 00:14:10,540 --> 00:14:12,707 this is actually better than doing gradient descent? 210 00:14:15,830 --> 00:14:18,260 AUDIENCE: [INAUDIBLE] 211 00:14:18,260 --> 00:14:20,690 RUSS TEDRAKE: Yeah. 212 00:14:20,690 --> 00:14:23,760 So the one advantage is that it's doing stochastic gradient 213 00:14:23,760 --> 00:14:24,260 descent. 214 00:14:34,372 --> 00:14:35,830 And there are people out there that 215 00:14:35,830 --> 00:14:39,430 really believe stochastic gradient descent can outperform 216 00:14:39,430 --> 00:14:43,030 even higher order methods in certain cases, 217 00:14:43,030 --> 00:14:47,180 just because of their ability to, by virtue of being random-- 218 00:14:47,180 --> 00:14:49,180 this is not some magical property we've endowed. 219 00:14:49,180 --> 00:14:51,670 This is [? Because ?] the algorithm is a little crazy. 220 00:14:51,670 --> 00:14:53,946 It bounces out of local minima. 221 00:14:58,500 --> 00:15:05,320 So for that reason, it does have all the strong optimization 222 00:15:05,320 --> 00:15:08,380 claims that a stochastic gradient descent algorithm has. 223 00:15:16,160 --> 00:15:17,780 There's another point to make, though, 224 00:15:17,780 --> 00:15:20,180 and I think John made this too. 225 00:15:20,180 --> 00:15:21,590 The performance of this-- 226 00:15:21,590 --> 00:15:23,300 and John's done nice-- 227 00:15:23,300 --> 00:15:26,030 written nice paper on this-- 228 00:15:26,030 --> 00:15:27,770 the performance you'd expect, meaning 229 00:15:27,770 --> 00:15:30,980 the number of trials it would take to learn to optimize 230 00:15:30,980 --> 00:15:32,510 your cost function-- 231 00:15:32,510 --> 00:15:35,343 the performance of these reinforced type algorithms-- 232 00:15:42,250 --> 00:15:44,243 it degrades with a number of parameters 233 00:15:44,243 --> 00:15:45,160 you're trying to tune. 234 00:16:09,910 --> 00:16:12,345 So remember, the fundamental idea was-- 235 00:16:12,345 --> 00:16:13,720 and the way I like to think of it 236 00:16:13,720 --> 00:16:18,835 is, imagine you're at a mixer station in a sound recording 237 00:16:18,835 --> 00:16:20,710 studio, and you're looking through the glass, 238 00:16:20,710 --> 00:16:23,470 and you've got a robot over there. 239 00:16:23,470 --> 00:16:25,380 You've got all your knobs set in some place, 240 00:16:25,380 --> 00:16:27,520 and your robot does its behavior, 241 00:16:27,520 --> 00:16:29,715 and then you give it a score. 242 00:16:29,715 --> 00:16:31,090 You turn your knobs a little bit, 243 00:16:31,090 --> 00:16:32,253 you see how the robot acts. 244 00:16:32,253 --> 00:16:33,670 You turn it off a little bit more. 245 00:16:33,670 --> 00:16:36,310 And your job is to just twist these knobs in a way that 246 00:16:36,310 --> 00:16:39,857 finds the way down the gradient, and gets your robot doing 247 00:16:39,857 --> 00:16:40,690 what you want to do. 248 00:16:43,450 --> 00:16:45,670 That maybe is a demystifying way to think 249 00:16:45,670 --> 00:16:48,045 about this everything, which is mathematically beautiful, 250 00:16:48,045 --> 00:16:51,010 but really, it's just turning knobs. 251 00:16:51,010 --> 00:16:53,470 If you have a model and you can compute the gradient, 252 00:16:53,470 --> 00:16:55,240 then you don't have to guess the way you turn knobs. 253 00:16:55,240 --> 00:16:57,323 You should always use that model to turn the knobs 254 00:16:57,323 --> 00:16:59,980 in the right direction. 255 00:16:59,980 --> 00:17:03,322 And also, if you think about that analogy, the number of-- 256 00:17:03,322 --> 00:17:05,530 the length of time it's going to take you to optimize 257 00:17:05,530 --> 00:17:07,359 your function is going to depend on how 258 00:17:07,359 --> 00:17:09,460 many knobs you have to turn. 259 00:17:09,460 --> 00:17:11,349 If I have 100 knobs in front of me 260 00:17:11,349 --> 00:17:13,210 and I change them all a little bit-- 261 00:17:13,210 --> 00:17:14,724 I see how my robot acted-- then it's 262 00:17:14,724 --> 00:17:16,599 going to be hard for me to figure out exactly 263 00:17:16,599 --> 00:17:19,185 which knob to assign credit to. 264 00:17:19,185 --> 00:17:20,560 The fewer knobs I have to change, 265 00:17:20,560 --> 00:17:23,440 the faster I can estimate which knobs were important, 266 00:17:23,440 --> 00:17:24,880 and climb down a gradient. 267 00:17:32,060 --> 00:17:33,590 I still say, when you have a model, 268 00:17:33,590 --> 00:17:34,490 you should always use it, because you 269 00:17:34,490 --> 00:17:35,615 can estimate the gradients. 270 00:17:35,615 --> 00:17:37,970 You can turn the knobs in the right way. 271 00:17:37,970 --> 00:17:43,580 But in the case where you don't have a model, it's actually-- 272 00:17:43,580 --> 00:17:46,100 they're very nice classes of algorithms. 273 00:17:46,100 --> 00:17:47,880 This knob tuning thing sounds ridiculous. 274 00:17:47,880 --> 00:17:51,185 Maybe, if I have even an [INAUDIBLE],, 275 00:17:51,185 --> 00:17:53,190 if you have a good model of the [INAUDIBLE],, 276 00:17:53,190 --> 00:17:56,540 then maybe you shouldn't-- you should definitely be using it. 277 00:17:56,540 --> 00:17:59,578 But if you have a very complicated system, 278 00:17:59,578 --> 00:18:02,120 and the performance only depends on the number of parameters, 279 00:18:02,120 --> 00:18:03,020 then it-- 280 00:18:03,020 --> 00:18:06,830 I just want to make the point that it's-- 281 00:18:06,830 --> 00:18:13,445 they're actually pretty powerful for some control problems. 282 00:18:18,863 --> 00:18:20,780 And the ones that we're working on in my group 283 00:18:20,780 --> 00:18:24,230 are fluid dynamics control problems, 284 00:18:24,230 --> 00:18:29,360 but specifically if you have problems where you can get away 285 00:18:29,360 --> 00:18:43,550 with the small number of parameters, 286 00:18:43,550 --> 00:18:51,260 but you have a very complicated unknown dynamics. 287 00:18:55,038 --> 00:18:56,830 And actually, those algorithms are really-- 288 00:18:56,830 --> 00:18:59,680 make a lot of sense to me. 289 00:18:59,680 --> 00:19:04,000 So the performance of these randomized policy search 290 00:19:04,000 --> 00:19:06,173 algorithms-- it goes with the number of parameters 291 00:19:06,173 --> 00:19:07,090 you're trying to tune. 292 00:19:07,090 --> 00:19:10,090 I could be sitting in this mixing station, 293 00:19:10,090 --> 00:19:11,800 and I could be twittering four parameters 294 00:19:11,800 --> 00:19:15,850 and having a simple pendulum do its thing, 295 00:19:15,850 --> 00:19:18,250 or I could be sitting in there turning these four knobs 296 00:19:18,250 --> 00:19:19,930 and having a Navier-Stokes simulation, 297 00:19:19,930 --> 00:19:24,547 with some very complicated fluid doing something, 298 00:19:24,547 --> 00:19:27,130 and the amount of time it takes me to twiddle those parameters 299 00:19:27,130 --> 00:19:28,725 is the same. 300 00:19:28,725 --> 00:19:30,850 One of the strongest properties of these algorithms 301 00:19:30,850 --> 00:19:33,700 is that, by virtue of ignoring the model, 302 00:19:33,700 --> 00:19:37,130 they're actually insensitive to the model complexity. 303 00:19:37,130 --> 00:19:39,460 So in my group, we're really trying to push-- 304 00:19:39,460 --> 00:19:42,070 in some problems where the dynamics are unknown and very 305 00:19:42,070 --> 00:19:44,740 complicated and a lot of the community 306 00:19:44,740 --> 00:19:47,050 is trying to build better models of this, 307 00:19:47,050 --> 00:19:49,750 we're trying to say, well, maybe before you have perfect models, 308 00:19:49,750 --> 00:19:52,690 we can do some of these model-free search algorithms 309 00:19:52,690 --> 00:19:55,210 to build good controllers without perfect models. 310 00:19:59,170 --> 00:20:03,530 Are people OK with that array of techniques? 311 00:20:03,530 --> 00:20:05,080 Yeah? 312 00:20:05,080 --> 00:20:10,962 You have a good arsenal of tools? 313 00:20:10,962 --> 00:20:12,420 Can you see the obvious place where 314 00:20:12,420 --> 00:20:15,870 I'm trying to go next, now that I've set it up like this? 315 00:20:21,960 --> 00:20:24,660 We did value methods and policy search methods 316 00:20:24,660 --> 00:20:27,570 for the simple case, then we did value methods and policy search 317 00:20:27,570 --> 00:20:29,520 methods for the stochastic case, then 318 00:20:29,520 --> 00:20:33,490 we did policy methods for the model-free case. 319 00:20:33,490 --> 00:20:36,700 So how about we do model-free value methods today? 320 00:20:57,460 --> 00:20:59,510 But I know it's a complicated web of algorithms, 321 00:20:59,510 --> 00:21:01,960 so I want to make sure that I stop and say that kind of stuff 322 00:21:01,960 --> 00:21:02,877 every once in a while. 323 00:21:08,250 --> 00:21:10,750 So what's the difference between a policy method and a value 324 00:21:10,750 --> 00:21:13,100 method? 325 00:21:13,100 --> 00:21:19,270 So value duration-- like I said, it's very, very efficient. 326 00:21:19,270 --> 00:21:25,360 The way we represented value iteration with a grid, 327 00:21:25,360 --> 00:21:27,040 and having to solve every possible state 328 00:21:27,040 --> 00:21:32,260 at every possible time, is the extreme form of value-- 329 00:21:32,260 --> 00:21:33,880 of the value methods. 330 00:21:33,880 --> 00:21:38,770 In general, we can try to build approximate value methods-- 331 00:21:38,770 --> 00:21:41,920 estimates of our value function that don't 332 00:21:41,920 --> 00:21:44,423 require the big discretization. 333 00:21:44,423 --> 00:21:46,090 So actually, last week, at the meeting-- 334 00:21:46,090 --> 00:21:49,420 one of the meetings I was at, I met Gerry Tesauro. 335 00:21:49,420 --> 00:21:51,790 And Gerry Tesauro is the guy who did TD-Gammon. 336 00:21:51,790 --> 00:21:54,080 Anybody heard of TD-Gammon? 337 00:21:54,080 --> 00:21:54,580 Yeah? 338 00:21:54,580 --> 00:21:55,810 [INAUDIBLE] knows TD-Gammon. 339 00:21:59,680 --> 00:22:00,930 I don't know what year it was. 340 00:22:00,930 --> 00:22:03,730 It was 20 years ago now. 341 00:22:03,730 --> 00:22:06,610 One of the big success stories for reinforcement learning 342 00:22:06,610 --> 00:22:08,770 was that they built a game player 343 00:22:08,770 --> 00:22:11,470 based on reinforcement learning that 344 00:22:11,470 --> 00:22:13,390 could play backgammon with the experts 345 00:22:13,390 --> 00:22:16,123 and beat the experts at backgammon. 346 00:22:16,123 --> 00:22:18,040 Now, backgammon's actually not a trivial game. 347 00:22:18,040 --> 00:22:19,600 It's got a huge state space-- 348 00:22:19,600 --> 00:22:20,482 huge state space. 349 00:22:20,482 --> 00:22:21,940 I don't play backgammon, but I know 350 00:22:21,940 --> 00:22:24,147 there's a lot of bits going around there. 351 00:22:24,147 --> 00:22:26,230 It's stochastic, because you roll a die every once 352 00:22:26,230 --> 00:22:28,510 in a while. 353 00:22:28,510 --> 00:22:32,290 So it's actually not some complicated-- 354 00:22:32,290 --> 00:22:33,928 not some simple game. 355 00:22:33,928 --> 00:22:35,470 In some ways, it's surprising that it 356 00:22:35,470 --> 00:22:41,173 was solved before checkers and these others. 357 00:22:41,173 --> 00:22:43,590 Maybe it's just because not enough people play backgammon, 358 00:22:43,590 --> 00:22:45,048 so you can beat the experts easier. 359 00:22:45,048 --> 00:22:47,650 I don't know. 360 00:22:47,650 --> 00:22:49,630 But we were playing competition style-- 361 00:22:49,630 --> 00:22:51,790 beat the best humans at backgammon-- 362 00:22:51,790 --> 00:22:58,900 well before checkers and chess, because of a value-based-- 363 00:22:58,900 --> 00:23:04,150 model-free, value-based method for backgammon. 364 00:23:04,150 --> 00:23:06,940 So Gerry Tesauro actually use neural networks, 365 00:23:06,940 --> 00:23:09,370 and he learned, from watching the game, 366 00:23:09,370 --> 00:23:11,690 a value function for the game. 367 00:23:11,690 --> 00:23:13,370 What does that mean? 368 00:23:13,370 --> 00:23:16,180 So what do you do when you play backgammon-- 369 00:23:16,180 --> 00:23:17,790 or whatever game you play? 370 00:23:17,790 --> 00:23:19,290 I'm not trying to dump on that game. 371 00:23:19,290 --> 00:23:21,040 I just haven't played it myself. 372 00:23:21,040 --> 00:23:26,230 So if you look at a go board or a chess board, 373 00:23:26,230 --> 00:23:29,650 you don't think about every single state that's possibly 374 00:23:29,650 --> 00:23:33,310 in there, but you're able to quickly look at the board 375 00:23:33,310 --> 00:23:36,580 and get a sense of if you're winning or losing. 376 00:23:36,580 --> 00:23:41,565 If you were to make this move, my life should get better. 377 00:23:41,565 --> 00:23:42,940 And there are serious people that 378 00:23:42,940 --> 00:23:45,040 think that the natural representation for very 379 00:23:45,040 --> 00:23:47,590 complicated physical control processes 380 00:23:47,590 --> 00:23:50,710 or very complicated game playing scenarios 381 00:23:50,710 --> 00:23:53,740 is to not learn actually the policy directly, 382 00:23:53,740 --> 00:23:55,450 but to just learn a sense of what's good 383 00:23:55,450 --> 00:24:01,480 and what's bad directly, learn a value function directly. 384 00:24:01,480 --> 00:24:04,360 And then we [INAUDIBLE] from value iteration. 385 00:24:04,360 --> 00:24:06,640 That captures all that's hard about-- 386 00:24:06,640 --> 00:24:09,010 that captures the entire long-term look 387 00:24:09,010 --> 00:24:11,530 ahead in the optimal control problem. 388 00:24:11,530 --> 00:24:13,690 Once I have a value function, if I 389 00:24:13,690 --> 00:24:16,300 have a value function I believe, if I want to make an action, 390 00:24:16,300 --> 00:24:18,730 all I have to do is think about, well, if I made this action, 391 00:24:18,730 --> 00:24:20,050 my value would get better by this much. 392 00:24:20,050 --> 00:24:22,633 If I made this action, my value would get better by this much. 393 00:24:22,633 --> 00:24:26,390 And I just pick the action that maximizes my expected value. 394 00:24:29,860 --> 00:24:33,190 Now, the good thing about value-based methods 395 00:24:33,190 --> 00:24:35,650 is that they tend to be very efficient. 396 00:24:38,200 --> 00:24:41,440 You can simultaneously think about lots of different states 397 00:24:41,440 --> 00:24:42,940 at a time. 398 00:24:42,940 --> 00:24:45,310 Just like value duration, it's very efficient 399 00:24:45,310 --> 00:24:48,310 to learn value methods. 400 00:24:48,310 --> 00:24:52,120 And historically, in the reinforcement learning world, 401 00:24:52,120 --> 00:24:54,250 nobody ever really did policy search methods 402 00:24:54,250 --> 00:24:56,000 until the early '90s. 403 00:24:56,000 --> 00:24:59,410 There was at least 15 years where 404 00:24:59,410 --> 00:25:02,885 people were doing cool things with robots, and game playing, 405 00:25:02,885 --> 00:25:05,260 and things like that, where almost everybody, every paper 406 00:25:05,260 --> 00:25:07,730 was talking about, how do you learn a value function? 407 00:25:07,730 --> 00:25:08,720 How do you learn a value function 408 00:25:08,720 --> 00:25:10,637 if you have to put in a function approximator? 409 00:25:10,637 --> 00:25:14,045 Or how do you do a value function if this, if this? 410 00:25:14,045 --> 00:25:15,670 So really, even though I did it second, 411 00:25:15,670 --> 00:25:17,462 this was actually the core of reinforcement 412 00:25:17,462 --> 00:25:18,770 learning for a long time. 413 00:25:18,770 --> 00:25:20,187 How do you learn a value function? 414 00:25:20,187 --> 00:25:22,780 How do you estimate the cost-to-go-- 415 00:25:22,780 --> 00:25:25,750 ideally, the optimal cost-to-go-- 416 00:25:25,750 --> 00:25:31,807 given trial and error experience with the robot? 417 00:25:31,807 --> 00:25:32,890 So that's today's problem. 418 00:25:41,620 --> 00:25:44,560 Good-- so we can make it easier by thinking 419 00:25:44,560 --> 00:25:45,790 about a sub-problem first. 420 00:25:50,200 --> 00:26:01,410 And that's really policy evaluation, 421 00:26:01,410 --> 00:26:03,635 which is the problem of, given I have-- 422 00:26:07,070 --> 00:26:20,460 I have my dynamics, of course, and some policy pi, 423 00:26:20,460 --> 00:26:30,400 I want to estimate or compute J of pi, 424 00:26:30,400 --> 00:26:34,240 the long-term potentially expected reward 425 00:26:34,240 --> 00:26:37,090 of executing that feedback policy 426 00:26:37,090 --> 00:26:43,330 on that robot, potentially from all states at all times. 427 00:26:46,510 --> 00:26:49,750 So this is maybe equivalent to what I just said about chess. 428 00:26:49,750 --> 00:26:53,740 So my value function for chess might look different 429 00:26:53,740 --> 00:26:57,030 than somebody who knows how to play chess. 430 00:26:57,030 --> 00:27:00,730 I look at the board, and most of the time, I'm losing, 431 00:27:00,730 --> 00:27:04,152 and my actions are going to be chosen differently, 432 00:27:04,152 --> 00:27:06,610 because I wouldn't even know what to do if my rook ended up 433 00:27:06,610 --> 00:27:08,830 over there. 434 00:27:08,830 --> 00:27:13,720 And the optimal value function, given I was acting optimally, 435 00:27:13,720 --> 00:27:15,250 might look very different. 436 00:27:15,250 --> 00:27:22,060 But for me, the first problem is just estimate what's my cost-- 437 00:27:22,060 --> 00:27:24,400 the cost of executing my current game playing 438 00:27:24,400 --> 00:27:26,950 strategy, my current control-- feedback controller 439 00:27:26,950 --> 00:27:31,510 on this robot, or this game? 440 00:27:31,510 --> 00:27:33,280 Now, there is something culturally 441 00:27:33,280 --> 00:27:38,050 different from the reinforcement learning 442 00:27:38,050 --> 00:27:40,720 value-based communities, and I'm going 443 00:27:40,720 --> 00:27:42,305 to go ahead and make that switch now. 444 00:27:47,680 --> 00:27:52,000 Most of the time, these things are infinite horizon 445 00:27:52,000 --> 00:27:52,900 discounted problems. 446 00:27:56,690 --> 00:27:59,050 I'll say it's discrete time just to keep it clean, 447 00:27:59,050 --> 00:28:04,360 because then it's easy to write [INAUDIBLE] equals t 448 00:28:04,360 --> 00:28:08,720 to infinity here, gamma to the-- 449 00:28:08,720 --> 00:28:10,220 let me just do it like this. 450 00:28:10,220 --> 00:28:12,492 Let's assume that it's completely feedback. 451 00:28:12,492 --> 00:28:14,200 That'll just keep me writing less symbols 452 00:28:14,200 --> 00:28:16,180 for the rest of the lecture here. 453 00:28:16,180 --> 00:28:32,130 0 to infinity, gamma to the n, xn, and then pi of xn, 454 00:28:32,130 --> 00:28:35,690 where my action is always pulled directly from pi-- 455 00:28:43,712 --> 00:28:45,170 I mentioned it once before, but why 456 00:28:45,170 --> 00:28:49,250 do people do discounted things? 457 00:28:49,250 --> 00:28:52,730 Lots of reasons why people do discounted things-- first 458 00:28:52,730 --> 00:28:54,828 of all, if you have infinite horizon rewards, 459 00:28:54,828 --> 00:28:56,120 there's just a practical issue. 460 00:28:56,120 --> 00:28:59,630 If you're not careful, infinite horizon rewards 461 00:28:59,630 --> 00:29:01,360 will blow up on you. 462 00:29:01,360 --> 00:29:09,500 So if you put some sort of decaying factor gammas, 463 00:29:09,500 --> 00:29:12,200 typically, it's constrained to be less than 1 464 00:29:12,200 --> 00:29:14,630 just so you don't have to worry about things blowing up 465 00:29:14,630 --> 00:29:18,150 in the long term. 466 00:29:18,150 --> 00:29:20,522 But you can make it 1, and then you just 467 00:29:20,522 --> 00:29:22,730 have to be more careful that you get to a fixed point 468 00:29:22,730 --> 00:29:24,230 [INAUDIBLE] cost, or whatever it is. 469 00:29:26,750 --> 00:29:33,702 Let's just put some decaying cost on future experiences. 470 00:29:39,500 --> 00:29:42,240 Philosophically, some people really like this. 471 00:29:42,240 --> 00:29:44,450 So a lot of the problems we've talked about 472 00:29:44,450 --> 00:29:46,820 are very episodic in nature. 473 00:29:46,820 --> 00:29:49,220 We talked about designing trajectories from time 0 474 00:29:49,220 --> 00:29:50,548 to time final. 475 00:29:50,548 --> 00:29:51,590 What's the optimal thing? 476 00:29:51,590 --> 00:29:53,660 What's the optimal thing? 477 00:29:53,660 --> 00:29:57,770 If you just want to live your life-- 478 00:29:57,770 --> 00:29:59,810 presumably, you don't know exactly when 479 00:29:59,810 --> 00:30:00,860 you're going to die. 480 00:30:00,860 --> 00:30:03,622 You're going to maximize some long-term reward. 481 00:30:03,622 --> 00:30:06,080 You'd like it to be infinite, but realistically, the things 482 00:30:06,080 --> 00:30:07,130 that are going to happen to me tomorrow 483 00:30:07,130 --> 00:30:09,005 are more important to me that the things that 484 00:30:09,005 --> 00:30:11,140 are happening in the very far, distant future. 485 00:30:11,140 --> 00:30:15,380 So some people, philosophically, just like having this 486 00:30:15,380 --> 00:30:18,860 as a cost function for a robot that's 487 00:30:18,860 --> 00:30:21,050 alive executing an online policy, 488 00:30:21,050 --> 00:30:23,630 worrying about short-term things a little bit more, 489 00:30:23,630 --> 00:30:25,558 but thinking about into the future. 490 00:30:25,558 --> 00:30:27,100 And that knob is controlled by gamma. 491 00:30:31,280 --> 00:30:36,290 Almost all of the RL tools can be 492 00:30:36,290 --> 00:30:39,710 made compatible with the episodic non-discounted cases, 493 00:30:39,710 --> 00:30:42,508 but culturally, like I said, they're almost always written 494 00:30:42,508 --> 00:30:44,300 in this form, so I thought it'd makes sense 495 00:30:44,300 --> 00:30:46,250 to switch to that form for a little bit. 496 00:30:56,090 --> 00:31:01,760 So how do we estimate J pi of x, given that kind of a setup? 497 00:31:06,370 --> 00:31:14,690 Let's do the model-based case, just as a first case. 498 00:31:14,690 --> 00:31:16,390 Let's say I have a good model. 499 00:31:23,770 --> 00:31:25,480 I made it look deterministic here, 500 00:31:25,480 --> 00:31:28,900 but we can, in general, do this for stochastic things. 501 00:31:33,260 --> 00:31:37,250 Let me do the model-based Markov chain version first. 502 00:31:50,808 --> 00:31:53,350 So you remember, in general, we said that the optimal control 503 00:31:53,350 --> 00:31:58,540 problem for discrete states, discrete actions, 504 00:31:58,540 --> 00:32:03,850 stochastic transitions looked like a Markov decision process, 505 00:32:03,850 --> 00:32:16,050 where we have some discrete state space, 506 00:32:16,050 --> 00:32:31,560 we have a probability transition matrix, 507 00:32:31,560 --> 00:32:55,640 where T-I-J is probability of transitioning from I to J. 508 00:32:55,640 --> 00:32:58,910 And we have some cost. 509 00:32:58,910 --> 00:33:00,830 And in the graph sense, I tend to write-- 510 00:33:00,830 --> 00:33:04,520 we tend to write the cost as-- 511 00:33:04,520 --> 00:33:06,020 instead of being [INAUDIBLE] action, 512 00:33:06,020 --> 00:33:11,060 we can just write it as the probability of-- 513 00:33:11,060 --> 00:33:35,280 the cost of transitioning from state I to state J. Good-- 514 00:33:35,280 --> 00:33:38,520 now, in the Markov decision processes 515 00:33:38,520 --> 00:33:42,570 that we talked about before, the transition matrix 516 00:33:42,570 --> 00:33:45,870 was a function of the action you chose. 517 00:33:45,870 --> 00:33:50,100 Your goal was to choose the action, which 518 00:33:50,100 --> 00:33:54,330 made your transition matrices have the optimal-- 519 00:33:54,330 --> 00:33:57,390 choose the best transition matrices for your problem. 520 00:33:57,390 --> 00:33:59,700 In policy evaluation, where we're saying, 521 00:33:59,700 --> 00:34:03,300 we're trying to figure out the probability of the cost-to-go 522 00:34:03,300 --> 00:34:07,380 of running this policy, then the actions are-- 523 00:34:07,380 --> 00:34:09,582 the parameterization by action disappears again. 524 00:34:09,582 --> 00:34:11,040 It's not a Markov decision process. 525 00:34:11,040 --> 00:34:14,020 It falls back right into being a Markov chain. 526 00:34:14,020 --> 00:34:16,230 So it's a simple picture now. 527 00:34:16,230 --> 00:34:25,580 We have a graph there's some probabilities of transitioning 528 00:34:25,580 --> 00:34:32,330 from each state to each action, from each state, 529 00:34:32,330 --> 00:34:34,675 because my actions are predetermined. 530 00:34:34,675 --> 00:34:36,800 If I'm in some state, I'm going to take this action 531 00:34:36,800 --> 00:34:37,469 based on pi. 532 00:34:40,010 --> 00:34:44,030 And each transition incurs some cost, 533 00:34:44,030 --> 00:34:49,250 and my goal is to move around the graph 534 00:34:49,250 --> 00:34:53,880 in a way that incurs minimal long-term cost and expected 535 00:34:53,880 --> 00:34:54,380 value. 536 00:35:01,970 --> 00:35:04,790 So that's a good way to start figuring out 537 00:35:04,790 --> 00:35:06,912 how to do policy evaluation. 538 00:35:10,490 --> 00:35:15,890 So now, in this discrete state transition matrix [INAUDIBLE] 539 00:35:15,890 --> 00:35:24,560 this form, I'm going to rewrite J pi as being a function of i, 540 00:35:24,560 --> 00:35:27,260 where i is from-- 541 00:35:27,260 --> 00:35:30,230 i is drawn from S, some-- 542 00:35:30,230 --> 00:35:32,750 it's one discrete state. 543 00:35:32,750 --> 00:35:44,750 And it's the expected value of G-I-N to I [INAUDIBLE] plus 1. 544 00:36:14,310 --> 00:36:16,690 I should say another funny example I just remembered. 545 00:36:16,690 --> 00:36:20,090 So I gave an analogy of playing a game. 546 00:36:20,090 --> 00:36:21,840 You might look at the board and figure out 547 00:36:21,840 --> 00:36:24,840 what's the value of being in certain states. 548 00:36:24,840 --> 00:36:27,270 People think it's relevant in your brains too. 549 00:36:27,270 --> 00:36:30,600 So there's actually a lot of work in neuroscience these days 550 00:36:30,600 --> 00:36:34,170 which probes activity of certain neurons in your brain, 551 00:36:34,170 --> 00:36:39,330 and finds neurons that basically respond with the expected value 552 00:36:39,330 --> 00:36:43,620 of your cost-to-go function. 553 00:36:43,620 --> 00:36:45,120 They have monkeys doing these tasks, 554 00:36:45,120 --> 00:36:48,540 where they pull levers or blink at the right time, 555 00:36:48,540 --> 00:36:50,160 and get certain rewards. 556 00:36:50,160 --> 00:36:52,620 And there's neurons that fire correlated 557 00:36:52,620 --> 00:36:55,410 with their expected reward in ways that are-- 558 00:36:55,410 --> 00:36:57,930 they design an experiment so it doesn't look like something 559 00:36:57,930 --> 00:37:00,390 that's correlated with the action they're going to choose, 560 00:37:00,390 --> 00:37:01,932 but it does look like it's correlated 561 00:37:01,932 --> 00:37:03,660 with expected reward. 562 00:37:03,660 --> 00:37:07,140 And interestingly, when they learn-- 563 00:37:07,140 --> 00:37:09,060 as the monkeys learn during the task, 564 00:37:09,060 --> 00:37:11,520 you can actually see that they start 565 00:37:11,520 --> 00:37:13,080 making predictions accurately when 566 00:37:13,080 --> 00:37:14,745 they're close to the reward. 567 00:37:14,745 --> 00:37:17,370 They're about to get juice, and then, a few minutes later, they 568 00:37:17,370 --> 00:37:20,320 can predict when they're a minute away from getting juice. 569 00:37:20,320 --> 00:37:22,320 And then, if you look at it a couple of days in, 570 00:37:22,320 --> 00:37:23,640 they're able to predict when they're 571 00:37:23,640 --> 00:37:25,890 a half hour from getting juice or something like this. 572 00:37:28,330 --> 00:37:31,080 I think the structure of trying to learn the value function 573 00:37:31,080 --> 00:37:37,230 is very real, especially if you're a juice-deprived monkey. 574 00:37:37,230 --> 00:37:42,120 So let's continue on here. 575 00:37:46,470 --> 00:37:48,930 How do you compute J pi, given this equation? 576 00:37:53,070 --> 00:37:55,500 J is a vector now. 577 00:37:55,500 --> 00:37:58,393 Well, first of all the dynamic programming recursion let's 578 00:37:58,393 --> 00:37:59,310 us write it like this. 579 00:38:01,890 --> 00:38:07,950 J of ik is the expected value of taking one step-- 580 00:38:23,280 --> 00:38:26,830 the one step cost plus the discount factor 581 00:38:26,830 --> 00:38:29,670 times the future. 582 00:38:29,670 --> 00:38:33,300 The reason people choose this form for the discount factor 583 00:38:33,300 --> 00:38:36,827 is that the Bellman recursion just looks like that. 584 00:38:36,827 --> 00:38:38,660 You just put a gamma in front of everything. 585 00:38:46,350 --> 00:38:49,200 We can take the expected value of this with our Markov chain 586 00:38:49,200 --> 00:38:53,320 notation and say it's the sum over i k 587 00:38:53,320 --> 00:39:03,800 plus 1's of Tik ik plus 1 times g ik, ik plus 1. 588 00:39:08,960 --> 00:39:11,930 Keep putting pi everywhere so we remember that. 589 00:39:11,930 --> 00:39:14,840 Pi k plus 1. 590 00:39:24,000 --> 00:39:28,140 The expected value just is a sum over probabilities of getting 591 00:39:28,140 --> 00:39:29,015 each of the outcomes. 592 00:39:31,610 --> 00:39:33,920 So you can use that transition matrix. 593 00:39:33,920 --> 00:39:40,290 And in vector form, since I have a finite number 594 00:39:40,290 --> 00:39:43,980 of discrete states, I can just write 595 00:39:43,980 --> 00:39:57,600 that as J is g plus gamma TJ, where the i-th element of g 596 00:39:57,600 --> 00:40:20,032 is Keep my pi's everywhere. 597 00:40:24,125 --> 00:40:25,500 Everybody agree with those steps? 598 00:40:28,430 --> 00:40:30,450 OK. 599 00:40:30,450 --> 00:40:32,406 So what's J-- J pi? 600 00:40:32,406 --> 00:40:37,135 AUDIENCE: [INAUDIBLE] 601 00:40:37,135 --> 00:40:38,010 RUSS TEDRAKE: Mm-hmm. 602 00:40:38,010 --> 00:40:44,117 AUDIENCE: [INAUDIBLE] 603 00:40:44,117 --> 00:40:46,200 RUSS TEDRAKE: I have to go to a vector form for J, 604 00:40:46,200 --> 00:40:48,450 so I just put it over here. 605 00:40:48,450 --> 00:40:52,140 I'm saying that the i-th element of the vector g-- 606 00:40:52,140 --> 00:40:53,695 this is my vector g now-- 607 00:40:53,695 --> 00:40:56,070 and the i-th element of my vector g has that T in there-- 608 00:40:56,070 --> 00:40:57,110 absolutely. 609 00:41:04,170 --> 00:41:04,890 Yep. 610 00:41:04,890 --> 00:41:08,760 So it's the expected value of g there. 611 00:41:08,760 --> 00:41:09,780 OK, so what's J pi? 612 00:41:51,660 --> 00:41:59,460 So lo and behold, policy evaluation 613 00:41:59,460 --> 00:42:06,850 on a Markov chain with known probabilities is trivial. 614 00:42:06,850 --> 00:42:08,160 It's this. 615 00:42:08,160 --> 00:42:11,220 It's almost free to compute. 616 00:42:11,220 --> 00:42:15,570 I could tell you exactly what my long-term cost 617 00:42:15,570 --> 00:42:20,158 is going to be just by knowing my transition matrix. 618 00:42:20,158 --> 00:42:22,200 That's something I think we forget, because we're 619 00:42:22,200 --> 00:42:23,370 going to get into models that look more 620 00:42:23,370 --> 00:42:24,870 complicated than that, but remember, 621 00:42:24,870 --> 00:42:29,220 if the transition matrix, it's trivial to compute 622 00:42:29,220 --> 00:42:32,830 the long-term cost for a Markov chain. 623 00:42:32,830 --> 00:42:35,970 So let me just show you why that's relevant, for instance. 624 00:42:39,400 --> 00:42:43,640 All right, so I told you about this the day the clock stopped. 625 00:42:43,640 --> 00:42:47,462 I kept telling you about it [INAUDIBLE] And for the record, 626 00:42:47,462 --> 00:42:48,920 do you know what happened that day? 627 00:42:48,920 --> 00:42:52,040 The clock physically stopped. 628 00:42:52,040 --> 00:42:52,880 Michael debugged it. 629 00:42:52,880 --> 00:42:54,620 There was a little piece of paint 630 00:42:54,620 --> 00:42:56,960 that blocked it at exactly 3:05 the day 631 00:42:56,960 --> 00:42:59,000 I was giving that lecture. 632 00:42:59,000 --> 00:43:03,120 That was a hard one to catch, to be fair. 633 00:43:03,120 --> 00:43:08,270 So one of my favorite models of stochastic processes 634 00:43:08,270 --> 00:43:10,610 in discrete time, for instance, is 635 00:43:10,610 --> 00:43:14,030 taking our rimless wheels, our passive walking models, 636 00:43:14,030 --> 00:43:16,220 and putting them on rough terrain. 637 00:43:16,220 --> 00:43:21,530 So this is the rimless wheel, where now, every time it 638 00:43:21,530 --> 00:43:27,560 takes a step, the ramp angle is drawn from some distribution. 639 00:43:27,560 --> 00:43:31,040 Now, in real life, maybe you don't roll rimless wheels 640 00:43:31,040 --> 00:43:37,400 on that kind of slope, but the contention in that paper 641 00:43:37,400 --> 00:43:39,500 was that actually, every floor is rough terrain, 642 00:43:39,500 --> 00:43:42,440 and you actually have to worry about the stochastic dynamics 643 00:43:42,440 --> 00:43:44,098 all the time. 644 00:43:44,098 --> 00:43:45,140 And if you want to take-- 645 00:43:45,140 --> 00:43:46,640 you can take your compass-gait model 646 00:43:46,640 --> 00:43:49,940 and put it on rough terrain, and you 647 00:43:49,940 --> 00:43:52,928 could take the [? kneed ?] model and put it on rough terrain. 648 00:43:52,928 --> 00:43:54,470 These are the passive things, so they 649 00:43:54,470 --> 00:43:58,910 can't walk on very much rough terrain before they fall down. 650 00:43:58,910 --> 00:43:59,510 But they can. 651 00:43:59,510 --> 00:44:00,802 They can walk on rough terrain. 652 00:44:03,530 --> 00:44:05,630 And then you want to ask complicated questions 653 00:44:05,630 --> 00:44:06,500 about this, maybe. 654 00:44:06,500 --> 00:44:08,450 You want to say, given my terrain 655 00:44:08,450 --> 00:44:10,940 was drawn from some distribution, 656 00:44:10,940 --> 00:44:16,158 how far should I expect my robot to walk before it falls down? 657 00:44:16,158 --> 00:44:17,950 That sounds like a hard question to answer. 658 00:44:20,520 --> 00:44:23,690 It's trivial to answer, actually. 659 00:44:23,690 --> 00:44:27,590 So this equation is exactly what drove that work. 660 00:44:27,590 --> 00:44:31,850 We built the transition matrix on the [INAUDIBLE] map, saying, 661 00:44:31,850 --> 00:44:32,940 given it's passive-- 662 00:44:32,940 --> 00:44:34,700 there's no actions to choose from-- 663 00:44:34,700 --> 00:44:37,287 given it's passive, what's the probability of being 664 00:44:37,287 --> 00:44:38,870 in this new state, given the terrain's 665 00:44:38,870 --> 00:44:41,810 drawn from some distribution given it's at a current state. 666 00:44:41,810 --> 00:44:44,600 The cost function was 1 if it keeps taking 667 00:44:44,600 --> 00:44:47,990 a step, 0 if it fell over. 668 00:44:47,990 --> 00:44:50,632 And you compute this, and it-- 669 00:44:50,632 --> 00:44:51,590 what does it tells you? 670 00:44:51,590 --> 00:44:55,230 It tells you the expected number of steps until you fall down-- 671 00:44:55,230 --> 00:44:55,730 period. 672 00:44:55,730 --> 00:44:56,900 On shot. 673 00:44:56,900 --> 00:44:58,850 Simple calculation. 674 00:44:58,850 --> 00:44:59,870 It's so simple. 675 00:44:59,870 --> 00:45:01,912 The bad part is you have to discretize your state 676 00:45:01,912 --> 00:45:02,832 space to do it. 677 00:45:02,832 --> 00:45:05,040 But if you're willing to discretize your state space, 678 00:45:05,040 --> 00:45:08,780 then you can make very long-term predictions about your model 679 00:45:08,780 --> 00:45:09,560 with-- 680 00:45:09,560 --> 00:45:11,360 just like that, to the point where 681 00:45:11,360 --> 00:45:16,482 we are trying to say that people who talk about stability-- 682 00:45:16,482 --> 00:45:18,440 people are coming up with metrics for stability 683 00:45:18,440 --> 00:45:19,670 and walking systems. 684 00:45:19,670 --> 00:45:21,260 They say, why not just do this? 685 00:45:21,260 --> 00:45:23,330 Why not actually compute, given some model 686 00:45:23,330 --> 00:45:25,223 of the terrain, how many steps you'd expect 687 00:45:25,223 --> 00:45:26,390 to take until it falls down? 688 00:45:26,390 --> 00:45:27,807 That's what you'd like to compute, 689 00:45:27,807 --> 00:45:31,280 and it's not hard to compute, so you should do that. 690 00:45:31,280 --> 00:45:36,020 So that's a clear place where policy evaluation by itself-- 691 00:45:36,020 --> 00:45:38,210 there's lots of cases where you have a robot that's 692 00:45:38,210 --> 00:45:40,640 doing something, it's got a control system, 693 00:45:40,640 --> 00:45:42,890 and you just want to verify how well it works. 694 00:45:42,890 --> 00:45:45,620 If you're trying to verify it an expected value, it's easy. 695 00:45:45,620 --> 00:45:48,950 Just do the Monte Carlo-- or sorry-- the Markov chain thing. 696 00:45:55,402 --> 00:45:57,110 But what happens if I don't have a model? 697 00:45:57,110 --> 00:45:59,840 That's what we're supposed to be talking about today. 698 00:45:59,840 --> 00:46:04,400 Can we do the same thing if we don't have a model? 699 00:46:04,400 --> 00:46:07,520 I had to know T. I had to know the-- all the transition 700 00:46:07,520 --> 00:46:10,325 probabilities in order to make that calculation. 701 00:46:10,325 --> 00:46:12,950 What happens if we don't have a model-- we just have a robot we 702 00:46:12,950 --> 00:46:14,342 can run a bunch of times? 703 00:46:14,342 --> 00:46:15,050 How do you do it? 704 00:47:15,470 --> 00:47:17,398 What would you do, if I asked you-- 705 00:47:17,398 --> 00:47:18,440 I say, I like your robot. 706 00:47:18,440 --> 00:47:22,337 I want to know how long it tends to run before it fails. 707 00:47:22,337 --> 00:47:23,170 How would you do it? 708 00:47:28,227 --> 00:47:29,060 How would you do it? 709 00:47:32,510 --> 00:47:34,780 There's an easy answer. 710 00:47:34,780 --> 00:47:38,610 You could run it a bunch of times and take an average. 711 00:47:41,330 --> 00:47:44,620 We know that these value functions are state-dependent, 712 00:47:44,620 --> 00:47:46,630 so it's a little more painful than that. 713 00:47:46,630 --> 00:47:48,005 Technically, you're going to have 714 00:47:48,005 --> 00:47:51,773 to run it a bunch of times from every single initial condition, 715 00:47:51,773 --> 00:47:52,690 but you could do that. 716 00:47:52,690 --> 00:47:54,610 And actually, that's not totally crazy. 717 00:48:15,820 --> 00:48:17,650 So I want to know how much cost I'm 718 00:48:17,650 --> 00:48:19,900 going to incur-- in the case of the walking robot, how 719 00:48:19,900 --> 00:48:23,050 many steps it's going to take on average before it falls down. 720 00:48:23,050 --> 00:48:24,250 First thing to try-- 721 00:48:24,250 --> 00:48:26,380 don't have to know the transition matrices-- just 722 00:48:26,380 --> 00:48:28,420 run it a bunch of times. 723 00:48:28,420 --> 00:48:44,920 So if I say Jn i, the n-th time I run my robot, I incur-- 724 00:48:44,920 --> 00:48:46,295 I just keep track of the cost. 725 00:48:46,295 --> 00:48:47,920 I keep track of how many steps it took. 726 00:48:47,920 --> 00:48:51,008 I keep track of how much gold it found-- 727 00:48:51,008 --> 00:48:52,300 whatever your cost function is. 728 00:49:09,430 --> 00:49:13,210 The thing I'm trying to estimate is the expected value 729 00:49:13,210 --> 00:49:15,530 of that long-term cost. 730 00:49:15,530 --> 00:49:17,140 But any one trial-- 731 00:49:17,140 --> 00:49:19,885 I get this thing out as a random variable. 732 00:49:23,620 --> 00:49:27,100 I could take the expected value of the random variable. 733 00:49:27,100 --> 00:49:30,760 I can make a nice estimate of J pi i 734 00:49:30,760 --> 00:49:35,830 by just running it a bunch of times and taking-- 735 00:49:44,700 --> 00:49:46,560 doesn't sound very elegant, but it works. 736 00:49:52,284 --> 00:49:54,198 AUDIENCE: [INAUDIBLE] 737 00:49:54,198 --> 00:49:54,990 RUSS TEDRAKE: What? 738 00:49:58,490 --> 00:49:59,360 Sum over k. 739 00:49:59,360 --> 00:50:00,525 Good. 740 00:50:00,525 --> 00:50:01,400 Thank you, thank you. 741 00:50:01,400 --> 00:50:01,900 Good. 742 00:50:01,900 --> 00:50:04,730 Sum over k. 743 00:50:04,730 --> 00:50:11,395 [INAUDIBLE] you've corrected both simultaneously. 744 00:50:23,790 --> 00:50:25,900 OK, so a couple of nuances here-- 745 00:50:25,900 --> 00:50:32,310 so first of all, I have an infinite horizon cost function. 746 00:50:32,310 --> 00:50:34,650 So this is only going to be an approximation, 747 00:50:34,650 --> 00:50:37,050 because I'm not going to run this forever 10 times. 748 00:50:37,050 --> 00:50:40,050 I'm going to run it for some finite duration 10 times. 749 00:50:40,050 --> 00:50:41,790 So in practice, I'm actually going 750 00:50:41,790 --> 00:50:46,550 to run something that's big number. 751 00:50:51,600 --> 00:50:54,240 But that's OK because this discount factor 752 00:50:54,240 --> 00:50:57,698 means that a finite trial approximation should 753 00:50:57,698 --> 00:50:59,490 be a pretty good estimate of the long-term. 754 00:51:03,330 --> 00:51:06,820 And if I run it from initial condition i long enough, 755 00:51:06,820 --> 00:51:10,200 then I should be able to take an average 756 00:51:10,200 --> 00:51:12,585 and get the expected [INAUDIBLE].. 757 00:51:21,770 --> 00:51:29,210 There's lots of ways you can do that kind of thing 758 00:51:29,210 --> 00:51:34,190 if you don't want to do all the bookkeeping of remembering 759 00:51:34,190 --> 00:51:35,690 where you've been. 760 00:51:35,690 --> 00:51:37,910 You don't have to remember all of these things. 761 00:51:37,910 --> 00:51:41,090 You can do an online version, incremental. 762 00:51:41,090 --> 00:51:45,590 You can say that my J hat is just-- 763 00:51:45,590 --> 00:51:49,505 my J hat pi is just J hat pi-- 764 00:52:03,830 --> 00:52:08,420 I can guess an initial J hat, and then, 765 00:52:08,420 --> 00:52:10,610 every time I get a new trial, I'll 766 00:52:10,610 --> 00:52:15,680 just move my estimate towards that trial. 767 00:52:15,680 --> 00:52:19,070 And this is actually an online version that 768 00:52:19,070 --> 00:52:21,980 approximates that in batch. 769 00:52:21,980 --> 00:52:24,470 This is just a standard [INAUDIBLE] 770 00:52:24,470 --> 00:52:27,110 that you can do it more carefully. 771 00:52:31,355 --> 00:52:33,570 I could choose these to be a perfect weighting 772 00:52:33,570 --> 00:52:36,112 but in general, this is actually a pretty good approximation, 773 00:52:36,112 --> 00:52:40,610 as the number of trials goes up, to this sum without keeping 774 00:52:40,610 --> 00:52:41,412 track of every J. 775 00:52:41,412 --> 00:52:43,370 Every time, I'm just going to do moving average 776 00:52:43,370 --> 00:52:46,550 towards the new point, and by changing a small amount, 777 00:52:46,550 --> 00:52:47,600 it will converge. 778 00:52:47,600 --> 00:52:49,430 This is a low-pass filter. 779 00:52:49,430 --> 00:52:50,870 That's another way to say it. 780 00:52:50,870 --> 00:52:53,180 It's a low-pass filter that tries to get me to-- 781 00:52:53,180 --> 00:52:58,400 the mean of the J samples I'm getting in. 782 00:53:05,680 --> 00:53:08,020 So that gets rid of a little bit of bookkeeping. 783 00:53:08,020 --> 00:53:08,950 There's other things you can do. 784 00:53:08,950 --> 00:53:10,200 Now, here's a really cool one. 785 00:53:15,130 --> 00:53:18,310 Think about this, and tell me if you think it's possible. 786 00:53:18,310 --> 00:53:22,240 I'm going to tell you in a minute how that-- 787 00:53:22,240 --> 00:53:32,100 if I have two policies, I can-- 788 00:53:32,100 --> 00:53:34,140 say, pi 1 and pi 2-- 789 00:53:54,189 --> 00:53:55,200 Do you believe that? 790 00:53:58,612 --> 00:54:00,570 It's going to take a little bit more machinery, 791 00:54:00,570 --> 00:54:03,540 but just to see where we go. 792 00:54:03,540 --> 00:54:06,330 Say I have two control systems. 793 00:54:06,330 --> 00:54:10,530 I have the one that is risky, and I ran it once, 794 00:54:10,530 --> 00:54:13,072 and the thing fell down. 795 00:54:13,072 --> 00:54:15,030 So I don't actually want to run that 100 times. 796 00:54:15,030 --> 00:54:17,118 I might break my robot. 797 00:54:17,118 --> 00:54:18,660 Let's say I've got a different policy 798 00:54:18,660 --> 00:54:19,827 that I like a little better. 799 00:54:19,827 --> 00:54:21,900 It's a little safer to do evaluations on. 800 00:54:21,900 --> 00:54:25,410 Can you imagine running the safe policy, let's say, 801 00:54:25,410 --> 00:54:26,790 to learn about the risky policy? 802 00:54:30,480 --> 00:54:31,800 That's pretty cool idea, right? 803 00:54:35,270 --> 00:54:36,270 What is wrong with this? 804 00:54:48,390 --> 00:54:49,860 Typically done with a q function. 805 00:54:49,860 --> 00:54:51,610 I'll show you how to say that in a second. 806 00:54:56,283 --> 00:54:57,950 So there's lots of ways you can do that. 807 00:54:57,950 --> 00:55:02,322 You can run trials, you can keep averages, 808 00:55:02,322 --> 00:55:04,780 you can try to learn about one trial by learning the other. 809 00:55:04,780 --> 00:55:06,430 What the fundamental idea here is 810 00:55:06,430 --> 00:55:09,910 is that it requires stochasticity. 811 00:55:09,910 --> 00:55:14,680 You need that, in policy, pi 1 and pi 2 812 00:55:14,680 --> 00:55:17,140 have to change-- take the same actions 813 00:55:17,140 --> 00:55:20,020 with some non-zero probability. 814 00:55:20,020 --> 00:55:23,230 Pi 2 might be my risky policy, and every once in a while, 815 00:55:23,230 --> 00:55:26,470 with some small probability, it takes a safe action, let's say. 816 00:55:26,470 --> 00:55:31,080 And pi 1 is my safe policy, but every once in a while, it 817 00:55:31,080 --> 00:55:32,170 takes a risky action. 818 00:55:32,170 --> 00:55:35,150 As long as these things have some non-zero overlap 819 00:55:35,150 --> 00:55:38,230 in probability space, then I can actually 820 00:55:38,230 --> 00:55:39,760 learn about what it would have been 821 00:55:39,760 --> 00:55:41,500 to do the more risky thing by taking 822 00:55:41,500 --> 00:55:45,060 the more conservative thing. 823 00:55:45,060 --> 00:55:46,810 So policy evaluation's a really nice tool. 824 00:55:49,750 --> 00:55:51,100 But this feels slow. 825 00:55:51,100 --> 00:55:53,700 The Monte Carlo thing feels slow-- 826 00:55:53,700 --> 00:55:55,480 feels like I got to run a lot of trials 827 00:55:55,480 --> 00:55:57,280 from a lot of different initial conditions. 828 00:55:57,280 --> 00:55:59,245 And now you tell me what the cost-to-go 829 00:55:59,245 --> 00:56:01,120 is from this initial condition, and let's say 830 00:56:01,120 --> 00:56:02,260 I try this initial condition. 831 00:56:02,260 --> 00:56:02,810 What do I do? 832 00:56:02,810 --> 00:56:06,160 Do I just have to start over and run trials from the get-go 833 00:56:06,160 --> 00:56:06,760 again? 834 00:56:06,760 --> 00:56:08,480 Well, that doesn't seem very satisfying. 835 00:56:11,020 --> 00:56:14,358 Approach number two is bootstrapping. 836 00:56:31,470 --> 00:56:34,920 I call it bootstrapping. 837 00:56:34,920 --> 00:56:39,027 If I learned about the cost of being in this state, 838 00:56:39,027 --> 00:56:41,610 and I spent a long time learning about the cost-to-go of being 839 00:56:41,610 --> 00:56:43,980 in this state, and then I go back and ask what's 840 00:56:43,980 --> 00:56:47,400 the cost of being in this state, if this one 841 00:56:47,400 --> 00:56:48,840 transitions into this one, then I 842 00:56:48,840 --> 00:56:51,810 should be able to reuse what I learned about the state 843 00:56:51,810 --> 00:56:54,390 to make it faster to learn about that state. 844 00:56:54,390 --> 00:56:58,020 I didn't really plan to do it with the steps on the floor, 845 00:56:58,020 --> 00:56:59,428 but I hope that makes sense. 846 00:56:59,428 --> 00:57:00,720 Maybe I could do it on a graph. 847 00:57:00,720 --> 00:57:03,720 That's better, yeah? 848 00:57:03,720 --> 00:57:10,440 Let's say I figured out what J pi of this state is-- 849 00:57:10,440 --> 00:57:12,545 because I went from here, and I went around, 850 00:57:12,545 --> 00:57:13,920 and I did my stuff, and I learned 851 00:57:13,920 --> 00:57:16,295 pretty much what there is to learn about here [INAUDIBLE] 852 00:57:16,295 --> 00:57:17,190 policy. 853 00:57:17,190 --> 00:57:21,853 And now I want to know about this state. 854 00:57:21,853 --> 00:57:23,520 Well, I should be able to reuse the fact 855 00:57:23,520 --> 00:57:26,820 that I've learned about that to help me learn this 856 00:57:26,820 --> 00:57:29,730 more quickly-- 857 00:57:29,730 --> 00:57:30,420 reasonable idea. 858 00:57:33,120 --> 00:57:37,800 Using your estimate to inform your future estimates is 859 00:57:37,800 --> 00:57:40,080 an idea about bootstrapping, reusing-- 860 00:57:40,080 --> 00:57:45,060 building on your current guess to build a better future guess. 861 00:57:45,060 --> 00:57:49,740 And here's how it could look in the optimal control policy 862 00:57:49,740 --> 00:57:50,670 evaluation sense. 863 00:58:05,000 --> 00:58:11,330 What if I said my online rule used 864 00:58:11,330 --> 00:58:18,380 to be this, where I've got some estimate J pi hat? 865 00:58:18,380 --> 00:58:21,440 I'm going to run from 0 to some very large number 866 00:58:21,440 --> 00:58:24,260 to estimate this, and then make the update. 867 00:58:24,260 --> 00:58:27,470 What if, instead, I just took a single step 868 00:58:27,470 --> 00:58:28,430 and I did this update? 869 00:59:08,193 --> 00:59:09,360 Does that make sense to you? 870 00:59:20,730 --> 00:59:25,530 Let's say I ask you to guess the long-term cost here. 871 00:59:25,530 --> 00:59:28,710 Instead of running all the way to the end, what if I just 872 00:59:28,710 --> 00:59:32,040 run a single step and then use as my cost 873 00:59:32,040 --> 00:59:37,500 my estimate for this, the cost of going here 874 00:59:37,500 --> 00:59:42,450 plus the gamma times the cost of doing all that? 875 00:59:42,450 --> 00:59:55,440 It's just using this one-step cost as an estimate for when I 876 00:59:55,440 --> 00:59:59,490 was going J-N of ik plus 1-- 877 00:59:59,490 --> 01:00:00,600 or sorry, J-N of ik. 878 01:00:04,640 --> 01:00:07,430 Does that makes sense? 879 01:00:07,430 --> 01:00:09,980 If I find myself in a lot of different initial conditions, 880 01:00:09,980 --> 01:00:11,780 I could take one step and then use my guess 881 01:00:11,780 --> 01:00:16,290 for the cost-to-go from that step to the rest of the time. 882 01:00:16,290 --> 01:00:18,240 Now, this starts feeling a lot more appealing, 883 01:00:18,240 --> 01:00:21,660 actually, because now I don't have to think-- 884 01:00:21,660 --> 01:00:24,300 this actually got rid of that whole episodic problem. 885 01:00:24,300 --> 01:00:27,750 I don't have to go in and run some fixed length 886 01:00:27,750 --> 01:00:31,290 trial to approximate the long-term thing. 887 01:00:31,290 --> 01:00:36,390 I just take a single step, use this as my estimate, 888 01:00:36,390 --> 01:00:38,940 and I can just keep moving through my Markov chain. 889 01:00:38,940 --> 01:00:41,250 I don't have to ever reset. 890 01:00:41,250 --> 01:00:44,798 And potentially, if I visit states often enough-- 891 01:00:44,798 --> 01:00:46,590 I won't get into all the details-- roughly, 892 01:00:46,590 --> 01:00:49,020 it involves that Markov chain being-- 893 01:00:49,020 --> 01:00:51,630 having ergodicity. 894 01:00:51,630 --> 01:00:53,550 you have to be able to visit all the states 895 01:00:53,550 --> 01:00:56,365 with some non-zero probability as you go along. 896 01:00:56,365 --> 01:00:58,740 But if you visit the states-- each state infinitely often 897 01:00:58,740 --> 01:01:03,180 is roughly the thing-- then this actually 898 01:01:03,180 --> 01:01:14,620 will converge to J pi of ik. 899 01:01:45,540 --> 01:01:48,700 So the ergodicity is actually bad news for my walking robot, 900 01:01:48,700 --> 01:01:50,897 because if my walking robot falls down, 901 01:01:50,897 --> 01:01:52,480 I'm going have to pick it back up if I 902 01:01:52,480 --> 01:01:54,055 want to get ergodicity back. 903 01:01:54,055 --> 01:01:55,930 There are robots that don't visit every state 904 01:01:55,930 --> 01:01:59,152 every arbitrarily often. 905 01:01:59,152 --> 01:02:00,610 But in the Markov chain sense, that 906 01:02:00,610 --> 01:02:02,320 doesn't seem like such a bad assumption. 907 01:02:02,320 --> 01:02:04,570 And if I'm willing to take my robot when it falls down 908 01:02:04,570 --> 01:02:05,625 and pick it back up-- 909 01:02:05,625 --> 01:02:07,000 which, by the way, is about how I 910 01:02:07,000 --> 01:02:08,410 spent the last year of my PhD-- 911 01:02:10,872 --> 01:02:12,580 then actually, I can get ergodicity back. 912 01:02:17,590 --> 01:02:20,320 OK, cool-- so that makes sense, right? 913 01:02:20,320 --> 01:02:24,460 I'm going to use my existing estimate of the cost-to-go 914 01:02:24,460 --> 01:02:27,110 to bootstrap my algorithm for estimating the cost-to-go. 915 01:02:27,110 --> 01:02:27,610 Yeah? 916 01:02:27,610 --> 01:02:29,318 AUDIENCE: Does the transition [INAUDIBLE] 917 01:02:29,318 --> 01:02:30,760 come into play at all? 918 01:02:30,760 --> 01:02:32,350 RUSS TEDRAKE: It does, because I'm 919 01:02:32,350 --> 01:02:35,530 getting this from sampled data. 920 01:02:35,530 --> 01:02:36,628 So this is actually drawn. 921 01:02:36,628 --> 01:02:38,920 The expected value of this update does the right thing. 922 01:02:41,607 --> 01:02:43,440 So this update doesn't have it, because this 923 01:02:43,440 --> 01:02:45,660 is from a real trials. 924 01:02:45,660 --> 01:02:48,270 But you should think about this as a sample 925 01:02:48,270 --> 01:02:50,535 from the real distribution. 926 01:02:50,535 --> 01:02:52,410 Now, that's actually a really good way for me 927 01:02:52,410 --> 01:02:53,640 to lead into the next step. 928 01:02:53,640 --> 01:02:59,050 These algorithms tend to be a lot faster in practice than 929 01:02:59,050 --> 01:03:01,990 those algorithms-- not only are they a little bit more elegant, 930 01:03:01,990 --> 01:03:04,630 because you don't have to reset and run finite-length trials-- 931 01:03:04,630 --> 01:03:07,070 they tend to be a lot faster. 932 01:03:07,070 --> 01:03:12,280 And the reason for that is this here is really the-- 933 01:03:12,280 --> 01:03:15,130 it has the expected value of future costs built into it. 934 01:03:18,030 --> 01:03:21,420 Let me say that in the pictures. 935 01:03:21,420 --> 01:03:23,520 There's two ways, I could estimate this. 936 01:03:23,520 --> 01:03:26,320 I could get here and then I could take a single path. 937 01:03:26,320 --> 01:03:28,590 Well, this one is not rich enough for me 938 01:03:28,590 --> 01:03:30,690 to make my point here, but OK-- so I 939 01:03:30,690 --> 01:03:32,610 could take a single path through here 940 01:03:32,610 --> 01:03:38,460 and get a single sample estimating the long-term cost. 941 01:03:38,460 --> 01:03:42,510 But if I instead use J pi, J pi is the expected value 942 01:03:42,510 --> 01:03:45,330 of going around and living in this. 943 01:03:45,330 --> 01:03:48,300 So by using this update to bootstrap, 944 01:03:48,300 --> 01:03:50,070 or if I just take one step from here, 945 01:03:50,070 --> 01:03:54,000 I get for free the expected value of living over here 946 01:03:54,000 --> 01:03:56,680 for a long time. 947 01:03:56,680 --> 01:03:59,370 Does that make sense? 948 01:03:59,370 --> 01:04:01,770 So J is building up a map of the expected value, 949 01:04:01,770 --> 01:04:04,230 because it's visiting things often and it's-- drew this 950 01:04:04,230 --> 01:04:07,410 online algorithm with this low-pass filter. 951 01:04:07,410 --> 01:04:10,470 He's basically doing an expected value calculation. 952 01:04:10,470 --> 01:04:14,550 By using my low-pass filtered [INAUDIBLE] in here, 953 01:04:14,550 --> 01:04:15,360 it's also-- 954 01:04:15,360 --> 01:04:16,527 it's getting the reward of-- 955 01:04:16,527 --> 01:04:18,485 maybe you could just say it's filtering faster. 956 01:04:18,485 --> 01:04:20,640 That's actually not a bad way to think about it. 957 01:04:20,640 --> 01:04:22,260 I've already got this thing filtered, 958 01:04:22,260 --> 01:04:24,600 so this one filters faster. 959 01:04:24,600 --> 01:04:28,510 That's a pretty reasonable way to think about it, actually. 960 01:04:28,510 --> 01:04:29,010 OK. 961 01:04:32,050 --> 01:04:35,020 So this quantity here in the brackets, 962 01:04:35,020 --> 01:04:44,660 this whole guy right here, it's a very important quantity, 963 01:04:44,660 --> 01:04:45,460 it comes up a lot. 964 01:04:45,460 --> 01:04:47,210 It's called the temporal difference error. 965 01:04:58,770 --> 01:05:03,300 It's the difference that I get from executing 966 01:05:03,300 --> 01:05:06,750 my policy for one step and then using the long-term estimate 967 01:05:06,750 --> 01:05:11,550 compared to what I have is my long-term estimate, 968 01:05:11,550 --> 01:05:13,530 temporal difference error. 969 01:05:13,530 --> 01:05:15,433 Now, if the system was deterministic 970 01:05:15,433 --> 01:05:17,850 and I had already converged, then that temporal difference 971 01:05:17,850 --> 01:05:23,670 there should be 0, because this thing should be exactly-- 972 01:05:23,670 --> 01:05:26,010 predict the long-term thing. 973 01:05:26,010 --> 01:05:29,130 If the system's stochastic, then this temporal difference error 974 01:05:29,130 --> 01:05:30,975 should be 0, on average. 975 01:05:35,080 --> 01:05:38,470 It's comparing my cost-to-go from ik, 976 01:05:38,470 --> 01:05:42,130 given my 1 step plus the cost-to-go from ik plus 1. 977 01:05:42,130 --> 01:05:46,030 So those things-- you want those to match, right? 978 01:05:46,030 --> 01:05:50,020 You want that my 1 step plus long-term prediction 979 01:05:50,020 --> 01:05:52,400 should match my long-term prediction, 980 01:05:52,400 --> 01:05:54,040 if things are right. 981 01:05:54,040 --> 01:05:56,090 They should match an expected value. 982 01:05:56,090 --> 01:05:58,565 So that thing's called the temporal difference error, 983 01:05:58,565 --> 01:06:00,940 and it's an important quantity in reinforcement learning. 984 01:06:14,010 --> 01:06:15,760 It makes sense to write that down. 985 01:06:15,760 --> 01:06:20,124 That's a reasonable estimate for J. 986 01:06:20,124 --> 01:06:23,940 n would be to take one step and then do the other one, 987 01:06:23,940 --> 01:06:25,800 but there's-- 988 01:06:25,800 --> 01:06:27,450 that seems a little bit arbitrary. 989 01:06:27,450 --> 01:06:30,952 Why don't I just do one step and then use my lookup? 990 01:06:30,952 --> 01:06:32,160 Thi sis the way Rich says it. 991 01:06:32,160 --> 01:06:36,390 Why not do two steps and then use my value function 992 01:06:36,390 --> 01:06:37,260 to look up? 993 01:06:37,260 --> 01:06:39,510 Or three steps-- why not take, similarly, three steps, 994 01:06:39,510 --> 01:06:41,970 and then use that to look it up-- 995 01:06:41,970 --> 01:06:45,480 or 14 steps or something like that. 996 01:06:45,480 --> 01:06:49,710 Why should I arbitrarily pick this one real data 997 01:06:49,710 --> 01:06:53,040 and then look ahead, instead of two real pieces of data 998 01:06:53,040 --> 01:06:53,790 and my look ahead? 999 01:06:56,732 --> 01:06:58,440 Well, there's no reason that you actually 1000 01:06:58,440 --> 01:07:01,050 have to pick like that. 1001 01:07:09,020 --> 01:07:10,760 Can I just write the inside part here? 1002 01:07:10,760 --> 01:07:20,720 I could have estimated Jn ik as gik ik 1003 01:07:20,720 --> 01:07:32,540 plus 1 plus gamma gik plus 1 ik plus 2 plus gamma squared 1004 01:07:32,540 --> 01:07:38,480 J pi at ik plus 2. 1005 01:07:38,480 --> 01:07:41,250 That's a perfectly good approximation too. 1006 01:07:41,250 --> 01:07:42,470 I could have done three-step. 1007 01:07:42,470 --> 01:07:43,923 I could have done four-step. 1008 01:07:43,923 --> 01:07:45,875 AUDIENCE: [INAUDIBLE] 1009 01:07:45,875 --> 01:07:47,000 RUSS TEDRAKE: Say it again. 1010 01:07:47,000 --> 01:07:48,642 AUDIENCE: [INAUDIBLE] 1011 01:07:48,642 --> 01:07:50,600 RUSS TEDRAKE: Yeah I. I haven't been writing it 1012 01:07:50,600 --> 01:07:53,300 that way because-- 1013 01:07:53,300 --> 01:07:55,048 yeah. 1014 01:07:55,048 --> 01:07:56,840 I would have been fine writing it that way. 1015 01:07:56,840 --> 01:07:58,460 At some point, I decided to not write pi there, 1016 01:07:58,460 --> 01:08:00,620 and I'll just stay consistent by not writing that. 1017 01:08:06,880 --> 01:08:12,400 OK, so Rich [INAUDIBLE] came up with a clever algorithm 1018 01:08:12,400 --> 01:08:17,410 that's-- basically allows you to seamlessly pick between the one 1019 01:08:17,410 --> 01:08:21,340 step, two step, three step, n-step look ahead with a single 1020 01:08:21,340 --> 01:08:22,840 knob. 1021 01:08:22,840 --> 01:08:23,800 And it works. 1022 01:08:23,800 --> 01:08:29,830 It's called the TD lambda algorithm. 1023 01:08:40,410 --> 01:08:44,729 And the basic idea is that you want 1024 01:08:44,729 --> 01:08:53,180 to combine a lot of these different updates 1025 01:08:53,180 --> 01:08:54,229 into a single update. 1026 01:08:54,229 --> 01:08:55,939 It sounds really bizarre [INAUDIBLE] 1027 01:08:55,939 --> 01:08:57,200 so let me just say it. 1028 01:08:57,200 --> 01:09:07,069 Let's say I call my estimate J of Jn, 1029 01:09:07,069 --> 01:09:11,074 with M step look ahead of ik. 1030 01:09:15,529 --> 01:09:26,810 M equals 0 to M gamma of M gik plus M ik plus M plus 1. 1031 01:09:39,939 --> 01:09:43,930 Big M. 1032 01:09:43,930 --> 01:09:48,580 This is a big M. Big M. Everything else is little m's. 1033 01:09:52,720 --> 01:09:53,890 That was the one step. 1034 01:09:53,890 --> 01:09:55,080 This is the two step. 1035 01:09:55,080 --> 01:09:56,830 In general, this is the M step look ahead. 1036 01:10:00,780 --> 01:10:03,690 So it turns out there's actually an efficient way 1037 01:10:03,690 --> 01:10:04,650 to compute this. 1038 01:10:23,240 --> 01:10:24,650 Let's call it something else-- 1039 01:10:24,650 --> 01:10:27,455 p, p, p. 1040 01:10:35,290 --> 01:10:38,920 This one takes a little time to digest. 1041 01:10:38,920 --> 01:10:45,910 But it turns out it's pretty efficient to calculate 1042 01:10:45,910 --> 01:10:52,930 a weighted sum of the one step, two step, three step-- 1043 01:10:52,930 --> 01:10:58,840 onto forever-- sum of look-aheads 1044 01:10:58,840 --> 01:11:01,960 parameterized by another parameter, lambda. 1045 01:11:05,020 --> 01:11:08,110 So when lambda's 1, this thing turns out 1046 01:11:08,110 --> 01:11:12,970 to be basically doing Monte Carlo. 1047 01:11:12,970 --> 01:11:16,060 And when lambda's 0, this thing basically 1048 01:11:16,060 --> 01:11:18,430 is doing just the one-step look ahead. 1049 01:11:18,430 --> 01:11:20,140 And when lambda is somewhere in between, 1050 01:11:20,140 --> 01:11:27,610 it's doing some look ahead using a few steps. 1051 01:11:27,610 --> 01:11:29,270 Does that makes sense at all? 1052 01:11:29,270 --> 01:11:30,895 It's a lot of terms flying around here. 1053 01:11:33,880 --> 01:11:38,110 Even if you don't completely love that, 1054 01:11:38,110 --> 01:11:44,710 just think of my estimate, J of lambda being awaited, 1055 01:11:44,710 --> 01:11:47,770 basically something that, where if lambda is 1, 1056 01:11:47,770 --> 01:11:50,470 it's going to be the very long-term look ahead. 1057 01:11:50,470 --> 01:11:53,182 If lambda is 0, it's going to be the very short-term look ahead. 1058 01:11:53,182 --> 01:11:54,640 And there's a continuum in between, 1059 01:11:54,640 --> 01:11:57,100 a continuous knob I can turn to say how far I'm 1060 01:11:57,100 --> 01:12:00,365 going to look ahead into the future as my estimate 1061 01:12:00,365 --> 01:12:01,990 that I'm going to use in that TD error. 1062 01:12:22,100 --> 01:12:23,720 And there's a whole gamut in between. 1063 01:12:26,594 --> 01:12:28,945 AUDIENCE: [INAUDIBLE] 1064 01:12:28,945 --> 01:12:30,330 RUSS TEDRAKE: Can I say it again? 1065 01:12:30,330 --> 01:12:31,205 AUDIENCE: [INAUDIBLE] 1066 01:12:31,205 --> 01:12:34,230 RUSS TEDRAKE: Or can I read it? 1067 01:12:34,230 --> 01:12:37,620 1 minus lambda i just a the normalization factor. 1068 01:12:37,620 --> 01:12:40,620 p equals 1 to infinity. 1069 01:12:40,620 --> 01:12:47,045 Lambda the p minus 1 J p-- 1070 01:12:47,045 --> 01:12:48,740 where this is the p step look ahead. 1071 01:12:57,460 --> 01:12:59,140 So this is a very famous algorithm-- 1072 01:12:59,140 --> 01:13:01,630 the TD lambda algorithm-- 1073 01:13:01,630 --> 01:13:07,630 which allows you to do policy evaluation 1074 01:13:07,630 --> 01:13:10,630 without knowing the transition matrix, 1075 01:13:10,630 --> 01:13:14,110 doing bootstrapping or Monte Carlo 1076 01:13:14,110 --> 01:13:18,220 in a simple single framework with just a parameter lambda 1077 01:13:18,220 --> 01:13:20,580 to evaluate. 1078 01:13:20,580 --> 01:13:23,760 So it's a tweak. 1079 01:13:23,760 --> 01:13:28,620 And it turns out it uses an eligibility trace, 1080 01:13:28,620 --> 01:13:30,210 just like in reinforce. 1081 01:13:30,210 --> 01:13:32,960 Did you get the eligibility traces, John? 1082 01:13:32,960 --> 01:13:35,190 AUDIENCE: [INAUDIBLE] 1083 01:13:35,190 --> 01:13:35,910 RUSS TEDRAKE: OK. 1084 01:13:35,910 --> 01:13:36,660 Well, that's fine. 1085 01:13:36,660 --> 01:13:41,730 So it turns out to have a really simple form. 1086 01:13:45,837 --> 01:13:47,420 I'll write it, because it's so simple, 1087 01:13:47,420 --> 01:13:50,020 but it'll also be in the notes, if you 1088 01:13:50,020 --> 01:13:51,580 want to spend more time with it here. 1089 01:15:01,820 --> 01:15:04,340 OK, two observations-- first of all, this 1090 01:15:04,340 --> 01:15:07,190 looks no harder [INAUDIBLE] than the original version I had, 1091 01:15:07,190 --> 01:15:09,190 pretty much. 1092 01:15:09,190 --> 01:15:10,940 It just requires one extra variable, which 1093 01:15:10,940 --> 01:15:14,120 is this eligibility trace. 1094 01:15:14,120 --> 01:15:15,920 What does the eligibility trace look like? 1095 01:15:28,110 --> 01:15:28,950 OK. 1096 01:15:28,950 --> 01:15:31,470 It starts off at 0. 1097 01:15:31,470 --> 01:15:36,240 There's an element for every note in the graph. 1098 01:15:36,240 --> 01:15:41,550 Every time I visit the graph-- that node in the graph, I-- 1099 01:15:41,550 --> 01:15:46,930 it goes up by 1, and then it starts 1100 01:15:46,930 --> 01:15:53,110 forgetting based on gamma and lambda, 1101 01:15:53,110 --> 01:15:54,950 as this discount factor. 1102 01:15:54,950 --> 01:15:57,340 And then, the next time I visit it, it goes up by 1. 1103 01:15:57,340 --> 01:15:59,830 If I visit it a lot, it can build up like this. 1104 01:15:59,830 --> 01:16:06,180 It's just a trace of memory of when I visited this cell. 1105 01:16:06,180 --> 01:16:09,070 Does that makes sense, this dynamics here? 1106 01:16:09,070 --> 01:16:11,170 Every time I visit the cell, it goes up by 1, 1107 01:16:11,170 --> 01:16:15,670 and always, it's going down exponentially. 1108 01:16:15,670 --> 01:16:18,030 It turns out, if you just remember that, 1109 01:16:18,030 --> 01:16:22,150 the way that you've visited cells in the past, 1110 01:16:22,150 --> 01:16:26,270 decade by this lambda-- as well as gamma-- but this lambda-- 1111 01:16:26,270 --> 01:16:28,210 which is the new term-- 1112 01:16:28,210 --> 01:16:32,260 then it's enough to [INAUDIBLE] this trivial update here, 1113 01:16:32,260 --> 01:16:34,360 scaled by the-- 1114 01:16:34,360 --> 01:16:37,510 how often I visited that cell recently. 1115 01:16:37,510 --> 01:16:41,500 Is it enough to accomplish this seemingly bizarre combination 1116 01:16:41,500 --> 01:16:43,427 of short and long-term look-aheads. 1117 01:16:46,740 --> 01:16:50,460 So it's a really simple, really beautiful algorithm. 1118 01:16:50,460 --> 01:16:53,490 Just remember how-- when I visited these cells, 1119 01:16:53,490 --> 01:16:59,790 and then make this TD error update scaled by that, 1120 01:16:59,790 --> 01:17:01,883 and I've got the TD lambda algorithm. 1121 01:17:06,550 --> 01:17:07,960 And what people can do is they-- 1122 01:17:07,960 --> 01:17:18,870 people can prove that TD lambda converges to the TD lambda 1123 01:17:18,870 --> 01:17:26,260 update that J hat will go to J pi from any initial conditions. 1124 01:17:26,260 --> 01:17:29,570 So you can just guess J randomly to begin with. 1125 01:17:29,570 --> 01:17:32,730 And if I run it, as I visit all these states arbitrarily often, 1126 01:17:32,730 --> 01:17:34,950 it still makes that ergodicity assumption. 1127 01:17:34,950 --> 01:17:39,480 Then I'll get my policy evaluation out. 1128 01:17:39,480 --> 01:17:42,720 That's really cool-- simple algorithm. 1129 01:17:42,720 --> 01:17:47,310 Now, what people also realize is that, when you start out, 1130 01:17:47,310 --> 01:17:51,120 and J is randomly initialized, then it makes a lot of sense 1131 01:17:51,120 --> 01:17:56,863 to set lambda close to 1, because bootstrapping has less 1132 01:17:56,863 --> 01:17:58,030 value when I just start out. 1133 01:17:58,030 --> 01:18:00,010 My estimate is bad everywhere, so 1134 01:18:00,010 --> 01:18:04,940 why should I use my bad estimate as my predictor? 1135 01:18:04,940 --> 01:18:07,060 So you start off-- you keep lambda close to 1. 1136 01:18:07,060 --> 01:18:08,170 It does very long-term. 1137 01:18:08,170 --> 01:18:10,750 It does more Monte Carlo style updates. 1138 01:18:10,750 --> 01:18:12,430 And as this estimate starts converging 1139 01:18:12,430 --> 01:18:16,780 to the good estimate, you start turning lambda down. 1140 01:18:16,780 --> 01:18:20,530 And with a cleverly tuned timing of lambda, 1141 01:18:20,530 --> 01:18:22,330 you can get very fast convergence compared 1142 01:18:22,330 --> 01:18:26,290 to the Monte Carlo algorithms. 1143 01:18:26,290 --> 01:18:28,115 You more and more bootstrap. 1144 01:18:30,650 --> 01:18:31,150 Excellent. 1145 01:18:31,150 --> 01:18:33,160 Well, time's up. 1146 01:18:33,160 --> 01:18:37,120 The clock is still moving today, so I have to stop. 1147 01:18:37,120 --> 01:18:38,590 So the really cool thing-- 1148 01:18:41,630 --> 01:18:44,480 we only talked about policy evaluation today. 1149 01:18:44,480 --> 01:18:46,280 The next step is, how do you do these value 1150 01:18:46,280 --> 01:18:49,130 methods to improve your policy? 1151 01:18:49,130 --> 01:18:52,730 And it turns, out in many cases, if you make a current estimate 1152 01:18:52,730 --> 01:18:55,940 of your value function and then, on every step, 1153 01:18:55,940 --> 01:18:59,840 you try to do the greedy policy, epsilon greedy policy, 1154 01:18:59,840 --> 01:19:02,583 you basically-- you mostly exploit your current estimate 1155 01:19:02,583 --> 01:19:05,000 of the value function, then you can still prove that these 1156 01:19:05,000 --> 01:19:08,120 things-- at least on the grid, the Markov chain case-- 1157 01:19:08,120 --> 01:19:09,710 can get to their optimal-- 1158 01:19:09,710 --> 01:19:12,985 the optimal value function and the optimal policy. 1159 01:19:12,985 --> 01:19:14,360 So we'll finish that up next time 1160 01:19:14,360 --> 01:19:16,152 and get into the more interesting-- get rid 1161 01:19:16,152 --> 01:19:19,290 of these Markov chains to try to get back to the real world. 1162 01:19:19,290 --> 01:19:19,790 Good. 1163 01:19:19,790 --> 01:19:23,260 OK, see you Tuesday.