1 00:00:00,000 --> 00:00:02,520 The following content is provided under a Creative 2 00:00:02,520 --> 00:00:03,970 Commons license. 3 00:00:03,970 --> 00:00:06,330 Your support will help MIT OpenCourseWare 4 00:00:06,330 --> 00:00:10,660 continue to offer high-quality educational resources for free. 5 00:00:10,660 --> 00:00:13,320 To make a donation or view additional materials 6 00:00:13,320 --> 00:00:17,170 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,170 --> 00:00:18,370 at ocw.mit.edu. 8 00:00:21,062 --> 00:00:22,770 JOHN W. ROBERTS: We'll be talking about-- 9 00:00:29,670 --> 00:00:31,560 I heard last time I had bad handwriting. 10 00:00:31,560 --> 00:00:33,268 And I guess this isn't much improved yet, 11 00:00:33,268 --> 00:00:39,150 but I will try to be more deliberate if not more skilled. 12 00:00:39,150 --> 00:00:53,200 Stochastic-- all right. 13 00:00:53,200 --> 00:00:59,650 So as you may remember last time, 14 00:00:59,650 --> 00:01:01,930 we were talking about different assumptions 15 00:01:01,930 --> 00:01:04,180 that we've used in all the techniques we've 16 00:01:04,180 --> 00:01:05,019 applied so far. 17 00:01:11,830 --> 00:01:19,780 Now, we assume that we have a model of the system, 18 00:01:19,780 --> 00:01:22,137 that the system is deterministic-- 19 00:01:26,232 --> 00:01:27,940 that's not really any better handwriting, 20 00:01:27,940 --> 00:01:29,550 but this is the one that last time we 21 00:01:29,550 --> 00:01:31,300 talked about getting rid of anyway, right? 22 00:01:31,300 --> 00:01:33,790 Stochastic system, stochastic dynamics. 23 00:01:33,790 --> 00:01:36,503 So that's what we tried to remove. 24 00:01:36,503 --> 00:01:37,670 And then the state is known. 25 00:01:43,060 --> 00:01:46,720 So sort of already gotten rid of this one 26 00:01:46,720 --> 00:01:48,023 in some of our discussions. 27 00:01:48,023 --> 00:01:49,690 And today we're going to talk about what 28 00:01:49,690 --> 00:01:51,940 you do if you don't a model. 29 00:01:51,940 --> 00:01:54,122 And this is something that's actually 30 00:01:54,122 --> 00:01:56,080 very important in a lot of interesting systems. 31 00:01:56,080 --> 00:01:59,330 And the systems that we work on the lab, some of them 32 00:01:59,330 --> 00:02:01,870 we try to model, but some of them help us to model. 33 00:02:01,870 --> 00:02:06,520 So this, dealing without this, is a very useful thing 34 00:02:06,520 --> 00:02:07,600 to be able to do. 35 00:02:07,600 --> 00:02:10,840 So hopefully, you'll all at the end 36 00:02:10,840 --> 00:02:14,730 appreciate the tremendous power of model-free reinforcement 37 00:02:14,730 --> 00:02:15,230 learning. 38 00:02:18,850 --> 00:02:22,480 So the basic idea is, again, we have this policy 39 00:02:22,480 --> 00:02:28,840 parameterization alpha, which somehow defines our policy. 40 00:02:28,840 --> 00:02:31,390 And the problem sets that you recently did, it's open loop, 41 00:02:31,390 --> 00:02:33,800 so you just have one alpha for every time step. 42 00:02:33,800 --> 00:02:36,760 You also can imagine these are gains on a feedback policy, 43 00:02:36,760 --> 00:02:40,750 entries of the K matrix, or PD gains, any way you 44 00:02:40,750 --> 00:02:42,400 want to parameterize it. 45 00:02:42,400 --> 00:02:46,675 And you think about you use-- 46 00:02:46,675 --> 00:02:49,300 these parameters-- now, this is the most simple interpretation. 47 00:02:49,300 --> 00:02:51,710 There's a lot more complicated ways of looking at it, 48 00:02:51,710 --> 00:02:53,470 but I'm going to look at the most simple way first. 49 00:02:53,470 --> 00:02:54,800 You send this into your system. 50 00:02:54,800 --> 00:02:58,305 So you can run your system with these parameters. 51 00:02:58,305 --> 00:02:59,680 Now, this is, again, sort of like 52 00:02:59,680 --> 00:03:00,580 what you did in the problem set. 53 00:03:00,580 --> 00:03:03,020 You have a fixed initial condition, fixed cost function, 54 00:03:03,020 --> 00:03:06,870 you give it a policy, you run it, and you see how it does. 55 00:03:06,870 --> 00:03:11,950 And so what you get, you get J. You 56 00:03:11,950 --> 00:03:15,650 get the cost of running that policy. 57 00:03:15,650 --> 00:03:17,883 So the question is that, previously we've 58 00:03:17,883 --> 00:03:20,050 talked about, OK, if you have a model of the system, 59 00:03:20,050 --> 00:03:21,508 there's a lot of things you can do. 60 00:03:21,508 --> 00:03:23,650 You can do back prop to get the specific gradient; 61 00:03:23,650 --> 00:03:26,770 do something like SNOPT, you can do gradient descent using that. 62 00:03:26,770 --> 00:03:29,725 You can do, depending on the dimensionality, 63 00:03:29,725 --> 00:03:30,850 you can do value iteration. 64 00:03:30,850 --> 00:03:32,215 So there's a lot of options when you have a model. 65 00:03:32,215 --> 00:03:34,150 But if you don't, if you don't know how the system works, 66 00:03:34,150 --> 00:03:36,400 if it's really just a black box where I have a policy 67 00:03:36,400 --> 00:03:42,190 parameterization, I get a cost, how do we achieve anything 68 00:03:42,190 --> 00:03:43,280 in that context? 69 00:03:43,280 --> 00:03:44,380 We don't have any sort of information 70 00:03:44,380 --> 00:03:46,170 about how these things relate to each other. 71 00:03:46,170 --> 00:03:47,800 Well, the thing is we do have some information in that we 72 00:03:47,800 --> 00:03:48,925 can execute this black box. 73 00:03:48,925 --> 00:03:49,960 We can test it, right? 74 00:03:49,960 --> 00:03:52,600 We can run our policy and see how well it does. 75 00:03:52,600 --> 00:03:55,840 So what would you say is the crudest thing 76 00:03:55,840 --> 00:04:00,220 you could do if you had a system like this, a black box? 77 00:04:00,220 --> 00:04:02,587 You give it an open loop tape, let's say, you run it, 78 00:04:02,587 --> 00:04:03,670 and it tells you the cost. 79 00:04:06,855 --> 00:04:08,170 What could we do? 80 00:04:08,170 --> 00:04:09,395 AUDIENCE: SNOPT could also-- 81 00:04:09,395 --> 00:04:11,020 well, not that we'll [INAUDIBLE] SNOPT. 82 00:04:11,020 --> 00:04:12,370 But SNOPT could also-- 83 00:04:12,370 --> 00:04:14,680 or you have methods for estimating the gradient. 84 00:04:14,680 --> 00:04:16,490 JOHN W. ROBERTS: You can do finite differences, right? 85 00:04:16,490 --> 00:04:17,019 AUDIENCE: Yeah, do finite-- 86 00:04:17,019 --> 00:04:19,019 JOHN W. ROBERTS: So finite differences, exactly. 87 00:04:19,019 --> 00:04:20,990 So what you can do is you can say, 88 00:04:20,990 --> 00:04:25,225 let's talk about, again, this is a notation. [? So ?] 89 00:04:25,225 --> 00:04:26,350 what we're using is simple. 90 00:04:26,350 --> 00:04:27,460 A lot of times they parameterize it in a different way. 91 00:04:27,460 --> 00:04:28,480 But yeah. 92 00:04:28,480 --> 00:04:30,755 So you have pretty much in this context-- 93 00:04:30,755 --> 00:04:32,380 let's say we have a deterministic cost. 94 00:04:32,380 --> 00:04:33,755 So we don't have a random system. 95 00:04:33,755 --> 00:04:35,800 We'll talk about random systems later. 96 00:04:35,800 --> 00:04:37,690 Let's say we have a deterministic cost, which 97 00:04:37,690 --> 00:04:38,930 is a function of our alpha. 98 00:04:38,930 --> 00:04:42,920 So what are our parameters, our parameter vector? 99 00:04:42,920 --> 00:04:45,160 So what we do is we say, OK, let's say 100 00:04:45,160 --> 00:04:56,115 I have a 2D system alpha 1, alpha 2. 101 00:04:56,115 --> 00:04:58,240 Now, we don't know what this function is right now. 102 00:04:58,240 --> 00:05:02,710 But let's just say it's a simple function like this, convex, 103 00:05:02,710 --> 00:05:06,977 where these are the contour lines, and what we want 104 00:05:06,977 --> 00:05:08,310 is we want to get to the middle. 105 00:05:08,310 --> 00:05:12,440 So this is sort of the local min, and we start here. 106 00:05:12,440 --> 00:05:14,440 Now, what-- how SNOPT could get these gradients, 107 00:05:14,440 --> 00:05:16,750 and sort of the simplest thing you can imagine doing, 108 00:05:16,750 --> 00:05:18,800 would be, all right, well-- 109 00:05:18,800 --> 00:05:20,800 one of the simplest things you can imagine doing 110 00:05:20,800 --> 00:05:23,080 is actually another very simple thing. 111 00:05:23,080 --> 00:05:24,190 You measure here. 112 00:05:24,190 --> 00:05:25,300 So you run the system. 113 00:05:25,300 --> 00:05:27,440 You get J at this point. 114 00:05:27,440 --> 00:05:29,440 So you run the system, you get J at this point. 115 00:05:29,440 --> 00:05:31,273 You run the system, you get J at this point. 116 00:05:31,273 --> 00:05:34,480 But you take these differences, divide by your displacement, 117 00:05:34,480 --> 00:05:37,422 and what you get is you get some estimate of the local gradient. 118 00:05:37,422 --> 00:05:39,880 And if those distances are small enough and your evaluation 119 00:05:39,880 --> 00:05:42,475 sort of nice enough, you can get arbitrarily close 120 00:05:42,475 --> 00:05:43,600 to the true gradient there. 121 00:05:43,600 --> 00:05:46,225 And this will tell you, OK, you want to move in this direction, 122 00:05:46,225 --> 00:05:47,230 right? 123 00:05:47,230 --> 00:05:51,128 Now, the problem with this is that you have to do n plus 1, 124 00:05:51,128 --> 00:05:53,170 where n is the number of dimensions, evaluations, 125 00:05:53,170 --> 00:05:54,290 to get this. 126 00:05:54,290 --> 00:05:56,920 So you sort of have to evaluate at alpha 127 00:05:56,920 --> 00:06:02,800 and at alpha plus delta 0, 0, 0, dot, dot, dot, 128 00:06:02,800 --> 00:06:07,095 alpha plus 0 delta 0, 0, dot, dot, dot, et cetera. 129 00:06:07,095 --> 00:06:08,470 Now, obviously, these things just 130 00:06:08,470 --> 00:06:10,540 have to be linearly independent, actually. 131 00:06:10,540 --> 00:06:12,670 But you might as well do it this way. 132 00:06:12,670 --> 00:06:14,170 Do these finite differences, you get 133 00:06:14,170 --> 00:06:14,770 an estimate of the gradient. 134 00:06:14,770 --> 00:06:16,187 You can hand it to SNOPT and SNOPT 135 00:06:16,187 --> 00:06:17,760 can try to do fancier things. 136 00:06:17,760 --> 00:06:20,260 Or you can do gradient descent, where you get this gradient, 137 00:06:20,260 --> 00:06:23,470 you compute it, and then you do an update where I say, OK, 138 00:06:23,470 --> 00:06:34,060 now my alpha at n plus 1 equals alpha at n plus some delta 139 00:06:34,060 --> 00:06:36,370 alpha. 140 00:06:36,370 --> 00:06:42,520 And you can say, OK, delta alpha equals negative eta 141 00:06:42,520 --> 00:06:47,410 and then dJ d alpha. 142 00:06:47,410 --> 00:06:48,165 That's a vector. 143 00:06:48,165 --> 00:06:49,540 And so this is our learning rate. 144 00:06:49,540 --> 00:06:51,248 That says, OK, we have the gradient here. 145 00:06:51,248 --> 00:06:52,720 How far are we going to move? 146 00:06:52,720 --> 00:06:54,290 Setting that can be an issue. 147 00:06:54,290 --> 00:06:55,940 But you update your alpha like this. 148 00:06:55,940 --> 00:06:58,148 And you can just keep doing that over and over again, 149 00:06:58,148 --> 00:07:00,170 keep evaluating it over and over again. 150 00:07:00,170 --> 00:07:05,380 And eventually, you should move in to the 0. 151 00:07:05,380 --> 00:07:07,210 You should get to a local min. 152 00:07:07,210 --> 00:07:12,850 The thing is that doing n plus 1 evaluations every time 153 00:07:12,850 --> 00:07:13,570 is expensive. 154 00:07:13,570 --> 00:07:15,800 Now, you could say you could cut that down a bit 155 00:07:15,800 --> 00:07:16,825 if you were to reuse some evaluations 156 00:07:16,825 --> 00:07:17,630 and stuff like that. 157 00:07:17,630 --> 00:07:19,540 But the point is that you have to do a lot of evaluations 158 00:07:19,540 --> 00:07:20,790 to get this local information. 159 00:07:20,790 --> 00:07:22,930 And if you move very far, you sort of 160 00:07:22,930 --> 00:07:25,468 have to discard those and do all those evaluations again. 161 00:07:25,468 --> 00:07:27,010 So short of doing a lot of evaluation 162 00:07:27,010 --> 00:07:28,780 to get sort of an accurate estimate of the gradient right 163 00:07:28,780 --> 00:07:29,280 here. 164 00:07:29,280 --> 00:07:31,447 Then you're throwing a lot of it away when you move, 165 00:07:31,447 --> 00:07:32,740 and the gradient could change. 166 00:07:32,740 --> 00:07:35,050 And so in that sense, doing all these evaluations maybe 167 00:07:35,050 --> 00:07:36,580 is wasteful, because you're getting more-- 168 00:07:36,580 --> 00:07:38,470 you're sort of being more careful than you have to. 169 00:07:38,470 --> 00:07:39,910 And then you're just going to lose that information 170 00:07:39,910 --> 00:07:42,368 once you move somewhere else and have to evaluate it again. 171 00:07:42,368 --> 00:07:43,410 So there's another thing. 172 00:07:43,410 --> 00:07:45,370 This one, you could say, is even more crude. 173 00:07:45,370 --> 00:07:47,920 This is, at least in evolutionary algorithm 174 00:07:47,920 --> 00:07:50,170 [? screen, ?] I think they call it just hill climbing. 175 00:07:50,170 --> 00:07:52,670 I mean, all these things are sort of hill climbing or valley 176 00:07:52,670 --> 00:07:54,040 descending. 177 00:07:54,040 --> 00:07:55,840 But what you can also imagine doing 178 00:07:55,840 --> 00:07:58,600 is just having a point here, and now we just randomly 179 00:07:58,600 --> 00:07:59,188 perturb that. 180 00:07:59,188 --> 00:07:59,980 So I don't do this. 181 00:07:59,980 --> 00:08:01,293 This is deterministic, right? 182 00:08:01,293 --> 00:08:03,460 I could randomly perturb and just be like, OK, well, 183 00:08:03,460 --> 00:08:04,607 what if I'm here? 184 00:08:04,607 --> 00:08:05,440 That's worse, right? 185 00:08:05,440 --> 00:08:06,232 The cost is higher. 186 00:08:06,232 --> 00:08:08,740 So you just throw it out, don't use it. 187 00:08:08,740 --> 00:08:11,090 Do it again here, that's better. 188 00:08:11,090 --> 00:08:12,460 So now we just keep this. 189 00:08:12,460 --> 00:08:15,220 And we just do this over and over, discarding bad ones, 190 00:08:15,220 --> 00:08:20,927 keeping good ones, until we get back there. 191 00:08:20,927 --> 00:08:22,510 But the thing is, is that there you're 192 00:08:22,510 --> 00:08:23,530 doing all these evaluations. 193 00:08:23,530 --> 00:08:25,240 When they get worse, you're just throwing them out, 194 00:08:25,240 --> 00:08:27,310 and you're acting like they give you no information. 195 00:08:27,310 --> 00:08:28,450 But there is information in that. 196 00:08:28,450 --> 00:08:30,658 Even if it gets worse, and by how much it gets worse, 197 00:08:30,658 --> 00:08:33,158 how much it gets better, there's information in all of that. 198 00:08:33,158 --> 00:08:35,575 And you're getting information when you do the evaluation. 199 00:08:35,575 --> 00:08:38,260 So I throw it away and just sort of like cast out these things 200 00:08:38,260 --> 00:08:39,200 that do worse. 201 00:08:39,200 --> 00:08:44,290 So that's sort of the idea of the stochastic gradient 202 00:08:44,290 --> 00:08:47,050 descent, is that we're going to follow this sort 203 00:08:47,050 --> 00:08:48,363 of like random kind of idea. 204 00:08:48,363 --> 00:08:50,530 Well, instead of doing this deterministic evaluation 205 00:08:50,530 --> 00:08:53,732 of the local gradient, we're going to randomize the system, 206 00:08:53,732 --> 00:08:55,690 and we're going to get an estimate the gradient 207 00:08:55,690 --> 00:08:58,150 stochastically, and then we're going to follow that. 208 00:08:58,150 --> 00:08:59,440 And we're going to get as much information out 209 00:08:59,440 --> 00:09:00,170 of that as possible. 210 00:09:00,170 --> 00:09:01,587 That's one of the important thing, 211 00:09:01,587 --> 00:09:04,702 is generally these systems, this evaluation is all the cost. 212 00:09:04,702 --> 00:09:06,160 Pretty much everything is dominated 213 00:09:06,160 --> 00:09:08,980 by checking your cost of the [INAUDIBLE] policy. 214 00:09:08,980 --> 00:09:11,403 So you want to get as much as you can out of each one. 215 00:09:11,403 --> 00:09:12,820 And stochastic gradient descent is 216 00:09:12,820 --> 00:09:15,610 sort of a powerful way of doing that when you have no model. 217 00:09:15,610 --> 00:09:20,170 That's definitely more efficient than hill climbing. 218 00:09:20,170 --> 00:09:28,450 So now the question is, what is the appropriate process 219 00:09:28,450 --> 00:09:30,040 for doing this? 220 00:09:30,040 --> 00:09:31,540 How do we randomly sample these guys 221 00:09:31,540 --> 00:09:33,848 and actually improve our policy? 222 00:09:33,848 --> 00:09:35,390 So I'm going to write down an update. 223 00:09:35,390 --> 00:09:36,310 This is a common update. 224 00:09:36,310 --> 00:09:37,810 It's the weight perturbation update. 225 00:09:37,810 --> 00:09:40,165 It also shows up in an identical form in reinforce, 226 00:09:40,165 --> 00:09:40,750 if you see any of those. 227 00:09:40,750 --> 00:09:41,875 We'll talk about all those. 228 00:09:41,875 --> 00:09:44,410 But when you can look at the [? performance ?] update-- 229 00:09:47,830 --> 00:09:50,740 is my handwriting at all legible? 230 00:09:50,740 --> 00:09:51,640 [INTERPOSING VOICES] 231 00:09:51,640 --> 00:09:52,557 JOHN W. ROBERTS: Yeah? 232 00:09:52,557 --> 00:09:54,022 OK. 233 00:09:54,022 --> 00:09:56,230 So you want to look at this [? performance ?] update. 234 00:10:01,690 --> 00:10:03,610 Take my word for now that this makes sense. 235 00:10:07,450 --> 00:10:09,550 Well, changing the alpha bits. 236 00:10:14,600 --> 00:10:16,000 OK. 237 00:10:16,000 --> 00:10:17,870 So I'm saying, change your alpha. 238 00:10:17,870 --> 00:10:19,370 Here we have the same learning rate. 239 00:10:19,370 --> 00:10:21,850 So this is like in the deterministic gradient descent. 240 00:10:21,850 --> 00:10:23,308 And then here's where you evaluate. 241 00:10:23,308 --> 00:10:24,530 And this z is noise. 242 00:10:24,530 --> 00:10:26,320 So when you perturb your policy, this 243 00:10:26,320 --> 00:10:29,256 is sort of the vector of how you perturb that alpha vector. 244 00:10:29,256 --> 00:10:30,610 So this is sort of-- 245 00:10:30,610 --> 00:10:34,240 this is a z, this is a z, this is a z. 246 00:10:34,240 --> 00:10:38,098 Those z's are these perturbations to this. 247 00:10:38,098 --> 00:10:39,890 So what you can do is we can say-- a simple 248 00:10:39,890 --> 00:10:45,380 and a very common way is to have the vector z is distributed 249 00:10:45,380 --> 00:10:50,240 as a multivariate Gaussian, so where each element of z 250 00:10:50,240 --> 00:10:56,480 is iid with the same standard deviation, mean 0. 251 00:10:56,480 --> 00:10:59,380 And so you sort of draw this z from your-- 252 00:10:59,380 --> 00:11:01,880 you draw a sort of sample z, you evaluate 253 00:11:01,880 --> 00:11:03,830 how well it does, you evaluate how well you 254 00:11:03,830 --> 00:11:05,288 do with your sort of nominal policy 255 00:11:05,288 --> 00:11:06,980 right now, calculate this difference, 256 00:11:06,980 --> 00:11:10,370 and then you move in the direction of z. 257 00:11:10,370 --> 00:11:19,090 So I'll try to draw this in 1D and then 2D so it makes sense. 258 00:11:19,090 --> 00:11:23,660 So here in 1D, you can say this is our 1 alpha. 259 00:11:23,660 --> 00:11:29,940 If this is J, so here's our cost function. 260 00:11:29,940 --> 00:11:32,180 So we'll be here. 261 00:11:32,180 --> 00:11:34,190 Now, our z in this case is just a scalar. 262 00:11:34,190 --> 00:11:36,560 But so our z is going to be mean 0. 263 00:11:36,560 --> 00:11:39,020 And it's going to have a Gaussian distribution. 264 00:11:39,020 --> 00:11:43,948 But when you sample from this, you evaluate-- 265 00:11:43,948 --> 00:11:45,865 I should actually probably keep that update up 266 00:11:45,865 --> 00:11:46,573 at the same time. 267 00:11:54,686 --> 00:11:58,300 So you sample, you get this change. 268 00:11:58,300 --> 00:11:59,620 So this is sort of my J alpha. 269 00:12:03,010 --> 00:12:06,930 This right here is my J alpha plus z. 270 00:12:06,930 --> 00:12:07,930 And imagine this change. 271 00:12:07,930 --> 00:12:10,330 That's going to say, OK, the cost went up. 272 00:12:10,330 --> 00:12:12,250 It went up by some amount. 273 00:12:12,250 --> 00:12:13,750 That's the difference. 274 00:12:13,750 --> 00:12:15,910 I'm going to move in the direction of z. 275 00:12:15,910 --> 00:12:16,827 So z is just a scalar. 276 00:12:16,827 --> 00:12:18,618 Here it's just going to be sort of the sign 277 00:12:18,618 --> 00:12:19,690 and the magnitude of it. 278 00:12:19,690 --> 00:12:21,732 And then I'm going to move sort of opposite this. 279 00:12:21,732 --> 00:12:23,620 So I perturb z. z went in this direction. 280 00:12:23,620 --> 00:12:25,330 It got bigger. 281 00:12:25,330 --> 00:12:27,860 That change, then, is a positive number. 282 00:12:27,860 --> 00:12:32,290 So we're going to move down by amount sort of eta that change, 283 00:12:32,290 --> 00:12:33,310 right? 284 00:12:33,310 --> 00:12:36,220 And so if it gets a lot worse, we move down farther. 285 00:12:36,220 --> 00:12:38,530 If it gets a bit worse, we move down a bit. 286 00:12:38,530 --> 00:12:40,430 Does that makes sense? 287 00:12:40,430 --> 00:12:42,640 And so when you're measuring here, 288 00:12:42,640 --> 00:12:45,220 you're going to get small change for the same z. 289 00:12:45,220 --> 00:12:48,390 When you, again, you draw your Gaussian around that, 290 00:12:48,390 --> 00:12:49,765 if you get a small change, you're 291 00:12:49,765 --> 00:12:50,590 just going to move a bit. 292 00:12:50,590 --> 00:12:52,240 When I'm here, where it's really steep, 293 00:12:52,240 --> 00:12:54,865 I'll get the same perturbation, I'm going to get bigger change. 294 00:12:54,865 --> 00:12:56,890 I'm going to move even farther. 295 00:12:56,890 --> 00:12:57,958 And I'll update here. 296 00:12:57,958 --> 00:12:59,500 And so if I do this a bunch of times, 297 00:12:59,500 --> 00:13:03,875 you can imagine I descend into local min. 298 00:13:03,875 --> 00:13:04,750 Does that make sense? 299 00:13:04,750 --> 00:13:07,300 And this is every time you're drawing the stochastically. 300 00:13:07,300 --> 00:13:08,920 So you're not doing this [INAUDIBLE] term thing. 301 00:13:08,920 --> 00:13:10,753 Every time you do it, you could be updating, 302 00:13:10,753 --> 00:13:12,700 you could try worse, you could try better. 303 00:13:12,700 --> 00:13:14,860 But stochastically, you can sort of intuitively 304 00:13:14,860 --> 00:13:16,575 see why it's going to sort of descend. 305 00:13:16,575 --> 00:13:17,450 Does that make sense? 306 00:13:17,450 --> 00:13:20,085 AUDIENCE: This is heavily depending on the fact 307 00:13:20,085 --> 00:13:23,372 that the function is sort of [INAUDIBLE] direction? 308 00:13:23,372 --> 00:13:24,830 JOHN W. ROBERTS: It's sort of what? 309 00:13:24,830 --> 00:13:27,163 AUDIENCE: It's like the function that you're looking at, 310 00:13:27,163 --> 00:13:28,640 if you're looking-- 311 00:13:28,640 --> 00:13:30,410 if you increase, like in this case, 312 00:13:30,410 --> 00:13:33,149 like alpha in one direction versus other one, 313 00:13:33,149 --> 00:13:35,817 the changes are sort of similar in both ways. 314 00:13:35,817 --> 00:13:36,650 JOHN W. ROBERTS: No. 315 00:13:36,650 --> 00:13:39,770 I mean, that can affect the performance of the algorithm. 316 00:13:39,770 --> 00:13:41,390 But yeah, I can draw that. 317 00:13:41,390 --> 00:13:44,807 These are sort of common pathological cases. 318 00:13:44,807 --> 00:13:45,640 Let's look at in 2D. 319 00:13:45,640 --> 00:13:48,680 So this is what you're saying, right? 320 00:13:48,680 --> 00:13:50,060 Now, the ideal one would be-- 321 00:13:50,060 --> 00:13:53,823 again, we can draw a contour map again. 322 00:13:53,823 --> 00:13:55,740 Now, you'd be-- this is about the same, right? 323 00:13:55,740 --> 00:13:58,157 You're saying this is about sort of isotropic or whatever. 324 00:13:58,157 --> 00:14:01,465 You're here, you perturb yourself randomly 325 00:14:01,465 --> 00:14:03,590 so your Gaussian is going to put you anywhere here. 326 00:14:03,590 --> 00:14:04,783 And you measure somewhere. 327 00:14:04,783 --> 00:14:07,200 You get better, so you're going to move in that direction, 328 00:14:07,200 --> 00:14:09,242 depending on what eta is, and I'll get an update. 329 00:14:09,242 --> 00:14:11,480 You're saying, well, what happens if we're actually 330 00:14:11,480 --> 00:14:22,623 in trouble, and we have something 331 00:14:22,623 --> 00:14:23,790 that looks like this, right? 332 00:14:26,400 --> 00:14:28,300 Saying that's a problem? 333 00:14:28,300 --> 00:14:31,152 Well, that can hurt the convergence of it. 334 00:14:31,152 --> 00:14:31,860 It can be slower. 335 00:14:31,860 --> 00:14:33,120 But it still works. 336 00:14:33,120 --> 00:14:35,078 Because you can see-- like, let's say I'm here. 337 00:14:35,078 --> 00:14:37,495 Now, it's really steep here, and it's really shallow here. 338 00:14:37,495 --> 00:14:39,630 So what's going to happen is when I perturb it, 339 00:14:39,630 --> 00:14:41,015 I'm still going to-- 340 00:14:41,015 --> 00:14:42,390 my perturbation in this direction 341 00:14:42,390 --> 00:14:44,160 is going have an effect-- maybe it's relatively shallow-- 342 00:14:44,160 --> 00:14:46,120 but then in this direction it's going to be very sensitive. 343 00:14:46,120 --> 00:14:48,203 And so when I move it more in this direction and I 344 00:14:48,203 --> 00:14:50,400 move very far, and I'm going to go down here first-- 345 00:14:50,400 --> 00:14:51,942 I'm going to descend the steep part-- 346 00:14:51,942 --> 00:14:54,060 and then slowly converge in on the shallow part. 347 00:14:54,060 --> 00:14:57,182 That's sort of called the, I think, the banana problem, 348 00:14:57,182 --> 00:14:58,890 where you sort of have this massive bowl, 349 00:14:58,890 --> 00:15:01,260 and you go really quick right down here, and then really 350 00:15:01,260 --> 00:15:02,130 slowly. 351 00:15:02,130 --> 00:15:04,330 And so the thing is that if it's all very shallow, 352 00:15:04,330 --> 00:15:05,040 that's not a problem. 353 00:15:05,040 --> 00:15:06,090 You can make your learning rate bigger, 354 00:15:06,090 --> 00:15:07,830 you can make your sample further out, 355 00:15:07,830 --> 00:15:09,788 and then it sort of just doesn't matter, right? 356 00:15:09,788 --> 00:15:12,942 But this asymmetry in these things is an issue. 357 00:15:12,942 --> 00:15:14,400 Now, there are some ways of dealing 358 00:15:14,400 --> 00:15:16,650 with that if you have an idea of how asymmetric it is. 359 00:15:16,650 --> 00:15:18,030 We can talk about this later. 360 00:15:18,030 --> 00:15:19,980 But it'll still descend. 361 00:15:19,980 --> 00:15:22,230 And actually, you can show, and I'm 362 00:15:22,230 --> 00:15:25,920 about to show, that this update actually 363 00:15:25,920 --> 00:15:28,320 follows an expectation it moves in the direction 364 00:15:28,320 --> 00:15:30,990 of the true gradient. 365 00:15:30,990 --> 00:15:34,230 So, I mean, randomly it can bounce all around. 366 00:15:34,230 --> 00:15:37,510 But in expectation, it will move in the right direction. 367 00:15:37,510 --> 00:15:39,755 And if you're having deterministic evaluations-- 368 00:15:39,755 --> 00:15:41,880 well, we're going to do a linear analysis at first. 369 00:15:41,880 --> 00:15:45,720 But you actually can show that it'll always 370 00:15:45,720 --> 00:15:48,030 move within 90 degrees of the true gradient 371 00:15:48,030 --> 00:15:50,710 if you have deterministic valuations. 372 00:15:50,710 --> 00:15:52,380 So you'll never actually get worse. 373 00:15:52,380 --> 00:15:54,990 You can move parallel and not improve, 374 00:15:54,990 --> 00:15:57,990 but you'll never move sort of the wrong sign 375 00:15:57,990 --> 00:16:01,630 of [? those two ?] parameters. 376 00:16:01,630 --> 00:16:02,130 All right. 377 00:16:05,020 --> 00:16:07,390 So then yeah. 378 00:16:07,390 --> 00:16:09,880 Let's look at why that is in some detail then. 379 00:16:14,210 --> 00:16:21,310 So again, our delta alpha, same up there. 380 00:16:21,310 --> 00:16:23,820 So I just won't waste time rewriting it. 381 00:16:23,820 --> 00:16:26,550 And let's do-- let's look at it in a first-order Taylor 382 00:16:26,550 --> 00:16:28,050 expansion of our cost function. 383 00:16:28,050 --> 00:16:30,070 So look at it locally where you look at sort 384 00:16:30,070 --> 00:16:34,530 of like the linear term. 385 00:16:34,530 --> 00:16:39,840 So our J-- and we linearize around alpha. 386 00:16:39,840 --> 00:16:48,350 So our J of alpha plus z, well, that 387 00:16:48,350 --> 00:16:51,170 is approximately equal to-- 388 00:16:51,170 --> 00:17:06,410 for small z-- alpha plus dJ d alpha transpose z. 389 00:17:10,554 --> 00:17:13,280 So that's the first-order Taylor expansion. 390 00:17:13,280 --> 00:17:16,040 Now, if we examine this then, we plug this 391 00:17:16,040 --> 00:17:19,880 in for J alpha plus z in that update, 392 00:17:19,880 --> 00:17:23,750 we're going to cancel out of this term, our J alpha term, 393 00:17:23,750 --> 00:17:31,130 and we're going to get that delta alpha approximately 394 00:17:31,130 --> 00:17:43,356 negative eta dJ d alpha z z. 395 00:17:48,440 --> 00:17:50,390 So now, what does this look like? 396 00:17:50,390 --> 00:17:52,400 This is sort of like a dot product 397 00:17:52,400 --> 00:17:53,910 between the gradient with respect 398 00:17:53,910 --> 00:17:56,990 to alpha and our noise vector. 399 00:17:56,990 --> 00:17:59,690 All right? 400 00:17:59,690 --> 00:18:04,790 So and this is going to be about equal to negative eta. 401 00:18:04,790 --> 00:18:10,670 This thing we can then write [INAUDIBLE] i is 1 to N 402 00:18:10,670 --> 00:18:24,290 of dJ d alpha i zi times vector z. 403 00:18:35,050 --> 00:18:36,970 All right. 404 00:18:36,970 --> 00:18:42,290 So here then, if we multiply that out, 405 00:18:42,290 --> 00:18:46,570 so we're going to get this vector and eta, because you're 406 00:18:46,570 --> 00:18:49,868 multiplying that coefficient times each term individually. 407 00:18:49,868 --> 00:18:51,160 You're going to get the vector. 408 00:19:08,800 --> 00:19:10,570 And then the same thing. 409 00:19:10,570 --> 00:19:18,620 This one's going to be some dJ d alpha zi zN. 410 00:19:22,000 --> 00:19:25,893 Now, if we take expectation of this, 411 00:19:25,893 --> 00:19:27,060 we get another distribution. 412 00:19:27,060 --> 00:19:29,740 We know that each zi is iid. 413 00:19:29,740 --> 00:19:30,780 Do you know iid? 414 00:19:30,780 --> 00:19:32,530 That they're all-- they're all distributed 415 00:19:32,530 --> 00:19:34,390 the exact same distribution, all the sort of mean 0, 416 00:19:34,390 --> 00:19:36,130 Gaussian, standard deviation sigma. 417 00:19:36,130 --> 00:19:38,060 And they're all independent. 418 00:19:38,060 --> 00:19:46,775 So if we do that, we can then take the expectation of delta 419 00:19:46,775 --> 00:19:47,275 alpha. 420 00:19:49,870 --> 00:19:55,030 We can pull that eta out front, because expectation is linear. 421 00:19:55,030 --> 00:20:01,510 And what you'll get is you'll get the [INAUDIBLE] again, 422 00:20:01,510 --> 00:20:04,765 dJ d alpha-- i is not a random variable. 423 00:20:04,765 --> 00:20:07,920 So pull that out. 424 00:20:07,920 --> 00:20:17,410 dJ d alpha i, again the sum z-- 425 00:20:17,410 --> 00:20:17,910 sorry. 426 00:20:21,350 --> 00:20:29,980 Expectation of zi then z1. 427 00:20:29,980 --> 00:20:34,900 Now, this sum goes through all the i's. 428 00:20:34,900 --> 00:20:37,480 But the first one only has that z1, right? 429 00:20:37,480 --> 00:20:42,333 Now, zi, z1, they're independent, mean 0. 430 00:20:42,333 --> 00:20:43,750 So you can sort of split these up, 431 00:20:43,750 --> 00:20:46,780 and you're going to get that they're 0 for every term, 432 00:20:46,780 --> 00:20:49,780 except for the term where i equals 1, 433 00:20:49,780 --> 00:20:52,150 and the second one where equals 2, et cetera, right? 434 00:20:52,150 --> 00:20:53,980 All the other terms are going to go to 0. 435 00:20:53,980 --> 00:20:54,772 So it's easy, then. 436 00:20:54,772 --> 00:20:57,178 To get the expectations, you go through the sum, 437 00:20:57,178 --> 00:20:58,720 and you're going to see that you only 438 00:20:58,720 --> 00:21:01,420 have the one where you have expectation of z1 squared, 439 00:21:01,420 --> 00:21:02,900 expectation of z2 squared. 440 00:21:05,890 --> 00:21:15,280 Now, the expectation-- again, maybe you remember variance 441 00:21:15,280 --> 00:21:21,430 equals expected value of x squared minus expected value 442 00:21:21,430 --> 00:21:24,430 of x squared, right? 443 00:21:24,430 --> 00:21:27,190 Now, we're mean 0, so this is 0. 444 00:21:27,190 --> 00:21:29,650 Our variance is sigma squared, So our expected value 445 00:21:29,650 --> 00:21:31,668 of x squared is sigma squared. 446 00:21:31,668 --> 00:21:33,710 So that means that each one of these expectations 447 00:21:33,710 --> 00:21:35,700 is going to be sigma squared. 448 00:21:35,700 --> 00:21:39,070 So you're going to end up where you have negative eta-- now, 449 00:21:39,070 --> 00:21:41,380 they all the same sigma, so we can pull that out-- 450 00:21:41,380 --> 00:21:42,610 sigma squared. 451 00:21:42,610 --> 00:21:45,670 And you're going to have the vector of this dJ d alpha 1, 452 00:21:45,670 --> 00:21:47,880 dJ d alpha 2, et cetera. 453 00:21:47,880 --> 00:21:54,616 So you're going to get dJ d alpha. 454 00:21:54,616 --> 00:21:56,350 So yeah, so the expectation, this update, 455 00:21:56,350 --> 00:21:58,660 when we look at it in this sort of linear sense, 456 00:21:58,660 --> 00:22:01,540 is eta sigma squared-- so just these are scalars. 457 00:22:01,540 --> 00:22:03,430 They just change the magnitude of it. 458 00:22:03,430 --> 00:22:05,695 But it's in the direction of the gradient. 459 00:22:05,695 --> 00:22:07,070 And eta is sort of our parameter. 460 00:22:07,070 --> 00:22:07,820 We can control it. 461 00:22:11,660 --> 00:22:12,644 That makes sense? 462 00:22:12,644 --> 00:22:14,300 AUDIENCE: Is that sigma squared? 463 00:22:14,300 --> 00:22:14,840 JOHN W. ROBERTS: Yes, yes. 464 00:22:14,840 --> 00:22:15,340 Sorry. 465 00:22:17,950 --> 00:22:18,650 Yeah, sorry. 466 00:22:18,650 --> 00:22:19,150 Yeah. 467 00:22:19,150 --> 00:22:21,970 So the noise you use pops out here. 468 00:22:21,970 --> 00:22:24,680 Comment-- I actually oftentimes in one of the other-- 469 00:22:24,680 --> 00:22:27,160 when we look at this algorithm in a different way, 470 00:22:27,160 --> 00:22:29,680 they write the update where it's eta 471 00:22:29,680 --> 00:22:31,615 over sigma squared, your noise. 472 00:22:31,615 --> 00:22:33,490 And then that cancels out that sigma squared, 473 00:22:33,490 --> 00:22:35,490 and you purely just get eta dJ d alpha. 474 00:22:35,490 --> 00:22:37,990 So you can put that in, too, if you wanted to really just be 475 00:22:37,990 --> 00:22:40,660 eta times your true gradient. 476 00:22:40,660 --> 00:22:43,570 But the important thing is that you'll 477 00:22:43,570 --> 00:22:45,700 move an expectation in the true direction. 478 00:22:51,860 --> 00:22:56,930 So a couple of interesting properties to this. 479 00:22:56,930 --> 00:22:59,452 Here, you see we still have to do-- 480 00:22:59,452 --> 00:23:01,410 we still have to do two evaluations to give rid 481 00:23:01,410 --> 00:23:02,870 of the update, right? 482 00:23:02,870 --> 00:23:06,532 If we want to cancel out that J alpha term, 483 00:23:06,532 --> 00:23:08,240 we're going to have to evaluate it twice. 484 00:23:08,240 --> 00:23:10,060 Now, it doesn't matter for three-dimensional, 485 00:23:10,060 --> 00:23:11,630 and we only have to evaluate it twice. 486 00:23:11,630 --> 00:23:13,422 But we still have to evaluate it two times. 487 00:23:13,422 --> 00:23:16,130 And the question is, well, what happens if you 488 00:23:16,130 --> 00:23:19,580 don't evaluate it at J alpha? 489 00:23:19,580 --> 00:23:21,470 What happens if you only evaluate it once? 490 00:23:21,470 --> 00:23:24,560 Well, that's a very common thing to do, actually, 491 00:23:24,560 --> 00:23:27,572 and doesn't actually affect your expectation at all. 492 00:23:27,572 --> 00:23:29,030 Lots of times, instead of this sort 493 00:23:29,030 --> 00:23:31,310 of like your perfect baseline where you evaluate it, 494 00:23:31,310 --> 00:23:33,625 people sometimes average the last several evaluations 495 00:23:33,625 --> 00:23:35,000 to get that baseline-- oh, sorry. 496 00:23:35,000 --> 00:23:36,770 I don't think I defined baseline. 497 00:23:36,770 --> 00:23:41,107 This right here, whatever it is, is your baseline. 498 00:23:44,380 --> 00:23:46,640 Now, there doesn't have to be J of alpha. 499 00:23:46,640 --> 00:23:49,610 It can be an exponentially decaying average 500 00:23:49,610 --> 00:23:51,620 of your last several evaluations. 501 00:23:51,620 --> 00:23:53,780 That's going be approximately J of alpha. 502 00:23:53,780 --> 00:23:57,200 And it won't be perfect, but the point 503 00:23:57,200 --> 00:23:58,700 is that it's not going to affect it, 504 00:23:58,700 --> 00:23:59,270 and we're going to see that. 505 00:23:59,270 --> 00:24:00,980 Maybe you'd expect that you need to get rid of that term 506 00:24:00,980 --> 00:24:03,480 for you're still moving in the direction [? of ?] [? your ?] 507 00:24:03,480 --> 00:24:06,920 gradient, because you can imagine if you don't have that, 508 00:24:06,920 --> 00:24:10,550 if you don't know that term, you could evaluate-- 509 00:24:10,550 --> 00:24:12,398 if it's always positive, you'll-- 510 00:24:12,398 --> 00:24:13,940 I'll draw a diagram, make this clear. 511 00:24:16,640 --> 00:24:21,850 If you don't have that, and you're here, if I-- let's say 512 00:24:21,850 --> 00:24:23,360 I just make that 0. 513 00:24:23,360 --> 00:24:25,190 I'm going to evaluate here, that's 514 00:24:25,190 --> 00:24:26,540 going to be a positive number. 515 00:24:26,540 --> 00:24:28,207 So I'm moving in the opposite direction. 516 00:24:28,207 --> 00:24:30,530 If I evaluate here, it's going to be positive number. 517 00:24:30,530 --> 00:24:31,680 So you're going to move in the opposite direction. 518 00:24:31,680 --> 00:24:33,110 So maybe you think like, oh, without that baseline 519 00:24:33,110 --> 00:24:34,680 we could be in bad shape. 520 00:24:34,680 --> 00:24:36,788 But actually, you'll move more in this direction 521 00:24:36,788 --> 00:24:39,080 when you do that sample than you move in this direction 522 00:24:39,080 --> 00:24:41,270 when you do the other sample. 523 00:24:41,270 --> 00:24:44,720 And so that scale here, the fact that you 524 00:24:44,720 --> 00:24:47,210 move proportional to how big the change is in your cost, 525 00:24:47,210 --> 00:24:48,710 it means that in expectation, you'll 526 00:24:48,710 --> 00:24:50,940 still move in the direction of the true gradient. 527 00:24:50,940 --> 00:24:52,680 Now, in practice you won't do as well. 528 00:24:52,680 --> 00:24:53,720 It makes sense that you won't do as well. 529 00:24:53,720 --> 00:24:54,800 Really, when you think about it, that's 530 00:24:54,800 --> 00:24:56,467 going to be bouncing all around crazily. 531 00:24:56,467 --> 00:24:58,770 But it'll still move in the direction the gradient. 532 00:24:58,770 --> 00:25:02,790 And you don't just have to take my word for that. 533 00:25:02,790 --> 00:25:10,730 If you look at this update again, 534 00:25:10,730 --> 00:25:12,500 now we can do linear expansion again, 535 00:25:12,500 --> 00:25:24,440 and you'll get this dJ d alpha z plus, say, some scalar-- 536 00:25:24,440 --> 00:25:25,940 this is uncorrelated with the noise. 537 00:25:25,940 --> 00:25:27,380 That's an important thing, though. 538 00:25:27,380 --> 00:25:30,620 It's uncorrelated with the noise z. 539 00:25:30,620 --> 00:25:31,850 Now, use expectation again. 540 00:25:31,850 --> 00:25:33,140 Expectation is linear. 541 00:25:33,140 --> 00:25:35,155 So we have expectation of this term. 542 00:25:35,155 --> 00:25:36,530 That's the same as it was before. 543 00:25:36,530 --> 00:25:38,160 That's the gradient. 544 00:25:38,160 --> 00:25:43,920 And then we have expectation of negative eta dz. 545 00:25:46,940 --> 00:25:48,920 Now, E is uncorrelated with the noise. 546 00:25:48,920 --> 00:25:51,295 These are both scalar, so you can actually pull them out. 547 00:25:51,295 --> 00:25:53,330 Expectation of z, it's mean 0. 548 00:25:53,330 --> 00:25:55,187 So this won't affect it at all. 549 00:25:55,187 --> 00:25:57,770 So really, your expected update will not depend it all on what 550 00:25:57,770 --> 00:25:58,880 you use here. 551 00:25:58,880 --> 00:26:00,380 So you could put a constant there. 552 00:26:00,380 --> 00:26:02,060 You could put in the exact one. 553 00:26:02,060 --> 00:26:07,310 You could put in some decaying average, anything you want. 554 00:26:07,310 --> 00:26:08,950 It will still move, in expectation, 555 00:26:08,950 --> 00:26:09,908 in the right direction. 556 00:26:09,908 --> 00:26:12,240 But in practice, it can a huge difference. 557 00:26:12,240 --> 00:26:15,140 I don't know if anyone's implemented these things on-- 558 00:26:15,140 --> 00:26:18,410 but a good baseline can be the difference between success 559 00:26:18,410 --> 00:26:20,990 and getting completely stuck and not moving anywhere. 560 00:26:20,990 --> 00:26:24,050 So if you do small updates, you should still be OK. 561 00:26:24,050 --> 00:26:27,802 But performance depends a lot on getting a baseline. 562 00:26:27,802 --> 00:26:28,760 Or it can depend a lot. 563 00:26:28,760 --> 00:26:29,927 Sometimes it doesn't matter. 564 00:26:32,480 --> 00:26:34,460 Right. 565 00:26:34,460 --> 00:26:37,370 So the-- yeah. 566 00:26:37,370 --> 00:26:41,270 So again, a common thing to use here is that you're evaluating, 567 00:26:41,270 --> 00:26:42,110 you're updating. 568 00:26:42,110 --> 00:26:44,690 Let's say every time I do one evaluation, I update. 569 00:26:44,690 --> 00:26:47,037 If I took my last 10 of them, I averaged them 570 00:26:47,037 --> 00:26:49,370 with decaying sort of weight so that the most recent one 571 00:26:49,370 --> 00:26:50,990 is the most heavily weighted, then 572 00:26:50,990 --> 00:26:52,990 I'm sure you'll get an approximation of how much 573 00:26:52,990 --> 00:26:54,110 should it be around here. 574 00:26:54,110 --> 00:26:55,490 And then I update based on that. 575 00:26:55,490 --> 00:26:58,020 And that way you don't have to evaluate it twice every time. 576 00:26:58,020 --> 00:26:58,760 And so that way, you can actually 577 00:26:58,760 --> 00:27:00,135 get sort of improved performance. 578 00:27:00,135 --> 00:27:02,090 And it's still going to work. 579 00:27:02,090 --> 00:27:03,590 And another cool thing, this is sort 580 00:27:03,590 --> 00:27:04,760 of when we go back to our assumptions 581 00:27:04,760 --> 00:27:05,900 about deterministic. 582 00:27:05,900 --> 00:27:07,733 It doesn't have to be deterministic, either. 583 00:27:07,733 --> 00:27:12,230 Let's say in the same way we put in this, instead 584 00:27:12,230 --> 00:27:16,700 let's say we put in noise like, again, like a scalar noise 585 00:27:16,700 --> 00:27:19,412 to evaluation w. 586 00:27:19,412 --> 00:27:23,870 Oh, I just got [? color. ?] 587 00:27:23,870 --> 00:27:25,670 Now, that's going to show up in here again. 588 00:27:25,670 --> 00:27:26,540 Now, it's not a-- 589 00:27:26,540 --> 00:27:28,748 now it's a random variable, so it has an expectation. 590 00:27:28,748 --> 00:27:32,780 But if they're uncorrelated, we can split them up. 591 00:27:32,780 --> 00:27:40,100 We can-- that'll be equal to negative eta expectation w 592 00:27:40,100 --> 00:27:42,410 expectation z. 593 00:27:42,410 --> 00:27:45,200 Now, we know that z is mean 0 again. 594 00:27:45,200 --> 00:27:45,740 That's 0. 595 00:27:45,740 --> 00:27:47,198 So it's not going to affect either. 596 00:27:47,198 --> 00:27:48,830 We're still going to get this term. 597 00:27:48,830 --> 00:27:51,360 And so you can add sort of additive random noise, 598 00:27:51,360 --> 00:27:53,610 and you'll still move that through expected direction, 599 00:27:53,610 --> 00:27:54,450 the gradient. 600 00:27:54,450 --> 00:27:55,420 So that's sort of cool. 601 00:27:55,420 --> 00:27:56,350 This is quite robust. 602 00:27:56,350 --> 00:27:58,142 You can have these errors in this baseline. 603 00:27:58,142 --> 00:27:59,550 You can have noisy evaluations. 604 00:27:59,550 --> 00:28:01,175 You can have all sorts of these things. 605 00:28:01,175 --> 00:28:04,625 And still, expectation will move in the right direction. 606 00:28:04,625 --> 00:28:05,250 So that's nice. 607 00:28:08,100 --> 00:28:15,130 We're going to see that that has a lot of practical benefits. 608 00:28:15,130 --> 00:28:17,400 Is everybody with me here? 609 00:28:17,400 --> 00:28:21,268 I don't know if I went through this quickly or if-- 610 00:28:21,268 --> 00:28:22,560 everyone's sort of being quiet. 611 00:28:22,560 --> 00:28:23,352 They look sort of-- 612 00:28:23,352 --> 00:28:24,602 AUDIENCE: w is baseline there? 613 00:28:24,602 --> 00:28:25,894 JOHN W. ROBERTS: No, no, sorry. 614 00:28:25,894 --> 00:28:27,160 This w I change it to noise. 615 00:28:27,160 --> 00:28:28,680 Sorry, this is a noise. 616 00:28:28,680 --> 00:28:32,310 Maybe you'd prefer it to be called like xi or something 617 00:28:32,310 --> 00:28:34,510 like that. 618 00:28:34,510 --> 00:28:36,150 But this is just added noise. 619 00:28:36,150 --> 00:28:39,282 So you could say that z is drawn from-- 620 00:28:39,282 --> 00:28:40,990 it doesn't really matter the distribution 621 00:28:40,990 --> 00:28:42,690 as long as it's uncorrelated. 622 00:28:42,690 --> 00:28:48,690 We could say it's drawn from some other Gaussian. 623 00:28:48,690 --> 00:28:49,738 And so it's expectation-- 624 00:28:49,738 --> 00:28:51,780 I mean, expectation of this really can be 0, too. 625 00:28:51,780 --> 00:28:54,223 Because if it's not non-zero-- it's not mean 0 noise, 626 00:28:54,223 --> 00:28:56,640 then you might as well just put that in your cost function 627 00:28:56,640 --> 00:28:58,950 and make it mean 0 again, right? 628 00:28:58,950 --> 00:28:59,670 Yes? 629 00:28:59,670 --> 00:29:03,420 AUDIENCE: So the idea is to add this into the term J alpha? 630 00:29:03,420 --> 00:29:05,610 Or replace the term J alpha with different baseline? 631 00:29:05,610 --> 00:29:06,540 JOHN W. ROBERTS: Replace it, right. 632 00:29:06,540 --> 00:29:07,082 AUDIENCE: OK. 633 00:29:07,082 --> 00:29:08,700 And then so what cancels-- so when we 634 00:29:08,700 --> 00:29:10,650 talk about a Taylor expansion? 635 00:29:10,650 --> 00:29:11,858 What cancels-- what-- 636 00:29:11,858 --> 00:29:12,900 JOHN W. ROBERTS: Nothing. 637 00:29:12,900 --> 00:29:13,710 Nothing cancels it. 638 00:29:13,710 --> 00:29:14,793 You see, that's the thing. 639 00:29:14,793 --> 00:29:17,490 Yeah, so I put an E here-- 640 00:29:17,490 --> 00:29:19,320 maybe I'm reusing too many things. 641 00:29:19,320 --> 00:29:23,650 AUDIENCE: Oh, is it J alpha is also uncorrelated to z? 642 00:29:23,650 --> 00:29:25,275 JOHN W. ROBERTS: Well, J alpha, J alpha 643 00:29:25,275 --> 00:29:26,280 is just a scalar, right? 644 00:29:26,280 --> 00:29:28,510 I mean, it is some number. 645 00:29:28,510 --> 00:29:29,010 So it is-- 646 00:29:29,010 --> 00:29:30,000 AUDIENCE: z is your mean, so. 647 00:29:30,000 --> 00:29:30,960 JOHN W. ROBERTS: Yeah, so z is your mean. 648 00:29:30,960 --> 00:29:32,587 So whether-- we could put in J alpha. 649 00:29:32,587 --> 00:29:34,920 We could put an estimate of J alpha that has some error. 650 00:29:34,920 --> 00:29:37,410 And then our J alpha minus this is 651 00:29:37,410 --> 00:29:39,130 going to be some number-- doesn't matter. 652 00:29:39,130 --> 00:29:40,350 If we just put in nothing at all, then 653 00:29:40,350 --> 00:29:42,060 our error is sort of that J alpha term. 654 00:29:42,060 --> 00:29:43,435 That J alpha term is just, again, 655 00:29:43,435 --> 00:29:46,165 some number that's uncorrelated, gets rid of it. 656 00:29:46,165 --> 00:29:47,040 Does that make sense? 657 00:29:47,040 --> 00:29:51,253 Everyone looks sort of just-- 658 00:29:51,253 --> 00:29:53,420 AUDIENCE: So it's actually, putting another constant 659 00:29:53,420 --> 00:29:55,810 in that equation for the update makes you move more 660 00:29:55,810 --> 00:29:57,370 in some random z-direction. 661 00:29:57,370 --> 00:30:00,038 But on average, you're still going down the gradient 662 00:30:00,038 --> 00:30:00,580 the same way. 663 00:30:00,580 --> 00:30:00,640 JOHN W. ROBERTS: Yeah. 664 00:30:00,640 --> 00:30:01,940 I mean, you can move more. 665 00:30:01,940 --> 00:30:02,440 Yeah. 666 00:30:02,440 --> 00:30:04,773 I mean, if you put some-- if you put some giant constant 667 00:30:04,773 --> 00:30:07,780 every time you update, maybe you'll bounce around farther. 668 00:30:07,780 --> 00:30:09,280 But on average, you'll still move in the right direction. 669 00:30:09,280 --> 00:30:10,930 Because you'll move farther in the right direction than you 670 00:30:10,930 --> 00:30:12,480 move in the wrong direction. 671 00:30:12,480 --> 00:30:13,605 So they sort of cancel out. 672 00:30:16,110 --> 00:30:18,770 So everybody is on board here? 673 00:30:18,770 --> 00:30:19,270 OK. 674 00:30:19,270 --> 00:30:20,870 I just really want you to-- 675 00:30:20,870 --> 00:30:23,600 AUDIENCE: Why wouldn't you include the actual J alpha? 676 00:30:23,600 --> 00:30:26,270 JOHN W. ROBERTS: Well, because if you get it 677 00:30:26,270 --> 00:30:29,600 by evaluating the function, if you run a policy, 678 00:30:29,600 --> 00:30:31,820 it can be expensive to get that J alpha, right? 679 00:30:31,820 --> 00:30:33,653 Because for example, I use this in some work 680 00:30:33,653 --> 00:30:35,278 I did where we had this flapping thing. 681 00:30:35,278 --> 00:30:36,530 I'll show you videos of it. 682 00:30:36,530 --> 00:30:38,600 Maybe I'll start setting that up right now. 683 00:30:38,600 --> 00:30:40,340 But so we have this flapping system. 684 00:30:40,340 --> 00:30:43,602 And we get-- we sort of have souped it 685 00:30:43,602 --> 00:30:44,810 up now so it's a bit quicker. 686 00:30:44,810 --> 00:30:46,352 But it used to be every time I wanted 687 00:30:46,352 --> 00:30:49,880 to evaluate the function, I had to sit there for 4 minutes 688 00:30:49,880 --> 00:30:52,100 and have this sort of plate flap in this water 689 00:30:52,100 --> 00:30:54,392 and measure how quickly it was going, all these things. 690 00:30:54,392 --> 00:30:58,220 And so to evaluate that function once, it took me 4 minutes. 691 00:30:58,220 --> 00:31:01,260 And so avoiding evaluations is important. 692 00:31:01,260 --> 00:31:05,695 And so if you can just take your several previous evaluations, 693 00:31:05,695 --> 00:31:07,070 average them together-- now, it's 694 00:31:07,070 --> 00:31:08,090 not going to be a perfect assignment, 695 00:31:08,090 --> 00:31:09,715 but maybe it's an OK estimate, and then 696 00:31:09,715 --> 00:31:11,730 you don't have to spend any more time. 697 00:31:11,730 --> 00:31:14,780 And so in that sense, it's sort of cheaper. 698 00:31:14,780 --> 00:31:18,910 Please ask as many questions as possible, because this is-- 699 00:31:18,910 --> 00:31:21,840 AUDIENCE: But at some point you have to measure every time, 700 00:31:21,840 --> 00:31:22,340 right? 701 00:31:22,340 --> 00:31:22,700 JOHN W. ROBERTS: You have to. 702 00:31:22,700 --> 00:31:23,990 Yeah, you have to measure every time 703 00:31:23,990 --> 00:31:25,280 when you want to do an update. 704 00:31:25,280 --> 00:31:26,900 But the thing is that-- 705 00:31:26,900 --> 00:31:27,400 here. 706 00:31:30,910 --> 00:31:37,540 Let's say i, a tiny one-- 707 00:31:37,540 --> 00:31:40,510 but the question is, if I have some estimate of that-- 708 00:31:40,510 --> 00:31:43,840 let's say my current sort of alpha is here. 709 00:31:43,840 --> 00:31:46,270 Now, I need to randomly sample something, 710 00:31:46,270 --> 00:31:47,830 so I have to do that evaluation. 711 00:31:47,830 --> 00:31:50,680 Now, the question is, do I have to evaluate it here, too? 712 00:31:50,680 --> 00:31:52,720 Because this is my J alpha. 713 00:31:52,720 --> 00:31:53,620 Do I evaluate that? 714 00:31:57,280 --> 00:31:59,470 Now, I could estimate this, because have 715 00:31:59,470 --> 00:32:02,797 a bunch of other evaluations from however I got here, right? 716 00:32:02,797 --> 00:32:03,880 So I've already evaluated. 717 00:32:03,880 --> 00:32:05,440 If I average those together, I'll 718 00:32:05,440 --> 00:32:07,390 get a pretty good idea of what this is. 719 00:32:07,390 --> 00:32:08,950 If I wanted to get it exactly, I'd 720 00:32:08,950 --> 00:32:12,170 have to run my system here, and then run it again here. 721 00:32:12,170 --> 00:32:16,240 And so every update would require two evaluations 722 00:32:16,240 --> 00:32:17,842 as opposed to just one. 723 00:32:17,842 --> 00:32:19,300 Now, sometimes it still makes sense 724 00:32:19,300 --> 00:32:20,620 to do that evaluation, though. 725 00:32:20,620 --> 00:32:23,470 Depending on how your system is, if it's really noisy, 726 00:32:23,470 --> 00:32:26,650 if you have to do really big updates, it makes sense. 727 00:32:26,650 --> 00:32:29,411 AUDIENCE: [INAUDIBLE] using this delta alpha would you 728 00:32:29,411 --> 00:32:30,328 calculate [INAUDIBLE]? 729 00:32:30,328 --> 00:32:31,869 JOHN W. ROBERTS: Pardon [INAUDIBLE]?? 730 00:32:31,869 --> 00:32:32,750 AUDIENCE: Yes. 731 00:32:32,750 --> 00:32:33,890 JOHN W. ROBERTS: I'm sorry, I didn't hear what you said. 732 00:32:33,890 --> 00:32:35,648 AUDIENCE: This new alpha that we have, 733 00:32:35,648 --> 00:32:37,190 that we have the [INAUDIBLE] before-- 734 00:32:37,190 --> 00:32:37,630 JOHN W. ROBERTS: This one? 735 00:32:37,630 --> 00:32:38,260 Yeah. 736 00:32:38,260 --> 00:32:40,640 AUDIENCE: You calculate it by having a previous alpha, 737 00:32:40,640 --> 00:32:42,230 and then we did this thing, and-- 738 00:32:42,230 --> 00:32:43,980 JOHN W. ROBERTS: And I moved in that direction, right. 739 00:32:43,980 --> 00:32:44,380 AUDIENCE: Right. 740 00:32:44,380 --> 00:32:46,463 But you're saying that you don't want to calculate 741 00:32:46,463 --> 00:32:47,980 the value for this new alpha. 742 00:32:47,980 --> 00:32:49,940 Instead we use like, for example, 743 00:32:49,940 --> 00:32:54,340 10 past history of J of alpha, and use that as your estimate. 744 00:32:54,340 --> 00:32:55,835 JOHN W. ROBERTS: Yeah. 745 00:32:55,835 --> 00:32:58,032 You're saying that doesn't make sense to you? 746 00:32:58,032 --> 00:32:59,240 AUDIENCE: It does make sense. 747 00:32:59,240 --> 00:33:01,460 In some cases I can think [INAUDIBLE] actually 748 00:33:01,460 --> 00:33:04,840 [INAUDIBLE] if the change-- 749 00:33:04,840 --> 00:33:07,880 a small change in alpha would have a huge effect on the end 750 00:33:07,880 --> 00:33:10,085 value [INAUDIBLE] from J-- 751 00:33:10,085 --> 00:33:13,860 like, if you have a very discrete-- like, [INAUDIBLE] 752 00:33:13,860 --> 00:33:15,193 condition pass over [INAUDIBLE]. 753 00:33:15,193 --> 00:33:17,277 JOHN W. ROBERTS: If you move very violently, yeah. 754 00:33:17,277 --> 00:33:19,340 So I mean, that's a good example in practice. 755 00:33:19,340 --> 00:33:20,830 I mean, there's things that we have 756 00:33:20,830 --> 00:33:22,580 in the theory like this expectation stuff. 757 00:33:22,580 --> 00:33:24,950 And there's things that I've applied to several systems. 758 00:33:24,950 --> 00:33:27,033 And in practice, when you have like sort of really 759 00:33:27,033 --> 00:33:29,283 bad policies, and you need to move really far in state 760 00:33:29,283 --> 00:33:30,908 space-- let's say that right now you're 761 00:33:30,908 --> 00:33:33,660 trying to swing up a cart-pole, and you're not going anywhere 762 00:33:33,660 --> 00:33:34,280 near the top. 763 00:33:34,280 --> 00:33:36,980 And your reward function doesn't have very smooth gradients, 764 00:33:36,980 --> 00:33:39,650 and so you can't just sort of swing up a bit by bit by bit. 765 00:33:39,650 --> 00:33:42,230 Well, a good thing is, is to put in place possibly very 766 00:33:42,230 --> 00:33:46,107 big noise, a very big eta, and then do these two evaluations. 767 00:33:46,107 --> 00:33:48,440 Because if you-- it's going to change so much every time 768 00:33:48,440 --> 00:33:48,770 you do it. 769 00:33:48,770 --> 00:33:50,060 Like for example, if you jump and suddenly you're 770 00:33:50,060 --> 00:33:52,520 doing a lot better, then your previous average 771 00:33:52,520 --> 00:33:53,973 is not going to be representative. 772 00:33:53,973 --> 00:33:55,640 And then you can actually bounce around. 773 00:33:55,640 --> 00:33:57,057 You can bounce around so violently 774 00:33:57,057 --> 00:34:01,160 in this big space of policies that you never improve, right? 775 00:34:01,160 --> 00:34:03,742 I don't-- maybe I should draw a diagram to make this more 776 00:34:03,742 --> 00:34:04,700 clear, what I'm saying. 777 00:34:04,700 --> 00:34:06,693 But the key thing is, is that, yeah, 778 00:34:06,693 --> 00:34:08,360 if you're moving these really big jumps, 779 00:34:08,360 --> 00:34:10,489 and your cost is changing a lot every time, 780 00:34:10,489 --> 00:34:12,906 and you still want to sort of move in the right direction, 781 00:34:12,906 --> 00:34:14,449 doing two evaluations can make sense. 782 00:34:14,449 --> 00:34:16,280 Because if you're stuck to where you 783 00:34:16,280 --> 00:34:18,290 don't have good gradients in your cost function, 784 00:34:18,290 --> 00:34:20,690 a bunch of little updates which slowly would climb 785 00:34:20,690 --> 00:34:22,190 aren't going to give you anything, because maybe they're 786 00:34:22,190 --> 00:34:22,940 not even differentiable. 787 00:34:22,940 --> 00:34:25,148 Maybe you have some sort of discrete way of measuring 788 00:34:25,148 --> 00:34:27,523 a reward, like how many time steps you spend in some goal 789 00:34:27,523 --> 00:34:29,815 region, or something, and you don't have any time steps 790 00:34:29,815 --> 00:34:31,650 there, there's no gradient at all right now. 791 00:34:31,650 --> 00:34:33,770 And so you need to be violent enough in sort of your policy 792 00:34:33,770 --> 00:34:35,179 changes that you eventually get it to where 793 00:34:35,179 --> 00:34:36,360 you're into that goal region. 794 00:34:36,360 --> 00:34:37,340 And once you get in that goal region, 795 00:34:37,340 --> 00:34:39,810 now you have some gradients and you're in good shape. 796 00:34:39,810 --> 00:34:42,477 So that's actually another thing that I was going to talk about. 797 00:34:42,477 --> 00:34:46,465 But designing your cost function is extremely important. 798 00:34:46,465 --> 00:34:48,590 There are cost functions that can be extremely poor 799 00:34:48,590 --> 00:34:50,298 and doing this can work really poorly on. 800 00:34:50,298 --> 00:34:53,040 And there's cost functions that can make it a lot easier. 801 00:34:53,040 --> 00:34:56,900 So if you have a cost function which is relatively smooth, 802 00:34:56,900 --> 00:34:59,400 if it's-- ideally it doesn't have this sort of banana 803 00:34:59,400 --> 00:34:59,900 problem. 804 00:34:59,900 --> 00:35:02,730 If it's relatively same in all the different parameters, 805 00:35:02,730 --> 00:35:03,950 it can work a lot better. 806 00:35:03,950 --> 00:35:06,840 And you can sort of formulate the same task lots of time, 807 00:35:06,840 --> 00:35:09,512 since lots of times your cost function isn't what you really 808 00:35:09,512 --> 00:35:10,220 want to optimize. 809 00:35:10,220 --> 00:35:12,750 It's just of a proxy for trying to get something done. 810 00:35:12,750 --> 00:35:14,410 That's what Russ talked about he didn't care about optimality. 811 00:35:14,410 --> 00:35:15,993 It's like, here's a cost function that 812 00:35:15,993 --> 00:35:17,785 gives us a means of solving how to do this. 813 00:35:17,785 --> 00:35:20,035 And so there's sort of a whole bunch of cost functions 814 00:35:20,035 --> 00:35:23,070 you can imagine coming up that try to encapsulate that task. 815 00:35:23,070 --> 00:35:25,487 Now, if you come up with-- for the perch one, for example, 816 00:35:25,487 --> 00:35:28,535 this plane perching, which is a difficult problem, 817 00:35:28,535 --> 00:35:30,410 and a problem where the models are very bad-- 818 00:35:30,410 --> 00:35:32,300 I mean, the aerodynamic models of this plane 819 00:35:32,300 --> 00:35:34,110 flying like that are extremely poor. 820 00:35:34,110 --> 00:35:36,110 And we have-- we actually have some decent ones. 821 00:35:36,110 --> 00:35:37,890 We spent a lot of work trying to get decent ones. 822 00:35:37,890 --> 00:35:39,410 But sort of the high-fidelity kind of region, 823 00:35:39,410 --> 00:35:41,285 where you really want to just get at the end, 824 00:35:41,285 --> 00:35:42,890 it's hard to model that. 825 00:35:42,890 --> 00:35:45,093 So the thing is that, what if you had a cost 826 00:35:45,093 --> 00:35:46,760 function, like what we really care about 827 00:35:46,760 --> 00:35:47,640 is hitting that perch. 828 00:35:47,640 --> 00:35:49,890 So let's say that we give you a 1 if you hit the perch 829 00:35:49,890 --> 00:35:51,165 and a 0 everywhere else. 830 00:35:51,165 --> 00:35:52,790 Now, that means until we hit the perch, 831 00:35:52,790 --> 00:35:53,720 we're getting no information. 832 00:35:53,720 --> 00:35:54,780 We could be getting really close, 833 00:35:54,780 --> 00:35:55,450 we could be really far away. 834 00:35:55,450 --> 00:35:57,080 It's not going to tell us anything. 835 00:35:57,080 --> 00:35:59,780 Now, a lot of actually reinforcement learning 836 00:35:59,780 --> 00:36:02,630 has these sort of rewards, like these sort of delayed rewards 837 00:36:02,630 --> 00:36:03,440 where you get it here, and then you 838 00:36:03,440 --> 00:36:04,940 have to sort of propagate that back. 839 00:36:04,940 --> 00:36:07,030 When you're trying to accomplish a task like that, 840 00:36:07,030 --> 00:36:08,937 that doesn't necessarily work that well. 841 00:36:08,937 --> 00:36:10,520 If you measure something like distance 842 00:36:10,520 --> 00:36:12,290 from the perch of distance from your desired state, 843 00:36:12,290 --> 00:36:14,160 if you get a little bit closer to your desired state, 844 00:36:14,160 --> 00:36:15,660 you sort of get a little bit better. 845 00:36:15,660 --> 00:36:17,570 And then you can measure the gradient. 846 00:36:17,570 --> 00:36:19,970 And so that will make a big difference, right? 847 00:36:19,970 --> 00:36:21,590 And so if you had something where 848 00:36:21,590 --> 00:36:23,660 you have region of state where you 849 00:36:23,660 --> 00:36:25,610 have a good gradient in your cost function, 850 00:36:25,610 --> 00:36:27,710 and you're out here, and not getting a gradient, 851 00:36:27,710 --> 00:36:28,655 the little perturbations you're going 852 00:36:28,655 --> 00:36:30,650 to have to random walk sort of have made you 853 00:36:30,650 --> 00:36:33,140 no update at all, because you may get no change. 854 00:36:33,140 --> 00:36:34,340 But if you do really big ones, maybe you'll 855 00:36:34,340 --> 00:36:35,750 bounce into where you get this region where 856 00:36:35,750 --> 00:36:36,875 you're getting some reward. 857 00:36:36,875 --> 00:36:38,333 And in that case, these updates are 858 00:36:38,333 --> 00:36:40,140 so big that averaging doesn't make sense, 859 00:36:40,140 --> 00:36:41,932 a baseline still gives you a big advantage, 860 00:36:41,932 --> 00:36:44,300 and maybe two evaluations is worth it. 861 00:36:44,300 --> 00:36:45,860 In some of the flapping stuff I did, 862 00:36:45,860 --> 00:36:47,660 I did two evaluations, because when 863 00:36:47,660 --> 00:36:49,640 I was moving very violently, because averaging 864 00:36:49,640 --> 00:36:50,630 didn't work that well. 865 00:36:50,630 --> 00:36:53,973 And getting a good baseline was worth the extra time. 866 00:36:53,973 --> 00:36:55,640 But when we ended up getting it working, 867 00:36:55,640 --> 00:36:58,700 we put it online, and we actually-- we update it 868 00:36:58,700 --> 00:36:59,750 every time we flapped. 869 00:36:59,750 --> 00:37:02,630 So it was just 1 second, flap, update, flap, update. 870 00:37:02,630 --> 00:37:04,070 And that way, we pretty much were 871 00:37:04,070 --> 00:37:06,470 able to sort of cut our time in half, because our policies were 872 00:37:06,470 --> 00:37:08,710 very similar, our average was a pretty good estimate. 873 00:37:08,710 --> 00:37:10,640 It's so noisy that one evaluation, anyway, 874 00:37:10,640 --> 00:37:13,265 isn't necessarily that great of an estimate of your local value 875 00:37:13,265 --> 00:37:14,630 function. 876 00:37:14,630 --> 00:37:15,130 And so yeah. 877 00:37:15,130 --> 00:37:16,463 We just did an average baseline. 878 00:37:16,463 --> 00:37:18,530 And that's sort of half the running time, right? 879 00:37:18,530 --> 00:37:20,310 And so it can be a big one. 880 00:37:20,310 --> 00:37:22,670 And so there's a lot of details when you implement it 881 00:37:22,670 --> 00:37:24,540 about the right way to sort of put this together, 882 00:37:24,540 --> 00:37:26,620 and depending what your cost function is, and how good 883 00:37:26,620 --> 00:37:28,070 of an initial policy, what your initial condition 884 00:37:28,070 --> 00:37:29,450 and your policy is. 885 00:37:29,450 --> 00:37:31,490 But yeah, there's a lot of factors like that. 886 00:37:34,830 --> 00:37:35,330 All right. 887 00:37:35,330 --> 00:37:42,530 So now we can do some of-- 888 00:37:46,360 --> 00:37:46,860 sorry. 889 00:37:51,180 --> 00:37:53,970 I can do example of this. 890 00:37:53,970 --> 00:37:56,430 So I keep on talking about this flapping system. 891 00:37:56,430 --> 00:38:02,460 That's what I worked on for my master's thesis. 892 00:38:02,460 --> 00:38:05,160 And so that's sort of what my brain always goes back to, 893 00:38:05,160 --> 00:38:07,035 particularly since we used all these methods. 894 00:38:07,035 --> 00:38:07,980 But all right. 895 00:38:07,980 --> 00:38:12,330 So now I wonder if I can do Russ' thing where 896 00:38:12,330 --> 00:38:14,822 he makes the font really big. 897 00:38:14,822 --> 00:38:16,530 That's also-- the thing I'm about to run, 898 00:38:16,530 --> 00:38:21,240 it's this relatively simple lumped parameter simulation 899 00:38:21,240 --> 00:38:23,400 of the flapping system. 900 00:38:23,400 --> 00:38:26,320 This is a lumped parameter model of-- let me show you, 901 00:38:26,320 --> 00:38:27,600 it's pretty cool-- 902 00:38:27,600 --> 00:38:32,580 of this system which I guy in NYU named Jun [? Zhang ?] 903 00:38:32,580 --> 00:38:37,150 built this robot that effectively models flapping 904 00:38:37,150 --> 00:38:37,650 flight. 905 00:38:37,650 --> 00:38:39,567 It's a very simple model. 906 00:38:39,567 --> 00:38:40,900 I'll show it to you in a second. 907 00:38:40,900 --> 00:38:42,442 But it has a lot of the same dynamics 908 00:38:42,442 --> 00:38:46,110 and a lot of the same issues as sort of a bird. 909 00:38:46,110 --> 00:38:51,840 So the system, it's a sort of a rigid plate. 910 00:38:51,840 --> 00:38:57,250 Well, the one you see here, we attached a rubber tail to it. 911 00:38:57,250 --> 00:38:59,250 But the one-- most of these results 912 00:38:59,250 --> 00:39:05,450 are on actually a rigid plate, where it heaves up and down, 913 00:39:05,450 --> 00:39:11,470 and what we can do is control the motion it follows. 914 00:39:11,470 --> 00:39:13,310 I hope that the camera can see it. 915 00:39:20,816 --> 00:39:22,798 AUDIENCE: [INAUDIBLE] moonlight [INAUDIBLE].. 916 00:39:22,798 --> 00:39:23,715 JOHN W. ROBERTS: Mood? 917 00:39:23,715 --> 00:39:24,270 AUDIENCE: Moonlight. 918 00:39:24,270 --> 00:39:25,562 JOHN W. ROBERTS: Oh, moonlight. 919 00:39:25,562 --> 00:39:26,840 I was like, mood lighting? 920 00:39:26,840 --> 00:39:27,830 OK. 921 00:39:27,830 --> 00:39:31,000 Make my lecture more enjoyable. 922 00:39:31,000 --> 00:39:32,360 All right. 923 00:39:32,360 --> 00:39:35,600 So this is the system. 924 00:39:35,600 --> 00:39:37,100 You can see we drive it up and down. 925 00:39:37,100 --> 00:39:41,120 That big cylindrical disk right there is the load cell. 926 00:39:41,120 --> 00:39:42,890 So that measures the force we're applying. 927 00:39:42,890 --> 00:39:45,182 And then what we do is we control this vertical motion. 928 00:39:45,182 --> 00:39:47,223 How we control it is-- that's an important thing. 929 00:39:47,223 --> 00:39:49,460 I talked about how the cost function matters a lot. 930 00:39:49,460 --> 00:39:51,290 Well, another thing that matters a lot 931 00:39:51,290 --> 00:39:53,930 is the parameterization of your policy. 932 00:39:53,930 --> 00:39:57,860 Now, in the last few problems we had open-loop policies, 933 00:39:57,860 --> 00:39:58,860 which are pretty simple. 934 00:39:58,860 --> 00:40:01,580 You have like 251 parameters or something like that, right? 935 00:40:01,580 --> 00:40:04,730 Now, when you're doing gradient descent using back 936 00:40:04,730 --> 00:40:06,682 prop or SNOPT, you have the exact gradient. 937 00:40:06,682 --> 00:40:08,390 It's cheap to compute the exact gradient, 938 00:40:08,390 --> 00:40:10,265 so you can sort of follow this pretty nicely. 939 00:40:10,265 --> 00:40:13,130 But When you do stochastic gradient descent, 940 00:40:13,130 --> 00:40:15,298 the probability of being perpendicular sort 941 00:40:15,298 --> 00:40:17,840 of to your gradient, or nearly perpendicular to the gradient, 942 00:40:17,840 --> 00:40:20,180 increases the number of parameters goes up. 943 00:40:20,180 --> 00:40:22,183 So you can think, if you're on-- 944 00:40:22,183 --> 00:40:23,600 if you're doing a 1D thing, you're 945 00:40:23,600 --> 00:40:25,490 always going to move pretty much-- it doesn't 946 00:40:25,490 --> 00:40:26,310 matter if you move in the right direction 947 00:40:26,310 --> 00:40:27,190 or the wrong direction. 948 00:40:27,190 --> 00:40:29,000 That's one of the benefits of this instead of that hill 949 00:40:29,000 --> 00:40:29,810 climbing. 950 00:40:29,810 --> 00:40:31,370 But you're always trying to get moving in the right direction 951 00:40:31,370 --> 00:40:32,480 to get this measurement. 952 00:40:32,480 --> 00:40:34,140 Does that make sense? 953 00:40:34,140 --> 00:40:36,470 If you think in 2D, you have the circle. 954 00:40:36,470 --> 00:40:38,650 You're going to be moving around. 955 00:40:38,650 --> 00:40:39,902 You're going to be along-- 956 00:40:39,902 --> 00:40:42,110 close to the direction of your gradient pretty often. 957 00:40:42,110 --> 00:40:44,540 A sphere, it's a lot easier to be pretty far away. 958 00:40:44,540 --> 00:40:48,770 I mean, sort of a lot more of the samples you do 959 00:40:48,770 --> 00:40:51,440 are going to be relatively perpendicular 960 00:40:51,440 --> 00:40:53,120 to your true gradient. 961 00:40:53,120 --> 00:40:55,070 And as your dimensionality gets very high, 962 00:40:55,070 --> 00:40:57,290 a lot of your samples are relatively perpendicular. 963 00:40:57,290 --> 00:40:58,210 And the thing is that whether you 964 00:40:58,210 --> 00:41:00,180 go in the right direction or wrong direction doesn't matter. 965 00:41:00,180 --> 00:41:01,670 You'll get the same information either way. 966 00:41:01,670 --> 00:41:03,128 Going perpendicular to the gradient 967 00:41:03,128 --> 00:41:04,850 gives you no information. 968 00:41:04,850 --> 00:41:07,220 Because you'll get no change, and there's no update. 969 00:41:07,220 --> 00:41:10,490 So it's still-- the [? cross ?] dimensionality 970 00:41:10,490 --> 00:41:12,410 is alive and well. 971 00:41:12,410 --> 00:41:14,210 And very high-dimensional policies 972 00:41:14,210 --> 00:41:15,380 can be slower to learn. 973 00:41:15,380 --> 00:41:18,380 And so those 251 dimensional policies you use 974 00:41:18,380 --> 00:41:20,390 may not be the best representation, 975 00:41:20,390 --> 00:41:21,830 because they sort of-- 976 00:41:21,830 --> 00:41:25,580 I mean, you probably don't need that many parameters 977 00:41:25,580 --> 00:41:27,330 to represent what you want to do. 978 00:41:27,330 --> 00:41:29,293 So for this, what we had-- 979 00:41:29,293 --> 00:41:30,710 and this made a big difference, we 980 00:41:30,710 --> 00:41:33,002 tried different things, this one worked really nicely-- 981 00:41:33,002 --> 00:41:34,100 was a spline. 982 00:41:34,100 --> 00:41:39,050 So we said, all right, if you have time, 983 00:41:39,050 --> 00:41:41,780 I'm going to set the final time here. 984 00:41:41,780 --> 00:41:43,580 Now that's a parameter, too. 985 00:41:43,580 --> 00:41:45,620 Then this is the z height. 986 00:41:45,620 --> 00:41:48,060 It's in millimeters or whatever you want. 987 00:41:48,060 --> 00:41:49,550 And I'm going to say, OK, I'm going 988 00:41:49,550 --> 00:41:51,920 to force it to be at the beginning, in the middle, 989 00:41:51,920 --> 00:41:54,420 and at the end-- wow, that's nowhere near the middle, is it? 990 00:41:58,090 --> 00:42:02,300 I shouldn't be a carpenter in the 1200s. 991 00:42:02,300 --> 00:42:03,500 So what do we do then? 992 00:42:03,500 --> 00:42:05,258 We then have five parameters-- now, we've 993 00:42:05,258 --> 00:42:07,550 done several versions, but simple one right here-- five 994 00:42:07,550 --> 00:42:09,200 parameters that define a spline. 995 00:42:09,200 --> 00:42:10,700 So this is going to be smooth. 996 00:42:10,700 --> 00:42:13,100 You can enforce to be a periodic spline, which 997 00:42:13,100 --> 00:42:16,700 means that the knot at the end, the connection here, is 998 00:42:16,700 --> 00:42:18,570 continuously differentiable as well. 999 00:42:18,570 --> 00:42:22,040 And then we force that this parameter-- so this number p1, 1000 00:42:22,040 --> 00:42:27,180 this one is going to be the opposite of it. 1001 00:42:27,180 --> 00:42:28,550 So it's a negative p1. 1002 00:42:28,550 --> 00:42:30,240 And that's true for all these. 1003 00:42:30,240 --> 00:42:35,930 So this way, we have this relatively rich policy class 1004 00:42:35,930 --> 00:42:37,980 that has sort of the right kind of properties. 1005 00:42:37,980 --> 00:42:39,955 But we do it with only five parameters. 1006 00:42:39,955 --> 00:42:41,330 So you can imagine, if we want it 1007 00:42:41,330 --> 00:42:42,920 to be asymmetric top and bottom, that would 1008 00:42:42,920 --> 00:42:43,760 double our parameters. 1009 00:42:43,760 --> 00:42:45,885 And we probably wouldn't want to tie this guy to 0, 1010 00:42:45,885 --> 00:42:47,450 so we'd even add one more. 1011 00:42:47,450 --> 00:42:49,270 And when we have the amplitude, you 1012 00:42:49,270 --> 00:42:51,230 can either fix it or make it free. 1013 00:42:51,230 --> 00:42:52,640 I can add another parameter. 1014 00:42:52,640 --> 00:42:54,200 So you can see that as you add this richness, 1015 00:42:54,200 --> 00:42:55,580 you're going to add all these different parameters. 1016 00:42:55,580 --> 00:42:57,718 But getting-- using a spline rather than-- 1017 00:42:57,718 --> 00:43:00,260 this is the height right now, this is the height right then-- 1018 00:43:00,260 --> 00:43:01,340 it's a huge advantage. 1019 00:43:01,340 --> 00:43:02,240 Because what's the chance that you're 1020 00:43:02,240 --> 00:43:04,698 going to want it to move very violently on a sort of like 1 1021 00:43:04,698 --> 00:43:05,750 dt time scale? 1022 00:43:05,750 --> 00:43:06,950 And if you try to do that, you could actually 1023 00:43:06,950 --> 00:43:07,742 damage your system. 1024 00:43:07,742 --> 00:43:09,200 Some of the policies that I-- when 1025 00:43:09,200 --> 00:43:11,030 I was working on this parameterization, 1026 00:43:11,030 --> 00:43:14,360 I had the load cell break off and fall into the tank once. 1027 00:43:14,360 --> 00:43:16,250 Luckily it broke off the wires and lost 1028 00:43:16,250 --> 00:43:20,510 its electric connection before it fell in there, but yeah. 1029 00:43:20,510 --> 00:43:24,890 So if you come up with a parameterization that 1030 00:43:24,890 --> 00:43:30,890 appropriately captures the kind of behaviors you expect to see, 1031 00:43:30,890 --> 00:43:32,630 it can be a lot faster to learn. 1032 00:43:32,630 --> 00:43:36,495 Now, sort of the warning, then, is that you're only 1033 00:43:36,495 --> 00:43:37,370 going to be optimal-- 1034 00:43:37,370 --> 00:43:40,037 the only thing, you're going get to a local minimum in this sort 1035 00:43:40,037 --> 00:43:41,438 of parameterization space. 1036 00:43:41,438 --> 00:43:43,730 So if you parameterize-- if I were to parameterize this 1037 00:43:43,730 --> 00:43:48,260 by saying, OK, well I'm only going to let it be some-- 1038 00:43:48,260 --> 00:43:50,885 let's say I was going to do like a Fourier series kind of thing 1039 00:43:50,885 --> 00:43:56,750 and say, OK, it's add this, this, and this-- 1040 00:43:56,750 --> 00:43:58,265 now, that's not very rich. 1041 00:43:58,265 --> 00:43:59,390 It's only three parameters. 1042 00:43:59,390 --> 00:44:00,050 That's good. 1043 00:44:00,050 --> 00:44:01,490 But I'm going to do all sorts of things that are probably 1044 00:44:01,490 --> 00:44:02,407 extremely sub-optimal. 1045 00:44:02,407 --> 00:44:04,880 Now, it's still going to find the best kind of behavior, 1046 00:44:04,880 --> 00:44:06,470 or the locally best kind of behavior 1047 00:44:06,470 --> 00:44:09,920 can using this kind of policy. 1048 00:44:09,920 --> 00:44:11,210 But it could be quite bad. 1049 00:44:11,210 --> 00:44:14,340 So the actual optimum could be very different. 1050 00:44:14,340 --> 00:44:19,640 So your policy class, you'd like it to include the optimum. 1051 00:44:19,640 --> 00:44:21,180 And so that sort of is-- it depends 1052 00:44:21,180 --> 00:44:21,970 on what the question is. 1053 00:44:21,970 --> 00:44:24,210 You sort of have to just have a feel for what is a good policy 1054 00:44:24,210 --> 00:44:24,710 class. 1055 00:44:24,710 --> 00:44:26,850 How do I get [? my ?] dimension as low as possible, 1056 00:44:26,850 --> 00:44:29,430 while still having the richness to represent a wide variety 1057 00:44:29,430 --> 00:44:32,440 of viable policies? 1058 00:44:32,440 --> 00:44:35,370 So when you're trying to implement these things, 1059 00:44:35,370 --> 00:44:38,910 that can make a big difference. 1060 00:44:38,910 --> 00:44:39,410 So yeah. 1061 00:44:39,410 --> 00:44:40,440 So we set up that. 1062 00:44:40,440 --> 00:44:44,920 And we could control the shape of that curve. 1063 00:44:44,920 --> 00:44:47,770 And so that is the policy parameterization we chose. 1064 00:44:47,770 --> 00:44:51,960 So going back to this code here. 1065 00:44:51,960 --> 00:44:58,860 Now, I think I can just run this here. 1066 00:44:58,860 --> 00:45:02,580 This is going to be doing that bit we talked about on-- 1067 00:45:02,580 --> 00:45:06,650 again, a simple lumped parameter model of that flapping system. 1068 00:45:06,650 --> 00:45:08,766 So here's our curve. 1069 00:45:08,766 --> 00:45:13,485 It's this, you see this-- well, this 1070 00:45:13,485 --> 00:45:17,653 is the forward motion of the thing as it's flapping. 1071 00:45:17,653 --> 00:45:18,820 This is the vertical motion. 1072 00:45:18,820 --> 00:45:20,640 So this is sort of the waveform it's following. 1073 00:45:20,640 --> 00:45:21,810 This is where it is in x position. 1074 00:45:21,810 --> 00:45:24,120 You can see it sort of goes fast, bounces around-- 1075 00:45:24,120 --> 00:45:26,200 sorry, this is the speed, not the position. 1076 00:45:26,200 --> 00:45:27,930 So you can see it accelerates from 0, 1077 00:45:27,930 --> 00:45:29,730 and then as it's pumping, it sort of oscillates a bit. 1078 00:45:29,730 --> 00:45:31,200 In practice, there's more inertia and everything, 1079 00:45:31,200 --> 00:45:33,325 so you don't see these high-frequency oscillations. 1080 00:45:33,325 --> 00:45:36,450 But this is just a relatively simple, explicit model. 1081 00:45:36,450 --> 00:45:38,100 This is the shape we follow. 1082 00:45:38,100 --> 00:45:39,900 So we're following that curve. 1083 00:45:39,900 --> 00:45:42,040 And we have a little bit of noise to it. 1084 00:45:42,040 --> 00:45:47,065 And let me-- so now we're going to perturb it, measure again. 1085 00:45:47,065 --> 00:45:50,965 Try to measure again, and boom, here we are. 1086 00:45:50,965 --> 00:45:52,090 We got a little bit better. 1087 00:45:52,090 --> 00:45:54,215 This is our reward, and then we did another sample, 1088 00:45:54,215 --> 00:45:55,440 and that's our reward. 1089 00:45:55,440 --> 00:45:57,120 Let's do it again, better. 1090 00:46:00,424 --> 00:46:04,560 You see we improve quite nicely. 1091 00:46:04,560 --> 00:46:07,800 And also, notice, relatively monotonically. 1092 00:46:07,800 --> 00:46:09,840 Now, you might be surprised by that. 1093 00:46:09,840 --> 00:46:11,470 Because even though we're moving-- 1094 00:46:11,470 --> 00:46:12,970 we have this sort of guarantee we'll 1095 00:46:12,970 --> 00:46:14,910 move within 90 degrees of the gradient. 1096 00:46:14,910 --> 00:46:17,190 That's what I was talking about sort of with, 1097 00:46:17,190 --> 00:46:20,037 you'll always be within 90 degrees if it's deterministic. 1098 00:46:20,037 --> 00:46:21,120 And this is deterministic. 1099 00:46:21,120 --> 00:46:24,330 But it also sort of is this linear kind of interpretation, 1100 00:46:24,330 --> 00:46:24,900 right? 1101 00:46:24,900 --> 00:46:29,130 So as you run it, you'd imagine that you could perturb yourself 1102 00:46:29,130 --> 00:46:31,380 far enough that you got worse. 1103 00:46:31,380 --> 00:46:32,880 Now, the reason that's not happening 1104 00:46:32,880 --> 00:46:35,310 is because I'm perturbing myself very small amounts, 1105 00:46:35,310 --> 00:46:37,010 and I'm updating very small amounts. 1106 00:46:37,010 --> 00:46:38,550 So all this sort of linear analysis is appropriate. 1107 00:46:38,550 --> 00:46:40,380 And actually, you can see what I talked about that, 1108 00:46:40,380 --> 00:46:43,005 that you always get pretty close to the true gradient is there. 1109 00:46:43,005 --> 00:46:45,090 Sometimes it moves up a lot, sometimes it's steep, 1110 00:46:45,090 --> 00:46:48,090 sometimes it moves up shallowly, but it does a pretty good job. 1111 00:46:48,090 --> 00:46:51,540 Now, we can change that and try to sabotage our little code 1112 00:46:51,540 --> 00:46:52,380 here. 1113 00:46:52,380 --> 00:46:54,180 Or sometimes you're OK, actually. 1114 00:46:54,180 --> 00:46:55,680 That's the thing, is that in practice lots of times 1115 00:46:55,680 --> 00:46:57,330 it's OK if it gets worse sometimes, 1116 00:46:57,330 --> 00:46:58,500 because allowing it to get worse, 1117 00:46:58,500 --> 00:46:59,917 being violent enough to get worse, 1118 00:46:59,917 --> 00:47:02,392 it'll reach the optimum a lot faster. 1119 00:47:02,392 --> 00:47:03,850 So here, this is our eta parameter. 1120 00:47:03,850 --> 00:47:10,470 Let's make it bigger a factor of-- let's make it 20.5. 1121 00:47:10,470 --> 00:47:11,817 I don't want to risk-- 1122 00:47:11,817 --> 00:47:12,900 [INAUDIBLE] not get worse. 1123 00:47:12,900 --> 00:47:13,824 AUDIENCE: Is that the noise or-- 1124 00:47:13,824 --> 00:47:14,286 JOHN W. ROBERTS: Pardon? 1125 00:47:14,286 --> 00:47:15,278 No, that is the update. 1126 00:47:15,278 --> 00:47:16,320 So the noise is the same. 1127 00:47:16,320 --> 00:47:17,070 This noise is still local. 1128 00:47:17,070 --> 00:47:18,590 But now we're jumping really far. 1129 00:47:18,590 --> 00:47:20,798 And so you can imagine, we're measuring the gradient. 1130 00:47:20,798 --> 00:47:21,848 We're moving really far. 1131 00:47:21,848 --> 00:47:23,640 And now where we've moved to, that gradient 1132 00:47:23,640 --> 00:47:25,050 may be a poor measurement of sort 1133 00:47:25,050 --> 00:47:28,090 of the update over that long of a scale. 1134 00:47:28,090 --> 00:47:31,510 So let's do this again. 1135 00:47:31,510 --> 00:47:32,950 This is always fun. 1136 00:47:32,950 --> 00:47:35,650 Oh, there you go, already. 1137 00:47:35,650 --> 00:47:36,290 That's better. 1138 00:47:38,985 --> 00:47:41,110 See now-- but you see, that's a huge increase then. 1139 00:47:41,110 --> 00:47:42,250 That's what I'm talking about, is that there's 1140 00:47:42,250 --> 00:47:43,210 sort of a sweet spot. 1141 00:47:43,210 --> 00:47:45,550 And you don't necessarily want monotonic increasing. 1142 00:47:45,550 --> 00:47:47,637 Like, there's limitations on how violent 1143 00:47:47,637 --> 00:47:49,720 you want it to be in practice, because on a robot, 1144 00:47:49,720 --> 00:47:52,350 a very violent policy could break your load 1145 00:47:52,350 --> 00:47:55,137 cell of and have it almost cost you $400. 1146 00:47:55,137 --> 00:47:56,470 So you don't do something crazy. 1147 00:47:56,470 --> 00:47:59,470 But there's also the willingness that-- oh, that's ugly. 1148 00:47:59,470 --> 00:48:02,500 But you see, I mean, if you bounce pretty far, 1149 00:48:02,500 --> 00:48:04,210 you can also get huge improvements. 1150 00:48:04,210 --> 00:48:06,460 And so there's sort of this-- 1151 00:48:06,460 --> 00:48:09,910 monotonicity in your increasing reward 1152 00:48:09,910 --> 00:48:12,550 is not necessarily the best way to learn, I suppose. 1153 00:48:12,550 --> 00:48:15,400 That's from the trenches. 1154 00:48:15,400 --> 00:48:18,040 I learned that the hard way through many, many hours 1155 00:48:18,040 --> 00:48:20,150 sitting in front of a machine. 1156 00:48:20,150 --> 00:48:24,115 So then the other thing that we can do is this eta. 1157 00:48:24,115 --> 00:48:26,800 Let's decrease eta. 1158 00:48:26,800 --> 00:48:28,492 And now let's make our sigma really big. 1159 00:48:28,492 --> 00:48:30,700 Now, this is going to be really crazy stuff probably. 1160 00:48:30,700 --> 00:48:32,140 But you see, now we're going to measure so far. 1161 00:48:32,140 --> 00:48:33,610 And we're going to get this sort of-- 1162 00:48:33,610 --> 00:48:34,480 we're going to try to measure the gradient, 1163 00:48:34,480 --> 00:48:36,903 but it's going to be just way off, because it's 1164 00:48:36,903 --> 00:48:39,070 moving so far that the local structure is completely 1165 00:48:39,070 --> 00:48:39,570 ignored. 1166 00:48:42,910 --> 00:48:44,880 Yeah. 1167 00:48:44,880 --> 00:48:47,130 I probably don't have to be nearly as dramatic as this 1168 00:48:47,130 --> 00:48:47,838 to make my point. 1169 00:48:47,838 --> 00:48:50,700 But, you know, it's just completely falling apart. 1170 00:48:50,700 --> 00:48:51,660 Yeah. 1171 00:48:51,660 --> 00:48:53,520 That's doing as badly as it can, I guess. 1172 00:48:53,520 --> 00:48:56,040 I think it's like almost [? no net ?] motion, so. 1173 00:48:56,040 --> 00:48:56,563 Yeah. 1174 00:48:56,563 --> 00:48:58,230 So the sweet spot, then, is somewhere in 1175 00:48:58,230 --> 00:49:02,805 between, where maybe you want an eta of, say, 3, 1176 00:49:02,805 --> 00:49:15,007 and sigma, I don't know, 0.1. 1177 00:49:15,007 --> 00:49:16,590 Oh, that's probably still too violent. 1178 00:49:19,420 --> 00:49:21,940 Yeah, definitely. 1179 00:49:21,940 --> 00:49:24,920 But I think that is-- 1180 00:49:24,920 --> 00:49:26,680 that's the sort of game you have to play. 1181 00:49:26,680 --> 00:49:29,950 And how big all these things are depend on a number of factors 1182 00:49:29,950 --> 00:49:31,750 specific to your system. 1183 00:49:31,750 --> 00:49:35,390 Like, if your system-- 1184 00:49:35,390 --> 00:49:37,130 if the change is very small in magnitude, 1185 00:49:37,130 --> 00:49:38,200 if your cost function is such that it's 1186 00:49:38,200 --> 00:49:40,450 changed between 10 to the negative fifth and 10 1187 00:49:40,450 --> 00:49:43,510 to the negative fifth plus 1 times 10 1188 00:49:43,510 --> 00:49:46,090 to the negative sixth, that's changing by very small amounts, 1189 00:49:46,090 --> 00:49:46,590 right? 1190 00:49:46,590 --> 00:49:49,150 You could need a very large eta just to make up for the fact 1191 00:49:49,150 --> 00:49:50,450 that your change is so small. 1192 00:49:50,450 --> 00:49:53,500 So a big eta-- like, there's no absolute perception 1193 00:49:53,500 --> 00:49:54,460 on what is a big eta. 1194 00:49:54,460 --> 00:49:56,532 It's not like 10,000 is a huge eta. 1195 00:49:56,532 --> 00:49:58,240 10,000 could be very small eta, depending 1196 00:49:58,240 --> 00:49:59,453 on what your rewards are. 1197 00:49:59,453 --> 00:50:00,370 Same thing with sigma. 1198 00:50:00,370 --> 00:50:04,222 It depends on how big your parameters are. 1199 00:50:04,222 --> 00:50:05,680 Because, I mean, my parameters here 1200 00:50:05,680 --> 00:50:07,900 are of order one, which is sort of convenient. 1201 00:50:07,900 --> 00:50:09,760 Yeah. 1202 00:50:09,760 --> 00:50:10,270 So there. 1203 00:50:17,130 --> 00:50:17,630 Yeah. 1204 00:50:17,630 --> 00:50:19,915 So here we're learning pretty quickly. 1205 00:50:19,915 --> 00:50:21,540 And so all those sort of things, that's 1206 00:50:21,540 --> 00:50:23,240 sort of a disadvantage of this technique, 1207 00:50:23,240 --> 00:50:25,698 is that there's a lot of tuning to sort solve these things, 1208 00:50:25,698 --> 00:50:27,320 is that-- 1209 00:50:27,320 --> 00:50:29,300 where SNOPT you don't have to set-- 1210 00:50:29,300 --> 00:50:32,228 SNOPT you don't have to set a learning rate, 1211 00:50:32,228 --> 00:50:33,770 here you have to set a learning rate. 1212 00:50:33,770 --> 00:50:35,660 You have to set your sigma. 1213 00:50:35,660 --> 00:50:37,790 And when you have really sort of hard problems, 1214 00:50:37,790 --> 00:50:38,660 there's even more things you have to do. 1215 00:50:38,660 --> 00:50:39,830 Like, your policy parameterization 1216 00:50:39,830 --> 00:50:41,250 could affect a lot of things. 1217 00:50:41,250 --> 00:50:42,250 There's a lot of issues. 1218 00:50:42,250 --> 00:50:44,790 But sometimes, that's-- sometimes it's the only sort 1219 00:50:44,790 --> 00:50:45,590 of route you have. 1220 00:50:45,590 --> 00:50:48,050 Like, the best this can ever do is gradient descent. 1221 00:50:48,050 --> 00:50:50,360 It's never going to do better than gradient descent. 1222 00:50:50,360 --> 00:50:52,250 And so there's of a lot of fancy packages out there. 1223 00:50:52,250 --> 00:50:54,250 When you have better models and stuff like that, 1224 00:50:54,250 --> 00:50:56,480 you can do better than gradient descent. 1225 00:50:56,480 --> 00:50:58,057 But while even though you're only 1226 00:50:58,057 --> 00:50:59,195 going to be able to achieve gradient descent, 1227 00:50:59,195 --> 00:51:00,290 you can achieve it despite the fact 1228 00:51:00,290 --> 00:51:01,957 that you know nothing about your system, 1229 00:51:01,957 --> 00:51:04,980 your system is stochastic, and it's noisy, like that. 1230 00:51:04,980 --> 00:51:09,120 And so in those cases, it can be a big win. 1231 00:51:09,120 --> 00:51:11,360 AUDIENCE: So when you were doing this in real life, 1232 00:51:11,360 --> 00:51:13,190 instead of in space each time, you 1233 00:51:13,190 --> 00:51:15,963 were sitting for 4 minutes in front of a flapping-- 1234 00:51:15,963 --> 00:51:18,380 JOHN W. ROBERTS: I automated pretty much everything, yeah. 1235 00:51:18,380 --> 00:51:19,310 So I was-- but yeah. 1236 00:51:19,310 --> 00:51:20,540 I mean, this is a little simulation. 1237 00:51:20,540 --> 00:51:22,730 AUDIENCE: Every interval was like actually it running and-- 1238 00:51:22,730 --> 00:51:23,210 JOHN W. ROBERTS: Oh, yeah. 1239 00:51:23,210 --> 00:51:24,585 When I pressed Space, it actually 1240 00:51:24,585 --> 00:51:26,868 does two-- because this is using a true baseline. 1241 00:51:26,868 --> 00:51:28,410 I didn't put in the average baseline. 1242 00:51:28,410 --> 00:51:30,350 So this is running it twice every time I press Space. 1243 00:51:30,350 --> 00:51:30,770 But yeah. 1244 00:51:30,770 --> 00:51:31,940 You can imagine every time I'm doing Space, 1245 00:51:31,940 --> 00:51:34,107 it does this one update and gives me that new point. 1246 00:51:34,107 --> 00:51:37,340 What I was doing is I sat there, and it would run, 1247 00:51:37,340 --> 00:51:40,490 and I'd babysit to make sure it wasn't broken. 1248 00:51:40,490 --> 00:51:42,410 And it would throw up the curve as it 1249 00:51:42,410 --> 00:51:44,540 was running so I could make sure that the encoders weren't off, 1250 00:51:44,540 --> 00:51:45,470 just sort of sitting there keeping 1251 00:51:45,470 --> 00:51:46,160 track of all these things. 1252 00:51:46,160 --> 00:51:48,140 I was like a nuclear safety technician. 1253 00:51:48,140 --> 00:51:51,656 I just eat some doughnuts and go to Moe's, and I would have 1254 00:51:51,656 --> 00:51:54,630 been a good sitcom character. 1255 00:51:54,630 --> 00:51:55,130 But yeah. 1256 00:51:55,130 --> 00:51:56,150 So I mean, pretty much just babysitting it. 1257 00:51:56,150 --> 00:51:57,320 But yeah, every time you did it, every time you 1258 00:51:57,320 --> 00:51:58,970 got a new update-- like, every one of these points 1259 00:51:58,970 --> 00:52:00,630 cost me 6 minutes or something, because it 1260 00:52:00,630 --> 00:52:02,015 was like a 3-minute run for-- basically 1261 00:52:02,015 --> 00:52:03,320 a 3-minute run for an update. 1262 00:52:03,320 --> 00:52:05,528 Because I wasn't using averaged baseline then either. 1263 00:52:05,528 --> 00:52:06,950 I was trying to be more violent. 1264 00:52:06,950 --> 00:52:07,580 But yeah. 1265 00:52:07,580 --> 00:52:10,067 And so that's the thing, is that that 1266 00:52:10,067 --> 00:52:11,913 is the perfect encapsulation of why 1267 00:52:11,913 --> 00:52:14,330 you want to use this information as carefully as possible. 1268 00:52:14,330 --> 00:52:15,890 It's because it's very expensive to get a point. 1269 00:52:15,890 --> 00:52:17,015 Like, here it cost nothing. 1270 00:52:17,015 --> 00:52:19,190 If I were to turn off the pause, like, 1271 00:52:19,190 --> 00:52:20,690 this thing would climb up like that. 1272 00:52:23,108 --> 00:52:24,650 If you're running on a robot, like we 1273 00:52:24,650 --> 00:52:26,450 want to use this on the glider, every time you watch 1274 00:52:26,450 --> 00:52:28,242 that glider, you have to set up the glider, 1275 00:52:28,242 --> 00:52:30,747 fire it off, take all this data, and reset it by hand, 1276 00:52:30,747 --> 00:52:31,580 and launch it again. 1277 00:52:31,580 --> 00:52:34,130 So getting a data point there is going be extremely expensive. 1278 00:52:34,130 --> 00:52:36,380 And so we've actually done some work on the right ways 1279 00:52:36,380 --> 00:52:37,052 to sample. 1280 00:52:37,052 --> 00:52:39,260 You can imagine trying to come up with the right ways 1281 00:52:39,260 --> 00:52:39,968 to have a policy. 1282 00:52:39,968 --> 00:52:42,917 But sampling intelligently can save you a lot of time. 1283 00:52:42,917 --> 00:52:44,750 We sort of look at the signal-to-noise ratio 1284 00:52:44,750 --> 00:52:46,340 of these updates. 1285 00:52:46,340 --> 00:52:47,990 I don't know if anyone-- 1286 00:52:47,990 --> 00:52:50,450 some people here probably at least heard about that stuff 1287 00:52:50,450 --> 00:52:51,750 since they're in my group. 1288 00:52:51,750 --> 00:52:56,432 But probably talk about that maybe tomorrow. 1289 00:52:56,432 --> 00:52:57,890 But there's these things you can do 1290 00:52:57,890 --> 00:53:01,330 that can improve the quality of your performance a lot. 1291 00:53:01,330 --> 00:53:03,080 And actually, I test on this exact system. 1292 00:53:03,080 --> 00:53:05,060 I got put on the system, and I ran it 1293 00:53:05,060 --> 00:53:07,865 with the sort of results we had that's just, 1294 00:53:07,865 --> 00:53:09,740 this is a better way to sample, and then just 1295 00:53:09,740 --> 00:53:13,970 with a naive Gaussian kind of sampling, and you learn faster. 1296 00:53:13,970 --> 00:53:17,150 And in the context of me sitting there and spending my days 1297 00:53:17,150 --> 00:53:19,880 in New York City huddled in front of a computer, that's 1298 00:53:19,880 --> 00:53:20,480 a big win. 1299 00:53:20,480 --> 00:53:21,702 So anyway. 1300 00:53:21,702 --> 00:53:23,647 AUDIENCE: So when you say change the sampling, 1301 00:53:23,647 --> 00:53:25,730 you can just change the variance like you would do 1302 00:53:25,730 --> 00:53:26,840 to a non-Gaussian distribution? 1303 00:53:26,840 --> 00:53:28,048 JOHN W. ROBERTS: Right, yeah. 1304 00:53:28,048 --> 00:53:29,000 So that-- yeah. 1305 00:53:29,000 --> 00:53:33,180 In fact, we used a very different kind of description 1306 00:53:33,180 --> 00:53:33,680 overall. 1307 00:53:33,680 --> 00:53:35,990 You can still-- the linear analysis will still work. 1308 00:53:35,990 --> 00:53:38,660 But it's just a local-- 1309 00:53:38,660 --> 00:53:41,083 but yeah, there's work where they change-- 1310 00:53:41,083 --> 00:53:43,250 We also have something where you change the variance 1311 00:53:43,250 --> 00:53:46,730 [? to ?] the Gaussian, but your different directions 1312 00:53:46,730 --> 00:53:48,038 have different variances. 1313 00:53:48,038 --> 00:53:50,330 And so if you sort of need an estimate of the gradient, 1314 00:53:50,330 --> 00:53:50,830 then-- 1315 00:53:50,830 --> 00:53:53,873 but you just estimate the gradient to bias your sampling 1316 00:53:53,873 --> 00:53:56,290 more in the directions where you think the gradient is, so 1317 00:53:56,290 --> 00:53:59,120 that more of your sampling is along the directions you think 1318 00:53:59,120 --> 00:54:00,110 are most interesting. 1319 00:54:00,110 --> 00:54:04,760 And so that can be a win when you have a lot of parameters 1320 00:54:04,760 --> 00:54:05,930 that aren't well correlated. 1321 00:54:05,930 --> 00:54:07,638 Like if you imagine if you had a feedback 1322 00:54:07,638 --> 00:54:09,160 policy that was dependent on-- 1323 00:54:09,160 --> 00:54:10,910 a parameter is active in a certain state-- 1324 00:54:10,910 --> 00:54:14,540 like, if I was at negative 2 to negative 5, I do this, 1325 00:54:14,540 --> 00:54:16,370 and let's say I never get there, then 1326 00:54:16,370 --> 00:54:19,040 that parameter has nothing to do with how well I perform. 1327 00:54:19,040 --> 00:54:20,752 And so if you know that, you can sort 1328 00:54:20,752 --> 00:54:23,210 of-- there's something called an eligibility you can track. 1329 00:54:23,210 --> 00:54:24,170 And you cannot update that parameter. 1330 00:54:24,170 --> 00:54:25,610 There's no reason to sort of be fooling around 1331 00:54:25,610 --> 00:54:27,690 with that parameter when it's not affecting your output. 1332 00:54:27,690 --> 00:54:29,773 And if you know that, you can do things like that. 1333 00:54:29,773 --> 00:54:32,750 And we sort of have a way, a more careful way of, 1334 00:54:32,750 --> 00:54:34,330 shaping all these-- 1335 00:54:34,330 --> 00:54:36,530 of shaping this Gaussian to learn faster. 1336 00:54:36,530 --> 00:54:38,583 And it can. 1337 00:54:38,583 --> 00:54:41,000 And also, just completely very different kind of sampling. 1338 00:54:41,000 --> 00:54:43,133 Like, it's-- well, maybe I'll try to talk about it. 1339 00:54:43,133 --> 00:54:45,050 Because I think it's pretty interesting stuff. 1340 00:54:45,050 --> 00:54:47,120 The math is a little bit nasty, but I'll 1341 00:54:47,120 --> 00:54:49,760 skip the really ugly steps. 1342 00:54:49,760 --> 00:54:53,000 And actually, the one with the different distribution 1343 00:54:53,000 --> 00:54:54,030 isn't even that nasty. 1344 00:54:54,030 --> 00:54:54,530 But yeah. 1345 00:54:54,530 --> 00:54:57,205 I mean, we ran it here and it [INAUDIBLE] improvement. 1346 00:54:57,205 --> 00:54:58,220 Yeah, so. 1347 00:54:58,220 --> 00:54:59,990 Did I answer your question? 1348 00:54:59,990 --> 00:55:00,560 Yeah. 1349 00:55:00,560 --> 00:55:01,990 It's not just changing the variances. 1350 00:55:01,990 --> 00:55:03,050 It's more complicated than that. 1351 00:55:03,050 --> 00:55:05,092 Although changing the variances can be a big win. 1352 00:55:05,092 --> 00:55:07,802 For example, if you knew you had this anisotropy, 1353 00:55:07,802 --> 00:55:10,010 and if you were to have different etas in different-- 1354 00:55:10,010 --> 00:55:11,927 if you were to scale everything in your sigma, 1355 00:55:11,927 --> 00:55:14,420 you could effectively make it squashed in, right? 1356 00:55:14,420 --> 00:55:17,790 I mean, just a rescaling of this anisotropic bowl 1357 00:55:17,790 --> 00:55:19,350 will make it right. 1358 00:55:19,350 --> 00:55:21,660 So if you can evaluate that, you can fix it. 1359 00:55:21,660 --> 00:55:23,713 But you sort have to know that that's going on. 1360 00:55:23,713 --> 00:55:25,380 That's about the times you have adaptive 1361 00:55:25,380 --> 00:55:26,580 learning rates and stuff. 1362 00:55:26,580 --> 00:55:28,410 Gradient descent, like if you keep moving the same direction, 1363 00:55:28,410 --> 00:55:30,000 you have a bigger learning rate. 1364 00:55:30,000 --> 00:55:30,900 You can have different learning rates, 1365 00:55:30,900 --> 00:55:32,440 you have different parameters. 1366 00:55:32,440 --> 00:55:33,840 This one, as you get close to a local min, 1367 00:55:33,840 --> 00:55:35,100 you'll decrease your learning rate and your noise, 1368 00:55:35,100 --> 00:55:36,850 because you want to sort of bounce around. 1369 00:55:36,850 --> 00:55:39,250 You don't want to be jumping all across this min. 1370 00:55:39,250 --> 00:55:39,750 So-- 1371 00:55:39,750 --> 00:55:42,626 AUDIENCE: [INAUDIBLE] talked about a basically policy 1372 00:55:42,626 --> 00:55:44,114 gradient when we were [INAUDIBLE].. 1373 00:55:44,114 --> 00:55:45,197 JOHN W. ROBERTS: Yeah, no. 1374 00:55:45,197 --> 00:55:45,780 Yeah. 1375 00:55:45,780 --> 00:55:47,700 I mean, there is-- 1376 00:55:47,700 --> 00:55:49,410 it's definitely exactly that. 1377 00:55:49,410 --> 00:55:51,002 It's just stochastic gradient. 1378 00:55:51,002 --> 00:55:52,710 But yeah, it's all policy gradient ideas. 1379 00:55:52,710 --> 00:55:55,800 Because we don't-- I mean, these things don't have a critic, 1380 00:55:55,800 --> 00:55:56,970 right? 1381 00:55:56,970 --> 00:55:59,280 But you can combine this with some policy evaluation 1382 00:55:59,280 --> 00:55:59,780 techniques. 1383 00:55:59,780 --> 00:56:01,905 And you can turn them into actor-critic algorithms. 1384 00:56:01,905 --> 00:56:04,830 A very simple-- do people know about actor-critic algorithms? 1385 00:56:04,830 --> 00:56:05,770 That's going to be a subject I think 1386 00:56:05,770 --> 00:56:06,937 Russ talks about at the end. 1387 00:56:06,937 --> 00:56:09,270 But the thing is that right now-- well, I'll 1388 00:56:09,270 --> 00:56:11,007 motivate in a completely different way. 1389 00:56:11,007 --> 00:56:12,840 We talked about how this baseline can affect 1390 00:56:12,840 --> 00:56:14,730 your performance a lot, right? 1391 00:56:14,730 --> 00:56:17,670 Now, a good baseline can make you do a lot better. 1392 00:56:17,670 --> 00:56:20,160 Now, the thing is that, what happens 1393 00:56:20,160 --> 00:56:23,260 if-- here we start with the same initial condition every time. 1394 00:56:23,260 --> 00:56:25,167 But let's say that I actually could be in one 1395 00:56:25,167 --> 00:56:26,250 of two initial conditions. 1396 00:56:26,250 --> 00:56:29,160 I can measure this, and then I run it. 1397 00:56:29,160 --> 00:56:31,592 And the system behaves very differently, 1398 00:56:31,592 --> 00:56:33,300 or the costs are very different depending 1399 00:56:33,300 --> 00:56:34,530 on my initial condition. 1400 00:56:34,530 --> 00:56:37,153 But I want sort of the same policy to cover both of these. 1401 00:56:37,153 --> 00:56:39,570 So the thing is, if I just did this and I had one baseline 1402 00:56:39,570 --> 00:56:42,065 for both of them, and I could randomly be putting these 1403 00:56:42,065 --> 00:56:44,190 in [? different initial ?] conditions or whatever-- 1404 00:56:44,190 --> 00:56:45,503 or I mean, I could-- 1405 00:56:45,503 --> 00:56:47,670 there's probably a more sensible way of saying this, 1406 00:56:47,670 --> 00:56:49,690 but I don't want to confuse the issue. 1407 00:56:49,690 --> 00:56:52,758 So if you could have different initial conditions, 1408 00:56:52,758 --> 00:56:54,300 you can make your baseline a function 1409 00:56:54,300 --> 00:56:55,770 of your initial condition. 1410 00:56:55,770 --> 00:56:56,687 Does that makes sense? 1411 00:56:56,687 --> 00:56:59,410 Instead of just having B, instead of evaluating it twice, 1412 00:56:59,410 --> 00:57:01,800 I could have my B of x. 1413 00:57:01,800 --> 00:57:04,090 And if my x is here, I'm going to say, OK, 1414 00:57:04,090 --> 00:57:05,490 my cost should be like this. 1415 00:57:05,490 --> 00:57:06,780 And if my x is here, then it's like, oh, 1416 00:57:06,780 --> 00:57:08,010 my cost should be like this. 1417 00:57:08,010 --> 00:57:10,890 And when I evaluate my cost, when I perturb my policy, 1418 00:57:10,890 --> 00:57:13,170 I have a better idea of how well I'm doing. 1419 00:57:13,170 --> 00:57:15,635 Does that makes sense? 1420 00:57:15,635 --> 00:57:17,640 It probably doesn't, so. 1421 00:57:17,640 --> 00:57:20,400 All right. 1422 00:57:20,400 --> 00:57:25,500 So let's say-- now, this is phase space now. 1423 00:57:28,770 --> 00:57:33,810 Now let's say that I can start in either of these. 1424 00:57:33,810 --> 00:57:37,152 And let's say that I'm trying to get to-- 1425 00:57:37,152 --> 00:57:38,822 let's draw this here. 1426 00:57:38,822 --> 00:57:39,780 I'm trying to get to 0. 1427 00:57:39,780 --> 00:57:40,920 That's my goal. 1428 00:57:40,920 --> 00:57:42,240 And I can measure this. 1429 00:57:42,240 --> 00:57:46,270 But then one of them, I'm going to go [WHOOSH] like that. 1430 00:57:46,270 --> 00:57:49,447 And the other one I'm going to have to go, I don't know, 1431 00:57:49,447 --> 00:57:51,030 through whatever torque, [? limited ?] 1432 00:57:51,030 --> 00:57:52,860 reasons like that or something. 1433 00:57:52,860 --> 00:57:55,963 So this one always costs more than this one, all right? 1434 00:57:55,963 --> 00:57:57,630 It doesn't matter how good my policy is. 1435 00:57:57,630 --> 00:57:59,713 Like, you can imagine just have a feedback policy. 1436 00:57:59,713 --> 00:58:01,920 It doesn't matter how bad it is, how good it is. 1437 00:58:01,920 --> 00:58:04,435 I mean, the same policy is always going to do worse here. 1438 00:58:04,435 --> 00:58:07,060 Now, if you believe that a good baseline improves performance-- 1439 00:58:07,060 --> 00:58:08,370 and trust me, it does-- 1440 00:58:08,370 --> 00:58:09,912 then I don't want the same baseline. 1441 00:58:09,912 --> 00:58:12,120 I don't want the same B for both of these situations. 1442 00:58:12,120 --> 00:58:14,340 Because this guy should always be around 50, 1443 00:58:14,340 --> 00:58:16,433 and this guy should always be around 20, right? 1444 00:58:16,433 --> 00:58:17,850 So what I could do is I could have 1445 00:58:17,850 --> 00:58:19,422 my baseline be a function of x. 1446 00:58:19,422 --> 00:58:21,630 And I'm going to be like, OK, here my baseline is 50, 1447 00:58:21,630 --> 00:58:23,932 here my baseline is 20. 1448 00:58:23,932 --> 00:58:25,890 And let's say I don't know that from the start. 1449 00:58:25,890 --> 00:58:30,690 I can learn my baseline while I'm learning my policy. 1450 00:58:30,690 --> 00:58:32,970 So I can use the same policy for both situations. 1451 00:58:32,970 --> 00:58:34,658 And then over here I measure my state, 1452 00:58:34,658 --> 00:58:36,950 and I'm like, oh, over here I'm doing bad all the time. 1453 00:58:36,950 --> 00:58:38,220 So my baseline is going to be high. 1454 00:58:38,220 --> 00:58:39,720 And over here I'm always doing well, 1455 00:58:39,720 --> 00:58:41,760 so my baseline is going to be low. 1456 00:58:41,760 --> 00:58:48,400 And so in that way you can take that into account. 1457 00:58:48,400 --> 00:58:49,968 Does that makes sense? 1458 00:58:49,968 --> 00:58:50,760 it does look like-- 1459 00:58:50,760 --> 00:58:54,200 AUDIENCE: [INAUDIBLE] this is basically Monte-Carlo 1460 00:58:54,200 --> 00:58:57,050 sampling and learning. 1461 00:58:57,050 --> 00:58:59,910 Because each time that you set your-- 1462 00:58:59,910 --> 00:59:02,290 so your policy is defined by a set of alphas. 1463 00:59:02,290 --> 00:59:04,500 And then you fix it, you run it, and you 1464 00:59:04,500 --> 00:59:07,200 get a sample that says what is the value associated 1465 00:59:07,200 --> 00:59:09,807 with this starting point given this [INAUDIBLE] policy. 1466 00:59:09,807 --> 00:59:11,890 JOHN W. ROBERTS: Are you talking about Monte-Carlo 1467 00:59:11,890 --> 00:59:12,810 for policy evaluation? 1468 00:59:12,810 --> 00:59:14,143 Because Monte-Carlo [INAUDIBLE]. 1469 00:59:14,143 --> 00:59:16,300 That's like TD infinity or whatever it is. 1470 00:59:16,300 --> 00:59:17,700 And that's for policy evaluation. 1471 00:59:17,700 --> 00:59:20,100 That's how you make a critic. 1472 00:59:20,100 --> 00:59:21,660 The policy is different, right? 1473 00:59:21,660 --> 00:59:23,202 The policy, you're doing this update, 1474 00:59:23,202 --> 00:59:24,540 then you're advancing it a bit. 1475 00:59:24,540 --> 00:59:26,082 Your critic, the way I just described 1476 00:59:26,082 --> 00:59:28,192 making the baseline for this, that would be 1477 00:59:28,192 --> 00:59:29,400 a Monte-Carlo interpretation. 1478 00:59:29,400 --> 00:59:32,100 You could do it with t, lambda, or anything you wanted to. 1479 00:59:32,100 --> 00:59:32,790 But yeah. 1480 00:59:32,790 --> 00:59:37,350 So the important thing is-- 1481 00:59:37,350 --> 00:59:39,570 I mean, it looks like the sort of blank faces 1482 00:59:39,570 --> 00:59:41,610 after I talked about that. 1483 00:59:41,610 --> 00:59:45,480 But Russ, I think, is going to go into more detail 1484 00:59:45,480 --> 00:59:49,260 into actor-critic. 1485 00:59:49,260 --> 00:59:52,720 But maybe I can talk about that more tomorrow if you want. 1486 00:59:52,720 --> 00:59:53,220 Yeah. 1487 00:59:53,220 --> 00:59:54,690 I mean, the important thing is that right now this 1488 00:59:54,690 --> 00:59:56,732 is a very simple kind of idea we've talked about, 1489 00:59:56,732 --> 00:59:59,640 where you run the alpha, and then if you ran the same alpha, 1490 00:59:59,640 --> 01:00:01,860 it would always do the same. 1491 01:00:01,860 --> 01:00:04,380 Or maybe it just has a little bit of additive noise. 1492 01:00:04,380 --> 01:00:07,140 But If actually running the same alpha from different states-- 1493 01:00:07,140 --> 01:00:09,570 which happens a lot in a lot of systems-- 1494 01:00:09,570 --> 01:00:12,492 the different states could have different expected performance. 1495 01:00:12,492 --> 01:00:14,700 And so while you'll still learn without the baseline, 1496 01:00:14,700 --> 01:00:16,075 having a good baseline everywhere 1497 01:00:16,075 --> 01:00:17,440 will make you learn faster. 1498 01:00:17,440 --> 01:00:19,410 And so it's worth learning a baseline 1499 01:00:19,410 --> 01:00:23,562 and learning the policy simultaneously. 1500 01:00:23,562 --> 01:00:25,770 And sort of the thing we talked about, where you just 1501 01:00:25,770 --> 01:00:28,570 average your last several samples to get your baseline, 1502 01:00:28,570 --> 01:00:30,570 that's already we're learning a baseline, right? 1503 01:00:30,570 --> 01:00:32,310 We're just learning it for everywhere in state space. 1504 01:00:32,310 --> 01:00:34,310 We're saying this is the same everywhere, right? 1505 01:00:38,320 --> 01:00:40,510 AUDIENCE: That idea of sampling, can you 1506 01:00:40,510 --> 01:00:44,230 do something like [? smarter ?] using Gaussian processes 1507 01:00:44,230 --> 01:00:46,120 to do active learning on top of it 1508 01:00:46,120 --> 01:00:49,840 to sample in areas that are more promising? 1509 01:00:49,840 --> 01:00:51,663 Instead of just randomly moving somewhere? 1510 01:00:51,663 --> 01:00:53,080 JOHN W. ROBERTS: I mean, there are 1511 01:00:53,080 --> 01:00:55,870 ways of biasing your sampling based on what 1512 01:00:55,870 --> 01:00:57,145 you think the gradient is. 1513 01:00:57,145 --> 01:00:59,020 I mean, that's one of the things we worked on 1514 01:00:59,020 --> 01:01:02,410 with signal-to-noise ratio. 1515 01:01:02,410 --> 01:01:04,600 I'm not sure exactly what-- 1516 01:01:04,600 --> 01:01:08,230 AUDIENCE: I know some people worked on Aibos walking, 1517 01:01:08,230 --> 01:01:09,880 and they wanted to find a gain which 1518 01:01:09,880 --> 01:01:12,460 maximizes the speed of the Aibos when they're walking. 1519 01:01:12,460 --> 01:01:14,502 JOHN W. ROBERTS: I think I read that paper, yeah. 1520 01:01:14,502 --> 01:01:17,930 AUDIENCE: Yeah, and there are like 12 or 13 dimensions. 1521 01:01:17,930 --> 01:01:20,085 And it seems like a similar problem-- 1522 01:01:20,085 --> 01:01:21,460 JOHN W. ROBERTS: No, I think they 1523 01:01:21,460 --> 01:01:22,668 use a very similar algorithm. 1524 01:01:22,668 --> 01:01:24,502 I think they had a different update, though. 1525 01:01:24,502 --> 01:01:25,870 It was the same kind of idea. 1526 01:01:25,870 --> 01:01:27,662 I think that the update structure was maybe 1527 01:01:27,662 --> 01:01:29,660 different than that. 1528 01:01:29,660 --> 01:01:30,160 Yeah. 1529 01:01:30,160 --> 01:01:31,813 So I won't dwell on critic stuff. 1530 01:01:31,813 --> 01:01:33,730 That's, I think, the last lecture in the class 1531 01:01:33,730 --> 01:01:34,870 or something like that. 1532 01:01:34,870 --> 01:01:37,420 But yeah. 1533 01:01:40,680 --> 01:01:42,820 So here, I mean, this is sort of the sample system. 1534 01:01:42,820 --> 01:01:46,510 And you can see how this thing is robust to really noisy 1535 01:01:46,510 --> 01:01:47,380 systems in practice. 1536 01:01:47,380 --> 01:01:52,180 Because when I ran it on the flapping thing down at NYU, 1537 01:01:52,180 --> 01:01:57,280 the consecutive evaluations could be very different-- 1538 01:01:57,280 --> 01:01:58,780 not because of any change in policy, 1539 01:01:58,780 --> 01:02:00,160 You run the same policy, you get a big variance. 1540 01:02:00,160 --> 01:02:01,300 So that's just because you're running 1541 01:02:01,300 --> 01:02:03,370 on this physical robot with this fluid system 1542 01:02:03,370 --> 01:02:05,590 and you're measuring the forces in an analog sensor. 1543 01:02:05,590 --> 01:02:07,060 And so it's just very noisy. 1544 01:02:07,060 --> 01:02:08,560 But it's robust to that. 1545 01:02:08,560 --> 01:02:09,670 And that's what's so nice. 1546 01:02:17,350 --> 01:02:18,330 Put that here. 1547 01:02:22,160 --> 01:02:24,080 So look at that. 1548 01:02:24,080 --> 01:02:25,940 I mean, this one-- these, luckily, 1549 01:02:25,940 --> 01:02:27,200 didn't take 3 minutes anymore. 1550 01:02:27,200 --> 01:02:28,710 They took 1 second. 1551 01:02:28,710 --> 01:02:31,342 So it wasn't nearly as bad. 1552 01:02:31,342 --> 01:02:33,050 But, I mean, look how much it's changing. 1553 01:02:33,050 --> 01:02:35,330 It's changing a significant percentage every time, right? 1554 01:02:35,330 --> 01:02:36,560 AUDIENCE: These are all with the same [? taping loop? ?] 1555 01:02:36,560 --> 01:02:36,880 JOHN W. ROBERTS: Yeah. 1556 01:02:36,880 --> 01:02:38,630 Yeah-- I mean, no, this is playing a different-- 1557 01:02:38,630 --> 01:02:39,410 this is learning. 1558 01:02:39,410 --> 01:02:40,723 So the thing is that-- 1559 01:02:40,723 --> 01:02:42,890 I mean, I showed you how it wasn't monotonic before. 1560 01:02:42,890 --> 01:02:44,150 But this, you can run the same tape. 1561 01:02:44,150 --> 01:02:45,950 I mean, up there it's pretty much running the same tape. 1562 01:02:45,950 --> 01:02:48,620 So up there you get an idea of what the noise looks like when 1563 01:02:48,620 --> 01:02:50,180 you're running the same policy. 1564 01:02:50,180 --> 01:02:51,290 Right. 1565 01:02:51,290 --> 01:02:53,030 And so you can imagine-- yes. 1566 01:02:53,030 --> 01:02:55,295 AUDIENCE: Just [INAUDIBLE] went with blue and red. 1567 01:02:55,295 --> 01:02:56,670 JOHN W. ROBERTS: Oh, blue and red 1568 01:02:56,670 --> 01:02:59,960 are different ways of keeping track of my baseline. 1569 01:02:59,960 --> 01:03:01,668 All right. 1570 01:03:01,668 --> 01:03:04,085 So I mean, I don't worry about the different blue and red. 1571 01:03:04,085 --> 01:03:05,300 They're just sort of an internal test 1572 01:03:05,300 --> 01:03:07,190 to see the right way to make these things-- we determined 1573 01:03:07,190 --> 01:03:08,330 that it didn't make a difference. 1574 01:03:08,330 --> 01:03:08,960 But yeah. 1575 01:03:08,960 --> 01:03:12,270 AUDIENCE: It looks like the red is much smoother. 1576 01:03:12,270 --> 01:03:13,520 JOHN W. ROBERTS: I don't know. 1577 01:03:13,520 --> 01:03:14,150 It may be plotting. 1578 01:03:14,150 --> 01:03:16,070 I may have plotted blue on top of red or something, too, 1579 01:03:16,070 --> 01:03:16,970 you know? 1580 01:03:16,970 --> 01:03:17,750 I don't know. 1581 01:03:17,750 --> 01:03:21,150 I remember we decided it didn't make much of a difference. 1582 01:03:21,150 --> 01:03:21,830 Yeah. 1583 01:03:21,830 --> 01:03:22,430 I see what you're saying. 1584 01:03:22,430 --> 01:03:24,305 It does look like the variance is a bit less, 1585 01:03:24,305 --> 01:03:25,850 but I don't think it was. 1586 01:03:25,850 --> 01:03:27,620 But these are trials on the bottom. 1587 01:03:27,620 --> 01:03:30,013 So that's, every second we sort of did another flap, 1588 01:03:30,013 --> 01:03:30,930 we did another update. 1589 01:03:30,930 --> 01:03:32,240 So this is update from the bottom. 1590 01:03:32,240 --> 01:03:32,540 And yeah. 1591 01:03:32,540 --> 01:03:34,460 This is-- we actually have a reward instead of cost here. 1592 01:03:34,460 --> 01:03:36,085 So it's going to go up instead of down. 1593 01:03:36,085 --> 01:03:37,552 But yeah. 1594 01:03:37,552 --> 01:03:39,260 So despite the fact this is really noisy, 1595 01:03:39,260 --> 01:03:41,797 despite the fact that we had this average baseline, which 1596 01:03:41,797 --> 01:03:43,630 I was talking about-- so our baseline wasn't 1597 01:03:43,630 --> 01:03:44,867 perfect-- it still learned. 1598 01:03:44,867 --> 01:03:45,950 It learned pretty quickly. 1599 01:03:45,950 --> 01:03:48,500 I mean, 400 samples maybe doesn't seem very good. 1600 01:03:48,500 --> 01:03:50,360 But that's also less than 10 minutes. 1601 01:03:50,360 --> 01:03:53,630 So that's like 7 minutes. 1602 01:03:53,630 --> 01:03:57,660 So it in practice can work pretty darn well. 1603 01:03:57,660 --> 01:03:59,520 And solving this thing with other techniques 1604 01:03:59,520 --> 01:04:00,395 would be very tricky. 1605 01:04:00,395 --> 01:04:03,440 Well, I mean, you could build a model like this model we have 1606 01:04:03,440 --> 01:04:05,270 and stuff like that, you can try to solve it with a simulation. 1607 01:04:05,270 --> 01:04:07,603 That's generally how they solve a lot of these problems, 1608 01:04:07,603 --> 01:04:10,070 is to do the optimization on a model. 1609 01:04:10,070 --> 01:04:13,310 So there's this fly. 1610 01:04:15,920 --> 01:04:18,800 Jane Wang at Cornell tries to optimize the stroke form 1611 01:04:18,800 --> 01:04:20,450 for a fly, like a fruit fly. 1612 01:04:20,450 --> 01:04:22,640 I think it's a fruit fly scale. 1613 01:04:22,640 --> 01:04:25,100 And so she just built a sort of pretty fancy model 1614 01:04:25,100 --> 01:04:26,750 of this thing and then simulates it. 1615 01:04:26,750 --> 01:04:29,990 It does the optimization on a computational fluid dynamics 1616 01:04:29,990 --> 01:04:31,280 simulation. 1617 01:04:31,280 --> 01:04:33,470 And so that's some way we can-- and there you 1618 01:04:33,470 --> 01:04:35,000 can get the gradients, you can do all the sort of things 1619 01:04:35,000 --> 01:04:36,260 we've already talked about. 1620 01:04:36,260 --> 01:04:37,910 Because you have the model, you can do all these things 1621 01:04:37,910 --> 01:04:38,630 explicitly. 1622 01:04:38,630 --> 01:04:40,255 But the model takes a long time to run. 1623 01:04:40,255 --> 01:04:43,590 I think the optimization took months of computer time. 1624 01:04:43,590 --> 01:04:45,758 So if you-- that's the thing here, 1625 01:04:45,758 --> 01:04:47,800 is that the full simulation of this system, where 1626 01:04:47,800 --> 01:04:50,900 it took me 1 second to get an update, 1627 01:04:50,900 --> 01:04:54,410 it takes, I think, about an hour per flap. 1628 01:04:54,410 --> 01:04:55,790 So an hour on a computing cluster 1629 01:04:55,790 --> 01:04:58,640 to get one full safety simulation of one flap. 1630 01:04:58,640 --> 01:05:00,020 And that's even the simpler one. 1631 01:05:00,020 --> 01:05:01,395 We're working on other ones, too, 1632 01:05:01,395 --> 01:05:03,290 that have sort of aeroelastic effects, which 1633 01:05:03,290 --> 01:05:05,367 are where sort of the body deforms in response 1634 01:05:05,367 --> 01:05:06,200 to the fluid forces. 1635 01:05:06,200 --> 01:05:08,880 And simulating those is even harder. 1636 01:05:08,880 --> 01:05:11,660 And so where it takes an hour to get an update, 1637 01:05:11,660 --> 01:05:13,158 I can get it in a second. 1638 01:05:13,158 --> 01:05:15,200 And the thing is my update is going to be noisier 1639 01:05:15,200 --> 01:05:16,617 and I don't get the true gradient. 1640 01:05:16,617 --> 01:05:19,707 But when you can get 3,600 updates per update, 1641 01:05:19,707 --> 01:05:20,540 you're going to win. 1642 01:05:20,540 --> 01:05:22,110 I mean, I'll get one flap in the time 1643 01:05:22,110 --> 01:05:25,250 takes me to optimize and sit there for most of an hour, 1644 01:05:25,250 --> 01:05:25,820 you know? 1645 01:05:25,820 --> 01:05:28,825 So you can see in those kind of problems, 1646 01:05:28,825 --> 01:05:31,458 it can be a big win, especially when a simulation is extremely 1647 01:05:31,458 --> 01:05:33,500 expensive, or computing the gradient is extremely 1648 01:05:33,500 --> 01:05:35,833 expensive, but you have the robot right in front of you. 1649 01:05:35,833 --> 01:05:38,120 You can just take that data, accept the noise, 1650 01:05:38,120 --> 01:05:39,570 do model-free gradient descent. 1651 01:05:43,610 --> 01:05:47,498 I think that's what I wanted to talk about. 1652 01:05:47,498 --> 01:05:49,790 If you have any questions or anything didn't make sense 1653 01:05:49,790 --> 01:05:51,200 at all, please let me know. 1654 01:05:51,200 --> 01:05:53,540 Otherwise, maybe I'll introduce something 1655 01:05:53,540 --> 01:05:55,205 that I'm trying to talk about tomorrow, 1656 01:05:55,205 --> 01:05:56,330 a different interpretation. 1657 01:05:56,330 --> 01:05:59,090 I'll just try to get your brain ready for it, I guess. 1658 01:05:59,090 --> 01:06:01,850 But if there are any other questions on this, please ask. 1659 01:06:04,527 --> 01:06:06,860 AUDIENCE: What was the reward function for [INAUDIBLE]?? 1660 01:06:06,860 --> 01:06:08,360 JOHN W. ROBERTS: The reward function 1661 01:06:08,360 --> 01:06:11,600 for this was the integral of velocity, of spin velocity, 1662 01:06:11,600 --> 01:06:14,270 over the integral of power input. 1663 01:06:14,270 --> 01:06:18,290 So it measured the force on it, multiplied that 1664 01:06:18,290 --> 01:06:19,640 by the vertical velocity. 1665 01:06:19,640 --> 01:06:23,210 That gives you the rate of power. 1666 01:06:23,210 --> 01:06:25,550 That gives you power, which is the rate of work. 1667 01:06:25,550 --> 01:06:28,050 And then it just sort of calculates the distance. 1668 01:06:28,050 --> 01:06:30,008 And so that ratio is what we tried to optimize. 1669 01:06:30,008 --> 01:06:33,830 So it tries to figure out sort of the minimum energy per unit 1670 01:06:33,830 --> 01:06:34,362 distance. 1671 01:06:34,362 --> 01:06:35,820 And so it spins around in a circle, 1672 01:06:35,820 --> 01:06:37,362 but it's a model of it going forward. 1673 01:06:37,362 --> 01:06:39,260 So we did it for an angle, but you 1674 01:06:39,260 --> 01:06:41,660 can do it just as easily for if you had a linear test. 1675 01:06:41,660 --> 01:06:43,730 It's just harder experimentally. 1676 01:06:43,730 --> 01:06:46,100 And so it's try-- it's sort of an efficiency metric. 1677 01:06:50,320 --> 01:06:50,820 Yeah? 1678 01:06:53,520 --> 01:06:56,340 All right. 1679 01:06:56,340 --> 01:06:59,670 Turn the lights back up. 1680 01:07:02,600 --> 01:07:07,550 Make sure I crossed all my Ts, dotted my Is. 1681 01:07:12,900 --> 01:07:13,620 Oh, yeah. 1682 01:07:13,620 --> 01:07:16,950 And actually, there's one story, too, 1683 01:07:16,950 --> 01:07:18,180 before I get into that thing. 1684 01:07:18,180 --> 01:07:22,200 So a lot of these things originated, 1685 01:07:22,200 --> 01:07:24,990 like a lot of the things we've seen for neural networks-- 1686 01:07:24,990 --> 01:07:26,760 like back prop, like gradient descent. 1687 01:07:26,760 --> 01:07:29,672 I mean, we learned that [INAUDIBLE] 1688 01:07:29,672 --> 01:07:33,210 originated in the context of neural networks, RTRL did. 1689 01:07:33,210 --> 01:07:35,260 And a lot of this did, the reinforce algorithm, 1690 01:07:35,260 --> 01:07:37,770 which is the thing we're going to talk about-- originated 1691 01:07:37,770 --> 01:07:40,030 with neural networks. 1692 01:07:40,030 --> 01:07:42,172 And one of the reasons they found so appealing, 1693 01:07:42,172 --> 01:07:44,130 particularly like this kind of stochastic work, 1694 01:07:44,130 --> 01:07:46,230 is that it seemed biologically plausible. 1695 01:07:46,230 --> 01:07:49,038 That it could be like, what is the chance that a human brain 1696 01:07:49,038 --> 01:07:49,830 is doing back prop? 1697 01:07:49,830 --> 01:07:51,750 I mean, it could be doing some sort of approximate back prop 1698 01:07:51,750 --> 01:07:52,708 or something like that. 1699 01:07:52,708 --> 01:07:54,870 I actually don't know that much about neuroscience. 1700 01:07:54,870 --> 01:07:58,860 But the thing is that these sort of computationally involved 1701 01:07:58,860 --> 01:08:01,140 techniques for solving these problems 1702 01:08:01,140 --> 01:08:05,250 don't seem like they're reasonable as sort 1703 01:08:05,250 --> 01:08:09,870 of postulations on how the human brain or how neurons solve 1704 01:08:09,870 --> 01:08:10,960 these problems. 1705 01:08:10,960 --> 01:08:12,710 But this one, you can see, it's so simple. 1706 01:08:12,710 --> 01:08:14,729 And the little randomness being part of it 1707 01:08:14,729 --> 01:08:16,979 and just the sort of like simple update structure does 1708 01:08:16,979 --> 01:08:18,283 seem biologically plausible. 1709 01:08:18,283 --> 01:08:20,200 Just sort of intuitively, it makes more sense. 1710 01:08:20,200 --> 01:08:24,090 But even more than that, there's examples of-- 1711 01:08:24,090 --> 01:08:26,609 there's data and evidence that suggests 1712 01:08:26,609 --> 01:08:29,160 that these kind of things could be one of the aspects of how 1713 01:08:29,160 --> 01:08:30,240 animals learn. 1714 01:08:30,240 --> 01:08:32,715 And the coolest one, I think, is there's these song 1715 01:08:32,715 --> 01:08:34,255 birds that learn how to sing. 1716 01:08:34,255 --> 01:08:36,630 Like, they don't-- they're not born knowing a certain way 1717 01:08:36,630 --> 01:08:37,290 to sing. 1718 01:08:37,290 --> 01:08:39,647 But they hear their parents sing as they're growing up, 1719 01:08:39,647 --> 01:08:41,189 and they start singing more and more. 1720 01:08:41,189 --> 01:08:42,100 And they get better and better. 1721 01:08:42,100 --> 01:08:43,620 And actually, you can hear them getting better until they 1722 01:08:43,620 --> 01:08:44,880 sing like their parents did. 1723 01:08:44,880 --> 01:08:46,770 And you can raise them in captivity 1724 01:08:46,770 --> 01:08:48,990 and play them Elvis all the time, 1725 01:08:48,990 --> 01:08:53,760 and they'll do like a song bird impression of Elvis, which 1726 01:08:53,760 --> 01:08:57,779 I'm surprised you can't buy the CD of that on late night TV. 1727 01:08:57,779 --> 01:09:02,422 But the-- right. 1728 01:09:02,422 --> 01:09:03,880 But so a really cool thing, though, 1729 01:09:03,880 --> 01:09:05,819 is that there's this part of the brain where, 1730 01:09:05,819 --> 01:09:08,850 if you measure sort of the signals, 1731 01:09:08,850 --> 01:09:10,770 they seem to be completely random. 1732 01:09:10,770 --> 01:09:12,640 Like, they just seem to be random noise. 1733 01:09:12,640 --> 01:09:14,640 And so it's like-- it's strange that there's not 1734 01:09:14,640 --> 01:09:15,029 this structure. 1735 01:09:15,029 --> 01:09:16,500 It's like, what could this part of the brain be doing? 1736 01:09:16,500 --> 01:09:19,020 Why would it need to be producing random noise? 1737 01:09:19,020 --> 01:09:21,510 What they did-- and this is your-- 1738 01:09:21,510 --> 01:09:24,202 maybe you bird lovers out there won't like it-- 1739 01:09:24,202 --> 01:09:25,410 they took one of these birds. 1740 01:09:25,410 --> 01:09:26,785 And while it was learning-- like, 1741 01:09:26,785 --> 01:09:28,979 they waited till a bird learned like the full song. 1742 01:09:28,979 --> 01:09:31,828 And then they deactivated, through some means, 1743 01:09:31,828 --> 01:09:33,870 the part of the brain that produces random noise. 1744 01:09:33,870 --> 01:09:34,745 And nothing happened. 1745 01:09:34,745 --> 01:09:37,380 The bird-- apparently, the bird wasn't like entirely the same. 1746 01:09:37,380 --> 01:09:39,005 But it still could sing the songs fine, 1747 01:09:39,005 --> 01:09:40,090 everything like that. 1748 01:09:40,090 --> 01:09:42,430 Then they took a bird who was in the process of learning 1749 01:09:42,430 --> 01:09:44,460 the song-- 1750 01:09:44,460 --> 01:09:46,479 had learned some of it but wasn't perfect yet 1751 01:09:46,479 --> 01:09:47,729 and was still getting better-- 1752 01:09:47,729 --> 01:09:51,090 and they deactivated that part of the brain. 1753 01:09:51,090 --> 01:09:53,609 And it started just singing the same song. 1754 01:09:53,609 --> 01:09:55,560 Like, how it'd been singing, it kept singing. 1755 01:09:55,560 --> 01:09:57,250 It didn't get any better. 1756 01:09:57,250 --> 01:09:59,820 And so that's some sort of proxy evidence 1757 01:09:59,820 --> 01:10:02,280 that this random noise was related towards the ability 1758 01:10:02,280 --> 01:10:04,560 to improve, that it's not storing the signal, 1759 01:10:04,560 --> 01:10:07,722 that it's not necessarily like the descent itself. 1760 01:10:07,722 --> 01:10:09,180 But just this random noise could be 1761 01:10:09,180 --> 01:10:12,938 how it's screwing up its song in an effort 1762 01:10:12,938 --> 01:10:13,980 to get better and better. 1763 01:10:13,980 --> 01:10:16,647 It screws up a bit, listens, and maybe it's a little bit better, 1764 01:10:16,647 --> 01:10:17,620 and it does that. 1765 01:10:17,620 --> 01:10:18,930 So that's sort of a pretty-- 1766 01:10:18,930 --> 01:10:21,330 I mean, and sometimes compelling evidence 1767 01:10:21,330 --> 01:10:23,460 that, I mean, biology could at least use this 1768 01:10:23,460 --> 01:10:25,260 as some aspect of its improvement, 1769 01:10:25,260 --> 01:10:29,100 that you shut down the random noise and it stops learning. 1770 01:10:29,100 --> 01:10:31,010 I mean, if you get the variance to 0, 1771 01:10:31,010 --> 01:10:32,760 you're not going to get worse, you're just 1772 01:10:32,760 --> 01:10:33,802 not going to do anything. 1773 01:10:33,802 --> 01:10:36,270 You're going to keep singing the same song. 1774 01:10:36,270 --> 01:10:39,741 So that's sort of cool, I think. 1775 01:10:39,741 --> 01:10:45,330 Right, so just give you something to chew on. 1776 01:10:45,330 --> 01:10:46,960 There's another interpretation of this. 1777 01:10:46,960 --> 01:10:50,400 So here we sort of talked about this one. 1778 01:10:50,400 --> 01:10:58,090 Here I think our idea was this sort of sampling, 1779 01:10:58,090 --> 01:10:59,970 where we have some nominal policy. 1780 01:11:06,560 --> 01:11:20,280 We perturb it, measure how good we did, how well we did, 1781 01:11:20,280 --> 01:11:28,140 measure performance, and update. 1782 01:11:31,350 --> 01:11:33,180 So this is pretty much what we have, 1783 01:11:33,180 --> 01:11:35,370 is we have got some policy that we're working at. 1784 01:11:35,370 --> 01:11:38,140 We add this z to it that changes it a bit. 1785 01:11:38,140 --> 01:11:40,170 We run it, and then we update. 1786 01:11:40,170 --> 01:11:41,970 There's a different interpretation-- 1787 01:11:41,970 --> 01:11:45,090 my performance got too long. 1788 01:11:45,090 --> 01:11:47,899 There's a stochastic policy interpretation. 1789 01:11:54,390 --> 01:11:56,970 Now, in this, the way you think about it 1790 01:11:56,970 --> 01:11:59,822 isn't that we have some nominal policy 1791 01:11:59,822 --> 01:12:01,030 and we're adding noise to it. 1792 01:12:01,030 --> 01:12:03,720 It's that your policy itself acts stochastically. 1793 01:12:03,720 --> 01:12:09,420 So actions are random. 1794 01:12:13,167 --> 01:12:15,000 Doesn't mean that they're completely random. 1795 01:12:15,000 --> 01:12:17,010 I mean, they're random with some distribution. 1796 01:12:17,010 --> 01:12:19,770 But you're not saying exactly what you do. 1797 01:12:19,770 --> 01:12:24,630 And so you can imagine this is sort of like if you're 1798 01:12:24,630 --> 01:12:26,550 playing Liar's Poker-- you know, where you 1799 01:12:26,550 --> 01:12:28,260 hold the card above your head. 1800 01:12:28,260 --> 01:12:30,690 And then you can see anyone else's card but not 1801 01:12:30,690 --> 01:12:32,040 your own, and then you sort of bet on these things. 1802 01:12:32,040 --> 01:12:32,915 Do you know the game? 1803 01:12:32,915 --> 01:12:35,250 Maybe that game doesn't have enough cultural penetration 1804 01:12:35,250 --> 01:12:36,247 to be a good example. 1805 01:12:36,247 --> 01:12:38,580 But if you're playing normal poker, any sort of gambling 1806 01:12:38,580 --> 01:12:41,190 games, if every time you had the same cards, 1807 01:12:41,190 --> 01:12:43,740 if you made the exact same bet, people could eventually 1808 01:12:43,740 --> 01:12:45,720 sort of figure that out, and maybe they 1809 01:12:45,720 --> 01:12:46,950 could use it to beat you. 1810 01:12:46,950 --> 01:12:50,400 There's plenty of games like that, where, say, every time I 1811 01:12:50,400 --> 01:12:52,233 have a certain card, I always bet this. 1812 01:12:52,233 --> 01:12:53,775 Then if I bet that way, they're going 1813 01:12:53,775 --> 01:12:54,730 to be like, oh, he has good cards. 1814 01:12:54,730 --> 01:12:55,405 I'm going to fold. 1815 01:12:55,405 --> 01:12:56,910 Or, oh, he always bluffs when he has this card. 1816 01:12:56,910 --> 01:12:59,430 So this sort of deterministic policy doesn't make sense. 1817 01:12:59,430 --> 01:13:02,065 The stochastic policy is exactly what you do. 1818 01:13:02,065 --> 01:13:03,690 So your policy is going to be like, oh, 1819 01:13:03,690 --> 01:13:06,690 I've got like pocket kings, 95% of the time I'm going to raise 1820 01:13:06,690 --> 01:13:09,863 whatever, and [INAUDIBLE] time I'm going to check-- 1821 01:13:09,863 --> 01:13:12,030 those kind of things, where you're sort of-- there's 1822 01:13:12,030 --> 01:13:14,310 some noise in what you do. 1823 01:13:14,310 --> 01:13:15,810 Now, you can question whether or not 1824 01:13:15,810 --> 01:13:18,300 that makes sense as whether optimal policies would 1825 01:13:18,300 --> 01:13:20,570 be stochastic in the kind of problems we look at. 1826 01:13:20,570 --> 01:13:22,320 But the important thing is just to realize 1827 01:13:22,320 --> 01:13:25,740 that your policy, don't think of it as it's doing these things. 1828 01:13:25,740 --> 01:13:28,810 It is these sort of distributions of what you do. 1829 01:13:28,810 --> 01:13:30,420 All right? 1830 01:13:30,420 --> 01:13:46,440 So the parameterization, then, controls distribution. 1831 01:13:49,590 --> 01:13:51,240 Ooh. 1832 01:13:51,240 --> 01:13:53,710 My fifth grade teacher would not have liked that. 1833 01:13:53,710 --> 01:13:54,210 But. 1834 01:13:57,870 --> 01:14:00,400 So what you do, then, you can imagine you control, 1835 01:14:00,400 --> 01:14:01,900 perhaps, the mean of a distribution. 1836 01:14:01,900 --> 01:14:03,878 So where over here we had this-- 1837 01:14:03,878 --> 01:14:06,420 you can think of it as really sort of exactly the same, where 1838 01:14:06,420 --> 01:14:09,660 my other interpretation said, OK, my policy is alpha, 1839 01:14:09,660 --> 01:14:13,620 and then I add random noise z to it. 1840 01:14:13,620 --> 01:14:16,980 While here, my policy is parameterized by alpha, 1841 01:14:16,980 --> 01:14:19,060 and my action is the same thing. 1842 01:14:19,060 --> 01:14:21,060 It's just that it's not, this is what I'm doing 1843 01:14:21,060 --> 01:14:22,393 and I'm sampling something else. 1844 01:14:22,393 --> 01:14:24,300 It's that, this is actually my policy. 1845 01:14:24,300 --> 01:14:26,910 If I ran the same policy, I would just do all these things 1846 01:14:26,910 --> 01:14:29,700 with these probabilities. 1847 01:14:29,700 --> 01:14:32,670 So it's your actions are stochastic. 1848 01:14:32,670 --> 01:14:39,570 Now, that's sort of something that isn't always completely-- 1849 01:14:39,570 --> 01:14:41,940 well, when I first saw it, it wasn't really easy for me 1850 01:14:41,940 --> 01:14:43,830 to get my head around all what that meant. 1851 01:14:43,830 --> 01:14:45,630 But yeah. 1852 01:14:45,630 --> 01:14:47,550 So this is-- we're going to look at this. 1853 01:14:47,550 --> 01:14:49,710 Yeah, I won't go into more detail. 1854 01:14:49,710 --> 01:14:51,210 But tomorrow we'll look at this sort 1855 01:14:51,210 --> 01:14:53,470 of different interpretation of how to do this. 1856 01:14:53,470 --> 01:14:54,370 And you can get the same learning. 1857 01:14:54,370 --> 01:14:55,980 we'll actually show that the update's the same, 1858 01:14:55,980 --> 01:14:57,243 the behavior's very similar. 1859 01:14:57,243 --> 01:14:59,160 But the properties are a little bit different. 1860 01:14:59,160 --> 01:15:00,810 And the big thing is that you don't 1861 01:15:00,810 --> 01:15:02,402 have to do this linearization. 1862 01:15:02,402 --> 01:15:04,110 Here we did this sort of linear expansion 1863 01:15:04,110 --> 01:15:06,420 and we say, OK, so this is true locally. 1864 01:15:06,420 --> 01:15:08,010 When you look at it in this context, 1865 01:15:08,010 --> 01:15:09,780 you can show that you'll always follow 1866 01:15:09,780 --> 01:15:12,900 the gradient of the expected value of the policy. 1867 01:15:12,900 --> 01:15:13,680 All right? 1868 01:15:13,680 --> 01:15:14,940 And so that's a big difference, right? 1869 01:15:14,940 --> 01:15:17,190 Here we're saying, OK, we're looking at the local gradient, 1870 01:15:17,190 --> 01:15:19,080 we're going to follow the local gradient here. 1871 01:15:19,080 --> 01:15:20,760 But let's say that you have a very broad policy 1872 01:15:20,760 --> 01:15:22,670 or a very sort of violent value function. 1873 01:15:22,670 --> 01:15:25,530 Let's look at this 1D one again, where my value function has 1874 01:15:25,530 --> 01:15:27,010 something like this-- 1875 01:15:27,010 --> 01:15:27,990 extremely violent. 1876 01:15:27,990 --> 01:15:31,950 Well, when I put this random stochastic policy, 1877 01:15:31,950 --> 01:15:33,677 that smooths it out. 1878 01:15:33,677 --> 01:15:35,135 And so even though-- it's because I 1879 01:15:35,135 --> 01:15:36,240 have a stochastic policy. 1880 01:15:36,240 --> 01:15:39,545 Running my policy, the cost is a random variable now, 1881 01:15:39,545 --> 01:15:40,920 depending on what my actions are. 1882 01:15:40,920 --> 01:15:43,050 Even if my dynamics are deterministic, 1883 01:15:43,050 --> 01:15:47,250 because my policy is stochastic, my cost is stochastic, 1884 01:15:47,250 --> 01:15:48,430 I'm going to get some-- 1885 01:15:48,430 --> 01:15:51,030 if you look at this, there's some expected cost 1886 01:15:51,030 --> 01:15:52,608 for running this policy on this. 1887 01:15:52,608 --> 01:15:54,150 And you do this, you're going to sort 1888 01:15:54,150 --> 01:15:55,530 of-- you can imagine sort of smoothing out 1889 01:15:55,530 --> 01:15:56,363 some of this, right? 1890 01:15:56,363 --> 01:15:58,020 Sort of averaging over all of these. 1891 01:15:58,020 --> 01:15:59,910 And what you follow when you do this-- 1892 01:15:59,910 --> 01:16:01,710 and the update is really identical. 1893 01:16:01,710 --> 01:16:04,857 Like, it's actually just the exact same update. 1894 01:16:04,857 --> 01:16:07,440 Possibly, there's a coefficient up front that you could put in 1895 01:16:07,440 --> 01:16:07,940 or not. 1896 01:16:07,940 --> 01:16:09,510 But the structure is the same. 1897 01:16:09,510 --> 01:16:11,160 And the thing is that it'll follow 1898 01:16:11,160 --> 01:16:13,680 the expected value of the performance 1899 01:16:13,680 --> 01:16:15,607 of the stochastic policy. 1900 01:16:15,607 --> 01:16:17,440 So it's sort of a different way of thinking. 1901 01:16:17,440 --> 01:16:19,230 I think that this way is sort of the easier way to sort of first 1902 01:16:19,230 --> 01:16:20,340 think about it. 1903 01:16:20,340 --> 01:16:22,890 But tomorrow will be more probability kind of things. 1904 01:16:22,890 --> 01:16:24,720 And we'll talk about the stochastic policy 1905 01:16:24,720 --> 01:16:28,570 interpretation and some of the ramifications of that. 1906 01:16:28,570 --> 01:16:29,120 Yeah. 1907 01:16:29,120 --> 01:16:33,000 And maybe some other interesting side notes. 1908 01:16:33,000 --> 01:16:35,510 So yeah.