1 00:00:00,000 --> 00:00:02,490 The following content is provided under a Creative 2 00:00:02,490 --> 00:00:03,940 Commons license. 3 00:00:03,940 --> 00:00:06,330 Your support will help MIT OpenCourseWare 4 00:00:06,330 --> 00:00:10,660 continue to offer high quality educational resources for free. 5 00:00:10,660 --> 00:00:13,320 To make a donation or view additional materials 6 00:00:13,320 --> 00:00:17,160 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,160 --> 00:00:18,252 at ocw.mit.edu. 8 00:00:21,390 --> 00:00:22,810 PROFESSOR: OK, welcome back. 9 00:00:22,810 --> 00:00:26,340 Sorry for the technical blip there. 10 00:00:26,340 --> 00:00:31,611 OK, so I guess lecture two. 11 00:00:31,611 --> 00:00:32,910 I challenged you. 12 00:00:32,910 --> 00:00:35,850 We talked about the phase space of the simple pendulum, 13 00:00:35,850 --> 00:00:39,510 and I challenged you to come up with a simple algorithm. 14 00:00:39,510 --> 00:00:41,408 I guess I didn't say simple, but I 15 00:00:41,408 --> 00:00:43,200 challenged you to come up with an algorithm 16 00:00:43,200 --> 00:00:49,620 to try to, in some sort of minimal way, 17 00:00:49,620 --> 00:00:51,820 change the phase plot of this system 18 00:00:51,820 --> 00:00:55,290 so that the fixed points that used to be unstable 19 00:00:55,290 --> 00:00:57,902 become stable and vise versa. 20 00:00:57,902 --> 00:00:59,235 So today we're going to do that. 21 00:00:59,235 --> 00:01:03,042 I don't know if anybody do that for fun? 22 00:01:03,042 --> 00:01:04,220 Yeah, OK. 23 00:01:04,220 --> 00:01:06,610 [LAUGHTER] 24 00:01:06,610 --> 00:01:08,110 OK, so today we're going to do that. 25 00:01:08,110 --> 00:01:13,890 So yeah, the question is, can we use optimal control now, 26 00:01:13,890 --> 00:01:19,320 numerical optimal control, to reshape these dynamics, OK. 27 00:01:19,320 --> 00:01:24,000 And I want to start by doing sort 28 00:01:24,000 --> 00:01:27,450 of an evil thing but something that's 29 00:01:27,450 --> 00:01:30,460 going to make thinking about it a lot easier. 30 00:01:30,460 --> 00:01:33,540 We're going to discretize everything, OK. 31 00:01:33,540 --> 00:01:36,060 So let's start by-- 32 00:01:42,000 --> 00:01:53,670 we're going to discretize state, actions, and time, OK. 33 00:01:53,670 --> 00:02:00,120 So I'm actually going to take my vector of x, which 34 00:02:00,120 --> 00:02:06,420 lived on the real numbers, and start thinking 35 00:02:06,420 --> 00:02:12,460 about integer number of states. 36 00:02:12,460 --> 00:02:13,680 I'll say what I mean by that. 37 00:02:16,580 --> 00:02:19,050 OK. 38 00:02:19,050 --> 00:02:27,040 And I'm going to take my actions, my continuous action 39 00:02:27,040 --> 00:02:29,860 space, which I've been thinking of as u, 40 00:02:29,860 --> 00:02:32,380 and I'm going to turn that into a discrete state 41 00:02:32,380 --> 00:02:35,290 space, a discrete action space. 42 00:02:35,290 --> 00:02:38,170 And I'm going to take time and turn it 43 00:02:38,170 --> 00:02:44,980 into some integer, discrete time, OK. 44 00:02:48,690 --> 00:02:51,480 So and I'm going to try to be-- throughout the lectures, 45 00:02:51,480 --> 00:02:53,730 throughout the notes, I tried to be very, very careful 46 00:02:53,730 --> 00:02:57,630 to use X and U and time for continuous things 47 00:02:57,630 --> 00:03:02,710 and S for states, A for actions, N for discrete things. 48 00:03:02,710 --> 00:03:04,800 So we might find ourselves in situations 49 00:03:04,800 --> 00:03:07,140 where we have continuous state and discrete actions 50 00:03:07,140 --> 00:03:13,112 or some other combination, but that should be a code. 51 00:03:13,112 --> 00:03:15,570 OK, so if we want to-- if we're willing to discretize state 52 00:03:15,570 --> 00:03:17,880 and time, then maybe one way to think about that 53 00:03:17,880 --> 00:03:22,960 on this picture is by thinking of every one of these-- 54 00:03:22,960 --> 00:03:24,600 this was my quick cartoon of the phase 55 00:03:24,600 --> 00:03:26,700 plot of the simple pendulum. 56 00:03:26,700 --> 00:03:29,970 Let's think about identifying each one 57 00:03:29,970 --> 00:03:34,110 of these possible states in the phase portrait 58 00:03:34,110 --> 00:03:37,020 as a particular state, OK. 59 00:03:37,020 --> 00:03:40,740 These little nodes, possible states we can live in. 60 00:03:40,740 --> 00:03:48,840 And through actions, we can transition to different states, 61 00:03:48,840 --> 00:03:51,030 if you see what I'm doing without drawing 62 00:03:51,030 --> 00:03:53,950 100,000 circles here. 63 00:03:53,950 --> 00:03:59,560 So let's tile the state space with discrete states. 64 00:03:59,560 --> 00:04:01,470 You could also think of it as drawing a grid 65 00:04:01,470 --> 00:04:06,030 and calling each box in the grid a state. 66 00:04:06,030 --> 00:04:08,430 And what that allows us to do-- we're also 67 00:04:08,430 --> 00:04:12,640 discretizing actions, so we have a finite number 68 00:04:12,640 --> 00:04:15,780 of possible options coming out of each state. 69 00:04:15,780 --> 00:04:19,050 It allows us to turn the continuous time optimal control 70 00:04:19,050 --> 00:04:26,070 problem into a simple graph search problem, OK. 71 00:04:26,070 --> 00:04:27,930 Graph search, we know how to do well. 72 00:04:27,930 --> 00:04:30,210 We're really good at that in computer science. 73 00:04:30,210 --> 00:04:34,200 OK, so let's see how far we can get first 74 00:04:34,200 --> 00:04:39,420 by just thinking about this very non-linear, very dynamic thing 75 00:04:39,420 --> 00:04:40,950 on a graph search, OK. 76 00:04:45,180 --> 00:04:48,000 So we're going to do numerical optimal control. 77 00:04:48,000 --> 00:04:53,120 This is-- in particular, when people 78 00:04:53,120 --> 00:04:56,617 talk about the dynamic programming algorithm, 79 00:04:56,617 --> 00:04:59,075 they're often talking about discretizing state and actions. 80 00:05:02,270 --> 00:05:06,590 And we're going to use the standard optimal control 81 00:05:06,590 --> 00:05:08,310 formulation. 82 00:05:08,310 --> 00:05:16,430 I'm going to start with a finite horizon 83 00:05:16,430 --> 00:05:23,300 and say that my cost of being in state x, time t 84 00:05:23,300 --> 00:05:26,045 is h of x at the final time. 85 00:05:31,253 --> 00:05:33,170 All right, this is the continuous time optimal 86 00:05:33,170 --> 00:05:33,670 control. 87 00:05:42,690 --> 00:05:48,600 And I'm going to start thinking of that now as being in state S 88 00:05:48,600 --> 00:05:53,920 at integer time N and having me be 89 00:05:53,920 --> 00:06:02,550 at some final cost on S plus a sum 90 00:06:02,550 --> 00:06:09,300 from N equals 0 to N of g SA, OK. 91 00:06:09,300 --> 00:06:12,690 And my dynamics now are going to be of the form S-- 92 00:06:16,560 --> 00:06:19,050 maybe I should even write more explicitly, S N plus 1 93 00:06:19,050 --> 00:06:22,180 is a function of SN, AN, OK. 94 00:06:34,230 --> 00:06:41,850 OK, so again, dynamic programming 95 00:06:41,850 --> 00:06:46,320 exploits the fact that you can write this in a recursive form. 96 00:06:46,320 --> 00:06:52,830 So if I want to find the optimal cost 97 00:06:52,830 --> 00:07:03,600 to go, which I'll call J star, at the final time, 98 00:07:03,600 --> 00:07:08,610 it's just h of S, right. 99 00:07:08,610 --> 00:07:15,630 And going backwards in time, this 100 00:07:15,630 --> 00:07:26,130 is just going to be the min over a of g S, a plus h of S prime, 101 00:07:26,130 --> 00:07:27,120 where S prime is. 102 00:07:32,600 --> 00:07:33,100 Right? 103 00:07:33,100 --> 00:07:35,652 I'll get one-- if N is-- 104 00:07:35,652 --> 00:07:37,360 N minus 1, I get one of these, and then I 105 00:07:37,360 --> 00:07:41,870 get the final cost, OK. 106 00:07:41,870 --> 00:07:49,060 And going backwards, we have this recursive form, 107 00:07:49,060 --> 00:07:58,540 which is min over a g S, a plus the cost to go from S prime 108 00:07:58,540 --> 00:08:03,340 and n plus 1 using that same S prime. 109 00:08:10,250 --> 00:08:14,440 OK, I want to make sure you see why that is, why 110 00:08:14,440 --> 00:08:16,030 this-- this is magical, right? 111 00:08:16,030 --> 00:08:20,110 The fact that I can summarize my optimal 112 00:08:20,110 --> 00:08:24,240 cost to go by doing a min over a single action, that's 113 00:08:24,240 --> 00:08:24,865 really magical. 114 00:08:27,910 --> 00:08:35,380 Just to make that extremely clear, think about J star 115 00:08:35,380 --> 00:08:40,360 at N minus 2, let's say. 116 00:08:40,360 --> 00:08:42,610 So I have to minimize over two actions. 117 00:08:42,610 --> 00:08:45,250 I have to minimize over, let's say I'll call them a1 and a2. 118 00:08:48,460 --> 00:08:49,810 I have two steps left to go. 119 00:08:49,810 --> 00:08:56,320 So I have to minimize S at a1 plus g of S prime, 120 00:08:56,320 --> 00:09:01,667 let's call it, a2 plus h of S double prime. 121 00:09:01,667 --> 00:09:03,250 That's my minimization that I'm trying 122 00:09:03,250 --> 00:09:08,620 to solve in order to find the optimal cost to go, 123 00:09:08,620 --> 00:09:13,750 where S prime is f of S, a. 124 00:09:13,750 --> 00:09:17,490 S double prime is f of S prime. 125 00:09:17,490 --> 00:09:19,540 This is a1, and this is a2. 126 00:09:25,615 --> 00:09:31,660 I'm just expanding this sum for the last two g's. 127 00:09:35,770 --> 00:09:39,970 And the cool thing is that, because of this additive form 128 00:09:39,970 --> 00:09:45,400 of g, this term doesn't depend at all on my decision a2. 129 00:09:49,440 --> 00:09:56,490 I'm given a current state S, and I have to decide my action a1. 130 00:09:56,490 --> 00:10:02,580 Nothing about this term depends at all on a2, OK. 131 00:10:02,580 --> 00:10:06,930 In contrast, this one does depend on a1, 132 00:10:06,930 --> 00:10:08,640 because S prime depends on a1. 133 00:10:11,730 --> 00:10:15,540 This one depends on a1 and a2. 134 00:10:15,540 --> 00:10:18,550 This one certainly depends on a2. 135 00:10:18,550 --> 00:10:19,550 You see what I'm saying? 136 00:10:22,630 --> 00:10:34,770 So I can rewrite this as min over a1 f of S a1 137 00:10:34,770 --> 00:10:46,530 plus min over a2 g of S prime a2 plus h of S double prime. 138 00:10:46,530 --> 00:10:48,240 I could just move that min inside 139 00:10:48,240 --> 00:10:50,310 to the only terms that matter. 140 00:10:54,670 --> 00:10:57,880 This is intended to be a moment of clarity, 141 00:10:57,880 --> 00:11:01,000 and I don't see a clarity on your faces. 142 00:11:01,000 --> 00:11:05,410 Does that make sense, that this doesn't depend on a2? 143 00:11:05,410 --> 00:11:09,646 I know I'm going to- a1 is my action at time N minus 2. 144 00:11:09,646 --> 00:11:13,180 a2 is my action at N minus 1. 145 00:11:13,180 --> 00:11:17,012 The action I take next time has absolutely no effect 146 00:11:17,012 --> 00:11:18,720 on my current state or my current action. 147 00:11:23,400 --> 00:11:30,090 So the great thing is this here is just-- 148 00:11:30,090 --> 00:11:33,570 this whole term right here is just 149 00:11:33,570 --> 00:11:40,980 J star of S prime at, I'm calling it, N minus 1 here. 150 00:11:49,310 --> 00:11:51,050 So it's really the fact that we're 151 00:11:51,050 --> 00:11:56,790 taking this min over this additive form that 152 00:11:56,790 --> 00:12:04,320 allows us to write the recursive statement like this that says, 153 00:12:04,320 --> 00:12:07,335 the best thing I can do with additive cost 154 00:12:07,335 --> 00:12:12,550 and all these things is to, in a single step, 155 00:12:12,550 --> 00:12:17,850 take the action which minimizes my one step cost combined 156 00:12:17,850 --> 00:12:22,110 with the cost I'm going to get from being in the state I 157 00:12:22,110 --> 00:12:26,160 transition to for the rest of time. 158 00:12:26,160 --> 00:12:28,980 It's a magical thing. 159 00:12:28,980 --> 00:12:32,790 At whatever time I'm at, I only have to think one action ahead 160 00:12:32,790 --> 00:12:38,090 if I've already got my J star computed, OK. 161 00:12:38,090 --> 00:12:40,860 Simultaneously, it's saying that I can 162 00:12:40,860 --> 00:12:42,750 compute the optimal cost to go. 163 00:12:42,750 --> 00:12:45,690 I could compute the optimal-- 164 00:12:45,690 --> 00:12:47,940 I know exactly how much cost I'm going 165 00:12:47,940 --> 00:12:51,600 to incur from any state, given I follow the optimal policy, 166 00:12:51,600 --> 00:12:53,460 if I just work backwards in time. 167 00:12:53,460 --> 00:12:56,095 And when I'm in time N minus 1, I 168 00:12:56,095 --> 00:12:57,720 don't have to think about the actions I 169 00:12:57,720 --> 00:12:59,790 was going to take beforehand. 170 00:12:59,790 --> 00:13:01,840 As long as I know what state I'm in, 171 00:13:01,840 --> 00:13:03,840 because that state encompasses every action I've 172 00:13:03,840 --> 00:13:08,640 taken in the past, that state contains all the information, 173 00:13:08,640 --> 00:13:11,490 all I have to think about is the last action 174 00:13:11,490 --> 00:13:14,580 I'm going to take to decide my optimal policy one 175 00:13:14,580 --> 00:13:17,320 step from the end of time, OK. 176 00:13:23,070 --> 00:13:27,720 So the fact that you can solve these things 177 00:13:27,720 --> 00:13:32,880 backwards in time, that's the principle of optimality, OK. 178 00:13:35,610 --> 00:13:37,700 Ask questions if you don't like what I said. 179 00:13:42,670 --> 00:13:45,130 I think that the graphics that are about to come 180 00:13:45,130 --> 00:13:49,520 are going to make things clear, too. 181 00:13:49,520 --> 00:13:51,673 OK, so what does that mean? 182 00:13:51,673 --> 00:13:53,090 What are the implications of that? 183 00:14:05,910 --> 00:14:15,100 All right, for the additive costs, 184 00:14:15,100 --> 00:14:34,200 I can compute J star recursively from the end of time, which, 185 00:14:34,200 --> 00:14:37,620 in this case, is N back to 0. 186 00:14:46,940 --> 00:14:50,030 And the optimal action, the optimal policy, 187 00:14:50,030 --> 00:14:54,350 which I then want to call pi star, which 188 00:14:54,350 --> 00:15:04,982 could in general depend on the time, is just argmin over a. 189 00:15:04,982 --> 00:15:08,823 It's the action which minimizes that same expression. 190 00:15:24,120 --> 00:15:29,000 So I can compute J star recursively backwards in time, 191 00:15:29,000 --> 00:15:33,180 and if I know J star, then I essentially know 192 00:15:33,180 --> 00:15:34,390 my optimal policy. 193 00:15:34,390 --> 00:15:37,680 I know the best action, OK. 194 00:15:37,680 --> 00:15:43,800 So but for this reason, the fact that the cost to go, 195 00:15:43,800 --> 00:15:46,620 the cost I expect to incur given I'm in state S 196 00:15:46,620 --> 00:15:49,103 and I'm running from time N, the cost to go 197 00:15:49,103 --> 00:15:51,270 becomes a very central construct in optimal control. 198 00:15:54,490 --> 00:15:56,850 All right, so part of the goal for today 199 00:15:56,850 --> 00:16:02,340 is to give you some more intuition about J star, 200 00:16:02,340 --> 00:16:05,560 OK, because it's actually a very intuitive thing, 201 00:16:05,560 --> 00:16:10,570 but you can be lost, I think, in the equations. 202 00:16:10,570 --> 00:16:12,838 So let's give you more intuition about that. 203 00:16:12,838 --> 00:16:14,880 I'm going to do that by getting a little bit more 204 00:16:14,880 --> 00:16:20,880 abstract, well, simultaneously abstract and concrete. 205 00:16:24,296 --> 00:16:28,200 AUDIENCE: [INAUDIBLE] 206 00:16:28,200 --> 00:16:31,316 PROFESSOR: Because it's finite horizon. 207 00:16:31,316 --> 00:16:32,798 AUDIENCE: You know that the reward 208 00:16:32,798 --> 00:16:35,722 function is dependent on time. 209 00:16:35,722 --> 00:16:37,180 PROFESSOR: I haven't included that. 210 00:16:37,180 --> 00:16:40,160 You can make the reward function depend on time. 211 00:16:40,160 --> 00:16:43,720 But even if the reward function, or cost function in my world, 212 00:16:43,720 --> 00:16:44,320 is-- 213 00:16:44,320 --> 00:16:48,310 there's a difference between optimal control people 214 00:16:48,310 --> 00:16:50,100 and reinforcement learning people. 215 00:16:50,100 --> 00:16:51,850 The optimal control people are pessimists. 216 00:16:51,850 --> 00:16:53,710 Everything's a cost. 217 00:16:53,710 --> 00:16:55,600 And the reward reinforcement learning people 218 00:16:55,600 --> 00:16:56,740 give rewards out. 219 00:16:56,740 --> 00:16:59,350 So I guess I'm a pessimist. 220 00:16:59,350 --> 00:17:01,930 So yeah, so my cost is actually not a function of time. 221 00:17:01,930 --> 00:17:04,119 I could have made it that. 222 00:17:04,119 --> 00:17:06,760 But because there's a finite horizon time, 223 00:17:06,760 --> 00:17:09,363 that means my policy and my cost to go function still 224 00:17:09,363 --> 00:17:10,030 depends on time. 225 00:17:15,010 --> 00:17:17,172 Because if time ends in one step, 226 00:17:17,172 --> 00:17:19,380 I'm going to do something different than if time ends 227 00:17:19,380 --> 00:17:22,410 arbitrarily far in the future. 228 00:17:22,410 --> 00:17:23,700 OK. 229 00:17:23,700 --> 00:17:25,589 So we're going to-- 230 00:17:25,589 --> 00:17:37,020 my goal here is to get intuition about cost to go 231 00:17:37,020 --> 00:17:41,915 and dynamic programming, which I'm often going to call DP, OK. 232 00:17:41,915 --> 00:17:44,040 And I'm going to do it with the grid world example. 233 00:17:44,040 --> 00:17:49,340 This is right out of the reinforcement learning books. 234 00:17:54,180 --> 00:17:59,550 OK, so in that pendulum phase plot, 235 00:17:59,550 --> 00:18:02,970 I discretized the state space, and I 236 00:18:02,970 --> 00:18:06,330 started talking about transitions between states, OK. 237 00:18:06,330 --> 00:18:10,570 I can make that even more transparent by saying, 238 00:18:10,570 --> 00:18:15,630 OK, now you're a trashcan robot in a room. 239 00:18:15,630 --> 00:18:19,150 You're going to be in one of these tiles. 240 00:18:19,150 --> 00:18:21,990 You're on one of these blocks, so there's 241 00:18:21,990 --> 00:18:26,760 a finite, discrete state space, OK. 242 00:18:26,760 --> 00:18:31,110 I won't draw a trashcan robot, but let's say I'm here. 243 00:18:31,110 --> 00:18:35,970 And when you're here, you have five discrete actions 244 00:18:35,970 --> 00:18:36,630 you can take. 245 00:18:36,630 --> 00:18:42,000 You can move up, you can move right, down, left, 246 00:18:42,000 --> 00:18:43,960 or you can sit still. 247 00:18:43,960 --> 00:18:44,460 OK. 248 00:19:01,410 --> 00:19:08,520 And discrete states and discrete time. 249 00:19:08,520 --> 00:19:11,250 Every time you take an action, in the next time step, 250 00:19:11,250 --> 00:19:14,910 you'll be in the next grid box. 251 00:19:19,490 --> 00:19:21,770 OK. 252 00:19:21,770 --> 00:19:27,950 Let's say I've got a goal state somewhere in the world. 253 00:19:27,950 --> 00:19:31,722 Well, we can formulate plenty of good optimal control problems 254 00:19:31,722 --> 00:19:32,930 to get us to that goal state. 255 00:19:36,200 --> 00:19:43,210 So plenty of good cost to go functions 256 00:19:43,210 --> 00:19:45,115 in the additive form-- 257 00:19:58,280 --> 00:20:00,843 let's say I want to do minimum-- 258 00:20:00,843 --> 00:20:02,510 I want to get there in the minimum time. 259 00:20:10,560 --> 00:20:16,020 Well, then I can just set g of S, a to be-- 260 00:20:18,900 --> 00:20:21,105 to actually have it in units of time, 261 00:20:21,105 --> 00:20:30,540 I should put a 1 if S is not at the goal and 0 262 00:20:30,540 --> 00:20:34,260 if S is in the goal, OK. 263 00:20:40,290 --> 00:20:42,450 And I don't actually care about actions. 264 00:20:42,450 --> 00:20:44,610 I have five discrete actions I can pick 265 00:20:44,610 --> 00:20:49,330 from whenever I'm in a state. 266 00:20:49,330 --> 00:20:52,890 If I'm not at the goal, I'm going to incur a cost of 1. 267 00:20:52,890 --> 00:20:56,550 So it's in my best interest as a trashcan robot 268 00:20:56,550 --> 00:20:58,627 to get to the goal. 269 00:20:58,627 --> 00:21:00,210 If I'm minimizing that cost, I'm going 270 00:21:00,210 --> 00:21:01,710 to get the goal as fast as possible. 271 00:21:01,710 --> 00:21:03,660 And actually, the units, the cost to go 272 00:21:03,660 --> 00:21:07,635 will tell me the number of steps to get there. 273 00:21:07,635 --> 00:21:08,682 AUDIENCE: [INAUDIBLE] 274 00:21:08,682 --> 00:21:09,390 PROFESSOR: Right. 275 00:21:09,390 --> 00:21:13,140 So I'm going to do that graphically. 276 00:21:13,140 --> 00:21:15,510 But let's say there's a finite horizon now, 277 00:21:15,510 --> 00:21:17,885 but this is how I'm going to get to infinite horizon, so. 278 00:21:25,890 --> 00:21:28,510 And let's say that h of S is just 0. 279 00:21:32,027 --> 00:21:34,110 I don't really care where I am at the end of time. 280 00:21:38,360 --> 00:21:40,690 Or I could have h of S be this same function. 281 00:21:40,690 --> 00:21:41,690 That would be fine, too. 282 00:21:47,030 --> 00:21:47,530 OK. 283 00:21:51,220 --> 00:21:52,870 How's is it going to look? 284 00:21:52,870 --> 00:22:00,190 What is J-- well, let's be specific about h. 285 00:22:00,190 --> 00:22:04,810 Let's make h actually be the same as g here. 286 00:22:04,810 --> 00:22:09,220 So I'll say it's g S with the 0 action. 287 00:22:09,220 --> 00:22:12,770 So since this doesn't depend on actions, it doesn't matter. 288 00:22:12,770 --> 00:22:17,140 Let's say h is the same function as g there. 289 00:22:17,140 --> 00:22:22,300 So what does my cost to go look like at time N? 290 00:22:36,926 --> 00:22:40,380 My optimal cost to go given I'm in some state, 291 00:22:40,380 --> 00:22:50,040 and it's time N. This is a function over S, 292 00:22:50,040 --> 00:22:53,582 and I'm time N. And what is that function? 293 00:22:53,582 --> 00:22:54,470 AUDIENCE: g. 294 00:22:54,470 --> 00:22:56,100 PROFESSOR: Yeah. 295 00:22:56,100 --> 00:23:02,580 Well, if I'm not in the goal, it's that. 296 00:23:02,580 --> 00:23:05,430 It's the same as g, or h in this case. 297 00:23:13,080 --> 00:23:15,450 OK. 298 00:23:15,450 --> 00:23:19,800 What does g star of S N minus 1 look like? 299 00:23:37,500 --> 00:23:41,160 Now I have time to take one action, OK. 300 00:23:44,380 --> 00:23:44,880 So-- 301 00:23:44,880 --> 00:23:46,864 AUDIENCE: One step away from the goal is 1. 302 00:23:46,864 --> 00:23:49,031 If you're on the goal, it's 0, but anywhere else, it 303 00:23:49,031 --> 00:23:49,810 would just be 1. 304 00:23:49,810 --> 00:23:51,040 PROFESSOR: Awesome. 305 00:23:51,040 --> 00:23:52,660 Right? 306 00:23:52,660 --> 00:23:59,350 If I'm on the goal, I can do nothing, incur zero cost to go. 307 00:23:59,350 --> 00:24:04,600 So the best thing for me to do if I'm on the goal 308 00:24:04,600 --> 00:24:07,180 is to stay there, OK. 309 00:24:07,180 --> 00:24:10,690 If I'm a long way from the goal, then I'm 310 00:24:10,690 --> 00:24:12,770 not going to get to the goal in two steps, 311 00:24:12,770 --> 00:24:16,570 so I'm going to incur two units of cost. 312 00:24:20,290 --> 00:24:23,290 I'll say loosely far from goal. 313 00:24:28,600 --> 00:24:30,547 And then there's this in-between place, 314 00:24:30,547 --> 00:24:32,380 which is if I'm one step away from the goal, 315 00:24:32,380 --> 00:24:36,468 I can take the right action and get there and incur 316 00:24:36,468 --> 00:24:37,385 only one unit of cost. 317 00:24:50,053 --> 00:24:51,470 All right, what's it going to be-- 318 00:24:51,470 --> 00:24:55,790 what's J S N minus 2 going to be? 319 00:24:55,790 --> 00:24:58,827 It's going to be 3, 2, or 1, depending 320 00:24:58,827 --> 00:25:00,410 on how closely-- if I'm near the goal, 321 00:25:00,410 --> 00:25:02,077 I've got a chance of getting to the goal 322 00:25:02,077 --> 00:25:07,100 and stopping this insane adding cost. 323 00:25:07,100 --> 00:25:08,750 Stop the madness. 324 00:25:08,750 --> 00:25:09,872 Get to the goal. 325 00:25:09,872 --> 00:25:12,080 Otherwise, I'm going to just incur the cost no matter 326 00:25:12,080 --> 00:25:14,690 what I do, OK. 327 00:25:14,690 --> 00:25:16,648 So what's the optimal policy? 328 00:25:16,648 --> 00:25:19,190 If I'm on the goal, what's the best-- the best action to take 329 00:25:19,190 --> 00:25:20,970 is to sit still. 330 00:25:20,970 --> 00:25:23,840 If I'm one step away from the goal, the best thing to do 331 00:25:23,840 --> 00:25:26,745 is to move to the goal, whether it's up, down, left, or right. 332 00:25:26,745 --> 00:25:27,620 What if I'm out here? 333 00:25:27,620 --> 00:25:29,078 What's the best thing for me to do? 334 00:25:31,780 --> 00:25:33,010 Doesn't matter at all. 335 00:25:33,010 --> 00:25:35,550 I can do anything I want. 336 00:25:35,550 --> 00:25:37,340 I'm still going to incur the cost, 337 00:25:37,340 --> 00:25:42,310 so you might as well just choose your policy at random, OK. 338 00:25:42,310 --> 00:25:45,080 So optimal policies aren't necessarily unique. 339 00:25:45,080 --> 00:25:49,420 Sometimes multiple actions are equally optimal. 340 00:25:49,420 --> 00:25:51,130 OK, here's your world. 341 00:25:51,130 --> 00:25:58,060 I have put the goal always at 2,3, just randomly, OK. 342 00:25:58,060 --> 00:25:59,350 You are a blue star. 343 00:25:59,350 --> 00:26:01,720 The goal is a red asterisk. 344 00:26:01,720 --> 00:26:05,990 It's a-- take you back to the '80s or something, video games. 345 00:26:05,990 --> 00:26:08,530 OK. 346 00:26:08,530 --> 00:26:10,970 So let's just very simply-- 347 00:26:10,970 --> 00:26:14,410 I'm going to run this value iteration algorithm on it, OK, 348 00:26:14,410 --> 00:26:18,820 and I'm going to plot, at every step of the algorithm, the cost 349 00:26:18,820 --> 00:26:21,850 to go, OK, and the policy, actually. 350 00:26:21,850 --> 00:26:22,990 So it's not going to be-- 351 00:26:22,990 --> 00:26:24,970 I have my more general value iteration 352 00:26:24,970 --> 00:26:28,960 code that's not going to be quite as beautiful, but-- 353 00:26:33,415 --> 00:26:37,375 [TYPING] 354 00:26:46,105 --> 00:26:46,605 OK. 355 00:26:49,413 --> 00:26:50,580 Well, that went pretty fast. 356 00:26:50,580 --> 00:26:51,630 There was supposed to be pause there. 357 00:26:51,630 --> 00:26:52,330 Let me get that-- 358 00:26:52,330 --> 00:26:53,705 add a pause in there quick, but-- 359 00:27:07,480 --> 00:27:07,980 OK. 360 00:27:11,700 --> 00:27:14,700 Here is J at time-- 361 00:27:14,700 --> 00:27:17,790 at J at capital N. My cost function 362 00:27:17,790 --> 00:27:22,380 is 0 if I'm at the goal, 1 everywhere else, OK. 363 00:27:22,380 --> 00:27:25,650 My policy, it doesn't matter what I choose. 364 00:27:25,650 --> 00:27:27,180 I've actually chosen to do-- 365 00:27:27,180 --> 00:27:28,020 I didn't put this-- 366 00:27:28,020 --> 00:27:33,420 I didn't give you a key, but 0 is the do nothing action, OK. 367 00:27:33,420 --> 00:27:35,830 So this just has do nothing everywhere. 368 00:27:35,830 --> 00:27:37,890 This is the lazy policy, I guess. 369 00:27:37,890 --> 00:27:39,840 And the cost it's going to get is 370 00:27:39,840 --> 00:27:42,360 it's going to get no cost if it's at the goal, one cost 371 00:27:42,360 --> 00:27:43,950 if it's everywhere else. 372 00:27:43,950 --> 00:27:47,010 OK, if I'm now computing J S N minus 1, 373 00:27:47,010 --> 00:27:48,720 you guys told me what that is. 374 00:27:48,720 --> 00:27:51,570 That says it's 0 here, it's 1 here, 375 00:27:51,570 --> 00:27:53,910 it's 2 everywhere else, right. 376 00:27:53,910 --> 00:27:56,700 And the co-- now you can see my key here. 377 00:27:56,700 --> 00:28:00,090 Orange must mean move down, red must mean move to the left, 378 00:28:00,090 --> 00:28:04,410 green must mean move to the right, and so on, OK. 379 00:28:04,410 --> 00:28:07,380 The value-- this backwards propagation, 380 00:28:07,380 --> 00:28:09,570 this dynamic programming propagation 381 00:28:09,570 --> 00:28:12,660 is a very beautiful and intuitive thing, OK. 382 00:28:12,660 --> 00:28:16,770 Every time I take a step, a few more states become reachable. 383 00:28:19,440 --> 00:28:22,830 In that amount of time, I can get to the goal. 384 00:28:22,830 --> 00:28:27,313 The resulting cost to go function is simple. 385 00:28:27,313 --> 00:28:29,730 It's just the distance, the number of cells from the goal, 386 00:28:29,730 --> 00:28:30,360 yeah. 387 00:28:30,360 --> 00:28:32,880 And the policy, again, it's not unique. 388 00:28:32,880 --> 00:28:34,800 But this one, just because of the ordering 389 00:28:34,800 --> 00:28:37,170 I chose, and I just do a min over the actions, 390 00:28:37,170 --> 00:28:40,890 says it's always going to move down in that orange area, 391 00:28:40,890 --> 00:28:43,290 it's always going to move up in the blue area, 392 00:28:43,290 --> 00:28:45,140 and it's just going to-- 393 00:28:45,140 --> 00:28:49,740 so that's one of the optimal policies, all right. 394 00:28:49,740 --> 00:28:52,330 Now Alborz asked a good question, 395 00:28:52,330 --> 00:28:54,570 what's my horizon time? 396 00:28:54,570 --> 00:28:59,340 So I'm actually just working backwards from some arbitrary 397 00:28:59,340 --> 00:29:02,220 capital N and just going backwards in time 398 00:29:02,220 --> 00:29:04,200 further and further. 399 00:29:04,200 --> 00:29:09,780 But it turns out for this problem, and for many problems, 400 00:29:09,780 --> 00:29:13,660 everything converges, OK. 401 00:29:13,660 --> 00:29:16,990 After some amount of time, the optimal cost to go 402 00:29:16,990 --> 00:29:25,170 stops changing, and I know that's my optimal policy. 403 00:29:25,170 --> 00:29:26,122 Walk down. 404 00:29:26,122 --> 00:29:27,080 And this is too simple. 405 00:29:27,080 --> 00:29:28,610 This is painfully simple. 406 00:29:28,610 --> 00:29:31,240 But I think that intuition is going 407 00:29:31,240 --> 00:29:34,480 to take us a long way with the value methods, OK. 408 00:29:38,720 --> 00:29:40,143 AUDIENCE: So, Professor? 409 00:29:40,143 --> 00:29:40,810 PROFESSOR: Yeah. 410 00:29:40,810 --> 00:29:44,420 AUDIENCE: In this example, the optimal policy is not unique. 411 00:29:44,420 --> 00:29:46,260 PROFESSOR: The optimal policy is not unique. 412 00:29:46,260 --> 00:29:48,843 The guy could have just as well gone left first and then down. 413 00:29:51,650 --> 00:29:54,062 So how does that manifest itself in those equations? 414 00:29:59,610 --> 00:30:02,220 There's multiple min over a's. 415 00:30:02,220 --> 00:30:06,960 There's multiple a's that give me the same J star S and N 416 00:30:06,960 --> 00:30:08,130 minus-- or plus 1, whatever. 417 00:30:12,210 --> 00:30:14,850 Multiple actions give me the same long-term cost, 418 00:30:14,850 --> 00:30:18,780 so I could equally pick any of them, yeah? 419 00:30:18,780 --> 00:30:22,470 OK, to make a more careful analogy 420 00:30:22,470 --> 00:30:26,630 to the more continuous world, that 421 00:30:26,630 --> 00:30:28,380 was a perfectly good minimum time problem. 422 00:30:28,380 --> 00:30:33,690 I could have equally well chosen a different cost function. 423 00:30:33,690 --> 00:30:37,740 Oh wait, let's put the obstacles back in, all right. 424 00:30:37,740 --> 00:30:39,338 So the cool thing is obstacles aren't 425 00:30:39,338 --> 00:30:41,130 going to make it any harder for us to solve 426 00:30:41,130 --> 00:30:43,800 this problem in our head. 427 00:30:43,800 --> 00:30:45,930 It's a nice observation that they don't actually 428 00:30:45,930 --> 00:30:50,488 make it any harder for the algorithm to solve it either. 429 00:30:50,488 --> 00:30:51,780 And that's a general principle. 430 00:30:51,780 --> 00:30:53,822 That's something I definitely want you to get out 431 00:30:53,822 --> 00:30:56,370 of this course, is that when we're 432 00:30:56,370 --> 00:30:59,790 doing analytical optimal control, every piece 433 00:30:59,790 --> 00:31:03,007 you add to the dynamics makes things cripplingly difficult. 434 00:31:03,007 --> 00:31:05,340 And so you have to stay with these very simple dynamical 435 00:31:05,340 --> 00:31:06,840 systems. 436 00:31:06,840 --> 00:31:08,850 OK, the computational algorithms are actually 437 00:31:08,850 --> 00:31:11,610 pretty insensitive to how complex the dynamics are. 438 00:31:11,610 --> 00:31:15,475 They're going to break down in a different way, OK. 439 00:31:15,475 --> 00:31:17,850 So there's these different tools for different-- that are 440 00:31:17,850 --> 00:31:19,017 good for different problems. 441 00:31:19,017 --> 00:31:22,770 And there's a lot of problems which are very amenable 442 00:31:22,770 --> 00:31:24,840 to these computational tools that people aren't-- 443 00:31:24,840 --> 00:31:27,660 I mean, you can solve brand new problems pretty easily 444 00:31:27,660 --> 00:31:30,300 with some of these algorithms. 445 00:31:30,300 --> 00:31:32,580 OK, so let's think of another cost function. 446 00:31:35,800 --> 00:31:41,149 Let's do the equivalent of a quadratic regulator. 447 00:31:46,507 --> 00:31:48,090 I just had that whole spiel and forgot 448 00:31:48,090 --> 00:31:53,320 to run the boundary-- the obstacles together in Soapbox. 449 00:32:04,880 --> 00:32:10,580 OK, so now I'm just going to put in some obstacle. 450 00:32:10,580 --> 00:32:15,230 And if you see-- whoops, sorry. 451 00:32:15,230 --> 00:32:17,186 If my state-- 452 00:32:17,186 --> 00:32:20,517 OK, so I promised to use S and a in my notes and on the board, 453 00:32:20,517 --> 00:32:22,100 but I guess I didn't do it in my code. 454 00:32:22,100 --> 00:32:22,910 Sorry. 455 00:32:22,910 --> 00:32:28,100 So x equals the goal, then the cost to go is-- 456 00:32:28,100 --> 00:32:29,912 the cost, instantaneous cost, is 0. 457 00:32:29,912 --> 00:32:30,620 Otherwise it's 1. 458 00:32:30,620 --> 00:32:32,995 If there's an obstacle, I just give it a high cost of 10. 459 00:32:37,280 --> 00:32:45,310 So if I put that obstacle function in there, 460 00:32:45,310 --> 00:32:49,070 then I've got my same 0 cost for the goal. 461 00:32:49,070 --> 00:32:50,570 I've got a 1 cost almost everywhere, 462 00:32:50,570 --> 00:32:51,570 but I've got a 10 there. 463 00:32:51,570 --> 00:32:52,700 That's my cost function. 464 00:32:52,700 --> 00:32:56,810 And as I backup, a couple of things happened. 465 00:32:56,810 --> 00:32:58,360 First, this thing quickly figures out 466 00:32:58,360 --> 00:33:01,510 how to get off that obstacle as fast as possible 467 00:33:01,510 --> 00:33:04,300 and decides not to go there anymore. 468 00:33:04,300 --> 00:33:06,423 And then as you back up the cost function, 469 00:33:06,423 --> 00:33:07,840 the colors are a little more muted 470 00:33:07,840 --> 00:33:09,670 because I have this high color here. 471 00:33:09,670 --> 00:33:14,140 But the same basic algorithm plays out 472 00:33:14,140 --> 00:33:15,340 until it covers the space. 473 00:33:18,160 --> 00:33:19,996 And my s-- oh, that was a-- 474 00:33:19,996 --> 00:33:21,940 [LAUGHTER] 475 00:33:21,940 --> 00:33:25,150 --lucky initial condition. 476 00:33:25,150 --> 00:33:25,680 OK, good. 477 00:33:25,680 --> 00:33:26,680 Now he has to go around. 478 00:33:26,680 --> 00:33:27,180 Wow. 479 00:33:30,700 --> 00:33:34,873 OK, so adding an obstacle in the grid world is clearly trivial. 480 00:33:34,873 --> 00:33:37,290 It's nice to think that adding an obstacle when I get back 481 00:33:37,290 --> 00:33:39,000 to the pendulum would be trivial, 482 00:33:39,000 --> 00:33:41,375 because that's not trivial for most of your other control 483 00:33:41,375 --> 00:33:43,140 derivations. 484 00:33:43,140 --> 00:33:46,040 OK, so minimum-- the quadratic regulator now. 485 00:33:50,210 --> 00:33:55,320 Now here, the cost I want is x of u, 486 00:33:55,320 --> 00:34:04,620 in the continuous world is some x minus x goal transpose Q x 487 00:34:04,620 --> 00:34:05,550 minus x goal. 488 00:34:09,600 --> 00:34:15,210 And you have to map that down into the integer 489 00:34:15,210 --> 00:34:17,760 world, the states. 490 00:34:17,760 --> 00:34:21,030 There's not a particularly clean way to write that, 491 00:34:21,030 --> 00:34:22,980 so I'm just going to allow you to imagine 492 00:34:22,980 --> 00:34:26,250 that it's trivial to code. 493 00:34:26,250 --> 00:34:27,859 Imagine that transition. 494 00:34:47,820 --> 00:34:51,360 OK, now my cost function is just penalizing me 495 00:34:51,360 --> 00:34:53,310 for being away from the goal. 496 00:34:53,310 --> 00:34:55,139 But it's not a 0 and 1. 497 00:34:55,139 --> 00:34:58,270 It's penalizing me more smoothly for being away from the goal. 498 00:34:58,270 --> 00:35:00,310 So what's the best thing to do? 499 00:35:00,310 --> 00:35:02,310 The best thing to do is still to get to the goal 500 00:35:02,310 --> 00:35:03,840 as quickly as possible. 501 00:35:03,840 --> 00:35:06,840 It actually doesn't really change the optimal policy here, 502 00:35:06,840 --> 00:35:09,670 but it's a more smooth cost function, 503 00:35:09,670 --> 00:35:14,680 which, in some problems, gives you nice properties. 504 00:35:14,680 --> 00:35:17,970 It turns out the optimal policy is more unique in this case. 505 00:35:20,467 --> 00:35:22,800 But that would have been an optimal for the minimum time 506 00:35:22,800 --> 00:35:24,180 problem, too. 507 00:35:24,180 --> 00:35:33,607 And it converges nicely and goes to the goal in the same way, 508 00:35:33,607 --> 00:35:35,440 and works fine with the obstacle, of course. 509 00:35:39,570 --> 00:35:40,070 OK? 510 00:35:45,080 --> 00:35:46,370 Good. 511 00:35:46,370 --> 00:35:51,380 So now you have a little bit more intuition 512 00:35:51,380 --> 00:35:55,670 to work with on these cost to go functions. 513 00:35:55,670 --> 00:35:58,130 A couple of important things happened there 514 00:35:58,130 --> 00:36:01,540 that I want to highlight. 515 00:36:01,540 --> 00:36:04,715 First of all, I really want you to think in terms of cost 516 00:36:04,715 --> 00:36:05,690 to go functions. 517 00:36:05,690 --> 00:36:06,930 They're really intuitive. 518 00:36:10,790 --> 00:36:14,270 The cost that I will obtain till the end of time, 519 00:36:14,270 --> 00:36:18,320 the optimal cost to go says if I'm acting optimally, 520 00:36:18,320 --> 00:36:20,090 this is the cost I'm going to incur. 521 00:36:20,090 --> 00:36:24,630 And the optimal cost to go gives me the optimal policy, OK. 522 00:36:27,860 --> 00:36:31,640 And just to calibrate you here, J star 523 00:36:31,640 --> 00:36:33,980 is called the optimal cost to go, 524 00:36:33,980 --> 00:36:38,190 but it's also sometimes called a value function, optimal value 525 00:36:38,190 --> 00:36:38,690 function. 526 00:36:53,422 --> 00:36:55,880 A bunch of different communities talk about the same things 527 00:36:55,880 --> 00:36:59,000 with different words. 528 00:36:59,000 --> 00:37:00,170 These are the optimists. 529 00:37:00,170 --> 00:37:01,474 These are the pessimists. 530 00:37:08,400 --> 00:37:19,140 OK, the other thing that we saw is that for many problems, 531 00:37:19,140 --> 00:37:25,800 the limit as N goes to negative infinity-- 532 00:37:30,750 --> 00:37:33,990 I know that's a silly thing to say, I guess, but-- 533 00:37:37,980 --> 00:37:40,260 that a lot of times this thing actually 534 00:37:40,260 --> 00:37:48,696 goes to some well posed J star. 535 00:37:48,696 --> 00:37:51,930 It doesn't have to. 536 00:37:51,930 --> 00:37:53,010 Sometimes it blows up. 537 00:37:57,330 --> 00:38:05,670 Another way to think of this is that I said J S of N 538 00:38:05,670 --> 00:38:20,400 is S of capital N. It's the limit of this as capital 539 00:38:20,400 --> 00:38:25,050 N goes to infinity, if you think of it in the forward way. 540 00:38:25,050 --> 00:38:31,740 So in order for this thing to converge to some nice solution, 541 00:38:31,740 --> 00:38:35,760 this sum had better converge in the limit. 542 00:38:38,760 --> 00:38:42,180 For my choice of g for the minimum time problem, 543 00:38:42,180 --> 00:38:45,472 and for the quadratic regulator, both of these 544 00:38:45,472 --> 00:38:47,430 had the property that when you get to the goal, 545 00:38:47,430 --> 00:38:50,900 you stop incurring cost. 546 00:38:50,900 --> 00:38:53,570 So that integral-- as long as you can get to the goal, 547 00:38:53,570 --> 00:38:56,225 that integral-- the sum, sorry, is going to converge. 548 00:38:59,330 --> 00:39:04,190 If I had chosen that I give a cost of 1 549 00:39:04,190 --> 00:39:06,800 when I'm at the goal and 2 when I'm anywhere else, 550 00:39:06,800 --> 00:39:08,630 then it wouldn't have converged. 551 00:39:12,100 --> 00:39:14,860 The cost to go would have gone to that same shape, 552 00:39:14,860 --> 00:39:16,360 but then that shape would have just 553 00:39:16,360 --> 00:39:18,430 kept increasing every time I go farther back in time. 554 00:39:18,430 --> 00:39:20,305 That whole function would just move up by one 555 00:39:20,305 --> 00:39:23,170 every increment of time, OK. 556 00:39:23,170 --> 00:39:26,140 But for a lot of problems, we do have 557 00:39:26,140 --> 00:39:32,350 this nice limiting behavior, OK, and that gives rise 558 00:39:32,350 --> 00:39:37,350 to the infinite horizon problems. 559 00:39:42,320 --> 00:39:44,740 So so far, I had talked about finite horizon, 560 00:39:44,740 --> 00:39:46,240 but a lot of time, a lot of problems 561 00:39:46,240 --> 00:39:47,448 we write as infinite horizon. 562 00:40:03,059 --> 00:40:03,559 OK. 563 00:40:06,060 --> 00:40:10,240 When your problems are infinite horizon, J and J star 564 00:40:10,240 --> 00:40:13,740 don't depend on time anymore. 565 00:40:13,740 --> 00:40:17,500 And the optimal policy doesn't depend on time. 566 00:40:17,500 --> 00:40:23,940 So J star and pi, all these things are just functions of S, 567 00:40:23,940 --> 00:40:26,520 not of time, OK. 568 00:40:37,290 --> 00:40:40,830 And for these to be well posed, that sum had better converge. 569 00:41:04,650 --> 00:41:07,590 Now just to say it, but not to dwell on it, a lot of people 570 00:41:07,590 --> 00:41:12,480 do write other formulations that handle that. 571 00:41:12,480 --> 00:41:18,660 For instance, a lot of people do discounting. 572 00:41:18,660 --> 00:41:30,570 A lot of people like to solve problems of this form, OK, just 573 00:41:30,570 --> 00:41:37,080 to make it more likely that that sum's going to converge, 574 00:41:37,080 --> 00:41:37,740 for instance. 575 00:41:37,740 --> 00:41:40,330 And there's some problems which really do have discounting. 576 00:41:40,330 --> 00:41:40,830 Yeah. 577 00:41:40,830 --> 00:41:42,980 AUDIENCE: So that's less than 1 [INAUDIBLE].. 578 00:41:42,980 --> 00:41:43,550 PROFESSOR: Yes, thank you. 579 00:41:43,550 --> 00:41:43,710 Good. 580 00:41:43,710 --> 00:41:44,343 Good call. 581 00:41:51,500 --> 00:41:52,000 Thank you. 582 00:42:05,760 --> 00:42:10,170 OK, so you know the basic dynamic programming 583 00:42:10,170 --> 00:42:12,480 equations, no? 584 00:42:12,480 --> 00:42:17,660 Let me just say one word about implementation, 585 00:42:17,660 --> 00:42:21,270 if you want to go home and make your own '80s graphics 586 00:42:21,270 --> 00:42:22,425 game in Matlab. 587 00:42:27,000 --> 00:42:42,345 For discrete states, discrete actions, J, 588 00:42:42,345 --> 00:42:47,730 even J star of S at some N, it's a vector. 589 00:42:53,550 --> 00:43:00,060 Typically I think of it as sort of a dimension of S 590 00:43:00,060 --> 00:43:01,170 by one vector. 591 00:43:03,990 --> 00:43:06,135 And dimension isn't the right word. 592 00:43:06,135 --> 00:43:12,620 This is-- so the cardinality of S, 593 00:43:12,620 --> 00:43:16,620 let's say, something like that, a big S, 594 00:43:16,620 --> 00:43:19,710 the number of possible states by one vector. 595 00:43:24,250 --> 00:43:30,880 And it's very practical to write that recursion for all states 596 00:43:30,880 --> 00:43:32,390 as a vector equation. 597 00:43:32,390 --> 00:43:38,350 So if I think of J star as being a vector, 598 00:43:38,350 --> 00:43:42,910 I have to do a min over a of g S, a. 599 00:43:42,910 --> 00:43:48,950 But g is another vector which depends on a. 600 00:43:48,950 --> 00:43:53,125 It's an S by 1 plus-- 601 00:44:10,030 --> 00:44:11,620 I can write it as a vector equation 602 00:44:11,620 --> 00:44:13,400 where this is a vector. 603 00:44:13,400 --> 00:44:14,285 This is a matrix. 604 00:44:14,285 --> 00:44:15,535 This is the transition matrix. 605 00:44:24,220 --> 00:44:25,390 And this is my vector again. 606 00:44:28,240 --> 00:44:52,336 And then transition matrix is just 1 if f iA 607 00:44:52,336 --> 00:44:55,338 equals J and 0 otherwise. 608 00:45:08,050 --> 00:45:08,980 OK. 609 00:45:08,980 --> 00:45:12,040 That's just a standard graph notation. 610 00:45:16,830 --> 00:45:19,090 So it's trivial to code these things in Matlab 611 00:45:19,090 --> 00:45:21,786 with just a bunch of matrix manipulations. 612 00:45:24,910 --> 00:45:29,020 OK, we understand everything about the grid world. 613 00:45:29,020 --> 00:45:32,500 I think it is a very helpful example, actually. 614 00:45:32,500 --> 00:45:35,980 Now let's think about the more continuous problems 615 00:45:35,980 --> 00:45:37,180 that I care about. 616 00:45:37,180 --> 00:45:44,110 What if, instead of having the dynamics of this 617 00:45:44,110 --> 00:45:47,350 moving left, right, whatever, my dynamics, my transitions came 618 00:45:47,350 --> 00:45:50,020 from my equations of motion from one of the systems 619 00:45:50,020 --> 00:45:51,940 we care about? 620 00:45:51,940 --> 00:46:16,690 So let's think about the double integrator. 621 00:46:16,690 --> 00:46:18,640 q double dot equals u. 622 00:46:18,640 --> 00:46:20,185 Let's do the min time problem. 623 00:46:24,650 --> 00:46:29,700 I can use the same minimum time cost function I did before, OK. 624 00:46:48,126 --> 00:46:52,110 [TYPING] 625 00:47:03,130 --> 00:47:04,720 OK. 626 00:47:04,720 --> 00:47:09,190 This one, I didn't leave the pause in there, 627 00:47:09,190 --> 00:47:12,720 but look what happens. 628 00:47:12,720 --> 00:47:13,390 Oops, sorry. 629 00:47:13,390 --> 00:47:15,070 Meant to do that. 630 00:47:15,070 --> 00:47:15,700 Make it bigger. 631 00:47:18,690 --> 00:47:21,010 I pop the same-- 632 00:47:21,010 --> 00:47:22,440 let me turn the lights down. 633 00:47:22,440 --> 00:47:27,120 I pop that same exact set of equations. 634 00:47:27,120 --> 00:47:30,840 I run the same value iteration algorithm, dynamic programming 635 00:47:30,840 --> 00:47:32,660 algorithm. 636 00:47:32,660 --> 00:47:34,410 I should have said, people tend to call it 637 00:47:34,410 --> 00:47:38,790 value iteration for when you take the infinite horizon 638 00:47:38,790 --> 00:47:42,150 version and dynamic programming if you call it-- 639 00:47:42,150 --> 00:47:44,190 if you do the finite horizon, but they're 640 00:47:44,190 --> 00:47:46,150 exactly the same thing, OK. 641 00:47:46,150 --> 00:47:48,450 So I might accidentally say value iteration 642 00:47:48,450 --> 00:47:50,730 because I'm used to it. 643 00:47:50,730 --> 00:47:57,150 OK, so I took my double integrator dynamics. 644 00:47:57,150 --> 00:47:59,100 I discretized my space. 645 00:47:59,100 --> 00:48:01,890 I made my cost function exactly the same 646 00:48:01,890 --> 00:48:03,510 as the minimum time cost function 647 00:48:03,510 --> 00:48:05,160 I used in the grid world, where there's 648 00:48:05,160 --> 00:48:09,480 a 0 cost of being at the goal and 1 everywhere else. 649 00:48:09,480 --> 00:48:11,460 And look what pops out. 650 00:48:11,460 --> 00:48:14,383 This is the cost to go function, is a function of state, 651 00:48:14,383 --> 00:48:15,300 and that's the policy. 652 00:48:18,310 --> 00:48:20,320 Remind you of anything? 653 00:48:20,320 --> 00:48:21,730 Right? 654 00:48:21,730 --> 00:48:23,260 Now I've got a big disclaimer that 655 00:48:23,260 --> 00:48:25,780 goes at the end of the lecture, but for now, let's just 656 00:48:25,780 --> 00:48:27,505 say that that's the perfect solution. 657 00:48:30,070 --> 00:48:32,740 The discretization is going to make this thing a little bit 658 00:48:32,740 --> 00:48:34,030 wrong. 659 00:48:34,030 --> 00:48:35,830 I'm going to say a few things about that 660 00:48:35,830 --> 00:48:37,240 at the end of the class. 661 00:48:37,240 --> 00:48:41,560 But the cool thing is that I pop my cost function in. 662 00:48:41,560 --> 00:48:43,670 I pop my continuous dynamical system. 663 00:48:43,670 --> 00:48:44,965 It's discretized. 664 00:48:44,965 --> 00:48:47,920 [CLICK] Run dynamic programming. 665 00:48:47,920 --> 00:48:50,890 As I back it up, it converges to some-- 666 00:48:50,890 --> 00:48:53,230 as N goes back, it does converge for this. 667 00:48:53,230 --> 00:48:54,850 It was the minimum time problem. 668 00:48:54,850 --> 00:48:56,740 And I get my optimal policy out, which 669 00:48:56,740 --> 00:48:59,488 is a bang bang policy, which is decelerate 670 00:48:59,488 --> 00:49:01,030 when you're at the bottom, accelerate 671 00:49:01,030 --> 00:49:02,030 when you're at that top. 672 00:49:02,030 --> 00:49:03,908 And that switching surface shows up in green 673 00:49:03,908 --> 00:49:05,200 just because it's interpolated. 674 00:49:05,200 --> 00:49:12,520 But when you know it, that's what we know about bang bang 675 00:49:12,520 --> 00:49:14,110 controllers, OK. 676 00:49:14,110 --> 00:49:14,650 Yeah. 677 00:49:14,650 --> 00:49:17,455 AUDIENCE: Did you have to encode that your only three 678 00:49:17,455 --> 00:49:20,315 actions were full forward, full backward, and-- 679 00:49:20,315 --> 00:49:21,940 PROFESSOR: The minimum over a is always 680 00:49:21,940 --> 00:49:24,670 going to choose the rails. 681 00:49:24,670 --> 00:49:26,462 In fact, in this implementation, they 682 00:49:26,462 --> 00:49:28,420 could have chosen in between things, and that's 683 00:49:28,420 --> 00:49:29,740 what it did right on the switching surface because 684 00:49:29,740 --> 00:49:30,550 of some-- 685 00:49:30,550 --> 00:49:31,360 it chose 0. 686 00:49:31,360 --> 00:49:32,860 AUDIENCE: OK, so you left the general just as-- 687 00:49:32,860 --> 00:49:34,068 PROFESSOR: I left it general. 688 00:49:34,068 --> 00:49:35,410 Yeah. 689 00:49:35,410 --> 00:49:37,870 So always, when I discretize the state 690 00:49:37,870 --> 00:49:40,300 and I discretize the actions of these continuous problems, 691 00:49:40,300 --> 00:49:41,890 I'm left with a finite set of states, 692 00:49:41,890 --> 00:49:43,040 a finite set of actions. 693 00:49:43,040 --> 00:49:44,260 So it can't pick unbounded. 694 00:49:44,260 --> 00:49:49,210 It's fundamentally bounded in actions that it can choose, 695 00:49:49,210 --> 00:49:51,210 and it chose those bounds. 696 00:49:51,210 --> 00:49:53,760 AUDIENCE: [INAUDIBLE] 697 00:49:53,760 --> 00:49:54,760 PROFESSOR: Say it again. 698 00:49:54,760 --> 00:49:58,040 AUDIENCE: How do you define the transition model? 699 00:49:58,040 --> 00:49:58,930 PROFESSOR: Good. 700 00:49:58,930 --> 00:50:00,100 I'm going to say some words about that in a minute, 701 00:50:00,100 --> 00:50:01,300 too, OK. 702 00:50:01,300 --> 00:50:03,700 Yeah. 703 00:50:03,700 --> 00:50:04,240 But not yet. 704 00:50:04,240 --> 00:50:05,740 Just give me a minute. 705 00:50:05,740 --> 00:50:09,400 OK, let's say we want to solve the LQR 706 00:50:09,400 --> 00:50:13,060 problem, the quadratic regulator cost for this. 707 00:50:18,880 --> 00:50:22,760 [TYPING] 708 00:50:28,580 --> 00:50:35,120 So I animated the brick for you just to keep it exciting. 709 00:50:35,120 --> 00:50:37,460 OK, so what pops out? 710 00:50:37,460 --> 00:50:41,690 This beautiful quadratic cost to go function, OK. 711 00:50:41,690 --> 00:50:44,120 Now this is a little bit off. 712 00:50:44,120 --> 00:50:45,980 It's supposed to be a linear function. 713 00:50:45,980 --> 00:50:48,920 It almost is, but there's some saturation because 714 00:50:48,920 --> 00:50:51,890 of my actuator limits, OK. 715 00:50:51,890 --> 00:50:56,670 But within the resolution of sort of my discrete actions, 716 00:50:56,670 --> 00:51:00,720 that's what we expected, OK. 717 00:51:00,720 --> 00:51:02,550 So I can do this for the brick. 718 00:51:02,550 --> 00:51:04,010 I'm going to tell you the caveats again in a minute, 719 00:51:04,010 --> 00:51:06,343 and I'm going to tell you the interpolation in a minute. 720 00:51:06,343 --> 00:51:09,660 But first I just want to help you realize that this-- 721 00:51:09,660 --> 00:51:12,148 we can pop these equations in if we're 722 00:51:12,148 --> 00:51:14,190 willing to discretize the state and action space. 723 00:51:14,190 --> 00:51:17,760 Even for pretty hard problems, I can just [CLICK] let it go. 724 00:51:17,760 --> 00:51:19,110 It's pretty fast, too, actually. 725 00:51:23,340 --> 00:51:25,650 OK. 726 00:51:25,650 --> 00:51:28,380 So now why not-- 727 00:51:28,380 --> 00:51:30,990 analytically, we had a hard time doing the pendulum, 728 00:51:30,990 --> 00:51:33,660 those nonlinear equations, OK. 729 00:51:33,660 --> 00:51:36,000 But if we tile the space, turn it into a graph, 730 00:51:36,000 --> 00:51:37,800 then I can run the exact same algorithm 731 00:51:37,800 --> 00:51:40,710 on the simple pendulum, OK. 732 00:51:40,710 --> 00:51:41,850 So let's do that. 733 00:51:46,700 --> 00:51:50,580 [TYPING] 734 00:51:55,540 --> 00:51:58,560 What am I going to get here? 735 00:51:58,560 --> 00:52:02,000 So minimum time for the simple pendulum. 736 00:52:04,520 --> 00:52:08,540 I've got my pause back in here. 737 00:52:08,540 --> 00:52:10,280 It's hard to see, but there's actually-- 738 00:52:10,280 --> 00:52:13,460 it's 1 everywhere except for 0 at the goal, which is the-- 739 00:52:13,460 --> 00:52:17,540 now I'm in phase space, so that's pi at 0. 740 00:52:17,540 --> 00:52:21,110 That's my unsteady fixed point, OK. 741 00:52:21,110 --> 00:52:25,670 I've got a blue 0 there, 1 everywhere else. 742 00:52:25,670 --> 00:52:28,640 At the end of time, my action is just do nothing, 743 00:52:28,640 --> 00:52:31,070 because there's no benefit to doing anything. 744 00:52:31,070 --> 00:52:36,238 And as I back up in time, this will give you a key to what-- 745 00:52:36,238 --> 00:52:38,780 you can see a little bit about my interpolation as I do this. 746 00:52:38,780 --> 00:52:45,710 OK, then it starts giving me incentive to move. 747 00:52:45,710 --> 00:52:48,170 Again, when you can't get to the goal, 748 00:52:48,170 --> 00:52:51,890 that's actually just noise there. 749 00:52:51,890 --> 00:52:54,740 But this thing quickly figures out-- 750 00:53:00,920 --> 00:53:01,420 oops. 751 00:53:04,380 --> 00:53:11,530 Let me do the same thing and let it not plot every time. 752 00:53:22,060 --> 00:53:25,370 Figures out a cost to go function, 753 00:53:25,370 --> 00:53:28,150 the optimal cost to go function, and an optimal policy. 754 00:53:28,150 --> 00:53:30,530 Now it looks a little noisy there. 755 00:53:30,530 --> 00:53:32,530 Again, we're going to talk about the sensitivity 756 00:53:32,530 --> 00:53:33,760 to discretization. 757 00:53:33,760 --> 00:53:36,190 But this is very much a bang bang 758 00:53:36,190 --> 00:53:39,550 policy, with the blue area being, 759 00:53:39,550 --> 00:53:43,105 do one action, the red area doing the other action. 760 00:53:43,105 --> 00:53:48,200 The switching surface is actually pretty complicated. 761 00:53:48,200 --> 00:53:51,610 It's some complicated function of state, 762 00:53:51,610 --> 00:53:56,190 but it gets this beautifully smooth cost to go function, OK. 763 00:54:03,880 --> 00:54:06,850 Now let's take a second and look at the phase plots here. 764 00:54:09,802 --> 00:54:12,440 Let me actually do it in order here. 765 00:54:12,440 --> 00:54:26,450 So this is the phase plot of the damped passive pendulum, 766 00:54:26,450 --> 00:54:30,960 OK, the original one we thought about in class. 767 00:54:30,960 --> 00:54:33,450 I just drew a few lines to help you. 768 00:54:33,450 --> 00:54:36,620 So if I start at downright position 769 00:54:36,620 --> 00:54:40,220 with a little bit of velocity, I'd slow down and stop. 770 00:54:40,220 --> 00:54:43,940 If I start near an unstable fixed point 771 00:54:43,940 --> 00:54:47,180 with near 0 velocity, then I actually 772 00:54:47,180 --> 00:54:50,870 fall down and go like this and end up standing still 773 00:54:50,870 --> 00:54:54,860 near the closest unstable fixed point, OK. 774 00:54:57,380 --> 00:55:03,740 Now if I do my feedback linearization invert gravity 775 00:55:03,740 --> 00:55:06,732 controller to stabilize the fixed point, then 776 00:55:06,732 --> 00:55:08,440 what's the phase plot going to look like? 777 00:55:13,233 --> 00:55:14,650 It's going to look just like this, 778 00:55:14,650 --> 00:55:17,640 but it's going to be moved over there, right? 779 00:55:17,640 --> 00:55:19,460 So let's make sure that's true. 780 00:55:23,810 --> 00:55:26,065 Ah, what did I call it? 781 00:55:26,065 --> 00:55:26,690 Invert gravity. 782 00:55:30,940 --> 00:55:32,080 OK, yeah. 783 00:55:32,080 --> 00:55:34,525 So I see the exact same things. 784 00:55:34,525 --> 00:55:36,650 Used to be my stable fixed point are now going over 785 00:55:36,650 --> 00:55:39,790 to the closest unstable one. 786 00:55:39,790 --> 00:55:40,625 This works great. 787 00:55:40,625 --> 00:55:42,250 The only objection to it is it required 788 00:55:42,250 --> 00:55:44,530 an enormous amount of torque to just pretend 789 00:55:44,530 --> 00:55:46,670 like you're inverted gravity. 790 00:55:46,670 --> 00:55:50,580 OK, so what's the minimum time solution going to look like? 791 00:56:02,710 --> 00:56:04,293 AUDIENCE: It's going to depend on what 792 00:56:04,293 --> 00:56:05,600 your torque constraint is. 793 00:56:05,600 --> 00:56:07,433 PROFESSOR: It's going to depend on my torque 794 00:56:07,433 --> 00:56:10,220 constraint is, yeah. 795 00:56:10,220 --> 00:56:13,000 So for whatever torque constraint I have now, 796 00:56:13,000 --> 00:56:15,500 you could even figure out the units here. 797 00:56:15,500 --> 00:56:17,510 My torque constraint was chosen to be something 798 00:56:17,510 --> 00:56:23,190 like half of the stall torque required to hold out like this. 799 00:56:23,190 --> 00:56:24,800 Then let's see what happens. 800 00:56:29,380 --> 00:56:33,840 This is the minimum time solution, 801 00:56:33,840 --> 00:56:35,070 which is exactly right. 802 00:56:35,070 --> 00:56:37,390 If I had more torque to give, it could 803 00:56:37,390 --> 00:56:39,450 have gotten out there quicker. 804 00:56:39,450 --> 00:56:43,350 And this added enough that, after going around once, 805 00:56:43,350 --> 00:56:45,460 it could get up to the top, OK. 806 00:56:54,520 --> 00:56:55,190 Let me see. 807 00:56:55,190 --> 00:56:56,740 Why is it not drawing anymore? 808 00:56:56,740 --> 00:56:58,211 I've got this [INAUDIBLE]. 809 00:57:05,085 --> 00:57:06,080 Oop. 810 00:57:06,080 --> 00:57:08,600 So that was-- that's a random initial condition. 811 00:57:08,600 --> 00:57:10,580 So from the one I had shown, it took one pump. 812 00:57:10,580 --> 00:57:15,080 That one took two pumps, and that gets it to the top. 813 00:57:15,080 --> 00:57:17,420 OK, but now, remember, my original challenge 814 00:57:17,420 --> 00:57:20,632 was to not just get to the top in minimum time. 815 00:57:20,632 --> 00:57:22,340 This is minimum time with bounded torque, 816 00:57:22,340 --> 00:57:23,900 so that's a little bit more satisfying. 817 00:57:23,900 --> 00:57:25,358 I don't want to pump in more torque 818 00:57:25,358 --> 00:57:27,290 than I could possibly implement. 819 00:57:27,290 --> 00:57:30,758 But what if I want to be sensitive about the torque? 820 00:57:30,758 --> 00:57:32,300 I want to get to the top, but I don't 821 00:57:32,300 --> 00:57:34,520 want to use a bunch of energy. 822 00:57:34,520 --> 00:57:37,270 OK, now the quadratic cost function makes a lot of sense, 823 00:57:37,270 --> 00:57:38,300 OK. 824 00:57:38,300 --> 00:57:40,040 So I'm going to put a quadratic cost 825 00:57:40,040 --> 00:57:42,560 on being away from the top and a big quadratic 826 00:57:42,560 --> 00:57:44,900 cost on using actions. 827 00:57:44,900 --> 00:57:48,650 So that'll give me some sense of minimally stabilizing the top, 828 00:57:48,650 --> 00:57:49,310 OK. 829 00:57:49,310 --> 00:57:50,768 What's that one going to look like? 830 00:57:53,900 --> 00:57:56,360 Would you expect it to look like-- 831 00:57:56,360 --> 00:57:57,212 phase plot going. 832 00:57:57,212 --> 00:57:58,670 AUDIENCE: Basically in phase space, 833 00:57:58,670 --> 00:58:01,717 it will more turns to get up there on the top [INAUDIBLE].. 834 00:58:01,717 --> 00:58:02,300 PROFESSOR: OK. 835 00:58:02,300 --> 00:58:03,633 What about if it's near the top? 836 00:58:03,633 --> 00:58:06,143 Is there going to look like a damp pendulum at the top? 837 00:58:06,143 --> 00:58:07,060 What's it going to do? 838 00:58:10,120 --> 00:58:12,600 AUDIENCE: Well, if it's headed the wrong way near the top, 839 00:58:12,600 --> 00:58:14,350 it will probably swing all the way around. 840 00:58:14,350 --> 00:58:15,250 PROFESSOR: Good. 841 00:58:15,250 --> 00:58:16,060 Right. 842 00:58:16,060 --> 00:58:18,750 AUDIENCE: But if you put too much cost on distance, 843 00:58:18,750 --> 00:58:22,050 it might end up quickest on the [INAUDIBLE].. 844 00:58:22,050 --> 00:58:23,730 PROFESSOR: Perfect, OK. 845 00:58:23,730 --> 00:58:30,990 So let's switch this to be my quadratic regulator cost. 846 00:58:43,410 --> 00:58:45,360 Right, so that's what you said. 847 00:58:45,360 --> 00:58:46,557 Took more pumps to get up. 848 00:58:46,557 --> 00:58:48,390 And if you plot the phase plot from a couple 849 00:58:48,390 --> 00:58:49,930 of these different places-- 850 00:58:49,930 --> 00:58:51,240 oh. 851 00:58:51,240 --> 00:58:51,840 Crap, sorry. 852 00:58:51,840 --> 00:58:53,010 I thought I picked initial conditions that were 853 00:58:53,010 --> 00:58:54,630 far enough to show you that. 854 00:58:54,630 --> 00:58:55,770 This is what you said, OK. 855 00:58:55,770 --> 00:58:58,228 This one happens to be close enough that it got to the top. 856 00:59:00,570 --> 00:59:03,750 This one took a lot of pumps and got out there. 857 00:59:03,750 --> 00:59:05,500 But the point I was trying to illustrate-- 858 00:59:05,500 --> 00:59:08,680 I guess I need to either penalize torque a little bit 859 00:59:08,680 --> 00:59:09,180 more or-- 860 00:59:23,400 --> 00:59:25,087 I never change things by a factor of 2. 861 00:59:25,087 --> 00:59:25,670 It's too slow. 862 00:59:29,318 --> 00:59:30,943 Oh, I made it not move. 863 00:59:30,943 --> 00:59:31,770 [LAUGHTER] 864 00:59:31,770 --> 00:59:32,840 Sorry. 865 00:59:32,840 --> 00:59:34,220 But it showed my point, OK. 866 00:59:34,220 --> 00:59:37,543 So yeah, it has no incentive to move from the bottom. 867 00:59:37,543 --> 00:59:39,710 It says, I'm going to incur more cost by moving than 868 00:59:39,710 --> 00:59:41,930 by getting close to the goal. 869 00:59:41,930 --> 00:59:42,740 Not getting close. 870 00:59:42,740 --> 00:59:46,730 OK, but up at the top, it is able to-- 871 00:59:46,730 --> 00:59:50,630 given it was near the top with some velocity, 872 00:59:50,630 --> 00:59:53,330 with a little effort, it's worth going around and stabilizing 873 00:59:53,330 --> 00:59:55,960 itself at the top. 874 00:59:55,960 --> 00:59:56,550 Yeah? 875 00:59:56,550 --> 00:59:57,050 OK. 876 01:00:00,170 --> 01:00:00,957 Good. 877 01:00:00,957 --> 01:00:02,582 AUDIENCE: If you iterate it far enough, 878 01:00:02,582 --> 01:00:05,852 it should go at the top, but-- 879 01:00:05,852 --> 01:00:06,630 PROFESSOR: No. 880 01:00:06,630 --> 01:00:07,130 Let's see. 881 01:00:07,130 --> 01:00:08,453 So-- 882 01:00:08,453 --> 01:00:10,036 AUDIENCE: It's because of the damping. 883 01:00:10,036 --> 01:00:11,190 PROFESSOR: It's because of the damping. 884 01:00:11,190 --> 01:00:11,898 AUDIENCE: Oh, OK. 885 01:00:11,898 --> 01:00:13,087 PROFESSOR: Yeah, good. 886 01:00:13,087 --> 01:00:15,170 Because that is actually the steady state solution 887 01:00:15,170 --> 01:00:15,637 I'm plotting. 888 01:00:15,637 --> 01:00:16,179 AUDIENCE: Oh. 889 01:00:16,179 --> 01:00:18,440 PROFESSOR: Mm-hmm. 890 01:00:18,440 --> 01:00:20,754 OK. 891 01:00:20,754 --> 01:00:24,682 [RUSTLING] 892 01:00:33,030 --> 01:00:34,800 So if you care about simple pendula-- 893 01:00:34,800 --> 01:00:37,928 sorry-- and you want optimal solutions, 894 01:00:37,928 --> 01:00:39,970 this looks like a pretty satisfying way to do it. 895 01:00:39,970 --> 01:00:42,510 You could up with your arbitrary cost functions 896 01:00:42,510 --> 01:00:43,815 and see what you get. 897 01:00:43,815 --> 01:00:47,250 It runs in no time on my laptop, and you 898 01:00:47,250 --> 01:00:50,145 get things that look like optimal policies, nice phase 899 01:00:50,145 --> 01:00:53,712 plots, you name it, OK. 900 01:00:53,712 --> 01:00:54,420 What's the catch? 901 01:00:54,420 --> 01:00:58,050 First catch is, how do I do the interpolation? 902 01:00:58,050 --> 01:00:59,760 How do I make that transition matrix? 903 01:01:22,280 --> 01:01:27,200 So on my pendulum example, I discretized some states. 904 01:01:27,200 --> 01:01:30,050 I have a handful-- 905 01:01:30,050 --> 01:01:32,990 I've already discretized actions, 906 01:01:32,990 --> 01:01:35,690 and I've got some other states over here 907 01:01:35,690 --> 01:01:38,600 that they've already discretized. 908 01:01:38,600 --> 01:01:41,960 I'd have to be pretty remarkably lucky to have it 909 01:01:41,960 --> 01:01:44,450 that the random actions that I chose, 910 01:01:44,450 --> 01:01:46,838 integrated for some small amount of time, 911 01:01:46,838 --> 01:01:49,130 actually landed right on top of one of my other states. 912 01:01:53,480 --> 01:01:55,880 In fact, they tend to land in between the states, 913 01:01:55,880 --> 01:01:59,480 OK, so we do a little bit of interpolation between them. 914 01:01:59,480 --> 01:02:03,710 And one of the reasons I showed you that transition matrix 915 01:02:03,710 --> 01:02:09,260 form is that it's actually quite OK, quite standard, 916 01:02:09,260 --> 01:02:16,610 to say that my transition matrix, my T from S 917 01:02:16,610 --> 01:02:22,505 to S prime as a function of a is some-- 918 01:02:22,505 --> 01:02:24,650 let me just handwave it here-- 919 01:02:24,650 --> 01:02:40,518 but is some interpolated set of weights for S1 close. 920 01:02:40,518 --> 01:02:44,390 [LAUGHS] OK. 921 01:02:44,390 --> 01:02:47,480 Zach just showed me a sign that said, the pendulum works. 922 01:02:47,480 --> 01:02:49,250 Having Matlab licensing issues. 923 01:02:49,250 --> 01:02:50,420 So we might-- 924 01:02:50,420 --> 01:02:51,680 I was hoping to run these on the real pendulum. 925 01:02:51,680 --> 01:02:53,197 We'll do it on Tuesday if not today. 926 01:02:53,197 --> 01:02:56,930 [LAUGHS] I don't know whey he didn't just say that, 927 01:02:56,930 --> 01:03:00,680 but there's a big bright green sign. 928 01:03:00,680 --> 01:03:09,170 So let me write it like this for the moment, OK. 929 01:03:09,170 --> 01:03:14,510 So if I end up being near some states in two dimensions, 930 01:03:14,510 --> 01:03:19,670 I tend to interpolate between the three closest states, OK. 931 01:03:19,670 --> 01:03:21,540 So I'll call those Sy and S1-- 932 01:03:21,540 --> 01:03:27,380 Si, Sj, Sk, and I get some interpolants, W1, W2, and W3. 933 01:03:27,380 --> 01:03:36,200 They'd better sum to one, OK. 934 01:03:36,200 --> 01:03:38,280 And there's actually lots of ways to do that. 935 01:03:38,280 --> 01:03:40,610 So actually, in previous times I've given the class, 936 01:03:40,610 --> 01:03:42,210 I went into some detail about that. 937 01:03:42,210 --> 01:03:43,308 I think that you could-- 938 01:03:43,308 --> 01:03:45,600 if you care about it, there's a lot of ways to do that. 939 01:03:45,600 --> 01:03:48,440 You could use the Matlab interp2 function. 940 01:03:48,440 --> 01:03:52,250 The one we use is called barycentric interpolation. 941 01:03:59,450 --> 01:04:02,600 In the RL community, that was popularized by Munoz and Moore. 942 01:04:07,650 --> 01:04:09,410 That'll be cited in the notes. 943 01:04:12,650 --> 01:04:15,170 And it uses-- if you're operating 944 01:04:15,170 --> 01:04:18,920 in an N-dimensional space, it uses N plus 1 interpolants. 945 01:04:18,920 --> 01:04:21,290 So in a two-dimensional space, it 946 01:04:21,290 --> 01:04:22,880 uses the three closest points. 947 01:04:22,880 --> 01:04:24,463 If you're in a four-dimensional space, 948 01:04:24,463 --> 01:04:27,890 it uses the five closest points, OK. 949 01:04:27,890 --> 01:04:32,030 And there's a very clean, simple algorithm 950 01:04:32,030 --> 01:04:38,510 to find the factors of that interpolant, OK. 951 01:04:38,510 --> 01:04:44,870 The caveat is that everything spreads out. 952 01:04:44,870 --> 01:04:49,070 If I simulate my dynamics, my graph dynamics, 953 01:04:49,070 --> 01:04:52,070 what it's roughly saying is that if I started from this state, 954 01:04:52,070 --> 01:04:53,540 I'm going to be a little bit in that state, a little bit 955 01:04:53,540 --> 01:04:54,360 in that-- 956 01:04:54,360 --> 01:04:57,980 a little bit in state 48, a little bit in state 52. 957 01:04:57,980 --> 01:05:00,230 And then my transition's out of there, 958 01:05:00,230 --> 01:05:04,805 so I get this diffusion across my graph of where my state is, 959 01:05:04,805 --> 01:05:06,100 if that makes sense. 960 01:05:06,100 --> 01:05:07,220 Yeah? 961 01:05:07,220 --> 01:05:09,590 And that's why you get some of the smoothing effects 962 01:05:09,590 --> 01:05:14,310 that you saw in the plots, OK. 963 01:05:19,840 --> 01:05:22,630 There's a bigger problem with that. 964 01:05:22,630 --> 01:05:28,210 The smoothing effects a lot of times don't look too dangerous, 965 01:05:28,210 --> 01:05:33,670 but they can do bad things to your solution 966 01:05:33,670 --> 01:05:37,210 if you're not careful, OK. 967 01:05:37,210 --> 01:05:42,280 So the big caveat is the solution 968 01:05:42,280 --> 01:05:51,595 you get is optimal only for the discrete system. 969 01:06:01,690 --> 01:06:12,670 We hope that it's approximately optimal for continuous, 970 01:06:12,670 --> 01:06:16,120 but compared to the finite element analysis 971 01:06:16,120 --> 01:06:18,687 world or the computational fluid dynamics world, 972 01:06:18,687 --> 01:06:20,770 or other people that solve these kind of problems, 973 01:06:20,770 --> 01:06:26,320 we have relatively less strict understanding of when-- 974 01:06:26,320 --> 01:06:30,820 of how bad this approximation can be compared-- 975 01:06:30,820 --> 01:06:32,170 based on the discretization. 976 01:06:32,170 --> 01:06:34,420 There might actually be people out there that know it. 977 01:06:34,420 --> 01:06:37,630 I don't know how to tell you how bad it's 978 01:06:37,630 --> 01:06:41,350 going to get with the appro-- with the discretization. 979 01:06:41,350 --> 01:06:43,630 But I will ask you on your problem set 980 01:06:43,630 --> 01:06:47,080 to plot the bang bang solution of the double pendulum-- 981 01:06:47,080 --> 01:06:49,570 or, sorry, of the double integrator, 982 01:06:49,570 --> 01:06:52,210 and plot the analytical solution on top of it. 983 01:06:52,210 --> 01:06:54,310 And you'll see that if you're not careful, 984 01:06:54,310 --> 01:06:55,870 it's not just a little bit wrong. 985 01:06:55,870 --> 01:06:57,285 It can be systematically wrong. 986 01:06:57,285 --> 01:06:59,410 The switching surface turns out in the wrong place. 987 01:06:59,410 --> 01:07:03,000 And we'll ask you to think a little bit about why that is, 988 01:07:03,000 --> 01:07:03,500 OK. 989 01:07:07,620 --> 01:07:11,628 That's really the only caveat if you 990 01:07:11,628 --> 01:07:12,920 about low dimensional problems. 991 01:07:16,990 --> 01:07:20,100 The more cited one, though, of course, 992 01:07:20,100 --> 01:07:22,225 is that there's this curse of dimensionality. 993 01:07:32,420 --> 01:07:36,980 The only reason that everybody doesn't use this stuff 994 01:07:36,980 --> 01:07:41,270 is because if I had a 10 degree of freedom robot 995 01:07:41,270 --> 01:07:44,120 and I had to break up that 10-dimensional space 996 01:07:44,120 --> 01:07:47,840 in discrete points, discrete buckets, and made a graph, 997 01:07:47,840 --> 01:07:49,655 I would need a bigger computer. 998 01:07:49,655 --> 01:07:53,150 Not just a little bit bigger, an exponentially bigger computer, 999 01:07:53,150 --> 01:07:53,650 OK. 1000 01:07:56,440 --> 01:07:58,488 So you have to be able to discretize your space, 1001 01:07:58,488 --> 01:08:00,280 and discretizing the space is exponentially 1002 01:08:00,280 --> 01:08:04,930 expensive in the number of states, OK. 1003 01:08:04,930 --> 01:08:08,200 But so people actually-- historically, 1004 01:08:08,200 --> 01:08:14,020 value methods were very popular in the '80s, say. 1005 01:08:14,020 --> 01:08:15,520 And there's a lot of work that we're 1006 01:08:15,520 --> 01:08:17,920 going to talk about that continues to be popular, 1007 01:08:17,920 --> 01:08:19,870 about using approximations, where you don't 1008 01:08:19,870 --> 01:08:21,990 do a strict discretization, but you 1009 01:08:21,990 --> 01:08:24,026 do it so to try to approximate these costs, 1010 01:08:24,026 --> 01:08:26,109 these dynamic programming algorithms with function 1011 01:08:26,109 --> 01:08:28,713 approximation. 1012 01:08:28,713 --> 01:08:30,880 But because of this sort of curse of dimensionality, 1013 01:08:30,880 --> 01:08:33,040 a lot of people switched gears to a different class 1014 01:08:33,040 --> 01:08:37,120 of optimization algorithms based more on the Pontryagin 1015 01:08:37,120 --> 01:08:41,620 principle and more on gradient methods. 1016 01:08:41,620 --> 01:08:45,470 We're going to talk about those, too. 1017 01:08:45,470 --> 01:08:49,060 But I think we have to remember that since the 1980s, 1018 01:08:49,060 --> 01:08:51,770 our computers actually got a lot better, OK. 1019 01:08:51,770 --> 01:08:54,220 Sounds silly, but so in the '80s, 1020 01:08:54,220 --> 01:08:56,080 they could tile two-dimensional spaces, 1021 01:08:56,080 --> 01:08:58,510 and three-dimensional hurt. 1022 01:08:58,510 --> 01:09:02,240 Now we could probably do four, five, six-dimensional spaces, 1023 01:09:02,240 --> 01:09:03,640 OK. 1024 01:09:03,640 --> 01:09:06,370 We actually did for-- 1025 01:09:06,370 --> 01:09:08,170 we made that airplane land on a perch 1026 01:09:08,170 --> 01:09:10,810 by just tiling the state space and doing brute force 1027 01:09:10,810 --> 01:09:13,300 computation on that, OK. 1028 01:09:13,300 --> 01:09:14,470 So you should look around. 1029 01:09:14,470 --> 01:09:16,840 If there's some hard control problems that 1030 01:09:16,840 --> 01:09:20,649 are four-dimensional or less that you consider 1031 01:09:20,649 --> 01:09:22,390 to be unsolved, you could probably 1032 01:09:22,390 --> 01:09:24,059 just hand them the dynamic programming 1033 01:09:24,059 --> 01:09:26,692 and get a very nice solution, OK. 1034 01:09:26,692 --> 01:09:28,609 And say, hey, you couldn't do it 10 years ago, 1035 01:09:28,609 --> 01:09:30,067 but I can do it today on my laptop. 1036 01:09:33,960 --> 01:09:34,540 Awesome. 1037 01:09:34,540 --> 01:09:35,040 OK. 1038 01:09:35,040 --> 01:09:37,950 So unless Zach appears here, there's 1039 01:09:37,950 --> 01:09:39,700 only one last thing I want to say, 1040 01:09:39,700 --> 01:09:42,025 and that is I want to observe quickly-- 1041 01:09:52,740 --> 01:09:55,410 we talked about the fact that optimal policies are not 1042 01:09:55,410 --> 01:09:55,910 unique. 1043 01:09:58,980 --> 01:10:01,950 But there's more things you can learn by staring 1044 01:10:01,950 --> 01:10:03,150 at these guys a little bit. 1045 01:10:09,590 --> 01:10:12,600 Let's put my R down to something more manageable. 1046 01:10:21,420 --> 01:10:22,480 Go, go, go. 1047 01:10:25,810 --> 01:10:27,670 OK. 1048 01:10:27,670 --> 01:10:28,690 Can you see it in this? 1049 01:10:28,690 --> 01:10:31,270 It's a little bit hard to see it. 1050 01:10:31,270 --> 01:10:34,810 I think you can see it if I turn the lights down. 1051 01:10:34,810 --> 01:10:37,710 This is the quadratic regulator again. 1052 01:10:42,740 --> 01:10:45,740 Now this isn't quite the quadratic regulator 1053 01:10:45,740 --> 01:10:47,960 from the double integrator. 1054 01:10:47,960 --> 01:10:52,950 This is now a quadratic cost function on a nonlinear 1055 01:10:52,950 --> 01:10:55,020 dynamical system, OK. 1056 01:10:55,020 --> 01:10:57,780 In this case, the dynamics are smooth. 1057 01:10:57,780 --> 01:10:59,820 They're non-linear, but they're smooth. 1058 01:10:59,820 --> 01:11:02,700 There's nothing that changes abruptly in the derivatives. 1059 01:11:02,700 --> 01:11:05,400 And the cost function is smooth, but you 1060 01:11:05,400 --> 01:11:08,130 can find that the optimal policy can actually still 1061 01:11:08,130 --> 01:11:12,660 be discontinuous, OK. 1062 01:11:12,660 --> 01:11:16,140 So costs-- so why is it discontinuous? 1063 01:11:16,140 --> 01:11:19,560 In this case, because if I'm here and I'm going this way, 1064 01:11:19,560 --> 01:11:22,800 I want to push up, but at some point, I have to change my mind 1065 01:11:22,800 --> 01:11:26,250 and go the opposite way to pump up energy and get to the top. 1066 01:11:26,250 --> 01:11:33,660 So this pump up strategy is inherently discontinuous, OK. 1067 01:11:33,660 --> 01:11:39,360 So this is the Gordian knot of optimal control, 1068 01:11:39,360 --> 01:11:42,930 is as soon as things stop being linear, 1069 01:11:42,930 --> 01:11:47,730 computing optimal cost to go functions can get arbitrarily 1070 01:11:47,730 --> 01:11:49,260 hard, OK. 1071 01:11:49,260 --> 01:11:51,420 And that's why computation's so great, because it 1072 01:11:51,420 --> 01:11:53,635 does that stuff for me. 1073 01:11:53,635 --> 01:11:55,510 But know that it doesn't take much to make it 1074 01:11:55,510 --> 01:11:58,387 so the cost to go function gets a lot more subtle. 1075 01:11:58,387 --> 01:11:58,887 Mm-hmm. 1076 01:12:04,050 --> 01:12:04,830 Good. 1077 01:12:04,830 --> 01:12:09,960 So the class will proceed taking these methods as far as we can, 1078 01:12:09,960 --> 01:12:12,030 breaking them, and then showing you 1079 01:12:12,030 --> 01:12:15,930 approximation methods that work in higher dimensional spaces. 1080 01:12:15,930 --> 01:12:17,910 And when we give up on optimality all together, 1081 01:12:17,910 --> 01:12:19,770 we'll do motion planning, and we're 1082 01:12:19,770 --> 01:12:21,812 going to get to more and more interesting robots. 1083 01:12:21,812 --> 01:12:25,020 But this is really a key idea. 1084 01:12:25,020 --> 01:12:30,240 So I hope that the intuition came through and-- 1085 01:12:30,240 --> 01:12:31,470 through your problems set. 1086 01:12:31,470 --> 01:12:35,130 And I can share some of this code and everything. 1087 01:12:35,130 --> 01:12:36,990 I hope you play with it, and think about it, 1088 01:12:36,990 --> 01:12:40,020 and change cost functions, and see what happens. 1089 01:12:40,020 --> 01:12:42,350 OK, see you next week.