1 00:00:01,640 --> 00:00:04,040 The following content is provided under a Creative 2 00:00:04,040 --> 00:00:05,580 Commons license. 3 00:00:05,580 --> 00:00:07,880 Your support will help MIT OpenCourseWare 4 00:00:07,880 --> 00:00:12,270 continue to offer high quality educational resources for free. 5 00:00:12,270 --> 00:00:14,870 To make a donation or view additional materials 6 00:00:14,870 --> 00:00:18,830 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,830 --> 00:00:22,400 at ocw.mit.edu. 8 00:00:22,400 --> 00:00:25,170 JOSH TENENBAUM: So where we left off was, you know, again, 9 00:00:25,170 --> 00:00:28,850 I was telling you a story, both conceptual and motivational 10 00:00:28,850 --> 00:00:32,180 and a little bit technical about how we got to the things 11 00:00:32,180 --> 00:00:34,520 we're trying to do now as part of the center. 12 00:00:34,520 --> 00:00:37,170 And it involves, again both the problems we want to solve. 13 00:00:37,170 --> 00:00:39,380 We want to understand what is this common sense 14 00:00:39,380 --> 00:00:40,850 knowledge about the physical world 15 00:00:40,850 --> 00:00:43,760 and the psychological world that you can see in some form even 16 00:00:43,760 --> 00:00:44,900 in young infants? 17 00:00:44,900 --> 00:00:46,400 And what are the learning mechanisms 18 00:00:46,400 --> 00:00:48,140 that build it and grow it? 19 00:00:48,140 --> 00:00:50,647 And then what's the kind of technical ideas that 20 00:00:50,647 --> 00:00:52,730 are going to be also hopefully useful for building 21 00:00:52,730 --> 00:00:56,240 intelligent robots or other AI systems that 22 00:00:56,240 --> 00:00:59,330 can explain on the scientific side how this stuff works? 23 00:00:59,330 --> 00:01:02,129 All right, so that was all this business. 24 00:01:02,129 --> 00:01:04,670 And what I was suggesting, or I'll start to suggest here now, 25 00:01:04,670 --> 00:01:06,760 is that it goes back to that quote 26 00:01:06,760 --> 00:01:08,510 I gave at the beginning from Craik, right? 27 00:01:08,510 --> 00:01:10,460 The guy who in 1943 wrote this book 28 00:01:10,460 --> 00:01:11,840 called The Nature of Explanation. 29 00:01:11,840 --> 00:01:14,210 And he was saying that's the essence of intelligence is 30 00:01:14,210 --> 00:01:17,000 this ability to build models that 31 00:01:17,000 --> 00:01:19,430 allow you to explain the world, to then reason, 32 00:01:19,430 --> 00:01:21,890 simulate, plan, and so on. 33 00:01:21,890 --> 00:01:24,740 And I think we need tools for understanding 34 00:01:24,740 --> 00:01:28,900 how the brain is a modeling engine or an explaining engine. 35 00:01:28,900 --> 00:01:30,650 Or to get a little bit recursive about it, 36 00:01:30,650 --> 00:01:32,024 since what we're doing in science 37 00:01:32,024 --> 00:01:34,070 is also an explanatory activity, we 38 00:01:34,070 --> 00:01:36,440 need modeling engines in which we 39 00:01:36,440 --> 00:01:39,059 can build models of the brain as a modeling engine. 40 00:01:39,059 --> 00:01:40,850 And that's where the probabilistic programs 41 00:01:40,850 --> 00:01:41,870 are going to come in. 42 00:01:41,870 --> 00:01:45,170 So part of why I spent a while in the morning talking 43 00:01:45,170 --> 00:01:47,300 about these graphical models and the ways 44 00:01:47,300 --> 00:01:49,130 that we tried and, I think, made progress 45 00:01:49,130 --> 00:01:51,410 on, but ultimately we're dissatisfied with, 46 00:01:51,410 --> 00:01:54,860 talking about how we're modeling various aspects of cognition 47 00:01:54,860 --> 00:01:57,720 with these kinds of graphical models. 48 00:01:57,720 --> 00:02:00,530 I put up-- I didn't say too much about the technical details. 49 00:02:00,530 --> 00:02:01,070 That's fine. 50 00:02:01,070 --> 00:02:03,840 You can read a lot about it or not. 51 00:02:03,840 --> 00:02:06,949 But these ways of using graphs, mostly directed graphs, 52 00:02:06,949 --> 00:02:09,199 to capture something about the structure of the world. 53 00:02:09,199 --> 00:02:11,240 And then you put probabilities on it in some way, 54 00:02:11,240 --> 00:02:13,340 like a diffusion process or a noisy transmission 55 00:02:13,340 --> 00:02:15,020 process for a food web. 56 00:02:15,020 --> 00:02:18,410 That's a style of reasoning that sometimes 57 00:02:18,410 --> 00:02:21,060 goes by the name of Bayesian networks 58 00:02:21,060 --> 00:02:22,490 or causal graphical models. 59 00:02:22,490 --> 00:02:25,880 It's been hugely influential in computer science, 60 00:02:25,880 --> 00:02:28,347 many different fields, not just AI, and many fields outside 61 00:02:28,347 --> 00:02:29,180 of computer science. 62 00:02:29,180 --> 00:02:31,388 Not just cognitive science, neuroscience-- many areas 63 00:02:31,388 --> 00:02:32,870 of science and engineering. 64 00:02:32,870 --> 00:02:35,270 Here are just a few examples of some Bayesian networks 65 00:02:35,270 --> 00:02:38,606 you get if you search on Google image for Bayesian networks. 66 00:02:38,606 --> 00:02:39,980 And if you look carefully, you'll 67 00:02:39,980 --> 00:02:43,880 see they come from biology, economics, many chemical 68 00:02:43,880 --> 00:02:47,330 engineering, whatever. 69 00:02:47,330 --> 00:02:48,644 They're due to many people. 70 00:02:48,644 --> 00:02:50,060 Maybe more than anyone, the person 71 00:02:50,060 --> 00:02:52,310 who's most associated with this idea and with the name 72 00:02:52,310 --> 00:02:53,870 Bayesian networks is Judea Pearl. 73 00:02:53,870 --> 00:02:55,820 He received the Turing Award, which 74 00:02:55,820 --> 00:02:58,730 is like the highest reward in computer science. 75 00:02:58,730 --> 00:03:00,440 This is a language that we were using 76 00:03:00,440 --> 00:03:04,250 in all the projects you saw up until now in some form 77 00:03:04,250 --> 00:03:05,930 that we and many others used. 78 00:03:05,930 --> 00:03:09,140 Because they provide a powerful set of tools 79 00:03:09,140 --> 00:03:11,270 for general purpose tools. 80 00:03:11,270 --> 00:03:14,270 It goes back to this dream of building general purpose 81 00:03:14,270 --> 00:03:16,140 systems for understanding the world. 82 00:03:16,140 --> 00:03:18,590 So these provide general purpose languages 83 00:03:18,590 --> 00:03:20,950 for representing causal structure-- 84 00:03:20,950 --> 00:03:22,580 I'll say a little bit more about that-- 85 00:03:22,580 --> 00:03:24,470 and general purpose algorithms for doing 86 00:03:24,470 --> 00:03:25,710 the probabilistic inference on this. 87 00:03:25,710 --> 00:03:27,209 So we talked about ways of combining 88 00:03:27,209 --> 00:03:29,120 sophisticated statistical inference 89 00:03:29,120 --> 00:03:30,800 with knowledge representation that's 90 00:03:30,800 --> 00:03:32,274 causal and compositional. 91 00:03:32,274 --> 00:03:34,190 These models-- I'll just tell you a little bit 92 00:03:34,190 --> 00:03:35,630 about the one in the upper left up there, 93 00:03:35,630 --> 00:03:37,421 that thing that says diseases and symptoms. 94 00:03:37,421 --> 00:03:38,150 It is causal. 95 00:03:38,150 --> 00:03:39,020 It is compositional. 96 00:03:39,020 --> 00:03:40,802 It does support probabilistic inference. 97 00:03:40,802 --> 00:03:42,260 And it was the heart of why we were 98 00:03:42,260 --> 00:03:44,390 doing what we were doing and showing you 99 00:03:44,390 --> 00:03:46,765 how different kinds of causal graphical models 100 00:03:46,765 --> 00:03:48,890 basically could capture different modes of people's 101 00:03:48,890 --> 00:03:50,330 reasoning. 102 00:03:50,330 --> 00:03:52,670 And the idea that maybe learning about different domains 103 00:03:52,670 --> 00:03:56,256 was learning those different kinds of graphs structures. 104 00:03:56,256 --> 00:03:58,130 So let me say a little bit about how it works 105 00:03:58,130 --> 00:04:00,320 and then why it's not enough, because it really isn't enough. 106 00:04:00,320 --> 00:04:01,730 I mean, it's the right start. 107 00:04:01,730 --> 00:04:02,900 It's definitely in the right direction. 108 00:04:02,900 --> 00:04:03,940 But we need to go beyond it. 109 00:04:03,940 --> 00:04:05,910 That's where the problematic programs come in. 110 00:04:05,910 --> 00:04:08,930 So look at that network up there on the upper left. 111 00:04:08,930 --> 00:04:12,000 It's one of the most famous Bayesian networks. 112 00:04:12,000 --> 00:04:13,160 It's a textbook example. 113 00:04:13,160 --> 00:04:14,780 One of the first actually implemented 114 00:04:14,780 --> 00:04:18,769 AI systems was based on this for a system for medical diagnosis. 115 00:04:18,769 --> 00:04:22,910 Sort of a simple approximation to what a general practitioner 116 00:04:22,910 --> 00:04:25,140 might be doing if a patient comes in and reports 117 00:04:25,140 --> 00:04:27,640 some pattern of symptoms, and they want to figure out what's 118 00:04:27,640 --> 00:04:28,230 wrong. 119 00:04:28,230 --> 00:04:32,480 So diagnosis of a disease to explain the symptoms. 120 00:04:32,480 --> 00:04:34,460 The graph is a bipartite graph. 121 00:04:34,460 --> 00:04:36,080 So two sets of nodes with the arrows, 122 00:04:36,080 --> 00:04:38,670 again, going down in the causal direction. 123 00:04:38,670 --> 00:04:40,970 The bottom layer, the symptoms, are the things 124 00:04:40,970 --> 00:04:43,190 that you can nominally observe. 125 00:04:43,190 --> 00:04:44,997 A patient comes in reporting some symptoms. 126 00:04:44,997 --> 00:04:46,580 Not all are observed, but others maybe 127 00:04:46,580 --> 00:04:49,640 are things that you could test, like medical test results. 128 00:04:49,640 --> 00:04:51,330 And then the top level is this level 129 00:04:51,330 --> 00:04:53,330 of latent structure, the causes, the things that 130 00:04:53,330 --> 00:04:54,560 cause the symptoms. 131 00:04:54,560 --> 00:04:56,090 The arrows represent basically which 132 00:04:56,090 --> 00:04:57,970 diseases cause which symptoms. 133 00:04:57,970 --> 00:05:01,250 In this model there's roughly 500, 600 diseases-- you know, 134 00:05:01,250 --> 00:05:03,440 the commonish ones, not all that common-- 135 00:05:03,440 --> 00:05:04,500 and 4,000 symptoms. 136 00:05:04,500 --> 00:05:06,020 So it's a big model. 137 00:05:06,020 --> 00:05:07,670 And in some sense, you can think of it 138 00:05:07,670 --> 00:05:09,080 as a big probability model. 139 00:05:09,080 --> 00:05:15,200 It's a way of specifying a joint distribution on this 140 00:05:15,200 --> 00:05:18,080 4,600-dimensional space. 141 00:05:18,080 --> 00:05:20,530 But it's a very particular one that's causally structured. 142 00:05:20,530 --> 00:05:24,680 It represents only the minimal causal dependencies and really 143 00:05:24,680 --> 00:05:27,610 only the minimal probabilistic dependencies. 144 00:05:27,610 --> 00:05:29,845 That sparsity is really important for how you use it, 145 00:05:29,845 --> 00:05:31,970 whether you're talking about inference or learning. 146 00:05:31,970 --> 00:05:35,000 So inference means observing patterns of symptoms 147 00:05:35,000 --> 00:05:37,430 or just observing the values of some of those variables 148 00:05:37,430 --> 00:05:38,930 and making guesses about the others. 149 00:05:38,930 --> 00:05:40,555 Like observing some symptoms and making 150 00:05:40,555 --> 00:05:42,350 guesses about the diseases that are most 151 00:05:42,350 --> 00:05:43,970 likely to have explained those. 152 00:05:43,970 --> 00:05:46,130 Or you might make a prediction about other symptoms 153 00:05:46,130 --> 00:05:46,770 you could observe. 154 00:05:46,770 --> 00:05:48,410 So you could go up and then back down. 155 00:05:48,410 --> 00:05:50,150 You could say, well, from these symptoms, 156 00:05:50,150 --> 00:05:52,940 I think the patient might have one of these two rare diseases. 157 00:05:52,940 --> 00:05:55,130 I don't know which one. 158 00:05:55,130 --> 00:05:56,660 But if it was this disease, then it 159 00:05:56,660 --> 00:05:59,505 would predict that symptom or that test maybe. 160 00:05:59,505 --> 00:06:01,130 But this disease wouldn't, so then that 161 00:06:01,130 --> 00:06:03,140 suggests a way to plan an action you could 162 00:06:03,140 --> 00:06:04,852 take to figure things out. 163 00:06:04,852 --> 00:06:06,560 So then I could go test for that symptom, 164 00:06:06,560 --> 00:06:09,184 and that would tell me which of these diseases the patient has. 165 00:06:09,184 --> 00:06:11,960 They're also useful in planning other kinds of treatments, 166 00:06:11,960 --> 00:06:12,650 interventions. 167 00:06:12,650 --> 00:06:14,180 Like if you want to cure someone-- 168 00:06:14,180 --> 00:06:16,280 again, we all know this intuitively-- 169 00:06:16,280 --> 00:06:18,530 you should try to cure the disease, not the symptom. 170 00:06:18,530 --> 00:06:20,429 If you have some way to act to change 171 00:06:20,429 --> 00:06:22,220 the state of one of those disease variables 172 00:06:22,220 --> 00:06:25,460 to kind of turn it off, reasonably that 173 00:06:25,460 --> 00:06:26,627 should relieve the symptoms. 174 00:06:26,627 --> 00:06:29,293 If that disease gets turned off, these symptoms should turn off. 175 00:06:29,293 --> 00:06:30,770 Whereas just treating the symptom 176 00:06:30,770 --> 00:06:32,922 like taking Advil for a headache is fine 177 00:06:32,922 --> 00:06:34,130 if that's all the problem is. 178 00:06:34,130 --> 00:06:36,620 But if it's being caused by something, you know, 179 00:06:36,620 --> 00:06:39,220 god forbid, like a brain tumor, it's not going to help. 180 00:06:39,220 --> 00:06:41,760 It's not going to cure the problem in the long term. 181 00:06:41,760 --> 00:06:44,720 OK, so all those patterns of causal inference, reasoning, 182 00:06:44,720 --> 00:06:46,466 prediction, action planning, exploration 183 00:06:46,466 --> 00:06:48,590 is a beautiful language for capturing all of those. 184 00:06:48,590 --> 00:06:51,260 You can automate all those inferences. 185 00:06:51,260 --> 00:06:56,030 Why isn't it enough, then, for capturing commonsense reasoning 186 00:06:56,030 --> 00:06:57,800 or this approach to cognition? 187 00:06:57,800 --> 00:07:00,380 Which I'm calling the kind of model-building explaining 188 00:07:00,380 --> 00:07:02,440 part, as opposed to the pattern recognition part. 189 00:07:02,440 --> 00:07:04,190 I mean, again, I don't want to get too far 190 00:07:04,190 --> 00:07:05,398 behind in talking about this. 191 00:07:05,398 --> 00:07:06,890 But that example is so rich. 192 00:07:06,890 --> 00:07:09,170 Like if you can build a neural network, 193 00:07:09,170 --> 00:07:10,640 you can just turn the arrows around 194 00:07:10,640 --> 00:07:12,514 to learn a mapping from symptoms to diseases, 195 00:07:12,514 --> 00:07:14,850 and that would be a pattern classifier. 196 00:07:14,850 --> 00:07:18,359 So these two different paradigms for intelligence-- 197 00:07:18,359 --> 00:07:20,900 as some of the questions we're getting at, and as I will show 198 00:07:20,900 --> 00:07:23,066 versions of that with some more interesting examples 199 00:07:23,066 --> 00:07:24,470 in a little bit-- 200 00:07:24,470 --> 00:07:27,320 often it's very subtle, and the relations between them 201 00:07:27,320 --> 00:07:29,630 are quite valuable. 202 00:07:29,630 --> 00:07:33,190 So one way to work with such a model, for example, 203 00:07:33,190 --> 00:07:34,057 or one nice way-- 204 00:07:34,057 --> 00:07:36,140 I mean, I mentioned a lot of people want to know-- 205 00:07:36,140 --> 00:07:40,760 and I'll keep talking about this for the rest of the hour-- 206 00:07:40,760 --> 00:07:43,370 productive ways to combine these powerful generative 207 00:07:43,370 --> 00:07:48,050 models with more pattern recognition approaches. 208 00:07:48,050 --> 00:07:49,475 For some classes of this model-- 209 00:07:52,020 --> 00:07:53,990 there are always general purpose algorithms 210 00:07:53,990 --> 00:07:55,730 that can support these inferences, that 211 00:07:55,730 --> 00:07:57,980 can tell you what diseases you're likely to have given 212 00:07:57,980 --> 00:07:58,760 what symptoms. 213 00:07:58,760 --> 00:08:01,160 But in some cases, they could be very fast. 214 00:08:01,160 --> 00:08:03,320 In other cases, they could be very slow. 215 00:08:03,320 --> 00:08:05,480 Whereas if you could imagine trying 216 00:08:05,480 --> 00:08:08,527 to learn a neural network that looks just like that, 217 00:08:08,527 --> 00:08:10,610 only the arrows go up, so they implement a mapping 218 00:08:10,610 --> 00:08:12,980 from data to diseases, that could 219 00:08:12,980 --> 00:08:15,440 help to do much faster inference in the cases 220 00:08:15,440 --> 00:08:16,740 where that's possible. 221 00:08:16,740 --> 00:08:19,250 So that's just one example of where 222 00:08:19,250 --> 00:08:20,900 a model, which might be not a crazy way 223 00:08:20,900 --> 00:08:22,691 to think about, for example, more generally 224 00:08:22,691 --> 00:08:25,517 the way top-down and bottom-up connections work in the brain. 225 00:08:25,517 --> 00:08:28,100 I'll take that a little bit more literally in a vision example 226 00:08:28,100 --> 00:08:30,620 in a second. 227 00:08:30,620 --> 00:08:33,530 So there's a lot you can get from studying 228 00:08:33,530 --> 00:08:37,220 these causal graphical models, including some version of what 229 00:08:37,220 --> 00:08:38,900 it is for the mind to explain the world 230 00:08:38,900 --> 00:08:41,150 and how that explanation and pattern recognition 231 00:08:41,150 --> 00:08:42,919 approach can work together. 232 00:08:42,919 --> 00:08:44,960 But it's not enough to really get 233 00:08:44,960 --> 00:08:49,040 at the heart of common sense The mental generative models 234 00:08:49,040 --> 00:08:50,690 we build are more richly structured. 235 00:08:50,690 --> 00:08:53,042 They're more like programs. 236 00:08:53,042 --> 00:08:54,000 What do I mean by that? 237 00:08:54,000 --> 00:08:58,160 Well, here I'm giving a bunch of examples of scientific theories 238 00:08:58,160 --> 00:08:59,540 or models. 239 00:08:59,540 --> 00:09:02,630 Not commonsense ones, but I think the same idea applies. 240 00:09:02,630 --> 00:09:04,670 Ways of, again, explaining the world, 241 00:09:04,670 --> 00:09:06,270 not just describing the pattern. 242 00:09:06,270 --> 00:09:08,840 So we went at the beginning through Newton's laws 243 00:09:08,840 --> 00:09:10,151 versus Kepler's laws. 244 00:09:10,151 --> 00:09:11,150 That's just one example. 245 00:09:11,150 --> 00:09:13,710 And you might not have thought of those laws as a program, 246 00:09:13,710 --> 00:09:15,480 but they're certainly not a graph. 247 00:09:15,480 --> 00:09:18,724 On the first slide when I showed Newton's laws, 248 00:09:18,724 --> 00:09:20,390 there was a bunch of symbols, statements 249 00:09:20,390 --> 00:09:22,580 in English, some math. 250 00:09:22,580 --> 00:09:25,340 But what it comes down to is basically 251 00:09:25,340 --> 00:09:27,890 a set of pieces of code that you could 252 00:09:27,890 --> 00:09:30,980 run to generate the orbits. 253 00:09:30,980 --> 00:09:35,630 It doesn't describe the sheep or the velocities, 254 00:09:35,630 --> 00:09:37,700 but it's a machine that you plug in some things. 255 00:09:37,700 --> 00:09:39,560 You plug in some masses, some objects, 256 00:09:39,560 --> 00:09:41,570 some initial conditions. 257 00:09:41,570 --> 00:09:44,911 And you press run, and it generates the orbits, just 258 00:09:44,911 --> 00:09:46,160 like what you're seeing there. 259 00:09:46,160 --> 00:09:48,110 Although those probably weren't generated. 260 00:09:48,110 --> 00:09:49,480 That's a GIF. 261 00:09:49,480 --> 00:09:50,600 OK. 262 00:09:50,600 --> 00:09:55,000 That's more like Kepler or Ptolemy. 263 00:09:55,000 --> 00:09:58,089 But anyway, it's a powerful machine. 264 00:09:58,089 --> 00:09:59,630 It's a machine, which if you put down 265 00:09:59,630 --> 00:10:01,340 the right masses in the right position, 266 00:10:01,340 --> 00:10:03,090 they don't just all go around in ellipses. 267 00:10:03,090 --> 00:10:04,320 Some of them are like moons, and they 268 00:10:04,320 --> 00:10:06,930 will go around the things that will go around the others. 269 00:10:06,930 --> 00:10:09,090 And some of them will be like apples on the Earth, 270 00:10:09,090 --> 00:10:10,506 and they won't go around anything. 271 00:10:10,506 --> 00:10:11,610 They'll just fall down. 272 00:10:11,610 --> 00:10:14,360 So that's the powerful machine. 273 00:10:14,360 --> 00:10:17,040 And in the really simplest cases, that machine-- 274 00:10:17,040 --> 00:10:19,370 those equations can be solved analytically. 275 00:10:19,370 --> 00:10:21,414 You can use calculus or other methods of analysis 276 00:10:21,414 --> 00:10:22,080 like Newton did. 277 00:10:22,080 --> 00:10:23,280 He didn't have a computer. 278 00:10:23,280 --> 00:10:27,670 And you can show that for a two-body system, one planet 279 00:10:27,670 --> 00:10:29,997 and one sun, you can solve those equations to show 280 00:10:29,997 --> 00:10:31,080 that you get Kepler's law. 281 00:10:31,080 --> 00:10:32,460 Amazing. 282 00:10:32,460 --> 00:10:38,165 And under the approximation that only the sun is-- 283 00:10:38,165 --> 00:10:39,540 for every other planet, it's only 284 00:10:39,540 --> 00:10:41,730 the sun that's exerting a significant influence, 285 00:10:41,730 --> 00:10:44,110 you can describe all of Kepler's laws this way. 286 00:10:44,110 --> 00:10:46,320 But once you have more than two bodies interacting 287 00:10:46,320 --> 00:10:49,350 in some complex way, like three masses similar 288 00:10:49,350 --> 00:10:53,280 in size near each other, you can't 289 00:10:53,280 --> 00:10:55,140 solve the equations analytically anymore. 290 00:10:55,140 --> 00:10:57,960 You basically just have to run a simulation. 291 00:10:57,960 --> 00:11:00,300 For the most part, the world is complicated, 292 00:11:00,300 --> 00:11:03,200 and our models have to be run. 293 00:11:03,200 --> 00:11:05,040 Here's a model of a riverbed formation. 294 00:11:05,040 --> 00:11:08,000 Or these are snapshots of a model of galaxy collision, 295 00:11:08,000 --> 00:11:11,070 you know, and climate modeling or aerodynamics. 296 00:11:11,070 --> 00:11:13,530 So basically what most modern science 297 00:11:13,530 --> 00:11:15,369 is is you write down descriptions 298 00:11:15,369 --> 00:11:17,160 of the causal processes, something going on 299 00:11:17,160 --> 00:11:19,920 in the world, and you study that through some combination 300 00:11:19,920 --> 00:11:22,740 of analysis and simulation to see what would happen. 301 00:11:22,740 --> 00:11:25,260 If you want to estimate parameters, 302 00:11:25,260 --> 00:11:27,300 you try out some guesses of the parameters. 303 00:11:27,300 --> 00:11:29,674 And you run this thing, and you see if its behavior looks 304 00:11:29,674 --> 00:11:31,410 like the data you observe. 305 00:11:31,410 --> 00:11:34,584 If you are trying to decide between two different models, 306 00:11:34,584 --> 00:11:36,000 you simulate each of them, and you 307 00:11:36,000 --> 00:11:38,250 see which one looks more like the data you observe. 308 00:11:38,250 --> 00:11:39,915 If you think there's something wrong with your model-- 309 00:11:39,915 --> 00:11:42,016 it doesn't quite look like the data you observe. 310 00:11:42,016 --> 00:11:43,890 You think, how could I change my model, which 311 00:11:43,890 --> 00:11:46,140 basically if I run it, it'll look more like the data I 312 00:11:46,140 --> 00:11:48,540 observe in some important way? 313 00:11:48,540 --> 00:11:51,600 Those activities of science-- 314 00:11:51,600 --> 00:11:54,030 those are, in some form I'm arguing, 315 00:11:54,030 --> 00:11:56,520 the activities of common sense explanation. 316 00:11:56,520 --> 00:12:01,200 So when I'm talking about the child as scientist, 317 00:12:01,200 --> 00:12:03,260 that's what I'm basically talking about. 318 00:12:03,260 --> 00:12:04,940 It's some version of that. 319 00:12:04,940 --> 00:12:07,150 And so that includes both using-- 320 00:12:07,150 --> 00:12:10,530 describing the causal processes with a program that you run. 321 00:12:10,530 --> 00:12:13,950 Or if you want to talk about learning, 322 00:12:13,950 --> 00:12:17,730 the scientific analog is building one of these theories. 323 00:12:17,730 --> 00:12:20,520 You don't build a theory, whether it's 324 00:12:20,520 --> 00:12:24,720 Newton's laws or Mendel's laws or any of these things, 325 00:12:24,720 --> 00:12:27,210 by just finding patterns and data. 326 00:12:27,210 --> 00:12:32,190 You do something like this program thing, 327 00:12:32,190 --> 00:12:34,230 but kind of recursively. 328 00:12:34,230 --> 00:12:37,800 Think of you having some kind of paradigm, some program that 329 00:12:37,800 --> 00:12:39,660 generates programs, and you use it 330 00:12:39,660 --> 00:12:41,970 to try to somehow search the space of programs 331 00:12:41,970 --> 00:12:44,500 to come up with a program that fits your data well. 332 00:12:44,500 --> 00:12:46,470 OK, so that's, again, kind of the big picture. 333 00:12:46,470 --> 00:12:48,780 And now, let's talk about how we can actually 334 00:12:48,780 --> 00:12:52,230 do something with this idea-- use these programs. 335 00:12:52,230 --> 00:12:56,862 And you might be wondering, OK, maybe I understand-- 336 00:12:59,709 --> 00:13:01,500 I'm realizing I didn't say the main thing I 337 00:13:01,500 --> 00:13:02,190 want you to understand. 338 00:13:02,190 --> 00:13:03,940 The main thing I want you to get from this 339 00:13:03,940 --> 00:13:06,330 is how programs go beyond graphs. 340 00:13:06,330 --> 00:13:09,660 So none of these processes here can nicely 341 00:13:09,660 --> 00:13:12,300 describe with a graph the way we have 342 00:13:12,300 --> 00:13:14,060 in the language of graphical models. 343 00:13:14,060 --> 00:13:16,732 So the interesting causality-- 344 00:13:16,732 --> 00:13:18,690 I mean, in some sense, there's kind of a graph. 345 00:13:18,690 --> 00:13:20,610 You can talk about the state of the world at time T, 346 00:13:20,610 --> 00:13:22,000 and I'll show you graphs like this in a second. 347 00:13:22,000 --> 00:13:24,166 The state of the world at time T plus 1 and an arrow 348 00:13:24,166 --> 00:13:25,290 forward in time. 349 00:13:25,290 --> 00:13:27,960 But all the interesting stuff that science really gains power 350 00:13:27,960 --> 00:13:30,870 from are the much more fine-grained structure 351 00:13:30,870 --> 00:13:32,520 captured in equations or functions 352 00:13:32,520 --> 00:13:35,410 that describe exactly how all this stuff works. 353 00:13:35,410 --> 00:13:39,450 And it needs languages like math or C++ or LISP. 354 00:13:39,450 --> 00:13:42,810 It needs a symbolic language of processes 355 00:13:42,810 --> 00:13:44,509 to really do justice to. 356 00:13:44,509 --> 00:13:46,050 The second thing I want to get, which 357 00:13:46,050 --> 00:13:48,770 will take a minute to get, but let's put it out there. 358 00:13:48,770 --> 00:13:51,000 As yes, OK, maybe you get the idea 359 00:13:51,000 --> 00:13:53,340 that programs can be used to describe causal processes 360 00:13:53,340 --> 00:13:54,430 in interesting ways. 361 00:13:54,430 --> 00:13:56,910 But where is the probability part come in? 362 00:13:56,910 --> 00:14:01,980 So the same thing is actually true in graphical models. 363 00:14:01,980 --> 00:14:04,470 How many people have read Judea Pearl's 2000 364 00:14:04,470 --> 00:14:06,450 book called Causality? 365 00:14:06,450 --> 00:14:08,490 How many people have read his '88 book? 366 00:14:08,490 --> 00:14:09,630 Or nobody's read anything. 367 00:14:09,630 --> 00:14:13,450 But, OK, so what Pearl is most famous for-- 368 00:14:13,450 --> 00:14:15,450 I mean, when we say Pearl's famous for inventing 369 00:14:15,450 --> 00:14:17,310 Bayesian networks, that's based on work 370 00:14:17,310 --> 00:14:21,150 he did in the '80s, in which, yes, 371 00:14:21,150 --> 00:14:23,320 they were all probability models. 372 00:14:23,320 --> 00:14:24,737 But then he came to what he calls, 373 00:14:24,737 --> 00:14:27,195 and I would call, too, a deeper view in which it was really 374 00:14:27,195 --> 00:14:29,370 about basically deterministic causal relations. 375 00:14:29,370 --> 00:14:32,730 Basically it was a graphical language for equations-- 376 00:14:32,730 --> 00:14:35,050 certain classes of equations like structural equations. 377 00:14:35,050 --> 00:14:36,420 If you know about linear structural equations, 378 00:14:36,420 --> 00:14:38,610 it was sort of like nonlinear structural equations. 379 00:14:38,610 --> 00:14:40,290 And then probabilities are these things 380 00:14:40,290 --> 00:14:42,210 you put on just on top of it to capture 381 00:14:42,210 --> 00:14:44,670 the things you don't know that you're uncertain about. 382 00:14:44,670 --> 00:14:46,680 And I think he was he was getting at the fact 383 00:14:46,680 --> 00:14:49,687 that to scientists, and also to people-- 384 00:14:49,687 --> 00:14:52,020 there's some very nice work by Laura Schultz and Jessica 385 00:14:52,020 --> 00:14:54,660 Sommerville, both of whom will be here next week actually, 386 00:14:54,660 --> 00:14:56,940 on how children's concepts of causality 387 00:14:56,940 --> 00:14:59,032 are basically deterministic at the core. 388 00:14:59,032 --> 00:15:00,490 And where the probabilities come in 389 00:15:00,490 --> 00:15:02,730 is on the things that we don't observe 390 00:15:02,730 --> 00:15:05,300 or the things we don't know, the uncertainty. 391 00:15:05,300 --> 00:15:06,710 It's not that the world is noisy. 392 00:15:06,710 --> 00:15:08,270 It's that we believe, at least-- 393 00:15:08,270 --> 00:15:10,674 except for quantum mechanics-- but our intuitive notions 394 00:15:10,674 --> 00:15:12,590 are that the world is basically deterministic, 395 00:15:12,590 --> 00:15:13,910 but with a lot of stuff we don't know. 396 00:15:13,910 --> 00:15:16,400 This was, for example, Laplace's view in philosophy of science. 397 00:15:16,400 --> 00:15:17,858 And really until quantum mechanics, 398 00:15:17,858 --> 00:15:21,089 it was broadly the Enlightenment science view 399 00:15:21,089 --> 00:15:23,630 that the world is full of all these complicated deterministic 400 00:15:23,630 --> 00:15:26,690 machines, and where uncertainty comes from the things 401 00:15:26,690 --> 00:15:29,400 that we can't observe or that we can't measure finely enough, 402 00:15:29,400 --> 00:15:34,100 or they're just in some form unknown or unknowable to us. 403 00:15:34,100 --> 00:15:35,860 Does that make sense? 404 00:15:35,860 --> 00:15:38,019 So you'll see more of this in a second. 405 00:15:38,019 --> 00:15:39,560 But where the probabilities are going 406 00:15:39,560 --> 00:15:42,440 to come from is basically if there are inputs to the program 407 00:15:42,440 --> 00:15:43,910 that we don't know or parameters we 408 00:15:43,910 --> 00:15:45,620 don't know that in order to simulate them 409 00:15:45,620 --> 00:15:47,300 we're going to have to put distributions on those 410 00:15:47,300 --> 00:15:48,924 and make some guesses and then see what 411 00:15:48,924 --> 00:15:50,180 happens for different guesses. 412 00:15:50,180 --> 00:15:51,260 Does that make sense? 413 00:15:51,260 --> 00:15:52,950 OK. 414 00:15:52,950 --> 00:15:53,450 Good. 415 00:15:53,450 --> 00:15:55,908 So again, that's most of the technical stuff I need to say. 416 00:15:55,908 --> 00:15:58,130 And you'll learn about how this works in much more 417 00:15:58,130 --> 00:16:00,050 concrete details if you go to the tutorial 418 00:16:00,050 --> 00:16:02,101 afterwards that Tomer is going to run. 419 00:16:02,101 --> 00:16:03,350 What you'll see there is this. 420 00:16:03,350 --> 00:16:04,683 So here are just a few examples. 421 00:16:04,683 --> 00:16:06,920 Many of you hopefully already looked at the web pages 422 00:16:06,920 --> 00:16:08,930 from this probmods.org thing. 423 00:16:08,930 --> 00:16:12,500 And what you see here is basically each of these boxes 424 00:16:12,500 --> 00:16:14,030 is a probabilistic program model. 425 00:16:14,030 --> 00:16:16,620 Most of it is a bunch of defined statements. 426 00:16:16,620 --> 00:16:19,500 So if you look here, you'll see these defined statements. 427 00:16:19,500 --> 00:16:21,170 Those are just defining functions. 428 00:16:21,170 --> 00:16:22,370 They name the function. 429 00:16:22,370 --> 00:16:24,650 They take some inputs, which call other functions, 430 00:16:24,650 --> 00:16:26,120 and then they maybe do something-- 431 00:16:26,120 --> 00:16:28,650 they have some output that might be an object. 432 00:16:28,650 --> 00:16:30,470 It might itself be a function. 433 00:16:30,470 --> 00:16:34,760 These can be functions that generate other functions. 434 00:16:34,760 --> 00:16:36,590 And where the probabilities come in 435 00:16:36,590 --> 00:16:39,470 is that sometimes these functions call random number 436 00:16:39,470 --> 00:16:40,520 of generators basically. 437 00:16:40,520 --> 00:16:42,186 If you look carefully, you'll see things 438 00:16:42,186 --> 00:16:48,670 like Dirichlet, or uniform draw, or Gaussian, or flip. 439 00:16:48,670 --> 00:16:51,740 Right those are primitive random functions that flip a coin, 440 00:16:51,740 --> 00:16:54,110 or roll a die, or draw from a Gaussian. 441 00:16:54,110 --> 00:16:59,040 And those captured things that are currently unknown. 442 00:16:59,040 --> 00:17:03,470 In a very important sense, the particular language, Church, 443 00:17:03,470 --> 00:17:06,800 that you're going to learn here with its sort of stochastic 444 00:17:06,800 --> 00:17:07,970 LISP-- 445 00:17:07,970 --> 00:17:10,520 basically just functions that call other functions 446 00:17:10,520 --> 00:17:12,650 and maybe add in some randomness to that-- 447 00:17:12,650 --> 00:17:15,931 is very much analogous to the directed graph of a Bayesian 448 00:17:15,931 --> 00:17:16,430 network. 449 00:17:16,430 --> 00:17:19,550 In a Bayesian network, you have nodes and arrows. 450 00:17:19,550 --> 00:17:22,490 And the parents of a node, the ones that send arrows to it, 451 00:17:22,490 --> 00:17:24,530 are basically the minimal set of variables 452 00:17:24,530 --> 00:17:26,530 that if you were going to sample from this model 453 00:17:26,530 --> 00:17:29,521 you'd have to sample first in order to then sample the child 454 00:17:29,521 --> 00:17:30,020 variable. 455 00:17:30,020 --> 00:17:32,270 Because those are the key things it depends on. 456 00:17:32,270 --> 00:17:34,640 And you can have a multi-layered Bayesian network 457 00:17:34,640 --> 00:17:36,410 that, if you are going to sample from it, 458 00:17:36,410 --> 00:17:38,755 it's just you start at the top and you sort of go down. 459 00:17:38,755 --> 00:17:40,130 That's exactly the same thing you 460 00:17:40,130 --> 00:17:41,879 have in these probabilistic programs where 461 00:17:41,879 --> 00:17:44,570 the defined statements are basically defining a function. 462 00:17:44,570 --> 00:17:48,436 And the functions are the nodes, and the other functions 463 00:17:48,436 --> 00:17:50,060 that they call as part of the statement 464 00:17:50,060 --> 00:17:52,970 are the things that are the nodes that send arrows there. 465 00:17:52,970 --> 00:17:55,520 But the key is, as you can imagine if you've ever-- 466 00:17:55,520 --> 00:17:57,950 I mean, all of you have written computer programs-- 467 00:17:57,950 --> 00:18:00,650 is that only very simple programs look 468 00:18:00,650 --> 00:18:02,240 like directed acyclic graphs. 469 00:18:02,240 --> 00:18:04,190 And that's what a Bayesian network is. 470 00:18:04,190 --> 00:18:06,129 It's very easy and often necessary 471 00:18:06,129 --> 00:18:08,420 to write a program to really capture something causally 472 00:18:08,420 --> 00:18:10,050 interesting in the world where it's not 473 00:18:10,050 --> 00:18:11,634 a directed acyclic route. 474 00:18:11,634 --> 00:18:12,800 There's all sorts of cycles. 475 00:18:12,800 --> 00:18:13,850 There's recursion. 476 00:18:13,850 --> 00:18:17,570 One thing that a function can do is make a whole other graph. 477 00:18:17,570 --> 00:18:21,290 Or often it might be directed and acyclic, 478 00:18:21,290 --> 00:18:22,940 but all the interesting stuff is kind 479 00:18:22,940 --> 00:18:26,169 of going on inside what happens when you evaluate one function. 480 00:18:26,169 --> 00:18:27,710 So if you were to draw it as a graph, 481 00:18:27,710 --> 00:18:30,680 it might look like you could draw a directed acyclic graph, 482 00:18:30,680 --> 00:18:32,096 but all the interesting stuff will 483 00:18:32,096 --> 00:18:35,342 be going on inside one node or one arrow. 484 00:18:35,342 --> 00:18:37,550 So let me get more specific about the particular kind 485 00:18:37,550 --> 00:18:41,300 of programs that we're going to be talking about. 486 00:18:41,300 --> 00:18:43,280 In a probabilistic programming language 487 00:18:43,280 --> 00:18:46,160 like Church, or in general in this view of the mind, 488 00:18:46,160 --> 00:18:48,140 we're interested in being able to build really 489 00:18:48,140 --> 00:18:49,220 any kind of thing. 490 00:18:49,220 --> 00:18:51,860 Again, there's lots of big dreams here. 491 00:18:51,860 --> 00:18:53,840 Like I was saying before, I felt like we 492 00:18:53,840 --> 00:18:55,100 had to give up on some dreams, but we've 493 00:18:55,100 --> 00:18:56,600 replaced it with even grander ones, 494 00:18:56,600 --> 00:18:59,000 like probabilistic modeling engines that 495 00:18:59,000 --> 00:19:00,990 can do any computable model. 496 00:19:00,990 --> 00:19:04,400 But in the spirit of trying to scale up from something that we 497 00:19:04,400 --> 00:19:08,300 can get traction on, what I've been focusing on 498 00:19:08,300 --> 00:19:10,130 in a lot of my work recently and what we've 499 00:19:10,130 --> 00:19:11,960 been doing as part of the center, 500 00:19:11,960 --> 00:19:14,300 are particular probabilistic programs 501 00:19:14,300 --> 00:19:17,180 that we think can capture this very early core of common sense 502 00:19:17,180 --> 00:19:21,140 intuitive physics and intuitive psychology in young kids. 503 00:19:21,140 --> 00:19:23,520 It's what I called-- and I remember I mentioned this 504 00:19:23,520 --> 00:19:25,330 in the first lecture-- 505 00:19:25,330 --> 00:19:26,780 this game engine in your head. 506 00:19:26,780 --> 00:19:31,310 So it's programs for graphics engines, physics engines, 507 00:19:31,310 --> 00:19:33,140 planning engines, the basic kinds of things 508 00:19:33,140 --> 00:19:37,760 you might use to build one of these immersive video games. 509 00:19:37,760 --> 00:19:40,970 And we think if you wrap those inside this framework 510 00:19:40,970 --> 00:19:43,190 for probabilistic inference, then 511 00:19:43,190 --> 00:19:46,610 that's a powerful way to do the kind of common sense seen 512 00:19:46,610 --> 00:19:48,680 understanding, whether in these adult versions 513 00:19:48,680 --> 00:19:51,230 or in the young kid versions. 514 00:19:51,230 --> 00:19:56,690 Now, to specify this probabilistic programs 515 00:19:56,690 --> 00:19:58,670 view, just like with Bayesian networks 516 00:19:58,670 --> 00:20:01,471 or these graphical models, we wanted general purpose tools 517 00:20:01,471 --> 00:20:03,470 for representing interesting things in the world 518 00:20:03,470 --> 00:20:06,470 and for computing the inferences that we want. 519 00:20:06,470 --> 00:20:10,280 Again, which means basically observing, say, just like you 520 00:20:10,280 --> 00:20:12,110 observe some of the symptoms and you 521 00:20:12,110 --> 00:20:14,330 want to compute the likely diseases that best 522 00:20:14,330 --> 00:20:16,130 explain the observed symptoms. 523 00:20:16,130 --> 00:20:20,360 Here we talk about observing the outputs of some 524 00:20:20,360 --> 00:20:22,754 of these programs, like the image 525 00:20:22,754 --> 00:20:24,420 that's the output of a graphics program. 526 00:20:24,420 --> 00:20:27,150 And we want to work backwards and make a guess at the world 527 00:20:27,150 --> 00:20:29,030 state, the input to the graphics engine 528 00:20:29,030 --> 00:20:31,280 that's most likely to have produced the image. 529 00:20:31,280 --> 00:20:34,520 That's the analog of getting diseases from symptoms. 530 00:20:34,520 --> 00:20:38,624 Or again, that's our explanation right there. 531 00:20:38,624 --> 00:20:41,040 And there are lots of different algorithms for doing this. 532 00:20:41,040 --> 00:20:42,748 I'm not going to say too much about them. 533 00:20:42,748 --> 00:20:44,942 Tomer will say a little bit more in the afternoon. 534 00:20:44,942 --> 00:20:46,400 The main thing I will do is, I will 535 00:20:46,400 --> 00:20:48,710 say that the main general purpose 536 00:20:48,710 --> 00:20:51,500 algorithms for inference in probabilistic programming 537 00:20:51,500 --> 00:20:54,590 language are in the category of slow 538 00:20:54,590 --> 00:20:59,360 and slower and really, really slow. 539 00:20:59,360 --> 00:21:01,850 And this is one of the many ways in which there's 540 00:21:01,850 --> 00:21:04,360 no magic or no free lunch. 541 00:21:04,360 --> 00:21:06,350 Across all of AI and cognitive science, 542 00:21:06,350 --> 00:21:08,777 when you build very powerful representations, 543 00:21:08,777 --> 00:21:10,610 doing inference with them becomes very hard. 544 00:21:10,610 --> 00:21:12,144 It's part of why people often like 545 00:21:12,144 --> 00:21:13,310 things like neural networks. 546 00:21:13,310 --> 00:21:14,809 They're much weaker representations, 547 00:21:14,809 --> 00:21:17,600 but inference can be much faster. 548 00:21:17,600 --> 00:21:19,984 And at the moment, the only totally general purpose 549 00:21:19,984 --> 00:21:22,400 algorithms for doing inference with probabilistic programs 550 00:21:22,400 --> 00:21:23,300 are slow. 551 00:21:23,300 --> 00:21:25,580 But first of all, they're getting faster. 552 00:21:25,580 --> 00:21:27,520 People are coming up with-- 553 00:21:27,520 --> 00:21:30,780 and I can talk about this offline where that's going-- 554 00:21:30,780 --> 00:21:34,210 but also-- and this is what I'll talk about in a sharper way 555 00:21:34,210 --> 00:21:35,450 in a second-- 556 00:21:35,450 --> 00:21:37,767 there are particular classes of probabilistic programs, 557 00:21:37,767 --> 00:21:40,100 in particular, the ones in the game engine in your head. 558 00:21:40,100 --> 00:21:42,710 Like for vision is inverse graphics and maybe 559 00:21:42,710 --> 00:21:46,400 some things about physics and psychology too. 560 00:21:46,400 --> 00:21:48,987 I mean, again, I'm just thinking of the stuff of like what's 561 00:21:48,987 --> 00:21:50,570 going on with it when a kid is playing 562 00:21:50,570 --> 00:21:51,980 with some objects around them and thinking 563 00:21:51,980 --> 00:21:54,271 about what other people might think about those things. 564 00:21:54,271 --> 00:21:57,680 It's just that setting where we think 565 00:21:57,680 --> 00:22:01,179 that you can build sort of in some sense special purpose. 566 00:22:01,179 --> 00:22:02,720 I mean, they're still pretty general. 567 00:22:02,720 --> 00:22:05,720 But inference algorithms for doing inference 568 00:22:05,720 --> 00:22:07,940 in probabilistic programs, getting the causes 569 00:22:07,940 --> 00:22:10,670 from the effects that are much, much faster 570 00:22:10,670 --> 00:22:12,170 than things that could work on just 571 00:22:12,170 --> 00:22:15,845 arbitrary probabilistic programs and that actually often look 572 00:22:15,845 --> 00:22:16,970 a lot like neural networks. 573 00:22:16,970 --> 00:22:18,710 And in particular, we can directly 574 00:22:18,710 --> 00:22:22,010 use, say for example, deep convolutional neural networks 575 00:22:22,010 --> 00:22:23,880 to build these recognition programs 576 00:22:23,880 --> 00:22:27,350 or basically inference programs that 577 00:22:27,350 --> 00:22:30,020 work by pattern recognition in, for example, 578 00:22:30,020 --> 00:22:31,790 an inverse graphics approach to vision. 579 00:22:31,790 --> 00:22:34,550 So that's what I'll show you basically now. 580 00:22:34,550 --> 00:22:36,350 I'm going to start off by just working 581 00:22:36,350 --> 00:22:37,725 through a couple of these arrows. 582 00:22:37,725 --> 00:22:41,630 I'm going to first talk about this sort of approach we've 583 00:22:41,630 --> 00:22:44,690 done to tackle both vision as inverse graphics 584 00:22:44,690 --> 00:22:46,539 and some intuitive physics on the scene 585 00:22:46,539 --> 00:22:48,830 recovered and then say a little bit about the intuitive 586 00:22:48,830 --> 00:22:51,260 psychology side. 587 00:22:51,260 --> 00:22:54,090 Here's an example of the kind of specific domain we've studied. 588 00:22:54,090 --> 00:22:56,150 It's like our Atari setting. 589 00:22:56,150 --> 00:22:59,360 It's a kind of video game inspired by the real game 590 00:22:59,360 --> 00:23:00,535 Jenga. 591 00:23:00,535 --> 00:23:02,660 Jenga's this cool game you play with wooden blocks. 592 00:23:02,660 --> 00:23:06,500 You start off with a very, very, very nicely stacked up thing 593 00:23:06,500 --> 00:23:09,380 and you take turns removing the blocks. 594 00:23:09,380 --> 00:23:11,330 And the player who removes the block that 595 00:23:11,330 --> 00:23:13,662 makes the whole thing fall over is the one who loses. 596 00:23:13,662 --> 00:23:15,620 And it really exercises this part of your brain 597 00:23:15,620 --> 00:23:18,500 that we've been studying here, which is an ability 598 00:23:18,500 --> 00:23:22,130 to reason about stability and support I very briefly went 599 00:23:22,130 --> 00:23:23,730 over this, but this is something that 600 00:23:23,730 --> 00:23:26,371 is one of the classic case studies of infant object 601 00:23:26,371 --> 00:23:26,870 knowledge. 602 00:23:26,870 --> 00:23:29,480 Looking at how basically these concepts develop 603 00:23:29,480 --> 00:23:32,240 in some really interesting ways over the first year of life. 604 00:23:32,240 --> 00:23:34,880 Though what we're doing here is building models and testing 605 00:23:34,880 --> 00:23:36,110 them primarily with adults. 606 00:23:36,110 --> 00:23:38,568 It is part of what we're trying to do in our brains, minds, 607 00:23:38,568 --> 00:23:40,220 and machines research program here, 608 00:23:40,220 --> 00:23:42,470 collaboration with Liz and others, 609 00:23:42,470 --> 00:23:45,007 to actually test these ideas in experiments with infants. 610 00:23:45,007 --> 00:23:47,090 But what I'll show you is just kind of think of it 611 00:23:47,090 --> 00:23:49,790 as like infant-inspired adult intuitive physics 612 00:23:49,790 --> 00:23:52,490 where we build and test the models in an easier way, 613 00:23:52,490 --> 00:23:55,082 and then we're taking it down to kids going forward. 614 00:23:55,082 --> 00:23:56,540 So the kind of experiment we can do 615 00:23:56,540 --> 00:23:59,990 with adults is show them these configurations of blocks 616 00:23:59,990 --> 00:24:05,440 and say, for example, how stable under gravity 617 00:24:05,440 --> 00:24:07,989 is one of these towers or configurations? 618 00:24:07,989 --> 00:24:09,530 So like everything else, you can make 619 00:24:09,530 --> 00:24:11,930 a judgment on a scale of zero to 10 or one to seven. 620 00:24:11,930 --> 00:24:13,430 And probably most people would agree 621 00:24:13,430 --> 00:24:17,739 that the ones in the upper left are relatively stable, meaning 622 00:24:17,739 --> 00:24:19,280 if you just sort of run gravity on it 623 00:24:19,280 --> 00:24:20,630 it's not going to fall over. 624 00:24:20,630 --> 00:24:22,280 Whereas the ones in the lower right 625 00:24:22,280 --> 00:24:24,590 are much more likely to fall under gravity. 626 00:24:24,590 --> 00:24:25,702 Fair enough? 627 00:24:25,702 --> 00:24:26,660 That's what people say. 628 00:24:26,660 --> 00:24:27,260 OK. 629 00:24:27,260 --> 00:24:29,762 So that's the kind of thing we'd like to be able to explain 630 00:24:29,762 --> 00:24:31,220 as well as many other judgments you 631 00:24:31,220 --> 00:24:33,680 could make about this simple, but not 632 00:24:33,680 --> 00:24:35,057 that simple world of objects. 633 00:24:35,057 --> 00:24:36,890 And again, you can see how in principle this 634 00:24:36,890 --> 00:24:39,473 could very nicely interface with what Demis was talking about. 635 00:24:39,473 --> 00:24:42,350 He talked about their ambition to do the SHRDLU task, which 636 00:24:42,350 --> 00:24:45,620 was this ability to basically have a system that 637 00:24:45,620 --> 00:24:47,600 can take in instructions and language 638 00:24:47,600 --> 00:24:50,094 and manipulate objects and blocks world. 639 00:24:50,094 --> 00:24:51,260 They are very far from that. 640 00:24:51,260 --> 00:24:53,210 Everybody's really far from having a general purpose 641 00:24:53,210 --> 00:24:55,418 system that can do that in any way like a human does. 642 00:24:55,418 --> 00:24:58,280 But we think we're building some of the common sense knowledge 643 00:24:58,280 --> 00:24:59,330 about the physical world that would 644 00:24:59,330 --> 00:25:02,000 be necessary to get something like that to work or to explain 645 00:25:02,000 --> 00:25:04,400 how kids play with blocks, play with each other, 646 00:25:04,400 --> 00:25:07,410 talk to each other while they're playing with blocks and so on. 647 00:25:07,410 --> 00:25:09,490 So the first step is the vision part. 648 00:25:09,490 --> 00:25:12,770 In this picture here, it's that blue graphics arrow. 649 00:25:12,770 --> 00:25:15,260 Here's another way into it. 650 00:25:15,260 --> 00:25:19,469 We want to be able to take a 2D image and work backwards 651 00:25:19,469 --> 00:25:21,260 to the world state, the kind of world state 652 00:25:21,260 --> 00:25:22,850 that can support physical reasoning. 653 00:25:22,850 --> 00:25:26,570 Again, remember these buzzwords-- 654 00:25:26,570 --> 00:25:28,820 explaining the mind with generative models 655 00:25:28,820 --> 00:25:31,010 that are causal and compositional. 656 00:25:31,010 --> 00:25:33,530 We want a description of the world which 657 00:25:33,530 --> 00:25:35,390 supports causal reasoning of the sort 658 00:25:35,390 --> 00:25:37,464 that physics is doing, like forces interacting 659 00:25:37,464 --> 00:25:38,130 with each other. 660 00:25:38,130 --> 00:25:40,520 So it's got to have things that can exert force 661 00:25:40,520 --> 00:25:41,630 and can suffer forces. 662 00:25:41,630 --> 00:25:43,910 It's got to have mass in some form. 663 00:25:43,910 --> 00:25:45,710 It's got to be compositional because you've 664 00:25:45,710 --> 00:25:47,570 got to be able to pick up a block and take it away. 665 00:25:47,570 --> 00:25:50,230 Or if I have these blocks over here and these blocks over here 666 00:25:50,230 --> 00:25:51,620 and I want to put these ones on top of there, 667 00:25:51,620 --> 00:25:53,240 the world state has to be able to support 668 00:25:53,240 --> 00:25:55,070 any number of objects in any configuration 669 00:25:55,070 --> 00:25:57,560 and to literally compose a representation 670 00:25:57,560 --> 00:25:59,570 of a world of objects that are composed together 671 00:25:59,570 --> 00:26:00,824 to make bigger things. 672 00:26:00,824 --> 00:26:02,240 So really the only way we know how 673 00:26:02,240 --> 00:26:05,180 to do that is something like what's sometimes in engineering 674 00:26:05,180 --> 00:26:07,022 called a CAD model or computer-aided design. 675 00:26:07,022 --> 00:26:08,480 But it's basically a representation 676 00:26:08,480 --> 00:26:10,640 of three-dimensional objects, often 677 00:26:10,640 --> 00:26:12,650 with something like a mesh or a grid 678 00:26:12,650 --> 00:26:15,470 of key points with their masses and springs for stiffness, 679 00:26:15,470 --> 00:26:16,636 something like that. 680 00:26:16,636 --> 00:26:18,260 Here my only picture of the world state 681 00:26:18,260 --> 00:26:20,090 looks an awful lot like the image, 682 00:26:20,090 --> 00:26:22,100 only it's in black and white instead of color. 683 00:26:22,100 --> 00:26:24,590 But the difference is that the thing on the bottom 684 00:26:24,590 --> 00:26:26,030 is actually an image. 685 00:26:26,030 --> 00:26:27,530 Whereas the thing on the top is just 686 00:26:27,530 --> 00:26:29,840 a 2D projection of a 3D model. 687 00:26:29,840 --> 00:26:31,190 I'll show you that one. 688 00:26:31,190 --> 00:26:32,194 Here's a few others. 689 00:26:32,194 --> 00:26:33,860 So I'll go back and forth between these. 690 00:26:33,860 --> 00:26:36,443 Notice how it kind of looks like the blocks are moving around. 691 00:26:36,443 --> 00:26:38,360 So what's actually going on is these 692 00:26:38,360 --> 00:26:40,460 are samples from the Bayesian posterior 693 00:26:40,460 --> 00:26:42,410 in an inverse graphics system. 694 00:26:42,410 --> 00:26:44,330 We put a prior on world states, which 695 00:26:44,330 --> 00:26:48,050 is basically a prior on what we think the world is made out of. 696 00:26:48,050 --> 00:26:50,490 We think there's these Jenga blocks basically. 697 00:26:50,490 --> 00:26:54,140 And then the likelihood, which is that that forward model is 698 00:26:54,140 --> 00:26:57,170 the probability of seeing a particular 2D image given 699 00:26:57,170 --> 00:26:58,610 a 3D configuration of blocks. 700 00:26:58,610 --> 00:27:00,110 And going back to the thing you had, 701 00:27:00,110 --> 00:27:02,750 it's basically deterministic with a little bit of noise. 702 00:27:02,750 --> 00:27:03,590 It's deterministic. 703 00:27:03,590 --> 00:27:06,590 It just follows the rules of OpenGL graphics. 704 00:27:06,590 --> 00:27:08,540 Basically says objects have surfaces. 705 00:27:08,540 --> 00:27:09,660 They're not transparent. 706 00:27:09,660 --> 00:27:10,784 You can't see through them. 707 00:27:10,784 --> 00:27:13,540 That's an extra complication if you wanted to have that. 708 00:27:13,540 --> 00:27:15,920 And basically the image is formed 709 00:27:15,920 --> 00:27:19,190 by taking the closest surface of the closest object 710 00:27:19,190 --> 00:27:22,160 and bouncing a ray of light off of it, which really just means 711 00:27:22,160 --> 00:27:24,060 taking its color and scaling it by intensity. 712 00:27:24,060 --> 00:27:26,840 It's a very simple shadow model. 713 00:27:26,840 --> 00:27:27,980 So that's the causal model. 714 00:27:27,980 --> 00:27:30,188 And then we can add a little bit of uncertainty like, 715 00:27:30,188 --> 00:27:31,550 for example, maybe we can't-- 716 00:27:31,550 --> 00:27:34,700 there's a little bit of noise in the sensor data. 717 00:27:34,700 --> 00:27:38,690 So you can be uncertain about exactly the low level image 718 00:27:38,690 --> 00:27:39,255 features. 719 00:27:39,255 --> 00:27:41,630 And then when you run one of these probabilistic programs 720 00:27:41,630 --> 00:27:44,840 in reverse to make a guess of what configuration of blocks 721 00:27:44,840 --> 00:27:46,720 is most likely to have produced that image, 722 00:27:46,720 --> 00:27:48,950 there is a little bit of posterior uncertainty 723 00:27:48,950 --> 00:27:54,530 that inherits from the fact that you can't perfectly localize 724 00:27:54,530 --> 00:27:56,220 those objects in the world. 725 00:27:56,220 --> 00:27:59,930 So again, what you see here are three or four samples 726 00:27:59,930 --> 00:28:01,910 from the posterior-- the distribution 727 00:28:01,910 --> 00:28:04,220 over best guesses of the world state 728 00:28:04,220 --> 00:28:07,160 of 3D objects that were most likely to have rendered 729 00:28:07,160 --> 00:28:08,780 into that 2D image. 730 00:28:08,780 --> 00:28:11,270 And any one of those is now an actionable representation 731 00:28:11,270 --> 00:28:14,220 for physical manipulation or reasoning. 732 00:28:14,220 --> 00:28:16,285 OK? 733 00:28:16,285 --> 00:28:17,660 And how we actually compute that, 734 00:28:17,660 --> 00:28:20,540 again, I'm not going to go into right now. 735 00:28:20,540 --> 00:28:23,250 I'll go into something like it in a minute. 736 00:28:23,250 --> 00:28:24,830 But at least in its most basic form, 737 00:28:24,830 --> 00:28:27,230 it involves some rather unfortunately 738 00:28:27,230 --> 00:28:29,720 slow random search process through the space 739 00:28:29,720 --> 00:28:32,870 of blocks models. 740 00:28:32,870 --> 00:28:33,900 Here's another example. 741 00:28:33,900 --> 00:28:36,050 This is another configuration there-- 742 00:28:36,050 --> 00:28:36,830 another image. 743 00:28:36,830 --> 00:28:39,950 And here is a few samples again from the posterior. 744 00:28:39,950 --> 00:28:42,200 And hopefully when you see these things moving around, 745 00:28:42,200 --> 00:28:43,970 whether it's this one or the one before, 746 00:28:43,970 --> 00:28:46,550 you see them move a little bit, but most of them 747 00:28:46,550 --> 00:28:47,780 look very similar. 748 00:28:47,780 --> 00:28:49,966 You'd be hard pressed to tell the difference 749 00:28:49,966 --> 00:28:52,340 if you looked away for a second between any one of those. 750 00:28:52,340 --> 00:28:53,930 Which one are you actually seeing? 751 00:28:53,930 --> 00:28:55,820 And that's exactly the point. 752 00:28:55,820 --> 00:28:59,000 The uncertainty you see there is meant to capture basically 753 00:28:59,000 --> 00:29:01,280 the uncertainty you have in a single glance 754 00:29:01,280 --> 00:29:02,420 at an image like that. 755 00:29:02,420 --> 00:29:04,430 You can't perfectly tell where the blocks are. 756 00:29:04,430 --> 00:29:08,000 So basically any one of these configurations 757 00:29:08,000 --> 00:29:09,860 up here is about equally good. 758 00:29:09,860 --> 00:29:11,420 And we think your intuitive physics, 759 00:29:11,420 --> 00:29:14,090 your sort of common sense core intuitive physics 760 00:29:14,090 --> 00:29:16,610 that even babies have, is operating over one 761 00:29:16,610 --> 00:29:19,310 or a few samples like that. 762 00:29:19,310 --> 00:29:21,840 Now in separate work that is not really-- 763 00:29:21,840 --> 00:29:23,960 I think of it as really about common sense, 764 00:29:23,960 --> 00:29:26,293 but it's one of the things we've been doing in our group 765 00:29:26,293 --> 00:29:28,709 and in CBMM where are these ideas best make contact 766 00:29:28,709 --> 00:29:30,500 with the rest of what people are doing here 767 00:29:30,500 --> 00:29:33,530 and where we can really test interesting neural hypotheses 768 00:29:33,530 --> 00:29:35,780 potentially and understand the interplay 769 00:29:35,780 --> 00:29:37,940 between these generative models for explanation 770 00:29:37,940 --> 00:29:40,100 and the more sort of neural-network-type models 771 00:29:40,100 --> 00:29:42,207 for pattern recognition. 772 00:29:42,207 --> 00:29:43,790 We've been really pushing on this idea 773 00:29:43,790 --> 00:29:45,440 vision as inverse graphics. 774 00:29:45,440 --> 00:29:47,810 So I'll tell you a little bit about that because it's 775 00:29:47,810 --> 00:29:49,075 quite interesting for CBMM. 776 00:29:49,075 --> 00:29:51,800 But I want to make sure to only do this for about five minutes 777 00:29:51,800 --> 00:29:53,930 and then go back to the how this gets 778 00:29:53,930 --> 00:29:57,500 used for more the intuitive physics and planning stuff. 779 00:29:57,500 --> 00:30:01,950 So this is this an example from a paper by Tejas Kulkarni who's 780 00:30:01,950 --> 00:30:03,750 one of our grad students. 781 00:30:03,750 --> 00:30:06,540 And it's joint work with a few other really smart people 782 00:30:06,540 --> 00:30:08,400 such as Vikash Mansinghka who's a research 783 00:30:08,400 --> 00:30:10,500 scientist at MIT and Pushmeet Kohli 784 00:30:10,500 --> 00:30:12,750 who's at Microsoft Research. 785 00:30:12,750 --> 00:30:15,380 And it was a computer vision paper, pure computer vision 786 00:30:15,380 --> 00:30:20,610 paper from the summer, where he was developing 787 00:30:20,610 --> 00:30:22,860 a specific kind of probabilistic programming language, 788 00:30:22,860 --> 00:30:25,200 but a general one for doing this kind of vision 789 00:30:25,200 --> 00:30:27,992 as inverse graphics where you could give 790 00:30:27,992 --> 00:30:29,200 a number of different models. 791 00:30:29,200 --> 00:30:31,574 Here I'll show you one for faces, another one for bodies, 792 00:30:31,574 --> 00:30:33,030 another one for generic objects. 793 00:30:33,030 --> 00:30:38,250 But basically you can pretty easily specify a graphics model 794 00:30:38,250 --> 00:30:40,260 that when you run it in the forward direction 795 00:30:40,260 --> 00:30:43,260 generates random images of objects in a certain class. 796 00:30:43,260 --> 00:30:45,420 And then you can run it in the reverse direction 797 00:30:45,420 --> 00:30:49,480 to do scene parsing to go from the image 798 00:30:49,480 --> 00:30:50,980 to the underlying scene. 799 00:30:50,980 --> 00:30:53,070 So here's an example of this in faces 800 00:30:53,070 --> 00:30:56,460 where the graphics model-- it's really very directly based 801 00:30:56,460 --> 00:30:59,280 on work that Thomas Vetter, who was a former student 802 00:30:59,280 --> 00:31:00,870 or post-doc of Tommy's actually, so 803 00:31:00,870 --> 00:31:04,500 kind of a early ancestor of CBMM, built. 804 00:31:04,500 --> 00:31:07,000 And his group in Basel, Switzerland, where 805 00:31:07,000 --> 00:31:09,570 it's a simple but still pretty nice 806 00:31:09,570 --> 00:31:12,090 graphics model for making face images. 807 00:31:12,090 --> 00:31:14,550 There's a model of the shape of the face, which again, it's 808 00:31:14,550 --> 00:31:15,690 like a CAD model. 809 00:31:15,690 --> 00:31:17,520 It's a mesh surface description. 810 00:31:17,520 --> 00:31:21,390 Pretty fine-grained structure of the 2D surface 811 00:31:21,390 --> 00:31:23,430 of the face in 3D. 812 00:31:23,430 --> 00:31:25,710 And there is about 400 dimensions 813 00:31:25,710 --> 00:31:28,374 to characterize the possible shapes of faces. 814 00:31:28,374 --> 00:31:29,790 And there's another 400 dimensions 815 00:31:29,790 --> 00:31:31,206 to characterize the texture, which 816 00:31:31,206 --> 00:31:34,560 is like the skin, the beard, the eyes, the color, and surface 817 00:31:34,560 --> 00:31:36,779 properties that get mapped on top of the mesh. 818 00:31:36,779 --> 00:31:38,820 And then there's a little bit more graphic stuff, 819 00:31:38,820 --> 00:31:40,740 which is generic, not specific to faces. 820 00:31:40,740 --> 00:31:42,530 That stuff is all specific to faces. 821 00:31:42,530 --> 00:31:43,890 But then there is a simple lighting model. 822 00:31:43,890 --> 00:31:45,639 So you basically have a point light source 823 00:31:45,639 --> 00:31:48,474 somewhere out there and you shine the light on the face. 824 00:31:48,474 --> 00:31:49,890 It can produce shadows, of course, 825 00:31:49,890 --> 00:31:51,990 but not very complicated ones. 826 00:31:51,990 --> 00:31:54,284 And then there's a viewpoint camera thing. 827 00:31:54,284 --> 00:31:56,700 So you put the light source somewhere and you put a camera 828 00:31:56,700 --> 00:31:58,530 somewhere specifying the viewpoint. 829 00:31:58,530 --> 00:32:01,050 And the combination of these, shape, texture, lighting, 830 00:32:01,050 --> 00:32:03,680 and camera, give you a complete graphic specification. 831 00:32:03,680 --> 00:32:06,000 It produces an image of a particular face 832 00:32:06,000 --> 00:32:07,920 lit from a particular direction and viewed 833 00:32:07,920 --> 00:32:10,870 from some particular viewpoint and distance. 834 00:32:10,870 --> 00:32:12,870 And what you see on the right are random samples 835 00:32:12,870 --> 00:32:15,414 from this probabilistic program, this generative model. 836 00:32:15,414 --> 00:32:16,830 So you can just write this program 837 00:32:16,830 --> 00:32:19,030 and press Go, Go, Go, Go, Go, and every time you run it, 838 00:32:19,030 --> 00:32:21,488 you get a new face viewed from a new direction and lighting 839 00:32:21,488 --> 00:32:22,417 condition. 840 00:32:22,417 --> 00:32:23,250 So that's the prior. 841 00:32:26,400 --> 00:32:28,347 Now, what about inference? 842 00:32:28,347 --> 00:32:30,180 Well, the idea of vision as inverse graphics 843 00:32:30,180 --> 00:32:33,570 is to say take a real image of a face like that one 844 00:32:33,570 --> 00:32:36,179 and see if you can produce from your graphics 845 00:32:36,179 --> 00:32:37,720 model something that looks like that. 846 00:32:37,720 --> 00:32:39,690 So, for example, here in the lower left 847 00:32:39,690 --> 00:32:42,360 is an example of a face that was produced from the graphics 848 00:32:42,360 --> 00:32:45,300 model that hopefully most of you agree looks kind of like that. 849 00:32:45,300 --> 00:32:47,970 Maybe not exactly the same, but kind of enough. 850 00:32:47,970 --> 00:32:50,170 And in building this system-- 851 00:32:50,170 --> 00:32:52,290 this system, by the way, is called Picture. 852 00:32:52,290 --> 00:32:54,330 That's that first word of the paper title, 853 00:32:54,330 --> 00:32:56,769 too, the Kulkarni, et al. paper. 854 00:32:56,769 --> 00:32:58,810 There were a few neat things that had to be done. 855 00:32:58,810 --> 00:33:00,351 One of the things that had to be done 856 00:33:00,351 --> 00:33:03,259 was to come up with various ways to say 857 00:33:03,259 --> 00:33:05,550 what does it mean for the output of the graphics engine 858 00:33:05,550 --> 00:33:07,150 to look like the image. 859 00:33:07,150 --> 00:33:09,450 In the case of faces, actually matching up pixels 860 00:33:09,450 --> 00:33:10,750 is not completely crazy. 861 00:33:10,750 --> 00:33:13,140 But for most vision problems, it's 862 00:33:13,140 --> 00:33:16,020 going to be unrealistic and unnecessary to build a graphics 863 00:33:16,020 --> 00:33:19,070 engine that's pixel-level realistic. 864 00:33:19,070 --> 00:33:20,970 And so you might, for example, want 865 00:33:20,970 --> 00:33:25,062 to have something where the graphics engine hypothesis is 866 00:33:25,062 --> 00:33:26,520 matched to the image with something 867 00:33:26,520 --> 00:33:27,739 like some kind of features. 868 00:33:27,739 --> 00:33:30,030 Like it could be convolutional neural network features. 869 00:33:30,030 --> 00:33:32,400 That's one way to use, for example, neural networks 870 00:33:32,400 --> 00:33:34,800 to make something like this work well. 871 00:33:34,800 --> 00:33:37,590 And Jojen just showed me a paper by some other folks 872 00:33:37,590 --> 00:33:39,114 from Darmstadt, which is doing what 873 00:33:39,114 --> 00:33:41,280 looks like a very interesting similar kind of thing. 874 00:33:44,280 --> 00:33:48,330 Let me show what inference looks like in this model and then 875 00:33:48,330 --> 00:33:50,430 say what I think is an even more interesting way 876 00:33:50,430 --> 00:33:51,750 to use convolutional. 877 00:33:51,750 --> 00:33:54,620 And that's from another recent paper we've been looking at. 878 00:33:54,620 --> 00:33:59,100 So here is, if you watch this, this is one observed face. 879 00:33:59,100 --> 00:34:01,230 And what you're seeing over here is just 880 00:34:01,230 --> 00:34:04,140 a trace of the system kind of searching 881 00:34:04,140 --> 00:34:06,890 through the space of traces of the graphics program. 882 00:34:06,890 --> 00:34:08,698 Basically trying out random faces 883 00:34:08,698 --> 00:34:10,239 that might look like that face there. 884 00:34:10,239 --> 00:34:12,106 It's using a kind of MCMC inference. 885 00:34:12,106 --> 00:34:13,980 It's very similar to what you're going to see 886 00:34:13,980 --> 00:34:16,739 from Tomer in the tutorial. 887 00:34:16,739 --> 00:34:20,280 It basically starts off with a random face 888 00:34:20,280 --> 00:34:23,909 and takes a bunch of small random steps 889 00:34:23,909 --> 00:34:27,270 that are biased towards making the image look more and more 890 00:34:27,270 --> 00:34:28,973 like the actual observed image. 891 00:34:28,973 --> 00:34:30,389 And at the end, you have something 892 00:34:30,389 --> 00:34:33,031 which looks almost identical to the observed face. 893 00:34:33,031 --> 00:34:35,489 The key, right, though, is that though the observed face is 894 00:34:35,489 --> 00:34:37,679 literally just a 2D image, the thing 895 00:34:37,679 --> 00:34:39,659 you're seeing on the right is a projection 896 00:34:39,659 --> 00:34:41,610 of a 3D model of a face. 897 00:34:41,610 --> 00:34:45,570 And it's one that supports a lot of causal action. 898 00:34:45,570 --> 00:34:49,020 So here just to show you on a more interesting sort 899 00:34:49,020 --> 00:34:52,170 of high-resolution set of face images, the ones on the left 900 00:34:52,170 --> 00:34:53,489 are observed images. 901 00:34:53,489 --> 00:34:55,916 And then we fit this model. 902 00:34:55,916 --> 00:34:58,290 And then we can rotate it around and change the lighting. 903 00:34:58,290 --> 00:35:00,749 If we had parameters that control the expression-- 904 00:35:00,749 --> 00:35:02,790 there's no real expression parameters here-- that 905 00:35:02,790 --> 00:35:04,710 wouldn't be too hard to put in. 906 00:35:04,710 --> 00:35:06,400 You could make us happy or sad. 907 00:35:06,400 --> 00:35:07,740 But you can see-- 908 00:35:07,740 --> 00:35:10,290 hopefully what you can see is that the recovered model 909 00:35:10,290 --> 00:35:12,840 supports fairly reasonable generalization 910 00:35:12,840 --> 00:35:14,874 to other viewpoints and lighting conditions. 911 00:35:14,874 --> 00:35:16,290 It's the sort of thing that should 912 00:35:16,290 --> 00:35:18,951 make for more robust face recognition. 913 00:35:18,951 --> 00:35:20,700 Although that's not the main focus of what 914 00:35:20,700 --> 00:35:21,360 we're trying to use it here. 915 00:35:21,360 --> 00:35:23,250 I just want to emphasize there's all sorts of things that 916 00:35:23,250 --> 00:35:25,791 would be useful if you had an actual 3D model of the face you 917 00:35:25,791 --> 00:35:27,260 could get from a single image. 918 00:35:27,260 --> 00:35:31,600 Or here's the same kind of idea now for a body pose system. 919 00:35:31,600 --> 00:35:33,585 So now, the image we're going to assume 920 00:35:33,585 --> 00:35:35,460 has a person in it somewhere doing something. 921 00:35:35,460 --> 00:35:37,751 Remember back to that challenge I gave at the beginning 922 00:35:37,751 --> 00:35:41,400 about finding the bodies in a complex scene like the airplane 923 00:35:41,400 --> 00:35:44,880 full of computer vision researchers 924 00:35:44,880 --> 00:35:48,090 where you found the right hand or the left toe. 925 00:35:48,090 --> 00:35:50,280 So in order to do that, we think you 926 00:35:50,280 --> 00:35:52,950 have to have something like an actual 3D model of a body. 927 00:35:52,950 --> 00:35:54,840 What you see on the lower left is 928 00:35:54,840 --> 00:35:56,060 a bunch of samples from this. 929 00:35:56,060 --> 00:36:00,180 So we basically just took a kind of interesting 3D stick figure 930 00:36:00,180 --> 00:36:03,360 skeleton model and just put some knobs on it. 931 00:36:03,360 --> 00:36:04,540 You can tweak it around. 932 00:36:04,540 --> 00:36:06,030 You can put some simple probability models 933 00:36:06,030 --> 00:36:06,720 to get a prior. 934 00:36:06,720 --> 00:36:08,095 And these are just random samples 935 00:36:08,095 --> 00:36:09,502 of random body positions. 936 00:36:09,502 --> 00:36:11,460 And the idea of the system is to kind of search 937 00:36:11,460 --> 00:36:14,250 through that space of body positions 938 00:36:14,250 --> 00:36:16,800 until you find one, which then when you project it 939 00:36:16,800 --> 00:36:18,960 from a certain camera angle looks 940 00:36:18,960 --> 00:36:20,700 like the body you're seeing. 941 00:36:20,700 --> 00:36:22,510 So here is an example of this in action. 942 00:36:22,510 --> 00:36:24,030 This is some guy-- 943 00:36:24,030 --> 00:36:27,090 I guess Usain Bolt. Some kind of interesting slightly unusual 944 00:36:27,090 --> 00:36:30,240 pose as he's about to break the finish line maybe. 945 00:36:30,240 --> 00:36:32,280 And here is the system in action. 946 00:36:32,280 --> 00:36:33,991 So it starts off from a random position 947 00:36:33,991 --> 00:36:35,490 and, again, sort of takes some takes 948 00:36:35,490 --> 00:36:38,910 a bunch of random steps moving around in 3D space 949 00:36:38,910 --> 00:36:40,890 until it finds a configuration, which 950 00:36:40,890 --> 00:36:42,720 when you project it into the image looks 951 00:36:42,720 --> 00:36:44,270 like what you see there. 952 00:36:44,270 --> 00:36:48,600 Now, notice a key difference when I say looks like-- 953 00:36:48,600 --> 00:36:52,050 it doesn't look like it at the pixel level like the face did. 954 00:36:52,050 --> 00:36:55,530 It's only matching at the level of these basically enhanced 955 00:36:55,530 --> 00:36:57,280 edge statistics which you see here. 956 00:36:57,280 --> 00:36:59,280 So this is an example of building a model that's 957 00:36:59,280 --> 00:37:02,160 not a photorealistic render. 958 00:37:02,160 --> 00:37:04,380 The graphics model is not trying to match the image. 959 00:37:04,380 --> 00:37:05,880 It's trying to match this. 960 00:37:05,880 --> 00:37:08,130 Or it could be, for example, some intermediate level 961 00:37:08,130 --> 00:37:09,252 of convonet features. 962 00:37:09,252 --> 00:37:10,710 And we think this is very powerful. 963 00:37:10,710 --> 00:37:12,420 Because more generally while we might 964 00:37:12,420 --> 00:37:17,040 have a really detailed model of facial appearance, for bodies, 965 00:37:17,040 --> 00:37:18,630 we don't have a good clothing model. 966 00:37:18,630 --> 00:37:21,310 We're not trying to model the skin. 967 00:37:21,310 --> 00:37:24,752 We're just trying to model just enough 968 00:37:24,752 --> 00:37:26,460 to solve the problem we're interested in. 969 00:37:26,460 --> 00:37:29,200 And again, this is reflective of a much more broad theme 970 00:37:29,200 --> 00:37:32,340 in this idea of intelligence as explanation, 971 00:37:32,340 --> 00:37:35,190 modeling the causal structure of the world. 972 00:37:35,190 --> 00:37:37,047 We don't expect, even in science, 973 00:37:37,047 --> 00:37:38,880 but certainly not in our intuitive theories, 974 00:37:38,880 --> 00:37:42,820 to model the causal structure of the world at full detail. 975 00:37:42,820 --> 00:37:46,110 And a way that either I am always misunderstood or always 976 00:37:46,110 --> 00:37:48,582 fail to communicate-- it's my fault really-- 977 00:37:48,582 --> 00:37:50,790 is I say, oh, we have these rich models of the world. 978 00:37:50,790 --> 00:37:52,800 People often think that means that somehow 979 00:37:52,800 --> 00:37:53,800 the complete thing. 980 00:37:53,800 --> 00:37:55,680 Like if I say we have a physics engine in our head, 981 00:37:55,680 --> 00:37:56,720 it means we have all of physics. 982 00:37:56,720 --> 00:37:58,303 Of if I say we have a graphics engine, 983 00:37:58,303 --> 00:38:00,330 we have all of every possible thing. 984 00:38:00,330 --> 00:38:02,250 This isn't Pixar. 985 00:38:02,250 --> 00:38:05,280 We're not trying to make a beautiful movie, 986 00:38:05,280 --> 00:38:07,170 except maybe for faces. 987 00:38:07,170 --> 00:38:10,980 We're just trying to capture just the key parts, just 988 00:38:10,980 --> 00:38:13,925 the key causal parts of the way things move 989 00:38:13,925 --> 00:38:16,050 in the world as physical objects and the way images 990 00:38:16,050 --> 00:38:18,510 are formed that at the right level of abstraction 991 00:38:18,510 --> 00:38:22,560 that matters for us allows us to do what we need to do. 992 00:38:22,560 --> 00:38:27,910 This is just an example of our system 993 00:38:27,910 --> 00:38:31,230 solving some pretty challenging body pose recognition problems 994 00:38:31,230 --> 00:38:34,540 in 3D, cases which are problematic 995 00:38:34,540 --> 00:38:38,070 even for the best of standard computer vision systems. 996 00:38:38,070 --> 00:38:39,510 Either because it's a weird pose, 997 00:38:39,510 --> 00:38:42,090 like these weird sports figures, or because the body 998 00:38:42,090 --> 00:38:43,355 is heavily occluded. 999 00:38:43,355 --> 00:38:45,730 But I think, again, these are problems which people solve 1000 00:38:45,730 --> 00:38:46,590 effortlessly. 1001 00:38:46,590 --> 00:38:48,840 And I think something like this is 1002 00:38:48,840 --> 00:38:50,341 on the track of what we want to do. 1003 00:38:50,341 --> 00:38:51,840 You can apply the same kind of thing 1004 00:38:51,840 --> 00:38:54,820 to more generic objects like this, 1005 00:38:54,820 --> 00:38:56,700 but I'm not going to go into the details. 1006 00:38:56,700 --> 00:38:58,408 The last thing I want to say about vision 1007 00:38:58,408 --> 00:39:02,490 before getting back to common sense for a few minutes-- 1008 00:39:02,490 --> 00:39:05,310 and in some sense, maybe this is the most important slide 1009 00:39:05,310 --> 00:39:08,370 for the broader CBMM, brains, minds, and machines thing. 1010 00:39:08,370 --> 00:39:10,410 Because this is the clearest thing 1011 00:39:10,410 --> 00:39:12,807 I can point to to the thing I've been saying all along 1012 00:39:12,807 --> 00:39:14,640 since the beginning of the morning about how 1013 00:39:14,640 --> 00:39:18,090 we want to look for ways to combine the generative model 1014 00:39:18,090 --> 00:39:20,520 view and the pattern recognition view. 1015 00:39:20,520 --> 00:39:22,980 So the generative model is what you see on the left here. 1016 00:39:22,980 --> 00:39:24,370 It's the arrows going down. 1017 00:39:24,370 --> 00:39:26,430 It's exactly just the face graphics engine, 1018 00:39:26,430 --> 00:39:29,130 the same thing I showed you. 1019 00:39:29,130 --> 00:39:33,060 The thing on the right with the arrows going up is a convonet. 1020 00:39:33,060 --> 00:39:36,440 Basically it's a out-of-the-box, cafe-style, 1021 00:39:36,440 --> 00:39:39,180 convolutional neural net with some fully connected layers 1022 00:39:39,180 --> 00:39:39,921 on the top. 1023 00:39:39,921 --> 00:39:41,670 And then there's a few other dashed arrows 1024 00:39:41,670 --> 00:39:45,330 which represent linear decoders from layers of that model 1025 00:39:45,330 --> 00:39:47,670 to other things, which are basically 1026 00:39:47,670 --> 00:39:49,320 parts of the generative model. 1027 00:39:49,320 --> 00:39:51,835 And the idea here-- this is work due to Ilker Yildirim, who 1028 00:39:51,835 --> 00:39:52,960 some of you might have met. 1029 00:39:52,960 --> 00:39:54,600 He was here the other day. 1030 00:39:54,600 --> 00:39:57,840 He's one of our CBMM postdocs, but also 1031 00:39:57,840 --> 00:40:02,210 joint with Tejas and with Winrich who you saw before. 1032 00:40:02,210 --> 00:40:05,540 It's to try to in several senses combine 1033 00:40:05,540 --> 00:40:07,790 the best of these perspectives, to say, look, 1034 00:40:07,790 --> 00:40:10,244 if we want to recognize anything or perceive 1035 00:40:10,244 --> 00:40:11,660 the structure of the world richly, 1036 00:40:11,660 --> 00:40:14,450 I think it needs to be something like this inverse graphics 1037 00:40:14,450 --> 00:40:16,074 or inverting a graphics program. 1038 00:40:16,074 --> 00:40:17,240 But you saw how slow it was. 1039 00:40:17,240 --> 00:40:19,239 You saw how it took a couple of seconds at least 1040 00:40:19,239 --> 00:40:21,380 on our computer just for faces to search 1041 00:40:21,380 --> 00:40:22,310 through the space of faces to come up 1042 00:40:22,310 --> 00:40:23,518 with a convincing hypothesis. 1043 00:40:23,518 --> 00:40:24,380 That's way too slow. 1044 00:40:24,380 --> 00:40:25,970 It doesn't take that long. 1045 00:40:25,970 --> 00:40:29,300 We know a lot about exactly how long it takes you from Winrich, 1046 00:40:29,300 --> 00:40:32,060 and Nancy's, and many other people's work. 1047 00:40:32,060 --> 00:40:35,390 So how can vision in this case, or really much more generally, 1048 00:40:35,390 --> 00:40:38,840 be so rich in terms of the model it builds, yet so fast? 1049 00:40:38,840 --> 00:40:40,730 Well, here's a proposal, which is 1050 00:40:40,730 --> 00:40:44,570 to take the things that are good at being fast like the pattern 1051 00:40:44,570 --> 00:40:47,524 recognizers, deep ones, and train 1052 00:40:47,524 --> 00:40:49,190 them to solve the hard inference problem 1053 00:40:49,190 --> 00:40:51,440 or at least to do most of the work. 1054 00:40:51,440 --> 00:40:53,540 It's an idea which is very heavily inspired 1055 00:40:53,540 --> 00:40:55,550 by an older idea of Geoff Hinton's 1056 00:40:55,550 --> 00:40:58,010 sometimes called the Helmholtz machine. 1057 00:40:58,010 --> 00:41:01,250 Here the idea in common with Hinton 1058 00:41:01,250 --> 00:41:05,300 is to have a generative model and a recognition model 1059 00:41:05,300 --> 00:41:07,460 where the recognition model is a neural network 1060 00:41:07,460 --> 00:41:09,740 and it's trained to invert the generative the model. 1061 00:41:09,740 --> 00:41:15,620 Namely, it's trained to map not from sense data to task output, 1062 00:41:15,620 --> 00:41:18,290 but from sense data to the hidden deep causes 1063 00:41:18,290 --> 00:41:21,020 of the generative model, which then, when you want to use this 1064 00:41:21,020 --> 00:41:26,630 to act to plan what you're going to do, you plan on the model. 1065 00:41:26,630 --> 00:41:29,390 To make an analogy to say the DeepMind video game player, 1066 00:41:29,390 --> 00:41:31,280 this would be like having a system which, 1067 00:41:31,280 --> 00:41:33,380 in contrast to the DeepQ network, which 1068 00:41:33,380 --> 00:41:36,050 mapped from pixel images to joystick commands, 1069 00:41:36,050 --> 00:41:38,600 this would be like learning a network that 1070 00:41:38,600 --> 00:41:40,316 maps from pixel images to the game state, 1071 00:41:40,316 --> 00:41:42,440 to the objects, the sprites that are moving around, 1072 00:41:42,440 --> 00:41:44,760 the score, and so on, and then plans on that. 1073 00:41:44,760 --> 00:41:50,190 And I think that's much more like what people do is that. 1074 00:41:50,190 --> 00:41:51,990 Here just in the limited case of faces, 1075 00:41:51,990 --> 00:41:53,031 what are we doing, right? 1076 00:41:53,031 --> 00:41:55,860 So what we've got here is we take 1077 00:41:55,860 --> 00:41:58,080 this convolutional neural network. 1078 00:41:58,080 --> 00:42:00,570 We train it in ways that you can read about in the paper. 1079 00:42:00,570 --> 00:42:05,400 It's very easy kind of training to basically make predictions, 1080 00:42:05,400 --> 00:42:07,260 to make guesses about all the latent 1081 00:42:07,260 --> 00:42:09,510 variables, the shape, the texture, the lighting, 1082 00:42:09,510 --> 00:42:11,160 the camera angle. 1083 00:42:11,160 --> 00:42:14,160 And then you take those guesses, and they start off 1084 00:42:14,160 --> 00:42:15,150 that Markov chain. 1085 00:42:15,150 --> 00:42:17,614 So instead of starting off at a random graphics hypothesis, 1086 00:42:17,614 --> 00:42:19,030 you start off at a pretty good one 1087 00:42:19,030 --> 00:42:20,520 and then refine it a little bit. 1088 00:42:20,520 --> 00:42:21,978 What you can see here in these blue 1089 00:42:21,978 --> 00:42:29,220 and red curves is the blue curve is the course of inference 1090 00:42:29,220 --> 00:42:30,820 for the model I showed you before, 1091 00:42:30,820 --> 00:42:32,820 where you start off at a random guess, 1092 00:42:32,820 --> 00:42:37,380 and after, I don't know, 100 iterations of MCMC, you improve 1093 00:42:37,380 --> 00:42:38,650 and you kind of get there. 1094 00:42:38,650 --> 00:42:40,025 Whereas the red curve is what you 1095 00:42:40,025 --> 00:42:42,400 see if you start off with the guess of this recognition 1096 00:42:42,400 --> 00:42:42,900 model. 1097 00:42:42,900 --> 00:42:45,060 And you can see that you start off sort 1098 00:42:45,060 --> 00:42:47,732 of in some sense almost as good as you're ever going to get, 1099 00:42:47,732 --> 00:42:48,690 and then you refine it. 1100 00:42:48,690 --> 00:42:49,930 Well, it might look like you we're just 1101 00:42:49,930 --> 00:42:51,180 were refining it a little bit. 1102 00:42:51,180 --> 00:42:53,070 But this is a kind of a double log scale. 1103 00:42:53,070 --> 00:42:56,140 It's a log plot of log probability. 1104 00:42:56,140 --> 00:42:58,680 So what looks like a little bit there on the red curve 1105 00:42:58,680 --> 00:42:59,890 is actually a lot-- 1106 00:42:59,890 --> 00:43:01,470 I mean perceptually. 1107 00:43:01,470 --> 00:43:03,900 You can see it here where if you take-- on the top 1108 00:43:03,900 --> 00:43:06,360 I'm showing observed input faces. 1109 00:43:06,360 --> 00:43:07,920 On the bottom I'm showing the result 1110 00:43:07,920 --> 00:43:09,480 of this full inverse graphics thing. 1111 00:43:09,480 --> 00:43:11,063 And they should look almost identical. 1112 00:43:11,063 --> 00:43:14,534 So the full model is able to basically perfectly invert this 1113 00:43:14,534 --> 00:43:16,200 and come up with a face that really does 1114 00:43:16,200 --> 00:43:17,492 look like the one on the top. 1115 00:43:17,492 --> 00:43:19,200 The ones in the middle are the best guess 1116 00:43:19,200 --> 00:43:20,824 you get from this neural network that's 1117 00:43:20,824 --> 00:43:23,425 been trained to approximately invert the generative model. 1118 00:43:23,425 --> 00:43:25,050 And what you can see is on first glance 1119 00:43:25,050 --> 00:43:26,280 it should look pretty good. 1120 00:43:26,280 --> 00:43:28,080 But if you pay a little bit of attention, 1121 00:43:28,080 --> 00:43:29,310 you can see differences. 1122 00:43:29,310 --> 00:43:32,340 Like hopefully you can see this person is not actually 1123 00:43:32,340 --> 00:43:34,862 that person in a way that this is much more convincingly. 1124 00:43:34,862 --> 00:43:36,570 Or this person-- this one is pretty good, 1125 00:43:36,570 --> 00:43:37,500 but I think this one-- 1126 00:43:37,500 --> 00:43:39,210 I think it's pretty easy to say, yeah, 1127 00:43:39,210 --> 00:43:41,085 this isn't quite the same person as that one. 1128 00:43:41,085 --> 00:43:41,980 Do you guys agree? 1129 00:43:41,980 --> 00:43:44,464 We've done some experiments to verify this. 1130 00:43:44,464 --> 00:43:46,380 But hopefully they should look pretty similar, 1131 00:43:46,380 --> 00:43:49,410 and that's the point. 1132 00:43:49,410 --> 00:43:52,590 How do you combine the best of these computational paradigms? 1133 00:43:52,590 --> 00:43:54,030 How can perception more generally 1134 00:43:54,030 --> 00:43:55,500 be so rich and so fast? 1135 00:43:55,500 --> 00:43:58,150 Well, quite possibly like this. 1136 00:43:58,150 --> 00:44:01,362 It even actually might provide some insight 1137 00:44:01,362 --> 00:44:03,570 into the neural circuitry that Winrich and Doris Tsao 1138 00:44:03,570 --> 00:44:05,640 and others have mapped out. 1139 00:44:05,640 --> 00:44:07,950 We think that this recognition model that's 1140 00:44:07,950 --> 00:44:09,840 trained to invert the graphics model 1141 00:44:09,840 --> 00:44:12,090 can provide a really nice account of some of Winrich's 1142 00:44:12,090 --> 00:44:13,230 data like you saw before. 1143 00:44:13,230 --> 00:44:14,910 But I will not go into the details 1144 00:44:14,910 --> 00:44:17,700 because in maybe five to 10 minutes 1145 00:44:17,700 --> 00:44:20,430 I want to get back to physics and psychology. 1146 00:44:20,430 --> 00:44:25,061 So physics-- and there won't be any more neural networks. 1147 00:44:25,061 --> 00:44:26,310 Because that's about as much-- 1148 00:44:26,310 --> 00:44:32,769 I mean, I think we'd like to take those ways of integrating 1149 00:44:32,769 --> 00:44:34,560 the best of these approaches and apply them 1150 00:44:34,560 --> 00:44:35,610 to these more general cases. 1151 00:44:35,610 --> 00:44:36,980 But that's about as far as we can get. 1152 00:44:36,980 --> 00:44:39,240 Here what I want to just give you a taste of at least 1153 00:44:39,240 --> 00:44:41,550 is how we're using ideas just purely 1154 00:44:41,550 --> 00:44:43,590 from probabilistic programs to capture 1155 00:44:43,590 --> 00:44:45,791 more of this common sense physics and psychology. 1156 00:44:45,791 --> 00:44:47,790 So let's say we can solve this problem by making 1157 00:44:47,790 --> 00:44:49,620 a good guess of the 3D world state 1158 00:44:49,620 --> 00:44:52,860 from the image very quickly inverting this graphics engine. 1159 00:44:52,860 --> 00:44:54,880 Now, we can start to do some physical reasoning, 1160 00:44:54,880 --> 00:44:59,910 a la Craik's mental model of in the head of the physical world, 1161 00:44:59,910 --> 00:45:02,760 where we now take a physics engine, which is-- 1162 00:45:02,760 --> 00:45:05,340 here again we're using the kind of physics engines 1163 00:45:05,340 --> 00:45:07,610 that game physics-- 1164 00:45:07,610 --> 00:45:09,900 like very simple-- again, I don't have time 1165 00:45:09,900 --> 00:45:11,070 to go into the details. 1166 00:45:11,070 --> 00:45:14,730 Although Tomer has written a very nice paper with, well, 1167 00:45:14,730 --> 00:45:15,330 with himself. 1168 00:45:15,330 --> 00:45:19,620 But he's nicely put my name and Liz's on it-- 1169 00:45:19,620 --> 00:45:21,660 about sort of trying to introduce 1170 00:45:21,660 --> 00:45:23,490 some of the basic game engine concepts 1171 00:45:23,490 --> 00:45:25,170 to cognitive scientists. 1172 00:45:25,170 --> 00:45:27,766 So hopefully we'll be able to show you that soon too. 1173 00:45:27,766 --> 00:45:28,890 Or you can read about them. 1174 00:45:28,890 --> 00:45:30,973 Basically it's that these physics engines are just 1175 00:45:30,973 --> 00:45:36,090 doing again a very quick, fast, approximate implementation 1176 00:45:36,090 --> 00:45:38,632 of certain aspects of Newtonian mechanics. 1177 00:45:38,632 --> 00:45:40,340 Sufficient that if you run it a few times 1178 00:45:40,340 --> 00:45:41,915 steps with a configuration of objects 1179 00:45:41,915 --> 00:45:43,290 like that you might get something 1180 00:45:43,290 --> 00:45:45,120 like what you see over there on the right. 1181 00:45:45,120 --> 00:45:47,970 That's an example of running this approximate Newtonian 1182 00:45:47,970 --> 00:45:50,260 physics forward a few time steps. 1183 00:45:50,260 --> 00:45:52,500 Here's another sample from this model, another kind 1184 00:45:52,500 --> 00:45:54,180 of mental stimulation. 1185 00:45:54,180 --> 00:45:56,790 We take a slightly different guess of the world state, 1186 00:45:56,790 --> 00:45:58,560 and we run that forward a few time steps, 1187 00:45:58,560 --> 00:46:00,794 and you see something else happens. 1188 00:46:00,794 --> 00:46:03,210 Nothing here is claimed to be accurate in the ground truth 1189 00:46:03,210 --> 00:46:03,979 way. 1190 00:46:03,979 --> 00:46:06,270 Neither one of these is exactly the right configuration 1191 00:46:06,270 --> 00:46:07,020 of blocks. 1192 00:46:07,020 --> 00:46:09,000 And you run this thing forward, and it only approximately 1193 00:46:09,000 --> 00:46:11,208 captures the way blocks really bounce off each other. 1194 00:46:11,208 --> 00:46:14,092 It's a hard problem to actually totally realistically simulate. 1195 00:46:14,092 --> 00:46:16,050 But our point is that you don't really have to. 1196 00:46:16,050 --> 00:46:18,330 You just have to make a reasonable guess 1197 00:46:18,330 --> 00:46:20,910 of the position of the blocks and a reasonable guess 1198 00:46:20,910 --> 00:46:23,250 of what's going to happen a few time steps in the future 1199 00:46:23,250 --> 00:46:25,110 to predict what you need to know and common sense, which 1200 00:46:25,110 --> 00:46:26,776 is that, wow, that's going to fall over. 1201 00:46:26,776 --> 00:46:28,410 I better do something about it. 1202 00:46:28,410 --> 00:46:30,510 And that's what our experiment taps into. 1203 00:46:30,510 --> 00:46:32,220 We give people a whole bunch of stimuli 1204 00:46:32,220 --> 00:46:34,290 like the ones I showed you and ask them, 1205 00:46:34,290 --> 00:46:35,760 on some graded scale, how likely do 1206 00:46:35,760 --> 00:46:37,320 you think it is to fall over? 1207 00:46:37,320 --> 00:46:39,510 And what you see here-- 1208 00:46:39,510 --> 00:46:43,770 this is again one of those plots that always are the same where 1209 00:46:43,770 --> 00:46:46,920 on the y-axis are the average human judgments now of-- it's 1210 00:46:46,920 --> 00:46:48,657 an estimate of how unstable the tower is. 1211 00:46:48,657 --> 00:46:50,490 It's both the probability that it will fall, 1212 00:46:50,490 --> 00:46:52,410 but also how much of the tower will fall. 1213 00:46:52,410 --> 00:46:54,270 So it's like the expected proportion 1214 00:46:54,270 --> 00:46:56,870 of the tower that's going to fall over under gravity. 1215 00:46:56,870 --> 00:46:59,110 And along the x-axis is the model prediction, 1216 00:46:59,110 --> 00:47:01,669 which is just the average of a few samples from what 1217 00:47:01,669 --> 00:47:02,210 I showed you. 1218 00:47:02,210 --> 00:47:04,170 You just take a few guesses of the world state, 1219 00:47:04,170 --> 00:47:06,210 run it forward a few time steps, count up 1220 00:47:06,210 --> 00:47:09,065 the proportion of blocks it fell, and average that. 1221 00:47:09,065 --> 00:47:10,440 And what you can see is that does 1222 00:47:10,440 --> 00:47:15,270 a really nice job of predicting people's stability intuitions. 1223 00:47:15,270 --> 00:47:17,790 I'll just point to an interesting comparison. 1224 00:47:17,790 --> 00:47:19,320 Because it does come into where. 1225 00:47:19,320 --> 00:47:20,410 Where does the probability come in 1226 00:47:20,410 --> 00:47:21,330 in these probabilistic programs? 1227 00:47:21,330 --> 00:47:23,280 Well, here's one very noticeable way. 1228 00:47:23,280 --> 00:47:25,690 So if you look down there on the lower right, 1229 00:47:25,690 --> 00:47:29,740 you'll see a smaller version of a similar plot. 1230 00:47:29,740 --> 00:47:31,640 It's plotting now the results of-- 1231 00:47:31,640 --> 00:47:34,140 it says ground truth physics, but that's a little misleading 1232 00:47:34,140 --> 00:47:34,639 maybe. 1233 00:47:34,639 --> 00:47:36,210 It's just a noiseless physics engine. 1234 00:47:36,210 --> 00:47:37,699 So we take the same physics model, 1235 00:47:37,699 --> 00:47:39,740 but we get rid of any of the state uncertainties. 1236 00:47:39,740 --> 00:47:42,810 So we tell it the true position of the blocks, 1237 00:47:42,810 --> 00:47:44,340 and we give it the true physics. 1238 00:47:44,340 --> 00:47:46,830 Whereas our probabilistic physics engine 1239 00:47:46,830 --> 00:47:49,110 allows for some uncertainty in exactly which forces 1240 00:47:49,110 --> 00:47:50,070 are doing what. 1241 00:47:50,070 --> 00:47:52,860 But here we say we're just going to model gravity, friction, 1242 00:47:52,860 --> 00:47:54,880 collisions as best we can. 1243 00:47:54,880 --> 00:47:58,320 And we're going to get the state of the blocks perfectly. 1244 00:47:58,320 --> 00:48:01,000 And because it's noiseless, you notice that-- 1245 00:48:01,000 --> 00:48:03,157 so those crosses over there are crosses 1246 00:48:03,157 --> 00:48:05,490 because they're arrow bars, both across people and model 1247 00:48:05,490 --> 00:48:06,047 simulations. 1248 00:48:06,047 --> 00:48:07,380 Now they're just vertical lines. 1249 00:48:07,380 --> 00:48:09,254 There's no arrow bars in the model simulation 1250 00:48:09,254 --> 00:48:10,600 because it's deterministic. 1251 00:48:10,600 --> 00:48:13,100 It's graded because there's the proportion of the tower that 1252 00:48:13,100 --> 00:48:13,620 falls over. 1253 00:48:13,620 --> 00:48:15,630 But what you see is the model is a lot worse. 1254 00:48:15,630 --> 00:48:17,520 It scatters much more. 1255 00:48:17,520 --> 00:48:19,620 The correlation dropped from around 0.9 1256 00:48:19,620 --> 00:48:22,890 to around 0.6 in terms of correlation of model 1257 00:48:22,890 --> 00:48:24,060 with people's judgments. 1258 00:48:24,060 --> 00:48:26,310 And you have some cases like this red dot here-- 1259 00:48:26,310 --> 00:48:28,380 that corresponds to this stimulus-- 1260 00:48:28,380 --> 00:48:30,720 which goes from being a really nice model fit. 1261 00:48:30,720 --> 00:48:33,280 This is one which people judged to be very unstable, 1262 00:48:33,280 --> 00:48:35,370 and so does the probabilistic physics engine. 1263 00:48:35,370 --> 00:48:37,930 But actually it's not unstable at all. 1264 00:48:37,930 --> 00:48:39,330 It's actually perfectly stable. 1265 00:48:39,330 --> 00:48:41,370 The blocks are actually just perfectly balanced 1266 00:48:41,370 --> 00:48:42,180 so that it doesn't fall. 1267 00:48:42,180 --> 00:48:43,888 Although I'm sure everybody looks at that 1268 00:48:43,888 --> 00:48:45,240 and finds that hard to believe. 1269 00:48:45,240 --> 00:48:46,060 So this is nice. 1270 00:48:46,060 --> 00:48:47,670 This is a kind of physics illusion. 1271 00:48:47,670 --> 00:48:50,400 There are real world versions of this out on the beaches 1272 00:48:50,400 --> 00:48:52,260 not too far from here. 1273 00:48:52,260 --> 00:48:54,990 It's a fun thing to do to stack up objects in ways 1274 00:48:54,990 --> 00:48:57,180 that are surprisingly stable. 1275 00:48:57,180 --> 00:49:01,260 We would say a surprise because your intuitive physics 1276 00:49:01,260 --> 00:49:04,410 has certain irreducible noise. 1277 00:49:04,410 --> 00:49:09,180 What we're suggesting here is that your physical intuitions-- 1278 00:49:09,180 --> 00:49:11,040 you're always in some sense making 1279 00:49:11,040 --> 00:49:13,440 a guess that's sensitive to the uncertainty about where 1280 00:49:13,440 --> 00:49:16,470 things might be and what forces might be active on the world. 1281 00:49:16,470 --> 00:49:19,080 And it's very hard to see these as deterministic physics, 1282 00:49:19,080 --> 00:49:21,330 even when you know that that's exactly what's going on 1283 00:49:21,330 --> 00:49:23,190 and that it is stable. 1284 00:49:23,190 --> 00:49:25,290 Let me say just a little bit about planning. 1285 00:49:25,290 --> 00:49:28,170 So how might you use this kind of model 1286 00:49:28,170 --> 00:49:32,120 to build some model of this core intuitive psychology? 1287 00:49:32,120 --> 00:49:35,016 And I don't mean here all of theory of mind. 1288 00:49:35,016 --> 00:49:36,390 Next week, we'll hear a lot more. 1289 00:49:36,390 --> 00:49:37,889 Like Rebecca Saxe will be down here. 1290 00:49:37,889 --> 00:49:41,604 We'll hear a lot more about much richer kinds of reasoning 1291 00:49:41,604 --> 00:49:43,020 about other people's mental states 1292 00:49:43,020 --> 00:49:45,030 that adults and older children can do. 1293 00:49:45,030 --> 00:49:47,130 But here we're talking about, just 1294 00:49:47,130 --> 00:49:48,960 as we were talking about what I was calling 1295 00:49:48,960 --> 00:49:52,350 core intuitive physics, again inspired by Liz's work of just 1296 00:49:52,350 --> 00:49:56,010 you know what objects do right here on the table top around us 1297 00:49:56,010 --> 00:49:59,054 over short time scales, the core theory of mind, 1298 00:49:59,054 --> 00:50:01,470 something that even very young babies can do in some form, 1299 00:50:01,470 --> 00:50:03,160 or at least young children. 1300 00:50:03,160 --> 00:50:06,680 There's controversy over exactly what age kids can be 1301 00:50:06,680 --> 00:50:07,930 able to do this sort of thing. 1302 00:50:07,930 --> 00:50:13,210 But in some form I think before language, 1303 00:50:13,210 --> 00:50:17,010 it's the kind of thing that when you're starting to learn verbs, 1304 00:50:17,010 --> 00:50:19,382 the earliest language is kind of mentalistic 1305 00:50:19,382 --> 00:50:20,590 and builds on this knowledge. 1306 00:50:20,590 --> 00:50:24,180 And take the red and blue ball chasing scene that you saw, 1307 00:50:24,180 --> 00:50:25,260 remember from Tomer. 1308 00:50:25,260 --> 00:50:26,310 That was 13-month-olds. 1309 00:50:26,310 --> 00:50:29,850 So there's definitely some form of kind of interpretation 1310 00:50:29,850 --> 00:50:32,220 of beliefs and desires in some protoform 1311 00:50:32,220 --> 00:50:36,430 that you can see even in infants of around one year of age. 1312 00:50:36,430 --> 00:50:38,546 And it's exactly that kind of thing also. 1313 00:50:38,546 --> 00:50:40,920 Remember that, if you saw John Leonard's talk yesterday-- 1314 00:50:40,920 --> 00:50:43,410 he was the robotics guy who talked about self-driving cars 1315 00:50:43,410 --> 00:50:46,110 and how there's certain gaps in what they 1316 00:50:46,110 --> 00:50:47,700 can do despite the all the publicity, 1317 00:50:47,700 --> 00:50:50,250 like the can't turn left basically 1318 00:50:50,250 --> 00:50:52,124 in an unrestricted intersection. 1319 00:50:52,124 --> 00:50:53,790 Because there's a certain kind of theory 1320 00:50:53,790 --> 00:50:56,689 of mind in street scenes when cars could be coming and people 1321 00:50:56,689 --> 00:50:58,230 could be crossing or all those things 1322 00:50:58,230 --> 00:51:00,229 about the police officers. 1323 00:51:00,229 --> 00:51:01,770 Part of why this is so exciting to me 1324 00:51:01,770 --> 00:51:04,590 and why I love that talk is because this is, I think, 1325 00:51:04,590 --> 00:51:06,840 that same common sense knowledge that if we can really 1326 00:51:06,840 --> 00:51:09,510 figure out how to capture this reasoning about beliefs 1327 00:51:09,510 --> 00:51:11,580 and desires in the limited context 1328 00:51:11,580 --> 00:51:14,250 where desires are people moving around in space around us 1329 00:51:14,250 --> 00:51:16,560 and the beliefs are who can see who 1330 00:51:16,560 --> 00:51:18,980 and who can see who can see who-- 1331 00:51:18,980 --> 00:51:21,960 in driving, the art of making eye contact with other drivers 1332 00:51:21,960 --> 00:51:24,559 or pedestrians is seeing that they can see you 1333 00:51:24,559 --> 00:51:26,100 or that they can see what you can see 1334 00:51:26,100 --> 00:51:27,930 and that they can see you seeing you them. 1335 00:51:27,930 --> 00:51:29,910 It doesn't have to be super deeply recursive, 1336 00:51:29,910 --> 00:51:31,537 but it's a couple of layers deep. 1337 00:51:31,537 --> 00:51:33,370 We don't have to think about it consciously, 1338 00:51:33,370 --> 00:51:35,050 but we have to be able to do it. 1339 00:51:35,050 --> 00:51:37,059 So that's the kind of core belief desire 1340 00:51:37,059 --> 00:51:38,100 theory of mind reasoning. 1341 00:51:38,100 --> 00:51:40,500 And here's how we've tried to capture this 1342 00:51:40,500 --> 00:51:43,110 with probabilistic programs. 1343 00:51:43,110 --> 00:51:47,080 This is work that Chris Baker started doing a few years ago. 1344 00:51:47,080 --> 00:51:50,250 And a lot of it joint Rebecca Saxe 1345 00:51:50,250 --> 00:51:53,820 and also some of it with Julian Jara-Ettinger and some of it 1346 00:51:53,820 --> 00:51:54,510 with Tomer. 1347 00:51:54,510 --> 00:51:55,410 So there's a whole bunch of us who've 1348 00:51:55,410 --> 00:51:56,620 been working on versions of this, 1349 00:51:56,620 --> 00:51:58,411 but I'll just show you one or two examples. 1350 00:52:01,350 --> 00:52:08,730 Again, the key programs here are not graphics or physics 1351 00:52:08,730 --> 00:52:11,200 engines, but planning engines and perception engines. 1352 00:52:11,200 --> 00:52:15,090 So very simple kinds of robotics programs, 1353 00:52:15,090 --> 00:52:18,210 far too simple in this form to build 1354 00:52:18,210 --> 00:52:20,760 a self-driving car or a humanoid robot, 1355 00:52:20,760 --> 00:52:24,520 but maybe the kind of thing that in game robots like the zombie 1356 00:52:24,520 --> 00:52:26,460 or the security guard in Quake or something 1357 00:52:26,460 --> 00:52:28,720 might do something like this. 1358 00:52:28,720 --> 00:52:32,100 So planning basically just means it's a little bit more 1359 00:52:32,100 --> 00:52:35,220 than sort of shortest path planning. 1360 00:52:35,220 --> 00:52:37,890 But it's basically like find a sequence of actions 1361 00:52:37,890 --> 00:52:39,720 in a simple world like moving around 1362 00:52:39,720 --> 00:52:44,490 a 2D environment that maximizes your long run expected reward. 1363 00:52:44,490 --> 00:52:46,050 So there's a kind of utility theory, 1364 00:52:46,050 --> 00:52:49,340 or what Laura Schulz calls a naive utility calculus, here. 1365 00:52:49,340 --> 00:52:52,800 A calculation of costs and benefits where in a sense 1366 00:52:52,800 --> 00:52:55,620 you get a big reward, a good positive utility 1367 00:52:55,620 --> 00:52:58,740 for getting to your goal and a small cost for each action you 1368 00:52:58,740 --> 00:52:59,640 take. 1369 00:52:59,640 --> 00:53:03,525 And under that view, then in some sense-- 1370 00:53:03,525 --> 00:53:06,600 and some actions might be costly than others, something 1371 00:53:06,600 --> 00:53:09,690 that Tomer is looking at in infants and something 1372 00:53:09,690 --> 00:53:11,430 that Julian Jara-Ettinger has looked 1373 00:53:11,430 --> 00:53:13,890 at in older kids, this understanding of that. 1374 00:53:13,890 --> 00:53:15,832 But this sort of basic cost-benefit trade 1375 00:53:15,832 --> 00:53:18,984 off that is going on whenever you move around an environment 1376 00:53:18,984 --> 00:53:21,150 and decide, well, is it worthwhile to go all the way 1377 00:53:21,150 --> 00:53:24,330 over there, or, well, I know I like the coffee up at Pie 1378 00:53:24,330 --> 00:53:26,850 in the Sky better than the coffee in the dining hall 1379 00:53:26,850 --> 00:53:27,540 here at Swope. 1380 00:53:27,540 --> 00:53:30,090 But to think about, am I going to be to my lecture? 1381 00:53:30,090 --> 00:53:31,830 Am I going to be late to Nancy's lecture? 1382 00:53:31,830 --> 00:53:33,870 Those are different costs-- 1383 00:53:33,870 --> 00:53:35,130 both costs. 1384 00:53:35,130 --> 00:53:36,600 It's that kind of calculation. 1385 00:53:39,390 --> 00:53:42,310 So here let me get more concrete. 1386 00:53:42,310 --> 00:53:44,580 So here's an example of an experiment 1387 00:53:44,580 --> 00:53:46,980 that Chris did a few years ago where, again, it's 1388 00:53:46,980 --> 00:53:49,540 like what you saw what the Heider and Simmel, the squares 1389 00:53:49,540 --> 00:53:52,230 and the triangles and circles or the south gate 1390 00:53:52,230 --> 00:53:54,840 and chibra, the red and blue balls chasing each other. 1391 00:53:54,840 --> 00:53:56,280 Very simple stuff. 1392 00:53:56,280 --> 00:53:57,450 Here you see an agent. 1393 00:53:57,450 --> 00:53:59,890 It's like an overhead view of a room, 1394 00:53:59,890 --> 00:54:01,290 2D environment from the top. 1395 00:54:01,290 --> 00:54:03,270 The agents moving along some path. 1396 00:54:03,270 --> 00:54:06,384 There are three possible goals, A, B, or C. 1397 00:54:06,384 --> 00:54:08,550 And then there's maybe some obstacles or constraints 1398 00:54:08,550 --> 00:54:10,505 like a wall like like you saw in those movies. 1399 00:54:10,505 --> 00:54:12,630 Maybe the wall has a hole that he can pass through. 1400 00:54:12,630 --> 00:54:14,190 Maybe it doesn't. 1401 00:54:14,190 --> 00:54:16,194 And across different trials of the experiment, 1402 00:54:16,194 --> 00:54:18,360 just like in the physics stuff we vary all the block 1403 00:54:18,360 --> 00:54:21,900 configurations and so on, here we vary where the goals are. 1404 00:54:21,900 --> 00:54:23,790 We vary whether the wall has a hole or not. 1405 00:54:23,790 --> 00:54:25,440 We vary the agent's path. 1406 00:54:25,440 --> 00:54:28,440 On different trials, we also stop it at different points. 1407 00:54:28,440 --> 00:54:30,600 Because we're trying to see as you watch this agent 1408 00:54:30,600 --> 00:54:33,060 move around, action unfolds over time. 1409 00:54:33,060 --> 00:54:36,570 How do your guesses about his goal change over time? 1410 00:54:36,570 --> 00:54:39,060 And what you see-- 1411 00:54:39,060 --> 00:54:42,570 so these are just examples of a few of the scenes. 1412 00:54:42,570 --> 00:54:44,590 And here what you see are examples of the data. 1413 00:54:44,590 --> 00:54:47,680 Again, the y-axis is the average human judgment. 1414 00:54:47,680 --> 00:54:49,680 Red, blue, and green is color coded to the goal. 1415 00:54:49,680 --> 00:54:51,055 They're just asked, how likely do 1416 00:54:51,055 --> 00:54:53,490 you think each of those three things is his goal? 1417 00:54:53,490 --> 00:54:55,810 And then here the x-axis is time. 1418 00:54:55,810 --> 00:54:58,830 So these are time steps that we ask at different points 1419 00:54:58,830 --> 00:54:59,952 along the trajectory. 1420 00:54:59,952 --> 00:55:01,410 And what you can see is that people 1421 00:55:01,410 --> 00:55:03,571 are making various systematic kinds of judgments. 1422 00:55:03,571 --> 00:55:05,820 Sometimes they're not sure whether his goal is A or B, 1423 00:55:05,820 --> 00:55:08,010 but they know it's not C. And then 1424 00:55:08,010 --> 00:55:10,837 after a little while or some key stat happens, 1425 00:55:10,837 --> 00:55:12,670 and now they're quite sure it's A and not B. 1426 00:55:12,670 --> 00:55:14,430 Or they could change their mind. 1427 00:55:14,430 --> 00:55:18,480 Here people were pretty sure it was either green or red but not 1428 00:55:18,480 --> 00:55:19,390 blue. 1429 00:55:19,390 --> 00:55:21,150 And then there comes a point where it's surely not green, 1430 00:55:21,150 --> 00:55:22,316 but it might be blue or red. 1431 00:55:22,316 --> 00:55:23,480 Oh no, then it's red. 1432 00:55:23,480 --> 00:55:25,290 Here they were pretty sure it was green. 1433 00:55:25,290 --> 00:55:26,910 Then no, definitely not green. 1434 00:55:26,910 --> 00:55:28,250 And now, I think it's red. 1435 00:55:28,250 --> 00:55:29,850 It was probably never blue. 1436 00:55:29,850 --> 00:55:30,780 OK. 1437 00:55:30,780 --> 00:55:32,640 And the really striking thing to us 1438 00:55:32,640 --> 00:55:36,000 is how closely, you can match those judgments 1439 00:55:36,000 --> 00:55:38,250 with this very simple probabilistic planning 1440 00:55:38,250 --> 00:55:39,552 program run in reverse. 1441 00:55:39,552 --> 00:55:41,510 So we take, again, this simple planning program 1442 00:55:41,510 --> 00:55:44,850 that just says basically just kind of get as efficiently 1443 00:55:44,850 --> 00:55:46,237 as possible to your goal. 1444 00:55:46,237 --> 00:55:47,820 I don't know what your goal is though. 1445 00:55:47,820 --> 00:55:50,370 I observe your actions that result from an efficient plan, 1446 00:55:50,370 --> 00:55:52,100 and I want to work backwards to say, 1447 00:55:52,100 --> 00:55:53,850 what do I think your goal is, your desire, 1448 00:55:53,850 --> 00:55:55,290 the rewarding state? 1449 00:55:55,290 --> 00:55:57,090 And just doing that just basically 1450 00:55:57,090 --> 00:55:58,830 perfectly predicts people's data. 1451 00:55:58,830 --> 00:56:00,990 I mean, of all the mathematical models of behavior 1452 00:56:00,990 --> 00:56:02,760 I've ever had a hand in building, 1453 00:56:02,760 --> 00:56:05,190 this one works the best. 1454 00:56:05,190 --> 00:56:06,600 It's really quite striking. 1455 00:56:06,600 --> 00:56:08,130 To me it was striking because I came 1456 00:56:08,130 --> 00:56:11,310 in thinking this would be a very high-level, weird, flaky, 1457 00:56:11,310 --> 00:56:13,230 hard-to-model thing. 1458 00:56:13,230 --> 00:56:14,760 Here's just one more example of one 1459 00:56:14,760 --> 00:56:17,350 of these things, which actually puts beliefs in there, 1460 00:56:17,350 --> 00:56:18,090 not just desires. 1461 00:56:18,090 --> 00:56:19,950 So it's a key part of intuitive psychology 1462 00:56:19,950 --> 00:56:22,670 that we do joint inference over beliefs and desires. 1463 00:56:22,670 --> 00:56:26,040 In this one here, we assume that you, the subject, 1464 00:56:26,040 --> 00:56:27,750 the agent who's moving around, all of us 1465 00:56:27,750 --> 00:56:29,716 have shared full knowledge of the world. 1466 00:56:29,716 --> 00:56:31,090 So we know where the objects are. 1467 00:56:31,090 --> 00:56:31,830 We know where the holes are. 1468 00:56:31,830 --> 00:56:33,270 There's none of this false belief, 1469 00:56:33,270 --> 00:56:36,090 like you think something is there when it isn't. 1470 00:56:36,090 --> 00:56:38,010 Now, here's some later work that Chris 1471 00:56:38,010 --> 00:56:42,575 did, what we call the food truck studies, 1472 00:56:42,575 --> 00:56:44,700 where here we add in some uncertainty about beliefs 1473 00:56:44,700 --> 00:56:46,080 in addition to desires. 1474 00:56:46,080 --> 00:56:48,330 And it's easiest just to explain with this one example 1475 00:56:48,330 --> 00:56:49,940 up there in the upper left. 1476 00:56:49,940 --> 00:56:54,390 So here, and this, like a lot of university campuses, 1477 00:56:54,390 --> 00:56:56,720 lunch is best found at food trucks, 1478 00:56:56,720 --> 00:56:59,540 which can park in different spots around campus. 1479 00:56:59,540 --> 00:57:02,840 Here the two yellow squares show the two parking spots 1480 00:57:02,840 --> 00:57:04,467 on this part of campus. 1481 00:57:04,467 --> 00:57:06,050 And there are several different trucks 1482 00:57:06,050 --> 00:57:07,490 that can come and park in different places 1483 00:57:07,490 --> 00:57:08,240 on different days. 1484 00:57:08,240 --> 00:57:09,350 There's a Korean truck. 1485 00:57:09,350 --> 00:57:10,350 That's k. 1486 00:57:10,350 --> 00:57:11,600 There's a Lebanese truck. 1487 00:57:11,600 --> 00:57:12,522 That's l. 1488 00:57:12,522 --> 00:57:14,480 There's also other trucks like a Mexican truck. 1489 00:57:14,480 --> 00:57:15,900 But there's only two spots. 1490 00:57:15,900 --> 00:57:17,780 So if the Korean won parks there and the Lebanese one 1491 00:57:17,780 --> 00:57:19,821 parks there, the Mexican has to go somewhere else 1492 00:57:19,821 --> 00:57:21,470 or can't come there today. 1493 00:57:21,470 --> 00:57:24,150 And on some days the trucks park in different places. 1494 00:57:24,150 --> 00:57:26,290 Or a spot could also be unoccupied. 1495 00:57:26,290 --> 00:57:28,070 The trucks could be elsewhere. 1496 00:57:28,070 --> 00:57:29,870 So look at what happens on this day. 1497 00:57:29,870 --> 00:57:33,830 Our friendly grad student, Harold, 1498 00:57:33,830 --> 00:57:35,550 comes out from his office here. 1499 00:57:35,550 --> 00:57:38,000 And importantly, the way we model interesting notions 1500 00:57:38,000 --> 00:57:39,874 of evolving belief is that now we've 1501 00:57:39,874 --> 00:57:41,790 got that perception and inference arrow there. 1502 00:57:41,790 --> 00:57:43,550 So Harold forms his belief about what's 1503 00:57:43,550 --> 00:57:44,870 where based on what he can see. 1504 00:57:44,870 --> 00:57:47,420 And it's just the simplest perception model, just 1505 00:57:47,420 --> 00:57:48,710 line-of-sight access. 1506 00:57:48,710 --> 00:57:51,470 We assume he can kind of see anything that's unobstructed 1507 00:57:51,470 --> 00:57:52,490 in his line of sight. 1508 00:57:52,490 --> 00:57:56,270 So that means that when he comes out here, 1509 00:57:56,270 --> 00:57:59,389 he can see that there is the Korean truck here. 1510 00:57:59,389 --> 00:58:01,430 But you can't see-- this is a wall or a building. 1511 00:58:01,430 --> 00:58:03,722 He can't see what's on the other side of that. 1512 00:58:03,722 --> 00:58:04,680 OK, so what does he do? 1513 00:58:04,680 --> 00:58:05,721 Well, he walks down here. 1514 00:58:05,721 --> 00:58:07,471 He goes past the Korean truck, goes around 1515 00:58:07,471 --> 00:58:08,762 the other side of the building. 1516 00:58:08,762 --> 00:58:10,340 Now at this point, his line of sight 1517 00:58:10,340 --> 00:58:11,715 gives him-- he can see that there 1518 00:58:11,715 --> 00:58:13,280 is a Lebanese truck there. 1519 00:58:13,280 --> 00:58:15,920 He turns around, and he goes back to the Korean truck. 1520 00:58:15,920 --> 00:58:19,310 So the question for you is, what is his favorite truck? 1521 00:58:19,310 --> 00:58:21,702 Is it Korean, Lebanese, or Mexican? 1522 00:58:21,702 --> 00:58:22,646 AUDIENCE: Mexican. 1523 00:58:22,646 --> 00:58:23,510 PROFESSOR: Mexican, yeah, it doesn't sound 1524 00:58:23,510 --> 00:58:24,950 very hard to figure that out. 1525 00:58:24,950 --> 00:58:27,500 But it's quite interesting because the Mexican one 1526 00:58:27,500 --> 00:58:30,080 isn't even in the scene. 1527 00:58:30,080 --> 00:58:33,660 The most basic kind of goal recognition-- and this, 1528 00:58:33,660 --> 00:58:35,660 again, cuts right to the heart of the difference 1529 00:58:35,660 --> 00:58:37,681 between recognition and explanation. 1530 00:58:37,681 --> 00:58:39,680 There's been a lot of progress in machine vision 1531 00:58:39,680 --> 00:58:42,290 systems for action understanding, action 1532 00:58:42,290 --> 00:58:43,880 recognition, and so on. 1533 00:58:43,880 --> 00:58:48,110 And they do things like, for example, they take video. 1534 00:58:48,110 --> 00:58:50,699 And the best cue that somebody wants something 1535 00:58:50,699 --> 00:58:52,490 is if they reach for it or move towards it. 1536 00:58:52,490 --> 00:58:54,740 And that's certainly what was going on here. 1537 00:58:54,740 --> 00:58:57,800 In all of these scenes, your best inference 1538 00:58:57,800 --> 00:59:00,426 about what the guy's goal is is which 1539 00:59:00,426 --> 00:59:01,550 thing is he moving towards. 1540 00:59:01,550 --> 00:59:03,650 And it's just subtle to parse out 1541 00:59:03,650 --> 00:59:05,150 the relative degrees of confidence 1542 00:59:05,150 --> 00:59:08,550 when there's a complex environment with constraints. 1543 00:59:08,550 --> 00:59:10,400 But in every case, by the end it's 1544 00:59:10,400 --> 00:59:12,350 clear he's going for one thing, and the thing 1545 00:59:12,350 --> 00:59:14,660 he is moving towards is the thing he wants. 1546 00:59:14,660 --> 00:59:16,880 But here you have no trouble realizing 1547 00:59:16,880 --> 00:59:18,680 that his goal is something that isn't 1548 00:59:18,680 --> 00:59:20,095 even present in the scene. 1549 00:59:20,095 --> 00:59:21,470 Yet he's still moving towards it. 1550 00:59:21,470 --> 00:59:23,990 In a sense, he's moving towards his mental representation 1551 00:59:23,990 --> 00:59:25,190 of it. 1552 00:59:25,190 --> 00:59:30,300 He's moving towards the Mexican truck in his mind's model. 1553 00:59:30,300 --> 00:59:33,230 And that's him explaining the data he sees. 1554 00:59:33,230 --> 00:59:34,882 For some reason, he must have had 1555 00:59:34,882 --> 00:59:37,340 maybe a prior belief that the Mexican truck would be there. 1556 00:59:37,340 --> 00:59:39,140 So he formed a plan to go there. 1557 00:59:39,140 --> 00:59:41,180 And in fact, we can ask people not only 1558 00:59:41,180 --> 00:59:43,474 which truck does he like-- it's his Mexican truck. 1559 00:59:43,474 --> 00:59:45,390 That's what people say, and here is the model. 1560 00:59:45,390 --> 00:59:47,420 But we also asked them a belief inference. 1561 00:59:47,420 --> 00:59:50,060 We say, prior to setting out, what 1562 00:59:50,060 --> 00:59:52,089 did Harold think was on the other side? 1563 00:59:52,089 --> 00:59:54,380 What was parked in the other spot that he couldn't see? 1564 00:59:54,380 --> 00:59:57,050 Did he think it was Lebanese, Mexican, or neither? 1565 00:59:57,050 --> 00:59:58,300 And we ask a degree of belief. 1566 00:59:58,300 --> 00:59:59,692 So you could say he had no idea. 1567 00:59:59,692 --> 01:00:01,400 But interestingly, people say he probably 1568 01:00:01,400 --> 01:00:02,358 thought it was Mexican. 1569 01:00:02,358 --> 01:00:05,510 Because how else could you explain what he's doing? 1570 01:00:05,510 --> 01:00:09,440 So I mean, if I had to point to just one example of cognition 1571 01:00:09,440 --> 01:00:11,090 as explanation, it's this. 1572 01:00:11,090 --> 01:00:14,630 The only sensible way, and it's a very intuitive and compelling 1573 01:00:14,630 --> 01:00:17,960 way to explain why did he go the way he did 1574 01:00:17,960 --> 01:00:19,830 and then turn around just when he did 1575 01:00:19,830 --> 01:00:23,360 and wind up just where he did, is this set of six instances 1576 01:00:23,360 --> 01:00:24,200 basically. 1577 01:00:24,200 --> 01:00:26,480 That his favorite is Mexican, his second favorite 1578 01:00:26,480 --> 01:00:28,460 is Korean-- that's also important-- his least 1579 01:00:28,460 --> 01:00:29,960 favorite is Lebanese. 1580 01:00:29,960 --> 01:00:32,480 And he thought that Mexican was there, 1581 01:00:32,480 --> 01:00:34,730 which is why it was worthwhile to go and check. 1582 01:00:34,730 --> 01:00:36,710 At least, he thought it was likely. 1583 01:00:36,710 --> 01:00:37,850 He wasn't sure, right? 1584 01:00:37,850 --> 01:00:39,010 Notice it's not very high. 1585 01:00:39,010 --> 01:00:41,630 But it it's more likely than the other possibilities. 1586 01:00:41,630 --> 01:00:44,090 Because, of course, if he was quite sure it was Lebanese, 1587 01:00:44,090 --> 01:00:45,560 well, he wouldn't have bothered to go around there. 1588 01:00:45,560 --> 01:00:46,610 And in fact, you do see that. 1589 01:00:46,610 --> 01:00:47,330 So you have ones-- 1590 01:00:47,330 --> 01:00:48,621 I guess I don't have them here. 1591 01:00:48,621 --> 01:00:50,960 But there are scenes where he just goes straight here. 1592 01:00:50,960 --> 01:00:53,480 And then that's consistent with him thinking possibly it 1593 01:00:53,480 --> 01:00:54,320 was Lebanese. 1594 01:00:54,320 --> 01:00:55,820 And if he thought nothing was there, 1595 01:00:55,820 --> 01:00:58,340 well, again, he wouldn't have gone to check. 1596 01:00:58,340 --> 01:01:00,770 And again, this model is extremely quantitatively 1597 01:01:00,770 --> 01:01:05,510 predictive of people's judgments about both desires and beliefs. 1598 01:01:05,510 --> 01:01:08,195 You can read in some of Battaglia's papers 1599 01:01:08,195 --> 01:01:10,320 ways in which you take the very same physics engine 1600 01:01:10,320 --> 01:01:12,650 and use it for all these different tasks, including 1601 01:01:12,650 --> 01:01:14,750 sort of slightly weird ones like these tasks. 1602 01:01:14,750 --> 01:01:16,220 If you bump the table, are you more 1603 01:01:16,220 --> 01:01:19,010 likely to knock off red blocks or yellow blocks? 1604 01:01:19,010 --> 01:01:22,430 Not a task you ever got any end-to-end training on, right? 1605 01:01:22,430 --> 01:01:24,980 But an example of the compositionality 1606 01:01:24,980 --> 01:01:26,810 of your model and your task. 1607 01:01:26,810 --> 01:01:28,310 Somebody asked me this during lunch, 1608 01:01:28,310 --> 01:01:31,340 and I think it is a key point to make about compositionality. 1609 01:01:31,340 --> 01:01:34,310 One of the key ways in which compositionality 1610 01:01:34,310 --> 01:01:37,004 works in this view of the mind, as opposed to the pattern 1611 01:01:37,004 --> 01:01:38,420 recognition view or the way, let's 1612 01:01:38,420 --> 01:01:40,130 say, like a DeepQ network works-- 1613 01:01:40,130 --> 01:01:42,410 AUDIENCE: You mean the [INAUDIBLE].. 1614 01:01:42,410 --> 01:01:45,680 PROFESSOR: Just ways of getting a very flexible repertoire 1615 01:01:45,680 --> 01:01:48,230 of inferences from composing pieces without having 1616 01:01:48,230 --> 01:01:50,840 to train specifically for it. 1617 01:01:50,840 --> 01:01:52,930 It's that if you have a physics engine, 1618 01:01:52,930 --> 01:01:55,209 you can simulate the physical world. 1619 01:01:55,209 --> 01:01:57,250 You can answer questions that you've never gotten 1620 01:01:57,250 --> 01:01:58,564 any training at all to solve. 1621 01:01:58,564 --> 01:01:59,980 So in this experiment here, we ask 1622 01:01:59,980 --> 01:02:01,879 people, if you bump the table hard enough 1623 01:02:01,879 --> 01:02:03,670 to knock some of the blocks onto the floor, 1624 01:02:03,670 --> 01:02:05,544 is it more likely to be red or yellow blocks? 1625 01:02:05,544 --> 01:02:07,630 Unlike questions of will this tower 1626 01:02:07,630 --> 01:02:09,400 fall over, which we've made a lot of judgments of that sort. 1627 01:02:09,400 --> 01:02:11,358 You've never made that kind of judgment before. 1628 01:02:11,358 --> 01:02:12,550 It's a slightly weird one. 1629 01:02:12,550 --> 01:02:13,840 But you have no trouble making it. 1630 01:02:13,840 --> 01:02:15,760 And for many different configurations of blocks, 1631 01:02:15,760 --> 01:02:16,990 you make various grade adjustments, 1632 01:02:16,990 --> 01:02:19,330 and the model captures it perfectly with no extra stuff 1633 01:02:19,330 --> 01:02:20,080 put in. 1634 01:02:20,080 --> 01:02:22,030 You just you just take the same model, 1635 01:02:22,030 --> 01:02:23,830 and you ask it a different question. 1636 01:02:23,830 --> 01:02:26,410 So if our dream is to build AI systems that 1637 01:02:26,410 --> 01:02:28,210 can answer questions, for example, which 1638 01:02:28,210 --> 01:02:30,040 a lot of people's dream is, I think 1639 01:02:30,040 --> 01:02:32,170 there's really no compelling alternative 1640 01:02:32,170 --> 01:02:33,160 to something like this. 1641 01:02:33,160 --> 01:02:35,534 That you build a model that you can ask all the questions 1642 01:02:35,534 --> 01:02:36,920 of that you'd want to ask. 1643 01:02:36,920 --> 01:02:39,700 And in this limited domain, again, it's just our Atari. 1644 01:02:39,700 --> 01:02:41,290 In this limited domain of reasoning 1645 01:02:41,290 --> 01:02:43,880 about the physics of blocks, it's really pretty cool 1646 01:02:43,880 --> 01:02:45,490 what this physics engine is able to do 1647 01:02:45,490 --> 01:02:47,079 with many kinds of questions. 1648 01:02:47,079 --> 01:02:49,120 It can reason about things with different masses. 1649 01:02:49,120 --> 01:02:50,920 It can make guesses about the masses. 1650 01:02:50,920 --> 01:02:53,570 You can make fun of the objects bigger or smaller. 1651 01:02:53,570 --> 01:02:56,806 You can attach constraints like fences to the table. 1652 01:02:56,806 --> 01:02:58,930 And the same model, without any fundamental change, 1653 01:02:58,930 --> 01:03:00,130 can answer all these questions. 1654 01:03:00,130 --> 01:03:01,588 So it doesn't have to be retrained. 1655 01:03:01,588 --> 01:03:03,490 Because there's basically no training. 1656 01:03:03,490 --> 01:03:05,080 It's just reasoning. 1657 01:03:05,080 --> 01:03:07,399 If we want to understand how learning works, 1658 01:03:07,399 --> 01:03:09,190 we first have to understand what's learned. 1659 01:03:09,190 --> 01:03:10,960 I think right now, we're only at the point 1660 01:03:10,960 --> 01:03:12,418 where we're starting to really have 1661 01:03:12,418 --> 01:03:15,950 a sense of what are these mental models of the physical world 1662 01:03:15,950 --> 01:03:17,440 and intentional action-- 1663 01:03:17,440 --> 01:03:20,400 these probabilistic programs that even young children 1664 01:03:20,400 --> 01:03:23,415 are using to reason about the world. 1665 01:03:23,415 --> 01:03:24,790 And then it's a separate question 1666 01:03:24,790 --> 01:03:27,550 how those are built up through some combination 1667 01:03:27,550 --> 01:03:34,090 of scientific discovery sorts of processes and evolution. 1668 01:03:34,090 --> 01:03:36,265 So here's the story, and I've told most 1669 01:03:36,265 --> 01:03:37,390 of what I want to tell you. 1670 01:03:37,390 --> 01:03:40,836 But the rest you'll get to hear-- 1671 01:03:40,836 --> 01:03:42,460 some of it you'll get to hear next week 1672 01:03:42,460 --> 01:03:45,190 from both our developmental colleagues 1673 01:03:45,190 --> 01:03:46,480 and from me and Tomer. 1674 01:03:46,480 --> 01:03:47,897 More on the computational side. 1675 01:03:47,897 --> 01:03:49,480 But actually the most interesting part 1676 01:03:49,480 --> 01:03:50,644 we just don't know yet. 1677 01:03:50,644 --> 01:03:52,810 So we hope you will actually write that next chapter 1678 01:03:52,810 --> 01:03:53,800 of this story. 1679 01:03:53,800 --> 01:03:58,630 But here's the outlines of where we currently see things. 1680 01:03:58,630 --> 01:04:02,080 We think that we have a good target for what is really 1681 01:04:02,080 --> 01:04:04,751 the core of human intelligence, what makes us so smart in terms 1682 01:04:04,751 --> 01:04:06,250 of these ideas of both what we start 1683 01:04:06,250 --> 01:04:09,790 with, this common sense core physics and psychology, 1684 01:04:09,790 --> 01:04:12,590 and how those things grow. 1685 01:04:12,590 --> 01:04:16,390 What are the learning mechanisms that I've just justified. 1686 01:04:16,390 --> 01:04:20,260 Again, more next week on the sort of science-like mechanisms 1687 01:04:20,260 --> 01:04:23,740 of hypothesis formation, experiment testing, play, 1688 01:04:23,740 --> 01:04:27,280 exploration that you can use to build these intuitive theories, 1689 01:04:27,280 --> 01:04:30,182 much like scientists build their scientific theories. 1690 01:04:30,182 --> 01:04:32,140 And that we're starting on the engineering side 1691 01:04:32,140 --> 01:04:34,780 to have tools to capture this, both to capture the knowledge 1692 01:04:34,780 --> 01:04:36,390 and how it might grow through the use 1693 01:04:36,390 --> 01:04:38,800 of probabilistic programs and things 1694 01:04:38,800 --> 01:04:41,410 that sometimes go by the name of program induction or program 1695 01:04:41,410 --> 01:04:42,340 synthesis. 1696 01:04:42,340 --> 01:04:44,170 Or if you like hierarchical Bayes 1697 01:04:44,170 --> 01:04:46,720 on programs that generate other programs where 1698 01:04:46,720 --> 01:04:50,380 the search for a good program is like the inference of a program 1699 01:04:50,380 --> 01:04:53,170 that best explains the data as generated from a prior that's 1700 01:04:53,170 --> 01:04:54,820 a higher level program. 1701 01:04:54,820 --> 01:04:57,160 If you go to the tutorial from Tomer 1702 01:04:57,160 --> 01:04:58,684 you'll actually see building blocks. 1703 01:04:58,684 --> 01:05:00,100 You can write Church programs that 1704 01:05:00,100 --> 01:05:01,766 will do something like that, and we will 1705 01:05:01,766 --> 01:05:03,080 see more of that next time. 1706 01:05:03,080 --> 01:05:06,030 But the key is that we have a language now 1707 01:05:06,030 --> 01:05:08,740 which keeps building the different ingredients that we 1708 01:05:08,740 --> 01:05:09,340 think we need. 1709 01:05:09,340 --> 01:05:12,030 On the one hand, we've gone from thinking that we need something 1710 01:05:12,030 --> 01:05:14,530 like probabilistic generative models, which many people will 1711 01:05:14,530 --> 01:05:16,197 agree with, to recognizing that not only 1712 01:05:16,197 --> 01:05:17,863 do they have to be generative, they have 1713 01:05:17,863 --> 01:05:19,180 to be causal and compositional. 1714 01:05:19,180 --> 01:05:21,670 And they have to have this fine-grained compositional 1715 01:05:21,670 --> 01:05:24,010 structure needed to capture the real stuff of the world. 1716 01:05:24,010 --> 01:05:26,440 Not graphs, but something more like equations 1717 01:05:26,440 --> 01:05:30,100 that capture graphics or physics or planning. 1718 01:05:30,100 --> 01:05:31,360 Of course, that's not all. 1719 01:05:31,360 --> 01:05:33,220 I mean, as I tried to gesture at, 1720 01:05:33,220 --> 01:05:36,970 we need also ways to make these things work very, very quickly. 1721 01:05:36,970 --> 01:05:39,460 There might be a place in this picture for something 1722 01:05:39,460 --> 01:05:40,930 like neural networks or some kind 1723 01:05:40,930 --> 01:05:44,440 of alternative pro-and-con approach based 1724 01:05:44,440 --> 01:05:46,060 on pattern recognition. 1725 01:05:46,060 --> 01:05:47,710 But these are just a number of the ways 1726 01:05:47,710 --> 01:05:50,260 which I think we need to think about going forward. 1727 01:05:50,260 --> 01:05:53,710 We need to take the idea of both the brain as a pattern ignition 1728 01:05:53,710 --> 01:05:56,680 engine seriously and the idea of the brain as a modeling 1729 01:05:56,680 --> 01:05:58,425 or explanation engine seriously. 1730 01:05:58,425 --> 01:05:59,800 We're excited because we now have 1731 01:05:59,800 --> 01:06:01,852 tools to model modeling engines and maybe 1732 01:06:01,852 --> 01:06:04,060 to model how pattern recognition engines and modeling 1733 01:06:04,060 --> 01:06:05,410 engines might interact. 1734 01:06:05,410 --> 01:06:08,620 But really, again, the great challenges 1735 01:06:08,620 --> 01:06:10,960 here are really very much in our future. 1736 01:06:10,960 --> 01:06:13,240 Not the unforeseeable future, but the foreseeable one. 1737 01:06:13,240 --> 01:06:14,890 So help us work on it. 1738 01:06:14,890 --> 01:06:16,440 Thanks.