1 00:00:01,640 --> 00:00:04,040 The following content is provided under a Creative 2 00:00:04,040 --> 00:00:05,580 Commons license. 3 00:00:05,580 --> 00:00:07,880 Your support will help MIT OpenCourseWare 4 00:00:07,880 --> 00:00:12,270 continue to offer high quality educational resources for free. 5 00:00:12,270 --> 00:00:14,870 To make a donation or view additional materials 6 00:00:14,870 --> 00:00:18,830 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,830 --> 00:00:21,015 at osw.mit.edu. 8 00:00:21,015 --> 00:00:22,640 JOSH TENENBAUM: I'm going to be talking 9 00:00:22,640 --> 00:00:25,580 about computational cognitive science. 10 00:00:25,580 --> 00:00:28,590 In the brains, minds, and machines landscape, 11 00:00:28,590 --> 00:00:31,250 this is connecting the minds and the machines part. 12 00:00:31,250 --> 00:00:32,900 And I really want to try to emphasize 13 00:00:32,900 --> 00:00:35,870 both some conceptual themes and some technical themes that 14 00:00:35,870 --> 00:00:37,640 are complimentary to a lot of what 15 00:00:37,640 --> 00:00:40,490 you've seen for the first week or so of the class. 16 00:00:40,490 --> 00:00:45,357 That's going to include ideas of generative models and ideas 17 00:00:45,357 --> 00:00:46,940 of probabilistic programs, which we'll 18 00:00:46,940 --> 00:00:48,710 see a little bit here and a lot more 19 00:00:48,710 --> 00:00:50,570 in the tutorial in the afternoon. 20 00:00:50,570 --> 00:00:53,060 And on the cognitive side, maybe we 21 00:00:53,060 --> 00:00:56,360 could sum it up by calling it common sense. 22 00:00:56,360 --> 00:00:59,420 Since this is meant to be a broad introduction-- 23 00:00:59,420 --> 00:01:01,550 and I'm going to try to cover from some very basic 24 00:01:01,550 --> 00:01:04,849 fundamental things that people in this field were doing maybe 25 00:01:04,849 --> 00:01:10,220 10 or 15 years ago up until the state of the art current 26 00:01:10,220 --> 00:01:11,150 research-- 27 00:01:11,150 --> 00:01:13,070 I want to try to give that whole broad sweep. 28 00:01:13,070 --> 00:01:14,570 And I also want to try to give a bit 29 00:01:14,570 --> 00:01:17,210 of a sort of philosophical introduction at the beginning 30 00:01:17,210 --> 00:01:19,490 to set this in context with the other things you're 31 00:01:19,490 --> 00:01:21,050 seeing in the summer school. 32 00:01:21,050 --> 00:01:24,590 I think it's fair to say that there are two different notions 33 00:01:24,590 --> 00:01:27,957 of intelligence that are both important and are 34 00:01:27,957 --> 00:01:30,290 both interesting to members of this center in the summer 35 00:01:30,290 --> 00:01:31,250 school. 36 00:01:31,250 --> 00:01:34,640 The two different notions are what 37 00:01:34,640 --> 00:01:38,960 I think you could call classifying, recognizing 38 00:01:38,960 --> 00:01:44,630 patterns in data, and what you could call explaining, 39 00:01:44,630 --> 00:01:48,687 understanding, modeling the world. 40 00:01:48,687 --> 00:01:51,020 So, again, there's the notion of classification, pattern 41 00:01:51,020 --> 00:01:54,470 recognition, finding patterns in data 42 00:01:54,470 --> 00:01:57,150 and maybe patterns that connect data to some task you're 43 00:01:57,150 --> 00:01:58,060 trying to solve. 44 00:01:58,060 --> 00:01:59,810 And then there's this idea of intelligence 45 00:01:59,810 --> 00:02:02,900 as explaining, understanding, building a model of the world 46 00:02:02,900 --> 00:02:07,280 that you can use to play on and solve problems with. 47 00:02:07,280 --> 00:02:10,202 I'm going to emphasize here notions of explanation, 48 00:02:10,202 --> 00:02:11,660 because I think they are absolutely 49 00:02:11,660 --> 00:02:14,120 central to intelligence, certainly in any sense 50 00:02:14,120 --> 00:02:15,890 that we mean when we talk about humans. 51 00:02:15,890 --> 00:02:17,540 And because they get kind of underemphasized 52 00:02:17,540 --> 00:02:19,373 in a lot of recent work in machine learning, 53 00:02:19,373 --> 00:02:20,764 AI, neural networks, and so on. 54 00:02:20,764 --> 00:02:22,430 Like, most of the techniques that you've 55 00:02:22,430 --> 00:02:26,240 seen so far in other parts of the class 56 00:02:26,240 --> 00:02:30,120 and will continue to see, I think it's fair to say they 57 00:02:30,120 --> 00:02:31,640 sort of fall under the broad idea 58 00:02:31,640 --> 00:02:35,570 of trying to classify and recognize patterns in data. 59 00:02:35,570 --> 00:02:37,070 And there's good reasons why there's 60 00:02:37,070 --> 00:02:38,820 been a lot of attention on these recently, 61 00:02:38,820 --> 00:02:41,000 particularly coming from the more brain side. 62 00:02:41,000 --> 00:02:43,610 Because it's much easier when you go and look 63 00:02:43,610 --> 00:02:46,970 in the brain to understand how neural circuits do things 64 00:02:46,970 --> 00:02:49,260 like classifying recognized patterns. 65 00:02:49,260 --> 00:02:53,180 And it's also, I think with at least 66 00:02:53,180 --> 00:02:55,190 certain kinds of current technology, 67 00:02:55,190 --> 00:02:57,230 much easier to get machines to do this, right? 68 00:02:57,230 --> 00:02:59,560 All the excitement in deep neural networks 69 00:02:59,560 --> 00:03:01,070 is all about this, right? 70 00:03:01,070 --> 00:03:02,900 But what I want to try to convince you 71 00:03:02,900 --> 00:03:06,470 here and illustrate a lot of different kinds of examples 72 00:03:06,470 --> 00:03:08,780 is how both of these kinds of approaches 73 00:03:08,780 --> 00:03:11,330 are probably necessary, essential to understanding 74 00:03:11,330 --> 00:03:11,900 the mind. 75 00:03:11,900 --> 00:03:14,330 I won't really bother to try to convince you that the pattern 76 00:03:14,330 --> 00:03:16,163 recognition approach is essential, because I 77 00:03:16,163 --> 00:03:17,619 take that for granted. 78 00:03:17,619 --> 00:03:19,910 But both are essential and, also, that they essentially 79 00:03:19,910 --> 00:03:20,576 need each other. 80 00:03:20,576 --> 00:03:24,470 I'll try to illustrate a couple of ways in which they really 81 00:03:24,470 --> 00:03:26,630 each solve the problems that the other one needs-- 82 00:03:26,630 --> 00:03:29,510 so ways in which ideas like deep neural networks for doing 83 00:03:29,510 --> 00:03:31,850 really fast pattern recognition can 84 00:03:31,850 --> 00:03:35,060 help to make the sort of explaining understanding 85 00:03:35,060 --> 00:03:39,830 view of intelligence much quicker and maybe much lower 86 00:03:39,830 --> 00:03:44,324 energy, but also ways in which the sort of explaining, 87 00:03:44,324 --> 00:03:45,740 understanding view of intelligence 88 00:03:45,740 --> 00:03:48,950 can make the pattern recognition view much richer, much more 89 00:03:48,950 --> 00:03:50,691 flexible. 90 00:03:50,691 --> 00:03:51,690 What do you really mean? 91 00:03:51,690 --> 00:03:54,920 What's the difference between classification and explanation? 92 00:03:54,920 --> 00:03:56,600 Or what makes a good explanation? 93 00:03:56,600 --> 00:03:58,190 So we're talking about intelligence 94 00:03:58,190 --> 00:04:01,130 as trying to explain your experience in the world, 95 00:04:01,130 --> 00:04:04,160 basically, to build a model that is in some sense 96 00:04:04,160 --> 00:04:06,237 a kind of actionable causal model. 97 00:04:06,237 --> 00:04:08,570 And there's a bunch of virtues here, these bullet points 98 00:04:08,570 --> 00:04:09,420 under explanation. 99 00:04:09,420 --> 00:04:10,836 There's a bunch of things we could 100 00:04:10,836 --> 00:04:13,460 say about what makes a good explanation 101 00:04:13,460 --> 00:04:15,020 of the world or a good model. 102 00:04:15,020 --> 00:04:16,645 And I won't say too much abstractly. 103 00:04:16,645 --> 00:04:19,519 I'll mostly try to illustrate this over the morning. 104 00:04:19,519 --> 00:04:22,610 But like any kind of model, whether it's 105 00:04:22,610 --> 00:04:24,620 sort of more pattern recognition classification 106 00:04:24,620 --> 00:04:30,440 style or these more explanatory type models, 107 00:04:30,440 --> 00:04:33,935 ideas of compactness, unification, 108 00:04:33,935 --> 00:04:34,810 are important, right? 109 00:04:34,810 --> 00:04:37,630 You want to explain a lot with a little. 110 00:04:37,630 --> 00:04:38,540 OK? 111 00:04:38,540 --> 00:04:41,810 There's a term if anybody has read David Deutsch's book 112 00:04:41,810 --> 00:04:43,190 The Beginning Of Infinity. 113 00:04:43,190 --> 00:04:46,100 He talks about this view in a certain form 114 00:04:46,100 --> 00:04:49,550 of good explanations as being hard to vary, non-arbitrary. 115 00:04:49,550 --> 00:04:50,150 OK. 116 00:04:50,150 --> 00:04:52,910 That's sort of in common with any way of describing 117 00:04:52,910 --> 00:04:54,020 or explaining the world. 118 00:04:54,020 --> 00:04:55,520 But some key features of the models 119 00:04:55,520 --> 00:04:57,670 we're going to talk about-- one is that they're generative. 120 00:04:57,670 --> 00:04:59,420 So what we mean by generative is that they 121 00:04:59,420 --> 00:05:00,890 generate the world, right? 122 00:05:00,890 --> 00:05:03,490 In some sense, their output is the world, your experience. 123 00:05:03,490 --> 00:05:05,740 They're trying to explain the stuff you observe 124 00:05:05,740 --> 00:05:08,530 by positing some hidden, unobservable, but really 125 00:05:08,530 --> 00:05:12,700 important, causal actionable deep stuff. 126 00:05:12,700 --> 00:05:14,221 They don't model a task. 127 00:05:14,221 --> 00:05:15,220 That's really important. 128 00:05:15,220 --> 00:05:17,080 Because, like, if you're used to something like, 129 00:05:17,080 --> 00:05:19,180 you know, end to end training of a deep neural network 130 00:05:19,180 --> 00:05:21,429 for classification where there's an objective function 131 00:05:21,429 --> 00:05:22,967 and the task and the task is to map 132 00:05:22,967 --> 00:05:24,550 from things you experience and observe 133 00:05:24,550 --> 00:05:26,360 in the world to how you should behave, 134 00:05:26,360 --> 00:05:28,420 that's sort of the opposite view, right? 135 00:05:28,420 --> 00:05:31,270 These are things whose output is not behavior on a task, 136 00:05:31,270 --> 00:05:32,985 but whose output is the world you see. 137 00:05:32,985 --> 00:05:34,360 Because what they're trying to do 138 00:05:34,360 --> 00:05:37,240 is produce or generate explanations. 139 00:05:37,240 --> 00:05:39,320 And that means they have to come into contact. 140 00:05:39,320 --> 00:05:42,730 They have to basically explain stuff you see. 141 00:05:42,730 --> 00:05:43,330 OK. 142 00:05:43,330 --> 00:05:45,910 Now, these models are not just generative in this sense, 143 00:05:45,910 --> 00:05:47,170 but they're causal. 144 00:05:47,170 --> 00:05:49,310 And, again, I'm using these terms intuitively. 145 00:05:49,310 --> 00:05:51,260 I'll get more precise later on. 146 00:05:51,260 --> 00:05:55,240 But what I mean by that is the hidden or latent variables that 147 00:05:55,240 --> 00:05:58,840 generate the stuff you observe are, in some form, 148 00:05:58,840 --> 00:06:00,820 trying to get at the actual causal mechanisms 149 00:06:00,820 --> 00:06:01,404 in the world-- 150 00:06:01,404 --> 00:06:03,778 the things that, if you were then to go act on the world, 151 00:06:03,778 --> 00:06:06,446 you could intervene on and move around and succeed in changing 152 00:06:06,446 --> 00:06:07,570 the world the way you want. 153 00:06:07,570 --> 00:06:10,028 Because that's the point of having one of these rich models 154 00:06:10,028 --> 00:06:13,540 is so that you can use it to act intelligently, right? 155 00:06:13,540 --> 00:06:16,720 And, again, this is a contrast with a approach that's 156 00:06:16,720 --> 00:06:18,850 trying to find and classify patterns that 157 00:06:18,850 --> 00:06:21,377 are useful for performing some particular task to detect 158 00:06:21,377 --> 00:06:22,960 oh, when I see this, I should do this. 159 00:06:22,960 --> 00:06:24,779 When I see this, I should do that, right? 160 00:06:24,779 --> 00:06:25,820 That's good for one task. 161 00:06:25,820 --> 00:06:29,330 But these are meant to be good for an endless array of tasks. 162 00:06:29,330 --> 00:06:32,020 Not any task, but, in some important sense, 163 00:06:32,020 --> 00:06:34,207 a kind of unbounded set of tasks where 164 00:06:34,207 --> 00:06:36,790 given a goal which is different from your model of the world-- 165 00:06:36,790 --> 00:06:37,390 you have your goal. 166 00:06:37,390 --> 00:06:38,764 You have your model of the world. 167 00:06:38,764 --> 00:06:42,400 And then you use that model to plan some sequence of actions 168 00:06:42,400 --> 00:06:43,540 to achieve your goal. 169 00:06:43,540 --> 00:06:45,770 And you change the goal, you get a different plan. 170 00:06:45,770 --> 00:06:47,474 But the model is the invariant, right? 171 00:06:47,474 --> 00:06:49,390 And it's invariant, because it captures what's 172 00:06:49,390 --> 00:06:51,280 really going on causally. 173 00:06:51,280 --> 00:06:52,930 And then maybe the most important, 174 00:06:52,930 --> 00:06:55,420 but hardest to really get a handle on, theme-- although, 175 00:06:55,420 --> 00:06:57,510 again, we'll try to do this by the end of today-- 176 00:06:57,510 --> 00:06:59,750 is that they're compositional in some way. 177 00:06:59,750 --> 00:07:02,520 They consist of parts which have independent meaning 178 00:07:02,520 --> 00:07:04,660 or which have some notion of meaning, and then ways 179 00:07:04,660 --> 00:07:07,160 of hooking those together to form larger wholes. 180 00:07:07,160 --> 00:07:10,000 And that gives a kind of flexibility or extensibility 181 00:07:10,000 --> 00:07:13,990 that is fundamental, important to intelligence-- 182 00:07:13,990 --> 00:07:16,690 the ability to not just, say, learn from little data, 183 00:07:16,690 --> 00:07:19,450 but to be able to take what you've learned in some tasks 184 00:07:19,450 --> 00:07:21,940 and use it instantly, immediately, on tasks you've 185 00:07:21,940 --> 00:07:23,710 never had any training for. 186 00:07:23,710 --> 00:07:26,890 It's, I think, really only with this kind of model building 187 00:07:26,890 --> 00:07:29,470 view of intelligence that you can do that. 188 00:07:29,470 --> 00:07:31,930 I'll give one other motivating example-- 189 00:07:31,930 --> 00:07:34,000 just because it will appear in different forms 190 00:07:34,000 --> 00:07:35,480 throughout the talk-- 191 00:07:35,480 --> 00:07:39,010 just of the difference between classification and explanation 192 00:07:39,010 --> 00:07:43,330 as ways of thinking about the world with thinking about, 193 00:07:43,330 --> 00:07:46,240 in particular, planets and just the orbits of objects 194 00:07:46,240 --> 00:07:47,410 in the solar system. 195 00:07:47,410 --> 00:07:51,441 That could include objects, basically, on any one planet, 196 00:07:51,441 --> 00:07:51,940 like ours. 197 00:07:51,940 --> 00:07:54,520 But think about the problem of describing the motions 198 00:07:54,520 --> 00:07:56,740 of the planets around the sun. 199 00:07:56,740 --> 00:07:58,030 Well, there's some phenomena. 200 00:07:58,030 --> 00:07:59,560 You can make observations. 201 00:07:59,560 --> 00:08:01,990 You could observe them in various ways. 202 00:08:01,990 --> 00:08:04,450 Go back to the early stages of modern science 203 00:08:04,450 --> 00:08:07,850 when the data by which the phenomena were represented-- 204 00:08:07,850 --> 00:08:09,550 you know, things like just measurements 205 00:08:09,550 --> 00:08:15,490 of those light spots in the sky, over nights, over years. 206 00:08:15,490 --> 00:08:19,870 So here are two ways to capture the regularities in the data. 207 00:08:19,870 --> 00:08:23,030 You could think about Kepler's laws or Newton's laws. 208 00:08:23,030 --> 00:08:25,420 So just to remind you, these are Kepler's laws. 209 00:08:25,420 --> 00:08:26,930 And these are Newton's laws. 210 00:08:26,930 --> 00:08:28,513 I won't really go through the details. 211 00:08:28,513 --> 00:08:31,000 Probably, all of you know these or have some familiarity. 212 00:08:31,000 --> 00:08:32,500 The key thing is that Kepler's laws 213 00:08:32,500 --> 00:08:37,000 are laws about patterns of motion and space and time. 214 00:08:37,000 --> 00:08:40,030 They specify the shape of the orbits, the shape of the path 215 00:08:40,030 --> 00:08:44,860 that the planets trace out in the solar system. 216 00:08:44,860 --> 00:08:48,570 Not in the sky, but in the actual 3D world-- 217 00:08:48,570 --> 00:08:50,630 the idea that the orbits, the planets, 218 00:08:50,630 --> 00:08:52,750 are an ellipse with the sun at one focus. 219 00:08:52,750 --> 00:08:55,570 And then they give some other mathematical regularities 220 00:08:55,570 --> 00:08:57,220 that describe, in a sense, how fast 221 00:08:57,220 --> 00:08:59,050 the planets go around the sun as a function 222 00:08:59,050 --> 00:09:01,300 of the size of the orbit and the fact 223 00:09:01,300 --> 00:09:03,310 that they kind of go faster at some places 224 00:09:03,310 --> 00:09:05,690 and slower at other places in the orbit, right? 225 00:09:05,690 --> 00:09:06,970 OK. 226 00:09:06,970 --> 00:09:09,250 But in a very important sense, they 227 00:09:09,250 --> 00:09:11,860 don't explain why they do these things, right? 228 00:09:11,860 --> 00:09:14,500 These are patterns which, if I were to give you 229 00:09:14,500 --> 00:09:17,424 a set of data, a path, and I said, 230 00:09:17,424 --> 00:09:19,090 is this a possible planet or not-- maybe 231 00:09:19,090 --> 00:09:20,800 there's a undiscovered planet. 232 00:09:20,800 --> 00:09:22,510 And this is possibly that, or maybe 233 00:09:22,510 --> 00:09:24,109 this is some other thing like a comet. 234 00:09:24,109 --> 00:09:25,900 And you could use this to classify and say, 235 00:09:25,900 --> 00:09:29,710 yeah, that's a planet, not a comet, right? 236 00:09:29,710 --> 00:09:32,350 And, you know, you could use them to predict, right? 237 00:09:32,350 --> 00:09:35,230 If you've observed a planet over some periods of time 238 00:09:35,230 --> 00:09:37,064 in the sky, then you could use Kepler's laws 239 00:09:37,064 --> 00:09:39,063 to basically fit an ellipse and figure out where 240 00:09:39,063 --> 00:09:40,210 it's going to be later on. 241 00:09:40,210 --> 00:09:41,260 That's great. 242 00:09:41,260 --> 00:09:44,120 But they don't explain. 243 00:09:44,120 --> 00:09:46,330 In contrast, Newton's laws work like this, right? 244 00:09:46,330 --> 00:09:49,030 Again, there's several different kinds of laws. 245 00:09:49,030 --> 00:09:52,030 There's, classically, Newton's laws of motion. 246 00:09:52,030 --> 00:09:55,240 These ideas about inertia and F equals MA and every action 247 00:09:55,240 --> 00:09:57,910 produces an equal and opposite reaction, again, don't 248 00:09:57,910 --> 00:10:00,129 say anything about planets. 249 00:10:00,129 --> 00:10:01,920 But they really say everything about force. 250 00:10:01,920 --> 00:10:05,100 They talk about how forces work and how forces interact 251 00:10:05,100 --> 00:10:07,080 and combine and compose-- 252 00:10:07,080 --> 00:10:09,120 compositional-- to produce motion 253 00:10:09,120 --> 00:10:13,080 or, in particular, to produce the change of motion. 254 00:10:13,080 --> 00:10:17,460 That's acceleration or the second derivative of position. 255 00:10:17,460 --> 00:10:20,250 And then there's this other law, the law of gravitational force, 256 00:10:20,250 --> 00:10:22,350 so the universal gravitation, which 257 00:10:22,350 --> 00:10:25,260 specifies in particular how you get one particular force. 258 00:10:25,260 --> 00:10:27,450 That's the name of the force we call gravity 259 00:10:27,450 --> 00:10:29,970 as a function of the mass of the two bodies 260 00:10:29,970 --> 00:10:31,830 and the square distance between them 261 00:10:31,830 --> 00:10:35,295 and some unknown constant, right? 262 00:10:35,295 --> 00:10:37,170 And the idea is you put these things together 263 00:10:37,170 --> 00:10:38,790 and you get Kepler's law. 264 00:10:38,790 --> 00:10:41,250 You can derive the fact that the planets have 265 00:10:41,250 --> 00:10:43,140 to go that way from the combination 266 00:10:43,140 --> 00:10:46,410 of these laws of motion and the law of gravitational force. 267 00:10:46,410 --> 00:10:48,900 So there's a sense in which the explanation is deeper 268 00:10:48,900 --> 00:10:53,530 and that you can derive the patterns from the explanation. 269 00:10:53,530 --> 00:10:54,960 But it's a lot more than that. 270 00:10:54,960 --> 00:10:58,500 Because these laws don't just explain 271 00:10:58,500 --> 00:11:00,450 the motions of the planets around the sun, 272 00:11:00,450 --> 00:11:02,200 but a huge number of other things. 273 00:11:02,200 --> 00:11:03,690 Like, for example, they don't just 274 00:11:03,690 --> 00:11:04,890 explain the orbits of the planets, 275 00:11:04,890 --> 00:11:06,640 but also other things in the solar system. 276 00:11:06,640 --> 00:11:09,000 Like, you can use them to describe comets. 277 00:11:09,000 --> 00:11:12,180 You can use them to describe the moon going around the planets. 278 00:11:12,180 --> 00:11:13,980 And you can use them to explain why 279 00:11:13,980 --> 00:11:16,920 the moon goes around the Earth and not around the sun 280 00:11:16,920 --> 00:11:18,780 in that sense, right? 281 00:11:18,780 --> 00:11:21,830 You can use them to explain not just the motions of the really 282 00:11:21,830 --> 00:11:24,330 big things in the solar system, but the really little things 283 00:11:24,330 --> 00:11:28,230 like, you know, this, and to explain why when I drop this 284 00:11:28,230 --> 00:11:31,980 or when Newton famously did or didn't drop an apple 285 00:11:31,980 --> 00:11:33,800 or had an apple drop on its head, right? 286 00:11:33,800 --> 00:11:36,430 That, superficially, seems to be a very different pattern, 287 00:11:36,430 --> 00:11:36,930 right? 288 00:11:36,930 --> 00:11:39,600 It's something going down in your current frame 289 00:11:39,600 --> 00:11:40,380 of reference. 290 00:11:40,380 --> 00:11:43,500 But the very same laws describe exactly that 291 00:11:43,500 --> 00:11:46,230 and explain why the moon goes around the Earth, 292 00:11:46,230 --> 00:11:48,600 but the bottle or the apple goes down 293 00:11:48,600 --> 00:11:52,890 in my current experience of the world. 294 00:11:52,890 --> 00:11:55,920 In terms of things like causal and actionable ideas, 295 00:11:55,920 --> 00:11:59,612 they explain how you could get a man to the moon and back again 296 00:11:59,612 --> 00:12:01,320 or how you could build a rocket to escape 297 00:12:01,320 --> 00:12:06,730 the gravitational field to not only get off the ground the way 298 00:12:06,730 --> 00:12:09,720 we're all on the ground, but to get off or out of orbiting 299 00:12:09,720 --> 00:12:13,480 around and get to orbiting some other thing, right? 300 00:12:13,480 --> 00:12:15,810 And it's all about compositionality 301 00:12:15,810 --> 00:12:17,220 as well as causality. 302 00:12:17,220 --> 00:12:19,860 In order to escape the Earth's gravitational field 303 00:12:19,860 --> 00:12:21,277 or get to the moon and back again, 304 00:12:21,277 --> 00:12:22,901 there's a lot of things you have to do. 305 00:12:22,901 --> 00:12:24,630 But one of the key things you have to do 306 00:12:24,630 --> 00:12:27,240 is generate some significant force 307 00:12:27,240 --> 00:12:29,617 to oppose, be stronger, than gravity. 308 00:12:29,617 --> 00:12:31,950 And, you know, Newton really didn't know how to do that. 309 00:12:31,950 --> 00:12:33,932 But some years later, people figured out, 310 00:12:33,932 --> 00:12:35,640 you know, by chemistry and other things-- 311 00:12:35,640 --> 00:12:39,280 explosions, rockets-- how to do some other kind of physics 312 00:12:39,280 --> 00:12:43,140 which could generate a force that was powerful enough 313 00:12:43,140 --> 00:12:48,780 for an object the size of rocket to go against gravity 314 00:12:48,780 --> 00:12:52,410 and to get to where you need to be and then to get back. 315 00:12:52,410 --> 00:12:55,860 So the idea of a causal model, which in this case 316 00:12:55,860 --> 00:12:59,340 is the one based on forces, and compositionality-- the ability 317 00:12:59,340 --> 00:13:01,832 to take the general laws of forces, laws about one 318 00:13:01,832 --> 00:13:03,540 particular kind of force that's generated 319 00:13:03,540 --> 00:13:05,730 by this mysterious thing called mass, 320 00:13:05,730 --> 00:13:07,320 some other kinds of forces generated 321 00:13:07,320 --> 00:13:10,410 by exploding chemicals-- put those all together 322 00:13:10,410 --> 00:13:11,970 is hugely powerful. 323 00:13:11,970 --> 00:13:15,150 And, of course, this as an expression 324 00:13:15,150 --> 00:13:17,040 of human intelligence-- you know, 325 00:13:17,040 --> 00:13:19,110 the moon shot is a classic metaphor. 326 00:13:19,110 --> 00:13:20,850 Demis used it in his talk. 327 00:13:20,850 --> 00:13:23,730 And I think if we really want to understand 328 00:13:23,730 --> 00:13:26,640 the way intelligence works in the human mind and brain that 329 00:13:26,640 --> 00:13:28,320 could lead to this, you have to go back 330 00:13:28,320 --> 00:13:29,697 to the roots of intelligence. 331 00:13:29,697 --> 00:13:31,030 You've heard me say this before. 332 00:13:31,030 --> 00:13:32,870 And I'm going to do this more by today. 333 00:13:32,870 --> 00:13:35,370 We want to go back to the roots of intelligence in even very 334 00:13:35,370 --> 00:13:38,610 young children where you already see all of this happening, 335 00:13:38,610 --> 00:13:39,990 right? 336 00:13:39,990 --> 00:13:40,560 OK. 337 00:13:40,560 --> 00:13:41,643 So that's the big picture. 338 00:13:44,370 --> 00:13:45,240 I'll just point you. 339 00:13:45,240 --> 00:13:47,615 If you want to learn more about the history of this idea, 340 00:13:47,615 --> 00:13:52,290 a really nice thing to read is this book by Kenneth Craik. 341 00:13:52,290 --> 00:13:56,370 He was an English scientist sort of a contemporary of Turing, 342 00:13:56,370 --> 00:13:58,770 also died tragically early, although 343 00:13:58,770 --> 00:14:02,132 from different tragic causes. 344 00:14:02,132 --> 00:14:03,840 He was, you know, one of the first people 345 00:14:03,840 --> 00:14:06,480 to start thinking about this topic of brains, minds, 346 00:14:06,480 --> 00:14:09,930 and machines, cybernetics type ideas, 347 00:14:09,930 --> 00:14:12,270 using math to describe how the brain works, how 348 00:14:12,270 --> 00:14:15,091 the mind might work in a brain. 349 00:14:15,091 --> 00:14:16,590 As you see when you read this quote, 350 00:14:16,590 --> 00:14:18,548 he didn't even really know what a computer was. 351 00:14:18,548 --> 00:14:20,850 Because it was pre-Turing, right? 352 00:14:20,850 --> 00:14:22,990 But he wrote this wonderful book, very short book. 353 00:14:22,990 --> 00:14:27,334 And I'll just quote here from one of the chapters. 354 00:14:27,334 --> 00:14:29,250 The book was called The Nature Of Explanation. 355 00:14:29,250 --> 00:14:31,789 And it was sort of both a philosophical study of that 356 00:14:31,789 --> 00:14:33,330 and how explanation works in science, 357 00:14:33,330 --> 00:14:35,280 like some of the ideas I was just going through, but also 358 00:14:35,280 --> 00:14:37,920 really arguing in very common sense and compelling ways 359 00:14:37,920 --> 00:14:41,572 why this is a key idea for understanding how 360 00:14:41,572 --> 00:14:42,780 the mind and the brain works. 361 00:14:42,780 --> 00:14:46,620 And he wasn't just talking about humans. 362 00:14:46,620 --> 00:14:49,726 Well, you know, these ideas have their greatest expression 363 00:14:49,726 --> 00:14:51,600 in some form, their most powerful expression, 364 00:14:51,600 --> 00:14:52,670 in the human mind. 365 00:14:52,670 --> 00:14:56,610 They're also important ones for understanding 366 00:14:56,610 --> 00:14:58,890 other intelligent brains. 367 00:14:58,890 --> 00:15:01,420 So he says here, "one of the most fundamental properties 368 00:15:01,420 --> 00:15:04,720 of thought is its power of predicting events. 369 00:15:04,720 --> 00:15:06,640 It enables us, for instance, to design bridges 370 00:15:06,640 --> 00:15:08,194 with a sufficient factor of safety 371 00:15:08,194 --> 00:15:09,610 instead of building them haphazard 372 00:15:09,610 --> 00:15:11,680 and waiting to see whether they collapse. 373 00:15:11,680 --> 00:15:15,400 If the organism carries a small scale model of external reality 374 00:15:15,400 --> 00:15:17,990 into its own possible actions within its head, 375 00:15:17,990 --> 00:15:20,200 it is able to try out various alternatives, 376 00:15:20,200 --> 00:15:21,880 conclude which is the best of them, 377 00:15:21,880 --> 00:15:24,250 react to future situations before they arise, 378 00:15:24,250 --> 00:15:25,960 utilize the knowledge of past events 379 00:15:25,960 --> 00:15:27,750 in dealing with the present and future 380 00:15:27,750 --> 00:15:30,790 and in every way to react in a much fuller, safer, 381 00:15:30,790 --> 00:15:34,150 and more competent manner to the emergencies which face it." 382 00:15:34,150 --> 00:15:35,920 So he's just really summing up this 383 00:15:35,920 --> 00:15:37,270 is what intelligence is about-- 384 00:15:37,270 --> 00:15:39,160 building a model of the world that you 385 00:15:39,160 --> 00:15:42,010 can manipulate and plan on and improve, think about, 386 00:15:42,010 --> 00:15:43,970 reason about, all that. 387 00:15:43,970 --> 00:15:46,280 And then he makes this very nice analogy, 388 00:15:46,280 --> 00:15:49,910 a kind of cognitive technology analogy. 389 00:15:49,910 --> 00:15:52,091 "Most of the greatest advances of modern technology 390 00:15:52,091 --> 00:15:53,590 have been instruments which extended 391 00:15:53,590 --> 00:15:56,890 the scope of our sense organs, our brains, or our limbs-- 392 00:15:56,890 --> 00:15:58,750 such, our telescopes and microscopes, 393 00:15:58,750 --> 00:16:02,740 wireless calculating machines, typewriters, motor cars, ships, 394 00:16:02,740 --> 00:16:03,620 and airplanes." 395 00:16:03,620 --> 00:16:04,120 Right? 396 00:16:04,120 --> 00:16:06,790 He's writing in 1943, or that's when the book was published, 397 00:16:06,790 --> 00:16:08,170 writing a little before that. 398 00:16:08,170 --> 00:16:08,440 Right? 399 00:16:08,440 --> 00:16:09,565 He didn't even have the word computer. 400 00:16:09,565 --> 00:16:11,940 Or back then, computer meant something different-- people 401 00:16:11,940 --> 00:16:15,280 who did calculations, basically. 402 00:16:15,280 --> 00:16:17,374 But same idea-- that's what he's talking about. 403 00:16:17,374 --> 00:16:19,540 He's talking about a computer, though he doesn't yet 404 00:16:19,540 --> 00:16:21,470 have the language quite to describe it. 405 00:16:21,470 --> 00:16:23,950 "Is it not possible, therefore, that our brains themselves 406 00:16:23,950 --> 00:16:26,860 utilize comparable mechanisms to achieve the same ends 407 00:16:26,860 --> 00:16:28,900 and that these mechanisms can parallel phenomena 408 00:16:28,900 --> 00:16:31,340 in the external world as a calculating machine 409 00:16:31,340 --> 00:16:33,917 can parallel with development of strains in a bridge?" 410 00:16:33,917 --> 00:16:35,500 And what he's saying is that the brain 411 00:16:35,500 --> 00:16:37,870 is this amazing kind of calculating machine that, 412 00:16:37,870 --> 00:16:42,945 in some form, can parallel the development of forces 413 00:16:42,945 --> 00:16:44,320 in all sorts of different systems 414 00:16:44,320 --> 00:16:47,790 in the world and not only forces. 415 00:16:47,790 --> 00:16:49,990 And, again, he doesn't have the vocabulary 416 00:16:49,990 --> 00:16:52,900 in English or the math really to describe it formally. 417 00:16:52,900 --> 00:16:55,210 That's, you know, why this is such an exciting time 418 00:16:55,210 --> 00:16:56,080 to be doing all these things we're 419 00:16:56,080 --> 00:16:57,829 doing is because now we're really starting 420 00:16:57,829 --> 00:16:59,830 to have the vocabulary and the technology 421 00:16:59,830 --> 00:17:01,500 to make good on this idea. 422 00:17:01,500 --> 00:17:02,000 OK. 423 00:17:02,000 --> 00:17:06,790 So that's it for the big picture philosophical introduction. 424 00:17:06,790 --> 00:17:08,770 Now, I'll try to get more concrete 425 00:17:08,770 --> 00:17:12,369 with the questions that have motivated not only me, 426 00:17:12,369 --> 00:17:13,630 but many cognitive scientists. 427 00:17:13,630 --> 00:17:17,734 Like, why are we thinking about these issues of explanation? 428 00:17:17,734 --> 00:17:19,150 And what are our concrete handles? 429 00:17:19,150 --> 00:17:21,040 Like, let's give a couple of examples 430 00:17:21,040 --> 00:17:24,550 of ways we can study intelligence in this form. 431 00:17:24,550 --> 00:17:28,040 And I like to say that the big question of our field-- 432 00:17:28,040 --> 00:17:29,770 it's big enough that it can fold in most, 433 00:17:29,770 --> 00:17:32,630 if not all, of our big questions-- is this one. 434 00:17:32,630 --> 00:17:36,160 How does the mind get so much out of so little, right? 435 00:17:36,160 --> 00:17:38,380 So across cognition wherever you look, 436 00:17:38,380 --> 00:17:40,720 our minds are building these rich models 437 00:17:40,720 --> 00:17:43,540 of the world that go way beyond the data of our senses. 438 00:17:43,540 --> 00:17:45,280 That's this extension of our sense organs 439 00:17:45,280 --> 00:17:47,590 that Craik was talking about there, right? 440 00:17:47,590 --> 00:17:50,030 From data that is altogether way too sparse, 441 00:17:50,030 --> 00:17:51,670 noisy, ambiguous in all sorts of ways, 442 00:17:51,670 --> 00:17:55,600 we build models that allow us to go beyond our experience, 443 00:17:55,600 --> 00:17:57,730 to plan effectively. 444 00:17:57,730 --> 00:17:58,600 How do we do it? 445 00:17:58,600 --> 00:18:02,050 And you could add-- and I do want to go in this direction. 446 00:18:02,050 --> 00:18:04,690 Because it is part of how when we relate 447 00:18:04,690 --> 00:18:07,000 the mind to the brain or these more explanatory models 448 00:18:07,000 --> 00:18:09,590 to more sort of pattern classification models, 449 00:18:09,590 --> 00:18:11,590 we also have to ask not only how do 450 00:18:11,590 --> 00:18:14,320 you get such a rich model of the world from so little data, 451 00:18:14,320 --> 00:18:16,220 but how do you do it so quickly? 452 00:18:16,220 --> 00:18:18,110 How do you do it so flexibly? 453 00:18:18,110 --> 00:18:20,740 How do you do it with such little energy, right? 454 00:18:20,740 --> 00:18:24,640 Metabolic energy is an incredible constraint 455 00:18:24,640 --> 00:18:28,120 on computation in the mind and brain. 456 00:18:28,120 --> 00:18:31,000 So just to give some examples-- 457 00:18:31,000 --> 00:18:33,370 again, these are ones that will keep coming up here. 458 00:18:33,370 --> 00:18:35,470 They've come up in our work. 459 00:18:35,470 --> 00:18:36,980 But they're key ones that allow us 460 00:18:36,980 --> 00:18:38,980 to take the perspective that you're seeing today 461 00:18:38,980 --> 00:18:40,750 and bring it into contact with the other perspectives you're 462 00:18:40,750 --> 00:18:41,916 seeing in the summer school. 463 00:18:41,916 --> 00:18:43,930 So let's look at visual scene perception. 464 00:18:43,930 --> 00:18:45,850 This is just a snapshot of images 465 00:18:45,850 --> 00:18:48,841 I got searching on Google Images for, I think, object detection, 466 00:18:48,841 --> 00:18:49,340 right? 467 00:18:49,340 --> 00:18:51,880 And we've seen a lot of examples of these kinds of things. 468 00:18:51,880 --> 00:18:55,570 You can go to the iCub and see its trainable object detectors. 469 00:18:55,570 --> 00:18:59,625 We'll see more of this when Amnon, the Mobileye guy, 470 00:18:59,625 --> 00:19:01,750 comes and tells us about really cool things they've 471 00:19:01,750 --> 00:19:04,830 done to do object detection for self-driving cars. 472 00:19:04,830 --> 00:19:07,221 You saw a lot of this kind of thing in robotics before. 473 00:19:07,221 --> 00:19:07,720 OK. 474 00:19:07,720 --> 00:19:10,720 So what's the basic sort of idea, the state of the art, 475 00:19:10,720 --> 00:19:12,940 in a lot of higher level computer vision? 476 00:19:12,940 --> 00:19:15,520 It's getting a system that learns 477 00:19:15,520 --> 00:19:18,065 to put boxes around regions of an image that contains 478 00:19:18,065 --> 00:19:20,440 some object of interest that you can label with the word, 479 00:19:20,440 --> 00:19:23,320 like person or pedestrian or car or horse, 480 00:19:23,320 --> 00:19:24,610 or various parts of things. 481 00:19:24,610 --> 00:19:26,860 Like, you might not just put a box around the bicycle, 482 00:19:26,860 --> 00:19:29,370 but you might put a box around the wheel, handlebar, seat, 483 00:19:29,370 --> 00:19:29,870 and so on. 484 00:19:29,870 --> 00:19:31,006 OK. 485 00:19:31,006 --> 00:19:32,380 And in some sense, you know, this 486 00:19:32,380 --> 00:19:36,340 is starting to get at some aspect of computer vision, 487 00:19:36,340 --> 00:19:37,360 right? 488 00:19:37,360 --> 00:19:39,760 Several people have quoted from David Marr who said, 489 00:19:39,760 --> 00:19:43,100 you know, vision is figuring out what is where 490 00:19:43,100 --> 00:19:45,630 from images, right? 491 00:19:45,630 --> 00:19:49,860 But Marr meant something that goes way beyond this, 492 00:19:49,860 --> 00:19:53,255 way beyond putting boxes in images with single-word labels. 493 00:19:53,255 --> 00:19:54,880 And I think you just have to, you know, 494 00:19:54,880 --> 00:19:59,497 look around you to see that your brain's ability to reconstruct 495 00:19:59,497 --> 00:20:01,330 the world, the whole three-dimensional world 496 00:20:01,330 --> 00:20:03,280 with all the objects and surfaces in it, 497 00:20:03,280 --> 00:20:07,060 goes so far beyond putting a few boxes around some parts 498 00:20:07,060 --> 00:20:08,110 of the image, right? 499 00:20:08,110 --> 00:20:10,120 Even put aside the fact that when you actually 500 00:20:10,120 --> 00:20:14,800 do this in real time on a real system, you know, 501 00:20:14,800 --> 00:20:18,112 the mistakes and the gaps are just glaring, right? 502 00:20:18,112 --> 00:20:19,570 But even if you could do this, even 503 00:20:19,570 --> 00:20:21,403 if you could put a box around all the things 504 00:20:21,403 --> 00:20:24,050 that we could easily label, you look around the world. 505 00:20:24,050 --> 00:20:25,660 You see so many objects and surfaces 506 00:20:25,660 --> 00:20:27,340 out there, all actionable. 507 00:20:27,340 --> 00:20:30,580 This is what this is when I talk about causality, right? 508 00:20:30,580 --> 00:20:33,040 Think about, you know, if somebody told me 509 00:20:33,040 --> 00:20:36,460 that there was some treasure hidden behind the chair that 510 00:20:36,460 --> 00:20:38,552 has Timothy Goldsmith's name on it, 511 00:20:38,552 --> 00:20:40,510 I know I could go around looking for the chair. 512 00:20:40,510 --> 00:20:42,455 I think I saw it over there, right? 513 00:20:42,455 --> 00:20:44,080 And I know exactly what I'd have to do. 514 00:20:44,080 --> 00:20:46,140 I'd have to go there, lift up the thing, right? 515 00:20:46,140 --> 00:20:49,630 That's just one of the many plans I could make 516 00:20:49,630 --> 00:20:51,850 given what I see in this world. 517 00:20:51,850 --> 00:20:55,396 If I didn't know that that was Timothy Goldsmith's chair, 518 00:20:55,396 --> 00:20:57,520 somewhere over there there's the Lily chair, right? 519 00:20:57,520 --> 00:20:58,020 OK. 520 00:20:58,020 --> 00:21:00,340 So I know that there's chairs here. 521 00:21:00,340 --> 00:21:02,800 There's little name tags on them. 522 00:21:02,800 --> 00:21:05,680 I could go around, make my way through looking at that tags, 523 00:21:05,680 --> 00:21:08,045 and find the one that says Lily and then, again, 524 00:21:08,045 --> 00:21:10,420 know what I have to do to go look for the treasure buried 525 00:21:10,420 --> 00:21:12,190 under it, right? 526 00:21:12,190 --> 00:21:14,466 That's just one of, really, this endless number 527 00:21:14,466 --> 00:21:16,090 of tasks that you can do with the model 528 00:21:16,090 --> 00:21:17,590 of the world around you that you've 529 00:21:17,590 --> 00:21:22,300 built from visual perception. 530 00:21:22,300 --> 00:21:26,090 And we don't need to get into a debate of, you know-- 531 00:21:26,090 --> 00:21:28,990 here, we can do this in a few minutes if you want-- 532 00:21:28,990 --> 00:21:31,150 about the difference between, like say 533 00:21:31,150 --> 00:21:33,760 for example, what Jim DiCarlo might call core object 534 00:21:33,760 --> 00:21:36,700 recognition or the kind of stuff that Winrich is studying where, 535 00:21:36,700 --> 00:21:39,490 you know, you show a monkey just a single object 536 00:21:39,490 --> 00:21:41,980 against maybe a cluttered background or a single face 537 00:21:41,980 --> 00:21:44,410 for 100 or 200 milliseconds, and you 538 00:21:44,410 --> 00:21:46,040 ask a very important question. 539 00:21:46,040 --> 00:21:48,310 What can you get in 100 milliseconds 540 00:21:48,310 --> 00:21:49,630 in that kind of limited scene? 541 00:21:49,630 --> 00:21:51,610 That's a very important question. 542 00:21:51,610 --> 00:21:54,652 But the convergence of visual neuroscience on that problem 543 00:21:54,652 --> 00:21:56,110 has enabled us to really understand 544 00:21:56,110 --> 00:21:58,390 a lot about the circuits that drive 545 00:21:58,390 --> 00:22:02,860 the first initial paths of some aspects of high-level vision, 546 00:22:02,860 --> 00:22:03,520 right? 547 00:22:03,520 --> 00:22:05,860 But that is really only getting out the classification 548 00:22:05,860 --> 00:22:07,840 or pattern detection part of the problem. 549 00:22:07,840 --> 00:22:10,720 And the other part of the problem, 550 00:22:10,720 --> 00:22:13,444 figuring out the stuff in the world that causes what you see, 551 00:22:13,444 --> 00:22:14,860 that is really the actionable part 552 00:22:14,860 --> 00:22:16,792 of things to guide your actions the world. 553 00:22:16,792 --> 00:22:19,000 We really are still quite far from understanding that 554 00:22:19,000 --> 00:22:22,400 at least with those kinds of methods. 555 00:22:22,400 --> 00:22:23,920 Just to give a few examples-- 556 00:22:23,920 --> 00:22:27,280 some of my favorite kind of hard object detection examples, 557 00:22:27,280 --> 00:22:30,477 but ones that show that your brain is really 558 00:22:30,477 --> 00:22:32,560 doing this kind of thing even from a single image. 559 00:22:32,560 --> 00:22:34,060 You know, it doesn't just require 560 00:22:34,060 --> 00:22:35,960 a lot of extensive exploration. 561 00:22:35,960 --> 00:22:39,370 So let's do some person detecting problems here. 562 00:22:39,370 --> 00:22:41,630 So here's a few images. 563 00:22:41,630 --> 00:22:44,020 And let's just start with the one in the upper left. 564 00:22:44,020 --> 00:22:46,531 You tell me. 565 00:22:46,531 --> 00:22:49,030 Here, I'll point with this, so you can see it on the screen. 566 00:22:49,030 --> 00:22:51,770 How many people are in this upper left image? 567 00:22:51,770 --> 00:22:52,420 Just tell me. 568 00:22:52,420 --> 00:22:53,800 AUDIENCE: Three. 569 00:22:53,800 --> 00:22:55,180 AUDIENCE: About 18. 570 00:22:55,180 --> 00:22:56,470 JOSH TENENBAUM: About 18? 571 00:22:56,470 --> 00:22:56,970 OK. 572 00:22:59,390 --> 00:22:59,990 OK. 573 00:22:59,990 --> 00:23:01,235 Yeah. 574 00:23:01,235 --> 00:23:02,360 That's a good answer, yeah. 575 00:23:02,360 --> 00:23:04,670 There are somewhere between 20 or 30 or something. 576 00:23:04,670 --> 00:23:05,170 Yeah. 577 00:23:05,170 --> 00:23:08,000 That was even more precise than I was expecting. 578 00:23:08,000 --> 00:23:08,500 OK. 579 00:23:08,500 --> 00:23:09,340 Now, I don't know. 580 00:23:09,340 --> 00:23:11,381 This would be a good project if somebody is still 581 00:23:11,381 --> 00:23:13,210 looking for a project. 582 00:23:13,210 --> 00:23:15,700 If you take the best person detector that you can find out 583 00:23:15,700 --> 00:23:18,324 there or that you can build from however much training data you 584 00:23:18,324 --> 00:23:20,890 find labeled on the web, how many of those 585 00:23:20,890 --> 00:23:22,151 people is it going to detect? 586 00:23:22,151 --> 00:23:23,650 You know, my guess is, at best, it's 587 00:23:23,650 --> 00:23:26,050 going to detect just five or six-- just the bicyclists 588 00:23:26,050 --> 00:23:26,786 in the front row. 589 00:23:26,786 --> 00:23:27,910 Does that seem fair to say? 590 00:23:27,910 --> 00:23:30,160 Even that will be a challenge, right? 591 00:23:30,160 --> 00:23:32,650 Whereas, not only do you have no trouble detecting 592 00:23:32,650 --> 00:23:34,630 the bicyclists in the front row, but all the other ones back 593 00:23:34,630 --> 00:23:36,790 there, too, even though for many of them all you can see 594 00:23:36,790 --> 00:23:38,620 is like a little bit of their face or neck 595 00:23:38,620 --> 00:23:41,260 or sometimes even just that funny helmet 596 00:23:41,260 --> 00:23:42,910 that bicyclists wear. 597 00:23:42,910 --> 00:23:44,920 But your ability to make sense of that 598 00:23:44,920 --> 00:23:48,806 depends on understanding a lot of causal stuff in the world-- 599 00:23:48,806 --> 00:23:50,680 the three-dimensional structure of the world, 600 00:23:50,680 --> 00:23:53,507 the three-dimensional structure of bodies in the world, 601 00:23:53,507 --> 00:23:55,840 some of the behaviors that bicyclists tend to engage in, 602 00:23:55,840 --> 00:23:57,160 and so on. 603 00:23:57,160 --> 00:23:59,350 Or take the scene in the upper right there, 604 00:23:59,350 --> 00:24:01,380 how many people are in that scene? 605 00:24:01,380 --> 00:24:02,170 AUDIENCE: 350. 606 00:24:02,170 --> 00:24:04,096 JOSH TENENBAUM: 350. 607 00:24:04,096 --> 00:24:05,720 Maybe a couple of hundred or something. 608 00:24:05,720 --> 00:24:06,627 Yeah, I guess. 609 00:24:06,627 --> 00:24:07,960 Were you counting all this time? 610 00:24:07,960 --> 00:24:08,360 No. 611 00:24:08,360 --> 00:24:08,680 That was a good estimate 612 00:24:08,680 --> 00:24:08,760 AUDIENCE: No. 613 00:24:08,760 --> 00:24:10,020 JOSH TENENBAUM: Yeah, OK. 614 00:24:10,020 --> 00:24:13,417 The scene in the lower left, how many people are there? 615 00:24:13,417 --> 00:24:14,000 AUDIENCE: 100? 616 00:24:14,000 --> 00:24:15,500 JOSH TENENBAUM: 100-something, yeah. 617 00:24:15,500 --> 00:24:17,215 The scene in the lower right? 618 00:24:17,215 --> 00:24:17,840 AUDIENCE: Zero. 619 00:24:17,840 --> 00:24:18,714 JOSH TENENBAUM: Zero. 620 00:24:18,714 --> 00:24:20,790 Was anybody tempted to say two? 621 00:24:20,790 --> 00:24:23,740 Were you tempted to say two as a joke or seriously? 622 00:24:23,740 --> 00:24:24,895 Both are valid responses. 623 00:24:24,895 --> 00:24:25,570 AUDIENCE: [INAUDIBLE] 624 00:24:25,570 --> 00:24:26,445 JOSH TENENBAUM: Yeah. 625 00:24:26,445 --> 00:24:27,197 OK. 626 00:24:27,197 --> 00:24:29,530 So, again, how do we solve all those problems, including 627 00:24:29,530 --> 00:24:32,140 knowing that one in the bottom-- 628 00:24:32,140 --> 00:24:36,000 maybe it takes a second or so-- but knowing that, you know, 629 00:24:36,000 --> 00:24:37,570 there's actually zero there. 630 00:24:37,570 --> 00:24:39,470 You know, it's the hats, the graduation hats, 631 00:24:39,470 --> 00:24:41,470 that are the cues to people in the other scenes. 632 00:24:41,470 --> 00:24:43,270 But here, again, because we know something 633 00:24:43,270 --> 00:24:47,010 about physics and the fact that people need to breathe-- 634 00:24:47,010 --> 00:24:48,640 or just tend to not bury themselves 635 00:24:48,640 --> 00:24:50,890 all the way up to the tippy top of their head, 636 00:24:50,890 --> 00:24:53,530 unless it's like some kind of Samuel Beckett play 637 00:24:53,530 --> 00:24:56,620 or something, Graduation Endgame-- 638 00:24:56,620 --> 00:25:01,570 then, you know, there's almost certainly nobody in that scene. 639 00:25:01,570 --> 00:25:02,584 OK. 640 00:25:02,584 --> 00:25:04,000 Now, all of those problems, again, 641 00:25:04,000 --> 00:25:06,430 are really way beyond what current computer vision 642 00:25:06,430 --> 00:25:08,370 can do and really wants to do. 643 00:25:08,370 --> 00:25:11,224 But I mean, I think, you know, the aspect 644 00:25:11,224 --> 00:25:12,640 of scene understanding that really 645 00:25:12,640 --> 00:25:14,989 taps into this notion of intelligence, of explaining 646 00:25:14,989 --> 00:25:16,780 modeling the causal structure of the world, 647 00:25:16,780 --> 00:25:18,030 should be able to do all that. 648 00:25:18,030 --> 00:25:19,030 Because we can, right? 649 00:25:19,030 --> 00:25:21,720 But here's a problem which is one that motivates us 650 00:25:21,720 --> 00:25:23,590 on the vision side that's somewhere 651 00:25:23,590 --> 00:25:26,680 in between these sort of ridiculously 652 00:25:26,680 --> 00:25:28,930 hard by current standards problems and one 653 00:25:28,930 --> 00:25:31,202 that, you know, people can do now. 654 00:25:31,202 --> 00:25:32,660 This is a kind of problem that I've 655 00:25:32,660 --> 00:25:36,262 been trying to put out there for computer vision community 656 00:25:36,262 --> 00:25:37,720 to think about it in a serious way. 657 00:25:37,720 --> 00:25:40,730 Because it's a big challenge, but it's not ridiculously hard. 658 00:25:40,730 --> 00:25:41,230 OK. 659 00:25:41,230 --> 00:25:44,500 So here, this is a scene of airplane full of computer 660 00:25:44,500 --> 00:25:48,910 vision researchers, in fact, going 661 00:25:48,910 --> 00:25:52,560 to last year's CVPR conference. 662 00:25:52,560 --> 00:25:56,779 And, again, how many people are in the scene? 663 00:25:56,779 --> 00:25:57,320 AUDIENCE: 20? 664 00:25:57,320 --> 00:25:58,530 JOSH TENENBAUM: 20,50? 665 00:25:58,530 --> 00:25:59,850 Yeah, something like that. 666 00:25:59,850 --> 00:26:05,430 Again, you know, more than 10, less than 500, right? 667 00:26:05,430 --> 00:26:06,260 You could count. 668 00:26:06,260 --> 00:26:07,530 Well, you can count, actually. 669 00:26:07,530 --> 00:26:08,850 Let's try that. 670 00:26:08,850 --> 00:26:14,220 So, you know, just do this mentally along with me. 671 00:26:14,220 --> 00:26:16,800 Just touch, in your mind, all the people. 672 00:26:16,800 --> 00:26:19,100 You know, 1, 2, 3, 4-- well, it's too hard 673 00:26:19,100 --> 00:26:20,100 to do it with the mouse. 674 00:26:20,100 --> 00:26:22,070 Da, da, da, da, da-- you know, at some point 675 00:26:22,070 --> 00:26:24,570 it gets a little bit hard to see exactly how many people are 676 00:26:24,570 --> 00:26:26,291 standing in the back by the restroom. 677 00:26:26,291 --> 00:26:26,790 OK. 678 00:26:26,790 --> 00:26:31,587 But it's amazing how much you can, with just the slightest 679 00:26:31,587 --> 00:26:34,170 little bit of effort, pick out all the people even though most 680 00:26:34,170 --> 00:26:35,432 of them are barely visible. 681 00:26:35,432 --> 00:26:36,390 And it's not only that. 682 00:26:36,390 --> 00:26:39,120 It's not just that you can pick them out. 683 00:26:39,120 --> 00:26:41,580 While you only see a very small part of their bodies, 684 00:26:41,580 --> 00:26:43,288 you know where all the rest of their body 685 00:26:43,288 --> 00:26:45,150 is to some degree of being able to predict 686 00:26:45,150 --> 00:26:48,580 an act if you needed to, right? 687 00:26:48,580 --> 00:26:51,660 So to sort of probe this, here's a kind of little experiment 688 00:26:51,660 --> 00:26:53,680 we can do. 689 00:26:53,680 --> 00:26:55,390 So let's take this guy here. 690 00:26:55,390 --> 00:26:57,210 See, you've just got his head. 691 00:26:57,210 --> 00:26:59,630 And though you see his head, think about where 692 00:26:59,630 --> 00:27:00,630 the rest of his body is. 693 00:27:00,630 --> 00:27:07,200 And in particular, think about where 694 00:27:07,200 --> 00:27:08,872 his right hand is in the scene. 695 00:27:08,872 --> 00:27:10,080 You can't see his right hand. 696 00:27:10,080 --> 00:27:11,746 But in some sense, you know where it is. 697 00:27:11,746 --> 00:27:12,720 I'll move the cursor. 698 00:27:12,720 --> 00:27:14,219 And you just hum when I get to where 699 00:27:14,219 --> 00:27:17,220 you think his right hand is if you could see, 700 00:27:17,220 --> 00:27:18,717 like if everything was transparent. 701 00:27:18,717 --> 00:27:19,342 AUDIENCE: Yeah. 702 00:27:19,342 --> 00:27:20,629 AUDIENCE: Yeah. 703 00:27:20,629 --> 00:27:21,420 JOSH TENENBAUM: OK. 704 00:27:21,420 --> 00:27:23,260 Somewhere around there. 705 00:27:23,260 --> 00:27:25,560 All right, how about let's take this guy. 706 00:27:25,560 --> 00:27:28,030 You can see his scalp only and maybe a bit of his shoulder. 707 00:27:28,030 --> 00:27:32,520 Think about his left big toe. 708 00:27:32,520 --> 00:27:33,300 OK? 709 00:27:33,300 --> 00:27:35,940 Think about that. 710 00:27:35,940 --> 00:27:39,437 And just hum when I get to where his left big toe is. 711 00:27:39,437 --> 00:27:40,062 AUDIENCE: Yeah. 712 00:27:40,062 --> 00:27:42,137 AUDIENCE: Yeah. 713 00:27:42,137 --> 00:27:43,470 JOSH TENENBAUM: Somewhere, yeah. 714 00:27:43,470 --> 00:27:46,170 All right, so you can see we did an instant experiment. 715 00:27:46,170 --> 00:27:49,080 You don't even need Mechanical Turk. 716 00:27:49,080 --> 00:27:50,670 It's like recording from neurons, 717 00:27:50,670 --> 00:27:52,470 only you're each being a neuron. 718 00:27:52,470 --> 00:27:54,157 And you're humming instead of spiking. 719 00:27:54,157 --> 00:27:56,490 But it's amazing how much you can learn about your brain 720 00:27:56,490 --> 00:27:58,417 just by doing things like that. 721 00:27:58,417 --> 00:28:00,750 You've got a whole probability distribution right there, 722 00:28:00,750 --> 00:28:01,170 right? 723 00:28:01,170 --> 00:28:02,520 And that's a meaningful distribution. 724 00:28:02,520 --> 00:28:04,230 You weren't just hallucinating, right? 725 00:28:04,230 --> 00:28:07,380 You were using a model, a causal model, of how bodies work 726 00:28:07,380 --> 00:28:10,050 and how other three-dimensional structures work 727 00:28:10,050 --> 00:28:11,962 to solve that problem. 728 00:28:11,962 --> 00:28:13,050 OK. 729 00:28:13,050 --> 00:28:14,550 This isn't just about bodies, right? 730 00:28:14,550 --> 00:28:16,590 Our ability to detect objects, like 731 00:28:16,590 --> 00:28:18,360 to detect all the books on my bookshelf 732 00:28:18,360 --> 00:28:20,940 there-- again, most of which are barely visible, just 733 00:28:20,940 --> 00:28:25,560 a few pixels, a small part of each book, or the glasses 734 00:28:25,560 --> 00:28:27,660 in this tabletop scene there, right? 735 00:28:27,660 --> 00:28:30,420 I don't really know any other way you can do this. 736 00:28:30,420 --> 00:28:34,110 Like, any standard machine learning-based book detector 737 00:28:34,110 --> 00:28:36,100 is not going to detect most of those books. 738 00:28:36,100 --> 00:28:37,770 Any standard glass detector is not 739 00:28:37,770 --> 00:28:40,320 going to detect most of those glasses. 740 00:28:40,320 --> 00:28:41,490 And yet you can do it. 741 00:28:41,490 --> 00:28:43,615 And I don't think there's any alternative to saying 742 00:28:43,615 --> 00:28:45,750 that in some sense, as we'll talk more about it 743 00:28:45,750 --> 00:28:50,416 in a little bit, you're kind of inverting the graphics process, 744 00:28:50,416 --> 00:28:52,290 In computer science now, we call it graphics. 745 00:28:52,290 --> 00:28:53,460 We maybe used to call it optics. 746 00:28:53,460 --> 00:28:55,210 But the way light bounces off the surfaces 747 00:28:55,210 --> 00:28:56,700 of objects in the world and comes 748 00:28:56,700 --> 00:28:58,980 into your eye, that's a causal process 749 00:28:58,980 --> 00:29:00,720 that your visual system is in some way 750 00:29:00,720 --> 00:29:05,272 able to invert, to model and go from the observable 751 00:29:05,272 --> 00:29:07,230 to the unobservable stuff, just like Newton was 752 00:29:07,230 --> 00:29:09,650 doing with astronomical data. 753 00:29:09,650 --> 00:29:10,530 OK. 754 00:29:10,530 --> 00:29:12,780 Enough on vision for now, sort of. 755 00:29:12,780 --> 00:29:15,720 Let's go from actually just perceiving this stuff out there 756 00:29:15,720 --> 00:29:18,000 in the world to forming concepts and generalizing. 757 00:29:18,000 --> 00:29:20,250 So a problem that I've studied a lot, that a lot of us 758 00:29:20,250 --> 00:29:22,170 have studied a lot in this field, 759 00:29:22,170 --> 00:29:24,680 is the problem of learning concepts and, in particular, 760 00:29:24,680 --> 00:29:26,430 one very particular kind of concept, which 761 00:29:26,430 --> 00:29:28,650 is object kinds like categories of objects, things 762 00:29:28,650 --> 00:29:30,460 we could label with a word. 763 00:29:30,460 --> 00:29:32,690 It's one of the very most obvious forms 764 00:29:32,690 --> 00:29:35,190 of interesting learning that you see in young children, part 765 00:29:35,190 --> 00:29:36,065 of learning language. 766 00:29:36,065 --> 00:29:37,560 But it's not just about language. 767 00:29:37,560 --> 00:29:39,790 And the striking thing when you look at, say, 768 00:29:39,790 --> 00:29:42,360 a child learning words-- just in particular let's say, 769 00:29:42,360 --> 00:29:43,920 words that label kinds of objects, 770 00:29:43,920 --> 00:29:47,430 like chair or horse or bottle, ball-- 771 00:29:47,430 --> 00:29:49,970 is how little data of a certain labels 772 00:29:49,970 --> 00:29:54,090 or how little task relevant data is required. 773 00:29:54,090 --> 00:29:57,654 A lot of other data is probably used in some way, right? 774 00:29:57,654 --> 00:29:59,070 And, again, this is a theme you've 775 00:29:59,070 --> 00:30:00,929 heard from a number of the other speakers. 776 00:30:00,929 --> 00:30:02,970 But just to give you some of my favorite examples 777 00:30:02,970 --> 00:30:04,860 of how we can learn object concepts from just 778 00:30:04,860 --> 00:30:06,362 one or a few examples, well, here's 779 00:30:06,362 --> 00:30:08,070 an example from some experimental stimuli 780 00:30:08,070 --> 00:30:11,880 we use where we just made up a whole little world of objects. 781 00:30:11,880 --> 00:30:16,604 And in this world, I can teach you a new name, let's say tufa, 782 00:30:16,604 --> 00:30:17,770 and give you a few examples. 783 00:30:17,770 --> 00:30:18,900 And, again, you can now go through. 784 00:30:18,900 --> 00:30:21,150 We can try this as a little experiment here and just 785 00:30:21,150 --> 00:30:22,470 say, you know, yes or no. 786 00:30:22,470 --> 00:30:23,830 For each of these objects, is it a tufa? 787 00:30:23,830 --> 00:30:25,038 So how about this, yes or no? 788 00:30:25,038 --> 00:30:25,645 AUDIENCE: Yes. 789 00:30:25,645 --> 00:30:26,520 JOSH TENENBAUM: Here? 790 00:30:26,520 --> 00:30:27,175 AUDIENCE: No. 791 00:30:27,175 --> 00:30:28,050 JOSH TENENBAUM: Here? 792 00:30:28,050 --> 00:30:28,592 AUDIENCE: No. 793 00:30:28,592 --> 00:30:29,466 JOSH TENENBAUM: Here? 794 00:30:29,466 --> 00:30:30,150 AUDIENCE: No. 795 00:30:30,150 --> 00:30:30,420 JOSH TENENBAUM: Here? 796 00:30:30,420 --> 00:30:30,962 AUDIENCE: No. 797 00:30:30,962 --> 00:30:31,836 JOSH TENENBAUM: Here? 798 00:30:31,836 --> 00:30:32,420 AUDIENCE: Yes. 799 00:30:32,420 --> 00:30:33,294 JOSH TENENBAUM: Here? 800 00:30:33,294 --> 00:30:33,960 AUDIENCE: No. 801 00:30:33,960 --> 00:30:34,835 JOSH TENENBAUM: Here? 802 00:30:34,835 --> 00:30:35,733 AUDIENCE: Yes. 803 00:30:35,733 --> 00:30:37,206 No. 804 00:30:37,206 --> 00:30:38,679 No. 805 00:30:38,679 --> 00:30:39,661 No. 806 00:30:39,661 --> 00:30:41,625 No. 807 00:30:41,625 --> 00:30:42,620 Yes. 808 00:30:42,620 --> 00:30:43,770 JOSH TENENBAUM: Yeah. 809 00:30:43,770 --> 00:30:44,270 OK. 810 00:30:44,270 --> 00:30:47,090 So first of all, how long did it take you for each one? 811 00:30:47,090 --> 00:30:48,590 I mean, it basically didn't take you 812 00:30:48,590 --> 00:30:52,550 any longer than it takes in one of Winrich's experiments to get 813 00:30:52,550 --> 00:30:54,110 the spike seeing the face. 814 00:30:54,110 --> 00:30:56,030 So you learned this concept, and now you 815 00:30:56,030 --> 00:30:57,680 can just use it right away. 816 00:30:57,680 --> 00:31:01,680 It's far less than a second of actual visual processing. 817 00:31:01,680 --> 00:31:04,040 And there was a little bit of a latency. 818 00:31:04,040 --> 00:31:07,550 This one's a little more uncertain here, right? 819 00:31:07,550 --> 00:31:10,220 And you saw that in that it took you maybe almost twice as 820 00:31:10,220 --> 00:31:12,165 long to make that decision. 821 00:31:12,165 --> 00:31:13,470 OK. 822 00:31:13,470 --> 00:31:16,460 That's the kind of thing we'd like to be able to explain. 823 00:31:16,460 --> 00:31:18,919 And that means how can you get a whole concept? 824 00:31:18,919 --> 00:31:20,210 It's a whole new kind of thing. 825 00:31:20,210 --> 00:31:21,560 You don't really know much about it. 826 00:31:21,560 --> 00:31:23,060 Maybe you know it's some kind of weird plant 827 00:31:23,060 --> 00:31:23,893 on this weird thing. 828 00:31:23,893 --> 00:31:27,320 But you've got a whole new concept and a whole entry 829 00:31:27,320 --> 00:31:30,920 into a whole, probably, system of concepts. 830 00:31:30,920 --> 00:31:33,740 Again, several notions of being quick-- sample complexity, 831 00:31:33,740 --> 00:31:35,570 as we say, just one or a few examples, 832 00:31:35,570 --> 00:31:37,340 but also the speed-- the speed in which 833 00:31:37,340 --> 00:31:38,990 you formed that concept and the speed 834 00:31:38,990 --> 00:31:41,810 in which you're able to deploy it in now recognizing 835 00:31:41,810 --> 00:31:44,660 and detecting things. 836 00:31:44,660 --> 00:31:47,360 Just to give one other real world example, so it's not just 837 00:31:47,360 --> 00:31:50,030 we make things up-- but, for example, here's an object. 838 00:31:50,030 --> 00:31:52,130 Just how many know what this thing is? 839 00:31:52,130 --> 00:31:53,414 Raise your hand if you do. 840 00:31:53,414 --> 00:31:55,330 How many people don't know what this thing is? 841 00:31:55,330 --> 00:31:55,850 OK. 842 00:31:55,850 --> 00:31:56,990 Good. 843 00:31:56,990 --> 00:31:59,870 So this is a piece of rock climbing equipment. 844 00:31:59,870 --> 00:32:01,010 It's called a cam. 845 00:32:01,010 --> 00:32:02,781 I won't tell you anything more than that. 846 00:32:02,781 --> 00:32:04,280 Well, maybe I'll tell you one thing, 847 00:32:04,280 --> 00:32:07,370 because it's kind of useful. 848 00:32:07,370 --> 00:32:09,880 Well, I mean, you may or may not even need to-- 849 00:32:09,880 --> 00:32:10,640 yeah. 850 00:32:10,640 --> 00:32:12,059 This strap here is not technically 851 00:32:12,059 --> 00:32:13,350 part of the piece of equipment. 852 00:32:13,350 --> 00:32:14,210 But it doesn't really matter. 853 00:32:14,210 --> 00:32:14,709 OK. 854 00:32:14,709 --> 00:32:17,439 So anyway, I've given you one example of this new kind 855 00:32:17,439 --> 00:32:18,480 of thing for most of you. 856 00:32:18,480 --> 00:32:20,390 And now, you can look at a complex scene 857 00:32:20,390 --> 00:32:22,700 like this climber's equipment rack. 858 00:32:22,700 --> 00:32:26,220 And tell me, are there any cams in this scene? 859 00:32:26,220 --> 00:32:26,900 AUDIENCE: Yes. 860 00:32:26,900 --> 00:32:28,191 JOSH TENENBAUM: Where are they? 861 00:32:28,191 --> 00:32:29,735 AUDIENCE: On top. 862 00:32:29,735 --> 00:32:30,610 JOSH TENENBAUM: Yeah. 863 00:32:30,610 --> 00:32:31,430 The top. 864 00:32:31,430 --> 00:32:31,930 Like here? 865 00:32:31,930 --> 00:32:32,814 AUDIENCE: No. 866 00:32:32,814 --> 00:32:33,700 Next to there. 867 00:32:33,700 --> 00:32:34,120 JOSH TENENBAUM: Here. 868 00:32:34,120 --> 00:32:34,620 Yeah. 869 00:32:34,620 --> 00:32:35,530 Right, exactly. 870 00:32:35,530 --> 00:32:38,267 How about this scene, any? 871 00:32:38,267 --> 00:32:39,608 AUDIENCE: No. 872 00:32:39,608 --> 00:32:40,950 AUDIENCE: [INAUDIBLE] 873 00:32:40,950 --> 00:32:42,010 JOSH TENENBAUM: There's none of that-- 874 00:32:42,010 --> 00:32:42,968 well, there's a couple. 875 00:32:42,968 --> 00:32:45,760 Anyone see the ones over up in the upper right up here? 876 00:32:45,760 --> 00:32:46,630 AUDIENCE: Yeah. 877 00:32:46,630 --> 00:32:46,990 JOSH TENENBAUM: Yeah. 878 00:32:46,990 --> 00:32:47,823 They're hard to see. 879 00:32:47,823 --> 00:32:49,780 They're really dark and shaded, right? 880 00:32:52,242 --> 00:32:54,700 But when I draw your attention to it, and then you're like, 881 00:32:54,700 --> 00:32:55,200 oh yeah. 882 00:32:55,200 --> 00:32:56,827 I see that, right? 883 00:32:56,827 --> 00:32:58,660 So part of why I give these examples is they 884 00:32:58,660 --> 00:33:00,970 show how the several examples I've 885 00:33:00,970 --> 00:33:02,470 been giving, like the object concept 886 00:33:02,470 --> 00:33:04,511 learning thing, interacts with the vision, right? 887 00:33:04,511 --> 00:33:06,010 I think your ability to solve tasks 888 00:33:06,010 --> 00:33:09,880 like this rests on your ability to form this abstract concept 889 00:33:09,880 --> 00:33:11,620 of this physical object. 890 00:33:11,620 --> 00:33:14,737 And notice all these ones, they're different colors. 891 00:33:14,737 --> 00:33:16,820 The physical details of the objects are different. 892 00:33:16,820 --> 00:33:19,270 It's only a category of object that's preserved. 893 00:33:19,270 --> 00:33:22,060 But your ability to recognize these things in the real world 894 00:33:22,060 --> 00:33:24,250 depends on, also, the ability to recognize them 895 00:33:24,250 --> 00:33:26,666 in very different viewpoints under very different lighting 896 00:33:26,666 --> 00:33:27,250 conditions. 897 00:33:27,250 --> 00:33:30,130 And if we want to explain how you can do this-- again, 898 00:33:30,130 --> 00:33:33,100 to go back to composability and compositionality-- 899 00:33:33,100 --> 00:33:35,620 we need to understand how you can put together 900 00:33:35,620 --> 00:33:39,017 the kind of causal model of how scenes are formed. 901 00:33:39,017 --> 00:33:41,350 That vision is inverting-- this inverse graphics thing-- 902 00:33:41,350 --> 00:33:42,766 with the causal model of something 903 00:33:42,766 --> 00:33:47,560 about how objects concepts work and compose them together 904 00:33:47,560 --> 00:33:50,260 to be able to learn a new concept of an object 905 00:33:50,260 --> 00:33:55,175 that you can also recognize new instances of the kind of thing 906 00:33:55,175 --> 00:33:57,550 in new viewpoints and under different lighting conditions 907 00:33:57,550 --> 00:34:00,160 than the really wonderfully perfect example I gave you here 908 00:34:00,160 --> 00:34:02,230 with a nice lighting and nice viewpoint. 909 00:34:02,230 --> 00:34:04,900 We can push this to quite extremes. 910 00:34:04,900 --> 00:34:06,970 Like, in that scene in the upper right, 911 00:34:06,970 --> 00:34:08,150 do you see any cams there? 912 00:34:08,150 --> 00:34:09,020 AUDIENCE: Yeah. 913 00:34:09,020 --> 00:34:09,894 JOSH TENENBAUM: Yeah. 914 00:34:09,894 --> 00:34:11,440 How many are there? 915 00:34:11,440 --> 00:34:12,864 AUDIENCE: [INAUDIBLE] 916 00:34:12,864 --> 00:34:14,739 JOSH TENENBAUM: Quite a lot, yeah, and, like, 917 00:34:14,739 --> 00:34:16,010 all occluded and cluttered. 918 00:34:16,010 --> 00:34:16,510 Yeah. 919 00:34:16,510 --> 00:34:18,560 Amazing that you can do this. 920 00:34:18,560 --> 00:34:20,949 And as we'll see in a little bit, what we 921 00:34:20,949 --> 00:34:23,199 do with our object concepts-- 922 00:34:23,199 --> 00:34:25,570 and these are other ways to show this notion 923 00:34:25,570 --> 00:34:27,110 of a generative model-- 924 00:34:27,110 --> 00:34:28,360 we don't just classify things. 925 00:34:28,360 --> 00:34:30,790 But we can use them for all sorts of other tasks, right? 926 00:34:30,790 --> 00:34:33,550 We can use them to generate or imagine new instances. 927 00:34:33,550 --> 00:34:35,320 We can parse an object out into parts. 928 00:34:35,320 --> 00:34:37,960 This is another novel, but real object-- 929 00:34:37,960 --> 00:34:41,415 the Segway personal thing. 930 00:34:41,415 --> 00:34:43,540 Which, again, probably all of you know this, right? 931 00:34:43,540 --> 00:34:46,550 How many people have seen those Segways before, right? 932 00:34:46,550 --> 00:34:47,050 OK. 933 00:34:47,050 --> 00:34:48,460 But you all probably remember the first time 934 00:34:48,460 --> 00:34:49,239 you saw one on the street. 935 00:34:49,239 --> 00:34:50,020 And whoa, that's really cool. 936 00:34:50,020 --> 00:34:50,936 What's that new thing? 937 00:34:50,936 --> 00:34:53,449 And then somebody tells you, and now you know, right? 938 00:34:53,449 --> 00:34:56,679 But it's partly related to your ability to parse out the parts. 939 00:34:56,679 --> 00:34:58,930 If somebody says, oh, my Segway has a flat tire, 940 00:34:58,930 --> 00:35:00,670 you kind of know what that means and what 941 00:35:00,670 --> 00:35:03,130 you could do, at least in principle, to fix it, right? 942 00:35:07,840 --> 00:35:09,600 You can take different kinds of things 943 00:35:09,600 --> 00:35:12,010 in some category like vehicles and imagine 944 00:35:12,010 --> 00:35:13,690 ways of combining the parts to make yet 945 00:35:13,690 --> 00:35:16,610 other new either real or fanciful vehicles, like that C 946 00:35:16,610 --> 00:35:18,380 to the lower right there. 947 00:35:18,380 --> 00:35:21,040 These are all things you do from very little data 948 00:35:21,040 --> 00:35:23,650 from these object concepts. 949 00:35:23,650 --> 00:35:27,010 Moving on and then both back to some examples 950 00:35:27,010 --> 00:35:30,310 you saw Tomer and I talk about on the first day 951 00:35:30,310 --> 00:35:32,500 in our brief introduction and what 952 00:35:32,500 --> 00:35:35,470 we'll get to more by the end of today, examples like these. 953 00:35:35,470 --> 00:35:37,360 So Tomer already showed you the scene 954 00:35:37,360 --> 00:35:41,470 of the red and the blue ball chasing each other around. 955 00:35:41,470 --> 00:35:43,270 I won't rehearse that example. 956 00:35:43,270 --> 00:35:46,400 I'll show you another scene that is more famous. 957 00:35:46,400 --> 00:35:46,900 OK. 958 00:35:46,900 --> 00:35:49,067 Well, so for the people who haven't seen it, 959 00:35:49,067 --> 00:35:50,650 you can never watch it too many times. 960 00:35:50,650 --> 00:35:53,066 Again, like that one, it's just some shapes moving around. 961 00:35:53,066 --> 00:35:56,050 It was done in the 1940s, that golden age 962 00:35:56,050 --> 00:35:58,690 for cognitive science as well as many other things. 963 00:35:58,690 --> 00:36:03,310 And much lower technology of animation, it's 964 00:36:03,310 --> 00:36:06,222 like stop-action animation on a table top. 965 00:36:06,222 --> 00:36:07,930 But just like the scene on the left which 966 00:36:07,930 --> 00:36:09,760 is done with computer animation, just 967 00:36:09,760 --> 00:36:12,610 from the motion of a few shapes in this two-dimensional world, 968 00:36:12,610 --> 00:36:14,350 you get so much. 969 00:36:14,350 --> 00:36:15,600 First of all, you get physics. 970 00:36:15,600 --> 00:36:17,755 Let's watch it again. 971 00:36:17,755 --> 00:36:19,180 It looks like there's a collision. 972 00:36:19,180 --> 00:36:20,380 It's just objects, shapes moving. 973 00:36:20,380 --> 00:36:22,190 But it looks like one thing is banging into another. 974 00:36:22,190 --> 00:36:24,110 And it looks like they're characters, right? 975 00:36:24,110 --> 00:36:26,430 It looks like the big one is kind of bullying the other one. 976 00:36:26,430 --> 00:36:28,240 It's sort of backed him up against the wall 977 00:36:28,240 --> 00:36:29,240 scaring them off, right? 978 00:36:29,240 --> 00:36:31,270 Does you guys see that? 979 00:36:31,270 --> 00:36:32,830 The other one was hiding. 980 00:36:32,830 --> 00:36:34,960 Now, this one goes in to go after him. 981 00:36:34,960 --> 00:36:36,820 It starts to get a little scary, right? 982 00:36:36,820 --> 00:36:39,224 Cue the scary music if it was a silent movie. 983 00:36:39,224 --> 00:36:40,390 Doo, doo, doo, doo, doo, OK. 984 00:36:40,390 --> 00:36:42,160 You can watch the end of it on YouTube if you want. 985 00:36:42,160 --> 00:36:43,210 It's quite famous. 986 00:36:43,210 --> 00:36:45,290 So I won't show you the end of it. 987 00:36:45,290 --> 00:36:47,620 But in case you're getting nervous, don't worry. 988 00:36:47,620 --> 00:36:51,820 It ends happily, at least for two of the three characters. 989 00:36:51,820 --> 00:36:54,730 From some combination of all your experiences in your life 990 00:36:54,730 --> 00:36:57,100 and whatever evolution genetics gave you 991 00:36:57,100 --> 00:36:58,780 before you came out into the world, 992 00:36:58,780 --> 00:37:00,740 you've built up a model that allows you to understand this. 993 00:37:00,740 --> 00:37:02,410 And then it's a separate, but very interesting, 994 00:37:02,410 --> 00:37:03,430 question and harder one. 995 00:37:03,430 --> 00:37:05,239 How do you get to that point, right? 996 00:37:05,239 --> 00:37:07,030 The question of the development of the kind 997 00:37:07,030 --> 00:37:08,530 of commonsense knowledge that allows 998 00:37:08,530 --> 00:37:10,945 you to parse out just the motion into both forces, 999 00:37:10,945 --> 00:37:13,390 you know, one thing hitting another thing, and then 1000 00:37:13,390 --> 00:37:16,892 the whole mental state structure and the sort of social who's 1001 00:37:16,892 --> 00:37:18,100 good and who's bad on there-- 1002 00:37:18,100 --> 00:37:19,130 I mean, because, again, most people 1003 00:37:19,130 --> 00:37:21,088 when they see this and think about a little bit 1004 00:37:21,088 --> 00:37:24,160 see some of the characters as good and others as bad. 1005 00:37:24,160 --> 00:37:27,450 How that knowledge develops is extremely interesting. 1006 00:37:27,450 --> 00:37:30,340 We're going to see a lot more of the more experiments, how 1007 00:37:30,340 --> 00:37:34,090 we study this kind of thing in young children, next week. 1008 00:37:34,090 --> 00:37:36,132 And we'll talk more about the learning next week. 1009 00:37:36,132 --> 00:37:37,631 We'll see how much of that I get to. 1010 00:37:37,631 --> 00:37:39,267 What I want to talk about here is 1011 00:37:39,267 --> 00:37:41,350 sort of general issues of how the knowledge works, 1012 00:37:41,350 --> 00:37:43,510 how you deploy it, how you make the inferences 1013 00:37:43,510 --> 00:37:45,110 with the knowledge, and a little bit about learning. 1014 00:37:45,110 --> 00:37:46,690 Maybe we'll see if we have time for that at the end. 1015 00:37:46,690 --> 00:37:48,273 But they'll be more of that next week. 1016 00:37:48,273 --> 00:37:52,974 I think it's important to understand what the models are, 1017 00:37:52,974 --> 00:37:55,390 these generative models that you're building of the world, 1018 00:37:55,390 --> 00:37:56,890 before you actually study learning. 1019 00:37:56,890 --> 00:37:59,380 I think there's a danger if you study learning. 1020 00:37:59,380 --> 00:38:02,680 Without having the right target of learning, you might be-- 1021 00:38:02,680 --> 00:38:06,580 to take a classic analogy-- 1022 00:38:06,580 --> 00:38:08,560 trying to get to the moon by climbing trees. 1023 00:38:13,460 --> 00:38:15,080 How about this? 1024 00:38:15,080 --> 00:38:18,755 Just to give one example that is familiar, 1025 00:38:18,755 --> 00:38:20,630 because we saw this wonderful talk by Demis-- 1026 00:38:20,630 --> 00:38:24,350 and I think many people had seen the DeepMind work. 1027 00:38:29,520 --> 00:38:32,720 And I hope everybody here saw Demis' talk. 1028 00:38:32,720 --> 00:38:35,210 This is just a couple of slides from their Nature paper, 1029 00:38:35,210 --> 00:38:38,510 where, again, they had this deep Q-network, which 1030 00:38:38,510 --> 00:38:40,970 is I think a great example of trying 1031 00:38:40,970 --> 00:38:44,450 to see how far you can go with this pattern recognition idea, 1032 00:38:44,450 --> 00:38:45,020 right? 1033 00:38:45,020 --> 00:38:47,186 In a sense, what this network does, if you remember, 1034 00:38:47,186 --> 00:38:49,650 is it has a bunch of sort of convolutional layers 1035 00:38:49,650 --> 00:38:50,900 and of fully connected layers. 1036 00:38:50,900 --> 00:38:52,080 But it's mapping. 1037 00:38:52,080 --> 00:38:55,370 It's learning a feedforward mapping from images 1038 00:38:55,370 --> 00:38:57,170 to joystick action. 1039 00:38:57,170 --> 00:38:59,180 So it's a perfect example of trying 1040 00:38:59,180 --> 00:39:01,907 to solve interesting problems of intelligence. 1041 00:39:01,907 --> 00:39:03,740 I think that the problems of video gaming AI 1042 00:39:03,740 --> 00:39:05,936 are really cool ones. 1043 00:39:05,936 --> 00:39:07,310 With this pattern classification, 1044 00:39:07,310 --> 00:39:09,018 they're basically trying to find patterns 1045 00:39:09,018 --> 00:39:10,790 of pixels in Atari video games that 1046 00:39:10,790 --> 00:39:12,375 are diagnostic of whether you should 1047 00:39:12,375 --> 00:39:14,000 move your joystick this way or that way 1048 00:39:14,000 --> 00:39:16,664 or press the button this way or that way, right? 1049 00:39:16,664 --> 00:39:18,080 And they showed that that can give 1050 00:39:18,080 --> 00:39:20,990 very competitive performance with humans when you give it 1051 00:39:20,990 --> 00:39:23,660 enough training data and with clever training algorithms, 1052 00:39:23,660 --> 00:39:25,250 right? 1053 00:39:25,250 --> 00:39:28,160 But I think there's also an important sense in which what 1054 00:39:28,160 --> 00:39:30,680 this is doing is quite different from what 1055 00:39:30,680 --> 00:39:32,660 humans are doing when they're learning 1056 00:39:32,660 --> 00:39:34,310 to play one of these games. 1057 00:39:34,310 --> 00:39:36,800 And, you know, Demis, I think is quite aware of this. 1058 00:39:36,800 --> 00:39:38,490 He made some of these points in his talk 1059 00:39:38,490 --> 00:39:40,760 and, informally, afterwards, right? 1060 00:39:40,760 --> 00:39:42,710 There's all sorts of things that a person 1061 00:39:42,710 --> 00:39:44,960 brings to the problem of learning an Atari video game, 1062 00:39:44,960 --> 00:39:48,470 just like your question of what do you bring to learning this. 1063 00:39:48,470 --> 00:39:51,150 But I think from a cognitive point of view, 1064 00:39:51,150 --> 00:39:52,790 the real problem of intelligence is 1065 00:39:52,790 --> 00:39:55,454 to understand how learning works with the knowledge 1066 00:39:55,454 --> 00:39:56,870 that you have and how you actually 1067 00:39:56,870 --> 00:39:58,190 build up that knowledge. 1068 00:39:58,190 --> 00:40:00,770 I think that at least the current DeepMind 1069 00:40:00,770 --> 00:40:03,360 system, the one that was published a few months ago, 1070 00:40:03,360 --> 00:40:06,020 is not really getting that question. 1071 00:40:06,020 --> 00:40:08,630 It's trying to see how much you can do without really 1072 00:40:08,630 --> 00:40:10,290 a causal model of the world. 1073 00:40:10,290 --> 00:40:12,590 But as I think Demis showed in his talk, 1074 00:40:12,590 --> 00:40:17,415 that's a direction, among many others, 1075 00:40:17,415 --> 00:40:20,409 that I think they realized they need to go in. 1076 00:40:20,409 --> 00:40:21,950 A nice way to illustrate this is just 1077 00:40:21,950 --> 00:40:24,160 to look at one particular video game. 1078 00:40:24,160 --> 00:40:25,880 This is a game called Frostbite. 1079 00:40:25,880 --> 00:40:29,180 It's one of the ones down here on this chart, 1080 00:40:29,180 --> 00:40:31,700 which the DeepMind system did particularly 1081 00:40:31,700 --> 00:40:34,370 poorly on in terms of getting only 1082 00:40:34,370 --> 00:40:37,550 about 6% performance relative to humans. 1083 00:40:37,550 --> 00:40:40,264 But I think it's interesting and informative. 1084 00:40:40,264 --> 00:40:42,680 And it really gets to the heart of all of the things we're 1085 00:40:42,680 --> 00:40:43,850 talking about here. 1086 00:40:43,850 --> 00:40:48,290 To contrast how the mind system as well as other attempts 1087 00:40:48,290 --> 00:40:52,670 to do sort of powerful scalable deep reinforcement learning, 1088 00:40:52,670 --> 00:40:54,620 I'll show you another more recent result 1089 00:40:54,620 --> 00:40:56,540 from a different group in a second. 1090 00:40:56,540 --> 00:40:59,870 Contrast how those systems learn to play this video game 1091 00:40:59,870 --> 00:41:02,600 with how a human child might learn to play a game, 1092 00:41:02,600 --> 00:41:04,820 like that kid over there who's watching his older 1093 00:41:04,820 --> 00:41:07,230 brother play a game, right? 1094 00:41:07,230 --> 00:41:11,690 So the DeepMind system, you know, 1095 00:41:11,690 --> 00:41:17,060 gets about 1,000 hours of game play experience, right? 1096 00:41:17,060 --> 00:41:20,390 And then it chops that up in various interesting ways 1097 00:41:20,390 --> 00:41:22,580 with the replay that Demis talked about, right? 1098 00:41:22,580 --> 00:41:24,920 But when we talk about getting so much from so little, 1099 00:41:24,920 --> 00:41:28,550 the basic data is about 1,000 hours of experience. 1100 00:41:28,550 --> 00:41:33,020 But I would venture that a kid learns a lot more 1101 00:41:33,020 --> 00:41:34,657 from a lot less, right? 1102 00:41:34,657 --> 00:41:36,740 The way a kid actually learns to play a video game 1103 00:41:36,740 --> 00:41:40,175 is not by trial and error for 1,000 hours, right? 1104 00:41:40,175 --> 00:41:42,800 I mean, it might be a little bit of trial and error themselves. 1105 00:41:42,800 --> 00:41:44,000 But, often, it might be just watching 1106 00:41:44,000 --> 00:41:46,140 someone else play and say, wow, that's awesome. 1107 00:41:46,140 --> 00:41:47,040 I'd like to do that. 1108 00:41:47,040 --> 00:41:47,540 Can I play? 1109 00:41:47,540 --> 00:41:47,900 My turn. 1110 00:41:47,900 --> 00:41:49,550 My turn-- and wrestling for the joystick and then 1111 00:41:49,550 --> 00:41:50,570 seeing what you can do. 1112 00:41:50,570 --> 00:41:52,370 And it only takes a minute, really, 1113 00:41:52,370 --> 00:41:54,200 to figure out if this game is fun, 1114 00:41:54,200 --> 00:41:55,616 interesting, if it's something you 1115 00:41:55,616 --> 00:41:59,100 want to do, and to sort of get the basic hang of things, 1116 00:41:59,100 --> 00:42:00,890 at least of what you should try to do. 1117 00:42:00,890 --> 00:42:02,520 That's not to say to be able to do it. 1118 00:42:02,520 --> 00:42:05,420 So I mean, unless you saw me give a talk, 1119 00:42:05,420 --> 00:42:06,960 has anybody played this game before? 1120 00:42:06,960 --> 00:42:07,460 OK. 1121 00:42:07,460 --> 00:42:11,300 So perfect example-- let's watch a minute of this game 1122 00:42:11,300 --> 00:42:13,430 and see if you can figure out what's going on. 1123 00:42:13,430 --> 00:42:15,920 Think about how you learn to play this game, right? 1124 00:42:15,920 --> 00:42:17,780 Imagine you're watching somebody else play. 1125 00:42:17,780 --> 00:42:19,670 This is a video of not the DeepMind system, 1126 00:42:19,670 --> 00:42:22,070 but of an expert human game player, 1127 00:42:22,070 --> 00:42:23,660 a really good human playing this, 1128 00:42:23,660 --> 00:42:25,265 like that kid's older brother. 1129 00:42:25,265 --> 00:42:30,710 [VIDEO PLAYBACK] 1130 00:43:11,767 --> 00:43:12,350 [END PLAYBACK] 1131 00:43:12,350 --> 00:43:13,250 OK. 1132 00:43:13,250 --> 00:43:14,340 Maybe you've got the idea. 1133 00:43:14,340 --> 00:43:17,600 So, again, only people who haven't seen before, 1134 00:43:17,600 --> 00:43:19,295 so how does this game work? 1135 00:43:19,295 --> 00:43:21,170 So probably everybody noticed, and it's maybe 1136 00:43:21,170 --> 00:43:22,720 so obvious you didn't even mention it, 1137 00:43:22,720 --> 00:43:24,050 but every time he hits a platform, 1138 00:43:24,050 --> 00:43:24,970 there's a beep, right? 1139 00:43:24,970 --> 00:43:26,120 And the platform turns blue. 1140 00:43:26,120 --> 00:43:27,140 Did everybody notice that? 1141 00:43:27,140 --> 00:43:27,640 Right. 1142 00:43:27,640 --> 00:43:30,710 So it only takes like one or two of those, maybe even just one. 1143 00:43:30,710 --> 00:43:33,350 Like, beep, beep, woop, woop, and you get that right away. 1144 00:43:33,350 --> 00:43:34,730 That's an important causal thing. 1145 00:43:34,730 --> 00:43:37,700 And it just happened that this guy is so good, 1146 00:43:37,700 --> 00:43:38,940 and he starts right away. 1147 00:43:38,940 --> 00:43:40,790 So he goes, ba, ba ba, ba, ba, and he's 1148 00:43:40,790 --> 00:43:42,200 doing it about once a second. 1149 00:43:42,200 --> 00:43:43,970 And so there's an illusory correlation. 1150 00:43:43,970 --> 00:43:46,410 And the same part of your brain that figures out 1151 00:43:46,410 --> 00:43:49,521 the actually important and true causal thing going on, 1152 00:43:49,521 --> 00:43:51,020 the first thing I mentioned, figures 1153 00:43:51,020 --> 00:43:53,060 out this other thing, which is just a slight illusion. 1154 00:43:53,060 --> 00:43:54,500 But if you started playing it yourself, 1155 00:43:54,500 --> 00:43:56,480 you would quickly notice that that wasn't true, right? 1156 00:43:56,480 --> 00:43:57,827 Because you'd start off there. 1157 00:43:57,827 --> 00:43:59,910 Maybe you would have thought of that for a minute. 1158 00:43:59,910 --> 00:44:01,090 But then you'd start off playing. 1159 00:44:01,090 --> 00:44:03,170 And very quickly, you'd see you're sitting there 1160 00:44:03,170 --> 00:44:03,950 trying to decide what to do. 1161 00:44:03,950 --> 00:44:05,783 Because you're not as expert as this person. 1162 00:44:05,783 --> 00:44:08,107 And the temperature's going down anyway. 1163 00:44:08,107 --> 00:44:10,190 So, again, you would figure that out very quickly. 1164 00:44:10,190 --> 00:44:11,240 What else is going on in this game? 1165 00:44:11,240 --> 00:44:12,936 AUDIENCE: He has to build an igloo. 1166 00:44:12,936 --> 00:44:15,316 JOSH TENENBAUM: He has to build an igloo, yeah. 1167 00:44:15,316 --> 00:44:16,440 How does he build an igloo? 1168 00:44:16,440 --> 00:44:18,300 AUDIENCE: Just by [INAUDIBLE]. 1169 00:44:18,300 --> 00:44:18,510 JOSH TENENBAUM: Right. 1170 00:44:18,510 --> 00:44:20,260 Every time he hits one of those platforms, 1171 00:44:20,260 --> 00:44:21,820 a brick comes into play. 1172 00:44:21,820 --> 00:44:26,160 And then what, when you say he has to build an igloo? 1173 00:44:26,160 --> 00:44:28,395 AUDIENCE: [INAUDIBLE] 1174 00:44:28,395 --> 00:44:29,270 JOSH TENENBAUM: Yeah. 1175 00:44:29,270 --> 00:44:30,938 And then what happens? 1176 00:44:30,938 --> 00:44:31,906 AUDIENCE: [INAUDIBLE] 1177 00:44:31,906 --> 00:44:33,842 JOSH TENENBAUM: What, sir? 1178 00:44:33,842 --> 00:44:38,200 AUDIENCE: [INAUDIBLE] 1179 00:44:38,200 --> 00:44:40,070 JOSH TENENBAUM: Right. 1180 00:44:40,070 --> 00:44:41,380 He goes in. 1181 00:44:41,380 --> 00:44:44,320 The level ends, he gets some score for. 1182 00:44:44,320 --> 00:44:45,360 What about these things? 1183 00:44:45,360 --> 00:44:48,430 What are these, those little dust on the screen? 1184 00:44:48,430 --> 00:44:50,125 AUDIENCE: Avoid them. 1185 00:44:50,125 --> 00:44:51,250 JOSH TENENBAUM: Avoid them. 1186 00:44:51,250 --> 00:44:51,749 Yeah. 1187 00:44:51,749 --> 00:44:53,537 How do you know? 1188 00:44:53,537 --> 00:44:55,328 AUDIENCE: He doesn't actually [INAUDIBLE].. 1189 00:44:55,328 --> 00:44:56,240 AUDIENCE: We haven't seen an example. 1190 00:44:56,240 --> 00:44:57,260 JOSH TENENBAUM: Yeah. 1191 00:44:57,260 --> 00:44:58,470 Well, an example of what? 1192 00:44:58,470 --> 00:45:00,010 We don't know what's going to happen if he hits one. 1193 00:45:00,010 --> 00:45:00,290 AUDIENCE: We assume [INAUDIBLE]. 1194 00:45:00,290 --> 00:45:02,250 JOSH TENENBAUM: But somehow, we assume-- well, 1195 00:45:02,250 --> 00:45:03,249 it's just an assumption. 1196 00:45:03,249 --> 00:45:07,224 I think we very reasonably infer that there's something bad 1197 00:45:07,224 --> 00:45:08,390 will happen if he hits them. 1198 00:45:08,390 --> 00:45:10,431 Now, do you remember of some of the other objects 1199 00:45:10,431 --> 00:45:11,840 that we saw on the second screen? 1200 00:45:11,840 --> 00:45:13,460 There were these fish, yeah. 1201 00:45:13,460 --> 00:45:16,106 What happens if he hits those? 1202 00:45:16,106 --> 00:45:17,510 AUDIENCE: He gets more points 1203 00:45:17,510 --> 00:45:18,410 JOSH TENENBAUM: He gets points, yeah. 1204 00:45:18,410 --> 00:45:20,490 And he went out of his way to actually get them. 1205 00:45:20,490 --> 00:45:20,990 OK. 1206 00:45:20,990 --> 00:45:24,132 So you basically figured it out, right? 1207 00:45:24,132 --> 00:45:26,090 It only took you really literally just a minute 1208 00:45:26,090 --> 00:45:29,789 of watching this game to figure out a lot. 1209 00:45:29,789 --> 00:45:31,580 Now, if you actually went to go and play it 1210 00:45:31,580 --> 00:45:33,170 after a minute of experience, you 1211 00:45:33,170 --> 00:45:34,460 wouldn't be that good, right? 1212 00:45:34,460 --> 00:45:37,430 It turns out that it's hard to coordinate all these moves. 1213 00:45:37,430 --> 00:45:40,760 But you would be kind of excited and frustrated, 1214 00:45:40,760 --> 00:45:43,640 which is the experience of a good video game, right? 1215 00:45:43,640 --> 00:45:45,916 Anybody remember the Flappy Bird phenomenon? 1216 00:45:45,916 --> 00:45:46,630 AUDIENCE: Yeah 1217 00:45:46,630 --> 00:45:47,000 JOSH TENENBAUM: Right. 1218 00:45:47,000 --> 00:45:48,920 This was this, like, sensation, this game that 1219 00:45:48,920 --> 00:45:50,300 was like the stupidest game. 1220 00:45:50,300 --> 00:45:52,190 I mean, it seemed like it should be trivial, 1221 00:45:52,190 --> 00:45:53,240 and yet it was really hard. 1222 00:45:53,240 --> 00:45:54,410 But, again, you just watch it for a second, 1223 00:45:54,410 --> 00:45:55,370 you know exactly what you're supposed to do. 1224 00:45:55,370 --> 00:45:56,953 You think you can do it, but it's just 1225 00:45:56,953 --> 00:45:59,010 hard to get the rhythms down for most people. 1226 00:45:59,010 --> 00:46:00,200 And certainly, this game is a little bit hard 1227 00:46:00,200 --> 00:46:01,310 to time the rhythms. 1228 00:46:01,310 --> 00:46:03,350 But what you do when you play this game is 1229 00:46:03,350 --> 00:46:05,690 you get, from one minute, you build that whole model 1230 00:46:05,690 --> 00:46:08,330 of the world, the causal relations, the goals, 1231 00:46:08,330 --> 00:46:09,860 the subgoals. 1232 00:46:09,860 --> 00:46:11,810 And you can formulate clearly what 1233 00:46:11,810 --> 00:46:13,460 are the right kinds of plans. 1234 00:46:13,460 --> 00:46:15,610 But to actually implement them in real time, 1235 00:46:15,610 --> 00:46:18,050 but without getting killed is a little bit harder. 1236 00:46:18,050 --> 00:46:20,030 And you could say that, you know, 1237 00:46:20,030 --> 00:46:21,570 when the child is learning to walk 1238 00:46:21,570 --> 00:46:23,278 there's a similar kind of thing going on, 1239 00:46:23,278 --> 00:46:26,150 except usually without the danger of getting killed, just 1240 00:46:26,150 --> 00:46:27,740 danger falling over a little bit. 1241 00:46:27,740 --> 00:46:29,540 OK. 1242 00:46:29,540 --> 00:46:31,366 Contrast that learning dynamics-- which, 1243 00:46:31,366 --> 00:46:32,990 again, I'm just describing anecdotally. 1244 00:46:32,990 --> 00:46:35,120 One of the things we'd like to do actually as one 1245 00:46:35,120 --> 00:46:37,100 of our center activities and it's a possible project 1246 00:46:37,100 --> 00:46:39,350 for students, either in our center or some of you guys 1247 00:46:39,350 --> 00:46:41,550 if you're interested-- it's a big possible project-- 1248 00:46:41,550 --> 00:46:43,300 is to actually measure this, like actually 1249 00:46:43,300 --> 00:46:46,290 study what do people learn from just a minute or two 1250 00:46:46,290 --> 00:46:48,890 or very, very quick learning experience 1251 00:46:48,890 --> 00:46:51,710 with these kinds of games, whether they're 1252 00:46:51,710 --> 00:46:53,377 adults like us who've played other games 1253 00:46:53,377 --> 00:46:54,835 or even young children who've never 1254 00:46:54,835 --> 00:46:56,030 played a video game before. 1255 00:46:56,030 --> 00:46:59,157 But I think what we will find is the kind of learning dynamic 1256 00:46:59,157 --> 00:46:59,990 that I'm describing. 1257 00:46:59,990 --> 00:47:01,323 It will be tricky to measure it. 1258 00:47:01,323 --> 00:47:03,197 But I'm sure we can. 1259 00:47:03,197 --> 00:47:05,780 And it'll be very different from the kind of learning dynamics 1260 00:47:05,780 --> 00:47:08,170 that you get from these deep reinforcement networks. 1261 00:47:08,170 --> 00:47:11,074 Here, this is an example of their learning curves 1262 00:47:11,074 --> 00:47:12,740 which comes not from the DeepMind paper, 1263 00:47:12,740 --> 00:47:14,864 but from some slightly more recent work from Pieter 1264 00:47:14,864 --> 00:47:16,687 Abbeel's group which basically builds 1265 00:47:16,687 --> 00:47:18,770 on the same architecture, but shows how to improve 1266 00:47:18,770 --> 00:47:22,400 the exploration part of it in order to improve dramatically 1267 00:47:22,400 --> 00:47:24,470 on some games, including this Frostbite game. 1268 00:47:24,470 --> 00:47:28,440 So this is the learning curve for this game you just saw. 1269 00:47:28,440 --> 00:47:32,180 The black dashed line is the DeepMind system 1270 00:47:32,180 --> 00:47:33,140 from the Nature paper. 1271 00:47:33,140 --> 00:47:35,390 And they will tell you that their current system 1272 00:47:35,390 --> 00:47:36,020 is much better. 1273 00:47:36,020 --> 00:47:37,730 So I don't know how much better. 1274 00:47:37,730 --> 00:47:41,930 But, anyway, just to be fair, right? 1275 00:47:41,930 --> 00:47:46,520 And, again, I'm essentially criticizing these approaches 1276 00:47:46,520 --> 00:47:48,210 saying, from a human point of view, 1277 00:47:48,210 --> 00:47:48,990 they're very different from humans. 1278 00:47:48,990 --> 00:47:51,950 That's not to take away from the really impressive engineering 1279 00:47:51,950 --> 00:47:53,952 in AI, machine learning accomplishments 1280 00:47:53,952 --> 00:47:55,160 that these systems are doing. 1281 00:47:55,160 --> 00:47:56,831 I think they are really interesting. 1282 00:47:56,831 --> 00:47:57,830 They're really valuable. 1283 00:47:57,830 --> 00:48:02,240 They have scientific value as well as engineering value. 1284 00:48:02,240 --> 00:48:05,450 I just want to draw the contrast between what they're doing 1285 00:48:05,450 --> 00:48:07,910 and some other really important scientific and engineering 1286 00:48:07,910 --> 00:48:09,493 questions that are the ones that we're 1287 00:48:09,493 --> 00:48:12,330 trying to talk about here. 1288 00:48:12,330 --> 00:48:14,692 So the DeepMind system is the black dashed line. 1289 00:48:14,692 --> 00:48:17,150 And then the red and blue curves are two different versions 1290 00:48:17,150 --> 00:48:19,622 of the system from Pieter Abbeel's group, which 1291 00:48:19,622 --> 00:48:21,080 is basically the same architecture, 1292 00:48:21,080 --> 00:48:23,240 but it just explores a little bit better. 1293 00:48:23,240 --> 00:48:26,960 And you can see that the x-axis is the amount of experience. 1294 00:48:26,960 --> 00:48:28,020 It's in training epochs. 1295 00:48:28,020 --> 00:48:29,540 But I think, if I understand correctly, 1296 00:48:29,540 --> 00:48:30,740 it's roughly proportional to like 1297 00:48:30,740 --> 00:48:31,948 hours of gameplay experience. 1298 00:48:31,948 --> 00:48:34,850 So 100 is like 100 hours. 1299 00:48:34,850 --> 00:48:37,580 At the end, the DeepQ network in the Nature paper 1300 00:48:37,580 --> 00:48:38,550 trained up for 1,000. 1301 00:48:38,550 --> 00:48:41,150 And you're showing there the asymptote. 1302 00:48:41,150 --> 00:48:43,550 That's the horizontal dashed line. 1303 00:48:43,550 --> 00:48:47,250 And then this line here is what it does 1304 00:48:47,250 --> 00:48:48,860 after about 100 iterations. 1305 00:48:48,860 --> 00:48:50,960 And you can see it's basically asymptoted 1306 00:48:50,960 --> 00:48:53,880 in that after 10 times as much, there's a time lapse here, 1307 00:48:53,880 --> 00:48:54,380 right? 1308 00:48:54,380 --> 00:48:55,640 10 times as much, it gets up to about there. 1309 00:48:55,640 --> 00:48:56,300 OK. 1310 00:48:56,300 --> 00:48:59,810 And impressively, Abbeel's group system does much better. 1311 00:48:59,810 --> 00:49:01,970 After only 100 hours, it's already 1312 00:49:01,970 --> 00:49:04,190 twice as good as that system. 1313 00:49:04,190 --> 00:49:06,680 But, again, contrast this with humans, both 1314 00:49:06,680 --> 00:49:09,050 what a human would do and also where 1315 00:49:09,050 --> 00:49:10,820 the human knowledge is, right? 1316 00:49:14,260 --> 00:49:15,860 I mean, the human game player that you 1317 00:49:15,860 --> 00:49:20,060 saw in here, by the time it's finished the first screen, 1318 00:49:20,060 --> 00:49:24,290 is already like up here, so after about a minute of play. 1319 00:49:24,290 --> 00:49:27,290 Now, again, you wouldn't be able to be that good after a minute. 1320 00:49:27,290 --> 00:49:29,750 But essentially, the difference between these systems 1321 00:49:29,750 --> 00:49:34,130 is that the DeepQ network never gets past the first screen even 1322 00:49:34,130 --> 00:49:35,480 with 1,000 hours. 1323 00:49:35,480 --> 00:49:38,390 And this other one gets past the first screen in 100 hours, 1324 00:49:38,390 --> 00:49:40,760 kind of gets to about the second screen. 1325 00:49:40,760 --> 00:49:42,800 It's sort of midway through the second screen. 1326 00:49:42,800 --> 00:49:44,480 In this domain, it's really interesting 1327 00:49:44,480 --> 00:49:47,862 to think about not what happens scientifically. 1328 00:49:47,862 --> 00:49:49,820 It's really interesting to think about not what 1329 00:49:49,820 --> 00:49:52,644 happens when you had 1,000 hours of experience 1330 00:49:52,644 --> 00:49:55,060 with no prior knowledge, because humans just don't do that 1331 00:49:55,060 --> 00:49:56,740 on this or really any other task that we 1332 00:49:56,740 --> 00:49:57,890 can study experimentally. 1333 00:49:57,890 --> 00:50:00,820 But you can study what humans do in the first minute, which is 1334 00:50:00,820 --> 00:50:02,742 just this blip like right here. 1335 00:50:02,742 --> 00:50:05,200 I think if we could get the right learning curve, you know, 1336 00:50:05,200 --> 00:50:07,283 what you'd see is that humans are going like this. 1337 00:50:07,283 --> 00:50:09,830 And they may asymptote well before any of these systems do. 1338 00:50:09,830 --> 00:50:11,710 But the interesting human learning part 1339 00:50:11,710 --> 00:50:15,090 is what's going on in the first minute, more or less 1340 00:50:15,090 --> 00:50:19,090 or the first hour, with all of the knowledge that you bring 1341 00:50:19,090 --> 00:50:21,579 to this task as well as how did you build up 1342 00:50:21,579 --> 00:50:22,370 all that knowledge. 1343 00:50:22,370 --> 00:50:24,190 So you want to talk about learning to learn 1344 00:50:24,190 --> 00:50:26,440 and multiple task learning, so that's all there, too. 1345 00:50:26,440 --> 00:50:27,760 I'm just saying in this one game that's 1346 00:50:27,760 --> 00:50:29,230 what you can study I think, or that's 1347 00:50:29,230 --> 00:50:31,510 where the heart of the matter is of human intelligence 1348 00:50:31,510 --> 00:50:32,996 in this setting. 1349 00:50:32,996 --> 00:50:34,370 And I think we should study that. 1350 00:50:34,370 --> 00:50:35,440 So, you know, what I've been trying 1351 00:50:35,440 --> 00:50:37,940 to do here for the last hour is motivate the kinds of things 1352 00:50:37,940 --> 00:50:39,700 we should study if we want to understand 1353 00:50:39,700 --> 00:50:42,790 the aspect of intelligence that we could call explaining, 1354 00:50:42,790 --> 00:50:45,710 understanding, the heart of building causal 1355 00:50:45,710 --> 00:50:46,550 models of the world. 1356 00:50:46,550 --> 00:50:47,380 We can do it. 1357 00:50:47,380 --> 00:50:49,360 But we have to do it a little bit differently. 1358 00:50:49,360 --> 00:50:52,240 In a flash, that's the first problem, I started with. 1359 00:50:52,240 --> 00:50:55,390 How do we learn a generalizable concept from just one example? 1360 00:50:55,390 --> 00:50:56,920 How can we discover causal relations 1361 00:50:56,920 --> 00:50:59,260 from just a single observed event, like that, you know, 1362 00:50:59,260 --> 00:51:02,710 jumping on the block and the beep and so on, 1363 00:51:02,710 --> 00:51:05,720 which sometimes can go wrong like any other perceptual 1364 00:51:05,720 --> 00:51:06,220 process? 1365 00:51:06,220 --> 00:51:07,030 You can have illusions. 1366 00:51:07,030 --> 00:51:08,980 You can see an accident that isn't quite right. 1367 00:51:08,980 --> 00:51:11,521 And then you move your head, and you see something different. 1368 00:51:11,521 --> 00:51:14,095 Or you go into the game, and you realize that it's not just 1369 00:51:14,095 --> 00:51:16,220 touching blocks that makes the temperature go down, 1370 00:51:16,220 --> 00:51:17,950 but it's just time. 1371 00:51:17,950 --> 00:51:21,135 How do we see forces, physics, and see inside of other minds 1372 00:51:21,135 --> 00:51:22,510 even if they're just a few shapes 1373 00:51:22,510 --> 00:51:24,260 moving around in two dimensions? 1374 00:51:24,260 --> 00:51:27,670 How do we learn to play games and act in a whole new world 1375 00:51:27,670 --> 00:51:30,735 in just under a minute, right? 1376 00:51:30,735 --> 00:51:32,110 And then there's all the problems 1377 00:51:32,110 --> 00:51:34,151 of language, which I'm not going to go into, 1378 00:51:34,151 --> 00:51:35,650 like understanding what we're saying 1379 00:51:35,650 --> 00:51:36,899 and what you're reading here-- 1380 00:51:36,899 --> 00:51:38,290 also, versions of these problems. 1381 00:51:38,290 --> 00:51:41,200 And our goal in our field is to understand this in engineering 1382 00:51:41,200 --> 00:51:43,090 terms, to have a computational framework that 1383 00:51:43,090 --> 00:51:45,790 explains how this is even possible and, in particular, 1384 00:51:45,790 --> 00:51:47,450 then how people do it. 1385 00:51:47,450 --> 00:51:48,650 OK. 1386 00:51:48,650 --> 00:51:51,901 Now, you know, in some sense cognitive scientists 1387 00:51:51,901 --> 00:51:54,400 and researchers, we're not the first people to work on this. 1388 00:51:54,400 --> 00:51:56,800 Philosophers have talked about this kind of thing 1389 00:51:56,800 --> 00:51:59,455 for thousands of years in the Western tradition. 1390 00:51:59,455 --> 00:52:02,375 It's a version of the problem of induction, the problem of how 1391 00:52:02,375 --> 00:52:04,250 do you know the sun is going to rise tomorrow 1392 00:52:04,250 --> 00:52:07,210 or just generalizing from experience. 1393 00:52:07,210 --> 00:52:09,400 And for as long as people have studied this problem, 1394 00:52:09,400 --> 00:52:11,770 the answer has always been clear in some form 1395 00:52:11,770 --> 00:52:13,840 that, again, it has to be about the knowledge 1396 00:52:13,840 --> 00:52:16,420 that you bring to the situation that gives you 1397 00:52:16,420 --> 00:52:18,460 the constraints that allows you to fill in 1398 00:52:18,460 --> 00:52:20,300 from this very sparse data. 1399 00:52:20,300 --> 00:52:24,354 But, again, if you're dissatisfied with that 1400 00:52:24,354 --> 00:52:26,020 is the answer, of course, you should be. 1401 00:52:26,020 --> 00:52:26,920 That's not really the answer. 1402 00:52:26,920 --> 00:52:28,670 That just raises the real problems, right? 1403 00:52:28,670 --> 00:52:30,220 And these are the problems that I 1404 00:52:30,220 --> 00:52:32,350 want to try to address in the more 1405 00:52:32,350 --> 00:52:37,230 substantive part of the morning, which is these questions here. 1406 00:52:37,230 --> 00:52:40,380 So how do you actually use knowledge 1407 00:52:40,380 --> 00:52:41,850 to guide learning from sparse data? 1408 00:52:41,850 --> 00:52:43,400 What form does it take? 1409 00:52:43,400 --> 00:52:44,834 How can we describe the knowledge? 1410 00:52:44,834 --> 00:52:46,500 And how can we explain how it's learned? 1411 00:52:46,500 --> 00:52:49,230 How is that knowledge itself constructed 1412 00:52:49,230 --> 00:52:51,150 from other kinds of experiences you 1413 00:52:51,150 --> 00:52:54,660 have combined with whatever, you know, your genes have set up 1414 00:52:54,660 --> 00:52:55,710 for you? 1415 00:52:55,710 --> 00:52:58,140 And I'm going to be talking about this approach. 1416 00:52:58,140 --> 00:52:59,925 And you know, again, really think 1417 00:52:59,925 --> 00:53:01,800 of this as the introduction to the whole day. 1418 00:53:01,800 --> 00:53:03,674 Because you're going to see a couple of hours 1419 00:53:03,674 --> 00:53:06,900 from me and then also from Tomer more hands on in the afternoon. 1420 00:53:06,900 --> 00:53:08,012 This is our approach. 1421 00:53:08,012 --> 00:53:09,720 You can give it different kinds of names. 1422 00:53:09,720 --> 00:53:11,303 I guess I called it generative models, 1423 00:53:11,303 --> 00:53:13,830 because that's what Tommy likes to call it in CBMM. 1424 00:53:13,830 --> 00:53:14,620 And that's fine. 1425 00:53:14,620 --> 00:53:16,320 Like any other approach, you know, 1426 00:53:16,320 --> 00:53:19,290 there's no one word that captures what it's about. 1427 00:53:19,290 --> 00:53:20,910 But these are the key ideas that we're 1428 00:53:20,910 --> 00:53:23,100 going to be talking about. 1429 00:53:23,100 --> 00:53:25,620 We're going to talk a lot about generative models 1430 00:53:25,620 --> 00:53:26,670 in a probabilistic sense. 1431 00:53:26,670 --> 00:53:29,250 So what it means to have a generative model 1432 00:53:29,250 --> 00:53:34,620 is to be able to describe the joint distribution in some form 1433 00:53:34,620 --> 00:53:37,750 on your observable data with some kind of latent variables, 1434 00:53:37,750 --> 00:53:38,250 right? 1435 00:53:38,250 --> 00:53:39,770 And then you can do probabilistic inference 1436 00:53:39,770 --> 00:53:41,370 or Bayesian inference, which means 1437 00:53:41,370 --> 00:53:43,504 conditioning on some of the outputs 1438 00:53:43,504 --> 00:53:45,420 of that generative model and making inferences 1439 00:53:45,420 --> 00:53:47,580 about the latent structure, the hidden variables, 1440 00:53:47,580 --> 00:53:48,900 as well as the other things. 1441 00:53:48,900 --> 00:53:52,500 But crucially, there's lots of problematic models, 1442 00:53:52,500 --> 00:53:55,050 but these ones have very particular kinds of structures, 1443 00:53:55,050 --> 00:53:55,550 right? 1444 00:53:55,550 --> 00:53:57,930 So the probabilities are not just defined 1445 00:53:57,930 --> 00:53:59,040 in statisticians terms. 1446 00:53:59,040 --> 00:54:01,710 But they're defined on some kind of interestingly structured 1447 00:54:01,710 --> 00:54:03,780 representation that can actually capture 1448 00:54:03,780 --> 00:54:06,347 the causal and compositional things we're talking about, 1449 00:54:06,347 --> 00:54:08,430 that can capture the causal structure of the world 1450 00:54:08,430 --> 00:54:12,660 in a composable way that can support the kind of flexibility 1451 00:54:12,660 --> 00:54:14,800 of learning and planning that we're talking about. 1452 00:54:14,800 --> 00:54:18,030 So a key part of how you do this sort of work 1453 00:54:18,030 --> 00:54:20,940 is to understand how to build probabilistic models 1454 00:54:20,940 --> 00:54:25,050 and do inference over various kinds of richly structured 1455 00:54:25,050 --> 00:54:26,502 symbolic representations. 1456 00:54:26,502 --> 00:54:27,960 And this is the sort of thing which 1457 00:54:27,960 --> 00:54:30,720 is a fairly new technical advance, right? 1458 00:54:30,720 --> 00:54:33,300 If you look in the history of AI as well as 1459 00:54:33,300 --> 00:54:34,971 in cognitive science, there's been 1460 00:54:34,971 --> 00:54:37,470 a lot of back and forth between people emphasizing these two 1461 00:54:37,470 --> 00:54:40,380 big ideas, the ideas of statistics and symbols 1462 00:54:40,380 --> 00:54:41,462 if you like, right? 1463 00:54:41,462 --> 00:54:43,170 And there's a long history of people sort 1464 00:54:43,170 --> 00:54:45,820 of saying one of these is going to explain everything 1465 00:54:45,820 --> 00:54:48,000 and the other one is not going explain very much 1466 00:54:48,000 --> 00:54:50,220 or isn't even real, right? 1467 00:54:50,220 --> 00:54:52,819 For example, some of the debates between Chomsky in language 1468 00:54:52,819 --> 00:54:55,110 in cognitive science and the people who came before him 1469 00:54:55,110 --> 00:54:57,870 and the people who came after him had this character, right? 1470 00:54:57,870 --> 00:54:59,880 Or some of the debates in AI in the first wave 1471 00:54:59,880 --> 00:55:03,560 of neural networks, people like Minsky, for example, 1472 00:55:03,560 --> 00:55:06,900 and spend some of the neural network people 1473 00:55:06,900 --> 00:55:09,090 like Jay McClelland initially-- 1474 00:55:09,090 --> 00:55:13,601 I mean, I'm mixing up chronology there. 1475 00:55:13,601 --> 00:55:14,100 I'm sorry. 1476 00:55:14,100 --> 00:55:16,920 But you know, you see this every time whether it's 1477 00:55:16,920 --> 00:55:18,630 in the '60s or the '80s or now. 1478 00:55:18,630 --> 00:55:23,040 You know, there's a discourse in our field, which 1479 00:55:23,040 --> 00:55:24,290 is a really interesting one. 1480 00:55:24,290 --> 00:55:26,389 I think, ultimately, we have to go beyond it. 1481 00:55:26,389 --> 00:55:27,930 And what's so exciting is that we are 1482 00:55:27,930 --> 00:55:29,500 being starting to go beyond it. 1483 00:55:29,500 --> 00:55:31,892 But there's been this discourse of people really saying, 1484 00:55:31,892 --> 00:55:33,600 you know, the heart of human intelligence 1485 00:55:33,600 --> 00:55:35,814 is some kind of rich symbolic structures. 1486 00:55:35,814 --> 00:55:37,980 Oh, and there's some other people who said something 1487 00:55:37,980 --> 00:55:38,970 about statistics. 1488 00:55:38,970 --> 00:55:42,411 But that's like trivial or uninteresting or never going 1489 00:55:42,411 --> 00:55:42,910 to anything. 1490 00:55:42,910 --> 00:55:45,160 And then some other people often responding 1491 00:55:45,160 --> 00:55:47,160 to those first people-- it's very much of a back 1492 00:55:47,160 --> 00:55:49,870 and forth debate. 1493 00:55:49,870 --> 00:55:53,590 It gets very acrimonious and emotional saying, you know, 1494 00:55:53,590 --> 00:55:55,680 no, those symbols are magical, mysterious things, 1495 00:55:55,680 --> 00:55:59,580 completely ridiculous, totally useless, never worked. 1496 00:55:59,580 --> 00:56:03,240 It's really all about statistics. 1497 00:56:03,240 --> 00:56:05,370 And somehow something kind of maybe like symbols 1498 00:56:05,370 --> 00:56:06,630 will emerge from those. 1499 00:56:06,630 --> 00:56:08,760 And I think we as a field are learning 1500 00:56:08,760 --> 00:56:10,710 that neither of those extreme views 1501 00:56:10,710 --> 00:56:12,820 is going to get us anywhere really quite honestly 1502 00:56:12,820 --> 00:56:14,910 and that we have to understand-- among other things. 1503 00:56:14,910 --> 00:56:16,110 It's not the only thing we have to understand. 1504 00:56:16,110 --> 00:56:17,651 But a big thing we have to understand 1505 00:56:17,651 --> 00:56:19,200 and are starting to understand is 1506 00:56:19,200 --> 00:56:22,020 how to do probabilistic inference over richly 1507 00:56:22,020 --> 00:56:23,910 structured symbolic objects. 1508 00:56:23,910 --> 00:56:27,150 And that means both using interesting symbolic structures 1509 00:56:27,150 --> 00:56:29,460 to define the priors for probabilistic inference, 1510 00:56:29,460 --> 00:56:32,400 but also-- and this moves more into the third topic-- 1511 00:56:32,400 --> 00:56:34,710 being able to think about learning 1512 00:56:34,710 --> 00:56:37,140 interesting symbolic representations as a kind 1513 00:56:37,140 --> 00:56:38,430 of probabilistic inference. 1514 00:56:38,430 --> 00:56:41,727 And to do that, we need to combine statistics and symbols 1515 00:56:41,727 --> 00:56:43,560 with some kind of notion of what's sometimes 1516 00:56:43,560 --> 00:56:45,730 called hierarchical probabilistic models. 1517 00:56:45,730 --> 00:56:48,750 Or it's a certain kind of recursive generative model 1518 00:56:48,750 --> 00:56:51,345 where you don't just have a generative model that 1519 00:56:51,345 --> 00:56:53,220 has some latent variables which then generate 1520 00:56:53,220 --> 00:56:54,540 your observable experience, but where 1521 00:56:54,540 --> 00:56:56,010 you have hierarchies of these things-- 1522 00:56:56,010 --> 00:56:58,260 so generative models for generative models or priors 1523 00:56:58,260 --> 00:56:58,920 on priors. 1524 00:56:58,920 --> 00:57:01,420 If you've heard of hierarchical Bayes or hierarchical models 1525 00:57:01,420 --> 00:57:03,414 and statistics, it's a version of the idea. 1526 00:57:03,414 --> 00:57:05,580 But it's sort of a more general version of that idea 1527 00:57:05,580 --> 00:57:08,940 where the hypothesis space and priors for Bayesian 1528 00:57:08,940 --> 00:57:11,220 inference that, you know, you see in the simplest 1529 00:57:11,220 --> 00:57:13,290 version of Bayes' rule, are not considered 1530 00:57:13,290 --> 00:57:17,960 to be just some fixed thing that you write down and wire up 1531 00:57:17,960 --> 00:57:18,720 and that's it. 1532 00:57:18,720 --> 00:57:20,220 But rather, they themselves could 1533 00:57:20,220 --> 00:57:23,036 be generated by some higher level or more abstract 1534 00:57:23,036 --> 00:57:24,660 probabilistic model, a hypothesis space 1535 00:57:24,660 --> 00:57:26,610 of hypothesis spaces, or priors on priors, 1536 00:57:26,610 --> 00:57:29,410 or a generative model for generative models. 1537 00:57:29,410 --> 00:57:31,620 And, again, there's a long history of that idea. 1538 00:57:31,620 --> 00:57:33,990 So, for example, some really interesting early work 1539 00:57:33,990 --> 00:57:36,330 on grammar induction in the 1960s 1540 00:57:36,330 --> 00:57:38,300 introduced something called grammar grammar, 1541 00:57:38,300 --> 00:57:40,880 where it used the grammar, a formal grammar, 1542 00:57:40,880 --> 00:57:45,774 to give a hypothesis base for grammars of languages, right? 1543 00:57:45,774 --> 00:57:47,690 But, again, what we're understanding how to do 1544 00:57:47,690 --> 00:57:51,410 is to combine this notion of a kind of recursive abstraction 1545 00:57:51,410 --> 00:57:52,797 with statistics and symbols. 1546 00:57:52,797 --> 00:57:54,380 And you put all those things together, 1547 00:57:54,380 --> 00:57:56,180 and you get a really powerful tool kit 1548 00:57:56,180 --> 00:57:57,860 for thinking about intelligence. 1549 00:57:57,860 --> 00:58:00,590 There's one other version of this big picture which you'll 1550 00:58:00,590 --> 00:58:03,740 hear about both in the morning and in the afternoon, which 1551 00:58:03,740 --> 00:58:05,880 is this idea of probabilistic programs. 1552 00:58:05,880 --> 00:58:09,290 So when I would give a kind of tutorial introduction about 1553 00:58:09,290 --> 00:58:10,850 five years ago-- oops, sorry-- 1554 00:58:10,850 --> 00:58:13,190 I would say this. 1555 00:58:13,190 --> 00:58:15,590 But one of the really exciting recent developments 1556 00:58:15,590 --> 00:58:17,420 in the last few years is in a sense 1557 00:58:17,420 --> 00:58:20,520 a kind of unified language that puts all these things together. 1558 00:58:20,520 --> 00:58:22,490 So we can have a lot fewer words on the slide 1559 00:58:22,490 --> 00:58:24,781 and just say, oh, it's all a big probabilistic program. 1560 00:58:24,781 --> 00:58:27,200 I mean, that's way simplifying and leaving 1561 00:58:27,200 --> 00:58:28,610 out a lot of important stuff. 1562 00:58:28,610 --> 00:58:30,830 But the language of probabilistic programs 1563 00:58:30,830 --> 00:58:33,080 that you're going to see in little bits in my talks 1564 00:58:33,080 --> 00:58:35,450 and much more in the tutorial later on 1565 00:58:35,450 --> 00:58:37,760 is part of why it's a powerful language, or really 1566 00:58:37,760 --> 00:58:38,960 the main reason. 1567 00:58:38,960 --> 00:58:41,360 It's that it just gives a unifying language and set 1568 00:58:41,360 --> 00:58:42,780 of tools for all of these things, 1569 00:58:42,780 --> 00:58:44,900 including probabilistic models defined 1570 00:58:44,900 --> 00:58:47,580 over all sorts of interesting symbolic structures. 1571 00:58:47,580 --> 00:58:50,540 In fact any computable model, any probabilistic model 1572 00:58:50,540 --> 00:58:53,660 defined on any representation that's computable 1573 00:58:53,660 --> 00:58:55,950 can be expressed as a probabilistic program. 1574 00:58:55,950 --> 00:58:59,450 It's where Turing universal computation meets probability. 1575 00:58:59,450 --> 00:59:01,850 And everything about hierarchical models, 1576 00:59:01,850 --> 00:59:03,890 generative models for generative models, 1577 00:59:03,890 --> 00:59:06,890 or priors on priors, hypothesis space by hypothesis space, 1578 00:59:06,890 --> 00:59:08,591 can be very naturally expressed in terms 1579 00:59:08,591 --> 00:59:11,090 of probabilistic programs, where basically you have programs 1580 00:59:11,090 --> 00:59:12,600 that generate other programs. 1581 00:59:12,600 --> 00:59:15,024 So if your model is a program and it's 1582 00:59:15,024 --> 00:59:16,440 a probabilistic generative model-- 1583 00:59:16,440 --> 00:59:17,510 so it's a probabilistic program-- 1584 00:59:17,510 --> 00:59:19,337 and you want to put down a generative model 1585 00:59:19,337 --> 00:59:21,170 for generative models that can make learning 1586 00:59:21,170 --> 00:59:24,170 into inference recursively up in higher levels of abstraction, 1587 00:59:24,170 --> 00:59:27,260 you just add a little bit more to the probabilistic program. 1588 00:59:27,260 --> 00:59:30,170 And so it's a very both beautiful, but also extremely 1589 00:59:30,170 --> 00:59:33,362 useful model building tool kit. 1590 00:59:33,362 --> 00:59:34,820 Now, there's a few other ideas that 1591 00:59:34,820 --> 00:59:37,281 go along with these things which I won't talk about. 1592 00:59:37,281 --> 00:59:38,780 The content of what I'm going to try 1593 00:59:38,780 --> 00:59:39,890 to do for the rest of the morning 1594 00:59:39,890 --> 00:59:41,431 and what you'll see for the afternoon 1595 00:59:41,431 --> 00:59:44,840 is just to give you various examples and ways to do things 1596 00:59:44,840 --> 00:59:46,370 with the ideas on these slides. 1597 00:59:46,370 --> 00:59:47,480 Now, there's some other stuff which 1598 00:59:47,480 --> 00:59:48,350 we won't say that much about. 1599 00:59:48,350 --> 00:59:50,510 Although, I think Tomer, who just walked in-- hey-- 1600 00:59:50,510 --> 00:59:52,310 you will talk a little about MCMC, right? 1601 00:59:52,310 --> 00:59:54,632 And we'll say a little bit about item four, 1602 00:59:54,632 --> 00:59:56,840 because it goes back to these questions I started off 1603 00:59:56,840 --> 00:59:58,412 with also that are very pressing. 1604 00:59:58,412 --> 00:59:59,870 And they're really interesting ones 1605 00:59:59,870 --> 01:00:02,570 for where neural networks meet up with generative models. 1606 01:00:02,570 --> 01:00:05,660 You know, just how can we do inference and learning so fast 1607 01:00:05,660 --> 01:00:08,210 and not just from few examples-- that's what this stuff is 1608 01:00:08,210 --> 01:00:08,810 about-- 1609 01:00:08,810 --> 01:00:13,670 but just very quickly in terms of time? 1610 01:00:13,670 --> 01:00:16,080 So we will say a little bit about that. 1611 01:00:16,080 --> 01:00:20,530 But all of these, every item, component of this approach, 1612 01:00:20,530 --> 01:00:22,280 is a whole research area in and of itself. 1613 01:00:22,280 --> 01:00:24,655 There are people who spend their entire career these days 1614 01:00:24,655 --> 01:00:29,060 focusing on how to make four work and other people who 1615 01:00:29,060 --> 01:00:32,870 focus on how to use these kind of rich probabilistic models 1616 01:00:32,870 --> 01:00:35,480 to guide planning and decision making, 1617 01:00:35,480 --> 01:00:37,090 or how to relate them to the brain. 1618 01:00:37,090 --> 01:00:40,074 Any one of these you could spend more than a career on. 1619 01:00:40,074 --> 01:00:41,990 But what's exciting to us is that with a bunch 1620 01:00:41,990 --> 01:00:44,600 of smart people working on these and kind of developing 1621 01:00:44,600 --> 01:00:46,990 common languages to link up these questions, 1622 01:00:46,990 --> 01:00:50,570 I think we really are poised to make progress in my lifetime 1623 01:00:50,570 --> 01:00:52,840 and even more in yours.