1 00:00:01,640 --> 00:00:04,040 The following content is provided under a Creative 2 00:00:04,040 --> 00:00:05,580 Commons license. 3 00:00:05,580 --> 00:00:07,880 Your support will help MIT OpenCourseWare 4 00:00:07,880 --> 00:00:12,270 continue to offer high-quality educational resources for free. 5 00:00:12,270 --> 00:00:14,870 To make a donation or view additional materials 6 00:00:14,870 --> 00:00:18,830 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,830 --> 00:00:21,157 at ocw.mit.edu. 8 00:00:21,157 --> 00:00:22,490 JOSH TENENBAUM: We're going to-- 9 00:00:22,490 --> 00:00:25,189 I'm just going to give a bunch of examples of things 10 00:00:25,189 --> 00:00:26,480 that we in our field have done. 11 00:00:26,480 --> 00:00:28,730 Most of them are things that I've played some role in. 12 00:00:28,730 --> 00:00:31,075 Maybe it was a thesis project of a student. 13 00:00:31,075 --> 00:00:33,200 But they're meant to be representative of a broader 14 00:00:33,200 --> 00:00:35,199 set of things that many people have been working 15 00:00:35,199 --> 00:00:37,349 on developing this toolkit. 16 00:00:37,349 --> 00:00:39,890 And we're going to start from the beginning in a sense-- just 17 00:00:39,890 --> 00:00:41,870 some very simple things that we did 18 00:00:41,870 --> 00:00:44,720 to try to look at ways in which probabilistic, generative 19 00:00:44,720 --> 00:00:48,486 models can inform people's basic cognitive processes. 20 00:00:48,486 --> 00:00:50,360 And then build up to more interestingly kinds 21 00:00:50,360 --> 00:00:53,610 of symbolically structured models, hierarchical models, 22 00:00:53,610 --> 00:00:55,670 and ultimately to these probabilistic programs 23 00:00:55,670 --> 00:00:57,865 for common sense. 24 00:00:57,865 --> 00:00:59,990 So when I say a lot of people have been doing this, 25 00:00:59,990 --> 00:01:02,073 I mean here's just a small number of these people. 26 00:01:02,073 --> 00:01:04,340 Every year or two, I try to update this slide. 27 00:01:04,340 --> 00:01:07,321 But it's very much historically dated with people 28 00:01:07,321 --> 00:01:09,320 that I knew when I was in grad school basically. 29 00:01:09,320 --> 00:01:11,240 There's a lot of really great work 30 00:01:11,240 --> 00:01:14,542 by younger people who maybe their names haven't 31 00:01:14,542 --> 00:01:15,500 appeared on this slide. 32 00:01:15,500 --> 00:01:17,554 So those dot dot dots are extremely serious. 33 00:01:17,554 --> 00:01:19,720 And a lot of the best stuff is not included on here. 34 00:01:19,720 --> 00:01:22,040 But in the last couple of decades, 35 00:01:22,040 --> 00:01:24,470 across basically all the different areas 36 00:01:24,470 --> 00:01:26,180 of cognitive science that cover basically 37 00:01:26,180 --> 00:01:28,520 all the different things that cognition does, 38 00:01:28,520 --> 00:01:31,847 there's been great progress building serious mathematical-- 39 00:01:31,847 --> 00:01:34,430 and what we could call reverse engineering models in the sense 40 00:01:34,430 --> 00:01:38,130 that they are quantitative models of human cognition, 41 00:01:38,130 --> 00:01:40,944 but they are phrased in the terms of engineering, 42 00:01:40,944 --> 00:01:42,860 the same things you would use to build a robot 43 00:01:42,860 --> 00:01:44,900 to do these things, at least in principle. 44 00:01:44,900 --> 00:01:48,080 And it's been developing this toolkit 45 00:01:48,080 --> 00:01:50,390 of probabalistic generative models. 46 00:01:50,390 --> 00:01:52,310 I want to start off by telling you 47 00:01:52,310 --> 00:01:55,010 a little bit about some work that I did together 48 00:01:55,010 --> 00:01:56,180 with Tom Griffiths. 49 00:01:56,180 --> 00:02:00,020 So Tom is now a senior faculty member at Berkeley, 50 00:02:00,020 --> 00:02:01,714 one of the leaders in this field-- 51 00:02:01,714 --> 00:02:04,130 as well as a leading person in machine learning, actually. 52 00:02:04,130 --> 00:02:05,713 One of the great things that he's done 53 00:02:05,713 --> 00:02:08,120 is to take inspiration from human learning 54 00:02:08,120 --> 00:02:11,320 and develop fundamentally new kinds of probabilistic models, 55 00:02:11,320 --> 00:02:13,130 in non-parametric Bayes in particular, 56 00:02:13,130 --> 00:02:15,800 inspired by human learning. 57 00:02:15,800 --> 00:02:19,230 But when Tom was a grad student, we worked together. 58 00:02:19,230 --> 00:02:20,420 He was my first student. 59 00:02:20,420 --> 00:02:21,530 We're almost the same age. 60 00:02:21,530 --> 00:02:23,660 So at this point, we're more like senior colleagues 61 00:02:23,660 --> 00:02:24,685 than student advisor. 62 00:02:24,685 --> 00:02:26,060 But I'll tell you about some work 63 00:02:26,060 --> 00:02:27,500 we did back when he was a student, 64 00:02:27,500 --> 00:02:28,760 and I was just starting off. 65 00:02:28,760 --> 00:02:31,160 And we were both together trying to tackle this problem 66 00:02:31,160 --> 00:02:36,440 and trying to see, OK, what are the prospects for understanding 67 00:02:36,440 --> 00:02:39,380 even very basic cognitive intuitions, like senses 68 00:02:39,380 --> 00:02:41,840 of similarity or the most basic kinds of causal discovery 69 00:02:41,840 --> 00:02:44,090 intuitions like we were talking about before, 70 00:02:44,090 --> 00:02:46,460 using some kind of idea of probabilistic inference 71 00:02:46,460 --> 00:02:48,280 in a generative model? 72 00:02:48,280 --> 00:02:50,360 And at the time-- remember in the introduction 73 00:02:50,360 --> 00:02:52,280 I was talking about how there's been this back 74 00:02:52,280 --> 00:02:56,780 and forth discourse over the decades of people saying, yeah, 75 00:02:56,780 --> 00:02:59,601 rah rah, statistics, and, statistics, 76 00:02:59,601 --> 00:03:01,100 those are trivial and uninteresting? 77 00:03:01,100 --> 00:03:03,058 And at the time we started to do this, at least 78 00:03:03,058 --> 00:03:05,180 in cognitive psychology, the idea 79 00:03:05,180 --> 00:03:07,520 that cognition could be seen as some kind 80 00:03:07,520 --> 00:03:09,140 of sophisticated statistical inference 81 00:03:09,140 --> 00:03:11,100 was very much not a popular idea. 82 00:03:11,100 --> 00:03:14,000 But we thought that it was fundamentally right 83 00:03:14,000 --> 00:03:14,990 in some ways. 84 00:03:14,990 --> 00:03:16,130 And it was at the time-- 85 00:03:16,130 --> 00:03:19,250 again, this was work we were doing in the early 2000s 86 00:03:19,250 --> 00:03:21,830 when it was very clear in machine learning and AI 87 00:03:21,830 --> 00:03:24,410 already how transformative these ideas were 88 00:03:24,410 --> 00:03:26,300 in building intelligent machines or starting 89 00:03:26,300 --> 00:03:27,470 to build intelligent machines. 90 00:03:27,470 --> 00:03:29,120 So it seemed clear to us that at least it 91 00:03:29,120 --> 00:03:31,536 was a good hypothesis worth exploring and taking much more 92 00:03:31,536 --> 00:03:34,790 seriously than psychologists had much before that. 93 00:03:34,790 --> 00:03:37,430 That this also could describe basic aspects 94 00:03:37,430 --> 00:03:38,630 of human thinking. 95 00:03:38,630 --> 00:03:41,930 So I'll give you a couple examples of what we did here. 96 00:03:41,930 --> 00:03:45,200 Here's a simple kind of causal inference from coincidences, 97 00:03:45,200 --> 00:03:49,720 much like what you saw going on in the video game. 98 00:03:49,720 --> 00:03:50,720 There's no time in this. 99 00:03:50,720 --> 00:03:53,840 It's really mostly just space, or maybe a little bit of time. 100 00:03:53,840 --> 00:03:57,090 The motivation was not a video game, but imagine-- 101 00:03:57,090 --> 00:03:59,990 to put a real world context on it-- 102 00:03:59,990 --> 00:04:02,510 what's sometimes called cancer clusters or rare disease 103 00:04:02,510 --> 00:04:03,150 clusters. 104 00:04:03,150 --> 00:04:05,400 You can read about these often in the newspaper, where 105 00:04:05,400 --> 00:04:09,140 somebody has seen some evidence suggestive of some maybe 106 00:04:09,140 --> 00:04:10,700 hidden environmental cause-- 107 00:04:10,700 --> 00:04:14,810 maybe it's a toxic chemical leak or something-- 108 00:04:14,810 --> 00:04:16,940 that seems to be responsible for-- or maybe they 109 00:04:16,940 --> 00:04:17,731 don't have a cause. 110 00:04:17,731 --> 00:04:19,670 They just see a suspicious coincidence 111 00:04:19,670 --> 00:04:23,750 of some very rare disease, a few cases that seem surprisingly 112 00:04:23,750 --> 00:04:25,560 clustered in space and time. 113 00:04:25,560 --> 00:04:29,330 So for example, let's say this is one square mile of a city. 114 00:04:29,330 --> 00:04:34,070 And each dot represents one case of some very rare disease that 115 00:04:34,070 --> 00:04:36,106 occurred in the span of a year. 116 00:04:36,106 --> 00:04:36,980 And you look at this. 117 00:04:36,980 --> 00:04:39,230 And you might think that, well, it 118 00:04:39,230 --> 00:04:41,990 doesn't look like those dots are completely, uniformly, randomly 119 00:04:41,990 --> 00:04:42,710 distributed over there. 120 00:04:42,710 --> 00:04:44,480 Maybe there's some weird thing going on 121 00:04:44,480 --> 00:04:47,855 in the upper left or northwest corner-- 122 00:04:47,855 --> 00:04:49,220 some who knows what-- 123 00:04:49,220 --> 00:04:50,390 making people sick. 124 00:04:50,390 --> 00:04:51,740 So let me just ask you. 125 00:04:51,740 --> 00:04:56,300 On a scale of 0 to 10, where 10 means you're sure 126 00:04:56,300 --> 00:04:58,580 there's some kind of thing going on 127 00:04:58,580 --> 00:05:02,122 and some special cause in some part of this map. 128 00:05:02,122 --> 00:05:04,580 And 0 means no, you're quite sure there's nothing going on. 129 00:05:04,580 --> 00:05:06,290 It's just random. 130 00:05:06,290 --> 00:05:08,870 What do you say? 131 00:05:08,870 --> 00:05:11,660 To what extent does this give evidence for some hidden cause? 132 00:05:11,660 --> 00:05:13,718 So give me a number between 0 and 10. 133 00:05:13,718 --> 00:05:15,167 AUDIENCE: 5. 134 00:05:15,167 --> 00:05:16,250 JOSH TENENBAUM: OK, great. 135 00:05:16,250 --> 00:05:17,014 5, 2, 7. 136 00:05:17,014 --> 00:05:18,680 I heard a few examples of each of those. 137 00:05:18,680 --> 00:05:19,180 Perfect. 138 00:05:19,180 --> 00:05:20,442 That's exactly what people do. 139 00:05:20,442 --> 00:05:22,400 You could do the same thing on Mechanical Turk, 140 00:05:22,400 --> 00:05:26,060 and get 10 times as much data, and pay a lot more. 141 00:05:26,060 --> 00:05:27,512 It would be the same. 142 00:05:27,512 --> 00:05:28,970 I'll show you the data in a second. 143 00:05:28,970 --> 00:05:30,990 But here's the model that we built. 144 00:05:30,990 --> 00:05:33,620 So again, this model is a very simple kind 145 00:05:33,620 --> 00:05:36,560 of generative model of a hidden cause 146 00:05:36,560 --> 00:05:38,390 that various people in statistics 147 00:05:38,390 --> 00:05:40,130 have worked with for a while. 148 00:05:40,130 --> 00:05:45,490 We're basically modeling a hidden cause as a mixture. 149 00:05:45,490 --> 00:05:47,140 Or I mean it's a generative model, 150 00:05:47,140 --> 00:05:49,130 so we have to model the whole data. 151 00:05:49,130 --> 00:05:50,741 When we say there's a hidden cause, 152 00:05:50,741 --> 00:05:53,240 we don't necessarily mean that everything is caused by this. 153 00:05:53,240 --> 00:05:56,570 It's just that the data we see in this picture 154 00:05:56,570 --> 00:06:01,250 is a mixture of whatever the normal random thing going on is 155 00:06:01,250 --> 00:06:05,420 plus possibly some spatially localized cause that 156 00:06:05,420 --> 00:06:08,570 has some unknown position, unknown extent. 157 00:06:08,570 --> 00:06:10,160 Maybe it's a very big region. 158 00:06:10,160 --> 00:06:11,720 And some unknown intensity-- maybe 159 00:06:11,720 --> 00:06:13,910 it's causing a lot of cases or not that many. 160 00:06:13,910 --> 00:06:16,280 The hypothesis space maybe is best visualized like this. 161 00:06:16,280 --> 00:06:19,790 Each of these squares is a different hypothesis 162 00:06:19,790 --> 00:06:21,980 of a mixture density or a mixture model-- 163 00:06:21,980 --> 00:06:24,650 which is a mixture of just whatever the normal uniform 164 00:06:24,650 --> 00:06:28,490 process is that causes a disease unrelated to space and then 165 00:06:28,490 --> 00:06:31,730 some kind of just Gaussian bump, which can vary in location, 166 00:06:31,730 --> 00:06:36,110 size, and intensity, that is the possible hidden cause 167 00:06:36,110 --> 00:06:37,670 of some of these cases. 168 00:06:37,670 --> 00:06:39,840 And what the model that we propose says 169 00:06:39,840 --> 00:06:41,932 is that your sense of this spatial coincidence-- 170 00:06:41,932 --> 00:06:44,390 like when you look at a pattern of dots really and you see, 171 00:06:44,390 --> 00:06:46,931 oh, that looks like there's a hidden cluster there somewhere. 172 00:06:46,931 --> 00:06:50,466 It's basically you're trying to see whether something like one 173 00:06:50,466 --> 00:06:53,090 of those things on the right is going on as opposed to the null 174 00:06:53,090 --> 00:06:55,280 hypothesis of just pure randomness. 175 00:06:55,280 --> 00:06:59,150 So we take this log likelihood ratio, 176 00:06:59,150 --> 00:07:01,700 or log probability, where we're comparing 177 00:07:01,700 --> 00:07:05,120 the probability of the data under the hypothesis 178 00:07:05,120 --> 00:07:08,720 that there's some interesting hidden cause, one 179 00:07:08,720 --> 00:07:11,300 of the things on the right, versus the alternative 180 00:07:11,300 --> 00:07:12,710 hypothesis that it's just random, 181 00:07:12,710 --> 00:07:15,641 which is just the simple, completely uniform density. 182 00:07:15,641 --> 00:07:18,140 And what makes this a little bit interesting computationally 183 00:07:18,140 --> 00:07:21,560 is that there's an infinite number of these possibilities 184 00:07:21,560 --> 00:07:22,350 on the right. 185 00:07:22,350 --> 00:07:26,000 There's an infinite number of different locations 186 00:07:26,000 --> 00:07:27,819 and sizes and intensities of the Gaussian. 187 00:07:27,819 --> 00:07:29,610 And you have to integrate over all of them. 188 00:07:29,610 --> 00:07:31,484 So again, there's not going to be a whole lot 189 00:07:31,484 --> 00:07:33,150 of mathematical details here. 190 00:07:33,150 --> 00:07:35,210 But you can read about this stuff 191 00:07:35,210 --> 00:07:39,410 if you want to read these papers that we had here. 192 00:07:39,410 --> 00:07:42,800 But for those of you who are familiar with this, 193 00:07:42,800 --> 00:07:44,870 with working with latent variable models, 194 00:07:44,870 --> 00:07:50,240 effectively what you're doing is just 195 00:07:50,240 --> 00:07:54,200 integrating either analytically or in a simulation over all 196 00:07:54,200 --> 00:07:56,570 the possible models and sort of trying 197 00:07:56,570 --> 00:07:59,087 to compute on average how much does the evidence support 198 00:07:59,087 --> 00:08:01,670 something like what you see on the right, one of those cluster 199 00:08:01,670 --> 00:08:04,730 possibilities, versus just uniform density. 200 00:08:04,730 --> 00:08:06,920 And now what I'm showing you is that model 201 00:08:06,920 --> 00:08:10,260 compared to people's judgments on an experiment. 202 00:08:10,260 --> 00:08:12,590 So in this experiment, we showed people patterns 203 00:08:12,590 --> 00:08:14,120 like the one you just saw. 204 00:08:14,120 --> 00:08:17,780 The one you saw is this one here. 205 00:08:17,780 --> 00:08:20,900 But in the different stimuli, we varied parameters 206 00:08:20,900 --> 00:08:22,610 that we thought would be relevant. 207 00:08:22,610 --> 00:08:26,300 So we varied how many points were there total, 208 00:08:26,300 --> 00:08:29,000 how strong the cluster was in various ways, whether it was 209 00:08:29,000 --> 00:08:31,640 very tightly clustered or very big, 210 00:08:31,640 --> 00:08:35,270 the relative number of points in the cluster versus not. 211 00:08:35,270 --> 00:08:37,039 So what you can see here, for example, 212 00:08:37,039 --> 00:08:39,230 is it's a very similar kind of geometry, 213 00:08:39,230 --> 00:08:43,179 except here this is a sort of biggish cluster. 214 00:08:43,179 --> 00:08:46,340 And then we're making basically there's four points that look 215 00:08:46,340 --> 00:08:47,870 clustered and two that aren't. 216 00:08:47,870 --> 00:08:50,000 And in these cases, we just make the four points 217 00:08:50,000 --> 00:08:51,500 more tightly clustered. 218 00:08:51,500 --> 00:08:54,890 Here, what we're doing is we're going 219 00:08:54,890 --> 00:08:57,961 from having no points that look clustered to having almost all 220 00:08:57,961 --> 00:08:59,960 of the points looking clustered and just varying 221 00:08:59,960 --> 00:09:03,200 the ratio of clustered points to non-clustered points. 222 00:09:03,200 --> 00:09:05,160 Here, we're just changing the overall number. 223 00:09:05,160 --> 00:09:07,576 So notice that this one is basically the same as this one. 224 00:09:11,036 --> 00:09:12,410 So again, at both of these, we've 225 00:09:12,410 --> 00:09:15,601 got four clustered points and two seemingly non-clustered 226 00:09:15,601 --> 00:09:16,100 ones. 227 00:09:16,100 --> 00:09:18,440 And here we just scale up in set-- or scale 228 00:09:18,440 --> 00:09:20,270 up from four to two, to eight and four. 229 00:09:20,270 --> 00:09:23,107 And here we scale it down to two and one, 230 00:09:23,107 --> 00:09:24,440 and various other manipulations. 231 00:09:24,440 --> 00:09:25,910 And what you can see is that they 232 00:09:25,910 --> 00:09:28,604 have various systematic effects on people's judgments. 233 00:09:28,604 --> 00:09:30,020 So what I'm calling the data there 234 00:09:30,020 --> 00:09:32,292 is the average of about 150 people 235 00:09:32,292 --> 00:09:33,750 who did the same judgment you did-- 236 00:09:33,750 --> 00:09:35,000 0 to 10. 237 00:09:35,000 --> 00:09:38,810 What you can see is the one I gave you was this one here. 238 00:09:38,810 --> 00:09:41,372 And the average judgment was almost exactly five. 239 00:09:41,372 --> 00:09:42,830 And if you look at the variance, it 240 00:09:42,830 --> 00:09:44,040 looks just like what you saw here. 241 00:09:44,040 --> 00:09:45,290 Some people say two or three. 242 00:09:45,290 --> 00:09:47,250 Some people say seven. 243 00:09:47,250 --> 00:09:49,374 I chose one that was right in the middle. 244 00:09:49,374 --> 00:09:51,290 The interesting thing is that, while you maybe 245 00:09:51,290 --> 00:09:52,700 felt like you were guessing-- and if you just 246 00:09:52,700 --> 00:09:54,020 listened to what everyone else was saying, 247 00:09:54,020 --> 00:09:56,561 maybe it sounds like we're just shouting out random numbers-- 248 00:09:56,561 --> 00:09:58,019 that's not what you're doing. 249 00:09:58,019 --> 00:09:59,810 On that one, it looks like it, because it's 250 00:09:59,810 --> 00:10:00,839 right on the threshold. 251 00:10:00,839 --> 00:10:03,130 But if you look over all these different patterns, what 252 00:10:03,130 --> 00:10:05,500 you see is that sometimes people give much higher numbers 253 00:10:05,500 --> 00:10:05,930 than others. 254 00:10:05,930 --> 00:10:08,096 Sometimes people give much lower number than others. 255 00:10:08,096 --> 00:10:10,579 And the details, that variation, both 256 00:10:10,579 --> 00:10:12,370 within these different manipulations we did 257 00:10:12,370 --> 00:10:14,710 and across them, are almost perfectly 258 00:10:14,710 --> 00:10:17,560 captured by this very simple probabilistic generative 259 00:10:17,560 --> 00:10:18,890 model for a latent cause. 260 00:10:18,890 --> 00:10:22,084 So the model here is-- this is the predictions that model I 261 00:10:22,084 --> 00:10:23,500 showed you is making, where again, 262 00:10:23,500 --> 00:10:25,510 basically, a high bar means there's 263 00:10:25,510 --> 00:10:30,700 strong evidence in favor of the hidden latent cause hypothesis. 264 00:10:30,700 --> 00:10:33,760 Some, one, or more-- 265 00:10:33,760 --> 00:10:36,400 some cluster-- that low bar means strong evidence 266 00:10:36,400 --> 00:10:38,020 for the alternative hypothesis. 267 00:10:38,020 --> 00:10:39,760 The scale is a bit arbitrary. 268 00:10:39,760 --> 00:10:43,302 And it's a log probability ratio scale. 269 00:10:43,302 --> 00:10:45,010 So I'm not going to comment on the scale. 270 00:10:45,010 --> 00:10:47,384 But importantly, it's the same scale across all of these. 271 00:10:47,384 --> 00:10:49,810 So a big difference is, it's the same big difference 272 00:10:49,810 --> 00:10:51,200 in both cases. 273 00:10:51,200 --> 00:10:53,136 And I don't think this is fairly good evidence 274 00:10:53,136 --> 00:10:54,760 that this model is capturing your sense 275 00:10:54,760 --> 00:10:57,130 of spatial coincidence and showing that it's not 276 00:10:57,130 --> 00:10:59,272 just random or arbitrary, but it's actually 277 00:10:59,272 --> 00:11:01,480 a very rational measure of how much evidence there is 278 00:11:01,480 --> 00:11:03,850 in the data for a hidden cause. 279 00:11:03,850 --> 00:11:05,364 Here's the same model now applied 280 00:11:05,364 --> 00:11:07,030 to a different data set that we actually 281 00:11:07,030 --> 00:11:09,102 collected a few years before, which 282 00:11:09,102 --> 00:11:10,810 just varies the same kinds of parameters, 283 00:11:10,810 --> 00:11:12,160 but has a lot more points. 284 00:11:12,160 --> 00:11:15,534 And the same model works in those cases, too. 285 00:11:15,534 --> 00:11:17,200 The differences are a little more subtle 286 00:11:17,200 --> 00:11:20,350 with these more points. 287 00:11:20,350 --> 00:11:24,790 So I'll give you one other example of this sort of thing. 288 00:11:24,790 --> 00:11:26,890 Like the one I just showed you, we're 289 00:11:26,890 --> 00:11:29,200 taking a fairly simple statistical model. 290 00:11:29,200 --> 00:11:31,400 This one, as you'll see, isn't even really causal. 291 00:11:31,400 --> 00:11:33,400 This one at least, that I showed you, is causal. 292 00:11:33,400 --> 00:11:35,230 The advantage of this other one is 293 00:11:35,230 --> 00:11:38,175 that it's both a kind of textbook statistics example, 294 00:11:38,175 --> 00:11:40,300 it's one where people do something more interesting 295 00:11:40,300 --> 00:11:41,380 than what's in the textbook. 296 00:11:41,380 --> 00:11:43,588 Although you can extend the textbook analysis to make 297 00:11:43,588 --> 00:11:45,250 it look like what people do. 298 00:11:45,250 --> 00:11:47,200 And unlike in this case here, you 299 00:11:47,200 --> 00:11:49,210 can actually measure the empirical statistic. 300 00:11:49,210 --> 00:11:51,340 You can go out, and instead of just like positing, 301 00:11:51,340 --> 00:11:54,944 here's a simple model of what a latent environmental cause 302 00:11:54,944 --> 00:11:56,860 would be like, you can actually go and measure 303 00:11:56,860 --> 00:11:58,870 all the relevant probability distributions 304 00:11:58,870 --> 00:12:01,530 and compare people not just with a notional model, 305 00:12:01,530 --> 00:12:03,570 but with what, in some stronger sense, 306 00:12:03,570 --> 00:12:05,890 is the rational thing to do, if you 307 00:12:05,890 --> 00:12:09,140 were doing some kind of intuitive Bayesian inference. 308 00:12:09,140 --> 00:12:11,290 So these are, again stuff that Tom Griffiths 309 00:12:11,290 --> 00:12:15,540 did with me, in an in and then after grad school. 310 00:12:15,540 --> 00:12:17,290 We asked people to make the following kind 311 00:12:17,290 --> 00:12:18,400 of everyday prediction. 312 00:12:18,400 --> 00:12:21,520 So we said, suppose you read about a movie that's 313 00:12:21,520 --> 00:12:23,650 made $60 million to date. 314 00:12:23,650 --> 00:12:25,505 How much money will it make in total? 315 00:12:25,505 --> 00:12:27,130 Or you see that something's been baking 316 00:12:27,130 --> 00:12:28,610 in the oven for 34 minutes. 317 00:12:28,610 --> 00:12:30,310 How long until it's ready? 318 00:12:30,310 --> 00:12:31,810 You meet someone who's 78 years old. 319 00:12:31,810 --> 00:12:33,570 How long will they live? 320 00:12:33,570 --> 00:12:36,070 Your friend quotes to you from line 17 of his favorite poem. 321 00:12:36,070 --> 00:12:37,510 How long is the poem? 322 00:12:37,510 --> 00:12:40,000 Or you meet a US congressman who has served for 11 years. 323 00:12:40,000 --> 00:12:41,770 How long will he serve in total? 324 00:12:41,770 --> 00:12:43,420 So in each of these cases, you're 325 00:12:43,420 --> 00:12:46,990 encountering some phenomenon or event 326 00:12:46,990 --> 00:12:49,410 in the world with some unknown total duration. 327 00:12:49,410 --> 00:12:51,100 We'll call that t, total. 328 00:12:51,100 --> 00:12:53,770 And all we know is that t, total, is somewhere 329 00:12:53,770 --> 00:12:55,090 between zero and infinity. 330 00:12:55,090 --> 00:12:58,426 We might have a prior on it, as you'll see in a second. 331 00:12:58,426 --> 00:13:00,977 But we don't know very much about this particular t, 332 00:13:00,977 --> 00:13:04,210 total except you get one example, one piece of data, 333 00:13:04,210 --> 00:13:08,227 some t, which we'll just assume is just randomly sampled 334 00:13:08,227 --> 00:13:09,310 between zero and t, total. 335 00:13:09,310 --> 00:13:11,440 So all we know is that whatever these things are, 336 00:13:11,440 --> 00:13:14,830 it's something randomly chosen, less than the total extent 337 00:13:14,830 --> 00:13:17,950 or duration of these events. 338 00:13:17,950 --> 00:13:19,750 And now we can ask, what can you guess 339 00:13:19,750 --> 00:13:21,850 about the total extent or duration from that one 340 00:13:21,850 --> 00:13:22,390 observation? 341 00:13:22,390 --> 00:13:25,600 Or in mathematical terms, there is some unknown interval 342 00:13:25,600 --> 00:13:27,880 from zero up to some maximal value. 343 00:13:27,880 --> 00:13:30,340 You can put a prior on what that interval is. 344 00:13:30,340 --> 00:13:33,100 And you have to guess the interval from one sampled point 345 00:13:33,100 --> 00:13:34,450 sampled randomly within it. 346 00:13:34,450 --> 00:13:36,250 It's also very similar-- and another reason 347 00:13:36,250 --> 00:13:38,083 we studied this-- to the problem of learning 348 00:13:38,083 --> 00:13:39,620 a concept from one example. 349 00:13:39,620 --> 00:13:41,870 When you're learning what horses are from one example, 350 00:13:41,870 --> 00:13:43,390 or when you're learning what that piece of rock climbing 351 00:13:43,390 --> 00:13:44,740 question is-- what's a cam-- 352 00:13:44,740 --> 00:13:48,154 from one example, or what's a tufa from one example. 353 00:13:48,154 --> 00:13:49,570 You can think, there's some region 354 00:13:49,570 --> 00:13:51,700 in the space of all possible objects or something, 355 00:13:51,700 --> 00:13:52,616 or some set out there. 356 00:13:52,616 --> 00:13:54,760 And you get one or a few sample points, 357 00:13:54,760 --> 00:13:57,000 and you have to figure out the extent of the region. 358 00:13:57,000 --> 00:13:59,740 It's basically the same kind of problem, mathematically. 359 00:13:59,740 --> 00:14:01,240 But what's cool about this is we can 360 00:14:01,240 --> 00:14:04,000 measure the priors for these different classes of events 361 00:14:04,000 --> 00:14:07,190 and compare people with an optimal Bayesian inference. 362 00:14:07,190 --> 00:14:09,550 And you see something kind of striking. 363 00:14:09,550 --> 00:14:11,930 So here's, on the top-- 364 00:14:11,930 --> 00:14:13,990 I'm showing two different kinds of data here. 365 00:14:13,990 --> 00:14:16,780 On the top are just empirical statistics of events 366 00:14:16,780 --> 00:14:19,270 you can measure in the world; nothing behavioral, nothing 367 00:14:19,270 --> 00:14:20,470 about cognition. 368 00:14:20,470 --> 00:14:23,650 On the bottom, I'm showing some behavioral data 369 00:14:23,650 --> 00:14:25,540 and comparing it with model predictions that 370 00:14:25,540 --> 00:14:28,340 are based on the statistics that are measured on top. 371 00:14:28,340 --> 00:14:29,920 So what we have in each column is 372 00:14:29,920 --> 00:14:34,000 one of these classes of events, like movie grosses in dollars. 373 00:14:34,000 --> 00:14:37,090 You can get this data from iMDB, the Internet Movie Database. 374 00:14:37,090 --> 00:14:41,470 You can see that most movies make $100 million or less. 375 00:14:41,470 --> 00:14:42,730 There's sort of a power law. 376 00:14:42,730 --> 00:14:46,630 But a few movies make hundreds, or even many hundreds, 377 00:14:46,630 --> 00:14:48,647 maybe a billion dollars even, these days. 378 00:14:48,647 --> 00:14:50,980 Similarly with poems, they have a power law distribution 379 00:14:50,980 --> 00:14:51,549 of length. 380 00:14:51,549 --> 00:14:52,840 So most poems are pretty short. 381 00:14:52,840 --> 00:14:54,250 They fit on a page or less. 382 00:14:54,250 --> 00:14:55,810 But there are some epic poems, or 383 00:14:55,810 --> 00:14:58,240 some multi-page-- many, many hundreds of lines. 384 00:14:58,240 --> 00:15:01,290 And they fall off with a long tail. 385 00:15:01,290 --> 00:15:04,380 Lifespans, movie runtimes are kind of unimodal, almost 386 00:15:04,380 --> 00:15:06,030 Gaussian-- not exactly. 387 00:15:06,030 --> 00:15:09,420 Those red curves, histograms' bars, 388 00:15:09,420 --> 00:15:11,070 show the empirical statistics that we 389 00:15:11,070 --> 00:15:13,050 measured from public data. 390 00:15:13,050 --> 00:15:15,171 And the red curves show just the best fit 391 00:15:15,171 --> 00:15:17,670 of a simple parametric model, like a Gaussian or a power law 392 00:15:17,670 --> 00:15:20,130 distribution that I'm mentioning. 393 00:15:20,130 --> 00:15:22,740 House of representatives-- how long people serve in the House 394 00:15:22,740 --> 00:15:24,840 has this kind of gamma, or particular gamma 395 00:15:24,840 --> 00:15:28,020 called an Erlang shape with a little bit 396 00:15:28,020 --> 00:15:29,820 of an incumbent effect. 397 00:15:29,820 --> 00:15:32,640 Cake baking times-- so remember we asked how long 398 00:15:32,640 --> 00:15:34,506 is this cake going to bake for. 399 00:15:34,506 --> 00:15:36,880 They don't have any simple parametric form when you go in 400 00:15:36,880 --> 00:15:38,240 and look at cookbooks. 401 00:15:38,240 --> 00:15:41,132 But you see, there's something systematic there. 402 00:15:41,132 --> 00:15:42,840 There's a lot of things that are supposed 403 00:15:42,840 --> 00:15:44,610 to bake for exactly an hour. 404 00:15:44,610 --> 00:15:48,360 There are some which have a smaller, or a shorter, 405 00:15:48,360 --> 00:15:49,080 but broad mode. 406 00:15:49,080 --> 00:15:53,090 And then there's a few epic 90-minute cakes out there. 407 00:15:53,090 --> 00:15:54,900 So that's all the empirical statistics. 408 00:15:54,900 --> 00:15:59,410 Now what you're seeing on the bottom is people's-- 409 00:15:59,410 --> 00:16:01,990 well, on the y-axis, the vertical axis, 410 00:16:01,990 --> 00:16:04,080 you have the average-- it's a median-- 411 00:16:04,080 --> 00:16:06,990 of a bunch of human predictions for the total extent 412 00:16:06,990 --> 00:16:09,990 of any one of these things, like your guess 413 00:16:09,990 --> 00:16:12,150 of the total length of a poem given that, 414 00:16:12,150 --> 00:16:14,310 basically, there is a line 17 in it. 415 00:16:14,310 --> 00:16:16,880 And on the x-axis, what you're seeing is that one data point, 416 00:16:16,880 --> 00:16:18,630 the one value of t, which is, all you know 417 00:16:18,630 --> 00:16:20,671 is that it's somewhere between zero and t, total. 418 00:16:23,160 --> 00:16:25,470 So different groups of subjects were 419 00:16:25,470 --> 00:16:26,640 given five different values. 420 00:16:26,640 --> 00:16:29,340 So you see five black dots, which correspond to what 421 00:16:29,340 --> 00:16:31,080 five different subgroups of subjects 422 00:16:31,080 --> 00:16:35,106 said for each of these possible t values. 423 00:16:35,106 --> 00:16:36,480 And then the black and red curves 424 00:16:36,480 --> 00:16:39,150 are the model fit, which comes from taking 425 00:16:39,150 --> 00:16:42,000 a certain kind of Bayesian optimal prediction, where 426 00:16:42,000 --> 00:16:44,070 the prior is what's specified on the top-- that's 427 00:16:44,070 --> 00:16:45,480 the prior on t, total. 428 00:16:45,480 --> 00:16:50,920 The likelihood is a sort of uniform random density. 429 00:16:50,920 --> 00:16:54,045 So it's just saying t is just a uniform random sample from zero 430 00:16:54,045 --> 00:16:55,140 up to t, total. 431 00:16:55,140 --> 00:16:57,870 You put those together to compute a posterior. 432 00:16:57,870 --> 00:17:00,390 And then you-- the particular estimator we're using 433 00:17:00,390 --> 00:17:02,060 is what's called the posterior median. 434 00:17:02,060 --> 00:17:03,976 So we're looking at the median of the exterior 435 00:17:03,976 --> 00:17:06,270 and comparing that with a median of human subjects. 436 00:17:06,270 --> 00:17:11,354 And what you can see is that it's almost a perfect fit. 437 00:17:11,354 --> 00:17:13,020 And it doesn't really matter whether you 438 00:17:13,020 --> 00:17:16,230 take the red curve, which is what comes from approximating 439 00:17:16,230 --> 00:17:18,954 the prior with one of these simple parametric models, 440 00:17:18,954 --> 00:17:20,579 or the black one, which comes from just 441 00:17:20,579 --> 00:17:22,470 taking the empirical histogram. 442 00:17:22,470 --> 00:17:24,060 Although, for the cake baking times, 443 00:17:24,060 --> 00:17:26,160 you really can only go for the empirical one. 444 00:17:26,160 --> 00:17:27,910 Because there is no simple parametric one. 445 00:17:27,910 --> 00:17:30,930 That's why you just see a jagged black line there. 446 00:17:30,930 --> 00:17:34,800 But it's interesting that it's almost a perfect fit. 447 00:17:34,800 --> 00:17:36,090 There are a couple-- 448 00:17:36,090 --> 00:17:39,430 just like somebody asked in Demis's talk-- 449 00:17:39,430 --> 00:17:42,360 there's one or two cases we found where this model doesn't 450 00:17:42,360 --> 00:17:45,374 work, sometimes dramatically, and sometimes a little bit. 451 00:17:45,374 --> 00:17:46,540 And they're all interesting. 452 00:17:46,540 --> 00:17:47,650 But I have time to talk about it. 453 00:17:47,650 --> 00:17:48,750 That's one of the things I decided to skip. 454 00:17:48,750 --> 00:17:50,560 If you'd like to talk about it, I'm happy to do that. 455 00:17:50,560 --> 00:17:52,935 But most of the time, in most of the cases we've studied, 456 00:17:52,935 --> 00:17:53,977 these are representative. 457 00:17:53,977 --> 00:17:55,810 And I think, again, all of the failure cases 458 00:17:55,810 --> 00:17:56,970 are quite interesting ones. 459 00:17:56,970 --> 00:17:58,844 That point to, this is one of the many things 460 00:17:58,844 --> 00:18:00,360 we need to go beyond. 461 00:18:00,360 --> 00:18:02,040 But the interesting thing isn't just 462 00:18:02,040 --> 00:18:05,220 that the curves fit the data, but the fact 463 00:18:05,220 --> 00:18:07,650 that the actual shape is different in each case. 464 00:18:07,650 --> 00:18:10,866 Depending on the prior of this different classes of events, 465 00:18:10,866 --> 00:18:12,990 you get a fundamentally different, or qualitatively 466 00:18:12,990 --> 00:18:14,894 different, prediction function. 467 00:18:14,894 --> 00:18:15,810 Sometimes it's linear. 468 00:18:15,810 --> 00:18:16,893 Sometimes it's non-linear. 469 00:18:16,893 --> 00:18:18,540 Sometimes it has some weird shape. 470 00:18:18,540 --> 00:18:23,910 And really, quite surprisingly to us, 471 00:18:23,910 --> 00:18:25,410 people seem to be sensitive to that. 472 00:18:25,410 --> 00:18:29,640 So they seem to predict in ways that are reflective of not only 473 00:18:29,640 --> 00:18:31,140 the optimal Bayesian thing to do, 474 00:18:31,140 --> 00:18:34,110 but the optimal Bayesian thing to do from the optimal prior, 475 00:18:34,110 --> 00:18:36,420 from the correct prior. 476 00:18:36,420 --> 00:18:39,050 And I certainly don't want to suggest that people always 477 00:18:39,050 --> 00:18:39,550 do this. 478 00:18:39,550 --> 00:18:40,924 But it was very interesting to us 479 00:18:40,924 --> 00:18:44,280 that for just a bunch of everyday events, and really, 480 00:18:44,280 --> 00:18:46,800 the places where this analysis works best 481 00:18:46,800 --> 00:18:49,470 are ones, again, where we think people actually might plausibly 482 00:18:49,470 --> 00:18:52,500 have good reasons to have the relevant experiences 483 00:18:52,500 --> 00:18:54,360 with these everyday events, they seem 484 00:18:54,360 --> 00:18:58,350 to be sensitive to both the statistics in the sense of just 485 00:18:58,350 --> 00:19:00,090 what's going on in the world and doing 486 00:19:00,090 --> 00:19:02,980 the right statistical prediction. 487 00:19:02,980 --> 00:19:04,095 So that's what we did. 488 00:19:04,095 --> 00:19:05,886 10 years ago or so, that was like the state 489 00:19:05,886 --> 00:19:07,189 of the art for us. 490 00:19:07,189 --> 00:19:08,730 And then we wanted to know, well, OK, 491 00:19:08,730 --> 00:19:11,110 can we take these sorts of ideas and scale them up 492 00:19:11,110 --> 00:19:13,110 to some actually interesting cognitive problems, 493 00:19:13,110 --> 00:19:17,100 like say, for example, learning words for object categories. 494 00:19:17,100 --> 00:19:18,894 And we did some of that. 495 00:19:18,894 --> 00:19:20,310 I'll show you a little bit of that 496 00:19:20,310 --> 00:19:22,680 before showing you what I think was missing there. 497 00:19:22,680 --> 00:19:26,980 I mean, in a lot of ways, this is a harder problem. 498 00:19:26,980 --> 00:19:28,530 I mean, it's very similar, as I said. 499 00:19:28,530 --> 00:19:30,930 It's basically like, there's just like the problem I just 500 00:19:30,930 --> 00:19:32,610 showed you, where there was an unknown total extent 501 00:19:32,610 --> 00:19:34,950 or duration, and you got one random sample from it, 502 00:19:34,950 --> 00:19:36,960 here there is some un-- 503 00:19:36,960 --> 00:19:40,980 imagine the space of all possible objects-- 504 00:19:40,980 --> 00:19:43,860 could be a manifold or described by a bunch of knobs. 505 00:19:43,860 --> 00:19:46,500 I mean, these are all generated from some computer program. 506 00:19:46,500 --> 00:19:48,083 If these were real, biological things, 507 00:19:48,083 --> 00:19:51,000 they would be generated from DNA or whatever it is. 508 00:19:51,000 --> 00:19:54,660 But there's some huge, maybe interestingly structured, space 509 00:19:54,660 --> 00:19:55,920 of all possible objects. 510 00:19:55,920 --> 00:20:00,610 And within that space is some subset, some region or subset, 511 00:20:00,610 --> 00:20:03,670 somehow described that is the set of tufas. 512 00:20:03,670 --> 00:20:06,791 And somehow you're able to grasp that subset, more or less, 513 00:20:06,791 --> 00:20:09,040 if you get its boundaries, to be able to say yes or no 514 00:20:09,040 --> 00:20:10,770 as you did at the beginning of the lecture 515 00:20:10,770 --> 00:20:13,030 from just, in this case, a few points-- three points-- 516 00:20:13,030 --> 00:20:15,809 randomly sampled from somewhere in that region. 517 00:20:15,809 --> 00:20:18,100 It would work just as well if I showed you one of them, 518 00:20:18,100 --> 00:20:19,210 basically. 519 00:20:19,210 --> 00:20:20,950 So in some sense, it's the same problem. 520 00:20:20,950 --> 00:20:23,320 But it's much harder, because here, the space 521 00:20:23,320 --> 00:20:24,970 was this one dimensional thing. 522 00:20:24,970 --> 00:20:26,344 It was just a number. 523 00:20:26,344 --> 00:20:27,760 Whereas here, we don't know what's 524 00:20:27,760 --> 00:20:29,770 the dimensionality of the space of objects. 525 00:20:29,770 --> 00:20:31,300 We don't know how to describe the regions. 526 00:20:31,300 --> 00:20:32,380 Here we knew how to describe the regions. 527 00:20:32,380 --> 00:20:34,460 They were just intervals with a lower bound at zero 528 00:20:34,460 --> 00:20:35,860 and an upper bound at some unknown thing. 529 00:20:35,860 --> 00:20:37,960 And the hypothesis space of possible regions 530 00:20:37,960 --> 00:20:41,484 was just all the possible upper bounds of this event duration. 531 00:20:41,484 --> 00:20:43,400 Here we don't know how to describe this space. 532 00:20:43,400 --> 00:20:45,316 We don't know how to describe the regions that 533 00:20:45,316 --> 00:20:47,650 correspond to object concepts. 534 00:20:47,650 --> 00:20:51,010 We don't know how to put a price on those hypotheses. 535 00:20:51,010 --> 00:20:53,590 But in some work that we did-- in particular, 536 00:20:53,590 --> 00:20:55,480 some work that I did with Fei Xu, who is also 537 00:20:55,480 --> 00:20:56,920 a professor at Berkeley. 538 00:20:56,920 --> 00:21:00,070 We were colleagues and friends in graduate school. 539 00:21:00,070 --> 00:21:02,320 We sort of did what we could at the time. 540 00:21:02,320 --> 00:21:05,050 So we made some guesses about what that hypothesis space-- 541 00:21:05,050 --> 00:21:07,633 what that space might be like, what the hypothesis space might 542 00:21:07,633 --> 00:21:09,826 be like, how to put some priors, and so, on there. 543 00:21:09,826 --> 00:21:11,200 Used exactly the same likelihood, 544 00:21:11,200 --> 00:21:14,260 which was just this very simple idea that the observed examples 545 00:21:14,260 --> 00:21:18,010 are a uniform random draw from some subset of the world. 546 00:21:18,010 --> 00:21:20,710 And you have to figure out what that subset is. 547 00:21:20,710 --> 00:21:22,400 And we were able to make some progress. 548 00:21:22,400 --> 00:21:25,750 So what we did was we said, well, like in biology, 549 00:21:25,750 --> 00:21:28,570 perhaps-- and if you saw-- how many people saw Surya Ganguli's 550 00:21:28,570 --> 00:21:30,970 lecture yesterday morning? 551 00:21:30,970 --> 00:21:32,925 Cool. 552 00:21:32,925 --> 00:21:35,710 I sort of tailored this for assuming that you probably 553 00:21:35,710 --> 00:21:36,370 had seen that. 554 00:21:36,370 --> 00:21:40,617 Because there's a lot of similarities, or parallels, 555 00:21:40,617 --> 00:21:41,200 which is neat. 556 00:21:41,200 --> 00:21:43,660 And it's, again, part of engaging on generative models 557 00:21:43,660 --> 00:21:44,980 and neural networks. 558 00:21:44,980 --> 00:21:47,410 As you saw him do, you'll get my version of this. 559 00:21:47,410 --> 00:21:50,530 So also, like he mentioned, there 560 00:21:50,530 --> 00:21:54,310 are actual processes in the world which generate objects-- 561 00:21:54,310 --> 00:21:56,530 something like this. 562 00:21:56,530 --> 00:21:58,180 We know about evolution-- 563 00:21:58,180 --> 00:22:01,220 produces basically tree-structured groups, 564 00:22:01,220 --> 00:22:04,420 which we call species, or genus, or something like that, or just 565 00:22:04,420 --> 00:22:05,350 taxa, or something. 566 00:22:05,350 --> 00:22:06,970 There's groups of organisms that have 567 00:22:06,970 --> 00:22:08,560 a common evolutionary descent. 568 00:22:08,560 --> 00:22:11,000 That's the way a biologist might describe it. 569 00:22:11,000 --> 00:22:14,140 And we know, these days, a lot about the mechanisms 570 00:22:14,140 --> 00:22:15,490 that produce that. 571 00:22:15,490 --> 00:22:18,340 Even going back 100 or 200 years, 572 00:22:18,340 --> 00:22:19,997 say, to Darwin, we knew something 573 00:22:19,997 --> 00:22:21,580 about the mechanisms that produced it, 574 00:22:21,580 --> 00:22:24,100 even if we didn't know the genetic details, ideas 575 00:22:24,100 --> 00:22:27,400 of something like mutation, variation, natural selection 576 00:22:27,400 --> 00:22:28,900 as a kind of mechanistic account, 577 00:22:28,900 --> 00:22:31,080 about right up there with Newton and forces. 578 00:22:31,080 --> 00:22:33,666 But anyway, scientists can describe some process 579 00:22:33,666 --> 00:22:34,540 that generates trees. 580 00:22:34,540 --> 00:22:36,160 And maybe people have some intuition, 581 00:22:36,160 --> 00:22:37,870 just like people seem to have some intuitions 582 00:22:37,870 --> 00:22:39,620 about these statistics of everyday events, 583 00:22:39,620 --> 00:22:41,530 maybe they have some intuitions, somehow, 584 00:22:41,530 --> 00:22:43,990 about the causal processes in the world, which 585 00:22:43,990 --> 00:22:47,730 give rise to groups and groups and subgroups. 586 00:22:47,730 --> 00:22:50,290 And they can use that to set up a hypothesis space. 587 00:22:50,290 --> 00:22:52,180 And the way we went about this is, 588 00:22:52,180 --> 00:22:55,145 we have no idea how to describe people's internal mental models 589 00:22:55,145 --> 00:22:57,020 of these things, but you can do some simple-- 590 00:22:57,020 --> 00:22:59,410 there are simple ways to get this picture 591 00:22:59,410 --> 00:23:02,084 by just basically asking people to judge similarity 592 00:23:02,084 --> 00:23:03,500 and doing hierarchical clustering. 593 00:23:03,500 --> 00:23:06,880 So this is a tree that we built up by just asking people-- 594 00:23:06,880 --> 00:23:08,680 getting some subjective similarity metric 595 00:23:08,680 --> 00:23:11,470 and then doing hierarchical clustering, which we thought 596 00:23:11,470 --> 00:23:13,990 could roughly approximate maybe the internal hierarchy 597 00:23:13,990 --> 00:23:15,604 that our mental models impose on this. 598 00:23:15,604 --> 00:23:17,270 Were you raising your hand or just-- no. 599 00:23:17,270 --> 00:23:17,540 OK. 600 00:23:17,540 --> 00:23:18,040 Cool. 601 00:23:20,740 --> 00:23:22,570 We ultimately found this dissatisfying, 602 00:23:22,570 --> 00:23:24,190 because we don't really know what the features are. 603 00:23:24,190 --> 00:23:25,810 We don't really know if this is the right tree 604 00:23:25,810 --> 00:23:27,130 or how people built it up. 605 00:23:27,130 --> 00:23:28,630 But it actually worked pretty well, 606 00:23:28,630 --> 00:23:30,546 in the sense that we could build up this tree. 607 00:23:30,546 --> 00:23:33,550 We could then assume that the hypotheses for concepts 608 00:23:33,550 --> 00:23:36,430 just corresponded to branches of the tree. 609 00:23:36,430 --> 00:23:37,984 And then you could-- 610 00:23:37,984 --> 00:23:39,400 again, to put it just intuitively, 611 00:23:39,400 --> 00:23:41,733 the way you do this learning from one or a few examples, 612 00:23:41,733 --> 00:23:44,230 let's say that you see those few tufas over there. 613 00:23:44,230 --> 00:23:46,960 You're basically asking, which branch of the tree 614 00:23:46,960 --> 00:23:50,470 do I think-- those are randomly drawn from some internal branch 615 00:23:50,470 --> 00:23:52,360 of the tree, some subtree. 616 00:23:52,360 --> 00:23:53,440 Which subtree is it? 617 00:23:53,440 --> 00:23:55,690 And intuitively, if you see those things and you say, 618 00:23:55,690 --> 00:23:57,820 well, they are randomly drawn from some branch, 619 00:23:57,820 --> 00:24:00,190 maybe it's the one that I've circled. 620 00:24:00,190 --> 00:24:03,670 That sounds like a better bet, for example, than this one 621 00:24:03,670 --> 00:24:05,710 here, or maybe this one, which would 622 00:24:05,710 --> 00:24:07,870 include one of these things, but not the others. 623 00:24:07,870 --> 00:24:10,092 So that's probably unlikely. 624 00:24:10,092 --> 00:24:11,800 And it's probably a better bet than, say, 625 00:24:11,800 --> 00:24:14,230 this branch, or this branch, or these ones, which are logically 626 00:24:14,230 --> 00:24:16,300 compatible, but somehow it would have been sort 627 00:24:16,300 --> 00:24:18,490 of a suspicious coincidence. 628 00:24:18,490 --> 00:24:21,799 If the set of tufas had really been this branch here, 629 00:24:21,799 --> 00:24:24,340 or this one here, then it would have been quite a coincidence 630 00:24:24,340 --> 00:24:25,881 that the first three examples you saw 631 00:24:25,881 --> 00:24:28,490 were all clustered over there in one corner. 632 00:24:28,490 --> 00:24:32,310 And what we showed was that, that kind of model, 633 00:24:32,310 --> 00:24:34,060 where that suspicious coincidence came out 634 00:24:34,060 --> 00:24:35,260 from the same kinds of things I've just 635 00:24:35,260 --> 00:24:37,720 been showing you for the causal clustering example, 636 00:24:37,720 --> 00:24:40,407 and for the interval thing, it's the same Bayesian math. 637 00:24:40,407 --> 00:24:42,240 But now with this tree-structured hypothesis 638 00:24:42,240 --> 00:24:44,530 space, that was actually-- did a pretty good job 639 00:24:44,530 --> 00:24:46,060 of capturing people's judgments. 640 00:24:46,060 --> 00:24:49,050 We gave people one or a few examples of these concepts 641 00:24:49,050 --> 00:24:51,550 that, the examples could be more narrowly or broadly spread, 642 00:24:51,550 --> 00:24:53,560 just like you saw in the clustering thing, 643 00:24:53,560 --> 00:24:56,010 but just sort of less extensive. 644 00:24:56,010 --> 00:24:57,010 We did this with adults. 645 00:24:57,010 --> 00:24:57,929 We did this with kids. 646 00:24:57,929 --> 00:24:59,845 And I won't really go into any of the details. 647 00:24:59,845 --> 00:25:02,400 but If you're interested, check out these various Xu 648 00:25:02,400 --> 00:25:03,300 and Tenenbaum papers. 649 00:25:03,300 --> 00:25:04,766 That's the main one there. 650 00:25:04,766 --> 00:25:06,390 And you know, the model kind of worked. 651 00:25:06,390 --> 00:25:08,130 But ultimately, we found it dissatisfying. 652 00:25:08,130 --> 00:25:09,510 Because we couldn't really explain-- 653 00:25:09,510 --> 00:25:11,270 we didn't really know what the hypothesis space was. 654 00:25:11,270 --> 00:25:13,769 We didn't really know how people were building up this tree. 655 00:25:13,769 --> 00:25:15,660 And so we did a few things. 656 00:25:15,660 --> 00:25:17,850 We-- meaning I with some other people-- 657 00:25:17,850 --> 00:25:20,404 turned to other problems where we had a better idea, maybe, 658 00:25:20,404 --> 00:25:22,320 of the feature space and the hypothesis space, 659 00:25:22,320 --> 00:25:24,990 but the same kind of ideas could be explored and developed. 660 00:25:24,990 --> 00:25:28,830 And then ultimately-- and I'll show you this maybe before 661 00:25:28,830 --> 00:25:30,436 lunch, or maybe after lunch-- 662 00:25:30,436 --> 00:25:32,810 we went back and tackled the problem of learning concepts 663 00:25:32,810 --> 00:25:36,810 from examples with other cases where we could get a better 664 00:25:36,810 --> 00:25:39,960 handle on really knowing what the representations that people 665 00:25:39,960 --> 00:25:43,350 were using were, and also where we could compare with machines 666 00:25:43,350 --> 00:25:45,840 in much more compelling apples and oranges ways. 667 00:25:45,840 --> 00:25:48,600 In some sense here, there's no machine, as far as I know, 668 00:25:48,600 --> 00:25:51,270 that can solve this problem as well as our model. 669 00:25:51,270 --> 00:25:53,220 On the other hand, that's, again, it's 670 00:25:53,220 --> 00:25:54,570 just very much like the issue that came up 671 00:25:54,570 --> 00:25:55,710 when we were talking about-- 672 00:25:55,710 --> 00:25:56,930 I guess maybe it was with you, Tyler-- 673 00:25:56,930 --> 00:25:58,980 when we were talking about the deep learning-- or with you, 674 00:25:58,980 --> 00:25:59,690 Leo-- 675 00:25:59,690 --> 00:26:01,810 the deep reinforcement network. 676 00:26:01,810 --> 00:26:05,010 A machine that's looking at this just as pixels 677 00:26:05,010 --> 00:26:06,810 is missing so much of what we bring 678 00:26:06,810 --> 00:26:08,310 to it, which is, we see these things 679 00:26:08,310 --> 00:26:10,240 as three-dimensional objects. 680 00:26:10,240 --> 00:26:13,309 And just like the cam in rock climbing, 681 00:26:13,309 --> 00:26:14,850 or any of those other examples I gave 682 00:26:14,850 --> 00:26:17,414 before, I think that's essential to the abilities 683 00:26:17,414 --> 00:26:18,330 that people are doing. 684 00:26:18,330 --> 00:26:20,430 The generative model we build, this tree 685 00:26:20,430 --> 00:26:23,171 is based not on pixels, or even on ConvNet features, 686 00:26:23,171 --> 00:26:25,170 but on a sense of the three-dimensional objects, 687 00:26:25,170 --> 00:26:27,610 its parts, and their relations to each other. 688 00:26:27,610 --> 00:26:30,660 And so, fundamentally, until we know how to perceive objects 689 00:26:30,660 --> 00:26:34,290 better, this is not going to be comparable between humans 690 00:26:34,290 --> 00:26:36,030 and machines on equal terms. 691 00:26:36,030 --> 00:26:37,530 But I'll show you a little bit later 692 00:26:37,530 --> 00:26:39,600 some still pretty quite interesting, but simpler, 693 00:26:39,600 --> 00:26:42,000 visual concepts that you can still learn and generalize 694 00:26:42,000 --> 00:26:46,050 from one example, but where they are comparable in equal terms. 695 00:26:46,050 --> 00:26:48,960 But first I want to tell you a little bit about these-- 696 00:26:48,960 --> 00:26:50,670 yet another cognitive judgment, which 697 00:26:50,670 --> 00:26:52,800 like the word learning, or concept learning cases, 698 00:26:52,800 --> 00:26:55,110 involved generalizing from a few examples. 699 00:26:55,110 --> 00:26:56,850 They also involve using prior knowledge. 700 00:26:56,850 --> 00:26:58,350 But they're ones where maybe we have 701 00:26:58,350 --> 00:27:01,290 some way of capturing people's prior knowledge by using 702 00:27:01,290 --> 00:27:04,380 the right combination of statistical inference 703 00:27:04,380 --> 00:27:06,560 on some kind of symbolically structured bottle. 704 00:27:06,560 --> 00:27:08,070 So you can already see, as-- 705 00:27:08,070 --> 00:27:12,010 I mean, just sort of to show the narrative here. 706 00:27:12,010 --> 00:27:14,670 The examples I was giving here, this 707 00:27:14,670 --> 00:27:16,370 doesn't require any symbolic structure. 708 00:27:16,370 --> 00:27:17,730 All that stuff I was talking at the beginning, 709 00:27:17,730 --> 00:27:19,856 about how we have to combine statistical inference, 710 00:27:19,856 --> 00:27:21,355 sophisticated statistical inference, 711 00:27:21,355 --> 00:27:23,250 with sophisticated symbolic representations, 712 00:27:23,250 --> 00:27:24,594 you don't need any of that here. 713 00:27:24,594 --> 00:27:26,010 All the representations could just 714 00:27:26,010 --> 00:27:29,010 be counting up numbers or using simple probability 715 00:27:29,010 --> 00:27:30,600 distributions that statisticians have 716 00:27:30,600 --> 00:27:33,600 worked with for over 100 years. 717 00:27:33,600 --> 00:27:35,477 Once we start to go here, now we have 718 00:27:35,477 --> 00:27:37,560 to define a model with some interesting structure, 719 00:27:37,560 --> 00:27:41,510 like a branching tree structure, and so on. 720 00:27:41,510 --> 00:27:43,410 And as you'll see, we can quickly 721 00:27:43,410 --> 00:27:45,360 get to lots more interesting causal, 722 00:27:45,360 --> 00:27:48,870 compositionally-structured generative models 723 00:27:48,870 --> 00:27:50,220 in similar kinds of tasks. 724 00:27:50,220 --> 00:27:52,300 And in particular, we were looking for-- 725 00:27:52,300 --> 00:27:54,150 for a few years, we were very interested 726 00:27:54,150 --> 00:27:55,950 in these property induction tasks. 727 00:27:55,950 --> 00:27:57,480 So this was-- it happened to be-- 728 00:27:57,480 --> 00:27:59,105 I mean, I think this was a coincidence. 729 00:27:59,105 --> 00:28:01,680 Or maybe we were both influenced by Susan Carey, actually. 730 00:28:01,680 --> 00:28:04,950 So the work that Surya talked about, that he was trying 731 00:28:04,950 --> 00:28:06,910 to explain as a theoretician-- remember, 732 00:28:06,910 --> 00:28:09,330 Surya and Andrew Saxe, they were trying 733 00:28:09,330 --> 00:28:12,030 to give the theory of these neural network 734 00:28:12,030 --> 00:28:13,620 models that Jay McClelland and Tim 735 00:28:13,620 --> 00:28:15,510 Rogers had built in the early 2000s, 736 00:28:15,510 --> 00:28:17,940 around the same time we were doing this work. 737 00:28:17,940 --> 00:28:20,280 And they were inspired by some of Susan Carey's work 738 00:28:20,280 --> 00:28:22,560 on children's intuitive biology, as well as 739 00:28:22,560 --> 00:28:25,140 other people out there in cognitive psychology-- 740 00:28:25,140 --> 00:28:31,770 for example, Lance Rips, and Smith, Madine. 741 00:28:31,770 --> 00:28:33,720 Many, many cognitive psychologists 742 00:28:33,720 --> 00:28:35,100 studied things like this-- 743 00:28:35,100 --> 00:28:37,765 Dan Osherson. 744 00:28:37,765 --> 00:28:40,440 They often talked about this as a kind of inductive reasoning, 745 00:28:40,440 --> 00:28:42,826 or property induction, where the idea was-- 746 00:28:42,826 --> 00:28:45,450 so it might look different from the task I've given you before, 747 00:28:45,450 --> 00:28:48,875 but actually, it's deeply related. 748 00:28:48,875 --> 00:28:53,340 The task was often presented to people like an argument 749 00:28:53,340 --> 00:28:54,870 with premises and a conclusion, kind 750 00:28:54,870 --> 00:28:57,960 of like a traditional deductive syllogism, like all men are 751 00:28:57,960 --> 00:29:00,769 mortal, Socrates is a man, therefore Socrates is mortal. 752 00:29:00,769 --> 00:29:02,310 But these are inductive in that there 753 00:29:02,310 --> 00:29:06,210 is no-- you can't conclude with deductive certainty 754 00:29:06,210 --> 00:29:08,010 the conclusion follows from the premises 755 00:29:08,010 --> 00:29:10,290 or is falsified by the premise, but rather you just 756 00:29:10,290 --> 00:29:11,480 make a good guess. 757 00:29:11,480 --> 00:29:15,060 The statements above the line provide some, more or less, 758 00:29:15,060 --> 00:29:16,980 good or bad evidence for the statement 759 00:29:16,980 --> 00:29:19,545 below the line being true. 760 00:29:19,545 --> 00:29:22,200 These studies were often done with-- they 761 00:29:22,200 --> 00:29:25,920 could be done with just sort of familiar biological properties, 762 00:29:25,920 --> 00:29:30,565 like having hairy legs or being bigger than a breadbox. 763 00:29:30,565 --> 00:29:32,940 I mean, it's also-- it's very much the same kind of thing 764 00:29:32,940 --> 00:29:35,800 that Tom Mitchell was talking about, as you'll start to see. 765 00:29:35,800 --> 00:29:38,221 There's another reason why I wanted to cover this. 766 00:29:38,221 --> 00:29:39,720 We worked on these things because we 767 00:29:39,720 --> 00:29:42,095 wanted to be able to engage with the same kinds of things 768 00:29:42,095 --> 00:29:44,100 that people like Jay McClelland and Tom Mitchell 769 00:29:44,100 --> 00:29:46,180 were thinking about, coming from different perspectives. 770 00:29:46,180 --> 00:29:47,610 Remember, Tom Mitchell showed you 771 00:29:47,610 --> 00:29:52,710 his way of classifying brain representations of semantics 772 00:29:52,710 --> 00:29:56,550 with matrices of objects and 20-question-like features that 773 00:29:56,550 --> 00:29:58,920 included things like is it hairy, or is it alive, 774 00:29:58,920 --> 00:30:03,020 or does it eggs, or is it bigger than a car, 775 00:30:03,020 --> 00:30:06,190 or bigger than a breadbox, or whatever. 776 00:30:06,190 --> 00:30:07,910 Any one of these things-- 777 00:30:07,910 --> 00:30:09,960 basically, we're getting at the same thing. 778 00:30:09,960 --> 00:30:11,510 Here there's just what's-- 779 00:30:11,510 --> 00:30:13,010 often these experiments with humans 780 00:30:13,010 --> 00:30:15,176 were done with so-called blank predicates, something 781 00:30:15,176 --> 00:30:17,840 that sounded vaguely biological, but was basically made up, 782 00:30:17,840 --> 00:30:19,850 or that most people wouldn't know much about. 783 00:30:19,850 --> 00:30:22,220 Does anyone know anything about T9 hormones? 784 00:30:22,220 --> 00:30:24,299 I hope so, because I made it up. 785 00:30:24,299 --> 00:30:26,090 But some of them were just done with things 786 00:30:26,090 --> 00:30:28,740 that were real, but not known to most people. 787 00:30:28,740 --> 00:30:31,072 So if I tell you that gorillas and seals both have 788 00:30:31,072 --> 00:30:33,530 T9 hormones, you might think it's sort of, fairly plausible 789 00:30:33,530 --> 00:30:35,738 that horses have T9 hormones, maybe more so than if I 790 00:30:35,738 --> 00:30:38,060 hadn't told you anything. 791 00:30:38,060 --> 00:30:40,010 Maybe you think that argument is more 792 00:30:40,010 --> 00:30:41,990 plausible than the one on the right; given 793 00:30:41,990 --> 00:30:45,007 that gorillas and seals have T9 or hormones, that anteaters 794 00:30:45,007 --> 00:30:45,590 have hormones. 795 00:30:45,590 --> 00:30:48,020 So maybe you think horses are somehow 796 00:30:48,020 --> 00:30:50,629 more similar to gorillas and seals than anteaters are. 797 00:30:50,629 --> 00:30:51,170 I don't know. 798 00:30:51,170 --> 00:30:51,670 Maybe. 799 00:30:51,670 --> 00:30:52,670 Maybe a little bit. 800 00:30:52,670 --> 00:30:54,020 If I made that bees-- 801 00:30:54,020 --> 00:30:56,305 gorillas and seals have T9 on hormones. 802 00:30:56,305 --> 00:30:58,430 Does that make you think it's likely that bees have 803 00:30:58,430 --> 00:31:02,030 T9 hormones, or pine trees? 804 00:31:02,030 --> 00:31:03,620 The farther the conclusion category 805 00:31:03,620 --> 00:31:06,620 gets from the premises, the less plausible it seems. 806 00:31:06,620 --> 00:31:09,260 Maybe the one on the lower right also seems not very plausible, 807 00:31:09,260 --> 00:31:10,652 or not as plausible. 808 00:31:10,652 --> 00:31:12,110 Because if I tell you that gorillas 809 00:31:12,110 --> 00:31:14,240 have T9 hormones, chimps, monkeys, and baboons 810 00:31:14,240 --> 00:31:16,130 all have T9 on hormones, maybe you 811 00:31:16,130 --> 00:31:18,140 think that it's only primates or something. 812 00:31:18,140 --> 00:31:19,640 So they're not a very-- it's, again, 813 00:31:19,640 --> 00:31:21,140 one of these typicality-suspicious 814 00:31:21,140 --> 00:31:23,750 coincidence businesses. 815 00:31:23,750 --> 00:31:25,670 So again, you can think of it as-- you 816 00:31:25,670 --> 00:31:27,380 can do these experiments in various ways. 817 00:31:27,380 --> 00:31:28,963 I won't really go through the details, 818 00:31:28,963 --> 00:31:30,770 but it basically involves giving people 819 00:31:30,770 --> 00:31:33,229 a bunch of different sets of examples, just like-- 820 00:31:33,229 --> 00:31:35,270 I mean, in some sense, the important thing to get 821 00:31:35,270 --> 00:31:37,160 is that abstractly it has the same character of all 822 00:31:37,160 --> 00:31:38,420 the other tasks you've seen. 823 00:31:38,420 --> 00:31:40,820 You're giving people one or a few examples, which we're 824 00:31:40,820 --> 00:31:44,150 going to treat as random draws from some concept, 825 00:31:44,150 --> 00:31:46,940 or some region in some larger space. 826 00:31:46,940 --> 00:31:49,400 In this case, the examples are the different premise 827 00:31:49,400 --> 00:31:51,200 categories, like gorillas and seals 828 00:31:51,200 --> 00:31:55,280 are examples of the concept of having T9 hormones. 829 00:31:55,280 --> 00:31:57,310 Or gorillas, chimps, monkeys, and baboons 830 00:31:57,310 --> 00:31:58,880 are an example of a concept. 831 00:31:58,880 --> 00:32:01,460 We're going to put a prior on possible extents 832 00:32:01,460 --> 00:32:04,430 of that concept, and then ask what kind of inferences 833 00:32:04,430 --> 00:32:06,867 people make from that prior, to figure out 834 00:32:06,867 --> 00:32:08,450 what other things are in that concept. 835 00:32:08,450 --> 00:32:11,630 So are horses in that same concept? 836 00:32:11,630 --> 00:32:12,382 Or are anteaters? 837 00:32:12,382 --> 00:32:14,840 Or are horses in it more or less, depending on the examples 838 00:32:14,840 --> 00:32:15,470 you give? 839 00:32:15,470 --> 00:32:17,750 And what's the nature of that prior? 840 00:32:17,750 --> 00:32:19,820 And what's good about this is that, 841 00:32:19,820 --> 00:32:24,740 kind of like the everyday prediction task-- 842 00:32:24,740 --> 00:32:27,290 the lines of the poems, or the movie grosses, or the cake 843 00:32:27,290 --> 00:32:29,210 baking-- we can actually sort of go out and measure 844 00:32:29,210 --> 00:32:30,960 some features that are plausibly relevant, 845 00:32:30,960 --> 00:32:33,200 to set up a plausibly relevant prior, 846 00:32:33,200 --> 00:32:36,860 unlike the interesting object cases. 847 00:32:36,860 --> 00:32:38,829 But like the interesting object cases, 848 00:32:38,829 --> 00:32:41,120 there are some interesting hierarchical and other kinds 849 00:32:41,120 --> 00:32:42,740 of causal compositional structures 850 00:32:42,740 --> 00:32:44,360 that people seem to be using that we 851 00:32:44,360 --> 00:32:46,680 can capture in our models. 852 00:32:46,680 --> 00:32:50,240 So here, again, the kinds of experiments-- these features 853 00:32:50,240 --> 00:32:54,220 were generated many years ago by Osherson and colleagues. 854 00:32:54,220 --> 00:32:56,270 But it's very similar to the 20 questions game 855 00:32:56,270 --> 00:32:57,290 that Tom Mitchell used. 856 00:32:57,290 --> 00:32:58,820 And I don't remember if Surya talked 857 00:32:58,820 --> 00:33:00,230 about where these features came from, 858 00:33:00,230 --> 00:33:02,570 that he talked a lot about a matrix of objects and features. 859 00:33:02,570 --> 00:33:04,778 I don't know if he talked about where they come from. 860 00:33:04,778 --> 00:33:07,250 But actually, psychologists spent a while coming up 861 00:33:07,250 --> 00:33:09,110 with ways to get people to just tell you 862 00:33:09,110 --> 00:33:10,820 a bunch of features of animals. 863 00:33:10,820 --> 00:33:13,790 This is, again, it's meant to capture the knowledge 864 00:33:13,790 --> 00:33:17,210 that maybe a kid would get from maybe plausibly reading 865 00:33:17,210 --> 00:33:18,700 books and going to the zoo. 866 00:33:18,700 --> 00:33:20,042 We know that elephants are gray. 867 00:33:20,042 --> 00:33:20,750 They're hairless. 868 00:33:20,750 --> 00:33:21,590 They have tough skin. 869 00:33:21,590 --> 00:33:22,100 They're big. 870 00:33:22,100 --> 00:33:24,740 They have a bulbous body shape. 871 00:33:24,740 --> 00:33:25,642 They have long legs. 872 00:33:25,642 --> 00:33:27,600 These are all mostly relative to other animals. 873 00:33:27,600 --> 00:33:28,910 They have a tail. 874 00:33:28,910 --> 00:33:29,580 They have tusks. 875 00:33:29,580 --> 00:33:31,639 They might be smelly, compared to other animals-- 876 00:33:31,639 --> 00:33:33,680 smellier than average is sort of what that means. 877 00:33:33,680 --> 00:33:35,130 They walk, as opposed to fly. 878 00:33:35,130 --> 00:33:36,752 They're slow, as opposed to fast. 879 00:33:36,752 --> 00:33:38,210 They're strong, as opposed to weak. 880 00:33:38,210 --> 00:33:39,644 It's that kind of business. 881 00:33:39,644 --> 00:33:41,810 So basically what that gives you is this big matrix. 882 00:33:41,810 --> 00:33:43,393 Again, the same kind of thing that you 883 00:33:43,393 --> 00:33:46,130 saw in Surya's talk, the same kind of thing 884 00:33:46,130 --> 00:33:47,750 that Tom Mitchell is using to help 885 00:33:47,750 --> 00:33:49,130 classify things, the same kind of thing 886 00:33:49,130 --> 00:33:50,963 that basically everybody in machine learning 887 00:33:50,963 --> 00:33:55,190 uses-- a matrix of data with objects, maybe as rows, 888 00:33:55,190 --> 00:33:57,530 and features, or attributes, as columns. 889 00:33:57,530 --> 00:33:59,330 And the problem here is-- 890 00:33:59,330 --> 00:34:02,145 the problem of learning is to say-- 891 00:34:02,145 --> 00:34:04,520 the problem of learning and generalizing from one example 892 00:34:04,520 --> 00:34:06,920 is to take a new property, which is a new concept, which 893 00:34:06,920 --> 00:34:08,810 is like a new column here, to get 894 00:34:08,810 --> 00:34:11,659 one or a few examples of that concept, which is basically 895 00:34:11,659 --> 00:34:14,600 just filling in one or a few entries in that column, 896 00:34:14,600 --> 00:34:16,850 and figure out how to fill in the others, to decide, 897 00:34:16,850 --> 00:34:19,790 do you or don't you have that property, somehow 898 00:34:19,790 --> 00:34:21,800 building knowledge that you can generalize 899 00:34:21,800 --> 00:34:24,080 from your prior experience, which 900 00:34:24,080 --> 00:34:26,360 could be captured by, say, all the other features 901 00:34:26,360 --> 00:34:27,780 that you know about objects. 902 00:34:27,780 --> 00:34:29,405 So that's the way that you might set up 903 00:34:29,405 --> 00:34:32,030 this problem, which again, looks like a lot of other problems 904 00:34:32,030 --> 00:34:34,550 of, say, semi-supervised learning or sparse matrix 905 00:34:34,550 --> 00:34:35,189 completion. 906 00:34:35,189 --> 00:34:36,980 It's a problem in which we can, or at least 907 00:34:36,980 --> 00:34:38,592 we thought we could, compare humans 908 00:34:38,592 --> 00:34:40,550 and many different algorithms, and even theory, 909 00:34:40,550 --> 00:34:42,020 like from Surya's talk. 910 00:34:42,020 --> 00:34:43,760 And that seemed very appealing to us. 911 00:34:43,760 --> 00:34:45,480 What we thought, though, that people were doing, 912 00:34:45,480 --> 00:34:47,355 which is maybe a little different than what-- 913 00:34:47,355 --> 00:34:49,940 or somewhat different-- well, quite different than what Jay 914 00:34:49,940 --> 00:34:51,199 McClelland thought people were doing-- 915 00:34:51,199 --> 00:34:53,389 maybe a little bit more like what Susan Carey or some 916 00:34:53,389 --> 00:34:55,722 of the earlier psychologists thought people were doing-- 917 00:34:55,722 --> 00:34:57,110 was something like this. 918 00:34:57,110 --> 00:34:59,890 That the way we solve this problem, 919 00:34:59,890 --> 00:35:02,380 the way we bridged from our prior experience to new things 920 00:35:02,380 --> 00:35:06,860 we wanted to learn was not, say, by just computing 921 00:35:06,860 --> 00:35:08,860 the second order of statistics and correlations, 922 00:35:08,860 --> 00:35:11,276 and compressing that through some bottleneck hidden layer, 923 00:35:11,276 --> 00:35:12,970 but by building a more interesting 924 00:35:12,970 --> 00:35:15,850 structured probabilistic model that was, in some form, 925 00:35:15,850 --> 00:35:16,450 causal-- 926 00:35:16,450 --> 00:35:18,640 in some form-- in some form, compositional 927 00:35:18,640 --> 00:35:22,140 and hierarchical-- something kind of like this. 928 00:35:22,140 --> 00:35:26,800 And this is a good example of a hierarchical generative model. 929 00:35:26,800 --> 00:35:29,110 There's three layers of structure here. 930 00:35:29,110 --> 00:35:31,620 The bottom layer is the observable layer. 931 00:35:31,620 --> 00:35:34,510 So the arrows in these generative models point down, 932 00:35:34,510 --> 00:35:37,750 often, usually, where the thing on the bottom is the thing 933 00:35:37,750 --> 00:35:39,820 you observe, the data of your experience. 934 00:35:39,820 --> 00:35:42,460 And then the stuff above it are various levels of structure 935 00:35:42,460 --> 00:35:45,440 that your mind is positing to explain it. 936 00:35:45,440 --> 00:35:47,230 So here we have two levels of structure. 937 00:35:47,230 --> 00:35:50,590 The level above this is sort of this tree in your head. 938 00:35:50,590 --> 00:35:54,130 The idea-- it's like a certain kind of graph structure, 939 00:35:54,130 --> 00:35:56,654 where the objects, or the species, are the leaf nodes. 940 00:35:56,654 --> 00:35:58,570 And there's some internal nodes corresponding, 941 00:35:58,570 --> 00:36:00,760 maybe to higher level taxa, or groups, or something. 942 00:36:00,760 --> 00:36:02,860 You might have words for these, too, like mammal, 943 00:36:02,860 --> 00:36:04,930 or primate, or animal. 944 00:36:04,930 --> 00:36:07,000 And the idea is that there's some kind 945 00:36:07,000 --> 00:36:08,890 of probabilistic model that you can describe, 946 00:36:08,890 --> 00:36:12,070 maybe even a causal one on top of that symbolic structure, 947 00:36:12,070 --> 00:36:16,000 that tree, that produces the data that's more directly 948 00:36:16,000 --> 00:36:17,622 observable, the observable features, 949 00:36:17,622 --> 00:36:19,330 including the things you've only sparsely 950 00:36:19,330 --> 00:36:20,867 observed and want to fill in. 951 00:36:20,867 --> 00:36:23,200 And then you might also have higher levels of structure. 952 00:36:23,200 --> 00:36:24,970 Like if you want to explain, how did you 953 00:36:24,970 --> 00:36:27,080 learn that tree in the first place, 954 00:36:27,080 --> 00:36:29,957 maybe it's because you have some kind of generative model 955 00:36:29,957 --> 00:36:31,040 for that generative model. 956 00:36:31,040 --> 00:36:32,680 So here I'm just using words to describe it, 957 00:36:32,680 --> 00:36:34,388 but I'll show you some other stuff in a-- 958 00:36:34,388 --> 00:36:37,600 or I'll show you something more formal a little bit later. 959 00:36:37,600 --> 00:36:40,120 But you could say, well, maybe the way 960 00:36:40,120 --> 00:36:42,100 I figure out that there's a tree structure is 961 00:36:42,100 --> 00:36:44,206 by having a hypothesis-- 962 00:36:44,206 --> 00:36:45,580 the way I figure out that there's 963 00:36:45,580 --> 00:36:49,480 that particular tree-structured graphical model of this domain 964 00:36:49,480 --> 00:36:51,520 is by having the more general hypothesis 965 00:36:51,520 --> 00:36:54,310 that there is some latent hierarchy of species. 966 00:36:54,310 --> 00:36:57,040 And I just have to figure out which one it is. 967 00:36:57,040 --> 00:36:59,380 So you could formulate this as a hierarchical inference 968 00:36:59,380 --> 00:37:00,940 by saying that what we're calling 969 00:37:00,940 --> 00:37:02,500 the form, the form of the model, it's 970 00:37:02,500 --> 00:37:05,230 like a hypothesis space of models, which are themselves 971 00:37:05,230 --> 00:37:08,890 hypothesis spaces of possible observed patterns of feature 972 00:37:08,890 --> 00:37:10,060 correlation. 973 00:37:10,060 --> 00:37:11,860 And that, that higher level knowledge, 974 00:37:11,860 --> 00:37:14,020 puts some kind of a generative model on these graph 975 00:37:14,020 --> 00:37:15,910 structures, where each graph structure then 976 00:37:15,910 --> 00:37:17,830 puts a generative model on the data you can observe. 977 00:37:17,830 --> 00:37:19,210 And then you could have even higher levels 978 00:37:19,210 --> 00:37:20,080 of this sort of thing. 979 00:37:20,080 --> 00:37:22,180 And then learning could go on at any or all levels 980 00:37:22,180 --> 00:37:25,399 of this hierarchy, higher than the level of experience. 981 00:37:25,399 --> 00:37:27,940 So just to show you a little bit about how this kind of thing 982 00:37:27,940 --> 00:37:31,870 works, what we're calling the probability of the data 983 00:37:31,870 --> 00:37:34,420 given the structure is actually exactly the same, really, 984 00:37:34,420 --> 00:37:37,300 as the model that Surya and Andrew Saxe used. 985 00:37:37,300 --> 00:37:40,960 The difference is that we were suggesting-- 986 00:37:40,960 --> 00:37:42,642 may be right, may be wrong-- 987 00:37:42,642 --> 00:37:44,350 that something like this generative model 988 00:37:44,350 --> 00:37:46,655 was actually in your head. 989 00:37:46,655 --> 00:37:49,270 Surya presented a very simple abstraction 990 00:37:49,270 --> 00:37:52,480 of evolutionary branching process, a kind of diffusion 991 00:37:52,480 --> 00:37:54,940 over the tree, where properties could turn on or off. 992 00:37:54,940 --> 00:37:57,700 And we built basically that same kind of model. 993 00:37:57,700 --> 00:38:00,880 And we said, maybe you have something in your head 994 00:38:00,880 --> 00:38:05,230 as a model of, again, the distribution of properties, 995 00:38:05,230 --> 00:38:08,630 or features, or attributes over the leaf nodes of the tree. 996 00:38:08,630 --> 00:38:11,222 So if you have this kind of statistical model. 997 00:38:11,222 --> 00:38:13,180 If you think that there's something like a tree 998 00:38:13,180 --> 00:38:16,990 structure, and properties are produced over the leaf nodes 999 00:38:16,990 --> 00:38:19,739 by some kind of switching, on-and-off, mutation-like 1000 00:38:19,739 --> 00:38:22,030 process, then you can do something like in this picture 1001 00:38:22,030 --> 00:38:22,530 here. 1002 00:38:22,530 --> 00:38:25,030 You can take an observe a set of features in that matrix 1003 00:38:25,030 --> 00:38:26,440 and learn the best tree. 1004 00:38:26,440 --> 00:38:28,760 You can figure out that thing I'm showing on the top, 1005 00:38:28,760 --> 00:38:30,970 that structure, which is, in some sense, the best 1006 00:38:30,970 --> 00:38:34,000 guess of a tree structure-- a latent tree structure-- which 1007 00:38:34,000 --> 00:38:37,060 if you then define some kind of diffusion mutation 1008 00:38:37,060 --> 00:38:40,810 process over that tree, would produce with high probability 1009 00:38:40,810 --> 00:38:43,930 distributions of features like those shown there. 1010 00:38:43,930 --> 00:38:45,460 If I gave you a very different tree 1011 00:38:45,460 --> 00:38:47,450 it would produce other patterns of correlation. 1012 00:38:47,450 --> 00:38:48,927 And it's just like Surya said, it 1013 00:38:48,927 --> 00:38:51,010 can be all captured by the second order statistics 1014 00:38:51,010 --> 00:38:52,420 of feature correlations. 1015 00:38:52,420 --> 00:38:54,190 The nice thing about this is that now this 1016 00:38:54,190 --> 00:38:56,420 also gives a distribution on new properties. 1017 00:38:56,420 --> 00:38:57,280 So if I observe-- 1018 00:38:57,280 --> 00:38:59,302 because each column is conditionally independent 1019 00:38:59,302 --> 00:39:00,010 given that model. 1020 00:39:00,010 --> 00:39:01,690 Each column is an independent sample 1021 00:39:01,690 --> 00:39:04,400 from that generative model. 1022 00:39:04,400 --> 00:39:06,340 And the idea is if I observe a new property, 1023 00:39:06,340 --> 00:39:08,840 and I want to say, well, which other things have this, well, 1024 00:39:08,840 --> 00:39:11,740 I can make a guess on using that probabilistic model. 1025 00:39:11,740 --> 00:39:14,020 I can say, all right, given that I 1026 00:39:14,020 --> 00:39:16,180 know the value of this function over the tree, 1027 00:39:16,180 --> 00:39:18,851 this stochastic process, at some points, what 1028 00:39:18,851 --> 00:39:21,100 do I think the most likely values are at other points? 1029 00:39:21,100 --> 00:39:23,599 And basically, what you get is, again, like in the diffusion 1030 00:39:23,599 --> 00:39:25,750 process, a kind of similarity-based generalization 1031 00:39:25,750 --> 00:39:28,960 with a tree-structured metric, that nearby points in the tree 1032 00:39:28,960 --> 00:39:30,450 are likely to have the same value. 1033 00:39:30,450 --> 00:39:33,077 So in particular, things that are near to, say, species one 1034 00:39:33,077 --> 00:39:35,326 and nine are probably going to have the same property, 1035 00:39:35,326 --> 00:39:36,640 and others maybe less so. 1036 00:39:36,640 --> 00:39:38,150 And you build that model. 1037 00:39:38,150 --> 00:39:40,000 And it's really quite striking how much 1038 00:39:40,000 --> 00:39:41,292 it matches people's intuitions. 1039 00:39:41,292 --> 00:39:43,666 So now you're seeing the kinds of plots I was showing you 1040 00:39:43,666 --> 00:39:44,350 before, where-- 1041 00:39:44,350 --> 00:39:46,060 all my data plots look like this. 1042 00:39:46,060 --> 00:39:48,310 Whenever I'm showing the scatterplot, by default, 1043 00:39:48,310 --> 00:39:52,300 the y-axis is the average of a bunch of people's judgments, 1044 00:39:52,300 --> 00:39:55,660 and the x-axis is the model predictions on the same units 1045 00:39:55,660 --> 00:39:56,950 or scale. 1046 00:39:56,950 --> 00:39:58,330 And each of these scatterplots is 1047 00:39:58,330 --> 00:40:00,090 from a different experiment-- not done 1048 00:40:00,090 --> 00:40:02,610 by us, done by other people, like Osherson and Smith 1049 00:40:02,610 --> 00:40:03,840 from a couple of decades ago. 1050 00:40:06,360 --> 00:40:08,460 But they all sort of have the same kind of form, 1051 00:40:08,460 --> 00:40:10,800 where each dot is a different set of examples, 1052 00:40:10,800 --> 00:40:12,160 or a different argument. 1053 00:40:12,160 --> 00:40:14,550 And what typically varied within an experiment-- 1054 00:40:14,550 --> 00:40:15,750 you vary the examples. 1055 00:40:15,750 --> 00:40:18,810 And you fix constant the conclusion category. 1056 00:40:18,810 --> 00:40:21,240 And you see, basically, how much evidential support 1057 00:40:21,240 --> 00:40:23,130 to different sets of two or three examples 1058 00:40:23,130 --> 00:40:25,200 gives to a certain conclusion. 1059 00:40:25,200 --> 00:40:27,480 And it's really, again, quite striking that-- 1060 00:40:27,480 --> 00:40:29,070 sometimes in a more categorical way, 1061 00:40:29,070 --> 00:40:31,110 sometimes in a more graded way-- but basically, 1062 00:40:31,110 --> 00:40:32,820 people's average judgments here just 1063 00:40:32,820 --> 00:40:37,950 line up quite well with the sort of Bayesian inference 1064 00:40:37,950 --> 00:40:41,190 on this tree-structured generative model. 1065 00:40:41,190 --> 00:40:45,120 These are just examples of the kinds of stimuli here. 1066 00:40:45,120 --> 00:40:46,080 Now, we can compare. 1067 00:40:46,080 --> 00:40:47,280 One of the reasons why we were interested 1068 00:40:47,280 --> 00:40:49,654 in this was to compare, again, many different approaches. 1069 00:40:49,654 --> 00:40:52,110 So here I'm going to show you a comparison with just 1070 00:40:52,110 --> 00:40:53,340 a variant of our approach. 1071 00:40:53,340 --> 00:40:55,710 It's the same kind of hierarchical Bayesian model, 1072 00:40:55,710 --> 00:40:57,540 but now the structure isn't a tree, 1073 00:40:57,540 --> 00:41:00,360 it's a low-dimensional Euclidean space. 1074 00:41:00,360 --> 00:41:03,967 You can define the same kinds of proximity smoothness thing. 1075 00:41:03,967 --> 00:41:06,300 I mean, again, it's more a standard in machine learning. 1076 00:41:06,300 --> 00:41:10,500 It's related to Gaussian processes. 1077 00:41:10,500 --> 00:41:12,164 It's much more like neural networks. 1078 00:41:12,164 --> 00:41:14,580 You could think of this as kind of like a Bayesian version 1079 00:41:14,580 --> 00:41:17,430 of a bottleneck hidden layer with two dimensions, 1080 00:41:17,430 --> 00:41:18,930 or a small number of dimensions. 1081 00:41:18,930 --> 00:41:21,974 The pictures that Surya showed you were all higher dimensions 1082 00:41:21,974 --> 00:41:23,640 than two dimensions in the latent space, 1083 00:41:23,640 --> 00:41:25,800 or the hidden variable space, of the neural network, 1084 00:41:25,800 --> 00:41:26,760 the hidden layer space. 1085 00:41:26,760 --> 00:41:28,920 But when he compress it down to two dimensions, 1086 00:41:28,920 --> 00:41:31,090 it looks pretty good. 1087 00:41:31,090 --> 00:41:32,340 So it's the same kind of idea. 1088 00:41:32,340 --> 00:41:37,370 Now what you're saying is you're going to find, 1089 00:41:37,370 --> 00:41:39,600 not the best tree that explains all these features, 1090 00:41:39,600 --> 00:41:41,520 but the best two-dimensional space. 1091 00:41:41,520 --> 00:41:43,680 Maybe it looks like this. 1092 00:41:43,680 --> 00:41:45,270 Where, again, the probabilistic model 1093 00:41:45,270 --> 00:41:47,594 says that things which are relatively-- things that 1094 00:41:47,594 --> 00:41:49,260 are closer in this two-dimensional space 1095 00:41:49,260 --> 00:41:51,218 are more likely to have the same feature value. 1096 00:41:51,218 --> 00:41:53,850 So you're basically explaining all the pairwise feature 1097 00:41:53,850 --> 00:41:57,340 correlations by distance in this space. 1098 00:41:57,340 --> 00:41:58,230 It's similar. 1099 00:41:58,230 --> 00:42:01,890 Importantly it's not as causal and compositional. 1100 00:42:01,890 --> 00:42:04,770 The tree models something about, possibly, the causal processes 1101 00:42:04,770 --> 00:42:06,570 of how organisms come to be. 1102 00:42:06,570 --> 00:42:08,870 If I told you that, oh, there's this-- 1103 00:42:08,870 --> 00:42:13,315 that I told you about a subspecies, like whatever-- 1104 00:42:13,315 --> 00:42:15,600 what's a good example-- 1105 00:42:15,600 --> 00:42:16,860 different breeds of dogs. 1106 00:42:16,860 --> 00:42:20,310 Or I told you that, oh, well, there's not just wolves, 1107 00:42:20,310 --> 00:42:22,710 but there's the gray-tailed wolf and the red-tailed wolf. 1108 00:42:22,710 --> 00:42:23,376 Red-tailed wolf? 1109 00:42:23,376 --> 00:42:23,940 I don't know. 1110 00:42:23,940 --> 00:42:26,340 Again, they're probably similar, but they might-- 1111 00:42:26,340 --> 00:42:28,452 one red-tailed wolf, whatever that is, more 1112 00:42:28,452 --> 00:42:29,910 similar to another red-tailed wolf, 1113 00:42:29,910 --> 00:42:31,409 probably has more features in common 1114 00:42:31,409 --> 00:42:33,420 than with a gray-tailed wolf, and probably more 1115 00:42:33,420 --> 00:42:35,004 to the gray-tailed wolf than to a dog. 1116 00:42:35,004 --> 00:42:37,461 The nice thing about a tree is I can tell you these things, 1117 00:42:37,461 --> 00:42:38,610 and you can, in your mind-- 1118 00:42:38,610 --> 00:42:40,530 maybe you'll never forget that there's a red-tailed wolf. 1119 00:42:40,530 --> 00:42:41,029 There isn't. 1120 00:42:41,029 --> 00:42:41,920 I just made it up. 1121 00:42:41,920 --> 00:42:44,677 But if you ever find yourself thinking 1122 00:42:44,677 --> 00:42:47,010 about red-tailed wolves and whether their properties are 1123 00:42:47,010 --> 00:42:50,040 more or less similar to each other than to gray-tailed 1124 00:42:50,040 --> 00:42:52,529 wolves, or less so to dogs, or so on, 1125 00:42:52,529 --> 00:42:54,070 it's because I just said some things, 1126 00:42:54,070 --> 00:42:55,736 and you grew out your tree in your mind. 1127 00:42:55,736 --> 00:43:00,187 That's a lot harder to do in a low-dimensional space. 1128 00:43:00,187 --> 00:43:01,770 And it turns out that, that model also 1129 00:43:01,770 --> 00:43:03,000 fits this data less well. 1130 00:43:03,000 --> 00:43:05,140 So here I'm just showing two of those experiments. 1131 00:43:05,140 --> 00:43:06,806 Some of them are well fit by that model, 1132 00:43:06,806 --> 00:43:09,182 but others are less well fit. 1133 00:43:09,182 --> 00:43:10,890 Now, that's not to say that they wouldn't 1134 00:43:10,890 --> 00:43:11,710 be good for other things. 1135 00:43:11,710 --> 00:43:12,930 So we also did some experiments. 1136 00:43:12,930 --> 00:43:14,250 This was experiments that we did. 1137 00:43:14,250 --> 00:43:15,930 Oh, actually, I forgot to say, really importantly, 1138 00:43:15,930 --> 00:43:18,060 this was all worked done by Charles Kemp, who's 1139 00:43:18,060 --> 00:43:19,860 now a professor at CMU. 1140 00:43:19,860 --> 00:43:24,870 And it was part of the stuff that he did in his PhD thesis. 1141 00:43:24,870 --> 00:43:28,430 So we were interested in this as a way, not to study trees, 1142 00:43:28,430 --> 00:43:30,680 but to study a range of different kinds of structures. 1143 00:43:30,680 --> 00:43:33,280 And it is true, going back, I guess, to the question 1144 00:43:33,280 --> 00:43:35,280 you asked, this is what I was referring to about 1145 00:43:35,280 --> 00:43:36,390 low-dimensional manifolds. 1146 00:43:36,390 --> 00:43:38,520 There are some kinds of knowledge representations 1147 00:43:38,520 --> 00:43:40,830 we have which might have a low-dimensional spatial 1148 00:43:40,830 --> 00:43:44,340 structure, in particular, like mental maps of the world. 1149 00:43:44,340 --> 00:43:48,726 So our intuitive models of the Earth's surface, and things 1150 00:43:48,726 --> 00:43:50,850 which might be distributed over the Earth's surface 1151 00:43:50,850 --> 00:43:53,250 spatially, a two-dimensional map is probably 1152 00:43:53,250 --> 00:43:55,290 a good one for that. 1153 00:43:55,290 --> 00:43:58,590 So here we considered a similar kind of concept 1154 00:43:58,590 --> 00:44:02,794 learning from a few examples task, where we said-- 1155 00:44:02,794 --> 00:44:03,960 but now we put it like this. 1156 00:44:03,960 --> 00:44:05,459 We said, suppose that a certain kind 1157 00:44:05,459 --> 00:44:07,170 of Native American artifact has been 1158 00:44:07,170 --> 00:44:09,240 found in sites near city x. 1159 00:44:09,240 --> 00:44:12,480 How likely is it also to be found in sites near city y? 1160 00:44:12,480 --> 00:44:16,650 Or we could say sites near city x and y, how about city z. 1161 00:44:16,650 --> 00:44:21,390 And we told people that different Native American 1162 00:44:21,390 --> 00:44:25,200 tribes maybe had-- some lived in a very small area, some 1163 00:44:25,200 --> 00:44:26,487 lived in a very big area. 1164 00:44:26,487 --> 00:44:28,320 Some lived in one place, some another place. 1165 00:44:28,320 --> 00:44:30,309 Some lived here, and then moved there. 1166 00:44:30,309 --> 00:44:32,850 We just told people very vague things that taps into people's 1167 00:44:32,850 --> 00:44:35,730 probably badly remembered, and very distorted, 1168 00:44:35,730 --> 00:44:39,930 versions of American history that would basically suggests 1169 00:44:39,930 --> 00:44:42,090 that there should be some kind of similar kind 1170 00:44:42,090 --> 00:44:44,010 of spatial diffusion process, but now 1171 00:44:44,010 --> 00:44:46,320 in your 2D mental map of cities. 1172 00:44:46,320 --> 00:44:50,040 So again, there's no claim that there's any reality to this, 1173 00:44:50,040 --> 00:44:51,137 or fine-grained reality. 1174 00:44:51,137 --> 00:44:53,220 But we thought it would sort of roughly correspond 1175 00:44:53,220 --> 00:44:55,410 to people's internal causal generative 1176 00:44:55,410 --> 00:44:58,800 models of archeology. 1177 00:44:58,800 --> 00:45:00,900 Again, I think it says something about the way 1178 00:45:00,900 --> 00:45:02,566 human intelligence works that none of us 1179 00:45:02,566 --> 00:45:05,504 are archaeologists, probably, but we still have these ideas. 1180 00:45:05,504 --> 00:45:07,920 And it turned out that, here, a spatially structured model 1181 00:45:07,920 --> 00:45:08,940 actually works a lot better. 1182 00:45:08,940 --> 00:45:10,356 Again, it shouldn't be surprising. 1183 00:45:10,356 --> 00:45:13,816 It's just showing that actually, the way-- the judgments 1184 00:45:13,816 --> 00:45:16,440 people make when they're making inferences from a few examples, 1185 00:45:16,440 --> 00:45:17,981 just like you saw with the predicting 1186 00:45:17,981 --> 00:45:20,070 the everyday events, but now in the much more 1187 00:45:20,070 --> 00:45:22,230 interestingly structured domain, is 1188 00:45:22,230 --> 00:45:25,200 sensitive to the different kinds of environmental statistics. 1189 00:45:25,200 --> 00:45:28,950 There it was different power laws versus Gaussian's of cake 1190 00:45:28,950 --> 00:45:34,080 bake-- or of movie grosses versus lifetimes or something. 1191 00:45:34,080 --> 00:45:35,580 Here it's other stuff. 1192 00:45:35,580 --> 00:45:38,250 It's more interestingly structured kinds of knowledge. 1193 00:45:38,250 --> 00:45:40,050 But you see the same kind of picture. 1194 00:45:40,050 --> 00:45:41,760 And we thought that was interesting, 1195 00:45:41,760 --> 00:45:43,979 and again, suggests some of the ways 1196 00:45:43,979 --> 00:45:46,020 that we are starting to put these tools together, 1197 00:45:46,020 --> 00:45:48,019 putting together probabilistic generative models 1198 00:45:48,019 --> 00:45:50,340 with some kind of interestingly structured knowledge. 1199 00:45:50,340 --> 00:45:52,140 Now, again, as you saw from Surya, 1200 00:45:52,140 --> 00:45:54,740 and as Jay McClellan and Tim Rogers worked on, 1201 00:45:54,740 --> 00:45:56,490 you can try to capture a lot of this stuff 1202 00:45:56,490 --> 00:45:57,240 with neural networks. 1203 00:45:57,240 --> 00:45:59,781 The neat thing about the neural networks that these guys have 1204 00:45:59,781 --> 00:46:03,090 worked on is that exactly the same neural network can 1205 00:46:03,090 --> 00:46:05,640 capture this kind of thing, and it 1206 00:46:05,640 --> 00:46:07,380 can capture this kind of thing. 1207 00:46:07,380 --> 00:46:11,400 So you can train the very same hidden multilayer neural 1208 00:46:11,400 --> 00:46:16,260 network with one matrix of object and features. 1209 00:46:16,260 --> 00:46:17,790 And the very same neural network can 1210 00:46:17,790 --> 00:46:21,416 predict the tree-structured patterns for animals 1211 00:46:21,416 --> 00:46:23,790 and their properties, as well as the spatially-structured 1212 00:46:23,790 --> 00:46:27,090 patterns for Native American artifacts and their cities. 1213 00:46:27,090 --> 00:46:31,380 The catch is that it doesn't do either of them that well. 1214 00:46:31,380 --> 00:46:35,900 It doesn't do as well as the tree-structured models do 1215 00:46:35,900 --> 00:46:36,630 for peop-- 1216 00:46:36,630 --> 00:46:37,950 when I say either, it doesn't do that well, I 1217 00:46:37,950 --> 00:46:39,450 mean, in capturing people's judgments. 1218 00:46:39,450 --> 00:46:41,670 It doesn't do as well as the best tree-structured models do 1219 00:46:41,670 --> 00:46:43,919 for people's concepts of animals and their properties. 1220 00:46:43,919 --> 00:46:46,890 And it doesn't do as well as the best spacial structures. 1221 00:46:46,890 --> 00:46:50,130 But again, it's in the same spirit as the DeepMind networks 1222 00:46:50,130 --> 00:46:51,550 for playing lots of Atari games. 1223 00:46:51,550 --> 00:46:53,550 The idea there is to have the same network solve 1224 00:46:53,550 --> 00:46:54,734 all these different tasks. 1225 00:46:54,734 --> 00:46:56,650 And in some sense, I think that's a good idea. 1226 00:46:56,650 --> 00:46:58,709 I just think that the architecture should 1227 00:46:58,709 --> 00:47:00,000 have a more flexible structure. 1228 00:47:00,000 --> 00:47:02,100 So we would also say, in some sense, 1229 00:47:02,100 --> 00:47:04,560 the same architecture is solving all these different tasks. 1230 00:47:04,560 --> 00:47:07,200 It's just that this is one setting of it. 1231 00:47:07,200 --> 00:47:10,260 And this is another setting of it. 1232 00:47:10,260 --> 00:47:12,720 And where they differ is in the kind of structure 1233 00:47:12,720 --> 00:47:14,220 that-- well, they differ in the fact 1234 00:47:14,220 --> 00:47:16,502 that they explicitly represent structure in the world. 1235 00:47:16,502 --> 00:47:18,960 And they explicitly represent different kinds of structure. 1236 00:47:18,960 --> 00:47:21,043 And they explicitly represent that different kinds 1237 00:47:21,043 --> 00:47:23,895 of structure are appropriate to different kinds of domains 1238 00:47:23,895 --> 00:47:26,520 in the world and our intuitions about the causal processes that 1239 00:47:26,520 --> 00:47:28,470 are at work producing the data. 1240 00:47:28,470 --> 00:47:29,970 And I think that, again, that's sort 1241 00:47:29,970 --> 00:47:32,136 of the difference between the pattern classification 1242 00:47:32,136 --> 00:47:33,840 and the understanding or explaining 1243 00:47:33,840 --> 00:47:37,680 view of intelligence. 1244 00:47:37,680 --> 00:47:40,320 The explanations, of course, go a lot beyond different ways 1245 00:47:40,320 --> 00:47:42,120 that similarity can be structured. 1246 00:47:42,120 --> 00:47:44,220 So one of the kind of nice things-- oh, 1247 00:47:44,220 --> 00:47:46,380 and I guess another-- 1248 00:47:46,380 --> 00:47:48,360 two other points beyond that. 1249 00:47:48,360 --> 00:47:50,070 One is that to get the neural networks 1250 00:47:50,070 --> 00:47:52,319 to do that, you have to train them with a lot of data. 1251 00:47:52,319 --> 00:47:55,380 Remember, Surya, as Tommy pushed him on in that talk, 1252 00:47:55,380 --> 00:47:57,570 Surya was very concerned with modeling 1253 00:47:57,570 --> 00:48:01,650 the dynamics of learning in the sense of the optimization time 1254 00:48:01,650 --> 00:48:04,410 course, how the weights change over time. 1255 00:48:04,410 --> 00:48:06,750 But he was usually looking at infinite data. 1256 00:48:06,750 --> 00:48:09,300 So he was assuming that you had, effectively, 1257 00:48:09,300 --> 00:48:11,940 an infinite number of columns of any of these matrices. 1258 00:48:11,940 --> 00:48:13,920 So you could perfectly compute the statistics. 1259 00:48:13,920 --> 00:48:15,920 And another important thing about the difference 1260 00:48:15,920 --> 00:48:18,503 being the neural network models and the ones I was showing you 1261 00:48:18,503 --> 00:48:21,030 is that, suppose you want to train the model, not 1262 00:48:21,030 --> 00:48:24,780 on an infinite matrix, but on a small finite one, 1263 00:48:24,780 --> 00:48:26,460 and maybe one with missing data. 1264 00:48:26,460 --> 00:48:28,710 It's a lot harder to get the-- the neural network will 1265 00:48:28,710 --> 00:48:32,940 do a much poorer job capturing the structure than these more 1266 00:48:32,940 --> 00:48:33,760 structured models. 1267 00:48:33,760 --> 00:48:35,750 And again, in a way that's familiar with-- 1268 00:48:35,750 --> 00:48:38,820 have you guys talked about bias-variance dilemma? 1269 00:48:38,820 --> 00:48:42,249 So it's that same kind of idea that you probably 1270 00:48:42,249 --> 00:48:43,290 heard about from Lorenzo. 1271 00:48:43,290 --> 00:48:45,520 Was it Lorenzo or one of the machin learni-- 1272 00:48:45,520 --> 00:48:46,020 OK. 1273 00:48:46,020 --> 00:48:48,019 So it's that same kind of idea, but now applying 1274 00:48:48,019 --> 00:48:50,370 in this interesting case of structured estimation 1275 00:48:50,370 --> 00:48:52,440 of generative models for the world, 1276 00:48:52,440 --> 00:48:56,340 that if you have relatively little data, and sparse data, 1277 00:48:56,340 --> 00:48:59,822 then having a more structured inductive bi-- 1278 00:48:59,822 --> 00:49:02,280 having the inductive bias that comes from a more structured 1279 00:49:02,280 --> 00:49:06,030 representation is going to be much more valuable when 1280 00:49:06,030 --> 00:49:08,935 you have sparse and noisy data. 1281 00:49:08,935 --> 00:49:11,310 The key-- and again, this is something that Charles and I 1282 00:49:11,310 --> 00:49:13,590 were really interested in-- is we wanted to-- 1283 00:49:13,590 --> 00:49:16,080 like the DeepMind people, like the connectionists, 1284 00:49:16,080 --> 00:49:19,080 we wanted to build general purpose semantic cognition, 1285 00:49:19,080 --> 00:49:23,340 wanted to build general purpose learning and reasoning systems. 1286 00:49:23,340 --> 00:49:25,260 And we wanted to somehow figure out 1287 00:49:25,260 --> 00:49:27,843 how you could have the best of both worlds, how you could have 1288 00:49:27,843 --> 00:49:31,170 a system that relatively quickly could come 1289 00:49:31,170 --> 00:49:34,380 to get the right kind of strong constraint-inductive bias 1290 00:49:34,380 --> 00:49:37,110 in some domain, and a different one for a different domain, 1291 00:49:37,110 --> 00:49:38,700 yet could learn in a flexible way 1292 00:49:38,700 --> 00:49:41,580 to capture the different structure in different domains. 1293 00:49:41,580 --> 00:49:43,060 More on that in a little bit. 1294 00:49:43,060 --> 00:49:45,150 But the other thing I wanted to talk about here 1295 00:49:45,150 --> 00:49:47,970 is just ways in which our mental models, 1296 00:49:47,970 --> 00:49:49,860 our causal and compositional ones, 1297 00:49:49,860 --> 00:49:51,991 go beyond just similarity. 1298 00:49:51,991 --> 00:49:53,490 I guess, since time is short-- well, 1299 00:49:53,490 --> 00:49:55,698 I was planning to go through this relatively quickly. 1300 00:49:55,698 --> 00:49:57,902 But anyway, mostly I'll just gesture towards this. 1301 00:49:57,902 --> 00:49:59,360 And if you're interested, you could 1302 00:49:59,360 --> 00:50:02,200 read the papers that Charles has, or his thesis. 1303 00:50:02,200 --> 00:50:05,690 But here, there's a long history of asking people 1304 00:50:05,690 --> 00:50:07,460 to make these kind of judgments, in which 1305 00:50:07,460 --> 00:50:10,730 the basis for the judgment isn't something like similarity, 1306 00:50:10,730 --> 00:50:12,599 but some other kind of causal reasoning. 1307 00:50:12,599 --> 00:50:14,390 So for example, consider these things here. 1308 00:50:14,390 --> 00:50:17,234 Poodles can bite through wire, therefore German shepherds 1309 00:50:17,234 --> 00:50:18,150 can bite through wire. 1310 00:50:18,150 --> 00:50:19,889 Is that a strong argument or weak? 1311 00:50:19,889 --> 00:50:22,430 Compare that with, dobermans can bite through wire, therefore 1312 00:50:22,430 --> 00:50:24,080 German shepherds can bite through wire. 1313 00:50:24,080 --> 00:50:26,600 So how many people think that the top argument is a stronger 1314 00:50:26,600 --> 00:50:27,890 one? 1315 00:50:27,890 --> 00:50:30,320 How many people think the bottom line is a stronger one? 1316 00:50:30,320 --> 00:50:31,100 So that's typical. 1317 00:50:31,100 --> 00:50:33,980 About twice as many people prefer the top one. 1318 00:50:33,980 --> 00:50:36,560 Because intuitively-- do I have a little thing 1319 00:50:36,560 --> 00:50:37,370 that will appear? 1320 00:50:37,370 --> 00:50:40,262 Intuitively, anyone want to explain why you thought so? 1321 00:50:44,198 --> 00:50:45,680 AUDIENCE: Poodles are really small. 1322 00:50:45,680 --> 00:50:47,330 JOSH TENENBAUM: Poodles are small or weak. 1323 00:50:47,330 --> 00:50:47,630 Yes. 1324 00:50:47,630 --> 00:50:49,296 And German shepherds are big and strong. 1325 00:50:49,296 --> 00:50:50,817 And what about dobermans? 1326 00:50:50,817 --> 00:50:52,900 AUDIENCE: They're just as big as German shepherds. 1327 00:50:52,900 --> 00:50:53,260 JOSH TENENBAUM: Yeah. 1328 00:50:53,260 --> 00:50:53,860 That's right. 1329 00:50:53,860 --> 00:50:56,291 So they're more similar to German shepherds, 1330 00:50:56,291 --> 00:50:57,790 because they're both big and strong. 1331 00:50:57,790 --> 00:51:01,327 But notice that something very different is going on here. 1332 00:51:01,327 --> 00:51:02,410 It's not about similarity. 1333 00:51:02,410 --> 00:51:04,180 It's sort of anti-similarity. 1334 00:51:04,180 --> 00:51:05,990 But it's not just anti-similarity. 1335 00:51:05,990 --> 00:51:09,160 Suppose I said, German shepherds can bite through wire, 1336 00:51:09,160 --> 00:51:11,320 therefore poodles can bite through wire. 1337 00:51:11,320 --> 00:51:12,489 Is that a good argument? 1338 00:51:12,489 --> 00:51:13,030 AUDIENCE: No. 1339 00:51:13,030 --> 00:51:13,630 It's an argument against. 1340 00:51:13,630 --> 00:51:13,750 JOSH TENENBAUM: No. 1341 00:51:13,750 --> 00:51:15,850 It's sort of a terrible argument, right? 1342 00:51:15,850 --> 00:51:18,970 So there's some kind of asymmetric dimensional 1343 00:51:18,970 --> 00:51:20,380 reasoning going on. 1344 00:51:20,380 --> 00:51:23,470 Or similarly, if I said, which of these seems better 1345 00:51:23,470 --> 00:51:27,120 intuitively; Salmon carry some bacteria, 1346 00:51:27,120 --> 00:51:29,350 therefore grizzly bears are likely to carry it, 1347 00:51:29,350 --> 00:51:31,180 versus grizzly bears carry this, therefore 1348 00:51:31,180 --> 00:51:32,890 salmon are likely to carry it. 1349 00:51:32,890 --> 00:51:36,280 How many people say salmon, therefore grizzly bears? 1350 00:51:36,280 --> 00:51:38,560 How many people say grizzly bears, therefore salmon? 1351 00:51:38,560 --> 00:51:39,870 How do you know? 1352 00:51:39,870 --> 00:51:41,170 Those who-- yeah, you're right. 1353 00:51:41,170 --> 00:51:43,128 I mean, you're right in that's what people say. 1354 00:51:43,128 --> 00:51:44,427 I don't know if it's right. 1355 00:51:44,427 --> 00:51:45,260 Again, I made it up. 1356 00:51:45,260 --> 00:51:47,170 But why did you say that, those of you who said salmon? 1357 00:51:47,170 --> 00:51:48,430 AUDIENCE: Bears eat salmon. 1358 00:51:48,430 --> 00:51:49,180 JOSH TENENBAUM: Bears eat salmon. 1359 00:51:49,180 --> 00:51:49,680 Yeah. 1360 00:51:49,680 --> 00:51:52,600 So assuming that's true, so we're told or see on TV, 1361 00:51:52,600 --> 00:51:53,140 then yeah. 1362 00:51:53,140 --> 00:51:55,199 So anyway, these are these different kinds 1363 00:51:55,199 --> 00:51:56,365 of things that are going on. 1364 00:51:56,365 --> 00:51:58,736 And to cut to the chase, what we showed 1365 00:51:58,736 --> 00:52:01,360 is that you could capture these different patterns of reasoning 1366 00:52:01,360 --> 00:52:04,190 with, again, the same kind of thing, but different. 1367 00:52:04,190 --> 00:52:08,230 It's also a hierarchical generative model. 1368 00:52:08,230 --> 00:52:10,480 It also has, the key level of the hierarchy 1369 00:52:10,480 --> 00:52:13,630 is some kind of directed graphical structure 1370 00:52:13,630 --> 00:52:16,012 that generates distribution on observable properties. 1371 00:52:16,012 --> 00:52:18,220 But it's a fundamentally different kind of structure. 1372 00:52:18,220 --> 00:52:20,680 It's not just a tree or a space. 1373 00:52:20,680 --> 00:52:22,420 It might be a different kind of graph 1374 00:52:22,420 --> 00:52:23,890 and a different kind of process. 1375 00:52:23,890 --> 00:52:26,740 So to be a little bit more technical, the things 1376 00:52:26,740 --> 00:52:29,400 I showed you with the tree and the low-dimensional space, 1377 00:52:29,400 --> 00:52:31,540 they had a different geometry to the graph, 1378 00:52:31,540 --> 00:52:34,090 but the same stochastic process operating over it. 1379 00:52:34,090 --> 00:52:36,730 It was, in both cases, basically a diffusion process. 1380 00:52:36,730 --> 00:52:39,292 Whereas to get the kinds of reasoning that you saw here, 1381 00:52:39,292 --> 00:52:40,750 you need a different kind of graph. 1382 00:52:40,750 --> 00:52:44,020 In one case it's like a chain to capture a dimension of strength 1383 00:52:44,020 --> 00:52:45,230 or size, say. 1384 00:52:45,230 --> 00:52:47,494 In the other case, it's some kind of food web thing. 1385 00:52:47,494 --> 00:52:48,160 It's not a tree. 1386 00:52:48,160 --> 00:52:50,860 It's that kind of directed network. 1387 00:52:50,860 --> 00:52:52,750 But you also need a different process. 1388 00:52:52,750 --> 00:52:54,580 So the ways-- the kind of probability model 1389 00:52:54,580 --> 00:52:55,840 to find that out is different. 1390 00:52:55,840 --> 00:52:57,460 And it's easy to see on the-- 1391 00:52:57,460 --> 00:53:00,970 for example-- on the reasoning with these threshold things, 1392 00:53:00,970 --> 00:53:02,710 like the strength properties, if you 1393 00:53:02,710 --> 00:53:07,870 compare a 1D chain with just symmetric diffusion, 1394 00:53:07,870 --> 00:53:09,280 you get a much worse fit people's 1395 00:53:09,280 --> 00:53:11,410 judgments than if you'd used what we called this drift 1396 00:53:11,410 --> 00:53:13,576 threshold thing, which is basically a way of saying, 1397 00:53:13,576 --> 00:53:14,890 OK, I don't know. 1398 00:53:14,890 --> 00:53:17,407 There's some mapping from strength to being 1399 00:53:17,407 --> 00:53:18,490 able to bite through wire. 1400 00:53:18,490 --> 00:53:19,820 I don't know exactly what it is. 1401 00:53:19,820 --> 00:53:21,340 But the higher up you go on one, it's 1402 00:53:21,340 --> 00:53:23,006 probably more likely that you can bite-- 1403 00:53:23,006 --> 00:53:24,497 that you can do the other. 1404 00:53:24,497 --> 00:53:26,830 So that provides a wonderful model of people's judgments 1405 00:53:26,830 --> 00:53:28,660 on these kind of tasks. 1406 00:53:28,660 --> 00:53:30,700 But that sort of diffusion process, 1407 00:53:30,700 --> 00:53:34,510 like if it was like mutation in biology, then that 1408 00:53:34,510 --> 00:53:35,980 would provide a very bad model. 1409 00:53:35,980 --> 00:53:38,320 That's the second row here. 1410 00:53:38,320 --> 00:53:41,830 Similarly, this sort of directed kind of noisy transmission 1411 00:53:41,830 --> 00:53:46,890 process on a food web does a great way of modeling people's 1412 00:53:46,890 --> 00:53:48,730 judgments about diseases, but not 1413 00:53:48,730 --> 00:53:51,520 a very good way of modeling people's judgments 1414 00:53:51,520 --> 00:53:53,230 about these biological properties. 1415 00:53:53,230 --> 00:53:55,300 But the tree models you saw before that 1416 00:53:55,300 --> 00:53:56,770 do a great job of modeling people's 1417 00:53:56,770 --> 00:53:58,520 judgments about the properties of animals, 1418 00:53:58,520 --> 00:54:01,420 they do a lousy job of modeling these disease judgments. 1419 00:54:01,420 --> 00:54:04,060 So we have this picture emerging that, at the time, 1420 00:54:04,060 --> 00:54:05,950 was very satisfying to us. 1421 00:54:05,950 --> 00:54:08,380 That, hey, we can take this domain of, say, 1422 00:54:08,380 --> 00:54:10,870 animals and their properties, or the various things we 1423 00:54:10,870 --> 00:54:14,110 can reason about, and there's a lot of different ways 1424 00:54:14,110 --> 00:54:15,940 we can reason about just this one domain. 1425 00:54:15,940 --> 00:54:18,942 And by building these structured probabilistic models 1426 00:54:18,942 --> 00:54:20,650 with different kinds of graphs structures 1427 00:54:20,650 --> 00:54:22,990 that capture different kinds of causal processes, 1428 00:54:22,990 --> 00:54:25,690 we could really describe a lot of different kinds 1429 00:54:25,690 --> 00:54:27,640 of reasoning. 1430 00:54:27,640 --> 00:54:29,249 And we saw this as part of a theme 1431 00:54:29,249 --> 00:54:31,040 that a lot of other people were working on. 1432 00:54:31,040 --> 00:54:32,590 So this is-- I mentioned this before, 1433 00:54:32,590 --> 00:54:35,440 but now I'm just sort of throwing it all out there. 1434 00:54:35,440 --> 00:54:36,970 A lot of people at the time-- again, 1435 00:54:36,970 --> 00:54:40,690 this is maybe somewhere between 5 to 10 years ago-- 1436 00:54:40,690 --> 00:54:43,930 more like six or seven years ago-- 1437 00:54:43,930 --> 00:54:46,930 we're extremely interested in this general view 1438 00:54:46,930 --> 00:54:49,540 of common sense reasoning and semantic cognition 1439 00:54:49,540 --> 00:54:51,599 by basically taking big matrices and boiling them 1440 00:54:51,599 --> 00:54:53,140 down to some kind of graph structure. 1441 00:54:53,140 --> 00:54:55,699 In some form, that's what Tom Mitchell was doing, 1442 00:54:55,699 --> 00:54:57,490 not just in the talk you saw, but remember, 1443 00:54:57,490 --> 00:54:59,380 he said there's this other stuff he does-- 1444 00:54:59,380 --> 00:55:01,960 this thing called NELL, the Never Ending Language Learner. 1445 00:55:01,960 --> 00:55:03,460 I'm showing a little glimpse of that 1446 00:55:03,460 --> 00:55:06,730 up there from a New York Times piece on it in the upper right. 1447 00:55:06,730 --> 00:55:09,220 In some ways, in a sort of at least more implicit way, 1448 00:55:09,220 --> 00:55:11,560 it's what the neural networks that Jay McClelland, Tim 1449 00:55:11,560 --> 00:55:14,156 Rogers, Surya were talking about do. 1450 00:55:14,156 --> 00:55:16,030 And we thought-- you know, we had good reason 1451 00:55:16,030 --> 00:55:17,830 to think that our approach was more 1452 00:55:17,830 --> 00:55:20,630 like what people were doing than some of these others. 1453 00:55:20,630 --> 00:55:23,380 But I then came to see-- and this was around the time when 1454 00:55:23,380 --> 00:55:25,367 CBMM was actually getting started-- 1455 00:55:25,367 --> 00:55:26,950 that none of these were going to work. 1456 00:55:26,950 --> 00:55:29,680 Like the whole thing was just not going to work. 1457 00:55:29,680 --> 00:55:32,370 Liz was one of the main people who convinced me of this. 1458 00:55:32,370 --> 00:55:34,120 But you could just read the New York Times 1459 00:55:34,120 --> 00:55:37,210 article on Tom Mitchell's piece, and you can see what's missing. 1460 00:55:37,210 --> 00:55:40,750 So there's Tom, remember. 1461 00:55:40,750 --> 00:55:43,510 This was an article from 2010. 1462 00:55:43,510 --> 00:55:46,070 Just to set the chronology right, that was right around-- 1463 00:55:46,070 --> 00:55:48,490 a little bit after Charles had finished all that nice work 1464 00:55:48,490 --> 00:55:51,440 I showed you, which again, I still think is valuable. 1465 00:55:51,440 --> 00:55:53,260 I think it is capturing something 1466 00:55:53,260 --> 00:55:55,390 about what's going on. 1467 00:55:55,390 --> 00:55:57,520 It was very appealing to people, like at Google, 1468 00:55:57,520 --> 00:56:00,490 because these knowledge graphs are very much like the way, 1469 00:56:00,490 --> 00:56:02,380 around the same time, Google was starting 1470 00:56:02,380 --> 00:56:04,690 to try to put more semantics into web search-- 1471 00:56:04,690 --> 00:56:07,330 again, connected to the work that Tom was doing. 1472 00:56:07,330 --> 00:56:09,820 And there was this nice article in the New York 1473 00:56:09,820 --> 00:56:12,850 Times talking about how they built their system by reading 1474 00:56:12,850 --> 00:56:13,780 the web. 1475 00:56:13,780 --> 00:56:15,387 But the best part of it was describing 1476 00:56:15,387 --> 00:56:16,970 one of the mistakes their system made. 1477 00:56:16,970 --> 00:56:18,949 So let me just show this to you. 1478 00:56:18,949 --> 00:56:20,740 About knowledge that's obvious to a person, 1479 00:56:20,740 --> 00:56:22,540 but not to a computer-- again, it's 1480 00:56:22,540 --> 00:56:25,566 Tom Mitchell himself describing this. 1481 00:56:25,566 --> 00:56:27,940 And the challenge of, that's where NELL has to be headed, 1482 00:56:27,940 --> 00:56:30,610 is how to make the things that are obvious to people 1483 00:56:30,610 --> 00:56:33,160 obvious to computers. 1484 00:56:33,160 --> 00:56:35,590 He gives this example of a bug that 1485 00:56:35,590 --> 00:56:38,560 happened in NELL's early life. 1486 00:56:38,560 --> 00:56:40,980 The research team noticed that-- 1487 00:56:40,980 --> 00:56:42,990 oh, let's skip down there. 1488 00:56:42,990 --> 00:56:44,330 So, a particular example-- 1489 00:56:44,330 --> 00:56:47,920 when Dr. Mitchell scanned the baked goods category recently, 1490 00:56:47,920 --> 00:56:50,050 he noticed a clear pattern. 1491 00:56:50,050 --> 00:56:51,760 NELL was at first quite accurate, easily 1492 00:56:51,760 --> 00:56:54,420 identifying all kinds of pies, breads, cakes, and cookies 1493 00:56:54,420 --> 00:56:55,570 as baked goods. 1494 00:56:55,570 --> 00:56:58,620 But things went awry after NELL's noun phrase classifier 1495 00:56:58,620 --> 00:57:00,370 decided internet cookies was a baked good. 1496 00:57:03,100 --> 00:57:06,100 NELL had read the sentence "I deleted my internet cookies." 1497 00:57:06,100 --> 00:57:07,960 And again, think of that as, it's kind 1498 00:57:07,960 --> 00:57:09,340 of like a simple proposition. 1499 00:57:09,340 --> 00:57:10,180 It's like, OK. 1500 00:57:10,180 --> 00:57:12,184 But the way it parses that is cookies 1501 00:57:12,184 --> 00:57:14,350 are things that can be deleted, the same way you can 1502 00:57:14,350 --> 00:57:16,040 say horses have T9 hormones. 1503 00:57:16,040 --> 00:57:17,500 It's basically just a matrix. 1504 00:57:17,500 --> 00:57:21,220 And the concept is internet cookies. 1505 00:57:21,220 --> 00:57:23,620 And then there's the property of can be deleted, 1506 00:57:23,620 --> 00:57:25,545 or something like that. 1507 00:57:25,545 --> 00:57:27,920 And it knows something about natural language processing. 1508 00:57:27,920 --> 00:57:28,644 So it can see-- 1509 00:57:28,644 --> 00:57:30,060 and it's trying to be intelligent. 1510 00:57:30,060 --> 00:57:30,980 Oh, internet cookies. 1511 00:57:30,980 --> 00:57:33,160 Well, maybe like chocolate chip cookies and oatmeal raisin 1512 00:57:33,160 --> 00:57:34,743 cookies, those were a kind of cookies. 1513 00:57:34,743 --> 00:57:36,170 Basically, that's what it did. 1514 00:57:36,170 --> 00:57:37,670 Or no, actually did the opposite. 1515 00:57:37,670 --> 00:57:39,340 [LAUGHS] 1516 00:57:39,340 --> 00:57:42,740 It said-- when it read "I deleted my files," 1517 00:57:42,740 --> 00:57:44,740 it decided files was probably a baked good, too. 1518 00:57:44,740 --> 00:57:46,990 Well, first it decided internet cookies was a baked good, 1519 00:57:46,990 --> 00:57:47,710 like those other cookies. 1520 00:57:47,710 --> 00:57:49,450 And then it decided that files were a baked goods. 1521 00:57:49,450 --> 00:57:51,450 And it started this whole avalanche of mistakes, 1522 00:57:51,450 --> 00:57:52,776 Dr. Mitchell said. 1523 00:57:52,776 --> 00:57:54,400 He corrected the internet cookies error 1524 00:57:54,400 --> 00:57:56,827 and restarted NELL's bakery education. 1525 00:57:56,827 --> 00:57:57,910 [LAUGHS] I mean, like, OK. 1526 00:57:57,910 --> 00:58:02,080 Now rerun without that problem. 1527 00:58:02,080 --> 00:58:04,365 So the point, the lesson Tom draws from this, 1528 00:58:04,365 --> 00:58:05,740 and that the article talks about, 1529 00:58:05,740 --> 00:58:07,630 is, oh, well, we still need some assistance. 1530 00:58:07,630 --> 00:58:10,010 We have to go back and, by hand, set these things. 1531 00:58:10,010 --> 00:58:11,902 But the key thing is that, really-- 1532 00:58:11,902 --> 00:58:13,360 I think the message this is telling 1533 00:58:13,360 --> 00:58:16,360 us is no human child would ever make this mistake. 1534 00:58:16,360 --> 00:58:17,735 Human children learn in this way. 1535 00:58:17,735 --> 00:58:19,401 They don't need this kind of assistance. 1536 00:58:19,401 --> 00:58:21,310 It's true that, as Tom says, you and I don't 1537 00:58:21,310 --> 00:58:22,730 learn in isolation either. 1538 00:58:22,730 --> 00:58:24,688 So, all of the things we've been talking about, 1539 00:58:24,688 --> 00:58:27,280 about learning from prior knowledge and so on, are true. 1540 00:58:27,280 --> 00:58:30,010 But there's a basic kind of common sense thing 1541 00:58:30,010 --> 00:58:32,710 that this is missing, which is that at the time 1542 00:58:32,710 --> 00:58:34,300 a child is learning anything about-- 1543 00:58:34,300 --> 00:58:37,600 by the time a child is learning anything about computers, 1544 00:58:37,600 --> 00:58:40,300 and files, and so on, they understand well before that, 1545 00:58:40,300 --> 00:58:43,030 like back in early infancy, from say, work that Liz has done, 1546 00:58:43,030 --> 00:58:47,020 and many others, that cookies, in the sense of baked goods, 1547 00:58:47,020 --> 00:58:49,960 are a physical object, a kind of food, a thing you eat. 1548 00:58:49,960 --> 00:58:52,441 Files, email-- not a physical object. 1549 00:58:52,441 --> 00:58:54,190 And there's all sorts of interesting stuff 1550 00:58:54,190 --> 00:58:57,640 to understand about how kids learn that a book can be both 1551 00:58:57,640 --> 00:59:00,100 a no-- a novel is both a story and it's also 1552 00:59:00,100 --> 00:59:02,560 a physical object, and so a lot of that stuff. 1553 00:59:02,560 --> 00:59:05,380 But there's a basic common sense understanding 1554 00:59:05,380 --> 00:59:08,170 of the world as consisting of physical objects, 1555 00:59:08,170 --> 00:59:09,921 and for example, agents and their goals. 1556 00:59:09,921 --> 00:59:11,670 You heard a little bit about this from us, 1557 00:59:11,670 --> 00:59:13,270 from me and Tomer, on the first day. 1558 00:59:13,270 --> 00:59:15,060 And that's where I want to turn to next. 1559 00:59:15,060 --> 00:59:17,810 And this is just one of many examples that we realized, 1560 00:59:17,810 --> 00:59:21,610 as cool as this system is, as great as all this stuff is, 1561 00:59:21,610 --> 00:59:24,670 just trying to approach semantic knowledge and common sense 1562 00:59:24,670 --> 00:59:27,460 reasoning as some kind of big matrix completion 1563 00:59:27,460 --> 00:59:30,580 without a much more fundamental grasp of the ways in which 1564 00:59:30,580 --> 00:59:32,824 the world is real to a human mind, 1565 00:59:32,824 --> 00:59:34,990 well before they're learning anything about language 1566 00:59:34,990 --> 00:59:36,930 or any of this higher level stuff, 1567 00:59:36,930 --> 00:59:39,850 it was just not going to work, in the same way 1568 00:59:39,850 --> 00:59:42,100 that I think if you want to build a system that learns 1569 00:59:42,100 --> 00:59:44,100 to play a video game, even remotely like the way 1570 00:59:44,100 --> 00:59:44,830 a human does, 1571 00:59:44,830 --> 00:59:47,165 there's a lot of more basic stuff you have to build on. 1572 00:59:47,165 --> 00:59:49,040 And it's the same basic stuff, I would argue. 1573 00:59:49,040 --> 00:59:50,920 A cool thing about Atari video games 1574 00:59:50,920 --> 00:59:54,310 is that, even though they were very low resolution, 1575 00:59:54,310 --> 00:59:59,140 very low-bit color displays, with very big pixels, what 1576 00:59:59,140 --> 01:00:01,750 makes your ability to learn that game work is 1577 01:00:01,750 --> 01:00:04,840 the same kind of thing that makes the ability, even as 1578 01:00:04,840 --> 01:00:07,030 a young child, to not make this mistake. 1579 01:00:07,030 --> 01:00:09,370 And it's the kind of thing that Liz and people 1580 01:00:09,370 --> 01:00:11,530 in her field of developmental psychology-- 1581 01:00:11,530 --> 01:00:14,710 in particular, infant research-- have been studying really 1582 01:00:14,710 --> 01:00:16,960 excitingly for a couple of decades. 1583 01:00:16,960 --> 01:00:20,144 That, I think, is as transformative for the topic 1584 01:00:20,144 --> 01:00:22,060 of intelligence in brains, minds, and machines 1585 01:00:22,060 --> 01:00:23,090 as anything. 1586 01:00:23,090 --> 01:00:24,880 So that's what motivated the work we've 1587 01:00:24,880 --> 01:00:27,213 been doing in the last few years and the main work we're 1588 01:00:27,213 --> 01:00:28,480 trying to do in the center. 1589 01:00:28,480 --> 01:00:31,830 And it also goes hand-in-hand with the ways in which 1590 01:00:31,830 --> 01:00:34,810 we've realized that we have to take what we've learned how 1591 01:00:34,810 --> 01:00:37,570 to do with building problematic models over interesting 1592 01:00:37,570 --> 01:00:39,200 symbolically-structured representations 1593 01:00:39,200 --> 01:00:42,550 and so on, but move way beyond what you could call-- 1594 01:00:42,550 --> 01:00:45,379 I mean, we need better, even more interesting, 1595 01:00:45,379 --> 01:00:46,420 symbolic representations. 1596 01:00:46,420 --> 01:00:48,760 In particular, we need to move beyond graphs 1597 01:00:48,760 --> 01:00:52,441 and stochastic processes defined over graphs to programs. 1598 01:00:52,441 --> 01:00:54,190 So that's where the probabilistic programs 1599 01:00:54,190 --> 01:00:55,790 come back into the mix. 1600 01:00:55,790 --> 01:00:57,100 So again, you already saw this. 1601 01:00:57,100 --> 01:00:58,641 And I'm trying to close the loop back 1602 01:00:58,641 --> 01:00:59,860 to what we're doing in CBMM. 1603 01:00:59,860 --> 01:01:01,960 I've given you about 10 to 15 years 1604 01:01:01,960 --> 01:01:04,669 of background in our field of how we got to this, why 1605 01:01:04,669 --> 01:01:06,460 we think this is interesting and important, 1606 01:01:06,460 --> 01:01:08,350 and why we think we need to-- 1607 01:01:08,350 --> 01:01:11,170 why we've developed a certain toolkit of ideas, 1608 01:01:11,170 --> 01:01:15,100 and why we think we needed to keep extending it. 1609 01:01:15,100 --> 01:01:17,830 And I think, as you saw before, and as you'll see, 1610 01:01:17,830 --> 01:01:19,512 this also, in some ways-- 1611 01:01:19,512 --> 01:01:21,970 I think we're getting more and more to the interesting part 1612 01:01:21,970 --> 01:01:22,871 of common sense. 1613 01:01:22,871 --> 01:01:25,120 But in another way, we're getting back to the problems 1614 01:01:25,120 --> 01:01:26,500 I started off with and what a lot 1615 01:01:26,500 --> 01:01:27,880 of other people at this summer school 1616 01:01:27,880 --> 01:01:29,380 have an interest in, which is things 1617 01:01:29,380 --> 01:01:31,810 like much more basic aspects of visual perception. 1618 01:01:31,810 --> 01:01:35,020 I think the heart of real intelligence and common sense 1619 01:01:35,020 --> 01:01:37,050 reasoning that we're talking about 1620 01:01:37,050 --> 01:01:40,180 is directly connected to vision and other sense modalities, 1621 01:01:40,180 --> 01:01:42,580 and how we get around in the world and plan our actions, 1622 01:01:42,580 --> 01:01:45,157 and the very basic kinds of goal social 1623 01:01:45,157 --> 01:01:47,240 understandings that you saw in those little videos 1624 01:01:47,240 --> 01:01:49,027 of the red and blue ball, or that you 1625 01:01:49,027 --> 01:01:51,360 see if you're trying to do action recognition and action 1626 01:01:51,360 --> 01:01:52,450 understanding. 1627 01:01:52,450 --> 01:01:55,080 So in some sense, it's gotten more cognitive. 1628 01:01:55,080 --> 01:01:58,080 But it's also, by getting to the root of our common sense 1629 01:01:58,080 --> 01:02:01,350 knowledge, it makes better contact with vision, 1630 01:02:01,350 --> 01:02:02,980 with neuroscience research. 1631 01:02:02,980 --> 01:02:05,429 And so I think it's a super exciting development 1632 01:02:05,429 --> 01:02:07,470 in what we're doing for the larger Brains, Minds, 1633 01:02:07,470 --> 01:02:09,150 and Machines agenda. 1634 01:02:09,150 --> 01:02:10,770 So again, now we're saying, OK, let's 1635 01:02:10,770 --> 01:02:13,920 try to understand the way in which-- 1636 01:02:13,920 --> 01:02:16,530 even these kids playing with blocks, 1637 01:02:16,530 --> 01:02:19,290 the world is real to them. 1638 01:02:19,290 --> 01:02:21,209 It's not just a big matrix of data. 1639 01:02:21,209 --> 01:02:22,500 That is a thing in their hands. 1640 01:02:22,500 --> 01:02:24,083 And they have an understanding of what 1641 01:02:24,083 --> 01:02:27,540 a thing is before they start compiling lists of properties. 1642 01:02:27,540 --> 01:02:29,430 And they're playing with somebody else. 1643 01:02:29,430 --> 01:02:31,890 That hand is attached to a person, who has goals. 1644 01:02:31,890 --> 01:02:35,410 It's not just a big matrix of rows and columns. 1645 01:02:35,410 --> 01:02:38,430 It's an agent with goals, and even a mind. 1646 01:02:38,430 --> 01:02:41,610 And they understand those things before they 1647 01:02:41,610 --> 01:02:44,605 start to learn a lot of other things, like words for objects, 1648 01:02:44,605 --> 01:02:49,000 and advanced game-playing behavior, and so on. 1649 01:02:49,000 --> 01:02:50,820 And when we want to talk about learning, 1650 01:02:50,820 --> 01:02:52,695 we still are interested in one-shot learning, 1651 01:02:52,695 --> 01:02:55,440 or very rapid learning from a few examples. 1652 01:02:55,440 --> 01:02:58,950 And we're still interested in how prior knowledge guides 1653 01:02:58,950 --> 01:03:01,390 that, and how that knowledge can be built. 1654 01:03:01,390 --> 01:03:03,080 But we want to do it in this context. 1655 01:03:03,080 --> 01:03:04,410 We want to study in the context of, say, 1656 01:03:04,410 --> 01:03:06,210 how you learn how magnets work, or how you learn 1657 01:03:06,210 --> 01:03:08,760 how a touchscreen device works-- really interesting kinds 1658 01:03:08,760 --> 01:03:12,900 of grounded physical causes. 1659 01:03:12,900 --> 01:03:15,990 So this is what we have, or what I've come 1660 01:03:15,990 --> 01:03:18,889 to call the common sense core. 1661 01:03:18,889 --> 01:03:21,180 Liz, are you going to talk about core knowledge at all? 1662 01:03:21,180 --> 01:03:22,596 so there's a phrase that Liz likes 1663 01:03:22,596 --> 01:03:23,820 to use called core knowledge. 1664 01:03:23,820 --> 01:03:25,620 And this is definitely meant to evoke that. 1665 01:03:25,620 --> 01:03:27,780 And it's inspired by it. 1666 01:03:27,780 --> 01:03:29,269 I guess I changed it a little bit, 1667 01:03:29,269 --> 01:03:30,810 because I wanted it to mean something 1668 01:03:30,810 --> 01:03:32,080 a little bit different. 1669 01:03:32,080 --> 01:03:34,410 And I think, again, to anticipate a little bit, 1670 01:03:34,410 --> 01:03:35,789 the main difference is-- 1671 01:03:35,789 --> 01:03:36,330 I don't know. 1672 01:03:36,330 --> 01:03:38,667 What's the main difference? 1673 01:03:38,667 --> 01:03:40,500 The main difference is that, in the same way 1674 01:03:40,500 --> 01:03:42,090 that lots of people look at me and say, oh, he's 1675 01:03:42,090 --> 01:03:44,190 the Bayesian guy, lots of people look at Liz 1676 01:03:44,190 --> 01:03:47,340 and say, oh, she's the nativist gal or something. 1677 01:03:47,340 --> 01:03:49,590 And it's true that, compared to a lot of other people, 1678 01:03:49,590 --> 01:03:51,330 I tend to be more interested, and have 1679 01:03:51,330 --> 01:03:53,070 done more work prominently associated 1680 01:03:53,070 --> 01:03:54,507 with Bayesian inference. 1681 01:03:54,507 --> 01:03:56,590 But by no means do I think that's the whole story. 1682 01:03:56,590 --> 01:03:58,860 And part of what I tried to show you, and will keep showing you, 1683 01:03:58,860 --> 01:03:59,880 is ways in which that's only really 1684 01:03:59,880 --> 01:04:01,005 the beginning of the story. 1685 01:04:01,005 --> 01:04:02,670 And Liz is prominently associated, 1686 01:04:02,670 --> 01:04:05,850 and you'll see some of this, with really fascinating 1687 01:04:05,850 --> 01:04:10,290 discoveries that key high level concepts, key kinds 1688 01:04:10,290 --> 01:04:13,230 of real knowledge, are present, in some sense, 1689 01:04:13,230 --> 01:04:16,890 as early as you can look, and in some form, I think, 1690 01:04:16,890 --> 01:04:20,490 very plausibly, have to be due to some kind of innately 1691 01:04:20,490 --> 01:04:23,070 unfolding genetic program that builds a mind the same way it 1692 01:04:23,070 --> 01:04:23,820 builds a brain. 1693 01:04:23,820 --> 01:04:25,140 But just as we'll hear from her, that's, 1694 01:04:25,140 --> 01:04:27,223 in some ways, only the beginning, or only one part 1695 01:04:27,223 --> 01:04:29,130 of a much richer, more interesting story 1696 01:04:29,130 --> 01:04:31,450 that she's been developing. 1697 01:04:31,450 --> 01:04:33,274 But for that, among other reasons, 1698 01:04:33,274 --> 01:04:34,690 I'm calling it a little different. 1699 01:04:34,690 --> 01:04:36,990 And I'm trying to emphasize the connection to what people in AI 1700 01:04:36,990 --> 01:04:38,156 call common sense reasoning. 1701 01:04:38,156 --> 01:04:41,050 Because I really do think this is the heart of common sense. 1702 01:04:41,050 --> 01:04:43,380 It's this intuitive physics and intuitive psychology. 1703 01:04:43,380 --> 01:04:47,610 So again, you saw us already give an intro to this. 1704 01:04:47,610 --> 01:04:49,950 Maybe what I'll just do is show you a little bit more 1705 01:04:49,950 --> 01:04:53,692 of the-- well, are you going to talk about the stuff at all? 1706 01:04:53,692 --> 01:04:54,525 LIZ SPELKE: I guess. 1707 01:04:54,525 --> 01:04:55,020 Yeah. 1708 01:04:55,020 --> 01:04:56,061 JOSH TENENBAUM: Well, OK. 1709 01:04:56,061 --> 01:04:58,599 So this is work-- some of this is based on Liz's work. 1710 01:04:58,599 --> 01:05:00,390 Some of this is based on Renée Baillargeon, 1711 01:05:00,390 --> 01:05:02,950 a close colleague of hers, and many other people out there. 1712 01:05:02,950 --> 01:05:04,800 And I wasn't really going to go into the details. 1713 01:05:04,800 --> 01:05:07,230 And maybe, Liz, we can decide whether you want to do this 1714 01:05:07,230 --> 01:05:07,740 or not. 1715 01:05:07,740 --> 01:05:10,920 But what they've shown is that, even prior 1716 01:05:10,920 --> 01:05:14,280 to the time when kids are learning words for objects, 1717 01:05:14,280 --> 01:05:17,400 all of this stuff with infants, two months, four months, eight 1718 01:05:17,400 --> 01:05:18,840 months-- 1719 01:05:18,840 --> 01:05:23,520 at this age, kids have, at best, some vague statistical 1720 01:05:23,520 --> 01:05:26,460 associations of words to kinds of objects. 1721 01:05:26,460 --> 01:05:28,860 But they already have a great deal 1722 01:05:28,860 --> 01:05:30,630 of much more abstract understanding 1723 01:05:30,630 --> 01:05:33,120 of physical objects. 1724 01:05:33,120 --> 01:05:36,732 So I won't-- maybe I should not go into the details of it. 1725 01:05:36,732 --> 01:05:38,940 But you saw it in that nice video of the baby playing 1726 01:05:38,940 --> 01:05:39,523 with the cups. 1727 01:05:42,015 --> 01:05:44,364 And there's really interesting, sort 1728 01:05:44,364 --> 01:05:45,780 of rough, developmental timelines. 1729 01:05:45,780 --> 01:05:47,946 One of the things we're trying to figure out in CBMM 1730 01:05:47,946 --> 01:05:51,120 is to actually get much, much clearer picture on this. 1731 01:05:51,120 --> 01:05:54,120 But at least if you look across a bunch of different studies, 1732 01:05:54,120 --> 01:05:57,090 sometimes by one lab, sometimes up by multiple labs, 1733 01:05:57,090 --> 01:06:00,360 you see ways in which, say, going from two months to five 1734 01:06:00,360 --> 01:06:01,979 months, or five months to 12 months, 1735 01:06:01,979 --> 01:06:04,020 kids seem to-- their intuitive physics of objects 1736 01:06:04,020 --> 01:06:06,210 is getting a little bit more sophisticated. 1737 01:06:06,210 --> 01:06:12,180 So for example, they tend to understand-- 1738 01:06:12,180 --> 01:06:14,880 in some form, they understand a little bit of how collisions 1739 01:06:14,880 --> 01:06:19,830 conserve momentum, a little bit, by five months or six months-- 1740 01:06:19,830 --> 01:06:22,410 according to one of Baillargeon's studies-- 1741 01:06:22,410 --> 01:06:25,230 in the sense that if they see a ball roll down a ramp 1742 01:06:25,230 --> 01:06:28,830 and hit another one, and the second one goes 1743 01:06:28,830 --> 01:06:31,395 a certain distance, if a bigger object comes, 1744 01:06:31,395 --> 01:06:33,520 they're not too surprised if this one goes farther. 1745 01:06:33,520 --> 01:06:35,800 But if a little object hits it, then they are surprised. 1746 01:06:35,800 --> 01:06:37,966 So they expect a bigger object to be able to move it 1747 01:06:37,966 --> 01:06:39,270 more than a little object. 1748 01:06:39,270 --> 01:06:41,340 But a two-month-old doesn't understand that. 1749 01:06:41,340 --> 01:06:43,890 Although a two-month-old does understand-- this is, again, 1750 01:06:43,890 --> 01:06:45,000 from Liz's work-- 1751 01:06:45,000 --> 01:06:47,970 that if an object is colluded by a screen, 1752 01:06:47,970 --> 01:06:50,010 it hasn't disappeared, and that if an object 1753 01:06:50,010 --> 01:06:52,980 is rolling towards a wall, and that wall looks solid, 1754 01:06:52,980 --> 01:06:56,174 that the object can't go through it, and that if it somehow-- 1755 01:06:56,174 --> 01:06:58,590 when the screen is removed, as you see on the upper left-- 1756 01:06:58,590 --> 01:07:00,840 appears on the other side of the screen, that's 1757 01:07:00,840 --> 01:07:01,954 very surprising to them. 1758 01:07:01,954 --> 01:07:04,620 I think-- I'm sure what Liz will talk about, among other things, 1759 01:07:04,620 --> 01:07:07,710 are the methods they use, the looking time methods to reveal 1760 01:07:07,710 --> 01:07:09,795 this. 1761 01:07:09,795 --> 01:07:11,490 And I think there's really-- 1762 01:07:11,490 --> 01:07:14,200 this is one of the two main insights that I, 1763 01:07:14,200 --> 01:07:15,630 and I think our whole field, needs 1764 01:07:15,630 --> 01:07:17,850 to learn from developmental psychology, 1765 01:07:17,850 --> 01:07:20,730 is how much of a basic understanding of physics 1766 01:07:20,730 --> 01:07:23,390 like this is present very early. 1767 01:07:23,390 --> 01:07:25,710 And it doesn't matter whether it's-- 1768 01:07:25,710 --> 01:07:28,140 in some sense, it doesn't matter for the points 1769 01:07:28,140 --> 01:07:31,440 I want to make here, how much or in what way this is innate, 1770 01:07:31,440 --> 01:07:34,719 or how the genetics and the experience interact. 1771 01:07:34,719 --> 01:07:35,760 I mean, that does matter. 1772 01:07:35,760 --> 01:07:37,801 And that, that's something we want to understand, 1773 01:07:37,801 --> 01:07:39,450 and we are hoping to try to understand 1774 01:07:39,450 --> 01:07:41,830 in the hopefully not-too-distant future. 1775 01:07:41,830 --> 01:07:43,404 But for the purpose of understanding 1776 01:07:43,404 --> 01:07:44,820 what is the heart of common sense, 1777 01:07:44,820 --> 01:07:47,070 how are we going to build these causal, compositional, 1778 01:07:47,070 --> 01:07:49,149 generative models to really get at intelligence, 1779 01:07:49,149 --> 01:07:51,690 the main thing is that it should be about this kind of stuff. 1780 01:07:51,690 --> 01:07:52,800 That's the main focus. 1781 01:07:52,800 --> 01:07:56,340 And then the other big insight from developmental psychology, 1782 01:07:56,340 --> 01:07:58,260 which has to do with how we build this stuff, 1783 01:07:58,260 --> 01:08:02,130 is this idea sometimes called the child as scientist. 1784 01:08:02,130 --> 01:08:04,950 The basic idea is that, just as this early commonsense 1785 01:08:04,950 --> 01:08:07,227 knowledge is something like a scientific theory, 1786 01:08:07,227 --> 01:08:09,810 something like a good scientific theory, the way Newton's laws 1787 01:08:09,810 --> 01:08:12,330 are a better scientific theory than Kepler's laws 1788 01:08:12,330 --> 01:08:14,910 because of how they capture the causal structure of the world 1789 01:08:14,910 --> 01:08:15,965 in a compositional way. 1790 01:08:15,965 --> 01:08:17,340 That's another way to sum up what 1791 01:08:17,340 --> 01:08:20,850 I'm trying to say about children's early knowledge. 1792 01:08:20,850 --> 01:08:23,520 But also, the way children build their knowledge is something 1793 01:08:23,520 --> 01:08:25,710 like the way scientists build their knowledge, which 1794 01:08:25,710 --> 01:08:30,700 is, well, they do experiments, of course. 1795 01:08:30,700 --> 01:08:32,649 We normally call that play. 1796 01:08:32,649 --> 01:08:34,474 That's one of Laura's Schulz's big ideas. 1797 01:08:34,474 --> 01:08:36,140 But it's not just about the experiments. 1798 01:08:36,140 --> 01:08:38,410 I mean, Newton didn't really do any experiments. 1799 01:08:38,410 --> 01:08:39,077 He just thought. 1800 01:08:39,077 --> 01:08:41,451 And that's another thing you'll hear from Laura, and also 1801 01:08:41,451 --> 01:08:43,870 from Tomer, is that a lot of children's learning 1802 01:08:43,870 --> 01:08:46,810 looks less like, say, stochastic gradient descent, and more 1803 01:08:46,810 --> 01:08:49,479 like scratching your head and trying to make sense of, 1804 01:08:49,479 --> 01:08:51,337 well, that's really funny. 1805 01:08:51,337 --> 01:08:52,420 Why does this happen here? 1806 01:08:52,420 --> 01:08:53,890 Why does that happen over there? 1807 01:08:53,890 --> 01:08:57,250 Or how can I explain what seemed to be 1808 01:08:57,250 --> 01:09:00,790 diverse patterns of phenomena with some common underlying 1809 01:09:00,790 --> 01:09:02,859 principles, and making analogies between things, 1810 01:09:02,859 --> 01:09:04,420 and then trying out, oh, well, if that's right, 1811 01:09:04,420 --> 01:09:06,020 then it would make this prediction. 1812 01:09:06,020 --> 01:09:08,380 And the kid doesn't have to be conscious of that the way 1813 01:09:08,380 --> 01:09:10,720 scientists maybe are. 1814 01:09:10,720 --> 01:09:13,990 That process of coming up with theories 1815 01:09:13,990 --> 01:09:16,630 and considering variations, trying them out, 1816 01:09:16,630 --> 01:09:19,450 seeing what kinds of new experiences you can create 1817 01:09:19,450 --> 01:09:20,459 for yourself-- 1818 01:09:20,459 --> 01:09:22,000 call them an experiment, or call them 1819 01:09:22,000 --> 01:09:24,010 just a game, or playing with a toy, 1820 01:09:24,010 --> 01:09:28,210 but that dynamic is the real heart of how children learn 1821 01:09:28,210 --> 01:09:30,700 and build the knowledge from the early stages 1822 01:09:30,700 --> 01:09:33,939 to what we come to have as adults. 1823 01:09:33,939 --> 01:09:36,640 Those two insights of what we start with and how we grow, 1824 01:09:36,640 --> 01:09:39,790 I think, are hugely powerful and hugely important 1825 01:09:39,790 --> 01:09:42,404 for anything we want to do in capturing-- making machines 1826 01:09:42,404 --> 01:09:43,779 that learn like humans, or making 1827 01:09:43,779 --> 01:09:46,112 computational models that really get at the heart of how 1828 01:09:46,112 --> 01:09:47,850 we come to be smart.