1 00:00:01,640 --> 00:00:04,040 The following content is provided under a Creative 2 00:00:04,040 --> 00:00:05,580 Commons license. 3 00:00:05,580 --> 00:00:07,880 Your support will help MIT OpenCourseWare 4 00:00:07,880 --> 00:00:12,270 continue to offer high quality educational resources for free. 5 00:00:12,270 --> 00:00:14,870 To make a donation, or view additional materials 6 00:00:14,870 --> 00:00:18,830 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,830 --> 00:00:20,000 at ocw.mit.edu. 8 00:00:23,392 --> 00:00:25,100 JAMES DICARLO: So let me start by first-- 9 00:00:25,100 --> 00:00:26,808 I already alluded to this, but let's talk 10 00:00:26,808 --> 00:00:30,050 about the problem of vision. 11 00:00:30,050 --> 00:00:32,960 This is just one computational challenge 12 00:00:32,960 --> 00:00:35,930 that our brains solve, but it's one that many of us 13 00:00:35,930 --> 00:00:37,432 are very fascinated by. 14 00:00:37,432 --> 00:00:39,140 As you'll hear in the rest of the course, 15 00:00:39,140 --> 00:00:41,060 there are other problems that are equally fascinating. 16 00:00:41,060 --> 00:00:43,340 But I'm going to talk about problems of vision. 17 00:00:43,340 --> 00:00:45,780 I'm going to talk about a specific problem of vision, 18 00:00:45,780 --> 00:00:48,530 and that's the problem of object recognition. 19 00:00:48,530 --> 00:00:52,160 So I will try to operationalize that for you. 20 00:00:52,160 --> 00:00:53,660 And one thing you'll see when I talk 21 00:00:53,660 --> 00:00:55,460 is that our field, even though we 22 00:00:55,460 --> 00:00:58,496 can be motivated by words like vision and object recognition, 23 00:00:58,496 --> 00:00:59,870 we're going to only make progress 24 00:00:59,870 --> 00:01:02,150 if we start to operationally define things 25 00:01:02,150 --> 00:01:05,000 and then decide in what domain models are going to apply. 26 00:01:05,000 --> 00:01:07,190 And I think that's an important lesson that I hope 27 00:01:07,190 --> 00:01:09,630 will come across in my talk. 28 00:01:09,630 --> 00:01:11,940 So this is the way computer vision operationally 29 00:01:11,940 --> 00:01:14,720 defines part of the problem of object recognition and vision. 30 00:01:14,720 --> 00:01:16,580 It's as if you take a scene like this 31 00:01:16,580 --> 00:01:18,524 and you want to do things like come up 32 00:01:18,524 --> 00:01:20,690 with an answer space that looks like this, where you 33 00:01:20,690 --> 00:01:22,460 have noun labels, say a car. 34 00:01:22,460 --> 00:01:25,340 And you have what are called bounding boxes around the cars, 35 00:01:25,340 --> 00:01:28,040 similarly for people, or buildings, or trees, 36 00:01:28,040 --> 00:01:32,300 or whatever nouns that you or DARPA or whoever 37 00:01:32,300 --> 00:01:34,160 wants to actually label. 38 00:01:34,160 --> 00:01:37,727 Right, so this is just one way of operationalizing vision. 39 00:01:37,727 --> 00:01:40,310 But I think it gets at the crux of what we're after, which is, 40 00:01:40,310 --> 00:01:43,700 there is what's called latent content in this image 41 00:01:43,700 --> 00:01:48,960 that all of us instantly bring to our memories, 42 00:01:48,960 --> 00:01:51,690 that we can say, aha, that's a car, that's a building. 43 00:01:51,690 --> 00:01:53,666 There are nouns that pop into our heads. 44 00:01:53,666 --> 00:01:56,040 We also know other latent information about these things, 45 00:01:56,040 --> 00:01:58,560 like the pose of this car, the position of the car, 46 00:01:58,560 --> 00:01:59,524 the size of the car. 47 00:01:59,524 --> 00:02:01,440 The key point that I'm going to tell you today 48 00:02:01,440 --> 00:02:03,750 about this problem is that that information feels to us 49 00:02:03,750 --> 00:02:06,540 that it's obvious, but it's quite 50 00:02:06,540 --> 00:02:08,250 latent in the image-- that's implicit 51 00:02:08,250 --> 00:02:09,507 in the pixel representation. 52 00:02:09,507 --> 00:02:11,340 Those of you who have worked on this problem 53 00:02:11,340 --> 00:02:13,923 will understand this and those of you who haven't, I hopefully 54 00:02:13,923 --> 00:02:17,240 will give you some flavor for what that problem feels like. 55 00:02:17,240 --> 00:02:19,560 So I want to back up a bit. 56 00:02:19,560 --> 00:02:21,900 This is more from a cognitive science perspective, 57 00:02:21,900 --> 00:02:25,290 or a human brain perspective, to ask, why would we 58 00:02:25,290 --> 00:02:28,205 even bother worrying about this problem of object recognition? 59 00:02:28,205 --> 00:02:30,330 And maybe this is obvious that those of you-- and I 60 00:02:30,330 --> 00:02:31,770 don't need to say this, but I like 61 00:02:31,770 --> 00:02:34,110 to point out that we think of the representations 62 00:02:34,110 --> 00:02:36,225 of the tokens of what's out there in the world 63 00:02:36,225 --> 00:02:38,100 as being the substrates of what you might do, 64 00:02:38,100 --> 00:02:41,550 what's called higher level cognition, things like memory, 65 00:02:41,550 --> 00:02:44,370 value judgments, decisions and actions in the world. 66 00:02:44,370 --> 00:02:45,960 Imagine building a robot and having 67 00:02:45,960 --> 00:02:47,876 it try to act in the world and it doesn't even 68 00:02:47,876 --> 00:02:49,450 really know what's out there. 69 00:02:49,450 --> 00:02:52,650 So these are the sort of substrate of these kind 70 00:02:52,650 --> 00:02:54,295 of cognitive processes. 71 00:02:54,295 --> 00:02:55,920 Again, from an engineering perspective, 72 00:02:55,920 --> 00:02:58,776 these are processes or behaviors. 73 00:02:58,776 --> 00:03:00,150 This is just a short list of them 74 00:03:00,150 --> 00:03:03,030 that might depend on your good abilities 75 00:03:03,030 --> 00:03:05,477 to recognize and discriminate among different objects. 76 00:03:05,477 --> 00:03:07,060 I think if you look through this list, 77 00:03:07,060 --> 00:03:09,560 you could imagine things that would go terribly wrong if you 78 00:03:09,560 --> 00:03:11,760 didn't actually do a good job at identifying 79 00:03:11,760 --> 00:03:13,039 what's out there in the world. 80 00:03:13,039 --> 00:03:14,580 So that's just to think about, again, 81 00:03:14,580 --> 00:03:18,096 as an engineer building a robot. 82 00:03:18,096 --> 00:03:19,470 This is a slide I stuck in that I 83 00:03:19,470 --> 00:03:21,745 want to connect to this course, the idea 84 00:03:21,745 --> 00:03:24,120 that I know many of you are from maybe these backgrounds, 85 00:03:24,120 --> 00:03:25,560 or from this background. 86 00:03:25,560 --> 00:03:27,220 And when I think about the brain, 87 00:03:27,220 --> 00:03:28,890 I have this coin here to say, really 88 00:03:28,890 --> 00:03:31,890 these are kind of two sides-- we're studying the same coin 89 00:03:31,890 --> 00:03:33,090 from two directions here. 90 00:03:33,090 --> 00:03:34,590 And really the question that we have 91 00:03:34,590 --> 00:03:36,381 to all be excited about, I hope many of you 92 00:03:36,381 --> 00:03:38,500 are excited about it is, how does the brain work? 93 00:03:38,500 --> 00:03:39,390 And you could do computer science 94 00:03:39,390 --> 00:03:40,980 and not care at all about this question. 95 00:03:40,980 --> 00:03:42,780 I think it's a little harder to do these and not 96 00:03:42,780 --> 00:03:43,821 care about this question. 97 00:03:43,821 --> 00:03:46,800 But it's possible, I guess. 98 00:03:46,800 --> 00:03:49,140 So these are all trying to answer this question. 99 00:03:49,140 --> 00:03:51,390 And this is maybe pretty obvious, 100 00:03:51,390 --> 00:03:53,550 but when you have biological brains that 101 00:03:53,550 --> 00:03:55,890 are performing tasks better than current computer 102 00:03:55,890 --> 00:03:59,070 systems, machines that humans have built, 103 00:03:59,070 --> 00:04:01,410 then the flow tends to want to go this way. 104 00:04:01,410 --> 00:04:04,260 You discover phenomena or constraints over here. 105 00:04:04,260 --> 00:04:07,650 These lead to ideas that can be built into computer code that 106 00:04:07,650 --> 00:04:09,661 can say, hey, can I build a better machine based 107 00:04:09,661 --> 00:04:10,910 on what we discover over here? 108 00:04:10,910 --> 00:04:13,410 And many of us who came into the field excited to do this 109 00:04:13,410 --> 00:04:15,930 and are still excited of this kind of direction. 110 00:04:15,930 --> 00:04:17,399 But an equally important direction 111 00:04:17,399 --> 00:04:19,260 is that when you have systems that 112 00:04:19,260 --> 00:04:21,060 are matched with our abilities, or that 113 00:04:21,060 --> 00:04:23,910 can compute some of the things that we think the brain has 114 00:04:23,910 --> 00:04:26,250 to compute, then the flow goes more this way, 115 00:04:26,250 --> 00:04:30,900 where there's many possible ways to implement an idea 116 00:04:30,900 --> 00:04:32,610 and these become falsifiable. 117 00:04:32,610 --> 00:04:35,470 That is, that they can be tested against experimental data 118 00:04:35,470 --> 00:04:38,520 to ask which of these many ways of implementing a computation 119 00:04:38,520 --> 00:04:40,990 are the ones that are actually occurring in the brain. 120 00:04:40,990 --> 00:04:42,060 And that's important if you say you 121 00:04:42,060 --> 00:04:43,710 want to build brain-machine interfaces, 122 00:04:43,710 --> 00:04:45,870 or fix diseases, or do something that's 123 00:04:45,870 --> 00:04:49,174 on the level of interacting with the brain directly. 124 00:04:49,174 --> 00:04:51,090 I hope that you guys keep this picture in mind 125 00:04:51,090 --> 00:04:53,010 because I think it's sort of the spirit of the course 126 00:04:53,010 --> 00:04:54,843 that both of these directions are important. 127 00:04:54,843 --> 00:04:57,420 And it's not as if we work on this for 20 years and then work 128 00:04:57,420 --> 00:04:58,200 on this for 20 years. 129 00:04:58,200 --> 00:05:00,075 It's really the flow across them that I think 130 00:05:00,075 --> 00:05:01,650 is the most exciting to us. 131 00:05:01,650 --> 00:05:03,870 So just to connect to that, a little bit of history 132 00:05:03,870 --> 00:05:07,475 of where was the field on this problem of visual recognition. 133 00:05:07,475 --> 00:05:09,100 I don't know if many of you heard this, 134 00:05:09,100 --> 00:05:10,720 but here you are at summer school, 135 00:05:10,720 --> 00:05:12,840 so there was a Summer Vision Project-- 136 00:05:12,840 --> 00:05:14,100 it was called, at MIT. 137 00:05:14,100 --> 00:05:16,230 I used to think this story was apocryphal. 138 00:05:16,230 --> 00:05:19,782 In 1966, there was a project that the final goal 139 00:05:19,782 --> 00:05:21,990 was object identification, which we'll actually name, 140 00:05:21,990 --> 00:05:24,360 Objects by Matching the Vocabulary of Known Objects. 141 00:05:24,360 --> 00:05:26,912 So this was essentially a summer project to say, 142 00:05:26,912 --> 00:05:29,370 we're going to get a couple undergraduate students together 143 00:05:29,370 --> 00:05:33,240 and we're going to build a recognition system in 1966. 144 00:05:33,240 --> 00:05:35,250 And this was the excitement of AI, 145 00:05:35,250 --> 00:05:36,870 we can build anything that we want. 146 00:05:36,870 --> 00:05:38,774 And of course, those of you who know this, 147 00:05:38,774 --> 00:05:40,440 this problem turned out to be much, much 148 00:05:40,440 --> 00:05:42,270 harder than anticipated. 149 00:05:42,270 --> 00:05:45,420 So sometimes problems that seem easy for us are actually 150 00:05:45,420 --> 00:05:48,230 quite difficult. If any of you wants this, 151 00:05:48,230 --> 00:05:50,370 I would be happy to share this document with you. 152 00:05:50,370 --> 00:05:52,920 It's interesting, the space of objects that they describe 153 00:05:52,920 --> 00:05:55,452 things like recognizing-- 154 00:05:55,452 --> 00:05:57,660 of course, I would say like coffee cups on your desk. 155 00:05:57,660 --> 00:06:00,360 But they also say packs of cigarettes on your desk. 156 00:06:00,360 --> 00:06:02,962 So this sort of dates the time of this here. 157 00:06:02,962 --> 00:06:04,920 So it's a little bit like Mad Men or something. 158 00:06:04,920 --> 00:06:06,420 So now, here we are today. 159 00:06:06,420 --> 00:06:08,920 And I guess I just can't help but sort of get excited about, 160 00:06:08,920 --> 00:06:12,120 here's this really cool machine that's just amazing 161 00:06:12,120 --> 00:06:13,500 that does these computations. 162 00:06:13,500 --> 00:06:16,230 The things got-- I can't tell you all this because of the 100 163 00:06:16,230 --> 00:06:18,810 billion computing elements, solves problems not solveable 164 00:06:18,810 --> 00:06:20,130 by any previous machine. 165 00:06:20,130 --> 00:06:21,990 And the thing, it looks crazy, but it only 166 00:06:21,990 --> 00:06:24,087 requires 20 watts of power. 167 00:06:24,087 --> 00:06:25,670 Those of you who have seen this slide, 168 00:06:25,670 --> 00:06:27,240 I'm not talking about this thing. 169 00:06:27,240 --> 00:06:29,820 I'm talking about that thing right there. 170 00:06:29,820 --> 00:06:33,160 So this is a scale of what we're after. 171 00:06:33,160 --> 00:06:36,000 And we often talk about power, but this is something engineers 172 00:06:36,000 --> 00:06:38,050 are especially interested in as they build these systems, is 173 00:06:38,050 --> 00:06:40,674 how does our brains solve these problems at such a low wattage, 174 00:06:40,674 --> 00:06:42,540 so to speak. 175 00:06:42,540 --> 00:06:44,765 This is, again, the spirit of many of the things 176 00:06:44,765 --> 00:06:47,140 that I hope that you guys are excited about in the future 177 00:06:47,140 --> 00:06:47,987 of this field. 178 00:06:47,987 --> 00:06:49,570 Here's another slide that I pulled out 179 00:06:49,570 --> 00:06:51,910 that I often like to show is that, from an engineer's 180 00:06:51,910 --> 00:06:53,470 point of view, we often try to say, 181 00:06:53,470 --> 00:06:56,050 well, we want to build machines that are as 182 00:06:56,050 --> 00:06:57,670 good or better than our brain. 183 00:06:57,670 --> 00:07:00,010 So machines today, you guys know this, beat us 184 00:07:00,010 --> 00:07:01,930 at many things, straight calculation, 185 00:07:01,930 --> 00:07:03,910 they beat us at chess. 186 00:07:03,910 --> 00:07:06,880 When I was a grad student, they recently won at Jeopardy. 187 00:07:06,880 --> 00:07:08,982 In memory, they've always beaten us. 188 00:07:08,982 --> 00:07:10,690 Machines are way better at memory than us 189 00:07:10,690 --> 00:07:12,955 in the simple form of memory. 190 00:07:12,955 --> 00:07:15,100 Seeing, in pattern matching, go to the grocery 191 00:07:15,100 --> 00:07:16,910 store, hey, what's that bar code done? 192 00:07:16,910 --> 00:07:18,910 I don't know what that was, but it just scans in 193 00:07:18,910 --> 00:07:20,800 and somehow it does pattern matching, right? 194 00:07:20,800 --> 00:07:22,570 So there's forms of vision that machines 195 00:07:22,570 --> 00:07:23,579 are way better than us. 196 00:07:23,579 --> 00:07:25,870 But some forms of vision that are more complicated that 197 00:07:25,870 --> 00:07:28,630 require generalization, like object recognition, 198 00:07:28,630 --> 00:07:30,250 or more broadly, scene understanding, 199 00:07:30,250 --> 00:07:32,060 we like to think that we are still the winners at. 200 00:07:32,060 --> 00:07:34,360 And even things that we take for granted, like walking, 201 00:07:34,360 --> 00:07:35,880 this is quite a challenging problem. 202 00:07:35,880 --> 00:07:40,180 So engineers really want to move this over here. 203 00:07:40,180 --> 00:07:42,610 So our goal is to discover how the brain solves 204 00:07:42,610 --> 00:07:43,460 object recognition. 205 00:07:43,460 --> 00:07:45,250 And the reason I put this up is, from an engineering 206 00:07:45,250 --> 00:07:47,410 point of view, that just doesn't mean write a bunch of papers 207 00:07:47,410 --> 00:07:48,880 in a textbook that says, this part of the brain 208 00:07:48,880 --> 00:07:51,670 does it, but actually help to implement a system where this 209 00:07:51,670 --> 00:07:54,640 is, at least, matched with us and I assume someday, 210 00:07:54,640 --> 00:07:56,140 will be better than us. 211 00:07:56,140 --> 00:07:58,900 And this is also a gateway problem. 212 00:07:58,900 --> 00:08:00,780 That is, even if it's just this domain, 213 00:08:00,780 --> 00:08:02,470 we think that the systems we're studying 214 00:08:02,470 --> 00:08:06,370 might generalize to other, for instance, sensory domains. 215 00:08:06,370 --> 00:08:07,870 Gabriel told me you were going to do 216 00:08:07,870 --> 00:08:12,722 an auditory, visual comparison session later in the week. 217 00:08:12,722 --> 00:08:14,180 That's an engineer's point of view, 218 00:08:14,180 --> 00:08:16,270 how do I just build better systems? 219 00:08:16,270 --> 00:08:18,687 Let's step back and talk from a scientist's point of view. 220 00:08:18,687 --> 00:08:20,853 So this is really now to introduce the talk that I'm 221 00:08:20,853 --> 00:08:21,910 going to give you today. 222 00:08:21,910 --> 00:08:23,812 So when you're a scientist, what's our job? 223 00:08:23,812 --> 00:08:25,020 We say we want to understand. 224 00:08:25,020 --> 00:08:26,270 We all write that, understand. 225 00:08:26,270 --> 00:08:27,470 What does that mean? 226 00:08:27,470 --> 00:08:29,560 Well, what it really means if you boil it down, 227 00:08:29,560 --> 00:08:31,434 and I would love to discuss this if you like, 228 00:08:31,434 --> 00:08:33,850 is that you have some measurements in some domain. 229 00:08:33,850 --> 00:08:36,130 So you can think of this as a state space here. 230 00:08:36,130 --> 00:08:39,549 This is like the position of the planets today. 231 00:08:39,549 --> 00:08:43,450 And this is like the position of the planets tomorrow. 232 00:08:43,450 --> 00:08:47,120 Or you could say, this is the DNA sequence inside a cell. 233 00:08:47,120 --> 00:08:49,250 And this is some protein that's going to get made. 234 00:08:49,250 --> 00:08:51,820 So you're searching for mappings that are predictive 235 00:08:51,820 --> 00:08:53,310 from one domain to another. 236 00:08:53,310 --> 00:08:54,976 And we can give lots of examples of what 237 00:08:54,976 --> 00:08:57,590 we call successful science, where that's true. 238 00:08:57,590 --> 00:08:59,952 This is the core of science is to predict, given 239 00:08:59,952 --> 00:09:01,660 some measurements or observations, what's 240 00:09:01,660 --> 00:09:04,150 going to happen either in the future or some other set 241 00:09:04,150 --> 00:09:05,110 of measurements. 242 00:09:05,110 --> 00:09:08,140 So predictive power is the core of all science 243 00:09:08,140 --> 00:09:09,500 and the core of understanding. 244 00:09:09,500 --> 00:09:11,530 And I think it would be fun if you want to debate that, 245 00:09:11,530 --> 00:09:12,400 that you think there's another way. 246 00:09:12,400 --> 00:09:15,280 But this is what I come to in thinking about this problem. 247 00:09:15,280 --> 00:09:17,230 And the reason I'm bringing this up 248 00:09:17,230 --> 00:09:19,540 is because the accuracy of this predictive mapping 249 00:09:19,540 --> 00:09:21,900 is a measure of the strength of any scientific field. 250 00:09:21,900 --> 00:09:24,380 And some fields are further along than others. 251 00:09:24,380 --> 00:09:27,700 And I would say ours is still not very far along. 252 00:09:27,700 --> 00:09:31,150 Our job is to bring it from a nonpredictive state 253 00:09:31,150 --> 00:09:32,810 to a very predictive state. 254 00:09:32,810 --> 00:09:35,560 And so that means building models that can be falsified 255 00:09:35,560 --> 00:09:37,039 and that can predict things. 256 00:09:37,039 --> 00:09:38,580 And you'll hear that through my talk. 257 00:09:38,580 --> 00:09:40,204 As Gabriel mentioned, what we try to do 258 00:09:40,204 --> 00:09:42,730 is build models that can predict either behavior or neural 259 00:09:42,730 --> 00:09:43,580 activity. 260 00:09:43,580 --> 00:09:46,370 And that's what we think is what progress looks like. 261 00:09:46,370 --> 00:09:48,430 So now let's translate this to the problem I gave 262 00:09:48,430 --> 00:09:50,920 you, which is the problem of vision or more generally 263 00:09:50,920 --> 00:09:52,244 object recognition. 264 00:09:52,244 --> 00:09:54,160 You could imagine, there's a domain of images. 265 00:09:54,160 --> 00:09:56,076 So just to slow down here, just so everybody's 266 00:09:56,076 --> 00:09:58,360 on the same page, each dot here might be 267 00:09:58,360 --> 00:10:00,150 all the pixels in this image. 268 00:10:00,150 --> 00:10:02,510 In this dot, all the pixels in this image. 269 00:10:02,510 --> 00:10:04,390 So there's a set of possible pixel-images 270 00:10:04,390 --> 00:10:05,530 that you could see. 271 00:10:05,530 --> 00:10:08,060 And we imagine that they give rise to, in the brain, 272 00:10:08,060 --> 00:10:09,150 some state space. 273 00:10:09,150 --> 00:10:13,057 Think of this as the whole brain for now, to just fix ideas, 274 00:10:13,057 --> 00:10:15,640 that you could imagine that this image, one you're looking at, 275 00:10:15,640 --> 00:10:17,530 it gives rise to some pattern of activity 276 00:10:17,530 --> 00:10:18,781 across your whole brain. 277 00:10:18,781 --> 00:10:21,280 And this image gives rise to a different pattern of activity 278 00:10:21,280 --> 00:10:22,280 across your whole brain. 279 00:10:22,280 --> 00:10:24,490 And loosely, we call this the neural representation 280 00:10:24,490 --> 00:10:26,550 of this thing. 281 00:10:26,550 --> 00:10:29,050 But then what we do is somehow when we ask you 282 00:10:29,050 --> 00:10:30,820 for behavior reports, there's a mapping 283 00:10:30,820 --> 00:10:33,700 between that neural state space and what 284 00:10:33,700 --> 00:10:34,814 we measure as the output. 285 00:10:34,814 --> 00:10:36,730 Whether you say it or write it, you might say, 286 00:10:36,730 --> 00:10:38,230 that's a face, these are both faces, 287 00:10:38,230 --> 00:10:40,470 if I asked you for nouns among them. 288 00:10:40,470 --> 00:10:43,270 OK, so this is another domain of measurement. 289 00:10:43,270 --> 00:10:45,940 So now you can see I'm setting up the notion of predictivity. 290 00:10:45,940 --> 00:10:47,356 And what we want to do is, we have 291 00:10:47,356 --> 00:10:49,630 this complex thing over here of images that somehow 292 00:10:49,630 --> 00:10:51,130 map internally into neural activity 293 00:10:51,130 --> 00:10:52,504 and then somehow map to the thing 294 00:10:52,504 --> 00:10:53,734 we call perceptual reports. 295 00:10:53,734 --> 00:10:55,150 And notice I've already put things 296 00:10:55,150 --> 00:10:56,950 that we call nouns that we usually 297 00:10:56,950 --> 00:10:59,980 associate with objects, cars, face, dogs, cats, clocks, 298 00:10:59,980 --> 00:11:01,000 and so forth. 299 00:11:01,000 --> 00:11:03,860 OK, so understanding this mapping in a predictive sense 300 00:11:03,860 --> 00:11:07,280 is really a summary of what our part of the field is about. 301 00:11:07,280 --> 00:11:10,420 And again, accurate predictivity is the core product 302 00:11:10,420 --> 00:11:12,700 of the science that underlies our ability 303 00:11:12,700 --> 00:11:14,530 to build a system like this-- many of you 304 00:11:14,530 --> 00:11:16,480 are interested, to fix a system like this, 305 00:11:16,480 --> 00:11:18,576 or to perhaps even augment our own systems. 306 00:11:18,576 --> 00:11:20,950 If we want to inject signals here and have them give rise 307 00:11:20,950 --> 00:11:23,146 to percepts, we have to know how this works. 308 00:11:23,146 --> 00:11:24,520 A big part of the field of vision 309 00:11:24,520 --> 00:11:26,890 is spent-- a lot of the last three decades, 310 00:11:26,890 --> 00:11:30,010 working on the mapping between images and neural activity. 311 00:11:30,010 --> 00:11:33,520 That's usually called encoding, predictive encoding mechanisms. 312 00:11:33,520 --> 00:11:35,569 And it's driven by Hubel and Wiesel's work. 313 00:11:35,569 --> 00:11:37,360 The people saw this as a great way forward. 314 00:11:37,360 --> 00:11:39,370 It's like, let's go study the neurons 315 00:11:39,370 --> 00:11:42,910 and try to understand what in the image is driving them. 316 00:11:42,910 --> 00:11:45,580 That is, what's an image computable 317 00:11:45,580 --> 00:11:47,720 model in the world that would go from images 318 00:11:47,720 --> 00:11:49,340 to neural responses? 319 00:11:49,340 --> 00:11:51,710 The other part is that there's some linkage, we think, 320 00:11:51,710 --> 00:11:54,269 between the neural activity and these reports. 321 00:11:54,269 --> 00:11:56,060 And notice, this is actually why most of us 322 00:11:56,060 --> 00:11:57,355 get into neuroscience because you 323 00:11:57,355 --> 00:11:58,580 notice this arrow is two-way. 324 00:11:58,580 --> 00:12:00,086 This is actually quite deep here. 325 00:12:00,086 --> 00:12:01,460 From an engineer's point of view, 326 00:12:01,460 --> 00:12:02,810 you go, well, there's got to be some mapping 327 00:12:02,810 --> 00:12:04,640 between the neural activity and the button 328 00:12:04,640 --> 00:12:07,220 presses on my fingers or my saying the word noun. 329 00:12:07,220 --> 00:12:10,310 There's some causal linkage between this and the things 330 00:12:10,310 --> 00:12:13,040 that we observe objectively in a subject. 331 00:12:13,040 --> 00:12:15,746 But this is where philosophers debate about like, well, you 332 00:12:15,746 --> 00:12:17,120 know in some sense these are sort 333 00:12:17,120 --> 00:12:18,380 of two sides of the same coin. 334 00:12:18,380 --> 00:12:19,796 We say our own perception, there's 335 00:12:19,796 --> 00:12:21,350 some aspects of the internal activity 336 00:12:21,350 --> 00:12:24,406 that are the thing that we call awareness or perception. 337 00:12:24,406 --> 00:12:26,030 Now I'm not going to get into all that, 338 00:12:26,030 --> 00:12:28,252 but I just want to point out that if you're just 339 00:12:28,252 --> 00:12:29,960 building models, you can't approach that. 340 00:12:29,960 --> 00:12:32,580 It's this sort of strange thing between neurons 341 00:12:32,580 --> 00:12:35,510 and these reported states that many of us are fascinated by. 342 00:12:35,510 --> 00:12:37,667 So this is called predictive decoding mechanisms. 343 00:12:37,667 --> 00:12:39,500 For me, it's all going to be operationalized 344 00:12:39,500 --> 00:12:41,794 in terms of reports from humans or animals. 345 00:12:41,794 --> 00:12:43,460 And I'll not do that philosophical part, 346 00:12:43,460 --> 00:12:45,501 but I thought I'd mention that for those you like 347 00:12:45,501 --> 00:12:46,680 to think about those things. 348 00:12:46,680 --> 00:12:48,350 So for visual object perception, I 349 00:12:48,350 --> 00:12:50,641 want to point out that, again, the history of the field 350 00:12:50,641 --> 00:12:51,530 has been mostly here. 351 00:12:51,530 --> 00:12:53,900 This link has been neglected or dominated 352 00:12:53,900 --> 00:12:55,652 by weakly predictive word models. 353 00:12:55,652 --> 00:12:57,860 That doesn't mean they're not useful starting points, 354 00:12:57,860 --> 00:12:59,261 but they're weakly predictive. 355 00:12:59,261 --> 00:13:01,260 And so a weakly predictive word model would be-- 356 00:13:01,260 --> 00:13:02,930 and for temporal cortex, a part of the brain 357 00:13:02,930 --> 00:13:05,388 I'm going to tell you about today, does object recognition. 358 00:13:05,388 --> 00:13:08,280 That model has been around for a long time. 359 00:13:08,280 --> 00:13:10,100 It is somewhat predictive because it says, 360 00:13:10,100 --> 00:13:12,100 you take that out and all object recognition 361 00:13:12,100 --> 00:13:14,030 will get destroyed, would be a prediction. 362 00:13:14,030 --> 00:13:16,220 Turns out that doesn't actually happen. 363 00:13:16,220 --> 00:13:17,180 We can discuss that. 364 00:13:17,180 --> 00:13:21,230 But it doesn't tell you how it does it, how to inject signals, 365 00:13:21,230 --> 00:13:23,330 which tasks are more or less affected, 366 00:13:23,330 --> 00:13:25,220 so that's what I mean by weakly predictive. 367 00:13:25,220 --> 00:13:26,570 It's a word model. 368 00:13:26,570 --> 00:13:28,340 Face neurons do face task, that's 369 00:13:28,340 --> 00:13:29,780 probably true to some extent. 370 00:13:29,780 --> 00:13:32,030 But again, it doesn't tell us-- it's more tight. 371 00:13:32,030 --> 00:13:33,650 It sort of says, oh, I'll take out these smaller regions 372 00:13:33,650 --> 00:13:36,090 and there'll be some set of tasks that involve faces. 373 00:13:36,090 --> 00:13:38,400 I don't know, I won't say anything about other tasks. 374 00:13:38,400 --> 00:13:40,770 So that's a somewhat more strongly predictive model, 375 00:13:40,770 --> 00:13:42,740 but still pretty weakly predictive. 376 00:13:42,740 --> 00:13:45,500 And my personal favorite that comes in from reviewers a lot 377 00:13:45,500 --> 00:13:47,960 is, attention solves that. 378 00:13:47,960 --> 00:13:49,550 So this is just a statement that-- 379 00:13:49,550 --> 00:13:51,920 just to be on the lookout for word models 380 00:13:51,920 --> 00:13:54,492 that don't actually have content in terms of prediction. 381 00:13:54,492 --> 00:13:55,700 I don't know what that means. 382 00:13:55,700 --> 00:13:57,360 I read this as, hand of God reaches in 383 00:13:57,360 --> 00:13:58,610 and solves the problem. 384 00:13:58,610 --> 00:14:01,220 So there's got to be an actual predictive model that 385 00:14:01,220 --> 00:14:02,650 can be falsified. 386 00:14:02,650 --> 00:14:05,620 OK, so I don't mean to doubt the importance of these. 387 00:14:05,620 --> 00:14:07,370 Before people start giving me a hard time, 388 00:14:07,370 --> 00:14:09,740 there are attentional phenomena, there are face neurons, 389 00:14:09,740 --> 00:14:11,780 there is an IT, that's what we study. 390 00:14:11,780 --> 00:14:13,640 I'm just trying to emphasize for you that we 391 00:14:13,640 --> 00:14:16,705 need to go beyond word models into actual testable 392 00:14:16,705 --> 00:14:18,830 models that make predictions, that would stand even 393 00:14:18,830 --> 00:14:22,290 if the person claiming those models is no longer around, 394 00:14:22,290 --> 00:14:23,750 it would make a prediction. 395 00:14:23,750 --> 00:14:25,040 Let me try to define a domain. 396 00:14:25,040 --> 00:14:26,790 I said we're going to try to define stuff. 397 00:14:26,790 --> 00:14:27,920 It's hard to define stuff. 398 00:14:27,920 --> 00:14:29,814 It's big, vision, it's a big area. 399 00:14:29,814 --> 00:14:31,730 Object recognition, I sort of said it vaguely. 400 00:14:31,730 --> 00:14:34,010 And when I say this, I include faces as an object, 401 00:14:34,010 --> 00:14:35,093 a socially important when. 402 00:14:35,093 --> 00:14:37,340 You'll hear this from Winrich I think. 403 00:14:37,340 --> 00:14:40,040 But I want to say, to try to limit it even further, 404 00:14:40,040 --> 00:14:41,640 that's still a big domain. 405 00:14:41,640 --> 00:14:45,430 And so we tried early on to reduce the problem even further 406 00:14:45,430 --> 00:14:48,350 to something that is more, again, naturalistic, 407 00:14:48,350 --> 00:14:50,114 that we think can give us more traction, 408 00:14:50,114 --> 00:14:51,030 this predictive sense. 409 00:14:51,030 --> 00:14:53,720 So we started by saying, when you take a scene like this 410 00:14:53,720 --> 00:14:55,550 and you analyze it, you may not notice it 411 00:14:55,550 --> 00:15:01,210 but your ventral stream, really your retina has high acuity 412 00:15:01,210 --> 00:15:02,764 in say the central 10 degrees. 413 00:15:02,764 --> 00:15:04,430 There's anatomy that I'll show you later 414 00:15:04,430 --> 00:15:05,960 that the ventral stream is especially 415 00:15:05,960 --> 00:15:07,520 interested in processing the central 10 416 00:15:07,520 --> 00:15:08,478 degrees of information. 417 00:15:08,478 --> 00:15:10,770 So that's about two hands at arm's length, 418 00:15:10,770 --> 00:15:12,020 for those you see in the room. 419 00:15:12,020 --> 00:15:14,210 So you may have the sense that you know what's out there, 420 00:15:14,210 --> 00:15:15,085 but you don't really. 421 00:15:15,085 --> 00:15:16,910 You kind of stitch that together. 422 00:15:16,910 --> 00:15:18,764 And lots of people have shown this, 423 00:15:18,764 --> 00:15:20,930 the way you stitch this together is making rapid eye 424 00:15:20,930 --> 00:15:22,570 movements around, called saccades, 425 00:15:22,570 --> 00:15:25,220 followed by fixations, which are 200 to 500 milliseconds 426 00:15:25,220 --> 00:15:26,000 in duration. 427 00:15:26,000 --> 00:15:28,209 You don't really see during this time here. 428 00:15:28,209 --> 00:15:29,750 It's not as if your brain shuts down, 429 00:15:29,750 --> 00:15:32,354 it's just that the movement is too fast for your retina 430 00:15:32,354 --> 00:15:33,520 to really keep up with this. 431 00:15:33,520 --> 00:15:35,480 So you make these rapid eye movements, 432 00:15:35,480 --> 00:15:37,162 you fixate, fixate, fixate. 433 00:15:37,162 --> 00:15:38,870 And what you do is, that brings this sort 434 00:15:38,870 --> 00:15:41,780 of sampled scene to the central 10 degrees 435 00:15:41,780 --> 00:15:44,190 that might look something like this. 436 00:15:44,190 --> 00:15:46,070 So those are 200 millisecond snapshots 437 00:15:46,070 --> 00:15:47,260 across that scan path. 438 00:15:47,260 --> 00:15:49,040 And I'll play it for you one more time. 439 00:15:49,040 --> 00:15:50,749 Now, you should notice that there's 440 00:15:50,749 --> 00:15:52,540 one or more objects in each and every image 441 00:15:52,540 --> 00:15:54,230 that you probably said, oh, there's a sign. 442 00:15:54,230 --> 00:15:54,810 There's a person. 443 00:15:54,810 --> 00:15:55,130 There's a car. 444 00:15:55,130 --> 00:15:56,880 You might have gotten two out of each one. 445 00:15:56,880 --> 00:15:58,970 But you were sort of extracting, at least 446 00:15:58,970 --> 00:16:02,390 intuitively to me, at least one or more foreground 447 00:16:02,390 --> 00:16:05,300 or central objects when I show you those images. 448 00:16:05,300 --> 00:16:08,660 And that ability to do what I just showed you there, we think 449 00:16:08,660 --> 00:16:10,840 is the core of how you analyze or build up 450 00:16:10,840 --> 00:16:13,620 a scene like this, at least how the ventral stream contributes. 451 00:16:13,620 --> 00:16:16,500 And therefore, we call that core recognition, 452 00:16:16,500 --> 00:16:18,990 which I defined as a central 10 degrees of visual field, 453 00:16:18,990 --> 00:16:21,504 100 to 200 millisecond viewing duration. 454 00:16:21,504 --> 00:16:23,420 And again, it's not all of object recognition, 455 00:16:23,420 --> 00:16:26,030 but we think it's a good starting point. 456 00:16:26,030 --> 00:16:28,970 And a way that we probably got into this is because of a rapid 457 00:16:28,970 --> 00:16:31,320 serial visual presentation movies from the 70's. 458 00:16:31,320 --> 00:16:33,020 Molly Potter showed this really nicely. 459 00:16:33,020 --> 00:16:36,290 This is a movie that I've been showing for 15 years now. 460 00:16:36,290 --> 00:16:38,187 Notice that this is just a sequence of images 461 00:16:38,187 --> 00:16:40,520 where there is typically one or more foreground objects. 462 00:16:40,520 --> 00:16:42,962 And you should be quickly mapping those to memory, 463 00:16:42,962 --> 00:16:44,920 even though I'm not telling you what to expect. 464 00:16:44,920 --> 00:16:46,670 Like Leaning Tower of Pisa, right, I'm 465 00:16:46,670 --> 00:16:48,230 not going to tell you that you're going to see Star Wars 466 00:16:48,230 --> 00:16:49,480 characters-- well, I just did. 467 00:16:49,480 --> 00:16:52,580 But you quickly are able to map those things 468 00:16:52,580 --> 00:16:56,090 to some noun or even a more precise subordinate noun. 469 00:16:56,090 --> 00:16:57,750 I know this is Yoda. 470 00:16:57,750 --> 00:17:00,770 So our ability to do that, we're very, very good at that. 471 00:17:00,770 --> 00:17:02,660 Notice you didn't need a lot of pre-cueing, 472 00:17:02,660 --> 00:17:04,035 yet you're still able to do that. 473 00:17:04,035 --> 00:17:07,159 And that is really what fascinates us about vision 474 00:17:07,159 --> 00:17:08,700 and object recognition in particular. 475 00:17:08,700 --> 00:17:11,210 Even without featural attention or pre-cueing, 476 00:17:11,210 --> 00:17:13,660 you're able to do a remarkable amount of processing. 477 00:17:13,660 --> 00:17:16,020 And I think that's a great demonstration of that. 478 00:17:16,020 --> 00:17:17,510 And just to quantify this for you, 479 00:17:17,510 --> 00:17:19,218 because sometimes people say, well you're 480 00:17:19,218 --> 00:17:20,599 showing it too short. 481 00:17:20,599 --> 00:17:22,220 Your vision system doesn't do much. 482 00:17:22,220 --> 00:17:24,079 Here's an eight-way categorization task 483 00:17:24,079 --> 00:17:26,876 I'll show you later under range of transformation. 484 00:17:26,876 --> 00:17:28,250 These are just the example images 485 00:17:28,250 --> 00:17:30,366 of eight different categories of objects. 486 00:17:30,366 --> 00:17:32,240 It doesn't really matter what I much do here, 487 00:17:32,240 --> 00:17:33,500 you get a very similar curve. 488 00:17:33,500 --> 00:17:36,230 And that is, you get most of the performance gain 489 00:17:36,230 --> 00:17:37,980 in about the first 100 milliseconds. 490 00:17:37,980 --> 00:17:39,840 This is accuracy, you're about 85% correct. 491 00:17:39,840 --> 00:17:41,400 This is a challenging task, as I'll show you earlier. 492 00:17:41,400 --> 00:17:43,550 It looks easy here, but it's quite challenging. 493 00:17:43,550 --> 00:17:46,070 85% correct, if I let you look at the image longer, 494 00:17:46,070 --> 00:17:48,770 up to two seconds, you can bump up to around 90's. 495 00:17:48,770 --> 00:17:51,410 So there is some gain with longer viewing duration, 496 00:17:51,410 --> 00:17:52,190 but you get-- 497 00:17:52,190 --> 00:17:55,269 chance is 50, so you get this huge ability. 498 00:17:55,269 --> 00:17:56,810 And we're not the first to show this. 499 00:17:56,810 --> 00:17:58,991 This is just to show you in our own kind of task 500 00:17:58,991 --> 00:18:00,740 that the data I'm going to tell you about, 501 00:18:00,740 --> 00:18:03,470 where we show the image for 100 or 200 milliseconds, 502 00:18:03,470 --> 00:18:05,480 this is the typical primate viewing duration 503 00:18:05,480 --> 00:18:07,010 that I pin this on. 504 00:18:07,010 --> 00:18:09,290 We use this for reasons of efficiency. 505 00:18:09,290 --> 00:18:11,780 But you see, the performance is similar across that time. 506 00:18:11,780 --> 00:18:13,010 You get a lot done. 507 00:18:13,010 --> 00:18:15,920 Your visual system does a lot of work in that first glimpse. 508 00:18:15,920 --> 00:18:18,845 And that's core recognition that we are trying to study here. 509 00:18:18,845 --> 00:18:20,720 And I know it's not all of object recognition 510 00:18:20,720 --> 00:18:22,670 or all of vision, but it's now, we 511 00:18:22,670 --> 00:18:24,050 think, a much more defined domain 512 00:18:24,050 --> 00:18:25,070 that we can make progress on. 513 00:18:25,070 --> 00:18:26,060 And that's what we've been working on. 514 00:18:26,060 --> 00:18:28,620 And that's essentially what I'm going to talk about today. 515 00:18:28,620 --> 00:18:30,245 So think of vision, object recognition, 516 00:18:30,245 --> 00:18:32,400 within that core recognition. 517 00:18:32,400 --> 00:18:33,950 This is David Marr. 518 00:18:33,950 --> 00:18:36,320 David and Tommy Poggio, I studied with a long time. 519 00:18:36,320 --> 00:18:38,695 And Tommy wrote the introduction to David's-- if you guys 520 00:18:38,695 --> 00:18:40,200 haven't read this book, Vision-- 521 00:18:40,200 --> 00:18:42,064 has anybody, guys know this book? 522 00:18:42,064 --> 00:18:43,730 It's really a classic book in our field. 523 00:18:43,730 --> 00:18:45,200 It's the first couple chapters that 524 00:18:45,200 --> 00:18:46,950 are the part you should really read. 525 00:18:46,950 --> 00:18:48,030 That's the best part of the book. 526 00:18:48,030 --> 00:18:49,940 And one of the things that you take from this book, 527 00:18:49,940 --> 00:18:51,398 that I think David and Tommy helped 528 00:18:51,398 --> 00:18:53,490 to lay out a long time ago, is that there 529 00:18:53,490 --> 00:18:54,770 is this challenge of level. 530 00:18:57,310 --> 00:18:59,060 I think one of the things I take from this 531 00:18:59,060 --> 00:19:02,214 is, they would try to define three clean levels. 532 00:19:02,214 --> 00:19:04,130 It turns out not to be this clean in practice. 533 00:19:04,130 --> 00:19:06,505 But there's one level called computational theory, what's 534 00:19:06,505 --> 00:19:08,950 the goal, what's appropriate, what's the logic, 535 00:19:08,950 --> 00:19:10,992 and by what strategy can it be carried out. 536 00:19:10,992 --> 00:19:12,450 There's another level which is, OK, 537 00:19:12,450 --> 00:19:14,950 now once you decide that, how should you represent the data? 538 00:19:14,950 --> 00:19:16,850 How can you implement an algorithm to do it? 539 00:19:16,850 --> 00:19:18,360 And then there's this actually, how do you run it, 540 00:19:18,360 --> 00:19:20,090 how do you build it in hardware? 541 00:19:20,090 --> 00:19:22,130 And neuroscientists often come in, they're like, 542 00:19:22,130 --> 00:19:23,300 I'm going to study neurons and it's sort of 543 00:19:23,300 --> 00:19:24,800 like jumping into your iPhone and saying, 544 00:19:24,800 --> 00:19:26,091 I'm going to study transistors. 545 00:19:26,091 --> 00:19:28,370 They often tend to start at the hardware level. 546 00:19:28,370 --> 00:19:30,450 And I think that's the biggest lesson you take from this like, 547 00:19:30,450 --> 00:19:31,370 oh wait, there's something going on here, 548 00:19:31,370 --> 00:19:32,390 these transistors are flying. 549 00:19:32,390 --> 00:19:34,090 And you make some story about it if you 550 00:19:34,090 --> 00:19:35,840 were recording from the brain or measuring 551 00:19:35,840 --> 00:19:37,210 transistors in my iPhone. 552 00:19:37,210 --> 00:19:39,590 But I think the important point to take from this 553 00:19:39,590 --> 00:19:42,200 is it helps to start thinking about what's 554 00:19:42,200 --> 00:19:43,200 the point of the system. 555 00:19:43,200 --> 00:19:44,158 What might it be doing? 556 00:19:44,158 --> 00:19:45,550 How might you solve that problem? 557 00:19:45,550 --> 00:19:46,820 And that leads you then to algorithm. 558 00:19:46,820 --> 00:19:48,150 And then you think about representations. 559 00:19:48,150 --> 00:19:49,756 So it's sort of a top down approach, 560 00:19:49,756 --> 00:19:51,380 rather than just digging into the brain 561 00:19:51,380 --> 00:19:53,990 and hoping that the answers will emerge. 562 00:19:53,990 --> 00:19:56,746 So I'm going to try to give you that top down approach 563 00:19:56,746 --> 00:19:58,370 in this problem that I'm talking about. 564 00:19:58,370 --> 00:20:00,140 I've already given you a bit of it by introducing you 565 00:20:00,140 --> 00:20:00,710 to the problem. 566 00:20:00,710 --> 00:20:03,400 I'll say a little bit more about that and step down a little bit 567 00:20:03,400 --> 00:20:03,900 this way. 568 00:20:03,900 --> 00:20:07,100 And so this kind of thinking, I think, 569 00:20:07,100 --> 00:20:08,960 is important to making progress in how 570 00:20:08,960 --> 00:20:10,652 the brain computes things. 571 00:20:10,652 --> 00:20:12,860 So here's a related slide that I made a long time ago 572 00:20:12,860 --> 00:20:14,235 that, again, I pulled out for you 573 00:20:14,235 --> 00:20:16,580 guys, that I think helps bridge between what I just 574 00:20:16,580 --> 00:20:18,241 said about the Marr levels of analysis 575 00:20:18,241 --> 00:20:20,240 and whether you're a neuroscientist or cognitive 576 00:20:20,240 --> 00:20:22,340 scientist, and are a computer vision or machine learning 577 00:20:22,340 --> 00:20:22,770 person. 578 00:20:22,770 --> 00:20:25,228 So the first is, what is the problem we're trying to solve? 579 00:20:25,228 --> 00:20:27,320 So that's Marr computational level one. 580 00:20:27,320 --> 00:20:29,907 So computational vision-- now operationally, 581 00:20:29,907 --> 00:20:31,490 you'll hear folks in machine learning, 582 00:20:31,490 --> 00:20:33,948 they might say, well, there's some benchmarks, that's good. 583 00:20:33,948 --> 00:20:36,050 There's a ImageNet Challenge or whatever 584 00:20:36,050 --> 00:20:37,620 challenge they want to solve. 585 00:20:37,620 --> 00:20:39,620 Sometimes they'll say, well the brain solves it. 586 00:20:39,620 --> 00:20:41,369 That's not good because they didn't really 587 00:20:41,369 --> 00:20:42,170 define the problem. 588 00:20:42,170 --> 00:20:43,669 Neuroscientists will say, well, it's 589 00:20:43,669 --> 00:20:46,260 something like perception or behavior 590 00:20:46,260 --> 00:20:49,220 or there's some sort of behavior that they imagined, 591 00:20:49,220 --> 00:20:50,990 although characterizing that behavior 592 00:20:50,990 --> 00:20:53,750 is not usually their primary goal. 593 00:20:53,750 --> 00:20:57,200 But I think there is at least some progress in that regard. 594 00:20:57,200 --> 00:20:59,129 Now what does a solution look like? 595 00:20:59,129 --> 00:21:00,920 This is really just to talk about language. 596 00:21:00,920 --> 00:21:04,510 So useful image representations for machine learning, like what 597 00:21:04,510 --> 00:21:05,669 we might call features-- 598 00:21:05,669 --> 00:21:08,210 but neuroscientists will talk about explicit neuronal spiking 599 00:21:08,210 --> 00:21:08,820 populations. 600 00:21:08,820 --> 00:21:10,070 You heard this in Haim's talk. 601 00:21:10,070 --> 00:21:11,880 He was using these words interchangeably. 602 00:21:11,880 --> 00:21:13,585 Again, this may be obvious to you guys, 603 00:21:13,585 --> 00:21:15,210 but I thought it's worth going through. 604 00:21:15,210 --> 00:21:18,060 So this is like Marr level two, representation. 605 00:21:18,060 --> 00:21:19,790 How do we instantiate these solutions? 606 00:21:19,790 --> 00:21:21,920 So this is still level two algorithms, 607 00:21:21,920 --> 00:21:23,930 or mechanisms that actually build useful feature 608 00:21:23,930 --> 00:21:24,980 representations. 609 00:21:24,980 --> 00:21:27,590 Neuroscientists will think about neuronal wiring and weighting 610 00:21:27,590 --> 00:21:29,900 patterns that are actually executing those algorithms. 611 00:21:29,900 --> 00:21:33,314 This is what we think is a bridging language there. 612 00:21:33,314 --> 00:21:34,730 And then there's this deeper level 613 00:21:34,730 --> 00:21:36,229 that came up in the questions, which 614 00:21:36,229 --> 00:21:40,070 is, how would you construct it from the beginning? 615 00:21:40,070 --> 00:21:42,672 Learning rules, initial conditions, training images, 616 00:21:42,672 --> 00:21:43,880 are words that are used here. 617 00:21:43,880 --> 00:21:45,790 There is a learning machine. 618 00:21:45,790 --> 00:21:48,960 Here, neuroscientists talk about plasticity, architecture, 619 00:21:48,960 --> 00:21:50,670 and experience. 620 00:21:50,670 --> 00:21:53,010 But again, those are similar questions just 621 00:21:53,010 --> 00:21:54,080 with different language. 622 00:21:54,080 --> 00:21:56,580 And I'm doing this because I think the spirit of this course 623 00:21:56,580 --> 00:21:59,820 is to try to build these links at all these different levels 624 00:21:59,820 --> 00:22:00,512 here. 625 00:22:00,512 --> 00:22:01,970 OK, so hopefully that kind of helps 626 00:22:01,970 --> 00:22:03,470 orient you to how we think about it. 627 00:22:03,470 --> 00:22:06,480 Let me just go and say, I want to talk about number one. 628 00:22:06,480 --> 00:22:09,600 What is a problem we're trying to solve and why is it hard? 629 00:22:09,600 --> 00:22:11,520 I said, object recognition is hard 630 00:22:11,520 --> 00:22:15,250 and I showed you that MIT Challenge and it was difficult. 631 00:22:15,250 --> 00:22:17,940 Maybe it's hard because there's lots of objects. 632 00:22:17,940 --> 00:22:20,290 Who thinks that's why it's hard? 633 00:22:20,290 --> 00:22:23,960 Who thinks that's not why it's hard? 634 00:22:23,960 --> 00:22:26,710 You think computers can list a bunch of objects? 635 00:22:26,710 --> 00:22:28,056 It's easy, right? 636 00:22:28,056 --> 00:22:29,430 This is what I said about memory, 637 00:22:29,430 --> 00:22:30,706 it's a big long list of stuff. 638 00:22:30,706 --> 00:22:31,830 Computers are good at that. 639 00:22:31,830 --> 00:22:33,538 There's going to be thousands of objects. 640 00:22:33,538 --> 00:22:36,170 A list of objects is not a hard thing for a machine to do. 641 00:22:36,170 --> 00:22:40,700 What's hard is that each object can produce an essentially 642 00:22:40,700 --> 00:22:42,430 infinite number of images. 643 00:22:42,430 --> 00:22:44,210 And so you somehow have to be able to take 644 00:22:44,210 --> 00:22:47,800 some samples of certain views or poses of an object, 645 00:22:47,800 --> 00:22:50,270 this is a car under different poses, 646 00:22:50,270 --> 00:22:53,690 and be able to generalize or to predict what the car might 647 00:22:53,690 --> 00:22:54,920 look like in another view. 648 00:22:59,046 --> 00:23:00,920 This is what's called the invariance problem. 649 00:23:00,920 --> 00:23:02,795 and it's due to the fact that, again, there's 650 00:23:02,795 --> 00:23:04,345 identity preserving image variation. 651 00:23:04,345 --> 00:23:06,470 This is why the bar code reader in your supermarket 652 00:23:06,470 --> 00:23:09,440 works fine, because the code is always laid out very simply. 653 00:23:09,440 --> 00:23:11,520 But when you have to be able to generalize 654 00:23:11,520 --> 00:23:13,520 across a bunch of conditions, potentially things 655 00:23:13,520 --> 00:23:16,310 like background clutter, even more severely occlusion, things 656 00:23:16,310 --> 00:23:18,740 you heard from Gabriel, or you may even want to generalize 657 00:23:18,740 --> 00:23:20,948 across the class of cars where the cars have slightly 658 00:23:20,948 --> 00:23:22,960 different geometry but they're still cars, 659 00:23:22,960 --> 00:23:25,654 these kind of generalizations are what make the problem hard. 660 00:23:25,654 --> 00:23:27,320 So I'm lumping them all together in what 661 00:23:27,320 --> 00:23:30,020 we call the invariance problem. 662 00:23:30,020 --> 00:23:32,270 Many of you in the room know this is the hard problem. 663 00:23:32,270 --> 00:23:37,674 And I think that hopefully it fixes ideas of, that's 664 00:23:37,674 --> 00:23:38,840 what you should think about. 665 00:23:38,840 --> 00:23:40,460 It's not the number of objects, but it's the fact 666 00:23:40,460 --> 00:23:42,501 that it has to deal with that invariance problem. 667 00:23:45,230 --> 00:23:47,330 Haim was talking about manifolds, 668 00:23:47,330 --> 00:23:49,860 and this is my version of that. 669 00:23:49,860 --> 00:23:51,792 So this is to introduce you to the problem of, 670 00:23:51,792 --> 00:23:53,000 why that invariance problem-- 671 00:23:53,000 --> 00:23:54,800 what it looks like or feels like. 672 00:23:54,800 --> 00:23:56,910 I'm not going to give you math on how to solve it. 673 00:23:56,910 --> 00:23:59,490 It's just a geometric feel for the problem. 674 00:23:59,490 --> 00:24:01,730 So if you imagine you're a camera-- 675 00:24:01,730 --> 00:24:03,530 or your retina, which is capturing 676 00:24:03,530 --> 00:24:05,930 an image of an object, let's call this a person, 677 00:24:05,930 --> 00:24:07,880 I think I called him Joe. 678 00:24:07,880 --> 00:24:10,596 So when you see this image of Joe, and this is the retina, 679 00:24:10,596 --> 00:24:13,220 so now this is a state space of what's going on in your retina. 680 00:24:13,220 --> 00:24:16,110 So it's a million retinal ganglion cells. 681 00:24:16,110 --> 00:24:18,390 Think of them as being an analog value out of each, 682 00:24:18,390 --> 00:24:20,610 so this is a million dimensional state space. 683 00:24:20,610 --> 00:24:22,130 So when you see this image of Joe, 684 00:24:22,130 --> 00:24:24,470 he activates every retinal ganglion cell, some a lot, 685 00:24:24,470 --> 00:24:26,750 some a little, but he's some point of that million 686 00:24:26,750 --> 00:24:27,970 dimensional space. 687 00:24:27,970 --> 00:24:30,322 OK, everybody with me? 688 00:24:30,322 --> 00:24:32,780 If everybody's heard all this before and wants me to go on, 689 00:24:32,780 --> 00:24:34,526 everybody wave your hand and I'll move on. 690 00:24:34,526 --> 00:24:35,240 AUDIENCE: No, it's good. 691 00:24:35,240 --> 00:24:36,650 JAMES DICARLO: Keep going, OK. 692 00:24:36,650 --> 00:24:40,940 So the basic idea is that if Joe undergoes a transformation, 693 00:24:40,940 --> 00:24:43,430 like a change in pose, what that does 694 00:24:43,430 --> 00:24:45,535 is, it's only a 1 degree of freedom 695 00:24:45,535 --> 00:24:47,910 I'm turning under the hood one of those latent variables. 696 00:24:47,910 --> 00:24:49,370 If I had a graphics engine, I'm changing 697 00:24:49,370 --> 00:24:50,578 the pose of latent variables. 698 00:24:50,578 --> 00:24:54,090 It's only one knob that I'm turning, so to speak. 699 00:24:54,090 --> 00:24:56,750 And that means there's one line through here 700 00:24:56,750 --> 00:24:59,180 as Joe projects across these different images here. 701 00:24:59,180 --> 00:25:00,750 And I'm ignoring noise and things. 702 00:25:00,750 --> 00:25:02,540 This is just the deterministic mapping 703 00:25:02,540 --> 00:25:04,100 onto the retinal ganglion cells. 704 00:25:04,100 --> 00:25:04,670 So Joe goes-- 705 00:25:04,670 --> 00:25:05,253 [MOVING NOISE] 706 00:25:05,253 --> 00:25:06,690 --and he goes over here. 707 00:25:06,690 --> 00:25:08,690 And if I turn the other knob, he goes over here. 708 00:25:08,690 --> 00:25:10,565 And so I could imagine, if I turned those two 709 00:25:10,565 --> 00:25:13,107 knobs of two axis opposed always possible 710 00:25:13,107 --> 00:25:15,440 and plotted this in the million dimensional state space, 711 00:25:15,440 --> 00:25:17,551 there'd be this curved up sheet of points, 712 00:25:17,551 --> 00:25:19,550 which you could think of Joe's identity manifold 713 00:25:19,550 --> 00:25:21,230 over those two degrees of view change. 714 00:25:21,230 --> 00:25:23,152 It's only two dimensions, it's hard to start 715 00:25:23,152 --> 00:25:24,110 showing more than this. 716 00:25:24,110 --> 00:25:26,745 But it's this curved up sheet of points. 717 00:25:26,745 --> 00:25:28,040 Everybody with me so far? 718 00:25:28,040 --> 00:25:30,080 You don't actually get to see all those. 719 00:25:30,080 --> 00:25:32,480 You could imagine a machine actually running them all, 720 00:25:32,480 --> 00:25:33,490 but you don't really get to see them. 721 00:25:33,490 --> 00:25:34,906 You've got to get samples of them. 722 00:25:34,906 --> 00:25:37,500 But there's some underlying manifold structure here. 723 00:25:37,500 --> 00:25:39,920 Now, what's interesting and what's important to point out 724 00:25:39,920 --> 00:25:41,670 is that this thing, even though I've drawn 725 00:25:41,670 --> 00:25:44,930 it and it's a little curve, but it's highly complicated 726 00:25:44,930 --> 00:25:46,610 in this native pixel space. 727 00:25:46,610 --> 00:25:50,150 It's all curved up and bending all over the place. 728 00:25:50,150 --> 00:25:52,340 And the reason that matters, and this 729 00:25:52,340 --> 00:25:55,040 is what Haim introduced you to, is 730 00:25:55,040 --> 00:25:57,950 that if you want to be able to separate Joe 731 00:25:57,950 --> 00:26:01,640 from another object, say not Joe, another person say, 732 00:26:01,640 --> 00:26:03,460 then you need a representation. 733 00:26:03,460 --> 00:26:04,960 I showed you retinal ganglion cells. 734 00:26:04,960 --> 00:26:06,980 This is another imaginary state space 735 00:26:06,980 --> 00:26:11,150 where you can take simple tools to extract the information. 736 00:26:11,150 --> 00:26:13,610 And the simple tools that we like to use 737 00:26:13,610 --> 00:26:14,750 are linear classifiers. 738 00:26:14,750 --> 00:26:16,670 But you can use other simple tools. 739 00:26:16,670 --> 00:26:18,770 Haim used the exact same description to you 740 00:26:18,770 --> 00:26:21,350 guys in his talk, that you have some linear decoder 741 00:26:21,350 --> 00:26:25,370 on the state space that can say, oh, they can separate 742 00:26:25,370 --> 00:26:26,986 cleanly Joe from not Joe. 743 00:26:26,986 --> 00:26:28,610 So these manifolds are nicely separated 744 00:26:28,610 --> 00:26:29,897 by a separating hyperplane. 745 00:26:29,897 --> 00:26:32,480 That's what these tools tend to do is they like to cut planes. 746 00:26:32,480 --> 00:26:33,896 This is one thing they like to do, 747 00:26:33,896 --> 00:26:35,750 or they want to find locations or regions, 748 00:26:35,750 --> 00:26:37,520 like compact regions in this space, 749 00:26:37,520 --> 00:26:39,470 depending on what kind of tool you use. 750 00:26:39,470 --> 00:26:40,910 But you don't want the tool having 751 00:26:40,910 --> 00:26:43,640 to do all kinds of complicated tracing through this space. 752 00:26:43,640 --> 00:26:45,690 That's basically the original problem itself. 753 00:26:45,690 --> 00:26:47,870 So what you need is, you have a simple tool box, 754 00:26:47,870 --> 00:26:50,060 which we think of as downstream neurons. 755 00:26:50,060 --> 00:26:51,990 So a linear classifier, as an approximation, 756 00:26:51,990 --> 00:26:53,000 it's like a dot product. 757 00:26:53,000 --> 00:26:55,880 It's a weighted sum, which is what we think, neuroscientists, 758 00:26:55,880 --> 00:26:57,920 of downstream neurons doing. 759 00:26:57,920 --> 00:26:59,240 So it's a weighted sum. 760 00:26:59,240 --> 00:27:02,612 And if we want an explicit representation 761 00:27:02,612 --> 00:27:04,070 in some neural state space, then we 762 00:27:04,070 --> 00:27:07,239 need to be able to take weighted sums of some population 763 00:27:07,239 --> 00:27:09,530 representation to be able to separate Joe from not Joe, 764 00:27:09,530 --> 00:27:12,140 and Sam from Jill, and everything from everything else 765 00:27:12,140 --> 00:27:14,060 that we want to separate. 766 00:27:14,060 --> 00:27:16,550 If we had such a space of neural population, 767 00:27:16,550 --> 00:27:18,380 we'd call that a good set of features 768 00:27:18,380 --> 00:27:21,050 or an explicit representation of object shape. 769 00:27:21,050 --> 00:27:22,880 And for any aficionados here, it's 770 00:27:22,880 --> 00:27:26,060 not just cleanly linear separation, 771 00:27:26,060 --> 00:27:28,040 it's actually being able to find this 772 00:27:28,040 --> 00:27:29,780 with a low number of training examples. 773 00:27:29,780 --> 00:27:32,030 So that turns out to be important. 774 00:27:32,030 --> 00:27:35,420 But it helps to fix ideas to think about linear separation, 775 00:27:35,420 --> 00:27:37,770 ideally with a low number of training examples. 776 00:27:37,770 --> 00:27:40,340 So that's a good representation. 777 00:27:40,340 --> 00:27:43,870 And notice, I'm starting to mix up terms here. 778 00:27:43,870 --> 00:27:45,620 I am assuming, when I talk about shape, 779 00:27:45,620 --> 00:27:47,600 that that will map cleanly to identity, 780 00:27:47,600 --> 00:27:49,610 or what you might call broadly, category. 781 00:27:49,610 --> 00:27:52,460 That's another topic I won't talk about, if you just 782 00:27:52,460 --> 00:27:55,790 think about the shape of Joe, or separating one geometry 783 00:27:55,790 --> 00:27:57,629 from another. 784 00:27:57,629 --> 00:28:00,170 Now, here's a simulation that my first graduate student, Dave 785 00:28:00,170 --> 00:28:01,850 Cox, who's now at Harvard, did. 786 00:28:01,850 --> 00:28:03,230 This is a number of years old. 787 00:28:03,230 --> 00:28:06,140 This takes these two face objects, render them 788 00:28:06,140 --> 00:28:08,250 under changes, and view. 789 00:28:08,250 --> 00:28:12,830 And then he actually simulated the manifolds 790 00:28:12,830 --> 00:28:15,780 in a 14,000 dimensional space. 791 00:28:15,780 --> 00:28:17,270 And then he wanted to visualize it. 792 00:28:17,270 --> 00:28:18,770 And because we wanted to try to make 793 00:28:18,770 --> 00:28:22,250 the point that these manifolds of these two objects 794 00:28:22,250 --> 00:28:24,355 are highly curved and highly tangled, 795 00:28:24,355 --> 00:28:25,730 this is a three dimensional view. 796 00:28:25,730 --> 00:28:28,220 Remember, it's sitting on a 14,000 dimensional simulation 797 00:28:28,220 --> 00:28:29,240 space. 798 00:28:29,240 --> 00:28:30,770 You can't view that space. 799 00:28:30,770 --> 00:28:32,600 This is a three dimensional view of it. 800 00:28:32,600 --> 00:28:35,420 And the point is that it's like two sheets of paper 801 00:28:35,420 --> 00:28:39,470 being all crumpled up together and they're not fused. 802 00:28:39,470 --> 00:28:41,720 They look fused here because it's in three dimensions. 803 00:28:41,720 --> 00:28:44,690 But they're not actually fused. 804 00:28:44,690 --> 00:28:46,730 But they're complicated, you can't easily 805 00:28:46,730 --> 00:28:50,330 find a separating hyperplane to separate these two objects. 806 00:28:50,330 --> 00:28:53,150 We call these tangled object manifolds. 807 00:28:53,150 --> 00:28:56,550 And really, they're tangled due to image variation. 808 00:28:56,550 --> 00:28:59,050 Remember, if I didn't change those knobs of view or position 809 00:28:59,050 --> 00:29:01,520 or scale, there would just be two points in the space 810 00:29:01,520 --> 00:29:02,400 and it would be easy. 811 00:29:02,400 --> 00:29:04,192 That's the easy problem of listing objects. 812 00:29:04,192 --> 00:29:06,358 But if they have to undergo all this transformation, 813 00:29:06,358 --> 00:29:08,060 they become these complicated structures 814 00:29:08,060 --> 00:29:10,440 that need to be untangled from each other. 815 00:29:10,440 --> 00:29:12,620 So the problem that's being solved 816 00:29:12,620 --> 00:29:14,510 is, you have this retina sampling data, 817 00:29:14,510 --> 00:29:16,699 like a camera on the front end, where things look 818 00:29:16,699 --> 00:29:18,740 complicated with respect to the latent variables, 819 00:29:18,740 --> 00:29:21,885 in this case shape or identity, Sam or Joe. 820 00:29:21,885 --> 00:29:24,260 And that they somehow are transformed, as Haim mentioned, 821 00:29:24,260 --> 00:29:26,810 they're transformed by some non-linear transformation, 822 00:29:26,810 --> 00:29:30,440 some other neural population state space, shown here, where 823 00:29:30,440 --> 00:29:31,770 the things look more like this. 824 00:29:31,770 --> 00:29:34,340 The latent variable structure is more explicit, 825 00:29:34,340 --> 00:29:37,160 that you can easily take things like separating hyperplanes 826 00:29:37,160 --> 00:29:39,410 to identify things like shape, which again, roughly 827 00:29:39,410 --> 00:29:41,960 corresponds to identity or other latent parameters, 828 00:29:41,960 --> 00:29:42,980 like position and scale. 829 00:29:42,980 --> 00:29:44,990 You maybe haven't thrown away all these other latent 830 00:29:44,990 --> 00:29:45,570 parameters. 831 00:29:45,570 --> 00:29:47,611 And if I have time, I'll say something about that 832 00:29:47,611 --> 00:29:49,200 so you don't just get identity. 833 00:29:49,200 --> 00:29:50,810 But if you can untangle this, you 834 00:29:50,810 --> 00:29:52,880 would have a very nice representation with regard 835 00:29:52,880 --> 00:29:54,170 to those originally latent parameters. 836 00:29:54,170 --> 00:29:55,920 That's the dream of what you'd like to do. 837 00:29:55,920 --> 00:29:59,390 It's like reverse graphics, if you will. 838 00:29:59,390 --> 00:30:02,360 So this is what we call an untangled explicit object 839 00:30:02,360 --> 00:30:02,990 information. 840 00:30:02,990 --> 00:30:04,680 And we think it lives somewhere in the brain, 841 00:30:04,680 --> 00:30:05,679 at least to some degree. 842 00:30:05,679 --> 00:30:07,950 And I'll show you the evidence for that later on. 843 00:30:07,950 --> 00:30:10,460 So what you have then is you have a poor encoding basis, 844 00:30:10,460 --> 00:30:11,487 the pixel space. 845 00:30:11,487 --> 00:30:13,820 And somewhere in the brain is a powerful encoding basis, 846 00:30:13,820 --> 00:30:15,540 a good set of features. 847 00:30:15,540 --> 00:30:17,270 And as Haim mentioned, as I already said, 848 00:30:17,270 --> 00:30:19,400 this must be a non-linear transformation 849 00:30:19,400 --> 00:30:21,191 because the linear transformations are just 850 00:30:21,191 --> 00:30:23,400 rotations of that original space. 851 00:30:23,400 --> 00:30:25,337 So now let's go down to-- actually this 852 00:30:25,337 --> 00:30:26,420 would be Marr level three. 853 00:30:26,420 --> 00:30:27,666 Let's go to instantiation. 854 00:30:27,666 --> 00:30:29,040 Let's get into the hardware here. 855 00:30:29,040 --> 00:30:30,320 We're supposed to be talking about brains. 856 00:30:30,320 --> 00:30:32,450 So I'm going to give you a tour of the ventral stream. 857 00:30:32,450 --> 00:30:34,640 So we would love to know how this brain solves it. 858 00:30:34,640 --> 00:30:36,904 This is the human brain. 859 00:30:36,904 --> 00:30:38,070 This is a non-human primate. 860 00:30:38,070 --> 00:30:39,230 This is not shown to scale. 861 00:30:39,230 --> 00:30:40,605 This is blown up to show you it's 862 00:30:40,605 --> 00:30:42,920 a similar structure, temporal lobe, frontal lobes, 863 00:30:42,920 --> 00:30:43,992 occipital lobe. 864 00:30:43,992 --> 00:30:45,200 There is a non-human primate. 865 00:30:45,200 --> 00:30:48,240 We like this model for a number of reasons. 866 00:30:48,240 --> 00:30:50,040 One reason that we like it is that they 867 00:30:50,040 --> 00:30:51,710 are very visual creatures, their acuity 868 00:30:51,710 --> 00:30:53,070 is very well matched to ours. 869 00:30:53,070 --> 00:30:55,280 In fact, even their object recognition abilities 870 00:30:55,280 --> 00:30:57,144 are actually quite similar to our own. 871 00:30:57,144 --> 00:30:59,060 This may be surprising to you, but let me just 872 00:30:59,060 --> 00:31:01,310 show you some data for that. 873 00:31:01,310 --> 00:31:05,909 This is actually data from Rishi Rajalingham, in my lab. 874 00:31:05,909 --> 00:31:07,700 It says, impressed, but this just came out. 875 00:31:07,700 --> 00:31:09,620 This is the confusion matrix patterns 876 00:31:09,620 --> 00:31:11,720 of humans trying to discriminate different objects 877 00:31:11,720 --> 00:31:14,649 under those transformations that I showed you earlier, 878 00:31:14,649 --> 00:31:16,190 where they're not just seeing images, 879 00:31:16,190 --> 00:31:18,590 but they have to deal with these invariances. 880 00:31:18,590 --> 00:31:22,250 And this is rhesus monkey data trying to do the same thing. 881 00:31:22,250 --> 00:31:24,140 And the task goes, I'll give you a test image 882 00:31:24,140 --> 00:31:25,190 and then you get choice images. 883 00:31:25,190 --> 00:31:26,171 Was it a car or a dog? 884 00:31:26,171 --> 00:31:28,670 I'll show you an image, what choice was it, a dog or a tree? 885 00:31:28,670 --> 00:31:31,500 And you're trying to entertain many objects all at once, 886 00:31:31,500 --> 00:31:34,482 and you get an image under some unpredictable view 887 00:31:34,482 --> 00:31:36,065 and unpredictable background, and then 888 00:31:36,065 --> 00:31:37,148 you have to make a choice. 889 00:31:37,148 --> 00:31:39,510 So this is the confusion difficulty. 890 00:31:39,510 --> 00:31:42,020 And when you look at this, it's intuitive 891 00:31:42,020 --> 00:31:43,960 that these are sort of geometry similar. 892 00:31:43,960 --> 00:31:48,230 Camel is confused with dog, and tank is confused with truck, 893 00:31:48,230 --> 00:31:50,180 and that's true of both monkeys and humans. 894 00:31:50,180 --> 00:31:54,377 And to some level, this shouldn't be surprising to you. 895 00:31:54,377 --> 00:31:56,210 The same tasks that are difficult for humans 896 00:31:56,210 --> 00:31:58,550 are difficult for monkeys because probably they 897 00:31:58,550 --> 00:32:03,000 share very similar processing structures. 898 00:32:03,000 --> 00:32:05,000 They don't have to bring in a bunch of knowledge 899 00:32:05,000 --> 00:32:08,370 about tanks are driven by people or that, they just have to say, 900 00:32:08,370 --> 00:32:09,590 was there a tank or a truck. 901 00:32:09,590 --> 00:32:12,048 And under those conditions, they make very similar patterns 902 00:32:12,048 --> 00:32:12,920 of confusion. 903 00:32:12,920 --> 00:32:15,140 And these patterns are very different from those 904 00:32:15,140 --> 00:32:16,850 that you get when you run classifiers 905 00:32:16,850 --> 00:32:20,680 on pixels or low level visual simulations. 906 00:32:20,680 --> 00:32:22,680 But they're very similar to each other, in fact, 907 00:32:22,680 --> 00:32:24,180 are statistically indistinguishable, 908 00:32:24,180 --> 00:32:27,300 monkeys and humans, on these kind of patterns of confusion. 909 00:32:27,300 --> 00:32:31,440 OK, so that's one reason we like this subject, the monkey model, 910 00:32:31,440 --> 00:32:34,551 is that the behavior is very well matched to the humans. 911 00:32:34,551 --> 00:32:37,050 The other reason is that we know from a lot of previous work 912 00:32:37,050 --> 00:32:40,470 that I alluded to, that some studies have shown that lesions 913 00:32:40,470 --> 00:32:43,290 in these parts of the brain can lead to deficits in recognition 914 00:32:43,290 --> 00:32:44,040 task. 915 00:32:44,040 --> 00:32:47,870 So again, we think the ventral stream solves recognition. 916 00:32:47,870 --> 00:32:50,370 So we know a weak word model of where to look, 917 00:32:50,370 --> 00:32:53,070 we just don't know exactly what's going on there. 918 00:32:53,070 --> 00:32:55,170 Just to orient you, these ventral 919 00:32:55,170 --> 00:32:59,010 areas, V1, V2, V4, and infer temporal cortex, or IT cortex-- 920 00:32:59,010 --> 00:33:01,560 IT projects anatomically to the frontal lobe 921 00:33:01,560 --> 00:33:03,570 to regions involved in decision and action, 922 00:33:03,570 --> 00:33:05,986 and around the bend to the medial temporal lobe to regions 923 00:33:05,986 --> 00:33:08,642 involved in formation of long-term memory. 924 00:33:08,642 --> 00:33:10,350 Because these are monkeys and not humans, 925 00:33:10,350 --> 00:33:12,840 and Gabriel mentioned this in his talk, we can go in 926 00:33:12,840 --> 00:33:14,430 and we can record from their brains, 927 00:33:14,430 --> 00:33:16,860 and we can perturb neural activity in their brains 928 00:33:16,860 --> 00:33:17,380 directly. 929 00:33:17,380 --> 00:33:18,600 And we can do that in a systematic way. 930 00:33:18,600 --> 00:33:20,725 This is the advantage of an animal model as opposed 931 00:33:20,725 --> 00:33:21,900 to a human model. 932 00:33:21,900 --> 00:33:24,090 OK, as neuroscientists now, we've 933 00:33:24,090 --> 00:33:26,190 taken a problem, translated it to behavior, 934 00:33:26,190 --> 00:33:28,477 taken that behavior into a species we can study, 935 00:33:28,477 --> 00:33:30,060 we know roughly where to look, and now 936 00:33:30,060 --> 00:33:32,140 we want to try to understand what's going on. 937 00:33:32,140 --> 00:33:35,135 So as engineers, we take these curled up sheets of cortex 938 00:33:35,135 --> 00:33:37,260 and think of them as I've already been showing you, 939 00:33:37,260 --> 00:33:39,289 as populations of neurons. 940 00:33:39,289 --> 00:33:41,580 So there's millions of neurons on each of these sheets. 941 00:33:41,580 --> 00:33:43,710 I'll give you numbers on a slide coming up. 942 00:33:43,710 --> 00:33:46,130 There's some sort of processing that may be common here, 943 00:33:46,130 --> 00:33:47,546 I put these T's in, there might be 944 00:33:47,546 --> 00:33:50,850 some common cortical algorithm processing forward this way. 945 00:33:50,850 --> 00:33:52,720 There's also inter-cortical processing. 946 00:33:52,720 --> 00:33:55,180 And there's also some feedback processing going on in here. 947 00:33:55,180 --> 00:33:57,570 So all that's schematically illustrated in this slide 948 00:33:57,570 --> 00:33:58,680 that I'll keep bringing up here when 949 00:33:58,680 --> 00:34:01,140 we talk about these different levels of the ventral stream. 950 00:34:01,140 --> 00:34:03,348 Now I'm most going to be talking about IT cortex here 951 00:34:03,348 --> 00:34:04,200 at the end. 952 00:34:04,200 --> 00:34:05,747 Why do we call these different areas? 953 00:34:05,747 --> 00:34:07,830 One reason is that there's a complete retina topic 954 00:34:07,830 --> 00:34:10,080 map, a map of the whole visual space in each 955 00:34:10,080 --> 00:34:11,312 of these different levels. 956 00:34:11,312 --> 00:34:12,270 In retina, there's one. 957 00:34:12,270 --> 00:34:14,264 In LGN-- in the thalamus, there's another. 958 00:34:14,264 --> 00:34:15,389 In V1, there's another map. 959 00:34:15,389 --> 00:34:16,380 In V2, there's another map. 960 00:34:16,380 --> 00:34:17,610 In V4, there's another map. 961 00:34:17,610 --> 00:34:20,580 In IT, it's less clear that it's retinotopic, 962 00:34:20,580 --> 00:34:23,670 we're not even sure that IT is one area. 963 00:34:23,670 --> 00:34:27,260 Maybe we'll have time, I'll say more about that detail. 964 00:34:27,260 --> 00:34:29,340 So it's not that retinotopic in IT, 965 00:34:29,340 --> 00:34:32,280 except the most posterior parts of IT. 966 00:34:32,280 --> 00:34:34,440 But that's why neuroscientists divide these 967 00:34:34,440 --> 00:34:36,310 into different areas. 968 00:34:36,310 --> 00:34:38,610 So a key concept, though, for you computationally is, 969 00:34:38,610 --> 00:34:41,250 think of each of these as a population representation 970 00:34:41,250 --> 00:34:44,489 that's retransforming the data from that complicated space 971 00:34:44,489 --> 00:34:46,320 to some nicer space. 972 00:34:46,320 --> 00:34:49,830 And it's doing this probably in a stepwise, gradual manner. 973 00:34:49,830 --> 00:34:52,136 So IT is believed to be that powerful encoding 974 00:34:52,136 --> 00:34:53,719 basis that I alluded to earlier, where 975 00:34:53,719 --> 00:34:56,024 you have these nice flattened object manifolds. 976 00:34:56,024 --> 00:34:57,690 And I'll show you the evidence for that. 977 00:35:00,590 --> 00:35:02,910 This is recently from a review I did that gives 978 00:35:02,910 --> 00:35:04,230 more numbers on these things. 979 00:35:04,230 --> 00:35:06,210 And I've sized the areas according 980 00:35:06,210 --> 00:35:08,920 to their relative cortical area in the monkey. 981 00:35:08,920 --> 00:35:11,220 Here's V1, V2, V4, IT. 982 00:35:11,220 --> 00:35:13,080 IT is a complex of areas. 983 00:35:13,080 --> 00:35:15,570 And I'm showing you these latencies. 984 00:35:15,570 --> 00:35:19,552 These are the average latencies in 985 00:35:19,552 --> 00:35:20,760 these different visual areas. 986 00:35:20,760 --> 00:35:22,350 You can see, it's about 50 milliseconds 987 00:35:22,350 --> 00:35:23,766 from when an image hits the retina 988 00:35:23,766 --> 00:35:25,110 until you get activity in V1. 989 00:35:25,110 --> 00:35:27,942 60 in V2, 70-- there's about a 10 millisecond step 990 00:35:27,942 --> 00:35:29,150 across these different areas. 991 00:35:29,150 --> 00:35:32,280 So it's about 100 millisecond lag between an image it's here, 992 00:35:32,280 --> 00:35:34,800 and you start to see changes in activity at this level 993 00:35:34,800 --> 00:35:36,750 up here that I'm referring to. 994 00:35:36,750 --> 00:35:40,440 When I say IT, I'm referring to AIT and CIT together. 995 00:35:40,440 --> 00:35:43,241 That's my usage of the word IT for the aficionados 996 00:35:43,241 --> 00:35:43,740 in the room. 997 00:35:43,740 --> 00:35:46,680 And that's about 10 million output neurons in IT 998 00:35:46,680 --> 00:35:48,180 just to fix numbers. 999 00:35:48,180 --> 00:35:50,650 In V1 here, you have like 37 million output neurons. 1000 00:35:50,650 --> 00:35:54,460 There's about 200 million neurons in V1, similar in V2. 1001 00:35:54,460 --> 00:35:56,460 And many of you probably heard about other parts 1002 00:35:56,460 --> 00:35:58,040 of the visual system. 1003 00:35:58,040 --> 00:36:00,970 Here's MT, many of you probably heard about MT. 1004 00:36:00,970 --> 00:36:03,650 So you can see it's tiny compared to some of these areas 1005 00:36:03,650 --> 00:36:05,195 that I'm talking about here. 1006 00:36:05,195 --> 00:36:06,820 I'm going to show you some neural dam-- 1007 00:36:06,820 --> 00:36:07,950 I'm just going to give you a brief tour 1008 00:36:07,950 --> 00:36:10,830 of these different areas, so brief, it's almost cartoonish. 1009 00:36:10,830 --> 00:36:13,026 But at least those of you who haven't seen this 1010 00:36:13,026 --> 00:36:14,150 should at least be exposed. 1011 00:36:14,150 --> 00:36:15,372 So in the retina-- 1012 00:36:15,372 --> 00:36:16,830 you guys know in the retina there's 1013 00:36:16,830 --> 00:36:18,862 a bunch of cell layers in the retina. 1014 00:36:18,862 --> 00:36:20,320 The retina is a complicated device. 1015 00:36:20,320 --> 00:36:22,320 I think of it as a beautiful camera. 1016 00:36:22,320 --> 00:36:23,904 So you're down in the retina. 1017 00:36:23,904 --> 00:36:25,320 To me, the key thing in the retina 1018 00:36:25,320 --> 00:36:27,120 is in the end you've got some cells that are going to project 1019 00:36:27,120 --> 00:36:28,784 back along the optic nerve. 1020 00:36:28,784 --> 00:36:30,450 So these are the retinal ganglion cells, 1021 00:36:30,450 --> 00:36:31,530 they actually live on the surface. 1022 00:36:31,530 --> 00:36:33,613 The light comes through, photo receptors are here, 1023 00:36:33,613 --> 00:36:35,850 there is processing in these intermediate layers, 1024 00:36:35,850 --> 00:36:38,370 and then there's a bunch of retinal ganglion cell types. 1025 00:36:38,370 --> 00:36:40,590 There's thought to be about 20 types or so. 1026 00:36:40,590 --> 00:36:42,780 The original physiology, there are 1027 00:36:42,780 --> 00:36:45,480 two functional central types where they 1028 00:36:45,480 --> 00:36:47,506 have on center or off center. 1029 00:36:47,506 --> 00:36:49,380 Let's take an on center cell, you shine light 1030 00:36:49,380 --> 00:36:51,360 in the middle of a spot-- now this 1031 00:36:51,360 --> 00:36:52,950 is a tiny little spot on the retina, 1032 00:36:52,950 --> 00:36:56,127 the size depends on where you are in the visual field. 1033 00:36:56,127 --> 00:36:58,210 But you shine a little bit of light in the center, 1034 00:36:58,210 --> 00:36:59,084 the response goes up. 1035 00:36:59,084 --> 00:37:00,499 See the spike rate going up here. 1036 00:37:00,499 --> 00:37:02,790 Put light in the surround, the response rate goes down. 1037 00:37:02,790 --> 00:37:06,270 So it has an on center, off surround profile. 1038 00:37:06,270 --> 00:37:08,610 And then there's a flip type here. 1039 00:37:08,610 --> 00:37:10,152 So that's the basic functional type. 1040 00:37:10,152 --> 00:37:11,610 When you think about the retina, it 1041 00:37:11,610 --> 00:37:13,920 is tiled with all of these point detectors that 1042 00:37:13,920 --> 00:37:16,090 have some nice center surround effects. 1043 00:37:16,090 --> 00:37:19,620 There's some nice gain control for overall illumination 1044 00:37:19,620 --> 00:37:21,570 conditions. 1045 00:37:21,570 --> 00:37:24,410 But my toy model of the retina, it's 1046 00:37:24,410 --> 00:37:27,410 basically a really nice pixel map coming back down 1047 00:37:27,410 --> 00:37:30,770 the optic track to the LGN. 1048 00:37:30,770 --> 00:37:34,250 OK, I'm going to skip the LGN and go straight to V1. 1049 00:37:34,250 --> 00:37:37,130 People have known for a long time, functionally V1 1050 00:37:37,130 --> 00:37:42,350 cells they have sensitivity to especially edges. 1051 00:37:42,350 --> 00:37:45,590 They have what's called orientation selectivity. 1052 00:37:45,590 --> 00:37:47,210 Hopefully this isn't new to you guys. 1053 00:37:47,210 --> 00:37:48,752 Here's a simple cell in V1. 1054 00:37:48,752 --> 00:37:50,210 If you shine a bar of a light on it 1055 00:37:50,210 --> 00:37:51,485 inside it's receptive field-- 1056 00:37:51,485 --> 00:37:53,360 does everyone know what a receptive field is? 1057 00:37:53,360 --> 00:37:54,193 I don't want to go-- 1058 00:37:54,193 --> 00:37:55,184 OK. 1059 00:37:55,184 --> 00:37:56,600 It's OK if you ask, because I want 1060 00:37:56,600 --> 00:37:57,809 to make sure you guys are OK. 1061 00:37:57,809 --> 00:37:59,974 So the receptive field, you shine a bar light in it, 1062 00:37:59,974 --> 00:38:01,580 turn it on in the right orientation, 1063 00:38:01,580 --> 00:38:04,190 gives good response out of the cell. 1064 00:38:04,190 --> 00:38:06,545 Move it off this position, now not much response, 1065 00:38:06,545 --> 00:38:08,420 there's a little bit of an off response here. 1066 00:38:08,420 --> 00:38:10,580 Change the orientation, nothing happens. 1067 00:38:10,580 --> 00:38:12,860 Full field illumination, nothing happens. 1068 00:38:12,860 --> 00:38:15,620 OK, so this is called selectivity. 1069 00:38:15,620 --> 00:38:17,810 That is, there's some portion of the image space 1070 00:38:17,810 --> 00:38:19,040 that it cares about. 1071 00:38:19,040 --> 00:38:21,020 It doesn't just respond to any light 1072 00:38:21,020 --> 00:38:25,030 at that spot like the pixel wise, retinal ganglion cell 1073 00:38:25,030 --> 00:38:26,120 would. 1074 00:38:26,120 --> 00:38:28,730 So now there's this complex cell that's 1075 00:38:28,730 --> 00:38:32,700 also in V1, which maintains this orientation 1076 00:38:32,700 --> 00:38:35,150 selectivity across a change in position, 1077 00:38:35,150 --> 00:38:38,460 as shown here, also across some changes in scale. 1078 00:38:38,460 --> 00:38:41,500 So it maintains it, meaning that you have this tolerance-- 1079 00:38:41,500 --> 00:38:44,360 so that's called position tolerance, for position. 1080 00:38:44,360 --> 00:38:47,120 You can move the bar around it, still likes that oriented bar. 1081 00:38:47,120 --> 00:38:50,420 But you change its angle and it goes down, 1082 00:38:50,420 --> 00:38:52,460 so it still maintains the same selectivity here 1083 00:38:52,460 --> 00:38:53,570 but it has some tolerance. 1084 00:38:53,570 --> 00:38:57,020 So you get this build up of some orientation sensitivity 1085 00:38:57,020 --> 00:38:58,690 followed by some tolerance. 1086 00:38:58,690 --> 00:39:00,469 And there are models from Hubel and Wiesel 1087 00:39:00,469 --> 00:39:02,510 that they thought that you could build this first 1088 00:39:02,510 --> 00:39:04,093 and then you build these out of these, 1089 00:39:04,093 --> 00:39:05,540 that's the simple version. 1090 00:39:05,540 --> 00:39:06,687 And here they are. 1091 00:39:06,687 --> 00:39:08,270 These are the Hubel and Wiesel models, 1092 00:39:08,270 --> 00:39:11,420 how you build these and like operators to build selectivity 1093 00:39:11,420 --> 00:39:14,539 from pixel-wise cells with an and like operator lining 1094 00:39:14,539 --> 00:39:15,330 these up correctly. 1095 00:39:15,330 --> 00:39:17,770 You can imagine oriented tuned cells built this way. 1096 00:39:17,770 --> 00:39:20,120 There's evidence for this in physiology 1097 00:39:20,120 --> 00:39:21,980 that this is how these are constructed. 1098 00:39:21,980 --> 00:39:23,680 The tolerance of these complex cells 1099 00:39:23,680 --> 00:39:27,230 is thought to build by a combination of simple cells. 1100 00:39:27,230 --> 00:39:29,167 And there's some evidence for this. 1101 00:39:29,167 --> 00:39:31,375 And this is again, all the way from Hubel and Wiesel, 1102 00:39:31,375 --> 00:39:35,030 who won a Nobel Prize for this and related work in the 1960s. 1103 00:39:35,030 --> 00:39:38,450 And then there were a bunch of computational models 1104 00:39:38,450 --> 00:39:40,460 that are really inspired by this and I 1105 00:39:40,460 --> 00:39:42,974 think are still the core models of how the system works. 1106 00:39:42,974 --> 00:39:45,140 And some of the original ones that were written down 1107 00:39:45,140 --> 00:39:47,900 are Fukushima in the '80s, and then 1108 00:39:47,900 --> 00:39:50,702 Tommy Poggio and others built what's called an HMAX Model, 1109 00:39:50,702 --> 00:39:52,160 you guys have probably heard about, 1110 00:39:52,160 --> 00:39:54,440 that's off of these similar ideas, much more 1111 00:39:54,440 --> 00:39:58,055 refined and much more matched to the neural data. 1112 00:39:58,055 --> 00:39:59,930 But I'm just try to point out that these kind 1113 00:39:59,930 --> 00:40:01,460 of physiological observations are 1114 00:40:01,460 --> 00:40:04,960 what inspired this class of largely feedforward models 1115 00:40:04,960 --> 00:40:08,300 that you heard about much today. 1116 00:40:08,300 --> 00:40:11,840 So that's a brief tour of V1. 1117 00:40:11,840 --> 00:40:13,700 Now, what's going on in V2? 1118 00:40:13,700 --> 00:40:15,380 For a long time, people thought it 1119 00:40:15,380 --> 00:40:17,360 was hard to tell the difference from V1 and V2. 1120 00:40:17,360 --> 00:40:18,901 And I just thought I'd show you guys, 1121 00:40:18,901 --> 00:40:21,054 this is a slide I stuck in, this is from Eero 1122 00:40:21,054 --> 00:40:22,220 Simoncelli and Tony Movshon. 1123 00:40:22,220 --> 00:40:24,678 And I think you guys have Eero teaching in the course a bit 1124 00:40:24,678 --> 00:40:26,500 later, so he may say some of this. 1125 00:40:26,500 --> 00:40:33,050 But V2 cells have some sensitivity to natural image 1126 00:40:33,050 --> 00:40:35,810 statistics that V1 cells don't. 1127 00:40:35,810 --> 00:40:38,030 And maybe I'll see if I can take you through this. 1128 00:40:38,030 --> 00:40:42,680 So the way that they did this is you can simulate-- 1129 00:40:42,680 --> 00:40:45,410 so this is all driven off of work that Eero and Tony have 1130 00:40:45,410 --> 00:40:45,910 done-- 1131 00:40:45,910 --> 00:40:48,390 especially Eero has done on texture synthesis. 1132 00:40:48,390 --> 00:40:50,120 So you have these original images, 1133 00:40:50,120 --> 00:40:53,120 and if you run them through a bunch of V1-like filter banks, 1134 00:40:53,120 --> 00:40:56,540 and then you take a new image, a random seed, which 1135 00:40:56,540 --> 00:40:58,430 is like white noise, and you try to make sure 1136 00:40:58,430 --> 00:41:00,860 that it would activate populations 1137 00:41:00,860 --> 00:41:02,530 of V1 cells in a similar way, there's 1138 00:41:02,530 --> 00:41:05,030 a large set of images that would do that because you're just 1139 00:41:05,030 --> 00:41:07,100 doing summary statistics, but these 1140 00:41:07,100 --> 00:41:08,340 are some examples of them. 1141 00:41:08,340 --> 00:41:10,548 For this image, this is one that one might look like. 1142 00:41:10,548 --> 00:41:12,980 So you can see, to you, it doesn't look the same as this. 1143 00:41:12,980 --> 00:41:15,230 But to V1, these are metamers, they're 1144 00:41:15,230 --> 00:41:18,650 very similar in the summary statistics in V1. 1145 00:41:18,650 --> 00:41:21,320 And then you start taking cross products of these V1 summary 1146 00:41:21,320 --> 00:41:23,162 statistics and then you try to match those. 1147 00:41:23,162 --> 00:41:24,620 And what's interesting is you start 1148 00:41:24,620 --> 00:41:26,720 to get something that looks, texture wise, much more 1149 00:41:26,720 --> 00:41:27,810 like this original image. 1150 00:41:27,810 --> 00:41:29,893 And this is a big part of what Eero and others did 1151 00:41:29,893 --> 00:41:30,740 in that work. 1152 00:41:30,740 --> 00:41:32,198 And the reason I'm showing you this 1153 00:41:32,198 --> 00:41:35,570 is that Tony's lab has gone and recorded in V1 and V2 1154 00:41:35,570 --> 00:41:38,840 with these kinds of stimuli, and the main observation they have 1155 00:41:38,840 --> 00:41:43,670 is that V1 doesn't care whether you show it this or this. 1156 00:41:43,670 --> 00:41:46,022 To V1, these are both the same, which 1157 00:41:46,022 --> 00:41:47,480 says we have the summary statistics 1158 00:41:47,480 --> 00:41:49,867 for V1 right in terms of the average V1 response. 1159 00:41:49,867 --> 00:41:51,200 That's all I'm showing you here. 1160 00:41:51,200 --> 00:41:53,256 The paper, if you want it, is much more detailed. 1161 00:41:53,256 --> 00:41:55,130 But you go to V2 and there's a big difference 1162 00:41:55,130 --> 00:41:59,000 between this, which V2 cells respond to more, and this, 1163 00:41:59,000 --> 00:42:00,680 which they respond to less. 1164 00:42:00,680 --> 00:42:02,720 And really one inference you can take from this 1165 00:42:02,720 --> 00:42:06,950 is that V2 neurons apply a repeated-- another and like 1166 00:42:06,950 --> 00:42:08,480 operator on V1. 1167 00:42:08,480 --> 00:42:10,850 That's a simple inference that these kinds of data seem 1168 00:42:10,850 --> 00:42:11,720 to support . 1169 00:42:11,720 --> 00:42:14,090 And they also tell you that these and-like operators, 1170 00:42:14,090 --> 00:42:15,890 these conjunctions of V1 statistics 1171 00:42:15,890 --> 00:42:18,560 tend to be in the direction of the statistics 1172 00:42:18,560 --> 00:42:21,500 of the natural world, that's naturalistic statistics. 1173 00:42:21,500 --> 00:42:24,110 Now lots of controls haven't been done here 1174 00:42:24,110 --> 00:42:26,300 to narrow in exactly what kinds ands, 1175 00:42:26,300 --> 00:42:28,280 but that's the spirit of where the field is 1176 00:42:28,280 --> 00:42:29,635 in trying to understand V2. 1177 00:42:29,635 --> 00:42:31,010 Everybody thinks it has something 1178 00:42:31,010 --> 00:42:33,135 to do with corners or a more complicated structure. 1179 00:42:33,135 --> 00:42:35,270 But this is a way that current in the field 1180 00:42:35,270 --> 00:42:38,170 to try to move these image computing models forward 1181 00:42:38,170 --> 00:42:39,140 in V1 and V2. 1182 00:42:39,140 --> 00:42:42,080 And Tony likes to point out that this is one of the strongest 1183 00:42:42,080 --> 00:42:44,599 differences that you see between V1 and V2, 1184 00:42:44,599 --> 00:42:46,140 other than the receptive field sizes. 1185 00:42:46,140 --> 00:42:49,460 So I think that's quite some exciting work if you 1186 00:42:49,460 --> 00:42:51,820 don't know about it on V2. 1187 00:42:51,820 --> 00:42:54,740 OK, then you get up into V4 and things get much murkier. 1188 00:42:54,740 --> 00:42:56,420 So what's going on in V4? 1189 00:42:56,420 --> 00:42:59,030 Well, let me just briefly say that one of my post-docs-- this 1190 00:42:59,030 --> 00:43:01,770 is more recent work just because it builds on that earlier work. 1191 00:43:01,770 --> 00:43:04,280 This is Nicole Rust, when she was a post-doc in the lab, 1192 00:43:04,280 --> 00:43:05,210 compared V4. 1193 00:43:05,210 --> 00:43:07,280 She actually compared it to IT. 1194 00:43:07,280 --> 00:43:08,000 I'll skip that. 1195 00:43:08,000 --> 00:43:10,760 But she was using these Simoncelli scrambled images. 1196 00:43:10,760 --> 00:43:13,636 These are actually the texture images from-- 1197 00:43:13,636 --> 00:43:15,260 these are the original images and these 1198 00:43:15,260 --> 00:43:16,040 are the texture versions. 1199 00:43:16,040 --> 00:43:17,960 So this should look like a textured version of that. 1200 00:43:17,960 --> 00:43:19,959 You can see that these algorithms don't actually 1201 00:43:19,959 --> 00:43:23,180 capture the object content of these images. 1202 00:43:23,180 --> 00:43:26,990 And what Nicole actually showed is that similar to what 1203 00:43:26,990 --> 00:43:29,510 you just saw there, in the earlier work like V1, 1204 00:43:29,510 --> 00:43:32,360 V4 doesn't care about the differences between these. 1205 00:43:32,360 --> 00:43:35,090 It responds similarly, as a population, to this and this, 1206 00:43:35,090 --> 00:43:36,800 and this and this, and this and this. 1207 00:43:36,800 --> 00:43:40,220 But IT cares a lot about this versus this. 1208 00:43:40,220 --> 00:43:43,250 So this is just repeating the same theme, the general idea 1209 00:43:43,250 --> 00:43:46,402 that you have and -like operators that we think 1210 00:43:46,402 --> 00:43:48,110 are aligned along the ventral stream that 1211 00:43:48,110 --> 00:43:49,790 are tuned to the kind of statistics 1212 00:43:49,790 --> 00:43:51,530 that you tend to encounter in the world. 1213 00:43:51,530 --> 00:43:54,650 And this is some of the evidence for it in V2, 1214 00:43:54,650 --> 00:43:57,500 and then later in V4, and IT, and Nicole's work, 1215 00:43:57,500 --> 00:43:59,000 if you piece that all together. 1216 00:43:59,000 --> 00:44:00,980 When you go to a place like V4, remember V4 1217 00:44:00,980 --> 00:44:02,690 is now like three levels up. 1218 00:44:02,690 --> 00:44:04,170 And what does V4 do? 1219 00:44:04,170 --> 00:44:07,409 Look, this is Jack's work in 1996. 1220 00:44:07,409 --> 00:44:08,950 This is from Jack Gallant when he was 1221 00:44:08,950 --> 00:44:10,158 working with David Van Essen. 1222 00:44:10,158 --> 00:44:12,200 And people had some ideas that maybe there 1223 00:44:12,200 --> 00:44:14,600 are these certain functions that V4 neurons like, 1224 00:44:14,600 --> 00:44:16,407 and they would show these-- 1225 00:44:16,407 --> 00:44:17,990 the same thing people have done in V2, 1226 00:44:17,990 --> 00:44:19,781 they would show a bunch of images like this 1227 00:44:19,781 --> 00:44:22,490 and figure out, well, does it like these Cartesian gratings 1228 00:44:22,490 --> 00:44:23,390 or these curved ones. 1229 00:44:23,390 --> 00:44:25,070 And you know what, you get out of this is, 1230 00:44:25,070 --> 00:44:26,540 you could tell some story about it, 1231 00:44:26,540 --> 00:44:28,331 but you get a bunch of responses out of it. 1232 00:44:28,331 --> 00:44:30,140 The color indicates the response. 1233 00:44:30,140 --> 00:44:32,180 And you kind of look at it, and people would tell some stories, 1234 00:44:32,180 --> 00:44:33,980 but it really was just kind of like tea leaves. 1235 00:44:33,980 --> 00:44:35,604 Here's a bunch of data, we don't really 1236 00:44:35,604 --> 00:44:38,150 know what these V4 neurons were doing. 1237 00:44:38,150 --> 00:44:43,170 This was a science paper, so you could go back and read it. 1238 00:44:43,170 --> 00:44:47,720 And then Ed Connor and Anitha Pasupathy 1239 00:44:47,720 --> 00:44:49,760 worked together a few years after that 1240 00:44:49,760 --> 00:44:54,260 to try to figure out more about what V4 neurons do. 1241 00:44:54,260 --> 00:44:55,790 And they did things like take images 1242 00:44:55,790 --> 00:44:57,470 like this, which were isolated, and try 1243 00:44:57,470 --> 00:45:00,560 to cut them into parts, like curved parts, pointy parts, 1244 00:45:00,560 --> 00:45:02,930 curved, concave, convex. 1245 00:45:02,930 --> 00:45:06,355 And this was motivated off of some psychology literature. 1246 00:45:06,355 --> 00:45:07,730 And they would define these based 1247 00:45:07,730 --> 00:45:09,510 on the center of the object. 1248 00:45:09,510 --> 00:45:11,580 So this wasn't an image computable model, 1249 00:45:11,580 --> 00:45:13,880 it was just a basis set that they 1250 00:45:13,880 --> 00:45:16,259 built around these silhouette objects. 1251 00:45:16,259 --> 00:45:18,800 And so they made this basis set about any kind of silhouetted 1252 00:45:18,800 --> 00:45:20,155 object they like here. 1253 00:45:20,155 --> 00:45:21,530 They hypothesized that they could 1254 00:45:21,530 --> 00:45:23,750 fit the responses of V4 neurons in this basis set. 1255 00:45:23,750 --> 00:45:25,340 And this was their attempt to do it. 1256 00:45:25,340 --> 00:45:27,369 They could actually fit quite well. 1257 00:45:27,369 --> 00:45:29,160 And that's kind of what's being shown here. 1258 00:45:29,160 --> 00:45:30,618 Here's the response of a V4 neuron. 1259 00:45:30,618 --> 00:45:32,790 The color indicates the depth of the response. 1260 00:45:32,790 --> 00:45:35,040 You can see, this is sort of like that previous slide, 1261 00:45:35,040 --> 00:45:35,870 you're looking at tea leaves. 1262 00:45:35,870 --> 00:45:37,756 It looks complicated, but under this model 1263 00:45:37,756 --> 00:45:40,130 they were able to, in the shape space, explain about half 1264 00:45:40,130 --> 00:45:42,620 of the response variants of V4 neurons. 1265 00:45:42,620 --> 00:45:48,800 The upshot is, that V4 curve is about some combination 1266 00:45:48,800 --> 00:45:49,730 of curves. 1267 00:45:49,730 --> 00:45:51,590 And then later, Scott Brincat, with Ed, 1268 00:45:51,590 --> 00:45:53,300 went on into posterior IT and showed 1269 00:45:53,300 --> 00:45:55,670 that maybe some combinations of these V4 cells 1270 00:45:55,670 --> 00:45:58,859 could fit posterior IT responses quite well. 1271 00:45:58,859 --> 00:46:00,650 So if you read the literature in V4 and IT, 1272 00:46:00,650 --> 00:46:01,780 you'll come across these studies. 1273 00:46:01,780 --> 00:46:03,500 And they are important ones to look at. 1274 00:46:03,500 --> 00:46:04,916 Unfortunately, they don't give you 1275 00:46:04,916 --> 00:46:07,530 an image computable model of what these neurons are doing. 1276 00:46:07,530 --> 00:46:09,790 But it's some of the work that you should know about 1277 00:46:09,790 --> 00:46:12,980 if you want to look in V4 or early IT, 1278 00:46:12,980 --> 00:46:14,275 so I'm telling it to you. 1279 00:46:14,275 --> 00:46:16,400 So let me go on to IT, which is what I want to talk 1280 00:46:16,400 --> 00:46:18,404 about for the rest of today. 1281 00:46:18,404 --> 00:46:19,945 Again, I'm talking about AIT and CIT. 1282 00:46:23,270 --> 00:46:26,150 And I'll just quickly say that the anatomy, again, 1283 00:46:26,150 --> 00:46:29,580 suggests that the IT is the central 10 degrees. 1284 00:46:29,580 --> 00:46:34,400 And even though V1, V2, and V4 cover the whole visual field, 1285 00:46:34,400 --> 00:46:36,650 if you make injections in V4, that's 1286 00:46:36,650 --> 00:46:39,560 shown here, where you make injections 1287 00:46:39,560 --> 00:46:42,560 in the more peripheral parts of the V4 representation, which 1288 00:46:42,560 --> 00:46:46,021 is up here, that you don't get much projection into IT, which 1289 00:46:46,021 --> 00:46:46,520 is here. 1290 00:46:46,520 --> 00:46:49,061 You don't see much green color, whereas, you make projections 1291 00:46:49,061 --> 00:46:51,170 in the center part of V4, these red sites here, 1292 00:46:51,170 --> 00:46:55,700 you see much more coverage into IT, which is shown here. 1293 00:46:55,700 --> 00:46:57,830 So when I say 10 degrees, that's rough. 1294 00:46:57,830 --> 00:46:59,260 Everything in biology is messy. 1295 00:46:59,260 --> 00:47:01,940 But this is some of the evidence, beyond recordings, 1296 00:47:01,940 --> 00:47:04,910 there's anatomical evidence that as you go down into IT, 1297 00:47:04,910 --> 00:47:07,890 you are more and more focused on the central 10 degrees. 1298 00:47:07,890 --> 00:47:10,700 OK, let me talk about a little bit of the history of IT 1299 00:47:10,700 --> 00:47:11,290 recordings. 1300 00:47:11,290 --> 00:47:14,291 This is when people got excited about IT, in the 70s. 1301 00:47:14,291 --> 00:47:16,790 This is work by Charlie Gross, who's one of the first people 1302 00:47:16,790 --> 00:47:19,070 to record an IT cortex. 1303 00:47:19,070 --> 00:47:22,250 And I'll show you what they did here. 1304 00:47:22,250 --> 00:47:24,710 This was in an era where, remember, Hubel and Wiesel 1305 00:47:24,710 --> 00:47:26,360 had just done their work in the '60s. 1306 00:47:26,360 --> 00:47:28,900 And they recorded from the cat visual cortex. 1307 00:47:28,900 --> 00:47:30,400 And they had found these edge cells, 1308 00:47:30,400 --> 00:47:32,525 and they ended up winning the Nobel Prize for that. 1309 00:47:32,525 --> 00:47:34,700 So it was the heyday of like, let's record 1310 00:47:34,700 --> 00:47:36,380 and figure out what makes cells go. 1311 00:47:36,380 --> 00:47:39,620 So they were brave enough to put an electrode down an IT cortex 1312 00:47:39,620 --> 00:47:42,470 in 1970 and said, what makes this neuron go. 1313 00:47:42,470 --> 00:47:44,300 Remember, that's an encoding question, 1314 00:47:44,300 --> 00:47:48,620 what's the image content that will drive this neuron. 1315 00:47:48,620 --> 00:47:50,360 And it's fun to just look back on this 1316 00:47:50,360 --> 00:47:51,597 and what they were doing. 1317 00:47:51,597 --> 00:47:53,180 So they didn't have computer monitors. 1318 00:47:53,180 --> 00:47:55,090 They were actually waving around stimuli 1319 00:47:55,090 --> 00:47:56,090 in front of the animals. 1320 00:47:56,090 --> 00:47:59,030 This is an anesthetized animal on a table. 1321 00:47:59,030 --> 00:48:00,005 This is a monkey. 1322 00:48:00,005 --> 00:48:01,380 Actually, they started with a cat 1323 00:48:01,380 --> 00:48:02,838 and then they later went to monkey. 1324 00:48:02,838 --> 00:48:05,520 The use of these stimuli was begun one day when, 1325 00:48:05,520 --> 00:48:08,020 having failed to drive a unit with any light stimulus-- that 1326 00:48:08,020 --> 00:48:10,100 probably means spots of light, edges things 1327 00:48:10,100 --> 00:48:11,930 that Hubel and Wiesel had been using. 1328 00:48:11,930 --> 00:48:14,530 We waved a hand at the stimulus screen, 1329 00:48:14,530 --> 00:48:15,950 they waved in front of the monkey, 1330 00:48:15,950 --> 00:48:18,080 and elicited a very vigorous response 1331 00:48:18,080 --> 00:48:21,190 from the previously unresponsive neuron. 1332 00:48:21,190 --> 00:48:23,621 And then we spent the next 12 hours-- so the animal's 1333 00:48:23,621 --> 00:48:26,120 anesthetized on the table, their recording from this neuron. 1334 00:48:26,120 --> 00:48:27,680 It's 12 hours because nothing's moving, 1335 00:48:27,680 --> 00:48:29,570 so you can record for a long period of time. 1336 00:48:29,570 --> 00:48:30,650 So singular neuron, they're recording, 1337 00:48:30,650 --> 00:48:31,430 listening to the spikes. 1338 00:48:31,430 --> 00:48:33,770 We spent the next 12 hours testing various paper cut 1339 00:48:33,770 --> 00:48:36,740 outs in attempt to find the trigger feature. 1340 00:48:36,740 --> 00:48:38,990 You can see, that's a Hubel and Wiesel idea, 1341 00:48:38,990 --> 00:48:40,310 what makes this neuron go. 1342 00:48:40,310 --> 00:48:43,390 What's the best thing, that's become 1343 00:48:43,390 --> 00:48:45,680 a lot of what the field spent time doing. 1344 00:48:45,680 --> 00:48:48,500 Trigger feature for this unit, when the entire stimulus set 1345 00:48:48,500 --> 00:48:51,290 were used, were ranked according to the strength of the response 1346 00:48:51,290 --> 00:48:52,081 that they produced. 1347 00:48:52,081 --> 00:48:54,230 We could not find a simple physical dimension 1348 00:48:54,230 --> 00:48:55,829 that correlated with this rank order. 1349 00:48:55,829 --> 00:48:57,620 However, the rank order of adequate stimuli 1350 00:48:57,620 --> 00:48:59,660 did correlate with similarity for us, 1351 00:48:59,660 --> 00:49:01,760 that means psychophysical judged, 1352 00:49:01,760 --> 00:49:03,800 to the shadow of a monkey hand. 1353 00:49:03,800 --> 00:49:05,820 So these are their rank order of the stimuli. 1354 00:49:05,820 --> 00:49:08,480 And they say look, it looks like it's some sort of hand neuron. 1355 00:49:08,480 --> 00:49:10,021 That's all I know how to describe it. 1356 00:49:10,021 --> 00:49:11,960 I can't find some simple thing on here. 1357 00:49:11,960 --> 00:49:14,972 So this kind of study then launched a whole domain 1358 00:49:14,972 --> 00:49:17,180 where people started to go in to record these neurons 1359 00:49:17,180 --> 00:49:18,980 and they found interesting different types. 1360 00:49:18,980 --> 00:49:21,200 Bob Desimone, who worked with Charlie Gross, 1361 00:49:21,200 --> 00:49:23,325 later showed much more nicely under more controlled 1362 00:49:23,325 --> 00:49:25,710 conditions, yes, there are indeed neurons that respond. 1363 00:49:25,710 --> 00:49:27,770 You can see more to these hand-- this is the post stimulus time 1364 00:49:27,770 --> 00:49:30,228 histogram, lots of spikes, lots of spikes, lots of spikes-- 1365 00:49:30,228 --> 00:49:32,870 respond more to these hands than to these other kind 1366 00:49:32,870 --> 00:49:34,851 of stimuli here. 1367 00:49:34,851 --> 00:49:36,350 So you could say, these neurons have 1368 00:49:36,350 --> 00:49:39,380 tuned to specific combinations of high selectivity. 1369 00:49:39,380 --> 00:49:40,970 You'll hear from Winrich that others 1370 00:49:40,970 --> 00:49:42,470 had shown that you could record some 1371 00:49:42,470 --> 00:49:45,410 of the neurons are really like faces that you could find, 1372 00:49:45,410 --> 00:49:46,832 and not so much hands. 1373 00:49:46,832 --> 00:49:48,290 So you could find neurons that seem 1374 00:49:48,290 --> 00:49:51,230 to have some interesting selectivity in IT cortex. 1375 00:49:51,230 --> 00:49:53,030 And then others later went on to show 1376 00:49:53,030 --> 00:49:55,605 in a number of studies-- this is from Nico Logothetis' work 1377 00:49:55,605 --> 00:49:56,730 of a number of years later. 1378 00:49:56,730 --> 00:50:00,500 It's just one example that this selectivity had some tolerance 1379 00:50:00,500 --> 00:50:02,390 to, say, the position of the stimulus, that's 1380 00:50:02,390 --> 00:50:03,350 what's shown here. 1381 00:50:03,350 --> 00:50:05,180 The fact that these bars are high just 1382 00:50:05,180 --> 00:50:09,230 means that it tolerates movement in where the-- 1383 00:50:09,230 --> 00:50:11,270 sorry, this is size, degrees of visual angle. 1384 00:50:11,270 --> 00:50:13,640 This is position, moving the stimulus around. 1385 00:50:13,640 --> 00:50:16,190 So this was known for a number of years 1386 00:50:16,190 --> 00:50:18,200 that there's some tolerance to position and size 1387 00:50:18,200 --> 00:50:19,384 changes at least. 1388 00:50:19,384 --> 00:50:21,050 OK, so I'm putting these up and you say, 1389 00:50:21,050 --> 00:50:23,836 there's some selectivity and there's some tolerance. 1390 00:50:23,836 --> 00:50:26,210 And that should remind you of what we already said in V1, 1391 00:50:26,210 --> 00:50:27,834 there's some selectivity, simple cells. 1392 00:50:27,834 --> 00:50:29,880 There's some tolerance, complex cells. 1393 00:50:29,880 --> 00:50:31,460 So you have the same themes here, 1394 00:50:31,460 --> 00:50:34,760 just different kinds of types of stimuli being used. 1395 00:50:34,760 --> 00:50:38,150 Then people really went on, in the 80s especially, and said, 1396 00:50:38,150 --> 00:50:39,710 let's go after this trigger feature. 1397 00:50:39,710 --> 00:50:44,360 And Tanaka's group really went after this really hard. 1398 00:50:44,360 --> 00:50:46,550 Tanaka's group would find the best stimulus 1399 00:50:46,550 --> 00:50:48,350 they would find, dangle a bunch of objects 1400 00:50:48,350 --> 00:50:50,350 in front of a recorded neuron, find the best out 1401 00:50:50,350 --> 00:50:52,016 of a whole set of objects, and then they 1402 00:50:52,016 --> 00:50:53,300 try to do a reduction. 1403 00:50:53,300 --> 00:50:55,460 They'd try to figure out, how can I reduce this. 1404 00:50:55,460 --> 00:50:59,570 This is their attempt to reduce the stimulus to its features 1405 00:50:59,570 --> 00:51:01,190 without lowering the neural response. 1406 00:51:01,190 --> 00:51:03,290 So high response, high response, high response, high response, 1407 00:51:03,290 --> 00:51:05,390 high response, suddenly I do this, the response drops. 1408 00:51:05,390 --> 00:51:06,639 I do this, the response drops. 1409 00:51:06,639 --> 00:51:08,660 And they have lots of examples of this. 1410 00:51:08,660 --> 00:51:11,600 And they want you to try to get to the simplest thing that 1411 00:51:11,600 --> 00:51:12,830 could capture the response. 1412 00:51:12,830 --> 00:51:15,350 And when they did this, they would take stimuli like this, 1413 00:51:15,350 --> 00:51:18,440 and end up with stimuli that looked like that. 1414 00:51:18,440 --> 00:51:20,960 Now, many of you should probably start to wonder here, 1415 00:51:20,960 --> 00:51:23,120 there's lots of paths for stimulus space. 1416 00:51:23,120 --> 00:51:25,754 It's not clear that these are elemental in any way. 1417 00:51:25,754 --> 00:51:27,920 There's lots of ways that you can show with modeling 1418 00:51:27,920 --> 00:51:31,430 that you can get easily lost in this space of navigating around 1419 00:51:31,430 --> 00:51:32,210 here. 1420 00:51:32,210 --> 00:51:34,001 This is just, again, a history of the work. 1421 00:51:34,001 --> 00:51:36,110 This is the kind of things that people were doing. 1422 00:51:36,110 --> 00:51:37,820 And then from that, they presented 1423 00:51:37,820 --> 00:51:40,070 what we think of as the ice cube model of IT, 1424 00:51:40,070 --> 00:51:43,015 that I think is actually still a very reasonable approximation. 1425 00:51:43,015 --> 00:51:44,390 They not only showed that neurons 1426 00:51:44,390 --> 00:51:49,070 tended to like certain relatively reduced stimulus 1427 00:51:49,070 --> 00:51:51,410 features, not full objects, but that they 1428 00:51:51,410 --> 00:51:52,570 are gathered together. 1429 00:51:52,570 --> 00:51:55,340 So these are millimeter scale regions of IT 1430 00:51:55,340 --> 00:51:58,070 that nearby neurons, within a millimeter or so, 1431 00:51:58,070 --> 00:52:00,590 have similar preferences. 1432 00:52:00,590 --> 00:52:02,224 They're not just scattered willy-nilly 1433 00:52:02,224 --> 00:52:03,140 throughout the tissue. 1434 00:52:03,140 --> 00:52:05,480 When you go record nearby neurons, they're similar. 1435 00:52:05,480 --> 00:52:08,910 So there's some mapping within IT cortex. 1436 00:52:08,910 --> 00:52:10,370 This is schematic here. 1437 00:52:10,370 --> 00:52:13,940 This is optical imaging data of IT cortex also 1438 00:52:13,940 --> 00:52:16,130 from Tanaka's group that show you 1439 00:52:16,130 --> 00:52:19,130 that these different blobs of tissue 1440 00:52:19,130 --> 00:52:21,120 get activated by different images shown here. 1441 00:52:21,120 --> 00:52:22,911 And I'm just showing you the scale of this, 1442 00:52:22,911 --> 00:52:25,399 it's around a little less than a millimeter. 1443 00:52:25,399 --> 00:52:26,940 And our lab has evidence of this too. 1444 00:52:26,940 --> 00:52:30,300 So there's some sort of spatial organization in IT, 1445 00:52:30,300 --> 00:52:32,850 but we really don't really yet understand the features, 1446 00:52:32,850 --> 00:52:36,840 these elemental features yet, or at least, not at this time. 1447 00:52:36,840 --> 00:52:39,194 Then later, there's lots of beautiful work in IT. 1448 00:52:39,194 --> 00:52:41,110 Again, I'm probably not telling you all of it. 1449 00:52:41,110 --> 00:52:42,925 Some of the most exciting work recently-- 1450 00:52:42,925 --> 00:52:44,790 and you'll hear about this from Winrich, 1451 00:52:44,790 --> 00:52:46,740 that people started to use fMRIs. 1452 00:52:46,740 --> 00:52:49,530 So Doris Tsao and Winrich Freiwald and Marge Livingstone 1453 00:52:49,530 --> 00:52:52,880 all together started to use fMRI data to compare 1454 00:52:52,880 --> 00:52:54,602 faces versus objects. 1455 00:52:54,602 --> 00:52:56,060 This was motivated from human work, 1456 00:52:56,060 --> 00:52:59,361 by work like Nancy Kanwisher lab and others. 1457 00:52:59,361 --> 00:53:00,860 What they found was that in monkeys, 1458 00:53:00,860 --> 00:53:03,722 you could find different parts that would show up, 1459 00:53:03,722 --> 00:53:05,180 what are called face patches, where 1460 00:53:05,180 --> 00:53:07,940 you have a relative preference for faces over objects. 1461 00:53:07,940 --> 00:53:10,640 Again, I don't want to take all of Winrich's talk here, 1462 00:53:10,640 --> 00:53:12,674 but you have these different patches here. 1463 00:53:12,674 --> 00:53:14,840 And then what's really cool is, you go in and record 1464 00:53:14,840 --> 00:53:18,312 from these patches and then you find a very enriched locations 1465 00:53:18,312 --> 00:53:19,020 for face neurons. 1466 00:53:19,020 --> 00:53:21,119 And these enriched locations were known 1467 00:53:21,119 --> 00:53:22,410 from a number of other studies. 1468 00:53:22,410 --> 00:53:25,100 But this is a nice correlation between functional imaging 1469 00:53:25,100 --> 00:53:26,810 and this enrichment of these face cells. 1470 00:53:26,810 --> 00:53:30,200 And that's what's shown here, that these neurons respond 1471 00:53:30,200 --> 00:53:32,900 mostly to faces and not so much other objects. 1472 00:53:32,900 --> 00:53:35,150 Although, you see they still sort of respond to these. 1473 00:53:35,150 --> 00:53:37,715 So this kind of says fMRI and physiology are 1474 00:53:37,715 --> 00:53:38,840 telling you similar things. 1475 00:53:38,840 --> 00:53:41,600 It also tells you there's some spatial clumping, at least 1476 00:53:41,600 --> 00:53:44,160 for face-like objects, at a scale of a few millimeters 1477 00:53:44,160 --> 00:53:46,070 or so, the size of these patches. 1478 00:53:46,070 --> 00:53:49,250 OK, so that's larger scale organization. 1479 00:53:49,250 --> 00:53:52,670 This is data from our own lab that shows the same thing. 1480 00:53:52,670 --> 00:53:55,310 Maybe I'll just skip through this in the interest of time-- 1481 00:53:55,310 --> 00:53:58,100 that we can map and record the neurons very precisely, 1482 00:53:58,100 --> 00:54:02,090 map them spatially and compare that with fMRI. 1483 00:54:02,090 --> 00:54:05,840 So this is just a larger field of view maps of the same idea. 1484 00:54:05,840 --> 00:54:08,870 So what we have then, just to wrap up 1485 00:54:08,870 --> 00:54:11,720 this whirlwind tour of the ventral stream, 1486 00:54:11,720 --> 00:54:15,430 is that we had some untangled explicit information. 1487 00:54:15,430 --> 00:54:18,100 And what I want to try to convince you of now, is that-- 1488 00:54:18,100 --> 00:54:19,370 I've told you about the ventral stream, 1489 00:54:19,370 --> 00:54:21,536 but I'm going to try to tell you that, in IT cortex, 1490 00:54:21,536 --> 00:54:24,130 this is a powerful representation for encoding 1491 00:54:24,130 --> 00:54:26,364 object information. 1492 00:54:26,364 --> 00:54:28,780 And then we'll take a break because we've already probably 1493 00:54:28,780 --> 00:54:30,167 been going a while. 1494 00:54:30,167 --> 00:54:32,500 Yeah, about 10 more minutes and then we'll take a break. 1495 00:54:32,500 --> 00:54:36,280 So what I've told you is, I've led you up the ventral stream, 1496 00:54:36,280 --> 00:54:37,880 I've given you a bit of the history, 1497 00:54:37,880 --> 00:54:39,720 so now let's talk about IT more precisely. 1498 00:54:39,720 --> 00:54:42,940 So now this is work from my own lab. 1499 00:54:42,940 --> 00:54:44,510 You go in and record IT. 1500 00:54:44,510 --> 00:54:45,940 You go record extracellularly. 1501 00:54:45,940 --> 00:54:48,940 You travel down into IT cortex, which is down here. 1502 00:54:48,940 --> 00:54:50,170 And you record from this. 1503 00:54:50,170 --> 00:54:52,900 And similar to what you saw, another version of what 1504 00:54:52,900 --> 00:54:55,900 you saw from Charlie Gross or Bob Desimone, 1505 00:54:55,900 --> 00:54:57,790 you show a bunch of images. 1506 00:54:57,790 --> 00:54:59,410 And they could be arbitrary images. 1507 00:54:59,410 --> 00:55:01,270 You take an IT recording site, and see these little dots, 1508 00:55:01,270 --> 00:55:03,700 those are action potential spikes out of a particular IT 1509 00:55:03,700 --> 00:55:04,730 site. 1510 00:55:04,730 --> 00:55:06,190 And these are repeatable. 1511 00:55:06,190 --> 00:55:08,580 You have some Poisson variability here. 1512 00:55:08,580 --> 00:55:10,330 But you see that there's more spikes here, 1513 00:55:10,330 --> 00:55:12,490 there's little more here, less here, less there. 1514 00:55:12,490 --> 00:55:13,990 These images are all randomly interleaved 1515 00:55:13,990 --> 00:55:16,323 when you collect the data, as I'll show you in a minute. 1516 00:55:16,323 --> 00:55:18,920 And you go to different sites and it likes different images. 1517 00:55:18,920 --> 00:55:20,425 So there is certainly some image selectivity. 1518 00:55:20,425 --> 00:55:22,840 This should not be surprising because I already showed 1519 00:55:22,840 --> 00:55:24,580 you this from previous work. 1520 00:55:24,580 --> 00:55:26,320 This is just data from our own lab. 1521 00:55:26,320 --> 00:55:28,361 You can also see now that you are looking closely 1522 00:55:28,361 --> 00:55:30,910 at the time lag, remember, I said around 100 milliseconds 1523 00:55:30,910 --> 00:55:31,945 stimulus on. 1524 00:55:31,945 --> 00:55:34,240 Stimulus off, the stimulus is actually off 1525 00:55:34,240 --> 00:55:36,220 before the spikes actually start to occur out 1526 00:55:36,220 --> 00:55:38,170 here in IT because, again, there's a long time 1527 00:55:38,170 --> 00:55:39,880 lag, 100 milliseconds. 1528 00:55:39,880 --> 00:55:42,755 OK, so that's what the neural responses look like. 1529 00:55:42,755 --> 00:55:44,380 I don't know if you guys can hear this, 1530 00:55:44,380 --> 00:55:45,880 maybe I should have hooked up audio. 1531 00:55:45,880 --> 00:55:47,350 Maybe you might be able to hear-- 1532 00:55:47,350 --> 00:55:50,394 this is actually a recording that Chou Hung did 1533 00:55:50,394 --> 00:55:52,810 when he collected his data in my lab for the early studies 1534 00:55:52,810 --> 00:55:54,340 we did in the lab. 1535 00:55:54,340 --> 00:55:56,700 I don't know if you guys can hear. 1536 00:55:56,700 --> 00:55:58,140 [STATIC] 1537 00:55:58,140 --> 00:56:00,060 [BEEP] 1538 00:56:00,060 --> 00:56:03,900 [BEEP] 1539 00:56:03,900 --> 00:56:04,880 [BEEP] 1540 00:56:04,880 --> 00:56:06,960 Those high beeps are the animal getting reward 1541 00:56:06,960 --> 00:56:08,712 for fixating on that dot. 1542 00:56:08,712 --> 00:56:10,670 You're not even going to be able to parse that. 1543 00:56:10,670 --> 00:56:12,636 I mean, you hear the spikes clicking by, those-- 1544 00:56:12,636 --> 00:56:13,136 [STATIC] 1545 00:56:13,136 --> 00:56:15,416 Those are action potentials. 1546 00:56:15,416 --> 00:56:17,790 And I don't expect you to look at anything like, oh, it's 1547 00:56:17,790 --> 00:56:18,780 a face neuron, or whatever. 1548 00:56:18,780 --> 00:56:20,988 I just want you to get a feel for how those data were 1549 00:56:20,988 --> 00:56:21,940 originally collected. 1550 00:56:21,940 --> 00:56:23,640 This is a pretty grainy video. 1551 00:56:23,640 --> 00:56:26,400 But you get the idea. 1552 00:56:26,400 --> 00:56:27,649 You collect data like that. 1553 00:56:27,649 --> 00:56:29,940 And again, you can find selectivity in those population 1554 00:56:29,940 --> 00:56:31,540 patterns, as I just showed you. 1555 00:56:31,540 --> 00:56:34,980 But then, Gabriel and Tommy and I, so the three of us, 1556 00:56:34,980 --> 00:56:36,630 I think all in this room, way back when 1557 00:56:36,630 --> 00:56:39,780 in 2005 said, well look, the population of IT 1558 00:56:39,780 --> 00:56:41,490 might have good, useful information 1559 00:56:41,490 --> 00:56:43,812 for solving this difficult object manifold 1560 00:56:43,812 --> 00:56:44,520 tangling problem. 1561 00:56:44,520 --> 00:56:46,810 It might be a good explicit representation. 1562 00:56:46,810 --> 00:56:50,460 So we did a, what I call, early test of this idea. 1563 00:56:50,460 --> 00:56:54,810 We took this simple image set from eight different categories 1564 00:56:54,810 --> 00:56:56,400 that we had chosen. 1565 00:56:56,400 --> 00:56:59,070 And there's good stories of why we chose those objects, 1566 00:56:59,070 --> 00:57:00,360 if you like to hear them. 1567 00:57:00,360 --> 00:57:02,490 But let me just say, simple objects, we moved them 1568 00:57:02,490 --> 00:57:06,510 across position and scale, and we collected the responses 1569 00:57:06,510 --> 00:57:09,690 of IT of a bunch of sites to changes 1570 00:57:09,690 --> 00:57:11,504 to all these different visual images. 1571 00:57:11,504 --> 00:57:13,170 And we showed them as I just showed you. 1572 00:57:13,170 --> 00:57:14,878 We just showed them for 100 milliseconds. 1573 00:57:14,878 --> 00:57:16,827 This is this core recognition regime, 1574 00:57:16,827 --> 00:57:18,660 were just showing them for 100 milliseconds. 1575 00:57:18,660 --> 00:57:20,576 And then we show another one, and they're just 1576 00:57:20,576 --> 00:57:21,756 randomly interleaved. 1577 00:57:21,756 --> 00:57:23,130 And from this, what you do is you 1578 00:57:23,130 --> 00:57:25,110 could get a population set of data 1579 00:57:25,110 --> 00:57:27,900 where we recorded 350 IT sites. 1580 00:57:27,900 --> 00:57:29,820 Here's a sample of 63 sites. 1581 00:57:29,820 --> 00:57:32,270 This is 78 images, the mean neural response 1582 00:57:32,270 --> 00:57:34,140 here is the mean response to an image. 1583 00:57:34,140 --> 00:57:35,599 This is 78 of the images we showed. 1584 00:57:35,599 --> 00:57:38,139 There's nothing for you to read into here to say, other than, 1585 00:57:38,139 --> 00:57:40,020 you have this rich population data. 1586 00:57:40,020 --> 00:57:43,665 And now our question is, well, what lives in this population 1587 00:57:43,665 --> 00:57:44,940 data that we've collected. 1588 00:57:44,940 --> 00:57:47,130 Is it explicit with regard to categories? 1589 00:57:47,130 --> 00:57:48,630 So we come back to what I showed you 1590 00:57:48,630 --> 00:57:51,390 earlier about those tangled manifolds and said, 1591 00:57:51,390 --> 00:57:53,460 we need simple decoding tools. 1592 00:57:53,460 --> 00:57:56,280 Can a simple decoding tool look at that population 1593 00:57:56,280 --> 00:57:57,930 and tell me what's out there? 1594 00:57:57,930 --> 00:58:00,474 And again, we were using linear classifiers at the time, 1595 00:58:00,474 --> 00:58:01,890 because we took that, as you heard 1596 00:58:01,890 --> 00:58:04,350 from Haim as our operational definition of what 1597 00:58:04,350 --> 00:58:05,310 a simple tool is. 1598 00:58:05,310 --> 00:58:07,821 And if it could decode information about the object 1599 00:58:07,821 --> 00:58:09,570 identity, then we'd say, well, that means, 1600 00:58:09,570 --> 00:58:11,550 by that operational definition, this 1601 00:58:11,550 --> 00:58:14,790 is explicit, available, accessible information, or just 1602 00:58:14,790 --> 00:58:15,960 generally good. 1603 00:58:15,960 --> 00:58:19,650 So if you imagine that the activity-- this is schematic. 1604 00:58:19,650 --> 00:58:21,450 Each dot, this is neuron one, neuron two, 1605 00:58:21,450 --> 00:58:23,210 and you could have a bunch of IT neurons. 1606 00:58:23,210 --> 00:58:24,926 But if you can separate any object 1607 00:58:24,926 --> 00:58:26,550 from all the other object, these points 1608 00:58:26,550 --> 00:58:28,650 represent the population response 1609 00:58:28,650 --> 00:58:30,210 to each image of an object. 1610 00:58:30,210 --> 00:58:32,190 Remember, there's many images of each object. 1611 00:58:32,190 --> 00:58:33,856 But if you could linearly separate that, 1612 00:58:33,856 --> 00:58:35,252 that would mean it was explicit. 1613 00:58:35,252 --> 00:58:36,960 And if you had a hard time separating it, 1614 00:58:36,960 --> 00:58:38,610 this would be implicit. 1615 00:58:38,610 --> 00:58:40,270 These are like tangled object manifold. 1616 00:58:40,270 --> 00:58:42,900 This is Inaccessible, or bad, information. 1617 00:58:42,900 --> 00:58:44,781 So we just-- we, and when I mean we, 1618 00:58:44,781 --> 00:58:46,280 I mean Chou Hung, who led the study. 1619 00:58:46,280 --> 00:58:47,940 Gabriel, Tommy, and I did this. 1620 00:58:47,940 --> 00:58:51,300 We took the response of an image, like this one. 1621 00:58:51,300 --> 00:58:53,337 It produced a population vector. 1622 00:58:53,337 --> 00:58:54,920 Again, we recorded a bunch of neurons. 1623 00:58:54,920 --> 00:58:57,170 We recorded them sequentially and then pieced together 1624 00:58:57,170 --> 00:58:59,430 this population vector. 1625 00:58:59,430 --> 00:59:01,940 So these are the spikes simulated off 1626 00:59:01,940 --> 00:59:03,630 a population of IT. 1627 00:59:03,630 --> 00:59:05,069 We could do various things. 1628 00:59:05,069 --> 00:59:07,110 In fact, I think Gabriel did everything possible, 1629 00:59:07,110 --> 00:59:08,345 as I remember at the time. 1630 00:59:08,345 --> 00:59:10,470 And one of the things we did was just count spikes. 1631 00:59:10,470 --> 00:59:12,060 One of the simple things, that turns out to work quite well, 1632 00:59:12,060 --> 00:59:14,470 is count the spikes over 100 milliseconds. 1633 00:59:14,470 --> 00:59:15,780 So this neuron counts spikes. 1634 00:59:15,780 --> 00:59:17,580 That gives you a number, one number here, 1635 00:59:17,580 --> 00:59:19,590 count spikes get one number. 1636 00:59:19,590 --> 00:59:22,380 So you have n neurons, you get n numbers. 1637 00:59:22,380 --> 00:59:25,770 So it's a point in a n dimensional state space where 1638 00:59:25,770 --> 00:59:27,310 n is the number of neurons. 1639 00:59:27,310 --> 00:59:29,370 And then we had already pre-divided the images 1640 00:59:29,370 --> 00:59:32,280 into different categories, as shown here. 1641 00:59:32,280 --> 00:59:33,460 These are the categories. 1642 00:59:33,460 --> 00:59:36,270 And again, we just asked how well 1643 00:59:36,270 --> 00:59:37,950 you could do faces versus non-faces, 1644 00:59:37,950 --> 00:59:41,186 toys versus non-toys, so on and so forth. 1645 00:59:41,186 --> 00:59:42,060 These are old slides. 1646 00:59:42,060 --> 00:59:43,500 But you get the idea, is that basically, you 1647 00:59:43,500 --> 00:59:45,420 don't need that many sites to already get 1648 00:59:45,420 --> 00:59:47,280 to very high levels of performance 1649 00:59:47,280 --> 00:59:49,980 on both categorization and identification. 1650 00:59:49,980 --> 00:59:51,510 The interesting thing about this was 1651 00:59:51,510 --> 00:59:54,389 that you could solve simple forms 1652 00:59:54,389 --> 00:59:56,430 of this invariance problem in this representation 1653 00:59:56,430 --> 00:59:57,450 quite easily. 1654 00:59:57,450 --> 01:00:00,900 That if you just trained on the central objects, the center 1655 01:00:00,900 --> 01:00:03,670 and size, the simple three degree size center position, 1656 01:00:03,670 --> 01:00:06,180 and test it on the same thing, just 1657 01:00:06,180 --> 01:00:08,730 held out repeats of this data, you did quite well. 1658 01:00:08,730 --> 01:00:09,930 That's a baseline. 1659 01:00:09,930 --> 01:00:12,810 But what's interesting is you test at different position 1660 01:00:12,810 --> 01:00:13,740 and scale. 1661 01:00:13,740 --> 01:00:15,780 And then you also do almost nearly as well. 1662 01:00:15,780 --> 01:00:19,050 So you naturally generalize to these other conditions 1663 01:00:19,050 --> 01:00:21,450 by training on these simple conditions. 1664 01:00:21,450 --> 01:00:23,760 So this is evidence that the population 1665 01:00:23,760 --> 01:00:26,850 is a good basis set for solving these kind of problems. 1666 01:00:26,850 --> 01:00:29,520 A few number of training examples on this population 1667 01:00:29,520 --> 01:00:31,523 then generalizes, well, across conditions 1668 01:00:31,523 --> 01:00:33,690 makes the problem hard. 1669 01:00:33,690 --> 01:00:35,620 So again, we published that a long time ago. 1670 01:00:35,620 --> 01:00:37,120 This was an early step to say, look, 1671 01:00:37,120 --> 01:00:39,780 the phenomenology looks right for the story that I've 1672 01:00:39,780 --> 01:00:41,820 been telling you so far. 1673 01:00:41,820 --> 01:00:45,450 You can't do this easily in earlier visual areas like V1, 1674 01:00:45,450 --> 01:00:47,270 or simulated V1 or V4. 1675 01:00:47,270 --> 01:00:49,920 And we later show that a number of ways. 1676 01:00:49,920 --> 01:00:52,740 This is consistent with work I was showing you 1677 01:00:52,740 --> 01:00:54,780 with Logothetis position tolerance, size 1678 01:00:54,780 --> 01:00:56,620 tolerance, the selectivity. 1679 01:00:56,620 --> 01:00:59,780 It's really just an explicit test of the idea population 1680 01:00:59,780 --> 01:01:00,780 encoding. 1681 01:01:00,780 --> 01:01:03,750 So the take home here is that there's this explicit object 1682 01:01:03,750 --> 01:01:05,354 representation in IT. 1683 01:01:05,354 --> 01:01:06,770 I didn't prove to you that this is 1684 01:01:06,770 --> 01:01:08,970 the link, this predictive model to decoding yet. 1685 01:01:08,970 --> 01:01:09,920 We're going to talk about that next. 1686 01:01:09,920 --> 01:01:11,794 But this was some of the important population 1687 01:01:11,794 --> 01:01:14,120 phenomenology that we did. 1688 01:01:14,120 --> 01:01:15,500 What I try to tell you today-- 1689 01:01:15,500 --> 01:01:17,750 hopefully I've introduced you to the problem of visual object 1690 01:01:17,750 --> 01:01:19,416 recognition and the way we restricted it 1691 01:01:19,416 --> 01:01:20,924 to core object recognition. 1692 01:01:20,924 --> 01:01:23,340 We talked a lot about predictive models as being the goal, 1693 01:01:23,340 --> 01:01:24,950 although I haven't presented much to you yet. 1694 01:01:24,950 --> 01:01:26,997 Hopefully, that's the second part of the talk. 1695 01:01:26,997 --> 01:01:28,830 I've given you a tour of the ventral stream. 1696 01:01:28,830 --> 01:01:30,440 But it was a poor tour. 1697 01:01:30,440 --> 01:01:31,940 I'm sure everybody i work with would 1698 01:01:31,940 --> 01:01:34,231 say that you've neglected all this work because there's 1699 01:01:34,231 --> 01:01:38,300 no way I can do that all in even a whole week. 1700 01:01:38,300 --> 01:01:40,820 I just tried to hit some of the highlights for you. 1701 01:01:40,820 --> 01:01:42,650 And I told you that the IT population 1702 01:01:42,650 --> 01:01:44,900 seems to have solved a key problem, 1703 01:01:44,900 --> 01:01:47,390 this sort of invariance problem that I set up. 1704 01:01:47,390 --> 01:01:50,240 And one way to step back and say, over the last 40 years 1705 01:01:50,240 --> 01:01:53,150 or so, from those early studies of Charlie Gross 1706 01:01:53,150 --> 01:01:56,270 or even Hubel and Wiesel, we, the field of ventral stream 1707 01:01:56,270 --> 01:01:58,920 physiology, we've largely described important 1708 01:01:58,920 --> 01:01:59,540 phenomenology. 1709 01:01:59,540 --> 01:02:02,780 Even that last study is population phenomenology. 1710 01:02:02,780 --> 01:02:06,237 And so now we need these more advanced models. 1711 01:02:06,237 --> 01:02:08,570 So the next phase of the field is developing and testing 1712 01:02:08,570 --> 01:02:10,016 these predictive models that I've 1713 01:02:10,016 --> 01:02:11,390 motivated at the beginning, but I 1714 01:02:11,390 --> 01:02:13,230 haven't given you much of yet. 1715 01:02:13,230 --> 01:02:16,430 So this was hopefully a bit of history and set context 1716 01:02:16,430 --> 01:02:18,370 to where we are.