1 00:00:01,725 --> 00:00:04,080 The following content is provided under a Creative 2 00:00:04,080 --> 00:00:05,620 Commons license. 3 00:00:05,620 --> 00:00:07,920 Your support will help MIT OpenCourseWare 4 00:00:07,920 --> 00:00:12,280 continue to offer high quality educational resources for free. 5 00:00:12,280 --> 00:00:14,910 To make a donation, or view additional materials 6 00:00:14,910 --> 00:00:18,870 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,870 --> 00:00:21,517 at OCW.MIT.edu. 8 00:00:21,517 --> 00:00:23,100 JAMES DICARLO: I'm going to shift more 9 00:00:23,100 --> 00:00:25,980 towards this decoding space than we talked about, 10 00:00:25,980 --> 00:00:29,850 the linkage between neural activity and behavioral report. 11 00:00:29,850 --> 00:00:31,170 And I introduced that a bit. 12 00:00:31,170 --> 00:00:34,290 You just saw that there's some population powerful activity 13 00:00:34,290 --> 00:00:34,980 in IT. 14 00:00:34,980 --> 00:00:37,450 And I'm going to expand on that a bit here. 15 00:00:37,450 --> 00:00:41,114 But sort of stepping back, when you think about it again, 16 00:00:41,114 --> 00:00:42,780 what I call an end to end understanding, 17 00:00:42,780 --> 00:00:44,460 going from the image all the way to neural activity 18 00:00:44,460 --> 00:00:47,200 to the perceptual report, one of the things we want to do, 19 00:00:47,200 --> 00:00:49,540 again, is just define a decoding mechanism 20 00:00:49,540 --> 00:00:52,110 that the brain uses to support these perceptual reports. 21 00:00:52,110 --> 00:00:54,000 Basically what neural activity are directly 22 00:00:54,000 --> 00:00:56,250 responsible for these tasks? 23 00:00:56,250 --> 00:00:58,359 And I'll come back to later this encoding side. 24 00:00:58,359 --> 00:01:00,150 It's like, you know, and notice I'm putting 25 00:01:00,150 --> 00:01:01,274 these in this order, right? 26 00:01:01,274 --> 00:01:03,450 So once you know what the relevant aspects 27 00:01:03,450 --> 00:01:06,900 of neural activity are in IT, or wherever you think they are, 28 00:01:06,900 --> 00:01:09,690 then that sets a target for what is the image 29 00:01:09,690 --> 00:01:12,420 to neural transformation that you're trying to explain? 30 00:01:12,420 --> 00:01:14,220 Not predict any neural response, but those 31 00:01:14,220 --> 00:01:16,554 particular aspects of the neural response. 32 00:01:16,554 --> 00:01:18,720 So that's what I mean by the relevant ventral stream 33 00:01:18,720 --> 00:01:20,220 patterns of activity. 34 00:01:20,220 --> 00:01:21,600 So we start here. 35 00:01:21,600 --> 00:01:24,150 We work to here, and then we work to here, rather than 36 00:01:24,150 --> 00:01:25,516 the other way around. 37 00:01:25,516 --> 00:01:26,890 OK, so I'm going to try it again. 38 00:01:26,890 --> 00:01:28,170 Keep with the domain I set up. 39 00:01:28,170 --> 00:01:29,520 I talked about core recognition. 40 00:01:29,520 --> 00:01:31,470 I now need to start to define tasks. 41 00:01:31,470 --> 00:01:34,890 I'm going to talk about specific tasks that are, for now, let's 42 00:01:34,890 --> 00:01:36,700 call them basic level nouns. 43 00:01:36,700 --> 00:01:39,130 I'm actually going to relax that to subordinate tasks 44 00:01:39,130 --> 00:01:39,760 in a minute. 45 00:01:39,760 --> 00:01:40,830 But here they are. 46 00:01:40,830 --> 00:01:41,759 Car, clock, cat. 47 00:01:41,759 --> 00:01:43,050 These are not the actual nouns. 48 00:01:43,050 --> 00:01:44,100 I'll show you the ones we use. 49 00:01:44,100 --> 00:01:45,810 But just to fix ideas, we're imagining 50 00:01:45,810 --> 00:01:49,140 a space of all possible nouns that you might use 51 00:01:49,140 --> 00:01:51,520 to describe what you just saw. 52 00:01:51,520 --> 00:01:53,520 And I'm going to have a generative image domain. 53 00:01:53,520 --> 00:01:55,426 So I now have a space of images here. 54 00:01:55,426 --> 00:01:57,300 I'm not just going to draw these off the web. 55 00:01:57,300 --> 00:01:58,560 We're going to generate our own image 56 00:01:58,560 --> 00:02:00,600 domain that we think engages on the problem, 57 00:02:00,600 --> 00:02:02,760 but gives us control of the latent variables. 58 00:02:02,760 --> 00:02:03,950 So I'll show you that now. 59 00:02:03,950 --> 00:02:05,325 So the way we're going to do this 60 00:02:05,325 --> 00:02:08,850 is by generating one foreground object in each image 61 00:02:08,850 --> 00:02:10,530 that we're going to show. 62 00:02:10,530 --> 00:02:13,890 And we just did this by taking 3-D models like these-- 63 00:02:13,890 --> 00:02:15,450 this is a model of a car. 64 00:02:15,450 --> 00:02:18,180 We can control it's other latent variables beyond its identity. 65 00:02:18,180 --> 00:02:19,250 So this is a car. 66 00:02:19,250 --> 00:02:21,130 It has a particular car type. 67 00:02:21,130 --> 00:02:23,730 So there's a couple of latent variables about identity here 68 00:02:23,730 --> 00:02:25,350 that relate to the geometry. 69 00:02:25,350 --> 00:02:27,210 Then there's these position-- other latent variables like 70 00:02:27,210 --> 00:02:28,960 position, size, and pose that I mentioned, 71 00:02:28,960 --> 00:02:31,620 that are unknowns that make the problem challenging. 72 00:02:31,620 --> 00:02:33,880 And we can then just, like, render this thing. 73 00:02:33,880 --> 00:02:37,140 And we could place it on any old background we wanted to. 74 00:02:37,140 --> 00:02:39,120 And what we did was we tended to place them 75 00:02:39,120 --> 00:02:41,430 on uncorrelated naturalistic backgrounds. 76 00:02:41,430 --> 00:02:44,190 And that creates these sort of weirdish looking images. 77 00:02:44,190 --> 00:02:46,140 Some of them may look sort of natural, hence, 78 00:02:46,140 --> 00:02:48,376 this looks pretty unnatural. 79 00:02:48,376 --> 00:02:49,500 But the reason we did this. 80 00:02:49,500 --> 00:02:50,416 Why would you do this? 81 00:02:50,416 --> 00:02:55,270 So-- so we did this because we could add a generative space. 82 00:02:55,270 --> 00:02:58,980 And because it was-- so we know what's going on with the latent 83 00:02:58,980 --> 00:03:00,244 variables we care about. 84 00:03:00,244 --> 00:03:02,910 And we also, when we built this, it was challenging for computer 85 00:03:02,910 --> 00:03:04,360 vision systems to deal with this, 86 00:03:04,360 --> 00:03:06,420 even though humans could naturally-- you know, 87 00:03:06,420 --> 00:03:08,520 they don't have advantage of any contextual cues 88 00:03:08,520 --> 00:03:11,970 here because by construction, these are uncorrelated. 89 00:03:11,970 --> 00:03:14,100 We just took natural images and would randomly 90 00:03:14,100 --> 00:03:15,460 put objects on them. 91 00:03:15,460 --> 00:03:17,880 But this was enough to fool a lot of the computer vision 92 00:03:17,880 --> 00:03:21,000 systems at the time that tended to rely on the contextual cues. 93 00:03:21,000 --> 00:03:23,652 Like blue in the background signals or being an airplane, 94 00:03:23,652 --> 00:03:25,610 we didn't want those kind of things being done. 95 00:03:25,610 --> 00:03:27,840 We wanted the actual extraction of object identity. 96 00:03:27,840 --> 00:03:29,940 And again, humans could do it quite well. 97 00:03:29,940 --> 00:03:31,680 So that's why we ended up in this sort 98 00:03:31,680 --> 00:03:34,830 of maybe this no man's land of image space, which 99 00:03:34,830 --> 00:03:38,310 is not very simple, but not ImageNet just pulled off 100 00:03:38,310 --> 00:03:39,779 off the web. 101 00:03:39,779 --> 00:03:41,070 And so that's how we got there. 102 00:03:41,070 --> 00:03:42,870 And just to give you a sense that this is actually 103 00:03:42,870 --> 00:03:44,910 quite doable for humans, I'll show you a few images. 104 00:03:44,910 --> 00:03:46,050 I won't even cue you what they are. 105 00:03:46,050 --> 00:03:47,883 I'm going to show them for 100 milliseconds. 106 00:03:47,883 --> 00:03:50,555 You can kind of shout out what object you see. 107 00:03:50,555 --> 00:03:51,525 AUDIENCE: Car. 108 00:03:51,525 --> 00:03:55,265 AUDIENCE: [INAUDIBLE] 109 00:03:55,265 --> 00:03:56,140 JAMES DICARIO: Right. 110 00:03:56,140 --> 00:03:57,931 So see, it's pretty straightforward, right? 111 00:03:57,931 --> 00:04:00,655 And those look weird, you can do that quite well. 112 00:04:00,655 --> 00:04:03,280 And you know, here's the kind of images that we would generate. 113 00:04:03,280 --> 00:04:05,997 This would be-- so when we think of image bags, 114 00:04:05,997 --> 00:04:07,580 we think of partitions of image space. 115 00:04:07,580 --> 00:04:10,490 This is some images that would correspond to faces. 116 00:04:10,490 --> 00:04:12,962 These are all images of faces under some transformations. 117 00:04:12,962 --> 00:04:14,170 Again, different backgrounds. 118 00:04:14,170 --> 00:04:15,190 These are not faces. 119 00:04:15,190 --> 00:04:17,740 These are other objects again, under transformations. 120 00:04:17,740 --> 00:04:20,260 And we can have as many of these as we want. 121 00:04:20,260 --> 00:04:21,709 We call this one-- 122 00:04:21,709 --> 00:04:24,280 this distinction, when shown for 100 milliseconds-- 123 00:04:24,280 --> 00:04:25,780 is one core recognition test. 124 00:04:25,780 --> 00:04:27,530 Discriminate face for not face. 125 00:04:27,530 --> 00:04:28,989 Here is a subordinate task. 126 00:04:28,989 --> 00:04:30,280 This is beetle from not beetle. 127 00:04:30,280 --> 00:04:31,690 This is a particular type of car. 128 00:04:31,690 --> 00:04:33,487 You can see it's more challenging. 129 00:04:33,487 --> 00:04:35,320 Again, we don't show these images like this. 130 00:04:35,320 --> 00:04:36,695 This is just to show you the set. 131 00:04:36,695 --> 00:04:38,210 We show them one at a time. 132 00:04:38,210 --> 00:04:40,730 And so let me now go ahead and say, 133 00:04:40,730 --> 00:04:43,240 we're going to try to make a predictive model using 134 00:04:43,240 --> 00:04:46,990 that kind of image space to see if we can understand what 135 00:04:46,990 --> 00:04:48,940 are the relevant aspects of neural activity 136 00:04:48,940 --> 00:04:51,640 that can predict human report on an image space? 137 00:04:51,640 --> 00:04:54,280 And when I say we, I mean Naiib Maiai and Ha 138 00:04:54,280 --> 00:04:56,230 Hong, who are post-doc and graduate student 139 00:04:56,230 --> 00:04:59,260 that were in the lab that led this experimental work. 140 00:04:59,260 --> 00:05:03,200 And Ethan Soloman and Dan Yamins also contributed to the work. 141 00:05:03,200 --> 00:05:07,720 So what we did was to try to record a bunch of IT activity 142 00:05:07,720 --> 00:05:10,390 to measure what's going on in the population 143 00:05:10,390 --> 00:05:12,940 as I showed you earlier, but now in this more defined space 144 00:05:12,940 --> 00:05:15,190 where we're going to collect a bunch of human behavior 145 00:05:15,190 --> 00:05:18,040 to compare possible ways of reading IT 146 00:05:18,040 --> 00:05:20,200 with the behavior of the human. 147 00:05:20,200 --> 00:05:21,820 This is how we started. 148 00:05:21,820 --> 00:05:23,310 We're now doing monkeys-- 149 00:05:23,310 --> 00:05:25,670 where we're recording and the monkey's doing a task. 150 00:05:25,670 --> 00:05:28,720 But what we did here was we just passively fixating monkeys, 151 00:05:28,720 --> 00:05:30,812 compared with behaving humans. 152 00:05:30,812 --> 00:05:32,770 And as I showed you earlier, monkeys and humans 153 00:05:32,770 --> 00:05:34,850 have very similar patterns of behavior. 154 00:05:34,850 --> 00:05:37,120 So what we record from IT, in this case, 155 00:05:37,120 --> 00:05:39,170 we were using array recording electrodes. 156 00:05:39,170 --> 00:05:40,750 These are chronically implanted. 157 00:05:40,750 --> 00:05:41,650 This shows them here. 158 00:05:41,650 --> 00:05:43,066 You implant them during a surgery, 159 00:05:43,066 --> 00:05:44,520 as kind of is shown here. 160 00:05:44,520 --> 00:05:45,920 Down in the IT cortex. 161 00:05:45,920 --> 00:05:47,119 You can get their size here. 162 00:05:47,119 --> 00:05:48,160 There are about hundred-- 163 00:05:48,160 --> 00:05:50,739 there's actually 96 electrodes on each of them. 164 00:05:50,739 --> 00:05:52,780 They typically yield about half of the electrodes 165 00:05:52,780 --> 00:05:54,700 having active neurons on them. 166 00:05:54,700 --> 00:05:58,309 So you get, you know, on the order of 150 recording sites. 167 00:05:58,309 --> 00:05:59,350 And you can lay them out. 168 00:05:59,350 --> 00:06:01,016 You can lay-- we would typically lay out 169 00:06:01,016 --> 00:06:04,990 three of them across IT and V4 to record a population 170 00:06:04,990 --> 00:06:06,790 sample out of IT. 171 00:06:06,790 --> 00:06:09,269 And we would do this across among multiple monkeys. 172 00:06:09,269 --> 00:06:11,560 And here's an example of the kind of data we would get. 173 00:06:11,560 --> 00:06:14,276 This is 168 IT recording sites. 174 00:06:14,276 --> 00:06:16,150 This is similar to what I showed you earlier. 175 00:06:16,150 --> 00:06:19,600 This is the mean response in a particular time window 176 00:06:19,600 --> 00:06:21,790 out of IT, similar to what I showed you earlier 177 00:06:21,790 --> 00:06:23,230 in that study with Gabriel. 178 00:06:23,230 --> 00:06:26,230 And what we do here is, I'm just showing you to give you feel. 179 00:06:26,230 --> 00:06:27,250 That's one image. 180 00:06:27,250 --> 00:06:30,460 Here is eight more-- here's seven more images. 181 00:06:30,460 --> 00:06:32,200 And these are just the population vectors 182 00:06:32,200 --> 00:06:34,160 in a graphic form. 183 00:06:34,160 --> 00:06:36,310 And but we actually collected nearly 25-- 184 00:06:36,310 --> 00:06:37,870 this is 2,560 images. 185 00:06:37,870 --> 00:06:39,250 This is sort of the mean response 186 00:06:39,250 --> 00:06:41,230 data of this 168 neurons. 187 00:06:41,230 --> 00:06:43,550 And now you have this again, this rich population data. 188 00:06:43,550 --> 00:06:46,030 And you can ask, what's available in there 189 00:06:46,030 --> 00:06:47,200 to support these tasks? 190 00:06:47,200 --> 00:06:49,870 And how well does it predict human patterns of performance 191 00:06:49,870 --> 00:06:51,190 on those tasks? 192 00:06:51,190 --> 00:06:54,391 So in this study, that's all we were asking to do. 193 00:06:54,391 --> 00:06:56,140 We're trying to do more and more recently. 194 00:06:56,140 --> 00:06:58,639 But let me show you what all we were trying to do is to say, 195 00:06:58,639 --> 00:06:59,140 look. 196 00:06:59,140 --> 00:07:00,910 One thing we observed, even though you 197 00:07:00,910 --> 00:07:03,670 saw that car-- you could do car, you could do faces. 198 00:07:03,670 --> 00:07:05,710 It seemed like you were doing 100%. 199 00:07:05,710 --> 00:07:08,080 Turns out you're better at some things than others. 200 00:07:08,080 --> 00:07:10,510 So discriminate-- this is a deep prime map of humans. 201 00:07:10,510 --> 00:07:12,700 So red means good performance. 202 00:07:12,700 --> 00:07:14,090 High D prime. 203 00:07:14,090 --> 00:07:16,404 You know, a D prime of 3 is something like-- 204 00:07:16,404 --> 00:07:18,820 I don't know, psychophysicists in the room may correct me. 205 00:07:18,820 --> 00:07:22,000 A D prime of 3 is sort of on the order of 90 some 95% 206 00:07:22,000 --> 00:07:23,570 correct, in that range. 207 00:07:23,570 --> 00:07:25,450 So these are very high performance levels 208 00:07:25,450 --> 00:07:27,400 when you get up to 5. 209 00:07:27,400 --> 00:07:28,660 0 is chance. 210 00:07:28,660 --> 00:07:31,070 So 50%-- well this is an eight way task. 211 00:07:31,070 --> 00:07:33,220 So one over 8% correct. 212 00:07:33,220 --> 00:07:36,910 So the subjects were doing either eight way basic level 213 00:07:36,910 --> 00:07:39,527 tasks, or eight way subordinate cars, or eight way faces. 214 00:07:39,527 --> 00:07:41,860 And these are the D prime levels under different amounts 215 00:07:41,860 --> 00:07:43,693 of variation of those other latent variables 216 00:07:43,693 --> 00:07:44,967 position size and pose. 217 00:07:44,967 --> 00:07:46,300 Don't worry about those details. 218 00:07:46,300 --> 00:07:48,520 What I want you to see is the color here. 219 00:07:48,520 --> 00:07:50,560 So look, it's tables versus-- 220 00:07:50,560 --> 00:07:53,020 discriminating tables from all these other objects. 221 00:07:53,020 --> 00:07:55,120 You do that at a very high D prime. 222 00:07:55,120 --> 00:07:57,700 Discriminating beetles from other cars, 223 00:07:57,700 --> 00:08:00,580 you do it at slightly lower D prime. 224 00:08:00,580 --> 00:08:02,934 You can see this, specially at a high variation, 225 00:08:02,934 --> 00:08:05,350 you're actually starting to get down to lower performance. 226 00:08:05,350 --> 00:08:07,720 And faces-- one face versus another face, 227 00:08:07,720 --> 00:08:09,530 you're actually quite poor at that. 228 00:08:09,530 --> 00:08:11,369 You're a little bit better than chance. 229 00:08:11,369 --> 00:08:13,660 But it's actually quite challenging in 100 milliseconds 230 00:08:13,660 --> 00:08:15,730 without hair and glasses to discriminate 231 00:08:15,730 --> 00:08:17,680 those 3-D kind of face models. 232 00:08:17,680 --> 00:08:20,260 I showed you Sam and Joe earlier as examples. 233 00:08:20,260 --> 00:08:22,480 You're actually quite challenging to do 234 00:08:22,480 --> 00:08:25,040 that for humans in that domain of faces. 235 00:08:25,040 --> 00:08:26,631 So, what I want to show you here is 236 00:08:26,631 --> 00:08:28,630 you have this pattern of behavioral performance. 237 00:08:28,630 --> 00:08:29,980 You have all this IT activity. 238 00:08:29,980 --> 00:08:30,680 This is humans. 239 00:08:30,680 --> 00:08:31,767 This is monkeys. 240 00:08:31,767 --> 00:08:33,350 And what we wanted to do is say, look. 241 00:08:33,350 --> 00:08:34,370 We can use this pattern. 242 00:08:34,370 --> 00:08:36,610 This is very repeatable across humans. 243 00:08:36,610 --> 00:08:39,190 Can we use this repeatable behavioral pattern 244 00:08:39,190 --> 00:08:41,559 to understand what aspects of this activity 245 00:08:41,559 --> 00:08:43,210 could map to that? 246 00:08:43,210 --> 00:08:44,800 And again, this pattern is reliable. 247 00:08:44,800 --> 00:08:45,910 I just said that. 248 00:08:45,910 --> 00:08:48,220 And it's not as if you can predict this pattern 249 00:08:48,220 --> 00:08:51,210 by just running classifiers on pixels or V1. 250 00:08:51,210 --> 00:08:53,297 In fact, I'll show you that a minute. 251 00:08:53,297 --> 00:08:55,380 But we thought there's some aspects of IT activity 252 00:08:55,380 --> 00:08:56,390 that would predict this. 253 00:08:56,390 --> 00:08:59,250 And we wanted to try to find those aspects to-- 254 00:08:59,250 --> 00:09:01,920 so, again, this was motivated by that study I showed you 255 00:09:01,920 --> 00:09:02,800 earlier. 256 00:09:02,800 --> 00:09:05,220 So which part of the IT population activity 257 00:09:05,220 --> 00:09:08,190 could predict this behavior over all recognition tasks? 258 00:09:08,190 --> 00:09:11,086 We're seeking a general decoding model that would work. 259 00:09:11,086 --> 00:09:12,210 Here's some specific tasks. 260 00:09:12,210 --> 00:09:13,350 But we'd like it to be-- 261 00:09:13,350 --> 00:09:15,900 work over any task that we could imagine testing humans 262 00:09:15,900 --> 00:09:18,240 within this domain of taking 3D models, 263 00:09:18,240 --> 00:09:19,650 putting them under variation. 264 00:09:19,650 --> 00:09:20,940 Work over that entire domain. 265 00:09:20,940 --> 00:09:22,725 That was what we were hoping to do. 266 00:09:22,725 --> 00:09:24,600 So again, I'll briefly take you through this. 267 00:09:24,600 --> 00:09:26,040 Because I already showed you this earlier. 268 00:09:26,040 --> 00:09:28,415 Again, we've previously shown that you could kind of take 269 00:09:28,415 --> 00:09:30,570 this kind of state space, and say hey, 270 00:09:30,570 --> 00:09:33,027 can you separate images of faces from non-faces, 271 00:09:33,027 --> 00:09:34,860 using these simple linear classifiers, which 272 00:09:34,860 --> 00:09:38,730 are essentially weighted sums on the IT activity? 273 00:09:38,730 --> 00:09:40,620 And now we wanted to ask, could this 274 00:09:40,620 --> 00:09:43,260 predict human behavioral face performance, 275 00:09:43,260 --> 00:09:45,600 and monkey, because again, they're very similar. 276 00:09:45,600 --> 00:09:47,970 And not only would this class of decoding 277 00:09:47,970 --> 00:09:51,420 models that was motivated by the earlier work predict this task, 278 00:09:51,420 --> 00:09:53,700 but would predict car detection? 279 00:09:53,700 --> 00:09:56,580 Would the same model predict car one versus car two? 280 00:09:56,580 --> 00:09:58,050 That's a subordinate task. 281 00:09:58,050 --> 00:09:59,120 And all such tasks. 282 00:09:59,120 --> 00:10:01,740 Again, over the whole domain, can you take a same decoding 283 00:10:01,740 --> 00:10:03,549 strategy and take the data and say, 284 00:10:03,549 --> 00:10:05,840 I'm going to just learn on a certain number of training 285 00:10:05,840 --> 00:10:07,830 examples, build a classifier, and then I'll 286 00:10:07,830 --> 00:10:10,320 say that's my model of how the human does 287 00:10:10,320 --> 00:10:11,560 every one of these tasks. 288 00:10:11,560 --> 00:10:13,542 And if that's true, then it should perfectly 289 00:10:13,542 --> 00:10:15,000 predict that pattern or performance 290 00:10:15,000 --> 00:10:16,900 that I just showed you earlier. 291 00:10:16,900 --> 00:10:20,550 And so here was again, this was the working hypothesis. 292 00:10:20,550 --> 00:10:22,842 Passively evoked spike rates using single fixed time 293 00:10:22,842 --> 00:10:25,050 scale that are spatially distributed, because they're 294 00:10:25,050 --> 00:10:29,400 sampled over IT, over a single fixed number of non-human-- 295 00:10:29,400 --> 00:10:31,260 of non-human primate cortex. 296 00:10:31,260 --> 00:10:33,027 So a single number of neurons. 297 00:10:33,027 --> 00:10:35,360 And learn from a reasonable number of training examples. 298 00:10:35,360 --> 00:10:39,390 So all of that is a decoding class of models that we thought 299 00:10:39,390 --> 00:10:40,617 might work. 300 00:10:40,617 --> 00:10:42,450 And if this is correct-- this is what I just 301 00:10:42,450 --> 00:10:44,490 said-- it should predict the behavioral data that we 302 00:10:44,490 --> 00:10:44,989 collect. 303 00:10:44,989 --> 00:10:47,350 For example, the D prime data I just showed you. 304 00:10:47,350 --> 00:10:51,000 But also more fine grained behavioral data in principle. 305 00:10:51,000 --> 00:10:52,530 So I want to just step back to make 306 00:10:52,530 --> 00:10:55,880 it clear that it's not obvious that this should work, right? 307 00:10:55,880 --> 00:10:56,960 I mean, it depends-- 308 00:10:56,960 --> 00:10:59,880 in the audience, I get people on completely different sides 309 00:10:59,880 --> 00:11:01,990 of this, whether this should work or not. 310 00:11:01,990 --> 00:11:03,990 So, you know, one thing is, like, well look, 311 00:11:03,990 --> 00:11:04,920 it's passively evoked. 312 00:11:04,920 --> 00:11:07,490 You heard Gabriel say, well, you didn't like passive tasks. 313 00:11:07,490 --> 00:11:08,220 And I agree with that. 314 00:11:08,220 --> 00:11:09,510 In the ideal world, the animal will 315 00:11:09,510 --> 00:11:10,710 be actively doing the task. 316 00:11:10,710 --> 00:11:12,390 And then you'd say, well I'll measure while the animal's 317 00:11:12,390 --> 00:11:12,870 doing the task. 318 00:11:12,870 --> 00:11:14,953 That's going to be your best chance of prediction. 319 00:11:14,953 --> 00:11:16,800 But we also saw earlier that that passively 320 00:11:16,800 --> 00:11:17,904 evoked monkey still-- 321 00:11:17,904 --> 00:11:20,070 you know, nobody would argue that a passively evoked 322 00:11:20,070 --> 00:11:24,330 retinal data is not going to be somewhat applicable to vision. 323 00:11:24,330 --> 00:11:26,100 And you know, the question is, how much 324 00:11:26,100 --> 00:11:28,470 of those arousal effects show up in a place 325 00:11:28,470 --> 00:11:31,110 like IT cortex, which is typically high? 326 00:11:31,110 --> 00:11:32,850 Which is high up in the ventral stream. 327 00:11:32,850 --> 00:11:34,722 So you could argue both sides of this. 328 00:11:34,722 --> 00:11:36,930 But it's possible that attentional arousal mechanisms 329 00:11:36,930 --> 00:11:40,120 are needed to make this a good predictive linkage between that 330 00:11:40,120 --> 00:11:43,554 to sort of activate IT in this sort of crude way, if you like. 331 00:11:43,554 --> 00:11:45,720 Some people have pointed out that you need the trial 332 00:11:45,720 --> 00:11:47,550 by trial coordinated spike timing 333 00:11:47,550 --> 00:11:49,650 structure to actually make good predictions, 334 00:11:49,650 --> 00:11:51,160 that those are critical. 335 00:11:51,160 --> 00:11:54,570 Some people have pointed out that you have to kind of assign 336 00:11:54,570 --> 00:11:57,030 different parts of IT to particular roles, which is 337 00:11:57,030 --> 00:11:58,840 a prior on the decoding space. 338 00:11:58,840 --> 00:12:01,680 For instance, that you could believe that biologically, 339 00:12:01,680 --> 00:12:02,460 an animal's born. 340 00:12:02,460 --> 00:12:04,876 There's some tissue that's going to be dedicated to faces. 341 00:12:04,876 --> 00:12:07,502 You have to wire those neurons downstream to that tissue. 342 00:12:07,502 --> 00:12:09,960 And that means you're going to restrict the decoding space, 343 00:12:09,960 --> 00:12:12,555 rather than just letting them learn from the space of IT 344 00:12:12,555 --> 00:12:15,600 as if they collected samples off of all of IT. 345 00:12:15,600 --> 00:12:16,980 So I think some people implicitly 346 00:12:16,980 --> 00:12:20,180 believe that even if it's not stated quite that way. 347 00:12:20,180 --> 00:12:22,110 IT does not directly underlie recognition. 348 00:12:22,110 --> 00:12:23,460 You could imagine that. 349 00:12:23,460 --> 00:12:25,090 I mean, it's not for sure known. 350 00:12:25,090 --> 00:12:27,690 And some lesions of IT don't produce deficits 351 00:12:27,690 --> 00:12:28,320 in recognition. 352 00:12:28,320 --> 00:12:29,340 That's a possibility. 353 00:12:29,340 --> 00:12:31,090 Maybe you need too many training examples. 354 00:12:31,090 --> 00:12:33,992 Monkey neural codes cannot explain human behavior. 355 00:12:33,992 --> 00:12:35,700 You know, again, but I already showed you 356 00:12:35,700 --> 00:12:36,970 monkeys and humans are very similar. 357 00:12:36,970 --> 00:12:38,345 So these are the reasons that you 358 00:12:38,345 --> 00:12:40,350 might say this is negative, and might not work. 359 00:12:40,350 --> 00:12:41,560 And probably already have guessed 360 00:12:41,560 --> 00:12:43,935 that I'm telling all these negatives because it turns out 361 00:12:43,935 --> 00:12:46,800 this simple thing works quite well for the grain of behavior 362 00:12:46,800 --> 00:12:48,359 that I've shown you so far. 363 00:12:48,359 --> 00:12:49,650 And here's my evidence of that. 364 00:12:49,650 --> 00:12:52,117 So this is actual behavioral performance out of humans 365 00:12:52,117 --> 00:12:53,200 that I showed you earlier. 366 00:12:53,200 --> 00:12:54,074 This is mean D prime. 367 00:12:54,074 --> 00:12:56,010 This is the predicted behavior or performance 368 00:12:56,010 --> 00:12:59,520 of taking a classifier, reading from that IT population data 369 00:12:59,520 --> 00:13:02,520 that I've shown you, which gives a predicted D prime. 370 00:13:02,520 --> 00:13:04,080 Here is-- we first chose a decoder. 371 00:13:04,080 --> 00:13:06,180 We had to match things like the number of neurons. 372 00:13:06,180 --> 00:13:07,800 We had to get it in the ballpark, so-- 373 00:13:07,800 --> 00:13:08,895 because again, there's a free variable, 374 00:13:08,895 --> 00:13:09,895 as I showed you earlier. 375 00:13:09,895 --> 00:13:11,080 There's at least one. 376 00:13:11,080 --> 00:13:14,130 But for now, let's think of matching the number of neurons 377 00:13:14,130 --> 00:13:16,299 to get you near the diagonal, so that you 378 00:13:16,299 --> 00:13:18,090 have sufficient number of neural recordings 379 00:13:18,090 --> 00:13:20,700 to say, how well do you do on a face detection task? 380 00:13:20,700 --> 00:13:22,267 And then, here's all the other tasks. 381 00:13:22,267 --> 00:13:24,350 This is those 64 points that I showed you earlier. 382 00:13:24,350 --> 00:13:26,910 Here's some examples like fruit versus other things, car 383 00:13:26,910 --> 00:13:28,597 versus other things. 384 00:13:28,597 --> 00:13:30,930 And you should see that all these points kind of line up 385 00:13:30,930 --> 00:13:33,546 along this diagonal, which says, wow, this is actually 386 00:13:33,546 --> 00:13:35,670 quite predictive, that I can take this simple thing 387 00:13:35,670 --> 00:13:38,980 and predict all the stuff that we've collected so far. 388 00:13:38,980 --> 00:13:40,980 And so let me now kind of be more concrete 389 00:13:40,980 --> 00:13:43,230 about what is the inferred neural mechanism 390 00:13:43,230 --> 00:13:44,820 that we're testing here? 391 00:13:44,820 --> 00:13:47,050 Well, I'll show you in a minute. 392 00:13:47,050 --> 00:13:49,560 This is, for each new object, we think 393 00:13:49,560 --> 00:13:51,570 what happens is some downstream observer, 394 00:13:51,570 --> 00:13:54,550 a downstream neuron, randomly samples roughly 50,000 395 00:13:54,550 --> 00:13:57,430 single neurons, spatially distributed over all of IT, 396 00:13:57,430 --> 00:13:59,530 not biased to any compartments. 397 00:13:59,530 --> 00:14:02,074 Listens to each IT sites. 398 00:14:02,074 --> 00:14:03,490 When I say listen in this case, we 399 00:14:03,490 --> 00:14:05,477 think could average over 100 milliseconds. 400 00:14:05,477 --> 00:14:06,560 We're not sure about this. 401 00:14:06,560 --> 00:14:08,860 This is just the version that's shown here. 402 00:14:08,860 --> 00:14:11,680 Learn an appropriate weighted sum of those IT spiking. 403 00:14:11,680 --> 00:14:12,970 And then listen at 10%. 404 00:14:12,970 --> 00:14:14,650 That's basically, once you learn, 405 00:14:14,650 --> 00:14:18,169 there's a heavily weighted about 10% of the IT neurons 406 00:14:18,169 --> 00:14:19,960 are heavily weighted for each of the tasks. 407 00:14:19,960 --> 00:14:22,690 That's just an observation that we have in our data. 408 00:14:22,690 --> 00:14:25,300 But this is trying to map it to neuroscientist language 409 00:14:25,300 --> 00:14:28,030 from these decoder versions out of IT. 410 00:14:28,030 --> 00:14:29,650 So what that is a model that says, 411 00:14:29,650 --> 00:14:32,384 learn weighted sums of 50,000 random average 100 milliseconds 412 00:14:32,384 --> 00:14:34,300 single unit responses distributed over all IT. 413 00:14:34,300 --> 00:14:37,760 So a bunch of stuff in here is what your model 414 00:14:37,760 --> 00:14:39,190 is sort of encapsulating. 415 00:14:39,190 --> 00:14:40,416 That's still too long. 416 00:14:40,416 --> 00:14:42,040 So I made a little acronym out of that. 417 00:14:42,040 --> 00:14:45,010 And that caught Laws of RAD IT decoding mechanism. 418 00:14:45,010 --> 00:14:49,000 So this is just to say there's a hypothesis of how everything 419 00:14:49,000 --> 00:14:52,330 might work, but now can be make predictions for other objects 420 00:14:52,330 --> 00:14:54,520 and could potentially be falsified. 421 00:14:54,520 --> 00:14:59,590 So, so far, this model works quite well over these tasks. 422 00:14:59,590 --> 00:15:01,240 And in fact, the correlation is 0.92. 423 00:15:01,240 --> 00:15:03,637 You might look at this and say, oh, it's not perfect. 424 00:15:03,637 --> 00:15:05,470 But it turns out that that's about the level 425 00:15:05,470 --> 00:15:07,220 that which humans differ from each other. 426 00:15:07,220 --> 00:15:10,870 So it's passing a Turing test, that this mechanism read off 427 00:15:10,870 --> 00:15:14,020 of the monkey IT hides in the distribution 428 00:15:14,020 --> 00:15:16,870 of the human population that we're asking to also perform 429 00:15:16,870 --> 00:15:17,680 these same tasks. 430 00:15:17,680 --> 00:15:19,305 So it can't be distinguished from being 431 00:15:19,305 --> 00:15:22,750 a human in these tasks. 432 00:15:22,750 --> 00:15:24,640 You guys, watch "X Machina?" 433 00:15:24,640 --> 00:15:25,990 Wasn't that a movie I saw? 434 00:15:25,990 --> 00:15:27,190 Doesn't pass that test. 435 00:15:27,190 --> 00:15:29,760 Passes just a simple core recognition test. 436 00:15:29,760 --> 00:15:31,570 But so that was a Turing test of this. 437 00:15:31,570 --> 00:15:34,510 So OK, so, this is here that I quantified. 438 00:15:34,510 --> 00:15:37,000 So this is human to human consistency. 439 00:15:37,000 --> 00:15:38,380 That's the range I just mentioned 440 00:15:38,380 --> 00:15:41,500 that, you've got to get into here to pass our Turing 441 00:15:41,500 --> 00:15:42,387 test on this. 442 00:15:42,387 --> 00:15:44,470 And that's a decoding mechanism I just showed you. 443 00:15:44,470 --> 00:15:47,460 There's other ways of reading out of IT that don't pass. 444 00:15:47,460 --> 00:15:49,960 There's ways of reading out of V4, which you recorded from-- 445 00:15:49,960 --> 00:15:52,790 none of them we've tried are able to get you to this here. 446 00:15:52,790 --> 00:15:54,370 That doesn't mean V4 isn't involved. 447 00:15:54,370 --> 00:15:56,200 V4 is the feeder to IT. 448 00:15:56,200 --> 00:15:59,470 It just means you can't take simple decodes off of V4 449 00:15:59,470 --> 00:16:01,330 and naturally produces this pattern. 450 00:16:01,330 --> 00:16:03,970 And that's similar for like, pixels or V1 representations. 451 00:16:03,970 --> 00:16:06,610 So lower level representations don't naturally 452 00:16:06,610 --> 00:16:08,542 predict this pattern of behavior. 453 00:16:08,542 --> 00:16:10,000 And even some computer vision codes 454 00:16:10,000 --> 00:16:12,760 that we tested at the time, as you can see, if those of you 455 00:16:12,760 --> 00:16:16,750 know these older computer vision models didn't do this. 456 00:16:16,750 --> 00:16:19,870 But more recent computer vision models actually do. 457 00:16:19,870 --> 00:16:22,230 And I'll show you that at the end. 458 00:16:22,230 --> 00:16:22,900 OK. 459 00:16:22,900 --> 00:16:25,420 So, this is a little bit for the aficionados 460 00:16:25,420 --> 00:16:27,220 to tell you how we got there as we 461 00:16:27,220 --> 00:16:31,160 increase the number of units in IT, that drives performance up. 462 00:16:31,160 --> 00:16:33,100 So as you read more and more units out of IT, 463 00:16:33,100 --> 00:16:34,930 you get better and better performance. 464 00:16:34,930 --> 00:16:36,539 That's also true out of V4. 465 00:16:36,539 --> 00:16:38,080 But I'm trying to show you this here, 466 00:16:38,080 --> 00:16:40,750 is it's like, not the absolute performance 467 00:16:40,750 --> 00:16:44,260 that is the good thing to compare a model 468 00:16:44,260 --> 00:16:46,390 with actual behavioral data. 469 00:16:46,390 --> 00:16:48,010 It's the pattern of performance, which 470 00:16:48,010 --> 00:16:49,780 we call the consistency with the humans. 471 00:16:49,780 --> 00:16:51,959 That's that correlation along that diagonal 472 00:16:51,959 --> 00:16:53,500 that I showed you earlier, that tasks 473 00:16:53,500 --> 00:16:56,020 that are hard for the models are also hard for the humans. 474 00:16:56,020 --> 00:16:58,660 Tasks that are easy for humans are also easy for the models. 475 00:16:58,660 --> 00:17:00,035 And you could imagine doing that, 476 00:17:00,035 --> 00:17:03,560 not just at the task level, but at the image level as well. 477 00:17:03,560 --> 00:17:05,319 And anyway, that's what's quantified here. 478 00:17:05,319 --> 00:17:07,990 And you see that when you get up to around you know, about 100-- 479 00:17:07,990 --> 00:17:12,010 I showed you 168 recordings out of IT. 480 00:17:12,010 --> 00:17:14,950 This point right there is about 500 IT features. 481 00:17:14,950 --> 00:17:16,660 And taking you through some things 482 00:17:16,660 --> 00:17:18,130 that maybe I won't have time for, 483 00:17:18,130 --> 00:17:21,160 that's actually how we approximate that 50,000 single 484 00:17:21,160 --> 00:17:21,970 IT neuron number. 485 00:17:21,970 --> 00:17:24,880 That's an inference from our data 486 00:17:24,880 --> 00:17:27,532 based on if we didn't actually record 50,000 single neurons. 487 00:17:27,532 --> 00:17:28,990 But from these kind of plots, we're 488 00:17:28,990 --> 00:17:33,190 able to make a pretty good guess that this kind of model right 489 00:17:33,190 --> 00:17:34,590 here would produce-- 490 00:17:34,590 --> 00:17:35,940 would land right there. 491 00:17:35,940 --> 00:17:37,810 To be consistent with humans, and would 492 00:17:37,810 --> 00:17:39,760 get the absolute level of performance 493 00:17:39,760 --> 00:17:41,170 which humans matched. 494 00:17:41,170 --> 00:17:43,140 And you know, the models we tried out of V4, 495 00:17:43,140 --> 00:17:44,410 this is one example of them. 496 00:17:44,410 --> 00:17:45,580 They can get performance. 497 00:17:45,580 --> 00:17:47,850 But they can never-- they don't match this pattern 498 00:17:47,850 --> 00:17:48,940 of performance naturally. 499 00:17:48,940 --> 00:17:51,640 They over perform on some tasks, and under-perform on others. 500 00:17:51,640 --> 00:17:53,980 They sort of reveal themselves as not 501 00:17:53,980 --> 00:17:57,640 being human like by being too good at some things, right? 502 00:17:57,640 --> 00:18:00,310 So that's a way to fail the Turing test. 503 00:18:00,310 --> 00:18:01,540 OK. 504 00:18:01,540 --> 00:18:04,010 Maybe I'll skip through this, it's sort of the same thing. 505 00:18:04,010 --> 00:18:05,343 This is about training examples. 506 00:18:05,343 --> 00:18:06,974 If those of you guys care about this, 507 00:18:06,974 --> 00:18:09,390 I could kind of take you through how we-- there's actually 508 00:18:09,390 --> 00:18:11,110 a family of solutions in there. 509 00:18:11,110 --> 00:18:14,200 And I'm just telling you about one of them for simplicity. 510 00:18:14,200 --> 00:18:17,166 So, let me then just take it down to another grain. 511 00:18:17,166 --> 00:18:18,790 So that was the pattern of performance, 512 00:18:18,790 --> 00:18:20,164 it's actually naturally predicted 513 00:18:20,164 --> 00:18:22,990 by this first decoding mechanism that we tried. 514 00:18:22,990 --> 00:18:24,830 But what about the confusion pattern? 515 00:18:24,830 --> 00:18:27,370 So not just the absolute D primes for each of these tasks, 516 00:18:27,370 --> 00:18:30,640 but there's finer grained data, like how often an animal is 517 00:18:30,640 --> 00:18:33,160 confused with a fruit, or an animal's confused with a face. 518 00:18:33,160 --> 00:18:34,960 These are the confusion pattern data here. 519 00:18:34,960 --> 00:18:36,670 I'm sorry I don't have the color bars up. 520 00:18:36,670 --> 00:18:38,753 All I'm going to need you to do is say, well these 521 00:18:38,753 --> 00:18:41,380 are the confusion patterns that we predicted. 522 00:18:41,380 --> 00:18:44,710 And this is what is the predicted confusion pattern, 523 00:18:44,710 --> 00:18:49,630 if I gave the machine, the IT, these ground truth labels. 524 00:18:49,630 --> 00:18:50,810 And it predicts this. 525 00:18:50,810 --> 00:18:52,370 This is what actually happened in human data. 526 00:18:52,370 --> 00:18:53,920 And what I want to sort of look at this and this, and say, 527 00:18:53,920 --> 00:18:55,900 there actually look quite similar. 528 00:18:55,900 --> 00:18:58,270 Their noise corrected correlation is 0.91. 529 00:18:58,270 --> 00:19:00,940 So they were still quite good at predicting confusion patterns. 530 00:19:00,940 --> 00:19:03,420 Although this did not hold up fully. 531 00:19:03,420 --> 00:19:04,570 We're only at 0.68. 532 00:19:04,570 --> 00:19:05,410 I say only. 533 00:19:05,410 --> 00:19:07,300 Some people would say this is success. 534 00:19:07,300 --> 00:19:09,320 We're only at 0.68 on high variation. 535 00:19:09,320 --> 00:19:11,020 So there's a failure here of the model. 536 00:19:11,020 --> 00:19:13,360 That should be at 1, because it's noise corrected. 537 00:19:13,360 --> 00:19:14,980 So there's something about this that's 538 00:19:14,980 --> 00:19:17,410 not quite right at predicting the confusion 539 00:19:17,410 --> 00:19:19,600 patterns of humans at high variation images. 540 00:19:19,600 --> 00:19:22,960 And that to us, that's an opening to push forward, right? 541 00:19:22,960 --> 00:19:24,730 So this is a strategy going forward 542 00:19:24,730 --> 00:19:28,300 as we have an initial guess of how you read out of IT. 543 00:19:28,300 --> 00:19:30,730 It looks pretty good for first grain test. 544 00:19:30,730 --> 00:19:32,800 But now we can turn the crank harder. 545 00:19:32,800 --> 00:19:33,940 We need more neural data. 546 00:19:33,940 --> 00:19:36,970 We need more psychophysics, finer grained measurements 547 00:19:36,970 --> 00:19:38,650 to sort of distinguish among, not just 548 00:19:38,650 --> 00:19:41,560 say IT's better than V4 or those other representations. 549 00:19:41,560 --> 00:19:44,380 But what exactly about the IT representation? 550 00:19:44,380 --> 00:19:45,550 Is it 100 milliseconds? 551 00:19:45,550 --> 00:19:46,764 What time scale? 552 00:19:46,764 --> 00:19:48,430 Maybe those synchronous codes do matter. 553 00:19:48,430 --> 00:19:50,430 Some of those things that I put on there earlier 554 00:19:50,430 --> 00:19:53,170 might start to matter when we push the code-- push 555 00:19:53,170 --> 00:19:54,370 this even further. 556 00:19:54,370 --> 00:19:56,440 So what I take home here is that you 557 00:19:56,440 --> 00:19:59,590 do quite well with this first order rate code 558 00:19:59,590 --> 00:20:01,390 reads out of IT. 559 00:20:01,390 --> 00:20:04,330 But now there's an opportunity to try to dig in and say, 560 00:20:04,330 --> 00:20:06,444 well at what point do they break down? 561 00:20:06,444 --> 00:20:08,110 And what kind of decoding models are you 562 00:20:08,110 --> 00:20:09,234 going to replace them with? 563 00:20:09,234 --> 00:20:11,680 And that's what we're trying to do. 564 00:20:11,680 --> 00:20:13,900 I've told you that IT does good at identity. 565 00:20:13,900 --> 00:20:15,820 But remember I said earlier on, remember 566 00:20:15,820 --> 00:20:16,660 I showed you those manifolds, and said 567 00:20:16,660 --> 00:20:19,420 there's other latent variables like position and scale. 568 00:20:19,420 --> 00:20:21,640 And I said those don't get thrown away. 569 00:20:21,640 --> 00:20:23,170 They just get unwrapped, right? 570 00:20:23,170 --> 00:20:25,377 Remember that manifold picture I showed earlier? 571 00:20:25,377 --> 00:20:27,460 And so one of the things we've been doing recently 572 00:20:27,460 --> 00:20:29,510 is asking, because we built these images, 573 00:20:29,510 --> 00:20:31,120 we know these other latent variables, 574 00:20:31,120 --> 00:20:32,495 like position and pose-- that was 575 00:20:32,495 --> 00:20:34,750 one of the advantages of building the images this way. 576 00:20:34,750 --> 00:20:38,530 And we've been asking how well IT encodes those other latent 577 00:20:38,530 --> 00:20:40,450 variables about the pose of the object, 578 00:20:40,450 --> 00:20:42,040 the position of the object. 579 00:20:42,040 --> 00:20:45,670 And to make-- let me just skip through. 580 00:20:45,670 --> 00:20:48,250 To make a long story short, IT actually encodes-- 581 00:20:48,250 --> 00:20:50,990 not only has information about these kind of variables, 582 00:20:50,990 --> 00:20:53,290 which is really not surprising, because others 583 00:20:53,290 --> 00:20:55,000 have shown that there's information 584 00:20:55,000 --> 00:20:57,040 about those kind of things before. 585 00:20:57,040 --> 00:20:58,914 But that's sort of what's on here. 586 00:20:58,914 --> 00:21:00,580 Everything what I'm showing here, here's 587 00:21:00,580 --> 00:21:03,060 IT V4 simulated V1 in pixels. 588 00:21:03,060 --> 00:21:05,500 And always, everything goes up along the ventral stream 589 00:21:05,500 --> 00:21:07,330 for the other variables, which may be 590 00:21:07,330 --> 00:21:08,950 non-intuitive to some of you. 591 00:21:08,950 --> 00:21:10,989 I mean, because position is supposed to be V1. 592 00:21:10,989 --> 00:21:13,030 But position of an object in a complex background 593 00:21:13,030 --> 00:21:14,560 is better at IT. 594 00:21:14,560 --> 00:21:15,430 That's one example. 595 00:21:15,430 --> 00:21:17,806 But all these latent variables go up 596 00:21:17,806 --> 00:21:19,180 along the ventral stream in terms 597 00:21:19,180 --> 00:21:20,900 of their ease of decoding. 598 00:21:20,900 --> 00:21:22,810 But what I'm most excited about is 599 00:21:22,810 --> 00:21:26,080 that if you do this comparison with humans again, 600 00:21:26,080 --> 00:21:28,300 you actually get this sort of, again, pretty decent, 601 00:21:28,300 --> 00:21:31,877 not quite as tight correlation, between the human-- 602 00:21:31,877 --> 00:21:33,460 actual measured behavioral performance 603 00:21:33,460 --> 00:21:35,890 on making estimates of those other latent variables, 604 00:21:35,890 --> 00:21:38,450 and the predicted behavioral performance out of IT. 605 00:21:38,450 --> 00:21:40,180 And again, much better correlations. 606 00:21:40,180 --> 00:21:41,220 It's not perfect. 607 00:21:41,220 --> 00:21:44,170 So again, there's some gap here, some failure of understanding. 608 00:21:44,170 --> 00:21:47,650 But much better than if you read out of V4, V1 or pixel. 609 00:21:47,650 --> 00:21:50,620 So this says that the representation again isn't just 610 00:21:50,620 --> 00:21:51,817 an identity thing. 611 00:21:51,817 --> 00:21:53,650 It seems like this could be representational 612 00:21:53,650 --> 00:21:56,320 underlie some of these other judgments, at least 613 00:21:56,320 --> 00:21:59,380 at the central 10 degrees for sort of foreground objects 614 00:21:59,380 --> 00:22:00,970 as we've been measuring here. 615 00:22:00,970 --> 00:22:03,428 That's the-- don't worry about the details on here-- that's 616 00:22:03,428 --> 00:22:05,637 the upshot of what I'm trying to say with this slide. 617 00:22:05,637 --> 00:22:07,261 But I just wanted to put that out there 618 00:22:07,261 --> 00:22:09,310 so you didn't forget that you haven't thrown away 619 00:22:09,310 --> 00:22:11,290 all this other interesting stuff about what's 620 00:22:11,290 --> 00:22:13,461 out there in the scene. 621 00:22:13,461 --> 00:22:13,960 OK. 622 00:22:13,960 --> 00:22:14,626 Let me kind of-- 623 00:22:14,626 --> 00:22:16,510 I've sort of alluded to this a bit. 624 00:22:16,510 --> 00:22:18,310 I want to come back to kind of now, 625 00:22:18,310 --> 00:22:20,350 this is like Marr level 3 stuff, right? 626 00:22:20,350 --> 00:22:22,990 So you have this idea of what you're trying to solve. 627 00:22:22,990 --> 00:22:23,800 You have a decode-- 628 00:22:23,800 --> 00:22:26,350 you have an algorithm that's a decoder on a basis, that's 629 00:22:26,350 --> 00:22:28,790 trying-- that looks like it predicts pretty well. 630 00:22:28,790 --> 00:22:29,510 It's not perfect. 631 00:22:29,510 --> 00:22:30,560 There's work to be done there. 632 00:22:30,560 --> 00:22:31,893 But it actually does quite well. 633 00:22:31,893 --> 00:22:34,240 Now what does that mean on the physical hardware level? 634 00:22:34,240 --> 00:22:35,742 So that's Marr level 3. 635 00:22:35,742 --> 00:22:37,450 So you think-- here's how I visualize it. 636 00:22:37,450 --> 00:22:40,960 You have IT cortex, which I mean AIT and CIT. 637 00:22:40,960 --> 00:22:43,534 So it's about 150 square millimeters in a monkey. 638 00:22:43,534 --> 00:22:45,700 And remember I told you there was about 1 millimeter 639 00:22:45,700 --> 00:22:46,900 scale of organization? 640 00:22:46,900 --> 00:22:48,792 I showed you that earlier. 641 00:22:48,792 --> 00:22:49,750 And others have shown-- 642 00:22:49,750 --> 00:22:51,458 I showed this earlier, too-- that there's 643 00:22:51,458 --> 00:22:52,390 sort of face regions. 644 00:22:52,390 --> 00:22:54,514 So I've drawn them just for sort of for scale here, 645 00:22:54,514 --> 00:22:55,690 just a schematic. 646 00:22:55,690 --> 00:22:58,000 That they're slightly bigger organizations, 647 00:22:58,000 --> 00:22:59,620 they're 2 to 5 millimeter. 648 00:22:59,620 --> 00:23:04,030 So I think of IT as being this sort of like 100 to 200 649 00:23:04,030 --> 00:23:04,930 little-- 650 00:23:04,930 --> 00:23:06,100 similar to Tanaka. 651 00:23:06,100 --> 00:23:07,600 This is not a new conceptual idea. 652 00:23:07,600 --> 00:23:09,391 But there's sort of just the simple version 653 00:23:09,391 --> 00:23:12,070 would be each millimeter does exactly the same thing, is 654 00:23:12,070 --> 00:23:13,030 a feature. 655 00:23:13,030 --> 00:23:15,910 And if you sample off of that, you take 5,000 neurons, 656 00:23:15,910 --> 00:23:19,320 but they're really sampling from only about 150 IT 657 00:23:19,320 --> 00:23:21,871 features at 1 millimeter scale. 658 00:23:21,871 --> 00:23:23,620 Remember, I don't know if you caught that. 659 00:23:23,620 --> 00:23:24,820 But I showed 150-- 660 00:23:24,820 --> 00:23:26,950 101-- 150. 661 00:23:26,950 --> 00:23:30,425 I showed you 168 IT neurons predicted the pattern 662 00:23:30,425 --> 00:23:31,300 of human performance. 663 00:23:31,300 --> 00:23:32,770 I showed that a few slides ago. 664 00:23:32,770 --> 00:23:35,440 But I told you the real number of neurons is probably 50,000. 665 00:23:35,440 --> 00:23:37,360 Most of those are redundant copies 666 00:23:37,360 --> 00:23:40,142 of that 168 dimensional feature set. 667 00:23:40,142 --> 00:23:41,350 That's how we think about it. 668 00:23:41,350 --> 00:23:44,350 So you could imagine, it's just a redundant set of about-- 669 00:23:44,350 --> 00:23:47,410 I like to think of about 100 features in IT which 670 00:23:47,410 --> 00:23:51,370 are sampled maybe randomly downstream neurons that are 671 00:23:51,370 --> 00:23:52,270 then learned. 672 00:23:52,270 --> 00:23:54,390 So when you learn faces versus other things, 673 00:23:54,390 --> 00:23:56,550 hey, there's lots of good information about faces 674 00:23:56,550 --> 00:23:57,383 versus other things. 675 00:23:57,383 --> 00:23:59,610 And these face patches, that's how they're defined. 676 00:23:59,610 --> 00:24:01,822 But those neurons are going to lean heavily-- 677 00:24:01,822 --> 00:24:03,780 this downstream neuron is going to lean heavily 678 00:24:03,780 --> 00:24:04,500 on those neurons. 679 00:24:04,500 --> 00:24:07,800 And then these-- so that would make these regions causally 680 00:24:07,800 --> 00:24:08,320 involved. 681 00:24:08,320 --> 00:24:11,067 So that doesn't mean you had to pre-build in anything here. 682 00:24:11,067 --> 00:24:12,900 You just learn this at a downstream version. 683 00:24:12,900 --> 00:24:14,190 And you would get something that looks 684 00:24:14,190 --> 00:24:15,481 like it would explain our data. 685 00:24:15,481 --> 00:24:17,919 So we like that, because it captures that case. 686 00:24:17,919 --> 00:24:19,710 But it also captures the more general case. 687 00:24:19,710 --> 00:24:21,418 If you learn cars, you're going to sample 688 00:24:21,418 --> 00:24:22,940 from a different subset of neurons. 689 00:24:22,940 --> 00:24:25,470 But you're following the same learning rule. 690 00:24:25,470 --> 00:24:26,920 That's what I said earlier on. 691 00:24:26,920 --> 00:24:27,900 So you end up-- 692 00:24:27,900 --> 00:24:29,760 we think this is the initial state. 693 00:24:29,760 --> 00:24:31,411 This is when you learn objects. 694 00:24:31,411 --> 00:24:33,660 And so what we think is a post learning, what you have 695 00:24:33,660 --> 00:24:36,690 is again, about 100 to 150 IT sub regions, 696 00:24:36,690 --> 00:24:38,520 each at 1 millimeter scale, that are 697 00:24:38,520 --> 00:24:40,500 supporting a number of noun tasks 698 00:24:40,500 --> 00:24:42,720 read off this common basis here. 699 00:24:42,720 --> 00:24:44,580 That's the model that we like, given 700 00:24:44,580 --> 00:24:46,440 the kind of data that I've been showing you. 701 00:24:46,440 --> 00:24:48,384 The post learning model, as we call it. 702 00:24:48,384 --> 00:24:49,800 So the reason I'm bringing this up 703 00:24:49,800 --> 00:24:51,258 is probably for the neuroscientists 704 00:24:51,258 --> 00:24:54,900 to fix ideas about how we think about IT as a basis set. 705 00:24:54,900 --> 00:24:55,530 And this is-- 706 00:24:55,530 --> 00:24:57,030 I think Haim sort of set this up nicely, 707 00:24:57,030 --> 00:24:58,446 he sort of implied similar things. 708 00:24:58,446 --> 00:25:01,020 That somebody downstream reads from it. 709 00:25:01,020 --> 00:25:01,650 OK. 710 00:25:01,650 --> 00:25:02,970 But now, we have a more-- you know, 711 00:25:02,970 --> 00:25:04,950 we're starting to have a more concrete model, that we now, 712 00:25:04,950 --> 00:25:06,330 I'm trying to start to be physical 713 00:25:06,330 --> 00:25:08,496 about it, about the size of these regions connecting 714 00:25:08,496 --> 00:25:10,194 to earlier data, how many there are. 715 00:25:10,194 --> 00:25:11,610 So we're gaining inference on that 716 00:25:11,610 --> 00:25:13,190 from these different experiments. 717 00:25:13,190 --> 00:25:15,690 And now, if you believe this, it starts to make a prediction 718 00:25:15,690 --> 00:25:17,640 of what's-- now we can do causality, right? 719 00:25:17,640 --> 00:25:19,300 Somebody mentioned that earlier. 720 00:25:19,300 --> 00:25:20,580 And so, one of the things we've been doing recently 721 00:25:20,580 --> 00:25:21,989 is if we can start to silence-- 722 00:25:21,989 --> 00:25:24,030 look, the way I've drawn this, this bit of tissue 723 00:25:24,030 --> 00:25:25,950 for-- this is just schematic-- is somehow 724 00:25:25,950 --> 00:25:27,840 involved in this task and that task. 725 00:25:27,840 --> 00:25:29,700 Face task and car task. 726 00:25:29,700 --> 00:25:31,980 But this bit of tissue, only face task. 727 00:25:31,980 --> 00:25:34,080 And that bit of tissue, only car task. 728 00:25:34,080 --> 00:25:35,780 And this bit of tissue, neither. 729 00:25:35,780 --> 00:25:37,770 So if you believe that, you had the tools, 730 00:25:37,770 --> 00:25:39,311 you should be able to go in and start 731 00:25:39,311 --> 00:25:40,830 to silence little bits of IT. 732 00:25:40,830 --> 00:25:42,944 And you should get predictable patterns out 733 00:25:42,944 --> 00:25:44,610 of the behavioral deficits of the animal 734 00:25:44,610 --> 00:25:46,460 when you make those manipulations, right? 735 00:25:46,460 --> 00:25:48,130 Everybody follow that? 736 00:25:48,130 --> 00:25:48,630 Right? 737 00:25:48,630 --> 00:25:49,130 OK. 738 00:25:49,130 --> 00:25:51,030 And now the models give you a framework 739 00:25:51,030 --> 00:25:52,800 to build those predictions and to also 740 00:25:52,800 --> 00:25:55,710 estimate the magnitude of those effects that you should see. 741 00:25:55,710 --> 00:25:57,850 And so that's what we've been doing more recently. 742 00:25:57,850 --> 00:25:59,040 And I'll just give you a taste of this, 743 00:25:59,040 --> 00:26:00,330 because this is really ongoing. 744 00:26:00,330 --> 00:26:01,890 But I think it connects to what Gabriel 745 00:26:01,890 --> 00:26:04,389 said earlier about now there are these tools available to do 746 00:26:04,389 --> 00:26:04,894 that. 747 00:26:04,894 --> 00:26:06,810 Oh, I put that in from an earlier talk where-- 748 00:26:06,810 --> 00:26:08,750 I think Google has a thing called Inception. 749 00:26:08,750 --> 00:26:09,570 And I don't know-- 750 00:26:09,570 --> 00:26:10,200 was it Google? 751 00:26:10,200 --> 00:26:12,090 Or somebody has it-- you can't do Inception unless you're 752 00:26:12,090 --> 00:26:13,050 actually in a brain. 753 00:26:13,050 --> 00:26:15,120 So are you going to try to insert-- 754 00:26:15,120 --> 00:26:17,495 the reason we do this is my student that is working on it 755 00:26:17,495 --> 00:26:20,010 really wants to inject signals in the brain. 756 00:26:20,010 --> 00:26:21,450 There's a dream about VMI, right? 757 00:26:21,450 --> 00:26:23,760 Could you kind of inject a percept? 758 00:26:23,760 --> 00:26:25,260 And to do that, you're going to need 759 00:26:25,260 --> 00:26:26,427 to do experiments like this. 760 00:26:26,427 --> 00:26:28,635 And you understand this hardware to interact with it. 761 00:26:28,635 --> 00:26:30,430 It's something we talked about earlier. 762 00:26:30,430 --> 00:26:34,181 So actually-- and Tonegawa's lab has some cool Inception stuff 763 00:26:34,181 --> 00:26:34,680 on memory. 764 00:26:34,680 --> 00:26:37,140 But this is like inserting an object/person. 765 00:26:37,140 --> 00:26:39,510 So to do that, this has been a dream for many of us 766 00:26:39,510 --> 00:26:40,230 for a long time. 767 00:26:40,230 --> 00:26:42,600 Can we reliably disrupt performance 768 00:26:42,600 --> 00:26:45,600 by suppressing 1 millimeter bits of IT? 769 00:26:45,600 --> 00:26:48,390 So to do that, what we're doing is 770 00:26:48,390 --> 00:26:50,630 testing a large battery of tasks and a battery 771 00:26:50,630 --> 00:26:51,630 of suppression patterns. 772 00:26:51,630 --> 00:26:53,310 So not just sort of saying, can we 773 00:26:53,310 --> 00:26:55,140 affect face tasks or one task? 774 00:26:55,140 --> 00:26:57,546 But let's imagine we test a battery of tasks. 775 00:26:57,546 --> 00:26:58,920 And then, we-- and the idea where 776 00:26:58,920 --> 00:27:01,652 we'd have a whole bunch of tasks and we'd do every bit of IT one 777 00:27:01,652 --> 00:27:03,360 by one, and then in combination, and we'd 778 00:27:03,360 --> 00:27:04,650 sort of get all that data and figure out what's 779 00:27:04,650 --> 00:27:05,220 going on, right? 780 00:27:05,220 --> 00:27:05,970 That's sort of the dream, right? 781 00:27:05,970 --> 00:27:07,990 So we're trying to build towards that dream. 782 00:27:07,990 --> 00:27:09,100 Do you guys get it? 783 00:27:09,100 --> 00:27:09,600 Right. 784 00:27:09,600 --> 00:27:10,620 I mean, I don't know. 785 00:27:10,620 --> 00:27:12,990 And then we're motivated by this kind of idea here. 786 00:27:12,990 --> 00:27:14,865 So to build-- so we started-- 787 00:27:14,865 --> 00:27:16,740 I'm just going to give you a quick tour of we 788 00:27:16,740 --> 00:27:19,250 have tools to start to do this. 789 00:27:19,250 --> 00:27:20,952 You know, this is our recording, we 790 00:27:20,952 --> 00:27:23,160 can localize what we're recording two very fine grain 791 00:27:23,160 --> 00:27:24,649 using x-rays. 792 00:27:24,649 --> 00:27:26,940 So we know exactly where we're recording the IT to like 793 00:27:26,940 --> 00:27:29,214 about 300 micron resolution. 794 00:27:29,214 --> 00:27:30,880 So that's why I'm putting this slide up. 795 00:27:30,880 --> 00:27:32,463 And what we're interested in is going, 796 00:27:32,463 --> 00:27:35,955 if I silence this bit of IT, or that bit of IT, 797 00:27:35,955 --> 00:27:38,790 or that bit of IT, so actually do this experiment, what 798 00:27:38,790 --> 00:27:40,230 happens behaviorally? 799 00:27:40,230 --> 00:27:43,140 And Arash Afraz is a post-doc in the lab, started 800 00:27:43,140 --> 00:27:44,610 these actual experiments. 801 00:27:44,610 --> 00:27:47,130 And one of the things Arash did was to first say, 802 00:27:47,130 --> 00:27:51,330 let's see if we can get this silencing of optogenetics tool 803 00:27:51,330 --> 00:27:52,290 to work in our hands. 804 00:27:52,290 --> 00:27:54,123 And the reason we were so excited about that 805 00:27:54,123 --> 00:27:55,890 is because we think lesions, if we 806 00:27:55,890 --> 00:27:58,410 can make temporary brief silencing, 807 00:27:58,410 --> 00:28:03,020 that that will give it much more reliable disruption of behavior 808 00:28:03,020 --> 00:28:06,300 that then, if we started to try to inject signals, 809 00:28:06,300 --> 00:28:09,330 which would be our dream, but that seems too risky to us. 810 00:28:09,330 --> 00:28:11,580 We just want to say, what is a temporary lesion 811 00:28:11,580 --> 00:28:13,082 of each bit of IT do? 812 00:28:13,082 --> 00:28:14,790 And optogenetics is cool, because there's 813 00:28:14,790 --> 00:28:17,610 no other technique that can briefly silence-- 814 00:28:17,610 --> 00:28:19,782 temporarily silence activity. 815 00:28:19,782 --> 00:28:21,490 You can do pharmacological manipulations, 816 00:28:21,490 --> 00:28:23,010 but those last for hours. 817 00:28:23,010 --> 00:28:25,544 So this could briefly silence bits of IT. 818 00:28:25,544 --> 00:28:27,210 And that's why we were excited about it. 819 00:28:27,210 --> 00:28:30,120 We also did pharmacological manipulation as a reference 820 00:28:30,120 --> 00:28:30,780 to get started. 821 00:28:30,780 --> 00:28:33,600 But what we're doing is trying to silence 1 millimeter regions 822 00:28:33,600 --> 00:28:36,490 of IT using light delivered through optical fibers 823 00:28:36,490 --> 00:28:38,220 as the recording electrode. 824 00:28:38,220 --> 00:28:41,490 And to silence bits of neurons here. 825 00:28:41,490 --> 00:28:43,757 And so what Arash did was first show 826 00:28:43,757 --> 00:28:45,840 that you can actually silence neurons in this way. 827 00:28:45,840 --> 00:28:48,390 So if you guys haven't seen optogenetics plots, 828 00:28:48,390 --> 00:28:49,560 this is data from our lab. 829 00:28:49,560 --> 00:28:51,060 What's quite cool about this, again, 830 00:28:51,060 --> 00:28:53,290 is you have the same images are being presented. 831 00:28:53,290 --> 00:28:55,110 So this green line should be up here. 832 00:28:55,110 --> 00:28:57,900 But Arash turns a laser on right here, shines light on there. 833 00:28:57,900 --> 00:28:59,970 And there's some opsins expressed in the neurons 834 00:28:59,970 --> 00:29:00,861 in that local area. 835 00:29:00,861 --> 00:29:03,360 And you can see it just sort of shuts the thing down, and it 836 00:29:03,360 --> 00:29:04,830 sort of deletes or blocks this. 837 00:29:04,830 --> 00:29:06,390 You have the same input coming in. 838 00:29:06,390 --> 00:29:08,300 But you can sort of delete it here. 839 00:29:08,300 --> 00:29:09,670 And this is another example. 840 00:29:09,670 --> 00:29:11,253 These are some pretty strong examples. 841 00:29:11,253 --> 00:29:13,750 It's not always this strong. 842 00:29:13,750 --> 00:29:16,140 But this is, again, you can see we can return back 843 00:29:16,140 --> 00:29:17,400 to normal right away, right? 844 00:29:17,400 --> 00:29:19,290 So this is a 200 millisecond silencing. 845 00:29:19,290 --> 00:29:21,230 You could go even narrower than that. 846 00:29:21,230 --> 00:29:23,340 But so this is what we had done so far. 847 00:29:23,340 --> 00:29:25,020 And again, what we did was say, look. 848 00:29:25,020 --> 00:29:25,980 This is a risky tool. 849 00:29:25,980 --> 00:29:27,479 This is it not going to work at all. 850 00:29:27,479 --> 00:29:29,250 So Arash just wanted to test something 851 00:29:29,250 --> 00:29:31,850 that was likely to work. 852 00:29:31,850 --> 00:29:34,080 And so we picked a face task because there 853 00:29:34,080 --> 00:29:36,480 was a lot of evidence of spatial clustering of faces 854 00:29:36,480 --> 00:29:39,240 that you'll hear from Winrich and you also 855 00:29:39,240 --> 00:29:40,750 known in the literature. 856 00:29:40,750 --> 00:29:42,770 So what Arash did was to say, we picked 857 00:29:42,770 --> 00:29:44,910 a task of discriminating males from females. 858 00:29:44,910 --> 00:29:46,470 We put in our notion of invariance. 859 00:29:46,470 --> 00:29:48,390 It's not just do this image access. 860 00:29:48,390 --> 00:29:50,970 But you have to do it across a bunch of transformations. 861 00:29:50,970 --> 00:29:53,127 In this case, its identity as a transformation. 862 00:29:53,127 --> 00:29:55,710 So you're saying, all of these are supposed to be called male, 863 00:29:55,710 --> 00:29:57,050 and all these are called female. 864 00:29:57,050 --> 00:29:59,049 And he wanted you to distinguish this from this. 865 00:29:59,049 --> 00:30:00,900 That's what he trained a monkey to do. 866 00:30:00,900 --> 00:30:04,031 And just to give you the upshot, is that, we do all this work, 867 00:30:04,031 --> 00:30:05,280 we silence the bits of cortex. 868 00:30:05,280 --> 00:30:06,870 And here's the big take home. 869 00:30:06,870 --> 00:30:10,140 You get a 2% deficit of single one millimeter 870 00:30:10,140 --> 00:30:12,750 silencing of bits of IT cortex. 871 00:30:12,750 --> 00:30:16,200 Parts of IT cortex, not all of IT cortex, 872 00:30:16,200 --> 00:30:17,670 produce a 2% deficit. 873 00:30:17,670 --> 00:30:19,545 Here's the animal running at 80%, 6% correct. 874 00:30:19,545 --> 00:30:21,086 These are interleaved trials where we 875 00:30:21,086 --> 00:30:22,610 silence some local bit of IT. 876 00:30:22,610 --> 00:30:23,787 You get a 2% deficit. 877 00:30:23,787 --> 00:30:25,620 That's true only in the contralateral field, 878 00:30:25,620 --> 00:30:28,290 not that ipsilateral field, for the aficionados. 879 00:30:28,290 --> 00:30:30,630 You might look at this 2% and go, well, that's tiny. 880 00:30:30,630 --> 00:30:32,190 But we looked at it, this is exactly what's 881 00:30:32,190 --> 00:30:34,315 predicted by the models that we were talking about. 882 00:30:34,315 --> 00:30:37,140 It's right in the range of what should happen. 883 00:30:37,140 --> 00:30:39,170 And so this, to us, is really quite cool. 884 00:30:39,170 --> 00:30:40,445 This is highly significant. 885 00:30:40,445 --> 00:30:42,570 And now we sort of are in position to start to say, 886 00:30:42,570 --> 00:30:43,870 OK these tools work. 887 00:30:43,870 --> 00:30:45,330 They do what they're supposed to. 888 00:30:45,330 --> 00:30:47,670 And now we can start to expand that task space. 889 00:30:47,670 --> 00:30:49,620 So this result has been published recently, 890 00:30:49,620 --> 00:30:51,112 if you're interested in this. 891 00:30:51,112 --> 00:30:53,070 And here is one of the ways we're going forward 892 00:30:53,070 --> 00:30:55,736 is that Rish Rajaingham, the one doing those tasks in the monkey 893 00:30:55,736 --> 00:30:56,670 I showed you earlier. 894 00:30:56,670 --> 00:30:58,390 Silencing different parts of IT. 895 00:30:58,390 --> 00:31:01,320 This is now with muscimol, different bits of IT-- 896 00:31:01,320 --> 00:31:03,570 these are different tasks, lead to different patterns. 897 00:31:03,570 --> 00:31:04,944 That's what these dots are here-- 898 00:31:04,944 --> 00:31:06,300 different patterns of deficits. 899 00:31:06,300 --> 00:31:08,010 And if you go back to the same location, 900 00:31:08,010 --> 00:31:09,640 you get the same pattern of deficits. 901 00:31:09,640 --> 00:31:11,064 So this is only 10 tasks. 902 00:31:11,064 --> 00:31:12,480 But I think it hopefully gives you 903 00:31:12,480 --> 00:31:14,760 the spirit of what we're trying to do. 904 00:31:14,760 --> 00:31:16,650 And again, this is only muscimol, 905 00:31:16,650 --> 00:31:19,600 which doesn't have all the advantages of optogenetics. 906 00:31:19,600 --> 00:31:22,460 But this is what we're were building towards here. 907 00:31:22,460 --> 00:31:26,116 So I'm just giving you the sort of state of the art. 908 00:31:26,116 --> 00:31:27,990 So our aim is to measure the specific pattern 909 00:31:27,990 --> 00:31:30,740 of behavioral change induced by the suppression of each IT sub 910 00:31:30,740 --> 00:31:32,850 region, ideally testing many of them, 911 00:31:32,850 --> 00:31:35,389 and then compare with the model predictions. 912 00:31:35,389 --> 00:31:36,930 I'm saying there's this domain, and I 913 00:31:36,930 --> 00:31:38,220 want to sort of sample the whole domain. 914 00:31:38,220 --> 00:31:41,040 So far, I've given you only just samples of tasks in the domain. 915 00:31:41,040 --> 00:31:42,950 But we're really trying to define the domain. 916 00:31:42,950 --> 00:31:43,770 And I'm just-- 917 00:31:43,770 --> 00:31:46,353 I'm going to skip through this just to give you the punchline, 918 00:31:46,353 --> 00:31:48,780 is that we do a whole bunch of behavioral measurements. 919 00:31:48,780 --> 00:31:50,040 We presented this work before. 920 00:31:50,040 --> 00:31:52,456 It's like, this is now up to three million Mechanical Turk 921 00:31:52,456 --> 00:31:53,130 trials. 922 00:31:53,130 --> 00:31:56,550 It seems to us that we can embed all objects, even 923 00:31:56,550 --> 00:31:58,230 subordinate objects, of the type of task 924 00:31:58,230 --> 00:31:59,854 that I've been telling you, in roughly, 925 00:31:59,854 --> 00:32:01,830 in essentially a 20 dimensional space. 926 00:32:01,830 --> 00:32:02,940 So there's 20 dimensions. 927 00:32:02,940 --> 00:32:05,190 We think we infer that humans are projecting 928 00:32:05,190 --> 00:32:07,730 to about 20 dimensions to do these kind of, the tasks 929 00:32:07,730 --> 00:32:08,860 that we've shown here. 930 00:32:08,860 --> 00:32:11,310 Which is sort of smaller, but eerily 931 00:32:11,310 --> 00:32:13,380 close to that in the order of magnitude 932 00:32:13,380 --> 00:32:15,900 to that 100 or so features that I've been talking about. 933 00:32:15,900 --> 00:32:19,020 So that's where-- regardless of whether-- these 934 00:32:19,020 --> 00:32:21,580 are some of the dimensions and how we're projecting them. 935 00:32:21,580 --> 00:32:22,930 Again, I won't take you through this, 936 00:32:22,930 --> 00:32:24,490 because I think we've already used up enough time 937 00:32:24,490 --> 00:32:25,906 and I want to get on to this part. 938 00:32:25,906 --> 00:32:28,565 But we're trying to define a domain of all tasks 939 00:32:28,565 --> 00:32:29,940 where we can sort of predict what 940 00:32:29,940 --> 00:32:32,010 would happen across anything within that domain. 941 00:32:32,010 --> 00:32:35,175 And that raises questions of the dimensionality of that domain. 942 00:32:35,175 --> 00:32:37,050 And there were behavioral methods to do that. 943 00:32:37,050 --> 00:32:39,330 And we've been doing some work on that. 944 00:32:39,330 --> 00:32:40,664 So I'll just leave it at that. 945 00:32:40,664 --> 00:32:42,080 And if you guys have questions, we 946 00:32:42,080 --> 00:32:43,567 can talk about that some more. 947 00:32:43,567 --> 00:32:45,150 I want to sort of in the time I really 948 00:32:45,150 --> 00:32:47,910 have left is to talk about the encoding side of things, 949 00:32:47,910 --> 00:32:49,410 because I promised you guys I would get to this. 950 00:32:49,410 --> 00:32:50,868 Unless people have any more burning 951 00:32:50,868 --> 00:32:52,202 questions on this decoding side. 952 00:32:52,202 --> 00:32:53,826 So far I've been talking about the link 953 00:32:53,826 --> 00:32:55,020 between IT and perception. 954 00:32:55,020 --> 00:32:57,900 Now I'm going to switch gears and talk about this other side. 955 00:32:57,900 --> 00:32:59,310 Which is, so I talked about this. 956 00:32:59,310 --> 00:33:01,630 And that tells us that the mean rates in IT 957 00:33:01,630 --> 00:33:03,630 are something that seem to be highly predictive. 958 00:33:03,630 --> 00:33:05,130 I showed you at least one model that 959 00:33:05,130 --> 00:33:06,510 has the laws of RAD IT model. 960 00:33:06,510 --> 00:33:09,300 But now, it's like now, we can turn to the encoding side 961 00:33:09,300 --> 00:33:11,677 and say, we need to predict the mean rates of IT. 962 00:33:11,677 --> 00:33:14,010 And that should be our goal if we want to explain images 963 00:33:14,010 --> 00:33:15,570 to IT activity. 964 00:33:15,570 --> 00:33:18,940 So, these would be called predictive encoding mechanisms. 965 00:33:18,940 --> 00:33:21,360 So, now you guys have heard about 966 00:33:21,360 --> 00:33:22,994 deep convolutional networks. 967 00:33:22,994 --> 00:33:24,660 If not, you've heard about them already, 968 00:33:24,660 --> 00:33:26,409 you'll probably hear about them some more. 969 00:33:26,409 --> 00:33:28,767 So we started messing around in 2008. 970 00:33:28,767 --> 00:33:29,850 This is a model inspired-- 971 00:33:29,850 --> 00:33:31,558 I mentioned this family of models before. 972 00:33:31,558 --> 00:33:34,430 Hubel-Wiesel, Fukushima, and there's a whole HMAX family 973 00:33:34,430 --> 00:33:38,030 of models, that really was the inspiration of this larger-- 974 00:33:38,030 --> 00:33:39,930 this large family of models, that 975 00:33:39,930 --> 00:33:43,939 have this repeating structure that are now 976 00:33:43,939 --> 00:33:46,230 really the sort of modern day deep convolution networks 977 00:33:46,230 --> 00:33:48,640 really grew out of all of this earlier work. 978 00:33:48,640 --> 00:33:51,600 And so we started exploring the family in 2008. 979 00:33:51,600 --> 00:33:54,140 And just, this is a slide that you've already sort of seen 980 00:33:54,140 --> 00:33:56,056 a version of this from Gabriel where you know, 981 00:33:56,056 --> 00:33:58,125 for when you take an image, you pass it 982 00:33:58,125 --> 00:33:59,250 through a set of operators. 983 00:33:59,250 --> 00:34:00,180 So you have filters. 984 00:34:00,180 --> 00:34:02,840 So these are dot products over some restricted spatial 985 00:34:02,840 --> 00:34:05,550 restricted region, like receptive fields. 986 00:34:05,550 --> 00:34:08,330 You have a non linear area, like a threshold and a saturation. 987 00:34:08,330 --> 00:34:10,159 You have pooling operation. 988 00:34:10,159 --> 00:34:11,409 Then you have a normalization. 989 00:34:11,409 --> 00:34:13,610 So you have all these operations happen here. 990 00:34:13,610 --> 00:34:14,928 And that produces a stack. 991 00:34:14,928 --> 00:34:16,969 So think of like, if there are four filters here, 992 00:34:16,969 --> 00:34:19,389 like four orientations, you get four images, 993 00:34:19,389 --> 00:34:21,389 you have one image in, you have four images out. 994 00:34:21,389 --> 00:34:23,638 But if you had 10 of these, you'd get 10 of these out. 995 00:34:23,638 --> 00:34:25,125 Then you repeat this here, right? 996 00:34:25,125 --> 00:34:26,750 And so as you keep adding more filters, 997 00:34:26,750 --> 00:34:28,749 this stack just keeps getting bigger and bigger. 998 00:34:28,749 --> 00:34:30,743 And it keeps, because you're spatially pooling, 999 00:34:30,743 --> 00:34:32,659 it keeps getting narrower and narrower, right? 1000 00:34:32,659 --> 00:34:34,544 So you go from this image to this sort 1001 00:34:34,544 --> 00:34:38,060 of deep stack of features that has less retinatopy. 1002 00:34:38,060 --> 00:34:40,130 It still has a little bit of retinotopy. 1003 00:34:40,130 --> 00:34:42,560 And that, you can see, has been exactly a very good model 1004 00:34:42,560 --> 00:34:44,389 why people liked it of how people 1005 00:34:44,389 --> 00:34:46,310 think about the ventral stream. 1006 00:34:46,310 --> 00:34:48,830 So these models typically have thousands 1007 00:34:48,830 --> 00:34:52,010 of feat-- visual neurons or features at the top level. 1008 00:34:52,010 --> 00:34:55,520 Just to give you a sense of scale of how they're run. 1009 00:34:55,520 --> 00:34:57,209 And just to take you through, you 1010 00:34:57,209 --> 00:34:59,000 know, I guess maybe you'll hear about this, 1011 00:34:59,000 --> 00:35:00,110 if you haven't already. 1012 00:35:00,110 --> 00:35:02,900 Each element has like, a filter, has a large fan in. 1013 00:35:02,900 --> 00:35:05,330 Like these are like neuroscience related things. 1014 00:35:05,330 --> 00:35:08,460 They have non-linearities, like thresholds of neurons. 1015 00:35:08,460 --> 00:35:10,250 Each layer is convolutional, which 1016 00:35:10,250 --> 00:35:12,962 means you apply the same filters across visual space. 1017 00:35:12,962 --> 00:35:15,170 Which is like retinotopy, that is a view on cell that 1018 00:35:15,170 --> 00:35:16,010 is oriented here. 1019 00:35:16,010 --> 00:35:17,690 There'll be another view on cell that's 1020 00:35:17,690 --> 00:35:19,670 in another spatial position, same orientation, 1021 00:35:19,670 --> 00:35:21,080 different spatial position. 1022 00:35:21,080 --> 00:35:23,360 That's what the convolutional models are just 1023 00:35:23,360 --> 00:35:26,810 an implementation of that idea of copying the same filter 1024 00:35:26,810 --> 00:35:28,782 type across the retina. 1025 00:35:28,782 --> 00:35:30,240 And there's a deep stack of layers. 1026 00:35:30,240 --> 00:35:31,615 These are all things that I think 1027 00:35:31,615 --> 00:35:34,610 are commensurate with the ventral stream 1028 00:35:34,610 --> 00:35:36,796 anatomy and physiology. 1029 00:35:36,796 --> 00:35:39,230 So, but one of the key things that those 1030 00:35:39,230 --> 00:35:40,790 who work with these models know is 1031 00:35:40,790 --> 00:35:43,250 that, they have lots of unknown parameters 1032 00:35:43,250 --> 00:35:45,320 that are not determined from the neurobiology. 1033 00:35:45,320 --> 00:35:47,750 Even though the family of models is well described, 1034 00:35:47,750 --> 00:35:50,270 what are the exact filter weights? 1035 00:35:50,270 --> 00:35:51,830 What are the threshold parameters? 1036 00:35:51,830 --> 00:35:53,090 How exactly do you pool? 1037 00:35:53,090 --> 00:35:54,050 How do you normalize? 1038 00:35:54,050 --> 00:35:56,480 There's lots of parameters when you build these things, 1039 00:35:56,480 --> 00:35:59,090 essentially thousands of parameters, most of them hidden 1040 00:35:59,090 --> 00:36:00,360 in the weight structure here. 1041 00:36:00,360 --> 00:36:02,360 Which, if you think about, the first layer, that 1042 00:36:02,360 --> 00:36:04,190 would be like, should I choose Gabor filters? 1043 00:36:04,190 --> 00:36:06,110 Or should I do some other-- you know Haim was talking 1044 00:36:06,110 --> 00:36:07,369 about random weights, right? 1045 00:36:07,369 --> 00:36:08,410 So there's choices there. 1046 00:36:08,410 --> 00:36:09,140 There are lots of parameters. 1047 00:36:09,140 --> 00:36:11,410 So the upshot is, there's a big-- that's why 1048 00:36:11,410 --> 00:36:13,240 I call it a family of models. 1049 00:36:13,240 --> 00:36:16,560 And how do you choose which one is the right one, so to speak? 1050 00:36:16,560 --> 00:36:17,900 Or is there a right one? 1051 00:36:17,900 --> 00:36:19,650 Or maybe the whole family is wrong, right? 1052 00:36:19,650 --> 00:36:21,680 These are the interesting discussions. 1053 00:36:21,680 --> 00:36:24,610 So, what I like about it is, at least when you set it, 1054 00:36:24,610 --> 00:36:25,220 it's a model. 1055 00:36:25,220 --> 00:36:26,120 It makes predictions. 1056 00:36:26,120 --> 00:36:27,161 And then you can test it. 1057 00:36:27,161 --> 00:36:28,610 So it's at least a model. 1058 00:36:28,610 --> 00:36:30,800 And it predicts the entire-- you know, 1059 00:36:30,800 --> 00:36:33,560 if you start to map these, you say this is V1, this is V2, 1060 00:36:33,560 --> 00:36:34,220 this is V4. 1061 00:36:34,220 --> 00:36:37,400 It predicts the full neural population response 1062 00:36:37,400 --> 00:36:39,380 to any image across these areas. 1063 00:36:39,380 --> 00:36:43,100 So it's a strongly predictive model once built. 1064 00:36:43,100 --> 00:36:44,176 So that's nice. 1065 00:36:44,176 --> 00:36:46,550 But now you have to determine how am I going to build it? 1066 00:36:46,550 --> 00:36:48,300 How do I set the parameters? 1067 00:36:48,300 --> 00:36:50,129 So how do we do that? 1068 00:36:50,129 --> 00:36:51,920 Well, there's lots of ways you could do it. 1069 00:36:51,920 --> 00:36:53,753 And I'll tell you the way we chose to do it. 1070 00:36:53,753 --> 00:36:56,540 Which was to just not use any neural data. 1071 00:36:56,540 --> 00:36:58,370 It was just to use optimization methods 1072 00:36:58,370 --> 00:37:00,860 to find specific models to set the parameters 1073 00:37:00,860 --> 00:37:02,660 inside this model class. 1074 00:37:02,660 --> 00:37:04,924 And we chose an optimization target. 1075 00:37:04,924 --> 00:37:07,340 This is a little bit, again, inspired from a top down view 1076 00:37:07,340 --> 00:37:09,146 of what the system's doing. 1077 00:37:09,146 --> 00:37:10,520 What are the visual tasks that we 1078 00:37:10,520 --> 00:37:13,010 suppose the ventral stream was supposed to solve? 1079 00:37:13,010 --> 00:37:15,350 Which I already told you, we think it's invariant object 1080 00:37:15,350 --> 00:37:15,950 recognition. 1081 00:37:15,950 --> 00:37:17,600 That's what makes the problem hard. 1082 00:37:17,600 --> 00:37:19,476 So we tried to optimize models to solve that. 1083 00:37:19,476 --> 00:37:21,058 And essentially when we're doing that, 1084 00:37:21,058 --> 00:37:23,540 we're kind of doing the same thing that computer vision is 1085 00:37:23,540 --> 00:37:26,164 trying to do, except we're doing it in our own domain of images 1086 00:37:26,164 --> 00:37:27,329 and tasks that we set up. 1087 00:37:27,329 --> 00:37:29,870 But we essentially, there's a meeting between computer vision 1088 00:37:29,870 --> 00:37:32,240 and what we were trying to do here. 1089 00:37:32,240 --> 00:37:34,450 And when I say we, this is work by Dan Yamins, 1090 00:37:34,450 --> 00:37:37,520 a post-doc in the lab, and Ha Hong, a graduate student. 1091 00:37:37,520 --> 00:37:40,712 And what we did was to just try to simulate again, 1092 00:37:40,712 --> 00:37:41,420 as I did earlier. 1093 00:37:41,420 --> 00:37:43,049 We took these simple 3-D objects. 1094 00:37:43,049 --> 00:37:44,590 We could render them, just as before, 1095 00:37:44,590 --> 00:37:46,730 place them on naturalistic background. 1096 00:37:46,730 --> 00:37:48,380 And then we just built models that 1097 00:37:48,380 --> 00:37:50,360 would try to discriminate bodies from buildings 1098 00:37:50,360 --> 00:37:51,318 from flowers from guns. 1099 00:37:51,318 --> 00:37:53,076 So they would have good feature sets 1100 00:37:53,076 --> 00:37:54,950 that would discriminate between these things. 1101 00:37:54,950 --> 00:37:58,280 And these were essentially trained by various forms 1102 00:37:58,280 --> 00:37:59,220 of supervision. 1103 00:37:59,220 --> 00:38:02,240 Now there's lots of ways you can train these models. 1104 00:38:02,240 --> 00:38:03,740 I could tell you about how we did it 1105 00:38:03,740 --> 00:38:04,970 and how others have done it. 1106 00:38:04,970 --> 00:38:06,546 I think those details are beyond what 1107 00:38:06,546 --> 00:38:07,670 I want to talk about today. 1108 00:38:07,670 --> 00:38:09,530 But just, it's a supervised class 1109 00:38:09,530 --> 00:38:12,156 that's probably not learned in the same way 1110 00:38:12,156 --> 00:38:13,280 that the brain has learned. 1111 00:38:13,280 --> 00:38:14,711 Most people don't think so. 1112 00:38:14,711 --> 00:38:16,460 But the interesting thing is the end state 1113 00:38:16,460 --> 00:38:19,659 of these models might look very much like the current adult 1114 00:38:19,659 --> 00:38:20,450 state of the brain. 1115 00:38:20,450 --> 00:38:22,390 And that's what I want to try to tell you next. 1116 00:38:22,390 --> 00:38:24,280 So first, let me show you that when we built these models, 1117 00:38:24,280 --> 00:38:25,280 this was in 2012. 1118 00:38:25,280 --> 00:38:27,080 We had a particular optimization approach 1119 00:38:27,080 --> 00:38:29,176 that we called HMO that was trying 1120 00:38:29,176 --> 00:38:31,550 to solve these kind of problems that I showed you earlier 1121 00:38:31,550 --> 00:38:33,020 on these kind of images. 1122 00:38:33,020 --> 00:38:35,104 And I showed you IT was pretty good with humans. 1123 00:38:35,104 --> 00:38:37,520 I showed you its performance was almost up to humans, even 1124 00:38:37,520 --> 00:38:39,379 with just 168 samples. 1125 00:38:39,379 --> 00:38:40,920 And when we first built a model here, 1126 00:38:40,920 --> 00:38:42,586 we were able to do much better than some 1127 00:38:42,586 --> 00:38:44,960 of our previous models that-- 1128 00:38:44,960 --> 00:38:46,130 on these same kind of tasks. 1129 00:38:46,130 --> 00:38:47,796 So I told you we constructed, because we 1130 00:38:47,796 --> 00:38:49,530 knew it made these things-- 1131 00:38:49,530 --> 00:38:51,410 we made these models not do so well. 1132 00:38:51,410 --> 00:38:53,390 So we built these high invariance tasks 1133 00:38:53,390 --> 00:38:55,040 to push these models down. 1134 00:38:55,040 --> 00:38:56,720 And then we had space to build a model 1135 00:38:56,720 --> 00:38:59,050 that we could do better on. 1136 00:38:59,050 --> 00:39:01,850 And we called it HMO 1.0. 1137 00:39:01,850 --> 00:39:03,530 And then we started to say, now we 1138 00:39:03,530 --> 00:39:06,560 have this model that has been optimized for performance. 1139 00:39:06,560 --> 00:39:09,380 Let's see how well it does on comparing with neurons. 1140 00:39:09,380 --> 00:39:12,540 Let's see if its internals look like the neural data. 1141 00:39:12,540 --> 00:39:14,324 So here's the model we built, HMO 1.0. 1142 00:39:14,324 --> 00:39:15,740 It's a deep convolutional network. 1143 00:39:15,740 --> 00:39:16,906 It has two different levels. 1144 00:39:16,906 --> 00:39:18,140 It had four levels. 1145 00:39:18,140 --> 00:39:20,524 It had a bunch of parameters that we set by optimization, 1146 00:39:20,524 --> 00:39:22,690 that I'm just telling you kind of what we optimized. 1147 00:39:22,690 --> 00:39:23,481 I didn't tell you-- 1148 00:39:23,481 --> 00:39:25,530 I'm not telling you any of the parameters. 1149 00:39:25,530 --> 00:39:26,660 And now, we come back to say, well look. 1150 00:39:26,660 --> 00:39:28,190 We can show the same images to the model 1151 00:39:28,190 --> 00:39:29,550 that we showed to the neurons. 1152 00:39:29,550 --> 00:39:32,060 And then we can compare how well these populations look 1153 00:39:32,060 --> 00:39:35,830 like that population, or this population looks like that. 1154 00:39:35,830 --> 00:39:38,820 And so what we did was, we asked how well can layer four predict 1155 00:39:38,820 --> 00:39:39,524 IT first? 1156 00:39:39,524 --> 00:39:40,940 That was the first thing we wanted 1157 00:39:40,940 --> 00:39:43,160 to do, take the top layer of this model, 1158 00:39:43,160 --> 00:39:46,852 the last layer before the linear readout of this model. 1159 00:39:46,852 --> 00:39:49,310 And to do that, you might sort of say, well, wait a minute. 1160 00:39:49,310 --> 00:39:51,170 The model doesn't have mappings. 1161 00:39:51,170 --> 00:39:54,680 It has sort of neurons simulated here, neuron 12 or something. 1162 00:39:54,680 --> 00:39:56,180 And there's some neuron we recorded. 1163 00:39:56,180 --> 00:39:58,970 But there's no linkage between that neuron and that neuron, 1164 00:39:58,970 --> 00:39:59,470 right? 1165 00:39:59,470 --> 00:40:01,320 You have to make that map. 1166 00:40:01,320 --> 00:40:03,770 So what we do is we take each IT neuron 1167 00:40:03,770 --> 00:40:06,066 and treat this as sort of a generative space. 1168 00:40:06,066 --> 00:40:07,940 You can generate as many simulated IT neurons 1169 00:40:07,940 --> 00:40:08,510 as you want. 1170 00:40:08,510 --> 00:40:10,550 You would just ask, let's take this neuron, 1171 00:40:10,550 --> 00:40:13,640 take some of its data, and try to build a linear regression 1172 00:40:13,640 --> 00:40:14,330 to this neuron. 1173 00:40:14,330 --> 00:40:16,640 Treat this as a basis to explain that neuron. 1174 00:40:16,640 --> 00:40:19,946 And then test the predictive power on the held out IT data. 1175 00:40:19,946 --> 00:40:21,320 And that's what I'm writing here. 1176 00:40:21,320 --> 00:40:23,470 That's cross-validation linear regression. 1177 00:40:23,470 --> 00:40:25,850 So I'm going to show you predictions on held out data 1178 00:40:25,850 --> 00:40:28,880 where some of the data were used to make the mapping. 1179 00:40:28,880 --> 00:40:31,217 And there's lots of ways we chose-- 1180 00:40:31,217 --> 00:40:32,300 we could make the mapping. 1181 00:40:32,300 --> 00:40:33,934 And we did essentially all of them. 1182 00:40:33,934 --> 00:40:35,600 And I could talk about that if you want. 1183 00:40:35,600 --> 00:40:37,130 But that's this central idea. 1184 00:40:37,130 --> 00:40:40,010 Take some of your data, say, is this in the linear space 1185 00:40:40,010 --> 00:40:41,450 spanned by this basis set? 1186 00:40:41,450 --> 00:40:44,890 So I can I fit that well with this linear basis here? 1187 00:40:44,890 --> 00:40:47,000 As a linear map from this basis? 1188 00:40:47,000 --> 00:40:49,500 And here's what we actually-- here's what it looks like. 1189 00:40:49,500 --> 00:40:53,470 Here's the IT neural response of one simulated-- one actual IT 1190 00:40:53,470 --> 00:40:54,710 neuron in black. 1191 00:40:54,710 --> 00:40:55,880 This is not time. 1192 00:40:55,880 --> 00:40:57,260 These are images. 1193 00:40:57,260 --> 00:40:59,174 I think there's like 1,600 images here. 1194 00:40:59,174 --> 00:41:01,340 So each black going up and down, you can barely see, 1195 00:41:01,340 --> 00:41:04,370 is the response, the mean response, to different images. 1196 00:41:04,370 --> 00:41:06,930 And you see we grouped them by categories, just so, 1197 00:41:06,930 --> 00:41:09,780 just to help you kind of understand the data. 1198 00:41:09,780 --> 00:41:11,805 Otherwise, it'd just be a big mess. 1199 00:41:11,805 --> 00:41:13,430 Because IT neurons do-- you can kind of 1200 00:41:13,430 --> 00:41:15,410 see they have a bit of category selectivity. 1201 00:41:15,410 --> 00:41:16,710 And again, this was known. 1202 00:41:16,710 --> 00:41:19,451 This neuron seems to like chair images, but not all chair 1203 00:41:19,451 --> 00:41:19,950 images. 1204 00:41:19,950 --> 00:41:23,060 It sometimes likes boats and some planes a little bit. 1205 00:41:23,060 --> 00:41:26,270 And the red line is the prediction of the model, 1206 00:41:26,270 --> 00:41:28,042 once fit to part of the-- to this neuron. 1207 00:41:28,042 --> 00:41:30,500 This is the prediction on the held out data for the neuron. 1208 00:41:30,500 --> 00:41:32,690 You can see the R squared is 0.48. 1209 00:41:32,690 --> 00:41:35,150 So half the explainable response variance 1210 00:41:35,150 --> 00:41:37,050 is explained by this model. 1211 00:41:37,050 --> 00:41:39,350 And again, these are predictions. 1212 00:41:39,350 --> 00:41:41,360 The images were never seen-- 1213 00:41:41,360 --> 00:41:44,360 the objects even were never seen by this model 1214 00:41:44,360 --> 00:41:47,750 before it makes these predictions here. 1215 00:41:47,750 --> 00:41:50,810 So this is just saying that the IT neurons live in this space. 1216 00:41:50,810 --> 00:41:53,120 It's actually quite well captured by the top level, 1217 00:41:53,120 --> 00:41:55,170 in this case, of this first HMO model we built. 1218 00:41:55,170 --> 00:41:57,480 I'll show you some other models in a minute. 1219 00:41:57,480 --> 00:41:59,480 Here's another neuron that you might call a face 1220 00:41:59,480 --> 00:42:02,840 neuron because it tends to like faces over other categories. 1221 00:42:02,840 --> 00:42:04,340 So it might-- it would pass the test 1222 00:42:04,340 --> 00:42:06,410 of the operational definition of a face neuron. 1223 00:42:06,410 --> 00:42:09,615 This model, this neuron was well predicted, again, 1224 00:42:09,615 --> 00:42:12,470 by both its preferred and non-preferred face images 1225 00:42:12,470 --> 00:42:13,880 by this HMO model. 1226 00:42:13,880 --> 00:42:16,757 Again, a slightly-- an R squared near 0.5. 1227 00:42:16,757 --> 00:42:19,340 Here's a neuron that you would look at the category structure. 1228 00:42:19,340 --> 00:42:20,332 And you don't even-- 1229 00:42:20,332 --> 00:42:22,040 you can't really see the categories here. 1230 00:42:22,040 --> 00:42:23,060 They're still here. 1231 00:42:23,060 --> 00:42:25,010 But you don't see these sort of blocks. 1232 00:42:25,010 --> 00:42:27,050 You just see there's sort of some images it likes and some 1233 00:42:27,050 --> 00:42:27,410 it doesn't. 1234 00:42:27,410 --> 00:42:29,530 It's hard to even know what's driving this neuron. 1235 00:42:29,530 --> 00:42:31,722 But it's actually quite well predicted, I think. 1236 00:42:31,722 --> 00:42:32,930 You don't have the R squared. 1237 00:42:32,930 --> 00:42:33,800 But it's similar. 1238 00:42:33,800 --> 00:42:35,990 It's about half the explainable variance. 1239 00:42:35,990 --> 00:42:37,490 Just another example. 1240 00:42:37,490 --> 00:42:39,140 And here is a sort of summary here. 1241 00:42:39,140 --> 00:42:40,700 If you take-- this is a distribution 1242 00:42:40,700 --> 00:42:43,730 of the explainable variance for the top level of the model 1243 00:42:43,730 --> 00:42:46,850 fitting about, I think this is 168 IT sites. 1244 00:42:46,850 --> 00:42:48,950 Some sites are fit really well, near 100%. 1245 00:42:48,950 --> 00:42:50,390 Some are fit not as well. 1246 00:42:50,390 --> 00:42:53,040 The average is about 50%, which is shown here. 1247 00:42:53,040 --> 00:42:55,890 So this is the median of that distribution here. 1248 00:42:55,890 --> 00:42:58,400 So the summary take home is about 50% 1249 00:42:58,400 --> 00:43:00,277 of singularly response variance predicted. 1250 00:43:00,277 --> 00:43:02,360 And this is a big improvement over previous models 1251 00:43:02,360 --> 00:43:03,600 I'll show you in a minute. 1252 00:43:03,600 --> 00:43:06,020 The other levels of the model don't predict nearly well. 1253 00:43:06,020 --> 00:43:07,550 So the first level doesn't predict well. 1254 00:43:07,550 --> 00:43:09,216 Second level better, third level better, 1255 00:43:09,216 --> 00:43:10,550 the fourth level the best. 1256 00:43:10,550 --> 00:43:13,216 If you take other models-- these are some of the models I showed 1257 00:43:13,216 --> 00:43:13,880 you earlier-- 1258 00:43:13,880 --> 00:43:16,110 they don't fit nearly as well. 1259 00:43:16,110 --> 00:43:18,440 Here's their distributions and here's their average, 1260 00:43:18,440 --> 00:43:20,240 their median explained variance. 1261 00:43:20,240 --> 00:43:24,120 And just to fix-- to just fix ideas, you might think, 1262 00:43:24,120 --> 00:43:27,080 well look, we built a model that's a good categorizer. 1263 00:43:27,080 --> 00:43:28,730 So of course it fits IT neurons well. 1264 00:43:28,730 --> 00:43:30,260 Because IT neurons are categorizers. 1265 00:43:30,260 --> 00:43:32,954 Well, here's a model that actually has explicit knowledge 1266 00:43:32,954 --> 00:43:33,620 of the category. 1267 00:43:33,620 --> 00:43:35,210 It's not an image computable model, 1268 00:43:35,210 --> 00:43:36,590 and it's not an easy one. 1269 00:43:36,590 --> 00:43:38,360 But it's just given that sort of an oracle 1270 00:43:38,360 --> 00:43:41,460 that's given the category, and how well it explains IT. 1271 00:43:41,460 --> 00:43:43,490 And you can see, it explains IT much worse 1272 00:43:43,490 --> 00:43:44,600 than the actual model. 1273 00:43:44,600 --> 00:43:47,930 So this implies a model is limited by the real-- 1274 00:43:47,930 --> 00:43:50,570 the architecture puts constraints on the model 1275 00:43:50,570 --> 00:43:53,680 and how it adds variance that the sustained IT 1276 00:43:53,680 --> 00:43:56,290 neurons are categories does not easily capture. 1277 00:43:56,290 --> 00:44:00,320 So that kind of-- 1278 00:44:00,320 --> 00:44:02,836 that sort of inspired us to say, OK. 1279 00:44:02,836 --> 00:44:04,710 What about if we go down and say not just IT, 1280 00:44:04,710 --> 00:44:05,650 but let's go to V4. 1281 00:44:05,650 --> 00:44:07,600 Because we had a bunch of V4 data. 1282 00:44:07,600 --> 00:44:09,280 And so we play the same game in V4. 1283 00:44:09,280 --> 00:44:12,130 Let's take level three and see if we can predict V4. 1284 00:44:12,130 --> 00:44:14,710 And here's the IT data I just showed you a minute ago. 1285 00:44:14,710 --> 00:44:16,270 And here's the V4 data. 1286 00:44:16,270 --> 00:44:19,630 So the V4 neurons are highly predicted in the middle layer. 1287 00:44:19,630 --> 00:44:21,700 Layer three is the best predictor of V4. 1288 00:44:21,700 --> 00:44:24,400 The top layer is actually not so predictive, less predictive 1289 00:44:24,400 --> 00:44:26,664 of V4 neurons than the middle layers. 1290 00:44:26,664 --> 00:44:28,580 And the first layer is not so well predictive. 1291 00:44:28,580 --> 00:44:30,288 And again, the other models are actually, 1292 00:44:30,288 --> 00:44:32,830 now you can see they're getting on relatively better. 1293 00:44:32,830 --> 00:44:35,740 You can think of them as sort of lower level models. 1294 00:44:35,740 --> 00:44:38,390 And they're getting better, which is what you'd expect. 1295 00:44:38,390 --> 00:44:41,230 But interestingly, this is really exciting to us. 1296 00:44:41,230 --> 00:44:44,080 Because look, this model was not optimized 1297 00:44:44,080 --> 00:44:47,200 to fit any neural data other than that last mapping step. 1298 00:44:47,200 --> 00:44:49,300 All it is is a bio inspired algorithm class, 1299 00:44:49,300 --> 00:44:51,610 which is the neuroscience sort of view 1300 00:44:51,610 --> 00:44:54,550 of the feed-forward class of the field. 1301 00:44:54,550 --> 00:44:56,867 And tasks that we and others hypothesize 1302 00:44:56,867 --> 00:44:58,450 are important, that the ventral stream 1303 00:44:58,450 --> 00:45:01,810 might be optimized to solve, and an actual optimization 1304 00:45:01,810 --> 00:45:03,640 procedure that we applied. 1305 00:45:03,640 --> 00:45:07,000 And that leads to neural like encoding functions at the top 1306 00:45:07,000 --> 00:45:08,200 and in the middle layer. 1307 00:45:08,200 --> 00:45:11,380 So you don't-- so this sort of leads to funny things like 1308 00:45:11,380 --> 00:45:13,150 saying, what does V4 do? 1309 00:45:13,150 --> 00:45:14,650 The answer here would be, well, it's 1310 00:45:14,650 --> 00:45:17,020 an intermediate layer in a network built 1311 00:45:17,020 --> 00:45:18,410 to optimize these things. 1312 00:45:18,410 --> 00:45:19,960 That's the way to describe what V4 1313 00:45:19,960 --> 00:45:22,840 does, according to this kind of modeling approach. 1314 00:45:22,840 --> 00:45:24,880 Now I want to point out, this is only half 1315 00:45:24,880 --> 00:45:26,046 of the explainable variance. 1316 00:45:26,046 --> 00:45:27,252 So it's far from perfect. 1317 00:45:27,252 --> 00:45:28,460 There's room to improve here. 1318 00:45:28,460 --> 00:45:30,793 But it's really dramatic how much improvement we got out 1319 00:45:30,793 --> 00:45:32,400 of these kind of models. 1320 00:45:32,400 --> 00:45:34,670 And so if you take this sort of-- 1321 00:45:34,670 --> 00:45:35,870 well, I'll skip this. 1322 00:45:35,870 --> 00:45:38,650 If you take this back to you know, big picture, 1323 00:45:38,650 --> 00:45:40,000 what did we do here? 1324 00:45:40,000 --> 00:45:41,710 What we're doing is we have performance 1325 00:45:41,710 --> 00:45:43,793 of a model on high end variance recognition tasks. 1326 00:45:43,793 --> 00:45:46,474 We're saying, this is what we've been trying to optimize. 1327 00:45:46,474 --> 00:45:47,890 And what we noticed is that if you 1328 00:45:47,890 --> 00:45:50,852 plot-- these dots are samples out of that model family. 1329 00:45:50,852 --> 00:45:52,810 These black dots are other models I showed you. 1330 00:45:52,810 --> 00:45:55,600 So they're control models that were in the field at the time. 1331 00:45:55,600 --> 00:45:57,400 And this is the ability of the top-- 1332 00:45:57,400 --> 00:45:59,500 the model-- the top level of any of the models 1333 00:45:59,500 --> 00:46:01,120 to predict IT responses. 1334 00:46:01,120 --> 00:46:03,450 So, you know, how good they are predicting-- 1335 00:46:03,450 --> 00:46:06,160 this is sort of the median variance explained of single IT 1336 00:46:06,160 --> 00:46:06,935 responses. 1337 00:46:06,935 --> 00:46:08,560 And you see there's a correlation here. 1338 00:46:08,560 --> 00:46:11,140 If you're better at this, you're better at predicting that. 1339 00:46:11,140 --> 00:46:13,279 And all we did was optimize this way, 1340 00:46:13,279 --> 00:46:15,445 which we think of as like, evolution or development. 1341 00:46:15,445 --> 00:46:16,820 So we're not fitting neural data. 1342 00:46:16,820 --> 00:46:19,180 We're just optimizing for task performance. 1343 00:46:19,180 --> 00:46:21,850 And that led in 2012 to a model that I just showed you, 1344 00:46:21,850 --> 00:46:24,277 explained about half of the IT response variance. 1345 00:46:24,277 --> 00:46:26,110 OK, so it's like, well, this looks like it's 1346 00:46:26,110 --> 00:46:27,670 continuing up this way. 1347 00:46:27,670 --> 00:46:31,930 OK so if you believe that story, then, that says, 1348 00:46:31,930 --> 00:46:35,200 if we can optimize further on these kind of tasks, 1349 00:46:35,200 --> 00:46:37,649 maybe we can explain more variance. 1350 00:46:37,649 --> 00:46:39,190 And it turned out, we didn't actually 1351 00:46:39,190 --> 00:46:41,290 need to do that, because again, I said, 1352 00:46:41,290 --> 00:46:43,804 computer vision was already working on this. 1353 00:46:43,804 --> 00:46:45,220 And they got a lot more resources. 1354 00:46:45,220 --> 00:46:46,090 They're already doing it. 1355 00:46:46,090 --> 00:46:47,714 They're already better than us on this. 1356 00:46:47,714 --> 00:46:49,137 So here's our HMO model. 1357 00:46:49,137 --> 00:46:51,220 This is now Charles Cadieu, a post-doc in the lab. 1358 00:46:51,220 --> 00:46:52,600 These were models that came out at the time. 1359 00:46:52,600 --> 00:46:54,183 This is Krizhevski et al. supervision. 1360 00:46:54,183 --> 00:46:56,200 It's ICLR 2013. 1361 00:46:56,200 --> 00:46:58,287 They were better than the model that we had built. 1362 00:46:58,287 --> 00:47:00,370 You know, we were in this restricted image domain, 1363 00:47:00,370 --> 00:47:01,930 you know, there's lots of reasons why 1364 00:47:01,930 --> 00:47:03,096 we could say they're better. 1365 00:47:03,096 --> 00:47:06,070 Regardless, they were better at our own tasks than the models 1366 00:47:06,070 --> 00:47:07,300 that we had built, right? 1367 00:47:07,300 --> 00:47:09,490 So they were already ahead of us on the task 1368 00:47:09,490 --> 00:47:10,990 that we had designed. 1369 00:47:10,990 --> 00:47:13,199 And so they were up here, and then they were up here. 1370 00:47:13,199 --> 00:47:14,781 And so, if you follow that prediction, 1371 00:47:14,781 --> 00:47:16,900 that means these models might be better predictors 1372 00:47:16,900 --> 00:47:17,890 of our neural data, right? 1373 00:47:17,890 --> 00:47:19,240 These guys don't have our neural data. 1374 00:47:19,240 --> 00:47:21,740 All they're doing is building models to optimize performance 1375 00:47:21,740 --> 00:47:22,370 on tasks. 1376 00:47:22,370 --> 00:47:24,970 And but we could take their features from the neural data, 1377 00:47:24,970 --> 00:47:25,930 play the same game. 1378 00:47:25,930 --> 00:47:29,020 And we actually explained our response-- data 1379 00:47:29,020 --> 00:47:31,060 better than our model explained our own data. 1380 00:47:31,060 --> 00:47:34,120 So this is a nice statement that is not even in our own lab. 1381 00:47:34,120 --> 00:47:36,980 Just a continued optimization for those kinds of tasks 1382 00:47:36,980 --> 00:47:41,180 leads to features that are good predictors of the IT responses. 1383 00:47:41,180 --> 00:47:42,560 And that's what's shown here. 1384 00:47:42,560 --> 00:47:45,970 So I think that's what I just said there. 1385 00:47:45,970 --> 00:47:49,120 So, Charles took this further and analyzed 1386 00:47:49,120 --> 00:47:50,597 this in more detail. 1387 00:47:50,597 --> 00:47:52,930 This is a summary of what I presented in the second half 1388 00:47:52,930 --> 00:47:56,080 now, showing that IT firing rates are feature based, 1389 00:47:56,080 --> 00:47:58,420 learned object judgments naturally predict human monkey 1390 00:47:58,420 --> 00:47:58,919 performance. 1391 00:47:58,919 --> 00:48:00,820 This is why the laws of RAD IT. 1392 00:48:00,820 --> 00:48:02,530 I picked a particular model, which 1393 00:48:02,530 --> 00:48:05,730 is 100 millisecond read on this time window, 50,000 neurons. 1394 00:48:05,730 --> 00:48:07,150 100 training examples. 1395 00:48:07,150 --> 00:48:12,220 That's one particular choice of a decode model, that's just a-- 1396 00:48:12,220 --> 00:48:17,230 is a current set of decode model that fits a lot of our data, 1397 00:48:17,230 --> 00:48:18,230 but not all of our data. 1398 00:48:18,230 --> 00:48:20,620 And we also want to get finer grain data. 1399 00:48:20,620 --> 00:48:22,960 The inference is, this might be the specific neural code 1400 00:48:22,960 --> 00:48:25,102 and decoding mechanism that the brain uses 1401 00:48:25,102 --> 00:48:26,060 to support these tasks. 1402 00:48:26,060 --> 00:48:28,174 That's what we'd like to think. 1403 00:48:28,174 --> 00:48:30,340 But now, we're trying to do systematic causal tests. 1404 00:48:30,340 --> 00:48:32,590 And we talked a lot about trying to silence bits of IT 1405 00:48:32,590 --> 00:48:33,810 as one example of that. 1406 00:48:33,810 --> 00:48:37,070 And the tools are still not where we'd like them to be. 1407 00:48:37,070 --> 00:48:39,800 But you see we're making progress there. 1408 00:48:39,800 --> 00:48:43,021 So the second was I showed the optimization of deep CNN models 1409 00:48:43,021 --> 00:48:44,770 for invariant object recognition tasks led 1410 00:48:44,770 --> 00:48:46,420 to dramatic improvements in our ability 1411 00:48:46,420 --> 00:48:47,950 to predict IT and V4 responses. 1412 00:48:47,950 --> 00:48:49,150 I showed you our model HMO. 1413 00:48:49,150 --> 00:48:51,700 But then the convolutional neural networks in the field 1414 00:48:51,700 --> 00:48:53,810 have already surpassed our predictive ability 1415 00:48:53,810 --> 00:48:56,320 on our own data. 1416 00:48:56,320 --> 00:48:58,570 And so the inference is that these encoding mechanisms 1417 00:48:58,570 --> 00:49:00,486 in these models might be similar to those that 1418 00:49:00,486 --> 00:49:02,040 work in the ventral stream. 1419 00:49:02,040 --> 00:49:04,096 And now, you know, there's a whole sort of area 1420 00:49:04,096 --> 00:49:06,220 where you can start to think about doing physiology 1421 00:49:06,220 --> 00:49:07,516 on the models, so to speak. 1422 00:49:07,516 --> 00:49:08,890 And that problem's almost as hard 1423 00:49:08,890 --> 00:49:10,600 as doing physiology except on the animal, 1424 00:49:10,600 --> 00:49:13,000 except that you can gain a lot more data. 1425 00:49:13,000 --> 00:49:14,890 And so, and this is allowing the field 1426 00:49:14,890 --> 00:49:17,080 to design experiments to explore what remains, 1427 00:49:17,080 --> 00:49:20,050 what's unique and powerful about primate object perception. 1428 00:49:20,050 --> 00:49:21,520 So within core object recognition 1429 00:49:21,520 --> 00:49:23,560 or perhaps having to extend out of that, 1430 00:49:23,560 --> 00:49:26,320 I think is now what people are trying to do. 1431 00:49:26,320 --> 00:49:28,570 So big picture in terms of us for the future, 1432 00:49:28,570 --> 00:49:30,850 I've talked about this law's of RAD IT. 1433 00:49:30,850 --> 00:49:32,650 Can we perturb here and get effects here 1434 00:49:32,650 --> 00:49:33,940 that are predictable? 1435 00:49:33,940 --> 00:49:36,770 Can we predict for each image, coding model, 1436 00:49:36,770 --> 00:49:39,130 and for the optical manipulations? 1437 00:49:39,130 --> 00:49:40,840 We talked about that. 1438 00:49:40,840 --> 00:49:42,400 Dynamics and feedback are something 1439 00:49:42,400 --> 00:49:43,510 that we're interested in. 1440 00:49:43,510 --> 00:49:45,310 But I haven't talked much at all about. 1441 00:49:45,310 --> 00:49:48,850 I think that's a good point, a discussion topic. 1442 00:49:48,850 --> 00:49:51,160 I can tell you how we're thinking about it. 1443 00:49:51,160 --> 00:49:54,100 We have some efforts in that regard. 1444 00:49:54,100 --> 00:49:56,380 I talked on the encoding side about these kind 1445 00:49:56,380 --> 00:49:59,050 of deep convolutional networks that map from images. 1446 00:49:59,050 --> 00:50:01,570 But the dash lines mean they're only 50% predicted. 1447 00:50:01,570 --> 00:50:04,570 Both of these cases, they're not perfect, right? 1448 00:50:04,570 --> 00:50:06,380 So there's work to be done there. 1449 00:50:06,380 --> 00:50:07,930 And one of the really exciting things 1450 00:50:07,930 --> 00:50:09,377 is here is how these models learn. 1451 00:50:09,377 --> 00:50:11,210 This supervised way of learning these models 1452 00:50:11,210 --> 00:50:13,480 is almost surely not what's going on in the brain. 1453 00:50:13,480 --> 00:50:16,930 So finding more-- less supervised, biologically 1454 00:50:16,930 --> 00:50:18,880 motivated learning of these models 1455 00:50:18,880 --> 00:50:22,660 is a good-- is the next step, I think, for much of the field. 1456 00:50:22,660 --> 00:50:24,620 But what's nice is to have an end state that 1457 00:50:24,620 --> 00:50:27,930 is much better than any previous end state we'd had before. 1458 00:50:27,930 --> 00:50:32,080 So that sets a target of what success might look like. 1459 00:50:32,080 --> 00:50:34,240 And you know, maybe we can think about expanding 1460 00:50:34,240 --> 00:50:35,880 beyond core recognition. 1461 00:50:35,880 --> 00:50:37,920 We can talk in the question period about that. 1462 00:50:37,920 --> 00:50:39,940 When is the right time to kind of keep 1463 00:50:39,940 --> 00:50:42,640 working within the domain of core recognition that 1464 00:50:42,640 --> 00:50:45,042 is set up, versus expanding beyond that? 1465 00:50:45,042 --> 00:50:46,750 Because there's lots of aspects of object 1466 00:50:46,750 --> 00:50:48,790 recognition that I didn't touch on here. 1467 00:50:48,790 --> 00:50:50,830 And that comes up in the questions. 1468 00:50:50,830 --> 00:50:54,190 I think, there's lots of work to be done within the domain, 1469 00:50:54,190 --> 00:50:56,590 but there's also interesting directions that 1470 00:50:56,590 --> 00:50:59,070 extend outside of that domain.