1 00:00:01,680 --> 00:00:04,080 The following content is provided under a Creative 2 00:00:04,080 --> 00:00:05,620 Commons license. 3 00:00:05,620 --> 00:00:07,920 Your support will help MIT OpenCourseWare 4 00:00:07,920 --> 00:00:12,280 continue to offer high quality educational resources for free. 5 00:00:12,280 --> 00:00:14,910 To make a donation or view additional materials 6 00:00:14,910 --> 00:00:18,840 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,840 --> 00:00:19,720 at ocw.mit.edu. 8 00:00:25,110 --> 00:00:33,930 TOMASO POGGIO: So I'll speak about i-theory, visual cortex, 9 00:00:33,930 --> 00:00:34,940 deep learning networks. 10 00:00:38,400 --> 00:00:44,270 The background for this is this conceptual framework 11 00:00:44,270 --> 00:00:51,420 that we take as a guide to present work 12 00:00:51,420 --> 00:00:54,140 in vision in this center-- 13 00:00:54,140 --> 00:01:02,350 The idea that you have a phase in visual perception, 14 00:01:02,350 --> 00:01:06,470 essentially up to the first saccade-- 15 00:01:06,470 --> 00:01:12,470 say, 100 milliseconds from onset of an image-- 16 00:01:12,470 --> 00:01:18,770 in which most of the processing is feedforward 17 00:01:18,770 --> 00:01:22,080 in the visual cortex. 18 00:01:22,080 --> 00:01:27,190 And that top-down signals-- 19 00:01:29,900 --> 00:01:33,500 I hate the term feedback in this case, 20 00:01:33,500 --> 00:01:39,530 but back projections going from higher visual areas, 21 00:01:39,530 --> 00:01:46,610 like inferotemporal cortex, back to V2 and other cortical areas 22 00:01:46,610 --> 00:01:51,260 are not active in this first hundred milliseconds. 23 00:01:51,260 --> 00:02:00,320 Now, all of this is a conjecture based on a number of data. 24 00:02:00,320 --> 00:02:02,990 So it has to be proven. 25 00:02:02,990 --> 00:02:08,000 For us it's just a motivation, a guide, 26 00:02:08,000 --> 00:02:13,550 to first studying feedforward processing in, as I said, 27 00:02:13,550 --> 00:02:15,500 the first 100 milliseconds or so. 28 00:02:18,920 --> 00:02:28,790 And to think that other types of theory, like generative models, 29 00:02:28,790 --> 00:02:32,630 probabilistic inference that you have heard about, 30 00:02:32,630 --> 00:02:36,730 visual routines you have heard kind of from Shimon, 31 00:02:36,730 --> 00:02:41,390 are important not so much in the first 100 milliseconds, 32 00:02:41,390 --> 00:02:43,620 but later on. 33 00:02:43,620 --> 00:02:48,530 Especially when feedback through back projection, 34 00:02:48,530 --> 00:02:51,530 but also through movements of the eyes 35 00:02:51,530 --> 00:02:55,145 that acquire new images depending on the first one you 36 00:02:55,145 --> 00:02:58,520 have seen, come into play. 37 00:02:58,520 --> 00:02:59,120 OK. 38 00:02:59,120 --> 00:03:03,870 This is just to motivate feedforward. 39 00:03:03,870 --> 00:03:09,920 And of course, the evidence I refer to is evidence like-- 40 00:03:09,920 --> 00:03:13,760 you have heard from Jim DiCarlo, for the physiology there 41 00:03:13,760 --> 00:03:20,510 is quite a bit of data showing that neurons in IT become 42 00:03:20,510 --> 00:03:26,540 active and selective for what is in the image about 80 or 90 43 00:03:26,540 --> 00:03:29,570 milliseconds after onset of the stimulus. 44 00:03:29,570 --> 00:03:33,830 And this basically implies that there are no big feedback loops 45 00:03:33,830 --> 00:03:36,230 from one area to another one. 46 00:03:36,230 --> 00:03:40,460 It takes 40 milliseconds to get to V1, and 10 milliseconds 47 00:03:40,460 --> 00:03:44,720 or so for each of the next areas. 48 00:03:44,720 --> 00:03:50,630 So the problem is, computational vision-- 49 00:03:50,630 --> 00:03:52,580 the guy on the left is David Marr. 50 00:04:02,910 --> 00:04:07,890 And here it's really where most probably 51 00:04:07,890 --> 00:04:11,430 a lot of object recognition takes 52 00:04:11,430 --> 00:04:17,160 place, is the ventral stream from V1 to V2, 53 00:04:17,160 --> 00:04:19,489 V4, and the IT complex. 54 00:04:23,340 --> 00:04:26,130 So that's the back of the head. 55 00:04:26,130 --> 00:04:28,290 As I said, it takes 40 milliseconds 56 00:04:28,290 --> 00:04:33,690 for electrical signals to come from the eye in the front 57 00:04:33,690 --> 00:04:38,010 through the LGN back to neurons in V1. 58 00:04:38,010 --> 00:04:39,320 Simple complex cells. 59 00:04:39,320 --> 00:04:44,010 And then for signals to go from the back to the front, 60 00:04:44,010 --> 00:04:49,150 that's the feedforward part. 61 00:04:49,150 --> 00:04:54,420 And on the bottom right, you have seen this picture already. 62 00:04:54,420 --> 00:04:58,755 This is from Van Essen, edited recently by Movshon. 63 00:05:01,370 --> 00:05:05,850 It's the size of the areas and the size of the connection 64 00:05:05,850 --> 00:05:09,910 are roughly proportional to the number of neurons and fibers. 65 00:05:09,910 --> 00:05:12,780 So you see that V1 is as big as V2. 66 00:05:12,780 --> 00:05:16,350 they both have about 200 million neurons. 67 00:05:16,350 --> 00:05:20,970 And V4 is about 50 million, and the inferotemporal complex 68 00:05:20,970 --> 00:05:22,620 is probably 100 million or so. 69 00:05:33,800 --> 00:05:38,136 Our brain is about one million flies. 70 00:05:38,136 --> 00:05:42,130 A fly is around 300,000 neurons or so. 71 00:05:42,130 --> 00:05:44,550 A bee is one million. 72 00:05:50,710 --> 00:05:57,030 And as I think Jim DiCarlo mentioned, 73 00:05:57,030 --> 00:06:00,510 there are these models that have been developed 74 00:06:00,510 --> 00:06:02,924 since Hubel and Wiesel-- 75 00:06:02,924 --> 00:06:12,720 so that's '59-- that tried to model feedforward processing 76 00:06:12,720 --> 00:06:16,240 from V1 to IT. 77 00:06:16,240 --> 00:06:22,310 And they start with simple and complex cells, this S1 78 00:06:22,310 --> 00:06:27,030 and C1, simple cells being essentially equivalent to Gabor 79 00:06:27,030 --> 00:06:34,210 filters, oriented Gabor filters in different positions, 80 00:06:34,210 --> 00:06:36,450 different orientations. 81 00:06:36,450 --> 00:06:42,300 And then complex cells that put together the signals 82 00:06:42,300 --> 00:06:47,940 from simple cells of the same orientation preference, 83 00:06:47,940 --> 00:06:51,960 but different position, and so have some more position 84 00:06:51,960 --> 00:06:54,930 tolerance than simple cells. 85 00:06:54,930 --> 00:07:00,180 And then a repetition of this basic scheme, with S2 cells 86 00:07:00,180 --> 00:07:04,360 that are representing more complex-- 87 00:07:04,360 --> 00:07:05,970 let's call them features-- 88 00:07:05,970 --> 00:07:07,710 than lines. 89 00:07:07,710 --> 00:07:11,322 Maybe a combination of lines. 90 00:07:11,322 --> 00:07:16,410 And then C2 cells again pulling together 91 00:07:16,410 --> 00:07:19,110 cells of the same preference in order 92 00:07:19,110 --> 00:07:22,650 to get more invariance to position. 93 00:07:22,650 --> 00:07:30,510 And there is evidence from the old work of Hubel and Wiesel 94 00:07:30,510 --> 00:07:32,540 about simple and complex cells in V1. 95 00:07:32,540 --> 00:07:36,690 So S1 and C1, although the morphological identity 96 00:07:36,690 --> 00:07:39,990 of complex and simple cells is still an open question-- 97 00:07:39,990 --> 00:07:41,520 you know, which specific cells. 98 00:07:41,520 --> 00:07:45,030 We can discuss that later. 99 00:07:45,030 --> 00:07:49,890 But for the rest, this hierarchy continuing in other areas, 100 00:07:49,890 --> 00:07:54,495 like V2 and V4 and IT, this is one conjecture 101 00:07:54,495 --> 00:07:55,950 in model like this. 102 00:07:59,560 --> 00:08:07,600 And we, like other ones before us, 103 00:08:07,600 --> 00:08:15,190 modeled back 15 years ago this different area. 104 00:08:15,190 --> 00:08:18,780 It's V1, V2, and V4 with this kind of model. 105 00:08:18,780 --> 00:08:23,410 And the reason to do so was not really 106 00:08:23,410 --> 00:08:25,390 to do object recognition, but it was 107 00:08:25,390 --> 00:08:30,490 to try to see whether we could get 108 00:08:30,490 --> 00:08:33,250 the physiological properties of a different sense 109 00:08:33,250 --> 00:08:35,020 in such a feedforward model, the ones 110 00:08:35,020 --> 00:08:38,710 that people have had recorded from and published about. 111 00:08:42,440 --> 00:08:45,410 And we could do that to reproduce the property. 112 00:08:45,410 --> 00:08:48,040 Of course, some of them we put in properties 113 00:08:48,040 --> 00:08:50,170 of simple and complex cells. 114 00:08:50,170 --> 00:08:54,010 But other ones, like how much invariance to position 115 00:08:54,010 --> 00:08:56,650 there was in the top level, we got it out 116 00:08:56,650 --> 00:08:59,950 from the model consistent with the data. 117 00:09:02,560 --> 00:09:06,040 One surprising thing that we had with this model 118 00:09:06,040 --> 00:09:11,620 was that, although it was not designed in order to perform 119 00:09:11,620 --> 00:09:16,180 well at object recognition, it did actually work pretty well. 120 00:09:16,180 --> 00:09:19,740 So the kind of things you have to think about this 121 00:09:19,740 --> 00:09:22,060 is rapid categorization. 122 00:09:22,060 --> 00:09:25,100 You have seen that already. 123 00:09:25,100 --> 00:09:32,380 And the task is, for each image, is there an animal or not? 124 00:09:32,380 --> 00:09:45,790 And you can kind of get the feeling that you can do that. 125 00:09:45,790 --> 00:09:49,090 In the real experiment, you have an image and then 126 00:09:49,090 --> 00:09:50,380 a mask, another image. 127 00:09:50,380 --> 00:09:52,540 And then you can say yes, there is an image, 128 00:09:52,540 --> 00:09:55,570 or no, there is not. 129 00:09:55,570 --> 00:10:01,300 This is called rapid categorization. 130 00:10:01,300 --> 00:10:03,290 It was introduced by Molly Potter, 131 00:10:03,290 --> 00:10:07,780 and more recently Simon Thorpe in France used it. 132 00:10:07,780 --> 00:10:11,770 And it's a way to force the observer 133 00:10:11,770 --> 00:10:15,580 to work in a feedforward mode, because you don't have the time 134 00:10:15,580 --> 00:10:17,990 to move your eyes, to fixate. 135 00:10:17,990 --> 00:10:20,020 There is some evidence that the mask 136 00:10:20,020 --> 00:10:26,980 may stop the back projections from working. 137 00:10:26,980 --> 00:10:29,830 So this is a situation in which you 138 00:10:29,830 --> 00:10:34,870 could compare human performance to these feedforward models, 139 00:10:34,870 --> 00:10:38,710 which are not a complete description of vision anyway, 140 00:10:38,710 --> 00:10:42,850 because they don't take into account different eye fixation 141 00:10:42,850 --> 00:10:47,800 and feedbacks and higher processes, like-- 142 00:10:47,800 --> 00:10:52,400 like I said, probabilistic inference and routines. 143 00:10:52,400 --> 00:10:57,340 Whatever it happens, very likely in normal vision, in which you 144 00:10:57,340 --> 00:10:59,680 have time to look around. 145 00:10:59,680 --> 00:11:04,990 So in this case, this d prime is a measure 146 00:11:04,990 --> 00:11:08,920 of performance, how well you're doing this task. 147 00:11:08,920 --> 00:11:13,930 And you can see, first of all, the absolute performance, 148 00:11:13,930 --> 00:11:17,440 80% correct on a certain database. 149 00:11:17,440 --> 00:11:21,100 This task, animal no animal, is similar between the model 150 00:11:21,100 --> 00:11:23,690 in humans. 151 00:11:23,690 --> 00:11:29,110 And images that are difficult for people, 152 00:11:29,110 --> 00:11:32,350 like images in which there is a lot of clutter, 153 00:11:32,350 --> 00:11:36,310 the animals are small, are also difficult for the model. 154 00:11:36,310 --> 00:11:38,220 And the easy ones are easy for both. 155 00:11:38,220 --> 00:11:42,400 So there is a correlation between models and humans. 156 00:11:45,100 --> 00:11:48,640 This does not say that the model is correct, of course, 157 00:11:48,640 --> 00:11:56,110 but it gives a hint that model of this type capture 158 00:11:56,110 --> 00:12:00,310 something of what's going on in the visual pathway. 159 00:12:00,310 --> 00:12:04,840 And Jim DiCarlo spoke about a more sophisticated version 160 00:12:04,840 --> 00:12:07,420 of these feedforward models, including 161 00:12:07,420 --> 00:12:11,650 training with back propagation, that gives pretty good results 162 00:12:11,650 --> 00:12:14,770 also in terms of agreement between neurons 163 00:12:14,770 --> 00:12:16,840 and units in the model. 164 00:12:16,840 --> 00:12:23,110 So the question is why these models work. 165 00:12:23,110 --> 00:12:27,264 They're very simple, feedforward. 166 00:12:27,264 --> 00:12:33,370 It has been surprisingly difficult to understand why 167 00:12:33,370 --> 00:12:36,250 they work as well as they do. 168 00:12:36,250 --> 00:12:40,310 When I started to work on this kind of things 15 years ago, 169 00:12:40,310 --> 00:12:44,350 I thought this kind of architecture would not work. 170 00:12:46,910 --> 00:12:49,690 But then they worked much better than I thought. 171 00:12:52,690 --> 00:12:58,850 And if you believe deep learning these days, which I do-- 172 00:12:58,850 --> 00:13:02,900 for instance, in performance on ImageNet-- 173 00:13:02,900 --> 00:13:04,820 my guess is they work better than humans, 174 00:13:04,820 --> 00:13:08,990 actually, because the right comparison for humans 175 00:13:08,990 --> 00:13:14,170 on ImageNet would be the rapid categorization one. 176 00:13:14,170 --> 00:13:15,545 So they present images briefly. 177 00:13:18,100 --> 00:13:20,330 Because that's what the models have-- just one image. 178 00:13:20,330 --> 00:13:23,950 No chance of getting a second view. 179 00:13:23,950 --> 00:13:28,010 Anyway, that's a more complex discussion 180 00:13:28,010 --> 00:13:31,430 that has to do also with how to model 181 00:13:31,430 --> 00:13:36,025 the fact that in our eyes, in our cortex, 182 00:13:36,025 --> 00:13:39,270 every-- solution depends on eccentricity. 183 00:13:39,270 --> 00:13:43,610 It's a pretty rapidly decaying resolution 184 00:13:43,610 --> 00:13:47,640 as you go away from the fovea, and has 185 00:13:47,640 --> 00:13:53,210 some significant implications for all these topics. 186 00:13:53,210 --> 00:13:55,880 I'll get to that. 187 00:13:55,880 --> 00:13:59,750 What I want to do today is, one way 188 00:13:59,750 --> 00:14:01,790 to look at this to try to understand 189 00:14:01,790 --> 00:14:04,640 how these kind of feedforward models work-- 190 00:14:07,750 --> 00:14:12,050 i-theory is based on trying to understand 191 00:14:12,050 --> 00:14:17,540 how models that are simple and complex cells 192 00:14:17,540 --> 00:14:21,860 and can be integrated in a hierarchical architecture 193 00:14:21,860 --> 00:14:31,070 can provide a signature set of features that 194 00:14:31,070 --> 00:14:36,980 are invariant to transformations observed during development, 195 00:14:36,980 --> 00:14:41,130 and at the same time keep selectivity. 196 00:14:41,130 --> 00:14:45,530 You don't lose any selectivity to different objects. 197 00:14:48,690 --> 00:14:53,710 And then I want to see what they say 198 00:14:53,710 --> 00:15:00,280 about deep convolutional learning networks, 199 00:15:00,280 --> 00:15:04,690 and look at some of the-- 200 00:15:04,690 --> 00:15:09,010 beginning with theory about deep learning. 201 00:15:09,010 --> 00:15:17,260 And then I want to look at a couple of predictions, 202 00:15:17,260 --> 00:15:21,025 particularly related to eccentricity-dependent 203 00:15:21,025 --> 00:15:24,210 resolution coming from i-theory, that 204 00:15:24,210 --> 00:15:28,810 are interesting for the sake of physics and modeling. 205 00:15:28,810 --> 00:15:32,260 And then it's basically garbage time, 206 00:15:32,260 --> 00:15:34,930 if you're interested in mathematical details 207 00:15:34,930 --> 00:15:41,110 and proofs of theorems and historical background. 208 00:15:41,110 --> 00:15:41,620 OK. 209 00:15:41,620 --> 00:15:43,890 Let's start with i-theory. 210 00:15:46,540 --> 00:15:49,030 These are the kind of things that we want, ideally, 211 00:15:49,030 --> 00:15:50,680 to explain. 212 00:15:50,680 --> 00:15:54,040 This is the visual cortex on the left. 213 00:15:54,040 --> 00:15:59,290 Models like HMAX, or feedforward models. 214 00:15:59,290 --> 00:16:02,770 And on the right are the deep learning 215 00:16:02,770 --> 00:16:08,050 convolutional networks, a couple of them, 216 00:16:08,050 --> 00:16:10,560 which basically have convolutional stage 217 00:16:10,560 --> 00:16:16,450 stages very similar to S1, and pooling stages similar to C1. 218 00:16:19,090 --> 00:16:22,502 But quite a lot of those layers. 219 00:16:28,350 --> 00:16:30,340 How many of you know about deep learning? 220 00:16:30,340 --> 00:16:31,120 Everybody, right? 221 00:16:35,440 --> 00:16:38,180 OK. 222 00:16:38,180 --> 00:16:43,400 These are the kind of questions that i-theory tries to answer-- 223 00:16:43,400 --> 00:16:47,810 why these hierarchies work well, what 224 00:16:47,810 --> 00:16:52,343 is really visual cortex, what is the goal of V1 to IT. 225 00:16:56,646 --> 00:16:58,520 We know a lot about simple and complex cells, 226 00:16:58,520 --> 00:17:03,230 but again, what is the computational goal 227 00:17:03,230 --> 00:17:06,770 of these simple and complex cells? 228 00:17:06,770 --> 00:17:11,720 Why do we have Gabor tuning in the early areas? 229 00:17:11,720 --> 00:17:16,339 And why do we have quite generic tuning, 230 00:17:16,339 --> 00:17:21,310 like in the first visual area, but quite specific tuning 231 00:17:21,310 --> 00:17:25,990 to different types of objects like faces and bodies 232 00:17:25,990 --> 00:17:26,680 higher up? 233 00:17:33,680 --> 00:17:38,450 The main hypothesis with starting i-theory is that one 234 00:17:38,450 --> 00:17:42,080 of the main goals of the visual cortex-- 235 00:17:42,080 --> 00:17:47,020 it's a hypothesis-- is to compute a set of features, 236 00:17:47,020 --> 00:17:52,520 a representation of images, that is invariant to transformations 237 00:17:52,520 --> 00:17:54,680 that the organism has experienced-- 238 00:17:54,680 --> 00:18:00,920 visual transformations-- and remains selective. 239 00:18:00,920 --> 00:18:02,930 Now, why is invariance important? 240 00:18:08,610 --> 00:18:11,120 A lot of the problem of recognizing objects 241 00:18:11,120 --> 00:18:18,200 is the fact that I can see once Rosalie's face, 242 00:18:18,200 --> 00:18:21,680 and then the next time it's the same face, 243 00:18:21,680 --> 00:18:23,300 but the image is completely different, 244 00:18:23,300 --> 00:18:25,580 for it's much bigger now because I'm closer, 245 00:18:25,580 --> 00:18:27,510 or the illumination is different. 246 00:18:27,510 --> 00:18:29,930 So the pixels are different. 247 00:18:29,930 --> 00:18:34,370 And from one single object, you can produce in this way-- 248 00:18:34,370 --> 00:18:40,070 through translation, scaling, different illumination, 249 00:18:40,070 --> 00:18:46,480 viewpoint-- you can produce thousands of different images. 250 00:18:46,480 --> 00:18:50,680 So the intuition is that if I could 251 00:18:50,680 --> 00:18:53,050 get a computer description-- 252 00:18:53,050 --> 00:18:56,740 say, long vectors of features of her face-- 253 00:18:56,740 --> 00:19:01,690 that does not change under these transformations, 254 00:19:01,690 --> 00:19:05,200 recognition would be much easier. 255 00:19:05,200 --> 00:19:08,380 Easier means, especially, that I could 256 00:19:08,380 --> 00:19:12,966 learn to recognize an object with much fewer labeled 257 00:19:12,966 --> 00:19:13,465 examples. 258 00:19:17,290 --> 00:19:23,500 Here on the right you have a very simple demonstration 259 00:19:23,500 --> 00:19:25,840 of what I mean, empirical demonstration. 260 00:19:28,420 --> 00:19:36,640 So we have at the bottom different cars 261 00:19:36,640 --> 00:19:38,560 and different planes. 262 00:19:38,560 --> 00:19:41,980 And there is a linear classifier which is trained directly 263 00:19:41,980 --> 00:19:42,790 on the pixel. 264 00:19:42,790 --> 00:19:46,000 Very stupid classifier. 265 00:19:46,000 --> 00:19:54,280 And you train it with one car and one plane-- 266 00:19:54,280 --> 00:19:57,850 this is on the left-- 267 00:19:57,850 --> 00:19:59,680 or two cars, two planes. 268 00:19:59,680 --> 00:20:03,160 And then you test on other images. 269 00:20:03,160 --> 00:20:07,060 And as you can see, when it's trained 270 00:20:07,060 --> 00:20:11,410 with the bottom examples, which are at all kinds of viewpoints 271 00:20:11,410 --> 00:20:16,570 and sizes, the performance of the classifier in answering 272 00:20:16,570 --> 00:20:19,735 is this a car or is this a plane, it's 50%. 273 00:20:19,735 --> 00:20:21,340 It's chance. 274 00:20:21,340 --> 00:20:24,580 Does not learn at all. 275 00:20:24,580 --> 00:20:29,810 On the other hand, suppose I have an oracle which is-- 276 00:20:29,810 --> 00:20:32,110 I will conjecture is visual cortex, 277 00:20:32,110 --> 00:20:36,370 essentially, that gives you the feature 278 00:20:36,370 --> 00:20:39,910 vectors for each image, which is invariant 279 00:20:39,910 --> 00:20:41,660 to these transformations. 280 00:20:41,660 --> 00:20:46,240 So it's like having images of cars in this line B. 281 00:20:46,240 --> 00:20:50,920 They're all in the same position, same illumination, 282 00:20:50,920 --> 00:20:54,190 and so on, and the same for the planes. 283 00:20:54,190 --> 00:20:55,860 And I repeat this experiment. 284 00:20:55,860 --> 00:20:58,360 I use one pair-- 285 00:20:58,360 --> 00:21:01,310 one car, one plane-- 286 00:21:01,310 --> 00:21:04,930 to train, or two cars, two planes, 287 00:21:04,930 --> 00:21:09,870 and I see immediately that when tested on new images, 288 00:21:09,870 --> 00:21:14,320 this classifier is close to 90%. 289 00:21:14,320 --> 00:21:15,700 So much better. 290 00:21:18,530 --> 00:21:25,420 So correcting-- having invariant representation can help a lot. 291 00:21:25,420 --> 00:21:29,060 That's the empirical, simple demonstration. 292 00:21:29,060 --> 00:21:33,980 And you can prove theorems saying the same thing, 293 00:21:33,980 --> 00:21:38,770 that if you have an invariant representation, 294 00:21:38,770 --> 00:21:45,880 you can have a much lower simple complexity, which 295 00:21:45,880 --> 00:21:51,340 means you need much fewer labeled examples 296 00:21:51,340 --> 00:21:57,200 to train a classifier to achieve a certain level of accuracy. 297 00:21:57,200 --> 00:22:02,710 So how can you compute an invariant representation? 298 00:22:02,710 --> 00:22:04,090 There are many ways to do it. 299 00:22:07,900 --> 00:22:10,240 But I'll describe to you one which 300 00:22:10,240 --> 00:22:16,720 I think is attractive, because it's neurophysiologically very 301 00:22:16,720 --> 00:22:18,640 plausible. 302 00:22:18,640 --> 00:22:21,820 The basic assumption I'm making here 303 00:22:21,820 --> 00:22:27,070 is that neurons are very slow devices. 304 00:22:27,070 --> 00:22:32,270 They don't do well a lot of things. 305 00:22:32,270 --> 00:22:34,900 One of the things they do probably best 306 00:22:34,900 --> 00:22:37,115 is high-dimensional dot products. 307 00:22:40,340 --> 00:22:47,530 And the reason is that you have a dendritic tree, 308 00:22:47,530 --> 00:22:51,850 and in cortical neurons you have between 1,000 and 10,000 309 00:22:51,850 --> 00:22:53,560 synapses. 310 00:22:53,560 --> 00:22:57,820 So you have between 1,000 and 10,000 inputs. 311 00:22:57,820 --> 00:23:02,950 And each input gets essentially multiplied 312 00:23:02,950 --> 00:23:05,830 by the weight of the synapse, which 313 00:23:05,830 --> 00:23:08,560 can be changed during learning. 314 00:23:08,560 --> 00:23:10,360 It's plastic. 315 00:23:10,360 --> 00:23:17,310 And then the post-synaptic depolarization 316 00:23:17,310 --> 00:23:19,930 or hyperpolarization, so the electrical changes 317 00:23:19,930 --> 00:23:24,290 to the synapses, get all summated in the soma. 318 00:23:24,290 --> 00:23:26,210 So you have some i. 319 00:23:26,210 --> 00:23:29,620 Xi are your inputs, Wi are your synapses. 320 00:23:29,620 --> 00:23:31,750 That's a dot product. 321 00:23:31,750 --> 00:23:36,170 And this happens automatically, within a millisecond. 322 00:23:36,170 --> 00:23:41,690 So it's one of the few things that neurons do well. 323 00:23:41,690 --> 00:23:45,220 It's, I think, one of the distinctive features of neurons 324 00:23:45,220 --> 00:23:49,470 of the brain relative to our electronic components, 325 00:23:49,470 --> 00:23:54,460 that in each neuron, each unit in the brain, 326 00:23:54,460 --> 00:23:58,420 there are about 10,000 wires getting in or out. 327 00:23:58,420 --> 00:24:03,370 When I say in, transistor or logical units 328 00:24:03,370 --> 00:24:05,620 in our computers, the number of wires 329 00:24:05,620 --> 00:24:08,050 is more like three or four. 330 00:24:08,050 --> 00:24:12,820 So this is the assumption, that this kind of dot products 331 00:24:12,820 --> 00:24:15,065 are easy to do. 332 00:24:15,065 --> 00:24:20,290 And so this suggests this kind of algorithm 333 00:24:20,290 --> 00:24:22,840 for computing invariance. 334 00:24:22,840 --> 00:24:26,230 Suppose you are a baby in the cradle. 335 00:24:26,230 --> 00:24:27,750 You're playing with a toy-- 336 00:24:27,750 --> 00:24:31,730 it's a bike-- and you are rotating it, for instance. 337 00:24:31,730 --> 00:24:32,640 For simplicity. 338 00:24:32,640 --> 00:24:35,500 We'll do more complex things. 339 00:24:35,500 --> 00:24:40,360 The unsupervised learning that you need to do at this point 340 00:24:40,360 --> 00:24:45,910 is just to store the movie of what happens to your toy. 341 00:24:45,910 --> 00:24:48,890 For instance, suppose you get a perfect rotation. 342 00:24:48,890 --> 00:24:50,890 This is a movie up there. 343 00:24:50,890 --> 00:24:53,420 There are eight frames. 344 00:24:53,420 --> 00:24:54,910 Yeah. 345 00:24:54,910 --> 00:24:57,025 You store those, and you keep them forever. 346 00:25:01,170 --> 00:25:02,410 All right. 347 00:25:02,410 --> 00:25:06,340 So when you see a new image, it could be 348 00:25:06,340 --> 00:25:12,780 Rosalie's face, or this fish. 349 00:25:12,780 --> 00:25:16,050 And I want to compute a feature vectors which 350 00:25:16,050 --> 00:25:19,440 is invariant to rotation, even if I've never 351 00:25:19,440 --> 00:25:22,620 seen the fish rotated. 352 00:25:22,620 --> 00:25:29,310 What I do is, I compute a dot product 353 00:25:29,310 --> 00:25:34,230 of the image of the fish with each one of the frames. 354 00:25:34,230 --> 00:25:38,070 So I get eight numbers. 355 00:25:38,070 --> 00:25:43,440 And the claim is that these eight numbers-- 356 00:25:43,440 --> 00:25:46,890 not their order, but the numbers-- 357 00:25:46,890 --> 00:25:50,890 are invariant to rotation of the fish. 358 00:25:50,890 --> 00:25:54,850 So if I see the fish now in a different rotation angle-- 359 00:25:54,850 --> 00:25:59,160 suppose it's vertical, I'd still get the same eight numbers. 360 00:25:59,160 --> 00:26:02,490 In a different order, probably. 361 00:26:02,490 --> 00:26:05,690 You could have-- these are eight numbers. 362 00:26:05,690 --> 00:26:14,070 What I said, they are invariant to rotation of the fish. 363 00:26:14,070 --> 00:26:15,780 There are various quantities that you 364 00:26:15,780 --> 00:26:18,090 can use to represent compactly the fact 365 00:26:18,090 --> 00:26:20,340 that they are the same independent of rotation. 366 00:26:20,340 --> 00:26:23,620 For instance, the probability distribution-- 367 00:26:23,620 --> 00:26:29,380 the histogram-- of these values does not depend on the order. 368 00:26:29,380 --> 00:26:31,950 And so if you make a histogram, these 369 00:26:31,950 --> 00:26:35,310 should be independent of rotation, 370 00:26:35,310 --> 00:26:38,810 invariant to rotation, Or moments of the histogram, 371 00:26:38,810 --> 00:26:45,607 like the average, the variance, the moment of order infinity. 372 00:26:51,211 --> 00:26:55,020 And for instance, the equation for computing a histogram 373 00:26:55,020 --> 00:26:56,300 is written there. 374 00:26:56,300 --> 00:27:00,690 You have the dot product of the image, the fish, 375 00:27:00,690 --> 00:27:05,330 with one template to the bike, the bike Tk. 376 00:27:05,330 --> 00:27:09,100 You have several templates, not just one. 377 00:27:09,100 --> 00:27:13,360 And Gi is the element of the rotation group. 378 00:27:13,360 --> 00:27:17,220 So you get various rotations of-- 379 00:27:17,220 --> 00:27:18,870 simply because you have observed that. 380 00:27:18,870 --> 00:27:20,760 You don't need to know its rotation group. 381 00:27:20,760 --> 00:27:22,140 You don't need to compute that. 382 00:27:22,140 --> 00:27:25,010 These are just images that you have stored. 383 00:27:27,780 --> 00:27:32,670 And there can be different thresholds of simple cells. 384 00:27:32,670 --> 00:27:36,150 And sigma could be just a threshold function, 385 00:27:36,150 --> 00:27:37,530 for instance. 386 00:27:37,530 --> 00:27:38,360 As it turns out-- 387 00:27:38,360 --> 00:27:39,300 I'll describe later. 388 00:27:39,300 --> 00:27:40,977 And sum is the pool. 389 00:27:40,977 --> 00:27:42,060 I'll describe later these. 390 00:27:42,060 --> 00:27:45,040 But sigma, the nonlinearities can be, in fact, 391 00:27:45,040 --> 00:27:46,620 almost anything. 392 00:27:46,620 --> 00:27:53,830 This is very robust to different choices of the nonlinearity 393 00:27:53,830 --> 00:27:56,800 and the pooling. 394 00:27:56,800 --> 00:28:02,490 Here are some examples in which now the transformation 395 00:28:02,490 --> 00:28:06,370 is translation that you have observed for the bike. 396 00:28:06,370 --> 00:28:11,950 And if I compute a histogram-- 397 00:28:11,950 --> 00:28:16,330 from more than eight frames, in this case-- 398 00:28:16,330 --> 00:28:21,470 I get the red histogram for the fish, 399 00:28:21,470 --> 00:28:24,620 and you can see the red histogram does not change, 400 00:28:24,620 --> 00:28:28,270 even if the image of the fish is translated. 401 00:28:28,270 --> 00:28:31,780 Same for the blue Instagram, which is the set to features 402 00:28:31,780 --> 00:28:34,750 corresponding to the cat. 403 00:28:34,750 --> 00:28:38,620 Also it's invariant to translation. 404 00:28:38,620 --> 00:28:43,800 But it's different from the red one. 405 00:28:43,800 --> 00:28:47,200 So these quantities, the histograms, 406 00:28:47,200 --> 00:28:50,030 can be invariant of course, but also selective, 407 00:28:50,030 --> 00:28:53,350 which is what you want. 408 00:28:53,350 --> 00:28:58,030 In order to have a selectivity as high as you want, 409 00:28:58,030 --> 00:29:01,930 you need more than one template. 410 00:29:01,930 --> 00:29:11,050 And some results about how many you need. 411 00:29:11,050 --> 00:29:13,790 I can go into more details of this. 412 00:29:13,790 --> 00:29:18,700 But essentially, you need a number of templates-- 413 00:29:18,700 --> 00:29:23,110 of templates like the bike, in your original example-- 414 00:29:23,110 --> 00:29:26,200 that is logarithmic in the number of images 415 00:29:26,200 --> 00:29:28,000 you want to separate. 416 00:29:28,000 --> 00:29:32,000 For instance, suppose you want to be able to distinguish 417 00:29:32,000 --> 00:29:35,050 1,000 faces, or 1,000 objects. 418 00:29:35,050 --> 00:29:38,740 Then the number of templates you need 419 00:29:38,740 --> 00:29:43,780 is in the order of log 1,000. 420 00:29:43,780 --> 00:29:46,630 So does not increase so much. 421 00:29:46,630 --> 00:29:47,660 Yeah. 422 00:29:47,660 --> 00:29:51,710 So there are two things, one, which you implied. 423 00:29:51,710 --> 00:29:56,360 The reason I spoke about rotation of the image plane, 424 00:29:56,360 --> 00:30:00,440 because rotation is a compact group. 425 00:30:00,440 --> 00:30:04,850 So you never get out. 426 00:30:04,850 --> 00:30:06,690 You come back in. 427 00:30:06,690 --> 00:30:09,950 The translation, you can-- 428 00:30:09,950 --> 00:30:13,650 in principle, mathematically, between plus infinity or minus 429 00:30:13,650 --> 00:30:14,190 infinity. 430 00:30:14,190 --> 00:30:16,160 Of course it does not make sense, 431 00:30:16,160 --> 00:30:19,340 but mathematically this means that it's 432 00:30:19,340 --> 00:30:21,210 a little bit more difficult to prove 433 00:30:21,210 --> 00:30:24,020 the same results in the case of translation and scale. 434 00:30:24,020 --> 00:30:25,130 But we can do it. 435 00:30:25,130 --> 00:30:27,080 That's the first point. 436 00:30:27,080 --> 00:30:29,990 The second one, the combinatorics 437 00:30:29,990 --> 00:30:33,920 of different transformations. 438 00:30:33,920 --> 00:30:40,400 Turns out that-- one approach to this 439 00:30:40,400 --> 00:30:44,840 is to have what the visual system seems to have, 440 00:30:44,840 --> 00:30:53,780 in which you have relatively small ranges of invariance 441 00:30:53,780 --> 00:30:55,950 at different stages. 442 00:30:55,950 --> 00:31:00,800 So that at first stage, say in V1, 443 00:31:00,800 --> 00:31:04,250 you have pooling by the complex cells 444 00:31:04,250 --> 00:31:08,660 over a small range of translations, and probably 445 00:31:08,660 --> 00:31:10,370 scale. 446 00:31:10,370 --> 00:31:15,025 And then at the second stage you have a larger range. 447 00:31:15,025 --> 00:31:16,670 I'll come to that later. 448 00:31:16,670 --> 00:31:19,850 But it's a very interesting point. 449 00:31:19,850 --> 00:31:21,255 I'll not go into this. 450 00:31:21,255 --> 00:31:26,240 These are-- technical extension of these partial observer 451 00:31:26,240 --> 00:31:29,120 groups, these non-compact groups. 452 00:31:31,930 --> 00:31:35,270 The non-group transformation of this approximate 453 00:31:35,270 --> 00:31:40,650 invariance to rotations in 3D, or changes of expression, 454 00:31:40,650 --> 00:31:41,760 and so on. 455 00:31:41,760 --> 00:31:43,400 And then what happens when you have-- 456 00:31:43,400 --> 00:31:46,190 a hierarchy of just modules. 457 00:31:46,190 --> 00:31:50,015 I'll say briefly something about each one. 458 00:31:50,015 --> 00:31:58,760 One is that if you look at the templates that 459 00:31:58,760 --> 00:32:02,660 give you simultaneous-- 460 00:32:02,660 --> 00:32:06,950 so what we want to do, we want to get scale and positioning 461 00:32:06,950 --> 00:32:09,260 invariance. 462 00:32:09,260 --> 00:32:13,820 And suppose you want templates that 463 00:32:13,820 --> 00:32:18,450 maximize the simultaneous range of invariance 464 00:32:18,450 --> 00:32:19,760 to scale and position. 465 00:32:19,760 --> 00:32:23,660 It turns out that Gabor templates, Gabor filters, 466 00:32:23,660 --> 00:32:26,560 are the ones to do that. 467 00:32:26,560 --> 00:32:30,410 So that may be one computational reason 468 00:32:30,410 --> 00:32:35,570 for why Gabor filters are a good thing 469 00:32:35,570 --> 00:32:40,770 to do in processing images. 470 00:32:40,770 --> 00:32:45,170 So for getting approximately good invariance 471 00:32:45,170 --> 00:32:47,930 to non-group transformations, you 472 00:32:47,930 --> 00:32:50,930 need to have some conditions. 473 00:32:50,930 --> 00:32:53,550 The main one is that the template 474 00:32:53,550 --> 00:32:57,870 must transform in a similar way to the object you 475 00:32:57,870 --> 00:33:00,890 are to compute, like faces. 476 00:33:06,320 --> 00:33:13,520 And for these properties to be true for a hierarchy 477 00:33:13,520 --> 00:33:14,230 of modules. 478 00:33:17,290 --> 00:33:20,950 Think of this inverted triangle like a set 479 00:33:20,950 --> 00:33:24,520 of simple cells at the base, and one complex cell, 480 00:33:24,520 --> 00:33:27,190 the red circle at the top. 481 00:33:27,190 --> 00:33:29,590 And so the architecture that we're looking at 482 00:33:29,590 --> 00:33:32,110 is simple complex. 483 00:33:32,110 --> 00:33:34,300 This would be like V1. 484 00:33:34,300 --> 00:33:39,040 And next to it, another simple complex module. 485 00:33:39,040 --> 00:33:40,420 This is all V2. 486 00:33:40,420 --> 00:33:43,330 And then you have V1 in the second layer, that 487 00:33:43,330 --> 00:33:45,010 is getting the input from V1. 488 00:33:45,010 --> 00:33:50,050 And you repeat the same thing, but on the output of V1. 489 00:33:50,050 --> 00:33:55,260 This is exactly like a deep learning network. 490 00:33:55,260 --> 00:34:00,610 It's like visual cortex, where you have different stages 491 00:34:00,610 --> 00:34:06,280 and the effective receptive fields increases as you go up, 492 00:34:06,280 --> 00:34:07,240 as you see here. 493 00:34:09,909 --> 00:34:14,830 So this would be the increase in spatial pooling-- 494 00:34:14,830 --> 00:34:20,100 so invariance-- and also, as I mentioned-- not 495 00:34:20,100 --> 00:34:22,560 drawn here, but the scale. 496 00:34:22,560 --> 00:34:27,360 Pooling over size, scale. 497 00:34:27,360 --> 00:34:34,270 And you can show that, if the following is true, that-- 498 00:34:34,270 --> 00:34:34,770 let me see. 499 00:34:34,770 --> 00:34:36,908 Is this animated? 500 00:34:36,908 --> 00:34:37,408 No. 501 00:34:50,380 --> 00:34:54,239 What you need to have-- and a number of different networks, 502 00:34:54,239 --> 00:34:57,070 certainly the ones I described, have 503 00:34:57,070 --> 00:34:59,370 this property of covariance. 504 00:34:59,370 --> 00:35:07,690 So suppose you have an object that translates in the image. 505 00:35:07,690 --> 00:35:08,190 OK. 506 00:35:12,010 --> 00:35:21,110 What I need is that the neural activity-- 507 00:35:21,110 --> 00:35:25,480 the red circles at the first level-- 508 00:35:25,480 --> 00:35:27,610 also translate. 509 00:35:27,610 --> 00:35:28,810 This is covariance. 510 00:35:31,460 --> 00:35:32,910 So what happens is the following. 511 00:35:32,910 --> 00:35:36,650 Suppose the object is smaller than those receptive fields, 512 00:35:36,650 --> 00:35:39,130 and this drawing is as big. 513 00:35:39,130 --> 00:35:41,350 But suppose it's smaller. 514 00:35:41,350 --> 00:35:48,140 Then if you translate one of those receptive fields, going 515 00:35:48,140 --> 00:35:54,650 from one point to another, because each one has invariance 516 00:35:54,650 --> 00:35:58,310 to translations within the receptive field-- 517 00:35:58,310 --> 00:36:00,440 it's pooling over them-- 518 00:36:00,440 --> 00:36:02,140 translation in the receptive field 519 00:36:02,140 --> 00:36:05,290 will give the same output. 520 00:36:05,290 --> 00:36:09,380 You will have invariance right there. 521 00:36:09,380 --> 00:36:12,620 But suppose you have one image, and then 522 00:36:12,620 --> 00:36:16,940 the next one the object moves to a different receptive field, 523 00:36:16,940 --> 00:36:19,560 or gets out of the receptive field. 524 00:36:19,560 --> 00:36:24,080 Then you don't have invariance at the first layer. 525 00:36:24,080 --> 00:36:29,510 But if you have covariance-- or the neural activity moves-- 526 00:36:29,510 --> 00:36:33,670 at that layer above, you may have invariance 527 00:36:33,670 --> 00:36:36,620 under that receptive field. 528 00:36:36,620 --> 00:36:39,590 In other words, in this construction, 529 00:36:39,590 --> 00:36:43,220 if you have this covariance property, then 530 00:36:43,220 --> 00:36:48,020 at some point in the network, one of these receptive fields 531 00:36:48,020 --> 00:36:49,110 will be invariant. 532 00:36:53,890 --> 00:36:55,246 Is that-- 533 00:36:55,246 --> 00:36:58,290 AUDIENCE: Can you explain that again? 534 00:36:58,290 --> 00:37:00,260 TOMASO POGGIO: Yeah. 535 00:37:00,260 --> 00:37:08,760 The argument is-- suppose I have an object like this. 536 00:37:08,760 --> 00:37:12,840 I have an image. 537 00:37:12,840 --> 00:37:17,060 And then-- I have another image in which the object is here. 538 00:37:20,130 --> 00:37:23,020 Obviously the response at this level-- 539 00:37:23,020 --> 00:37:28,410 the response of this cell will change, because before it 540 00:37:28,410 --> 00:37:31,870 saw this object. 541 00:37:31,870 --> 00:37:33,780 Now, there is these other cells who see that. 542 00:37:33,780 --> 00:37:35,240 So the response has changed. 543 00:37:35,240 --> 00:37:38,670 You don't have invariance. 544 00:37:38,670 --> 00:37:42,900 However, if you look at what happens, say, 545 00:37:42,900 --> 00:37:48,530 at the top red circle there, the top red circle will 546 00:37:48,530 --> 00:37:53,190 you see some activity in the first image 547 00:37:53,190 --> 00:37:58,410 here, because it was activated for this. 548 00:37:58,410 --> 00:38:04,440 And-- in the second case, we see some activity over there, 549 00:38:04,440 --> 00:38:07,710 which should be equivalent. 550 00:38:07,710 --> 00:38:11,580 And under these receptive fields, 551 00:38:11,580 --> 00:38:15,990 translations will give rise to the same signature. 552 00:38:15,990 --> 00:38:18,880 Under this big receptive field, you 553 00:38:18,880 --> 00:38:23,410 have invariance for translation within it. 554 00:38:23,410 --> 00:38:25,130 So the argument is that-- 555 00:38:25,130 --> 00:38:27,420 either you have invariance at one layer, 556 00:38:27,420 --> 00:38:31,404 because the object just moved within it, 557 00:38:31,404 --> 00:38:32,320 and then you are done. 558 00:38:32,320 --> 00:38:35,310 It's invariant, and everything else is invariant. 559 00:38:35,310 --> 00:38:37,860 Or you don't have invariance in this layer, 560 00:38:37,860 --> 00:38:42,160 but you will have it at some layer above. 561 00:38:42,160 --> 00:38:43,230 So in a sense-- 562 00:38:43,230 --> 00:38:44,340 if you go back to this-- 563 00:38:44,340 --> 00:38:45,520 I'll make this point later. 564 00:38:45,520 --> 00:38:47,150 But if you go back to this-- 565 00:38:52,856 --> 00:38:55,720 to this algorithm, the basic idea 566 00:38:55,720 --> 00:39:03,460 is that you want to have invariance to rotation. 567 00:39:03,460 --> 00:39:08,290 And so you average over the rotations. 568 00:39:08,290 --> 00:39:12,700 But suppose you want to have invariance-- 569 00:39:12,700 --> 00:39:17,170 you want to have an estimate of rotation, 570 00:39:17,170 --> 00:39:21,160 but you're not interested in identity. 571 00:39:21,160 --> 00:39:24,670 Then what you do, you don't pool over rotation. 572 00:39:24,670 --> 00:39:30,900 You pull over different objects at one rotation. 573 00:39:30,900 --> 00:39:33,890 So you can do both. 574 00:39:33,890 --> 00:39:35,527 All right? 575 00:39:35,527 --> 00:39:38,810 AUDIENCE: My question was more physiological than theoretical. 576 00:39:38,810 --> 00:39:40,010 TOMASO POGGIO: Yeah. 577 00:39:40,010 --> 00:39:44,790 Physiological-- we had done experiments long ago in IT 578 00:39:44,790 --> 00:39:47,070 with Jim DiCarlo, Gabriel Kreiman. 579 00:39:47,070 --> 00:39:52,950 And from the same population of neurons, 580 00:39:52,950 --> 00:39:58,520 we could read out identity, object identity, 581 00:39:58,520 --> 00:40:02,640 invariant to scale and position. 582 00:40:02,640 --> 00:40:08,451 And we could also read out position invariant to identity. 583 00:40:08,451 --> 00:40:08,950 And-- 584 00:40:08,950 --> 00:40:09,540 AUDIENCE: The same from the-- 585 00:40:09,540 --> 00:40:11,280 TOMASO POGGIO: Same population. 586 00:40:11,280 --> 00:40:15,480 I'm not saying the same neuron, but the same population of 200 587 00:40:15,480 --> 00:40:17,640 neurons. 588 00:40:17,640 --> 00:40:19,590 And so you can imagine that you could 589 00:40:19,590 --> 00:40:21,670 have different situations. 590 00:40:21,670 --> 00:40:25,380 One could be some of the neurons are only conveying position, 591 00:40:25,380 --> 00:40:28,560 and some others are completely invariant. 592 00:40:28,560 --> 00:40:33,950 And when you read out with a classifier, it will work. 593 00:40:33,950 --> 00:40:35,580 Or you have neurons that are already 594 00:40:35,580 --> 00:40:40,200 combining this information, because the channels-- 595 00:40:40,200 --> 00:40:42,480 either way. 596 00:40:42,480 --> 00:40:47,280 OK, let me do this, and then we can take a break. 597 00:40:47,280 --> 00:40:50,280 I want to make the connection with simple and complex cells. 598 00:40:50,280 --> 00:41:01,380 We already mentioned this, but this set of operations, 599 00:41:01,380 --> 00:41:09,150 you can think of this sigma dot product, n delta, 600 00:41:09,150 --> 00:41:10,900 this is a simple cell. 601 00:41:13,430 --> 00:41:17,180 So this is a dot product of the image with a receptive 602 00:41:17,180 --> 00:41:19,790 field of the simple cell. 603 00:41:19,790 --> 00:41:21,870 That's what this parenthesis is. 604 00:41:26,360 --> 00:41:30,860 You have a bias, or a threshold, and the nonlinearity. 605 00:41:30,860 --> 00:41:32,780 Could be the spiking nonlinearity. 606 00:41:32,780 --> 00:41:37,010 Could be, as I said, a rectifier. 607 00:41:37,010 --> 00:41:43,760 Neurons don't generate negative spikes. 608 00:41:43,760 --> 00:41:47,870 And so all of this is very plausible biologically. 609 00:41:47,870 --> 00:41:51,440 And the simple cell will simply pool, 610 00:41:51,440 --> 00:41:55,240 take the over the different simple cells. 611 00:42:02,670 --> 00:42:05,700 So that's what I mentioned before, that nonlinearity 612 00:42:05,700 --> 00:42:06,720 can be almost anything. 613 00:42:12,900 --> 00:42:14,900 And I want to mention something that could 614 00:42:14,900 --> 00:42:18,120 be interesting for physiology. 615 00:42:18,120 --> 00:42:21,120 From the point of view of this algorithm, 616 00:42:21,120 --> 00:42:25,970 this may be a solution to this problem that has been around 617 00:42:25,970 --> 00:42:33,050 for 30 years or so, which is that Hubel and Wiesel and other 618 00:42:33,050 --> 00:42:37,010 physiologists after them identified 619 00:42:37,010 --> 00:42:39,320 simple and complex cells in terms 620 00:42:39,320 --> 00:42:41,570 of their physiological properties. 621 00:42:41,570 --> 00:42:44,840 They couldn't see from where they are recording. 622 00:42:44,840 --> 00:42:48,470 But there were cells that behaved in different ways. 623 00:42:48,470 --> 00:42:52,400 The simple cells had the small receptive field. 624 00:42:52,400 --> 00:42:54,860 The complex cell had larger receptive field. 625 00:42:58,970 --> 00:43:01,550 The complex cells were more invariant. 626 00:43:01,550 --> 00:43:05,600 And then physiologists today are using 627 00:43:05,600 --> 00:43:07,850 criteria in which the complex cell is 628 00:43:07,850 --> 00:43:11,690 more non-linear than the simple cell. 629 00:43:11,690 --> 00:43:14,450 Now, from the point of view of the theory, 630 00:43:14,450 --> 00:43:17,830 the real difference is one is doing the pooling-- 631 00:43:17,830 --> 00:43:19,400 the complex cells. 632 00:43:19,400 --> 00:43:20,610 The simple cell is not. 633 00:43:23,840 --> 00:43:30,520 And the puzzle is that despite these physiological difference, 634 00:43:30,520 --> 00:43:36,070 they were never able to say this type of pyramidal cell 635 00:43:36,070 --> 00:43:40,850 is simple, and this type of pyramid cell are complex. 636 00:43:40,850 --> 00:43:46,440 And part of the reason could be that maybe simple and complex 637 00:43:46,440 --> 00:43:49,880 cells are the same cell. 638 00:43:49,880 --> 00:43:54,940 So that the operation can be done on the same cell. 639 00:43:54,940 --> 00:43:58,440 If you look at the theory, what may happen 640 00:43:58,440 --> 00:44:06,390 is that you have one dendrite play the roll of a simple cell. 641 00:44:06,390 --> 00:44:10,370 You have inputs, synaptic weights. 642 00:44:10,370 --> 00:44:13,360 So this could give rise, for instance, 643 00:44:13,360 --> 00:44:18,340 to the Gabor-like receptive field. 644 00:44:18,340 --> 00:44:25,490 And then-- these other dendrites to another simple cell. 645 00:44:25,490 --> 00:44:29,920 It's a Gabor-like in a slightly different position in the image 646 00:44:29,920 --> 00:44:33,220 plane, in the retina. 647 00:44:33,220 --> 00:44:37,050 You need the nonlinearities. 648 00:44:37,050 --> 00:44:40,240 And they may be, instead of the output of the cell, 649 00:44:40,240 --> 00:44:46,990 they may be so-called voltage and time dependent 650 00:44:46,990 --> 00:44:50,140 conductancies in the dendrites. 651 00:44:50,140 --> 00:44:52,440 In the meantime, we know that pyramidal cells 652 00:44:52,440 --> 00:44:57,190 in the visual cortex have these nonlinearities 653 00:44:57,190 --> 00:45:03,730 like almost having spike generation in the dendrites. 654 00:45:03,730 --> 00:45:07,720 And then the soma will summate everything. 655 00:45:07,720 --> 00:45:11,470 This is what the complex cell is doing. 656 00:45:11,470 --> 00:45:16,600 And if one of the cells is computing something 657 00:45:16,600 --> 00:45:24,230 like an average, which is one of the moments of a distribution, 658 00:45:24,230 --> 00:45:28,630 then the nonlinearity will not even be needed. 659 00:45:28,630 --> 00:45:32,110 And then physiologists, using the criteria they use this day, 660 00:45:32,110 --> 00:45:37,360 would classify that cell as simple, 661 00:45:37,360 --> 00:45:39,370 even if from that point of view of the theory 662 00:45:39,370 --> 00:45:40,161 it's still complex. 663 00:45:42,640 --> 00:45:46,120 Anyway, that's the proposed machinery 664 00:45:46,120 --> 00:45:49,640 that comes from the theory. 665 00:45:49,640 --> 00:45:51,880 That's everything that we need. 666 00:45:51,880 --> 00:45:58,180 And it will say simple and complex cell could be one cell.