1 00:00:01,640 --> 00:00:04,040 The following content is provided under a Creative 2 00:00:04,040 --> 00:00:05,580 Commons license. 3 00:00:05,580 --> 00:00:07,880 Your support will help MIT OpenCourseWare 4 00:00:07,880 --> 00:00:12,270 continue to offer high quality educational resources for free. 5 00:00:12,270 --> 00:00:14,870 To make a donation or view additional materials 6 00:00:14,870 --> 00:00:18,830 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,830 --> 00:00:22,430 at ocw.mit.edu. 8 00:00:22,430 --> 00:00:24,830 EERO SIMONCELLI: I'm going to talk about a bunch of work 9 00:00:24,830 --> 00:00:26,625 that we've been doing over the last-- 10 00:00:26,625 --> 00:00:32,030 it's about four years, on trying to understand basically, 11 00:00:32,030 --> 00:00:34,910 that terra incognita that Gabrielle just mentioned 12 00:00:34,910 --> 00:00:37,296 that lies between V1 and IT. 13 00:00:37,296 --> 00:00:39,500 I brought this back with me from the Dolomites, 14 00:00:39,500 --> 00:00:41,520 where I was last week with my family. 15 00:00:41,520 --> 00:00:43,610 And when you sit and you look at it 16 00:00:43,610 --> 00:00:45,080 and that image comes into your eyes 17 00:00:45,080 --> 00:00:47,382 and gets processed by your brain, 18 00:00:47,382 --> 00:00:48,840 there's a lot of information there. 19 00:00:48,840 --> 00:00:49,880 It's a lot of pixels. 20 00:00:49,880 --> 00:00:51,770 And the question that I'm going to start with 21 00:00:51,770 --> 00:00:53,340 is, where does it go? 22 00:00:53,340 --> 00:00:54,590 You have all this information. 23 00:00:54,590 --> 00:00:56,540 It's flooding into your eyes all day every day 24 00:00:56,540 --> 00:00:58,100 for your entire lifetime. 25 00:00:58,100 --> 00:00:59,920 Obviously, you don't store it all there. 26 00:00:59,920 --> 00:01:02,000 Your head doesn't inflate until it 27 00:01:02,000 --> 00:01:04,790 gets to the point of explosion. 28 00:01:04,790 --> 00:01:05,930 So where does it go? 29 00:01:05,930 --> 00:01:09,640 And as a theorist's diagram of the brain-- 30 00:01:09,640 --> 00:01:11,450 a square with rounded corners. 31 00:01:11,450 --> 00:01:13,520 In comes the information, and there are really 32 00:01:13,520 --> 00:01:14,760 only three options. 33 00:01:14,760 --> 00:01:16,640 You either act on the information, 34 00:01:16,640 --> 00:01:20,000 do something with it, sensory motor loops, 35 00:01:20,000 --> 00:01:23,630 or for complex organisms especially, a fair amount of it 36 00:01:23,630 --> 00:01:25,280 you might actually try to remember. 37 00:01:25,280 --> 00:01:28,880 You might hold on to, and we heard about that earlier today. 38 00:01:28,880 --> 00:01:30,410 But this really only accounts for, I 39 00:01:30,410 --> 00:01:32,720 think, a fairly small portion of what 40 00:01:32,720 --> 00:01:35,420 goes on, because a lot of it you throw away. 41 00:01:35,420 --> 00:01:36,560 You have to. 42 00:01:36,560 --> 00:01:38,691 You really don't have a choice. 43 00:01:38,691 --> 00:01:40,190 You have to summarize it, squeeze it 44 00:01:40,190 --> 00:01:42,564 down to the relevant bits that you're going to hold on to 45 00:01:42,564 --> 00:01:45,050 or act on, and the rest of it you just toss. 46 00:01:45,050 --> 00:01:50,000 So the question is, how can we exploit that basic fact? 47 00:01:50,000 --> 00:01:51,440 It's an obvious fact. 48 00:01:51,440 --> 00:01:52,704 It has to be true. 49 00:01:52,704 --> 00:01:54,620 How do we exploit that to understand something 50 00:01:54,620 --> 00:01:56,912 about what the system does and what it doesn't do? 51 00:01:56,912 --> 00:01:58,370 And there's a long history to this, 52 00:01:58,370 --> 00:02:02,000 and in fact, since I come from vision and most of my work 53 00:02:02,000 --> 00:02:06,450 is centered on vision and auditory system to some extent, 54 00:02:06,450 --> 00:02:09,919 the vision scientists were the first to recognize 55 00:02:09,919 --> 00:02:11,000 the importance of this. 56 00:02:11,000 --> 00:02:14,540 And it really is a foundational chunk 57 00:02:14,540 --> 00:02:18,260 of work in the beginning of the field that 58 00:02:18,260 --> 00:02:20,330 set in motion a lot of things that we currently 59 00:02:20,330 --> 00:02:21,210 know about vision. 60 00:02:21,210 --> 00:02:23,210 And so I'm going to just-- for those of you that 61 00:02:23,210 --> 00:02:24,876 don't know that story, I'm going to give 62 00:02:24,876 --> 00:02:26,450 a very, very brief reminder of what 63 00:02:26,450 --> 00:02:28,940 that is, because I think it's an absolutely 64 00:02:28,940 --> 00:02:31,430 fantastic scientific story. 65 00:02:31,430 --> 00:02:34,570 And then from there, I'll talk about texture. 66 00:02:34,570 --> 00:02:36,500 So two examples-- I'm going to quickly say 67 00:02:36,500 --> 00:02:38,330 something about trichomatic color vision, 68 00:02:38,330 --> 00:02:40,038 and then I'm going to talk about texture, 69 00:02:40,038 --> 00:02:44,730 and then we'll go into V2 and metamers and other things. 70 00:02:44,730 --> 00:02:49,730 So trichromacy-- Newton figured out 71 00:02:49,730 --> 00:02:52,300 that light comes in wavelengths. 72 00:02:52,300 --> 00:02:53,960 He split light with a prism. 73 00:02:53,960 --> 00:02:57,421 There's the picture drawing of him splitting light coming 74 00:02:57,421 --> 00:02:58,670 in through a hole in the wall. 75 00:02:58,670 --> 00:03:02,120 He split it with a prism into wavelengths, saw a rainbow, 76 00:03:02,120 --> 00:03:03,727 did a lot of experiments to recognize 77 00:03:03,727 --> 00:03:05,810 that you could take that rainbow and reassemble it 78 00:03:05,810 --> 00:03:09,170 into white light, but you couldn't further subdivide it, 79 00:03:09,170 --> 00:03:12,410 and basically gave us the foundations 80 00:03:12,410 --> 00:03:16,340 for thinking about light and spectral distributions. 81 00:03:16,340 --> 00:03:18,740 In the 1800s, a group of people that 82 00:03:18,740 --> 00:03:21,560 were combined physicists, mathematicians, 83 00:03:21,560 --> 00:03:23,852 and psychologists all rolled into one-- 84 00:03:23,852 --> 00:03:25,310 and there were quite a few of them. 85 00:03:25,310 --> 00:03:27,006 Helmholtz was one of them. 86 00:03:27,006 --> 00:03:28,880 Grassmann was one of the most important ones. 87 00:03:28,880 --> 00:03:31,070 I'll mention him again in a moment-- 88 00:03:31,070 --> 00:03:33,800 figured out something peculiar about human vision-- 89 00:03:33,800 --> 00:03:37,070 that even though there was this huge array of colors 90 00:03:37,070 --> 00:03:39,860 in the wavelengths in the spectrum, 91 00:03:39,860 --> 00:03:42,350 that humans actually had these deficits that we were not 92 00:03:42,350 --> 00:03:45,490 able to actually sense or discriminate things 93 00:03:45,490 --> 00:03:47,870 that it seemed like we should be able to. 94 00:03:47,870 --> 00:03:49,850 And it boiled down in the end, after a lot 95 00:03:49,850 --> 00:03:53,050 of study and discussion and theorizing, 96 00:03:53,050 --> 00:03:59,030 to this experiment, which is known as a bipartite color 97 00:03:59,030 --> 00:04:00,170 matching experiment. 98 00:04:00,170 --> 00:04:02,750 So on the left side of this little display, 99 00:04:02,750 --> 00:04:03,860 here's a gray annulus. 100 00:04:03,860 --> 00:04:05,330 In the middle is a circle. 101 00:04:05,330 --> 00:04:08,750 On the left side of this is light coming from some source. 102 00:04:08,750 --> 00:04:10,979 It has some spectral distribution illustrated here. 103 00:04:10,979 --> 00:04:12,770 This is all a cartoon, but just to give you 104 00:04:12,770 --> 00:04:14,390 the idea of how this works. 105 00:04:14,390 --> 00:04:18,110 On the right side are three primary lights. 106 00:04:18,110 --> 00:04:21,620 And the job of the observer in this experiment is to adjust, 107 00:04:21,620 --> 00:04:23,900 let's say, sliders or knobs in order 108 00:04:23,900 --> 00:04:26,030 to change the intensity of these three lights 109 00:04:26,030 --> 00:04:30,350 to make the light on the right side of this split circle 110 00:04:30,350 --> 00:04:33,230 look the same as the light on the left side. 111 00:04:33,230 --> 00:04:35,300 And it turns out that-- 112 00:04:35,300 --> 00:04:37,340 so just to be clear, so these three things 113 00:04:37,340 --> 00:04:39,110 have their own spectral distributions. 114 00:04:39,110 --> 00:04:41,540 They might look like that, for example. 115 00:04:41,540 --> 00:04:44,294 And when the observer comes up with the knob settings, 116 00:04:44,294 --> 00:04:45,710 they're going to produce something 117 00:04:45,710 --> 00:04:46,793 that might look like that. 118 00:04:46,793 --> 00:04:50,150 This is just a sum of the three copies of these spectra 119 00:04:50,150 --> 00:04:52,010 weighted by the knob settings. 120 00:04:52,010 --> 00:04:54,380 So this is a linear combination of three 121 00:04:54,380 --> 00:04:56,180 spectral distributions. 122 00:04:56,180 --> 00:04:57,956 And intentionally, I've drawn this 123 00:04:57,956 --> 00:04:59,330 so that they don't look the same, 124 00:04:59,330 --> 00:05:01,430 because that's the whole point of the experiment. 125 00:05:01,430 --> 00:05:05,020 It turns out that humans-- 126 00:05:05,020 --> 00:05:07,580 you can do this experiment, and that any human 127 00:05:07,580 --> 00:05:10,130 with normal color vision can make these settings so 128 00:05:10,130 --> 00:05:12,490 that these two things are absolutely indistinguishable. 129 00:05:12,490 --> 00:05:16,400 They look identical, and yet they knew even in the mid-1800s 130 00:05:16,400 --> 00:05:19,312 that these two things have very different spectra, 131 00:05:19,312 --> 00:05:21,020 and I've drawn it that way intentionally. 132 00:05:21,020 --> 00:05:22,860 So the point is that humans are obviously-- 133 00:05:22,860 --> 00:05:25,140 even though we can see all the bands of the spectrum, 134 00:05:25,140 --> 00:05:26,570 we can see all the colors-- 135 00:05:26,570 --> 00:05:28,400 we actually have this deficiency in terms 136 00:05:28,400 --> 00:05:30,620 of noticing the difference between these two things. 137 00:05:30,620 --> 00:05:32,830 So how can that be? 138 00:05:32,830 --> 00:05:35,006 And I think and hope that most of you 139 00:05:35,006 --> 00:05:36,380 know the answer to that question, 140 00:05:36,380 --> 00:05:37,963 because you're using devices every day 141 00:05:37,963 --> 00:05:40,490 that are exploiting this fact. 142 00:05:40,490 --> 00:05:44,750 But the bottom line is that in the 1850s Grassmann laid down 143 00:05:44,750 --> 00:05:45,500 a set of rules. 144 00:05:45,500 --> 00:05:47,552 Grassmann was a mathematician. 145 00:05:47,552 --> 00:05:49,760 He actually developed a large chunk of linear algebra 146 00:05:49,760 --> 00:05:52,130 in order to explain and understand and manipulate 147 00:05:52,130 --> 00:05:53,480 these ideas. 148 00:05:53,480 --> 00:05:55,580 And he pointed out that-- 149 00:05:55,580 --> 00:05:57,722 he actually had a set of laws that he laid out, 150 00:05:57,722 --> 00:05:59,430 and I won't drag you through all of that. 151 00:05:59,430 --> 00:06:01,730 But in the end, what all of those laws 152 00:06:01,730 --> 00:06:04,820 amounted to, taking into account all of the evidence that he 153 00:06:04,820 --> 00:06:06,770 had, he laid down these laws. 154 00:06:06,770 --> 00:06:08,990 And what it amounted to is that the human being, 155 00:06:08,990 --> 00:06:12,890 when setting these knobs, was acting like a linear system. 156 00:06:12,890 --> 00:06:15,560 The human was taking an input, which is a wavelength spectrum, 157 00:06:15,560 --> 00:06:17,030 and adjusting the knobs. 158 00:06:17,030 --> 00:06:20,270 And the settings of the knobs were a linear function 159 00:06:20,270 --> 00:06:23,840 of the wavelength spectrum that was coming into the eye. 160 00:06:23,840 --> 00:06:27,650 And it's a remarkable and amazing fact, 161 00:06:27,650 --> 00:06:31,160 if you know that the brain is a highly non-linear device, how 162 00:06:31,160 --> 00:06:34,380 is it that a human can act like a linear device? 163 00:06:34,380 --> 00:06:37,670 And the answer is that basically the human, taking this thing in 164 00:06:37,670 --> 00:06:41,440 and making the knob settings, has a front end that's linear 165 00:06:41,440 --> 00:06:44,000 and is doing a projection of the wavelength spectrum 166 00:06:44,000 --> 00:06:46,070 onto basically three axes. 167 00:06:46,070 --> 00:06:48,140 And those three measurements-- 168 00:06:48,140 --> 00:06:49,610 that process is linear. 169 00:06:49,610 --> 00:06:51,380 Everything that happens after that, 170 00:06:51,380 --> 00:06:54,172 which is complicated and non-linear and involves noise 171 00:06:54,172 --> 00:06:56,630 and decisions and all kinds of motor control and everything 172 00:06:56,630 --> 00:06:57,410 else-- 173 00:06:57,410 --> 00:06:59,510 as long as the information in those original three 174 00:06:59,510 --> 00:07:01,842 measurements is not lost, then the human 175 00:07:01,842 --> 00:07:03,800 is going to basically act like a linear system, 176 00:07:03,800 --> 00:07:06,050 in terms of doing this matching. 177 00:07:06,050 --> 00:07:08,470 So Grassmann realized this. 178 00:07:08,470 --> 00:07:12,800 The theory that he set out and that others then elaborated on 179 00:07:12,800 --> 00:07:16,480 perfectly explained all the data for normal human subjects, 180 00:07:16,480 --> 00:07:18,800 that lights that appear identical but had physically 181 00:07:18,800 --> 00:07:21,260 distinct wavelength spectra could be created, 182 00:07:21,260 --> 00:07:23,130 and they called these metamers-- 183 00:07:23,130 --> 00:07:27,120 two things that are physically different but look the same. 184 00:07:27,120 --> 00:07:28,400 This was codified. 185 00:07:28,400 --> 00:07:29,780 It took many, many decades. 186 00:07:29,780 --> 00:07:31,370 Things moved slower back then. 187 00:07:31,370 --> 00:07:35,240 We don't have these rapid, Google-style overturns 188 00:07:35,240 --> 00:07:38,270 of scientific establishment within a year or two. 189 00:07:38,270 --> 00:07:41,120 It took until the 1930s to actually build this 190 00:07:41,120 --> 00:07:43,610 into a set of standards that were used in the engineering 191 00:07:43,610 --> 00:07:46,760 community to generate and create color film, color 192 00:07:46,760 --> 00:07:50,090 devices, eventually color video, color monitors, color 193 00:07:50,090 --> 00:07:52,820 projectors, color printers-- everything else that we use. 194 00:07:52,820 --> 00:07:56,440 And these specifications were to allow the reproduction 195 00:07:56,440 --> 00:07:58,190 of colors so that they looked the way they 196 00:07:58,190 --> 00:07:59,520 were supposed to look. 197 00:07:59,520 --> 00:08:01,554 So you record color with a camera. 198 00:08:01,554 --> 00:08:04,220 It turns out that your camera is also only recording three color 199 00:08:04,220 --> 00:08:06,320 channels, just like your eye, and then you 200 00:08:06,320 --> 00:08:09,140 have to be able to re-render that on another device. 201 00:08:09,140 --> 00:08:12,766 And these standards specify how to do that. 202 00:08:12,766 --> 00:08:14,390 The surprising thing in the whole story 203 00:08:14,390 --> 00:08:15,920 is-- so this is 1850s. 204 00:08:15,920 --> 00:08:17,120 Well, we go back to Newton. 205 00:08:17,120 --> 00:08:18,150 It was a 1600s. 206 00:08:18,150 --> 00:08:21,260 Then in the 1850s, when we're getting this beautiful theory 207 00:08:21,260 --> 00:08:23,750 that's very, very precise, this gets built 208 00:08:23,750 --> 00:08:25,010 into engineering standards. 209 00:08:25,010 --> 00:08:27,410 And it's not until 1987 that it actually 210 00:08:27,410 --> 00:08:29,644 gets verified in a mechanistic sense. 211 00:08:29,644 --> 00:08:31,310 And I like to tell this story, because I 212 00:08:31,310 --> 00:08:34,100 think it's a reminder that aiming always 213 00:08:34,100 --> 00:08:37,850 for the reductionist solution is not necessarily the right thing 214 00:08:37,850 --> 00:08:38,610 to do. 215 00:08:38,610 --> 00:08:41,600 This is a very beautiful piece of science 216 00:08:41,600 --> 00:08:43,250 that was done at Stanford, actually, 217 00:08:43,250 --> 00:08:46,550 by Baylor, Nunn, and Schnapf. 218 00:08:46,550 --> 00:08:49,967 They took cones from a macaque-- 219 00:08:49,967 --> 00:08:52,550 I think originally they worked with turtles, but then macaque, 220 00:08:52,550 --> 00:08:54,862 sucked them up into a glass micro-pipette, 221 00:08:54,862 --> 00:08:56,570 shined monochromatic lights through them, 222 00:08:56,570 --> 00:08:58,730 and measured their absorption properties. 223 00:08:58,730 --> 00:09:00,230 And they found these three functions 224 00:09:00,230 --> 00:09:02,840 for three different types of cones and verified, 225 00:09:02,840 --> 00:09:06,830 basically, that these three absorption spectra perfectly 226 00:09:06,830 --> 00:09:09,212 explained the data from the 1800s. 227 00:09:09,212 --> 00:09:10,670 So this is an amazing thing, if you 228 00:09:10,670 --> 00:09:13,580 can have a theory and a set of behavioral experiments 229 00:09:13,580 --> 00:09:17,330 that make very precise and clear predictions that then get 230 00:09:17,330 --> 00:09:19,500 verified and tested in a mechanistic sense 231 00:09:19,500 --> 00:09:22,160 more than 100 years later, and they come out basically 232 00:09:22,160 --> 00:09:23,010 perfect. 233 00:09:23,010 --> 00:09:25,700 So it's an astounding, astounding sequence, 234 00:09:25,700 --> 00:09:27,590 in my view. 235 00:09:27,590 --> 00:09:30,860 So what we wanted to do is to set out 236 00:09:30,860 --> 00:09:33,740 trying to do the same kind of thing for pattern vision. 237 00:09:33,740 --> 00:09:37,100 And we're going to do that by thinking about texture. 238 00:09:37,100 --> 00:09:38,090 So what's a texture? 239 00:09:38,090 --> 00:09:40,640 A texture is an image that's homogeneous 240 00:09:40,640 --> 00:09:41,770 with repeated structures. 241 00:09:41,770 --> 00:09:43,478 So each of these are examples of texture. 242 00:09:43,478 --> 00:09:45,690 That's a piece of woven basket. 243 00:09:45,690 --> 00:09:46,609 This is tree bark. 244 00:09:46,609 --> 00:09:48,400 That's a herringbone pattern, and these are 245 00:09:48,400 --> 00:09:50,780 some sort of nuts or stones. 246 00:09:50,780 --> 00:09:52,889 And each of these has the property 247 00:09:52,889 --> 00:09:55,430 that there's lots of repeated elements with some variability. 248 00:09:55,430 --> 00:09:56,810 Sometimes there's more variability, 249 00:09:56,810 --> 00:09:58,268 sometimes there's less variability, 250 00:09:58,268 --> 00:10:00,726 but there's usually at least some. 251 00:10:00,726 --> 00:10:03,290 And of course, these things are ubiquitous. 252 00:10:03,290 --> 00:10:07,320 When I started working on this, which is about 15 years ago-- 253 00:10:07,320 --> 00:10:08,970 maybe a little bit more, 16 years ago-- 254 00:10:08,970 --> 00:10:12,240 I started photographing things that I saw as I walked around, 255 00:10:12,240 --> 00:10:13,900 and textures are everywhere. 256 00:10:13,900 --> 00:10:15,090 Most things are textured. 257 00:10:15,090 --> 00:10:17,180 The world is not made up of plain-- 258 00:10:17,180 --> 00:10:18,120 of Mondrians. 259 00:10:18,120 --> 00:10:20,160 It's not made up of things that are plain, 260 00:10:20,160 --> 00:10:23,100 blank colors separated by sharp edges. 261 00:10:23,100 --> 00:10:26,730 It's made up of textures, and often 262 00:10:26,730 --> 00:10:29,190 the boundaries between things are boundaries between things 263 00:10:29,190 --> 00:10:31,890 that are textured objects, like the seats in the auditorium, 264 00:10:31,890 --> 00:10:34,410 for example. 265 00:10:34,410 --> 00:10:37,140 So how is it that we can go about thinking about this 266 00:10:37,140 --> 00:10:39,840 in terms of metamers and representation in, let's say, 267 00:10:39,840 --> 00:10:40,990 the visual system? 268 00:10:40,990 --> 00:10:42,960 And the idea really comes from Julesz, 269 00:10:42,960 --> 00:10:48,210 who proposed in 1962 a famous theory that he later abandoned. 270 00:10:48,210 --> 00:10:49,940 The theory goes like this. 271 00:10:49,940 --> 00:10:51,690 First of all, he said the thing that we're 272 00:10:51,690 --> 00:10:53,951 going to do to try to describe textures is 273 00:10:53,951 --> 00:10:55,200 we're going to use statistics. 274 00:10:55,200 --> 00:10:56,179 And why statistics? 275 00:10:56,179 --> 00:10:57,970 Because these are supposed to be variables, 276 00:10:57,970 --> 00:10:59,385 so I need some stochasticity. 277 00:10:59,385 --> 00:11:01,260 But I also want something that's homogeneous, 278 00:11:01,260 --> 00:11:03,930 so I'm going to average or measure things averaged 279 00:11:03,930 --> 00:11:05,340 across the entire image. 280 00:11:05,340 --> 00:11:08,760 That's the statistical side of it. 281 00:11:08,760 --> 00:11:12,580 And he proposed that, well, if I start by measuring just pixel 282 00:11:12,580 --> 00:11:15,390 statistics-- say single pixel statistics, 283 00:11:15,390 --> 00:11:19,230 pairwise pixels statistics, maybe triples of pixels, 284 00:11:19,230 --> 00:11:22,740 eventually I should reach a point where I've made enough 285 00:11:22,740 --> 00:11:25,410 measurements to actually sufficiently constrain 286 00:11:25,410 --> 00:11:29,910 the texture such that any two textures that have the same 287 00:11:29,910 --> 00:11:31,470 statistics up to that-- 288 00:11:31,470 --> 00:11:35,370 whatever that order is, should look the same to a human being. 289 00:11:35,370 --> 00:11:39,240 And he didn't talk about this in physiological terms, 290 00:11:39,240 --> 00:11:43,320 but I think in the background is the notion that humans 291 00:11:43,320 --> 00:11:45,400 are actually measuring those statistics, 292 00:11:45,400 --> 00:11:47,580 and if you can get them right-- if you can make two images have 293 00:11:47,580 --> 00:11:49,996 the same statistics, and that's the only thing that humans 294 00:11:49,996 --> 00:11:54,190 are measuring, then those two images will look the same. 295 00:11:54,190 --> 00:12:00,270 So Julesz goes ahead with this, and eventually constructs 296 00:12:00,270 --> 00:12:02,870 by hand, because he did everything 297 00:12:02,870 --> 00:12:04,980 with binary patterns constructed by hand-- 298 00:12:04,980 --> 00:12:07,920 he constructs these two examples that are identical. 299 00:12:07,920 --> 00:12:10,680 He first falsifies the theory at n equals 2, 300 00:12:10,680 --> 00:12:12,720 and then he tries to do third-order statistics. 301 00:12:12,720 --> 00:12:15,136 And he comes up with these two examples-- counter-examples 302 00:12:15,136 --> 00:12:15,752 to the theory. 303 00:12:15,752 --> 00:12:18,210 These are matched in terms of their third-order statistics. 304 00:12:18,210 --> 00:12:21,067 It's not easy to see that or realize that, but it's true. 305 00:12:21,067 --> 00:12:22,650 If you take triples of pixels, and you 306 00:12:22,650 --> 00:12:25,270 take the product of those three, and you average 307 00:12:25,270 --> 00:12:27,720 that over the image, these two things are identical, 308 00:12:27,720 --> 00:12:28,950 but they look very different. 309 00:12:28,950 --> 00:12:31,260 And if you draw samples of each of these, 310 00:12:31,260 --> 00:12:35,250 it's very easy to label them as, let's say, A or B 311 00:12:35,250 --> 00:12:37,110 into these two categories. 312 00:12:37,110 --> 00:12:38,610 Here's another example that came out 313 00:12:38,610 --> 00:12:40,110 a bit later by Jack Yellott. 314 00:12:40,110 --> 00:12:42,540 These two things also are matched up to third-order. 315 00:12:42,540 --> 00:12:45,570 So Julesz decides that the theory is a failure, 316 00:12:45,570 --> 00:12:47,250 and he abandons it. 317 00:12:47,250 --> 00:12:50,700 And he begins a new theory, which is the theory of textons, 318 00:12:50,700 --> 00:12:55,680 which is a much less precisely-specified theory that 319 00:12:55,680 --> 00:12:57,510 has to do with laying down-- 320 00:12:57,510 --> 00:13:00,180 basically, it's a generative model, if you like. 321 00:13:00,180 --> 00:13:01,800 Everybody's fond of generative models 322 00:13:01,800 --> 00:13:04,260 these days, except for me. 323 00:13:04,260 --> 00:13:06,195 And he comes up with a generative model-- 324 00:13:06,195 --> 00:13:07,757 ah, and maybe Tommy. 325 00:13:07,757 --> 00:13:09,340 He comes up with the generative model, 326 00:13:09,340 --> 00:13:15,870 which is to lay down many copies of a small, repeating unit, 327 00:13:15,870 --> 00:13:18,270 which he called the texton. 328 00:13:18,270 --> 00:13:20,880 And so he came up with this method of generating texture 329 00:13:20,880 --> 00:13:23,160 images, which he went to town on, 330 00:13:23,160 --> 00:13:24,450 and he made lots of examples. 331 00:13:24,450 --> 00:13:26,820 The problem is that that wasn't a description of how 332 00:13:26,820 --> 00:13:30,390 to analyze texture images or how a human would analyze texture 333 00:13:30,390 --> 00:13:32,950 images, and so it became very difficult to bridge that gap. 334 00:13:32,950 --> 00:13:35,520 And I think, in my view, that the theory really never 335 00:13:35,520 --> 00:13:38,730 succeeded, and he should have stuck with the initial theory. 336 00:13:38,730 --> 00:13:41,170 Anyway, but that gave us an opportunity. 337 00:13:41,170 --> 00:13:43,620 So we went back many years later-- 338 00:13:43,620 --> 00:13:45,070 this is around 1999. 339 00:13:45,070 --> 00:13:47,670 I had a fantastic post-doc, Javier Portilla, 340 00:13:47,670 --> 00:13:51,330 who came from Spain, and we started thinking about texture 341 00:13:51,330 --> 00:13:53,250 and started putting together a model that 342 00:13:53,250 --> 00:13:57,047 was Juleszian in spirit, but a little bit different, 343 00:13:57,047 --> 00:13:59,130 because we wanted to build in a little bit of what 344 00:13:59,130 --> 00:14:00,510 we knew about physiology. 345 00:14:00,510 --> 00:14:02,035 Now, Julesz knew about physiology, 346 00:14:02,035 --> 00:14:04,410 because Hubel and Wiesel were doing all those experiments 347 00:14:04,410 --> 00:14:06,930 in V1 in the late '50S and the early '60s, 348 00:14:06,930 --> 00:14:09,700 but he really didn't incorporate that into his thinking. 349 00:14:09,700 --> 00:14:13,450 So what we did is build a very simple model. 350 00:14:13,450 --> 00:14:19,200 It's just dumb, stupid, simple, in which we took 351 00:14:19,200 --> 00:14:21,200 this description of V1 neurons. 352 00:14:21,200 --> 00:14:24,120 So these are oriented receptive fields. 353 00:14:24,120 --> 00:14:27,150 The idea is that this is a description of a neuron that 354 00:14:27,150 --> 00:14:29,670 takes a weighted sum of the pixels 355 00:14:29,670 --> 00:14:31,530 with positive and negative lobes. 356 00:14:31,530 --> 00:14:33,960 And it has a preferred orientation, 357 00:14:33,960 --> 00:14:36,360 because the positive and negative lobes 358 00:14:36,360 --> 00:14:38,400 have a particular oriented structure. 359 00:14:38,400 --> 00:14:41,010 And then it takes the output of that weighted sum 360 00:14:41,010 --> 00:14:44,950 and runs it through some rectifying, nonlinear function. 361 00:14:44,950 --> 00:14:48,360 And here's another, and this is a classic thing 362 00:14:48,360 --> 00:14:51,030 that Hubel and Wiesel described for a simple cell. 363 00:14:51,030 --> 00:14:53,760 And here's another one, which is a complex cell. 364 00:14:53,760 --> 00:14:57,570 And this one basically does two of these and combines them. 365 00:14:57,570 --> 00:14:59,314 I'm trying to avoid the details here, 366 00:14:59,314 --> 00:15:01,230 because they're not critical for understanding 367 00:15:01,230 --> 00:15:02,188 what going to show you. 368 00:15:04,242 --> 00:15:06,200 So then we took those things and we said, well, 369 00:15:06,200 --> 00:15:08,875 what if we measure joint statistics of those things 370 00:15:08,875 --> 00:15:09,500 over the image? 371 00:15:09,500 --> 00:15:11,652 So we're going to take not just these filters, 372 00:15:11,652 --> 00:15:13,610 but of course, we're going to do a convolution. 373 00:15:13,610 --> 00:15:15,443 That is, we're going to compute the response 374 00:15:15,443 --> 00:15:17,600 of this weighted sum at different positions 375 00:15:17,600 --> 00:15:18,632 throughout the image. 376 00:15:18,632 --> 00:15:20,090 We're going to rectify all of them. 377 00:15:20,090 --> 00:15:21,530 Now we're going to take joint statistics. 378 00:15:21,530 --> 00:15:22,488 What do I mean by that? 379 00:15:22,488 --> 00:15:25,820 Just correlations, basically-- second-order statistics 380 00:15:25,820 --> 00:15:28,280 of the simple cells, of the complex cells, 381 00:15:28,280 --> 00:15:30,860 of the cross statistics between them. 382 00:15:30,860 --> 00:15:33,620 And these statistics are between different orientations 383 00:15:33,620 --> 00:15:37,040 and different positions and also different sizes. 384 00:15:37,040 --> 00:15:39,150 And given that large set of numbers-- 385 00:15:39,150 --> 00:15:41,720 and typically for the images that we worked with back then, 386 00:15:41,720 --> 00:15:44,720 these were typically on the order of 700 numbers. 387 00:15:44,720 --> 00:15:48,500 So we have an image over here, which is say, tens of thousands 388 00:15:48,500 --> 00:15:52,160 or hundreds of thousands of pixels, being transformed 389 00:15:52,160 --> 00:15:56,360 through this box into a set of, let's say, 700 numbers. 390 00:15:56,360 --> 00:16:00,980 So 700 summary statistics to describe this pattern. 391 00:16:00,980 --> 00:16:04,000 And then the question is, how do we test the model? 392 00:16:04,000 --> 00:16:05,900 And for testing the model-- 393 00:16:05,900 --> 00:16:08,210 most people, when they test models like this, 394 00:16:08,210 --> 00:16:09,530 they do classification. 395 00:16:09,530 --> 00:16:11,390 This should sound very familiar these days, 396 00:16:11,390 --> 00:16:14,180 with the deep network world. 397 00:16:14,180 --> 00:16:17,104 They take a model, and then they run it on lots of examples. 398 00:16:17,104 --> 00:16:18,770 And they ask, well, do the examples that 399 00:16:18,770 --> 00:16:20,520 are supposed to be the same kind of thing, 400 00:16:20,520 --> 00:16:22,280 like the same tree bark-- 401 00:16:22,280 --> 00:16:25,520 do they come out with statistics that are similar 402 00:16:25,520 --> 00:16:27,470 or almost the same as each other? 403 00:16:27,470 --> 00:16:30,470 And can I classify or group them or cluster them 404 00:16:30,470 --> 00:16:32,660 and get the right answer when trying to identify 405 00:16:32,660 --> 00:16:34,710 the different examples? 406 00:16:34,710 --> 00:16:37,400 We decided that that was a very-- at least at the time, 407 00:16:37,400 --> 00:16:40,460 a very weak test of this model, because this 408 00:16:40,460 --> 00:16:43,470 is a high-dimensional space, and we had only, 409 00:16:43,470 --> 00:16:46,340 let's say, on the order of hundreds of example textures. 410 00:16:46,340 --> 00:16:48,890 And hundreds of-- that sounds like a lot of textures-- 411 00:16:48,890 --> 00:16:52,040 a couple hundred textures, but if the outputs 412 00:16:52,040 --> 00:16:56,900 live in a 700 dimensional space, then it's basically nothing. 413 00:16:56,900 --> 00:16:58,850 We're not filling that space. 414 00:16:58,850 --> 00:17:01,059 And for those of you that are statistically-oriented, 415 00:17:01,059 --> 00:17:02,683 you know that there's this thing called 416 00:17:02,683 --> 00:17:03,950 the curse of dimensionality. 417 00:17:03,950 --> 00:17:06,890 The number of data samples that you need to fill up a space 418 00:17:06,890 --> 00:17:09,540 goes up exponentially with the number of dimensions. 419 00:17:09,540 --> 00:17:11,050 So this was really bad news, and we 420 00:17:11,050 --> 00:17:12,800 decided that it was going to be a disaster 421 00:17:12,800 --> 00:17:15,170 to just do classification-- that pretty 422 00:17:15,170 --> 00:17:18,119 much any set of measurements would work for classification. 423 00:17:18,119 --> 00:17:22,650 So we were looking for a more demanding test of the model. 424 00:17:22,650 --> 00:17:24,630 And for that, we turned to synthesis. 425 00:17:24,630 --> 00:17:27,200 So the idea is like this. 426 00:17:27,200 --> 00:17:28,392 So you take this image. 427 00:17:28,392 --> 00:17:29,600 You run it through the model. 428 00:17:29,600 --> 00:17:31,620 You get your responses. 429 00:17:31,620 --> 00:17:33,996 Now we're going to take a patch of white noise. 430 00:17:33,996 --> 00:17:35,870 We're going to run it through the same model, 431 00:17:35,870 --> 00:17:39,680 and then we're going to lean on the noise, push on it-- 432 00:17:39,680 --> 00:17:42,350 push on all the pixels in that noise image 433 00:17:42,350 --> 00:17:46,040 until we get the same outputs. 434 00:17:46,040 --> 00:17:50,330 So this is sometimes called synthesis by analysis. 435 00:17:50,330 --> 00:17:52,590 This is not a generative model, but we're using it 436 00:17:52,590 --> 00:17:53,590 like a generative model. 437 00:17:53,590 --> 00:17:56,540 We're going to draw samples of images 438 00:17:56,540 --> 00:17:59,510 that have the same statistics by starting 439 00:17:59,510 --> 00:18:01,310 with white noise and just pounding on it 440 00:18:01,310 --> 00:18:02,480 until it looks right. 441 00:18:02,480 --> 00:18:04,250 And pounding on it means, for those of you 442 00:18:04,250 --> 00:18:09,960 that want to know, measuring the gradients of the deviation 443 00:18:09,960 --> 00:18:12,710 away from the desired output and just moving 444 00:18:12,710 --> 00:18:14,720 in the direction of the gradient. 445 00:18:14,720 --> 00:18:18,800 I'm giving you the quick version of this. 446 00:18:18,800 --> 00:18:22,670 A little bit more abstractly, we can think of it this way. 447 00:18:22,670 --> 00:18:24,377 There's a space of all possible images. 448 00:18:24,377 --> 00:18:25,460 Here's the original image. 449 00:18:25,460 --> 00:18:27,500 It's a point in this space. 450 00:18:27,500 --> 00:18:30,210 We compute the responses of the model, 451 00:18:30,210 --> 00:18:32,900 which is a lower dimensional space-- a smaller space. 452 00:18:32,900 --> 00:18:34,490 That's this. 453 00:18:34,490 --> 00:18:37,560 Because this is a many to one mapping and it's continuous, 454 00:18:37,560 --> 00:18:38,980 there's actually a manifold-- 455 00:18:38,980 --> 00:18:42,620 a continuous a collection of images over here, all of which 456 00:18:42,620 --> 00:18:45,320 have the same exact model responses. 457 00:18:45,320 --> 00:18:48,140 And what we're trying to do is grab one of these. 458 00:18:48,140 --> 00:18:50,460 We want to draw a sample from that manifold. 459 00:18:50,460 --> 00:18:53,930 If the theory is right-- if this model is a good representation 460 00:18:53,930 --> 00:18:57,530 of what humans see and capture when they look at textures, 461 00:18:57,530 --> 00:18:59,840 then all of these things should look the same. 462 00:18:59,840 --> 00:19:01,120 That's the hypothesis. 463 00:19:01,120 --> 00:19:03,620 And the way we do it, again, is to start with a noise seed-- 464 00:19:03,620 --> 00:19:05,060 just an image filled with noise. 465 00:19:05,060 --> 00:19:06,540 We project it onto the manifold. 466 00:19:06,540 --> 00:19:08,354 We push it onto this point. 467 00:19:08,354 --> 00:19:10,520 We can test that, because we can, of course, measure 468 00:19:10,520 --> 00:19:12,830 the same things on this image and make sure 469 00:19:12,830 --> 00:19:14,840 that they're the same as this image, 470 00:19:14,840 --> 00:19:16,800 and that's our synthesized image. 471 00:19:16,800 --> 00:19:20,240 So that's a abstract picture of what I told you 472 00:19:20,240 --> 00:19:21,810 on the previous slide. 473 00:19:21,810 --> 00:19:25,010 And then finally, the scientific or experimental logic 474 00:19:25,010 --> 00:19:27,890 is to test this by showing it to a human observer. 475 00:19:27,890 --> 00:19:29,900 So we have the original image, and then 476 00:19:29,900 --> 00:19:31,340 we compute the model responses. 477 00:19:31,340 --> 00:19:33,950 We generate a new image, and we ask the human, 478 00:19:33,950 --> 00:19:35,225 do these look the same? 479 00:19:35,225 --> 00:19:37,100 And if the model captures the same properties 480 00:19:37,100 --> 00:19:40,869 as the visual system, then two images 481 00:19:40,869 --> 00:19:42,410 with identical model responses should 482 00:19:42,410 --> 00:19:44,700 appear identical to a human. 483 00:19:44,700 --> 00:19:47,240 So that's the logic. 484 00:19:47,240 --> 00:19:49,850 And any strong failure of this indicates 485 00:19:49,850 --> 00:19:52,250 that the model is insufficient to capture what is 486 00:19:52,250 --> 00:19:54,500 important about these images. 487 00:19:54,500 --> 00:19:57,800 So it works, or I wouldn't be telling you about it. 488 00:19:57,800 --> 00:19:59,400 Here is just a few examples. 489 00:19:59,400 --> 00:20:01,230 There are hundreds more on the web page 490 00:20:01,230 --> 00:20:02,520 that describes this work. 491 00:20:02,520 --> 00:20:04,260 On the top are original photographs-- 492 00:20:04,260 --> 00:20:08,430 lizard skin, plaster of some sort, beans. 493 00:20:08,430 --> 00:20:11,960 On the bottom are synthesized versions of these. 494 00:20:11,960 --> 00:20:13,380 The lizard skin works really well. 495 00:20:13,380 --> 00:20:14,850 The plaster works quite well. 496 00:20:14,850 --> 00:20:16,490 The beans a little less so. 497 00:20:16,490 --> 00:20:20,169 And it depends-- whether it works well or not depends 498 00:20:20,169 --> 00:20:21,210 on the viewing condition. 499 00:20:21,210 --> 00:20:22,814 So if you flash these up quickly, 500 00:20:22,814 --> 00:20:25,230 people might be convinced that they all look really great. 501 00:20:25,230 --> 00:20:27,063 If you allow them to inspect them carefully, 502 00:20:27,063 --> 00:20:30,450 they can start to see deviations or funny little artifacts. 503 00:20:30,450 --> 00:20:34,410 So it's a partial success. 504 00:20:34,410 --> 00:20:36,600 And I should point out that it also 505 00:20:36,600 --> 00:20:40,320 provides a pretty convincing success on Julesz' 506 00:20:40,320 --> 00:20:41,730 counter-examples. 507 00:20:41,730 --> 00:20:42,900 So these are examples. 508 00:20:42,900 --> 00:20:44,820 This is synthesized from that, and this 509 00:20:44,820 --> 00:20:50,680 is synthesized from that, and they're easily classifiable. 510 00:20:50,680 --> 00:20:53,040 And there's fun things you can do with this. 511 00:20:53,040 --> 00:20:56,130 You can fill in regions around images. 512 00:20:56,130 --> 00:20:58,019 So if you take this little chunk of text here 513 00:20:58,019 --> 00:21:00,060 and you measure the statistics, and you say, fill 514 00:21:00,060 --> 00:21:02,460 in the stuff around it with something 515 00:21:02,460 --> 00:21:04,170 with has the same statistics, but try 516 00:21:04,170 --> 00:21:06,810 to do a careful job of matching up at the boundaries, 517 00:21:06,810 --> 00:21:08,280 you can create things like this. 518 00:21:08,280 --> 00:21:09,946 So you can read the words in the center, 519 00:21:09,946 --> 00:21:12,104 but the outside looks like gibberish. 520 00:21:12,104 --> 00:21:14,020 Each one of these was created in the same way. 521 00:21:14,020 --> 00:21:16,228 So the center of each of these is the original image, 522 00:21:16,228 --> 00:21:17,970 and what's around it is synthesized. 523 00:21:17,970 --> 00:21:20,070 So it works reasonably well. 524 00:21:20,070 --> 00:21:22,630 You can also do fun things like this. 525 00:21:22,630 --> 00:21:23,846 So these are examples where-- 526 00:21:23,846 --> 00:21:25,470 I told you we started from white noise, 527 00:21:25,470 --> 00:21:27,011 and then pushed it onto the manifold, 528 00:21:27,011 --> 00:21:29,170 but we can actually start from any image. 529 00:21:29,170 --> 00:21:30,780 So if we start from these images-- 530 00:21:30,780 --> 00:21:33,030 these are three of my collaborators-- 531 00:21:33,030 --> 00:21:36,000 two of my students and my collaborator Tony Movshon. 532 00:21:36,000 --> 00:21:38,529 If we start with those as starting point images, 533 00:21:38,529 --> 00:21:40,320 and we use these textures for each of them, 534 00:21:40,320 --> 00:21:42,236 we arrive at these images, where you can still 535 00:21:42,236 --> 00:21:44,280 see some of the global structure of the face. 536 00:21:44,280 --> 00:21:46,710 Because the model is a homogeneous model, 537 00:21:46,710 --> 00:21:49,800 it doesn't impose anything on global structure. 538 00:21:49,800 --> 00:21:51,300 And so if you seed it with something 539 00:21:51,300 --> 00:21:53,760 that has particular global structure or arrangement, 540 00:21:53,760 --> 00:21:55,320 it will inherit some of that. 541 00:21:55,320 --> 00:21:57,260 It'll hold onto it. 542 00:21:57,260 --> 00:21:59,260 Anyway, this is just for fun. 543 00:21:59,260 --> 00:22:01,230 Let's get back to science. 544 00:22:01,230 --> 00:22:04,860 So now, here's an example of Richard Feynman. 545 00:22:04,860 --> 00:22:07,870 This is Richard Feynman after he's gone through the blender. 546 00:22:07,870 --> 00:22:12,190 You can see pieces of skin-like things and folds and flaps, 547 00:22:12,190 --> 00:22:14,070 but it's all disorganized. 548 00:22:14,070 --> 00:22:15,600 Again it's a homogeneous model. 549 00:22:15,600 --> 00:22:17,849 It doesn't know anything about the global organization 550 00:22:17,849 --> 00:22:18,780 of this photograph. 551 00:22:18,780 --> 00:22:20,520 But what we want to know is-- 552 00:22:20,520 --> 00:22:22,410 so do we have a model that's just 553 00:22:22,410 --> 00:22:24,896 a model for the perception of homogeneous textures, 554 00:22:24,896 --> 00:22:26,520 or can we actually push it a little bit 555 00:22:26,520 --> 00:22:28,950 and make it, first of all, a little more physiological, 556 00:22:28,950 --> 00:22:32,280 and second of all, maybe a little bit more relevant 557 00:22:32,280 --> 00:22:33,390 for everyday vision? 558 00:22:33,390 --> 00:22:36,830 For me, standing here and looking at this scene, 559 00:22:36,830 --> 00:22:39,390 how do I go about describing something 560 00:22:39,390 --> 00:22:42,850 like this that's going on when I'm looking at a normal scene? 561 00:22:42,850 --> 00:22:46,480 So let's go through thinking about how to do this. 562 00:22:46,480 --> 00:22:50,020 So I'm going to jump right to this diagram of the brain 563 00:22:50,020 --> 00:22:50,520 again. 564 00:22:50,520 --> 00:22:53,180 So V1 is in the back of the brain. 565 00:22:53,180 --> 00:22:55,230 The information that comes into your eyes 566 00:22:55,230 --> 00:22:57,840 goes through the retina, the LGN, back to V1. 567 00:22:57,840 --> 00:23:00,240 And then it splits into these two branches, the dorsal 568 00:23:00,240 --> 00:23:01,270 and the ventral stream. 569 00:23:01,270 --> 00:23:04,800 The ventral stream is usually associated with spatial form 570 00:23:04,800 --> 00:23:06,102 and recognition and memory. 571 00:23:06,102 --> 00:23:08,060 So I'm going to think about the ventral stream, 572 00:23:08,060 --> 00:23:09,560 and we're going to try to understand 573 00:23:09,560 --> 00:23:12,260 what this model might have to say about processing 574 00:23:12,260 --> 00:23:13,265 in the ventral stream. 575 00:23:13,265 --> 00:23:15,390 I'm going to rely on just a few simple assumptions. 576 00:23:15,390 --> 00:23:18,690 First, that each of these areas has neurons, 577 00:23:18,690 --> 00:23:20,490 and that they respond to small contents 578 00:23:20,490 --> 00:23:22,614 or regions of the visual input. 579 00:23:22,614 --> 00:23:24,030 They're known as receptive fields. 580 00:23:24,030 --> 00:23:25,560 Most of you know that. 581 00:23:25,560 --> 00:23:27,600 In each visual area, I'm going to assume 582 00:23:27,600 --> 00:23:29,310 that those receptive fields are covering, 583 00:23:29,310 --> 00:23:32,320 blanketing the entire visual field. 584 00:23:32,320 --> 00:23:33,600 So there's no dead spots. 585 00:23:33,600 --> 00:23:35,160 There's no spots that are left out. 586 00:23:35,160 --> 00:23:37,430 Everything is covered nicely. 587 00:23:37,430 --> 00:23:40,052 And in fact, we know that this is true, for example, 588 00:23:40,052 --> 00:23:41,010 starting in the retina. 589 00:23:41,010 --> 00:23:46,890 So this is a cartoon diagram to illustrate the inhomogeneity 590 00:23:46,890 --> 00:23:48,210 that's found in the retina. 591 00:23:48,210 --> 00:23:51,540 So the receptive field sizes in the retina 592 00:23:51,540 --> 00:23:53,030 grow with eccentricity. 593 00:23:53,030 --> 00:23:55,030 And it turns out that that starts in the retina, 594 00:23:55,030 --> 00:23:56,160 but that's true, actually, all the way 595 00:23:56,160 --> 00:23:58,659 through the visual system and throughout the ventral stream, 596 00:23:58,659 --> 00:23:59,820 in particular. 597 00:23:59,820 --> 00:24:03,630 And this diagram is showing these little circles are 598 00:24:03,630 --> 00:24:08,130 about 10 times the size of your midget ganglion cell 599 00:24:08,130 --> 00:24:10,170 receptive fields in your retina. 600 00:24:10,170 --> 00:24:13,110 So you looking-- if you fixate right here 601 00:24:13,110 --> 00:24:14,730 in the center of this, these things 602 00:24:14,730 --> 00:24:17,190 are about 10 times the size of your receptive fields. 603 00:24:17,190 --> 00:24:19,920 And that's been long thought to be 604 00:24:19,920 --> 00:24:25,260 the primary driver of your limits on acuity, 605 00:24:25,260 --> 00:24:27,460 in terms of peripheral vision. 606 00:24:27,460 --> 00:24:29,520 So in particular, if you take this eye 607 00:24:29,520 --> 00:24:32,610 chart, which is modified by-- 608 00:24:32,610 --> 00:24:35,190 this was done by Richard Anstis back in the '70s, 609 00:24:35,190 --> 00:24:37,320 and you lay it out in this fashion, 610 00:24:37,320 --> 00:24:39,690 these things are about 10 times the threshold 611 00:24:39,690 --> 00:24:42,067 for visibility and recognition of these letters. 612 00:24:42,067 --> 00:24:44,400 And so you can say that the stroke widths of the letters 613 00:24:44,400 --> 00:24:47,790 are about matched to the size of these ganglion cells, 614 00:24:47,790 --> 00:24:51,060 and it works, at least qualitatively-- that things 615 00:24:51,060 --> 00:24:55,320 are scaling in the right way, in terms of acuity, 616 00:24:55,320 --> 00:25:00,270 and in terms of the size that the letters need to be 617 00:25:00,270 --> 00:25:03,020 for you to recognize them. 618 00:25:03,020 --> 00:25:05,230 And you can make pictures like this. 619 00:25:05,230 --> 00:25:09,950 This is after Bill Geisler, who showed that if you foveate-- 620 00:25:09,950 --> 00:25:12,602 if you fixate here, in fact, you can't 621 00:25:12,602 --> 00:25:14,060 see the details of the stuff that's 622 00:25:14,060 --> 00:25:16,790 far from your fixation point, and if you blur it, 623 00:25:16,790 --> 00:25:18,830 people don't notice. 624 00:25:18,830 --> 00:25:21,296 You can actually add high frequency noise to it, 625 00:25:21,296 --> 00:25:23,420 alternatively, and people won't notice that either. 626 00:25:23,420 --> 00:25:26,090 Because those receptive fields are getting larger and larger, 627 00:25:26,090 --> 00:25:27,230 and you're basically blurring out 628 00:25:27,230 --> 00:25:29,396 the information that would allow you to distinguish, 629 00:25:29,396 --> 00:25:30,862 let's say, these two things. 630 00:25:30,862 --> 00:25:32,570 When you look right at it you can see it, 631 00:25:32,570 --> 00:25:37,580 but if you keep your eye fixated here, you won't notice it. 632 00:25:37,580 --> 00:25:39,644 So let's work off of those ideas-- 633 00:25:39,644 --> 00:25:41,060 the idea of these receptive fields 634 00:25:41,060 --> 00:25:44,570 that are getting larger with eccentricity, that are covering 635 00:25:44,570 --> 00:25:45,590 the entire visual field. 636 00:25:45,590 --> 00:25:49,170 And let's notice the following-- so this is data taken-- 637 00:25:49,170 --> 00:25:51,730 physiological data from several papers 638 00:25:51,730 --> 00:25:53,480 that were assembled by Jeremy Freeman, who 639 00:25:53,480 --> 00:25:55,160 was a grad student in my lab. 640 00:25:55,160 --> 00:25:58,370 And here you can see the center of the receptor fields 641 00:25:58,370 --> 00:26:00,741 versus the size of the receptive fields. 642 00:26:00,741 --> 00:26:02,240 And you can see that in the retina-- 643 00:26:02,240 --> 00:26:03,990 I already showed you on the previous slide 644 00:26:03,990 --> 00:26:07,250 that it grows with eccentricity, but it's actually very slow 645 00:26:07,250 --> 00:26:09,110 compared to what happens in the cortex. 646 00:26:09,110 --> 00:26:12,140 V1, the receptor fields grow at a pretty good clip. 647 00:26:12,140 --> 00:26:15,380 V2, they grow about twice as fast as that, and V4 twice 648 00:26:15,380 --> 00:26:16,610 as fast again. 649 00:26:16,610 --> 00:26:19,460 Another way of saying this-- at any given receptive field 650 00:26:19,460 --> 00:26:22,030 location relative to the fovea-- 651 00:26:22,030 --> 00:26:25,010 let's say 15 degrees, the receptive fields in V1 652 00:26:25,010 --> 00:26:28,355 are of a given size. 653 00:26:28,355 --> 00:26:33,020 It's on the order of 0.2 to 0.25 times. 654 00:26:33,020 --> 00:26:37,410 The diameter is 0.2 to 0.25 times the eccentricity. 655 00:26:37,410 --> 00:26:39,140 The receptive fields in V2 are twice 656 00:26:39,140 --> 00:26:44,090 that size, so about 0.45 times the eccentricity, 657 00:26:44,090 --> 00:26:48,260 and the receptive fields in V4 are twice that again. 658 00:26:48,260 --> 00:26:50,250 In cartoon form, it looks something like this. 659 00:26:50,250 --> 00:26:51,440 So here's V1. 660 00:26:51,440 --> 00:26:55,190 Lots of cells and small-ish receptive 661 00:26:55,190 --> 00:26:56,810 fields growing with eccentricity. 662 00:26:56,810 --> 00:26:58,070 Here's V2. 663 00:26:58,070 --> 00:26:58,820 They're bigger. 664 00:26:58,820 --> 00:27:00,000 They grow faster. 665 00:27:00,000 --> 00:27:00,590 Here's V4. 666 00:27:00,590 --> 00:27:02,180 And by the time you get to IT-- 667 00:27:02,180 --> 00:27:03,960 Jim DiCarlo was here a bunch of days ago, 668 00:27:03,960 --> 00:27:05,543 and he probably told you this-- almost 669 00:27:05,543 --> 00:27:09,380 every IT cell includes the fovea as part of its receptive field. 670 00:27:09,380 --> 00:27:10,850 They're very large, and they often 671 00:27:10,850 --> 00:27:14,396 cover half the visual field. 672 00:27:14,396 --> 00:27:15,770 So now we have to figure out what 673 00:27:15,770 --> 00:27:17,390 to put inside of these little circles 674 00:27:17,390 --> 00:27:18,950 in order to make a model, and I'm 675 00:27:18,950 --> 00:27:21,170 going to basically combine-- 676 00:27:21,170 --> 00:27:23,690 smash together the texture model that I told you about, 677 00:27:23,690 --> 00:27:29,570 which was a global homogeneous model, with this receptive 678 00:27:29,570 --> 00:27:30,260 field model. 679 00:27:30,260 --> 00:27:32,450 I'm going to basically stick a little texture model in each 680 00:27:32,450 --> 00:27:33,500 of these little circles. 681 00:27:33,500 --> 00:27:35,030 That's the concept. 682 00:27:35,030 --> 00:27:35,920 So how do we do that? 683 00:27:35,920 --> 00:27:37,450 Well, we're going to go back to Hubel and Wiesel. 684 00:27:37,450 --> 00:27:38,908 Hubel and Wiesel were the ones that 685 00:27:38,908 --> 00:27:43,460 said you make V1 receptive field simple cells out of LGN cells 686 00:27:43,460 --> 00:27:45,920 by just taking a bunch of LGN cells that line up. 687 00:27:45,920 --> 00:27:49,700 Here they are-- center surround receptive fields from the LGN, 688 00:27:49,700 --> 00:27:51,490 which are coming off of the center 689 00:27:51,490 --> 00:27:53,690 surround architecture of the retina. 690 00:27:53,690 --> 00:27:55,430 You line them up, you add them together, 691 00:27:55,430 --> 00:27:57,971 and that gives you an oriented receptive field, like the ones 692 00:27:57,971 --> 00:27:59,240 that I showed you earlier. 693 00:27:59,240 --> 00:28:01,884 And in more of a computational diagram, 694 00:28:01,884 --> 00:28:03,050 you might draw it like this. 695 00:28:03,050 --> 00:28:05,837 So here's an array of LGN inputs coming in. 696 00:28:05,837 --> 00:28:07,670 We're going to take a weighted sum of those. 697 00:28:07,670 --> 00:28:09,010 Black is negative. 698 00:28:09,010 --> 00:28:09,890 White is positive. 699 00:28:09,890 --> 00:28:12,170 So we add up these three guys, we subtract the two 700 00:28:12,170 --> 00:28:14,530 guys on either side, and then we run that 701 00:28:14,530 --> 00:28:15,920 through a rectifying nonlinearity 702 00:28:15,920 --> 00:28:17,185 that's a simple cell. 703 00:28:17,185 --> 00:28:19,310 Hubel and Wiesel also pointed out 704 00:28:19,310 --> 00:28:20,780 that you could maybe create-- 705 00:28:20,780 --> 00:28:22,940 or suggested that you create complex cells 706 00:28:22,940 --> 00:28:24,500 by combining simple cells. 707 00:28:24,500 --> 00:28:27,920 This is the diagram from their paper in 1962. 708 00:28:27,920 --> 00:28:30,406 And so we can diagram that like this. 709 00:28:30,406 --> 00:28:32,280 Here's basically three of these simple cells. 710 00:28:32,280 --> 00:28:34,160 They're displaced in position, but they 711 00:28:34,160 --> 00:28:35,520 have the same orientation. 712 00:28:35,520 --> 00:28:37,820 We halfway rectify all of them, add them together, 713 00:28:37,820 --> 00:28:40,580 and that gives us a complex cell. 714 00:28:40,580 --> 00:28:43,700 So it's interesting to note that the hook here 715 00:28:43,700 --> 00:28:47,150 is going to be that this is an average of these. 716 00:28:47,150 --> 00:28:49,080 An average is a statistic. 717 00:28:49,080 --> 00:28:50,490 It's a local average. 718 00:28:50,490 --> 00:28:52,460 So we're going to compute local averages, 719 00:28:52,460 --> 00:28:55,070 and we're going to call those statistics-- i.e. statistics, 720 00:28:55,070 --> 00:28:58,130 as in used in the texture model. 721 00:28:58,130 --> 00:28:59,000 So let's do that. 722 00:28:59,000 --> 00:29:01,070 So here's the V2 receptive field. 723 00:29:01,070 --> 00:29:02,060 Open that up. 724 00:29:02,060 --> 00:29:05,150 Inside of that is a bunch of V1 cells, here all shown 725 00:29:05,150 --> 00:29:06,270 at the same orientation. 726 00:29:06,270 --> 00:29:08,510 In reality, they would be all different orientations 727 00:29:08,510 --> 00:29:09,412 and different sizes. 728 00:29:09,412 --> 00:29:11,870 And now we're going to compute those joint statistics, just 729 00:29:11,870 --> 00:29:14,120 like I did in the texture model, and that's 730 00:29:14,120 --> 00:29:15,770 going to give us our responses. 731 00:29:15,770 --> 00:29:17,990 We're going to have to do that for each one of these receptive 732 00:29:17,990 --> 00:29:18,489 fields. 733 00:29:18,489 --> 00:29:19,910 So there's a lot of these. 734 00:29:19,910 --> 00:29:21,710 It's not 700 numbers anymore. 735 00:29:21,710 --> 00:29:23,760 It's reduced, because it's-- 736 00:29:23,760 --> 00:29:25,310 so there's details here. 737 00:29:25,310 --> 00:29:27,380 It's reduced, but there's a lot of these, so it's 738 00:29:27,380 --> 00:29:30,200 quite a lot of parameters. 739 00:29:30,200 --> 00:29:32,240 And these local correlations that I told you 740 00:29:32,240 --> 00:29:35,090 we were going to compute here can be re-expressed, actually, 741 00:29:35,090 --> 00:29:38,540 in a form that looks just like the simple and complex cell 742 00:29:38,540 --> 00:29:41,180 calculations that I showed you for V1. 743 00:29:41,180 --> 00:29:43,070 So in fact, if you take these V1 cells, 744 00:29:43,070 --> 00:29:45,380 and you take weighted sums of these guys, 745 00:29:45,380 --> 00:29:47,570 and you half-wave rectify them and add them, 746 00:29:47,570 --> 00:29:49,070 you get something that's essentially 747 00:29:49,070 --> 00:29:52,610 equivalent to the texture model that I told you about. 748 00:29:52,610 --> 00:29:54,170 So that's pretty cool, because it 749 00:29:54,170 --> 00:29:58,430 means that the calculations that are taking us from the LGN 750 00:29:58,430 --> 00:30:05,310 input to V1 outputs have a form, a structure which is then 751 00:30:05,310 --> 00:30:07,500 repeated when we get to V2. 752 00:30:07,500 --> 00:30:10,320 We do the same kind of calculations-- linear filters, 753 00:30:10,320 --> 00:30:12,780 rectification, pooling or averaging. 754 00:30:12,780 --> 00:30:14,430 And so that, of course, has become 755 00:30:14,430 --> 00:30:17,940 ubiquitous with the advent of all the deep network stuff. 756 00:30:17,940 --> 00:30:21,030 But the idea here is that we can actually 757 00:30:21,030 --> 00:30:23,790 do this kind of canonical computation again and again 758 00:30:23,790 --> 00:30:25,620 and again and produce something that 759 00:30:25,620 --> 00:30:28,260 replicates the loss of information 760 00:30:28,260 --> 00:30:31,470 and the extraction of features or parameters 761 00:30:31,470 --> 00:30:33,760 that the human visual system is performing. 762 00:30:33,760 --> 00:30:36,240 So this canonical idea, I think, is important, 763 00:30:36,240 --> 00:30:38,850 and it's something that we've been thinking 764 00:30:38,850 --> 00:30:40,300 about for a long time-- 765 00:30:40,300 --> 00:30:43,350 linear filtering that determines pattern selectivity, 766 00:30:43,350 --> 00:30:44,936 some sort of rectifying non-linearity, 767 00:30:44,936 --> 00:30:45,810 some sort of pooling. 768 00:30:45,810 --> 00:30:48,835 And we usually also include some sort of local gain control, 769 00:30:48,835 --> 00:30:51,210 which seems to be ubiquitous throughout the visual system 770 00:30:51,210 --> 00:30:55,770 and the auditory system in every stage, and noise, as well. 771 00:30:55,770 --> 00:30:58,590 And we're currently, in my lab, working on lots of models that 772 00:30:58,590 --> 00:31:01,410 are trying to incorporate all of these things in stacked 773 00:31:01,410 --> 00:31:04,840 networks-- small numbers of layers, not deep-- shallow, 774 00:31:04,840 --> 00:31:07,150 shallow networks for us-- 775 00:31:07,150 --> 00:31:09,150 in order to try to understand their implications 776 00:31:09,150 --> 00:31:11,400 for perception and physiology. 777 00:31:11,400 --> 00:31:15,060 This was just a description of a single stage, 778 00:31:15,060 --> 00:31:16,922 and then, of course, you have to stack them. 779 00:31:16,922 --> 00:31:19,380 And there are many people that have talked about that idea. 780 00:31:19,380 --> 00:31:24,300 This is a figure from Tommy's paper with Christof, I think-- 781 00:31:24,300 --> 00:31:25,620 1999. 782 00:31:25,620 --> 00:31:29,010 And Fukushima had proposed a basic architecture 783 00:31:29,010 --> 00:31:31,530 like this earlier. 784 00:31:31,530 --> 00:31:33,810 And so I think this has now become-- 785 00:31:33,810 --> 00:31:36,510 you barely even need to say it, because of the deep network 786 00:31:36,510 --> 00:31:38,820 literature. 787 00:31:38,820 --> 00:31:40,260 So how do we do this? 788 00:31:40,260 --> 00:31:41,610 Same thing I told you before. 789 00:31:41,610 --> 00:31:44,460 Take an image, plop down all these V2 receptive fields. 790 00:31:44,460 --> 00:31:46,620 By the way, I should have said this at the outset-- 791 00:31:46,620 --> 00:31:50,010 this is drawn as a cartoon. 792 00:31:50,010 --> 00:31:51,480 The actual receptive fields that we 793 00:31:51,480 --> 00:31:55,230 use are smooth and overlapping, so that there are no holes. 794 00:31:55,230 --> 00:31:57,630 And in fact, the details of that are 795 00:31:57,630 --> 00:31:59,130 that since we're computing averages, 796 00:31:59,130 --> 00:32:00,921 you can think of this as a low pass filter, 797 00:32:00,921 --> 00:32:03,330 and we try to at least approximately obey the Nyquist 798 00:32:03,330 --> 00:32:05,640 theorem, so that there's no aliasing-- 799 00:32:05,640 --> 00:32:07,530 that is, there's no evidence of the sampling 800 00:32:07,530 --> 00:32:13,170 lattice, for those of you that are thinking down those lines. 801 00:32:13,170 --> 00:32:15,019 If you were not thinking down those lines, 802 00:32:15,019 --> 00:32:16,560 I'll just say the simple thing, which 803 00:32:16,560 --> 00:32:19,260 is that they're not little disks that are non-overlapping, 804 00:32:19,260 --> 00:32:21,510 because then we would be screwing everything up in 805 00:32:21,510 --> 00:32:22,700 between them. 806 00:32:22,700 --> 00:32:24,270 They're smooth and overlapping so 807 00:32:24,270 --> 00:32:26,853 that we cover the whole image, and all the pixels in the image 808 00:32:26,853 --> 00:32:28,944 are going to be affected by this process. 809 00:32:28,944 --> 00:32:30,360 So we make all those measurements. 810 00:32:30,360 --> 00:32:32,070 It's a very large set of measurements. 811 00:32:32,070 --> 00:32:36,160 And now we start with white noise, and we push the button. 812 00:32:36,160 --> 00:32:38,910 And again, push simultaneously on the gradients 813 00:32:38,910 --> 00:32:41,370 from all those little regions until we achieve something 814 00:32:41,370 --> 00:32:43,200 that matches all the measurements in all 815 00:32:43,200 --> 00:32:44,864 of those receptive fields. 816 00:32:44,864 --> 00:32:46,530 The measurements in the receptive fields 817 00:32:46,530 --> 00:32:48,280 are averaged over different regions. 818 00:32:48,280 --> 00:32:53,910 So the ones that are in the far periphery 819 00:32:53,910 --> 00:32:57,360 are averaged over large regions, and so those averages 820 00:32:57,360 --> 00:32:59,730 are throwing away a lot more information. 821 00:32:59,730 --> 00:33:01,980 The ones that are averaged near the fovea 822 00:33:01,980 --> 00:33:04,020 are throwing away a small amount of information. 823 00:33:04,020 --> 00:33:05,370 When you get close enough to the fovea, 824 00:33:05,370 --> 00:33:06,840 they're throwing away nothing. 825 00:33:06,840 --> 00:33:09,284 So the original image is preserved in the center, 826 00:33:09,284 --> 00:33:10,950 and then it gets more and more distorted 827 00:33:10,950 --> 00:33:12,720 as you go away from the fovea. 828 00:33:12,720 --> 00:33:16,980 So the question is, does that work for a human? 829 00:33:16,980 --> 00:33:18,240 Is it metameric? 830 00:33:18,240 --> 00:33:19,864 The display here is not very good, 831 00:33:19,864 --> 00:33:21,780 but I'll try to give you a demonstration of it 832 00:33:21,780 --> 00:33:23,310 to convince you that it does work. 833 00:33:23,310 --> 00:33:25,020 You have to keep your eyes planted here, 834 00:33:25,020 --> 00:33:26,520 and I'm going to flip back and forth 835 00:33:26,520 --> 00:33:28,230 between this original picture, which 836 00:33:28,230 --> 00:33:32,250 was taken in Washington Square Park, near the department. 837 00:33:32,250 --> 00:33:35,400 And I'm going to flip between this and a synthesized version. 838 00:33:35,400 --> 00:33:38,540 You have to keep your eyes here, at least for a bunch of flips. 839 00:33:38,540 --> 00:33:40,710 Hello. 840 00:33:40,710 --> 00:33:42,100 Here we go. 841 00:33:42,100 --> 00:33:43,627 Keep your eyes fixated. 842 00:33:43,627 --> 00:33:45,210 Those two images should look the same. 843 00:33:45,210 --> 00:33:48,600 It's going back and forth, A, B, A, B, and they 844 00:33:48,600 --> 00:33:49,570 should look the same. 845 00:33:49,570 --> 00:33:51,820 I think for most of you, and for most of these viewing 846 00:33:51,820 --> 00:33:53,029 distances, it should work. 847 00:33:53,029 --> 00:33:54,570 And now if you look over here, you'll 848 00:33:54,570 --> 00:33:57,340 see that they actually are not the same. 849 00:33:57,340 --> 00:34:00,020 That's about the size of a V2 receptive field, 850 00:34:00,020 --> 00:34:01,980 and it is the same two images. 851 00:34:01,980 --> 00:34:04,690 I'm not cheating here, in case anybody's worried. 852 00:34:04,690 --> 00:34:07,510 I'm just flipping back and forth between the same two images. 853 00:34:07,510 --> 00:34:10,739 And you can see that the original image has 854 00:34:10,739 --> 00:34:12,480 a couple of faces in that circle, 855 00:34:12,480 --> 00:34:16,110 but the synthesized one, they're all distorted, 856 00:34:16,110 --> 00:34:20,130 the same way Feynman was when I showed you his photograph. 857 00:34:20,130 --> 00:34:24,570 But again, the point here is that these two are not metamers 858 00:34:24,570 --> 00:34:26,969 when you look right at this peripheral region, 859 00:34:26,969 --> 00:34:29,531 but when you keep your eyes fixated here, 860 00:34:29,531 --> 00:34:30,989 they're pretty hard to distinguish. 861 00:34:30,989 --> 00:34:33,330 This is right at about the threshold for the subjects 862 00:34:33,330 --> 00:34:37,320 that we ran in this experiment, so it should be 863 00:34:37,320 --> 00:34:39,690 basically imperceptible to you. 864 00:34:39,690 --> 00:34:41,580 That was a demo, just to convince you 865 00:34:41,580 --> 00:34:43,480 that it seems to work. 866 00:34:43,480 --> 00:34:45,360 We did an experiment, because we wanted 867 00:34:45,360 --> 00:34:48,594 to do more than just show that it sort of works. 868 00:34:48,594 --> 00:34:51,300 We wanted to figure out whether we could actually tie it 869 00:34:51,300 --> 00:34:53,650 to the physiology in a more direct way, 870 00:34:53,650 --> 00:34:57,000 so what we did is we generated stimuli 871 00:34:57,000 --> 00:35:01,160 where we used different receptive field size scaling. 872 00:35:01,160 --> 00:35:01,960 So this is a plot. 873 00:35:01,960 --> 00:35:03,820 Along this axis is going to be-- 874 00:35:03,820 --> 00:35:05,710 just to get you situated, along this axis 875 00:35:05,710 --> 00:35:08,590 is going to be models that are used 876 00:35:08,590 --> 00:35:11,371 to generate stimuli with different receptive field size 877 00:35:11,371 --> 00:35:11,870 scaling. 878 00:35:11,870 --> 00:35:14,295 That's the ratio of diameter to eccentricity-- diameter 879 00:35:14,295 --> 00:35:16,420 of the receptive field to the eccentricity distance 880 00:35:16,420 --> 00:35:17,560 from the fovea. 881 00:35:17,560 --> 00:35:20,200 And along here is going to be the percent correct 882 00:35:20,200 --> 00:35:26,020 that a human is able to correctly identify-- 883 00:35:26,020 --> 00:35:28,809 the way we did this, it's called an ABX experiment. 884 00:35:28,809 --> 00:35:30,850 So we show one image, then we show another image, 885 00:35:30,850 --> 00:35:31,974 then we show a third image. 886 00:35:31,974 --> 00:35:36,760 And we say, which image does the third one look like? 887 00:35:36,760 --> 00:35:40,060 So we're going to plot percent correct here. 888 00:35:40,060 --> 00:35:44,790 And if we use a model with very small receptive fields, 889 00:35:44,790 --> 00:35:46,540 then we get syntheses that look like this. 890 00:35:46,540 --> 00:35:48,030 This one has very little distortion. 891 00:35:48,030 --> 00:35:49,330 There's a little bit of distortion around 892 00:35:49,330 --> 00:35:51,580 near the edges, but it's pretty close to the original. 893 00:35:51,580 --> 00:35:53,200 If we use really big receptor fields, 894 00:35:53,200 --> 00:35:54,550 then we get a lot of distortion. 895 00:35:54,550 --> 00:35:56,590 Things really start to fall apart. 896 00:35:56,590 --> 00:35:58,550 And somewhere in between-- 897 00:35:58,550 --> 00:36:00,250 so far to the right on this plot, 898 00:36:00,250 --> 00:36:05,530 we expect people to be at 100% noticing the distortions, 899 00:36:05,530 --> 00:36:06,940 and far to the left on this plot, 900 00:36:06,940 --> 00:36:09,632 we expect them to be at chance. 901 00:36:09,632 --> 00:36:11,840 We expect them to not be able to tell the difference. 902 00:36:11,840 --> 00:36:13,173 And that's exactly what happens. 903 00:36:13,173 --> 00:36:15,640 This is an average over four observers. 904 00:36:15,640 --> 00:36:17,830 And you can see that the performance, the percent 905 00:36:17,830 --> 00:36:20,380 correct starts at around 50%, and then 906 00:36:20,380 --> 00:36:23,420 climbs up and asymptotes. 907 00:36:23,420 --> 00:36:25,360 So what's more, we can now do something-- 908 00:36:25,360 --> 00:36:27,820 this is a little bit complicated to get your head around. 909 00:36:27,820 --> 00:36:31,390 We're using this model to generate the stimuli, 910 00:36:31,390 --> 00:36:34,120 and this is the model parameter plotted along this axis. 911 00:36:34,120 --> 00:36:36,610 Now we're going to use the model again, 912 00:36:36,610 --> 00:36:38,110 but now we're going to use the model 913 00:36:38,110 --> 00:36:39,820 as a model for the observer. 914 00:36:39,820 --> 00:36:41,320 So there's two models here. 915 00:36:41,320 --> 00:36:42,670 One is generating the stimuli. 916 00:36:42,670 --> 00:36:45,100 The other one, we're going to try to fit-- 917 00:36:45,100 --> 00:36:47,920 we're going to ask, if we used a second copy of the model 918 00:36:47,920 --> 00:36:49,930 to actually look at these images and tell 919 00:36:49,930 --> 00:36:51,670 the difference between them, what 920 00:36:51,670 --> 00:36:55,180 would its receptive fields have to be in order 921 00:36:55,180 --> 00:36:57,780 to match the human data? 922 00:36:57,780 --> 00:37:00,040 And I'm not going to drag you through the details, 923 00:37:00,040 --> 00:37:03,370 but the basic idea is that allows us to produce 924 00:37:03,370 --> 00:37:06,220 a prediction-- this black line-- 925 00:37:06,220 --> 00:37:08,170 for how this model would behave if it 926 00:37:08,170 --> 00:37:10,330 were acting as an observer. 927 00:37:10,330 --> 00:37:13,630 And by adjusting the parameter of the observer model, 928 00:37:13,630 --> 00:37:17,120 we can estimate the size of the human receptive fields. 929 00:37:17,120 --> 00:37:18,726 So the end result of all of this is 930 00:37:18,726 --> 00:37:20,350 we're going to fit a curve to the data, 931 00:37:20,350 --> 00:37:22,060 and it's going to give us an estimate 932 00:37:22,060 --> 00:37:23,650 of the size of the receptive fields 933 00:37:23,650 --> 00:37:26,350 that the human is using to do this task. 934 00:37:26,350 --> 00:37:28,150 And that is right here. 935 00:37:28,150 --> 00:37:29,800 In fact, it's right at the place where 936 00:37:29,800 --> 00:37:32,644 the curve hits the 50% line. 937 00:37:32,644 --> 00:37:35,060 That's the point where the human can't tell the difference 938 00:37:35,060 --> 00:37:37,120 anymore, and that's the point where 939 00:37:37,120 --> 00:37:40,610 we think an observer would be-- 940 00:37:40,610 --> 00:37:42,970 where the receptive fields of the stimulus 941 00:37:42,970 --> 00:37:44,770 would be the same size as the receptive 942 00:37:44,770 --> 00:37:45,970 fields of the observer. 943 00:37:45,970 --> 00:37:47,449 So that's what we're looking for. 944 00:37:47,449 --> 00:37:49,240 And when we do that for our four observers, 945 00:37:49,240 --> 00:37:50,860 they come out very consistent. 946 00:37:50,860 --> 00:37:54,490 So here's a plot of the estimated receptive field 947 00:37:54,490 --> 00:37:56,680 sizes of these observers. 948 00:37:56,680 --> 00:37:57,430 All four of them-- 949 00:37:57,430 --> 00:38:00,040 1, 2, 3, 4, and the average over the four. 950 00:38:00,040 --> 00:38:03,070 And nicely enough-- remember, I told you that we know something 951 00:38:03,070 --> 00:38:05,230 about the receptive field sizes in-- 952 00:38:05,230 --> 00:38:06,790 these are macaque monkey. 953 00:38:06,790 --> 00:38:08,770 And if we plot those on the same plot, 954 00:38:08,770 --> 00:38:11,480 these color bands are the size of the receptive fields 955 00:38:11,480 --> 00:38:15,456 in a macaque, now combined over this large set of data 956 00:38:15,456 --> 00:38:17,080 from a whole bunch of different papers. 957 00:38:17,080 --> 00:38:19,319 Jeremy went through incredibly painstaking work 958 00:38:19,319 --> 00:38:21,610 to try to put these all into the same coordinate system 959 00:38:21,610 --> 00:38:23,620 and unify the data sets. 960 00:38:23,620 --> 00:38:26,770 And so the height of each of these bars tells you-- 961 00:38:26,770 --> 00:38:30,010 they're error bars on how much variability there is, where 962 00:38:30,010 --> 00:38:31,210 we think the estimates are. 963 00:38:31,210 --> 00:38:33,168 And you can see that the answers for the humans 964 00:38:33,168 --> 00:38:35,980 are coming right down on top of V2. 965 00:38:35,980 --> 00:38:39,160 So we really do think that the information that 966 00:38:39,160 --> 00:38:42,550 is being lost in these stimuli is being lost in V2, 967 00:38:42,550 --> 00:38:45,190 and it seems to match the receptive field sizes at least 968 00:38:45,190 --> 00:38:47,350 of macaque monkey. 969 00:38:47,350 --> 00:38:50,980 We were worried that this might depend a lot on the details 970 00:38:50,980 --> 00:38:52,540 of the experiment. 971 00:38:52,540 --> 00:38:54,999 So for example, we thought, well, 972 00:38:54,999 --> 00:38:57,040 what if we give people a little more information? 973 00:38:57,040 --> 00:38:59,800 For example, what if we let them look at the stimulus longer? 974 00:38:59,800 --> 00:39:02,410 So the original experiment was pretty brief-- 975 00:39:02,410 --> 00:39:03,400 200 milliseconds. 976 00:39:03,400 --> 00:39:05,630 What if we give them 400 milliseconds? 977 00:39:05,630 --> 00:39:09,430 And so up here are plots for the same four subjects. 978 00:39:09,430 --> 00:39:13,024 The original task is in the dark gray, 979 00:39:13,024 --> 00:39:15,190 and you can see the curves for each of the subjects. 980 00:39:15,190 --> 00:39:18,010 When we give them more time, what you notice 981 00:39:18,010 --> 00:39:21,080 is that, in general, they do better. 982 00:39:21,080 --> 00:39:22,900 So generally, the light gray curves-- 983 00:39:22,900 --> 00:39:25,410 1, 2, 3-- are above the dark gray curves. 984 00:39:25,410 --> 00:39:27,490 They get higher percent correct. 985 00:39:27,490 --> 00:39:30,730 But the important thing is that each of these curves 986 00:39:30,730 --> 00:39:34,400 dives down and hits the 50% point at the same place. 987 00:39:34,400 --> 00:39:37,060 In other words, what we interpret this to mean 988 00:39:37,060 --> 00:39:41,230 is that the estimate of the receptive field sizes 989 00:39:41,230 --> 00:39:45,070 is an architectural constraint, and we 990 00:39:45,070 --> 00:39:48,070 can estimate the same architectural constraint 991 00:39:48,070 --> 00:39:50,920 under both of these conditions, even though performance 992 00:39:50,920 --> 00:39:53,560 is noticeably different, at least for these three subjects. 993 00:39:53,560 --> 00:39:56,350 This one, it's really quite a big, big improvement. 994 00:39:56,350 --> 00:39:58,120 This subject is doing much, much better 995 00:39:58,120 --> 00:40:00,260 on the task when we give them more time. 996 00:40:00,260 --> 00:40:03,290 And yet, this estimate of receptive field sizes 997 00:40:03,290 --> 00:40:05,060 is pretty stable, so we thought this 998 00:40:05,060 --> 00:40:07,220 was a pretty important control. 999 00:40:07,220 --> 00:40:09,180 And down below is another control. 1000 00:40:09,180 --> 00:40:10,400 That was a bottom-up control. 1001 00:40:10,400 --> 00:40:11,770 This is a top-down control. 1002 00:40:11,770 --> 00:40:14,270 People have talked about attention 1003 00:40:14,270 --> 00:40:16,580 being very important in peripheral tasks, 1004 00:40:16,580 --> 00:40:19,300 so we now gave the subjects an attentional cue-- 1005 00:40:19,300 --> 00:40:22,790 a little arrow at the center of the display that 1006 00:40:22,790 --> 00:40:25,190 pointed toward the region of the periphery 1007 00:40:25,190 --> 00:40:28,350 where the distortion was largest in a mean-squared error sense. 1008 00:40:28,350 --> 00:40:31,190 So we measure little chunks of the peripheral image 1009 00:40:31,190 --> 00:40:34,070 and look for the place where there's the biggest difference, 1010 00:40:34,070 --> 00:40:36,380 and we tell them to pay attention 1011 00:40:36,380 --> 00:40:37,670 to that part of the stimulus. 1012 00:40:37,670 --> 00:40:38,720 They're not allowed to move their eyes. 1013 00:40:38,720 --> 00:40:40,530 We have an eye tracker on them the whole time, 1014 00:40:40,530 --> 00:40:41,960 so they're not allowed to look at it. 1015 00:40:41,960 --> 00:40:44,270 But we're telling them, try to pay attention to what's, 1016 00:40:44,270 --> 00:40:46,500 let's say, in the upper left. 1017 00:40:46,500 --> 00:40:48,500 And again, the result is quite similar. 1018 00:40:48,500 --> 00:40:51,179 Their performance improves noticeably, at least 1019 00:40:51,179 --> 00:40:52,220 for these three subjects. 1020 00:40:52,220 --> 00:40:55,070 This one, again, is the most dramatic performance 1021 00:40:55,070 --> 00:40:55,610 improvement. 1022 00:40:55,610 --> 00:40:56,690 Nobody gets worse. 1023 00:40:56,690 --> 00:40:59,600 This subject basically stayed about the same. 1024 00:40:59,600 --> 00:41:01,820 But again, the estimates of receptive field size 1025 00:41:01,820 --> 00:41:02,790 are quite stable. 1026 00:41:02,790 --> 00:41:05,960 So our interpretation is attention 1027 00:41:05,960 --> 00:41:09,920 is boosting the signal, if there is a signal, that 1028 00:41:09,920 --> 00:41:11,670 allows them to do the task. 1029 00:41:11,670 --> 00:41:14,720 But if they're at chance and there's no signal, 1030 00:41:14,720 --> 00:41:18,380 attention does nothing, which is why that when you get to 50%, 1031 00:41:18,380 --> 00:41:20,870 all these points coalesce. 1032 00:41:20,870 --> 00:41:24,260 All the curves are hitting 50% at the same place. 1033 00:41:24,260 --> 00:41:26,750 One last control-- we wanted to convince ourselves 1034 00:41:26,750 --> 00:41:28,970 that really it was V2, and it wasn't just luck 1035 00:41:28,970 --> 00:41:32,060 that we happened to get that receptive field size that 1036 00:41:32,060 --> 00:41:33,630 matched the macaque data. 1037 00:41:33,630 --> 00:41:35,540 So we did a control experiment where we tried 1038 00:41:35,540 --> 00:41:37,610 to get the same result for V1. 1039 00:41:37,610 --> 00:41:41,570 So this time, we just measure local oriented receptive fields 1040 00:41:41,570 --> 00:41:45,860 like Hubel and Wiesel described, and we average them 1041 00:41:45,860 --> 00:41:48,950 as in a complex cell over different sized regions. 1042 00:41:48,950 --> 00:41:50,630 And we generate stimuli that are just 1043 00:41:50,630 --> 00:41:54,067 matched for the average responses of the V1 cells. 1044 00:41:54,067 --> 00:41:56,150 We don't do all the statistics on top of that that 1045 00:41:56,150 --> 00:41:57,920 represents the V2 calculation. 1046 00:41:57,920 --> 00:42:01,310 We're just doing average V1 responses. 1047 00:42:01,310 --> 00:42:02,450 When we do that-- 1048 00:42:02,450 --> 00:42:04,760 we generate the stimuli, we do the same experiment, 1049 00:42:04,760 --> 00:42:07,280 we get a very different result in light gray here. 1050 00:42:07,280 --> 00:42:09,202 So you can see that these curves are always 1051 00:42:09,202 --> 00:42:10,910 higher than the other ones, but they also 1052 00:42:10,910 --> 00:42:15,350 hit the axis at a much, much smaller value, 1053 00:42:15,350 --> 00:42:18,590 usually by about a factor of two, which is just right, 1054 00:42:18,590 --> 00:42:21,100 given what I told you before about receptive field sizes. 1055 00:42:21,100 --> 00:42:24,020 So if we go back and we combine all the data on one plot-- 1056 00:42:24,020 --> 00:42:25,910 down here are the V1 controls. 1057 00:42:25,910 --> 00:42:28,160 They're about the right size for V1. 1058 00:42:28,160 --> 00:42:31,759 And up here is the original experiment and the two controls 1059 00:42:31,759 --> 00:42:33,800 that I told you about-- the extended presentation 1060 00:42:33,800 --> 00:42:36,650 and the directed attention, and those are all pretty much 1061 00:42:36,650 --> 00:42:38,510 lying in the range of V2. 1062 00:42:38,510 --> 00:42:41,090 We think this has a pretty strong implication for reading 1063 00:42:41,090 --> 00:42:42,140 speed. 1064 00:42:42,140 --> 00:42:44,180 When you read, your eyes hop across the page. 1065 00:42:44,180 --> 00:42:45,890 You do not scan continuously. 1066 00:42:45,890 --> 00:42:46,970 You hop. 1067 00:42:46,970 --> 00:42:48,500 And when you hop, here's an example 1068 00:42:48,500 --> 00:42:50,458 of the kind of hops you do when you're reading. 1069 00:42:50,458 --> 00:42:53,300 There's an eye position, and the typical hop distance 1070 00:42:53,300 --> 00:42:54,710 would be about that-- 1071 00:42:54,710 --> 00:42:55,730 from here to there. 1072 00:42:55,730 --> 00:42:57,320 This is the same piece of text. 1073 00:42:57,320 --> 00:43:00,980 We've synthesized it as a metamer using this model, just 1074 00:43:00,980 --> 00:43:03,770 to illustrate the idea that the chunk of stuff 1075 00:43:03,770 --> 00:43:06,290 that you can read around that fixation point, 1076 00:43:06,290 --> 00:43:07,520 it's about right. 1077 00:43:07,520 --> 00:43:11,630 It matches what you would expect for the kind 1078 00:43:11,630 --> 00:43:12,980 of hopping that you could do. 1079 00:43:12,980 --> 00:43:17,450 Your reading speed is limited by the distance of those hops, 1080 00:43:17,450 --> 00:43:19,220 and the distance of those hops is limited 1081 00:43:19,220 --> 00:43:22,700 by this loss of information. 1082 00:43:22,700 --> 00:43:26,192 So you can't read anything beyond maybe this I and this N. 1083 00:43:26,192 --> 00:43:28,400 And in order to read it, you hop your eyes over here, 1084 00:43:28,400 --> 00:43:29,960 and now you get most of this word. 1085 00:43:29,960 --> 00:43:32,420 You can make out the rest of an "involuntarily." 1086 00:43:32,420 --> 00:43:35,060 So there's an interesting implication here, 1087 00:43:35,060 --> 00:43:37,190 which is that you can potentially 1088 00:43:37,190 --> 00:43:43,160 increase reading speed by using this model to optimize 1089 00:43:43,160 --> 00:43:44,726 the presentation of text. 1090 00:43:44,726 --> 00:43:46,850 And now that we can do these things electronically, 1091 00:43:46,850 --> 00:43:48,590 you can imagine all kinds of devices 1092 00:43:48,590 --> 00:43:51,754 where the word spacing and the line spacing 1093 00:43:51,754 --> 00:43:53,420 and the letter sizes and everything else 1094 00:43:53,420 --> 00:43:57,244 could change with time and position on the display. 1095 00:43:57,244 --> 00:43:58,910 So you don't have to just put things out 1096 00:43:58,910 --> 00:44:01,010 as static arrays of characters. 1097 00:44:01,010 --> 00:44:04,321 You could now imagine jumping things around and rescaling 1098 00:44:04,321 --> 00:44:04,820 things. 1099 00:44:04,820 --> 00:44:07,700 You could imagine designing new fonts that 1100 00:44:07,700 --> 00:44:11,640 caused less distortion or loss of information, et cetera. 1101 00:44:11,640 --> 00:44:14,510 So this is just going back to the trichromacy story 1102 00:44:14,510 --> 00:44:15,470 that I told you. 1103 00:44:15,470 --> 00:44:17,884 I told you that once they figured out the theory, 1104 00:44:17,884 --> 00:44:19,550 and they had all the psychophysics down, 1105 00:44:19,550 --> 00:44:21,950 the next thing that happened is all that engineering. 1106 00:44:21,950 --> 00:44:23,616 They came up with engineering standards, 1107 00:44:23,616 --> 00:44:27,690 and they used it to design devices and specify protocols 1108 00:44:27,690 --> 00:44:30,110 for transmitting images, for communicating them, 1109 00:44:30,110 --> 00:44:31,520 for rendering them. 1110 00:44:31,520 --> 00:44:34,340 I think that this has that kind of potential. 1111 00:44:34,340 --> 00:44:36,620 And this theory is too crude right now, 1112 00:44:36,620 --> 00:44:39,800 but if you had a really solid theory for what information 1113 00:44:39,800 --> 00:44:43,190 survived in the periphery, you can really 1114 00:44:43,190 --> 00:44:48,050 start to push hard on designing devices and designing 1115 00:44:48,050 --> 00:44:52,659 specifications for devices for improved whatever. 1116 00:44:52,659 --> 00:44:54,200 Sometimes you want to improve things. 1117 00:44:54,200 --> 00:44:55,866 Sometimes you want to make things harder 1118 00:44:55,866 --> 00:44:57,920 to see, like in this example. 1119 00:44:57,920 --> 00:44:59,257 So you want to build camouflage. 1120 00:44:59,257 --> 00:45:01,840 You go in, you take a bunch of photographs of the environment, 1121 00:45:01,840 --> 00:45:03,730 and then you say, let's design a camouflage 1122 00:45:03,730 --> 00:45:08,200 that best hides itself when it's not seen directly 1123 00:45:08,200 --> 00:45:09,260 within this environment. 1124 00:45:09,260 --> 00:45:13,390 So you could use these kinds of loss of information 1125 00:45:13,390 --> 00:45:15,820 to exploit things or to aid things 1126 00:45:15,820 --> 00:45:19,450 in terms of human perception. 1127 00:45:19,450 --> 00:45:21,460 So let me say just a few things about V2, 1128 00:45:21,460 --> 00:45:24,470 and then maybe I should stop. 1129 00:45:24,470 --> 00:45:28,270 So this work that Jeremy and I did 1130 00:45:28,270 --> 00:45:30,220 in building this model for metamers, which 1131 00:45:30,220 --> 00:45:32,350 is a global version of the texture model 1132 00:45:32,350 --> 00:45:35,260 that operates in local regions, led 1133 00:45:35,260 --> 00:45:39,040 us to start asking questions about what 1134 00:45:39,040 --> 00:45:42,160 we could learn by actually measuring cells in V2. 1135 00:45:42,160 --> 00:45:44,260 And we joined forces with Tony Movshon, 1136 00:45:44,260 --> 00:45:46,240 who is the chair of my department 1137 00:45:46,240 --> 00:45:48,400 and a longtime collaborator and friend. 1138 00:45:48,400 --> 00:45:50,830 And we started a series of experiments 1139 00:45:50,830 --> 00:45:55,894 to try to explore presentations of texture to V2 neurons 1140 00:45:55,894 --> 00:45:57,310 to try to understand what we could 1141 00:45:57,310 --> 00:46:01,030 learn about the actual representations of V2. 1142 00:46:01,030 --> 00:46:03,840 And these are all done in macaque monkey. 1143 00:46:03,840 --> 00:46:08,565 And I should also mention that V2 is-- 1144 00:46:12,400 --> 00:46:14,500 it's been studied for a long time. 1145 00:46:14,500 --> 00:46:17,740 Hubel and Wiesel wrote a very important paper about V2 1146 00:46:17,740 --> 00:46:21,400 in 1965, which was quite beautiful, documenting 1147 00:46:21,400 --> 00:46:23,080 the properties that they could find. 1148 00:46:23,080 --> 00:46:25,180 But the thing that's interesting about this 1149 00:46:25,180 --> 00:46:30,460 is that V1 didn't really crack until Hubel and Wiesel figured 1150 00:46:30,460 --> 00:46:32,020 out what the magic ingredient was. 1151 00:46:32,020 --> 00:46:34,210 And the magic ingredient was orientation. 1152 00:46:34,210 --> 00:46:36,070 Before Hubel and Wiesel, people have 1153 00:46:36,070 --> 00:46:38,200 been poking at primary visual cortex, 1154 00:46:38,200 --> 00:46:40,725 showing little spots of light and little annuli-- 1155 00:46:40,725 --> 00:46:42,100 all the things that worked really 1156 00:46:42,100 --> 00:46:44,830 well in the retina and the LGN, and they were not getting 1157 00:46:44,830 --> 00:46:45,910 very interesting results. 1158 00:46:45,910 --> 00:46:48,340 They were saying, well, the receptive fields are bigger 1159 00:46:48,340 --> 00:46:51,670 and there are hot spots, positive and negative regions, 1160 00:46:51,670 --> 00:46:56,650 but the cells are not responding that well. 1161 00:46:56,650 --> 00:46:58,360 And when Hubel and Wiesel figured out 1162 00:46:58,360 --> 00:47:00,280 that orientation was the magic ingredient-- 1163 00:47:00,280 --> 00:47:03,160 and the apocryphal story is that they did that late at night, 1164 00:47:03,160 --> 00:47:05,076 and they figured it out when they were putting 1165 00:47:05,076 --> 00:47:07,180 a slide into the projector, and they had forgotten 1166 00:47:07,180 --> 00:47:08,860 to cover the cat's eyes. 1167 00:47:08,860 --> 00:47:12,940 And they put the slide into the projector, 1168 00:47:12,940 --> 00:47:15,572 and the line at the edge of the slide went past on the screen-- 1169 00:47:15,572 --> 00:47:16,560 TOMMY: it was broken. 1170 00:47:16,560 --> 00:47:17,851 EERO SIMONCELLI: It was broken. 1171 00:47:17,851 --> 00:47:21,290 Ah, I always thought it was the edge of the slide. 1172 00:47:21,290 --> 00:47:23,780 I've fibbed, and Tommy has corrected me 1173 00:47:23,780 --> 00:47:25,990 that it was something broken in the slide. 1174 00:47:25,990 --> 00:47:29,770 But in any case, the point is that a boundary went by, 1175 00:47:29,770 --> 00:47:31,600 and they heard-- 1176 00:47:31,600 --> 00:47:33,940 so they played the spikes through a loudspeaker. 1177 00:47:33,940 --> 00:47:36,550 This is what most physiologists did in those days, 1178 00:47:36,550 --> 00:47:38,422 and even still a lot do. 1179 00:47:38,422 --> 00:47:40,630 Certainly, in Tony's lab you can always walk in there 1180 00:47:40,630 --> 00:47:42,630 and hear the spikes coming over the loudspeaker. 1181 00:47:42,630 --> 00:47:44,710 Anyway, they heard this huge barrage of spikes, 1182 00:47:44,710 --> 00:47:46,690 more than they had ever heard from any cell 1183 00:47:46,690 --> 00:47:48,580 that they had recorded from, and that 1184 00:47:48,580 --> 00:47:53,500 was the beginning of a whole sequence of just fabulous work. 1185 00:47:53,500 --> 00:47:55,390 And using that tool-- 1186 00:47:55,390 --> 00:47:58,280 very simple and very obvious in retrospect, 1187 00:47:58,280 --> 00:48:01,230 but absolutely critical for the progress. 1188 00:48:01,230 --> 00:48:05,860 The stimuli matter is the point, and making the jump 1189 00:48:05,860 --> 00:48:08,660 to the right stimuli changes everything. 1190 00:48:08,660 --> 00:48:11,800 So V2 for the last 40 years has been 1191 00:48:11,800 --> 00:48:14,740 sitting in this difficult state where people 1192 00:48:14,740 --> 00:48:16,420 keep throwing stimuli at it. 1193 00:48:16,420 --> 00:48:17,410 They try angles. 1194 00:48:17,410 --> 00:48:18,340 They try curves. 1195 00:48:18,340 --> 00:48:19,600 They try swirly things. 1196 00:48:19,600 --> 00:48:20,830 They try corners. 1197 00:48:20,830 --> 00:48:25,840 They try contours of various kinds, illusory contours. 1198 00:48:25,840 --> 00:48:29,830 And throughout all of this, the end story 1199 00:48:29,830 --> 00:48:35,200 is V2 cells have bigger receptive fields, many of them 1200 00:48:35,200 --> 00:48:38,230 respond to orientation, some of them 1201 00:48:38,230 --> 00:48:40,690 respond to particular combinations of orientation, 1202 00:48:40,690 --> 00:48:44,969 but it's usually a small subset, and the responses are weak. 1203 00:48:44,969 --> 00:48:46,510 And that's really what the literature 1204 00:48:46,510 --> 00:48:49,160 has looked like for 40 years. 1205 00:48:49,160 --> 00:48:53,560 So what we were after is, can we drive these cells convincingly 1206 00:48:53,560 --> 00:48:55,540 and in a way that we can document 1207 00:48:55,540 --> 00:48:58,840 is significantly different than what we see in V1? 1208 00:48:58,840 --> 00:49:00,250 That was the goal-- 1209 00:49:00,250 --> 00:49:04,870 find a way to drive most of the cells 1210 00:49:04,870 --> 00:49:07,600 and to drive them differently than what 1211 00:49:07,600 --> 00:49:10,000 one would expect in V1. 1212 00:49:10,000 --> 00:49:12,940 As a starting point, we succeeded with textures. 1213 00:49:12,940 --> 00:49:15,290 So basically, we took a bunch of textures. 1214 00:49:15,290 --> 00:49:17,710 Here are some example textures drawn from the model. 1215 00:49:17,710 --> 00:49:20,101 Down below are spectrally-matched equivalents. 1216 00:49:20,101 --> 00:49:22,600 So these things have the same power spectra, the same amount 1217 00:49:22,600 --> 00:49:24,850 of energy and different orientation and frequency 1218 00:49:24,850 --> 00:49:28,330 bands, but they lack all the higher-order statistics that 1219 00:49:28,330 --> 00:49:31,390 are coming in this texture model that give you nice, 1220 00:49:31,390 --> 00:49:35,480 clean edges and contours and object-y things, 1221 00:49:35,480 --> 00:49:37,150 or lumps of objects. 1222 00:49:37,150 --> 00:49:39,370 And sure enough-- so here's some example cells. 1223 00:49:39,370 --> 00:49:40,460 Here's three V1 cells. 1224 00:49:40,460 --> 00:49:42,000 Here's three V2 cells. 1225 00:49:42,000 --> 00:49:44,060 And in each of these plots, there's two curves. 1226 00:49:44,060 --> 00:49:45,460 These are shown over time. 1227 00:49:45,460 --> 00:49:48,280 The stimulus is presented here for 100 milliseconds. 1228 00:49:48,280 --> 00:49:50,200 You see a little bump in the response. 1229 00:49:50,200 --> 00:49:52,300 And there's a light curve and a dark curve. 1230 00:49:52,300 --> 00:49:55,630 The light curve is the response to the spectrally-matched 1231 00:49:55,630 --> 00:49:58,930 noise, and the dark curve is the response to the texture, 1232 00:49:58,930 --> 00:50:00,830 with the higher-order statistics. 1233 00:50:00,830 --> 00:50:04,200 V1 doesn't seem to care is the short answer here, 1234 00:50:04,200 --> 00:50:07,860 and V2 cares quite significantly. 1235 00:50:07,860 --> 00:50:10,230 So when you put those higher-order statistics in, 1236 00:50:10,230 --> 00:50:13,740 almost all V2 cells respond significantly more, 1237 00:50:13,740 --> 00:50:15,660 and you can see that in these three examples. 1238 00:50:15,660 --> 00:50:16,720 These are not unusual. 1239 00:50:16,720 --> 00:50:18,910 That's what most of the cells look like. 1240 00:50:18,910 --> 00:50:22,560 So here's a plot, just showing you 63% of the V2 neurons 1241 00:50:22,560 --> 00:50:24,600 significantly and positively modulated. 1242 00:50:24,600 --> 00:50:27,360 And by the way, this is averaged over all the textures 1243 00:50:27,360 --> 00:50:28,740 that we showed them. 1244 00:50:28,740 --> 00:50:30,630 And if you pick any individual cell, 1245 00:50:30,630 --> 00:50:32,130 there's usually a couple of textures 1246 00:50:32,130 --> 00:50:33,780 that drive it really well, and then 1247 00:50:33,780 --> 00:50:35,613 a bunch of textures that drive it less well. 1248 00:50:35,613 --> 00:50:37,260 So this effect could be made stronger 1249 00:50:37,260 --> 00:50:40,500 if you chose only the textures that drove the cell well. 1250 00:50:40,500 --> 00:50:43,290 And up here is V1, where you can see that very few of them 1251 00:50:43,290 --> 00:50:46,290 are modulated by the existence of these higher-order 1252 00:50:46,290 --> 00:50:47,680 statistics. 1253 00:50:47,680 --> 00:50:50,070 Oh, here it is across texture category. 1254 00:50:50,070 --> 00:50:54,000 So now on the horizontal axis is the texture category-- 1255 00:50:54,000 --> 00:50:56,310 15 different textures, and you can see, again, 1256 00:50:56,310 --> 00:50:59,640 that V1 is pretty much very close to the same responses-- 1257 00:50:59,640 --> 00:51:02,220 dark and light, again, for the spectrally-matched 1258 00:51:02,220 --> 00:51:03,480 and the higher-order. 1259 00:51:03,480 --> 00:51:08,100 And for these three V1 cells, they're 1260 00:51:08,100 --> 00:51:11,640 basically the same responses for each of these pairs. 1261 00:51:11,640 --> 00:51:14,460 And for the V2 cells, there are always 1262 00:51:14,460 --> 00:51:18,696 at least some textures where there's an extreme difference. 1263 00:51:18,696 --> 00:51:20,070 So this is a really good example. 1264 00:51:20,070 --> 00:51:22,020 There's a huge difference in response 1265 00:51:22,020 --> 00:51:24,360 here for these two textures, but for actually many 1266 00:51:24,360 --> 00:51:28,130 of the other textures, there's not much of a difference. 1267 00:51:28,130 --> 00:51:31,470 So sort of a success. 1268 00:51:31,470 --> 00:51:33,480 And the last thing I was going to tell you about 1269 00:51:33,480 --> 00:51:36,120 is that we think-- 1270 00:51:36,120 --> 00:51:39,630 so this is really fitting, given what Jim DiCarlo told you 1271 00:51:39,630 --> 00:51:42,900 about, or what I assume he told you about-- 1272 00:51:42,900 --> 00:51:47,220 this idea of tolerance or invariance versus selectivity. 1273 00:51:47,220 --> 00:51:49,176 We wanted to know, how can we take 1274 00:51:49,176 --> 00:51:50,550 what we know about these V2 cells 1275 00:51:50,550 --> 00:51:53,280 and pull it back into the perceptual domain? 1276 00:51:53,280 --> 00:51:54,930 How can we ask, what is it that you 1277 00:51:54,930 --> 00:51:57,270 could do with a population of V2 cells 1278 00:51:57,270 --> 00:52:00,900 that you couldn't do with a population of V1 cells? 1279 00:52:00,900 --> 00:52:04,050 And the thought was if the V2 cells are responding 1280 00:52:04,050 --> 00:52:06,750 to these texture statistics, then 1281 00:52:06,750 --> 00:52:10,080 if I made a whole bunch of samples of the same texture, 1282 00:52:10,080 --> 00:52:12,660 the V2 cells should be really good at identifying 1283 00:52:12,660 --> 00:52:14,860 which texture that is-- 1284 00:52:14,860 --> 00:52:17,130 which family it came from. 1285 00:52:17,130 --> 00:52:19,770 And the V1 cells will be all confused by the fact 1286 00:52:19,770 --> 00:52:22,680 that those samples each have different details that 1287 00:52:22,680 --> 00:52:23,820 are shifting around. 1288 00:52:23,820 --> 00:52:26,610 So the V1 cells will respond to those details, 1289 00:52:26,610 --> 00:52:29,310 and they'll give a huge variety of responses 1290 00:52:29,310 --> 00:52:33,090 invariant to re-sampling from that family, 1291 00:52:33,090 --> 00:52:36,270 and the V2 cells will be more invariant or more 1292 00:52:36,270 --> 00:52:38,310 tolerant to re-sampling from that family. 1293 00:52:38,310 --> 00:52:39,419 That was the concept. 1294 00:52:39,419 --> 00:52:40,960 And that turns out to be the case, so 1295 00:52:40,960 --> 00:52:42,450 let me show you the evidence. 1296 00:52:42,450 --> 00:52:45,720 So here's four different textures, 1297 00:52:45,720 --> 00:52:47,990 four different-- what we call different families. 1298 00:52:47,990 --> 00:52:51,360 Here's images of three different examples drawn from each. 1299 00:52:51,360 --> 00:52:53,550 So these are just three samples drawn, 1300 00:52:53,550 --> 00:52:55,334 starting with different white noise seeds. 1301 00:52:55,334 --> 00:52:57,750 And you can see that they're actually physically different 1302 00:52:57,750 --> 00:53:00,820 images, but they look the same. 1303 00:53:00,820 --> 00:53:01,530 Three again. 1304 00:53:01,530 --> 00:53:03,090 Three again. 1305 00:53:03,090 --> 00:53:08,915 And so we got 100 cells from V1 and about 100 cells from V2. 1306 00:53:08,915 --> 00:53:11,880 The stimuli are presented for 100 milliseconds. 1307 00:53:11,880 --> 00:53:13,380 We do 20 repetitions each. 1308 00:53:13,380 --> 00:53:14,580 We need a lot of data. 1309 00:53:14,580 --> 00:53:18,330 And what's shown here are just these this 4 by 3 array, 1310 00:53:18,330 --> 00:53:20,280 but we actually had 15 different families 1311 00:53:20,280 --> 00:53:22,640 and 15 examples of each. 1312 00:53:22,640 --> 00:53:24,360 20 repetitions of each of those. 1313 00:53:24,360 --> 00:53:26,910 225 stimuli times 20 repetitions. 1314 00:53:26,910 --> 00:53:28,400 That's the experiment. 1315 00:53:28,400 --> 00:53:31,529 So what we wanted to know is, does the hypothesis hold? 1316 00:53:31,529 --> 00:53:32,570 And so here's an example. 1317 00:53:32,570 --> 00:53:36,390 These are responses laid out for these 12 stimuli. 1318 00:53:36,390 --> 00:53:38,760 And what you can see is that this is a V1 neuron-- 1319 00:53:38,760 --> 00:53:40,260 a typical V1 neuron. 1320 00:53:40,260 --> 00:53:41,970 You can see that the neuron actually 1321 00:53:41,970 --> 00:53:45,390 responds with a fair amount of variety in these columns. 1322 00:53:45,390 --> 00:53:49,380 That is, for different exemplars from the same family, 1323 00:53:49,380 --> 00:53:50,460 there's some variety. 1324 00:53:50,460 --> 00:53:52,350 High response here, medium response here, 1325 00:53:52,350 --> 00:53:53,820 very low response here. 1326 00:53:53,820 --> 00:53:56,160 And this is for these three images, which 1327 00:53:56,160 --> 00:53:58,380 to us look basically the same. 1328 00:53:58,380 --> 00:54:02,680 So this cell would not be very good at separating out 1329 00:54:02,680 --> 00:54:07,140 or recognizing or helping in the process of recognizing 1330 00:54:07,140 --> 00:54:09,720 which kind of texture you were looking at, because it's 1331 00:54:09,720 --> 00:54:13,800 flopping all over the place when we draw different samples. 1332 00:54:13,800 --> 00:54:15,780 That, as compared to having to V2 cell-- 1333 00:54:15,780 --> 00:54:17,880 this is a typical V2 cell, which you can see 1334 00:54:17,880 --> 00:54:19,920 is much more stable across these columns. 1335 00:54:19,920 --> 00:54:22,470 This is roughly the same response here, roughly the same 1336 00:54:22,470 --> 00:54:24,840 here, little bit of variety in this one, 1337 00:54:24,840 --> 00:54:26,580 roughly the same in this one. 1338 00:54:26,580 --> 00:54:30,220 And sure enough, if you actually go and plot this, 1339 00:54:30,220 --> 00:54:33,450 V2 has much higher variance across families. 1340 00:54:33,450 --> 00:54:34,890 That's vertically. 1341 00:54:34,890 --> 00:54:36,160 These are the V1 cells. 1342 00:54:36,160 --> 00:54:37,142 These are the V2 cells. 1343 00:54:37,142 --> 00:54:38,850 And this is the variance across families. 1344 00:54:38,850 --> 00:54:40,940 This is the variance across exemplars. 1345 00:54:40,940 --> 00:54:44,070 V2 has higher variance typically across families, 1346 00:54:44,070 --> 00:54:47,070 and V1 has higher variance across exemplars. 1347 00:54:47,070 --> 00:54:50,380 And now if you take the populations of equal size-- 1348 00:54:50,380 --> 00:54:53,190 100 of each, and you ask well, how good 1349 00:54:53,190 --> 00:54:55,050 would I be at taking that population 1350 00:54:55,050 --> 00:55:02,430 and identifying which family, which kind of texture 1351 00:55:02,430 --> 00:55:03,380 I'm looking at? 1352 00:55:03,380 --> 00:55:04,915 And we do this with cross-validation 1353 00:55:04,915 --> 00:55:05,540 and everything. 1354 00:55:05,540 --> 00:55:07,790 I can give you the details later, if you want to know. 1355 00:55:07,790 --> 00:55:12,680 We find V2 is always better than V1 in doing this task. 1356 00:55:12,680 --> 00:55:16,460 So we can do a better job in performing this task-- 1357 00:55:16,460 --> 00:55:19,460 identifying which of these families 1358 00:55:19,460 --> 00:55:23,540 a given example was drawn from if we look at V2 than if we 1359 00:55:23,540 --> 00:55:24,560 look at V1. 1360 00:55:24,560 --> 00:55:26,270 And if we flip that around and we 1361 00:55:26,270 --> 00:55:27,830 try to do exemplar identification, 1362 00:55:27,830 --> 00:55:31,400 with 15 different examples of a given family-- 1363 00:55:31,400 --> 00:55:33,430 if we say, which one was it? 1364 00:55:33,430 --> 00:55:36,800 It turns out that V1 is better than V2 for that. 1365 00:55:36,800 --> 00:55:39,610 So we think of this as evidence that V2 1366 00:55:39,610 --> 00:55:44,260 has some invariance across these samples, 1367 00:55:44,260 --> 00:55:46,150 whereas V1 is much more specialized 1368 00:55:46,150 --> 00:55:48,640 for the particular samples. 1369 00:55:48,640 --> 00:55:51,970 This work started with this fantastic post-doc 1370 00:55:51,970 --> 00:55:54,700 that I had mentioned earlier, Javier Portilla. 1371 00:55:54,700 --> 00:55:56,800 Jeremy Freeman came into my lab, and we just 1372 00:55:56,800 --> 00:55:59,950 jumped all over this in making the metamers. 1373 00:55:59,950 --> 00:56:01,810 Josh McDermott is on here because I usually 1374 00:56:01,810 --> 00:56:05,454 also play the auditory examples and walk 1375 00:56:05,454 --> 00:56:06,870 through a little bit of that work, 1376 00:56:06,870 --> 00:56:09,160 but I'm going to leave that for him. 1377 00:56:09,160 --> 00:56:11,740 And Corey Ziemba, who's a student who's in the lab 1378 00:56:11,740 --> 00:56:14,560 right now and is doing a lot of the physiology and 1379 00:56:14,560 --> 00:56:18,280 did a lot of the physiology that I showed you in Tony's lab. 1380 00:56:18,280 --> 00:56:22,310 And we were funded by HHMI and also the NIH. 1381 00:56:22,310 --> 00:56:23,940 So thanks.