1 00:00:01,640 --> 00:00:04,040 The following content is provided under a Creative 2 00:00:04,040 --> 00:00:05,580 Commons license. 3 00:00:05,580 --> 00:00:07,880 Your support will help MIT OpenCourseWare 4 00:00:07,880 --> 00:00:12,270 continue to offer high-quality educational resources for free. 5 00:00:12,270 --> 00:00:14,870 To make a donation or view additional materials 6 00:00:14,870 --> 00:00:18,830 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,830 --> 00:00:22,220 at ocw.mit.edu. 8 00:00:22,220 --> 00:00:26,720 SHIMON ULLMAN: Now for a different, entirely different 9 00:00:26,720 --> 00:00:28,490 type of issue that has more to do 10 00:00:28,490 --> 00:00:31,580 with recognition, some psychophysics, some computer 11 00:00:31,580 --> 00:00:32,090 vision. 12 00:00:32,090 --> 00:00:34,430 But you will see at the end the motivation was really 13 00:00:34,430 --> 00:00:37,490 to be able to, be able to recognize 14 00:00:37,490 --> 00:00:40,940 and understand really complicated things that are 15 00:00:40,940 --> 00:00:44,150 happening in natural images. 16 00:00:44,150 --> 00:00:49,500 Now, when we look at objects in the world, 17 00:00:49,500 --> 00:00:51,480 people have worked a lot in object recognition, 18 00:00:51,480 --> 00:00:55,490 and we can recognize well complete objects, 19 00:00:55,490 --> 00:01:00,650 but we can also recognize a very limited configuration 20 00:01:00,650 --> 00:01:01,260 of objects. 21 00:01:01,260 --> 00:01:04,349 So we are very good at using limited information 22 00:01:04,349 --> 00:01:07,100 if this is what's available in order 23 00:01:07,100 --> 00:01:08,890 to recognize what's in there. 24 00:01:08,890 --> 00:01:11,720 And this is some arbitrary collection. 25 00:01:11,720 --> 00:01:13,970 I guess you can recognize all of them. 26 00:01:13,970 --> 00:01:16,400 Some of them, if you think about a person or even a face-- 27 00:01:16,400 --> 00:01:18,950 this is a very small part of a face-- 28 00:01:18,950 --> 00:01:22,410 everybody I guess knows what it is, right? 29 00:01:22,410 --> 00:01:25,010 It's not even a recognizable, well-delineated part 30 00:01:25,010 --> 00:01:25,730 like an eye. 31 00:01:25,730 --> 00:01:27,390 You see a part of a person here. 32 00:01:27,390 --> 00:01:28,730 We know what it is, right? 33 00:01:28,730 --> 00:01:32,760 I mean, everybody recognizes this, and so on. 34 00:01:32,760 --> 00:01:36,440 Now, I think that the ability to be 35 00:01:36,440 --> 00:01:41,900 able to get all the information out even from a limited region 36 00:01:41,900 --> 00:01:44,240 plays an important role in understanding images. 37 00:01:44,240 --> 00:01:46,940 And let me motivate it by one example. 38 00:01:46,940 --> 00:01:50,720 I'll go back to it at the end. 39 00:01:50,720 --> 00:01:53,820 When we look at these images, we know what's happening. 40 00:01:53,820 --> 00:01:55,934 We know what the action here is. 41 00:01:55,934 --> 00:01:57,850 All the people are performing the same action. 42 00:01:57,850 --> 00:02:01,530 They are all drinking, even drinking from a bottle, right? 43 00:02:01,530 --> 00:02:04,370 But the images as images are very different. 44 00:02:04,370 --> 00:02:07,160 If you look at each image, if you stored one image 45 00:02:07,160 --> 00:02:10,050 and you try to recognize another image based on the first one, 46 00:02:10,050 --> 00:02:14,000 it will be difficult. The variability here can be huge. 47 00:02:14,000 --> 00:02:16,100 But if you focus on where the action really 48 00:02:16,100 --> 00:02:18,350 takes place, the most informative part where 49 00:02:18,350 --> 00:02:21,776 most of the answer is already given to you of what's 50 00:02:21,776 --> 00:02:23,900 happening here, which is where the bottle is docked 51 00:02:23,900 --> 00:02:27,650 into the mouth, you can see that now these diverse images 52 00:02:27,650 --> 00:02:30,480 becomes virtually almost a copy of one another, 53 00:02:30,480 --> 00:02:32,040 almost the same. 54 00:02:32,040 --> 00:02:36,470 So if you manage to understand this and extract 55 00:02:36,470 --> 00:02:39,710 the most informative part, although it's 56 00:02:39,710 --> 00:02:42,710 limited and so on, the variability 57 00:02:42,710 --> 00:02:44,480 will be much, much reduced. 58 00:02:44,480 --> 00:02:47,277 The variability here is much, much reduced 59 00:02:47,277 --> 00:02:49,860 compared to the variability that you have in the entire image. 60 00:02:49,860 --> 00:02:53,150 So most of the other stuff is much less relevant, 61 00:02:53,150 --> 00:02:57,230 but this is where the information is concentrated. 62 00:02:57,230 --> 00:03:02,470 And in the limited, restricted configuration, 63 00:03:02,470 --> 00:03:06,530 recognition will be much easier, and will generalize much better 64 00:03:06,530 --> 00:03:08,510 from one situation to another because 65 00:03:08,510 --> 00:03:11,950 of this principle of highly-reduced variability 66 00:03:11,950 --> 00:03:14,660 in the delimited image. 67 00:03:14,660 --> 00:03:18,160 So we became interested. 68 00:03:18,160 --> 00:03:21,080 As you see, it's useful. 69 00:03:21,080 --> 00:03:26,510 But it's also-- to deal with small images and still 70 00:03:26,510 --> 00:03:30,990 recognize the limited images, you'll see it's much more-- 71 00:03:30,990 --> 00:03:32,727 there are some very challenging issues. 72 00:03:32,727 --> 00:03:34,310 And I want to discuss it a little bit, 73 00:03:34,310 --> 00:03:38,990 and then also discuss what it's good for a little bit more. 74 00:03:38,990 --> 00:03:40,460 I will show you some human studies. 75 00:03:40,460 --> 00:03:41,834 What we wanted to see is what are 76 00:03:41,834 --> 00:03:45,560 the minimal images that people can still recognize. 77 00:03:45,560 --> 00:03:49,690 We examined some computational models, and I will give you-- 78 00:03:49,690 --> 00:03:50,780 will not keep the secret. 79 00:03:50,780 --> 00:03:53,840 It turns out that well-performing current 80 00:03:53,840 --> 00:03:56,090 schemes, including deep networks, 81 00:03:56,090 --> 00:03:59,450 cannot deal well with such minimal images. 82 00:03:59,450 --> 00:04:02,420 And, from this, I want to discuss some implications 83 00:04:02,420 --> 00:04:06,530 in terms of representations in our system, brain processing, 84 00:04:06,530 --> 00:04:08,210 and things like this. 85 00:04:08,210 --> 00:04:11,470 And quite a number of people have been involved in this. 86 00:04:11,470 --> 00:04:14,342 Here are the names. 87 00:04:14,342 --> 00:04:16,550 Some of them are in the Weizmann Institute in Israel, 88 00:04:16,550 --> 00:04:18,730 and a few that-- 89 00:04:18,730 --> 00:04:19,769 Leyla is here. 90 00:04:19,769 --> 00:04:24,500 Leyla Isik, she is here in the summer school, and a student, 91 00:04:24,500 --> 00:04:28,460 Yena Han, at MIT doing some brain imaging 92 00:04:28,460 --> 00:04:32,120 on this, which I will mention very briefly. 93 00:04:32,120 --> 00:04:33,740 So I'll start with the human study. 94 00:04:33,740 --> 00:04:37,780 We are looking for minimal atomic things in recognition, 95 00:04:37,780 --> 00:04:40,490 and the experiment goes like this. 96 00:04:40,490 --> 00:04:43,970 You show a subject an image and ask them to recognize it, 97 00:04:43,970 --> 00:04:45,150 just produce a label. 98 00:04:45,150 --> 00:04:46,040 So this is a dog. 99 00:04:46,040 --> 00:04:50,350 If they say a dog, they recognize it. 100 00:04:50,350 --> 00:04:52,760 And if they recognize it correctly, 101 00:04:52,760 --> 00:04:57,770 we generate five descendants from this initial image. 102 00:04:57,770 --> 00:05:00,110 If this image was, say, 50 by 50 pixels-- 103 00:05:00,110 --> 00:05:02,710 and I'll tell you about pixels in a minute. 104 00:05:02,710 --> 00:05:04,810 But say it's 50 by 50 pixels. 105 00:05:04,810 --> 00:05:06,400 We make it somewhat smaller. 106 00:05:06,400 --> 00:05:09,070 We reduce use it, because it's still not minimal. 107 00:05:09,070 --> 00:05:10,870 And we reduce it in five ways. 108 00:05:10,870 --> 00:05:14,740 We either copy it at one of the four corners 109 00:05:14,740 --> 00:05:19,000 to create, say, a 48 by 48 image by taking 110 00:05:19,000 --> 00:05:20,880 two pixels from this corner here, 111 00:05:20,880 --> 00:05:23,770 or this corner here, and so on and so forth descendants. 112 00:05:23,770 --> 00:05:26,210 And we generate-- we also take the full image. 113 00:05:26,210 --> 00:05:27,310 Keep it as is. 114 00:05:27,310 --> 00:05:29,800 We do not crop it. 115 00:05:29,800 --> 00:05:31,240 We just reduce the resolution. 116 00:05:31,240 --> 00:05:36,610 So we resample it so some details start to become lost. 117 00:05:36,610 --> 00:05:42,310 Instead of 50 by 50 pixels, it's also a full image but 48 by 48. 118 00:05:42,310 --> 00:05:44,980 And then we give each one of these images-- 119 00:05:44,980 --> 00:05:46,000 now we have five. 120 00:05:46,000 --> 00:05:49,870 We give each one-- this is beginning to expand as a tree. 121 00:05:49,870 --> 00:05:52,750 Each one of the five is given again to a subject. 122 00:05:52,750 --> 00:05:55,210 If they recognize it, again five descendants 123 00:05:55,210 --> 00:05:57,700 are being generated, and we explore the entire tree 124 00:05:57,700 --> 00:06:00,400 until we find all the sub-images which 125 00:06:00,400 --> 00:06:05,470 are minimal and can be recognizable 126 00:06:05,470 --> 00:06:08,660 in this original configuration. 127 00:06:08,660 --> 00:06:11,110 Now, this is challenging psychophysically 128 00:06:11,110 --> 00:06:13,210 in terms of number of subjects, because we 129 00:06:13,210 --> 00:06:15,130 use a subject only once. 130 00:06:15,130 --> 00:06:16,820 Because if you show a subject this 131 00:06:16,820 --> 00:06:19,000 and he recognizes it, if you show him 132 00:06:19,000 --> 00:06:21,590 the same subject, a reduced image, 133 00:06:21,590 --> 00:06:24,430 he will recognize the image based on his previous exposure. 134 00:06:24,430 --> 00:06:26,950 So you don't you do not want to use him again. 135 00:06:26,950 --> 00:06:29,390 So you don't use him again, and you show the other images 136 00:06:29,390 --> 00:06:31,256 to a new subject. 137 00:06:31,256 --> 00:06:33,130 And this requires a large number of subjects. 138 00:06:33,130 --> 00:06:37,960 So 15,000 subjects participated in this experiment 139 00:06:37,960 --> 00:06:40,890 online by Mechanical Turk, together with some laboratory 140 00:06:40,890 --> 00:06:43,070 controls to see that they are doing the right thing, 141 00:06:43,070 --> 00:06:47,020 and how it compares with the same experiment done 142 00:06:47,020 --> 00:06:49,930 under laboratory conditions, and so on. 143 00:06:49,930 --> 00:06:53,140 So the way we define the minimal image for recognition 144 00:06:53,140 --> 00:06:54,970 is in this tree. 145 00:06:54,970 --> 00:06:57,080 Here is an image. 146 00:06:57,080 --> 00:06:58,990 This image is recognizable. 147 00:06:58,990 --> 00:07:00,970 And then we create the five descendants, 148 00:07:00,970 --> 00:07:02,920 and none of the descendants is recognizable. 149 00:07:02,920 --> 00:07:04,030 So this is recognizable. 150 00:07:04,030 --> 00:07:05,800 Nothing here is recognizable. 151 00:07:05,800 --> 00:07:09,220 So it's minimal because you can no longer reduce it 152 00:07:09,220 --> 00:07:13,060 either by resolution going here or by reducing the size. 153 00:07:13,060 --> 00:07:17,830 Any manipulation like this will make it unrecognizable. 154 00:07:17,830 --> 00:07:20,470 Technically, when I'm-- in my measuring, 155 00:07:20,470 --> 00:07:24,730 if I'm using numbers that the image is 50 pixels or 35 156 00:07:24,730 --> 00:07:27,640 pixels, and so, it's actually well-defined. 157 00:07:27,640 --> 00:07:29,380 I mean not the pixels on the screen. 158 00:07:29,380 --> 00:07:32,080 You can take the image and make it bigger or smaller 159 00:07:32,080 --> 00:07:33,130 on the screen. 160 00:07:33,130 --> 00:07:35,174 But the number of sampling points in the image 161 00:07:35,174 --> 00:07:35,840 is well-defined. 162 00:07:35,840 --> 00:07:38,900 When you give me an image, a particular image, 163 00:07:38,900 --> 00:07:41,980 I can tell you how many sample points you need in order 164 00:07:41,980 --> 00:07:43,440 to capture this image. 165 00:07:43,440 --> 00:07:46,960 Technically, for those of you who know it, it's twice the-- 166 00:07:46,960 --> 00:07:52,540 if you do a Fourier transform and take twice the cutoff 167 00:07:52,540 --> 00:07:55,540 frequency, the highest frequency in the Fourier spectrum, 168 00:07:55,540 --> 00:07:57,640 this is, by the sampling theorem of Shannon, 169 00:07:57,640 --> 00:08:00,820 this is the number of points you need in order to. 170 00:08:00,820 --> 00:08:04,390 So when I said that the image was 35 pixels, 171 00:08:04,390 --> 00:08:06,230 I don't really care. 172 00:08:06,230 --> 00:08:09,980 You can make it somewhat smaller or larger 173 00:08:09,980 --> 00:08:11,440 on the screen by interpolation. 174 00:08:11,440 --> 00:08:13,360 It doesn't change the information content. 175 00:08:13,360 --> 00:08:16,240 It's well-defined notion mathematically 176 00:08:16,240 --> 00:08:19,390 how many points, discrete points or sampling 177 00:08:19,390 --> 00:08:22,870 points in these images. 178 00:08:22,870 --> 00:08:27,850 So a very interesting thing that we 179 00:08:27,850 --> 00:08:31,870 found when we found these minimal images 180 00:08:31,870 --> 00:08:33,850 is that there is a sharp transition when 181 00:08:33,850 --> 00:08:36,230 you get to the level of the minimal images. 182 00:08:36,230 --> 00:08:39,789 So you go down and you recognize it. 183 00:08:39,789 --> 00:08:42,640 And then there is a sharp transition 184 00:08:42,640 --> 00:08:48,410 that it suddenly becomes unrecognizable, basically, 185 00:08:48,410 --> 00:08:50,450 to the large majority of people. 186 00:08:50,450 --> 00:08:53,380 So it can change a little bit, and I'll show you some examples 187 00:08:53,380 --> 00:08:57,400 for you to try to see how these minimal images look 188 00:08:57,400 --> 00:09:00,460 like at the recognizable level, at the unrecognizable. 189 00:09:00,460 --> 00:09:01,800 This is the recognizable level. 190 00:09:01,800 --> 00:09:03,790 This is the unrecognizable level. 191 00:09:03,790 --> 00:09:06,830 So to show it to you as examples here, 192 00:09:06,830 --> 00:09:09,310 I will show you first the unrecognizable one, 193 00:09:09,310 --> 00:09:11,920 the one which people find, on average, more 194 00:09:11,920 --> 00:09:13,820 difficult to recognize. 195 00:09:13,820 --> 00:09:15,810 And if you recognize it, raise your hand. 196 00:09:15,810 --> 00:09:17,684 Don't say what you see, because this 197 00:09:17,684 --> 00:09:18,850 will influence other people. 198 00:09:18,850 --> 00:09:22,259 Just raise your hand if you recognize the image. 199 00:09:22,259 --> 00:09:24,300 And then I'll show you the more recognizable one, 200 00:09:24,300 --> 00:09:26,470 and let's see if more hands show up, 201 00:09:26,470 --> 00:09:30,690 if the distinction between the recognizable and unrecognizable 202 00:09:30,690 --> 00:09:31,190 holds here. 203 00:09:31,190 --> 00:09:33,820 I'll just show you a couple of examples from the-- 204 00:09:33,820 --> 00:09:35,790 OK, so I'll. 205 00:09:35,790 --> 00:09:37,840 OK, so this is the one which is supposed 206 00:09:37,840 --> 00:09:39,292 to be difficult to recognize. 207 00:09:39,292 --> 00:09:41,500 If you see what it is, if you know what's the object, 208 00:09:41,500 --> 00:09:44,650 raise your hand. 209 00:09:44,650 --> 00:09:46,570 OK, good. 210 00:09:46,570 --> 00:09:47,830 OK, don't say what it is. 211 00:09:47,830 --> 00:09:49,826 We have two. 212 00:09:49,826 --> 00:09:52,362 Let's see here. 213 00:09:52,362 --> 00:09:53,724 OK, certainly more hands. 214 00:09:53,724 --> 00:09:54,390 What do you see? 215 00:09:54,390 --> 00:09:55,232 What do you think? 216 00:09:55,232 --> 00:09:56,116 AUDIENCE: Should I say it? 217 00:09:56,116 --> 00:09:56,560 SHIMON ULLMAN: OK. 218 00:09:56,560 --> 00:09:56,920 Now you can say it because-- 219 00:09:56,920 --> 00:09:57,500 AUDIENCE: A horse. 220 00:09:57,500 --> 00:09:57,990 SHIMON ULLMAN: A horse. 221 00:09:57,990 --> 00:09:58,750 Right. 222 00:09:58,750 --> 00:10:01,882 So let me show them side by side. 223 00:10:01,882 --> 00:10:03,340 So you see that it's very difficult 224 00:10:03,340 --> 00:10:05,350 to recognize what's not recog-- 225 00:10:05,350 --> 00:10:07,460 this is, you can see the statistic. 226 00:10:07,460 --> 00:10:09,760 This is recognized by 93% of the subjects. 227 00:10:09,760 --> 00:10:12,620 30 subjects saw each one of these images. 228 00:10:12,620 --> 00:10:14,290 93% recognized this. 229 00:10:14,290 --> 00:10:16,480 3% recognized this. 230 00:10:16,480 --> 00:10:17,920 And you look at the image and you 231 00:10:17,920 --> 00:10:20,380 see that they are very similar images, 232 00:10:20,380 --> 00:10:22,747 and it drops from 90% to 3%. 233 00:10:22,747 --> 00:10:25,330 So you can see the two images, and you can see the similarity, 234 00:10:25,330 --> 00:10:28,540 and you can see the large drop. 235 00:10:28,540 --> 00:10:33,340 This is part of the entire tree which is being explored. 236 00:10:33,340 --> 00:10:34,240 This is the farther. 237 00:10:34,240 --> 00:10:36,360 This is the recognized one. 238 00:10:36,360 --> 00:10:38,624 The minimal image. 239 00:10:38,624 --> 00:10:41,290 And you can see that even reduce the resolution, which is really 240 00:10:41,290 --> 00:10:44,240 not a big manipulation, but this is a drop in performance. 241 00:10:44,240 --> 00:10:46,270 And you can see all the-- 242 00:10:46,270 --> 00:10:50,060 so we used 50% as our criterion. 243 00:10:50,060 --> 00:10:53,360 So the parents should be recognized at higher. 244 00:10:53,360 --> 00:10:54,920 This should be recognized at lower. 245 00:10:54,920 --> 00:10:59,830 But typically the jump is very sharp. 246 00:10:59,830 --> 00:11:03,430 Let's try two more or something just for fun. 247 00:11:03,430 --> 00:11:06,974 If you can recognize it, raise your hand. 248 00:11:06,974 --> 00:11:08,330 OK. 249 00:11:08,330 --> 00:11:10,360 Nobody, just for the record. 250 00:11:10,360 --> 00:11:10,970 OK. 251 00:11:10,970 --> 00:11:11,470 Look around. 252 00:11:11,470 --> 00:11:12,100 You can see many. 253 00:11:12,100 --> 00:11:12,766 What do you see? 254 00:11:12,766 --> 00:11:13,550 AUDIENCE: A boat. 255 00:11:13,550 --> 00:11:13,910 SHIMON ULLMAN: A boat. 256 00:11:13,910 --> 00:11:14,410 Right. 257 00:11:14,410 --> 00:11:15,670 So you can see the two images. 258 00:11:15,670 --> 00:11:18,210 So 80% on this. 259 00:11:18,210 --> 00:11:19,870 0% here. 260 00:11:19,870 --> 00:11:21,850 And you can see that what's really missing here 261 00:11:21,850 --> 00:11:25,300 is the tip here. 262 00:11:25,300 --> 00:11:27,737 And, clearly, this tip is-- 263 00:11:27,737 --> 00:11:29,320 there are many contours in this image, 264 00:11:29,320 --> 00:11:34,180 but this particular corner sharp makes an enormous difference, 265 00:11:34,180 --> 00:11:38,086 and it goes from 80% to 0%. 266 00:11:38,086 --> 00:11:39,820 OK, let me skip. 267 00:11:39,820 --> 00:11:42,698 Just one more. 268 00:11:42,698 --> 00:11:46,040 OK, let me skip this. 269 00:11:46,040 --> 00:11:47,260 This is somewhat easier. 270 00:11:47,260 --> 00:11:47,780 OK. 271 00:11:47,780 --> 00:11:48,710 This is some-- 272 00:11:48,710 --> 00:11:51,180 OK, at least one, two, and three. 273 00:11:51,180 --> 00:11:52,010 OK. 274 00:11:52,010 --> 00:11:53,769 How about this one? 275 00:11:53,769 --> 00:11:54,560 Everybody, I think. 276 00:11:54,560 --> 00:11:56,425 Or maybe we are missing one. 277 00:11:56,425 --> 00:11:58,550 So, again, you can see that the difference-- if you 278 00:11:58,550 --> 00:12:00,175 look at the two, there is a difference, 279 00:12:00,175 --> 00:12:01,570 and it's this thing here. 280 00:12:01,570 --> 00:12:03,380 But it's not a very big part of the image. 281 00:12:03,380 --> 00:12:04,940 It's crucial, you know. 282 00:12:04,940 --> 00:12:06,360 You have to be trained on this. 283 00:12:06,360 --> 00:12:07,860 It's part of your representation. 284 00:12:07,860 --> 00:12:08,990 It's important. 285 00:12:08,990 --> 00:12:12,730 You go from almost 90% to 15%, roughly. 286 00:12:12,730 --> 00:12:13,760 So it's important. 287 00:12:13,760 --> 00:12:18,050 So you can see that the drop is typically very, very sharp. 288 00:12:18,050 --> 00:12:25,820 And it's also-- the sharp transition is also interesting, 289 00:12:25,820 --> 00:12:30,140 in the sense that if it drops from, like the horse, 290 00:12:30,140 --> 00:12:34,040 from 90% to 3%, or even here, it also 291 00:12:34,040 --> 00:12:36,780 says that we all carry around in our head 292 00:12:36,780 --> 00:12:38,790 a very similar presentation. 293 00:12:38,790 --> 00:12:43,360 Because if each one of us, based on the history 294 00:12:43,360 --> 00:12:47,540 and visual experience, would be less or more 295 00:12:47,540 --> 00:12:49,550 sensitive to various features, then we 296 00:12:49,550 --> 00:12:51,050 will not find this sharp transition. 297 00:12:51,050 --> 00:12:54,410 Different people will lose it at different points 298 00:12:54,410 --> 00:12:56,360 in the manipulation. 299 00:12:56,360 --> 00:13:00,020 But at 90%, 90% of the people, roughly everybody 300 00:13:00,020 --> 00:13:01,310 recognizes it. 301 00:13:01,310 --> 00:13:03,950 You remove a feature and it goes to 3%. 302 00:13:03,950 --> 00:13:08,990 So everybody is using the same, or very similar, 303 00:13:08,990 --> 00:13:11,450 representation, which I find somewhat surprising, at least 304 00:13:11,450 --> 00:13:14,540 for some of these images. 305 00:13:14,540 --> 00:13:16,460 We don't all have the same kind of experience 306 00:13:16,460 --> 00:13:18,710 with horses, or with battleships, or things 307 00:13:18,710 --> 00:13:20,660 like that, and still the representation 308 00:13:20,660 --> 00:13:24,320 is very strikingly similar across individuals. 309 00:13:24,320 --> 00:13:27,470 The experiment was done on 10 different objects. 310 00:13:27,470 --> 00:13:29,490 These are the initial objects. 311 00:13:29,490 --> 00:13:33,020 I showed you the object at the beginning of the hierarchy, 312 00:13:33,020 --> 00:13:35,390 and then you start the manipulation to discover all 313 00:13:35,390 --> 00:13:41,550 the minimal images inside them. 314 00:13:41,550 --> 00:13:44,330 And here, so we ended up with a very nice catalog. 315 00:13:44,330 --> 00:13:46,640 We have a database of all the minimal images 316 00:13:46,640 --> 00:13:50,120 in all of these 10 images in all of the children, 317 00:13:50,120 --> 00:13:52,030 the unrecognizable ones. 318 00:13:52,030 --> 00:13:53,990 So, in terms of modeling and in terms 319 00:13:53,990 --> 00:13:57,140 of exploring visual features and what is necessary in order 320 00:13:57,140 --> 00:14:00,180 to recognize, and so on, there is a very rich data 321 00:14:00,180 --> 00:14:02,210 set here of all the minimal images 322 00:14:02,210 --> 00:14:03,970 in all of these 10 images. 323 00:14:03,970 --> 00:14:09,680 Here are some more pairs of recognizable and 324 00:14:09,680 --> 00:14:10,750 unrecognizable. 325 00:14:10,750 --> 00:14:12,830 We already saw this in principle, 326 00:14:12,830 --> 00:14:14,220 but just to show some-- 327 00:14:14,220 --> 00:14:17,600 in some cases, it's pretty clear what may be going on. 328 00:14:17,600 --> 00:14:22,160 For example, this is horse legs, the front legs of the horse. 329 00:14:22,160 --> 00:14:25,050 This seems to be important. 330 00:14:25,050 --> 00:14:29,060 You can see that very often it's a tremendously 331 00:14:29,060 --> 00:14:31,970 small-- in this fly image, very small differences, 332 00:14:31,970 --> 00:14:33,530 very hard to pinpoint. 333 00:14:33,530 --> 00:14:38,870 And it's glass that you've got in the eyeglasses. 334 00:14:38,870 --> 00:14:41,280 Something here is missing a little bit. 335 00:14:41,280 --> 00:14:43,500 But very small things in a very reliable way 336 00:14:43,500 --> 00:14:46,820 cause this dramatic change. 337 00:14:46,820 --> 00:14:49,340 As was mentioned here, somebody mentioned, said 338 00:14:49,340 --> 00:14:52,770 the inflection and point, you can manipulate psychophysically 339 00:14:52,770 --> 00:14:53,900 a bit more. 340 00:14:53,900 --> 00:15:00,980 For example, here, this was another version 341 00:15:00,980 --> 00:15:02,720 of a minimal image. 342 00:15:02,720 --> 00:15:04,430 It was cropped at two locations. 343 00:15:04,430 --> 00:15:07,730 You can crop only the left side, or you 344 00:15:07,730 --> 00:15:10,480 can crop only the bottom side, and you can try 345 00:15:10,480 --> 00:15:11,820 to see what makes a difference. 346 00:15:11,820 --> 00:15:16,370 So you can really zoom in on the critical features. 347 00:15:16,370 --> 00:15:18,800 In terms of number of pixels, the impression 348 00:15:18,800 --> 00:15:21,440 is that it's surprisingly small. 349 00:15:21,440 --> 00:15:23,870 So I guess you can recognize that this is an eagle. 350 00:15:23,870 --> 00:15:24,800 This is an airplane. 351 00:15:24,800 --> 00:15:25,883 And the number of pixels-- 352 00:15:30,740 --> 00:15:33,290 those of you who know vision, your retina 353 00:15:33,290 --> 00:15:37,760 has 120 million pixels. 354 00:15:37,760 --> 00:15:44,890 The fovea, which is the area of very high acuity, is 2 degrees. 355 00:15:44,890 --> 00:15:49,380 It's about 250 by 250, 250 by 200 pixel. 356 00:15:49,380 --> 00:15:51,940 This is the area at the center, an area of high acuity. 357 00:15:51,940 --> 00:15:53,940 But you can recognize things with, I don't know, 358 00:15:53,940 --> 00:15:56,690 15, 20 pixels. 359 00:15:56,690 --> 00:15:59,040 It's 1/10 of your fovea. 360 00:15:59,040 --> 00:16:01,205 It's tiny, tiny. 361 00:16:01,205 --> 00:16:03,050 You can make it larger, but in terms 362 00:16:03,050 --> 00:16:06,020 of how much visual information, I 363 00:16:06,020 --> 00:16:09,320 find it surprising that you need very, very little. 364 00:16:09,320 --> 00:16:11,710 It's also interesting that it's very useful, in the sense 365 00:16:11,710 --> 00:16:13,280 that it's very redundant. 366 00:16:13,280 --> 00:16:17,060 If you have the capacity, if you have a visual system that 367 00:16:17,060 --> 00:16:20,120 can recognize individually each one of these minimal images, 368 00:16:20,120 --> 00:16:22,490 and in fact they can be recognized on their own, 369 00:16:22,490 --> 00:16:26,540 then a full image like this contains a high number 370 00:16:26,540 --> 00:16:30,500 of partially overlapping minimal images. 371 00:16:30,500 --> 00:16:31,550 Some of them are large. 372 00:16:31,550 --> 00:16:34,280 You can see each one of these frame, colored frame, 373 00:16:34,280 --> 00:16:37,380 is a minimal image, shown not necessarily 374 00:16:37,380 --> 00:16:39,290 at the right resolution. 375 00:16:39,290 --> 00:16:42,050 You can reduce the resolution of things. 376 00:16:42,050 --> 00:16:44,660 But you can see that some images are essentially 377 00:16:44,660 --> 00:16:49,250 low-resolution representations of the entire object, 378 00:16:49,250 --> 00:16:51,170 like almost the entire eagle. 379 00:16:51,170 --> 00:16:53,480 But some of them just contain something 380 00:16:53,480 --> 00:17:00,590 relatively small around the head and the eye. 381 00:17:00,590 --> 00:17:03,410 For the eye region, you can see that you 382 00:17:03,410 --> 00:17:07,069 can get a low-resolution, again, thing of almost everything. 383 00:17:07,069 --> 00:17:09,050 But just the corner of the eye and things 384 00:17:09,050 --> 00:17:10,670 like that are enough. 385 00:17:10,670 --> 00:17:12,619 We find, in general, it seems that things that 386 00:17:12,619 --> 00:17:14,839 are related to humans, you have a large number 387 00:17:14,839 --> 00:17:18,530 of these minimal images. 388 00:17:18,530 --> 00:17:24,619 So they provide a sensitive tool to compare representations 389 00:17:24,619 --> 00:17:28,610 to see what's missing in the sub-image which 390 00:17:28,610 --> 00:17:31,730 made the image become unrecognizable. 391 00:17:31,730 --> 00:17:34,720 So we call them sometimes, these are 392 00:17:34,720 --> 00:17:37,830 called minimal recognizable configuration. 393 00:17:37,830 --> 00:17:40,370 We call them configuration but not images. 394 00:17:40,370 --> 00:17:41,770 Not parts. 395 00:17:41,770 --> 00:17:43,925 Not objects because they are not objects. 396 00:17:43,925 --> 00:17:46,400 And not parts because, as we saw in the examples, 397 00:17:46,400 --> 00:17:48,330 they do not have to be well-delineated parts. 398 00:17:48,330 --> 00:17:50,630 They are more like local configuration. 399 00:17:50,630 --> 00:17:52,445 But, anyway, minimal images. 400 00:17:57,998 --> 00:18:00,620 The next thing that we did is, we 401 00:18:00,620 --> 00:18:10,860 were wondering if this kind of behavior, the ability 402 00:18:10,860 --> 00:18:14,910 to recognize these images from such minimal information 403 00:18:14,910 --> 00:18:17,880 requires-- it places an interesting 404 00:18:17,880 --> 00:18:20,970 challenge, or an interesting test of a recognition system, 405 00:18:20,970 --> 00:18:24,030 because you really have to extract and use 406 00:18:24,030 --> 00:18:25,440 all the available information. 407 00:18:25,440 --> 00:18:27,210 By definition, this is minimal. 408 00:18:27,210 --> 00:18:29,760 If you do not use all the information that's 409 00:18:29,760 --> 00:18:33,120 in this minimal image, then you don't 410 00:18:33,120 --> 00:18:34,630 have the minimal information. 411 00:18:34,630 --> 00:18:37,290 You have less than that and you will fail. 412 00:18:37,290 --> 00:18:40,770 So a system that is not good enough 413 00:18:40,770 --> 00:18:43,230 will fail on these minimal images, or the ability 414 00:18:43,230 --> 00:18:48,240 to recognize them means that you really can suck out all 415 00:18:48,240 --> 00:18:50,160 the relevant information out. 416 00:18:50,160 --> 00:18:52,500 So we were wondering what will happen if we show it 417 00:18:52,500 --> 00:18:58,560 to various computational algorithms that performed well 418 00:18:58,560 --> 00:18:59,564 on full images. 419 00:18:59,564 --> 00:19:01,230 What will happen when you challenge them 420 00:19:01,230 --> 00:19:03,600 with things which are, by nature, 421 00:19:03,600 --> 00:19:06,780 designed to be non-redundant? 422 00:19:06,780 --> 00:19:08,760 So here is what I will do. 423 00:19:08,760 --> 00:19:12,560 It's not a computer vision school. 424 00:19:12,560 --> 00:19:15,480 I will not go too much through the details 425 00:19:15,480 --> 00:19:17,060 of the computational schemes, just 426 00:19:17,060 --> 00:19:19,230 to show you what was happening. 427 00:19:19,230 --> 00:19:22,500 And the bottom line is that they are not doing a good job. 428 00:19:22,500 --> 00:19:23,480 Two things happen. 429 00:19:23,480 --> 00:19:26,850 First of all, when you train a computational system, 430 00:19:26,850 --> 00:19:29,190 you do not see the same drop that you see here, 431 00:19:29,190 --> 00:19:32,130 that it recognizes one and doesn't recognize the other. 432 00:19:32,130 --> 00:19:34,320 You don't have a drop in recognition. 433 00:19:34,320 --> 00:19:37,500 This sort of phase transition that 434 00:19:37,500 --> 00:19:39,810 characterizes the human vision system 435 00:19:39,810 --> 00:19:44,650 is not reproduced in any of the current recognition systems, 436 00:19:44,650 --> 00:19:48,060 including deep network and any other ones. 437 00:19:48,060 --> 00:19:50,730 And, secondly, they are not very good at recognizing them. 438 00:19:50,730 --> 00:19:53,280 Regardless of the gap, that there is a sharp transition 439 00:19:53,280 --> 00:19:57,420 or not, they do not get good recognition results 440 00:19:57,420 --> 00:19:59,095 on these minimal images. 441 00:19:59,095 --> 00:20:02,690 They do not suck all the necessary information. 442 00:20:02,690 --> 00:20:05,490 So in the full images, it's like we 443 00:20:05,490 --> 00:20:08,400 had an image of a side view of a plane. 444 00:20:08,400 --> 00:20:10,367 So we are training on airplane. 445 00:20:10,367 --> 00:20:11,700 You can think of a deep network. 446 00:20:11,700 --> 00:20:16,470 We actually tried a whole range of good classifiers. 447 00:20:16,470 --> 00:20:18,090 And in all of these good classifiers-- 448 00:20:18,090 --> 00:20:19,506 those of you who are not in vision 449 00:20:19,506 --> 00:20:24,840 probably got enough at the beginning of this summer school 450 00:20:24,840 --> 00:20:28,410 that they have a feeling for a classifier in computer vision. 451 00:20:28,410 --> 00:20:32,010 It's a system, an algorithm, a system, a scheme, 452 00:20:32,010 --> 00:20:33,870 that you give it training images. 453 00:20:33,870 --> 00:20:36,900 You don't have to specify, you don't tell it what to look for. 454 00:20:36,900 --> 00:20:39,490 You just give it lots of images and tell them 455 00:20:39,490 --> 00:20:42,420 all these are of the same class. 456 00:20:42,420 --> 00:20:45,250 And then it calibrates itself, and adjusts parameters, 457 00:20:45,250 --> 00:20:45,750 and so on. 458 00:20:45,750 --> 00:20:48,810 And then you give it new images, and the system 459 00:20:48,810 --> 00:20:52,830 is supposed to tell you if it's a new member of the same class 460 00:20:52,830 --> 00:20:53,680 or not. 461 00:20:53,680 --> 00:20:56,200 So, in this case, we train the system, 462 00:20:56,200 --> 00:20:59,940 giving them full side views of an airplane. 463 00:20:59,940 --> 00:21:04,710 But then we gave them just the tails. 464 00:21:04,710 --> 00:21:07,890 Compared to random pictures taken from known images, 465 00:21:07,890 --> 00:21:10,590 the question is do they reliably can tell you 466 00:21:10,590 --> 00:21:14,370 that this is a tail of an airplane, part 467 00:21:14,370 --> 00:21:16,230 of the previous class? 468 00:21:16,230 --> 00:21:18,910 Or they would be confused and they 469 00:21:18,910 --> 00:21:21,990 will give even higher score to things which do not 470 00:21:21,990 --> 00:21:25,470 come from airplane at all? 471 00:21:25,470 --> 00:21:29,430 So we started this when deep networks were still not 472 00:21:29,430 --> 00:21:32,160 the leaders, and we had some other things, like DPM, 473 00:21:32,160 --> 00:21:34,230 and including HMAX, which is a very 474 00:21:34,230 --> 00:21:38,580 good model of the human visual system and performs very well. 475 00:21:41,290 --> 00:21:45,680 And so we included it as well, and deep network as well. 476 00:21:45,680 --> 00:21:47,300 This is the HMAX. 477 00:21:47,300 --> 00:21:50,730 This is convolutional neural networks. 478 00:21:50,730 --> 00:21:55,100 You probably got the idea, it's just worth pointing, 479 00:21:55,100 --> 00:21:57,570 I find it interesting in the computer vision community 480 00:21:57,570 --> 00:22:00,790 that you have Olympic games every year. 481 00:22:00,790 --> 00:22:02,790 It's something which is very structured and very 482 00:22:02,790 --> 00:22:05,390 competitive, and very nice in this regard, 483 00:22:05,390 --> 00:22:07,830 that there is the Pascal challenge, 484 00:22:07,830 --> 00:22:10,980 and the ImageNet challenge, and it's well run. 485 00:22:10,980 --> 00:22:12,930 And people who think that they have a better 486 00:22:12,930 --> 00:22:16,630 algorithm than others can submit an entry, 487 00:22:16,630 --> 00:22:18,330 can submit an algorithm. 488 00:22:18,330 --> 00:22:23,070 Everybody gets training images that are distributed publicly, 489 00:22:23,070 --> 00:22:25,380 but there are secret images used for testing. 490 00:22:25,380 --> 00:22:29,170 And you can train your algorithm on the available data. 491 00:22:29,170 --> 00:22:31,200 Everybody uses the same data. 492 00:22:31,200 --> 00:22:32,850 And then you submit your algorithm, 493 00:22:32,850 --> 00:22:36,360 and the algorithm is run by the central committee on the test 494 00:22:36,360 --> 00:22:37,410 images. 495 00:22:37,410 --> 00:22:39,900 And the results are published, and everybody knows who's 496 00:22:39,900 --> 00:22:41,380 number one, who's number two. 497 00:22:41,380 --> 00:22:44,520 You have the gold medal and the silver medal. 498 00:22:44,520 --> 00:22:48,570 It's very competitive, and in some sense 499 00:22:48,570 --> 00:22:51,060 it's doing very good things. 500 00:22:51,060 --> 00:22:54,490 It's sort of driving the performance up. 501 00:22:54,490 --> 00:22:57,670 It also has some negative effects, 502 00:22:57,670 --> 00:23:03,610 I think, on the way things are being done. 503 00:23:03,610 --> 00:23:05,340 One negative is it's very difficult 504 00:23:05,340 --> 00:23:08,370 to come up with an entirely new scheme which 505 00:23:08,370 --> 00:23:10,980 explores a completely new idea. 506 00:23:10,980 --> 00:23:12,990 Because, initially, before you fine tune it, 507 00:23:12,990 --> 00:23:20,340 it will not be at the level of the high-performing winners, 508 00:23:20,340 --> 00:23:23,400 and until it establishes itself as a winner, 509 00:23:23,400 --> 00:23:24,870 it will not get credit. 510 00:23:24,870 --> 00:23:28,980 So it sort of becomes a little bit conservative 511 00:23:28,980 --> 00:23:33,330 in this regard, which is the unfortunate part. 512 00:23:33,330 --> 00:23:36,390 So, as I told you, and I will not go in great detail, 513 00:23:36,390 --> 00:23:40,920 the two basic outcomes is that the gap 514 00:23:40,920 --> 00:23:44,480 between the recognizable and recognizable-- these two bars 515 00:23:44,480 --> 00:23:47,150 are the gap for human vision. 516 00:23:47,150 --> 00:23:52,280 That's the whole group of horse images. 517 00:23:52,280 --> 00:23:56,045 The parents are highly recognizable. 518 00:23:56,045 --> 00:23:59,910 The children, the offsprings, are not recognizable. 519 00:23:59,910 --> 00:24:00,840 Very large drop. 520 00:24:00,840 --> 00:24:07,080 This drop is not recaptured in this model, 521 00:24:07,080 --> 00:24:09,480 in any of the model. 522 00:24:09,480 --> 00:24:11,430 If you have a deep network, or you 523 00:24:11,430 --> 00:24:13,470 have one of these classifiers, what 524 00:24:13,470 --> 00:24:15,960 is recognized and not recognized depends on the threshold. 525 00:24:15,960 --> 00:24:16,914 You can decide that. 526 00:24:16,914 --> 00:24:18,330 It gives you a number, and it says 527 00:24:18,330 --> 00:24:19,860 that I have this and this confidence 528 00:24:19,860 --> 00:24:22,090 that this belongs to the class. 529 00:24:22,090 --> 00:24:24,890 So what we did here is that we tried to match. 530 00:24:24,890 --> 00:24:28,300 We had a class of images, and people recognized them 531 00:24:28,300 --> 00:24:30,760 at 80% recognition. 532 00:24:30,760 --> 00:24:34,650 So we put the threshold in the artificial, the computer vision 533 00:24:34,650 --> 00:24:38,160 system, at such a level that it recognized correctly 534 00:24:38,160 --> 00:24:45,180 80% of the minimal images. 535 00:24:45,180 --> 00:24:47,340 So you match them. 536 00:24:47,340 --> 00:24:51,210 And then we looked at how many of the sub-images 537 00:24:51,210 --> 00:24:52,574 passed the threshold. 538 00:24:52,574 --> 00:24:54,240 And you get-- this is for deep network-- 539 00:24:54,240 --> 00:24:57,980 that, instead of a gap, you actually got an anti-gap. 540 00:24:57,980 --> 00:25:00,390 It actually recognized a few more. 541 00:25:00,390 --> 00:25:01,890 But this should not confuse you. 542 00:25:01,890 --> 00:25:05,625 It does not mean that the deep network did better than humans. 543 00:25:05,625 --> 00:25:08,280 It actually did much worse than humans, 544 00:25:08,280 --> 00:25:10,677 although the bars here are higher. 545 00:25:10,677 --> 00:25:12,010 And the reason is the following. 546 00:25:12,010 --> 00:25:14,220 You can always, even in a very bad classifier, 547 00:25:14,220 --> 00:25:18,300 you can get 80% recognition by just lowering the threshold, 548 00:25:18,300 --> 00:25:21,540 and then 80% of the class examples 549 00:25:21,540 --> 00:25:22,710 will exceed the threshold. 550 00:25:22,710 --> 00:25:25,600 The question is how many just garbage image, 551 00:25:25,600 --> 00:25:29,430 non-class images, will also pass the threshold at the same time. 552 00:25:29,430 --> 00:25:32,130 If you get 80% of the class but also lots 553 00:25:32,130 --> 00:25:36,270 and lots and lots of completely false positive, 554 00:25:36,270 --> 00:25:37,920 negative images, non-class images 555 00:25:37,920 --> 00:25:41,240 are also saying I'm an airplane, then that's bad performance. 556 00:25:41,240 --> 00:25:46,250 So just these high bars do not say anything. 557 00:25:46,250 --> 00:25:53,370 The actual recognition levels were very low. 558 00:25:53,370 --> 00:25:55,260 We can see here for deep networks 559 00:25:55,260 --> 00:25:58,640 that this high bar is the performance on new airplanes. 560 00:25:58,640 --> 00:26:00,720 So for airplanes it did very well. 561 00:26:00,720 --> 00:26:03,810 But the percent correct that it did on minimal images 562 00:26:03,810 --> 00:26:07,500 were 3%, or 4%, were very, very, very low. 563 00:26:07,500 --> 00:26:14,700 So it did very bad recognition on the minimal images. 564 00:26:14,700 --> 00:26:17,460 So recognition of minimal images does not 565 00:26:17,460 --> 00:26:19,920 emerge by training any of the existing models 566 00:26:19,920 --> 00:26:27,790 that I know in the world, including deep network models. 567 00:26:27,790 --> 00:26:30,510 Now, the second test was, as was asked here, 568 00:26:30,510 --> 00:26:34,440 is that we did another large test. 569 00:26:34,440 --> 00:26:37,080 All of these things, actually, were a lot of effort 570 00:26:37,080 --> 00:26:39,024 and time-consuming. 571 00:26:39,024 --> 00:26:40,065 Because now we have this. 572 00:26:40,065 --> 00:26:43,570 This was in the original test, was a minimal image. 573 00:26:43,570 --> 00:26:45,410 I don't know if this was a minimal image. 574 00:26:45,410 --> 00:26:51,090 Then we collected a range of tails of planes 575 00:26:51,090 --> 00:26:54,300 like this for many other airplanes. 576 00:26:54,300 --> 00:27:00,810 And we ran another Turk experiment, which 577 00:27:00,810 --> 00:27:02,340 was pretty large because we wanted 578 00:27:02,340 --> 00:27:09,090 to verify that each one of these patches that we added 579 00:27:09,090 --> 00:27:12,480 to our test and we were going to use for testing recognition, 580 00:27:12,480 --> 00:27:14,710 was indeed a minimal image for recognition. 581 00:27:14,710 --> 00:27:19,110 So each one of these patches, and there were 60 of those, 582 00:27:19,110 --> 00:27:22,350 we ran psychophysically. 583 00:27:22,350 --> 00:27:24,190 And we saw that it's recognizable, 584 00:27:24,190 --> 00:27:27,690 and if you make it small, if you try to reduce it, 585 00:27:27,690 --> 00:27:28,830 it's unrecognizable. 586 00:27:28,830 --> 00:27:30,540 So each one of these is individually 587 00:27:30,540 --> 00:27:31,950 also a minimal image. 588 00:27:31,950 --> 00:27:35,430 So here we did training and testing on-- 589 00:27:35,430 --> 00:27:38,220 so this is some examples of this. 590 00:27:38,220 --> 00:27:41,580 So here are various images of fly, and each one of them 591 00:27:41,580 --> 00:27:45,720 was tested on 30 subjects on the Mechanical Turk. 592 00:27:45,720 --> 00:27:49,790 And the results are that, in terms of correct recognition, 593 00:27:49,790 --> 00:27:54,830 there is a substantial improvement from 3% to 60%. 594 00:27:54,830 --> 00:27:56,550 But 60% is not very large. 595 00:27:56,550 --> 00:27:57,630 People recognized them-- 596 00:28:00,650 --> 00:28:04,170 I should say you should look at the false alarm. 597 00:28:04,170 --> 00:28:06,890 The number of errors, I will show you later. 598 00:28:06,890 --> 00:28:09,830 The number of errors that, even after training 599 00:28:09,830 --> 00:28:18,050 on minimal images, the performance of the deep network 600 00:28:18,050 --> 00:28:21,590 and all the other models on the minimal images 601 00:28:21,590 --> 00:28:23,810 is far worse than human recognition 602 00:28:23,810 --> 00:28:26,080 levels, human performance, on the same image. 603 00:28:26,080 --> 00:28:29,450 So it's not just the gap is not reproduced. 604 00:28:29,450 --> 00:28:34,010 Even training with minimal images, 605 00:28:34,010 --> 00:28:37,010 the performance is not reproduced. 606 00:28:37,010 --> 00:28:42,010 The errors, or the accuracy, is far worse 607 00:28:42,010 --> 00:28:44,240 in all the models, including deep network, 608 00:28:44,240 --> 00:28:46,790 compared to human vision. 609 00:28:46,790 --> 00:28:50,540 So these systems do not do it. 610 00:28:50,540 --> 00:28:52,460 It remains to be-- you can always ask, 611 00:28:52,460 --> 00:28:57,060 what happens if I train it with 100,000 images and I add 612 00:28:57,060 --> 00:28:58,480 and add more and more examples? 613 00:28:58,480 --> 00:29:01,880 This we couldn't-- this becomes more and larger. 614 00:29:01,880 --> 00:29:05,210 But with the experiments we've done, 615 00:29:05,210 --> 00:29:07,610 which are quite extensive, it does not 616 00:29:07,610 --> 00:29:11,840 begin to approach human accuracy. 617 00:29:11,840 --> 00:29:14,480 Humans are much better. 618 00:29:14,480 --> 00:29:15,690 And I'll show you. 619 00:29:15,690 --> 00:29:20,240 I think it's not just a competition, who does better. 620 00:29:20,240 --> 00:29:22,100 I think there is something deeper there. 621 00:29:22,100 --> 00:29:23,670 And that's what I want to go next. 622 00:29:23,670 --> 00:29:25,020 Let me skip some. 623 00:29:25,020 --> 00:29:26,390 These are the error comparison. 624 00:29:26,390 --> 00:29:30,390 And you can see, just as we saw, in a lot of different examples, 625 00:29:30,390 --> 00:29:33,980 0 errors for humans, 17% error in the deep networks, 626 00:29:33,980 --> 00:29:34,540 and so on. 627 00:29:34,540 --> 00:29:38,581 So those are big differences. 628 00:29:38,581 --> 00:29:39,080 OK. 629 00:29:42,895 --> 00:29:45,690 A related thing which, I think, gets to the heart 630 00:29:45,690 --> 00:29:49,260 of what's going on, that humans can do with these minimal 631 00:29:49,260 --> 00:29:52,470 images and model, at the moment cannot, 632 00:29:52,470 --> 00:29:57,020 is that we not only recognize these images and say this is 633 00:29:57,020 --> 00:29:59,760 a man, this is an eagle, this is a horse. 634 00:29:59,760 --> 00:30:01,710 Once we recognize it, although the image 635 00:30:01,710 --> 00:30:06,150 itself is sort of atomic, in the sense that you reduce it 636 00:30:06,150 --> 00:30:09,390 and recognition goes away, but once we recognize it 637 00:30:09,390 --> 00:30:11,880 we can recognize sort of subatomic particles. 638 00:30:11,880 --> 00:30:14,470 We can recognize things inside it. 639 00:30:14,470 --> 00:30:16,700 So if this is a person, we ask again 640 00:30:16,700 --> 00:30:19,230 in the psychophysical test to tell us 641 00:30:19,230 --> 00:30:24,910 what you see inside the image using various methodologies, 642 00:30:24,910 --> 00:30:26,490 which I'll not go into. 643 00:30:26,490 --> 00:30:28,500 But people recognize this. 644 00:30:28,500 --> 00:30:30,810 This is a person in an Italian suit, 645 00:30:30,810 --> 00:30:32,730 for those of you who could not recognize it. 646 00:30:32,730 --> 00:30:34,260 But once people recognize it, they say, 647 00:30:34,260 --> 00:30:35,560 this is the neck of the person. 648 00:30:35,560 --> 00:30:36,240 This is the tie. 649 00:30:39,390 --> 00:30:40,830 This is the knot of the tie. 650 00:30:40,830 --> 00:30:43,300 This is part of the jacket, and so on and so forth. 651 00:30:43,300 --> 00:30:46,270 I mean, they recognize a whole lot of details, 652 00:30:46,270 --> 00:30:49,620 semantic internal details inside. 653 00:30:49,620 --> 00:30:52,960 If they see this is the horse, the contrast is low, 654 00:30:52,960 --> 00:30:56,210 but they see the ear, and the other ear, 655 00:30:56,210 --> 00:30:58,200 and the eye, and the mouth. 656 00:30:58,200 --> 00:31:00,600 But if you reduce the image, they lose the recognition 657 00:31:00,600 --> 00:31:01,110 completely. 658 00:31:01,110 --> 00:31:03,720 Once they recognize it, they recognize a whole lot 659 00:31:03,720 --> 00:31:05,970 of structure inside. 660 00:31:05,970 --> 00:31:08,879 And I think that the structure, by itself, 661 00:31:08,879 --> 00:31:10,920 is the more interesting part, because, really, we 662 00:31:10,920 --> 00:31:12,150 don't want to see a horse. 663 00:31:12,150 --> 00:31:14,400 We don't want to see a car. 664 00:31:14,400 --> 00:31:18,120 We want to know where the car door is, where the knob is. 665 00:31:18,120 --> 00:31:20,880 We want to recognize all the internal details. 666 00:31:20,880 --> 00:31:23,720 But the ability to recognize all of these internal details 667 00:31:23,720 --> 00:31:26,430 is, automatically, it's also helping you 668 00:31:26,430 --> 00:31:29,240 with improving the recognition and rejecting 669 00:31:29,240 --> 00:31:30,780 sort of false detections. 670 00:31:30,780 --> 00:31:32,760 Because these are images the deep network 671 00:31:32,760 --> 00:31:36,510 thought that are good images of a man in a suit. 672 00:31:36,510 --> 00:31:39,630 But once you dive inside and you say, where exactly is the neck 673 00:31:39,630 --> 00:31:42,210 and where exactly is the tie, and is it the right structure 674 00:31:42,210 --> 00:31:43,560 that I expect? 675 00:31:43,560 --> 00:31:47,160 The answer is that it's not quite appropriate. 676 00:31:47,160 --> 00:31:50,760 And you can use that so that this internal interpretation 677 00:31:50,760 --> 00:31:54,570 is, first of all, the more important goal of vision. 678 00:31:54,570 --> 00:31:57,270 But, in addition, once you do it, 679 00:31:57,270 --> 00:32:01,350 you can reject things that appeared, 680 00:32:01,350 --> 00:32:08,130 based on the causal structure, to be correct, and in this way 681 00:32:08,130 --> 00:32:12,230 you can get the correct recognition. 682 00:32:12,230 --> 00:32:14,499 And, for this reason, my prediction 683 00:32:14,499 --> 00:32:16,290 is that it will be very difficult to get it 684 00:32:16,290 --> 00:32:19,170 with current deep network, because what you'd need 685 00:32:19,170 --> 00:32:22,380 is not only to get the label out but to be able to dive 686 00:32:22,380 --> 00:32:25,730 down and get the correct interpretation, and inspect it. 687 00:32:25,730 --> 00:32:28,010 And it has some properties. 688 00:32:28,010 --> 00:32:32,670 The tie, the knot in the tie is slightly wider 689 00:32:32,670 --> 00:32:35,484 than the part under it, and so on. 690 00:32:35,484 --> 00:32:37,650 So you have to check for the-- you know these things 691 00:32:37,650 --> 00:32:39,890 and you check for them. 692 00:32:39,890 --> 00:32:41,955 And if you don't do it, then the recognition 693 00:32:41,955 --> 00:32:45,210 will remain limited. 694 00:32:45,210 --> 00:32:48,210 Now, when you look at it and you say, OK, 695 00:32:48,210 --> 00:32:50,190 and we try to develop an algorithm-- 696 00:32:50,190 --> 00:32:53,400 which we'll actually dive in and we'll do the internal 697 00:32:53,400 --> 00:32:56,760 interpretation, and we'll do them correctly and we'll reject 698 00:32:56,760 --> 00:32:58,260 false alarms, and so on-- 699 00:32:58,260 --> 00:33:00,480 it turns out that this is an interesting business. 700 00:33:00,480 --> 00:33:02,070 You have to be very accurate, and some 701 00:33:02,070 --> 00:33:05,330 of the properties and relations that you need to extract 702 00:33:05,330 --> 00:33:07,530 are very specific to certain categories 703 00:33:07,530 --> 00:33:08,770 and are very precise. 704 00:33:08,770 --> 00:33:11,550 For example, this was selected by deep network 705 00:33:11,550 --> 00:33:14,130 as a very good example of a horse head. 706 00:33:14,130 --> 00:33:18,540 And, basically, it does have the right shape. 707 00:33:18,540 --> 00:33:20,220 But, for example, people reject it. 708 00:33:20,220 --> 00:33:23,730 We asked people who did not accept it as a horse head, 709 00:33:23,730 --> 00:33:26,600 and they said, for example, that these lines are too straight. 710 00:33:26,600 --> 00:33:30,140 It looks like a man-made part rather than 711 00:33:30,140 --> 00:33:33,180 a part of a real animal. 712 00:33:33,180 --> 00:33:36,320 That was a repeating answer, for example. 713 00:33:36,320 --> 00:33:38,550 But deviation, how straight is it and so on, 714 00:33:38,550 --> 00:33:40,800 this is a bit tricky. 715 00:33:40,800 --> 00:33:42,720 And also it didn't have quite the ear 716 00:33:42,720 --> 00:33:45,150 that you do expect here. 717 00:33:45,150 --> 00:33:50,850 So we think that the kind of feature that you need in order 718 00:33:50,850 --> 00:33:53,400 to do this internal interpretation of interest 719 00:33:53,400 --> 00:33:56,586 depends on relatively complicated properties 720 00:33:56,586 --> 00:33:57,960 and relations that you don't want 721 00:33:57,960 --> 00:34:05,200 to spend time and effort doing in a bottom-up way 722 00:34:05,200 --> 00:34:07,790 all over the entire visual field. 723 00:34:07,790 --> 00:34:10,840 If certain two contours smoothly are in a corner, 724 00:34:10,840 --> 00:34:14,670 or if something is really straight, only semi-straight. 725 00:34:14,670 --> 00:34:19,199 I mean, to do all of these computations my hunch is, 726 00:34:19,199 --> 00:34:22,130 to do all of these complicated things, 727 00:34:22,130 --> 00:34:24,110 you need them only in a small-- 728 00:34:24,110 --> 00:34:28,199 you need some specific ones for some specific classes 729 00:34:28,199 --> 00:34:30,130 at some specific locations. 730 00:34:30,130 --> 00:34:36,050 So the right way to do this kind of computation, 731 00:34:36,050 --> 00:34:37,530 the right architecture, seems to me 732 00:34:37,530 --> 00:34:40,604 a combination of bottom-up and top-down processing. 733 00:34:40,604 --> 00:34:42,270 And we know that, in the visual system-- 734 00:34:42,270 --> 00:34:45,060 this is a diagram of the visual system, which 735 00:34:45,060 --> 00:34:49,380 is supposed to show that we have lots of connections going up, 736 00:34:49,380 --> 00:34:51,969 but also a lot of connections going down. 737 00:34:51,969 --> 00:34:55,020 And the suggestion that I would like to put up-- 738 00:34:55,020 --> 00:34:57,480 and I think it's what's happening here-- 739 00:34:57,480 --> 00:34:59,370 is that we have something like deep network 740 00:34:59,370 --> 00:35:02,820 that does an initial generic classification. 741 00:35:02,820 --> 00:35:03,560 It's bottom-up. 742 00:35:03,560 --> 00:35:04,850 It has some kind of-- 743 00:35:04,850 --> 00:35:07,080 was trained on many categories. 744 00:35:07,080 --> 00:35:10,680 It is not sensitive to all of these small and informative 745 00:35:10,680 --> 00:35:13,340 things that you need for internal classification. 746 00:35:13,340 --> 00:35:20,320 And it proposes a lot of-- it gives you initial recognition, 747 00:35:20,320 --> 00:35:21,750 which is OK. 748 00:35:21,750 --> 00:35:24,660 It's especially OK when you have a complete object and not 749 00:35:24,660 --> 00:35:27,990 something challenging like a minimal image. 750 00:35:27,990 --> 00:35:31,350 Because you may be wrong on a couple of the minimal images, 751 00:35:31,350 --> 00:35:33,910 but you have 20 of them in each object. 752 00:35:33,910 --> 00:35:36,630 So if two are wrong, it's not too bad. 753 00:35:36,630 --> 00:35:38,800 So, under many circumstances, you 754 00:35:38,800 --> 00:35:42,720 will be OK in terms of general recognition. 755 00:35:42,720 --> 00:35:46,140 But what this does is it doesn't complete the process, 756 00:35:46,140 --> 00:35:49,290 but it sort of triggers the application of something which 757 00:35:49,290 --> 00:35:51,210 is much more class-specific, that it says, 758 00:35:51,210 --> 00:35:52,800 oh, it looks like a horse. 759 00:35:52,800 --> 00:35:55,585 Let's check if it has, or let's now complete 760 00:35:55,585 --> 00:35:56,520 the interpretation. 761 00:35:56,520 --> 00:35:58,920 It's not just a validation, but you really 762 00:35:58,920 --> 00:36:01,260 want to know where is the eye, where is the ear, where 763 00:36:01,260 --> 00:36:02,540 is the mouth, and so on. 764 00:36:02,540 --> 00:36:05,226 You want to know maybe if the mouth is open or closed. 765 00:36:05,226 --> 00:36:06,350 You want to feed the horse. 766 00:36:06,350 --> 00:36:07,440 You want to pet the horse. 767 00:36:07,440 --> 00:36:09,420 I mean, when you interact with objects, all of these things 768 00:36:09,420 --> 00:36:10,050 are important. 769 00:36:10,050 --> 00:36:12,360 So you continue your understanding 770 00:36:12,360 --> 00:36:13,830 of the visual scene. 771 00:36:13,830 --> 00:36:17,340 But this is not this generic bottom-up recognition, 772 00:36:17,340 --> 00:36:19,530 but you are looking for specific structures 773 00:36:19,530 --> 00:36:24,810 that you learned about when you interacted with these objects 774 00:36:24,810 --> 00:36:25,770 before. 775 00:36:25,770 --> 00:36:27,930 And then you test specific things. 776 00:36:27,930 --> 00:36:28,880 Where is the eye? 777 00:36:28,880 --> 00:36:30,840 There should be a round thing roughly here, 778 00:36:30,840 --> 00:36:32,530 and so on and so forth. 779 00:36:32,530 --> 00:36:34,740 So these are more extended routines 780 00:36:34,740 --> 00:36:43,800 that you're applying to the detected region, sort 781 00:36:43,800 --> 00:36:49,502 of directed from above, and you know what kind of feature 782 00:36:49,502 --> 00:36:53,426 to look for at different locations 783 00:36:53,426 --> 00:36:55,230 within the minimal image. 784 00:36:55,230 --> 00:36:58,290 And this kind of ongoing, continuing interpretation 785 00:36:58,290 --> 00:37:01,380 is not just inside, internally, to what 786 00:37:01,380 --> 00:37:05,220 you succeeded to recognize, but sort of spread 787 00:37:05,220 --> 00:37:06,490 over the entire image. 788 00:37:06,490 --> 00:37:08,640 For example, if you look at this image, 789 00:37:08,640 --> 00:37:10,650 what do you see here in this image here? 790 00:37:13,630 --> 00:37:15,380 Anyone want to suggest what we see here? 791 00:37:15,380 --> 00:37:16,380 AUDIENCE: A face, maybe. 792 00:37:16,380 --> 00:37:16,590 SHIMON ULLMAN: Sorry? 793 00:37:16,590 --> 00:37:17,699 AUDIENCE: A face. 794 00:37:17,699 --> 00:37:18,740 AUDIENCE: A woman's face. 795 00:37:18,740 --> 00:37:19,170 AUDIENCE: A woman's face. 796 00:37:19,170 --> 00:37:19,810 SHIMON ULLMAN: A woman's face. 797 00:37:19,810 --> 00:37:20,590 What is the woman doing? 798 00:37:20,590 --> 00:37:21,381 AUDIENCE: Drinking. 799 00:37:21,381 --> 00:37:22,420 SHIMON ULLMAN: Drinking. 800 00:37:22,420 --> 00:37:22,750 Right. 801 00:37:22,750 --> 00:37:24,370 So it's a woman drinking, for those of you 802 00:37:24,370 --> 00:37:25,245 managed to recognize. 803 00:37:25,245 --> 00:37:28,066 This is the woman, and she's drinking from a cup. 804 00:37:28,066 --> 00:37:29,020 Now, we tested it. 805 00:37:29,020 --> 00:37:32,050 The woman is actually a minimal image. 806 00:37:32,050 --> 00:37:34,600 If you remove the cup, you show this image, 807 00:37:34,600 --> 00:37:37,210 people recognize it at a relatively high. 808 00:37:37,210 --> 00:37:41,720 Nobody recognizes this is a glass when you just 809 00:37:41,720 --> 00:37:43,360 show the glass on its own. 810 00:37:43,360 --> 00:37:46,000 We think that the actual recognition process 811 00:37:46,000 --> 00:37:48,610 in your head starts with recognizing 812 00:37:48,610 --> 00:37:50,740 what is recognizable on its own, sort 813 00:37:50,740 --> 00:37:53,530 of the minimal configuration which you know what it is. 814 00:37:53,530 --> 00:37:54,462 You don't need help. 815 00:37:54,462 --> 00:37:55,420 You don't need context. 816 00:37:55,420 --> 00:37:56,419 You don't need anything. 817 00:37:56,419 --> 00:37:57,440 This is a woman. 818 00:37:57,440 --> 00:37:58,210 This is the mouth. 819 00:37:58,210 --> 00:38:00,310 And you can continue from there in the same way 820 00:38:00,310 --> 00:38:02,450 that you can recognize internally 821 00:38:02,450 --> 00:38:05,350 that this is the nose, and the nostril, 822 00:38:05,350 --> 00:38:07,460 and this is the upper lip and lower lip. 823 00:38:07,460 --> 00:38:11,230 In the same way that you can guide your interpretation 824 00:38:11,230 --> 00:38:13,480 process internally, you can also say 825 00:38:13,480 --> 00:38:18,160 that the thing which is docked at her mouth is a glass. 826 00:38:18,160 --> 00:38:20,050 Some results from-- this has been 827 00:38:20,050 --> 00:38:25,060 implemented by Guy Ben-Yosef, who is also now a part of CBMM. 828 00:38:25,060 --> 00:38:28,840 And this internal interpretation begins 829 00:38:28,840 --> 00:38:32,860 to work interestingly well. 830 00:38:32,860 --> 00:38:35,230 We started to do at MIT some MEG studies, 831 00:38:35,230 --> 00:38:37,640 because if this is correct, if the interpretation 832 00:38:37,640 --> 00:38:41,230 process and the correct recognition of minimal images 833 00:38:41,230 --> 00:38:44,890 and the following full interpretation process 834 00:38:44,890 --> 00:38:48,915 is driven by its-- 835 00:38:48,915 --> 00:38:53,110 requires for completion, it requires the triggering 836 00:38:53,110 --> 00:38:56,950 of top-down processing, that we could see it using 837 00:38:56,950 --> 00:38:58,780 the right kind of imaging. 838 00:38:58,780 --> 00:39:08,530 In this case, we started to do minimal images in MEG images. 839 00:39:08,530 --> 00:39:10,690 MEG is-- I think you-- 840 00:39:10,690 --> 00:39:13,140 was MEG already mentioned here in any of the talks? 841 00:39:13,140 --> 00:39:14,980 So MEG, as you know, it doesn't have 842 00:39:14,980 --> 00:39:16,450 very good spatial resolution. 843 00:39:16,450 --> 00:39:20,390 It's not like fMRI, but it has very good temporal resolution. 844 00:39:20,390 --> 00:39:31,120 And what Leyla-- it was led by Leyla Isik. 845 00:39:31,120 --> 00:39:33,340 And what we've done here is trying 846 00:39:33,340 --> 00:39:36,850 to let subjects in the MEG recognize minimal images. 847 00:39:36,850 --> 00:39:42,247 And we took the electrodes from the MEG and trained a decoder. 848 00:39:42,247 --> 00:39:44,080 The decoder is trained to say whether or not 849 00:39:44,080 --> 00:39:48,130 the image contains, say, an eagle in this case. 850 00:39:48,130 --> 00:39:50,410 And we had various images. 851 00:39:50,410 --> 00:39:52,090 And the question is, we can follow 852 00:39:52,090 --> 00:39:55,930 the performance of the computational decoder that 853 00:39:55,930 --> 00:39:58,560 tries to say now the image, now the electrode, 854 00:39:58,560 --> 00:40:00,160 the pattern of electrodes, allow me 855 00:40:00,160 --> 00:40:04,390 to deduce that there is an eagle in the image. 856 00:40:04,390 --> 00:40:08,140 And we see that the decoder is successful, you can see here, 857 00:40:08,140 --> 00:40:10,370 at about 400 milliseconds. 858 00:40:10,370 --> 00:40:11,510 This is late for vision. 859 00:40:11,510 --> 00:40:15,790 The initial bottom-up initial recognition is more like 150, 860 00:40:15,790 --> 00:40:17,830 or something like this. 861 00:40:17,830 --> 00:40:21,240 And we also get the same results when we do psychophysics, 862 00:40:21,240 --> 00:40:27,670 that in normal images you can recognize them at-- 863 00:40:27,670 --> 00:40:31,050 you can get good recognition after, say, 864 00:40:31,050 --> 00:40:34,930 exposure of 100 milliseconds followed by a mask 865 00:40:34,930 --> 00:40:36,970 to recognize correctly at the human level, 866 00:40:36,970 --> 00:40:43,420 to get to the human level that we get. 867 00:40:43,420 --> 00:40:45,370 With minimal images, you have to give 868 00:40:45,370 --> 00:40:49,450 enough time, which we suspect is enough time, 869 00:40:49,450 --> 00:40:54,540 to allow the application of the top-down interpretation within. 870 00:40:54,540 --> 00:40:57,100 And if you don't give enough time, then 871 00:40:57,100 --> 00:40:59,920 people degenerate and become deep networks, 872 00:40:59,920 --> 00:41:02,530 and you get the same kind of performance, roughly. 873 00:41:02,530 --> 00:41:07,750 But this is all still unpublished and still running, 874 00:41:07,750 --> 00:41:08,920 and we need more subjects. 875 00:41:08,920 --> 00:41:12,550 And all of this is looking in the right direction, 876 00:41:12,550 --> 00:41:16,150 and looking in providing support for top-down processing 877 00:41:16,150 --> 00:41:16,855 for this. 878 00:41:16,855 --> 00:41:21,044 And this, by the way, it's interesting 879 00:41:21,044 --> 00:41:22,960 methodologically, because it's very difficult. 880 00:41:22,960 --> 00:41:26,200 With real images, it's so rich and you get so much information 881 00:41:26,200 --> 00:41:28,010 already in the way of going up and 882 00:41:28,010 --> 00:41:31,330 because of these redundancies, that even if you make 883 00:41:31,330 --> 00:41:34,510 20% error, it doesn't really matter because you have 884 00:41:34,510 --> 00:41:38,440 redundancy, you have many multiple, 885 00:41:38,440 --> 00:41:41,450 sufficient minimal images within any object, and so on. 886 00:41:41,450 --> 00:41:44,560 So it's very difficult to tease out 887 00:41:44,560 --> 00:41:48,160 the effect of where exactly the top-down information starts. 888 00:41:48,160 --> 00:41:49,590 Where do you need it? 889 00:41:49,590 --> 00:41:52,820 Where exactly you fail if you don't have it. 890 00:41:52,820 --> 00:41:56,530 So we think you need it for this internal interpretation 891 00:41:56,530 --> 00:41:59,500 and for the correct recognition of minimal images. 892 00:41:59,500 --> 00:42:03,040 And here you can start seeing good signals in the MEG. 893 00:42:03,040 --> 00:42:07,540 It provides you sort of a tool that is pretty unique 894 00:42:07,540 --> 00:42:10,910 and allows you to do these things. 895 00:42:10,910 --> 00:42:13,410 So let me add what is a very informal thing 896 00:42:13,410 --> 00:42:15,800 but where I think this is going. 897 00:42:15,800 --> 00:42:19,850 I think that when you look at difficult images, 898 00:42:19,850 --> 00:42:22,670 like action recognition that we discussed below, 899 00:42:22,670 --> 00:42:26,720 many things that we do depend not on sort 900 00:42:26,720 --> 00:42:29,760 of cause label of there is a person there, 901 00:42:29,760 --> 00:42:31,640 or there is an airplane, or there is a dog. 902 00:42:31,640 --> 00:42:36,080 But, really, things depend on the fine details 903 00:42:36,080 --> 00:42:37,580 of the internal interpretation. 904 00:42:37,580 --> 00:42:41,300 And so if you can turn off what I 905 00:42:41,300 --> 00:42:45,460 think is the top-down part of class-specific top-down 906 00:42:45,460 --> 00:42:49,130 processes, I think that many of these fine distinctions 907 00:42:49,130 --> 00:42:51,750 that we make all the time-- and it's what vision is all about. 908 00:42:51,750 --> 00:42:55,760 Vision is not about giving cause categories-- 909 00:42:55,760 --> 00:42:56,640 will go away. 910 00:42:56,640 --> 00:42:58,520 And so these things will become more and more 911 00:42:58,520 --> 00:43:00,200 an important part of vision. 912 00:43:00,200 --> 00:43:05,549 Let me look at this variability in action recognition. 913 00:43:05,549 --> 00:43:07,340 But let me show you some specific examples. 914 00:43:10,040 --> 00:43:12,814 This is something that confuses current classifiers, 915 00:43:12,814 --> 00:43:15,230 that in most of them it seems that the person is drinking. 916 00:43:18,260 --> 00:43:20,450 Because there is a person, there is a bottle, 917 00:43:20,450 --> 00:43:22,690 and the bottle is close to the mouth. 918 00:43:22,690 --> 00:43:25,550 So the person is drinking at this rough level 919 00:43:25,550 --> 00:43:26,270 of description. 920 00:43:26,270 --> 00:43:28,760 But, obviously, here this person is drinking, 921 00:43:28,760 --> 00:43:31,120 this person is pouring, right? 922 00:43:31,120 --> 00:43:35,450 Something very-- is this person drinking at the moment? 923 00:43:35,450 --> 00:43:36,050 Yes or no? 924 00:43:36,050 --> 00:43:36,365 AUDIENCE: No. 925 00:43:36,365 --> 00:43:36,680 SHIMON ULLMAN: No. 926 00:43:36,680 --> 00:43:37,180 Why not? 927 00:43:37,180 --> 00:43:40,040 She's holding a cup, and it's not far, and maybe 928 00:43:40,040 --> 00:43:41,040 on the way to the mouth. 929 00:43:41,040 --> 00:43:43,220 We know that she's not drinking, right? 930 00:43:43,220 --> 00:43:44,304 But why exactly not? 931 00:43:44,304 --> 00:43:45,720 And, again, this is something that 932 00:43:45,720 --> 00:43:48,680 is picked up as drinking by many recognition systems. 933 00:43:48,680 --> 00:43:50,600 But something is wrong here. 934 00:43:50,600 --> 00:43:53,300 All of these things, these are different objects 935 00:43:53,300 --> 00:43:55,650 and different actions that the people are performing. 936 00:43:55,650 --> 00:43:56,900 This is drinking from a straw. 937 00:43:56,900 --> 00:43:57,800 This is smoking. 938 00:43:57,800 --> 00:43:59,880 And this is brushing their teeth. 939 00:43:59,880 --> 00:44:02,780 But this depends on, you have to go to the right location 940 00:44:02,780 --> 00:44:05,110 and decide exactly what's happening there. 941 00:44:05,110 --> 00:44:09,370 It's the kind of thing that we do all the time. 942 00:44:09,370 --> 00:44:10,820 Some more challenges. 943 00:44:10,820 --> 00:44:12,680 These are just sort of informal challenges 944 00:44:12,680 --> 00:44:18,110 to show you how we can deal with fine interpretation of details 945 00:44:18,110 --> 00:44:19,290 of interest in the image. 946 00:44:19,290 --> 00:44:21,978 What is this arrow pointing at? 947 00:44:21,978 --> 00:44:22,914 AUDIENCE: Bottle 948 00:44:22,914 --> 00:44:23,382 SHIMON ULLMAN: Sorry? 949 00:44:23,382 --> 00:44:23,850 AUDIENCE: Bottle. 950 00:44:23,850 --> 00:44:24,230 SHIMON ULLMAN: Yeah. 951 00:44:24,230 --> 00:44:26,396 But above the bottle, there is something else there. 952 00:44:26,396 --> 00:44:27,600 AUDIENCE: Fingers. 953 00:44:27,600 --> 00:44:28,736 SHIMON ULLMAN: Sorry? 954 00:44:28,736 --> 00:44:29,444 AUDIENCE: Finger. 955 00:44:29,444 --> 00:44:31,441 SHIMON ULLMAN: Fingers, right. 956 00:44:31,441 --> 00:44:31,940 Let's see. 957 00:44:31,940 --> 00:44:32,970 Just playing this time. 958 00:44:32,970 --> 00:44:35,832 What is this arrow pointing at? 959 00:44:35,832 --> 00:44:36,540 AUDIENCE: Zipper. 960 00:44:36,540 --> 00:44:37,580 SHIMON ULLMAN: Zipper. 961 00:44:37,580 --> 00:44:38,287 Let's see. 962 00:44:38,287 --> 00:44:39,620 Here are two challenging things. 963 00:44:39,620 --> 00:44:40,453 Here are two arrows. 964 00:44:40,453 --> 00:44:43,068 What is this one pointing at? 965 00:44:43,068 --> 00:44:43,651 AUDIENCE: Cup? 966 00:44:43,651 --> 00:44:44,234 AUDIENCE: Tea. 967 00:44:44,234 --> 00:44:44,962 AUDIENCE: Cup. 968 00:44:44,962 --> 00:44:46,330 SHIMON ULLMAN: All right. 969 00:44:46,330 --> 00:44:48,196 Next to the cup, right, is also-- 970 00:44:48,196 --> 00:44:49,320 this is really challenging. 971 00:44:49,320 --> 00:44:50,520 Let's see if some folks. 972 00:44:50,520 --> 00:44:52,463 What is this one pointing at? 973 00:44:52,463 --> 00:44:53,171 AUDIENCE: A tray? 974 00:44:53,171 --> 00:44:53,610 AUDIENCE: A tray. 975 00:44:53,610 --> 00:44:54,444 SHIMON ULLMAN: Tray. 976 00:44:54,444 --> 00:44:56,068 So the tray, think about it, it's this, 977 00:44:56,068 --> 00:44:58,580 but you match it with this thing here in order to make sure, 978 00:44:58,580 --> 00:45:00,900 to know that it's a tray. 979 00:45:00,900 --> 00:45:03,530 It's not something that will be easily picked up. 980 00:45:03,530 --> 00:45:06,702 I mean, I'm looking for difficult things which 981 00:45:06,702 --> 00:45:07,910 are a little bit challenging. 982 00:45:07,910 --> 00:45:09,160 And you say, ah, I can get it. 983 00:45:09,160 --> 00:45:12,280 But this level of detail, interpreting the fine details 984 00:45:12,280 --> 00:45:15,910 and images in a top-down fashion happens all the time. 985 00:45:15,910 --> 00:45:18,640 Is this person smoking? 986 00:45:18,640 --> 00:45:20,530 Of course not, and we are not fooled by it, 987 00:45:20,530 --> 00:45:22,520 and we immediately zoom on the right things. 988 00:45:22,520 --> 00:45:27,130 And, really, all the information is here at the end of the-- 989 00:45:27,130 --> 00:45:28,610 and so on, and so on, and so forth. 990 00:45:28,610 --> 00:45:32,140 I mean, we were looking at dealing visually 991 00:45:32,140 --> 00:45:34,570 with social interactions, understanding 992 00:45:34,570 --> 00:45:37,300 the social interactions between agents. 993 00:45:37,300 --> 00:45:39,850 And, again, it's very difficult to do correctly, 994 00:45:39,850 --> 00:45:41,830 and it depends on subtle things. 995 00:45:41,830 --> 00:45:45,640 I mean, you can get something rough OK. 996 00:45:48,400 --> 00:45:50,635 For example, is this sort of an intimate hug, 997 00:45:50,635 --> 00:45:54,760 or this just a cordial hug of people who are not-- 998 00:45:54,760 --> 00:45:56,500 we know exactly what's going on, right? 999 00:45:56,500 --> 00:45:58,690 And it turns out that the features are not 1000 00:45:58,690 --> 00:46:00,340 that easy to get. 1001 00:46:00,340 --> 00:46:03,490 This was picked up incorrectly by something 1002 00:46:03,490 --> 00:46:05,770 that we designed for people hugging. 1003 00:46:05,770 --> 00:46:07,840 And it's not very far from people hugging, 1004 00:46:07,840 --> 00:46:10,120 but it doesn't fool us, right? 1005 00:46:10,120 --> 00:46:12,520 But they are not really hugging. 1006 00:46:12,520 --> 00:46:14,380 On social interactions, we know interactions 1007 00:46:14,380 --> 00:46:17,475 even between non-human agents. 1008 00:46:17,475 --> 00:46:18,850 I mean, this interaction, is this 1009 00:46:18,850 --> 00:46:22,090 is threatening interaction or a friendly interaction? 1010 00:46:22,090 --> 00:46:23,100 What do you think? 1011 00:46:23,100 --> 00:46:24,330 Yeah. 1012 00:46:24,330 --> 00:46:25,270 Correct. 1013 00:46:25,270 --> 00:46:26,127 I think so too. 1014 00:46:26,127 --> 00:46:28,460 Anyway, I think that all of these things that we can do, 1015 00:46:28,460 --> 00:46:30,070 and I think that vision is about this. 1016 00:46:30,070 --> 00:46:31,870 It's not about looking at this room 1017 00:46:31,870 --> 00:46:34,210 and saying that this is a computer and this is a chair. 1018 00:46:34,210 --> 00:46:36,460 It's about understanding the situation 1019 00:46:36,460 --> 00:46:41,950 and making fine judgments, and interacting with objects. 1020 00:46:41,950 --> 00:46:48,780 And, in fact, we're looking at is part of we're doing at CBMM. 1021 00:46:48,780 --> 00:46:52,280 We are looking at the problem of asking questions about images. 1022 00:46:52,280 --> 00:46:54,430 So we want a system that you can give it 1023 00:46:54,430 --> 00:46:58,030 an image and a question, and then we 1024 00:46:58,030 --> 00:47:01,870 want the system to be able to process the image in such 1025 00:47:01,870 --> 00:47:04,690 a way that will give you a good answer to the question. 1026 00:47:04,690 --> 00:47:06,190 This is interesting because it means 1027 00:47:06,190 --> 00:47:09,240 that it's not just a generic pipeline of running 1028 00:47:09,240 --> 00:47:12,550 the image through a pipeline, sort of fixed 1029 00:47:12,550 --> 00:47:13,780 sequence of operations. 1030 00:47:13,780 --> 00:47:16,270 But, depending on what you're interested in, 1031 00:47:16,270 --> 00:47:19,870 the whole visual process should be directed in a particular way 1032 00:47:19,870 --> 00:47:21,700 to produce just the relevant answer. 1033 00:47:21,700 --> 00:47:27,345 And we looked at a set of-- 1034 00:47:27,345 --> 00:47:30,640 with students, we looked at a set of some 600 questions 1035 00:47:30,640 --> 00:47:34,900 that we gave people on the Mechanical Turk images. 1036 00:47:34,900 --> 00:47:38,600 And we say imagine some questions about these images. 1037 00:47:38,600 --> 00:47:40,390 Ask some question about these images. 1038 00:47:40,390 --> 00:47:42,220 And they came up with some images. 1039 00:47:42,220 --> 00:47:44,500 We looked at them, and an informal observation, 1040 00:47:44,500 --> 00:47:48,250 initial observation, is that most of these questions 1041 00:47:48,250 --> 00:47:51,100 that people invented to ask about images, 1042 00:47:51,100 --> 00:47:53,170 you needed some things which depended 1043 00:47:53,170 --> 00:47:56,950 on precise internal interpretation of the details. 1044 00:47:56,950 --> 00:47:59,500 So it's things that come up all the time. 1045 00:47:59,500 --> 00:48:02,200 You have to dive into the image and analyze 1046 00:48:02,200 --> 00:48:06,730 the subtle cues that will tell you that these are not hugging, 1047 00:48:06,730 --> 00:48:10,360 and this is not threatening, and this is not an intimate hug, 1048 00:48:10,360 --> 00:48:11,480 and so on and so forth. 1049 00:48:11,480 --> 00:48:14,290 And this is what we are-- 1050 00:48:14,290 --> 00:48:16,030 the whole story of the minimal images 1051 00:48:16,030 --> 00:48:18,590 and the internal interpretation. 1052 00:48:18,590 --> 00:48:20,560 The real goal eventually is to be 1053 00:48:20,560 --> 00:48:26,860 able to identify the important visual features and structures 1054 00:48:26,860 --> 00:48:28,690 which are important for this, and thinking 1055 00:48:28,690 --> 00:48:32,830 about the automatic learning of how to extract 1056 00:48:32,830 --> 00:48:38,260 the internal structure that will support 1057 00:48:38,260 --> 00:48:43,630 the interpretation of all these interesting and meaningful 1058 00:48:43,630 --> 00:48:46,480 aspects of images that, at the moment, we do not have. 1059 00:48:49,990 --> 00:48:53,090 OK, let me skip this. 1060 00:48:53,090 --> 00:48:58,720 OK, I think I've said all of these conclusions already 1061 00:48:58,720 --> 00:49:02,610 in the final comments, so let me stop here.