1 00:00:01,640 --> 00:00:04,040 The following content is provided under a Creative 2 00:00:04,040 --> 00:00:05,580 Commons license. 3 00:00:05,580 --> 00:00:07,880 Your support will help MIT OpenCourseWare 4 00:00:07,880 --> 00:00:12,270 continue to offer high-quality educational resources for free. 5 00:00:12,270 --> 00:00:14,870 To make a donation or view additional materials 6 00:00:14,870 --> 00:00:18,830 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,830 --> 00:00:22,050 at ocw.mit.edu. 8 00:00:22,050 --> 00:00:24,110 SHIMON ULLMAN: What I'm going to talk about today 9 00:00:24,110 --> 00:00:28,370 is very much related to this general issue of using vision, 10 00:00:28,370 --> 00:00:33,950 but with sort of the final goal of seeing and understanding, 11 00:00:33,950 --> 00:00:36,260 in a very deep and complete manner, what's 12 00:00:36,260 --> 00:00:38,900 going on around you. 13 00:00:38,900 --> 00:00:46,010 And let me start with just one image we've all-- 14 00:00:46,010 --> 00:00:47,480 we see images all the time, and we 15 00:00:47,480 --> 00:00:49,760 know that we can get a lot of meaning out of them. 16 00:00:49,760 --> 00:00:53,540 But just, you look at one image, and you get 17 00:00:53,540 --> 00:00:54,900 a lot of information out of it. 18 00:00:54,900 --> 00:00:57,380 You understand what happened before, that 19 00:00:57,380 --> 00:00:58,970 was some kind of a flood, and why 20 00:00:58,970 --> 00:01:04,680 these people are hanging out up there on the wires, and so on. 21 00:01:04,680 --> 00:01:07,090 So all of this, we get very quickly. 22 00:01:07,090 --> 00:01:11,540 And what we do, usually, in computational vision in much 23 00:01:11,540 --> 00:01:14,390 of my own work is to try to understand 24 00:01:14,390 --> 00:01:16,760 computational schemes that I can take 25 00:01:16,760 --> 00:01:21,710 as an input, an image like this, and get 26 00:01:21,710 --> 00:01:23,970 all this information out. 27 00:01:23,970 --> 00:01:27,390 But what I want to talk about in the first part-- 28 00:01:27,390 --> 00:01:32,210 I'm going to break the afternoon into two different talks on two 29 00:01:32,210 --> 00:01:34,700 different topics, but they are all closely related 30 00:01:34,700 --> 00:01:37,610 to this ultimate goal which drives me, 31 00:01:37,610 --> 00:01:38,900 which is, as you will see-- 32 00:01:38,900 --> 00:01:41,810 I think it will become apparent as we go along-- 33 00:01:41,810 --> 00:01:44,900 this sort of complete and full understanding, and complicated 34 00:01:44,900 --> 00:01:47,840 concepts, and complicated notions that we 35 00:01:47,840 --> 00:01:50,230 can derive from an image. 36 00:01:50,230 --> 00:01:53,300 But what I'm going to discuss in the first part 37 00:01:53,300 --> 00:01:55,550 is sort of how it all starts. 38 00:01:55,550 --> 00:01:59,010 And this combines vision with another topic 39 00:01:59,010 --> 00:02:02,311 which is very important in cognition-- part of CBMM 40 00:02:02,311 --> 00:02:02,810 as well. 41 00:02:02,810 --> 00:02:05,870 But it's not just this particular center, 42 00:02:05,870 --> 00:02:08,180 it's part of understanding cognition-- 43 00:02:08,180 --> 00:02:11,180 is infant learning and how it all starts. 44 00:02:11,180 --> 00:02:13,790 And certainly for vision, and for learning, 45 00:02:13,790 --> 00:02:17,360 this is a very interesting and fascinating problem. 46 00:02:17,360 --> 00:02:19,970 Because here, you think about babies, 47 00:02:19,970 --> 00:02:21,500 they open up their eyes. 48 00:02:21,500 --> 00:02:23,390 And they see-- they don't-- 49 00:02:23,390 --> 00:02:25,430 they cannot understand the images that they see. 50 00:02:25,430 --> 00:02:27,677 You can think that they see pixels. 51 00:02:27,677 --> 00:02:28,760 And they watch the pixels. 52 00:02:28,760 --> 00:02:31,550 And the pixels are transforming. 53 00:02:31,550 --> 00:02:34,800 And they look at the world as things change around them 54 00:02:34,800 --> 00:02:35,730 and so on. 55 00:02:35,730 --> 00:02:40,160 And what this short clip is trying to make explicit is, 56 00:02:40,160 --> 00:02:44,930 somehow, these pixels, over time, acquire meaning. 57 00:02:44,930 --> 00:02:47,750 And they understand the world, what they see. 58 00:02:47,750 --> 00:02:51,980 The input changes from changing pixels and light intensities 59 00:02:51,980 --> 00:02:56,021 into other infants, and rooms, and whatever they see around 60 00:02:56,021 --> 00:02:56,520 them. 61 00:02:56,520 --> 00:02:59,540 So we would like to be able to understand this 62 00:02:59,540 --> 00:03:02,190 and to do something similar. 63 00:03:02,190 --> 00:03:03,950 So to have a system, imagine that we 64 00:03:03,950 --> 00:03:05,580 have some kind of a system that starts 65 00:03:05,580 --> 00:03:10,610 without any specific world knowledge wired into it. 66 00:03:10,610 --> 00:03:13,376 It watches movies for, say, six months. 67 00:03:13,376 --> 00:03:14,750 And at the end of the six months, 68 00:03:14,750 --> 00:03:17,750 this system knows about the world, about people, about 69 00:03:17,750 --> 00:03:19,670 agents, about animals, about objects, 70 00:03:19,670 --> 00:03:21,980 about actions, social interactions 71 00:03:21,980 --> 00:03:24,830 between agents the way that infants do 72 00:03:24,830 --> 00:03:26,640 during the first year of life. 73 00:03:26,640 --> 00:03:29,600 So the goal isn't-- for me at least, it's not really-- 74 00:03:29,600 --> 00:03:33,080 the interest is not necessarily to engineer, 75 00:03:33,080 --> 00:03:36,290 the engineering part of it to really build such a system, 76 00:03:36,290 --> 00:03:41,270 but to think about and develop schemes that will be able 77 00:03:41,270 --> 00:03:44,240 to deal with it and do something similar. 78 00:03:44,240 --> 00:03:46,880 I think it's also-- maybe I'll mention it at the end-- 79 00:03:46,880 --> 00:03:48,500 it's also a very interesting direction 80 00:03:48,500 --> 00:03:50,210 for artificial intelligence to think 81 00:03:50,210 --> 00:03:53,660 about not generating final complete systems, 82 00:03:53,660 --> 00:03:56,000 but generating baby systems, if you want, 83 00:03:56,000 --> 00:03:59,510 that have some interesting and useful initial capacities. 84 00:03:59,510 --> 00:04:02,390 But the rest is just, they watch the world, 85 00:04:02,390 --> 00:04:06,830 and they get intelligent as time goes by. 86 00:04:06,830 --> 00:04:10,970 I'm going to talk initially about two particular things 87 00:04:10,970 --> 00:04:12,980 that we've been working on and we 88 00:04:12,980 --> 00:04:16,490 selected because we thought it particularly interesting. 89 00:04:16,490 --> 00:04:18,470 And one has to do with hands, and the other one 90 00:04:18,470 --> 00:04:20,810 has to do with gaze. 91 00:04:20,810 --> 00:04:25,830 And the reason that we selected them is, I'll show in a minute, 92 00:04:25,830 --> 00:04:29,420 visually, for computer vision, being able to deal with hands 93 00:04:29,420 --> 00:04:33,070 of people's in images and dealing with direction of gaze 94 00:04:33,070 --> 00:04:34,710 are a very difficult problem. 95 00:04:34,710 --> 00:04:37,910 And a lot of work has been done in computer vision dealing 96 00:04:37,910 --> 00:04:41,960 with issues related to hands and to gaze. 97 00:04:41,960 --> 00:04:44,900 They're also very important for cognition in general. 98 00:04:44,900 --> 00:04:47,389 Again, I will discuss it a bit more later. 99 00:04:47,389 --> 00:04:48,930 But understanding hands and what they 100 00:04:48,930 --> 00:04:51,080 are doing, interacting with objects, 101 00:04:51,080 --> 00:04:53,910 manipulating objects, action recognition-- 102 00:04:53,910 --> 00:04:57,200 so hands are a part of understanding the full scheme, 103 00:04:57,200 --> 00:05:01,850 the whole domain of actions. 104 00:05:01,850 --> 00:05:06,470 And social interactions between agents is a part of it. 105 00:05:06,470 --> 00:05:08,780 Gaze is also very important for understanding 106 00:05:08,780 --> 00:05:10,760 intentions of people, understanding 107 00:05:10,760 --> 00:05:12,060 interactions between people. 108 00:05:12,060 --> 00:05:14,420 So these are two basic-- 109 00:05:14,420 --> 00:05:17,780 very basic type of objects or concepts 110 00:05:17,780 --> 00:05:20,750 that you want to acquire that are very difficult. 111 00:05:20,750 --> 00:05:22,730 And the final surprising thing is they 112 00:05:22,730 --> 00:05:30,950 come very early in infant vision, one of the first things 113 00:05:30,950 --> 00:05:32,030 to be learned. 114 00:05:32,030 --> 00:05:34,520 So you can see here, when I say hands are important, 115 00:05:34,520 --> 00:05:36,530 for example, for action recognition, 116 00:05:36,530 --> 00:05:38,900 I don't know if you can tell what this person is 117 00:05:38,900 --> 00:05:39,740 doing, for example. 118 00:05:39,740 --> 00:05:42,635 Any guess? 119 00:05:42,635 --> 00:05:43,512 Yeah, you say what? 120 00:05:43,512 --> 00:05:44,220 What is he doing? 121 00:05:44,220 --> 00:05:45,000 AUDIENCE: Talking on the phone. 122 00:05:45,000 --> 00:05:45,780 SHIMON ULLMAN: Talking on the phone. 123 00:05:45,780 --> 00:05:47,238 And we cannot really see the phone. 124 00:05:47,238 --> 00:05:50,590 But it's more where the hand is relative to the ear and so on. 125 00:05:50,590 --> 00:05:53,610 And you know, we can see the interactions between agents 126 00:05:53,610 --> 00:05:54,110 and so on. 127 00:05:54,110 --> 00:05:56,790 A lot depends on understanding the body posture, 128 00:05:56,790 --> 00:06:00,000 and in particular, the hands. 129 00:06:00,000 --> 00:06:03,870 So they are certainly very important for us. 130 00:06:03,870 --> 00:06:05,880 I mentioned that they are very difficult to be 131 00:06:05,880 --> 00:06:08,610 able to automatically be able to extract them 132 00:06:08,610 --> 00:06:11,310 and localize them in the image. 133 00:06:11,310 --> 00:06:13,360 And there are two reasons for this. 134 00:06:13,360 --> 00:06:16,660 One is that hands are very flexible, 135 00:06:16,660 --> 00:06:19,980 much more so than most rigid objects we 136 00:06:19,980 --> 00:06:21,180 encounter in the world. 137 00:06:21,180 --> 00:06:23,880 So a hand does not have a typical appearance. 138 00:06:23,880 --> 00:06:26,310 It has so many different appearances 139 00:06:26,310 --> 00:06:29,250 that it's difficult to handle all of them. 140 00:06:29,250 --> 00:06:32,100 And the other reason is that hands in images, although they 141 00:06:32,100 --> 00:06:35,220 are important, very often, there is very little information 142 00:06:35,220 --> 00:06:38,580 in the image showing the hand. 143 00:06:38,580 --> 00:06:42,780 Just because of resolution size, partial occlusion, 144 00:06:42,780 --> 00:06:44,220 we know where the hands are. 145 00:06:44,220 --> 00:06:46,890 But we can see very little of the hands. 146 00:06:46,890 --> 00:06:49,050 We have the impression here, when we look at this, 147 00:06:49,050 --> 00:06:51,180 we know what this person is doing, right? 148 00:06:51,180 --> 00:06:53,419 He's holding a camera and taking a picture. 149 00:06:53,419 --> 00:06:54,960 But if you take the image-- you know, 150 00:06:54,960 --> 00:06:57,382 where the hand and the camera are-- 151 00:06:57,382 --> 00:06:59,340 you know, this is the camera, this is the hand, 152 00:06:59,340 --> 00:07:02,010 and so on-- it's really not much information. 153 00:07:02,010 --> 00:07:04,350 But we can use it very effectively, 154 00:07:04,350 --> 00:07:08,400 and similarly in the other images that you see here. 155 00:07:08,400 --> 00:07:12,090 Children, or infants, do this ability to deal with hands-- 156 00:07:12,090 --> 00:07:14,250 and later on, we'll see, with gaze-- 157 00:07:14,250 --> 00:07:16,140 in a completely unsupervised manner. 158 00:07:16,140 --> 00:07:18,930 Nobody teaches them, look, this is a hand. 159 00:07:18,930 --> 00:07:22,980 Even it cannot be even theoretically possible, 160 00:07:22,980 --> 00:07:26,010 because this capacity to deal with, say, 161 00:07:26,010 --> 00:07:28,620 hands and gaze comes at the age of three months, 162 00:07:28,620 --> 00:07:31,170 way before language starts to develop. 163 00:07:31,170 --> 00:07:33,780 So all of this is entirely unsupervised, 164 00:07:33,780 --> 00:07:38,260 just watching things happening in an unstructured way 165 00:07:38,260 --> 00:07:40,180 and mastering these concepts. 166 00:07:40,180 --> 00:07:43,980 And when you try to imitate this in computer vision systems, 167 00:07:43,980 --> 00:07:46,000 there are not too many computer vision system, 168 00:07:46,000 --> 00:07:50,500 learning system that can deal well with unsupervised data. 169 00:07:50,500 --> 00:07:52,219 And I can tell you without going-- 170 00:07:52,219 --> 00:07:54,510 I don't want to elaborate on the different schemes that 171 00:07:54,510 --> 00:07:56,170 exist and so on. 172 00:07:56,170 --> 00:07:58,560 But the kind of thing that exists cannot-- 173 00:07:58,560 --> 00:08:02,370 nobody can help, nobody can learn hands 174 00:08:02,370 --> 00:08:04,080 in an unsupervised way. 175 00:08:04,080 --> 00:08:07,590 It may be interesting to know, just 176 00:08:07,590 --> 00:08:10,630 anecdotally, when we actually started to work with this, 177 00:08:10,630 --> 00:08:13,800 deep networks were not exactly what they are today. 178 00:08:13,800 --> 00:08:15,180 If you go back in the literature, 179 00:08:15,180 --> 00:08:18,030 and we see when the term "deep networks" and the initial work 180 00:08:18,030 --> 00:08:20,434 on deep networks started by-- 181 00:08:20,434 --> 00:08:22,900 at least in the group of Geoff Hinton. 182 00:08:22,900 --> 00:08:27,000 Yann LeCun was doing independent things separately. 183 00:08:27,000 --> 00:08:28,500 But the goal was to learn everything 184 00:08:28,500 --> 00:08:29,830 in an unsupervised way. 185 00:08:29,830 --> 00:08:32,039 They were labeled as, that's the goal of the project, 186 00:08:32,039 --> 00:08:34,409 to be able to build a machine. 187 00:08:34,409 --> 00:08:36,990 And the machine, you will not need supervision. 188 00:08:36,990 --> 00:08:38,580 You just aim it at the world, and it 189 00:08:38,580 --> 00:08:41,850 will start to absorb information from the world 190 00:08:41,850 --> 00:08:45,900 and build a internal representation in a completely 191 00:08:45,900 --> 00:08:47,280 unsupervised way. 192 00:08:47,280 --> 00:08:49,560 And they demonstrated it on simple examples-- 193 00:08:49,560 --> 00:08:53,280 for example, on MNIST on digits, that you 194 00:08:53,280 --> 00:08:57,630 don't tell the system that there are 10 digits and so on. 195 00:08:57,630 --> 00:09:02,550 You just show it data, and it builds a deep network 196 00:09:02,550 --> 00:09:05,430 that automatically divides the input 197 00:09:05,430 --> 00:09:07,420 into 10 different classes. 198 00:09:07,420 --> 00:09:11,220 And in interesting ways, it also divides subclasses. 199 00:09:11,220 --> 00:09:14,190 There is a closed 4, and an open 4, and so on with something 200 00:09:14,190 --> 00:09:15,210 very natural. 201 00:09:15,210 --> 00:09:19,950 And it was an example of dealing with multiple classes 202 00:09:19,950 --> 00:09:21,732 in an unsupervised manner. 203 00:09:21,732 --> 00:09:23,940 But when you try to do something like hands, which we 204 00:09:23,940 --> 00:09:25,560 tried, I mean, there is just-- 205 00:09:25,560 --> 00:09:34,740 it basically failed as an unsupervised method. 206 00:09:34,740 --> 00:09:37,500 And the problem remained difficult. 207 00:09:37,500 --> 00:09:40,410 Here is a quote for Jitendra Malik, who is-- 208 00:09:40,410 --> 00:09:42,690 those of you who deal with computer vision 209 00:09:42,690 --> 00:09:43,550 would know the name. 210 00:09:43,550 --> 00:09:46,800 He's sort of a leading person in computer vision. 211 00:09:46,800 --> 00:09:49,730 "They say that dealing with body configuration in general, 212 00:09:49,730 --> 00:09:52,380 and hands in particular, is maybe 213 00:09:52,380 --> 00:09:54,930 the most difficult recognition problem in computer vision." 214 00:09:54,930 --> 00:09:57,210 I think that's probably going too far. 215 00:09:57,210 --> 00:10:00,780 But still, you can realize that people took it 216 00:10:00,780 --> 00:10:02,445 as a very difficult problem. 217 00:10:02,445 --> 00:10:09,420 On the unsupervised way, which is still a big open problem, 218 00:10:09,420 --> 00:10:11,960 the biggest effort so far has been a paper-- 219 00:10:11,960 --> 00:10:13,920 it's already a few years ago-- 220 00:10:13,920 --> 00:10:17,640 a collaboration between Google and Stanford 221 00:10:17,640 --> 00:10:20,550 by Andrew Ng others in which they took images 222 00:10:20,550 --> 00:10:23,820 from 10 million YouTube movies. 223 00:10:23,820 --> 00:10:26,280 And they tried to learn whatever they 224 00:10:26,280 --> 00:10:27,880 could in an unsupervised way. 225 00:10:27,880 --> 00:10:31,440 And they designed a system that was designed to get information 226 00:10:31,440 --> 00:10:33,450 out in an unsupervised manner. 227 00:10:33,450 --> 00:10:36,480 And basically, they managed, from all this information, 228 00:10:36,480 --> 00:10:41,070 to get out three concepts what happened in the machine 229 00:10:41,070 --> 00:10:44,540 that you-- it developed units that were sensitive, 230 00:10:44,540 --> 00:10:48,060 specifically, to three different categories. 231 00:10:48,060 --> 00:10:51,110 One was faces, the other one was cats. 232 00:10:51,110 --> 00:10:52,370 That's from their paper. 233 00:10:52,370 --> 00:10:54,620 It's not easy to see the cat here, but there is a cat. 234 00:10:54,620 --> 00:10:56,960 They found the cat. 235 00:10:56,960 --> 00:11:00,920 And there is a sort of a torso, upper body from-- 236 00:11:00,920 --> 00:11:02,350 upper body. 237 00:11:02,350 --> 00:11:05,840 Three concepts-- after all of this training, three concepts 238 00:11:05,840 --> 00:11:06,660 sort of emerged. 239 00:11:06,660 --> 00:11:10,370 And in fact, only one of them, faces, 240 00:11:10,370 --> 00:11:12,230 was really there, that there were units that 241 00:11:12,230 --> 00:11:14,120 were very sensitive to faces. 242 00:11:14,120 --> 00:11:17,840 For the other cases, like cats and upper bodies, 243 00:11:17,840 --> 00:11:20,147 it was not all that selective. 244 00:11:20,147 --> 00:11:21,980 And by the way, cats is not very surprising. 245 00:11:21,980 --> 00:11:23,960 You know why cats came out in these movies? 246 00:11:27,080 --> 00:11:29,540 If you watched YouTube you would know, 247 00:11:29,540 --> 00:11:31,840 literally, it's also the most salient thing. 248 00:11:31,840 --> 00:11:33,620 After faces, it's literally the case 249 00:11:33,620 --> 00:11:36,200 that if you take random movies, or millions 250 00:11:36,200 --> 00:11:40,490 of movies from YouTube, many, many of them will contain cats. 251 00:11:40,490 --> 00:11:44,870 So in the database, it was the most-- after faces and bodies, 252 00:11:44,870 --> 00:11:49,430 it was the third most frequent category. 253 00:11:49,430 --> 00:11:53,520 So it wouldn't do, hands or gaze and so on. 254 00:11:53,520 --> 00:11:55,340 It's really picking up only things 255 00:11:55,340 --> 00:11:58,340 which are very, very salient and very, very 256 00:11:58,340 --> 00:12:01,310 frequent in the input. 257 00:12:01,310 --> 00:12:04,280 And as I said, babies do it. 258 00:12:04,280 --> 00:12:09,560 And now people started to work to look more closely at it. 259 00:12:09,560 --> 00:12:14,300 One technique was to put a sort of webcam on infants. 260 00:12:14,300 --> 00:12:15,220 This is not an infant. 261 00:12:15,220 --> 00:12:18,720 This is a slightly older person, a toddler. 262 00:12:18,720 --> 00:12:20,870 But they do it with infants, and they look 263 00:12:20,870 --> 00:12:24,860 at what the babies are looking. 264 00:12:24,860 --> 00:12:27,980 And what babies are looking at the very initial stages 265 00:12:27,980 --> 00:12:29,090 are faces and hands. 266 00:12:29,090 --> 00:12:30,900 They really like faces. 267 00:12:30,900 --> 00:12:34,526 And they really like hands. 268 00:12:34,526 --> 00:12:36,410 And they recognize hands. 269 00:12:36,410 --> 00:12:38,180 And they know sort of what-- 270 00:12:40,790 --> 00:12:43,730 they have, already, information and expectation about hands 271 00:12:43,730 --> 00:12:44,810 in a very early age. 272 00:12:44,810 --> 00:12:49,010 So it's not just even the visual recognition that they group 273 00:12:49,010 --> 00:12:51,830 together images of hands, but they 274 00:12:51,830 --> 00:12:56,750 know that hands, for example, are the causal agents that are 275 00:12:56,750 --> 00:12:58,560 moving objects in the world. 276 00:12:58,560 --> 00:13:02,315 And this is for an experiment by Rebecca Saxe. 277 00:13:02,315 --> 00:13:06,035 There is a person here working with Rebecca Saxe, right? 278 00:13:06,035 --> 00:13:07,430 Did she talk already in the-- 279 00:13:07,430 --> 00:13:08,360 AUDIENCE: She will. 280 00:13:08,360 --> 00:13:09,360 SHIMON ULLMAN: She will. 281 00:13:09,360 --> 00:13:10,620 And she's worth listening to. 282 00:13:10,620 --> 00:13:15,590 So this is from one of her studies 283 00:13:15,590 --> 00:13:18,970 in which they showed infant-- 284 00:13:18,970 --> 00:13:20,390 these are slightly older infants. 285 00:13:20,390 --> 00:13:24,430 But they showed infants, on the computer screen, 286 00:13:24,430 --> 00:13:26,120 a hand moving an object. 287 00:13:26,120 --> 00:13:28,810 This is not taken from the paper directly. 288 00:13:28,810 --> 00:13:31,040 This, I just drew-- but a hand moving an object-- 289 00:13:31,040 --> 00:13:33,910 in this case, a cup or a glass. 290 00:13:33,910 --> 00:13:37,850 And the infant watches it for a while and sort of gets bored. 291 00:13:37,850 --> 00:13:40,220 And after they do it, they show the infant 292 00:13:40,220 --> 00:13:44,120 either the hand moving alone on the screen or the glass moving 293 00:13:44,120 --> 00:13:45,560 alone on the screen. 294 00:13:45,560 --> 00:13:47,960 And the hand moving alone, on itself, on the screen, 295 00:13:47,960 --> 00:13:48,960 they are not-- 296 00:13:48,960 --> 00:13:49,550 still bored. 297 00:13:49,550 --> 00:13:50,949 They don't look at it much. 298 00:13:50,949 --> 00:13:52,490 When the cup is moving on the screen, 299 00:13:52,490 --> 00:13:54,170 they are very surprised and interested. 300 00:13:54,170 --> 00:13:56,720 So they know it's the hand moving the cup. 301 00:13:56,720 --> 00:13:59,940 It's not the cup moving the hand or they have equal status, 302 00:13:59,940 --> 00:14:02,710 it's the motion of this configuration. 303 00:14:02,710 --> 00:14:06,600 The originator, or the actor, or the mover is the hand. 304 00:14:06,600 --> 00:14:11,900 So this is at seven months. 305 00:14:11,900 --> 00:14:17,330 But it has been known that this kind of motion that this one-- 306 00:14:17,330 --> 00:14:20,690 that an object can cause another object to move, 307 00:14:20,690 --> 00:14:24,470 this is something that babies are sensitive to, not only 308 00:14:24,470 --> 00:14:27,230 at the age of seven months, but it's 309 00:14:27,230 --> 00:14:31,280 something that appears in infants as early as you 310 00:14:31,280 --> 00:14:32,390 can imagine. 311 00:14:32,390 --> 00:14:34,235 And you can test. 312 00:14:34,235 --> 00:14:35,450 And this, for us-- 313 00:14:35,450 --> 00:14:37,220 I'll tell you what they are sensitive to. 314 00:14:37,220 --> 00:14:39,940 And for us, this was sort of the guideline, or sort of the door, 315 00:14:39,940 --> 00:14:41,375 the open door, how to-- 316 00:14:41,375 --> 00:14:47,000 what may be going on that lets the infant speak up 317 00:14:47,000 --> 00:14:50,290 specifically on hands and quickly develop 318 00:14:50,290 --> 00:14:52,190 a well-developed face detect-- 319 00:14:52,190 --> 00:14:53,370 hand detector. 320 00:14:53,370 --> 00:14:55,550 So infants are known to-- have been 321 00:14:55,550 --> 00:14:59,260 known to be sensitive-- they are sensitive to motion in general. 322 00:14:59,260 --> 00:15:01,290 They are really following moving objects. 323 00:15:01,290 --> 00:15:04,070 And they use motion a lot. 324 00:15:04,070 --> 00:15:06,260 But motion by itself is not very helpful 325 00:15:06,260 --> 00:15:07,700 if you want to recognize hands. 326 00:15:07,700 --> 00:15:09,950 It's true that hands are moving. 327 00:15:09,950 --> 00:15:12,990 But many other things are moving as well. 328 00:15:12,990 --> 00:15:18,390 So if you take random videos from children's perspective, 329 00:15:18,390 --> 00:15:20,070 they see doors opening and closing. 330 00:15:20,070 --> 00:15:21,800 They see people moving back and forth, 331 00:15:21,800 --> 00:15:24,050 coming by and disappearing. 332 00:15:24,050 --> 00:15:28,310 Hands are just one small category of moving things. 333 00:15:28,310 --> 00:15:31,580 But they're also sensitive, as I said, not just to motion 334 00:15:31,580 --> 00:15:36,425 but to this particular situation in which an object moves, 335 00:15:36,425 --> 00:15:40,340 comes in contact with another object, and causes it to move. 336 00:15:40,340 --> 00:15:43,540 And this is not even at the level of objects. 337 00:15:43,540 --> 00:15:47,215 At three months, they don't even have a well-set notion of-- 338 00:15:47,215 --> 00:15:52,480 they just begin to organize the world into separate objects. 339 00:15:52,480 --> 00:15:54,460 But you can think of just cloud of pixels 340 00:15:54,460 --> 00:15:56,260 if you want, moving around. 341 00:15:56,260 --> 00:15:58,330 They come in contact with stationary pixels 342 00:15:58,330 --> 00:15:59,820 and causing them to move. 343 00:15:59,820 --> 00:16:04,090 Whenever this happens, infants pay attention to this. 344 00:16:04,090 --> 00:16:06,220 They look at it preferentially. 345 00:16:06,220 --> 00:16:11,650 And it's known that they are sensitive to this type 346 00:16:11,650 --> 00:16:14,675 of motion, which called a mover event. 347 00:16:14,675 --> 00:16:16,900 And a mover event is this event that I just 348 00:16:16,900 --> 00:16:18,970 described that something is in motion, 349 00:16:18,970 --> 00:16:21,580 comes in contact with a stationary item, 350 00:16:21,580 --> 00:16:23,290 and causes it to move. 351 00:16:23,290 --> 00:16:26,380 So we started with developing a very simple algorithm 352 00:16:26,380 --> 00:16:30,400 that, all it does, it simply looks on video, 353 00:16:30,400 --> 00:16:33,190 on changing images, watching for, or waiting 354 00:16:33,190 --> 00:16:35,890 for, mover events to occur. 355 00:16:35,890 --> 00:16:38,260 And the way it's defined, it's very simple. 356 00:16:38,260 --> 00:16:42,790 In the algorithm, we divide the visual field 357 00:16:42,790 --> 00:16:44,290 into small regions. 358 00:16:44,290 --> 00:16:47,270 And we monitor each one of these cells 359 00:16:47,270 --> 00:16:51,250 in the grid for the occurrence of a mover, which 360 00:16:51,250 --> 00:16:53,740 means that there will be some optical flow, some motion 361 00:16:53,740 --> 00:16:59,080 coming into the cell, and then leaving the cell, 362 00:16:59,080 --> 00:17:03,200 and sort of carrying with it the content of the cell. 363 00:17:03,200 --> 00:17:06,150 So this does require some kind of optical flow 364 00:17:06,150 --> 00:17:07,220 in change detection. 365 00:17:07,220 --> 00:17:09,410 And all of these capacities are known 366 00:17:09,410 --> 00:17:15,079 that they're in place before three months of age. 367 00:17:15,079 --> 00:17:20,349 And now, look at an image of a person manipulating objects 368 00:17:20,349 --> 00:17:23,940 when all you do is simply monitor 369 00:17:23,940 --> 00:17:26,770 all locations in the image for the occurrence 370 00:17:26,770 --> 00:17:28,715 of this kind of a mover event. 371 00:17:28,715 --> 00:17:31,090 What you should see, or what you should pay attention to, 372 00:17:31,090 --> 00:17:33,130 is the fact that motion, by itself, 373 00:17:33,130 --> 00:17:34,970 is not particularly-- is not doing anything. 374 00:17:34,970 --> 00:17:37,340 What the algorithm is doing is, whenever 375 00:17:37,340 --> 00:17:42,580 it detects a mover event, it draws a square 376 00:17:42,580 --> 00:17:44,860 around the mover event and continues 377 00:17:44,860 --> 00:17:48,190 to follow it to track it for a little bit. 378 00:17:48,190 --> 00:17:52,810 So you can see, the minute the hand is moving something, 379 00:17:52,810 --> 00:17:55,790 it's detected as a hand moving things on their own, 380 00:17:55,790 --> 00:17:57,790 are not triggering the system here. 381 00:17:57,790 --> 00:17:58,680 The hand is moving. 382 00:17:58,680 --> 00:18:00,790 So this, by itself, is not the signal. 383 00:18:00,790 --> 00:18:04,720 But the interaction between the hand and an object 384 00:18:04,720 --> 00:18:06,550 is something that it's detected based 385 00:18:06,550 --> 00:18:07,920 on these very low-level cues. 386 00:18:07,920 --> 00:18:09,086 It doesn't know about hands. 387 00:18:09,086 --> 00:18:11,770 It doesn't know about objects. 388 00:18:11,770 --> 00:18:15,200 But you can create-- you can see, here, a false alarm, which 389 00:18:15,200 --> 00:18:15,910 was interesting. 390 00:18:15,910 --> 00:18:18,250 You can probably understand why, and so on. 391 00:18:21,540 --> 00:18:25,210 So here's an example of what if you just let it-- 392 00:18:25,210 --> 00:18:28,720 this scheme run on a large set of hours and hours of videos. 393 00:18:28,720 --> 00:18:30,820 Some of these videos do not contain anything 394 00:18:30,820 --> 00:18:32,660 related to people. 395 00:18:32,660 --> 00:18:36,100 Some of them, there are people, but just going back and forth, 396 00:18:36,100 --> 00:18:37,690 entering the room, leaving the room, 397 00:18:37,690 --> 00:18:39,490 but do not manipulate objects. 398 00:18:39,490 --> 00:18:41,590 Nothing specific happens in these videos. 399 00:18:41,590 --> 00:18:44,980 The system that is looking for these kinds of mover events 400 00:18:44,980 --> 00:18:47,050 finds very rare occasions in which 401 00:18:47,050 --> 00:18:48,700 anything interesting happens. 402 00:18:48,700 --> 00:18:50,560 But you can see the output that happened. 403 00:18:50,560 --> 00:18:52,480 This is just examples of output of just, 404 00:18:52,480 --> 00:18:55,510 from these many videos, the kind of images 405 00:18:55,510 --> 00:19:00,520 that it extracted by being tuned specifically to the occurrence 406 00:19:00,520 --> 00:19:02,440 of this specific event. 407 00:19:02,440 --> 00:19:04,630 You can see that you get a lot of hand. 408 00:19:04,630 --> 00:19:06,800 These hands are the continuation. 409 00:19:06,800 --> 00:19:09,580 The assumption here, which we, again, took and modeled 410 00:19:09,580 --> 00:19:10,680 after what-- 411 00:19:10,680 --> 00:19:14,530 when infancy, something that starts to move, 412 00:19:14,530 --> 00:19:16,490 they track it for about one or two seconds. 413 00:19:16,490 --> 00:19:19,480 So we also tracked it for about one or two seconds. 414 00:19:19,480 --> 00:19:22,180 And we show some images from this tracking. 415 00:19:22,180 --> 00:19:23,737 These are some false alarms. 416 00:19:23,737 --> 00:19:25,070 There are not very many of them. 417 00:19:25,070 --> 00:19:27,190 But these are examples of some false alarms, 418 00:19:27,190 --> 00:19:29,200 that something happened in the image that 419 00:19:29,200 --> 00:19:35,170 caused the scheme for the specific mover event 420 00:19:35,170 --> 00:19:36,020 to be triggered. 421 00:19:36,020 --> 00:19:37,690 And it collected these images. 422 00:19:37,690 --> 00:19:39,910 But on the whole, as it shows here, 423 00:19:39,910 --> 00:19:42,200 you get very good performance. 424 00:19:42,200 --> 00:19:45,850 It's actually surprising that, if you look at all the ima-- 425 00:19:45,850 --> 00:19:51,990 all the occurrences of hands touching objects, 426 00:19:51,990 --> 00:19:55,540 that 90% of them in all these collection of videos 427 00:19:55,540 --> 00:19:57,340 were captured by the image. 428 00:19:57,340 --> 00:19:59,630 And the accuracy was 65%. 429 00:19:59,630 --> 00:20:03,520 So there were some false alarms, but the majority were fine. 430 00:20:03,520 --> 00:20:06,490 And what you will end up with is a large collection which 431 00:20:06,490 --> 00:20:09,140 is mostly composed of hands. 432 00:20:09,140 --> 00:20:12,970 So now you have, without being supplied 433 00:20:12,970 --> 00:20:15,892 with external supervision-- here is a hand, here is a hand, 434 00:20:15,892 --> 00:20:17,350 here is a hand-- suddenly, you have 435 00:20:17,350 --> 00:20:20,500 a collection of 10,000 images. 436 00:20:20,500 --> 00:20:22,520 And most of them contain a hand. 437 00:20:22,520 --> 00:20:25,570 And now if you look at this, and you apply completely 438 00:20:25,570 --> 00:20:33,620 standard algorithms for object recognition, 439 00:20:33,620 --> 00:20:35,980 this is great for them-- this is sufficient for them-- 440 00:20:35,980 --> 00:20:37,720 in order to learn a new object. 441 00:20:37,720 --> 00:20:39,590 So if you do a deep network-- 442 00:20:39,590 --> 00:20:44,450 but you can do even simpler algorithms of various sorts-- 443 00:20:44,450 --> 00:20:47,590 what you will end up using this collection of images, which 444 00:20:47,590 --> 00:20:53,620 were identified as belonging to the same concept 445 00:20:53,620 --> 00:20:57,010 by all triggering the same special event of something 446 00:20:57,010 --> 00:21:02,669 causing something to move, you will get a hand example. 447 00:21:02,669 --> 00:21:03,460 So these are just-- 448 00:21:03,460 --> 00:21:05,418 I will not play them-- lots and lots of movies. 449 00:21:05,418 --> 00:21:07,900 Some of them can play with for half an hour 450 00:21:07,900 --> 00:21:10,780 without having a single event of this kind. 451 00:21:10,780 --> 00:21:13,120 Others are pretty dense with such events. 452 00:21:13,120 --> 00:21:15,034 And they are being detected. 453 00:21:15,034 --> 00:21:17,200 And what is shown here, it's not the greatest image. 454 00:21:17,200 --> 00:21:21,550 But all these squares here, the yellow squares, are-- 455 00:21:21,550 --> 00:21:25,030 having gone through this first round of learning these events, 456 00:21:25,030 --> 00:21:26,710 this is the output of a detector. 457 00:21:26,710 --> 00:21:29,260 You take the images that were labeled in this way. 458 00:21:29,260 --> 00:21:35,484 You give it to a standard computer vision classifier 459 00:21:35,484 --> 00:21:37,150 that takes these images and finds what's 460 00:21:37,150 --> 00:21:38,740 common to these images. 461 00:21:38,740 --> 00:21:43,180 And then on completely new images-- static images now-- 462 00:21:43,180 --> 00:21:48,040 this marks, in the image, following 463 00:21:48,040 --> 00:21:51,944 this learning, where the hands will be-- the hands are. 464 00:21:51,944 --> 00:22:00,400 Now, this is a very good start in terms 465 00:22:00,400 --> 00:22:05,860 of being able to learn hands, and not only learn hands. 466 00:22:05,860 --> 00:22:09,480 As I showed you before, very early on, the notion of learn 467 00:22:09,480 --> 00:22:11,620 is automatically a sort of-- 468 00:22:11,620 --> 00:22:13,030 in terms of the cognitive system, 469 00:22:13,030 --> 00:22:17,020 is closely associated with moving objects, causing objects 470 00:22:17,020 --> 00:22:20,290 to move, as we saw in the Rebecca Saxe experiment. 471 00:22:20,290 --> 00:22:22,240 This also happens in this system. 472 00:22:22,240 --> 00:22:23,620 But this continues to develop. 473 00:22:23,620 --> 00:22:26,680 Because eventually we can learn hands not only in 474 00:22:26,680 --> 00:22:30,070 this grasping configuration. 475 00:22:30,070 --> 00:22:32,170 Later on we want to be able to recognize 476 00:22:32,170 --> 00:22:35,822 any hand in any configuration. 477 00:22:35,822 --> 00:22:39,460 And so the system needs to learn more. 478 00:22:39,460 --> 00:22:42,550 And this is accomplished by something 479 00:22:42,550 --> 00:22:45,880 that, again, the details, or the specific application here, 480 00:22:45,880 --> 00:22:47,620 is less important than the principle. 481 00:22:47,620 --> 00:22:51,010 And the principle is sort of two subsystems 482 00:22:51,010 --> 00:22:54,490 in the cognitive system training each other. 483 00:22:54,490 --> 00:22:57,750 And together, by each one advising the other, 484 00:22:57,750 --> 00:22:59,820 they reach, together, much more than what 485 00:22:59,820 --> 00:23:06,460 any one of the systems would be able to reach on its own. 486 00:23:06,460 --> 00:23:09,070 And the two subsystems in this case, one 487 00:23:09,070 --> 00:23:12,460 is the ability to recognize hands by their appearance. 488 00:23:12,460 --> 00:23:17,600 I can show you just an image of a hand without the rest of it, 489 00:23:17,600 --> 00:23:21,520 the rest of the body or the rest of the image. 490 00:23:21,520 --> 00:23:23,360 And you know, by the local appearance, 491 00:23:23,360 --> 00:23:24,670 that this is a hand. 492 00:23:24,670 --> 00:23:28,580 You also know, here, you cannot even see the hands of this 493 00:23:28,580 --> 00:23:29,780 woman. 494 00:23:29,780 --> 00:23:31,360 But you know where the hands are. 495 00:23:31,360 --> 00:23:34,180 So you can also use the surrounding context in order 496 00:23:34,180 --> 00:23:36,880 to get where the hands are. 497 00:23:36,880 --> 00:23:39,490 So these are two different algorithms that 498 00:23:39,490 --> 00:23:40,910 are known in computer vision. 499 00:23:40,910 --> 00:23:42,970 They are also known in infants. 500 00:23:42,970 --> 00:23:45,840 People have demonstrated, sort of independently before we 501 00:23:45,840 --> 00:23:51,460 have done our work, that infants associate the body, and even 502 00:23:51,460 --> 00:23:52,090 the face-- 503 00:23:52,090 --> 00:23:54,760 when they see a hand, they immediately look up, 504 00:23:54,760 --> 00:23:55,929 and they look for a face. 505 00:23:55,929 --> 00:23:57,970 And they are surprised if there is no face there. 506 00:23:57,970 --> 00:24:01,900 So they know about the association between body parts 507 00:24:01,900 --> 00:24:05,410 surrounding the hand and the hand itself. 508 00:24:05,410 --> 00:24:10,650 And you can think about it as-- 509 00:24:10,650 --> 00:24:12,330 we saw this image before. 510 00:24:12,330 --> 00:24:15,200 Here the hand itself is not very clear. 511 00:24:15,200 --> 00:24:18,910 But you can get to the hand by-- if you know where the face is, 512 00:24:18,910 --> 00:24:20,890 for example, that you can go to the shoulder, 513 00:24:20,890 --> 00:24:26,110 and to the upper arm, and lower arm, and end up at the hand. 514 00:24:26,110 --> 00:24:28,390 So people have used this in computer vision 515 00:24:28,390 --> 00:24:31,240 as well, finding hands. 516 00:24:31,240 --> 00:24:34,570 They also use this idea of finding the hands on their own 517 00:24:34,570 --> 00:24:37,720 by their own appearance or using the surrounding body 518 00:24:37,720 --> 00:24:39,010 configuration. 519 00:24:39,010 --> 00:24:41,640 And the nice thing is, instead of just thinking of, 520 00:24:41,640 --> 00:24:45,730 here are two methods, two schemes 521 00:24:45,730 --> 00:24:50,590 that can both produce the same final goal, they can also, 522 00:24:50,590 --> 00:24:53,176 during learning, help each other and guide each other. 523 00:24:53,176 --> 00:24:54,550 If you think about it, the way it 524 00:24:54,550 --> 00:25:00,370 goes it is shown here, that, sort of, the appearance can 525 00:25:00,370 --> 00:25:03,640 help finding hands by the body pose. 526 00:25:03,640 --> 00:25:05,390 And the body pose can do the appearance. 527 00:25:05,390 --> 00:25:09,460 So if you think, for example, that, initially, 528 00:25:09,460 --> 00:25:17,380 I learned this particular hand in this particular appearance 529 00:25:17,380 --> 00:25:20,020 and pose, then I learn it by appearance. 530 00:25:20,020 --> 00:25:21,670 I learn it by the pose. 531 00:25:21,670 --> 00:25:24,580 But then if, for example, I keep the same pose 532 00:25:24,580 --> 00:25:27,230 but change completely the appearance of the hand, 533 00:25:27,230 --> 00:25:29,890 I still have the pose guiding me to the right location. 534 00:25:29,890 --> 00:25:33,100 And then I grab a new image and say, OK, this is still 535 00:25:33,100 --> 00:25:34,552 a hand, but a new appearance. 536 00:25:34,552 --> 00:25:36,010 Now that I have the new appearance, 537 00:25:36,010 --> 00:25:38,720 I can move the new appearance to a new location. 538 00:25:38,720 --> 00:25:41,370 So now I can recognize the hand by the appearance 539 00:25:41,370 --> 00:25:42,730 that I already know. 540 00:25:42,730 --> 00:25:45,630 But then, this is a new pose, so I say aha. 541 00:25:45,630 --> 00:25:48,370 So this is still a new body configuration 542 00:25:48,370 --> 00:25:49,920 that ends in a hand. 543 00:25:49,920 --> 00:25:51,840 And then I can change the appearance again. 544 00:25:51,840 --> 00:25:54,870 So you can see that by having enough images, 545 00:25:54,870 --> 00:25:56,640 I can use the appearance to learn 546 00:25:56,640 --> 00:26:01,980 various poses that end up in a hand using 547 00:26:01,980 --> 00:26:04,510 the common appearance and vice versa. 548 00:26:04,510 --> 00:26:07,440 If I know the pose, I can use the same pose 549 00:26:07,440 --> 00:26:12,170 and deduce from the different appearances of the same hand. 550 00:26:12,170 --> 00:26:13,410 And this becomes-- 551 00:26:13,410 --> 00:26:15,060 I will not go into the algorithms, 552 00:26:15,060 --> 00:26:18,060 but this becomes very powerful. 553 00:26:18,060 --> 00:26:22,530 And just by going through this iteration in which you start 554 00:26:22,530 --> 00:26:26,460 from a small subset of correct identification, 555 00:26:26,460 --> 00:26:31,200 but then you let these two schemes guide each other-- 556 00:26:31,200 --> 00:26:39,330 this kind of learning that one system guides the other-- 557 00:26:39,330 --> 00:26:41,520 we see, here, graphs of performance. 558 00:26:41,520 --> 00:26:44,720 Roughly speaking, the details are not that important. 559 00:26:44,720 --> 00:26:47,280 This is called a precision recall graph. 560 00:26:47,280 --> 00:26:50,340 But even without explaining the details 561 00:26:50,340 --> 00:26:53,010 of recall and precision, the higher graphs 562 00:26:53,010 --> 00:26:55,392 mean better performance of the system. 563 00:26:55,392 --> 00:26:56,850 And this is the initial performance 564 00:26:56,850 --> 00:27:00,390 of what you get if you just train it using hands 565 00:27:00,390 --> 00:27:03,150 grabbing objects. 566 00:27:03,150 --> 00:27:05,220 Actually, the accuracy of the system-- the system 567 00:27:05,220 --> 00:27:07,740 is doing a good job at recognizing hands, 568 00:27:07,740 --> 00:27:11,610 but only in a limited domain of hands touching objects. 569 00:27:11,610 --> 00:27:13,990 So other things, it does not recognize very well. 570 00:27:13,990 --> 00:27:17,250 So this is shown here, that it has high accuracy, 571 00:27:17,250 --> 00:27:20,910 but does not cover all the range of all possible hands 572 00:27:20,910 --> 00:27:22,500 that you can learn. 573 00:27:22,500 --> 00:27:24,060 And then without doing anything else, 574 00:27:24,060 --> 00:27:26,120 we just continue to watch movies. 575 00:27:26,120 --> 00:27:30,180 But you also integrate these two systems. 576 00:27:30,180 --> 00:27:34,530 Each one is supplying internal supervision to the other one. 577 00:27:34,530 --> 00:27:35,960 Everything grows, grows, grows. 578 00:27:35,960 --> 00:27:40,680 And after a while, after training 579 00:27:40,680 --> 00:27:45,780 with several hours of video, we get up to the green curve. 580 00:27:45,780 --> 00:27:50,850 And the red curve is sort of the absolute maximum you can get. 581 00:27:50,850 --> 00:27:54,690 This is using the best classifier we can get. 582 00:27:54,690 --> 00:27:56,520 And everything is completely supervised. 583 00:27:56,520 --> 00:27:59,550 So on every frame, in 10,000 frames or more, 584 00:27:59,550 --> 00:28:03,240 you tell the system where exactly the hand is. 585 00:28:03,240 --> 00:28:06,420 So this is what you can get with a completely supervised scheme. 586 00:28:06,420 --> 00:28:07,920 And this, the green, is what you can 587 00:28:07,920 --> 00:28:12,660 get with reasonable training-- 588 00:28:12,660 --> 00:28:15,570 I mean, seven hours of training. 589 00:28:15,570 --> 00:28:18,750 Infants get more training-- completely unsupervised. 590 00:28:18,750 --> 00:28:21,314 It just happens on its own. 591 00:28:21,314 --> 00:28:22,980 It's interesting, also, to think about-- 592 00:28:22,980 --> 00:28:24,146 if you think about infants-- 593 00:28:24,146 --> 00:28:27,750 and actually, I wanted to-- 594 00:28:27,750 --> 00:28:31,840 I was planning to ask you the question here let's see. 595 00:28:31,840 --> 00:28:35,550 What else could help infants do-- 596 00:28:35,550 --> 00:28:39,525 if you can think of other tricks in which infants somehow 597 00:28:39,525 --> 00:28:41,400 have to pick up, it's very important for them 598 00:28:41,400 --> 00:28:43,440 to pick up hands. 599 00:28:43,440 --> 00:28:46,950 What other signals, or tricks, or guidelines 600 00:28:46,950 --> 00:28:50,510 could help them pick up hands? 601 00:28:50,510 --> 00:28:53,700 And OK, since I sort of gave up own hands, 602 00:28:53,700 --> 00:28:56,520 you can think about babies sort of waving their hands. 603 00:28:56,520 --> 00:28:59,880 And babies do wave their hands a lot in the air. 604 00:28:59,880 --> 00:29:03,330 And you can think of a scheme in which the brain knows this. 605 00:29:03,330 --> 00:29:05,620 And you sort of wave hands. 606 00:29:05,620 --> 00:29:09,080 And then the images that are generated by these motive 607 00:29:09,080 --> 00:29:11,310 activities interpreted by the system, 608 00:29:11,310 --> 00:29:16,170 it already knows, grab this, and somehow, use them in order 609 00:29:16,170 --> 00:29:18,560 to build hand detectors. 610 00:29:18,560 --> 00:29:21,180 There is evidence that this is, I think, interesting. 611 00:29:21,180 --> 00:29:24,405 And we know that infants are interested in their own hands. 612 00:29:28,070 --> 00:29:31,107 But there are reasons to believe that this is not the case. 613 00:29:31,107 --> 00:29:32,940 Because for example, if you really try this, 614 00:29:32,940 --> 00:29:37,850 and you try to learn hands from waving your hands in this way, 615 00:29:37,850 --> 00:29:42,870 imitating what infants may see, a scheme 616 00:29:42,870 --> 00:29:46,700 that learns hands in this way is very bad at recognizing hands 617 00:29:46,700 --> 00:29:47,830 manipulating objects. 618 00:29:47,830 --> 00:29:50,940 If, after waving the hands, you test the system, 619 00:29:50,940 --> 00:29:52,860 and there is a hand coming in the image 620 00:29:52,860 --> 00:29:55,350 and touching, grabbing something, 621 00:29:55,350 --> 00:29:57,900 the difference in appearance in point 622 00:29:57,900 --> 00:30:01,050 of view between waving your own hands 623 00:30:01,050 --> 00:30:03,240 and watching somebody grabbing an object 624 00:30:03,240 --> 00:30:06,690 is so large that it does not allow to generalize well 625 00:30:06,690 --> 00:30:07,870 at this stage. 626 00:30:07,870 --> 00:30:09,490 And we know, from testing infants, 627 00:30:09,490 --> 00:30:11,760 that the first thing that they recognize well 628 00:30:11,760 --> 00:30:14,200 is actually other hands grabbing objects. 629 00:30:14,200 --> 00:30:16,980 So it's much more consistent with the idea 630 00:30:16,980 --> 00:30:18,870 that the guiding internal signal that 631 00:30:18,870 --> 00:30:21,570 helps them deal, in an unsupervised way, 632 00:30:21,570 --> 00:30:27,680 with this difficult task is the special event of the hand 633 00:30:27,680 --> 00:30:31,230 as the mover of objects. 634 00:30:31,230 --> 00:30:33,360 OK, I want to move from hands to-- 635 00:30:33,360 --> 00:30:36,790 this will be shorter, but I want to say something about gaze. 636 00:30:36,790 --> 00:30:38,710 Gaze is also, as I said, interesting. 637 00:30:38,710 --> 00:30:41,410 It starts at about three months of age. 638 00:30:41,410 --> 00:30:43,720 An infant has this capability. 639 00:30:43,720 --> 00:30:47,230 What happens at three months of age is that an infant may look 640 00:30:47,230 --> 00:30:50,110 at an adult-- at a caregiver, the mother, say, 641 00:30:50,110 --> 00:30:51,964 or look at another person-- 642 00:30:51,964 --> 00:30:53,380 and if the other person is looking 643 00:30:53,380 --> 00:30:56,410 at an object over there, then the infant 644 00:30:56,410 --> 00:30:59,680 will look at the other person, and then will follow the gaze, 645 00:30:59,680 --> 00:31:03,550 and will look at the object that the other person is looking at. 646 00:31:03,550 --> 00:31:07,150 So it's, first of all, the identification 647 00:31:07,150 --> 00:31:11,210 of the correct direction of gaze, 648 00:31:11,210 --> 00:31:16,150 but also then using it in order to get joint attention. 649 00:31:16,150 --> 00:31:17,560 And all of these things are known 650 00:31:17,560 --> 00:31:22,310 to be very important in early visual development. 651 00:31:22,310 --> 00:31:25,570 And psychologists, child psychologists, 652 00:31:25,570 --> 00:31:28,000 talk a lot about this mechanism of joint attention 653 00:31:28,000 --> 00:31:32,350 in which the parent, or the caregiver, and the child 654 00:31:32,350 --> 00:31:35,260 can get joint attention. 655 00:31:35,260 --> 00:31:37,240 And some people, some infants, do not 656 00:31:37,240 --> 00:31:39,910 have this mechanism of joint attention, 657 00:31:39,910 --> 00:31:42,890 being able to attend to the same event 658 00:31:42,890 --> 00:31:46,570 or object that the other person is attending to. 659 00:31:46,570 --> 00:31:49,180 And this has developmental consequences. 660 00:31:49,180 --> 00:31:54,940 So it's an important aspect of communication and learning. 661 00:31:54,940 --> 00:31:58,130 So understanding direction of gaze is very important. 662 00:31:58,130 --> 00:32:04,540 And here it's even perhaps more surprising and unexpected even 663 00:32:04,540 --> 00:32:05,440 compared to hands. 664 00:32:05,440 --> 00:32:07,390 Because gaze, in some sense, doesn't really 665 00:32:07,390 --> 00:32:10,780 exist objectively out there in the scene. 666 00:32:10,780 --> 00:32:12,940 It's not an object, a yellow object, that you see. 667 00:32:12,940 --> 00:32:17,020 It's some kind of an imaginary vector, if you want, 668 00:32:17,020 --> 00:32:19,090 pointing in a particular direction based 669 00:32:19,090 --> 00:32:20,800 on the face features. 670 00:32:20,800 --> 00:32:23,890 But it's very non-explicit. 671 00:32:23,890 --> 00:32:26,020 And you have, somehow, to observe it, and see it, 672 00:32:26,020 --> 00:32:28,630 and start extracting it from what you see, 673 00:32:28,630 --> 00:32:31,130 and all of this in a completely unsupervised manner. 674 00:32:31,130 --> 00:32:34,090 So what would it take for a system 675 00:32:34,090 --> 00:32:37,620 to be able to watch things, get no supervision, 676 00:32:37,620 --> 00:32:45,630 and after a while, extract correctly direction of gaze? 677 00:32:45,630 --> 00:32:48,010 Direction of gaze is actually depend 678 00:32:48,010 --> 00:32:51,220 on two types of sources of information, 679 00:32:51,220 --> 00:32:53,920 one of the direction of the head, 680 00:32:53,920 --> 00:32:58,780 and the other is the direction of the eyes in the orbit. 681 00:32:58,780 --> 00:33:00,670 And both of them are important. 682 00:33:00,670 --> 00:33:02,270 And you have to master both. 683 00:33:05,590 --> 00:33:08,184 There are more recent studies of this, 684 00:33:08,184 --> 00:33:09,600 and more accurate studies of this, 685 00:33:09,600 --> 00:33:11,620 but I like this reference. 686 00:33:11,620 --> 00:33:13,990 Because this is from a scientific paper 687 00:33:13,990 --> 00:33:16,300 on the relative effect of hand-- 688 00:33:16,300 --> 00:33:21,150 of head orientation and eyes' orientation in gaze perception. 689 00:33:21,150 --> 00:33:24,010 And it's a scientific paper in 1824. 690 00:33:24,010 --> 00:33:26,410 So this problem was studied with experiments, 691 00:33:26,410 --> 00:33:29,780 and the good judgment, and so on. 692 00:33:29,780 --> 00:33:35,080 The point here, by the way, is that these people 693 00:33:35,080 --> 00:33:37,630 look as if they are looking at different directions. 694 00:33:37,630 --> 00:33:40,150 But in fact, the eyes here are exactly the same. 695 00:33:40,150 --> 00:33:41,470 It's sort of cut and paste. 696 00:33:41,470 --> 00:33:43,570 It's literally the same eye region, 697 00:33:43,570 --> 00:33:45,400 and only the head is turning. 698 00:33:45,400 --> 00:33:49,030 And this is enough to cause us to perceive these two 699 00:33:49,030 --> 00:33:53,380 people as looking in two different directions. 700 00:33:53,380 --> 00:33:56,980 In terms of inference and how this learning comes about, 701 00:33:56,980 --> 00:33:58,890 the head comes about first. 702 00:33:58,890 --> 00:34:02,920 And initially, if the caregiver, as I said, is-- 703 00:34:02,920 --> 00:34:05,380 the head is pointing in a particular direction 704 00:34:05,380 --> 00:34:07,400 but the eyes are not, the infants 705 00:34:07,400 --> 00:34:09,909 will follow the head direction. 706 00:34:09,909 --> 00:34:11,920 Later on-- so this is at three months. 707 00:34:11,920 --> 00:34:15,670 Later on, they combine the information from the head 708 00:34:15,670 --> 00:34:19,380 and from the eyes. 709 00:34:19,380 --> 00:34:21,520 And the eyes are really subtle cues 710 00:34:21,520 --> 00:34:25,360 which we use very intuitively, very naturally. 711 00:34:25,360 --> 00:34:29,500 But although it's-- let me hide this for a minute. 712 00:34:29,500 --> 00:34:32,199 This person-- it's a bad image. 713 00:34:32,199 --> 00:34:35,380 It's blurred, especially those who sit in the back. 714 00:34:35,380 --> 00:34:38,388 But this person, is he looking, basically, roughly at you, 715 00:34:38,388 --> 00:34:40,179 or is he looking at the objects down there? 716 00:34:43,495 --> 00:34:43,995 Sorry? 717 00:34:43,995 --> 00:34:44,949 AUDIENCE: audience 718 00:34:44,949 --> 00:34:47,069 SHIMON ULLMAN: Yeah, basically at you, right? 719 00:34:47,069 --> 00:34:48,110 Now, if you look at the-- 720 00:34:48,110 --> 00:34:49,130 so it's from the eyes. 721 00:34:49,130 --> 00:34:51,549 And this is the-- 722 00:34:51,549 --> 00:34:52,340 these are the eyes. 723 00:34:52,340 --> 00:34:53,506 This is all the information. 724 00:34:53,506 --> 00:34:56,374 This is the right pixels, the same number of pixels that 725 00:34:56,374 --> 00:34:57,540 are in the image, and so on. 726 00:34:57,540 --> 00:34:59,380 So this is the information in the eyes. 727 00:34:59,380 --> 00:35:00,350 It's not a lot. 728 00:35:00,350 --> 00:35:02,060 And we use it effectively in order 729 00:35:02,060 --> 00:35:04,940 to decide where the person is-- we just look at it. 730 00:35:04,940 --> 00:35:07,730 And it's interesting that it's so small and inconspicuous 731 00:35:07,730 --> 00:35:08,970 in some objective terms. 732 00:35:08,970 --> 00:35:16,230 But for us, we know that the person is looking, roughly, 733 00:35:16,230 --> 00:35:16,730 at us. 734 00:35:16,730 --> 00:35:21,200 Now, we have some computer algorithms that do gaze. 735 00:35:21,200 --> 00:35:25,340 And gaze, again, it's not an easy problem. 736 00:35:25,340 --> 00:35:27,830 And people have worked quite a lot 737 00:35:27,830 --> 00:35:30,330 on detection, detecting direction of gaze. 738 00:35:30,330 --> 00:35:32,142 And all the schemes are highly supervised. 739 00:35:32,142 --> 00:35:34,100 And once it's highly supervised, you can do it. 740 00:35:34,100 --> 00:35:37,880 So by highly supervised, I mean that you give the system 741 00:35:37,880 --> 00:35:41,030 many, many images in which you give the image, 742 00:35:41,030 --> 00:35:45,410 but you also give the learning system-- 743 00:35:45,410 --> 00:35:48,290 together with the input image, you supply it 744 00:35:48,290 --> 00:35:49,440 with the direction of gaze. 745 00:35:49,440 --> 00:35:52,250 So this is the image, and this is the direction 746 00:35:52,250 --> 00:35:54,710 the person is looking at. 747 00:35:54,710 --> 00:35:59,150 And there are ways of getting this input information 748 00:35:59,150 --> 00:36:01,880 of the appearance of the face and the direction of gaze, 749 00:36:01,880 --> 00:36:06,230 to associate them, and then when you see a new face, to recover 750 00:36:06,230 --> 00:36:07,200 the direction of gaze. 751 00:36:07,200 --> 00:36:14,940 But it really depends on a large set of supervised images. 752 00:36:14,940 --> 00:36:18,500 So we need something to-- if you want to go along the same 753 00:36:18,500 --> 00:36:26,420 direction that happened before with getting hands correctly, 754 00:36:26,420 --> 00:36:28,760 we want-- instead of the internal supervision, we want-- 755 00:36:28,760 --> 00:36:33,680 some kind of a signal that somehow can tell the baby, 756 00:36:33,680 --> 00:36:34,640 can tell the system-- 757 00:36:34,640 --> 00:36:36,860 without any explicit supervision, 758 00:36:36,860 --> 00:36:38,840 provide some kind of an internal teaching 759 00:36:38,840 --> 00:36:42,500 signal that will tell it where is the direction of gaze. 760 00:36:42,500 --> 00:36:44,570 It's very close to the hand and the mover 761 00:36:44,570 --> 00:36:47,210 using the following notion, that if I pick up 762 00:36:47,210 --> 00:36:50,690 an object, which was very close to the mover event, 763 00:36:50,690 --> 00:36:54,480 once I picked up an object, I can do whatever I want with it. 764 00:36:54,480 --> 00:36:56,120 I don't have to look at it. 765 00:36:56,120 --> 00:36:57,990 I can manipulate it and so on. 766 00:36:57,990 --> 00:37:01,160 But if it's placed somewhere and I want to pick it up, 767 00:37:01,160 --> 00:37:02,895 nobody picks up objects like this. 768 00:37:02,895 --> 00:37:04,270 I mean, when you look at objects, 769 00:37:04,270 --> 00:37:07,520 when you pick them up, at the moment of making the contact 770 00:37:07,520 --> 00:37:09,000 to pick them up, you look at them. 771 00:37:09,000 --> 00:37:12,590 And in fact, that's a spontaneous behavior which 772 00:37:12,590 --> 00:37:13,980 we checked psycho-physically. 773 00:37:13,980 --> 00:37:15,900 You just tell people, pick objects. 774 00:37:15,900 --> 00:37:17,930 You don't tell them what you are trying to do. 775 00:37:17,930 --> 00:37:21,290 And they're invariably always-- at the instant 776 00:37:21,290 --> 00:37:24,470 of grabbing the object, making the contact, they look at it. 777 00:37:24,470 --> 00:37:26,200 So if we have, already, a mover detector, 778 00:37:26,200 --> 00:37:28,320 or sort of a hand detector, or a mover detector, 779 00:37:28,320 --> 00:37:31,910 that hands that are touching objects and causing 780 00:37:31,910 --> 00:37:33,590 them to move, all you have to do-- 781 00:37:33,590 --> 00:37:35,390 whenever you take an event like this, 782 00:37:35,390 --> 00:37:36,830 it's not only useful for hands. 783 00:37:36,830 --> 00:37:40,080 But once a hand is touching the object, this kind of mover 784 00:37:40,080 --> 00:37:43,590 event, you can freeze your video, you can take the frame. 785 00:37:43,590 --> 00:37:46,130 And you can know with high precision 786 00:37:46,130 --> 00:37:49,400 that you can-- now the direction of gaze 787 00:37:49,400 --> 00:37:53,120 is directed toward the object. 788 00:37:53,120 --> 00:37:57,600 So we asked people to manipulate objects on the table and so on. 789 00:37:57,600 --> 00:38:03,000 And what we did is we ran our previous mover detector. 790 00:38:03,000 --> 00:38:08,180 And let me skip the movie. 791 00:38:08,180 --> 00:38:12,590 But whenever this kind of a detection of a hand touching 792 00:38:12,590 --> 00:38:15,220 an object, making initial contact with an object 793 00:38:15,220 --> 00:38:17,390 happened, we froze the image. 794 00:38:17,390 --> 00:38:19,500 Unfortunately, this is not a very good slide, 795 00:38:19,500 --> 00:38:20,750 so it may be difficult to see. 796 00:38:20,750 --> 00:38:22,460 Maybe you can see here. 797 00:38:22,460 --> 00:38:25,190 So we simply drew a vector pointing 798 00:38:25,190 --> 00:38:31,440 from the face in the direction of the detected grabbing event. 799 00:38:31,440 --> 00:38:34,730 And we assumed-- we don't know-- that's an implicit, internal, 800 00:38:34,730 --> 00:38:36,230 imaginary supervision. 801 00:38:36,230 --> 00:38:37,210 Nobody checked it. 802 00:38:37,210 --> 00:38:46,704 But we grabbed the image, and we drew the vector 803 00:38:46,704 --> 00:38:48,437 to the contact point. 804 00:38:48,437 --> 00:38:49,520 So now you have a system-- 805 00:38:49,520 --> 00:38:52,040 on the one hand, we have face images 806 00:38:52,040 --> 00:38:55,310 at the point where we took the contacts. 807 00:38:55,310 --> 00:38:56,150 So here is a face. 808 00:38:56,150 --> 00:38:58,430 And this is a descriptor, some way 809 00:38:58,430 --> 00:39:00,920 of describing the appearance of the face based 810 00:39:00,920 --> 00:39:02,770 on local gradients. 811 00:39:02,770 --> 00:39:03,270 Sorry? 812 00:39:03,270 --> 00:39:05,130 AUDIENCE: How do you find the face? 813 00:39:05,130 --> 00:39:06,505 SHIMON ULLMAN: You assume a face. 814 00:39:06,505 --> 00:39:08,085 The face detector, I just left-- 815 00:39:08,085 --> 00:39:08,960 AUDIENCE: [INAUDIBLE] 816 00:39:08,960 --> 00:39:09,245 SHIMON ULLMAN: Right. 817 00:39:09,245 --> 00:39:09,900 Right. 818 00:39:09,900 --> 00:39:11,330 Faces come even before-- 819 00:39:11,330 --> 00:39:15,080 I didn't talk about faces, but faces come even beforehand. 820 00:39:15,080 --> 00:39:19,220 As I mentioned, the first thing that infants look at is faces. 821 00:39:19,220 --> 00:39:20,960 And this is even before the three months 822 00:39:20,960 --> 00:39:24,350 that they look at hands. 823 00:39:24,350 --> 00:39:28,680 This is to-- the current theory is that faces, 824 00:39:28,680 --> 00:39:33,860 you're born with a primitive initial face template. 825 00:39:33,860 --> 00:39:38,050 There are some discussions where the face template is. 826 00:39:38,050 --> 00:39:40,460 There is some evidence that it may not be in the cortex. 827 00:39:40,460 --> 00:39:41,930 It may be in the amygdala. 828 00:39:41,930 --> 00:39:44,790 But there is some evidence for this face. 829 00:39:44,790 --> 00:39:49,530 For manipulation of these patterns that infants look at, 830 00:39:49,530 --> 00:39:52,020 it's a very simple template, basically 831 00:39:52,020 --> 00:39:57,020 the two eyes, or something round, with two dark blobs. 832 00:39:57,020 --> 00:40:04,290 And this makes them fixate more on faces in a very similar way 833 00:40:04,290 --> 00:40:06,270 to the handling the hands. 834 00:40:06,270 --> 00:40:08,580 You just-- if you do this, from time to time, 835 00:40:08,580 --> 00:40:11,940 you will end up focusing not on a face, 836 00:40:11,940 --> 00:40:13,860 but on some random texture that has 837 00:40:13,860 --> 00:40:15,490 these two blobs or something. 838 00:40:15,490 --> 00:40:17,420 But if you really run it, then you 839 00:40:17,420 --> 00:40:19,260 will get lots and lots of face images. 840 00:40:19,260 --> 00:40:23,290 And then you'll develop a more refined face detector. 841 00:40:23,290 --> 00:40:27,300 So babies, from day one by the way, the way we think that-- 842 00:40:27,300 --> 00:40:31,260 the way people think it's innate is you can work-- it is done-- 843 00:40:31,260 --> 00:40:34,110 experiments have been done with the first day when 844 00:40:34,110 --> 00:40:37,140 babies were born, the day one. 845 00:40:37,140 --> 00:40:39,310 They keep their eyes closed most of the time. 846 00:40:39,310 --> 00:40:42,110 But when they are open, they fixate. 847 00:40:42,110 --> 00:40:44,550 You have to make big stimuli, because it's 848 00:40:44,550 --> 00:40:49,410 like close-up faces, because the acuity is still not 849 00:40:49,410 --> 00:40:51,270 fully developed. 850 00:40:51,270 --> 00:40:53,280 But you can test what they're fixating on. 851 00:40:53,280 --> 00:40:56,070 And they fixate specifically on faces. 852 00:40:56,070 --> 00:40:58,890 And once there is a face, they fixate on it. 853 00:40:58,890 --> 00:41:01,170 And the face can move, and they will even track it. 854 00:41:01,170 --> 00:41:04,160 So this is day one. 855 00:41:04,160 --> 00:41:10,050 So face seems to be innate in a stronger sense. 856 00:41:10,050 --> 00:41:12,210 In the case of the hand, for example, as I said, 857 00:41:12,210 --> 00:41:15,930 you cannot even imagine building a innate hand detector 858 00:41:15,930 --> 00:41:17,970 because of all this variability in appearance. 859 00:41:17,970 --> 00:41:20,520 For the face, it seems that there is an initial face 860 00:41:20,520 --> 00:41:23,340 detector which gets elaborated. 861 00:41:23,340 --> 00:41:26,370 So we assume that there is some kind of-- in these images, 862 00:41:26,370 --> 00:41:30,630 we assume that when we grab an event of contact like this, 863 00:41:30,630 --> 00:41:33,590 the face is known, the location of the face, the location 864 00:41:33,590 --> 00:41:34,590 of the contact is known. 865 00:41:34,590 --> 00:41:37,344 And you can draw a vector from the first to the second. 866 00:41:37,344 --> 00:41:38,760 And this is the direction of gaze. 867 00:41:38,760 --> 00:41:41,460 And when, now, you see a new image in which there 868 00:41:41,460 --> 00:41:43,080 is no contact, you just have the face, 869 00:41:43,080 --> 00:41:45,630 and you have to decide what is the direction of gaze, 870 00:41:45,630 --> 00:41:49,140 you look at similar faces that you have stored in your memory. 871 00:41:49,140 --> 00:41:54,370 And from this stored face in memory, for this, 872 00:41:54,370 --> 00:41:56,670 you already know from the learning phase 873 00:41:56,670 --> 00:41:59,280 what is associated direction of gaze. 874 00:41:59,280 --> 00:42:00,900 And you retrieve it. 875 00:42:00,900 --> 00:42:02,850 And this is the kind of things that you do. 876 00:42:02,850 --> 00:42:05,490 What we see here with the yellow arrows 877 00:42:05,490 --> 00:42:07,650 are collected images, which, again, 878 00:42:07,650 --> 00:42:11,100 the direction of gaze, the supervised direction, 879 00:42:11,100 --> 00:42:14,400 has been collected, or was collected, automatically, 880 00:42:14,400 --> 00:42:19,500 by just identifying the direction to the contact point. 881 00:42:19,500 --> 00:42:20,580 These are some examples. 882 00:42:20,580 --> 00:42:25,800 And what this shows is just doing some psychophysics 883 00:42:25,800 --> 00:42:27,660 and comparing what this algorithm-- which 884 00:42:27,660 --> 00:42:33,140 is sort of this infant-related algorithm which just has 885 00:42:33,140 --> 00:42:36,570 no supervision, looking images for hands touching 886 00:42:36,570 --> 00:42:39,120 things, collecting direction of gaze, 887 00:42:39,120 --> 00:42:42,640 developing a gaze detector. 888 00:42:42,640 --> 00:42:46,770 So the red and the green, one is the model, the other one 889 00:42:46,770 --> 00:42:50,250 is human judgment on a similar situation. 890 00:42:50,250 --> 00:42:52,060 And you get good agreement. 891 00:42:52,060 --> 00:42:53,275 I mean, it's not perfect. 892 00:42:53,275 --> 00:42:54,900 It's not the state of the art, but it's 893 00:42:54,900 --> 00:42:57,090 close to state of the art. 894 00:42:57,090 --> 00:43:00,080 And this is just training with some videos. 895 00:43:00,080 --> 00:43:04,740 I mean, this certainly does at least as well as infants. 896 00:43:04,740 --> 00:43:07,530 And it keeps developing, getting from here 897 00:43:07,530 --> 00:43:12,700 to a better and better gaze detector with reduced error. 898 00:43:12,700 --> 00:43:16,050 Well, the error is pretty small here too. 899 00:43:16,050 --> 00:43:17,730 But you can improve it. 900 00:43:17,730 --> 00:43:23,340 That's already-- that's more of standard additional training. 901 00:43:23,340 --> 00:43:25,980 But making the first jump of being 902 00:43:25,980 --> 00:43:29,160 able to deal with this nonexistent gaze, 903 00:43:29,160 --> 00:43:31,920 collecting a lot of data without any supervision, which 904 00:43:31,920 --> 00:43:38,640 is quite accurate, about where the gaze is and so on, 905 00:43:38,640 --> 00:43:42,300 this is supplied by, again, this internal teaching signal 906 00:43:42,300 --> 00:43:47,940 that can come instead of any external supervision 907 00:43:47,940 --> 00:43:52,050 and make it unnecessary. 908 00:43:52,050 --> 00:43:55,260 And you can do it without the outside supervision. 909 00:43:55,260 --> 00:43:57,650 It also has, I think, some-- 910 00:43:57,650 --> 00:44:01,260 the beginning of the more cognitive correct association, 911 00:44:01,260 --> 00:44:05,460 like the hand is associated with moving 912 00:44:05,460 --> 00:44:09,030 object, direction of gaze and going to where the-- 913 00:44:09,030 --> 00:44:14,260 following it to see what is the object at the other end 914 00:44:14,260 --> 00:44:15,770 and so on, this is-- 915 00:44:15,770 --> 00:44:18,450 gaze is associated with attention 916 00:44:18,450 --> 00:44:20,964 of people, what they are interested in at the moment. 917 00:44:20,964 --> 00:44:22,380 So it's not just the fact that you 918 00:44:22,380 --> 00:44:25,600 connect the face with the target object and so on. 919 00:44:25,600 --> 00:44:28,920 It's a good way of creating internal supervision. 920 00:44:28,920 --> 00:44:32,660 But it also starts to, I think, create the right association 921 00:44:32,660 --> 00:44:36,840 that hand is associated with manipulation and goals 922 00:44:36,840 --> 00:44:38,490 of manipulating objects. 923 00:44:38,490 --> 00:44:41,730 And gaze is associated with attention, 924 00:44:41,730 --> 00:44:46,730 and what we are paying attention to, and so on. 925 00:44:46,730 --> 00:44:50,880 So you can see that you start to have-- 926 00:44:50,880 --> 00:44:53,290 based on this, if you have an image like this, 927 00:44:53,290 --> 00:44:55,935 and you can detect hands, and you know, what-- 928 00:44:55,935 --> 00:44:58,500 there's a scheme that does it. 929 00:44:58,500 --> 00:44:59,850 You know about hands. 930 00:44:59,850 --> 00:45:01,800 You know about direction of gaze. 931 00:45:01,800 --> 00:45:03,510 You know about-- I didn't talk about it, 932 00:45:03,510 --> 00:45:05,677 but you also follow which objects move around. 933 00:45:05,677 --> 00:45:07,260 And you know which objects are movable 934 00:45:07,260 --> 00:45:09,990 and which objects are not movable-- 935 00:45:09,990 --> 00:45:14,130 so a very simple scheme that follows the chains 936 00:45:14,130 --> 00:45:15,890 of processing that I described. 937 00:45:15,890 --> 00:45:18,520 So it already starts to know-- 938 00:45:18,520 --> 00:45:23,820 you know, it's not quite having this full representation 939 00:45:23,820 --> 00:45:24,400 in itself. 940 00:45:24,400 --> 00:45:27,930 But it's thought quite along the way that the two agents here. 941 00:45:27,930 --> 00:45:30,252 The two agents are manipulating objects. 942 00:45:30,252 --> 00:45:31,710 And the one of the left is actually 943 00:45:31,710 --> 00:45:34,320 interested in the object that the other one is holding. 944 00:45:34,320 --> 00:45:35,820 So you have all this, the building 945 00:45:35,820 --> 00:45:37,620 blocks to start build-- 946 00:45:37,620 --> 00:45:40,140 to start having an internal description 947 00:45:40,140 --> 00:45:45,600 along this line following the chain of processing 948 00:45:45,600 --> 00:45:47,970 that I mentioned. 949 00:45:47,970 --> 00:45:49,770 And by the way, this internal training 950 00:45:49,770 --> 00:45:52,200 that one thing can train another, if you want, 951 00:45:52,200 --> 00:45:54,240 simple mover can train the hand. 952 00:45:54,240 --> 00:45:57,810 Mover and a hand together can train a gaze detector. 953 00:45:57,810 --> 00:46:00,840 It turns out that gaze is important in learning language, 954 00:46:00,840 --> 00:46:03,696 in disambiguating nouns and verbs 955 00:46:03,696 --> 00:46:05,070 when you learn language, when you 956 00:46:05,070 --> 00:46:06,510 acquire your first language. 957 00:46:06,510 --> 00:46:15,210 So this is from verb learning for a particular experiment. 958 00:46:15,210 --> 00:46:16,260 But let me ignore that. 959 00:46:16,260 --> 00:46:23,000 A simple example would be acquiring a noun that I say. 960 00:46:23,000 --> 00:46:25,690 I say, suddenly, oh, look at my new blicket. 961 00:46:25,690 --> 00:46:27,600 And people have done experiments like this. 962 00:46:27,600 --> 00:46:30,150 And I can say, look at my blicket, 963 00:46:30,150 --> 00:46:32,040 looking at an object on the right side 964 00:46:32,040 --> 00:46:36,000 or looking at another object on the left side, 965 00:46:36,000 --> 00:46:38,100 saying exactly the same expression. 966 00:46:38,100 --> 00:46:40,950 And people have shown that infants exposed 967 00:46:40,950 --> 00:46:43,320 to this kind of situation, they automatically 968 00:46:43,320 --> 00:46:45,870 associate the term, the noun "blicket," 969 00:46:45,870 --> 00:46:48,990 with the object that has been attended to. 970 00:46:48,990 --> 00:46:50,730 Namely, the gaze was used in order 971 00:46:50,730 --> 00:46:54,340 to disambiguate the reference. 972 00:46:54,340 --> 00:46:55,920 So you can see a nice-- 973 00:46:55,920 --> 00:46:58,960 starting with very low-level internal guiding signals 974 00:46:58,960 --> 00:47:03,540 of, say, moving pixels that can tell you about hands 975 00:47:03,540 --> 00:47:06,120 and about direction of gaze-- and then direction of gaze 976 00:47:06,120 --> 00:47:08,990 helps you to disambiguate the reference of words-- 977 00:47:08,990 --> 00:47:12,960 so these kinds of trajectories of internal supervision 978 00:47:12,960 --> 00:47:17,460 that can help you learn to deal with the work. 979 00:47:17,460 --> 00:47:20,210 This is, to me, a part of a larger project, which we called 980 00:47:20,210 --> 00:47:22,260 the digital baby, in which we-- 981 00:47:22,260 --> 00:47:23,400 it's an extension of this. 982 00:47:23,400 --> 00:47:24,840 We really want to understand, what 983 00:47:24,840 --> 00:47:29,820 are all these various innate capacities that we 984 00:47:29,820 --> 00:47:31,460 are born with cognitively? 985 00:47:31,460 --> 00:47:38,400 And we mention, here, a number of suggested ones-- the mover, 986 00:47:38,400 --> 00:47:41,500 how the mover can train a gaze, and the core training of two 987 00:47:41,500 --> 00:47:42,125 systems. 988 00:47:42,125 --> 00:47:45,960 And some of the things we think that are happening innately 989 00:47:45,960 --> 00:47:49,320 before we begin to learn. 990 00:47:49,320 --> 00:47:53,370 And then we would like to be able to watch 991 00:47:53,370 --> 00:47:56,760 lots and lots of sensory input, which could be visual. 992 00:47:56,760 --> 00:48:00,210 It can be, in general, non-visual. 993 00:48:00,210 --> 00:48:02,310 And from this, to generate, what will 994 00:48:02,310 --> 00:48:04,740 happen is the automatic generation and lots 995 00:48:04,740 --> 00:48:09,300 of understanding of the world, concepts like hands 996 00:48:09,300 --> 00:48:11,490 and intention, direction of looking, and eventually, 997 00:48:11,490 --> 00:48:16,790 nouns, and verbs, and so on-- so how we'll be able to do it. 998 00:48:16,790 --> 00:48:23,010 Know that it's very different from the less structured 999 00:48:23,010 --> 00:48:27,090 direction of deep networks, which are interesting and are 1000 00:48:27,090 --> 00:48:28,230 doing wonderful things. 1001 00:48:28,230 --> 00:48:31,090 I think that they're a very useful tool. 1002 00:48:31,090 --> 00:48:34,380 But I think that they are not the answer to the digital baby. 1003 00:48:34,380 --> 00:48:41,290 They do not have the capacity to learn interesting concepts 1004 00:48:41,290 --> 00:48:44,550 in an unsupervised way. 1005 00:48:44,550 --> 00:48:45,600 They do not distinguish. 1006 00:48:45,600 --> 00:48:47,760 They go, as I showed you at the very beginning, 1007 00:48:47,760 --> 00:48:50,970 with the cats, and the upper body, and so on. 1008 00:48:50,970 --> 00:48:53,100 They go only for the salient things. 1009 00:48:53,100 --> 00:48:54,590 Gaze is not a salient thing. 1010 00:48:54,590 --> 00:48:56,280 I mean, we have internal signals that 1011 00:48:56,280 --> 00:48:59,980 allow us to zoom in on meaningful things. 1012 00:48:59,980 --> 00:49:03,870 Even if they are not very salient 1013 00:49:03,870 --> 00:49:06,060 objectively in the statistical sense, 1014 00:49:06,060 --> 00:49:09,210 there is something inside us that is tuned to it. 1015 00:49:09,210 --> 00:49:10,470 We are born with it. 1016 00:49:10,470 --> 00:49:12,720 And it guides us towards extracting 1017 00:49:12,720 --> 00:49:14,970 this meaningful information, even 1018 00:49:14,970 --> 00:49:16,830 if it's not all that salient. 1019 00:49:16,830 --> 00:49:18,390 So all of these things are missing 1020 00:49:18,390 --> 00:49:25,050 from the unstructured net-- or the networks which 1021 00:49:25,050 --> 00:49:28,200 do not have all of this pre-concept and internal 1022 00:49:28,200 --> 00:49:28,710 guidance. 1023 00:49:28,710 --> 00:49:31,770 And I don't think that they could provide a good model 1024 00:49:31,770 --> 00:49:36,240 for cognitive learning in this sense of the digital baby. 1025 00:49:36,240 --> 00:49:41,970 Although I can see a very useful role for them, for example, 1026 00:49:41,970 --> 00:49:43,210 as just-- 1027 00:49:43,210 --> 00:49:48,820 in answer to Doreen's question, that if you want to then get-- 1028 00:49:48,820 --> 00:49:50,950 from all the data and the internal supervision 1029 00:49:50,950 --> 00:49:55,350 that you provided, you want to get an accurate gaze detector, 1030 00:49:55,350 --> 00:50:00,550 then training, using supervision training 1031 00:50:00,550 --> 00:50:07,030 in appropriate deep networks can be a very good way to go. 1032 00:50:07,030 --> 00:50:09,679 I wanted to also show you, this is not directly related, 1033 00:50:09,679 --> 00:50:12,220 but it's something impressive about the use of hands in order 1034 00:50:12,220 --> 00:50:16,070 to understand the world, just to show you how smart infants are. 1035 00:50:16,070 --> 00:50:19,660 I talked more about detecting the hands. 1036 00:50:19,660 --> 00:50:21,730 It was more the visual aspect of, 1037 00:50:21,730 --> 00:50:24,760 here is an image, show me the hand. 1038 00:50:24,760 --> 00:50:26,630 But how they use it-- and this is 1039 00:50:26,630 --> 00:50:29,590 at the age of about one year-- maybe 13 months, 1040 00:50:29,590 --> 00:50:33,080 but one year of age. 1041 00:50:33,080 --> 00:50:34,120 Here's the experiment. 1042 00:50:34,120 --> 00:50:37,210 I think it's a really nice experiment. 1043 00:50:37,210 --> 00:50:40,090 This experiment was with a experimenter. 1044 00:50:40,090 --> 00:50:43,774 This is the experimenter, one image. 1045 00:50:43,774 --> 00:50:45,190 What happened in the experiment is 1046 00:50:45,190 --> 00:50:48,910 that there was a sort of a lamp that you can turn the lamp 1047 00:50:48,910 --> 00:50:50,740 on by pressing it from above. 1048 00:50:50,740 --> 00:50:52,360 It sort of has this dome shape. 1049 00:50:52,360 --> 00:50:54,040 That's the whites in here. 1050 00:50:54,040 --> 00:50:57,170 You press it down, and it turns on. 1051 00:50:57,170 --> 00:50:59,740 It shines blue light, and it's very nice. 1052 00:50:59,740 --> 00:51:01,420 And babies like it. 1053 00:51:01,420 --> 00:51:03,280 And they smile at it, and they jiggle, 1054 00:51:03,280 --> 00:51:07,290 and they like this turning of the bright light. 1055 00:51:07,290 --> 00:51:08,920 And during the experiment, what happens 1056 00:51:08,920 --> 00:51:12,760 is that the infant is sitting on its parent's lap. 1057 00:51:12,760 --> 00:51:15,910 And the experimenter is on the other side, that experimenter. 1058 00:51:15,910 --> 00:51:17,260 And she turns on the light. 1059 00:51:17,260 --> 00:51:20,350 But she turns on the light-- instead of pressing it, 1060 00:51:20,350 --> 00:51:23,830 as you'd expect, with her hand, she's 1061 00:51:23,830 --> 00:51:25,870 pressing on it with her forehead. 1062 00:51:25,870 --> 00:51:29,110 She leans forward, and she presses 1063 00:51:29,110 --> 00:51:33,366 the lamp, this dome, and the light comes on. 1064 00:51:33,366 --> 00:51:35,990 And then, these are babies that can already manipulate objects. 1065 00:51:35,990 --> 00:51:37,910 So after they see it three or four times, 1066 00:51:37,910 --> 00:51:40,990 and they are happy seeing the light coming on, 1067 00:51:40,990 --> 00:51:44,170 they are handed the lamp and asked 1068 00:51:44,170 --> 00:51:46,990 to turn it on on their own. 1069 00:51:50,290 --> 00:51:55,030 And here is the clever manipulation. 1070 00:51:55,030 --> 00:51:58,660 For half the babies, the experimenter 1071 00:51:58,660 --> 00:52:00,820 had her hands concealed. 1072 00:52:00,820 --> 00:52:02,760 She didn't have her hands here, you see? 1073 00:52:02,760 --> 00:52:06,290 No hands are under this poncho. 1074 00:52:06,290 --> 00:52:08,140 Here it's the same, very similar thing, 1075 00:52:08,140 --> 00:52:09,880 but the hands are visible. 1076 00:52:09,880 --> 00:52:11,680 Now, it turns out that the babies-- 1077 00:52:11,680 --> 00:52:13,590 or the in-- these are not babies anymore. 1078 00:52:13,590 --> 00:52:16,750 These are young infants-- 1079 00:52:16,750 --> 00:52:18,940 some of them, when they were handed the lamp, 1080 00:52:18,940 --> 00:52:20,920 they did exactly what the experimenter did. 1081 00:52:20,920 --> 00:52:23,590 They bent over, and pressed the lamp with their forehead, 1082 00:52:23,590 --> 00:52:25,630 and turned it on. 1083 00:52:25,630 --> 00:52:27,940 And other children, instead of-- although that's 1084 00:52:27,940 --> 00:52:32,290 what they saw, when they got the lamp over to their side, 1085 00:52:32,290 --> 00:52:35,830 they turned it on by pressing it with their hand, 1086 00:52:35,830 --> 00:52:38,650 unlike what they saw the experimenter do. 1087 00:52:38,650 --> 00:52:42,030 Any prediction on your side what happened-- 1088 00:52:42,030 --> 00:52:45,520 you see these two situations-- when which babies-- 1089 00:52:45,520 --> 00:52:48,580 I mean, in this case or in this case, in which case 1090 00:52:48,580 --> 00:52:52,840 do you think they actually did it with their hands 1091 00:52:52,840 --> 00:52:54,410 rather than using their forehead? 1092 00:52:54,410 --> 00:52:54,980 Any guess? 1093 00:52:54,980 --> 00:52:55,480 Yeah? 1094 00:52:55,480 --> 00:52:58,874 AUDIENCE: hands in A and no hands in B 1095 00:52:58,874 --> 00:53:00,040 SHIMON ULLMAN: That's right. 1096 00:53:00,040 --> 00:53:01,123 And what's your reasoning? 1097 00:53:01,123 --> 00:53:03,566 AUDIENCE: [INAUDIBLE]. 1098 00:53:03,566 --> 00:53:05,440 SHIMON ULLMAN: But you think about, you know, 1099 00:53:05,440 --> 00:53:11,350 baby, if you saw a baby, infant, young one-year-old just moving 1100 00:53:11,350 --> 00:53:14,950 seemingly quasi-randomly and so on, something like that 1101 00:53:14,950 --> 00:53:17,240 went on in their head, that here, she 1102 00:53:17,240 --> 00:53:18,820 did it with her forehead. 1103 00:53:18,820 --> 00:53:20,470 She would have used her hands, but she 1104 00:53:20,470 --> 00:53:22,477 couldn't, because they were concealed, 1105 00:53:22,477 --> 00:53:23,560 and she couldn't use them. 1106 00:53:23,560 --> 00:53:25,840 So she used her forehead, but that's not the right way 1107 00:53:25,840 --> 00:53:26,830 to do it. 1108 00:53:26,830 --> 00:53:28,770 I can do it differently and so on. 1109 00:53:28,770 --> 00:53:31,249 That's sort of-- they don't say it explicitly. 1110 00:53:31,249 --> 00:53:32,290 They don't have language. 1111 00:53:32,290 --> 00:53:35,230 But that's the kind of reasoning that went on. 1112 00:53:35,230 --> 00:53:37,300 And indeed, a much larger proportion-- so this 1113 00:53:37,300 --> 00:53:44,360 is the proportion of using their hands where the green, I think, 1114 00:53:44,360 --> 00:53:46,900 was the hand occupied. 1115 00:53:46,900 --> 00:53:48,971 Or when the hands of the experimenter is free, 1116 00:53:48,971 --> 00:53:51,220 you see that there is a big difference between the two 1117 00:53:51,220 --> 00:53:52,160 groups. 1118 00:53:52,160 --> 00:53:54,010 So they notice the hands. 1119 00:53:54,010 --> 00:53:58,224 They ran through some kind of inference and reasoning. 1120 00:53:58,224 --> 00:53:59,140 What hands are useful? 1121 00:53:59,140 --> 00:54:00,356 Why I should do? 1122 00:54:00,356 --> 00:54:02,230 Should I do it in the same way because that's 1123 00:54:02,230 --> 00:54:04,100 what other people are doing? 1124 00:54:04,100 --> 00:54:05,320 Should I do it differently? 1125 00:54:05,320 --> 00:54:09,350 So I find it impressive. 1126 00:54:09,350 --> 00:54:12,760 So some general comment is-- 1127 00:54:16,660 --> 00:54:19,840 general thoughts on learning and the combination of learning 1128 00:54:19,840 --> 00:54:22,540 and innate structures, that there 1129 00:54:22,540 --> 00:54:28,970 is a big sort of argument in the field, 1130 00:54:28,970 --> 00:54:31,810 has been going on since philosophers in ancient times, 1131 00:54:31,810 --> 00:54:36,910 whether human cognition is learned. 1132 00:54:36,910 --> 00:54:41,950 This is nativism against empiricism, where nativism 1133 00:54:41,950 --> 00:54:43,570 proposed that things are basically-- 1134 00:54:43,570 --> 00:54:47,540 we are born with what is needed in order to deal 1135 00:54:47,540 --> 00:54:48,680 with the world. 1136 00:54:48,680 --> 00:54:51,030 And empiricism in the extreme form 1137 00:54:51,030 --> 00:54:56,690 is that we are born with a blank slate and just a big learning 1138 00:54:56,690 --> 00:54:58,400 machine, maybe like a deep network. 1139 00:54:58,400 --> 00:55:01,370 And we learn everything from the contingencies in the world. 1140 00:55:01,370 --> 00:55:05,780 So this is the empiricism versus nativism. 1141 00:55:05,780 --> 00:55:08,600 In these examples, in an interesting way, I think, 1142 00:55:08,600 --> 00:55:12,200 that complex concepts were neither learned on their own 1143 00:55:12,200 --> 00:55:15,590 nor innate-- so for example, we didn't have an innate hand 1144 00:55:15,590 --> 00:55:18,080 detector, but also, it couldn't emerge 1145 00:55:18,080 --> 00:55:20,600 in a purely empiricist way. 1146 00:55:20,600 --> 00:55:24,920 But we had enough structure inside that would not 1147 00:55:24,920 --> 00:55:27,500 be the final solution, but would be the right guidance, 1148 00:55:27,500 --> 00:55:34,100 or the right infrastructure, to make learning possible. 1149 00:55:34,100 --> 00:55:36,530 And this is not just a very generic learner. 1150 00:55:36,530 --> 00:55:40,130 But in this case, the learner was informed by-- 1151 00:55:40,130 --> 00:55:46,100 you know, was looking for some mover events or things like it. 1152 00:55:46,100 --> 00:55:48,710 So it's not the hands, it was the movers. 1153 00:55:48,710 --> 00:55:52,520 And this guides the system without supervision, 1154 00:55:52,520 --> 00:55:55,610 not only making supervision unnecessary, 1155 00:55:55,610 --> 00:56:00,470 but also focusing the learner on meaningful representations, not 1156 00:56:00,470 --> 00:56:03,800 necessarily just things which are the first things that 1157 00:56:03,800 --> 00:56:09,370 jump at you statistically from the visual input. 1158 00:56:09,370 --> 00:56:11,450 So there are these kinds of learning trajectories 1159 00:56:11,450 --> 00:56:15,590 like the mover, hand, gaze, and reference in language, of sort 1160 00:56:15,590 --> 00:56:19,370 of natural trajectories in which one thing leads to another 1161 00:56:19,370 --> 00:56:21,380 and help us acquire things which would be very 1162 00:56:21,380 --> 00:56:25,100 difficult to extract otherwise. 1163 00:56:25,100 --> 00:56:26,720 As I mentioned at the beginning, I 1164 00:56:26,720 --> 00:56:29,180 think that there are some interesting possibilities 1165 00:56:29,180 --> 00:56:34,610 for AI, as I said, to build intelligent machines by not 1166 00:56:34,610 --> 00:56:37,300 thinking about the final intelligence system, 1167 00:56:37,300 --> 00:56:41,090 but thinking about baby system with the right internal 1168 00:56:41,090 --> 00:56:46,730 capacities which will make it able to then learn. 1169 00:56:46,730 --> 00:56:49,670 So the use of learning is sort of-- 1170 00:56:49,670 --> 00:56:50,840 we all follow it. 1171 00:56:50,840 --> 00:56:54,480 But the point is, probably just a big learning machine is not 1172 00:56:54,480 --> 00:56:54,980 enough. 1173 00:56:54,980 --> 00:56:57,050 It's really the combination of, we 1174 00:56:57,050 --> 00:56:59,420 have to understand the kind of internal structures 1175 00:56:59,420 --> 00:57:05,335 that allow babies to efficiently extract information 1176 00:57:05,335 --> 00:57:05,960 from the world. 1177 00:57:05,960 --> 00:57:11,450 If we manage to put something like this into a baby system 1178 00:57:11,450 --> 00:57:13,196 and let it interact with the world, 1179 00:57:13,196 --> 00:57:14,570 then we have a much higher chance 1180 00:57:14,570 --> 00:57:19,854 of starting to develop really intelligent systems. 1181 00:57:19,854 --> 00:57:21,270 It's interesting, by the way, that 1182 00:57:21,270 --> 00:57:25,040 in the regional paper by Turing, when he discusses the Turing 1183 00:57:25,040 --> 00:57:26,720 test and how to build-- 1184 00:57:26,720 --> 00:57:30,410 can machines think, he discusses the issue 1185 00:57:30,410 --> 00:57:32,300 about building intelligent machines 1186 00:57:32,300 --> 00:57:33,920 somewhere in the future. 1187 00:57:33,920 --> 00:57:37,190 And he says that his hunch is that the really good way 1188 00:57:37,190 --> 00:57:40,070 of building, eventually, intelligent computers, 1189 00:57:40,070 --> 00:57:43,850 intelligent machines would be to build a baby 1190 00:57:43,850 --> 00:57:45,540 computer, a digital baby, and let 1191 00:57:45,540 --> 00:57:50,530 it learn rather than thinking about the final one.