1 00:00:01,680 --> 00:00:04,080 The following content is provided under a Creative 2 00:00:04,080 --> 00:00:05,620 Commons license. 3 00:00:05,620 --> 00:00:07,920 Your support will help MIT OpenCourseWare 4 00:00:07,920 --> 00:00:12,280 continue to offer high-quality educational resources for free. 5 00:00:12,280 --> 00:00:14,910 To make a donation, or view additional materials 6 00:00:14,910 --> 00:00:18,870 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,870 --> 00:00:21,650 at ocw.mit.edu. 8 00:00:21,650 --> 00:00:22,650 ANDREI BARBU: All right. 9 00:00:22,650 --> 00:00:25,329 To start off with, perception is a very difficult problem. 10 00:00:25,329 --> 00:00:27,870 And there's a good reason why perception should be difficult, 11 00:00:27,870 --> 00:00:28,410 right? 12 00:00:28,410 --> 00:00:30,350 We get a very impoverished stimulus. 13 00:00:30,350 --> 00:00:34,626 We get, like a 2D array of values from a 3D scene. 14 00:00:34,626 --> 00:00:36,000 Given this impoverished stimulus, 15 00:00:36,000 --> 00:00:37,620 we have to understand a huge amount of stuff 16 00:00:37,620 --> 00:00:38,400 about the world. 17 00:00:38,400 --> 00:00:40,900 We have to understand the 3D structure of the world. 18 00:00:40,900 --> 00:00:42,330 If you look at any one pixel, you 19 00:00:42,330 --> 00:00:44,820 have to understand the properties of the surface that 20 00:00:44,820 --> 00:00:47,340 produce the elimination that gave you that pixel. 21 00:00:47,340 --> 00:00:49,599 You have to understand the color or the texture. 22 00:00:49,599 --> 00:00:51,390 You have to see the color of the light that 23 00:00:51,390 --> 00:00:54,690 hit that surface, the roughness of the surface, et cetera. 24 00:00:54,690 --> 00:00:57,810 So you have a very small channel, a very small window 25 00:00:57,810 --> 00:01:00,270 onto the world, and you have to extract a tremendous amount 26 00:01:00,270 --> 00:01:02,186 of information so that you can survive and not 27 00:01:02,186 --> 00:01:04,239 get killed by cars regularly. 28 00:01:04,239 --> 00:01:04,739 All right. 29 00:01:04,739 --> 00:01:06,322 That's exactly the problem we're going 30 00:01:06,322 --> 00:01:07,980 to talk about here, which is, how do we 31 00:01:07,980 --> 00:01:10,650 use our knowledge of the world to structure our perception? 32 00:01:10,650 --> 00:01:12,960 To actually modify what we see in order 33 00:01:12,960 --> 00:01:14,800 to be able to solve this problem? 34 00:01:14,800 --> 00:01:17,130 How do we take a small impoverished stimulus 35 00:01:17,130 --> 00:01:18,897 and extract a huge amount of information 36 00:01:18,897 --> 00:01:19,980 about the world around us? 37 00:01:19,980 --> 00:01:21,900 So let's start with a few examples 38 00:01:21,900 --> 00:01:24,180 where knowledge about the world really, practically 39 00:01:24,180 --> 00:01:25,944 changes what we see. 40 00:01:25,944 --> 00:01:26,610 So I'm Canadian. 41 00:01:26,610 --> 00:01:29,026 So I'm required to show you a Canadian flag in every talk. 42 00:01:29,026 --> 00:01:30,750 Here we are. 43 00:01:30,750 --> 00:01:33,410 You can take this flag and you can give it to any of you. 44 00:01:33,410 --> 00:01:35,870 You can give it to any kid and you can ask them, 45 00:01:35,870 --> 00:01:36,900 here's a marker. 46 00:01:36,900 --> 00:01:40,110 Put big red marks around the regions on this flag 47 00:01:40,110 --> 00:01:42,024 or in the regions in this image that are red. 48 00:01:42,024 --> 00:01:43,440 And it's pretty clear to all of us 49 00:01:43,440 --> 00:01:45,110 that there is a distinction between the red that's 50 00:01:45,110 --> 00:01:47,339 in the flag, and the bars, and in the Maple Leaf, 51 00:01:47,339 --> 00:01:49,380 versus the red that's actually in the background. 52 00:01:49,380 --> 00:01:51,470 And we can all tell those two apart. 53 00:01:51,470 --> 00:01:53,640 Except if you actually look at the pixel values, 54 00:01:53,640 --> 00:01:56,400 you open them in Photoshop, or GIMP, or whatever other program 55 00:01:56,400 --> 00:01:58,900 you want, you're going to notice that those pixel values are 56 00:01:58,900 --> 00:02:00,455 actually not particularly different. 57 00:02:00,455 --> 00:02:01,830 There's no threshold that you can 58 00:02:01,830 --> 00:02:04,590 choose that will separate the red on the flag 59 00:02:04,590 --> 00:02:06,380 from the red in the background. 60 00:02:06,380 --> 00:02:08,130 so you're doing a huge amount of inference 61 00:02:08,130 --> 00:02:10,620 just to solve this trivial, little problem of, 62 00:02:10,620 --> 00:02:12,360 what color is where? 63 00:02:12,360 --> 00:02:14,720 You're using knowledge about regions, knowledges 64 00:02:14,720 --> 00:02:17,430 about flags, knowledge about transparency, 65 00:02:17,430 --> 00:02:19,860 in order to figure out that the red flag is 66 00:02:19,860 --> 00:02:22,120 different from the red in the background. 67 00:02:22,120 --> 00:02:23,862 So this is a really practical way 68 00:02:23,862 --> 00:02:25,320 that your knowledge about the world 69 00:02:25,320 --> 00:02:26,760 really changes your perception. 70 00:02:26,760 --> 00:02:30,260 You're not seeing the colors that are really there. 71 00:02:30,260 --> 00:02:34,450 Another nice example comes from a paper by Antonino Torralba. 72 00:02:34,450 --> 00:02:36,450 So if you look at the scene, it's pretty blurry. 73 00:02:36,450 --> 00:02:38,658 And it has to be blurry because your visual system is 74 00:02:38,658 --> 00:02:41,430 so incredibly good, we have to degrade the input for you 75 00:02:41,430 --> 00:02:43,860 to see how poor it actually is if we take away 76 00:02:43,860 --> 00:02:45,370 some information. 77 00:02:45,370 --> 00:02:46,496 So this looks like a scene. 78 00:02:46,496 --> 00:02:48,203 And the background looks like a building. 79 00:02:48,203 --> 00:02:50,310 In the foreground, you can there's maybe a street. 80 00:02:50,310 --> 00:02:52,518 And the thing on the street kind of looks like a car. 81 00:02:52,518 --> 00:02:54,750 Does it look like a car to everybody? 82 00:02:54,750 --> 00:02:55,770 Awesome. 83 00:02:55,770 --> 00:02:57,750 We can look at a slightly different image. 84 00:02:57,750 --> 00:02:59,696 We can look at a very similar scene. 85 00:02:59,696 --> 00:03:01,320 Again, same building in the background, 86 00:03:01,320 --> 00:03:02,980 same street in the foreground. 87 00:03:02,980 --> 00:03:05,006 Now, there's kind of a blob in the foreground. 88 00:03:05,006 --> 00:03:06,380 And it looks as if it's a person. 89 00:03:06,380 --> 00:03:08,910 It looks like a person to everyone, right? 90 00:03:08,910 --> 00:03:09,600 Awesome. 91 00:03:09,600 --> 00:03:11,670 Well, the only problem is the blob 92 00:03:11,670 --> 00:03:14,257 on the left is exactly the same as the blob on the right. 93 00:03:14,257 --> 00:03:16,590 It's difficult to believe me, but you can find these two 94 00:03:16,590 --> 00:03:18,270 images online and in his paper. 95 00:03:18,270 --> 00:03:20,610 You can open them up in your favorite image viewer. 96 00:03:20,610 --> 00:03:22,234 You can zoom in and you'll see they are 97 00:03:22,234 --> 00:03:24,280 pixelized completely identical. 98 00:03:24,280 --> 00:03:27,000 So you're using a tremendous amount of information 99 00:03:27,000 --> 00:03:30,090 to put together the fact that buildings and streets-- when 100 00:03:30,090 --> 00:03:32,640 you see these horizontal streaks, it means a car. 101 00:03:32,640 --> 00:03:35,910 And when you see these vertical streaks, it means people. 102 00:03:35,910 --> 00:03:38,342 And this really changes how you see the world. 103 00:03:38,342 --> 00:03:40,800 And it changes it to the point where you actually, probably 104 00:03:40,800 --> 00:03:42,690 don't believe me that these two blobs are the same. 105 00:03:42,690 --> 00:03:44,356 And I couldn't believe it either until I 106 00:03:44,356 --> 00:03:46,710 really zoomed in and checked. 107 00:03:46,710 --> 00:03:48,660 So you can see lots of interesting effects 108 00:03:48,660 --> 00:03:50,520 where your high-level knowledge of the world 109 00:03:50,520 --> 00:03:52,380 is structuring your low-level perception, 110 00:03:52,380 --> 00:03:55,012 and it is actually overriding it. 111 00:03:55,012 --> 00:03:56,970 You've seen this example with the hammer, where 112 00:03:56,970 --> 00:03:58,740 you were unable to recognize what's 113 00:03:58,740 --> 00:04:00,330 going on in a small region. 114 00:04:00,330 --> 00:04:02,280 But when I give you the rest of the context, 115 00:04:02,280 --> 00:04:04,176 you can tell that it's a hammer. 116 00:04:04,176 --> 00:04:05,550 And when you see the whole video, 117 00:04:05,550 --> 00:04:07,530 you actually don't see the hammer disappear. 118 00:04:07,530 --> 00:04:10,115 You're filling in information from context in single images 119 00:04:10,115 --> 00:04:11,490 and you're filling in information 120 00:04:11,490 --> 00:04:13,650 from context in whole videos. 121 00:04:13,650 --> 00:04:16,709 But if we dig in what's going on here just a little bit more, 122 00:04:16,709 --> 00:04:18,779 what's going on is somewhere inside your head, 123 00:04:18,779 --> 00:04:21,540 there's something resembling a hammer detector, right? 124 00:04:21,540 --> 00:04:23,040 So you run a hammer detector. 125 00:04:23,040 --> 00:04:26,010 And you ran that hammer detector over that little region. 126 00:04:26,010 --> 00:04:29,586 And it said, I'm not so sure. 127 00:04:29,586 --> 00:04:30,960 I'm not very confident about what 128 00:04:30,960 --> 00:04:32,395 I see in this little region. 129 00:04:32,395 --> 00:04:34,080 And somewhere inside your head, there's 130 00:04:34,080 --> 00:04:36,990 some detector or something that can recognize someone hammering 131 00:04:36,990 --> 00:04:38,590 something. 132 00:04:38,590 --> 00:04:41,484 So if we look at sort of a more traditional computer vision 133 00:04:41,484 --> 00:04:42,900 pipeline, what you would do is you 134 00:04:42,900 --> 00:04:44,180 would run your hammer detector. 135 00:04:44,180 --> 00:04:45,300 You would take your hammer detector. 136 00:04:45,300 --> 00:04:46,440 You would use that knowledge in order 137 00:04:46,440 --> 00:04:48,024 to recognize hammering in the scene. 138 00:04:48,024 --> 00:04:49,440 And at the end, you would say, I'm 139 00:04:49,440 --> 00:04:51,190 really confused because my hammer detector 140 00:04:51,190 --> 00:04:52,375 didn't work very well. 141 00:04:52,375 --> 00:04:54,000 The reason why you can actually do this 142 00:04:54,000 --> 00:04:55,350 is because you have a feedback. 143 00:04:55,350 --> 00:04:58,110 You were able to recognize the hammering event as a whole. 144 00:04:58,110 --> 00:04:59,796 And that lets you upgrade the scores 145 00:04:59,796 --> 00:05:01,170 of your hammer detector, which is 146 00:05:01,170 --> 00:05:02,930 very unreliable in this case. 147 00:05:02,930 --> 00:05:04,860 So feedback was really critical in being 148 00:05:04,860 --> 00:05:06,630 able to understand this scene. 149 00:05:06,630 --> 00:05:08,910 Unfortunately, pretty much all of computer vision 150 00:05:08,910 --> 00:05:11,430 is feed forward, even though most of your visual system 151 00:05:11,430 --> 00:05:13,710 has, for the most part, feedback connections. 152 00:05:13,710 --> 00:05:15,356 More feedbacks than feed forwards. 153 00:05:15,356 --> 00:05:17,730 So in this talk, we're going to talk about that feedback. 154 00:05:17,730 --> 00:05:18,870 And we're going to see a way that we're 155 00:05:18,870 --> 00:05:21,090 going to build this feedback in a principled way. 156 00:05:21,090 --> 00:05:23,220 That if we choose right detections-- 157 00:05:23,220 --> 00:05:25,770 right algorithms and right representations 158 00:05:25,770 --> 00:05:27,360 from our low-level perception, we're 159 00:05:27,360 --> 00:05:29,901 going to be able to combine it with our high-level perception 160 00:05:29,901 --> 00:05:31,800 of the world. 161 00:05:31,800 --> 00:05:34,070 So we've seen that perception is very unreliable. 162 00:05:34,070 --> 00:05:36,790 The top-down knowledge really affects your perception. 163 00:05:36,790 --> 00:05:38,940 And that, what you're going to see in a moment, 164 00:05:38,940 --> 00:05:40,650 is that one integrated representation 165 00:05:40,650 --> 00:05:42,420 can be used for many tasks. 166 00:05:42,420 --> 00:05:46,200 The advantage of these feedbacks goes beyond just better vision. 167 00:05:46,200 --> 00:05:48,080 It lets you solve a lot of different problems 168 00:05:48,080 --> 00:05:50,040 that look very, very distinct. 169 00:05:50,040 --> 00:05:52,590 But actually, turn out to be very, very similar. 170 00:05:52,590 --> 00:05:54,430 So one problem is recognition. 171 00:05:54,430 --> 00:05:57,180 I can give you a picture of a chair, and I can tell you, 172 00:05:57,180 --> 00:05:57,815 what is this? 173 00:05:57,815 --> 00:05:59,190 And you can tell me it's a chair. 174 00:05:59,190 --> 00:06:01,320 Or I can give you a picture and I can give you a sentence, 175 00:06:01,320 --> 00:06:02,100 this is a chair. 176 00:06:02,100 --> 00:06:03,840 And you can tell me, I believe you. 177 00:06:03,840 --> 00:06:04,837 This is true. 178 00:06:04,837 --> 00:06:06,670 There's also a completely different problem, 179 00:06:06,670 --> 00:06:08,100 which is retrieval. 180 00:06:08,100 --> 00:06:10,080 Related to recognition, right? 181 00:06:10,080 --> 00:06:11,952 How about I give you a library of videos 182 00:06:11,952 --> 00:06:14,160 and ask you to find me the video where the person was 183 00:06:14,160 --> 00:06:15,210 sitting on the chair. 184 00:06:15,210 --> 00:06:16,980 And you can solve that problem. 185 00:06:16,980 --> 00:06:18,870 You can also solve a problem like generation. 186 00:06:18,870 --> 00:06:20,280 I can give you a video and I can tell you, 187 00:06:20,280 --> 00:06:21,321 I don't know what's here. 188 00:06:21,321 --> 00:06:22,860 Please describe it to me. 189 00:06:22,860 --> 00:06:25,590 So if you see the scene, you can say, what's on the screen? 190 00:06:25,590 --> 00:06:28,120 Well, there's a whole bunch of text on the screen. 191 00:06:28,120 --> 00:06:29,700 You can also do question answering. 192 00:06:29,700 --> 00:06:31,869 You can take an image like this. 193 00:06:31,869 --> 00:06:32,910 I can ask you a question. 194 00:06:32,910 --> 00:06:34,200 What's the color of the font? 195 00:06:34,200 --> 00:06:36,250 And you can say, the font is white. 196 00:06:36,250 --> 00:06:38,940 So you were able to take some very high-level that's 197 00:06:38,940 --> 00:06:41,370 in my head that got transmitted to your head. 198 00:06:41,370 --> 00:06:43,920 You were able to understand the purpose of this transmission, 199 00:06:43,920 --> 00:06:46,890 connect it to your perception, figure out the knowledge that I 200 00:06:46,890 --> 00:06:48,570 wanted extracted from your perception, 201 00:06:48,570 --> 00:06:52,281 and give it back to me in a way that's meaningful to me. 202 00:06:52,281 --> 00:06:54,030 Even more than this, you can disambiguate. 203 00:06:54,030 --> 00:06:55,770 You can take a sentence that's extremely 204 00:06:55,770 --> 00:06:57,660 ambiguous about the world and figure out 205 00:06:57,660 --> 00:06:58,624 what I'm referring to. 206 00:06:58,624 --> 00:06:59,790 And we do this all the time. 207 00:06:59,790 --> 00:07:02,490 That's really what makes human communication possible, right? 208 00:07:02,490 --> 00:07:05,160 The fact that most of what I say is extremely ambiguous. 209 00:07:05,160 --> 00:07:07,680 That's why programming computers is a real pain, 210 00:07:07,680 --> 00:07:09,780 but talking to people is generally easier 211 00:07:09,780 --> 00:07:12,540 depending on the person. 212 00:07:12,540 --> 00:07:14,550 You can also acquire knowledge, right? 213 00:07:14,550 --> 00:07:16,239 You can look at a whole bunch of videos. 214 00:07:16,239 --> 00:07:18,780 If you're a child, you sort of perceive the world around you. 215 00:07:18,780 --> 00:07:21,630 Occasionally, an adult comes, drops a sentence here or there 216 00:07:21,630 --> 00:07:22,800 for you. 217 00:07:22,800 --> 00:07:25,530 But what's important is that no adult ever really points out 218 00:07:25,530 --> 00:07:27,030 what the sentence is referring to. 219 00:07:27,030 --> 00:07:28,560 You don't know that approach refers 220 00:07:28,560 --> 00:07:32,190 to this particular vector when someone was doing some action. 221 00:07:32,190 --> 00:07:35,182 You know that Apple refers to this particular object class. 222 00:07:35,182 --> 00:07:36,390 Who knows what it could mean? 223 00:07:36,390 --> 00:07:37,848 But you get enough data, and you're 224 00:07:37,848 --> 00:07:42,720 able to disentangle this problem of seeing 225 00:07:42,720 --> 00:07:45,750 weakly-supervised videos paired with sentences. 226 00:07:45,750 --> 00:07:47,167 And we'll see how you can do that. 227 00:07:47,167 --> 00:07:48,750 Pretty much everything I'll talk about 228 00:07:48,750 --> 00:07:49,950 is going to be about videos. 229 00:07:49,950 --> 00:07:51,491 And I'll tell you a story about how I 230 00:07:51,491 --> 00:07:53,232 think we can do images as well. 231 00:07:53,232 --> 00:07:54,690 There are a bunch of other problems 232 00:07:54,690 --> 00:07:57,031 that you can solve with this approach. 233 00:07:57,031 --> 00:07:57,540 I'm sorry? 234 00:07:57,540 --> 00:07:59,060 AUDIENCE: So those images go with the video? 235 00:07:59,060 --> 00:07:59,810 ANDREI BARBU: Yes. 236 00:07:59,810 --> 00:08:02,967 So rather than doing videos, we're going to do images. 237 00:08:02,967 --> 00:08:05,550 So one thing that you can do is you can try to do translation. 238 00:08:05,550 --> 00:08:06,180 We haven't done this. 239 00:08:06,180 --> 00:08:07,888 We're going to be doing this in the fall. 240 00:08:07,888 --> 00:08:09,000 We have two students. 241 00:08:09,000 --> 00:08:11,047 And I'll tell you at the end what the story is 242 00:08:11,047 --> 00:08:13,380 for how you're going to do a task that sounds as if it's 243 00:08:13,380 --> 00:08:15,000 from language to language, but you're 244 00:08:15,000 --> 00:08:17,772 going to do it in a grounded way that involves vision. 245 00:08:17,772 --> 00:08:19,480 Even more than that, you can do planning. 246 00:08:19,480 --> 00:08:21,430 And I'll tell you about that at the end a little bit. 247 00:08:21,430 --> 00:08:22,830 And finally, you can also incorporate 248 00:08:22,830 --> 00:08:23,700 some theory of mind. 249 00:08:23,700 --> 00:08:25,320 And that's actually the project that the students are 250 00:08:25,320 --> 00:08:26,611 doing as part of summer school. 251 00:08:26,611 --> 00:08:28,174 And I'll say a few words about that. 252 00:08:28,174 --> 00:08:30,340 What's important about this is the parts at the top, 253 00:08:30,340 --> 00:08:31,050 we understand better. 254 00:08:31,050 --> 00:08:32,466 We've published papers about them. 255 00:08:32,466 --> 00:08:34,679 The parts at the bottom are sort of more future work, 256 00:08:34,679 --> 00:08:37,590 and I'll say less about them. 257 00:08:37,590 --> 00:08:39,539 Well, one important part about this 258 00:08:39,539 --> 00:08:41,496 is I've shown you all these tasks. 259 00:08:41,496 --> 00:08:42,870 But you really have to believe me 260 00:08:42,870 --> 00:08:45,255 that humans perform these tasks all the time. 261 00:08:45,255 --> 00:08:47,630 Every time you're sitting at a table and you ask someone, 262 00:08:47,630 --> 00:08:48,600 give me a cup. 263 00:08:48,600 --> 00:08:50,366 That's a really hard vision language task. 264 00:08:50,366 --> 00:08:52,740 There may be 10 different cups inside of you on the table 265 00:08:52,740 --> 00:08:54,823 if you're sitting on one of the big, round tables. 266 00:08:54,823 --> 00:08:57,330 And you have to figure out, what object am I talking about? 267 00:08:57,330 --> 00:08:59,070 What kind of cup am I talking about? 268 00:08:59,070 --> 00:09:00,580 Which cup would I be interested in? 269 00:09:00,580 --> 00:09:02,996 If I drank out of the cup, I would expect that you give me 270 00:09:02,996 --> 00:09:04,582 my cup, not your cup. 271 00:09:04,582 --> 00:09:05,540 Otherwise, let me know. 272 00:09:05,540 --> 00:09:08,670 I will sit at different tables from now on. 273 00:09:08,670 --> 00:09:11,250 If I ask you, which chair should I sit in? 274 00:09:11,250 --> 00:09:13,389 Again, you have to solve a pretty difficult problem 275 00:09:13,389 --> 00:09:14,430 where you look at chairs. 276 00:09:14,430 --> 00:09:17,299 You figure out what I mean by which chair should I sit in. 277 00:09:17,299 --> 00:09:19,590 Is it that there's a chair that's reserved for someone. 278 00:09:19,590 --> 00:09:21,600 Is it that a chair is for a child 279 00:09:21,600 --> 00:09:24,447 and I'm an adult, that kind of thing. 280 00:09:24,447 --> 00:09:26,280 You can say something like this is an apple. 281 00:09:26,280 --> 00:09:27,720 And when you say that to a child, 282 00:09:27,720 --> 00:09:29,428 you're saying it for a particular reason. 283 00:09:29,428 --> 00:09:30,390 To convey some idea. 284 00:09:30,390 --> 00:09:33,000 You have to coordinate your gaze with the other person's gaze 285 00:09:33,000 --> 00:09:34,791 to make sure you're drawing their attention 286 00:09:34,791 --> 00:09:36,000 to the real object. 287 00:09:36,000 --> 00:09:37,650 Even more than that, you can say very abstract things, 288 00:09:37,650 --> 00:09:40,108 like to win this game, you have to make a straight line out 289 00:09:40,108 --> 00:09:41,040 of these pieces. 290 00:09:41,040 --> 00:09:43,320 That means that we both agree on what a piece is. 291 00:09:43,320 --> 00:09:46,549 That I've drawn your attention to the right idea of a piece. 292 00:09:46,549 --> 00:09:48,090 That we agree on what a straight line 293 00:09:48,090 --> 00:09:49,777 means on this particular board. 294 00:09:49,777 --> 00:09:52,110 There's a lot of knowledge that goes into each of these. 295 00:09:52,110 --> 00:09:53,901 But the important part is that they're each 296 00:09:53,901 --> 00:09:55,140 grounded in perception. 297 00:09:55,140 --> 00:09:57,690 We have to agree on what we're seeing in front of each other 298 00:09:57,690 --> 00:10:00,300 in order to be able to exchange this information. 299 00:10:00,300 --> 00:10:03,370 And pretty much everything that we do in daily communication 300 00:10:03,370 --> 00:10:06,550 is a language-vision problem on some level. 301 00:10:06,550 --> 00:10:07,150 All right. 302 00:10:07,150 --> 00:10:09,160 So if we believe these problems are important, 303 00:10:09,160 --> 00:10:10,576 we can make one other observation, 304 00:10:10,576 --> 00:10:11,995 which is none of you got training 305 00:10:11,995 --> 00:10:13,510 in most of these problems. 306 00:10:13,510 --> 00:10:15,700 No adult ever sat you down and said, OK. 307 00:10:15,700 --> 00:10:16,480 Now, you're four. 308 00:10:16,480 --> 00:10:18,438 Now I'm going to teach you how to ask questions 309 00:10:18,438 --> 00:10:19,390 about the real world. 310 00:10:19,390 --> 00:10:21,010 Or no one sat you down and said, OK. 311 00:10:21,010 --> 00:10:22,801 Now, let's talk about language acquisition. 312 00:10:22,801 --> 00:10:25,210 You're supposed to do gradient descent. 313 00:10:25,210 --> 00:10:28,060 So what's important is you have some core ability that's shared 314 00:10:28,060 --> 00:10:29,530 across all of these tasks? 315 00:10:29,530 --> 00:10:31,510 And you're able to acquire knowledge, 316 00:10:31,510 --> 00:10:34,069 maybe in one of these tasks or across all of these tasks. 317 00:10:34,069 --> 00:10:35,360 You're able to put it together. 318 00:10:35,360 --> 00:10:36,790 And as soon as you have this knowledge, 319 00:10:36,790 --> 00:10:38,650 you can use this for all these other tasks 320 00:10:38,650 --> 00:10:40,532 without having to learn anything else. 321 00:10:40,532 --> 00:10:41,990 And that's what we're going to see. 322 00:10:41,990 --> 00:10:44,230 And the core of this that we're going to focus on 323 00:10:44,230 --> 00:10:45,670 is recognition. 324 00:10:45,670 --> 00:10:47,540 So we're going to build one component. 325 00:10:47,540 --> 00:10:49,690 This is that engineering portion of the talk. 326 00:10:49,690 --> 00:10:52,060 We're going to build one scoring function that 327 00:10:52,060 --> 00:10:54,310 takes a sentence in a video and gives you a score. 328 00:10:54,310 --> 00:10:56,440 How well does this sentence match this video? 329 00:10:56,440 --> 00:10:58,690 If the score is 1, it means the system really 330 00:10:58,690 --> 00:11:00,745 believes the sentence is depicted by the video. 331 00:11:00,745 --> 00:11:02,620 If the score is 0, it means the system really 332 00:11:02,620 --> 00:11:05,734 believes the sentence does not occur anywhere in this video. 333 00:11:05,734 --> 00:11:07,150 And this is the basic thing that's 334 00:11:07,150 --> 00:11:10,420 going to allow us to connect our top-down knowledge about what's 335 00:11:10,420 --> 00:11:12,982 going on in the world with our low-level perception. 336 00:11:12,982 --> 00:11:14,440 And after we have this, we're going 337 00:11:14,440 --> 00:11:17,050 to see how we reformulate everything in terms of this one 338 00:11:17,050 --> 00:11:18,550 function, so we don't have to learn 339 00:11:18,550 --> 00:11:20,560 anything else about the world. 340 00:11:20,560 --> 00:11:21,760 All right. 341 00:11:21,760 --> 00:11:24,700 So we said we need this one function, scoring function 342 00:11:24,700 --> 00:11:26,122 between sentences and videos. 343 00:11:26,122 --> 00:11:27,580 So let's look at what we would need 344 00:11:27,580 --> 00:11:30,130 to have inside this function in the first place. 345 00:11:30,130 --> 00:11:32,920 If I give you a video like this, it's just a person 346 00:11:32,920 --> 00:11:34,090 riding a skateboard. 347 00:11:34,090 --> 00:11:35,570 And I give you a sentence. 348 00:11:35,570 --> 00:11:37,690 The person rode the skateboard leftward. 349 00:11:37,690 --> 00:11:42,580 Well, I can ask you, is the sentence true of this video? 350 00:11:42,580 --> 00:11:43,424 Indeed, it is true. 351 00:11:43,424 --> 00:11:44,840 But let's think about what you had 352 00:11:44,840 --> 00:11:47,200 to do in order to be able to answer this question. 353 00:11:47,200 --> 00:11:49,979 Well, you had to at some level decide there's a person there. 354 00:11:49,979 --> 00:11:51,520 I'm not saying that you're doing this 355 00:11:51,520 --> 00:11:52,870 in this order in your brain. 356 00:11:52,870 --> 00:11:55,120 I'm not saying that they have to be individual stages. 357 00:11:55,120 --> 00:11:57,190 I'm not saying you have to have to object detectors. 358 00:11:57,190 --> 00:11:59,231 But at some point, you had to decide there really 359 00:11:59,231 --> 00:12:01,150 is a person there somehow. 360 00:12:01,150 --> 00:12:04,030 You also had to decide there is a skateboard there. 361 00:12:04,030 --> 00:12:07,450 You had to look at these objects over time, or at least-- 362 00:12:07,450 --> 00:12:08,952 in at least one or two frames decide 363 00:12:08,952 --> 00:12:10,660 that they have a particular relationship, 364 00:12:10,660 --> 00:12:12,410 so that the person isn't flying in the air 365 00:12:12,410 --> 00:12:14,514 and the skateboard continues onwards. 366 00:12:14,514 --> 00:12:16,930 And you had to look at this relationship and decide, yeah. 367 00:12:16,930 --> 00:12:17,980 OK, this is writing. 368 00:12:17,980 --> 00:12:19,665 And it's happening leftward. 369 00:12:19,665 --> 00:12:21,790 So you have to have these components on some level. 370 00:12:21,790 --> 00:12:23,102 You've got to see the objects. 371 00:12:23,102 --> 00:12:25,060 You've got to see the relationships, the static 372 00:12:25,060 --> 00:12:27,143 and the changing relationship between the objects. 373 00:12:27,143 --> 00:12:29,800 And you have to have some way of combining those together 374 00:12:29,800 --> 00:12:31,570 to form some kind of sentence you 375 00:12:31,570 --> 00:12:33,200 can represent that knowledge. 376 00:12:33,200 --> 00:12:34,764 And that's what we're going to do. 377 00:12:34,764 --> 00:12:37,180 Everything I described to you is this feed-forward system, 378 00:12:37,180 --> 00:12:37,780 right? 379 00:12:37,780 --> 00:12:39,230 We had objects. 380 00:12:39,230 --> 00:12:40,240 We have tracks. 381 00:12:40,240 --> 00:12:42,040 We take tracks and we build events. 382 00:12:42,040 --> 00:12:44,260 Events like ride. 383 00:12:44,260 --> 00:12:46,210 And we take those events together 384 00:12:46,210 --> 00:12:48,294 and we form sentences out of them. 385 00:12:48,294 --> 00:12:49,960 And there's this hard separation, right? 386 00:12:49,960 --> 00:12:52,060 It's easy to understand a system where what you do 387 00:12:52,060 --> 00:12:54,850 is you have objects, tracks, events, and sentences. 388 00:12:54,850 --> 00:12:58,397 And you use tracks in order to see if your events happened 389 00:12:58,397 --> 00:13:00,730 and your events in order to see if a particular sentence 390 00:13:00,730 --> 00:13:01,525 occurred. 391 00:13:01,525 --> 00:13:03,400 So that's what we're going to describe first, 392 00:13:03,400 --> 00:13:05,170 and then we're going to see how, because we're 393 00:13:05,170 --> 00:13:07,253 going to choose the right representations for each 394 00:13:07,253 --> 00:13:09,790 of these, these feedbacks become completely trivial and very 395 00:13:09,790 --> 00:13:11,710 natural to implement. 396 00:13:11,710 --> 00:13:12,580 All right. 397 00:13:12,580 --> 00:13:13,960 We need to start with some object detections. 398 00:13:13,960 --> 00:13:16,043 Otherwise, we're just going to hallucinate objects 399 00:13:16,043 --> 00:13:17,520 all the time. 400 00:13:17,520 --> 00:13:19,690 Any off-the-shelf object detector that you choose 401 00:13:19,690 --> 00:13:20,530 will sometimes work. 402 00:13:20,530 --> 00:13:22,863 Here, we ran a person detector in red and a bag detector 403 00:13:22,863 --> 00:13:23,780 in blue. 404 00:13:23,780 --> 00:13:25,810 It will sometimes give you false positives. 405 00:13:25,810 --> 00:13:27,400 Trees are often confused for people. 406 00:13:27,400 --> 00:13:30,730 I guess we're both two vertical-- two 407 00:13:30,730 --> 00:13:32,106 long, vertical lines. 408 00:13:32,106 --> 00:13:33,730 And sometimes, you get false negatives. 409 00:13:33,730 --> 00:13:35,950 Sometimes, a bag is so deformable 410 00:13:35,950 --> 00:13:39,340 that you think the person's knee is the bag. 411 00:13:39,340 --> 00:13:41,440 Lest you think that object detection is solved, 412 00:13:41,440 --> 00:13:42,520 it actually isn't. 413 00:13:42,520 --> 00:13:45,870 So if you look at something like the image net challenge, 414 00:13:45,870 --> 00:13:48,270 mostly people talk about the image classification. 415 00:13:48,270 --> 00:13:49,780 The stuff in light blue. 416 00:13:49,780 --> 00:13:51,820 And they're saying that there's 10% error. 417 00:13:51,820 --> 00:13:53,830 These days, there's 5% error on this. 418 00:13:53,830 --> 00:13:56,410 But that's really not what you're doing in the real world. 419 00:13:56,410 --> 00:13:58,390 You're not classifying whole images. 420 00:13:58,390 --> 00:14:00,880 When you see an image in the real world, what you're doing 421 00:14:00,880 --> 00:14:03,210 is you're trying to figure out what objects are where. 422 00:14:03,210 --> 00:14:04,210 And that's the red part. 423 00:14:04,210 --> 00:14:07,270 That's the part where you have an average precision of 50%. 424 00:14:07,270 --> 00:14:09,610 In other words, the object detector really, 425 00:14:09,610 --> 00:14:11,020 really, really sucks. 426 00:14:11,020 --> 00:14:13,190 Most of the time, it's going to be pretty wrong. 427 00:14:13,190 --> 00:14:16,390 It's very, very, very far away from how accurate you are. 428 00:14:16,390 --> 00:14:17,990 If your object detector was that bad, 429 00:14:17,990 --> 00:14:20,710 you would die every time you crossed the street. 430 00:14:20,710 --> 00:14:21,280 All right. 431 00:14:21,280 --> 00:14:24,100 So we believe that object detection doesn't work well. 432 00:14:24,100 --> 00:14:25,767 In order to fix this, because somehow we 433 00:14:25,767 --> 00:14:28,141 have to be able to extract some knowledge about the video 434 00:14:28,141 --> 00:14:30,700 that's pretty robust for us to be able to track these objects 435 00:14:30,700 --> 00:14:32,600 and recognize these sentences, we 436 00:14:32,600 --> 00:14:34,629 need to modify object detectors a little bit. 437 00:14:34,629 --> 00:14:36,420 We're going to go into our object detector. 438 00:14:36,420 --> 00:14:38,560 And normally, they have a threshold. 439 00:14:38,560 --> 00:14:41,740 At some point, they learn that if the score of this detection 440 00:14:41,740 --> 00:14:44,020 is above this level, I should have confidence in it. 441 00:14:44,020 --> 00:14:46,150 And if the score of this detection is below this level, 442 00:14:46,150 --> 00:14:47,350 I shouldn't have confidence in it. 443 00:14:47,350 --> 00:14:48,120 And what we're going to do is we're 444 00:14:48,120 --> 00:14:49,420 going to remove that threshold. 445 00:14:49,420 --> 00:14:51,086 We're going to tell the object detector, 446 00:14:51,086 --> 00:14:53,950 give me thousands or millions of detections in every frame. 447 00:14:53,950 --> 00:14:55,450 We're going to take those detections 448 00:14:55,450 --> 00:14:58,690 and we're going to figure out how to filter them later on. 449 00:14:58,690 --> 00:14:59,710 All right. 450 00:14:59,710 --> 00:15:00,730 The way we're going to do this-- and this 451 00:15:00,730 --> 00:15:02,938 is the only slide that's going to have any equations. 452 00:15:02,938 --> 00:15:05,120 And it's just going to be a linear combination. 453 00:15:05,120 --> 00:15:06,620 All we're going to do is we're going 454 00:15:06,620 --> 00:15:08,859 to take every detection in every frame of this video. 455 00:15:08,859 --> 00:15:10,650 We're going to arrange them in the lattice. 456 00:15:10,650 --> 00:15:12,233 In every column of this lattice, we're 457 00:15:12,233 --> 00:15:14,850 going to have the detections for one particular frame. 458 00:15:14,850 --> 00:15:17,130 And in essence, what we want is one detection 459 00:15:17,130 --> 00:15:19,030 for every object for every frame. 460 00:15:19,030 --> 00:15:21,280 In other words, we want the path through this lattice. 461 00:15:21,280 --> 00:15:24,750 We want to select one detection in every column. 462 00:15:24,750 --> 00:15:27,870 But we want tracks that have a particular property, right? 463 00:15:27,870 --> 00:15:30,060 If I'm approaching this microphone, 464 00:15:30,060 --> 00:15:32,420 you know that you expect to see me kind of far away. 465 00:15:32,420 --> 00:15:33,330 Then, getting closer. 466 00:15:33,330 --> 00:15:35,205 Then, eventually I'm close to the microphone. 467 00:15:35,205 --> 00:15:36,960 You don't expect me to be over there, 468 00:15:36,960 --> 00:15:39,474 and then to appear over here as if I've teleported. 469 00:15:39,474 --> 00:15:40,890 So we want to build this intuition 470 00:15:40,890 --> 00:15:42,560 that objects move smoothly. 471 00:15:42,560 --> 00:15:46,470 And that objects move according to how they previously 472 00:15:46,470 --> 00:15:47,370 moved, right? 473 00:15:47,370 --> 00:15:50,161 It's not like someone moves from one frame 10 pixels over. 474 00:15:50,161 --> 00:15:52,410 Then, the next frame, they move 10 pixels to the left. 475 00:15:52,410 --> 00:15:54,234 And they keep oscillating between the two. 476 00:15:54,234 --> 00:15:55,650 And that's what we're going to do. 477 00:15:55,650 --> 00:15:57,030 I'm not going to talk about how we compute this. 478 00:15:57,030 --> 00:15:57,863 It's really trivial. 479 00:15:57,863 --> 00:16:00,180 If you know about optical flow, you can do it. 480 00:16:00,180 --> 00:16:03,210 But basically, what we want is a track where we 481 00:16:03,210 --> 00:16:04,740 don't hallucinate the objects. 482 00:16:04,740 --> 00:16:09,209 So every node in our resulting detections should be strong. 483 00:16:09,209 --> 00:16:11,250 If we ignore the strength of the object detector, 484 00:16:11,250 --> 00:16:13,080 we're just going to pretend that there 485 00:16:13,080 --> 00:16:15,150 are a whole bunch of people in front of us. 486 00:16:15,150 --> 00:16:17,140 And every edge should also be strong. 487 00:16:17,140 --> 00:16:19,080 In other words, when we look at two detections 488 00:16:19,080 --> 00:16:22,260 from adjacent frames, if I have a person over here in one frame 489 00:16:22,260 --> 00:16:24,175 and a person over here in another frame, 490 00:16:24,175 --> 00:16:26,550 I shouldn't really think that's a very good person track. 491 00:16:26,550 --> 00:16:28,675 But if I have a person over here that kind of moved 492 00:16:28,675 --> 00:16:30,660 to the right the previous frame and I 493 00:16:30,660 --> 00:16:32,490 have a new detection that's just slightly 494 00:16:32,490 --> 00:16:34,230 to the right of that one, I should expect 495 00:16:34,230 --> 00:16:36,157 that it's a much better track. 496 00:16:36,157 --> 00:16:36,990 So that's all we do. 497 00:16:36,990 --> 00:16:39,660 And encoding this intuition is very, very straightforward. 498 00:16:39,660 --> 00:16:41,080 It's just a linear combination. 499 00:16:41,080 --> 00:16:43,200 So the score of one path, the score 500 00:16:43,200 --> 00:16:45,390 of the track of an object, is just 501 00:16:45,390 --> 00:16:48,670 the sum of your confidence in the detections. 502 00:16:48,670 --> 00:16:50,370 So in every detection and every frame, 503 00:16:50,370 --> 00:16:53,160 along with the confidence that the object track 504 00:16:53,160 --> 00:16:55,540 was actually coherent. 505 00:16:55,540 --> 00:16:56,040 All right. 506 00:16:56,040 --> 00:16:57,840 So is the only equation we're going to see in this talk. 507 00:16:57,840 --> 00:16:59,300 And it's just a linear combination. 508 00:16:59,300 --> 00:17:01,170 But it will come back to haunt us 509 00:17:01,170 --> 00:17:03,030 several times before the end. 510 00:17:03,030 --> 00:17:03,570 All right. 511 00:17:03,570 --> 00:17:04,837 So we use dynamic programming. 512 00:17:04,837 --> 00:17:06,420 We find the path through this lattice. 513 00:17:06,420 --> 00:17:07,589 And this is a tracker. 514 00:17:07,589 --> 00:17:12,150 And actually, Viterbi did this in 1967 for radar. 515 00:17:12,150 --> 00:17:13,829 This is not a new idea. 516 00:17:13,829 --> 00:17:17,520 Here, we ran it for just a computer vision task 517 00:17:17,520 --> 00:17:19,599 where we just wanted to track objects. 518 00:17:19,599 --> 00:17:22,087 We ran a person detector and a motorcycle detector, 519 00:17:22,087 --> 00:17:23,670 but we don't have a person standing up 520 00:17:23,670 --> 00:17:25,410 and a person sitting down detector. 521 00:17:25,410 --> 00:17:27,569 So the tracker is good enough that it 522 00:17:27,569 --> 00:17:29,640 can keep the two people separate from each other, 523 00:17:29,640 --> 00:17:31,440 despite the fact that they're actually 524 00:17:31,440 --> 00:17:33,754 pretty close in the video. 525 00:17:33,754 --> 00:17:36,420 So you see we do a pretty decent job of tracking all the objects 526 00:17:36,420 --> 00:17:38,420 until they get pretty small in the field of view 527 00:17:38,420 --> 00:17:40,760 and the object detector doesn't work well anymore. 528 00:17:40,760 --> 00:17:42,690 All right. 529 00:17:42,690 --> 00:17:45,520 So now what we have are the tracks of objects. 530 00:17:45,520 --> 00:17:48,600 We can see object motion over time and from a video. 531 00:17:48,600 --> 00:17:50,610 And somehow, we have to look at these tracks 532 00:17:50,610 --> 00:17:52,620 and determine what happened to them. 533 00:17:52,620 --> 00:17:53,760 Was someone riding? 534 00:17:53,760 --> 00:17:55,080 Was someone running? 535 00:17:55,080 --> 00:17:56,832 Was someone bouncing up and down? 536 00:17:56,832 --> 00:17:59,040 In order to do this, we're going to get some features 537 00:17:59,040 --> 00:18:00,390 from our tracks. 538 00:18:00,390 --> 00:18:02,296 You can look at the track in every frame 539 00:18:02,296 --> 00:18:04,170 and you can extract out a lot of information. 540 00:18:04,170 --> 00:18:05,972 You can extract out the average color. 541 00:18:05,972 --> 00:18:07,930 You can extract out the position, the velocity, 542 00:18:07,930 --> 00:18:10,020 acceleration, aspect ratio. 543 00:18:10,020 --> 00:18:11,430 Anything that you want to get out 544 00:18:11,430 --> 00:18:14,280 of this frame knowing that this bounding box is there, 545 00:18:14,280 --> 00:18:17,280 you can compute and the algorithm doesn't care. 546 00:18:17,280 --> 00:18:18,270 All right. 547 00:18:18,270 --> 00:18:19,712 There's one small problem, though. 548 00:18:19,712 --> 00:18:22,170 Most of the time, we need more complicated feature vectors. 549 00:18:22,170 --> 00:18:23,817 So for example for ride, it's not 550 00:18:23,817 --> 00:18:26,400 enough to have a feature vector that only includes the person. 551 00:18:26,400 --> 00:18:28,191 You needed to look at the relative position 552 00:18:28,191 --> 00:18:29,790 between the person and the skateboard 553 00:18:29,790 --> 00:18:31,748 to determine-- that are actually going together 554 00:18:31,748 --> 00:18:34,287 and one isn't going right and the other one is going left. 555 00:18:34,287 --> 00:18:36,120 So for that, what we're going to do is we're 556 00:18:36,120 --> 00:18:39,000 going to build a feature vector for the agent of the action 557 00:18:39,000 --> 00:18:41,520 in the case of ride and a feature vector 558 00:18:41,520 --> 00:18:43,827 for the instrument-- the skateboard. 559 00:18:43,827 --> 00:18:45,660 We're going to concatenate the two together, 560 00:18:45,660 --> 00:18:47,310 so we get a bigger feature vector. 561 00:18:47,310 --> 00:18:48,750 And then we're going to have some extra features that 562 00:18:48,750 --> 00:18:51,190 tell us about the relationships between these two. 563 00:18:51,190 --> 00:18:53,400 So we can include things like the distance, 564 00:18:53,400 --> 00:18:56,705 the relative velocity, the angle, overlap. 565 00:18:56,705 --> 00:18:58,830 Anything that you want to compute between these two 566 00:18:58,830 --> 00:19:02,160 bounding boxes in this frame, you're welcome to compute. 567 00:19:02,160 --> 00:19:03,200 All right. 568 00:19:03,200 --> 00:19:05,809 And if you build this feature vector between this person 569 00:19:05,809 --> 00:19:08,100 and the skateboard, you could recognize the person rode 570 00:19:08,100 --> 00:19:09,764 the skateboard in this video. 571 00:19:09,764 --> 00:19:11,430 If you build a different feature vector, 572 00:19:11,430 --> 00:19:13,170 for example between these two people, 573 00:19:13,170 --> 00:19:15,120 you could recognize the person was approaching 574 00:19:15,120 --> 00:19:17,760 the other person, or the person was leaving the other person. 575 00:19:17,760 --> 00:19:20,342 If you build a feature vector between the skateboard 576 00:19:20,342 --> 00:19:22,050 and the other person, you could recognize 577 00:19:22,050 --> 00:19:24,600 the skateboard is approaching the person, et cetera. 578 00:19:24,600 --> 00:19:26,580 So depending on which feature vector you build, 579 00:19:26,580 --> 00:19:29,070 you can recognize different kinds of actions. 580 00:19:29,070 --> 00:19:31,410 So when we have our tracks, we know how the objects 581 00:19:31,410 --> 00:19:32,820 moved in these videos. 582 00:19:32,820 --> 00:19:35,012 We get out some feature vectors from our tracks. 583 00:19:35,012 --> 00:19:37,470 And what we need to do is decide what these feature vectors 584 00:19:37,470 --> 00:19:38,261 are actually doing. 585 00:19:38,261 --> 00:19:40,110 Is the person riding that skateboard? 586 00:19:40,110 --> 00:19:42,651 The way we're going to do this is using hidden Markov models. 587 00:19:42,651 --> 00:19:44,410 Hidden Markov models are really simple. 588 00:19:44,410 --> 00:19:46,076 All they assume is that there is a model 589 00:19:46,076 --> 00:19:49,140 of the world that follows a particular kind of dynamics. 590 00:19:49,140 --> 00:19:52,770 In this case, imagine that we have an action like approach. 591 00:19:52,770 --> 00:19:54,190 I'm far away from the object. 592 00:19:54,190 --> 00:19:55,620 I get closer to the object. 593 00:19:55,620 --> 00:19:58,180 Eventually, I'm next to the object. 594 00:19:58,180 --> 00:20:01,680 So this action, for example, has three states. 595 00:20:01,680 --> 00:20:04,150 One where I was far, one as I was getting nearer, 596 00:20:04,150 --> 00:20:05,502 one when I was very close. 597 00:20:05,502 --> 00:20:06,960 And we have a particular transition 598 00:20:06,960 --> 00:20:08,126 between these states, right? 599 00:20:08,126 --> 00:20:09,870 We already said that I don't teleport. 600 00:20:09,870 --> 00:20:12,120 So I shouldn't be able to go from being far away 601 00:20:12,120 --> 00:20:13,170 to being near. 602 00:20:13,170 --> 00:20:15,351 So you should expect me to go from the first state 603 00:20:15,351 --> 00:20:17,100 to the second state and to the third state 604 00:20:17,100 --> 00:20:19,370 without going from the first to the last. 605 00:20:19,370 --> 00:20:21,090 In each state, you have something 606 00:20:21,090 --> 00:20:22,950 that you want to observe about me, right? 607 00:20:22,950 --> 00:20:25,550 You want to really see that I'm far away in the first state, 608 00:20:25,550 --> 00:20:27,133 that I'm getting closer in the second, 609 00:20:27,133 --> 00:20:29,200 and I'm actually there in the third. 610 00:20:29,200 --> 00:20:32,310 So we have some model for what we expect to see in every state 611 00:20:32,310 --> 00:20:34,830 and we can connect this with our feature vectors. 612 00:20:34,830 --> 00:20:36,840 So the idea is there's some hidden 613 00:20:36,840 --> 00:20:39,054 information behind the motion of these objects. 614 00:20:39,054 --> 00:20:41,220 And we're going to assume that hidden information is 615 00:20:41,220 --> 00:20:42,780 represented within HMM. 616 00:20:42,780 --> 00:20:45,120 And what we need to recover is the real state 617 00:20:45,120 --> 00:20:46,300 of these objects. 618 00:20:46,300 --> 00:20:50,820 So if you see a video of me moving towards this microphone, 619 00:20:50,820 --> 00:20:53,430 you have to recover some hidden information of, 620 00:20:53,430 --> 00:20:55,410 which frames was far away in? 621 00:20:55,410 --> 00:20:57,720 Which frames was getting nearer? 622 00:20:57,720 --> 00:21:00,846 And which frames was I actually next to the object? 623 00:21:00,846 --> 00:21:02,220 For now what we're going to do is 624 00:21:02,220 --> 00:21:04,761 we're going to assume that we have one of these hidden Markov 625 00:21:04,761 --> 00:21:06,622 models for every different word. 626 00:21:06,622 --> 00:21:09,080 So for every verb, we have a different hidden Markov model. 627 00:21:09,080 --> 00:21:10,040 There's one for approach. 628 00:21:10,040 --> 00:21:10,998 There's one for pickup. 629 00:21:10,998 --> 00:21:12,625 There's one for ride, et cetera. 630 00:21:12,625 --> 00:21:15,000 And if you want to tell me what's going on in this video, 631 00:21:15,000 --> 00:21:17,166 you just have a big library of hidden Markov models. 632 00:21:17,166 --> 00:21:19,110 You apply every one to every video. 633 00:21:19,110 --> 00:21:20,244 You have some threshold. 634 00:21:20,244 --> 00:21:22,410 And anything above that threshold, you say happened. 635 00:21:22,410 --> 00:21:25,456 And you produce a sentence for it. 636 00:21:25,456 --> 00:21:26,820 OK. 637 00:21:26,820 --> 00:21:28,560 If we look at how you actually figure out 638 00:21:28,560 --> 00:21:30,750 what this hidden information is, what state am I 639 00:21:30,750 --> 00:21:32,370 in when I'm approaching this object, 640 00:21:32,370 --> 00:21:34,094 it looks a lot like the tracker. 641 00:21:34,094 --> 00:21:36,510 What you have is you have to make a choice in every frame. 642 00:21:36,510 --> 00:21:39,920 Your choice is, which state is my action in? 643 00:21:39,920 --> 00:21:42,192 Is it state 1 through 3, or some other state? 644 00:21:42,192 --> 00:21:44,650 In the same way that in tracker, you have to make a choice. 645 00:21:44,650 --> 00:21:46,520 You have to choose, which detection is 646 00:21:46,520 --> 00:21:48,770 the system in for each frame? 647 00:21:48,770 --> 00:21:50,260 And here, you also have edges. 648 00:21:50,260 --> 00:21:52,800 Edges tell you, how likely am I to transition 649 00:21:52,800 --> 00:21:55,020 between different states in my action? 650 00:21:55,020 --> 00:21:57,080 And every node also has a score. 651 00:21:57,080 --> 00:21:58,830 It's the score of, did you actually 652 00:21:58,830 --> 00:22:00,870 observe me doing what you're supposed to observe 653 00:22:00,870 --> 00:22:02,350 me doing in every action? 654 00:22:02,350 --> 00:22:04,492 So if you're saying I'm in the first state, 655 00:22:04,492 --> 00:22:06,450 did you actually see me stationary and far away 656 00:22:06,450 --> 00:22:08,080 from that object? 657 00:22:08,080 --> 00:22:10,680 And what you want is a path through this lattice 658 00:22:10,680 --> 00:22:12,660 in the same way that we had a path before. 659 00:22:12,660 --> 00:22:15,000 And a path just means you made a decision 660 00:22:15,000 --> 00:22:17,130 that I'm in state 1 in the first frame 661 00:22:17,130 --> 00:22:20,722 or in state 1 in the third frame, et cetera. 662 00:22:20,722 --> 00:22:22,930 And that's just the linear combination of the scores. 663 00:22:22,930 --> 00:22:25,480 So it's the same equation we saw before. 664 00:22:25,480 --> 00:22:28,050 So here's an example of this sort of feed-forward pipeline 665 00:22:28,050 --> 00:22:29,220 in action. 666 00:22:29,220 --> 00:22:32,922 We ran it over a few thousand videos. 667 00:22:32,922 --> 00:22:35,130 It produces output like the person carried something, 668 00:22:35,130 --> 00:22:38,270 the person went away, the person walked, the person had the bag. 669 00:22:38,270 --> 00:22:39,945 It's pretty limited in its vocabulary. 670 00:22:39,945 --> 00:22:42,810 It has 48 verbs, about 30 different objects, 671 00:22:42,810 --> 00:22:45,690 a few different prepositions. 672 00:22:45,690 --> 00:22:47,426 And it even works when the camera moves. 673 00:22:47,426 --> 00:22:49,050 So the person chased the car rightward, 674 00:22:49,050 --> 00:22:51,452 the person slowly ran rightward to the car. 675 00:22:51,452 --> 00:22:53,910 And it should also probably say the person had a really bad 676 00:22:53,910 --> 00:22:56,920 day, but that's for the future. 677 00:22:56,920 --> 00:22:59,070 So we've seen this feed-forward pipeline. 678 00:22:59,070 --> 00:23:00,880 We've seen that we can get objects. 679 00:23:00,880 --> 00:23:01,650 We can get tracks. 680 00:23:01,650 --> 00:23:04,050 We can look at our tracks, get some features, 681 00:23:04,050 --> 00:23:06,600 run event detectors, take those event detectors, 682 00:23:06,600 --> 00:23:08,250 and produce some sentences. 683 00:23:08,250 --> 00:23:09,780 And now, all we're going to do is 684 00:23:09,780 --> 00:23:11,490 we're going to break down the barriers between these 685 00:23:11,490 --> 00:23:13,906 and show you how you can have feedback in a really, really 686 00:23:13,906 --> 00:23:15,450 simple way. 687 00:23:15,450 --> 00:23:16,080 All right. 688 00:23:16,080 --> 00:23:18,900 So first, let's combine our event detector and our tracker. 689 00:23:18,900 --> 00:23:20,400 Because what that's going to say is, 690 00:23:20,400 --> 00:23:22,911 if you're looking for someone riding something, well, 691 00:23:22,911 --> 00:23:24,660 you should be biased towards seeing people 692 00:23:24,660 --> 00:23:25,770 that are riding something. 693 00:23:25,770 --> 00:23:27,390 So in the occlusion example, if you 694 00:23:27,390 --> 00:23:30,570 see someone go behind some large pillar, 695 00:23:30,570 --> 00:23:32,290 well, you might lose them. 696 00:23:32,290 --> 00:23:34,950 But you have a bias that you should reacquire someone riding 697 00:23:34,950 --> 00:23:36,960 a skateboard after they leave the pillar, 698 00:23:36,960 --> 00:23:39,750 which you don't have if you just run the tracker independently 699 00:23:39,750 --> 00:23:41,709 from the event detector. 700 00:23:41,709 --> 00:23:43,500 So the way we're going to put them together 701 00:23:43,500 --> 00:23:44,460 is very, very easy. 702 00:23:44,460 --> 00:23:46,830 There's a reason why these two look completely identical 703 00:23:46,830 --> 00:23:50,280 and why the inference algorithm between them is identical. 704 00:23:50,280 --> 00:23:54,030 Right now, what we're doing is we have a tracker on the left, 705 00:23:54,030 --> 00:23:55,740 or on your left. 706 00:23:55,740 --> 00:23:59,607 And we have an event recognizer on the right. 707 00:23:59,607 --> 00:24:01,440 Right now, we're running one, and then we're 708 00:24:01,440 --> 00:24:03,810 feeding the output of one into the other. 709 00:24:03,810 --> 00:24:05,580 Basically, we run one maximization, 710 00:24:05,580 --> 00:24:07,265 and then we run another maximization. 711 00:24:07,265 --> 00:24:08,640 And all we're going to do is move 712 00:24:08,640 --> 00:24:11,070 the max on the right to the left. 713 00:24:11,070 --> 00:24:13,470 And you get the exact same inference algorithm. 714 00:24:13,470 --> 00:24:16,590 The intuition behind this is you have two lattices. 715 00:24:16,590 --> 00:24:18,750 And you can take the cross-product of the lattices. 716 00:24:18,750 --> 00:24:20,910 Basically, for every tracker node, 717 00:24:20,910 --> 00:24:23,220 you just look at all the event recognizer's nodes 718 00:24:23,220 --> 00:24:25,650 and you make one big node for each of those. 719 00:24:25,650 --> 00:24:27,900 And every node represents the fact 720 00:24:27,900 --> 00:24:30,000 that the tracker was in some state 721 00:24:30,000 --> 00:24:32,086 and the event recognizer was in some other state. 722 00:24:32,086 --> 00:24:33,960 So we have a node that says the tracker chose 723 00:24:33,960 --> 00:24:34,690 the first detection. 724 00:24:34,690 --> 00:24:36,570 The event recognizer was in the first state. 725 00:24:36,570 --> 00:24:38,070 We have another node that says the tracker chose 726 00:24:38,070 --> 00:24:38,840 the second detection. 727 00:24:38,840 --> 00:24:40,923 The event recognizer was still in the first state. 728 00:24:40,923 --> 00:24:43,044 And you do this for every detection. 729 00:24:43,044 --> 00:24:45,210 Then, you do the same thing for the event recognizer 730 00:24:45,210 --> 00:24:47,479 being the second state, et cetera. 731 00:24:47,479 --> 00:24:49,020 So you're just taking a cross-product 732 00:24:49,020 --> 00:24:51,040 between all of the states. 733 00:24:51,040 --> 00:24:52,570 Does that make sense? 734 00:24:52,570 --> 00:24:56,370 Another way to say it is that we have two Markov chains. 735 00:24:56,370 --> 00:24:59,130 One that's observing the output from the object detector 736 00:24:59,130 --> 00:25:01,990 and another one that's observing the output of the middle Markov 737 00:25:01,990 --> 00:25:02,740 chain. 738 00:25:02,740 --> 00:25:04,407 And you do joint influence over them. 739 00:25:04,407 --> 00:25:05,990 And the way you can do joint inference 740 00:25:05,990 --> 00:25:08,020 is by taking the cross-product. 741 00:25:08,020 --> 00:25:10,540 Basically, you have two hidden Markov models. 742 00:25:10,540 --> 00:25:13,246 One that does tracking and one that does event recognition. 743 00:25:13,246 --> 00:25:15,120 And all we're going to do is joint difference 744 00:25:15,120 --> 00:25:16,480 in both of them. 745 00:25:16,480 --> 00:25:18,940 So rather than trying to choose the best detection, 746 00:25:18,940 --> 00:25:20,950 and then the best state for my event, 747 00:25:20,950 --> 00:25:22,679 I'm going to jointly figure out, what's 748 00:25:22,679 --> 00:25:24,720 the best detection if I assume I'm in this state? 749 00:25:24,720 --> 00:25:27,370 What's the best detection if I assume I'm in this other state? 750 00:25:27,370 --> 00:25:31,710 And at the end, I'll pick the best combination. 751 00:25:31,710 --> 00:25:33,230 Make sense? 752 00:25:33,230 --> 00:25:35,200 So this is a way for your event recognizer 753 00:25:35,200 --> 00:25:37,810 to influence your tracker, because now you're 754 00:25:37,810 --> 00:25:40,570 jointly choosing the best detection for both the tracker 755 00:25:40,570 --> 00:25:42,670 and the event recognizer. 756 00:25:42,670 --> 00:25:44,330 So that was really, really simple. 757 00:25:44,330 --> 00:25:46,330 We put in a tremendous amount of feedback 758 00:25:46,330 --> 00:25:48,560 by just taking a cross product. 759 00:25:48,560 --> 00:25:49,789 So we can see this in action. 760 00:25:49,789 --> 00:25:51,580 I'm going to show you the same video twice. 761 00:25:51,580 --> 00:25:54,070 The person is not going to move in this video at all. 762 00:25:54,070 --> 00:25:57,289 What we told the system is that a ball will approach a person. 763 00:25:57,289 --> 00:25:59,080 That's it we didn't tell them which person. 764 00:25:59,080 --> 00:26:02,574 We didn't tell the system which particular ball, 765 00:26:02,574 --> 00:26:04,240 which direction it's going to come from, 766 00:26:04,240 --> 00:26:05,470 or anything like that. 767 00:26:05,470 --> 00:26:08,170 The top detection in this frame happens to be the window. 768 00:26:08,170 --> 00:26:10,200 It's a little hard to see. 769 00:26:10,200 --> 00:26:12,580 It's quite a bit stronger than the person. 770 00:26:12,580 --> 00:26:14,710 But because neither the window nor the person 771 00:26:14,710 --> 00:26:17,980 ever move in this scenario, the tracker 772 00:26:17,980 --> 00:26:18,990 can't possibly help you. 773 00:26:18,990 --> 00:26:20,800 You have no motion information. 774 00:26:20,800 --> 00:26:22,810 The only way to override that window detection 775 00:26:22,810 --> 00:26:26,540 is to know something else about the world. 776 00:26:26,540 --> 00:26:28,320 So we told it that the ball will approach. 777 00:26:28,320 --> 00:26:30,820 And you can see that for the combined tracker and event 778 00:26:30,820 --> 00:26:32,200 recognizer. 779 00:26:32,200 --> 00:26:36,492 Indeed, when the ball comes into view, it will make more sense. 780 00:26:36,492 --> 00:26:38,950 So the reason why we actually-- coming back to the question 781 00:26:38,950 --> 00:26:39,640 that you asked. 782 00:26:39,640 --> 00:26:41,650 Why we don't run it over small windows 783 00:26:41,650 --> 00:26:44,270 is because we want this effect of knowledge that's much, 784 00:26:44,270 --> 00:26:45,477 much later on in the video. 785 00:26:45,477 --> 00:26:47,560 Like the fact that the ball will enter or approach 786 00:26:47,560 --> 00:26:49,420 that person as opposed to that window 787 00:26:49,420 --> 00:26:52,000 to actually help you much earlier in the video. 788 00:26:52,000 --> 00:26:55,450 If you run it over small windows, you lose that effect. 789 00:26:55,450 --> 00:26:58,400 So here, you track the person correctly from the very first 790 00:26:58,400 --> 00:27:01,210 frame despite the fact that the ball only comes into view 791 00:27:01,210 --> 00:27:03,850 halfway through the video. 792 00:27:03,850 --> 00:27:05,440 There are many more examples of this. 793 00:27:05,440 --> 00:27:07,570 In this case, it's a person carrying something. 794 00:27:07,570 --> 00:27:11,789 Here, we told the system one person's carrying something. 795 00:27:11,789 --> 00:27:13,330 And you'll see when the person moves, 796 00:27:13,330 --> 00:27:16,580 we can detect the person and the bag. 797 00:27:16,580 --> 00:27:19,210 The object detector fails much, much earlier because the person 798 00:27:19,210 --> 00:27:20,860 was deformable. 799 00:27:20,860 --> 00:27:23,470 So we've seen how we can combine together trackers and events 800 00:27:23,470 --> 00:27:24,190 recognizers. 801 00:27:24,190 --> 00:27:25,684 And now, we need to add sentences. 802 00:27:25,684 --> 00:27:27,100 And the trick for adding sentences 803 00:27:27,100 --> 00:27:29,506 is going to do more of the same. 804 00:27:29,506 --> 00:27:32,020 What we're going to do is we're going to take a tracker. 805 00:27:32,020 --> 00:27:33,975 It's just exactly what we saw before. 806 00:27:33,975 --> 00:27:35,350 And what we just did a moment ago 807 00:27:35,350 --> 00:27:37,529 is we combined it with an event recognizer. 808 00:27:37,529 --> 00:27:39,820 Well, there's no reason why we can't add more trackers. 809 00:27:39,820 --> 00:27:40,840 We actually kind of did that, right? 810 00:27:40,840 --> 00:27:43,580 We were tracking both a person and a ball a moment ago. 811 00:27:43,580 --> 00:27:45,580 So we can take an even bigger cross-product, 812 00:27:45,580 --> 00:27:49,510 have multiple trackers, and have multiple words. 813 00:27:49,510 --> 00:27:51,520 So all we're saying is, I have, say, 814 00:27:51,520 --> 00:27:53,110 five trackers that are running. 815 00:27:53,110 --> 00:27:54,970 I have five words that I want to detect, 816 00:27:54,970 --> 00:27:56,495 or 10 words that I want to detect. 817 00:27:56,495 --> 00:27:58,870 And I want to make the choice for all of these 5 trackers 818 00:27:58,870 --> 00:28:03,040 jointly, so that they match all of these 10 words. 819 00:28:03,040 --> 00:28:04,762 In this picture, basically our words 820 00:28:04,762 --> 00:28:07,220 are kind of-- our sentences are kind of like bags of words, 821 00:28:07,220 --> 00:28:07,720 right? 822 00:28:07,720 --> 00:28:09,900 Every word is combined with every tracker. 823 00:28:09,900 --> 00:28:12,340 But we know if you look at the structure of a sentence 824 00:28:12,340 --> 00:28:14,920 like the tall person quickly rode the horse, 825 00:28:14,920 --> 00:28:18,880 not every word refers to every object in the sentence. 826 00:28:18,880 --> 00:28:22,234 So you can run your object detectors over your video. 827 00:28:22,234 --> 00:28:23,650 And you can look at your sentence. 828 00:28:23,650 --> 00:28:25,400 And you can look at the nouns and say, OK. 829 00:28:25,400 --> 00:28:29,170 So I have people and horses inside the sentence. 830 00:28:29,170 --> 00:28:30,260 And you can say, OK. 831 00:28:30,260 --> 00:28:32,770 Well, if I have people and horses, I need two trackers. 832 00:28:32,770 --> 00:28:34,895 But you can look a little bit more at your sentence 833 00:28:34,895 --> 00:28:37,390 and see that, oh, well, it's the other horse. 834 00:28:37,390 --> 00:28:40,030 So you analyze your sentence and you 835 00:28:40,030 --> 00:28:42,640 can determine there are three participants in the event 836 00:28:42,640 --> 00:28:44,180 described by the sentence. 837 00:28:44,180 --> 00:28:46,360 There's a person and two horses. 838 00:28:46,360 --> 00:28:47,390 One's the agent. 839 00:28:47,390 --> 00:28:50,710 One's the patient-- the thing that's being ridden-- 840 00:28:50,710 --> 00:28:57,160 and one's source-- the thing that is being left. 841 00:28:57,160 --> 00:28:58,820 Does that make sense? 842 00:28:58,820 --> 00:28:59,440 Awesome. 843 00:28:59,440 --> 00:29:03,620 So now, given a sentence, we know that we need n trackers. 844 00:29:03,620 --> 00:29:06,640 And for every word, we can have a hidden Markov model. 845 00:29:06,640 --> 00:29:08,540 We can have a hidden Markov model for ride. 846 00:29:08,540 --> 00:29:09,589 It's just another verb. 847 00:29:09,589 --> 00:29:11,380 And we just have to be careful how we build 848 00:29:11,380 --> 00:29:12,700 a feature vector for ride. 849 00:29:12,700 --> 00:29:14,170 Because if we build it in one way, 850 00:29:14,170 --> 00:29:16,090 we're going to detect the person rode the horse. 851 00:29:16,090 --> 00:29:17,440 And if we build it in the opposite way 852 00:29:17,440 --> 00:29:19,700 by concatenating the vectors the other way around, 853 00:29:19,700 --> 00:29:22,360 we're going to detect the horse rode the person, which 854 00:29:22,360 --> 00:29:23,496 is not what we want. 855 00:29:23,496 --> 00:29:24,885 We can also detect tall. 856 00:29:24,885 --> 00:29:27,010 Tall is kind of a weird hidden Markov model, right? 857 00:29:27,010 --> 00:29:29,140 It has only a single state, but it's still a hidden Markov 858 00:29:29,140 --> 00:29:29,710 model. 859 00:29:29,710 --> 00:29:31,930 It just wants to see that this object is tall. 860 00:29:31,930 --> 00:29:36,010 So maybe its aspect ratio is more than the mean aspect ratio 861 00:29:36,010 --> 00:29:37,840 of objects of this class. 862 00:29:37,840 --> 00:29:41,369 But nonetheless, it still fits into this paradigm. 863 00:29:41,369 --> 00:29:42,910 We can do the same thing for quickly. 864 00:29:42,910 --> 00:29:43,960 We can have an HMR for that. 865 00:29:43,960 --> 00:29:44,751 We can do leftward. 866 00:29:44,751 --> 00:29:45,602 We can do away from. 867 00:29:45,602 --> 00:29:48,320 Away from looks a lot like leave. 868 00:29:48,320 --> 00:29:50,140 It's the same meaning. 869 00:29:50,140 --> 00:29:52,570 And basically, we end up with this bipartite graph. 870 00:29:52,570 --> 00:29:55,780 At the top, we have lattices that represent words. 871 00:29:55,780 --> 00:29:57,612 Each word has a hidden Markov model. 872 00:29:57,612 --> 00:29:59,070 And in the middle, we have lattices 873 00:29:59,070 --> 00:30:00,820 that represent trackers. 874 00:30:00,820 --> 00:30:03,275 We can combine them together according to the links. 875 00:30:03,275 --> 00:30:05,650 And you can get these links from your favorite dependency 876 00:30:05,650 --> 00:30:06,149 parser. 877 00:30:06,149 --> 00:30:09,400 You can get them from Boris's START system. 878 00:30:09,400 --> 00:30:12,372 Any language analysis system will give you this. 879 00:30:12,372 --> 00:30:14,080 So this is actually all the heavy lifting 880 00:30:14,080 --> 00:30:14,871 that we have to do. 881 00:30:14,871 --> 00:30:18,487 Everything from now on is kind of eye candy. 882 00:30:18,487 --> 00:30:20,320 One thing that we really wanted to make sure 883 00:30:20,320 --> 00:30:22,420 that system was doing is that we could distinguish 884 00:30:22,420 --> 00:30:23,690 different sentences. 885 00:30:23,690 --> 00:30:26,650 So we tried to come up with an experiment that 886 00:30:26,650 --> 00:30:28,987 is, in some way, maximally difficult where 887 00:30:28,987 --> 00:30:30,820 events are going to happen at the same time. 888 00:30:30,820 --> 00:30:33,280 So you can't use time in order to distinguish them. 889 00:30:33,280 --> 00:30:37,600 And the sentences only differ in one word or one lexical item. 890 00:30:37,600 --> 00:30:39,070 So in this case, we have a sentence 891 00:30:39,070 --> 00:30:41,361 like the person picked up an object and person put down 892 00:30:41,361 --> 00:30:42,400 an object. 893 00:30:42,400 --> 00:30:44,169 There are two systems that are running. 894 00:30:44,169 --> 00:30:45,460 One is running on one sentence. 895 00:30:45,460 --> 00:30:47,200 One is running on the other sentence. 896 00:30:47,200 --> 00:30:48,908 You're going to see the same video played 897 00:30:48,908 --> 00:30:50,726 twice side by side. 898 00:30:50,726 --> 00:30:52,600 And you can already see that one system, when 899 00:30:52,600 --> 00:30:54,130 we primed it to look for pickup, it 900 00:30:54,130 --> 00:30:56,020 detected me picking up my backpack. 901 00:30:56,020 --> 00:30:58,240 And then, the other one it detected one of my lab 902 00:30:58,240 --> 00:31:00,410 mates picking up a bin. 903 00:31:00,410 --> 00:31:02,849 So the only way you could focus its attention 904 00:31:02,849 --> 00:31:05,140 on the right object is if it understood the distinction 905 00:31:05,140 --> 00:31:06,740 between these two sentences, or if it 906 00:31:06,740 --> 00:31:09,410 was able to represent them. 907 00:31:09,410 --> 00:31:11,510 So we can play this game many, many times over. 908 00:31:11,510 --> 00:31:13,629 We can have it pay attention to the subject. 909 00:31:13,629 --> 00:31:15,670 Is a backpack approaching something or is a chair 910 00:31:15,670 --> 00:31:17,110 approaching something? 911 00:31:17,110 --> 00:31:19,997 We can have it pay attention to the color of an object. 912 00:31:19,997 --> 00:31:22,330 Is the red object approaching something or a blue object 913 00:31:22,330 --> 00:31:23,530 approaching something? 914 00:31:23,530 --> 00:31:26,237 We can have it pay attention to a preposition. 915 00:31:26,237 --> 00:31:28,570 Is someone picking up an object to the left of something 916 00:31:28,570 --> 00:31:29,778 or to the right of something? 917 00:31:29,778 --> 00:31:32,080 And we have many, many dozens or hundreds of these. 918 00:31:32,080 --> 00:31:34,150 And I won't bore you with all of them. 919 00:31:34,150 --> 00:31:36,812 But the important part is we can handle lots and lots 920 00:31:36,812 --> 00:31:38,020 of different parts of speech. 921 00:31:38,020 --> 00:31:40,019 And we can still represent them and we can still 922 00:31:40,019 --> 00:31:42,400 be sensitive to these subtle distinctions in the meanings 923 00:31:42,400 --> 00:31:44,900 of the sentences. 924 00:31:44,900 --> 00:31:45,400 All right. 925 00:31:45,400 --> 00:31:46,566 So we did all the hard work. 926 00:31:46,566 --> 00:31:49,240 And we actually built this recognizer-- 927 00:31:49,240 --> 00:31:51,825 the score of a sentence given a video. 928 00:31:51,825 --> 00:31:53,200 And now, it turns out that we can 929 00:31:53,200 --> 00:31:55,150 reformulate all of these other tasks 930 00:31:55,150 --> 00:31:56,380 in terms of this one score. 931 00:31:56,380 --> 00:31:58,580 And it's going to do all the heavy lifting for us. 932 00:31:58,580 --> 00:32:00,970 So when we tune the parameters of whatever 933 00:32:00,970 --> 00:32:02,980 goes into the scoring function, we're 934 00:32:02,980 --> 00:32:05,975 going to get the ability to do all these other tasks. 935 00:32:05,975 --> 00:32:07,100 So let's look at retrieval. 936 00:32:07,100 --> 00:32:09,010 It's the most straightforward kind of task, right? 937 00:32:09,010 --> 00:32:10,360 It's what YouTube does for you. 938 00:32:10,360 --> 00:32:11,170 You go to YouTube. 939 00:32:11,170 --> 00:32:14,660 You type in a query, and YouTube comes back with some answers. 940 00:32:14,660 --> 00:32:17,650 So let's see what YouTube actually does. 941 00:32:17,650 --> 00:32:19,220 If you look at YouTube. 942 00:32:19,220 --> 00:32:21,550 And if you look at something like pickup, 943 00:32:21,550 --> 00:32:23,920 you get men picking up women. 944 00:32:23,920 --> 00:32:26,862 If you look at approach, you get men picking up women. 945 00:32:26,862 --> 00:32:28,570 If you look at put down, once upon a time 946 00:32:28,570 --> 00:32:29,944 you did get men picking up women, 947 00:32:29,944 --> 00:32:33,190 but rap is now more popular. 948 00:32:33,190 --> 00:32:35,430 If you ask something more interesting-- the person 949 00:32:35,430 --> 00:32:36,580 approached the other person-- you 950 00:32:36,580 --> 00:32:38,663 don't get videos where people approach each other. 951 00:32:38,663 --> 00:32:40,914 You get videos about how you should approach women. 952 00:32:40,914 --> 00:32:41,830 I didn't select these. 953 00:32:41,830 --> 00:32:43,660 I typed them in and this is just what happened. 954 00:32:43,660 --> 00:32:45,670 If you type in, like the person approached the cat, 955 00:32:45,670 --> 00:32:47,378 you get lots of people playing with cats, 956 00:32:47,378 --> 00:32:50,410 but no one approaching cats, including 957 00:32:50,410 --> 00:32:54,030 a link that's kind of scary and an Airbus landing. 958 00:32:54,030 --> 00:32:55,720 And I have no idea what that means. 959 00:32:55,720 --> 00:32:58,690 So what we did is we built a video retrieval system that 960 00:32:58,690 --> 00:33:00,910 actually understands what's going on in the videos as 961 00:33:00,910 --> 00:33:02,368 opposed to just looking at the tags 962 00:33:02,368 --> 00:33:03,970 that the people apply to these videos. 963 00:33:03,970 --> 00:33:05,553 People don't describe what's going on. 964 00:33:05,553 --> 00:33:08,300 People describe some high-level concept. 965 00:33:08,300 --> 00:33:10,240 So we took a whole bunch of object detectors 966 00:33:10,240 --> 00:33:13,020 that are completely of the shelf for people and for horses. 967 00:33:13,020 --> 00:33:15,070 And we took 10 Hollywood movies. 968 00:33:15,070 --> 00:33:16,810 Nominally, they're all Westerns. 969 00:33:16,810 --> 00:33:18,147 They involve people on horses. 970 00:33:18,147 --> 00:33:19,980 And the reason why we chose people on horses 971 00:33:19,980 --> 00:33:21,750 was because people on horses tend 972 00:33:21,750 --> 00:33:23,680 to be fairly larger in the field of view. 973 00:33:23,680 --> 00:33:25,630 And given that object detectors suck so much, 974 00:33:25,630 --> 00:33:27,713 we thought we should kind of help the system along 975 00:33:27,713 --> 00:33:29,290 s best we could. 976 00:33:29,290 --> 00:33:32,400 So we build a system. 977 00:33:32,400 --> 00:33:34,820 It's a system that knows about three verbs. 978 00:33:34,820 --> 00:33:37,830 It knows about two nouns, person and horse. 979 00:33:37,830 --> 00:33:41,100 It knows about some adverbs, quickly and slowly. 980 00:33:41,100 --> 00:33:43,610 It knows about some prepositions, leftwards, 981 00:33:43,610 --> 00:33:45,814 rightwards, towards, away from. 982 00:33:45,814 --> 00:33:47,980 And given this template, you can generate about 200, 983 00:33:47,980 --> 00:33:50,210 300 different sentences. 984 00:33:50,210 --> 00:33:53,830 So we can type in something like the person rode the horse. 985 00:33:53,830 --> 00:33:55,910 And we can get a bunch of results. 986 00:33:55,910 --> 00:33:59,150 So you can see, we were in 90% accurate in the top 10 results. 987 00:33:59,150 --> 00:34:01,750 You can see these are really videos of people riding horses. 988 00:34:01,750 --> 00:34:04,240 The way this works is we took one of these long videos. 989 00:34:04,240 --> 00:34:06,320 We chopped it up into many small segments 990 00:34:06,320 --> 00:34:08,596 and we ran over each individual segment. 991 00:34:08,596 --> 00:34:10,179 You could run it over the whole video, 992 00:34:10,179 --> 00:34:12,137 but then it would just classify the whole video 993 00:34:12,137 --> 00:34:14,110 because it's an HMM and would sort of adapt 994 00:34:14,110 --> 00:34:16,009 to the length of the video. 995 00:34:16,009 --> 00:34:17,800 We can also ask for other kinds of queries, 996 00:34:17,800 --> 00:34:20,409 like the person rode the horse quickly. 997 00:34:20,409 --> 00:34:23,739 You can see we get videos that really are quicker. 998 00:34:23,739 --> 00:34:25,900 We can ask for something more ambitious, 999 00:34:25,900 --> 00:34:28,420 like the person rode the horse quickly rightward. 1000 00:34:28,420 --> 00:34:32,061 And we get videos where people are riding horses rightward. 1001 00:34:32,061 --> 00:34:32,560 All right. 1002 00:34:32,560 --> 00:34:35,340 So we did the hard work of building this recognition 1003 00:34:35,340 --> 00:34:35,840 system. 1004 00:34:35,840 --> 00:34:37,840 And we saw we can use it for another task, which 1005 00:34:37,840 --> 00:34:38,679 is retrieval. 1006 00:34:38,679 --> 00:34:39,960 But let's do something else. 1007 00:34:39,960 --> 00:34:41,080 Let's do generation. 1008 00:34:41,080 --> 00:34:43,750 Someone asked about generation earlier. 1009 00:34:43,750 --> 00:34:45,790 Generation is very similar to retrieval. 1010 00:34:45,790 --> 00:34:48,310 In retrieval, what we had was we had a fixed sentence 1011 00:34:48,310 --> 00:34:50,409 and we searched over all our videos 1012 00:34:50,409 --> 00:34:52,330 to see which ones were the best match. 1013 00:34:52,330 --> 00:34:54,310 Here, we have a fixed video. 1014 00:34:54,310 --> 00:34:56,409 And we're going to search over all our sentences. 1015 00:34:56,409 --> 00:34:58,660 The only trick is you have a language model, 1016 00:34:58,660 --> 00:35:00,930 so it can generate a huge number of sentences. 1017 00:35:00,930 --> 00:35:03,100 But we're going to see that's OK. 1018 00:35:03,100 --> 00:35:05,200 So we have a language model. 1019 00:35:05,200 --> 00:35:07,820 It's very, very small model by Boris' standards, 1020 00:35:07,820 --> 00:35:10,210 or the standard of NLP. 1021 00:35:10,210 --> 00:35:16,270 We have only four verbs, two adjectives, only four nouns, 1022 00:35:16,270 --> 00:35:17,800 some adverbs, et cetera. 1023 00:35:17,800 --> 00:35:20,380 But the important part is even if we ignore recursion, 1024 00:35:20,380 --> 00:35:22,260 we have a tremendous number of sentences. 1025 00:35:22,260 --> 00:35:24,640 And this model is recursive, so we can really 1026 00:35:24,640 --> 00:35:27,107 generate an infinite number of sentences from it. 1027 00:35:27,107 --> 00:35:28,690 But nonetheless, it turns out that you 1028 00:35:28,690 --> 00:35:30,470 can search the space of sentences very, 1029 00:35:30,470 --> 00:35:33,520 very efficiently and actually find the global optimum. 1030 00:35:33,520 --> 00:35:36,980 And the intuition for why that's true is pretty straightforward. 1031 00:35:36,980 --> 00:35:39,790 You can think of your sentence as a constraint on what you 1032 00:35:39,790 --> 00:35:41,050 can see in the world. 1033 00:35:41,050 --> 00:35:43,870 The longer your sentence, the more constrains you have. 1034 00:35:43,870 --> 00:35:45,710 So the lower the overall score is. 1035 00:35:45,710 --> 00:35:47,740 So every time you add a word, the score 1036 00:35:47,740 --> 00:35:49,870 can't possibly increase, right? 1037 00:35:49,870 --> 00:35:51,370 The score has to always decrease. 1038 00:35:51,370 --> 00:35:53,590 So basically, you have this monotonically-decreasing 1039 00:35:53,590 --> 00:35:56,814 function over a lattice of sentences. 1040 00:35:56,814 --> 00:35:58,480 And if you ignore the fact that you only 1041 00:35:58,480 --> 00:36:00,271 have to search sentences, you can start off 1042 00:36:00,271 --> 00:36:02,940 with individual words, aggregate words together. 1043 00:36:02,940 --> 00:36:04,652 So you look at all one-word phrases. 1044 00:36:04,652 --> 00:36:06,610 You can a two-word phrases, three-word phrases. 1045 00:36:06,610 --> 00:36:09,734 Eventually, get out to real sentences. 1046 00:36:09,734 --> 00:36:11,650 But because this is a monotonically-decreasing 1047 00:36:11,650 --> 00:36:14,590 function, this is a very quick search. 1048 00:36:14,590 --> 00:36:16,930 So you can start off with an empty set. 1049 00:36:16,930 --> 00:36:18,140 You can add a word. 1050 00:36:18,140 --> 00:36:20,080 For example, you can add carried. 1051 00:36:20,080 --> 00:36:23,170 You can look at all the ways that you can extend carried 1052 00:36:23,170 --> 00:36:24,250 with another word or two. 1053 00:36:24,250 --> 00:36:26,260 So you get a phrase like the person carried. 1054 00:36:26,260 --> 00:36:28,000 And you can keep adding words to it 1055 00:36:28,000 --> 00:36:31,090 until you get to the global optimum. 1056 00:36:31,090 --> 00:36:33,270 So given a video like this, where 1057 00:36:33,270 --> 00:36:34,984 you see me doing something, you can 1058 00:36:34,984 --> 00:36:36,400 produce a sentence like the person 1059 00:36:36,400 --> 00:36:38,358 to the right of the bin picked up the backpack. 1060 00:36:41,034 --> 00:36:42,450 And that's pretty straightforward. 1061 00:36:42,450 --> 00:36:45,320 We built a generator in just a few lines of code 1062 00:36:45,320 --> 00:36:47,454 as long as we had our recognition system. 1063 00:36:47,454 --> 00:36:49,370 So you have this problem in question answering 1064 00:36:49,370 --> 00:36:51,790 that you have to connect two sentences with a video. 1065 00:36:51,790 --> 00:36:53,210 And instead of doing that, what we're going to do 1066 00:36:53,210 --> 00:36:55,293 is we're going to make some connection between two 1067 00:36:55,293 --> 00:36:56,320 sentences. 1068 00:36:56,320 --> 00:36:57,820 So we're going to take our question. 1069 00:36:57,820 --> 00:37:00,111 We're going to give it to something like Boris' system. 1070 00:37:00,111 --> 00:37:02,390 And it's going to tell us this question, 1071 00:37:02,390 --> 00:37:06,879 like, what did the person put on top of the red car? 1072 00:37:06,879 --> 00:37:09,170 If you wanted to answer it, you would produce an answer 1073 00:37:09,170 --> 00:37:12,980 like, the person put some noun phrase on top of the red car. 1074 00:37:12,980 --> 00:37:15,140 So you can run the generation system exactly 1075 00:37:15,140 --> 00:37:16,190 as was suggested. 1076 00:37:16,190 --> 00:37:17,360 You seed it with this. 1077 00:37:17,360 --> 00:37:18,830 You give it a constraint that what 1078 00:37:18,830 --> 00:37:20,840 it has to produce next inside this empty gap 1079 00:37:20,840 --> 00:37:22,700 is a noun phrase. 1080 00:37:22,700 --> 00:37:24,430 And you're going to get out the answer. 1081 00:37:24,430 --> 00:37:25,846 Another way to think about this is 1082 00:37:25,846 --> 00:37:27,680 you have sort of a partial detector. 1083 00:37:27,680 --> 00:37:30,290 You look inside the video to see where it matches. 1084 00:37:30,290 --> 00:37:32,790 You choose the best region where it matches, 1085 00:37:32,790 --> 00:37:34,549 and then you complete your sentence. 1086 00:37:34,549 --> 00:37:36,090 And you get an answer like the person 1087 00:37:36,090 --> 00:37:39,187 put the pair on top of the red car. 1088 00:37:39,187 --> 00:37:41,270 There's one small problem with question answering, 1089 00:37:41,270 --> 00:37:44,220 and it differs from generation in one way. 1090 00:37:44,220 --> 00:37:45,922 So imagine that we're in a parking lot 1091 00:37:45,922 --> 00:37:48,380 and there are a hundred white cars inside this parking lot. 1092 00:37:48,380 --> 00:37:51,232 And you come to me desperate and you say, I lost my keys. 1093 00:37:51,232 --> 00:37:52,190 And I say, don't worry. 1094 00:37:52,190 --> 00:37:54,110 I know exactly where your keys are. 1095 00:37:54,110 --> 00:37:56,510 And you look at me and I say, they're in the white car. 1096 00:37:56,510 --> 00:37:58,260 And then you think I'm a complete asshole, 1097 00:37:58,260 --> 00:38:00,590 because that was totally worthless information, right? 1098 00:38:00,590 --> 00:38:02,510 I told you something that's basically true. 1099 00:38:02,510 --> 00:38:04,640 It's a parking lot full of white cars, 1100 00:38:04,640 --> 00:38:07,742 but isn't actually giving you anything useful. 1101 00:38:07,742 --> 00:38:09,200 So to handle this-- in the same way 1102 00:38:09,200 --> 00:38:11,660 that in generation, we had this one parameter 1103 00:38:11,660 --> 00:38:14,679 that we could tune to get, more or less, for both sentences. 1104 00:38:14,679 --> 00:38:16,220 We're going to add only one parameter 1105 00:38:16,220 --> 00:38:17,990 to question answering, which is kind 1106 00:38:17,990 --> 00:38:20,600 of a truthfulness parameter. 1107 00:38:20,600 --> 00:38:26,360 Which basically is going to say, this sentence, the person put 1108 00:38:26,360 --> 00:38:29,540 an object on top of the red car in this video, 1109 00:38:29,540 --> 00:38:30,865 is very ambiguous, right? 1110 00:38:30,865 --> 00:38:32,240 It could either be Danny that did 1111 00:38:32,240 --> 00:38:33,890 it or it could be me that put something 1112 00:38:33,890 --> 00:38:35,054 on top of the red car. 1113 00:38:35,054 --> 00:38:36,470 So what we're going to do is we're 1114 00:38:36,470 --> 00:38:38,269 going to take this candidate's answer. 1115 00:38:38,269 --> 00:38:39,810 We're going to run it over the video. 1116 00:38:39,810 --> 00:38:41,935 And we're going to see how many times it has really 1117 00:38:41,935 --> 00:38:43,220 close matches in the video. 1118 00:38:43,220 --> 00:38:44,720 And depending on this one parameter, 1119 00:38:44,720 --> 00:38:46,136 we're going to say you are allowed 1120 00:38:46,136 --> 00:38:48,290 to say more things about the video 1121 00:38:48,290 --> 00:38:51,350 to become more specific about what you're referring to. 1122 00:38:51,350 --> 00:38:53,960 But potentially, slightly less true because the score 1123 00:38:53,960 --> 00:38:54,591 will be lower. 1124 00:38:54,591 --> 00:38:56,090 In the same way that you were saying 1125 00:38:56,090 --> 00:38:58,190 slightly more in the generation case 1126 00:38:58,190 --> 00:39:00,140 at the risk of saying potentially something 1127 00:39:00,140 --> 00:39:02,450 that's slightly less true. 1128 00:39:02,450 --> 00:39:05,180 So this way, you can ignore the sentence, which is unhelpful. 1129 00:39:05,180 --> 00:39:06,638 And you can end up saying something 1130 00:39:06,638 --> 00:39:09,140 like, the person on the left of the car 1131 00:39:09,140 --> 00:39:12,590 put an object on top of the red car. 1132 00:39:12,590 --> 00:39:14,490 So we can actually do that and the system 1133 00:39:14,490 --> 00:39:15,950 produces that output. 1134 00:39:15,950 --> 00:39:18,054 We built one recognition approach. 1135 00:39:18,054 --> 00:39:19,970 And we did retrieval, generation, and question 1136 00:39:19,970 --> 00:39:20,719 answering with it. 1137 00:39:20,719 --> 00:39:22,767 We can also do disambiguation with it. 1138 00:39:22,767 --> 00:39:24,350 In disambiguation, we take a sentence, 1139 00:39:24,350 --> 00:39:26,757 like Danny approached the chair with a bag. 1140 00:39:26,757 --> 00:39:28,340 And you can imagine that this sentence 1141 00:39:28,340 --> 00:39:30,150 can mean multiple things. 1142 00:39:30,150 --> 00:39:33,770 It could mean Danny was actually carrying a bag 1143 00:39:33,770 --> 00:39:35,780 and approaching a chair. 1144 00:39:35,780 --> 00:39:39,740 Or it could mean there was a bag on a chair 1145 00:39:39,740 --> 00:39:41,312 and Danny was approaching it. 1146 00:39:41,312 --> 00:39:42,770 And there's the question of, how do 1147 00:39:42,770 --> 00:39:45,320 you decide which interpretation for the sentence 1148 00:39:45,320 --> 00:39:46,990 corresponds to which video? 1149 00:39:50,370 --> 00:39:52,200 Basically, you can take your sentences 1150 00:39:52,200 --> 00:39:53,520 and you can look at their parse trees. 1151 00:39:53,520 --> 00:39:55,200 And you're going to see that they're different. 1152 00:39:55,200 --> 00:39:56,520 Essentially, your language system 1153 00:39:56,520 --> 00:39:58,228 is going to give you a slightly different 1154 00:39:58,228 --> 00:40:00,130 internal representation for each of these. 1155 00:40:00,130 --> 00:40:02,610 And we already know that when we build our detectors 1156 00:40:02,610 --> 00:40:05,280 for the sentence, we take these kinds of relationships 1157 00:40:05,280 --> 00:40:06,840 between the words as inputs. 1158 00:40:06,840 --> 00:40:09,210 So even though there's one sentence in English 1159 00:40:09,210 --> 00:40:11,160 that described both of these scenarios, when 1160 00:40:11,160 --> 00:40:12,510 we build detectors we're going to end up 1161 00:40:12,510 --> 00:40:13,718 with two different detectors. 1162 00:40:13,718 --> 00:40:17,587 One for one meaning, one for the other meaning. 1163 00:40:17,587 --> 00:40:19,170 And then we can just run the detectors 1164 00:40:19,170 --> 00:40:22,080 and figure out which meaning corresponds to which video. 1165 00:40:22,080 --> 00:40:23,772 And indeed, that's what we did. 1166 00:40:23,772 --> 00:40:25,230 Except that there are lots and lots 1167 00:40:25,230 --> 00:40:27,224 of different potential ambiguities. 1168 00:40:27,224 --> 00:40:28,890 There are different kinds of attachment. 1169 00:40:28,890 --> 00:40:29,699 In the same case-- 1170 00:40:29,699 --> 00:40:30,990 I won't go through all of them. 1171 00:40:30,990 --> 00:40:34,230 But for example, you might not know where the bag is. 1172 00:40:34,230 --> 00:40:37,020 You might not know who's performing the action. 1173 00:40:37,020 --> 00:40:39,000 You might not be sure if both people 1174 00:40:39,000 --> 00:40:40,860 are performing the action or only one person 1175 00:40:40,860 --> 00:40:43,230 is performing the action. 1176 00:40:43,230 --> 00:40:46,770 There may be some problems with references. 1177 00:40:46,770 --> 00:40:49,011 So this is a very simple example, 1178 00:40:49,011 --> 00:40:50,760 like Danny picked up the bag in the chair. 1179 00:40:50,760 --> 00:40:51,302 It is yellow. 1180 00:40:51,302 --> 00:40:53,301 But this is the kind of thing that you would see 1181 00:40:53,301 --> 00:40:54,780 if you had a long paragraph. 1182 00:40:54,780 --> 00:40:57,000 You would have some reference later on 1183 00:40:57,000 --> 00:40:58,980 or earlier on to some person. 1184 00:40:58,980 --> 00:41:02,460 And you wouldn't be sure who was the referent. 1185 00:41:02,460 --> 00:41:04,880 And it turns out that if you have sentences like this, 1186 00:41:04,880 --> 00:41:06,630 you can disambiguate them pretty reliably. 1187 00:41:10,080 --> 00:41:13,150 So what's important is it's not just a case of parse trees. 1188 00:41:13,150 --> 00:41:16,110 We need a more interesting internal representation. 1189 00:41:16,110 --> 00:41:18,420 And an example of how we do this is we take a sentence 1190 00:41:18,420 --> 00:41:21,870 and we make some first-order logic formula out of it. 1191 00:41:21,870 --> 00:41:23,190 So you have some variables. 1192 00:41:23,190 --> 00:41:25,540 The chair is something like x. 1193 00:41:25,540 --> 00:41:28,455 You have Danny, who moved it, and I moved it. 1194 00:41:28,455 --> 00:41:30,580 Or in the other case, you have two separate chairs. 1195 00:41:30,580 --> 00:41:32,330 And I moved one and Danny moved the other. 1196 00:41:32,330 --> 00:41:34,440 And they're distinct chairs. 1197 00:41:34,440 --> 00:41:37,020 What we do is we first ignore the people. 1198 00:41:37,020 --> 00:41:38,670 So we just say there are two people. 1199 00:41:38,670 --> 00:41:41,539 And in both cases, we're distinct from each other. 1200 00:41:41,539 --> 00:41:43,830 But we don't have person recognizers, face recognition, 1201 00:41:43,830 --> 00:41:45,420 or anything like that. 1202 00:41:45,420 --> 00:41:48,780 Then for each of these variables, we build a tracker. 1203 00:41:48,780 --> 00:41:51,510 And for every constraint, we have a word model. 1204 00:41:51,510 --> 00:41:54,600 And essentially, you can go from this first-order logic formula 1205 00:41:54,600 --> 00:41:56,700 to one of our detectors. 1206 00:41:56,700 --> 00:41:58,470 So it's exactly the same thing as the case 1207 00:41:58,470 --> 00:42:00,570 where we had a sentence and a video. 1208 00:42:00,570 --> 00:42:03,809 And we just wanted to see, is the sentence true of the video? 1209 00:42:03,809 --> 00:42:05,850 Except that now we have a sentence interpretation 1210 00:42:05,850 --> 00:42:06,433 and the video. 1211 00:42:09,510 --> 00:42:11,380 So we've seen that if all you have 1212 00:42:11,380 --> 00:42:13,470 are multiple interpretation of a sentence, 1213 00:42:13,470 --> 00:42:17,114 you can figure out which one belongs to which video. 1214 00:42:17,114 --> 00:42:18,780 And we'll come back to this in a moment, 1215 00:42:18,780 --> 00:42:20,440 because it's actually quite useful. 1216 00:42:20,440 --> 00:42:22,380 So you can imagine a scenario where 1217 00:42:22,380 --> 00:42:23,814 you want to talk to a robot. 1218 00:42:23,814 --> 00:42:25,230 And you want to give it a command. 1219 00:42:25,230 --> 00:42:27,210 You don't want to play 20 questions with it, right? 1220 00:42:27,210 --> 00:42:28,260 You want to tell it something. 1221 00:42:28,260 --> 00:42:29,200 It should look at the environment. 1222 00:42:29,200 --> 00:42:31,533 And it should figure out, you're referring to this chair 1223 00:42:31,533 --> 00:42:33,900 and this is what I'm supposed to do. 1224 00:42:33,900 --> 00:42:36,450 So the other reason for disambiguation 1225 00:42:36,450 --> 00:42:39,270 is going to be because you get a lot of ambiguities 1226 00:42:39,270 --> 00:42:41,099 while you're acquiring language. 1227 00:42:41,099 --> 00:42:43,140 So we're going to break down language acquisition 1228 00:42:43,140 --> 00:42:44,920 into two parts. 1229 00:42:44,920 --> 00:42:48,030 One part is we want to learn the meanings of each of our words. 1230 00:42:48,030 --> 00:42:50,910 And another one is we want to learn how we take a sentence 1231 00:42:50,910 --> 00:42:53,640 and we transform it into this internal representation 1232 00:42:53,640 --> 00:42:56,550 that we use to actually build these detectors. 1233 00:42:56,550 --> 00:42:58,980 So if you look at the first one, let's 1234 00:42:58,980 --> 00:43:00,570 say you have a whole bunch of videos. 1235 00:43:00,570 --> 00:43:02,697 And every video comes with a sentence. 1236 00:43:02,697 --> 00:43:04,530 You don't know what the sentence is actually 1237 00:43:04,530 --> 00:43:06,060 referring to in the video. 1238 00:43:06,060 --> 00:43:08,914 When children are born, nobody gives mothers bounding boxes 1239 00:43:08,914 --> 00:43:10,830 and tells them, put this around the Teddy bear 1240 00:43:10,830 --> 00:43:13,120 so your child knows what you're referring to. 1241 00:43:13,120 --> 00:43:14,100 So we don't get those. 1242 00:43:14,100 --> 00:43:16,540 We have this more weakly-supervised system. 1243 00:43:16,540 --> 00:43:18,840 But what's important is we get this data set and there 1244 00:43:18,840 --> 00:43:21,090 are certain correlations in this data set, right? 1245 00:43:21,090 --> 00:43:23,370 We know the chair occurs in some videos. 1246 00:43:23,370 --> 00:43:25,696 We know that backpack occurs in others just 1247 00:43:25,696 --> 00:43:26,820 by looking at the sentence. 1248 00:43:26,820 --> 00:43:28,770 We know pickup occurs in others. 1249 00:43:28,770 --> 00:43:30,270 So basically, this is the same thing 1250 00:43:30,270 --> 00:43:32,700 as training one, big hidden Markov model. 1251 00:43:32,700 --> 00:43:35,520 Except that now we have multiple hidden Markov models 1252 00:43:35,520 --> 00:43:37,984 that have a small amount of dependency between them. 1253 00:43:37,984 --> 00:43:39,150 And I won't talk about this. 1254 00:43:39,150 --> 00:43:40,300 You'll have to take my word for it. 1255 00:43:40,300 --> 00:43:41,383 You can look at the paper. 1256 00:43:41,383 --> 00:43:45,690 But it's identical to the Baum-Welch algorithm. 1257 00:43:45,690 --> 00:43:48,420 Essentially, all you do is you take the gradient through all 1258 00:43:48,420 --> 00:43:50,160 the parameters of these words and you 1259 00:43:50,160 --> 00:43:52,390 can acquire their meanings. 1260 00:43:52,390 --> 00:43:54,870 There are lots of technical issues with this, 1261 00:43:54,870 --> 00:43:57,696 but that's the general idea. 1262 00:43:57,696 --> 00:43:59,320 So we can also look at learning syntax. 1263 00:43:59,320 --> 00:44:01,111 And this is something that we haven't done, 1264 00:44:01,111 --> 00:44:02,160 but we really want to do. 1265 00:44:02,160 --> 00:44:05,015 And this is where disambiguation work really comes into play. 1266 00:44:05,015 --> 00:44:06,640 So if I give you a sentence, like Danny 1267 00:44:06,640 --> 00:44:09,130 approached the chair with a bag, you feed it into a parser. 1268 00:44:09,130 --> 00:44:11,280 Something like Boris' start system. 1269 00:44:11,280 --> 00:44:14,310 And you get potentially two parse trees, right? 1270 00:44:14,310 --> 00:44:15,780 One for one interpretation and one 1271 00:44:15,780 --> 00:44:17,190 for the other interpretation. 1272 00:44:17,190 --> 00:44:18,889 You take the video and you can select 1273 00:44:18,889 --> 00:44:19,930 one of these parse trees. 1274 00:44:19,930 --> 00:44:21,870 That's the game we just played a moment ago. 1275 00:44:21,870 --> 00:44:25,350 But imagine that we take Boris' system and we brain damage 1276 00:44:25,350 --> 00:44:26,130 it a little bit. 1277 00:44:26,130 --> 00:44:29,130 Or we take some deep network that does parsing 1278 00:44:29,130 --> 00:44:31,220 and we just randomize a few of the parameters. 1279 00:44:31,220 --> 00:44:34,200 So now, rather than getting a single or two parse trees 1280 00:44:34,200 --> 00:44:37,170 for our two interpretations, we get 100 or 1,000 1281 00:44:37,170 --> 00:44:38,310 different parse trees. 1282 00:44:38,310 --> 00:44:39,810 We can take each one of those and we 1283 00:44:39,810 --> 00:44:42,686 can see, how well does this match our video? 1284 00:44:42,686 --> 00:44:44,310 And we get some distribution over them. 1285 00:44:44,310 --> 00:44:46,710 Maybe we won't get a single one that matches the best. 1286 00:44:46,710 --> 00:44:48,570 Maybe we'll get a few that match well 1287 00:44:48,570 --> 00:44:50,950 and a bunch that match really, really poorly. 1288 00:44:50,950 --> 00:44:53,892 So this provides a signal to actually train the parser. 1289 00:44:53,892 --> 00:44:55,350 Essentially, you have a parser that 1290 00:44:55,350 --> 00:44:57,870 produces a distribution over parse trees. 1291 00:44:57,870 --> 00:44:59,550 You use the vision system to decide 1292 00:44:59,550 --> 00:45:01,667 which of these parse trees are better than others. 1293 00:45:01,667 --> 00:45:03,750 And you feed this information back into the parser 1294 00:45:03,750 --> 00:45:05,040 and retrain it. 1295 00:45:05,040 --> 00:45:08,437 We haven't done this, but it's in the pipeline. 1296 00:45:08,437 --> 00:45:10,020 And eventually, the idea is that we're 1297 00:45:10,020 --> 00:45:12,060 going to be able to close the loop 1298 00:45:12,060 --> 00:45:13,560 and learn the meanings of the words 1299 00:45:13,560 --> 00:45:15,460 while we end up learning the parser. 1300 00:45:15,460 --> 00:45:18,164 But that's further down the line. 1301 00:45:18,164 --> 00:45:19,830 So lest you think that there's something 1302 00:45:19,830 --> 00:45:22,560 remarkable about language learning in humans, 1303 00:45:22,560 --> 00:45:26,140 actually lots of animals learn language, not just humans. 1304 00:45:26,140 --> 00:45:29,190 And here's a cute example of a dog that does something 1305 00:45:29,190 --> 00:45:30,420 that our system can't do. 1306 00:45:30,420 --> 00:45:33,540 And actually, no language system out there can do. 1307 00:45:33,540 --> 00:45:37,560 So there's this paper, but this is from PBS. 1308 00:45:37,560 --> 00:45:40,050 And what ended up happening is this dog 1309 00:45:40,050 --> 00:45:42,416 knows the meaning of about 1,000 different words 1310 00:45:42,416 --> 00:45:44,040 because there are labels that have been 1311 00:45:44,040 --> 00:45:45,540 attached to different toys. 1312 00:45:45,540 --> 00:45:47,190 So it has 1,000 different toys. 1313 00:45:47,190 --> 00:45:48,570 Each one has a unique name. 1314 00:45:48,570 --> 00:45:52,320 And if you tell the dog, give me Blinky, 1315 00:45:52,320 --> 00:45:54,510 it knows exactly which toy Blinky is. 1316 00:45:54,510 --> 00:45:56,790 And it has 100% accuracy getting you Blinky 1317 00:45:56,790 --> 00:45:59,490 from it's big, big pile of toys. 1318 00:45:59,490 --> 00:46:01,560 So what they did is they took 10 toys. 1319 00:46:01,560 --> 00:46:03,930 They put them behind the sofa. 1320 00:46:03,930 --> 00:46:05,640 And they added one additional toy 1321 00:46:05,640 --> 00:46:07,929 that the dog has never seen before. 1322 00:46:07,929 --> 00:46:09,720 They tested the dog many times to make sure 1323 00:46:09,720 --> 00:46:11,428 that it doesn't have a novelty preference 1324 00:46:11,428 --> 00:46:12,400 or anything like that. 1325 00:46:12,400 --> 00:46:15,000 And then they asked the dog, bring the Blinky. 1326 00:46:15,000 --> 00:46:17,760 And you can see the dog was asked. 1327 00:46:17,760 --> 00:46:18,930 It goes behind. 1328 00:46:18,930 --> 00:46:20,160 It quickly finds Blinky. 1329 00:46:20,160 --> 00:46:21,360 It brings it back. 1330 00:46:21,360 --> 00:46:22,360 And there we go. 1331 00:46:22,360 --> 00:46:26,710 And now, the dog is really happy. 1332 00:46:26,710 --> 00:46:31,020 So now, the dog is going to be asked, bring me this new toy. 1333 00:46:31,020 --> 00:46:34,650 Bring me the professor, or whatever the toy is called. 1334 00:46:34,650 --> 00:46:36,120 It's a little less certain. 1335 00:46:36,120 --> 00:46:36,690 OK. 1336 00:46:36,690 --> 00:46:38,370 So it's going to go behind and it's 1337 00:46:38,370 --> 00:46:40,399 going to look at all the objects. 1338 00:46:40,399 --> 00:46:41,940 The toy with the beard is the new one 1339 00:46:41,940 --> 00:46:43,560 that it hasn't seen before. 1340 00:46:43,560 --> 00:46:45,632 And it was there in the previous trial. 1341 00:46:45,632 --> 00:46:47,590 So it looks around and it's a little uncertain. 1342 00:46:47,590 --> 00:46:50,000 It doesn't quite want to come back. 1343 00:46:50,000 --> 00:46:52,500 We're going to see that we're going 1344 00:46:52,500 --> 00:46:56,700 to have to give it another instruction in a moment. 1345 00:46:56,700 --> 00:46:58,550 He's going to call it back and ask the dog 1346 00:46:58,550 --> 00:47:00,100 to do exactly the same task again. 1347 00:47:00,100 --> 00:47:01,350 Isn't telling it anything new. 1348 00:47:01,350 --> 00:47:03,766 It's just to give it some encouragement. 1349 00:47:09,730 --> 00:47:13,811 So looking around for some toy. 1350 00:47:13,811 --> 00:47:15,990 And it picks the-- you'll see in a moment. 1351 00:47:19,280 --> 00:47:22,010 It picks the toy that it hasn't seen before, 1352 00:47:22,010 --> 00:47:23,274 because it's a new word. 1353 00:47:23,274 --> 00:47:24,440 And the dog is really happy. 1354 00:47:24,440 --> 00:47:26,773 And I think the human is even happier that this actually 1355 00:47:26,773 --> 00:47:28,010 worked. 1356 00:47:28,010 --> 00:47:30,470 But the important part is, there's this dog 1357 00:47:30,470 --> 00:47:33,620 that we normally don't associate with having a huge amount 1358 00:47:33,620 --> 00:47:34,652 of linguistic ability. 1359 00:47:34,652 --> 00:47:36,110 But it's learning language in a way 1360 00:47:36,110 --> 00:47:39,646 that is far more advanced than anything that we have. 1361 00:47:39,646 --> 00:47:41,270 And it's learning it in a grounded way, 1362 00:47:41,270 --> 00:47:43,850 like it hard to connect its knowledge about what it sees 1363 00:47:43,850 --> 00:47:45,380 with these toys to this new object 1364 00:47:45,380 --> 00:47:48,830 that it's never seen before and understand this new label. 1365 00:47:48,830 --> 00:47:51,600 And dogs are not the only animal that can do this. 1366 00:47:51,600 --> 00:47:54,600 There are many other animals that can do this. 1367 00:47:54,600 --> 00:47:55,324 All right. 1368 00:47:55,324 --> 00:47:56,990 And of course, children do this as well. 1369 00:47:59,760 --> 00:48:01,400 So there was a question about the fact 1370 00:48:01,400 --> 00:48:03,170 that we're constantly using videos here. 1371 00:48:03,170 --> 00:48:04,580 And we're very focused on motion. 1372 00:48:04,580 --> 00:48:06,020 But of course, in many of these sentences, 1373 00:48:06,020 --> 00:48:07,936 we were referring to objects that were static. 1374 00:48:07,936 --> 00:48:10,228 So we're not only sensitive to objects that are moving. 1375 00:48:10,228 --> 00:48:11,769 So for example, when I said something 1376 00:48:11,769 --> 00:48:13,850 like it was the person to the left of the car, 1377 00:48:13,850 --> 00:48:17,390 neither the person nor the car were moving in that question. 1378 00:48:17,390 --> 00:48:19,530 It was the pair that was moving. 1379 00:48:19,530 --> 00:48:21,380 But there's an interesting question, 1380 00:48:21,380 --> 00:48:24,950 what if you want to recognize actions in still images? 1381 00:48:24,950 --> 00:48:26,260 After all, we can do it. 1382 00:48:26,260 --> 00:48:28,730 It probably didn't involve looking at photos. 1383 00:48:28,730 --> 00:48:31,400 You know, 200 million years ago when our visual system 1384 00:48:31,400 --> 00:48:32,790 was being formed. 1385 00:48:32,790 --> 00:48:34,940 So somehow, we take our video ability 1386 00:48:34,940 --> 00:48:37,647 and we apply it to images. 1387 00:48:37,647 --> 00:48:39,980 And the way we're going to do that is by taking an image 1388 00:48:39,980 --> 00:48:41,945 and predicting a video from it. 1389 00:48:41,945 --> 00:48:43,820 We haven't done this, but we've done the part 1390 00:48:43,820 --> 00:48:46,820 where you can actually get predicting motion 1391 00:48:46,820 --> 00:48:48,129 from single frames. 1392 00:48:48,129 --> 00:48:49,670 So the intuition about why this works 1393 00:48:49,670 --> 00:48:51,770 is, if you look at this image and I ask you, 1394 00:48:51,770 --> 00:48:54,230 how quickly is this baseball moving? 1395 00:48:54,230 --> 00:48:56,965 You can give me an answer. 1396 00:48:56,965 --> 00:48:58,090 AUDIENCE: Not very quickly. 1397 00:48:58,090 --> 00:48:59,090 ANDREI BARBU: Not very quickly. 1398 00:48:59,090 --> 00:48:59,810 Right. 1399 00:48:59,810 --> 00:49:01,690 And if you look at this baseball, 1400 00:49:01,690 --> 00:49:06,034 you can decide that it's moving very quickly, right? 1401 00:49:06,034 --> 00:49:07,450 So the other story in this talk is 1402 00:49:07,450 --> 00:49:08,740 I'm becoming more and more American. 1403 00:49:08,740 --> 00:49:10,490 I started with the Canadian flag and now I 1404 00:49:10,490 --> 00:49:12,070 ended up with baseball. 1405 00:49:12,070 --> 00:49:12,640 All right. 1406 00:49:12,640 --> 00:49:14,512 So you can clearly do this task. 1407 00:49:14,512 --> 00:49:15,970 There is good neuroscience evidence 1408 00:49:15,970 --> 00:49:19,600 that people are doing this fairly regularly. 1409 00:49:19,600 --> 00:49:22,150 Kids can do this, et cetera. 1410 00:49:22,150 --> 00:49:23,230 All right. 1411 00:49:23,230 --> 00:49:25,360 So now, what we did is we went to YouTube 1412 00:49:25,360 --> 00:49:27,370 and we got a whole bunch of videos. 1413 00:49:27,370 --> 00:49:30,010 Videos that contain cars or different kinds of objects. 1414 00:49:30,010 --> 00:49:31,900 We had eight different object classes. 1415 00:49:31,900 --> 00:49:34,390 And we ran a standard optical flow algorithm just 1416 00:49:34,390 --> 00:49:35,380 off the shelf. 1417 00:49:35,380 --> 00:49:38,320 And this gives us an idea of how the motion actually 1418 00:49:38,320 --> 00:49:39,655 happens inside this video. 1419 00:49:39,655 --> 00:49:40,780 Then, we discard the video. 1420 00:49:40,780 --> 00:49:43,030 And we only keep one of the frames. 1421 00:49:43,030 --> 00:49:44,380 And we train a deep network. 1422 00:49:44,380 --> 00:49:47,620 This is the only time deep networks appear in this talk. 1423 00:49:47,620 --> 00:49:51,700 That takes as input the image and predicts the optical flow. 1424 00:49:51,700 --> 00:49:53,290 It looks a lot like an auto-encoder, 1425 00:49:53,290 --> 00:49:56,014 except the input and the output are different from each other. 1426 00:49:56,014 --> 00:49:57,680 And it turns out this works pretty well. 1427 00:49:57,680 --> 00:50:00,250 It has similar performance to actually doing optical flow 1428 00:50:00,250 --> 00:50:02,950 on the video with sort of a crappier, earlier optical flow 1429 00:50:02,950 --> 00:50:04,600 algorithm. 1430 00:50:04,600 --> 00:50:07,564 So up until now are things that we've done. 1431 00:50:07,564 --> 00:50:08,980 At the end I'll talk briefly about 1432 00:50:08,980 --> 00:50:11,000 what we're doing in the future. 1433 00:50:11,000 --> 00:50:13,620 So one thing that you can do is translation. 1434 00:50:13,620 --> 00:50:16,210 And you can cast translation as a visual language task, 1435 00:50:16,210 --> 00:50:19,220 even though it sounds like it has nothing to do with vision. 1436 00:50:19,220 --> 00:50:22,240 So if I give you a sentence in Chinese, 1437 00:50:22,240 --> 00:50:25,215 you can imagine scenarios for that sentence, 1438 00:50:25,215 --> 00:50:27,340 and then try to describe them with another language 1439 00:50:27,340 --> 00:50:28,596 that you know. 1440 00:50:28,596 --> 00:50:30,970 This is very different from the way people do translation 1441 00:50:30,970 --> 00:50:31,545 right now. 1442 00:50:31,545 --> 00:50:32,920 So right now, the way it works is 1443 00:50:32,920 --> 00:50:34,789 you have a sentence, like Sam was happy. 1444 00:50:34,789 --> 00:50:36,080 And you have a parallel corpus. 1445 00:50:36,080 --> 00:50:38,288 If you want to translate into French, you go off you. 1446 00:50:38,288 --> 00:50:41,920 Get the Hansard corpus and you get a whole bunch 1447 00:50:41,920 --> 00:50:43,780 of French and English sentences that 1448 00:50:43,780 --> 00:50:46,280 are aligned with each other and you learn the correspondence 1449 00:50:46,280 --> 00:50:46,960 between them. 1450 00:50:46,960 --> 00:50:50,386 Here, I translated into [AUDIO OUT] Russian. 1451 00:50:50,386 --> 00:50:51,760 The important part is in English, 1452 00:50:51,760 --> 00:50:54,130 there's no assumption about the gender of Sam. 1453 00:50:54,130 --> 00:50:56,890 Sam is both a male name and a female name. 1454 00:50:56,890 --> 00:51:01,630 But the problem is Romanian, Russian, French, et cetera, 1455 00:51:01,630 --> 00:51:04,300 they really force you to specify the gender of the people that 1456 00:51:04,300 --> 00:51:05,620 are involved in these actions. 1457 00:51:05,620 --> 00:51:07,411 And you have to go through a certain amount 1458 00:51:07,411 --> 00:51:10,580 of [AUDIO OUT] really want to avoid specifying their gender. 1459 00:51:10,580 --> 00:51:13,090 So here, we specify the gender as male. 1460 00:51:13,090 --> 00:51:15,220 Here, we specify the gender as female. 1461 00:51:15,220 --> 00:51:17,800 And if all you have is statistical machine translation 1462 00:51:17,800 --> 00:51:20,351 system, you may get an arbitrary one of these two. 1463 00:51:20,351 --> 00:51:21,850 And you may not know that you've got 1464 00:51:21,850 --> 00:51:22,950 an arbitrary one of these two. 1465 00:51:22,950 --> 00:51:25,074 And there may be a terrible faux pas at some point. 1466 00:51:27,730 --> 00:51:31,210 So this problem is not restricted to gender. 1467 00:51:31,210 --> 00:51:33,122 And it occurs all the time. 1468 00:51:33,122 --> 00:51:35,080 For example, in Thai, you specify your siblings 1469 00:51:35,080 --> 00:51:37,040 by age, not by their gender. 1470 00:51:37,040 --> 00:51:39,550 So if you have an English sentence like my brother 1471 00:51:39,550 --> 00:51:42,266 did x, translating that is quite difficult. 1472 00:51:42,266 --> 00:51:44,890 In English, you specify relative time through the tense system, 1473 00:51:44,890 --> 00:51:47,872 but Mandarin doesn't have the same kind of tense system. 1474 00:51:47,872 --> 00:51:49,330 In this language that I never tried 1475 00:51:49,330 --> 00:51:52,930 to pronounce after the first time that I tried, 1476 00:51:52,930 --> 00:51:54,640 you don't use relative direction. 1477 00:51:54,640 --> 00:51:57,370 So you don't say the bottle to the left of the laptop. 1478 00:51:57,370 --> 00:51:59,050 We all agree on a common reference 1479 00:51:59,050 --> 00:52:01,480 frame like a hill or something. 1480 00:52:01,480 --> 00:52:03,790 Or we agree on cardinal directions. 1481 00:52:03,790 --> 00:52:06,280 And you say, the bottle to the north or something. 1482 00:52:06,280 --> 00:52:08,450 And these people are really, really good 1483 00:52:08,450 --> 00:52:09,991 at wayfinding because they constantly 1484 00:52:09,991 --> 00:52:11,170 have to know where north is. 1485 00:52:11,170 --> 00:52:13,090 Many languages don't distinguish blue and green. 1486 00:52:13,090 --> 00:52:14,590 Historically, this is not something 1487 00:52:14,590 --> 00:52:17,110 that languages have done. 1488 00:52:17,110 --> 00:52:18,910 It's pretty new. 1489 00:52:18,910 --> 00:52:22,420 For example, Japanese didn't until a hundred years ago. 1490 00:52:22,420 --> 00:52:24,790 They only started distinguishing the two fairly recently 1491 00:52:24,790 --> 00:52:26,832 when they started interacting with the West more. 1492 00:52:26,832 --> 00:52:28,581 And many languages don't set that boundary 1493 00:52:28,581 --> 00:52:29,870 at exactly the same place. 1494 00:52:29,870 --> 00:52:31,286 So one language, you may say blue. 1495 00:52:31,286 --> 00:52:33,850 In another language, you may have to say green. 1496 00:52:33,850 --> 00:52:36,550 In Swahili, you specify the color of everything 1497 00:52:36,550 --> 00:52:37,900 as the color of x. 1498 00:52:37,900 --> 00:52:39,670 So like in English, we have orange. 1499 00:52:39,670 --> 00:52:41,375 But in Swahili, I could say the color 1500 00:52:41,375 --> 00:52:43,090 of the back of my cell phone. 1501 00:52:43,090 --> 00:52:46,600 And I expect you to know that's blue as long as you can see it. 1502 00:52:46,600 --> 00:52:50,140 In Turkish, there is a relatively complicated 1503 00:52:50,140 --> 00:52:51,490 evidentiality system. 1504 00:52:51,490 --> 00:52:54,670 So you have to fairly often tell me why you know something. 1505 00:52:54,670 --> 00:52:56,980 So if you saw somebody do something, 1506 00:52:56,980 --> 00:52:59,320 you have to mark that in the sentence as opposed 1507 00:52:59,320 --> 00:53:01,880 to hearing it from someone else. 1508 00:53:01,880 --> 00:53:04,210 So if it's hearsay, you have to let me know. 1509 00:53:04,210 --> 00:53:06,430 There are much more complicated evidentiality systems 1510 00:53:06,430 --> 00:53:08,888 where you have to tell me, did you hear it, did you see it, 1511 00:53:08,888 --> 00:53:09,960 did you feel it? 1512 00:53:09,960 --> 00:53:11,200 It can get pretty hairy. 1513 00:53:11,200 --> 00:53:13,600 So there are a lot of reasons why just 1514 00:53:13,600 --> 00:53:15,580 doing the straightforward sentence alignment 1515 00:53:15,580 --> 00:53:17,110 can really fail on you. 1516 00:53:17,110 --> 00:53:19,180 And you can make some pretty terrible mistakes. 1517 00:53:19,180 --> 00:53:21,040 And more importantly, you just won't know 1518 00:53:21,040 --> 00:53:22,870 that that made these mistakes. 1519 00:53:22,870 --> 00:53:24,370 So instead, what we've been thinking 1520 00:53:24,370 --> 00:53:26,720 is sort of translation by imagination. 1521 00:53:26,720 --> 00:53:28,600 So you take a sentence. 1522 00:53:28,600 --> 00:53:30,610 And it's a generative model that we have that 1523 00:53:30,610 --> 00:53:32,230 connects sentences and videos. 1524 00:53:32,230 --> 00:53:33,560 And what you do is you sample. 1525 00:53:33,560 --> 00:53:35,160 You sample a whole bunch of videos. 1526 00:53:35,160 --> 00:53:37,420 So basically, you imagine what scenarios the sentence 1527 00:53:37,420 --> 00:53:38,770 could be true of. 1528 00:53:38,770 --> 00:53:42,370 You get your collection from the generator. 1529 00:53:42,370 --> 00:53:45,850 You search over sentences that describe these videos 1530 00:53:45,850 --> 00:53:47,770 and you output a sentence that describes them 1531 00:53:47,770 --> 00:53:49,120 well in aggregate. 1532 00:53:49,120 --> 00:53:52,300 So basically, you just combine your ability 1533 00:53:52,300 --> 00:53:55,660 to sample, which comes from your recognizer, 1534 00:53:55,660 --> 00:53:57,070 and your ability to generate. 1535 00:53:57,070 --> 00:53:58,850 And you get a translation system. 1536 00:53:58,850 --> 00:54:01,300 So you do a language-to-language task 1537 00:54:01,300 --> 00:54:04,270 mediated by your understanding of the real world. 1538 00:54:04,270 --> 00:54:06,460 Something else that you can do is planning, which 1539 00:54:06,460 --> 00:54:08,770 I'll just say two words about. 1540 00:54:08,770 --> 00:54:11,055 All you do is-- 1541 00:54:11,055 --> 00:54:11,680 [PHONE RINGING] 1542 00:54:11,680 --> 00:54:14,150 --in a planning task, what you have is 1543 00:54:14,150 --> 00:54:16,280 you have a planning language. 1544 00:54:16,280 --> 00:54:18,741 I'm glad that wasn't my cellphone. 1545 00:54:18,741 --> 00:54:20,240 You have a planning language, right? 1546 00:54:20,240 --> 00:54:22,294 So you have a fairly constrained vocabulary 1547 00:54:22,294 --> 00:54:23,960 that you can use to describe your plans. 1548 00:54:23,960 --> 00:54:26,570 And this allows you to have efficient inference. 1549 00:54:26,570 --> 00:54:29,810 Instead, you can imagine that I have two frames of a video, 1550 00:54:29,810 --> 00:54:33,410 real or imagined, where I have the first world. 1551 00:54:33,410 --> 00:54:35,000 I am far away from the microphone. 1552 00:54:35,000 --> 00:54:37,520 I have the last world where I'm near the microphone. 1553 00:54:37,520 --> 00:54:40,580 And I have an unobserved video between the two. 1554 00:54:40,580 --> 00:54:42,260 People have work and I've done some work 1555 00:54:42,260 --> 00:54:44,720 on filling in partially observed videos. 1556 00:54:44,720 --> 00:54:47,090 So it's a very similar idea, except that here we have 1557 00:54:47,090 --> 00:54:48,474 a partially-observed video. 1558 00:54:48,474 --> 00:54:50,390 And we know that this partially-observed video 1559 00:54:50,390 --> 00:54:52,829 should be described by one or more sentences. 1560 00:54:52,829 --> 00:54:55,370 So we're going to do the same kind of sampling process, where 1561 00:54:55,370 --> 00:54:57,380 we sample from this partially-observed video 1562 00:54:57,380 --> 00:54:59,440 and we try to describe what the sentence is. 1563 00:54:59,440 --> 00:55:00,740 And now you're doing planning. 1564 00:55:00,740 --> 00:55:02,198 You're coming up with a description 1565 00:55:02,198 --> 00:55:04,917 of what had happened in this missing chunk of the video. 1566 00:55:04,917 --> 00:55:06,500 But your planning language is English, 1567 00:55:06,500 --> 00:55:07,820 so you get to take advantage of things 1568 00:55:07,820 --> 00:55:09,620 like ambiguity, which you couldn't take 1569 00:55:09,620 --> 00:55:13,970 advantage of in many languages. 1570 00:55:13,970 --> 00:55:15,070 Theory of mind. 1571 00:55:15,070 --> 00:55:18,000 The idea here is relatively straightforward as well. 1572 00:55:18,000 --> 00:55:20,052 So what we have right now, basically 1573 00:55:20,052 --> 00:55:21,260 are two hidden Markov models. 1574 00:55:21,260 --> 00:55:23,240 Or two kinds of hidden Markov models, right? 1575 00:55:23,240 --> 00:55:24,050 There's a video. 1576 00:55:24,050 --> 00:55:26,300 We have some hidden Markov models that are tracks 1577 00:55:26,300 --> 00:55:29,030 and we have some hidden Markov model that look at the tracks 1578 00:55:29,030 --> 00:55:30,654 and they do some inference about what's 1579 00:55:30,654 --> 00:55:32,780 going on with the events in these videos. 1580 00:55:32,780 --> 00:55:34,630 So now imagine that I had a third kind. 1581 00:55:34,630 --> 00:55:36,546 A third kind of hidden Markov model that 1582 00:55:36,546 --> 00:55:37,670 only looks at the trackers. 1583 00:55:37,670 --> 00:55:39,402 Doesn't look at the words directly. 1584 00:55:39,402 --> 00:55:41,360 And what it does is it makes another assumption 1585 00:55:41,360 --> 00:55:42,170 about the videos. 1586 00:55:42,170 --> 00:55:45,410 So first, we assume the objects were moving in a coherent way. 1587 00:55:45,410 --> 00:55:47,300 Then, we assumed that the objects 1588 00:55:47,300 --> 00:55:49,760 were moving according to the dynamics of some hidden Markov 1589 00:55:49,760 --> 00:55:50,450 models. 1590 00:55:50,450 --> 00:55:52,220 Now, we're going to assume that people 1591 00:55:52,220 --> 00:55:54,200 move according to some dynamics of what's 1592 00:55:54,200 --> 00:55:55,860 going on inside our heads. 1593 00:55:55,860 --> 00:55:59,180 So you can assume that I have a planner inside my head 1594 00:55:59,180 --> 00:56:01,130 that tells me what I want to do and what 1595 00:56:01,130 --> 00:56:03,710 I should do in the future to accomplish my goals. 1596 00:56:03,710 --> 00:56:07,280 And you can look at a sequence of my actions and try to infer. 1597 00:56:07,280 --> 00:56:09,800 If you believe this planner is running in my head, 1598 00:56:09,800 --> 00:56:11,820 what do you think I should do next? 1599 00:56:11,820 --> 00:56:14,079 Now, the nice part about many of these planners 1600 00:56:14,079 --> 00:56:16,370 is that they look a lot like this hidden Markov models. 1601 00:56:16,370 --> 00:56:19,310 And the inference algorithms look a lot like these models. 1602 00:56:19,310 --> 00:56:21,560 So basically, you can do the same kind of trick 1603 00:56:21,560 --> 00:56:23,930 by assuming that HMM-like things are 1604 00:56:23,930 --> 00:56:25,620 going on inside people's heads. 1605 00:56:25,620 --> 00:56:27,370 So you can do things like predict actions, 1606 00:56:27,370 --> 00:56:29,720 figure out what people want to do in the future, what 1607 00:56:29,720 --> 00:56:30,594 they did in the past. 1608 00:56:33,370 --> 00:56:35,024 That's what the project is. 1609 00:56:35,024 --> 00:56:37,440 I want to show you another example of vision and language, 1610 00:56:37,440 --> 00:56:39,900 but in a totally different domain that I won't talk about, 1611 00:56:39,900 --> 00:56:41,670 which is in the case of robots. 1612 00:56:41,670 --> 00:56:45,710 This is something that we built several years ago. 1613 00:56:45,710 --> 00:56:47,650 This is a robot that looks at a 3D structure. 1614 00:56:47,650 --> 00:56:48,860 It's built out of Lincoln Logs. 1615 00:56:48,860 --> 00:56:49,360 They're big. 1616 00:56:49,360 --> 00:56:51,610 They're easier for the robot to manipulate than LEGOs. 1617 00:56:51,610 --> 00:56:53,026 The downside is they're all brown, 1618 00:56:53,026 --> 00:56:54,880 so it's very difficult to do vision on this. 1619 00:56:54,880 --> 00:56:56,560 But it actually will, in a moment, 1620 00:56:56,560 --> 00:56:58,840 reconstruct the 3D structure of what it sees. 1621 00:56:58,840 --> 00:57:01,270 And we annotated and read what errors it made. 1622 00:57:01,270 --> 00:57:03,190 We didn't tell it this. 1623 00:57:03,190 --> 00:57:05,650 What it does is it measures its own confidence 1624 00:57:05,650 --> 00:57:08,200 and it figures out what parts are occluded. 1625 00:57:08,200 --> 00:57:10,210 So it has too little information. 1626 00:57:10,210 --> 00:57:11,970 And it plans another view. 1627 00:57:11,970 --> 00:57:16,060 It goes, it acquires it by measuring its own confidence. 1628 00:57:16,060 --> 00:57:18,370 This view is actually worse than the previous view, 1629 00:57:18,370 --> 00:57:19,610 but it's complementary. 1630 00:57:19,610 --> 00:57:22,300 So it will actually gain the information that it's missing. 1631 00:57:22,300 --> 00:57:24,970 And all of this comes from the same kind of generative model 1632 00:57:24,970 --> 00:57:26,920 trick that I showed you a moment ago. 1633 00:57:26,920 --> 00:57:29,710 A similar model, it just makes different assumptions 1634 00:57:29,710 --> 00:57:31,671 about what's built into it. 1635 00:57:31,671 --> 00:57:33,670 So now, because we have a nice generative model, 1636 00:57:33,670 --> 00:57:35,800 we can integrate the two views together. 1637 00:57:35,800 --> 00:57:38,529 You're going to see in a moment. 1638 00:57:38,529 --> 00:57:39,820 It'll still make some mistakes. 1639 00:57:39,820 --> 00:57:42,444 It won't be completely confident because there are some regions 1640 00:57:42,444 --> 00:57:44,830 that it can't see, even from both views. 1641 00:57:44,830 --> 00:57:46,690 And then what we told it is, OK, fine. 1642 00:57:46,690 --> 00:57:49,510 For now, ignore the second view, take just the first few. 1643 00:57:49,510 --> 00:57:50,570 Here's a sentence. 1644 00:57:50,570 --> 00:57:52,450 Or in this case, a sentence fragment. 1645 00:57:52,450 --> 00:57:54,100 The fragment is something like, there's 1646 00:57:54,100 --> 00:57:56,620 a window to the left and perpendicular to this door. 1647 00:57:56,620 --> 00:57:58,180 It'll just appear a moment. 1648 00:57:58,180 --> 00:58:00,280 And integrating this one view that it 1649 00:58:00,280 --> 00:58:04,630 saw that it was uncertain about with that one sentence, that's 1650 00:58:04,630 --> 00:58:07,420 also very generic and applied to many structures, 1651 00:58:07,420 --> 00:58:09,070 determine that these two completely 1652 00:58:09,070 --> 00:58:10,780 disambiguate the structure. 1653 00:58:10,780 --> 00:58:13,061 And now, it's perfectly confident in what's going on. 1654 00:58:13,061 --> 00:58:14,560 And it can go and it can disassemble 1655 00:58:14,560 --> 00:58:15,670 the structure for you. 1656 00:58:15,670 --> 00:58:17,590 And we can play this game in many directions. 1657 00:58:17,590 --> 00:58:20,170 We can have the robot describe structures to us. 1658 00:58:20,170 --> 00:58:22,900 We can give it a description and it can build the structure. 1659 00:58:22,900 --> 00:58:25,090 One robot can describe the structure in English 1660 00:58:25,090 --> 00:58:27,780 to another robot who can build it for it. 1661 00:58:27,780 --> 00:58:30,640 And it's exactly the same kind of idea. 1662 00:58:30,640 --> 00:58:33,580 You connect your vision and your language model 1663 00:58:33,580 --> 00:58:35,620 to something in the real world, and then you 1664 00:58:35,620 --> 00:58:37,120 can play many, many different tricks 1665 00:58:37,120 --> 00:58:41,020 with one internal representation without modifying it at all. 1666 00:58:41,020 --> 00:58:44,080 But I realized yesterday that I was the last speaker 1667 00:58:44,080 --> 00:58:45,580 before the weekend, so I want to end 1668 00:58:45,580 --> 00:58:48,100 by leaving you as depressed as I possibly can, 1669 00:58:48,100 --> 00:58:51,040 and tell you all the wonderful things that don't work. 1670 00:58:51,040 --> 00:58:53,990 And how far away we are from understanding anything. 1671 00:58:53,990 --> 00:58:55,840 So first of all, we can't generate 1672 00:58:55,840 --> 00:58:58,570 the kind of coherent stories that Patrick looks at. 1673 00:58:58,570 --> 00:59:00,880 Really, if you look at a long video, what we can do 1674 00:59:00,880 --> 00:59:03,014 is we can search or we can describe small events. 1675 00:59:03,014 --> 00:59:04,180 A person picks something up. 1676 00:59:04,180 --> 00:59:05,020 They put it down. 1677 00:59:05,020 --> 00:59:08,530 What we can say is the thief entered the room 1678 00:59:08,530 --> 00:59:10,665 and rummaged around and ran away with the gold. 1679 00:59:10,665 --> 00:59:12,790 That's the kind of thing that you want to generate. 1680 00:59:12,790 --> 00:59:14,540 It's the kind of thing that kids generate, 1681 00:59:14,540 --> 00:59:15,700 but we're not there yet. 1682 00:59:15,700 --> 00:59:17,180 Not even close. 1683 00:59:17,180 --> 00:59:18,640 We also only reason in 2D. 1684 00:59:18,640 --> 00:59:20,380 There's no 3D reasoning here. 1685 00:59:20,380 --> 00:59:21,880 And that significantly hurts us. 1686 00:59:21,880 --> 00:59:24,610 Although, we have some ideas for how we might do 3D. 1687 00:59:24,610 --> 00:59:27,640 Another important aspect is we don't know forces and contact 1688 00:59:27,640 --> 00:59:28,450 relationships. 1689 00:59:28,450 --> 00:59:29,920 Now, that's fine as long as pickup 1690 00:59:29,920 --> 00:59:32,069 means this kind of action where you 1691 00:59:32,069 --> 00:59:34,360 see me standing next to an object and the object moving 1692 00:59:34,360 --> 00:59:34,860 up. 1693 00:59:34,860 --> 00:59:37,257 But sometimes, pickup means something totally different. 1694 00:59:37,257 --> 00:59:39,340 So you're going to see this cat is going to pickup 1695 00:59:39,340 --> 00:59:42,277 that kitten in just a moment. 1696 00:59:42,277 --> 00:59:44,110 And you're going to see if you pay attention 1697 00:59:44,110 --> 00:59:46,600 to the motion of the cat, that it doesn't look 1698 00:59:46,600 --> 00:59:48,901 like it's picking something up. 1699 00:59:48,901 --> 00:59:51,150 It's not very good at picking up the kitten, mind you. 1700 00:59:51,150 --> 00:59:52,630 I think this may be its first try. 1701 00:59:55,619 --> 00:59:56,910 I think it's having a good day. 1702 00:59:56,910 --> 00:59:58,330 It's OK. 1703 00:59:58,330 --> 00:59:59,825 Struggling a little bit. 1704 00:59:59,825 --> 01:00:01,480 But see? 1705 01:00:01,480 --> 01:00:03,040 So definitely, picked it up. 1706 01:00:03,040 --> 01:00:05,380 But it didn't look anything like any of the other pickup 1707 01:00:05,380 --> 01:00:06,700 examples I showed you. 1708 01:00:06,700 --> 01:00:09,040 But conceptually, you should totally 1709 01:00:09,040 --> 01:00:11,260 recognize this if you've seen those other examples. 1710 01:00:11,260 --> 01:00:13,120 And kids can do this. 1711 01:00:13,120 --> 01:00:14,950 So the important part is you have 1712 01:00:14,950 --> 01:00:16,630 to change how you reason you can't just 1713 01:00:16,630 --> 01:00:18,924 reason about the relative motions of the objects. 1714 01:00:18,924 --> 01:00:21,340 You have to assume that there are some hidden forces going 1715 01:00:21,340 --> 01:00:21,790 on. 1716 01:00:21,790 --> 01:00:23,110 And you have to reason about the contact 1717 01:00:23,110 --> 01:00:25,151 relationships and the forces that the objects are 1718 01:00:25,151 --> 01:00:26,956 undergoing. 1719 01:00:26,956 --> 01:00:29,469 What happens if you try to recognize a helicopter picking 1720 01:00:29,469 --> 01:00:30,010 something up. 1721 01:00:30,010 --> 01:00:32,051 It looks totally different from a human doing it, 1722 01:00:32,051 --> 01:00:35,380 but no one has any problems recognizing this. 1723 01:00:35,380 --> 01:00:36,880 Segmentation is also a huge problem. 1724 01:00:36,880 --> 01:00:38,379 For many of these problems, you have 1725 01:00:38,379 --> 01:00:40,740 to pay attention to the fine boundaries of the objects 1726 01:00:40,740 --> 01:00:43,115 in order to understand that that kitten was being rotated 1727 01:00:43,115 --> 01:00:45,442 and then slightly lifted. 1728 01:00:45,442 --> 01:00:47,150 There's also a more philosophical problem 1729 01:00:47,150 --> 01:00:48,733 about what is a part and what it means 1730 01:00:48,733 --> 01:00:50,470 for something to be an object. 1731 01:00:50,470 --> 01:00:52,750 We arbitrarily say that the cat is an object, 1732 01:00:52,750 --> 01:00:54,370 but I could refer to its paws. 1733 01:00:54,370 --> 01:00:55,510 I could refer to its ears. 1734 01:00:55,510 --> 01:00:57,580 I could refer to one small patch on its back. 1735 01:00:57,580 --> 01:00:59,579 As long as we all know what we're talking about, 1736 01:00:59,579 --> 01:01:00,790 that can be our object. 1737 01:01:00,790 --> 01:01:03,130 And that's a problem throughout computer vision. 1738 01:01:03,130 --> 01:01:05,500 It also occurs in a totally different problem. 1739 01:01:05,500 --> 01:01:07,240 So if you've ever seen Bongard problems, 1740 01:01:07,240 --> 01:01:09,550 there are these problems where you have these weird patches, 1741 01:01:09,550 --> 01:01:11,730 and you have to figure out what's in common between them. 1742 01:01:11,730 --> 01:01:12,610 And that's the case where you have 1743 01:01:12,610 --> 01:01:14,290 to dig deep into your visual system 1744 01:01:14,290 --> 01:01:17,060 to extract a completely different kind of information. 1745 01:01:17,060 --> 01:01:18,855 And this is an example that I prefer. 1746 01:01:18,855 --> 01:01:22,780 So in this task, you can try to find the real dog. 1747 01:01:22,780 --> 01:01:25,600 And we can all spot it after you look for a little while. 1748 01:01:25,600 --> 01:01:26,300 Right? 1749 01:01:26,300 --> 01:01:28,350 Does everyone see it? 1750 01:01:28,350 --> 01:01:28,940 OK. 1751 01:01:28,940 --> 01:01:30,489 So you can all see it 1752 01:01:30,489 --> 01:01:31,530 The interesting part is-- 1753 01:01:31,530 --> 01:01:34,030 I mean, I doubt you have ever had training 1754 01:01:34,030 --> 01:01:38,550 detecting real dogs amongst masses of fake dogs. 1755 01:01:38,550 --> 01:01:41,410 But somehow, you were able to adapt and extract 1756 01:01:41,410 --> 01:01:43,260 a completely different kind of information 1757 01:01:43,260 --> 01:01:44,310 from your visual system. 1758 01:01:44,310 --> 01:01:46,560 Information that isn't captured by our feature vector, 1759 01:01:46,560 --> 01:01:50,580 as I talk about the color, location, velocity, et cetera. 1760 01:01:50,580 --> 01:01:52,710 So you have this ability to extract out 1761 01:01:52,710 --> 01:01:54,870 task-specific information. 1762 01:01:54,870 --> 01:01:56,580 You can do things like theory of mind, 1763 01:01:56,580 --> 01:01:58,050 but you can do far more than assume 1764 01:01:58,050 --> 01:01:59,340 people are writing a planner. 1765 01:01:59,340 --> 01:02:00,510 You can detect if I'm sad. 1766 01:02:00,510 --> 01:02:02,490 If I'm happy. 1767 01:02:02,490 --> 01:02:04,590 You can reason about whether two people 1768 01:02:04,590 --> 01:02:06,600 are having a particular kind of interaction. 1769 01:02:06,600 --> 01:02:08,894 Who's more powerful than the other person. 1770 01:02:08,894 --> 01:02:11,310 You also have a very strong physics model inside your head 1771 01:02:11,310 --> 01:02:13,110 that underlies much of this. 1772 01:02:13,110 --> 01:02:15,930 And even more than that, there's the concept of modification. 1773 01:02:15,930 --> 01:02:19,500 So walking quickly looks very different from running quickly. 1774 01:02:19,500 --> 01:02:21,720 And the way you model these is quite complicated. 1775 01:02:21,720 --> 01:02:24,490 And the system that I presented doesn't do a good job of it. 1776 01:02:24,490 --> 01:02:27,900 But one of my favorite examples from my childhood 1777 01:02:27,900 --> 01:02:31,350 long ago is this one, which is a kind of modification. 1778 01:02:31,350 --> 01:02:34,687 So coyote is going to draw this. 1779 01:02:34,687 --> 01:02:36,270 You're going to see the Roadrunner try 1780 01:02:36,270 --> 01:02:37,750 to run through it. 1781 01:02:37,750 --> 01:02:39,010 And he makes it. 1782 01:02:39,010 --> 01:02:42,145 And you can imagine what's about to happen next. 1783 01:02:42,145 --> 01:02:43,770 Coyote is not going to have a good day. 1784 01:02:43,770 --> 01:02:45,880 So this looks silly, right? 1785 01:02:45,880 --> 01:02:48,380 And you would think to yourself, how could we possibly apply 1786 01:02:48,380 --> 01:02:49,350 this to the real world? 1787 01:02:49,350 --> 01:02:51,766 But actually, this happens in the real world all the time. 1788 01:02:51,766 --> 01:02:54,602 A cage can be open for a mouse, but closed for an elephant. 1789 01:02:54,602 --> 01:02:56,560 So if you're going to represent something like, 1790 01:02:56,560 --> 01:02:58,530 is something closed or not, you have 1791 01:02:58,530 --> 01:03:00,281 to be able to handle situations like this. 1792 01:03:00,281 --> 01:03:01,696 And that's why kids can understand 1793 01:03:01,696 --> 01:03:03,960 really weird scenarios like this because they're not 1794 01:03:03,960 --> 01:03:05,896 so outlandish. 1795 01:03:05,896 --> 01:03:07,770 There's also the problem of the vast majority 1796 01:03:07,770 --> 01:03:08,850 of English verbs-- 1797 01:03:08,850 --> 01:03:12,430 things like absolve, admire, anger, approve, bark, et 1798 01:03:12,430 --> 01:03:12,930 cetera. 1799 01:03:12,930 --> 01:03:14,820 All of them require far more knowledge. 1800 01:03:14,820 --> 01:03:17,340 They require many of the things I've talked about before. 1801 01:03:17,340 --> 01:03:19,680 And actually, far more than them. 1802 01:03:19,680 --> 01:03:21,630 And what's even worse is we also use language 1803 01:03:21,630 --> 01:03:23,250 in pretty bizarre ways. 1804 01:03:23,250 --> 01:03:25,920 So there are some kinds of idioms in English, 1805 01:03:25,920 --> 01:03:28,230 like the market [AUDIO OUT] bullish, 1806 01:03:28,230 --> 01:03:30,660 that you have to have seen before to understand, right? 1807 01:03:30,660 --> 01:03:32,267 There's no reason to assume that bears 1808 01:03:32,267 --> 01:03:34,350 are better or worse than bulls when you apply them 1809 01:03:34,350 --> 01:03:35,739 to the stock market. 1810 01:03:35,739 --> 01:03:37,530 On the other hand, there are certain things 1811 01:03:37,530 --> 01:03:38,571 that are very systematic. 1812 01:03:38,571 --> 01:03:40,467 I can have an up day or a down day, 1813 01:03:40,467 --> 01:03:43,050 because we've both kind of as a culture agreed that up is good 1814 01:03:43,050 --> 01:03:43,767 and down is bad. 1815 01:03:43,767 --> 01:03:45,600 Some cultures have made the opposite choice. 1816 01:03:45,600 --> 01:03:47,820 But usually, it's up is good, down is bad. 1817 01:03:47,820 --> 01:03:51,210 So an idea can be grand or it can be small. 1818 01:03:51,210 --> 01:03:54,870 Because we've decided big things are better than small things. 1819 01:03:54,870 --> 01:03:57,210 Someone's mood can be dark or light. 1820 01:03:57,210 --> 01:03:59,070 And these are very systematic variations 1821 01:03:59,070 --> 01:04:00,390 that underlie all of language. 1822 01:04:00,390 --> 01:04:02,490 And we constantly use metaphoric extension 1823 01:04:02,490 --> 01:04:04,800 in order to describe what's going on around us 1824 01:04:04,800 --> 01:04:06,390 and to talk about abstract things. 1825 01:04:06,390 --> 01:04:08,306 It really seems as if this is kind of built-in 1826 01:04:08,306 --> 01:04:10,110 to our model of the world. 1827 01:04:10,110 --> 01:04:13,140 And modeling this is kind of over the horizon. 1828 01:04:13,140 --> 01:04:15,030 And there are many, many, many other things 1829 01:04:15,030 --> 01:04:16,489 that we're missing here. 1830 01:04:16,489 --> 01:04:18,780 So I just want to thank all my wonderful collaborators, 1831 01:04:18,780 --> 01:04:22,650 like Boris, Max, Candace, and people at MIT, and people 1832 01:04:22,650 --> 01:04:23,670 elsewhere. 1833 01:04:23,670 --> 01:04:25,835 But to recap, what we saw is that we 1834 01:04:25,835 --> 01:04:27,960 can get a little bit of traction on these problems. 1835 01:04:27,960 --> 01:04:32,460 We can build one system that does one simple problem just 1836 01:04:32,460 --> 01:04:34,920 connects our perception with our high-level knowledge, 1837 01:04:34,920 --> 01:04:37,620 takes a video and a sentence and gives us a score. 1838 01:04:37,620 --> 01:04:39,510 And once we have this interesting connection, 1839 01:04:39,510 --> 01:04:41,700 this interesting feedback between these two 1840 01:04:41,700 --> 01:04:43,440 very different-looking systems, it 1841 01:04:43,440 --> 01:04:46,050 turns out that we can do many different and sometimes 1842 01:04:46,050 --> 01:04:48,500 surprising things.