1 00:00:01,640 --> 00:00:04,040 The following content is provided under a Creative 2 00:00:04,040 --> 00:00:05,580 Commons license. 3 00:00:05,580 --> 00:00:07,880 Your support will help MIT OpenCourseWare 4 00:00:07,880 --> 00:00:12,270 continue to offer high quality educational resources for free. 5 00:00:12,270 --> 00:00:14,870 To make a donation or view additional materials 6 00:00:14,870 --> 00:00:18,830 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,830 --> 00:00:20,000 at ocw.mit.edu. 8 00:00:22,612 --> 00:00:24,820 AUDE OLIVA: Thank you very much for the introduction. 9 00:00:24,820 --> 00:00:26,410 Good morning, everyone. 10 00:00:26,410 --> 00:00:28,230 So I'm very pleased to be here. 11 00:00:28,230 --> 00:00:30,420 It's the first time I visit. 12 00:00:30,420 --> 00:00:34,410 So I would like to give you a tour during this lecture 13 00:00:34,410 --> 00:00:38,550 about how you can predict human visual memory, 14 00:00:38,550 --> 00:00:41,130 and actually an interdisciplinary account 15 00:00:41,130 --> 00:00:44,040 of the methods you can use together in order 16 00:00:44,040 --> 00:00:49,030 to have basically a view of what people can remember or forget. 17 00:00:49,030 --> 00:00:52,620 So the specific question we are going to ask today 18 00:00:52,620 --> 00:00:57,420 is, well, we are experiencing and seeing all the time 19 00:00:57,420 --> 00:00:58,950 a lot of digital information. 20 00:00:58,950 --> 00:01:02,220 First you see in the real world, but also you 21 00:01:02,220 --> 00:01:05,640 are exposed to many, many videos and images. 22 00:01:05,640 --> 00:01:10,170 So vision and visual memory is one of the core concept 23 00:01:10,170 --> 00:01:11,640 of cognition. 24 00:01:11,640 --> 00:01:14,220 And the question we ask is, can we 25 00:01:14,220 --> 00:01:18,900 predict which image or graph or face or words or videos 26 00:01:18,900 --> 00:01:21,450 or piece of information or event is 27 00:01:21,450 --> 00:01:23,790 going to be memorable or forgettable 28 00:01:23,790 --> 00:01:26,700 for a group of people, and eventually for a given 29 00:01:26,700 --> 00:01:28,480 individual? 30 00:01:28,480 --> 00:01:31,710 So let's take a moment to imagine 31 00:01:31,710 --> 00:01:37,170 that we will be able to predict accurately 32 00:01:37,170 --> 00:01:39,840 the memory of people. 33 00:01:39,840 --> 00:01:42,750 Well, this will be very useful to understand 34 00:01:42,750 --> 00:01:46,110 the mechanism of human memory, both at 35 00:01:46,110 --> 00:01:50,640 the cognitive and neuroscience level, system level, 36 00:01:50,640 --> 00:01:54,000 as well as possibly diagnose memory problem, 37 00:01:54,000 --> 00:01:57,120 short-term visual memory, long-term visual memory, that 38 00:01:57,120 --> 00:02:03,210 may arise possibly in an acute way or developing over time, 39 00:02:03,210 --> 00:02:06,000 as well as design mnemonic aids in order 40 00:02:06,000 --> 00:02:08,699 to recall better informations. 41 00:02:08,699 --> 00:02:13,140 But beside basic science, if you could predict information 42 00:02:13,140 --> 00:02:15,810 that are memorable or forgettable, 43 00:02:15,810 --> 00:02:19,020 there is a realm of application that you 44 00:02:19,020 --> 00:02:24,780 could work on or basically propose that lie really 45 00:02:24,780 --> 00:02:28,890 everywhere between the data visualization or the slogan, 46 00:02:28,890 --> 00:02:31,890 make basically slogan better, as well 47 00:02:31,890 --> 00:02:35,160 as all the realm of education, the individual differences that 48 00:02:35,160 --> 00:02:36,720 may arise between people. 49 00:02:36,720 --> 00:02:39,330 Someone will not learn very well. 50 00:02:39,330 --> 00:02:41,940 Well, what can we do in order to increase 51 00:02:41,940 --> 00:02:46,470 the memory of visual information that this person can grasp? 52 00:02:46,470 --> 00:02:48,210 As well as various applications-- 53 00:02:48,210 --> 00:02:51,030 social networking, faces, retrieve better images, 54 00:02:51,030 --> 00:02:52,300 and so on. 55 00:02:52,300 --> 00:02:54,240 So understanding what make an information 56 00:02:54,240 --> 00:02:56,370 memorable or forgettable basically 57 00:02:56,370 --> 00:02:59,670 is really a very inter-disciplinary question, 58 00:02:59,670 --> 00:03:02,040 something very exciting for us to work on, 59 00:03:02,040 --> 00:03:04,770 because there's a lot, a lot of future 60 00:03:04,770 --> 00:03:08,340 in working in that topic. 61 00:03:08,340 --> 00:03:11,520 So this is a topic we started in my lab a few years ago. 62 00:03:11,520 --> 00:03:14,610 And the best for you to get a sense of how 63 00:03:14,610 --> 00:03:17,150 you start working, for instance, on memory 64 00:03:17,150 --> 00:03:20,220 is to do the kind of game and the experiment 65 00:03:20,220 --> 00:03:21,750 we had people doing. 66 00:03:21,750 --> 00:03:24,240 So welcome to the Visual Memory Game. 67 00:03:24,240 --> 00:03:26,670 A stream of images will be presented on the screen 68 00:03:26,670 --> 00:03:28,410 for one second each. 69 00:03:28,410 --> 00:03:31,020 And your task is to-- 70 00:03:31,020 --> 00:03:33,630 If you are in front of your computer you will press a key. 71 00:03:33,630 --> 00:03:36,120 But what you're going to do is play the game with me 72 00:03:36,120 --> 00:03:38,310 and clap your hand anytime you see 73 00:03:38,310 --> 00:03:40,330 any image that you saw before. 74 00:03:40,330 --> 00:03:44,040 So you're going to have to be attentive, because images, 75 00:03:44,040 --> 00:03:47,960 they go by fast in this rapid stream of images. 76 00:03:47,960 --> 00:03:50,740 And so you will be getting feedback. 77 00:03:50,740 --> 00:03:54,000 So it's a very straightforward memory experiment. 78 00:03:54,000 --> 00:03:55,770 And this is the first step in order 79 00:03:55,770 --> 00:04:01,770 to get some score on a lot of images regarding 80 00:04:01,770 --> 00:04:04,650 the type of information that people will naturally 81 00:04:04,650 --> 00:04:06,720 forget or naturally remember. 82 00:04:06,720 --> 00:04:08,580 So let's do the game. 83 00:04:08,580 --> 00:04:09,881 So are you ready? 84 00:04:09,881 --> 00:04:10,380 All right. 85 00:04:10,380 --> 00:04:13,260 So clap your hand whenever you see a repeat. 86 00:04:16,370 --> 00:04:19,070 So this is what will rerun. 87 00:04:19,070 --> 00:04:20,329 Very simple images. 88 00:04:20,329 --> 00:04:20,930 [CLAPPING] 89 00:04:20,930 --> 00:04:21,769 Fine. 90 00:04:21,769 --> 00:04:23,420 Excellent. 91 00:04:23,420 --> 00:04:25,695 So images are shown one second. 92 00:04:25,695 --> 00:04:26,960 There's a one-second interval. 93 00:04:30,120 --> 00:04:32,060 [CLAPPING] 94 00:04:37,890 --> 00:04:39,302 You're good. 95 00:04:39,302 --> 00:04:42,600 No false alarm. 96 00:04:42,600 --> 00:04:44,901 Excellent. 97 00:04:44,901 --> 00:04:45,400 All right. 98 00:04:45,400 --> 00:04:50,220 So that was one of the level, level 9, out of 30 complete. 99 00:04:50,220 --> 00:04:52,680 And so here's the game that people had. 100 00:04:52,680 --> 00:04:55,170 They could play this game for five minutes, had a break, 101 00:04:55,170 --> 00:04:56,280 and come back. 102 00:04:56,280 --> 00:04:58,630 And this game had a lot of success, and we run it. 103 00:04:58,630 --> 00:05:01,350 I'm going to show many results about this. 104 00:05:01,350 --> 00:05:04,380 So you could see your score, the amount of money you have done. 105 00:05:04,380 --> 00:05:06,590 This was run on Amazon Mechanical Turk. 106 00:05:06,590 --> 00:05:10,110 And this allow us to collect all around the world a lot 107 00:05:10,110 --> 00:05:14,010 of data regarding many, many images. 108 00:05:14,010 --> 00:05:15,690 So those visual memory experiment 109 00:05:15,690 --> 00:05:20,550 were set up by Phillip Isola from MIT. 110 00:05:20,550 --> 00:05:24,300 And you can basically play that game 111 00:05:24,300 --> 00:05:26,490 for any kind of information displayed visually. 112 00:05:26,490 --> 00:05:29,940 So we did it for pictures, faces, and words. 113 00:05:29,940 --> 00:05:32,100 And you're going to see the result. 114 00:05:32,100 --> 00:05:33,870 So let's start with the pictures. 115 00:05:33,870 --> 00:05:36,780 In the first experiment, we presented 10,000 images, 116 00:05:36,780 --> 00:05:41,580 and about 2,200 we repeated many, many time. 117 00:05:41,580 --> 00:05:45,870 So those were the one where we collected the score. 118 00:05:45,870 --> 00:05:48,900 So for a given subject, those images 119 00:05:48,900 --> 00:05:50,860 were actually seen only once. 120 00:05:50,860 --> 00:05:52,020 Twice, sorry. 121 00:05:52,020 --> 00:05:55,780 So you have the stream of images at the top. 122 00:05:55,780 --> 00:06:00,990 So, for instance, an image will be shown after 90, 100, 123 00:06:00,990 --> 00:06:03,790 110 images or so. 124 00:06:03,790 --> 00:06:08,190 And if this image was one that the subject recognized, 125 00:06:08,190 --> 00:06:09,990 then he will press a key. 126 00:06:09,990 --> 00:06:12,600 So it's exactly the design you just did. 127 00:06:12,600 --> 00:06:14,910 And when first you look at the type of images 128 00:06:14,910 --> 00:06:18,160 that are highly memorable or forgettable, well, 129 00:06:18,160 --> 00:06:20,450 there's a trend that we all expect, 130 00:06:20,450 --> 00:06:24,840 is images who are kind of either funny or have something 131 00:06:24,840 --> 00:06:26,520 distinctive or something different, 132 00:06:26,520 --> 00:06:29,310 or if people are doing various action, 133 00:06:29,310 --> 00:06:35,250 or of some object that are kind of out of context, 134 00:06:35,250 --> 00:06:37,580 those tend to be memorable. 135 00:06:37,580 --> 00:06:40,710 And, in general, landscape or images 136 00:06:40,710 --> 00:06:44,070 that don't have any activity tend to be forgettable. 137 00:06:44,070 --> 00:06:47,010 So we have those that score for more than 2,000 images 138 00:06:47,010 --> 00:06:48,610 from that experiment. 139 00:06:48,610 --> 00:06:50,400 So one of the first thing we need to know 140 00:06:50,400 --> 00:06:53,560 is, well, everyone is playing that game, 141 00:06:53,560 --> 00:06:55,810 and we can have their own memory score. 142 00:06:55,810 --> 00:06:58,920 But in order to know if some images are indeed 143 00:06:58,920 --> 00:07:02,970 memorable for all of us or a good amount of the population, 144 00:07:02,970 --> 00:07:06,670 we need to see if there is consistency between people. 145 00:07:06,670 --> 00:07:08,280 So here is a simple measure we do. 146 00:07:08,280 --> 00:07:10,650 You have a group of people looking at those images, 147 00:07:10,650 --> 00:07:12,450 and we have the memory score. 148 00:07:12,450 --> 00:07:15,060 And you can split the group into two, 149 00:07:15,060 --> 00:07:19,920 and rank according to the average of the first group, 150 00:07:19,920 --> 00:07:22,020 the images, from the most memorable 151 00:07:22,020 --> 00:07:24,090 to the less memorable. 152 00:07:24,090 --> 00:07:26,500 And then you can also rank the same images 153 00:07:26,500 --> 00:07:28,050 in the second group. 154 00:07:28,050 --> 00:07:31,620 And if the two group were identical to each other, 155 00:07:31,620 --> 00:07:36,150 you will get a correlation between those two ranking of 1. 156 00:07:36,150 --> 00:07:39,780 And what we observe when we split repetitively the group 157 00:07:39,780 --> 00:07:42,750 into two like this is we observe a correlation 158 00:07:42,750 --> 00:07:46,090 of 0.75, which is pretty high. 159 00:07:46,090 --> 00:07:49,320 And this give us basically the maximum performances 160 00:07:49,320 --> 00:07:52,140 that we can expect when we have a group of people 161 00:07:52,140 --> 00:07:54,570 that can predict the rank of images 162 00:07:54,570 --> 00:07:56,460 of another group of people. 163 00:07:56,460 --> 00:07:58,630 And actually, here is the curve that 164 00:07:58,630 --> 00:08:03,965 shows what the 0.75 consistency look like. 165 00:08:03,965 --> 00:08:06,930 So on the x-axis, you have the image rank 166 00:08:06,930 --> 00:08:09,090 according to the group number one, 167 00:08:09,090 --> 00:08:14,505 so the images with the highest memory score. 168 00:08:14,505 --> 00:08:17,640 So I have a group of them that are above 90. 169 00:08:17,640 --> 00:08:18,800 And it's normal. 170 00:08:18,800 --> 00:08:23,220 The group number one in blue decrease as images 171 00:08:23,220 --> 00:08:26,070 are less and less memorable. 172 00:08:26,070 --> 00:08:30,090 So that's like basically your ground truth curve. 173 00:08:30,090 --> 00:08:32,460 And the green show the group number 174 00:08:32,460 --> 00:08:36,840 two, totally independent people, and also the performances 175 00:08:36,840 --> 00:08:40,880 that they got for each bins along the image rank. 176 00:08:40,880 --> 00:08:44,190 And you can see the curve are pretty close by. 177 00:08:44,190 --> 00:08:48,640 So a correlation of 0.75 look like this. 178 00:08:48,640 --> 00:08:51,750 So it means that, with independent group of people, 179 00:08:51,750 --> 00:08:55,320 there are some images that are going to be systematically 180 00:08:55,320 --> 00:08:57,250 more memorable or forgettable. 181 00:08:57,250 --> 00:09:01,260 You can see the full range going below 40% for the images that 182 00:09:01,260 --> 00:09:04,860 were forgotten, up to 95 for the one that 183 00:09:04,860 --> 00:09:07,540 were systematically remembered. 184 00:09:07,540 --> 00:09:09,390 But, importantly, a group of person 185 00:09:09,390 --> 00:09:11,970 is going to predict another group of person. 186 00:09:11,970 --> 00:09:14,190 So there's several ways to test memory. 187 00:09:14,190 --> 00:09:16,800 The way we tested memory here to get ground truth 188 00:09:16,800 --> 00:09:18,870 was an objective measurement. 189 00:09:18,870 --> 00:09:20,970 You see an image again, and if you remember it, 190 00:09:20,970 --> 00:09:21,680 you press a key. 191 00:09:21,680 --> 00:09:23,580 So that's an objective measurement. 192 00:09:23,580 --> 00:09:26,010 But you can also ask people, do you 193 00:09:26,010 --> 00:09:27,960 think you will remember an image? 194 00:09:27,960 --> 00:09:30,450 Do you think someone else will remember an image? 195 00:09:30,450 --> 00:09:34,580 We also run those subjective memory score. 196 00:09:34,580 --> 00:09:37,630 And we observed this very interesting trend 197 00:09:37,630 --> 00:09:39,410 that the subjective judgment do not 198 00:09:39,410 --> 00:09:43,370 predict image memorability, which means that, 199 00:09:43,370 --> 00:09:46,970 if you ask yourself, am I going to remember this? 200 00:09:46,970 --> 00:09:49,430 Well, basically, maybe, maybe not. 201 00:09:49,430 --> 00:09:56,720 So subjective judgment of what you think your memory will be 202 00:09:56,720 --> 00:09:59,240 or what you think the memory of someone else will be 203 00:09:59,240 --> 00:10:02,480 is not correlated with the true memory, with whatever you're 204 00:10:02,480 --> 00:10:04,970 going to remember or not. 205 00:10:04,970 --> 00:10:08,210 So this was very interesting, because it 206 00:10:08,210 --> 00:10:12,050 shows that objective measurement should be needed here in order 207 00:10:12,050 --> 00:10:15,560 to really get a sense of what people will remember or forget. 208 00:10:15,560 --> 00:10:19,970 We basically have many papers on this topic since 2010. 209 00:10:19,970 --> 00:10:21,682 They are all on the web site. 210 00:10:21,682 --> 00:10:24,140 And then some of them will look at the correlation existing 211 00:10:24,140 --> 00:10:26,180 between memory, so the fact that you're 212 00:10:26,180 --> 00:10:28,850 going to remember certain kind of images, 213 00:10:28,850 --> 00:10:32,000 and other attributes, for instance, aesthetic. 214 00:10:32,000 --> 00:10:33,980 And, again, we found that memorability is 215 00:10:33,980 --> 00:10:35,990 distinct from image aesthetic. 216 00:10:35,990 --> 00:10:37,730 This means that basically you could 217 00:10:37,730 --> 00:10:40,055 have an image that is judged very beautiful, 218 00:10:40,055 --> 00:10:43,940 or, on the contrary, ugly or boring. 219 00:10:43,940 --> 00:10:45,830 And in those cases, you will still 220 00:10:45,830 --> 00:10:47,680 remember those two images. 221 00:10:47,680 --> 00:10:51,710 So we found this absence of a correlation between those two 222 00:10:51,710 --> 00:10:54,560 attribute in our values studies. 223 00:10:54,560 --> 00:10:58,110 We replicate this with other data set and faces as well. 224 00:10:58,110 --> 00:11:00,950 So it looks like what you will remember 225 00:11:00,950 --> 00:11:02,960 is this notion of distinctiveness, 226 00:11:02,960 --> 00:11:06,710 but it can be beautiful or ugly. 227 00:11:06,710 --> 00:11:07,590 It doesn't matter. 228 00:11:07,590 --> 00:11:10,520 You can still either remember it or forget it. 229 00:11:10,520 --> 00:11:14,390 So you had this question about the notion of the lag. 230 00:11:14,390 --> 00:11:17,750 So the lag is in that you could test visual memory 231 00:11:17,750 --> 00:11:20,880 after a few seconds or a few intervening images, 232 00:11:20,880 --> 00:11:23,240 or you could test it a few minutes or even 233 00:11:23,240 --> 00:11:25,580 one hour later, or even days later. 234 00:11:25,580 --> 00:11:27,620 So because we were running those experiments 235 00:11:27,620 --> 00:11:30,350 on Amazon Mechanical Turk, we did not do the dates. 236 00:11:30,350 --> 00:11:33,650 However, we did run some with a larger gap, 237 00:11:33,650 --> 00:11:37,340 up to 1,000 different images between the first 238 00:11:37,340 --> 00:11:39,810 and the second repeat. 239 00:11:39,810 --> 00:11:42,440 And so here is the design, the one 240 00:11:42,440 --> 00:11:46,460 that I show you for the about 100 images intervening 241 00:11:46,460 --> 00:11:48,420 between the first and the second repeat. 242 00:11:48,420 --> 00:11:52,310 But what about a shorter and a longer time scale? 243 00:11:52,310 --> 00:11:55,700 So all that work is also published. 244 00:11:55,700 --> 00:11:59,420 You can go and download the paper and see the details. 245 00:11:59,420 --> 00:12:04,830 But the basic idea is that the ranks were conserved. 246 00:12:04,830 --> 00:12:08,600 So if one image is very memorable, 247 00:12:08,600 --> 00:12:11,360 one of the top after, let's say, a few seconds, 248 00:12:11,360 --> 00:12:14,750 will still be memorable after hours. 249 00:12:14,750 --> 00:12:18,320 And if an image is forgettable or one of the most forgettable 250 00:12:18,320 --> 00:12:23,090 after a few seconds, will still be forgettable after hours. 251 00:12:23,090 --> 00:12:27,740 So the fact that the magnitude, the percentage of images 252 00:12:27,740 --> 00:12:30,140 remembered decrease is normal. 253 00:12:30,140 --> 00:12:33,690 That is known from memory research for decades. 254 00:12:33,690 --> 00:12:35,700 So it is expected. 255 00:12:35,700 --> 00:12:38,270 However, memorability here is basically 256 00:12:38,270 --> 00:12:42,680 the rank, which is an image, is for a population independently 257 00:12:42,680 --> 00:12:46,640 basically of the row, the magnitude. 258 00:12:46,640 --> 00:12:52,886 One of the most memorable given this condition is forgettable. 259 00:12:52,886 --> 00:12:56,670 And we did those experiment, both on the web 260 00:12:56,670 --> 00:12:58,970 as well as in the lab, because in the lab we 261 00:12:58,970 --> 00:13:01,100 could control for more factors. 262 00:13:01,100 --> 00:13:06,380 And it was very interesting to see in the lab experiment 263 00:13:06,380 --> 00:13:10,940 that only after 20 second there were images that were totally 264 00:13:10,940 --> 00:13:15,930 forgotten by a good group, a good amount of people. 265 00:13:15,930 --> 00:13:19,990 So some image seems to really-- 266 00:13:19,990 --> 00:13:25,536 basically do not stick and be gone in I 267 00:13:25,536 --> 00:13:26,660 don't know how many second. 268 00:13:26,660 --> 00:13:28,940 We did not go really to short-term memory. 269 00:13:28,940 --> 00:13:32,730 We work starting in long-term memory at 20 seconds and so on. 270 00:13:32,730 --> 00:13:35,250 But there's this phenomenon. 271 00:13:35,250 --> 00:13:37,580 So it suggests that there are some features 272 00:13:37,580 --> 00:13:41,000 into the visual information that are encoded 273 00:13:41,000 --> 00:13:44,492 in less details than others or with more details than others. 274 00:13:44,492 --> 00:13:45,950 And what's very interesting is then 275 00:13:45,950 --> 00:13:49,280 you can go to neuroscience and basically 276 00:13:49,280 --> 00:13:51,230 start studying the level of details 277 00:13:51,230 --> 00:13:54,270 or the quality of encoding of an image 278 00:13:54,270 --> 00:13:57,150 and see where in the visual pathway 279 00:13:57,150 --> 00:14:00,980 an image basically is gone after 20 seconds or something 280 00:14:00,980 --> 00:14:02,600 like this. 281 00:14:02,600 --> 00:14:07,880 So important point-- the rank is conserved. 282 00:14:07,880 --> 00:14:12,530 So we also look at those principle of memorability 283 00:14:12,530 --> 00:14:14,700 that we found for images in faces. 284 00:14:14,700 --> 00:14:21,710 So faces is a very interesting material 285 00:14:21,710 --> 00:14:24,320 to work with, because basically it's 286 00:14:24,320 --> 00:14:28,580 all images that look alike. 287 00:14:28,580 --> 00:14:30,620 So you have basically one object and look 288 00:14:30,620 --> 00:14:32,200 at many, many exemplars. 289 00:14:32,200 --> 00:14:35,660 And there's no reason to believe that this high consistency we 290 00:14:35,660 --> 00:14:38,280 will find for images as different 291 00:14:38,280 --> 00:14:43,710 as amphitheater and the parking and so on, and landscape, 292 00:14:43,710 --> 00:14:47,250 this high consistency will be found with faces. 293 00:14:47,250 --> 00:14:51,240 So we gather a data set of 10,000 faces. 294 00:14:51,240 --> 00:14:53,610 The paper is published, as well as the entire data 295 00:14:53,610 --> 00:14:55,380 set is available on the web. 296 00:14:55,380 --> 00:14:58,650 So you can go and download those 10,000 faces as well as 297 00:14:58,650 --> 00:15:01,380 all the attribute that we found where 298 00:15:01,380 --> 00:15:04,470 we study with the data set. 299 00:15:04,470 --> 00:15:07,080 And we also found the same phenomenon, 300 00:15:07,080 --> 00:15:11,850 very high consistency, both in the correct positive responses, 301 00:15:11,850 --> 00:15:14,400 when people remember seeing a face, 302 00:15:14,400 --> 00:15:18,990 as well as in the false alarm, when people did not see a face 303 00:15:18,990 --> 00:15:22,500 but falsely thought that they saw a face 304 00:15:22,500 --> 00:15:24,600 and basically pressed the key. 305 00:15:24,600 --> 00:15:28,230 So this very high consistency for both measurement 306 00:15:28,230 --> 00:15:33,065 suggests that, again, in the facial features of people, 307 00:15:33,065 --> 00:15:35,520 or at least the way a photo is taken, 308 00:15:35,520 --> 00:15:39,060 there is something at the level of the image that 309 00:15:39,060 --> 00:15:43,050 will make a face highly memorable or highly forgettable 310 00:15:43,050 --> 00:15:45,790 for most people. 311 00:15:45,790 --> 00:15:50,710 So all the details of this study are actually on the web. 312 00:15:50,710 --> 00:15:59,040 It was a pretty complex study to run, because while we have 313 00:15:59,040 --> 00:16:01,710 very different sensitivity to faces, 314 00:16:01,710 --> 00:16:06,900 so the race effect on basically where we grew up, 315 00:16:06,900 --> 00:16:10,210 so we know there's a lot of individual differences. 316 00:16:10,210 --> 00:16:12,660 So the way this study was run is the collection 317 00:16:12,660 --> 00:16:17,250 of faces we have followed the US census in terms 318 00:16:17,250 --> 00:16:19,860 of male, female, race, and age. 319 00:16:19,860 --> 00:16:22,990 We started at 18 years old and older. 320 00:16:22,990 --> 00:16:26,290 So did our population as well. 321 00:16:26,290 --> 00:16:28,230 So on the group, we show a collection 322 00:16:28,230 --> 00:16:32,430 of faces that did match the population, the people who 323 00:16:32,430 --> 00:16:35,450 were running the study. 324 00:16:35,450 --> 00:16:38,110 And, as I say, all the data are available on the web 325 00:16:38,110 --> 00:16:43,170 if some of you want to go back and do additional analysis. 326 00:16:43,170 --> 00:16:44,320 So we kept going with this. 327 00:16:44,320 --> 00:16:47,840 So we have the consistency in the visual material. 328 00:16:47,840 --> 00:16:50,070 So now what about words? 329 00:16:50,070 --> 00:16:51,930 So words is a very interesting case, 330 00:16:51,930 --> 00:16:54,090 because now you do know the words. 331 00:16:54,090 --> 00:16:56,960 But are there some kind of words that we can predict 332 00:16:56,960 --> 00:17:00,030 are more forgettable or memorable? 333 00:17:00,030 --> 00:17:03,600 Again, there's no reason to believe a high consistency will 334 00:17:03,600 --> 00:17:05,180 be found. 335 00:17:05,180 --> 00:17:07,710 But we've run the study twice with two different data 336 00:17:07,710 --> 00:17:15,119 set, again, ten thousand item, and two different set of words. 337 00:17:15,119 --> 00:17:18,900 And we found, again, very high consistency, which 338 00:17:18,900 --> 00:17:21,599 is that a collection of words were systematically 339 00:17:21,599 --> 00:17:25,440 remembered by people and others were forgettable. 340 00:17:25,440 --> 00:17:28,620 So this work that is done in collaboration 341 00:17:28,620 --> 00:17:32,890 with Mahowald, Isola, Gibson, and Fedorenko 342 00:17:32,890 --> 00:17:35,310 is under submission. 343 00:17:35,310 --> 00:17:38,280 But let me give you a taste of the words that we found. 344 00:17:38,280 --> 00:17:42,610 So what make words more memorable or forgettable? 345 00:17:42,610 --> 00:17:47,130 So here is a cartoon that give you the basic idea is, well, 346 00:17:47,130 --> 00:17:49,710 if there is one word for one meaning, 347 00:17:49,710 --> 00:17:52,280 so basically a word has a single meaning, 348 00:17:52,280 --> 00:17:53,940 it will have a tendency to be much more 349 00:17:53,940 --> 00:17:57,210 memorable than if a words has many meanings. 350 00:17:57,210 --> 00:18:00,240 So in the paper, we also look at the correlation 351 00:18:00,240 --> 00:18:06,280 between memorability and image ability or frequency and so on. 352 00:18:06,280 --> 00:18:09,330 And all this is describe, but really the main factor 353 00:18:09,330 --> 00:18:15,470 is this one-to-one referent between a word and its meaning 354 00:18:15,470 --> 00:18:16,950 or concept. 355 00:18:16,950 --> 00:18:20,940 So let's look at some of the example. 356 00:18:20,940 --> 00:18:23,810 The paper will come with the two data set and thousand 357 00:18:23,810 --> 00:18:28,080 of words that were found memorable and forgettable. 358 00:18:28,080 --> 00:18:30,780 So I think if you write a letter, 359 00:18:30,780 --> 00:18:34,000 you should not say that your student is excellent 360 00:18:34,000 --> 00:18:36,580 but that she is fabulous. 361 00:18:36,580 --> 00:18:39,450 And our research is not a blast. 362 00:18:39,450 --> 00:18:40,800 It's in vogue. 363 00:18:40,800 --> 00:18:44,880 And the idea of a team is not irrational, they are grotesque. 364 00:18:44,880 --> 00:18:48,420 So those are just a few example of the words 365 00:18:48,420 --> 00:18:50,760 that on average, of the three first, 366 00:18:50,760 --> 00:18:54,420 had more referent and the tendency 367 00:18:54,420 --> 00:18:55,950 to be more forgettable because they 368 00:18:55,950 --> 00:18:59,550 might be used for many more things than the one. 369 00:18:59,550 --> 00:19:03,180 I also notice a lot of the French words 370 00:19:03,180 --> 00:19:05,220 tend to be memorable. 371 00:19:05,220 --> 00:19:09,870 So we do find this stable principle 372 00:19:09,870 --> 00:19:13,890 that you can predict the content, 373 00:19:13,890 --> 00:19:16,800 can predict what type of images, faces, 374 00:19:16,800 --> 00:19:18,150 words that are memorable. 375 00:19:18,150 --> 00:19:19,830 And we also did it for visualization, 376 00:19:19,830 --> 00:19:22,560 starting working on the topic of education. 377 00:19:22,560 --> 00:19:24,930 So this is very useful, because at least 378 00:19:24,930 --> 00:19:28,470 at the level of a group, you can start making that prediction. 379 00:19:28,470 --> 00:19:30,780 Oh, I also have massive and avalanche. 380 00:19:30,780 --> 00:19:32,170 Forgot about it. 381 00:19:32,170 --> 00:19:38,530 So now that we were able to have all those data 382 00:19:38,530 --> 00:19:41,260 and see that there is this consistency, 383 00:19:41,260 --> 00:19:44,440 then one of the next question you can look at is, OK, well, 384 00:19:44,440 --> 00:19:47,080 if it seems that we all have a tendency 385 00:19:47,080 --> 00:19:53,150 to remember the same image, can we find a neural signature 386 00:19:53,150 --> 00:19:55,400 into the human brain? 387 00:19:55,400 --> 00:19:59,800 So the question of memorability, is it a perceptual or memory 388 00:19:59,800 --> 00:20:00,340 question? 389 00:20:00,340 --> 00:20:02,680 Because in all our experiment, the images 390 00:20:02,680 --> 00:20:04,906 are shown for a short time. 391 00:20:04,906 --> 00:20:06,280 And then, when they are repeated, 392 00:20:06,280 --> 00:20:08,020 you see them a second time. 393 00:20:08,020 --> 00:20:11,410 But basically, all the action is at the perception level. 394 00:20:11,410 --> 00:20:14,980 Whenever you perceive this image for half a second or one 395 00:20:14,980 --> 00:20:17,390 second, there is something going on here 396 00:20:17,390 --> 00:20:20,560 at the perception level that is going 397 00:20:20,560 --> 00:20:24,190 to bias if this image is going to basically go into memory 398 00:20:24,190 --> 00:20:25,360 or not. 399 00:20:25,360 --> 00:20:28,720 So, knowing this, if we want to look 400 00:20:28,720 --> 00:20:31,540 at the potential neural framework of memorability, 401 00:20:31,540 --> 00:20:33,910 we have to look at the entire brain. 402 00:20:33,910 --> 00:20:36,190 We have to look at all the region that 403 00:20:36,190 --> 00:20:41,710 have been found to be related to perception, 404 00:20:41,710 --> 00:20:46,000 faces perception, picture perception, object, space, 405 00:20:46,000 --> 00:20:50,260 and so on, as well as the medial temporal lobe region, 406 00:20:50,260 --> 00:20:52,750 more in the middle of the brain, that 407 00:20:52,750 --> 00:20:54,980 have been related to memory. 408 00:20:54,980 --> 00:20:57,790 So this is what we did with Wilma Bainbridge. 409 00:20:57,790 --> 00:21:01,330 This is her PhD, basically having 410 00:21:01,330 --> 00:21:03,370 a look at all those region. 411 00:21:03,370 --> 00:21:07,180 And here is the very simple experiment we run. 412 00:21:07,180 --> 00:21:11,080 So we took a collection of faces and scene 413 00:21:11,080 --> 00:21:15,250 from the thousands we have, and where every one, 414 00:21:15,250 --> 00:21:18,010 we'll basically separate those two between looking 415 00:21:18,010 --> 00:21:20,500 at the region that are more activated 416 00:21:20,500 --> 00:21:26,350 for seeing versus the region more activated for faces. 417 00:21:26,350 --> 00:21:29,400 We split it that way, between the memorable 418 00:21:29,400 --> 00:21:31,570 and the forgettable set. 419 00:21:31,570 --> 00:21:37,990 So in those set, every images is novel. 420 00:21:37,990 --> 00:21:40,030 So exactly like in the memory experiment, 421 00:21:40,030 --> 00:21:44,680 we could show them one time for half a second. 422 00:21:44,680 --> 00:21:46,100 So you had the perception level. 423 00:21:46,100 --> 00:21:47,890 You're in a perception experiment. 424 00:21:47,890 --> 00:21:52,030 You saw those image one after the other only one time. 425 00:21:52,030 --> 00:21:54,340 All those images are novel. 426 00:21:54,340 --> 00:21:56,020 So we're going to look at the contrast 427 00:21:56,020 --> 00:22:01,840 from novel image minus novel image, from scene minus scene, 428 00:22:01,840 --> 00:22:07,240 from faces minus faces, except that some images are 429 00:22:07,240 --> 00:22:12,430 highly memorable and some are highly forgettable. 430 00:22:12,430 --> 00:22:15,040 So other factor that you have to look at 431 00:22:15,040 --> 00:22:21,880 is, it's still possible that within those group of images 432 00:22:21,880 --> 00:22:26,510 and faces that are highly memorable or forgettable, 433 00:22:26,510 --> 00:22:28,120 that there's a lot of image features 434 00:22:28,120 --> 00:22:30,260 that basically correlate with those. 435 00:22:30,260 --> 00:22:33,400 So if you take a collection of images like I shown you before, 436 00:22:33,400 --> 00:22:37,150 and did the environment or the photo that have people 437 00:22:37,150 --> 00:22:40,460 or action tend to be memorable versus a landscape 438 00:22:40,460 --> 00:22:42,880 tend to be forgettable? 439 00:22:42,880 --> 00:22:46,270 So here you have a lot of visual features 440 00:22:46,270 --> 00:22:51,610 that will co-vary with the dimension of memorability. 441 00:22:51,610 --> 00:22:55,780 So in that study for the brain, we equalize for that, 442 00:22:55,780 --> 00:22:57,650 because we had enough images. 443 00:22:57,650 --> 00:23:00,810 So here you have a sample of the two groups 444 00:23:00,810 --> 00:23:04,400 and sample of images and the type of statistic we look at. 445 00:23:04,400 --> 00:23:06,310 And the two, for instance, for this scene 446 00:23:06,310 --> 00:23:10,240 were equalized in term of the type of category you had-- 447 00:23:10,240 --> 00:23:14,440 outdoor, indoor, beach, landscape, house, kitchen, 448 00:23:14,440 --> 00:23:15,680 and so on-- 449 00:23:15,680 --> 00:23:18,790 as well as a collection of low-level features. 450 00:23:18,790 --> 00:23:21,850 And you can see some of the average signature that 451 00:23:21,850 --> 00:23:28,540 are actually identical on a lot of low to mid to higher level 452 00:23:28,540 --> 00:23:31,590 image features that were equalized between the two 453 00:23:31,590 --> 00:23:32,390 group. 454 00:23:32,390 --> 00:23:35,620 So whatever we find is not going to be 455 00:23:35,620 --> 00:23:41,330 due to simple statistic due to the image or the type of object 456 00:23:41,330 --> 00:23:43,000 those have. 457 00:23:43,000 --> 00:23:46,960 We could play the same game for the faces, so we did. 458 00:23:46,960 --> 00:23:50,320 Here are the numbers again, memorable and forgettable 459 00:23:50,320 --> 00:23:54,220 faces that were also equalized for various attribute 460 00:23:54,220 --> 00:23:57,550 like attractiveness, emotion, kindness, happiness, 461 00:23:57,550 --> 00:24:02,260 and so on, as well as male, female, 462 00:24:02,260 --> 00:24:05,620 race, as well as expression, and so on. 463 00:24:05,620 --> 00:24:08,560 And you can see that the statistic, the average faces 464 00:24:08,560 --> 00:24:11,380 for both the memorable and the forgettable group, 465 00:24:11,380 --> 00:24:13,520 are actually also identical. 466 00:24:13,520 --> 00:24:19,150 So with those group, what's left is hopefully only the factor 467 00:24:19,150 --> 00:24:23,700 of something else in the image at the level of a higher image 468 00:24:23,700 --> 00:24:27,520 statistic, because only image statistic will explain the fact 469 00:24:27,520 --> 00:24:30,520 that very different people will remember the same faces 470 00:24:30,520 --> 00:24:31,990 and forget the same faces. 471 00:24:31,990 --> 00:24:36,040 But certainly not some of obvious low-level image 472 00:24:36,040 --> 00:24:36,700 statistic. 473 00:24:36,700 --> 00:24:40,310 Those cannot explain any result. 474 00:24:40,310 --> 00:24:44,710 So two years and four study later, 475 00:24:44,710 --> 00:24:47,840 we replicated basically this study four times 476 00:24:47,840 --> 00:24:50,500 with many different matter. 477 00:24:50,500 --> 00:24:52,480 Then, I'm just going to show you here one 478 00:24:52,480 --> 00:24:55,390 snapshot of the result. 479 00:24:55,390 --> 00:24:58,090 So this is a Multi-variate Pattern Analysis 480 00:24:58,090 --> 00:25:03,270 looking at the memorable versus the forgettable groups, 481 00:25:03,270 --> 00:25:06,190 searchlight analysis, MVPA looking 482 00:25:06,190 --> 00:25:10,620 for the region of the brain that are more active. 483 00:25:10,620 --> 00:25:12,580 Well, that have a different pattern. 484 00:25:12,580 --> 00:25:15,310 They are also more active, but at a different pattern 485 00:25:15,310 --> 00:25:18,070 for memorable faces. 486 00:25:18,070 --> 00:25:20,310 And we find signature in the hippocampus, 487 00:25:20,310 --> 00:25:22,810 the parahippocampal area, as well as 488 00:25:22,810 --> 00:25:30,620 the perirhinal that are typical for memorable faces and scene, 489 00:25:30,620 --> 00:25:35,590 and a new signature in the visual area or even the higher 490 00:25:35,590 --> 00:25:38,890 visual area, because we did equalize for those. 491 00:25:38,890 --> 00:25:43,510 So it seems to show that those MTL region play 492 00:25:43,510 --> 00:25:47,520 a role in a kind of higher-order statistical perception, 493 00:25:47,520 --> 00:25:52,000 a notion of distinctiveness that is within those image. 494 00:25:52,000 --> 00:25:56,110 So this suggest that at the perception of a new image, 495 00:25:56,110 --> 00:26:00,180 could be a face or a scene or a collection of object 496 00:26:00,180 --> 00:26:04,120 and so on, well, there's already a signature of this that's 497 00:26:04,120 --> 00:26:08,650 going to guide or bias if this is going to be put into memory 498 00:26:08,650 --> 00:26:09,560 or not. 499 00:26:09,560 --> 00:26:11,860 And we have those at the level of the group, 500 00:26:11,860 --> 00:26:17,840 looking now at the level of individuals. 501 00:26:17,840 --> 00:26:19,730 All right. 502 00:26:19,730 --> 00:26:22,780 So, well, it looks like we're going, 503 00:26:22,780 --> 00:26:25,780 given your question is now trying to model 504 00:26:25,780 --> 00:26:27,640 this notion of memorability. 505 00:26:27,640 --> 00:26:31,570 So we have a good case that there is some information 506 00:26:31,570 --> 00:26:34,210 into the image of a higher-level status-- 507 00:26:34,210 --> 00:26:36,180 we don't know which one-- 508 00:26:36,180 --> 00:26:41,410 that cannot be explained by simple features that make all 509 00:26:41,410 --> 00:26:43,720 of us reacting the same way. 510 00:26:43,720 --> 00:26:45,430 Even our brain do react the same way. 511 00:26:45,430 --> 00:26:50,530 We also have images, signature of memorability coming up. 512 00:26:50,530 --> 00:26:53,860 So we have this intrinsic information that make all of us 513 00:26:53,860 --> 00:26:57,110 remember or forget the same kind of visual information. 514 00:26:57,110 --> 00:27:02,440 So can we now in a way imitate or model those result 515 00:27:02,440 --> 00:27:06,470 into an artificial system? 516 00:27:06,470 --> 00:27:06,970 All right. 517 00:27:06,970 --> 00:27:15,220 So you have all heard about the revolution in computer vision 518 00:27:15,220 --> 00:27:18,430 a couple of years ago and deep learning 519 00:27:18,430 --> 00:27:21,370 and those neural networks that are now 520 00:27:21,370 --> 00:27:26,900 able to outperform a lot of-- 521 00:27:26,900 --> 00:27:30,040 that were able to recognize and perform 522 00:27:30,040 --> 00:27:33,260 a lot of task, some of them at the level of humans, 523 00:27:33,260 --> 00:27:36,130 so recognizing various object and so on. 524 00:27:36,130 --> 00:27:39,760 And one of the aspect of those neural networks, 525 00:27:39,760 --> 00:27:41,170 and I'm going to talk about them, 526 00:27:41,170 --> 00:27:44,150 is that they require a lot of information. 527 00:27:44,150 --> 00:27:47,680 So you need to teach them the classes 528 00:27:47,680 --> 00:27:49,710 you want to distinguish. 529 00:27:49,710 --> 00:27:54,160 And they can and need a lot, a lot of data. 530 00:27:54,160 --> 00:27:57,540 So everything we did so far, we are getting that 531 00:27:57,540 --> 00:28:00,350 on a couple of thousand images. 532 00:28:00,350 --> 00:28:03,720 And that's really not enough to even start scratching 533 00:28:03,720 --> 00:28:05,920 computational modeling. 534 00:28:05,920 --> 00:28:10,390 So with Aditya Khosla, we run recently 535 00:28:10,390 --> 00:28:12,460 a new large-scale visual memorability 536 00:28:12,460 --> 00:28:16,360 study on Amazon Mechanical Turk, but this time getting 537 00:28:16,360 --> 00:28:19,750 score for 60,000 photograph. 538 00:28:19,750 --> 00:28:24,310 And you have a set of a sample over here. 539 00:28:24,310 --> 00:28:25,022 60,000. 540 00:28:25,022 --> 00:28:28,420 They are all going to be available in a few weeks. 541 00:28:28,420 --> 00:28:29,800 The paper is under revision. 542 00:28:29,800 --> 00:28:30,970 It's looking good. 543 00:28:30,970 --> 00:28:32,900 So as soon as we have a citation, 544 00:28:32,900 --> 00:28:36,970 we are going to give away all the data as well as the score 545 00:28:36,970 --> 00:28:38,800 and the images and so on. 546 00:28:38,800 --> 00:28:41,080 And in this experiment, images were 547 00:28:41,080 --> 00:28:44,210 presented for 600 milliseconds or shorter time, 548 00:28:44,210 --> 00:28:47,200 but they really did not change much. 549 00:28:47,200 --> 00:28:52,420 So as a snapshot, because the 60,000 images really 550 00:28:52,420 --> 00:28:55,560 cover a lot of type of photo-- 551 00:28:55,560 --> 00:28:59,860 faces, event, action, and even a graph and so on-- 552 00:28:59,860 --> 00:29:02,320 well, you either hear some that were highly 553 00:29:02,320 --> 00:29:05,050 memorable or forgettable. 554 00:29:05,050 --> 00:29:06,730 There's also the website. 555 00:29:06,730 --> 00:29:09,880 It's not populated yet with many things, 556 00:29:09,880 --> 00:29:12,370 but it will be very shortly. 557 00:29:12,370 --> 00:29:17,980 And the correlation we got on that data set 558 00:29:17,980 --> 00:29:20,710 was pretty high again, 0.68. 559 00:29:20,710 --> 00:29:22,560 That was expected. 560 00:29:22,560 --> 00:29:27,730 And the paper explain the various split we did. 561 00:29:27,730 --> 00:29:30,740 But, again, we find this very high correlation. 562 00:29:30,740 --> 00:29:34,500 So we do know that there's something to model there. 563 00:29:34,500 --> 00:29:38,700 And other-- again, a very quick summary. 564 00:29:38,700 --> 00:29:42,080 The type of images that seems to be the most memorable 565 00:29:42,080 --> 00:29:45,650 are the one which have a focus and close settings, that 566 00:29:45,650 --> 00:29:48,740 show some dynamics and something distinctive, unusual, 567 00:29:48,740 --> 00:29:52,910 a little different, whereas the less memorable one seems 568 00:29:52,910 --> 00:29:56,120 to have no single focus, distant view, static, 569 00:29:56,120 --> 00:29:58,880 and more commonalities. 570 00:29:58,880 --> 00:30:02,120 All photos, you can still find two images that 571 00:30:02,120 --> 00:30:05,390 will be focus and dynamics, and one of them 572 00:30:05,390 --> 00:30:07,190 will be more memorable than the other 573 00:30:07,190 --> 00:30:12,110 because it will have something unusual that now our system can 574 00:30:12,110 --> 00:30:14,120 capture. 575 00:30:14,120 --> 00:30:16,400 So we have this new data set. 576 00:30:16,400 --> 00:30:18,950 We have all the memory score. 577 00:30:18,950 --> 00:30:20,780 We have the high consistency. 578 00:30:20,780 --> 00:30:26,450 How do we even start thinking about the computational model 579 00:30:26,450 --> 00:30:29,730 of visual memory or memorability? 580 00:30:29,730 --> 00:30:32,750 Well, in order to give you a sense of one 581 00:30:32,750 --> 00:30:36,110 of the basic we needs, in order to even start 582 00:30:36,110 --> 00:30:42,060 thinking of a model, I'm going to show and run another demo. 583 00:30:42,060 --> 00:30:45,380 So in this demo, you're also going to see some images. 584 00:30:45,380 --> 00:30:48,790 And clap your hand whenever you see an image that repeat. 585 00:30:48,790 --> 00:30:49,400 OK? 586 00:30:49,400 --> 00:30:51,550 Exactly the same game than before. 587 00:30:51,550 --> 00:30:53,191 Ready? 588 00:30:53,191 --> 00:30:53,690 All right. 589 00:30:53,690 --> 00:30:57,070 If everyone play the game, it will be fun. 590 00:30:57,070 --> 00:30:59,810 OK. 591 00:30:59,810 --> 00:31:01,794 [CLAPPING] 592 00:31:04,770 --> 00:31:05,770 [CLAPPING] 593 00:31:05,770 --> 00:31:08,382 A little false alarm. 594 00:31:08,382 --> 00:31:09,298 [CLAPPING] 595 00:31:09,298 --> 00:31:10,181 False alarm. 596 00:31:10,181 --> 00:31:10,680 [CLAPPING] 597 00:31:10,680 --> 00:31:11,930 Good. 598 00:31:11,930 --> 00:31:12,500 More energy. 599 00:31:15,161 --> 00:31:15,660 [CLAPPING] 600 00:31:15,660 --> 00:31:18,330 Good. 601 00:31:18,330 --> 00:31:19,302 [CLAPPING] 602 00:31:19,302 --> 00:31:21,732 Sorry. 603 00:31:21,732 --> 00:31:26,120 [CLAPPING] 604 00:31:26,120 --> 00:31:26,946 No 605 00:31:26,946 --> 00:31:28,749 [LAUGHTER] 606 00:31:30,138 --> 00:31:31,070 [CLAPPING] 607 00:31:31,070 --> 00:31:33,300 No. 608 00:31:33,300 --> 00:31:34,250 [CLAPPING] 609 00:31:34,250 --> 00:31:35,680 Yes. 610 00:31:35,680 --> 00:31:37,150 [CLAPPING] 611 00:31:37,150 --> 00:31:37,787 No. 612 00:31:37,787 --> 00:31:38,620 AUDIENCE: Too close. 613 00:31:38,620 --> 00:31:39,110 [CLAPPING] 614 00:31:39,110 --> 00:31:39,735 AUDE OLIVA: No. 615 00:31:39,735 --> 00:31:42,029 Yeah, that was-- yes. 616 00:31:42,029 --> 00:31:44,444 [CLAPPING] 617 00:31:44,444 --> 00:31:46,380 [LAUGHTER] 618 00:31:46,380 --> 00:31:47,070 All right. 619 00:31:47,070 --> 00:31:51,710 So for the sake of the demo, I put here 620 00:31:51,710 --> 00:31:53,690 really images that are very different, 621 00:31:53,690 --> 00:31:55,140 some you are familiar with. 622 00:31:55,140 --> 00:31:56,910 You have a concept. 623 00:31:56,910 --> 00:31:58,230 You know what it is. 624 00:31:58,230 --> 00:31:59,220 This is a restaurant. 625 00:31:59,220 --> 00:32:00,660 This is an alley. 626 00:32:00,660 --> 00:32:02,760 This is a stadium, and so on. 627 00:32:02,760 --> 00:32:08,730 And some for which either you don't have a specific concept, 628 00:32:08,730 --> 00:32:10,290 or you have the same concept-- 629 00:32:10,290 --> 00:32:13,950 texture, paintings, texture, texture, texture. 630 00:32:13,950 --> 00:32:17,070 And the basic idea of memory is you 631 00:32:17,070 --> 00:32:21,720 need to recognize, to put a unique tag or collection of tag 632 00:32:21,720 --> 00:32:25,080 in order to remember that individual image. 633 00:32:25,080 --> 00:32:27,710 So the fact that you saw a collection of texture 634 00:32:27,710 --> 00:32:30,120 or paintings or whatever you want to call them, 635 00:32:30,120 --> 00:32:32,240 you'll remember that as a group. 636 00:32:32,240 --> 00:32:34,950 But to go to the individual memory of one, 637 00:32:34,950 --> 00:32:37,740 you're going to need to have a specific concept. 638 00:32:37,740 --> 00:32:39,750 And in order to remember it, this 639 00:32:39,750 --> 00:32:45,420 is going to be an abstraction, a format, a collection of words, 640 00:32:45,420 --> 00:32:48,420 or a coding that make it unique. 641 00:32:48,420 --> 00:32:50,670 So you need to recognize to remember, 642 00:32:50,670 --> 00:32:54,050 which means that if you want a model of memory which 643 00:32:54,050 --> 00:32:55,840 start from the raw image-- 644 00:32:55,840 --> 00:32:58,200 I'm not in toy world here. 645 00:32:58,200 --> 00:33:01,510 I really start from the raw image, like the retina. 646 00:33:01,510 --> 00:33:07,050 Well, you're going to need to build a visual recognition 647 00:33:07,050 --> 00:33:08,400 system first. 648 00:33:08,400 --> 00:33:12,840 So first you need a model that recognize object 649 00:33:12,840 --> 00:33:15,360 and scene and event and so on. 650 00:33:15,360 --> 00:33:19,020 And then, from this, there can be a base 651 00:33:19,020 --> 00:33:21,570 to start modeling memory. 652 00:33:21,570 --> 00:33:26,190 So, fortunately, the field of computer vision 653 00:33:26,190 --> 00:33:29,120 made a lot of progress in the past couple of years. 654 00:33:29,120 --> 00:33:32,732 So now we do have visual recognition system 655 00:33:32,732 --> 00:33:33,690 that works pretty well. 656 00:33:33,690 --> 00:33:35,190 I'm going to describe them. 657 00:33:35,190 --> 00:33:40,060 We need to first a visual recognition system. 658 00:33:40,060 --> 00:33:40,560 All right. 659 00:33:40,560 --> 00:33:44,210 So, what does a visual recognition system needs to do? 660 00:33:44,210 --> 00:33:47,499 So, well, it's your Sunday morning. 661 00:33:47,499 --> 00:33:49,290 You're going to the picnic area, and you're 662 00:33:49,290 --> 00:33:50,610 faced with that view. 663 00:33:50,610 --> 00:33:52,440 You take a photo. 664 00:33:52,440 --> 00:33:55,780 This photo actually became viral on the web. 665 00:33:55,780 --> 00:33:59,520 And here is the state of the art of computer vision system. 666 00:33:59,520 --> 00:34:01,620 When it comes to recognize the object-- 667 00:34:01,620 --> 00:34:03,120 I know it's a different view, but it 668 00:34:03,120 --> 00:34:06,240 works very well for any view-- 669 00:34:06,240 --> 00:34:09,510 object recognition for about 1,000 670 00:34:09,510 --> 00:34:15,256 object category is reaching human performances so far. 671 00:34:15,256 --> 00:34:18,460 And so this will tell you this is a black bear, 672 00:34:18,460 --> 00:34:21,960 there's a bench, a table, and so on, and trees 673 00:34:21,960 --> 00:34:23,070 in the background. 674 00:34:23,070 --> 00:34:28,320 But it's missing the point, that this is a picnic area. 675 00:34:28,320 --> 00:34:33,840 So you do need at least two kind of information 676 00:34:33,840 --> 00:34:38,010 in order to reach visual scene understanding. 677 00:34:38,010 --> 00:34:39,900 You need the scene and the context, 678 00:34:39,900 --> 00:34:44,380 you need to know the place, and you need to know the object. 679 00:34:44,380 --> 00:34:47,580 So, as I said, so far on the challenge 680 00:34:47,580 --> 00:34:50,820 that's called the ImageNet Challenge in computer vision, 681 00:34:50,820 --> 00:34:54,360 computer vision model average human performances, which 682 00:34:54,360 --> 00:35:00,540 is 95% correct on exemplars of objects that have never 683 00:35:00,540 --> 00:35:04,440 been trained on, a new one, for 1,000 object category. 684 00:35:04,440 --> 00:35:09,030 And recently, we basically published a few papers 685 00:35:09,030 --> 00:35:12,030 for the other part of visual scene understanding, the place 686 00:35:12,030 --> 00:35:13,120 and the context. 687 00:35:13,120 --> 00:35:16,420 And this is an output of our system. 688 00:35:16,420 --> 00:35:17,670 And you can go on the web-- 689 00:35:17,670 --> 00:35:19,836 I'm going to give you the address-- and play with it 690 00:35:19,836 --> 00:35:22,410 and see the performances of recognizing 691 00:35:22,410 --> 00:35:24,990 the context and the place. 692 00:35:24,990 --> 00:35:31,890 So just to put into context what the field of computer vision 693 00:35:31,890 --> 00:35:36,030 have been doing for the past 15 years 694 00:35:36,030 --> 00:35:40,740 is, well, the number of data set, 695 00:35:40,740 --> 00:35:44,280 the number of images by data set has been increasing, 696 00:35:44,280 --> 00:35:48,160 so that there's more exemplar to learn from. 697 00:35:48,160 --> 00:35:51,900 And, in perspective, you can see that two-years-old kids. 698 00:35:51,900 --> 00:35:55,380 Of course, it will depend on the sampling you use in order 699 00:35:55,380 --> 00:35:58,260 to have an estimate of the number of visual information 700 00:35:58,260 --> 00:35:59,750 that the retina sees. 701 00:35:59,750 --> 00:36:03,450 But it sees much, much more variety 702 00:36:03,450 --> 00:36:06,600 and numbers of visual input. 703 00:36:06,600 --> 00:36:10,650 But right now, both ImageNet, that is, a data set of objects, 704 00:36:10,650 --> 00:36:13,800 and Places, which is a data set of scene 705 00:36:13,800 --> 00:36:18,580 I'm going to present to you now, have about 10 million label 706 00:36:18,580 --> 00:36:20,990 images of many categories. 707 00:36:20,990 --> 00:36:24,270 So label means that for the places, it will tell you, 708 00:36:24,270 --> 00:36:26,880 this is a kitchen or this is a conference room, and so 709 00:36:26,880 --> 00:36:28,720 on for hundred of categories. 710 00:36:28,720 --> 00:36:32,220 So 10 million is largely enough to start 711 00:36:32,220 --> 00:36:36,360 building very serious visual recognition system, 712 00:36:36,360 --> 00:36:38,610 but it's no near the human brain. 713 00:36:38,610 --> 00:36:42,010 However, we might be getting there. 714 00:36:42,010 --> 00:36:47,220 So how do we even start to build a visual scene 715 00:36:47,220 --> 00:36:49,110 recognition data set? 716 00:36:49,110 --> 00:36:51,750 This is a work we did and published in 2010, 717 00:36:51,750 --> 00:36:53,390 the Scene Understanding, or SUN data 718 00:36:53,390 --> 00:36:58,800 set, where we collected the words from the dictionary that 719 00:36:58,800 --> 00:37:02,440 correspond to places at the subordinate level-- 720 00:37:02,440 --> 00:37:04,020 I'm going to give you example-- 721 00:37:04,020 --> 00:37:07,020 and then retrieve a lot of images from the web. 722 00:37:07,020 --> 00:37:12,480 And there was a total of 900 different categories, 723 00:37:12,480 --> 00:37:15,390 and about 400 of them have enough exemplars 724 00:37:15,390 --> 00:37:19,590 to build a artificial system. 725 00:37:19,590 --> 00:37:24,090 So instead of going and only looking to build images, 726 00:37:24,090 --> 00:37:28,020 like to build a data set of a bedroom and kitchen and so 727 00:37:28,020 --> 00:37:32,340 on, what's happening for the human brain is that you 728 00:37:32,340 --> 00:37:35,340 have a different environment. 729 00:37:35,340 --> 00:37:39,560 You see there's many attribute you can put in. 730 00:37:39,560 --> 00:37:43,150 And you are forthcoming and storytelling and so on. 731 00:37:43,150 --> 00:37:45,990 So most of the places, it's not only that this is, 732 00:37:45,990 --> 00:37:48,600 for instance, a bedroom, is that you will go more 733 00:37:48,600 --> 00:37:51,010 to a student bedroom or-- 734 00:37:51,010 --> 00:37:53,130 Well, I think those are student bedroom two 735 00:37:53,130 --> 00:37:55,710 doors from my colleague. 736 00:37:55,710 --> 00:37:58,620 So there are many type of adjective 737 00:37:58,620 --> 00:38:01,380 that we can use in order to retrieve images that 738 00:38:01,380 --> 00:38:08,130 will give us a larger panorama of the type of concept 739 00:38:08,130 --> 00:38:10,020 that are used by the human brain in order 740 00:38:10,020 --> 00:38:12,300 to recognize environment. 741 00:38:12,300 --> 00:38:15,040 So a simple bedroom. 742 00:38:15,040 --> 00:38:18,720 The tag were put automatically. 743 00:38:18,720 --> 00:38:20,020 Superior bedroom. 744 00:38:20,020 --> 00:38:21,750 Senior bedroom. 745 00:38:21,750 --> 00:38:23,520 Colorful bedroom. 746 00:38:23,520 --> 00:38:24,510 Hotel bedroom. 747 00:38:24,510 --> 00:38:25,350 And so on and so on. 748 00:38:25,350 --> 00:38:27,810 So that was the retrieval we did, 749 00:38:27,810 --> 00:38:29,970 which means that now, for every category there 750 00:38:29,970 --> 00:38:37,830 is also a tag in term of the subtype of environment 751 00:38:37,830 --> 00:38:39,360 this can be. 752 00:38:39,360 --> 00:38:40,680 Messy bedroom. 753 00:38:40,680 --> 00:38:45,760 And a couple of years later, 80 million images later, 754 00:38:45,760 --> 00:38:49,320 and a lot of Amazon Mechanical Turk experiment, 755 00:38:49,320 --> 00:38:54,270 we are launching this week the Places2 data set, 756 00:38:54,270 --> 00:38:58,830 with 460 categories, different categories of environment, 757 00:38:58,830 --> 00:39:01,530 and 10 million images label. 758 00:39:01,530 --> 00:39:04,530 So this is a larger data set of label images, 759 00:39:04,530 --> 00:39:09,850 with a label to be used right away for artificial system 760 00:39:09,850 --> 00:39:13,120 learning, deep learning, and so on. 761 00:39:13,120 --> 00:39:17,460 So here is just a snapshot of the differences 762 00:39:17,460 --> 00:39:21,570 of the places in term of the number of exemplars with other 763 00:39:21,570 --> 00:39:25,150 a large data set. 764 00:39:25,150 --> 00:39:30,990 So the Places data set is actually 765 00:39:30,990 --> 00:39:35,260 part of the ImageNet challenge this year, 766 00:39:35,260 --> 00:39:38,175 which means that you can go to ImageNet 767 00:39:38,175 --> 00:39:42,700 and register for the challenge and download right now, 768 00:39:42,700 --> 00:39:46,560 tonight, eight million images of places 769 00:39:46,560 --> 00:39:50,370 to use for the learning of your neural networks, 770 00:39:50,370 --> 00:39:55,160 as well as of a set that will be used for the testing, 771 00:39:55,160 --> 00:39:58,090 and participate to the challenge. 772 00:39:58,090 --> 00:40:02,950 So this was launch last week, and the website 773 00:40:02,950 --> 00:40:06,710 associated with Places will be launched this week. 774 00:40:06,710 --> 00:40:11,460 And we decided to just give this away to everyone right away. 775 00:40:11,460 --> 00:40:13,110 We are finishing up the paper now. 776 00:40:13,110 --> 00:40:15,510 It will be an archive paper. 777 00:40:15,510 --> 00:40:19,260 No time to wait for month and month. 778 00:40:19,260 --> 00:40:22,320 This is a data set that can be used by a lot of people 779 00:40:22,320 --> 00:40:24,313 to make progress fast. 780 00:40:24,313 --> 00:40:28,360 And so that's what we are doing. 781 00:40:28,360 --> 00:40:32,190 So, as I said, the computer vision model 782 00:40:32,190 --> 00:40:36,460 now require, if you use deep learning, a lot of data. 783 00:40:36,460 --> 00:40:40,110 And we hope that with this data set, 784 00:40:40,110 --> 00:40:42,760 fast progress are going to be made. 785 00:40:42,760 --> 00:40:48,750 So what we specifically did is using the AlexNet deep learning 786 00:40:48,750 --> 00:40:49,830 architecture-- 787 00:40:49,830 --> 00:40:51,440 If you don't know what this is, I 788 00:40:51,440 --> 00:40:53,570 can tell you later how to basically access 789 00:40:53,570 --> 00:40:55,440 to it with the paper on so on. 790 00:40:55,440 --> 00:40:56,910 This is not my model. 791 00:40:56,910 --> 00:40:59,710 This is a model put together by Geoffrey Hinton 792 00:40:59,710 --> 00:41:01,930 and collaborator a few years ago. 793 00:41:01,930 --> 00:41:04,320 And you can download the model or download the code 794 00:41:04,320 --> 00:41:04,861 and re-train. 795 00:41:07,350 --> 00:41:10,020 So neural net now basically are based 796 00:41:10,020 --> 00:41:13,610 on the collection of operation that 797 00:41:13,610 --> 00:41:17,670 are call layers, convolution, normalization, simple image 798 00:41:17,670 --> 00:41:20,760 processing operation that you do in a sequence. 799 00:41:20,760 --> 00:41:23,540 You do it one time, then a second and third and so on. 800 00:41:23,540 --> 00:41:26,910 And then you have those multi-layers models. 801 00:41:26,910 --> 00:41:30,960 And the number of layers is still a question of research. 802 00:41:30,960 --> 00:41:32,730 How does layer correspond to the brain? 803 00:41:32,730 --> 00:41:36,300 I'm going to say a little bit about that. 804 00:41:36,300 --> 00:41:40,060 And using this simple-- 805 00:41:40,060 --> 00:41:45,985 well, this AlexNet model, we built a scene recognition 806 00:41:45,985 --> 00:41:46,485 system. 807 00:41:46,485 --> 00:41:50,920 And now you can go to places.csail.mit.edu 808 00:41:50,920 --> 00:41:54,430 with a smartphone will work, will take a photo. 809 00:41:54,430 --> 00:41:57,560 And it should tell you the type of environment 810 00:41:57,560 --> 00:41:59,620 the photo represent. 811 00:41:59,620 --> 00:42:02,890 It will give you several possibilities, 812 00:42:02,890 --> 00:42:06,070 because basically environment are ambiguous. 813 00:42:06,070 --> 00:42:07,320 They can be of different type. 814 00:42:07,320 --> 00:42:08,653 So I don't know if you can read. 815 00:42:08,653 --> 00:42:09,950 Let me read a few. 816 00:42:09,950 --> 00:42:13,300 The first one, it says, restaurant, coffee shop, 817 00:42:13,300 --> 00:42:16,990 cafeteria, food court, restaurant patio. 818 00:42:16,990 --> 00:42:18,430 I guess they all fit. 819 00:42:18,430 --> 00:42:21,880 The second one, parking lot and driveway. 820 00:42:21,880 --> 00:42:25,000 The third one, conference room, dining room, banquet hall, 821 00:42:25,000 --> 00:42:26,100 classroom. 822 00:42:26,100 --> 00:42:28,220 That was a difficult one. 823 00:42:28,220 --> 00:42:31,240 And the fourth environment is patio, restaurant patio, 824 00:42:31,240 --> 00:42:33,970 or restaurant. 825 00:42:33,970 --> 00:42:37,720 So if you go there, you can also give feedback 826 00:42:37,720 --> 00:42:41,830 if one of the label match the environment 827 00:42:41,830 --> 00:42:43,000 that you're looking at. 828 00:42:43,000 --> 00:42:45,960 And it should be above 80% correct. 829 00:42:45,960 --> 00:42:50,890 And this model use 1.5 million images and 200 categories. 830 00:42:50,890 --> 00:42:53,110 So soon with a great-- 831 00:42:53,110 --> 00:42:59,310 we hope that things will be even more interesting and accurate. 832 00:42:59,310 --> 00:43:02,680 And I took this morning a couple of photo at breakfast. 833 00:43:02,680 --> 00:43:06,920 So you may all recognize the scenery here. 834 00:43:06,920 --> 00:43:10,870 So from the breakfast area looking outside, 835 00:43:10,870 --> 00:43:14,680 it's an outdoor harbor, dock, boat deck. 836 00:43:14,680 --> 00:43:16,900 Yeah, it could be actually on a boat 837 00:43:16,900 --> 00:43:19,680 deck looking at the harbor. 838 00:43:19,680 --> 00:43:22,360 And otherwise, the breakfast area 839 00:43:22,360 --> 00:43:25,240 was restaurant, cafeteria, coffee shop, food court, 840 00:43:25,240 --> 00:43:26,190 or bar. 841 00:43:26,190 --> 00:43:28,930 All those, again, fits. 842 00:43:28,930 --> 00:43:31,480 So those model now works very, very well. 843 00:43:31,480 --> 00:43:32,410 So why? 844 00:43:32,410 --> 00:43:34,790 Well, let me tell you why. 845 00:43:34,790 --> 00:43:40,990 So those neural networks, you can go to any layers 846 00:43:40,990 --> 00:43:49,450 and open them and look at what every single artificial neuron 847 00:43:49,450 --> 00:43:50,110 do. 848 00:43:50,110 --> 00:43:52,330 So what we call the receptive field 849 00:43:52,330 --> 00:43:54,820 of every single unit in the layer 850 00:43:54,820 --> 00:43:58,180 one, the layer two, the layer three, and so on. 851 00:43:58,180 --> 00:44:00,634 So the first the layer-- 852 00:44:00,634 --> 00:44:03,880 here, four layers are shown-- 853 00:44:03,880 --> 00:44:10,690 basically learn simple features, contours and simple texture. 854 00:44:10,690 --> 00:44:12,700 That's called pool1. 855 00:44:12,700 --> 00:44:20,080 And those looks like the type of responses of the visual cells, 856 00:44:20,080 --> 00:44:21,810 and possibly V1, V2. 857 00:44:21,810 --> 00:44:23,540 I'm going to tell more about this. 858 00:44:23,540 --> 00:44:28,060 And as you go higher up in the layers, 859 00:44:28,060 --> 00:44:31,480 then you start having some texture, some patches, 860 00:44:31,480 --> 00:44:32,880 that make more sense. 861 00:44:32,880 --> 00:44:36,600 And higher up in the layers, like layer number five, 862 00:44:36,600 --> 00:44:42,090 then you start having artificial receptive field that 863 00:44:42,090 --> 00:44:45,640 are specific to part of object, part of scene, 864 00:44:45,640 --> 00:44:47,770 or an entire object by itself. 865 00:44:47,770 --> 00:44:52,090 Like we can see here some kind of [INAUDIBLE] coffee, 866 00:44:52,090 --> 00:44:55,330 as well as the tower and so on. 867 00:44:55,330 --> 00:44:59,950 So it seems that the system are able to recognize environment 868 00:44:59,950 --> 00:45:01,520 of object. 869 00:45:01,520 --> 00:45:04,540 But what they learn are the part and the object 870 00:45:04,540 --> 00:45:05,980 that the environment contain. 871 00:45:05,980 --> 00:45:08,130 And I'm going to show example of those. 872 00:45:08,130 --> 00:45:15,050 So jumping, just giving you a result in neuroscience. 873 00:45:15,050 --> 00:45:17,800 And so there's a lot of debate out there about, OK, well, 874 00:45:17,800 --> 00:45:20,930 there's those model with different layers. 875 00:45:20,930 --> 00:45:22,840 And you have different models out there. 876 00:45:22,840 --> 00:45:27,520 To which extent they correspond to the visual hierarchy 877 00:45:27,520 --> 00:45:28,810 of the human brain? 878 00:45:28,810 --> 00:45:32,170 Well, first, the computational model 879 00:45:32,170 --> 00:45:38,320 were inspire by the visual hierarchy, the V1, V2, V4, 880 00:45:38,320 --> 00:45:40,750 and [INAUDIBLE] and so on, knowing 881 00:45:40,750 --> 00:45:47,920 that more complex features are built over time and space. 882 00:45:47,920 --> 00:45:52,660 So what you can also do is run through a network 883 00:45:52,660 --> 00:45:57,550 and run in an fMRI experiment the same image, 884 00:45:57,550 --> 00:46:02,500 and then look at the correlation existing between the responses 885 00:46:02,500 --> 00:46:05,290 of the cells, let's say in layer one, 886 00:46:05,290 --> 00:46:08,380 and the responses you may have on the human brain 887 00:46:08,380 --> 00:46:10,340 in different part of the brain. 888 00:46:10,340 --> 00:46:14,890 And what we find is that the layer one will correspond more, 889 00:46:14,890 --> 00:46:18,070 will have a higher correlation, with responses 890 00:46:18,070 --> 00:46:21,700 in the visual area, literally V1, V2. 891 00:46:21,700 --> 00:46:25,900 And as you move up through the layers, 892 00:46:25,900 --> 00:46:30,520 then there is correlation, higher [INAUDIBLE] correlation, 893 00:46:30,520 --> 00:46:34,340 with part of the ventral and the parietal part of the brain. 894 00:46:34,340 --> 00:46:39,150 And I know you had the lecture by Jim DiCarlo that actually 895 00:46:39,150 --> 00:46:41,740 must have explained this. 896 00:46:41,740 --> 00:46:46,500 So Jim DiCarlo team did it, Jack Gallant as well in Berkeley. 897 00:46:46,500 --> 00:46:49,140 And we also did it with other type 898 00:46:49,140 --> 00:46:53,100 of images and other network, and all the result really 899 00:46:53,100 --> 00:46:57,360 corroborate each other with this nice correlation 900 00:46:57,360 --> 00:47:02,970 between low to high visual areas between the brain, 901 00:47:02,970 --> 00:47:06,910 the human brain, and those multi-layers model. 902 00:47:06,910 --> 00:47:07,410 All right. 903 00:47:07,410 --> 00:47:09,810 Let me show you some of the receptive 904 00:47:09,810 --> 00:47:11,700 field, the artificial receptive field 905 00:47:11,700 --> 00:47:14,670 that we find in the higher layers. 906 00:47:14,670 --> 00:47:19,990 So this network was trained for scene categorization. 907 00:47:19,990 --> 00:47:22,080 So the only thing that this network, 908 00:47:22,080 --> 00:47:24,480 the one we are using here, learn was 909 00:47:24,480 --> 00:47:29,940 to discriminate between, in this case, 200 categories-- 910 00:47:29,940 --> 00:47:33,090 the kitchen, the bedroom, the bathroom, the alley, 911 00:47:33,090 --> 00:47:35,890 the living room, the forest, and so on and so on. 912 00:47:35,890 --> 00:47:37,860 200 of those. 913 00:47:37,860 --> 00:47:40,510 So that's the task. 914 00:47:40,510 --> 00:47:42,960 What the network learn and what we 915 00:47:42,960 --> 00:47:46,650 observe is that object discriminant and diagnostical 916 00:47:46,650 --> 00:47:49,640 information between those category 917 00:47:49,640 --> 00:47:52,620 form the emerging representation that are automatically, 918 00:47:52,620 --> 00:47:55,950 naturally learn by the networks. 919 00:47:55,950 --> 00:48:00,900 So the network has never learned a wheel, 920 00:48:00,900 --> 00:48:04,840 but the representation emerge, as you can see here. 921 00:48:04,840 --> 00:48:09,920 So this is one artificial neurone. 922 00:48:09,920 --> 00:48:14,190 And it is receptive field and its responses, 923 00:48:14,190 --> 00:48:17,880 the highest response it got for a collection of images. 924 00:48:17,880 --> 00:48:22,650 And as you can see, those higher pool5 receptive field 925 00:48:22,650 --> 00:48:24,780 are more independent to the location. 926 00:48:24,780 --> 00:48:26,610 They are built that way. 927 00:48:26,610 --> 00:48:29,850 But the network never learn the parts. 928 00:48:29,850 --> 00:48:32,250 This is something that emerged naturally 929 00:48:32,250 --> 00:48:34,262 by learning different environment. 930 00:48:34,262 --> 00:48:36,720 The other thing that's very interesting in this model, when 931 00:48:36,720 --> 00:48:39,060 you can open up and look at the receptive field using 932 00:48:39,060 --> 00:48:44,460 various method, here is one, is, well, this model has never 933 00:48:44,460 --> 00:48:47,800 learned the notion of shape or object. 934 00:48:47,800 --> 00:48:51,420 So it's going to basically become 935 00:48:51,420 --> 00:48:55,840 sensitive to discriminant and diagnostic information. 936 00:48:55,840 --> 00:48:59,820 So you have this unit that is discriminant to the bottom part 937 00:48:59,820 --> 00:49:03,000 of either the legs of animate, or even you 938 00:49:03,000 --> 00:49:04,690 can see the trees over there. 939 00:49:04,690 --> 00:49:05,790 So this is a unit. 940 00:49:05,790 --> 00:49:10,650 It seems that it was needed in order to classify environment 941 00:49:10,650 --> 00:49:14,520 to have units that we might not have a word for it. 942 00:49:14,520 --> 00:49:17,040 But the human brain might as well 943 00:49:17,040 --> 00:49:19,690 have many of those unit that do not correspond 944 00:49:19,690 --> 00:49:21,750 to necessarily a word. 945 00:49:21,750 --> 00:49:24,150 So, basically, with this model you 946 00:49:24,150 --> 00:49:27,450 can have a lot of new object emerging that you might not 947 00:49:27,450 --> 00:49:31,260 have thought of, but they become parts of the code 948 00:49:31,260 --> 00:49:35,590 needed in order to identify an environment or an object. 949 00:49:35,590 --> 00:49:38,760 So we also have another unit for this bottom parts 950 00:49:38,760 --> 00:49:40,560 of a collection of chairs. 951 00:49:40,560 --> 00:49:43,110 We also have chair, of course, showing up. 952 00:49:43,110 --> 00:49:43,860 Faces. 953 00:49:43,860 --> 00:49:46,500 The model never learned faces. 954 00:49:46,500 --> 00:49:49,290 Only learn kitchen, bathroom, street. 955 00:49:49,290 --> 00:49:53,110 We have several unit emerging for faces. 956 00:49:53,110 --> 00:49:53,610 Why? 957 00:49:53,610 --> 00:49:56,190 Because those are correlated with a collection 958 00:49:56,190 --> 00:49:57,650 of environment. 959 00:49:57,650 --> 00:50:01,350 Then, entire object shapes, like bed, 960 00:50:01,350 --> 00:50:04,110 very diagnostic of a bedroom in this case. 961 00:50:04,110 --> 00:50:09,130 So that will be a unit that is very, very specific. 962 00:50:09,130 --> 00:50:13,650 That would be really only for beds, that one. 963 00:50:13,650 --> 00:50:17,970 And then, others like lamps and so on. 964 00:50:17,970 --> 00:50:20,810 There's thousand and thousand here. 965 00:50:20,810 --> 00:50:24,720 Another unit never learned, screen monitors. 966 00:50:24,720 --> 00:50:29,005 And here is specific unit for that that emerge. 967 00:50:29,005 --> 00:50:32,160 Also, collection of object or space, 968 00:50:32,160 --> 00:50:35,010 collection of chairs over here. 969 00:50:35,010 --> 00:50:39,660 The network found that this is a discriminant information 970 00:50:39,660 --> 00:50:42,300 to classify environment, crowds. 971 00:50:42,300 --> 00:50:45,030 It's a very interesting unit because it's 972 00:50:45,030 --> 00:50:50,250 really independent of the location, as well as basically 973 00:50:50,250 --> 00:50:53,710 the number of people and if they are closer or further away. 974 00:50:53,710 --> 00:50:56,490 But it does capture this notion of crowd. 975 00:50:56,490 --> 00:50:58,440 So the model doesn't have the word crowd. 976 00:50:58,440 --> 00:51:01,390 So it has never known the crowd. 977 00:51:01,390 --> 00:51:03,300 It's one of the unit emerging that 978 00:51:03,300 --> 00:51:05,850 now can be used as an object detector 979 00:51:05,850 --> 00:51:08,970 to enhance the recognition of what's going on in the scene. 980 00:51:08,970 --> 00:51:13,630 So this is an ice skating area, and there is a crowd. 981 00:51:13,630 --> 00:51:18,380 And also unit that are more specific to space 982 00:51:18,380 --> 00:51:20,610 and useful for navigation, for instance. 983 00:51:20,610 --> 00:51:22,300 We have many of those. 984 00:51:22,300 --> 00:51:30,090 So in that case, just the fact that there is lamps up 985 00:51:30,090 --> 00:51:33,810 or perspective, so we have a unit 986 00:51:33,810 --> 00:51:36,030 like this specific to this. 987 00:51:36,030 --> 00:51:39,030 So it's not the only object as physical object. 988 00:51:39,030 --> 00:51:40,890 There's also a collection of unit 989 00:51:40,890 --> 00:51:45,130 that are related to spatial layout and geometry that 990 00:51:45,130 --> 00:51:49,120 are also discriminant for environment. 991 00:51:49,120 --> 00:51:52,080 And many, many other that are showing up. 992 00:51:52,080 --> 00:51:53,760 So those object detector naturally 993 00:51:53,760 --> 00:51:56,390 emerge inside this kind of network trained 994 00:51:56,390 --> 00:51:58,770 for scene understanding. 995 00:51:58,770 --> 00:52:02,740 So now I only have a couple of minutes to wrap up, 20 minutes, 996 00:52:02,740 --> 00:52:09,670 so I'm going to just give you the hint of the next part. 997 00:52:09,670 --> 00:52:15,480 So with the Places challenge that is starting this week, 998 00:52:15,480 --> 00:52:20,500 certainly in less than a year the computational vision model 999 00:52:20,500 --> 00:52:22,440 of scene recognition are going to be very 1000 00:52:22,440 --> 00:52:25,170 close to human performances. 1001 00:52:25,170 --> 00:52:30,360 And then there's a long way to go and many more things 1002 00:52:30,360 --> 00:52:32,100 to match, like the error. 1003 00:52:32,100 --> 00:52:33,660 So know that the error look alike 1004 00:52:33,660 --> 00:52:36,450 or when a category can have many type of object 1005 00:52:36,450 --> 00:52:37,570 or what can happen next. 1006 00:52:37,570 --> 00:52:39,240 So you can really expand. 1007 00:52:39,240 --> 00:52:41,130 But let's say we can consider now 1008 00:52:41,130 --> 00:52:44,250 that we have a base of visual recognition 1009 00:52:44,250 --> 00:52:49,210 into a model that works pretty well at the level of human, 1010 00:52:49,210 --> 00:52:52,140 or close enough, or will be close enough. 1011 00:52:52,140 --> 00:52:56,640 So, now that we have that, we can add the memory module. 1012 00:52:56,640 --> 00:52:59,190 How to add the memorability module 1013 00:52:59,190 --> 00:53:01,260 is really open in the air. 1014 00:53:01,260 --> 00:53:03,000 There's many ways to do. 1015 00:53:03,000 --> 00:53:08,270 So we just did it one way, to have a ground very first model 1016 00:53:08,270 --> 00:53:12,010 of visual memorability at the level of human. 1017 00:53:12,010 --> 00:53:15,870 And this is going to be out-- the paper is in revision-- 1018 00:53:15,870 --> 00:53:16,495 in a few weeks. 1019 00:53:16,495 --> 00:53:19,078 And you're going to be able to download the images, the model, 1020 00:53:19,078 --> 00:53:19,800 and so on. 1021 00:53:19,800 --> 00:53:23,130 Again, this is model number one, and we hope that then a better 1022 00:53:23,130 --> 00:53:24,480 model can be done. 1023 00:53:24,480 --> 00:53:28,920 So we went for the Occam razor approach. 1024 00:53:28,920 --> 00:53:34,620 The most simple one given the model is we took AlexNet, 1025 00:53:34,620 --> 00:53:38,550 we feed AlexNet with both ImageNet and Places. 1026 00:53:38,550 --> 00:53:41,122 Because all the images that are memorable or unforgettable, 1027 00:53:41,122 --> 00:53:43,080 they might have object, they might have places. 1028 00:53:43,080 --> 00:53:46,290 OK, so let's put the two together so we have more power. 1029 00:53:46,290 --> 00:53:50,190 We train the model, so we have the visual recognition model. 1030 00:53:50,190 --> 00:53:53,820 We do know that those units make sense. 1031 00:53:53,820 --> 00:53:55,680 They recognize that this is a kitchen. 1032 00:53:55,680 --> 00:53:58,760 And we do know why, because we have the parts. 1033 00:53:58,760 --> 00:54:02,160 So that's a classical standard AlexNet. 1034 00:54:02,160 --> 00:54:05,295 The output is scene and object categorization, 1035 00:54:05,295 --> 00:54:07,785 so we are still in categorization land. 1036 00:54:07,785 --> 00:54:10,710 And we can remove the last layer. 1037 00:54:10,710 --> 00:54:14,610 And then this is a procedure that has been well published 1038 00:54:14,610 --> 00:54:17,790 in computer vision and computer science, 1039 00:54:17,790 --> 00:54:20,610 this notion of fine tuning and back propagation. 1040 00:54:20,610 --> 00:54:23,070 So you use the network on learning 1041 00:54:23,070 --> 00:54:24,990 that are learn to recognize places, 1042 00:54:24,990 --> 00:54:27,840 and you finely tune and adopt the feature 1043 00:54:27,840 --> 00:54:29,760 so that the task has change. 1044 00:54:29,760 --> 00:54:31,980 So the task was recognition. 1045 00:54:31,980 --> 00:54:33,720 And now, for the network, at the end 1046 00:54:33,720 --> 00:54:37,190 there is a new task for that network that has learn 1047 00:54:37,190 --> 00:54:40,260 all those object and scene. 1048 00:54:40,260 --> 00:54:44,310 And the new task is now to learn that those element are 1049 00:54:44,310 --> 00:54:47,700 of high, medium, or low memorability, 1050 00:54:47,700 --> 00:54:50,800 which is a continuous value. 1051 00:54:50,800 --> 00:54:54,300 So by doing this, we have now a model 1052 00:54:54,300 --> 00:54:56,610 where we give it a new image, and it's 1053 00:54:56,610 --> 00:55:02,390 going to output a score between 0 and 1. 1054 00:55:02,390 --> 00:55:06,820 And the human to human I'll show you is 0.68 correlation. 1055 00:55:06,820 --> 00:55:10,620 The human to computer is 0.65, which mean now 1056 00:55:10,620 --> 00:55:16,950 that there is a first model that can basically 1057 00:55:16,950 --> 00:55:22,500 replicate human memorability of a group nearly at 1058 00:55:22,500 --> 00:55:24,330 the level of a human. 1059 00:55:24,330 --> 00:55:28,590 And we use the data set of 60,000 image 1060 00:55:28,590 --> 00:55:32,010 to finely tune the recognition model, 1061 00:55:32,010 --> 00:55:35,770 as well as test with images that this has not shown. 1062 00:55:35,770 --> 00:55:42,360 And because we can open up this new network of memorability 1063 00:55:42,360 --> 00:55:47,250 and look also now at the receptive field of the neuron 1064 00:55:47,250 --> 00:55:51,600 that are related to high or low memorability image-- 1065 00:55:51,600 --> 00:55:56,340 and we will publish every single receptive field on the web 1066 00:55:56,340 --> 00:55:58,420 as well with this paper-- 1067 00:55:58,420 --> 00:56:02,900 and thus now we can see the unit that are related, 1068 00:56:02,900 --> 00:56:09,270 this higher level information of object or space or parts 1069 00:56:09,270 --> 00:56:11,760 and so on that is related with images 1070 00:56:11,760 --> 00:56:14,940 that are highly memorable in green, strong positive, 1071 00:56:14,940 --> 00:56:16,590 or highly forgettable. 1072 00:56:16,590 --> 00:56:21,330 And we find, again, that if you have animate object 1073 00:56:21,330 --> 00:56:23,280 or kind of object, roundish object, 1074 00:56:23,280 --> 00:56:26,100 and so on, you can go [INAUDIBLE] those will make 1075 00:56:26,100 --> 00:56:29,340 your images more memorable. 1076 00:56:29,340 --> 00:56:33,230 And so our last part is, so now that we 1077 00:56:33,230 --> 00:56:36,910 do have this model that spill out responses 1078 00:56:36,910 --> 00:56:42,470 at the level of human and indicate the parts that 1079 00:56:42,470 --> 00:56:47,210 are related to higher memory, lower memory, at least 1080 00:56:47,210 --> 00:56:51,560 as a guideline, then we can go back to a given image 1081 00:56:51,560 --> 00:56:54,680 and emphasize the part that correspond 1082 00:56:54,680 --> 00:56:58,210 to the high memorability receptive field 1083 00:56:58,210 --> 00:57:01,280 and de-emphasize the part corresponding 1084 00:57:01,280 --> 00:57:06,680 to the forgettable part of the receptive field. 1085 00:57:06,680 --> 00:57:08,630 And this give you images on the right 1086 00:57:08,630 --> 00:57:11,270 like that that have been weighted 1087 00:57:11,270 --> 00:57:12,830 by the element that are memorable 1088 00:57:12,830 --> 00:57:14,690 and the element that are forgettable. 1089 00:57:14,690 --> 00:57:18,360 So maybe here it doesn't matter that the ground is forgettable. 1090 00:57:18,360 --> 00:57:21,430 Here, the part I like to emphasize 1091 00:57:21,430 --> 00:57:23,390 are the memorable one. 1092 00:57:23,390 --> 00:57:26,940 But, for instance, in this imagine we have two person. 1093 00:57:26,940 --> 00:57:31,550 And, well, she just happened to be 1094 00:57:31,550 --> 00:57:34,700 more forgettable in this case, which will make a perfect CIA 1095 00:57:34,700 --> 00:57:38,270 agent, if we think about it. 1096 00:57:38,270 --> 00:57:40,530 And we did tested those. 1097 00:57:40,530 --> 00:57:45,320 So then, for a values part of navigation, 1098 00:57:45,320 --> 00:57:51,260 I mean values scene, we do find that the element of exit 1099 00:57:51,260 --> 00:57:53,820 or entrance, so where there's basically 1100 00:57:53,820 --> 00:57:57,340 path and a 3D structure for navigation, 1101 00:57:57,340 --> 00:57:58,970 tend also to be more memorable. 1102 00:57:58,970 --> 00:58:02,510 So those are highlighted in those images. 1103 00:58:02,510 --> 00:58:04,640 Another example, where the kids will 1104 00:58:04,640 --> 00:58:08,890 be more memorable than the features of the person. 1105 00:58:08,890 --> 00:58:09,810 And I don't-- 1106 00:58:09,810 --> 00:58:11,510 Well, I'm not an expert in game. 1107 00:58:11,510 --> 00:58:14,480 You can explain that one to me. 1108 00:58:14,480 --> 00:58:18,470 So I have to stop because we need a break, 1109 00:58:18,470 --> 00:58:21,420 and I might need to talk. 1110 00:58:21,420 --> 00:58:27,410 However, here is a vision of where we are going. 1111 00:58:27,410 --> 00:58:30,980 And maybe other people will be interested to go 1112 00:58:30,980 --> 00:58:33,350 on this adventure. 1113 00:58:33,350 --> 00:58:36,180 Because it's really, really just the beginning. 1114 00:58:36,180 --> 00:58:39,410 But if you now can have a model that 1115 00:58:39,410 --> 00:58:42,110 at the level of a group of human predict which image 1116 00:58:42,110 --> 00:58:47,000 are memorable, and as well, also match 1117 00:58:47,000 --> 00:58:51,530 part of the visual region and even the higher level object 1118 00:58:51,530 --> 00:58:56,720 recognition, then you can start having this similarity going on 1119 00:58:56,720 --> 00:59:01,070 in this study between the human brain and all the part, 1120 00:59:01,070 --> 00:59:05,180 from perception to memory and computational model, 1121 00:59:05,180 --> 00:59:09,770 to characterize what the computation underlying 1122 00:59:09,770 --> 00:59:11,780 those particular region. 1123 00:59:11,780 --> 00:59:15,500 Because now you have model zero. 1124 00:59:15,500 --> 00:59:17,540 I'm not saying that the model I'm showing to you 1125 00:59:17,540 --> 00:59:20,520 is a model that imitate the brain. 1126 00:59:20,520 --> 00:59:23,090 But it's a model zero, and now it 1127 00:59:23,090 --> 00:59:27,110 can be tuned to better learn the calculation that 1128 00:59:27,110 --> 00:59:30,110 are done at different level along the visual 1129 00:59:30,110 --> 00:59:36,340 to the memory hierarchy in order to characterize visual memory.