1 00:00:05,269 --> 00:00:07,185 [MUSIC PLAYING] 2 00:00:10,070 --> 00:00:11,840 LAURIE BAYET: My name is Laurie Bayet. 3 00:00:11,840 --> 00:00:14,750 I'm a postdoc at the University of Rochester and Boston 4 00:00:14,750 --> 00:00:16,490 Children's Hospital, and I'm working 5 00:00:16,490 --> 00:00:18,600 on developmental cognitive neuroscience. 6 00:00:18,600 --> 00:00:21,830 ALON BARAM: My name is Alon, and I am studying currently 7 00:00:21,830 --> 00:00:22,880 at Oxford. 8 00:00:22,880 --> 00:00:24,230 I'm doing my PhD. 9 00:00:24,230 --> 00:00:27,410 I'm there with Professor Tim Behrens, 10 00:00:27,410 --> 00:00:31,670 and I'm currently working on computational cognitive 11 00:00:31,670 --> 00:00:32,479 neuroscience. 12 00:00:32,479 --> 00:00:34,610 LAURIE BAYET: Alon and I are trying 13 00:00:34,610 --> 00:00:40,910 to use paper by Tomaso Poggio and Potters on a specific way 14 00:00:40,910 --> 00:00:43,660 to achieve invariant recognition in computer 15 00:00:43,660 --> 00:00:46,520 vision or other algorithm. 16 00:00:46,520 --> 00:00:49,250 So we're basically trying to implement this in a simpler 17 00:00:49,250 --> 00:00:53,450 case and then moving on to our face recognition 18 00:00:53,450 --> 00:00:54,860 under rotations. 19 00:00:54,860 --> 00:00:57,620 ALON BARAM: The idea is that most 20 00:00:57,620 --> 00:01:00,770 of the variance in computer vision, 21 00:01:00,770 --> 00:01:06,230 when an algorithm tries to discover what is in the image, 22 00:01:06,230 --> 00:01:09,980 is held in very few manipulation. 23 00:01:09,980 --> 00:01:16,190 Like translation, which is a shifting image across a field 24 00:01:16,190 --> 00:01:18,850 or rotations or scaling. 25 00:01:18,850 --> 00:01:22,770 So Poggio has a cool idea of how to create this signature 26 00:01:22,770 --> 00:01:24,470 that Laurie just told about, which 27 00:01:24,470 --> 00:01:28,070 is invariant to these things and might reduce the sample 28 00:01:28,070 --> 00:01:28,980 complexity. 29 00:01:28,980 --> 00:01:31,814 So how many examples you need to learn. 30 00:01:31,814 --> 00:01:33,230 LAURIE BAYET: For the simple case, 31 00:01:33,230 --> 00:01:35,630 we just used an existing data set of digits. 32 00:01:35,630 --> 00:01:40,160 For the face data set, we tried to find a suitable data set 33 00:01:40,160 --> 00:01:42,110 online, but we ended up just taking 34 00:01:42,110 --> 00:01:45,985 videos of people using materials provided by the summer school. 35 00:01:45,985 --> 00:01:48,360 So taking videos of people rotating their heads like this 36 00:01:48,360 --> 00:01:48,635 slowly. 37 00:01:48,635 --> 00:01:50,040 ALON BARAM: Yeah, it was fun. 38 00:01:50,040 --> 00:01:52,040 LAURIE BAYET: Moving around a little bit. 39 00:01:52,040 --> 00:01:53,680 ALON BARAM: We have now a complete data 40 00:01:53,680 --> 00:01:58,790 set of the heads of people from different angles. 41 00:01:58,790 --> 00:02:01,690 LAURIE BAYET: We wanted to provide the algorithm 42 00:02:01,690 --> 00:02:06,650 with a hopefully limited number of raw frames 43 00:02:06,650 --> 00:02:09,538 from people rotating their heads like this. 44 00:02:09,538 --> 00:02:14,120 As a template, so to speak, and act then as like a kernel 45 00:02:14,120 --> 00:02:16,840 so to speak, to be able then to recognize 46 00:02:16,840 --> 00:02:18,970 unseen people under various angles 47 00:02:18,970 --> 00:02:21,650 so that whenever a person is showing 48 00:02:21,650 --> 00:02:23,162 this profile or this profile, you 49 00:02:23,162 --> 00:02:24,620 would still be able to recognize it 50 00:02:24,620 --> 00:02:26,660 with the same level of accuracy as 51 00:02:26,660 --> 00:02:29,510 if they were in front of them, presenting the frontal face. 52 00:02:29,510 --> 00:02:31,635 ALON BARAM: The purpose of doing this project would 53 00:02:31,635 --> 00:02:36,580 be, in the long run or what this iTheory as Tommy Poggio calls 54 00:02:36,580 --> 00:02:38,625 it will be in the long run would be 55 00:02:38,625 --> 00:02:43,280 to reduce the number of examples that an algorithm, for example, 56 00:02:43,280 --> 00:02:45,320 deep neural nets, the number of examples 57 00:02:45,320 --> 00:02:49,750 they need to see in order to learn their weights in order 58 00:02:49,750 --> 00:02:56,732 to learn how to classify images or retrieve images. 59 00:02:56,732 --> 00:02:58,690 LAURIE BAYET: We haven't started the face part. 60 00:02:58,690 --> 00:03:00,920 We only started the digits part, which worked. 61 00:03:00,920 --> 00:03:01,745 So we're-- 62 00:03:01,745 --> 00:03:04,070 ALON BARAM: It's working basically. 63 00:03:04,070 --> 00:03:08,915 We hope it will also work in the endlessly more complex domain 64 00:03:08,915 --> 00:03:09,630 of faces. 65 00:03:09,630 --> 00:03:10,755 LAURIE BAYET: Now you know. 66 00:03:10,755 --> 00:03:12,957 ALON BARAM: But we're hoping. 67 00:03:12,957 --> 00:03:14,706 LAURIE BAYET: We're reasonably optimistic. 68 00:03:14,706 --> 00:03:15,138 I don't know. 69 00:03:15,138 --> 00:03:15,570 We'll see. 70 00:03:15,570 --> 00:03:16,736 ALON BARAM: Fingers crossed. 71 00:03:16,736 --> 00:03:18,560 LAURIE BAYET: We've approached the project 72 00:03:18,560 --> 00:03:21,530 from pretty much very different angles 73 00:03:21,530 --> 00:03:24,680 but still ended up having common interests, which 74 00:03:24,680 --> 00:03:27,920 I guess is kind of hallmark of this summer school too. 75 00:03:27,920 --> 00:03:32,134 Alon has this other, very interested in the engineering 76 00:03:32,134 --> 00:03:33,050 problems, so to speak. 77 00:03:33,050 --> 00:03:35,390 So how can we achieve this with machines? 78 00:03:35,390 --> 00:03:38,674 And I approached the project from 79 00:03:38,674 --> 00:03:39,840 a developmental perspective. 80 00:03:39,840 --> 00:03:42,050 So given that the current algorithms 81 00:03:42,050 --> 00:03:45,350 manage to do invariant face recognition based 82 00:03:45,350 --> 00:03:49,040 on a fairly large number of exemplars, 83 00:03:49,040 --> 00:03:52,940 how come infants can achieve this in a few months 84 00:03:52,940 --> 00:03:56,840 based on a lot of experience, but not that much-- 85 00:03:56,840 --> 00:04:00,190 mostly looking at their parents, caregivers, 86 00:04:00,190 --> 00:04:03,220 and a few other exemplars, but not 87 00:04:03,220 --> 00:04:07,710 like 3,000 people from all possible angles. 88 00:04:07,710 --> 00:04:10,220 So this is why I was very interested in this theory 89 00:04:10,220 --> 00:04:12,870 and trying to implement this manually 90 00:04:12,870 --> 00:04:14,510 has been pretty cool so far. 91 00:04:14,510 --> 00:04:16,660 [MUSIC PLAYING]