1
00:00:05,269 --> 00:00:07,185
[MUSIC PLAYING]

2
00:00:10,070 --> 00:00:11,840
LAURIE BAYET: My
name is Laurie Bayet.

3
00:00:11,840 --> 00:00:14,750
I'm a postdoc at the University
of Rochester and Boston

4
00:00:14,750 --> 00:00:16,490
Children's Hospital,
and I'm working

5
00:00:16,490 --> 00:00:18,600
on developmental
cognitive neuroscience.

6
00:00:18,600 --> 00:00:21,830
ALON BARAM: My name is Alon,
and I am studying currently

7
00:00:21,830 --> 00:00:22,880
at Oxford.

8
00:00:22,880 --> 00:00:24,230
I'm doing my PhD.

9
00:00:24,230 --> 00:00:27,410
I'm there with
Professor Tim Behrens,

10
00:00:27,410 --> 00:00:31,670
and I'm currently working
on computational cognitive

11
00:00:31,670 --> 00:00:32,479
neuroscience.

12
00:00:32,479 --> 00:00:34,610
LAURIE BAYET: Alon
and I are trying

13
00:00:34,610 --> 00:00:40,910
to use paper by Tomaso Poggio
and Potters on a specific way

14
00:00:40,910 --> 00:00:43,660
to achieve invariant
recognition in computer

15
00:00:43,660 --> 00:00:46,520
vision or other algorithm.

16
00:00:46,520 --> 00:00:49,250
So we're basically trying to
implement this in a simpler

17
00:00:49,250 --> 00:00:53,450
case and then moving on
to our face recognition

18
00:00:53,450 --> 00:00:54,860
under rotations.

19
00:00:54,860 --> 00:00:57,620
ALON BARAM: The
idea is that most

20
00:00:57,620 --> 00:01:00,770
of the variance in
computer vision,

21
00:01:00,770 --> 00:01:06,230
when an algorithm tries to
discover what is in the image,

22
00:01:06,230 --> 00:01:09,980
is held in very
few manipulation.

23
00:01:09,980 --> 00:01:16,190
Like translation, which is a
shifting image across a field

24
00:01:16,190 --> 00:01:18,850
or rotations or scaling.

25
00:01:18,850 --> 00:01:22,770
So Poggio has a cool idea of
how to create this signature

26
00:01:22,770 --> 00:01:24,470
that Laurie just
told about, which

27
00:01:24,470 --> 00:01:28,070
is invariant to these things
and might reduce the sample

28
00:01:28,070 --> 00:01:28,980
complexity.

29
00:01:28,980 --> 00:01:31,814
So how many examples
you need to learn.

30
00:01:31,814 --> 00:01:33,230
LAURIE BAYET: For
the simple case,

31
00:01:33,230 --> 00:01:35,630
we just used an existing
data set of digits.

32
00:01:35,630 --> 00:01:40,160
For the face data set, we tried
to find a suitable data set

33
00:01:40,160 --> 00:01:42,110
online, but we
ended up just taking

34
00:01:42,110 --> 00:01:45,985
videos of people using materials
provided by the summer school.

35
00:01:45,985 --> 00:01:48,360
So taking videos of people
rotating their heads like this

36
00:01:48,360 --> 00:01:48,635
slowly.

37
00:01:48,635 --> 00:01:50,040
ALON BARAM: Yeah, it was fun.

38
00:01:50,040 --> 00:01:52,040
LAURIE BAYET: Moving
around a little bit.

39
00:01:52,040 --> 00:01:53,680
ALON BARAM: We have
now a complete data

40
00:01:53,680 --> 00:01:58,790
set of the heads of people
from different angles.

41
00:01:58,790 --> 00:02:01,690
LAURIE BAYET: We wanted
to provide the algorithm

42
00:02:01,690 --> 00:02:06,650
with a hopefully limited
number of raw frames

43
00:02:06,650 --> 00:02:09,538
from people rotating
their heads like this.

44
00:02:09,538 --> 00:02:14,120
As a template, so to speak,
and act then as like a kernel

45
00:02:14,120 --> 00:02:16,840
so to speak, to be
able then to recognize

46
00:02:16,840 --> 00:02:18,970
unseen people under
various angles

47
00:02:18,970 --> 00:02:21,650
so that whenever a
person is showing

48
00:02:21,650 --> 00:02:23,162
this profile or
this profile, you

49
00:02:23,162 --> 00:02:24,620
would still be able
to recognize it

50
00:02:24,620 --> 00:02:26,660
with the same level
of accuracy as

51
00:02:26,660 --> 00:02:29,510
if they were in front of them,
presenting the frontal face.

52
00:02:29,510 --> 00:02:31,635
ALON BARAM: The purpose of
doing this project would

53
00:02:31,635 --> 00:02:36,580
be, in the long run or what this
iTheory as Tommy Poggio calls

54
00:02:36,580 --> 00:02:38,625
it will be in the
long run would be

55
00:02:38,625 --> 00:02:43,280
to reduce the number of examples
that an algorithm, for example,

56
00:02:43,280 --> 00:02:45,320
deep neural nets, the
number of examples

57
00:02:45,320 --> 00:02:49,750
they need to see in order to
learn their weights in order

58
00:02:49,750 --> 00:02:56,732
to learn how to classify
images or retrieve images.

59
00:02:56,732 --> 00:02:58,690
LAURIE BAYET: We haven't
started the face part.

60
00:02:58,690 --> 00:03:00,920
We only started the
digits part, which worked.

61
00:03:00,920 --> 00:03:01,745
So we're--

62
00:03:01,745 --> 00:03:04,070
ALON BARAM: It's
working basically.

63
00:03:04,070 --> 00:03:08,915
We hope it will also work in the
endlessly more complex domain

64
00:03:08,915 --> 00:03:09,630
of faces.

65
00:03:09,630 --> 00:03:10,755
LAURIE BAYET: Now you know.

66
00:03:10,755 --> 00:03:12,957
ALON BARAM: But we're hoping.

67
00:03:12,957 --> 00:03:14,706
LAURIE BAYET: We're
reasonably optimistic.

68
00:03:14,706 --> 00:03:15,138
I don't know.

69
00:03:15,138 --> 00:03:15,570
We'll see.

70
00:03:15,570 --> 00:03:16,736
ALON BARAM: Fingers crossed.

71
00:03:16,736 --> 00:03:18,560
LAURIE BAYET: We've
approached the project

72
00:03:18,560 --> 00:03:21,530
from pretty much
very different angles

73
00:03:21,530 --> 00:03:24,680
but still ended up having
common interests, which

74
00:03:24,680 --> 00:03:27,920
I guess is kind of hallmark
of this summer school too.

75
00:03:27,920 --> 00:03:32,134
Alon has this other, very
interested in the engineering

76
00:03:32,134 --> 00:03:33,050
problems, so to speak.

77
00:03:33,050 --> 00:03:35,390
So how can we achieve
this with machines?

78
00:03:35,390 --> 00:03:38,674
And I approached
the project from

79
00:03:38,674 --> 00:03:39,840
a developmental perspective.

80
00:03:39,840 --> 00:03:42,050
So given that the
current algorithms

81
00:03:42,050 --> 00:03:45,350
manage to do invariant
face recognition based

82
00:03:45,350 --> 00:03:49,040
on a fairly large
number of exemplars,

83
00:03:49,040 --> 00:03:52,940
how come infants can
achieve this in a few months

84
00:03:52,940 --> 00:03:56,840
based on a lot of experience,
but not that much--

85
00:03:56,840 --> 00:04:00,190
mostly looking at their
parents, caregivers,

86
00:04:00,190 --> 00:04:03,220
and a few other
exemplars, but not

87
00:04:03,220 --> 00:04:07,710
like 3,000 people from
all possible angles.

88
00:04:07,710 --> 00:04:10,220
So this is why I was very
interested in this theory

89
00:04:10,220 --> 00:04:12,870
and trying to
implement this manually

90
00:04:12,870 --> 00:04:14,510
has been pretty cool so far.

91
00:04:14,510 --> 00:04:16,660
[MUSIC PLAYING]