1
00:00:01,640 --> 00:00:04,040
The following content is
provided under a Creative

2
00:00:04,040 --> 00:00:05,580
Commons license.

3
00:00:05,580 --> 00:00:07,880
Your support will help
MIT OpenCourseWare

4
00:00:07,880 --> 00:00:12,270
continue to offer high-quality
educational resources for free.

5
00:00:12,270 --> 00:00:14,870
To make a donation or
view additional materials

6
00:00:14,870 --> 00:00:18,830
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,830 --> 00:00:22,050
at ocw.mit.edu.

8
00:00:22,050 --> 00:00:24,110
SHIMON ULLMAN: What I'm
going to talk about today

9
00:00:24,110 --> 00:00:28,370
is very much related to this
general issue of using vision,

10
00:00:28,370 --> 00:00:33,950
but with sort of the final goal
of seeing and understanding,

11
00:00:33,950 --> 00:00:36,260
in a very deep and
complete manner, what's

12
00:00:36,260 --> 00:00:38,900
going on around you.

13
00:00:38,900 --> 00:00:46,010
And let me start with
just one image we've all--

14
00:00:46,010 --> 00:00:47,480
we see images all
the time, and we

15
00:00:47,480 --> 00:00:49,760
know that we can get a lot
of meaning out of them.

16
00:00:49,760 --> 00:00:53,540
But just, you look at
one image, and you get

17
00:00:53,540 --> 00:00:54,900
a lot of information out of it.

18
00:00:54,900 --> 00:00:57,380
You understand what
happened before, that

19
00:00:57,380 --> 00:00:58,970
was some kind of
a flood, and why

20
00:00:58,970 --> 00:01:04,680
these people are hanging out up
there on the wires, and so on.

21
00:01:04,680 --> 00:01:07,090
So all of this, we
get very quickly.

22
00:01:07,090 --> 00:01:11,540
And what we do, usually, in
computational vision in much

23
00:01:11,540 --> 00:01:14,390
of my own work is
to try to understand

24
00:01:14,390 --> 00:01:16,760
computational schemes
that I can take

25
00:01:16,760 --> 00:01:21,710
as an input, an image
like this, and get

26
00:01:21,710 --> 00:01:23,970
all this information out.

27
00:01:23,970 --> 00:01:27,390
But what I want to talk
about in the first part--

28
00:01:27,390 --> 00:01:32,210
I'm going to break the afternoon
into two different talks on two

29
00:01:32,210 --> 00:01:34,700
different topics, but they
are all closely related

30
00:01:34,700 --> 00:01:37,610
to this ultimate
goal which drives me,

31
00:01:37,610 --> 00:01:38,900
which is, as you will see--

32
00:01:38,900 --> 00:01:41,810
I think it will become
apparent as we go along--

33
00:01:41,810 --> 00:01:44,900
this sort of complete and full
understanding, and complicated

34
00:01:44,900 --> 00:01:47,840
concepts, and complicated
notions that we

35
00:01:47,840 --> 00:01:50,230
can derive from an image.

36
00:01:50,230 --> 00:01:53,300
But what I'm going to
discuss in the first part

37
00:01:53,300 --> 00:01:55,550
is sort of how it all starts.

38
00:01:55,550 --> 00:01:59,010
And this combines vision
with another topic

39
00:01:59,010 --> 00:02:02,311
which is very important in
cognition-- part of CBMM

40
00:02:02,311 --> 00:02:02,810
as well.

41
00:02:02,810 --> 00:02:05,870
But it's not just this
particular center,

42
00:02:05,870 --> 00:02:08,180
it's part of
understanding cognition--

43
00:02:08,180 --> 00:02:11,180
is infant learning
and how it all starts.

44
00:02:11,180 --> 00:02:13,790
And certainly for
vision, and for learning,

45
00:02:13,790 --> 00:02:17,360
this is a very interesting
and fascinating problem.

46
00:02:17,360 --> 00:02:19,970
Because here, you
think about babies,

47
00:02:19,970 --> 00:02:21,500
they open up their eyes.

48
00:02:21,500 --> 00:02:23,390
And they see-- they don't--

49
00:02:23,390 --> 00:02:25,430
they cannot understand
the images that they see.

50
00:02:25,430 --> 00:02:27,677
You can think that
they see pixels.

51
00:02:27,677 --> 00:02:28,760
And they watch the pixels.

52
00:02:28,760 --> 00:02:31,550
And the pixels are transforming.

53
00:02:31,550 --> 00:02:34,800
And they look at the world
as things change around them

54
00:02:34,800 --> 00:02:35,730
and so on.

55
00:02:35,730 --> 00:02:40,160
And what this short clip is
trying to make explicit is,

56
00:02:40,160 --> 00:02:44,930
somehow, these pixels,
over time, acquire meaning.

57
00:02:44,930 --> 00:02:47,750
And they understand the
world, what they see.

58
00:02:47,750 --> 00:02:51,980
The input changes from changing
pixels and light intensities

59
00:02:51,980 --> 00:02:56,021
into other infants, and rooms,
and whatever they see around

60
00:02:56,021 --> 00:02:56,520
them.

61
00:02:56,520 --> 00:02:59,540
So we would like to be
able to understand this

62
00:02:59,540 --> 00:03:02,190
and to do something similar.

63
00:03:02,190 --> 00:03:03,950
So to have a system,
imagine that we

64
00:03:03,950 --> 00:03:05,580
have some kind of a
system that starts

65
00:03:05,580 --> 00:03:10,610
without any specific world
knowledge wired into it.

66
00:03:10,610 --> 00:03:13,376
It watches movies
for, say, six months.

67
00:03:13,376 --> 00:03:14,750
And at the end of
the six months,

68
00:03:14,750 --> 00:03:17,750
this system knows about the
world, about people, about

69
00:03:17,750 --> 00:03:19,670
agents, about animals,
about objects,

70
00:03:19,670 --> 00:03:21,980
about actions,
social interactions

71
00:03:21,980 --> 00:03:24,830
between agents the
way that infants do

72
00:03:24,830 --> 00:03:26,640
during the first year of life.

73
00:03:26,640 --> 00:03:29,600
So the goal isn't-- for me
at least, it's not really--

74
00:03:29,600 --> 00:03:33,080
the interest is not
necessarily to engineer,

75
00:03:33,080 --> 00:03:36,290
the engineering part of it to
really build such a system,

76
00:03:36,290 --> 00:03:41,270
but to think about and develop
schemes that will be able

77
00:03:41,270 --> 00:03:44,240
to deal with it and
do something similar.

78
00:03:44,240 --> 00:03:46,880
I think it's also-- maybe
I'll mention it at the end--

79
00:03:46,880 --> 00:03:48,500
it's also a very
interesting direction

80
00:03:48,500 --> 00:03:50,210
for artificial
intelligence to think

81
00:03:50,210 --> 00:03:53,660
about not generating
final complete systems,

82
00:03:53,660 --> 00:03:56,000
but generating baby
systems, if you want,

83
00:03:56,000 --> 00:03:59,510
that have some interesting
and useful initial capacities.

84
00:03:59,510 --> 00:04:02,390
But the rest is just,
they watch the world,

85
00:04:02,390 --> 00:04:06,830
and they get intelligent
as time goes by.

86
00:04:06,830 --> 00:04:10,970
I'm going to talk initially
about two particular things

87
00:04:10,970 --> 00:04:12,980
that we've been
working on and we

88
00:04:12,980 --> 00:04:16,490
selected because we thought
it particularly interesting.

89
00:04:16,490 --> 00:04:18,470
And one has to do with
hands, and the other one

90
00:04:18,470 --> 00:04:20,810
has to do with gaze.

91
00:04:20,810 --> 00:04:25,830
And the reason that we selected
them is, I'll show in a minute,

92
00:04:25,830 --> 00:04:29,420
visually, for computer vision,
being able to deal with hands

93
00:04:29,420 --> 00:04:33,070
of people's in images and
dealing with direction of gaze

94
00:04:33,070 --> 00:04:34,710
are a very difficult problem.

95
00:04:34,710 --> 00:04:37,910
And a lot of work has been
done in computer vision dealing

96
00:04:37,910 --> 00:04:41,960
with issues related
to hands and to gaze.

97
00:04:41,960 --> 00:04:44,900
They're also very important
for cognition in general.

98
00:04:44,900 --> 00:04:47,389
Again, I will discuss
it a bit more later.

99
00:04:47,389 --> 00:04:48,930
But understanding
hands and what they

100
00:04:48,930 --> 00:04:51,080
are doing, interacting
with objects,

101
00:04:51,080 --> 00:04:53,910
manipulating objects,
action recognition--

102
00:04:53,910 --> 00:04:57,200
so hands are a part of
understanding the full scheme,

103
00:04:57,200 --> 00:05:01,850
the whole domain of actions.

104
00:05:01,850 --> 00:05:06,470
And social interactions
between agents is a part of it.

105
00:05:06,470 --> 00:05:08,780
Gaze is also very
important for understanding

106
00:05:08,780 --> 00:05:10,760
intentions of
people, understanding

107
00:05:10,760 --> 00:05:12,060
interactions between people.

108
00:05:12,060 --> 00:05:14,420
So these are two basic--

109
00:05:14,420 --> 00:05:17,780
very basic type of
objects or concepts

110
00:05:17,780 --> 00:05:20,750
that you want to acquire
that are very difficult.

111
00:05:20,750 --> 00:05:22,730
And the final
surprising thing is they

112
00:05:22,730 --> 00:05:30,950
come very early in infant
vision, one of the first things

113
00:05:30,950 --> 00:05:32,030
to be learned.

114
00:05:32,030 --> 00:05:34,520
So you can see here, when
I say hands are important,

115
00:05:34,520 --> 00:05:36,530
for example, for
action recognition,

116
00:05:36,530 --> 00:05:38,900
I don't know if you can
tell what this person is

117
00:05:38,900 --> 00:05:39,740
doing, for example.

118
00:05:39,740 --> 00:05:42,635
Any guess?

119
00:05:42,635 --> 00:05:43,512
Yeah, you say what?

120
00:05:43,512 --> 00:05:44,220
What is he doing?

121
00:05:44,220 --> 00:05:45,000
AUDIENCE: Talking on the phone.

122
00:05:45,000 --> 00:05:45,780
SHIMON ULLMAN:
Talking on the phone.

123
00:05:45,780 --> 00:05:47,238
And we cannot really
see the phone.

124
00:05:47,238 --> 00:05:50,590
But it's more where the hand is
relative to the ear and so on.

125
00:05:50,590 --> 00:05:53,610
And you know, we can see the
interactions between agents

126
00:05:53,610 --> 00:05:54,110
and so on.

127
00:05:54,110 --> 00:05:56,790
A lot depends on understanding
the body posture,

128
00:05:56,790 --> 00:06:00,000
and in particular, the hands.

129
00:06:00,000 --> 00:06:03,870
So they are certainly
very important for us.

130
00:06:03,870 --> 00:06:05,880
I mentioned that they
are very difficult to be

131
00:06:05,880 --> 00:06:08,610
able to automatically
be able to extract them

132
00:06:08,610 --> 00:06:11,310
and localize them in the image.

133
00:06:11,310 --> 00:06:13,360
And there are two
reasons for this.

134
00:06:13,360 --> 00:06:16,660
One is that hands
are very flexible,

135
00:06:16,660 --> 00:06:19,980
much more so than
most rigid objects we

136
00:06:19,980 --> 00:06:21,180
encounter in the world.

137
00:06:21,180 --> 00:06:23,880
So a hand does not have
a typical appearance.

138
00:06:23,880 --> 00:06:26,310
It has so many
different appearances

139
00:06:26,310 --> 00:06:29,250
that it's difficult
to handle all of them.

140
00:06:29,250 --> 00:06:32,100
And the other reason is that
hands in images, although they

141
00:06:32,100 --> 00:06:35,220
are important, very often,
there is very little information

142
00:06:35,220 --> 00:06:38,580
in the image showing the hand.

143
00:06:38,580 --> 00:06:42,780
Just because of resolution
size, partial occlusion,

144
00:06:42,780 --> 00:06:44,220
we know where the hands are.

145
00:06:44,220 --> 00:06:46,890
But we can see very
little of the hands.

146
00:06:46,890 --> 00:06:49,050
We have the impression
here, when we look at this,

147
00:06:49,050 --> 00:06:51,180
we know what this
person is doing, right?

148
00:06:51,180 --> 00:06:53,419
He's holding a camera
and taking a picture.

149
00:06:53,419 --> 00:06:54,960
But if you take the
image-- you know,

150
00:06:54,960 --> 00:06:57,382
where the hand and
the camera are--

151
00:06:57,382 --> 00:06:59,340
you know, this is the
camera, this is the hand,

152
00:06:59,340 --> 00:07:02,010
and so on-- it's really
not much information.

153
00:07:02,010 --> 00:07:04,350
But we can use it
very effectively,

154
00:07:04,350 --> 00:07:08,400
and similarly in the other
images that you see here.

155
00:07:08,400 --> 00:07:12,090
Children, or infants, do this
ability to deal with hands--

156
00:07:12,090 --> 00:07:14,250
and later on, we'll
see, with gaze--

157
00:07:14,250 --> 00:07:16,140
in a completely
unsupervised manner.

158
00:07:16,140 --> 00:07:18,930
Nobody teaches them,
look, this is a hand.

159
00:07:18,930 --> 00:07:22,980
Even it cannot be even
theoretically possible,

160
00:07:22,980 --> 00:07:26,010
because this capacity
to deal with, say,

161
00:07:26,010 --> 00:07:28,620
hands and gaze comes at
the age of three months,

162
00:07:28,620 --> 00:07:31,170
way before language
starts to develop.

163
00:07:31,170 --> 00:07:33,780
So all of this is
entirely unsupervised,

164
00:07:33,780 --> 00:07:38,260
just watching things happening
in an unstructured way

165
00:07:38,260 --> 00:07:40,180
and mastering these concepts.

166
00:07:40,180 --> 00:07:43,980
And when you try to imitate
this in computer vision systems,

167
00:07:43,980 --> 00:07:46,000
there are not too many
computer vision system,

168
00:07:46,000 --> 00:07:50,500
learning system that can deal
well with unsupervised data.

169
00:07:50,500 --> 00:07:52,219
And I can tell you
without going--

170
00:07:52,219 --> 00:07:54,510
I don't want to elaborate on
the different schemes that

171
00:07:54,510 --> 00:07:56,170
exist and so on.

172
00:07:56,170 --> 00:07:58,560
But the kind of thing
that exists cannot--

173
00:07:58,560 --> 00:08:02,370
nobody can help,
nobody can learn hands

174
00:08:02,370 --> 00:08:04,080
in an unsupervised way.

175
00:08:04,080 --> 00:08:07,590
It may be interesting
to know, just

176
00:08:07,590 --> 00:08:10,630
anecdotally, when we actually
started to work with this,

177
00:08:10,630 --> 00:08:13,800
deep networks were not
exactly what they are today.

178
00:08:13,800 --> 00:08:15,180
If you go back in
the literature,

179
00:08:15,180 --> 00:08:18,030
and we see when the term "deep
networks" and the initial work

180
00:08:18,030 --> 00:08:20,434
on deep networks started by--

181
00:08:20,434 --> 00:08:22,900
at least in the group
of Geoff Hinton.

182
00:08:22,900 --> 00:08:27,000
Yann LeCun was doing
independent things separately.

183
00:08:27,000 --> 00:08:28,500
But the goal was
to learn everything

184
00:08:28,500 --> 00:08:29,830
in an unsupervised way.

185
00:08:29,830 --> 00:08:32,039
They were labeled as, that's
the goal of the project,

186
00:08:32,039 --> 00:08:34,409
to be able to build a machine.

187
00:08:34,409 --> 00:08:36,990
And the machine, you will
not need supervision.

188
00:08:36,990 --> 00:08:38,580
You just aim it at
the world, and it

189
00:08:38,580 --> 00:08:41,850
will start to absorb
information from the world

190
00:08:41,850 --> 00:08:45,900
and build a internal
representation in a completely

191
00:08:45,900 --> 00:08:47,280
unsupervised way.

192
00:08:47,280 --> 00:08:49,560
And they demonstrated
it on simple examples--

193
00:08:49,560 --> 00:08:53,280
for example, on MNIST
on digits, that you

194
00:08:53,280 --> 00:08:57,630
don't tell the system that
there are 10 digits and so on.

195
00:08:57,630 --> 00:09:02,550
You just show it data, and
it builds a deep network

196
00:09:02,550 --> 00:09:05,430
that automatically
divides the input

197
00:09:05,430 --> 00:09:07,420
into 10 different classes.

198
00:09:07,420 --> 00:09:11,220
And in interesting ways,
it also divides subclasses.

199
00:09:11,220 --> 00:09:14,190
There is a closed 4, and an
open 4, and so on with something

200
00:09:14,190 --> 00:09:15,210
very natural.

201
00:09:15,210 --> 00:09:19,950
And it was an example of
dealing with multiple classes

202
00:09:19,950 --> 00:09:21,732
in an unsupervised manner.

203
00:09:21,732 --> 00:09:23,940
But when you try to do
something like hands, which we

204
00:09:23,940 --> 00:09:25,560
tried, I mean, there is just--

205
00:09:25,560 --> 00:09:34,740
it basically failed as
an unsupervised method.

206
00:09:34,740 --> 00:09:37,500
And the problem
remained difficult.

207
00:09:37,500 --> 00:09:40,410
Here is a quote for
Jitendra Malik, who is--

208
00:09:40,410 --> 00:09:42,690
those of you who deal
with computer vision

209
00:09:42,690 --> 00:09:43,550
would know the name.

210
00:09:43,550 --> 00:09:46,800
He's sort of a leading
person in computer vision.

211
00:09:46,800 --> 00:09:49,730
"They say that dealing with
body configuration in general,

212
00:09:49,730 --> 00:09:52,380
and hands in
particular, is maybe

213
00:09:52,380 --> 00:09:54,930
the most difficult recognition
problem in computer vision."

214
00:09:54,930 --> 00:09:57,210
I think that's
probably going too far.

215
00:09:57,210 --> 00:10:00,780
But still, you can realize
that people took it

216
00:10:00,780 --> 00:10:02,445
as a very difficult problem.

217
00:10:02,445 --> 00:10:09,420
On the unsupervised way, which
is still a big open problem,

218
00:10:09,420 --> 00:10:11,960
the biggest effort so
far has been a paper--

219
00:10:11,960 --> 00:10:13,920
it's already a few years ago--

220
00:10:13,920 --> 00:10:17,640
a collaboration between
Google and Stanford

221
00:10:17,640 --> 00:10:20,550
by Andrew Ng others in
which they took images

222
00:10:20,550 --> 00:10:23,820
from 10 million YouTube movies.

223
00:10:23,820 --> 00:10:26,280
And they tried to
learn whatever they

224
00:10:26,280 --> 00:10:27,880
could in an unsupervised way.

225
00:10:27,880 --> 00:10:31,440
And they designed a system that
was designed to get information

226
00:10:31,440 --> 00:10:33,450
out in an unsupervised manner.

227
00:10:33,450 --> 00:10:36,480
And basically, they managed,
from all this information,

228
00:10:36,480 --> 00:10:41,070
to get out three concepts
what happened in the machine

229
00:10:41,070 --> 00:10:44,540
that you-- it developed
units that were sensitive,

230
00:10:44,540 --> 00:10:48,060
specifically, to three
different categories.

231
00:10:48,060 --> 00:10:51,110
One was faces, the
other one was cats.

232
00:10:51,110 --> 00:10:52,370
That's from their paper.

233
00:10:52,370 --> 00:10:54,620
It's not easy to see the cat
here, but there is a cat.

234
00:10:54,620 --> 00:10:56,960
They found the cat.

235
00:10:56,960 --> 00:11:00,920
And there is a sort of a
torso, upper body from--

236
00:11:00,920 --> 00:11:02,350
upper body.

237
00:11:02,350 --> 00:11:05,840
Three concepts-- after all of
this training, three concepts

238
00:11:05,840 --> 00:11:06,660
sort of emerged.

239
00:11:06,660 --> 00:11:10,370
And in fact, only
one of them, faces,

240
00:11:10,370 --> 00:11:12,230
was really there, that
there were units that

241
00:11:12,230 --> 00:11:14,120
were very sensitive to faces.

242
00:11:14,120 --> 00:11:17,840
For the other cases, like
cats and upper bodies,

243
00:11:17,840 --> 00:11:20,147
it was not all that selective.

244
00:11:20,147 --> 00:11:21,980
And by the way, cats
is not very surprising.

245
00:11:21,980 --> 00:11:23,960
You know why cats came
out in these movies?

246
00:11:27,080 --> 00:11:29,540
If you watched YouTube
you would know,

247
00:11:29,540 --> 00:11:31,840
literally, it's also
the most salient thing.

248
00:11:31,840 --> 00:11:33,620
After faces, it's
literally the case

249
00:11:33,620 --> 00:11:36,200
that if you take random
movies, or millions

250
00:11:36,200 --> 00:11:40,490
of movies from YouTube, many,
many of them will contain cats.

251
00:11:40,490 --> 00:11:44,870
So in the database, it was the
most-- after faces and bodies,

252
00:11:44,870 --> 00:11:49,430
it was the third most
frequent category.

253
00:11:49,430 --> 00:11:53,520
So it wouldn't do,
hands or gaze and so on.

254
00:11:53,520 --> 00:11:55,340
It's really picking
up only things

255
00:11:55,340 --> 00:11:58,340
which are very, very
salient and very, very

256
00:11:58,340 --> 00:12:01,310
frequent in the input.

257
00:12:01,310 --> 00:12:04,280
And as I said, babies do it.

258
00:12:04,280 --> 00:12:09,560
And now people started to work
to look more closely at it.

259
00:12:09,560 --> 00:12:14,300
One technique was to put a
sort of webcam on infants.

260
00:12:14,300 --> 00:12:15,220
This is not an infant.

261
00:12:15,220 --> 00:12:18,720
This is a slightly
older person, a toddler.

262
00:12:18,720 --> 00:12:20,870
But they do it with
infants, and they look

263
00:12:20,870 --> 00:12:24,860
at what the babies are looking.

264
00:12:24,860 --> 00:12:27,980
And what babies are looking
at the very initial stages

265
00:12:27,980 --> 00:12:29,090
are faces and hands.

266
00:12:29,090 --> 00:12:30,900
They really like faces.

267
00:12:30,900 --> 00:12:34,526
And they really like hands.

268
00:12:34,526 --> 00:12:36,410
And they recognize hands.

269
00:12:36,410 --> 00:12:38,180
And they know sort of what--

270
00:12:40,790 --> 00:12:43,730
they have, already, information
and expectation about hands

271
00:12:43,730 --> 00:12:44,810
in a very early age.

272
00:12:44,810 --> 00:12:49,010
So it's not just even the visual
recognition that they group

273
00:12:49,010 --> 00:12:51,830
together images
of hands, but they

274
00:12:51,830 --> 00:12:56,750
know that hands, for example,
are the causal agents that are

275
00:12:56,750 --> 00:12:58,560
moving objects in the world.

276
00:12:58,560 --> 00:13:02,315
And this is for an
experiment by Rebecca Saxe.

277
00:13:02,315 --> 00:13:06,035
There is a person here working
with Rebecca Saxe, right?

278
00:13:06,035 --> 00:13:07,430
Did she talk already in the--

279
00:13:07,430 --> 00:13:08,360
AUDIENCE: She will.

280
00:13:08,360 --> 00:13:09,360
SHIMON ULLMAN: She will.

281
00:13:09,360 --> 00:13:10,620
And she's worth listening to.

282
00:13:10,620 --> 00:13:15,590
So this is from
one of her studies

283
00:13:15,590 --> 00:13:18,970
in which they showed infant--

284
00:13:18,970 --> 00:13:20,390
these are slightly
older infants.

285
00:13:20,390 --> 00:13:24,430
But they showed infants,
on the computer screen,

286
00:13:24,430 --> 00:13:26,120
a hand moving an object.

287
00:13:26,120 --> 00:13:28,810
This is not taken from
the paper directly.

288
00:13:28,810 --> 00:13:31,040
This, I just drew-- but
a hand moving an object--

289
00:13:31,040 --> 00:13:33,910
in this case, a cup or a glass.

290
00:13:33,910 --> 00:13:37,850
And the infant watches it for
a while and sort of gets bored.

291
00:13:37,850 --> 00:13:40,220
And after they do it,
they show the infant

292
00:13:40,220 --> 00:13:44,120
either the hand moving alone on
the screen or the glass moving

293
00:13:44,120 --> 00:13:45,560
alone on the screen.

294
00:13:45,560 --> 00:13:47,960
And the hand moving alone,
on itself, on the screen,

295
00:13:47,960 --> 00:13:48,960
they are not--

296
00:13:48,960 --> 00:13:49,550
still bored.

297
00:13:49,550 --> 00:13:50,949
They don't look at it much.

298
00:13:50,949 --> 00:13:52,490
When the cup is
moving on the screen,

299
00:13:52,490 --> 00:13:54,170
they are very surprised
and interested.

300
00:13:54,170 --> 00:13:56,720
So they know it's the
hand moving the cup.

301
00:13:56,720 --> 00:13:59,940
It's not the cup moving the
hand or they have equal status,

302
00:13:59,940 --> 00:14:02,710
it's the motion of
this configuration.

303
00:14:02,710 --> 00:14:06,600
The originator, or the actor,
or the mover is the hand.

304
00:14:06,600 --> 00:14:11,900
So this is at seven months.

305
00:14:11,900 --> 00:14:17,330
But it has been known that this
kind of motion that this one--

306
00:14:17,330 --> 00:14:20,690
that an object can cause
another object to move,

307
00:14:20,690 --> 00:14:24,470
this is something that babies
are sensitive to, not only

308
00:14:24,470 --> 00:14:27,230
at the age of seven
months, but it's

309
00:14:27,230 --> 00:14:31,280
something that appears in
infants as early as you

310
00:14:31,280 --> 00:14:32,390
can imagine.

311
00:14:32,390 --> 00:14:34,235
And you can test.

312
00:14:34,235 --> 00:14:35,450
And this, for us--

313
00:14:35,450 --> 00:14:37,220
I'll tell you what
they are sensitive to.

314
00:14:37,220 --> 00:14:39,940
And for us, this was sort of the
guideline, or sort of the door,

315
00:14:39,940 --> 00:14:41,375
the open door, how to--

316
00:14:41,375 --> 00:14:47,000
what may be going on that
lets the infant speak up

317
00:14:47,000 --> 00:14:50,290
specifically on hands
and quickly develop

318
00:14:50,290 --> 00:14:52,190
a well-developed face detect--

319
00:14:52,190 --> 00:14:53,370
hand detector.

320
00:14:53,370 --> 00:14:55,550
So infants are
known to-- have been

321
00:14:55,550 --> 00:14:59,260
known to be sensitive-- they are
sensitive to motion in general.

322
00:14:59,260 --> 00:15:01,290
They are really
following moving objects.

323
00:15:01,290 --> 00:15:04,070
And they use motion a lot.

324
00:15:04,070 --> 00:15:06,260
But motion by itself
is not very helpful

325
00:15:06,260 --> 00:15:07,700
if you want to recognize hands.

326
00:15:07,700 --> 00:15:09,950
It's true that hands are moving.

327
00:15:09,950 --> 00:15:12,990
But many other things
are moving as well.

328
00:15:12,990 --> 00:15:18,390
So if you take random videos
from children's perspective,

329
00:15:18,390 --> 00:15:20,070
they see doors
opening and closing.

330
00:15:20,070 --> 00:15:21,800
They see people
moving back and forth,

331
00:15:21,800 --> 00:15:24,050
coming by and disappearing.

332
00:15:24,050 --> 00:15:28,310
Hands are just one small
category of moving things.

333
00:15:28,310 --> 00:15:31,580
But they're also sensitive,
as I said, not just to motion

334
00:15:31,580 --> 00:15:36,425
but to this particular situation
in which an object moves,

335
00:15:36,425 --> 00:15:40,340
comes in contact with another
object, and causes it to move.

336
00:15:40,340 --> 00:15:43,540
And this is not even at
the level of objects.

337
00:15:43,540 --> 00:15:47,215
At three months, they don't
even have a well-set notion of--

338
00:15:47,215 --> 00:15:52,480
they just begin to organize the
world into separate objects.

339
00:15:52,480 --> 00:15:54,460
But you can think of
just cloud of pixels

340
00:15:54,460 --> 00:15:56,260
if you want, moving around.

341
00:15:56,260 --> 00:15:58,330
They come in contact
with stationary pixels

342
00:15:58,330 --> 00:15:59,820
and causing them to move.

343
00:15:59,820 --> 00:16:04,090
Whenever this happens,
infants pay attention to this.

344
00:16:04,090 --> 00:16:06,220
They look at it preferentially.

345
00:16:06,220 --> 00:16:11,650
And it's known that they
are sensitive to this type

346
00:16:11,650 --> 00:16:14,675
of motion, which
called a mover event.

347
00:16:14,675 --> 00:16:16,900
And a mover event is
this event that I just

348
00:16:16,900 --> 00:16:18,970
described that
something is in motion,

349
00:16:18,970 --> 00:16:21,580
comes in contact with
a stationary item,

350
00:16:21,580 --> 00:16:23,290
and causes it to move.

351
00:16:23,290 --> 00:16:26,380
So we started with developing
a very simple algorithm

352
00:16:26,380 --> 00:16:30,400
that, all it does, it
simply looks on video,

353
00:16:30,400 --> 00:16:33,190
on changing images,
watching for, or waiting

354
00:16:33,190 --> 00:16:35,890
for, mover events to occur.

355
00:16:35,890 --> 00:16:38,260
And the way it's defined,
it's very simple.

356
00:16:38,260 --> 00:16:42,790
In the algorithm, we
divide the visual field

357
00:16:42,790 --> 00:16:44,290
into small regions.

358
00:16:44,290 --> 00:16:47,270
And we monitor each
one of these cells

359
00:16:47,270 --> 00:16:51,250
in the grid for the
occurrence of a mover, which

360
00:16:51,250 --> 00:16:53,740
means that there will be some
optical flow, some motion

361
00:16:53,740 --> 00:16:59,080
coming into the cell, and
then leaving the cell,

362
00:16:59,080 --> 00:17:03,200
and sort of carrying with
it the content of the cell.

363
00:17:03,200 --> 00:17:06,150
So this does require
some kind of optical flow

364
00:17:06,150 --> 00:17:07,220
in change detection.

365
00:17:07,220 --> 00:17:09,410
And all of these
capacities are known

366
00:17:09,410 --> 00:17:15,079
that they're in place
before three months of age.

367
00:17:15,079 --> 00:17:20,349
And now, look at an image of
a person manipulating objects

368
00:17:20,349 --> 00:17:23,940
when all you do
is simply monitor

369
00:17:23,940 --> 00:17:26,770
all locations in the
image for the occurrence

370
00:17:26,770 --> 00:17:28,715
of this kind of a mover event.

371
00:17:28,715 --> 00:17:31,090
What you should see, or what
you should pay attention to,

372
00:17:31,090 --> 00:17:33,130
is the fact that
motion, by itself,

373
00:17:33,130 --> 00:17:34,970
is not particularly--
is not doing anything.

374
00:17:34,970 --> 00:17:37,340
What the algorithm
is doing is, whenever

375
00:17:37,340 --> 00:17:42,580
it detects a mover
event, it draws a square

376
00:17:42,580 --> 00:17:44,860
around the mover
event and continues

377
00:17:44,860 --> 00:17:48,190
to follow it to track
it for a little bit.

378
00:17:48,190 --> 00:17:52,810
So you can see, the minute
the hand is moving something,

379
00:17:52,810 --> 00:17:55,790
it's detected as a hand
moving things on their own,

380
00:17:55,790 --> 00:17:57,790
are not triggering
the system here.

381
00:17:57,790 --> 00:17:58,680
The hand is moving.

382
00:17:58,680 --> 00:18:00,790
So this, by itself,
is not the signal.

383
00:18:00,790 --> 00:18:04,720
But the interaction between
the hand and an object

384
00:18:04,720 --> 00:18:06,550
is something that
it's detected based

385
00:18:06,550 --> 00:18:07,920
on these very low-level cues.

386
00:18:07,920 --> 00:18:09,086
It doesn't know about hands.

387
00:18:09,086 --> 00:18:11,770
It doesn't know about objects.

388
00:18:11,770 --> 00:18:15,200
But you can create-- you can
see, here, a false alarm, which

389
00:18:15,200 --> 00:18:15,910
was interesting.

390
00:18:15,910 --> 00:18:18,250
You can probably
understand why, and so on.

391
00:18:21,540 --> 00:18:25,210
So here's an example of
what if you just let it--

392
00:18:25,210 --> 00:18:28,720
this scheme run on a large set
of hours and hours of videos.

393
00:18:28,720 --> 00:18:30,820
Some of these videos
do not contain anything

394
00:18:30,820 --> 00:18:32,660
related to people.

395
00:18:32,660 --> 00:18:36,100
Some of them, there are people,
but just going back and forth,

396
00:18:36,100 --> 00:18:37,690
entering the room,
leaving the room,

397
00:18:37,690 --> 00:18:39,490
but do not manipulate objects.

398
00:18:39,490 --> 00:18:41,590
Nothing specific
happens in these videos.

399
00:18:41,590 --> 00:18:44,980
The system that is looking for
these kinds of mover events

400
00:18:44,980 --> 00:18:47,050
finds very rare
occasions in which

401
00:18:47,050 --> 00:18:48,700
anything interesting happens.

402
00:18:48,700 --> 00:18:50,560
But you can see the
output that happened.

403
00:18:50,560 --> 00:18:52,480
This is just examples
of output of just,

404
00:18:52,480 --> 00:18:55,510
from these many videos,
the kind of images

405
00:18:55,510 --> 00:19:00,520
that it extracted by being tuned
specifically to the occurrence

406
00:19:00,520 --> 00:19:02,440
of this specific event.

407
00:19:02,440 --> 00:19:04,630
You can see that you
get a lot of hand.

408
00:19:04,630 --> 00:19:06,800
These hands are
the continuation.

409
00:19:06,800 --> 00:19:09,580
The assumption here, which
we, again, took and modeled

410
00:19:09,580 --> 00:19:10,680
after what--

411
00:19:10,680 --> 00:19:14,530
when infancy, something
that starts to move,

412
00:19:14,530 --> 00:19:16,490
they track it for about
one or two seconds.

413
00:19:16,490 --> 00:19:19,480
So we also tracked it for
about one or two seconds.

414
00:19:19,480 --> 00:19:22,180
And we show some images
from this tracking.

415
00:19:22,180 --> 00:19:23,737
These are some false alarms.

416
00:19:23,737 --> 00:19:25,070
There are not very many of them.

417
00:19:25,070 --> 00:19:27,190
But these are examples
of some false alarms,

418
00:19:27,190 --> 00:19:29,200
that something happened
in the image that

419
00:19:29,200 --> 00:19:35,170
caused the scheme for
the specific mover event

420
00:19:35,170 --> 00:19:36,020
to be triggered.

421
00:19:36,020 --> 00:19:37,690
And it collected these images.

422
00:19:37,690 --> 00:19:39,910
But on the whole,
as it shows here,

423
00:19:39,910 --> 00:19:42,200
you get very good performance.

424
00:19:42,200 --> 00:19:45,850
It's actually surprising that,
if you look at all the ima--

425
00:19:45,850 --> 00:19:51,990
all the occurrences of
hands touching objects,

426
00:19:51,990 --> 00:19:55,540
that 90% of them in all
these collection of videos

427
00:19:55,540 --> 00:19:57,340
were captured by the image.

428
00:19:57,340 --> 00:19:59,630
And the accuracy was 65%.

429
00:19:59,630 --> 00:20:03,520
So there were some false alarms,
but the majority were fine.

430
00:20:03,520 --> 00:20:06,490
And what you will end up with
is a large collection which

431
00:20:06,490 --> 00:20:09,140
is mostly composed of hands.

432
00:20:09,140 --> 00:20:12,970
So now you have,
without being supplied

433
00:20:12,970 --> 00:20:15,892
with external supervision--
here is a hand, here is a hand,

434
00:20:15,892 --> 00:20:17,350
here is a hand--
suddenly, you have

435
00:20:17,350 --> 00:20:20,500
a collection of 10,000 images.

436
00:20:20,500 --> 00:20:22,520
And most of them contain a hand.

437
00:20:22,520 --> 00:20:25,570
And now if you look at this,
and you apply completely

438
00:20:25,570 --> 00:20:33,620
standard algorithms
for object recognition,

439
00:20:33,620 --> 00:20:35,980
this is great for them--
this is sufficient for them--

440
00:20:35,980 --> 00:20:37,720
in order to learn a new object.

441
00:20:37,720 --> 00:20:39,590
So if you do a deep network--

442
00:20:39,590 --> 00:20:44,450
but you can do even simpler
algorithms of various sorts--

443
00:20:44,450 --> 00:20:47,590
what you will end up using this
collection of images, which

444
00:20:47,590 --> 00:20:53,620
were identified as belonging
to the same concept

445
00:20:53,620 --> 00:20:57,010
by all triggering the same
special event of something

446
00:20:57,010 --> 00:21:02,669
causing something to move,
you will get a hand example.

447
00:21:02,669 --> 00:21:03,460
So these are just--

448
00:21:03,460 --> 00:21:05,418
I will not play them--
lots and lots of movies.

449
00:21:05,418 --> 00:21:07,900
Some of them can play
with for half an hour

450
00:21:07,900 --> 00:21:10,780
without having a single
event of this kind.

451
00:21:10,780 --> 00:21:13,120
Others are pretty
dense with such events.

452
00:21:13,120 --> 00:21:15,034
And they are being detected.

453
00:21:15,034 --> 00:21:17,200
And what is shown here,
it's not the greatest image.

454
00:21:17,200 --> 00:21:21,550
But all these squares here,
the yellow squares, are--

455
00:21:21,550 --> 00:21:25,030
having gone through this first
round of learning these events,

456
00:21:25,030 --> 00:21:26,710
this is the output
of a detector.

457
00:21:26,710 --> 00:21:29,260
You take the images that
were labeled in this way.

458
00:21:29,260 --> 00:21:35,484
You give it to a standard
computer vision classifier

459
00:21:35,484 --> 00:21:37,150
that takes these
images and finds what's

460
00:21:37,150 --> 00:21:38,740
common to these images.

461
00:21:38,740 --> 00:21:43,180
And then on completely new
images-- static images now--

462
00:21:43,180 --> 00:21:48,040
this marks, in the
image, following

463
00:21:48,040 --> 00:21:51,944
this learning, where the
hands will be-- the hands are.

464
00:21:51,944 --> 00:22:00,400
Now, this is a very
good start in terms

465
00:22:00,400 --> 00:22:05,860
of being able to learn hands,
and not only learn hands.

466
00:22:05,860 --> 00:22:09,480
As I showed you before, very
early on, the notion of learn

467
00:22:09,480 --> 00:22:11,620
is automatically a sort of--

468
00:22:11,620 --> 00:22:13,030
in terms of the
cognitive system,

469
00:22:13,030 --> 00:22:17,020
is closely associated with
moving objects, causing objects

470
00:22:17,020 --> 00:22:20,290
to move, as we saw in the
Rebecca Saxe experiment.

471
00:22:20,290 --> 00:22:22,240
This also happens
in this system.

472
00:22:22,240 --> 00:22:23,620
But this continues to develop.

473
00:22:23,620 --> 00:22:26,680
Because eventually we can
learn hands not only in

474
00:22:26,680 --> 00:22:30,070
this grasping configuration.

475
00:22:30,070 --> 00:22:32,170
Later on we want to
be able to recognize

476
00:22:32,170 --> 00:22:35,822
any hand in any configuration.

477
00:22:35,822 --> 00:22:39,460
And so the system
needs to learn more.

478
00:22:39,460 --> 00:22:42,550
And this is accomplished
by something

479
00:22:42,550 --> 00:22:45,880
that, again, the details, or
the specific application here,

480
00:22:45,880 --> 00:22:47,620
is less important
than the principle.

481
00:22:47,620 --> 00:22:51,010
And the principle is
sort of two subsystems

482
00:22:51,010 --> 00:22:54,490
in the cognitive system
training each other.

483
00:22:54,490 --> 00:22:57,750
And together, by each
one advising the other,

484
00:22:57,750 --> 00:22:59,820
they reach, together,
much more than what

485
00:22:59,820 --> 00:23:06,460
any one of the systems would
be able to reach on its own.

486
00:23:06,460 --> 00:23:09,070
And the two subsystems
in this case, one

487
00:23:09,070 --> 00:23:12,460
is the ability to recognize
hands by their appearance.

488
00:23:12,460 --> 00:23:17,600
I can show you just an image of
a hand without the rest of it,

489
00:23:17,600 --> 00:23:21,520
the rest of the body or
the rest of the image.

490
00:23:21,520 --> 00:23:23,360
And you know, by the
local appearance,

491
00:23:23,360 --> 00:23:24,670
that this is a hand.

492
00:23:24,670 --> 00:23:28,580
You also know, here, you cannot
even see the hands of this

493
00:23:28,580 --> 00:23:29,780
woman.

494
00:23:29,780 --> 00:23:31,360
But you know where
the hands are.

495
00:23:31,360 --> 00:23:34,180
So you can also use the
surrounding context in order

496
00:23:34,180 --> 00:23:36,880
to get where the hands are.

497
00:23:36,880 --> 00:23:39,490
So these are two
different algorithms that

498
00:23:39,490 --> 00:23:40,910
are known in computer vision.

499
00:23:40,910 --> 00:23:42,970
They are also known in infants.

500
00:23:42,970 --> 00:23:45,840
People have demonstrated, sort
of independently before we

501
00:23:45,840 --> 00:23:51,460
have done our work, that infants
associate the body, and even

502
00:23:51,460 --> 00:23:52,090
the face--

503
00:23:52,090 --> 00:23:54,760
when they see a hand,
they immediately look up,

504
00:23:54,760 --> 00:23:55,929
and they look for a face.

505
00:23:55,929 --> 00:23:57,970
And they are surprised if
there is no face there.

506
00:23:57,970 --> 00:24:01,900
So they know about the
association between body parts

507
00:24:01,900 --> 00:24:05,410
surrounding the hand
and the hand itself.

508
00:24:05,410 --> 00:24:10,650
And you can think about it as--

509
00:24:10,650 --> 00:24:12,330
we saw this image before.

510
00:24:12,330 --> 00:24:15,200
Here the hand itself
is not very clear.

511
00:24:15,200 --> 00:24:18,910
But you can get to the hand by--
if you know where the face is,

512
00:24:18,910 --> 00:24:20,890
for example, that you
can go to the shoulder,

513
00:24:20,890 --> 00:24:26,110
and to the upper arm, and lower
arm, and end up at the hand.

514
00:24:26,110 --> 00:24:28,390
So people have used
this in computer vision

515
00:24:28,390 --> 00:24:31,240
as well, finding hands.

516
00:24:31,240 --> 00:24:34,570
They also use this idea of
finding the hands on their own

517
00:24:34,570 --> 00:24:37,720
by their own appearance or
using the surrounding body

518
00:24:37,720 --> 00:24:39,010
configuration.

519
00:24:39,010 --> 00:24:41,640
And the nice thing is,
instead of just thinking of,

520
00:24:41,640 --> 00:24:45,730
here are two
methods, two schemes

521
00:24:45,730 --> 00:24:50,590
that can both produce the same
final goal, they can also,

522
00:24:50,590 --> 00:24:53,176
during learning, help each
other and guide each other.

523
00:24:53,176 --> 00:24:54,550
If you think about
it, the way it

524
00:24:54,550 --> 00:25:00,370
goes it is shown here, that,
sort of, the appearance can

525
00:25:00,370 --> 00:25:03,640
help finding hands
by the body pose.

526
00:25:03,640 --> 00:25:05,390
And the body pose can
do the appearance.

527
00:25:05,390 --> 00:25:09,460
So if you think, for
example, that, initially,

528
00:25:09,460 --> 00:25:17,380
I learned this particular hand
in this particular appearance

529
00:25:17,380 --> 00:25:20,020
and pose, then I learn
it by appearance.

530
00:25:20,020 --> 00:25:21,670
I learn it by the pose.

531
00:25:21,670 --> 00:25:24,580
But then if, for example,
I keep the same pose

532
00:25:24,580 --> 00:25:27,230
but change completely the
appearance of the hand,

533
00:25:27,230 --> 00:25:29,890
I still have the pose guiding
me to the right location.

534
00:25:29,890 --> 00:25:33,100
And then I grab a new image
and say, OK, this is still

535
00:25:33,100 --> 00:25:34,552
a hand, but a new appearance.

536
00:25:34,552 --> 00:25:36,010
Now that I have
the new appearance,

537
00:25:36,010 --> 00:25:38,720
I can move the new
appearance to a new location.

538
00:25:38,720 --> 00:25:41,370
So now I can recognize
the hand by the appearance

539
00:25:41,370 --> 00:25:42,730
that I already know.

540
00:25:42,730 --> 00:25:45,630
But then, this is a
new pose, so I say aha.

541
00:25:45,630 --> 00:25:48,370
So this is still a
new body configuration

542
00:25:48,370 --> 00:25:49,920
that ends in a hand.

543
00:25:49,920 --> 00:25:51,840
And then I can change
the appearance again.

544
00:25:51,840 --> 00:25:54,870
So you can see that by
having enough images,

545
00:25:54,870 --> 00:25:56,640
I can use the
appearance to learn

546
00:25:56,640 --> 00:26:01,980
various poses that
end up in a hand using

547
00:26:01,980 --> 00:26:04,510
the common appearance
and vice versa.

548
00:26:04,510 --> 00:26:07,440
If I know the pose, I
can use the same pose

549
00:26:07,440 --> 00:26:12,170
and deduce from the different
appearances of the same hand.

550
00:26:12,170 --> 00:26:13,410
And this becomes--

551
00:26:13,410 --> 00:26:15,060
I will not go into
the algorithms,

552
00:26:15,060 --> 00:26:18,060
but this becomes very powerful.

553
00:26:18,060 --> 00:26:22,530
And just by going through this
iteration in which you start

554
00:26:22,530 --> 00:26:26,460
from a small subset of
correct identification,

555
00:26:26,460 --> 00:26:31,200
but then you let these two
schemes guide each other--

556
00:26:31,200 --> 00:26:39,330
this kind of learning that
one system guides the other--

557
00:26:39,330 --> 00:26:41,520
we see, here, graphs
of performance.

558
00:26:41,520 --> 00:26:44,720
Roughly speaking, the details
are not that important.

559
00:26:44,720 --> 00:26:47,280
This is called a
precision recall graph.

560
00:26:47,280 --> 00:26:50,340
But even without
explaining the details

561
00:26:50,340 --> 00:26:53,010
of recall and precision,
the higher graphs

562
00:26:53,010 --> 00:26:55,392
mean better performance
of the system.

563
00:26:55,392 --> 00:26:56,850
And this is the
initial performance

564
00:26:56,850 --> 00:27:00,390
of what you get if you
just train it using hands

565
00:27:00,390 --> 00:27:03,150
grabbing objects.

566
00:27:03,150 --> 00:27:05,220
Actually, the accuracy of
the system-- the system

567
00:27:05,220 --> 00:27:07,740
is doing a good job
at recognizing hands,

568
00:27:07,740 --> 00:27:11,610
but only in a limited domain
of hands touching objects.

569
00:27:11,610 --> 00:27:13,990
So other things, it does
not recognize very well.

570
00:27:13,990 --> 00:27:17,250
So this is shown here,
that it has high accuracy,

571
00:27:17,250 --> 00:27:20,910
but does not cover all the
range of all possible hands

572
00:27:20,910 --> 00:27:22,500
that you can learn.

573
00:27:22,500 --> 00:27:24,060
And then without
doing anything else,

574
00:27:24,060 --> 00:27:26,120
we just continue
to watch movies.

575
00:27:26,120 --> 00:27:30,180
But you also integrate
these two systems.

576
00:27:30,180 --> 00:27:34,530
Each one is supplying internal
supervision to the other one.

577
00:27:34,530 --> 00:27:35,960
Everything grows, grows, grows.

578
00:27:35,960 --> 00:27:40,680
And after a while,
after training

579
00:27:40,680 --> 00:27:45,780
with several hours of video,
we get up to the green curve.

580
00:27:45,780 --> 00:27:50,850
And the red curve is sort of the
absolute maximum you can get.

581
00:27:50,850 --> 00:27:54,690
This is using the best
classifier we can get.

582
00:27:54,690 --> 00:27:56,520
And everything is
completely supervised.

583
00:27:56,520 --> 00:27:59,550
So on every frame, in
10,000 frames or more,

584
00:27:59,550 --> 00:28:03,240
you tell the system where
exactly the hand is.

585
00:28:03,240 --> 00:28:06,420
So this is what you can get with
a completely supervised scheme.

586
00:28:06,420 --> 00:28:07,920
And this, the green,
is what you can

587
00:28:07,920 --> 00:28:12,660
get with reasonable training--

588
00:28:12,660 --> 00:28:15,570
I mean, seven hours of training.

589
00:28:15,570 --> 00:28:18,750
Infants get more training--
completely unsupervised.

590
00:28:18,750 --> 00:28:21,314
It just happens on its own.

591
00:28:21,314 --> 00:28:22,980
It's interesting,
also, to think about--

592
00:28:22,980 --> 00:28:24,146
if you think about infants--

593
00:28:24,146 --> 00:28:27,750
and actually, I wanted to--

594
00:28:27,750 --> 00:28:31,840
I was planning to ask you
the question here let's see.

595
00:28:31,840 --> 00:28:35,550
What else could
help infants do--

596
00:28:35,550 --> 00:28:39,525
if you can think of other
tricks in which infants somehow

597
00:28:39,525 --> 00:28:41,400
have to pick up, it's
very important for them

598
00:28:41,400 --> 00:28:43,440
to pick up hands.

599
00:28:43,440 --> 00:28:46,950
What other signals, or
tricks, or guidelines

600
00:28:46,950 --> 00:28:50,510
could help them pick up hands?

601
00:28:50,510 --> 00:28:53,700
And OK, since I sort
of gave up own hands,

602
00:28:53,700 --> 00:28:56,520
you can think about babies
sort of waving their hands.

603
00:28:56,520 --> 00:28:59,880
And babies do wave their
hands a lot in the air.

604
00:28:59,880 --> 00:29:03,330
And you can think of a scheme
in which the brain knows this.

605
00:29:03,330 --> 00:29:05,620
And you sort of wave hands.

606
00:29:05,620 --> 00:29:09,080
And then the images that are
generated by these motive

607
00:29:09,080 --> 00:29:11,310
activities interpreted
by the system,

608
00:29:11,310 --> 00:29:16,170
it already knows, grab this,
and somehow, use them in order

609
00:29:16,170 --> 00:29:18,560
to build hand detectors.

610
00:29:18,560 --> 00:29:21,180
There is evidence that this
is, I think, interesting.

611
00:29:21,180 --> 00:29:24,405
And we know that infants are
interested in their own hands.

612
00:29:28,070 --> 00:29:31,107
But there are reasons to believe
that this is not the case.

613
00:29:31,107 --> 00:29:32,940
Because for example,
if you really try this,

614
00:29:32,940 --> 00:29:37,850
and you try to learn hands from
waving your hands in this way,

615
00:29:37,850 --> 00:29:42,870
imitating what infants
may see, a scheme

616
00:29:42,870 --> 00:29:46,700
that learns hands in this way
is very bad at recognizing hands

617
00:29:46,700 --> 00:29:47,830
manipulating objects.

618
00:29:47,830 --> 00:29:50,940
If, after waving the
hands, you test the system,

619
00:29:50,940 --> 00:29:52,860
and there is a hand
coming in the image

620
00:29:52,860 --> 00:29:55,350
and touching,
grabbing something,

621
00:29:55,350 --> 00:29:57,900
the difference in
appearance in point

622
00:29:57,900 --> 00:30:01,050
of view between
waving your own hands

623
00:30:01,050 --> 00:30:03,240
and watching somebody
grabbing an object

624
00:30:03,240 --> 00:30:06,690
is so large that it does
not allow to generalize well

625
00:30:06,690 --> 00:30:07,870
at this stage.

626
00:30:07,870 --> 00:30:09,490
And we know, from
testing infants,

627
00:30:09,490 --> 00:30:11,760
that the first thing
that they recognize well

628
00:30:11,760 --> 00:30:14,200
is actually other
hands grabbing objects.

629
00:30:14,200 --> 00:30:16,980
So it's much more
consistent with the idea

630
00:30:16,980 --> 00:30:18,870
that the guiding
internal signal that

631
00:30:18,870 --> 00:30:21,570
helps them deal, in
an unsupervised way,

632
00:30:21,570 --> 00:30:27,680
with this difficult task is
the special event of the hand

633
00:30:27,680 --> 00:30:31,230
as the mover of objects.

634
00:30:31,230 --> 00:30:33,360
OK, I want to move
from hands to--

635
00:30:33,360 --> 00:30:36,790
this will be shorter, but I want
to say something about gaze.

636
00:30:36,790 --> 00:30:38,710
Gaze is also, as I
said, interesting.

637
00:30:38,710 --> 00:30:41,410
It starts at about
three months of age.

638
00:30:41,410 --> 00:30:43,720
An infant has this capability.

639
00:30:43,720 --> 00:30:47,230
What happens at three months of
age is that an infant may look

640
00:30:47,230 --> 00:30:50,110
at an adult-- at a
caregiver, the mother, say,

641
00:30:50,110 --> 00:30:51,964
or look at another person--

642
00:30:51,964 --> 00:30:53,380
and if the other
person is looking

643
00:30:53,380 --> 00:30:56,410
at an object over
there, then the infant

644
00:30:56,410 --> 00:30:59,680
will look at the other person,
and then will follow the gaze,

645
00:30:59,680 --> 00:31:03,550
and will look at the object that
the other person is looking at.

646
00:31:03,550 --> 00:31:07,150
So it's, first of all,
the identification

647
00:31:07,150 --> 00:31:11,210
of the correct
direction of gaze,

648
00:31:11,210 --> 00:31:16,150
but also then using it in
order to get joint attention.

649
00:31:16,150 --> 00:31:17,560
And all of these
things are known

650
00:31:17,560 --> 00:31:22,310
to be very important in
early visual development.

651
00:31:22,310 --> 00:31:25,570
And psychologists,
child psychologists,

652
00:31:25,570 --> 00:31:28,000
talk a lot about this
mechanism of joint attention

653
00:31:28,000 --> 00:31:32,350
in which the parent, or the
caregiver, and the child

654
00:31:32,350 --> 00:31:35,260
can get joint attention.

655
00:31:35,260 --> 00:31:37,240
And some people,
some infants, do not

656
00:31:37,240 --> 00:31:39,910
have this mechanism
of joint attention,

657
00:31:39,910 --> 00:31:42,890
being able to attend
to the same event

658
00:31:42,890 --> 00:31:46,570
or object that the other
person is attending to.

659
00:31:46,570 --> 00:31:49,180
And this has developmental
consequences.

660
00:31:49,180 --> 00:31:54,940
So it's an important aspect
of communication and learning.

661
00:31:54,940 --> 00:31:58,130
So understanding direction
of gaze is very important.

662
00:31:58,130 --> 00:32:04,540
And here it's even perhaps more
surprising and unexpected even

663
00:32:04,540 --> 00:32:05,440
compared to hands.

664
00:32:05,440 --> 00:32:07,390
Because gaze, in some
sense, doesn't really

665
00:32:07,390 --> 00:32:10,780
exist objectively out
there in the scene.

666
00:32:10,780 --> 00:32:12,940
It's not an object, a
yellow object, that you see.

667
00:32:12,940 --> 00:32:17,020
It's some kind of an
imaginary vector, if you want,

668
00:32:17,020 --> 00:32:19,090
pointing in a particular
direction based

669
00:32:19,090 --> 00:32:20,800
on the face features.

670
00:32:20,800 --> 00:32:23,890
But it's very non-explicit.

671
00:32:23,890 --> 00:32:26,020
And you have, somehow, to
observe it, and see it,

672
00:32:26,020 --> 00:32:28,630
and start extracting
it from what you see,

673
00:32:28,630 --> 00:32:31,130
and all of this in a
completely unsupervised manner.

674
00:32:31,130 --> 00:32:34,090
So what would it
take for a system

675
00:32:34,090 --> 00:32:37,620
to be able to watch
things, get no supervision,

676
00:32:37,620 --> 00:32:45,630
and after a while, extract
correctly direction of gaze?

677
00:32:45,630 --> 00:32:48,010
Direction of gaze
is actually depend

678
00:32:48,010 --> 00:32:51,220
on two types of
sources of information,

679
00:32:51,220 --> 00:32:53,920
one of the direction
of the head,

680
00:32:53,920 --> 00:32:58,780
and the other is the direction
of the eyes in the orbit.

681
00:32:58,780 --> 00:33:00,670
And both of them are important.

682
00:33:00,670 --> 00:33:02,270
And you have to master both.

683
00:33:05,590 --> 00:33:08,184
There are more recent
studies of this,

684
00:33:08,184 --> 00:33:09,600
and more accurate
studies of this,

685
00:33:09,600 --> 00:33:11,620
but I like this reference.

686
00:33:11,620 --> 00:33:13,990
Because this is from
a scientific paper

687
00:33:13,990 --> 00:33:16,300
on the relative effect of hand--

688
00:33:16,300 --> 00:33:21,150
of head orientation and eyes'
orientation in gaze perception.

689
00:33:21,150 --> 00:33:24,010
And it's a scientific
paper in 1824.

690
00:33:24,010 --> 00:33:26,410
So this problem was
studied with experiments,

691
00:33:26,410 --> 00:33:29,780
and the good
judgment, and so on.

692
00:33:29,780 --> 00:33:35,080
The point here, by the
way, is that these people

693
00:33:35,080 --> 00:33:37,630
look as if they are looking
at different directions.

694
00:33:37,630 --> 00:33:40,150
But in fact, the eyes
here are exactly the same.

695
00:33:40,150 --> 00:33:41,470
It's sort of cut and paste.

696
00:33:41,470 --> 00:33:43,570
It's literally the
same eye region,

697
00:33:43,570 --> 00:33:45,400
and only the head is turning.

698
00:33:45,400 --> 00:33:49,030
And this is enough to cause
us to perceive these two

699
00:33:49,030 --> 00:33:53,380
people as looking in two
different directions.

700
00:33:53,380 --> 00:33:56,980
In terms of inference and how
this learning comes about,

701
00:33:56,980 --> 00:33:58,890
the head comes about first.

702
00:33:58,890 --> 00:34:02,920
And initially, if the
caregiver, as I said, is--

703
00:34:02,920 --> 00:34:05,380
the head is pointing in
a particular direction

704
00:34:05,380 --> 00:34:07,400
but the eyes are
not, the infants

705
00:34:07,400 --> 00:34:09,909
will follow the head direction.

706
00:34:09,909 --> 00:34:11,920
Later on-- so this
is at three months.

707
00:34:11,920 --> 00:34:15,670
Later on, they combine the
information from the head

708
00:34:15,670 --> 00:34:19,380
and from the eyes.

709
00:34:19,380 --> 00:34:21,520
And the eyes are
really subtle cues

710
00:34:21,520 --> 00:34:25,360
which we use very
intuitively, very naturally.

711
00:34:25,360 --> 00:34:29,500
But although it's-- let
me hide this for a minute.

712
00:34:29,500 --> 00:34:32,199
This person-- it's a bad image.

713
00:34:32,199 --> 00:34:35,380
It's blurred, especially
those who sit in the back.

714
00:34:35,380 --> 00:34:38,388
But this person, is he looking,
basically, roughly at you,

715
00:34:38,388 --> 00:34:40,179
or is he looking at
the objects down there?

716
00:34:43,495 --> 00:34:43,995
Sorry?

717
00:34:43,995 --> 00:34:44,949
AUDIENCE: audience

718
00:34:44,949 --> 00:34:47,069
SHIMON ULLMAN: Yeah,
basically at you, right?

719
00:34:47,069 --> 00:34:48,110
Now, if you look at the--

720
00:34:48,110 --> 00:34:49,130
so it's from the eyes.

721
00:34:49,130 --> 00:34:51,549
And this is the--

722
00:34:51,549 --> 00:34:52,340
these are the eyes.

723
00:34:52,340 --> 00:34:53,506
This is all the information.

724
00:34:53,506 --> 00:34:56,374
This is the right pixels, the
same number of pixels that

725
00:34:56,374 --> 00:34:57,540
are in the image, and so on.

726
00:34:57,540 --> 00:34:59,380
So this is the
information in the eyes.

727
00:34:59,380 --> 00:35:00,350
It's not a lot.

728
00:35:00,350 --> 00:35:02,060
And we use it
effectively in order

729
00:35:02,060 --> 00:35:04,940
to decide where the person
is-- we just look at it.

730
00:35:04,940 --> 00:35:07,730
And it's interesting that it's
so small and inconspicuous

731
00:35:07,730 --> 00:35:08,970
in some objective terms.

732
00:35:08,970 --> 00:35:16,230
But for us, we know that the
person is looking, roughly,

733
00:35:16,230 --> 00:35:16,730
at us.

734
00:35:16,730 --> 00:35:21,200
Now, we have some computer
algorithms that do gaze.

735
00:35:21,200 --> 00:35:25,340
And gaze, again, it's
not an easy problem.

736
00:35:25,340 --> 00:35:27,830
And people have
worked quite a lot

737
00:35:27,830 --> 00:35:30,330
on detection, detecting
direction of gaze.

738
00:35:30,330 --> 00:35:32,142
And all the schemes
are highly supervised.

739
00:35:32,142 --> 00:35:34,100
And once it's highly
supervised, you can do it.

740
00:35:34,100 --> 00:35:37,880
So by highly supervised, I
mean that you give the system

741
00:35:37,880 --> 00:35:41,030
many, many images in
which you give the image,

742
00:35:41,030 --> 00:35:45,410
but you also give
the learning system--

743
00:35:45,410 --> 00:35:48,290
together with the input
image, you supply it

744
00:35:48,290 --> 00:35:49,440
with the direction of gaze.

745
00:35:49,440 --> 00:35:52,250
So this is the image,
and this is the direction

746
00:35:52,250 --> 00:35:54,710
the person is looking at.

747
00:35:54,710 --> 00:35:59,150
And there are ways of getting
this input information

748
00:35:59,150 --> 00:36:01,880
of the appearance of the face
and the direction of gaze,

749
00:36:01,880 --> 00:36:06,230
to associate them, and then when
you see a new face, to recover

750
00:36:06,230 --> 00:36:07,200
the direction of gaze.

751
00:36:07,200 --> 00:36:14,940
But it really depends on a
large set of supervised images.

752
00:36:14,940 --> 00:36:18,500
So we need something to-- if
you want to go along the same

753
00:36:18,500 --> 00:36:26,420
direction that happened before
with getting hands correctly,

754
00:36:26,420 --> 00:36:28,760
we want-- instead of the
internal supervision, we want--

755
00:36:28,760 --> 00:36:33,680
some kind of a signal that
somehow can tell the baby,

756
00:36:33,680 --> 00:36:34,640
can tell the system--

757
00:36:34,640 --> 00:36:36,860
without any explicit
supervision,

758
00:36:36,860 --> 00:36:38,840
provide some kind of
an internal teaching

759
00:36:38,840 --> 00:36:42,500
signal that will tell it where
is the direction of gaze.

760
00:36:42,500 --> 00:36:44,570
It's very close to
the hand and the mover

761
00:36:44,570 --> 00:36:47,210
using the following
notion, that if I pick up

762
00:36:47,210 --> 00:36:50,690
an object, which was very
close to the mover event,

763
00:36:50,690 --> 00:36:54,480
once I picked up an object, I
can do whatever I want with it.

764
00:36:54,480 --> 00:36:56,120
I don't have to look at it.

765
00:36:56,120 --> 00:36:57,990
I can manipulate it and so on.

766
00:36:57,990 --> 00:37:01,160
But if it's placed somewhere
and I want to pick it up,

767
00:37:01,160 --> 00:37:02,895
nobody picks up
objects like this.

768
00:37:02,895 --> 00:37:04,270
I mean, when you
look at objects,

769
00:37:04,270 --> 00:37:07,520
when you pick them up, at the
moment of making the contact

770
00:37:07,520 --> 00:37:09,000
to pick them up,
you look at them.

771
00:37:09,000 --> 00:37:12,590
And in fact, that's a
spontaneous behavior which

772
00:37:12,590 --> 00:37:13,980
we checked psycho-physically.

773
00:37:13,980 --> 00:37:15,900
You just tell
people, pick objects.

774
00:37:15,900 --> 00:37:17,930
You don't tell them what
you are trying to do.

775
00:37:17,930 --> 00:37:21,290
And they're invariably
always-- at the instant

776
00:37:21,290 --> 00:37:24,470
of grabbing the object, making
the contact, they look at it.

777
00:37:24,470 --> 00:37:26,200
So if we have, already,
a mover detector,

778
00:37:26,200 --> 00:37:28,320
or sort of a hand detector,
or a mover detector,

779
00:37:28,320 --> 00:37:31,910
that hands that are
touching objects and causing

780
00:37:31,910 --> 00:37:33,590
them to move, all
you have to do--

781
00:37:33,590 --> 00:37:35,390
whenever you take
an event like this,

782
00:37:35,390 --> 00:37:36,830
it's not only useful for hands.

783
00:37:36,830 --> 00:37:40,080
But once a hand is touching
the object, this kind of mover

784
00:37:40,080 --> 00:37:43,590
event, you can freeze your
video, you can take the frame.

785
00:37:43,590 --> 00:37:46,130
And you can know
with high precision

786
00:37:46,130 --> 00:37:49,400
that you can-- now
the direction of gaze

787
00:37:49,400 --> 00:37:53,120
is directed toward the object.

788
00:37:53,120 --> 00:37:57,600
So we asked people to manipulate
objects on the table and so on.

789
00:37:57,600 --> 00:38:03,000
And what we did is we ran
our previous mover detector.

790
00:38:03,000 --> 00:38:08,180
And let me skip the movie.

791
00:38:08,180 --> 00:38:12,590
But whenever this kind of a
detection of a hand touching

792
00:38:12,590 --> 00:38:15,220
an object, making initial
contact with an object

793
00:38:15,220 --> 00:38:17,390
happened, we froze the image.

794
00:38:17,390 --> 00:38:19,500
Unfortunately, this is
not a very good slide,

795
00:38:19,500 --> 00:38:20,750
so it may be difficult to see.

796
00:38:20,750 --> 00:38:22,460
Maybe you can see here.

797
00:38:22,460 --> 00:38:25,190
So we simply drew
a vector pointing

798
00:38:25,190 --> 00:38:31,440
from the face in the direction
of the detected grabbing event.

799
00:38:31,440 --> 00:38:34,730
And we assumed-- we don't know--
that's an implicit, internal,

800
00:38:34,730 --> 00:38:36,230
imaginary supervision.

801
00:38:36,230 --> 00:38:37,210
Nobody checked it.

802
00:38:37,210 --> 00:38:46,704
But we grabbed the image,
and we drew the vector

803
00:38:46,704 --> 00:38:48,437
to the contact point.

804
00:38:48,437 --> 00:38:49,520
So now you have a system--

805
00:38:49,520 --> 00:38:52,040
on the one hand,
we have face images

806
00:38:52,040 --> 00:38:55,310
at the point where
we took the contacts.

807
00:38:55,310 --> 00:38:56,150
So here is a face.

808
00:38:56,150 --> 00:38:58,430
And this is a
descriptor, some way

809
00:38:58,430 --> 00:39:00,920
of describing the
appearance of the face based

810
00:39:00,920 --> 00:39:02,770
on local gradients.

811
00:39:02,770 --> 00:39:03,270
Sorry?

812
00:39:03,270 --> 00:39:05,130
AUDIENCE: How do
you find the face?

813
00:39:05,130 --> 00:39:06,505
SHIMON ULLMAN:
You assume a face.

814
00:39:06,505 --> 00:39:08,085
The face detector, I just left--

815
00:39:08,085 --> 00:39:08,960
AUDIENCE: [INAUDIBLE]

816
00:39:08,960 --> 00:39:09,245
SHIMON ULLMAN: Right.

817
00:39:09,245 --> 00:39:09,900
Right.

818
00:39:09,900 --> 00:39:11,330
Faces come even before--

819
00:39:11,330 --> 00:39:15,080
I didn't talk about faces, but
faces come even beforehand.

820
00:39:15,080 --> 00:39:19,220
As I mentioned, the first thing
that infants look at is faces.

821
00:39:19,220 --> 00:39:20,960
And this is even
before the three months

822
00:39:20,960 --> 00:39:24,350
that they look at hands.

823
00:39:24,350 --> 00:39:28,680
This is to-- the current
theory is that faces,

824
00:39:28,680 --> 00:39:33,860
you're born with a primitive
initial face template.

825
00:39:33,860 --> 00:39:38,050
There are some discussions
where the face template is.

826
00:39:38,050 --> 00:39:40,460
There is some evidence that
it may not be in the cortex.

827
00:39:40,460 --> 00:39:41,930
It may be in the amygdala.

828
00:39:41,930 --> 00:39:44,790
But there is some
evidence for this face.

829
00:39:44,790 --> 00:39:49,530
For manipulation of these
patterns that infants look at,

830
00:39:49,530 --> 00:39:52,020
it's a very simple
template, basically

831
00:39:52,020 --> 00:39:57,020
the two eyes, or something
round, with two dark blobs.

832
00:39:57,020 --> 00:40:04,290
And this makes them fixate more
on faces in a very similar way

833
00:40:04,290 --> 00:40:06,270
to the handling the hands.

834
00:40:06,270 --> 00:40:08,580
You just-- if you do
this, from time to time,

835
00:40:08,580 --> 00:40:11,940
you will end up
focusing not on a face,

836
00:40:11,940 --> 00:40:13,860
but on some random
texture that has

837
00:40:13,860 --> 00:40:15,490
these two blobs or something.

838
00:40:15,490 --> 00:40:17,420
But if you really
run it, then you

839
00:40:17,420 --> 00:40:19,260
will get lots and
lots of face images.

840
00:40:19,260 --> 00:40:23,290
And then you'll develop a
more refined face detector.

841
00:40:23,290 --> 00:40:27,300
So babies, from day one by the
way, the way we think that--

842
00:40:27,300 --> 00:40:31,260
the way people think it's innate
is you can work-- it is done--

843
00:40:31,260 --> 00:40:34,110
experiments have been done
with the first day when

844
00:40:34,110 --> 00:40:37,140
babies were born, the day one.

845
00:40:37,140 --> 00:40:39,310
They keep their eyes
closed most of the time.

846
00:40:39,310 --> 00:40:42,110
But when they are
open, they fixate.

847
00:40:42,110 --> 00:40:44,550
You have to make big
stimuli, because it's

848
00:40:44,550 --> 00:40:49,410
like close-up faces, because
the acuity is still not

849
00:40:49,410 --> 00:40:51,270
fully developed.

850
00:40:51,270 --> 00:40:53,280
But you can test what
they're fixating on.

851
00:40:53,280 --> 00:40:56,070
And they fixate
specifically on faces.

852
00:40:56,070 --> 00:40:58,890
And once there is a
face, they fixate on it.

853
00:40:58,890 --> 00:41:01,170
And the face can move, and
they will even track it.

854
00:41:01,170 --> 00:41:04,160
So this is day one.

855
00:41:04,160 --> 00:41:10,050
So face seems to be innate
in a stronger sense.

856
00:41:10,050 --> 00:41:12,210
In the case of the hand,
for example, as I said,

857
00:41:12,210 --> 00:41:15,930
you cannot even imagine
building a innate hand detector

858
00:41:15,930 --> 00:41:17,970
because of all this
variability in appearance.

859
00:41:17,970 --> 00:41:20,520
For the face, it seems that
there is an initial face

860
00:41:20,520 --> 00:41:23,340
detector which gets elaborated.

861
00:41:23,340 --> 00:41:26,370
So we assume that there is
some kind of-- in these images,

862
00:41:26,370 --> 00:41:30,630
we assume that when we grab
an event of contact like this,

863
00:41:30,630 --> 00:41:33,590
the face is known, the location
of the face, the location

864
00:41:33,590 --> 00:41:34,590
of the contact is known.

865
00:41:34,590 --> 00:41:37,344
And you can draw a vector
from the first to the second.

866
00:41:37,344 --> 00:41:38,760
And this is the
direction of gaze.

867
00:41:38,760 --> 00:41:41,460
And when, now, you see a
new image in which there

868
00:41:41,460 --> 00:41:43,080
is no contact, you
just have the face,

869
00:41:43,080 --> 00:41:45,630
and you have to decide what
is the direction of gaze,

870
00:41:45,630 --> 00:41:49,140
you look at similar faces that
you have stored in your memory.

871
00:41:49,140 --> 00:41:54,370
And from this stored
face in memory, for this,

872
00:41:54,370 --> 00:41:56,670
you already know from
the learning phase

873
00:41:56,670 --> 00:41:59,280
what is associated
direction of gaze.

874
00:41:59,280 --> 00:42:00,900
And you retrieve it.

875
00:42:00,900 --> 00:42:02,850
And this is the kind
of things that you do.

876
00:42:02,850 --> 00:42:05,490
What we see here with
the yellow arrows

877
00:42:05,490 --> 00:42:07,650
are collected
images, which, again,

878
00:42:07,650 --> 00:42:11,100
the direction of gaze,
the supervised direction,

879
00:42:11,100 --> 00:42:14,400
has been collected, or was
collected, automatically,

880
00:42:14,400 --> 00:42:19,500
by just identifying the
direction to the contact point.

881
00:42:19,500 --> 00:42:20,580
These are some examples.

882
00:42:20,580 --> 00:42:25,800
And what this shows is just
doing some psychophysics

883
00:42:25,800 --> 00:42:27,660
and comparing what
this algorithm-- which

884
00:42:27,660 --> 00:42:33,140
is sort of this infant-related
algorithm which just has

885
00:42:33,140 --> 00:42:36,570
no supervision, looking
images for hands touching

886
00:42:36,570 --> 00:42:39,120
things, collecting
direction of gaze,

887
00:42:39,120 --> 00:42:42,640
developing a gaze detector.

888
00:42:42,640 --> 00:42:46,770
So the red and the green, one
is the model, the other one

889
00:42:46,770 --> 00:42:50,250
is human judgment on
a similar situation.

890
00:42:50,250 --> 00:42:52,060
And you get good agreement.

891
00:42:52,060 --> 00:42:53,275
I mean, it's not perfect.

892
00:42:53,275 --> 00:42:54,900
It's not the state
of the art, but it's

893
00:42:54,900 --> 00:42:57,090
close to state of the art.

894
00:42:57,090 --> 00:43:00,080
And this is just training
with some videos.

895
00:43:00,080 --> 00:43:04,740
I mean, this certainly does
at least as well as infants.

896
00:43:04,740 --> 00:43:07,530
And it keeps developing,
getting from here

897
00:43:07,530 --> 00:43:12,700
to a better and better gaze
detector with reduced error.

898
00:43:12,700 --> 00:43:16,050
Well, the error is
pretty small here too.

899
00:43:16,050 --> 00:43:17,730
But you can improve it.

900
00:43:17,730 --> 00:43:23,340
That's already-- that's more of
standard additional training.

901
00:43:23,340 --> 00:43:25,980
But making the
first jump of being

902
00:43:25,980 --> 00:43:29,160
able to deal with
this nonexistent gaze,

903
00:43:29,160 --> 00:43:31,920
collecting a lot of data
without any supervision, which

904
00:43:31,920 --> 00:43:38,640
is quite accurate, about
where the gaze is and so on,

905
00:43:38,640 --> 00:43:42,300
this is supplied by, again,
this internal teaching signal

906
00:43:42,300 --> 00:43:47,940
that can come instead of
any external supervision

907
00:43:47,940 --> 00:43:52,050
and make it unnecessary.

908
00:43:52,050 --> 00:43:55,260
And you can do it without
the outside supervision.

909
00:43:55,260 --> 00:43:57,650
It also has, I think, some--

910
00:43:57,650 --> 00:44:01,260
the beginning of the more
cognitive correct association,

911
00:44:01,260 --> 00:44:05,460
like the hand is
associated with moving

912
00:44:05,460 --> 00:44:09,030
object, direction of gaze
and going to where the--

913
00:44:09,030 --> 00:44:14,260
following it to see what is
the object at the other end

914
00:44:14,260 --> 00:44:15,770
and so on, this is--

915
00:44:15,770 --> 00:44:18,450
gaze is associated
with attention

916
00:44:18,450 --> 00:44:20,964
of people, what they are
interested in at the moment.

917
00:44:20,964 --> 00:44:22,380
So it's not just
the fact that you

918
00:44:22,380 --> 00:44:25,600
connect the face with the
target object and so on.

919
00:44:25,600 --> 00:44:28,920
It's a good way of creating
internal supervision.

920
00:44:28,920 --> 00:44:32,660
But it also starts to, I think,
create the right association

921
00:44:32,660 --> 00:44:36,840
that hand is associated
with manipulation and goals

922
00:44:36,840 --> 00:44:38,490
of manipulating objects.

923
00:44:38,490 --> 00:44:41,730
And gaze is associated
with attention,

924
00:44:41,730 --> 00:44:46,730
and what we are paying
attention to, and so on.

925
00:44:46,730 --> 00:44:50,880
So you can see that
you start to have--

926
00:44:50,880 --> 00:44:53,290
based on this, if you
have an image like this,

927
00:44:53,290 --> 00:44:55,935
and you can detect hands,
and you know, what--

928
00:44:55,935 --> 00:44:58,500
there's a scheme that does it.

929
00:44:58,500 --> 00:44:59,850
You know about hands.

930
00:44:59,850 --> 00:45:01,800
You know about
direction of gaze.

931
00:45:01,800 --> 00:45:03,510
You know about-- I
didn't talk about it,

932
00:45:03,510 --> 00:45:05,677
but you also follow which
objects move around.

933
00:45:05,677 --> 00:45:07,260
And you know which
objects are movable

934
00:45:07,260 --> 00:45:09,990
and which objects
are not movable--

935
00:45:09,990 --> 00:45:14,130
so a very simple scheme
that follows the chains

936
00:45:14,130 --> 00:45:15,890
of processing that I described.

937
00:45:15,890 --> 00:45:18,520
So it already starts to know--

938
00:45:18,520 --> 00:45:23,820
you know, it's not quite
having this full representation

939
00:45:23,820 --> 00:45:24,400
in itself.

940
00:45:24,400 --> 00:45:27,930
But it's thought quite along the
way that the two agents here.

941
00:45:27,930 --> 00:45:30,252
The two agents are
manipulating objects.

942
00:45:30,252 --> 00:45:31,710
And the one of the
left is actually

943
00:45:31,710 --> 00:45:34,320
interested in the object that
the other one is holding.

944
00:45:34,320 --> 00:45:35,820
So you have all
this, the building

945
00:45:35,820 --> 00:45:37,620
blocks to start build--

946
00:45:37,620 --> 00:45:40,140
to start having an
internal description

947
00:45:40,140 --> 00:45:45,600
along this line following
the chain of processing

948
00:45:45,600 --> 00:45:47,970
that I mentioned.

949
00:45:47,970 --> 00:45:49,770
And by the way, this
internal training

950
00:45:49,770 --> 00:45:52,200
that one thing can train
another, if you want,

951
00:45:52,200 --> 00:45:54,240
simple mover can train the hand.

952
00:45:54,240 --> 00:45:57,810
Mover and a hand together
can train a gaze detector.

953
00:45:57,810 --> 00:46:00,840
It turns out that gaze is
important in learning language,

954
00:46:00,840 --> 00:46:03,696
in disambiguating
nouns and verbs

955
00:46:03,696 --> 00:46:05,070
when you learn
language, when you

956
00:46:05,070 --> 00:46:06,510
acquire your first language.

957
00:46:06,510 --> 00:46:15,210
So this is from verb learning
for a particular experiment.

958
00:46:15,210 --> 00:46:16,260
But let me ignore that.

959
00:46:16,260 --> 00:46:23,000
A simple example would be
acquiring a noun that I say.

960
00:46:23,000 --> 00:46:25,690
I say, suddenly, oh,
look at my new blicket.

961
00:46:25,690 --> 00:46:27,600
And people have done
experiments like this.

962
00:46:27,600 --> 00:46:30,150
And I can say,
look at my blicket,

963
00:46:30,150 --> 00:46:32,040
looking at an object
on the right side

964
00:46:32,040 --> 00:46:36,000
or looking at another
object on the left side,

965
00:46:36,000 --> 00:46:38,100
saying exactly the
same expression.

966
00:46:38,100 --> 00:46:40,950
And people have shown
that infants exposed

967
00:46:40,950 --> 00:46:43,320
to this kind of situation,
they automatically

968
00:46:43,320 --> 00:46:45,870
associate the term,
the noun "blicket,"

969
00:46:45,870 --> 00:46:48,990
with the object that
has been attended to.

970
00:46:48,990 --> 00:46:50,730
Namely, the gaze
was used in order

971
00:46:50,730 --> 00:46:54,340
to disambiguate the reference.

972
00:46:54,340 --> 00:46:55,920
So you can see a nice--

973
00:46:55,920 --> 00:46:58,960
starting with very low-level
internal guiding signals

974
00:46:58,960 --> 00:47:03,540
of, say, moving pixels that
can tell you about hands

975
00:47:03,540 --> 00:47:06,120
and about direction of gaze--
and then direction of gaze

976
00:47:06,120 --> 00:47:08,990
helps you to disambiguate
the reference of words--

977
00:47:08,990 --> 00:47:12,960
so these kinds of trajectories
of internal supervision

978
00:47:12,960 --> 00:47:17,460
that can help you learn
to deal with the work.

979
00:47:17,460 --> 00:47:20,210
This is, to me, a part of a
larger project, which we called

980
00:47:20,210 --> 00:47:22,260
the digital baby, in which we--

981
00:47:22,260 --> 00:47:23,400
it's an extension of this.

982
00:47:23,400 --> 00:47:24,840
We really want to
understand, what

983
00:47:24,840 --> 00:47:29,820
are all these various
innate capacities that we

984
00:47:29,820 --> 00:47:31,460
are born with cognitively?

985
00:47:31,460 --> 00:47:38,400
And we mention, here, a number
of suggested ones-- the mover,

986
00:47:38,400 --> 00:47:41,500
how the mover can train a gaze,
and the core training of two

987
00:47:41,500 --> 00:47:42,125
systems.

988
00:47:42,125 --> 00:47:45,960
And some of the things we think
that are happening innately

989
00:47:45,960 --> 00:47:49,320
before we begin to learn.

990
00:47:49,320 --> 00:47:53,370
And then we would like
to be able to watch

991
00:47:53,370 --> 00:47:56,760
lots and lots of sensory
input, which could be visual.

992
00:47:56,760 --> 00:48:00,210
It can be, in
general, non-visual.

993
00:48:00,210 --> 00:48:02,310
And from this, to
generate, what will

994
00:48:02,310 --> 00:48:04,740
happen is the automatic
generation and lots

995
00:48:04,740 --> 00:48:09,300
of understanding of the
world, concepts like hands

996
00:48:09,300 --> 00:48:11,490
and intention, direction
of looking, and eventually,

997
00:48:11,490 --> 00:48:16,790
nouns, and verbs, and so on--
so how we'll be able to do it.

998
00:48:16,790 --> 00:48:23,010
Know that it's very different
from the less structured

999
00:48:23,010 --> 00:48:27,090
direction of deep networks,
which are interesting and are

1000
00:48:27,090 --> 00:48:28,230
doing wonderful things.

1001
00:48:28,230 --> 00:48:31,090
I think that they're
a very useful tool.

1002
00:48:31,090 --> 00:48:34,380
But I think that they are not
the answer to the digital baby.

1003
00:48:34,380 --> 00:48:41,290
They do not have the capacity
to learn interesting concepts

1004
00:48:41,290 --> 00:48:44,550
in an unsupervised way.

1005
00:48:44,550 --> 00:48:45,600
They do not distinguish.

1006
00:48:45,600 --> 00:48:47,760
They go, as I showed you
at the very beginning,

1007
00:48:47,760 --> 00:48:50,970
with the cats, and the
upper body, and so on.

1008
00:48:50,970 --> 00:48:53,100
They go only for
the salient things.

1009
00:48:53,100 --> 00:48:54,590
Gaze is not a salient thing.

1010
00:48:54,590 --> 00:48:56,280
I mean, we have
internal signals that

1011
00:48:56,280 --> 00:48:59,980
allow us to zoom in
on meaningful things.

1012
00:48:59,980 --> 00:49:03,870
Even if they are
not very salient

1013
00:49:03,870 --> 00:49:06,060
objectively in the
statistical sense,

1014
00:49:06,060 --> 00:49:09,210
there is something inside
us that is tuned to it.

1015
00:49:09,210 --> 00:49:10,470
We are born with it.

1016
00:49:10,470 --> 00:49:12,720
And it guides us
towards extracting

1017
00:49:12,720 --> 00:49:14,970
this meaningful
information, even

1018
00:49:14,970 --> 00:49:16,830
if it's not all that salient.

1019
00:49:16,830 --> 00:49:18,390
So all of these
things are missing

1020
00:49:18,390 --> 00:49:25,050
from the unstructured
net-- or the networks which

1021
00:49:25,050 --> 00:49:28,200
do not have all of this
pre-concept and internal

1022
00:49:28,200 --> 00:49:28,710
guidance.

1023
00:49:28,710 --> 00:49:31,770
And I don't think that they
could provide a good model

1024
00:49:31,770 --> 00:49:36,240
for cognitive learning in this
sense of the digital baby.

1025
00:49:36,240 --> 00:49:41,970
Although I can see a very useful
role for them, for example,

1026
00:49:41,970 --> 00:49:43,210
as just--

1027
00:49:43,210 --> 00:49:48,820
in answer to Doreen's question,
that if you want to then get--

1028
00:49:48,820 --> 00:49:50,950
from all the data and
the internal supervision

1029
00:49:50,950 --> 00:49:55,350
that you provided, you want to
get an accurate gaze detector,

1030
00:49:55,350 --> 00:50:00,550
then training, using
supervision training

1031
00:50:00,550 --> 00:50:07,030
in appropriate deep networks
can be a very good way to go.

1032
00:50:07,030 --> 00:50:09,679
I wanted to also show you,
this is not directly related,

1033
00:50:09,679 --> 00:50:12,220
but it's something impressive
about the use of hands in order

1034
00:50:12,220 --> 00:50:16,070
to understand the world, just to
show you how smart infants are.

1035
00:50:16,070 --> 00:50:19,660
I talked more about
detecting the hands.

1036
00:50:19,660 --> 00:50:21,730
It was more the
visual aspect of,

1037
00:50:21,730 --> 00:50:24,760
here is an image,
show me the hand.

1038
00:50:24,760 --> 00:50:26,630
But how they use
it-- and this is

1039
00:50:26,630 --> 00:50:29,590
at the age of about one
year-- maybe 13 months,

1040
00:50:29,590 --> 00:50:33,080
but one year of age.

1041
00:50:33,080 --> 00:50:34,120
Here's the experiment.

1042
00:50:34,120 --> 00:50:37,210
I think it's a really
nice experiment.

1043
00:50:37,210 --> 00:50:40,090
This experiment was
with a experimenter.

1044
00:50:40,090 --> 00:50:43,774
This is the
experimenter, one image.

1045
00:50:43,774 --> 00:50:45,190
What happened in
the experiment is

1046
00:50:45,190 --> 00:50:48,910
that there was a sort of a
lamp that you can turn the lamp

1047
00:50:48,910 --> 00:50:50,740
on by pressing it from above.

1048
00:50:50,740 --> 00:50:52,360
It sort of has this dome shape.

1049
00:50:52,360 --> 00:50:54,040
That's the whites in here.

1050
00:50:54,040 --> 00:50:57,170
You press it down,
and it turns on.

1051
00:50:57,170 --> 00:50:59,740
It shines blue light,
and it's very nice.

1052
00:50:59,740 --> 00:51:01,420
And babies like it.

1053
00:51:01,420 --> 00:51:03,280
And they smile at
it, and they jiggle,

1054
00:51:03,280 --> 00:51:07,290
and they like this turning
of the bright light.

1055
00:51:07,290 --> 00:51:08,920
And during the
experiment, what happens

1056
00:51:08,920 --> 00:51:12,760
is that the infant is
sitting on its parent's lap.

1057
00:51:12,760 --> 00:51:15,910
And the experimenter is on the
other side, that experimenter.

1058
00:51:15,910 --> 00:51:17,260
And she turns on the light.

1059
00:51:17,260 --> 00:51:20,350
But she turns on the light--
instead of pressing it,

1060
00:51:20,350 --> 00:51:23,830
as you'd expect,
with her hand, she's

1061
00:51:23,830 --> 00:51:25,870
pressing on it
with her forehead.

1062
00:51:25,870 --> 00:51:29,110
She leans forward,
and she presses

1063
00:51:29,110 --> 00:51:33,366
the lamp, this dome,
and the light comes on.

1064
00:51:33,366 --> 00:51:35,990
And then, these are babies that
can already manipulate objects.

1065
00:51:35,990 --> 00:51:37,910
So after they see it
three or four times,

1066
00:51:37,910 --> 00:51:40,990
and they are happy seeing
the light coming on,

1067
00:51:40,990 --> 00:51:44,170
they are handed
the lamp and asked

1068
00:51:44,170 --> 00:51:46,990
to turn it on on their own.

1069
00:51:50,290 --> 00:51:55,030
And here is the
clever manipulation.

1070
00:51:55,030 --> 00:51:58,660
For half the babies,
the experimenter

1071
00:51:58,660 --> 00:52:00,820
had her hands concealed.

1072
00:52:00,820 --> 00:52:02,760
She didn't have her
hands here, you see?

1073
00:52:02,760 --> 00:52:06,290
No hands are under this poncho.

1074
00:52:06,290 --> 00:52:08,140
Here it's the same,
very similar thing,

1075
00:52:08,140 --> 00:52:09,880
but the hands are visible.

1076
00:52:09,880 --> 00:52:11,680
Now, it turns out
that the babies--

1077
00:52:11,680 --> 00:52:13,590
or the in-- these are
not babies anymore.

1078
00:52:13,590 --> 00:52:16,750
These are young infants--

1079
00:52:16,750 --> 00:52:18,940
some of them, when they
were handed the lamp,

1080
00:52:18,940 --> 00:52:20,920
they did exactly what
the experimenter did.

1081
00:52:20,920 --> 00:52:23,590
They bent over, and pressed
the lamp with their forehead,

1082
00:52:23,590 --> 00:52:25,630
and turned it on.

1083
00:52:25,630 --> 00:52:27,940
And other children, instead
of-- although that's

1084
00:52:27,940 --> 00:52:32,290
what they saw, when they got
the lamp over to their side,

1085
00:52:32,290 --> 00:52:35,830
they turned it on by
pressing it with their hand,

1086
00:52:35,830 --> 00:52:38,650
unlike what they saw
the experimenter do.

1087
00:52:38,650 --> 00:52:42,030
Any prediction on your
side what happened--

1088
00:52:42,030 --> 00:52:45,520
you see these two situations--
when which babies--

1089
00:52:45,520 --> 00:52:48,580
I mean, in this case or in
this case, in which case

1090
00:52:48,580 --> 00:52:52,840
do you think they actually
did it with their hands

1091
00:52:52,840 --> 00:52:54,410
rather than using
their forehead?

1092
00:52:54,410 --> 00:52:54,980
Any guess?

1093
00:52:54,980 --> 00:52:55,480
Yeah?

1094
00:52:55,480 --> 00:52:58,874
AUDIENCE: hands in
A and no hands in B

1095
00:52:58,874 --> 00:53:00,040
SHIMON ULLMAN: That's right.

1096
00:53:00,040 --> 00:53:01,123
And what's your reasoning?

1097
00:53:01,123 --> 00:53:03,566
AUDIENCE: [INAUDIBLE].

1098
00:53:03,566 --> 00:53:05,440
SHIMON ULLMAN: But you
think about, you know,

1099
00:53:05,440 --> 00:53:11,350
baby, if you saw a baby, infant,
young one-year-old just moving

1100
00:53:11,350 --> 00:53:14,950
seemingly quasi-randomly and
so on, something like that

1101
00:53:14,950 --> 00:53:17,240
went on in their
head, that here, she

1102
00:53:17,240 --> 00:53:18,820
did it with her forehead.

1103
00:53:18,820 --> 00:53:20,470
She would have used
her hands, but she

1104
00:53:20,470 --> 00:53:22,477
couldn't, because
they were concealed,

1105
00:53:22,477 --> 00:53:23,560
and she couldn't use them.

1106
00:53:23,560 --> 00:53:25,840
So she used her forehead,
but that's not the right way

1107
00:53:25,840 --> 00:53:26,830
to do it.

1108
00:53:26,830 --> 00:53:28,770
I can do it
differently and so on.

1109
00:53:28,770 --> 00:53:31,249
That's sort of-- they
don't say it explicitly.

1110
00:53:31,249 --> 00:53:32,290
They don't have language.

1111
00:53:32,290 --> 00:53:35,230
But that's the kind of
reasoning that went on.

1112
00:53:35,230 --> 00:53:37,300
And indeed, a much larger
proportion-- so this

1113
00:53:37,300 --> 00:53:44,360
is the proportion of using their
hands where the green, I think,

1114
00:53:44,360 --> 00:53:46,900
was the hand occupied.

1115
00:53:46,900 --> 00:53:48,971
Or when the hands of the
experimenter is free,

1116
00:53:48,971 --> 00:53:51,220
you see that there is a big
difference between the two

1117
00:53:51,220 --> 00:53:52,160
groups.

1118
00:53:52,160 --> 00:53:54,010
So they notice the hands.

1119
00:53:54,010 --> 00:53:58,224
They ran through some kind
of inference and reasoning.

1120
00:53:58,224 --> 00:53:59,140
What hands are useful?

1121
00:53:59,140 --> 00:54:00,356
Why I should do?

1122
00:54:00,356 --> 00:54:02,230
Should I do it in the
same way because that's

1123
00:54:02,230 --> 00:54:04,100
what other people are doing?

1124
00:54:04,100 --> 00:54:05,320
Should I do it differently?

1125
00:54:05,320 --> 00:54:09,350
So I find it impressive.

1126
00:54:09,350 --> 00:54:12,760
So some general comment is--

1127
00:54:16,660 --> 00:54:19,840
general thoughts on learning
and the combination of learning

1128
00:54:19,840 --> 00:54:22,540
and innate
structures, that there

1129
00:54:22,540 --> 00:54:28,970
is a big sort of
argument in the field,

1130
00:54:28,970 --> 00:54:31,810
has been going on since
philosophers in ancient times,

1131
00:54:31,810 --> 00:54:36,910
whether human
cognition is learned.

1132
00:54:36,910 --> 00:54:41,950
This is nativism against
empiricism, where nativism

1133
00:54:41,950 --> 00:54:43,570
proposed that things
are basically--

1134
00:54:43,570 --> 00:54:47,540
we are born with what is
needed in order to deal

1135
00:54:47,540 --> 00:54:48,680
with the world.

1136
00:54:48,680 --> 00:54:51,030
And empiricism in
the extreme form

1137
00:54:51,030 --> 00:54:56,690
is that we are born with a blank
slate and just a big learning

1138
00:54:56,690 --> 00:54:58,400
machine, maybe like
a deep network.

1139
00:54:58,400 --> 00:55:01,370
And we learn everything from
the contingencies in the world.

1140
00:55:01,370 --> 00:55:05,780
So this is the empiricism
versus nativism.

1141
00:55:05,780 --> 00:55:08,600
In these examples, in an
interesting way, I think,

1142
00:55:08,600 --> 00:55:12,200
that complex concepts were
neither learned on their own

1143
00:55:12,200 --> 00:55:15,590
nor innate-- so for example,
we didn't have an innate hand

1144
00:55:15,590 --> 00:55:18,080
detector, but also,
it couldn't emerge

1145
00:55:18,080 --> 00:55:20,600
in a purely empiricist way.

1146
00:55:20,600 --> 00:55:24,920
But we had enough structure
inside that would not

1147
00:55:24,920 --> 00:55:27,500
be the final solution, but
would be the right guidance,

1148
00:55:27,500 --> 00:55:34,100
or the right infrastructure,
to make learning possible.

1149
00:55:34,100 --> 00:55:36,530
And this is not just a
very generic learner.

1150
00:55:36,530 --> 00:55:40,130
But in this case, the
learner was informed by--

1151
00:55:40,130 --> 00:55:46,100
you know, was looking for some
mover events or things like it.

1152
00:55:46,100 --> 00:55:48,710
So it's not the hands,
it was the movers.

1153
00:55:48,710 --> 00:55:52,520
And this guides the system
without supervision,

1154
00:55:52,520 --> 00:55:55,610
not only making
supervision unnecessary,

1155
00:55:55,610 --> 00:56:00,470
but also focusing the learner on
meaningful representations, not

1156
00:56:00,470 --> 00:56:03,800
necessarily just things which
are the first things that

1157
00:56:03,800 --> 00:56:09,370
jump at you statistically
from the visual input.

1158
00:56:09,370 --> 00:56:11,450
So there are these kinds
of learning trajectories

1159
00:56:11,450 --> 00:56:15,590
like the mover, hand, gaze, and
reference in language, of sort

1160
00:56:15,590 --> 00:56:19,370
of natural trajectories in
which one thing leads to another

1161
00:56:19,370 --> 00:56:21,380
and help us acquire
things which would be very

1162
00:56:21,380 --> 00:56:25,100
difficult to extract otherwise.

1163
00:56:25,100 --> 00:56:26,720
As I mentioned at
the beginning, I

1164
00:56:26,720 --> 00:56:29,180
think that there are some
interesting possibilities

1165
00:56:29,180 --> 00:56:34,610
for AI, as I said, to build
intelligent machines by not

1166
00:56:34,610 --> 00:56:37,300
thinking about the final
intelligence system,

1167
00:56:37,300 --> 00:56:41,090
but thinking about baby
system with the right internal

1168
00:56:41,090 --> 00:56:46,730
capacities which will make
it able to then learn.

1169
00:56:46,730 --> 00:56:49,670
So the use of
learning is sort of--

1170
00:56:49,670 --> 00:56:50,840
we all follow it.

1171
00:56:50,840 --> 00:56:54,480
But the point is, probably just
a big learning machine is not

1172
00:56:54,480 --> 00:56:54,980
enough.

1173
00:56:54,980 --> 00:56:57,050
It's really the
combination of, we

1174
00:56:57,050 --> 00:56:59,420
have to understand the
kind of internal structures

1175
00:56:59,420 --> 00:57:05,335
that allow babies to
efficiently extract information

1176
00:57:05,335 --> 00:57:05,960
from the world.

1177
00:57:05,960 --> 00:57:11,450
If we manage to put something
like this into a baby system

1178
00:57:11,450 --> 00:57:13,196
and let it interact
with the world,

1179
00:57:13,196 --> 00:57:14,570
then we have a
much higher chance

1180
00:57:14,570 --> 00:57:19,854
of starting to develop
really intelligent systems.

1181
00:57:19,854 --> 00:57:21,270
It's interesting,
by the way, that

1182
00:57:21,270 --> 00:57:25,040
in the regional paper by Turing,
when he discusses the Turing

1183
00:57:25,040 --> 00:57:26,720
test and how to build--

1184
00:57:26,720 --> 00:57:30,410
can machines think,
he discusses the issue

1185
00:57:30,410 --> 00:57:32,300
about building
intelligent machines

1186
00:57:32,300 --> 00:57:33,920
somewhere in the future.

1187
00:57:33,920 --> 00:57:37,190
And he says that his hunch
is that the really good way

1188
00:57:37,190 --> 00:57:40,070
of building, eventually,
intelligent computers,

1189
00:57:40,070 --> 00:57:43,850
intelligent machines
would be to build a baby

1190
00:57:43,850 --> 00:57:45,540
computer, a digital
baby, and let

1191
00:57:45,540 --> 00:57:50,530
it learn rather than
thinking about the final one.