1
00:00:01,680 --> 00:00:04,080
The following content is
provided under a Creative

2
00:00:04,080 --> 00:00:05,620
Commons license.

3
00:00:05,620 --> 00:00:07,920
Your support will help
MIT OpenCourseWare

4
00:00:07,920 --> 00:00:12,280
continue to offer high-quality
educational resources for free.

5
00:00:12,280 --> 00:00:14,910
To make a donation, or
view additional materials

6
00:00:14,910 --> 00:00:18,870
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,870 --> 00:00:21,650
at ocw.mit.edu.

8
00:00:21,650 --> 00:00:22,650
ANDREI BARBU: All right.

9
00:00:22,650 --> 00:00:25,329
To start off with, perception
is a very difficult problem.

10
00:00:25,329 --> 00:00:27,870
And there's a good reason why
perception should be difficult,

11
00:00:27,870 --> 00:00:28,410
right?

12
00:00:28,410 --> 00:00:30,350
We get a very
impoverished stimulus.

13
00:00:30,350 --> 00:00:34,626
We get, like a 2D array
of values from a 3D scene.

14
00:00:34,626 --> 00:00:36,000
Given this
impoverished stimulus,

15
00:00:36,000 --> 00:00:37,620
we have to understand
a huge amount of stuff

16
00:00:37,620 --> 00:00:38,400
about the world.

17
00:00:38,400 --> 00:00:40,900
We have to understand the
3D structure of the world.

18
00:00:40,900 --> 00:00:42,330
If you look at
any one pixel, you

19
00:00:42,330 --> 00:00:44,820
have to understand the
properties of the surface that

20
00:00:44,820 --> 00:00:47,340
produce the elimination
that gave you that pixel.

21
00:00:47,340 --> 00:00:49,599
You have to understand
the color or the texture.

22
00:00:49,599 --> 00:00:51,390
You have to see the
color of the light that

23
00:00:51,390 --> 00:00:54,690
hit that surface, the roughness
of the surface, et cetera.

24
00:00:54,690 --> 00:00:57,810
So you have a very small
channel, a very small window

25
00:00:57,810 --> 00:01:00,270
onto the world, and you have
to extract a tremendous amount

26
00:01:00,270 --> 00:01:02,186
of information so that
you can survive and not

27
00:01:02,186 --> 00:01:04,239
get killed by cars regularly.

28
00:01:04,239 --> 00:01:04,739
All right.

29
00:01:04,739 --> 00:01:06,322
That's exactly the
problem we're going

30
00:01:06,322 --> 00:01:07,980
to talk about here,
which is, how do we

31
00:01:07,980 --> 00:01:10,650
use our knowledge of the world
to structure our perception?

32
00:01:10,650 --> 00:01:12,960
To actually modify
what we see in order

33
00:01:12,960 --> 00:01:14,800
to be able to
solve this problem?

34
00:01:14,800 --> 00:01:17,130
How do we take a small
impoverished stimulus

35
00:01:17,130 --> 00:01:18,897
and extract a huge
amount of information

36
00:01:18,897 --> 00:01:19,980
about the world around us?

37
00:01:19,980 --> 00:01:21,900
So let's start
with a few examples

38
00:01:21,900 --> 00:01:24,180
where knowledge about the
world really, practically

39
00:01:24,180 --> 00:01:25,944
changes what we see.

40
00:01:25,944 --> 00:01:26,610
So I'm Canadian.

41
00:01:26,610 --> 00:01:29,026
So I'm required to show you a
Canadian flag in every talk.

42
00:01:29,026 --> 00:01:30,750
Here we are.

43
00:01:30,750 --> 00:01:33,410
You can take this flag and
you can give it to any of you.

44
00:01:33,410 --> 00:01:35,870
You can give it to any
kid and you can ask them,

45
00:01:35,870 --> 00:01:36,900
here's a marker.

46
00:01:36,900 --> 00:01:40,110
Put big red marks around
the regions on this flag

47
00:01:40,110 --> 00:01:42,024
or in the regions in
this image that are red.

48
00:01:42,024 --> 00:01:43,440
And it's pretty
clear to all of us

49
00:01:43,440 --> 00:01:45,110
that there is a distinction
between the red that's

50
00:01:45,110 --> 00:01:47,339
in the flag, and the bars,
and in the Maple Leaf,

51
00:01:47,339 --> 00:01:49,380
versus the red that's
actually in the background.

52
00:01:49,380 --> 00:01:51,470
And we can all tell
those two apart.

53
00:01:51,470 --> 00:01:53,640
Except if you actually
look at the pixel values,

54
00:01:53,640 --> 00:01:56,400
you open them in Photoshop, or
GIMP, or whatever other program

55
00:01:56,400 --> 00:01:58,900
you want, you're going to notice
that those pixel values are

56
00:01:58,900 --> 00:02:00,455
actually not
particularly different.

57
00:02:00,455 --> 00:02:01,830
There's no threshold
that you can

58
00:02:01,830 --> 00:02:04,590
choose that will separate
the red on the flag

59
00:02:04,590 --> 00:02:06,380
from the red in the background.

60
00:02:06,380 --> 00:02:08,130
so you're doing a huge
amount of inference

61
00:02:08,130 --> 00:02:10,620
just to solve this
trivial, little problem of,

62
00:02:10,620 --> 00:02:12,360
what color is where?

63
00:02:12,360 --> 00:02:14,720
You're using knowledge
about regions, knowledges

64
00:02:14,720 --> 00:02:17,430
about flags, knowledge
about transparency,

65
00:02:17,430 --> 00:02:19,860
in order to figure out
that the red flag is

66
00:02:19,860 --> 00:02:22,120
different from the
red in the background.

67
00:02:22,120 --> 00:02:23,862
So this is a really
practical way

68
00:02:23,862 --> 00:02:25,320
that your knowledge
about the world

69
00:02:25,320 --> 00:02:26,760
really changes your perception.

70
00:02:26,760 --> 00:02:30,260
You're not seeing the colors
that are really there.

71
00:02:30,260 --> 00:02:34,450
Another nice example comes from
a paper by Antonino Torralba.

72
00:02:34,450 --> 00:02:36,450
So if you look at the
scene, it's pretty blurry.

73
00:02:36,450 --> 00:02:38,658
And it has to be blurry
because your visual system is

74
00:02:38,658 --> 00:02:41,430
so incredibly good, we have
to degrade the input for you

75
00:02:41,430 --> 00:02:43,860
to see how poor it
actually is if we take away

76
00:02:43,860 --> 00:02:45,370
some information.

77
00:02:45,370 --> 00:02:46,496
So this looks like a scene.

78
00:02:46,496 --> 00:02:48,203
And the background
looks like a building.

79
00:02:48,203 --> 00:02:50,310
In the foreground, you can
there's maybe a street.

80
00:02:50,310 --> 00:02:52,518
And the thing on the street
kind of looks like a car.

81
00:02:52,518 --> 00:02:54,750
Does it look like
a car to everybody?

82
00:02:54,750 --> 00:02:55,770
Awesome.

83
00:02:55,770 --> 00:02:57,750
We can look at a
slightly different image.

84
00:02:57,750 --> 00:02:59,696
We can look at a
very similar scene.

85
00:02:59,696 --> 00:03:01,320
Again, same building
in the background,

86
00:03:01,320 --> 00:03:02,980
same street in the foreground.

87
00:03:02,980 --> 00:03:05,006
Now, there's kind of a
blob in the foreground.

88
00:03:05,006 --> 00:03:06,380
And it looks as
if it's a person.

89
00:03:06,380 --> 00:03:08,910
It looks like a person
to everyone, right?

90
00:03:08,910 --> 00:03:09,600
Awesome.

91
00:03:09,600 --> 00:03:11,670
Well, the only
problem is the blob

92
00:03:11,670 --> 00:03:14,257
on the left is exactly the
same as the blob on the right.

93
00:03:14,257 --> 00:03:16,590
It's difficult to believe me,
but you can find these two

94
00:03:16,590 --> 00:03:18,270
images online and in his paper.

95
00:03:18,270 --> 00:03:20,610
You can open them up in
your favorite image viewer.

96
00:03:20,610 --> 00:03:22,234
You can zoom in and
you'll see they are

97
00:03:22,234 --> 00:03:24,280
pixelized completely identical.

98
00:03:24,280 --> 00:03:27,000
So you're using a tremendous
amount of information

99
00:03:27,000 --> 00:03:30,090
to put together the fact that
buildings and streets-- when

100
00:03:30,090 --> 00:03:32,640
you see these horizontal
streaks, it means a car.

101
00:03:32,640 --> 00:03:35,910
And when you see these vertical
streaks, it means people.

102
00:03:35,910 --> 00:03:38,342
And this really changes
how you see the world.

103
00:03:38,342 --> 00:03:40,800
And it changes it to the point
where you actually, probably

104
00:03:40,800 --> 00:03:42,690
don't believe me that these
two blobs are the same.

105
00:03:42,690 --> 00:03:44,356
And I couldn't believe
it either until I

106
00:03:44,356 --> 00:03:46,710
really zoomed in and checked.

107
00:03:46,710 --> 00:03:48,660
So you can see lots
of interesting effects

108
00:03:48,660 --> 00:03:50,520
where your high-level
knowledge of the world

109
00:03:50,520 --> 00:03:52,380
is structuring your
low-level perception,

110
00:03:52,380 --> 00:03:55,012
and it is actually
overriding it.

111
00:03:55,012 --> 00:03:56,970
You've seen this example
with the hammer, where

112
00:03:56,970 --> 00:03:58,740
you were unable to
recognize what's

113
00:03:58,740 --> 00:04:00,330
going on in a small region.

114
00:04:00,330 --> 00:04:02,280
But when I give you the
rest of the context,

115
00:04:02,280 --> 00:04:04,176
you can tell that it's a hammer.

116
00:04:04,176 --> 00:04:05,550
And when you see
the whole video,

117
00:04:05,550 --> 00:04:07,530
you actually don't see
the hammer disappear.

118
00:04:07,530 --> 00:04:10,115
You're filling in information
from context in single images

119
00:04:10,115 --> 00:04:11,490
and you're filling
in information

120
00:04:11,490 --> 00:04:13,650
from context in whole videos.

121
00:04:13,650 --> 00:04:16,709
But if we dig in what's going
on here just a little bit more,

122
00:04:16,709 --> 00:04:18,779
what's going on is
somewhere inside your head,

123
00:04:18,779 --> 00:04:21,540
there's something resembling
a hammer detector, right?

124
00:04:21,540 --> 00:04:23,040
So you run a hammer detector.

125
00:04:23,040 --> 00:04:26,010
And you ran that hammer detector
over that little region.

126
00:04:26,010 --> 00:04:29,586
And it said, I'm not so sure.

127
00:04:29,586 --> 00:04:30,960
I'm not very
confident about what

128
00:04:30,960 --> 00:04:32,395
I see in this little region.

129
00:04:32,395 --> 00:04:34,080
And somewhere inside
your head, there's

130
00:04:34,080 --> 00:04:36,990
some detector or something that
can recognize someone hammering

131
00:04:36,990 --> 00:04:38,590
something.

132
00:04:38,590 --> 00:04:41,484
So if we look at sort of a more
traditional computer vision

133
00:04:41,484 --> 00:04:42,900
pipeline, what you
would do is you

134
00:04:42,900 --> 00:04:44,180
would run your hammer detector.

135
00:04:44,180 --> 00:04:45,300
You would take your
hammer detector.

136
00:04:45,300 --> 00:04:46,440
You would use that
knowledge in order

137
00:04:46,440 --> 00:04:48,024
to recognize hammering
in the scene.

138
00:04:48,024 --> 00:04:49,440
And at the end,
you would say, I'm

139
00:04:49,440 --> 00:04:51,190
really confused because
my hammer detector

140
00:04:51,190 --> 00:04:52,375
didn't work very well.

141
00:04:52,375 --> 00:04:54,000
The reason why you
can actually do this

142
00:04:54,000 --> 00:04:55,350
is because you have a feedback.

143
00:04:55,350 --> 00:04:58,110
You were able to recognize the
hammering event as a whole.

144
00:04:58,110 --> 00:04:59,796
And that lets you
upgrade the scores

145
00:04:59,796 --> 00:05:01,170
of your hammer
detector, which is

146
00:05:01,170 --> 00:05:02,930
very unreliable in this case.

147
00:05:02,930 --> 00:05:04,860
So feedback was really
critical in being

148
00:05:04,860 --> 00:05:06,630
able to understand this scene.

149
00:05:06,630 --> 00:05:08,910
Unfortunately, pretty much
all of computer vision

150
00:05:08,910 --> 00:05:11,430
is feed forward, even though
most of your visual system

151
00:05:11,430 --> 00:05:13,710
has, for the most part,
feedback connections.

152
00:05:13,710 --> 00:05:15,356
More feedbacks
than feed forwards.

153
00:05:15,356 --> 00:05:17,730
So in this talk, we're going
to talk about that feedback.

154
00:05:17,730 --> 00:05:18,870
And we're going to
see a way that we're

155
00:05:18,870 --> 00:05:21,090
going to build this feedback
in a principled way.

156
00:05:21,090 --> 00:05:23,220
That if we choose
right detections--

157
00:05:23,220 --> 00:05:25,770
right algorithms and
right representations

158
00:05:25,770 --> 00:05:27,360
from our low-level
perception, we're

159
00:05:27,360 --> 00:05:29,901
going to be able to combine it
with our high-level perception

160
00:05:29,901 --> 00:05:31,800
of the world.

161
00:05:31,800 --> 00:05:34,070
So we've seen that perception
is very unreliable.

162
00:05:34,070 --> 00:05:36,790
The top-down knowledge really
affects your perception.

163
00:05:36,790 --> 00:05:38,940
And that, what you're
going to see in a moment,

164
00:05:38,940 --> 00:05:40,650
is that one integrated
representation

165
00:05:40,650 --> 00:05:42,420
can be used for many tasks.

166
00:05:42,420 --> 00:05:46,200
The advantage of these feedbacks
goes beyond just better vision.

167
00:05:46,200 --> 00:05:48,080
It lets you solve a lot
of different problems

168
00:05:48,080 --> 00:05:50,040
that look very, very distinct.

169
00:05:50,040 --> 00:05:52,590
But actually, turn out
to be very, very similar.

170
00:05:52,590 --> 00:05:54,430
So one problem is recognition.

171
00:05:54,430 --> 00:05:57,180
I can give you a picture of
a chair, and I can tell you,

172
00:05:57,180 --> 00:05:57,815
what is this?

173
00:05:57,815 --> 00:05:59,190
And you can tell
me it's a chair.

174
00:05:59,190 --> 00:06:01,320
Or I can give you a picture
and I can give you a sentence,

175
00:06:01,320 --> 00:06:02,100
this is a chair.

176
00:06:02,100 --> 00:06:03,840
And you can tell
me, I believe you.

177
00:06:03,840 --> 00:06:04,837
This is true.

178
00:06:04,837 --> 00:06:06,670
There's also a completely
different problem,

179
00:06:06,670 --> 00:06:08,100
which is retrieval.

180
00:06:08,100 --> 00:06:10,080
Related to recognition, right?

181
00:06:10,080 --> 00:06:11,952
How about I give you
a library of videos

182
00:06:11,952 --> 00:06:14,160
and ask you to find me the
video where the person was

183
00:06:14,160 --> 00:06:15,210
sitting on the chair.

184
00:06:15,210 --> 00:06:16,980
And you can solve that problem.

185
00:06:16,980 --> 00:06:18,870
You can also solve a
problem like generation.

186
00:06:18,870 --> 00:06:20,280
I can give you a video
and I can tell you,

187
00:06:20,280 --> 00:06:21,321
I don't know what's here.

188
00:06:21,321 --> 00:06:22,860
Please describe it to me.

189
00:06:22,860 --> 00:06:25,590
So if you see the scene, you
can say, what's on the screen?

190
00:06:25,590 --> 00:06:28,120
Well, there's a whole bunch
of text on the screen.

191
00:06:28,120 --> 00:06:29,700
You can also do
question answering.

192
00:06:29,700 --> 00:06:31,869
You can take an image like this.

193
00:06:31,869 --> 00:06:32,910
I can ask you a question.

194
00:06:32,910 --> 00:06:34,200
What's the color of the font?

195
00:06:34,200 --> 00:06:36,250
And you can say,
the font is white.

196
00:06:36,250 --> 00:06:38,940
So you were able to take
some very high-level that's

197
00:06:38,940 --> 00:06:41,370
in my head that got
transmitted to your head.

198
00:06:41,370 --> 00:06:43,920
You were able to understand the
purpose of this transmission,

199
00:06:43,920 --> 00:06:46,890
connect it to your perception,
figure out the knowledge that I

200
00:06:46,890 --> 00:06:48,570
wanted extracted
from your perception,

201
00:06:48,570 --> 00:06:52,281
and give it back to me in a
way that's meaningful to me.

202
00:06:52,281 --> 00:06:54,030
Even more than this,
you can disambiguate.

203
00:06:54,030 --> 00:06:55,770
You can take a sentence
that's extremely

204
00:06:55,770 --> 00:06:57,660
ambiguous about the
world and figure out

205
00:06:57,660 --> 00:06:58,624
what I'm referring to.

206
00:06:58,624 --> 00:06:59,790
And we do this all the time.

207
00:06:59,790 --> 00:07:02,490
That's really what makes human
communication possible, right?

208
00:07:02,490 --> 00:07:05,160
The fact that most of what I
say is extremely ambiguous.

209
00:07:05,160 --> 00:07:07,680
That's why programming
computers is a real pain,

210
00:07:07,680 --> 00:07:09,780
but talking to people
is generally easier

211
00:07:09,780 --> 00:07:12,540
depending on the person.

212
00:07:12,540 --> 00:07:14,550
You can also acquire
knowledge, right?

213
00:07:14,550 --> 00:07:16,239
You can look at a
whole bunch of videos.

214
00:07:16,239 --> 00:07:18,780
If you're a child, you sort of
perceive the world around you.

215
00:07:18,780 --> 00:07:21,630
Occasionally, an adult comes,
drops a sentence here or there

216
00:07:21,630 --> 00:07:22,800
for you.

217
00:07:22,800 --> 00:07:25,530
But what's important is that
no adult ever really points out

218
00:07:25,530 --> 00:07:27,030
what the sentence
is referring to.

219
00:07:27,030 --> 00:07:28,560
You don't know that
approach refers

220
00:07:28,560 --> 00:07:32,190
to this particular vector when
someone was doing some action.

221
00:07:32,190 --> 00:07:35,182
You know that Apple refers to
this particular object class.

222
00:07:35,182 --> 00:07:36,390
Who knows what it could mean?

223
00:07:36,390 --> 00:07:37,848
But you get enough
data, and you're

224
00:07:37,848 --> 00:07:42,720
able to disentangle
this problem of seeing

225
00:07:42,720 --> 00:07:45,750
weakly-supervised videos
paired with sentences.

226
00:07:45,750 --> 00:07:47,167
And we'll see how
you can do that.

227
00:07:47,167 --> 00:07:48,750
Pretty much everything
I'll talk about

228
00:07:48,750 --> 00:07:49,950
is going to be about videos.

229
00:07:49,950 --> 00:07:51,491
And I'll tell you
a story about how I

230
00:07:51,491 --> 00:07:53,232
think we can do images as well.

231
00:07:53,232 --> 00:07:54,690
There are a bunch
of other problems

232
00:07:54,690 --> 00:07:57,031
that you can solve
with this approach.

233
00:07:57,031 --> 00:07:57,540
I'm sorry?

234
00:07:57,540 --> 00:07:59,060
AUDIENCE: So those
images go with the video?

235
00:07:59,060 --> 00:07:59,810
ANDREI BARBU: Yes.

236
00:07:59,810 --> 00:08:02,967
So rather than doing videos,
we're going to do images.

237
00:08:02,967 --> 00:08:05,550
So one thing that you can do is
you can try to do translation.

238
00:08:05,550 --> 00:08:06,180
We haven't done this.

239
00:08:06,180 --> 00:08:07,888
We're going to be
doing this in the fall.

240
00:08:07,888 --> 00:08:09,000
We have two students.

241
00:08:09,000 --> 00:08:11,047
And I'll tell you at the
end what the story is

242
00:08:11,047 --> 00:08:13,380
for how you're going to do a
task that sounds as if it's

243
00:08:13,380 --> 00:08:15,000
from language to
language, but you're

244
00:08:15,000 --> 00:08:17,772
going to do it in a grounded
way that involves vision.

245
00:08:17,772 --> 00:08:19,480
Even more than that,
you can do planning.

246
00:08:19,480 --> 00:08:21,430
And I'll tell you about that
at the end a little bit.

247
00:08:21,430 --> 00:08:22,830
And finally, you
can also incorporate

248
00:08:22,830 --> 00:08:23,700
some theory of mind.

249
00:08:23,700 --> 00:08:25,320
And that's actually the
project that the students are

250
00:08:25,320 --> 00:08:26,611
doing as part of summer school.

251
00:08:26,611 --> 00:08:28,174
And I'll say a few
words about that.

252
00:08:28,174 --> 00:08:30,340
What's important about this
is the parts at the top,

253
00:08:30,340 --> 00:08:31,050
we understand better.

254
00:08:31,050 --> 00:08:32,466
We've published
papers about them.

255
00:08:32,466 --> 00:08:34,679
The parts at the bottom are
sort of more future work,

256
00:08:34,679 --> 00:08:37,590
and I'll say less about them.

257
00:08:37,590 --> 00:08:39,539
Well, one important
part about this

258
00:08:39,539 --> 00:08:41,496
is I've shown you
all these tasks.

259
00:08:41,496 --> 00:08:42,870
But you really
have to believe me

260
00:08:42,870 --> 00:08:45,255
that humans perform
these tasks all the time.

261
00:08:45,255 --> 00:08:47,630
Every time you're sitting at
a table and you ask someone,

262
00:08:47,630 --> 00:08:48,600
give me a cup.

263
00:08:48,600 --> 00:08:50,366
That's a really hard
vision language task.

264
00:08:50,366 --> 00:08:52,740
There may be 10 different cups
inside of you on the table

265
00:08:52,740 --> 00:08:54,823
if you're sitting on one
of the big, round tables.

266
00:08:54,823 --> 00:08:57,330
And you have to figure out,
what object am I talking about?

267
00:08:57,330 --> 00:08:59,070
What kind of cup
am I talking about?

268
00:08:59,070 --> 00:09:00,580
Which cup would I
be interested in?

269
00:09:00,580 --> 00:09:02,996
If I drank out of the cup, I
would expect that you give me

270
00:09:02,996 --> 00:09:04,582
my cup, not your cup.

271
00:09:04,582 --> 00:09:05,540
Otherwise, let me know.

272
00:09:05,540 --> 00:09:08,670
I will sit at different
tables from now on.

273
00:09:08,670 --> 00:09:11,250
If I ask you, which
chair should I sit in?

274
00:09:11,250 --> 00:09:13,389
Again, you have to solve
a pretty difficult problem

275
00:09:13,389 --> 00:09:14,430
where you look at chairs.

276
00:09:14,430 --> 00:09:17,299
You figure out what I mean by
which chair should I sit in.

277
00:09:17,299 --> 00:09:19,590
Is it that there's a chair
that's reserved for someone.

278
00:09:19,590 --> 00:09:21,600
Is it that a chair
is for a child

279
00:09:21,600 --> 00:09:24,447
and I'm an adult,
that kind of thing.

280
00:09:24,447 --> 00:09:26,280
You can say something
like this is an apple.

281
00:09:26,280 --> 00:09:27,720
And when you say
that to a child,

282
00:09:27,720 --> 00:09:29,428
you're saying it for
a particular reason.

283
00:09:29,428 --> 00:09:30,390
To convey some idea.

284
00:09:30,390 --> 00:09:33,000
You have to coordinate your gaze
with the other person's gaze

285
00:09:33,000 --> 00:09:34,791
to make sure you're
drawing their attention

286
00:09:34,791 --> 00:09:36,000
to the real object.

287
00:09:36,000 --> 00:09:37,650
Even more than that, you can
say very abstract things,

288
00:09:37,650 --> 00:09:40,108
like to win this game, you have
to make a straight line out

289
00:09:40,108 --> 00:09:41,040
of these pieces.

290
00:09:41,040 --> 00:09:43,320
That means that we both
agree on what a piece is.

291
00:09:43,320 --> 00:09:46,549
That I've drawn your attention
to the right idea of a piece.

292
00:09:46,549 --> 00:09:48,090
That we agree on
what a straight line

293
00:09:48,090 --> 00:09:49,777
means on this particular board.

294
00:09:49,777 --> 00:09:52,110
There's a lot of knowledge
that goes into each of these.

295
00:09:52,110 --> 00:09:53,901
But the important part
is that they're each

296
00:09:53,901 --> 00:09:55,140
grounded in perception.

297
00:09:55,140 --> 00:09:57,690
We have to agree on what we're
seeing in front of each other

298
00:09:57,690 --> 00:10:00,300
in order to be able to
exchange this information.

299
00:10:00,300 --> 00:10:03,370
And pretty much everything that
we do in daily communication

300
00:10:03,370 --> 00:10:06,550
is a language-vision
problem on some level.

301
00:10:06,550 --> 00:10:07,150
All right.

302
00:10:07,150 --> 00:10:09,160
So if we believe these
problems are important,

303
00:10:09,160 --> 00:10:10,576
we can make one
other observation,

304
00:10:10,576 --> 00:10:11,995
which is none of
you got training

305
00:10:11,995 --> 00:10:13,510
in most of these problems.

306
00:10:13,510 --> 00:10:15,700
No adult ever sat you
down and said, OK.

307
00:10:15,700 --> 00:10:16,480
Now, you're four.

308
00:10:16,480 --> 00:10:18,438
Now I'm going to teach
you how to ask questions

309
00:10:18,438 --> 00:10:19,390
about the real world.

310
00:10:19,390 --> 00:10:21,010
Or no one sat you
down and said, OK.

311
00:10:21,010 --> 00:10:22,801
Now, let's talk about
language acquisition.

312
00:10:22,801 --> 00:10:25,210
You're supposed to
do gradient descent.

313
00:10:25,210 --> 00:10:28,060
So what's important is you have
some core ability that's shared

314
00:10:28,060 --> 00:10:29,530
across all of these tasks?

315
00:10:29,530 --> 00:10:31,510
And you're able to
acquire knowledge,

316
00:10:31,510 --> 00:10:34,069
maybe in one of these tasks
or across all of these tasks.

317
00:10:34,069 --> 00:10:35,360
You're able to put it together.

318
00:10:35,360 --> 00:10:36,790
And as soon as you
have this knowledge,

319
00:10:36,790 --> 00:10:38,650
you can use this for
all these other tasks

320
00:10:38,650 --> 00:10:40,532
without having to
learn anything else.

321
00:10:40,532 --> 00:10:41,990
And that's what
we're going to see.

322
00:10:41,990 --> 00:10:44,230
And the core of this that
we're going to focus on

323
00:10:44,230 --> 00:10:45,670
is recognition.

324
00:10:45,670 --> 00:10:47,540
So we're going to
build one component.

325
00:10:47,540 --> 00:10:49,690
This is that engineering
portion of the talk.

326
00:10:49,690 --> 00:10:52,060
We're going to build one
scoring function that

327
00:10:52,060 --> 00:10:54,310
takes a sentence in a video
and gives you a score.

328
00:10:54,310 --> 00:10:56,440
How well does this
sentence match this video?

329
00:10:56,440 --> 00:10:58,690
If the score is 1, it
means the system really

330
00:10:58,690 --> 00:11:00,745
believes the sentence is
depicted by the video.

331
00:11:00,745 --> 00:11:02,620
If the score is 0, it
means the system really

332
00:11:02,620 --> 00:11:05,734
believes the sentence does not
occur anywhere in this video.

333
00:11:05,734 --> 00:11:07,150
And this is the
basic thing that's

334
00:11:07,150 --> 00:11:10,420
going to allow us to connect our
top-down knowledge about what's

335
00:11:10,420 --> 00:11:12,982
going on in the world with
our low-level perception.

336
00:11:12,982 --> 00:11:14,440
And after we have
this, we're going

337
00:11:14,440 --> 00:11:17,050
to see how we reformulate
everything in terms of this one

338
00:11:17,050 --> 00:11:18,550
function, so we
don't have to learn

339
00:11:18,550 --> 00:11:20,560
anything else about the world.

340
00:11:20,560 --> 00:11:21,760
All right.

341
00:11:21,760 --> 00:11:24,700
So we said we need this one
function, scoring function

342
00:11:24,700 --> 00:11:26,122
between sentences and videos.

343
00:11:26,122 --> 00:11:27,580
So let's look at
what we would need

344
00:11:27,580 --> 00:11:30,130
to have inside this
function in the first place.

345
00:11:30,130 --> 00:11:32,920
If I give you a video like
this, it's just a person

346
00:11:32,920 --> 00:11:34,090
riding a skateboard.

347
00:11:34,090 --> 00:11:35,570
And I give you a sentence.

348
00:11:35,570 --> 00:11:37,690
The person rode the
skateboard leftward.

349
00:11:37,690 --> 00:11:42,580
Well, I can ask you, is the
sentence true of this video?

350
00:11:42,580 --> 00:11:43,424
Indeed, it is true.

351
00:11:43,424 --> 00:11:44,840
But let's think
about what you had

352
00:11:44,840 --> 00:11:47,200
to do in order to be able
to answer this question.

353
00:11:47,200 --> 00:11:49,979
Well, you had to at some level
decide there's a person there.

354
00:11:49,979 --> 00:11:51,520
I'm not saying that
you're doing this

355
00:11:51,520 --> 00:11:52,870
in this order in your brain.

356
00:11:52,870 --> 00:11:55,120
I'm not saying that they
have to be individual stages.

357
00:11:55,120 --> 00:11:57,190
I'm not saying you have to
have to object detectors.

358
00:11:57,190 --> 00:11:59,231
But at some point, you
had to decide there really

359
00:11:59,231 --> 00:12:01,150
is a person there somehow.

360
00:12:01,150 --> 00:12:04,030
You also had to decide
there is a skateboard there.

361
00:12:04,030 --> 00:12:07,450
You had to look at these
objects over time, or at least--

362
00:12:07,450 --> 00:12:08,952
in at least one or
two frames decide

363
00:12:08,952 --> 00:12:10,660
that they have a
particular relationship,

364
00:12:10,660 --> 00:12:12,410
so that the person
isn't flying in the air

365
00:12:12,410 --> 00:12:14,514
and the skateboard
continues onwards.

366
00:12:14,514 --> 00:12:16,930
And you had to look at this
relationship and decide, yeah.

367
00:12:16,930 --> 00:12:17,980
OK, this is writing.

368
00:12:17,980 --> 00:12:19,665
And it's happening leftward.

369
00:12:19,665 --> 00:12:21,790
So you have to have these
components on some level.

370
00:12:21,790 --> 00:12:23,102
You've got to see the objects.

371
00:12:23,102 --> 00:12:25,060
You've got to see the
relationships, the static

372
00:12:25,060 --> 00:12:27,143
and the changing relationship
between the objects.

373
00:12:27,143 --> 00:12:29,800
And you have to have some way
of combining those together

374
00:12:29,800 --> 00:12:31,570
to form some kind
of sentence you

375
00:12:31,570 --> 00:12:33,200
can represent that knowledge.

376
00:12:33,200 --> 00:12:34,764
And that's what
we're going to do.

377
00:12:34,764 --> 00:12:37,180
Everything I described to you
is this feed-forward system,

378
00:12:37,180 --> 00:12:37,780
right?

379
00:12:37,780 --> 00:12:39,230
We had objects.

380
00:12:39,230 --> 00:12:40,240
We have tracks.

381
00:12:40,240 --> 00:12:42,040
We take tracks and
we build events.

382
00:12:42,040 --> 00:12:44,260
Events like ride.

383
00:12:44,260 --> 00:12:46,210
And we take those
events together

384
00:12:46,210 --> 00:12:48,294
and we form sentences
out of them.

385
00:12:48,294 --> 00:12:49,960
And there's this hard
separation, right?

386
00:12:49,960 --> 00:12:52,060
It's easy to understand a
system where what you do

387
00:12:52,060 --> 00:12:54,850
is you have objects, tracks,
events, and sentences.

388
00:12:54,850 --> 00:12:58,397
And you use tracks in order
to see if your events happened

389
00:12:58,397 --> 00:13:00,730
and your events in order to
see if a particular sentence

390
00:13:00,730 --> 00:13:01,525
occurred.

391
00:13:01,525 --> 00:13:03,400
So that's what we're
going to describe first,

392
00:13:03,400 --> 00:13:05,170
and then we're going to
see how, because we're

393
00:13:05,170 --> 00:13:07,253
going to choose the right
representations for each

394
00:13:07,253 --> 00:13:09,790
of these, these feedbacks become
completely trivial and very

395
00:13:09,790 --> 00:13:11,710
natural to implement.

396
00:13:11,710 --> 00:13:12,580
All right.

397
00:13:12,580 --> 00:13:13,960
We need to start with
some object detections.

398
00:13:13,960 --> 00:13:16,043
Otherwise, we're just going
to hallucinate objects

399
00:13:16,043 --> 00:13:17,520
all the time.

400
00:13:17,520 --> 00:13:19,690
Any off-the-shelf object
detector that you choose

401
00:13:19,690 --> 00:13:20,530
will sometimes work.

402
00:13:20,530 --> 00:13:22,863
Here, we ran a person detector
in red and a bag detector

403
00:13:22,863 --> 00:13:23,780
in blue.

404
00:13:23,780 --> 00:13:25,810
It will sometimes give
you false positives.

405
00:13:25,810 --> 00:13:27,400
Trees are often
confused for people.

406
00:13:27,400 --> 00:13:30,730
I guess we're both
two vertical-- two

407
00:13:30,730 --> 00:13:32,106
long, vertical lines.

408
00:13:32,106 --> 00:13:33,730
And sometimes, you
get false negatives.

409
00:13:33,730 --> 00:13:35,950
Sometimes, a bag
is so deformable

410
00:13:35,950 --> 00:13:39,340
that you think the
person's knee is the bag.

411
00:13:39,340 --> 00:13:41,440
Lest you think that object
detection is solved,

412
00:13:41,440 --> 00:13:42,520
it actually isn't.

413
00:13:42,520 --> 00:13:45,870
So if you look at something
like the image net challenge,

414
00:13:45,870 --> 00:13:48,270
mostly people talk about
the image classification.

415
00:13:48,270 --> 00:13:49,780
The stuff in light blue.

416
00:13:49,780 --> 00:13:51,820
And they're saying
that there's 10% error.

417
00:13:51,820 --> 00:13:53,830
These days, there's
5% error on this.

418
00:13:53,830 --> 00:13:56,410
But that's really not what
you're doing in the real world.

419
00:13:56,410 --> 00:13:58,390
You're not classifying
whole images.

420
00:13:58,390 --> 00:14:00,880
When you see an image in the
real world, what you're doing

421
00:14:00,880 --> 00:14:03,210
is you're trying to figure
out what objects are where.

422
00:14:03,210 --> 00:14:04,210
And that's the red part.

423
00:14:04,210 --> 00:14:07,270
That's the part where you have
an average precision of 50%.

424
00:14:07,270 --> 00:14:09,610
In other words, the
object detector really,

425
00:14:09,610 --> 00:14:11,020
really, really sucks.

426
00:14:11,020 --> 00:14:13,190
Most of the time, it's
going to be pretty wrong.

427
00:14:13,190 --> 00:14:16,390
It's very, very, very far away
from how accurate you are.

428
00:14:16,390 --> 00:14:17,990
If your object
detector was that bad,

429
00:14:17,990 --> 00:14:20,710
you would die every time
you crossed the street.

430
00:14:20,710 --> 00:14:21,280
All right.

431
00:14:21,280 --> 00:14:24,100
So we believe that object
detection doesn't work well.

432
00:14:24,100 --> 00:14:25,767
In order to fix this,
because somehow we

433
00:14:25,767 --> 00:14:28,141
have to be able to extract
some knowledge about the video

434
00:14:28,141 --> 00:14:30,700
that's pretty robust for us to
be able to track these objects

435
00:14:30,700 --> 00:14:32,600
and recognize
these sentences, we

436
00:14:32,600 --> 00:14:34,629
need to modify object
detectors a little bit.

437
00:14:34,629 --> 00:14:36,420
We're going to go into
our object detector.

438
00:14:36,420 --> 00:14:38,560
And normally, they
have a threshold.

439
00:14:38,560 --> 00:14:41,740
At some point, they learn that
if the score of this detection

440
00:14:41,740 --> 00:14:44,020
is above this level, I
should have confidence in it.

441
00:14:44,020 --> 00:14:46,150
And if the score of this
detection is below this level,

442
00:14:46,150 --> 00:14:47,350
I shouldn't have
confidence in it.

443
00:14:47,350 --> 00:14:48,120
And what we're
going to do is we're

444
00:14:48,120 --> 00:14:49,420
going to remove that threshold.

445
00:14:49,420 --> 00:14:51,086
We're going to tell
the object detector,

446
00:14:51,086 --> 00:14:53,950
give me thousands or millions
of detections in every frame.

447
00:14:53,950 --> 00:14:55,450
We're going to take
those detections

448
00:14:55,450 --> 00:14:58,690
and we're going to figure out
how to filter them later on.

449
00:14:58,690 --> 00:14:59,710
All right.

450
00:14:59,710 --> 00:15:00,730
The way we're going
to do this-- and this

451
00:15:00,730 --> 00:15:02,938
is the only slide that's
going to have any equations.

452
00:15:02,938 --> 00:15:05,120
And it's just going to
be a linear combination.

453
00:15:05,120 --> 00:15:06,620
All we're going to
do is we're going

454
00:15:06,620 --> 00:15:08,859
to take every detection in
every frame of this video.

455
00:15:08,859 --> 00:15:10,650
We're going to arrange
them in the lattice.

456
00:15:10,650 --> 00:15:12,233
In every column of
this lattice, we're

457
00:15:12,233 --> 00:15:14,850
going to have the detections
for one particular frame.

458
00:15:14,850 --> 00:15:17,130
And in essence, what we
want is one detection

459
00:15:17,130 --> 00:15:19,030
for every object
for every frame.

460
00:15:19,030 --> 00:15:21,280
In other words, we want the
path through this lattice.

461
00:15:21,280 --> 00:15:24,750
We want to select one
detection in every column.

462
00:15:24,750 --> 00:15:27,870
But we want tracks that have
a particular property, right?

463
00:15:27,870 --> 00:15:30,060
If I'm approaching
this microphone,

464
00:15:30,060 --> 00:15:32,420
you know that you expect
to see me kind of far away.

465
00:15:32,420 --> 00:15:33,330
Then, getting closer.

466
00:15:33,330 --> 00:15:35,205
Then, eventually I'm
close to the microphone.

467
00:15:35,205 --> 00:15:36,960
You don't expect me
to be over there,

468
00:15:36,960 --> 00:15:39,474
and then to appear over
here as if I've teleported.

469
00:15:39,474 --> 00:15:40,890
So we want to build
this intuition

470
00:15:40,890 --> 00:15:42,560
that objects move smoothly.

471
00:15:42,560 --> 00:15:46,470
And that objects move according
to how they previously

472
00:15:46,470 --> 00:15:47,370
moved, right?

473
00:15:47,370 --> 00:15:50,161
It's not like someone moves
from one frame 10 pixels over.

474
00:15:50,161 --> 00:15:52,410
Then, the next frame, they
move 10 pixels to the left.

475
00:15:52,410 --> 00:15:54,234
And they keep oscillating
between the two.

476
00:15:54,234 --> 00:15:55,650
And that's what
we're going to do.

477
00:15:55,650 --> 00:15:57,030
I'm not going to talk
about how we compute this.

478
00:15:57,030 --> 00:15:57,863
It's really trivial.

479
00:15:57,863 --> 00:16:00,180
If you know about optical
flow, you can do it.

480
00:16:00,180 --> 00:16:03,210
But basically, what we
want is a track where we

481
00:16:03,210 --> 00:16:04,740
don't hallucinate the objects.

482
00:16:04,740 --> 00:16:09,209
So every node in our resulting
detections should be strong.

483
00:16:09,209 --> 00:16:11,250
If we ignore the strength
of the object detector,

484
00:16:11,250 --> 00:16:13,080
we're just going to
pretend that there

485
00:16:13,080 --> 00:16:15,150
are a whole bunch of
people in front of us.

486
00:16:15,150 --> 00:16:17,140
And every edge should
also be strong.

487
00:16:17,140 --> 00:16:19,080
In other words, when we
look at two detections

488
00:16:19,080 --> 00:16:22,260
from adjacent frames, if I have
a person over here in one frame

489
00:16:22,260 --> 00:16:24,175
and a person over
here in another frame,

490
00:16:24,175 --> 00:16:26,550
I shouldn't really think that's
a very good person track.

491
00:16:26,550 --> 00:16:28,675
But if I have a person over
here that kind of moved

492
00:16:28,675 --> 00:16:30,660
to the right the
previous frame and I

493
00:16:30,660 --> 00:16:32,490
have a new detection
that's just slightly

494
00:16:32,490 --> 00:16:34,230
to the right of that
one, I should expect

495
00:16:34,230 --> 00:16:36,157
that it's a much better track.

496
00:16:36,157 --> 00:16:36,990
So that's all we do.

497
00:16:36,990 --> 00:16:39,660
And encoding this intuition
is very, very straightforward.

498
00:16:39,660 --> 00:16:41,080
It's just a linear combination.

499
00:16:41,080 --> 00:16:43,200
So the score of
one path, the score

500
00:16:43,200 --> 00:16:45,390
of the track of
an object, is just

501
00:16:45,390 --> 00:16:48,670
the sum of your confidence
in the detections.

502
00:16:48,670 --> 00:16:50,370
So in every detection
and every frame,

503
00:16:50,370 --> 00:16:53,160
along with the confidence
that the object track

504
00:16:53,160 --> 00:16:55,540
was actually coherent.

505
00:16:55,540 --> 00:16:56,040
All right.

506
00:16:56,040 --> 00:16:57,840
So is the only equation we're
going to see in this talk.

507
00:16:57,840 --> 00:16:59,300
And it's just a
linear combination.

508
00:16:59,300 --> 00:17:01,170
But it will come
back to haunt us

509
00:17:01,170 --> 00:17:03,030
several times before the end.

510
00:17:03,030 --> 00:17:03,570
All right.

511
00:17:03,570 --> 00:17:04,837
So we use dynamic programming.

512
00:17:04,837 --> 00:17:06,420
We find the path
through this lattice.

513
00:17:06,420 --> 00:17:07,589
And this is a tracker.

514
00:17:07,589 --> 00:17:12,150
And actually, Viterbi did
this in 1967 for radar.

515
00:17:12,150 --> 00:17:13,829
This is not a new idea.

516
00:17:13,829 --> 00:17:17,520
Here, we ran it for just
a computer vision task

517
00:17:17,520 --> 00:17:19,599
where we just wanted
to track objects.

518
00:17:19,599 --> 00:17:22,087
We ran a person detector
and a motorcycle detector,

519
00:17:22,087 --> 00:17:23,670
but we don't have a
person standing up

520
00:17:23,670 --> 00:17:25,410
and a person sitting
down detector.

521
00:17:25,410 --> 00:17:27,569
So the tracker is
good enough that it

522
00:17:27,569 --> 00:17:29,640
can keep the two people
separate from each other,

523
00:17:29,640 --> 00:17:31,440
despite the fact
that they're actually

524
00:17:31,440 --> 00:17:33,754
pretty close in the video.

525
00:17:33,754 --> 00:17:36,420
So you see we do a pretty decent
job of tracking all the objects

526
00:17:36,420 --> 00:17:38,420
until they get pretty
small in the field of view

527
00:17:38,420 --> 00:17:40,760
and the object detector
doesn't work well anymore.

528
00:17:40,760 --> 00:17:42,690
All right.

529
00:17:42,690 --> 00:17:45,520
So now what we have are
the tracks of objects.

530
00:17:45,520 --> 00:17:48,600
We can see object motion
over time and from a video.

531
00:17:48,600 --> 00:17:50,610
And somehow, we have
to look at these tracks

532
00:17:50,610 --> 00:17:52,620
and determine what
happened to them.

533
00:17:52,620 --> 00:17:53,760
Was someone riding?

534
00:17:53,760 --> 00:17:55,080
Was someone running?

535
00:17:55,080 --> 00:17:56,832
Was someone bouncing
up and down?

536
00:17:56,832 --> 00:17:59,040
In order to do this, we're
going to get some features

537
00:17:59,040 --> 00:18:00,390
from our tracks.

538
00:18:00,390 --> 00:18:02,296
You can look at the
track in every frame

539
00:18:02,296 --> 00:18:04,170
and you can extract out
a lot of information.

540
00:18:04,170 --> 00:18:05,972
You can extract out
the average color.

541
00:18:05,972 --> 00:18:07,930
You can extract out the
position, the velocity,

542
00:18:07,930 --> 00:18:10,020
acceleration, aspect ratio.

543
00:18:10,020 --> 00:18:11,430
Anything that you
want to get out

544
00:18:11,430 --> 00:18:14,280
of this frame knowing that
this bounding box is there,

545
00:18:14,280 --> 00:18:17,280
you can compute and the
algorithm doesn't care.

546
00:18:17,280 --> 00:18:18,270
All right.

547
00:18:18,270 --> 00:18:19,712
There's one small
problem, though.

548
00:18:19,712 --> 00:18:22,170
Most of the time, we need more
complicated feature vectors.

549
00:18:22,170 --> 00:18:23,817
So for example
for ride, it's not

550
00:18:23,817 --> 00:18:26,400
enough to have a feature vector
that only includes the person.

551
00:18:26,400 --> 00:18:28,191
You needed to look at
the relative position

552
00:18:28,191 --> 00:18:29,790
between the person
and the skateboard

553
00:18:29,790 --> 00:18:31,748
to determine-- that are
actually going together

554
00:18:31,748 --> 00:18:34,287
and one isn't going right and
the other one is going left.

555
00:18:34,287 --> 00:18:36,120
So for that, what we're
going to do is we're

556
00:18:36,120 --> 00:18:39,000
going to build a feature vector
for the agent of the action

557
00:18:39,000 --> 00:18:41,520
in the case of ride
and a feature vector

558
00:18:41,520 --> 00:18:43,827
for the instrument--
the skateboard.

559
00:18:43,827 --> 00:18:45,660
We're going to concatenate
the two together,

560
00:18:45,660 --> 00:18:47,310
so we get a bigger
feature vector.

561
00:18:47,310 --> 00:18:48,750
And then we're going to have
some extra features that

562
00:18:48,750 --> 00:18:51,190
tell us about the relationships
between these two.

563
00:18:51,190 --> 00:18:53,400
So we can include things
like the distance,

564
00:18:53,400 --> 00:18:56,705
the relative velocity,
the angle, overlap.

565
00:18:56,705 --> 00:18:58,830
Anything that you want to
compute between these two

566
00:18:58,830 --> 00:19:02,160
bounding boxes in this frame,
you're welcome to compute.

567
00:19:02,160 --> 00:19:03,200
All right.

568
00:19:03,200 --> 00:19:05,809
And if you build this feature
vector between this person

569
00:19:05,809 --> 00:19:08,100
and the skateboard, you could
recognize the person rode

570
00:19:08,100 --> 00:19:09,764
the skateboard in this video.

571
00:19:09,764 --> 00:19:11,430
If you build a different
feature vector,

572
00:19:11,430 --> 00:19:13,170
for example between
these two people,

573
00:19:13,170 --> 00:19:15,120
you could recognize the
person was approaching

574
00:19:15,120 --> 00:19:17,760
the other person, or the person
was leaving the other person.

575
00:19:17,760 --> 00:19:20,342
If you build a feature
vector between the skateboard

576
00:19:20,342 --> 00:19:22,050
and the other person,
you could recognize

577
00:19:22,050 --> 00:19:24,600
the skateboard is approaching
the person, et cetera.

578
00:19:24,600 --> 00:19:26,580
So depending on which
feature vector you build,

579
00:19:26,580 --> 00:19:29,070
you can recognize
different kinds of actions.

580
00:19:29,070 --> 00:19:31,410
So when we have our tracks,
we know how the objects

581
00:19:31,410 --> 00:19:32,820
moved in these videos.

582
00:19:32,820 --> 00:19:35,012
We get out some feature
vectors from our tracks.

583
00:19:35,012 --> 00:19:37,470
And what we need to do is decide
what these feature vectors

584
00:19:37,470 --> 00:19:38,261
are actually doing.

585
00:19:38,261 --> 00:19:40,110
Is the person riding
that skateboard?

586
00:19:40,110 --> 00:19:42,651
The way we're going to do this
is using hidden Markov models.

587
00:19:42,651 --> 00:19:44,410
Hidden Markov models
are really simple.

588
00:19:44,410 --> 00:19:46,076
All they assume is
that there is a model

589
00:19:46,076 --> 00:19:49,140
of the world that follows a
particular kind of dynamics.

590
00:19:49,140 --> 00:19:52,770
In this case, imagine that we
have an action like approach.

591
00:19:52,770 --> 00:19:54,190
I'm far away from the object.

592
00:19:54,190 --> 00:19:55,620
I get closer to the object.

593
00:19:55,620 --> 00:19:58,180
Eventually, I'm
next to the object.

594
00:19:58,180 --> 00:20:01,680
So this action, for
example, has three states.

595
00:20:01,680 --> 00:20:04,150
One where I was far, one
as I was getting nearer,

596
00:20:04,150 --> 00:20:05,502
one when I was very close.

597
00:20:05,502 --> 00:20:06,960
And we have a
particular transition

598
00:20:06,960 --> 00:20:08,126
between these states, right?

599
00:20:08,126 --> 00:20:09,870
We already said that
I don't teleport.

600
00:20:09,870 --> 00:20:12,120
So I shouldn't be able
to go from being far away

601
00:20:12,120 --> 00:20:13,170
to being near.

602
00:20:13,170 --> 00:20:15,351
So you should expect me
to go from the first state

603
00:20:15,351 --> 00:20:17,100
to the second state
and to the third state

604
00:20:17,100 --> 00:20:19,370
without going from
the first to the last.

605
00:20:19,370 --> 00:20:21,090
In each state,
you have something

606
00:20:21,090 --> 00:20:22,950
that you want to
observe about me, right?

607
00:20:22,950 --> 00:20:25,550
You want to really see that I'm
far away in the first state,

608
00:20:25,550 --> 00:20:27,133
that I'm getting
closer in the second,

609
00:20:27,133 --> 00:20:29,200
and I'm actually
there in the third.

610
00:20:29,200 --> 00:20:32,310
So we have some model for what
we expect to see in every state

611
00:20:32,310 --> 00:20:34,830
and we can connect this
with our feature vectors.

612
00:20:34,830 --> 00:20:36,840
So the idea is
there's some hidden

613
00:20:36,840 --> 00:20:39,054
information behind the
motion of these objects.

614
00:20:39,054 --> 00:20:41,220
And we're going to assume
that hidden information is

615
00:20:41,220 --> 00:20:42,780
represented within HMM.

616
00:20:42,780 --> 00:20:45,120
And what we need to
recover is the real state

617
00:20:45,120 --> 00:20:46,300
of these objects.

618
00:20:46,300 --> 00:20:50,820
So if you see a video of me
moving towards this microphone,

619
00:20:50,820 --> 00:20:53,430
you have to recover some
hidden information of,

620
00:20:53,430 --> 00:20:55,410
which frames was far away in?

621
00:20:55,410 --> 00:20:57,720
Which frames was getting nearer?

622
00:20:57,720 --> 00:21:00,846
And which frames was I
actually next to the object?

623
00:21:00,846 --> 00:21:02,220
For now what we're
going to do is

624
00:21:02,220 --> 00:21:04,761
we're going to assume that we
have one of these hidden Markov

625
00:21:04,761 --> 00:21:06,622
models for every different word.

626
00:21:06,622 --> 00:21:09,080
So for every verb, we have a
different hidden Markov model.

627
00:21:09,080 --> 00:21:10,040
There's one for approach.

628
00:21:10,040 --> 00:21:10,998
There's one for pickup.

629
00:21:10,998 --> 00:21:12,625
There's one for ride, et cetera.

630
00:21:12,625 --> 00:21:15,000
And if you want to tell me
what's going on in this video,

631
00:21:15,000 --> 00:21:17,166
you just have a big library
of hidden Markov models.

632
00:21:17,166 --> 00:21:19,110
You apply every
one to every video.

633
00:21:19,110 --> 00:21:20,244
You have some threshold.

634
00:21:20,244 --> 00:21:22,410
And anything above that
threshold, you say happened.

635
00:21:22,410 --> 00:21:25,456
And you produce a
sentence for it.

636
00:21:25,456 --> 00:21:26,820
OK.

637
00:21:26,820 --> 00:21:28,560
If we look at how you
actually figure out

638
00:21:28,560 --> 00:21:30,750
what this hidden information
is, what state am I

639
00:21:30,750 --> 00:21:32,370
in when I'm approaching
this object,

640
00:21:32,370 --> 00:21:34,094
it looks a lot like the tracker.

641
00:21:34,094 --> 00:21:36,510
What you have is you have to
make a choice in every frame.

642
00:21:36,510 --> 00:21:39,920
Your choice is, which
state is my action in?

643
00:21:39,920 --> 00:21:42,192
Is it state 1 through
3, or some other state?

644
00:21:42,192 --> 00:21:44,650
In the same way that in tracker,
you have to make a choice.

645
00:21:44,650 --> 00:21:46,520
You have to choose,
which detection is

646
00:21:46,520 --> 00:21:48,770
the system in for each frame?

647
00:21:48,770 --> 00:21:50,260
And here, you also have edges.

648
00:21:50,260 --> 00:21:52,800
Edges tell you, how
likely am I to transition

649
00:21:52,800 --> 00:21:55,020
between different
states in my action?

650
00:21:55,020 --> 00:21:57,080
And every node also has a score.

651
00:21:57,080 --> 00:21:58,830
It's the score of,
did you actually

652
00:21:58,830 --> 00:22:00,870
observe me doing what
you're supposed to observe

653
00:22:00,870 --> 00:22:02,350
me doing in every action?

654
00:22:02,350 --> 00:22:04,492
So if you're saying
I'm in the first state,

655
00:22:04,492 --> 00:22:06,450
did you actually see me
stationary and far away

656
00:22:06,450 --> 00:22:08,080
from that object?

657
00:22:08,080 --> 00:22:10,680
And what you want is a
path through this lattice

658
00:22:10,680 --> 00:22:12,660
in the same way that
we had a path before.

659
00:22:12,660 --> 00:22:15,000
And a path just means
you made a decision

660
00:22:15,000 --> 00:22:17,130
that I'm in state 1
in the first frame

661
00:22:17,130 --> 00:22:20,722
or in state 1 in the
third frame, et cetera.

662
00:22:20,722 --> 00:22:22,930
And that's just the linear
combination of the scores.

663
00:22:22,930 --> 00:22:25,480
So it's the same
equation we saw before.

664
00:22:25,480 --> 00:22:28,050
So here's an example of this
sort of feed-forward pipeline

665
00:22:28,050 --> 00:22:29,220
in action.

666
00:22:29,220 --> 00:22:32,922
We ran it over a
few thousand videos.

667
00:22:32,922 --> 00:22:35,130
It produces output like the
person carried something,

668
00:22:35,130 --> 00:22:38,270
the person went away, the person
walked, the person had the bag.

669
00:22:38,270 --> 00:22:39,945
It's pretty limited
in its vocabulary.

670
00:22:39,945 --> 00:22:42,810
It has 48 verbs, about
30 different objects,

671
00:22:42,810 --> 00:22:45,690
a few different prepositions.

672
00:22:45,690 --> 00:22:47,426
And it even works
when the camera moves.

673
00:22:47,426 --> 00:22:49,050
So the person chased
the car rightward,

674
00:22:49,050 --> 00:22:51,452
the person slowly ran
rightward to the car.

675
00:22:51,452 --> 00:22:53,910
And it should also probably
say the person had a really bad

676
00:22:53,910 --> 00:22:56,920
day, but that's for the future.

677
00:22:56,920 --> 00:22:59,070
So we've seen this
feed-forward pipeline.

678
00:22:59,070 --> 00:23:00,880
We've seen that we
can get objects.

679
00:23:00,880 --> 00:23:01,650
We can get tracks.

680
00:23:01,650 --> 00:23:04,050
We can look at our
tracks, get some features,

681
00:23:04,050 --> 00:23:06,600
run event detectors, take
those event detectors,

682
00:23:06,600 --> 00:23:08,250
and produce some sentences.

683
00:23:08,250 --> 00:23:09,780
And now, all we're
going to do is

684
00:23:09,780 --> 00:23:11,490
we're going to break down
the barriers between these

685
00:23:11,490 --> 00:23:13,906
and show you how you can have
feedback in a really, really

686
00:23:13,906 --> 00:23:15,450
simple way.

687
00:23:15,450 --> 00:23:16,080
All right.

688
00:23:16,080 --> 00:23:18,900
So first, let's combine our
event detector and our tracker.

689
00:23:18,900 --> 00:23:20,400
Because what that's
going to say is,

690
00:23:20,400 --> 00:23:22,911
if you're looking for someone
riding something, well,

691
00:23:22,911 --> 00:23:24,660
you should be biased
towards seeing people

692
00:23:24,660 --> 00:23:25,770
that are riding something.

693
00:23:25,770 --> 00:23:27,390
So in the occlusion
example, if you

694
00:23:27,390 --> 00:23:30,570
see someone go behind
some large pillar,

695
00:23:30,570 --> 00:23:32,290
well, you might lose them.

696
00:23:32,290 --> 00:23:34,950
But you have a bias that you
should reacquire someone riding

697
00:23:34,950 --> 00:23:36,960
a skateboard after
they leave the pillar,

698
00:23:36,960 --> 00:23:39,750
which you don't have if you just
run the tracker independently

699
00:23:39,750 --> 00:23:41,709
from the event detector.

700
00:23:41,709 --> 00:23:43,500
So the way we're going
to put them together

701
00:23:43,500 --> 00:23:44,460
is very, very easy.

702
00:23:44,460 --> 00:23:46,830
There's a reason why these
two look completely identical

703
00:23:46,830 --> 00:23:50,280
and why the inference algorithm
between them is identical.

704
00:23:50,280 --> 00:23:54,030
Right now, what we're doing is
we have a tracker on the left,

705
00:23:54,030 --> 00:23:55,740
or on your left.

706
00:23:55,740 --> 00:23:59,607
And we have an event
recognizer on the right.

707
00:23:59,607 --> 00:24:01,440
Right now, we're running
one, and then we're

708
00:24:01,440 --> 00:24:03,810
feeding the output of
one into the other.

709
00:24:03,810 --> 00:24:05,580
Basically, we run
one maximization,

710
00:24:05,580 --> 00:24:07,265
and then we run
another maximization.

711
00:24:07,265 --> 00:24:08,640
And all we're
going to do is move

712
00:24:08,640 --> 00:24:11,070
the max on the
right to the left.

713
00:24:11,070 --> 00:24:13,470
And you get the exact
same inference algorithm.

714
00:24:13,470 --> 00:24:16,590
The intuition behind this
is you have two lattices.

715
00:24:16,590 --> 00:24:18,750
And you can take the
cross-product of the lattices.

716
00:24:18,750 --> 00:24:20,910
Basically, for
every tracker node,

717
00:24:20,910 --> 00:24:23,220
you just look at all the
event recognizer's nodes

718
00:24:23,220 --> 00:24:25,650
and you make one big
node for each of those.

719
00:24:25,650 --> 00:24:27,900
And every node
represents the fact

720
00:24:27,900 --> 00:24:30,000
that the tracker
was in some state

721
00:24:30,000 --> 00:24:32,086
and the event recognizer
was in some other state.

722
00:24:32,086 --> 00:24:33,960
So we have a node that
says the tracker chose

723
00:24:33,960 --> 00:24:34,690
the first detection.

724
00:24:34,690 --> 00:24:36,570
The event recognizer
was in the first state.

725
00:24:36,570 --> 00:24:38,070
We have another node that
says the tracker chose

726
00:24:38,070 --> 00:24:38,840
the second detection.

727
00:24:38,840 --> 00:24:40,923
The event recognizer was
still in the first state.

728
00:24:40,923 --> 00:24:43,044
And you do this for
every detection.

729
00:24:43,044 --> 00:24:45,210
Then, you do the same thing
for the event recognizer

730
00:24:45,210 --> 00:24:47,479
being the second
state, et cetera.

731
00:24:47,479 --> 00:24:49,020
So you're just taking
a cross-product

732
00:24:49,020 --> 00:24:51,040
between all of the states.

733
00:24:51,040 --> 00:24:52,570
Does that make sense?

734
00:24:52,570 --> 00:24:56,370
Another way to say it is that
we have two Markov chains.

735
00:24:56,370 --> 00:24:59,130
One that's observing the
output from the object detector

736
00:24:59,130 --> 00:25:01,990
and another one that's observing
the output of the middle Markov

737
00:25:01,990 --> 00:25:02,740
chain.

738
00:25:02,740 --> 00:25:04,407
And you do joint
influence over them.

739
00:25:04,407 --> 00:25:05,990
And the way you can
do joint inference

740
00:25:05,990 --> 00:25:08,020
is by taking the cross-product.

741
00:25:08,020 --> 00:25:10,540
Basically, you have two
hidden Markov models.

742
00:25:10,540 --> 00:25:13,246
One that does tracking and one
that does event recognition.

743
00:25:13,246 --> 00:25:15,120
And all we're going to
do is joint difference

744
00:25:15,120 --> 00:25:16,480
in both of them.

745
00:25:16,480 --> 00:25:18,940
So rather than trying to
choose the best detection,

746
00:25:18,940 --> 00:25:20,950
and then the best
state for my event,

747
00:25:20,950 --> 00:25:22,679
I'm going to jointly
figure out, what's

748
00:25:22,679 --> 00:25:24,720
the best detection if I
assume I'm in this state?

749
00:25:24,720 --> 00:25:27,370
What's the best detection if I
assume I'm in this other state?

750
00:25:27,370 --> 00:25:31,710
And at the end, I'll pick
the best combination.

751
00:25:31,710 --> 00:25:33,230
Make sense?

752
00:25:33,230 --> 00:25:35,200
So this is a way for
your event recognizer

753
00:25:35,200 --> 00:25:37,810
to influence your tracker,
because now you're

754
00:25:37,810 --> 00:25:40,570
jointly choosing the best
detection for both the tracker

755
00:25:40,570 --> 00:25:42,670
and the event recognizer.

756
00:25:42,670 --> 00:25:44,330
So that was really,
really simple.

757
00:25:44,330 --> 00:25:46,330
We put in a tremendous
amount of feedback

758
00:25:46,330 --> 00:25:48,560
by just taking a cross product.

759
00:25:48,560 --> 00:25:49,789
So we can see this in action.

760
00:25:49,789 --> 00:25:51,580
I'm going to show you
the same video twice.

761
00:25:51,580 --> 00:25:54,070
The person is not going to
move in this video at all.

762
00:25:54,070 --> 00:25:57,289
What we told the system is that
a ball will approach a person.

763
00:25:57,289 --> 00:25:59,080
That's it we didn't
tell them which person.

764
00:25:59,080 --> 00:26:02,574
We didn't tell the system
which particular ball,

765
00:26:02,574 --> 00:26:04,240
which direction it's
going to come from,

766
00:26:04,240 --> 00:26:05,470
or anything like that.

767
00:26:05,470 --> 00:26:08,170
The top detection in this
frame happens to be the window.

768
00:26:08,170 --> 00:26:10,200
It's a little hard to see.

769
00:26:10,200 --> 00:26:12,580
It's quite a bit
stronger than the person.

770
00:26:12,580 --> 00:26:14,710
But because neither the
window nor the person

771
00:26:14,710 --> 00:26:17,980
ever move in this
scenario, the tracker

772
00:26:17,980 --> 00:26:18,990
can't possibly help you.

773
00:26:18,990 --> 00:26:20,800
You have no motion information.

774
00:26:20,800 --> 00:26:22,810
The only way to override
that window detection

775
00:26:22,810 --> 00:26:26,540
is to know something
else about the world.

776
00:26:26,540 --> 00:26:28,320
So we told it that the
ball will approach.

777
00:26:28,320 --> 00:26:30,820
And you can see that for the
combined tracker and event

778
00:26:30,820 --> 00:26:32,200
recognizer.

779
00:26:32,200 --> 00:26:36,492
Indeed, when the ball comes into
view, it will make more sense.

780
00:26:36,492 --> 00:26:38,950
So the reason why we actually--
coming back to the question

781
00:26:38,950 --> 00:26:39,640
that you asked.

782
00:26:39,640 --> 00:26:41,650
Why we don't run it
over small windows

783
00:26:41,650 --> 00:26:44,270
is because we want this effect
of knowledge that's much,

784
00:26:44,270 --> 00:26:45,477
much later on in the video.

785
00:26:45,477 --> 00:26:47,560
Like the fact that the
ball will enter or approach

786
00:26:47,560 --> 00:26:49,420
that person as
opposed to that window

787
00:26:49,420 --> 00:26:52,000
to actually help you much
earlier in the video.

788
00:26:52,000 --> 00:26:55,450
If you run it over small
windows, you lose that effect.

789
00:26:55,450 --> 00:26:58,400
So here, you track the person
correctly from the very first

790
00:26:58,400 --> 00:27:01,210
frame despite the fact that
the ball only comes into view

791
00:27:01,210 --> 00:27:03,850
halfway through the video.

792
00:27:03,850 --> 00:27:05,440
There are many more
examples of this.

793
00:27:05,440 --> 00:27:07,570
In this case, it's a
person carrying something.

794
00:27:07,570 --> 00:27:11,789
Here, we told the system one
person's carrying something.

795
00:27:11,789 --> 00:27:13,330
And you'll see when
the person moves,

796
00:27:13,330 --> 00:27:16,580
we can detect the
person and the bag.

797
00:27:16,580 --> 00:27:19,210
The object detector fails much,
much earlier because the person

798
00:27:19,210 --> 00:27:20,860
was deformable.

799
00:27:20,860 --> 00:27:23,470
So we've seen how we can combine
together trackers and events

800
00:27:23,470 --> 00:27:24,190
recognizers.

801
00:27:24,190 --> 00:27:25,684
And now, we need
to add sentences.

802
00:27:25,684 --> 00:27:27,100
And the trick for
adding sentences

803
00:27:27,100 --> 00:27:29,506
is going to do more of the same.

804
00:27:29,506 --> 00:27:32,020
What we're going to do is
we're going to take a tracker.

805
00:27:32,020 --> 00:27:33,975
It's just exactly
what we saw before.

806
00:27:33,975 --> 00:27:35,350
And what we just
did a moment ago

807
00:27:35,350 --> 00:27:37,529
is we combined it with
an event recognizer.

808
00:27:37,529 --> 00:27:39,820
Well, there's no reason why
we can't add more trackers.

809
00:27:39,820 --> 00:27:40,840
We actually kind
of did that, right?

810
00:27:40,840 --> 00:27:43,580
We were tracking both a person
and a ball a moment ago.

811
00:27:43,580 --> 00:27:45,580
So we can take an even
bigger cross-product,

812
00:27:45,580 --> 00:27:49,510
have multiple trackers,
and have multiple words.

813
00:27:49,510 --> 00:27:51,520
So all we're saying
is, I have, say,

814
00:27:51,520 --> 00:27:53,110
five trackers that are running.

815
00:27:53,110 --> 00:27:54,970
I have five words
that I want to detect,

816
00:27:54,970 --> 00:27:56,495
or 10 words that
I want to detect.

817
00:27:56,495 --> 00:27:58,870
And I want to make the choice
for all of these 5 trackers

818
00:27:58,870 --> 00:28:03,040
jointly, so that they match
all of these 10 words.

819
00:28:03,040 --> 00:28:04,762
In this picture,
basically our words

820
00:28:04,762 --> 00:28:07,220
are kind of-- our sentences
are kind of like bags of words,

821
00:28:07,220 --> 00:28:07,720
right?

822
00:28:07,720 --> 00:28:09,900
Every word is combined
with every tracker.

823
00:28:09,900 --> 00:28:12,340
But we know if you look at
the structure of a sentence

824
00:28:12,340 --> 00:28:14,920
like the tall person
quickly rode the horse,

825
00:28:14,920 --> 00:28:18,880
not every word refers to
every object in the sentence.

826
00:28:18,880 --> 00:28:22,234
So you can run your object
detectors over your video.

827
00:28:22,234 --> 00:28:23,650
And you can look
at your sentence.

828
00:28:23,650 --> 00:28:25,400
And you can look at
the nouns and say, OK.

829
00:28:25,400 --> 00:28:29,170
So I have people and
horses inside the sentence.

830
00:28:29,170 --> 00:28:30,260
And you can say, OK.

831
00:28:30,260 --> 00:28:32,770
Well, if I have people and
horses, I need two trackers.

832
00:28:32,770 --> 00:28:34,895
But you can look a little
bit more at your sentence

833
00:28:34,895 --> 00:28:37,390
and see that, oh, well,
it's the other horse.

834
00:28:37,390 --> 00:28:40,030
So you analyze your
sentence and you

835
00:28:40,030 --> 00:28:42,640
can determine there are three
participants in the event

836
00:28:42,640 --> 00:28:44,180
described by the sentence.

837
00:28:44,180 --> 00:28:46,360
There's a person and two horses.

838
00:28:46,360 --> 00:28:47,390
One's the agent.

839
00:28:47,390 --> 00:28:50,710
One's the patient-- the
thing that's being ridden--

840
00:28:50,710 --> 00:28:57,160
and one's source-- the
thing that is being left.

841
00:28:57,160 --> 00:28:58,820
Does that make sense?

842
00:28:58,820 --> 00:28:59,440
Awesome.

843
00:28:59,440 --> 00:29:03,620
So now, given a sentence, we
know that we need n trackers.

844
00:29:03,620 --> 00:29:06,640
And for every word, we can
have a hidden Markov model.

845
00:29:06,640 --> 00:29:08,540
We can have a hidden
Markov model for ride.

846
00:29:08,540 --> 00:29:09,589
It's just another verb.

847
00:29:09,589 --> 00:29:11,380
And we just have to be
careful how we build

848
00:29:11,380 --> 00:29:12,700
a feature vector for ride.

849
00:29:12,700 --> 00:29:14,170
Because if we build
it in one way,

850
00:29:14,170 --> 00:29:16,090
we're going to detect the
person rode the horse.

851
00:29:16,090 --> 00:29:17,440
And if we build it
in the opposite way

852
00:29:17,440 --> 00:29:19,700
by concatenating the vectors
the other way around,

853
00:29:19,700 --> 00:29:22,360
we're going to detect the
horse rode the person, which

854
00:29:22,360 --> 00:29:23,496
is not what we want.

855
00:29:23,496 --> 00:29:24,885
We can also detect tall.

856
00:29:24,885 --> 00:29:27,010
Tall is kind of a weird
hidden Markov model, right?

857
00:29:27,010 --> 00:29:29,140
It has only a single state,
but it's still a hidden Markov

858
00:29:29,140 --> 00:29:29,710
model.

859
00:29:29,710 --> 00:29:31,930
It just wants to see
that this object is tall.

860
00:29:31,930 --> 00:29:36,010
So maybe its aspect ratio is
more than the mean aspect ratio

861
00:29:36,010 --> 00:29:37,840
of objects of this class.

862
00:29:37,840 --> 00:29:41,369
But nonetheless, it still
fits into this paradigm.

863
00:29:41,369 --> 00:29:42,910
We can do the same
thing for quickly.

864
00:29:42,910 --> 00:29:43,960
We can have an HMR for that.

865
00:29:43,960 --> 00:29:44,751
We can do leftward.

866
00:29:44,751 --> 00:29:45,602
We can do away from.

867
00:29:45,602 --> 00:29:48,320
Away from looks
a lot like leave.

868
00:29:48,320 --> 00:29:50,140
It's the same meaning.

869
00:29:50,140 --> 00:29:52,570
And basically, we end up
with this bipartite graph.

870
00:29:52,570 --> 00:29:55,780
At the top, we have lattices
that represent words.

871
00:29:55,780 --> 00:29:57,612
Each word has a
hidden Markov model.

872
00:29:57,612 --> 00:29:59,070
And in the middle,
we have lattices

873
00:29:59,070 --> 00:30:00,820
that represent trackers.

874
00:30:00,820 --> 00:30:03,275
We can combine them together
according to the links.

875
00:30:03,275 --> 00:30:05,650
And you can get these links
from your favorite dependency

876
00:30:05,650 --> 00:30:06,149
parser.

877
00:30:06,149 --> 00:30:09,400
You can get them from
Boris's START system.

878
00:30:09,400 --> 00:30:12,372
Any language analysis
system will give you this.

879
00:30:12,372 --> 00:30:14,080
So this is actually
all the heavy lifting

880
00:30:14,080 --> 00:30:14,871
that we have to do.

881
00:30:14,871 --> 00:30:18,487
Everything from now on
is kind of eye candy.

882
00:30:18,487 --> 00:30:20,320
One thing that we really
wanted to make sure

883
00:30:20,320 --> 00:30:22,420
that system was doing is
that we could distinguish

884
00:30:22,420 --> 00:30:23,690
different sentences.

885
00:30:23,690 --> 00:30:26,650
So we tried to come up
with an experiment that

886
00:30:26,650 --> 00:30:28,987
is, in some way,
maximally difficult where

887
00:30:28,987 --> 00:30:30,820
events are going to
happen at the same time.

888
00:30:30,820 --> 00:30:33,280
So you can't use time in
order to distinguish them.

889
00:30:33,280 --> 00:30:37,600
And the sentences only differ
in one word or one lexical item.

890
00:30:37,600 --> 00:30:39,070
So in this case,
we have a sentence

891
00:30:39,070 --> 00:30:41,361
like the person picked up an
object and person put down

892
00:30:41,361 --> 00:30:42,400
an object.

893
00:30:42,400 --> 00:30:44,169
There are two systems
that are running.

894
00:30:44,169 --> 00:30:45,460
One is running on one sentence.

895
00:30:45,460 --> 00:30:47,200
One is running on
the other sentence.

896
00:30:47,200 --> 00:30:48,908
You're going to see
the same video played

897
00:30:48,908 --> 00:30:50,726
twice side by side.

898
00:30:50,726 --> 00:30:52,600
And you can already see
that one system, when

899
00:30:52,600 --> 00:30:54,130
we primed it to
look for pickup, it

900
00:30:54,130 --> 00:30:56,020
detected me picking
up my backpack.

901
00:30:56,020 --> 00:30:58,240
And then, the other one
it detected one of my lab

902
00:30:58,240 --> 00:31:00,410
mates picking up a bin.

903
00:31:00,410 --> 00:31:02,849
So the only way you
could focus its attention

904
00:31:02,849 --> 00:31:05,140
on the right object is if it
understood the distinction

905
00:31:05,140 --> 00:31:06,740
between these two
sentences, or if it

906
00:31:06,740 --> 00:31:09,410
was able to represent them.

907
00:31:09,410 --> 00:31:11,510
So we can play this game
many, many times over.

908
00:31:11,510 --> 00:31:13,629
We can have it pay
attention to the subject.

909
00:31:13,629 --> 00:31:15,670
Is a backpack approaching
something or is a chair

910
00:31:15,670 --> 00:31:17,110
approaching something?

911
00:31:17,110 --> 00:31:19,997
We can have it pay attention
to the color of an object.

912
00:31:19,997 --> 00:31:22,330
Is the red object approaching
something or a blue object

913
00:31:22,330 --> 00:31:23,530
approaching something?

914
00:31:23,530 --> 00:31:26,237
We can have it pay
attention to a preposition.

915
00:31:26,237 --> 00:31:28,570
Is someone picking up an
object to the left of something

916
00:31:28,570 --> 00:31:29,778
or to the right of something?

917
00:31:29,778 --> 00:31:32,080
And we have many, many
dozens or hundreds of these.

918
00:31:32,080 --> 00:31:34,150
And I won't bore you
with all of them.

919
00:31:34,150 --> 00:31:36,812
But the important part is
we can handle lots and lots

920
00:31:36,812 --> 00:31:38,020
of different parts of speech.

921
00:31:38,020 --> 00:31:40,019
And we can still represent
them and we can still

922
00:31:40,019 --> 00:31:42,400
be sensitive to these subtle
distinctions in the meanings

923
00:31:42,400 --> 00:31:44,900
of the sentences.

924
00:31:44,900 --> 00:31:45,400
All right.

925
00:31:45,400 --> 00:31:46,566
So we did all the hard work.

926
00:31:46,566 --> 00:31:49,240
And we actually built
this recognizer--

927
00:31:49,240 --> 00:31:51,825
the score of a
sentence given a video.

928
00:31:51,825 --> 00:31:53,200
And now, it turns
out that we can

929
00:31:53,200 --> 00:31:55,150
reformulate all of
these other tasks

930
00:31:55,150 --> 00:31:56,380
in terms of this one score.

931
00:31:56,380 --> 00:31:58,580
And it's going to do all
the heavy lifting for us.

932
00:31:58,580 --> 00:32:00,970
So when we tune the
parameters of whatever

933
00:32:00,970 --> 00:32:02,980
goes into the scoring
function, we're

934
00:32:02,980 --> 00:32:05,975
going to get the ability to
do all these other tasks.

935
00:32:05,975 --> 00:32:07,100
So let's look at retrieval.

936
00:32:07,100 --> 00:32:09,010
It's the most straightforward
kind of task, right?

937
00:32:09,010 --> 00:32:10,360
It's what YouTube does for you.

938
00:32:10,360 --> 00:32:11,170
You go to YouTube.

939
00:32:11,170 --> 00:32:14,660
You type in a query, and YouTube
comes back with some answers.

940
00:32:14,660 --> 00:32:17,650
So let's see what
YouTube actually does.

941
00:32:17,650 --> 00:32:19,220
If you look at YouTube.

942
00:32:19,220 --> 00:32:21,550
And if you look at
something like pickup,

943
00:32:21,550 --> 00:32:23,920
you get men picking up women.

944
00:32:23,920 --> 00:32:26,862
If you look at approach, you
get men picking up women.

945
00:32:26,862 --> 00:32:28,570
If you look at put
down, once upon a time

946
00:32:28,570 --> 00:32:29,944
you did get men
picking up women,

947
00:32:29,944 --> 00:32:33,190
but rap is now more popular.

948
00:32:33,190 --> 00:32:35,430
If you ask something more
interesting-- the person

949
00:32:35,430 --> 00:32:36,580
approached the
other person-- you

950
00:32:36,580 --> 00:32:38,663
don't get videos where
people approach each other.

951
00:32:38,663 --> 00:32:40,914
You get videos about how
you should approach women.

952
00:32:40,914 --> 00:32:41,830
I didn't select these.

953
00:32:41,830 --> 00:32:43,660
I typed them in and this
is just what happened.

954
00:32:43,660 --> 00:32:45,670
If you type in, like the
person approached the cat,

955
00:32:45,670 --> 00:32:47,378
you get lots of people
playing with cats,

956
00:32:47,378 --> 00:32:50,410
but no one approaching
cats, including

957
00:32:50,410 --> 00:32:54,030
a link that's kind of scary
and an Airbus landing.

958
00:32:54,030 --> 00:32:55,720
And I have no idea
what that means.

959
00:32:55,720 --> 00:32:58,690
So what we did is we built a
video retrieval system that

960
00:32:58,690 --> 00:33:00,910
actually understands what's
going on in the videos as

961
00:33:00,910 --> 00:33:02,368
opposed to just
looking at the tags

962
00:33:02,368 --> 00:33:03,970
that the people apply
to these videos.

963
00:33:03,970 --> 00:33:05,553
People don't describe
what's going on.

964
00:33:05,553 --> 00:33:08,300
People describe some
high-level concept.

965
00:33:08,300 --> 00:33:10,240
So we took a whole bunch
of object detectors

966
00:33:10,240 --> 00:33:13,020
that are completely of the
shelf for people and for horses.

967
00:33:13,020 --> 00:33:15,070
And we took 10 Hollywood movies.

968
00:33:15,070 --> 00:33:16,810
Nominally, they're all Westerns.

969
00:33:16,810 --> 00:33:18,147
They involve people on horses.

970
00:33:18,147 --> 00:33:19,980
And the reason why we
chose people on horses

971
00:33:19,980 --> 00:33:21,750
was because people
on horses tend

972
00:33:21,750 --> 00:33:23,680
to be fairly larger
in the field of view.

973
00:33:23,680 --> 00:33:25,630
And given that object
detectors suck so much,

974
00:33:25,630 --> 00:33:27,713
we thought we should kind
of help the system along

975
00:33:27,713 --> 00:33:29,290
s best we could.

976
00:33:29,290 --> 00:33:32,400
So we build a system.

977
00:33:32,400 --> 00:33:34,820
It's a system that
knows about three verbs.

978
00:33:34,820 --> 00:33:37,830
It knows about two
nouns, person and horse.

979
00:33:37,830 --> 00:33:41,100
It knows about some
adverbs, quickly and slowly.

980
00:33:41,100 --> 00:33:43,610
It knows about some
prepositions, leftwards,

981
00:33:43,610 --> 00:33:45,814
rightwards, towards, away from.

982
00:33:45,814 --> 00:33:47,980
And given this template,
you can generate about 200,

983
00:33:47,980 --> 00:33:50,210
300 different sentences.

984
00:33:50,210 --> 00:33:53,830
So we can type in something
like the person rode the horse.

985
00:33:53,830 --> 00:33:55,910
And we can get a
bunch of results.

986
00:33:55,910 --> 00:33:59,150
So you can see, we were in 90%
accurate in the top 10 results.

987
00:33:59,150 --> 00:34:01,750
You can see these are really
videos of people riding horses.

988
00:34:01,750 --> 00:34:04,240
The way this works is we took
one of these long videos.

989
00:34:04,240 --> 00:34:06,320
We chopped it up into
many small segments

990
00:34:06,320 --> 00:34:08,596
and we ran over each
individual segment.

991
00:34:08,596 --> 00:34:10,179
You could run it
over the whole video,

992
00:34:10,179 --> 00:34:12,137
but then it would just
classify the whole video

993
00:34:12,137 --> 00:34:14,110
because it's an HMM
and would sort of adapt

994
00:34:14,110 --> 00:34:16,009
to the length of the video.

995
00:34:16,009 --> 00:34:17,800
We can also ask for
other kinds of queries,

996
00:34:17,800 --> 00:34:20,409
like the person rode
the horse quickly.

997
00:34:20,409 --> 00:34:23,739
You can see we get videos
that really are quicker.

998
00:34:23,739 --> 00:34:25,900
We can ask for something
more ambitious,

999
00:34:25,900 --> 00:34:28,420
like the person rode the
horse quickly rightward.

1000
00:34:28,420 --> 00:34:32,061
And we get videos where people
are riding horses rightward.

1001
00:34:32,061 --> 00:34:32,560
All right.

1002
00:34:32,560 --> 00:34:35,340
So we did the hard work of
building this recognition

1003
00:34:35,340 --> 00:34:35,840
system.

1004
00:34:35,840 --> 00:34:37,840
And we saw we can use it
for another task, which

1005
00:34:37,840 --> 00:34:38,679
is retrieval.

1006
00:34:38,679 --> 00:34:39,960
But let's do something else.

1007
00:34:39,960 --> 00:34:41,080
Let's do generation.

1008
00:34:41,080 --> 00:34:43,750
Someone asked about
generation earlier.

1009
00:34:43,750 --> 00:34:45,790
Generation is very
similar to retrieval.

1010
00:34:45,790 --> 00:34:48,310
In retrieval, what we had
was we had a fixed sentence

1011
00:34:48,310 --> 00:34:50,409
and we searched
over all our videos

1012
00:34:50,409 --> 00:34:52,330
to see which ones
were the best match.

1013
00:34:52,330 --> 00:34:54,310
Here, we have a fixed video.

1014
00:34:54,310 --> 00:34:56,409
And we're going to search
over all our sentences.

1015
00:34:56,409 --> 00:34:58,660
The only trick is you
have a language model,

1016
00:34:58,660 --> 00:35:00,930
so it can generate a
huge number of sentences.

1017
00:35:00,930 --> 00:35:03,100
But we're going
to see that's OK.

1018
00:35:03,100 --> 00:35:05,200
So we have a language model.

1019
00:35:05,200 --> 00:35:07,820
It's very, very small
model by Boris' standards,

1020
00:35:07,820 --> 00:35:10,210
or the standard of NLP.

1021
00:35:10,210 --> 00:35:16,270
We have only four verbs, two
adjectives, only four nouns,

1022
00:35:16,270 --> 00:35:17,800
some adverbs, et cetera.

1023
00:35:17,800 --> 00:35:20,380
But the important part is
even if we ignore recursion,

1024
00:35:20,380 --> 00:35:22,260
we have a tremendous
number of sentences.

1025
00:35:22,260 --> 00:35:24,640
And this model is
recursive, so we can really

1026
00:35:24,640 --> 00:35:27,107
generate an infinite number
of sentences from it.

1027
00:35:27,107 --> 00:35:28,690
But nonetheless, it
turns out that you

1028
00:35:28,690 --> 00:35:30,470
can search the space
of sentences very,

1029
00:35:30,470 --> 00:35:33,520
very efficiently and actually
find the global optimum.

1030
00:35:33,520 --> 00:35:36,980
And the intuition for why that's
true is pretty straightforward.

1031
00:35:36,980 --> 00:35:39,790
You can think of your sentence
as a constraint on what you

1032
00:35:39,790 --> 00:35:41,050
can see in the world.

1033
00:35:41,050 --> 00:35:43,870
The longer your sentence,
the more constrains you have.

1034
00:35:43,870 --> 00:35:45,710
So the lower the
overall score is.

1035
00:35:45,710 --> 00:35:47,740
So every time you
add a word, the score

1036
00:35:47,740 --> 00:35:49,870
can't possibly increase, right?

1037
00:35:49,870 --> 00:35:51,370
The score has to
always decrease.

1038
00:35:51,370 --> 00:35:53,590
So basically, you have this
monotonically-decreasing

1039
00:35:53,590 --> 00:35:56,814
function over a
lattice of sentences.

1040
00:35:56,814 --> 00:35:58,480
And if you ignore the
fact that you only

1041
00:35:58,480 --> 00:36:00,271
have to search sentences,
you can start off

1042
00:36:00,271 --> 00:36:02,940
with individual words,
aggregate words together.

1043
00:36:02,940 --> 00:36:04,652
So you look at all
one-word phrases.

1044
00:36:04,652 --> 00:36:06,610
You can a two-word phrases,
three-word phrases.

1045
00:36:06,610 --> 00:36:09,734
Eventually, get out
to real sentences.

1046
00:36:09,734 --> 00:36:11,650
But because this is a
monotonically-decreasing

1047
00:36:11,650 --> 00:36:14,590
function, this is a
very quick search.

1048
00:36:14,590 --> 00:36:16,930
So you can start off
with an empty set.

1049
00:36:16,930 --> 00:36:18,140
You can add a word.

1050
00:36:18,140 --> 00:36:20,080
For example, you
can add carried.

1051
00:36:20,080 --> 00:36:23,170
You can look at all the ways
that you can extend carried

1052
00:36:23,170 --> 00:36:24,250
with another word or two.

1053
00:36:24,250 --> 00:36:26,260
So you get a phrase
like the person carried.

1054
00:36:26,260 --> 00:36:28,000
And you can keep
adding words to it

1055
00:36:28,000 --> 00:36:31,090
until you get to
the global optimum.

1056
00:36:31,090 --> 00:36:33,270
So given a video
like this, where

1057
00:36:33,270 --> 00:36:34,984
you see me doing
something, you can

1058
00:36:34,984 --> 00:36:36,400
produce a sentence
like the person

1059
00:36:36,400 --> 00:36:38,358
to the right of the bin
picked up the backpack.

1060
00:36:41,034 --> 00:36:42,450
And that's pretty
straightforward.

1061
00:36:42,450 --> 00:36:45,320
We built a generator in
just a few lines of code

1062
00:36:45,320 --> 00:36:47,454
as long as we had our
recognition system.

1063
00:36:47,454 --> 00:36:49,370
So you have this problem
in question answering

1064
00:36:49,370 --> 00:36:51,790
that you have to connect
two sentences with a video.

1065
00:36:51,790 --> 00:36:53,210
And instead of doing that,
what we're going to do

1066
00:36:53,210 --> 00:36:55,293
is we're going to make
some connection between two

1067
00:36:55,293 --> 00:36:56,320
sentences.

1068
00:36:56,320 --> 00:36:57,820
So we're going to
take our question.

1069
00:36:57,820 --> 00:37:00,111
We're going to give it to
something like Boris' system.

1070
00:37:00,111 --> 00:37:02,390
And it's going to
tell us this question,

1071
00:37:02,390 --> 00:37:06,879
like, what did the person
put on top of the red car?

1072
00:37:06,879 --> 00:37:09,170
If you wanted to answer it,
you would produce an answer

1073
00:37:09,170 --> 00:37:12,980
like, the person put some noun
phrase on top of the red car.

1074
00:37:12,980 --> 00:37:15,140
So you can run the
generation system exactly

1075
00:37:15,140 --> 00:37:16,190
as was suggested.

1076
00:37:16,190 --> 00:37:17,360
You seed it with this.

1077
00:37:17,360 --> 00:37:18,830
You give it a
constraint that what

1078
00:37:18,830 --> 00:37:20,840
it has to produce next
inside this empty gap

1079
00:37:20,840 --> 00:37:22,700
is a noun phrase.

1080
00:37:22,700 --> 00:37:24,430
And you're going to
get out the answer.

1081
00:37:24,430 --> 00:37:25,846
Another way to
think about this is

1082
00:37:25,846 --> 00:37:27,680
you have sort of a
partial detector.

1083
00:37:27,680 --> 00:37:30,290
You look inside the video
to see where it matches.

1084
00:37:30,290 --> 00:37:32,790
You choose the best
region where it matches,

1085
00:37:32,790 --> 00:37:34,549
and then you complete
your sentence.

1086
00:37:34,549 --> 00:37:36,090
And you get an answer
like the person

1087
00:37:36,090 --> 00:37:39,187
put the pair on
top of the red car.

1088
00:37:39,187 --> 00:37:41,270
There's one small problem
with question answering,

1089
00:37:41,270 --> 00:37:44,220
and it differs from
generation in one way.

1090
00:37:44,220 --> 00:37:45,922
So imagine that we're
in a parking lot

1091
00:37:45,922 --> 00:37:48,380
and there are a hundred white
cars inside this parking lot.

1092
00:37:48,380 --> 00:37:51,232
And you come to me desperate
and you say, I lost my keys.

1093
00:37:51,232 --> 00:37:52,190
And I say, don't worry.

1094
00:37:52,190 --> 00:37:54,110
I know exactly
where your keys are.

1095
00:37:54,110 --> 00:37:56,510
And you look at me and I say,
they're in the white car.

1096
00:37:56,510 --> 00:37:58,260
And then you think I'm
a complete asshole,

1097
00:37:58,260 --> 00:38:00,590
because that was totally
worthless information, right?

1098
00:38:00,590 --> 00:38:02,510
I told you something
that's basically true.

1099
00:38:02,510 --> 00:38:04,640
It's a parking lot
full of white cars,

1100
00:38:04,640 --> 00:38:07,742
but isn't actually giving
you anything useful.

1101
00:38:07,742 --> 00:38:09,200
So to handle this--
in the same way

1102
00:38:09,200 --> 00:38:11,660
that in generation, we
had this one parameter

1103
00:38:11,660 --> 00:38:14,679
that we could tune to get, more
or less, for both sentences.

1104
00:38:14,679 --> 00:38:16,220
We're going to add
only one parameter

1105
00:38:16,220 --> 00:38:17,990
to question answering,
which is kind

1106
00:38:17,990 --> 00:38:20,600
of a truthfulness parameter.

1107
00:38:20,600 --> 00:38:26,360
Which basically is going to say,
this sentence, the person put

1108
00:38:26,360 --> 00:38:29,540
an object on top of the
red car in this video,

1109
00:38:29,540 --> 00:38:30,865
is very ambiguous, right?

1110
00:38:30,865 --> 00:38:32,240
It could either
be Danny that did

1111
00:38:32,240 --> 00:38:33,890
it or it could be me
that put something

1112
00:38:33,890 --> 00:38:35,054
on top of the red car.

1113
00:38:35,054 --> 00:38:36,470
So what we're going
to do is we're

1114
00:38:36,470 --> 00:38:38,269
going to take this
candidate's answer.

1115
00:38:38,269 --> 00:38:39,810
We're going to run
it over the video.

1116
00:38:39,810 --> 00:38:41,935
And we're going to see how
many times it has really

1117
00:38:41,935 --> 00:38:43,220
close matches in the video.

1118
00:38:43,220 --> 00:38:44,720
And depending on
this one parameter,

1119
00:38:44,720 --> 00:38:46,136
we're going to say
you are allowed

1120
00:38:46,136 --> 00:38:48,290
to say more things
about the video

1121
00:38:48,290 --> 00:38:51,350
to become more specific about
what you're referring to.

1122
00:38:51,350 --> 00:38:53,960
But potentially, slightly
less true because the score

1123
00:38:53,960 --> 00:38:54,591
will be lower.

1124
00:38:54,591 --> 00:38:56,090
In the same way
that you were saying

1125
00:38:56,090 --> 00:38:58,190
slightly more in
the generation case

1126
00:38:58,190 --> 00:39:00,140
at the risk of saying
potentially something

1127
00:39:00,140 --> 00:39:02,450
that's slightly less true.

1128
00:39:02,450 --> 00:39:05,180
So this way, you can ignore the
sentence, which is unhelpful.

1129
00:39:05,180 --> 00:39:06,638
And you can end up
saying something

1130
00:39:06,638 --> 00:39:09,140
like, the person on
the left of the car

1131
00:39:09,140 --> 00:39:12,590
put an object on
top of the red car.

1132
00:39:12,590 --> 00:39:14,490
So we can actually do
that and the system

1133
00:39:14,490 --> 00:39:15,950
produces that output.

1134
00:39:15,950 --> 00:39:18,054
We built one
recognition approach.

1135
00:39:18,054 --> 00:39:19,970
And we did retrieval,
generation, and question

1136
00:39:19,970 --> 00:39:20,719
answering with it.

1137
00:39:20,719 --> 00:39:22,767
We can also do
disambiguation with it.

1138
00:39:22,767 --> 00:39:24,350
In disambiguation,
we take a sentence,

1139
00:39:24,350 --> 00:39:26,757
like Danny approached
the chair with a bag.

1140
00:39:26,757 --> 00:39:28,340
And you can imagine
that this sentence

1141
00:39:28,340 --> 00:39:30,150
can mean multiple things.

1142
00:39:30,150 --> 00:39:33,770
It could mean Danny was
actually carrying a bag

1143
00:39:33,770 --> 00:39:35,780
and approaching a chair.

1144
00:39:35,780 --> 00:39:39,740
Or it could mean there
was a bag on a chair

1145
00:39:39,740 --> 00:39:41,312
and Danny was approaching it.

1146
00:39:41,312 --> 00:39:42,770
And there's the
question of, how do

1147
00:39:42,770 --> 00:39:45,320
you decide which
interpretation for the sentence

1148
00:39:45,320 --> 00:39:46,990
corresponds to which video?

1149
00:39:50,370 --> 00:39:52,200
Basically, you can
take your sentences

1150
00:39:52,200 --> 00:39:53,520
and you can look at
their parse trees.

1151
00:39:53,520 --> 00:39:55,200
And you're going to see
that they're different.

1152
00:39:55,200 --> 00:39:56,520
Essentially, your
language system

1153
00:39:56,520 --> 00:39:58,228
is going to give you
a slightly different

1154
00:39:58,228 --> 00:40:00,130
internal representation
for each of these.

1155
00:40:00,130 --> 00:40:02,610
And we already know that
when we build our detectors

1156
00:40:02,610 --> 00:40:05,280
for the sentence, we take
these kinds of relationships

1157
00:40:05,280 --> 00:40:06,840
between the words as inputs.

1158
00:40:06,840 --> 00:40:09,210
So even though there's
one sentence in English

1159
00:40:09,210 --> 00:40:11,160
that described both of
these scenarios, when

1160
00:40:11,160 --> 00:40:12,510
we build detectors
we're going to end up

1161
00:40:12,510 --> 00:40:13,718
with two different detectors.

1162
00:40:13,718 --> 00:40:17,587
One for one meaning, one
for the other meaning.

1163
00:40:17,587 --> 00:40:19,170
And then we can just
run the detectors

1164
00:40:19,170 --> 00:40:22,080
and figure out which meaning
corresponds to which video.

1165
00:40:22,080 --> 00:40:23,772
And indeed, that's what we did.

1166
00:40:23,772 --> 00:40:25,230
Except that there
are lots and lots

1167
00:40:25,230 --> 00:40:27,224
of different
potential ambiguities.

1168
00:40:27,224 --> 00:40:28,890
There are different
kinds of attachment.

1169
00:40:28,890 --> 00:40:29,699
In the same case--

1170
00:40:29,699 --> 00:40:30,990
I won't go through all of them.

1171
00:40:30,990 --> 00:40:34,230
But for example, you might
not know where the bag is.

1172
00:40:34,230 --> 00:40:37,020
You might not know who's
performing the action.

1173
00:40:37,020 --> 00:40:39,000
You might not be
sure if both people

1174
00:40:39,000 --> 00:40:40,860
are performing the
action or only one person

1175
00:40:40,860 --> 00:40:43,230
is performing the action.

1176
00:40:43,230 --> 00:40:46,770
There may be some
problems with references.

1177
00:40:46,770 --> 00:40:49,011
So this is a very
simple example,

1178
00:40:49,011 --> 00:40:50,760
like Danny picked up
the bag in the chair.

1179
00:40:50,760 --> 00:40:51,302
It is yellow.

1180
00:40:51,302 --> 00:40:53,301
But this is the kind of
thing that you would see

1181
00:40:53,301 --> 00:40:54,780
if you had a long paragraph.

1182
00:40:54,780 --> 00:40:57,000
You would have some
reference later on

1183
00:40:57,000 --> 00:40:58,980
or earlier on to some person.

1184
00:40:58,980 --> 00:41:02,460
And you wouldn't be sure
who was the referent.

1185
00:41:02,460 --> 00:41:04,880
And it turns out that if you
have sentences like this,

1186
00:41:04,880 --> 00:41:06,630
you can disambiguate
them pretty reliably.

1187
00:41:10,080 --> 00:41:13,150
So what's important is it's
not just a case of parse trees.

1188
00:41:13,150 --> 00:41:16,110
We need a more interesting
internal representation.

1189
00:41:16,110 --> 00:41:18,420
And an example of how we do
this is we take a sentence

1190
00:41:18,420 --> 00:41:21,870
and we make some first-order
logic formula out of it.

1191
00:41:21,870 --> 00:41:23,190
So you have some variables.

1192
00:41:23,190 --> 00:41:25,540
The chair is something like x.

1193
00:41:25,540 --> 00:41:28,455
You have Danny, who
moved it, and I moved it.

1194
00:41:28,455 --> 00:41:30,580
Or in the other case, you
have two separate chairs.

1195
00:41:30,580 --> 00:41:32,330
And I moved one and
Danny moved the other.

1196
00:41:32,330 --> 00:41:34,440
And they're distinct chairs.

1197
00:41:34,440 --> 00:41:37,020
What we do is we first
ignore the people.

1198
00:41:37,020 --> 00:41:38,670
So we just say there
are two people.

1199
00:41:38,670 --> 00:41:41,539
And in both cases, we're
distinct from each other.

1200
00:41:41,539 --> 00:41:43,830
But we don't have person
recognizers, face recognition,

1201
00:41:43,830 --> 00:41:45,420
or anything like that.

1202
00:41:45,420 --> 00:41:48,780
Then for each of these
variables, we build a tracker.

1203
00:41:48,780 --> 00:41:51,510
And for every constraint,
we have a word model.

1204
00:41:51,510 --> 00:41:54,600
And essentially, you can go from
this first-order logic formula

1205
00:41:54,600 --> 00:41:56,700
to one of our detectors.

1206
00:41:56,700 --> 00:41:58,470
So it's exactly the
same thing as the case

1207
00:41:58,470 --> 00:42:00,570
where we had a
sentence and a video.

1208
00:42:00,570 --> 00:42:03,809
And we just wanted to see, is
the sentence true of the video?

1209
00:42:03,809 --> 00:42:05,850
Except that now we have
a sentence interpretation

1210
00:42:05,850 --> 00:42:06,433
and the video.

1211
00:42:09,510 --> 00:42:11,380
So we've seen that
if all you have

1212
00:42:11,380 --> 00:42:13,470
are multiple interpretation
of a sentence,

1213
00:42:13,470 --> 00:42:17,114
you can figure out which
one belongs to which video.

1214
00:42:17,114 --> 00:42:18,780
And we'll come back
to this in a moment,

1215
00:42:18,780 --> 00:42:20,440
because it's actually
quite useful.

1216
00:42:20,440 --> 00:42:22,380
So you can imagine
a scenario where

1217
00:42:22,380 --> 00:42:23,814
you want to talk to a robot.

1218
00:42:23,814 --> 00:42:25,230
And you want to
give it a command.

1219
00:42:25,230 --> 00:42:27,210
You don't want to play 20
questions with it, right?

1220
00:42:27,210 --> 00:42:28,260
You want to tell it something.

1221
00:42:28,260 --> 00:42:29,200
It should look at
the environment.

1222
00:42:29,200 --> 00:42:31,533
And it should figure out,
you're referring to this chair

1223
00:42:31,533 --> 00:42:33,900
and this is what
I'm supposed to do.

1224
00:42:33,900 --> 00:42:36,450
So the other reason
for disambiguation

1225
00:42:36,450 --> 00:42:39,270
is going to be because you
get a lot of ambiguities

1226
00:42:39,270 --> 00:42:41,099
while you're acquiring language.

1227
00:42:41,099 --> 00:42:43,140
So we're going to break
down language acquisition

1228
00:42:43,140 --> 00:42:44,920
into two parts.

1229
00:42:44,920 --> 00:42:48,030
One part is we want to learn the
meanings of each of our words.

1230
00:42:48,030 --> 00:42:50,910
And another one is we want to
learn how we take a sentence

1231
00:42:50,910 --> 00:42:53,640
and we transform it into
this internal representation

1232
00:42:53,640 --> 00:42:56,550
that we use to actually
build these detectors.

1233
00:42:56,550 --> 00:42:58,980
So if you look at
the first one, let's

1234
00:42:58,980 --> 00:43:00,570
say you have a whole
bunch of videos.

1235
00:43:00,570 --> 00:43:02,697
And every video comes
with a sentence.

1236
00:43:02,697 --> 00:43:04,530
You don't know what the
sentence is actually

1237
00:43:04,530 --> 00:43:06,060
referring to in the video.

1238
00:43:06,060 --> 00:43:08,914
When children are born, nobody
gives mothers bounding boxes

1239
00:43:08,914 --> 00:43:10,830
and tells them, put this
around the Teddy bear

1240
00:43:10,830 --> 00:43:13,120
so your child knows what
you're referring to.

1241
00:43:13,120 --> 00:43:14,100
So we don't get those.

1242
00:43:14,100 --> 00:43:16,540
We have this more
weakly-supervised system.

1243
00:43:16,540 --> 00:43:18,840
But what's important is we
get this data set and there

1244
00:43:18,840 --> 00:43:21,090
are certain correlations
in this data set, right?

1245
00:43:21,090 --> 00:43:23,370
We know the chair
occurs in some videos.

1246
00:43:23,370 --> 00:43:25,696
We know that backpack
occurs in others just

1247
00:43:25,696 --> 00:43:26,820
by looking at the sentence.

1248
00:43:26,820 --> 00:43:28,770
We know pickup occurs in others.

1249
00:43:28,770 --> 00:43:30,270
So basically, this
is the same thing

1250
00:43:30,270 --> 00:43:32,700
as training one, big
hidden Markov model.

1251
00:43:32,700 --> 00:43:35,520
Except that now we have
multiple hidden Markov models

1252
00:43:35,520 --> 00:43:37,984
that have a small amount
of dependency between them.

1253
00:43:37,984 --> 00:43:39,150
And I won't talk about this.

1254
00:43:39,150 --> 00:43:40,300
You'll have to take
my word for it.

1255
00:43:40,300 --> 00:43:41,383
You can look at the paper.

1256
00:43:41,383 --> 00:43:45,690
But it's identical to
the Baum-Welch algorithm.

1257
00:43:45,690 --> 00:43:48,420
Essentially, all you do is you
take the gradient through all

1258
00:43:48,420 --> 00:43:50,160
the parameters of
these words and you

1259
00:43:50,160 --> 00:43:52,390
can acquire their meanings.

1260
00:43:52,390 --> 00:43:54,870
There are lots of
technical issues with this,

1261
00:43:54,870 --> 00:43:57,696
but that's the general idea.

1262
00:43:57,696 --> 00:43:59,320
So we can also look
at learning syntax.

1263
00:43:59,320 --> 00:44:01,111
And this is something
that we haven't done,

1264
00:44:01,111 --> 00:44:02,160
but we really want to do.

1265
00:44:02,160 --> 00:44:05,015
And this is where disambiguation
work really comes into play.

1266
00:44:05,015 --> 00:44:06,640
So if I give you a
sentence, like Danny

1267
00:44:06,640 --> 00:44:09,130
approached the chair with a
bag, you feed it into a parser.

1268
00:44:09,130 --> 00:44:11,280
Something like
Boris' start system.

1269
00:44:11,280 --> 00:44:14,310
And you get potentially
two parse trees, right?

1270
00:44:14,310 --> 00:44:15,780
One for one
interpretation and one

1271
00:44:15,780 --> 00:44:17,190
for the other interpretation.

1272
00:44:17,190 --> 00:44:18,889
You take the video
and you can select

1273
00:44:18,889 --> 00:44:19,930
one of these parse trees.

1274
00:44:19,930 --> 00:44:21,870
That's the game we just
played a moment ago.

1275
00:44:21,870 --> 00:44:25,350
But imagine that we take Boris'
system and we brain damage

1276
00:44:25,350 --> 00:44:26,130
it a little bit.

1277
00:44:26,130 --> 00:44:29,130
Or we take some deep
network that does parsing

1278
00:44:29,130 --> 00:44:31,220
and we just randomize a
few of the parameters.

1279
00:44:31,220 --> 00:44:34,200
So now, rather than getting
a single or two parse trees

1280
00:44:34,200 --> 00:44:37,170
for our two interpretations,
we get 100 or 1,000

1281
00:44:37,170 --> 00:44:38,310
different parse trees.

1282
00:44:38,310 --> 00:44:39,810
We can take each
one of those and we

1283
00:44:39,810 --> 00:44:42,686
can see, how well does
this match our video?

1284
00:44:42,686 --> 00:44:44,310
And we get some
distribution over them.

1285
00:44:44,310 --> 00:44:46,710
Maybe we won't get a single
one that matches the best.

1286
00:44:46,710 --> 00:44:48,570
Maybe we'll get a
few that match well

1287
00:44:48,570 --> 00:44:50,950
and a bunch that match
really, really poorly.

1288
00:44:50,950 --> 00:44:53,892
So this provides a signal to
actually train the parser.

1289
00:44:53,892 --> 00:44:55,350
Essentially, you
have a parser that

1290
00:44:55,350 --> 00:44:57,870
produces a distribution
over parse trees.

1291
00:44:57,870 --> 00:44:59,550
You use the vision
system to decide

1292
00:44:59,550 --> 00:45:01,667
which of these parse trees
are better than others.

1293
00:45:01,667 --> 00:45:03,750
And you feed this information
back into the parser

1294
00:45:03,750 --> 00:45:05,040
and retrain it.

1295
00:45:05,040 --> 00:45:08,437
We haven't done this,
but it's in the pipeline.

1296
00:45:08,437 --> 00:45:10,020
And eventually, the
idea is that we're

1297
00:45:10,020 --> 00:45:12,060
going to be able
to close the loop

1298
00:45:12,060 --> 00:45:13,560
and learn the
meanings of the words

1299
00:45:13,560 --> 00:45:15,460
while we end up
learning the parser.

1300
00:45:15,460 --> 00:45:18,164
But that's further
down the line.

1301
00:45:18,164 --> 00:45:19,830
So lest you think
that there's something

1302
00:45:19,830 --> 00:45:22,560
remarkable about language
learning in humans,

1303
00:45:22,560 --> 00:45:26,140
actually lots of animals learn
language, not just humans.

1304
00:45:26,140 --> 00:45:29,190
And here's a cute example
of a dog that does something

1305
00:45:29,190 --> 00:45:30,420
that our system can't do.

1306
00:45:30,420 --> 00:45:33,540
And actually, no language
system out there can do.

1307
00:45:33,540 --> 00:45:37,560
So there's this paper,
but this is from PBS.

1308
00:45:37,560 --> 00:45:40,050
And what ended up
happening is this dog

1309
00:45:40,050 --> 00:45:42,416
knows the meaning of about
1,000 different words

1310
00:45:42,416 --> 00:45:44,040
because there are
labels that have been

1311
00:45:44,040 --> 00:45:45,540
attached to different toys.

1312
00:45:45,540 --> 00:45:47,190
So it has 1,000 different toys.

1313
00:45:47,190 --> 00:45:48,570
Each one has a unique name.

1314
00:45:48,570 --> 00:45:52,320
And if you tell the
dog, give me Blinky,

1315
00:45:52,320 --> 00:45:54,510
it knows exactly
which toy Blinky is.

1316
00:45:54,510 --> 00:45:56,790
And it has 100% accuracy
getting you Blinky

1317
00:45:56,790 --> 00:45:59,490
from it's big, big pile of toys.

1318
00:45:59,490 --> 00:46:01,560
So what they did is
they took 10 toys.

1319
00:46:01,560 --> 00:46:03,930
They put them behind the sofa.

1320
00:46:03,930 --> 00:46:05,640
And they added
one additional toy

1321
00:46:05,640 --> 00:46:07,929
that the dog has
never seen before.

1322
00:46:07,929 --> 00:46:09,720
They tested the dog
many times to make sure

1323
00:46:09,720 --> 00:46:11,428
that it doesn't have
a novelty preference

1324
00:46:11,428 --> 00:46:12,400
or anything like that.

1325
00:46:12,400 --> 00:46:15,000
And then they asked the
dog, bring the Blinky.

1326
00:46:15,000 --> 00:46:17,760
And you can see
the dog was asked.

1327
00:46:17,760 --> 00:46:18,930
It goes behind.

1328
00:46:18,930 --> 00:46:20,160
It quickly finds Blinky.

1329
00:46:20,160 --> 00:46:21,360
It brings it back.

1330
00:46:21,360 --> 00:46:22,360
And there we go.

1331
00:46:22,360 --> 00:46:26,710
And now, the dog
is really happy.

1332
00:46:26,710 --> 00:46:31,020
So now, the dog is going to be
asked, bring me this new toy.

1333
00:46:31,020 --> 00:46:34,650
Bring me the professor, or
whatever the toy is called.

1334
00:46:34,650 --> 00:46:36,120
It's a little less certain.

1335
00:46:36,120 --> 00:46:36,690
OK.

1336
00:46:36,690 --> 00:46:38,370
So it's going to
go behind and it's

1337
00:46:38,370 --> 00:46:40,399
going to look at
all the objects.

1338
00:46:40,399 --> 00:46:41,940
The toy with the
beard is the new one

1339
00:46:41,940 --> 00:46:43,560
that it hasn't seen before.

1340
00:46:43,560 --> 00:46:45,632
And it was there in
the previous trial.

1341
00:46:45,632 --> 00:46:47,590
So it looks around and
it's a little uncertain.

1342
00:46:47,590 --> 00:46:50,000
It doesn't quite
want to come back.

1343
00:46:50,000 --> 00:46:52,500
We're going to see
that we're going

1344
00:46:52,500 --> 00:46:56,700
to have to give it another
instruction in a moment.

1345
00:46:56,700 --> 00:46:58,550
He's going to call it
back and ask the dog

1346
00:46:58,550 --> 00:47:00,100
to do exactly the
same task again.

1347
00:47:00,100 --> 00:47:01,350
Isn't telling it anything new.

1348
00:47:01,350 --> 00:47:03,766
It's just to give it
some encouragement.

1349
00:47:09,730 --> 00:47:13,811
So looking around for some toy.

1350
00:47:13,811 --> 00:47:15,990
And it picks the--
you'll see in a moment.

1351
00:47:19,280 --> 00:47:22,010
It picks the toy that
it hasn't seen before,

1352
00:47:22,010 --> 00:47:23,274
because it's a new word.

1353
00:47:23,274 --> 00:47:24,440
And the dog is really happy.

1354
00:47:24,440 --> 00:47:26,773
And I think the human is even
happier that this actually

1355
00:47:26,773 --> 00:47:28,010
worked.

1356
00:47:28,010 --> 00:47:30,470
But the important part
is, there's this dog

1357
00:47:30,470 --> 00:47:33,620
that we normally don't associate
with having a huge amount

1358
00:47:33,620 --> 00:47:34,652
of linguistic ability.

1359
00:47:34,652 --> 00:47:36,110
But it's learning
language in a way

1360
00:47:36,110 --> 00:47:39,646
that is far more advanced
than anything that we have.

1361
00:47:39,646 --> 00:47:41,270
And it's learning it
in a grounded way,

1362
00:47:41,270 --> 00:47:43,850
like it hard to connect its
knowledge about what it sees

1363
00:47:43,850 --> 00:47:45,380
with these toys
to this new object

1364
00:47:45,380 --> 00:47:48,830
that it's never seen before
and understand this new label.

1365
00:47:48,830 --> 00:47:51,600
And dogs are not the only
animal that can do this.

1366
00:47:51,600 --> 00:47:54,600
There are many other
animals that can do this.

1367
00:47:54,600 --> 00:47:55,324
All right.

1368
00:47:55,324 --> 00:47:56,990
And of course, children
do this as well.

1369
00:47:59,760 --> 00:48:01,400
So there was a
question about the fact

1370
00:48:01,400 --> 00:48:03,170
that we're constantly
using videos here.

1371
00:48:03,170 --> 00:48:04,580
And we're very
focused on motion.

1372
00:48:04,580 --> 00:48:06,020
But of course, in many
of these sentences,

1373
00:48:06,020 --> 00:48:07,936
we were referring to
objects that were static.

1374
00:48:07,936 --> 00:48:10,228
So we're not only sensitive
to objects that are moving.

1375
00:48:10,228 --> 00:48:11,769
So for example, when
I said something

1376
00:48:11,769 --> 00:48:13,850
like it was the person
to the left of the car,

1377
00:48:13,850 --> 00:48:17,390
neither the person nor the car
were moving in that question.

1378
00:48:17,390 --> 00:48:19,530
It was the pair that was moving.

1379
00:48:19,530 --> 00:48:21,380
But there's an
interesting question,

1380
00:48:21,380 --> 00:48:24,950
what if you want to recognize
actions in still images?

1381
00:48:24,950 --> 00:48:26,260
After all, we can do it.

1382
00:48:26,260 --> 00:48:28,730
It probably didn't
involve looking at photos.

1383
00:48:28,730 --> 00:48:31,400
You know, 200 million years
ago when our visual system

1384
00:48:31,400 --> 00:48:32,790
was being formed.

1385
00:48:32,790 --> 00:48:34,940
So somehow, we take
our video ability

1386
00:48:34,940 --> 00:48:37,647
and we apply it to images.

1387
00:48:37,647 --> 00:48:39,980
And the way we're going to
do that is by taking an image

1388
00:48:39,980 --> 00:48:41,945
and predicting a video from it.

1389
00:48:41,945 --> 00:48:43,820
We haven't done this,
but we've done the part

1390
00:48:43,820 --> 00:48:46,820
where you can actually
get predicting motion

1391
00:48:46,820 --> 00:48:48,129
from single frames.

1392
00:48:48,129 --> 00:48:49,670
So the intuition
about why this works

1393
00:48:49,670 --> 00:48:51,770
is, if you look at this
image and I ask you,

1394
00:48:51,770 --> 00:48:54,230
how quickly is this
baseball moving?

1395
00:48:54,230 --> 00:48:56,965
You can give me an answer.

1396
00:48:56,965 --> 00:48:58,090
AUDIENCE: Not very quickly.

1397
00:48:58,090 --> 00:48:59,090
ANDREI BARBU: Not very quickly.

1398
00:48:59,090 --> 00:48:59,810
Right.

1399
00:48:59,810 --> 00:49:01,690
And if you look
at this baseball,

1400
00:49:01,690 --> 00:49:06,034
you can decide that it's
moving very quickly, right?

1401
00:49:06,034 --> 00:49:07,450
So the other story
in this talk is

1402
00:49:07,450 --> 00:49:08,740
I'm becoming more
and more American.

1403
00:49:08,740 --> 00:49:10,490
I started with the
Canadian flag and now I

1404
00:49:10,490 --> 00:49:12,070
ended up with baseball.

1405
00:49:12,070 --> 00:49:12,640
All right.

1406
00:49:12,640 --> 00:49:14,512
So you can clearly do this task.

1407
00:49:14,512 --> 00:49:15,970
There is good
neuroscience evidence

1408
00:49:15,970 --> 00:49:19,600
that people are doing
this fairly regularly.

1409
00:49:19,600 --> 00:49:22,150
Kids can do this, et cetera.

1410
00:49:22,150 --> 00:49:23,230
All right.

1411
00:49:23,230 --> 00:49:25,360
So now, what we did
is we went to YouTube

1412
00:49:25,360 --> 00:49:27,370
and we got a whole
bunch of videos.

1413
00:49:27,370 --> 00:49:30,010
Videos that contain cars or
different kinds of objects.

1414
00:49:30,010 --> 00:49:31,900
We had eight different
object classes.

1415
00:49:31,900 --> 00:49:34,390
And we ran a standard
optical flow algorithm just

1416
00:49:34,390 --> 00:49:35,380
off the shelf.

1417
00:49:35,380 --> 00:49:38,320
And this gives us an idea
of how the motion actually

1418
00:49:38,320 --> 00:49:39,655
happens inside this video.

1419
00:49:39,655 --> 00:49:40,780
Then, we discard the video.

1420
00:49:40,780 --> 00:49:43,030
And we only keep
one of the frames.

1421
00:49:43,030 --> 00:49:44,380
And we train a deep network.

1422
00:49:44,380 --> 00:49:47,620
This is the only time deep
networks appear in this talk.

1423
00:49:47,620 --> 00:49:51,700
That takes as input the image
and predicts the optical flow.

1424
00:49:51,700 --> 00:49:53,290
It looks a lot like
an auto-encoder,

1425
00:49:53,290 --> 00:49:56,014
except the input and the output
are different from each other.

1426
00:49:56,014 --> 00:49:57,680
And it turns out this
works pretty well.

1427
00:49:57,680 --> 00:50:00,250
It has similar performance to
actually doing optical flow

1428
00:50:00,250 --> 00:50:02,950
on the video with sort of a
crappier, earlier optical flow

1429
00:50:02,950 --> 00:50:04,600
algorithm.

1430
00:50:04,600 --> 00:50:07,564
So up until now are
things that we've done.

1431
00:50:07,564 --> 00:50:08,980
At the end I'll
talk briefly about

1432
00:50:08,980 --> 00:50:11,000
what we're doing in the future.

1433
00:50:11,000 --> 00:50:13,620
So one thing that you
can do is translation.

1434
00:50:13,620 --> 00:50:16,210
And you can cast translation
as a visual language task,

1435
00:50:16,210 --> 00:50:19,220
even though it sounds like it
has nothing to do with vision.

1436
00:50:19,220 --> 00:50:22,240
So if I give you a
sentence in Chinese,

1437
00:50:22,240 --> 00:50:25,215
you can imagine scenarios
for that sentence,

1438
00:50:25,215 --> 00:50:27,340
and then try to describe
them with another language

1439
00:50:27,340 --> 00:50:28,596
that you know.

1440
00:50:28,596 --> 00:50:30,970
This is very different from
the way people do translation

1441
00:50:30,970 --> 00:50:31,545
right now.

1442
00:50:31,545 --> 00:50:32,920
So right now, the
way it works is

1443
00:50:32,920 --> 00:50:34,789
you have a sentence,
like Sam was happy.

1444
00:50:34,789 --> 00:50:36,080
And you have a parallel corpus.

1445
00:50:36,080 --> 00:50:38,288
If you want to translate
into French, you go off you.

1446
00:50:38,288 --> 00:50:41,920
Get the Hansard corpus
and you get a whole bunch

1447
00:50:41,920 --> 00:50:43,780
of French and English
sentences that

1448
00:50:43,780 --> 00:50:46,280
are aligned with each other and
you learn the correspondence

1449
00:50:46,280 --> 00:50:46,960
between them.

1450
00:50:46,960 --> 00:50:50,386
Here, I translated into
[AUDIO OUT] Russian.

1451
00:50:50,386 --> 00:50:51,760
The important part
is in English,

1452
00:50:51,760 --> 00:50:54,130
there's no assumption
about the gender of Sam.

1453
00:50:54,130 --> 00:50:56,890
Sam is both a male
name and a female name.

1454
00:50:56,890 --> 00:51:01,630
But the problem is Romanian,
Russian, French, et cetera,

1455
00:51:01,630 --> 00:51:04,300
they really force you to specify
the gender of the people that

1456
00:51:04,300 --> 00:51:05,620
are involved in these actions.

1457
00:51:05,620 --> 00:51:07,411
And you have to go
through a certain amount

1458
00:51:07,411 --> 00:51:10,580
of [AUDIO OUT] really want to
avoid specifying their gender.

1459
00:51:10,580 --> 00:51:13,090
So here, we specify
the gender as male.

1460
00:51:13,090 --> 00:51:15,220
Here, we specify the
gender as female.

1461
00:51:15,220 --> 00:51:17,800
And if all you have is
statistical machine translation

1462
00:51:17,800 --> 00:51:20,351
system, you may get an
arbitrary one of these two.

1463
00:51:20,351 --> 00:51:21,850
And you may not
know that you've got

1464
00:51:21,850 --> 00:51:22,950
an arbitrary one of these two.

1465
00:51:22,950 --> 00:51:25,074
And there may be a terrible
faux pas at some point.

1466
00:51:27,730 --> 00:51:31,210
So this problem is not
restricted to gender.

1467
00:51:31,210 --> 00:51:33,122
And it occurs all the time.

1468
00:51:33,122 --> 00:51:35,080
For example, in Thai,
you specify your siblings

1469
00:51:35,080 --> 00:51:37,040
by age, not by their gender.

1470
00:51:37,040 --> 00:51:39,550
So if you have an English
sentence like my brother

1471
00:51:39,550 --> 00:51:42,266
did x, translating that
is quite difficult.

1472
00:51:42,266 --> 00:51:44,890
In English, you specify relative
time through the tense system,

1473
00:51:44,890 --> 00:51:47,872
but Mandarin doesn't have the
same kind of tense system.

1474
00:51:47,872 --> 00:51:49,330
In this language
that I never tried

1475
00:51:49,330 --> 00:51:52,930
to pronounce after the
first time that I tried,

1476
00:51:52,930 --> 00:51:54,640
you don't use
relative direction.

1477
00:51:54,640 --> 00:51:57,370
So you don't say the bottle
to the left of the laptop.

1478
00:51:57,370 --> 00:51:59,050
We all agree on a
common reference

1479
00:51:59,050 --> 00:52:01,480
frame like a hill or something.

1480
00:52:01,480 --> 00:52:03,790
Or we agree on
cardinal directions.

1481
00:52:03,790 --> 00:52:06,280
And you say, the bottle
to the north or something.

1482
00:52:06,280 --> 00:52:08,450
And these people are
really, really good

1483
00:52:08,450 --> 00:52:09,991
at wayfinding because
they constantly

1484
00:52:09,991 --> 00:52:11,170
have to know where north is.

1485
00:52:11,170 --> 00:52:13,090
Many languages don't
distinguish blue and green.

1486
00:52:13,090 --> 00:52:14,590
Historically, this
is not something

1487
00:52:14,590 --> 00:52:17,110
that languages have done.

1488
00:52:17,110 --> 00:52:18,910
It's pretty new.

1489
00:52:18,910 --> 00:52:22,420
For example, Japanese didn't
until a hundred years ago.

1490
00:52:22,420 --> 00:52:24,790
They only started distinguishing
the two fairly recently

1491
00:52:24,790 --> 00:52:26,832
when they started interacting
with the West more.

1492
00:52:26,832 --> 00:52:28,581
And many languages
don't set that boundary

1493
00:52:28,581 --> 00:52:29,870
at exactly the same place.

1494
00:52:29,870 --> 00:52:31,286
So one language,
you may say blue.

1495
00:52:31,286 --> 00:52:33,850
In another language, you
may have to say green.

1496
00:52:33,850 --> 00:52:36,550
In Swahili, you specify
the color of everything

1497
00:52:36,550 --> 00:52:37,900
as the color of x.

1498
00:52:37,900 --> 00:52:39,670
So like in English,
we have orange.

1499
00:52:39,670 --> 00:52:41,375
But in Swahili, I
could say the color

1500
00:52:41,375 --> 00:52:43,090
of the back of my cell phone.

1501
00:52:43,090 --> 00:52:46,600
And I expect you to know that's
blue as long as you can see it.

1502
00:52:46,600 --> 00:52:50,140
In Turkish, there is a
relatively complicated

1503
00:52:50,140 --> 00:52:51,490
evidentiality system.

1504
00:52:51,490 --> 00:52:54,670
So you have to fairly often
tell me why you know something.

1505
00:52:54,670 --> 00:52:56,980
So if you saw
somebody do something,

1506
00:52:56,980 --> 00:52:59,320
you have to mark that in
the sentence as opposed

1507
00:52:59,320 --> 00:53:01,880
to hearing it from someone else.

1508
00:53:01,880 --> 00:53:04,210
So if it's hearsay, you
have to let me know.

1509
00:53:04,210 --> 00:53:06,430
There are much more complicated
evidentiality systems

1510
00:53:06,430 --> 00:53:08,888
where you have to tell me, did
you hear it, did you see it,

1511
00:53:08,888 --> 00:53:09,960
did you feel it?

1512
00:53:09,960 --> 00:53:11,200
It can get pretty hairy.

1513
00:53:11,200 --> 00:53:13,600
So there are a lot
of reasons why just

1514
00:53:13,600 --> 00:53:15,580
doing the straightforward
sentence alignment

1515
00:53:15,580 --> 00:53:17,110
can really fail on you.

1516
00:53:17,110 --> 00:53:19,180
And you can make some
pretty terrible mistakes.

1517
00:53:19,180 --> 00:53:21,040
And more importantly,
you just won't know

1518
00:53:21,040 --> 00:53:22,870
that that made these mistakes.

1519
00:53:22,870 --> 00:53:24,370
So instead, what
we've been thinking

1520
00:53:24,370 --> 00:53:26,720
is sort of translation
by imagination.

1521
00:53:26,720 --> 00:53:28,600
So you take a sentence.

1522
00:53:28,600 --> 00:53:30,610
And it's a generative
model that we have that

1523
00:53:30,610 --> 00:53:32,230
connects sentences and videos.

1524
00:53:32,230 --> 00:53:33,560
And what you do is you sample.

1525
00:53:33,560 --> 00:53:35,160
You sample a whole
bunch of videos.

1526
00:53:35,160 --> 00:53:37,420
So basically, you imagine
what scenarios the sentence

1527
00:53:37,420 --> 00:53:38,770
could be true of.

1528
00:53:38,770 --> 00:53:42,370
You get your collection
from the generator.

1529
00:53:42,370 --> 00:53:45,850
You search over sentences
that describe these videos

1530
00:53:45,850 --> 00:53:47,770
and you output a sentence
that describes them

1531
00:53:47,770 --> 00:53:49,120
well in aggregate.

1532
00:53:49,120 --> 00:53:52,300
So basically, you just
combine your ability

1533
00:53:52,300 --> 00:53:55,660
to sample, which comes
from your recognizer,

1534
00:53:55,660 --> 00:53:57,070
and your ability to generate.

1535
00:53:57,070 --> 00:53:58,850
And you get a
translation system.

1536
00:53:58,850 --> 00:54:01,300
So you do a
language-to-language task

1537
00:54:01,300 --> 00:54:04,270
mediated by your understanding
of the real world.

1538
00:54:04,270 --> 00:54:06,460
Something else that you
can do is planning, which

1539
00:54:06,460 --> 00:54:08,770
I'll just say two words about.

1540
00:54:08,770 --> 00:54:11,055
All you do is--

1541
00:54:11,055 --> 00:54:11,680
[PHONE RINGING]

1542
00:54:11,680 --> 00:54:14,150
--in a planning task,
what you have is

1543
00:54:14,150 --> 00:54:16,280
you have a planning language.

1544
00:54:16,280 --> 00:54:18,741
I'm glad that
wasn't my cellphone.

1545
00:54:18,741 --> 00:54:20,240
You have a planning
language, right?

1546
00:54:20,240 --> 00:54:22,294
So you have a fairly
constrained vocabulary

1547
00:54:22,294 --> 00:54:23,960
that you can use to
describe your plans.

1548
00:54:23,960 --> 00:54:26,570
And this allows you to
have efficient inference.

1549
00:54:26,570 --> 00:54:29,810
Instead, you can imagine that
I have two frames of a video,

1550
00:54:29,810 --> 00:54:33,410
real or imagined, where
I have the first world.

1551
00:54:33,410 --> 00:54:35,000
I am far away from
the microphone.

1552
00:54:35,000 --> 00:54:37,520
I have the last world where
I'm near the microphone.

1553
00:54:37,520 --> 00:54:40,580
And I have an unobserved
video between the two.

1554
00:54:40,580 --> 00:54:42,260
People have work and
I've done some work

1555
00:54:42,260 --> 00:54:44,720
on filling in partially
observed videos.

1556
00:54:44,720 --> 00:54:47,090
So it's a very similar idea,
except that here we have

1557
00:54:47,090 --> 00:54:48,474
a partially-observed video.

1558
00:54:48,474 --> 00:54:50,390
And we know that this
partially-observed video

1559
00:54:50,390 --> 00:54:52,829
should be described by
one or more sentences.

1560
00:54:52,829 --> 00:54:55,370
So we're going to do the same
kind of sampling process, where

1561
00:54:55,370 --> 00:54:57,380
we sample from this
partially-observed video

1562
00:54:57,380 --> 00:54:59,440
and we try to describe
what the sentence is.

1563
00:54:59,440 --> 00:55:00,740
And now you're doing planning.

1564
00:55:00,740 --> 00:55:02,198
You're coming up
with a description

1565
00:55:02,198 --> 00:55:04,917
of what had happened in this
missing chunk of the video.

1566
00:55:04,917 --> 00:55:06,500
But your planning
language is English,

1567
00:55:06,500 --> 00:55:07,820
so you get to take
advantage of things

1568
00:55:07,820 --> 00:55:09,620
like ambiguity, which
you couldn't take

1569
00:55:09,620 --> 00:55:13,970
advantage of in many languages.

1570
00:55:13,970 --> 00:55:15,070
Theory of mind.

1571
00:55:15,070 --> 00:55:18,000
The idea here is relatively
straightforward as well.

1572
00:55:18,000 --> 00:55:20,052
So what we have
right now, basically

1573
00:55:20,052 --> 00:55:21,260
are two hidden Markov models.

1574
00:55:21,260 --> 00:55:23,240
Or two kinds of hidden
Markov models, right?

1575
00:55:23,240 --> 00:55:24,050
There's a video.

1576
00:55:24,050 --> 00:55:26,300
We have some hidden Markov
models that are tracks

1577
00:55:26,300 --> 00:55:29,030
and we have some hidden Markov
model that look at the tracks

1578
00:55:29,030 --> 00:55:30,654
and they do some
inference about what's

1579
00:55:30,654 --> 00:55:32,780
going on with the
events in these videos.

1580
00:55:32,780 --> 00:55:34,630
So now imagine that
I had a third kind.

1581
00:55:34,630 --> 00:55:36,546
A third kind of hidden
Markov model that

1582
00:55:36,546 --> 00:55:37,670
only looks at the trackers.

1583
00:55:37,670 --> 00:55:39,402
Doesn't look at
the words directly.

1584
00:55:39,402 --> 00:55:41,360
And what it does is it
makes another assumption

1585
00:55:41,360 --> 00:55:42,170
about the videos.

1586
00:55:42,170 --> 00:55:45,410
So first, we assume the objects
were moving in a coherent way.

1587
00:55:45,410 --> 00:55:47,300
Then, we assumed
that the objects

1588
00:55:47,300 --> 00:55:49,760
were moving according to the
dynamics of some hidden Markov

1589
00:55:49,760 --> 00:55:50,450
models.

1590
00:55:50,450 --> 00:55:52,220
Now, we're going to
assume that people

1591
00:55:52,220 --> 00:55:54,200
move according to some
dynamics of what's

1592
00:55:54,200 --> 00:55:55,860
going on inside our heads.

1593
00:55:55,860 --> 00:55:59,180
So you can assume that I
have a planner inside my head

1594
00:55:59,180 --> 00:56:01,130
that tells me what I
want to do and what

1595
00:56:01,130 --> 00:56:03,710
I should do in the future
to accomplish my goals.

1596
00:56:03,710 --> 00:56:07,280
And you can look at a sequence
of my actions and try to infer.

1597
00:56:07,280 --> 00:56:09,800
If you believe this planner
is running in my head,

1598
00:56:09,800 --> 00:56:11,820
what do you think
I should do next?

1599
00:56:11,820 --> 00:56:14,079
Now, the nice part about
many of these planners

1600
00:56:14,079 --> 00:56:16,370
is that they look a lot like
this hidden Markov models.

1601
00:56:16,370 --> 00:56:19,310
And the inference algorithms
look a lot like these models.

1602
00:56:19,310 --> 00:56:21,560
So basically, you can do
the same kind of trick

1603
00:56:21,560 --> 00:56:23,930
by assuming that
HMM-like things are

1604
00:56:23,930 --> 00:56:25,620
going on inside people's heads.

1605
00:56:25,620 --> 00:56:27,370
So you can do things
like predict actions,

1606
00:56:27,370 --> 00:56:29,720
figure out what people want
to do in the future, what

1607
00:56:29,720 --> 00:56:30,594
they did in the past.

1608
00:56:33,370 --> 00:56:35,024
That's what the project is.

1609
00:56:35,024 --> 00:56:37,440
I want to show you another
example of vision and language,

1610
00:56:37,440 --> 00:56:39,900
but in a totally different
domain that I won't talk about,

1611
00:56:39,900 --> 00:56:41,670
which is in the case of robots.

1612
00:56:41,670 --> 00:56:45,710
This is something that we
built several years ago.

1613
00:56:45,710 --> 00:56:47,650
This is a robot that
looks at a 3D structure.

1614
00:56:47,650 --> 00:56:48,860
It's built out of Lincoln Logs.

1615
00:56:48,860 --> 00:56:49,360
They're big.

1616
00:56:49,360 --> 00:56:51,610
They're easier for the robot
to manipulate than LEGOs.

1617
00:56:51,610 --> 00:56:53,026
The downside is
they're all brown,

1618
00:56:53,026 --> 00:56:54,880
so it's very difficult
to do vision on this.

1619
00:56:54,880 --> 00:56:56,560
But it actually
will, in a moment,

1620
00:56:56,560 --> 00:56:58,840
reconstruct the 3D
structure of what it sees.

1621
00:56:58,840 --> 00:57:01,270
And we annotated and
read what errors it made.

1622
00:57:01,270 --> 00:57:03,190
We didn't tell it this.

1623
00:57:03,190 --> 00:57:05,650
What it does is it
measures its own confidence

1624
00:57:05,650 --> 00:57:08,200
and it figures out what
parts are occluded.

1625
00:57:08,200 --> 00:57:10,210
So it has too
little information.

1626
00:57:10,210 --> 00:57:11,970
And it plans another view.

1627
00:57:11,970 --> 00:57:16,060
It goes, it acquires it by
measuring its own confidence.

1628
00:57:16,060 --> 00:57:18,370
This view is actually worse
than the previous view,

1629
00:57:18,370 --> 00:57:19,610
but it's complementary.

1630
00:57:19,610 --> 00:57:22,300
So it will actually gain the
information that it's missing.

1631
00:57:22,300 --> 00:57:24,970
And all of this comes from the
same kind of generative model

1632
00:57:24,970 --> 00:57:26,920
trick that I showed
you a moment ago.

1633
00:57:26,920 --> 00:57:29,710
A similar model, it just
makes different assumptions

1634
00:57:29,710 --> 00:57:31,671
about what's built into it.

1635
00:57:31,671 --> 00:57:33,670
So now, because we have
a nice generative model,

1636
00:57:33,670 --> 00:57:35,800
we can integrate the
two views together.

1637
00:57:35,800 --> 00:57:38,529
You're going to see in a moment.

1638
00:57:38,529 --> 00:57:39,820
It'll still make some mistakes.

1639
00:57:39,820 --> 00:57:42,444
It won't be completely confident
because there are some regions

1640
00:57:42,444 --> 00:57:44,830
that it can't see,
even from both views.

1641
00:57:44,830 --> 00:57:46,690
And then what we
told it is, OK, fine.

1642
00:57:46,690 --> 00:57:49,510
For now, ignore the second
view, take just the first few.

1643
00:57:49,510 --> 00:57:50,570
Here's a sentence.

1644
00:57:50,570 --> 00:57:52,450
Or in this case, a
sentence fragment.

1645
00:57:52,450 --> 00:57:54,100
The fragment is
something like, there's

1646
00:57:54,100 --> 00:57:56,620
a window to the left and
perpendicular to this door.

1647
00:57:56,620 --> 00:57:58,180
It'll just appear a moment.

1648
00:57:58,180 --> 00:58:00,280
And integrating this
one view that it

1649
00:58:00,280 --> 00:58:04,630
saw that it was uncertain about
with that one sentence, that's

1650
00:58:04,630 --> 00:58:07,420
also very generic and
applied to many structures,

1651
00:58:07,420 --> 00:58:09,070
determine that
these two completely

1652
00:58:09,070 --> 00:58:10,780
disambiguate the structure.

1653
00:58:10,780 --> 00:58:13,061
And now, it's perfectly
confident in what's going on.

1654
00:58:13,061 --> 00:58:14,560
And it can go and
it can disassemble

1655
00:58:14,560 --> 00:58:15,670
the structure for you.

1656
00:58:15,670 --> 00:58:17,590
And we can play this
game in many directions.

1657
00:58:17,590 --> 00:58:20,170
We can have the robot
describe structures to us.

1658
00:58:20,170 --> 00:58:22,900
We can give it a description
and it can build the structure.

1659
00:58:22,900 --> 00:58:25,090
One robot can describe
the structure in English

1660
00:58:25,090 --> 00:58:27,780
to another robot who
can build it for it.

1661
00:58:27,780 --> 00:58:30,640
And it's exactly the
same kind of idea.

1662
00:58:30,640 --> 00:58:33,580
You connect your vision
and your language model

1663
00:58:33,580 --> 00:58:35,620
to something in the
real world, and then you

1664
00:58:35,620 --> 00:58:37,120
can play many, many
different tricks

1665
00:58:37,120 --> 00:58:41,020
with one internal representation
without modifying it at all.

1666
00:58:41,020 --> 00:58:44,080
But I realized yesterday
that I was the last speaker

1667
00:58:44,080 --> 00:58:45,580
before the weekend,
so I want to end

1668
00:58:45,580 --> 00:58:48,100
by leaving you as depressed
as I possibly can,

1669
00:58:48,100 --> 00:58:51,040
and tell you all the wonderful
things that don't work.

1670
00:58:51,040 --> 00:58:53,990
And how far away we are
from understanding anything.

1671
00:58:53,990 --> 00:58:55,840
So first of all,
we can't generate

1672
00:58:55,840 --> 00:58:58,570
the kind of coherent stories
that Patrick looks at.

1673
00:58:58,570 --> 00:59:00,880
Really, if you look at a
long video, what we can do

1674
00:59:00,880 --> 00:59:03,014
is we can search or we
can describe small events.

1675
00:59:03,014 --> 00:59:04,180
A person picks something up.

1676
00:59:04,180 --> 00:59:05,020
They put it down.

1677
00:59:05,020 --> 00:59:08,530
What we can say is the
thief entered the room

1678
00:59:08,530 --> 00:59:10,665
and rummaged around and
ran away with the gold.

1679
00:59:10,665 --> 00:59:12,790
That's the kind of thing
that you want to generate.

1680
00:59:12,790 --> 00:59:14,540
It's the kind of thing
that kids generate,

1681
00:59:14,540 --> 00:59:15,700
but we're not there yet.

1682
00:59:15,700 --> 00:59:17,180
Not even close.

1683
00:59:17,180 --> 00:59:18,640
We also only reason in 2D.

1684
00:59:18,640 --> 00:59:20,380
There's no 3D reasoning here.

1685
00:59:20,380 --> 00:59:21,880
And that significantly hurts us.

1686
00:59:21,880 --> 00:59:24,610
Although, we have some ideas
for how we might do 3D.

1687
00:59:24,610 --> 00:59:27,640
Another important aspect is we
don't know forces and contact

1688
00:59:27,640 --> 00:59:28,450
relationships.

1689
00:59:28,450 --> 00:59:29,920
Now, that's fine
as long as pickup

1690
00:59:29,920 --> 00:59:32,069
means this kind of
action where you

1691
00:59:32,069 --> 00:59:34,360
see me standing next to an
object and the object moving

1692
00:59:34,360 --> 00:59:34,860
up.

1693
00:59:34,860 --> 00:59:37,257
But sometimes, pickup means
something totally different.

1694
00:59:37,257 --> 00:59:39,340
So you're going to see
this cat is going to pickup

1695
00:59:39,340 --> 00:59:42,277
that kitten in just a moment.

1696
00:59:42,277 --> 00:59:44,110
And you're going to see
if you pay attention

1697
00:59:44,110 --> 00:59:46,600
to the motion of the
cat, that it doesn't look

1698
00:59:46,600 --> 00:59:48,901
like it's picking something up.

1699
00:59:48,901 --> 00:59:51,150
It's not very good at picking
up the kitten, mind you.

1700
00:59:51,150 --> 00:59:52,630
I think this may
be its first try.

1701
00:59:55,619 --> 00:59:56,910
I think it's having a good day.

1702
00:59:56,910 --> 00:59:58,330
It's OK.

1703
00:59:58,330 --> 00:59:59,825
Struggling a little bit.

1704
00:59:59,825 --> 01:00:01,480
But see?

1705
01:00:01,480 --> 01:00:03,040
So definitely, picked it up.

1706
01:00:03,040 --> 01:00:05,380
But it didn't look anything
like any of the other pickup

1707
01:00:05,380 --> 01:00:06,700
examples I showed you.

1708
01:00:06,700 --> 01:00:09,040
But conceptually,
you should totally

1709
01:00:09,040 --> 01:00:11,260
recognize this if you've
seen those other examples.

1710
01:00:11,260 --> 01:00:13,120
And kids can do this.

1711
01:00:13,120 --> 01:00:14,950
So the important
part is you have

1712
01:00:14,950 --> 01:00:16,630
to change how you
reason you can't just

1713
01:00:16,630 --> 01:00:18,924
reason about the relative
motions of the objects.

1714
01:00:18,924 --> 01:00:21,340
You have to assume that there
are some hidden forces going

1715
01:00:21,340 --> 01:00:21,790
on.

1716
01:00:21,790 --> 01:00:23,110
And you have to reason
about the contact

1717
01:00:23,110 --> 01:00:25,151
relationships and the
forces that the objects are

1718
01:00:25,151 --> 01:00:26,956
undergoing.

1719
01:00:26,956 --> 01:00:29,469
What happens if you try to
recognize a helicopter picking

1720
01:00:29,469 --> 01:00:30,010
something up.

1721
01:00:30,010 --> 01:00:32,051
It looks totally different
from a human doing it,

1722
01:00:32,051 --> 01:00:35,380
but no one has any
problems recognizing this.

1723
01:00:35,380 --> 01:00:36,880
Segmentation is
also a huge problem.

1724
01:00:36,880 --> 01:00:38,379
For many of these
problems, you have

1725
01:00:38,379 --> 01:00:40,740
to pay attention to the fine
boundaries of the objects

1726
01:00:40,740 --> 01:00:43,115
in order to understand that
that kitten was being rotated

1727
01:00:43,115 --> 01:00:45,442
and then slightly lifted.

1728
01:00:45,442 --> 01:00:47,150
There's also a more
philosophical problem

1729
01:00:47,150 --> 01:00:48,733
about what is a part
and what it means

1730
01:00:48,733 --> 01:00:50,470
for something to be an object.

1731
01:00:50,470 --> 01:00:52,750
We arbitrarily say that
the cat is an object,

1732
01:00:52,750 --> 01:00:54,370
but I could refer to its paws.

1733
01:00:54,370 --> 01:00:55,510
I could refer to its ears.

1734
01:00:55,510 --> 01:00:57,580
I could refer to one
small patch on its back.

1735
01:00:57,580 --> 01:00:59,579
As long as we all know
what we're talking about,

1736
01:00:59,579 --> 01:01:00,790
that can be our object.

1737
01:01:00,790 --> 01:01:03,130
And that's a problem
throughout computer vision.

1738
01:01:03,130 --> 01:01:05,500
It also occurs in a
totally different problem.

1739
01:01:05,500 --> 01:01:07,240
So if you've ever
seen Bongard problems,

1740
01:01:07,240 --> 01:01:09,550
there are these problems where
you have these weird patches,

1741
01:01:09,550 --> 01:01:11,730
and you have to figure out
what's in common between them.

1742
01:01:11,730 --> 01:01:12,610
And that's the
case where you have

1743
01:01:12,610 --> 01:01:14,290
to dig deep into
your visual system

1744
01:01:14,290 --> 01:01:17,060
to extract a completely
different kind of information.

1745
01:01:17,060 --> 01:01:18,855
And this is an
example that I prefer.

1746
01:01:18,855 --> 01:01:22,780
So in this task, you can
try to find the real dog.

1747
01:01:22,780 --> 01:01:25,600
And we can all spot it after
you look for a little while.

1748
01:01:25,600 --> 01:01:26,300
Right?

1749
01:01:26,300 --> 01:01:28,350
Does everyone see it?

1750
01:01:28,350 --> 01:01:28,940
OK.

1751
01:01:28,940 --> 01:01:30,489
So you can all see it

1752
01:01:30,489 --> 01:01:31,530
The interesting part is--

1753
01:01:31,530 --> 01:01:34,030
I mean, I doubt you
have ever had training

1754
01:01:34,030 --> 01:01:38,550
detecting real dogs amongst
masses of fake dogs.

1755
01:01:38,550 --> 01:01:41,410
But somehow, you were
able to adapt and extract

1756
01:01:41,410 --> 01:01:43,260
a completely different
kind of information

1757
01:01:43,260 --> 01:01:44,310
from your visual system.

1758
01:01:44,310 --> 01:01:46,560
Information that isn't
captured by our feature vector,

1759
01:01:46,560 --> 01:01:50,580
as I talk about the color,
location, velocity, et cetera.

1760
01:01:50,580 --> 01:01:52,710
So you have this
ability to extract out

1761
01:01:52,710 --> 01:01:54,870
task-specific information.

1762
01:01:54,870 --> 01:01:56,580
You can do things
like theory of mind,

1763
01:01:56,580 --> 01:01:58,050
but you can do far
more than assume

1764
01:01:58,050 --> 01:01:59,340
people are writing a planner.

1765
01:01:59,340 --> 01:02:00,510
You can detect if I'm sad.

1766
01:02:00,510 --> 01:02:02,490
If I'm happy.

1767
01:02:02,490 --> 01:02:04,590
You can reason about
whether two people

1768
01:02:04,590 --> 01:02:06,600
are having a particular
kind of interaction.

1769
01:02:06,600 --> 01:02:08,894
Who's more powerful
than the other person.

1770
01:02:08,894 --> 01:02:11,310
You also have a very strong
physics model inside your head

1771
01:02:11,310 --> 01:02:13,110
that underlies much of this.

1772
01:02:13,110 --> 01:02:15,930
And even more than that, there's
the concept of modification.

1773
01:02:15,930 --> 01:02:19,500
So walking quickly looks very
different from running quickly.

1774
01:02:19,500 --> 01:02:21,720
And the way you model
these is quite complicated.

1775
01:02:21,720 --> 01:02:24,490
And the system that I presented
doesn't do a good job of it.

1776
01:02:24,490 --> 01:02:27,900
But one of my favorite
examples from my childhood

1777
01:02:27,900 --> 01:02:31,350
long ago is this one, which
is a kind of modification.

1778
01:02:31,350 --> 01:02:34,687
So coyote is going to draw this.

1779
01:02:34,687 --> 01:02:36,270
You're going to see
the Roadrunner try

1780
01:02:36,270 --> 01:02:37,750
to run through it.

1781
01:02:37,750 --> 01:02:39,010
And he makes it.

1782
01:02:39,010 --> 01:02:42,145
And you can imagine what's
about to happen next.

1783
01:02:42,145 --> 01:02:43,770
Coyote is not going
to have a good day.

1784
01:02:43,770 --> 01:02:45,880
So this looks silly, right?

1785
01:02:45,880 --> 01:02:48,380
And you would think to yourself,
how could we possibly apply

1786
01:02:48,380 --> 01:02:49,350
this to the real world?

1787
01:02:49,350 --> 01:02:51,766
But actually, this happens in
the real world all the time.

1788
01:02:51,766 --> 01:02:54,602
A cage can be open for a mouse,
but closed for an elephant.

1789
01:02:54,602 --> 01:02:56,560
So if you're going to
represent something like,

1790
01:02:56,560 --> 01:02:58,530
is something closed
or not, you have

1791
01:02:58,530 --> 01:03:00,281
to be able to handle
situations like this.

1792
01:03:00,281 --> 01:03:01,696
And that's why
kids can understand

1793
01:03:01,696 --> 01:03:03,960
really weird scenarios like
this because they're not

1794
01:03:03,960 --> 01:03:05,896
so outlandish.

1795
01:03:05,896 --> 01:03:07,770
There's also the problem
of the vast majority

1796
01:03:07,770 --> 01:03:08,850
of English verbs--

1797
01:03:08,850 --> 01:03:12,430
things like absolve, admire,
anger, approve, bark, et

1798
01:03:12,430 --> 01:03:12,930
cetera.

1799
01:03:12,930 --> 01:03:14,820
All of them require
far more knowledge.

1800
01:03:14,820 --> 01:03:17,340
They require many of the things
I've talked about before.

1801
01:03:17,340 --> 01:03:19,680
And actually, far
more than them.

1802
01:03:19,680 --> 01:03:21,630
And what's even worse
is we also use language

1803
01:03:21,630 --> 01:03:23,250
in pretty bizarre ways.

1804
01:03:23,250 --> 01:03:25,920
So there are some kinds
of idioms in English,

1805
01:03:25,920 --> 01:03:28,230
like the market
[AUDIO OUT] bullish,

1806
01:03:28,230 --> 01:03:30,660
that you have to have seen
before to understand, right?

1807
01:03:30,660 --> 01:03:32,267
There's no reason
to assume that bears

1808
01:03:32,267 --> 01:03:34,350
are better or worse than
bulls when you apply them

1809
01:03:34,350 --> 01:03:35,739
to the stock market.

1810
01:03:35,739 --> 01:03:37,530
On the other hand,
there are certain things

1811
01:03:37,530 --> 01:03:38,571
that are very systematic.

1812
01:03:38,571 --> 01:03:40,467
I can have an up
day or a down day,

1813
01:03:40,467 --> 01:03:43,050
because we've both kind of as a
culture agreed that up is good

1814
01:03:43,050 --> 01:03:43,767
and down is bad.

1815
01:03:43,767 --> 01:03:45,600
Some cultures have made
the opposite choice.

1816
01:03:45,600 --> 01:03:47,820
But usually, it's up
is good, down is bad.

1817
01:03:47,820 --> 01:03:51,210
So an idea can be grand
or it can be small.

1818
01:03:51,210 --> 01:03:54,870
Because we've decided big things
are better than small things.

1819
01:03:54,870 --> 01:03:57,210
Someone's mood can
be dark or light.

1820
01:03:57,210 --> 01:03:59,070
And these are very
systematic variations

1821
01:03:59,070 --> 01:04:00,390
that underlie all of language.

1822
01:04:00,390 --> 01:04:02,490
And we constantly use
metaphoric extension

1823
01:04:02,490 --> 01:04:04,800
in order to describe
what's going on around us

1824
01:04:04,800 --> 01:04:06,390
and to talk about
abstract things.

1825
01:04:06,390 --> 01:04:08,306
It really seems as if
this is kind of built-in

1826
01:04:08,306 --> 01:04:10,110
to our model of the world.

1827
01:04:10,110 --> 01:04:13,140
And modeling this is
kind of over the horizon.

1828
01:04:13,140 --> 01:04:15,030
And there are many,
many, many other things

1829
01:04:15,030 --> 01:04:16,489
that we're missing here.

1830
01:04:16,489 --> 01:04:18,780
So I just want to thank all
my wonderful collaborators,

1831
01:04:18,780 --> 01:04:22,650
like Boris, Max, Candace,
and people at MIT, and people

1832
01:04:22,650 --> 01:04:23,670
elsewhere.

1833
01:04:23,670 --> 01:04:25,835
But to recap, what
we saw is that we

1834
01:04:25,835 --> 01:04:27,960
can get a little bit of
traction on these problems.

1835
01:04:27,960 --> 01:04:32,460
We can build one system that
does one simple problem just

1836
01:04:32,460 --> 01:04:34,920
connects our perception with
our high-level knowledge,

1837
01:04:34,920 --> 01:04:37,620
takes a video and a sentence
and gives us a score.

1838
01:04:37,620 --> 01:04:39,510
And once we have this
interesting connection,

1839
01:04:39,510 --> 01:04:41,700
this interesting feedback
between these two

1840
01:04:41,700 --> 01:04:43,440
very different-looking
systems, it

1841
01:04:43,440 --> 01:04:46,050
turns out that we can do
many different and sometimes

1842
01:04:46,050 --> 01:04:48,500
surprising things.