1
00:00:01,640 --> 00:00:04,040
The following content is
provided under a Creative

2
00:00:04,040 --> 00:00:05,580
Commons license.

3
00:00:05,580 --> 00:00:07,880
Your support will help
MIT OpenCourseWare

4
00:00:07,880 --> 00:00:12,270
continue to offer high quality
educational resources for free.

5
00:00:12,270 --> 00:00:14,870
To make a donation or
view additional materials

6
00:00:14,870 --> 00:00:18,830
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,830 --> 00:00:22,400
at ocw.mit.edu.

8
00:00:22,400 --> 00:00:25,170
JOSH TENENBAUM: So where we
left off was, you know, again,

9
00:00:25,170 --> 00:00:28,850
I was telling you a story, both
conceptual and motivational

10
00:00:28,850 --> 00:00:32,180
and a little bit technical
about how we got to the things

11
00:00:32,180 --> 00:00:34,520
we're trying to do now
as part of the center.

12
00:00:34,520 --> 00:00:37,170
And it involves, again both
the problems we want to solve.

13
00:00:37,170 --> 00:00:39,380
We want to understand
what is this common sense

14
00:00:39,380 --> 00:00:40,850
knowledge about
the physical world

15
00:00:40,850 --> 00:00:43,760
and the psychological world that
you can see in some form even

16
00:00:43,760 --> 00:00:44,900
in young infants?

17
00:00:44,900 --> 00:00:46,400
And what are the
learning mechanisms

18
00:00:46,400 --> 00:00:48,140
that build it and grow it?

19
00:00:48,140 --> 00:00:50,647
And then what's the kind
of technical ideas that

20
00:00:50,647 --> 00:00:52,730
are going to be also
hopefully useful for building

21
00:00:52,730 --> 00:00:56,240
intelligent robots or
other AI systems that

22
00:00:56,240 --> 00:00:59,330
can explain on the scientific
side how this stuff works?

23
00:00:59,330 --> 00:01:02,129
All right, so that
was all this business.

24
00:01:02,129 --> 00:01:04,670
And what I was suggesting, or
I'll start to suggest here now,

25
00:01:04,670 --> 00:01:06,760
is that it goes
back to that quote

26
00:01:06,760 --> 00:01:08,510
I gave at the beginning
from Craik, right?

27
00:01:08,510 --> 00:01:10,460
The guy who in 1943
wrote this book

28
00:01:10,460 --> 00:01:11,840
called The Nature
of Explanation.

29
00:01:11,840 --> 00:01:14,210
And he was saying that's the
essence of intelligence is

30
00:01:14,210 --> 00:01:17,000
this ability to
build models that

31
00:01:17,000 --> 00:01:19,430
allow you to explain the
world, to then reason,

32
00:01:19,430 --> 00:01:21,890
simulate, plan, and so on.

33
00:01:21,890 --> 00:01:24,740
And I think we need
tools for understanding

34
00:01:24,740 --> 00:01:28,900
how the brain is a modeling
engine or an explaining engine.

35
00:01:28,900 --> 00:01:30,650
Or to get a little bit
recursive about it,

36
00:01:30,650 --> 00:01:32,024
since what we're
doing in science

37
00:01:32,024 --> 00:01:34,070
is also an explanatory
activity, we

38
00:01:34,070 --> 00:01:36,440
need modeling
engines in which we

39
00:01:36,440 --> 00:01:39,059
can build models of the
brain as a modeling engine.

40
00:01:39,059 --> 00:01:40,850
And that's where the
probabilistic programs

41
00:01:40,850 --> 00:01:41,870
are going to come in.

42
00:01:41,870 --> 00:01:45,170
So part of why I spent a
while in the morning talking

43
00:01:45,170 --> 00:01:47,300
about these graphical
models and the ways

44
00:01:47,300 --> 00:01:49,130
that we tried and, I
think, made progress

45
00:01:49,130 --> 00:01:51,410
on, but ultimately
we're dissatisfied with,

46
00:01:51,410 --> 00:01:54,860
talking about how we're modeling
various aspects of cognition

47
00:01:54,860 --> 00:01:57,720
with these kinds of
graphical models.

48
00:01:57,720 --> 00:02:00,530
I put up-- I didn't say too much
about the technical details.

49
00:02:00,530 --> 00:02:01,070
That's fine.

50
00:02:01,070 --> 00:02:03,840
You can read a lot
about it or not.

51
00:02:03,840 --> 00:02:06,949
But these ways of using
graphs, mostly directed graphs,

52
00:02:06,949 --> 00:02:09,199
to capture something about
the structure of the world.

53
00:02:09,199 --> 00:02:11,240
And then you put probabilities
on it in some way,

54
00:02:11,240 --> 00:02:13,340
like a diffusion process
or a noisy transmission

55
00:02:13,340 --> 00:02:15,020
process for a food web.

56
00:02:15,020 --> 00:02:18,410
That's a style of
reasoning that sometimes

57
00:02:18,410 --> 00:02:21,060
goes by the name of
Bayesian networks

58
00:02:21,060 --> 00:02:22,490
or causal graphical models.

59
00:02:22,490 --> 00:02:25,880
It's been hugely influential
in computer science,

60
00:02:25,880 --> 00:02:28,347
many different fields, not just
AI, and many fields outside

61
00:02:28,347 --> 00:02:29,180
of computer science.

62
00:02:29,180 --> 00:02:31,388
Not just cognitive science,
neuroscience-- many areas

63
00:02:31,388 --> 00:02:32,870
of science and engineering.

64
00:02:32,870 --> 00:02:35,270
Here are just a few examples
of some Bayesian networks

65
00:02:35,270 --> 00:02:38,606
you get if you search on Google
image for Bayesian networks.

66
00:02:38,606 --> 00:02:39,980
And if you look
carefully, you'll

67
00:02:39,980 --> 00:02:43,880
see they come from biology,
economics, many chemical

68
00:02:43,880 --> 00:02:47,330
engineering, whatever.

69
00:02:47,330 --> 00:02:48,644
They're due to many people.

70
00:02:48,644 --> 00:02:50,060
Maybe more than
anyone, the person

71
00:02:50,060 --> 00:02:52,310
who's most associated with
this idea and with the name

72
00:02:52,310 --> 00:02:53,870
Bayesian networks
is Judea Pearl.

73
00:02:53,870 --> 00:02:55,820
He received the
Turing Award, which

74
00:02:55,820 --> 00:02:58,730
is like the highest reward
in computer science.

75
00:02:58,730 --> 00:03:00,440
This is a language
that we were using

76
00:03:00,440 --> 00:03:04,250
in all the projects you saw
up until now in some form

77
00:03:04,250 --> 00:03:05,930
that we and many others used.

78
00:03:05,930 --> 00:03:09,140
Because they provide a
powerful set of tools

79
00:03:09,140 --> 00:03:11,270
for general purpose tools.

80
00:03:11,270 --> 00:03:14,270
It goes back to this dream
of building general purpose

81
00:03:14,270 --> 00:03:16,140
systems for
understanding the world.

82
00:03:16,140 --> 00:03:18,590
So these provide general
purpose languages

83
00:03:18,590 --> 00:03:20,950
for representing
causal structure--

84
00:03:20,950 --> 00:03:22,580
I'll say a little
bit more about that--

85
00:03:22,580 --> 00:03:24,470
and general purpose
algorithms for doing

86
00:03:24,470 --> 00:03:25,710
the probabilistic
inference on this.

87
00:03:25,710 --> 00:03:27,209
So we talked about
ways of combining

88
00:03:27,209 --> 00:03:29,120
sophisticated
statistical inference

89
00:03:29,120 --> 00:03:30,800
with knowledge
representation that's

90
00:03:30,800 --> 00:03:32,274
causal and compositional.

91
00:03:32,274 --> 00:03:34,190
These models-- I'll just
tell you a little bit

92
00:03:34,190 --> 00:03:35,630
about the one in the
upper left up there,

93
00:03:35,630 --> 00:03:37,421
that thing that says
diseases and symptoms.

94
00:03:37,421 --> 00:03:38,150
It is causal.

95
00:03:38,150 --> 00:03:39,020
It is compositional.

96
00:03:39,020 --> 00:03:40,802
It does support
probabilistic inference.

97
00:03:40,802 --> 00:03:42,260
And it was the
heart of why we were

98
00:03:42,260 --> 00:03:44,390
doing what we were
doing and showing you

99
00:03:44,390 --> 00:03:46,765
how different kinds of
causal graphical models

100
00:03:46,765 --> 00:03:48,890
basically could capture
different modes of people's

101
00:03:48,890 --> 00:03:50,330
reasoning.

102
00:03:50,330 --> 00:03:52,670
And the idea that maybe
learning about different domains

103
00:03:52,670 --> 00:03:56,256
was learning those different
kinds of graphs structures.

104
00:03:56,256 --> 00:03:58,130
So let me say a little
bit about how it works

105
00:03:58,130 --> 00:04:00,320
and then why it's not enough,
because it really isn't enough.

106
00:04:00,320 --> 00:04:01,730
I mean, it's the right start.

107
00:04:01,730 --> 00:04:02,900
It's definitely in
the right direction.

108
00:04:02,900 --> 00:04:03,940
But we need to go beyond it.

109
00:04:03,940 --> 00:04:05,910
That's where the problematic
programs come in.

110
00:04:05,910 --> 00:04:08,930
So look at that network up
there on the upper left.

111
00:04:08,930 --> 00:04:12,000
It's one of the most
famous Bayesian networks.

112
00:04:12,000 --> 00:04:13,160
It's a textbook example.

113
00:04:13,160 --> 00:04:14,780
One of the first
actually implemented

114
00:04:14,780 --> 00:04:18,769
AI systems was based on this for
a system for medical diagnosis.

115
00:04:18,769 --> 00:04:22,910
Sort of a simple approximation
to what a general practitioner

116
00:04:22,910 --> 00:04:25,140
might be doing if a patient
comes in and reports

117
00:04:25,140 --> 00:04:27,640
some pattern of symptoms, and
they want to figure out what's

118
00:04:27,640 --> 00:04:28,230
wrong.

119
00:04:28,230 --> 00:04:32,480
So diagnosis of a disease
to explain the symptoms.

120
00:04:32,480 --> 00:04:34,460
The graph is a bipartite graph.

121
00:04:34,460 --> 00:04:36,080
So two sets of nodes
with the arrows,

122
00:04:36,080 --> 00:04:38,670
again, going down in
the causal direction.

123
00:04:38,670 --> 00:04:40,970
The bottom layer, the
symptoms, are the things

124
00:04:40,970 --> 00:04:43,190
that you can nominally observe.

125
00:04:43,190 --> 00:04:44,997
A patient comes in
reporting some symptoms.

126
00:04:44,997 --> 00:04:46,580
Not all are observed,
but others maybe

127
00:04:46,580 --> 00:04:49,640
are things that you could test,
like medical test results.

128
00:04:49,640 --> 00:04:51,330
And then the top
level is this level

129
00:04:51,330 --> 00:04:53,330
of latent structure, the
causes, the things that

130
00:04:53,330 --> 00:04:54,560
cause the symptoms.

131
00:04:54,560 --> 00:04:56,090
The arrows represent
basically which

132
00:04:56,090 --> 00:04:57,970
diseases cause which symptoms.

133
00:04:57,970 --> 00:05:01,250
In this model there's roughly
500, 600 diseases-- you know,

134
00:05:01,250 --> 00:05:03,440
the commonish ones,
not all that common--

135
00:05:03,440 --> 00:05:04,500
and 4,000 symptoms.

136
00:05:04,500 --> 00:05:06,020
So it's a big model.

137
00:05:06,020 --> 00:05:07,670
And in some sense,
you can think of it

138
00:05:07,670 --> 00:05:09,080
as a big probability model.

139
00:05:09,080 --> 00:05:15,200
It's a way of specifying a
joint distribution on this

140
00:05:15,200 --> 00:05:18,080
4,600-dimensional space.

141
00:05:18,080 --> 00:05:20,530
But it's a very particular one
that's causally structured.

142
00:05:20,530 --> 00:05:24,680
It represents only the minimal
causal dependencies and really

143
00:05:24,680 --> 00:05:27,610
only the minimal
probabilistic dependencies.

144
00:05:27,610 --> 00:05:29,845
That sparsity is really
important for how you use it,

145
00:05:29,845 --> 00:05:31,970
whether you're talking
about inference or learning.

146
00:05:31,970 --> 00:05:35,000
So inference means observing
patterns of symptoms

147
00:05:35,000 --> 00:05:37,430
or just observing the values
of some of those variables

148
00:05:37,430 --> 00:05:38,930
and making guesses
about the others.

149
00:05:38,930 --> 00:05:40,555
Like observing some
symptoms and making

150
00:05:40,555 --> 00:05:42,350
guesses about the
diseases that are most

151
00:05:42,350 --> 00:05:43,970
likely to have explained those.

152
00:05:43,970 --> 00:05:46,130
Or you might make a prediction
about other symptoms

153
00:05:46,130 --> 00:05:46,770
you could observe.

154
00:05:46,770 --> 00:05:48,410
So you could go up
and then back down.

155
00:05:48,410 --> 00:05:50,150
You could say, well,
from these symptoms,

156
00:05:50,150 --> 00:05:52,940
I think the patient might have
one of these two rare diseases.

157
00:05:52,940 --> 00:05:55,130
I don't know which one.

158
00:05:55,130 --> 00:05:56,660
But if it was this
disease, then it

159
00:05:56,660 --> 00:05:59,505
would predict that symptom
or that test maybe.

160
00:05:59,505 --> 00:06:01,130
But this disease
wouldn't, so then that

161
00:06:01,130 --> 00:06:03,140
suggests a way to plan
an action you could

162
00:06:03,140 --> 00:06:04,852
take to figure things out.

163
00:06:04,852 --> 00:06:06,560
So then I could go
test for that symptom,

164
00:06:06,560 --> 00:06:09,184
and that would tell me which of
these diseases the patient has.

165
00:06:09,184 --> 00:06:11,960
They're also useful in planning
other kinds of treatments,

166
00:06:11,960 --> 00:06:12,650
interventions.

167
00:06:12,650 --> 00:06:14,180
Like if you want
to cure someone--

168
00:06:14,180 --> 00:06:16,280
again, we all know
this intuitively--

169
00:06:16,280 --> 00:06:18,530
you should try to cure the
disease, not the symptom.

170
00:06:18,530 --> 00:06:20,429
If you have some
way to act to change

171
00:06:20,429 --> 00:06:22,220
the state of one of
those disease variables

172
00:06:22,220 --> 00:06:25,460
to kind of turn it
off, reasonably that

173
00:06:25,460 --> 00:06:26,627
should relieve the symptoms.

174
00:06:26,627 --> 00:06:29,293
If that disease gets turned off,
these symptoms should turn off.

175
00:06:29,293 --> 00:06:30,770
Whereas just
treating the symptom

176
00:06:30,770 --> 00:06:32,922
like taking Advil for
a headache is fine

177
00:06:32,922 --> 00:06:34,130
if that's all the problem is.

178
00:06:34,130 --> 00:06:36,620
But if it's being caused
by something, you know,

179
00:06:36,620 --> 00:06:39,220
god forbid, like a brain
tumor, it's not going to help.

180
00:06:39,220 --> 00:06:41,760
It's not going to cure the
problem in the long term.

181
00:06:41,760 --> 00:06:44,720
OK, so all those patterns of
causal inference, reasoning,

182
00:06:44,720 --> 00:06:46,466
prediction, action
planning, exploration

183
00:06:46,466 --> 00:06:48,590
is a beautiful language
for capturing all of those.

184
00:06:48,590 --> 00:06:51,260
You can automate all
those inferences.

185
00:06:51,260 --> 00:06:56,030
Why isn't it enough, then, for
capturing commonsense reasoning

186
00:06:56,030 --> 00:06:57,800
or this approach to cognition?

187
00:06:57,800 --> 00:07:00,380
Which I'm calling the kind
of model-building explaining

188
00:07:00,380 --> 00:07:02,440
part, as opposed to the
pattern recognition part.

189
00:07:02,440 --> 00:07:04,190
I mean, again, I don't
want to get too far

190
00:07:04,190 --> 00:07:05,398
behind in talking about this.

191
00:07:05,398 --> 00:07:06,890
But that example is so rich.

192
00:07:06,890 --> 00:07:09,170
Like if you can build
a neural network,

193
00:07:09,170 --> 00:07:10,640
you can just turn
the arrows around

194
00:07:10,640 --> 00:07:12,514
to learn a mapping from
symptoms to diseases,

195
00:07:12,514 --> 00:07:14,850
and that would be a
pattern classifier.

196
00:07:14,850 --> 00:07:18,359
So these two different
paradigms for intelligence--

197
00:07:18,359 --> 00:07:20,900
as some of the questions we're
getting at, and as I will show

198
00:07:20,900 --> 00:07:23,066
versions of that with some
more interesting examples

199
00:07:23,066 --> 00:07:24,470
in a little bit--

200
00:07:24,470 --> 00:07:27,320
often it's very subtle, and
the relations between them

201
00:07:27,320 --> 00:07:29,630
are quite valuable.

202
00:07:29,630 --> 00:07:33,190
So one way to work with
such a model, for example,

203
00:07:33,190 --> 00:07:34,057
or one nice way--

204
00:07:34,057 --> 00:07:36,140
I mean, I mentioned a lot
of people want to know--

205
00:07:36,140 --> 00:07:40,760
and I'll keep talking about
this for the rest of the hour--

206
00:07:40,760 --> 00:07:43,370
productive ways to combine
these powerful generative

207
00:07:43,370 --> 00:07:48,050
models with more pattern
recognition approaches.

208
00:07:48,050 --> 00:07:49,475
For some classes of this model--

209
00:07:52,020 --> 00:07:53,990
there are always general
purpose algorithms

210
00:07:53,990 --> 00:07:55,730
that can support
these inferences, that

211
00:07:55,730 --> 00:07:57,980
can tell you what diseases
you're likely to have given

212
00:07:57,980 --> 00:07:58,760
what symptoms.

213
00:07:58,760 --> 00:08:01,160
But in some cases, they
could be very fast.

214
00:08:01,160 --> 00:08:03,320
In other cases, they
could be very slow.

215
00:08:03,320 --> 00:08:05,480
Whereas if you
could imagine trying

216
00:08:05,480 --> 00:08:08,527
to learn a neural network
that looks just like that,

217
00:08:08,527 --> 00:08:10,610
only the arrows go up, so
they implement a mapping

218
00:08:10,610 --> 00:08:12,980
from data to
diseases, that could

219
00:08:12,980 --> 00:08:15,440
help to do much faster
inference in the cases

220
00:08:15,440 --> 00:08:16,740
where that's possible.

221
00:08:16,740 --> 00:08:19,250
So that's just one
example of where

222
00:08:19,250 --> 00:08:20,900
a model, which might
be not a crazy way

223
00:08:20,900 --> 00:08:22,691
to think about, for
example, more generally

224
00:08:22,691 --> 00:08:25,517
the way top-down and bottom-up
connections work in the brain.

225
00:08:25,517 --> 00:08:28,100
I'll take that a little bit more
literally in a vision example

226
00:08:28,100 --> 00:08:30,620
in a second.

227
00:08:30,620 --> 00:08:33,530
So there's a lot you
can get from studying

228
00:08:33,530 --> 00:08:37,220
these causal graphical models,
including some version of what

229
00:08:37,220 --> 00:08:38,900
it is for the mind
to explain the world

230
00:08:38,900 --> 00:08:41,150
and how that explanation
and pattern recognition

231
00:08:41,150 --> 00:08:42,919
approach can work together.

232
00:08:42,919 --> 00:08:44,960
But it's not enough
to really get

233
00:08:44,960 --> 00:08:49,040
at the heart of common sense
The mental generative models

234
00:08:49,040 --> 00:08:50,690
we build are more
richly structured.

235
00:08:50,690 --> 00:08:53,042
They're more like programs.

236
00:08:53,042 --> 00:08:54,000
What do I mean by that?

237
00:08:54,000 --> 00:08:58,160
Well, here I'm giving a bunch of
examples of scientific theories

238
00:08:58,160 --> 00:08:59,540
or models.

239
00:08:59,540 --> 00:09:02,630
Not commonsense ones, but I
think the same idea applies.

240
00:09:02,630 --> 00:09:04,670
Ways of, again,
explaining the world,

241
00:09:04,670 --> 00:09:06,270
not just describing the pattern.

242
00:09:06,270 --> 00:09:08,840
So we went at the beginning
through Newton's laws

243
00:09:08,840 --> 00:09:10,151
versus Kepler's laws.

244
00:09:10,151 --> 00:09:11,150
That's just one example.

245
00:09:11,150 --> 00:09:13,710
And you might not have thought
of those laws as a program,

246
00:09:13,710 --> 00:09:15,480
but they're certainly
not a graph.

247
00:09:15,480 --> 00:09:18,724
On the first slide when
I showed Newton's laws,

248
00:09:18,724 --> 00:09:20,390
there was a bunch of
symbols, statements

249
00:09:20,390 --> 00:09:22,580
in English, some math.

250
00:09:22,580 --> 00:09:25,340
But what it comes
down to is basically

251
00:09:25,340 --> 00:09:27,890
a set of pieces of
code that you could

252
00:09:27,890 --> 00:09:30,980
run to generate the orbits.

253
00:09:30,980 --> 00:09:35,630
It doesn't describe the
sheep or the velocities,

254
00:09:35,630 --> 00:09:37,700
but it's a machine that
you plug in some things.

255
00:09:37,700 --> 00:09:39,560
You plug in some
masses, some objects,

256
00:09:39,560 --> 00:09:41,570
some initial conditions.

257
00:09:41,570 --> 00:09:44,911
And you press run, and it
generates the orbits, just

258
00:09:44,911 --> 00:09:46,160
like what you're seeing there.

259
00:09:46,160 --> 00:09:48,110
Although those probably
weren't generated.

260
00:09:48,110 --> 00:09:49,480
That's a GIF.

261
00:09:49,480 --> 00:09:50,600
OK.

262
00:09:50,600 --> 00:09:55,000
That's more like
Kepler or Ptolemy.

263
00:09:55,000 --> 00:09:58,089
But anyway, it's a
powerful machine.

264
00:09:58,089 --> 00:09:59,630
It's a machine,
which if you put down

265
00:09:59,630 --> 00:10:01,340
the right masses in
the right position,

266
00:10:01,340 --> 00:10:03,090
they don't just all
go around in ellipses.

267
00:10:03,090 --> 00:10:04,320
Some of them are
like moons, and they

268
00:10:04,320 --> 00:10:06,930
will go around the things that
will go around the others.

269
00:10:06,930 --> 00:10:09,090
And some of them will be
like apples on the Earth,

270
00:10:09,090 --> 00:10:10,506
and they won't go
around anything.

271
00:10:10,506 --> 00:10:11,610
They'll just fall down.

272
00:10:11,610 --> 00:10:14,360
So that's the powerful machine.

273
00:10:14,360 --> 00:10:17,040
And in the really simplest
cases, that machine--

274
00:10:17,040 --> 00:10:19,370
those equations can be
solved analytically.

275
00:10:19,370 --> 00:10:21,414
You can use calculus or
other methods of analysis

276
00:10:21,414 --> 00:10:22,080
like Newton did.

277
00:10:22,080 --> 00:10:23,280
He didn't have a computer.

278
00:10:23,280 --> 00:10:27,670
And you can show that for a
two-body system, one planet

279
00:10:27,670 --> 00:10:29,997
and one sun, you can solve
those equations to show

280
00:10:29,997 --> 00:10:31,080
that you get Kepler's law.

281
00:10:31,080 --> 00:10:32,460
Amazing.

282
00:10:32,460 --> 00:10:38,165
And under the approximation
that only the sun is--

283
00:10:38,165 --> 00:10:39,540
for every other
planet, it's only

284
00:10:39,540 --> 00:10:41,730
the sun that's exerting
a significant influence,

285
00:10:41,730 --> 00:10:44,110
you can describe all of
Kepler's laws this way.

286
00:10:44,110 --> 00:10:46,320
But once you have more
than two bodies interacting

287
00:10:46,320 --> 00:10:49,350
in some complex way,
like three masses similar

288
00:10:49,350 --> 00:10:53,280
in size near each
other, you can't

289
00:10:53,280 --> 00:10:55,140
solve the equations
analytically anymore.

290
00:10:55,140 --> 00:10:57,960
You basically just have
to run a simulation.

291
00:10:57,960 --> 00:11:00,300
For the most part, the
world is complicated,

292
00:11:00,300 --> 00:11:03,200
and our models have to be run.

293
00:11:03,200 --> 00:11:05,040
Here's a model of a
riverbed formation.

294
00:11:05,040 --> 00:11:08,000
Or these are snapshots of a
model of galaxy collision,

295
00:11:08,000 --> 00:11:11,070
you know, and climate
modeling or aerodynamics.

296
00:11:11,070 --> 00:11:13,530
So basically what
most modern science

297
00:11:13,530 --> 00:11:15,369
is is you write
down descriptions

298
00:11:15,369 --> 00:11:17,160
of the causal processes,
something going on

299
00:11:17,160 --> 00:11:19,920
in the world, and you study
that through some combination

300
00:11:19,920 --> 00:11:22,740
of analysis and simulation
to see what would happen.

301
00:11:22,740 --> 00:11:25,260
If you want to
estimate parameters,

302
00:11:25,260 --> 00:11:27,300
you try out some guesses
of the parameters.

303
00:11:27,300 --> 00:11:29,674
And you run this thing, and
you see if its behavior looks

304
00:11:29,674 --> 00:11:31,410
like the data you observe.

305
00:11:31,410 --> 00:11:34,584
If you are trying to decide
between two different models,

306
00:11:34,584 --> 00:11:36,000
you simulate each
of them, and you

307
00:11:36,000 --> 00:11:38,250
see which one looks more
like the data you observe.

308
00:11:38,250 --> 00:11:39,915
If you think there's something
wrong with your model--

309
00:11:39,915 --> 00:11:42,016
it doesn't quite look
like the data you observe.

310
00:11:42,016 --> 00:11:43,890
You think, how could I
change my model, which

311
00:11:43,890 --> 00:11:46,140
basically if I run it, it'll
look more like the data I

312
00:11:46,140 --> 00:11:48,540
observe in some important way?

313
00:11:48,540 --> 00:11:51,600
Those activities of science--

314
00:11:51,600 --> 00:11:54,030
those are, in some
form I'm arguing,

315
00:11:54,030 --> 00:11:56,520
the activities of common
sense explanation.

316
00:11:56,520 --> 00:12:01,200
So when I'm talking about
the child as scientist,

317
00:12:01,200 --> 00:12:03,260
that's what I'm
basically talking about.

318
00:12:03,260 --> 00:12:04,940
It's some version of that.

319
00:12:04,940 --> 00:12:07,150
And so that includes
both using--

320
00:12:07,150 --> 00:12:10,530
describing the causal processes
with a program that you run.

321
00:12:10,530 --> 00:12:13,950
Or if you want to
talk about learning,

322
00:12:13,950 --> 00:12:17,730
the scientific analog is
building one of these theories.

323
00:12:17,730 --> 00:12:20,520
You don't build a
theory, whether it's

324
00:12:20,520 --> 00:12:24,720
Newton's laws or Mendel's
laws or any of these things,

325
00:12:24,720 --> 00:12:27,210
by just finding
patterns and data.

326
00:12:27,210 --> 00:12:32,190
You do something like
this program thing,

327
00:12:32,190 --> 00:12:34,230
but kind of recursively.

328
00:12:34,230 --> 00:12:37,800
Think of you having some kind
of paradigm, some program that

329
00:12:37,800 --> 00:12:39,660
generates programs,
and you use it

330
00:12:39,660 --> 00:12:41,970
to try to somehow search
the space of programs

331
00:12:41,970 --> 00:12:44,500
to come up with a program
that fits your data well.

332
00:12:44,500 --> 00:12:46,470
OK, so that's, again,
kind of the big picture.

333
00:12:46,470 --> 00:12:48,780
And now, let's talk
about how we can actually

334
00:12:48,780 --> 00:12:52,230
do something with this
idea-- use these programs.

335
00:12:52,230 --> 00:12:56,862
And you might be wondering,
OK, maybe I understand--

336
00:12:59,709 --> 00:13:01,500
I'm realizing I didn't
say the main thing I

337
00:13:01,500 --> 00:13:02,190
want you to understand.

338
00:13:02,190 --> 00:13:03,940
The main thing I want
you to get from this

339
00:13:03,940 --> 00:13:06,330
is how programs
go beyond graphs.

340
00:13:06,330 --> 00:13:09,660
So none of these
processes here can nicely

341
00:13:09,660 --> 00:13:12,300
describe with a
graph the way we have

342
00:13:12,300 --> 00:13:14,060
in the language of
graphical models.

343
00:13:14,060 --> 00:13:16,732
So the interesting causality--

344
00:13:16,732 --> 00:13:18,690
I mean, in some sense,
there's kind of a graph.

345
00:13:18,690 --> 00:13:20,610
You can talk about the state
of the world at time T,

346
00:13:20,610 --> 00:13:22,000
and I'll show you graphs
like this in a second.

347
00:13:22,000 --> 00:13:24,166
The state of the world at
time T plus 1 and an arrow

348
00:13:24,166 --> 00:13:25,290
forward in time.

349
00:13:25,290 --> 00:13:27,960
But all the interesting stuff
that science really gains power

350
00:13:27,960 --> 00:13:30,870
from are the much more
fine-grained structure

351
00:13:30,870 --> 00:13:32,520
captured in equations
or functions

352
00:13:32,520 --> 00:13:35,410
that describe exactly
how all this stuff works.

353
00:13:35,410 --> 00:13:39,450
And it needs languages
like math or C++ or LISP.

354
00:13:39,450 --> 00:13:42,810
It needs a symbolic
language of processes

355
00:13:42,810 --> 00:13:44,509
to really do justice to.

356
00:13:44,509 --> 00:13:46,050
The second thing I
want to get, which

357
00:13:46,050 --> 00:13:48,770
will take a minute to get,
but let's put it out there.

358
00:13:48,770 --> 00:13:51,000
As yes, OK, maybe
you get the idea

359
00:13:51,000 --> 00:13:53,340
that programs can be used
to describe causal processes

360
00:13:53,340 --> 00:13:54,430
in interesting ways.

361
00:13:54,430 --> 00:13:56,910
But where is the
probability part come in?

362
00:13:56,910 --> 00:14:01,980
So the same thing is actually
true in graphical models.

363
00:14:01,980 --> 00:14:04,470
How many people have
read Judea Pearl's 2000

364
00:14:04,470 --> 00:14:06,450
book called Causality?

365
00:14:06,450 --> 00:14:08,490
How many people have
read his '88 book?

366
00:14:08,490 --> 00:14:09,630
Or nobody's read anything.

367
00:14:09,630 --> 00:14:13,450
But, OK, so what Pearl
is most famous for--

368
00:14:13,450 --> 00:14:15,450
I mean, when we say Pearl's
famous for inventing

369
00:14:15,450 --> 00:14:17,310
Bayesian networks,
that's based on work

370
00:14:17,310 --> 00:14:21,150
he did in the '80s,
in which, yes,

371
00:14:21,150 --> 00:14:23,320
they were all
probability models.

372
00:14:23,320 --> 00:14:24,737
But then he came
to what he calls,

373
00:14:24,737 --> 00:14:27,195
and I would call, too, a deeper
view in which it was really

374
00:14:27,195 --> 00:14:29,370
about basically deterministic
causal relations.

375
00:14:29,370 --> 00:14:32,730
Basically it was a graphical
language for equations--

376
00:14:32,730 --> 00:14:35,050
certain classes of equations
like structural equations.

377
00:14:35,050 --> 00:14:36,420
If you know about linear
structural equations,

378
00:14:36,420 --> 00:14:38,610
it was sort of like nonlinear
structural equations.

379
00:14:38,610 --> 00:14:40,290
And then probabilities
are these things

380
00:14:40,290 --> 00:14:42,210
you put on just on
top of it to capture

381
00:14:42,210 --> 00:14:44,670
the things you don't know
that you're uncertain about.

382
00:14:44,670 --> 00:14:46,680
And I think he was he
was getting at the fact

383
00:14:46,680 --> 00:14:49,687
that to scientists,
and also to people--

384
00:14:49,687 --> 00:14:52,020
there's some very nice work
by Laura Schultz and Jessica

385
00:14:52,020 --> 00:14:54,660
Sommerville, both of whom will
be here next week actually,

386
00:14:54,660 --> 00:14:56,940
on how children's
concepts of causality

387
00:14:56,940 --> 00:14:59,032
are basically
deterministic at the core.

388
00:14:59,032 --> 00:15:00,490
And where the
probabilities come in

389
00:15:00,490 --> 00:15:02,730
is on the things
that we don't observe

390
00:15:02,730 --> 00:15:05,300
or the things we don't
know, the uncertainty.

391
00:15:05,300 --> 00:15:06,710
It's not that the
world is noisy.

392
00:15:06,710 --> 00:15:08,270
It's that we believe, at least--

393
00:15:08,270 --> 00:15:10,674
except for quantum mechanics--
but our intuitive notions

394
00:15:10,674 --> 00:15:12,590
are that the world is
basically deterministic,

395
00:15:12,590 --> 00:15:13,910
but with a lot of
stuff we don't know.

396
00:15:13,910 --> 00:15:16,400
This was, for example, Laplace's
view in philosophy of science.

397
00:15:16,400 --> 00:15:17,858
And really until
quantum mechanics,

398
00:15:17,858 --> 00:15:21,089
it was broadly the
Enlightenment science view

399
00:15:21,089 --> 00:15:23,630
that the world is full of all
these complicated deterministic

400
00:15:23,630 --> 00:15:26,690
machines, and where uncertainty
comes from the things

401
00:15:26,690 --> 00:15:29,400
that we can't observe or that
we can't measure finely enough,

402
00:15:29,400 --> 00:15:34,100
or they're just in some form
unknown or unknowable to us.

403
00:15:34,100 --> 00:15:35,860
Does that make sense?

404
00:15:35,860 --> 00:15:38,019
So you'll see more
of this in a second.

405
00:15:38,019 --> 00:15:39,560
But where the
probabilities are going

406
00:15:39,560 --> 00:15:42,440
to come from is basically if
there are inputs to the program

407
00:15:42,440 --> 00:15:43,910
that we don't know
or parameters we

408
00:15:43,910 --> 00:15:45,620
don't know that in
order to simulate them

409
00:15:45,620 --> 00:15:47,300
we're going to have to
put distributions on those

410
00:15:47,300 --> 00:15:48,924
and make some guesses
and then see what

411
00:15:48,924 --> 00:15:50,180
happens for different guesses.

412
00:15:50,180 --> 00:15:51,260
Does that make sense?

413
00:15:51,260 --> 00:15:52,950
OK.

414
00:15:52,950 --> 00:15:53,450
Good.

415
00:15:53,450 --> 00:15:55,908
So again, that's most of the
technical stuff I need to say.

416
00:15:55,908 --> 00:15:58,130
And you'll learn about how
this works in much more

417
00:15:58,130 --> 00:16:00,050
concrete details if
you go to the tutorial

418
00:16:00,050 --> 00:16:02,101
afterwards that Tomer
is going to run.

419
00:16:02,101 --> 00:16:03,350
What you'll see there is this.

420
00:16:03,350 --> 00:16:04,683
So here are just a few examples.

421
00:16:04,683 --> 00:16:06,920
Many of you hopefully already
looked at the web pages

422
00:16:06,920 --> 00:16:08,930
from this probmods.org thing.

423
00:16:08,930 --> 00:16:12,500
And what you see here is
basically each of these boxes

424
00:16:12,500 --> 00:16:14,030
is a probabilistic
program model.

425
00:16:14,030 --> 00:16:16,620
Most of it is a bunch
of defined statements.

426
00:16:16,620 --> 00:16:19,500
So if you look here, you'll
see these defined statements.

427
00:16:19,500 --> 00:16:21,170
Those are just
defining functions.

428
00:16:21,170 --> 00:16:22,370
They name the function.

429
00:16:22,370 --> 00:16:24,650
They take some inputs,
which call other functions,

430
00:16:24,650 --> 00:16:26,120
and then they maybe
do something--

431
00:16:26,120 --> 00:16:28,650
they have some output
that might be an object.

432
00:16:28,650 --> 00:16:30,470
It might itself be a function.

433
00:16:30,470 --> 00:16:34,760
These can be functions that
generate other functions.

434
00:16:34,760 --> 00:16:36,590
And where the
probabilities come in

435
00:16:36,590 --> 00:16:39,470
is that sometimes these
functions call random number

436
00:16:39,470 --> 00:16:40,520
of generators basically.

437
00:16:40,520 --> 00:16:42,186
If you look carefully,
you'll see things

438
00:16:42,186 --> 00:16:48,670
like Dirichlet, or uniform
draw, or Gaussian, or flip.

439
00:16:48,670 --> 00:16:51,740
Right those are primitive random
functions that flip a coin,

440
00:16:51,740 --> 00:16:54,110
or roll a die, or
draw from a Gaussian.

441
00:16:54,110 --> 00:16:59,040
And those captured things
that are currently unknown.

442
00:16:59,040 --> 00:17:03,470
In a very important sense, the
particular language, Church,

443
00:17:03,470 --> 00:17:06,800
that you're going to learn here
with its sort of stochastic

444
00:17:06,800 --> 00:17:07,970
LISP--

445
00:17:07,970 --> 00:17:10,520
basically just functions
that call other functions

446
00:17:10,520 --> 00:17:12,650
and maybe add in some
randomness to that--

447
00:17:12,650 --> 00:17:15,931
is very much analogous to the
directed graph of a Bayesian

448
00:17:15,931 --> 00:17:16,430
network.

449
00:17:16,430 --> 00:17:19,550
In a Bayesian network,
you have nodes and arrows.

450
00:17:19,550 --> 00:17:22,490
And the parents of a node, the
ones that send arrows to it,

451
00:17:22,490 --> 00:17:24,530
are basically the
minimal set of variables

452
00:17:24,530 --> 00:17:26,530
that if you were going
to sample from this model

453
00:17:26,530 --> 00:17:29,521
you'd have to sample first in
order to then sample the child

454
00:17:29,521 --> 00:17:30,020
variable.

455
00:17:30,020 --> 00:17:32,270
Because those are the
key things it depends on.

456
00:17:32,270 --> 00:17:34,640
And you can have a
multi-layered Bayesian network

457
00:17:34,640 --> 00:17:36,410
that, if you are going
to sample from it,

458
00:17:36,410 --> 00:17:38,755
it's just you start at the
top and you sort of go down.

459
00:17:38,755 --> 00:17:40,130
That's exactly
the same thing you

460
00:17:40,130 --> 00:17:41,879
have in these probabilistic
programs where

461
00:17:41,879 --> 00:17:44,570
the defined statements are
basically defining a function.

462
00:17:44,570 --> 00:17:48,436
And the functions are the
nodes, and the other functions

463
00:17:48,436 --> 00:17:50,060
that they call as
part of the statement

464
00:17:50,060 --> 00:17:52,970
are the things that are the
nodes that send arrows there.

465
00:17:52,970 --> 00:17:55,520
But the key is, as you can
imagine if you've ever--

466
00:17:55,520 --> 00:17:57,950
I mean, all of you have
written computer programs--

467
00:17:57,950 --> 00:18:00,650
is that only very
simple programs look

468
00:18:00,650 --> 00:18:02,240
like directed acyclic graphs.

469
00:18:02,240 --> 00:18:04,190
And that's what a
Bayesian network is.

470
00:18:04,190 --> 00:18:06,129
It's very easy and
often necessary

471
00:18:06,129 --> 00:18:08,420
to write a program to really
capture something causally

472
00:18:08,420 --> 00:18:10,050
interesting in the
world where it's not

473
00:18:10,050 --> 00:18:11,634
a directed acyclic route.

474
00:18:11,634 --> 00:18:12,800
There's all sorts of cycles.

475
00:18:12,800 --> 00:18:13,850
There's recursion.

476
00:18:13,850 --> 00:18:17,570
One thing that a function can
do is make a whole other graph.

477
00:18:17,570 --> 00:18:21,290
Or often it might be
directed and acyclic,

478
00:18:21,290 --> 00:18:22,940
but all the interesting
stuff is kind

479
00:18:22,940 --> 00:18:26,169
of going on inside what happens
when you evaluate one function.

480
00:18:26,169 --> 00:18:27,710
So if you were to
draw it as a graph,

481
00:18:27,710 --> 00:18:30,680
it might look like you could
draw a directed acyclic graph,

482
00:18:30,680 --> 00:18:32,096
but all the
interesting stuff will

483
00:18:32,096 --> 00:18:35,342
be going on inside
one node or one arrow.

484
00:18:35,342 --> 00:18:37,550
So let me get more specific
about the particular kind

485
00:18:37,550 --> 00:18:41,300
of programs that we're
going to be talking about.

486
00:18:41,300 --> 00:18:43,280
In a probabilistic
programming language

487
00:18:43,280 --> 00:18:46,160
like Church, or in general
in this view of the mind,

488
00:18:46,160 --> 00:18:48,140
we're interested in being
able to build really

489
00:18:48,140 --> 00:18:49,220
any kind of thing.

490
00:18:49,220 --> 00:18:51,860
Again, there's lots
of big dreams here.

491
00:18:51,860 --> 00:18:53,840
Like I was saying
before, I felt like we

492
00:18:53,840 --> 00:18:55,100
had to give up on
some dreams, but we've

493
00:18:55,100 --> 00:18:56,600
replaced it with
even grander ones,

494
00:18:56,600 --> 00:18:59,000
like probabilistic
modeling engines that

495
00:18:59,000 --> 00:19:00,990
can do any computable model.

496
00:19:00,990 --> 00:19:04,400
But in the spirit of trying to
scale up from something that we

497
00:19:04,400 --> 00:19:08,300
can get traction on, what
I've been focusing on

498
00:19:08,300 --> 00:19:10,130
in a lot of my work
recently and what we've

499
00:19:10,130 --> 00:19:11,960
been doing as part
of the center,

500
00:19:11,960 --> 00:19:14,300
are particular
probabilistic programs

501
00:19:14,300 --> 00:19:17,180
that we think can capture this
very early core of common sense

502
00:19:17,180 --> 00:19:21,140
intuitive physics and intuitive
psychology in young kids.

503
00:19:21,140 --> 00:19:23,520
It's what I called-- and I
remember I mentioned this

504
00:19:23,520 --> 00:19:25,330
in the first lecture--

505
00:19:25,330 --> 00:19:26,780
this game engine in your head.

506
00:19:26,780 --> 00:19:31,310
So it's programs for graphics
engines, physics engines,

507
00:19:31,310 --> 00:19:33,140
planning engines, the
basic kinds of things

508
00:19:33,140 --> 00:19:37,760
you might use to build one of
these immersive video games.

509
00:19:37,760 --> 00:19:40,970
And we think if you wrap
those inside this framework

510
00:19:40,970 --> 00:19:43,190
for probabilistic
inference, then

511
00:19:43,190 --> 00:19:46,610
that's a powerful way to do
the kind of common sense seen

512
00:19:46,610 --> 00:19:48,680
understanding, whether
in these adult versions

513
00:19:48,680 --> 00:19:51,230
or in the young kid versions.

514
00:19:51,230 --> 00:19:56,690
Now, to specify this
probabilistic programs

515
00:19:56,690 --> 00:19:58,670
view, just like with
Bayesian networks

516
00:19:58,670 --> 00:20:01,471
or these graphical models, we
wanted general purpose tools

517
00:20:01,471 --> 00:20:03,470
for representing interesting
things in the world

518
00:20:03,470 --> 00:20:06,470
and for computing the
inferences that we want.

519
00:20:06,470 --> 00:20:10,280
Again, which means basically
observing, say, just like you

520
00:20:10,280 --> 00:20:12,110
observe some of the
symptoms and you

521
00:20:12,110 --> 00:20:14,330
want to compute the
likely diseases that best

522
00:20:14,330 --> 00:20:16,130
explain the observed symptoms.

523
00:20:16,130 --> 00:20:20,360
Here we talk about observing
the outputs of some

524
00:20:20,360 --> 00:20:22,754
of these programs,
like the image

525
00:20:22,754 --> 00:20:24,420
that's the output of
a graphics program.

526
00:20:24,420 --> 00:20:27,150
And we want to work backwards
and make a guess at the world

527
00:20:27,150 --> 00:20:29,030
state, the input to
the graphics engine

528
00:20:29,030 --> 00:20:31,280
that's most likely to
have produced the image.

529
00:20:31,280 --> 00:20:34,520
That's the analog of getting
diseases from symptoms.

530
00:20:34,520 --> 00:20:38,624
Or again, that's our
explanation right there.

531
00:20:38,624 --> 00:20:41,040
And there are lots of different
algorithms for doing this.

532
00:20:41,040 --> 00:20:42,748
I'm not going to say
too much about them.

533
00:20:42,748 --> 00:20:44,942
Tomer will say a little
bit more in the afternoon.

534
00:20:44,942 --> 00:20:46,400
The main thing I
will do is, I will

535
00:20:46,400 --> 00:20:48,710
say that the main
general purpose

536
00:20:48,710 --> 00:20:51,500
algorithms for inference in
probabilistic programming

537
00:20:51,500 --> 00:20:54,590
language are in the
category of slow

538
00:20:54,590 --> 00:20:59,360
and slower and
really, really slow.

539
00:20:59,360 --> 00:21:01,850
And this is one of the
many ways in which there's

540
00:21:01,850 --> 00:21:04,360
no magic or no free lunch.

541
00:21:04,360 --> 00:21:06,350
Across all of AI and
cognitive science,

542
00:21:06,350 --> 00:21:08,777
when you build very
powerful representations,

543
00:21:08,777 --> 00:21:10,610
doing inference with
them becomes very hard.

544
00:21:10,610 --> 00:21:12,144
It's part of why
people often like

545
00:21:12,144 --> 00:21:13,310
things like neural networks.

546
00:21:13,310 --> 00:21:14,809
They're much weaker
representations,

547
00:21:14,809 --> 00:21:17,600
but inference can
be much faster.

548
00:21:17,600 --> 00:21:19,984
And at the moment, the only
totally general purpose

549
00:21:19,984 --> 00:21:22,400
algorithms for doing inference
with probabilistic programs

550
00:21:22,400 --> 00:21:23,300
are slow.

551
00:21:23,300 --> 00:21:25,580
But first of all,
they're getting faster.

552
00:21:25,580 --> 00:21:27,520
People are coming up with--

553
00:21:27,520 --> 00:21:30,780
and I can talk about this
offline where that's going--

554
00:21:30,780 --> 00:21:34,210
but also-- and this is what
I'll talk about in a sharper way

555
00:21:34,210 --> 00:21:35,450
in a second--

556
00:21:35,450 --> 00:21:37,767
there are particular classes
of probabilistic programs,

557
00:21:37,767 --> 00:21:40,100
in particular, the ones in
the game engine in your head.

558
00:21:40,100 --> 00:21:42,710
Like for vision is
inverse graphics and maybe

559
00:21:42,710 --> 00:21:46,400
some things about physics
and psychology too.

560
00:21:46,400 --> 00:21:48,987
I mean, again, I'm just thinking
of the stuff of like what's

561
00:21:48,987 --> 00:21:50,570
going on with it
when a kid is playing

562
00:21:50,570 --> 00:21:51,980
with some objects
around them and thinking

563
00:21:51,980 --> 00:21:54,271
about what other people might
think about those things.

564
00:21:54,271 --> 00:21:57,680
It's just that
setting where we think

565
00:21:57,680 --> 00:22:01,179
that you can build sort of in
some sense special purpose.

566
00:22:01,179 --> 00:22:02,720
I mean, they're
still pretty general.

567
00:22:02,720 --> 00:22:05,720
But inference algorithms
for doing inference

568
00:22:05,720 --> 00:22:07,940
in probabilistic programs,
getting the causes

569
00:22:07,940 --> 00:22:10,670
from the effects that
are much, much faster

570
00:22:10,670 --> 00:22:12,170
than things that
could work on just

571
00:22:12,170 --> 00:22:15,845
arbitrary probabilistic programs
and that actually often look

572
00:22:15,845 --> 00:22:16,970
a lot like neural networks.

573
00:22:16,970 --> 00:22:18,710
And in particular,
we can directly

574
00:22:18,710 --> 00:22:22,010
use, say for example, deep
convolutional neural networks

575
00:22:22,010 --> 00:22:23,880
to build these
recognition programs

576
00:22:23,880 --> 00:22:27,350
or basically inference
programs that

577
00:22:27,350 --> 00:22:30,020
work by pattern recognition
in, for example,

578
00:22:30,020 --> 00:22:31,790
an inverse graphics
approach to vision.

579
00:22:31,790 --> 00:22:34,550
So that's what I'll
show you basically now.

580
00:22:34,550 --> 00:22:36,350
I'm going to start
off by just working

581
00:22:36,350 --> 00:22:37,725
through a couple
of these arrows.

582
00:22:37,725 --> 00:22:41,630
I'm going to first talk about
this sort of approach we've

583
00:22:41,630 --> 00:22:44,690
done to tackle both
vision as inverse graphics

584
00:22:44,690 --> 00:22:46,539
and some intuitive
physics on the scene

585
00:22:46,539 --> 00:22:48,830
recovered and then say a
little bit about the intuitive

586
00:22:48,830 --> 00:22:51,260
psychology side.

587
00:22:51,260 --> 00:22:54,090
Here's an example of the kind of
specific domain we've studied.

588
00:22:54,090 --> 00:22:56,150
It's like our Atari setting.

589
00:22:56,150 --> 00:22:59,360
It's a kind of video game
inspired by the real game

590
00:22:59,360 --> 00:23:00,535
Jenga.

591
00:23:00,535 --> 00:23:02,660
Jenga's this cool game you
play with wooden blocks.

592
00:23:02,660 --> 00:23:06,500
You start off with a very, very,
very nicely stacked up thing

593
00:23:06,500 --> 00:23:09,380
and you take turns
removing the blocks.

594
00:23:09,380 --> 00:23:11,330
And the player who
removes the block that

595
00:23:11,330 --> 00:23:13,662
makes the whole thing fall
over is the one who loses.

596
00:23:13,662 --> 00:23:15,620
And it really exercises
this part of your brain

597
00:23:15,620 --> 00:23:18,500
that we've been studying
here, which is an ability

598
00:23:18,500 --> 00:23:22,130
to reason about stability and
support I very briefly went

599
00:23:22,130 --> 00:23:23,730
over this, but this
is something that

600
00:23:23,730 --> 00:23:26,371
is one of the classic case
studies of infant object

601
00:23:26,371 --> 00:23:26,870
knowledge.

602
00:23:26,870 --> 00:23:29,480
Looking at how basically
these concepts develop

603
00:23:29,480 --> 00:23:32,240
in some really interesting ways
over the first year of life.

604
00:23:32,240 --> 00:23:34,880
Though what we're doing here
is building models and testing

605
00:23:34,880 --> 00:23:36,110
them primarily with adults.

606
00:23:36,110 --> 00:23:38,568
It is part of what we're trying
to do in our brains, minds,

607
00:23:38,568 --> 00:23:40,220
and machines research
program here,

608
00:23:40,220 --> 00:23:42,470
collaboration with
Liz and others,

609
00:23:42,470 --> 00:23:45,007
to actually test these ideas
in experiments with infants.

610
00:23:45,007 --> 00:23:47,090
But what I'll show you is
just kind of think of it

611
00:23:47,090 --> 00:23:49,790
as like infant-inspired
adult intuitive physics

612
00:23:49,790 --> 00:23:52,490
where we build and test the
models in an easier way,

613
00:23:52,490 --> 00:23:55,082
and then we're taking it
down to kids going forward.

614
00:23:55,082 --> 00:23:56,540
So the kind of
experiment we can do

615
00:23:56,540 --> 00:23:59,990
with adults is show them
these configurations of blocks

616
00:23:59,990 --> 00:24:05,440
and say, for example,
how stable under gravity

617
00:24:05,440 --> 00:24:07,989
is one of these towers
or configurations?

618
00:24:07,989 --> 00:24:09,530
So like everything
else, you can make

619
00:24:09,530 --> 00:24:11,930
a judgment on a scale of
zero to 10 or one to seven.

620
00:24:11,930 --> 00:24:13,430
And probably most
people would agree

621
00:24:13,430 --> 00:24:17,739
that the ones in the upper left
are relatively stable, meaning

622
00:24:17,739 --> 00:24:19,280
if you just sort of
run gravity on it

623
00:24:19,280 --> 00:24:20,630
it's not going to fall over.

624
00:24:20,630 --> 00:24:22,280
Whereas the ones
in the lower right

625
00:24:22,280 --> 00:24:24,590
are much more likely
to fall under gravity.

626
00:24:24,590 --> 00:24:25,702
Fair enough?

627
00:24:25,702 --> 00:24:26,660
That's what people say.

628
00:24:26,660 --> 00:24:27,260
OK.

629
00:24:27,260 --> 00:24:29,762
So that's the kind of thing
we'd like to be able to explain

630
00:24:29,762 --> 00:24:31,220
as well as many
other judgments you

631
00:24:31,220 --> 00:24:33,680
could make about
this simple, but not

632
00:24:33,680 --> 00:24:35,057
that simple world of objects.

633
00:24:35,057 --> 00:24:36,890
And again, you can see
how in principle this

634
00:24:36,890 --> 00:24:39,473
could very nicely interface with
what Demis was talking about.

635
00:24:39,473 --> 00:24:42,350
He talked about their ambition
to do the SHRDLU task, which

636
00:24:42,350 --> 00:24:45,620
was this ability to
basically have a system that

637
00:24:45,620 --> 00:24:47,600
can take in instructions
and language

638
00:24:47,600 --> 00:24:50,094
and manipulate objects
and blocks world.

639
00:24:50,094 --> 00:24:51,260
They are very far from that.

640
00:24:51,260 --> 00:24:53,210
Everybody's really far from
having a general purpose

641
00:24:53,210 --> 00:24:55,418
system that can do that in
any way like a human does.

642
00:24:55,418 --> 00:24:58,280
But we think we're building some
of the common sense knowledge

643
00:24:58,280 --> 00:24:59,330
about the physical
world that would

644
00:24:59,330 --> 00:25:02,000
be necessary to get something
like that to work or to explain

645
00:25:02,000 --> 00:25:04,400
how kids play with blocks,
play with each other,

646
00:25:04,400 --> 00:25:07,410
talk to each other while they're
playing with blocks and so on.

647
00:25:07,410 --> 00:25:09,490
So the first step
is the vision part.

648
00:25:09,490 --> 00:25:12,770
In this picture here, it's
that blue graphics arrow.

649
00:25:12,770 --> 00:25:15,260
Here's another way into it.

650
00:25:15,260 --> 00:25:19,469
We want to be able to take a
2D image and work backwards

651
00:25:19,469 --> 00:25:21,260
to the world state,
the kind of world state

652
00:25:21,260 --> 00:25:22,850
that can support
physical reasoning.

653
00:25:22,850 --> 00:25:26,570
Again, remember
these buzzwords--

654
00:25:26,570 --> 00:25:28,820
explaining the mind
with generative models

655
00:25:28,820 --> 00:25:31,010
that are causal
and compositional.

656
00:25:31,010 --> 00:25:33,530
We want a description
of the world which

657
00:25:33,530 --> 00:25:35,390
supports causal
reasoning of the sort

658
00:25:35,390 --> 00:25:37,464
that physics is doing,
like forces interacting

659
00:25:37,464 --> 00:25:38,130
with each other.

660
00:25:38,130 --> 00:25:40,520
So it's got to have things
that can exert force

661
00:25:40,520 --> 00:25:41,630
and can suffer forces.

662
00:25:41,630 --> 00:25:43,910
It's got to have
mass in some form.

663
00:25:43,910 --> 00:25:45,710
It's got to be
compositional because you've

664
00:25:45,710 --> 00:25:47,570
got to be able to pick up
a block and take it away.

665
00:25:47,570 --> 00:25:50,230
Or if I have these blocks over
here and these blocks over here

666
00:25:50,230 --> 00:25:51,620
and I want to put these
ones on top of there,

667
00:25:51,620 --> 00:25:53,240
the world state has
to be able to support

668
00:25:53,240 --> 00:25:55,070
any number of objects
in any configuration

669
00:25:55,070 --> 00:25:57,560
and to literally
compose a representation

670
00:25:57,560 --> 00:25:59,570
of a world of objects
that are composed together

671
00:25:59,570 --> 00:26:00,824
to make bigger things.

672
00:26:00,824 --> 00:26:02,240
So really the only
way we know how

673
00:26:02,240 --> 00:26:05,180
to do that is something like
what's sometimes in engineering

674
00:26:05,180 --> 00:26:07,022
called a CAD model or
computer-aided design.

675
00:26:07,022 --> 00:26:08,480
But it's basically
a representation

676
00:26:08,480 --> 00:26:10,640
of three-dimensional
objects, often

677
00:26:10,640 --> 00:26:12,650
with something like
a mesh or a grid

678
00:26:12,650 --> 00:26:15,470
of key points with their masses
and springs for stiffness,

679
00:26:15,470 --> 00:26:16,636
something like that.

680
00:26:16,636 --> 00:26:18,260
Here my only picture
of the world state

681
00:26:18,260 --> 00:26:20,090
looks an awful lot
like the image,

682
00:26:20,090 --> 00:26:22,100
only it's in black and
white instead of color.

683
00:26:22,100 --> 00:26:24,590
But the difference is that
the thing on the bottom

684
00:26:24,590 --> 00:26:26,030
is actually an image.

685
00:26:26,030 --> 00:26:27,530
Whereas the thing
on the top is just

686
00:26:27,530 --> 00:26:29,840
a 2D projection of a 3D model.

687
00:26:29,840 --> 00:26:31,190
I'll show you that one.

688
00:26:31,190 --> 00:26:32,194
Here's a few others.

689
00:26:32,194 --> 00:26:33,860
So I'll go back and
forth between these.

690
00:26:33,860 --> 00:26:36,443
Notice how it kind of looks like
the blocks are moving around.

691
00:26:36,443 --> 00:26:38,360
So what's actually
going on is these

692
00:26:38,360 --> 00:26:40,460
are samples from the
Bayesian posterior

693
00:26:40,460 --> 00:26:42,410
in an inverse graphics system.

694
00:26:42,410 --> 00:26:44,330
We put a prior on
world states, which

695
00:26:44,330 --> 00:26:48,050
is basically a prior on what we
think the world is made out of.

696
00:26:48,050 --> 00:26:50,490
We think there's these
Jenga blocks basically.

697
00:26:50,490 --> 00:26:54,140
And then the likelihood, which
is that that forward model is

698
00:26:54,140 --> 00:26:57,170
the probability of seeing
a particular 2D image given

699
00:26:57,170 --> 00:26:58,610
a 3D configuration of blocks.

700
00:26:58,610 --> 00:27:00,110
And going back to
the thing you had,

701
00:27:00,110 --> 00:27:02,750
it's basically deterministic
with a little bit of noise.

702
00:27:02,750 --> 00:27:03,590
It's deterministic.

703
00:27:03,590 --> 00:27:06,590
It just follows the
rules of OpenGL graphics.

704
00:27:06,590 --> 00:27:08,540
Basically says
objects have surfaces.

705
00:27:08,540 --> 00:27:09,660
They're not transparent.

706
00:27:09,660 --> 00:27:10,784
You can't see through them.

707
00:27:10,784 --> 00:27:13,540
That's an extra complication
if you wanted to have that.

708
00:27:13,540 --> 00:27:15,920
And basically the
image is formed

709
00:27:15,920 --> 00:27:19,190
by taking the closest
surface of the closest object

710
00:27:19,190 --> 00:27:22,160
and bouncing a ray of light off
of it, which really just means

711
00:27:22,160 --> 00:27:24,060
taking its color and
scaling it by intensity.

712
00:27:24,060 --> 00:27:26,840
It's a very simple shadow model.

713
00:27:26,840 --> 00:27:27,980
So that's the causal model.

714
00:27:27,980 --> 00:27:30,188
And then we can add a little
bit of uncertainty like,

715
00:27:30,188 --> 00:27:31,550
for example, maybe we can't--

716
00:27:31,550 --> 00:27:34,700
there's a little bit of
noise in the sensor data.

717
00:27:34,700 --> 00:27:38,690
So you can be uncertain about
exactly the low level image

718
00:27:38,690 --> 00:27:39,255
features.

719
00:27:39,255 --> 00:27:41,630
And then when you run one of
these probabilistic programs

720
00:27:41,630 --> 00:27:44,840
in reverse to make a guess of
what configuration of blocks

721
00:27:44,840 --> 00:27:46,720
is most likely to have
produced that image,

722
00:27:46,720 --> 00:27:48,950
there is a little bit
of posterior uncertainty

723
00:27:48,950 --> 00:27:54,530
that inherits from the fact that
you can't perfectly localize

724
00:27:54,530 --> 00:27:56,220
those objects in the world.

725
00:27:56,220 --> 00:27:59,930
So again, what you see here
are three or four samples

726
00:27:59,930 --> 00:28:01,910
from the posterior--
the distribution

727
00:28:01,910 --> 00:28:04,220
over best guesses
of the world state

728
00:28:04,220 --> 00:28:07,160
of 3D objects that were
most likely to have rendered

729
00:28:07,160 --> 00:28:08,780
into that 2D image.

730
00:28:08,780 --> 00:28:11,270
And any one of those is now
an actionable representation

731
00:28:11,270 --> 00:28:14,220
for physical manipulation
or reasoning.

732
00:28:14,220 --> 00:28:16,285
OK?

733
00:28:16,285 --> 00:28:17,660
And how we actually
compute that,

734
00:28:17,660 --> 00:28:20,540
again, I'm not going
to go into right now.

735
00:28:20,540 --> 00:28:23,250
I'll go into something
like it in a minute.

736
00:28:23,250 --> 00:28:24,830
But at least in its
most basic form,

737
00:28:24,830 --> 00:28:27,230
it involves some
rather unfortunately

738
00:28:27,230 --> 00:28:29,720
slow random search
process through the space

739
00:28:29,720 --> 00:28:32,870
of blocks models.

740
00:28:32,870 --> 00:28:33,900
Here's another example.

741
00:28:33,900 --> 00:28:36,050
This is another
configuration there--

742
00:28:36,050 --> 00:28:36,830
another image.

743
00:28:36,830 --> 00:28:39,950
And here is a few samples
again from the posterior.

744
00:28:39,950 --> 00:28:42,200
And hopefully when you see
these things moving around,

745
00:28:42,200 --> 00:28:43,970
whether it's this one
or the one before,

746
00:28:43,970 --> 00:28:46,550
you see them move a little
bit, but most of them

747
00:28:46,550 --> 00:28:47,780
look very similar.

748
00:28:47,780 --> 00:28:49,966
You'd be hard pressed
to tell the difference

749
00:28:49,966 --> 00:28:52,340
if you looked away for a second
between any one of those.

750
00:28:52,340 --> 00:28:53,930
Which one are you
actually seeing?

751
00:28:53,930 --> 00:28:55,820
And that's exactly the point.

752
00:28:55,820 --> 00:28:59,000
The uncertainty you see there
is meant to capture basically

753
00:28:59,000 --> 00:29:01,280
the uncertainty you
have in a single glance

754
00:29:01,280 --> 00:29:02,420
at an image like that.

755
00:29:02,420 --> 00:29:04,430
You can't perfectly tell
where the blocks are.

756
00:29:04,430 --> 00:29:08,000
So basically any one
of these configurations

757
00:29:08,000 --> 00:29:09,860
up here is about equally good.

758
00:29:09,860 --> 00:29:11,420
And we think your
intuitive physics,

759
00:29:11,420 --> 00:29:14,090
your sort of common sense
core intuitive physics

760
00:29:14,090 --> 00:29:16,610
that even babies have,
is operating over one

761
00:29:16,610 --> 00:29:19,310
or a few samples like that.

762
00:29:19,310 --> 00:29:21,840
Now in separate work
that is not really--

763
00:29:21,840 --> 00:29:23,960
I think of it as really
about common sense,

764
00:29:23,960 --> 00:29:26,293
but it's one of the things
we've been doing in our group

765
00:29:26,293 --> 00:29:28,709
and in CBMM where are these
ideas best make contact

766
00:29:28,709 --> 00:29:30,500
with the rest of what
people are doing here

767
00:29:30,500 --> 00:29:33,530
and where we can really test
interesting neural hypotheses

768
00:29:33,530 --> 00:29:35,780
potentially and
understand the interplay

769
00:29:35,780 --> 00:29:37,940
between these generative
models for explanation

770
00:29:37,940 --> 00:29:40,100
and the more sort of
neural-network-type models

771
00:29:40,100 --> 00:29:42,207
for pattern recognition.

772
00:29:42,207 --> 00:29:43,790
We've been really
pushing on this idea

773
00:29:43,790 --> 00:29:45,440
vision as inverse graphics.

774
00:29:45,440 --> 00:29:47,810
So I'll tell you a little
bit about that because it's

775
00:29:47,810 --> 00:29:49,075
quite interesting for CBMM.

776
00:29:49,075 --> 00:29:51,800
But I want to make sure to only
do this for about five minutes

777
00:29:51,800 --> 00:29:53,930
and then go back to
the how this gets

778
00:29:53,930 --> 00:29:57,500
used for more the intuitive
physics and planning stuff.

779
00:29:57,500 --> 00:30:01,950
So this is this an example from
a paper by Tejas Kulkarni who's

780
00:30:01,950 --> 00:30:03,750
one of our grad students.

781
00:30:03,750 --> 00:30:06,540
And it's joint work with a
few other really smart people

782
00:30:06,540 --> 00:30:08,400
such as Vikash Mansinghka
who's a research

783
00:30:08,400 --> 00:30:10,500
scientist at MIT
and Pushmeet Kohli

784
00:30:10,500 --> 00:30:12,750
who's at Microsoft Research.

785
00:30:12,750 --> 00:30:15,380
And it was a computer vision
paper, pure computer vision

786
00:30:15,380 --> 00:30:20,610
paper from the summer,
where he was developing

787
00:30:20,610 --> 00:30:22,860
a specific kind of probabilistic
programming language,

788
00:30:22,860 --> 00:30:25,200
but a general one for
doing this kind of vision

789
00:30:25,200 --> 00:30:27,992
as inverse graphics
where you could give

790
00:30:27,992 --> 00:30:29,200
a number of different models.

791
00:30:29,200 --> 00:30:31,574
Here I'll show you one for
faces, another one for bodies,

792
00:30:31,574 --> 00:30:33,030
another one for generic objects.

793
00:30:33,030 --> 00:30:38,250
But basically you can pretty
easily specify a graphics model

794
00:30:38,250 --> 00:30:40,260
that when you run it in
the forward direction

795
00:30:40,260 --> 00:30:43,260
generates random images of
objects in a certain class.

796
00:30:43,260 --> 00:30:45,420
And then you can run it
in the reverse direction

797
00:30:45,420 --> 00:30:49,480
to do scene parsing
to go from the image

798
00:30:49,480 --> 00:30:50,980
to the underlying scene.

799
00:30:50,980 --> 00:30:53,070
So here's an example
of this in faces

800
00:30:53,070 --> 00:30:56,460
where the graphics model--
it's really very directly based

801
00:30:56,460 --> 00:30:59,280
on work that Thomas Vetter,
who was a former student

802
00:30:59,280 --> 00:31:00,870
or post-doc of
Tommy's actually, so

803
00:31:00,870 --> 00:31:04,500
kind of a early
ancestor of CBMM, built.

804
00:31:04,500 --> 00:31:07,000
And his group in Basel,
Switzerland, where

805
00:31:07,000 --> 00:31:09,570
it's a simple but
still pretty nice

806
00:31:09,570 --> 00:31:12,090
graphics model for
making face images.

807
00:31:12,090 --> 00:31:14,550
There's a model of the shape
of the face, which again, it's

808
00:31:14,550 --> 00:31:15,690
like a CAD model.

809
00:31:15,690 --> 00:31:17,520
It's a mesh surface description.

810
00:31:17,520 --> 00:31:21,390
Pretty fine-grained
structure of the 2D surface

811
00:31:21,390 --> 00:31:23,430
of the face in 3D.

812
00:31:23,430 --> 00:31:25,710
And there is about
400 dimensions

813
00:31:25,710 --> 00:31:28,374
to characterize the
possible shapes of faces.

814
00:31:28,374 --> 00:31:29,790
And there's another
400 dimensions

815
00:31:29,790 --> 00:31:31,206
to characterize
the texture, which

816
00:31:31,206 --> 00:31:34,560
is like the skin, the beard,
the eyes, the color, and surface

817
00:31:34,560 --> 00:31:36,779
properties that get
mapped on top of the mesh.

818
00:31:36,779 --> 00:31:38,820
And then there's a little
bit more graphic stuff,

819
00:31:38,820 --> 00:31:40,740
which is generic, not
specific to faces.

820
00:31:40,740 --> 00:31:42,530
That stuff is all
specific to faces.

821
00:31:42,530 --> 00:31:43,890
But then there is a
simple lighting model.

822
00:31:43,890 --> 00:31:45,639
So you basically have
a point light source

823
00:31:45,639 --> 00:31:48,474
somewhere out there and you
shine the light on the face.

824
00:31:48,474 --> 00:31:49,890
It can produce
shadows, of course,

825
00:31:49,890 --> 00:31:51,990
but not very complicated ones.

826
00:31:51,990 --> 00:31:54,284
And then there's a
viewpoint camera thing.

827
00:31:54,284 --> 00:31:56,700
So you put the light source
somewhere and you put a camera

828
00:31:56,700 --> 00:31:58,530
somewhere specifying
the viewpoint.

829
00:31:58,530 --> 00:32:01,050
And the combination of these,
shape, texture, lighting,

830
00:32:01,050 --> 00:32:03,680
and camera, give you a
complete graphic specification.

831
00:32:03,680 --> 00:32:06,000
It produces an image
of a particular face

832
00:32:06,000 --> 00:32:07,920
lit from a particular
direction and viewed

833
00:32:07,920 --> 00:32:10,870
from some particular
viewpoint and distance.

834
00:32:10,870 --> 00:32:12,870
And what you see on the
right are random samples

835
00:32:12,870 --> 00:32:15,414
from this probabilistic
program, this generative model.

836
00:32:15,414 --> 00:32:16,830
So you can just
write this program

837
00:32:16,830 --> 00:32:19,030
and press Go, Go, Go, Go, Go,
and every time you run it,

838
00:32:19,030 --> 00:32:21,488
you get a new face viewed from
a new direction and lighting

839
00:32:21,488 --> 00:32:22,417
condition.

840
00:32:22,417 --> 00:32:23,250
So that's the prior.

841
00:32:26,400 --> 00:32:28,347
Now, what about inference?

842
00:32:28,347 --> 00:32:30,180
Well, the idea of vision
as inverse graphics

843
00:32:30,180 --> 00:32:33,570
is to say take a real image
of a face like that one

844
00:32:33,570 --> 00:32:36,179
and see if you can
produce from your graphics

845
00:32:36,179 --> 00:32:37,720
model something that
looks like that.

846
00:32:37,720 --> 00:32:39,690
So, for example, here
in the lower left

847
00:32:39,690 --> 00:32:42,360
is an example of a face that
was produced from the graphics

848
00:32:42,360 --> 00:32:45,300
model that hopefully most of you
agree looks kind of like that.

849
00:32:45,300 --> 00:32:47,970
Maybe not exactly the
same, but kind of enough.

850
00:32:47,970 --> 00:32:50,170
And in building this system--

851
00:32:50,170 --> 00:32:52,290
this system, by the
way, is called Picture.

852
00:32:52,290 --> 00:32:54,330
That's that first word
of the paper title,

853
00:32:54,330 --> 00:32:56,769
too, the Kulkarni, et al. paper.

854
00:32:56,769 --> 00:32:58,810
There were a few neat
things that had to be done.

855
00:32:58,810 --> 00:33:00,351
One of the things
that had to be done

856
00:33:00,351 --> 00:33:03,259
was to come up with
various ways to say

857
00:33:03,259 --> 00:33:05,550
what does it mean for the
output of the graphics engine

858
00:33:05,550 --> 00:33:07,150
to look like the image.

859
00:33:07,150 --> 00:33:09,450
In the case of faces,
actually matching up pixels

860
00:33:09,450 --> 00:33:10,750
is not completely crazy.

861
00:33:10,750 --> 00:33:13,140
But for most vision
problems, it's

862
00:33:13,140 --> 00:33:16,020
going to be unrealistic and
unnecessary to build a graphics

863
00:33:16,020 --> 00:33:19,070
engine that's
pixel-level realistic.

864
00:33:19,070 --> 00:33:20,970
And so you might,
for example, want

865
00:33:20,970 --> 00:33:25,062
to have something where the
graphics engine hypothesis is

866
00:33:25,062 --> 00:33:26,520
matched to the
image with something

867
00:33:26,520 --> 00:33:27,739
like some kind of features.

868
00:33:27,739 --> 00:33:30,030
Like it could be convolutional
neural network features.

869
00:33:30,030 --> 00:33:32,400
That's one way to use, for
example, neural networks

870
00:33:32,400 --> 00:33:34,800
to make something
like this work well.

871
00:33:34,800 --> 00:33:37,590
And Jojen just showed me a
paper by some other folks

872
00:33:37,590 --> 00:33:39,114
from Darmstadt,
which is doing what

873
00:33:39,114 --> 00:33:41,280
looks like a very interesting
similar kind of thing.

874
00:33:44,280 --> 00:33:48,330
Let me show what inference looks
like in this model and then

875
00:33:48,330 --> 00:33:50,430
say what I think is an
even more interesting way

876
00:33:50,430 --> 00:33:51,750
to use convolutional.

877
00:33:51,750 --> 00:33:54,620
And that's from another recent
paper we've been looking at.

878
00:33:54,620 --> 00:33:59,100
So here is, if you watch this,
this is one observed face.

879
00:33:59,100 --> 00:34:01,230
And what you're seeing
over here is just

880
00:34:01,230 --> 00:34:04,140
a trace of the system
kind of searching

881
00:34:04,140 --> 00:34:06,890
through the space of traces
of the graphics program.

882
00:34:06,890 --> 00:34:08,698
Basically trying
out random faces

883
00:34:08,698 --> 00:34:10,239
that might look like
that face there.

884
00:34:10,239 --> 00:34:12,106
It's using a kind
of MCMC inference.

885
00:34:12,106 --> 00:34:13,980
It's very similar to
what you're going to see

886
00:34:13,980 --> 00:34:16,739
from Tomer in the tutorial.

887
00:34:16,739 --> 00:34:20,280
It basically starts
off with a random face

888
00:34:20,280 --> 00:34:23,909
and takes a bunch of
small random steps

889
00:34:23,909 --> 00:34:27,270
that are biased towards making
the image look more and more

890
00:34:27,270 --> 00:34:28,973
like the actual observed image.

891
00:34:28,973 --> 00:34:30,389
And at the end,
you have something

892
00:34:30,389 --> 00:34:33,031
which looks almost identical
to the observed face.

893
00:34:33,031 --> 00:34:35,489
The key, right, though, is that
though the observed face is

894
00:34:35,489 --> 00:34:37,679
literally just a
2D image, the thing

895
00:34:37,679 --> 00:34:39,659
you're seeing on the
right is a projection

896
00:34:39,659 --> 00:34:41,610
of a 3D model of a face.

897
00:34:41,610 --> 00:34:45,570
And it's one that supports
a lot of causal action.

898
00:34:45,570 --> 00:34:49,020
So here just to show you
on a more interesting sort

899
00:34:49,020 --> 00:34:52,170
of high-resolution set of face
images, the ones on the left

900
00:34:52,170 --> 00:34:53,489
are observed images.

901
00:34:53,489 --> 00:34:55,916
And then we fit this model.

902
00:34:55,916 --> 00:34:58,290
And then we can rotate it
around and change the lighting.

903
00:34:58,290 --> 00:35:00,749
If we had parameters that
control the expression--

904
00:35:00,749 --> 00:35:02,790
there's no real expression
parameters here-- that

905
00:35:02,790 --> 00:35:04,710
wouldn't be too hard to put in.

906
00:35:04,710 --> 00:35:06,400
You could make us happy or sad.

907
00:35:06,400 --> 00:35:07,740
But you can see--

908
00:35:07,740 --> 00:35:10,290
hopefully what you can see
is that the recovered model

909
00:35:10,290 --> 00:35:12,840
supports fairly
reasonable generalization

910
00:35:12,840 --> 00:35:14,874
to other viewpoints and
lighting conditions.

911
00:35:14,874 --> 00:35:16,290
It's the sort of
thing that should

912
00:35:16,290 --> 00:35:18,951
make for more robust
face recognition.

913
00:35:18,951 --> 00:35:20,700
Although that's not
the main focus of what

914
00:35:20,700 --> 00:35:21,360
we're trying to use it here.

915
00:35:21,360 --> 00:35:23,250
I just want to emphasize
there's all sorts of things that

916
00:35:23,250 --> 00:35:25,791
would be useful if you had an
actual 3D model of the face you

917
00:35:25,791 --> 00:35:27,260
could get from a single image.

918
00:35:27,260 --> 00:35:31,600
Or here's the same kind of idea
now for a body pose system.

919
00:35:31,600 --> 00:35:33,585
So now, the image
we're going to assume

920
00:35:33,585 --> 00:35:35,460
has a person in it
somewhere doing something.

921
00:35:35,460 --> 00:35:37,751
Remember back to that challenge
I gave at the beginning

922
00:35:37,751 --> 00:35:41,400
about finding the bodies in a
complex scene like the airplane

923
00:35:41,400 --> 00:35:44,880
full of computer
vision researchers

924
00:35:44,880 --> 00:35:48,090
where you found the right
hand or the left toe.

925
00:35:48,090 --> 00:35:50,280
So in order to do
that, we think you

926
00:35:50,280 --> 00:35:52,950
have to have something like
an actual 3D model of a body.

927
00:35:52,950 --> 00:35:54,840
What you see on
the lower left is

928
00:35:54,840 --> 00:35:56,060
a bunch of samples from this.

929
00:35:56,060 --> 00:36:00,180
So we basically just took a kind
of interesting 3D stick figure

930
00:36:00,180 --> 00:36:03,360
skeleton model and just
put some knobs on it.

931
00:36:03,360 --> 00:36:04,540
You can tweak it around.

932
00:36:04,540 --> 00:36:06,030
You can put some simple
probability models

933
00:36:06,030 --> 00:36:06,720
to get a prior.

934
00:36:06,720 --> 00:36:08,095
And these are just
random samples

935
00:36:08,095 --> 00:36:09,502
of random body positions.

936
00:36:09,502 --> 00:36:11,460
And the idea of the system
is to kind of search

937
00:36:11,460 --> 00:36:14,250
through that space
of body positions

938
00:36:14,250 --> 00:36:16,800
until you find one, which
then when you project it

939
00:36:16,800 --> 00:36:18,960
from a certain
camera angle looks

940
00:36:18,960 --> 00:36:20,700
like the body you're seeing.

941
00:36:20,700 --> 00:36:22,510
So here is an example
of this in action.

942
00:36:22,510 --> 00:36:24,030
This is some guy--

943
00:36:24,030 --> 00:36:27,090
I guess Usain Bolt. Some kind
of interesting slightly unusual

944
00:36:27,090 --> 00:36:30,240
pose as he's about to break
the finish line maybe.

945
00:36:30,240 --> 00:36:32,280
And here is the
system in action.

946
00:36:32,280 --> 00:36:33,991
So it starts off from
a random position

947
00:36:33,991 --> 00:36:35,490
and, again, sort
of takes some takes

948
00:36:35,490 --> 00:36:38,910
a bunch of random steps
moving around in 3D space

949
00:36:38,910 --> 00:36:40,890
until it finds a
configuration, which

950
00:36:40,890 --> 00:36:42,720
when you project it
into the image looks

951
00:36:42,720 --> 00:36:44,270
like what you see there.

952
00:36:44,270 --> 00:36:48,600
Now, notice a key difference
when I say looks like--

953
00:36:48,600 --> 00:36:52,050
it doesn't look like it at the
pixel level like the face did.

954
00:36:52,050 --> 00:36:55,530
It's only matching at the level
of these basically enhanced

955
00:36:55,530 --> 00:36:57,280
edge statistics
which you see here.

956
00:36:57,280 --> 00:36:59,280
So this is an example of
building a model that's

957
00:36:59,280 --> 00:37:02,160
not a photorealistic render.

958
00:37:02,160 --> 00:37:04,380
The graphics model is not
trying to match the image.

959
00:37:04,380 --> 00:37:05,880
It's trying to match this.

960
00:37:05,880 --> 00:37:08,130
Or it could be, for example,
some intermediate level

961
00:37:08,130 --> 00:37:09,252
of convonet features.

962
00:37:09,252 --> 00:37:10,710
And we think this
is very powerful.

963
00:37:10,710 --> 00:37:12,420
Because more generally
while we might

964
00:37:12,420 --> 00:37:17,040
have a really detailed model of
facial appearance, for bodies,

965
00:37:17,040 --> 00:37:18,630
we don't have a
good clothing model.

966
00:37:18,630 --> 00:37:21,310
We're not trying
to model the skin.

967
00:37:21,310 --> 00:37:24,752
We're just trying
to model just enough

968
00:37:24,752 --> 00:37:26,460
to solve the problem
we're interested in.

969
00:37:26,460 --> 00:37:29,200
And again, this is reflective
of a much more broad theme

970
00:37:29,200 --> 00:37:32,340
in this idea of
intelligence as explanation,

971
00:37:32,340 --> 00:37:35,190
modeling the causal
structure of the world.

972
00:37:35,190 --> 00:37:37,047
We don't expect,
even in science,

973
00:37:37,047 --> 00:37:38,880
but certainly not in
our intuitive theories,

974
00:37:38,880 --> 00:37:42,820
to model the causal structure
of the world at full detail.

975
00:37:42,820 --> 00:37:46,110
And a way that either I am
always misunderstood or always

976
00:37:46,110 --> 00:37:48,582
fail to communicate--
it's my fault really--

977
00:37:48,582 --> 00:37:50,790
is I say, oh, we have these
rich models of the world.

978
00:37:50,790 --> 00:37:52,800
People often think
that means that somehow

979
00:37:52,800 --> 00:37:53,800
the complete thing.

980
00:37:53,800 --> 00:37:55,680
Like if I say we have a
physics engine in our head,

981
00:37:55,680 --> 00:37:56,720
it means we have all of physics.

982
00:37:56,720 --> 00:37:58,303
Of if I say we have
a graphics engine,

983
00:37:58,303 --> 00:38:00,330
we have all of every
possible thing.

984
00:38:00,330 --> 00:38:02,250
This isn't Pixar.

985
00:38:02,250 --> 00:38:05,280
We're not trying to
make a beautiful movie,

986
00:38:05,280 --> 00:38:07,170
except maybe for faces.

987
00:38:07,170 --> 00:38:10,980
We're just trying to capture
just the key parts, just

988
00:38:10,980 --> 00:38:13,925
the key causal parts
of the way things move

989
00:38:13,925 --> 00:38:16,050
in the world as physical
objects and the way images

990
00:38:16,050 --> 00:38:18,510
are formed that at the
right level of abstraction

991
00:38:18,510 --> 00:38:22,560
that matters for us allows
us to do what we need to do.

992
00:38:22,560 --> 00:38:27,910
This is just an
example of our system

993
00:38:27,910 --> 00:38:31,230
solving some pretty challenging
body pose recognition problems

994
00:38:31,230 --> 00:38:34,540
in 3D, cases which
are problematic

995
00:38:34,540 --> 00:38:38,070
even for the best of standard
computer vision systems.

996
00:38:38,070 --> 00:38:39,510
Either because
it's a weird pose,

997
00:38:39,510 --> 00:38:42,090
like these weird sports
figures, or because the body

998
00:38:42,090 --> 00:38:43,355
is heavily occluded.

999
00:38:43,355 --> 00:38:45,730
But I think, again, these are
problems which people solve

1000
00:38:45,730 --> 00:38:46,590
effortlessly.

1001
00:38:46,590 --> 00:38:48,840
And I think something
like this is

1002
00:38:48,840 --> 00:38:50,341
on the track of
what we want to do.

1003
00:38:50,341 --> 00:38:51,840
You can apply the
same kind of thing

1004
00:38:51,840 --> 00:38:54,820
to more generic
objects like this,

1005
00:38:54,820 --> 00:38:56,700
but I'm not going to
go into the details.

1006
00:38:56,700 --> 00:38:58,408
The last thing I want
to say about vision

1007
00:38:58,408 --> 00:39:02,490
before getting back to common
sense for a few minutes--

1008
00:39:02,490 --> 00:39:05,310
and in some sense, maybe this
is the most important slide

1009
00:39:05,310 --> 00:39:08,370
for the broader CBMM, brains,
minds, and machines thing.

1010
00:39:08,370 --> 00:39:10,410
Because this is
the clearest thing

1011
00:39:10,410 --> 00:39:12,807
I can point to to the thing
I've been saying all along

1012
00:39:12,807 --> 00:39:14,640
since the beginning of
the morning about how

1013
00:39:14,640 --> 00:39:18,090
we want to look for ways to
combine the generative model

1014
00:39:18,090 --> 00:39:20,520
view and the pattern
recognition view.

1015
00:39:20,520 --> 00:39:22,980
So the generative model is
what you see on the left here.

1016
00:39:22,980 --> 00:39:24,370
It's the arrows going down.

1017
00:39:24,370 --> 00:39:26,430
It's exactly just the
face graphics engine,

1018
00:39:26,430 --> 00:39:29,130
the same thing I showed you.

1019
00:39:29,130 --> 00:39:33,060
The thing on the right with the
arrows going up is a convonet.

1020
00:39:33,060 --> 00:39:36,440
Basically it's a
out-of-the-box, cafe-style,

1021
00:39:36,440 --> 00:39:39,180
convolutional neural net with
some fully connected layers

1022
00:39:39,180 --> 00:39:39,921
on the top.

1023
00:39:39,921 --> 00:39:41,670
And then there's a few
other dashed arrows

1024
00:39:41,670 --> 00:39:45,330
which represent linear decoders
from layers of that model

1025
00:39:45,330 --> 00:39:47,670
to other things,
which are basically

1026
00:39:47,670 --> 00:39:49,320
parts of the generative model.

1027
00:39:49,320 --> 00:39:51,835
And the idea here-- this is
work due to Ilker Yildirim, who

1028
00:39:51,835 --> 00:39:52,960
some of you might have met.

1029
00:39:52,960 --> 00:39:54,600
He was here the other day.

1030
00:39:54,600 --> 00:39:57,840
He's one of our CBMM
postdocs, but also

1031
00:39:57,840 --> 00:40:02,210
joint with Tejas and with
Winrich who you saw before.

1032
00:40:02,210 --> 00:40:05,540
It's to try to in
several senses combine

1033
00:40:05,540 --> 00:40:07,790
the best of these
perspectives, to say, look,

1034
00:40:07,790 --> 00:40:10,244
if we want to recognize
anything or perceive

1035
00:40:10,244 --> 00:40:11,660
the structure of
the world richly,

1036
00:40:11,660 --> 00:40:14,450
I think it needs to be something
like this inverse graphics

1037
00:40:14,450 --> 00:40:16,074
or inverting a graphics program.

1038
00:40:16,074 --> 00:40:17,240
But you saw how slow it was.

1039
00:40:17,240 --> 00:40:19,239
You saw how it took a
couple of seconds at least

1040
00:40:19,239 --> 00:40:21,380
on our computer just
for faces to search

1041
00:40:21,380 --> 00:40:22,310
through the space
of faces to come up

1042
00:40:22,310 --> 00:40:23,518
with a convincing hypothesis.

1043
00:40:23,518 --> 00:40:24,380
That's way too slow.

1044
00:40:24,380 --> 00:40:25,970
It doesn't take that long.

1045
00:40:25,970 --> 00:40:29,300
We know a lot about exactly how
long it takes you from Winrich,

1046
00:40:29,300 --> 00:40:32,060
and Nancy's, and many
other people's work.

1047
00:40:32,060 --> 00:40:35,390
So how can vision in this case,
or really much more generally,

1048
00:40:35,390 --> 00:40:38,840
be so rich in terms of the
model it builds, yet so fast?

1049
00:40:38,840 --> 00:40:40,730
Well, here's a
proposal, which is

1050
00:40:40,730 --> 00:40:44,570
to take the things that are good
at being fast like the pattern

1051
00:40:44,570 --> 00:40:47,524
recognizers, deep
ones, and train

1052
00:40:47,524 --> 00:40:49,190
them to solve the
hard inference problem

1053
00:40:49,190 --> 00:40:51,440
or at least to do
most of the work.

1054
00:40:51,440 --> 00:40:53,540
It's an idea which is
very heavily inspired

1055
00:40:53,540 --> 00:40:55,550
by an older idea
of Geoff Hinton's

1056
00:40:55,550 --> 00:40:58,010
sometimes called the
Helmholtz machine.

1057
00:40:58,010 --> 00:41:01,250
Here the idea in
common with Hinton

1058
00:41:01,250 --> 00:41:05,300
is to have a generative
model and a recognition model

1059
00:41:05,300 --> 00:41:07,460
where the recognition
model is a neural network

1060
00:41:07,460 --> 00:41:09,740
and it's trained to invert
the generative the model.

1061
00:41:09,740 --> 00:41:15,620
Namely, it's trained to map not
from sense data to task output,

1062
00:41:15,620 --> 00:41:18,290
but from sense data to
the hidden deep causes

1063
00:41:18,290 --> 00:41:21,020
of the generative model, which
then, when you want to use this

1064
00:41:21,020 --> 00:41:26,630
to act to plan what you're going
to do, you plan on the model.

1065
00:41:26,630 --> 00:41:29,390
To make an analogy to say the
DeepMind video game player,

1066
00:41:29,390 --> 00:41:31,280
this would be like
having a system which,

1067
00:41:31,280 --> 00:41:33,380
in contrast to the
DeepQ network, which

1068
00:41:33,380 --> 00:41:36,050
mapped from pixel images
to joystick commands,

1069
00:41:36,050 --> 00:41:38,600
this would be like
learning a network that

1070
00:41:38,600 --> 00:41:40,316
maps from pixel images
to the game state,

1071
00:41:40,316 --> 00:41:42,440
to the objects, the sprites
that are moving around,

1072
00:41:42,440 --> 00:41:44,760
the score, and so on,
and then plans on that.

1073
00:41:44,760 --> 00:41:50,190
And I think that's much more
like what people do is that.

1074
00:41:50,190 --> 00:41:51,990
Here just in the
limited case of faces,

1075
00:41:51,990 --> 00:41:53,031
what are we doing, right?

1076
00:41:53,031 --> 00:41:55,860
So what we've got
here is we take

1077
00:41:55,860 --> 00:41:58,080
this convolutional
neural network.

1078
00:41:58,080 --> 00:42:00,570
We train it in ways that you
can read about in the paper.

1079
00:42:00,570 --> 00:42:05,400
It's very easy kind of training
to basically make predictions,

1080
00:42:05,400 --> 00:42:07,260
to make guesses
about all the latent

1081
00:42:07,260 --> 00:42:09,510
variables, the shape, the
texture, the lighting,

1082
00:42:09,510 --> 00:42:11,160
the camera angle.

1083
00:42:11,160 --> 00:42:14,160
And then you take those
guesses, and they start off

1084
00:42:14,160 --> 00:42:15,150
that Markov chain.

1085
00:42:15,150 --> 00:42:17,614
So instead of starting off at
a random graphics hypothesis,

1086
00:42:17,614 --> 00:42:19,030
you start off at
a pretty good one

1087
00:42:19,030 --> 00:42:20,520
and then refine it a little bit.

1088
00:42:20,520 --> 00:42:21,978
What you can see
here in these blue

1089
00:42:21,978 --> 00:42:29,220
and red curves is the blue
curve is the course of inference

1090
00:42:29,220 --> 00:42:30,820
for the model I
showed you before,

1091
00:42:30,820 --> 00:42:32,820
where you start off
at a random guess,

1092
00:42:32,820 --> 00:42:37,380
and after, I don't know, 100
iterations of MCMC, you improve

1093
00:42:37,380 --> 00:42:38,650
and you kind of get there.

1094
00:42:38,650 --> 00:42:40,025
Whereas the red
curve is what you

1095
00:42:40,025 --> 00:42:42,400
see if you start off with
the guess of this recognition

1096
00:42:42,400 --> 00:42:42,900
model.

1097
00:42:42,900 --> 00:42:45,060
And you can see that
you start off sort

1098
00:42:45,060 --> 00:42:47,732
of in some sense almost as good
as you're ever going to get,

1099
00:42:47,732 --> 00:42:48,690
and then you refine it.

1100
00:42:48,690 --> 00:42:49,930
Well, it might look
like you we're just

1101
00:42:49,930 --> 00:42:51,180
were refining it a little bit.

1102
00:42:51,180 --> 00:42:53,070
But this is a kind of
a double log scale.

1103
00:42:53,070 --> 00:42:56,140
It's a log plot of
log probability.

1104
00:42:56,140 --> 00:42:58,680
So what looks like a little
bit there on the red curve

1105
00:42:58,680 --> 00:42:59,890
is actually a lot--

1106
00:42:59,890 --> 00:43:01,470
I mean perceptually.

1107
00:43:01,470 --> 00:43:03,900
You can see it here where
if you take-- on the top

1108
00:43:03,900 --> 00:43:06,360
I'm showing observed
input faces.

1109
00:43:06,360 --> 00:43:07,920
On the bottom I'm
showing the result

1110
00:43:07,920 --> 00:43:09,480
of this full inverse
graphics thing.

1111
00:43:09,480 --> 00:43:11,063
And they should look
almost identical.

1112
00:43:11,063 --> 00:43:14,534
So the full model is able to
basically perfectly invert this

1113
00:43:14,534 --> 00:43:16,200
and come up with a
face that really does

1114
00:43:16,200 --> 00:43:17,492
look like the one on the top.

1115
00:43:17,492 --> 00:43:19,200
The ones in the middle
are the best guess

1116
00:43:19,200 --> 00:43:20,824
you get from this
neural network that's

1117
00:43:20,824 --> 00:43:23,425
been trained to approximately
invert the generative model.

1118
00:43:23,425 --> 00:43:25,050
And what you can see
is on first glance

1119
00:43:25,050 --> 00:43:26,280
it should look pretty good.

1120
00:43:26,280 --> 00:43:28,080
But if you pay a little
bit of attention,

1121
00:43:28,080 --> 00:43:29,310
you can see differences.

1122
00:43:29,310 --> 00:43:32,340
Like hopefully you can see
this person is not actually

1123
00:43:32,340 --> 00:43:34,862
that person in a way that this
is much more convincingly.

1124
00:43:34,862 --> 00:43:36,570
Or this person-- this
one is pretty good,

1125
00:43:36,570 --> 00:43:37,500
but I think this one--

1126
00:43:37,500 --> 00:43:39,210
I think it's pretty
easy to say, yeah,

1127
00:43:39,210 --> 00:43:41,085
this isn't quite the
same person as that one.

1128
00:43:41,085 --> 00:43:41,980
Do you guys agree?

1129
00:43:41,980 --> 00:43:44,464
We've done some
experiments to verify this.

1130
00:43:44,464 --> 00:43:46,380
But hopefully they should
look pretty similar,

1131
00:43:46,380 --> 00:43:49,410
and that's the point.

1132
00:43:49,410 --> 00:43:52,590
How do you combine the best of
these computational paradigms?

1133
00:43:52,590 --> 00:43:54,030
How can perception
more generally

1134
00:43:54,030 --> 00:43:55,500
be so rich and so fast?

1135
00:43:55,500 --> 00:43:58,150
Well, quite possibly like this.

1136
00:43:58,150 --> 00:44:01,362
It even actually might
provide some insight

1137
00:44:01,362 --> 00:44:03,570
into the neural circuitry
that Winrich and Doris Tsao

1138
00:44:03,570 --> 00:44:05,640
and others have mapped out.

1139
00:44:05,640 --> 00:44:07,950
We think that this
recognition model that's

1140
00:44:07,950 --> 00:44:09,840
trained to invert
the graphics model

1141
00:44:09,840 --> 00:44:12,090
can provide a really nice
account of some of Winrich's

1142
00:44:12,090 --> 00:44:13,230
data like you saw before.

1143
00:44:13,230 --> 00:44:14,910
But I will not go
into the details

1144
00:44:14,910 --> 00:44:17,700
because in maybe
five to 10 minutes

1145
00:44:17,700 --> 00:44:20,430
I want to get back to
physics and psychology.

1146
00:44:20,430 --> 00:44:25,061
So physics-- and there won't
be any more neural networks.

1147
00:44:25,061 --> 00:44:26,310
Because that's about as much--

1148
00:44:26,310 --> 00:44:32,769
I mean, I think we'd like to
take those ways of integrating

1149
00:44:32,769 --> 00:44:34,560
the best of these
approaches and apply them

1150
00:44:34,560 --> 00:44:35,610
to these more general cases.

1151
00:44:35,610 --> 00:44:36,980
But that's about as
far as we can get.

1152
00:44:36,980 --> 00:44:39,240
Here what I want to just
give you a taste of at least

1153
00:44:39,240 --> 00:44:41,550
is how we're using
ideas just purely

1154
00:44:41,550 --> 00:44:43,590
from probabilistic
programs to capture

1155
00:44:43,590 --> 00:44:45,791
more of this common sense
physics and psychology.

1156
00:44:45,791 --> 00:44:47,790
So let's say we can solve
this problem by making

1157
00:44:47,790 --> 00:44:49,620
a good guess of
the 3D world state

1158
00:44:49,620 --> 00:44:52,860
from the image very quickly
inverting this graphics engine.

1159
00:44:52,860 --> 00:44:54,880
Now, we can start to do
some physical reasoning,

1160
00:44:54,880 --> 00:44:59,910
a la Craik's mental model of in
the head of the physical world,

1161
00:44:59,910 --> 00:45:02,760
where we now take a
physics engine, which is--

1162
00:45:02,760 --> 00:45:05,340
here again we're using the
kind of physics engines

1163
00:45:05,340 --> 00:45:07,610
that game physics--

1164
00:45:07,610 --> 00:45:09,900
like very simple--
again, I don't have time

1165
00:45:09,900 --> 00:45:11,070
to go into the details.

1166
00:45:11,070 --> 00:45:14,730
Although Tomer has written a
very nice paper with, well,

1167
00:45:14,730 --> 00:45:15,330
with himself.

1168
00:45:15,330 --> 00:45:19,620
But he's nicely put my
name and Liz's on it--

1169
00:45:19,620 --> 00:45:21,660
about sort of
trying to introduce

1170
00:45:21,660 --> 00:45:23,490
some of the basic
game engine concepts

1171
00:45:23,490 --> 00:45:25,170
to cognitive scientists.

1172
00:45:25,170 --> 00:45:27,766
So hopefully we'll be able
to show you that soon too.

1173
00:45:27,766 --> 00:45:28,890
Or you can read about them.

1174
00:45:28,890 --> 00:45:30,973
Basically it's that these
physics engines are just

1175
00:45:30,973 --> 00:45:36,090
doing again a very quick, fast,
approximate implementation

1176
00:45:36,090 --> 00:45:38,632
of certain aspects of
Newtonian mechanics.

1177
00:45:38,632 --> 00:45:40,340
Sufficient that if
you run it a few times

1178
00:45:40,340 --> 00:45:41,915
steps with a
configuration of objects

1179
00:45:41,915 --> 00:45:43,290
like that you
might get something

1180
00:45:43,290 --> 00:45:45,120
like what you see over
there on the right.

1181
00:45:45,120 --> 00:45:47,970
That's an example of running
this approximate Newtonian

1182
00:45:47,970 --> 00:45:50,260
physics forward
a few time steps.

1183
00:45:50,260 --> 00:45:52,500
Here's another sample from
this model, another kind

1184
00:45:52,500 --> 00:45:54,180
of mental stimulation.

1185
00:45:54,180 --> 00:45:56,790
We take a slightly different
guess of the world state,

1186
00:45:56,790 --> 00:45:58,560
and we run that forward
a few time steps,

1187
00:45:58,560 --> 00:46:00,794
and you see something
else happens.

1188
00:46:00,794 --> 00:46:03,210
Nothing here is claimed to be
accurate in the ground truth

1189
00:46:03,210 --> 00:46:03,979
way.

1190
00:46:03,979 --> 00:46:06,270
Neither one of these is
exactly the right configuration

1191
00:46:06,270 --> 00:46:07,020
of blocks.

1192
00:46:07,020 --> 00:46:09,000
And you run this thing forward,
and it only approximately

1193
00:46:09,000 --> 00:46:11,208
captures the way blocks
really bounce off each other.

1194
00:46:11,208 --> 00:46:14,092
It's a hard problem to actually
totally realistically simulate.

1195
00:46:14,092 --> 00:46:16,050
But our point is that
you don't really have to.

1196
00:46:16,050 --> 00:46:18,330
You just have to make
a reasonable guess

1197
00:46:18,330 --> 00:46:20,910
of the position of the
blocks and a reasonable guess

1198
00:46:20,910 --> 00:46:23,250
of what's going to happen a
few time steps in the future

1199
00:46:23,250 --> 00:46:25,110
to predict what you need to
know and common sense, which

1200
00:46:25,110 --> 00:46:26,776
is that, wow, that's
going to fall over.

1201
00:46:26,776 --> 00:46:28,410
I better do something about it.

1202
00:46:28,410 --> 00:46:30,510
And that's what our
experiment taps into.

1203
00:46:30,510 --> 00:46:32,220
We give people a
whole bunch of stimuli

1204
00:46:32,220 --> 00:46:34,290
like the ones I showed
you and ask them,

1205
00:46:34,290 --> 00:46:35,760
on some graded
scale, how likely do

1206
00:46:35,760 --> 00:46:37,320
you think it is to fall over?

1207
00:46:37,320 --> 00:46:39,510
And what you see here--

1208
00:46:39,510 --> 00:46:43,770
this is again one of those plots
that always are the same where

1209
00:46:43,770 --> 00:46:46,920
on the y-axis are the average
human judgments now of-- it's

1210
00:46:46,920 --> 00:46:48,657
an estimate of how
unstable the tower is.

1211
00:46:48,657 --> 00:46:50,490
It's both the probability
that it will fall,

1212
00:46:50,490 --> 00:46:52,410
but also how much of
the tower will fall.

1213
00:46:52,410 --> 00:46:54,270
So it's like the
expected proportion

1214
00:46:54,270 --> 00:46:56,870
of the tower that's going
to fall over under gravity.

1215
00:46:56,870 --> 00:46:59,110
And along the x-axis is
the model prediction,

1216
00:46:59,110 --> 00:47:01,669
which is just the average
of a few samples from what

1217
00:47:01,669 --> 00:47:02,210
I showed you.

1218
00:47:02,210 --> 00:47:04,170
You just take a few
guesses of the world state,

1219
00:47:04,170 --> 00:47:06,210
run it forward a few
time steps, count up

1220
00:47:06,210 --> 00:47:09,065
the proportion of blocks
it fell, and average that.

1221
00:47:09,065 --> 00:47:10,440
And what you can
see is that does

1222
00:47:10,440 --> 00:47:15,270
a really nice job of predicting
people's stability intuitions.

1223
00:47:15,270 --> 00:47:17,790
I'll just point to an
interesting comparison.

1224
00:47:17,790 --> 00:47:19,320
Because it does come into where.

1225
00:47:19,320 --> 00:47:20,410
Where does the
probability come in

1226
00:47:20,410 --> 00:47:21,330
in these probabilistic programs?

1227
00:47:21,330 --> 00:47:23,280
Well, here's one
very noticeable way.

1228
00:47:23,280 --> 00:47:25,690
So if you look down
there on the lower right,

1229
00:47:25,690 --> 00:47:29,740
you'll see a smaller
version of a similar plot.

1230
00:47:29,740 --> 00:47:31,640
It's plotting now
the results of--

1231
00:47:31,640 --> 00:47:34,140
it says ground truth physics,
but that's a little misleading

1232
00:47:34,140 --> 00:47:34,639
maybe.

1233
00:47:34,639 --> 00:47:36,210
It's just a noiseless
physics engine.

1234
00:47:36,210 --> 00:47:37,699
So we take the
same physics model,

1235
00:47:37,699 --> 00:47:39,740
but we get rid of any of
the state uncertainties.

1236
00:47:39,740 --> 00:47:42,810
So we tell it the true
position of the blocks,

1237
00:47:42,810 --> 00:47:44,340
and we give it the true physics.

1238
00:47:44,340 --> 00:47:46,830
Whereas our probabilistic
physics engine

1239
00:47:46,830 --> 00:47:49,110
allows for some uncertainty
in exactly which forces

1240
00:47:49,110 --> 00:47:50,070
are doing what.

1241
00:47:50,070 --> 00:47:52,860
But here we say we're just going
to model gravity, friction,

1242
00:47:52,860 --> 00:47:54,880
collisions as best we can.

1243
00:47:54,880 --> 00:47:58,320
And we're going to get the
state of the blocks perfectly.

1244
00:47:58,320 --> 00:48:01,000
And because it's noiseless,
you notice that--

1245
00:48:01,000 --> 00:48:03,157
so those crosses over
there are crosses

1246
00:48:03,157 --> 00:48:05,490
because they're arrow bars,
both across people and model

1247
00:48:05,490 --> 00:48:06,047
simulations.

1248
00:48:06,047 --> 00:48:07,380
Now they're just vertical lines.

1249
00:48:07,380 --> 00:48:09,254
There's no arrow bars
in the model simulation

1250
00:48:09,254 --> 00:48:10,600
because it's deterministic.

1251
00:48:10,600 --> 00:48:13,100
It's graded because there's the
proportion of the tower that

1252
00:48:13,100 --> 00:48:13,620
falls over.

1253
00:48:13,620 --> 00:48:15,630
But what you see is the
model is a lot worse.

1254
00:48:15,630 --> 00:48:17,520
It scatters much more.

1255
00:48:17,520 --> 00:48:19,620
The correlation
dropped from around 0.9

1256
00:48:19,620 --> 00:48:22,890
to around 0.6 in terms
of correlation of model

1257
00:48:22,890 --> 00:48:24,060
with people's judgments.

1258
00:48:24,060 --> 00:48:26,310
And you have some cases
like this red dot here--

1259
00:48:26,310 --> 00:48:28,380
that corresponds
to this stimulus--

1260
00:48:28,380 --> 00:48:30,720
which goes from being a
really nice model fit.

1261
00:48:30,720 --> 00:48:33,280
This is one which people
judged to be very unstable,

1262
00:48:33,280 --> 00:48:35,370
and so does the
probabilistic physics engine.

1263
00:48:35,370 --> 00:48:37,930
But actually it's
not unstable at all.

1264
00:48:37,930 --> 00:48:39,330
It's actually perfectly stable.

1265
00:48:39,330 --> 00:48:41,370
The blocks are actually
just perfectly balanced

1266
00:48:41,370 --> 00:48:42,180
so that it doesn't fall.

1267
00:48:42,180 --> 00:48:43,888
Although I'm sure
everybody looks at that

1268
00:48:43,888 --> 00:48:45,240
and finds that hard to believe.

1269
00:48:45,240 --> 00:48:46,060
So this is nice.

1270
00:48:46,060 --> 00:48:47,670
This is a kind of
physics illusion.

1271
00:48:47,670 --> 00:48:50,400
There are real world versions
of this out on the beaches

1272
00:48:50,400 --> 00:48:52,260
not too far from here.

1273
00:48:52,260 --> 00:48:54,990
It's a fun thing to do to
stack up objects in ways

1274
00:48:54,990 --> 00:48:57,180
that are surprisingly stable.

1275
00:48:57,180 --> 00:49:01,260
We would say a surprise
because your intuitive physics

1276
00:49:01,260 --> 00:49:04,410
has certain irreducible noise.

1277
00:49:04,410 --> 00:49:09,180
What we're suggesting here is
that your physical intuitions--

1278
00:49:09,180 --> 00:49:11,040
you're always in
some sense making

1279
00:49:11,040 --> 00:49:13,440
a guess that's sensitive to
the uncertainty about where

1280
00:49:13,440 --> 00:49:16,470
things might be and what forces
might be active on the world.

1281
00:49:16,470 --> 00:49:19,080
And it's very hard to see
these as deterministic physics,

1282
00:49:19,080 --> 00:49:21,330
even when you know that
that's exactly what's going on

1283
00:49:21,330 --> 00:49:23,190
and that it is stable.

1284
00:49:23,190 --> 00:49:25,290
Let me say just a little
bit about planning.

1285
00:49:25,290 --> 00:49:28,170
So how might you use
this kind of model

1286
00:49:28,170 --> 00:49:32,120
to build some model of this
core intuitive psychology?

1287
00:49:32,120 --> 00:49:35,016
And I don't mean here
all of theory of mind.

1288
00:49:35,016 --> 00:49:36,390
Next week, we'll
hear a lot more.

1289
00:49:36,390 --> 00:49:37,889
Like Rebecca Saxe
will be down here.

1290
00:49:37,889 --> 00:49:41,604
We'll hear a lot more about
much richer kinds of reasoning

1291
00:49:41,604 --> 00:49:43,020
about other people's
mental states

1292
00:49:43,020 --> 00:49:45,030
that adults and older
children can do.

1293
00:49:45,030 --> 00:49:47,130
But here we're
talking about, just

1294
00:49:47,130 --> 00:49:48,960
as we were talking
about what I was calling

1295
00:49:48,960 --> 00:49:52,350
core intuitive physics, again
inspired by Liz's work of just

1296
00:49:52,350 --> 00:49:56,010
you know what objects do right
here on the table top around us

1297
00:49:56,010 --> 00:49:59,054
over short time scales,
the core theory of mind,

1298
00:49:59,054 --> 00:50:01,470
something that even very young
babies can do in some form,

1299
00:50:01,470 --> 00:50:03,160
or at least young children.

1300
00:50:03,160 --> 00:50:06,680
There's controversy over
exactly what age kids can be

1301
00:50:06,680 --> 00:50:07,930
able to do this sort of thing.

1302
00:50:07,930 --> 00:50:13,210
But in some form I
think before language,

1303
00:50:13,210 --> 00:50:17,010
it's the kind of thing that when
you're starting to learn verbs,

1304
00:50:17,010 --> 00:50:19,382
the earliest language
is kind of mentalistic

1305
00:50:19,382 --> 00:50:20,590
and builds on this knowledge.

1306
00:50:20,590 --> 00:50:24,180
And take the red and blue ball
chasing scene that you saw,

1307
00:50:24,180 --> 00:50:25,260
remember from Tomer.

1308
00:50:25,260 --> 00:50:26,310
That was 13-month-olds.

1309
00:50:26,310 --> 00:50:29,850
So there's definitely some
form of kind of interpretation

1310
00:50:29,850 --> 00:50:32,220
of beliefs and desires
in some protoform

1311
00:50:32,220 --> 00:50:36,430
that you can see even in infants
of around one year of age.

1312
00:50:36,430 --> 00:50:38,546
And it's exactly that
kind of thing also.

1313
00:50:38,546 --> 00:50:40,920
Remember that, if you saw John
Leonard's talk yesterday--

1314
00:50:40,920 --> 00:50:43,410
he was the robotics guy who
talked about self-driving cars

1315
00:50:43,410 --> 00:50:46,110
and how there's certain
gaps in what they

1316
00:50:46,110 --> 00:50:47,700
can do despite the
all the publicity,

1317
00:50:47,700 --> 00:50:50,250
like the can't
turn left basically

1318
00:50:50,250 --> 00:50:52,124
in an unrestricted intersection.

1319
00:50:52,124 --> 00:50:53,790
Because there's a
certain kind of theory

1320
00:50:53,790 --> 00:50:56,689
of mind in street scenes when
cars could be coming and people

1321
00:50:56,689 --> 00:50:58,230
could be crossing
or all those things

1322
00:50:58,230 --> 00:51:00,229
about the police officers.

1323
00:51:00,229 --> 00:51:01,770
Part of why this is
so exciting to me

1324
00:51:01,770 --> 00:51:04,590
and why I love that talk is
because this is, I think,

1325
00:51:04,590 --> 00:51:06,840
that same common sense
knowledge that if we can really

1326
00:51:06,840 --> 00:51:09,510
figure out how to capture
this reasoning about beliefs

1327
00:51:09,510 --> 00:51:11,580
and desires in the
limited context

1328
00:51:11,580 --> 00:51:14,250
where desires are people moving
around in space around us

1329
00:51:14,250 --> 00:51:16,560
and the beliefs
are who can see who

1330
00:51:16,560 --> 00:51:18,980
and who can see
who can see who--

1331
00:51:18,980 --> 00:51:21,960
in driving, the art of making
eye contact with other drivers

1332
00:51:21,960 --> 00:51:24,559
or pedestrians is seeing
that they can see you

1333
00:51:24,559 --> 00:51:26,100
or that they can
see what you can see

1334
00:51:26,100 --> 00:51:27,930
and that they can see
you seeing you them.

1335
00:51:27,930 --> 00:51:29,910
It doesn't have to be
super deeply recursive,

1336
00:51:29,910 --> 00:51:31,537
but it's a couple
of layers deep.

1337
00:51:31,537 --> 00:51:33,370
We don't have to think
about it consciously,

1338
00:51:33,370 --> 00:51:35,050
but we have to be able to do it.

1339
00:51:35,050 --> 00:51:37,059
So that's the kind
of core belief desire

1340
00:51:37,059 --> 00:51:38,100
theory of mind reasoning.

1341
00:51:38,100 --> 00:51:40,500
And here's how we've
tried to capture this

1342
00:51:40,500 --> 00:51:43,110
with probabilistic programs.

1343
00:51:43,110 --> 00:51:47,080
This is work that Chris Baker
started doing a few years ago.

1344
00:51:47,080 --> 00:51:50,250
And a lot of it
joint Rebecca Saxe

1345
00:51:50,250 --> 00:51:53,820
and also some of it with Julian
Jara-Ettinger and some of it

1346
00:51:53,820 --> 00:51:54,510
with Tomer.

1347
00:51:54,510 --> 00:51:55,410
So there's a whole
bunch of us who've

1348
00:51:55,410 --> 00:51:56,620
been working on
versions of this,

1349
00:51:56,620 --> 00:51:58,411
but I'll just show you
one or two examples.

1350
00:52:01,350 --> 00:52:08,730
Again, the key programs here
are not graphics or physics

1351
00:52:08,730 --> 00:52:11,200
engines, but planning engines
and perception engines.

1352
00:52:11,200 --> 00:52:15,090
So very simple kinds
of robotics programs,

1353
00:52:15,090 --> 00:52:18,210
far too simple in
this form to build

1354
00:52:18,210 --> 00:52:20,760
a self-driving car
or a humanoid robot,

1355
00:52:20,760 --> 00:52:24,520
but maybe the kind of thing that
in game robots like the zombie

1356
00:52:24,520 --> 00:52:26,460
or the security guard
in Quake or something

1357
00:52:26,460 --> 00:52:28,720
might do something like this.

1358
00:52:28,720 --> 00:52:32,100
So planning basically just
means it's a little bit more

1359
00:52:32,100 --> 00:52:35,220
than sort of shortest
path planning.

1360
00:52:35,220 --> 00:52:37,890
But it's basically like
find a sequence of actions

1361
00:52:37,890 --> 00:52:39,720
in a simple world
like moving around

1362
00:52:39,720 --> 00:52:44,490
a 2D environment that maximizes
your long run expected reward.

1363
00:52:44,490 --> 00:52:46,050
So there's a kind
of utility theory,

1364
00:52:46,050 --> 00:52:49,340
or what Laura Schulz calls a
naive utility calculus, here.

1365
00:52:49,340 --> 00:52:52,800
A calculation of costs and
benefits where in a sense

1366
00:52:52,800 --> 00:52:55,620
you get a big reward,
a good positive utility

1367
00:52:55,620 --> 00:52:58,740
for getting to your goal and a
small cost for each action you

1368
00:52:58,740 --> 00:52:59,640
take.

1369
00:52:59,640 --> 00:53:03,525
And under that view,
then in some sense--

1370
00:53:03,525 --> 00:53:06,600
and some actions might be
costly than others, something

1371
00:53:06,600 --> 00:53:09,690
that Tomer is looking at
in infants and something

1372
00:53:09,690 --> 00:53:11,430
that Julian
Jara-Ettinger has looked

1373
00:53:11,430 --> 00:53:13,890
at in older kids, this
understanding of that.

1374
00:53:13,890 --> 00:53:15,832
But this sort of basic
cost-benefit trade

1375
00:53:15,832 --> 00:53:18,984
off that is going on whenever
you move around an environment

1376
00:53:18,984 --> 00:53:21,150
and decide, well, is it
worthwhile to go all the way

1377
00:53:21,150 --> 00:53:24,330
over there, or, well, I know
I like the coffee up at Pie

1378
00:53:24,330 --> 00:53:26,850
in the Sky better than the
coffee in the dining hall

1379
00:53:26,850 --> 00:53:27,540
here at Swope.

1380
00:53:27,540 --> 00:53:30,090
But to think about, am I
going to be to my lecture?

1381
00:53:30,090 --> 00:53:31,830
Am I going to be late
to Nancy's lecture?

1382
00:53:31,830 --> 00:53:33,870
Those are different costs--

1383
00:53:33,870 --> 00:53:35,130
both costs.

1384
00:53:35,130 --> 00:53:36,600
It's that kind of calculation.

1385
00:53:39,390 --> 00:53:42,310
So here let me
get more concrete.

1386
00:53:42,310 --> 00:53:44,580
So here's an example
of an experiment

1387
00:53:44,580 --> 00:53:46,980
that Chris did a few years
ago where, again, it's

1388
00:53:46,980 --> 00:53:49,540
like what you saw what the
Heider and Simmel, the squares

1389
00:53:49,540 --> 00:53:52,230
and the triangles and
circles or the south gate

1390
00:53:52,230 --> 00:53:54,840
and chibra, the red and blue
balls chasing each other.

1391
00:53:54,840 --> 00:53:56,280
Very simple stuff.

1392
00:53:56,280 --> 00:53:57,450
Here you see an agent.

1393
00:53:57,450 --> 00:53:59,890
It's like an overhead
view of a room,

1394
00:53:59,890 --> 00:54:01,290
2D environment from the top.

1395
00:54:01,290 --> 00:54:03,270
The agents moving
along some path.

1396
00:54:03,270 --> 00:54:06,384
There are three possible
goals, A, B, or C.

1397
00:54:06,384 --> 00:54:08,550
And then there's maybe some
obstacles or constraints

1398
00:54:08,550 --> 00:54:10,505
like a wall like like
you saw in those movies.

1399
00:54:10,505 --> 00:54:12,630
Maybe the wall has a hole
that he can pass through.

1400
00:54:12,630 --> 00:54:14,190
Maybe it doesn't.

1401
00:54:14,190 --> 00:54:16,194
And across different
trials of the experiment,

1402
00:54:16,194 --> 00:54:18,360
just like in the physics
stuff we vary all the block

1403
00:54:18,360 --> 00:54:21,900
configurations and so on, here
we vary where the goals are.

1404
00:54:21,900 --> 00:54:23,790
We vary whether the
wall has a hole or not.

1405
00:54:23,790 --> 00:54:25,440
We vary the agent's path.

1406
00:54:25,440 --> 00:54:28,440
On different trials, we also
stop it at different points.

1407
00:54:28,440 --> 00:54:30,600
Because we're trying to
see as you watch this agent

1408
00:54:30,600 --> 00:54:33,060
move around, action
unfolds over time.

1409
00:54:33,060 --> 00:54:36,570
How do your guesses about
his goal change over time?

1410
00:54:36,570 --> 00:54:39,060
And what you see--

1411
00:54:39,060 --> 00:54:42,570
so these are just examples
of a few of the scenes.

1412
00:54:42,570 --> 00:54:44,590
And here what you see
are examples of the data.

1413
00:54:44,590 --> 00:54:47,680
Again, the y-axis is the
average human judgment.

1414
00:54:47,680 --> 00:54:49,680
Red, blue, and green is
color coded to the goal.

1415
00:54:49,680 --> 00:54:51,055
They're just asked,
how likely do

1416
00:54:51,055 --> 00:54:53,490
you think each of those
three things is his goal?

1417
00:54:53,490 --> 00:54:55,810
And then here the
x-axis is time.

1418
00:54:55,810 --> 00:54:58,830
So these are time steps that
we ask at different points

1419
00:54:58,830 --> 00:54:59,952
along the trajectory.

1420
00:54:59,952 --> 00:55:01,410
And what you can
see is that people

1421
00:55:01,410 --> 00:55:03,571
are making various systematic
kinds of judgments.

1422
00:55:03,571 --> 00:55:05,820
Sometimes they're not sure
whether his goal is A or B,

1423
00:55:05,820 --> 00:55:08,010
but they know it's
not C. And then

1424
00:55:08,010 --> 00:55:10,837
after a little while or
some key stat happens,

1425
00:55:10,837 --> 00:55:12,670
and now they're quite
sure it's A and not B.

1426
00:55:12,670 --> 00:55:14,430
Or they could change their mind.

1427
00:55:14,430 --> 00:55:18,480
Here people were pretty sure it
was either green or red but not

1428
00:55:18,480 --> 00:55:19,390
blue.

1429
00:55:19,390 --> 00:55:21,150
And then there comes a point
where it's surely not green,

1430
00:55:21,150 --> 00:55:22,316
but it might be blue or red.

1431
00:55:22,316 --> 00:55:23,480
Oh no, then it's red.

1432
00:55:23,480 --> 00:55:25,290
Here they were pretty
sure it was green.

1433
00:55:25,290 --> 00:55:26,910
Then no, definitely not green.

1434
00:55:26,910 --> 00:55:28,250
And now, I think it's red.

1435
00:55:28,250 --> 00:55:29,850
It was probably never blue.

1436
00:55:29,850 --> 00:55:30,780
OK.

1437
00:55:30,780 --> 00:55:32,640
And the really
striking thing to us

1438
00:55:32,640 --> 00:55:36,000
is how closely, you can
match those judgments

1439
00:55:36,000 --> 00:55:38,250
with this very simple
probabilistic planning

1440
00:55:38,250 --> 00:55:39,552
program run in reverse.

1441
00:55:39,552 --> 00:55:41,510
So we take, again, this
simple planning program

1442
00:55:41,510 --> 00:55:44,850
that just says basically just
kind of get as efficiently

1443
00:55:44,850 --> 00:55:46,237
as possible to your goal.

1444
00:55:46,237 --> 00:55:47,820
I don't know what
your goal is though.

1445
00:55:47,820 --> 00:55:50,370
I observe your actions that
result from an efficient plan,

1446
00:55:50,370 --> 00:55:52,100
and I want to work
backwards to say,

1447
00:55:52,100 --> 00:55:53,850
what do I think your
goal is, your desire,

1448
00:55:53,850 --> 00:55:55,290
the rewarding state?

1449
00:55:55,290 --> 00:55:57,090
And just doing
that just basically

1450
00:55:57,090 --> 00:55:58,830
perfectly predicts
people's data.

1451
00:55:58,830 --> 00:56:00,990
I mean, of all the
mathematical models of behavior

1452
00:56:00,990 --> 00:56:02,760
I've ever had a
hand in building,

1453
00:56:02,760 --> 00:56:05,190
this one works the best.

1454
00:56:05,190 --> 00:56:06,600
It's really quite striking.

1455
00:56:06,600 --> 00:56:08,130
To me it was striking
because I came

1456
00:56:08,130 --> 00:56:11,310
in thinking this would be a
very high-level, weird, flaky,

1457
00:56:11,310 --> 00:56:13,230
hard-to-model thing.

1458
00:56:13,230 --> 00:56:14,760
Here's just one
more example of one

1459
00:56:14,760 --> 00:56:17,350
of these things, which
actually puts beliefs in there,

1460
00:56:17,350 --> 00:56:18,090
not just desires.

1461
00:56:18,090 --> 00:56:19,950
So it's a key part of
intuitive psychology

1462
00:56:19,950 --> 00:56:22,670
that we do joint inference
over beliefs and desires.

1463
00:56:22,670 --> 00:56:26,040
In this one here, we assume
that you, the subject,

1464
00:56:26,040 --> 00:56:27,750
the agent who's moving
around, all of us

1465
00:56:27,750 --> 00:56:29,716
have shared full
knowledge of the world.

1466
00:56:29,716 --> 00:56:31,090
So we know where
the objects are.

1467
00:56:31,090 --> 00:56:31,830
We know where the holes are.

1468
00:56:31,830 --> 00:56:33,270
There's none of
this false belief,

1469
00:56:33,270 --> 00:56:36,090
like you think something
is there when it isn't.

1470
00:56:36,090 --> 00:56:38,010
Now, here's some
later work that Chris

1471
00:56:38,010 --> 00:56:42,575
did, what we call the
food truck studies,

1472
00:56:42,575 --> 00:56:44,700
where here we add in some
uncertainty about beliefs

1473
00:56:44,700 --> 00:56:46,080
in addition to desires.

1474
00:56:46,080 --> 00:56:48,330
And it's easiest just to
explain with this one example

1475
00:56:48,330 --> 00:56:49,940
up there in the upper left.

1476
00:56:49,940 --> 00:56:54,390
So here, and this, like a
lot of university campuses,

1477
00:56:54,390 --> 00:56:56,720
lunch is best found
at food trucks,

1478
00:56:56,720 --> 00:56:59,540
which can park in different
spots around campus.

1479
00:56:59,540 --> 00:57:02,840
Here the two yellow squares
show the two parking spots

1480
00:57:02,840 --> 00:57:04,467
on this part of campus.

1481
00:57:04,467 --> 00:57:06,050
And there are several
different trucks

1482
00:57:06,050 --> 00:57:07,490
that can come and park
in different places

1483
00:57:07,490 --> 00:57:08,240
on different days.

1484
00:57:08,240 --> 00:57:09,350
There's a Korean truck.

1485
00:57:09,350 --> 00:57:10,350
That's k.

1486
00:57:10,350 --> 00:57:11,600
There's a Lebanese truck.

1487
00:57:11,600 --> 00:57:12,522
That's l.

1488
00:57:12,522 --> 00:57:14,480
There's also other trucks
like a Mexican truck.

1489
00:57:14,480 --> 00:57:15,900
But there's only two spots.

1490
00:57:15,900 --> 00:57:17,780
So if the Korean won parks
there and the Lebanese one

1491
00:57:17,780 --> 00:57:19,821
parks there, the Mexican
has to go somewhere else

1492
00:57:19,821 --> 00:57:21,470
or can't come there today.

1493
00:57:21,470 --> 00:57:24,150
And on some days the trucks
park in different places.

1494
00:57:24,150 --> 00:57:26,290
Or a spot could
also be unoccupied.

1495
00:57:26,290 --> 00:57:28,070
The trucks could be elsewhere.

1496
00:57:28,070 --> 00:57:29,870
So look at what
happens on this day.

1497
00:57:29,870 --> 00:57:33,830
Our friendly grad
student, Harold,

1498
00:57:33,830 --> 00:57:35,550
comes out from his office here.

1499
00:57:35,550 --> 00:57:38,000
And importantly, the way we
model interesting notions

1500
00:57:38,000 --> 00:57:39,874
of evolving belief
is that now we've

1501
00:57:39,874 --> 00:57:41,790
got that perception and
inference arrow there.

1502
00:57:41,790 --> 00:57:43,550
So Harold forms his
belief about what's

1503
00:57:43,550 --> 00:57:44,870
where based on what he can see.

1504
00:57:44,870 --> 00:57:47,420
And it's just the simplest
perception model, just

1505
00:57:47,420 --> 00:57:48,710
line-of-sight access.

1506
00:57:48,710 --> 00:57:51,470
We assume he can kind of see
anything that's unobstructed

1507
00:57:51,470 --> 00:57:52,490
in his line of sight.

1508
00:57:52,490 --> 00:57:56,270
So that means that
when he comes out here,

1509
00:57:56,270 --> 00:57:59,389
he can see that there is
the Korean truck here.

1510
00:57:59,389 --> 00:58:01,430
But you can't see-- this
is a wall or a building.

1511
00:58:01,430 --> 00:58:03,722
He can't see what's on
the other side of that.

1512
00:58:03,722 --> 00:58:04,680
OK, so what does he do?

1513
00:58:04,680 --> 00:58:05,721
Well, he walks down here.

1514
00:58:05,721 --> 00:58:07,471
He goes past the Korean
truck, goes around

1515
00:58:07,471 --> 00:58:08,762
the other side of the building.

1516
00:58:08,762 --> 00:58:10,340
Now at this point,
his line of sight

1517
00:58:10,340 --> 00:58:11,715
gives him-- he
can see that there

1518
00:58:11,715 --> 00:58:13,280
is a Lebanese truck there.

1519
00:58:13,280 --> 00:58:15,920
He turns around, and he goes
back to the Korean truck.

1520
00:58:15,920 --> 00:58:19,310
So the question for you is,
what is his favorite truck?

1521
00:58:19,310 --> 00:58:21,702
Is it Korean,
Lebanese, or Mexican?

1522
00:58:21,702 --> 00:58:22,646
AUDIENCE: Mexican.

1523
00:58:22,646 --> 00:58:23,510
PROFESSOR: Mexican,
yeah, it doesn't sound

1524
00:58:23,510 --> 00:58:24,950
very hard to figure that out.

1525
00:58:24,950 --> 00:58:27,500
But it's quite interesting
because the Mexican one

1526
00:58:27,500 --> 00:58:30,080
isn't even in the scene.

1527
00:58:30,080 --> 00:58:33,660
The most basic kind of goal
recognition-- and this,

1528
00:58:33,660 --> 00:58:35,660
again, cuts right to the
heart of the difference

1529
00:58:35,660 --> 00:58:37,681
between recognition
and explanation.

1530
00:58:37,681 --> 00:58:39,680
There's been a lot of
progress in machine vision

1531
00:58:39,680 --> 00:58:42,290
systems for action
understanding, action

1532
00:58:42,290 --> 00:58:43,880
recognition, and so on.

1533
00:58:43,880 --> 00:58:48,110
And they do things like, for
example, they take video.

1534
00:58:48,110 --> 00:58:50,699
And the best cue that
somebody wants something

1535
00:58:50,699 --> 00:58:52,490
is if they reach for
it or move towards it.

1536
00:58:52,490 --> 00:58:54,740
And that's certainly
what was going on here.

1537
00:58:54,740 --> 00:58:57,800
In all of these scenes,
your best inference

1538
00:58:57,800 --> 00:59:00,426
about what the guy's
goal is is which

1539
00:59:00,426 --> 00:59:01,550
thing is he moving towards.

1540
00:59:01,550 --> 00:59:03,650
And it's just
subtle to parse out

1541
00:59:03,650 --> 00:59:05,150
the relative degrees
of confidence

1542
00:59:05,150 --> 00:59:08,550
when there's a complex
environment with constraints.

1543
00:59:08,550 --> 00:59:10,400
But in every case,
by the end it's

1544
00:59:10,400 --> 00:59:12,350
clear he's going for
one thing, and the thing

1545
00:59:12,350 --> 00:59:14,660
he is moving towards
is the thing he wants.

1546
00:59:14,660 --> 00:59:16,880
But here you have
no trouble realizing

1547
00:59:16,880 --> 00:59:18,680
that his goal is
something that isn't

1548
00:59:18,680 --> 00:59:20,095
even present in the scene.

1549
00:59:20,095 --> 00:59:21,470
Yet he's still
moving towards it.

1550
00:59:21,470 --> 00:59:23,990
In a sense, he's moving towards
his mental representation

1551
00:59:23,990 --> 00:59:25,190
of it.

1552
00:59:25,190 --> 00:59:30,300
He's moving towards the Mexican
truck in his mind's model.

1553
00:59:30,300 --> 00:59:33,230
And that's him explaining
the data he sees.

1554
00:59:33,230 --> 00:59:34,882
For some reason,
he must have had

1555
00:59:34,882 --> 00:59:37,340
maybe a prior belief that the
Mexican truck would be there.

1556
00:59:37,340 --> 00:59:39,140
So he formed a plan to go there.

1557
00:59:39,140 --> 00:59:41,180
And in fact, we can
ask people not only

1558
00:59:41,180 --> 00:59:43,474
which truck does he like--
it's his Mexican truck.

1559
00:59:43,474 --> 00:59:45,390
That's what people say,
and here is the model.

1560
00:59:45,390 --> 00:59:47,420
But we also asked them
a belief inference.

1561
00:59:47,420 --> 00:59:50,060
We say, prior to
setting out, what

1562
00:59:50,060 --> 00:59:52,089
did Harold think was
on the other side?

1563
00:59:52,089 --> 00:59:54,380
What was parked in the other
spot that he couldn't see?

1564
00:59:54,380 --> 00:59:57,050
Did he think it was Lebanese,
Mexican, or neither?

1565
00:59:57,050 --> 00:59:58,300
And we ask a degree of belief.

1566
00:59:58,300 --> 00:59:59,692
So you could say he had no idea.

1567
00:59:59,692 --> 01:00:01,400
But interestingly,
people say he probably

1568
01:00:01,400 --> 01:00:02,358
thought it was Mexican.

1569
01:00:02,358 --> 01:00:05,510
Because how else could you
explain what he's doing?

1570
01:00:05,510 --> 01:00:09,440
So I mean, if I had to point to
just one example of cognition

1571
01:00:09,440 --> 01:00:11,090
as explanation, it's this.

1572
01:00:11,090 --> 01:00:14,630
The only sensible way, and it's
a very intuitive and compelling

1573
01:00:14,630 --> 01:00:17,960
way to explain why did
he go the way he did

1574
01:00:17,960 --> 01:00:19,830
and then turn around
just when he did

1575
01:00:19,830 --> 01:00:23,360
and wind up just where he did,
is this set of six instances

1576
01:00:23,360 --> 01:00:24,200
basically.

1577
01:00:24,200 --> 01:00:26,480
That his favorite is
Mexican, his second favorite

1578
01:00:26,480 --> 01:00:28,460
is Korean-- that's also
important-- his least

1579
01:00:28,460 --> 01:00:29,960
favorite is Lebanese.

1580
01:00:29,960 --> 01:00:32,480
And he thought that
Mexican was there,

1581
01:00:32,480 --> 01:00:34,730
which is why it was
worthwhile to go and check.

1582
01:00:34,730 --> 01:00:36,710
At least, he thought
it was likely.

1583
01:00:36,710 --> 01:00:37,850
He wasn't sure, right?

1584
01:00:37,850 --> 01:00:39,010
Notice it's not very high.

1585
01:00:39,010 --> 01:00:41,630
But it it's more likely than
the other possibilities.

1586
01:00:41,630 --> 01:00:44,090
Because, of course, if he was
quite sure it was Lebanese,

1587
01:00:44,090 --> 01:00:45,560
well, he wouldn't have
bothered to go around there.

1588
01:00:45,560 --> 01:00:46,610
And in fact, you do see that.

1589
01:00:46,610 --> 01:00:47,330
So you have ones--

1590
01:00:47,330 --> 01:00:48,621
I guess I don't have them here.

1591
01:00:48,621 --> 01:00:50,960
But there are scenes where
he just goes straight here.

1592
01:00:50,960 --> 01:00:53,480
And then that's consistent
with him thinking possibly it

1593
01:00:53,480 --> 01:00:54,320
was Lebanese.

1594
01:00:54,320 --> 01:00:55,820
And if he thought
nothing was there,

1595
01:00:55,820 --> 01:00:58,340
well, again, he wouldn't
have gone to check.

1596
01:00:58,340 --> 01:01:00,770
And again, this model is
extremely quantitatively

1597
01:01:00,770 --> 01:01:05,510
predictive of people's judgments
about both desires and beliefs.

1598
01:01:05,510 --> 01:01:08,195
You can read in some
of Battaglia's papers

1599
01:01:08,195 --> 01:01:10,320
ways in which you take the
very same physics engine

1600
01:01:10,320 --> 01:01:12,650
and use it for all these
different tasks, including

1601
01:01:12,650 --> 01:01:14,750
sort of slightly weird
ones like these tasks.

1602
01:01:14,750 --> 01:01:16,220
If you bump the
table, are you more

1603
01:01:16,220 --> 01:01:19,010
likely to knock off red
blocks or yellow blocks?

1604
01:01:19,010 --> 01:01:22,430
Not a task you ever got any
end-to-end training on, right?

1605
01:01:22,430 --> 01:01:24,980
But an example of
the compositionality

1606
01:01:24,980 --> 01:01:26,810
of your model and your task.

1607
01:01:26,810 --> 01:01:28,310
Somebody asked me
this during lunch,

1608
01:01:28,310 --> 01:01:31,340
and I think it is a key point
to make about compositionality.

1609
01:01:31,340 --> 01:01:34,310
One of the key ways in
which compositionality

1610
01:01:34,310 --> 01:01:37,004
works in this view of the
mind, as opposed to the pattern

1611
01:01:37,004 --> 01:01:38,420
recognition view
or the way, let's

1612
01:01:38,420 --> 01:01:40,130
say, like a DeepQ
network works--

1613
01:01:40,130 --> 01:01:42,410
AUDIENCE: You mean
the [INAUDIBLE]..

1614
01:01:42,410 --> 01:01:45,680
PROFESSOR: Just ways of getting
a very flexible repertoire

1615
01:01:45,680 --> 01:01:48,230
of inferences from composing
pieces without having

1616
01:01:48,230 --> 01:01:50,840
to train specifically for it.

1617
01:01:50,840 --> 01:01:52,930
It's that if you have
a physics engine,

1618
01:01:52,930 --> 01:01:55,209
you can simulate
the physical world.

1619
01:01:55,209 --> 01:01:57,250
You can answer questions
that you've never gotten

1620
01:01:57,250 --> 01:01:58,564
any training at all to solve.

1621
01:01:58,564 --> 01:01:59,980
So in this experiment
here, we ask

1622
01:01:59,980 --> 01:02:01,879
people, if you bump
the table hard enough

1623
01:02:01,879 --> 01:02:03,670
to knock some of the
blocks onto the floor,

1624
01:02:03,670 --> 01:02:05,544
is it more likely to be
red or yellow blocks?

1625
01:02:05,544 --> 01:02:07,630
Unlike questions
of will this tower

1626
01:02:07,630 --> 01:02:09,400
fall over, which we've made a
lot of judgments of that sort.

1627
01:02:09,400 --> 01:02:11,358
You've never made that
kind of judgment before.

1628
01:02:11,358 --> 01:02:12,550
It's a slightly weird one.

1629
01:02:12,550 --> 01:02:13,840
But you have no
trouble making it.

1630
01:02:13,840 --> 01:02:15,760
And for many different
configurations of blocks,

1631
01:02:15,760 --> 01:02:16,990
you make various
grade adjustments,

1632
01:02:16,990 --> 01:02:19,330
and the model captures it
perfectly with no extra stuff

1633
01:02:19,330 --> 01:02:20,080
put in.

1634
01:02:20,080 --> 01:02:22,030
You just you just
take the same model,

1635
01:02:22,030 --> 01:02:23,830
and you ask it a
different question.

1636
01:02:23,830 --> 01:02:26,410
So if our dream is to
build AI systems that

1637
01:02:26,410 --> 01:02:28,210
can answer questions,
for example, which

1638
01:02:28,210 --> 01:02:30,040
a lot of people's
dream is, I think

1639
01:02:30,040 --> 01:02:32,170
there's really no
compelling alternative

1640
01:02:32,170 --> 01:02:33,160
to something like this.

1641
01:02:33,160 --> 01:02:35,534
That you build a model that
you can ask all the questions

1642
01:02:35,534 --> 01:02:36,920
of that you'd want to ask.

1643
01:02:36,920 --> 01:02:39,700
And in this limited domain,
again, it's just our Atari.

1644
01:02:39,700 --> 01:02:41,290
In this limited
domain of reasoning

1645
01:02:41,290 --> 01:02:43,880
about the physics of blocks,
it's really pretty cool

1646
01:02:43,880 --> 01:02:45,490
what this physics
engine is able to do

1647
01:02:45,490 --> 01:02:47,079
with many kinds of questions.

1648
01:02:47,079 --> 01:02:49,120
It can reason about things
with different masses.

1649
01:02:49,120 --> 01:02:50,920
It can make guesses
about the masses.

1650
01:02:50,920 --> 01:02:53,570
You can make fun of the
objects bigger or smaller.

1651
01:02:53,570 --> 01:02:56,806
You can attach constraints
like fences to the table.

1652
01:02:56,806 --> 01:02:58,930
And the same model, without
any fundamental change,

1653
01:02:58,930 --> 01:03:00,130
can answer all these questions.

1654
01:03:00,130 --> 01:03:01,588
So it doesn't have
to be retrained.

1655
01:03:01,588 --> 01:03:03,490
Because there's
basically no training.

1656
01:03:03,490 --> 01:03:05,080
It's just reasoning.

1657
01:03:05,080 --> 01:03:07,399
If we want to understand
how learning works,

1658
01:03:07,399 --> 01:03:09,190
we first have to
understand what's learned.

1659
01:03:09,190 --> 01:03:10,960
I think right now,
we're only at the point

1660
01:03:10,960 --> 01:03:12,418
where we're starting
to really have

1661
01:03:12,418 --> 01:03:15,950
a sense of what are these mental
models of the physical world

1662
01:03:15,950 --> 01:03:17,440
and intentional action--

1663
01:03:17,440 --> 01:03:20,400
these probabilistic programs
that even young children

1664
01:03:20,400 --> 01:03:23,415
are using to reason
about the world.

1665
01:03:23,415 --> 01:03:24,790
And then it's a
separate question

1666
01:03:24,790 --> 01:03:27,550
how those are built up
through some combination

1667
01:03:27,550 --> 01:03:34,090
of scientific discovery sorts
of processes and evolution.

1668
01:03:34,090 --> 01:03:36,265
So here's the story,
and I've told most

1669
01:03:36,265 --> 01:03:37,390
of what I want to tell you.

1670
01:03:37,390 --> 01:03:40,836
But the rest you'll
get to hear--

1671
01:03:40,836 --> 01:03:42,460
some of it you'll
get to hear next week

1672
01:03:42,460 --> 01:03:45,190
from both our
developmental colleagues

1673
01:03:45,190 --> 01:03:46,480
and from me and Tomer.

1674
01:03:46,480 --> 01:03:47,897
More on the computational side.

1675
01:03:47,897 --> 01:03:49,480
But actually the
most interesting part

1676
01:03:49,480 --> 01:03:50,644
we just don't know yet.

1677
01:03:50,644 --> 01:03:52,810
So we hope you will actually
write that next chapter

1678
01:03:52,810 --> 01:03:53,800
of this story.

1679
01:03:53,800 --> 01:03:58,630
But here's the outlines of
where we currently see things.

1680
01:03:58,630 --> 01:04:02,080
We think that we have a good
target for what is really

1681
01:04:02,080 --> 01:04:04,751
the core of human intelligence,
what makes us so smart in terms

1682
01:04:04,751 --> 01:04:06,250
of these ideas of
both what we start

1683
01:04:06,250 --> 01:04:09,790
with, this common sense
core physics and psychology,

1684
01:04:09,790 --> 01:04:12,590
and how those things grow.

1685
01:04:12,590 --> 01:04:16,390
What are the learning mechanisms
that I've just justified.

1686
01:04:16,390 --> 01:04:20,260
Again, more next week on the
sort of science-like mechanisms

1687
01:04:20,260 --> 01:04:23,740
of hypothesis formation,
experiment testing, play,

1688
01:04:23,740 --> 01:04:27,280
exploration that you can use to
build these intuitive theories,

1689
01:04:27,280 --> 01:04:30,182
much like scientists build
their scientific theories.

1690
01:04:30,182 --> 01:04:32,140
And that we're starting
on the engineering side

1691
01:04:32,140 --> 01:04:34,780
to have tools to capture this,
both to capture the knowledge

1692
01:04:34,780 --> 01:04:36,390
and how it might
grow through the use

1693
01:04:36,390 --> 01:04:38,800
of probabilistic
programs and things

1694
01:04:38,800 --> 01:04:41,410
that sometimes go by the name
of program induction or program

1695
01:04:41,410 --> 01:04:42,340
synthesis.

1696
01:04:42,340 --> 01:04:44,170
Or if you like
hierarchical Bayes

1697
01:04:44,170 --> 01:04:46,720
on programs that generate
other programs where

1698
01:04:46,720 --> 01:04:50,380
the search for a good program is
like the inference of a program

1699
01:04:50,380 --> 01:04:53,170
that best explains the data as
generated from a prior that's

1700
01:04:53,170 --> 01:04:54,820
a higher level program.

1701
01:04:54,820 --> 01:04:57,160
If you go to the
tutorial from Tomer

1702
01:04:57,160 --> 01:04:58,684
you'll actually see
building blocks.

1703
01:04:58,684 --> 01:05:00,100
You can write
Church programs that

1704
01:05:00,100 --> 01:05:01,766
will do something
like that, and we will

1705
01:05:01,766 --> 01:05:03,080
see more of that next time.

1706
01:05:03,080 --> 01:05:06,030
But the key is that
we have a language now

1707
01:05:06,030 --> 01:05:08,740
which keeps building the
different ingredients that we

1708
01:05:08,740 --> 01:05:09,340
think we need.

1709
01:05:09,340 --> 01:05:12,030
On the one hand, we've gone from
thinking that we need something

1710
01:05:12,030 --> 01:05:14,530
like probabilistic generative
models, which many people will

1711
01:05:14,530 --> 01:05:16,197
agree with, to
recognizing that not only

1712
01:05:16,197 --> 01:05:17,863
do they have to be
generative, they have

1713
01:05:17,863 --> 01:05:19,180
to be causal and compositional.

1714
01:05:19,180 --> 01:05:21,670
And they have to have this
fine-grained compositional

1715
01:05:21,670 --> 01:05:24,010
structure needed to capture
the real stuff of the world.

1716
01:05:24,010 --> 01:05:26,440
Not graphs, but something
more like equations

1717
01:05:26,440 --> 01:05:30,100
that capture graphics
or physics or planning.

1718
01:05:30,100 --> 01:05:31,360
Of course, that's not all.

1719
01:05:31,360 --> 01:05:33,220
I mean, as I tried
to gesture at,

1720
01:05:33,220 --> 01:05:36,970
we need also ways to make these
things work very, very quickly.

1721
01:05:36,970 --> 01:05:39,460
There might be a place in
this picture for something

1722
01:05:39,460 --> 01:05:40,930
like neural networks
or some kind

1723
01:05:40,930 --> 01:05:44,440
of alternative
pro-and-con approach based

1724
01:05:44,440 --> 01:05:46,060
on pattern recognition.

1725
01:05:46,060 --> 01:05:47,710
But these are just
a number of the ways

1726
01:05:47,710 --> 01:05:50,260
which I think we need to
think about going forward.

1727
01:05:50,260 --> 01:05:53,710
We need to take the idea of both
the brain as a pattern ignition

1728
01:05:53,710 --> 01:05:56,680
engine seriously and the idea
of the brain as a modeling

1729
01:05:56,680 --> 01:05:58,425
or explanation engine seriously.

1730
01:05:58,425 --> 01:05:59,800
We're excited
because we now have

1731
01:05:59,800 --> 01:06:01,852
tools to model modeling
engines and maybe

1732
01:06:01,852 --> 01:06:04,060
to model how pattern
recognition engines and modeling

1733
01:06:04,060 --> 01:06:05,410
engines might interact.

1734
01:06:05,410 --> 01:06:08,620
But really, again,
the great challenges

1735
01:06:08,620 --> 01:06:10,960
here are really very
much in our future.

1736
01:06:10,960 --> 01:06:13,240
Not the unforeseeable future,
but the foreseeable one.

1737
01:06:13,240 --> 01:06:14,890
So help us work on it.

1738
01:06:14,890 --> 01:06:16,440
Thanks.