1
00:00:01,640 --> 00:00:04,040
The following content is
provided under a Creative

2
00:00:04,040 --> 00:00:05,580
Commons license.

3
00:00:05,580 --> 00:00:07,880
Your support will help
MIT OpenCourseWare

4
00:00:07,880 --> 00:00:12,270
continue to offer high quality
educational resources for free.

5
00:00:12,270 --> 00:00:14,870
To make a donation, or
view additional materials

6
00:00:14,870 --> 00:00:18,830
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,830 --> 00:00:20,000
at ocw.mit.edu.

8
00:00:23,392 --> 00:00:25,100
JAMES DICARLO: So let
me start by first--

9
00:00:25,100 --> 00:00:26,808
I already alluded to
this, but let's talk

10
00:00:26,808 --> 00:00:30,050
about the problem of vision.

11
00:00:30,050 --> 00:00:32,960
This is just one
computational challenge

12
00:00:32,960 --> 00:00:35,930
that our brains solve, but
it's one that many of us

13
00:00:35,930 --> 00:00:37,432
are very fascinated by.

14
00:00:37,432 --> 00:00:39,140
As you'll hear in the
rest of the course,

15
00:00:39,140 --> 00:00:41,060
there are other problems
that are equally fascinating.

16
00:00:41,060 --> 00:00:43,340
But I'm going to talk
about problems of vision.

17
00:00:43,340 --> 00:00:45,780
I'm going to talk about a
specific problem of vision,

18
00:00:45,780 --> 00:00:48,530
and that's the problem
of object recognition.

19
00:00:48,530 --> 00:00:52,160
So I will try to
operationalize that for you.

20
00:00:52,160 --> 00:00:53,660
And one thing you'll
see when I talk

21
00:00:53,660 --> 00:00:55,460
is that our field,
even though we

22
00:00:55,460 --> 00:00:58,496
can be motivated by words like
vision and object recognition,

23
00:00:58,496 --> 00:00:59,870
we're going to
only make progress

24
00:00:59,870 --> 00:01:02,150
if we start to
operationally define things

25
00:01:02,150 --> 00:01:05,000
and then decide in what domain
models are going to apply.

26
00:01:05,000 --> 00:01:07,190
And I think that's an
important lesson that I hope

27
00:01:07,190 --> 00:01:09,630
will come across in my talk.

28
00:01:09,630 --> 00:01:11,940
So this is the way computer
vision operationally

29
00:01:11,940 --> 00:01:14,720
defines part of the problem of
object recognition and vision.

30
00:01:14,720 --> 00:01:16,580
It's as if you take
a scene like this

31
00:01:16,580 --> 00:01:18,524
and you want to do
things like come up

32
00:01:18,524 --> 00:01:20,690
with an answer space that
looks like this, where you

33
00:01:20,690 --> 00:01:22,460
have noun labels, say a car.

34
00:01:22,460 --> 00:01:25,340
And you have what are called
bounding boxes around the cars,

35
00:01:25,340 --> 00:01:28,040
similarly for people,
or buildings, or trees,

36
00:01:28,040 --> 00:01:32,300
or whatever nouns that
you or DARPA or whoever

37
00:01:32,300 --> 00:01:34,160
wants to actually label.

38
00:01:34,160 --> 00:01:37,727
Right, so this is just one way
of operationalizing vision.

39
00:01:37,727 --> 00:01:40,310
But I think it gets at the crux
of what we're after, which is,

40
00:01:40,310 --> 00:01:43,700
there is what's called
latent content in this image

41
00:01:43,700 --> 00:01:48,960
that all of us instantly
bring to our memories,

42
00:01:48,960 --> 00:01:51,690
that we can say, aha, that's
a car, that's a building.

43
00:01:51,690 --> 00:01:53,666
There are nouns that
pop into our heads.

44
00:01:53,666 --> 00:01:56,040
We also know other latent
information about these things,

45
00:01:56,040 --> 00:01:58,560
like the pose of this car,
the position of the car,

46
00:01:58,560 --> 00:01:59,524
the size of the car.

47
00:01:59,524 --> 00:02:01,440
The key point that I'm
going to tell you today

48
00:02:01,440 --> 00:02:03,750
about this problem is that
that information feels to us

49
00:02:03,750 --> 00:02:06,540
that it's obvious,
but it's quite

50
00:02:06,540 --> 00:02:08,250
latent in the image--
that's implicit

51
00:02:08,250 --> 00:02:09,507
in the pixel representation.

52
00:02:09,507 --> 00:02:11,340
Those of you who have
worked on this problem

53
00:02:11,340 --> 00:02:13,923
will understand this and those
of you who haven't, I hopefully

54
00:02:13,923 --> 00:02:17,240
will give you some flavor for
what that problem feels like.

55
00:02:17,240 --> 00:02:19,560
So I want to back up a bit.

56
00:02:19,560 --> 00:02:21,900
This is more from a cognitive
science perspective,

57
00:02:21,900 --> 00:02:25,290
or a human brain perspective,
to ask, why would we

58
00:02:25,290 --> 00:02:28,205
even bother worrying about this
problem of object recognition?

59
00:02:28,205 --> 00:02:30,330
And maybe this is obvious
that those of you-- and I

60
00:02:30,330 --> 00:02:31,770
don't need to say
this, but I like

61
00:02:31,770 --> 00:02:34,110
to point out that we think
of the representations

62
00:02:34,110 --> 00:02:36,225
of the tokens of what's
out there in the world

63
00:02:36,225 --> 00:02:38,100
as being the substrates
of what you might do,

64
00:02:38,100 --> 00:02:41,550
what's called higher level
cognition, things like memory,

65
00:02:41,550 --> 00:02:44,370
value judgments, decisions
and actions in the world.

66
00:02:44,370 --> 00:02:45,960
Imagine building
a robot and having

67
00:02:45,960 --> 00:02:47,876
it try to act in the
world and it doesn't even

68
00:02:47,876 --> 00:02:49,450
really know what's out there.

69
00:02:49,450 --> 00:02:52,650
So these are the sort of
substrate of these kind

70
00:02:52,650 --> 00:02:54,295
of cognitive processes.

71
00:02:54,295 --> 00:02:55,920
Again, from an
engineering perspective,

72
00:02:55,920 --> 00:02:58,776
these are processes
or behaviors.

73
00:02:58,776 --> 00:03:00,150
This is just a
short list of them

74
00:03:00,150 --> 00:03:03,030
that might depend on
your good abilities

75
00:03:03,030 --> 00:03:05,477
to recognize and discriminate
among different objects.

76
00:03:05,477 --> 00:03:07,060
I think if you look
through this list,

77
00:03:07,060 --> 00:03:09,560
you could imagine things that
would go terribly wrong if you

78
00:03:09,560 --> 00:03:11,760
didn't actually do a
good job at identifying

79
00:03:11,760 --> 00:03:13,039
what's out there in the world.

80
00:03:13,039 --> 00:03:14,580
So that's just to
think about, again,

81
00:03:14,580 --> 00:03:18,096
as an engineer building a robot.

82
00:03:18,096 --> 00:03:19,470
This is a slide
I stuck in that I

83
00:03:19,470 --> 00:03:21,745
want to connect to
this course, the idea

84
00:03:21,745 --> 00:03:24,120
that I know many of you are
from maybe these backgrounds,

85
00:03:24,120 --> 00:03:25,560
or from this background.

86
00:03:25,560 --> 00:03:27,220
And when I think
about the brain,

87
00:03:27,220 --> 00:03:28,890
I have this coin
here to say, really

88
00:03:28,890 --> 00:03:31,890
these are kind of two sides--
we're studying the same coin

89
00:03:31,890 --> 00:03:33,090
from two directions here.

90
00:03:33,090 --> 00:03:34,590
And really the
question that we have

91
00:03:34,590 --> 00:03:36,381
to all be excited about,
I hope many of you

92
00:03:36,381 --> 00:03:38,500
are excited about it is,
how does the brain work?

93
00:03:38,500 --> 00:03:39,390
And you could do
computer science

94
00:03:39,390 --> 00:03:40,980
and not care at all
about this question.

95
00:03:40,980 --> 00:03:42,780
I think it's a little
harder to do these and not

96
00:03:42,780 --> 00:03:43,821
care about this question.

97
00:03:43,821 --> 00:03:46,800
But it's possible, I guess.

98
00:03:46,800 --> 00:03:49,140
So these are all trying
to answer this question.

99
00:03:49,140 --> 00:03:51,390
And this is maybe
pretty obvious,

100
00:03:51,390 --> 00:03:53,550
but when you have
biological brains that

101
00:03:53,550 --> 00:03:55,890
are performing tasks better
than current computer

102
00:03:55,890 --> 00:03:59,070
systems, machines that
humans have built,

103
00:03:59,070 --> 00:04:01,410
then the flow tends to
want to go this way.

104
00:04:01,410 --> 00:04:04,260
You discover phenomena
or constraints over here.

105
00:04:04,260 --> 00:04:07,650
These lead to ideas that can be
built into computer code that

106
00:04:07,650 --> 00:04:09,661
can say, hey, can I build
a better machine based

107
00:04:09,661 --> 00:04:10,910
on what we discover over here?

108
00:04:10,910 --> 00:04:13,410
And many of us who came into
the field excited to do this

109
00:04:13,410 --> 00:04:15,930
and are still excited of
this kind of direction.

110
00:04:15,930 --> 00:04:17,399
But an equally
important direction

111
00:04:17,399 --> 00:04:19,260
is that when you
have systems that

112
00:04:19,260 --> 00:04:21,060
are matched with our
abilities, or that

113
00:04:21,060 --> 00:04:23,910
can compute some of the things
that we think the brain has

114
00:04:23,910 --> 00:04:26,250
to compute, then the
flow goes more this way,

115
00:04:26,250 --> 00:04:30,900
where there's many possible
ways to implement an idea

116
00:04:30,900 --> 00:04:32,610
and these become falsifiable.

117
00:04:32,610 --> 00:04:35,470
That is, that they can be
tested against experimental data

118
00:04:35,470 --> 00:04:38,520
to ask which of these many ways
of implementing a computation

119
00:04:38,520 --> 00:04:40,990
are the ones that are actually
occurring in the brain.

120
00:04:40,990 --> 00:04:42,060
And that's important
if you say you

121
00:04:42,060 --> 00:04:43,710
want to build
brain-machine interfaces,

122
00:04:43,710 --> 00:04:45,870
or fix diseases, or
do something that's

123
00:04:45,870 --> 00:04:49,174
on the level of interacting
with the brain directly.

124
00:04:49,174 --> 00:04:51,090
I hope that you guys
keep this picture in mind

125
00:04:51,090 --> 00:04:53,010
because I think it's sort
of the spirit of the course

126
00:04:53,010 --> 00:04:54,843
that both of these
directions are important.

127
00:04:54,843 --> 00:04:57,420
And it's not as if we work on
this for 20 years and then work

128
00:04:57,420 --> 00:04:58,200
on this for 20 years.

129
00:04:58,200 --> 00:05:00,075
It's really the flow
across them that I think

130
00:05:00,075 --> 00:05:01,650
is the most exciting to us.

131
00:05:01,650 --> 00:05:03,870
So just to connect to that,
a little bit of history

132
00:05:03,870 --> 00:05:07,475
of where was the field on this
problem of visual recognition.

133
00:05:07,475 --> 00:05:09,100
I don't know if many
of you heard this,

134
00:05:09,100 --> 00:05:10,720
but here you are
at summer school,

135
00:05:10,720 --> 00:05:12,840
so there was a Summer
Vision Project--

136
00:05:12,840 --> 00:05:14,100
it was called, at MIT.

137
00:05:14,100 --> 00:05:16,230
I used to think this
story was apocryphal.

138
00:05:16,230 --> 00:05:19,782
In 1966, there was a
project that the final goal

139
00:05:19,782 --> 00:05:21,990
was object identification,
which we'll actually name,

140
00:05:21,990 --> 00:05:24,360
Objects by Matching the
Vocabulary of Known Objects.

141
00:05:24,360 --> 00:05:26,912
So this was essentially
a summer project to say,

142
00:05:26,912 --> 00:05:29,370
we're going to get a couple
undergraduate students together

143
00:05:29,370 --> 00:05:33,240
and we're going to build a
recognition system in 1966.

144
00:05:33,240 --> 00:05:35,250
And this was the
excitement of AI,

145
00:05:35,250 --> 00:05:36,870
we can build anything
that we want.

146
00:05:36,870 --> 00:05:38,774
And of course, those
of you who know this,

147
00:05:38,774 --> 00:05:40,440
this problem turned
out to be much, much

148
00:05:40,440 --> 00:05:42,270
harder than anticipated.

149
00:05:42,270 --> 00:05:45,420
So sometimes problems that
seem easy for us are actually

150
00:05:45,420 --> 00:05:48,230
quite difficult. If
any of you wants this,

151
00:05:48,230 --> 00:05:50,370
I would be happy to share
this document with you.

152
00:05:50,370 --> 00:05:52,920
It's interesting, the space
of objects that they describe

153
00:05:52,920 --> 00:05:55,452
things like recognizing--

154
00:05:55,452 --> 00:05:57,660
of course, I would say like
coffee cups on your desk.

155
00:05:57,660 --> 00:06:00,360
But they also say packs of
cigarettes on your desk.

156
00:06:00,360 --> 00:06:02,962
So this sort of dates
the time of this here.

157
00:06:02,962 --> 00:06:04,920
So it's a little bit like
Mad Men or something.

158
00:06:04,920 --> 00:06:06,420
So now, here we are today.

159
00:06:06,420 --> 00:06:08,920
And I guess I just can't help
but sort of get excited about,

160
00:06:08,920 --> 00:06:12,120
here's this really cool
machine that's just amazing

161
00:06:12,120 --> 00:06:13,500
that does these computations.

162
00:06:13,500 --> 00:06:16,230
The things got-- I can't tell
you all this because of the 100

163
00:06:16,230 --> 00:06:18,810
billion computing elements,
solves problems not solveable

164
00:06:18,810 --> 00:06:20,130
by any previous machine.

165
00:06:20,130 --> 00:06:21,990
And the thing, it looks
crazy, but it only

166
00:06:21,990 --> 00:06:24,087
requires 20 watts of power.

167
00:06:24,087 --> 00:06:25,670
Those of you who
have seen this slide,

168
00:06:25,670 --> 00:06:27,240
I'm not talking
about this thing.

169
00:06:27,240 --> 00:06:29,820
I'm talking about that
thing right there.

170
00:06:29,820 --> 00:06:33,160
So this is a scale
of what we're after.

171
00:06:33,160 --> 00:06:36,000
And we often talk about power,
but this is something engineers

172
00:06:36,000 --> 00:06:38,050
are especially interested in
as they build these systems, is

173
00:06:38,050 --> 00:06:40,674
how does our brains solve these
problems at such a low wattage,

174
00:06:40,674 --> 00:06:42,540
so to speak.

175
00:06:42,540 --> 00:06:44,765
This is, again, the spirit
of many of the things

176
00:06:44,765 --> 00:06:47,140
that I hope that you guys are
excited about in the future

177
00:06:47,140 --> 00:06:47,987
of this field.

178
00:06:47,987 --> 00:06:49,570
Here's another slide
that I pulled out

179
00:06:49,570 --> 00:06:51,910
that I often like to show
is that, from an engineer's

180
00:06:51,910 --> 00:06:53,470
point of view, we
often try to say,

181
00:06:53,470 --> 00:06:56,050
well, we want to build
machines that are as

182
00:06:56,050 --> 00:06:57,670
good or better than our brain.

183
00:06:57,670 --> 00:07:00,010
So machines today, you
guys know this, beat us

184
00:07:00,010 --> 00:07:01,930
at many things,
straight calculation,

185
00:07:01,930 --> 00:07:03,910
they beat us at chess.

186
00:07:03,910 --> 00:07:06,880
When I was a grad student,
they recently won at Jeopardy.

187
00:07:06,880 --> 00:07:08,982
In memory, they've
always beaten us.

188
00:07:08,982 --> 00:07:10,690
Machines are way better
at memory than us

189
00:07:10,690 --> 00:07:12,955
in the simple form of memory.

190
00:07:12,955 --> 00:07:15,100
Seeing, in pattern
matching, go to the grocery

191
00:07:15,100 --> 00:07:16,910
store, hey, what's
that bar code done?

192
00:07:16,910 --> 00:07:18,910
I don't know what that
was, but it just scans in

193
00:07:18,910 --> 00:07:20,800
and somehow it does
pattern matching, right?

194
00:07:20,800 --> 00:07:22,570
So there's forms of
vision that machines

195
00:07:22,570 --> 00:07:23,579
are way better than us.

196
00:07:23,579 --> 00:07:25,870
But some forms of vision that
are more complicated that

197
00:07:25,870 --> 00:07:28,630
require generalization,
like object recognition,

198
00:07:28,630 --> 00:07:30,250
or more broadly,
scene understanding,

199
00:07:30,250 --> 00:07:32,060
we like to think that we
are still the winners at.

200
00:07:32,060 --> 00:07:34,360
And even things that we take
for granted, like walking,

201
00:07:34,360 --> 00:07:35,880
this is quite a
challenging problem.

202
00:07:35,880 --> 00:07:40,180
So engineers really want
to move this over here.

203
00:07:40,180 --> 00:07:42,610
So our goal is to discover
how the brain solves

204
00:07:42,610 --> 00:07:43,460
object recognition.

205
00:07:43,460 --> 00:07:45,250
And the reason I put this
up is, from an engineering

206
00:07:45,250 --> 00:07:47,410
point of view, that just doesn't
mean write a bunch of papers

207
00:07:47,410 --> 00:07:48,880
in a textbook that says,
this part of the brain

208
00:07:48,880 --> 00:07:51,670
does it, but actually help to
implement a system where this

209
00:07:51,670 --> 00:07:54,640
is, at least, matched with
us and I assume someday,

210
00:07:54,640 --> 00:07:56,140
will be better than us.

211
00:07:56,140 --> 00:07:58,900
And this is also
a gateway problem.

212
00:07:58,900 --> 00:08:00,780
That is, even if it's
just this domain,

213
00:08:00,780 --> 00:08:02,470
we think that the
systems we're studying

214
00:08:02,470 --> 00:08:06,370
might generalize to other,
for instance, sensory domains.

215
00:08:06,370 --> 00:08:07,870
Gabriel told me you
were going to do

216
00:08:07,870 --> 00:08:12,722
an auditory, visual comparison
session later in the week.

217
00:08:12,722 --> 00:08:14,180
That's an engineer's
point of view,

218
00:08:14,180 --> 00:08:16,270
how do I just build
better systems?

219
00:08:16,270 --> 00:08:18,687
Let's step back and talk from
a scientist's point of view.

220
00:08:18,687 --> 00:08:20,853
So this is really now to
introduce the talk that I'm

221
00:08:20,853 --> 00:08:21,910
going to give you today.

222
00:08:21,910 --> 00:08:23,812
So when you're a
scientist, what's our job?

223
00:08:23,812 --> 00:08:25,020
We say we want to understand.

224
00:08:25,020 --> 00:08:26,270
We all write that, understand.

225
00:08:26,270 --> 00:08:27,470
What does that mean?

226
00:08:27,470 --> 00:08:29,560
Well, what it really
means if you boil it down,

227
00:08:29,560 --> 00:08:31,434
and I would love to
discuss this if you like,

228
00:08:31,434 --> 00:08:33,850
is that you have some
measurements in some domain.

229
00:08:33,850 --> 00:08:36,130
So you can think of this
as a state space here.

230
00:08:36,130 --> 00:08:39,549
This is like the position
of the planets today.

231
00:08:39,549 --> 00:08:43,450
And this is like the position
of the planets tomorrow.

232
00:08:43,450 --> 00:08:47,120
Or you could say, this is the
DNA sequence inside a cell.

233
00:08:47,120 --> 00:08:49,250
And this is some protein
that's going to get made.

234
00:08:49,250 --> 00:08:51,820
So you're searching for
mappings that are predictive

235
00:08:51,820 --> 00:08:53,310
from one domain to another.

236
00:08:53,310 --> 00:08:54,976
And we can give lots
of examples of what

237
00:08:54,976 --> 00:08:57,590
we call successful
science, where that's true.

238
00:08:57,590 --> 00:08:59,952
This is the core of science
is to predict, given

239
00:08:59,952 --> 00:09:01,660
some measurements or
observations, what's

240
00:09:01,660 --> 00:09:04,150
going to happen either in
the future or some other set

241
00:09:04,150 --> 00:09:05,110
of measurements.

242
00:09:05,110 --> 00:09:08,140
So predictive power is
the core of all science

243
00:09:08,140 --> 00:09:09,500
and the core of understanding.

244
00:09:09,500 --> 00:09:11,530
And I think it would be fun
if you want to debate that,

245
00:09:11,530 --> 00:09:12,400
that you think
there's another way.

246
00:09:12,400 --> 00:09:15,280
But this is what I come to in
thinking about this problem.

247
00:09:15,280 --> 00:09:17,230
And the reason I'm
bringing this up

248
00:09:17,230 --> 00:09:19,540
is because the accuracy
of this predictive mapping

249
00:09:19,540 --> 00:09:21,900
is a measure of the strength
of any scientific field.

250
00:09:21,900 --> 00:09:24,380
And some fields are
further along than others.

251
00:09:24,380 --> 00:09:27,700
And I would say ours is
still not very far along.

252
00:09:27,700 --> 00:09:31,150
Our job is to bring it
from a nonpredictive state

253
00:09:31,150 --> 00:09:32,810
to a very predictive state.

254
00:09:32,810 --> 00:09:35,560
And so that means building
models that can be falsified

255
00:09:35,560 --> 00:09:37,039
and that can predict things.

256
00:09:37,039 --> 00:09:38,580
And you'll hear that
through my talk.

257
00:09:38,580 --> 00:09:40,204
As Gabriel mentioned,
what we try to do

258
00:09:40,204 --> 00:09:42,730
is build models that can predict
either behavior or neural

259
00:09:42,730 --> 00:09:43,580
activity.

260
00:09:43,580 --> 00:09:46,370
And that's what we think is
what progress looks like.

261
00:09:46,370 --> 00:09:48,430
So now let's translate
this to the problem I gave

262
00:09:48,430 --> 00:09:50,920
you, which is the problem
of vision or more generally

263
00:09:50,920 --> 00:09:52,244
object recognition.

264
00:09:52,244 --> 00:09:54,160
You could imagine, there's
a domain of images.

265
00:09:54,160 --> 00:09:56,076
So just to slow down
here, just so everybody's

266
00:09:56,076 --> 00:09:58,360
on the same page,
each dot here might be

267
00:09:58,360 --> 00:10:00,150
all the pixels in this image.

268
00:10:00,150 --> 00:10:02,510
In this dot, all the
pixels in this image.

269
00:10:02,510 --> 00:10:04,390
So there's a set of
possible pixel-images

270
00:10:04,390 --> 00:10:05,530
that you could see.

271
00:10:05,530 --> 00:10:08,060
And we imagine that they
give rise to, in the brain,

272
00:10:08,060 --> 00:10:09,150
some state space.

273
00:10:09,150 --> 00:10:13,057
Think of this as the whole brain
for now, to just fix ideas,

274
00:10:13,057 --> 00:10:15,640
that you could imagine that this
image, one you're looking at,

275
00:10:15,640 --> 00:10:17,530
it gives rise to some
pattern of activity

276
00:10:17,530 --> 00:10:18,781
across your whole brain.

277
00:10:18,781 --> 00:10:21,280
And this image gives rise to a
different pattern of activity

278
00:10:21,280 --> 00:10:22,280
across your whole brain.

279
00:10:22,280 --> 00:10:24,490
And loosely, we call this
the neural representation

280
00:10:24,490 --> 00:10:26,550
of this thing.

281
00:10:26,550 --> 00:10:29,050
But then what we do is
somehow when we ask you

282
00:10:29,050 --> 00:10:30,820
for behavior reports,
there's a mapping

283
00:10:30,820 --> 00:10:33,700
between that neural
state space and what

284
00:10:33,700 --> 00:10:34,814
we measure as the output.

285
00:10:34,814 --> 00:10:36,730
Whether you say it or
write it, you might say,

286
00:10:36,730 --> 00:10:38,230
that's a face, these
are both faces,

287
00:10:38,230 --> 00:10:40,470
if I asked you for
nouns among them.

288
00:10:40,470 --> 00:10:43,270
OK, so this is another
domain of measurement.

289
00:10:43,270 --> 00:10:45,940
So now you can see I'm setting
up the notion of predictivity.

290
00:10:45,940 --> 00:10:47,356
And what we want
to do is, we have

291
00:10:47,356 --> 00:10:49,630
this complex thing over
here of images that somehow

292
00:10:49,630 --> 00:10:51,130
map internally into
neural activity

293
00:10:51,130 --> 00:10:52,504
and then somehow
map to the thing

294
00:10:52,504 --> 00:10:53,734
we call perceptual reports.

295
00:10:53,734 --> 00:10:55,150
And notice I've
already put things

296
00:10:55,150 --> 00:10:56,950
that we call nouns
that we usually

297
00:10:56,950 --> 00:10:59,980
associate with objects, cars,
face, dogs, cats, clocks,

298
00:10:59,980 --> 00:11:01,000
and so forth.

299
00:11:01,000 --> 00:11:03,860
OK, so understanding this
mapping in a predictive sense

300
00:11:03,860 --> 00:11:07,280
is really a summary of what
our part of the field is about.

301
00:11:07,280 --> 00:11:10,420
And again, accurate
predictivity is the core product

302
00:11:10,420 --> 00:11:12,700
of the science that
underlies our ability

303
00:11:12,700 --> 00:11:14,530
to build a system like
this-- many of you

304
00:11:14,530 --> 00:11:16,480
are interested, to fix
a system like this,

305
00:11:16,480 --> 00:11:18,576
or to perhaps even
augment our own systems.

306
00:11:18,576 --> 00:11:20,950
If we want to inject signals
here and have them give rise

307
00:11:20,950 --> 00:11:23,146
to percepts, we have
to know how this works.

308
00:11:23,146 --> 00:11:24,520
A big part of the
field of vision

309
00:11:24,520 --> 00:11:26,890
is spent-- a lot of
the last three decades,

310
00:11:26,890 --> 00:11:30,010
working on the mapping between
images and neural activity.

311
00:11:30,010 --> 00:11:33,520
That's usually called encoding,
predictive encoding mechanisms.

312
00:11:33,520 --> 00:11:35,569
And it's driven by
Hubel and Wiesel's work.

313
00:11:35,569 --> 00:11:37,360
The people saw this as
a great way forward.

314
00:11:37,360 --> 00:11:39,370
It's like, let's go
study the neurons

315
00:11:39,370 --> 00:11:42,910
and try to understand what
in the image is driving them.

316
00:11:42,910 --> 00:11:45,580
That is, what's an
image computable

317
00:11:45,580 --> 00:11:47,720
model in the world that
would go from images

318
00:11:47,720 --> 00:11:49,340
to neural responses?

319
00:11:49,340 --> 00:11:51,710
The other part is that there's
some linkage, we think,

320
00:11:51,710 --> 00:11:54,269
between the neural
activity and these reports.

321
00:11:54,269 --> 00:11:56,060
And notice, this is
actually why most of us

322
00:11:56,060 --> 00:11:57,355
get into neuroscience
because you

323
00:11:57,355 --> 00:11:58,580
notice this arrow is two-way.

324
00:11:58,580 --> 00:12:00,086
This is actually
quite deep here.

325
00:12:00,086 --> 00:12:01,460
From an engineer's
point of view,

326
00:12:01,460 --> 00:12:02,810
you go, well, there's
got to be some mapping

327
00:12:02,810 --> 00:12:04,640
between the neural
activity and the button

328
00:12:04,640 --> 00:12:07,220
presses on my fingers or
my saying the word noun.

329
00:12:07,220 --> 00:12:10,310
There's some causal linkage
between this and the things

330
00:12:10,310 --> 00:12:13,040
that we observe
objectively in a subject.

331
00:12:13,040 --> 00:12:15,746
But this is where philosophers
debate about like, well, you

332
00:12:15,746 --> 00:12:17,120
know in some sense
these are sort

333
00:12:17,120 --> 00:12:18,380
of two sides of the same coin.

334
00:12:18,380 --> 00:12:19,796
We say our own
perception, there's

335
00:12:19,796 --> 00:12:21,350
some aspects of the
internal activity

336
00:12:21,350 --> 00:12:24,406
that are the thing that we
call awareness or perception.

337
00:12:24,406 --> 00:12:26,030
Now I'm not going to
get into all that,

338
00:12:26,030 --> 00:12:28,252
but I just want to point
out that if you're just

339
00:12:28,252 --> 00:12:29,960
building models, you
can't approach that.

340
00:12:29,960 --> 00:12:32,580
It's this sort of strange
thing between neurons

341
00:12:32,580 --> 00:12:35,510
and these reported states that
many of us are fascinated by.

342
00:12:35,510 --> 00:12:37,667
So this is called predictive
decoding mechanisms.

343
00:12:37,667 --> 00:12:39,500
For me, it's all going
to be operationalized

344
00:12:39,500 --> 00:12:41,794
in terms of reports
from humans or animals.

345
00:12:41,794 --> 00:12:43,460
And I'll not do that
philosophical part,

346
00:12:43,460 --> 00:12:45,501
but I thought I'd mention
that for those you like

347
00:12:45,501 --> 00:12:46,680
to think about those things.

348
00:12:46,680 --> 00:12:48,350
So for visual
object perception, I

349
00:12:48,350 --> 00:12:50,641
want to point out that, again,
the history of the field

350
00:12:50,641 --> 00:12:51,530
has been mostly here.

351
00:12:51,530 --> 00:12:53,900
This link has been
neglected or dominated

352
00:12:53,900 --> 00:12:55,652
by weakly predictive
word models.

353
00:12:55,652 --> 00:12:57,860
That doesn't mean they're
not useful starting points,

354
00:12:57,860 --> 00:12:59,261
but they're weakly predictive.

355
00:12:59,261 --> 00:13:01,260
And so a weakly predictive
word model would be--

356
00:13:01,260 --> 00:13:02,930
and for temporal cortex,
a part of the brain

357
00:13:02,930 --> 00:13:05,388
I'm going to tell you about
today, does object recognition.

358
00:13:05,388 --> 00:13:08,280
That model has been
around for a long time.

359
00:13:08,280 --> 00:13:10,100
It is somewhat predictive
because it says,

360
00:13:10,100 --> 00:13:12,100
you take that out and
all object recognition

361
00:13:12,100 --> 00:13:14,030
will get destroyed,
would be a prediction.

362
00:13:14,030 --> 00:13:16,220
Turns out that doesn't
actually happen.

363
00:13:16,220 --> 00:13:17,180
We can discuss that.

364
00:13:17,180 --> 00:13:21,230
But it doesn't tell you how it
does it, how to inject signals,

365
00:13:21,230 --> 00:13:23,330
which tasks are more
or less affected,

366
00:13:23,330 --> 00:13:25,220
so that's what I mean
by weakly predictive.

367
00:13:25,220 --> 00:13:26,570
It's a word model.

368
00:13:26,570 --> 00:13:28,340
Face neurons do
face task, that's

369
00:13:28,340 --> 00:13:29,780
probably true to some extent.

370
00:13:29,780 --> 00:13:32,030
But again, it doesn't
tell us-- it's more tight.

371
00:13:32,030 --> 00:13:33,650
It sort of says, oh, I'll
take out these smaller regions

372
00:13:33,650 --> 00:13:36,090
and there'll be some set of
tasks that involve faces.

373
00:13:36,090 --> 00:13:38,400
I don't know, I won't say
anything about other tasks.

374
00:13:38,400 --> 00:13:40,770
So that's a somewhat more
strongly predictive model,

375
00:13:40,770 --> 00:13:42,740
but still pretty
weakly predictive.

376
00:13:42,740 --> 00:13:45,500
And my personal favorite that
comes in from reviewers a lot

377
00:13:45,500 --> 00:13:47,960
is, attention solves that.

378
00:13:47,960 --> 00:13:49,550
So this is just a
statement that--

379
00:13:49,550 --> 00:13:51,920
just to be on the
lookout for word models

380
00:13:51,920 --> 00:13:54,492
that don't actually have
content in terms of prediction.

381
00:13:54,492 --> 00:13:55,700
I don't know what that means.

382
00:13:55,700 --> 00:13:57,360
I read this as, hand
of God reaches in

383
00:13:57,360 --> 00:13:58,610
and solves the problem.

384
00:13:58,610 --> 00:14:01,220
So there's got to be an
actual predictive model that

385
00:14:01,220 --> 00:14:02,650
can be falsified.

386
00:14:02,650 --> 00:14:05,620
OK, so I don't mean to doubt
the importance of these.

387
00:14:05,620 --> 00:14:07,370
Before people start
giving me a hard time,

388
00:14:07,370 --> 00:14:09,740
there are attentional phenomena,
there are face neurons,

389
00:14:09,740 --> 00:14:11,780
there is an IT,
that's what we study.

390
00:14:11,780 --> 00:14:13,640
I'm just trying to
emphasize for you that we

391
00:14:13,640 --> 00:14:16,705
need to go beyond word
models into actual testable

392
00:14:16,705 --> 00:14:18,830
models that make predictions,
that would stand even

393
00:14:18,830 --> 00:14:22,290
if the person claiming those
models is no longer around,

394
00:14:22,290 --> 00:14:23,750
it would make a prediction.

395
00:14:23,750 --> 00:14:25,040
Let me try to define a domain.

396
00:14:25,040 --> 00:14:26,790
I said we're going to
try to define stuff.

397
00:14:26,790 --> 00:14:27,920
It's hard to define stuff.

398
00:14:27,920 --> 00:14:29,814
It's big, vision,
it's a big area.

399
00:14:29,814 --> 00:14:31,730
Object recognition, I
sort of said it vaguely.

400
00:14:31,730 --> 00:14:34,010
And when I say this, I
include faces as an object,

401
00:14:34,010 --> 00:14:35,093
a socially important when.

402
00:14:35,093 --> 00:14:37,340
You'll hear this
from Winrich I think.

403
00:14:37,340 --> 00:14:40,040
But I want to say, to try
to limit it even further,

404
00:14:40,040 --> 00:14:41,640
that's still a big domain.

405
00:14:41,640 --> 00:14:45,430
And so we tried early on to
reduce the problem even further

406
00:14:45,430 --> 00:14:48,350
to something that is
more, again, naturalistic,

407
00:14:48,350 --> 00:14:50,114
that we think can
give us more traction,

408
00:14:50,114 --> 00:14:51,030
this predictive sense.

409
00:14:51,030 --> 00:14:53,720
So we started by saying, when
you take a scene like this

410
00:14:53,720 --> 00:14:55,550
and you analyze it,
you may not notice it

411
00:14:55,550 --> 00:15:01,210
but your ventral stream, really
your retina has high acuity

412
00:15:01,210 --> 00:15:02,764
in say the central 10 degrees.

413
00:15:02,764 --> 00:15:04,430
There's anatomy that
I'll show you later

414
00:15:04,430 --> 00:15:05,960
that the ventral
stream is especially

415
00:15:05,960 --> 00:15:07,520
interested in processing
the central 10

416
00:15:07,520 --> 00:15:08,478
degrees of information.

417
00:15:08,478 --> 00:15:10,770
So that's about two
hands at arm's length,

418
00:15:10,770 --> 00:15:12,020
for those you see in the room.

419
00:15:12,020 --> 00:15:14,210
So you may have the sense that
you know what's out there,

420
00:15:14,210 --> 00:15:15,085
but you don't really.

421
00:15:15,085 --> 00:15:16,910
You kind of stitch
that together.

422
00:15:16,910 --> 00:15:18,764
And lots of people
have shown this,

423
00:15:18,764 --> 00:15:20,930
the way you stitch this
together is making rapid eye

424
00:15:20,930 --> 00:15:22,570
movements around,
called saccades,

425
00:15:22,570 --> 00:15:25,220
followed by fixations, which
are 200 to 500 milliseconds

426
00:15:25,220 --> 00:15:26,000
in duration.

427
00:15:26,000 --> 00:15:28,209
You don't really see
during this time here.

428
00:15:28,209 --> 00:15:29,750
It's not as if your
brain shuts down,

429
00:15:29,750 --> 00:15:32,354
it's just that the movement
is too fast for your retina

430
00:15:32,354 --> 00:15:33,520
to really keep up with this.

431
00:15:33,520 --> 00:15:35,480
So you make these
rapid eye movements,

432
00:15:35,480 --> 00:15:37,162
you fixate, fixate, fixate.

433
00:15:37,162 --> 00:15:38,870
And what you do is,
that brings this sort

434
00:15:38,870 --> 00:15:41,780
of sampled scene to
the central 10 degrees

435
00:15:41,780 --> 00:15:44,190
that might look
something like this.

436
00:15:44,190 --> 00:15:46,070
So those are 200
millisecond snapshots

437
00:15:46,070 --> 00:15:47,260
across that scan path.

438
00:15:47,260 --> 00:15:49,040
And I'll play it for
you one more time.

439
00:15:49,040 --> 00:15:50,749
Now, you should
notice that there's

440
00:15:50,749 --> 00:15:52,540
one or more objects in
each and every image

441
00:15:52,540 --> 00:15:54,230
that you probably said,
oh, there's a sign.

442
00:15:54,230 --> 00:15:54,810
There's a person.

443
00:15:54,810 --> 00:15:55,130
There's a car.

444
00:15:55,130 --> 00:15:56,880
You might have gotten
two out of each one.

445
00:15:56,880 --> 00:15:58,970
But you were sort of
extracting, at least

446
00:15:58,970 --> 00:16:02,390
intuitively to me, at least
one or more foreground

447
00:16:02,390 --> 00:16:05,300
or central objects when
I show you those images.

448
00:16:05,300 --> 00:16:08,660
And that ability to do what I
just showed you there, we think

449
00:16:08,660 --> 00:16:10,840
is the core of how you
analyze or build up

450
00:16:10,840 --> 00:16:13,620
a scene like this, at least how
the ventral stream contributes.

451
00:16:13,620 --> 00:16:16,500
And therefore, we call
that core recognition,

452
00:16:16,500 --> 00:16:18,990
which I defined as a central
10 degrees of visual field,

453
00:16:18,990 --> 00:16:21,504
100 to 200 millisecond
viewing duration.

454
00:16:21,504 --> 00:16:23,420
And again, it's not all
of object recognition,

455
00:16:23,420 --> 00:16:26,030
but we think it's a
good starting point.

456
00:16:26,030 --> 00:16:28,970
And a way that we probably got
into this is because of a rapid

457
00:16:28,970 --> 00:16:31,320
serial visual presentation
movies from the 70's.

458
00:16:31,320 --> 00:16:33,020
Molly Potter showed
this really nicely.

459
00:16:33,020 --> 00:16:36,290
This is a movie that I've
been showing for 15 years now.

460
00:16:36,290 --> 00:16:38,187
Notice that this is just
a sequence of images

461
00:16:38,187 --> 00:16:40,520
where there is typically one
or more foreground objects.

462
00:16:40,520 --> 00:16:42,962
And you should be quickly
mapping those to memory,

463
00:16:42,962 --> 00:16:44,920
even though I'm not
telling you what to expect.

464
00:16:44,920 --> 00:16:46,670
Like Leaning Tower
of Pisa, right, I'm

465
00:16:46,670 --> 00:16:48,230
not going to tell you that
you're going to see Star Wars

466
00:16:48,230 --> 00:16:49,480
characters-- well, I just did.

467
00:16:49,480 --> 00:16:52,580
But you quickly are
able to map those things

468
00:16:52,580 --> 00:16:56,090
to some noun or even a more
precise subordinate noun.

469
00:16:56,090 --> 00:16:57,750
I know this is Yoda.

470
00:16:57,750 --> 00:17:00,770
So our ability to do that,
we're very, very good at that.

471
00:17:00,770 --> 00:17:02,660
Notice you didn't need
a lot of pre-cueing,

472
00:17:02,660 --> 00:17:04,035
yet you're still
able to do that.

473
00:17:04,035 --> 00:17:07,159
And that is really what
fascinates us about vision

474
00:17:07,159 --> 00:17:08,700
and object recognition
in particular.

475
00:17:08,700 --> 00:17:11,210
Even without featural
attention or pre-cueing,

476
00:17:11,210 --> 00:17:13,660
you're able to do a remarkable
amount of processing.

477
00:17:13,660 --> 00:17:16,020
And I think that's a great
demonstration of that.

478
00:17:16,020 --> 00:17:17,510
And just to quantify
this for you,

479
00:17:17,510 --> 00:17:19,218
because sometimes
people say, well you're

480
00:17:19,218 --> 00:17:20,599
showing it too short.

481
00:17:20,599 --> 00:17:22,220
Your vision system
doesn't do much.

482
00:17:22,220 --> 00:17:24,079
Here's an eight-way
categorization task

483
00:17:24,079 --> 00:17:26,876
I'll show you later under
range of transformation.

484
00:17:26,876 --> 00:17:28,250
These are just
the example images

485
00:17:28,250 --> 00:17:30,366
of eight different
categories of objects.

486
00:17:30,366 --> 00:17:32,240
It doesn't really matter
what I much do here,

487
00:17:32,240 --> 00:17:33,500
you get a very similar curve.

488
00:17:33,500 --> 00:17:36,230
And that is, you get most
of the performance gain

489
00:17:36,230 --> 00:17:37,980
in about the first
100 milliseconds.

490
00:17:37,980 --> 00:17:39,840
This is accuracy, you're
about 85% correct.

491
00:17:39,840 --> 00:17:41,400
This is a challenging task,
as I'll show you earlier.

492
00:17:41,400 --> 00:17:43,550
It looks easy here, but
it's quite challenging.

493
00:17:43,550 --> 00:17:46,070
85% correct, if I let you
look at the image longer,

494
00:17:46,070 --> 00:17:48,770
up to two seconds, you can
bump up to around 90's.

495
00:17:48,770 --> 00:17:51,410
So there is some gain with
longer viewing duration,

496
00:17:51,410 --> 00:17:52,190
but you get--

497
00:17:52,190 --> 00:17:55,269
chance is 50, so you
get this huge ability.

498
00:17:55,269 --> 00:17:56,810
And we're not the
first to show this.

499
00:17:56,810 --> 00:17:58,991
This is just to show you
in our own kind of task

500
00:17:58,991 --> 00:18:00,740
that the data I'm going
to tell you about,

501
00:18:00,740 --> 00:18:03,470
where we show the image for
100 or 200 milliseconds,

502
00:18:03,470 --> 00:18:05,480
this is the typical
primate viewing duration

503
00:18:05,480 --> 00:18:07,010
that I pin this on.

504
00:18:07,010 --> 00:18:09,290
We use this for
reasons of efficiency.

505
00:18:09,290 --> 00:18:11,780
But you see, the performance
is similar across that time.

506
00:18:11,780 --> 00:18:13,010
You get a lot done.

507
00:18:13,010 --> 00:18:15,920
Your visual system does a lot
of work in that first glimpse.

508
00:18:15,920 --> 00:18:18,845
And that's core recognition that
we are trying to study here.

509
00:18:18,845 --> 00:18:20,720
And I know it's not all
of object recognition

510
00:18:20,720 --> 00:18:22,670
or all of vision,
but it's now, we

511
00:18:22,670 --> 00:18:24,050
think, a much more
defined domain

512
00:18:24,050 --> 00:18:25,070
that we can make progress on.

513
00:18:25,070 --> 00:18:26,060
And that's what we've
been working on.

514
00:18:26,060 --> 00:18:28,620
And that's essentially what
I'm going to talk about today.

515
00:18:28,620 --> 00:18:30,245
So think of vision,
object recognition,

516
00:18:30,245 --> 00:18:32,400
within that core recognition.

517
00:18:32,400 --> 00:18:33,950
This is David Marr.

518
00:18:33,950 --> 00:18:36,320
David and Tommy Poggio, I
studied with a long time.

519
00:18:36,320 --> 00:18:38,695
And Tommy wrote the introduction
to David's-- if you guys

520
00:18:38,695 --> 00:18:40,200
haven't read this book, Vision--

521
00:18:40,200 --> 00:18:42,064
has anybody, guys
know this book?

522
00:18:42,064 --> 00:18:43,730
It's really a classic
book in our field.

523
00:18:43,730 --> 00:18:45,200
It's the first
couple chapters that

524
00:18:45,200 --> 00:18:46,950
are the part you
should really read.

525
00:18:46,950 --> 00:18:48,030
That's the best
part of the book.

526
00:18:48,030 --> 00:18:49,940
And one of the things that
you take from this book,

527
00:18:49,940 --> 00:18:51,398
that I think David
and Tommy helped

528
00:18:51,398 --> 00:18:53,490
to lay out a long time
ago, is that there

529
00:18:53,490 --> 00:18:54,770
is this challenge of level.

530
00:18:57,310 --> 00:18:59,060
I think one of the
things I take from this

531
00:18:59,060 --> 00:19:02,214
is, they would try to
define three clean levels.

532
00:19:02,214 --> 00:19:04,130
It turns out not to be
this clean in practice.

533
00:19:04,130 --> 00:19:06,505
But there's one level called
computational theory, what's

534
00:19:06,505 --> 00:19:08,950
the goal, what's appropriate,
what's the logic,

535
00:19:08,950 --> 00:19:10,992
and by what strategy
can it be carried out.

536
00:19:10,992 --> 00:19:12,450
There's another
level which is, OK,

537
00:19:12,450 --> 00:19:14,950
now once you decide that, how
should you represent the data?

538
00:19:14,950 --> 00:19:16,850
How can you implement
an algorithm to do it?

539
00:19:16,850 --> 00:19:18,360
And then there's this
actually, how do you run it,

540
00:19:18,360 --> 00:19:20,090
how do you build it in hardware?

541
00:19:20,090 --> 00:19:22,130
And neuroscientists often
come in, they're like,

542
00:19:22,130 --> 00:19:23,300
I'm going to study
neurons and it's sort of

543
00:19:23,300 --> 00:19:24,800
like jumping into your
iPhone and saying,

544
00:19:24,800 --> 00:19:26,091
I'm going to study transistors.

545
00:19:26,091 --> 00:19:28,370
They often tend to start
at the hardware level.

546
00:19:28,370 --> 00:19:30,450
And I think that's the biggest
lesson you take from this like,

547
00:19:30,450 --> 00:19:31,370
oh wait, there's
something going on here,

548
00:19:31,370 --> 00:19:32,390
these transistors are flying.

549
00:19:32,390 --> 00:19:34,090
And you make some
story about it if you

550
00:19:34,090 --> 00:19:35,840
were recording from
the brain or measuring

551
00:19:35,840 --> 00:19:37,210
transistors in my iPhone.

552
00:19:37,210 --> 00:19:39,590
But I think the important
point to take from this

553
00:19:39,590 --> 00:19:42,200
is it helps to start
thinking about what's

554
00:19:42,200 --> 00:19:43,200
the point of the system.

555
00:19:43,200 --> 00:19:44,158
What might it be doing?

556
00:19:44,158 --> 00:19:45,550
How might you
solve that problem?

557
00:19:45,550 --> 00:19:46,820
And that leads you
then to algorithm.

558
00:19:46,820 --> 00:19:48,150
And then you think
about representations.

559
00:19:48,150 --> 00:19:49,756
So it's sort of a
top down approach,

560
00:19:49,756 --> 00:19:51,380
rather than just
digging into the brain

561
00:19:51,380 --> 00:19:53,990
and hoping that the
answers will emerge.

562
00:19:53,990 --> 00:19:56,746
So I'm going to try to give
you that top down approach

563
00:19:56,746 --> 00:19:58,370
in this problem that
I'm talking about.

564
00:19:58,370 --> 00:20:00,140
I've already given you a
bit of it by introducing you

565
00:20:00,140 --> 00:20:00,710
to the problem.

566
00:20:00,710 --> 00:20:03,400
I'll say a little bit more about
that and step down a little bit

567
00:20:03,400 --> 00:20:03,900
this way.

568
00:20:03,900 --> 00:20:07,100
And so this kind of
thinking, I think,

569
00:20:07,100 --> 00:20:08,960
is important to
making progress in how

570
00:20:08,960 --> 00:20:10,652
the brain computes things.

571
00:20:10,652 --> 00:20:12,860
So here's a related slide
that I made a long time ago

572
00:20:12,860 --> 00:20:14,235
that, again, I
pulled out for you

573
00:20:14,235 --> 00:20:16,580
guys, that I think helps
bridge between what I just

574
00:20:16,580 --> 00:20:18,241
said about the Marr
levels of analysis

575
00:20:18,241 --> 00:20:20,240
and whether you're a
neuroscientist or cognitive

576
00:20:20,240 --> 00:20:22,340
scientist, and are a computer
vision or machine learning

577
00:20:22,340 --> 00:20:22,770
person.

578
00:20:22,770 --> 00:20:25,228
So the first is, what is the
problem we're trying to solve?

579
00:20:25,228 --> 00:20:27,320
So that's Marr
computational level one.

580
00:20:27,320 --> 00:20:29,907
So computational vision--
now operationally,

581
00:20:29,907 --> 00:20:31,490
you'll hear folks
in machine learning,

582
00:20:31,490 --> 00:20:33,948
they might say, well, there's
some benchmarks, that's good.

583
00:20:33,948 --> 00:20:36,050
There's a ImageNet
Challenge or whatever

584
00:20:36,050 --> 00:20:37,620
challenge they want to solve.

585
00:20:37,620 --> 00:20:39,620
Sometimes they'll say,
well the brain solves it.

586
00:20:39,620 --> 00:20:41,369
That's not good because
they didn't really

587
00:20:41,369 --> 00:20:42,170
define the problem.

588
00:20:42,170 --> 00:20:43,669
Neuroscientists
will say, well, it's

589
00:20:43,669 --> 00:20:46,260
something like
perception or behavior

590
00:20:46,260 --> 00:20:49,220
or there's some sort of
behavior that they imagined,

591
00:20:49,220 --> 00:20:50,990
although characterizing
that behavior

592
00:20:50,990 --> 00:20:53,750
is not usually
their primary goal.

593
00:20:53,750 --> 00:20:57,200
But I think there is at least
some progress in that regard.

594
00:20:57,200 --> 00:20:59,129
Now what does a
solution look like?

595
00:20:59,129 --> 00:21:00,920
This is really just to
talk about language.

596
00:21:00,920 --> 00:21:04,510
So useful image representations
for machine learning, like what

597
00:21:04,510 --> 00:21:05,669
we might call features--

598
00:21:05,669 --> 00:21:08,210
but neuroscientists will talk
about explicit neuronal spiking

599
00:21:08,210 --> 00:21:08,820
populations.

600
00:21:08,820 --> 00:21:10,070
You heard this in Haim's talk.

601
00:21:10,070 --> 00:21:11,880
He was using these
words interchangeably.

602
00:21:11,880 --> 00:21:13,585
Again, this may be
obvious to you guys,

603
00:21:13,585 --> 00:21:15,210
but I thought it's
worth going through.

604
00:21:15,210 --> 00:21:18,060
So this is like Marr
level two, representation.

605
00:21:18,060 --> 00:21:19,790
How do we instantiate
these solutions?

606
00:21:19,790 --> 00:21:21,920
So this is still
level two algorithms,

607
00:21:21,920 --> 00:21:23,930
or mechanisms that actually
build useful feature

608
00:21:23,930 --> 00:21:24,980
representations.

609
00:21:24,980 --> 00:21:27,590
Neuroscientists will think about
neuronal wiring and weighting

610
00:21:27,590 --> 00:21:29,900
patterns that are actually
executing those algorithms.

611
00:21:29,900 --> 00:21:33,314
This is what we think is
a bridging language there.

612
00:21:33,314 --> 00:21:34,730
And then there's
this deeper level

613
00:21:34,730 --> 00:21:36,229
that came up in the
questions, which

614
00:21:36,229 --> 00:21:40,070
is, how would you construct
it from the beginning?

615
00:21:40,070 --> 00:21:42,672
Learning rules, initial
conditions, training images,

616
00:21:42,672 --> 00:21:43,880
are words that are used here.

617
00:21:43,880 --> 00:21:45,790
There is a learning machine.

618
00:21:45,790 --> 00:21:48,960
Here, neuroscientists talk
about plasticity, architecture,

619
00:21:48,960 --> 00:21:50,670
and experience.

620
00:21:50,670 --> 00:21:53,010
But again, those are
similar questions just

621
00:21:53,010 --> 00:21:54,080
with different language.

622
00:21:54,080 --> 00:21:56,580
And I'm doing this because I
think the spirit of this course

623
00:21:56,580 --> 00:21:59,820
is to try to build these links
at all these different levels

624
00:21:59,820 --> 00:22:00,512
here.

625
00:22:00,512 --> 00:22:01,970
OK, so hopefully
that kind of helps

626
00:22:01,970 --> 00:22:03,470
orient you to how
we think about it.

627
00:22:03,470 --> 00:22:06,480
Let me just go and say, I
want to talk about number one.

628
00:22:06,480 --> 00:22:09,600
What is a problem we're trying
to solve and why is it hard?

629
00:22:09,600 --> 00:22:11,520
I said, object
recognition is hard

630
00:22:11,520 --> 00:22:15,250
and I showed you that MIT
Challenge and it was difficult.

631
00:22:15,250 --> 00:22:17,940
Maybe it's hard because
there's lots of objects.

632
00:22:17,940 --> 00:22:20,290
Who thinks that's why it's hard?

633
00:22:20,290 --> 00:22:23,960
Who thinks that's
not why it's hard?

634
00:22:23,960 --> 00:22:26,710
You think computers can
list a bunch of objects?

635
00:22:26,710 --> 00:22:28,056
It's easy, right?

636
00:22:28,056 --> 00:22:29,430
This is what I
said about memory,

637
00:22:29,430 --> 00:22:30,706
it's a big long list of stuff.

638
00:22:30,706 --> 00:22:31,830
Computers are good at that.

639
00:22:31,830 --> 00:22:33,538
There's going to be
thousands of objects.

640
00:22:33,538 --> 00:22:36,170
A list of objects is not a
hard thing for a machine to do.

641
00:22:36,170 --> 00:22:40,700
What's hard is that each object
can produce an essentially

642
00:22:40,700 --> 00:22:42,430
infinite number of images.

643
00:22:42,430 --> 00:22:44,210
And so you somehow
have to be able to take

644
00:22:44,210 --> 00:22:47,800
some samples of certain
views or poses of an object,

645
00:22:47,800 --> 00:22:50,270
this is a car under
different poses,

646
00:22:50,270 --> 00:22:53,690
and be able to generalize or
to predict what the car might

647
00:22:53,690 --> 00:22:54,920
look like in another view.

648
00:22:59,046 --> 00:23:00,920
This is what's called
the invariance problem.

649
00:23:00,920 --> 00:23:02,795
and it's due to the fact
that, again, there's

650
00:23:02,795 --> 00:23:04,345
identity preserving
image variation.

651
00:23:04,345 --> 00:23:06,470
This is why the bar code
reader in your supermarket

652
00:23:06,470 --> 00:23:09,440
works fine, because the code
is always laid out very simply.

653
00:23:09,440 --> 00:23:11,520
But when you have to
be able to generalize

654
00:23:11,520 --> 00:23:13,520
across a bunch of conditions,
potentially things

655
00:23:13,520 --> 00:23:16,310
like background clutter, even
more severely occlusion, things

656
00:23:16,310 --> 00:23:18,740
you heard from Gabriel, or you
may even want to generalize

657
00:23:18,740 --> 00:23:20,948
across the class of cars
where the cars have slightly

658
00:23:20,948 --> 00:23:22,960
different geometry but
they're still cars,

659
00:23:22,960 --> 00:23:25,654
these kind of generalizations
are what make the problem hard.

660
00:23:25,654 --> 00:23:27,320
So I'm lumping them
all together in what

661
00:23:27,320 --> 00:23:30,020
we call the invariance problem.

662
00:23:30,020 --> 00:23:32,270
Many of you in the room know
this is the hard problem.

663
00:23:32,270 --> 00:23:37,674
And I think that hopefully
it fixes ideas of, that's

664
00:23:37,674 --> 00:23:38,840
what you should think about.

665
00:23:38,840 --> 00:23:40,460
It's not the number of
objects, but it's the fact

666
00:23:40,460 --> 00:23:42,501
that it has to deal with
that invariance problem.

667
00:23:45,230 --> 00:23:47,330
Haim was talking
about manifolds,

668
00:23:47,330 --> 00:23:49,860
and this is my version of that.

669
00:23:49,860 --> 00:23:51,792
So this is to introduce
you to the problem of,

670
00:23:51,792 --> 00:23:53,000
why that invariance problem--

671
00:23:53,000 --> 00:23:54,800
what it looks like
or feels like.

672
00:23:54,800 --> 00:23:56,910
I'm not going to give you
math on how to solve it.

673
00:23:56,910 --> 00:23:59,490
It's just a geometric
feel for the problem.

674
00:23:59,490 --> 00:24:01,730
So if you imagine
you're a camera--

675
00:24:01,730 --> 00:24:03,530
or your retina,
which is capturing

676
00:24:03,530 --> 00:24:05,930
an image of an object,
let's call this a person,

677
00:24:05,930 --> 00:24:07,880
I think I called him Joe.

678
00:24:07,880 --> 00:24:10,596
So when you see this image of
Joe, and this is the retina,

679
00:24:10,596 --> 00:24:13,220
so now this is a state space of
what's going on in your retina.

680
00:24:13,220 --> 00:24:16,110
So it's a million
retinal ganglion cells.

681
00:24:16,110 --> 00:24:18,390
Think of them as being an
analog value out of each,

682
00:24:18,390 --> 00:24:20,610
so this is a million
dimensional state space.

683
00:24:20,610 --> 00:24:22,130
So when you see
this image of Joe,

684
00:24:22,130 --> 00:24:24,470
he activates every retinal
ganglion cell, some a lot,

685
00:24:24,470 --> 00:24:26,750
some a little, but he's
some point of that million

686
00:24:26,750 --> 00:24:27,970
dimensional space.

687
00:24:27,970 --> 00:24:30,322
OK, everybody with me?

688
00:24:30,322 --> 00:24:32,780
If everybody's heard all this
before and wants me to go on,

689
00:24:32,780 --> 00:24:34,526
everybody wave your
hand and I'll move on.

690
00:24:34,526 --> 00:24:35,240
AUDIENCE: No, it's good.

691
00:24:35,240 --> 00:24:36,650
JAMES DICARLO: Keep going, OK.

692
00:24:36,650 --> 00:24:40,940
So the basic idea is that if
Joe undergoes a transformation,

693
00:24:40,940 --> 00:24:43,430
like a change in
pose, what that does

694
00:24:43,430 --> 00:24:45,535
is, it's only a 1
degree of freedom

695
00:24:45,535 --> 00:24:47,910
I'm turning under the hood
one of those latent variables.

696
00:24:47,910 --> 00:24:49,370
If I had a graphics
engine, I'm changing

697
00:24:49,370 --> 00:24:50,578
the pose of latent variables.

698
00:24:50,578 --> 00:24:54,090
It's only one knob that
I'm turning, so to speak.

699
00:24:54,090 --> 00:24:56,750
And that means there's
one line through here

700
00:24:56,750 --> 00:24:59,180
as Joe projects across
these different images here.

701
00:24:59,180 --> 00:25:00,750
And I'm ignoring
noise and things.

702
00:25:00,750 --> 00:25:02,540
This is just the
deterministic mapping

703
00:25:02,540 --> 00:25:04,100
onto the retinal ganglion cells.

704
00:25:04,100 --> 00:25:04,670
So Joe goes--

705
00:25:04,670 --> 00:25:05,253
[MOVING NOISE]

706
00:25:05,253 --> 00:25:06,690
--and he goes over here.

707
00:25:06,690 --> 00:25:08,690
And if I turn the other
knob, he goes over here.

708
00:25:08,690 --> 00:25:10,565
And so I could imagine,
if I turned those two

709
00:25:10,565 --> 00:25:13,107
knobs of two axis
opposed always possible

710
00:25:13,107 --> 00:25:15,440
and plotted this in the million
dimensional state space,

711
00:25:15,440 --> 00:25:17,551
there'd be this curved
up sheet of points,

712
00:25:17,551 --> 00:25:19,550
which you could think of
Joe's identity manifold

713
00:25:19,550 --> 00:25:21,230
over those two degrees
of view change.

714
00:25:21,230 --> 00:25:23,152
It's only two dimensions,
it's hard to start

715
00:25:23,152 --> 00:25:24,110
showing more than this.

716
00:25:24,110 --> 00:25:26,745
But it's this curved
up sheet of points.

717
00:25:26,745 --> 00:25:28,040
Everybody with me so far?

718
00:25:28,040 --> 00:25:30,080
You don't actually
get to see all those.

719
00:25:30,080 --> 00:25:32,480
You could imagine a machine
actually running them all,

720
00:25:32,480 --> 00:25:33,490
but you don't really
get to see them.

721
00:25:33,490 --> 00:25:34,906
You've got to get
samples of them.

722
00:25:34,906 --> 00:25:37,500
But there's some underlying
manifold structure here.

723
00:25:37,500 --> 00:25:39,920
Now, what's interesting and
what's important to point out

724
00:25:39,920 --> 00:25:41,670
is that this thing,
even though I've drawn

725
00:25:41,670 --> 00:25:44,930
it and it's a little curve,
but it's highly complicated

726
00:25:44,930 --> 00:25:46,610
in this native pixel space.

727
00:25:46,610 --> 00:25:50,150
It's all curved up and
bending all over the place.

728
00:25:50,150 --> 00:25:52,340
And the reason that
matters, and this

729
00:25:52,340 --> 00:25:55,040
is what Haim
introduced you to, is

730
00:25:55,040 --> 00:25:57,950
that if you want to be
able to separate Joe

731
00:25:57,950 --> 00:26:01,640
from another object, say
not Joe, another person say,

732
00:26:01,640 --> 00:26:03,460
then you need a representation.

733
00:26:03,460 --> 00:26:04,960
I showed you retinal
ganglion cells.

734
00:26:04,960 --> 00:26:06,980
This is another
imaginary state space

735
00:26:06,980 --> 00:26:11,150
where you can take simple tools
to extract the information.

736
00:26:11,150 --> 00:26:13,610
And the simple tools
that we like to use

737
00:26:13,610 --> 00:26:14,750
are linear classifiers.

738
00:26:14,750 --> 00:26:16,670
But you can use
other simple tools.

739
00:26:16,670 --> 00:26:18,770
Haim used the exact
same description to you

740
00:26:18,770 --> 00:26:21,350
guys in his talk, that you
have some linear decoder

741
00:26:21,350 --> 00:26:25,370
on the state space that can
say, oh, they can separate

742
00:26:25,370 --> 00:26:26,986
cleanly Joe from not Joe.

743
00:26:26,986 --> 00:26:28,610
So these manifolds
are nicely separated

744
00:26:28,610 --> 00:26:29,897
by a separating hyperplane.

745
00:26:29,897 --> 00:26:32,480
That's what these tools tend to
do is they like to cut planes.

746
00:26:32,480 --> 00:26:33,896
This is one thing
they like to do,

747
00:26:33,896 --> 00:26:35,750
or they want to find
locations or regions,

748
00:26:35,750 --> 00:26:37,520
like compact regions
in this space,

749
00:26:37,520 --> 00:26:39,470
depending on what
kind of tool you use.

750
00:26:39,470 --> 00:26:40,910
But you don't want
the tool having

751
00:26:40,910 --> 00:26:43,640
to do all kinds of complicated
tracing through this space.

752
00:26:43,640 --> 00:26:45,690
That's basically the
original problem itself.

753
00:26:45,690 --> 00:26:47,870
So what you need is, you
have a simple tool box,

754
00:26:47,870 --> 00:26:50,060
which we think of as
downstream neurons.

755
00:26:50,060 --> 00:26:51,990
So a linear classifier,
as an approximation,

756
00:26:51,990 --> 00:26:53,000
it's like a dot product.

757
00:26:53,000 --> 00:26:55,880
It's a weighted sum, which is
what we think, neuroscientists,

758
00:26:55,880 --> 00:26:57,920
of downstream neurons doing.

759
00:26:57,920 --> 00:26:59,240
So it's a weighted sum.

760
00:26:59,240 --> 00:27:02,612
And if we want an
explicit representation

761
00:27:02,612 --> 00:27:04,070
in some neural
state space, then we

762
00:27:04,070 --> 00:27:07,239
need to be able to take
weighted sums of some population

763
00:27:07,239 --> 00:27:09,530
representation to be able to
separate Joe from not Joe,

764
00:27:09,530 --> 00:27:12,140
and Sam from Jill, and
everything from everything else

765
00:27:12,140 --> 00:27:14,060
that we want to separate.

766
00:27:14,060 --> 00:27:16,550
If we had such a space
of neural population,

767
00:27:16,550 --> 00:27:18,380
we'd call that a
good set of features

768
00:27:18,380 --> 00:27:21,050
or an explicit representation
of object shape.

769
00:27:21,050 --> 00:27:22,880
And for any
aficionados here, it's

770
00:27:22,880 --> 00:27:26,060
not just cleanly
linear separation,

771
00:27:26,060 --> 00:27:28,040
it's actually being
able to find this

772
00:27:28,040 --> 00:27:29,780
with a low number of
training examples.

773
00:27:29,780 --> 00:27:32,030
So that turns out
to be important.

774
00:27:32,030 --> 00:27:35,420
But it helps to fix ideas to
think about linear separation,

775
00:27:35,420 --> 00:27:37,770
ideally with a low number
of training examples.

776
00:27:37,770 --> 00:27:40,340
So that's a good representation.

777
00:27:40,340 --> 00:27:43,870
And notice, I'm starting
to mix up terms here.

778
00:27:43,870 --> 00:27:45,620
I am assuming, when
I talk about shape,

779
00:27:45,620 --> 00:27:47,600
that that will map
cleanly to identity,

780
00:27:47,600 --> 00:27:49,610
or what you might call
broadly, category.

781
00:27:49,610 --> 00:27:52,460
That's another topic I won't
talk about, if you just

782
00:27:52,460 --> 00:27:55,790
think about the shape of Joe,
or separating one geometry

783
00:27:55,790 --> 00:27:57,629
from another.

784
00:27:57,629 --> 00:28:00,170
Now, here's a simulation that
my first graduate student, Dave

785
00:28:00,170 --> 00:28:01,850
Cox, who's now at Harvard, did.

786
00:28:01,850 --> 00:28:03,230
This is a number of years old.

787
00:28:03,230 --> 00:28:06,140
This takes these two
face objects, render them

788
00:28:06,140 --> 00:28:08,250
under changes, and view.

789
00:28:08,250 --> 00:28:12,830
And then he actually
simulated the manifolds

790
00:28:12,830 --> 00:28:15,780
in a 14,000 dimensional space.

791
00:28:15,780 --> 00:28:17,270
And then he wanted
to visualize it.

792
00:28:17,270 --> 00:28:18,770
And because we
wanted to try to make

793
00:28:18,770 --> 00:28:22,250
the point that these
manifolds of these two objects

794
00:28:22,250 --> 00:28:24,355
are highly curved
and highly tangled,

795
00:28:24,355 --> 00:28:25,730
this is a three
dimensional view.

796
00:28:25,730 --> 00:28:28,220
Remember, it's sitting on a
14,000 dimensional simulation

797
00:28:28,220 --> 00:28:29,240
space.

798
00:28:29,240 --> 00:28:30,770
You can't view that space.

799
00:28:30,770 --> 00:28:32,600
This is a three
dimensional view of it.

800
00:28:32,600 --> 00:28:35,420
And the point is that it's
like two sheets of paper

801
00:28:35,420 --> 00:28:39,470
being all crumpled up together
and they're not fused.

802
00:28:39,470 --> 00:28:41,720
They look fused here because
it's in three dimensions.

803
00:28:41,720 --> 00:28:44,690
But they're not actually fused.

804
00:28:44,690 --> 00:28:46,730
But they're complicated,
you can't easily

805
00:28:46,730 --> 00:28:50,330
find a separating hyperplane
to separate these two objects.

806
00:28:50,330 --> 00:28:53,150
We call these tangled
object manifolds.

807
00:28:53,150 --> 00:28:56,550
And really, they're tangled
due to image variation.

808
00:28:56,550 --> 00:28:59,050
Remember, if I didn't change
those knobs of view or position

809
00:28:59,050 --> 00:29:01,520
or scale, there would just
be two points in the space

810
00:29:01,520 --> 00:29:02,400
and it would be easy.

811
00:29:02,400 --> 00:29:04,192
That's the easy problem
of listing objects.

812
00:29:04,192 --> 00:29:06,358
But if they have to undergo
all this transformation,

813
00:29:06,358 --> 00:29:08,060
they become these
complicated structures

814
00:29:08,060 --> 00:29:10,440
that need to be untangled
from each other.

815
00:29:10,440 --> 00:29:12,620
So the problem
that's being solved

816
00:29:12,620 --> 00:29:14,510
is, you have this
retina sampling data,

817
00:29:14,510 --> 00:29:16,699
like a camera on the front
end, where things look

818
00:29:16,699 --> 00:29:18,740
complicated with respect
to the latent variables,

819
00:29:18,740 --> 00:29:21,885
in this case shape or
identity, Sam or Joe.

820
00:29:21,885 --> 00:29:24,260
And that they somehow are
transformed, as Haim mentioned,

821
00:29:24,260 --> 00:29:26,810
they're transformed by some
non-linear transformation,

822
00:29:26,810 --> 00:29:30,440
some other neural population
state space, shown here, where

823
00:29:30,440 --> 00:29:31,770
the things look more like this.

824
00:29:31,770 --> 00:29:34,340
The latent variable
structure is more explicit,

825
00:29:34,340 --> 00:29:37,160
that you can easily take things
like separating hyperplanes

826
00:29:37,160 --> 00:29:39,410
to identify things like
shape, which again, roughly

827
00:29:39,410 --> 00:29:41,960
corresponds to identity or
other latent parameters,

828
00:29:41,960 --> 00:29:42,980
like position and scale.

829
00:29:42,980 --> 00:29:44,990
You maybe haven't thrown
away all these other latent

830
00:29:44,990 --> 00:29:45,570
parameters.

831
00:29:45,570 --> 00:29:47,611
And if I have time, I'll
say something about that

832
00:29:47,611 --> 00:29:49,200
so you don't just get identity.

833
00:29:49,200 --> 00:29:50,810
But if you can
untangle this, you

834
00:29:50,810 --> 00:29:52,880
would have a very nice
representation with regard

835
00:29:52,880 --> 00:29:54,170
to those originally
latent parameters.

836
00:29:54,170 --> 00:29:55,920
That's the dream of
what you'd like to do.

837
00:29:55,920 --> 00:29:59,390
It's like reverse
graphics, if you will.

838
00:29:59,390 --> 00:30:02,360
So this is what we call an
untangled explicit object

839
00:30:02,360 --> 00:30:02,990
information.

840
00:30:02,990 --> 00:30:04,680
And we think it lives
somewhere in the brain,

841
00:30:04,680 --> 00:30:05,679
at least to some degree.

842
00:30:05,679 --> 00:30:07,950
And I'll show you the
evidence for that later on.

843
00:30:07,950 --> 00:30:10,460
So what you have then is you
have a poor encoding basis,

844
00:30:10,460 --> 00:30:11,487
the pixel space.

845
00:30:11,487 --> 00:30:13,820
And somewhere in the brain
is a powerful encoding basis,

846
00:30:13,820 --> 00:30:15,540
a good set of features.

847
00:30:15,540 --> 00:30:17,270
And as Haim mentioned,
as I already said,

848
00:30:17,270 --> 00:30:19,400
this must be a
non-linear transformation

849
00:30:19,400 --> 00:30:21,191
because the linear
transformations are just

850
00:30:21,191 --> 00:30:23,400
rotations of that
original space.

851
00:30:23,400 --> 00:30:25,337
So now let's go down
to-- actually this

852
00:30:25,337 --> 00:30:26,420
would be Marr level three.

853
00:30:26,420 --> 00:30:27,666
Let's go to instantiation.

854
00:30:27,666 --> 00:30:29,040
Let's get into
the hardware here.

855
00:30:29,040 --> 00:30:30,320
We're supposed to be
talking about brains.

856
00:30:30,320 --> 00:30:32,450
So I'm going to give you a
tour of the ventral stream.

857
00:30:32,450 --> 00:30:34,640
So we would love to know
how this brain solves it.

858
00:30:34,640 --> 00:30:36,904
This is the human brain.

859
00:30:36,904 --> 00:30:38,070
This is a non-human primate.

860
00:30:38,070 --> 00:30:39,230
This is not shown to scale.

861
00:30:39,230 --> 00:30:40,605
This is blown up
to show you it's

862
00:30:40,605 --> 00:30:42,920
a similar structure,
temporal lobe, frontal lobes,

863
00:30:42,920 --> 00:30:43,992
occipital lobe.

864
00:30:43,992 --> 00:30:45,200
There is a non-human primate.

865
00:30:45,200 --> 00:30:48,240
We like this model for
a number of reasons.

866
00:30:48,240 --> 00:30:50,040
One reason that we
like it is that they

867
00:30:50,040 --> 00:30:51,710
are very visual
creatures, their acuity

868
00:30:51,710 --> 00:30:53,070
is very well matched to ours.

869
00:30:53,070 --> 00:30:55,280
In fact, even their object
recognition abilities

870
00:30:55,280 --> 00:30:57,144
are actually quite
similar to our own.

871
00:30:57,144 --> 00:30:59,060
This may be surprising
to you, but let me just

872
00:30:59,060 --> 00:31:01,310
show you some data for that.

873
00:31:01,310 --> 00:31:05,909
This is actually data from
Rishi Rajalingham, in my lab.

874
00:31:05,909 --> 00:31:07,700
It says, impressed,
but this just came out.

875
00:31:07,700 --> 00:31:09,620
This is the confusion
matrix patterns

876
00:31:09,620 --> 00:31:11,720
of humans trying to
discriminate different objects

877
00:31:11,720 --> 00:31:14,649
under those transformations
that I showed you earlier,

878
00:31:14,649 --> 00:31:16,190
where they're not
just seeing images,

879
00:31:16,190 --> 00:31:18,590
but they have to deal
with these invariances.

880
00:31:18,590 --> 00:31:22,250
And this is rhesus monkey data
trying to do the same thing.

881
00:31:22,250 --> 00:31:24,140
And the task goes, I'll
give you a test image

882
00:31:24,140 --> 00:31:25,190
and then you get choice images.

883
00:31:25,190 --> 00:31:26,171
Was it a car or a dog?

884
00:31:26,171 --> 00:31:28,670
I'll show you an image, what
choice was it, a dog or a tree?

885
00:31:28,670 --> 00:31:31,500
And you're trying to entertain
many objects all at once,

886
00:31:31,500 --> 00:31:34,482
and you get an image under
some unpredictable view

887
00:31:34,482 --> 00:31:36,065
and unpredictable
background, and then

888
00:31:36,065 --> 00:31:37,148
you have to make a choice.

889
00:31:37,148 --> 00:31:39,510
So this is the
confusion difficulty.

890
00:31:39,510 --> 00:31:42,020
And when you look at
this, it's intuitive

891
00:31:42,020 --> 00:31:43,960
that these are sort
of geometry similar.

892
00:31:43,960 --> 00:31:48,230
Camel is confused with dog, and
tank is confused with truck,

893
00:31:48,230 --> 00:31:50,180
and that's true of both
monkeys and humans.

894
00:31:50,180 --> 00:31:54,377
And to some level, this
shouldn't be surprising to you.

895
00:31:54,377 --> 00:31:56,210
The same tasks that are
difficult for humans

896
00:31:56,210 --> 00:31:58,550
are difficult for monkeys
because probably they

897
00:31:58,550 --> 00:32:03,000
share very similar
processing structures.

898
00:32:03,000 --> 00:32:05,000
They don't have to bring
in a bunch of knowledge

899
00:32:05,000 --> 00:32:08,370
about tanks are driven by people
or that, they just have to say,

900
00:32:08,370 --> 00:32:09,590
was there a tank or a truck.

901
00:32:09,590 --> 00:32:12,048
And under those conditions,
they make very similar patterns

902
00:32:12,048 --> 00:32:12,920
of confusion.

903
00:32:12,920 --> 00:32:15,140
And these patterns are
very different from those

904
00:32:15,140 --> 00:32:16,850
that you get when
you run classifiers

905
00:32:16,850 --> 00:32:20,680
on pixels or low level
visual simulations.

906
00:32:20,680 --> 00:32:22,680
But they're very similar
to each other, in fact,

907
00:32:22,680 --> 00:32:24,180
are statistically
indistinguishable,

908
00:32:24,180 --> 00:32:27,300
monkeys and humans, on these
kind of patterns of confusion.

909
00:32:27,300 --> 00:32:31,440
OK, so that's one reason we like
this subject, the monkey model,

910
00:32:31,440 --> 00:32:34,551
is that the behavior is very
well matched to the humans.

911
00:32:34,551 --> 00:32:37,050
The other reason is that we
know from a lot of previous work

912
00:32:37,050 --> 00:32:40,470
that I alluded to, that some
studies have shown that lesions

913
00:32:40,470 --> 00:32:43,290
in these parts of the brain can
lead to deficits in recognition

914
00:32:43,290 --> 00:32:44,040
task.

915
00:32:44,040 --> 00:32:47,870
So again, we think the ventral
stream solves recognition.

916
00:32:47,870 --> 00:32:50,370
So we know a weak word
model of where to look,

917
00:32:50,370 --> 00:32:53,070
we just don't know exactly
what's going on there.

918
00:32:53,070 --> 00:32:55,170
Just to orient
you, these ventral

919
00:32:55,170 --> 00:32:59,010
areas, V1, V2, V4, and infer
temporal cortex, or IT cortex--

920
00:32:59,010 --> 00:33:01,560
IT projects anatomically
to the frontal lobe

921
00:33:01,560 --> 00:33:03,570
to regions involved in
decision and action,

922
00:33:03,570 --> 00:33:05,986
and around the bend to the
medial temporal lobe to regions

923
00:33:05,986 --> 00:33:08,642
involved in formation
of long-term memory.

924
00:33:08,642 --> 00:33:10,350
Because these are
monkeys and not humans,

925
00:33:10,350 --> 00:33:12,840
and Gabriel mentioned this
in his talk, we can go in

926
00:33:12,840 --> 00:33:14,430
and we can record
from their brains,

927
00:33:14,430 --> 00:33:16,860
and we can perturb neural
activity in their brains

928
00:33:16,860 --> 00:33:17,380
directly.

929
00:33:17,380 --> 00:33:18,600
And we can do that
in a systematic way.

930
00:33:18,600 --> 00:33:20,725
This is the advantage of
an animal model as opposed

931
00:33:20,725 --> 00:33:21,900
to a human model.

932
00:33:21,900 --> 00:33:24,090
OK, as neuroscientists
now, we've

933
00:33:24,090 --> 00:33:26,190
taken a problem,
translated it to behavior,

934
00:33:26,190 --> 00:33:28,477
taken that behavior into
a species we can study,

935
00:33:28,477 --> 00:33:30,060
we know roughly where
to look, and now

936
00:33:30,060 --> 00:33:32,140
we want to try to
understand what's going on.

937
00:33:32,140 --> 00:33:35,135
So as engineers, we take these
curled up sheets of cortex

938
00:33:35,135 --> 00:33:37,260
and think of them as I've
already been showing you,

939
00:33:37,260 --> 00:33:39,289
as populations of neurons.

940
00:33:39,289 --> 00:33:41,580
So there's millions of neurons
on each of these sheets.

941
00:33:41,580 --> 00:33:43,710
I'll give you numbers
on a slide coming up.

942
00:33:43,710 --> 00:33:46,130
There's some sort of processing
that may be common here,

943
00:33:46,130 --> 00:33:47,546
I put these T's
in, there might be

944
00:33:47,546 --> 00:33:50,850
some common cortical algorithm
processing forward this way.

945
00:33:50,850 --> 00:33:52,720
There's also
inter-cortical processing.

946
00:33:52,720 --> 00:33:55,180
And there's also some feedback
processing going on in here.

947
00:33:55,180 --> 00:33:57,570
So all that's schematically
illustrated in this slide

948
00:33:57,570 --> 00:33:58,680
that I'll keep
bringing up here when

949
00:33:58,680 --> 00:34:01,140
we talk about these different
levels of the ventral stream.

950
00:34:01,140 --> 00:34:03,348
Now I'm most going to be
talking about IT cortex here

951
00:34:03,348 --> 00:34:04,200
at the end.

952
00:34:04,200 --> 00:34:05,747
Why do we call these
different areas?

953
00:34:05,747 --> 00:34:07,830
One reason is that there's
a complete retina topic

954
00:34:07,830 --> 00:34:10,080
map, a map of the whole
visual space in each

955
00:34:10,080 --> 00:34:11,312
of these different levels.

956
00:34:11,312 --> 00:34:12,270
In retina, there's one.

957
00:34:12,270 --> 00:34:14,264
In LGN-- in the thalamus,
there's another.

958
00:34:14,264 --> 00:34:15,389
In V1, there's another map.

959
00:34:15,389 --> 00:34:16,380
In V2, there's another map.

960
00:34:16,380 --> 00:34:17,610
In V4, there's another map.

961
00:34:17,610 --> 00:34:20,580
In IT, it's less clear
that it's retinotopic,

962
00:34:20,580 --> 00:34:23,670
we're not even sure
that IT is one area.

963
00:34:23,670 --> 00:34:27,260
Maybe we'll have time, I'll
say more about that detail.

964
00:34:27,260 --> 00:34:29,340
So it's not that
retinotopic in IT,

965
00:34:29,340 --> 00:34:32,280
except the most
posterior parts of IT.

966
00:34:32,280 --> 00:34:34,440
But that's why
neuroscientists divide these

967
00:34:34,440 --> 00:34:36,310
into different areas.

968
00:34:36,310 --> 00:34:38,610
So a key concept, though,
for you computationally is,

969
00:34:38,610 --> 00:34:41,250
think of each of these as
a population representation

970
00:34:41,250 --> 00:34:44,489
that's retransforming the data
from that complicated space

971
00:34:44,489 --> 00:34:46,320
to some nicer space.

972
00:34:46,320 --> 00:34:49,830
And it's doing this probably
in a stepwise, gradual manner.

973
00:34:49,830 --> 00:34:52,136
So IT is believed to be
that powerful encoding

974
00:34:52,136 --> 00:34:53,719
basis that I alluded
to earlier, where

975
00:34:53,719 --> 00:34:56,024
you have these nice
flattened object manifolds.

976
00:34:56,024 --> 00:34:57,690
And I'll show you the
evidence for that.

977
00:35:00,590 --> 00:35:02,910
This is recently from a
review I did that gives

978
00:35:02,910 --> 00:35:04,230
more numbers on these things.

979
00:35:04,230 --> 00:35:06,210
And I've sized the
areas according

980
00:35:06,210 --> 00:35:08,920
to their relative cortical
area in the monkey.

981
00:35:08,920 --> 00:35:11,220
Here's V1, V2, V4, IT.

982
00:35:11,220 --> 00:35:13,080
IT is a complex of areas.

983
00:35:13,080 --> 00:35:15,570
And I'm showing you
these latencies.

984
00:35:15,570 --> 00:35:19,552
These are the
average latencies in

985
00:35:19,552 --> 00:35:20,760
these different visual areas.

986
00:35:20,760 --> 00:35:22,350
You can see, it's
about 50 milliseconds

987
00:35:22,350 --> 00:35:23,766
from when an image
hits the retina

988
00:35:23,766 --> 00:35:25,110
until you get activity in V1.

989
00:35:25,110 --> 00:35:27,942
60 in V2, 70-- there's
about a 10 millisecond step

990
00:35:27,942 --> 00:35:29,150
across these different areas.

991
00:35:29,150 --> 00:35:32,280
So it's about 100 millisecond
lag between an image it's here,

992
00:35:32,280 --> 00:35:34,800
and you start to see changes
in activity at this level

993
00:35:34,800 --> 00:35:36,750
up here that I'm referring to.

994
00:35:36,750 --> 00:35:40,440
When I say IT, I'm referring
to AIT and CIT together.

995
00:35:40,440 --> 00:35:43,241
That's my usage of the
word IT for the aficionados

996
00:35:43,241 --> 00:35:43,740
in the room.

997
00:35:43,740 --> 00:35:46,680
And that's about 10 million
output neurons in IT

998
00:35:46,680 --> 00:35:48,180
just to fix numbers.

999
00:35:48,180 --> 00:35:50,650
In V1 here, you have like
37 million output neurons.

1000
00:35:50,650 --> 00:35:54,460
There's about 200 million
neurons in V1, similar in V2.

1001
00:35:54,460 --> 00:35:56,460
And many of you probably
heard about other parts

1002
00:35:56,460 --> 00:35:58,040
of the visual system.

1003
00:35:58,040 --> 00:36:00,970
Here's MT, many of you
probably heard about MT.

1004
00:36:00,970 --> 00:36:03,650
So you can see it's tiny
compared to some of these areas

1005
00:36:03,650 --> 00:36:05,195
that I'm talking about here.

1006
00:36:05,195 --> 00:36:06,820
I'm going to show
you some neural dam--

1007
00:36:06,820 --> 00:36:07,950
I'm just going to
give you a brief tour

1008
00:36:07,950 --> 00:36:10,830
of these different areas, so
brief, it's almost cartoonish.

1009
00:36:10,830 --> 00:36:13,026
But at least those of
you who haven't seen this

1010
00:36:13,026 --> 00:36:14,150
should at least be exposed.

1011
00:36:14,150 --> 00:36:15,372
So in the retina--

1012
00:36:15,372 --> 00:36:16,830
you guys know in
the retina there's

1013
00:36:16,830 --> 00:36:18,862
a bunch of cell
layers in the retina.

1014
00:36:18,862 --> 00:36:20,320
The retina is a
complicated device.

1015
00:36:20,320 --> 00:36:22,320
I think of it as a
beautiful camera.

1016
00:36:22,320 --> 00:36:23,904
So you're down in the retina.

1017
00:36:23,904 --> 00:36:25,320
To me, the key
thing in the retina

1018
00:36:25,320 --> 00:36:27,120
is in the end you've got some
cells that are going to project

1019
00:36:27,120 --> 00:36:28,784
back along the optic nerve.

1020
00:36:28,784 --> 00:36:30,450
So these are the
retinal ganglion cells,

1021
00:36:30,450 --> 00:36:31,530
they actually live
on the surface.

1022
00:36:31,530 --> 00:36:33,613
The light comes through,
photo receptors are here,

1023
00:36:33,613 --> 00:36:35,850
there is processing in
these intermediate layers,

1024
00:36:35,850 --> 00:36:38,370
and then there's a bunch of
retinal ganglion cell types.

1025
00:36:38,370 --> 00:36:40,590
There's thought to be
about 20 types or so.

1026
00:36:40,590 --> 00:36:42,780
The original
physiology, there are

1027
00:36:42,780 --> 00:36:45,480
two functional central
types where they

1028
00:36:45,480 --> 00:36:47,506
have on center or off center.

1029
00:36:47,506 --> 00:36:49,380
Let's take an on center
cell, you shine light

1030
00:36:49,380 --> 00:36:51,360
in the middle of
a spot-- now this

1031
00:36:51,360 --> 00:36:52,950
is a tiny little
spot on the retina,

1032
00:36:52,950 --> 00:36:56,127
the size depends on where
you are in the visual field.

1033
00:36:56,127 --> 00:36:58,210
But you shine a little bit
of light in the center,

1034
00:36:58,210 --> 00:36:59,084
the response goes up.

1035
00:36:59,084 --> 00:37:00,499
See the spike rate
going up here.

1036
00:37:00,499 --> 00:37:02,790
Put light in the surround,
the response rate goes down.

1037
00:37:02,790 --> 00:37:06,270
So it has an on center,
off surround profile.

1038
00:37:06,270 --> 00:37:08,610
And then there's
a flip type here.

1039
00:37:08,610 --> 00:37:10,152
So that's the basic
functional type.

1040
00:37:10,152 --> 00:37:11,610
When you think
about the retina, it

1041
00:37:11,610 --> 00:37:13,920
is tiled with all of
these point detectors that

1042
00:37:13,920 --> 00:37:16,090
have some nice center
surround effects.

1043
00:37:16,090 --> 00:37:19,620
There's some nice gain control
for overall illumination

1044
00:37:19,620 --> 00:37:21,570
conditions.

1045
00:37:21,570 --> 00:37:24,410
But my toy model
of the retina, it's

1046
00:37:24,410 --> 00:37:27,410
basically a really nice
pixel map coming back down

1047
00:37:27,410 --> 00:37:30,770
the optic track to the LGN.

1048
00:37:30,770 --> 00:37:34,250
OK, I'm going to skip the
LGN and go straight to V1.

1049
00:37:34,250 --> 00:37:37,130
People have known for a
long time, functionally V1

1050
00:37:37,130 --> 00:37:42,350
cells they have sensitivity
to especially edges.

1051
00:37:42,350 --> 00:37:45,590
They have what's called
orientation selectivity.

1052
00:37:45,590 --> 00:37:47,210
Hopefully this isn't
new to you guys.

1053
00:37:47,210 --> 00:37:48,752
Here's a simple cell in V1.

1054
00:37:48,752 --> 00:37:50,210
If you shine a bar
of a light on it

1055
00:37:50,210 --> 00:37:51,485
inside it's receptive field--

1056
00:37:51,485 --> 00:37:53,360
does everyone know what
a receptive field is?

1057
00:37:53,360 --> 00:37:54,193
I don't want to go--

1058
00:37:54,193 --> 00:37:55,184
OK.

1059
00:37:55,184 --> 00:37:56,600
It's OK if you
ask, because I want

1060
00:37:56,600 --> 00:37:57,809
to make sure you guys are OK.

1061
00:37:57,809 --> 00:37:59,974
So the receptive field, you
shine a bar light in it,

1062
00:37:59,974 --> 00:38:01,580
turn it on in the
right orientation,

1063
00:38:01,580 --> 00:38:04,190
gives good response
out of the cell.

1064
00:38:04,190 --> 00:38:06,545
Move it off this position,
now not much response,

1065
00:38:06,545 --> 00:38:08,420
there's a little bit of
an off response here.

1066
00:38:08,420 --> 00:38:10,580
Change the orientation,
nothing happens.

1067
00:38:10,580 --> 00:38:12,860
Full field illumination,
nothing happens.

1068
00:38:12,860 --> 00:38:15,620
OK, so this is
called selectivity.

1069
00:38:15,620 --> 00:38:17,810
That is, there's some
portion of the image space

1070
00:38:17,810 --> 00:38:19,040
that it cares about.

1071
00:38:19,040 --> 00:38:21,020
It doesn't just
respond to any light

1072
00:38:21,020 --> 00:38:25,030
at that spot like the pixel
wise, retinal ganglion cell

1073
00:38:25,030 --> 00:38:26,120
would.

1074
00:38:26,120 --> 00:38:28,730
So now there's this
complex cell that's

1075
00:38:28,730 --> 00:38:32,700
also in V1, which
maintains this orientation

1076
00:38:32,700 --> 00:38:35,150
selectivity across a
change in position,

1077
00:38:35,150 --> 00:38:38,460
as shown here, also across
some changes in scale.

1078
00:38:38,460 --> 00:38:41,500
So it maintains it, meaning
that you have this tolerance--

1079
00:38:41,500 --> 00:38:44,360
so that's called position
tolerance, for position.

1080
00:38:44,360 --> 00:38:47,120
You can move the bar around it,
still likes that oriented bar.

1081
00:38:47,120 --> 00:38:50,420
But you change its
angle and it goes down,

1082
00:38:50,420 --> 00:38:52,460
so it still maintains
the same selectivity here

1083
00:38:52,460 --> 00:38:53,570
but it has some tolerance.

1084
00:38:53,570 --> 00:38:57,020
So you get this build up of
some orientation sensitivity

1085
00:38:57,020 --> 00:38:58,690
followed by some tolerance.

1086
00:38:58,690 --> 00:39:00,469
And there are models
from Hubel and Wiesel

1087
00:39:00,469 --> 00:39:02,510
that they thought that
you could build this first

1088
00:39:02,510 --> 00:39:04,093
and then you build
these out of these,

1089
00:39:04,093 --> 00:39:05,540
that's the simple version.

1090
00:39:05,540 --> 00:39:06,687
And here they are.

1091
00:39:06,687 --> 00:39:08,270
These are the Hubel
and Wiesel models,

1092
00:39:08,270 --> 00:39:11,420
how you build these and like
operators to build selectivity

1093
00:39:11,420 --> 00:39:14,539
from pixel-wise cells with
an and like operator lining

1094
00:39:14,539 --> 00:39:15,330
these up correctly.

1095
00:39:15,330 --> 00:39:17,770
You can imagine oriented
tuned cells built this way.

1096
00:39:17,770 --> 00:39:20,120
There's evidence for
this in physiology

1097
00:39:20,120 --> 00:39:21,980
that this is how
these are constructed.

1098
00:39:21,980 --> 00:39:23,680
The tolerance of
these complex cells

1099
00:39:23,680 --> 00:39:27,230
is thought to build by a
combination of simple cells.

1100
00:39:27,230 --> 00:39:29,167
And there's some
evidence for this.

1101
00:39:29,167 --> 00:39:31,375
And this is again, all the
way from Hubel and Wiesel,

1102
00:39:31,375 --> 00:39:35,030
who won a Nobel Prize for this
and related work in the 1960s.

1103
00:39:35,030 --> 00:39:38,450
And then there were a bunch
of computational models

1104
00:39:38,450 --> 00:39:40,460
that are really
inspired by this and I

1105
00:39:40,460 --> 00:39:42,974
think are still the core
models of how the system works.

1106
00:39:42,974 --> 00:39:45,140
And some of the original
ones that were written down

1107
00:39:45,140 --> 00:39:47,900
are Fukushima in
the '80s, and then

1108
00:39:47,900 --> 00:39:50,702
Tommy Poggio and others built
what's called an HMAX Model,

1109
00:39:50,702 --> 00:39:52,160
you guys have
probably heard about,

1110
00:39:52,160 --> 00:39:54,440
that's off of these
similar ideas, much more

1111
00:39:54,440 --> 00:39:58,055
refined and much more
matched to the neural data.

1112
00:39:58,055 --> 00:39:59,930
But I'm just try to
point out that these kind

1113
00:39:59,930 --> 00:40:01,460
of physiological
observations are

1114
00:40:01,460 --> 00:40:04,960
what inspired this class of
largely feedforward models

1115
00:40:04,960 --> 00:40:08,300
that you heard about much today.

1116
00:40:08,300 --> 00:40:11,840
So that's a brief tour of V1.

1117
00:40:11,840 --> 00:40:13,700
Now, what's going on in V2?

1118
00:40:13,700 --> 00:40:15,380
For a long time,
people thought it

1119
00:40:15,380 --> 00:40:17,360
was hard to tell the
difference from V1 and V2.

1120
00:40:17,360 --> 00:40:18,901
And I just thought
I'd show you guys,

1121
00:40:18,901 --> 00:40:21,054
this is a slide I stuck
in, this is from Eero

1122
00:40:21,054 --> 00:40:22,220
Simoncelli and Tony Movshon.

1123
00:40:22,220 --> 00:40:24,678
And I think you guys have Eero
teaching in the course a bit

1124
00:40:24,678 --> 00:40:26,500
later, so he may
say some of this.

1125
00:40:26,500 --> 00:40:33,050
But V2 cells have some
sensitivity to natural image

1126
00:40:33,050 --> 00:40:35,810
statistics that V1 cells don't.

1127
00:40:35,810 --> 00:40:38,030
And maybe I'll see if I
can take you through this.

1128
00:40:38,030 --> 00:40:42,680
So the way that they did
this is you can simulate--

1129
00:40:42,680 --> 00:40:45,410
so this is all driven off of
work that Eero and Tony have

1130
00:40:45,410 --> 00:40:45,910
done--

1131
00:40:45,910 --> 00:40:48,390
especially Eero has done
on texture synthesis.

1132
00:40:48,390 --> 00:40:50,120
So you have these
original images,

1133
00:40:50,120 --> 00:40:53,120
and if you run them through a
bunch of V1-like filter banks,

1134
00:40:53,120 --> 00:40:56,540
and then you take a new
image, a random seed, which

1135
00:40:56,540 --> 00:40:58,430
is like white noise,
and you try to make sure

1136
00:40:58,430 --> 00:41:00,860
that it would
activate populations

1137
00:41:00,860 --> 00:41:02,530
of V1 cells in a
similar way, there's

1138
00:41:02,530 --> 00:41:05,030
a large set of images that would
do that because you're just

1139
00:41:05,030 --> 00:41:07,100
doing summary
statistics, but these

1140
00:41:07,100 --> 00:41:08,340
are some examples of them.

1141
00:41:08,340 --> 00:41:10,548
For this image, this is one
that one might look like.

1142
00:41:10,548 --> 00:41:12,980
So you can see, to you, it
doesn't look the same as this.

1143
00:41:12,980 --> 00:41:15,230
But to V1, these are
metamers, they're

1144
00:41:15,230 --> 00:41:18,650
very similar in the
summary statistics in V1.

1145
00:41:18,650 --> 00:41:21,320
And then you start taking cross
products of these V1 summary

1146
00:41:21,320 --> 00:41:23,162
statistics and then
you try to match those.

1147
00:41:23,162 --> 00:41:24,620
And what's interesting
is you start

1148
00:41:24,620 --> 00:41:26,720
to get something that looks,
texture wise, much more

1149
00:41:26,720 --> 00:41:27,810
like this original image.

1150
00:41:27,810 --> 00:41:29,893
And this is a big part of
what Eero and others did

1151
00:41:29,893 --> 00:41:30,740
in that work.

1152
00:41:30,740 --> 00:41:32,198
And the reason I'm
showing you this

1153
00:41:32,198 --> 00:41:35,570
is that Tony's lab has gone
and recorded in V1 and V2

1154
00:41:35,570 --> 00:41:38,840
with these kinds of stimuli, and
the main observation they have

1155
00:41:38,840 --> 00:41:43,670
is that V1 doesn't care whether
you show it this or this.

1156
00:41:43,670 --> 00:41:46,022
To V1, these are
both the same, which

1157
00:41:46,022 --> 00:41:47,480
says we have the
summary statistics

1158
00:41:47,480 --> 00:41:49,867
for V1 right in terms of
the average V1 response.

1159
00:41:49,867 --> 00:41:51,200
That's all I'm showing you here.

1160
00:41:51,200 --> 00:41:53,256
The paper, if you want
it, is much more detailed.

1161
00:41:53,256 --> 00:41:55,130
But you go to V2 and
there's a big difference

1162
00:41:55,130 --> 00:41:59,000
between this, which V2 cells
respond to more, and this,

1163
00:41:59,000 --> 00:42:00,680
which they respond to less.

1164
00:42:00,680 --> 00:42:02,720
And really one inference
you can take from this

1165
00:42:02,720 --> 00:42:06,950
is that V2 neurons apply a
repeated-- another and like

1166
00:42:06,950 --> 00:42:08,480
operator on V1.

1167
00:42:08,480 --> 00:42:10,850
That's a simple inference
that these kinds of data seem

1168
00:42:10,850 --> 00:42:11,720
to support .

1169
00:42:11,720 --> 00:42:14,090
And they also tell you that
these and-like operators,

1170
00:42:14,090 --> 00:42:15,890
these conjunctions
of V1 statistics

1171
00:42:15,890 --> 00:42:18,560
tend to be in the
direction of the statistics

1172
00:42:18,560 --> 00:42:21,500
of the natural world, that's
naturalistic statistics.

1173
00:42:21,500 --> 00:42:24,110
Now lots of controls
haven't been done here

1174
00:42:24,110 --> 00:42:26,300
to narrow in exactly
what kinds ands,

1175
00:42:26,300 --> 00:42:28,280
but that's the spirit
of where the field is

1176
00:42:28,280 --> 00:42:29,635
in trying to understand V2.

1177
00:42:29,635 --> 00:42:31,010
Everybody thinks
it has something

1178
00:42:31,010 --> 00:42:33,135
to do with corners or a
more complicated structure.

1179
00:42:33,135 --> 00:42:35,270
But this is a way that
current in the field

1180
00:42:35,270 --> 00:42:38,170
to try to move these image
computing models forward

1181
00:42:38,170 --> 00:42:39,140
in V1 and V2.

1182
00:42:39,140 --> 00:42:42,080
And Tony likes to point out that
this is one of the strongest

1183
00:42:42,080 --> 00:42:44,599
differences that you
see between V1 and V2,

1184
00:42:44,599 --> 00:42:46,140
other than the
receptive field sizes.

1185
00:42:46,140 --> 00:42:49,460
So I think that's quite
some exciting work if you

1186
00:42:49,460 --> 00:42:51,820
don't know about it on V2.

1187
00:42:51,820 --> 00:42:54,740
OK, then you get up into V4
and things get much murkier.

1188
00:42:54,740 --> 00:42:56,420
So what's going on in V4?

1189
00:42:56,420 --> 00:42:59,030
Well, let me just briefly say
that one of my post-docs-- this

1190
00:42:59,030 --> 00:43:01,770
is more recent work just because
it builds on that earlier work.

1191
00:43:01,770 --> 00:43:04,280
This is Nicole Rust, when she
was a post-doc in the lab,

1192
00:43:04,280 --> 00:43:05,210
compared V4.

1193
00:43:05,210 --> 00:43:07,280
She actually compared it to IT.

1194
00:43:07,280 --> 00:43:08,000
I'll skip that.

1195
00:43:08,000 --> 00:43:10,760
But she was using these
Simoncelli scrambled images.

1196
00:43:10,760 --> 00:43:13,636
These are actually the
texture images from--

1197
00:43:13,636 --> 00:43:15,260
these are the original
images and these

1198
00:43:15,260 --> 00:43:16,040
are the texture versions.

1199
00:43:16,040 --> 00:43:17,960
So this should look like a
textured version of that.

1200
00:43:17,960 --> 00:43:19,959
You can see that these
algorithms don't actually

1201
00:43:19,959 --> 00:43:23,180
capture the object
content of these images.

1202
00:43:23,180 --> 00:43:26,990
And what Nicole actually
showed is that similar to what

1203
00:43:26,990 --> 00:43:29,510
you just saw there, in
the earlier work like V1,

1204
00:43:29,510 --> 00:43:32,360
V4 doesn't care about the
differences between these.

1205
00:43:32,360 --> 00:43:35,090
It responds similarly, as a
population, to this and this,

1206
00:43:35,090 --> 00:43:36,800
and this and this,
and this and this.

1207
00:43:36,800 --> 00:43:40,220
But IT cares a lot
about this versus this.

1208
00:43:40,220 --> 00:43:43,250
So this is just repeating the
same theme, the general idea

1209
00:43:43,250 --> 00:43:46,402
that you have and -like
operators that we think

1210
00:43:46,402 --> 00:43:48,110
are aligned along the
ventral stream that

1211
00:43:48,110 --> 00:43:49,790
are tuned to the
kind of statistics

1212
00:43:49,790 --> 00:43:51,530
that you tend to
encounter in the world.

1213
00:43:51,530 --> 00:43:54,650
And this is some of the
evidence for it in V2,

1214
00:43:54,650 --> 00:43:57,500
and then later in V4, and
IT, and Nicole's work,

1215
00:43:57,500 --> 00:43:59,000
if you piece that all together.

1216
00:43:59,000 --> 00:44:00,980
When you go to a place
like V4, remember V4

1217
00:44:00,980 --> 00:44:02,690
is now like three levels up.

1218
00:44:02,690 --> 00:44:04,170
And what does V4 do?

1219
00:44:04,170 --> 00:44:07,409
Look, this is
Jack's work in 1996.

1220
00:44:07,409 --> 00:44:08,950
This is from Jack
Gallant when he was

1221
00:44:08,950 --> 00:44:10,158
working with David Van Essen.

1222
00:44:10,158 --> 00:44:12,200
And people had some
ideas that maybe there

1223
00:44:12,200 --> 00:44:14,600
are these certain functions
that V4 neurons like,

1224
00:44:14,600 --> 00:44:16,407
and they would show these--

1225
00:44:16,407 --> 00:44:17,990
the same thing people
have done in V2,

1226
00:44:17,990 --> 00:44:19,781
they would show a bunch
of images like this

1227
00:44:19,781 --> 00:44:22,490
and figure out, well, does it
like these Cartesian gratings

1228
00:44:22,490 --> 00:44:23,390
or these curved ones.

1229
00:44:23,390 --> 00:44:25,070
And you know what, you
get out of this is,

1230
00:44:25,070 --> 00:44:26,540
you could tell some
story about it,

1231
00:44:26,540 --> 00:44:28,331
but you get a bunch of
responses out of it.

1232
00:44:28,331 --> 00:44:30,140
The color indicates
the response.

1233
00:44:30,140 --> 00:44:32,180
And you kind of look at it, and
people would tell some stories,

1234
00:44:32,180 --> 00:44:33,980
but it really was just
kind of like tea leaves.

1235
00:44:33,980 --> 00:44:35,604
Here's a bunch of
data, we don't really

1236
00:44:35,604 --> 00:44:38,150
know what these V4
neurons were doing.

1237
00:44:38,150 --> 00:44:43,170
This was a science paper, so
you could go back and read it.

1238
00:44:43,170 --> 00:44:47,720
And then Ed Connor
and Anitha Pasupathy

1239
00:44:47,720 --> 00:44:49,760
worked together a
few years after that

1240
00:44:49,760 --> 00:44:54,260
to try to figure out more
about what V4 neurons do.

1241
00:44:54,260 --> 00:44:55,790
And they did things
like take images

1242
00:44:55,790 --> 00:44:57,470
like this, which were
isolated, and try

1243
00:44:57,470 --> 00:45:00,560
to cut them into parts, like
curved parts, pointy parts,

1244
00:45:00,560 --> 00:45:02,930
curved, concave, convex.

1245
00:45:02,930 --> 00:45:06,355
And this was motivated off of
some psychology literature.

1246
00:45:06,355 --> 00:45:07,730
And they would
define these based

1247
00:45:07,730 --> 00:45:09,510
on the center of the object.

1248
00:45:09,510 --> 00:45:11,580
So this wasn't an
image computable model,

1249
00:45:11,580 --> 00:45:13,880
it was just a
basis set that they

1250
00:45:13,880 --> 00:45:16,259
built around these
silhouette objects.

1251
00:45:16,259 --> 00:45:18,800
And so they made this basis set
about any kind of silhouetted

1252
00:45:18,800 --> 00:45:20,155
object they like here.

1253
00:45:20,155 --> 00:45:21,530
They hypothesized
that they could

1254
00:45:21,530 --> 00:45:23,750
fit the responses of V4
neurons in this basis set.

1255
00:45:23,750 --> 00:45:25,340
And this was their
attempt to do it.

1256
00:45:25,340 --> 00:45:27,369
They could actually
fit quite well.

1257
00:45:27,369 --> 00:45:29,160
And that's kind of
what's being shown here.

1258
00:45:29,160 --> 00:45:30,618
Here's the response
of a V4 neuron.

1259
00:45:30,618 --> 00:45:32,790
The color indicates the
depth of the response.

1260
00:45:32,790 --> 00:45:35,040
You can see, this is sort
of like that previous slide,

1261
00:45:35,040 --> 00:45:35,870
you're looking at tea leaves.

1262
00:45:35,870 --> 00:45:37,756
It looks complicated,
but under this model

1263
00:45:37,756 --> 00:45:40,130
they were able to, in the
shape space, explain about half

1264
00:45:40,130 --> 00:45:42,620
of the response
variants of V4 neurons.

1265
00:45:42,620 --> 00:45:48,800
The upshot is, that V4 curve
is about some combination

1266
00:45:48,800 --> 00:45:49,730
of curves.

1267
00:45:49,730 --> 00:45:51,590
And then later, Scott
Brincat, with Ed,

1268
00:45:51,590 --> 00:45:53,300
went on into posterior
IT and showed

1269
00:45:53,300 --> 00:45:55,670
that maybe some combinations
of these V4 cells

1270
00:45:55,670 --> 00:45:58,859
could fit posterior IT
responses quite well.

1271
00:45:58,859 --> 00:46:00,650
So if you read the
literature in V4 and IT,

1272
00:46:00,650 --> 00:46:01,780
you'll come across
these studies.

1273
00:46:01,780 --> 00:46:03,500
And they are important
ones to look at.

1274
00:46:03,500 --> 00:46:04,916
Unfortunately,
they don't give you

1275
00:46:04,916 --> 00:46:07,530
an image computable model of
what these neurons are doing.

1276
00:46:07,530 --> 00:46:09,790
But it's some of the work
that you should know about

1277
00:46:09,790 --> 00:46:12,980
if you want to look
in V4 or early IT,

1278
00:46:12,980 --> 00:46:14,275
so I'm telling it to you.

1279
00:46:14,275 --> 00:46:16,400
So let me go on to IT,
which is what I want to talk

1280
00:46:16,400 --> 00:46:18,404
about for the rest of today.

1281
00:46:18,404 --> 00:46:19,945
Again, I'm talking
about AIT and CIT.

1282
00:46:23,270 --> 00:46:26,150
And I'll just quickly say
that the anatomy, again,

1283
00:46:26,150 --> 00:46:29,580
suggests that the IT is
the central 10 degrees.

1284
00:46:29,580 --> 00:46:34,400
And even though V1, V2, and V4
cover the whole visual field,

1285
00:46:34,400 --> 00:46:36,650
if you make injections
in V4, that's

1286
00:46:36,650 --> 00:46:39,560
shown here, where
you make injections

1287
00:46:39,560 --> 00:46:42,560
in the more peripheral parts
of the V4 representation, which

1288
00:46:42,560 --> 00:46:46,021
is up here, that you don't get
much projection into IT, which

1289
00:46:46,021 --> 00:46:46,520
is here.

1290
00:46:46,520 --> 00:46:49,061
You don't see much green color,
whereas, you make projections

1291
00:46:49,061 --> 00:46:51,170
in the center part of
V4, these red sites here,

1292
00:46:51,170 --> 00:46:55,700
you see much more coverage
into IT, which is shown here.

1293
00:46:55,700 --> 00:46:57,830
So when I say 10
degrees, that's rough.

1294
00:46:57,830 --> 00:46:59,260
Everything in biology is messy.

1295
00:46:59,260 --> 00:47:01,940
But this is some of the
evidence, beyond recordings,

1296
00:47:01,940 --> 00:47:04,910
there's anatomical evidence
that as you go down into IT,

1297
00:47:04,910 --> 00:47:07,890
you are more and more focused
on the central 10 degrees.

1298
00:47:07,890 --> 00:47:10,700
OK, let me talk about a little
bit of the history of IT

1299
00:47:10,700 --> 00:47:11,290
recordings.

1300
00:47:11,290 --> 00:47:14,291
This is when people got
excited about IT, in the 70s.

1301
00:47:14,291 --> 00:47:16,790
This is work by Charlie Gross,
who's one of the first people

1302
00:47:16,790 --> 00:47:19,070
to record an IT cortex.

1303
00:47:19,070 --> 00:47:22,250
And I'll show you
what they did here.

1304
00:47:22,250 --> 00:47:24,710
This was in an era where,
remember, Hubel and Wiesel

1305
00:47:24,710 --> 00:47:26,360
had just done their
work in the '60s.

1306
00:47:26,360 --> 00:47:28,900
And they recorded from
the cat visual cortex.

1307
00:47:28,900 --> 00:47:30,400
And they had found
these edge cells,

1308
00:47:30,400 --> 00:47:32,525
and they ended up winning
the Nobel Prize for that.

1309
00:47:32,525 --> 00:47:34,700
So it was the heyday
of like, let's record

1310
00:47:34,700 --> 00:47:36,380
and figure out what
makes cells go.

1311
00:47:36,380 --> 00:47:39,620
So they were brave enough to put
an electrode down an IT cortex

1312
00:47:39,620 --> 00:47:42,470
in 1970 and said, what
makes this neuron go.

1313
00:47:42,470 --> 00:47:44,300
Remember, that's an
encoding question,

1314
00:47:44,300 --> 00:47:48,620
what's the image content
that will drive this neuron.

1315
00:47:48,620 --> 00:47:50,360
And it's fun to just
look back on this

1316
00:47:50,360 --> 00:47:51,597
and what they were doing.

1317
00:47:51,597 --> 00:47:53,180
So they didn't have
computer monitors.

1318
00:47:53,180 --> 00:47:55,090
They were actually
waving around stimuli

1319
00:47:55,090 --> 00:47:56,090
in front of the animals.

1320
00:47:56,090 --> 00:47:59,030
This is an anesthetized
animal on a table.

1321
00:47:59,030 --> 00:48:00,005
This is a monkey.

1322
00:48:00,005 --> 00:48:01,380
Actually, they
started with a cat

1323
00:48:01,380 --> 00:48:02,838
and then they later
went to monkey.

1324
00:48:02,838 --> 00:48:05,520
The use of these stimuli
was begun one day when,

1325
00:48:05,520 --> 00:48:08,020
having failed to drive a unit
with any light stimulus-- that

1326
00:48:08,020 --> 00:48:10,100
probably means spots
of light, edges things

1327
00:48:10,100 --> 00:48:11,930
that Hubel and Wiesel
had been using.

1328
00:48:11,930 --> 00:48:14,530
We waved a hand at
the stimulus screen,

1329
00:48:14,530 --> 00:48:15,950
they waved in front
of the monkey,

1330
00:48:15,950 --> 00:48:18,080
and elicited a very
vigorous response

1331
00:48:18,080 --> 00:48:21,190
from the previously
unresponsive neuron.

1332
00:48:21,190 --> 00:48:23,621
And then we spent the next
12 hours-- so the animal's

1333
00:48:23,621 --> 00:48:26,120
anesthetized on the table, their
recording from this neuron.

1334
00:48:26,120 --> 00:48:27,680
It's 12 hours because
nothing's moving,

1335
00:48:27,680 --> 00:48:29,570
so you can record for
a long period of time.

1336
00:48:29,570 --> 00:48:30,650
So singular neuron,
they're recording,

1337
00:48:30,650 --> 00:48:31,430
listening to the spikes.

1338
00:48:31,430 --> 00:48:33,770
We spent the next 12 hours
testing various paper cut

1339
00:48:33,770 --> 00:48:36,740
outs in attempt to find
the trigger feature.

1340
00:48:36,740 --> 00:48:38,990
You can see, that's a
Hubel and Wiesel idea,

1341
00:48:38,990 --> 00:48:40,310
what makes this neuron go.

1342
00:48:40,310 --> 00:48:43,390
What's the best
thing, that's become

1343
00:48:43,390 --> 00:48:45,680
a lot of what the
field spent time doing.

1344
00:48:45,680 --> 00:48:48,500
Trigger feature for this unit,
when the entire stimulus set

1345
00:48:48,500 --> 00:48:51,290
were used, were ranked according
to the strength of the response

1346
00:48:51,290 --> 00:48:52,081
that they produced.

1347
00:48:52,081 --> 00:48:54,230
We could not find a
simple physical dimension

1348
00:48:54,230 --> 00:48:55,829
that correlated with
this rank order.

1349
00:48:55,829 --> 00:48:57,620
However, the rank order
of adequate stimuli

1350
00:48:57,620 --> 00:48:59,660
did correlate with
similarity for us,

1351
00:48:59,660 --> 00:49:01,760
that means
psychophysical judged,

1352
00:49:01,760 --> 00:49:03,800
to the shadow of a monkey hand.

1353
00:49:03,800 --> 00:49:05,820
So these are their rank
order of the stimuli.

1354
00:49:05,820 --> 00:49:08,480
And they say look, it looks like
it's some sort of hand neuron.

1355
00:49:08,480 --> 00:49:10,021
That's all I know
how to describe it.

1356
00:49:10,021 --> 00:49:11,960
I can't find some
simple thing on here.

1357
00:49:11,960 --> 00:49:14,972
So this kind of study then
launched a whole domain

1358
00:49:14,972 --> 00:49:17,180
where people started to go
in to record these neurons

1359
00:49:17,180 --> 00:49:18,980
and they found interesting
different types.

1360
00:49:18,980 --> 00:49:21,200
Bob Desimone, who worked
with Charlie Gross,

1361
00:49:21,200 --> 00:49:23,325
later showed much more
nicely under more controlled

1362
00:49:23,325 --> 00:49:25,710
conditions, yes, there are
indeed neurons that respond.

1363
00:49:25,710 --> 00:49:27,770
You can see more to these hand--
this is the post stimulus time

1364
00:49:27,770 --> 00:49:30,228
histogram, lots of spikes, lots
of spikes, lots of spikes--

1365
00:49:30,228 --> 00:49:32,870
respond more to these hands
than to these other kind

1366
00:49:32,870 --> 00:49:34,851
of stimuli here.

1367
00:49:34,851 --> 00:49:36,350
So you could say,
these neurons have

1368
00:49:36,350 --> 00:49:39,380
tuned to specific combinations
of high selectivity.

1369
00:49:39,380 --> 00:49:40,970
You'll hear from
Winrich that others

1370
00:49:40,970 --> 00:49:42,470
had shown that you
could record some

1371
00:49:42,470 --> 00:49:45,410
of the neurons are really like
faces that you could find,

1372
00:49:45,410 --> 00:49:46,832
and not so much hands.

1373
00:49:46,832 --> 00:49:48,290
So you could find
neurons that seem

1374
00:49:48,290 --> 00:49:51,230
to have some interesting
selectivity in IT cortex.

1375
00:49:51,230 --> 00:49:53,030
And then others
later went on to show

1376
00:49:53,030 --> 00:49:55,605
in a number of studies-- this
is from Nico Logothetis' work

1377
00:49:55,605 --> 00:49:56,730
of a number of years later.

1378
00:49:56,730 --> 00:50:00,500
It's just one example that this
selectivity had some tolerance

1379
00:50:00,500 --> 00:50:02,390
to, say, the position
of the stimulus, that's

1380
00:50:02,390 --> 00:50:03,350
what's shown here.

1381
00:50:03,350 --> 00:50:05,180
The fact that these
bars are high just

1382
00:50:05,180 --> 00:50:09,230
means that it tolerates
movement in where the--

1383
00:50:09,230 --> 00:50:11,270
sorry, this is size,
degrees of visual angle.

1384
00:50:11,270 --> 00:50:13,640
This is position, moving
the stimulus around.

1385
00:50:13,640 --> 00:50:16,190
So this was known
for a number of years

1386
00:50:16,190 --> 00:50:18,200
that there's some tolerance
to position and size

1387
00:50:18,200 --> 00:50:19,384
changes at least.

1388
00:50:19,384 --> 00:50:21,050
OK, so I'm putting
these up and you say,

1389
00:50:21,050 --> 00:50:23,836
there's some selectivity
and there's some tolerance.

1390
00:50:23,836 --> 00:50:26,210
And that should remind you of
what we already said in V1,

1391
00:50:26,210 --> 00:50:27,834
there's some selectivity,
simple cells.

1392
00:50:27,834 --> 00:50:29,880
There's some tolerance,
complex cells.

1393
00:50:29,880 --> 00:50:31,460
So you have the
same themes here,

1394
00:50:31,460 --> 00:50:34,760
just different kinds of
types of stimuli being used.

1395
00:50:34,760 --> 00:50:38,150
Then people really went on, in
the 80s especially, and said,

1396
00:50:38,150 --> 00:50:39,710
let's go after this
trigger feature.

1397
00:50:39,710 --> 00:50:44,360
And Tanaka's group really
went after this really hard.

1398
00:50:44,360 --> 00:50:46,550
Tanaka's group would
find the best stimulus

1399
00:50:46,550 --> 00:50:48,350
they would find, dangle
a bunch of objects

1400
00:50:48,350 --> 00:50:50,350
in front of a recorded
neuron, find the best out

1401
00:50:50,350 --> 00:50:52,016
of a whole set of
objects, and then they

1402
00:50:52,016 --> 00:50:53,300
try to do a reduction.

1403
00:50:53,300 --> 00:50:55,460
They'd try to figure out,
how can I reduce this.

1404
00:50:55,460 --> 00:50:59,570
This is their attempt to reduce
the stimulus to its features

1405
00:50:59,570 --> 00:51:01,190
without lowering
the neural response.

1406
00:51:01,190 --> 00:51:03,290
So high response, high response,
high response, high response,

1407
00:51:03,290 --> 00:51:05,390
high response, suddenly I
do this, the response drops.

1408
00:51:05,390 --> 00:51:06,639
I do this, the response drops.

1409
00:51:06,639 --> 00:51:08,660
And they have lots
of examples of this.

1410
00:51:08,660 --> 00:51:11,600
And they want you to try to
get to the simplest thing that

1411
00:51:11,600 --> 00:51:12,830
could capture the response.

1412
00:51:12,830 --> 00:51:15,350
And when they did this, they
would take stimuli like this,

1413
00:51:15,350 --> 00:51:18,440
and end up with stimuli
that looked like that.

1414
00:51:18,440 --> 00:51:20,960
Now, many of you should
probably start to wonder here,

1415
00:51:20,960 --> 00:51:23,120
there's lots of paths
for stimulus space.

1416
00:51:23,120 --> 00:51:25,754
It's not clear that these
are elemental in any way.

1417
00:51:25,754 --> 00:51:27,920
There's lots of ways that
you can show with modeling

1418
00:51:27,920 --> 00:51:31,430
that you can get easily lost in
this space of navigating around

1419
00:51:31,430 --> 00:51:32,210
here.

1420
00:51:32,210 --> 00:51:34,001
This is just, again,
a history of the work.

1421
00:51:34,001 --> 00:51:36,110
This is the kind of things
that people were doing.

1422
00:51:36,110 --> 00:51:37,820
And then from that,
they presented

1423
00:51:37,820 --> 00:51:40,070
what we think of as the
ice cube model of IT,

1424
00:51:40,070 --> 00:51:43,015
that I think is actually still
a very reasonable approximation.

1425
00:51:43,015 --> 00:51:44,390
They not only
showed that neurons

1426
00:51:44,390 --> 00:51:49,070
tended to like certain
relatively reduced stimulus

1427
00:51:49,070 --> 00:51:51,410
features, not full
objects, but that they

1428
00:51:51,410 --> 00:51:52,570
are gathered together.

1429
00:51:52,570 --> 00:51:55,340
So these are millimeter
scale regions of IT

1430
00:51:55,340 --> 00:51:58,070
that nearby neurons,
within a millimeter or so,

1431
00:51:58,070 --> 00:52:00,590
have similar preferences.

1432
00:52:00,590 --> 00:52:02,224
They're not just
scattered willy-nilly

1433
00:52:02,224 --> 00:52:03,140
throughout the tissue.

1434
00:52:03,140 --> 00:52:05,480
When you go record nearby
neurons, they're similar.

1435
00:52:05,480 --> 00:52:08,910
So there's some mapping
within IT cortex.

1436
00:52:08,910 --> 00:52:10,370
This is schematic here.

1437
00:52:10,370 --> 00:52:13,940
This is optical imaging
data of IT cortex also

1438
00:52:13,940 --> 00:52:16,130
from Tanaka's
group that show you

1439
00:52:16,130 --> 00:52:19,130
that these different
blobs of tissue

1440
00:52:19,130 --> 00:52:21,120
get activated by different
images shown here.

1441
00:52:21,120 --> 00:52:22,911
And I'm just showing
you the scale of this,

1442
00:52:22,911 --> 00:52:25,399
it's around a little
less than a millimeter.

1443
00:52:25,399 --> 00:52:26,940
And our lab has
evidence of this too.

1444
00:52:26,940 --> 00:52:30,300
So there's some sort of
spatial organization in IT,

1445
00:52:30,300 --> 00:52:32,850
but we really don't really
yet understand the features,

1446
00:52:32,850 --> 00:52:36,840
these elemental features yet,
or at least, not at this time.

1447
00:52:36,840 --> 00:52:39,194
Then later, there's lots
of beautiful work in IT.

1448
00:52:39,194 --> 00:52:41,110
Again, I'm probably not
telling you all of it.

1449
00:52:41,110 --> 00:52:42,925
Some of the most
exciting work recently--

1450
00:52:42,925 --> 00:52:44,790
and you'll hear about
this from Winrich,

1451
00:52:44,790 --> 00:52:46,740
that people started
to use fMRIs.

1452
00:52:46,740 --> 00:52:49,530
So Doris Tsao and Winrich
Freiwald and Marge Livingstone

1453
00:52:49,530 --> 00:52:52,880
all together started to
use fMRI data to compare

1454
00:52:52,880 --> 00:52:54,602
faces versus objects.

1455
00:52:54,602 --> 00:52:56,060
This was motivated
from human work,

1456
00:52:56,060 --> 00:52:59,361
by work like Nancy
Kanwisher lab and others.

1457
00:52:59,361 --> 00:53:00,860
What they found was
that in monkeys,

1458
00:53:00,860 --> 00:53:03,722
you could find different
parts that would show up,

1459
00:53:03,722 --> 00:53:05,180
what are called
face patches, where

1460
00:53:05,180 --> 00:53:07,940
you have a relative preference
for faces over objects.

1461
00:53:07,940 --> 00:53:10,640
Again, I don't want to take
all of Winrich's talk here,

1462
00:53:10,640 --> 00:53:12,674
but you have these
different patches here.

1463
00:53:12,674 --> 00:53:14,840
And then what's really cool
is, you go in and record

1464
00:53:14,840 --> 00:53:18,312
from these patches and then you
find a very enriched locations

1465
00:53:18,312 --> 00:53:19,020
for face neurons.

1466
00:53:19,020 --> 00:53:21,119
And these enriched
locations were known

1467
00:53:21,119 --> 00:53:22,410
from a number of other studies.

1468
00:53:22,410 --> 00:53:25,100
But this is a nice correlation
between functional imaging

1469
00:53:25,100 --> 00:53:26,810
and this enrichment
of these face cells.

1470
00:53:26,810 --> 00:53:30,200
And that's what's shown here,
that these neurons respond

1471
00:53:30,200 --> 00:53:32,900
mostly to faces and not
so much other objects.

1472
00:53:32,900 --> 00:53:35,150
Although, you see they still
sort of respond to these.

1473
00:53:35,150 --> 00:53:37,715
So this kind of says
fMRI and physiology are

1474
00:53:37,715 --> 00:53:38,840
telling you similar things.

1475
00:53:38,840 --> 00:53:41,600
It also tells you there's some
spatial clumping, at least

1476
00:53:41,600 --> 00:53:44,160
for face-like objects, at a
scale of a few millimeters

1477
00:53:44,160 --> 00:53:46,070
or so, the size
of these patches.

1478
00:53:46,070 --> 00:53:49,250
OK, so that's larger
scale organization.

1479
00:53:49,250 --> 00:53:52,670
This is data from our own lab
that shows the same thing.

1480
00:53:52,670 --> 00:53:55,310
Maybe I'll just skip through
this in the interest of time--

1481
00:53:55,310 --> 00:53:58,100
that we can map and record
the neurons very precisely,

1482
00:53:58,100 --> 00:54:02,090
map them spatially and
compare that with fMRI.

1483
00:54:02,090 --> 00:54:05,840
So this is just a larger field
of view maps of the same idea.

1484
00:54:05,840 --> 00:54:08,870
So what we have
then, just to wrap up

1485
00:54:08,870 --> 00:54:11,720
this whirlwind tour
of the ventral stream,

1486
00:54:11,720 --> 00:54:15,430
is that we had some untangled
explicit information.

1487
00:54:15,430 --> 00:54:18,100
And what I want to try to
convince you of now, is that--

1488
00:54:18,100 --> 00:54:19,370
I've told you about
the ventral stream,

1489
00:54:19,370 --> 00:54:21,536
but I'm going to try to
tell you that, in IT cortex,

1490
00:54:21,536 --> 00:54:24,130
this is a powerful
representation for encoding

1491
00:54:24,130 --> 00:54:26,364
object information.

1492
00:54:26,364 --> 00:54:28,780
And then we'll take a break
because we've already probably

1493
00:54:28,780 --> 00:54:30,167
been going a while.

1494
00:54:30,167 --> 00:54:32,500
Yeah, about 10 more minutes
and then we'll take a break.

1495
00:54:32,500 --> 00:54:36,280
So what I've told you is, I've
led you up the ventral stream,

1496
00:54:36,280 --> 00:54:37,880
I've given you a
bit of the history,

1497
00:54:37,880 --> 00:54:39,720
so now let's talk about
IT more precisely.

1498
00:54:39,720 --> 00:54:42,940
So now this is work
from my own lab.

1499
00:54:42,940 --> 00:54:44,510
You go in and record IT.

1500
00:54:44,510 --> 00:54:45,940
You go record extracellularly.

1501
00:54:45,940 --> 00:54:48,940
You travel down into IT
cortex, which is down here.

1502
00:54:48,940 --> 00:54:50,170
And you record from this.

1503
00:54:50,170 --> 00:54:52,900
And similar to what you
saw, another version of what

1504
00:54:52,900 --> 00:54:55,900
you saw from Charlie
Gross or Bob Desimone,

1505
00:54:55,900 --> 00:54:57,790
you show a bunch of images.

1506
00:54:57,790 --> 00:54:59,410
And they could be
arbitrary images.

1507
00:54:59,410 --> 00:55:01,270
You take an IT recording site,
and see these little dots,

1508
00:55:01,270 --> 00:55:03,700
those are action potential
spikes out of a particular IT

1509
00:55:03,700 --> 00:55:04,730
site.

1510
00:55:04,730 --> 00:55:06,190
And these are repeatable.

1511
00:55:06,190 --> 00:55:08,580
You have some Poisson
variability here.

1512
00:55:08,580 --> 00:55:10,330
But you see that there's
more spikes here,

1513
00:55:10,330 --> 00:55:12,490
there's little more here,
less here, less there.

1514
00:55:12,490 --> 00:55:13,990
These images are all
randomly interleaved

1515
00:55:13,990 --> 00:55:16,323
when you collect the data,
as I'll show you in a minute.

1516
00:55:16,323 --> 00:55:18,920
And you go to different sites
and it likes different images.

1517
00:55:18,920 --> 00:55:20,425
So there is certainly
some image selectivity.

1518
00:55:20,425 --> 00:55:22,840
This should not be surprising
because I already showed

1519
00:55:22,840 --> 00:55:24,580
you this from previous work.

1520
00:55:24,580 --> 00:55:26,320
This is just data
from our own lab.

1521
00:55:26,320 --> 00:55:28,361
You can also see now that
you are looking closely

1522
00:55:28,361 --> 00:55:30,910
at the time lag, remember, I
said around 100 milliseconds

1523
00:55:30,910 --> 00:55:31,945
stimulus on.

1524
00:55:31,945 --> 00:55:34,240
Stimulus off, the
stimulus is actually off

1525
00:55:34,240 --> 00:55:36,220
before the spikes actually
start to occur out

1526
00:55:36,220 --> 00:55:38,170
here in IT because,
again, there's a long time

1527
00:55:38,170 --> 00:55:39,880
lag, 100 milliseconds.

1528
00:55:39,880 --> 00:55:42,755
OK, so that's what the
neural responses look like.

1529
00:55:42,755 --> 00:55:44,380
I don't know if you
guys can hear this,

1530
00:55:44,380 --> 00:55:45,880
maybe I should have
hooked up audio.

1531
00:55:45,880 --> 00:55:47,350
Maybe you might
be able to hear--

1532
00:55:47,350 --> 00:55:50,394
this is actually a
recording that Chou Hung did

1533
00:55:50,394 --> 00:55:52,810
when he collected his data in
my lab for the early studies

1534
00:55:52,810 --> 00:55:54,340
we did in the lab.

1535
00:55:54,340 --> 00:55:56,700
I don't know if
you guys can hear.

1536
00:55:56,700 --> 00:55:58,140
[STATIC]

1537
00:55:58,140 --> 00:56:00,060
[BEEP]

1538
00:56:00,060 --> 00:56:03,900
[BEEP]

1539
00:56:03,900 --> 00:56:04,880
[BEEP]

1540
00:56:04,880 --> 00:56:06,960
Those high beeps are the
animal getting reward

1541
00:56:06,960 --> 00:56:08,712
for fixating on that dot.

1542
00:56:08,712 --> 00:56:10,670
You're not even going to
be able to parse that.

1543
00:56:10,670 --> 00:56:12,636
I mean, you hear the
spikes clicking by, those--

1544
00:56:12,636 --> 00:56:13,136
[STATIC]

1545
00:56:13,136 --> 00:56:15,416
Those are action potentials.

1546
00:56:15,416 --> 00:56:17,790
And I don't expect you to look
at anything like, oh, it's

1547
00:56:17,790 --> 00:56:18,780
a face neuron, or whatever.

1548
00:56:18,780 --> 00:56:20,988
I just want you to get a
feel for how those data were

1549
00:56:20,988 --> 00:56:21,940
originally collected.

1550
00:56:21,940 --> 00:56:23,640
This is a pretty grainy video.

1551
00:56:23,640 --> 00:56:26,400
But you get the idea.

1552
00:56:26,400 --> 00:56:27,649
You collect data like that.

1553
00:56:27,649 --> 00:56:29,940
And again, you can find
selectivity in those population

1554
00:56:29,940 --> 00:56:31,540
patterns, as I just showed you.

1555
00:56:31,540 --> 00:56:34,980
But then, Gabriel and Tommy
and I, so the three of us,

1556
00:56:34,980 --> 00:56:36,630
I think all in this
room, way back when

1557
00:56:36,630 --> 00:56:39,780
in 2005 said, well look,
the population of IT

1558
00:56:39,780 --> 00:56:41,490
might have good,
useful information

1559
00:56:41,490 --> 00:56:43,812
for solving this
difficult object manifold

1560
00:56:43,812 --> 00:56:44,520
tangling problem.

1561
00:56:44,520 --> 00:56:46,810
It might be a good
explicit representation.

1562
00:56:46,810 --> 00:56:50,460
So we did a, what I call,
early test of this idea.

1563
00:56:50,460 --> 00:56:54,810
We took this simple image set
from eight different categories

1564
00:56:54,810 --> 00:56:56,400
that we had chosen.

1565
00:56:56,400 --> 00:56:59,070
And there's good stories of
why we chose those objects,

1566
00:56:59,070 --> 00:57:00,360
if you like to hear them.

1567
00:57:00,360 --> 00:57:02,490
But let me just say, simple
objects, we moved them

1568
00:57:02,490 --> 00:57:06,510
across position and scale,
and we collected the responses

1569
00:57:06,510 --> 00:57:09,690
of IT of a bunch
of sites to changes

1570
00:57:09,690 --> 00:57:11,504
to all these different
visual images.

1571
00:57:11,504 --> 00:57:13,170
And we showed them
as I just showed you.

1572
00:57:13,170 --> 00:57:14,878
We just showed them
for 100 milliseconds.

1573
00:57:14,878 --> 00:57:16,827
This is this core
recognition regime,

1574
00:57:16,827 --> 00:57:18,660
were just showing them
for 100 milliseconds.

1575
00:57:18,660 --> 00:57:20,576
And then we show another
one, and they're just

1576
00:57:20,576 --> 00:57:21,756
randomly interleaved.

1577
00:57:21,756 --> 00:57:23,130
And from this,
what you do is you

1578
00:57:23,130 --> 00:57:25,110
could get a
population set of data

1579
00:57:25,110 --> 00:57:27,900
where we recorded 350 IT sites.

1580
00:57:27,900 --> 00:57:29,820
Here's a sample of 63 sites.

1581
00:57:29,820 --> 00:57:32,270
This is 78 images, the
mean neural response

1582
00:57:32,270 --> 00:57:34,140
here is the mean
response to an image.

1583
00:57:34,140 --> 00:57:35,599
This is 78 of the
images we showed.

1584
00:57:35,599 --> 00:57:38,139
There's nothing for you to read
into here to say, other than,

1585
00:57:38,139 --> 00:57:40,020
you have this rich
population data.

1586
00:57:40,020 --> 00:57:43,665
And now our question is, well,
what lives in this population

1587
00:57:43,665 --> 00:57:44,940
data that we've collected.

1588
00:57:44,940 --> 00:57:47,130
Is it explicit with
regard to categories?

1589
00:57:47,130 --> 00:57:48,630
So we come back to
what I showed you

1590
00:57:48,630 --> 00:57:51,390
earlier about those
tangled manifolds and said,

1591
00:57:51,390 --> 00:57:53,460
we need simple decoding tools.

1592
00:57:53,460 --> 00:57:56,280
Can a simple decoding tool
look at that population

1593
00:57:56,280 --> 00:57:57,930
and tell me what's out there?

1594
00:57:57,930 --> 00:58:00,474
And again, we were using
linear classifiers at the time,

1595
00:58:00,474 --> 00:58:01,890
because we took
that, as you heard

1596
00:58:01,890 --> 00:58:04,350
from Haim as our operational
definition of what

1597
00:58:04,350 --> 00:58:05,310
a simple tool is.

1598
00:58:05,310 --> 00:58:07,821
And if it could decode
information about the object

1599
00:58:07,821 --> 00:58:09,570
identity, then we'd
say, well, that means,

1600
00:58:09,570 --> 00:58:11,550
by that operational
definition, this

1601
00:58:11,550 --> 00:58:14,790
is explicit, available,
accessible information, or just

1602
00:58:14,790 --> 00:58:15,960
generally good.

1603
00:58:15,960 --> 00:58:19,650
So if you imagine that the
activity-- this is schematic.

1604
00:58:19,650 --> 00:58:21,450
Each dot, this is
neuron one, neuron two,

1605
00:58:21,450 --> 00:58:23,210
and you could have a
bunch of IT neurons.

1606
00:58:23,210 --> 00:58:24,926
But if you can
separate any object

1607
00:58:24,926 --> 00:58:26,550
from all the other
object, these points

1608
00:58:26,550 --> 00:58:28,650
represent the
population response

1609
00:58:28,650 --> 00:58:30,210
to each image of an object.

1610
00:58:30,210 --> 00:58:32,190
Remember, there's many
images of each object.

1611
00:58:32,190 --> 00:58:33,856
But if you could
linearly separate that,

1612
00:58:33,856 --> 00:58:35,252
that would mean it was explicit.

1613
00:58:35,252 --> 00:58:36,960
And if you had a hard
time separating it,

1614
00:58:36,960 --> 00:58:38,610
this would be implicit.

1615
00:58:38,610 --> 00:58:40,270
These are like tangled
object manifold.

1616
00:58:40,270 --> 00:58:42,900
This is Inaccessible,
or bad, information.

1617
00:58:42,900 --> 00:58:44,781
So we just-- we,
and when I mean we,

1618
00:58:44,781 --> 00:58:46,280
I mean Chou Hung,
who led the study.

1619
00:58:46,280 --> 00:58:47,940
Gabriel, Tommy, and I did this.

1620
00:58:47,940 --> 00:58:51,300
We took the response of
an image, like this one.

1621
00:58:51,300 --> 00:58:53,337
It produced a population vector.

1622
00:58:53,337 --> 00:58:54,920
Again, we recorded
a bunch of neurons.

1623
00:58:54,920 --> 00:58:57,170
We recorded them sequentially
and then pieced together

1624
00:58:57,170 --> 00:58:59,430
this population vector.

1625
00:58:59,430 --> 00:59:01,940
So these are the
spikes simulated off

1626
00:59:01,940 --> 00:59:03,630
a population of IT.

1627
00:59:03,630 --> 00:59:05,069
We could do various things.

1628
00:59:05,069 --> 00:59:07,110
In fact, I think Gabriel
did everything possible,

1629
00:59:07,110 --> 00:59:08,345
as I remember at the time.

1630
00:59:08,345 --> 00:59:10,470
And one of the things we
did was just count spikes.

1631
00:59:10,470 --> 00:59:12,060
One of the simple things, that
turns out to work quite well,

1632
00:59:12,060 --> 00:59:14,470
is count the spikes
over 100 milliseconds.

1633
00:59:14,470 --> 00:59:15,780
So this neuron counts spikes.

1634
00:59:15,780 --> 00:59:17,580
That gives you a
number, one number here,

1635
00:59:17,580 --> 00:59:19,590
count spikes get one number.

1636
00:59:19,590 --> 00:59:22,380
So you have n neurons,
you get n numbers.

1637
00:59:22,380 --> 00:59:25,770
So it's a point in a n
dimensional state space where

1638
00:59:25,770 --> 00:59:27,310
n is the number of neurons.

1639
00:59:27,310 --> 00:59:29,370
And then we had already
pre-divided the images

1640
00:59:29,370 --> 00:59:32,280
into different
categories, as shown here.

1641
00:59:32,280 --> 00:59:33,460
These are the categories.

1642
00:59:33,460 --> 00:59:36,270
And again, we just
asked how well

1643
00:59:36,270 --> 00:59:37,950
you could do faces
versus non-faces,

1644
00:59:37,950 --> 00:59:41,186
toys versus non-toys,
so on and so forth.

1645
00:59:41,186 --> 00:59:42,060
These are old slides.

1646
00:59:42,060 --> 00:59:43,500
But you get the idea,
is that basically, you

1647
00:59:43,500 --> 00:59:45,420
don't need that many
sites to already get

1648
00:59:45,420 --> 00:59:47,280
to very high levels
of performance

1649
00:59:47,280 --> 00:59:49,980
on both categorization
and identification.

1650
00:59:49,980 --> 00:59:51,510
The interesting
thing about this was

1651
00:59:51,510 --> 00:59:54,389
that you could
solve simple forms

1652
00:59:54,389 --> 00:59:56,430
of this invariance problem
in this representation

1653
00:59:56,430 --> 00:59:57,450
quite easily.

1654
00:59:57,450 --> 01:00:00,900
That if you just trained on
the central objects, the center

1655
01:00:00,900 --> 01:00:03,670
and size, the simple three
degree size center position,

1656
01:00:03,670 --> 01:00:06,180
and test it on the
same thing, just

1657
01:00:06,180 --> 01:00:08,730
held out repeats of this
data, you did quite well.

1658
01:00:08,730 --> 01:00:09,930
That's a baseline.

1659
01:00:09,930 --> 01:00:12,810
But what's interesting is you
test at different position

1660
01:00:12,810 --> 01:00:13,740
and scale.

1661
01:00:13,740 --> 01:00:15,780
And then you also do
almost nearly as well.

1662
01:00:15,780 --> 01:00:19,050
So you naturally generalize
to these other conditions

1663
01:00:19,050 --> 01:00:21,450
by training on these
simple conditions.

1664
01:00:21,450 --> 01:00:23,760
So this is evidence
that the population

1665
01:00:23,760 --> 01:00:26,850
is a good basis set for
solving these kind of problems.

1666
01:00:26,850 --> 01:00:29,520
A few number of training
examples on this population

1667
01:00:29,520 --> 01:00:31,523
then generalizes,
well, across conditions

1668
01:00:31,523 --> 01:00:33,690
makes the problem hard.

1669
01:00:33,690 --> 01:00:35,620
So again, we published
that a long time ago.

1670
01:00:35,620 --> 01:00:37,120
This was an early
step to say, look,

1671
01:00:37,120 --> 01:00:39,780
the phenomenology looks
right for the story that I've

1672
01:00:39,780 --> 01:00:41,820
been telling you so far.

1673
01:00:41,820 --> 01:00:45,450
You can't do this easily in
earlier visual areas like V1,

1674
01:00:45,450 --> 01:00:47,270
or simulated V1 or V4.

1675
01:00:47,270 --> 01:00:49,920
And we later show
that a number of ways.

1676
01:00:49,920 --> 01:00:52,740
This is consistent with
work I was showing you

1677
01:00:52,740 --> 01:00:54,780
with Logothetis
position tolerance, size

1678
01:00:54,780 --> 01:00:56,620
tolerance, the selectivity.

1679
01:00:56,620 --> 01:00:59,780
It's really just an explicit
test of the idea population

1680
01:00:59,780 --> 01:01:00,780
encoding.

1681
01:01:00,780 --> 01:01:03,750
So the take home here is that
there's this explicit object

1682
01:01:03,750 --> 01:01:05,354
representation in IT.

1683
01:01:05,354 --> 01:01:06,770
I didn't prove to
you that this is

1684
01:01:06,770 --> 01:01:08,970
the link, this predictive
model to decoding yet.

1685
01:01:08,970 --> 01:01:09,920
We're going to talk
about that next.

1686
01:01:09,920 --> 01:01:11,794
But this was some of
the important population

1687
01:01:11,794 --> 01:01:14,120
phenomenology that we did.

1688
01:01:14,120 --> 01:01:15,500
What I try to tell you today--

1689
01:01:15,500 --> 01:01:17,750
hopefully I've introduced you
to the problem of visual object

1690
01:01:17,750 --> 01:01:19,416
recognition and the
way we restricted it

1691
01:01:19,416 --> 01:01:20,924
to core object recognition.

1692
01:01:20,924 --> 01:01:23,340
We talked a lot about predictive
models as being the goal,

1693
01:01:23,340 --> 01:01:24,950
although I haven't
presented much to you yet.

1694
01:01:24,950 --> 01:01:26,997
Hopefully, that's the
second part of the talk.

1695
01:01:26,997 --> 01:01:28,830
I've given you a tour
of the ventral stream.

1696
01:01:28,830 --> 01:01:30,440
But it was a poor tour.

1697
01:01:30,440 --> 01:01:31,940
I'm sure everybody
i work with would

1698
01:01:31,940 --> 01:01:34,231
say that you've neglected
all this work because there's

1699
01:01:34,231 --> 01:01:38,300
no way I can do that all
in even a whole week.

1700
01:01:38,300 --> 01:01:40,820
I just tried to hit some
of the highlights for you.

1701
01:01:40,820 --> 01:01:42,650
And I told you that
the IT population

1702
01:01:42,650 --> 01:01:44,900
seems to have solved
a key problem,

1703
01:01:44,900 --> 01:01:47,390
this sort of invariance
problem that I set up.

1704
01:01:47,390 --> 01:01:50,240
And one way to step back and
say, over the last 40 years

1705
01:01:50,240 --> 01:01:53,150
or so, from those early
studies of Charlie Gross

1706
01:01:53,150 --> 01:01:56,270
or even Hubel and Wiesel, we,
the field of ventral stream

1707
01:01:56,270 --> 01:01:58,920
physiology, we've largely
described important

1708
01:01:58,920 --> 01:01:59,540
phenomenology.

1709
01:01:59,540 --> 01:02:02,780
Even that last study is
population phenomenology.

1710
01:02:02,780 --> 01:02:06,237
And so now we need these
more advanced models.

1711
01:02:06,237 --> 01:02:08,570
So the next phase of the field
is developing and testing

1712
01:02:08,570 --> 01:02:10,016
these predictive
models that I've

1713
01:02:10,016 --> 01:02:11,390
motivated at the
beginning, but I

1714
01:02:11,390 --> 01:02:13,230
haven't given you much of yet.

1715
01:02:13,230 --> 01:02:16,430
So this was hopefully a bit
of history and set context

1716
01:02:16,430 --> 01:02:18,370
to where we are.