1
00:00:01,725 --> 00:00:04,080
The following content is
provided under a Creative

2
00:00:04,080 --> 00:00:05,620
Commons license.

3
00:00:05,620 --> 00:00:07,920
Your support will help
MIT OpenCourseWare

4
00:00:07,920 --> 00:00:12,280
continue to offer high quality
educational resources for free.

5
00:00:12,280 --> 00:00:14,910
To make a donation, or
view additional materials

6
00:00:14,910 --> 00:00:18,870
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,870 --> 00:00:21,517
at OCW.MIT.edu.

8
00:00:21,517 --> 00:00:23,100
JAMES DICARLO: I'm
going to shift more

9
00:00:23,100 --> 00:00:25,980
towards this decoding
space than we talked about,

10
00:00:25,980 --> 00:00:29,850
the linkage between neural
activity and behavioral report.

11
00:00:29,850 --> 00:00:31,170
And I introduced that a bit.

12
00:00:31,170 --> 00:00:34,290
You just saw that there's some
population powerful activity

13
00:00:34,290 --> 00:00:34,980
in IT.

14
00:00:34,980 --> 00:00:37,450
And I'm going to expand
on that a bit here.

15
00:00:37,450 --> 00:00:41,114
But sort of stepping back,
when you think about it again,

16
00:00:41,114 --> 00:00:42,780
what I call an end
to end understanding,

17
00:00:42,780 --> 00:00:44,460
going from the image all
the way to neural activity

18
00:00:44,460 --> 00:00:47,200
to the perceptual report, one
of the things we want to do,

19
00:00:47,200 --> 00:00:49,540
again, is just define
a decoding mechanism

20
00:00:49,540 --> 00:00:52,110
that the brain uses to support
these perceptual reports.

21
00:00:52,110 --> 00:00:54,000
Basically what neural
activity are directly

22
00:00:54,000 --> 00:00:56,250
responsible for these tasks?

23
00:00:56,250 --> 00:00:58,359
And I'll come back to
later this encoding side.

24
00:00:58,359 --> 00:01:00,150
It's like, you know,
and notice I'm putting

25
00:01:00,150 --> 00:01:01,274
these in this order, right?

26
00:01:01,274 --> 00:01:03,450
So once you know what
the relevant aspects

27
00:01:03,450 --> 00:01:06,900
of neural activity are in IT,
or wherever you think they are,

28
00:01:06,900 --> 00:01:09,690
then that sets a target
for what is the image

29
00:01:09,690 --> 00:01:12,420
to neural transformation that
you're trying to explain?

30
00:01:12,420 --> 00:01:14,220
Not predict any neural
response, but those

31
00:01:14,220 --> 00:01:16,554
particular aspects of
the neural response.

32
00:01:16,554 --> 00:01:18,720
So that's what I mean by
the relevant ventral stream

33
00:01:18,720 --> 00:01:20,220
patterns of activity.

34
00:01:20,220 --> 00:01:21,600
So we start here.

35
00:01:21,600 --> 00:01:24,150
We work to here, and then
we work to here, rather than

36
00:01:24,150 --> 00:01:25,516
the other way around.

37
00:01:25,516 --> 00:01:26,890
OK, so I'm going
to try it again.

38
00:01:26,890 --> 00:01:28,170
Keep with the domain I set up.

39
00:01:28,170 --> 00:01:29,520
I talked about core recognition.

40
00:01:29,520 --> 00:01:31,470
I now need to start
to define tasks.

41
00:01:31,470 --> 00:01:34,890
I'm going to talk about specific
tasks that are, for now, let's

42
00:01:34,890 --> 00:01:36,700
call them basic level nouns.

43
00:01:36,700 --> 00:01:39,130
I'm actually going to relax
that to subordinate tasks

44
00:01:39,130 --> 00:01:39,760
in a minute.

45
00:01:39,760 --> 00:01:40,830
But here they are.

46
00:01:40,830 --> 00:01:41,759
Car, clock, cat.

47
00:01:41,759 --> 00:01:43,050
These are not the actual nouns.

48
00:01:43,050 --> 00:01:44,100
I'll show you the ones we use.

49
00:01:44,100 --> 00:01:45,810
But just to fix
ideas, we're imagining

50
00:01:45,810 --> 00:01:49,140
a space of all possible
nouns that you might use

51
00:01:49,140 --> 00:01:51,520
to describe what you just saw.

52
00:01:51,520 --> 00:01:53,520
And I'm going to have a
generative image domain.

53
00:01:53,520 --> 00:01:55,426
So I now have a
space of images here.

54
00:01:55,426 --> 00:01:57,300
I'm not just going to
draw these off the web.

55
00:01:57,300 --> 00:01:58,560
We're going to
generate our own image

56
00:01:58,560 --> 00:02:00,600
domain that we think
engages on the problem,

57
00:02:00,600 --> 00:02:02,760
but gives us control of
the latent variables.

58
00:02:02,760 --> 00:02:03,950
So I'll show you that now.

59
00:02:03,950 --> 00:02:05,325
So the way we're
going to do this

60
00:02:05,325 --> 00:02:08,850
is by generating one
foreground object in each image

61
00:02:08,850 --> 00:02:10,530
that we're going to show.

62
00:02:10,530 --> 00:02:13,890
And we just did this by
taking 3-D models like these--

63
00:02:13,890 --> 00:02:15,450
this is a model of a car.

64
00:02:15,450 --> 00:02:18,180
We can control it's other latent
variables beyond its identity.

65
00:02:18,180 --> 00:02:19,250
So this is a car.

66
00:02:19,250 --> 00:02:21,130
It has a particular car type.

67
00:02:21,130 --> 00:02:23,730
So there's a couple of latent
variables about identity here

68
00:02:23,730 --> 00:02:25,350
that relate to the geometry.

69
00:02:25,350 --> 00:02:27,210
Then there's these position--
other latent variables like

70
00:02:27,210 --> 00:02:28,960
position, size, and
pose that I mentioned,

71
00:02:28,960 --> 00:02:31,620
that are unknowns that make
the problem challenging.

72
00:02:31,620 --> 00:02:33,880
And we can then just,
like, render this thing.

73
00:02:33,880 --> 00:02:37,140
And we could place it on any
old background we wanted to.

74
00:02:37,140 --> 00:02:39,120
And what we did was we
tended to place them

75
00:02:39,120 --> 00:02:41,430
on uncorrelated
naturalistic backgrounds.

76
00:02:41,430 --> 00:02:44,190
And that creates these sort
of weirdish looking images.

77
00:02:44,190 --> 00:02:46,140
Some of them may look
sort of natural, hence,

78
00:02:46,140 --> 00:02:48,376
this looks pretty unnatural.

79
00:02:48,376 --> 00:02:49,500
But the reason we did this.

80
00:02:49,500 --> 00:02:50,416
Why would you do this?

81
00:02:50,416 --> 00:02:55,270
So-- so we did this because we
could add a generative space.

82
00:02:55,270 --> 00:02:58,980
And because it was-- so we know
what's going on with the latent

83
00:02:58,980 --> 00:03:00,244
variables we care about.

84
00:03:00,244 --> 00:03:02,910
And we also, when we built this,
it was challenging for computer

85
00:03:02,910 --> 00:03:04,360
vision systems to
deal with this,

86
00:03:04,360 --> 00:03:06,420
even though humans could
naturally-- you know,

87
00:03:06,420 --> 00:03:08,520
they don't have advantage
of any contextual cues

88
00:03:08,520 --> 00:03:11,970
here because by construction,
these are uncorrelated.

89
00:03:11,970 --> 00:03:14,100
We just took natural
images and would randomly

90
00:03:14,100 --> 00:03:15,460
put objects on them.

91
00:03:15,460 --> 00:03:17,880
But this was enough to fool
a lot of the computer vision

92
00:03:17,880 --> 00:03:21,000
systems at the time that tended
to rely on the contextual cues.

93
00:03:21,000 --> 00:03:23,652
Like blue in the background
signals or being an airplane,

94
00:03:23,652 --> 00:03:25,610
we didn't want those kind
of things being done.

95
00:03:25,610 --> 00:03:27,840
We wanted the actual
extraction of object identity.

96
00:03:27,840 --> 00:03:29,940
And again, humans
could do it quite well.

97
00:03:29,940 --> 00:03:31,680
So that's why we
ended up in this sort

98
00:03:31,680 --> 00:03:34,830
of maybe this no man's
land of image space, which

99
00:03:34,830 --> 00:03:38,310
is not very simple, but not
ImageNet just pulled off

100
00:03:38,310 --> 00:03:39,779
off the web.

101
00:03:39,779 --> 00:03:41,070
And so that's how we got there.

102
00:03:41,070 --> 00:03:42,870
And just to give you a
sense that this is actually

103
00:03:42,870 --> 00:03:44,910
quite doable for humans,
I'll show you a few images.

104
00:03:44,910 --> 00:03:46,050
I won't even cue
you what they are.

105
00:03:46,050 --> 00:03:47,883
I'm going to show them
for 100 milliseconds.

106
00:03:47,883 --> 00:03:50,555
You can kind of shout
out what object you see.

107
00:03:50,555 --> 00:03:51,525
AUDIENCE: Car.

108
00:03:51,525 --> 00:03:55,265
AUDIENCE: [INAUDIBLE]

109
00:03:55,265 --> 00:03:56,140
JAMES DICARIO: Right.

110
00:03:56,140 --> 00:03:57,931
So see, it's pretty
straightforward, right?

111
00:03:57,931 --> 00:04:00,655
And those look weird, you
can do that quite well.

112
00:04:00,655 --> 00:04:03,280
And you know, here's the kind of
images that we would generate.

113
00:04:03,280 --> 00:04:05,997
This would be-- so when
we think of image bags,

114
00:04:05,997 --> 00:04:07,580
we think of partitions
of image space.

115
00:04:07,580 --> 00:04:10,490
This is some images that
would correspond to faces.

116
00:04:10,490 --> 00:04:12,962
These are all images of faces
under some transformations.

117
00:04:12,962 --> 00:04:14,170
Again, different backgrounds.

118
00:04:14,170 --> 00:04:15,190
These are not faces.

119
00:04:15,190 --> 00:04:17,740
These are other objects
again, under transformations.

120
00:04:17,740 --> 00:04:20,260
And we can have as many
of these as we want.

121
00:04:20,260 --> 00:04:21,709
We call this one--

122
00:04:21,709 --> 00:04:24,280
this distinction, when
shown for 100 milliseconds--

123
00:04:24,280 --> 00:04:25,780
is one core recognition test.

124
00:04:25,780 --> 00:04:27,530
Discriminate face for not face.

125
00:04:27,530 --> 00:04:28,989
Here is a subordinate task.

126
00:04:28,989 --> 00:04:30,280
This is beetle from not beetle.

127
00:04:30,280 --> 00:04:31,690
This is a particular
type of car.

128
00:04:31,690 --> 00:04:33,487
You can see it's
more challenging.

129
00:04:33,487 --> 00:04:35,320
Again, we don't show
these images like this.

130
00:04:35,320 --> 00:04:36,695
This is just to
show you the set.

131
00:04:36,695 --> 00:04:38,210
We show them one at a time.

132
00:04:38,210 --> 00:04:40,730
And so let me now
go ahead and say,

133
00:04:40,730 --> 00:04:43,240
we're going to try to make
a predictive model using

134
00:04:43,240 --> 00:04:46,990
that kind of image space to
see if we can understand what

135
00:04:46,990 --> 00:04:48,940
are the relevant aspects
of neural activity

136
00:04:48,940 --> 00:04:51,640
that can predict human
report on an image space?

137
00:04:51,640 --> 00:04:54,280
And when I say we, I
mean Naiib Maiai and Ha

138
00:04:54,280 --> 00:04:56,230
Hong, who are post-doc
and graduate student

139
00:04:56,230 --> 00:04:59,260
that were in the lab that
led this experimental work.

140
00:04:59,260 --> 00:05:03,200
And Ethan Soloman and Dan Yamins
also contributed to the work.

141
00:05:03,200 --> 00:05:07,720
So what we did was to try to
record a bunch of IT activity

142
00:05:07,720 --> 00:05:10,390
to measure what's going
on in the population

143
00:05:10,390 --> 00:05:12,940
as I showed you earlier, but
now in this more defined space

144
00:05:12,940 --> 00:05:15,190
where we're going to collect
a bunch of human behavior

145
00:05:15,190 --> 00:05:18,040
to compare possible
ways of reading IT

146
00:05:18,040 --> 00:05:20,200
with the behavior of the human.

147
00:05:20,200 --> 00:05:21,820
This is how we started.

148
00:05:21,820 --> 00:05:23,310
We're now doing monkeys--

149
00:05:23,310 --> 00:05:25,670
where we're recording and
the monkey's doing a task.

150
00:05:25,670 --> 00:05:28,720
But what we did here was we
just passively fixating monkeys,

151
00:05:28,720 --> 00:05:30,812
compared with behaving humans.

152
00:05:30,812 --> 00:05:32,770
And as I showed you
earlier, monkeys and humans

153
00:05:32,770 --> 00:05:34,850
have very similar
patterns of behavior.

154
00:05:34,850 --> 00:05:37,120
So what we record
from IT, in this case,

155
00:05:37,120 --> 00:05:39,170
we were using array
recording electrodes.

156
00:05:39,170 --> 00:05:40,750
These are chronically implanted.

157
00:05:40,750 --> 00:05:41,650
This shows them here.

158
00:05:41,650 --> 00:05:43,066
You implant them
during a surgery,

159
00:05:43,066 --> 00:05:44,520
as kind of is shown here.

160
00:05:44,520 --> 00:05:45,920
Down in the IT cortex.

161
00:05:45,920 --> 00:05:47,119
You can get their size here.

162
00:05:47,119 --> 00:05:48,160
There are about hundred--

163
00:05:48,160 --> 00:05:50,739
there's actually 96
electrodes on each of them.

164
00:05:50,739 --> 00:05:52,780
They typically yield about
half of the electrodes

165
00:05:52,780 --> 00:05:54,700
having active neurons on them.

166
00:05:54,700 --> 00:05:58,309
So you get, you know, on the
order of 150 recording sites.

167
00:05:58,309 --> 00:05:59,350
And you can lay them out.

168
00:05:59,350 --> 00:06:01,016
You can lay-- we would
typically lay out

169
00:06:01,016 --> 00:06:04,990
three of them across IT and
V4 to record a population

170
00:06:04,990 --> 00:06:06,790
sample out of IT.

171
00:06:06,790 --> 00:06:09,269
And we would do this across
among multiple monkeys.

172
00:06:09,269 --> 00:06:11,560
And here's an example of the
kind of data we would get.

173
00:06:11,560 --> 00:06:14,276
This is 168 IT recording sites.

174
00:06:14,276 --> 00:06:16,150
This is similar to what
I showed you earlier.

175
00:06:16,150 --> 00:06:19,600
This is the mean response
in a particular time window

176
00:06:19,600 --> 00:06:21,790
out of IT, similar to
what I showed you earlier

177
00:06:21,790 --> 00:06:23,230
in that study with Gabriel.

178
00:06:23,230 --> 00:06:26,230
And what we do here is, I'm just
showing you to give you feel.

179
00:06:26,230 --> 00:06:27,250
That's one image.

180
00:06:27,250 --> 00:06:30,460
Here is eight more--
here's seven more images.

181
00:06:30,460 --> 00:06:32,200
And these are just
the population vectors

182
00:06:32,200 --> 00:06:34,160
in a graphic form.

183
00:06:34,160 --> 00:06:36,310
And but we actually
collected nearly 25--

184
00:06:36,310 --> 00:06:37,870
this is 2,560 images.

185
00:06:37,870 --> 00:06:39,250
This is sort of
the mean response

186
00:06:39,250 --> 00:06:41,230
data of this 168 neurons.

187
00:06:41,230 --> 00:06:43,550
And now you have this again,
this rich population data.

188
00:06:43,550 --> 00:06:46,030
And you can ask, what's
available in there

189
00:06:46,030 --> 00:06:47,200
to support these tasks?

190
00:06:47,200 --> 00:06:49,870
And how well does it predict
human patterns of performance

191
00:06:49,870 --> 00:06:51,190
on those tasks?

192
00:06:51,190 --> 00:06:54,391
So in this study, that's
all we were asking to do.

193
00:06:54,391 --> 00:06:56,140
We're trying to do
more and more recently.

194
00:06:56,140 --> 00:06:58,639
But let me show you what all
we were trying to do is to say,

195
00:06:58,639 --> 00:06:59,140
look.

196
00:06:59,140 --> 00:07:00,910
One thing we observed,
even though you

197
00:07:00,910 --> 00:07:03,670
saw that car-- you could
do car, you could do faces.

198
00:07:03,670 --> 00:07:05,710
It seemed like you
were doing 100%.

199
00:07:05,710 --> 00:07:08,080
Turns out you're better at
some things than others.

200
00:07:08,080 --> 00:07:10,510
So discriminate-- this is
a deep prime map of humans.

201
00:07:10,510 --> 00:07:12,700
So red means good performance.

202
00:07:12,700 --> 00:07:14,090
High D prime.

203
00:07:14,090 --> 00:07:16,404
You know, a D prime of
3 is something like--

204
00:07:16,404 --> 00:07:18,820
I don't know, psychophysicists
in the room may correct me.

205
00:07:18,820 --> 00:07:22,000
A D prime of 3 is sort of
on the order of 90 some 95%

206
00:07:22,000 --> 00:07:23,570
correct, in that range.

207
00:07:23,570 --> 00:07:25,450
So these are very high
performance levels

208
00:07:25,450 --> 00:07:27,400
when you get up to 5.

209
00:07:27,400 --> 00:07:28,660
0 is chance.

210
00:07:28,660 --> 00:07:31,070
So 50%-- well this
is an eight way task.

211
00:07:31,070 --> 00:07:33,220
So one over 8% correct.

212
00:07:33,220 --> 00:07:36,910
So the subjects were doing
either eight way basic level

213
00:07:36,910 --> 00:07:39,527
tasks, or eight way subordinate
cars, or eight way faces.

214
00:07:39,527 --> 00:07:41,860
And these are the D prime
levels under different amounts

215
00:07:41,860 --> 00:07:43,693
of variation of those
other latent variables

216
00:07:43,693 --> 00:07:44,967
position size and pose.

217
00:07:44,967 --> 00:07:46,300
Don't worry about those details.

218
00:07:46,300 --> 00:07:48,520
What I want you to
see is the color here.

219
00:07:48,520 --> 00:07:50,560
So look, it's tables versus--

220
00:07:50,560 --> 00:07:53,020
discriminating tables from
all these other objects.

221
00:07:53,020 --> 00:07:55,120
You do that at a
very high D prime.

222
00:07:55,120 --> 00:07:57,700
Discriminating beetles
from other cars,

223
00:07:57,700 --> 00:08:00,580
you do it at slightly
lower D prime.

224
00:08:00,580 --> 00:08:02,934
You can see this, specially
at a high variation,

225
00:08:02,934 --> 00:08:05,350
you're actually starting to
get down to lower performance.

226
00:08:05,350 --> 00:08:07,720
And faces-- one face
versus another face,

227
00:08:07,720 --> 00:08:09,530
you're actually
quite poor at that.

228
00:08:09,530 --> 00:08:11,369
You're a little bit
better than chance.

229
00:08:11,369 --> 00:08:13,660
But it's actually quite
challenging in 100 milliseconds

230
00:08:13,660 --> 00:08:15,730
without hair and
glasses to discriminate

231
00:08:15,730 --> 00:08:17,680
those 3-D kind of face models.

232
00:08:17,680 --> 00:08:20,260
I showed you Sam and
Joe earlier as examples.

233
00:08:20,260 --> 00:08:22,480
You're actually quite
challenging to do

234
00:08:22,480 --> 00:08:25,040
that for humans in
that domain of faces.

235
00:08:25,040 --> 00:08:26,631
So, what I want to
show you here is

236
00:08:26,631 --> 00:08:28,630
you have this pattern of
behavioral performance.

237
00:08:28,630 --> 00:08:29,980
You have all this IT activity.

238
00:08:29,980 --> 00:08:30,680
This is humans.

239
00:08:30,680 --> 00:08:31,767
This is monkeys.

240
00:08:31,767 --> 00:08:33,350
And what we wanted
to do is say, look.

241
00:08:33,350 --> 00:08:34,370
We can use this pattern.

242
00:08:34,370 --> 00:08:36,610
This is very repeatable
across humans.

243
00:08:36,610 --> 00:08:39,190
Can we use this repeatable
behavioral pattern

244
00:08:39,190 --> 00:08:41,559
to understand what
aspects of this activity

245
00:08:41,559 --> 00:08:43,210
could map to that?

246
00:08:43,210 --> 00:08:44,800
And again, this
pattern is reliable.

247
00:08:44,800 --> 00:08:45,910
I just said that.

248
00:08:45,910 --> 00:08:48,220
And it's not as if you
can predict this pattern

249
00:08:48,220 --> 00:08:51,210
by just running classifiers
on pixels or V1.

250
00:08:51,210 --> 00:08:53,297
In fact, I'll show
you that a minute.

251
00:08:53,297 --> 00:08:55,380
But we thought there's
some aspects of IT activity

252
00:08:55,380 --> 00:08:56,390
that would predict this.

253
00:08:56,390 --> 00:08:59,250
And we wanted to try to
find those aspects to--

254
00:08:59,250 --> 00:09:01,920
so, again, this was motivated
by that study I showed you

255
00:09:01,920 --> 00:09:02,800
earlier.

256
00:09:02,800 --> 00:09:05,220
So which part of the
IT population activity

257
00:09:05,220 --> 00:09:08,190
could predict this behavior
over all recognition tasks?

258
00:09:08,190 --> 00:09:11,086
We're seeking a general
decoding model that would work.

259
00:09:11,086 --> 00:09:12,210
Here's some specific tasks.

260
00:09:12,210 --> 00:09:13,350
But we'd like it to be--

261
00:09:13,350 --> 00:09:15,900
work over any task that we
could imagine testing humans

262
00:09:15,900 --> 00:09:18,240
within this domain
of taking 3D models,

263
00:09:18,240 --> 00:09:19,650
putting them under variation.

264
00:09:19,650 --> 00:09:20,940
Work over that entire domain.

265
00:09:20,940 --> 00:09:22,725
That was what we
were hoping to do.

266
00:09:22,725 --> 00:09:24,600
So again, I'll briefly
take you through this.

267
00:09:24,600 --> 00:09:26,040
Because I already
showed you this earlier.

268
00:09:26,040 --> 00:09:28,415
Again, we've previously shown
that you could kind of take

269
00:09:28,415 --> 00:09:30,570
this kind of state
space, and say hey,

270
00:09:30,570 --> 00:09:33,027
can you separate images
of faces from non-faces,

271
00:09:33,027 --> 00:09:34,860
using these simple
linear classifiers, which

272
00:09:34,860 --> 00:09:38,730
are essentially weighted
sums on the IT activity?

273
00:09:38,730 --> 00:09:40,620
And now we wanted
to ask, could this

274
00:09:40,620 --> 00:09:43,260
predict human behavioral
face performance,

275
00:09:43,260 --> 00:09:45,600
and monkey, because again,
they're very similar.

276
00:09:45,600 --> 00:09:47,970
And not only would
this class of decoding

277
00:09:47,970 --> 00:09:51,420
models that was motivated by the
earlier work predict this task,

278
00:09:51,420 --> 00:09:53,700
but would predict car detection?

279
00:09:53,700 --> 00:09:56,580
Would the same model predict
car one versus car two?

280
00:09:56,580 --> 00:09:58,050
That's a subordinate task.

281
00:09:58,050 --> 00:09:59,120
And all such tasks.

282
00:09:59,120 --> 00:10:01,740
Again, over the whole domain,
can you take a same decoding

283
00:10:01,740 --> 00:10:03,549
strategy and take
the data and say,

284
00:10:03,549 --> 00:10:05,840
I'm going to just learn on
a certain number of training

285
00:10:05,840 --> 00:10:07,830
examples, build a
classifier, and then I'll

286
00:10:07,830 --> 00:10:10,320
say that's my model
of how the human does

287
00:10:10,320 --> 00:10:11,560
every one of these tasks.

288
00:10:11,560 --> 00:10:13,542
And if that's true,
then it should perfectly

289
00:10:13,542 --> 00:10:15,000
predict that pattern
or performance

290
00:10:15,000 --> 00:10:16,900
that I just showed you earlier.

291
00:10:16,900 --> 00:10:20,550
And so here was again, this
was the working hypothesis.

292
00:10:20,550 --> 00:10:22,842
Passively evoked spike rates
using single fixed time

293
00:10:22,842 --> 00:10:25,050
scale that are spatially
distributed, because they're

294
00:10:25,050 --> 00:10:29,400
sampled over IT, over a single
fixed number of non-human--

295
00:10:29,400 --> 00:10:31,260
of non-human primate cortex.

296
00:10:31,260 --> 00:10:33,027
So a single number of neurons.

297
00:10:33,027 --> 00:10:35,360
And learn from a reasonable
number of training examples.

298
00:10:35,360 --> 00:10:39,390
So all of that is a decoding
class of models that we thought

299
00:10:39,390 --> 00:10:40,617
might work.

300
00:10:40,617 --> 00:10:42,450
And if this is correct--
this is what I just

301
00:10:42,450 --> 00:10:44,490
said-- it should predict
the behavioral data that we

302
00:10:44,490 --> 00:10:44,989
collect.

303
00:10:44,989 --> 00:10:47,350
For example, the D prime
data I just showed you.

304
00:10:47,350 --> 00:10:51,000
But also more fine grained
behavioral data in principle.

305
00:10:51,000 --> 00:10:52,530
So I want to just
step back to make

306
00:10:52,530 --> 00:10:55,880
it clear that it's not obvious
that this should work, right?

307
00:10:55,880 --> 00:10:56,960
I mean, it depends--

308
00:10:56,960 --> 00:10:59,880
in the audience, I get people
on completely different sides

309
00:10:59,880 --> 00:11:01,990
of this, whether this
should work or not.

310
00:11:01,990 --> 00:11:03,990
So, you know, one thing
is, like, well look,

311
00:11:03,990 --> 00:11:04,920
it's passively evoked.

312
00:11:04,920 --> 00:11:07,490
You heard Gabriel say, well,
you didn't like passive tasks.

313
00:11:07,490 --> 00:11:08,220
And I agree with that.

314
00:11:08,220 --> 00:11:09,510
In the ideal world,
the animal will

315
00:11:09,510 --> 00:11:10,710
be actively doing the task.

316
00:11:10,710 --> 00:11:12,390
And then you'd say, well I'll
measure while the animal's

317
00:11:12,390 --> 00:11:12,870
doing the task.

318
00:11:12,870 --> 00:11:14,953
That's going to be your
best chance of prediction.

319
00:11:14,953 --> 00:11:16,800
But we also saw earlier
that that passively

320
00:11:16,800 --> 00:11:17,904
evoked monkey still--

321
00:11:17,904 --> 00:11:20,070
you know, nobody would argue
that a passively evoked

322
00:11:20,070 --> 00:11:24,330
retinal data is not going to be
somewhat applicable to vision.

323
00:11:24,330 --> 00:11:26,100
And you know, the
question is, how much

324
00:11:26,100 --> 00:11:28,470
of those arousal effects
show up in a place

325
00:11:28,470 --> 00:11:31,110
like IT cortex, which
is typically high?

326
00:11:31,110 --> 00:11:32,850
Which is high up in
the ventral stream.

327
00:11:32,850 --> 00:11:34,722
So you could argue
both sides of this.

328
00:11:34,722 --> 00:11:36,930
But it's possible that
attentional arousal mechanisms

329
00:11:36,930 --> 00:11:40,120
are needed to make this a good
predictive linkage between that

330
00:11:40,120 --> 00:11:43,554
to sort of activate IT in this
sort of crude way, if you like.

331
00:11:43,554 --> 00:11:45,720
Some people have pointed
out that you need the trial

332
00:11:45,720 --> 00:11:47,550
by trial coordinated
spike timing

333
00:11:47,550 --> 00:11:49,650
structure to actually
make good predictions,

334
00:11:49,650 --> 00:11:51,160
that those are critical.

335
00:11:51,160 --> 00:11:54,570
Some people have pointed out
that you have to kind of assign

336
00:11:54,570 --> 00:11:57,030
different parts of IT to
particular roles, which is

337
00:11:57,030 --> 00:11:58,840
a prior on the decoding space.

338
00:11:58,840 --> 00:12:01,680
For instance, that you could
believe that biologically,

339
00:12:01,680 --> 00:12:02,460
an animal's born.

340
00:12:02,460 --> 00:12:04,876
There's some tissue that's
going to be dedicated to faces.

341
00:12:04,876 --> 00:12:07,502
You have to wire those neurons
downstream to that tissue.

342
00:12:07,502 --> 00:12:09,960
And that means you're going to
restrict the decoding space,

343
00:12:09,960 --> 00:12:12,555
rather than just letting them
learn from the space of IT

344
00:12:12,555 --> 00:12:15,600
as if they collected
samples off of all of IT.

345
00:12:15,600 --> 00:12:16,980
So I think some
people implicitly

346
00:12:16,980 --> 00:12:20,180
believe that even if it's
not stated quite that way.

347
00:12:20,180 --> 00:12:22,110
IT does not directly
underlie recognition.

348
00:12:22,110 --> 00:12:23,460
You could imagine that.

349
00:12:23,460 --> 00:12:25,090
I mean, it's not for sure known.

350
00:12:25,090 --> 00:12:27,690
And some lesions of IT
don't produce deficits

351
00:12:27,690 --> 00:12:28,320
in recognition.

352
00:12:28,320 --> 00:12:29,340
That's a possibility.

353
00:12:29,340 --> 00:12:31,090
Maybe you need too
many training examples.

354
00:12:31,090 --> 00:12:33,992
Monkey neural codes cannot
explain human behavior.

355
00:12:33,992 --> 00:12:35,700
You know, again, but
I already showed you

356
00:12:35,700 --> 00:12:36,970
monkeys and humans
are very similar.

357
00:12:36,970 --> 00:12:38,345
So these are the
reasons that you

358
00:12:38,345 --> 00:12:40,350
might say this is negative,
and might not work.

359
00:12:40,350 --> 00:12:41,560
And probably
already have guessed

360
00:12:41,560 --> 00:12:43,935
that I'm telling all these
negatives because it turns out

361
00:12:43,935 --> 00:12:46,800
this simple thing works quite
well for the grain of behavior

362
00:12:46,800 --> 00:12:48,359
that I've shown you so far.

363
00:12:48,359 --> 00:12:49,650
And here's my evidence of that.

364
00:12:49,650 --> 00:12:52,117
So this is actual behavioral
performance out of humans

365
00:12:52,117 --> 00:12:53,200
that I showed you earlier.

366
00:12:53,200 --> 00:12:54,074
This is mean D prime.

367
00:12:54,074 --> 00:12:56,010
This is the predicted
behavior or performance

368
00:12:56,010 --> 00:12:59,520
of taking a classifier, reading
from that IT population data

369
00:12:59,520 --> 00:13:02,520
that I've shown you, which
gives a predicted D prime.

370
00:13:02,520 --> 00:13:04,080
Here is-- we first
chose a decoder.

371
00:13:04,080 --> 00:13:06,180
We had to match things
like the number of neurons.

372
00:13:06,180 --> 00:13:07,800
We had to get it in
the ballpark, so--

373
00:13:07,800 --> 00:13:08,895
because again, there's
a free variable,

374
00:13:08,895 --> 00:13:09,895
as I showed you earlier.

375
00:13:09,895 --> 00:13:11,080
There's at least one.

376
00:13:11,080 --> 00:13:14,130
But for now, let's think of
matching the number of neurons

377
00:13:14,130 --> 00:13:16,299
to get you near the
diagonal, so that you

378
00:13:16,299 --> 00:13:18,090
have sufficient number
of neural recordings

379
00:13:18,090 --> 00:13:20,700
to say, how well do you do
on a face detection task?

380
00:13:20,700 --> 00:13:22,267
And then, here's
all the other tasks.

381
00:13:22,267 --> 00:13:24,350
This is those 64 points
that I showed you earlier.

382
00:13:24,350 --> 00:13:26,910
Here's some examples like
fruit versus other things, car

383
00:13:26,910 --> 00:13:28,597
versus other things.

384
00:13:28,597 --> 00:13:30,930
And you should see that all
these points kind of line up

385
00:13:30,930 --> 00:13:33,546
along this diagonal, which
says, wow, this is actually

386
00:13:33,546 --> 00:13:35,670
quite predictive, that I
can take this simple thing

387
00:13:35,670 --> 00:13:38,980
and predict all the stuff
that we've collected so far.

388
00:13:38,980 --> 00:13:40,980
And so let me now kind
of be more concrete

389
00:13:40,980 --> 00:13:43,230
about what is the
inferred neural mechanism

390
00:13:43,230 --> 00:13:44,820
that we're testing here?

391
00:13:44,820 --> 00:13:47,050
Well, I'll show you in a minute.

392
00:13:47,050 --> 00:13:49,560
This is, for each
new object, we think

393
00:13:49,560 --> 00:13:51,570
what happens is some
downstream observer,

394
00:13:51,570 --> 00:13:54,550
a downstream neuron, randomly
samples roughly 50,000

395
00:13:54,550 --> 00:13:57,430
single neurons, spatially
distributed over all of IT,

396
00:13:57,430 --> 00:13:59,530
not biased to any compartments.

397
00:13:59,530 --> 00:14:02,074
Listens to each IT sites.

398
00:14:02,074 --> 00:14:03,490
When I say listen
in this case, we

399
00:14:03,490 --> 00:14:05,477
think could average
over 100 milliseconds.

400
00:14:05,477 --> 00:14:06,560
We're not sure about this.

401
00:14:06,560 --> 00:14:08,860
This is just the version
that's shown here.

402
00:14:08,860 --> 00:14:11,680
Learn an appropriate weighted
sum of those IT spiking.

403
00:14:11,680 --> 00:14:12,970
And then listen at 10%.

404
00:14:12,970 --> 00:14:14,650
That's basically,
once you learn,

405
00:14:14,650 --> 00:14:18,169
there's a heavily weighted
about 10% of the IT neurons

406
00:14:18,169 --> 00:14:19,960
are heavily weighted
for each of the tasks.

407
00:14:19,960 --> 00:14:22,690
That's just an observation
that we have in our data.

408
00:14:22,690 --> 00:14:25,300
But this is trying to map it
to neuroscientist language

409
00:14:25,300 --> 00:14:28,030
from these decoder
versions out of IT.

410
00:14:28,030 --> 00:14:29,650
So what that is a
model that says,

411
00:14:29,650 --> 00:14:32,384
learn weighted sums of 50,000
random average 100 milliseconds

412
00:14:32,384 --> 00:14:34,300
single unit responses
distributed over all IT.

413
00:14:34,300 --> 00:14:37,760
So a bunch of stuff in
here is what your model

414
00:14:37,760 --> 00:14:39,190
is sort of encapsulating.

415
00:14:39,190 --> 00:14:40,416
That's still too long.

416
00:14:40,416 --> 00:14:42,040
So I made a little
acronym out of that.

417
00:14:42,040 --> 00:14:45,010
And that caught Laws of
RAD IT decoding mechanism.

418
00:14:45,010 --> 00:14:49,000
So this is just to say there's
a hypothesis of how everything

419
00:14:49,000 --> 00:14:52,330
might work, but now can be make
predictions for other objects

420
00:14:52,330 --> 00:14:54,520
and could potentially
be falsified.

421
00:14:54,520 --> 00:14:59,590
So, so far, this model works
quite well over these tasks.

422
00:14:59,590 --> 00:15:01,240
And in fact, the
correlation is 0.92.

423
00:15:01,240 --> 00:15:03,637
You might look at this and
say, oh, it's not perfect.

424
00:15:03,637 --> 00:15:05,470
But it turns out that
that's about the level

425
00:15:05,470 --> 00:15:07,220
that which humans
differ from each other.

426
00:15:07,220 --> 00:15:10,870
So it's passing a Turing test,
that this mechanism read off

427
00:15:10,870 --> 00:15:14,020
of the monkey IT hides
in the distribution

428
00:15:14,020 --> 00:15:16,870
of the human population that
we're asking to also perform

429
00:15:16,870 --> 00:15:17,680
these same tasks.

430
00:15:17,680 --> 00:15:19,305
So it can't be
distinguished from being

431
00:15:19,305 --> 00:15:22,750
a human in these tasks.

432
00:15:22,750 --> 00:15:24,640
You guys, watch "X Machina?"

433
00:15:24,640 --> 00:15:25,990
Wasn't that a movie I saw?

434
00:15:25,990 --> 00:15:27,190
Doesn't pass that test.

435
00:15:27,190 --> 00:15:29,760
Passes just a simple
core recognition test.

436
00:15:29,760 --> 00:15:31,570
But so that was a
Turing test of this.

437
00:15:31,570 --> 00:15:34,510
So OK, so, this is
here that I quantified.

438
00:15:34,510 --> 00:15:37,000
So this is human to
human consistency.

439
00:15:37,000 --> 00:15:38,380
That's the range
I just mentioned

440
00:15:38,380 --> 00:15:41,500
that, you've got to get
into here to pass our Turing

441
00:15:41,500 --> 00:15:42,387
test on this.

442
00:15:42,387 --> 00:15:44,470
And that's a decoding
mechanism I just showed you.

443
00:15:44,470 --> 00:15:47,460
There's other ways of reading
out of IT that don't pass.

444
00:15:47,460 --> 00:15:49,960
There's ways of reading out of
V4, which you recorded from--

445
00:15:49,960 --> 00:15:52,790
none of them we've tried are
able to get you to this here.

446
00:15:52,790 --> 00:15:54,370
That doesn't mean
V4 isn't involved.

447
00:15:54,370 --> 00:15:56,200
V4 is the feeder to IT.

448
00:15:56,200 --> 00:15:59,470
It just means you can't take
simple decodes off of V4

449
00:15:59,470 --> 00:16:01,330
and naturally
produces this pattern.

450
00:16:01,330 --> 00:16:03,970
And that's similar for like,
pixels or V1 representations.

451
00:16:03,970 --> 00:16:06,610
So lower level representations
don't naturally

452
00:16:06,610 --> 00:16:08,542
predict this
pattern of behavior.

453
00:16:08,542 --> 00:16:10,000
And even some
computer vision codes

454
00:16:10,000 --> 00:16:12,760
that we tested at the time, as
you can see, if those of you

455
00:16:12,760 --> 00:16:16,750
know these older computer
vision models didn't do this.

456
00:16:16,750 --> 00:16:19,870
But more recent computer
vision models actually do.

457
00:16:19,870 --> 00:16:22,230
And I'll show you
that at the end.

458
00:16:22,230 --> 00:16:22,900
OK.

459
00:16:22,900 --> 00:16:25,420
So, this is a little
bit for the aficionados

460
00:16:25,420 --> 00:16:27,220
to tell you how
we got there as we

461
00:16:27,220 --> 00:16:31,160
increase the number of units in
IT, that drives performance up.

462
00:16:31,160 --> 00:16:33,100
So as you read more and
more units out of IT,

463
00:16:33,100 --> 00:16:34,930
you get better and
better performance.

464
00:16:34,930 --> 00:16:36,539
That's also true out of V4.

465
00:16:36,539 --> 00:16:38,080
But I'm trying to
show you this here,

466
00:16:38,080 --> 00:16:40,750
is it's like, not the
absolute performance

467
00:16:40,750 --> 00:16:44,260
that is the good thing
to compare a model

468
00:16:44,260 --> 00:16:46,390
with actual behavioral data.

469
00:16:46,390 --> 00:16:48,010
It's the pattern of
performance, which

470
00:16:48,010 --> 00:16:49,780
we call the consistency
with the humans.

471
00:16:49,780 --> 00:16:51,959
That's that correlation
along that diagonal

472
00:16:51,959 --> 00:16:53,500
that I showed you
earlier, that tasks

473
00:16:53,500 --> 00:16:56,020
that are hard for the models
are also hard for the humans.

474
00:16:56,020 --> 00:16:58,660
Tasks that are easy for humans
are also easy for the models.

475
00:16:58,660 --> 00:17:00,035
And you could
imagine doing that,

476
00:17:00,035 --> 00:17:03,560
not just at the task level,
but at the image level as well.

477
00:17:03,560 --> 00:17:05,319
And anyway, that's
what's quantified here.

478
00:17:05,319 --> 00:17:07,990
And you see that when you get up
to around you know, about 100--

479
00:17:07,990 --> 00:17:12,010
I showed you 168
recordings out of IT.

480
00:17:12,010 --> 00:17:14,950
This point right there
is about 500 IT features.

481
00:17:14,950 --> 00:17:16,660
And taking you
through some things

482
00:17:16,660 --> 00:17:18,130
that maybe I won't
have time for,

483
00:17:18,130 --> 00:17:21,160
that's actually how we
approximate that 50,000 single

484
00:17:21,160 --> 00:17:21,970
IT neuron number.

485
00:17:21,970 --> 00:17:24,880
That's an inference
from our data

486
00:17:24,880 --> 00:17:27,532
based on if we didn't actually
record 50,000 single neurons.

487
00:17:27,532 --> 00:17:28,990
But from these kind
of plots, we're

488
00:17:28,990 --> 00:17:33,190
able to make a pretty good guess
that this kind of model right

489
00:17:33,190 --> 00:17:34,590
here would produce--

490
00:17:34,590 --> 00:17:35,940
would land right there.

491
00:17:35,940 --> 00:17:37,810
To be consistent with
humans, and would

492
00:17:37,810 --> 00:17:39,760
get the absolute
level of performance

493
00:17:39,760 --> 00:17:41,170
which humans matched.

494
00:17:41,170 --> 00:17:43,140
And you know, the models
we tried out of V4,

495
00:17:43,140 --> 00:17:44,410
this is one example of them.

496
00:17:44,410 --> 00:17:45,580
They can get performance.

497
00:17:45,580 --> 00:17:47,850
But they can never-- they
don't match this pattern

498
00:17:47,850 --> 00:17:48,940
of performance naturally.

499
00:17:48,940 --> 00:17:51,640
They over perform on some tasks,
and under-perform on others.

500
00:17:51,640 --> 00:17:53,980
They sort of reveal
themselves as not

501
00:17:53,980 --> 00:17:57,640
being human like by being too
good at some things, right?

502
00:17:57,640 --> 00:18:00,310
So that's a way to
fail the Turing test.

503
00:18:00,310 --> 00:18:01,540
OK.

504
00:18:01,540 --> 00:18:04,010
Maybe I'll skip through this,
it's sort of the same thing.

505
00:18:04,010 --> 00:18:05,343
This is about training examples.

506
00:18:05,343 --> 00:18:06,974
If those of you guys
care about this,

507
00:18:06,974 --> 00:18:09,390
I could kind of take you through
how we-- there's actually

508
00:18:09,390 --> 00:18:11,110
a family of solutions in there.

509
00:18:11,110 --> 00:18:14,200
And I'm just telling you about
one of them for simplicity.

510
00:18:14,200 --> 00:18:17,166
So, let me then just take
it down to another grain.

511
00:18:17,166 --> 00:18:18,790
So that was the
pattern of performance,

512
00:18:18,790 --> 00:18:20,164
it's actually
naturally predicted

513
00:18:20,164 --> 00:18:22,990
by this first decoding
mechanism that we tried.

514
00:18:22,990 --> 00:18:24,830
But what about the
confusion pattern?

515
00:18:24,830 --> 00:18:27,370
So not just the absolute D
primes for each of these tasks,

516
00:18:27,370 --> 00:18:30,640
but there's finer grained data,
like how often an animal is

517
00:18:30,640 --> 00:18:33,160
confused with a fruit, or an
animal's confused with a face.

518
00:18:33,160 --> 00:18:34,960
These are the confusion
pattern data here.

519
00:18:34,960 --> 00:18:36,670
I'm sorry I don't have
the color bars up.

520
00:18:36,670 --> 00:18:38,753
All I'm going to need you
to do is say, well these

521
00:18:38,753 --> 00:18:41,380
are the confusion patterns
that we predicted.

522
00:18:41,380 --> 00:18:44,710
And this is what is the
predicted confusion pattern,

523
00:18:44,710 --> 00:18:49,630
if I gave the machine, the
IT, these ground truth labels.

524
00:18:49,630 --> 00:18:50,810
And it predicts this.

525
00:18:50,810 --> 00:18:52,370
This is what actually
happened in human data.

526
00:18:52,370 --> 00:18:53,920
And what I want to sort of
look at this and this, and say,

527
00:18:53,920 --> 00:18:55,900
there actually
look quite similar.

528
00:18:55,900 --> 00:18:58,270
Their noise corrected
correlation is 0.91.

529
00:18:58,270 --> 00:19:00,940
So they were still quite good at
predicting confusion patterns.

530
00:19:00,940 --> 00:19:03,420
Although this did
not hold up fully.

531
00:19:03,420 --> 00:19:04,570
We're only at 0.68.

532
00:19:04,570 --> 00:19:05,410
I say only.

533
00:19:05,410 --> 00:19:07,300
Some people would
say this is success.

534
00:19:07,300 --> 00:19:09,320
We're only at 0.68
on high variation.

535
00:19:09,320 --> 00:19:11,020
So there's a failure
here of the model.

536
00:19:11,020 --> 00:19:13,360
That should be at 1, because
it's noise corrected.

537
00:19:13,360 --> 00:19:14,980
So there's something
about this that's

538
00:19:14,980 --> 00:19:17,410
not quite right at
predicting the confusion

539
00:19:17,410 --> 00:19:19,600
patterns of humans at
high variation images.

540
00:19:19,600 --> 00:19:22,960
And that to us, that's an
opening to push forward, right?

541
00:19:22,960 --> 00:19:24,730
So this is a strategy
going forward

542
00:19:24,730 --> 00:19:28,300
as we have an initial guess
of how you read out of IT.

543
00:19:28,300 --> 00:19:30,730
It looks pretty good
for first grain test.

544
00:19:30,730 --> 00:19:32,800
But now we can turn
the crank harder.

545
00:19:32,800 --> 00:19:33,940
We need more neural data.

546
00:19:33,940 --> 00:19:36,970
We need more psychophysics,
finer grained measurements

547
00:19:36,970 --> 00:19:38,650
to sort of distinguish
among, not just

548
00:19:38,650 --> 00:19:41,560
say IT's better than V4 or
those other representations.

549
00:19:41,560 --> 00:19:44,380
But what exactly about
the IT representation?

550
00:19:44,380 --> 00:19:45,550
Is it 100 milliseconds?

551
00:19:45,550 --> 00:19:46,764
What time scale?

552
00:19:46,764 --> 00:19:48,430
Maybe those synchronous
codes do matter.

553
00:19:48,430 --> 00:19:50,430
Some of those things that
I put on there earlier

554
00:19:50,430 --> 00:19:53,170
might start to matter when
we push the code-- push

555
00:19:53,170 --> 00:19:54,370
this even further.

556
00:19:54,370 --> 00:19:56,440
So what I take home
here is that you

557
00:19:56,440 --> 00:19:59,590
do quite well with this
first order rate code

558
00:19:59,590 --> 00:20:01,390
reads out of IT.

559
00:20:01,390 --> 00:20:04,330
But now there's an opportunity
to try to dig in and say,

560
00:20:04,330 --> 00:20:06,444
well at what point
do they break down?

561
00:20:06,444 --> 00:20:08,110
And what kind of
decoding models are you

562
00:20:08,110 --> 00:20:09,234
going to replace them with?

563
00:20:09,234 --> 00:20:11,680
And that's what
we're trying to do.

564
00:20:11,680 --> 00:20:13,900
I've told you that IT
does good at identity.

565
00:20:13,900 --> 00:20:15,820
But remember I said
earlier on, remember

566
00:20:15,820 --> 00:20:16,660
I showed you those
manifolds, and said

567
00:20:16,660 --> 00:20:19,420
there's other latent variables
like position and scale.

568
00:20:19,420 --> 00:20:21,640
And I said those
don't get thrown away.

569
00:20:21,640 --> 00:20:23,170
They just get unwrapped, right?

570
00:20:23,170 --> 00:20:25,377
Remember that manifold
picture I showed earlier?

571
00:20:25,377 --> 00:20:27,460
And so one of the things
we've been doing recently

572
00:20:27,460 --> 00:20:29,510
is asking, because we
built these images,

573
00:20:29,510 --> 00:20:31,120
we know these other
latent variables,

574
00:20:31,120 --> 00:20:32,495
like position and
pose-- that was

575
00:20:32,495 --> 00:20:34,750
one of the advantages of
building the images this way.

576
00:20:34,750 --> 00:20:38,530
And we've been asking how well
IT encodes those other latent

577
00:20:38,530 --> 00:20:40,450
variables about the
pose of the object,

578
00:20:40,450 --> 00:20:42,040
the position of the object.

579
00:20:42,040 --> 00:20:45,670
And to make-- let me
just skip through.

580
00:20:45,670 --> 00:20:48,250
To make a long story short,
IT actually encodes--

581
00:20:48,250 --> 00:20:50,990
not only has information
about these kind of variables,

582
00:20:50,990 --> 00:20:53,290
which is really not
surprising, because others

583
00:20:53,290 --> 00:20:55,000
have shown that
there's information

584
00:20:55,000 --> 00:20:57,040
about those kind
of things before.

585
00:20:57,040 --> 00:20:58,914
But that's sort
of what's on here.

586
00:20:58,914 --> 00:21:00,580
Everything what I'm
showing here, here's

587
00:21:00,580 --> 00:21:03,060
IT V4 simulated V1 in pixels.

588
00:21:03,060 --> 00:21:05,500
And always, everything goes
up along the ventral stream

589
00:21:05,500 --> 00:21:07,330
for the other
variables, which may be

590
00:21:07,330 --> 00:21:08,950
non-intuitive to some of you.

591
00:21:08,950 --> 00:21:10,989
I mean, because position
is supposed to be V1.

592
00:21:10,989 --> 00:21:13,030
But position of an object
in a complex background

593
00:21:13,030 --> 00:21:14,560
is better at IT.

594
00:21:14,560 --> 00:21:15,430
That's one example.

595
00:21:15,430 --> 00:21:17,806
But all these latent
variables go up

596
00:21:17,806 --> 00:21:19,180
along the ventral
stream in terms

597
00:21:19,180 --> 00:21:20,900
of their ease of decoding.

598
00:21:20,900 --> 00:21:22,810
But what I'm most
excited about is

599
00:21:22,810 --> 00:21:26,080
that if you do this
comparison with humans again,

600
00:21:26,080 --> 00:21:28,300
you actually get this sort
of, again, pretty decent,

601
00:21:28,300 --> 00:21:31,877
not quite as tight correlation,
between the human--

602
00:21:31,877 --> 00:21:33,460
actual measured
behavioral performance

603
00:21:33,460 --> 00:21:35,890
on making estimates of those
other latent variables,

604
00:21:35,890 --> 00:21:38,450
and the predicted behavioral
performance out of IT.

605
00:21:38,450 --> 00:21:40,180
And again, much
better correlations.

606
00:21:40,180 --> 00:21:41,220
It's not perfect.

607
00:21:41,220 --> 00:21:44,170
So again, there's some gap here,
some failure of understanding.

608
00:21:44,170 --> 00:21:47,650
But much better than if you
read out of V4, V1 or pixel.

609
00:21:47,650 --> 00:21:50,620
So this says that the
representation again isn't just

610
00:21:50,620 --> 00:21:51,817
an identity thing.

611
00:21:51,817 --> 00:21:53,650
It seems like this could
be representational

612
00:21:53,650 --> 00:21:56,320
underlie some of these
other judgments, at least

613
00:21:56,320 --> 00:21:59,380
at the central 10 degrees for
sort of foreground objects

614
00:21:59,380 --> 00:22:00,970
as we've been measuring here.

615
00:22:00,970 --> 00:22:03,428
That's the-- don't worry about
the details on here-- that's

616
00:22:03,428 --> 00:22:05,637
the upshot of what I'm trying
to say with this slide.

617
00:22:05,637 --> 00:22:07,261
But I just wanted to
put that out there

618
00:22:07,261 --> 00:22:09,310
so you didn't forget that
you haven't thrown away

619
00:22:09,310 --> 00:22:11,290
all this other interesting
stuff about what's

620
00:22:11,290 --> 00:22:13,461
out there in the scene.

621
00:22:13,461 --> 00:22:13,960
OK.

622
00:22:13,960 --> 00:22:14,626
Let me kind of--

623
00:22:14,626 --> 00:22:16,510
I've sort of alluded
to this a bit.

624
00:22:16,510 --> 00:22:18,310
I want to come back
to kind of now,

625
00:22:18,310 --> 00:22:20,350
this is like Marr
level 3 stuff, right?

626
00:22:20,350 --> 00:22:22,990
So you have this idea of
what you're trying to solve.

627
00:22:22,990 --> 00:22:23,800
You have a decode--

628
00:22:23,800 --> 00:22:26,350
you have an algorithm that's
a decoder on a basis, that's

629
00:22:26,350 --> 00:22:28,790
trying-- that looks like
it predicts pretty well.

630
00:22:28,790 --> 00:22:29,510
It's not perfect.

631
00:22:29,510 --> 00:22:30,560
There's work to be done there.

632
00:22:30,560 --> 00:22:31,893
But it actually does quite well.

633
00:22:31,893 --> 00:22:34,240
Now what does that mean on
the physical hardware level?

634
00:22:34,240 --> 00:22:35,742
So that's Marr level 3.

635
00:22:35,742 --> 00:22:37,450
So you think-- here's
how I visualize it.

636
00:22:37,450 --> 00:22:40,960
You have IT cortex,
which I mean AIT and CIT.

637
00:22:40,960 --> 00:22:43,534
So it's about 150 square
millimeters in a monkey.

638
00:22:43,534 --> 00:22:45,700
And remember I told you
there was about 1 millimeter

639
00:22:45,700 --> 00:22:46,900
scale of organization?

640
00:22:46,900 --> 00:22:48,792
I showed you that earlier.

641
00:22:48,792 --> 00:22:49,750
And others have shown--

642
00:22:49,750 --> 00:22:51,458
I showed this earlier,
too-- that there's

643
00:22:51,458 --> 00:22:52,390
sort of face regions.

644
00:22:52,390 --> 00:22:54,514
So I've drawn them just
for sort of for scale here,

645
00:22:54,514 --> 00:22:55,690
just a schematic.

646
00:22:55,690 --> 00:22:58,000
That they're slightly
bigger organizations,

647
00:22:58,000 --> 00:22:59,620
they're 2 to 5 millimeter.

648
00:22:59,620 --> 00:23:04,030
So I think of IT as being
this sort of like 100 to 200

649
00:23:04,030 --> 00:23:04,930
little--

650
00:23:04,930 --> 00:23:06,100
similar to Tanaka.

651
00:23:06,100 --> 00:23:07,600
This is not a new
conceptual idea.

652
00:23:07,600 --> 00:23:09,391
But there's sort of
just the simple version

653
00:23:09,391 --> 00:23:12,070
would be each millimeter does
exactly the same thing, is

654
00:23:12,070 --> 00:23:13,030
a feature.

655
00:23:13,030 --> 00:23:15,910
And if you sample off of
that, you take 5,000 neurons,

656
00:23:15,910 --> 00:23:19,320
but they're really sampling
from only about 150 IT

657
00:23:19,320 --> 00:23:21,871
features at 1 millimeter scale.

658
00:23:21,871 --> 00:23:23,620
Remember, I don't know
if you caught that.

659
00:23:23,620 --> 00:23:24,820
But I showed 150--

660
00:23:24,820 --> 00:23:26,950
101-- 150.

661
00:23:26,950 --> 00:23:30,425
I showed you 168 IT neurons
predicted the pattern

662
00:23:30,425 --> 00:23:31,300
of human performance.

663
00:23:31,300 --> 00:23:32,770
I showed that a few slides ago.

664
00:23:32,770 --> 00:23:35,440
But I told you the real number
of neurons is probably 50,000.

665
00:23:35,440 --> 00:23:37,360
Most of those are
redundant copies

666
00:23:37,360 --> 00:23:40,142
of that 168 dimensional
feature set.

667
00:23:40,142 --> 00:23:41,350
That's how we think about it.

668
00:23:41,350 --> 00:23:44,350
So you could imagine, it's
just a redundant set of about--

669
00:23:44,350 --> 00:23:47,410
I like to think of about
100 features in IT which

670
00:23:47,410 --> 00:23:51,370
are sampled maybe randomly
downstream neurons that are

671
00:23:51,370 --> 00:23:52,270
then learned.

672
00:23:52,270 --> 00:23:54,390
So when you learn faces
versus other things,

673
00:23:54,390 --> 00:23:56,550
hey, there's lots of good
information about faces

674
00:23:56,550 --> 00:23:57,383
versus other things.

675
00:23:57,383 --> 00:23:59,610
And these face patches,
that's how they're defined.

676
00:23:59,610 --> 00:24:01,822
But those neurons are
going to lean heavily--

677
00:24:01,822 --> 00:24:03,780
this downstream neuron
is going to lean heavily

678
00:24:03,780 --> 00:24:04,500
on those neurons.

679
00:24:04,500 --> 00:24:07,800
And then these-- so that would
make these regions causally

680
00:24:07,800 --> 00:24:08,320
involved.

681
00:24:08,320 --> 00:24:11,067
So that doesn't mean you had
to pre-build in anything here.

682
00:24:11,067 --> 00:24:12,900
You just learn this at
a downstream version.

683
00:24:12,900 --> 00:24:14,190
And you would get
something that looks

684
00:24:14,190 --> 00:24:15,481
like it would explain our data.

685
00:24:15,481 --> 00:24:17,919
So we like that, because
it captures that case.

686
00:24:17,919 --> 00:24:19,710
But it also captures
the more general case.

687
00:24:19,710 --> 00:24:21,418
If you learn cars,
you're going to sample

688
00:24:21,418 --> 00:24:22,940
from a different
subset of neurons.

689
00:24:22,940 --> 00:24:25,470
But you're following
the same learning rule.

690
00:24:25,470 --> 00:24:26,920
That's what I said earlier on.

691
00:24:26,920 --> 00:24:27,900
So you end up--

692
00:24:27,900 --> 00:24:29,760
we think this is
the initial state.

693
00:24:29,760 --> 00:24:31,411
This is when you learn objects.

694
00:24:31,411 --> 00:24:33,660
And so what we think is a
post learning, what you have

695
00:24:33,660 --> 00:24:36,690
is again, about 100
to 150 IT sub regions,

696
00:24:36,690 --> 00:24:38,520
each at 1 millimeter
scale, that are

697
00:24:38,520 --> 00:24:40,500
supporting a number
of noun tasks

698
00:24:40,500 --> 00:24:42,720
read off this common basis here.

699
00:24:42,720 --> 00:24:44,580
That's the model
that we like, given

700
00:24:44,580 --> 00:24:46,440
the kind of data that
I've been showing you.

701
00:24:46,440 --> 00:24:48,384
The post learning
model, as we call it.

702
00:24:48,384 --> 00:24:49,800
So the reason I'm
bringing this up

703
00:24:49,800 --> 00:24:51,258
is probably for
the neuroscientists

704
00:24:51,258 --> 00:24:54,900
to fix ideas about how we
think about IT as a basis set.

705
00:24:54,900 --> 00:24:55,530
And this is--

706
00:24:55,530 --> 00:24:57,030
I think Haim sort of
set this up nicely,

707
00:24:57,030 --> 00:24:58,446
he sort of implied
similar things.

708
00:24:58,446 --> 00:25:01,020
That somebody downstream
reads from it.

709
00:25:01,020 --> 00:25:01,650
OK.

710
00:25:01,650 --> 00:25:02,970
But now, we have
a more-- you know,

711
00:25:02,970 --> 00:25:04,950
we're starting to have a more
concrete model, that we now,

712
00:25:04,950 --> 00:25:06,330
I'm trying to start
to be physical

713
00:25:06,330 --> 00:25:08,496
about it, about the size
of these regions connecting

714
00:25:08,496 --> 00:25:10,194
to earlier data,
how many there are.

715
00:25:10,194 --> 00:25:11,610
So we're gaining
inference on that

716
00:25:11,610 --> 00:25:13,190
from these different
experiments.

717
00:25:13,190 --> 00:25:15,690
And now, if you believe this,
it starts to make a prediction

718
00:25:15,690 --> 00:25:17,640
of what's-- now we can
do causality, right?

719
00:25:17,640 --> 00:25:19,300
Somebody mentioned that earlier.

720
00:25:19,300 --> 00:25:20,580
And so, one of the things
we've been doing recently

721
00:25:20,580 --> 00:25:21,989
is if we can start to silence--

722
00:25:21,989 --> 00:25:24,030
look, the way I've drawn
this, this bit of tissue

723
00:25:24,030 --> 00:25:25,950
for-- this is just
schematic-- is somehow

724
00:25:25,950 --> 00:25:27,840
involved in this
task and that task.

725
00:25:27,840 --> 00:25:29,700
Face task and car task.

726
00:25:29,700 --> 00:25:31,980
But this bit of
tissue, only face task.

727
00:25:31,980 --> 00:25:34,080
And that bit of
tissue, only car task.

728
00:25:34,080 --> 00:25:35,780
And this bit of tissue, neither.

729
00:25:35,780 --> 00:25:37,770
So if you believe that,
you had the tools,

730
00:25:37,770 --> 00:25:39,311
you should be able
to go in and start

731
00:25:39,311 --> 00:25:40,830
to silence little bits of IT.

732
00:25:40,830 --> 00:25:42,944
And you should get
predictable patterns out

733
00:25:42,944 --> 00:25:44,610
of the behavioral
deficits of the animal

734
00:25:44,610 --> 00:25:46,460
when you make those
manipulations, right?

735
00:25:46,460 --> 00:25:48,130
Everybody follow that?

736
00:25:48,130 --> 00:25:48,630
Right?

737
00:25:48,630 --> 00:25:49,130
OK.

738
00:25:49,130 --> 00:25:51,030
And now the models
give you a framework

739
00:25:51,030 --> 00:25:52,800
to build those
predictions and to also

740
00:25:52,800 --> 00:25:55,710
estimate the magnitude of those
effects that you should see.

741
00:25:55,710 --> 00:25:57,850
And so that's what we've
been doing more recently.

742
00:25:57,850 --> 00:25:59,040
And I'll just give
you a taste of this,

743
00:25:59,040 --> 00:26:00,330
because this is really ongoing.

744
00:26:00,330 --> 00:26:01,890
But I think it connects
to what Gabriel

745
00:26:01,890 --> 00:26:04,389
said earlier about now there
are these tools available to do

746
00:26:04,389 --> 00:26:04,894
that.

747
00:26:04,894 --> 00:26:06,810
Oh, I put that in from
an earlier talk where--

748
00:26:06,810 --> 00:26:08,750
I think Google has a
thing called Inception.

749
00:26:08,750 --> 00:26:09,570
And I don't know--

750
00:26:09,570 --> 00:26:10,200
was it Google?

751
00:26:10,200 --> 00:26:12,090
Or somebody has it-- you can't
do Inception unless you're

752
00:26:12,090 --> 00:26:13,050
actually in a brain.

753
00:26:13,050 --> 00:26:15,120
So are you going
to try to insert--

754
00:26:15,120 --> 00:26:17,495
the reason we do this is my
student that is working on it

755
00:26:17,495 --> 00:26:20,010
really wants to inject
signals in the brain.

756
00:26:20,010 --> 00:26:21,450
There's a dream
about VMI, right?

757
00:26:21,450 --> 00:26:23,760
Could you kind of
inject a percept?

758
00:26:23,760 --> 00:26:25,260
And to do that,
you're going to need

759
00:26:25,260 --> 00:26:26,427
to do experiments like this.

760
00:26:26,427 --> 00:26:28,635
And you understand this
hardware to interact with it.

761
00:26:28,635 --> 00:26:30,430
It's something we
talked about earlier.

762
00:26:30,430 --> 00:26:34,181
So actually-- and Tonegawa's lab
has some cool Inception stuff

763
00:26:34,181 --> 00:26:34,680
on memory.

764
00:26:34,680 --> 00:26:37,140
But this is like inserting
an object/person.

765
00:26:37,140 --> 00:26:39,510
So to do that, this has
been a dream for many of us

766
00:26:39,510 --> 00:26:40,230
for a long time.

767
00:26:40,230 --> 00:26:42,600
Can we reliably
disrupt performance

768
00:26:42,600 --> 00:26:45,600
by suppressing 1
millimeter bits of IT?

769
00:26:45,600 --> 00:26:48,390
So to do that,
what we're doing is

770
00:26:48,390 --> 00:26:50,630
testing a large battery
of tasks and a battery

771
00:26:50,630 --> 00:26:51,630
of suppression patterns.

772
00:26:51,630 --> 00:26:53,310
So not just sort
of saying, can we

773
00:26:53,310 --> 00:26:55,140
affect face tasks or one task?

774
00:26:55,140 --> 00:26:57,546
But let's imagine we
test a battery of tasks.

775
00:26:57,546 --> 00:26:58,920
And then, we--
and the idea where

776
00:26:58,920 --> 00:27:01,652
we'd have a whole bunch of tasks
and we'd do every bit of IT one

777
00:27:01,652 --> 00:27:03,360
by one, and then in
combination, and we'd

778
00:27:03,360 --> 00:27:04,650
sort of get all that data
and figure out what's

779
00:27:04,650 --> 00:27:05,220
going on, right?

780
00:27:05,220 --> 00:27:05,970
That's sort of the dream, right?

781
00:27:05,970 --> 00:27:07,990
So we're trying to build
towards that dream.

782
00:27:07,990 --> 00:27:09,100
Do you guys get it?

783
00:27:09,100 --> 00:27:09,600
Right.

784
00:27:09,600 --> 00:27:10,620
I mean, I don't know.

785
00:27:10,620 --> 00:27:12,990
And then we're motivated
by this kind of idea here.

786
00:27:12,990 --> 00:27:14,865
So to build-- so we started--

787
00:27:14,865 --> 00:27:16,740
I'm just going to give
you a quick tour of we

788
00:27:16,740 --> 00:27:19,250
have tools to start to do this.

789
00:27:19,250 --> 00:27:20,952
You know, this is
our recording, we

790
00:27:20,952 --> 00:27:23,160
can localize what we're
recording two very fine grain

791
00:27:23,160 --> 00:27:24,649
using x-rays.

792
00:27:24,649 --> 00:27:26,940
So we know exactly where
we're recording the IT to like

793
00:27:26,940 --> 00:27:29,214
about 300 micron resolution.

794
00:27:29,214 --> 00:27:30,880
So that's why I'm
putting this slide up.

795
00:27:30,880 --> 00:27:32,463
And what we're
interested in is going,

796
00:27:32,463 --> 00:27:35,955
if I silence this bit of
IT, or that bit of IT,

797
00:27:35,955 --> 00:27:38,790
or that bit of IT, so actually
do this experiment, what

798
00:27:38,790 --> 00:27:40,230
happens behaviorally?

799
00:27:40,230 --> 00:27:43,140
And Arash Afraz is a
post-doc in the lab, started

800
00:27:43,140 --> 00:27:44,610
these actual experiments.

801
00:27:44,610 --> 00:27:47,130
And one of the things
Arash did was to first say,

802
00:27:47,130 --> 00:27:51,330
let's see if we can get this
silencing of optogenetics tool

803
00:27:51,330 --> 00:27:52,290
to work in our hands.

804
00:27:52,290 --> 00:27:54,123
And the reason we were
so excited about that

805
00:27:54,123 --> 00:27:55,890
is because we think
lesions, if we

806
00:27:55,890 --> 00:27:58,410
can make temporary
brief silencing,

807
00:27:58,410 --> 00:28:03,020
that that will give it much more
reliable disruption of behavior

808
00:28:03,020 --> 00:28:06,300
that then, if we started
to try to inject signals,

809
00:28:06,300 --> 00:28:09,330
which would be our dream, but
that seems too risky to us.

810
00:28:09,330 --> 00:28:11,580
We just want to say, what
is a temporary lesion

811
00:28:11,580 --> 00:28:13,082
of each bit of IT do?

812
00:28:13,082 --> 00:28:14,790
And optogenetics is
cool, because there's

813
00:28:14,790 --> 00:28:17,610
no other technique that
can briefly silence--

814
00:28:17,610 --> 00:28:19,782
temporarily silence activity.

815
00:28:19,782 --> 00:28:21,490
You can do pharmacological
manipulations,

816
00:28:21,490 --> 00:28:23,010
but those last for hours.

817
00:28:23,010 --> 00:28:25,544
So this could briefly
silence bits of IT.

818
00:28:25,544 --> 00:28:27,210
And that's why we
were excited about it.

819
00:28:27,210 --> 00:28:30,120
We also did pharmacological
manipulation as a reference

820
00:28:30,120 --> 00:28:30,780
to get started.

821
00:28:30,780 --> 00:28:33,600
But what we're doing is trying
to silence 1 millimeter regions

822
00:28:33,600 --> 00:28:36,490
of IT using light delivered
through optical fibers

823
00:28:36,490 --> 00:28:38,220
as the recording electrode.

824
00:28:38,220 --> 00:28:41,490
And to silence bits
of neurons here.

825
00:28:41,490 --> 00:28:43,757
And so what Arash
did was first show

826
00:28:43,757 --> 00:28:45,840
that you can actually
silence neurons in this way.

827
00:28:45,840 --> 00:28:48,390
So if you guys haven't
seen optogenetics plots,

828
00:28:48,390 --> 00:28:49,560
this is data from our lab.

829
00:28:49,560 --> 00:28:51,060
What's quite cool
about this, again,

830
00:28:51,060 --> 00:28:53,290
is you have the same
images are being presented.

831
00:28:53,290 --> 00:28:55,110
So this green line
should be up here.

832
00:28:55,110 --> 00:28:57,900
But Arash turns a laser on right
here, shines light on there.

833
00:28:57,900 --> 00:28:59,970
And there's some opsins
expressed in the neurons

834
00:28:59,970 --> 00:29:00,861
in that local area.

835
00:29:00,861 --> 00:29:03,360
And you can see it just sort
of shuts the thing down, and it

836
00:29:03,360 --> 00:29:04,830
sort of deletes or blocks this.

837
00:29:04,830 --> 00:29:06,390
You have the same
input coming in.

838
00:29:06,390 --> 00:29:08,300
But you can sort
of delete it here.

839
00:29:08,300 --> 00:29:09,670
And this is another example.

840
00:29:09,670 --> 00:29:11,253
These are some pretty
strong examples.

841
00:29:11,253 --> 00:29:13,750
It's not always this strong.

842
00:29:13,750 --> 00:29:16,140
But this is, again, you
can see we can return back

843
00:29:16,140 --> 00:29:17,400
to normal right away, right?

844
00:29:17,400 --> 00:29:19,290
So this is a 200
millisecond silencing.

845
00:29:19,290 --> 00:29:21,230
You could go even
narrower than that.

846
00:29:21,230 --> 00:29:23,340
But so this is what
we had done so far.

847
00:29:23,340 --> 00:29:25,020
And again, what we
did was say, look.

848
00:29:25,020 --> 00:29:25,980
This is a risky tool.

849
00:29:25,980 --> 00:29:27,479
This is it not going
to work at all.

850
00:29:27,479 --> 00:29:29,250
So Arash just wanted
to test something

851
00:29:29,250 --> 00:29:31,850
that was likely to work.

852
00:29:31,850 --> 00:29:34,080
And so we picked a
face task because there

853
00:29:34,080 --> 00:29:36,480
was a lot of evidence of
spatial clustering of faces

854
00:29:36,480 --> 00:29:39,240
that you'll hear from
Winrich and you also

855
00:29:39,240 --> 00:29:40,750
known in the literature.

856
00:29:40,750 --> 00:29:42,770
So what Arash did
was to say, we picked

857
00:29:42,770 --> 00:29:44,910
a task of discriminating
males from females.

858
00:29:44,910 --> 00:29:46,470
We put in our notion
of invariance.

859
00:29:46,470 --> 00:29:48,390
It's not just do
this image access.

860
00:29:48,390 --> 00:29:50,970
But you have to do it across
a bunch of transformations.

861
00:29:50,970 --> 00:29:53,127
In this case, its identity
as a transformation.

862
00:29:53,127 --> 00:29:55,710
So you're saying, all of these
are supposed to be called male,

863
00:29:55,710 --> 00:29:57,050
and all these are called female.

864
00:29:57,050 --> 00:29:59,049
And he wanted you to
distinguish this from this.

865
00:29:59,049 --> 00:30:00,900
That's what he trained
a monkey to do.

866
00:30:00,900 --> 00:30:04,031
And just to give you the upshot,
is that, we do all this work,

867
00:30:04,031 --> 00:30:05,280
we silence the bits of cortex.

868
00:30:05,280 --> 00:30:06,870
And here's the big take home.

869
00:30:06,870 --> 00:30:10,140
You get a 2% deficit of
single one millimeter

870
00:30:10,140 --> 00:30:12,750
silencing of bits of IT cortex.

871
00:30:12,750 --> 00:30:16,200
Parts of IT cortex,
not all of IT cortex,

872
00:30:16,200 --> 00:30:17,670
produce a 2% deficit.

873
00:30:17,670 --> 00:30:19,545
Here's the animal running
at 80%, 6% correct.

874
00:30:19,545 --> 00:30:21,086
These are interleaved
trials where we

875
00:30:21,086 --> 00:30:22,610
silence some local bit of IT.

876
00:30:22,610 --> 00:30:23,787
You get a 2% deficit.

877
00:30:23,787 --> 00:30:25,620
That's true only in the
contralateral field,

878
00:30:25,620 --> 00:30:28,290
not that ipsilateral
field, for the aficionados.

879
00:30:28,290 --> 00:30:30,630
You might look at this 2%
and go, well, that's tiny.

880
00:30:30,630 --> 00:30:32,190
But we looked at it,
this is exactly what's

881
00:30:32,190 --> 00:30:34,315
predicted by the models
that we were talking about.

882
00:30:34,315 --> 00:30:37,140
It's right in the range
of what should happen.

883
00:30:37,140 --> 00:30:39,170
And so this, to us,
is really quite cool.

884
00:30:39,170 --> 00:30:40,445
This is highly significant.

885
00:30:40,445 --> 00:30:42,570
And now we sort of are in
position to start to say,

886
00:30:42,570 --> 00:30:43,870
OK these tools work.

887
00:30:43,870 --> 00:30:45,330
They do what
they're supposed to.

888
00:30:45,330 --> 00:30:47,670
And now we can start to
expand that task space.

889
00:30:47,670 --> 00:30:49,620
So this result has been
published recently,

890
00:30:49,620 --> 00:30:51,112
if you're interested in this.

891
00:30:51,112 --> 00:30:53,070
And here is one of the
ways we're going forward

892
00:30:53,070 --> 00:30:55,736
is that Rish Rajaingham, the one
doing those tasks in the monkey

893
00:30:55,736 --> 00:30:56,670
I showed you earlier.

894
00:30:56,670 --> 00:30:58,390
Silencing different parts of IT.

895
00:30:58,390 --> 00:31:01,320
This is now with muscimol,
different bits of IT--

896
00:31:01,320 --> 00:31:03,570
these are different tasks,
lead to different patterns.

897
00:31:03,570 --> 00:31:04,944
That's what these
dots are here--

898
00:31:04,944 --> 00:31:06,300
different patterns of deficits.

899
00:31:06,300 --> 00:31:08,010
And if you go back
to the same location,

900
00:31:08,010 --> 00:31:09,640
you get the same
pattern of deficits.

901
00:31:09,640 --> 00:31:11,064
So this is only 10 tasks.

902
00:31:11,064 --> 00:31:12,480
But I think it
hopefully gives you

903
00:31:12,480 --> 00:31:14,760
the spirit of what
we're trying to do.

904
00:31:14,760 --> 00:31:16,650
And again, this
is only muscimol,

905
00:31:16,650 --> 00:31:19,600
which doesn't have all the
advantages of optogenetics.

906
00:31:19,600 --> 00:31:22,460
But this is what we're
were building towards here.

907
00:31:22,460 --> 00:31:26,116
So I'm just giving you the
sort of state of the art.

908
00:31:26,116 --> 00:31:27,990
So our aim is to measure
the specific pattern

909
00:31:27,990 --> 00:31:30,740
of behavioral change induced by
the suppression of each IT sub

910
00:31:30,740 --> 00:31:32,850
region, ideally
testing many of them,

911
00:31:32,850 --> 00:31:35,389
and then compare with
the model predictions.

912
00:31:35,389 --> 00:31:36,930
I'm saying there's
this domain, and I

913
00:31:36,930 --> 00:31:38,220
want to sort of sample
the whole domain.

914
00:31:38,220 --> 00:31:41,040
So far, I've given you only just
samples of tasks in the domain.

915
00:31:41,040 --> 00:31:42,950
But we're really trying
to define the domain.

916
00:31:42,950 --> 00:31:43,770
And I'm just--

917
00:31:43,770 --> 00:31:46,353
I'm going to skip through this
just to give you the punchline,

918
00:31:46,353 --> 00:31:48,780
is that we do a whole bunch
of behavioral measurements.

919
00:31:48,780 --> 00:31:50,040
We presented this work before.

920
00:31:50,040 --> 00:31:52,456
It's like, this is now up to
three million Mechanical Turk

921
00:31:52,456 --> 00:31:53,130
trials.

922
00:31:53,130 --> 00:31:56,550
It seems to us that we can
embed all objects, even

923
00:31:56,550 --> 00:31:58,230
subordinate objects,
of the type of task

924
00:31:58,230 --> 00:31:59,854
that I've been telling
you, in roughly,

925
00:31:59,854 --> 00:32:01,830
in essentially a 20
dimensional space.

926
00:32:01,830 --> 00:32:02,940
So there's 20 dimensions.

927
00:32:02,940 --> 00:32:05,190
We think we infer that
humans are projecting

928
00:32:05,190 --> 00:32:07,730
to about 20 dimensions to
do these kind of, the tasks

929
00:32:07,730 --> 00:32:08,860
that we've shown here.

930
00:32:08,860 --> 00:32:11,310
Which is sort of
smaller, but eerily

931
00:32:11,310 --> 00:32:13,380
close to that in the
order of magnitude

932
00:32:13,380 --> 00:32:15,900
to that 100 or so features
that I've been talking about.

933
00:32:15,900 --> 00:32:19,020
So that's where-- regardless
of whether-- these

934
00:32:19,020 --> 00:32:21,580
are some of the dimensions
and how we're projecting them.

935
00:32:21,580 --> 00:32:22,930
Again, I won't take
you through this,

936
00:32:22,930 --> 00:32:24,490
because I think we've
already used up enough time

937
00:32:24,490 --> 00:32:25,906
and I want to get
on to this part.

938
00:32:25,906 --> 00:32:28,565
But we're trying to define
a domain of all tasks

939
00:32:28,565 --> 00:32:29,940
where we can sort
of predict what

940
00:32:29,940 --> 00:32:32,010
would happen across
anything within that domain.

941
00:32:32,010 --> 00:32:35,175
And that raises questions of the
dimensionality of that domain.

942
00:32:35,175 --> 00:32:37,050
And there were behavioral
methods to do that.

943
00:32:37,050 --> 00:32:39,330
And we've been doing
some work on that.

944
00:32:39,330 --> 00:32:40,664
So I'll just leave it at that.

945
00:32:40,664 --> 00:32:42,080
And if you guys
have questions, we

946
00:32:42,080 --> 00:32:43,567
can talk about that some more.

947
00:32:43,567 --> 00:32:45,150
I want to sort of
in the time I really

948
00:32:45,150 --> 00:32:47,910
have left is to talk about
the encoding side of things,

949
00:32:47,910 --> 00:32:49,410
because I promised you
guys I would get to this.

950
00:32:49,410 --> 00:32:50,868
Unless people have
any more burning

951
00:32:50,868 --> 00:32:52,202
questions on this decoding side.

952
00:32:52,202 --> 00:32:53,826
So far I've been
talking about the link

953
00:32:53,826 --> 00:32:55,020
between IT and perception.

954
00:32:55,020 --> 00:32:57,900
Now I'm going to switch gears
and talk about this other side.

955
00:32:57,900 --> 00:32:59,310
Which is, so I
talked about this.

956
00:32:59,310 --> 00:33:01,630
And that tells us that
the mean rates in IT

957
00:33:01,630 --> 00:33:03,630
are something that seem
to be highly predictive.

958
00:33:03,630 --> 00:33:05,130
I showed you at
least one model that

959
00:33:05,130 --> 00:33:06,510
has the laws of RAD IT model.

960
00:33:06,510 --> 00:33:09,300
But now, it's like now, we
can turn to the encoding side

961
00:33:09,300 --> 00:33:11,677
and say, we need to predict
the mean rates of IT.

962
00:33:11,677 --> 00:33:14,010
And that should be our goal
if we want to explain images

963
00:33:14,010 --> 00:33:15,570
to IT activity.

964
00:33:15,570 --> 00:33:18,940
So, these would be called
predictive encoding mechanisms.

965
00:33:18,940 --> 00:33:21,360
So, now you guys
have heard about

966
00:33:21,360 --> 00:33:22,994
deep convolutional networks.

967
00:33:22,994 --> 00:33:24,660
If not, you've heard
about them already,

968
00:33:24,660 --> 00:33:26,409
you'll probably hear
about them some more.

969
00:33:26,409 --> 00:33:28,767
So we started messing
around in 2008.

970
00:33:28,767 --> 00:33:29,850
This is a model inspired--

971
00:33:29,850 --> 00:33:31,558
I mentioned this family
of models before.

972
00:33:31,558 --> 00:33:34,430
Hubel-Wiesel, Fukushima, and
there's a whole HMAX family

973
00:33:34,430 --> 00:33:38,030
of models, that really was the
inspiration of this larger--

974
00:33:38,030 --> 00:33:39,930
this large family
of models, that

975
00:33:39,930 --> 00:33:43,939
have this repeating
structure that are now

976
00:33:43,939 --> 00:33:46,230
really the sort of modern
day deep convolution networks

977
00:33:46,230 --> 00:33:48,640
really grew out of all
of this earlier work.

978
00:33:48,640 --> 00:33:51,600
And so we started exploring
the family in 2008.

979
00:33:51,600 --> 00:33:54,140
And just, this is a slide that
you've already sort of seen

980
00:33:54,140 --> 00:33:56,056
a version of this from
Gabriel where you know,

981
00:33:56,056 --> 00:33:58,125
for when you take an
image, you pass it

982
00:33:58,125 --> 00:33:59,250
through a set of operators.

983
00:33:59,250 --> 00:34:00,180
So you have filters.

984
00:34:00,180 --> 00:34:02,840
So these are dot products
over some restricted spatial

985
00:34:02,840 --> 00:34:05,550
restricted region,
like receptive fields.

986
00:34:05,550 --> 00:34:08,330
You have a non linear area, like
a threshold and a saturation.

987
00:34:08,330 --> 00:34:10,159
You have pooling operation.

988
00:34:10,159 --> 00:34:11,409
Then you have a normalization.

989
00:34:11,409 --> 00:34:13,610
So you have all these
operations happen here.

990
00:34:13,610 --> 00:34:14,928
And that produces a stack.

991
00:34:14,928 --> 00:34:16,969
So think of like, if there
are four filters here,

992
00:34:16,969 --> 00:34:19,389
like four orientations,
you get four images,

993
00:34:19,389 --> 00:34:21,389
you have one image in,
you have four images out.

994
00:34:21,389 --> 00:34:23,638
But if you had 10 of these,
you'd get 10 of these out.

995
00:34:23,638 --> 00:34:25,125
Then you repeat
this here, right?

996
00:34:25,125 --> 00:34:26,750
And so as you keep
adding more filters,

997
00:34:26,750 --> 00:34:28,749
this stack just keeps
getting bigger and bigger.

998
00:34:28,749 --> 00:34:30,743
And it keeps, because
you're spatially pooling,

999
00:34:30,743 --> 00:34:32,659
it keeps getting narrower
and narrower, right?

1000
00:34:32,659 --> 00:34:34,544
So you go from this
image to this sort

1001
00:34:34,544 --> 00:34:38,060
of deep stack of features
that has less retinatopy.

1002
00:34:38,060 --> 00:34:40,130
It still has a little
bit of retinotopy.

1003
00:34:40,130 --> 00:34:42,560
And that, you can see, has
been exactly a very good model

1004
00:34:42,560 --> 00:34:44,389
why people liked
it of how people

1005
00:34:44,389 --> 00:34:46,310
think about the ventral stream.

1006
00:34:46,310 --> 00:34:48,830
So these models
typically have thousands

1007
00:34:48,830 --> 00:34:52,010
of feat-- visual neurons or
features at the top level.

1008
00:34:52,010 --> 00:34:55,520
Just to give you a sense of
scale of how they're run.

1009
00:34:55,520 --> 00:34:57,209
And just to take
you through, you

1010
00:34:57,209 --> 00:34:59,000
know, I guess maybe
you'll hear about this,

1011
00:34:59,000 --> 00:35:00,110
if you haven't already.

1012
00:35:00,110 --> 00:35:02,900
Each element has like, a
filter, has a large fan in.

1013
00:35:02,900 --> 00:35:05,330
Like these are like
neuroscience related things.

1014
00:35:05,330 --> 00:35:08,460
They have non-linearities,
like thresholds of neurons.

1015
00:35:08,460 --> 00:35:10,250
Each layer is
convolutional, which

1016
00:35:10,250 --> 00:35:12,962
means you apply the same
filters across visual space.

1017
00:35:12,962 --> 00:35:15,170
Which is like retinotopy,
that is a view on cell that

1018
00:35:15,170 --> 00:35:16,010
is oriented here.

1019
00:35:16,010 --> 00:35:17,690
There'll be another
view on cell that's

1020
00:35:17,690 --> 00:35:19,670
in another spatial
position, same orientation,

1021
00:35:19,670 --> 00:35:21,080
different spatial position.

1022
00:35:21,080 --> 00:35:23,360
That's what the
convolutional models are just

1023
00:35:23,360 --> 00:35:26,810
an implementation of that idea
of copying the same filter

1024
00:35:26,810 --> 00:35:28,782
type across the retina.

1025
00:35:28,782 --> 00:35:30,240
And there's a deep
stack of layers.

1026
00:35:30,240 --> 00:35:31,615
These are all
things that I think

1027
00:35:31,615 --> 00:35:34,610
are commensurate with
the ventral stream

1028
00:35:34,610 --> 00:35:36,796
anatomy and physiology.

1029
00:35:36,796 --> 00:35:39,230
So, but one of the
key things that those

1030
00:35:39,230 --> 00:35:40,790
who work with these
models know is

1031
00:35:40,790 --> 00:35:43,250
that, they have lots
of unknown parameters

1032
00:35:43,250 --> 00:35:45,320
that are not determined
from the neurobiology.

1033
00:35:45,320 --> 00:35:47,750
Even though the family of
models is well described,

1034
00:35:47,750 --> 00:35:50,270
what are the exact
filter weights?

1035
00:35:50,270 --> 00:35:51,830
What are the
threshold parameters?

1036
00:35:51,830 --> 00:35:53,090
How exactly do you pool?

1037
00:35:53,090 --> 00:35:54,050
How do you normalize?

1038
00:35:54,050 --> 00:35:56,480
There's lots of parameters
when you build these things,

1039
00:35:56,480 --> 00:35:59,090
essentially thousands of
parameters, most of them hidden

1040
00:35:59,090 --> 00:36:00,360
in the weight structure here.

1041
00:36:00,360 --> 00:36:02,360
Which, if you think about,
the first layer, that

1042
00:36:02,360 --> 00:36:04,190
would be like, should
I choose Gabor filters?

1043
00:36:04,190 --> 00:36:06,110
Or should I do some other--
you know Haim was talking

1044
00:36:06,110 --> 00:36:07,369
about random weights, right?

1045
00:36:07,369 --> 00:36:08,410
So there's choices there.

1046
00:36:08,410 --> 00:36:09,140
There are lots of parameters.

1047
00:36:09,140 --> 00:36:11,410
So the upshot is, there's
a big-- that's why

1048
00:36:11,410 --> 00:36:13,240
I call it a family of models.

1049
00:36:13,240 --> 00:36:16,560
And how do you choose which one
is the right one, so to speak?

1050
00:36:16,560 --> 00:36:17,900
Or is there a right one?

1051
00:36:17,900 --> 00:36:19,650
Or maybe the whole
family is wrong, right?

1052
00:36:19,650 --> 00:36:21,680
These are the
interesting discussions.

1053
00:36:21,680 --> 00:36:24,610
So, what I like about it is,
at least when you set it,

1054
00:36:24,610 --> 00:36:25,220
it's a model.

1055
00:36:25,220 --> 00:36:26,120
It makes predictions.

1056
00:36:26,120 --> 00:36:27,161
And then you can test it.

1057
00:36:27,161 --> 00:36:28,610
So it's at least a model.

1058
00:36:28,610 --> 00:36:30,800
And it predicts the
entire-- you know,

1059
00:36:30,800 --> 00:36:33,560
if you start to map these, you
say this is V1, this is V2,

1060
00:36:33,560 --> 00:36:34,220
this is V4.

1061
00:36:34,220 --> 00:36:37,400
It predicts the full
neural population response

1062
00:36:37,400 --> 00:36:39,380
to any image across these areas.

1063
00:36:39,380 --> 00:36:43,100
So it's a strongly
predictive model once built.

1064
00:36:43,100 --> 00:36:44,176
So that's nice.

1065
00:36:44,176 --> 00:36:46,550
But now you have to determine
how am I going to build it?

1066
00:36:46,550 --> 00:36:48,300
How do I set the parameters?

1067
00:36:48,300 --> 00:36:50,129
So how do we do that?

1068
00:36:50,129 --> 00:36:51,920
Well, there's lots of
ways you could do it.

1069
00:36:51,920 --> 00:36:53,753
And I'll tell you the
way we chose to do it.

1070
00:36:53,753 --> 00:36:56,540
Which was to just not
use any neural data.

1071
00:36:56,540 --> 00:36:58,370
It was just to use
optimization methods

1072
00:36:58,370 --> 00:37:00,860
to find specific models
to set the parameters

1073
00:37:00,860 --> 00:37:02,660
inside this model class.

1074
00:37:02,660 --> 00:37:04,924
And we chose an
optimization target.

1075
00:37:04,924 --> 00:37:07,340
This is a little bit, again,
inspired from a top down view

1076
00:37:07,340 --> 00:37:09,146
of what the system's doing.

1077
00:37:09,146 --> 00:37:10,520
What are the visual
tasks that we

1078
00:37:10,520 --> 00:37:13,010
suppose the ventral stream
was supposed to solve?

1079
00:37:13,010 --> 00:37:15,350
Which I already told you, we
think it's invariant object

1080
00:37:15,350 --> 00:37:15,950
recognition.

1081
00:37:15,950 --> 00:37:17,600
That's what makes
the problem hard.

1082
00:37:17,600 --> 00:37:19,476
So we tried to optimize
models to solve that.

1083
00:37:19,476 --> 00:37:21,058
And essentially when
we're doing that,

1084
00:37:21,058 --> 00:37:23,540
we're kind of doing the same
thing that computer vision is

1085
00:37:23,540 --> 00:37:26,164
trying to do, except we're doing
it in our own domain of images

1086
00:37:26,164 --> 00:37:27,329
and tasks that we set up.

1087
00:37:27,329 --> 00:37:29,870
But we essentially, there's a
meeting between computer vision

1088
00:37:29,870 --> 00:37:32,240
and what we were
trying to do here.

1089
00:37:32,240 --> 00:37:34,450
And when I say we, this
is work by Dan Yamins,

1090
00:37:34,450 --> 00:37:37,520
a post-doc in the lab, and
Ha Hong, a graduate student.

1091
00:37:37,520 --> 00:37:40,712
And what we did was to
just try to simulate again,

1092
00:37:40,712 --> 00:37:41,420
as I did earlier.

1093
00:37:41,420 --> 00:37:43,049
We took these
simple 3-D objects.

1094
00:37:43,049 --> 00:37:44,590
We could render
them, just as before,

1095
00:37:44,590 --> 00:37:46,730
place them on
naturalistic background.

1096
00:37:46,730 --> 00:37:48,380
And then we just
built models that

1097
00:37:48,380 --> 00:37:50,360
would try to discriminate
bodies from buildings

1098
00:37:50,360 --> 00:37:51,318
from flowers from guns.

1099
00:37:51,318 --> 00:37:53,076
So they would have
good feature sets

1100
00:37:53,076 --> 00:37:54,950
that would discriminate
between these things.

1101
00:37:54,950 --> 00:37:58,280
And these were essentially
trained by various forms

1102
00:37:58,280 --> 00:37:59,220
of supervision.

1103
00:37:59,220 --> 00:38:02,240
Now there's lots of ways
you can train these models.

1104
00:38:02,240 --> 00:38:03,740
I could tell you
about how we did it

1105
00:38:03,740 --> 00:38:04,970
and how others have done it.

1106
00:38:04,970 --> 00:38:06,546
I think those details
are beyond what

1107
00:38:06,546 --> 00:38:07,670
I want to talk about today.

1108
00:38:07,670 --> 00:38:09,530
But just, it's a
supervised class

1109
00:38:09,530 --> 00:38:12,156
that's probably not
learned in the same way

1110
00:38:12,156 --> 00:38:13,280
that the brain has learned.

1111
00:38:13,280 --> 00:38:14,711
Most people don't think so.

1112
00:38:14,711 --> 00:38:16,460
But the interesting
thing is the end state

1113
00:38:16,460 --> 00:38:19,659
of these models might look very
much like the current adult

1114
00:38:19,659 --> 00:38:20,450
state of the brain.

1115
00:38:20,450 --> 00:38:22,390
And that's what I want
to try to tell you next.

1116
00:38:22,390 --> 00:38:24,280
So first, let me show you that
when we built these models,

1117
00:38:24,280 --> 00:38:25,280
this was in 2012.

1118
00:38:25,280 --> 00:38:27,080
We had a particular
optimization approach

1119
00:38:27,080 --> 00:38:29,176
that we called HMO
that was trying

1120
00:38:29,176 --> 00:38:31,550
to solve these kind of problems
that I showed you earlier

1121
00:38:31,550 --> 00:38:33,020
on these kind of images.

1122
00:38:33,020 --> 00:38:35,104
And I showed you IT was
pretty good with humans.

1123
00:38:35,104 --> 00:38:37,520
I showed you its performance
was almost up to humans, even

1124
00:38:37,520 --> 00:38:39,379
with just 168 samples.

1125
00:38:39,379 --> 00:38:40,920
And when we first
built a model here,

1126
00:38:40,920 --> 00:38:42,586
we were able to do
much better than some

1127
00:38:42,586 --> 00:38:44,960
of our previous models that--

1128
00:38:44,960 --> 00:38:46,130
on these same kind of tasks.

1129
00:38:46,130 --> 00:38:47,796
So I told you we
constructed, because we

1130
00:38:47,796 --> 00:38:49,530
knew it made these things--

1131
00:38:49,530 --> 00:38:51,410
we made these models
not do so well.

1132
00:38:51,410 --> 00:38:53,390
So we built these
high invariance tasks

1133
00:38:53,390 --> 00:38:55,040
to push these models down.

1134
00:38:55,040 --> 00:38:56,720
And then we had space
to build a model

1135
00:38:56,720 --> 00:38:59,050
that we could do better on.

1136
00:38:59,050 --> 00:39:01,850
And we called it HMO 1.0.

1137
00:39:01,850 --> 00:39:03,530
And then we started
to say, now we

1138
00:39:03,530 --> 00:39:06,560
have this model that has been
optimized for performance.

1139
00:39:06,560 --> 00:39:09,380
Let's see how well it does
on comparing with neurons.

1140
00:39:09,380 --> 00:39:12,540
Let's see if its internals
look like the neural data.

1141
00:39:12,540 --> 00:39:14,324
So here's the model
we built, HMO 1.0.

1142
00:39:14,324 --> 00:39:15,740
It's a deep
convolutional network.

1143
00:39:15,740 --> 00:39:16,906
It has two different levels.

1144
00:39:16,906 --> 00:39:18,140
It had four levels.

1145
00:39:18,140 --> 00:39:20,524
It had a bunch of parameters
that we set by optimization,

1146
00:39:20,524 --> 00:39:22,690
that I'm just telling you
kind of what we optimized.

1147
00:39:22,690 --> 00:39:23,481
I didn't tell you--

1148
00:39:23,481 --> 00:39:25,530
I'm not telling you
any of the parameters.

1149
00:39:25,530 --> 00:39:26,660
And now, we come back
to say, well look.

1150
00:39:26,660 --> 00:39:28,190
We can show the same
images to the model

1151
00:39:28,190 --> 00:39:29,550
that we showed to the neurons.

1152
00:39:29,550 --> 00:39:32,060
And then we can compare how
well these populations look

1153
00:39:32,060 --> 00:39:35,830
like that population, or this
population looks like that.

1154
00:39:35,830 --> 00:39:38,820
And so what we did was, we asked
how well can layer four predict

1155
00:39:38,820 --> 00:39:39,524
IT first?

1156
00:39:39,524 --> 00:39:40,940
That was the first
thing we wanted

1157
00:39:40,940 --> 00:39:43,160
to do, take the top
layer of this model,

1158
00:39:43,160 --> 00:39:46,852
the last layer before the
linear readout of this model.

1159
00:39:46,852 --> 00:39:49,310
And to do that, you might sort
of say, well, wait a minute.

1160
00:39:49,310 --> 00:39:51,170
The model doesn't have mappings.

1161
00:39:51,170 --> 00:39:54,680
It has sort of neurons simulated
here, neuron 12 or something.

1162
00:39:54,680 --> 00:39:56,180
And there's some
neuron we recorded.

1163
00:39:56,180 --> 00:39:58,970
But there's no linkage between
that neuron and that neuron,

1164
00:39:58,970 --> 00:39:59,470
right?

1165
00:39:59,470 --> 00:40:01,320
You have to make that map.

1166
00:40:01,320 --> 00:40:03,770
So what we do is we
take each IT neuron

1167
00:40:03,770 --> 00:40:06,066
and treat this as sort
of a generative space.

1168
00:40:06,066 --> 00:40:07,940
You can generate as many
simulated IT neurons

1169
00:40:07,940 --> 00:40:08,510
as you want.

1170
00:40:08,510 --> 00:40:10,550
You would just ask,
let's take this neuron,

1171
00:40:10,550 --> 00:40:13,640
take some of its data, and try
to build a linear regression

1172
00:40:13,640 --> 00:40:14,330
to this neuron.

1173
00:40:14,330 --> 00:40:16,640
Treat this as a basis
to explain that neuron.

1174
00:40:16,640 --> 00:40:19,946
And then test the predictive
power on the held out IT data.

1175
00:40:19,946 --> 00:40:21,320
And that's what
I'm writing here.

1176
00:40:21,320 --> 00:40:23,470
That's cross-validation
linear regression.

1177
00:40:23,470 --> 00:40:25,850
So I'm going to show you
predictions on held out data

1178
00:40:25,850 --> 00:40:28,880
where some of the data were
used to make the mapping.

1179
00:40:28,880 --> 00:40:31,217
And there's lots
of ways we chose--

1180
00:40:31,217 --> 00:40:32,300
we could make the mapping.

1181
00:40:32,300 --> 00:40:33,934
And we did essentially
all of them.

1182
00:40:33,934 --> 00:40:35,600
And I could talk about
that if you want.

1183
00:40:35,600 --> 00:40:37,130
But that's this central idea.

1184
00:40:37,130 --> 00:40:40,010
Take some of your data, say,
is this in the linear space

1185
00:40:40,010 --> 00:40:41,450
spanned by this basis set?

1186
00:40:41,450 --> 00:40:44,890
So I can I fit that well
with this linear basis here?

1187
00:40:44,890 --> 00:40:47,000
As a linear map from this basis?

1188
00:40:47,000 --> 00:40:49,500
And here's what we actually--
here's what it looks like.

1189
00:40:49,500 --> 00:40:53,470
Here's the IT neural response
of one simulated-- one actual IT

1190
00:40:53,470 --> 00:40:54,710
neuron in black.

1191
00:40:54,710 --> 00:40:55,880
This is not time.

1192
00:40:55,880 --> 00:40:57,260
These are images.

1193
00:40:57,260 --> 00:40:59,174
I think there's like
1,600 images here.

1194
00:40:59,174 --> 00:41:01,340
So each black going up and
down, you can barely see,

1195
00:41:01,340 --> 00:41:04,370
is the response, the mean
response, to different images.

1196
00:41:04,370 --> 00:41:06,930
And you see we grouped them
by categories, just so,

1197
00:41:06,930 --> 00:41:09,780
just to help you kind
of understand the data.

1198
00:41:09,780 --> 00:41:11,805
Otherwise, it'd
just be a big mess.

1199
00:41:11,805 --> 00:41:13,430
Because IT neurons
do-- you can kind of

1200
00:41:13,430 --> 00:41:15,410
see they have a bit of
category selectivity.

1201
00:41:15,410 --> 00:41:16,710
And again, this was known.

1202
00:41:16,710 --> 00:41:19,451
This neuron seems to like
chair images, but not all chair

1203
00:41:19,451 --> 00:41:19,950
images.

1204
00:41:19,950 --> 00:41:23,060
It sometimes likes boats and
some planes a little bit.

1205
00:41:23,060 --> 00:41:26,270
And the red line is the
prediction of the model,

1206
00:41:26,270 --> 00:41:28,042
once fit to part of
the-- to this neuron.

1207
00:41:28,042 --> 00:41:30,500
This is the prediction on the
held out data for the neuron.

1208
00:41:30,500 --> 00:41:32,690
You can see the R
squared is 0.48.

1209
00:41:32,690 --> 00:41:35,150
So half the explainable
response variance

1210
00:41:35,150 --> 00:41:37,050
is explained by this model.

1211
00:41:37,050 --> 00:41:39,350
And again, these
are predictions.

1212
00:41:39,350 --> 00:41:41,360
The images were never seen--

1213
00:41:41,360 --> 00:41:44,360
the objects even were
never seen by this model

1214
00:41:44,360 --> 00:41:47,750
before it makes these
predictions here.

1215
00:41:47,750 --> 00:41:50,810
So this is just saying that the
IT neurons live in this space.

1216
00:41:50,810 --> 00:41:53,120
It's actually quite well
captured by the top level,

1217
00:41:53,120 --> 00:41:55,170
in this case, of this
first HMO model we built.

1218
00:41:55,170 --> 00:41:57,480
I'll show you some other
models in a minute.

1219
00:41:57,480 --> 00:41:59,480
Here's another neuron
that you might call a face

1220
00:41:59,480 --> 00:42:02,840
neuron because it tends to like
faces over other categories.

1221
00:42:02,840 --> 00:42:04,340
So it might-- it
would pass the test

1222
00:42:04,340 --> 00:42:06,410
of the operational
definition of a face neuron.

1223
00:42:06,410 --> 00:42:09,615
This model, this neuron
was well predicted, again,

1224
00:42:09,615 --> 00:42:12,470
by both its preferred and
non-preferred face images

1225
00:42:12,470 --> 00:42:13,880
by this HMO model.

1226
00:42:13,880 --> 00:42:16,757
Again, a slightly--
an R squared near 0.5.

1227
00:42:16,757 --> 00:42:19,340
Here's a neuron that you would
look at the category structure.

1228
00:42:19,340 --> 00:42:20,332
And you don't even--

1229
00:42:20,332 --> 00:42:22,040
you can't really see
the categories here.

1230
00:42:22,040 --> 00:42:23,060
They're still here.

1231
00:42:23,060 --> 00:42:25,010
But you don't see
these sort of blocks.

1232
00:42:25,010 --> 00:42:27,050
You just see there's sort of
some images it likes and some

1233
00:42:27,050 --> 00:42:27,410
it doesn't.

1234
00:42:27,410 --> 00:42:29,530
It's hard to even know
what's driving this neuron.

1235
00:42:29,530 --> 00:42:31,722
But it's actually quite
well predicted, I think.

1236
00:42:31,722 --> 00:42:32,930
You don't have the R squared.

1237
00:42:32,930 --> 00:42:33,800
But it's similar.

1238
00:42:33,800 --> 00:42:35,990
It's about half the
explainable variance.

1239
00:42:35,990 --> 00:42:37,490
Just another example.

1240
00:42:37,490 --> 00:42:39,140
And here is a sort
of summary here.

1241
00:42:39,140 --> 00:42:40,700
If you take-- this
is a distribution

1242
00:42:40,700 --> 00:42:43,730
of the explainable variance
for the top level of the model

1243
00:42:43,730 --> 00:42:46,850
fitting about, I think
this is 168 IT sites.

1244
00:42:46,850 --> 00:42:48,950
Some sites are fit
really well, near 100%.

1245
00:42:48,950 --> 00:42:50,390
Some are fit not as well.

1246
00:42:50,390 --> 00:42:53,040
The average is about
50%, which is shown here.

1247
00:42:53,040 --> 00:42:55,890
So this is the median of
that distribution here.

1248
00:42:55,890 --> 00:42:58,400
So the summary take
home is about 50%

1249
00:42:58,400 --> 00:43:00,277
of singularly response
variance predicted.

1250
00:43:00,277 --> 00:43:02,360
And this is a big improvement
over previous models

1251
00:43:02,360 --> 00:43:03,600
I'll show you in a minute.

1252
00:43:03,600 --> 00:43:06,020
The other levels of the model
don't predict nearly well.

1253
00:43:06,020 --> 00:43:07,550
So the first level
doesn't predict well.

1254
00:43:07,550 --> 00:43:09,216
Second level better,
third level better,

1255
00:43:09,216 --> 00:43:10,550
the fourth level the best.

1256
00:43:10,550 --> 00:43:13,216
If you take other models-- these
are some of the models I showed

1257
00:43:13,216 --> 00:43:13,880
you earlier--

1258
00:43:13,880 --> 00:43:16,110
they don't fit nearly as well.

1259
00:43:16,110 --> 00:43:18,440
Here's their distributions
and here's their average,

1260
00:43:18,440 --> 00:43:20,240
their median explained variance.

1261
00:43:20,240 --> 00:43:24,120
And just to fix-- to just
fix ideas, you might think,

1262
00:43:24,120 --> 00:43:27,080
well look, we built a model
that's a good categorizer.

1263
00:43:27,080 --> 00:43:28,730
So of course it fits
IT neurons well.

1264
00:43:28,730 --> 00:43:30,260
Because IT neurons
are categorizers.

1265
00:43:30,260 --> 00:43:32,954
Well, here's a model that
actually has explicit knowledge

1266
00:43:32,954 --> 00:43:33,620
of the category.

1267
00:43:33,620 --> 00:43:35,210
It's not an image
computable model,

1268
00:43:35,210 --> 00:43:36,590
and it's not an easy one.

1269
00:43:36,590 --> 00:43:38,360
But it's just given
that sort of an oracle

1270
00:43:38,360 --> 00:43:41,460
that's given the category,
and how well it explains IT.

1271
00:43:41,460 --> 00:43:43,490
And you can see, it
explains IT much worse

1272
00:43:43,490 --> 00:43:44,600
than the actual model.

1273
00:43:44,600 --> 00:43:47,930
So this implies a model
is limited by the real--

1274
00:43:47,930 --> 00:43:50,570
the architecture puts
constraints on the model

1275
00:43:50,570 --> 00:43:53,680
and how it adds variance
that the sustained IT

1276
00:43:53,680 --> 00:43:56,290
neurons are categories
does not easily capture.

1277
00:43:56,290 --> 00:44:00,320
So that kind of--

1278
00:44:00,320 --> 00:44:02,836
that sort of inspired
us to say, OK.

1279
00:44:02,836 --> 00:44:04,710
What about if we go down
and say not just IT,

1280
00:44:04,710 --> 00:44:05,650
but let's go to V4.

1281
00:44:05,650 --> 00:44:07,600
Because we had a
bunch of V4 data.

1282
00:44:07,600 --> 00:44:09,280
And so we play the
same game in V4.

1283
00:44:09,280 --> 00:44:12,130
Let's take level three and
see if we can predict V4.

1284
00:44:12,130 --> 00:44:14,710
And here's the IT data I
just showed you a minute ago.

1285
00:44:14,710 --> 00:44:16,270
And here's the V4 data.

1286
00:44:16,270 --> 00:44:19,630
So the V4 neurons are highly
predicted in the middle layer.

1287
00:44:19,630 --> 00:44:21,700
Layer three is the
best predictor of V4.

1288
00:44:21,700 --> 00:44:24,400
The top layer is actually not
so predictive, less predictive

1289
00:44:24,400 --> 00:44:26,664
of V4 neurons than
the middle layers.

1290
00:44:26,664 --> 00:44:28,580
And the first layer is
not so well predictive.

1291
00:44:28,580 --> 00:44:30,288
And again, the other
models are actually,

1292
00:44:30,288 --> 00:44:32,830
now you can see they're
getting on relatively better.

1293
00:44:32,830 --> 00:44:35,740
You can think of them as
sort of lower level models.

1294
00:44:35,740 --> 00:44:38,390
And they're getting better,
which is what you'd expect.

1295
00:44:38,390 --> 00:44:41,230
But interestingly, this
is really exciting to us.

1296
00:44:41,230 --> 00:44:44,080
Because look, this
model was not optimized

1297
00:44:44,080 --> 00:44:47,200
to fit any neural data other
than that last mapping step.

1298
00:44:47,200 --> 00:44:49,300
All it is is a bio
inspired algorithm class,

1299
00:44:49,300 --> 00:44:51,610
which is the
neuroscience sort of view

1300
00:44:51,610 --> 00:44:54,550
of the feed-forward
class of the field.

1301
00:44:54,550 --> 00:44:56,867
And tasks that we and
others hypothesize

1302
00:44:56,867 --> 00:44:58,450
are important, that
the ventral stream

1303
00:44:58,450 --> 00:45:01,810
might be optimized to solve,
and an actual optimization

1304
00:45:01,810 --> 00:45:03,640
procedure that we applied.

1305
00:45:03,640 --> 00:45:07,000
And that leads to neural like
encoding functions at the top

1306
00:45:07,000 --> 00:45:08,200
and in the middle layer.

1307
00:45:08,200 --> 00:45:11,380
So you don't-- so this sort
of leads to funny things like

1308
00:45:11,380 --> 00:45:13,150
saying, what does V4 do?

1309
00:45:13,150 --> 00:45:14,650
The answer here
would be, well, it's

1310
00:45:14,650 --> 00:45:17,020
an intermediate layer
in a network built

1311
00:45:17,020 --> 00:45:18,410
to optimize these things.

1312
00:45:18,410 --> 00:45:19,960
That's the way to
describe what V4

1313
00:45:19,960 --> 00:45:22,840
does, according to this
kind of modeling approach.

1314
00:45:22,840 --> 00:45:24,880
Now I want to point
out, this is only half

1315
00:45:24,880 --> 00:45:26,046
of the explainable variance.

1316
00:45:26,046 --> 00:45:27,252
So it's far from perfect.

1317
00:45:27,252 --> 00:45:28,460
There's room to improve here.

1318
00:45:28,460 --> 00:45:30,793
But it's really dramatic how
much improvement we got out

1319
00:45:30,793 --> 00:45:32,400
of these kind of models.

1320
00:45:32,400 --> 00:45:34,670
And so if you take
this sort of--

1321
00:45:34,670 --> 00:45:35,870
well, I'll skip this.

1322
00:45:35,870 --> 00:45:38,650
If you take this back to
you know, big picture,

1323
00:45:38,650 --> 00:45:40,000
what did we do here?

1324
00:45:40,000 --> 00:45:41,710
What we're doing is
we have performance

1325
00:45:41,710 --> 00:45:43,793
of a model on high end
variance recognition tasks.

1326
00:45:43,793 --> 00:45:46,474
We're saying, this is what
we've been trying to optimize.

1327
00:45:46,474 --> 00:45:47,890
And what we noticed
is that if you

1328
00:45:47,890 --> 00:45:50,852
plot-- these dots are samples
out of that model family.

1329
00:45:50,852 --> 00:45:52,810
These black dots are
other models I showed you.

1330
00:45:52,810 --> 00:45:55,600
So they're control models that
were in the field at the time.

1331
00:45:55,600 --> 00:45:57,400
And this is the
ability of the top--

1332
00:45:57,400 --> 00:45:59,500
the model-- the top level
of any of the models

1333
00:45:59,500 --> 00:46:01,120
to predict IT responses.

1334
00:46:01,120 --> 00:46:03,450
So, you know, how good
they are predicting--

1335
00:46:03,450 --> 00:46:06,160
this is sort of the median
variance explained of single IT

1336
00:46:06,160 --> 00:46:06,935
responses.

1337
00:46:06,935 --> 00:46:08,560
And you see there's
a correlation here.

1338
00:46:08,560 --> 00:46:11,140
If you're better at this, you're
better at predicting that.

1339
00:46:11,140 --> 00:46:13,279
And all we did was
optimize this way,

1340
00:46:13,279 --> 00:46:15,445
which we think of as like,
evolution or development.

1341
00:46:15,445 --> 00:46:16,820
So we're not
fitting neural data.

1342
00:46:16,820 --> 00:46:19,180
We're just optimizing
for task performance.

1343
00:46:19,180 --> 00:46:21,850
And that led in 2012 to a
model that I just showed you,

1344
00:46:21,850 --> 00:46:24,277
explained about half of
the IT response variance.

1345
00:46:24,277 --> 00:46:26,110
OK, so it's like, well,
this looks like it's

1346
00:46:26,110 --> 00:46:27,670
continuing up this way.

1347
00:46:27,670 --> 00:46:31,930
OK so if you believe that
story, then, that says,

1348
00:46:31,930 --> 00:46:35,200
if we can optimize further
on these kind of tasks,

1349
00:46:35,200 --> 00:46:37,649
maybe we can explain
more variance.

1350
00:46:37,649 --> 00:46:39,190
And it turned out,
we didn't actually

1351
00:46:39,190 --> 00:46:41,290
need to do that,
because again, I said,

1352
00:46:41,290 --> 00:46:43,804
computer vision was
already working on this.

1353
00:46:43,804 --> 00:46:45,220
And they got a lot
more resources.

1354
00:46:45,220 --> 00:46:46,090
They're already doing it.

1355
00:46:46,090 --> 00:46:47,714
They're already better
than us on this.

1356
00:46:47,714 --> 00:46:49,137
So here's our HMO model.

1357
00:46:49,137 --> 00:46:51,220
This is now Charles Cadieu,
a post-doc in the lab.

1358
00:46:51,220 --> 00:46:52,600
These were models that
came out at the time.

1359
00:46:52,600 --> 00:46:54,183
This is Krizhevski
et al. supervision.

1360
00:46:54,183 --> 00:46:56,200
It's ICLR 2013.

1361
00:46:56,200 --> 00:46:58,287
They were better than the
model that we had built.

1362
00:46:58,287 --> 00:47:00,370
You know, we were in this
restricted image domain,

1363
00:47:00,370 --> 00:47:01,930
you know, there's
lots of reasons why

1364
00:47:01,930 --> 00:47:03,096
we could say they're better.

1365
00:47:03,096 --> 00:47:06,070
Regardless, they were better at
our own tasks than the models

1366
00:47:06,070 --> 00:47:07,300
that we had built, right?

1367
00:47:07,300 --> 00:47:09,490
So they were already
ahead of us on the task

1368
00:47:09,490 --> 00:47:10,990
that we had designed.

1369
00:47:10,990 --> 00:47:13,199
And so they were up here,
and then they were up here.

1370
00:47:13,199 --> 00:47:14,781
And so, if you follow
that prediction,

1371
00:47:14,781 --> 00:47:16,900
that means these models
might be better predictors

1372
00:47:16,900 --> 00:47:17,890
of our neural data, right?

1373
00:47:17,890 --> 00:47:19,240
These guys don't
have our neural data.

1374
00:47:19,240 --> 00:47:21,740
All they're doing is building
models to optimize performance

1375
00:47:21,740 --> 00:47:22,370
on tasks.

1376
00:47:22,370 --> 00:47:24,970
And but we could take their
features from the neural data,

1377
00:47:24,970 --> 00:47:25,930
play the same game.

1378
00:47:25,930 --> 00:47:29,020
And we actually explained
our response-- data

1379
00:47:29,020 --> 00:47:31,060
better than our model
explained our own data.

1380
00:47:31,060 --> 00:47:34,120
So this is a nice statement
that is not even in our own lab.

1381
00:47:34,120 --> 00:47:36,980
Just a continued optimization
for those kinds of tasks

1382
00:47:36,980 --> 00:47:41,180
leads to features that are good
predictors of the IT responses.

1383
00:47:41,180 --> 00:47:42,560
And that's what's shown here.

1384
00:47:42,560 --> 00:47:45,970
So I think that's what
I just said there.

1385
00:47:45,970 --> 00:47:49,120
So, Charles took this
further and analyzed

1386
00:47:49,120 --> 00:47:50,597
this in more detail.

1387
00:47:50,597 --> 00:47:52,930
This is a summary of what I
presented in the second half

1388
00:47:52,930 --> 00:47:56,080
now, showing that IT firing
rates are feature based,

1389
00:47:56,080 --> 00:47:58,420
learned object judgments
naturally predict human monkey

1390
00:47:58,420 --> 00:47:58,919
performance.

1391
00:47:58,919 --> 00:48:00,820
This is why the laws of RAD IT.

1392
00:48:00,820 --> 00:48:02,530
I picked a particular
model, which

1393
00:48:02,530 --> 00:48:05,730
is 100 millisecond read on this
time window, 50,000 neurons.

1394
00:48:05,730 --> 00:48:07,150
100 training examples.

1395
00:48:07,150 --> 00:48:12,220
That's one particular choice of
a decode model, that's just a--

1396
00:48:12,220 --> 00:48:17,230
is a current set of decode model
that fits a lot of our data,

1397
00:48:17,230 --> 00:48:18,230
but not all of our data.

1398
00:48:18,230 --> 00:48:20,620
And we also want to
get finer grain data.

1399
00:48:20,620 --> 00:48:22,960
The inference is, this might
be the specific neural code

1400
00:48:22,960 --> 00:48:25,102
and decoding mechanism
that the brain uses

1401
00:48:25,102 --> 00:48:26,060
to support these tasks.

1402
00:48:26,060 --> 00:48:28,174
That's what we'd like to think.

1403
00:48:28,174 --> 00:48:30,340
But now, we're trying to
do systematic causal tests.

1404
00:48:30,340 --> 00:48:32,590
And we talked a lot about
trying to silence bits of IT

1405
00:48:32,590 --> 00:48:33,810
as one example of that.

1406
00:48:33,810 --> 00:48:37,070
And the tools are still not
where we'd like them to be.

1407
00:48:37,070 --> 00:48:39,800
But you see we're
making progress there.

1408
00:48:39,800 --> 00:48:43,021
So the second was I showed the
optimization of deep CNN models

1409
00:48:43,021 --> 00:48:44,770
for invariant object
recognition tasks led

1410
00:48:44,770 --> 00:48:46,420
to dramatic improvements
in our ability

1411
00:48:46,420 --> 00:48:47,950
to predict IT and V4 responses.

1412
00:48:47,950 --> 00:48:49,150
I showed you our model HMO.

1413
00:48:49,150 --> 00:48:51,700
But then the convolutional
neural networks in the field

1414
00:48:51,700 --> 00:48:53,810
have already surpassed
our predictive ability

1415
00:48:53,810 --> 00:48:56,320
on our own data.

1416
00:48:56,320 --> 00:48:58,570
And so the inference is that
these encoding mechanisms

1417
00:48:58,570 --> 00:49:00,486
in these models might
be similar to those that

1418
00:49:00,486 --> 00:49:02,040
work in the ventral stream.

1419
00:49:02,040 --> 00:49:04,096
And now, you know, there's
a whole sort of area

1420
00:49:04,096 --> 00:49:06,220
where you can start to
think about doing physiology

1421
00:49:06,220 --> 00:49:07,516
on the models, so to speak.

1422
00:49:07,516 --> 00:49:08,890
And that problem's
almost as hard

1423
00:49:08,890 --> 00:49:10,600
as doing physiology
except on the animal,

1424
00:49:10,600 --> 00:49:13,000
except that you can
gain a lot more data.

1425
00:49:13,000 --> 00:49:14,890
And so, and this is
allowing the field

1426
00:49:14,890 --> 00:49:17,080
to design experiments
to explore what remains,

1427
00:49:17,080 --> 00:49:20,050
what's unique and powerful
about primate object perception.

1428
00:49:20,050 --> 00:49:21,520
So within core
object recognition

1429
00:49:21,520 --> 00:49:23,560
or perhaps having to
extend out of that,

1430
00:49:23,560 --> 00:49:26,320
I think is now what
people are trying to do.

1431
00:49:26,320 --> 00:49:28,570
So big picture in terms
of us for the future,

1432
00:49:28,570 --> 00:49:30,850
I've talked about
this law's of RAD IT.

1433
00:49:30,850 --> 00:49:32,650
Can we perturb here
and get effects here

1434
00:49:32,650 --> 00:49:33,940
that are predictable?

1435
00:49:33,940 --> 00:49:36,770
Can we predict for each
image, coding model,

1436
00:49:36,770 --> 00:49:39,130
and for the optical
manipulations?

1437
00:49:39,130 --> 00:49:40,840
We talked about that.

1438
00:49:40,840 --> 00:49:42,400
Dynamics and feedback
are something

1439
00:49:42,400 --> 00:49:43,510
that we're interested in.

1440
00:49:43,510 --> 00:49:45,310
But I haven't talked
much at all about.

1441
00:49:45,310 --> 00:49:48,850
I think that's a good
point, a discussion topic.

1442
00:49:48,850 --> 00:49:51,160
I can tell you how
we're thinking about it.

1443
00:49:51,160 --> 00:49:54,100
We have some efforts
in that regard.

1444
00:49:54,100 --> 00:49:56,380
I talked on the encoding
side about these kind

1445
00:49:56,380 --> 00:49:59,050
of deep convolutional
networks that map from images.

1446
00:49:59,050 --> 00:50:01,570
But the dash lines mean
they're only 50% predicted.

1447
00:50:01,570 --> 00:50:04,570
Both of these cases,
they're not perfect, right?

1448
00:50:04,570 --> 00:50:06,380
So there's work
to be done there.

1449
00:50:06,380 --> 00:50:07,930
And one of the really
exciting things

1450
00:50:07,930 --> 00:50:09,377
is here is how
these models learn.

1451
00:50:09,377 --> 00:50:11,210
This supervised way of
learning these models

1452
00:50:11,210 --> 00:50:13,480
is almost surely not what's
going on in the brain.

1453
00:50:13,480 --> 00:50:16,930
So finding more-- less
supervised, biologically

1454
00:50:16,930 --> 00:50:18,880
motivated learning
of these models

1455
00:50:18,880 --> 00:50:22,660
is a good-- is the next step,
I think, for much of the field.

1456
00:50:22,660 --> 00:50:24,620
But what's nice is to
have an end state that

1457
00:50:24,620 --> 00:50:27,930
is much better than any previous
end state we'd had before.

1458
00:50:27,930 --> 00:50:32,080
So that sets a target of
what success might look like.

1459
00:50:32,080 --> 00:50:34,240
And you know, maybe we
can think about expanding

1460
00:50:34,240 --> 00:50:35,880
beyond core recognition.

1461
00:50:35,880 --> 00:50:37,920
We can talk in the
question period about that.

1462
00:50:37,920 --> 00:50:39,940
When is the right
time to kind of keep

1463
00:50:39,940 --> 00:50:42,640
working within the domain
of core recognition that

1464
00:50:42,640 --> 00:50:45,042
is set up, versus
expanding beyond that?

1465
00:50:45,042 --> 00:50:46,750
Because there's lots
of aspects of object

1466
00:50:46,750 --> 00:50:48,790
recognition that I
didn't touch on here.

1467
00:50:48,790 --> 00:50:50,830
And that comes up
in the questions.

1468
00:50:50,830 --> 00:50:54,190
I think, there's lots of work
to be done within the domain,

1469
00:50:54,190 --> 00:50:56,590
but there's also
interesting directions that

1470
00:50:56,590 --> 00:50:59,070
extend outside of that domain.