WEBVTT

00:00:01.725 --> 00:00:04.080
The following content is
provided under a Creative

00:00:04.080 --> 00:00:05.620
Commons license.

00:00:05.620 --> 00:00:07.920
Your support will help
MIT OpenCourseWare

00:00:07.920 --> 00:00:12.280
continue to offer high quality
educational resources for free.

00:00:12.280 --> 00:00:14.910
To make a donation, or
view additional materials

00:00:14.910 --> 00:00:18.870
from hundreds of MIT courses,
visit MIT OpenCourseWare

00:00:18.870 --> 00:00:21.517
at OCW.MIT.edu.

00:00:21.517 --> 00:00:23.100
JAMES DICARLO: I'm
going to shift more

00:00:23.100 --> 00:00:25.980
towards this decoding
space than we talked about,

00:00:25.980 --> 00:00:29.850
the linkage between neural
activity and behavioral report.

00:00:29.850 --> 00:00:31.170
And I introduced that a bit.

00:00:31.170 --> 00:00:34.290
You just saw that there's some
population powerful activity

00:00:34.290 --> 00:00:34.980
in IT.

00:00:34.980 --> 00:00:37.450
And I'm going to expand
on that a bit here.

00:00:37.450 --> 00:00:41.114
But sort of stepping back,
when you think about it again,

00:00:41.114 --> 00:00:42.780
what I call an end
to end understanding,

00:00:42.780 --> 00:00:44.460
going from the image all
the way to neural activity

00:00:44.460 --> 00:00:47.200
to the perceptual report, one
of the things we want to do,

00:00:47.200 --> 00:00:49.540
again, is just define
a decoding mechanism

00:00:49.540 --> 00:00:52.110
that the brain uses to support
these perceptual reports.

00:00:52.110 --> 00:00:54.000
Basically what neural
activity are directly

00:00:54.000 --> 00:00:56.250
responsible for these tasks?

00:00:56.250 --> 00:00:58.359
And I'll come back to
later this encoding side.

00:00:58.359 --> 00:01:00.150
It's like, you know,
and notice I'm putting

00:01:00.150 --> 00:01:01.274
these in this order, right?

00:01:01.274 --> 00:01:03.450
So once you know what
the relevant aspects

00:01:03.450 --> 00:01:06.900
of neural activity are in IT,
or wherever you think they are,

00:01:06.900 --> 00:01:09.690
then that sets a target
for what is the image

00:01:09.690 --> 00:01:12.420
to neural transformation that
you're trying to explain?

00:01:12.420 --> 00:01:14.220
Not predict any neural
response, but those

00:01:14.220 --> 00:01:16.554
particular aspects of
the neural response.

00:01:16.554 --> 00:01:18.720
So that's what I mean by
the relevant ventral stream

00:01:18.720 --> 00:01:20.220
patterns of activity.

00:01:20.220 --> 00:01:21.600
So we start here.

00:01:21.600 --> 00:01:24.150
We work to here, and then
we work to here, rather than

00:01:24.150 --> 00:01:25.516
the other way around.

00:01:25.516 --> 00:01:26.890
OK, so I'm going
to try it again.

00:01:26.890 --> 00:01:28.170
Keep with the domain I set up.

00:01:28.170 --> 00:01:29.520
I talked about core recognition.

00:01:29.520 --> 00:01:31.470
I now need to start
to define tasks.

00:01:31.470 --> 00:01:34.890
I'm going to talk about specific
tasks that are, for now, let's

00:01:34.890 --> 00:01:36.700
call them basic level nouns.

00:01:36.700 --> 00:01:39.130
I'm actually going to relax
that to subordinate tasks

00:01:39.130 --> 00:01:39.760
in a minute.

00:01:39.760 --> 00:01:40.830
But here they are.

00:01:40.830 --> 00:01:41.759
Car, clock, cat.

00:01:41.759 --> 00:01:43.050
These are not the actual nouns.

00:01:43.050 --> 00:01:44.100
I'll show you the ones we use.

00:01:44.100 --> 00:01:45.810
But just to fix
ideas, we're imagining

00:01:45.810 --> 00:01:49.140
a space of all possible
nouns that you might use

00:01:49.140 --> 00:01:51.520
to describe what you just saw.

00:01:51.520 --> 00:01:53.520
And I'm going to have a
generative image domain.

00:01:53.520 --> 00:01:55.426
So I now have a
space of images here.

00:01:55.426 --> 00:01:57.300
I'm not just going to
draw these off the web.

00:01:57.300 --> 00:01:58.560
We're going to
generate our own image

00:01:58.560 --> 00:02:00.600
domain that we think
engages on the problem,

00:02:00.600 --> 00:02:02.760
but gives us control of
the latent variables.

00:02:02.760 --> 00:02:03.950
So I'll show you that now.

00:02:03.950 --> 00:02:05.325
So the way we're
going to do this

00:02:05.325 --> 00:02:08.850
is by generating one
foreground object in each image

00:02:08.850 --> 00:02:10.530
that we're going to show.

00:02:10.530 --> 00:02:13.890
And we just did this by
taking 3-D models like these--

00:02:13.890 --> 00:02:15.450
this is a model of a car.

00:02:15.450 --> 00:02:18.180
We can control it's other latent
variables beyond its identity.

00:02:18.180 --> 00:02:19.250
So this is a car.

00:02:19.250 --> 00:02:21.130
It has a particular car type.

00:02:21.130 --> 00:02:23.730
So there's a couple of latent
variables about identity here

00:02:23.730 --> 00:02:25.350
that relate to the geometry.

00:02:25.350 --> 00:02:27.210
Then there's these position--
other latent variables like

00:02:27.210 --> 00:02:28.960
position, size, and
pose that I mentioned,

00:02:28.960 --> 00:02:31.620
that are unknowns that make
the problem challenging.

00:02:31.620 --> 00:02:33.880
And we can then just,
like, render this thing.

00:02:33.880 --> 00:02:37.140
And we could place it on any
old background we wanted to.

00:02:37.140 --> 00:02:39.120
And what we did was we
tended to place them

00:02:39.120 --> 00:02:41.430
on uncorrelated
naturalistic backgrounds.

00:02:41.430 --> 00:02:44.190
And that creates these sort
of weirdish looking images.

00:02:44.190 --> 00:02:46.140
Some of them may look
sort of natural, hence,

00:02:46.140 --> 00:02:48.376
this looks pretty unnatural.

00:02:48.376 --> 00:02:49.500
But the reason we did this.

00:02:49.500 --> 00:02:50.416
Why would you do this?

00:02:50.416 --> 00:02:55.270
So-- so we did this because we
could add a generative space.

00:02:55.270 --> 00:02:58.980
And because it was-- so we know
what's going on with the latent

00:02:58.980 --> 00:03:00.244
variables we care about.

00:03:00.244 --> 00:03:02.910
And we also, when we built this,
it was challenging for computer

00:03:02.910 --> 00:03:04.360
vision systems to
deal with this,

00:03:04.360 --> 00:03:06.420
even though humans could
naturally-- you know,

00:03:06.420 --> 00:03:08.520
they don't have advantage
of any contextual cues

00:03:08.520 --> 00:03:11.970
here because by construction,
these are uncorrelated.

00:03:11.970 --> 00:03:14.100
We just took natural
images and would randomly

00:03:14.100 --> 00:03:15.460
put objects on them.

00:03:15.460 --> 00:03:17.880
But this was enough to fool
a lot of the computer vision

00:03:17.880 --> 00:03:21.000
systems at the time that tended
to rely on the contextual cues.

00:03:21.000 --> 00:03:23.652
Like blue in the background
signals or being an airplane,

00:03:23.652 --> 00:03:25.610
we didn't want those kind
of things being done.

00:03:25.610 --> 00:03:27.840
We wanted the actual
extraction of object identity.

00:03:27.840 --> 00:03:29.940
And again, humans
could do it quite well.

00:03:29.940 --> 00:03:31.680
So that's why we
ended up in this sort

00:03:31.680 --> 00:03:34.830
of maybe this no man's
land of image space, which

00:03:34.830 --> 00:03:38.310
is not very simple, but not
ImageNet just pulled off

00:03:38.310 --> 00:03:39.779
off the web.

00:03:39.779 --> 00:03:41.070
And so that's how we got there.

00:03:41.070 --> 00:03:42.870
And just to give you a
sense that this is actually

00:03:42.870 --> 00:03:44.910
quite doable for humans,
I'll show you a few images.

00:03:44.910 --> 00:03:46.050
I won't even cue
you what they are.

00:03:46.050 --> 00:03:47.883
I'm going to show them
for 100 milliseconds.

00:03:47.883 --> 00:03:50.555
You can kind of shout
out what object you see.

00:03:50.555 --> 00:03:51.525
AUDIENCE: Car.

00:03:51.525 --> 00:03:55.265
AUDIENCE: [INAUDIBLE]

00:03:55.265 --> 00:03:56.140
JAMES DICARIO: Right.

00:03:56.140 --> 00:03:57.931
So see, it's pretty
straightforward, right?

00:03:57.931 --> 00:04:00.655
And those look weird, you
can do that quite well.

00:04:00.655 --> 00:04:03.280
And you know, here's the kind of
images that we would generate.

00:04:03.280 --> 00:04:05.997
This would be-- so when
we think of image bags,

00:04:05.997 --> 00:04:07.580
we think of partitions
of image space.

00:04:07.580 --> 00:04:10.490
This is some images that
would correspond to faces.

00:04:10.490 --> 00:04:12.962
These are all images of faces
under some transformations.

00:04:12.962 --> 00:04:14.170
Again, different backgrounds.

00:04:14.170 --> 00:04:15.190
These are not faces.

00:04:15.190 --> 00:04:17.740
These are other objects
again, under transformations.

00:04:17.740 --> 00:04:20.260
And we can have as many
of these as we want.

00:04:20.260 --> 00:04:21.709
We call this one--

00:04:21.709 --> 00:04:24.280
this distinction, when
shown for 100 milliseconds--

00:04:24.280 --> 00:04:25.780
is one core recognition test.

00:04:25.780 --> 00:04:27.530
Discriminate face for not face.

00:04:27.530 --> 00:04:28.989
Here is a subordinate task.

00:04:28.989 --> 00:04:30.280
This is beetle from not beetle.

00:04:30.280 --> 00:04:31.690
This is a particular
type of car.

00:04:31.690 --> 00:04:33.487
You can see it's
more challenging.

00:04:33.487 --> 00:04:35.320
Again, we don't show
these images like this.

00:04:35.320 --> 00:04:36.695
This is just to
show you the set.

00:04:36.695 --> 00:04:38.210
We show them one at a time.

00:04:38.210 --> 00:04:40.730
And so let me now
go ahead and say,

00:04:40.730 --> 00:04:43.240
we're going to try to make
a predictive model using

00:04:43.240 --> 00:04:46.990
that kind of image space to
see if we can understand what

00:04:46.990 --> 00:04:48.940
are the relevant aspects
of neural activity

00:04:48.940 --> 00:04:51.640
that can predict human
report on an image space?

00:04:51.640 --> 00:04:54.280
And when I say we, I
mean Naiib Maiai and Ha

00:04:54.280 --> 00:04:56.230
Hong, who are post-doc
and graduate student

00:04:56.230 --> 00:04:59.260
that were in the lab that
led this experimental work.

00:04:59.260 --> 00:05:03.200
And Ethan Soloman and Dan Yamins
also contributed to the work.

00:05:03.200 --> 00:05:07.720
So what we did was to try to
record a bunch of IT activity

00:05:07.720 --> 00:05:10.390
to measure what's going
on in the population

00:05:10.390 --> 00:05:12.940
as I showed you earlier, but
now in this more defined space

00:05:12.940 --> 00:05:15.190
where we're going to collect
a bunch of human behavior

00:05:15.190 --> 00:05:18.040
to compare possible
ways of reading IT

00:05:18.040 --> 00:05:20.200
with the behavior of the human.

00:05:20.200 --> 00:05:21.820
This is how we started.

00:05:21.820 --> 00:05:23.310
We're now doing monkeys--

00:05:23.310 --> 00:05:25.670
where we're recording and
the monkey's doing a task.

00:05:25.670 --> 00:05:28.720
But what we did here was we
just passively fixating monkeys,

00:05:28.720 --> 00:05:30.812
compared with behaving humans.

00:05:30.812 --> 00:05:32.770
And as I showed you
earlier, monkeys and humans

00:05:32.770 --> 00:05:34.850
have very similar
patterns of behavior.

00:05:34.850 --> 00:05:37.120
So what we record
from IT, in this case,

00:05:37.120 --> 00:05:39.170
we were using array
recording electrodes.

00:05:39.170 --> 00:05:40.750
These are chronically implanted.

00:05:40.750 --> 00:05:41.650
This shows them here.

00:05:41.650 --> 00:05:43.066
You implant them
during a surgery,

00:05:43.066 --> 00:05:44.520
as kind of is shown here.

00:05:44.520 --> 00:05:45.920
Down in the IT cortex.

00:05:45.920 --> 00:05:47.119
You can get their size here.

00:05:47.119 --> 00:05:48.160
There are about hundred--

00:05:48.160 --> 00:05:50.739
there's actually 96
electrodes on each of them.

00:05:50.739 --> 00:05:52.780
They typically yield about
half of the electrodes

00:05:52.780 --> 00:05:54.700
having active neurons on them.

00:05:54.700 --> 00:05:58.309
So you get, you know, on the
order of 150 recording sites.

00:05:58.309 --> 00:05:59.350
And you can lay them out.

00:05:59.350 --> 00:06:01.016
You can lay-- we would
typically lay out

00:06:01.016 --> 00:06:04.990
three of them across IT and
V4 to record a population

00:06:04.990 --> 00:06:06.790
sample out of IT.

00:06:06.790 --> 00:06:09.269
And we would do this across
among multiple monkeys.

00:06:09.269 --> 00:06:11.560
And here's an example of the
kind of data we would get.

00:06:11.560 --> 00:06:14.276
This is 168 IT recording sites.

00:06:14.276 --> 00:06:16.150
This is similar to what
I showed you earlier.

00:06:16.150 --> 00:06:19.600
This is the mean response
in a particular time window

00:06:19.600 --> 00:06:21.790
out of IT, similar to
what I showed you earlier

00:06:21.790 --> 00:06:23.230
in that study with Gabriel.

00:06:23.230 --> 00:06:26.230
And what we do here is, I'm just
showing you to give you feel.

00:06:26.230 --> 00:06:27.250
That's one image.

00:06:27.250 --> 00:06:30.460
Here is eight more--
here's seven more images.

00:06:30.460 --> 00:06:32.200
And these are just
the population vectors

00:06:32.200 --> 00:06:34.160
in a graphic form.

00:06:34.160 --> 00:06:36.310
And but we actually
collected nearly 25--

00:06:36.310 --> 00:06:37.870
this is 2,560 images.

00:06:37.870 --> 00:06:39.250
This is sort of
the mean response

00:06:39.250 --> 00:06:41.230
data of this 168 neurons.

00:06:41.230 --> 00:06:43.550
And now you have this again,
this rich population data.

00:06:43.550 --> 00:06:46.030
And you can ask, what's
available in there

00:06:46.030 --> 00:06:47.200
to support these tasks?

00:06:47.200 --> 00:06:49.870
And how well does it predict
human patterns of performance

00:06:49.870 --> 00:06:51.190
on those tasks?

00:06:51.190 --> 00:06:54.391
So in this study, that's
all we were asking to do.

00:06:54.391 --> 00:06:56.140
We're trying to do
more and more recently.

00:06:56.140 --> 00:06:58.639
But let me show you what all
we were trying to do is to say,

00:06:58.639 --> 00:06:59.140
look.

00:06:59.140 --> 00:07:00.910
One thing we observed,
even though you

00:07:00.910 --> 00:07:03.670
saw that car-- you could
do car, you could do faces.

00:07:03.670 --> 00:07:05.710
It seemed like you
were doing 100%.

00:07:05.710 --> 00:07:08.080
Turns out you're better at
some things than others.

00:07:08.080 --> 00:07:10.510
So discriminate-- this is
a deep prime map of humans.

00:07:10.510 --> 00:07:12.700
So red means good performance.

00:07:12.700 --> 00:07:14.090
High D prime.

00:07:14.090 --> 00:07:16.404
You know, a D prime of
3 is something like--

00:07:16.404 --> 00:07:18.820
I don't know, psychophysicists
in the room may correct me.

00:07:18.820 --> 00:07:22.000
A D prime of 3 is sort of
on the order of 90 some 95%

00:07:22.000 --> 00:07:23.570
correct, in that range.

00:07:23.570 --> 00:07:25.450
So these are very high
performance levels

00:07:25.450 --> 00:07:27.400
when you get up to 5.

00:07:27.400 --> 00:07:28.660
0 is chance.

00:07:28.660 --> 00:07:31.070
So 50%-- well this
is an eight way task.

00:07:31.070 --> 00:07:33.220
So one over 8% correct.

00:07:33.220 --> 00:07:36.910
So the subjects were doing
either eight way basic level

00:07:36.910 --> 00:07:39.527
tasks, or eight way subordinate
cars, or eight way faces.

00:07:39.527 --> 00:07:41.860
And these are the D prime
levels under different amounts

00:07:41.860 --> 00:07:43.693
of variation of those
other latent variables

00:07:43.693 --> 00:07:44.967
position size and pose.

00:07:44.967 --> 00:07:46.300
Don't worry about those details.

00:07:46.300 --> 00:07:48.520
What I want you to
see is the color here.

00:07:48.520 --> 00:07:50.560
So look, it's tables versus--

00:07:50.560 --> 00:07:53.020
discriminating tables from
all these other objects.

00:07:53.020 --> 00:07:55.120
You do that at a
very high D prime.

00:07:55.120 --> 00:07:57.700
Discriminating beetles
from other cars,

00:07:57.700 --> 00:08:00.580
you do it at slightly
lower D prime.

00:08:00.580 --> 00:08:02.934
You can see this, specially
at a high variation,

00:08:02.934 --> 00:08:05.350
you're actually starting to
get down to lower performance.

00:08:05.350 --> 00:08:07.720
And faces-- one face
versus another face,

00:08:07.720 --> 00:08:09.530
you're actually
quite poor at that.

00:08:09.530 --> 00:08:11.369
You're a little bit
better than chance.

00:08:11.369 --> 00:08:13.660
But it's actually quite
challenging in 100 milliseconds

00:08:13.660 --> 00:08:15.730
without hair and
glasses to discriminate

00:08:15.730 --> 00:08:17.680
those 3-D kind of face models.

00:08:17.680 --> 00:08:20.260
I showed you Sam and
Joe earlier as examples.

00:08:20.260 --> 00:08:22.480
You're actually quite
challenging to do

00:08:22.480 --> 00:08:25.040
that for humans in
that domain of faces.

00:08:25.040 --> 00:08:26.631
So, what I want to
show you here is

00:08:26.631 --> 00:08:28.630
you have this pattern of
behavioral performance.

00:08:28.630 --> 00:08:29.980
You have all this IT activity.

00:08:29.980 --> 00:08:30.680
This is humans.

00:08:30.680 --> 00:08:31.767
This is monkeys.

00:08:31.767 --> 00:08:33.350
And what we wanted
to do is say, look.

00:08:33.350 --> 00:08:34.370
We can use this pattern.

00:08:34.370 --> 00:08:36.610
This is very repeatable
across humans.

00:08:36.610 --> 00:08:39.190
Can we use this repeatable
behavioral pattern

00:08:39.190 --> 00:08:41.559
to understand what
aspects of this activity

00:08:41.559 --> 00:08:43.210
could map to that?

00:08:43.210 --> 00:08:44.800
And again, this
pattern is reliable.

00:08:44.800 --> 00:08:45.910
I just said that.

00:08:45.910 --> 00:08:48.220
And it's not as if you
can predict this pattern

00:08:48.220 --> 00:08:51.210
by just running classifiers
on pixels or V1.

00:08:51.210 --> 00:08:53.297
In fact, I'll show
you that a minute.

00:08:53.297 --> 00:08:55.380
But we thought there's
some aspects of IT activity

00:08:55.380 --> 00:08:56.390
that would predict this.

00:08:56.390 --> 00:08:59.250
And we wanted to try to
find those aspects to--

00:08:59.250 --> 00:09:01.920
so, again, this was motivated
by that study I showed you

00:09:01.920 --> 00:09:02.800
earlier.

00:09:02.800 --> 00:09:05.220
So which part of the
IT population activity

00:09:05.220 --> 00:09:08.190
could predict this behavior
over all recognition tasks?

00:09:08.190 --> 00:09:11.086
We're seeking a general
decoding model that would work.

00:09:11.086 --> 00:09:12.210
Here's some specific tasks.

00:09:12.210 --> 00:09:13.350
But we'd like it to be--

00:09:13.350 --> 00:09:15.900
work over any task that we
could imagine testing humans

00:09:15.900 --> 00:09:18.240
within this domain
of taking 3D models,

00:09:18.240 --> 00:09:19.650
putting them under variation.

00:09:19.650 --> 00:09:20.940
Work over that entire domain.

00:09:20.940 --> 00:09:22.725
That was what we
were hoping to do.

00:09:22.725 --> 00:09:24.600
So again, I'll briefly
take you through this.

00:09:24.600 --> 00:09:26.040
Because I already
showed you this earlier.

00:09:26.040 --> 00:09:28.415
Again, we've previously shown
that you could kind of take

00:09:28.415 --> 00:09:30.570
this kind of state
space, and say hey,

00:09:30.570 --> 00:09:33.027
can you separate images
of faces from non-faces,

00:09:33.027 --> 00:09:34.860
using these simple
linear classifiers, which

00:09:34.860 --> 00:09:38.730
are essentially weighted
sums on the IT activity?

00:09:38.730 --> 00:09:40.620
And now we wanted
to ask, could this

00:09:40.620 --> 00:09:43.260
predict human behavioral
face performance,

00:09:43.260 --> 00:09:45.600
and monkey, because again,
they're very similar.

00:09:45.600 --> 00:09:47.970
And not only would
this class of decoding

00:09:47.970 --> 00:09:51.420
models that was motivated by the
earlier work predict this task,

00:09:51.420 --> 00:09:53.700
but would predict car detection?

00:09:53.700 --> 00:09:56.580
Would the same model predict
car one versus car two?

00:09:56.580 --> 00:09:58.050
That's a subordinate task.

00:09:58.050 --> 00:09:59.120
And all such tasks.

00:09:59.120 --> 00:10:01.740
Again, over the whole domain,
can you take a same decoding

00:10:01.740 --> 00:10:03.549
strategy and take
the data and say,

00:10:03.549 --> 00:10:05.840
I'm going to just learn on
a certain number of training

00:10:05.840 --> 00:10:07.830
examples, build a
classifier, and then I'll

00:10:07.830 --> 00:10:10.320
say that's my model
of how the human does

00:10:10.320 --> 00:10:11.560
every one of these tasks.

00:10:11.560 --> 00:10:13.542
And if that's true,
then it should perfectly

00:10:13.542 --> 00:10:15.000
predict that pattern
or performance

00:10:15.000 --> 00:10:16.900
that I just showed you earlier.

00:10:16.900 --> 00:10:20.550
And so here was again, this
was the working hypothesis.

00:10:20.550 --> 00:10:22.842
Passively evoked spike rates
using single fixed time

00:10:22.842 --> 00:10:25.050
scale that are spatially
distributed, because they're

00:10:25.050 --> 00:10:29.400
sampled over IT, over a single
fixed number of non-human--

00:10:29.400 --> 00:10:31.260
of non-human primate cortex.

00:10:31.260 --> 00:10:33.027
So a single number of neurons.

00:10:33.027 --> 00:10:35.360
And learn from a reasonable
number of training examples.

00:10:35.360 --> 00:10:39.390
So all of that is a decoding
class of models that we thought

00:10:39.390 --> 00:10:40.617
might work.

00:10:40.617 --> 00:10:42.450
And if this is correct--
this is what I just

00:10:42.450 --> 00:10:44.490
said-- it should predict
the behavioral data that we

00:10:44.490 --> 00:10:44.989
collect.

00:10:44.989 --> 00:10:47.350
For example, the D prime
data I just showed you.

00:10:47.350 --> 00:10:51.000
But also more fine grained
behavioral data in principle.

00:10:51.000 --> 00:10:52.530
So I want to just
step back to make

00:10:52.530 --> 00:10:55.880
it clear that it's not obvious
that this should work, right?

00:10:55.880 --> 00:10:56.960
I mean, it depends--

00:10:56.960 --> 00:10:59.880
in the audience, I get people
on completely different sides

00:10:59.880 --> 00:11:01.990
of this, whether this
should work or not.

00:11:01.990 --> 00:11:03.990
So, you know, one thing
is, like, well look,

00:11:03.990 --> 00:11:04.920
it's passively evoked.

00:11:04.920 --> 00:11:07.490
You heard Gabriel say, well,
you didn't like passive tasks.

00:11:07.490 --> 00:11:08.220
And I agree with that.

00:11:08.220 --> 00:11:09.510
In the ideal world,
the animal will

00:11:09.510 --> 00:11:10.710
be actively doing the task.

00:11:10.710 --> 00:11:12.390
And then you'd say, well I'll
measure while the animal's

00:11:12.390 --> 00:11:12.870
doing the task.

00:11:12.870 --> 00:11:14.953
That's going to be your
best chance of prediction.

00:11:14.953 --> 00:11:16.800
But we also saw earlier
that that passively

00:11:16.800 --> 00:11:17.904
evoked monkey still--

00:11:17.904 --> 00:11:20.070
you know, nobody would argue
that a passively evoked

00:11:20.070 --> 00:11:24.330
retinal data is not going to be
somewhat applicable to vision.

00:11:24.330 --> 00:11:26.100
And you know, the
question is, how much

00:11:26.100 --> 00:11:28.470
of those arousal effects
show up in a place

00:11:28.470 --> 00:11:31.110
like IT cortex, which
is typically high?

00:11:31.110 --> 00:11:32.850
Which is high up in
the ventral stream.

00:11:32.850 --> 00:11:34.722
So you could argue
both sides of this.

00:11:34.722 --> 00:11:36.930
But it's possible that
attentional arousal mechanisms

00:11:36.930 --> 00:11:40.120
are needed to make this a good
predictive linkage between that

00:11:40.120 --> 00:11:43.554
to sort of activate IT in this
sort of crude way, if you like.

00:11:43.554 --> 00:11:45.720
Some people have pointed
out that you need the trial

00:11:45.720 --> 00:11:47.550
by trial coordinated
spike timing

00:11:47.550 --> 00:11:49.650
structure to actually
make good predictions,

00:11:49.650 --> 00:11:51.160
that those are critical.

00:11:51.160 --> 00:11:54.570
Some people have pointed out
that you have to kind of assign

00:11:54.570 --> 00:11:57.030
different parts of IT to
particular roles, which is

00:11:57.030 --> 00:11:58.840
a prior on the decoding space.

00:11:58.840 --> 00:12:01.680
For instance, that you could
believe that biologically,

00:12:01.680 --> 00:12:02.460
an animal's born.

00:12:02.460 --> 00:12:04.876
There's some tissue that's
going to be dedicated to faces.

00:12:04.876 --> 00:12:07.502
You have to wire those neurons
downstream to that tissue.

00:12:07.502 --> 00:12:09.960
And that means you're going to
restrict the decoding space,

00:12:09.960 --> 00:12:12.555
rather than just letting them
learn from the space of IT

00:12:12.555 --> 00:12:15.600
as if they collected
samples off of all of IT.

00:12:15.600 --> 00:12:16.980
So I think some
people implicitly

00:12:16.980 --> 00:12:20.180
believe that even if it's
not stated quite that way.

00:12:20.180 --> 00:12:22.110
IT does not directly
underlie recognition.

00:12:22.110 --> 00:12:23.460
You could imagine that.

00:12:23.460 --> 00:12:25.090
I mean, it's not for sure known.

00:12:25.090 --> 00:12:27.690
And some lesions of IT
don't produce deficits

00:12:27.690 --> 00:12:28.320
in recognition.

00:12:28.320 --> 00:12:29.340
That's a possibility.

00:12:29.340 --> 00:12:31.090
Maybe you need too
many training examples.

00:12:31.090 --> 00:12:33.992
Monkey neural codes cannot
explain human behavior.

00:12:33.992 --> 00:12:35.700
You know, again, but
I already showed you

00:12:35.700 --> 00:12:36.970
monkeys and humans
are very similar.

00:12:36.970 --> 00:12:38.345
So these are the
reasons that you

00:12:38.345 --> 00:12:40.350
might say this is negative,
and might not work.

00:12:40.350 --> 00:12:41.560
And probably
already have guessed

00:12:41.560 --> 00:12:43.935
that I'm telling all these
negatives because it turns out

00:12:43.935 --> 00:12:46.800
this simple thing works quite
well for the grain of behavior

00:12:46.800 --> 00:12:48.359
that I've shown you so far.

00:12:48.359 --> 00:12:49.650
And here's my evidence of that.

00:12:49.650 --> 00:12:52.117
So this is actual behavioral
performance out of humans

00:12:52.117 --> 00:12:53.200
that I showed you earlier.

00:12:53.200 --> 00:12:54.074
This is mean D prime.

00:12:54.074 --> 00:12:56.010
This is the predicted
behavior or performance

00:12:56.010 --> 00:12:59.520
of taking a classifier, reading
from that IT population data

00:12:59.520 --> 00:13:02.520
that I've shown you, which
gives a predicted D prime.

00:13:02.520 --> 00:13:04.080
Here is-- we first
chose a decoder.

00:13:04.080 --> 00:13:06.180
We had to match things
like the number of neurons.

00:13:06.180 --> 00:13:07.800
We had to get it in
the ballpark, so--

00:13:07.800 --> 00:13:08.895
because again, there's
a free variable,

00:13:08.895 --> 00:13:09.895
as I showed you earlier.

00:13:09.895 --> 00:13:11.080
There's at least one.

00:13:11.080 --> 00:13:14.130
But for now, let's think of
matching the number of neurons

00:13:14.130 --> 00:13:16.299
to get you near the
diagonal, so that you

00:13:16.299 --> 00:13:18.090
have sufficient number
of neural recordings

00:13:18.090 --> 00:13:20.700
to say, how well do you do
on a face detection task?

00:13:20.700 --> 00:13:22.267
And then, here's
all the other tasks.

00:13:22.267 --> 00:13:24.350
This is those 64 points
that I showed you earlier.

00:13:24.350 --> 00:13:26.910
Here's some examples like
fruit versus other things, car

00:13:26.910 --> 00:13:28.597
versus other things.

00:13:28.597 --> 00:13:30.930
And you should see that all
these points kind of line up

00:13:30.930 --> 00:13:33.546
along this diagonal, which
says, wow, this is actually

00:13:33.546 --> 00:13:35.670
quite predictive, that I
can take this simple thing

00:13:35.670 --> 00:13:38.980
and predict all the stuff
that we've collected so far.

00:13:38.980 --> 00:13:40.980
And so let me now kind
of be more concrete

00:13:40.980 --> 00:13:43.230
about what is the
inferred neural mechanism

00:13:43.230 --> 00:13:44.820
that we're testing here?

00:13:44.820 --> 00:13:47.050
Well, I'll show you in a minute.

00:13:47.050 --> 00:13:49.560
This is, for each
new object, we think

00:13:49.560 --> 00:13:51.570
what happens is some
downstream observer,

00:13:51.570 --> 00:13:54.550
a downstream neuron, randomly
samples roughly 50,000

00:13:54.550 --> 00:13:57.430
single neurons, spatially
distributed over all of IT,

00:13:57.430 --> 00:13:59.530
not biased to any compartments.

00:13:59.530 --> 00:14:02.074
Listens to each IT sites.

00:14:02.074 --> 00:14:03.490
When I say listen
in this case, we

00:14:03.490 --> 00:14:05.477
think could average
over 100 milliseconds.

00:14:05.477 --> 00:14:06.560
We're not sure about this.

00:14:06.560 --> 00:14:08.860
This is just the version
that's shown here.

00:14:08.860 --> 00:14:11.680
Learn an appropriate weighted
sum of those IT spiking.

00:14:11.680 --> 00:14:12.970
And then listen at 10%.

00:14:12.970 --> 00:14:14.650
That's basically,
once you learn,

00:14:14.650 --> 00:14:18.169
there's a heavily weighted
about 10% of the IT neurons

00:14:18.169 --> 00:14:19.960
are heavily weighted
for each of the tasks.

00:14:19.960 --> 00:14:22.690
That's just an observation
that we have in our data.

00:14:22.690 --> 00:14:25.300
But this is trying to map it
to neuroscientist language

00:14:25.300 --> 00:14:28.030
from these decoder
versions out of IT.

00:14:28.030 --> 00:14:29.650
So what that is a
model that says,

00:14:29.650 --> 00:14:32.384
learn weighted sums of 50,000
random average 100 milliseconds

00:14:32.384 --> 00:14:34.300
single unit responses
distributed over all IT.

00:14:34.300 --> 00:14:37.760
So a bunch of stuff in
here is what your model

00:14:37.760 --> 00:14:39.190
is sort of encapsulating.

00:14:39.190 --> 00:14:40.416
That's still too long.

00:14:40.416 --> 00:14:42.040
So I made a little
acronym out of that.

00:14:42.040 --> 00:14:45.010
And that caught Laws of
RAD IT decoding mechanism.

00:14:45.010 --> 00:14:49.000
So this is just to say there's
a hypothesis of how everything

00:14:49.000 --> 00:14:52.330
might work, but now can be make
predictions for other objects

00:14:52.330 --> 00:14:54.520
and could potentially
be falsified.

00:14:54.520 --> 00:14:59.590
So, so far, this model works
quite well over these tasks.

00:14:59.590 --> 00:15:01.240
And in fact, the
correlation is 0.92.

00:15:01.240 --> 00:15:03.637
You might look at this and
say, oh, it's not perfect.

00:15:03.637 --> 00:15:05.470
But it turns out that
that's about the level

00:15:05.470 --> 00:15:07.220
that which humans
differ from each other.

00:15:07.220 --> 00:15:10.870
So it's passing a Turing test,
that this mechanism read off

00:15:10.870 --> 00:15:14.020
of the monkey IT hides
in the distribution

00:15:14.020 --> 00:15:16.870
of the human population that
we're asking to also perform

00:15:16.870 --> 00:15:17.680
these same tasks.

00:15:17.680 --> 00:15:19.305
So it can't be
distinguished from being

00:15:19.305 --> 00:15:22.750
a human in these tasks.

00:15:22.750 --> 00:15:24.640
You guys, watch "X Machina?"

00:15:24.640 --> 00:15:25.990
Wasn't that a movie I saw?

00:15:25.990 --> 00:15:27.190
Doesn't pass that test.

00:15:27.190 --> 00:15:29.760
Passes just a simple
core recognition test.

00:15:29.760 --> 00:15:31.570
But so that was a
Turing test of this.

00:15:31.570 --> 00:15:34.510
So OK, so, this is
here that I quantified.

00:15:34.510 --> 00:15:37.000
So this is human to
human consistency.

00:15:37.000 --> 00:15:38.380
That's the range
I just mentioned

00:15:38.380 --> 00:15:41.500
that, you've got to get
into here to pass our Turing

00:15:41.500 --> 00:15:42.387
test on this.

00:15:42.387 --> 00:15:44.470
And that's a decoding
mechanism I just showed you.

00:15:44.470 --> 00:15:47.460
There's other ways of reading
out of IT that don't pass.

00:15:47.460 --> 00:15:49.960
There's ways of reading out of
V4, which you recorded from--

00:15:49.960 --> 00:15:52.790
none of them we've tried are
able to get you to this here.

00:15:52.790 --> 00:15:54.370
That doesn't mean
V4 isn't involved.

00:15:54.370 --> 00:15:56.200
V4 is the feeder to IT.

00:15:56.200 --> 00:15:59.470
It just means you can't take
simple decodes off of V4

00:15:59.470 --> 00:16:01.330
and naturally
produces this pattern.

00:16:01.330 --> 00:16:03.970
And that's similar for like,
pixels or V1 representations.

00:16:03.970 --> 00:16:06.610
So lower level representations
don't naturally

00:16:06.610 --> 00:16:08.542
predict this
pattern of behavior.

00:16:08.542 --> 00:16:10.000
And even some
computer vision codes

00:16:10.000 --> 00:16:12.760
that we tested at the time, as
you can see, if those of you

00:16:12.760 --> 00:16:16.750
know these older computer
vision models didn't do this.

00:16:16.750 --> 00:16:19.870
But more recent computer
vision models actually do.

00:16:19.870 --> 00:16:22.230
And I'll show you
that at the end.

00:16:22.230 --> 00:16:22.900
OK.

00:16:22.900 --> 00:16:25.420
So, this is a little
bit for the aficionados

00:16:25.420 --> 00:16:27.220
to tell you how
we got there as we

00:16:27.220 --> 00:16:31.160
increase the number of units in
IT, that drives performance up.

00:16:31.160 --> 00:16:33.100
So as you read more and
more units out of IT,

00:16:33.100 --> 00:16:34.930
you get better and
better performance.

00:16:34.930 --> 00:16:36.539
That's also true out of V4.

00:16:36.539 --> 00:16:38.080
But I'm trying to
show you this here,

00:16:38.080 --> 00:16:40.750
is it's like, not the
absolute performance

00:16:40.750 --> 00:16:44.260
that is the good thing
to compare a model

00:16:44.260 --> 00:16:46.390
with actual behavioral data.

00:16:46.390 --> 00:16:48.010
It's the pattern of
performance, which

00:16:48.010 --> 00:16:49.780
we call the consistency
with the humans.

00:16:49.780 --> 00:16:51.959
That's that correlation
along that diagonal

00:16:51.959 --> 00:16:53.500
that I showed you
earlier, that tasks

00:16:53.500 --> 00:16:56.020
that are hard for the models
are also hard for the humans.

00:16:56.020 --> 00:16:58.660
Tasks that are easy for humans
are also easy for the models.

00:16:58.660 --> 00:17:00.035
And you could
imagine doing that,

00:17:00.035 --> 00:17:03.560
not just at the task level,
but at the image level as well.

00:17:03.560 --> 00:17:05.319
And anyway, that's
what's quantified here.

00:17:05.319 --> 00:17:07.990
And you see that when you get up
to around you know, about 100--

00:17:07.990 --> 00:17:12.010
I showed you 168
recordings out of IT.

00:17:12.010 --> 00:17:14.950
This point right there
is about 500 IT features.

00:17:14.950 --> 00:17:16.660
And taking you
through some things

00:17:16.660 --> 00:17:18.130
that maybe I won't
have time for,

00:17:18.130 --> 00:17:21.160
that's actually how we
approximate that 50,000 single

00:17:21.160 --> 00:17:21.970
IT neuron number.

00:17:21.970 --> 00:17:24.880
That's an inference
from our data

00:17:24.880 --> 00:17:27.532
based on if we didn't actually
record 50,000 single neurons.

00:17:27.532 --> 00:17:28.990
But from these kind
of plots, we're

00:17:28.990 --> 00:17:33.190
able to make a pretty good guess
that this kind of model right

00:17:33.190 --> 00:17:34.590
here would produce--

00:17:34.590 --> 00:17:35.940
would land right there.

00:17:35.940 --> 00:17:37.810
To be consistent with
humans, and would

00:17:37.810 --> 00:17:39.760
get the absolute
level of performance

00:17:39.760 --> 00:17:41.170
which humans matched.

00:17:41.170 --> 00:17:43.140
And you know, the models
we tried out of V4,

00:17:43.140 --> 00:17:44.410
this is one example of them.

00:17:44.410 --> 00:17:45.580
They can get performance.

00:17:45.580 --> 00:17:47.850
But they can never-- they
don't match this pattern

00:17:47.850 --> 00:17:48.940
of performance naturally.

00:17:48.940 --> 00:17:51.640
They over perform on some tasks,
and under-perform on others.

00:17:51.640 --> 00:17:53.980
They sort of reveal
themselves as not

00:17:53.980 --> 00:17:57.640
being human like by being too
good at some things, right?

00:17:57.640 --> 00:18:00.310
So that's a way to
fail the Turing test.

00:18:00.310 --> 00:18:01.540
OK.

00:18:01.540 --> 00:18:04.010
Maybe I'll skip through this,
it's sort of the same thing.

00:18:04.010 --> 00:18:05.343
This is about training examples.

00:18:05.343 --> 00:18:06.974
If those of you guys
care about this,

00:18:06.974 --> 00:18:09.390
I could kind of take you through
how we-- there's actually

00:18:09.390 --> 00:18:11.110
a family of solutions in there.

00:18:11.110 --> 00:18:14.200
And I'm just telling you about
one of them for simplicity.

00:18:14.200 --> 00:18:17.166
So, let me then just take
it down to another grain.

00:18:17.166 --> 00:18:18.790
So that was the
pattern of performance,

00:18:18.790 --> 00:18:20.164
it's actually
naturally predicted

00:18:20.164 --> 00:18:22.990
by this first decoding
mechanism that we tried.

00:18:22.990 --> 00:18:24.830
But what about the
confusion pattern?

00:18:24.830 --> 00:18:27.370
So not just the absolute D
primes for each of these tasks,

00:18:27.370 --> 00:18:30.640
but there's finer grained data,
like how often an animal is

00:18:30.640 --> 00:18:33.160
confused with a fruit, or an
animal's confused with a face.

00:18:33.160 --> 00:18:34.960
These are the confusion
pattern data here.

00:18:34.960 --> 00:18:36.670
I'm sorry I don't have
the color bars up.

00:18:36.670 --> 00:18:38.753
All I'm going to need you
to do is say, well these

00:18:38.753 --> 00:18:41.380
are the confusion patterns
that we predicted.

00:18:41.380 --> 00:18:44.710
And this is what is the
predicted confusion pattern,

00:18:44.710 --> 00:18:49.630
if I gave the machine, the
IT, these ground truth labels.

00:18:49.630 --> 00:18:50.810
And it predicts this.

00:18:50.810 --> 00:18:52.370
This is what actually
happened in human data.

00:18:52.370 --> 00:18:53.920
And what I want to sort of
look at this and this, and say,

00:18:53.920 --> 00:18:55.900
there actually
look quite similar.

00:18:55.900 --> 00:18:58.270
Their noise corrected
correlation is 0.91.

00:18:58.270 --> 00:19:00.940
So they were still quite good at
predicting confusion patterns.

00:19:00.940 --> 00:19:03.420
Although this did
not hold up fully.

00:19:03.420 --> 00:19:04.570
We're only at 0.68.

00:19:04.570 --> 00:19:05.410
I say only.

00:19:05.410 --> 00:19:07.300
Some people would
say this is success.

00:19:07.300 --> 00:19:09.320
We're only at 0.68
on high variation.

00:19:09.320 --> 00:19:11.020
So there's a failure
here of the model.

00:19:11.020 --> 00:19:13.360
That should be at 1, because
it's noise corrected.

00:19:13.360 --> 00:19:14.980
So there's something
about this that's

00:19:14.980 --> 00:19:17.410
not quite right at
predicting the confusion

00:19:17.410 --> 00:19:19.600
patterns of humans at
high variation images.

00:19:19.600 --> 00:19:22.960
And that to us, that's an
opening to push forward, right?

00:19:22.960 --> 00:19:24.730
So this is a strategy
going forward

00:19:24.730 --> 00:19:28.300
as we have an initial guess
of how you read out of IT.

00:19:28.300 --> 00:19:30.730
It looks pretty good
for first grain test.

00:19:30.730 --> 00:19:32.800
But now we can turn
the crank harder.

00:19:32.800 --> 00:19:33.940
We need more neural data.

00:19:33.940 --> 00:19:36.970
We need more psychophysics,
finer grained measurements

00:19:36.970 --> 00:19:38.650
to sort of distinguish
among, not just

00:19:38.650 --> 00:19:41.560
say IT's better than V4 or
those other representations.

00:19:41.560 --> 00:19:44.380
But what exactly about
the IT representation?

00:19:44.380 --> 00:19:45.550
Is it 100 milliseconds?

00:19:45.550 --> 00:19:46.764
What time scale?

00:19:46.764 --> 00:19:48.430
Maybe those synchronous
codes do matter.

00:19:48.430 --> 00:19:50.430
Some of those things that
I put on there earlier

00:19:50.430 --> 00:19:53.170
might start to matter when
we push the code-- push

00:19:53.170 --> 00:19:54.370
this even further.

00:19:54.370 --> 00:19:56.440
So what I take home
here is that you

00:19:56.440 --> 00:19:59.590
do quite well with this
first order rate code

00:19:59.590 --> 00:20:01.390
reads out of IT.

00:20:01.390 --> 00:20:04.330
But now there's an opportunity
to try to dig in and say,

00:20:04.330 --> 00:20:06.444
well at what point
do they break down?

00:20:06.444 --> 00:20:08.110
And what kind of
decoding models are you

00:20:08.110 --> 00:20:09.234
going to replace them with?

00:20:09.234 --> 00:20:11.680
And that's what
we're trying to do.

00:20:11.680 --> 00:20:13.900
I've told you that IT
does good at identity.

00:20:13.900 --> 00:20:15.820
But remember I said
earlier on, remember

00:20:15.820 --> 00:20:16.660
I showed you those
manifolds, and said

00:20:16.660 --> 00:20:19.420
there's other latent variables
like position and scale.

00:20:19.420 --> 00:20:21.640
And I said those
don't get thrown away.

00:20:21.640 --> 00:20:23.170
They just get unwrapped, right?

00:20:23.170 --> 00:20:25.377
Remember that manifold
picture I showed earlier?

00:20:25.377 --> 00:20:27.460
And so one of the things
we've been doing recently

00:20:27.460 --> 00:20:29.510
is asking, because we
built these images,

00:20:29.510 --> 00:20:31.120
we know these other
latent variables,

00:20:31.120 --> 00:20:32.495
like position and
pose-- that was

00:20:32.495 --> 00:20:34.750
one of the advantages of
building the images this way.

00:20:34.750 --> 00:20:38.530
And we've been asking how well
IT encodes those other latent

00:20:38.530 --> 00:20:40.450
variables about the
pose of the object,

00:20:40.450 --> 00:20:42.040
the position of the object.

00:20:42.040 --> 00:20:45.670
And to make-- let me
just skip through.

00:20:45.670 --> 00:20:48.250
To make a long story short,
IT actually encodes--

00:20:48.250 --> 00:20:50.990
not only has information
about these kind of variables,

00:20:50.990 --> 00:20:53.290
which is really not
surprising, because others

00:20:53.290 --> 00:20:55.000
have shown that
there's information

00:20:55.000 --> 00:20:57.040
about those kind
of things before.

00:20:57.040 --> 00:20:58.914
But that's sort
of what's on here.

00:20:58.914 --> 00:21:00.580
Everything what I'm
showing here, here's

00:21:00.580 --> 00:21:03.060
IT V4 simulated V1 in pixels.

00:21:03.060 --> 00:21:05.500
And always, everything goes
up along the ventral stream

00:21:05.500 --> 00:21:07.330
for the other
variables, which may be

00:21:07.330 --> 00:21:08.950
non-intuitive to some of you.

00:21:08.950 --> 00:21:10.989
I mean, because position
is supposed to be V1.

00:21:10.989 --> 00:21:13.030
But position of an object
in a complex background

00:21:13.030 --> 00:21:14.560
is better at IT.

00:21:14.560 --> 00:21:15.430
That's one example.

00:21:15.430 --> 00:21:17.806
But all these latent
variables go up

00:21:17.806 --> 00:21:19.180
along the ventral
stream in terms

00:21:19.180 --> 00:21:20.900
of their ease of decoding.

00:21:20.900 --> 00:21:22.810
But what I'm most
excited about is

00:21:22.810 --> 00:21:26.080
that if you do this
comparison with humans again,

00:21:26.080 --> 00:21:28.300
you actually get this sort
of, again, pretty decent,

00:21:28.300 --> 00:21:31.877
not quite as tight correlation,
between the human--

00:21:31.877 --> 00:21:33.460
actual measured
behavioral performance

00:21:33.460 --> 00:21:35.890
on making estimates of those
other latent variables,

00:21:35.890 --> 00:21:38.450
and the predicted behavioral
performance out of IT.

00:21:38.450 --> 00:21:40.180
And again, much
better correlations.

00:21:40.180 --> 00:21:41.220
It's not perfect.

00:21:41.220 --> 00:21:44.170
So again, there's some gap here,
some failure of understanding.

00:21:44.170 --> 00:21:47.650
But much better than if you
read out of V4, V1 or pixel.

00:21:47.650 --> 00:21:50.620
So this says that the
representation again isn't just

00:21:50.620 --> 00:21:51.817
an identity thing.

00:21:51.817 --> 00:21:53.650
It seems like this could
be representational

00:21:53.650 --> 00:21:56.320
underlie some of these
other judgments, at least

00:21:56.320 --> 00:21:59.380
at the central 10 degrees for
sort of foreground objects

00:21:59.380 --> 00:22:00.970
as we've been measuring here.

00:22:00.970 --> 00:22:03.428
That's the-- don't worry about
the details on here-- that's

00:22:03.428 --> 00:22:05.637
the upshot of what I'm trying
to say with this slide.

00:22:05.637 --> 00:22:07.261
But I just wanted to
put that out there

00:22:07.261 --> 00:22:09.310
so you didn't forget that
you haven't thrown away

00:22:09.310 --> 00:22:11.290
all this other interesting
stuff about what's

00:22:11.290 --> 00:22:13.461
out there in the scene.

00:22:13.461 --> 00:22:13.960
OK.

00:22:13.960 --> 00:22:14.626
Let me kind of--

00:22:14.626 --> 00:22:16.510
I've sort of alluded
to this a bit.

00:22:16.510 --> 00:22:18.310
I want to come back
to kind of now,

00:22:18.310 --> 00:22:20.350
this is like Marr
level 3 stuff, right?

00:22:20.350 --> 00:22:22.990
So you have this idea of
what you're trying to solve.

00:22:22.990 --> 00:22:23.800
You have a decode--

00:22:23.800 --> 00:22:26.350
you have an algorithm that's
a decoder on a basis, that's

00:22:26.350 --> 00:22:28.790
trying-- that looks like
it predicts pretty well.

00:22:28.790 --> 00:22:29.510
It's not perfect.

00:22:29.510 --> 00:22:30.560
There's work to be done there.

00:22:30.560 --> 00:22:31.893
But it actually does quite well.

00:22:31.893 --> 00:22:34.240
Now what does that mean on
the physical hardware level?

00:22:34.240 --> 00:22:35.742
So that's Marr level 3.

00:22:35.742 --> 00:22:37.450
So you think-- here's
how I visualize it.

00:22:37.450 --> 00:22:40.960
You have IT cortex,
which I mean AIT and CIT.

00:22:40.960 --> 00:22:43.534
So it's about 150 square
millimeters in a monkey.

00:22:43.534 --> 00:22:45.700
And remember I told you
there was about 1 millimeter

00:22:45.700 --> 00:22:46.900
scale of organization?

00:22:46.900 --> 00:22:48.792
I showed you that earlier.

00:22:48.792 --> 00:22:49.750
And others have shown--

00:22:49.750 --> 00:22:51.458
I showed this earlier,
too-- that there's

00:22:51.458 --> 00:22:52.390
sort of face regions.

00:22:52.390 --> 00:22:54.514
So I've drawn them just
for sort of for scale here,

00:22:54.514 --> 00:22:55.690
just a schematic.

00:22:55.690 --> 00:22:58.000
That they're slightly
bigger organizations,

00:22:58.000 --> 00:22:59.620
they're 2 to 5 millimeter.

00:22:59.620 --> 00:23:04.030
So I think of IT as being
this sort of like 100 to 200

00:23:04.030 --> 00:23:04.930
little--

00:23:04.930 --> 00:23:06.100
similar to Tanaka.

00:23:06.100 --> 00:23:07.600
This is not a new
conceptual idea.

00:23:07.600 --> 00:23:09.391
But there's sort of
just the simple version

00:23:09.391 --> 00:23:12.070
would be each millimeter does
exactly the same thing, is

00:23:12.070 --> 00:23:13.030
a feature.

00:23:13.030 --> 00:23:15.910
And if you sample off of
that, you take 5,000 neurons,

00:23:15.910 --> 00:23:19.320
but they're really sampling
from only about 150 IT

00:23:19.320 --> 00:23:21.871
features at 1 millimeter scale.

00:23:21.871 --> 00:23:23.620
Remember, I don't know
if you caught that.

00:23:23.620 --> 00:23:24.820
But I showed 150--

00:23:24.820 --> 00:23:26.950
101-- 150.

00:23:26.950 --> 00:23:30.425
I showed you 168 IT neurons
predicted the pattern

00:23:30.425 --> 00:23:31.300
of human performance.

00:23:31.300 --> 00:23:32.770
I showed that a few slides ago.

00:23:32.770 --> 00:23:35.440
But I told you the real number
of neurons is probably 50,000.

00:23:35.440 --> 00:23:37.360
Most of those are
redundant copies

00:23:37.360 --> 00:23:40.142
of that 168 dimensional
feature set.

00:23:40.142 --> 00:23:41.350
That's how we think about it.

00:23:41.350 --> 00:23:44.350
So you could imagine, it's
just a redundant set of about--

00:23:44.350 --> 00:23:47.410
I like to think of about
100 features in IT which

00:23:47.410 --> 00:23:51.370
are sampled maybe randomly
downstream neurons that are

00:23:51.370 --> 00:23:52.270
then learned.

00:23:52.270 --> 00:23:54.390
So when you learn faces
versus other things,

00:23:54.390 --> 00:23:56.550
hey, there's lots of good
information about faces

00:23:56.550 --> 00:23:57.383
versus other things.

00:23:57.383 --> 00:23:59.610
And these face patches,
that's how they're defined.

00:23:59.610 --> 00:24:01.822
But those neurons are
going to lean heavily--

00:24:01.822 --> 00:24:03.780
this downstream neuron
is going to lean heavily

00:24:03.780 --> 00:24:04.500
on those neurons.

00:24:04.500 --> 00:24:07.800
And then these-- so that would
make these regions causally

00:24:07.800 --> 00:24:08.320
involved.

00:24:08.320 --> 00:24:11.067
So that doesn't mean you had
to pre-build in anything here.

00:24:11.067 --> 00:24:12.900
You just learn this at
a downstream version.

00:24:12.900 --> 00:24:14.190
And you would get
something that looks

00:24:14.190 --> 00:24:15.481
like it would explain our data.

00:24:15.481 --> 00:24:17.919
So we like that, because
it captures that case.

00:24:17.919 --> 00:24:19.710
But it also captures
the more general case.

00:24:19.710 --> 00:24:21.418
If you learn cars,
you're going to sample

00:24:21.418 --> 00:24:22.940
from a different
subset of neurons.

00:24:22.940 --> 00:24:25.470
But you're following
the same learning rule.

00:24:25.470 --> 00:24:26.920
That's what I said earlier on.

00:24:26.920 --> 00:24:27.900
So you end up--

00:24:27.900 --> 00:24:29.760
we think this is
the initial state.

00:24:29.760 --> 00:24:31.411
This is when you learn objects.

00:24:31.411 --> 00:24:33.660
And so what we think is a
post learning, what you have

00:24:33.660 --> 00:24:36.690
is again, about 100
to 150 IT sub regions,

00:24:36.690 --> 00:24:38.520
each at 1 millimeter
scale, that are

00:24:38.520 --> 00:24:40.500
supporting a number
of noun tasks

00:24:40.500 --> 00:24:42.720
read off this common basis here.

00:24:42.720 --> 00:24:44.580
That's the model
that we like, given

00:24:44.580 --> 00:24:46.440
the kind of data that
I've been showing you.

00:24:46.440 --> 00:24:48.384
The post learning
model, as we call it.

00:24:48.384 --> 00:24:49.800
So the reason I'm
bringing this up

00:24:49.800 --> 00:24:51.258
is probably for
the neuroscientists

00:24:51.258 --> 00:24:54.900
to fix ideas about how we
think about IT as a basis set.

00:24:54.900 --> 00:24:55.530
And this is--

00:24:55.530 --> 00:24:57.030
I think Haim sort of
set this up nicely,

00:24:57.030 --> 00:24:58.446
he sort of implied
similar things.

00:24:58.446 --> 00:25:01.020
That somebody downstream
reads from it.

00:25:01.020 --> 00:25:01.650
OK.

00:25:01.650 --> 00:25:02.970
But now, we have
a more-- you know,

00:25:02.970 --> 00:25:04.950
we're starting to have a more
concrete model, that we now,

00:25:04.950 --> 00:25:06.330
I'm trying to start
to be physical

00:25:06.330 --> 00:25:08.496
about it, about the size
of these regions connecting

00:25:08.496 --> 00:25:10.194
to earlier data,
how many there are.

00:25:10.194 --> 00:25:11.610
So we're gaining
inference on that

00:25:11.610 --> 00:25:13.190
from these different
experiments.

00:25:13.190 --> 00:25:15.690
And now, if you believe this,
it starts to make a prediction

00:25:15.690 --> 00:25:17.640
of what's-- now we can
do causality, right?

00:25:17.640 --> 00:25:19.300
Somebody mentioned that earlier.

00:25:19.300 --> 00:25:20.580
And so, one of the things
we've been doing recently

00:25:20.580 --> 00:25:21.989
is if we can start to silence--

00:25:21.989 --> 00:25:24.030
look, the way I've drawn
this, this bit of tissue

00:25:24.030 --> 00:25:25.950
for-- this is just
schematic-- is somehow

00:25:25.950 --> 00:25:27.840
involved in this
task and that task.

00:25:27.840 --> 00:25:29.700
Face task and car task.

00:25:29.700 --> 00:25:31.980
But this bit of
tissue, only face task.

00:25:31.980 --> 00:25:34.080
And that bit of
tissue, only car task.

00:25:34.080 --> 00:25:35.780
And this bit of tissue, neither.

00:25:35.780 --> 00:25:37.770
So if you believe that,
you had the tools,

00:25:37.770 --> 00:25:39.311
you should be able
to go in and start

00:25:39.311 --> 00:25:40.830
to silence little bits of IT.

00:25:40.830 --> 00:25:42.944
And you should get
predictable patterns out

00:25:42.944 --> 00:25:44.610
of the behavioral
deficits of the animal

00:25:44.610 --> 00:25:46.460
when you make those
manipulations, right?

00:25:46.460 --> 00:25:48.130
Everybody follow that?

00:25:48.130 --> 00:25:48.630
Right?

00:25:48.630 --> 00:25:49.130
OK.

00:25:49.130 --> 00:25:51.030
And now the models
give you a framework

00:25:51.030 --> 00:25:52.800
to build those
predictions and to also

00:25:52.800 --> 00:25:55.710
estimate the magnitude of those
effects that you should see.

00:25:55.710 --> 00:25:57.850
And so that's what we've
been doing more recently.

00:25:57.850 --> 00:25:59.040
And I'll just give
you a taste of this,

00:25:59.040 --> 00:26:00.330
because this is really ongoing.

00:26:00.330 --> 00:26:01.890
But I think it connects
to what Gabriel

00:26:01.890 --> 00:26:04.389
said earlier about now there
are these tools available to do

00:26:04.389 --> 00:26:04.894
that.

00:26:04.894 --> 00:26:06.810
Oh, I put that in from
an earlier talk where--

00:26:06.810 --> 00:26:08.750
I think Google has a
thing called Inception.

00:26:08.750 --> 00:26:09.570
And I don't know--

00:26:09.570 --> 00:26:10.200
was it Google?

00:26:10.200 --> 00:26:12.090
Or somebody has it-- you can't
do Inception unless you're

00:26:12.090 --> 00:26:13.050
actually in a brain.

00:26:13.050 --> 00:26:15.120
So are you going
to try to insert--

00:26:15.120 --> 00:26:17.495
the reason we do this is my
student that is working on it

00:26:17.495 --> 00:26:20.010
really wants to inject
signals in the brain.

00:26:20.010 --> 00:26:21.450
There's a dream
about VMI, right?

00:26:21.450 --> 00:26:23.760
Could you kind of
inject a percept?

00:26:23.760 --> 00:26:25.260
And to do that,
you're going to need

00:26:25.260 --> 00:26:26.427
to do experiments like this.

00:26:26.427 --> 00:26:28.635
And you understand this
hardware to interact with it.

00:26:28.635 --> 00:26:30.430
It's something we
talked about earlier.

00:26:30.430 --> 00:26:34.181
So actually-- and Tonegawa's lab
has some cool Inception stuff

00:26:34.181 --> 00:26:34.680
on memory.

00:26:34.680 --> 00:26:37.140
But this is like inserting
an object/person.

00:26:37.140 --> 00:26:39.510
So to do that, this has
been a dream for many of us

00:26:39.510 --> 00:26:40.230
for a long time.

00:26:40.230 --> 00:26:42.600
Can we reliably
disrupt performance

00:26:42.600 --> 00:26:45.600
by suppressing 1
millimeter bits of IT?

00:26:45.600 --> 00:26:48.390
So to do that,
what we're doing is

00:26:48.390 --> 00:26:50.630
testing a large battery
of tasks and a battery

00:26:50.630 --> 00:26:51.630
of suppression patterns.

00:26:51.630 --> 00:26:53.310
So not just sort
of saying, can we

00:26:53.310 --> 00:26:55.140
affect face tasks or one task?

00:26:55.140 --> 00:26:57.546
But let's imagine we
test a battery of tasks.

00:26:57.546 --> 00:26:58.920
And then, we--
and the idea where

00:26:58.920 --> 00:27:01.652
we'd have a whole bunch of tasks
and we'd do every bit of IT one

00:27:01.652 --> 00:27:03.360
by one, and then in
combination, and we'd

00:27:03.360 --> 00:27:04.650
sort of get all that data
and figure out what's

00:27:04.650 --> 00:27:05.220
going on, right?

00:27:05.220 --> 00:27:05.970
That's sort of the dream, right?

00:27:05.970 --> 00:27:07.990
So we're trying to build
towards that dream.

00:27:07.990 --> 00:27:09.100
Do you guys get it?

00:27:09.100 --> 00:27:09.600
Right.

00:27:09.600 --> 00:27:10.620
I mean, I don't know.

00:27:10.620 --> 00:27:12.990
And then we're motivated
by this kind of idea here.

00:27:12.990 --> 00:27:14.865
So to build-- so we started--

00:27:14.865 --> 00:27:16.740
I'm just going to give
you a quick tour of we

00:27:16.740 --> 00:27:19.250
have tools to start to do this.

00:27:19.250 --> 00:27:20.952
You know, this is
our recording, we

00:27:20.952 --> 00:27:23.160
can localize what we're
recording two very fine grain

00:27:23.160 --> 00:27:24.649
using x-rays.

00:27:24.649 --> 00:27:26.940
So we know exactly where
we're recording the IT to like

00:27:26.940 --> 00:27:29.214
about 300 micron resolution.

00:27:29.214 --> 00:27:30.880
So that's why I'm
putting this slide up.

00:27:30.880 --> 00:27:32.463
And what we're
interested in is going,

00:27:32.463 --> 00:27:35.955
if I silence this bit of
IT, or that bit of IT,

00:27:35.955 --> 00:27:38.790
or that bit of IT, so actually
do this experiment, what

00:27:38.790 --> 00:27:40.230
happens behaviorally?

00:27:40.230 --> 00:27:43.140
And Arash Afraz is a
post-doc in the lab, started

00:27:43.140 --> 00:27:44.610
these actual experiments.

00:27:44.610 --> 00:27:47.130
And one of the things
Arash did was to first say,

00:27:47.130 --> 00:27:51.330
let's see if we can get this
silencing of optogenetics tool

00:27:51.330 --> 00:27:52.290
to work in our hands.

00:27:52.290 --> 00:27:54.123
And the reason we were
so excited about that

00:27:54.123 --> 00:27:55.890
is because we think
lesions, if we

00:27:55.890 --> 00:27:58.410
can make temporary
brief silencing,

00:27:58.410 --> 00:28:03.020
that that will give it much more
reliable disruption of behavior

00:28:03.020 --> 00:28:06.300
that then, if we started
to try to inject signals,

00:28:06.300 --> 00:28:09.330
which would be our dream, but
that seems too risky to us.

00:28:09.330 --> 00:28:11.580
We just want to say, what
is a temporary lesion

00:28:11.580 --> 00:28:13.082
of each bit of IT do?

00:28:13.082 --> 00:28:14.790
And optogenetics is
cool, because there's

00:28:14.790 --> 00:28:17.610
no other technique that
can briefly silence--

00:28:17.610 --> 00:28:19.782
temporarily silence activity.

00:28:19.782 --> 00:28:21.490
You can do pharmacological
manipulations,

00:28:21.490 --> 00:28:23.010
but those last for hours.

00:28:23.010 --> 00:28:25.544
So this could briefly
silence bits of IT.

00:28:25.544 --> 00:28:27.210
And that's why we
were excited about it.

00:28:27.210 --> 00:28:30.120
We also did pharmacological
manipulation as a reference

00:28:30.120 --> 00:28:30.780
to get started.

00:28:30.780 --> 00:28:33.600
But what we're doing is trying
to silence 1 millimeter regions

00:28:33.600 --> 00:28:36.490
of IT using light delivered
through optical fibers

00:28:36.490 --> 00:28:38.220
as the recording electrode.

00:28:38.220 --> 00:28:41.490
And to silence bits
of neurons here.

00:28:41.490 --> 00:28:43.757
And so what Arash
did was first show

00:28:43.757 --> 00:28:45.840
that you can actually
silence neurons in this way.

00:28:45.840 --> 00:28:48.390
So if you guys haven't
seen optogenetics plots,

00:28:48.390 --> 00:28:49.560
this is data from our lab.

00:28:49.560 --> 00:28:51.060
What's quite cool
about this, again,

00:28:51.060 --> 00:28:53.290
is you have the same
images are being presented.

00:28:53.290 --> 00:28:55.110
So this green line
should be up here.

00:28:55.110 --> 00:28:57.900
But Arash turns a laser on right
here, shines light on there.

00:28:57.900 --> 00:28:59.970
And there's some opsins
expressed in the neurons

00:28:59.970 --> 00:29:00.861
in that local area.

00:29:00.861 --> 00:29:03.360
And you can see it just sort
of shuts the thing down, and it

00:29:03.360 --> 00:29:04.830
sort of deletes or blocks this.

00:29:04.830 --> 00:29:06.390
You have the same
input coming in.

00:29:06.390 --> 00:29:08.300
But you can sort
of delete it here.

00:29:08.300 --> 00:29:09.670
And this is another example.

00:29:09.670 --> 00:29:11.253
These are some pretty
strong examples.

00:29:11.253 --> 00:29:13.750
It's not always this strong.

00:29:13.750 --> 00:29:16.140
But this is, again, you
can see we can return back

00:29:16.140 --> 00:29:17.400
to normal right away, right?

00:29:17.400 --> 00:29:19.290
So this is a 200
millisecond silencing.

00:29:19.290 --> 00:29:21.230
You could go even
narrower than that.

00:29:21.230 --> 00:29:23.340
But so this is what
we had done so far.

00:29:23.340 --> 00:29:25.020
And again, what we
did was say, look.

00:29:25.020 --> 00:29:25.980
This is a risky tool.

00:29:25.980 --> 00:29:27.479
This is it not going
to work at all.

00:29:27.479 --> 00:29:29.250
So Arash just wanted
to test something

00:29:29.250 --> 00:29:31.850
that was likely to work.

00:29:31.850 --> 00:29:34.080
And so we picked a
face task because there

00:29:34.080 --> 00:29:36.480
was a lot of evidence of
spatial clustering of faces

00:29:36.480 --> 00:29:39.240
that you'll hear from
Winrich and you also

00:29:39.240 --> 00:29:40.750
known in the literature.

00:29:40.750 --> 00:29:42.770
So what Arash did
was to say, we picked

00:29:42.770 --> 00:29:44.910
a task of discriminating
males from females.

00:29:44.910 --> 00:29:46.470
We put in our notion
of invariance.

00:29:46.470 --> 00:29:48.390
It's not just do
this image access.

00:29:48.390 --> 00:29:50.970
But you have to do it across
a bunch of transformations.

00:29:50.970 --> 00:29:53.127
In this case, its identity
as a transformation.

00:29:53.127 --> 00:29:55.710
So you're saying, all of these
are supposed to be called male,

00:29:55.710 --> 00:29:57.050
and all these are called female.

00:29:57.050 --> 00:29:59.049
And he wanted you to
distinguish this from this.

00:29:59.049 --> 00:30:00.900
That's what he trained
a monkey to do.

00:30:00.900 --> 00:30:04.031
And just to give you the upshot,
is that, we do all this work,

00:30:04.031 --> 00:30:05.280
we silence the bits of cortex.

00:30:05.280 --> 00:30:06.870
And here's the big take home.

00:30:06.870 --> 00:30:10.140
You get a 2% deficit of
single one millimeter

00:30:10.140 --> 00:30:12.750
silencing of bits of IT cortex.

00:30:12.750 --> 00:30:16.200
Parts of IT cortex,
not all of IT cortex,

00:30:16.200 --> 00:30:17.670
produce a 2% deficit.

00:30:17.670 --> 00:30:19.545
Here's the animal running
at 80%, 6% correct.

00:30:19.545 --> 00:30:21.086
These are interleaved
trials where we

00:30:21.086 --> 00:30:22.610
silence some local bit of IT.

00:30:22.610 --> 00:30:23.787
You get a 2% deficit.

00:30:23.787 --> 00:30:25.620
That's true only in the
contralateral field,

00:30:25.620 --> 00:30:28.290
not that ipsilateral
field, for the aficionados.

00:30:28.290 --> 00:30:30.630
You might look at this 2%
and go, well, that's tiny.

00:30:30.630 --> 00:30:32.190
But we looked at it,
this is exactly what's

00:30:32.190 --> 00:30:34.315
predicted by the models
that we were talking about.

00:30:34.315 --> 00:30:37.140
It's right in the range
of what should happen.

00:30:37.140 --> 00:30:39.170
And so this, to us,
is really quite cool.

00:30:39.170 --> 00:30:40.445
This is highly significant.

00:30:40.445 --> 00:30:42.570
And now we sort of are in
position to start to say,

00:30:42.570 --> 00:30:43.870
OK these tools work.

00:30:43.870 --> 00:30:45.330
They do what
they're supposed to.

00:30:45.330 --> 00:30:47.670
And now we can start to
expand that task space.

00:30:47.670 --> 00:30:49.620
So this result has been
published recently,

00:30:49.620 --> 00:30:51.112
if you're interested in this.

00:30:51.112 --> 00:30:53.070
And here is one of the
ways we're going forward

00:30:53.070 --> 00:30:55.736
is that Rish Rajaingham, the one
doing those tasks in the monkey

00:30:55.736 --> 00:30:56.670
I showed you earlier.

00:30:56.670 --> 00:30:58.390
Silencing different parts of IT.

00:30:58.390 --> 00:31:01.320
This is now with muscimol,
different bits of IT--

00:31:01.320 --> 00:31:03.570
these are different tasks,
lead to different patterns.

00:31:03.570 --> 00:31:04.944
That's what these
dots are here--

00:31:04.944 --> 00:31:06.300
different patterns of deficits.

00:31:06.300 --> 00:31:08.010
And if you go back
to the same location,

00:31:08.010 --> 00:31:09.640
you get the same
pattern of deficits.

00:31:09.640 --> 00:31:11.064
So this is only 10 tasks.

00:31:11.064 --> 00:31:12.480
But I think it
hopefully gives you

00:31:12.480 --> 00:31:14.760
the spirit of what
we're trying to do.

00:31:14.760 --> 00:31:16.650
And again, this
is only muscimol,

00:31:16.650 --> 00:31:19.600
which doesn't have all the
advantages of optogenetics.

00:31:19.600 --> 00:31:22.460
But this is what we're
were building towards here.

00:31:22.460 --> 00:31:26.116
So I'm just giving you the
sort of state of the art.

00:31:26.116 --> 00:31:27.990
So our aim is to measure
the specific pattern

00:31:27.990 --> 00:31:30.740
of behavioral change induced by
the suppression of each IT sub

00:31:30.740 --> 00:31:32.850
region, ideally
testing many of them,

00:31:32.850 --> 00:31:35.389
and then compare with
the model predictions.

00:31:35.389 --> 00:31:36.930
I'm saying there's
this domain, and I

00:31:36.930 --> 00:31:38.220
want to sort of sample
the whole domain.

00:31:38.220 --> 00:31:41.040
So far, I've given you only just
samples of tasks in the domain.

00:31:41.040 --> 00:31:42.950
But we're really trying
to define the domain.

00:31:42.950 --> 00:31:43.770
And I'm just--

00:31:43.770 --> 00:31:46.353
I'm going to skip through this
just to give you the punchline,

00:31:46.353 --> 00:31:48.780
is that we do a whole bunch
of behavioral measurements.

00:31:48.780 --> 00:31:50.040
We presented this work before.

00:31:50.040 --> 00:31:52.456
It's like, this is now up to
three million Mechanical Turk

00:31:52.456 --> 00:31:53.130
trials.

00:31:53.130 --> 00:31:56.550
It seems to us that we can
embed all objects, even

00:31:56.550 --> 00:31:58.230
subordinate objects,
of the type of task

00:31:58.230 --> 00:31:59.854
that I've been telling
you, in roughly,

00:31:59.854 --> 00:32:01.830
in essentially a 20
dimensional space.

00:32:01.830 --> 00:32:02.940
So there's 20 dimensions.

00:32:02.940 --> 00:32:05.190
We think we infer that
humans are projecting

00:32:05.190 --> 00:32:07.730
to about 20 dimensions to
do these kind of, the tasks

00:32:07.730 --> 00:32:08.860
that we've shown here.

00:32:08.860 --> 00:32:11.310
Which is sort of
smaller, but eerily

00:32:11.310 --> 00:32:13.380
close to that in the
order of magnitude

00:32:13.380 --> 00:32:15.900
to that 100 or so features
that I've been talking about.

00:32:15.900 --> 00:32:19.020
So that's where-- regardless
of whether-- these

00:32:19.020 --> 00:32:21.580
are some of the dimensions
and how we're projecting them.

00:32:21.580 --> 00:32:22.930
Again, I won't take
you through this,

00:32:22.930 --> 00:32:24.490
because I think we've
already used up enough time

00:32:24.490 --> 00:32:25.906
and I want to get
on to this part.

00:32:25.906 --> 00:32:28.565
But we're trying to define
a domain of all tasks

00:32:28.565 --> 00:32:29.940
where we can sort
of predict what

00:32:29.940 --> 00:32:32.010
would happen across
anything within that domain.

00:32:32.010 --> 00:32:35.175
And that raises questions of the
dimensionality of that domain.

00:32:35.175 --> 00:32:37.050
And there were behavioral
methods to do that.

00:32:37.050 --> 00:32:39.330
And we've been doing
some work on that.

00:32:39.330 --> 00:32:40.664
So I'll just leave it at that.

00:32:40.664 --> 00:32:42.080
And if you guys
have questions, we

00:32:42.080 --> 00:32:43.567
can talk about that some more.

00:32:43.567 --> 00:32:45.150
I want to sort of
in the time I really

00:32:45.150 --> 00:32:47.910
have left is to talk about
the encoding side of things,

00:32:47.910 --> 00:32:49.410
because I promised you
guys I would get to this.

00:32:49.410 --> 00:32:50.868
Unless people have
any more burning

00:32:50.868 --> 00:32:52.202
questions on this decoding side.

00:32:52.202 --> 00:32:53.826
So far I've been
talking about the link

00:32:53.826 --> 00:32:55.020
between IT and perception.

00:32:55.020 --> 00:32:57.900
Now I'm going to switch gears
and talk about this other side.

00:32:57.900 --> 00:32:59.310
Which is, so I
talked about this.

00:32:59.310 --> 00:33:01.630
And that tells us that
the mean rates in IT

00:33:01.630 --> 00:33:03.630
are something that seem
to be highly predictive.

00:33:03.630 --> 00:33:05.130
I showed you at
least one model that

00:33:05.130 --> 00:33:06.510
has the laws of RAD IT model.

00:33:06.510 --> 00:33:09.300
But now, it's like now, we
can turn to the encoding side

00:33:09.300 --> 00:33:11.677
and say, we need to predict
the mean rates of IT.

00:33:11.677 --> 00:33:14.010
And that should be our goal
if we want to explain images

00:33:14.010 --> 00:33:15.570
to IT activity.

00:33:15.570 --> 00:33:18.940
So, these would be called
predictive encoding mechanisms.

00:33:18.940 --> 00:33:21.360
So, now you guys
have heard about

00:33:21.360 --> 00:33:22.994
deep convolutional networks.

00:33:22.994 --> 00:33:24.660
If not, you've heard
about them already,

00:33:24.660 --> 00:33:26.409
you'll probably hear
about them some more.

00:33:26.409 --> 00:33:28.767
So we started messing
around in 2008.

00:33:28.767 --> 00:33:29.850
This is a model inspired--

00:33:29.850 --> 00:33:31.558
I mentioned this family
of models before.

00:33:31.558 --> 00:33:34.430
Hubel-Wiesel, Fukushima, and
there's a whole HMAX family

00:33:34.430 --> 00:33:38.030
of models, that really was the
inspiration of this larger--

00:33:38.030 --> 00:33:39.930
this large family
of models, that

00:33:39.930 --> 00:33:43.939
have this repeating
structure that are now

00:33:43.939 --> 00:33:46.230
really the sort of modern
day deep convolution networks

00:33:46.230 --> 00:33:48.640
really grew out of all
of this earlier work.

00:33:48.640 --> 00:33:51.600
And so we started exploring
the family in 2008.

00:33:51.600 --> 00:33:54.140
And just, this is a slide that
you've already sort of seen

00:33:54.140 --> 00:33:56.056
a version of this from
Gabriel where you know,

00:33:56.056 --> 00:33:58.125
for when you take an
image, you pass it

00:33:58.125 --> 00:33:59.250
through a set of operators.

00:33:59.250 --> 00:34:00.180
So you have filters.

00:34:00.180 --> 00:34:02.840
So these are dot products
over some restricted spatial

00:34:02.840 --> 00:34:05.550
restricted region,
like receptive fields.

00:34:05.550 --> 00:34:08.330
You have a non linear area, like
a threshold and a saturation.

00:34:08.330 --> 00:34:10.159
You have pooling operation.

00:34:10.159 --> 00:34:11.409
Then you have a normalization.

00:34:11.409 --> 00:34:13.610
So you have all these
operations happen here.

00:34:13.610 --> 00:34:14.928
And that produces a stack.

00:34:14.928 --> 00:34:16.969
So think of like, if there
are four filters here,

00:34:16.969 --> 00:34:19.389
like four orientations,
you get four images,

00:34:19.389 --> 00:34:21.389
you have one image in,
you have four images out.

00:34:21.389 --> 00:34:23.638
But if you had 10 of these,
you'd get 10 of these out.

00:34:23.638 --> 00:34:25.125
Then you repeat
this here, right?

00:34:25.125 --> 00:34:26.750
And so as you keep
adding more filters,

00:34:26.750 --> 00:34:28.749
this stack just keeps
getting bigger and bigger.

00:34:28.749 --> 00:34:30.743
And it keeps, because
you're spatially pooling,

00:34:30.743 --> 00:34:32.659
it keeps getting narrower
and narrower, right?

00:34:32.659 --> 00:34:34.544
So you go from this
image to this sort

00:34:34.544 --> 00:34:38.060
of deep stack of features
that has less retinatopy.

00:34:38.060 --> 00:34:40.130
It still has a little
bit of retinotopy.

00:34:40.130 --> 00:34:42.560
And that, you can see, has
been exactly a very good model

00:34:42.560 --> 00:34:44.389
why people liked
it of how people

00:34:44.389 --> 00:34:46.310
think about the ventral stream.

00:34:46.310 --> 00:34:48.830
So these models
typically have thousands

00:34:48.830 --> 00:34:52.010
of feat-- visual neurons or
features at the top level.

00:34:52.010 --> 00:34:55.520
Just to give you a sense of
scale of how they're run.

00:34:55.520 --> 00:34:57.209
And just to take
you through, you

00:34:57.209 --> 00:34:59.000
know, I guess maybe
you'll hear about this,

00:34:59.000 --> 00:35:00.110
if you haven't already.

00:35:00.110 --> 00:35:02.900
Each element has like, a
filter, has a large fan in.

00:35:02.900 --> 00:35:05.330
Like these are like
neuroscience related things.

00:35:05.330 --> 00:35:08.460
They have non-linearities,
like thresholds of neurons.

00:35:08.460 --> 00:35:10.250
Each layer is
convolutional, which

00:35:10.250 --> 00:35:12.962
means you apply the same
filters across visual space.

00:35:12.962 --> 00:35:15.170
Which is like retinotopy,
that is a view on cell that

00:35:15.170 --> 00:35:16.010
is oriented here.

00:35:16.010 --> 00:35:17.690
There'll be another
view on cell that's

00:35:17.690 --> 00:35:19.670
in another spatial
position, same orientation,

00:35:19.670 --> 00:35:21.080
different spatial position.

00:35:21.080 --> 00:35:23.360
That's what the
convolutional models are just

00:35:23.360 --> 00:35:26.810
an implementation of that idea
of copying the same filter

00:35:26.810 --> 00:35:28.782
type across the retina.

00:35:28.782 --> 00:35:30.240
And there's a deep
stack of layers.

00:35:30.240 --> 00:35:31.615
These are all
things that I think

00:35:31.615 --> 00:35:34.610
are commensurate with
the ventral stream

00:35:34.610 --> 00:35:36.796
anatomy and physiology.

00:35:36.796 --> 00:35:39.230
So, but one of the
key things that those

00:35:39.230 --> 00:35:40.790
who work with these
models know is

00:35:40.790 --> 00:35:43.250
that, they have lots
of unknown parameters

00:35:43.250 --> 00:35:45.320
that are not determined
from the neurobiology.

00:35:45.320 --> 00:35:47.750
Even though the family of
models is well described,

00:35:47.750 --> 00:35:50.270
what are the exact
filter weights?

00:35:50.270 --> 00:35:51.830
What are the
threshold parameters?

00:35:51.830 --> 00:35:53.090
How exactly do you pool?

00:35:53.090 --> 00:35:54.050
How do you normalize?

00:35:54.050 --> 00:35:56.480
There's lots of parameters
when you build these things,

00:35:56.480 --> 00:35:59.090
essentially thousands of
parameters, most of them hidden

00:35:59.090 --> 00:36:00.360
in the weight structure here.

00:36:00.360 --> 00:36:02.360
Which, if you think about,
the first layer, that

00:36:02.360 --> 00:36:04.190
would be like, should
I choose Gabor filters?

00:36:04.190 --> 00:36:06.110
Or should I do some other--
you know Haim was talking

00:36:06.110 --> 00:36:07.369
about random weights, right?

00:36:07.369 --> 00:36:08.410
So there's choices there.

00:36:08.410 --> 00:36:09.140
There are lots of parameters.

00:36:09.140 --> 00:36:11.410
So the upshot is, there's
a big-- that's why

00:36:11.410 --> 00:36:13.240
I call it a family of models.

00:36:13.240 --> 00:36:16.560
And how do you choose which one
is the right one, so to speak?

00:36:16.560 --> 00:36:17.900
Or is there a right one?

00:36:17.900 --> 00:36:19.650
Or maybe the whole
family is wrong, right?

00:36:19.650 --> 00:36:21.680
These are the
interesting discussions.

00:36:21.680 --> 00:36:24.610
So, what I like about it is,
at least when you set it,

00:36:24.610 --> 00:36:25.220
it's a model.

00:36:25.220 --> 00:36:26.120
It makes predictions.

00:36:26.120 --> 00:36:27.161
And then you can test it.

00:36:27.161 --> 00:36:28.610
So it's at least a model.

00:36:28.610 --> 00:36:30.800
And it predicts the
entire-- you know,

00:36:30.800 --> 00:36:33.560
if you start to map these, you
say this is V1, this is V2,

00:36:33.560 --> 00:36:34.220
this is V4.

00:36:34.220 --> 00:36:37.400
It predicts the full
neural population response

00:36:37.400 --> 00:36:39.380
to any image across these areas.

00:36:39.380 --> 00:36:43.100
So it's a strongly
predictive model once built.

00:36:43.100 --> 00:36:44.176
So that's nice.

00:36:44.176 --> 00:36:46.550
But now you have to determine
how am I going to build it?

00:36:46.550 --> 00:36:48.300
How do I set the parameters?

00:36:48.300 --> 00:36:50.129
So how do we do that?

00:36:50.129 --> 00:36:51.920
Well, there's lots of
ways you could do it.

00:36:51.920 --> 00:36:53.753
And I'll tell you the
way we chose to do it.

00:36:53.753 --> 00:36:56.540
Which was to just not
use any neural data.

00:36:56.540 --> 00:36:58.370
It was just to use
optimization methods

00:36:58.370 --> 00:37:00.860
to find specific models
to set the parameters

00:37:00.860 --> 00:37:02.660
inside this model class.

00:37:02.660 --> 00:37:04.924
And we chose an
optimization target.

00:37:04.924 --> 00:37:07.340
This is a little bit, again,
inspired from a top down view

00:37:07.340 --> 00:37:09.146
of what the system's doing.

00:37:09.146 --> 00:37:10.520
What are the visual
tasks that we

00:37:10.520 --> 00:37:13.010
suppose the ventral stream
was supposed to solve?

00:37:13.010 --> 00:37:15.350
Which I already told you, we
think it's invariant object

00:37:15.350 --> 00:37:15.950
recognition.

00:37:15.950 --> 00:37:17.600
That's what makes
the problem hard.

00:37:17.600 --> 00:37:19.476
So we tried to optimize
models to solve that.

00:37:19.476 --> 00:37:21.058
And essentially when
we're doing that,

00:37:21.058 --> 00:37:23.540
we're kind of doing the same
thing that computer vision is

00:37:23.540 --> 00:37:26.164
trying to do, except we're doing
it in our own domain of images

00:37:26.164 --> 00:37:27.329
and tasks that we set up.

00:37:27.329 --> 00:37:29.870
But we essentially, there's a
meeting between computer vision

00:37:29.870 --> 00:37:32.240
and what we were
trying to do here.

00:37:32.240 --> 00:37:34.450
And when I say we, this
is work by Dan Yamins,

00:37:34.450 --> 00:37:37.520
a post-doc in the lab, and
Ha Hong, a graduate student.

00:37:37.520 --> 00:37:40.712
And what we did was to
just try to simulate again,

00:37:40.712 --> 00:37:41.420
as I did earlier.

00:37:41.420 --> 00:37:43.049
We took these
simple 3-D objects.

00:37:43.049 --> 00:37:44.590
We could render
them, just as before,

00:37:44.590 --> 00:37:46.730
place them on
naturalistic background.

00:37:46.730 --> 00:37:48.380
And then we just
built models that

00:37:48.380 --> 00:37:50.360
would try to discriminate
bodies from buildings

00:37:50.360 --> 00:37:51.318
from flowers from guns.

00:37:51.318 --> 00:37:53.076
So they would have
good feature sets

00:37:53.076 --> 00:37:54.950
that would discriminate
between these things.

00:37:54.950 --> 00:37:58.280
And these were essentially
trained by various forms

00:37:58.280 --> 00:37:59.220
of supervision.

00:37:59.220 --> 00:38:02.240
Now there's lots of ways
you can train these models.

00:38:02.240 --> 00:38:03.740
I could tell you
about how we did it

00:38:03.740 --> 00:38:04.970
and how others have done it.

00:38:04.970 --> 00:38:06.546
I think those details
are beyond what

00:38:06.546 --> 00:38:07.670
I want to talk about today.

00:38:07.670 --> 00:38:09.530
But just, it's a
supervised class

00:38:09.530 --> 00:38:12.156
that's probably not
learned in the same way

00:38:12.156 --> 00:38:13.280
that the brain has learned.

00:38:13.280 --> 00:38:14.711
Most people don't think so.

00:38:14.711 --> 00:38:16.460
But the interesting
thing is the end state

00:38:16.460 --> 00:38:19.659
of these models might look very
much like the current adult

00:38:19.659 --> 00:38:20.450
state of the brain.

00:38:20.450 --> 00:38:22.390
And that's what I want
to try to tell you next.

00:38:22.390 --> 00:38:24.280
So first, let me show you that
when we built these models,

00:38:24.280 --> 00:38:25.280
this was in 2012.

00:38:25.280 --> 00:38:27.080
We had a particular
optimization approach

00:38:27.080 --> 00:38:29.176
that we called HMO
that was trying

00:38:29.176 --> 00:38:31.550
to solve these kind of problems
that I showed you earlier

00:38:31.550 --> 00:38:33.020
on these kind of images.

00:38:33.020 --> 00:38:35.104
And I showed you IT was
pretty good with humans.

00:38:35.104 --> 00:38:37.520
I showed you its performance
was almost up to humans, even

00:38:37.520 --> 00:38:39.379
with just 168 samples.

00:38:39.379 --> 00:38:40.920
And when we first
built a model here,

00:38:40.920 --> 00:38:42.586
we were able to do
much better than some

00:38:42.586 --> 00:38:44.960
of our previous models that--

00:38:44.960 --> 00:38:46.130
on these same kind of tasks.

00:38:46.130 --> 00:38:47.796
So I told you we
constructed, because we

00:38:47.796 --> 00:38:49.530
knew it made these things--

00:38:49.530 --> 00:38:51.410
we made these models
not do so well.

00:38:51.410 --> 00:38:53.390
So we built these
high invariance tasks

00:38:53.390 --> 00:38:55.040
to push these models down.

00:38:55.040 --> 00:38:56.720
And then we had space
to build a model

00:38:56.720 --> 00:38:59.050
that we could do better on.

00:38:59.050 --> 00:39:01.850
And we called it HMO 1.0.

00:39:01.850 --> 00:39:03.530
And then we started
to say, now we

00:39:03.530 --> 00:39:06.560
have this model that has been
optimized for performance.

00:39:06.560 --> 00:39:09.380
Let's see how well it does
on comparing with neurons.

00:39:09.380 --> 00:39:12.540
Let's see if its internals
look like the neural data.

00:39:12.540 --> 00:39:14.324
So here's the model
we built, HMO 1.0.

00:39:14.324 --> 00:39:15.740
It's a deep
convolutional network.

00:39:15.740 --> 00:39:16.906
It has two different levels.

00:39:16.906 --> 00:39:18.140
It had four levels.

00:39:18.140 --> 00:39:20.524
It had a bunch of parameters
that we set by optimization,

00:39:20.524 --> 00:39:22.690
that I'm just telling you
kind of what we optimized.

00:39:22.690 --> 00:39:23.481
I didn't tell you--

00:39:23.481 --> 00:39:25.530
I'm not telling you
any of the parameters.

00:39:25.530 --> 00:39:26.660
And now, we come back
to say, well look.

00:39:26.660 --> 00:39:28.190
We can show the same
images to the model

00:39:28.190 --> 00:39:29.550
that we showed to the neurons.

00:39:29.550 --> 00:39:32.060
And then we can compare how
well these populations look

00:39:32.060 --> 00:39:35.830
like that population, or this
population looks like that.

00:39:35.830 --> 00:39:38.820
And so what we did was, we asked
how well can layer four predict

00:39:38.820 --> 00:39:39.524
IT first?

00:39:39.524 --> 00:39:40.940
That was the first
thing we wanted

00:39:40.940 --> 00:39:43.160
to do, take the top
layer of this model,

00:39:43.160 --> 00:39:46.852
the last layer before the
linear readout of this model.

00:39:46.852 --> 00:39:49.310
And to do that, you might sort
of say, well, wait a minute.

00:39:49.310 --> 00:39:51.170
The model doesn't have mappings.

00:39:51.170 --> 00:39:54.680
It has sort of neurons simulated
here, neuron 12 or something.

00:39:54.680 --> 00:39:56.180
And there's some
neuron we recorded.

00:39:56.180 --> 00:39:58.970
But there's no linkage between
that neuron and that neuron,

00:39:58.970 --> 00:39:59.470
right?

00:39:59.470 --> 00:40:01.320
You have to make that map.

00:40:01.320 --> 00:40:03.770
So what we do is we
take each IT neuron

00:40:03.770 --> 00:40:06.066
and treat this as sort
of a generative space.

00:40:06.066 --> 00:40:07.940
You can generate as many
simulated IT neurons

00:40:07.940 --> 00:40:08.510
as you want.

00:40:08.510 --> 00:40:10.550
You would just ask,
let's take this neuron,

00:40:10.550 --> 00:40:13.640
take some of its data, and try
to build a linear regression

00:40:13.640 --> 00:40:14.330
to this neuron.

00:40:14.330 --> 00:40:16.640
Treat this as a basis
to explain that neuron.

00:40:16.640 --> 00:40:19.946
And then test the predictive
power on the held out IT data.

00:40:19.946 --> 00:40:21.320
And that's what
I'm writing here.

00:40:21.320 --> 00:40:23.470
That's cross-validation
linear regression.

00:40:23.470 --> 00:40:25.850
So I'm going to show you
predictions on held out data

00:40:25.850 --> 00:40:28.880
where some of the data were
used to make the mapping.

00:40:28.880 --> 00:40:31.217
And there's lots
of ways we chose--

00:40:31.217 --> 00:40:32.300
we could make the mapping.

00:40:32.300 --> 00:40:33.934
And we did essentially
all of them.

00:40:33.934 --> 00:40:35.600
And I could talk about
that if you want.

00:40:35.600 --> 00:40:37.130
But that's this central idea.

00:40:37.130 --> 00:40:40.010
Take some of your data, say,
is this in the linear space

00:40:40.010 --> 00:40:41.450
spanned by this basis set?

00:40:41.450 --> 00:40:44.890
So I can I fit that well
with this linear basis here?

00:40:44.890 --> 00:40:47.000
As a linear map from this basis?

00:40:47.000 --> 00:40:49.500
And here's what we actually--
here's what it looks like.

00:40:49.500 --> 00:40:53.470
Here's the IT neural response
of one simulated-- one actual IT

00:40:53.470 --> 00:40:54.710
neuron in black.

00:40:54.710 --> 00:40:55.880
This is not time.

00:40:55.880 --> 00:40:57.260
These are images.

00:40:57.260 --> 00:40:59.174
I think there's like
1,600 images here.

00:40:59.174 --> 00:41:01.340
So each black going up and
down, you can barely see,

00:41:01.340 --> 00:41:04.370
is the response, the mean
response, to different images.

00:41:04.370 --> 00:41:06.930
And you see we grouped them
by categories, just so,

00:41:06.930 --> 00:41:09.780
just to help you kind
of understand the data.

00:41:09.780 --> 00:41:11.805
Otherwise, it'd
just be a big mess.

00:41:11.805 --> 00:41:13.430
Because IT neurons
do-- you can kind of

00:41:13.430 --> 00:41:15.410
see they have a bit of
category selectivity.

00:41:15.410 --> 00:41:16.710
And again, this was known.

00:41:16.710 --> 00:41:19.451
This neuron seems to like
chair images, but not all chair

00:41:19.451 --> 00:41:19.950
images.

00:41:19.950 --> 00:41:23.060
It sometimes likes boats and
some planes a little bit.

00:41:23.060 --> 00:41:26.270
And the red line is the
prediction of the model,

00:41:26.270 --> 00:41:28.042
once fit to part of
the-- to this neuron.

00:41:28.042 --> 00:41:30.500
This is the prediction on the
held out data for the neuron.

00:41:30.500 --> 00:41:32.690
You can see the R
squared is 0.48.

00:41:32.690 --> 00:41:35.150
So half the explainable
response variance

00:41:35.150 --> 00:41:37.050
is explained by this model.

00:41:37.050 --> 00:41:39.350
And again, these
are predictions.

00:41:39.350 --> 00:41:41.360
The images were never seen--

00:41:41.360 --> 00:41:44.360
the objects even were
never seen by this model

00:41:44.360 --> 00:41:47.750
before it makes these
predictions here.

00:41:47.750 --> 00:41:50.810
So this is just saying that the
IT neurons live in this space.

00:41:50.810 --> 00:41:53.120
It's actually quite well
captured by the top level,

00:41:53.120 --> 00:41:55.170
in this case, of this
first HMO model we built.

00:41:55.170 --> 00:41:57.480
I'll show you some other
models in a minute.

00:41:57.480 --> 00:41:59.480
Here's another neuron
that you might call a face

00:41:59.480 --> 00:42:02.840
neuron because it tends to like
faces over other categories.

00:42:02.840 --> 00:42:04.340
So it might-- it
would pass the test

00:42:04.340 --> 00:42:06.410
of the operational
definition of a face neuron.

00:42:06.410 --> 00:42:09.615
This model, this neuron
was well predicted, again,

00:42:09.615 --> 00:42:12.470
by both its preferred and
non-preferred face images

00:42:12.470 --> 00:42:13.880
by this HMO model.

00:42:13.880 --> 00:42:16.757
Again, a slightly--
an R squared near 0.5.

00:42:16.757 --> 00:42:19.340
Here's a neuron that you would
look at the category structure.

00:42:19.340 --> 00:42:20.332
And you don't even--

00:42:20.332 --> 00:42:22.040
you can't really see
the categories here.

00:42:22.040 --> 00:42:23.060
They're still here.

00:42:23.060 --> 00:42:25.010
But you don't see
these sort of blocks.

00:42:25.010 --> 00:42:27.050
You just see there's sort of
some images it likes and some

00:42:27.050 --> 00:42:27.410
it doesn't.

00:42:27.410 --> 00:42:29.530
It's hard to even know
what's driving this neuron.

00:42:29.530 --> 00:42:31.722
But it's actually quite
well predicted, I think.

00:42:31.722 --> 00:42:32.930
You don't have the R squared.

00:42:32.930 --> 00:42:33.800
But it's similar.

00:42:33.800 --> 00:42:35.990
It's about half the
explainable variance.

00:42:35.990 --> 00:42:37.490
Just another example.

00:42:37.490 --> 00:42:39.140
And here is a sort
of summary here.

00:42:39.140 --> 00:42:40.700
If you take-- this
is a distribution

00:42:40.700 --> 00:42:43.730
of the explainable variance
for the top level of the model

00:42:43.730 --> 00:42:46.850
fitting about, I think
this is 168 IT sites.

00:42:46.850 --> 00:42:48.950
Some sites are fit
really well, near 100%.

00:42:48.950 --> 00:42:50.390
Some are fit not as well.

00:42:50.390 --> 00:42:53.040
The average is about
50%, which is shown here.

00:42:53.040 --> 00:42:55.890
So this is the median of
that distribution here.

00:42:55.890 --> 00:42:58.400
So the summary take
home is about 50%

00:42:58.400 --> 00:43:00.277
of singularly response
variance predicted.

00:43:00.277 --> 00:43:02.360
And this is a big improvement
over previous models

00:43:02.360 --> 00:43:03.600
I'll show you in a minute.

00:43:03.600 --> 00:43:06.020
The other levels of the model
don't predict nearly well.

00:43:06.020 --> 00:43:07.550
So the first level
doesn't predict well.

00:43:07.550 --> 00:43:09.216
Second level better,
third level better,

00:43:09.216 --> 00:43:10.550
the fourth level the best.

00:43:10.550 --> 00:43:13.216
If you take other models-- these
are some of the models I showed

00:43:13.216 --> 00:43:13.880
you earlier--

00:43:13.880 --> 00:43:16.110
they don't fit nearly as well.

00:43:16.110 --> 00:43:18.440
Here's their distributions
and here's their average,

00:43:18.440 --> 00:43:20.240
their median explained variance.

00:43:20.240 --> 00:43:24.120
And just to fix-- to just
fix ideas, you might think,

00:43:24.120 --> 00:43:27.080
well look, we built a model
that's a good categorizer.

00:43:27.080 --> 00:43:28.730
So of course it fits
IT neurons well.

00:43:28.730 --> 00:43:30.260
Because IT neurons
are categorizers.

00:43:30.260 --> 00:43:32.954
Well, here's a model that
actually has explicit knowledge

00:43:32.954 --> 00:43:33.620
of the category.

00:43:33.620 --> 00:43:35.210
It's not an image
computable model,

00:43:35.210 --> 00:43:36.590
and it's not an easy one.

00:43:36.590 --> 00:43:38.360
But it's just given
that sort of an oracle

00:43:38.360 --> 00:43:41.460
that's given the category,
and how well it explains IT.

00:43:41.460 --> 00:43:43.490
And you can see, it
explains IT much worse

00:43:43.490 --> 00:43:44.600
than the actual model.

00:43:44.600 --> 00:43:47.930
So this implies a model
is limited by the real--

00:43:47.930 --> 00:43:50.570
the architecture puts
constraints on the model

00:43:50.570 --> 00:43:53.680
and how it adds variance
that the sustained IT

00:43:53.680 --> 00:43:56.290
neurons are categories
does not easily capture.

00:43:56.290 --> 00:44:00.320
So that kind of--

00:44:00.320 --> 00:44:02.836
that sort of inspired
us to say, OK.

00:44:02.836 --> 00:44:04.710
What about if we go down
and say not just IT,

00:44:04.710 --> 00:44:05.650
but let's go to V4.

00:44:05.650 --> 00:44:07.600
Because we had a
bunch of V4 data.

00:44:07.600 --> 00:44:09.280
And so we play the
same game in V4.

00:44:09.280 --> 00:44:12.130
Let's take level three and
see if we can predict V4.

00:44:12.130 --> 00:44:14.710
And here's the IT data I
just showed you a minute ago.

00:44:14.710 --> 00:44:16.270
And here's the V4 data.

00:44:16.270 --> 00:44:19.630
So the V4 neurons are highly
predicted in the middle layer.

00:44:19.630 --> 00:44:21.700
Layer three is the
best predictor of V4.

00:44:21.700 --> 00:44:24.400
The top layer is actually not
so predictive, less predictive

00:44:24.400 --> 00:44:26.664
of V4 neurons than
the middle layers.

00:44:26.664 --> 00:44:28.580
And the first layer is
not so well predictive.

00:44:28.580 --> 00:44:30.288
And again, the other
models are actually,

00:44:30.288 --> 00:44:32.830
now you can see they're
getting on relatively better.

00:44:32.830 --> 00:44:35.740
You can think of them as
sort of lower level models.

00:44:35.740 --> 00:44:38.390
And they're getting better,
which is what you'd expect.

00:44:38.390 --> 00:44:41.230
But interestingly, this
is really exciting to us.

00:44:41.230 --> 00:44:44.080
Because look, this
model was not optimized

00:44:44.080 --> 00:44:47.200
to fit any neural data other
than that last mapping step.

00:44:47.200 --> 00:44:49.300
All it is is a bio
inspired algorithm class,

00:44:49.300 --> 00:44:51.610
which is the
neuroscience sort of view

00:44:51.610 --> 00:44:54.550
of the feed-forward
class of the field.

00:44:54.550 --> 00:44:56.867
And tasks that we and
others hypothesize

00:44:56.867 --> 00:44:58.450
are important, that
the ventral stream

00:44:58.450 --> 00:45:01.810
might be optimized to solve,
and an actual optimization

00:45:01.810 --> 00:45:03.640
procedure that we applied.

00:45:03.640 --> 00:45:07.000
And that leads to neural like
encoding functions at the top

00:45:07.000 --> 00:45:08.200
and in the middle layer.

00:45:08.200 --> 00:45:11.380
So you don't-- so this sort
of leads to funny things like

00:45:11.380 --> 00:45:13.150
saying, what does V4 do?

00:45:13.150 --> 00:45:14.650
The answer here
would be, well, it's

00:45:14.650 --> 00:45:17.020
an intermediate layer
in a network built

00:45:17.020 --> 00:45:18.410
to optimize these things.

00:45:18.410 --> 00:45:19.960
That's the way to
describe what V4

00:45:19.960 --> 00:45:22.840
does, according to this
kind of modeling approach.

00:45:22.840 --> 00:45:24.880
Now I want to point
out, this is only half

00:45:24.880 --> 00:45:26.046
of the explainable variance.

00:45:26.046 --> 00:45:27.252
So it's far from perfect.

00:45:27.252 --> 00:45:28.460
There's room to improve here.

00:45:28.460 --> 00:45:30.793
But it's really dramatic how
much improvement we got out

00:45:30.793 --> 00:45:32.400
of these kind of models.

00:45:32.400 --> 00:45:34.670
And so if you take
this sort of--

00:45:34.670 --> 00:45:35.870
well, I'll skip this.

00:45:35.870 --> 00:45:38.650
If you take this back to
you know, big picture,

00:45:38.650 --> 00:45:40.000
what did we do here?

00:45:40.000 --> 00:45:41.710
What we're doing is
we have performance

00:45:41.710 --> 00:45:43.793
of a model on high end
variance recognition tasks.

00:45:43.793 --> 00:45:46.474
We're saying, this is what
we've been trying to optimize.

00:45:46.474 --> 00:45:47.890
And what we noticed
is that if you

00:45:47.890 --> 00:45:50.852
plot-- these dots are samples
out of that model family.

00:45:50.852 --> 00:45:52.810
These black dots are
other models I showed you.

00:45:52.810 --> 00:45:55.600
So they're control models that
were in the field at the time.

00:45:55.600 --> 00:45:57.400
And this is the
ability of the top--

00:45:57.400 --> 00:45:59.500
the model-- the top level
of any of the models

00:45:59.500 --> 00:46:01.120
to predict IT responses.

00:46:01.120 --> 00:46:03.450
So, you know, how good
they are predicting--

00:46:03.450 --> 00:46:06.160
this is sort of the median
variance explained of single IT

00:46:06.160 --> 00:46:06.935
responses.

00:46:06.935 --> 00:46:08.560
And you see there's
a correlation here.

00:46:08.560 --> 00:46:11.140
If you're better at this, you're
better at predicting that.

00:46:11.140 --> 00:46:13.279
And all we did was
optimize this way,

00:46:13.279 --> 00:46:15.445
which we think of as like,
evolution or development.

00:46:15.445 --> 00:46:16.820
So we're not
fitting neural data.

00:46:16.820 --> 00:46:19.180
We're just optimizing
for task performance.

00:46:19.180 --> 00:46:21.850
And that led in 2012 to a
model that I just showed you,

00:46:21.850 --> 00:46:24.277
explained about half of
the IT response variance.

00:46:24.277 --> 00:46:26.110
OK, so it's like, well,
this looks like it's

00:46:26.110 --> 00:46:27.670
continuing up this way.

00:46:27.670 --> 00:46:31.930
OK so if you believe that
story, then, that says,

00:46:31.930 --> 00:46:35.200
if we can optimize further
on these kind of tasks,

00:46:35.200 --> 00:46:37.649
maybe we can explain
more variance.

00:46:37.649 --> 00:46:39.190
And it turned out,
we didn't actually

00:46:39.190 --> 00:46:41.290
need to do that,
because again, I said,

00:46:41.290 --> 00:46:43.804
computer vision was
already working on this.

00:46:43.804 --> 00:46:45.220
And they got a lot
more resources.

00:46:45.220 --> 00:46:46.090
They're already doing it.

00:46:46.090 --> 00:46:47.714
They're already better
than us on this.

00:46:47.714 --> 00:46:49.137
So here's our HMO model.

00:46:49.137 --> 00:46:51.220
This is now Charles Cadieu,
a post-doc in the lab.

00:46:51.220 --> 00:46:52.600
These were models that
came out at the time.

00:46:52.600 --> 00:46:54.183
This is Krizhevski
et al. supervision.

00:46:54.183 --> 00:46:56.200
It's ICLR 2013.

00:46:56.200 --> 00:46:58.287
They were better than the
model that we had built.

00:46:58.287 --> 00:47:00.370
You know, we were in this
restricted image domain,

00:47:00.370 --> 00:47:01.930
you know, there's
lots of reasons why

00:47:01.930 --> 00:47:03.096
we could say they're better.

00:47:03.096 --> 00:47:06.070
Regardless, they were better at
our own tasks than the models

00:47:06.070 --> 00:47:07.300
that we had built, right?

00:47:07.300 --> 00:47:09.490
So they were already
ahead of us on the task

00:47:09.490 --> 00:47:10.990
that we had designed.

00:47:10.990 --> 00:47:13.199
And so they were up here,
and then they were up here.

00:47:13.199 --> 00:47:14.781
And so, if you follow
that prediction,

00:47:14.781 --> 00:47:16.900
that means these models
might be better predictors

00:47:16.900 --> 00:47:17.890
of our neural data, right?

00:47:17.890 --> 00:47:19.240
These guys don't
have our neural data.

00:47:19.240 --> 00:47:21.740
All they're doing is building
models to optimize performance

00:47:21.740 --> 00:47:22.370
on tasks.

00:47:22.370 --> 00:47:24.970
And but we could take their
features from the neural data,

00:47:24.970 --> 00:47:25.930
play the same game.

00:47:25.930 --> 00:47:29.020
And we actually explained
our response-- data

00:47:29.020 --> 00:47:31.060
better than our model
explained our own data.

00:47:31.060 --> 00:47:34.120
So this is a nice statement
that is not even in our own lab.

00:47:34.120 --> 00:47:36.980
Just a continued optimization
for those kinds of tasks

00:47:36.980 --> 00:47:41.180
leads to features that are good
predictors of the IT responses.

00:47:41.180 --> 00:47:42.560
And that's what's shown here.

00:47:42.560 --> 00:47:45.970
So I think that's what
I just said there.

00:47:45.970 --> 00:47:49.120
So, Charles took this
further and analyzed

00:47:49.120 --> 00:47:50.597
this in more detail.

00:47:50.597 --> 00:47:52.930
This is a summary of what I
presented in the second half

00:47:52.930 --> 00:47:56.080
now, showing that IT firing
rates are feature based,

00:47:56.080 --> 00:47:58.420
learned object judgments
naturally predict human monkey

00:47:58.420 --> 00:47:58.919
performance.

00:47:58.919 --> 00:48:00.820
This is why the laws of RAD IT.

00:48:00.820 --> 00:48:02.530
I picked a particular
model, which

00:48:02.530 --> 00:48:05.730
is 100 millisecond read on this
time window, 50,000 neurons.

00:48:05.730 --> 00:48:07.150
100 training examples.

00:48:07.150 --> 00:48:12.220
That's one particular choice of
a decode model, that's just a--

00:48:12.220 --> 00:48:17.230
is a current set of decode model
that fits a lot of our data,

00:48:17.230 --> 00:48:18.230
but not all of our data.

00:48:18.230 --> 00:48:20.620
And we also want to
get finer grain data.

00:48:20.620 --> 00:48:22.960
The inference is, this might
be the specific neural code

00:48:22.960 --> 00:48:25.102
and decoding mechanism
that the brain uses

00:48:25.102 --> 00:48:26.060
to support these tasks.

00:48:26.060 --> 00:48:28.174
That's what we'd like to think.

00:48:28.174 --> 00:48:30.340
But now, we're trying to
do systematic causal tests.

00:48:30.340 --> 00:48:32.590
And we talked a lot about
trying to silence bits of IT

00:48:32.590 --> 00:48:33.810
as one example of that.

00:48:33.810 --> 00:48:37.070
And the tools are still not
where we'd like them to be.

00:48:37.070 --> 00:48:39.800
But you see we're
making progress there.

00:48:39.800 --> 00:48:43.021
So the second was I showed the
optimization of deep CNN models

00:48:43.021 --> 00:48:44.770
for invariant object
recognition tasks led

00:48:44.770 --> 00:48:46.420
to dramatic improvements
in our ability

00:48:46.420 --> 00:48:47.950
to predict IT and V4 responses.

00:48:47.950 --> 00:48:49.150
I showed you our model HMO.

00:48:49.150 --> 00:48:51.700
But then the convolutional
neural networks in the field

00:48:51.700 --> 00:48:53.810
have already surpassed
our predictive ability

00:48:53.810 --> 00:48:56.320
on our own data.

00:48:56.320 --> 00:48:58.570
And so the inference is that
these encoding mechanisms

00:48:58.570 --> 00:49:00.486
in these models might
be similar to those that

00:49:00.486 --> 00:49:02.040
work in the ventral stream.

00:49:02.040 --> 00:49:04.096
And now, you know, there's
a whole sort of area

00:49:04.096 --> 00:49:06.220
where you can start to
think about doing physiology

00:49:06.220 --> 00:49:07.516
on the models, so to speak.

00:49:07.516 --> 00:49:08.890
And that problem's
almost as hard

00:49:08.890 --> 00:49:10.600
as doing physiology
except on the animal,

00:49:10.600 --> 00:49:13.000
except that you can
gain a lot more data.

00:49:13.000 --> 00:49:14.890
And so, and this is
allowing the field

00:49:14.890 --> 00:49:17.080
to design experiments
to explore what remains,

00:49:17.080 --> 00:49:20.050
what's unique and powerful
about primate object perception.

00:49:20.050 --> 00:49:21.520
So within core
object recognition

00:49:21.520 --> 00:49:23.560
or perhaps having to
extend out of that,

00:49:23.560 --> 00:49:26.320
I think is now what
people are trying to do.

00:49:26.320 --> 00:49:28.570
So big picture in terms
of us for the future,

00:49:28.570 --> 00:49:30.850
I've talked about
this law's of RAD IT.

00:49:30.850 --> 00:49:32.650
Can we perturb here
and get effects here

00:49:32.650 --> 00:49:33.940
that are predictable?

00:49:33.940 --> 00:49:36.770
Can we predict for each
image, coding model,

00:49:36.770 --> 00:49:39.130
and for the optical
manipulations?

00:49:39.130 --> 00:49:40.840
We talked about that.

00:49:40.840 --> 00:49:42.400
Dynamics and feedback
are something

00:49:42.400 --> 00:49:43.510
that we're interested in.

00:49:43.510 --> 00:49:45.310
But I haven't talked
much at all about.

00:49:45.310 --> 00:49:48.850
I think that's a good
point, a discussion topic.

00:49:48.850 --> 00:49:51.160
I can tell you how
we're thinking about it.

00:49:51.160 --> 00:49:54.100
We have some efforts
in that regard.

00:49:54.100 --> 00:49:56.380
I talked on the encoding
side about these kind

00:49:56.380 --> 00:49:59.050
of deep convolutional
networks that map from images.

00:49:59.050 --> 00:50:01.570
But the dash lines mean
they're only 50% predicted.

00:50:01.570 --> 00:50:04.570
Both of these cases,
they're not perfect, right?

00:50:04.570 --> 00:50:06.380
So there's work
to be done there.

00:50:06.380 --> 00:50:07.930
And one of the really
exciting things

00:50:07.930 --> 00:50:09.377
is here is how
these models learn.

00:50:09.377 --> 00:50:11.210
This supervised way of
learning these models

00:50:11.210 --> 00:50:13.480
is almost surely not what's
going on in the brain.

00:50:13.480 --> 00:50:16.930
So finding more-- less
supervised, biologically

00:50:16.930 --> 00:50:18.880
motivated learning
of these models

00:50:18.880 --> 00:50:22.660
is a good-- is the next step,
I think, for much of the field.

00:50:22.660 --> 00:50:24.620
But what's nice is to
have an end state that

00:50:24.620 --> 00:50:27.930
is much better than any previous
end state we'd had before.

00:50:27.930 --> 00:50:32.080
So that sets a target of
what success might look like.

00:50:32.080 --> 00:50:34.240
And you know, maybe we
can think about expanding

00:50:34.240 --> 00:50:35.880
beyond core recognition.

00:50:35.880 --> 00:50:37.920
We can talk in the
question period about that.

00:50:37.920 --> 00:50:39.940
When is the right
time to kind of keep

00:50:39.940 --> 00:50:42.640
working within the domain
of core recognition that

00:50:42.640 --> 00:50:45.042
is set up, versus
expanding beyond that?

00:50:45.042 --> 00:50:46.750
Because there's lots
of aspects of object

00:50:46.750 --> 00:50:48.790
recognition that I
didn't touch on here.

00:50:48.790 --> 00:50:50.830
And that comes up
in the questions.

00:50:50.830 --> 00:50:54.190
I think, there's lots of work
to be done within the domain,

00:50:54.190 --> 00:50:56.590
but there's also
interesting directions that

00:50:56.590 --> 00:50:59.070
extend outside of that domain.