1
00:00:01,640 --> 00:00:04,040
The following content is
provided under a Creative

2
00:00:04,040 --> 00:00:05,580
Commons license.

3
00:00:05,580 --> 00:00:07,880
Your support will help
MIT OpenCourseWare

4
00:00:07,880 --> 00:00:12,270
continue to offer high quality
educational resources for free.

5
00:00:12,270 --> 00:00:14,870
To make a donation or
view additional materials

6
00:00:14,870 --> 00:00:18,830
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,830 --> 00:00:22,430
at ocw.mit.edu.

8
00:00:22,430 --> 00:00:24,830
EERO SIMONCELLI: I'm going
to talk about a bunch of work

9
00:00:24,830 --> 00:00:26,625
that we've been
doing over the last--

10
00:00:26,625 --> 00:00:32,030
it's about four years, on
trying to understand basically,

11
00:00:32,030 --> 00:00:34,910
that terra incognita that
Gabrielle just mentioned

12
00:00:34,910 --> 00:00:37,296
that lies between V1 and IT.

13
00:00:37,296 --> 00:00:39,500
I brought this back with
me from the Dolomites,

14
00:00:39,500 --> 00:00:41,520
where I was last
week with my family.

15
00:00:41,520 --> 00:00:43,610
And when you sit
and you look at it

16
00:00:43,610 --> 00:00:45,080
and that image
comes into your eyes

17
00:00:45,080 --> 00:00:47,382
and gets processed
by your brain,

18
00:00:47,382 --> 00:00:48,840
there's a lot of
information there.

19
00:00:48,840 --> 00:00:49,880
It's a lot of pixels.

20
00:00:49,880 --> 00:00:51,770
And the question that
I'm going to start with

21
00:00:51,770 --> 00:00:53,340
is, where does it go?

22
00:00:53,340 --> 00:00:54,590
You have all this information.

23
00:00:54,590 --> 00:00:56,540
It's flooding into your
eyes all day every day

24
00:00:56,540 --> 00:00:58,100
for your entire lifetime.

25
00:00:58,100 --> 00:00:59,920
Obviously, you don't
store it all there.

26
00:00:59,920 --> 00:01:02,000
Your head doesn't
inflate until it

27
00:01:02,000 --> 00:01:04,790
gets to the point of explosion.

28
00:01:04,790 --> 00:01:05,930
So where does it go?

29
00:01:05,930 --> 00:01:09,640
And as a theorist's
diagram of the brain--

30
00:01:09,640 --> 00:01:11,450
a square with rounded corners.

31
00:01:11,450 --> 00:01:13,520
In comes the information,
and there are really

32
00:01:13,520 --> 00:01:14,760
only three options.

33
00:01:14,760 --> 00:01:16,640
You either act on
the information,

34
00:01:16,640 --> 00:01:20,000
do something with it,
sensory motor loops,

35
00:01:20,000 --> 00:01:23,630
or for complex organisms
especially, a fair amount of it

36
00:01:23,630 --> 00:01:25,280
you might actually
try to remember.

37
00:01:25,280 --> 00:01:28,880
You might hold on to, and we
heard about that earlier today.

38
00:01:28,880 --> 00:01:30,410
But this really
only accounts for, I

39
00:01:30,410 --> 00:01:32,720
think, a fairly
small portion of what

40
00:01:32,720 --> 00:01:35,420
goes on, because a lot
of it you throw away.

41
00:01:35,420 --> 00:01:36,560
You have to.

42
00:01:36,560 --> 00:01:38,691
You really don't have a choice.

43
00:01:38,691 --> 00:01:40,190
You have to summarize
it, squeeze it

44
00:01:40,190 --> 00:01:42,564
down to the relevant bits that
you're going to hold on to

45
00:01:42,564 --> 00:01:45,050
or act on, and the rest
of it you just toss.

46
00:01:45,050 --> 00:01:50,000
So the question is, how can
we exploit that basic fact?

47
00:01:50,000 --> 00:01:51,440
It's an obvious fact.

48
00:01:51,440 --> 00:01:52,704
It has to be true.

49
00:01:52,704 --> 00:01:54,620
How do we exploit that
to understand something

50
00:01:54,620 --> 00:01:56,912
about what the system does
and what it doesn't do?

51
00:01:56,912 --> 00:01:58,370
And there's a long
history to this,

52
00:01:58,370 --> 00:02:02,000
and in fact, since I come from
vision and most of my work

53
00:02:02,000 --> 00:02:06,450
is centered on vision and
auditory system to some extent,

54
00:02:06,450 --> 00:02:09,919
the vision scientists were
the first to recognize

55
00:02:09,919 --> 00:02:11,000
the importance of this.

56
00:02:11,000 --> 00:02:14,540
And it really is a
foundational chunk

57
00:02:14,540 --> 00:02:18,260
of work in the beginning
of the field that

58
00:02:18,260 --> 00:02:20,330
set in motion a lot of
things that we currently

59
00:02:20,330 --> 00:02:21,210
know about vision.

60
00:02:21,210 --> 00:02:23,210
And so I'm going to just--
for those of you that

61
00:02:23,210 --> 00:02:24,876
don't know that story,
I'm going to give

62
00:02:24,876 --> 00:02:26,450
a very, very brief
reminder of what

63
00:02:26,450 --> 00:02:28,940
that is, because I
think it's an absolutely

64
00:02:28,940 --> 00:02:31,430
fantastic scientific story.

65
00:02:31,430 --> 00:02:34,570
And then from there,
I'll talk about texture.

66
00:02:34,570 --> 00:02:36,500
So two examples-- I'm
going to quickly say

67
00:02:36,500 --> 00:02:38,330
something about
trichomatic color vision,

68
00:02:38,330 --> 00:02:40,038
and then I'm going to
talk about texture,

69
00:02:40,038 --> 00:02:44,730
and then we'll go into V2 and
metamers and other things.

70
00:02:44,730 --> 00:02:49,730
So trichromacy--
Newton figured out

71
00:02:49,730 --> 00:02:52,300
that light comes in wavelengths.

72
00:02:52,300 --> 00:02:53,960
He split light with a prism.

73
00:02:53,960 --> 00:02:57,421
There's the picture drawing
of him splitting light coming

74
00:02:57,421 --> 00:02:58,670
in through a hole in the wall.

75
00:02:58,670 --> 00:03:02,120
He split it with a prism into
wavelengths, saw a rainbow,

76
00:03:02,120 --> 00:03:03,727
did a lot of
experiments to recognize

77
00:03:03,727 --> 00:03:05,810
that you could take that
rainbow and reassemble it

78
00:03:05,810 --> 00:03:09,170
into white light, but you
couldn't further subdivide it,

79
00:03:09,170 --> 00:03:12,410
and basically gave
us the foundations

80
00:03:12,410 --> 00:03:16,340
for thinking about light
and spectral distributions.

81
00:03:16,340 --> 00:03:18,740
In the 1800s, a
group of people that

82
00:03:18,740 --> 00:03:21,560
were combined physicists,
mathematicians,

83
00:03:21,560 --> 00:03:23,852
and psychologists
all rolled into one--

84
00:03:23,852 --> 00:03:25,310
and there were
quite a few of them.

85
00:03:25,310 --> 00:03:27,006
Helmholtz was one of them.

86
00:03:27,006 --> 00:03:28,880
Grassmann was one of
the most important ones.

87
00:03:28,880 --> 00:03:31,070
I'll mention him
again in a moment--

88
00:03:31,070 --> 00:03:33,800
figured out something
peculiar about human vision--

89
00:03:33,800 --> 00:03:37,070
that even though there was
this huge array of colors

90
00:03:37,070 --> 00:03:39,860
in the wavelengths
in the spectrum,

91
00:03:39,860 --> 00:03:42,350
that humans actually had these
deficits that we were not

92
00:03:42,350 --> 00:03:45,490
able to actually sense
or discriminate things

93
00:03:45,490 --> 00:03:47,870
that it seemed like
we should be able to.

94
00:03:47,870 --> 00:03:49,850
And it boiled down in
the end, after a lot

95
00:03:49,850 --> 00:03:53,050
of study and discussion
and theorizing,

96
00:03:53,050 --> 00:03:59,030
to this experiment, which is
known as a bipartite color

97
00:03:59,030 --> 00:04:00,170
matching experiment.

98
00:04:00,170 --> 00:04:02,750
So on the left side of
this little display,

99
00:04:02,750 --> 00:04:03,860
here's a gray annulus.

100
00:04:03,860 --> 00:04:05,330
In the middle is a circle.

101
00:04:05,330 --> 00:04:08,750
On the left side of this is
light coming from some source.

102
00:04:08,750 --> 00:04:10,979
It has some spectral
distribution illustrated here.

103
00:04:10,979 --> 00:04:12,770
This is all a cartoon,
but just to give you

104
00:04:12,770 --> 00:04:14,390
the idea of how this works.

105
00:04:14,390 --> 00:04:18,110
On the right side are
three primary lights.

106
00:04:18,110 --> 00:04:21,620
And the job of the observer in
this experiment is to adjust,

107
00:04:21,620 --> 00:04:23,900
let's say, sliders
or knobs in order

108
00:04:23,900 --> 00:04:26,030
to change the intensity
of these three lights

109
00:04:26,030 --> 00:04:30,350
to make the light on the right
side of this split circle

110
00:04:30,350 --> 00:04:33,230
look the same as the
light on the left side.

111
00:04:33,230 --> 00:04:35,300
And it turns out that--

112
00:04:35,300 --> 00:04:37,340
so just to be clear,
so these three things

113
00:04:37,340 --> 00:04:39,110
have their own
spectral distributions.

114
00:04:39,110 --> 00:04:41,540
They might look like
that, for example.

115
00:04:41,540 --> 00:04:44,294
And when the observer comes
up with the knob settings,

116
00:04:44,294 --> 00:04:45,710
they're going to
produce something

117
00:04:45,710 --> 00:04:46,793
that might look like that.

118
00:04:46,793 --> 00:04:50,150
This is just a sum of the
three copies of these spectra

119
00:04:50,150 --> 00:04:52,010
weighted by the knob settings.

120
00:04:52,010 --> 00:04:54,380
So this is a linear
combination of three

121
00:04:54,380 --> 00:04:56,180
spectral distributions.

122
00:04:56,180 --> 00:04:57,956
And intentionally,
I've drawn this

123
00:04:57,956 --> 00:04:59,330
so that they don't
look the same,

124
00:04:59,330 --> 00:05:01,430
because that's the whole
point of the experiment.

125
00:05:01,430 --> 00:05:05,020
It turns out that humans--

126
00:05:05,020 --> 00:05:07,580
you can do this experiment,
and that any human

127
00:05:07,580 --> 00:05:10,130
with normal color vision
can make these settings so

128
00:05:10,130 --> 00:05:12,490
that these two things are
absolutely indistinguishable.

129
00:05:12,490 --> 00:05:16,400
They look identical, and yet
they knew even in the mid-1800s

130
00:05:16,400 --> 00:05:19,312
that these two things have
very different spectra,

131
00:05:19,312 --> 00:05:21,020
and I've drawn it that
way intentionally.

132
00:05:21,020 --> 00:05:22,860
So the point is that
humans are obviously--

133
00:05:22,860 --> 00:05:25,140
even though we can see all
the bands of the spectrum,

134
00:05:25,140 --> 00:05:26,570
we can see all the colors--

135
00:05:26,570 --> 00:05:28,400
we actually have this
deficiency in terms

136
00:05:28,400 --> 00:05:30,620
of noticing the difference
between these two things.

137
00:05:30,620 --> 00:05:32,830
So how can that be?

138
00:05:32,830 --> 00:05:35,006
And I think and hope
that most of you

139
00:05:35,006 --> 00:05:36,380
know the answer
to that question,

140
00:05:36,380 --> 00:05:37,963
because you're using
devices every day

141
00:05:37,963 --> 00:05:40,490
that are exploiting this fact.

142
00:05:40,490 --> 00:05:44,750
But the bottom line is that in
the 1850s Grassmann laid down

143
00:05:44,750 --> 00:05:45,500
a set of rules.

144
00:05:45,500 --> 00:05:47,552
Grassmann was a mathematician.

145
00:05:47,552 --> 00:05:49,760
He actually developed a
large chunk of linear algebra

146
00:05:49,760 --> 00:05:52,130
in order to explain and
understand and manipulate

147
00:05:52,130 --> 00:05:53,480
these ideas.

148
00:05:53,480 --> 00:05:55,580
And he pointed out that--

149
00:05:55,580 --> 00:05:57,722
he actually had a set of
laws that he laid out,

150
00:05:57,722 --> 00:05:59,430
and I won't drag you
through all of that.

151
00:05:59,430 --> 00:06:01,730
But in the end, what
all of those laws

152
00:06:01,730 --> 00:06:04,820
amounted to, taking into account
all of the evidence that he

153
00:06:04,820 --> 00:06:06,770
had, he laid down these laws.

154
00:06:06,770 --> 00:06:08,990
And what it amounted to
is that the human being,

155
00:06:08,990 --> 00:06:12,890
when setting these knobs, was
acting like a linear system.

156
00:06:12,890 --> 00:06:15,560
The human was taking an input,
which is a wavelength spectrum,

157
00:06:15,560 --> 00:06:17,030
and adjusting the knobs.

158
00:06:17,030 --> 00:06:20,270
And the settings of the
knobs were a linear function

159
00:06:20,270 --> 00:06:23,840
of the wavelength spectrum
that was coming into the eye.

160
00:06:23,840 --> 00:06:27,650
And it's a remarkable
and amazing fact,

161
00:06:27,650 --> 00:06:31,160
if you know that the brain is
a highly non-linear device, how

162
00:06:31,160 --> 00:06:34,380
is it that a human can
act like a linear device?

163
00:06:34,380 --> 00:06:37,670
And the answer is that basically
the human, taking this thing in

164
00:06:37,670 --> 00:06:41,440
and making the knob settings,
has a front end that's linear

165
00:06:41,440 --> 00:06:44,000
and is doing a projection
of the wavelength spectrum

166
00:06:44,000 --> 00:06:46,070
onto basically three axes.

167
00:06:46,070 --> 00:06:48,140
And those three measurements--

168
00:06:48,140 --> 00:06:49,610
that process is linear.

169
00:06:49,610 --> 00:06:51,380
Everything that
happens after that,

170
00:06:51,380 --> 00:06:54,172
which is complicated and
non-linear and involves noise

171
00:06:54,172 --> 00:06:56,630
and decisions and all kinds of
motor control and everything

172
00:06:56,630 --> 00:06:57,410
else--

173
00:06:57,410 --> 00:06:59,510
as long as the information
in those original three

174
00:06:59,510 --> 00:07:01,842
measurements is not
lost, then the human

175
00:07:01,842 --> 00:07:03,800
is going to basically
act like a linear system,

176
00:07:03,800 --> 00:07:06,050
in terms of doing this matching.

177
00:07:06,050 --> 00:07:08,470
So Grassmann realized this.

178
00:07:08,470 --> 00:07:12,800
The theory that he set out and
that others then elaborated on

179
00:07:12,800 --> 00:07:16,480
perfectly explained all the
data for normal human subjects,

180
00:07:16,480 --> 00:07:18,800
that lights that appear
identical but had physically

181
00:07:18,800 --> 00:07:21,260
distinct wavelength
spectra could be created,

182
00:07:21,260 --> 00:07:23,130
and they called these metamers--

183
00:07:23,130 --> 00:07:27,120
two things that are physically
different but look the same.

184
00:07:27,120 --> 00:07:28,400
This was codified.

185
00:07:28,400 --> 00:07:29,780
It took many, many decades.

186
00:07:29,780 --> 00:07:31,370
Things moved slower back then.

187
00:07:31,370 --> 00:07:35,240
We don't have these rapid,
Google-style overturns

188
00:07:35,240 --> 00:07:38,270
of scientific establishment
within a year or two.

189
00:07:38,270 --> 00:07:41,120
It took until the 1930s
to actually build this

190
00:07:41,120 --> 00:07:43,610
into a set of standards that
were used in the engineering

191
00:07:43,610 --> 00:07:46,760
community to generate and
create color film, color

192
00:07:46,760 --> 00:07:50,090
devices, eventually color
video, color monitors, color

193
00:07:50,090 --> 00:07:52,820
projectors, color printers--
everything else that we use.

194
00:07:52,820 --> 00:07:56,440
And these specifications were
to allow the reproduction

195
00:07:56,440 --> 00:07:58,190
of colors so that they
looked the way they

196
00:07:58,190 --> 00:07:59,520
were supposed to look.

197
00:07:59,520 --> 00:08:01,554
So you record color
with a camera.

198
00:08:01,554 --> 00:08:04,220
It turns out that your camera is
also only recording three color

199
00:08:04,220 --> 00:08:06,320
channels, just like
your eye, and then you

200
00:08:06,320 --> 00:08:09,140
have to be able to re-render
that on another device.

201
00:08:09,140 --> 00:08:12,766
And these standards
specify how to do that.

202
00:08:12,766 --> 00:08:14,390
The surprising thing
in the whole story

203
00:08:14,390 --> 00:08:15,920
is-- so this is 1850s.

204
00:08:15,920 --> 00:08:17,120
Well, we go back to Newton.

205
00:08:17,120 --> 00:08:18,150
It was a 1600s.

206
00:08:18,150 --> 00:08:21,260
Then in the 1850s, when we're
getting this beautiful theory

207
00:08:21,260 --> 00:08:23,750
that's very, very
precise, this gets built

208
00:08:23,750 --> 00:08:25,010
into engineering standards.

209
00:08:25,010 --> 00:08:27,410
And it's not until
1987 that it actually

210
00:08:27,410 --> 00:08:29,644
gets verified in a
mechanistic sense.

211
00:08:29,644 --> 00:08:31,310
And I like to tell
this story, because I

212
00:08:31,310 --> 00:08:34,100
think it's a reminder
that aiming always

213
00:08:34,100 --> 00:08:37,850
for the reductionist solution is
not necessarily the right thing

214
00:08:37,850 --> 00:08:38,610
to do.

215
00:08:38,610 --> 00:08:41,600
This is a very beautiful
piece of science

216
00:08:41,600 --> 00:08:43,250
that was done at
Stanford, actually,

217
00:08:43,250 --> 00:08:46,550
by Baylor, Nunn, and Schnapf.

218
00:08:46,550 --> 00:08:49,967
They took cones from a macaque--

219
00:08:49,967 --> 00:08:52,550
I think originally they worked
with turtles, but then macaque,

220
00:08:52,550 --> 00:08:54,862
sucked them up into a
glass micro-pipette,

221
00:08:54,862 --> 00:08:56,570
shined monochromatic
lights through them,

222
00:08:56,570 --> 00:08:58,730
and measured their
absorption properties.

223
00:08:58,730 --> 00:09:00,230
And they found these
three functions

224
00:09:00,230 --> 00:09:02,840
for three different types
of cones and verified,

225
00:09:02,840 --> 00:09:06,830
basically, that these three
absorption spectra perfectly

226
00:09:06,830 --> 00:09:09,212
explained the data
from the 1800s.

227
00:09:09,212 --> 00:09:10,670
So this is an
amazing thing, if you

228
00:09:10,670 --> 00:09:13,580
can have a theory and a set
of behavioral experiments

229
00:09:13,580 --> 00:09:17,330
that make very precise and
clear predictions that then get

230
00:09:17,330 --> 00:09:19,500
verified and tested
in a mechanistic sense

231
00:09:19,500 --> 00:09:22,160
more than 100 years later,
and they come out basically

232
00:09:22,160 --> 00:09:23,010
perfect.

233
00:09:23,010 --> 00:09:25,700
So it's an astounding,
astounding sequence,

234
00:09:25,700 --> 00:09:27,590
in my view.

235
00:09:27,590 --> 00:09:30,860
So what we wanted
to do is to set out

236
00:09:30,860 --> 00:09:33,740
trying to do the same kind
of thing for pattern vision.

237
00:09:33,740 --> 00:09:37,100
And we're going to do that
by thinking about texture.

238
00:09:37,100 --> 00:09:38,090
So what's a texture?

239
00:09:38,090 --> 00:09:40,640
A texture is an image
that's homogeneous

240
00:09:40,640 --> 00:09:41,770
with repeated structures.

241
00:09:41,770 --> 00:09:43,478
So each of these are
examples of texture.

242
00:09:43,478 --> 00:09:45,690
That's a piece of woven basket.

243
00:09:45,690 --> 00:09:46,609
This is tree bark.

244
00:09:46,609 --> 00:09:48,400
That's a herringbone
pattern, and these are

245
00:09:48,400 --> 00:09:50,780
some sort of nuts or stones.

246
00:09:50,780 --> 00:09:52,889
And each of these
has the property

247
00:09:52,889 --> 00:09:55,430
that there's lots of repeated
elements with some variability.

248
00:09:55,430 --> 00:09:56,810
Sometimes there's
more variability,

249
00:09:56,810 --> 00:09:58,268
sometimes there's
less variability,

250
00:09:58,268 --> 00:10:00,726
but there's usually
at least some.

251
00:10:00,726 --> 00:10:03,290
And of course, these
things are ubiquitous.

252
00:10:03,290 --> 00:10:07,320
When I started working on this,
which is about 15 years ago--

253
00:10:07,320 --> 00:10:08,970
maybe a little bit
more, 16 years ago--

254
00:10:08,970 --> 00:10:12,240
I started photographing things
that I saw as I walked around,

255
00:10:12,240 --> 00:10:13,900
and textures are everywhere.

256
00:10:13,900 --> 00:10:15,090
Most things are textured.

257
00:10:15,090 --> 00:10:17,180
The world is not
made up of plain--

258
00:10:17,180 --> 00:10:18,120
of Mondrians.

259
00:10:18,120 --> 00:10:20,160
It's not made up of
things that are plain,

260
00:10:20,160 --> 00:10:23,100
blank colors separated
by sharp edges.

261
00:10:23,100 --> 00:10:26,730
It's made up of
textures, and often

262
00:10:26,730 --> 00:10:29,190
the boundaries between things
are boundaries between things

263
00:10:29,190 --> 00:10:31,890
that are textured objects, like
the seats in the auditorium,

264
00:10:31,890 --> 00:10:34,410
for example.

265
00:10:34,410 --> 00:10:37,140
So how is it that we can go
about thinking about this

266
00:10:37,140 --> 00:10:39,840
in terms of metamers and
representation in, let's say,

267
00:10:39,840 --> 00:10:40,990
the visual system?

268
00:10:40,990 --> 00:10:42,960
And the idea really
comes from Julesz,

269
00:10:42,960 --> 00:10:48,210
who proposed in 1962 a famous
theory that he later abandoned.

270
00:10:48,210 --> 00:10:49,940
The theory goes like this.

271
00:10:49,940 --> 00:10:51,690
First of all, he said
the thing that we're

272
00:10:51,690 --> 00:10:53,951
going to do to try to
describe textures is

273
00:10:53,951 --> 00:10:55,200
we're going to use statistics.

274
00:10:55,200 --> 00:10:56,179
And why statistics?

275
00:10:56,179 --> 00:10:57,970
Because these are
supposed to be variables,

276
00:10:57,970 --> 00:10:59,385
so I need some stochasticity.

277
00:10:59,385 --> 00:11:01,260
But I also want something
that's homogeneous,

278
00:11:01,260 --> 00:11:03,930
so I'm going to average
or measure things averaged

279
00:11:03,930 --> 00:11:05,340
across the entire image.

280
00:11:05,340 --> 00:11:08,760
That's the statistical
side of it.

281
00:11:08,760 --> 00:11:12,580
And he proposed that, well, if
I start by measuring just pixel

282
00:11:12,580 --> 00:11:15,390
statistics-- say single
pixel statistics,

283
00:11:15,390 --> 00:11:19,230
pairwise pixels statistics,
maybe triples of pixels,

284
00:11:19,230 --> 00:11:22,740
eventually I should reach a
point where I've made enough

285
00:11:22,740 --> 00:11:25,410
measurements to actually
sufficiently constrain

286
00:11:25,410 --> 00:11:29,910
the texture such that any two
textures that have the same

287
00:11:29,910 --> 00:11:31,470
statistics up to that--

288
00:11:31,470 --> 00:11:35,370
whatever that order is, should
look the same to a human being.

289
00:11:35,370 --> 00:11:39,240
And he didn't talk about
this in physiological terms,

290
00:11:39,240 --> 00:11:43,320
but I think in the background
is the notion that humans

291
00:11:43,320 --> 00:11:45,400
are actually measuring
those statistics,

292
00:11:45,400 --> 00:11:47,580
and if you can get them right--
if you can make two images have

293
00:11:47,580 --> 00:11:49,996
the same statistics, and that's
the only thing that humans

294
00:11:49,996 --> 00:11:54,190
are measuring, then those two
images will look the same.

295
00:11:54,190 --> 00:12:00,270
So Julesz goes ahead with
this, and eventually constructs

296
00:12:00,270 --> 00:12:02,870
by hand, because
he did everything

297
00:12:02,870 --> 00:12:04,980
with binary patterns
constructed by hand--

298
00:12:04,980 --> 00:12:07,920
he constructs these two
examples that are identical.

299
00:12:07,920 --> 00:12:10,680
He first falsifies the
theory at n equals 2,

300
00:12:10,680 --> 00:12:12,720
and then he tries to do
third-order statistics.

301
00:12:12,720 --> 00:12:15,136
And he comes up with these two
examples-- counter-examples

302
00:12:15,136 --> 00:12:15,752
to the theory.

303
00:12:15,752 --> 00:12:18,210
These are matched in terms of
their third-order statistics.

304
00:12:18,210 --> 00:12:21,067
It's not easy to see that or
realize that, but it's true.

305
00:12:21,067 --> 00:12:22,650
If you take triples
of pixels, and you

306
00:12:22,650 --> 00:12:25,270
take the product of those
three, and you average

307
00:12:25,270 --> 00:12:27,720
that over the image, these
two things are identical,

308
00:12:27,720 --> 00:12:28,950
but they look very different.

309
00:12:28,950 --> 00:12:31,260
And if you draw samples
of each of these,

310
00:12:31,260 --> 00:12:35,250
it's very easy to label
them as, let's say, A or B

311
00:12:35,250 --> 00:12:37,110
into these two categories.

312
00:12:37,110 --> 00:12:38,610
Here's another
example that came out

313
00:12:38,610 --> 00:12:40,110
a bit later by Jack Yellott.

314
00:12:40,110 --> 00:12:42,540
These two things also are
matched up to third-order.

315
00:12:42,540 --> 00:12:45,570
So Julesz decides that
the theory is a failure,

316
00:12:45,570 --> 00:12:47,250
and he abandons it.

317
00:12:47,250 --> 00:12:50,700
And he begins a new theory,
which is the theory of textons,

318
00:12:50,700 --> 00:12:55,680
which is a much less
precisely-specified theory that

319
00:12:55,680 --> 00:12:57,510
has to do with laying down--

320
00:12:57,510 --> 00:13:00,180
basically, it's a generative
model, if you like.

321
00:13:00,180 --> 00:13:01,800
Everybody's fond of
generative models

322
00:13:01,800 --> 00:13:04,260
these days, except for me.

323
00:13:04,260 --> 00:13:06,195
And he comes up with
a generative model--

324
00:13:06,195 --> 00:13:07,757
ah, and maybe Tommy.

325
00:13:07,757 --> 00:13:09,340
He comes up with the
generative model,

326
00:13:09,340 --> 00:13:15,870
which is to lay down many copies
of a small, repeating unit,

327
00:13:15,870 --> 00:13:18,270
which he called the texton.

328
00:13:18,270 --> 00:13:20,880
And so he came up with this
method of generating texture

329
00:13:20,880 --> 00:13:23,160
images, which he
went to town on,

330
00:13:23,160 --> 00:13:24,450
and he made lots of examples.

331
00:13:24,450 --> 00:13:26,820
The problem is that that
wasn't a description of how

332
00:13:26,820 --> 00:13:30,390
to analyze texture images or how
a human would analyze texture

333
00:13:30,390 --> 00:13:32,950
images, and so it became very
difficult to bridge that gap.

334
00:13:32,950 --> 00:13:35,520
And I think, in my view,
that the theory really never

335
00:13:35,520 --> 00:13:38,730
succeeded, and he should have
stuck with the initial theory.

336
00:13:38,730 --> 00:13:41,170
Anyway, but that gave
us an opportunity.

337
00:13:41,170 --> 00:13:43,620
So we went back
many years later--

338
00:13:43,620 --> 00:13:45,070
this is around 1999.

339
00:13:45,070 --> 00:13:47,670
I had a fantastic
post-doc, Javier Portilla,

340
00:13:47,670 --> 00:13:51,330
who came from Spain, and we
started thinking about texture

341
00:13:51,330 --> 00:13:53,250
and started putting
together a model that

342
00:13:53,250 --> 00:13:57,047
was Juleszian in spirit,
but a little bit different,

343
00:13:57,047 --> 00:13:59,130
because we wanted to build
in a little bit of what

344
00:13:59,130 --> 00:14:00,510
we knew about physiology.

345
00:14:00,510 --> 00:14:02,035
Now, Julesz knew
about physiology,

346
00:14:02,035 --> 00:14:04,410
because Hubel and Wiesel were
doing all those experiments

347
00:14:04,410 --> 00:14:06,930
in V1 in the late '50S
and the early '60s,

348
00:14:06,930 --> 00:14:09,700
but he really didn't incorporate
that into his thinking.

349
00:14:09,700 --> 00:14:13,450
So what we did is build
a very simple model.

350
00:14:13,450 --> 00:14:19,200
It's just dumb, stupid,
simple, in which we took

351
00:14:19,200 --> 00:14:21,200
this description of V1 neurons.

352
00:14:21,200 --> 00:14:24,120
So these are oriented
receptive fields.

353
00:14:24,120 --> 00:14:27,150
The idea is that this is a
description of a neuron that

354
00:14:27,150 --> 00:14:29,670
takes a weighted
sum of the pixels

355
00:14:29,670 --> 00:14:31,530
with positive and
negative lobes.

356
00:14:31,530 --> 00:14:33,960
And it has a
preferred orientation,

357
00:14:33,960 --> 00:14:36,360
because the positive
and negative lobes

358
00:14:36,360 --> 00:14:38,400
have a particular
oriented structure.

359
00:14:38,400 --> 00:14:41,010
And then it takes the
output of that weighted sum

360
00:14:41,010 --> 00:14:44,950
and runs it through some
rectifying, nonlinear function.

361
00:14:44,950 --> 00:14:48,360
And here's another, and
this is a classic thing

362
00:14:48,360 --> 00:14:51,030
that Hubel and Wiesel
described for a simple cell.

363
00:14:51,030 --> 00:14:53,760
And here's another one,
which is a complex cell.

364
00:14:53,760 --> 00:14:57,570
And this one basically does
two of these and combines them.

365
00:14:57,570 --> 00:14:59,314
I'm trying to avoid
the details here,

366
00:14:59,314 --> 00:15:01,230
because they're not
critical for understanding

367
00:15:01,230 --> 00:15:02,188
what going to show you.

368
00:15:04,242 --> 00:15:06,200
So then we took those
things and we said, well,

369
00:15:06,200 --> 00:15:08,875
what if we measure joint
statistics of those things

370
00:15:08,875 --> 00:15:09,500
over the image?

371
00:15:09,500 --> 00:15:11,652
So we're going to take
not just these filters,

372
00:15:11,652 --> 00:15:13,610
but of course, we're
going to do a convolution.

373
00:15:13,610 --> 00:15:15,443
That is, we're going
to compute the response

374
00:15:15,443 --> 00:15:17,600
of this weighted sum
at different positions

375
00:15:17,600 --> 00:15:18,632
throughout the image.

376
00:15:18,632 --> 00:15:20,090
We're going to
rectify all of them.

377
00:15:20,090 --> 00:15:21,530
Now we're going to
take joint statistics.

378
00:15:21,530 --> 00:15:22,488
What do I mean by that?

379
00:15:22,488 --> 00:15:25,820
Just correlations, basically--
second-order statistics

380
00:15:25,820 --> 00:15:28,280
of the simple cells,
of the complex cells,

381
00:15:28,280 --> 00:15:30,860
of the cross statistics
between them.

382
00:15:30,860 --> 00:15:33,620
And these statistics are
between different orientations

383
00:15:33,620 --> 00:15:37,040
and different positions
and also different sizes.

384
00:15:37,040 --> 00:15:39,150
And given that large
set of numbers--

385
00:15:39,150 --> 00:15:41,720
and typically for the images
that we worked with back then,

386
00:15:41,720 --> 00:15:44,720
these were typically on
the order of 700 numbers.

387
00:15:44,720 --> 00:15:48,500
So we have an image over here,
which is say, tens of thousands

388
00:15:48,500 --> 00:15:52,160
or hundreds of thousands of
pixels, being transformed

389
00:15:52,160 --> 00:15:56,360
through this box into a set
of, let's say, 700 numbers.

390
00:15:56,360 --> 00:16:00,980
So 700 summary statistics
to describe this pattern.

391
00:16:00,980 --> 00:16:04,000
And then the question is,
how do we test the model?

392
00:16:04,000 --> 00:16:05,900
And for testing the model--

393
00:16:05,900 --> 00:16:08,210
most people, when they
test models like this,

394
00:16:08,210 --> 00:16:09,530
they do classification.

395
00:16:09,530 --> 00:16:11,390
This should sound very
familiar these days,

396
00:16:11,390 --> 00:16:14,180
with the deep network world.

397
00:16:14,180 --> 00:16:17,104
They take a model, and then
they run it on lots of examples.

398
00:16:17,104 --> 00:16:18,770
And they ask, well,
do the examples that

399
00:16:18,770 --> 00:16:20,520
are supposed to be the
same kind of thing,

400
00:16:20,520 --> 00:16:22,280
like the same tree bark--

401
00:16:22,280 --> 00:16:25,520
do they come out with
statistics that are similar

402
00:16:25,520 --> 00:16:27,470
or almost the same
as each other?

403
00:16:27,470 --> 00:16:30,470
And can I classify or
group them or cluster them

404
00:16:30,470 --> 00:16:32,660
and get the right answer
when trying to identify

405
00:16:32,660 --> 00:16:34,710
the different examples?

406
00:16:34,710 --> 00:16:37,400
We decided that that was a
very-- at least at the time,

407
00:16:37,400 --> 00:16:40,460
a very weak test of
this model, because this

408
00:16:40,460 --> 00:16:43,470
is a high-dimensional
space, and we had only,

409
00:16:43,470 --> 00:16:46,340
let's say, on the order of
hundreds of example textures.

410
00:16:46,340 --> 00:16:48,890
And hundreds of-- that sounds
like a lot of textures--

411
00:16:48,890 --> 00:16:52,040
a couple hundred textures,
but if the outputs

412
00:16:52,040 --> 00:16:56,900
live in a 700 dimensional space,
then it's basically nothing.

413
00:16:56,900 --> 00:16:58,850
We're not filling that space.

414
00:16:58,850 --> 00:17:01,059
And for those of you that
are statistically-oriented,

415
00:17:01,059 --> 00:17:02,683
you know that there's
this thing called

416
00:17:02,683 --> 00:17:03,950
the curse of dimensionality.

417
00:17:03,950 --> 00:17:06,890
The number of data samples that
you need to fill up a space

418
00:17:06,890 --> 00:17:09,540
goes up exponentially with
the number of dimensions.

419
00:17:09,540 --> 00:17:11,050
So this was really
bad news, and we

420
00:17:11,050 --> 00:17:12,800
decided that it was
going to be a disaster

421
00:17:12,800 --> 00:17:15,170
to just do classification--
that pretty

422
00:17:15,170 --> 00:17:18,119
much any set of measurements
would work for classification.

423
00:17:18,119 --> 00:17:22,650
So we were looking for a more
demanding test of the model.

424
00:17:22,650 --> 00:17:24,630
And for that, we
turned to synthesis.

425
00:17:24,630 --> 00:17:27,200
So the idea is like this.

426
00:17:27,200 --> 00:17:28,392
So you take this image.

427
00:17:28,392 --> 00:17:29,600
You run it through the model.

428
00:17:29,600 --> 00:17:31,620
You get your responses.

429
00:17:31,620 --> 00:17:33,996
Now we're going to take
a patch of white noise.

430
00:17:33,996 --> 00:17:35,870
We're going to run it
through the same model,

431
00:17:35,870 --> 00:17:39,680
and then we're going to lean
on the noise, push on it--

432
00:17:39,680 --> 00:17:42,350
push on all the pixels
in that noise image

433
00:17:42,350 --> 00:17:46,040
until we get the same outputs.

434
00:17:46,040 --> 00:17:50,330
So this is sometimes called
synthesis by analysis.

435
00:17:50,330 --> 00:17:52,590
This is not a generative
model, but we're using it

436
00:17:52,590 --> 00:17:53,590
like a generative model.

437
00:17:53,590 --> 00:17:56,540
We're going to draw
samples of images

438
00:17:56,540 --> 00:17:59,510
that have the same
statistics by starting

439
00:17:59,510 --> 00:18:01,310
with white noise and
just pounding on it

440
00:18:01,310 --> 00:18:02,480
until it looks right.

441
00:18:02,480 --> 00:18:04,250
And pounding on it
means, for those of you

442
00:18:04,250 --> 00:18:09,960
that want to know, measuring
the gradients of the deviation

443
00:18:09,960 --> 00:18:12,710
away from the desired
output and just moving

444
00:18:12,710 --> 00:18:14,720
in the direction
of the gradient.

445
00:18:14,720 --> 00:18:18,800
I'm giving you the
quick version of this.

446
00:18:18,800 --> 00:18:22,670
A little bit more abstractly,
we can think of it this way.

447
00:18:22,670 --> 00:18:24,377
There's a space of
all possible images.

448
00:18:24,377 --> 00:18:25,460
Here's the original image.

449
00:18:25,460 --> 00:18:27,500
It's a point in this space.

450
00:18:27,500 --> 00:18:30,210
We compute the
responses of the model,

451
00:18:30,210 --> 00:18:32,900
which is a lower dimensional
space-- a smaller space.

452
00:18:32,900 --> 00:18:34,490
That's this.

453
00:18:34,490 --> 00:18:37,560
Because this is a many to one
mapping and it's continuous,

454
00:18:37,560 --> 00:18:38,980
there's actually a manifold--

455
00:18:38,980 --> 00:18:42,620
a continuous a collection of
images over here, all of which

456
00:18:42,620 --> 00:18:45,320
have the same exact
model responses.

457
00:18:45,320 --> 00:18:48,140
And what we're trying to
do is grab one of these.

458
00:18:48,140 --> 00:18:50,460
We want to draw a sample
from that manifold.

459
00:18:50,460 --> 00:18:53,930
If the theory is right-- if this
model is a good representation

460
00:18:53,930 --> 00:18:57,530
of what humans see and capture
when they look at textures,

461
00:18:57,530 --> 00:18:59,840
then all of these things
should look the same.

462
00:18:59,840 --> 00:19:01,120
That's the hypothesis.

463
00:19:01,120 --> 00:19:03,620
And the way we do it, again,
is to start with a noise seed--

464
00:19:03,620 --> 00:19:05,060
just an image filled with noise.

465
00:19:05,060 --> 00:19:06,540
We project it onto the manifold.

466
00:19:06,540 --> 00:19:08,354
We push it onto this point.

467
00:19:08,354 --> 00:19:10,520
We can test that, because
we can, of course, measure

468
00:19:10,520 --> 00:19:12,830
the same things on this
image and make sure

469
00:19:12,830 --> 00:19:14,840
that they're the
same as this image,

470
00:19:14,840 --> 00:19:16,800
and that's our
synthesized image.

471
00:19:16,800 --> 00:19:20,240
So that's a abstract
picture of what I told you

472
00:19:20,240 --> 00:19:21,810
on the previous slide.

473
00:19:21,810 --> 00:19:25,010
And then finally, the
scientific or experimental logic

474
00:19:25,010 --> 00:19:27,890
is to test this by showing
it to a human observer.

475
00:19:27,890 --> 00:19:29,900
So we have the original
image, and then

476
00:19:29,900 --> 00:19:31,340
we compute the model responses.

477
00:19:31,340 --> 00:19:33,950
We generate a new image,
and we ask the human,

478
00:19:33,950 --> 00:19:35,225
do these look the same?

479
00:19:35,225 --> 00:19:37,100
And if the model captures
the same properties

480
00:19:37,100 --> 00:19:40,869
as the visual system,
then two images

481
00:19:40,869 --> 00:19:42,410
with identical model
responses should

482
00:19:42,410 --> 00:19:44,700
appear identical to a human.

483
00:19:44,700 --> 00:19:47,240
So that's the logic.

484
00:19:47,240 --> 00:19:49,850
And any strong failure
of this indicates

485
00:19:49,850 --> 00:19:52,250
that the model is insufficient
to capture what is

486
00:19:52,250 --> 00:19:54,500
important about these images.

487
00:19:54,500 --> 00:19:57,800
So it works, or I wouldn't
be telling you about it.

488
00:19:57,800 --> 00:19:59,400
Here is just a few examples.

489
00:19:59,400 --> 00:20:01,230
There are hundreds
more on the web page

490
00:20:01,230 --> 00:20:02,520
that describes this work.

491
00:20:02,520 --> 00:20:04,260
On the top are
original photographs--

492
00:20:04,260 --> 00:20:08,430
lizard skin, plaster
of some sort, beans.

493
00:20:08,430 --> 00:20:11,960
On the bottom are synthesized
versions of these.

494
00:20:11,960 --> 00:20:13,380
The lizard skin
works really well.

495
00:20:13,380 --> 00:20:14,850
The plaster works quite well.

496
00:20:14,850 --> 00:20:16,490
The beans a little less so.

497
00:20:16,490 --> 00:20:20,169
And it depends-- whether it
works well or not depends

498
00:20:20,169 --> 00:20:21,210
on the viewing condition.

499
00:20:21,210 --> 00:20:22,814
So if you flash
these up quickly,

500
00:20:22,814 --> 00:20:25,230
people might be convinced that
they all look really great.

501
00:20:25,230 --> 00:20:27,063
If you allow them to
inspect them carefully,

502
00:20:27,063 --> 00:20:30,450
they can start to see deviations
or funny little artifacts.

503
00:20:30,450 --> 00:20:34,410
So it's a partial success.

504
00:20:34,410 --> 00:20:36,600
And I should point
out that it also

505
00:20:36,600 --> 00:20:40,320
provides a pretty convincing
success on Julesz'

506
00:20:40,320 --> 00:20:41,730
counter-examples.

507
00:20:41,730 --> 00:20:42,900
So these are examples.

508
00:20:42,900 --> 00:20:44,820
This is synthesized
from that, and this

509
00:20:44,820 --> 00:20:50,680
is synthesized from that, and
they're easily classifiable.

510
00:20:50,680 --> 00:20:53,040
And there's fun things
you can do with this.

511
00:20:53,040 --> 00:20:56,130
You can fill in
regions around images.

512
00:20:56,130 --> 00:20:58,019
So if you take this
little chunk of text here

513
00:20:58,019 --> 00:21:00,060
and you measure the
statistics, and you say, fill

514
00:21:00,060 --> 00:21:02,460
in the stuff around
it with something

515
00:21:02,460 --> 00:21:04,170
with has the same
statistics, but try

516
00:21:04,170 --> 00:21:06,810
to do a careful job of
matching up at the boundaries,

517
00:21:06,810 --> 00:21:08,280
you can create things like this.

518
00:21:08,280 --> 00:21:09,946
So you can read the
words in the center,

519
00:21:09,946 --> 00:21:12,104
but the outside
looks like gibberish.

520
00:21:12,104 --> 00:21:14,020
Each one of these was
created in the same way.

521
00:21:14,020 --> 00:21:16,228
So the center of each of
these is the original image,

522
00:21:16,228 --> 00:21:17,970
and what's around
it is synthesized.

523
00:21:17,970 --> 00:21:20,070
So it works reasonably well.

524
00:21:20,070 --> 00:21:22,630
You can also do fun
things like this.

525
00:21:22,630 --> 00:21:23,846
So these are examples where--

526
00:21:23,846 --> 00:21:25,470
I told you we started
from white noise,

527
00:21:25,470 --> 00:21:27,011
and then pushed it
onto the manifold,

528
00:21:27,011 --> 00:21:29,170
but we can actually
start from any image.

529
00:21:29,170 --> 00:21:30,780
So if we start
from these images--

530
00:21:30,780 --> 00:21:33,030
these are three of
my collaborators--

531
00:21:33,030 --> 00:21:36,000
two of my students and my
collaborator Tony Movshon.

532
00:21:36,000 --> 00:21:38,529
If we start with those
as starting point images,

533
00:21:38,529 --> 00:21:40,320
and we use these textures
for each of them,

534
00:21:40,320 --> 00:21:42,236
we arrive at these images,
where you can still

535
00:21:42,236 --> 00:21:44,280
see some of the global
structure of the face.

536
00:21:44,280 --> 00:21:46,710
Because the model is
a homogeneous model,

537
00:21:46,710 --> 00:21:49,800
it doesn't impose anything
on global structure.

538
00:21:49,800 --> 00:21:51,300
And so if you seed
it with something

539
00:21:51,300 --> 00:21:53,760
that has particular global
structure or arrangement,

540
00:21:53,760 --> 00:21:55,320
it will inherit some of that.

541
00:21:55,320 --> 00:21:57,260
It'll hold onto it.

542
00:21:57,260 --> 00:21:59,260
Anyway, this is just for fun.

543
00:21:59,260 --> 00:22:01,230
Let's get back to science.

544
00:22:01,230 --> 00:22:04,860
So now, here's an example
of Richard Feynman.

545
00:22:04,860 --> 00:22:07,870
This is Richard Feynman after
he's gone through the blender.

546
00:22:07,870 --> 00:22:12,190
You can see pieces of skin-like
things and folds and flaps,

547
00:22:12,190 --> 00:22:14,070
but it's all disorganized.

548
00:22:14,070 --> 00:22:15,600
Again it's a homogeneous model.

549
00:22:15,600 --> 00:22:17,849
It doesn't know anything
about the global organization

550
00:22:17,849 --> 00:22:18,780
of this photograph.

551
00:22:18,780 --> 00:22:20,520
But what we want to know is--

552
00:22:20,520 --> 00:22:22,410
so do we have a
model that's just

553
00:22:22,410 --> 00:22:24,896
a model for the perception
of homogeneous textures,

554
00:22:24,896 --> 00:22:26,520
or can we actually
push it a little bit

555
00:22:26,520 --> 00:22:28,950
and make it, first of all,
a little more physiological,

556
00:22:28,950 --> 00:22:32,280
and second of all, maybe
a little bit more relevant

557
00:22:32,280 --> 00:22:33,390
for everyday vision?

558
00:22:33,390 --> 00:22:36,830
For me, standing here and
looking at this scene,

559
00:22:36,830 --> 00:22:39,390
how do I go about
describing something

560
00:22:39,390 --> 00:22:42,850
like this that's going on when
I'm looking at a normal scene?

561
00:22:42,850 --> 00:22:46,480
So let's go through thinking
about how to do this.

562
00:22:46,480 --> 00:22:50,020
So I'm going to jump right
to this diagram of the brain

563
00:22:50,020 --> 00:22:50,520
again.

564
00:22:50,520 --> 00:22:53,180
So V1 is in the
back of the brain.

565
00:22:53,180 --> 00:22:55,230
The information that
comes into your eyes

566
00:22:55,230 --> 00:22:57,840
goes through the retina,
the LGN, back to V1.

567
00:22:57,840 --> 00:23:00,240
And then it splits into these
two branches, the dorsal

568
00:23:00,240 --> 00:23:01,270
and the ventral stream.

569
00:23:01,270 --> 00:23:04,800
The ventral stream is usually
associated with spatial form

570
00:23:04,800 --> 00:23:06,102
and recognition and memory.

571
00:23:06,102 --> 00:23:08,060
So I'm going to think
about the ventral stream,

572
00:23:08,060 --> 00:23:09,560
and we're going to
try to understand

573
00:23:09,560 --> 00:23:12,260
what this model might have
to say about processing

574
00:23:12,260 --> 00:23:13,265
in the ventral stream.

575
00:23:13,265 --> 00:23:15,390
I'm going to rely on just
a few simple assumptions.

576
00:23:15,390 --> 00:23:18,690
First, that each of
these areas has neurons,

577
00:23:18,690 --> 00:23:20,490
and that they respond
to small contents

578
00:23:20,490 --> 00:23:22,614
or regions of the visual input.

579
00:23:22,614 --> 00:23:24,030
They're known as
receptive fields.

580
00:23:24,030 --> 00:23:25,560
Most of you know that.

581
00:23:25,560 --> 00:23:27,600
In each visual area,
I'm going to assume

582
00:23:27,600 --> 00:23:29,310
that those receptive
fields are covering,

583
00:23:29,310 --> 00:23:32,320
blanketing the
entire visual field.

584
00:23:32,320 --> 00:23:33,600
So there's no dead spots.

585
00:23:33,600 --> 00:23:35,160
There's no spots
that are left out.

586
00:23:35,160 --> 00:23:37,430
Everything is covered nicely.

587
00:23:37,430 --> 00:23:40,052
And in fact, we know that
this is true, for example,

588
00:23:40,052 --> 00:23:41,010
starting in the retina.

589
00:23:41,010 --> 00:23:46,890
So this is a cartoon diagram
to illustrate the inhomogeneity

590
00:23:46,890 --> 00:23:48,210
that's found in the retina.

591
00:23:48,210 --> 00:23:51,540
So the receptive field
sizes in the retina

592
00:23:51,540 --> 00:23:53,030
grow with eccentricity.

593
00:23:53,030 --> 00:23:55,030
And it turns out that
that starts in the retina,

594
00:23:55,030 --> 00:23:56,160
but that's true,
actually, all the way

595
00:23:56,160 --> 00:23:58,659
through the visual system and
throughout the ventral stream,

596
00:23:58,659 --> 00:23:59,820
in particular.

597
00:23:59,820 --> 00:24:03,630
And this diagram is showing
these little circles are

598
00:24:03,630 --> 00:24:08,130
about 10 times the size of
your midget ganglion cell

599
00:24:08,130 --> 00:24:10,170
receptive fields in your retina.

600
00:24:10,170 --> 00:24:13,110
So you looking-- if
you fixate right here

601
00:24:13,110 --> 00:24:14,730
in the center of
this, these things

602
00:24:14,730 --> 00:24:17,190
are about 10 times the size
of your receptive fields.

603
00:24:17,190 --> 00:24:19,920
And that's been
long thought to be

604
00:24:19,920 --> 00:24:25,260
the primary driver of
your limits on acuity,

605
00:24:25,260 --> 00:24:27,460
in terms of peripheral vision.

606
00:24:27,460 --> 00:24:29,520
So in particular,
if you take this eye

607
00:24:29,520 --> 00:24:32,610
chart, which is modified by--

608
00:24:32,610 --> 00:24:35,190
this was done by Richard
Anstis back in the '70s,

609
00:24:35,190 --> 00:24:37,320
and you lay it out
in this fashion,

610
00:24:37,320 --> 00:24:39,690
these things are about
10 times the threshold

611
00:24:39,690 --> 00:24:42,067
for visibility and
recognition of these letters.

612
00:24:42,067 --> 00:24:44,400
And so you can say that the
stroke widths of the letters

613
00:24:44,400 --> 00:24:47,790
are about matched to the
size of these ganglion cells,

614
00:24:47,790 --> 00:24:51,060
and it works, at least
qualitatively-- that things

615
00:24:51,060 --> 00:24:55,320
are scaling in the right
way, in terms of acuity,

616
00:24:55,320 --> 00:25:00,270
and in terms of the size
that the letters need to be

617
00:25:00,270 --> 00:25:03,020
for you to recognize them.

618
00:25:03,020 --> 00:25:05,230
And you can make
pictures like this.

619
00:25:05,230 --> 00:25:09,950
This is after Bill Geisler, who
showed that if you foveate--

620
00:25:09,950 --> 00:25:12,602
if you fixate here,
in fact, you can't

621
00:25:12,602 --> 00:25:14,060
see the details of
the stuff that's

622
00:25:14,060 --> 00:25:16,790
far from your fixation
point, and if you blur it,

623
00:25:16,790 --> 00:25:18,830
people don't notice.

624
00:25:18,830 --> 00:25:21,296
You can actually add high
frequency noise to it,

625
00:25:21,296 --> 00:25:23,420
alternatively, and people
won't notice that either.

626
00:25:23,420 --> 00:25:26,090
Because those receptive fields
are getting larger and larger,

627
00:25:26,090 --> 00:25:27,230
and you're basically
blurring out

628
00:25:27,230 --> 00:25:29,396
the information that would
allow you to distinguish,

629
00:25:29,396 --> 00:25:30,862
let's say, these two things.

630
00:25:30,862 --> 00:25:32,570
When you look right
at it you can see it,

631
00:25:32,570 --> 00:25:37,580
but if you keep your eye fixated
here, you won't notice it.

632
00:25:37,580 --> 00:25:39,644
So let's work off
of those ideas--

633
00:25:39,644 --> 00:25:41,060
the idea of these
receptive fields

634
00:25:41,060 --> 00:25:44,570
that are getting larger with
eccentricity, that are covering

635
00:25:44,570 --> 00:25:45,590
the entire visual field.

636
00:25:45,590 --> 00:25:49,170
And let's notice the following--
so this is data taken--

637
00:25:49,170 --> 00:25:51,730
physiological data
from several papers

638
00:25:51,730 --> 00:25:53,480
that were assembled
by Jeremy Freeman, who

639
00:25:53,480 --> 00:25:55,160
was a grad student in my lab.

640
00:25:55,160 --> 00:25:58,370
And here you can see the
center of the receptor fields

641
00:25:58,370 --> 00:26:00,741
versus the size of
the receptive fields.

642
00:26:00,741 --> 00:26:02,240
And you can see
that in the retina--

643
00:26:02,240 --> 00:26:03,990
I already showed you
on the previous slide

644
00:26:03,990 --> 00:26:07,250
that it grows with eccentricity,
but it's actually very slow

645
00:26:07,250 --> 00:26:09,110
compared to what
happens in the cortex.

646
00:26:09,110 --> 00:26:12,140
V1, the receptor fields
grow at a pretty good clip.

647
00:26:12,140 --> 00:26:15,380
V2, they grow about twice as
fast as that, and V4 twice

648
00:26:15,380 --> 00:26:16,610
as fast again.

649
00:26:16,610 --> 00:26:19,460
Another way of saying this--
at any given receptive field

650
00:26:19,460 --> 00:26:22,030
location relative to the fovea--

651
00:26:22,030 --> 00:26:25,010
let's say 15 degrees, the
receptive fields in V1

652
00:26:25,010 --> 00:26:28,355
are of a given size.

653
00:26:28,355 --> 00:26:33,020
It's on the order of
0.2 to 0.25 times.

654
00:26:33,020 --> 00:26:37,410
The diameter is 0.2 to 0.25
times the eccentricity.

655
00:26:37,410 --> 00:26:39,140
The receptive fields
in V2 are twice

656
00:26:39,140 --> 00:26:44,090
that size, so about 0.45
times the eccentricity,

657
00:26:44,090 --> 00:26:48,260
and the receptive fields
in V4 are twice that again.

658
00:26:48,260 --> 00:26:50,250
In cartoon form, it looks
something like this.

659
00:26:50,250 --> 00:26:51,440
So here's V1.

660
00:26:51,440 --> 00:26:55,190
Lots of cells and
small-ish receptive

661
00:26:55,190 --> 00:26:56,810
fields growing
with eccentricity.

662
00:26:56,810 --> 00:26:58,070
Here's V2.

663
00:26:58,070 --> 00:26:58,820
They're bigger.

664
00:26:58,820 --> 00:27:00,000
They grow faster.

665
00:27:00,000 --> 00:27:00,590
Here's V4.

666
00:27:00,590 --> 00:27:02,180
And by the time you get to IT--

667
00:27:02,180 --> 00:27:03,960
Jim DiCarlo was here
a bunch of days ago,

668
00:27:03,960 --> 00:27:05,543
and he probably told
you this-- almost

669
00:27:05,543 --> 00:27:09,380
every IT cell includes the fovea
as part of its receptive field.

670
00:27:09,380 --> 00:27:10,850
They're very large,
and they often

671
00:27:10,850 --> 00:27:14,396
cover half the visual field.

672
00:27:14,396 --> 00:27:15,770
So now we have to
figure out what

673
00:27:15,770 --> 00:27:17,390
to put inside of
these little circles

674
00:27:17,390 --> 00:27:18,950
in order to make
a model, and I'm

675
00:27:18,950 --> 00:27:21,170
going to basically combine--

676
00:27:21,170 --> 00:27:23,690
smash together the texture
model that I told you about,

677
00:27:23,690 --> 00:27:29,570
which was a global homogeneous
model, with this receptive

678
00:27:29,570 --> 00:27:30,260
field model.

679
00:27:30,260 --> 00:27:32,450
I'm going to basically stick
a little texture model in each

680
00:27:32,450 --> 00:27:33,500
of these little circles.

681
00:27:33,500 --> 00:27:35,030
That's the concept.

682
00:27:35,030 --> 00:27:35,920
So how do we do that?

683
00:27:35,920 --> 00:27:37,450
Well, we're going to go
back to Hubel and Wiesel.

684
00:27:37,450 --> 00:27:38,908
Hubel and Wiesel
were the ones that

685
00:27:38,908 --> 00:27:43,460
said you make V1 receptive field
simple cells out of LGN cells

686
00:27:43,460 --> 00:27:45,920
by just taking a bunch of
LGN cells that line up.

687
00:27:45,920 --> 00:27:49,700
Here they are-- center surround
receptive fields from the LGN,

688
00:27:49,700 --> 00:27:51,490
which are coming
off of the center

689
00:27:51,490 --> 00:27:53,690
surround architecture
of the retina.

690
00:27:53,690 --> 00:27:55,430
You line them up, you
add them together,

691
00:27:55,430 --> 00:27:57,971
and that gives you an oriented
receptive field, like the ones

692
00:27:57,971 --> 00:27:59,240
that I showed you earlier.

693
00:27:59,240 --> 00:28:01,884
And in more of a
computational diagram,

694
00:28:01,884 --> 00:28:03,050
you might draw it like this.

695
00:28:03,050 --> 00:28:05,837
So here's an array of
LGN inputs coming in.

696
00:28:05,837 --> 00:28:07,670
We're going to take a
weighted sum of those.

697
00:28:07,670 --> 00:28:09,010
Black is negative.

698
00:28:09,010 --> 00:28:09,890
White is positive.

699
00:28:09,890 --> 00:28:12,170
So we add up these three
guys, we subtract the two

700
00:28:12,170 --> 00:28:14,530
guys on either side,
and then we run that

701
00:28:14,530 --> 00:28:15,920
through a rectifying
nonlinearity

702
00:28:15,920 --> 00:28:17,185
that's a simple cell.

703
00:28:17,185 --> 00:28:19,310
Hubel and Wiesel
also pointed out

704
00:28:19,310 --> 00:28:20,780
that you could maybe create--

705
00:28:20,780 --> 00:28:22,940
or suggested that you
create complex cells

706
00:28:22,940 --> 00:28:24,500
by combining simple cells.

707
00:28:24,500 --> 00:28:27,920
This is the diagram from
their paper in 1962.

708
00:28:27,920 --> 00:28:30,406
And so we can diagram
that like this.

709
00:28:30,406 --> 00:28:32,280
Here's basically three
of these simple cells.

710
00:28:32,280 --> 00:28:34,160
They're displaced in
position, but they

711
00:28:34,160 --> 00:28:35,520
have the same orientation.

712
00:28:35,520 --> 00:28:37,820
We halfway rectify all of
them, add them together,

713
00:28:37,820 --> 00:28:40,580
and that gives us
a complex cell.

714
00:28:40,580 --> 00:28:43,700
So it's interesting to
note that the hook here

715
00:28:43,700 --> 00:28:47,150
is going to be that this
is an average of these.

716
00:28:47,150 --> 00:28:49,080
An average is a statistic.

717
00:28:49,080 --> 00:28:50,490
It's a local average.

718
00:28:50,490 --> 00:28:52,460
So we're going to
compute local averages,

719
00:28:52,460 --> 00:28:55,070
and we're going to call those
statistics-- i.e. statistics,

720
00:28:55,070 --> 00:28:58,130
as in used in the texture model.

721
00:28:58,130 --> 00:28:59,000
So let's do that.

722
00:28:59,000 --> 00:29:01,070
So here's the V2
receptive field.

723
00:29:01,070 --> 00:29:02,060
Open that up.

724
00:29:02,060 --> 00:29:05,150
Inside of that is a bunch
of V1 cells, here all shown

725
00:29:05,150 --> 00:29:06,270
at the same orientation.

726
00:29:06,270 --> 00:29:08,510
In reality, they would be
all different orientations

727
00:29:08,510 --> 00:29:09,412
and different sizes.

728
00:29:09,412 --> 00:29:11,870
And now we're going to compute
those joint statistics, just

729
00:29:11,870 --> 00:29:14,120
like I did in the
texture model, and that's

730
00:29:14,120 --> 00:29:15,770
going to give us our responses.

731
00:29:15,770 --> 00:29:17,990
We're going to have to do that
for each one of these receptive

732
00:29:17,990 --> 00:29:18,489
fields.

733
00:29:18,489 --> 00:29:19,910
So there's a lot of these.

734
00:29:19,910 --> 00:29:21,710
It's not 700 numbers anymore.

735
00:29:21,710 --> 00:29:23,760
It's reduced, because it's--

736
00:29:23,760 --> 00:29:25,310
so there's details here.

737
00:29:25,310 --> 00:29:27,380
It's reduced, but there's
a lot of these, so it's

738
00:29:27,380 --> 00:29:30,200
quite a lot of parameters.

739
00:29:30,200 --> 00:29:32,240
And these local
correlations that I told you

740
00:29:32,240 --> 00:29:35,090
we were going to compute here
can be re-expressed, actually,

741
00:29:35,090 --> 00:29:38,540
in a form that looks just like
the simple and complex cell

742
00:29:38,540 --> 00:29:41,180
calculations that I
showed you for V1.

743
00:29:41,180 --> 00:29:43,070
So in fact, if you
take these V1 cells,

744
00:29:43,070 --> 00:29:45,380
and you take weighted
sums of these guys,

745
00:29:45,380 --> 00:29:47,570
and you half-wave rectify
them and add them,

746
00:29:47,570 --> 00:29:49,070
you get something
that's essentially

747
00:29:49,070 --> 00:29:52,610
equivalent to the texture
model that I told you about.

748
00:29:52,610 --> 00:29:54,170
So that's pretty
cool, because it

749
00:29:54,170 --> 00:29:58,430
means that the calculations
that are taking us from the LGN

750
00:29:58,430 --> 00:30:05,310
input to V1 outputs have a
form, a structure which is then

751
00:30:05,310 --> 00:30:07,500
repeated when we get to V2.

752
00:30:07,500 --> 00:30:10,320
We do the same kind of
calculations-- linear filters,

753
00:30:10,320 --> 00:30:12,780
rectification,
pooling or averaging.

754
00:30:12,780 --> 00:30:14,430
And so that, of
course, has become

755
00:30:14,430 --> 00:30:17,940
ubiquitous with the advent of
all the deep network stuff.

756
00:30:17,940 --> 00:30:21,030
But the idea here is
that we can actually

757
00:30:21,030 --> 00:30:23,790
do this kind of canonical
computation again and again

758
00:30:23,790 --> 00:30:25,620
and again and produce
something that

759
00:30:25,620 --> 00:30:28,260
replicates the
loss of information

760
00:30:28,260 --> 00:30:31,470
and the extraction of
features or parameters

761
00:30:31,470 --> 00:30:33,760
that the human visual
system is performing.

762
00:30:33,760 --> 00:30:36,240
So this canonical idea,
I think, is important,

763
00:30:36,240 --> 00:30:38,850
and it's something that
we've been thinking

764
00:30:38,850 --> 00:30:40,300
about for a long time--

765
00:30:40,300 --> 00:30:43,350
linear filtering that
determines pattern selectivity,

766
00:30:43,350 --> 00:30:44,936
some sort of rectifying
non-linearity,

767
00:30:44,936 --> 00:30:45,810
some sort of pooling.

768
00:30:45,810 --> 00:30:48,835
And we usually also include
some sort of local gain control,

769
00:30:48,835 --> 00:30:51,210
which seems to be ubiquitous
throughout the visual system

770
00:30:51,210 --> 00:30:55,770
and the auditory system in
every stage, and noise, as well.

771
00:30:55,770 --> 00:30:58,590
And we're currently, in my lab,
working on lots of models that

772
00:30:58,590 --> 00:31:01,410
are trying to incorporate all
of these things in stacked

773
00:31:01,410 --> 00:31:04,840
networks-- small numbers of
layers, not deep-- shallow,

774
00:31:04,840 --> 00:31:07,150
shallow networks for us--

775
00:31:07,150 --> 00:31:09,150
in order to try to
understand their implications

776
00:31:09,150 --> 00:31:11,400
for perception and physiology.

777
00:31:11,400 --> 00:31:15,060
This was just a description
of a single stage,

778
00:31:15,060 --> 00:31:16,922
and then, of course,
you have to stack them.

779
00:31:16,922 --> 00:31:19,380
And there are many people that
have talked about that idea.

780
00:31:19,380 --> 00:31:24,300
This is a figure from Tommy's
paper with Christof, I think--

781
00:31:24,300 --> 00:31:25,620
1999.

782
00:31:25,620 --> 00:31:29,010
And Fukushima had proposed
a basic architecture

783
00:31:29,010 --> 00:31:31,530
like this earlier.

784
00:31:31,530 --> 00:31:33,810
And so I think this
has now become--

785
00:31:33,810 --> 00:31:36,510
you barely even need to say
it, because of the deep network

786
00:31:36,510 --> 00:31:38,820
literature.

787
00:31:38,820 --> 00:31:40,260
So how do we do this?

788
00:31:40,260 --> 00:31:41,610
Same thing I told you before.

789
00:31:41,610 --> 00:31:44,460
Take an image, plop down all
these V2 receptive fields.

790
00:31:44,460 --> 00:31:46,620
By the way, I should have
said this at the outset--

791
00:31:46,620 --> 00:31:50,010
this is drawn as a cartoon.

792
00:31:50,010 --> 00:31:51,480
The actual receptive
fields that we

793
00:31:51,480 --> 00:31:55,230
use are smooth and overlapping,
so that there are no holes.

794
00:31:55,230 --> 00:31:57,630
And in fact, the
details of that are

795
00:31:57,630 --> 00:31:59,130
that since we're
computing averages,

796
00:31:59,130 --> 00:32:00,921
you can think of this
as a low pass filter,

797
00:32:00,921 --> 00:32:03,330
and we try to at least
approximately obey the Nyquist

798
00:32:03,330 --> 00:32:05,640
theorem, so that
there's no aliasing--

799
00:32:05,640 --> 00:32:07,530
that is, there's no
evidence of the sampling

800
00:32:07,530 --> 00:32:13,170
lattice, for those of you that
are thinking down those lines.

801
00:32:13,170 --> 00:32:15,019
If you were not thinking
down those lines,

802
00:32:15,019 --> 00:32:16,560
I'll just say the
simple thing, which

803
00:32:16,560 --> 00:32:19,260
is that they're not little
disks that are non-overlapping,

804
00:32:19,260 --> 00:32:21,510
because then we would be
screwing everything up in

805
00:32:21,510 --> 00:32:22,700
between them.

806
00:32:22,700 --> 00:32:24,270
They're smooth
and overlapping so

807
00:32:24,270 --> 00:32:26,853
that we cover the whole image,
and all the pixels in the image

808
00:32:26,853 --> 00:32:28,944
are going to be affected
by this process.

809
00:32:28,944 --> 00:32:30,360
So we make all
those measurements.

810
00:32:30,360 --> 00:32:32,070
It's a very large
set of measurements.

811
00:32:32,070 --> 00:32:36,160
And now we start with white
noise, and we push the button.

812
00:32:36,160 --> 00:32:38,910
And again, push simultaneously
on the gradients

813
00:32:38,910 --> 00:32:41,370
from all those little regions
until we achieve something

814
00:32:41,370 --> 00:32:43,200
that matches all the
measurements in all

815
00:32:43,200 --> 00:32:44,864
of those receptive fields.

816
00:32:44,864 --> 00:32:46,530
The measurements in
the receptive fields

817
00:32:46,530 --> 00:32:48,280
are averaged over
different regions.

818
00:32:48,280 --> 00:32:53,910
So the ones that are
in the far periphery

819
00:32:53,910 --> 00:32:57,360
are averaged over large
regions, and so those averages

820
00:32:57,360 --> 00:32:59,730
are throwing away a
lot more information.

821
00:32:59,730 --> 00:33:01,980
The ones that are
averaged near the fovea

822
00:33:01,980 --> 00:33:04,020
are throwing away a small
amount of information.

823
00:33:04,020 --> 00:33:05,370
When you get close
enough to the fovea,

824
00:33:05,370 --> 00:33:06,840
they're throwing away nothing.

825
00:33:06,840 --> 00:33:09,284
So the original image is
preserved in the center,

826
00:33:09,284 --> 00:33:10,950
and then it gets more
and more distorted

827
00:33:10,950 --> 00:33:12,720
as you go away from the fovea.

828
00:33:12,720 --> 00:33:16,980
So the question is, does
that work for a human?

829
00:33:16,980 --> 00:33:18,240
Is it metameric?

830
00:33:18,240 --> 00:33:19,864
The display here
is not very good,

831
00:33:19,864 --> 00:33:21,780
but I'll try to give you
a demonstration of it

832
00:33:21,780 --> 00:33:23,310
to convince you
that it does work.

833
00:33:23,310 --> 00:33:25,020
You have to keep your
eyes planted here,

834
00:33:25,020 --> 00:33:26,520
and I'm going to
flip back and forth

835
00:33:26,520 --> 00:33:28,230
between this original
picture, which

836
00:33:28,230 --> 00:33:32,250
was taken in Washington Square
Park, near the department.

837
00:33:32,250 --> 00:33:35,400
And I'm going to flip between
this and a synthesized version.

838
00:33:35,400 --> 00:33:38,540
You have to keep your eyes here,
at least for a bunch of flips.

839
00:33:38,540 --> 00:33:40,710
Hello.

840
00:33:40,710 --> 00:33:42,100
Here we go.

841
00:33:42,100 --> 00:33:43,627
Keep your eyes fixated.

842
00:33:43,627 --> 00:33:45,210
Those two images
should look the same.

843
00:33:45,210 --> 00:33:48,600
It's going back and forth,
A, B, A, B, and they

844
00:33:48,600 --> 00:33:49,570
should look the same.

845
00:33:49,570 --> 00:33:51,820
I think for most of you, and
for most of these viewing

846
00:33:51,820 --> 00:33:53,029
distances, it should work.

847
00:33:53,029 --> 00:33:54,570
And now if you look
over here, you'll

848
00:33:54,570 --> 00:33:57,340
see that they actually
are not the same.

849
00:33:57,340 --> 00:34:00,020
That's about the size
of a V2 receptive field,

850
00:34:00,020 --> 00:34:01,980
and it is the same two images.

851
00:34:01,980 --> 00:34:04,690
I'm not cheating here, in
case anybody's worried.

852
00:34:04,690 --> 00:34:07,510
I'm just flipping back and forth
between the same two images.

853
00:34:07,510 --> 00:34:10,739
And you can see that
the original image has

854
00:34:10,739 --> 00:34:12,480
a couple of faces
in that circle,

855
00:34:12,480 --> 00:34:16,110
but the synthesized one,
they're all distorted,

856
00:34:16,110 --> 00:34:20,130
the same way Feynman was when
I showed you his photograph.

857
00:34:20,130 --> 00:34:24,570
But again, the point here is
that these two are not metamers

858
00:34:24,570 --> 00:34:26,969
when you look right at
this peripheral region,

859
00:34:26,969 --> 00:34:29,531
but when you keep your
eyes fixated here,

860
00:34:29,531 --> 00:34:30,989
they're pretty hard
to distinguish.

861
00:34:30,989 --> 00:34:33,330
This is right at about the
threshold for the subjects

862
00:34:33,330 --> 00:34:37,320
that we ran in this
experiment, so it should be

863
00:34:37,320 --> 00:34:39,690
basically imperceptible to you.

864
00:34:39,690 --> 00:34:41,580
That was a demo,
just to convince you

865
00:34:41,580 --> 00:34:43,480
that it seems to work.

866
00:34:43,480 --> 00:34:45,360
We did an experiment,
because we wanted

867
00:34:45,360 --> 00:34:48,594
to do more than just show
that it sort of works.

868
00:34:48,594 --> 00:34:51,300
We wanted to figure out whether
we could actually tie it

869
00:34:51,300 --> 00:34:53,650
to the physiology in
a more direct way,

870
00:34:53,650 --> 00:34:57,000
so what we did is
we generated stimuli

871
00:34:57,000 --> 00:35:01,160
where we used different
receptive field size scaling.

872
00:35:01,160 --> 00:35:01,960
So this is a plot.

873
00:35:01,960 --> 00:35:03,820
Along this axis is going to be--

874
00:35:03,820 --> 00:35:05,710
just to get you
situated, along this axis

875
00:35:05,710 --> 00:35:08,590
is going to be
models that are used

876
00:35:08,590 --> 00:35:11,371
to generate stimuli with
different receptive field size

877
00:35:11,371 --> 00:35:11,870
scaling.

878
00:35:11,870 --> 00:35:14,295
That's the ratio of diameter
to eccentricity-- diameter

879
00:35:14,295 --> 00:35:16,420
of the receptive field to
the eccentricity distance

880
00:35:16,420 --> 00:35:17,560
from the fovea.

881
00:35:17,560 --> 00:35:20,200
And along here is going
to be the percent correct

882
00:35:20,200 --> 00:35:26,020
that a human is able
to correctly identify--

883
00:35:26,020 --> 00:35:28,809
the way we did this, it's
called an ABX experiment.

884
00:35:28,809 --> 00:35:30,850
So we show one image, then
we show another image,

885
00:35:30,850 --> 00:35:31,974
then we show a third image.

886
00:35:31,974 --> 00:35:36,760
And we say, which image does
the third one look like?

887
00:35:36,760 --> 00:35:40,060
So we're going to plot
percent correct here.

888
00:35:40,060 --> 00:35:44,790
And if we use a model with
very small receptive fields,

889
00:35:44,790 --> 00:35:46,540
then we get syntheses
that look like this.

890
00:35:46,540 --> 00:35:48,030
This one has very
little distortion.

891
00:35:48,030 --> 00:35:49,330
There's a little bit
of distortion around

892
00:35:49,330 --> 00:35:51,580
near the edges, but it's
pretty close to the original.

893
00:35:51,580 --> 00:35:53,200
If we use really
big receptor fields,

894
00:35:53,200 --> 00:35:54,550
then we get a lot of distortion.

895
00:35:54,550 --> 00:35:56,590
Things really start
to fall apart.

896
00:35:56,590 --> 00:35:58,550
And somewhere in between--

897
00:35:58,550 --> 00:36:00,250
so far to the
right on this plot,

898
00:36:00,250 --> 00:36:05,530
we expect people to be at
100% noticing the distortions,

899
00:36:05,530 --> 00:36:06,940
and far to the
left on this plot,

900
00:36:06,940 --> 00:36:09,632
we expect them to be at chance.

901
00:36:09,632 --> 00:36:11,840
We expect them to not be
able to tell the difference.

902
00:36:11,840 --> 00:36:13,173
And that's exactly what happens.

903
00:36:13,173 --> 00:36:15,640
This is an average
over four observers.

904
00:36:15,640 --> 00:36:17,830
And you can see that the
performance, the percent

905
00:36:17,830 --> 00:36:20,380
correct starts at
around 50%, and then

906
00:36:20,380 --> 00:36:23,420
climbs up and asymptotes.

907
00:36:23,420 --> 00:36:25,360
So what's more, we
can now do something--

908
00:36:25,360 --> 00:36:27,820
this is a little bit complicated
to get your head around.

909
00:36:27,820 --> 00:36:31,390
We're using this model
to generate the stimuli,

910
00:36:31,390 --> 00:36:34,120
and this is the model parameter
plotted along this axis.

911
00:36:34,120 --> 00:36:36,610
Now we're going to
use the model again,

912
00:36:36,610 --> 00:36:38,110
but now we're going
to use the model

913
00:36:38,110 --> 00:36:39,820
as a model for the observer.

914
00:36:39,820 --> 00:36:41,320
So there's two models here.

915
00:36:41,320 --> 00:36:42,670
One is generating the stimuli.

916
00:36:42,670 --> 00:36:45,100
The other one, we're
going to try to fit--

917
00:36:45,100 --> 00:36:47,920
we're going to ask, if we used
a second copy of the model

918
00:36:47,920 --> 00:36:49,930
to actually look at
these images and tell

919
00:36:49,930 --> 00:36:51,670
the difference
between them, what

920
00:36:51,670 --> 00:36:55,180
would its receptive
fields have to be in order

921
00:36:55,180 --> 00:36:57,780
to match the human data?

922
00:36:57,780 --> 00:37:00,040
And I'm not going to drag
you through the details,

923
00:37:00,040 --> 00:37:03,370
but the basic idea is
that allows us to produce

924
00:37:03,370 --> 00:37:06,220
a prediction-- this black line--

925
00:37:06,220 --> 00:37:08,170
for how this model
would behave if it

926
00:37:08,170 --> 00:37:10,330
were acting as an observer.

927
00:37:10,330 --> 00:37:13,630
And by adjusting the parameter
of the observer model,

928
00:37:13,630 --> 00:37:17,120
we can estimate the size of
the human receptive fields.

929
00:37:17,120 --> 00:37:18,726
So the end result
of all of this is

930
00:37:18,726 --> 00:37:20,350
we're going to fit
a curve to the data,

931
00:37:20,350 --> 00:37:22,060
and it's going to
give us an estimate

932
00:37:22,060 --> 00:37:23,650
of the size of the
receptive fields

933
00:37:23,650 --> 00:37:26,350
that the human is
using to do this task.

934
00:37:26,350 --> 00:37:28,150
And that is right here.

935
00:37:28,150 --> 00:37:29,800
In fact, it's right
at the place where

936
00:37:29,800 --> 00:37:32,644
the curve hits the 50% line.

937
00:37:32,644 --> 00:37:35,060
That's the point where the
human can't tell the difference

938
00:37:35,060 --> 00:37:37,120
anymore, and that's
the point where

939
00:37:37,120 --> 00:37:40,610
we think an observer would be--

940
00:37:40,610 --> 00:37:42,970
where the receptive
fields of the stimulus

941
00:37:42,970 --> 00:37:44,770
would be the same
size as the receptive

942
00:37:44,770 --> 00:37:45,970
fields of the observer.

943
00:37:45,970 --> 00:37:47,449
So that's what
we're looking for.

944
00:37:47,449 --> 00:37:49,240
And when we do that
for our four observers,

945
00:37:49,240 --> 00:37:50,860
they come out very consistent.

946
00:37:50,860 --> 00:37:54,490
So here's a plot of the
estimated receptive field

947
00:37:54,490 --> 00:37:56,680
sizes of these observers.

948
00:37:56,680 --> 00:37:57,430
All four of them--

949
00:37:57,430 --> 00:38:00,040
1, 2, 3, 4, and the
average over the four.

950
00:38:00,040 --> 00:38:03,070
And nicely enough-- remember, I
told you that we know something

951
00:38:03,070 --> 00:38:05,230
about the receptive
field sizes in--

952
00:38:05,230 --> 00:38:06,790
these are macaque monkey.

953
00:38:06,790 --> 00:38:08,770
And if we plot those
on the same plot,

954
00:38:08,770 --> 00:38:11,480
these color bands are the
size of the receptive fields

955
00:38:11,480 --> 00:38:15,456
in a macaque, now combined
over this large set of data

956
00:38:15,456 --> 00:38:17,080
from a whole bunch
of different papers.

957
00:38:17,080 --> 00:38:19,319
Jeremy went through
incredibly painstaking work

958
00:38:19,319 --> 00:38:21,610
to try to put these all into
the same coordinate system

959
00:38:21,610 --> 00:38:23,620
and unify the data sets.

960
00:38:23,620 --> 00:38:26,770
And so the height of each
of these bars tells you--

961
00:38:26,770 --> 00:38:30,010
they're error bars on how much
variability there is, where

962
00:38:30,010 --> 00:38:31,210
we think the estimates are.

963
00:38:31,210 --> 00:38:33,168
And you can see that the
answers for the humans

964
00:38:33,168 --> 00:38:35,980
are coming right
down on top of V2.

965
00:38:35,980 --> 00:38:39,160
So we really do think
that the information that

966
00:38:39,160 --> 00:38:42,550
is being lost in these
stimuli is being lost in V2,

967
00:38:42,550 --> 00:38:45,190
and it seems to match the
receptive field sizes at least

968
00:38:45,190 --> 00:38:47,350
of macaque monkey.

969
00:38:47,350 --> 00:38:50,980
We were worried that this might
depend a lot on the details

970
00:38:50,980 --> 00:38:52,540
of the experiment.

971
00:38:52,540 --> 00:38:54,999
So for example,
we thought, well,

972
00:38:54,999 --> 00:38:57,040
what if we give people a
little more information?

973
00:38:57,040 --> 00:38:59,800
For example, what if we let them
look at the stimulus longer?

974
00:38:59,800 --> 00:39:02,410
So the original experiment
was pretty brief--

975
00:39:02,410 --> 00:39:03,400
200 milliseconds.

976
00:39:03,400 --> 00:39:05,630
What if we give them
400 milliseconds?

977
00:39:05,630 --> 00:39:09,430
And so up here are plots
for the same four subjects.

978
00:39:09,430 --> 00:39:13,024
The original task
is in the dark gray,

979
00:39:13,024 --> 00:39:15,190
and you can see the curves
for each of the subjects.

980
00:39:15,190 --> 00:39:18,010
When we give them more
time, what you notice

981
00:39:18,010 --> 00:39:21,080
is that, in general,
they do better.

982
00:39:21,080 --> 00:39:22,900
So generally, the
light gray curves--

983
00:39:22,900 --> 00:39:25,410
1, 2, 3-- are above
the dark gray curves.

984
00:39:25,410 --> 00:39:27,490
They get higher percent correct.

985
00:39:27,490 --> 00:39:30,730
But the important thing is
that each of these curves

986
00:39:30,730 --> 00:39:34,400
dives down and hits the 50%
point at the same place.

987
00:39:34,400 --> 00:39:37,060
In other words, what we
interpret this to mean

988
00:39:37,060 --> 00:39:41,230
is that the estimate of
the receptive field sizes

989
00:39:41,230 --> 00:39:45,070
is an architectural
constraint, and we

990
00:39:45,070 --> 00:39:48,070
can estimate the same
architectural constraint

991
00:39:48,070 --> 00:39:50,920
under both of these conditions,
even though performance

992
00:39:50,920 --> 00:39:53,560
is noticeably different, at
least for these three subjects.

993
00:39:53,560 --> 00:39:56,350
This one, it's really quite
a big, big improvement.

994
00:39:56,350 --> 00:39:58,120
This subject is doing
much, much better

995
00:39:58,120 --> 00:40:00,260
on the task when we
give them more time.

996
00:40:00,260 --> 00:40:03,290
And yet, this estimate
of receptive field sizes

997
00:40:03,290 --> 00:40:05,060
is pretty stable,
so we thought this

998
00:40:05,060 --> 00:40:07,220
was a pretty important control.

999
00:40:07,220 --> 00:40:09,180
And down below is
another control.

1000
00:40:09,180 --> 00:40:10,400
That was a bottom-up control.

1001
00:40:10,400 --> 00:40:11,770
This is a top-down control.

1002
00:40:11,770 --> 00:40:14,270
People have talked
about attention

1003
00:40:14,270 --> 00:40:16,580
being very important
in peripheral tasks,

1004
00:40:16,580 --> 00:40:19,300
so we now gave the subjects
an attentional cue--

1005
00:40:19,300 --> 00:40:22,790
a little arrow at the
center of the display that

1006
00:40:22,790 --> 00:40:25,190
pointed toward the
region of the periphery

1007
00:40:25,190 --> 00:40:28,350
where the distortion was largest
in a mean-squared error sense.

1008
00:40:28,350 --> 00:40:31,190
So we measure little chunks
of the peripheral image

1009
00:40:31,190 --> 00:40:34,070
and look for the place where
there's the biggest difference,

1010
00:40:34,070 --> 00:40:36,380
and we tell them
to pay attention

1011
00:40:36,380 --> 00:40:37,670
to that part of the stimulus.

1012
00:40:37,670 --> 00:40:38,720
They're not allowed
to move their eyes.

1013
00:40:38,720 --> 00:40:40,530
We have an eye tracker
on them the whole time,

1014
00:40:40,530 --> 00:40:41,960
so they're not
allowed to look at it.

1015
00:40:41,960 --> 00:40:44,270
But we're telling them, try
to pay attention to what's,

1016
00:40:44,270 --> 00:40:46,500
let's say, in the upper left.

1017
00:40:46,500 --> 00:40:48,500
And again, the result
is quite similar.

1018
00:40:48,500 --> 00:40:51,179
Their performance improves
noticeably, at least

1019
00:40:51,179 --> 00:40:52,220
for these three subjects.

1020
00:40:52,220 --> 00:40:55,070
This one, again, is the
most dramatic performance

1021
00:40:55,070 --> 00:40:55,610
improvement.

1022
00:40:55,610 --> 00:40:56,690
Nobody gets worse.

1023
00:40:56,690 --> 00:40:59,600
This subject basically
stayed about the same.

1024
00:40:59,600 --> 00:41:01,820
But again, the estimates
of receptive field size

1025
00:41:01,820 --> 00:41:02,790
are quite stable.

1026
00:41:02,790 --> 00:41:05,960
So our interpretation
is attention

1027
00:41:05,960 --> 00:41:09,920
is boosting the signal,
if there is a signal, that

1028
00:41:09,920 --> 00:41:11,670
allows them to do the task.

1029
00:41:11,670 --> 00:41:14,720
But if they're at chance
and there's no signal,

1030
00:41:14,720 --> 00:41:18,380
attention does nothing, which
is why that when you get to 50%,

1031
00:41:18,380 --> 00:41:20,870
all these points coalesce.

1032
00:41:20,870 --> 00:41:24,260
All the curves are hitting
50% at the same place.

1033
00:41:24,260 --> 00:41:26,750
One last control-- we
wanted to convince ourselves

1034
00:41:26,750 --> 00:41:28,970
that really it was V2,
and it wasn't just luck

1035
00:41:28,970 --> 00:41:32,060
that we happened to get that
receptive field size that

1036
00:41:32,060 --> 00:41:33,630
matched the macaque data.

1037
00:41:33,630 --> 00:41:35,540
So we did a control
experiment where we tried

1038
00:41:35,540 --> 00:41:37,610
to get the same result for V1.

1039
00:41:37,610 --> 00:41:41,570
So this time, we just measure
local oriented receptive fields

1040
00:41:41,570 --> 00:41:45,860
like Hubel and Wiesel
described, and we average them

1041
00:41:45,860 --> 00:41:48,950
as in a complex cell over
different sized regions.

1042
00:41:48,950 --> 00:41:50,630
And we generate
stimuli that are just

1043
00:41:50,630 --> 00:41:54,067
matched for the average
responses of the V1 cells.

1044
00:41:54,067 --> 00:41:56,150
We don't do all the
statistics on top of that that

1045
00:41:56,150 --> 00:41:57,920
represents the V2 calculation.

1046
00:41:57,920 --> 00:42:01,310
We're just doing
average V1 responses.

1047
00:42:01,310 --> 00:42:02,450
When we do that--

1048
00:42:02,450 --> 00:42:04,760
we generate the stimuli,
we do the same experiment,

1049
00:42:04,760 --> 00:42:07,280
we get a very different
result in light gray here.

1050
00:42:07,280 --> 00:42:09,202
So you can see that
these curves are always

1051
00:42:09,202 --> 00:42:10,910
higher than the other
ones, but they also

1052
00:42:10,910 --> 00:42:15,350
hit the axis at a much,
much smaller value,

1053
00:42:15,350 --> 00:42:18,590
usually by about a factor
of two, which is just right,

1054
00:42:18,590 --> 00:42:21,100
given what I told you before
about receptive field sizes.

1055
00:42:21,100 --> 00:42:24,020
So if we go back and we combine
all the data on one plot--

1056
00:42:24,020 --> 00:42:25,910
down here are the V1 controls.

1057
00:42:25,910 --> 00:42:28,160
They're about the
right size for V1.

1058
00:42:28,160 --> 00:42:31,759
And up here is the original
experiment and the two controls

1059
00:42:31,759 --> 00:42:33,800
that I told you about--
the extended presentation

1060
00:42:33,800 --> 00:42:36,650
and the directed attention,
and those are all pretty much

1061
00:42:36,650 --> 00:42:38,510
lying in the range of V2.

1062
00:42:38,510 --> 00:42:41,090
We think this has a pretty
strong implication for reading

1063
00:42:41,090 --> 00:42:42,140
speed.

1064
00:42:42,140 --> 00:42:44,180
When you read, your eyes
hop across the page.

1065
00:42:44,180 --> 00:42:45,890
You do not scan continuously.

1066
00:42:45,890 --> 00:42:46,970
You hop.

1067
00:42:46,970 --> 00:42:48,500
And when you hop,
here's an example

1068
00:42:48,500 --> 00:42:50,458
of the kind of hops you
do when you're reading.

1069
00:42:50,458 --> 00:42:53,300
There's an eye position,
and the typical hop distance

1070
00:42:53,300 --> 00:42:54,710
would be about that--

1071
00:42:54,710 --> 00:42:55,730
from here to there.

1072
00:42:55,730 --> 00:42:57,320
This is the same piece of text.

1073
00:42:57,320 --> 00:43:00,980
We've synthesized it as a
metamer using this model, just

1074
00:43:00,980 --> 00:43:03,770
to illustrate the idea
that the chunk of stuff

1075
00:43:03,770 --> 00:43:06,290
that you can read around
that fixation point,

1076
00:43:06,290 --> 00:43:07,520
it's about right.

1077
00:43:07,520 --> 00:43:11,630
It matches what you
would expect for the kind

1078
00:43:11,630 --> 00:43:12,980
of hopping that you could do.

1079
00:43:12,980 --> 00:43:17,450
Your reading speed is limited
by the distance of those hops,

1080
00:43:17,450 --> 00:43:19,220
and the distance of
those hops is limited

1081
00:43:19,220 --> 00:43:22,700
by this loss of information.

1082
00:43:22,700 --> 00:43:26,192
So you can't read anything
beyond maybe this I and this N.

1083
00:43:26,192 --> 00:43:28,400
And in order to read it,
you hop your eyes over here,

1084
00:43:28,400 --> 00:43:29,960
and now you get
most of this word.

1085
00:43:29,960 --> 00:43:32,420
You can make out the rest
of an "involuntarily."

1086
00:43:32,420 --> 00:43:35,060
So there's an interesting
implication here,

1087
00:43:35,060 --> 00:43:37,190
which is that you
can potentially

1088
00:43:37,190 --> 00:43:43,160
increase reading speed by
using this model to optimize

1089
00:43:43,160 --> 00:43:44,726
the presentation of text.

1090
00:43:44,726 --> 00:43:46,850
And now that we can do
these things electronically,

1091
00:43:46,850 --> 00:43:48,590
you can imagine all
kinds of devices

1092
00:43:48,590 --> 00:43:51,754
where the word spacing
and the line spacing

1093
00:43:51,754 --> 00:43:53,420
and the letter sizes
and everything else

1094
00:43:53,420 --> 00:43:57,244
could change with time and
position on the display.

1095
00:43:57,244 --> 00:43:58,910
So you don't have to
just put things out

1096
00:43:58,910 --> 00:44:01,010
as static arrays of characters.

1097
00:44:01,010 --> 00:44:04,321
You could now imagine jumping
things around and rescaling

1098
00:44:04,321 --> 00:44:04,820
things.

1099
00:44:04,820 --> 00:44:07,700
You could imagine
designing new fonts that

1100
00:44:07,700 --> 00:44:11,640
caused less distortion or loss
of information, et cetera.

1101
00:44:11,640 --> 00:44:14,510
So this is just going back
to the trichromacy story

1102
00:44:14,510 --> 00:44:15,470
that I told you.

1103
00:44:15,470 --> 00:44:17,884
I told you that once they
figured out the theory,

1104
00:44:17,884 --> 00:44:19,550
and they had all the
psychophysics down,

1105
00:44:19,550 --> 00:44:21,950
the next thing that happened
is all that engineering.

1106
00:44:21,950 --> 00:44:23,616
They came up with
engineering standards,

1107
00:44:23,616 --> 00:44:27,690
and they used it to design
devices and specify protocols

1108
00:44:27,690 --> 00:44:30,110
for transmitting images,
for communicating them,

1109
00:44:30,110 --> 00:44:31,520
for rendering them.

1110
00:44:31,520 --> 00:44:34,340
I think that this has
that kind of potential.

1111
00:44:34,340 --> 00:44:36,620
And this theory is
too crude right now,

1112
00:44:36,620 --> 00:44:39,800
but if you had a really solid
theory for what information

1113
00:44:39,800 --> 00:44:43,190
survived in the
periphery, you can really

1114
00:44:43,190 --> 00:44:48,050
start to push hard on
designing devices and designing

1115
00:44:48,050 --> 00:44:52,659
specifications for devices
for improved whatever.

1116
00:44:52,659 --> 00:44:54,200
Sometimes you want
to improve things.

1117
00:44:54,200 --> 00:44:55,866
Sometimes you want
to make things harder

1118
00:44:55,866 --> 00:44:57,920
to see, like in this example.

1119
00:44:57,920 --> 00:44:59,257
So you want to build camouflage.

1120
00:44:59,257 --> 00:45:01,840
You go in, you take a bunch of
photographs of the environment,

1121
00:45:01,840 --> 00:45:03,730
and then you say, let's
design a camouflage

1122
00:45:03,730 --> 00:45:08,200
that best hides itself
when it's not seen directly

1123
00:45:08,200 --> 00:45:09,260
within this environment.

1124
00:45:09,260 --> 00:45:13,390
So you could use these
kinds of loss of information

1125
00:45:13,390 --> 00:45:15,820
to exploit things
or to aid things

1126
00:45:15,820 --> 00:45:19,450
in terms of human perception.

1127
00:45:19,450 --> 00:45:21,460
So let me say just a
few things about V2,

1128
00:45:21,460 --> 00:45:24,470
and then maybe I should stop.

1129
00:45:24,470 --> 00:45:28,270
So this work that
Jeremy and I did

1130
00:45:28,270 --> 00:45:30,220
in building this model
for metamers, which

1131
00:45:30,220 --> 00:45:32,350
is a global version
of the texture model

1132
00:45:32,350 --> 00:45:35,260
that operates in
local regions, led

1133
00:45:35,260 --> 00:45:39,040
us to start asking
questions about what

1134
00:45:39,040 --> 00:45:42,160
we could learn by actually
measuring cells in V2.

1135
00:45:42,160 --> 00:45:44,260
And we joined forces
with Tony Movshon,

1136
00:45:44,260 --> 00:45:46,240
who is the chair
of my department

1137
00:45:46,240 --> 00:45:48,400
and a longtime
collaborator and friend.

1138
00:45:48,400 --> 00:45:50,830
And we started a
series of experiments

1139
00:45:50,830 --> 00:45:55,894
to try to explore presentations
of texture to V2 neurons

1140
00:45:55,894 --> 00:45:57,310
to try to understand
what we could

1141
00:45:57,310 --> 00:46:01,030
learn about the actual
representations of V2.

1142
00:46:01,030 --> 00:46:03,840
And these are all done
in macaque monkey.

1143
00:46:03,840 --> 00:46:08,565
And I should also
mention that V2 is--

1144
00:46:12,400 --> 00:46:14,500
it's been studied
for a long time.

1145
00:46:14,500 --> 00:46:17,740
Hubel and Wiesel wrote a
very important paper about V2

1146
00:46:17,740 --> 00:46:21,400
in 1965, which was quite
beautiful, documenting

1147
00:46:21,400 --> 00:46:23,080
the properties that
they could find.

1148
00:46:23,080 --> 00:46:25,180
But the thing that's
interesting about this

1149
00:46:25,180 --> 00:46:30,460
is that V1 didn't really crack
until Hubel and Wiesel figured

1150
00:46:30,460 --> 00:46:32,020
out what the magic
ingredient was.

1151
00:46:32,020 --> 00:46:34,210
And the magic ingredient
was orientation.

1152
00:46:34,210 --> 00:46:36,070
Before Hubel and
Wiesel, people have

1153
00:46:36,070 --> 00:46:38,200
been poking at
primary visual cortex,

1154
00:46:38,200 --> 00:46:40,725
showing little spots of
light and little annuli--

1155
00:46:40,725 --> 00:46:42,100
all the things
that worked really

1156
00:46:42,100 --> 00:46:44,830
well in the retina and the
LGN, and they were not getting

1157
00:46:44,830 --> 00:46:45,910
very interesting results.

1158
00:46:45,910 --> 00:46:48,340
They were saying, well, the
receptive fields are bigger

1159
00:46:48,340 --> 00:46:51,670
and there are hot spots,
positive and negative regions,

1160
00:46:51,670 --> 00:46:56,650
but the cells are not
responding that well.

1161
00:46:56,650 --> 00:46:58,360
And when Hubel and
Wiesel figured out

1162
00:46:58,360 --> 00:47:00,280
that orientation was
the magic ingredient--

1163
00:47:00,280 --> 00:47:03,160
and the apocryphal story is that
they did that late at night,

1164
00:47:03,160 --> 00:47:05,076
and they figured it out
when they were putting

1165
00:47:05,076 --> 00:47:07,180
a slide into the projector,
and they had forgotten

1166
00:47:07,180 --> 00:47:08,860
to cover the cat's eyes.

1167
00:47:08,860 --> 00:47:12,940
And they put the slide
into the projector,

1168
00:47:12,940 --> 00:47:15,572
and the line at the edge of the
slide went past on the screen--

1169
00:47:15,572 --> 00:47:16,560
TOMMY: it was broken.

1170
00:47:16,560 --> 00:47:17,851
EERO SIMONCELLI: It was broken.

1171
00:47:17,851 --> 00:47:21,290
Ah, I always thought it
was the edge of the slide.

1172
00:47:21,290 --> 00:47:23,780
I've fibbed, and
Tommy has corrected me

1173
00:47:23,780 --> 00:47:25,990
that it was something
broken in the slide.

1174
00:47:25,990 --> 00:47:29,770
But in any case, the point
is that a boundary went by,

1175
00:47:29,770 --> 00:47:31,600
and they heard--

1176
00:47:31,600 --> 00:47:33,940
so they played the spikes
through a loudspeaker.

1177
00:47:33,940 --> 00:47:36,550
This is what most physiologists
did in those days,

1178
00:47:36,550 --> 00:47:38,422
and even still a lot do.

1179
00:47:38,422 --> 00:47:40,630
Certainly, in Tony's lab
you can always walk in there

1180
00:47:40,630 --> 00:47:42,630
and hear the spikes coming
over the loudspeaker.

1181
00:47:42,630 --> 00:47:44,710
Anyway, they heard this
huge barrage of spikes,

1182
00:47:44,710 --> 00:47:46,690
more than they had ever
heard from any cell

1183
00:47:46,690 --> 00:47:48,580
that they had recorded
from, and that

1184
00:47:48,580 --> 00:47:53,500
was the beginning of a whole
sequence of just fabulous work.

1185
00:47:53,500 --> 00:47:55,390
And using that tool--

1186
00:47:55,390 --> 00:47:58,280
very simple and very
obvious in retrospect,

1187
00:47:58,280 --> 00:48:01,230
but absolutely critical
for the progress.

1188
00:48:01,230 --> 00:48:05,860
The stimuli matter is the
point, and making the jump

1189
00:48:05,860 --> 00:48:08,660
to the right stimuli
changes everything.

1190
00:48:08,660 --> 00:48:11,800
So V2 for the last
40 years has been

1191
00:48:11,800 --> 00:48:14,740
sitting in this difficult
state where people

1192
00:48:14,740 --> 00:48:16,420
keep throwing stimuli at it.

1193
00:48:16,420 --> 00:48:17,410
They try angles.

1194
00:48:17,410 --> 00:48:18,340
They try curves.

1195
00:48:18,340 --> 00:48:19,600
They try swirly things.

1196
00:48:19,600 --> 00:48:20,830
They try corners.

1197
00:48:20,830 --> 00:48:25,840
They try contours of various
kinds, illusory contours.

1198
00:48:25,840 --> 00:48:29,830
And throughout all of
this, the end story

1199
00:48:29,830 --> 00:48:35,200
is V2 cells have bigger
receptive fields, many of them

1200
00:48:35,200 --> 00:48:38,230
respond to orientation,
some of them

1201
00:48:38,230 --> 00:48:40,690
respond to particular
combinations of orientation,

1202
00:48:40,690 --> 00:48:44,969
but it's usually a small subset,
and the responses are weak.

1203
00:48:44,969 --> 00:48:46,510
And that's really
what the literature

1204
00:48:46,510 --> 00:48:49,160
has looked like for 40 years.

1205
00:48:49,160 --> 00:48:53,560
So what we were after is, can we
drive these cells convincingly

1206
00:48:53,560 --> 00:48:55,540
and in a way that
we can document

1207
00:48:55,540 --> 00:48:58,840
is significantly different
than what we see in V1?

1208
00:48:58,840 --> 00:49:00,250
That was the goal--

1209
00:49:00,250 --> 00:49:04,870
find a way to drive
most of the cells

1210
00:49:04,870 --> 00:49:07,600
and to drive them
differently than what

1211
00:49:07,600 --> 00:49:10,000
one would expect in V1.

1212
00:49:10,000 --> 00:49:12,940
As a starting point, we
succeeded with textures.

1213
00:49:12,940 --> 00:49:15,290
So basically, we took
a bunch of textures.

1214
00:49:15,290 --> 00:49:17,710
Here are some example
textures drawn from the model.

1215
00:49:17,710 --> 00:49:20,101
Down below are
spectrally-matched equivalents.

1216
00:49:20,101 --> 00:49:22,600
So these things have the same
power spectra, the same amount

1217
00:49:22,600 --> 00:49:24,850
of energy and different
orientation and frequency

1218
00:49:24,850 --> 00:49:28,330
bands, but they lack all the
higher-order statistics that

1219
00:49:28,330 --> 00:49:31,390
are coming in this texture
model that give you nice,

1220
00:49:31,390 --> 00:49:35,480
clean edges and contours
and object-y things,

1221
00:49:35,480 --> 00:49:37,150
or lumps of objects.

1222
00:49:37,150 --> 00:49:39,370
And sure enough-- so
here's some example cells.

1223
00:49:39,370 --> 00:49:40,460
Here's three V1 cells.

1224
00:49:40,460 --> 00:49:42,000
Here's three V2 cells.

1225
00:49:42,000 --> 00:49:44,060
And in each of these
plots, there's two curves.

1226
00:49:44,060 --> 00:49:45,460
These are shown over time.

1227
00:49:45,460 --> 00:49:48,280
The stimulus is presented
here for 100 milliseconds.

1228
00:49:48,280 --> 00:49:50,200
You see a little
bump in the response.

1229
00:49:50,200 --> 00:49:52,300
And there's a light
curve and a dark curve.

1230
00:49:52,300 --> 00:49:55,630
The light curve is the response
to the spectrally-matched

1231
00:49:55,630 --> 00:49:58,930
noise, and the dark curve is
the response to the texture,

1232
00:49:58,930 --> 00:50:00,830
with the higher-order
statistics.

1233
00:50:00,830 --> 00:50:04,200
V1 doesn't seem to care
is the short answer here,

1234
00:50:04,200 --> 00:50:07,860
and V2 cares quite
significantly.

1235
00:50:07,860 --> 00:50:10,230
So when you put those
higher-order statistics in,

1236
00:50:10,230 --> 00:50:13,740
almost all V2 cells
respond significantly more,

1237
00:50:13,740 --> 00:50:15,660
and you can see that in
these three examples.

1238
00:50:15,660 --> 00:50:16,720
These are not unusual.

1239
00:50:16,720 --> 00:50:18,910
That's what most of
the cells look like.

1240
00:50:18,910 --> 00:50:22,560
So here's a plot, just showing
you 63% of the V2 neurons

1241
00:50:22,560 --> 00:50:24,600
significantly and
positively modulated.

1242
00:50:24,600 --> 00:50:27,360
And by the way, this is
averaged over all the textures

1243
00:50:27,360 --> 00:50:28,740
that we showed them.

1244
00:50:28,740 --> 00:50:30,630
And if you pick any
individual cell,

1245
00:50:30,630 --> 00:50:32,130
there's usually a
couple of textures

1246
00:50:32,130 --> 00:50:33,780
that drive it really
well, and then

1247
00:50:33,780 --> 00:50:35,613
a bunch of textures
that drive it less well.

1248
00:50:35,613 --> 00:50:37,260
So this effect could
be made stronger

1249
00:50:37,260 --> 00:50:40,500
if you chose only the textures
that drove the cell well.

1250
00:50:40,500 --> 00:50:43,290
And up here is V1, where you
can see that very few of them

1251
00:50:43,290 --> 00:50:46,290
are modulated by the existence
of these higher-order

1252
00:50:46,290 --> 00:50:47,680
statistics.

1253
00:50:47,680 --> 00:50:50,070
Oh, here it is across
texture category.

1254
00:50:50,070 --> 00:50:54,000
So now on the horizontal axis
is the texture category--

1255
00:50:54,000 --> 00:50:56,310
15 different textures,
and you can see, again,

1256
00:50:56,310 --> 00:50:59,640
that V1 is pretty much very
close to the same responses--

1257
00:50:59,640 --> 00:51:02,220
dark and light, again,
for the spectrally-matched

1258
00:51:02,220 --> 00:51:03,480
and the higher-order.

1259
00:51:03,480 --> 00:51:08,100
And for these three
V1 cells, they're

1260
00:51:08,100 --> 00:51:11,640
basically the same responses
for each of these pairs.

1261
00:51:11,640 --> 00:51:14,460
And for the V2 cells,
there are always

1262
00:51:14,460 --> 00:51:18,696
at least some textures where
there's an extreme difference.

1263
00:51:18,696 --> 00:51:20,070
So this is a really
good example.

1264
00:51:20,070 --> 00:51:22,020
There's a huge
difference in response

1265
00:51:22,020 --> 00:51:24,360
here for these two textures,
but for actually many

1266
00:51:24,360 --> 00:51:28,130
of the other textures, there's
not much of a difference.

1267
00:51:28,130 --> 00:51:31,470
So sort of a success.

1268
00:51:31,470 --> 00:51:33,480
And the last thing I was
going to tell you about

1269
00:51:33,480 --> 00:51:36,120
is that we think--

1270
00:51:36,120 --> 00:51:39,630
so this is really fitting,
given what Jim DiCarlo told you

1271
00:51:39,630 --> 00:51:42,900
about, or what I assume
he told you about--

1272
00:51:42,900 --> 00:51:47,220
this idea of tolerance or
invariance versus selectivity.

1273
00:51:47,220 --> 00:51:49,176
We wanted to know,
how can we take

1274
00:51:49,176 --> 00:51:50,550
what we know about
these V2 cells

1275
00:51:50,550 --> 00:51:53,280
and pull it back into
the perceptual domain?

1276
00:51:53,280 --> 00:51:54,930
How can we ask,
what is it that you

1277
00:51:54,930 --> 00:51:57,270
could do with a
population of V2 cells

1278
00:51:57,270 --> 00:52:00,900
that you couldn't do with
a population of V1 cells?

1279
00:52:00,900 --> 00:52:04,050
And the thought was if the
V2 cells are responding

1280
00:52:04,050 --> 00:52:06,750
to these texture
statistics, then

1281
00:52:06,750 --> 00:52:10,080
if I made a whole bunch of
samples of the same texture,

1282
00:52:10,080 --> 00:52:12,660
the V2 cells should be
really good at identifying

1283
00:52:12,660 --> 00:52:14,860
which texture that is--

1284
00:52:14,860 --> 00:52:17,130
which family it came from.

1285
00:52:17,130 --> 00:52:19,770
And the V1 cells will be
all confused by the fact

1286
00:52:19,770 --> 00:52:22,680
that those samples each
have different details that

1287
00:52:22,680 --> 00:52:23,820
are shifting around.

1288
00:52:23,820 --> 00:52:26,610
So the V1 cells will
respond to those details,

1289
00:52:26,610 --> 00:52:29,310
and they'll give a huge
variety of responses

1290
00:52:29,310 --> 00:52:33,090
invariant to re-sampling
from that family,

1291
00:52:33,090 --> 00:52:36,270
and the V2 cells will be
more invariant or more

1292
00:52:36,270 --> 00:52:38,310
tolerant to re-sampling
from that family.

1293
00:52:38,310 --> 00:52:39,419
That was the concept.

1294
00:52:39,419 --> 00:52:40,960
And that turns out
to be the case, so

1295
00:52:40,960 --> 00:52:42,450
let me show you the evidence.

1296
00:52:42,450 --> 00:52:45,720
So here's four
different textures,

1297
00:52:45,720 --> 00:52:47,990
four different-- what we
call different families.

1298
00:52:47,990 --> 00:52:51,360
Here's images of three different
examples drawn from each.

1299
00:52:51,360 --> 00:52:53,550
So these are just
three samples drawn,

1300
00:52:53,550 --> 00:52:55,334
starting with different
white noise seeds.

1301
00:52:55,334 --> 00:52:57,750
And you can see that they're
actually physically different

1302
00:52:57,750 --> 00:53:00,820
images, but they look the same.

1303
00:53:00,820 --> 00:53:01,530
Three again.

1304
00:53:01,530 --> 00:53:03,090
Three again.

1305
00:53:03,090 --> 00:53:08,915
And so we got 100 cells from
V1 and about 100 cells from V2.

1306
00:53:08,915 --> 00:53:11,880
The stimuli are presented
for 100 milliseconds.

1307
00:53:11,880 --> 00:53:13,380
We do 20 repetitions each.

1308
00:53:13,380 --> 00:53:14,580
We need a lot of data.

1309
00:53:14,580 --> 00:53:18,330
And what's shown here are
just these this 4 by 3 array,

1310
00:53:18,330 --> 00:53:20,280
but we actually had
15 different families

1311
00:53:20,280 --> 00:53:22,640
and 15 examples of each.

1312
00:53:22,640 --> 00:53:24,360
20 repetitions of each of those.

1313
00:53:24,360 --> 00:53:26,910
225 stimuli times
20 repetitions.

1314
00:53:26,910 --> 00:53:28,400
That's the experiment.

1315
00:53:28,400 --> 00:53:31,529
So what we wanted to know
is, does the hypothesis hold?

1316
00:53:31,529 --> 00:53:32,570
And so here's an example.

1317
00:53:32,570 --> 00:53:36,390
These are responses laid
out for these 12 stimuli.

1318
00:53:36,390 --> 00:53:38,760
And what you can see is
that this is a V1 neuron--

1319
00:53:38,760 --> 00:53:40,260
a typical V1 neuron.

1320
00:53:40,260 --> 00:53:41,970
You can see that
the neuron actually

1321
00:53:41,970 --> 00:53:45,390
responds with a fair amount
of variety in these columns.

1322
00:53:45,390 --> 00:53:49,380
That is, for different
exemplars from the same family,

1323
00:53:49,380 --> 00:53:50,460
there's some variety.

1324
00:53:50,460 --> 00:53:52,350
High response here,
medium response here,

1325
00:53:52,350 --> 00:53:53,820
very low response here.

1326
00:53:53,820 --> 00:53:56,160
And this is for these
three images, which

1327
00:53:56,160 --> 00:53:58,380
to us look basically the same.

1328
00:53:58,380 --> 00:54:02,680
So this cell would not be
very good at separating out

1329
00:54:02,680 --> 00:54:07,140
or recognizing or helping in
the process of recognizing

1330
00:54:07,140 --> 00:54:09,720
which kind of texture you
were looking at, because it's

1331
00:54:09,720 --> 00:54:13,800
flopping all over the place
when we draw different samples.

1332
00:54:13,800 --> 00:54:15,780
That, as compared to
having to V2 cell--

1333
00:54:15,780 --> 00:54:17,880
this is a typical V2
cell, which you can see

1334
00:54:17,880 --> 00:54:19,920
is much more stable
across these columns.

1335
00:54:19,920 --> 00:54:22,470
This is roughly the same
response here, roughly the same

1336
00:54:22,470 --> 00:54:24,840
here, little bit of
variety in this one,

1337
00:54:24,840 --> 00:54:26,580
roughly the same in this one.

1338
00:54:26,580 --> 00:54:30,220
And sure enough, if you
actually go and plot this,

1339
00:54:30,220 --> 00:54:33,450
V2 has much higher
variance across families.

1340
00:54:33,450 --> 00:54:34,890
That's vertically.

1341
00:54:34,890 --> 00:54:36,160
These are the V1 cells.

1342
00:54:36,160 --> 00:54:37,142
These are the V2 cells.

1343
00:54:37,142 --> 00:54:38,850
And this is the variance
across families.

1344
00:54:38,850 --> 00:54:40,940
This is the variance
across exemplars.

1345
00:54:40,940 --> 00:54:44,070
V2 has higher variance
typically across families,

1346
00:54:44,070 --> 00:54:47,070
and V1 has higher
variance across exemplars.

1347
00:54:47,070 --> 00:54:50,380
And now if you take the
populations of equal size--

1348
00:54:50,380 --> 00:54:53,190
100 of each, and you
ask well, how good

1349
00:54:53,190 --> 00:54:55,050
would I be at taking
that population

1350
00:54:55,050 --> 00:55:02,430
and identifying which
family, which kind of texture

1351
00:55:02,430 --> 00:55:03,380
I'm looking at?

1352
00:55:03,380 --> 00:55:04,915
And we do this with
cross-validation

1353
00:55:04,915 --> 00:55:05,540
and everything.

1354
00:55:05,540 --> 00:55:07,790
I can give you the details
later, if you want to know.

1355
00:55:07,790 --> 00:55:12,680
We find V2 is always better
than V1 in doing this task.

1356
00:55:12,680 --> 00:55:16,460
So we can do a better job
in performing this task--

1357
00:55:16,460 --> 00:55:19,460
identifying which
of these families

1358
00:55:19,460 --> 00:55:23,540
a given example was drawn from
if we look at V2 than if we

1359
00:55:23,540 --> 00:55:24,560
look at V1.

1360
00:55:24,560 --> 00:55:26,270
And if we flip
that around and we

1361
00:55:26,270 --> 00:55:27,830
try to do exemplar
identification,

1362
00:55:27,830 --> 00:55:31,400
with 15 different examples
of a given family--

1363
00:55:31,400 --> 00:55:33,430
if we say, which one was it?

1364
00:55:33,430 --> 00:55:36,800
It turns out that V1 is
better than V2 for that.

1365
00:55:36,800 --> 00:55:39,610
So we think of this
as evidence that V2

1366
00:55:39,610 --> 00:55:44,260
has some invariance
across these samples,

1367
00:55:44,260 --> 00:55:46,150
whereas V1 is much
more specialized

1368
00:55:46,150 --> 00:55:48,640
for the particular samples.

1369
00:55:48,640 --> 00:55:51,970
This work started with
this fantastic post-doc

1370
00:55:51,970 --> 00:55:54,700
that I had mentioned
earlier, Javier Portilla.

1371
00:55:54,700 --> 00:55:56,800
Jeremy Freeman came
into my lab, and we just

1372
00:55:56,800 --> 00:55:59,950
jumped all over this
in making the metamers.

1373
00:55:59,950 --> 00:56:01,810
Josh McDermott is on
here because I usually

1374
00:56:01,810 --> 00:56:05,454
also play the auditory
examples and walk

1375
00:56:05,454 --> 00:56:06,870
through a little
bit of that work,

1376
00:56:06,870 --> 00:56:09,160
but I'm going to
leave that for him.

1377
00:56:09,160 --> 00:56:11,740
And Corey Ziemba, who's a
student who's in the lab

1378
00:56:11,740 --> 00:56:14,560
right now and is doing a
lot of the physiology and

1379
00:56:14,560 --> 00:56:18,280
did a lot of the physiology
that I showed you in Tony's lab.

1380
00:56:18,280 --> 00:56:22,310
And we were funded by
HHMI and also the NIH.

1381
00:56:22,310 --> 00:56:23,940
So thanks.