1
00:00:01,680 --> 00:00:04,080
The following content is
provided under a Creative

2
00:00:04,080 --> 00:00:05,620
Commons license.

3
00:00:05,620 --> 00:00:07,920
Your support will help
MIT OpenCourseWare

4
00:00:07,920 --> 00:00:12,280
continue to offer high quality
educational resources for free.

5
00:00:12,280 --> 00:00:14,910
To make a donation or
view additional materials

6
00:00:14,910 --> 00:00:18,840
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,840 --> 00:00:19,720
at ocw.mit.edu.

8
00:00:25,110 --> 00:00:33,930
TOMASO POGGIO: So I'll speak
about i-theory, visual cortex,

9
00:00:33,930 --> 00:00:34,940
deep learning networks.

10
00:00:38,400 --> 00:00:44,270
The background for this is
this conceptual framework

11
00:00:44,270 --> 00:00:51,420
that we take as a
guide to present work

12
00:00:51,420 --> 00:00:54,140
in vision in this center--

13
00:00:54,140 --> 00:01:02,350
The idea that you have a
phase in visual perception,

14
00:01:02,350 --> 00:01:06,470
essentially up to
the first saccade--

15
00:01:06,470 --> 00:01:12,470
say, 100 milliseconds
from onset of an image--

16
00:01:12,470 --> 00:01:18,770
in which most of the
processing is feedforward

17
00:01:18,770 --> 00:01:22,080
in the visual cortex.

18
00:01:22,080 --> 00:01:27,190
And that top-down signals--

19
00:01:29,900 --> 00:01:33,500
I hate the term
feedback in this case,

20
00:01:33,500 --> 00:01:39,530
but back projections going
from higher visual areas,

21
00:01:39,530 --> 00:01:46,610
like inferotemporal cortex, back
to V2 and other cortical areas

22
00:01:46,610 --> 00:01:51,260
are not active in this
first hundred milliseconds.

23
00:01:51,260 --> 00:02:00,320
Now, all of this is a conjecture
based on a number of data.

24
00:02:00,320 --> 00:02:02,990
So it has to be proven.

25
00:02:02,990 --> 00:02:08,000
For us it's just a
motivation, a guide,

26
00:02:08,000 --> 00:02:13,550
to first studying feedforward
processing in, as I said,

27
00:02:13,550 --> 00:02:15,500
the first 100
milliseconds or so.

28
00:02:18,920 --> 00:02:28,790
And to think that other types of
theory, like generative models,

29
00:02:28,790 --> 00:02:32,630
probabilistic inference
that you have heard about,

30
00:02:32,630 --> 00:02:36,730
visual routines you have
heard kind of from Shimon,

31
00:02:36,730 --> 00:02:41,390
are important not so much in
the first 100 milliseconds,

32
00:02:41,390 --> 00:02:43,620
but later on.

33
00:02:43,620 --> 00:02:48,530
Especially when feedback
through back projection,

34
00:02:48,530 --> 00:02:51,530
but also through
movements of the eyes

35
00:02:51,530 --> 00:02:55,145
that acquire new images
depending on the first one you

36
00:02:55,145 --> 00:02:58,520
have seen, come into play.

37
00:02:58,520 --> 00:02:59,120
OK.

38
00:02:59,120 --> 00:03:03,870
This is just to
motivate feedforward.

39
00:03:03,870 --> 00:03:09,920
And of course, the evidence
I refer to is evidence like--

40
00:03:09,920 --> 00:03:13,760
you have heard from Jim DiCarlo,
for the physiology there

41
00:03:13,760 --> 00:03:20,510
is quite a bit of data showing
that neurons in IT become

42
00:03:20,510 --> 00:03:26,540
active and selective for what
is in the image about 80 or 90

43
00:03:26,540 --> 00:03:29,570
milliseconds after
onset of the stimulus.

44
00:03:29,570 --> 00:03:33,830
And this basically implies that
there are no big feedback loops

45
00:03:33,830 --> 00:03:36,230
from one area to another one.

46
00:03:36,230 --> 00:03:40,460
It takes 40 milliseconds to
get to V1, and 10 milliseconds

47
00:03:40,460 --> 00:03:44,720
or so for each of
the next areas.

48
00:03:44,720 --> 00:03:50,630
So the problem is,
computational vision--

49
00:03:50,630 --> 00:03:52,580
the guy on the
left is David Marr.

50
00:04:02,910 --> 00:04:07,890
And here it's really
where most probably

51
00:04:07,890 --> 00:04:11,430
a lot of object
recognition takes

52
00:04:11,430 --> 00:04:17,160
place, is the ventral
stream from V1 to V2,

53
00:04:17,160 --> 00:04:19,489
V4, and the IT complex.

54
00:04:23,340 --> 00:04:26,130
So that's the back of the head.

55
00:04:26,130 --> 00:04:28,290
As I said, it takes
40 milliseconds

56
00:04:28,290 --> 00:04:33,690
for electrical signals to
come from the eye in the front

57
00:04:33,690 --> 00:04:38,010
through the LGN back
to neurons in V1.

58
00:04:38,010 --> 00:04:39,320
Simple complex cells.

59
00:04:39,320 --> 00:04:44,010
And then for signals to go
from the back to the front,

60
00:04:44,010 --> 00:04:49,150
that's the feedforward part.

61
00:04:49,150 --> 00:04:54,420
And on the bottom right, you
have seen this picture already.

62
00:04:54,420 --> 00:04:58,755
This is from Van Essen,
edited recently by Movshon.

63
00:05:01,370 --> 00:05:05,850
It's the size of the areas
and the size of the connection

64
00:05:05,850 --> 00:05:09,910
are roughly proportional to the
number of neurons and fibers.

65
00:05:09,910 --> 00:05:12,780
So you see that V1
is as big as V2.

66
00:05:12,780 --> 00:05:16,350
they both have about
200 million neurons.

67
00:05:16,350 --> 00:05:20,970
And V4 is about 50 million,
and the inferotemporal complex

68
00:05:20,970 --> 00:05:22,620
is probably 100 million or so.

69
00:05:33,800 --> 00:05:38,136
Our brain is about
one million flies.

70
00:05:38,136 --> 00:05:42,130
A fly is around
300,000 neurons or so.

71
00:05:42,130 --> 00:05:44,550
A bee is one million.

72
00:05:50,710 --> 00:05:57,030
And as I think Jim
DiCarlo mentioned,

73
00:05:57,030 --> 00:06:00,510
there are these models
that have been developed

74
00:06:00,510 --> 00:06:02,924
since Hubel and Wiesel--

75
00:06:02,924 --> 00:06:12,720
so that's '59-- that tried to
model feedforward processing

76
00:06:12,720 --> 00:06:16,240
from V1 to IT.

77
00:06:16,240 --> 00:06:22,310
And they start with simple
and complex cells, this S1

78
00:06:22,310 --> 00:06:27,030
and C1, simple cells being
essentially equivalent to Gabor

79
00:06:27,030 --> 00:06:34,210
filters, oriented Gabor
filters in different positions,

80
00:06:34,210 --> 00:06:36,450
different orientations.

81
00:06:36,450 --> 00:06:42,300
And then complex cells that
put together the signals

82
00:06:42,300 --> 00:06:47,940
from simple cells of the
same orientation preference,

83
00:06:47,940 --> 00:06:51,960
but different position, and
so have some more position

84
00:06:51,960 --> 00:06:54,930
tolerance than simple cells.

85
00:06:54,930 --> 00:07:00,180
And then a repetition of this
basic scheme, with S2 cells

86
00:07:00,180 --> 00:07:04,360
that are representing
more complex--

87
00:07:04,360 --> 00:07:05,970
let's call them features--

88
00:07:05,970 --> 00:07:07,710
than lines.

89
00:07:07,710 --> 00:07:11,322
Maybe a combination of lines.

90
00:07:11,322 --> 00:07:16,410
And then C2 cells
again pulling together

91
00:07:16,410 --> 00:07:19,110
cells of the same
preference in order

92
00:07:19,110 --> 00:07:22,650
to get more invariance
to position.

93
00:07:22,650 --> 00:07:30,510
And there is evidence from the
old work of Hubel and Wiesel

94
00:07:30,510 --> 00:07:32,540
about simple and
complex cells in V1.

95
00:07:32,540 --> 00:07:36,690
So S1 and C1, although
the morphological identity

96
00:07:36,690 --> 00:07:39,990
of complex and simple cells
is still an open question--

97
00:07:39,990 --> 00:07:41,520
you know, which specific cells.

98
00:07:41,520 --> 00:07:45,030
We can discuss that later.

99
00:07:45,030 --> 00:07:49,890
But for the rest, this hierarchy
continuing in other areas,

100
00:07:49,890 --> 00:07:54,495
like V2 and V4 and IT,
this is one conjecture

101
00:07:54,495 --> 00:07:55,950
in model like this.

102
00:07:59,560 --> 00:08:07,600
And we, like other
ones before us,

103
00:08:07,600 --> 00:08:15,190
modeled back 15 years
ago this different area.

104
00:08:15,190 --> 00:08:18,780
It's V1, V2, and V4
with this kind of model.

105
00:08:18,780 --> 00:08:23,410
And the reason to
do so was not really

106
00:08:23,410 --> 00:08:25,390
to do object
recognition, but it was

107
00:08:25,390 --> 00:08:30,490
to try to see
whether we could get

108
00:08:30,490 --> 00:08:33,250
the physiological properties
of a different sense

109
00:08:33,250 --> 00:08:35,020
in such a feedforward
model, the ones

110
00:08:35,020 --> 00:08:38,710
that people have had recorded
from and published about.

111
00:08:42,440 --> 00:08:45,410
And we could do that to
reproduce the property.

112
00:08:45,410 --> 00:08:48,040
Of course, some of them
we put in properties

113
00:08:48,040 --> 00:08:50,170
of simple and complex cells.

114
00:08:50,170 --> 00:08:54,010
But other ones, like how
much invariance to position

115
00:08:54,010 --> 00:08:56,650
there was in the top
level, we got it out

116
00:08:56,650 --> 00:08:59,950
from the model
consistent with the data.

117
00:09:02,560 --> 00:09:06,040
One surprising thing that
we had with this model

118
00:09:06,040 --> 00:09:11,620
was that, although it was not
designed in order to perform

119
00:09:11,620 --> 00:09:16,180
well at object recognition, it
did actually work pretty well.

120
00:09:16,180 --> 00:09:19,740
So the kind of things you
have to think about this

121
00:09:19,740 --> 00:09:22,060
is rapid categorization.

122
00:09:22,060 --> 00:09:25,100
You have seen that already.

123
00:09:25,100 --> 00:09:32,380
And the task is, for each image,
is there an animal or not?

124
00:09:32,380 --> 00:09:45,790
And you can kind of get the
feeling that you can do that.

125
00:09:45,790 --> 00:09:49,090
In the real experiment,
you have an image and then

126
00:09:49,090 --> 00:09:50,380
a mask, another image.

127
00:09:50,380 --> 00:09:52,540
And then you can say
yes, there is an image,

128
00:09:52,540 --> 00:09:55,570
or no, there is not.

129
00:09:55,570 --> 00:10:01,300
This is called rapid
categorization.

130
00:10:01,300 --> 00:10:03,290
It was introduced
by Molly Potter,

131
00:10:03,290 --> 00:10:07,780
and more recently Simon
Thorpe in France used it.

132
00:10:07,780 --> 00:10:11,770
And it's a way to
force the observer

133
00:10:11,770 --> 00:10:15,580
to work in a feedforward mode,
because you don't have the time

134
00:10:15,580 --> 00:10:17,990
to move your eyes, to fixate.

135
00:10:17,990 --> 00:10:20,020
There is some
evidence that the mask

136
00:10:20,020 --> 00:10:26,980
may stop the back
projections from working.

137
00:10:26,980 --> 00:10:29,830
So this is a
situation in which you

138
00:10:29,830 --> 00:10:34,870
could compare human performance
to these feedforward models,

139
00:10:34,870 --> 00:10:38,710
which are not a complete
description of vision anyway,

140
00:10:38,710 --> 00:10:42,850
because they don't take into
account different eye fixation

141
00:10:42,850 --> 00:10:47,800
and feedbacks and
higher processes, like--

142
00:10:47,800 --> 00:10:52,400
like I said, probabilistic
inference and routines.

143
00:10:52,400 --> 00:10:57,340
Whatever it happens, very likely
in normal vision, in which you

144
00:10:57,340 --> 00:10:59,680
have time to look around.

145
00:10:59,680 --> 00:11:04,990
So in this case, this
d prime is a measure

146
00:11:04,990 --> 00:11:08,920
of performance, how well
you're doing this task.

147
00:11:08,920 --> 00:11:13,930
And you can see, first of
all, the absolute performance,

148
00:11:13,930 --> 00:11:17,440
80% correct on a
certain database.

149
00:11:17,440 --> 00:11:21,100
This task, animal no animal,
is similar between the model

150
00:11:21,100 --> 00:11:23,690
in humans.

151
00:11:23,690 --> 00:11:29,110
And images that are
difficult for people,

152
00:11:29,110 --> 00:11:32,350
like images in which
there is a lot of clutter,

153
00:11:32,350 --> 00:11:36,310
the animals are small, are
also difficult for the model.

154
00:11:36,310 --> 00:11:38,220
And the easy ones
are easy for both.

155
00:11:38,220 --> 00:11:42,400
So there is a correlation
between models and humans.

156
00:11:45,100 --> 00:11:48,640
This does not say that the
model is correct, of course,

157
00:11:48,640 --> 00:11:56,110
but it gives a hint that
model of this type capture

158
00:11:56,110 --> 00:12:00,310
something of what's going
on in the visual pathway.

159
00:12:00,310 --> 00:12:04,840
And Jim DiCarlo spoke about
a more sophisticated version

160
00:12:04,840 --> 00:12:07,420
of these feedforward
models, including

161
00:12:07,420 --> 00:12:11,650
training with back propagation,
that gives pretty good results

162
00:12:11,650 --> 00:12:14,770
also in terms of
agreement between neurons

163
00:12:14,770 --> 00:12:16,840
and units in the model.

164
00:12:16,840 --> 00:12:23,110
So the question is
why these models work.

165
00:12:23,110 --> 00:12:27,264
They're very
simple, feedforward.

166
00:12:27,264 --> 00:12:33,370
It has been surprisingly
difficult to understand why

167
00:12:33,370 --> 00:12:36,250
they work as well as they do.

168
00:12:36,250 --> 00:12:40,310
When I started to work on this
kind of things 15 years ago,

169
00:12:40,310 --> 00:12:44,350
I thought this kind of
architecture would not work.

170
00:12:46,910 --> 00:12:49,690
But then they worked much
better than I thought.

171
00:12:52,690 --> 00:12:58,850
And if you believe deep learning
these days, which I do--

172
00:12:58,850 --> 00:13:02,900
for instance, in
performance on ImageNet--

173
00:13:02,900 --> 00:13:04,820
my guess is they work
better than humans,

174
00:13:04,820 --> 00:13:08,990
actually, because the
right comparison for humans

175
00:13:08,990 --> 00:13:14,170
on ImageNet would be the
rapid categorization one.

176
00:13:14,170 --> 00:13:15,545
So they present images briefly.

177
00:13:18,100 --> 00:13:20,330
Because that's what the
models have-- just one image.

178
00:13:20,330 --> 00:13:23,950
No chance of getting
a second view.

179
00:13:23,950 --> 00:13:28,010
Anyway, that's a more
complex discussion

180
00:13:28,010 --> 00:13:31,430
that has to do also
with how to model

181
00:13:31,430 --> 00:13:36,025
the fact that in our
eyes, in our cortex,

182
00:13:36,025 --> 00:13:39,270
every-- solution
depends on eccentricity.

183
00:13:39,270 --> 00:13:43,610
It's a pretty rapidly
decaying resolution

184
00:13:43,610 --> 00:13:47,640
as you go away from
the fovea, and has

185
00:13:47,640 --> 00:13:53,210
some significant implications
for all these topics.

186
00:13:53,210 --> 00:13:55,880
I'll get to that.

187
00:13:55,880 --> 00:13:59,750
What I want to do
today is, one way

188
00:13:59,750 --> 00:14:01,790
to look at this to
try to understand

189
00:14:01,790 --> 00:14:04,640
how these kind of
feedforward models work--

190
00:14:07,750 --> 00:14:12,050
i-theory is based on
trying to understand

191
00:14:12,050 --> 00:14:17,540
how models that are
simple and complex cells

192
00:14:17,540 --> 00:14:21,860
and can be integrated in a
hierarchical architecture

193
00:14:21,860 --> 00:14:31,070
can provide a signature
set of features that

194
00:14:31,070 --> 00:14:36,980
are invariant to transformations
observed during development,

195
00:14:36,980 --> 00:14:41,130
and at the same time
keep selectivity.

196
00:14:41,130 --> 00:14:45,530
You don't lose any selectivity
to different objects.

197
00:14:48,690 --> 00:14:53,710
And then I want to
see what they say

198
00:14:53,710 --> 00:15:00,280
about deep convolutional
learning networks,

199
00:15:00,280 --> 00:15:04,690
and look at some of the--

200
00:15:04,690 --> 00:15:09,010
beginning with theory
about deep learning.

201
00:15:09,010 --> 00:15:17,260
And then I want to look at
a couple of predictions,

202
00:15:17,260 --> 00:15:21,025
particularly related to
eccentricity-dependent

203
00:15:21,025 --> 00:15:24,210
resolution coming
from i-theory, that

204
00:15:24,210 --> 00:15:28,810
are interesting for the sake
of physics and modeling.

205
00:15:28,810 --> 00:15:32,260
And then it's
basically garbage time,

206
00:15:32,260 --> 00:15:34,930
if you're interested
in mathematical details

207
00:15:34,930 --> 00:15:41,110
and proofs of theorems
and historical background.

208
00:15:41,110 --> 00:15:41,620
OK.

209
00:15:41,620 --> 00:15:43,890
Let's start with i-theory.

210
00:15:46,540 --> 00:15:49,030
These are the kind of things
that we want, ideally,

211
00:15:49,030 --> 00:15:50,680
to explain.

212
00:15:50,680 --> 00:15:54,040
This is the visual
cortex on the left.

213
00:15:54,040 --> 00:15:59,290
Models like HMAX, or
feedforward models.

214
00:15:59,290 --> 00:16:02,770
And on the right are
the deep learning

215
00:16:02,770 --> 00:16:08,050
convolutional networks,
a couple of them,

216
00:16:08,050 --> 00:16:10,560
which basically have
convolutional stage

217
00:16:10,560 --> 00:16:16,450
stages very similar to S1, and
pooling stages similar to C1.

218
00:16:19,090 --> 00:16:22,502
But quite a lot of those layers.

219
00:16:28,350 --> 00:16:30,340
How many of you know
about deep learning?

220
00:16:30,340 --> 00:16:31,120
Everybody, right?

221
00:16:35,440 --> 00:16:38,180
OK.

222
00:16:38,180 --> 00:16:43,400
These are the kind of questions
that i-theory tries to answer--

223
00:16:43,400 --> 00:16:47,810
why these hierarchies
work well, what

224
00:16:47,810 --> 00:16:52,343
is really visual cortex,
what is the goal of V1 to IT.

225
00:16:56,646 --> 00:16:58,520
We know a lot about
simple and complex cells,

226
00:16:58,520 --> 00:17:03,230
but again, what is
the computational goal

227
00:17:03,230 --> 00:17:06,770
of these simple
and complex cells?

228
00:17:06,770 --> 00:17:11,720
Why do we have Gabor
tuning in the early areas?

229
00:17:11,720 --> 00:17:16,339
And why do we have
quite generic tuning,

230
00:17:16,339 --> 00:17:21,310
like in the first visual area,
but quite specific tuning

231
00:17:21,310 --> 00:17:25,990
to different types of
objects like faces and bodies

232
00:17:25,990 --> 00:17:26,680
higher up?

233
00:17:33,680 --> 00:17:38,450
The main hypothesis with
starting i-theory is that one

234
00:17:38,450 --> 00:17:42,080
of the main goals of
the visual cortex--

235
00:17:42,080 --> 00:17:47,020
it's a hypothesis-- is to
compute a set of features,

236
00:17:47,020 --> 00:17:52,520
a representation of images, that
is invariant to transformations

237
00:17:52,520 --> 00:17:54,680
that the organism
has experienced--

238
00:17:54,680 --> 00:18:00,920
visual transformations--
and remains selective.

239
00:18:00,920 --> 00:18:02,930
Now, why is
invariance important?

240
00:18:08,610 --> 00:18:11,120
A lot of the problem
of recognizing objects

241
00:18:11,120 --> 00:18:18,200
is the fact that I can
see once Rosalie's face,

242
00:18:18,200 --> 00:18:21,680
and then the next time
it's the same face,

243
00:18:21,680 --> 00:18:23,300
but the image is
completely different,

244
00:18:23,300 --> 00:18:25,580
for it's much bigger
now because I'm closer,

245
00:18:25,580 --> 00:18:27,510
or the illumination
is different.

246
00:18:27,510 --> 00:18:29,930
So the pixels are different.

247
00:18:29,930 --> 00:18:34,370
And from one single object,
you can produce in this way--

248
00:18:34,370 --> 00:18:40,070
through translation, scaling,
different illumination,

249
00:18:40,070 --> 00:18:46,480
viewpoint-- you can produce
thousands of different images.

250
00:18:46,480 --> 00:18:50,680
So the intuition
is that if I could

251
00:18:50,680 --> 00:18:53,050
get a computer description--

252
00:18:53,050 --> 00:18:56,740
say, long vectors of
features of her face--

253
00:18:56,740 --> 00:19:01,690
that does not change under
these transformations,

254
00:19:01,690 --> 00:19:05,200
recognition would
be much easier.

255
00:19:05,200 --> 00:19:08,380
Easier means,
especially, that I could

256
00:19:08,380 --> 00:19:12,966
learn to recognize an object
with much fewer labeled

257
00:19:12,966 --> 00:19:13,465
examples.

258
00:19:17,290 --> 00:19:23,500
Here on the right you have
a very simple demonstration

259
00:19:23,500 --> 00:19:25,840
of what I mean,
empirical demonstration.

260
00:19:28,420 --> 00:19:36,640
So we have at the
bottom different cars

261
00:19:36,640 --> 00:19:38,560
and different planes.

262
00:19:38,560 --> 00:19:41,980
And there is a linear classifier
which is trained directly

263
00:19:41,980 --> 00:19:42,790
on the pixel.

264
00:19:42,790 --> 00:19:46,000
Very stupid classifier.

265
00:19:46,000 --> 00:19:54,280
And you train it with
one car and one plane--

266
00:19:54,280 --> 00:19:57,850
this is on the left--

267
00:19:57,850 --> 00:19:59,680
or two cars, two planes.

268
00:19:59,680 --> 00:20:03,160
And then you test
on other images.

269
00:20:03,160 --> 00:20:07,060
And as you can see,
when it's trained

270
00:20:07,060 --> 00:20:11,410
with the bottom examples, which
are at all kinds of viewpoints

271
00:20:11,410 --> 00:20:16,570
and sizes, the performance of
the classifier in answering

272
00:20:16,570 --> 00:20:19,735
is this a car or is
this a plane, it's 50%.

273
00:20:19,735 --> 00:20:21,340
It's chance.

274
00:20:21,340 --> 00:20:24,580
Does not learn at all.

275
00:20:24,580 --> 00:20:29,810
On the other hand, suppose
I have an oracle which is--

276
00:20:29,810 --> 00:20:32,110
I will conjecture
is visual cortex,

277
00:20:32,110 --> 00:20:36,370
essentially, that
gives you the feature

278
00:20:36,370 --> 00:20:39,910
vectors for each image,
which is invariant

279
00:20:39,910 --> 00:20:41,660
to these transformations.

280
00:20:41,660 --> 00:20:46,240
So it's like having images
of cars in this line B.

281
00:20:46,240 --> 00:20:50,920
They're all in the same
position, same illumination,

282
00:20:50,920 --> 00:20:54,190
and so on, and the
same for the planes.

283
00:20:54,190 --> 00:20:55,860
And I repeat this experiment.

284
00:20:55,860 --> 00:20:58,360
I use one pair--

285
00:20:58,360 --> 00:21:01,310
one car, one plane--

286
00:21:01,310 --> 00:21:04,930
to train, or two
cars, two planes,

287
00:21:04,930 --> 00:21:09,870
and I see immediately that
when tested on new images,

288
00:21:09,870 --> 00:21:14,320
this classifier is close to 90%.

289
00:21:14,320 --> 00:21:15,700
So much better.

290
00:21:18,530 --> 00:21:25,420
So correcting-- having invariant
representation can help a lot.

291
00:21:25,420 --> 00:21:29,060
That's the empirical,
simple demonstration.

292
00:21:29,060 --> 00:21:33,980
And you can prove theorems
saying the same thing,

293
00:21:33,980 --> 00:21:38,770
that if you have an
invariant representation,

294
00:21:38,770 --> 00:21:45,880
you can have a much lower
simple complexity, which

295
00:21:45,880 --> 00:21:51,340
means you need much
fewer labeled examples

296
00:21:51,340 --> 00:21:57,200
to train a classifier to achieve
a certain level of accuracy.

297
00:21:57,200 --> 00:22:02,710
So how can you compute an
invariant representation?

298
00:22:02,710 --> 00:22:04,090
There are many ways to do it.

299
00:22:07,900 --> 00:22:10,240
But I'll describe
to you one which

300
00:22:10,240 --> 00:22:16,720
I think is attractive, because
it's neurophysiologically very

301
00:22:16,720 --> 00:22:18,640
plausible.

302
00:22:18,640 --> 00:22:21,820
The basic assumption
I'm making here

303
00:22:21,820 --> 00:22:27,070
is that neurons are
very slow devices.

304
00:22:27,070 --> 00:22:32,270
They don't do well
a lot of things.

305
00:22:32,270 --> 00:22:34,900
One of the things
they do probably best

306
00:22:34,900 --> 00:22:37,115
is high-dimensional
dot products.

307
00:22:40,340 --> 00:22:47,530
And the reason is that
you have a dendritic tree,

308
00:22:47,530 --> 00:22:51,850
and in cortical neurons you
have between 1,000 and 10,000

309
00:22:51,850 --> 00:22:53,560
synapses.

310
00:22:53,560 --> 00:22:57,820
So you have between
1,000 and 10,000 inputs.

311
00:22:57,820 --> 00:23:02,950
And each input gets
essentially multiplied

312
00:23:02,950 --> 00:23:05,830
by the weight of
the synapse, which

313
00:23:05,830 --> 00:23:08,560
can be changed during learning.

314
00:23:08,560 --> 00:23:10,360
It's plastic.

315
00:23:10,360 --> 00:23:17,310
And then the post-synaptic
depolarization

316
00:23:17,310 --> 00:23:19,930
or hyperpolarization, so
the electrical changes

317
00:23:19,930 --> 00:23:24,290
to the synapses, get all
summated in the soma.

318
00:23:24,290 --> 00:23:26,210
So you have some i.

319
00:23:26,210 --> 00:23:29,620
Xi are your inputs,
Wi are your synapses.

320
00:23:29,620 --> 00:23:31,750
That's a dot product.

321
00:23:31,750 --> 00:23:36,170
And this happens automatically,
within a millisecond.

322
00:23:36,170 --> 00:23:41,690
So it's one of the few
things that neurons do well.

323
00:23:41,690 --> 00:23:45,220
It's, I think, one of the
distinctive features of neurons

324
00:23:45,220 --> 00:23:49,470
of the brain relative to
our electronic components,

325
00:23:49,470 --> 00:23:54,460
that in each neuron,
each unit in the brain,

326
00:23:54,460 --> 00:23:58,420
there are about 10,000
wires getting in or out.

327
00:23:58,420 --> 00:24:03,370
When I say in, transistor
or logical units

328
00:24:03,370 --> 00:24:05,620
in our computers,
the number of wires

329
00:24:05,620 --> 00:24:08,050
is more like three or four.

330
00:24:08,050 --> 00:24:12,820
So this is the assumption,
that this kind of dot products

331
00:24:12,820 --> 00:24:15,065
are easy to do.

332
00:24:15,065 --> 00:24:20,290
And so this suggests
this kind of algorithm

333
00:24:20,290 --> 00:24:22,840
for computing invariance.

334
00:24:22,840 --> 00:24:26,230
Suppose you are a
baby in the cradle.

335
00:24:26,230 --> 00:24:27,750
You're playing with a toy--

336
00:24:27,750 --> 00:24:31,730
it's a bike-- and you are
rotating it, for instance.

337
00:24:31,730 --> 00:24:32,640
For simplicity.

338
00:24:32,640 --> 00:24:35,500
We'll do more complex things.

339
00:24:35,500 --> 00:24:40,360
The unsupervised learning that
you need to do at this point

340
00:24:40,360 --> 00:24:45,910
is just to store the movie
of what happens to your toy.

341
00:24:45,910 --> 00:24:48,890
For instance, suppose you
get a perfect rotation.

342
00:24:48,890 --> 00:24:50,890
This is a movie up there.

343
00:24:50,890 --> 00:24:53,420
There are eight frames.

344
00:24:53,420 --> 00:24:54,910
Yeah.

345
00:24:54,910 --> 00:24:57,025
You store those, and
you keep them forever.

346
00:25:01,170 --> 00:25:02,410
All right.

347
00:25:02,410 --> 00:25:06,340
So when you see a new
image, it could be

348
00:25:06,340 --> 00:25:12,780
Rosalie's face, or this fish.

349
00:25:12,780 --> 00:25:16,050
And I want to compute
a feature vectors which

350
00:25:16,050 --> 00:25:19,440
is invariant to rotation,
even if I've never

351
00:25:19,440 --> 00:25:22,620
seen the fish rotated.

352
00:25:22,620 --> 00:25:29,310
What I do is, I
compute a dot product

353
00:25:29,310 --> 00:25:34,230
of the image of the fish
with each one of the frames.

354
00:25:34,230 --> 00:25:38,070
So I get eight numbers.

355
00:25:38,070 --> 00:25:43,440
And the claim is that
these eight numbers--

356
00:25:43,440 --> 00:25:46,890
not their order,
but the numbers--

357
00:25:46,890 --> 00:25:50,890
are invariant to
rotation of the fish.

358
00:25:50,890 --> 00:25:54,850
So if I see the fish now in
a different rotation angle--

359
00:25:54,850 --> 00:25:59,160
suppose it's vertical, I'd still
get the same eight numbers.

360
00:25:59,160 --> 00:26:02,490
In a different order, probably.

361
00:26:02,490 --> 00:26:05,690
You could have-- these
are eight numbers.

362
00:26:05,690 --> 00:26:14,070
What I said, they are invariant
to rotation of the fish.

363
00:26:14,070 --> 00:26:15,780
There are various
quantities that you

364
00:26:15,780 --> 00:26:18,090
can use to represent
compactly the fact

365
00:26:18,090 --> 00:26:20,340
that they are the same
independent of rotation.

366
00:26:20,340 --> 00:26:23,620
For instance, the
probability distribution--

367
00:26:23,620 --> 00:26:29,380
the histogram-- of these values
does not depend on the order.

368
00:26:29,380 --> 00:26:31,950
And so if you make
a histogram, these

369
00:26:31,950 --> 00:26:35,310
should be independent
of rotation,

370
00:26:35,310 --> 00:26:38,810
invariant to rotation, Or
moments of the histogram,

371
00:26:38,810 --> 00:26:45,607
like the average, the variance,
the moment of order infinity.

372
00:26:51,211 --> 00:26:55,020
And for instance, the equation
for computing a histogram

373
00:26:55,020 --> 00:26:56,300
is written there.

374
00:26:56,300 --> 00:27:00,690
You have the dot product
of the image, the fish,

375
00:27:00,690 --> 00:27:05,330
with one template to
the bike, the bike Tk.

376
00:27:05,330 --> 00:27:09,100
You have several
templates, not just one.

377
00:27:09,100 --> 00:27:13,360
And Gi is the element
of the rotation group.

378
00:27:13,360 --> 00:27:17,220
So you get various
rotations of--

379
00:27:17,220 --> 00:27:18,870
simply because you
have observed that.

380
00:27:18,870 --> 00:27:20,760
You don't need to know
its rotation group.

381
00:27:20,760 --> 00:27:22,140
You don't need to compute that.

382
00:27:22,140 --> 00:27:25,010
These are just images
that you have stored.

383
00:27:27,780 --> 00:27:32,670
And there can be different
thresholds of simple cells.

384
00:27:32,670 --> 00:27:36,150
And sigma could be just
a threshold function,

385
00:27:36,150 --> 00:27:37,530
for instance.

386
00:27:37,530 --> 00:27:38,360
As it turns out--

387
00:27:38,360 --> 00:27:39,300
I'll describe later.

388
00:27:39,300 --> 00:27:40,977
And sum is the pool.

389
00:27:40,977 --> 00:27:42,060
I'll describe later these.

390
00:27:42,060 --> 00:27:45,040
But sigma, the nonlinearities
can be, in fact,

391
00:27:45,040 --> 00:27:46,620
almost anything.

392
00:27:46,620 --> 00:27:53,830
This is very robust to different
choices of the nonlinearity

393
00:27:53,830 --> 00:27:56,800
and the pooling.

394
00:27:56,800 --> 00:28:02,490
Here are some examples in
which now the transformation

395
00:28:02,490 --> 00:28:06,370
is translation that you
have observed for the bike.

396
00:28:06,370 --> 00:28:11,950
And if I compute a histogram--

397
00:28:11,950 --> 00:28:16,330
from more than eight
frames, in this case--

398
00:28:16,330 --> 00:28:21,470
I get the red
histogram for the fish,

399
00:28:21,470 --> 00:28:24,620
and you can see the red
histogram does not change,

400
00:28:24,620 --> 00:28:28,270
even if the image of
the fish is translated.

401
00:28:28,270 --> 00:28:31,780
Same for the blue Instagram,
which is the set to features

402
00:28:31,780 --> 00:28:34,750
corresponding to the cat.

403
00:28:34,750 --> 00:28:38,620
Also it's invariant
to translation.

404
00:28:38,620 --> 00:28:43,800
But it's different
from the red one.

405
00:28:43,800 --> 00:28:47,200
So these quantities,
the histograms,

406
00:28:47,200 --> 00:28:50,030
can be invariant of
course, but also selective,

407
00:28:50,030 --> 00:28:53,350
which is what you want.

408
00:28:53,350 --> 00:28:58,030
In order to have a selectivity
as high as you want,

409
00:28:58,030 --> 00:29:01,930
you need more than one template.

410
00:29:01,930 --> 00:29:11,050
And some results about
how many you need.

411
00:29:11,050 --> 00:29:13,790
I can go into more
details of this.

412
00:29:13,790 --> 00:29:18,700
But essentially, you need
a number of templates--

413
00:29:18,700 --> 00:29:23,110
of templates like the bike,
in your original example--

414
00:29:23,110 --> 00:29:26,200
that is logarithmic in
the number of images

415
00:29:26,200 --> 00:29:28,000
you want to separate.

416
00:29:28,000 --> 00:29:32,000
For instance, suppose you
want to be able to distinguish

417
00:29:32,000 --> 00:29:35,050
1,000 faces, or 1,000 objects.

418
00:29:35,050 --> 00:29:38,740
Then the number of
templates you need

419
00:29:38,740 --> 00:29:43,780
is in the order of log 1,000.

420
00:29:43,780 --> 00:29:46,630
So does not increase so much.

421
00:29:46,630 --> 00:29:47,660
Yeah.

422
00:29:47,660 --> 00:29:51,710
So there are two things,
one, which you implied.

423
00:29:51,710 --> 00:29:56,360
The reason I spoke about
rotation of the image plane,

424
00:29:56,360 --> 00:30:00,440
because rotation
is a compact group.

425
00:30:00,440 --> 00:30:04,850
So you never get out.

426
00:30:04,850 --> 00:30:06,690
You come back in.

427
00:30:06,690 --> 00:30:09,950
The translation, you can--

428
00:30:09,950 --> 00:30:13,650
in principle, mathematically,
between plus infinity or minus

429
00:30:13,650 --> 00:30:14,190
infinity.

430
00:30:14,190 --> 00:30:16,160
Of course it does
not make sense,

431
00:30:16,160 --> 00:30:19,340
but mathematically
this means that it's

432
00:30:19,340 --> 00:30:21,210
a little bit more
difficult to prove

433
00:30:21,210 --> 00:30:24,020
the same results in the case
of translation and scale.

434
00:30:24,020 --> 00:30:25,130
But we can do it.

435
00:30:25,130 --> 00:30:27,080
That's the first point.

436
00:30:27,080 --> 00:30:29,990
The second one,
the combinatorics

437
00:30:29,990 --> 00:30:33,920
of different transformations.

438
00:30:33,920 --> 00:30:40,400
Turns out that--
one approach to this

439
00:30:40,400 --> 00:30:44,840
is to have what the visual
system seems to have,

440
00:30:44,840 --> 00:30:53,780
in which you have relatively
small ranges of invariance

441
00:30:53,780 --> 00:30:55,950
at different stages.

442
00:30:55,950 --> 00:31:00,800
So that at first
stage, say in V1,

443
00:31:00,800 --> 00:31:04,250
you have pooling by
the complex cells

444
00:31:04,250 --> 00:31:08,660
over a small range of
translations, and probably

445
00:31:08,660 --> 00:31:10,370
scale.

446
00:31:10,370 --> 00:31:15,025
And then at the second stage
you have a larger range.

447
00:31:15,025 --> 00:31:16,670
I'll come to that later.

448
00:31:16,670 --> 00:31:19,850
But it's a very
interesting point.

449
00:31:19,850 --> 00:31:21,255
I'll not go into this.

450
00:31:21,255 --> 00:31:26,240
These are-- technical extension
of these partial observer

451
00:31:26,240 --> 00:31:29,120
groups, these
non-compact groups.

452
00:31:31,930 --> 00:31:35,270
The non-group transformation
of this approximate

453
00:31:35,270 --> 00:31:40,650
invariance to rotations in
3D, or changes of expression,

454
00:31:40,650 --> 00:31:41,760
and so on.

455
00:31:41,760 --> 00:31:43,400
And then what happens
when you have--

456
00:31:43,400 --> 00:31:46,190
a hierarchy of just modules.

457
00:31:46,190 --> 00:31:50,015
I'll say briefly
something about each one.

458
00:31:50,015 --> 00:31:58,760
One is that if you look
at the templates that

459
00:31:58,760 --> 00:32:02,660
give you simultaneous--

460
00:32:02,660 --> 00:32:06,950
so what we want to do, we want
to get scale and positioning

461
00:32:06,950 --> 00:32:09,260
invariance.

462
00:32:09,260 --> 00:32:13,820
And suppose you
want templates that

463
00:32:13,820 --> 00:32:18,450
maximize the simultaneous
range of invariance

464
00:32:18,450 --> 00:32:19,760
to scale and position.

465
00:32:19,760 --> 00:32:23,660
It turns out that Gabor
templates, Gabor filters,

466
00:32:23,660 --> 00:32:26,560
are the ones to do that.

467
00:32:26,560 --> 00:32:30,410
So that may be one
computational reason

468
00:32:30,410 --> 00:32:35,570
for why Gabor filters
are a good thing

469
00:32:35,570 --> 00:32:40,770
to do in processing images.

470
00:32:40,770 --> 00:32:45,170
So for getting approximately
good invariance

471
00:32:45,170 --> 00:32:47,930
to non-group
transformations, you

472
00:32:47,930 --> 00:32:50,930
need to have some conditions.

473
00:32:50,930 --> 00:32:53,550
The main one is
that the template

474
00:32:53,550 --> 00:32:57,870
must transform in a similar
way to the object you

475
00:32:57,870 --> 00:33:00,890
are to compute, like faces.

476
00:33:06,320 --> 00:33:13,520
And for these properties
to be true for a hierarchy

477
00:33:13,520 --> 00:33:14,230
of modules.

478
00:33:17,290 --> 00:33:20,950
Think of this inverted
triangle like a set

479
00:33:20,950 --> 00:33:24,520
of simple cells at the
base, and one complex cell,

480
00:33:24,520 --> 00:33:27,190
the red circle at the top.

481
00:33:27,190 --> 00:33:29,590
And so the architecture
that we're looking at

482
00:33:29,590 --> 00:33:32,110
is simple complex.

483
00:33:32,110 --> 00:33:34,300
This would be like V1.

484
00:33:34,300 --> 00:33:39,040
And next to it, another
simple complex module.

485
00:33:39,040 --> 00:33:40,420
This is all V2.

486
00:33:40,420 --> 00:33:43,330
And then you have V1 in
the second layer, that

487
00:33:43,330 --> 00:33:45,010
is getting the input from V1.

488
00:33:45,010 --> 00:33:50,050
And you repeat the same thing,
but on the output of V1.

489
00:33:50,050 --> 00:33:55,260
This is exactly like a
deep learning network.

490
00:33:55,260 --> 00:34:00,610
It's like visual cortex, where
you have different stages

491
00:34:00,610 --> 00:34:06,280
and the effective receptive
fields increases as you go up,

492
00:34:06,280 --> 00:34:07,240
as you see here.

493
00:34:09,909 --> 00:34:14,830
So this would be the
increase in spatial pooling--

494
00:34:14,830 --> 00:34:20,100
so invariance-- and also,
as I mentioned-- not

495
00:34:20,100 --> 00:34:22,560
drawn here, but the scale.

496
00:34:22,560 --> 00:34:27,360
Pooling over size, scale.

497
00:34:27,360 --> 00:34:34,270
And you can show that, if
the following is true, that--

498
00:34:34,270 --> 00:34:34,770
let me see.

499
00:34:34,770 --> 00:34:36,908
Is this animated?

500
00:34:36,908 --> 00:34:37,408
No.

501
00:34:50,380 --> 00:34:54,239
What you need to have-- and a
number of different networks,

502
00:34:54,239 --> 00:34:57,070
certainly the ones
I described, have

503
00:34:57,070 --> 00:34:59,370
this property of covariance.

504
00:34:59,370 --> 00:35:07,690
So suppose you have an object
that translates in the image.

505
00:35:07,690 --> 00:35:08,190
OK.

506
00:35:12,010 --> 00:35:21,110
What I need is that
the neural activity--

507
00:35:21,110 --> 00:35:25,480
the red circles at
the first level--

508
00:35:25,480 --> 00:35:27,610
also translate.

509
00:35:27,610 --> 00:35:28,810
This is covariance.

510
00:35:31,460 --> 00:35:32,910
So what happens
is the following.

511
00:35:32,910 --> 00:35:36,650
Suppose the object is smaller
than those receptive fields,

512
00:35:36,650 --> 00:35:39,130
and this drawing is as big.

513
00:35:39,130 --> 00:35:41,350
But suppose it's smaller.

514
00:35:41,350 --> 00:35:48,140
Then if you translate one of
those receptive fields, going

515
00:35:48,140 --> 00:35:54,650
from one point to another,
because each one has invariance

516
00:35:54,650 --> 00:35:58,310
to translations within
the receptive field--

517
00:35:58,310 --> 00:36:00,440
it's pooling over them--

518
00:36:00,440 --> 00:36:02,140
translation in the
receptive field

519
00:36:02,140 --> 00:36:05,290
will give the same output.

520
00:36:05,290 --> 00:36:09,380
You will have
invariance right there.

521
00:36:09,380 --> 00:36:12,620
But suppose you have
one image, and then

522
00:36:12,620 --> 00:36:16,940
the next one the object moves
to a different receptive field,

523
00:36:16,940 --> 00:36:19,560
or gets out of the
receptive field.

524
00:36:19,560 --> 00:36:24,080
Then you don't have
invariance at the first layer.

525
00:36:24,080 --> 00:36:29,510
But if you have covariance--
or the neural activity moves--

526
00:36:29,510 --> 00:36:33,670
at that layer above,
you may have invariance

527
00:36:33,670 --> 00:36:36,620
under that receptive field.

528
00:36:36,620 --> 00:36:39,590
In other words, in
this construction,

529
00:36:39,590 --> 00:36:43,220
if you have this
covariance property, then

530
00:36:43,220 --> 00:36:48,020
at some point in the network,
one of these receptive fields

531
00:36:48,020 --> 00:36:49,110
will be invariant.

532
00:36:53,890 --> 00:36:55,246
Is that--

533
00:36:55,246 --> 00:36:58,290
AUDIENCE: Can you
explain that again?

534
00:36:58,290 --> 00:37:00,260
TOMASO POGGIO: Yeah.

535
00:37:00,260 --> 00:37:08,760
The argument is-- suppose
I have an object like this.

536
00:37:08,760 --> 00:37:12,840
I have an image.

537
00:37:12,840 --> 00:37:17,060
And then-- I have another image
in which the object is here.

538
00:37:20,130 --> 00:37:23,020
Obviously the response
at this level--

539
00:37:23,020 --> 00:37:28,410
the response of this cell
will change, because before it

540
00:37:28,410 --> 00:37:31,870
saw this object.

541
00:37:31,870 --> 00:37:33,780
Now, there is these
other cells who see that.

542
00:37:33,780 --> 00:37:35,240
So the response has changed.

543
00:37:35,240 --> 00:37:38,670
You don't have invariance.

544
00:37:38,670 --> 00:37:42,900
However, if you look
at what happens, say,

545
00:37:42,900 --> 00:37:48,530
at the top red circle there,
the top red circle will

546
00:37:48,530 --> 00:37:53,190
you see some activity
in the first image

547
00:37:53,190 --> 00:37:58,410
here, because it was
activated for this.

548
00:37:58,410 --> 00:38:04,440
And-- in the second case, we
see some activity over there,

549
00:38:04,440 --> 00:38:07,710
which should be equivalent.

550
00:38:07,710 --> 00:38:11,580
And under these
receptive fields,

551
00:38:11,580 --> 00:38:15,990
translations will give
rise to the same signature.

552
00:38:15,990 --> 00:38:18,880
Under this big
receptive field, you

553
00:38:18,880 --> 00:38:23,410
have invariance for
translation within it.

554
00:38:23,410 --> 00:38:25,130
So the argument is that--

555
00:38:25,130 --> 00:38:27,420
either you have
invariance at one layer,

556
00:38:27,420 --> 00:38:31,404
because the object
just moved within it,

557
00:38:31,404 --> 00:38:32,320
and then you are done.

558
00:38:32,320 --> 00:38:35,310
It's invariant, and
everything else is invariant.

559
00:38:35,310 --> 00:38:37,860
Or you don't have
invariance in this layer,

560
00:38:37,860 --> 00:38:42,160
but you will have it
at some layer above.

561
00:38:42,160 --> 00:38:43,230
So in a sense--

562
00:38:43,230 --> 00:38:44,340
if you go back to this--

563
00:38:44,340 --> 00:38:45,520
I'll make this point later.

564
00:38:45,520 --> 00:38:47,150
But if you go back to this--

565
00:38:52,856 --> 00:38:55,720
to this algorithm,
the basic idea

566
00:38:55,720 --> 00:39:03,460
is that you want to have
invariance to rotation.

567
00:39:03,460 --> 00:39:08,290
And so you average
over the rotations.

568
00:39:08,290 --> 00:39:12,700
But suppose you want
to have invariance--

569
00:39:12,700 --> 00:39:17,170
you want to have an
estimate of rotation,

570
00:39:17,170 --> 00:39:21,160
but you're not
interested in identity.

571
00:39:21,160 --> 00:39:24,670
Then what you do, you
don't pool over rotation.

572
00:39:24,670 --> 00:39:30,900
You pull over different
objects at one rotation.

573
00:39:30,900 --> 00:39:33,890
So you can do both.

574
00:39:33,890 --> 00:39:35,527
All right?

575
00:39:35,527 --> 00:39:38,810
AUDIENCE: My question was more
physiological than theoretical.

576
00:39:38,810 --> 00:39:40,010
TOMASO POGGIO: Yeah.

577
00:39:40,010 --> 00:39:44,790
Physiological-- we had done
experiments long ago in IT

578
00:39:44,790 --> 00:39:47,070
with Jim DiCarlo,
Gabriel Kreiman.

579
00:39:47,070 --> 00:39:52,950
And from the same
population of neurons,

580
00:39:52,950 --> 00:39:58,520
we could read out
identity, object identity,

581
00:39:58,520 --> 00:40:02,640
invariant to scale and position.

582
00:40:02,640 --> 00:40:08,451
And we could also read out
position invariant to identity.

583
00:40:08,451 --> 00:40:08,950
And--

584
00:40:08,950 --> 00:40:09,540
AUDIENCE: The same from the--

585
00:40:09,540 --> 00:40:11,280
TOMASO POGGIO: Same population.

586
00:40:11,280 --> 00:40:15,480
I'm not saying the same neuron,
but the same population of 200

587
00:40:15,480 --> 00:40:17,640
neurons.

588
00:40:17,640 --> 00:40:19,590
And so you can
imagine that you could

589
00:40:19,590 --> 00:40:21,670
have different situations.

590
00:40:21,670 --> 00:40:25,380
One could be some of the neurons
are only conveying position,

591
00:40:25,380 --> 00:40:28,560
and some others are
completely invariant.

592
00:40:28,560 --> 00:40:33,950
And when you read out with
a classifier, it will work.

593
00:40:33,950 --> 00:40:35,580
Or you have neurons
that are already

594
00:40:35,580 --> 00:40:40,200
combining this information,
because the channels--

595
00:40:40,200 --> 00:40:42,480
either way.

596
00:40:42,480 --> 00:40:47,280
OK, let me do this, and
then we can take a break.

597
00:40:47,280 --> 00:40:50,280
I want to make the connection
with simple and complex cells.

598
00:40:50,280 --> 00:41:01,380
We already mentioned this,
but this set of operations,

599
00:41:01,380 --> 00:41:09,150
you can think of this
sigma dot product, n delta,

600
00:41:09,150 --> 00:41:10,900
this is a simple cell.

601
00:41:13,430 --> 00:41:17,180
So this is a dot product of
the image with a receptive

602
00:41:17,180 --> 00:41:19,790
field of the simple cell.

603
00:41:19,790 --> 00:41:21,870
That's what this parenthesis is.

604
00:41:26,360 --> 00:41:30,860
You have a bias, or a
threshold, and the nonlinearity.

605
00:41:30,860 --> 00:41:32,780
Could be the spiking
nonlinearity.

606
00:41:32,780 --> 00:41:37,010
Could be, as I
said, a rectifier.

607
00:41:37,010 --> 00:41:43,760
Neurons don't generate
negative spikes.

608
00:41:43,760 --> 00:41:47,870
And so all of this is very
plausible biologically.

609
00:41:47,870 --> 00:41:51,440
And the simple cell
will simply pool,

610
00:41:51,440 --> 00:41:55,240
take the over the
different simple cells.

611
00:42:02,670 --> 00:42:05,700
So that's what I mentioned
before, that nonlinearity

612
00:42:05,700 --> 00:42:06,720
can be almost anything.

613
00:42:12,900 --> 00:42:14,900
And I want to mention
something that could

614
00:42:14,900 --> 00:42:18,120
be interesting for physiology.

615
00:42:18,120 --> 00:42:21,120
From the point of view
of this algorithm,

616
00:42:21,120 --> 00:42:25,970
this may be a solution to this
problem that has been around

617
00:42:25,970 --> 00:42:33,050
for 30 years or so, which is
that Hubel and Wiesel and other

618
00:42:33,050 --> 00:42:37,010
physiologists after
them identified

619
00:42:37,010 --> 00:42:39,320
simple and complex
cells in terms

620
00:42:39,320 --> 00:42:41,570
of their physiological
properties.

621
00:42:41,570 --> 00:42:44,840
They couldn't see from
where they are recording.

622
00:42:44,840 --> 00:42:48,470
But there were cells that
behaved in different ways.

623
00:42:48,470 --> 00:42:52,400
The simple cells had the
small receptive field.

624
00:42:52,400 --> 00:42:54,860
The complex cell had
larger receptive field.

625
00:42:58,970 --> 00:43:01,550
The complex cells
were more invariant.

626
00:43:01,550 --> 00:43:05,600
And then physiologists
today are using

627
00:43:05,600 --> 00:43:07,850
criteria in which
the complex cell is

628
00:43:07,850 --> 00:43:11,690
more non-linear than
the simple cell.

629
00:43:11,690 --> 00:43:14,450
Now, from the point
of view of the theory,

630
00:43:14,450 --> 00:43:17,830
the real difference is
one is doing the pooling--

631
00:43:17,830 --> 00:43:19,400
the complex cells.

632
00:43:19,400 --> 00:43:20,610
The simple cell is not.

633
00:43:23,840 --> 00:43:30,520
And the puzzle is that despite
these physiological difference,

634
00:43:30,520 --> 00:43:36,070
they were never able to say
this type of pyramidal cell

635
00:43:36,070 --> 00:43:40,850
is simple, and this type of
pyramid cell are complex.

636
00:43:40,850 --> 00:43:46,440
And part of the reason could be
that maybe simple and complex

637
00:43:46,440 --> 00:43:49,880
cells are the same cell.

638
00:43:49,880 --> 00:43:54,940
So that the operation can
be done on the same cell.

639
00:43:54,940 --> 00:43:58,440
If you look at the
theory, what may happen

640
00:43:58,440 --> 00:44:06,390
is that you have one dendrite
play the roll of a simple cell.

641
00:44:06,390 --> 00:44:10,370
You have inputs,
synaptic weights.

642
00:44:10,370 --> 00:44:13,360
So this could give
rise, for instance,

643
00:44:13,360 --> 00:44:18,340
to the Gabor-like
receptive field.

644
00:44:18,340 --> 00:44:25,490
And then-- these other dendrites
to another simple cell.

645
00:44:25,490 --> 00:44:29,920
It's a Gabor-like in a slightly
different position in the image

646
00:44:29,920 --> 00:44:33,220
plane, in the retina.

647
00:44:33,220 --> 00:44:37,050
You need the nonlinearities.

648
00:44:37,050 --> 00:44:40,240
And they may be, instead
of the output of the cell,

649
00:44:40,240 --> 00:44:46,990
they may be so-called
voltage and time dependent

650
00:44:46,990 --> 00:44:50,140
conductancies in the dendrites.

651
00:44:50,140 --> 00:44:52,440
In the meantime, we know
that pyramidal cells

652
00:44:52,440 --> 00:44:57,190
in the visual cortex
have these nonlinearities

653
00:44:57,190 --> 00:45:03,730
like almost having spike
generation in the dendrites.

654
00:45:03,730 --> 00:45:07,720
And then the soma will
summate everything.

655
00:45:07,720 --> 00:45:11,470
This is what the
complex cell is doing.

656
00:45:11,470 --> 00:45:16,600
And if one of the cells
is computing something

657
00:45:16,600 --> 00:45:24,230
like an average, which is one of
the moments of a distribution,

658
00:45:24,230 --> 00:45:28,630
then the nonlinearity
will not even be needed.

659
00:45:28,630 --> 00:45:32,110
And then physiologists, using
the criteria they use this day,

660
00:45:32,110 --> 00:45:37,360
would classify that
cell as simple,

661
00:45:37,360 --> 00:45:39,370
even if from that point
of view of the theory

662
00:45:39,370 --> 00:45:40,161
it's still complex.

663
00:45:42,640 --> 00:45:46,120
Anyway, that's the
proposed machinery

664
00:45:46,120 --> 00:45:49,640
that comes from the theory.

665
00:45:49,640 --> 00:45:51,880
That's everything that we need.

666
00:45:51,880 --> 00:45:58,180
And it will say simple and
complex cell could be one cell.