1
00:00:01,680 --> 00:00:04,080
The following content is
provided under a Creative

2
00:00:04,080 --> 00:00:05,620
Commons license.

3
00:00:05,620 --> 00:00:07,920
Your support will help
MIT OpenCourseWare

4
00:00:07,920 --> 00:00:12,280
continue to offer high-quality
educational resources for free.

5
00:00:12,280 --> 00:00:14,910
To make a donation or
view additional materials

6
00:00:14,910 --> 00:00:18,870
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,870 --> 00:00:19,770
at ocw.mit.edu.

8
00:00:22,590 --> 00:00:24,210
HAIM SOMPOLINSKY:
My topic today is

9
00:00:24,210 --> 00:00:26,170
discussing sensory
representations

10
00:00:26,170 --> 00:00:29,950
in deep cortex-like
architectures.

11
00:00:29,950 --> 00:00:33,430
I should say the
topic is perhaps

12
00:00:33,430 --> 00:00:38,300
toward a theory of
sensory representations

13
00:00:38,300 --> 00:00:39,930
in deep networks.

14
00:00:39,930 --> 00:00:45,540
As you will see, our
attempt is to develop

15
00:00:45,540 --> 00:00:49,680
a systematic theoretical
understanding of the capacity

16
00:00:49,680 --> 00:00:54,420
and limitations of
architectures of that type.

17
00:00:54,420 --> 00:00:58,530
The general context
is well known.

18
00:00:58,530 --> 00:01:02,160
In many sensory systems,
we see information

19
00:01:02,160 --> 00:01:07,770
is propagating from the
periphery, like the retina,

20
00:01:07,770 --> 00:01:12,060
to primary visual cortex, and
then, of course, many stages up

21
00:01:12,060 --> 00:01:17,250
to a very high level, or maybe
the hippocampal structure.

22
00:01:17,250 --> 00:01:18,450
It's not purely feedforward.

23
00:01:18,450 --> 00:01:23,130
There are backward, massive
backward or top-down

24
00:01:23,130 --> 00:01:25,200
connections, the
recurrent connections,

25
00:01:25,200 --> 00:01:28,410
and some of those extra
features I'll talk about.

26
00:01:28,410 --> 00:01:32,790
But the most intuitive
feature of that

27
00:01:32,790 --> 00:01:36,000
is simply a transformation
or filtering

28
00:01:36,000 --> 00:01:38,910
of data across multiple stages.

29
00:01:38,910 --> 00:01:42,540
Similarly in auditory pathway.

30
00:01:42,540 --> 00:01:46,380
In other systems, we see a
similar structure or aspect

31
00:01:46,380 --> 00:01:49,042
of similar structure as well.

32
00:01:49,042 --> 00:01:53,280
A well-known and a
classical formative system

33
00:01:53,280 --> 00:01:56,550
for computational
science is cerebellum,

34
00:01:56,550 --> 00:02:00,300
where you have information
coming from the mossy fiber

35
00:02:00,300 --> 00:02:04,920
layer, then expand enormously
into a granule layer,

36
00:02:04,920 --> 00:02:08,590
and then converge
to a Purkinje cell.

37
00:02:08,590 --> 00:02:10,889
So if you look at
a single Purkinje

38
00:02:10,889 --> 00:02:14,190
cell, the output of the
cerebellum, as a unit,

39
00:02:14,190 --> 00:02:18,030
then you see there is,
first, an expansion

40
00:02:18,030 --> 00:02:21,570
from 1,000 to two
orders of magnitude

41
00:02:21,570 --> 00:02:23,760
higher in the granule layer.

42
00:02:23,760 --> 00:02:29,330
And then convergence of
200,000 or so parallel fibers

43
00:02:29,330 --> 00:02:31,350
onto a single Purkinje cell.

44
00:02:31,350 --> 00:02:34,350
And there are, those type
of modules, many, many

45
00:02:34,350 --> 00:02:36,250
across the cerebellum.

46
00:02:36,250 --> 00:02:40,170
So, again, a transformation
which involves, in this case,

47
00:02:40,170 --> 00:02:43,710
expansion and then convergence.

48
00:02:43,710 --> 00:02:47,490
In the basal ganglia, which
I wouldn't categorize it

49
00:02:47,490 --> 00:02:50,194
as a sensory system.

50
00:02:50,194 --> 00:02:51,110
More related to motor.

51
00:02:51,110 --> 00:02:54,540
Nevertheless, you see
cortex converging first

52
00:02:54,540 --> 00:02:57,300
to various stages of
the basal ganglia,

53
00:02:57,300 --> 00:03:01,680
and then expanding
again to cortex.

54
00:03:01,680 --> 00:03:05,220
Hippocampus has also multiple
pathways, but some of them

55
00:03:05,220 --> 00:03:07,380
include a convergence.

56
00:03:07,380 --> 00:03:13,810
For instance, convergence
to a CA3, and then expansion

57
00:03:13,810 --> 00:03:16,110
again to cortex.

58
00:03:16,110 --> 00:03:19,680
But there are other
multiple pathways as well,

59
00:03:19,680 --> 00:03:23,790
different stages of sensory
information propagating,

60
00:03:23,790 --> 00:03:27,600
of course, across them.

61
00:03:27,600 --> 00:03:30,900
And, finally, the
artificial network story

62
00:03:30,900 --> 00:03:34,816
of deep neural networks,
all of you may have heard.

63
00:03:34,816 --> 00:03:38,780
Input layer, then
sequence of stages.

64
00:03:38,780 --> 00:03:40,440
Purely feedforward.

65
00:03:40,440 --> 00:03:46,800
And at least the
canonical leading network

66
00:03:46,800 --> 00:03:51,720
is one that the output layer
has object recognition, object

67
00:03:51,720 --> 00:03:52,990
classification task.

68
00:03:52,990 --> 00:03:56,860
And the whole network
is studied by backprop,

69
00:03:56,860 --> 00:04:01,100
supervised learning for that.

70
00:04:01,100 --> 00:04:07,140
What I'll talk about is more
in the spirit of the idea

71
00:04:07,140 --> 00:04:10,540
that the first stages
are more general purpose

72
00:04:10,540 --> 00:04:16,060
than the specific classic
task in the output layer.

73
00:04:16,060 --> 00:04:17,550
So there are many issues.

74
00:04:17,550 --> 00:04:20,630
Number of stages that are
required, the size of them,

75
00:04:20,630 --> 00:04:23,520
why compression or expansion.

76
00:04:23,520 --> 00:04:27,120
In many systems, you'll see that
the fraction of active neurons

77
00:04:27,120 --> 00:04:31,330
is small in the expanded layer.

78
00:04:31,330 --> 00:04:32,660
That's what we call sparseness.

79
00:04:32,660 --> 00:04:35,010
So high sparseness
means small number

80
00:04:35,010 --> 00:04:38,340
of neurons active at
any given stimulus.

81
00:04:38,340 --> 00:04:42,180
It's just the terminology
is somewhat confusing.

82
00:04:42,180 --> 00:04:47,110
So high sparseness is small
number of active neurons.

83
00:04:47,110 --> 00:04:51,990
One important and crucial
question is how to transform.

84
00:04:51,990 --> 00:04:53,910
What are the
filters, the weights

85
00:04:53,910 --> 00:04:58,200
that are good for transforming
sensory information from one

86
00:04:58,200 --> 00:04:59,560
layer to another?

87
00:04:59,560 --> 00:05:04,440
And, in particular,
whether random weights

88
00:05:04,440 --> 00:05:10,380
is good enough, or maybe
even optimal in some sense.

89
00:05:10,380 --> 00:05:13,345
Or one needs more
structure, the more

90
00:05:13,345 --> 00:05:16,080
learned type of synaptic way.

91
00:05:16,080 --> 00:05:19,620
This is a crucial question,
perhaps not for machine

92
00:05:19,620 --> 00:05:22,920
learning but for computational
neuroscience, because there

93
00:05:22,920 --> 00:05:25,500
is some experimental
evidence, for at least

94
00:05:25,500 --> 00:05:29,560
of some of the systems
that are studied,

95
00:05:29,560 --> 00:05:36,180
that the mapping from
the compressed, original

96
00:05:36,180 --> 00:05:39,600
representation to the
sparse representation

97
00:05:39,600 --> 00:05:43,440
is actually done by
randomly connected weights.

98
00:05:43,440 --> 00:05:46,560
So one example is
olfactory cortex.

99
00:05:46,560 --> 00:05:49,710
The mapping of
olfactory representation

100
00:05:49,710 --> 00:05:53,040
from the olfactory bulb,
so from glomerulus layer,

101
00:05:53,040 --> 00:05:57,486
to the piriform cortex
seems to be random,

102
00:05:57,486 --> 00:05:59,370
as far as one can say.

103
00:05:59,370 --> 00:06:01,710
Similarly, in the
cerebellum, the example

104
00:06:01,710 --> 00:06:04,200
that I mentioned
before, when one

105
00:06:04,200 --> 00:06:06,840
looks at the mapping
from the mossy fiber

106
00:06:06,840 --> 00:06:10,500
to the granule cell,
again enormous expansion

107
00:06:10,500 --> 00:06:12,000
by a few orders of magnitude.

108
00:06:12,000 --> 00:06:14,580
Nevertheless, they
seem to be random.

109
00:06:14,580 --> 00:06:18,370
Now, of course, one cannot
say exclusively that they are

110
00:06:18,370 --> 00:06:21,690
random and there are no subtle
correlations or structures.

111
00:06:21,690 --> 00:06:24,780
But, nevertheless, there
is a strong motivation

112
00:06:24,780 --> 00:06:29,620
to ask whether random
projections are good enough.

113
00:06:29,620 --> 00:06:33,330
And if not, what does
it mean structured?

114
00:06:33,330 --> 00:06:38,820
What kind of structure is
appropriate for this task?

115
00:06:38,820 --> 00:06:40,950
Question of top-down
and feedback loops,

116
00:06:40,950 --> 00:06:43,660
recurrent connections,
and so on.

117
00:06:43,660 --> 00:06:48,720
So that's all I hope to,
at least briefly, mention

118
00:06:48,720 --> 00:06:50,430
later on in my talk.

119
00:06:50,430 --> 00:06:56,640
So before I continue, most of,
or a large part of the talk

120
00:06:56,640 --> 00:07:00,660
will be based on published
and unpublished work

121
00:07:00,660 --> 00:07:03,050
with Baktash Babadi,
who was, until recently,

122
00:07:03,050 --> 00:07:07,320
a postdoctoral, a Swartz
Fellow at Harvard University.

123
00:07:07,320 --> 00:07:10,370
Went to practice medicine.

124
00:07:10,370 --> 00:07:13,990
Elia Frankin, a master student
at the Hebrew University.

125
00:07:13,990 --> 00:07:18,440
SueYeon, who all you
know here, at Harvard.

126
00:07:18,440 --> 00:07:22,060
Uri Cohen, a PhD student
at the Hebrew University.

127
00:07:22,060 --> 00:07:25,740
And Dan Lee from
Penn University.

128
00:07:25,740 --> 00:07:29,160
So here is our formalization
of the problem.

129
00:07:29,160 --> 00:07:34,350
We have an input
layer, denoted 0.

130
00:07:34,350 --> 00:07:39,750
Typically, it's a small,
it's a compressed layer

131
00:07:39,750 --> 00:07:41,350
with dense representation.

132
00:07:41,350 --> 00:07:45,240
So here, every input
will generate maybe half

133
00:07:45,240 --> 00:07:48,990
of the population, on average,

134
00:07:48,990 --> 00:07:51,600
Then there is a
feedforward layer

135
00:07:51,600 --> 00:07:55,140
of synaptic weights,
which expand

136
00:07:55,140 --> 00:08:00,530
to a higher-dimension layer,
which we call cortical layer.

137
00:08:00,530 --> 00:08:03,460
It's expanded in terms
of the number of neurons.

138
00:08:03,460 --> 00:08:05,760
So this will be S1.

139
00:08:05,760 --> 00:08:09,600
It is sparse because the
f, the fraction of neurons

140
00:08:09,600 --> 00:08:15,790
that are active for each given
input vector, will be small.

141
00:08:15,790 --> 00:08:19,110
So it is expanded and sparse.

142
00:08:19,110 --> 00:08:22,710
That will be the first part
of my talk, discussing this.

143
00:08:22,710 --> 00:08:28,830
Then, later on, I'll talk
about staging, cascading

144
00:08:28,830 --> 00:08:32,400
this transformation
to several stages.

145
00:08:32,400 --> 00:08:35,190
And ultimately
there is a readout,

146
00:08:35,190 --> 00:08:37,750
will be some
classification task.

147
00:08:37,750 --> 00:08:40,210
1 will be one classification.

148
00:08:40,210 --> 00:08:43,500
2 will be another
classification rule, et cetera,

149
00:08:43,500 --> 00:08:45,460
each one of them
with synaptic weights

150
00:08:45,460 --> 00:08:49,330
which are learned to
perform that task.

151
00:08:49,330 --> 00:08:51,960
So we call that a
supervised layer, and that's

152
00:08:51,960 --> 00:08:54,300
the unsupervised layer.

153
00:08:54,300 --> 00:08:56,710
So that's a formalization
of the problem.

154
00:08:56,710 --> 00:09:01,230
And, as you will see, we'll
make enormously simplifying

155
00:09:01,230 --> 00:09:05,490
abstraction of the
real biological system

156
00:09:05,490 --> 00:09:08,010
in order to try to
gain some insight

157
00:09:08,010 --> 00:09:12,240
about the computational
capacity of such systems.

158
00:09:12,240 --> 00:09:14,640
So the first
important question is

159
00:09:14,640 --> 00:09:18,540
what is the statistics,
the statistical structure

160
00:09:18,540 --> 00:09:19,360
of the input?

161
00:09:19,360 --> 00:09:21,570
So the input is kind of
n-dimensional vector,

162
00:09:21,570 --> 00:09:25,360
where n, or n0, when n is
the number of units here.

163
00:09:25,360 --> 00:09:30,430
So each sensory event evokes
a pattern of activity here.

164
00:09:30,430 --> 00:09:34,440
But what is the structure,
the statistical structure,

165
00:09:34,440 --> 00:09:36,750
that we are working with?

166
00:09:36,750 --> 00:09:41,390
And the simplest one that
we are going to discuss

167
00:09:41,390 --> 00:09:42,160
is the following.

168
00:09:42,160 --> 00:09:44,580
So we assume, basically,
that the inputs

169
00:09:44,580 --> 00:09:47,700
are coming from a
mixture of, so to speak,

170
00:09:47,700 --> 00:09:51,450
a mixture of
Gaussian statistics.

171
00:09:51,450 --> 00:09:54,120
It's not going to be Gaussian
because, for simplicity, we'll

172
00:09:54,120 --> 00:09:55,590
assume they are binary.

173
00:09:55,590 --> 00:09:58,320
But this doesn't
matter actually.

174
00:09:58,320 --> 00:10:04,850
So imagine that this is kind
of a graphical represent--

175
00:10:04,850 --> 00:10:07,940
a caricature of
high-dimensional space.

176
00:10:07,940 --> 00:10:11,690
And imagine the inputs,
the sensory inputs,

177
00:10:11,690 --> 00:10:16,375
imagine that they are clustered
around templates or cluster

178
00:10:16,375 --> 00:10:16,875
centers.

179
00:10:19,920 --> 00:10:24,590
So these will be the
centers of these balls.

180
00:10:24,590 --> 00:10:30,080
And the input itself is
coming from the neighborhoods

181
00:10:30,080 --> 00:10:31,590
of those templates.

182
00:10:31,590 --> 00:10:35,870
So each input will be
one point in this space,

183
00:10:35,870 --> 00:10:41,090
and it will be originating from
one of those ensembles of one

184
00:10:41,090 --> 00:10:42,720
of those states.

185
00:10:42,720 --> 00:10:46,280
So that's a simple architecture.

186
00:10:46,280 --> 00:10:47,930
And we are going--

187
00:10:47,930 --> 00:10:52,850
in real space, it
will be mapping

188
00:10:52,850 --> 00:10:56,780
from one of those states
into another state

189
00:10:56,780 --> 00:10:59,030
in the next layer.

190
00:10:59,030 --> 00:11:03,590
And then, finally, the
task will be to take--

191
00:11:03,590 --> 00:11:07,370
imagine that some of those
balls are classified as plus.

192
00:11:07,370 --> 00:11:12,140
Let's say the olfactory factory
stimuli, and some of them

193
00:11:12,140 --> 00:11:16,630
are classified as appetitive,
some of them as aversive.

194
00:11:16,630 --> 00:11:21,335
So the output layer,
the readout unit,

195
00:11:21,335 --> 00:11:26,360
has to classify some of
those spheres as a plus,

196
00:11:26,360 --> 00:11:28,100
and some of them are minus.

197
00:11:28,100 --> 00:11:30,680
And, of course,
depending on how many of

198
00:11:30,680 --> 00:11:33,850
them are in a dimensionality,
and their location,

199
00:11:33,850 --> 00:11:36,560
this may or may not
be an easy problem.

200
00:11:36,560 --> 00:11:41,125
So, for instance,
here it's fine.

201
00:11:41,125 --> 00:11:45,740
There is-- a linear classifier
on the input space can do it.

202
00:11:45,740 --> 00:11:50,850
Here, I think there are, there
should be some mistakes here.

203
00:11:50,850 --> 00:11:52,250
Yeah, here.

204
00:11:52,250 --> 00:11:59,150
So here is a case where the
linear classifier at the input

205
00:11:59,150 --> 00:12:00,020
layer cannot do it.

206
00:12:00,020 --> 00:12:06,350
And that's our--
that's a theme which

207
00:12:06,350 --> 00:12:12,890
is very popular, both in
computation neural science

208
00:12:12,890 --> 00:12:19,070
and system neural science
studies in machine learning.

209
00:12:19,070 --> 00:12:22,280
And the following
question comes up.

210
00:12:22,280 --> 00:12:25,310
Suppose we see that there
is a transformation of data

211
00:12:25,310 --> 00:12:34,790
from, let's say, a photoreceptor
layer in vision to the ganglion

212
00:12:34,790 --> 00:12:36,950
cells at the output
of the retina,

213
00:12:36,950 --> 00:12:40,040
then to cortex in
several stages.

214
00:12:40,040 --> 00:12:47,111
How do we gauge,
how do we assess

215
00:12:47,111 --> 00:12:52,610
what is the advantage for the
brain to transform information

216
00:12:52,610 --> 00:12:57,600
from one, let's say, from retina
to V1, and so on and so forth.

217
00:12:57,600 --> 00:13:01,640
After all, in this
feedforward's architecture,

218
00:13:01,640 --> 00:13:06,660
no net information is
generated at the next layer.

219
00:13:06,660 --> 00:13:09,950
So if no net information
is generated,

220
00:13:09,950 --> 00:13:15,090
the question is, what did we
gain by these transformations?

221
00:13:15,090 --> 00:13:22,520
And one possible answer is that
it is reformatted, reformatting

222
00:13:22,520 --> 00:13:26,480
the sensory representation into
different representation which

223
00:13:26,480 --> 00:13:30,020
will make subsequent
computations simpler.

224
00:13:30,020 --> 00:13:34,520
So what does it mean, subsequent
computation is simpler?

225
00:13:34,520 --> 00:13:40,130
One notion of simplicity is
whether subsequent computation

226
00:13:40,130 --> 00:13:45,320
can be realized by a
simple linear readout.

227
00:13:45,320 --> 00:13:53,180
So that's the strategy that
we are going to adopt here.

228
00:13:53,180 --> 00:13:56,510
And this is to ask,
as the information,

229
00:13:56,510 --> 00:13:58,760
as the representation
is changing

230
00:13:58,760 --> 00:14:01,130
as you go from one
layer to another,

231
00:14:01,130 --> 00:14:06,890
how well a linear readout will
be able to perform the task.

232
00:14:06,890 --> 00:14:08,710
So that's the input.

233
00:14:08,710 --> 00:14:10,710
That's the story.

234
00:14:10,710 --> 00:14:13,760
And then, as I said,
there is an input,

235
00:14:13,760 --> 00:14:16,526
unsupervised representations,
and supervised at the end.

236
00:14:19,670 --> 00:14:21,960
I need to introduce notations.

237
00:14:21,960 --> 00:14:22,560
Bear with me.

238
00:14:22,560 --> 00:14:24,680
This is a computational talk.

239
00:14:24,680 --> 00:14:28,580
I cannot just talk about ideas,
because the whole thing is

240
00:14:28,580 --> 00:14:33,710
to be able to actually come up
with a quantitative theory that

241
00:14:33,710 --> 00:14:35,150
tests ideas.

242
00:14:35,150 --> 00:14:38,150
So let me introduce notations.

243
00:14:38,150 --> 00:14:41,780
So at the centers,
at each layer,

244
00:14:41,780 --> 00:14:45,020
you can ask what is the
representation of the centers

245
00:14:45,020 --> 00:14:47,000
of these stimuli?

246
00:14:47,000 --> 00:14:50,060
And I'll denote the
center by a bar.

247
00:14:50,060 --> 00:14:53,050
And mu is index of the patterns.

248
00:14:53,050 --> 00:14:56,900
So mu goes from 1 to P. P Is
the number of those balls,

249
00:14:56,900 --> 00:14:59,870
number of those spheres, or
number of those clusters,

250
00:14:59,870 --> 00:15:03,580
if you think about
clustering some sensory data.

251
00:15:03,580 --> 00:15:05,920
So P would be the
number of clusters.

252
00:15:05,920 --> 00:15:11,622
i, from 1 to N, is simply the
neuron or the unit activation

253
00:15:11,622 --> 00:15:13,720
at each mu.

254
00:15:13,720 --> 00:15:14,830
And L is the layer.

255
00:15:14,830 --> 00:15:16,270
So 0 is the input layer.

256
00:15:16,270 --> 00:15:19,680
It's up to L layer.

257
00:15:19,680 --> 00:15:21,060
So this would be 0, 1.

258
00:15:21,060 --> 00:15:29,670
The mean activation
at each layer from 1

259
00:15:29,670 --> 00:15:31,890
on will just have
to be a constant

260
00:15:31,890 --> 00:15:34,500
to be f. f goes from 0 to 1.

261
00:15:34,500 --> 00:15:38,040
The smaller f is, the sparser
the representation is.

262
00:15:38,040 --> 00:15:41,800
We will assume that the input
representation is dense.

263
00:15:41,800 --> 00:15:43,900
So this is 0.5.

264
00:15:43,900 --> 00:15:46,860
N, again, we'll assume
to be, for simplicity,

265
00:15:46,860 --> 00:15:49,820
a constant across layers, except
for the first layer, where

266
00:15:49,820 --> 00:15:51,270
there is expansion.

267
00:15:51,270 --> 00:15:55,320
You can vary those parameters,
and actually the theory

268
00:15:55,320 --> 00:15:58,210
accommodates
variations of those.

269
00:15:58,210 --> 00:15:59,880
But that's the
simplest architecture.

270
00:15:59,880 --> 00:16:02,730
You expend a dense
representation

271
00:16:02,730 --> 00:16:05,460
into a sparse higher
dimension, and you

272
00:16:05,460 --> 00:16:09,810
keep doing it as you go along.

273
00:16:09,810 --> 00:16:12,460
So that's notation.

274
00:16:12,460 --> 00:16:21,950
Now, how do we assess what
is the next stages doing

275
00:16:21,950 --> 00:16:23,370
to those clusters.

276
00:16:23,370 --> 00:16:28,080
So, as I said, one measure
is take a linear classifier

277
00:16:28,080 --> 00:16:31,380
and see how linear
classifier performs.

278
00:16:31,380 --> 00:16:36,720
But, actually, you can
also look at the statistics

279
00:16:36,720 --> 00:16:42,330
of the injected sensory
stimuli at each layer

280
00:16:42,330 --> 00:16:44,250
and learn something from it.

281
00:16:44,250 --> 00:16:47,250
And, basically, I'm
going to suggest

282
00:16:47,250 --> 00:16:51,450
looking at two major
statistical aspects

283
00:16:51,450 --> 00:16:55,470
of the data in each layer
of the transformation.

284
00:16:55,470 --> 00:16:59,310
One of them is noise, and
one of them is correlation.

285
00:16:59,310 --> 00:17:01,080
So what is noise?

286
00:17:01,080 --> 00:17:06,960
So, again, noise will be
simply the radius, or measure

287
00:17:06,960 --> 00:17:10,619
of the radius of the sphere.

288
00:17:10,619 --> 00:17:15,540
So if you had only the
templates as inputs,

289
00:17:15,540 --> 00:17:16,859
the problem would be simple.

290
00:17:16,859 --> 00:17:20,260
Problem would be easy as long
as we have enough dimension.

291
00:17:20,260 --> 00:17:21,390
You expand it.

292
00:17:21,390 --> 00:17:24,510
You can easily do
linear classifier

293
00:17:24,510 --> 00:17:25,800
and solve the problem.

294
00:17:25,800 --> 00:17:28,800
So the problem, in
our case, is the fact

295
00:17:28,800 --> 00:17:31,590
that the input is actually
the infinite number of inputs,

296
00:17:31,590 --> 00:17:34,110
or exponentially large
number of possible inputs,

297
00:17:34,110 --> 00:17:42,870
because they all come from a
Gaussian or a binarized version

298
00:17:42,870 --> 00:17:44,960
of a Gaussian noise
around the templates.

299
00:17:44,960 --> 00:17:47,700
And I'll denote
the noise by delta.

300
00:17:47,700 --> 00:17:51,240
0 means no noise.

301
00:17:51,240 --> 00:17:54,930
The normalization is such that
1 means that they are random.

302
00:17:54,930 --> 00:18:00,360
So delta equals to 1 means that,
basically, you cannot tell,

303
00:18:00,360 --> 00:18:03,540
the input, whether it's coming
from here or from any other

304
00:18:03,540 --> 00:18:05,920
points in the input space.

305
00:18:05,920 --> 00:18:10,980
The other thing,
correlations, is more subtle.

306
00:18:10,980 --> 00:18:14,820
So I'm going to assume
that those balls are

307
00:18:14,820 --> 00:18:17,730
coming from kind of
uniform distribution.

308
00:18:17,730 --> 00:18:20,097
Imagine you take
a template here.

309
00:18:20,097 --> 00:18:21,180
You draw a ball around it.

310
00:18:21,180 --> 00:18:22,013
You take a template.

311
00:18:22,013 --> 00:18:23,080
Here, you draw a ball.

312
00:18:23,080 --> 00:18:26,400
Everything is kind of
uniformly distributed.

313
00:18:26,400 --> 00:18:28,590
The only structure
is the fact that data

314
00:18:28,590 --> 00:18:31,440
comes from this
mixture of Gaussians

315
00:18:31,440 --> 00:18:38,940
or noisy patterns
around those centers.

316
00:18:38,940 --> 00:18:39,840
So that's fine.

317
00:18:39,840 --> 00:18:48,150
But as you project those
clusters into the next stage,

318
00:18:48,150 --> 00:18:52,530
I claim that those
centers, those templates,

319
00:18:52,530 --> 00:18:56,205
get new representation, which
can actually have structure

320
00:18:56,205 --> 00:19:01,260
in them, simply by the fact
that you put all of them

321
00:19:01,260 --> 00:19:06,400
into this common synaptic
weights into the next layer.

322
00:19:06,400 --> 00:19:11,850
And I'm going to measure this
by Q. And, basically, low Q or 0

323
00:19:11,850 --> 00:19:15,550
Q is basically a kind of
randomly uniformly distributed

324
00:19:15,550 --> 00:19:16,650
centers.

325
00:19:16,650 --> 00:19:20,100
And I'll always start from
that at the input layer.

326
00:19:20,100 --> 00:19:23,490
But then there is
a danger, or it

327
00:19:23,490 --> 00:19:26,370
might happen that,
as you propagate

328
00:19:26,370 --> 00:19:28,940
this information or
this representation

329
00:19:28,940 --> 00:19:32,725
through the next layer, the
centers will look like that,

330
00:19:32,725 --> 00:19:35,350
or the data, structure of
the data, looks like that.

331
00:19:35,350 --> 00:19:39,450
So, on average, the distance
between two centers,

332
00:19:39,450 --> 00:19:41,340
on average, is the same as here.

333
00:19:41,340 --> 00:19:43,080
But they are clumped together.

334
00:19:43,080 --> 00:19:47,010
It's kind of random
clustering of the clusters.

335
00:19:47,010 --> 00:19:51,150
And that can be
induced by the fact

336
00:19:51,150 --> 00:19:58,080
that the data is feedforwarded
from this representation.

337
00:19:58,080 --> 00:20:00,660
That can pose a problem.

338
00:20:00,660 --> 00:20:03,550
If there is no noise, then
there is, again, no problem.

339
00:20:03,550 --> 00:20:06,500
You can differentiate
between them, and so on.

340
00:20:06,500 --> 00:20:09,250
But if there is noise, this
can aggravate the situation,

341
00:20:09,250 --> 00:20:14,410
because some of the clusters
become dangerously close

342
00:20:14,410 --> 00:20:16,060
to each other.

343
00:20:16,060 --> 00:20:17,140
And we will come to it.

344
00:20:17,140 --> 00:20:19,750
But, anyway, so we have
this delta, the noise,

345
00:20:19,750 --> 00:20:23,800
the size of the clusters, and
we have Q, the correlations,

346
00:20:23,800 --> 00:20:28,280
how they are clumped
in each representation.

347
00:20:28,280 --> 00:20:30,970
And now we can ask
how delta evolve

348
00:20:30,970 --> 00:20:32,500
when you go from
one presentation

349
00:20:32,500 --> 00:20:36,340
to another, how Q evolve from
one presentation to another,

350
00:20:36,340 --> 00:20:38,800
and how linear classifier
performance will

351
00:20:38,800 --> 00:20:41,510
change from one
representation to another.

352
00:20:41,510 --> 00:20:48,820
So the simplicity
of this assumption

353
00:20:48,820 --> 00:20:56,540
allows for a kind of systematic,
analytical exploration or study

354
00:20:56,540 --> 00:20:57,040
of this.

355
00:21:00,400 --> 00:21:02,290
These are definitions.

356
00:21:02,290 --> 00:21:03,340
Let's go on.

357
00:21:03,340 --> 00:21:06,490
So what will be the
ideal situation?

358
00:21:06,490 --> 00:21:08,560
So the ideal situation
will be that I

359
00:21:08,560 --> 00:21:14,971
start from some level
of noise, which is

360
00:21:14,971 --> 00:21:18,100
my spheres at the input layer.

361
00:21:18,100 --> 00:21:20,890
I may or may not start
with some correlation.

362
00:21:20,890 --> 00:21:25,120
The simplest case would be
that I start from randomly

363
00:21:25,120 --> 00:21:26,480
distributed centers.

364
00:21:26,480 --> 00:21:28,270
So this would be 0.

365
00:21:28,270 --> 00:21:30,920
And the best situation
will be that,

366
00:21:30,920 --> 00:21:35,920
as I propagate the sensory
stimuli, delta, the noise,

367
00:21:35,920 --> 00:21:38,560
will go to 0.

368
00:21:38,560 --> 00:21:40,180
As I said, if the
noise goes to 0,

369
00:21:40,180 --> 00:21:42,580
you are left with
basically points.

370
00:21:42,580 --> 00:21:45,670
And those points, if there
is enough dimensionality,

371
00:21:45,670 --> 00:21:48,470
those points would be
easily classifiable.

372
00:21:48,470 --> 00:21:51,580
It would also be
good, if the noise

373
00:21:51,580 --> 00:21:57,110
doesn't go to 0, to have
also kind of uniformly spread

374
00:21:57,110 --> 00:21:57,660
clusters.

375
00:21:57,660 --> 00:22:00,340
So it will be good to
keep Q to be small.

376
00:22:04,630 --> 00:22:06,514
So let's look at one layer.

377
00:22:06,514 --> 00:22:07,430
So let's look at this.

378
00:22:07,430 --> 00:22:10,730
We have the input layer,
the output layer here,

379
00:22:10,730 --> 00:22:13,690
and the readout.

380
00:22:13,690 --> 00:22:19,340
The first question is what to
choose for this input layer.

381
00:22:19,340 --> 00:22:23,170
So the simplest answer
would be choose random.

382
00:22:23,170 --> 00:22:27,410
So what we do, we
just take Gaussian.

383
00:22:27,410 --> 00:22:31,660
The Gaussian weights in
this layer are very simple.

384
00:22:31,660 --> 00:22:33,140
0 mean, with some normalization.

385
00:22:33,140 --> 00:22:34,990
It doesn't matter.

386
00:22:34,990 --> 00:22:39,130
Then we project them into
each one of these guys here.

387
00:22:39,130 --> 00:22:44,971
And then we add threshold
to enforce the sparsity

388
00:22:44,971 --> 00:22:46,120
that we want.

389
00:22:46,120 --> 00:22:54,790
So whatever the activation here
is, whatever the input here is,

390
00:22:54,790 --> 00:22:57,130
the threshold makes
sure that only

391
00:22:57,130 --> 00:23:01,900
the f with the largest
input will be active,

392
00:23:01,900 --> 00:23:03,347
and the rest will be 0.

393
00:23:03,347 --> 00:23:05,680
So there is a nonlinearity,
which is of course extremely

394
00:23:05,680 --> 00:23:06,280
important.

395
00:23:06,280 --> 00:23:11,350
If you map one layer to another
with a linear transformation,

396
00:23:11,350 --> 00:23:15,320
you don't gain anything in
terms of classification.

397
00:23:15,320 --> 00:23:20,560
So there is a nonlinearity,
simply a threshold nonlinearity

398
00:23:20,560 --> 00:23:24,730
after an input projection.

399
00:23:24,730 --> 00:23:25,360
All right.

400
00:23:25,360 --> 00:23:27,460
So how we are going to do this?

401
00:23:27,460 --> 00:23:31,270
So it's straightforward to
actually compute analytically

402
00:23:31,270 --> 00:23:32,890
what will happen to a noise.

403
00:23:32,890 --> 00:23:36,820
So imagine you take two input
vectors with some Hamming

404
00:23:36,820 --> 00:23:38,860
distance apart from each other.

405
00:23:38,860 --> 00:23:46,120
You map them by convolving
them, so to speak,

406
00:23:46,120 --> 00:23:49,030
with Gaussian weights,
and then thresholding

407
00:23:49,030 --> 00:23:50,620
them to get some sparsity.

408
00:23:50,620 --> 00:23:52,420
So f is the sparsity.

409
00:23:52,420 --> 00:23:55,120
The smaller the f is,
the sparser it is.

410
00:23:55,120 --> 00:23:59,835
So this is the noise level,
the normalized Gaussian--

411
00:23:59,835 --> 00:24:04,570
I'm sorry-- sphere radius
or Hamming distance

412
00:24:04,570 --> 00:24:07,950
in the output layer, and
also the input layer.

413
00:24:07,950 --> 00:24:10,870
Well if 0 at start, then of
course you start at the origin.

414
00:24:10,870 --> 00:24:13,300
If you are random at the input,
you will be random there.

415
00:24:13,300 --> 00:24:14,790
So these points are fine.

416
00:24:14,790 --> 00:24:16,960
But, as you see,
immediately there

417
00:24:16,960 --> 00:24:22,210
is an amplification of the
noise as you go from the input

418
00:24:22,210 --> 00:24:23,660
to the output.

419
00:24:23,660 --> 00:24:29,080
So you start from 0.2, but you
get actually, after one layer,

420
00:24:29,080 --> 00:24:31,480
to 0.6.

421
00:24:31,480 --> 00:24:36,610
And, actually, the
sparser it is--

422
00:24:36,610 --> 00:24:39,820
so this is a relatively
high sparsity,

423
00:24:39,820 --> 00:24:43,690
or at least you go from here
to here by increasing sparsity,

424
00:24:43,690 --> 00:24:45,670
namely f becomes smaller.

425
00:24:45,670 --> 00:24:48,010
And as f becomes
smaller, this curve

426
00:24:48,010 --> 00:24:50,150
is actually steeper and steeper.

427
00:24:50,150 --> 00:24:54,460
So not only you
amplify noise, but you

428
00:24:54,460 --> 00:24:58,245
also-- the amplification
becomes worse the sparser

429
00:24:58,245 --> 00:25:00,510
the representation is.

430
00:25:00,510 --> 00:25:06,710
So that is the kind
of negative result.

431
00:25:06,710 --> 00:25:13,300
The idea that you can gain
by expanding data to a higher

432
00:25:13,300 --> 00:25:17,290
dimension and make
them more separable

433
00:25:17,290 --> 00:25:23,290
later on dates back to David
Marr's classical theory

434
00:25:23,290 --> 00:25:24,940
of the cerebellum.

435
00:25:24,940 --> 00:25:28,090
But what we show here is
that, if you think not about

436
00:25:28,090 --> 00:25:31,510
clean data, a set of points
that you want to separate,

437
00:25:31,510 --> 00:25:34,795
but you think about the
more realistic case of you

438
00:25:34,795 --> 00:25:40,362
have noisy data, or
data with high variance,

439
00:25:40,362 --> 00:25:43,400
then the situation
is very different.

440
00:25:43,400 --> 00:25:48,130
So a random expansion
actually amplifies those.

441
00:25:48,130 --> 00:25:50,200
And that's a theme that will--

442
00:25:50,200 --> 00:25:53,170
actually, we will live
with it as we go along.

443
00:25:53,170 --> 00:25:58,910
Random expansion is doing the
separation of the templates.

444
00:25:58,910 --> 00:26:01,720
But the problem is it also
separates two nearby points

445
00:26:01,720 --> 00:26:03,140
within a cluster.

446
00:26:03,140 --> 00:26:04,480
It also separates them.

447
00:26:04,480 --> 00:26:08,480
So everything becomes
separated from each other.

448
00:26:08,480 --> 00:26:12,580
And this is why
noise is amplified.

449
00:26:12,580 --> 00:26:16,090
Now, what about the most subtle
thing, which is the kind of

450
00:26:16,090 --> 00:26:19,065
overlap between the centers?

451
00:26:19,065 --> 00:26:23,780
So, on average, the centers are
as far apart as random things.

452
00:26:23,780 --> 00:26:25,480
But if you look, not
on average, but you

453
00:26:25,480 --> 00:26:28,630
look at the
individual pairs, you

454
00:26:28,630 --> 00:26:30,680
see that there is an
excess correlations

455
00:26:30,680 --> 00:26:32,690
or overlap between them.

456
00:26:32,690 --> 00:26:34,600
So this is overlap
between the centers.

457
00:26:34,600 --> 00:26:37,900
Again, on average, it is 0,
but the variance is not 0.

458
00:26:37,900 --> 00:26:39,820
On average it's like
random, but the variance

459
00:26:39,820 --> 00:26:43,330
is different, is
larger than random.

460
00:26:43,330 --> 00:26:46,730
And there is an amplification.

461
00:26:46,730 --> 00:26:51,250
There is a generation
of this excess overlap,

462
00:26:51,250 --> 00:26:53,710
although it's nicely
controlled by sparsity.

463
00:26:53,710 --> 00:26:59,080
So as sparsity goes down,
these correlations go down.

464
00:26:59,080 --> 00:27:02,160
So that's not a
tremendous problem.

465
00:27:02,160 --> 00:27:05,990
The major problem, as
I said, is the noise.

466
00:27:05,990 --> 00:27:12,740
By the way, you can nicely do
an exercise where you generate,

467
00:27:12,740 --> 00:27:17,930
you look at this cortical
layer representation,

468
00:27:17,930 --> 00:27:21,140
and you do SVM, or
PCA, and you look

469
00:27:21,140 --> 00:27:22,950
at the eigenvalue spectrum.

470
00:27:22,950 --> 00:27:27,710
So if you just look at random
sparse points, and you look

471
00:27:27,710 --> 00:27:32,360
at the SVD this is the
eigenvalues number ranked--

472
00:27:32,360 --> 00:27:33,690
then that's what you find.

473
00:27:33,690 --> 00:27:39,020
It's the famous
Marchenko-Pastur distribution.

474
00:27:39,020 --> 00:27:43,800
But, in our case, you see
there is an extra power.

475
00:27:43,800 --> 00:27:46,220
In this case, the
input layer is 100,

476
00:27:46,220 --> 00:27:49,010
so the extra power
in the input layer,

477
00:27:49,010 --> 00:27:51,890
in the first input eigenvalue.

478
00:27:51,890 --> 00:27:54,640
Now, why is it so?

479
00:27:54,640 --> 00:27:57,490
What Q is telling
us, what nonzero Q

480
00:27:57,490 --> 00:28:00,470
is telling us is the following.

481
00:28:00,470 --> 00:28:04,730
You take a set of random
points, and you project them

482
00:28:04,730 --> 00:28:06,070
into higher dimensions.

483
00:28:06,070 --> 00:28:08,700
You start with 100 dimensions,
and you project them

484
00:28:08,700 --> 00:28:10,820
in 1,000 dimensions.

485
00:28:10,820 --> 00:28:12,870
On average, they are random.

486
00:28:12,870 --> 00:28:16,395
But actually-- so
you would imagine

487
00:28:16,395 --> 00:28:17,990
that it's a perfect thing.

488
00:28:17,990 --> 00:28:20,870
You project them
with random weights.

489
00:28:20,870 --> 00:28:22,490
Then you would
imagine that you just

490
00:28:22,490 --> 00:28:28,240
created a set of random points
in the expanded dimension

491
00:28:28,240 --> 00:28:30,030
representation.

492
00:28:30,030 --> 00:28:34,160
If this was so, then
if you do SVM or PCA

493
00:28:34,160 --> 00:28:36,830
on this representation,
you will find

494
00:28:36,830 --> 00:28:40,190
what you expect from a PCA
of a set of random points.

495
00:28:40,190 --> 00:28:42,050
And this is this one.

496
00:28:42,050 --> 00:28:50,250
In fact, there is a trace of
low dimensionality in the data.

497
00:28:50,250 --> 00:28:52,620
So I think that's an
important point, which

498
00:28:52,620 --> 00:28:53,930
I would like to explain.

499
00:28:53,930 --> 00:28:56,450
You start from a set of points.

500
00:28:56,450 --> 00:29:00,410
If you don't threshold
them and you just map them

501
00:29:00,410 --> 00:29:06,710
into 1,000-dimensional space,
those 100-dimensional input

502
00:29:06,710 --> 00:29:08,600
will remain 100-dimensional.

503
00:29:08,600 --> 00:29:10,820
Just be rotated, and
so on, but everything

504
00:29:10,820 --> 00:29:14,970
will live in
100-dimensional space.

505
00:29:14,970 --> 00:29:20,300
Now you add thresholding,
high thresholding by sparsity.

506
00:29:20,300 --> 00:29:24,770
So those 100-dimensional
subspace becomes now

507
00:29:24,770 --> 00:29:28,900
1,000-dimensional sparsity
because of the nonlinearity.

508
00:29:28,900 --> 00:29:33,740
But this nonlinearity, although
it takes 100-dimensional input

509
00:29:33,740 --> 00:29:36,560
and makes them
1,000-dimensional,

510
00:29:36,560 --> 00:29:40,040
it's still not like random.

511
00:29:40,040 --> 00:29:45,230
This 1,000-dimensional
cloud is still elongated.

512
00:29:45,230 --> 00:29:47,630
It's not simply
uniformly distributed.

513
00:29:47,630 --> 00:29:52,400
And this is the signature
that you see here.

514
00:29:52,400 --> 00:29:55,130
In the first largest
100 eigenvalues,

515
00:29:55,130 --> 00:30:02,870
there is extra power
relative to the random.

516
00:30:02,870 --> 00:30:04,040
The rest is not 0.

517
00:30:04,040 --> 00:30:07,220
So if you look here,
this goes up to 1,000.

518
00:30:07,220 --> 00:30:09,860
The rest is not 0.

519
00:30:09,860 --> 00:30:13,580
So the system is,
strictly speaking,

520
00:30:13,580 --> 00:30:17,370
1,000-dimensional space,
but it's not random.

521
00:30:17,370 --> 00:30:22,370
It has increased
power in 100 channels.

522
00:30:22,370 --> 00:30:26,460
If you do a readout,
a linear classifier

523
00:30:26,460 --> 00:30:29,880
readout, what you find in this--

524
00:30:29,880 --> 00:30:39,860
again, when you expand
with random weights,

525
00:30:39,860 --> 00:30:45,120
you find that there is
an optimal sparsity.

526
00:30:45,120 --> 00:30:46,920
So this is the readout error.

527
00:30:46,920 --> 00:30:51,326
For a classifier, the
output is a function

528
00:30:51,326 --> 00:30:54,920
of the sparsity for
different levels of noise.

529
00:30:54,920 --> 00:30:59,960
And you see that, in the
case of random weights,

530
00:30:59,960 --> 00:31:05,470
there is a very high
sparsity, is bad.

531
00:31:05,470 --> 00:31:08,270
There Is an optimal
sparsity or sparseness,

532
00:31:08,270 --> 00:31:10,820
and then there is
a shallow increase

533
00:31:10,820 --> 00:31:14,900
in the error when you go
to a denser representation.

534
00:31:14,900 --> 00:31:18,530
One important point
which I want to emphasize

535
00:31:18,530 --> 00:31:21,000
coming from the analysis--
let me skip equations--

536
00:31:21,000 --> 00:31:22,580
and this is what you see here.

537
00:31:22,580 --> 00:31:28,070
The question is, can I do better
by further increase the layer?

538
00:31:28,070 --> 00:31:32,090
So here I plot the readout
error as a function

539
00:31:32,090 --> 00:31:34,400
of the size of cortical layer.

540
00:31:34,400 --> 00:31:36,410
Can I do better?

541
00:31:36,410 --> 00:31:39,830
If I make the kernel
dimensionality infinite,

542
00:31:39,830 --> 00:31:41,690
can I do better?

543
00:31:41,690 --> 00:31:44,760
Well, it can do better if
you start with 0 noise.

544
00:31:44,760 --> 00:31:52,265
But if you have noisy
inputs, then basically,

545
00:31:52,265 --> 00:31:54,140
the performance saturates.

546
00:31:54,140 --> 00:31:55,490
And that's kind of surprising.

547
00:31:58,670 --> 00:32:00,890
We were expecting
that, if you go

548
00:32:00,890 --> 00:32:04,382
to a larger and larger
representation, eventually

549
00:32:04,382 --> 00:32:05,780
the error will go to 0.

550
00:32:05,780 --> 00:32:06,920
But it doesn't go to 0.

551
00:32:06,920 --> 00:32:09,980
And that actually
happens even for what we

552
00:32:09,980 --> 00:32:12,590
call structured representation.

553
00:32:12,590 --> 00:32:15,560
And that's the same for
different types of readout--

554
00:32:15,560 --> 00:32:18,830
perceptual, and
pseudo-inverse, SVM.

555
00:32:18,830 --> 00:32:22,040
All of them show this
saturation as you increase

556
00:32:22,040 --> 00:32:24,120
the size of cortical layer.

557
00:32:24,120 --> 00:32:29,110
And that's one of the very
important outcome of our study.

558
00:32:29,110 --> 00:32:31,740
That when you talk
about noisy inputs,

559
00:32:31,740 --> 00:32:35,590
you can think about it as kind
of more generalization task.

560
00:32:35,590 --> 00:32:40,580
Then there is a limit
about what you gain

561
00:32:40,580 --> 00:32:43,105
by expanding representation.

562
00:32:43,105 --> 00:32:46,120
Even if you expand in a
nonlinear fashion and you

563
00:32:46,120 --> 00:32:50,440
increase the dimensionality,
you cannot combat the noise,

564
00:32:50,440 --> 00:32:52,040
at least up to some level.

565
00:32:52,040 --> 00:32:55,340
Beyond some level, there is
no point of further expansion,

566
00:32:55,340 --> 00:33:02,190
because basically
the error saturates

567
00:33:02,190 --> 00:33:05,220
Let me, since time
goes fast, let

568
00:33:05,220 --> 00:33:06,540
me talk about the alternatives.

569
00:33:06,540 --> 00:33:09,167
So if random weights
are not doing so well,

570
00:33:09,167 --> 00:33:10,250
what are the alternatives?

571
00:33:10,250 --> 00:33:14,760
The alternative is to do some
kind of unsupervised learning.

572
00:33:14,760 --> 00:33:18,300
Here we are doing it
in a kind of a shortcut

573
00:33:18,300 --> 00:33:20,230
of unsupervised learning.

574
00:33:20,230 --> 00:33:21,120
What is the shortcut?

575
00:33:21,120 --> 00:33:22,810
We say the following.

576
00:33:22,810 --> 00:33:26,880
Imagine that these
layers, the learner

577
00:33:26,880 --> 00:33:31,130
knows about the representation
of the clusters.

578
00:33:31,130 --> 00:33:33,350
It doesn't know the labels.

579
00:33:33,350 --> 00:33:35,475
In other words, whether
those are pluses and those

580
00:33:35,475 --> 00:33:38,130
are minuses, which one
are pluses and minuses.

581
00:33:38,130 --> 00:33:40,860
But he does know about
the statistical structure

582
00:33:40,860 --> 00:33:43,740
of the input, and
this is this S bar.

583
00:33:43,740 --> 00:33:45,900
These are the centers.

584
00:33:45,900 --> 00:33:49,860
So we want to encode the
statistical structure

585
00:33:49,860 --> 00:33:54,660
of these input in this
expansion of the weights.

586
00:33:54,660 --> 00:33:57,840
And the way we do the simplest
way is the kind of Hebb rule.

587
00:33:57,840 --> 00:33:58,870
We do the following.

588
00:33:58,870 --> 00:34:05,150
We say let's first choose, or
recruit, or allocate a state,

589
00:34:05,150 --> 00:34:10,530
a sparse state here, randomly
chosen, to associate,

590
00:34:10,530 --> 00:34:13,120
to represent each
one of the clusters.

591
00:34:13,120 --> 00:34:18,150
So these are the R. R are the
randomly chosen patterns here.

592
00:34:18,150 --> 00:34:22,239
And then we associate
between those randomly chosen

593
00:34:22,239 --> 00:34:27,090
representations and
the actual centers

594
00:34:27,090 --> 00:34:28,560
of the clusters of the inputs.

595
00:34:28,560 --> 00:34:31,880
So this is S bar and R. And
then we do the association

596
00:34:31,880 --> 00:34:34,440
by the simple, what's
called Hebb rule.

597
00:34:34,440 --> 00:34:37,380
So this Hebbian rule
associates cluster

598
00:34:37,380 --> 00:34:43,920
center with a randomly
assigned state

599
00:34:43,920 --> 00:34:49,860
in the cortical layer in
a kind of simple summation

600
00:34:49,860 --> 00:34:51,630
or outer product
for the Hebb rule.

601
00:34:51,630 --> 00:34:53,580
There are more
sophisticated ways to do it,

602
00:34:53,580 --> 00:34:56,280
but that's the simplest
one of doing it.

603
00:34:56,280 --> 00:35:03,090
So it turns out that this simple
rule has enormous potential

604
00:35:03,090 --> 00:35:04,820
for suppressing noise.

605
00:35:04,820 --> 00:35:07,830
So, again, this is the input
noise and the output noise.

606
00:35:07,830 --> 00:35:10,830
The Hamming distance of
the input and the output

607
00:35:10,830 --> 00:35:12,240
properly normalized.

608
00:35:12,240 --> 00:35:16,740
And you see that, as you go to
higher and higher sparseness,

609
00:35:16,740 --> 00:35:20,555
to lower and lower
f, this is basically

610
00:35:20,555 --> 00:35:24,930
the input noise is completely
quenched when f is large.

611
00:35:24,930 --> 00:35:28,700
When f is 0.01, for instance,
this is this already.

612
00:35:28,700 --> 00:35:33,120
Sub-linear when f 0.05 is
here, and so and so forth.

613
00:35:33,120 --> 00:35:36,195
So sparse representation,
in particular,

614
00:35:36,195 --> 00:35:39,580
are very effective
in suppressing noise,

615
00:35:39,580 --> 00:35:45,210
but provided the inputs have
kind of unsupervised learning

616
00:35:45,210 --> 00:35:48,750
encoded into them
which embed into them

617
00:35:48,750 --> 00:35:51,361
the cluster structure
of the inputs.

618
00:35:55,740 --> 00:35:58,170
The same or similar
thing is true for Q

619
00:35:58,170 --> 00:35:59,910
for these correlations.

620
00:35:59,910 --> 00:36:03,630
If you look at the-- this
was a random correlation.

621
00:36:03,630 --> 00:36:08,520
This is a function of f, and
this is Q, the correlation.

622
00:36:08,520 --> 00:36:12,840
It's extremely suppressed
for sparse representation.

623
00:36:12,840 --> 00:36:15,540
Basically, it's
exponentially small with 1/f,

624
00:36:15,540 --> 00:36:18,660
so it's basically 0 for
sparse representation.

625
00:36:18,660 --> 00:36:21,780
Which means that those
centers look like randomly

626
00:36:21,780 --> 00:36:25,200
distributed, essentially,
and with very small noise.

627
00:36:25,200 --> 00:36:28,700
So you took these
spheres and you basically

628
00:36:28,700 --> 00:36:33,570
map them into random points
with a very small radius.

629
00:36:33,570 --> 00:36:36,610
So it's not surprising
that, in this case,

630
00:36:36,610 --> 00:36:42,670
the error for small f-- the
error, even for large noise

631
00:36:42,670 --> 00:36:46,700
values, the error is
basically small, 0.

632
00:36:46,700 --> 00:36:48,750
Nevertheless, it
is still saturating

633
00:36:48,750 --> 00:36:52,500
as a function of the network
size, of the cortical size.

634
00:36:52,500 --> 00:36:57,520
So the saturation of performance
as a function of cortical size

635
00:36:57,520 --> 00:37:00,910
is a general property
of such systems.

636
00:37:00,910 --> 00:37:05,400
Nevertheless, the performance
itself for any given size

637
00:37:05,400 --> 00:37:13,830
is extremely impressive, I would
say, when the system is sparse

638
00:37:13,830 --> 00:37:16,852
and the noise level
is kind of moderate.

639
00:37:20,370 --> 00:37:22,530
OK, let me skip this
because I don't have time.

640
00:37:22,530 --> 00:37:27,570
Let me briefly talk about
extension of this story

641
00:37:27,570 --> 00:37:28,650
to multi-layer.

642
00:37:28,650 --> 00:37:31,710
So we are now briefly
discussing what

643
00:37:31,710 --> 00:37:34,800
happens if you take this story
and you just propagate it

644
00:37:34,800 --> 00:37:38,662
as you go along
the architecture.

645
00:37:38,662 --> 00:37:40,120
So let's start with
random weights.

646
00:37:40,120 --> 00:37:43,140
So the idea is maybe
something is good happening.

647
00:37:43,140 --> 00:37:45,610
Although initially
performance was poor,

648
00:37:45,610 --> 00:37:47,250
maybe we can improve
the performance

649
00:37:47,250 --> 00:37:50,160
by cascading such layers.

650
00:37:50,160 --> 00:37:53,010
And the answer is no,
particularly the noise level.

651
00:37:53,010 --> 00:37:55,950
This is now the
number of layers.

652
00:37:55,950 --> 00:37:57,840
What we discussed
before is here.

653
00:37:57,840 --> 00:38:00,120
And you see the problem
becomes worse and worse.

654
00:38:00,120 --> 00:38:04,350
As you continue to
propagate those signals,

655
00:38:04,350 --> 00:38:08,050
the noise is amplified
and essentially goes to 1.

656
00:38:08,050 --> 00:38:12,000
So basically you will get
just random performance

657
00:38:12,000 --> 00:38:16,420
if you keep doing it
with random weights.

658
00:38:16,420 --> 00:38:20,520
The reason-- where is it?

659
00:38:20,520 --> 00:38:21,600
I missed a slide.

660
00:38:21,600 --> 00:38:23,520
The reason is,
basically, that if you

661
00:38:23,520 --> 00:38:28,380
think about the mapping
from one layer of noise

662
00:38:28,380 --> 00:38:31,620
to another layer of noise, there
are two fixed points, 0 and 1.

663
00:38:31,620 --> 00:38:33,570
The 0 fixed point is unstable.

664
00:38:33,570 --> 00:38:35,490
Everything goes eventually to 1.

665
00:38:35,490 --> 00:38:39,090
So it is a nice--

666
00:38:39,090 --> 00:38:41,340
this system gives you
a nice perspective

667
00:38:41,340 --> 00:38:48,390
about this deep network
by thinking about it

668
00:38:48,390 --> 00:38:49,780
as a kind of dynamical system.

669
00:38:49,780 --> 00:38:51,680
For instance, what
is the level of noise

670
00:38:51,680 --> 00:38:56,520
at one layer, how it's
related to the level of noise

671
00:38:56,520 --> 00:38:57,480
at previous layer.

672
00:38:57,480 --> 00:38:59,610
So it's kind of iterative map.

673
00:38:59,610 --> 00:39:01,950
Delta n versus delta n minus 1.

674
00:39:01,950 --> 00:39:05,040
And what's good about it
is, once you kind of draw

675
00:39:05,040 --> 00:39:08,100
this curve, one layer is
mapped to another layer,

676
00:39:08,100 --> 00:39:10,200
you can know what happens
to a deep network.

677
00:39:10,200 --> 00:39:11,769
We could just iterate this.

678
00:39:11,769 --> 00:39:13,560
You have to find what
are the fixed points,

679
00:39:13,560 --> 00:39:15,434
and which one is stable
and which one is not.

680
00:39:15,434 --> 00:39:18,970
In this case, the 1 is
stable, the 0 is unstable.

681
00:39:18,970 --> 00:39:21,120
So, unfortunately,
from any level of noise

682
00:39:21,120 --> 00:39:24,390
that you will start,
you eventually go to 1.

683
00:39:27,582 --> 00:39:29,750
Correlations, is a
similar story, but--

684
00:39:29,750 --> 00:39:33,370
and the error will go to 0.5.

685
00:39:33,370 --> 00:39:36,610
So that's very well.

686
00:39:36,610 --> 00:39:40,270
There are cases, by the way,
that you can find parameters

687
00:39:40,270 --> 00:39:43,000
where initially you
improve, like here.

688
00:39:43,000 --> 00:39:47,290
But then eventually
it will go to 0.5.

689
00:39:47,290 --> 00:39:49,720
Now, if we do similar--

690
00:39:49,720 --> 00:39:51,790
if we compare this
to what happened

691
00:39:51,790 --> 00:39:53,800
to the structured
weights if you keep

692
00:39:53,800 --> 00:39:57,490
doing the same kind of
unsupervised Hebbian

693
00:39:57,490 --> 00:40:00,700
learning from one
layer to another--

694
00:40:00,700 --> 00:40:03,970
and I'll skip the details--
you see the opposite.

695
00:40:03,970 --> 00:40:08,140
So here are parameter
value in which

696
00:40:08,140 --> 00:40:14,470
one stage of the expansion
stage is actually

697
00:40:14,470 --> 00:40:15,550
increasing the noise.

698
00:40:15,550 --> 00:40:21,580
And this is because f is not too
small, and the load is large,

699
00:40:21,580 --> 00:40:23,540
and the noise is starting.

700
00:40:23,540 --> 00:40:25,360
So you can have such situation.

701
00:40:25,360 --> 00:40:28,390
But even in such situation,
eventually the system

702
00:40:28,390 --> 00:40:34,855
goes into stages where the
noise basically goes to 0.

703
00:40:34,855 --> 00:40:38,380
And if you compare
the story why it

704
00:40:38,380 --> 00:40:42,230
is so to kind of
iterative map picture,

705
00:40:42,230 --> 00:40:44,210
you see that the picture
is very different.

706
00:40:44,210 --> 00:40:45,980
You have one fixed point at 0.

707
00:40:45,980 --> 00:40:47,740
You have one fixed point at 1.

708
00:40:47,740 --> 00:40:51,010
You have intermediate
fixed point at high value.

709
00:40:51,010 --> 00:40:53,550
But this is an unstable
fixed point, and both of them

710
00:40:53,550 --> 00:40:54,550
are stable fixed points.

711
00:40:54,550 --> 00:40:58,540
So if you start from even
from large values of noise,

712
00:40:58,540 --> 00:41:00,700
eventually you
will iterate to 0.

713
00:41:00,700 --> 00:41:05,290
So it does buy
you to actually go

714
00:41:05,290 --> 00:41:10,120
into several stages of this
deep network to make sure

715
00:41:10,120 --> 00:41:12,730
that the noise is
suppressed to 0.

716
00:41:12,730 --> 00:41:14,380
Similarly for the correlations.

717
00:41:14,380 --> 00:41:17,440
Even if the parameters are such
that initially correlations

718
00:41:17,440 --> 00:41:20,350
are increased, and you can
find parameters like that,

719
00:41:20,350 --> 00:41:22,720
eventually correlations
will go to almost 0.

720
00:41:25,360 --> 00:41:28,840
And this is comparison
of the readout error

721
00:41:28,840 --> 00:41:32,370
as a function of the layers
with structured weights,

722
00:41:32,370 --> 00:41:34,840
and I compare it with
the readout error

723
00:41:34,840 --> 00:41:38,740
of infinitely wide
layer, kind of a kernel

724
00:41:38,740 --> 00:41:41,710
with infinitely wide kernel.

725
00:41:41,710 --> 00:41:44,830
And you can see
that, at least for--

726
00:41:44,830 --> 00:41:49,240
here I compare the same type of
unsupervised learning but two

727
00:41:49,240 --> 00:41:50,290
different architectures.

728
00:41:50,290 --> 00:41:52,270
One is deep network
architecture,

729
00:41:52,270 --> 00:41:57,180
and the another one is shallow
architecture, infinitely wide.

730
00:41:57,180 --> 00:42:01,720
I'm not claiming that we can
show that there is no kernel

731
00:42:01,720 --> 00:42:03,940
or shallow architecture
which will do better,

732
00:42:03,940 --> 00:42:08,650
but I'm saying if we compare
the same learning rule

733
00:42:08,650 --> 00:42:11,050
but with the two
different architectures,

734
00:42:11,050 --> 00:42:14,320
you'll find that
you do gain by going

735
00:42:14,320 --> 00:42:17,980
into multiple stages
of nonlinearity

736
00:42:17,980 --> 00:42:20,590
than by using an
infinitely wide layer.

737
00:42:24,210 --> 00:42:24,980
I'll skip this.

738
00:42:24,980 --> 00:42:31,070
I want to go briefly
to two more issues.

739
00:42:31,070 --> 00:42:34,460
One issue is the
recurrent networks.

740
00:42:34,460 --> 00:42:36,200
Why recurrent networks?

741
00:42:36,200 --> 00:42:39,830
The primary reason
is that, in each one

742
00:42:39,830 --> 00:42:43,860
of those stages that I refer
to, if you look at the biology,

743
00:42:43,860 --> 00:42:45,590
on most of them--

744
00:42:45,590 --> 00:42:48,795
not all of them but most
of them, and definitely

745
00:42:48,795 --> 00:42:50,150
in neocortex--

746
00:42:50,150 --> 00:42:53,810
you find massive recurrent
or lateral interactions

747
00:42:53,810 --> 00:42:56,400
between each one of the layers.

748
00:42:56,400 --> 00:43:01,700
So, again, we would
like to ask, what

749
00:43:01,700 --> 00:43:06,560
is the computational advantage
of having this recurrent layer.

750
00:43:06,560 --> 00:43:11,090
Now, in our case, we had an
extra motivation, and this is--

751
00:43:11,090 --> 00:43:16,140
remember that I started in
saying that, in some cases,

752
00:43:16,140 --> 00:43:19,550
there is experimental evidence
that the initial projection is

753
00:43:19,550 --> 00:43:21,710
random.

754
00:43:21,710 --> 00:43:24,170
So that we ask ourselves,
what happens if we do this.

755
00:43:24,170 --> 00:43:26,466
If we start from
random projection,

756
00:43:26,466 --> 00:43:31,160
feedforward projection, and
then add recurrent connections.

757
00:43:31,160 --> 00:43:34,560
Think about it as from the
olfactory bulb, for instance,

758
00:43:34,560 --> 00:43:38,470
to piriform cortex, perhaps
random feedforward projections.

759
00:43:38,470 --> 00:43:42,235
But then the association,
recurrent connections

760
00:43:42,235 --> 00:43:45,200
in piriform cortex
are structured.

761
00:43:45,200 --> 00:43:46,560
How do we do that?

762
00:43:46,560 --> 00:43:51,020
We start, we imagine starting
from random projection,

763
00:43:51,020 --> 00:43:54,290
generating initial
representation

764
00:43:54,290 --> 00:43:57,110
by the random projection,
and then stabilizing

765
00:43:57,110 --> 00:43:58,910
those representation
into attractors

766
00:43:58,910 --> 00:44:00,590
by the recurrent connections.

767
00:44:00,590 --> 00:44:04,780
And that actually
works pretty well.

768
00:44:04,780 --> 00:44:06,680
It's not the optimal
architecture,

769
00:44:06,680 --> 00:44:07,700
but it's pretty well.

770
00:44:07,700 --> 00:44:10,850
For instance, noise,
which is initially

771
00:44:10,850 --> 00:44:13,010
increased by the
random projections,

772
00:44:13,010 --> 00:44:16,510
were quenched by
convergence to attractors.

773
00:44:16,510 --> 00:44:18,900
And, similarly, Q
will not go to 0,

774
00:44:18,900 --> 00:44:20,810
but will not continue
growing, but will

775
00:44:20,810 --> 00:44:23,150
go to an intermediate layer.

776
00:44:23,150 --> 00:44:25,760
And the error is pretty well.

777
00:44:25,760 --> 00:44:30,260
So if you look at in this
case, the error really

778
00:44:30,260 --> 00:44:32,730
goes down to very low values.

779
00:44:32,730 --> 00:44:34,190
But now it's not layers.

780
00:44:34,190 --> 00:44:36,440
Now it is the
number of iterations

781
00:44:36,440 --> 00:44:38,150
of the recurrent connections.

782
00:44:38,150 --> 00:44:44,000
So you start from just input
layer, or random projection,

783
00:44:44,000 --> 00:44:47,450
and then you iterate the
dynamics and it goes to 0.

784
00:44:47,450 --> 00:44:48,420
So it's not the layers.

785
00:44:48,420 --> 00:44:55,160
It's just the dynamics of
the convergence to attractor.

786
00:44:55,160 --> 00:44:56,570
My final point.

787
00:44:56,570 --> 00:44:58,600
I have 3 or 4 minutes?

788
00:44:58,600 --> 00:44:59,300
OK.

789
00:44:59,300 --> 00:45:04,670
My final point before wrapping
up is the question of top-down.

790
00:45:04,670 --> 00:45:09,980
So recurrent, we
briefly talked about it.

791
00:45:09,980 --> 00:45:13,790
But incorporating contextual
knowledge is a major question.

792
00:45:13,790 --> 00:45:17,900
How can you improve
on deep networks

793
00:45:17,900 --> 00:45:22,820
by incorporating, not simply
the feedforward sensory input,

794
00:45:22,820 --> 00:45:28,400
but other sources of knowledge
about this particular stimulus?

795
00:45:28,400 --> 00:45:33,080
And it's important that we are
not talking about knowledge

796
00:45:33,080 --> 00:45:35,690
about the statistics
of the input which

797
00:45:35,690 --> 00:45:37,730
can be incorporated
into the learning

798
00:45:37,730 --> 00:45:39,200
of the feedforward one.

799
00:45:39,200 --> 00:45:42,450
But we're talking
about inputs which are,

800
00:45:42,450 --> 00:45:47,330
or knowledge, which we have now
on the network which already

801
00:45:47,330 --> 00:45:49,430
has learned whatever
it has learned.

802
00:45:49,430 --> 00:45:52,610
So we have a mature network,
whatever the architecture is.

803
00:45:52,610 --> 00:45:53,840
We have a sensory input.

804
00:45:53,840 --> 00:45:55,400
It goes feedforward.

805
00:45:55,400 --> 00:45:58,610
And now we have additional
information, about

806
00:45:58,610 --> 00:46:00,170
context for instance,
that we want

807
00:46:00,170 --> 00:46:03,140
to incorporate with
the sensory input

808
00:46:03,140 --> 00:46:05,345
to improve the performance.

809
00:46:05,345 --> 00:46:09,260
So how do we do that?

810
00:46:09,260 --> 00:46:13,460
It turns out to be non-trivial
computational problem.

811
00:46:13,460 --> 00:46:18,980
It is very straightforward to
do it in Bayesian framework,

812
00:46:18,980 --> 00:46:21,550
where you simply
update the prior

813
00:46:21,550 --> 00:46:29,070
of what the sensory input is
by this contextual information.

814
00:46:29,070 --> 00:46:32,210
But if you want to
implement it in the network,

815
00:46:32,210 --> 00:46:36,910
you find that it's
not easy to find

816
00:46:36,910 --> 00:46:38,960
the appropriate architecture.

817
00:46:38,960 --> 00:46:43,530
So I'll just briefly
talk about how we do it.

818
00:46:43,530 --> 00:46:46,670
So imagine you have, again,
these sensory inputs,

819
00:46:46,670 --> 00:46:50,460
but now there is some
context, different contexts.

820
00:46:50,460 --> 00:46:53,840
And imagine you
have an information

821
00:46:53,840 --> 00:47:00,710
that the input is coming from
that particular part of state

822
00:47:00,710 --> 00:47:02,040
space.

823
00:47:02,040 --> 00:47:05,030
So basically the question is
how to amplify selectively

824
00:47:05,030 --> 00:47:08,630
a specific set of states in
a distributed representation.

825
00:47:08,630 --> 00:47:12,790
So usually when we talk
about attention, or gating,

826
00:47:12,790 --> 00:47:15,020
or questions like that,
we think about, OK, we

827
00:47:15,020 --> 00:47:16,640
have these neurons.

828
00:47:16,640 --> 00:47:20,650
We suppress those, or
maybe amplify other ones.

829
00:47:20,650 --> 00:47:24,310
Or we have a set of
axons, or pathways.

830
00:47:24,310 --> 00:47:26,890
We suppress those,
and amplify those.

831
00:47:26,890 --> 00:47:29,410
But what about a
representation which is more

832
00:47:29,410 --> 00:47:33,040
distributed where you have
to really suppress states

833
00:47:33,040 --> 00:47:36,730
rather than neural populations.

834
00:47:36,730 --> 00:47:42,850
So I just won't go-- again,
it's a complicated architecture.

835
00:47:42,850 --> 00:47:48,010
But, basically, we're using some
sort of a mixed representation,

836
00:47:48,010 --> 00:47:52,090
where we take the sensory
input and the category

837
00:47:52,090 --> 00:47:55,510
or contextual input,
mix the nonlinearity,

838
00:47:55,510 --> 00:47:58,614
use them to clean it,
and propagate this.

839
00:47:58,614 --> 00:48:00,280
So it's a more
complicated architecture,

840
00:48:00,280 --> 00:48:01,960
but it works beautifully.

841
00:48:01,960 --> 00:48:04,150
Let me show you here an
example, and you'll have

842
00:48:04,150 --> 00:48:05,770
a flavor of what we are doing.

843
00:48:05,770 --> 00:48:13,420
So now the input, we have
those 900 spheres or templates,

844
00:48:13,420 --> 00:48:20,110
but they are organized
into 30 categories,

845
00:48:20,110 --> 00:48:23,440
and 30 tokens per category.

846
00:48:23,440 --> 00:48:27,400
Now, the tokens, which are
the actual sensory inputs,

847
00:48:27,400 --> 00:48:30,880
are represented by,
let's say, 200 neurons.

848
00:48:30,880 --> 00:48:32,695
And you have a small
number of neurons

849
00:48:32,695 --> 00:48:34,230
representing a category.

850
00:48:34,230 --> 00:48:35,434
Maybe 20 is enough.

851
00:48:35,434 --> 00:48:36,850
So that's important,
and you don't

852
00:48:36,850 --> 00:48:39,320
have to really
expand dramatically

853
00:48:39,320 --> 00:48:42,430
the representation.

854
00:48:42,430 --> 00:48:45,140
So this is the input.

855
00:48:45,140 --> 00:48:48,340
And now we have
very noisy inputs.

856
00:48:48,340 --> 00:48:51,440
If you look at the readout,
this is layers here,

857
00:48:51,440 --> 00:48:52,600
and there is readout error.

858
00:48:52,600 --> 00:48:57,140
If you do it on the input
layer, or any subsequent layer

859
00:48:57,140 --> 00:49:00,290
here, but without
top-down information.

860
00:49:00,290 --> 00:49:03,610
With structured interactions
and all that I told you,

861
00:49:03,610 --> 00:49:07,420
this is such a noisy input where
the performance is basically

862
00:49:07,420 --> 00:49:08,800
0.5.

863
00:49:08,800 --> 00:49:13,150
There is nothing that you can
do without top-down information

864
00:49:13,150 --> 00:49:14,540
in this network.

865
00:49:14,540 --> 00:49:17,600
You can ask what will
be the performance.

866
00:49:17,600 --> 00:49:21,910
If you have an ideal observer
that looks at the noisy input

867
00:49:21,910 --> 00:49:27,310
and makes maximum
likelihood categorization.

868
00:49:27,310 --> 00:49:28,920
Well, then it will
do much better.

869
00:49:28,920 --> 00:49:31,390
Also not 0, so this
is at this level.

870
00:49:33,970 --> 00:49:38,650
This higher error is
in virtue of the fact

871
00:49:38,650 --> 00:49:42,190
that this network
is still not doing

872
00:49:42,190 --> 00:49:46,990
what an optimal maximum
likelihood observer will do.

873
00:49:46,990 --> 00:49:48,320
So this is the network.

874
00:49:48,320 --> 00:49:53,110
This is a maximum likelihood
readout, both of them

875
00:49:53,110 --> 00:49:56,320
without extra
top-down information.

876
00:49:56,320 --> 00:49:59,950
And in the network that
I kind of hinted about,

877
00:49:59,950 --> 00:50:04,570
if you add this top-down
information by generating

878
00:50:04,570 --> 00:50:08,740
mixed representation, you get
a performance which is really

879
00:50:08,740 --> 00:50:11,210
dramatically improved.

880
00:50:11,210 --> 00:50:16,690
And as you keep doing it
one layer from another,

881
00:50:16,690 --> 00:50:21,310
you really get a very
nice performance.

882
00:50:21,310 --> 00:50:24,010
So let me just summarize.

883
00:50:28,490 --> 00:50:31,735
There is one more
before summarizing.

884
00:50:31,735 --> 00:50:33,130
Yeah, OK.

885
00:50:33,130 --> 00:50:33,820
Before that.

886
00:50:33,820 --> 00:50:34,320
OK.

887
00:50:34,320 --> 00:50:40,930
So two points to bear in mind.

888
00:50:40,930 --> 00:50:44,790
One of them is that what
I discussed to you today

889
00:50:44,790 --> 00:50:51,460
relies on assuming either
random, comparing random

890
00:50:51,460 --> 00:50:56,410
projection, to unsupervised
learning of a very simple type,

891
00:50:56,410 --> 00:50:59,010
of a kind of Hebbian type.

892
00:50:59,010 --> 00:51:07,480
The output can be Hebbian, or
perceptron, or SVM, and so on.

893
00:51:07,480 --> 00:51:09,040
You could ask,
what happens if you

894
00:51:09,040 --> 00:51:12,850
use learning rules, more
sophisticated learning rules

895
00:51:12,850 --> 00:51:14,140
for the unsupervised weights?

896
00:51:14,140 --> 00:51:15,280
Some of them we've studied.

897
00:51:15,280 --> 00:51:20,470
But, anyway, that's something
which is important to explore.

898
00:51:20,470 --> 00:51:24,640
And another very important
issue for thinking

899
00:51:24,640 --> 00:51:28,110
about object
recognition in vision

900
00:51:28,110 --> 00:51:33,340
and in other real-life
problem is input statistics.

901
00:51:33,340 --> 00:51:36,070
Because what we assumed
is a very simple mixture

902
00:51:36,070 --> 00:51:37,870
of Gaussian model.

903
00:51:37,870 --> 00:51:40,930
So you can think about
the task of the network

904
00:51:40,930 --> 00:51:46,270
is to take the invariance,
which is the variation away

905
00:51:46,270 --> 00:51:49,080
from the center of the
spherical variation,

906
00:51:49,080 --> 00:51:52,870
and to generate representation
which is invariant to that.

907
00:51:52,870 --> 00:51:55,510
But this is a very simple
invariance problem,

908
00:51:55,510 --> 00:51:58,450
because the
invariance was simply

909
00:51:58,450 --> 00:52:03,790
restricted to these simple
geometric structures.

910
00:52:03,790 --> 00:52:12,070
More problems which are closer
to what real-life problems are

911
00:52:12,070 --> 00:52:17,200
will have inputs which
are, essentially, have

912
00:52:17,200 --> 00:52:19,240
some structure,
but the structure

913
00:52:19,240 --> 00:52:22,860
can be of a variety of shapes.

914
00:52:22,860 --> 00:52:27,090
Each one of them correspond
to an object, or a cluster,

915
00:52:27,090 --> 00:52:31,850
or a manifold representing an
entity, a perceptual entity.

916
00:52:31,850 --> 00:52:37,220
But how you go from this
nice, simple problem

917
00:52:37,220 --> 00:52:43,100
of this spherical invariance
problem to those problems,

918
00:52:43,100 --> 00:52:45,780
it's of course a
challenging problem.

919
00:52:45,780 --> 00:52:50,300
And that's the work which we
are now, ongoing work, also

920
00:52:50,300 --> 00:52:53,990
with SueYeon Chung and Dan Lee.

921
00:52:53,990 --> 00:53:00,130
But it's a story which is still
at the stage of unfolding.