1 00:00:01,680 --> 00:00:04,080 The following content is provided under a Creative 2 00:00:04,080 --> 00:00:05,620 Commons license. 3 00:00:05,620 --> 00:00:07,920 Your support will help MIT OpenCourseWare 4 00:00:07,920 --> 00:00:12,280 continue to offer high-quality educational resources for free. 5 00:00:12,280 --> 00:00:14,910 To make a donation or view additional materials 6 00:00:14,910 --> 00:00:18,870 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,870 --> 00:00:19,770 at ocw.mit.edu. 8 00:00:22,590 --> 00:00:24,210 HAIM SOMPOLINSKY: My topic today is 9 00:00:24,210 --> 00:00:26,170 discussing sensory representations 10 00:00:26,170 --> 00:00:29,950 in deep cortex-like architectures. 11 00:00:29,950 --> 00:00:33,430 I should say the topic is perhaps 12 00:00:33,430 --> 00:00:38,300 toward a theory of sensory representations 13 00:00:38,300 --> 00:00:39,930 in deep networks. 14 00:00:39,930 --> 00:00:45,540 As you will see, our attempt is to develop 15 00:00:45,540 --> 00:00:49,680 a systematic theoretical understanding of the capacity 16 00:00:49,680 --> 00:00:54,420 and limitations of architectures of that type. 17 00:00:54,420 --> 00:00:58,530 The general context is well known. 18 00:00:58,530 --> 00:01:02,160 In many sensory systems, we see information 19 00:01:02,160 --> 00:01:07,770 is propagating from the periphery, like the retina, 20 00:01:07,770 --> 00:01:12,060 to primary visual cortex, and then, of course, many stages up 21 00:01:12,060 --> 00:01:17,250 to a very high level, or maybe the hippocampal structure. 22 00:01:17,250 --> 00:01:18,450 It's not purely feedforward. 23 00:01:18,450 --> 00:01:23,130 There are backward, massive backward or top-down 24 00:01:23,130 --> 00:01:25,200 connections, the recurrent connections, 25 00:01:25,200 --> 00:01:28,410 and some of those extra features I'll talk about. 26 00:01:28,410 --> 00:01:32,790 But the most intuitive feature of that 27 00:01:32,790 --> 00:01:36,000 is simply a transformation or filtering 28 00:01:36,000 --> 00:01:38,910 of data across multiple stages. 29 00:01:38,910 --> 00:01:42,540 Similarly in auditory pathway. 30 00:01:42,540 --> 00:01:46,380 In other systems, we see a similar structure or aspect 31 00:01:46,380 --> 00:01:49,042 of similar structure as well. 32 00:01:49,042 --> 00:01:53,280 A well-known and a classical formative system 33 00:01:53,280 --> 00:01:56,550 for computational science is cerebellum, 34 00:01:56,550 --> 00:02:00,300 where you have information coming from the mossy fiber 35 00:02:00,300 --> 00:02:04,920 layer, then expand enormously into a granule layer, 36 00:02:04,920 --> 00:02:08,590 and then converge to a Purkinje cell. 37 00:02:08,590 --> 00:02:10,889 So if you look at a single Purkinje 38 00:02:10,889 --> 00:02:14,190 cell, the output of the cerebellum, as a unit, 39 00:02:14,190 --> 00:02:18,030 then you see there is, first, an expansion 40 00:02:18,030 --> 00:02:21,570 from 1,000 to two orders of magnitude 41 00:02:21,570 --> 00:02:23,760 higher in the granule layer. 42 00:02:23,760 --> 00:02:29,330 And then convergence of 200,000 or so parallel fibers 43 00:02:29,330 --> 00:02:31,350 onto a single Purkinje cell. 44 00:02:31,350 --> 00:02:34,350 And there are, those type of modules, many, many 45 00:02:34,350 --> 00:02:36,250 across the cerebellum. 46 00:02:36,250 --> 00:02:40,170 So, again, a transformation which involves, in this case, 47 00:02:40,170 --> 00:02:43,710 expansion and then convergence. 48 00:02:43,710 --> 00:02:47,490 In the basal ganglia, which I wouldn't categorize it 49 00:02:47,490 --> 00:02:50,194 as a sensory system. 50 00:02:50,194 --> 00:02:51,110 More related to motor. 51 00:02:51,110 --> 00:02:54,540 Nevertheless, you see cortex converging first 52 00:02:54,540 --> 00:02:57,300 to various stages of the basal ganglia, 53 00:02:57,300 --> 00:03:01,680 and then expanding again to cortex. 54 00:03:01,680 --> 00:03:05,220 Hippocampus has also multiple pathways, but some of them 55 00:03:05,220 --> 00:03:07,380 include a convergence. 56 00:03:07,380 --> 00:03:13,810 For instance, convergence to a CA3, and then expansion 57 00:03:13,810 --> 00:03:16,110 again to cortex. 58 00:03:16,110 --> 00:03:19,680 But there are other multiple pathways as well, 59 00:03:19,680 --> 00:03:23,790 different stages of sensory information propagating, 60 00:03:23,790 --> 00:03:27,600 of course, across them. 61 00:03:27,600 --> 00:03:30,900 And, finally, the artificial network story 62 00:03:30,900 --> 00:03:34,816 of deep neural networks, all of you may have heard. 63 00:03:34,816 --> 00:03:38,780 Input layer, then sequence of stages. 64 00:03:38,780 --> 00:03:40,440 Purely feedforward. 65 00:03:40,440 --> 00:03:46,800 And at least the canonical leading network 66 00:03:46,800 --> 00:03:51,720 is one that the output layer has object recognition, object 67 00:03:51,720 --> 00:03:52,990 classification task. 68 00:03:52,990 --> 00:03:56,860 And the whole network is studied by backprop, 69 00:03:56,860 --> 00:04:01,100 supervised learning for that. 70 00:04:01,100 --> 00:04:07,140 What I'll talk about is more in the spirit of the idea 71 00:04:07,140 --> 00:04:10,540 that the first stages are more general purpose 72 00:04:10,540 --> 00:04:16,060 than the specific classic task in the output layer. 73 00:04:16,060 --> 00:04:17,550 So there are many issues. 74 00:04:17,550 --> 00:04:20,630 Number of stages that are required, the size of them, 75 00:04:20,630 --> 00:04:23,520 why compression or expansion. 76 00:04:23,520 --> 00:04:27,120 In many systems, you'll see that the fraction of active neurons 77 00:04:27,120 --> 00:04:31,330 is small in the expanded layer. 78 00:04:31,330 --> 00:04:32,660 That's what we call sparseness. 79 00:04:32,660 --> 00:04:35,010 So high sparseness means small number 80 00:04:35,010 --> 00:04:38,340 of neurons active at any given stimulus. 81 00:04:38,340 --> 00:04:42,180 It's just the terminology is somewhat confusing. 82 00:04:42,180 --> 00:04:47,110 So high sparseness is small number of active neurons. 83 00:04:47,110 --> 00:04:51,990 One important and crucial question is how to transform. 84 00:04:51,990 --> 00:04:53,910 What are the filters, the weights 85 00:04:53,910 --> 00:04:58,200 that are good for transforming sensory information from one 86 00:04:58,200 --> 00:04:59,560 layer to another? 87 00:04:59,560 --> 00:05:04,440 And, in particular, whether random weights 88 00:05:04,440 --> 00:05:10,380 is good enough, or maybe even optimal in some sense. 89 00:05:10,380 --> 00:05:13,345 Or one needs more structure, the more 90 00:05:13,345 --> 00:05:16,080 learned type of synaptic way. 91 00:05:16,080 --> 00:05:19,620 This is a crucial question, perhaps not for machine 92 00:05:19,620 --> 00:05:22,920 learning but for computational neuroscience, because there 93 00:05:22,920 --> 00:05:25,500 is some experimental evidence, for at least 94 00:05:25,500 --> 00:05:29,560 of some of the systems that are studied, 95 00:05:29,560 --> 00:05:36,180 that the mapping from the compressed, original 96 00:05:36,180 --> 00:05:39,600 representation to the sparse representation 97 00:05:39,600 --> 00:05:43,440 is actually done by randomly connected weights. 98 00:05:43,440 --> 00:05:46,560 So one example is olfactory cortex. 99 00:05:46,560 --> 00:05:49,710 The mapping of olfactory representation 100 00:05:49,710 --> 00:05:53,040 from the olfactory bulb, so from glomerulus layer, 101 00:05:53,040 --> 00:05:57,486 to the piriform cortex seems to be random, 102 00:05:57,486 --> 00:05:59,370 as far as one can say. 103 00:05:59,370 --> 00:06:01,710 Similarly, in the cerebellum, the example 104 00:06:01,710 --> 00:06:04,200 that I mentioned before, when one 105 00:06:04,200 --> 00:06:06,840 looks at the mapping from the mossy fiber 106 00:06:06,840 --> 00:06:10,500 to the granule cell, again enormous expansion 107 00:06:10,500 --> 00:06:12,000 by a few orders of magnitude. 108 00:06:12,000 --> 00:06:14,580 Nevertheless, they seem to be random. 109 00:06:14,580 --> 00:06:18,370 Now, of course, one cannot say exclusively that they are 110 00:06:18,370 --> 00:06:21,690 random and there are no subtle correlations or structures. 111 00:06:21,690 --> 00:06:24,780 But, nevertheless, there is a strong motivation 112 00:06:24,780 --> 00:06:29,620 to ask whether random projections are good enough. 113 00:06:29,620 --> 00:06:33,330 And if not, what does it mean structured? 114 00:06:33,330 --> 00:06:38,820 What kind of structure is appropriate for this task? 115 00:06:38,820 --> 00:06:40,950 Question of top-down and feedback loops, 116 00:06:40,950 --> 00:06:43,660 recurrent connections, and so on. 117 00:06:43,660 --> 00:06:48,720 So that's all I hope to, at least briefly, mention 118 00:06:48,720 --> 00:06:50,430 later on in my talk. 119 00:06:50,430 --> 00:06:56,640 So before I continue, most of, or a large part of the talk 120 00:06:56,640 --> 00:07:00,660 will be based on published and unpublished work 121 00:07:00,660 --> 00:07:03,050 with Baktash Babadi, who was, until recently, 122 00:07:03,050 --> 00:07:07,320 a postdoctoral, a Swartz Fellow at Harvard University. 123 00:07:07,320 --> 00:07:10,370 Went to practice medicine. 124 00:07:10,370 --> 00:07:13,990 Elia Frankin, a master student at the Hebrew University. 125 00:07:13,990 --> 00:07:18,440 SueYeon, who all you know here, at Harvard. 126 00:07:18,440 --> 00:07:22,060 Uri Cohen, a PhD student at the Hebrew University. 127 00:07:22,060 --> 00:07:25,740 And Dan Lee from Penn University. 128 00:07:25,740 --> 00:07:29,160 So here is our formalization of the problem. 129 00:07:29,160 --> 00:07:34,350 We have an input layer, denoted 0. 130 00:07:34,350 --> 00:07:39,750 Typically, it's a small, it's a compressed layer 131 00:07:39,750 --> 00:07:41,350 with dense representation. 132 00:07:41,350 --> 00:07:45,240 So here, every input will generate maybe half 133 00:07:45,240 --> 00:07:48,990 of the population, on average, 134 00:07:48,990 --> 00:07:51,600 Then there is a feedforward layer 135 00:07:51,600 --> 00:07:55,140 of synaptic weights, which expand 136 00:07:55,140 --> 00:08:00,530 to a higher-dimension layer, which we call cortical layer. 137 00:08:00,530 --> 00:08:03,460 It's expanded in terms of the number of neurons. 138 00:08:03,460 --> 00:08:05,760 So this will be S1. 139 00:08:05,760 --> 00:08:09,600 It is sparse because the f, the fraction of neurons 140 00:08:09,600 --> 00:08:15,790 that are active for each given input vector, will be small. 141 00:08:15,790 --> 00:08:19,110 So it is expanded and sparse. 142 00:08:19,110 --> 00:08:22,710 That will be the first part of my talk, discussing this. 143 00:08:22,710 --> 00:08:28,830 Then, later on, I'll talk about staging, cascading 144 00:08:28,830 --> 00:08:32,400 this transformation to several stages. 145 00:08:32,400 --> 00:08:35,190 And ultimately there is a readout, 146 00:08:35,190 --> 00:08:37,750 will be some classification task. 147 00:08:37,750 --> 00:08:40,210 1 will be one classification. 148 00:08:40,210 --> 00:08:43,500 2 will be another classification rule, et cetera, 149 00:08:43,500 --> 00:08:45,460 each one of them with synaptic weights 150 00:08:45,460 --> 00:08:49,330 which are learned to perform that task. 151 00:08:49,330 --> 00:08:51,960 So we call that a supervised layer, and that's 152 00:08:51,960 --> 00:08:54,300 the unsupervised layer. 153 00:08:54,300 --> 00:08:56,710 So that's a formalization of the problem. 154 00:08:56,710 --> 00:09:01,230 And, as you will see, we'll make enormously simplifying 155 00:09:01,230 --> 00:09:05,490 abstraction of the real biological system 156 00:09:05,490 --> 00:09:08,010 in order to try to gain some insight 157 00:09:08,010 --> 00:09:12,240 about the computational capacity of such systems. 158 00:09:12,240 --> 00:09:14,640 So the first important question is 159 00:09:14,640 --> 00:09:18,540 what is the statistics, the statistical structure 160 00:09:18,540 --> 00:09:19,360 of the input? 161 00:09:19,360 --> 00:09:21,570 So the input is kind of n-dimensional vector, 162 00:09:21,570 --> 00:09:25,360 where n, or n0, when n is the number of units here. 163 00:09:25,360 --> 00:09:30,430 So each sensory event evokes a pattern of activity here. 164 00:09:30,430 --> 00:09:34,440 But what is the structure, the statistical structure, 165 00:09:34,440 --> 00:09:36,750 that we are working with? 166 00:09:36,750 --> 00:09:41,390 And the simplest one that we are going to discuss 167 00:09:41,390 --> 00:09:42,160 is the following. 168 00:09:42,160 --> 00:09:44,580 So we assume, basically, that the inputs 169 00:09:44,580 --> 00:09:47,700 are coming from a mixture of, so to speak, 170 00:09:47,700 --> 00:09:51,450 a mixture of Gaussian statistics. 171 00:09:51,450 --> 00:09:54,120 It's not going to be Gaussian because, for simplicity, we'll 172 00:09:54,120 --> 00:09:55,590 assume they are binary. 173 00:09:55,590 --> 00:09:58,320 But this doesn't matter actually. 174 00:09:58,320 --> 00:10:04,850 So imagine that this is kind of a graphical represent-- 175 00:10:04,850 --> 00:10:07,940 a caricature of high-dimensional space. 176 00:10:07,940 --> 00:10:11,690 And imagine the inputs, the sensory inputs, 177 00:10:11,690 --> 00:10:16,375 imagine that they are clustered around templates or cluster 178 00:10:16,375 --> 00:10:16,875 centers. 179 00:10:19,920 --> 00:10:24,590 So these will be the centers of these balls. 180 00:10:24,590 --> 00:10:30,080 And the input itself is coming from the neighborhoods 181 00:10:30,080 --> 00:10:31,590 of those templates. 182 00:10:31,590 --> 00:10:35,870 So each input will be one point in this space, 183 00:10:35,870 --> 00:10:41,090 and it will be originating from one of those ensembles of one 184 00:10:41,090 --> 00:10:42,720 of those states. 185 00:10:42,720 --> 00:10:46,280 So that's a simple architecture. 186 00:10:46,280 --> 00:10:47,930 And we are going-- 187 00:10:47,930 --> 00:10:52,850 in real space, it will be mapping 188 00:10:52,850 --> 00:10:56,780 from one of those states into another state 189 00:10:56,780 --> 00:10:59,030 in the next layer. 190 00:10:59,030 --> 00:11:03,590 And then, finally, the task will be to take-- 191 00:11:03,590 --> 00:11:07,370 imagine that some of those balls are classified as plus. 192 00:11:07,370 --> 00:11:12,140 Let's say the olfactory factory stimuli, and some of them 193 00:11:12,140 --> 00:11:16,630 are classified as appetitive, some of them as aversive. 194 00:11:16,630 --> 00:11:21,335 So the output layer, the readout unit, 195 00:11:21,335 --> 00:11:26,360 has to classify some of those spheres as a plus, 196 00:11:26,360 --> 00:11:28,100 and some of them are minus. 197 00:11:28,100 --> 00:11:30,680 And, of course, depending on how many of 198 00:11:30,680 --> 00:11:33,850 them are in a dimensionality, and their location, 199 00:11:33,850 --> 00:11:36,560 this may or may not be an easy problem. 200 00:11:36,560 --> 00:11:41,125 So, for instance, here it's fine. 201 00:11:41,125 --> 00:11:45,740 There is-- a linear classifier on the input space can do it. 202 00:11:45,740 --> 00:11:50,850 Here, I think there are, there should be some mistakes here. 203 00:11:50,850 --> 00:11:52,250 Yeah, here. 204 00:11:52,250 --> 00:11:59,150 So here is a case where the linear classifier at the input 205 00:11:59,150 --> 00:12:00,020 layer cannot do it. 206 00:12:00,020 --> 00:12:06,350 And that's our-- that's a theme which 207 00:12:06,350 --> 00:12:12,890 is very popular, both in computation neural science 208 00:12:12,890 --> 00:12:19,070 and system neural science studies in machine learning. 209 00:12:19,070 --> 00:12:22,280 And the following question comes up. 210 00:12:22,280 --> 00:12:25,310 Suppose we see that there is a transformation of data 211 00:12:25,310 --> 00:12:34,790 from, let's say, a photoreceptor layer in vision to the ganglion 212 00:12:34,790 --> 00:12:36,950 cells at the output of the retina, 213 00:12:36,950 --> 00:12:40,040 then to cortex in several stages. 214 00:12:40,040 --> 00:12:47,111 How do we gauge, how do we assess 215 00:12:47,111 --> 00:12:52,610 what is the advantage for the brain to transform information 216 00:12:52,610 --> 00:12:57,600 from one, let's say, from retina to V1, and so on and so forth. 217 00:12:57,600 --> 00:13:01,640 After all, in this feedforward's architecture, 218 00:13:01,640 --> 00:13:06,660 no net information is generated at the next layer. 219 00:13:06,660 --> 00:13:09,950 So if no net information is generated, 220 00:13:09,950 --> 00:13:15,090 the question is, what did we gain by these transformations? 221 00:13:15,090 --> 00:13:22,520 And one possible answer is that it is reformatted, reformatting 222 00:13:22,520 --> 00:13:26,480 the sensory representation into different representation which 223 00:13:26,480 --> 00:13:30,020 will make subsequent computations simpler. 224 00:13:30,020 --> 00:13:34,520 So what does it mean, subsequent computation is simpler? 225 00:13:34,520 --> 00:13:40,130 One notion of simplicity is whether subsequent computation 226 00:13:40,130 --> 00:13:45,320 can be realized by a simple linear readout. 227 00:13:45,320 --> 00:13:53,180 So that's the strategy that we are going to adopt here. 228 00:13:53,180 --> 00:13:56,510 And this is to ask, as the information, 229 00:13:56,510 --> 00:13:58,760 as the representation is changing 230 00:13:58,760 --> 00:14:01,130 as you go from one layer to another, 231 00:14:01,130 --> 00:14:06,890 how well a linear readout will be able to perform the task. 232 00:14:06,890 --> 00:14:08,710 So that's the input. 233 00:14:08,710 --> 00:14:10,710 That's the story. 234 00:14:10,710 --> 00:14:13,760 And then, as I said, there is an input, 235 00:14:13,760 --> 00:14:16,526 unsupervised representations, and supervised at the end. 236 00:14:19,670 --> 00:14:21,960 I need to introduce notations. 237 00:14:21,960 --> 00:14:22,560 Bear with me. 238 00:14:22,560 --> 00:14:24,680 This is a computational talk. 239 00:14:24,680 --> 00:14:28,580 I cannot just talk about ideas, because the whole thing is 240 00:14:28,580 --> 00:14:33,710 to be able to actually come up with a quantitative theory that 241 00:14:33,710 --> 00:14:35,150 tests ideas. 242 00:14:35,150 --> 00:14:38,150 So let me introduce notations. 243 00:14:38,150 --> 00:14:41,780 So at the centers, at each layer, 244 00:14:41,780 --> 00:14:45,020 you can ask what is the representation of the centers 245 00:14:45,020 --> 00:14:47,000 of these stimuli? 246 00:14:47,000 --> 00:14:50,060 And I'll denote the center by a bar. 247 00:14:50,060 --> 00:14:53,050 And mu is index of the patterns. 248 00:14:53,050 --> 00:14:56,900 So mu goes from 1 to P. P Is the number of those balls, 249 00:14:56,900 --> 00:14:59,870 number of those spheres, or number of those clusters, 250 00:14:59,870 --> 00:15:03,580 if you think about clustering some sensory data. 251 00:15:03,580 --> 00:15:05,920 So P would be the number of clusters. 252 00:15:05,920 --> 00:15:11,622 i, from 1 to N, is simply the neuron or the unit activation 253 00:15:11,622 --> 00:15:13,720 at each mu. 254 00:15:13,720 --> 00:15:14,830 And L is the layer. 255 00:15:14,830 --> 00:15:16,270 So 0 is the input layer. 256 00:15:16,270 --> 00:15:19,680 It's up to L layer. 257 00:15:19,680 --> 00:15:21,060 So this would be 0, 1. 258 00:15:21,060 --> 00:15:29,670 The mean activation at each layer from 1 259 00:15:29,670 --> 00:15:31,890 on will just have to be a constant 260 00:15:31,890 --> 00:15:34,500 to be f. f goes from 0 to 1. 261 00:15:34,500 --> 00:15:38,040 The smaller f is, the sparser the representation is. 262 00:15:38,040 --> 00:15:41,800 We will assume that the input representation is dense. 263 00:15:41,800 --> 00:15:43,900 So this is 0.5. 264 00:15:43,900 --> 00:15:46,860 N, again, we'll assume to be, for simplicity, 265 00:15:46,860 --> 00:15:49,820 a constant across layers, except for the first layer, where 266 00:15:49,820 --> 00:15:51,270 there is expansion. 267 00:15:51,270 --> 00:15:55,320 You can vary those parameters, and actually the theory 268 00:15:55,320 --> 00:15:58,210 accommodates variations of those. 269 00:15:58,210 --> 00:15:59,880 But that's the simplest architecture. 270 00:15:59,880 --> 00:16:02,730 You expend a dense representation 271 00:16:02,730 --> 00:16:05,460 into a sparse higher dimension, and you 272 00:16:05,460 --> 00:16:09,810 keep doing it as you go along. 273 00:16:09,810 --> 00:16:12,460 So that's notation. 274 00:16:12,460 --> 00:16:21,950 Now, how do we assess what is the next stages doing 275 00:16:21,950 --> 00:16:23,370 to those clusters. 276 00:16:23,370 --> 00:16:28,080 So, as I said, one measure is take a linear classifier 277 00:16:28,080 --> 00:16:31,380 and see how linear classifier performs. 278 00:16:31,380 --> 00:16:36,720 But, actually, you can also look at the statistics 279 00:16:36,720 --> 00:16:42,330 of the injected sensory stimuli at each layer 280 00:16:42,330 --> 00:16:44,250 and learn something from it. 281 00:16:44,250 --> 00:16:47,250 And, basically, I'm going to suggest 282 00:16:47,250 --> 00:16:51,450 looking at two major statistical aspects 283 00:16:51,450 --> 00:16:55,470 of the data in each layer of the transformation. 284 00:16:55,470 --> 00:16:59,310 One of them is noise, and one of them is correlation. 285 00:16:59,310 --> 00:17:01,080 So what is noise? 286 00:17:01,080 --> 00:17:06,960 So, again, noise will be simply the radius, or measure 287 00:17:06,960 --> 00:17:10,619 of the radius of the sphere. 288 00:17:10,619 --> 00:17:15,540 So if you had only the templates as inputs, 289 00:17:15,540 --> 00:17:16,859 the problem would be simple. 290 00:17:16,859 --> 00:17:20,260 Problem would be easy as long as we have enough dimension. 291 00:17:20,260 --> 00:17:21,390 You expand it. 292 00:17:21,390 --> 00:17:24,510 You can easily do linear classifier 293 00:17:24,510 --> 00:17:25,800 and solve the problem. 294 00:17:25,800 --> 00:17:28,800 So the problem, in our case, is the fact 295 00:17:28,800 --> 00:17:31,590 that the input is actually the infinite number of inputs, 296 00:17:31,590 --> 00:17:34,110 or exponentially large number of possible inputs, 297 00:17:34,110 --> 00:17:42,870 because they all come from a Gaussian or a binarized version 298 00:17:42,870 --> 00:17:44,960 of a Gaussian noise around the templates. 299 00:17:44,960 --> 00:17:47,700 And I'll denote the noise by delta. 300 00:17:47,700 --> 00:17:51,240 0 means no noise. 301 00:17:51,240 --> 00:17:54,930 The normalization is such that 1 means that they are random. 302 00:17:54,930 --> 00:18:00,360 So delta equals to 1 means that, basically, you cannot tell, 303 00:18:00,360 --> 00:18:03,540 the input, whether it's coming from here or from any other 304 00:18:03,540 --> 00:18:05,920 points in the input space. 305 00:18:05,920 --> 00:18:10,980 The other thing, correlations, is more subtle. 306 00:18:10,980 --> 00:18:14,820 So I'm going to assume that those balls are 307 00:18:14,820 --> 00:18:17,730 coming from kind of uniform distribution. 308 00:18:17,730 --> 00:18:20,097 Imagine you take a template here. 309 00:18:20,097 --> 00:18:21,180 You draw a ball around it. 310 00:18:21,180 --> 00:18:22,013 You take a template. 311 00:18:22,013 --> 00:18:23,080 Here, you draw a ball. 312 00:18:23,080 --> 00:18:26,400 Everything is kind of uniformly distributed. 313 00:18:26,400 --> 00:18:28,590 The only structure is the fact that data 314 00:18:28,590 --> 00:18:31,440 comes from this mixture of Gaussians 315 00:18:31,440 --> 00:18:38,940 or noisy patterns around those centers. 316 00:18:38,940 --> 00:18:39,840 So that's fine. 317 00:18:39,840 --> 00:18:48,150 But as you project those clusters into the next stage, 318 00:18:48,150 --> 00:18:52,530 I claim that those centers, those templates, 319 00:18:52,530 --> 00:18:56,205 get new representation, which can actually have structure 320 00:18:56,205 --> 00:19:01,260 in them, simply by the fact that you put all of them 321 00:19:01,260 --> 00:19:06,400 into this common synaptic weights into the next layer. 322 00:19:06,400 --> 00:19:11,850 And I'm going to measure this by Q. And, basically, low Q or 0 323 00:19:11,850 --> 00:19:15,550 Q is basically a kind of randomly uniformly distributed 324 00:19:15,550 --> 00:19:16,650 centers. 325 00:19:16,650 --> 00:19:20,100 And I'll always start from that at the input layer. 326 00:19:20,100 --> 00:19:23,490 But then there is a danger, or it 327 00:19:23,490 --> 00:19:26,370 might happen that, as you propagate 328 00:19:26,370 --> 00:19:28,940 this information or this representation 329 00:19:28,940 --> 00:19:32,725 through the next layer, the centers will look like that, 330 00:19:32,725 --> 00:19:35,350 or the data, structure of the data, looks like that. 331 00:19:35,350 --> 00:19:39,450 So, on average, the distance between two centers, 332 00:19:39,450 --> 00:19:41,340 on average, is the same as here. 333 00:19:41,340 --> 00:19:43,080 But they are clumped together. 334 00:19:43,080 --> 00:19:47,010 It's kind of random clustering of the clusters. 335 00:19:47,010 --> 00:19:51,150 And that can be induced by the fact 336 00:19:51,150 --> 00:19:58,080 that the data is feedforwarded from this representation. 337 00:19:58,080 --> 00:20:00,660 That can pose a problem. 338 00:20:00,660 --> 00:20:03,550 If there is no noise, then there is, again, no problem. 339 00:20:03,550 --> 00:20:06,500 You can differentiate between them, and so on. 340 00:20:06,500 --> 00:20:09,250 But if there is noise, this can aggravate the situation, 341 00:20:09,250 --> 00:20:14,410 because some of the clusters become dangerously close 342 00:20:14,410 --> 00:20:16,060 to each other. 343 00:20:16,060 --> 00:20:17,140 And we will come to it. 344 00:20:17,140 --> 00:20:19,750 But, anyway, so we have this delta, the noise, 345 00:20:19,750 --> 00:20:23,800 the size of the clusters, and we have Q, the correlations, 346 00:20:23,800 --> 00:20:28,280 how they are clumped in each representation. 347 00:20:28,280 --> 00:20:30,970 And now we can ask how delta evolve 348 00:20:30,970 --> 00:20:32,500 when you go from one presentation 349 00:20:32,500 --> 00:20:36,340 to another, how Q evolve from one presentation to another, 350 00:20:36,340 --> 00:20:38,800 and how linear classifier performance will 351 00:20:38,800 --> 00:20:41,510 change from one representation to another. 352 00:20:41,510 --> 00:20:48,820 So the simplicity of this assumption 353 00:20:48,820 --> 00:20:56,540 allows for a kind of systematic, analytical exploration or study 354 00:20:56,540 --> 00:20:57,040 of this. 355 00:21:00,400 --> 00:21:02,290 These are definitions. 356 00:21:02,290 --> 00:21:03,340 Let's go on. 357 00:21:03,340 --> 00:21:06,490 So what will be the ideal situation? 358 00:21:06,490 --> 00:21:08,560 So the ideal situation will be that I 359 00:21:08,560 --> 00:21:14,971 start from some level of noise, which is 360 00:21:14,971 --> 00:21:18,100 my spheres at the input layer. 361 00:21:18,100 --> 00:21:20,890 I may or may not start with some correlation. 362 00:21:20,890 --> 00:21:25,120 The simplest case would be that I start from randomly 363 00:21:25,120 --> 00:21:26,480 distributed centers. 364 00:21:26,480 --> 00:21:28,270 So this would be 0. 365 00:21:28,270 --> 00:21:30,920 And the best situation will be that, 366 00:21:30,920 --> 00:21:35,920 as I propagate the sensory stimuli, delta, the noise, 367 00:21:35,920 --> 00:21:38,560 will go to 0. 368 00:21:38,560 --> 00:21:40,180 As I said, if the noise goes to 0, 369 00:21:40,180 --> 00:21:42,580 you are left with basically points. 370 00:21:42,580 --> 00:21:45,670 And those points, if there is enough dimensionality, 371 00:21:45,670 --> 00:21:48,470 those points would be easily classifiable. 372 00:21:48,470 --> 00:21:51,580 It would also be good, if the noise 373 00:21:51,580 --> 00:21:57,110 doesn't go to 0, to have also kind of uniformly spread 374 00:21:57,110 --> 00:21:57,660 clusters. 375 00:21:57,660 --> 00:22:00,340 So it will be good to keep Q to be small. 376 00:22:04,630 --> 00:22:06,514 So let's look at one layer. 377 00:22:06,514 --> 00:22:07,430 So let's look at this. 378 00:22:07,430 --> 00:22:10,730 We have the input layer, the output layer here, 379 00:22:10,730 --> 00:22:13,690 and the readout. 380 00:22:13,690 --> 00:22:19,340 The first question is what to choose for this input layer. 381 00:22:19,340 --> 00:22:23,170 So the simplest answer would be choose random. 382 00:22:23,170 --> 00:22:27,410 So what we do, we just take Gaussian. 383 00:22:27,410 --> 00:22:31,660 The Gaussian weights in this layer are very simple. 384 00:22:31,660 --> 00:22:33,140 0 mean, with some normalization. 385 00:22:33,140 --> 00:22:34,990 It doesn't matter. 386 00:22:34,990 --> 00:22:39,130 Then we project them into each one of these guys here. 387 00:22:39,130 --> 00:22:44,971 And then we add threshold to enforce the sparsity 388 00:22:44,971 --> 00:22:46,120 that we want. 389 00:22:46,120 --> 00:22:54,790 So whatever the activation here is, whatever the input here is, 390 00:22:54,790 --> 00:22:57,130 the threshold makes sure that only 391 00:22:57,130 --> 00:23:01,900 the f with the largest input will be active, 392 00:23:01,900 --> 00:23:03,347 and the rest will be 0. 393 00:23:03,347 --> 00:23:05,680 So there is a nonlinearity, which is of course extremely 394 00:23:05,680 --> 00:23:06,280 important. 395 00:23:06,280 --> 00:23:11,350 If you map one layer to another with a linear transformation, 396 00:23:11,350 --> 00:23:15,320 you don't gain anything in terms of classification. 397 00:23:15,320 --> 00:23:20,560 So there is a nonlinearity, simply a threshold nonlinearity 398 00:23:20,560 --> 00:23:24,730 after an input projection. 399 00:23:24,730 --> 00:23:25,360 All right. 400 00:23:25,360 --> 00:23:27,460 So how we are going to do this? 401 00:23:27,460 --> 00:23:31,270 So it's straightforward to actually compute analytically 402 00:23:31,270 --> 00:23:32,890 what will happen to a noise. 403 00:23:32,890 --> 00:23:36,820 So imagine you take two input vectors with some Hamming 404 00:23:36,820 --> 00:23:38,860 distance apart from each other. 405 00:23:38,860 --> 00:23:46,120 You map them by convolving them, so to speak, 406 00:23:46,120 --> 00:23:49,030 with Gaussian weights, and then thresholding 407 00:23:49,030 --> 00:23:50,620 them to get some sparsity. 408 00:23:50,620 --> 00:23:52,420 So f is the sparsity. 409 00:23:52,420 --> 00:23:55,120 The smaller the f is, the sparser it is. 410 00:23:55,120 --> 00:23:59,835 So this is the noise level, the normalized Gaussian-- 411 00:23:59,835 --> 00:24:04,570 I'm sorry-- sphere radius or Hamming distance 412 00:24:04,570 --> 00:24:07,950 in the output layer, and also the input layer. 413 00:24:07,950 --> 00:24:10,870 Well if 0 at start, then of course you start at the origin. 414 00:24:10,870 --> 00:24:13,300 If you are random at the input, you will be random there. 415 00:24:13,300 --> 00:24:14,790 So these points are fine. 416 00:24:14,790 --> 00:24:16,960 But, as you see, immediately there 417 00:24:16,960 --> 00:24:22,210 is an amplification of the noise as you go from the input 418 00:24:22,210 --> 00:24:23,660 to the output. 419 00:24:23,660 --> 00:24:29,080 So you start from 0.2, but you get actually, after one layer, 420 00:24:29,080 --> 00:24:31,480 to 0.6. 421 00:24:31,480 --> 00:24:36,610 And, actually, the sparser it is-- 422 00:24:36,610 --> 00:24:39,820 so this is a relatively high sparsity, 423 00:24:39,820 --> 00:24:43,690 or at least you go from here to here by increasing sparsity, 424 00:24:43,690 --> 00:24:45,670 namely f becomes smaller. 425 00:24:45,670 --> 00:24:48,010 And as f becomes smaller, this curve 426 00:24:48,010 --> 00:24:50,150 is actually steeper and steeper. 427 00:24:50,150 --> 00:24:54,460 So not only you amplify noise, but you 428 00:24:54,460 --> 00:24:58,245 also-- the amplification becomes worse the sparser 429 00:24:58,245 --> 00:25:00,510 the representation is. 430 00:25:00,510 --> 00:25:06,710 So that is the kind of negative result. 431 00:25:06,710 --> 00:25:13,300 The idea that you can gain by expanding data to a higher 432 00:25:13,300 --> 00:25:17,290 dimension and make them more separable 433 00:25:17,290 --> 00:25:23,290 later on dates back to David Marr's classical theory 434 00:25:23,290 --> 00:25:24,940 of the cerebellum. 435 00:25:24,940 --> 00:25:28,090 But what we show here is that, if you think not about 436 00:25:28,090 --> 00:25:31,510 clean data, a set of points that you want to separate, 437 00:25:31,510 --> 00:25:34,795 but you think about the more realistic case of you 438 00:25:34,795 --> 00:25:40,362 have noisy data, or data with high variance, 439 00:25:40,362 --> 00:25:43,400 then the situation is very different. 440 00:25:43,400 --> 00:25:48,130 So a random expansion actually amplifies those. 441 00:25:48,130 --> 00:25:50,200 And that's a theme that will-- 442 00:25:50,200 --> 00:25:53,170 actually, we will live with it as we go along. 443 00:25:53,170 --> 00:25:58,910 Random expansion is doing the separation of the templates. 444 00:25:58,910 --> 00:26:01,720 But the problem is it also separates two nearby points 445 00:26:01,720 --> 00:26:03,140 within a cluster. 446 00:26:03,140 --> 00:26:04,480 It also separates them. 447 00:26:04,480 --> 00:26:08,480 So everything becomes separated from each other. 448 00:26:08,480 --> 00:26:12,580 And this is why noise is amplified. 449 00:26:12,580 --> 00:26:16,090 Now, what about the most subtle thing, which is the kind of 450 00:26:16,090 --> 00:26:19,065 overlap between the centers? 451 00:26:19,065 --> 00:26:23,780 So, on average, the centers are as far apart as random things. 452 00:26:23,780 --> 00:26:25,480 But if you look, not on average, but you 453 00:26:25,480 --> 00:26:28,630 look at the individual pairs, you 454 00:26:28,630 --> 00:26:30,680 see that there is an excess correlations 455 00:26:30,680 --> 00:26:32,690 or overlap between them. 456 00:26:32,690 --> 00:26:34,600 So this is overlap between the centers. 457 00:26:34,600 --> 00:26:37,900 Again, on average, it is 0, but the variance is not 0. 458 00:26:37,900 --> 00:26:39,820 On average it's like random, but the variance 459 00:26:39,820 --> 00:26:43,330 is different, is larger than random. 460 00:26:43,330 --> 00:26:46,730 And there is an amplification. 461 00:26:46,730 --> 00:26:51,250 There is a generation of this excess overlap, 462 00:26:51,250 --> 00:26:53,710 although it's nicely controlled by sparsity. 463 00:26:53,710 --> 00:26:59,080 So as sparsity goes down, these correlations go down. 464 00:26:59,080 --> 00:27:02,160 So that's not a tremendous problem. 465 00:27:02,160 --> 00:27:05,990 The major problem, as I said, is the noise. 466 00:27:05,990 --> 00:27:12,740 By the way, you can nicely do an exercise where you generate, 467 00:27:12,740 --> 00:27:17,930 you look at this cortical layer representation, 468 00:27:17,930 --> 00:27:21,140 and you do SVM, or PCA, and you look 469 00:27:21,140 --> 00:27:22,950 at the eigenvalue spectrum. 470 00:27:22,950 --> 00:27:27,710 So if you just look at random sparse points, and you look 471 00:27:27,710 --> 00:27:32,360 at the SVD this is the eigenvalues number ranked-- 472 00:27:32,360 --> 00:27:33,690 then that's what you find. 473 00:27:33,690 --> 00:27:39,020 It's the famous Marchenko-Pastur distribution. 474 00:27:39,020 --> 00:27:43,800 But, in our case, you see there is an extra power. 475 00:27:43,800 --> 00:27:46,220 In this case, the input layer is 100, 476 00:27:46,220 --> 00:27:49,010 so the extra power in the input layer, 477 00:27:49,010 --> 00:27:51,890 in the first input eigenvalue. 478 00:27:51,890 --> 00:27:54,640 Now, why is it so? 479 00:27:54,640 --> 00:27:57,490 What Q is telling us, what nonzero Q 480 00:27:57,490 --> 00:28:00,470 is telling us is the following. 481 00:28:00,470 --> 00:28:04,730 You take a set of random points, and you project them 482 00:28:04,730 --> 00:28:06,070 into higher dimensions. 483 00:28:06,070 --> 00:28:08,700 You start with 100 dimensions, and you project them 484 00:28:08,700 --> 00:28:10,820 in 1,000 dimensions. 485 00:28:10,820 --> 00:28:12,870 On average, they are random. 486 00:28:12,870 --> 00:28:16,395 But actually-- so you would imagine 487 00:28:16,395 --> 00:28:17,990 that it's a perfect thing. 488 00:28:17,990 --> 00:28:20,870 You project them with random weights. 489 00:28:20,870 --> 00:28:22,490 Then you would imagine that you just 490 00:28:22,490 --> 00:28:28,240 created a set of random points in the expanded dimension 491 00:28:28,240 --> 00:28:30,030 representation. 492 00:28:30,030 --> 00:28:34,160 If this was so, then if you do SVM or PCA 493 00:28:34,160 --> 00:28:36,830 on this representation, you will find 494 00:28:36,830 --> 00:28:40,190 what you expect from a PCA of a set of random points. 495 00:28:40,190 --> 00:28:42,050 And this is this one. 496 00:28:42,050 --> 00:28:50,250 In fact, there is a trace of low dimensionality in the data. 497 00:28:50,250 --> 00:28:52,620 So I think that's an important point, which 498 00:28:52,620 --> 00:28:53,930 I would like to explain. 499 00:28:53,930 --> 00:28:56,450 You start from a set of points. 500 00:28:56,450 --> 00:29:00,410 If you don't threshold them and you just map them 501 00:29:00,410 --> 00:29:06,710 into 1,000-dimensional space, those 100-dimensional input 502 00:29:06,710 --> 00:29:08,600 will remain 100-dimensional. 503 00:29:08,600 --> 00:29:10,820 Just be rotated, and so on, but everything 504 00:29:10,820 --> 00:29:14,970 will live in 100-dimensional space. 505 00:29:14,970 --> 00:29:20,300 Now you add thresholding, high thresholding by sparsity. 506 00:29:20,300 --> 00:29:24,770 So those 100-dimensional subspace becomes now 507 00:29:24,770 --> 00:29:28,900 1,000-dimensional sparsity because of the nonlinearity. 508 00:29:28,900 --> 00:29:33,740 But this nonlinearity, although it takes 100-dimensional input 509 00:29:33,740 --> 00:29:36,560 and makes them 1,000-dimensional, 510 00:29:36,560 --> 00:29:40,040 it's still not like random. 511 00:29:40,040 --> 00:29:45,230 This 1,000-dimensional cloud is still elongated. 512 00:29:45,230 --> 00:29:47,630 It's not simply uniformly distributed. 513 00:29:47,630 --> 00:29:52,400 And this is the signature that you see here. 514 00:29:52,400 --> 00:29:55,130 In the first largest 100 eigenvalues, 515 00:29:55,130 --> 00:30:02,870 there is extra power relative to the random. 516 00:30:02,870 --> 00:30:04,040 The rest is not 0. 517 00:30:04,040 --> 00:30:07,220 So if you look here, this goes up to 1,000. 518 00:30:07,220 --> 00:30:09,860 The rest is not 0. 519 00:30:09,860 --> 00:30:13,580 So the system is, strictly speaking, 520 00:30:13,580 --> 00:30:17,370 1,000-dimensional space, but it's not random. 521 00:30:17,370 --> 00:30:22,370 It has increased power in 100 channels. 522 00:30:22,370 --> 00:30:26,460 If you do a readout, a linear classifier 523 00:30:26,460 --> 00:30:29,880 readout, what you find in this-- 524 00:30:29,880 --> 00:30:39,860 again, when you expand with random weights, 525 00:30:39,860 --> 00:30:45,120 you find that there is an optimal sparsity. 526 00:30:45,120 --> 00:30:46,920 So this is the readout error. 527 00:30:46,920 --> 00:30:51,326 For a classifier, the output is a function 528 00:30:51,326 --> 00:30:54,920 of the sparsity for different levels of noise. 529 00:30:54,920 --> 00:30:59,960 And you see that, in the case of random weights, 530 00:30:59,960 --> 00:31:05,470 there is a very high sparsity, is bad. 531 00:31:05,470 --> 00:31:08,270 There Is an optimal sparsity or sparseness, 532 00:31:08,270 --> 00:31:10,820 and then there is a shallow increase 533 00:31:10,820 --> 00:31:14,900 in the error when you go to a denser representation. 534 00:31:14,900 --> 00:31:18,530 One important point which I want to emphasize 535 00:31:18,530 --> 00:31:21,000 coming from the analysis-- let me skip equations-- 536 00:31:21,000 --> 00:31:22,580 and this is what you see here. 537 00:31:22,580 --> 00:31:28,070 The question is, can I do better by further increase the layer? 538 00:31:28,070 --> 00:31:32,090 So here I plot the readout error as a function 539 00:31:32,090 --> 00:31:34,400 of the size of cortical layer. 540 00:31:34,400 --> 00:31:36,410 Can I do better? 541 00:31:36,410 --> 00:31:39,830 If I make the kernel dimensionality infinite, 542 00:31:39,830 --> 00:31:41,690 can I do better? 543 00:31:41,690 --> 00:31:44,760 Well, it can do better if you start with 0 noise. 544 00:31:44,760 --> 00:31:52,265 But if you have noisy inputs, then basically, 545 00:31:52,265 --> 00:31:54,140 the performance saturates. 546 00:31:54,140 --> 00:31:55,490 And that's kind of surprising. 547 00:31:58,670 --> 00:32:00,890 We were expecting that, if you go 548 00:32:00,890 --> 00:32:04,382 to a larger and larger representation, eventually 549 00:32:04,382 --> 00:32:05,780 the error will go to 0. 550 00:32:05,780 --> 00:32:06,920 But it doesn't go to 0. 551 00:32:06,920 --> 00:32:09,980 And that actually happens even for what we 552 00:32:09,980 --> 00:32:12,590 call structured representation. 553 00:32:12,590 --> 00:32:15,560 And that's the same for different types of readout-- 554 00:32:15,560 --> 00:32:18,830 perceptual, and pseudo-inverse, SVM. 555 00:32:18,830 --> 00:32:22,040 All of them show this saturation as you increase 556 00:32:22,040 --> 00:32:24,120 the size of cortical layer. 557 00:32:24,120 --> 00:32:29,110 And that's one of the very important outcome of our study. 558 00:32:29,110 --> 00:32:31,740 That when you talk about noisy inputs, 559 00:32:31,740 --> 00:32:35,590 you can think about it as kind of more generalization task. 560 00:32:35,590 --> 00:32:40,580 Then there is a limit about what you gain 561 00:32:40,580 --> 00:32:43,105 by expanding representation. 562 00:32:43,105 --> 00:32:46,120 Even if you expand in a nonlinear fashion and you 563 00:32:46,120 --> 00:32:50,440 increase the dimensionality, you cannot combat the noise, 564 00:32:50,440 --> 00:32:52,040 at least up to some level. 565 00:32:52,040 --> 00:32:55,340 Beyond some level, there is no point of further expansion, 566 00:32:55,340 --> 00:33:02,190 because basically the error saturates 567 00:33:02,190 --> 00:33:05,220 Let me, since time goes fast, let 568 00:33:05,220 --> 00:33:06,540 me talk about the alternatives. 569 00:33:06,540 --> 00:33:09,167 So if random weights are not doing so well, 570 00:33:09,167 --> 00:33:10,250 what are the alternatives? 571 00:33:10,250 --> 00:33:14,760 The alternative is to do some kind of unsupervised learning. 572 00:33:14,760 --> 00:33:18,300 Here we are doing it in a kind of a shortcut 573 00:33:18,300 --> 00:33:20,230 of unsupervised learning. 574 00:33:20,230 --> 00:33:21,120 What is the shortcut? 575 00:33:21,120 --> 00:33:22,810 We say the following. 576 00:33:22,810 --> 00:33:26,880 Imagine that these layers, the learner 577 00:33:26,880 --> 00:33:31,130 knows about the representation of the clusters. 578 00:33:31,130 --> 00:33:33,350 It doesn't know the labels. 579 00:33:33,350 --> 00:33:35,475 In other words, whether those are pluses and those 580 00:33:35,475 --> 00:33:38,130 are minuses, which one are pluses and minuses. 581 00:33:38,130 --> 00:33:40,860 But he does know about the statistical structure 582 00:33:40,860 --> 00:33:43,740 of the input, and this is this S bar. 583 00:33:43,740 --> 00:33:45,900 These are the centers. 584 00:33:45,900 --> 00:33:49,860 So we want to encode the statistical structure 585 00:33:49,860 --> 00:33:54,660 of these input in this expansion of the weights. 586 00:33:54,660 --> 00:33:57,840 And the way we do the simplest way is the kind of Hebb rule. 587 00:33:57,840 --> 00:33:58,870 We do the following. 588 00:33:58,870 --> 00:34:05,150 We say let's first choose, or recruit, or allocate a state, 589 00:34:05,150 --> 00:34:10,530 a sparse state here, randomly chosen, to associate, 590 00:34:10,530 --> 00:34:13,120 to represent each one of the clusters. 591 00:34:13,120 --> 00:34:18,150 So these are the R. R are the randomly chosen patterns here. 592 00:34:18,150 --> 00:34:22,239 And then we associate between those randomly chosen 593 00:34:22,239 --> 00:34:27,090 representations and the actual centers 594 00:34:27,090 --> 00:34:28,560 of the clusters of the inputs. 595 00:34:28,560 --> 00:34:31,880 So this is S bar and R. And then we do the association 596 00:34:31,880 --> 00:34:34,440 by the simple, what's called Hebb rule. 597 00:34:34,440 --> 00:34:37,380 So this Hebbian rule associates cluster 598 00:34:37,380 --> 00:34:43,920 center with a randomly assigned state 599 00:34:43,920 --> 00:34:49,860 in the cortical layer in a kind of simple summation 600 00:34:49,860 --> 00:34:51,630 or outer product for the Hebb rule. 601 00:34:51,630 --> 00:34:53,580 There are more sophisticated ways to do it, 602 00:34:53,580 --> 00:34:56,280 but that's the simplest one of doing it. 603 00:34:56,280 --> 00:35:03,090 So it turns out that this simple rule has enormous potential 604 00:35:03,090 --> 00:35:04,820 for suppressing noise. 605 00:35:04,820 --> 00:35:07,830 So, again, this is the input noise and the output noise. 606 00:35:07,830 --> 00:35:10,830 The Hamming distance of the input and the output 607 00:35:10,830 --> 00:35:12,240 properly normalized. 608 00:35:12,240 --> 00:35:16,740 And you see that, as you go to higher and higher sparseness, 609 00:35:16,740 --> 00:35:20,555 to lower and lower f, this is basically 610 00:35:20,555 --> 00:35:24,930 the input noise is completely quenched when f is large. 611 00:35:24,930 --> 00:35:28,700 When f is 0.01, for instance, this is this already. 612 00:35:28,700 --> 00:35:33,120 Sub-linear when f 0.05 is here, and so and so forth. 613 00:35:33,120 --> 00:35:36,195 So sparse representation, in particular, 614 00:35:36,195 --> 00:35:39,580 are very effective in suppressing noise, 615 00:35:39,580 --> 00:35:45,210 but provided the inputs have kind of unsupervised learning 616 00:35:45,210 --> 00:35:48,750 encoded into them which embed into them 617 00:35:48,750 --> 00:35:51,361 the cluster structure of the inputs. 618 00:35:55,740 --> 00:35:58,170 The same or similar thing is true for Q 619 00:35:58,170 --> 00:35:59,910 for these correlations. 620 00:35:59,910 --> 00:36:03,630 If you look at the-- this was a random correlation. 621 00:36:03,630 --> 00:36:08,520 This is a function of f, and this is Q, the correlation. 622 00:36:08,520 --> 00:36:12,840 It's extremely suppressed for sparse representation. 623 00:36:12,840 --> 00:36:15,540 Basically, it's exponentially small with 1/f, 624 00:36:15,540 --> 00:36:18,660 so it's basically 0 for sparse representation. 625 00:36:18,660 --> 00:36:21,780 Which means that those centers look like randomly 626 00:36:21,780 --> 00:36:25,200 distributed, essentially, and with very small noise. 627 00:36:25,200 --> 00:36:28,700 So you took these spheres and you basically 628 00:36:28,700 --> 00:36:33,570 map them into random points with a very small radius. 629 00:36:33,570 --> 00:36:36,610 So it's not surprising that, in this case, 630 00:36:36,610 --> 00:36:42,670 the error for small f-- the error, even for large noise 631 00:36:42,670 --> 00:36:46,700 values, the error is basically small, 0. 632 00:36:46,700 --> 00:36:48,750 Nevertheless, it is still saturating 633 00:36:48,750 --> 00:36:52,500 as a function of the network size, of the cortical size. 634 00:36:52,500 --> 00:36:57,520 So the saturation of performance as a function of cortical size 635 00:36:57,520 --> 00:37:00,910 is a general property of such systems. 636 00:37:00,910 --> 00:37:05,400 Nevertheless, the performance itself for any given size 637 00:37:05,400 --> 00:37:13,830 is extremely impressive, I would say, when the system is sparse 638 00:37:13,830 --> 00:37:16,852 and the noise level is kind of moderate. 639 00:37:20,370 --> 00:37:22,530 OK, let me skip this because I don't have time. 640 00:37:22,530 --> 00:37:27,570 Let me briefly talk about extension of this story 641 00:37:27,570 --> 00:37:28,650 to multi-layer. 642 00:37:28,650 --> 00:37:31,710 So we are now briefly discussing what 643 00:37:31,710 --> 00:37:34,800 happens if you take this story and you just propagate it 644 00:37:34,800 --> 00:37:38,662 as you go along the architecture. 645 00:37:38,662 --> 00:37:40,120 So let's start with random weights. 646 00:37:40,120 --> 00:37:43,140 So the idea is maybe something is good happening. 647 00:37:43,140 --> 00:37:45,610 Although initially performance was poor, 648 00:37:45,610 --> 00:37:47,250 maybe we can improve the performance 649 00:37:47,250 --> 00:37:50,160 by cascading such layers. 650 00:37:50,160 --> 00:37:53,010 And the answer is no, particularly the noise level. 651 00:37:53,010 --> 00:37:55,950 This is now the number of layers. 652 00:37:55,950 --> 00:37:57,840 What we discussed before is here. 653 00:37:57,840 --> 00:38:00,120 And you see the problem becomes worse and worse. 654 00:38:00,120 --> 00:38:04,350 As you continue to propagate those signals, 655 00:38:04,350 --> 00:38:08,050 the noise is amplified and essentially goes to 1. 656 00:38:08,050 --> 00:38:12,000 So basically you will get just random performance 657 00:38:12,000 --> 00:38:16,420 if you keep doing it with random weights. 658 00:38:16,420 --> 00:38:20,520 The reason-- where is it? 659 00:38:20,520 --> 00:38:21,600 I missed a slide. 660 00:38:21,600 --> 00:38:23,520 The reason is, basically, that if you 661 00:38:23,520 --> 00:38:28,380 think about the mapping from one layer of noise 662 00:38:28,380 --> 00:38:31,620 to another layer of noise, there are two fixed points, 0 and 1. 663 00:38:31,620 --> 00:38:33,570 The 0 fixed point is unstable. 664 00:38:33,570 --> 00:38:35,490 Everything goes eventually to 1. 665 00:38:35,490 --> 00:38:39,090 So it is a nice-- 666 00:38:39,090 --> 00:38:41,340 this system gives you a nice perspective 667 00:38:41,340 --> 00:38:48,390 about this deep network by thinking about it 668 00:38:48,390 --> 00:38:49,780 as a kind of dynamical system. 669 00:38:49,780 --> 00:38:51,680 For instance, what is the level of noise 670 00:38:51,680 --> 00:38:56,520 at one layer, how it's related to the level of noise 671 00:38:56,520 --> 00:38:57,480 at previous layer. 672 00:38:57,480 --> 00:38:59,610 So it's kind of iterative map. 673 00:38:59,610 --> 00:39:01,950 Delta n versus delta n minus 1. 674 00:39:01,950 --> 00:39:05,040 And what's good about it is, once you kind of draw 675 00:39:05,040 --> 00:39:08,100 this curve, one layer is mapped to another layer, 676 00:39:08,100 --> 00:39:10,200 you can know what happens to a deep network. 677 00:39:10,200 --> 00:39:11,769 We could just iterate this. 678 00:39:11,769 --> 00:39:13,560 You have to find what are the fixed points, 679 00:39:13,560 --> 00:39:15,434 and which one is stable and which one is not. 680 00:39:15,434 --> 00:39:18,970 In this case, the 1 is stable, the 0 is unstable. 681 00:39:18,970 --> 00:39:21,120 So, unfortunately, from any level of noise 682 00:39:21,120 --> 00:39:24,390 that you will start, you eventually go to 1. 683 00:39:27,582 --> 00:39:29,750 Correlations, is a similar story, but-- 684 00:39:29,750 --> 00:39:33,370 and the error will go to 0.5. 685 00:39:33,370 --> 00:39:36,610 So that's very well. 686 00:39:36,610 --> 00:39:40,270 There are cases, by the way, that you can find parameters 687 00:39:40,270 --> 00:39:43,000 where initially you improve, like here. 688 00:39:43,000 --> 00:39:47,290 But then eventually it will go to 0.5. 689 00:39:47,290 --> 00:39:49,720 Now, if we do similar-- 690 00:39:49,720 --> 00:39:51,790 if we compare this to what happened 691 00:39:51,790 --> 00:39:53,800 to the structured weights if you keep 692 00:39:53,800 --> 00:39:57,490 doing the same kind of unsupervised Hebbian 693 00:39:57,490 --> 00:40:00,700 learning from one layer to another-- 694 00:40:00,700 --> 00:40:03,970 and I'll skip the details-- you see the opposite. 695 00:40:03,970 --> 00:40:08,140 So here are parameter value in which 696 00:40:08,140 --> 00:40:14,470 one stage of the expansion stage is actually 697 00:40:14,470 --> 00:40:15,550 increasing the noise. 698 00:40:15,550 --> 00:40:21,580 And this is because f is not too small, and the load is large, 699 00:40:21,580 --> 00:40:23,540 and the noise is starting. 700 00:40:23,540 --> 00:40:25,360 So you can have such situation. 701 00:40:25,360 --> 00:40:28,390 But even in such situation, eventually the system 702 00:40:28,390 --> 00:40:34,855 goes into stages where the noise basically goes to 0. 703 00:40:34,855 --> 00:40:38,380 And if you compare the story why it 704 00:40:38,380 --> 00:40:42,230 is so to kind of iterative map picture, 705 00:40:42,230 --> 00:40:44,210 you see that the picture is very different. 706 00:40:44,210 --> 00:40:45,980 You have one fixed point at 0. 707 00:40:45,980 --> 00:40:47,740 You have one fixed point at 1. 708 00:40:47,740 --> 00:40:51,010 You have intermediate fixed point at high value. 709 00:40:51,010 --> 00:40:53,550 But this is an unstable fixed point, and both of them 710 00:40:53,550 --> 00:40:54,550 are stable fixed points. 711 00:40:54,550 --> 00:40:58,540 So if you start from even from large values of noise, 712 00:40:58,540 --> 00:41:00,700 eventually you will iterate to 0. 713 00:41:00,700 --> 00:41:05,290 So it does buy you to actually go 714 00:41:05,290 --> 00:41:10,120 into several stages of this deep network to make sure 715 00:41:10,120 --> 00:41:12,730 that the noise is suppressed to 0. 716 00:41:12,730 --> 00:41:14,380 Similarly for the correlations. 717 00:41:14,380 --> 00:41:17,440 Even if the parameters are such that initially correlations 718 00:41:17,440 --> 00:41:20,350 are increased, and you can find parameters like that, 719 00:41:20,350 --> 00:41:22,720 eventually correlations will go to almost 0. 720 00:41:25,360 --> 00:41:28,840 And this is comparison of the readout error 721 00:41:28,840 --> 00:41:32,370 as a function of the layers with structured weights, 722 00:41:32,370 --> 00:41:34,840 and I compare it with the readout error 723 00:41:34,840 --> 00:41:38,740 of infinitely wide layer, kind of a kernel 724 00:41:38,740 --> 00:41:41,710 with infinitely wide kernel. 725 00:41:41,710 --> 00:41:44,830 And you can see that, at least for-- 726 00:41:44,830 --> 00:41:49,240 here I compare the same type of unsupervised learning but two 727 00:41:49,240 --> 00:41:50,290 different architectures. 728 00:41:50,290 --> 00:41:52,270 One is deep network architecture, 729 00:41:52,270 --> 00:41:57,180 and the another one is shallow architecture, infinitely wide. 730 00:41:57,180 --> 00:42:01,720 I'm not claiming that we can show that there is no kernel 731 00:42:01,720 --> 00:42:03,940 or shallow architecture which will do better, 732 00:42:03,940 --> 00:42:08,650 but I'm saying if we compare the same learning rule 733 00:42:08,650 --> 00:42:11,050 but with the two different architectures, 734 00:42:11,050 --> 00:42:14,320 you'll find that you do gain by going 735 00:42:14,320 --> 00:42:17,980 into multiple stages of nonlinearity 736 00:42:17,980 --> 00:42:20,590 than by using an infinitely wide layer. 737 00:42:24,210 --> 00:42:24,980 I'll skip this. 738 00:42:24,980 --> 00:42:31,070 I want to go briefly to two more issues. 739 00:42:31,070 --> 00:42:34,460 One issue is the recurrent networks. 740 00:42:34,460 --> 00:42:36,200 Why recurrent networks? 741 00:42:36,200 --> 00:42:39,830 The primary reason is that, in each one 742 00:42:39,830 --> 00:42:43,860 of those stages that I refer to, if you look at the biology, 743 00:42:43,860 --> 00:42:45,590 on most of them-- 744 00:42:45,590 --> 00:42:48,795 not all of them but most of them, and definitely 745 00:42:48,795 --> 00:42:50,150 in neocortex-- 746 00:42:50,150 --> 00:42:53,810 you find massive recurrent or lateral interactions 747 00:42:53,810 --> 00:42:56,400 between each one of the layers. 748 00:42:56,400 --> 00:43:01,700 So, again, we would like to ask, what 749 00:43:01,700 --> 00:43:06,560 is the computational advantage of having this recurrent layer. 750 00:43:06,560 --> 00:43:11,090 Now, in our case, we had an extra motivation, and this is-- 751 00:43:11,090 --> 00:43:16,140 remember that I started in saying that, in some cases, 752 00:43:16,140 --> 00:43:19,550 there is experimental evidence that the initial projection is 753 00:43:19,550 --> 00:43:21,710 random. 754 00:43:21,710 --> 00:43:24,170 So that we ask ourselves, what happens if we do this. 755 00:43:24,170 --> 00:43:26,466 If we start from random projection, 756 00:43:26,466 --> 00:43:31,160 feedforward projection, and then add recurrent connections. 757 00:43:31,160 --> 00:43:34,560 Think about it as from the olfactory bulb, for instance, 758 00:43:34,560 --> 00:43:38,470 to piriform cortex, perhaps random feedforward projections. 759 00:43:38,470 --> 00:43:42,235 But then the association, recurrent connections 760 00:43:42,235 --> 00:43:45,200 in piriform cortex are structured. 761 00:43:45,200 --> 00:43:46,560 How do we do that? 762 00:43:46,560 --> 00:43:51,020 We start, we imagine starting from random projection, 763 00:43:51,020 --> 00:43:54,290 generating initial representation 764 00:43:54,290 --> 00:43:57,110 by the random projection, and then stabilizing 765 00:43:57,110 --> 00:43:58,910 those representation into attractors 766 00:43:58,910 --> 00:44:00,590 by the recurrent connections. 767 00:44:00,590 --> 00:44:04,780 And that actually works pretty well. 768 00:44:04,780 --> 00:44:06,680 It's not the optimal architecture, 769 00:44:06,680 --> 00:44:07,700 but it's pretty well. 770 00:44:07,700 --> 00:44:10,850 For instance, noise, which is initially 771 00:44:10,850 --> 00:44:13,010 increased by the random projections, 772 00:44:13,010 --> 00:44:16,510 were quenched by convergence to attractors. 773 00:44:16,510 --> 00:44:18,900 And, similarly, Q will not go to 0, 774 00:44:18,900 --> 00:44:20,810 but will not continue growing, but will 775 00:44:20,810 --> 00:44:23,150 go to an intermediate layer. 776 00:44:23,150 --> 00:44:25,760 And the error is pretty well. 777 00:44:25,760 --> 00:44:30,260 So if you look at in this case, the error really 778 00:44:30,260 --> 00:44:32,730 goes down to very low values. 779 00:44:32,730 --> 00:44:34,190 But now it's not layers. 780 00:44:34,190 --> 00:44:36,440 Now it is the number of iterations 781 00:44:36,440 --> 00:44:38,150 of the recurrent connections. 782 00:44:38,150 --> 00:44:44,000 So you start from just input layer, or random projection, 783 00:44:44,000 --> 00:44:47,450 and then you iterate the dynamics and it goes to 0. 784 00:44:47,450 --> 00:44:48,420 So it's not the layers. 785 00:44:48,420 --> 00:44:55,160 It's just the dynamics of the convergence to attractor. 786 00:44:55,160 --> 00:44:56,570 My final point. 787 00:44:56,570 --> 00:44:58,600 I have 3 or 4 minutes? 788 00:44:58,600 --> 00:44:59,300 OK. 789 00:44:59,300 --> 00:45:04,670 My final point before wrapping up is the question of top-down. 790 00:45:04,670 --> 00:45:09,980 So recurrent, we briefly talked about it. 791 00:45:09,980 --> 00:45:13,790 But incorporating contextual knowledge is a major question. 792 00:45:13,790 --> 00:45:17,900 How can you improve on deep networks 793 00:45:17,900 --> 00:45:22,820 by incorporating, not simply the feedforward sensory input, 794 00:45:22,820 --> 00:45:28,400 but other sources of knowledge about this particular stimulus? 795 00:45:28,400 --> 00:45:33,080 And it's important that we are not talking about knowledge 796 00:45:33,080 --> 00:45:35,690 about the statistics of the input which 797 00:45:35,690 --> 00:45:37,730 can be incorporated into the learning 798 00:45:37,730 --> 00:45:39,200 of the feedforward one. 799 00:45:39,200 --> 00:45:42,450 But we're talking about inputs which are, 800 00:45:42,450 --> 00:45:47,330 or knowledge, which we have now on the network which already 801 00:45:47,330 --> 00:45:49,430 has learned whatever it has learned. 802 00:45:49,430 --> 00:45:52,610 So we have a mature network, whatever the architecture is. 803 00:45:52,610 --> 00:45:53,840 We have a sensory input. 804 00:45:53,840 --> 00:45:55,400 It goes feedforward. 805 00:45:55,400 --> 00:45:58,610 And now we have additional information, about 806 00:45:58,610 --> 00:46:00,170 context for instance, that we want 807 00:46:00,170 --> 00:46:03,140 to incorporate with the sensory input 808 00:46:03,140 --> 00:46:05,345 to improve the performance. 809 00:46:05,345 --> 00:46:09,260 So how do we do that? 810 00:46:09,260 --> 00:46:13,460 It turns out to be non-trivial computational problem. 811 00:46:13,460 --> 00:46:18,980 It is very straightforward to do it in Bayesian framework, 812 00:46:18,980 --> 00:46:21,550 where you simply update the prior 813 00:46:21,550 --> 00:46:29,070 of what the sensory input is by this contextual information. 814 00:46:29,070 --> 00:46:32,210 But if you want to implement it in the network, 815 00:46:32,210 --> 00:46:36,910 you find that it's not easy to find 816 00:46:36,910 --> 00:46:38,960 the appropriate architecture. 817 00:46:38,960 --> 00:46:43,530 So I'll just briefly talk about how we do it. 818 00:46:43,530 --> 00:46:46,670 So imagine you have, again, these sensory inputs, 819 00:46:46,670 --> 00:46:50,460 but now there is some context, different contexts. 820 00:46:50,460 --> 00:46:53,840 And imagine you have an information 821 00:46:53,840 --> 00:47:00,710 that the input is coming from that particular part of state 822 00:47:00,710 --> 00:47:02,040 space. 823 00:47:02,040 --> 00:47:05,030 So basically the question is how to amplify selectively 824 00:47:05,030 --> 00:47:08,630 a specific set of states in a distributed representation. 825 00:47:08,630 --> 00:47:12,790 So usually when we talk about attention, or gating, 826 00:47:12,790 --> 00:47:15,020 or questions like that, we think about, OK, we 827 00:47:15,020 --> 00:47:16,640 have these neurons. 828 00:47:16,640 --> 00:47:20,650 We suppress those, or maybe amplify other ones. 829 00:47:20,650 --> 00:47:24,310 Or we have a set of axons, or pathways. 830 00:47:24,310 --> 00:47:26,890 We suppress those, and amplify those. 831 00:47:26,890 --> 00:47:29,410 But what about a representation which is more 832 00:47:29,410 --> 00:47:33,040 distributed where you have to really suppress states 833 00:47:33,040 --> 00:47:36,730 rather than neural populations. 834 00:47:36,730 --> 00:47:42,850 So I just won't go-- again, it's a complicated architecture. 835 00:47:42,850 --> 00:47:48,010 But, basically, we're using some sort of a mixed representation, 836 00:47:48,010 --> 00:47:52,090 where we take the sensory input and the category 837 00:47:52,090 --> 00:47:55,510 or contextual input, mix the nonlinearity, 838 00:47:55,510 --> 00:47:58,614 use them to clean it, and propagate this. 839 00:47:58,614 --> 00:48:00,280 So it's a more complicated architecture, 840 00:48:00,280 --> 00:48:01,960 but it works beautifully. 841 00:48:01,960 --> 00:48:04,150 Let me show you here an example, and you'll have 842 00:48:04,150 --> 00:48:05,770 a flavor of what we are doing. 843 00:48:05,770 --> 00:48:13,420 So now the input, we have those 900 spheres or templates, 844 00:48:13,420 --> 00:48:20,110 but they are organized into 30 categories, 845 00:48:20,110 --> 00:48:23,440 and 30 tokens per category. 846 00:48:23,440 --> 00:48:27,400 Now, the tokens, which are the actual sensory inputs, 847 00:48:27,400 --> 00:48:30,880 are represented by, let's say, 200 neurons. 848 00:48:30,880 --> 00:48:32,695 And you have a small number of neurons 849 00:48:32,695 --> 00:48:34,230 representing a category. 850 00:48:34,230 --> 00:48:35,434 Maybe 20 is enough. 851 00:48:35,434 --> 00:48:36,850 So that's important, and you don't 852 00:48:36,850 --> 00:48:39,320 have to really expand dramatically 853 00:48:39,320 --> 00:48:42,430 the representation. 854 00:48:42,430 --> 00:48:45,140 So this is the input. 855 00:48:45,140 --> 00:48:48,340 And now we have very noisy inputs. 856 00:48:48,340 --> 00:48:51,440 If you look at the readout, this is layers here, 857 00:48:51,440 --> 00:48:52,600 and there is readout error. 858 00:48:52,600 --> 00:48:57,140 If you do it on the input layer, or any subsequent layer 859 00:48:57,140 --> 00:49:00,290 here, but without top-down information. 860 00:49:00,290 --> 00:49:03,610 With structured interactions and all that I told you, 861 00:49:03,610 --> 00:49:07,420 this is such a noisy input where the performance is basically 862 00:49:07,420 --> 00:49:08,800 0.5. 863 00:49:08,800 --> 00:49:13,150 There is nothing that you can do without top-down information 864 00:49:13,150 --> 00:49:14,540 in this network. 865 00:49:14,540 --> 00:49:17,600 You can ask what will be the performance. 866 00:49:17,600 --> 00:49:21,910 If you have an ideal observer that looks at the noisy input 867 00:49:21,910 --> 00:49:27,310 and makes maximum likelihood categorization. 868 00:49:27,310 --> 00:49:28,920 Well, then it will do much better. 869 00:49:28,920 --> 00:49:31,390 Also not 0, so this is at this level. 870 00:49:33,970 --> 00:49:38,650 This higher error is in virtue of the fact 871 00:49:38,650 --> 00:49:42,190 that this network is still not doing 872 00:49:42,190 --> 00:49:46,990 what an optimal maximum likelihood observer will do. 873 00:49:46,990 --> 00:49:48,320 So this is the network. 874 00:49:48,320 --> 00:49:53,110 This is a maximum likelihood readout, both of them 875 00:49:53,110 --> 00:49:56,320 without extra top-down information. 876 00:49:56,320 --> 00:49:59,950 And in the network that I kind of hinted about, 877 00:49:59,950 --> 00:50:04,570 if you add this top-down information by generating 878 00:50:04,570 --> 00:50:08,740 mixed representation, you get a performance which is really 879 00:50:08,740 --> 00:50:11,210 dramatically improved. 880 00:50:11,210 --> 00:50:16,690 And as you keep doing it one layer from another, 881 00:50:16,690 --> 00:50:21,310 you really get a very nice performance. 882 00:50:21,310 --> 00:50:24,010 So let me just summarize. 883 00:50:28,490 --> 00:50:31,735 There is one more before summarizing. 884 00:50:31,735 --> 00:50:33,130 Yeah, OK. 885 00:50:33,130 --> 00:50:33,820 Before that. 886 00:50:33,820 --> 00:50:34,320 OK. 887 00:50:34,320 --> 00:50:40,930 So two points to bear in mind. 888 00:50:40,930 --> 00:50:44,790 One of them is that what I discussed to you today 889 00:50:44,790 --> 00:50:51,460 relies on assuming either random, comparing random 890 00:50:51,460 --> 00:50:56,410 projection, to unsupervised learning of a very simple type, 891 00:50:56,410 --> 00:50:59,010 of a kind of Hebbian type. 892 00:50:59,010 --> 00:51:07,480 The output can be Hebbian, or perceptron, or SVM, and so on. 893 00:51:07,480 --> 00:51:09,040 You could ask, what happens if you 894 00:51:09,040 --> 00:51:12,850 use learning rules, more sophisticated learning rules 895 00:51:12,850 --> 00:51:14,140 for the unsupervised weights? 896 00:51:14,140 --> 00:51:15,280 Some of them we've studied. 897 00:51:15,280 --> 00:51:20,470 But, anyway, that's something which is important to explore. 898 00:51:20,470 --> 00:51:24,640 And another very important issue for thinking 899 00:51:24,640 --> 00:51:28,110 about object recognition in vision 900 00:51:28,110 --> 00:51:33,340 and in other real-life problem is input statistics. 901 00:51:33,340 --> 00:51:36,070 Because what we assumed is a very simple mixture 902 00:51:36,070 --> 00:51:37,870 of Gaussian model. 903 00:51:37,870 --> 00:51:40,930 So you can think about the task of the network 904 00:51:40,930 --> 00:51:46,270 is to take the invariance, which is the variation away 905 00:51:46,270 --> 00:51:49,080 from the center of the spherical variation, 906 00:51:49,080 --> 00:51:52,870 and to generate representation which is invariant to that. 907 00:51:52,870 --> 00:51:55,510 But this is a very simple invariance problem, 908 00:51:55,510 --> 00:51:58,450 because the invariance was simply 909 00:51:58,450 --> 00:52:03,790 restricted to these simple geometric structures. 910 00:52:03,790 --> 00:52:12,070 More problems which are closer to what real-life problems are 911 00:52:12,070 --> 00:52:17,200 will have inputs which are, essentially, have 912 00:52:17,200 --> 00:52:19,240 some structure, but the structure 913 00:52:19,240 --> 00:52:22,860 can be of a variety of shapes. 914 00:52:22,860 --> 00:52:27,090 Each one of them correspond to an object, or a cluster, 915 00:52:27,090 --> 00:52:31,850 or a manifold representing an entity, a perceptual entity. 916 00:52:31,850 --> 00:52:37,220 But how you go from this nice, simple problem 917 00:52:37,220 --> 00:52:43,100 of this spherical invariance problem to those problems, 918 00:52:43,100 --> 00:52:45,780 it's of course a challenging problem. 919 00:52:45,780 --> 00:52:50,300 And that's the work which we are now, ongoing work, also 920 00:52:50,300 --> 00:52:53,990 with SueYeon Chung and Dan Lee. 921 00:52:53,990 --> 00:53:00,130 But it's a story which is still at the stage of unfolding.