1 00:00:01,680 --> 00:00:04,080 The following content is provided under a Creative 2 00:00:04,080 --> 00:00:05,620 Commons license. 3 00:00:05,620 --> 00:00:07,920 Your support will help MIT OpenCourseWare 4 00:00:07,920 --> 00:00:12,280 continue to offer high-quality educational resources for free. 5 00:00:12,280 --> 00:00:14,910 To make a donation or view additional materials 6 00:00:14,910 --> 00:00:18,760 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,760 --> 00:00:19,680 at ocw.MIT.edu. 8 00:00:23,030 --> 00:00:25,790 ALEX KELL: So, let's talk about the historical arcs of vision 9 00:00:25,790 --> 00:00:27,410 and auditory sciences. 10 00:00:27,410 --> 00:00:29,930 In the mid 20th century, auditory psychophysics was, 11 00:00:29,930 --> 00:00:33,049 like, a pretty robust and diverse field. 12 00:00:33,049 --> 00:00:35,090 But currently there are very few auditory faculty 13 00:00:35,090 --> 00:00:38,270 in psychology departments, whereas, like, vision faculty 14 00:00:38,270 --> 00:00:40,457 are the cornerstone of basically every psychology 15 00:00:40,457 --> 00:00:41,540 department in the country. 16 00:00:44,450 --> 00:00:46,619 In a similar vein, automated speech recognition 17 00:00:46,619 --> 00:00:48,410 is this big fruitful field, but there's not 18 00:00:48,410 --> 00:00:50,630 much kind of broader automated sound recognition. 19 00:00:50,630 --> 00:00:53,054 In contrast, computer vision is this huge field. 20 00:00:53,054 --> 00:00:54,845 So, I'm just kind of wondering historically 21 00:00:54,845 --> 00:00:56,511 and sociologically, how did we get here? 22 00:00:58,137 --> 00:01:00,220 JOSH TENENBAUM: You probably talk about this more. 23 00:01:00,220 --> 00:01:03,240 JOSH MCDERMOTT: Sure, yeah. 24 00:01:03,240 --> 00:01:07,740 I'm happy to tell you my take on this. 25 00:01:07,740 --> 00:01:09,150 Yes, so it's-- 26 00:01:09,150 --> 00:01:11,520 I think there are really interesting case studies 27 00:01:11,520 --> 00:01:13,830 in the history of science. 28 00:01:13,830 --> 00:01:19,350 If you go back to the '50s, psychoacoustics 29 00:01:19,350 --> 00:01:22,280 was sort of a centerpiece of psychology. 30 00:01:22,280 --> 00:01:24,470 And if you looked around at, like, you know, 31 00:01:24,470 --> 00:01:28,170 the top universities, they all had really good people 32 00:01:28,170 --> 00:01:29,250 studying hearing. 33 00:01:29,250 --> 00:01:32,640 And even some of the names that you know, like Green and Swets, 34 00:01:32,640 --> 00:01:35,430 you know, the guys that invented signal detection theory, pretty 35 00:01:35,430 --> 00:01:36,200 much. 36 00:01:36,200 --> 00:01:38,060 They were psychoacousticians. 37 00:01:38,060 --> 00:01:40,000 People often forget that. 38 00:01:40,000 --> 00:01:41,460 But they literally wrote the book 39 00:01:41,460 --> 00:01:42,585 on signal detection theory. 40 00:01:44,514 --> 00:01:45,930 JOSH TENENBAUM: When people talked 41 00:01:45,930 --> 00:01:48,596 about separating signal from the noise, they meant actual noise. 42 00:01:51,240 --> 00:01:55,380 JOSH MCDERMOTT: Yeah, and there was this famous psychoacoustics 43 00:01:55,380 --> 00:01:59,150 lab at Harvard that everybody kind of passed through. 44 00:01:59,150 --> 00:02:02,580 So back then, what constituted psychology 45 00:02:02,580 --> 00:02:04,080 was something pretty different. 46 00:02:04,080 --> 00:02:06,205 And it was really kind of closely related to things 47 00:02:06,205 --> 00:02:07,690 like signal detection theory. 48 00:02:07,690 --> 00:02:10,620 And it was pretty low level by the standards of today. 49 00:02:10,620 --> 00:02:14,580 And what happened over time is that hearing kind of gradually 50 00:02:14,580 --> 00:02:18,360 drifted out of psychology departments 51 00:02:18,360 --> 00:02:21,370 and vision became more and more prominent. 52 00:02:21,370 --> 00:02:23,580 I think the reason for this is that there's 53 00:02:23,580 --> 00:02:25,890 one really important-- there are several forces here, 54 00:02:25,890 --> 00:02:30,480 but one really important factor is that hearing impairment 55 00:02:30,480 --> 00:02:34,650 is something that really involves abnormal functioning 56 00:02:34,650 --> 00:02:36,430 at the level of the cochlea. 57 00:02:36,430 --> 00:02:38,160 So the actual signal processing that's 58 00:02:38,160 --> 00:02:39,826 being done at the cochlea really changes 59 00:02:39,826 --> 00:02:41,610 when people start to lose their hearing. 60 00:02:41,610 --> 00:02:44,040 And so there's always been pretty strong 61 00:02:44,040 --> 00:02:47,006 impetus coming, in part, from the NIH 62 00:02:47,006 --> 00:02:48,630 to try to understand hearing impairment 63 00:02:48,630 --> 00:02:49,950 and to know how to treat that. 64 00:02:49,950 --> 00:02:54,840 And knowing what happens at the front end 65 00:02:54,840 --> 00:02:56,550 of the auditory system really has been 66 00:02:56,550 --> 00:02:58,400 critical to making that work. 67 00:02:58,400 --> 00:03:01,220 In contrast, most vision impairments 68 00:03:01,220 --> 00:03:03,450 are optical in nature, and you fix them with glasses. 69 00:03:03,450 --> 00:03:03,630 Right? 70 00:03:03,630 --> 00:03:05,010 So it's not like studying vision is really 71 00:03:05,010 --> 00:03:07,140 going to help you understand visual impairment. 72 00:03:07,140 --> 00:03:09,056 And so there was never really that same thing. 73 00:03:09,056 --> 00:03:11,799 And so, when psychology sort of gradually 74 00:03:11,799 --> 00:03:13,590 got more and more cognitive, vision science 75 00:03:13,590 --> 00:03:14,515 went along with it. 76 00:03:14,515 --> 00:03:16,140 That really didn't happen with hearing. 77 00:03:16,140 --> 00:03:18,240 And I think part of that was the clinical impetus 78 00:03:18,240 --> 00:03:20,769 to try to continue to understand the periphery. 79 00:03:20,769 --> 00:03:22,560 The auditory periphery was also just harder 80 00:03:22,560 --> 00:03:24,990 to work out, because it's a mechanical device, the cochlea. 81 00:03:24,990 --> 00:03:26,940 And so you can't just stick an electrode into it 82 00:03:26,940 --> 00:03:27,773 and characterize it. 83 00:03:27,773 --> 00:03:30,330 It's actually really technically challenging 84 00:03:30,330 --> 00:03:32,260 to work out what's happening. 85 00:03:32,260 --> 00:03:35,560 And so that just kept people busy for a very long time. 86 00:03:35,560 --> 00:03:38,094 But as psychology was sort of advancing, 87 00:03:38,094 --> 00:03:40,260 what people in hearing science were studying kind of 88 00:03:40,260 --> 00:03:41,700 ceased to really be what psychologists 89 00:03:41,700 --> 00:03:42,360 found interesting. 90 00:03:42,360 --> 00:03:44,943 And so the field kind of dropped out of psychology departments 91 00:03:44,943 --> 00:03:46,715 and moved into speech and hearing science 92 00:03:46,715 --> 00:03:48,090 departments, which were typically 93 00:03:48,090 --> 00:03:50,380 at bigger state schools. 94 00:03:50,380 --> 00:03:54,780 And they never really got on the cognitive bandwagon 95 00:03:54,780 --> 00:03:57,930 in the same way that everybody in vision did. 96 00:03:57,930 --> 00:03:59,790 And then what ends up happening in science 97 00:03:59,790 --> 00:04:02,686 is there's this interesting phenomenon where people 98 00:04:02,686 --> 00:04:04,560 get trained in fields where there are already 99 00:04:04,560 --> 00:04:05,820 lots of scientists. 100 00:04:05,820 --> 00:04:07,887 So if you're a grad student, you need an advisor, 101 00:04:07,887 --> 00:04:09,720 and so you often end up working on something 102 00:04:09,720 --> 00:04:10,759 that your advisor does. 103 00:04:10,759 --> 00:04:13,050 And so if there's some field that is under-represented, 104 00:04:13,050 --> 00:04:16,019 it typically gets more under-represented 105 00:04:16,019 --> 00:04:17,860 as time goes on. 106 00:04:17,860 --> 00:04:20,500 And so that's sort of been the case. 107 00:04:20,500 --> 00:04:23,049 You know, if you want to study, you know, olfaction, 108 00:04:23,049 --> 00:04:24,090 it's a great idea, right? 109 00:04:24,090 --> 00:04:25,230 But how are you going to do it? 110 00:04:25,230 --> 00:04:26,938 You've got to find somebody to work with. 111 00:04:28,337 --> 00:04:30,420 There's not many places you can go to get trained. 112 00:04:30,420 --> 00:04:32,711 And the same has been true for hearing for a long time. 113 00:04:32,711 --> 00:04:35,037 So that's my take on that part of it. 114 00:04:35,037 --> 00:04:37,370 Hynek, do you have anything to say on the computational? 115 00:04:37,370 --> 00:04:39,150 HYNEK HERMANSKY: No, I don't know. 116 00:04:39,150 --> 00:04:42,450 I'm thinking, if it is something that also 117 00:04:42,450 --> 00:04:43,980 evolved with tools available. 118 00:04:43,980 --> 00:04:48,030 Because in the old days it was easier to generate the sounds 119 00:04:48,030 --> 00:04:49,710 than to generate images. 120 00:04:49,710 --> 00:04:53,600 On the computer now, it's much easier, right? 121 00:04:53,600 --> 00:04:57,690 So vision and visual research became sexier. 122 00:04:57,690 --> 00:05:02,030 So I teach auditory perception to engineers. 123 00:05:02,030 --> 00:05:04,110 Also I teach a little bit of visual perception. 124 00:05:04,110 --> 00:05:05,610 And I notice that they are much more 125 00:05:05,610 --> 00:05:10,090 interested in visual perception, especially the various effects. 126 00:05:10,090 --> 00:05:14,140 And, you know, because somehow you can see it better. 127 00:05:14,140 --> 00:05:16,590 And the other thing is, of course, funding. 128 00:05:16,590 --> 00:05:21,900 I mean, you know, the hearing research's main applications 129 00:05:21,900 --> 00:05:24,390 are, as you said, hearing prostheses. 130 00:05:24,390 --> 00:05:26,576 These people don't have much money, right? 131 00:05:26,576 --> 00:05:30,180 The speech recognition, we didn't get much of the benefits 132 00:05:30,180 --> 00:05:33,960 yet from hearing research, unfortunately. 133 00:05:33,960 --> 00:05:39,440 So I wonder if this is not also a little bit-- 134 00:05:39,440 --> 00:05:42,210 JOSH TENENBAUM: Can I add one? 135 00:05:42,210 --> 00:05:44,580 So, maybe-- this is sort of things you guys also gesture 136 00:05:44,580 --> 00:05:46,495 towards, but I think in both-- 137 00:05:46,495 --> 00:05:48,870 to go back towards similarities and not just differences. 138 00:05:48,870 --> 00:05:51,900 Maybe that's what will be my theme. 139 00:05:51,900 --> 00:05:54,480 In both vision and hearing or audition, 140 00:05:54,480 --> 00:05:56,610 there's, I think, a strong bias towards the aspects 141 00:05:56,610 --> 00:05:59,390 of the problem that fit with the rest of cognition. 142 00:05:59,390 --> 00:06:01,269 And often that's mediated by language, right? 143 00:06:01,269 --> 00:06:03,060 So there's been a lot of interest in vision 144 00:06:03,060 --> 00:06:05,624 on object recognition, parts of vision that ultimately 145 00:06:05,624 --> 00:06:07,040 lead into something like attaching 146 00:06:07,040 --> 00:06:09,770 a word to a part of an image or a scene. 147 00:06:09,770 --> 00:06:11,600 And there's a lot of other parts of vision, 148 00:06:11,600 --> 00:06:13,558 like certain kinds of scene understanding, that 149 00:06:13,558 --> 00:06:16,910 have been way understudied until recently also, right? 150 00:06:16,910 --> 00:06:19,759 And it does seem like the parts of hearing that 151 00:06:19,759 --> 00:06:21,050 have been the focus are speech. 152 00:06:21,050 --> 00:06:22,880 I mean, there are lots of people in-- 153 00:06:22,880 --> 00:06:25,220 it's maybe not as much as vision or object recognition, 154 00:06:25,220 --> 00:06:27,860 but certainly there's a lot of mainline cognitive psychology 155 00:06:27,860 --> 00:06:31,009 who studied things like categorization of basic speech 156 00:06:31,009 --> 00:06:33,050 elements and things that just start to bleed very 157 00:06:33,050 --> 00:06:36,210 quickly into psycholinguistics. 158 00:06:36,210 --> 00:06:36,710 Right? 159 00:06:36,710 --> 00:06:38,814 Whereas, the parts of hearing-- 160 00:06:38,814 --> 00:06:40,730 at least that's been where a lot of the focus. 161 00:06:40,730 --> 00:06:41,660 But the parts of hearing that are 162 00:06:41,660 --> 00:06:43,190 more about auditory scene analysis 163 00:06:43,190 --> 00:06:45,680 in general, like sound textures or physical events 164 00:06:45,680 --> 00:06:47,450 or all the richness of the world that we 165 00:06:47,450 --> 00:06:50,870 might get through sound, has been super understudied. 166 00:06:50,870 --> 00:06:53,120 Right? 167 00:06:53,120 --> 00:06:55,700 But echoing something Josh said also or implicitly is, 168 00:06:55,700 --> 00:06:58,575 just because it might be understudied 169 00:06:58,575 --> 00:07:00,200 and you need to find an advisor doesn't 170 00:07:00,200 --> 00:07:01,590 mean you shouldn't work on it. 171 00:07:01,590 --> 00:07:03,440 So if you have any interested in this, 172 00:07:03,440 --> 00:07:04,815 and there's even one person, say, 173 00:07:04,815 --> 00:07:07,315 who's doing some good work on it, you should work with them. 174 00:07:07,315 --> 00:07:08,576 And it's a great opportunity. 175 00:07:08,576 --> 00:07:10,617 JOSH MCDERMOTT: If I could follow up and just say 176 00:07:10,617 --> 00:07:13,982 that my sense of this is, if you can bear with it 177 00:07:13,982 --> 00:07:16,190 and figure out a way to make it work, in the long run 178 00:07:16,190 --> 00:07:17,620 it's actually a great place to be. 179 00:07:17,620 --> 00:07:19,953 It's a lot of fun to work in an area that's not crowded. 180 00:07:25,700 --> 00:07:27,410 ALEX KELL: All right. 181 00:07:27,410 --> 00:07:30,894 Obviously, a large-- to kind of transition, a large emphasis 182 00:07:30,894 --> 00:07:32,810 of the summer school is on potential symbiosis 183 00:07:32,810 --> 00:07:37,950 between, like, machine and, like, engineering and science. 184 00:07:37,950 --> 00:07:42,440 And so, what have been the most kind of fruitful interactions 185 00:07:42,440 --> 00:07:45,289 between machine perception, either vision or hearing, 186 00:07:45,289 --> 00:07:47,330 over the years, and are there any kind of lessons 187 00:07:47,330 --> 00:07:48,539 that we can learn in general? 188 00:07:48,539 --> 00:07:50,913 HYNEK HERMANSKY: You know, I went a little bit backwards, 189 00:07:50,913 --> 00:07:51,860 quite frankly. 190 00:07:51,860 --> 00:07:55,400 I'm trained as an engineer, and I was paid always 191 00:07:55,400 --> 00:07:57,260 to build better machines. 192 00:07:57,260 --> 00:07:59,630 And in the process of building better machines, 193 00:07:59,630 --> 00:08:03,140 you discover that, almost unknowingly, we were emulating 194 00:08:03,140 --> 00:08:04,490 some properties of hearing. 195 00:08:04,490 --> 00:08:07,760 So then I, of course, started to be interested, 196 00:08:07,760 --> 00:08:10,340 and I wanted to get more of it. 197 00:08:10,340 --> 00:08:12,570 So that's how I got into that. 198 00:08:12,570 --> 00:08:16,130 But I have to admit that in my field, 199 00:08:16,130 --> 00:08:17,780 we are a little bit looked down at. 200 00:08:17,780 --> 00:08:19,860 [INTERPOSING VOICES] 201 00:08:19,860 --> 00:08:23,150 HYNEK HERMANSKY: Because mainly not all that much, 202 00:08:23,150 --> 00:08:26,180 because engineers are such that [INAUDIBLE] that if something 203 00:08:26,180 --> 00:08:30,350 works, they like it and they don't much want to know why, 204 00:08:30,350 --> 00:08:30,850 you know? 205 00:08:30,850 --> 00:08:33,409 Or at least they don't talk much about it. 206 00:08:33,409 --> 00:08:37,429 So it's interesting when I'm in engineering meetings, 207 00:08:37,429 --> 00:08:41,090 they look at me as this strange kid who works also on hearing. 208 00:08:41,090 --> 00:08:43,747 And when I'm in an environment like this, 209 00:08:43,747 --> 00:08:45,455 people look at me like a speech engineer. 210 00:08:45,455 --> 00:08:50,180 But, I mean, I don't even feel either in some ways. 211 00:08:50,180 --> 00:08:53,510 But what I was wondering when Josh was talking about, 212 00:08:53,510 --> 00:08:57,350 is there anybody in this world who works on both? 213 00:08:57,350 --> 00:08:58,760 Because, you know, that, I think, 214 00:08:58,760 --> 00:09:01,250 is much more needed than anything else. 215 00:09:01,250 --> 00:09:03,920 I mean, somebody who is interested 216 00:09:03,920 --> 00:09:07,400 in both audio and visual processing 217 00:09:07,400 --> 00:09:10,339 and is capable of making this-- 218 00:09:10,339 --> 00:09:12,505 JOSH TENENBAUM: So not both science and engineering, 219 00:09:12,505 --> 00:09:13,940 but both vision and audition. 220 00:09:13,940 --> 00:09:15,315 HYNEK HERMANSKY: Yes, that's what 221 00:09:15,315 --> 00:09:18,290 I mean, vision and audition, with a real goal of trying 222 00:09:18,290 --> 00:09:21,530 to understand both and trying to find the similarities. 223 00:09:21,530 --> 00:09:24,710 Because, personally, I got some inspiration 224 00:09:24,710 --> 00:09:31,730 from visual research, even when I work with audio. 225 00:09:31,730 --> 00:09:33,510 But I don't see much of it anymore. 226 00:09:33,510 --> 00:09:36,230 And there are some basic questions which I would like 227 00:09:36,230 --> 00:09:38,680 to ask-- and maybe I should have sent it in-- 228 00:09:38,680 --> 00:09:41,750 which is like, I don't even know which 229 00:09:41,750 --> 00:09:45,520 most are the similar and different in audio and vision. 230 00:09:45,520 --> 00:09:48,680 Should I look at time? 231 00:09:48,680 --> 00:09:50,780 Should I look at modulations? 232 00:09:50,780 --> 00:09:53,150 Should I look at the frequencies? 233 00:09:53,150 --> 00:09:55,920 Should I look at the spatial resolution? 234 00:09:55,920 --> 00:09:56,930 And so on and so on. 235 00:09:56,930 --> 00:10:00,300 So I don't know if somebody can help me with this. 236 00:10:00,300 --> 00:10:03,560 I would be very happy to go home knowing that. 237 00:10:03,560 --> 00:10:07,070 I mean, sometimes I suspect that spatial resolution 238 00:10:07,070 --> 00:10:10,550 and frequency resolution in hearing are similar. 239 00:10:10,550 --> 00:10:12,980 I'm thinking about modulations in speech. 240 00:10:12,980 --> 00:10:14,740 Josh talked a lot about it. 241 00:10:14,740 --> 00:10:17,390 But there must be a modulation in vision also. 242 00:10:17,390 --> 00:10:20,120 But, of course, we never studied much of the vision 243 00:10:20,120 --> 00:10:21,680 of the moving pictures. 244 00:10:21,680 --> 00:10:25,100 A lot of vision research was fixed. 245 00:10:25,100 --> 00:10:27,800 Basically, the images, right? 246 00:10:27,800 --> 00:10:30,260 It was a little bit like, in speech, we 247 00:10:30,260 --> 00:10:32,490 used to study vowels. 248 00:10:32,490 --> 00:10:34,720 We don't do it anymore, because the area of speech 249 00:10:34,720 --> 00:10:36,720 is something very, very different. 250 00:10:36,720 --> 00:10:38,980 The same thing is in image processing. 251 00:10:38,980 --> 00:10:41,540 I think that now it's getting more and more, because, again, 252 00:10:41,540 --> 00:10:45,080 I mean I'm going back to availability of the machines. 253 00:10:45,080 --> 00:10:49,970 You can do some work on moving images and on video 254 00:10:49,970 --> 00:10:51,580 and so on and so on. 255 00:10:51,580 --> 00:10:53,870 But I don't know how much of that is happening. 256 00:10:53,870 --> 00:10:57,020 And so my question is really, are there any similarities, 257 00:10:57,020 --> 00:10:58,700 and on which level they are? 258 00:10:58,700 --> 00:11:03,230 [INAUDIBLE] There is a time, there is a spatial frequency, 259 00:11:03,230 --> 00:11:05,140 there is a carrier frequency, there 260 00:11:05,140 --> 00:11:07,800 are modulations in speech. 261 00:11:07,800 --> 00:11:08,660 I don't know. 262 00:11:08,660 --> 00:11:10,220 I mean, I would like to know this. 263 00:11:10,220 --> 00:11:12,560 I mean, that's something somebody can help me. 264 00:11:16,230 --> 00:11:17,270 DAN YAMINS: Oh, I was-- 265 00:11:17,270 --> 00:11:19,269 I had-- I think those are interesting questions, 266 00:11:19,269 --> 00:11:20,120 but I actually-- 267 00:11:20,120 --> 00:11:22,180 I'm sure I don't have the answers at this point. 268 00:11:22,180 --> 00:11:24,741 But I was going to say something a little-- 269 00:11:24,741 --> 00:11:26,990 you know, going back to the original general question, 270 00:11:26,990 --> 00:11:28,390 which is, again, you know, this sort 271 00:11:28,390 --> 00:11:29,890 of thinking about it from a historical point of view 272 00:11:29,890 --> 00:11:31,000 I think is helpful. 273 00:11:31,000 --> 00:11:35,110 In the long run, I think what's happened over the past 50 274 00:11:35,110 --> 00:11:37,960 years is that biological inspiration has been very 275 00:11:37,960 --> 00:11:42,640 helpful at injecting ideas into the engineering realm that 276 00:11:42,640 --> 00:11:44,090 end up being very powerful. 277 00:11:44,090 --> 00:11:44,590 Right? 278 00:11:44,590 --> 00:11:47,920 I mean, I think we're seeing kind of the arc of that right 279 00:11:47,920 --> 00:11:49,690 now in a very strong way. 280 00:11:49,690 --> 00:11:53,770 I mean, you know, in terms of vision and audition, 281 00:11:53,770 --> 00:11:57,460 the sort of algorithms that are most dominant 282 00:11:57,460 --> 00:12:00,567 are ones that were strongly biologically inspired. 283 00:12:00,567 --> 00:12:02,650 And there had been a historical arc over, I think, 284 00:12:02,650 --> 00:12:05,110 a period of decades, where, first the algorithms were sort 285 00:12:05,110 --> 00:12:08,680 of biologically inspired back in the '50s and '60s, 286 00:12:08,680 --> 00:12:10,900 and then they were not biologically inspired 287 00:12:10,900 --> 00:12:11,410 for a while. 288 00:12:11,410 --> 00:12:14,170 Like, the biology stuff didn't seem to be panning out. 289 00:12:14,170 --> 00:12:19,240 That was sort of the dark ages of, kind of, neural networks. 290 00:12:19,240 --> 00:12:21,400 And then more recently that has begun to change. 291 00:12:21,400 --> 00:12:23,230 And, again, biologically inspired ideas 292 00:12:23,230 --> 00:12:27,850 seem to be very powerful, for creating, you know, 293 00:12:27,850 --> 00:12:29,170 algorithmic approaches. 294 00:12:29,170 --> 00:12:31,600 But the arc is a very long one, right? 295 00:12:31,600 --> 00:12:34,030 And so it's not like, you know, you discover something 296 00:12:34,030 --> 00:12:35,740 in the lab and then the next day you 297 00:12:35,740 --> 00:12:37,510 would go implement it in your algorithm 298 00:12:37,510 --> 00:12:40,240 and suddenly you get 20% improvement on some task. 299 00:12:40,240 --> 00:12:40,740 All right? 300 00:12:40,740 --> 00:12:41,950 That's not realistic. 301 00:12:41,950 --> 00:12:42,910 OK? 302 00:12:42,910 --> 00:12:45,070 But if you're willing to have the patience 303 00:12:45,070 --> 00:12:48,830 to wait for a while, and sort of see ideas sort of slowly percol 304 00:12:48,830 --> 00:12:50,420 up, I think they can be very powerful. 305 00:12:50,420 --> 00:12:52,794 Now, the other direction is really also very interesting, 306 00:12:52,794 --> 00:12:55,907 like using the algorithms to understand neuroscience, right? 307 00:12:55,907 --> 00:12:58,240 That's one where you can get a lot of bang for your buck 308 00:12:58,240 --> 00:12:59,500 quickly, right? 309 00:12:59,500 --> 00:13:02,931 And it's sort of like-- it's like that's a short-term high, 310 00:13:02,931 --> 00:13:03,430 right? 311 00:13:03,430 --> 00:13:06,220 Because what happens is that you take this machine that you 312 00:13:06,220 --> 00:13:08,440 didn't understand and you apply it to this problem 313 00:13:08,440 --> 00:13:09,630 that you were worried about and that 314 00:13:09,630 --> 00:13:11,350 is sort of scientifically interesting, 315 00:13:11,350 --> 00:13:13,640 and suddenly you get 20% improvement overnight. 316 00:13:13,640 --> 00:13:14,140 Right? 317 00:13:14,140 --> 00:13:16,090 That is feasible in that direction, e.g., 318 00:13:16,090 --> 00:13:19,320 taking advances from the computational side 319 00:13:19,320 --> 00:13:21,550 or on the algorithmic side and applying them 320 00:13:21,550 --> 00:13:23,380 to understanding data in neuroscience. 321 00:13:23,380 --> 00:13:26,140 That does seem to have been borne out, but only 322 00:13:26,140 --> 00:13:27,130 much more recently. 323 00:13:27,130 --> 00:13:30,550 So, there wasn't much of that at all for many decades. 324 00:13:30,550 --> 00:13:32,720 But more recently that has begun to happen. 325 00:13:32,720 --> 00:13:35,350 And so I think that there is a really interesting 326 00:13:35,350 --> 00:13:38,800 open question right now as to which thing, which direction, 327 00:13:38,800 --> 00:13:40,297 is more live. 328 00:13:40,297 --> 00:13:42,130 Which one is leading at this point, I think, 329 00:13:42,130 --> 00:13:43,060 is a really interesting question. 330 00:13:43,060 --> 00:13:43,990 Maybe neither of them is leading. 331 00:13:43,990 --> 00:13:46,446 But I think that certainly on the everyday, on-the-ground 332 00:13:46,446 --> 00:13:48,820 experience, as somebody who is trying to do some of both, 333 00:13:48,820 --> 00:13:51,190 it feels like the algorithms are leading the biology. 334 00:13:51,190 --> 00:13:51,690 OK? 335 00:13:51,690 --> 00:13:53,731 Are leading the neuroscience, to me, in the sense 336 00:13:53,731 --> 00:13:57,619 that I feel like, in the short run at least, 337 00:13:57,619 --> 00:14:00,160 things are going to come out of the community of people doing 338 00:14:00,160 --> 00:14:02,493 algorithms development that are going to help understand 339 00:14:02,493 --> 00:14:04,957 neuroscience data before specific things 340 00:14:04,957 --> 00:14:07,290 are going to come out of the neuroscience community that 341 00:14:07,290 --> 00:14:08,560 are going to help make better algorithms, 342 00:14:08,560 --> 00:14:09,820 like in the short run. 343 00:14:09,820 --> 00:14:10,330 OK? 344 00:14:10,330 --> 00:14:12,204 Again, I think the long run can be different. 345 00:14:12,204 --> 00:14:15,340 And I think that's a really deep open research program 346 00:14:15,340 --> 00:14:18,640 question, is which tasks are the ones that you should choose 347 00:14:18,640 --> 00:14:21,270 such that learning about them from a neuroscience 348 00:14:21,270 --> 00:14:23,770 and psychology point of view will help, in the five- 349 00:14:23,770 --> 00:14:25,487 to 10-year run, make better algorithms. 350 00:14:25,487 --> 00:14:27,070 And if you can choose those correctly, 351 00:14:27,070 --> 00:14:29,180 I think you really have done something valuable. 352 00:14:29,180 --> 00:14:31,154 But I think that's really hard. 353 00:14:31,154 --> 00:14:31,820 ALEX KELL: Yeah. 354 00:14:31,820 --> 00:14:32,550 I want to push back. 355 00:14:32,550 --> 00:14:34,591 I like how you said how the engineering is really 356 00:14:34,591 --> 00:14:37,150 helping the science right now, but the science-- 357 00:14:37,150 --> 00:14:40,240 the seed that science planted in the engineering. 358 00:14:40,240 --> 00:14:43,000 Like, what you're talking about is just CNNs, basically. 359 00:14:43,000 --> 00:14:43,930 DAN YAMINS: Well, I think not entirely, 360 00:14:43,930 --> 00:14:45,190 because I think that there are some ideas 361 00:14:45,190 --> 00:14:46,330 in recurrent neural networks. 362 00:14:46,330 --> 00:14:46,780 ALEX KELL: OK, sure. 363 00:14:46,780 --> 00:14:47,863 Neural networks generally. 364 00:14:47,863 --> 00:14:51,177 But the point is the ideas that were kind of inspired 365 00:14:51,177 --> 00:14:53,260 from that have been around for decades and decades 366 00:14:53,260 --> 00:14:53,760 and decades. 367 00:14:53,760 --> 00:14:56,410 Are there other kinds of key examples besides, 368 00:14:56,410 --> 00:15:00,475 like, the operations that you throw into a CNN? 369 00:15:00,475 --> 00:15:02,080 The idea of convolution and the idea 370 00:15:02,080 --> 00:15:04,371 of layered computation-- these are obviously very, very 371 00:15:04,371 --> 00:15:07,390 important ideas, but, like, what are kind of other contributions 372 00:15:07,390 --> 00:15:09,480 that science has given engineering besides--? 373 00:15:09,480 --> 00:15:11,420 DAN YAMINS: Well, Green and Swets. 374 00:15:11,420 --> 00:15:12,190 I mean, the thing that he mentioned 375 00:15:12,190 --> 00:15:14,130 earlier about Green and Swets is another great example. 376 00:15:14,130 --> 00:15:14,530 ALEX KELL: Yeah. 377 00:15:14,530 --> 00:15:15,321 DAN YAMINS: Right.? 378 00:15:15,321 --> 00:15:17,730 Psychophysics helped understand signal detection theory. 379 00:15:17,730 --> 00:15:20,304 But that's much older, but that's a very clear example. 380 00:15:20,304 --> 00:15:21,970 HYNEK HERMANSKY: Signal detection theory 381 00:15:21,970 --> 00:15:23,510 didn't come from Green and Swets. 382 00:15:23,510 --> 00:15:26,469 It came from Second World War. 383 00:15:26,469 --> 00:15:28,510 DAN YAMINS: I was just thinking of all the work-- 384 00:15:28,510 --> 00:15:30,850 HYNEK HERMANSKY: They did very good work obviously, 385 00:15:30,850 --> 00:15:33,160 and they indeed were auditory people. 386 00:15:33,160 --> 00:15:34,900 DAN YAMINS: And they were actually-- 387 00:15:34,900 --> 00:15:36,670 they were doing a lot of work during-- 388 00:15:36,670 --> 00:15:37,460 government-- 389 00:15:37,460 --> 00:15:38,320 JOSH MCDERMOTT: They formalized a lot of stuff. 390 00:15:38,320 --> 00:15:39,370 DAN YAMINS: Yeah, and they did a lot-- 391 00:15:39,370 --> 00:15:40,859 HYNEK HERMANSKY: Yeah, I don't want to take anything away 392 00:15:40,859 --> 00:15:41,500 from them. 393 00:15:41,500 --> 00:15:42,130 DAN YAMINS: But, you know, it's interesting. 394 00:15:42,130 --> 00:15:43,600 There's this great paper that Green and Swets have 395 00:15:43,600 --> 00:15:44,570 where they talked about their-- 396 00:15:44,570 --> 00:15:45,760 HYNEK HERMANSKY: --was engineering. 397 00:15:45,760 --> 00:15:48,070 DAN YAMINS: They talked about their military work, right? 398 00:15:48,070 --> 00:15:49,700 And they did-- they actually worked for the military, 399 00:15:49,700 --> 00:15:51,220 just, like, determining which type of plane 400 00:15:51,220 --> 00:15:53,290 was the one that they were going to be facing. 401 00:15:53,290 --> 00:15:55,230 And so, yeah, I agree that came out of that. 402 00:15:55,230 --> 00:15:57,396 HYNEK HERMANSKY: If I want to still-- if I can still 403 00:15:57,396 --> 00:16:00,740 spend a little bit of time on engineering versus science, 404 00:16:00,740 --> 00:16:02,260 we also are missing one big thing, 405 00:16:02,260 --> 00:16:04,000 which is, like, Bell Labs. 406 00:16:04,000 --> 00:16:08,890 Bell Labs was the organization which paid people for doing-- 407 00:16:08,890 --> 00:16:09,670 having fun. 408 00:16:09,670 --> 00:16:11,470 Doing really good research. 409 00:16:11,470 --> 00:16:13,790 There was no question that, at the time, 410 00:16:13,790 --> 00:16:16,630 Bell Labs were about speech and about audio. 411 00:16:16,630 --> 00:16:19,180 So there was-- a lot of things were justified. 412 00:16:19,180 --> 00:16:22,240 And even, like, Bela Julesz and these people-- 413 00:16:22,240 --> 00:16:24,310 they pretended they are working on perception 414 00:16:24,310 --> 00:16:26,860 because the company wanted to make more 415 00:16:26,860 --> 00:16:28,550 money on the telephone calls. 416 00:16:28,550 --> 00:16:29,331 This has gone. 417 00:16:29,331 --> 00:16:29,830 Right? 418 00:16:29,830 --> 00:16:30,330 Both. 419 00:16:30,330 --> 00:16:32,530 Speech is gone. 420 00:16:32,530 --> 00:16:33,490 Bell Labs is gone. 421 00:16:33,490 --> 00:16:35,780 And maybe image is high-- 422 00:16:35,780 --> 00:16:38,410 image processing is in, because the government 423 00:16:38,410 --> 00:16:42,070 is interested in finding various things from the images 424 00:16:42,070 --> 00:16:43,520 and so on and so on. 425 00:16:43,520 --> 00:16:47,890 So, a lot of that is funding. 426 00:16:47,890 --> 00:16:49,960 Since you mentioned neural networks, 427 00:16:49,960 --> 00:16:52,330 it never stops amazing me that people 428 00:16:52,330 --> 00:16:55,600 would call artificial neural networks anything 429 00:16:55,600 --> 00:16:57,130 similar to biology. 430 00:16:57,130 --> 00:17:00,580 I mean, the only thing which I see similar there maybe 431 00:17:00,580 --> 00:17:03,056 are now these layered networks and that sort of things. 432 00:17:03,056 --> 00:17:05,680 ALEX KELL: I think a lot of the concepts were inspired by that. 433 00:17:05,680 --> 00:17:06,970 I don't think it was, like, directly-- like, 434 00:17:06,970 --> 00:17:09,053 I don't think anyone takes it as a super serious-- 435 00:17:09,053 --> 00:17:12,310 HYNEK HERMANSKY: But there I still have maybe one point 436 00:17:12,310 --> 00:17:13,119 to make. 437 00:17:13,119 --> 00:17:15,310 Most of the ideas which are now being 438 00:17:15,310 --> 00:17:19,130 explored in neural networks are also very old. 439 00:17:19,130 --> 00:17:21,849 The only thing is that we didn't have the hardware. 440 00:17:21,849 --> 00:17:24,940 We didn't have the means, basically, of doing so. 441 00:17:24,940 --> 00:17:28,119 So technology really supported this, 442 00:17:28,119 --> 00:17:30,370 and suddenly, yeah, it's working. 443 00:17:30,370 --> 00:17:33,850 But to some people it's not even surprising that it's working. 444 00:17:33,850 --> 00:17:35,020 They say, of course. 445 00:17:35,020 --> 00:17:36,400 They say, we couldn't do it. 446 00:17:36,400 --> 00:17:38,358 DAN YAMINS: I think what was surprising to them 447 00:17:38,358 --> 00:17:39,850 was that it didn't work for so long 448 00:17:39,850 --> 00:17:42,261 and that people were very disappointed and upset 449 00:17:42,261 --> 00:17:42,760 about that. 450 00:17:42,760 --> 00:17:44,310 And then, you know-- 451 00:17:44,310 --> 00:17:46,480 but I agree that basically there's all these, like-- 452 00:17:46,480 --> 00:17:48,640 all the ideas are these 40-year-old or 50-year-old 453 00:17:48,640 --> 00:17:50,710 ideas that people had thought of, 454 00:17:50,710 --> 00:17:52,990 typically many of them coming out of the psychology 455 00:17:52,990 --> 00:17:56,230 and neuroscience community a long time ago but just 456 00:17:56,230 --> 00:17:57,630 couldn't do anything about it. 457 00:17:57,630 --> 00:18:02,317 And so that takes a long time to bear fruit, it feels like. 458 00:18:02,317 --> 00:18:04,900 GABRIEL KREIMAN: So I have more questions rather than answers, 459 00:18:04,900 --> 00:18:08,170 but to try to get back to a question about vision 460 00:18:08,170 --> 00:18:10,930 and hearing and how we can synergistically 461 00:18:10,930 --> 00:18:13,790 interact between the two. 462 00:18:13,790 --> 00:18:17,620 First, I wanted to lay out a couple of biases and almost 463 00:18:17,620 --> 00:18:22,600 religious beliefs I have on the notion that cortex is cortex, 464 00:18:22,600 --> 00:18:24,640 meaning that there's a six-layer structure, 465 00:18:24,640 --> 00:18:27,100 that there are some patterns of connectivity that have been 466 00:18:27,100 --> 00:18:29,170 described both in the vision-- 467 00:18:29,170 --> 00:18:31,740 visual cortex as well as auditory cortex. 468 00:18:31,740 --> 00:18:33,400 They are remarkably similar, and we 469 00:18:33,400 --> 00:18:37,060 have to work with the same type of hardware in both cases. 470 00:18:37,060 --> 00:18:40,420 The type of plasticity learning rules that have been described 471 00:18:40,420 --> 00:18:41,930 are very similar in both cases. 472 00:18:41,930 --> 00:18:43,750 So there's a huge amount of similarity 473 00:18:43,750 --> 00:18:45,640 to the biological level. 474 00:18:45,640 --> 00:18:47,230 We use a lot of the same vocabulary 475 00:18:47,230 --> 00:18:49,732 in terms of describing problems about invariants and so on. 476 00:18:49,732 --> 00:18:51,190 And yet, at the same time, I wanted 477 00:18:51,190 --> 00:18:53,560 to raise a few questions, particularly 478 00:18:53,560 --> 00:18:56,010 to demonstrate my ignorance in terms of the auditory world 479 00:18:56,010 --> 00:18:59,890 and get answers from these two experts here. 480 00:18:59,890 --> 00:19:03,280 I cannot help but feel that maybe there's a nasty 481 00:19:03,280 --> 00:19:06,760 possibility that there are differences between the two, 482 00:19:06,760 --> 00:19:10,905 in particular the role of timing has been somewhat 483 00:19:10,905 --> 00:19:13,210 under-explored in the visual domain. 484 00:19:13,210 --> 00:19:15,670 We have done some work on this, some other people have. 485 00:19:15,670 --> 00:19:18,550 But it seems that timing plays a much more fundamental role 486 00:19:18,550 --> 00:19:20,350 in the auditory domain. 487 00:19:20,350 --> 00:19:23,650 Perhaps the most extreme example is sound localization, 488 00:19:23,650 --> 00:19:25,270 where we need to take into account 489 00:19:25,270 --> 00:19:28,340 micro-second differences in the arrival of signals 490 00:19:28,340 --> 00:19:29,590 within the two ears. 491 00:19:29,590 --> 00:19:33,930 I don't know anything even close to that in the visual domain. 492 00:19:33,930 --> 00:19:35,680 So that's one example where I think 493 00:19:35,680 --> 00:19:38,890 we have to say that there is a fundamental difference. 494 00:19:38,890 --> 00:19:42,100 Now thinking more about sort of the recognition questions 495 00:19:42,100 --> 00:19:46,150 that many of us are interested in, I think, 496 00:19:46,150 --> 00:19:48,430 again, timing seems to play a fundamental role 497 00:19:48,430 --> 00:19:49,580 in the auditory domain. 498 00:19:49,580 --> 00:19:54,420 But I would love to hear from these two experts here. 499 00:19:54,420 --> 00:19:57,430 I easily come up with questions about what 500 00:19:57,430 --> 00:19:59,500 is an object in the auditory domain that's 501 00:19:59,500 --> 00:20:02,825 sort of defined in a somewhat heuristic way 502 00:20:02,825 --> 00:20:03,700 in the visual domain? 503 00:20:03,700 --> 00:20:06,310 But we all sort of agree on what objects are. 504 00:20:06,310 --> 00:20:07,810 And I don't know what the equivalent 505 00:20:07,810 --> 00:20:10,630 is in the auditory domain. 506 00:20:10,630 --> 00:20:13,180 And how much attention should we pay to the fact 507 00:20:13,180 --> 00:20:15,640 that the temporal evolution of signals 508 00:20:15,640 --> 00:20:18,610 is a fundamental aspect in the auditory world, which 509 00:20:18,610 --> 00:20:19,917 we don't really-- 510 00:20:19,917 --> 00:20:22,000 by and large, we don't really think about too much 511 00:20:22,000 --> 00:20:24,340 in the visual domain. 512 00:20:24,340 --> 00:20:27,340 With that said, I do hope that at the end 513 00:20:27,340 --> 00:20:30,640 we will find similar fundamental principles and algorithms, 514 00:20:30,640 --> 00:20:32,850 because, as I said, cortex is cortex. 515 00:20:35,112 --> 00:20:36,570 JOSH MCDERMOTT: I can speak to some 516 00:20:36,570 --> 00:20:37,778 of those issues a little bit. 517 00:20:39,999 --> 00:20:41,040 Look, I think it can be-- 518 00:20:41,040 --> 00:20:44,370 I mean, it's an interesting and fun and, I think, 519 00:20:44,370 --> 00:20:46,950 often useful exercise to try to map, kind of, 520 00:20:46,950 --> 00:20:49,900 concepts from one modality onto another. 521 00:20:49,900 --> 00:20:52,627 But, again, at the end of the day, the purpose of perception 522 00:20:52,627 --> 00:20:54,210 is just to figure out what's out there 523 00:20:54,210 --> 00:20:56,275 in the world, and the information 524 00:20:56,275 --> 00:20:57,900 that you get from different modalities, 525 00:20:57,900 --> 00:20:59,190 I think in some cases it just tells you 526 00:20:59,190 --> 00:21:00,780 about different kinds of things. 527 00:21:00,780 --> 00:21:04,830 So sound is usually created when something happens, right? 528 00:21:04,830 --> 00:21:07,530 It's not quite-- it's not quite the same thing as there 529 00:21:07,530 --> 00:21:10,304 being an object there off of which light reflects. 530 00:21:10,304 --> 00:21:11,970 I mean, sometimes there's, in some sense 531 00:21:11,970 --> 00:21:12,840 which there's an object there. 532 00:21:12,840 --> 00:21:13,715 Like a person, right? 533 00:21:13,715 --> 00:21:15,840 Persons producing sound. 534 00:21:15,840 --> 00:21:17,509 But, oftentimes, the sound is produced 535 00:21:17,509 --> 00:21:19,800 by an interaction between a couple of different things. 536 00:21:19,800 --> 00:21:23,850 So, really, the question is sort of what happened, 537 00:21:23,850 --> 00:21:26,170 as much as what's there. 538 00:21:26,170 --> 00:21:29,760 And so, you could probably try to find things that are 539 00:21:29,760 --> 00:21:32,040 analogous to objects, but it's-- 540 00:21:32,040 --> 00:21:34,470 in my mind it may just not be exactly the right question 541 00:21:34,470 --> 00:21:36,372 to be asking about the sound. 542 00:21:36,372 --> 00:21:38,330 JOSH TENENBAUM: Can I just comment on that one? 543 00:21:38,330 --> 00:21:39,400 Yeah, I mean, I think, again, this 544 00:21:39,400 --> 00:21:42,270 is a place where Gabriel and I have somewhat different biases, 545 00:21:42,270 --> 00:21:44,160 although, again, it's all open. 546 00:21:44,160 --> 00:21:46,555 But an object to me is not a visual thing 547 00:21:46,555 --> 00:21:47,430 or an auditory thing. 548 00:21:47,430 --> 00:21:48,630 An object is a physical thing, right? 549 00:21:48,630 --> 00:21:51,270 So those of you who saw Liz Spelke's lectures on this, this 550 00:21:51,270 --> 00:21:54,082 is very inspiring to me that from very early on infants 551 00:21:54,082 --> 00:21:56,040 have a concept of an object, which is basically 552 00:21:56,040 --> 00:21:59,370 a thing in the world that can move on its own or be moved. 553 00:21:59,370 --> 00:22:02,421 And the same principles apply in vision but also haptics. 554 00:22:02,421 --> 00:22:04,170 And, you know, it's true that the main way 555 00:22:04,170 --> 00:22:06,660 we perceive objects is not through audition, 556 00:22:06,660 --> 00:22:09,060 but we can certainly perceive things about objects 557 00:22:09,060 --> 00:22:13,210 from sound and often just, echoing what Josh said, 558 00:22:13,210 --> 00:22:19,710 it's the events or the interactions between objects 559 00:22:19,710 --> 00:22:22,440 that make sounds. 560 00:22:22,440 --> 00:22:25,170 They make the-- physically cause sounds. 561 00:22:25,170 --> 00:22:27,519 And so it's often what we're learning from sound is-- 562 00:22:27,519 --> 00:22:29,310 GABRIEL KREIMAN: But maybe if I could ask-- 563 00:22:29,310 --> 00:22:32,590 I don't disagree with your definition of objects, 564 00:22:32,590 --> 00:22:34,182 a la Spelke and so on. 565 00:22:34,182 --> 00:22:35,640 But I guess in the auditory domain, 566 00:22:35,640 --> 00:22:38,740 if I think about speech, you know, 567 00:22:38,740 --> 00:22:39,990 are we talking about phonemes? 568 00:22:39,990 --> 00:22:41,580 Are we talking about words? 569 00:22:41,580 --> 00:22:44,947 I mean, if we talk about Lady Gaga or Vivaldi, 570 00:22:44,947 --> 00:22:46,780 are we talking about a whole piece of music, 571 00:22:46,780 --> 00:22:48,242 a measure, a tone, a frequency? 572 00:22:48,242 --> 00:22:49,200 These are things that-- 573 00:22:49,200 --> 00:22:50,205 JOSH TENENBAUM: So, structure, sort 574 00:22:50,205 --> 00:22:51,385 of structure more generally. 575 00:22:51,385 --> 00:22:53,343 GABRIEL KREIMAN: What's the unit of computation 576 00:22:53,343 --> 00:22:55,350 that we should think about algorithmically? 577 00:22:55,350 --> 00:22:58,290 In the same way that Dan and us and many others 578 00:22:58,290 --> 00:23:01,170 think about algorithms that will eventually have labels 579 00:23:01,170 --> 00:23:03,060 and objects, for example. 580 00:23:03,060 --> 00:23:05,140 I mean, what are those fundamental units? 581 00:23:05,140 --> 00:23:08,255 And maybe the answer is all of them, but-- 582 00:23:08,255 --> 00:23:10,380 JOSH TENENBAUM: Well, speech is really interesting, 583 00:23:10,380 --> 00:23:12,410 because from one point of view, you 584 00:23:12,410 --> 00:23:15,600 could think of it as, like, what-- 585 00:23:15,600 --> 00:23:18,150 it's basically, like-- it's an artifact, right? 586 00:23:18,150 --> 00:23:20,970 Speech is a thing that's created through biological and cultural 587 00:23:20,970 --> 00:23:23,940 evolution, manipulating a system to kind of create 588 00:23:23,940 --> 00:23:25,860 these artificial event categories, which 589 00:23:25,860 --> 00:23:28,424 we can call phonemes and words and sentences and so on. 590 00:23:28,424 --> 00:23:30,090 And, you know, surely there was audition 591 00:23:30,090 --> 00:23:31,470 before there was speech, right? 592 00:23:31,470 --> 00:23:33,720 So, it seems like it's building on a system that's 593 00:23:33,720 --> 00:23:36,750 going to detect a more basic notion of events, 594 00:23:36,750 --> 00:23:39,900 physical interactions, or things like babbling brooks or fires 595 00:23:39,900 --> 00:23:41,400 or breezes. 596 00:23:41,400 --> 00:23:44,010 And then animal communication. 597 00:23:44,010 --> 00:23:47,661 And it hacks that, basically, both on the production side 598 00:23:47,661 --> 00:23:48,660 and the perception side. 599 00:23:48,660 --> 00:23:51,330 So it's very interesting to ask what's the structure? 600 00:23:51,330 --> 00:23:54,090 What's the right way to describe the structure in speech? 601 00:23:54,090 --> 00:23:56,650 It probably seems most analogous to something like gesture, 602 00:23:56,650 --> 00:23:57,150 you know? 603 00:23:57,150 --> 00:23:59,149 That's a way to hack the visual system to create 604 00:23:59,149 --> 00:24:00,490 these events visually. 605 00:24:00,490 --> 00:24:02,910 Salient changes in motion, whether for just 606 00:24:02,910 --> 00:24:05,220 non-verbal communication or something in sign language. 607 00:24:05,220 --> 00:24:06,470 It's super interesting, right? 608 00:24:06,470 --> 00:24:07,470 But it's, again-- 609 00:24:07,470 --> 00:24:10,150 I wouldn't say-- the analog-- speech isn't a set of objects. 610 00:24:10,150 --> 00:24:12,090 It's a set of structured events, which 611 00:24:12,090 --> 00:24:16,080 have been created to be perceivable by a system which 612 00:24:16,080 --> 00:24:18,700 was evolutionarily much more ancient one, 613 00:24:18,700 --> 00:24:22,596 perceiving object interactions and events. 614 00:24:22,596 --> 00:24:24,720 JOSH MCDERMOTT: But I also think it's a case that-- 615 00:24:24,720 --> 00:24:28,180 yeah, there's a lot of focus on objects and vision, 616 00:24:28,180 --> 00:24:31,500 but it's certainly the case that vision is richer than just 617 00:24:31,500 --> 00:24:32,670 being about objects, right? 618 00:24:32,670 --> 00:24:34,230 I mean, you have-- 619 00:24:34,230 --> 00:24:35,580 there-- right? 620 00:24:35,580 --> 00:24:36,750 I mean, there's-- 621 00:24:36,750 --> 00:24:39,900 I think in some sense, the fact that you 622 00:24:39,900 --> 00:24:41,790 are posing the question, it's a reflection 623 00:24:41,790 --> 00:24:45,220 of where a lot of work has been concentrated on. 624 00:24:45,220 --> 00:24:47,820 But, yeah, there's obviously-- you know, you have scenes, 625 00:24:47,820 --> 00:24:50,180 there's stuff, not just things, right? 626 00:24:50,180 --> 00:24:51,614 And the same is true in audition. 627 00:24:51,614 --> 00:24:54,030 And the difference is just that there isn't really as much 628 00:24:54,030 --> 00:24:58,120 of a focus on, like, things, only because those are not-- 629 00:24:58,120 --> 00:25:00,120 GABRIEL KREIMAN: Here's the fundamental question 630 00:25:00,120 --> 00:25:02,790 I'm trying to raise, as well as the question about timing. 631 00:25:02,790 --> 00:25:05,520 In the visual domain, let's get away from objects 632 00:25:05,520 --> 00:25:07,969 and think about action recognition, for example. 633 00:25:07,969 --> 00:25:09,510 And that's one domain where you would 634 00:25:09,510 --> 00:25:12,420 think that, well, you have to start thinking about time. 635 00:25:12,420 --> 00:25:14,760 It's actually extremely challenging to come up with 636 00:25:14,760 --> 00:25:17,280 good stimuli that you cannot recognize-- 637 00:25:17,280 --> 00:25:20,520 where you cannot infer actions from single frames. 638 00:25:20,520 --> 00:25:22,874 And I would argue, but-- 639 00:25:22,874 --> 00:25:24,540 JOSH TENENBAUM: Let's talk about events. 640 00:25:24,540 --> 00:25:26,592 GABRIEL KREIMAN: But let me say one more thing. 641 00:25:26,592 --> 00:25:28,050 But please correct me if I'm wrong. 642 00:25:28,050 --> 00:25:29,400 I would argue that in the auditory domain, 643 00:25:29,400 --> 00:25:30,360 it's the opposite. 644 00:25:30,360 --> 00:25:32,295 It's very hard to come up with things that you can recognize 645 00:25:32,295 --> 00:25:33,170 from a single incident. 646 00:25:33,170 --> 00:25:33,590 JOSH MCDERMOTT: Sure. 647 00:25:33,590 --> 00:25:35,210 GABRIEL KREIMAN: You need time. 648 00:25:35,210 --> 00:25:39,350 Time is inherent to the basic definition of everything. 649 00:25:39,350 --> 00:25:42,170 In the visual domain-- 650 00:25:42,170 --> 00:25:44,330 again, we've thought about time and what 651 00:25:44,330 --> 00:25:46,090 happens if you present parts of objects 652 00:25:46,090 --> 00:25:47,600 asynchronously, for example. 653 00:25:47,600 --> 00:25:50,290 And you can disrupt object recognition or action 654 00:25:50,290 --> 00:25:52,190 recognition in that way. 655 00:25:52,190 --> 00:25:53,720 But it's sort of-- 656 00:25:53,720 --> 00:25:57,680 again, you can do a lot without time or without thinking 657 00:25:57,680 --> 00:25:59,260 too seriously about time. 658 00:25:59,260 --> 00:26:01,132 Then maybe, I don't know-- 659 00:26:01,132 --> 00:26:03,340 time is probably not one of your main preoccupations, 660 00:26:03,340 --> 00:26:07,310 I suspect, in the visual domain. 661 00:26:07,310 --> 00:26:09,290 HYNEK HERMANSKY: I'm not sure, because one 662 00:26:09,290 --> 00:26:12,800 of the big things which always strikes me in vision 663 00:26:12,800 --> 00:26:15,260 is the saccade and the fact that we 664 00:26:15,260 --> 00:26:18,950 are moving eyes, and the fact that it's possible even 665 00:26:18,950 --> 00:26:21,290 to lose the vision, basically, if you really 666 00:26:21,290 --> 00:26:24,900 fix the things on the retina, and so on and so on. 667 00:26:24,900 --> 00:26:28,010 So vision probably figured out different ways 668 00:26:28,010 --> 00:26:30,980 of introducing time into perception, 669 00:26:30,980 --> 00:26:34,070 basically moving eyes and maybe in sounds. 670 00:26:34,070 --> 00:26:37,430 Indeed, it's happening more, like, already out there. 671 00:26:37,430 --> 00:26:41,900 But, you know, I had one joint project actually 672 00:26:41,900 --> 00:26:45,500 where we tried to work on audio-visual recognition. 673 00:26:45,500 --> 00:26:50,270 And it was the project about recognizing unexpected things. 674 00:26:50,270 --> 00:26:52,700 And that was a big pain initially, because, 675 00:26:52,700 --> 00:26:55,610 of course, vision people thinking one way or auditory 676 00:26:55,610 --> 00:26:57,540 people thinking another way. 677 00:26:57,540 --> 00:27:01,820 But eventually we ended up with the time and with the surprises 678 00:27:01,820 --> 00:27:04,900 and with the unexpected and with the priors. 679 00:27:04,900 --> 00:27:06,920 And there's been a lot of similarities 680 00:27:06,920 --> 00:27:11,400 between audio and visual world, you know? 681 00:27:11,400 --> 00:27:13,760 So that's why I was maybe saying in the beginning, 682 00:27:13,760 --> 00:27:15,410 people should be more encouraged-- 683 00:27:15,410 --> 00:27:17,340 now I'm looking at the students-- 684 00:27:17,340 --> 00:27:18,280 to look at both. 685 00:27:18,280 --> 00:27:20,860 I mean, don't just say I'm a visual person 686 00:27:20,860 --> 00:27:23,540 and I just want to know a little bit about speech or something. 687 00:27:23,540 --> 00:27:24,039 No. 688 00:27:24,039 --> 00:27:26,830 I mean, these things are very interesting. 689 00:27:26,830 --> 00:27:29,020 And, of course I mean in auditory world, 690 00:27:29,020 --> 00:27:31,760 there are problems that are very similar to visual problems. 691 00:27:31,760 --> 00:27:35,100 And in the visual world there are very similar problems 692 00:27:35,100 --> 00:27:36,750 to auditory work. 693 00:27:36,750 --> 00:27:40,370 You just take a speech and take a writing, right? 694 00:27:40,370 --> 00:27:43,700 And be it handwriting or being even printed things. 695 00:27:43,700 --> 00:27:46,580 I mean, these things communicate messages, 696 00:27:46,580 --> 00:27:49,340 communicate information, in a very similar way. 697 00:27:49,340 --> 00:27:52,370 So, I would just say I got a little bit excited 698 00:27:52,370 --> 00:27:54,200 because I finished my coffee. 699 00:27:54,200 --> 00:27:57,530 But I would just say, let's look for the similarities rather 700 00:27:57,530 --> 00:28:01,860 than differences, and let's be very serious about it. 701 00:28:01,860 --> 00:28:04,880 Like, sort of say, oh, finally I found something. 702 00:28:04,880 --> 00:28:07,190 Like, for instance, I give you one little example. 703 00:28:07,190 --> 00:28:10,130 We had a big problem with a perceptual constancy 704 00:28:10,130 --> 00:28:12,980 when you get linear distortions in the signal. 705 00:28:12,980 --> 00:28:16,000 And I just accidentally read some paper by David Marr 706 00:28:16,000 --> 00:28:18,450 at the time, and I didn't understand it. 707 00:28:18,450 --> 00:28:21,570 I have to say I actually missed [INAUDIBLE] a little bit. 708 00:28:21,570 --> 00:28:23,630 But still, it was a great inspiration. 709 00:28:23,630 --> 00:28:26,360 And I came up with an algorithm which 710 00:28:26,360 --> 00:28:28,120 ended up to be a very good one. 711 00:28:28,120 --> 00:28:29,120 Well, at the time. 712 00:28:29,120 --> 00:28:31,030 I mean, I was being beaten many times. 713 00:28:31,030 --> 00:28:33,570 But, you know, let's just look for the similarities. 714 00:28:33,570 --> 00:28:36,380 That's what I'm somehow, maybe arguing. 715 00:28:36,380 --> 00:28:38,420 And that was also my quest-- like, 716 00:28:38,420 --> 00:28:41,850 I don't even know what is similar and different 717 00:28:41,850 --> 00:28:44,420 in auditory and visual signals. 718 00:28:44,420 --> 00:28:47,390 So, find-- certainly maybe-- 719 00:28:47,390 --> 00:28:49,600 on a certain level, it must be the same, right? 720 00:28:49,600 --> 00:28:51,590 The cortex is very, very similar. 721 00:28:51,590 --> 00:28:54,770 So I believe that, indeed, at the end 722 00:28:54,770 --> 00:28:58,310 we are getting information into our brain, which 723 00:28:58,310 --> 00:29:02,270 is being used for figuring out what's happening in the world. 724 00:29:02,270 --> 00:29:04,940 And there are these big differences at the beginning. 725 00:29:04,940 --> 00:29:06,680 I mean, the senses are so different. 726 00:29:06,680 --> 00:29:10,554 JOSH TENENBAUM: Could I nominate one sort of thing 727 00:29:10,554 --> 00:29:12,470 that could be very interesting to study that's 728 00:29:12,470 --> 00:29:14,510 very basic in both vision and audition, of where 729 00:29:14,510 --> 00:29:15,551 there are some analogies? 730 00:29:15,551 --> 00:29:17,180 Which is certain kinds of basic events 731 00:29:17,180 --> 00:29:19,460 that involve physical interaction between objects. 732 00:29:19,460 --> 00:29:21,510 Like, I'll try to make one right here. 733 00:29:21,510 --> 00:29:22,010 Right? 734 00:29:22,010 --> 00:29:22,510 OK. 735 00:29:22,510 --> 00:29:25,130 So there was a visual event, and it has a low level 736 00:29:25,130 --> 00:29:26,420 signal-- a motion signal. 737 00:29:26,420 --> 00:29:28,380 There was some motion over here. 738 00:29:28,380 --> 00:29:31,696 Then there was some other motion that, in a sense, was caused. 739 00:29:31,696 --> 00:29:33,320 There was some sound that went with it. 740 00:29:33,320 --> 00:29:35,150 There was the sound of the thing sliding on the table 741 00:29:35,150 --> 00:29:36,560 and then the sound of the collision. 742 00:29:36,560 --> 00:29:38,000 We have all sorts of other things like that, Right? 743 00:29:38,000 --> 00:29:39,770 Like, I can drop this object here, 744 00:29:39,770 --> 00:29:41,780 and it makes a certain sound. 745 00:29:41,780 --> 00:29:45,590 And so there's very salient, low levelly detectable, 746 00:29:45,590 --> 00:29:47,780 both auditory and visual signals that have 747 00:29:47,780 --> 00:29:49,137 a common cause in the world. 748 00:29:49,137 --> 00:29:50,220 One thing hitting another. 749 00:29:50,220 --> 00:29:51,740 It's also the kind of thing which-- 750 00:29:51,740 --> 00:29:53,150 I don't know if Liz mentioned this in her lecture. 751 00:29:53,150 --> 00:29:54,500 I mentioned this a little bit. 752 00:29:54,500 --> 00:29:56,930 Even very young infants, even two-month-olds, 753 00:29:56,930 --> 00:29:58,760 understand something about this contact 754 00:29:58,760 --> 00:30:01,654 causality, that one object can cause another object to move. 755 00:30:01,654 --> 00:30:03,570 It's the sort of thing that Shimon has shown-- 756 00:30:03,570 --> 00:30:05,030 Shimon Ullman has shown. 757 00:30:05,030 --> 00:30:06,590 You can, in a very basic way, use 758 00:30:06,590 --> 00:30:10,580 this to pick out primitive agents, like hands as movers. 759 00:30:10,580 --> 00:30:14,630 So this is one basic kind of event 760 00:30:14,630 --> 00:30:17,810 that has interesting parallels between vision and audition, 761 00:30:17,810 --> 00:30:19,580 because there's a basic thing happening 762 00:30:19,580 --> 00:30:21,740 in the world, an exertion of force between one 763 00:30:21,740 --> 00:30:25,190 moving object when it comes into contact with another thing. 764 00:30:25,190 --> 00:30:28,910 And it creates some simultaneously detectable 765 00:30:28,910 --> 00:30:30,990 events with analogous kinds of structure. 766 00:30:30,990 --> 00:30:33,290 I think a very basic question is, you know, 767 00:30:33,290 --> 00:30:35,490 if we were to look at the cortical representation 768 00:30:35,490 --> 00:30:38,751 of a visual collision event and the auditory side of that, 769 00:30:38,751 --> 00:30:39,250 you know? 770 00:30:39,250 --> 00:30:40,330 How do those work together? 771 00:30:40,330 --> 00:30:41,610 What are similarities or differences 772 00:30:41,610 --> 00:30:43,734 in the representation and computation of those kind 773 00:30:43,734 --> 00:30:45,180 of very basic events? 774 00:30:48,230 --> 00:30:50,980 HYNEK HERMANSKY: If I still may, obvious thing to use 775 00:30:50,980 --> 00:30:56,850 is use vision to transcribe the human communication by speech. 776 00:30:56,850 --> 00:31:00,240 If somebody wants a lot of money from Amazon or Microsoft 777 00:31:00,240 --> 00:31:04,110 or Google or government, you know, work on that. 778 00:31:04,110 --> 00:31:07,200 Because there is a clear visual channel, which is being 779 00:31:07,200 --> 00:31:09,617 used very heavily, you know? 780 00:31:09,617 --> 00:31:10,200 Not only that. 781 00:31:10,200 --> 00:31:13,350 I move the hands and that sort of thing. 782 00:31:13,350 --> 00:31:16,140 If somebody can help there, I mean, that would be great. 783 00:31:16,140 --> 00:31:18,969 And it's actually a relatively very straightforward problem. 784 00:31:18,969 --> 00:31:19,885 I'm not saying simple. 785 00:31:19,885 --> 00:31:21,300 But it's well defined. 786 00:31:21,300 --> 00:31:23,310 Because there is a message, which 787 00:31:23,310 --> 00:31:27,780 is being conveyed in a communication by speech. 788 00:31:27,780 --> 00:31:28,890 And it's being used. 789 00:31:28,890 --> 00:31:30,570 I mean, lips are definitely moving, 790 00:31:30,570 --> 00:31:32,610 unless you are working with a machine. 791 00:31:32,610 --> 00:31:35,440 And hands are moving unless you are a really calm person, which 792 00:31:35,440 --> 00:31:36,450 none of us is. 793 00:31:36,450 --> 00:31:38,940 And so this is one-- 794 00:31:38,940 --> 00:31:40,971 JOSH TENENBAUM: Just basic speech communication. 795 00:31:40,971 --> 00:31:42,804 HYNEK HERMANSKY: Basic speech communication, 796 00:31:42,804 --> 00:31:44,690 as Martin [INAUDIBLE] is saying. 797 00:31:44,690 --> 00:31:47,317 That would be great, really. 798 00:31:47,317 --> 00:31:49,650 JOSH MCDERMOTT: I mean, it's also worth saying, I think, 799 00:31:49,650 --> 00:31:51,904 you know, most of perception is multimodal, right? 800 00:31:51,904 --> 00:31:53,820 And you can certainly come up with these cases 801 00:31:53,820 --> 00:31:59,864 where you rely on sound and have basically no information 802 00:31:59,864 --> 00:32:01,280 from vision and vice versa, right? 803 00:32:01,280 --> 00:32:02,724 But most of the time, you get both 804 00:32:02,724 --> 00:32:04,640 and you don't even really think about the fact 805 00:32:04,640 --> 00:32:06,020 that you have two modalities. 806 00:32:06,020 --> 00:32:08,686 You're just, you know-- you want to know what to grab or whether 807 00:32:08,686 --> 00:32:11,860 to run or whether it's safe to cross the street and, you 808 00:32:11,860 --> 00:32:12,480 know-- 809 00:32:12,480 --> 00:32:14,021 HYNEK HERMANSKY: Of course, the thing 810 00:32:14,021 --> 00:32:17,315 is that you can switch off one modality without much damage. 811 00:32:17,315 --> 00:32:20,160 That's OK, because in most of the perception 812 00:32:20,160 --> 00:32:21,520 this is always the case. 813 00:32:21,520 --> 00:32:24,185 You don't need all the channels of communication. 814 00:32:24,185 --> 00:32:25,390 You only need some. 815 00:32:25,390 --> 00:32:28,160 But if you want to have a perfect communication, 816 00:32:28,160 --> 00:32:30,240 then you would like to use it. 817 00:32:30,240 --> 00:32:36,225 But I absolutely agree that the world is audiovisual. 818 00:32:36,225 --> 00:32:37,600 JOSH TENENBAUM: This is a comment 819 00:32:37,600 --> 00:32:39,460 I was going to add to our discussion list, which I shared 820 00:32:39,460 --> 00:32:40,650 with Alex but maybe not the rest, 821 00:32:40,650 --> 00:32:42,775 is I think it's a really interesting question, what 822 00:32:42,775 --> 00:32:44,620 can be understood about the similarities 823 00:32:44,620 --> 00:32:47,080 and differences in each of these perceptual modalities 824 00:32:47,080 --> 00:32:49,390 by studying multimodal perception? 825 00:32:49,390 --> 00:32:52,240 And to put out a kind of a bold hypothesis, 826 00:32:52,240 --> 00:32:55,900 I think that, for reasons that you guys were just saying, 827 00:32:55,900 --> 00:32:58,532 because natural perception is inherently multimodal. 828 00:32:58,532 --> 00:32:59,740 And it's not just these ones. 829 00:32:59,740 --> 00:33:01,119 It also involves touch and so on. 830 00:33:01,119 --> 00:33:03,160 I think that's going to impose strong constraints 831 00:33:03,160 --> 00:33:05,830 on the representations and computations in both 832 00:33:05,830 --> 00:33:07,292 how vision and audition work. 833 00:33:07,292 --> 00:33:09,250 The fact that they have to be able to interface 834 00:33:09,250 --> 00:33:11,380 with a common system, what, you know, 835 00:33:11,380 --> 00:33:14,686 I would think of as a kind of physical object events system. 836 00:33:14,686 --> 00:33:16,060 But, however you want to describe 837 00:33:16,060 --> 00:33:19,450 it, the fact of multimodal perception's pervasiveness, 838 00:33:19,450 --> 00:33:21,970 the fact that you can switch on or off sense modalities 839 00:33:21,970 --> 00:33:24,520 and still do something, but that you can really just 840 00:33:24,520 --> 00:33:26,470 so fluently, naturally bring them together 841 00:33:26,470 --> 00:33:29,980 into a shared understanding of the world, that's something 842 00:33:29,980 --> 00:33:31,640 we can't ignore, I would say. 843 00:33:31,640 --> 00:33:36,510 GABRIEL KREIMAN: Why are people so sure that in everyday life, 844 00:33:36,510 --> 00:33:40,460 most things are multimodal? 845 00:33:40,460 --> 00:33:42,570 I'm not really sure how to quantify that. 846 00:33:42,570 --> 00:33:44,892 But is there any quantification of this? 847 00:33:44,892 --> 00:33:47,100 JOSH MCDERMOTT: No, I don't know of a quantification. 848 00:33:47,100 --> 00:33:53,930 All I mean is that, most of the time, I mean, you're listening 849 00:33:53,930 --> 00:33:55,490 and you're looking and you're doing 850 00:33:55,490 --> 00:33:57,100 everything you can to figure out what happened, right? 851 00:33:57,100 --> 00:33:58,430 I mean, it's like, you know, you want 852 00:33:58,430 --> 00:34:00,138 to know if there's traffic coming, right? 853 00:34:00,138 --> 00:34:01,850 I mean, there's noise that the cars make. 854 00:34:01,850 --> 00:34:03,445 You also look, you know? 855 00:34:03,445 --> 00:34:04,320 You do both of those. 856 00:34:04,320 --> 00:34:05,986 And you probably don't even really think 857 00:34:05,986 --> 00:34:07,400 about which of them you're doing. 858 00:34:07,400 --> 00:34:07,610 GABRIEL KREIMAN: No. 859 00:34:07,610 --> 00:34:09,609 I'm not talking about the most of the time part. 860 00:34:09,609 --> 00:34:12,690 Yes, that's a very good example of multimodal experience. 861 00:34:12,690 --> 00:34:14,360 I can cite lots of other examples 862 00:34:14,360 --> 00:34:17,150 where I'm running and listening to music 863 00:34:17,150 --> 00:34:20,360 and they're completely decoupled. 864 00:34:20,360 --> 00:34:21,610 Or I'm working on my computer. 865 00:34:21,610 --> 00:34:23,318 JOSH TENENBAUM: You don't listen to music 866 00:34:23,318 --> 00:34:24,469 when you're driving, right? 867 00:34:24,469 --> 00:34:25,550 GABRIEL KREIMAN: I do, but-- 868 00:34:25,550 --> 00:34:25,980 GABRIEL KREIMAN: No, but no. 869 00:34:25,980 --> 00:34:27,860 But, I mean, not in the way that, like-- 870 00:34:27,860 --> 00:34:29,580 sure, you listen to music, obviously. 871 00:34:29,580 --> 00:34:30,780 You listen to music when we're driving, 872 00:34:30,780 --> 00:34:32,955 but we try-- it's sort of important that it doesn't 873 00:34:32,955 --> 00:34:34,080 drown out all other sounds. 874 00:34:34,080 --> 00:34:36,454 GABRIEL KREIMAN: I'm just wondering to what extent this-- 875 00:34:36,454 --> 00:34:37,582 JOSH TENENBAUM: Ok, fine. 876 00:34:37,582 --> 00:34:39,290 ALEX KELL: And how much of that is, like, 877 00:34:39,290 --> 00:34:41,351 kind of the particular, like, the modern-- like, 878 00:34:41,351 --> 00:34:43,850 in the contemporary world you can actually decorrelate these 879 00:34:43,850 --> 00:34:45,830 things in a way that in the natural world you can't. 880 00:34:45,830 --> 00:34:47,455 Like, if you are a monkey, these things 881 00:34:47,455 --> 00:34:49,900 would probably be a lot more correlated than you are 882 00:34:49,900 --> 00:34:52,719 as a human in the 21st century. 883 00:34:52,719 --> 00:34:55,610 Like, there would be a [INAUDIBLE] physical world 884 00:34:55,610 --> 00:34:57,950 causing the input to both your modalities in a way 885 00:34:57,950 --> 00:35:00,279 that you can break now, right? 886 00:35:00,279 --> 00:35:01,070 Like, I don't know. 887 00:35:01,070 --> 00:35:02,060 That feels-- 888 00:35:02,060 --> 00:35:03,476 GABRIEL KREIMAN: You may be right. 889 00:35:03,476 --> 00:35:05,461 I haven't really thought deeply about this. 890 00:35:05,461 --> 00:35:06,500 [INTERPOSING VOICES] 891 00:35:06,500 --> 00:35:07,490 GABRIEL KREIMAN: I'm not [INAUDIBLE] 892 00:35:07,490 --> 00:35:07,770 JOSH MCDERMOTT: It would be interesting to compute 893 00:35:07,770 --> 00:35:08,820 some statistics of this. 894 00:35:08,820 --> 00:35:10,861 GABRIEL KREIMAN: I'm not disputing the usefulness 895 00:35:10,861 --> 00:35:11,960 of multimodal perception. 896 00:35:11,960 --> 00:35:13,460 I think it's fantastic. 897 00:35:13,460 --> 00:35:14,870 I'm just wondering. 898 00:35:14,870 --> 00:35:18,006 I think vision can do very well without the other auditory 899 00:35:18,006 --> 00:35:18,740 world. 900 00:35:18,740 --> 00:35:20,240 And vice versa. 901 00:35:20,240 --> 00:35:22,823 DAN YAMINS: We could just close our eyes right now, all of us, 902 00:35:22,823 --> 00:35:24,852 and we'd have a fine panel for a while. 903 00:35:24,852 --> 00:35:26,810 JOSH TENENBAUM: But many of the social dynamics 904 00:35:26,810 --> 00:35:29,590 would be invisible, literally. 905 00:35:29,590 --> 00:35:31,340 JOSH MCDERMOTT: No, I think you'd probably 906 00:35:31,340 --> 00:35:32,381 get a lot of reciprocity. 907 00:35:32,381 --> 00:35:33,470 It's an open question. 908 00:35:33,470 --> 00:35:35,810 JOSH TENENBAUM: You'd get some, but, like, there's 909 00:35:35,810 --> 00:35:38,660 a difference. 910 00:35:38,660 --> 00:35:40,620 Have you ever listened to a radio talk show? 911 00:35:40,620 --> 00:35:43,226 Sometimes these days the shows are broadcast on TV and also-- 912 00:35:43,226 --> 00:35:45,350 and it's, like, when you watch you're like, oh my-- 913 00:35:45,350 --> 00:35:46,430 like, you have a totally different view 914 00:35:46,430 --> 00:35:47,030 of what's going on. 915 00:35:47,030 --> 00:35:48,840 Or, like, if you're there in the studio. 916 00:35:48,840 --> 00:35:51,256 I mean, I totally agree that these are all open questions, 917 00:35:51,256 --> 00:35:53,480 and it would be nice to actually quantify, 918 00:35:53,480 --> 00:35:56,150 for example, what to me is this often subjective experience. 919 00:35:56,150 --> 00:35:59,249 Like, sometimes if the sound is, you know-- 920 00:35:59,249 --> 00:35:59,790 I don't know. 921 00:35:59,790 --> 00:36:01,248 You turn off the sound on something 922 00:36:01,248 --> 00:36:03,800 where you're used to having the sound, it 923 00:36:03,800 --> 00:36:05,366 changes your experience, right? 924 00:36:05,366 --> 00:36:06,740 Or you turn on the sound in a way 925 00:36:06,740 --> 00:36:08,840 that you had previously watched something, right? 926 00:36:08,840 --> 00:36:11,006 Like, you could do experiments where you show people 927 00:36:11,006 --> 00:36:13,515 a movie without the sound and then you turn on the sound. 928 00:36:13,515 --> 00:36:15,431 You know, in some ways transform what they see 929 00:36:15,431 --> 00:36:16,306 and in some ways not. 930 00:36:16,306 --> 00:36:18,880 So, maybe the right thing to say is more data is needed. 931 00:36:18,880 --> 00:36:21,770 DAN YAMINS: But don't you guys think, though, that, like, 932 00:36:21,770 --> 00:36:24,224 even independent of multimodal, there's still actually 933 00:36:24,224 --> 00:36:25,640 a lot of even more basic questions 934 00:36:25,640 --> 00:36:27,514 to be asked about similarity and differences? 935 00:36:27,514 --> 00:36:28,882 Like, I mean, just from a very-- 936 00:36:28,882 --> 00:36:31,340 from my point of view since that's the only one I'm usually 937 00:36:31,340 --> 00:36:32,840 able to take, like, you took a bunch 938 00:36:32,840 --> 00:36:34,700 of convolutional neural networks and you 939 00:36:34,700 --> 00:36:36,824 train some of them on vision tasks and some of them 940 00:36:36,824 --> 00:36:37,960 on audition tasks, right? 941 00:36:37,960 --> 00:36:39,140 And you figured out which architectures 942 00:36:39,140 --> 00:36:40,310 are good for audition tasks and which 943 00:36:40,310 --> 00:36:41,510 are good for vision tasks. 944 00:36:41,510 --> 00:36:43,340 See if the architectures are the same, 945 00:36:43,340 --> 00:36:45,620 and if indeed the architectures are fairly similar, 946 00:36:45,620 --> 00:36:47,204 then, like, looking at the differences 947 00:36:47,204 --> 00:36:48,911 between the features at different levels. 948 00:36:48,911 --> 00:36:50,810 I mean, I know that that's a very narrow way 949 00:36:50,810 --> 00:36:52,300 to interpret the question, but it's one. 950 00:36:52,300 --> 00:36:54,050 And there's probably a lot that can be-- 951 00:36:54,050 --> 00:36:55,220 JOSH TENENBAUM: You guys have been doing that. 952 00:36:55,220 --> 00:36:57,020 What have you learned from doing that? 953 00:36:57,020 --> 00:36:57,840 ALEX KELL: We haven't done it that exhaustively. 954 00:36:57,840 --> 00:36:59,150 DAN YAMINS: We haven't done it that exhaustively. 955 00:36:59,150 --> 00:37:00,858 But suffice it to say that the hints are, 956 00:37:00,858 --> 00:37:02,360 I think, very interesting. 957 00:37:02,360 --> 00:37:04,610 Like, you begin to see places where 958 00:37:04,610 --> 00:37:07,400 there are clear similarities and clear differences and asking, 959 00:37:07,400 --> 00:37:09,350 like, where did the divergence occur? 960 00:37:09,350 --> 00:37:12,080 Are there any underlying principles about what layers 961 00:37:12,080 --> 00:37:14,420 or what levels in the model those divergences 962 00:37:14,420 --> 00:37:15,500 start to occur? 963 00:37:15,500 --> 00:37:17,083 Can you see similarities at all layers 964 00:37:17,083 --> 00:37:18,770 or do you start to see sort of a kind 965 00:37:18,770 --> 00:37:20,371 of a clear branching point? 966 00:37:20,371 --> 00:37:20,870 Right? 967 00:37:20,870 --> 00:37:22,800 Moreover, like what about lower layers, right? 968 00:37:22,800 --> 00:37:24,860 I mean, you start to actually see differences 969 00:37:24,860 --> 00:37:28,400 in sort of frequency content in auditory data and differences 970 00:37:28,400 --> 00:37:29,840 between that and visual data that 971 00:37:29,840 --> 00:37:31,970 seem to emerge very naturally from the underlying 972 00:37:31,970 --> 00:37:33,097 similarities. 973 00:37:33,097 --> 00:37:35,430 You know, underlying differences between the statistics. 974 00:37:35,430 --> 00:37:37,310 But still, downstream from there, 975 00:37:37,310 --> 00:37:40,264 there are some deep similarities about extraction of objects 976 00:37:40,264 --> 00:37:41,180 of some kind or other. 977 00:37:41,180 --> 00:37:43,230 You know, auditory objects, potentially. 978 00:37:43,230 --> 00:37:45,022 And so I think that's a very narrow way 979 00:37:45,022 --> 00:37:45,980 of posing the question. 980 00:37:45,980 --> 00:37:48,521 And I don't say that everybody should pose it that way by any 981 00:37:48,521 --> 00:37:49,130 means. 982 00:37:49,130 --> 00:37:51,770 But I just think that before we get to multimodal interaction, 983 00:37:51,770 --> 00:37:53,520 which is interesting, I think there's just 984 00:37:53,520 --> 00:37:56,690 this huge space of clear, very concrete ways 985 00:37:56,690 --> 00:37:59,060 to ask the question of similarities and differences 986 00:37:59,060 --> 00:37:59,720 that are-- 987 00:37:59,720 --> 00:38:00,860 like, almost no matter what you'll find, 988 00:38:00,860 --> 00:38:02,300 you'll find something interesting. 989 00:38:02,300 --> 00:38:03,740 JOSH TENENBAUM: You're saying if we enlarge the discussion 990 00:38:03,740 --> 00:38:04,880 from just talking about vision audition 991 00:38:04,880 --> 00:38:06,920 to other parts of cognition, then we'll 992 00:38:06,920 --> 00:38:09,530 see more of the similarities between these sense modalities, 993 00:38:09,530 --> 00:38:11,655 because they will be the differences that stand out 994 00:38:11,655 --> 00:38:13,720 in relief with respect to the rest of cognition. 995 00:38:13,720 --> 00:38:17,030 Yeah, I mean, I think that's a valuable thing to do, 996 00:38:17,030 --> 00:38:19,020 and it connects to what these guys were saying, 997 00:38:19,020 --> 00:38:21,197 which is that there's a sense in which this-- 998 00:38:21,197 --> 00:38:23,780 you know, something like these deep convolutional architecture 999 00:38:23,780 --> 00:38:27,740 seem like really good ways to do pattern recognition, right? 1000 00:38:27,740 --> 00:38:30,590 This is what I would see as the common theme between where 1001 00:38:30,590 --> 00:38:34,776 a lot of the successes happened in vision and in audition. 1002 00:38:34,776 --> 00:38:37,400 And I don't think-- and, again, everybody here has heard me say 1003 00:38:37,400 --> 00:38:38,358 this a bunch of times-- 1004 00:38:38,358 --> 00:38:40,430 I think that pattern recognition does not 1005 00:38:40,430 --> 00:38:42,440 exhaust, by any means, intelligence 1006 00:38:42,440 --> 00:38:43,340 or even perception. 1007 00:38:43,340 --> 00:38:46,260 Like, I think even within vision and audition, 1008 00:38:46,260 --> 00:38:49,550 there's a lot we do that goes beyond, at least 1009 00:38:49,550 --> 00:38:51,729 on the surface, you know, pattern recognition 1010 00:38:51,729 --> 00:38:52,520 and classification. 1011 00:38:52,520 --> 00:38:54,410 It's something more like building a generative model. 1012 00:38:54,410 --> 00:38:55,410 Maybe this is a good time to-- 1013 00:38:55,410 --> 00:38:57,243 that's another theme you wanted to bring in. 1014 00:38:57,243 --> 00:38:59,390 But, you know, something about building 1015 00:38:59,390 --> 00:39:02,360 a rich model of the world and its physical interactions. 1016 00:39:02,360 --> 00:39:03,990 And, to me, you know, and, again, 1017 00:39:03,990 --> 00:39:06,680 something Dan and I have talked a lot about it and I think 1018 00:39:06,680 --> 00:39:07,367 it's-- 1019 00:39:07,367 --> 00:39:09,200 you know, you've heard some of this from me, 1020 00:39:09,200 --> 00:39:12,200 and Dan has got some really awesome work in a similar vein 1021 00:39:12,200 --> 00:39:14,300 of trying to understand how, basically, 1022 00:39:14,300 --> 00:39:15,899 deep pattern recognizers-- to me, 1023 00:39:15,899 --> 00:39:18,440 that's another way we could call deep convolutional pattern-- 1024 00:39:18,440 --> 00:39:21,667 or just deep invariant pattern recognizers, 1025 00:39:21,667 --> 00:39:23,750 where the invariance is over space or time windows 1026 00:39:23,750 --> 00:39:26,390 or whatever it is that deep convolutional-- 1027 00:39:26,390 --> 00:39:28,850 you know, these are obviously important tools. 1028 00:39:28,850 --> 00:39:30,620 They obviously have some connection 1029 00:39:30,620 --> 00:39:32,720 to not just the six layer cortex architecture 1030 00:39:32,720 --> 00:39:35,810 but these multiple-- you know, the things that goes on 1031 00:39:35,810 --> 00:39:37,700 in, like, the ventral stream, for example. 1032 00:39:37,700 --> 00:39:39,450 I don't know, the auditory system as well. 1033 00:39:39,450 --> 00:39:41,720 But it's going on from one cortical area to a next. 1034 00:39:41,720 --> 00:39:43,370 A hierarchy of processing. 1035 00:39:43,370 --> 00:39:47,000 That seems to be a way that cortex has been arranged 1036 00:39:47,000 --> 00:39:50,630 in these two sense modalities in particular to do 1037 00:39:50,630 --> 00:39:53,180 a really powerful kind of pattern recognition. 1038 00:39:53,180 --> 00:39:55,610 And then I think there's the question of, OK, 1039 00:39:55,610 --> 00:40:00,037 how does pattern recognition fit together with model building? 1040 00:40:00,037 --> 00:40:02,120 And, you know, I think in other areas of cognition 1041 00:40:02,120 --> 00:40:04,570 you see a similar kind of interchange, right? 1042 00:40:04,570 --> 00:40:07,580 It might be-- like, this has come up a little bit in action 1043 00:40:07,580 --> 00:40:09,560 planning-- like, model-based planning 1044 00:40:09,560 --> 00:40:11,660 versus more model-free reinforcement learning. 1045 00:40:11,660 --> 00:40:13,160 And those are, again, a place where 1046 00:40:13,160 --> 00:40:15,118 there might be two different systems that might 1047 00:40:15,118 --> 00:40:16,370 interact in some kind of way. 1048 00:40:16,370 --> 00:40:18,911 I think pattern recognition also is useful all over-- 1049 00:40:18,911 --> 00:40:20,660 you know, where cognition starts to become 1050 00:40:20,660 --> 00:40:22,410 different from perception, for example. 1051 00:40:22,410 --> 00:40:24,170 There's so many ways, but things like when you have a goal 1052 00:40:24,170 --> 00:40:26,600 and you're trying to solve a problem, do something. 1053 00:40:26,600 --> 00:40:30,050 Pattern recognition is often useful in guiding problem 1054 00:40:30,050 --> 00:40:31,790 solving, right? 1055 00:40:31,790 --> 00:40:35,690 But it's not the same as a plan, right? 1056 00:40:35,690 --> 00:40:37,190 So, I don't know if this is starting 1057 00:40:37,190 --> 00:40:41,060 to answer your question, but I think this idea of intelligence 1058 00:40:41,060 --> 00:40:43,030 more generally as something like-- 1059 00:40:43,030 --> 00:40:45,500 I mean, the way Laura put it for learning, the same idea, 1060 00:40:45,500 --> 00:40:49,160 she put it as, like, goal directed or goal constrained-- 1061 00:40:49,160 --> 00:40:50,420 how did she put it?-- 1062 00:40:50,420 --> 00:40:52,550 problem solving or something like that, right? 1063 00:40:52,550 --> 00:40:55,820 That's a good way to-- if you need one general purpose 1064 00:40:55,820 --> 00:40:59,930 definition of cognition, that's a good way to put it. 1065 00:40:59,930 --> 00:41:02,900 And then, on the other hand, there's pattern recognition. 1066 00:41:02,900 --> 00:41:05,840 And so you could ask, well, how does pattern recognition more 1067 00:41:05,840 --> 00:41:08,180 generally work and what have we learned 1068 00:41:08,180 --> 00:41:11,240 about how it works in the cortex or computationally 1069 00:41:11,240 --> 00:41:13,790 from studying the commonalities between these two sense 1070 00:41:13,790 --> 00:41:14,829 modalities? 1071 00:41:14,829 --> 00:41:16,370 And then how does pattern recognition 1072 00:41:16,370 --> 00:41:18,650 play into a larger system that is basically 1073 00:41:18,650 --> 00:41:21,230 trying to have goals, build models of the world, 1074 00:41:21,230 --> 00:41:24,770 use those goals to guide its action plans on those models? 1075 00:41:24,770 --> 00:41:27,530 ALEX KELL: On the public of convolutional neural networks 1076 00:41:27,530 --> 00:41:31,809 and deep learning, like, they are reaching, like, 1077 00:41:31,809 --> 00:41:33,350 kind of impressive successes and they 1078 00:41:33,350 --> 00:41:35,490 might eliminate some similarities and differences 1079 00:41:35,490 --> 00:41:37,480 between the modalities. 1080 00:41:37,480 --> 00:41:40,090 But, in both cases, the learning algorithm 1081 00:41:40,090 --> 00:41:41,720 is extremely non-biological. 1082 00:41:41,720 --> 00:41:43,930 And I was wondering if any of you guys-- 1083 00:41:43,930 --> 00:41:45,650 like, infants don't need millions 1084 00:41:45,650 --> 00:41:51,242 of examples of label data to learn what words are. 1085 00:41:51,242 --> 00:41:52,700 So I was wondering if you guys have 1086 00:41:52,700 --> 00:41:56,340 any kind of thoughts on how to make that algorithm more 1087 00:41:56,340 --> 00:41:57,257 biologically possible? 1088 00:41:57,257 --> 00:41:59,298 DAN YAMINS: I would go to what Josh said earlier, 1089 00:41:59,298 --> 00:42:01,185 which is you look at those real physically 1090 00:42:01,185 --> 00:42:02,060 embodied environment. 1091 00:42:02,060 --> 00:42:05,850 You look for those low level cues that can be used to, like, 1092 00:42:05,850 --> 00:42:08,414 be a proxy for the higher level information, right? 1093 00:42:08,414 --> 00:42:09,830 And then what you really want is-- 1094 00:42:09,830 --> 00:42:11,080 ALEX KELL: Can you be a little more specific? 1095 00:42:11,080 --> 00:42:11,510 What do you mean? 1096 00:42:11,510 --> 00:42:12,500 DAN YAMINS: Well, do you want to be-- 1097 00:42:12,500 --> 00:42:14,625 JOSH TENENBAUM: I mean, some people have heard this 1098 00:42:14,625 --> 00:42:16,190 from Tommy and others here about, 1099 00:42:16,190 --> 00:42:18,590 like, sort of kinds of natural supervision, right? 1100 00:42:18,590 --> 00:42:20,840 I mean, several people have talked about this, right? 1101 00:42:20,840 --> 00:42:22,131 Is that what you're getting at? 1102 00:42:22,131 --> 00:42:24,590 The idea that, often, just tracking things 1103 00:42:24,590 --> 00:42:26,240 as they move in the world gives you 1104 00:42:26,240 --> 00:42:28,115 a lot of extra effectively labeled data. 1105 00:42:28,115 --> 00:42:30,490 You're getting lots of different views of this microphone 1106 00:42:30,490 --> 00:42:33,350 now, or whatever, for walking around the stage or all 1107 00:42:33,350 --> 00:42:35,720 of our faces as we're rotating. 1108 00:42:35,720 --> 00:42:38,780 So, when you pointed to the biological implausibility 1109 00:42:38,780 --> 00:42:41,240 of the standard way of training deep networks, 1110 00:42:41,240 --> 00:42:43,130 I think a lot of people are realizing-- 1111 00:42:43,130 --> 00:42:45,560 and this was the main idea behind Tommy's conversion 1112 00:42:45,560 --> 00:42:47,900 to now be a strong prophet for people learning instead 1113 00:42:47,900 --> 00:42:50,510 of being a critic, right?-- was that, the issue 1114 00:42:50,510 --> 00:42:52,520 of needing lots of labeled training, that's 1115 00:42:52,520 --> 00:42:53,870 not the biggest issue. 1116 00:42:53,870 --> 00:42:55,680 There's other issues, like backpropagation 1117 00:42:55,680 --> 00:42:58,970 as a mechanism of actually propagating error gradients all 1118 00:42:58,970 --> 00:43:00,890 the way down to a deep network. 1119 00:43:00,890 --> 00:43:02,635 I think that troubles more people. 1120 00:43:02,635 --> 00:43:04,760 DAN YAMINS: I have quite the opposite view on that. 1121 00:43:04,760 --> 00:43:05,551 JOSH TENENBAUM: OK. 1122 00:43:05,551 --> 00:43:07,790 DAN YAMINS: Yes. 1123 00:43:07,790 --> 00:43:10,850 I agree that it's true that the specific biological 1124 00:43:10,850 --> 00:43:14,300 plausibility of a specific deep learning, like backpropagation 1125 00:43:14,300 --> 00:43:16,280 algorithm is probably suspect. 1126 00:43:16,280 --> 00:43:18,210 But I suspect that by the same token, 1127 00:43:18,210 --> 00:43:20,570 there are somewhat inexact versions that 1128 00:43:20,570 --> 00:43:22,640 are biologically plausible or more plausible 1129 00:43:22,640 --> 00:43:24,380 anyway that could work pretty well. 1130 00:43:24,380 --> 00:43:26,042 I think that's less like-- 1131 00:43:26,042 --> 00:43:27,000 let me put it this way. 1132 00:43:27,000 --> 00:43:28,010 I think that's a flashy question. 1133 00:43:28,010 --> 00:43:29,540 I think if you actually end up solving that 1134 00:43:29,540 --> 00:43:30,770 both from an algorithm point of view 1135 00:43:30,770 --> 00:43:33,186 and maybe, more importantly, seeing how that's implemented 1136 00:43:33,186 --> 00:43:37,176 in a kind of real neural circumstance, 1137 00:43:37,176 --> 00:43:38,300 you'll win the Nobel Prize. 1138 00:43:38,300 --> 00:43:41,192 But, I mean, I think that-- 1139 00:43:41,192 --> 00:43:43,400 I feel like that's something that will happen, right? 1140 00:43:43,400 --> 00:43:45,500 I think that there is a bigger question 1141 00:43:45,500 --> 00:43:48,650 out there, which is, you know-- 1142 00:43:48,650 --> 00:43:50,720 I do think that from an algorithmic point of view 1143 00:43:50,720 --> 00:43:52,820 which things that people don't yet know how to do, 1144 00:43:52,820 --> 00:43:55,850 how to replace, like, millions of heavily semantic training 1145 00:43:55,850 --> 00:43:58,525 examples with those other things, right? 1146 00:43:58,525 --> 00:44:01,025 Like, the things that you just mentioned a moment ago, like, 1147 00:44:01,025 --> 00:44:02,260 the extra data. 1148 00:44:02,260 --> 00:44:04,010 Like, it hasn't actually been demonstrated 1149 00:44:04,010 --> 00:44:04,926 how to really do that. 1150 00:44:04,926 --> 00:44:08,750 And I feel like the details of getting that right will tell us 1151 00:44:08,750 --> 00:44:11,942 a lot about the signals that babies and others are paying 1152 00:44:11,942 --> 00:44:14,150 attention to in a way that's really conceptually very 1153 00:44:14,150 --> 00:44:17,470 interesting and, I think, not so obvious at this point how 1154 00:44:17,470 --> 00:44:18,010 that's-- 1155 00:44:18,010 --> 00:44:19,280 I think it'll happen too, but it will 1156 00:44:19,280 --> 00:44:21,500 be conceptually interesting when it does in a way 1157 00:44:21,500 --> 00:44:21,980 that I think that-- 1158 00:44:21,980 --> 00:44:22,950 JOSH TENENBAUM: Both are pretty interesting. 1159 00:44:22,950 --> 00:44:23,180 DAN YAMINS: Yeah. 1160 00:44:23,180 --> 00:44:24,164 JOSH TENENBAUM: Some people are more 1161 00:44:24,164 --> 00:44:25,148 worried about one or the other. 1162 00:44:25,148 --> 00:44:25,640 But, yeah. 1163 00:44:25,640 --> 00:44:26,220 DAN YAMINS: Exactly. 1164 00:44:26,220 --> 00:44:28,520 And, personally, I would say that, from an algorithm 1165 00:44:28,520 --> 00:44:30,170 point of view, I'm more interested in that second one, 1166 00:44:30,170 --> 00:44:32,960 because I think that will be a place where the biology will 1167 00:44:32,960 --> 00:44:34,711 help us teach how to do better algorithms. 1168 00:44:34,711 --> 00:44:37,001 JOSH TENENBAUM: Learning about the biological mechanism 1169 00:44:37,001 --> 00:44:38,966 backpropagation seems less likely to transform 1170 00:44:38,966 --> 00:44:39,590 our algorithms. 1171 00:44:39,590 --> 00:44:42,320 Although, again, if you ask Andrew Sax-- he's one person. 1172 00:44:42,320 --> 00:44:43,400 He's been here. 1173 00:44:43,400 --> 00:44:45,830 He's think-- that's the question he most wants to solve, 1174 00:44:45,830 --> 00:44:47,272 and he's a very smart person. 1175 00:44:47,272 --> 00:44:48,980 And I think he has some thoughts on that. 1176 00:44:48,980 --> 00:44:53,194 But I-- my sympathies are also with you there. 1177 00:44:53,194 --> 00:44:55,610 I think there are other things besides those that are both 1178 00:44:55,610 --> 00:44:58,430 biologically and cognitively implausible that need work, 1179 00:44:58,430 --> 00:45:01,230 too, so those-- but those are two of the main ones that-- 1180 00:45:01,230 --> 00:45:03,355 HYNEK HERMANSKY: I think you are touching something 1181 00:45:03,355 --> 00:45:06,230 very interesting, and one of the major problems with machine 1182 00:45:06,230 --> 00:45:08,160 learning in general, as I see it, 1183 00:45:08,160 --> 00:45:12,030 which is like use of transcribed or untranscribed data. 1184 00:45:12,030 --> 00:45:14,640 And I think that this is one direction, which actually, 1185 00:45:14,640 --> 00:45:17,660 specifically in the speech, it's a big, big, big problem 1186 00:45:17,660 --> 00:45:21,220 because, of course, data is expensive, so-- 1187 00:45:21,220 --> 00:45:22,910 unless you are Google. 1188 00:45:22,910 --> 00:45:26,210 But even there, you will want to have it transcribe data. 1189 00:45:26,210 --> 00:45:29,185 You want to know what is inside, and clearly, this is not what-- 1190 00:45:29,185 --> 00:45:31,310 JOSH TENENBAUM: You guys have this thing in speech. 1191 00:45:31,310 --> 00:45:33,410 I think this is I'd like to talk more about this because you 1192 00:45:33,410 --> 00:45:35,450 guys have this thing, particularly at Hopkins, 1193 00:45:35,450 --> 00:45:37,824 in speech that you call zero resource speech recognition. 1194 00:45:37,824 --> 00:45:39,202 HYNEK HERMANSKY: Right, that's-- 1195 00:45:39,202 --> 00:45:41,660 JOSH TENENBAUM: And I think this is a version of this idea, 1196 00:45:41,660 --> 00:45:44,840 but it's one of the places where studying not just neuroscience. 1197 00:45:44,840 --> 00:45:46,970 But cognition and what young children do, 1198 00:45:46,970 --> 00:45:48,830 the ability to get so much from so little 1199 00:45:48,830 --> 00:45:50,450 is a place where we really have a lot 1200 00:45:50,450 --> 00:45:52,533 to learn on the engineering side from the science. 1201 00:45:52,533 --> 00:45:54,590 HYNEK HERMANSKY: Yes, I mean, [INAUDIBLE] 1202 00:45:54,590 --> 00:45:57,210 that I could speak about it a little bit more in depth. 1203 00:45:57,210 --> 00:46:00,240 But definitely, this is the direction I'm thinking about, 1204 00:46:00,240 --> 00:46:02,420 which is like what do you do if you don't know what 1205 00:46:02,420 --> 00:46:05,140 is in the signal, but you know there is a structure, 1206 00:46:05,140 --> 00:46:08,270 and you know that there is information you need. 1207 00:46:08,270 --> 00:46:10,220 And you have to start from the very scratch, 1208 00:46:10,220 --> 00:46:13,910 figure out where information is, how is it coded, 1209 00:46:13,910 --> 00:46:16,910 and then use it in the machine. 1210 00:46:16,910 --> 00:46:19,110 And I think it's a general problem 1211 00:46:19,110 --> 00:46:21,519 in the same thing as in region. 1212 00:46:21,519 --> 00:46:23,060 JOSH TENENBAUM: Maybe-- you're asking 1213 00:46:23,060 --> 00:46:24,226 several different questions. 1214 00:46:24,226 --> 00:46:25,280 I mean, I don't know if-- 1215 00:46:25,280 --> 00:46:26,420 have people in the summer school talked 1216 00:46:26,420 --> 00:46:28,250 about these instabilities It's an interesting question. 1217 00:46:28,250 --> 00:46:30,530 People are very much divided on what they say. 1218 00:46:30,530 --> 00:46:32,880 And I do think that generative models are going 1219 00:46:32,880 --> 00:46:34,130 to come out differently there. 1220 00:46:34,130 --> 00:46:37,370 But, again, I don't want to say generative models are 1221 00:46:37,370 --> 00:46:41,390 better than discriminatively trained pattern recognizers. 1222 00:46:41,390 --> 00:46:43,130 I think, particularly for perception 1223 00:46:43,130 --> 00:46:45,296 and a lot of other areas, what we need to understand 1224 00:46:45,296 --> 00:46:46,790 is how to combine the best of both. 1225 00:46:46,790 --> 00:46:49,892 So in an audience where people are just neural networks, 1226 00:46:49,892 --> 00:46:51,350 rah, rah, rah, rah, rah, and that's 1227 00:46:51,350 --> 00:46:53,540 all there is, then I'm going to be arguing for the other side. 1228 00:46:53,540 --> 00:46:55,490 But that's not that I think they are better. 1229 00:46:55,490 --> 00:46:57,980 I think they have complementary strengths and weaknesses. 1230 00:46:57,980 --> 00:46:58,760 This might be one. 1231 00:46:58,760 --> 00:47:02,300 I think pretty much any pattern classifier, whether it's 1232 00:47:02,300 --> 00:47:04,040 a neural network or something else, 1233 00:47:04,040 --> 00:47:06,980 will probably be susceptible to these kind of pathologies 1234 00:47:06,980 --> 00:47:10,340 where you can basically hack up a stimulus that's arbitrarily 1235 00:47:10,340 --> 00:47:15,410 different from an actual member of the class that 1236 00:47:15,410 --> 00:47:16,190 gets classified. 1237 00:47:16,190 --> 00:47:17,900 Basically, if you're trying to put 1238 00:47:17,900 --> 00:47:21,256 a separating surface between two or n finite classes-- 1239 00:47:21,256 --> 00:47:23,630 I was trying to see how to formulate this mathematically. 1240 00:47:23,630 --> 00:47:25,644 I think you can basically show that it's not 1241 00:47:25,644 --> 00:47:26,810 specific to neural networks. 1242 00:47:26,810 --> 00:47:28,893 It should be true for any kind of discriminatively 1243 00:47:28,893 --> 00:47:30,110 trained pattern classifier. 1244 00:47:30,110 --> 00:47:31,670 I think generative models have other sorts 1245 00:47:31,670 --> 00:47:32,930 of illusions and pathologies. 1246 00:47:32,930 --> 00:47:34,620 But they're definitely going to be-- 1247 00:47:34,620 --> 00:47:36,620 my sense is they're going to be some of the ways 1248 00:47:36,620 --> 00:47:38,720 that any pattern classifier is susceptible, 1249 00:47:38,720 --> 00:47:40,210 that generative models won't be susceptible to. 1250 00:47:40,210 --> 00:47:42,626 And there will be others that they will be susceptible to. 1251 00:47:42,626 --> 00:47:44,820 But it's sort of an orthogonal issue. 1252 00:47:44,820 --> 00:47:47,660 But I think the illusions that generative models are 1253 00:47:47,660 --> 00:47:51,350 susceptible to, are generally going 1254 00:47:51,350 --> 00:47:53,700 to have, like, interesting, rational interpretations. 1255 00:47:53,700 --> 00:47:54,930 It's going to tell you something. 1256 00:47:54,930 --> 00:47:57,020 They're less likely to be susceptible to just completely 1257 00:47:57,020 --> 00:47:59,061 bizarre pathologies that we look at and are like, 1258 00:47:59,061 --> 00:48:01,384 I don't understand why it's seeing that. 1259 00:48:01,384 --> 00:48:03,800 On the other hand, they're going to have other things that 1260 00:48:03,800 --> 00:48:04,550 will frustrate us. 1261 00:48:04,550 --> 00:48:06,230 And my inference algorithm is stuck. 1262 00:48:06,230 --> 00:48:09,320 I don't understand why my Markov chain isn't converging. 1263 00:48:09,320 --> 00:48:12,230 If that's the only way you're going to do inference, 1264 00:48:12,230 --> 00:48:14,690 you'll be very frustrated by the dynamics of inference. 1265 00:48:14,690 --> 00:48:16,273 And that's where, hopefully, some kind 1266 00:48:16,273 --> 00:48:19,040 of pattern recognition system will come to the rescue. 1267 00:48:19,040 --> 00:48:21,410 And if we just look at anecdotal experience, 1268 00:48:21,410 --> 00:48:23,737 both in speech and vision, there are certain kinds 1269 00:48:23,737 --> 00:48:25,820 of cases where, like, you know, in a passing rock, 1270 00:48:25,820 --> 00:48:28,520 you suddenly see a face, or a noise. 1271 00:48:28,520 --> 00:48:30,560 Or in a tree people will see Jesus's face 1272 00:48:30,560 --> 00:48:32,990 on arbitrary parts of the visual world. 1273 00:48:32,990 --> 00:48:34,640 And also sometimes in sound. 1274 00:48:34,640 --> 00:48:36,290 You hear something. 1275 00:48:36,290 --> 00:48:42,420 So this idea of seeing signal in noise, we know humans do that. 1276 00:48:42,420 --> 00:48:45,200 But, for example, there are ways to get deep confidence 1277 00:48:45,200 --> 00:48:47,417 in vision to see-- 1278 00:48:47,417 --> 00:48:49,250 you can start off with an arbitrary texture. 1279 00:48:49,250 --> 00:48:50,460 Have you seen this stuff? 1280 00:48:50,460 --> 00:48:53,140 And massage it to look like-- 1281 00:48:53,140 --> 00:48:55,380 like you can start off with a texture of green bars 1282 00:48:55,380 --> 00:48:57,289 and make it look like a dog to the network, 1283 00:48:57,289 --> 00:48:58,830 and it doesn't look like a dog to us. 1284 00:48:58,830 --> 00:49:01,290 And we're never going to see a dog in a periodic pattern 1285 00:49:01,290 --> 00:49:03,260 of green and polka dotted bars. 1286 00:49:03,260 --> 00:49:03,570 JOSH MCDERMOTT: But the reason you 1287 00:49:03,570 --> 00:49:05,135 can do that is because perfect access to the network. 1288 00:49:05,135 --> 00:49:05,500 Right? 1289 00:49:05,500 --> 00:49:07,320 And if you had perfect access to visual stimuli [INAUDIBLE].. 1290 00:49:07,320 --> 00:49:07,420 JOSH TENENBAUM: Sure. 1291 00:49:07,420 --> 00:49:07,470 Sure. 1292 00:49:07,470 --> 00:49:08,928 But I'm just saying-- these don't-- 1293 00:49:08,928 --> 00:49:09,890 well, I don't think so. 1294 00:49:09,890 --> 00:49:10,260 DAN YAMINS: Of course there's going 1295 00:49:10,260 --> 00:49:11,730 to be visual illusions in every case. 1296 00:49:11,730 --> 00:49:12,780 The question is whether or not they're 1297 00:49:12,780 --> 00:49:13,830 going to make sense to humans-- 1298 00:49:13,830 --> 00:49:14,100 JOSH TENENBAUM: Right. 1299 00:49:14,100 --> 00:49:15,390 And I think some of them-- 1300 00:49:15,390 --> 00:49:16,980 ALEX KELL: --as a test of whether or not that model-- 1301 00:49:16,980 --> 00:49:17,930 JOSH TENENBAUM: If you learned something from that. 1302 00:49:17,930 --> 00:49:19,020 DAN YAMINS: --is a real model. 1303 00:49:19,020 --> 00:49:20,395 JOSH TENENBAUM: Just to be clear, 1304 00:49:20,395 --> 00:49:22,650 the ones that the convnets are susceptible to that say 1305 00:49:22,650 --> 00:49:24,384 a generative model or a human isn't-- 1306 00:49:24,384 --> 00:49:26,550 they're not signs that they're fundamentally broken. 1307 00:49:26,550 --> 00:49:29,310 Rather they're signs of any discriminatively trained 1308 00:49:29,310 --> 00:49:30,150 pattern recognizer. 1309 00:49:30,150 --> 00:49:32,890 I would predict, whether it's good or bad, 1310 00:49:32,890 --> 00:49:35,610 it's signs of the limitations of pattern recognition. 1311 00:49:35,610 --> 00:49:38,550 DAN YAMINS: Or signs of the limitation of the type of tasks 1312 00:49:38,550 --> 00:49:43,050 that are being used of which the recognition is being done. 1313 00:49:43,050 --> 00:49:45,600 If you replaced something like categorization with something 1314 00:49:45,600 --> 00:49:51,030 like the ability to predict geometric and physical 1315 00:49:51,030 --> 00:49:53,910 interactions x period of time in the future, 1316 00:49:53,910 --> 00:49:56,980 maybe you'd end up with quite different illusions. 1317 00:49:56,980 --> 00:49:57,480 Right? 1318 00:49:57,480 --> 00:49:59,605 There's something very brittle about categorization 1319 00:49:59,605 --> 00:50:02,720 that could lead to, sort of null space as being very broad. 1320 00:50:02,720 --> 00:50:03,720 JOSH TENENBAUM: Exactly. 1321 00:50:03,720 --> 00:50:04,290 That's what I mean. 1322 00:50:04,290 --> 00:50:06,498 By pattern recognition, I mean pattern classification 1323 00:50:06,498 --> 00:50:08,280 in particular. 1324 00:50:08,280 --> 00:50:09,727 Not prediction but classification. 1325 00:50:09,727 --> 00:50:10,477 DAN YAMINS: Right. 1326 00:50:10,477 --> 00:50:13,650 But I don't think it's yet known whether the existence of these, 1327 00:50:13,650 --> 00:50:17,790 sort of fooling images or this kind of weird allusions, e.g. 1328 00:50:17,790 --> 00:50:19,650 the models are bad or do not pick out 1329 00:50:19,650 --> 00:50:21,204 the correct resolutions. 1330 00:50:21,204 --> 00:50:22,120 I don't know whether-- 1331 00:50:22,120 --> 00:50:23,280 people are not totally sure whether that's 1332 00:50:23,280 --> 00:50:24,989 like, the networks need to have feedback, 1333 00:50:24,989 --> 00:50:27,071 and that will be what you really need to solve it. 1334 00:50:27,071 --> 00:50:28,860 It's at that broad level of mechanism. 1335 00:50:28,860 --> 00:50:30,937 Or is it like, the task is wrong? 1336 00:50:30,937 --> 00:50:32,520 So it's sort of a little bit less bad. 1337 00:50:32,520 --> 00:50:35,400 Or maybe like Josh said, it's like the easiest thing would 1338 00:50:35,400 --> 00:50:37,060 be, well, actually if you just did this 1339 00:50:37,060 --> 00:50:39,150 with the neural system, you'd find exactly the same thing. 1340 00:50:39,150 --> 00:50:41,270 But we don't have access to it, so we're not finding it. 1341 00:50:41,270 --> 00:50:41,790 Right? 1342 00:50:41,790 --> 00:50:45,850 And so I think it's not totally clear where it is yet. 1343 00:50:45,850 --> 00:50:46,350 Right? 1344 00:50:46,350 --> 00:50:49,080 It's a great question, but I feel like the answers are murky 1345 00:50:49,080 --> 00:50:50,234 right now. 1346 00:50:50,234 --> 00:50:50,900 ALEX KELL: Yeah. 1347 00:50:50,900 --> 00:50:51,990 OK. 1348 00:50:51,990 --> 00:50:56,200 On the topic of feedback, I wanted to kind of move over-- 1349 00:50:56,200 --> 00:50:58,980 and Gabriel talked about feedback during his talk. 1350 00:50:58,980 --> 00:51:01,272 And there's really heavy kind of feedback 1351 00:51:01,272 --> 00:51:02,730 in both of these modalities, where, 1352 00:51:02,730 --> 00:51:04,800 like, in hearing as Josh talked about, 1353 00:51:04,800 --> 00:51:06,930 it goes all the way back to, it can 1354 00:51:06,930 --> 00:51:09,690 alter the mechanics of the cochlea, 1355 00:51:09,690 --> 00:51:11,082 of the basilar membrane. 1356 00:51:11,082 --> 00:51:12,040 That's pretty shocking. 1357 00:51:12,040 --> 00:51:13,300 That's pretty interesting. 1358 00:51:13,300 --> 00:51:15,017 So what is the role-- 1359 00:51:15,017 --> 00:51:17,100 Gabriel talked about a couple of specific examples 1360 00:51:17,100 --> 00:51:21,160 where feedback would actually be useful. 1361 00:51:21,160 --> 00:51:23,219 Can you say something more broadly about, 1362 00:51:23,219 --> 00:51:25,260 in general when is feedback useful across the two 1363 00:51:25,260 --> 00:51:25,750 modalities? 1364 00:51:25,750 --> 00:51:27,791 Do we think they are kind of specific instances-- 1365 00:51:27,791 --> 00:51:29,680 can we talk about specific instances in each? 1366 00:51:29,680 --> 00:51:31,554 GABRIEL KREIMAN: Throughout the visual system 1367 00:51:31,554 --> 00:51:36,600 there is feedback essentially all over except for the retina. 1368 00:51:36,600 --> 00:51:38,540 Throughout the auditory cortex, again 1369 00:51:38,540 --> 00:51:43,580 this feedback and recurrent connections all over. 1370 00:51:48,030 --> 00:51:52,620 We've been interested in a couple of specific apps 1371 00:51:52,620 --> 00:51:56,000 in situations where feedback may be playing a role. 1372 00:51:56,000 --> 00:51:58,650 This includes visual search. 1373 00:51:58,650 --> 00:52:02,550 This includes pattern completion, 1374 00:52:02,550 --> 00:52:04,890 feature-based attention. 1375 00:52:04,890 --> 00:52:07,410 I believe, and hopefully Josh will expand on this 1376 00:52:07,410 --> 00:52:09,570 that these are problems that at least at a very 1377 00:52:09,570 --> 00:52:12,420 superficial level also exist in the auditory domain, 1378 00:52:12,420 --> 00:52:15,900 and where it's tempting to think that feedback will also 1379 00:52:15,900 --> 00:52:18,670 play a role. 1380 00:52:18,670 --> 00:52:20,970 More generally, you can mathematically 1381 00:52:20,970 --> 00:52:24,000 demonstrate that any network with feedback 1382 00:52:24,000 --> 00:52:26,130 can be transformed into a feed-forward network 1383 00:52:26,130 --> 00:52:31,350 just by decomposing time into more layers. 1384 00:52:31,350 --> 00:52:39,000 So I think ultimately, feedback in the cortex 1385 00:52:39,000 --> 00:52:41,550 may have a lot to do with, how many layers can you actually 1386 00:52:41,550 --> 00:52:47,230 fit into a system the size of the head 1387 00:52:47,230 --> 00:52:50,087 that it has to go through-- interesting places 1388 00:52:50,087 --> 00:52:50,670 at some point. 1389 00:52:50,670 --> 00:52:53,050 And there are sort of physical limitations 1390 00:52:53,050 --> 00:52:57,674 to that more than fundamental computational ones. 1391 00:52:57,674 --> 00:52:59,340 At the heart of this is the question of, 1392 00:52:59,340 --> 00:53:01,359 how many recurrent computations do you need? 1393 00:53:01,359 --> 00:53:03,900 How much feedback you actually need, how many recurrent loops 1394 00:53:03,900 --> 00:53:04,870 you need. 1395 00:53:04,870 --> 00:53:07,326 If that involves only two or three loops, 1396 00:53:07,326 --> 00:53:08,700 I think it's easy to convert that 1397 00:53:08,700 --> 00:53:11,830 into a feed forward network that will do the same job. 1398 00:53:11,830 --> 00:53:14,640 If that involves hundreds of iterations and loops, 1399 00:53:14,640 --> 00:53:17,860 it's harder to think about a biological system that 1400 00:53:17,860 --> 00:53:19,620 will accomplish that. 1401 00:53:19,620 --> 00:53:21,480 But at least at the very superficial level, 1402 00:53:21,480 --> 00:53:22,396 I would imagine that-- 1403 00:53:22,396 --> 00:53:25,837 JOSH TENENBAUM: Can I ask a very focused version 1404 00:53:25,837 --> 00:53:27,420 of the same question, or try to, which 1405 00:53:27,420 --> 00:53:30,110 is, what is the computational role of feedback 1406 00:53:30,110 --> 00:53:31,770 in vision and audition? 1407 00:53:31,770 --> 00:53:33,630 Like when we talk about feedback, 1408 00:53:33,630 --> 00:53:36,180 maybe we mean something like top-down connections 1409 00:53:36,180 --> 00:53:39,781 in the brain or something like recurrent processing. 1410 00:53:39,781 --> 00:53:42,030 Just from a computational point of view of the problem 1411 00:53:42,030 --> 00:53:43,405 we're trying to solve, what do we 1412 00:53:43,405 --> 00:53:45,170 think its roles are in each of those? 1413 00:53:45,170 --> 00:53:45,750 GABRIEL KREIMAN: So more and more, 1414 00:53:45,750 --> 00:53:47,550 I think that's the wrong kind of question to ask. 1415 00:53:47,550 --> 00:53:48,000 If I ask you-- 1416 00:53:48,000 --> 00:53:48,630 JOSH TENENBAUM: Why? 1417 00:53:48,630 --> 00:53:51,171 GABRIEL KREIMAN: What's the role of feed-forward connections? 1418 00:53:51,171 --> 00:53:52,850 JOSH TENENBAUM: Pattern recognition. 1419 00:53:52,850 --> 00:53:53,665 GABRIEL KREIMAN: There is no role 1420 00:53:53,665 --> 00:53:54,870 of feed-forward connections. 1421 00:53:54,870 --> 00:53:55,010 JOSH TENENBAUM: No. 1422 00:53:55,010 --> 00:53:55,830 On the contrary. 1423 00:53:55,830 --> 00:53:56,872 Tommy has a theory of it. 1424 00:53:56,872 --> 00:53:58,288 You know, you have another theory. 1425 00:53:58,288 --> 00:53:59,760 Something like very quickly trying 1426 00:53:59,760 --> 00:54:01,800 to find invariant features-- 1427 00:54:01,800 --> 00:54:04,289 trying to very quickly find invariant features 1428 00:54:04,289 --> 00:54:05,580 of certain classes of patterns. 1429 00:54:05,580 --> 00:54:06,540 That's a hypothesis. 1430 00:54:06,540 --> 00:54:08,114 It's pretty well supported. 1431 00:54:08,114 --> 00:54:09,780 GABRIEL KREIMAN: There's a lot of things 1432 00:54:09,780 --> 00:54:10,770 that happen with feed-forward. 1433 00:54:10,770 --> 00:54:12,728 HYNEK HERMANSKY: If you want a feed so that you 1434 00:54:12,728 --> 00:54:15,140 can make things better somehow, I mean, you 1435 00:54:15,140 --> 00:54:17,280 need a measure of goodness first. 1436 00:54:17,280 --> 00:54:19,290 I mean, otherwise I mean, how to build-- 1437 00:54:19,290 --> 00:54:21,720 I agree with you that you can make 1438 00:54:21,720 --> 00:54:25,440 a very deep structure which will function as a feedback thing. 1439 00:54:25,440 --> 00:54:27,450 But always what worries me the most, 1440 00:54:27,450 --> 00:54:30,570 and in general a number of cognitive problems, 1441 00:54:30,570 --> 00:54:36,870 is, how do I provide my machine with some mechanism which 1442 00:54:36,870 --> 00:54:40,230 tells the machine that output is good or not? 1443 00:54:40,230 --> 00:54:43,830 If the output is bad, if my image is making no sense, 1444 00:54:43,830 --> 00:54:47,520 there is no dog but it's a kind of weird mix of green things 1445 00:54:47,520 --> 00:54:50,290 and it's telling me it's a dog, I need feedback. 1446 00:54:50,290 --> 00:54:52,410 That's the point I need the feedback, I believe. 1447 00:54:52,410 --> 00:54:53,600 And I need to fix things. 1448 00:54:53,600 --> 00:54:56,280 Josh is talking about tuning the cochlea. 1449 00:54:56,280 --> 00:54:59,880 Yeah, of course that means that sharpening the tuning 1450 00:54:59,880 --> 00:55:00,540 is possible. 1451 00:55:00,540 --> 00:55:02,880 But in communication, I mean, if things 1452 00:55:02,880 --> 00:55:05,190 are noisy I go ahead and close the door. 1453 00:55:05,190 --> 00:55:06,740 There's the feedback to me. 1454 00:55:06,740 --> 00:55:09,240 But I know, as a human being, I know 1455 00:55:09,240 --> 00:55:11,550 information is not getting through, 1456 00:55:11,550 --> 00:55:12,960 and I do something about it. 1457 00:55:12,960 --> 00:55:13,880 And this is what we-- 1458 00:55:13,880 --> 00:55:14,160 JOSH TENENBAUM: That's great You just 1459 00:55:14,160 --> 00:55:16,372 gave two good examples of, I think, just 1460 00:55:16,372 --> 00:55:18,330 to generalize those, right, or just to say them 1461 00:55:18,330 --> 00:55:19,260 in more general terms. 1462 00:55:19,260 --> 00:55:21,839 One role of feedback that people have hypothesized 1463 00:55:21,839 --> 00:55:23,880 is like in the context of something like analysis 1464 00:55:23,880 --> 00:55:24,467 by synthesis. 1465 00:55:24,467 --> 00:55:26,550 If you have a high level model of what's going on, 1466 00:55:26,550 --> 00:55:27,970 and you want to see, does that really make sense? 1467 00:55:27,970 --> 00:55:29,860 Does that really explain my low level data? 1468 00:55:29,860 --> 00:55:32,400 Let me try it out and see that. 1469 00:55:32,400 --> 00:55:34,527 Another is, basically saying, the role 1470 00:55:34,527 --> 00:55:37,110 of the feed-forward connections is a kind of invariant pattern 1471 00:55:37,110 --> 00:55:40,290 recognizer, and tuning those, tuning the filters, 1472 00:55:40,290 --> 00:55:44,340 tuning the patterns to context or in particular 1473 00:55:44,340 --> 00:55:48,000 in a contextual way to make the features more 1474 00:55:48,000 --> 00:55:49,860 diagnostic in this particular context. 1475 00:55:49,860 --> 00:55:51,210 Those are two ideas. 1476 00:55:51,210 --> 00:55:51,810 And they are probably others. 1477 00:55:51,810 --> 00:55:54,018 HYNEK HERMANSKY: Yeah, you gave the wonderful example 1478 00:55:54,018 --> 00:55:55,860 with analysis by synthesis. 1479 00:55:55,860 --> 00:55:59,490 But even there, we need an error measure. 1480 00:55:59,490 --> 00:56:01,410 And I'm not saying that least mean 1481 00:56:01,410 --> 00:56:03,830 squared error between what I generate 1482 00:56:03,830 --> 00:56:05,700 and what I see is the right one. 1483 00:56:05,700 --> 00:56:06,380 JOSH TENENBAUM: So you think feedback 1484 00:56:06,380 --> 00:56:08,160 if it helps tune the error measure. 1485 00:56:08,160 --> 00:56:09,670 HYNEK HERMANSKY: Well, no. 1486 00:56:09,670 --> 00:56:11,160 It's a chicken and egg problem. 1487 00:56:11,160 --> 00:56:13,920 I think I need the error measure first, before I even 1488 00:56:13,920 --> 00:56:15,557 start using the feedback. 1489 00:56:15,557 --> 00:56:17,640 Because, you know, feedback, we can talk about it. 1490 00:56:17,640 --> 00:56:19,590 But if you want to implement it, you 1491 00:56:19,590 --> 00:56:22,680 have to figure out, what is the error measure 1492 00:56:22,680 --> 00:56:23,985 or what is the criteria? 1493 00:56:23,985 --> 00:56:28,422 And that I recognize my output is bad or not good enough. 1494 00:56:28,422 --> 00:56:29,880 Obviously it will a little bit bad. 1495 00:56:29,880 --> 00:56:30,379 Right? 1496 00:56:30,379 --> 00:56:33,000 I am not good enough and I have to do something about it. 1497 00:56:33,000 --> 00:56:36,600 Once I know it, I know what to do, I think. 1498 00:56:36,600 --> 00:56:39,559 Well maybe not me, but my students, whatever. 1499 00:56:39,559 --> 00:56:41,850 This is one of the big problems in which we are working 1500 00:56:41,850 --> 00:56:43,890 on actually-- figuring out, how can I 1501 00:56:43,890 --> 00:56:46,590 tell that output from my neural net or something 1502 00:56:46,590 --> 00:56:49,630 is good or bad? 1503 00:56:49,630 --> 00:56:52,014 And so far I don't have any answer. 1504 00:56:55,929 --> 00:56:58,470 JOSH TENENBAUM: But I think it's neat that those two or three 1505 00:56:58,470 --> 00:56:59,594 different kinds of things-- 1506 00:56:59,594 --> 00:57:02,880 they are totally parallel in vision and audition. 1507 00:57:02,880 --> 00:57:04,890 They're useful for both, engineering-wise, 1508 00:57:04,890 --> 00:57:07,890 and people have long proposed them both in psychology 1509 00:57:07,890 --> 00:57:09,084 and neuroscience. 1510 00:57:09,084 --> 00:57:10,500 GABRIEL KREIMAN: Generally I think 1511 00:57:10,500 --> 00:57:12,458 there's a lot to be done in this area, I think. 1512 00:57:12,458 --> 00:57:15,957 And the notion of adding-- 1513 00:57:15,957 --> 00:57:17,540 I mean, a lot of what's been happening 1514 00:57:17,540 --> 00:57:19,623 in commercial networks is sort of tweaks and hacks 1515 00:57:19,623 --> 00:57:20,370 here and there. 1516 00:57:20,370 --> 00:57:22,290 If there are fundamental principles that 1517 00:57:22,290 --> 00:57:25,140 come from studying recurrent connections and feedback 1518 00:57:25,140 --> 00:57:27,252 from the auditory domain, from the visual domain, 1519 00:57:27,252 --> 00:57:29,460 those are the sort of things that, as Dan was saying, 1520 00:57:29,460 --> 00:57:33,940 could potentially sort of lead to major jumps in performance 1521 00:57:33,940 --> 00:57:36,660 or in our conceptual understanding of what 1522 00:57:36,660 --> 00:57:38,470 these networks are doing. 1523 00:57:38,470 --> 00:57:41,010 I think it's a very rich area of exploration, 1524 00:57:41,010 --> 00:57:45,920 both in vision and the auditory world. 1525 00:57:45,920 --> 00:57:49,632 ALEX KELL: And wanted to ask Josh one more thing-- 1526 00:57:49,632 --> 00:57:50,840 the common constraints thing. 1527 00:57:50,840 --> 00:57:52,150 You were kind of saying like, it seems like there would 1528 00:57:52,150 --> 00:57:53,983 be common constraints and there are probably 1529 00:57:53,983 --> 00:57:56,390 consequences of those. 1530 00:57:56,390 --> 00:57:58,250 What are some kind of specific consequences 1531 00:57:58,250 --> 00:57:59,499 that you think would come out? 1532 00:57:59,499 --> 00:58:02,810 Like, how can we think about this? 1533 00:58:02,810 --> 00:58:05,120 To the extent that there is kind of a shared system, 1534 00:58:05,120 --> 00:58:06,590 what does it mean? 1535 00:58:11,500 --> 00:58:13,140 JOSH TENENBAUM: Well-- so again, I 1536 00:58:13,140 --> 00:58:15,330 can only, as several of the other speakers said, 1537 00:58:15,330 --> 00:58:18,720 only give my very, very personal, subjectively 1538 00:58:18,720 --> 00:58:19,710 biased view. 1539 00:58:19,710 --> 00:58:22,290 But I think the brain has a way to think 1540 00:58:22,290 --> 00:58:23,874 about the physical world to represent, 1541 00:58:23,874 --> 00:58:25,164 to perceive the physical world. 1542 00:58:25,164 --> 00:58:27,540 And it's really like a physics engine in your head. 1543 00:58:27,540 --> 00:58:28,987 And that the different-- 1544 00:58:28,987 --> 00:58:30,570 it's like analysis by synthesis that's 1545 00:58:30,570 --> 00:58:33,000 a familiar idea probably best developed classically 1546 00:58:33,000 --> 00:58:36,780 in speech, but in a way that's almost independent 1547 00:58:36,780 --> 00:58:37,690 of sense modality. 1548 00:58:37,690 --> 00:58:39,800 I think we have different sensory modalities, 1549 00:58:39,800 --> 00:58:41,550 and they're all just different projections 1550 00:58:41,550 --> 00:58:44,730 of an underlying physical representation of the world. 1551 00:58:44,730 --> 00:58:47,340 I think that, whether it's understanding 1552 00:58:47,340 --> 00:58:50,820 simple kinds of events as I was trying to illustrate here, 1553 00:58:50,820 --> 00:58:53,310 or many other kinds of things, I think, basically, 1554 00:58:53,310 --> 00:58:55,740 at some one, one of the key outputs of all 1555 00:58:55,740 --> 00:58:58,860 of these different sensory processing 1556 00:58:58,860 --> 00:59:01,350 pipelines has to be a shared system that 1557 00:59:01,350 --> 00:59:03,450 represents the world in three dimensions 1558 00:59:03,450 --> 00:59:07,410 with physical objects that have physical reality-- 1559 00:59:07,410 --> 00:59:11,000 properties like mass or surface properties 1560 00:59:11,000 --> 00:59:13,730 that produce friction when it comes to motion-- roughness. 1561 00:59:13,730 --> 00:59:14,230 Right. 1562 00:59:14,230 --> 00:59:19,440 There's some way to think about the forces, whether the force 1563 00:59:19,440 --> 00:59:20,860 that one object exerts on another, 1564 00:59:20,860 --> 00:59:24,300 or the force that an object presents in resisting, 1565 00:59:24,300 --> 00:59:28,110 when I reach for it, either the rigidity that resists my grasp 1566 00:59:28,110 --> 00:59:30,418 and that [INAUDIBLE] with it. 1567 00:59:30,418 --> 00:59:32,000 Or the weight of the object that that 1568 00:59:32,000 --> 00:59:34,630 requires me to exert some weight to do that. 1569 00:59:34,630 --> 00:59:36,720 So I think there has to be a shared 1570 00:59:36,720 --> 00:59:40,050 representation that bridges perception to action. 1571 00:59:40,050 --> 00:59:42,480 And that it's a physical representation, 1572 00:59:42,480 --> 00:59:44,910 and that it has to bridge-- it's the same representations 1573 00:59:44,910 --> 00:59:47,160 that's going to bridge the different sense modalities. 1574 00:59:47,160 --> 00:59:49,077 DAN YAMINS: Yeah, but to make [INAUDIBLE].. 1575 00:59:49,077 --> 00:59:50,160 JOSH TENENBAUM: More what? 1576 00:59:50,160 --> 00:59:52,080 DAN YAMINS: More brain meat-oriented, 1577 00:59:52,080 --> 00:59:53,840 I think that there's a version of that 1578 00:59:53,840 --> 00:59:56,310 that could be a constraint that is so strong that you have 1579 00:59:56,310 --> 00:59:58,260 to have a special brain area that's 1580 00:59:58,260 --> 00:59:59,850 used as the clearinghouse for doing 1581 00:59:59,850 --> 01:00:00,610 that common representation. 1582 01:00:00,610 --> 01:00:01,530 JOSH TENENBAUM: I'm certainly not saying that. 1583 01:00:01,530 --> 01:00:01,770 DAN YAMINS: OK. 1584 01:00:01,770 --> 01:00:01,950 Right. 1585 01:00:01,950 --> 01:00:03,116 No, I didn't think you were. 1586 01:00:03,116 --> 01:00:04,659 But that would be a concrete result. 1587 01:00:04,659 --> 01:00:06,450 Another concrete result is like effectively 1588 01:00:06,450 --> 01:00:11,640 that the individual modality structures are constrained 1589 01:00:11,640 --> 01:00:13,350 in such a way that they have an API that 1590 01:00:13,350 --> 01:00:16,290 has access to information from the other modality, 1591 01:00:16,290 --> 01:00:19,620 so that message passing is efficient, among other things. 1592 01:00:19,620 --> 01:00:20,120 Right? 1593 01:00:20,120 --> 01:00:21,600 And I think that they can talk to each other. 1594 01:00:21,600 --> 01:00:22,680 JOSH TENENBAUM: I think it's often not direct. 1595 01:00:22,680 --> 01:00:25,290 I think some of it might be, but a lot of it is going to be-- 1596 01:00:25,290 --> 01:00:27,706 it's not like vision talking to audition, but each of them 1597 01:00:27,706 --> 01:00:29,870 talking to physics. 1598 01:00:29,870 --> 01:00:30,620 DAN YAMINS: Right. 1599 01:00:30,620 --> 01:00:31,485 Right. 1600 01:00:31,485 --> 01:00:33,360 JOSH TENENBAUM: It's very hard for a lot of-- 1601 01:00:33,360 --> 01:00:33,845 DAN YAMINS: Right. 1602 01:00:33,845 --> 01:00:34,870 And that's a third-- 1603 01:00:34,870 --> 01:00:35,260 JOSH TENENBAUM: Mid-level vision and mid-level audition 1604 01:00:35,260 --> 01:00:36,300 are hard to talk to each other. 1605 01:00:36,300 --> 01:00:36,430 DAN YAMINS: Right. 1606 01:00:36,430 --> 01:00:38,010 And a third possibility is is that it's actually not 1607 01:00:38,010 --> 01:00:40,301 really so much of-- it that there's no particular brain 1608 01:00:40,301 --> 01:00:42,900 area, and they're not exactly talking to each other API 1609 01:00:42,900 --> 01:00:43,659 directly. 1610 01:00:43,659 --> 01:00:46,200 It's just that there is a common constraint in the world that 1611 01:00:46,200 --> 01:00:49,170 forces them to have similar structure 1612 01:00:49,170 --> 01:00:52,350 or sort of aligned structure for representing the things that 1613 01:00:52,350 --> 01:00:54,799 are caused by the same underlying phenomenon. 1614 01:00:54,799 --> 01:00:57,090 And that's like the weakest of the types of constraints 1615 01:00:57,090 --> 01:00:57,923 that you might have. 1616 01:00:57,923 --> 01:00:58,590 Right? 1617 01:00:58,590 --> 01:01:01,110 The first one is very strong. 1618 01:01:01,110 --> 01:01:03,610 JOSH TENENBAUM: But it's not it's not vague or content-less. 1619 01:01:03,610 --> 01:01:05,520 So, Nancy Kanwisher and Jason Fisher 1620 01:01:05,520 --> 01:01:06,960 have a particular hypothesis. 1621 01:01:06,960 --> 01:01:08,910 They've been doing some preliminary studies 1622 01:01:08,910 --> 01:01:11,118 on the kind of intuitive physics engine in the brain. 1623 01:01:11,118 --> 01:01:13,320 And they could point to a network of brain areas, 1624 01:01:13,320 --> 01:01:15,390 some premotor, some parietal. 1625 01:01:15,390 --> 01:01:16,900 You know, it's possible. 1626 01:01:16,900 --> 01:01:17,400 Who knows. 1627 01:01:17,400 --> 01:01:18,275 It's very early days. 1628 01:01:18,275 --> 01:01:20,820 But this might be a candidate way 1629 01:01:20,820 --> 01:01:24,900 into a view of brain systems that 1630 01:01:24,900 --> 01:01:26,140 might be this physics engine. 1631 01:01:26,140 --> 01:01:26,590 DAN YAMINS: Right. 1632 01:01:26,590 --> 01:01:27,000 But you are-- 1633 01:01:27,000 --> 01:01:28,705 JOSH TENENBAUM: Also ventral stream area. 1634 01:01:28,705 --> 01:01:30,455 DAN YAMINS: [INAUDIBLE] be something like, 1635 01:01:30,455 --> 01:01:33,330 if you optimized network's set of parameters 1636 01:01:33,330 --> 01:01:37,190 to do the joint physics prediction interaction task, 1637 01:01:37,190 --> 01:01:39,440 you'll get a different result than if you sort of just 1638 01:01:39,440 --> 01:01:41,340 did each modality separately. 1639 01:01:41,340 --> 01:01:44,010 And that would be a better-- that new different thing would 1640 01:01:44,010 --> 01:01:47,010 be a better match to the actual neural response patterns 1641 01:01:47,010 --> 01:01:47,870 in interesting-- 1642 01:01:47,870 --> 01:01:48,150 JOSH TENENBAUM: Yeah. 1643 01:01:48,150 --> 01:01:49,410 I think would make really cool thing to explore. 1644 01:01:49,410 --> 01:01:50,880 DAN YAMINS: And that's, I think the concrete way with cache 1645 01:01:50,880 --> 01:01:51,040 out. 1646 01:01:51,040 --> 01:01:52,350 And that certainly seems possible. 1647 01:01:52,350 --> 01:01:52,920 JOSH TENENBAUM: And it might be, you know, 1648 01:01:52,920 --> 01:01:55,450 a lot of things which are traditionally called 1649 01:01:55,450 --> 01:01:57,510 association cortex, right. 1650 01:01:57,510 --> 01:01:59,940 This is an old idea and I'm not enough of a neuroscientist 1651 01:01:59,940 --> 01:02:02,139 to know, but a cartoon history is, 1652 01:02:02,139 --> 01:02:04,680 there are lots of parts of the brain that nobody could figure 1653 01:02:04,680 --> 01:02:06,680 what they were doing because they didn't respond 1654 01:02:06,680 --> 01:02:09,570 in an obvious selective way to one particular sense modality. 1655 01:02:09,570 --> 01:02:12,700 It wasn't exactly obvious what the deficit was. 1656 01:02:12,700 --> 01:02:16,280 And so they can be called the association cortex. 1657 01:02:16,280 --> 01:02:19,650 That connects to the study of cross-modal association 1658 01:02:19,650 --> 01:02:22,575 and association to semantics, and this idea of association. 1659 01:02:22,575 --> 01:02:24,450 It's this thing you say when you don't really 1660 01:02:24,450 --> 01:02:25,350 know what's going on. 1661 01:02:25,350 --> 01:02:28,170 But it's quite possible that big chunks 1662 01:02:28,170 --> 01:02:30,660 of the brain we're calling association cortex 1663 01:02:30,660 --> 01:02:33,150 are actually doing something like this. 1664 01:02:33,150 --> 01:02:36,540 They're this convergence zone for a shared 1665 01:02:36,540 --> 01:02:38,850 physical representation across different perceptional 1666 01:02:38,850 --> 01:02:42,200 modalities and bridging to action plan. 1667 01:02:42,200 --> 01:02:45,450 And a big open challenge is that we 1668 01:02:45,450 --> 01:02:49,050 can have what feels to us like a deep debate 1669 01:02:49,050 --> 01:02:51,784 that we can think of as like the central problem in neuroscience 1670 01:02:51,784 --> 01:02:53,950 of, how do we combine whatever you want to call it-- 1671 01:02:53,950 --> 01:02:55,741 I don't know, generative and discriminative 1672 01:02:55,741 --> 01:02:59,580 or model-based analysis by synthesis and pattern 1673 01:02:59,580 --> 01:03:00,600 recognition synthesis. 1674 01:03:00,600 --> 01:03:02,413 But actually there's a lot of parts 1675 01:03:02,413 --> 01:03:04,750 of the brain that have to do with reward and goals. 1676 01:03:04,750 --> 01:03:05,790 And again, I thought Laura's talk was 1677 01:03:05,790 --> 01:03:07,440 a really good illustration of this, 1678 01:03:07,440 --> 01:03:09,602 understanding perception is representing 1679 01:03:09,602 --> 01:03:11,810 what's out there in the world, but that's clearly got 1680 01:03:11,810 --> 01:03:13,730 to be influenced by your goal. 1681 01:03:13,730 --> 01:03:17,360 And what is certainly a big problem 1682 01:03:17,360 --> 01:03:20,160 is the relation between those. 1683 01:03:20,160 --> 01:03:23,264 It says something that most of us who are studying perception 1684 01:03:23,264 --> 01:03:24,930 don't think we need to worry about that. 1685 01:03:24,930 --> 01:03:27,140 But I think we should, particularly those of us, 1686 01:03:27,140 --> 01:03:28,680 again, echoing what Laura said-- 1687 01:03:28,680 --> 01:03:30,650 if we're studying learning, we definitely, 1688 01:03:30,650 --> 01:03:34,014 I think, need to think about that more than we do. 1689 01:03:34,014 --> 01:03:35,930 HYNEK HERMANSKY: It may not be exactly related 1690 01:03:35,930 --> 01:03:38,070 to what you are asking, but I don't know. 1691 01:03:38,070 --> 01:03:39,800 I believe that we are carrying the model 1692 01:03:39,800 --> 01:03:43,160 of the world in our brain, and we are constantly 1693 01:03:43,160 --> 01:03:48,380 evaluating the fit of what we expect to what we are seeing. 1694 01:03:48,380 --> 01:03:52,190 And as long as we are seeing or hearing what we expect, 1695 01:03:52,190 --> 01:03:53,810 we don't work very hard, basically, 1696 01:03:53,810 --> 01:03:55,490 because the model is there. 1697 01:03:55,490 --> 01:03:57,480 You know what I'm going to say. 1698 01:03:57,480 --> 01:03:59,040 I'm not saying anything special. 1699 01:03:59,040 --> 01:04:01,790 I look reasonable and so on and so on. 1700 01:04:01,790 --> 01:04:06,210 And when the model of the world is for some reason violated, 1701 01:04:06,210 --> 01:04:09,770 that may be one way how to induce the feedback, 1702 01:04:09,770 --> 01:04:12,140 because then suddenly I know I should do something 1703 01:04:12,140 --> 01:04:13,700 about my perception. 1704 01:04:13,700 --> 01:04:15,380 Or I may just give up and say-- 1705 01:04:15,380 --> 01:04:18,260 or I become very, very interested. 1706 01:04:18,260 --> 01:04:20,730 But I think this is a model of the world priors 1707 01:04:20,730 --> 01:04:23,025 which we all carry with us. 1708 01:04:23,025 --> 01:04:26,180 It's extremely important and it helps 1709 01:04:26,180 --> 01:04:28,280 us to move through the world. 1710 01:04:28,280 --> 01:04:31,600 That's my feeling. 1711 01:04:31,600 --> 01:04:34,850 In speech we have a somehow interesting situation 1712 01:04:34,850 --> 01:04:36,640 that we can actually predict-- 1713 01:04:36,640 --> 01:04:39,740 say we are interested in estimating probabilities 1714 01:04:39,740 --> 01:04:41,120 of the speech sounds. 1715 01:04:41,120 --> 01:04:43,940 But we can also predict them from the language model. 1716 01:04:43,940 --> 01:04:45,710 Our language model is learned typically 1717 01:04:45,710 --> 01:04:49,205 very differently from a lot of texts, 1718 01:04:49,205 --> 01:04:50,910 and it's a lot of things. 1719 01:04:50,910 --> 01:04:54,020 And so we had quite a bit of success 1720 01:04:54,020 --> 01:04:57,190 in trying to determine if the world recognizes 1721 01:04:57,190 --> 01:05:01,150 if it's working well or not, by comparing what they recognize 1722 01:05:01,150 --> 01:05:04,190 and expect and what it sees. 1723 01:05:04,190 --> 01:05:08,180 And as long as these things go together well, it's fine. 1724 01:05:08,180 --> 01:05:12,050 If there is a problem between these two, 1725 01:05:12,050 --> 01:05:14,080 we have to start working. 1726 01:05:14,080 --> 01:05:15,830 JOSH TENENBAUM: I think, you know, physics 1727 01:05:15,830 --> 01:05:19,609 is a source of beautiful math and ideas. 1728 01:05:19,609 --> 01:05:21,650 I think it's an interesting thing to think about, 1729 01:05:21,650 --> 01:05:24,620 maybe some tuning of some low level mechanisms 1730 01:05:24,620 --> 01:05:26,180 and in both sensory modalities might 1731 01:05:26,180 --> 01:05:27,960 be well thought of that way. 1732 01:05:27,960 --> 01:05:28,490 Right. 1733 01:05:28,490 --> 01:05:29,990 But I think it is dangerous to apply 1734 01:05:29,990 --> 01:05:31,100 too much of the physicist's approach 1735 01:05:31,100 --> 01:05:32,141 to the system as a whole. 1736 01:05:32,141 --> 01:05:36,350 This idea that we're going to explain deep stuff about how 1737 01:05:36,350 --> 01:05:39,710 the brain works as some kind of emergent phenomenon, something 1738 01:05:39,710 --> 01:05:42,320 that just happened to work that way because of physics. 1739 01:05:42,320 --> 01:05:43,811 This is an engineered system. 1740 01:05:43,811 --> 01:05:44,310 Right. 1741 01:05:44,310 --> 01:05:46,109 Evolution engineered brains. 1742 01:05:46,109 --> 01:05:47,150 It's a very complicated-- 1743 01:05:47,150 --> 01:05:49,640 I mean, we have had a version of this discussion before. 1744 01:05:49,640 --> 01:05:51,634 But I think it's something that a lot of us 1745 01:05:51,634 --> 01:05:52,550 here are committed to. 1746 01:05:52,550 --> 01:05:53,383 Maybe not all of us. 1747 01:05:53,383 --> 01:05:58,040 But the way I see it is, this is a reverse engineering science. 1748 01:05:58,040 --> 01:05:59,780 And the brain isn't an accident. 1749 01:05:59,780 --> 01:06:00,720 It didn't just happen. 1750 01:06:00,720 --> 01:06:04,340 There were lots of forces over many different timescales 1751 01:06:04,340 --> 01:06:06,870 acting to shape it to have the function that it does. 1752 01:06:06,870 --> 01:06:10,430 So if it is the case that there are some basic mechanisms, 1753 01:06:10,430 --> 01:06:15,200 say maybe at the synaptic level that could be described 1754 01:06:15,200 --> 01:06:17,350 that way, it's not an accident. 1755 01:06:17,350 --> 01:06:19,520 They were [INAUDIBLE]. 1756 01:06:19,520 --> 01:06:23,210 I would call that, you know, biology using the physics 1757 01:06:23,210 --> 01:06:24,500 to solve a problem. 1758 01:06:24,500 --> 01:06:27,590 And again there's a long history of connecting free energy type 1759 01:06:27,590 --> 01:06:30,770 approaches to various elegant statistical inference 1760 01:06:30,770 --> 01:06:31,600 frameworks. 1761 01:06:31,600 --> 01:06:34,370 And it could be very sensible to say, yes, 1762 01:06:34,370 --> 01:06:36,380 at some levels you could describe that low level 1763 01:06:36,380 --> 01:06:40,815 sensory adaptation as doing that kind of just physical resonance 1764 01:06:40,815 --> 01:06:41,315 process. 1765 01:06:41,315 --> 01:06:43,770 Or nonequilibrium stat mech could describe that. 1766 01:06:43,770 --> 01:06:47,030 But the reason why nature has basically 1767 01:06:47,030 --> 01:06:49,156 put that physics interface in there 1768 01:06:49,156 --> 01:06:50,780 is because it's actually a way to solve 1769 01:06:50,780 --> 01:06:53,585 a certain kind of adaptive statistical inference problem. 1770 01:06:53,585 --> 01:06:54,460 ALEX KELL: All right. 1771 01:06:54,460 --> 01:06:54,610 Cool. 1772 01:06:54,610 --> 01:06:55,526 Let's thank our panel. 1773 01:06:55,526 --> 01:06:58,410 [APPLAUSE]