1 00:00:01,640 --> 00:00:04,040 The following content is provided under a Creative 2 00:00:04,040 --> 00:00:05,580 Commons license. 3 00:00:05,580 --> 00:00:07,880 Your support will help MIT OpenCourseWare 4 00:00:07,880 --> 00:00:12,270 continue to offer high quality educational resources for free. 5 00:00:12,270 --> 00:00:14,870 To make a donation or view additional materials 6 00:00:14,870 --> 00:00:18,830 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,830 --> 00:00:21,470 at ocw.mit.edu. 8 00:00:21,470 --> 00:00:23,470 JOSH MCDERMOTT: I'm going to talk about hearing. 9 00:00:23,470 --> 00:00:25,850 I gather this is the first time in this class 10 00:00:25,850 --> 00:00:28,460 that, really, people have talked about audition. 11 00:00:28,460 --> 00:00:32,069 And usually, when I give talks and lectures, 12 00:00:32,069 --> 00:00:33,860 I find it's often helpful to just start out 13 00:00:33,860 --> 00:00:35,584 by doing some listening. 14 00:00:35,584 --> 00:00:38,000 So we're going to start out this morning by just listening 15 00:00:38,000 --> 00:00:39,860 to some examples of typical auditory input 16 00:00:39,860 --> 00:00:42,590 that you might encounter during the course of your day. 17 00:00:42,590 --> 00:00:46,760 So all you've got to do is just close your eyes and listen. 18 00:00:46,760 --> 00:00:47,422 Here we go. 19 00:00:47,422 --> 00:00:48,088 [AUDIO PLAYBACK] 20 00:00:48,088 --> 00:00:50,992 [INAUDIBLE] 21 00:01:32,957 --> 00:01:33,540 [END PLAYBACK] 22 00:01:33,540 --> 00:01:34,070 OK. 23 00:01:34,070 --> 00:01:36,560 So you guys could probably tell what all of those things 24 00:01:36,560 --> 00:01:39,530 were, in case not, here are some labels. 25 00:01:39,530 --> 00:01:42,860 So the first one is just a scene I recorded on my iPhone 26 00:01:42,860 --> 00:01:44,870 in a cafe. 27 00:01:44,870 --> 00:01:48,304 The second one was something from a sports bar. 28 00:01:48,304 --> 00:01:50,720 Then there was a radio excerpt, and then some Barry White. 29 00:01:50,720 --> 00:01:53,139 And the point is that in all of those cases, 30 00:01:53,139 --> 00:01:54,680 I mean, they're all pretty different, 31 00:01:54,680 --> 00:01:56,096 but in all those cases, your brain 32 00:01:56,096 --> 00:01:59,030 was inferring an awful lot about what was happening in the world 33 00:01:59,030 --> 00:02:01,230 from the sound signal that was entering your ears. 34 00:02:01,230 --> 00:02:01,730 Right? 35 00:02:01,730 --> 00:02:05,090 And so what makes that kind of amazing and remarkable 36 00:02:05,090 --> 00:02:07,815 is the sensory input that it was getting, to first order, 37 00:02:07,815 --> 00:02:08,690 just looks like this. 38 00:02:08,690 --> 00:02:08,900 Right? 39 00:02:08,900 --> 00:02:11,210 So there was sound energy that traveled through the air. 40 00:02:11,210 --> 00:02:13,293 It was making your ear drums wiggle back and forth 41 00:02:13,293 --> 00:02:15,762 in some particular pattern that would be indicated 42 00:02:15,762 --> 00:02:17,720 by that waveform there, that plots the pressure 43 00:02:17,720 --> 00:02:20,730 at your eardrum as a function of time. 44 00:02:20,730 --> 00:02:23,480 And so from that particular funny pattern of wiggles, 45 00:02:23,480 --> 00:02:25,750 you were able to infer all those things. 46 00:02:25,750 --> 00:02:27,410 So in the case of the cafe scene, 47 00:02:27,410 --> 00:02:30,110 you could tell that there were a few people talking. 48 00:02:30,110 --> 00:02:31,860 You could probably hear a man and a woman, 49 00:02:31,860 --> 00:02:33,020 you could tell there was music in the background. 50 00:02:33,020 --> 00:02:35,030 You could hear there was dishes clattering. 51 00:02:35,030 --> 00:02:38,159 You might have been able to tell somebody had an Irish accent. 52 00:02:38,159 --> 00:02:39,950 You could probably tell, in the sports bar, 53 00:02:39,950 --> 00:02:42,320 that people were watching football and talking about it, 54 00:02:42,320 --> 00:02:43,880 and there were a bunch of people there. 55 00:02:43,880 --> 00:02:45,630 You instantly recognized the radio excerpt 56 00:02:45,630 --> 00:02:49,410 as drivetime radio, that sort of, like, standard sound. 57 00:02:49,410 --> 00:02:51,050 And in the case of the music excerpt, 58 00:02:51,050 --> 00:02:52,370 you could probably hear six or seven, 59 00:02:52,370 --> 00:02:53,870 or maybe eight different instruments 60 00:02:53,870 --> 00:02:55,550 that were each, in their own way, 61 00:02:55,550 --> 00:02:58,200 contributing to the groove. 62 00:02:58,200 --> 00:03:01,340 So that's kind of audition at work. 63 00:03:01,340 --> 00:03:06,680 And so the task of the brain is to take the sound signal that 64 00:03:06,680 --> 00:03:09,020 arrives at your ears that is then transduced 65 00:03:09,020 --> 00:03:12,770 into electrical signals by your ears, and then to interpret it 66 00:03:12,770 --> 00:03:16,071 and to figure out what's out there in the world. 67 00:03:16,071 --> 00:03:18,320 So what you're interested in, really, is not the sound 68 00:03:18,320 --> 00:03:18,680 itself. 69 00:03:18,680 --> 00:03:19,790 You're interested in whether it was 70 00:03:19,790 --> 00:03:21,860 a dog, or a train, or rain, or people singing, 71 00:03:21,860 --> 00:03:23,210 or whatever it was. 72 00:03:23,210 --> 00:03:26,610 And so the interesting and hard problem 73 00:03:26,610 --> 00:03:29,410 that comes along with this is that most of the properties 74 00:03:29,410 --> 00:03:31,827 that we are interested in as listeners are not 75 00:03:31,827 --> 00:03:33,410 explicit in the waveform, in the sense 76 00:03:33,410 --> 00:03:35,450 that if I hand you the sound waveform itself, 77 00:03:35,450 --> 00:03:37,310 and you either look at it, or you run sort 78 00:03:37,310 --> 00:03:39,160 of standard machine classifiers on it, 79 00:03:39,160 --> 00:03:41,690 it would be very difficult to discern the kinds of things 80 00:03:41,690 --> 00:03:46,066 that you, with your brain, can very easily just report. 81 00:03:46,066 --> 00:03:48,690 And so that's really that's the question that I'm interested in 82 00:03:48,690 --> 00:03:50,534 and that our lab studies, namely, 83 00:03:50,534 --> 00:03:52,700 how is it that we derive information about the world 84 00:03:52,700 --> 00:03:55,520 from sound? 85 00:03:55,520 --> 00:03:57,500 And so there's lots of different aspects 86 00:03:57,500 --> 00:03:59,844 to the problem of audition. 87 00:03:59,844 --> 00:04:01,760 I'm going to give you a taste of a few of them 88 00:04:01,760 --> 00:04:04,044 over the course of today's lecture. 89 00:04:04,044 --> 00:04:05,960 A big one that lots of people have heard about 90 00:04:05,960 --> 00:04:08,509 is often known as the cocktail party problem, which 91 00:04:08,509 --> 00:04:10,550 refers to the fact that real world settings often 92 00:04:10,550 --> 00:04:11,760 involve concurrent sound. 93 00:04:11,760 --> 00:04:13,310 So if you're in a room full of busy people, 94 00:04:13,310 --> 00:04:15,690 you might be trying to have a conversation with one of them, 95 00:04:15,690 --> 00:04:17,564 but there'll be lots of other people talking, 96 00:04:17,564 --> 00:04:19,959 music in the background, and so on, and so forth. 97 00:04:19,959 --> 00:04:22,397 And so from that kind of complicated mixed signal that 98 00:04:22,397 --> 00:04:24,230 enters your ears, your brain has to estimate 99 00:04:24,230 --> 00:04:27,040 the content of one particular source of interest, 100 00:04:27,040 --> 00:04:29,540 of the person that you're trying to converse with. 101 00:04:29,540 --> 00:04:32,870 So really, what you'd like to be hearing might be this. 102 00:04:32,870 --> 00:04:34,300 AUDIO: She argues with her sister. 103 00:04:34,300 --> 00:04:36,800 JOSH MCDERMOTT: But what might enter your ear could be this. 104 00:04:36,800 --> 00:04:38,768 AUDIO: [INTERPOSING VOICES] 105 00:04:38,768 --> 00:04:40,450 JOSH MCDERMOTT: Or maybe even this. 106 00:04:40,450 --> 00:04:42,162 [INTERPOSING VOICES] 107 00:04:42,162 --> 00:04:43,620 JOSH MCDERMOTT: Or maybe even this. 108 00:04:43,620 --> 00:04:46,007 AUDIO: [INTERPOSING VOICES] 109 00:04:46,007 --> 00:04:47,590 JOSH MCDERMOTT: So what I've done here 110 00:04:47,590 --> 00:04:50,349 plotted next to the icons are spectrograms. 111 00:04:50,349 --> 00:04:51,890 That's a way of taking a sound signal 112 00:04:51,890 --> 00:04:53,181 and turning that into an image. 113 00:04:53,181 --> 00:04:56,330 So it's a plot of the frequency content over time. 114 00:04:56,330 --> 00:04:59,120 And you can see with the single utterance up here at the top, 115 00:04:59,120 --> 00:05:01,160 there's all kinds of structure, right? 116 00:05:01,160 --> 00:05:03,410 And we think that your brain uses that structure 117 00:05:03,410 --> 00:05:04,910 to understand what was being said. 118 00:05:04,910 --> 00:05:07,370 And you can see that, as more and more people are added 119 00:05:07,370 --> 00:05:10,360 to the party, that that structure becomes progressively 120 00:05:10,360 --> 00:05:12,110 more and more obscured, until, by the time 121 00:05:12,110 --> 00:05:14,651 you get to the bottom, it's kind of amazing that you can kind 122 00:05:14,651 --> 00:05:16,260 of pull anything out at all. 123 00:05:16,260 --> 00:05:18,410 And yet, as you hopefully heard, your brain 124 00:05:18,410 --> 00:05:23,330 has this remarkable ability to attend to and understand 125 00:05:23,330 --> 00:05:26,330 the speech signal of interest. 126 00:05:26,330 --> 00:05:28,880 And this is an ability that still, to this day, 127 00:05:28,880 --> 00:05:32,297 is really unmatched by machines. 128 00:05:32,297 --> 00:05:34,130 So present day speech recognition algorithms 129 00:05:34,130 --> 00:05:37,290 are getting better by the minute, 130 00:05:37,290 --> 00:05:39,320 but this particular problem is still 131 00:05:39,320 --> 00:05:40,620 quite a significant challenge. 132 00:05:40,620 --> 00:05:42,120 And you've probably encountered this 133 00:05:42,120 --> 00:05:44,930 when you try to talk to your iPhone when you're in a car 134 00:05:44,930 --> 00:05:48,040 or wherever else. 135 00:05:48,040 --> 00:05:51,440 Another kind of interesting complexity in hearing 136 00:05:51,440 --> 00:05:54,650 is that sound interacts with the environment on its way 137 00:05:54,650 --> 00:05:55,910 to your ears. 138 00:05:55,910 --> 00:05:57,800 So, you know, you typically think of yourself 139 00:05:57,800 --> 00:05:59,540 as listening to, say, a person talking 140 00:05:59,540 --> 00:06:01,915 or to some sound source, but in reality, what's happening 141 00:06:01,915 --> 00:06:03,540 is something like this picture here. 142 00:06:03,540 --> 00:06:05,240 So there's a speaker in the upper right 143 00:06:05,240 --> 00:06:07,589 corner from which sound is emanating, 144 00:06:07,589 --> 00:06:10,130 but the sound takes a whole lot of different paths on its way 145 00:06:10,130 --> 00:06:10,680 to your ears. 146 00:06:10,680 --> 00:06:12,724 There's the direct path, which is shown in green, 147 00:06:12,724 --> 00:06:14,390 but then there are all these other paths 148 00:06:14,390 --> 00:06:16,940 where it can reflect off of the walls in the room. 149 00:06:16,940 --> 00:06:19,160 So the blue lines here indicate paths 150 00:06:19,160 --> 00:06:21,110 where there's a single reflection. 151 00:06:21,110 --> 00:06:23,169 And the red lines indicate paths where 152 00:06:23,169 --> 00:06:24,710 there are two reflections, and so you 153 00:06:24,710 --> 00:06:25,961 can see there's a lot of them. 154 00:06:25,961 --> 00:06:27,335 And so the consequence of this is 155 00:06:27,335 --> 00:06:28,940 that your brain gets all these delayed 156 00:06:28,940 --> 00:06:30,830 copies of the source signal. 157 00:06:30,830 --> 00:06:33,280 And what that amounts to is really massive distortion 158 00:06:33,280 --> 00:06:34,790 of the signal. 159 00:06:34,790 --> 00:06:36,340 And this is known as reverberation. 160 00:06:36,340 --> 00:06:39,219 So this is dry speech. 161 00:06:39,219 --> 00:06:41,260 Of course, you're hearing this in this auditorium 162 00:06:41,260 --> 00:06:42,843 that itself has lots of reverberation, 163 00:06:42,843 --> 00:06:44,714 so you're not actually going to hear it dry, 164 00:06:44,714 --> 00:06:46,630 but you'll still be able to hear a difference. 165 00:06:46,630 --> 00:06:48,310 AUDIO: They ate the lemon pie. 166 00:06:48,310 --> 00:06:49,895 Father forgot the bread. 167 00:06:49,895 --> 00:06:51,520 JOSH MCDERMOTT: And this is that signal 168 00:06:51,520 --> 00:06:52,780 with lots of reverberation added, 169 00:06:52,780 --> 00:06:55,240 as though you were listening to it in a cathedral or something. 170 00:06:55,240 --> 00:06:57,323 Of course, you're, again, hearing it in this room, 171 00:06:57,323 --> 00:06:59,682 as well. 172 00:06:59,682 --> 00:07:01,574 AUDIO: They ate the lemon pie. 173 00:07:01,574 --> 00:07:03,164 Father forgot the bread. 174 00:07:03,164 --> 00:07:05,330 JOSH MCDERMOTT: And you can still hear a difference. 175 00:07:05,330 --> 00:07:07,942 And if the reverb in this auditorium is swamping that, 176 00:07:07,942 --> 00:07:09,400 you can just look at the waveforms. 177 00:07:09,400 --> 00:07:13,060 And you can see that the waveforms of those two signals 178 00:07:13,060 --> 00:07:14,710 look pretty dramatically different, 179 00:07:14,710 --> 00:07:16,100 as do the spectrograms. 180 00:07:16,100 --> 00:07:16,600 All right? 181 00:07:16,600 --> 00:07:18,099 So the point is that the consequence 182 00:07:18,099 --> 00:07:19,690 of all of those delayed reflections 183 00:07:19,690 --> 00:07:21,250 massively distorts the signal, all right? 184 00:07:21,250 --> 00:07:23,291 Physically, there are two really different things 185 00:07:23,291 --> 00:07:24,980 in those two cases. 186 00:07:24,980 --> 00:07:27,250 But again, your ability to recognize what's being said 187 00:07:27,250 --> 00:07:31,480 is remarkably invariant to the presence of that reverberation. 188 00:07:31,480 --> 00:07:33,780 And again, this is an instance where humans really 189 00:07:33,780 --> 00:07:36,195 are outperforming machines to a considerable extent. 190 00:07:36,195 --> 00:07:37,570 This graph is a little bit dated. 191 00:07:37,570 --> 00:07:39,620 This is from, I think, three years ago, 192 00:07:39,620 --> 00:07:44,230 but it's a plot of five different speech recognition 193 00:07:44,230 --> 00:07:46,620 algorithms. 194 00:07:46,620 --> 00:07:49,210 And the percent of errors they're 195 00:07:49,210 --> 00:07:50,620 making when given a speech signal 196 00:07:50,620 --> 00:07:52,494 is a function of the amount of reverberation. 197 00:07:52,494 --> 00:07:54,836 And so zero means that there's no reverberation. 198 00:07:54,836 --> 00:07:55,960 That's the dry case, right? 199 00:07:55,960 --> 00:07:59,732 And so speech recognition works pretty well without any reverb. 200 00:07:59,732 --> 00:08:01,690 But when you add a little bit of reverberation, 201 00:08:01,690 --> 00:08:04,310 and this is measured in terms of the reverberation time, 202 00:08:04,310 --> 00:08:06,310 it's the amount of time that it takes the reverb 203 00:08:06,310 --> 00:08:09,880 to fall off by a certain specified amount, 204 00:08:09,880 --> 00:08:12,670 and 300 and 500 milliseconds are actually very, very modest. 205 00:08:12,670 --> 00:08:14,830 So in this auditorium, my guess would 206 00:08:14,830 --> 00:08:19,250 be that the reverberation time is maybe even a couple seconds. 207 00:08:19,250 --> 00:08:23,320 So this is, like, what you get in a small classroom, maybe. 208 00:08:23,320 --> 00:08:24,950 Maybe even less than that. 209 00:08:24,950 --> 00:08:27,370 But you can see that it causes major problems for speech 210 00:08:27,370 --> 00:08:28,300 recognition. 211 00:08:28,300 --> 00:08:30,730 And it's because the information in the speech signal 212 00:08:30,730 --> 00:08:32,650 gets blurred out over time, and again, it's 213 00:08:32,650 --> 00:08:34,515 just massive distortion. 214 00:08:34,515 --> 00:08:36,640 So your brain is doing something pretty complicated 215 00:08:36,640 --> 00:08:40,400 in order for you to be so robust to the presence 216 00:08:40,400 --> 00:08:43,390 of the reverberation. 217 00:08:43,390 --> 00:08:45,150 So I run a research group where we 218 00:08:45,150 --> 00:08:46,400 study these kinds of problems. 219 00:08:46,400 --> 00:08:48,316 It's called the Lab for Computational Addition 220 00:08:48,316 --> 00:08:50,936 in the Department of Brain and Cognitive Sciences at MIT. 221 00:08:50,936 --> 00:08:52,810 We operate at the intersection of psychology, 222 00:08:52,810 --> 00:08:54,400 and neuroscience, and engineering, 223 00:08:54,400 --> 00:08:56,410 where what we aspire to do is to understand 224 00:08:56,410 --> 00:08:59,800 how it is that people hear so well in computational terms 225 00:08:59,800 --> 00:09:01,690 that would allow us to instantiate them 226 00:09:01,690 --> 00:09:04,690 in algorithms that we might replicate in machines. 227 00:09:04,690 --> 00:09:07,810 And so the research that we try to do 228 00:09:07,810 --> 00:09:11,050 involves hopefully symbiotic relationships 229 00:09:11,050 --> 00:09:14,140 between experiments in humans, auditory neuroscience, 230 00:09:14,140 --> 00:09:17,140 and machine algorithms. 231 00:09:17,140 --> 00:09:19,870 And the general approach that we take 232 00:09:19,870 --> 00:09:22,160 is to start with what the brain has to work with. 233 00:09:22,160 --> 00:09:24,610 And by that, I mean we try to work with representations 234 00:09:24,610 --> 00:09:29,170 like the ones that are in the early auditory system. 235 00:09:29,170 --> 00:09:32,680 And so here's the plan for this morning. 236 00:09:32,680 --> 00:09:34,690 And this is subject to change, depending 237 00:09:34,690 --> 00:09:37,630 on what kind of feedback I get from you guys. 238 00:09:37,630 --> 00:09:39,370 But my general plan was to start out 239 00:09:39,370 --> 00:09:41,082 with an overview of the auditory system, 240 00:09:41,082 --> 00:09:43,540 because I gather there's sort of a diversity of backgrounds 241 00:09:43,540 --> 00:09:46,030 here, and nobody's talked about audition so far. 242 00:09:46,030 --> 00:09:48,490 So I was going to go through a little overview. 243 00:09:48,490 --> 00:09:50,500 And then there's been a special request 244 00:09:50,500 --> 00:09:53,020 to talk about some texture perception. 245 00:09:53,020 --> 00:09:54,730 I gather that there were some earlier 246 00:09:54,730 --> 00:09:59,470 lectures on visual texture, and that might be a useful thing 247 00:09:59,470 --> 00:10:00,160 to talk about. 248 00:10:00,160 --> 00:10:03,550 It's also a nice way to understand auditory models 249 00:10:03,550 --> 00:10:04,384 a little bit better. 250 00:10:04,384 --> 00:10:06,800 I was then going to talk a little bit about the perception 251 00:10:06,800 --> 00:10:09,500 of individual sound sources and sort of the flip side 252 00:10:09,500 --> 00:10:12,750 to sound texture, and then conclude 253 00:10:12,750 --> 00:10:15,140 with a section on auditory scene analysis, 254 00:10:15,140 --> 00:10:17,440 so what your brain is able to do when 255 00:10:17,440 --> 00:10:20,080 it gets a complicated sound signal like you would get 256 00:10:20,080 --> 00:10:22,810 normally in the world, that has contributions 257 00:10:22,810 --> 00:10:26,931 from multiple causes and you have to infer those. 258 00:10:26,931 --> 00:10:27,430 OK. 259 00:10:27,430 --> 00:10:29,670 And so we'll take a break about halfway through, 260 00:10:29,670 --> 00:10:31,600 as I guess that's kind of standard. 261 00:10:31,600 --> 00:10:35,620 And I'm happy for people to interrupt and ask questions. 262 00:10:35,620 --> 00:10:38,380 OK. 263 00:10:38,380 --> 00:10:40,130 So the general outline for hearing, right, 264 00:10:40,130 --> 00:10:42,980 is that sound is created when objects in the world vibrate. 265 00:10:42,980 --> 00:10:46,764 Usually, this is because something hits something else, 266 00:10:46,764 --> 00:10:48,430 or in the case of a biological organism, 267 00:10:48,430 --> 00:10:52,330 there is some energy imparted to the vocal cords. 268 00:10:52,330 --> 00:10:53,500 And the object vibrates. 269 00:10:53,500 --> 00:10:56,800 That vibration gets transmitted to the air molecules around it, 270 00:10:56,800 --> 00:11:00,310 and you get a sound wave that travels through the air. 271 00:11:00,310 --> 00:11:02,480 And that sound wave then gets measured by the ears. 272 00:11:02,480 --> 00:11:05,410 And so the ear is a pretty complicated device 273 00:11:05,410 --> 00:11:08,320 that is designed to measure sound. 274 00:11:08,320 --> 00:11:10,390 It's typically divided up into three pieces. 275 00:11:10,390 --> 00:11:12,280 So there's the outer ear, consisting 276 00:11:12,280 --> 00:11:14,602 of the pinna and the eardrum. 277 00:11:14,602 --> 00:11:16,060 In functional terms, people usually 278 00:11:16,060 --> 00:11:18,162 think about this as a directional microphone. 279 00:11:18,162 --> 00:11:19,120 There's the middle ear. 280 00:11:19,120 --> 00:11:22,810 They're these three little bones in between the eardrum 281 00:11:22,810 --> 00:11:24,816 and the cochlea that are typically 282 00:11:24,816 --> 00:11:27,190 ascribed the functions of impedance matching and overload 283 00:11:27,190 --> 00:11:27,689 protection. 284 00:11:27,689 --> 00:11:29,620 I'm not going to talk about that today. 285 00:11:29,620 --> 00:11:32,760 And then there's the inner ear that the cochlea, 286 00:11:32,760 --> 00:11:34,690 which in very coarse engineering terms, 287 00:11:34,690 --> 00:11:38,500 we think of as doing some kind of frequency analysis. 288 00:11:38,500 --> 00:11:42,080 And so again, at kind of a high level, 289 00:11:42,080 --> 00:11:44,380 so you've got your ears here. 290 00:11:44,380 --> 00:11:47,440 This is the inner ear on each side. 291 00:11:47,440 --> 00:11:50,850 And then those send feedforward input to the midbrain, 292 00:11:50,850 --> 00:11:53,560 and there's a few different way stations here. 293 00:11:53,560 --> 00:11:56,690 The cochlear nucleus is superior olivery complex, 294 00:11:56,690 --> 00:11:59,710 the inferior colliculus, and then the mediagenic nucleus 295 00:11:59,710 --> 00:12:01,000 or the thalamus. 296 00:12:01,000 --> 00:12:03,829 And the thalamus then projects to the auditory cortex. 297 00:12:03,829 --> 00:12:05,620 And there's a couple things at a high level 298 00:12:05,620 --> 00:12:06,830 that are worth noting here. 299 00:12:06,830 --> 00:12:08,140 One is that the pathways here are actually 300 00:12:08,140 --> 00:12:10,420 pretty complicated, especially relative to the visual system 301 00:12:10,420 --> 00:12:12,090 that you guys have been hearing lots about it. 302 00:12:12,090 --> 00:12:12,590 Right? 303 00:12:12,590 --> 00:12:15,760 So there's a bunch of different stops on the way to the cortex. 304 00:12:15,760 --> 00:12:19,150 Another interesting thing is that input from the two ears 305 00:12:19,150 --> 00:12:20,890 gets mixed at a pretty early stage. 306 00:12:23,720 --> 00:12:24,790 OK. 307 00:12:24,790 --> 00:12:26,860 All right, so let's step back and talk 308 00:12:26,860 --> 00:12:29,740 about the cochlea for a moment. 309 00:12:29,740 --> 00:12:31,330 And I realize that some of you guys 310 00:12:31,330 --> 00:12:32,830 will know about this stuff, so we'll 311 00:12:32,830 --> 00:12:34,845 go through it kind of quick. 312 00:12:34,845 --> 00:12:36,220 Now one of the signature features 313 00:12:36,220 --> 00:12:39,110 of cochlear transduction is that its frequency tunes. 314 00:12:39,110 --> 00:12:42,280 So this is an unwrapped version of the cochlea. 315 00:12:42,280 --> 00:12:45,160 So if we step back to here, all right-- 316 00:12:45,160 --> 00:12:47,740 so we've got the outer ear, the ear canal, the eardrum, 317 00:12:47,740 --> 00:12:50,080 those three little bones that I told you about that 318 00:12:50,080 --> 00:12:51,220 connect to the cochlea. 319 00:12:51,220 --> 00:12:53,470 And the cochlea is this thing that looks like a snail. 320 00:12:53,470 --> 00:12:57,130 And then if we unroll that snail and look at it like this, 321 00:12:57,130 --> 00:12:58,810 you can see that the cochlea consists 322 00:12:58,810 --> 00:13:02,737 of these tubes separated by a membrane. 323 00:13:02,737 --> 00:13:04,570 The membrane is called the basilar membrane. 324 00:13:04,570 --> 00:13:07,160 That's worth knowing about. 325 00:13:07,160 --> 00:13:13,270 So sound enters here at the base and sets up a traveling wave 326 00:13:13,270 --> 00:13:14,480 along this membrane. 327 00:13:14,480 --> 00:13:17,102 So this is really a mechanical thing that happens. 328 00:13:17,102 --> 00:13:19,060 So there's actually, like, a physical vibration 329 00:13:19,060 --> 00:13:20,720 that occurs in this membrane. 330 00:13:20,720 --> 00:13:23,260 And it's a wave that travels along the cochlea. 331 00:13:23,260 --> 00:13:25,620 And one of the signature discoveries about the cochlea 332 00:13:25,620 --> 00:13:28,131 is that that traveling wave peaks in different places, 333 00:13:28,131 --> 00:13:30,130 depending on the frequency content of the sound. 334 00:13:30,130 --> 00:13:34,420 And that's schematized in these drawings here on the right. 335 00:13:34,420 --> 00:13:39,250 So if the ear were to receive a high frequency sound, 336 00:13:39,250 --> 00:13:41,380 that traveling wave would peak near the base. 337 00:13:41,380 --> 00:13:43,627 If it were to receive a medium frequency sound, 338 00:13:43,627 --> 00:13:45,460 the wave would peak somewhere in the middle. 339 00:13:45,460 --> 00:13:49,180 And a low frequency sound would peak near the apex. 340 00:13:49,180 --> 00:13:52,910 And the frequency tuning, it's partly mechanical in origin, 341 00:13:52,910 --> 00:13:55,930 so that membrane, it varies in thickness and stiffness 342 00:13:55,930 --> 00:13:57,820 along its length. 343 00:13:57,820 --> 00:14:02,800 And there's also a contribution that's non-linear and active, 344 00:14:02,800 --> 00:14:05,674 that we'll talk briefly about it in a little bit. 345 00:14:05,674 --> 00:14:06,590 So this is a close up. 346 00:14:06,590 --> 00:14:07,490 This is a cross-section. 347 00:14:07,490 --> 00:14:09,280 So imagine you took this diagram and you kind of cut it 348 00:14:09,280 --> 00:14:10,200 in the middle. 349 00:14:10,200 --> 00:14:13,310 This is a cross-section of the cochlea. 350 00:14:13,310 --> 00:14:15,594 So this here is the basilar membrane, 351 00:14:15,594 --> 00:14:17,260 and this is the organ of Corti that sits 352 00:14:17,260 --> 00:14:19,250 on top of the basilar membrane. 353 00:14:19,250 --> 00:14:21,520 And so if you look closely, you can 354 00:14:21,520 --> 00:14:23,080 see that there's this thing in here. 355 00:14:23,080 --> 00:14:24,480 This is the inner hair cell. 356 00:14:24,480 --> 00:14:27,460 And that's the guy that does the transduction that 357 00:14:27,460 --> 00:14:29,830 takes the mechanical energy that's coming from the fact 358 00:14:29,830 --> 00:14:31,780 that this thing is vibrating up and down, 359 00:14:31,780 --> 00:14:33,655 and turns that into an electrical signal that 360 00:14:33,655 --> 00:14:35,510 gets sent to your brain. 361 00:14:35,510 --> 00:14:37,517 And so the way that that works is 362 00:14:37,517 --> 00:14:39,100 that there is this other membrane here 363 00:14:39,100 --> 00:14:41,020 called the tectorial membrane. 364 00:14:41,020 --> 00:14:44,261 And the hair cells got these cilia that stick out of it. 365 00:14:44,261 --> 00:14:45,760 And as it moves up and down, there's 366 00:14:45,760 --> 00:14:48,130 a shearing that's created between the two membranes. 367 00:14:48,130 --> 00:14:52,120 The hair cell body deforms, and that deformation 368 00:14:52,120 --> 00:14:54,250 causes a change in its membrane potential. 369 00:14:54,250 --> 00:14:57,514 And that causes neurotransmitter to be released. 370 00:14:57,514 --> 00:14:58,930 All right, so that's the mechanism 371 00:14:58,930 --> 00:15:01,030 by which the brain takes that mechanical signal 372 00:15:01,030 --> 00:15:02,821 and turns it into an electrical signal that 373 00:15:02,821 --> 00:15:05,219 gets sent to your brain. 374 00:15:05,219 --> 00:15:07,510 The other thing to note here, and we'll return to this, 375 00:15:07,510 --> 00:15:09,550 is that there are these other three cells here 376 00:15:09,550 --> 00:15:12,190 that are labeled as outer hair cells. 377 00:15:12,190 --> 00:15:15,190 And so those kind of do what the inner hair 378 00:15:15,190 --> 00:15:16,330 cell does in reverse. 379 00:15:16,330 --> 00:15:19,360 So they get an electrical signal from your brain, 380 00:15:19,360 --> 00:15:22,624 and that causes the hair cell bodies to deform, 381 00:15:22,624 --> 00:15:25,040 and that actually alters the motion of the basilar memory. 382 00:15:25,040 --> 00:15:26,950 So it's like a feedback system that we 383 00:15:26,950 --> 00:15:30,957 believe serves to amplify sounds and to sharpen their tuning. 384 00:15:30,957 --> 00:15:33,040 So there's feedback all the way to the all the way 385 00:15:33,040 --> 00:15:35,340 to the cochlea. 386 00:15:35,340 --> 00:15:36,130 OK. 387 00:15:36,130 --> 00:15:37,820 So this is just another view. 388 00:15:37,820 --> 00:15:39,760 So here's the inner hair cell here. 389 00:15:39,760 --> 00:15:41,269 As this thing vibrates up and down, 390 00:15:41,269 --> 00:15:43,060 there's a shearing between these membranes. 391 00:15:43,060 --> 00:15:46,240 The inner hair cell membrane potential changes, 392 00:15:46,240 --> 00:15:49,360 and that causes neurotransmitter release. 393 00:15:49,360 --> 00:15:52,400 OK, and so here's the really important point. 394 00:15:52,400 --> 00:15:54,490 So we just talked about how there's this traveling 395 00:15:54,490 --> 00:15:56,698 wave that gets set up that peaks in different places, 396 00:15:56,698 --> 00:15:58,910 depending on the frequency content of the sound. 397 00:15:58,910 --> 00:16:00,910 And so because to first order only, 398 00:16:00,910 --> 00:16:02,590 part of the basilar membrane moves 399 00:16:02,590 --> 00:16:05,440 for a given frequency of sound, each hair cell 400 00:16:05,440 --> 00:16:08,650 and the auditory nerve fiber that it synopsis with, 401 00:16:08,650 --> 00:16:10,880 signals only particular frequencies of sound. 402 00:16:10,880 --> 00:16:14,096 And so this is sort of the classic textbook figure 403 00:16:14,096 --> 00:16:15,470 that you would see on this, where 404 00:16:15,470 --> 00:16:17,830 what's being plotted on the y-axis 405 00:16:17,830 --> 00:16:19,780 is the minimum sound intensity needed 406 00:16:19,780 --> 00:16:21,910 to elicit a neural response. 407 00:16:21,910 --> 00:16:24,580 And the x-axis is the frequency of a tone with which you 408 00:16:24,580 --> 00:16:26,110 would be stimulating the ear. 409 00:16:26,110 --> 00:16:28,930 So we have a little pure tone generator 410 00:16:28,930 --> 00:16:31,750 with a knob that allows you to change the frequency, 411 00:16:31,750 --> 00:16:34,120 and another knob that allows you to change the level. 412 00:16:34,120 --> 00:16:36,560 And you sit there recording from an auditory nerve fiber, 413 00:16:36,560 --> 00:16:39,101 varying the frequency, and then turning the level up and down 414 00:16:39,101 --> 00:16:41,100 until you get spikes out of the nerve fiber. 415 00:16:41,100 --> 00:16:43,930 And so for every nerve fiber, there 416 00:16:43,930 --> 00:16:46,980 will be some frequency called the characteristic frequency, 417 00:16:46,980 --> 00:16:48,730 at which you can elicit spikes when 418 00:16:48,730 --> 00:16:50,690 you present the sound at a fairly low level. 419 00:16:50,690 --> 00:16:52,540 And then as you change the frequency, 420 00:16:52,540 --> 00:16:54,760 either higher or lower, the level 421 00:16:54,760 --> 00:16:58,126 that is needed to elicit a response grows. 422 00:16:58,126 --> 00:17:00,250 And so you can think of this as like a tuning curve 423 00:17:00,250 --> 00:17:01,815 for that auditory nerve fiber. 424 00:17:01,815 --> 00:17:02,947 All right? 425 00:17:02,947 --> 00:17:05,280 And different nerve fibers have different characteristic 426 00:17:05,280 --> 00:17:05,779 frequencies. 427 00:17:05,779 --> 00:17:08,819 Here is just a picture that shows a handful of them. 428 00:17:08,819 --> 00:17:12,060 And so together, collectively, they kind of tile the space. 429 00:17:12,060 --> 00:17:14,099 And of course, given what I just told you, 430 00:17:14,099 --> 00:17:17,010 you can probably guess that each of these nerve fibers 431 00:17:17,010 --> 00:17:20,182 would synapse to a different location along the cochlea. 432 00:17:22,697 --> 00:17:24,780 The ones that have high characteristic frequencies 433 00:17:24,780 --> 00:17:25,760 would be near the base. 434 00:17:25,760 --> 00:17:27,801 The ones that have low characteristic frequencies 435 00:17:27,801 --> 00:17:28,870 would be near the apex. 436 00:17:28,870 --> 00:17:29,370 OK. 437 00:17:29,370 --> 00:17:32,610 So in computational terms, the common way to think about this 438 00:17:32,610 --> 00:17:37,759 is to approximate auditory nerve fibers with bandpass filters. 439 00:17:37,759 --> 00:17:39,300 And so this would be the way that you 440 00:17:39,300 --> 00:17:40,260 would do this in a model. 441 00:17:40,260 --> 00:17:42,090 Each of these curves is a bandpass filter, 442 00:17:42,090 --> 00:17:46,350 so what you see on the y-axis is the response of the filter. 443 00:17:46,350 --> 00:17:48,640 The x-axis is frequency. 444 00:17:48,640 --> 00:17:51,702 So each filter has some particular frequency 445 00:17:51,702 --> 00:17:53,160 at which they give a peak response, 446 00:17:53,160 --> 00:17:55,740 and then the signal is attenuated 447 00:17:55,740 --> 00:17:59,650 on either side of that peak frequency. 448 00:17:59,650 --> 00:18:04,090 And so one way to think about what the cochlea is doing 449 00:18:04,090 --> 00:18:08,170 to the signal is that it's taking the signal that 450 00:18:08,170 --> 00:18:10,360 enters the ears, this thing here-- so this 451 00:18:10,360 --> 00:18:12,700 is just a sound signal, so the amplitude just 452 00:18:12,700 --> 00:18:15,736 varies over time in some particular way. 453 00:18:15,736 --> 00:18:17,110 You take that signal, you pass it 454 00:18:17,110 --> 00:18:19,510 through this bank of bandpass filters, 455 00:18:19,510 --> 00:18:22,096 and then the output of each of these filters 456 00:18:22,096 --> 00:18:23,970 is a filtered version of the original signal. 457 00:18:23,970 --> 00:18:27,070 And so in engineering terms, we call that a subband. 458 00:18:27,070 --> 00:18:29,950 So this would be the result of taking that sound signal 459 00:18:29,950 --> 00:18:31,960 and filtering it with a filter that's 460 00:18:31,960 --> 00:18:35,110 tuned to relatively low frequencies, 350 to 520 461 00:18:35,110 --> 00:18:36,800 Hertz, in this particular case. 462 00:18:36,800 --> 00:18:38,920 And so you can see that the output of that filter 463 00:18:38,920 --> 00:18:41,260 is a signal that varies relatively slowly. 464 00:18:41,260 --> 00:18:43,480 So it wiggles up and down, but you 465 00:18:43,480 --> 00:18:48,550 can see that the wiggles are in some sort of confined space 466 00:18:48,550 --> 00:18:50,265 of frequencies. 467 00:18:50,265 --> 00:18:51,640 If we go a little bit further up, 468 00:18:51,640 --> 00:18:53,181 we get the output of something that's 469 00:18:53,181 --> 00:18:54,794 tuned to slightly higher frequencies. 470 00:18:54,794 --> 00:18:56,710 And you can see that the output of that filter 471 00:18:56,710 --> 00:18:59,710 is wiggling at a faster rate. 472 00:18:59,710 --> 00:19:01,900 And then if we go up further still, 473 00:19:01,900 --> 00:19:06,220 we get a different thing, that is again wiggling even faster. 474 00:19:06,220 --> 00:19:12,100 And so collectively, we can take this original broadband signal, 475 00:19:12,100 --> 00:19:15,430 and then represent it with a whole bunch of these subbands. 476 00:19:15,430 --> 00:19:17,680 Typically, you might use 30, or 40, or 50. 477 00:19:21,290 --> 00:19:23,540 So one thing to note here, you might 478 00:19:23,540 --> 00:19:25,700 have noticed that there's something 479 00:19:25,700 --> 00:19:29,180 funny about this picture and that the filters, here, which 480 00:19:29,180 --> 00:19:31,950 are indicated by these colored curves, are not uniform. 481 00:19:31,950 --> 00:19:32,450 Right? 482 00:19:32,450 --> 00:19:34,010 So the ones down here are very narrow, 483 00:19:34,010 --> 00:19:35,450 and the ones down here are very broad. 484 00:19:35,450 --> 00:19:36,575 And that's not an accident. 485 00:19:36,575 --> 00:19:38,900 That's roughly what you find when you actually 486 00:19:38,900 --> 00:19:41,090 look in the ear. 487 00:19:41,090 --> 00:19:45,600 And why things are that way is something 488 00:19:45,600 --> 00:19:48,010 that you could potentially debate. 489 00:19:48,010 --> 00:19:49,520 But it's very clear, empirically, 490 00:19:49,520 --> 00:19:51,380 that that's roughly what you find. 491 00:19:51,380 --> 00:19:52,910 I'm not going to say too much more about this now, 492 00:19:52,910 --> 00:19:54,350 but remember that because it will 493 00:19:54,350 --> 00:19:56,390 become important a little bit later on. 494 00:19:56,390 --> 00:19:58,580 So we can take these filters, and turn that 495 00:19:58,580 --> 00:20:01,820 into an initial stage of an auditory model, the stuff 496 00:20:01,820 --> 00:20:04,160 that we think is happening in the early auditory system 497 00:20:04,160 --> 00:20:06,080 where we've got our sound signal that 498 00:20:06,080 --> 00:20:08,510 gets passed through this bag of bandpass filters. 499 00:20:08,510 --> 00:20:10,165 And you're now representing that signal 500 00:20:10,165 --> 00:20:12,290 as a bunch of different subbands, just two of which 501 00:20:12,290 --> 00:20:15,950 are shown here for clarity. 502 00:20:15,950 --> 00:20:18,917 And the frequency selectivity that you find in the ear 503 00:20:18,917 --> 00:20:20,750 has a whole host of perceptual consequences. 504 00:20:20,750 --> 00:20:23,330 I won't go through all of them exhaustively. 505 00:20:23,330 --> 00:20:26,540 It's one of the main deterrents of what masks what. 506 00:20:26,540 --> 00:20:29,780 So for instance, when you're trying 507 00:20:29,780 --> 00:20:33,110 to compress a sound signal by turning it into an mp3, 508 00:20:33,110 --> 00:20:36,690 you have to really pay attention to the nature of these filters. 509 00:20:36,690 --> 00:20:38,690 And, you know, you don't need to represent parts 510 00:20:38,690 --> 00:20:39,960 of the filters that would be-- 511 00:20:39,960 --> 00:20:41,870 sorry, parts of the signal would be masked, 512 00:20:41,870 --> 00:20:44,680 and these filters tell you a lot about that. 513 00:20:44,680 --> 00:20:47,470 One respect in which frequency selectivity is evident 514 00:20:47,470 --> 00:20:50,150 is by the ability to hear out individual frequency 515 00:20:50,150 --> 00:20:53,080 components of sounds that have lots of frequencies in them. 516 00:20:53,080 --> 00:20:55,100 So this is kind of a cool demonstration. 517 00:20:55,100 --> 00:20:59,420 And to kind of help us see what's going on, 518 00:20:59,420 --> 00:21:03,050 we're going to look at a spectrogram of what's 519 00:21:03,050 --> 00:21:04,324 coming in. 520 00:21:04,324 --> 00:21:05,490 So hopefully this will work. 521 00:21:05,490 --> 00:21:08,060 So what this little thing is doing is 522 00:21:08,060 --> 00:21:09,890 there's a microphone in the laptop, 523 00:21:09,890 --> 00:21:12,440 and it takes that microphone signal 524 00:21:12,440 --> 00:21:13,790 and turns it into a spectrogram. 525 00:21:13,790 --> 00:21:16,160 It's using a logarithmic frequency scale here, 526 00:21:16,160 --> 00:21:19,077 so it goes from about 100 Hertz up to 6400. 527 00:21:19,077 --> 00:21:20,660 And so if I don't say anything, you'll 528 00:21:20,660 --> 00:21:22,910 be able to hear the room noise, or see the room noise. 529 00:21:26,124 --> 00:21:27,540 All right, so that's the baseline. 530 00:21:27,540 --> 00:21:29,040 And also, the other thing to note 531 00:21:29,040 --> 00:21:31,170 is that the microphone doesn't have very good bass response. 532 00:21:31,170 --> 00:21:33,087 And so the very low frequencies won't show up. 533 00:21:33,087 --> 00:21:34,128 But everything else will. 534 00:21:34,128 --> 00:21:34,775 OK. 535 00:21:34,775 --> 00:21:35,441 [AUDIO PLAYBACK] 536 00:21:35,441 --> 00:21:37,220 - Canceled harmonics. 537 00:21:37,220 --> 00:21:39,600 A complex tone is presented, followed 538 00:21:39,600 --> 00:21:42,480 by several cancellations and restorations 539 00:21:42,480 --> 00:21:44,310 of a particular harmonic. 540 00:21:44,310 --> 00:21:49,110 This is done for harmonics 1 through 10. 541 00:21:49,110 --> 00:21:52,610 [LOUD TONE] 542 00:21:57,062 --> 00:22:00,506 [TONE] 543 00:22:05,426 --> 00:22:12,819 [TONE] 544 00:22:12,819 --> 00:22:13,360 END PLAYBACK] 545 00:22:13,360 --> 00:22:13,860 OK. 546 00:22:13,860 --> 00:22:15,950 And just to be clear, the point of this, right, 547 00:22:15,950 --> 00:22:18,860 is that what's happening here-- 548 00:22:18,860 --> 00:22:21,110 just stop that for a second-- 549 00:22:21,110 --> 00:22:23,457 is you're hearing what's called a complex tone. 550 00:22:23,457 --> 00:22:25,790 That just means a tone that has more than one frequency. 551 00:22:25,790 --> 00:22:26,360 All right? 552 00:22:26,360 --> 00:22:28,160 That's what constitutes complexity 553 00:22:28,160 --> 00:22:29,590 for a psychoacoustician. 554 00:22:29,590 --> 00:22:31,940 So it's a complex tone. 555 00:22:31,940 --> 00:22:34,950 And each of the stripes, here, is one of the frequencies. 556 00:22:34,950 --> 00:22:36,200 So this is a harmonic complex. 557 00:22:36,200 --> 00:22:38,976 Notice that the fundamental frequency is 200 Hertz, 558 00:22:38,976 --> 00:22:41,600 and so all the other frequencies are integer multiples of that. 559 00:22:41,600 --> 00:22:47,230 So there's 400, 600, 800, 1,000 1200, and so on, and so forth. 560 00:22:47,230 --> 00:22:48,920 OK? 561 00:22:48,920 --> 00:22:51,680 Then what's happening is that in each little cycle 562 00:22:51,680 --> 00:22:54,012 of this demonstration, one of the harmonics 563 00:22:54,012 --> 00:22:55,220 is getting pulsed on and off. 564 00:22:55,220 --> 00:22:55,719 All right? 565 00:22:55,719 --> 00:22:57,761 And the consequence of it being pulsed on and off 566 00:22:57,761 --> 00:22:59,802 is that you're able to actually hear it as, like, 567 00:22:59,802 --> 00:23:00,590 a distinct thing. 568 00:23:00,590 --> 00:23:03,710 And the fact that that happens, that's not, itself, 569 00:23:03,710 --> 00:23:04,920 happening in the ear. 570 00:23:04,920 --> 00:23:06,774 That's something complicated and interesting 571 00:23:06,774 --> 00:23:08,690 that your brain is doing with that signal it's 572 00:23:08,690 --> 00:23:09,750 getting from the ear. 573 00:23:09,750 --> 00:23:11,870 But the fact you're able to do that 574 00:23:11,870 --> 00:23:13,790 is only possible by virtue of the fact 575 00:23:13,790 --> 00:23:16,820 that the signal that your brain gets from the ear 576 00:23:16,820 --> 00:23:19,550 divides the signal up in a way that kind of preserves 577 00:23:19,550 --> 00:23:20,730 the individual frequencies. 578 00:23:20,730 --> 00:23:20,910 All right? 579 00:23:20,910 --> 00:23:23,120 And so this is just a demonstration that you're 580 00:23:23,120 --> 00:23:25,550 actually able to, under appropriate circumstances, 581 00:23:25,550 --> 00:23:27,830 to hear out particular frequency components 582 00:23:27,830 --> 00:23:30,830 of this complicated thing, even if you just heard it by itself, 583 00:23:30,830 --> 00:23:32,360 it would just sound like one thing. 584 00:23:32,360 --> 00:23:36,500 So another kind of interesting and cool phenomenon 585 00:23:36,500 --> 00:23:39,280 that is related to frequency selectivity 586 00:23:39,280 --> 00:23:41,120 is the perception of beating. 587 00:23:41,120 --> 00:23:43,820 So how many people here know what beating is? 588 00:23:43,820 --> 00:23:44,660 Yeah, OK. 589 00:23:44,660 --> 00:23:49,190 So beating is a physical phenomenon 590 00:23:49,190 --> 00:23:53,090 that happens whenever you have multiple different frequencies 591 00:23:53,090 --> 00:23:54,660 that are present at the same time. 592 00:23:54,660 --> 00:23:56,930 So in this case, those are the red and blue curves 593 00:23:56,930 --> 00:23:58,390 up at the top. 594 00:23:58,390 --> 00:24:00,680 So those are sinusoids of two different frequencies. 595 00:24:00,680 --> 00:24:03,440 And the consequence of them being two different frequencies 596 00:24:03,440 --> 00:24:05,840 is that over time, they shift in and out of phase. 597 00:24:05,840 --> 00:24:08,180 And so there's this particular point here 598 00:24:08,180 --> 00:24:11,280 where the peaks of the waveforms are aligned, 599 00:24:11,280 --> 00:24:13,821 and then there's this point over here where the peak of one 600 00:24:13,821 --> 00:24:15,320 aligns with the trough of the other. 601 00:24:15,320 --> 00:24:16,790 It's just because they're two different frequencies 602 00:24:16,790 --> 00:24:18,371 and they slide in and out of phase. 603 00:24:18,371 --> 00:24:20,870 And so when you play those two frequencies at the same time, 604 00:24:20,870 --> 00:24:23,780 you get the black waveform, so some linearly. 605 00:24:23,780 --> 00:24:27,200 That's what sounds do when they're both present at once. 606 00:24:27,200 --> 00:24:29,342 And so the point at which the peaks align, 607 00:24:29,342 --> 00:24:30,800 there is constructive interference. 608 00:24:30,800 --> 00:24:32,620 And the point at which the peak and the trough align, 609 00:24:32,620 --> 00:24:34,036 there is destructive interference. 610 00:24:34,036 --> 00:24:37,160 And so over time, the amplitude waxes and wanes. 611 00:24:37,160 --> 00:24:40,070 And so physically, that's what's known as beating. 612 00:24:40,070 --> 00:24:42,440 And the interesting thing is that the audibility 613 00:24:42,440 --> 00:24:46,290 of the beating is very tightly constrained by the cochlea. 614 00:24:46,290 --> 00:24:47,360 So here's one frequency. 615 00:24:47,360 --> 00:24:48,731 [TONE] 616 00:24:48,731 --> 00:24:50,046 Here's the other. 617 00:24:50,046 --> 00:24:51,000 [TONE] 618 00:24:51,000 --> 00:24:53,075 And then you play them at the same time. 619 00:24:53,075 --> 00:24:54,082 [TONE] 620 00:24:54,082 --> 00:24:56,040 Can you hear that fluttering kind of sensation? 621 00:24:56,040 --> 00:24:56,539 All right. 622 00:24:56,539 --> 00:25:00,011 So that's amplitude modulation. 623 00:25:00,011 --> 00:25:00,510 OK. 624 00:25:00,510 --> 00:25:04,170 And so I've just told you how we can 625 00:25:04,170 --> 00:25:07,260 think of the cochlea as this set of filters, 626 00:25:07,260 --> 00:25:11,310 and so it's an interesting empirical fact that you only 627 00:25:11,310 --> 00:25:14,580 hear beating if the two frequencies that are beating 628 00:25:14,580 --> 00:25:17,071 fall roughly within the same cochlear bandwidth. 629 00:25:17,071 --> 00:25:17,570 OK? 630 00:25:17,570 --> 00:25:20,514 And so when they're pretty close together, like, one semi-tone, 631 00:25:20,514 --> 00:25:21,680 the beating is very audible. 632 00:25:21,680 --> 00:25:23,397 [TONE] 633 00:25:23,397 --> 00:25:25,730 But as you move them further apart-- so three semi-tones 634 00:25:25,730 --> 00:25:29,150 is getting close to, roughly, what a typical cochlear filter 635 00:25:29,150 --> 00:25:32,887 bandwidth would be, and the beating is a lot less audible. 636 00:25:32,887 --> 00:25:33,719 [TONE] 637 00:25:33,719 --> 00:25:35,760 And then by the time you get to eight semi-tones, 638 00:25:35,760 --> 00:25:36,660 you just don't hear anything. 639 00:25:36,660 --> 00:25:37,909 It just sounds like two tones. 640 00:25:37,909 --> 00:25:39,670 [TONE] 641 00:25:39,670 --> 00:25:40,510 Very clear. 642 00:25:40,510 --> 00:25:41,986 So contrast that with-- 643 00:25:41,986 --> 00:25:43,920 [TONE] 644 00:25:43,920 --> 00:25:44,420 All right. 645 00:25:44,420 --> 00:25:47,240 So the important thing to emphasize is that in all three 646 00:25:47,240 --> 00:25:50,940 of these cases, physically, there's beating happening. 647 00:25:50,940 --> 00:25:51,440 All right? 648 00:25:51,440 --> 00:25:54,200 So if you actually were to look at what the eardrum was doing, 649 00:25:54,200 --> 00:25:56,210 you would see the amplitude modulation here, 650 00:25:56,210 --> 00:25:57,990 but you don't hear that. 651 00:25:57,990 --> 00:26:00,500 So this is just another consequence of the way 652 00:26:00,500 --> 00:26:03,830 that you're cochlea is filtering sound. 653 00:26:03,830 --> 00:26:04,520 OK. 654 00:26:04,520 --> 00:26:06,561 All right, so we've got our auditory model, here. 655 00:26:06,561 --> 00:26:07,860 What happens next? 656 00:26:07,860 --> 00:26:11,270 So there's a couple of important caveats about this. 657 00:26:11,270 --> 00:26:13,730 And I mentioned this, in part, because some of these things 658 00:26:13,730 --> 00:26:14,570 are-- 659 00:26:14,570 --> 00:26:16,947 we don't really know exactly what the true significance 660 00:26:16,947 --> 00:26:18,530 is of some of these things, especially 661 00:26:18,530 --> 00:26:19,500 in computational terms. 662 00:26:19,500 --> 00:26:21,920 So I've just told you about how we typically 663 00:26:21,920 --> 00:26:23,780 will model with the cochlea is doing 664 00:26:23,780 --> 00:26:25,790 as a set of linear bandpass filters. 665 00:26:25,790 --> 00:26:28,040 So you get a signal, you apply a linear filter, 666 00:26:28,040 --> 00:26:30,080 you get a subband. 667 00:26:30,080 --> 00:26:32,630 But in actuality, if you actually 668 00:26:32,630 --> 00:26:34,484 look at what the ear is doing, it's 669 00:26:34,484 --> 00:26:35,900 pretty clear that linear filtering 670 00:26:35,900 --> 00:26:38,990 provides only an approximate description of cochlear tuning. 671 00:26:38,990 --> 00:26:41,720 And in particular, this is evident 672 00:26:41,720 --> 00:26:43,320 when you change sound levels. 673 00:26:43,320 --> 00:26:45,740 So what this is a plot of is tuning curves 674 00:26:45,740 --> 00:26:48,440 that you would measure from an auditory nerve fiber. 675 00:26:48,440 --> 00:26:51,140 So we've got spikes per second on the y-axis, 676 00:26:51,140 --> 00:26:53,810 we've got the frequency of a pure tone stimulus 677 00:26:53,810 --> 00:26:56,330 on the x-axis, and each curve plots 678 00:26:56,330 --> 00:26:58,710 the response at a different stimulus intensity. 679 00:26:58,710 --> 00:26:59,210 All right? 680 00:26:59,210 --> 00:27:01,820 So down here at the bottom, we've 681 00:27:01,820 --> 00:27:06,959 got 35 dB SPL, so that's, like, a very, very low level, like, 682 00:27:06,959 --> 00:27:09,250 maybe if you rub your hands together or something, that 683 00:27:09,250 --> 00:27:12,110 would be close to 35. 684 00:27:12,110 --> 00:27:14,180 And 45-- and you can see here that the tuning 685 00:27:14,180 --> 00:27:16,460 is pretty narrow here at these low levels. 686 00:27:16,460 --> 00:27:17,810 So 45 dB SPL. 687 00:27:17,810 --> 00:27:19,895 So you get a pretty big response, here, 688 00:27:19,895 --> 00:27:22,610 at looks like, you know, 1700 Hertz. 689 00:27:22,610 --> 00:27:27,380 And then by the time you go down half an octave or something, 690 00:27:27,380 --> 00:27:29,080 there's almost no response. 691 00:27:29,080 --> 00:27:30,590 But as the stimulus level increases, 692 00:27:30,590 --> 00:27:32,990 you can see that the tuning broadens really 693 00:27:32,990 --> 00:27:34,940 very considerably. 694 00:27:34,940 --> 00:27:39,890 And so up here at 75 or 85 dB, you're getting a response 695 00:27:39,890 --> 00:27:43,430 from anywhere from 500 Hertz out to, you know, over 2,000 Hertz. 696 00:27:43,430 --> 00:27:45,140 That's, like, a two octave range, right? 697 00:27:45,140 --> 00:27:49,190 So the bandwidth is growing pretty dramatically. 698 00:27:49,190 --> 00:27:50,630 And this is very typical. 699 00:27:50,630 --> 00:27:53,090 Here is a bunch of examples of different nerve fibers 700 00:27:53,090 --> 00:27:54,750 that are doing the same thing. 701 00:27:54,750 --> 00:27:56,900 So at high levels, the tuning is really broad. 702 00:27:56,900 --> 00:27:58,860 At low levels, it's kind of narrow. 703 00:27:58,860 --> 00:28:02,610 And so mechanistically, in terms of the biology, 704 00:28:02,610 --> 00:28:04,370 we have a pretty good understanding 705 00:28:04,370 --> 00:28:06,492 of why this is happening. 706 00:28:06,492 --> 00:28:08,450 So what's going on is that the outer hair cells 707 00:28:08,450 --> 00:28:12,350 are providing amplification of the frequencies kind 708 00:28:12,350 --> 00:28:14,960 of near the characteristic frequency of the nerve fiber 709 00:28:14,960 --> 00:28:17,452 at low levels, but not at high levels. 710 00:28:17,452 --> 00:28:19,160 And so at high levels, what you're seeing 711 00:28:19,160 --> 00:28:22,420 is just very broad kind of mechanical tuning. 712 00:28:22,420 --> 00:28:24,170 But what really sort of remains unclear is 713 00:28:24,170 --> 00:28:27,890 what the consequences of this are for hearing, and really, 714 00:28:27,890 --> 00:28:29,850 how to think about it in computational terms. 715 00:28:29,850 --> 00:28:32,450 So it's clear that the linear filtering view is not 716 00:28:32,450 --> 00:28:34,880 exactly right. 717 00:28:34,880 --> 00:28:36,410 And one of the interesting things 718 00:28:36,410 --> 00:28:39,590 is that it's not like you get a lot worse at hearing 719 00:28:39,590 --> 00:28:41,840 when you go up to 75 or 85 dB. 720 00:28:41,840 --> 00:28:44,630 In fact, if anything, most psychophysical phenomena, 721 00:28:44,630 --> 00:28:47,510 actually, are better in some sense. 722 00:28:47,510 --> 00:28:49,162 This phenomena, as I said, is related 723 00:28:49,162 --> 00:28:51,620 to this distinction between inner hair cells and outer hair 724 00:28:51,620 --> 00:28:51,800 cells. 725 00:28:51,800 --> 00:28:53,090 So the inner hair cells are the ones 726 00:28:53,090 --> 00:28:55,256 that, actually, are responsible for the transduction 727 00:28:55,256 --> 00:28:58,400 of sound energy, and the outer hair cells we think of as part 728 00:28:58,400 --> 00:29:00,561 of a feedback system that amplifies 729 00:29:00,561 --> 00:29:02,810 the motion of the membrane and sharpens at the tuning. 730 00:29:02,810 --> 00:29:06,260 And that amplification is selective for frequencies 731 00:29:06,260 --> 00:29:08,390 at the characteristic frequency, and really occurs 732 00:29:08,390 --> 00:29:11,790 at very low levels. 733 00:29:11,790 --> 00:29:12,290 OK. 734 00:29:12,290 --> 00:29:15,540 So this is related to what Hynek was mentioning 735 00:29:15,540 --> 00:29:17,660 and the question that was asked earlier. 736 00:29:17,660 --> 00:29:20,510 So there's this other important response property 737 00:29:20,510 --> 00:29:23,750 of the cochlea, which is that for frequencies that 738 00:29:23,750 --> 00:29:26,330 are sufficiently low, auditory nerve spikes are 739 00:29:26,330 --> 00:29:28,010 phase locked to the stimulus. 740 00:29:28,010 --> 00:29:31,370 And so what you're seeing here is a single trace 741 00:29:31,370 --> 00:29:33,470 of a recording from a nerve fiber that's up top, 742 00:29:33,470 --> 00:29:35,854 and then at the bottom is the stimulus 743 00:29:35,854 --> 00:29:37,520 that would be supplied to the air, which 744 00:29:37,520 --> 00:29:39,860 is just a pure tone of some particular frequency. 745 00:29:39,860 --> 00:29:42,140 And so you can note, like, two interesting things 746 00:29:42,140 --> 00:29:44,124 about this response. 747 00:29:44,124 --> 00:29:46,040 The first is that the spikes are intermittent. 748 00:29:46,040 --> 00:29:49,460 You don't get a spike at every cycle of the frequency. 749 00:29:49,460 --> 00:29:51,160 But when the spikes occur-- 750 00:29:51,160 --> 00:29:53,660 sorry, they occur at a particular phase 751 00:29:53,660 --> 00:29:55,440 relative to the stimulus. 752 00:29:55,440 --> 00:29:55,940 Right? 753 00:29:55,940 --> 00:29:57,440 So they're kind of just a little bit 754 00:29:57,440 --> 00:30:00,340 behind the peak of the waveform in this particular case, right, 755 00:30:00,340 --> 00:30:02,390 and in every single case. 756 00:30:02,390 --> 00:30:04,850 All right, so this is known as phase locking. 757 00:30:04,850 --> 00:30:08,090 And it's a pretty robust phenomena for frequencies 758 00:30:08,090 --> 00:30:09,786 under a few kilohertz. 759 00:30:09,786 --> 00:30:11,160 And this is in non-human animals. 760 00:30:11,160 --> 00:30:13,451 There's no measurements of the auditory nerve in humans 761 00:30:13,451 --> 00:30:15,650 because to do so is highly invasive, 762 00:30:15,650 --> 00:30:19,970 and just nobody's ever done it, and probably won't ever. 763 00:30:22,820 --> 00:30:25,010 So this is another example of the same thing, where 764 00:30:25,010 --> 00:30:27,910 again, this is sort of the input waveform, 765 00:30:27,910 --> 00:30:31,430 and this is a bunch of different recordings of the same nerve 766 00:30:31,430 --> 00:30:33,070 fiber that are time aligned. 767 00:30:33,070 --> 00:30:34,940 So you can see the spikes always occur 768 00:30:34,940 --> 00:30:38,190 at a particular region of phase space. 769 00:30:38,190 --> 00:30:40,287 So they're not uniformly distributed. 770 00:30:40,287 --> 00:30:41,870 And the figure here on the right shows 771 00:30:41,870 --> 00:30:43,910 that this phenomenon deteriorates 772 00:30:43,910 --> 00:30:44,880 at very high frequency. 773 00:30:44,880 --> 00:30:47,736 So this is the plot of the strength of the phase locking 774 00:30:47,736 --> 00:30:48,860 as a function of frequency. 775 00:30:48,860 --> 00:30:51,780 And so up to a kilohertz, it's quite strong, 776 00:30:51,780 --> 00:30:54,850 and then it starts to kind of drop off, and above about 4k 777 00:30:54,850 --> 00:30:56,180 there's not really a whole lot. 778 00:30:58,950 --> 00:30:59,450 OK. 779 00:30:59,450 --> 00:31:02,660 And so as I said, one of the other salient things 780 00:31:02,660 --> 00:31:05,570 is that the fibers don't fire with every cycle 781 00:31:05,570 --> 00:31:07,910 of the stimulus. 782 00:31:07,910 --> 00:31:10,220 And one interesting fact about the ear 783 00:31:10,220 --> 00:31:12,590 is that there are a whole bunch of auditory nerve fibers 784 00:31:12,590 --> 00:31:13,820 for every inner hair cell. 785 00:31:13,820 --> 00:31:16,130 And people think that one of the reasons for that 786 00:31:16,130 --> 00:31:18,500 is because of this phenomena, here. 787 00:31:18,500 --> 00:31:21,080 And this is probably due to things like refractory periods 788 00:31:21,080 --> 00:31:23,259 and stuff, right? 789 00:31:23,259 --> 00:31:25,550 But if you have a whole bunch of nerve fibers synapsing 790 00:31:25,550 --> 00:31:27,550 under the same hair cell, the idea 791 00:31:27,550 --> 00:31:29,930 is that, well, collectively, you'll get spikes 792 00:31:29,930 --> 00:31:32,424 at every single cycle. 793 00:31:32,424 --> 00:31:34,090 So here's just some interesting numbers. 794 00:31:34,090 --> 00:31:38,450 So the number of inner hair cells per ear in a human 795 00:31:38,450 --> 00:31:42,140 is estimated to be about 3500. 796 00:31:42,140 --> 00:31:46,110 There's about four times as many outer hair cells, 797 00:31:46,110 --> 00:31:49,350 or roughly 12,000, but that's the key number here. 798 00:31:49,350 --> 00:31:53,540 So coming out of every ear are roughly 30,000 auditory nerve 799 00:31:53,540 --> 00:31:54,040 fibers. 800 00:31:54,040 --> 00:31:57,500 So there's about 10 times as many auditory nerve fibers 801 00:31:57,500 --> 00:31:59,405 as there are inner hair cells. 802 00:31:59,405 --> 00:32:01,280 And it's interesting to compare those numbers 803 00:32:01,280 --> 00:32:03,080 to what you see in the eye. 804 00:32:03,080 --> 00:32:06,320 So these are estimates from a few years 805 00:32:06,320 --> 00:32:11,000 ago, from the eye of roughly 5 million cones per eye, 806 00:32:11,000 --> 00:32:12,690 lots of rods, obviously. 807 00:32:12,690 --> 00:32:14,450 But then you go to the optic nerve, 808 00:32:14,450 --> 00:32:16,170 and the number of optic nerve fibers 809 00:32:16,170 --> 00:32:18,560 is actually substantially less than the number of cones. 810 00:32:18,560 --> 00:32:19,551 So 1 and 1/2 million. 811 00:32:19,551 --> 00:32:21,050 So there's this big compression that 812 00:32:21,050 --> 00:32:23,760 happens when you go into the auditory nerve-- 813 00:32:23,760 --> 00:32:25,820 sorry, in the optic nerve, whereas there's 814 00:32:25,820 --> 00:32:29,860 an expansion that happens in the auditory nerve. 815 00:32:29,860 --> 00:32:32,659 And just for fun, these are rough estimates 816 00:32:32,659 --> 00:32:33,950 of what you find in the cortex. 817 00:32:33,950 --> 00:32:36,749 So in primary auditory cortex, this 818 00:32:36,749 --> 00:32:38,540 is a very crude estimate I got from someone 819 00:32:38,540 --> 00:32:43,380 a couple of days ago, of 60 million neurons per hemisphere. 820 00:32:43,380 --> 00:32:45,980 And in v1, the estimate that I was able to find 821 00:32:45,980 --> 00:32:47,210 was 140 million. 822 00:32:47,210 --> 00:32:50,030 So these are sort of roughly the same order of magnitude, 823 00:32:50,030 --> 00:32:53,600 although it's obviously smaller in the auditory system. 824 00:32:53,600 --> 00:32:55,187 But there is something very different 825 00:32:55,187 --> 00:32:57,770 happening here, in terms of the way the information is getting 826 00:32:57,770 --> 00:33:00,950 piped from the periphery onto the brain. 827 00:33:00,950 --> 00:33:03,080 And one reason for this might just 828 00:33:03,080 --> 00:33:06,850 be the fact that phenomena here, where 829 00:33:06,850 --> 00:33:08,892 the spiking in an individual auditory nerve fiber 830 00:33:08,892 --> 00:33:10,849 is going to be intermittent because the signals 831 00:33:10,849 --> 00:33:12,800 that it has to convey are very, very fast, 832 00:33:12,800 --> 00:33:15,590 and so you kind of have to multiplex in this way. 833 00:33:15,590 --> 00:33:17,450 All right, so the big picture here 834 00:33:17,450 --> 00:33:20,630 is that if you look at the output of the cochlea, 835 00:33:20,630 --> 00:33:22,070 there are, in some sense, two cues 836 00:33:22,070 --> 00:33:24,229 to the frequencies that are contained in a sound. 837 00:33:24,229 --> 00:33:26,270 So there's what is often referred to as the place 838 00:33:26,270 --> 00:33:27,584 of excitation in the cochlea. 839 00:33:27,584 --> 00:33:29,000 So these are the nerve fibers that 840 00:33:29,000 --> 00:33:33,260 are firing the most, according to a rate code. 841 00:33:33,260 --> 00:33:35,270 And there's also the timing of the spikes that 842 00:33:35,270 --> 00:33:39,380 are fired, in the sense that, for frequencies below about 4k, 843 00:33:39,380 --> 00:33:40,790 you get phase locking. 844 00:33:40,790 --> 00:33:42,260 And so the inner spiked intervals 845 00:33:42,260 --> 00:33:44,510 will be stereotyped, depending on the frequencies that 846 00:33:44,510 --> 00:33:46,010 come in. 847 00:33:46,010 --> 00:33:47,660 And so it's one of these sort of-- 848 00:33:47,660 --> 00:33:49,400 I still find this kind of remarkable 849 00:33:49,400 --> 00:33:51,560 when I sort of step back and think 850 00:33:51,560 --> 00:33:53,060 about the state of things, that this 851 00:33:53,060 --> 00:33:56,870 is a very basic question about neural coding at the very 852 00:33:56,870 --> 00:33:59,030 front end of the system. 853 00:33:59,030 --> 00:34:01,400 And the importance of these things 854 00:34:01,400 --> 00:34:02,830 really remains unresolved. 855 00:34:02,830 --> 00:34:07,160 So people have been debating this for a really long time, 856 00:34:07,160 --> 00:34:09,620 and we still don't really have very clear answers 857 00:34:09,620 --> 00:34:12,560 to the extent to which the spike timing really 858 00:34:12,560 --> 00:34:15,170 is critical for inferring frequency. 859 00:34:15,170 --> 00:34:18,000 So I'm going to play you a demo that provides-- 860 00:34:18,000 --> 00:34:20,420 it's an example of some of the circumstantial evidence 861 00:34:20,420 --> 00:34:21,920 for the importance of phase locking. 862 00:34:21,920 --> 00:34:24,500 And broadly speaking, the evidence 863 00:34:24,500 --> 00:34:27,320 for the importance of phase locking in humans 864 00:34:27,320 --> 00:34:29,870 comes from the fact that the perception of frequency 865 00:34:29,870 --> 00:34:34,040 seems to change once you kind of get above about 4k. 866 00:34:34,040 --> 00:34:37,310 So for instance, if you give people 867 00:34:37,310 --> 00:34:39,860 a musical interval, [SINGING] da da, and then 868 00:34:39,860 --> 00:34:42,830 you ask them to replicate that, in, say, different octaves, 869 00:34:42,830 --> 00:34:45,889 people can do that pretty well until you get to about 4k. 870 00:34:45,889 --> 00:34:47,670 And above 4k, they just break down. 871 00:34:47,670 --> 00:34:49,969 They become very, very highly variable. 872 00:34:49,969 --> 00:34:51,719 And that's evident in this demonstration. 873 00:34:51,719 --> 00:34:54,290 So what you're going to hear in this demo, 874 00:34:54,290 --> 00:34:57,260 and this is a demonstration I got from Peter Cariani. 875 00:34:57,260 --> 00:34:58,880 Thanks to him. 876 00:34:58,880 --> 00:35:01,670 It's a melody that is probably familiar to all of you that's 877 00:35:01,670 --> 00:35:03,590 being played with pure tones. 878 00:35:03,590 --> 00:35:05,780 And it will be played repeatedly, 879 00:35:05,780 --> 00:35:09,170 transposed from very high frequencies down in, 880 00:35:09,170 --> 00:35:12,040 I don't know, third octaves or something, 881 00:35:12,040 --> 00:35:13,520 to lower and lower frequencies. 882 00:35:13,520 --> 00:35:15,512 And what you will experience is that when 883 00:35:15,512 --> 00:35:17,345 you hear the very high frequencies, well, A, 884 00:35:17,345 --> 00:35:19,790 they'll be kind of annoying, so just bear with me, 885 00:35:19,790 --> 00:35:22,179 but the melody will also be unrecognizable. 886 00:35:22,179 --> 00:35:24,470 And so it'll only be when you get below a certain point 887 00:35:24,470 --> 00:35:26,180 that you'll say, aha, I know what that is. 888 00:35:26,180 --> 00:35:26,679 OK? 889 00:35:26,679 --> 00:35:31,460 And again, we can look at this in our spectrogram, which 890 00:35:31,460 --> 00:35:33,650 is still going. 891 00:35:33,650 --> 00:35:37,154 And you can actually see what's going on. 892 00:35:37,154 --> 00:35:40,591 [MELODIC TONES] 893 00:36:55,770 --> 00:36:56,754 OK. 894 00:36:56,754 --> 00:36:58,170 So by the end, hopefully everybody 895 00:36:58,170 --> 00:37:00,120 recognized what that was. 896 00:37:00,120 --> 00:37:01,979 So let's just talk briefly about how 897 00:37:01,979 --> 00:37:04,020 these subbands that we were talking about earlier 898 00:37:04,020 --> 00:37:05,853 relate to what we see in the auditory nerve, 899 00:37:05,853 --> 00:37:10,380 and again, this relates to one of the earlier questions. 900 00:37:10,380 --> 00:37:13,480 So the subband is this blue signal here. 901 00:37:13,480 --> 00:37:15,960 This is the output of a linear filter. 902 00:37:15,960 --> 00:37:18,540 And one of the ways to characterize a subband 903 00:37:18,540 --> 00:37:22,140 like this that's band limited is by the instantaneous amplitude 904 00:37:22,140 --> 00:37:23,616 and the instantaneous phase. 905 00:37:23,616 --> 00:37:24,990 And these things, loosely, can be 906 00:37:24,990 --> 00:37:27,540 mapped onto a spike rate and spike 907 00:37:27,540 --> 00:37:30,990 timing in the auditory nerve. 908 00:37:30,990 --> 00:37:34,230 So again, this is an example of the phase locking that you see, 909 00:37:34,230 --> 00:37:36,570 where the spikes get fired at some particular point 910 00:37:36,570 --> 00:37:37,710 in the waveform. 911 00:37:37,710 --> 00:37:39,090 And so if you observe this, well, 912 00:37:39,090 --> 00:37:41,131 you know something about exactly what's happening 913 00:37:41,131 --> 00:37:44,190 in the waveform, namely, you know that there's energy 914 00:37:44,190 --> 00:37:46,494 there in the stimulus because you're getting spikes, 915 00:37:46,494 --> 00:37:48,660 but you also actually know the phase of the waveform 916 00:37:48,660 --> 00:37:52,740 because the spikes are happening in particular places. 917 00:37:52,740 --> 00:37:56,002 And so this issue of phase is sort of a tricky one, 918 00:37:56,002 --> 00:37:57,960 because it's often not something that we really 919 00:37:57,960 --> 00:38:00,930 know how to deal with. 920 00:38:00,930 --> 00:38:03,030 And it's also just empirically the case 921 00:38:03,030 --> 00:38:06,660 that a lot of the information in sound is carried by the way 922 00:38:06,660 --> 00:38:08,970 that frequencies are modulated over time, 923 00:38:08,970 --> 00:38:12,420 as measured by the instantaneous amplitude in a subband. 924 00:38:12,420 --> 00:38:13,934 And so the instantaneous amplitude 925 00:38:13,934 --> 00:38:15,850 is measured by a quantity called the envelope. 926 00:38:15,850 --> 00:38:17,308 So that's the red curve, here, that 927 00:38:17,308 --> 00:38:20,359 shows how the amplitude waxes and wanes. 928 00:38:20,359 --> 00:38:22,650 And the envelope is easy to extract from auditory nerve 929 00:38:22,650 --> 00:38:25,590 response, just by computing the firing rate over local time 930 00:38:25,590 --> 00:38:26,250 windows. 931 00:38:26,250 --> 00:38:28,542 And in signal processing terms, we typically extract it 932 00:38:28,542 --> 00:38:30,374 with something called the Hilbert transform, 933 00:38:30,374 --> 00:38:32,490 by taking the magnitude of the analytic signal. 934 00:38:32,490 --> 00:38:34,114 So it's a pretty easy thing to pull out 935 00:38:34,114 --> 00:38:37,140 in MATLAB, for instance. 936 00:38:37,140 --> 00:38:40,490 So just to relate this to stuff that may be more familiar, 937 00:38:40,490 --> 00:38:42,240 the spectrograms that people have probably 938 00:38:42,240 --> 00:38:43,770 seen in the past-- again, these are 939 00:38:43,770 --> 00:38:45,330 pictures that take a sound waveform 940 00:38:45,330 --> 00:38:47,590 and plot the frequency content over time. 941 00:38:47,590 --> 00:38:49,980 One way to get a spectrogram is to have 942 00:38:49,980 --> 00:38:52,770 a bank of bandpass filters, to get a bunch of subbands, 943 00:38:52,770 --> 00:38:55,380 and then to extract the envelope of each subband, 944 00:38:55,380 --> 00:38:57,970 and just plot the envelope in grayscale, horizontally. 945 00:38:57,970 --> 00:38:58,470 All right? 946 00:38:58,470 --> 00:39:02,250 So a stripe through this picture is the envelope in grayscale. 947 00:39:02,250 --> 00:39:04,230 So it's black in the places where 948 00:39:04,230 --> 00:39:08,290 the energy is high, and white in the places where it's low. 949 00:39:08,290 --> 00:39:13,185 And so this is a spectrogram of this. 950 00:39:13,185 --> 00:39:16,012 [DRUMMING] 951 00:39:16,012 --> 00:39:17,720 All right, so that's what you just heard. 952 00:39:17,720 --> 00:39:19,610 It's just a drum break. 953 00:39:19,610 --> 00:39:21,152 And you can probably see that there 954 00:39:21,152 --> 00:39:23,360 are sort of events in the spectrogram that correspond 955 00:39:23,360 --> 00:39:27,170 to things like the drumbeats. 956 00:39:27,170 --> 00:39:29,540 Now, one of the other striking things about this picture 957 00:39:29,540 --> 00:39:30,590 is it looks like a mess, right? 958 00:39:30,590 --> 00:39:32,131 I mean, you listen to that drum break 959 00:39:32,131 --> 00:39:34,500 and it sounded kind of crisp and clean, 960 00:39:34,500 --> 00:39:37,220 and when you actually look at the instantaneous amplitude 961 00:39:37,220 --> 00:39:42,450 in each of the subbands, it just sort of looks messy and noisy. 962 00:39:42,450 --> 00:39:47,900 But one interesting fact is that this picture, for most signals, 963 00:39:47,900 --> 00:39:51,260 captures all of the information that matters perceptually 964 00:39:51,260 --> 00:39:55,790 in the following sense, that if you have some sound signal 965 00:39:55,790 --> 00:39:57,680 and you let me generate this picture, 966 00:39:57,680 --> 00:39:59,600 and all you let me keep is that picture. 967 00:39:59,600 --> 00:40:01,580 We throw out the original sound waveform. 968 00:40:01,580 --> 00:40:03,440 From that picture, I can generate 969 00:40:03,440 --> 00:40:05,730 a sound signal that will sound just like the original. 970 00:40:05,730 --> 00:40:06,230 All right? 971 00:40:06,230 --> 00:40:09,050 And in fact, I've done this. 972 00:40:09,050 --> 00:40:11,798 And here is a reconstruction from that picture. 973 00:40:11,798 --> 00:40:14,430 [DRUMMING] 974 00:40:14,430 --> 00:40:15,677 Here's the original. 975 00:40:15,677 --> 00:40:18,319 [DRUMMING] 976 00:40:18,319 --> 00:40:19,360 Sounded exactly the same. 977 00:40:19,360 --> 00:40:20,010 OK? 978 00:40:20,010 --> 00:40:22,840 And so-- and I'll tell you in a second, how we do this, 979 00:40:22,840 --> 00:40:27,740 but the fact that this picture looks messy and noisy is, 980 00:40:27,740 --> 00:40:29,310 I think, mostly just due to the fact 981 00:40:29,310 --> 00:40:31,740 that your brain is not used to getting the sensory input 982 00:40:31,740 --> 00:40:32,540 in this format. 983 00:40:32,540 --> 00:40:33,039 Right? 984 00:40:33,039 --> 00:40:34,490 You're used to hearing as sound. 985 00:40:34,490 --> 00:40:34,990 Right? 986 00:40:34,990 --> 00:40:37,740 And so your visual system is actually not optimally designed 987 00:40:37,740 --> 00:40:40,320 to interpret a spectrogram. 988 00:40:40,320 --> 00:40:42,780 I want to just briefly explain how you take this picture 989 00:40:42,780 --> 00:40:44,070 and generate a sound signal because this 990 00:40:44,070 --> 00:40:45,736 is sort of a useful thing to understand, 991 00:40:45,736 --> 00:40:48,370 and it will become relevant a little bit later. 992 00:40:48,370 --> 00:40:51,570 So the general game that we play here 993 00:40:51,570 --> 00:40:54,244 is you hand me this picture, right, 994 00:40:54,244 --> 00:40:55,660 and I want to synthesize a signal. 995 00:40:55,660 --> 00:40:58,860 And so usually what we do, is we start out with a noise signal, 996 00:40:58,860 --> 00:41:01,754 and we transform that noise signal until it's in kind 997 00:41:01,754 --> 00:41:02,920 of the right representation. 998 00:41:02,920 --> 00:41:05,610 So we split it up into its subbands. 999 00:41:05,610 --> 00:41:09,000 And then we'll replace the envelopes of the noise subbands 1000 00:41:09,000 --> 00:41:10,990 by the envelopes from this picture. 1001 00:41:10,990 --> 00:41:11,490 OK? 1002 00:41:11,490 --> 00:41:13,410 And so the way that we do that is 1003 00:41:13,410 --> 00:41:15,450 we measure the envelope of each noise subband, 1004 00:41:15,450 --> 00:41:18,810 we divide it out, and then we multiply by the envelope 1005 00:41:18,810 --> 00:41:20,260 of the thing that we want. 1006 00:41:20,260 --> 00:41:22,380 And that gives us new subbands. 1007 00:41:22,380 --> 00:41:27,120 And we then add those up, and we get a new sound signal. 1008 00:41:27,120 --> 00:41:30,240 And for various reasons, this is a process 1009 00:41:30,240 --> 00:41:31,500 that needs to be iterated. 1010 00:41:31,500 --> 00:41:33,260 So you take that new sound signal, 1011 00:41:33,260 --> 00:41:37,200 you generate its subbands, you replace their envelopes 1012 00:41:37,200 --> 00:41:39,750 by the ones you want, and you add them back up 1013 00:41:39,750 --> 00:41:40,819 to get a sound signal. 1014 00:41:40,819 --> 00:41:43,110 And if you do this about 10 times, what you end up with 1015 00:41:43,110 --> 00:41:45,520 is something that has the envelopes that you want. 1016 00:41:45,520 --> 00:41:46,540 And so then you can listen to it. 1017 00:41:46,540 --> 00:41:47,790 And that's the thing that I played you. 1018 00:41:47,790 --> 00:41:48,570 OK? 1019 00:41:48,570 --> 00:41:50,778 So it's this iterative procedure, where you typically 1020 00:41:50,778 --> 00:41:53,670 start with noise, you project the signal 1021 00:41:53,670 --> 00:41:57,980 that you want onto the noise, collapse, and then iterate. 1022 00:41:57,980 --> 00:41:58,880 OK. 1023 00:41:58,880 --> 00:42:02,560 And we'll see some more examples of this in action. 1024 00:42:02,560 --> 00:42:07,434 So I just told you how the instantaneous amplitude 1025 00:42:07,434 --> 00:42:09,600 in each of the filter outputs, which we characterize 1026 00:42:09,600 --> 00:42:12,060 with the envelope, is an important thing 1027 00:42:12,060 --> 00:42:13,270 for sound representation. 1028 00:42:13,270 --> 00:42:16,050 And so in the auditory model that I'm building here, 1029 00:42:16,050 --> 00:42:17,700 we've got a second stage of processing 1030 00:42:17,700 --> 00:42:19,929 now, where we've taken the subbands 1031 00:42:19,929 --> 00:42:21,720 and we extract the instantaneous amplitude. 1032 00:42:21,720 --> 00:42:24,420 And there's one other thing here that I'll just mention, 1033 00:42:24,420 --> 00:42:26,670 which is that another important feature of the cochlea 1034 00:42:26,670 --> 00:42:28,378 is what's known as amplitude compression. 1035 00:42:28,378 --> 00:42:30,930 So the response of the cochlea as a function of sound level 1036 00:42:30,930 --> 00:42:34,117 is not linear, rather it's compressive. 1037 00:42:34,117 --> 00:42:35,700 And this is due to the fact that there 1038 00:42:35,700 --> 00:42:38,700 is selective amplification when sounds are quiet, 1039 00:42:38,700 --> 00:42:42,460 and not when they're very high in intensity. 1040 00:42:42,460 --> 00:42:43,920 And I'm not going to say anything 1041 00:42:43,920 --> 00:42:47,270 more about this for now, but it will become important later. 1042 00:42:47,270 --> 00:42:50,630 And this is something that has a lot of practical consequences. 1043 00:42:50,630 --> 00:42:53,359 So when people lose their hearing, 1044 00:42:53,359 --> 00:42:54,900 one of the common things that happens 1045 00:42:54,900 --> 00:42:58,179 is that the outer hair cells stop working correctly. 1046 00:42:58,179 --> 00:43:00,720 And the outer hair cells are one of the things that generates 1047 00:43:00,720 --> 00:43:01,840 that compressive response. 1048 00:43:01,840 --> 00:43:03,780 So they're the nonlinear component 1049 00:43:03,780 --> 00:43:05,430 of processing in the cochlea. 1050 00:43:05,430 --> 00:43:07,200 And so when people lose their hearing, 1051 00:43:07,200 --> 00:43:10,710 the tuning both broadens, and you get a linear response 1052 00:43:10,710 --> 00:43:12,930 to sound amplitude because you lose 1053 00:43:12,930 --> 00:43:15,856 that selective amplification, and that's 1054 00:43:15,856 --> 00:43:17,730 something that hearing aids try to replicate, 1055 00:43:17,730 --> 00:43:20,651 and that's hard to do. 1056 00:43:20,651 --> 00:43:21,150 OK. 1057 00:43:21,150 --> 00:43:22,885 So what happens next? 1058 00:43:22,885 --> 00:43:24,510 So everybody here-- does everybody here 1059 00:43:24,510 --> 00:43:26,190 know what a spike triggered average is? 1060 00:43:26,190 --> 00:43:26,460 Yeah. 1061 00:43:26,460 --> 00:43:28,251 Lots of people talked about this, probably. 1062 00:43:28,251 --> 00:43:28,860 OK. 1063 00:43:28,860 --> 00:43:31,920 So one of the standard ways that we investigate 1064 00:43:31,920 --> 00:43:34,950 sensory systems, when we have some reason 1065 00:43:34,950 --> 00:43:39,062 to think that things might be reasonably linear, 1066 00:43:39,062 --> 00:43:41,520 is by measuring something called a spike triggered average. 1067 00:43:41,520 --> 00:43:43,228 And so the way this experiment might work 1068 00:43:43,228 --> 00:43:45,686 is you would play stimulus like this. 1069 00:43:45,686 --> 00:43:54,080 [NOISE SIGNAL] 1070 00:43:54,080 --> 00:43:55,760 So that's a type of a noise signal. 1071 00:43:55,760 --> 00:43:56,260 Right? 1072 00:43:56,260 --> 00:43:59,230 So you'd play your animal that signal, you'd 1073 00:43:59,230 --> 00:44:01,690 be recording spikes from a neuron 1074 00:44:01,690 --> 00:44:05,180 that you might be interested in, and every time there's a spike, 1075 00:44:05,180 --> 00:44:07,930 you would look back at what happens in the signal 1076 00:44:07,930 --> 00:44:10,900 and then you would take all the little histories that 1077 00:44:10,900 --> 00:44:11,800 preceded that spike. 1078 00:44:11,800 --> 00:44:14,470 You'd average them together, and you get the average stimulus 1079 00:44:14,470 --> 00:44:15,860 that preceded a spike. 1080 00:44:15,860 --> 00:44:17,530 And so in this particular case, we're 1081 00:44:17,530 --> 00:44:20,560 going to actually do this in the domain of the spectrogram. 1082 00:44:20,560 --> 00:44:22,840 And that's because you might hypothesize that, really, 1083 00:44:22,840 --> 00:44:24,465 what the neurons would care about would 1084 00:44:24,465 --> 00:44:27,100 be the instantaneous amplitude in the sound signal, 1085 00:44:27,100 --> 00:44:29,320 and not necessarily the phase. 1086 00:44:29,320 --> 00:44:32,500 So if you do that, for instance, in the inferior colliculus, 1087 00:44:32,500 --> 00:44:34,510 the stage of the midbrain, you see things 1088 00:44:34,510 --> 00:44:36,790 that are pretty stereotyped. 1089 00:44:36,790 --> 00:44:39,430 And we're going to refer to what we get out 1090 00:44:39,430 --> 00:44:42,670 of this procedure as a spectrotemporal receptive 1091 00:44:42,670 --> 00:44:43,390 field. 1092 00:44:43,390 --> 00:44:48,170 And that would often be referred to as a STRF for short. 1093 00:44:48,170 --> 00:44:51,850 So if you hear people talk about STRFs, this is what they mean. 1094 00:44:51,850 --> 00:44:52,660 OK. 1095 00:44:52,660 --> 00:44:54,212 So these are derived from methods 1096 00:44:54,212 --> 00:44:56,170 that would be like the spike triggered average. 1097 00:44:56,170 --> 00:44:58,586 People don't usually actually do a spike triggered average 1098 00:44:58,586 --> 00:45:02,230 for various reasons, but what you get out is similar. 1099 00:45:02,230 --> 00:45:06,296 And so what you see here is the average spectrogram that 1100 00:45:06,296 --> 00:45:07,420 would be preceding a spike. 1101 00:45:07,420 --> 00:45:09,130 And for this particular neuron, you 1102 00:45:09,130 --> 00:45:11,880 can see there's a bit of red and then a bit of blue. 1103 00:45:11,880 --> 00:45:14,770 And so at this particular frequency, which, in this case, 1104 00:45:14,770 --> 00:45:20,590 is something like 10 kilohertz or something like that, 1105 00:45:20,590 --> 00:45:23,050 the optimal stimulus that would generate a spike 1106 00:45:23,050 --> 00:45:25,660 is something that gives you an increase in energy, and then 1107 00:45:25,660 --> 00:45:28,070 a decrease. 1108 00:45:28,070 --> 00:45:32,890 And so what this corresponds to is amplitude modulation 1109 00:45:32,890 --> 00:45:34,250 at a particular rate. 1110 00:45:34,250 --> 00:45:36,583 And so you can see there's a characteristic timing here. 1111 00:45:36,583 --> 00:45:38,307 So the red thing has a certain duration, 1112 00:45:38,307 --> 00:45:39,640 and then there's the blue thing. 1113 00:45:39,640 --> 00:45:41,080 And so there is this very rapid increase 1114 00:45:41,080 --> 00:45:43,460 and decrease in energy at that particular frequency. 1115 00:45:43,460 --> 00:45:46,850 So that's known as amplitude modulation. 1116 00:45:46,850 --> 00:45:48,820 And so this is one way of looking 1117 00:45:48,820 --> 00:45:50,770 at this in the domain of the spectrogram. 1118 00:45:50,770 --> 00:45:52,780 Another way of looking at this would 1119 00:45:52,780 --> 00:45:55,910 be to generate a tuning function as a function of the modulation 1120 00:45:55,910 --> 00:45:56,410 rate. 1121 00:45:56,410 --> 00:45:57,951 So you could actually change how fast 1122 00:45:57,951 --> 00:46:00,070 the amplitude is being modulated, 1123 00:46:00,070 --> 00:46:02,380 and you would see in a neuron like this, 1124 00:46:02,380 --> 00:46:04,250 that the response would exhibit tuning. 1125 00:46:04,250 --> 00:46:07,391 And so each one of the graphs here, 1126 00:46:07,391 --> 00:46:08,890 or each one of the plots, the dashed 1127 00:46:08,890 --> 00:46:13,570 curves in the lower right plot-- so each dashed line is 1128 00:46:13,570 --> 00:46:15,910 a tuning curve for one neuron. 1129 00:46:15,910 --> 00:46:18,370 So it's a plot of the response of that neuron 1130 00:46:18,370 --> 00:46:21,070 as a function of the temporal modulation frequencies. 1131 00:46:21,070 --> 00:46:24,650 So that's how fast the amplitude is changing with time. 1132 00:46:24,650 --> 00:46:27,580 And if you look at this particular case here, 1133 00:46:27,580 --> 00:46:30,280 you can see that there's a peak response when the modulation 1134 00:46:30,280 --> 00:46:34,540 frequency is maybe 70 Hertz, and then the response 1135 00:46:34,540 --> 00:46:36,940 decreases if you go in either direction. 1136 00:46:36,940 --> 00:46:40,930 This guy here has got a slightly lower preferred frequency, 1137 00:46:40,930 --> 00:46:42,940 and down here, lower still. 1138 00:46:42,940 --> 00:46:44,710 And so again, what you can see is 1139 00:46:44,710 --> 00:46:47,320 something that looks strikingly similar to the kinds of filter 1140 00:46:47,320 --> 00:46:49,992 banks that we just saw when we were looking at the cochlea. 1141 00:46:49,992 --> 00:46:51,700 But there's a key difference here, right, 1142 00:46:51,700 --> 00:46:55,030 that here we're seeing tuning to modulation frequency, 1143 00:46:55,030 --> 00:46:56,400 not to audio frequency. 1144 00:46:56,400 --> 00:46:59,170 So this is the rate at which the amplitude changes. 1145 00:46:59,170 --> 00:47:02,910 That's the amplitude of the sound, not the audio frequency. 1146 00:47:02,910 --> 00:47:06,310 So there's a carrier frequency here, which is 10k, 1147 00:47:06,310 --> 00:47:08,140 but the frequency we're talking about here 1148 00:47:08,140 --> 00:47:09,670 is the rate at which this changes. 1149 00:47:09,670 --> 00:47:11,380 And so in this case here, you can 1150 00:47:11,380 --> 00:47:15,760 see the period here is maybe 5 milliseconds, and so this 1151 00:47:15,760 --> 00:47:19,685 would correspond to a modulation of, I guess, 200 Hertz. 1152 00:47:19,685 --> 00:47:20,795 Is that right? 1153 00:47:20,795 --> 00:47:21,670 I think that's right. 1154 00:47:21,670 --> 00:47:22,680 Yeah. 1155 00:47:22,680 --> 00:47:24,450 OK. 1156 00:47:24,450 --> 00:47:27,220 And so as early as the midbrain, you see stuff like this. 1157 00:47:27,220 --> 00:47:29,530 So in the inferior colliculus, there's lots of it. 1158 00:47:29,530 --> 00:47:35,200 And so this suggests a second, or in this case, 1159 00:47:35,200 --> 00:47:37,780 a third stage of processing, which are known as modulation 1160 00:47:37,780 --> 00:47:38,730 filters. 1161 00:47:38,730 --> 00:47:42,865 This is a very old idea in auditory science 1162 00:47:42,865 --> 00:47:44,740 that now has a fair bit of empirical support. 1163 00:47:44,740 --> 00:47:46,880 And so the idea is that in our model, 1164 00:47:46,880 --> 00:47:48,200 we've got our sound signal. 1165 00:47:48,200 --> 00:47:50,408 It gets passed through this bank of bandpass filters, 1166 00:47:50,408 --> 00:47:51,940 you get these subbands, you extract 1167 00:47:51,940 --> 00:47:54,440 the instantaneous amplitude, known as the envelope, 1168 00:47:54,440 --> 00:47:56,398 and then you take that envelope and you pass it 1169 00:47:56,398 --> 00:47:57,670 through another filter bank. 1170 00:47:57,670 --> 00:47:59,211 This time, these are filters that are 1171 00:47:59,211 --> 00:48:01,194 tuned in modulation frequency. 1172 00:48:01,194 --> 00:48:02,860 And the output of those filters-- again, 1173 00:48:02,860 --> 00:48:05,950 it's exactly conceptually analogous to the output 1174 00:48:05,950 --> 00:48:08,210 of this first set of band pass filters. 1175 00:48:08,210 --> 00:48:10,120 So in this case, we have a filter 1176 00:48:10,120 --> 00:48:11,702 that's tuned to low modulation rates, 1177 00:48:11,702 --> 00:48:13,660 and so you can see what it outputs is something 1178 00:48:13,660 --> 00:48:15,476 that's fluctuating very slowly. 1179 00:48:15,476 --> 00:48:17,350 So it's just taken the very slow fluctuations 1180 00:48:17,350 --> 00:48:18,352 out of that envelope. 1181 00:48:18,352 --> 00:48:20,560 Here, you have a filter that's tuned to higher rates, 1182 00:48:20,560 --> 00:48:23,950 and so it's wiggling around at faster rates. 1183 00:48:23,950 --> 00:48:26,080 And you have a different set of these filters 1184 00:48:26,080 --> 00:48:28,880 for each cochlear channel, each thing coming out 1185 00:48:28,880 --> 00:48:30,910 of the cochlea. 1186 00:48:30,910 --> 00:48:33,280 So this picture here that we've gradually built up 1187 00:48:33,280 --> 00:48:36,490 gives us a model of the signal processing 1188 00:48:36,490 --> 00:48:40,330 that we think occurs between the cochlea and the midbrain 1189 00:48:40,330 --> 00:48:42,020 or the thalamus. 1190 00:48:42,020 --> 00:48:43,480 So you have the bandpass filtering 1191 00:48:43,480 --> 00:48:45,996 that happens in the cochlea, you get subbands, 1192 00:48:45,996 --> 00:48:48,370 you extract the instantaneous amplitude in the envelopes, 1193 00:48:48,370 --> 00:48:52,611 and then you filter that again with these modulation filters. 1194 00:48:52,611 --> 00:48:54,860 This is sort of a rough understanding of the front end 1195 00:48:54,860 --> 00:48:55,818 of the auditory system. 1196 00:48:55,818 --> 00:48:59,294 And the question is, given these representations, 1197 00:48:59,294 --> 00:49:01,710 how do we do the interesting things that we do with sound? 1198 00:49:01,710 --> 00:49:03,835 So how do we recognize things and their properties, 1199 00:49:03,835 --> 00:49:07,342 and do scene analysis, and so on, and so forth. 1200 00:49:07,342 --> 00:49:09,050 And this is still something that is still 1201 00:49:09,050 --> 00:49:14,584 very much in its infancy in terms of our understanding. 1202 00:49:14,584 --> 00:49:16,500 One of the areas that I've spent a bit of time 1203 00:49:16,500 --> 00:49:19,609 on to try to get a handle on this is sound texture. 1204 00:49:19,609 --> 00:49:21,900 And I started working on this because I sort of thought 1205 00:49:21,900 --> 00:49:24,090 it would be a nice way in to understanding some 1206 00:49:24,090 --> 00:49:28,230 of these issues, and because I gather that Eero was here 1207 00:49:28,230 --> 00:49:29,820 talking about visual textures. 1208 00:49:29,820 --> 00:49:34,480 I was asked to feature this, and hopefully this will be useful. 1209 00:49:34,480 --> 00:49:36,527 So what are sound textures? 1210 00:49:36,527 --> 00:49:38,610 Textures are sounds that result from large numbers 1211 00:49:38,610 --> 00:49:40,830 of acoustic events, and they include things 1212 00:49:40,830 --> 00:49:43,020 that you hear all the time, like rain-- 1213 00:49:43,020 --> 00:49:46,120 [RAIN] 1214 00:49:46,120 --> 00:49:47,130 --or birds-- 1215 00:49:47,130 --> 00:49:52,820 [BIRDS CHIRPING] 1216 00:49:52,820 --> 00:49:54,647 --or running water-- 1217 00:49:54,647 --> 00:49:57,390 [RUNNING WATER] 1218 00:49:57,390 --> 00:49:57,990 --insects-- 1219 00:49:57,990 --> 00:50:02,490 [INSECTS] 1220 00:50:02,490 --> 00:50:03,030 --applause-- 1221 00:50:03,030 --> 00:50:06,250 [APPLAUSE] 1222 00:50:06,250 --> 00:50:07,888 --fire, so forth. 1223 00:50:07,888 --> 00:50:11,640 [FIRE BURNING] 1224 00:50:11,640 --> 00:50:12,160 OK. 1225 00:50:12,160 --> 00:50:14,620 So these are sounds that they, typically, 1226 00:50:14,620 --> 00:50:17,230 are generated from large numbers of acoustic events. 1227 00:50:17,230 --> 00:50:18,604 They're very common in the world. 1228 00:50:18,604 --> 00:50:20,500 You hear these things all the time. 1229 00:50:20,500 --> 00:50:22,390 But they've been largely unstudied. 1230 00:50:22,390 --> 00:50:26,260 So there's been a long and storied history of research 1231 00:50:26,260 --> 00:50:29,800 on visual texture, and this really 1232 00:50:29,800 --> 00:50:32,950 had not been thought about very much until a few years ago. 1233 00:50:36,100 --> 00:50:39,160 Now, a lot of the things that people typically think about 1234 00:50:39,160 --> 00:50:40,610 in hearing are the sounds that are 1235 00:50:40,610 --> 00:50:41,860 produced by individual events. 1236 00:50:41,860 --> 00:50:44,110 And if we have time, we'll talk about this more later, 1237 00:50:44,110 --> 00:50:45,026 but stuff like this. 1238 00:50:45,026 --> 00:50:45,881 AUDIO: January. 1239 00:50:45,881 --> 00:50:46,880 JOSH MCDERMOTT: Or this. 1240 00:50:46,880 --> 00:50:47,736 [SQUEAK] 1241 00:50:47,736 --> 00:50:50,110 And these are the waveforms associated with those events. 1242 00:50:50,110 --> 00:50:51,550 And the point is that those sounds, 1243 00:50:51,550 --> 00:50:53,620 they have a beginning, and an end, and a temporal evolution. 1244 00:50:53,620 --> 00:50:55,180 And that temporal evolution is sort 1245 00:50:55,180 --> 00:50:58,335 of part of what makes the sound what it is, right? 1246 00:50:58,335 --> 00:50:59,710 And textures are a bit different. 1247 00:50:59,710 --> 00:51:01,866 So here's just the sound of rain. 1248 00:51:01,866 --> 00:51:04,250 [RAIN] 1249 00:51:04,250 --> 00:51:06,340 Now of course, at some point, the rain started 1250 00:51:06,340 --> 00:51:07,570 and hopefully at some point it will end, 1251 00:51:07,570 --> 00:51:08,590 but the start and the end are not 1252 00:51:08,590 --> 00:51:10,150 what makes it sound like rain, right? 1253 00:51:10,150 --> 00:51:12,566 The qualities that make it sound like rain are just there. 1254 00:51:12,566 --> 00:51:14,540 So the texture is stationary. 1255 00:51:14,540 --> 00:51:16,720 So the essential properties don't change over time. 1256 00:51:16,720 --> 00:51:18,230 And so I got interested in textures 1257 00:51:18,230 --> 00:51:20,560 because it seemed like stationarity would make them 1258 00:51:20,560 --> 00:51:22,330 a good starting point for understanding 1259 00:51:22,330 --> 00:51:24,288 auditory representation because, in some sense, 1260 00:51:24,288 --> 00:51:26,200 it sort of simplifies the kinds of things 1261 00:51:26,200 --> 00:51:27,199 you have to worry about. 1262 00:51:27,199 --> 00:51:31,004 You don't have to worry about time in quite the same way. 1263 00:51:31,004 --> 00:51:32,920 And so the question that we were interested in 1264 00:51:32,920 --> 00:51:35,910 is how people represent and recognize texture. 1265 00:51:35,910 --> 00:51:38,575 So just to make that concrete, listen to this. 1266 00:51:38,575 --> 00:51:40,900 [CHATTER] 1267 00:51:40,900 --> 00:51:42,095 And then this. 1268 00:51:42,095 --> 00:51:43,912 [CHATTER] 1269 00:51:43,912 --> 00:51:45,370 So it's immediately apparent to you 1270 00:51:45,370 --> 00:51:46,910 that those are the same kind of thing, right? 1271 00:51:46,910 --> 00:51:49,510 In fact, they're two different excerpts of the same recording. 1272 00:51:49,510 --> 00:51:51,610 But the waveform itself is totally different in the two 1273 00:51:51,610 --> 00:51:52,140 cases. 1274 00:51:52,140 --> 00:51:54,223 So there's something that your brain is extracting 1275 00:51:54,223 --> 00:51:56,348 from those two excerpts that tells you that they're 1276 00:51:56,348 --> 00:51:58,639 the same kind of thing, and that, for instance, they're 1277 00:51:58,639 --> 00:51:59,666 different from this. 1278 00:51:59,666 --> 00:52:01,850 [BEES BUZZING] 1279 00:52:01,850 --> 00:52:04,310 So the question is, what is it that you extract and store 1280 00:52:04,310 --> 00:52:07,010 about those waveforms that tells you that certain somethings are 1281 00:52:07,010 --> 00:52:08,676 the same and other things are different, 1282 00:52:08,676 --> 00:52:11,390 and that allows you to recognize what things are? 1283 00:52:11,390 --> 00:52:13,669 And so the key theoretical proposal 1284 00:52:13,669 --> 00:52:15,710 that we made in this work is that because they're 1285 00:52:15,710 --> 00:52:17,840 stationary, textures can be captured 1286 00:52:17,840 --> 00:52:19,760 by statistics that are time averages 1287 00:52:19,760 --> 00:52:21,050 of acoustic measurements. 1288 00:52:21,050 --> 00:52:23,591 So the proposal is that when you recognize the sound of fire, 1289 00:52:23,591 --> 00:52:25,100 or rain, or what have you, you're 1290 00:52:25,100 --> 00:52:26,990 recognizing these statistics. 1291 00:52:26,990 --> 00:52:28,490 So what kinds of statistics might we 1292 00:52:28,490 --> 00:52:32,060 be measuring if you think this proposal has some plausibility? 1293 00:52:32,060 --> 00:52:34,010 And so part of the reason for walking you 1294 00:52:34,010 --> 00:52:36,350 through this auditory model is that whatever statistics 1295 00:52:36,350 --> 00:52:37,910 the auditory system measures are presumably 1296 00:52:37,910 --> 00:52:39,620 derived from representations like this, 1297 00:52:39,620 --> 00:52:42,320 right, that constitute the input that your brain is getting 1298 00:52:42,320 --> 00:52:44,630 from the auditory periphery. 1299 00:52:44,630 --> 00:52:47,480 And so we initially asked how far 1300 00:52:47,480 --> 00:52:50,240 one might get with representations consisting 1301 00:52:50,240 --> 00:52:53,210 of fairly generic statistics of these standard auditory 1302 00:52:53,210 --> 00:52:55,250 representations, things like marginal moments 1303 00:52:55,250 --> 00:52:59,250 and correlations. 1304 00:52:59,250 --> 00:53:01,070 So the statistics that we initially 1305 00:53:01,070 --> 00:53:03,500 considered were not, in any way, specifically tailored 1306 00:53:03,500 --> 00:53:05,990 to natural sounds, and really, ultimately, 1307 00:53:05,990 --> 00:53:08,480 what we'd like to do would be to actually learn statistics 1308 00:53:08,480 --> 00:53:11,650 from data that, actually, we think are good representations. 1309 00:53:11,650 --> 00:53:13,792 That's something we're working on. 1310 00:53:13,792 --> 00:53:16,250 But these statistics are simple and they involve operations 1311 00:53:16,250 --> 00:53:17,930 that you could instantiate in neurons, 1312 00:53:17,930 --> 00:53:19,679 so it seemed like maybe a reasonable place 1313 00:53:19,679 --> 00:53:21,740 to start to at least get a feel for what 1314 00:53:21,740 --> 00:53:24,230 the landscape was like. 1315 00:53:24,230 --> 00:53:25,850 And so what I want to do now is just 1316 00:53:25,850 --> 00:53:29,360 to give you some intuitions as to what sorts of things 1317 00:53:29,360 --> 00:53:32,250 might be captured by statistics of these representations. 1318 00:53:32,250 --> 00:53:34,717 And so at a minimum, to be useful for recognition, 1319 00:53:34,717 --> 00:53:36,800 well, statistics need to give you different values 1320 00:53:36,800 --> 00:53:37,675 for different sounds. 1321 00:53:37,675 --> 00:53:39,150 And so let's see what happens. 1322 00:53:39,150 --> 00:53:42,140 So let's first have a look at what kinds of things 1323 00:53:42,140 --> 00:53:45,650 might be captured by marginal moments of amplitude envelopes 1324 00:53:45,650 --> 00:53:47,520 from bandpass filters. 1325 00:53:47,520 --> 00:53:48,240 OK? 1326 00:53:48,240 --> 00:53:49,760 So remember, the envelope, here, you 1327 00:53:49,760 --> 00:53:51,860 can think of as a stripe through a spectrogram, 1328 00:53:51,860 --> 00:53:54,980 right, so it's the instantaneous amplitude in a given frequency 1329 00:53:54,980 --> 00:53:55,916 channel. 1330 00:53:55,916 --> 00:53:57,290 So the blue thing is the subband, 1331 00:53:57,290 --> 00:53:59,090 the red thing here is the envelope. 1332 00:53:59,090 --> 00:54:01,490 And the marginal moments will describe the way 1333 00:54:01,490 --> 00:54:02,820 the envelope is distributed. 1334 00:54:02,820 --> 00:54:04,730 So imagine you took that red curve 1335 00:54:04,730 --> 00:54:07,359 and you collapsed that over time to get a histogram that 1336 00:54:07,359 --> 00:54:08,900 tells you the frequency of occurrence 1337 00:54:08,900 --> 00:54:13,434 of different amplitudes in that particular frequency channel. 1338 00:54:13,434 --> 00:54:15,850 And that's what it looks like for this particular example. 1339 00:54:15,850 --> 00:54:19,250 And you might think that, well, this can't possibly 1340 00:54:19,250 --> 00:54:21,880 be all that informative, right, because it's obviously 1341 00:54:21,880 --> 00:54:23,630 got some central tendency and some spread, 1342 00:54:23,630 --> 00:54:25,963 but when you do this business of collapsing across time, 1343 00:54:25,963 --> 00:54:28,110 you're throwing out all kinds of information. 1344 00:54:28,110 --> 00:54:29,030 But one of the interesting things 1345 00:54:29,030 --> 00:54:30,950 is that when you look at these kinds of distributions 1346 00:54:30,950 --> 00:54:32,390 for different types of sounds, you 1347 00:54:32,390 --> 00:54:34,580 see that they vary a fair bit, and in particular, 1348 00:54:34,580 --> 00:54:36,170 that they're systematically different 1349 00:54:36,170 --> 00:54:38,790 for a lot of natural sounds than they are for noise signals. 1350 00:54:38,790 --> 00:54:41,520 So this is that same thing, so it's this histogram here, 1351 00:54:41,520 --> 00:54:42,840 just rotated 90 degrees. 1352 00:54:42,840 --> 00:54:44,780 So we've got the probability of occurrence 1353 00:54:44,780 --> 00:54:48,800 on the y-axis, and the magnitude in the envelope of one 1354 00:54:48,800 --> 00:54:50,929 particular frequency channel on the x-axis, 1355 00:54:50,929 --> 00:54:52,220 for three different recordings. 1356 00:54:52,220 --> 00:54:56,400 So the red plots what you get for a recording of noise, 1357 00:54:56,400 --> 00:55:00,260 the blue is a stream, and the green is a bunch of geese. 1358 00:55:00,260 --> 00:55:01,100 Geese don't quack. 1359 00:55:01,100 --> 00:55:01,640 Whatever the geese do. 1360 00:55:01,640 --> 00:55:02,020 AUDIENCE: Honk. 1361 00:55:02,020 --> 00:55:02,660 JOSH MCDERMOTT: Yeah, honking. 1362 00:55:02,660 --> 00:55:03,160 Yeah. 1363 00:55:03,160 --> 00:55:03,780 All right. 1364 00:55:03,780 --> 00:55:08,390 And this is a filter that's centered at 2,200 Hertz. 1365 00:55:08,390 --> 00:55:10,250 In particular, these examples were 1366 00:55:10,250 --> 00:55:12,920 chosen because the average value of the envelope 1367 00:55:12,920 --> 00:55:14,690 is very similar in the three cases. 1368 00:55:14,690 --> 00:55:16,273 But you can see that the distributions 1369 00:55:16,273 --> 00:55:17,460 have very different shapes. 1370 00:55:17,460 --> 00:55:20,960 So the noise one, here, has got pretty low variance. 1371 00:55:20,960 --> 00:55:25,640 The stream has got larger variance, and the geese larger 1372 00:55:25,640 --> 00:55:27,530 still, and it's also positively skewed. 1373 00:55:27,530 --> 00:55:29,244 So to got this kind of long tail, here. 1374 00:55:29,244 --> 00:55:31,160 And if you look at the spectrograms associated 1375 00:55:31,160 --> 00:55:33,870 with these sounds, you can see where this comes from. 1376 00:55:33,870 --> 00:55:37,329 So spectogram of noise is pretty gray, 1377 00:55:37,329 --> 00:55:38,870 so the noise signal kind of hangs out 1378 00:55:38,870 --> 00:55:40,910 around its average value most of the time, 1379 00:55:40,910 --> 00:55:43,340 whereas the stream has got more gray and more black, 1380 00:55:43,340 --> 00:55:45,440 and the geese actually has some white and some dark black. 1381 00:55:45,440 --> 00:55:47,314 So here, white would correspond to down here, 1382 00:55:47,314 --> 00:55:50,060 black would correspond to up here. 1383 00:55:50,060 --> 00:55:53,720 And the intuition here is that natural signals are sparse. 1384 00:55:53,720 --> 00:55:56,210 In particular, they're sparser than noise. 1385 00:55:56,210 --> 00:55:58,640 So we think of natural signals as often being made up 1386 00:55:58,640 --> 00:56:00,710 of events like raindrops, or geese 1387 00:56:00,710 --> 00:56:02,630 calls, and these events are infrequent, 1388 00:56:02,630 --> 00:56:05,380 but when they occur they produce large amplitudes in the signal. 1389 00:56:05,380 --> 00:56:07,550 And when they don't occur, the amplitude is lower. 1390 00:56:07,550 --> 00:56:11,390 In contrast, the noise signal doesn't really have those. 1391 00:56:11,390 --> 00:56:14,900 But the important point about this 1392 00:56:14,900 --> 00:56:17,060 from the standpoint of wanting to characterize 1393 00:56:17,060 --> 00:56:20,780 a signal with statistics is that this phenomenon of sparsity 1394 00:56:20,780 --> 00:56:22,520 is reflected in some pretty simple things 1395 00:56:22,520 --> 00:56:23,670 that you could compute from the signal, 1396 00:56:23,670 --> 00:56:24,989 like the variance and the skew. 1397 00:56:24,989 --> 00:56:27,530 So you can see that the variance varies across these signals, 1398 00:56:27,530 --> 00:56:30,150 as does a skew. 1399 00:56:30,150 --> 00:56:30,650 All right. 1400 00:56:30,650 --> 00:56:31,730 Let's take a quick look at what you 1401 00:56:31,730 --> 00:56:33,380 might get by measuring correlations 1402 00:56:33,380 --> 00:56:35,476 between these different channels. 1403 00:56:35,476 --> 00:56:37,100 So these things also vary across sound. 1404 00:56:37,100 --> 00:56:40,370 And you can see them in the cochleogram here, 1405 00:56:40,370 --> 00:56:43,067 in this particular example, as reflected 1406 00:56:43,067 --> 00:56:44,150 in these vertical streaks. 1407 00:56:44,150 --> 00:56:45,890 So this is the cochleogram of fire, 1408 00:56:45,890 --> 00:56:48,034 and fire's got lots of crackles and pops. 1409 00:56:48,034 --> 00:56:52,120 [FIRE BURNING] 1410 00:56:52,120 --> 00:56:54,310 And those crackles and pops show up 1411 00:56:54,310 --> 00:56:56,530 as these vertical streaks in the cochleogram. 1412 00:56:56,530 --> 00:56:59,500 So a crackle and pop is like a click-like event. 1413 00:56:59,500 --> 00:57:02,136 Clicks have lots of different frequencies in them, 1414 00:57:02,136 --> 00:57:04,510 and so you see these vertical streaks in the spectrogram, 1415 00:57:04,510 --> 00:57:06,400 and that introduces statistical dependencies 1416 00:57:06,400 --> 00:57:08,230 between different frequency channels. 1417 00:57:08,230 --> 00:57:11,350 And so that can be measured by just computing correlations 1418 00:57:11,350 --> 00:57:14,140 between the envelopes of these different channels, 1419 00:57:14,140 --> 00:57:16,240 and that's reflected in this matrix here. 1420 00:57:16,240 --> 00:57:18,730 So every cell of this matrix is the correlation 1421 00:57:18,730 --> 00:57:21,160 between a pair of channels. 1422 00:57:21,160 --> 00:57:24,490 So we're going from low frequencies, 1423 00:57:24,490 --> 00:57:26,094 here, to high, and low to high. 1424 00:57:26,094 --> 00:57:28,510 The diagonal has got to be one, but the off-diagonal stuff 1425 00:57:28,510 --> 00:57:29,230 can be whatever. 1426 00:57:29,230 --> 00:57:32,020 You can see for example of fire, there's a lot of yellow 1427 00:57:32,020 --> 00:57:35,200 and a lot of red, which means that the amplitudes 1428 00:57:35,200 --> 00:57:37,450 of different channels tends to be correlated. 1429 00:57:37,450 --> 00:57:39,116 But this is not the case for everything. 1430 00:57:39,116 --> 00:57:41,590 So if you look at a water sound, like a stream-- 1431 00:57:41,590 --> 00:57:44,612 [STREAM] 1432 00:57:44,612 --> 00:57:46,820 --there are not very many things that are click-like, 1433 00:57:46,820 --> 00:57:49,505 and most of the correlations here are pretty close to zero. 1434 00:57:49,505 --> 00:57:51,130 So again, this is a pretty simple thing 1435 00:57:51,130 --> 00:57:53,213 that you can measure, but you get different values 1436 00:57:53,213 --> 00:57:55,300 for different sounds. 1437 00:57:55,300 --> 00:57:59,140 Similarly, if we look at the power coming out 1438 00:57:59,140 --> 00:58:01,540 of these modulation filters, you also see big differences 1439 00:58:01,540 --> 00:58:03,190 across sounds. 1440 00:58:03,190 --> 00:58:06,710 So that's plotted here for three different sound recordings, 1441 00:58:06,710 --> 00:58:09,190 insects, waves, and a stream. 1442 00:58:09,190 --> 00:58:11,560 And so remember that these modulation filters, 1443 00:58:11,560 --> 00:58:13,810 we think of them as being applied to each cochlear 1444 00:58:13,810 --> 00:58:14,320 channel. 1445 00:58:14,320 --> 00:58:17,455 So the modulation power that you would get from all 1446 00:58:17,455 --> 00:58:18,880 your modulation filters is-- 1447 00:58:18,880 --> 00:58:20,780 you can see that in a 2D plot. 1448 00:58:20,780 --> 00:58:23,350 So this is the frequency of the cochlear channel, 1449 00:58:23,350 --> 00:58:25,570 and this is the rate of the modulation channel. 1450 00:58:25,570 --> 00:58:30,520 So these are slow modulations and these are fast modulations. 1451 00:58:30,520 --> 00:58:32,500 And so for the insects, you can see 1452 00:58:32,500 --> 00:58:34,800 that there are these, like, little blobs up here. 1453 00:58:34,800 --> 00:58:36,550 And that actually corresponds to the rates 1454 00:58:36,550 --> 00:58:39,280 at which the insects rub their wings together and make 1455 00:58:39,280 --> 00:58:41,200 the sound that they make, which kind of 1456 00:58:41,200 --> 00:58:42,847 gives it this shimmery quality. 1457 00:58:42,847 --> 00:58:46,750 [INSECTS] 1458 00:58:46,750 --> 00:58:49,000 In contrast, the waves have got most of the power 1459 00:58:49,000 --> 00:58:51,884 here at very slow modulations. 1460 00:58:51,884 --> 00:58:56,760 [WAVES] 1461 00:58:56,760 --> 00:58:59,540 And the stream is pretty broadband, 1462 00:58:59,540 --> 00:59:02,097 so there's modulations at a whole bunch of different rates. 1463 00:59:02,097 --> 00:59:04,970 [STREAM] 1464 00:59:04,970 --> 00:59:05,470 All right. 1465 00:59:05,470 --> 00:59:07,840 So just by measuring the power coming out of these channels, 1466 00:59:07,840 --> 00:59:09,400 we're potentially learning something 1467 00:59:09,400 --> 00:59:11,590 about what's in the signal. 1468 00:59:11,590 --> 00:59:13,602 And I'm going to skip this last example. 1469 00:59:13,602 --> 00:59:15,310 All right, so the point is just that when 1470 00:59:15,310 --> 00:59:17,650 you look at these statistics, they vary pretty substantially 1471 00:59:17,650 --> 00:59:18,220 across sound. 1472 00:59:18,220 --> 00:59:20,140 And so the question we were interested in 1473 00:59:20,140 --> 00:59:23,340 is whether they could plausibly account for the perception 1474 00:59:23,340 --> 00:59:26,680 of real world textures. 1475 00:59:26,680 --> 00:59:29,270 The key methodological proposal of this work 1476 00:59:29,270 --> 00:59:32,600 was that synthesis is a potentially very powerful way 1477 00:59:32,600 --> 00:59:34,700 to test a perceptual theory. 1478 00:59:34,700 --> 00:59:37,732 Now, maybe the kind of standard thing 1479 00:59:37,732 --> 00:59:39,190 that you might think that you might 1480 00:59:39,190 --> 00:59:41,289 try to do with this type of representation 1481 00:59:41,289 --> 00:59:42,830 is measure these statistics, and then 1482 00:59:42,830 --> 00:59:44,329 see whether, for instance, you could 1483 00:59:44,329 --> 00:59:47,900 discriminate between different signals or maybe [AUDIO OUT].. 1484 00:59:47,900 --> 00:59:49,370 And for various reasons, I actually 1485 00:59:49,370 --> 00:59:50,720 think that synthesis is potentially 1486 00:59:50,720 --> 00:59:51,620 a lot more powerful. 1487 00:59:51,620 --> 00:59:54,980 And the notion is this, that if your brain is representing 1488 00:59:54,980 --> 00:59:56,522 sounds with some set of measurements, 1489 00:59:56,522 --> 00:59:59,021 then signals that have the same values of those measurements 1490 00:59:59,021 --> 01:00:00,490 ought to sound the same to you. 1491 01:00:00,490 --> 01:00:04,100 And so in particular, if we've got some real world recording, 1492 01:00:04,100 --> 01:00:05,790 and we synthesize a signal to cause 1493 01:00:05,790 --> 01:00:07,790 it to have the same measurements, the statistics 1494 01:00:07,790 --> 01:00:09,582 in this case, as that real world recording, 1495 01:00:09,582 --> 01:00:11,123 well, then the synthetic signal ought 1496 01:00:11,123 --> 01:00:12,860 to sound like the real world recording 1497 01:00:12,860 --> 01:00:15,170 if the measurements that we use are like the ones 1498 01:00:15,170 --> 01:00:17,810 that the brain is using to represent sound. 1499 01:00:17,810 --> 01:00:20,270 And so we can potentially use synthesis, 1500 01:00:20,270 --> 01:00:22,830 then, to test a candidate representation, 1501 01:00:22,830 --> 01:00:25,580 in this case, these statistics that we're measuring, 1502 01:00:25,580 --> 01:00:30,320 are a reasonable representation for the brain to be using. 1503 01:00:30,320 --> 01:00:32,466 So here's just a simple example to kind of walk you 1504 01:00:32,466 --> 01:00:33,216 through the logic. 1505 01:00:33,216 --> 01:00:36,680 So let's suppose that you had a relatively simple theory 1506 01:00:36,680 --> 01:00:38,319 of sound texture perception, which 1507 01:00:38,319 --> 01:00:40,610 is that texture perception might be rooted in the power 1508 01:00:40,610 --> 01:00:41,411 spectrum. 1509 01:00:41,411 --> 01:00:42,410 It's not so implausible. 1510 01:00:42,410 --> 01:00:44,034 Lots of people think the power spectrum 1511 01:00:44,034 --> 01:00:46,368 is sort of a useful way to characterize it [AUDIO OUT].. 1512 01:00:46,368 --> 01:00:48,367 And you might think that it would have something 1513 01:00:48,367 --> 01:00:50,390 to do with the way textures would sound. 1514 01:00:50,390 --> 01:00:52,670 And so in the context of our auditory model, 1515 01:00:52,670 --> 01:00:55,430 the power spectrum is captured by the average value 1516 01:00:55,430 --> 01:00:56,220 of each envelope. 1517 01:00:56,220 --> 01:00:57,020 Remember, the envelope's telling you 1518 01:00:57,020 --> 01:00:59,480 the instantaneous amplitude, and so if you just average that, 1519 01:00:59,480 --> 01:01:01,979 you're going to find out how much power is in each frequency 1520 01:01:01,979 --> 01:01:03,160 channel. 1521 01:01:03,160 --> 01:01:05,685 So that's how you would do it in this framework. 1522 01:01:05,685 --> 01:01:08,060 So the way this would work, if you get some sound signal, 1523 01:01:08,060 --> 01:01:08,570 say, this-- 1524 01:01:08,570 --> 01:01:11,210 [BUBBLES] 1525 01:01:11,210 --> 01:01:12,810 --you would pass it through the model, 1526 01:01:12,810 --> 01:01:14,810 you'd measure the average value of the envelope, 1527 01:01:14,810 --> 01:01:16,520 so you get a set of 30 numbers. 1528 01:01:16,520 --> 01:01:19,422 Say, if you have 30 bandpass filters there, 1529 01:01:19,422 --> 01:01:20,880 at the output of each one of those, 1530 01:01:20,880 --> 01:01:22,713 you get the average of all of the envelopes, 1531 01:01:22,713 --> 01:01:23,750 so you get 30 numbers. 1532 01:01:23,750 --> 01:01:25,520 And then you take those 30 numbers 1533 01:01:25,520 --> 01:01:27,392 and you want to synthesize the signal, 1534 01:01:27,392 --> 01:01:29,600 subject to the constraint of it having the same value 1535 01:01:29,600 --> 01:01:30,920 for those 30 numbers. 1536 01:01:30,920 --> 01:01:33,104 And so in this case, it's pretty simple to do. 1537 01:01:33,104 --> 01:01:34,520 So we take a noise signal, we want 1538 01:01:34,520 --> 01:01:35,978 to start out as random as possible, 1539 01:01:35,978 --> 01:01:38,724 so we take a noise signal, we generate it's subbands, 1540 01:01:38,724 --> 01:01:40,640 and then we just scale the subbands up or down 1541 01:01:40,640 --> 01:01:42,420 so that they have the right amount of power. 1542 01:01:42,420 --> 01:01:44,750 And we add them back up, and we get a new sound signal. 1543 01:01:44,750 --> 01:01:46,180 And then we listen to it and we see whether it 1544 01:01:46,180 --> 01:01:47,304 sounds like the same thing. 1545 01:01:47,304 --> 01:01:48,114 OK? 1546 01:01:48,114 --> 01:01:49,280 Here's what they sound like. 1547 01:01:49,280 --> 01:01:51,781 And as you will hear, they just sound like noise, basically, 1548 01:01:51,781 --> 01:01:52,280 right? 1549 01:01:52,280 --> 01:01:54,170 So this is supposed to sound like rain-- 1550 01:01:54,170 --> 01:01:56,670 [RAIN] 1551 01:01:56,670 --> 01:01:57,796 --or a stream-- 1552 01:01:57,796 --> 01:01:59,780 [STREAM] 1553 01:01:59,780 --> 01:02:00,824 --or bubbles-- 1554 01:02:00,824 --> 01:02:02,640 [BUBBLES] 1555 01:02:02,640 --> 01:02:03,715 --or fire-- 1556 01:02:03,715 --> 01:02:05,575 [FIRE BURNING] 1557 01:02:05,575 --> 01:02:06,630 --or applause. 1558 01:02:06,630 --> 01:02:09,980 [APPLAUSE] 1559 01:02:09,980 --> 01:02:12,500 So you might notice that they sound different, right? 1560 01:02:12,500 --> 01:02:13,940 And you might have even been able to convince yourself, 1561 01:02:13,940 --> 01:02:16,160 well, that sounds a little bit like applause, right? 1562 01:02:16,160 --> 01:02:17,285 So there's something there. 1563 01:02:17,285 --> 01:02:19,917 But the point is that they don't sound anything like-- 1564 01:02:19,917 --> 01:02:21,378 [APPLAUSE] 1565 01:02:21,378 --> 01:02:22,352 --or-- 1566 01:02:22,352 --> 01:02:24,484 [FIRE BURNING] 1567 01:02:24,484 --> 01:02:25,400 --so on, and so forth. 1568 01:02:25,400 --> 01:02:25,970 OK? 1569 01:02:25,970 --> 01:02:27,290 All right, so the point is that everything just 1570 01:02:27,290 --> 01:02:28,820 sounds like noise, and so what this 1571 01:02:28,820 --> 01:02:31,130 tells us is that our brains are not simply 1572 01:02:31,130 --> 01:02:34,750 registering the spectrum when we recognize these textures. 1573 01:02:34,750 --> 01:02:37,190 Question is whether additional simple statistics 1574 01:02:37,190 --> 01:02:38,340 will do any better. 1575 01:02:38,340 --> 01:02:40,370 And so we're going to play the same game, right, 1576 01:02:40,370 --> 01:02:43,430 except we have a souped up representation that's got 1577 01:02:43,430 --> 01:02:45,566 all these other things in it. 1578 01:02:45,566 --> 01:02:46,940 And so the consequence of this is 1579 01:02:46,940 --> 01:02:49,620 that the process of synthesizing something 1580 01:02:49,620 --> 01:02:53,360 is less straightforward. 1581 01:02:53,360 --> 01:02:56,780 So I was in Eero Simoncelli's at the lab at the time, 1582 01:02:56,780 --> 01:02:58,860 and spent a while trying to get this to work, 1583 01:02:58,860 --> 01:03:01,250 and eventually we got to work. 1584 01:03:01,250 --> 01:03:03,390 But conceptually, the process is the same. 1585 01:03:03,390 --> 01:03:06,320 So you have some original sound recording, rain, 1586 01:03:06,320 --> 01:03:08,750 or what have you, you pass it through your auditory model, 1587 01:03:08,750 --> 01:03:11,420 and then you measure some set of statistics. 1588 01:03:11,420 --> 01:03:13,370 And then you start out with a noise signal, 1589 01:03:13,370 --> 01:03:14,540 you pass it through the same model, 1590 01:03:14,540 --> 01:03:16,956 and then measure its statistics, and those will in general 1591 01:03:16,956 --> 01:03:18,680 be different from the target values. 1592 01:03:18,680 --> 01:03:20,960 And so you get some error signal here, 1593 01:03:20,960 --> 01:03:23,930 and you use that error signal to perform gradient descent 1594 01:03:23,930 --> 01:03:26,730 on the representation of the noise in the auditory model. 1595 01:03:26,730 --> 01:03:30,140 And so you cause the envelopes of the noise signal 1596 01:03:30,140 --> 01:03:33,290 to change in ways that cause their statistics to move 1597 01:03:33,290 --> 01:03:35,340 towards the target value. 1598 01:03:35,340 --> 01:03:38,050 And there's a procedure here by which this is iterated, 1599 01:03:38,050 --> 01:03:39,800 and I'm not going to get into the details. 1600 01:03:39,800 --> 01:03:41,460 If you want to play around with it, 1601 01:03:41,460 --> 01:03:45,020 there's a toolbox that's now available on the lab website 1602 01:03:45,020 --> 01:03:46,820 if you want to do that, and it's described 1603 01:03:46,820 --> 01:03:48,780 in more detail in that paper. 1604 01:03:48,780 --> 01:03:49,280 All right. 1605 01:03:49,280 --> 01:03:51,457 So the result, the whole point of this, 1606 01:03:51,457 --> 01:03:53,540 is that we get a signal that shares the statistics 1607 01:03:53,540 --> 01:03:54,950 of some real world sound. 1608 01:03:54,950 --> 01:03:55,700 How do they sound? 1609 01:03:55,700 --> 01:03:57,283 And remember, we're interested in this 1610 01:03:57,283 --> 01:04:00,530 because if these statistics, if this candidate representation 1611 01:04:00,530 --> 01:04:02,480 accounts for our perception of texture, well, 1612 01:04:02,480 --> 01:04:05,090 then the synthetic signals ought to sound like new examples 1613 01:04:05,090 --> 01:04:06,150 of the real thing. 1614 01:04:06,150 --> 01:04:08,540 And the cool thing, and rewarding part 1615 01:04:08,540 --> 01:04:11,150 of this whole thing, is that in many cases, they do. 1616 01:04:11,150 --> 01:04:13,546 So I'm just going to play the synthetic versions. 1617 01:04:13,546 --> 01:04:15,170 All of these were generated from noise, 1618 01:04:15,170 --> 01:04:17,211 just by causing the noise to match the statistics 1619 01:04:17,211 --> 01:04:18,680 of, in this case, rain-- 1620 01:04:18,680 --> 01:04:21,170 [RAIN] 1621 01:04:21,170 --> 01:04:21,700 --stream-- 1622 01:04:21,700 --> 01:04:24,040 [STREAM] 1623 01:04:24,040 --> 01:04:24,550 --bubbles-- 1624 01:04:24,550 --> 01:04:26,710 [BUBBLES] 1625 01:04:26,710 --> 01:04:27,290 --fire-- 1626 01:04:27,290 --> 01:04:29,810 [FIRE BURNING] 1627 01:04:29,810 --> 01:04:30,440 --applause-- 1628 01:04:30,440 --> 01:04:33,180 [APPLAUSE] 1629 01:04:33,180 --> 01:04:34,106 --wind-- 1630 01:04:34,106 --> 01:04:36,090 [WIND BLOWING] 1631 01:04:36,090 --> 01:04:37,082 --insects-- 1632 01:04:37,082 --> 01:04:42,042 [INSECTS] 1633 01:04:42,042 --> 01:04:43,530 --birds, oops-- 1634 01:04:43,530 --> 01:04:47,010 [BIRDS CHIRPING] 1635 01:04:47,010 --> 01:04:48,618 --and crowd noise. 1636 01:04:48,618 --> 01:04:52,442 [CHATTER] 1637 01:04:52,442 --> 01:04:53,727 All right. 1638 01:04:53,727 --> 01:04:55,560 It also works for a lot of unnatural sounds. 1639 01:04:55,560 --> 01:04:56,610 Here's rustling paper-- 1640 01:04:56,610 --> 01:04:59,570 [RUSTLING PAPER] 1641 01:04:59,570 --> 01:05:01,232 --and a jackhammer. 1642 01:05:01,232 --> 01:05:03,480 [JACKHAMMER] 1643 01:05:03,480 --> 01:05:04,020 All right. 1644 01:05:04,020 --> 01:05:06,280 And so the cool thing about this is 1645 01:05:06,280 --> 01:05:08,410 you can put it whatever you want in there, right? 1646 01:05:08,410 --> 01:05:09,790 You can measure the statistics from anything, 1647 01:05:09,790 --> 01:05:12,040 and you can generate something that is statistically matched. 1648 01:05:12,040 --> 01:05:13,873 And when you do this with a lot of textures, 1649 01:05:13,873 --> 01:05:16,060 you tend to get something that captures some 1650 01:05:16,060 --> 01:05:18,640 of the qualitative properties. 1651 01:05:18,640 --> 01:05:20,440 And so the success of this, and this 1652 01:05:20,440 --> 01:05:22,870 is in this case, the reason why this is scientifically 1653 01:05:22,870 --> 01:05:24,670 interesting in addition to fun, is 1654 01:05:24,670 --> 01:05:27,520 that it lends plausibility to the notion 1655 01:05:27,520 --> 01:05:30,160 that these statistics could underlie the representation 1656 01:05:30,160 --> 01:05:32,760 and recognition of textures.