1 00:00:00,000 --> 00:00:01,948 [SQUEAKING] 2 00:00:01,948 --> 00:00:04,383 [RUSTLING] 3 00:00:04,383 --> 00:00:10,240 [CLICKING] 4 00:00:10,240 --> 00:00:12,880 NANCY KANWISHER: We are turning from our various other topics 5 00:00:12,880 --> 00:00:14,590 to talk about hearing today. 6 00:00:14,590 --> 00:00:18,190 And let's start by thinking about all the cool stuff 7 00:00:18,190 --> 00:00:21,940 that you can do just by listening. 8 00:00:21,940 --> 00:00:25,510 So just by listening, you can identify the scene 9 00:00:25,510 --> 00:00:27,310 that you're in and what's going on 10 00:00:27,310 --> 00:00:30,025 in it, like for example, this. 11 00:00:30,025 --> 00:00:35,887 [AUDIO PLAYBACK] 12 00:00:35,887 --> 00:00:36,470 [END PLAYBACK] 13 00:00:36,470 --> 00:00:38,780 OK, so you know what kind of room you're in and roughly 14 00:00:38,780 --> 00:00:42,680 what's going on, just from that little bit of sound. 15 00:00:42,680 --> 00:00:45,860 You can localize events and people and objects. 16 00:00:45,860 --> 00:00:47,510 So close your eyes, everyone. 17 00:00:47,510 --> 00:00:48,810 Keep them closed. 18 00:00:48,810 --> 00:00:52,130 And if you just listen to me talking, 19 00:00:52,130 --> 00:00:54,650 it's really very vivid, isn't it, 20 00:00:54,650 --> 00:00:58,040 exactly how obvious it is where I am. 21 00:00:58,040 --> 00:01:00,560 And I will refrain from the temptation 22 00:01:00,560 --> 00:01:03,162 of coming up and speaking in somebody's ears 23 00:01:03,162 --> 00:01:04,370 because it's just too creepy. 24 00:01:04,370 --> 00:01:05,540 OK, you can open your eyes. 25 00:01:05,540 --> 00:01:06,620 It's very vivid. 26 00:01:06,620 --> 00:01:10,340 Just from listening, you know where the sound source is. 27 00:01:10,340 --> 00:01:13,700 You can recognize sound sources, so for example, 28 00:01:13,700 --> 00:01:14,870 sounds like this-- 29 00:01:14,870 --> 00:01:18,072 [GLASS BREAKING] 30 00:01:18,072 --> 00:01:19,280 You know what happened there. 31 00:01:19,280 --> 00:01:21,260 It's a whole vivid event just unfolded there 32 00:01:21,260 --> 00:01:24,860 in a whole second and a half, or a random series 33 00:01:24,860 --> 00:01:25,753 of sounds like this. 34 00:01:25,753 --> 00:01:26,420 [AUDIO PLAYBACK] 35 00:01:26,420 --> 00:01:29,729 - It's supposed to either rain or snow. 36 00:01:29,729 --> 00:01:34,776 [RANDOM SOUNDS] 37 00:01:34,776 --> 00:01:37,531 - Hannah is good at compromising. 38 00:01:37,531 --> 00:01:41,950 [RANDOM SOUNDS] 39 00:01:41,950 --> 00:01:44,337 [LAUGHTER] 40 00:01:44,337 --> 00:01:44,920 [END PLAYBACK] 41 00:01:44,920 --> 00:01:45,910 NANCY KANWISHER: Anyway, every one of those 42 00:01:45,910 --> 00:01:47,635 sounds you immediately recognize. 43 00:01:47,635 --> 00:01:50,680 You know exactly what it is. 44 00:01:50,680 --> 00:01:52,570 And that's environmental sounds, things 45 00:01:52,570 --> 00:01:55,540 that happen outdoors, speech, what is being said, 46 00:01:55,540 --> 00:01:57,700 voices, who is saying it. 47 00:01:57,700 --> 00:02:00,320 If you don't know the person, if they're male or female, 48 00:02:00,320 --> 00:02:02,140 young or old, much like faces-- 49 00:02:02,140 --> 00:02:05,980 if you know them, you'll recognize them pretty fast. 50 00:02:05,980 --> 00:02:09,449 You can selectively attend to one sound among others. 51 00:02:09,449 --> 00:02:14,460 Like if you had a little, hidden earphone that I didn't see, 52 00:02:14,460 --> 00:02:16,543 and you wanted to listen to your favorite podcast, 53 00:02:16,543 --> 00:02:18,293 you could listen to that occasionally when 54 00:02:18,293 --> 00:02:19,170 I was getting boring. 55 00:02:19,170 --> 00:02:21,690 And then you could turn back and listen to me. 56 00:02:21,690 --> 00:02:23,490 And you could just selectively choose 57 00:02:23,490 --> 00:02:27,870 which of those different audio inputs to listen to. 58 00:02:27,870 --> 00:02:30,810 And we'll talk more in a moment about this classic problem 59 00:02:30,810 --> 00:02:33,570 in hearing, which is known as the "cocktail party effect." 60 00:02:33,570 --> 00:02:36,978 I guess it was named in the '50s when cocktail parties were big. 61 00:02:36,978 --> 00:02:38,520 And it consists in the fact that when 62 00:02:38,520 --> 00:02:41,010 there are multiple sound sources, such as many people 63 00:02:41,010 --> 00:02:44,370 talking in a room, you can tune in one channel 64 00:02:44,370 --> 00:02:46,620 and then tune in another channel. 65 00:02:46,620 --> 00:02:48,570 And you can just selectively attend 66 00:02:48,570 --> 00:02:51,000 to one of many sound sources, even 67 00:02:51,000 --> 00:02:54,270 though those sound sources are massively overlapping on top 68 00:02:54,270 --> 00:02:55,590 of each other in the input. 69 00:02:55,590 --> 00:02:57,660 And it's a big computational challenge, 70 00:02:57,660 --> 00:03:00,960 as we'll talk about shortly, to do that. 71 00:03:00,960 --> 00:03:03,610 You can enjoy music. 72 00:03:03,610 --> 00:03:05,898 And you can determine what things are made of. 73 00:03:05,898 --> 00:03:08,440 So close your eyes and I'm going to drop things on the table. 74 00:03:08,440 --> 00:03:08,940 Don't look. 75 00:03:08,940 --> 00:03:10,648 I'm going to do various things and you're 76 00:03:10,648 --> 00:03:11,710 going to identify them. 77 00:03:14,290 --> 00:03:17,000 So let's see-- don't open your eyes. 78 00:03:17,000 --> 00:03:19,640 See if you can tell what's being dropped on the table, 79 00:03:19,640 --> 00:03:20,890 or at least what it's made of. 80 00:03:20,890 --> 00:03:21,950 Close your eyes. 81 00:03:21,950 --> 00:03:23,750 That's cheating. 82 00:03:23,750 --> 00:03:24,720 Wood, exactly. 83 00:03:24,720 --> 00:03:25,220 Very good. 84 00:03:25,220 --> 00:03:26,120 OK, what is this? 85 00:03:26,120 --> 00:03:27,260 Keep your eyes closed. 86 00:03:27,260 --> 00:03:30,180 What is this made of? 87 00:03:30,180 --> 00:03:30,680 - Plastic. 88 00:03:30,680 --> 00:03:31,310 NANCY KANWISHER: Yeah, good. 89 00:03:31,310 --> 00:03:32,240 Keep your eyes closed. 90 00:03:32,240 --> 00:03:34,730 What is this made of? 91 00:03:34,730 --> 00:03:36,843 STUDENT: [INAUDIBLE] 92 00:03:36,843 --> 00:03:37,760 NANCY KANWISHER: Yeah. 93 00:03:37,760 --> 00:03:39,080 OK, keep your eyes closed. 94 00:03:39,080 --> 00:03:42,044 What's this made of? 95 00:03:42,044 --> 00:03:42,990 STUDENT: [INAUDIBLE]. 96 00:03:42,990 --> 00:03:43,410 NANCY KANWISHER: Awesome. 97 00:03:43,410 --> 00:03:44,790 OK, you can open your eyes. 98 00:03:44,790 --> 00:03:45,420 Perfect. 99 00:03:45,420 --> 00:03:47,050 You guys are awesome. 100 00:03:47,050 --> 00:03:49,530 I just dropped these objects that I found from my kitchen 101 00:03:49,530 --> 00:03:52,740 this morning and you guys could tell what they're made of. 102 00:03:52,740 --> 00:03:55,860 That's amazing. 103 00:03:55,860 --> 00:03:59,160 OK, all of this that you guys just did 104 00:03:59,160 --> 00:04:02,125 happens from the simplest possible signal. 105 00:04:02,125 --> 00:04:04,500 We'll talk about what that signal is exactly in a moment, 106 00:04:04,500 --> 00:04:07,870 but it's just sound compression coming through the air. 107 00:04:07,870 --> 00:04:11,140 And it tells you all this rich stuff about your environment. 108 00:04:11,140 --> 00:04:13,540 So the question is, how do we do that? 109 00:04:13,540 --> 00:04:15,750 And the first question is, how do we 110 00:04:15,750 --> 00:04:17,970 start to think about how hearing works, how 111 00:04:17,970 --> 00:04:19,500 you're able to do all of that? 112 00:04:19,500 --> 00:04:23,400 And you guys know we start with computational theory-- 113 00:04:23,400 --> 00:04:26,850 considering what the inputs are, what the outputs are, 114 00:04:26,850 --> 00:04:30,180 the physics of sound, what would be involved 115 00:04:30,180 --> 00:04:33,090 if we tried to code up a machine to take those audio input 116 00:04:33,090 --> 00:04:35,070 and deliver the output that you guys all 117 00:04:35,070 --> 00:04:37,860 just delivered with no trouble whatsoever. 118 00:04:37,860 --> 00:04:39,600 What cues are in the stimulus? 119 00:04:39,600 --> 00:04:41,760 What are the key computational challenges? 120 00:04:41,760 --> 00:04:45,222 And what makes those aspects of hearing challenging? 121 00:04:45,222 --> 00:04:46,680 And then after we do all that stuff 122 00:04:46,680 --> 00:04:48,480 at the level of computational theory, 123 00:04:48,480 --> 00:04:51,170 we can, of course, study hearing in other ways, 124 00:04:51,170 --> 00:04:52,420 like studying it behaviorally. 125 00:04:52,420 --> 00:04:53,850 What can people do and not do? 126 00:04:53,850 --> 00:04:54,360 What's hard? 127 00:04:54,360 --> 00:04:56,040 What's less hard? 128 00:04:56,040 --> 00:04:57,880 And we can measure neural responses. 129 00:04:57,880 --> 00:04:59,800 So we'll talk about all of that. 130 00:04:59,800 --> 00:05:03,900 But let's start with a little more on what sound is. 131 00:05:03,900 --> 00:05:07,620 So sound is just a single univariate signal 132 00:05:07,620 --> 00:05:08,837 coming into the ears. 133 00:05:08,837 --> 00:05:10,420 We'll say more about that in a second, 134 00:05:10,420 --> 00:05:12,900 but it's really, really simple. 135 00:05:12,900 --> 00:05:16,950 And from that, you get all this rich experience. 136 00:05:16,950 --> 00:05:18,480 And so the question is, what goes 137 00:05:18,480 --> 00:05:20,700 on in that magic box in the middle 138 00:05:20,700 --> 00:05:23,250 to enable you to extract this kind of information 139 00:05:23,250 --> 00:05:26,070 from this really simple signal? 140 00:05:26,070 --> 00:05:28,470 So let's start with what is sound. 141 00:05:28,470 --> 00:05:32,640 Sound is just a set of longitudinal compressions 142 00:05:32,640 --> 00:05:36,600 and decompressions of the air coming from the source 143 00:05:36,600 --> 00:05:38,580 into your ear. 144 00:05:38,580 --> 00:05:41,340 So these waves travel from the source 145 00:05:41,340 --> 00:05:44,590 to the ear in little waves of compression 146 00:05:44,590 --> 00:05:46,950 where the air is just compressed, 147 00:05:46,950 --> 00:05:49,890 and rarefaction where the air is spread out. 148 00:05:52,530 --> 00:05:55,050 And just to give you a sense of how physical sound is, 149 00:05:55,050 --> 00:05:56,430 there's a silly video here. 150 00:05:56,430 --> 00:05:58,800 It's a speaker in a sink with a bunch of paint. 151 00:05:58,800 --> 00:06:01,320 And you can just see that the movement of the speaker-- 152 00:06:01,320 --> 00:06:04,230 normally, it makes those compressions and rarefactions 153 00:06:04,230 --> 00:06:05,860 of air, but if you stick paint on it. 154 00:06:05,860 --> 00:06:08,190 It's going to shove the paint up in the air, 155 00:06:08,190 --> 00:06:12,000 too, just to show you how physical it is. 156 00:06:12,000 --> 00:06:15,030 There's something called Schlieren photography, 157 00:06:15,030 --> 00:06:17,850 which is totally cool, and which is a way to visualize those 158 00:06:17,850 --> 00:06:20,603 compressions of the air to show you what's-- 159 00:06:20,603 --> 00:06:21,270 [VIDEO PLAYBACK] 160 00:06:21,270 --> 00:06:22,103 [INTERPOSING VOICES] 161 00:06:22,103 --> 00:06:23,910 - --use it to study aerodynamic flow. 162 00:06:23,910 --> 00:06:27,390 And sound-- well, that's just another change in air density, 163 00:06:27,390 --> 00:06:29,250 a traveling compression wave. 164 00:06:29,250 --> 00:06:32,520 So Schlieren visualization, along with a high-speed camera, 165 00:06:32,520 --> 00:06:35,530 can be used to see it as well. 166 00:06:35,530 --> 00:06:38,430 Here's a book landing on a table, 167 00:06:38,430 --> 00:06:41,640 the end of a towel being snapped, 168 00:06:41,640 --> 00:06:49,260 a firecracker, an AK-47, and of course, a clap. 169 00:06:52,260 --> 00:06:53,160 [END PLAYBACK] 170 00:06:53,160 --> 00:06:55,440 NANCY KANWISHER: OK, so just compressions of air 171 00:06:55,440 --> 00:07:00,000 traveling from the source to your ears-- that's all it is. 172 00:07:00,000 --> 00:07:03,030 So natural sounds happen at lots of different frequencies. 173 00:07:03,030 --> 00:07:05,760 And one of the ways we describe sounds 174 00:07:05,760 --> 00:07:08,290 is by looking at those frequencies. 175 00:07:08,290 --> 00:07:12,820 So there's an awesome website that is here on your slides. 176 00:07:12,820 --> 00:07:14,160 You can play with it offline. 177 00:07:14,160 --> 00:07:17,040 But meanwhile, we're going to play with it a little bit right 178 00:07:17,040 --> 00:07:22,283 now because it is so cool. 179 00:07:22,283 --> 00:07:23,700 So what we're going to do is we're 180 00:07:23,700 --> 00:07:26,640 going to look at spectrograms of different sounds. 181 00:07:26,640 --> 00:07:28,710 Let's start with a person whistling. 182 00:07:28,710 --> 00:07:32,340 [WHISTLING] 183 00:07:32,340 --> 00:07:35,010 OK, so frequency is on this axis, 184 00:07:35,010 --> 00:07:38,070 higher frequencies up here, lower frequencies there. 185 00:07:38,070 --> 00:07:40,276 And it's going by in time. 186 00:07:40,276 --> 00:07:45,080 [WHISTLING] 187 00:07:45,080 --> 00:07:48,530 So whistling is unusual in that it's pretty much 188 00:07:48,530 --> 00:07:50,630 a single frequency at a time. 189 00:07:50,630 --> 00:07:52,630 Many natural sounds are not like that. 190 00:07:52,630 --> 00:07:55,370 So you see not single, but a small, narrow band 191 00:07:55,370 --> 00:07:56,540 of frequencies at a time. 192 00:07:56,540 --> 00:07:58,700 OK, that's enough. 193 00:07:58,700 --> 00:08:00,260 [WHISTLING] 194 00:08:00,260 --> 00:08:01,310 Stop. 195 00:08:01,310 --> 00:08:01,910 All right. 196 00:08:01,910 --> 00:08:02,759 OK. 197 00:08:02,759 --> 00:08:10,830 [TROMBONE PLAYING] 198 00:08:10,830 --> 00:08:13,650 OK, so you see how with the trombone, 199 00:08:13,650 --> 00:08:16,770 there were many different bands of frequencies. 200 00:08:16,770 --> 00:08:18,940 In contrast-- this is me talking, by the way. 201 00:08:18,940 --> 00:08:20,950 We'll talk about that in a second. 202 00:08:20,950 --> 00:08:24,420 But with the whistling, you saw just a single band at a time. 203 00:08:24,420 --> 00:08:27,150 With the trombone, it has all of these harmonics, 204 00:08:27,150 --> 00:08:32,850 these parallel lines of multiples of frequencies. 205 00:08:32,850 --> 00:08:34,659 Those are called "pitched sounds." 206 00:08:34,659 --> 00:08:37,260 Sounds that have a pitch where you could sing back the tune 207 00:08:37,260 --> 00:08:40,605 have those bands of frequencies like that. 208 00:08:40,605 --> 00:08:42,480 And so that's what you see with the trombone. 209 00:08:42,480 --> 00:08:45,270 You see a little bit of this with natural speech here. 210 00:08:45,270 --> 00:08:48,030 You can see sets of bands, but mostly, you 211 00:08:48,030 --> 00:08:49,530 see vertical stripes. 212 00:08:49,530 --> 00:08:52,590 That's because I'm talking fast and mostly what's coming out 213 00:08:52,590 --> 00:08:53,640 is consonants. 214 00:08:53,640 --> 00:09:01,260 If I slowed down and stretched out the vowels, 215 00:09:01,260 --> 00:09:08,537 you would see more of those harmonics. 216 00:09:08,537 --> 00:09:09,120 Fun and games. 217 00:09:11,640 --> 00:09:13,430 OK, so that's what sound looks like. 218 00:09:13,430 --> 00:09:15,097 So everybody has to have a sense of this 219 00:09:15,097 --> 00:09:17,790 is showing you the energy at each frequency over time 220 00:09:17,790 --> 00:09:19,590 in response to natural speech. 221 00:09:19,590 --> 00:09:23,130 We'll play with this a little bit more later in the lecture. 222 00:09:31,360 --> 00:09:32,830 So we did all that. 223 00:09:32,830 --> 00:09:34,570 We'll do some of that other stuff later. 224 00:09:34,570 --> 00:09:37,240 So now that we have some sense of what sound is 225 00:09:37,240 --> 00:09:40,630 and what that input is, how are we 226 00:09:40,630 --> 00:09:43,840 going to think about how to extract information from it? 227 00:09:43,840 --> 00:09:45,970 What we want to do is think about how is it? 228 00:09:45,970 --> 00:09:48,910 Why is it challenging to get to that from this? 229 00:09:53,140 --> 00:09:55,360 There are several reasons that's challenging. 230 00:09:55,360 --> 00:09:58,120 First is invariance problems, much like we've 231 00:09:58,120 --> 00:10:00,910 discussed in the domain of vision and other domains 232 00:10:00,910 --> 00:10:03,980 already in this class. 233 00:10:03,980 --> 00:10:06,790 And so the way to think about that here is that a given sound 234 00:10:06,790 --> 00:10:10,930 source sounds really different in different situations. 235 00:10:10,930 --> 00:10:13,900 So if we have different people saying the same word, 236 00:10:13,900 --> 00:10:16,660 that will look very different on those spectrograms. 237 00:10:16,660 --> 00:10:18,250 The stimulus is actually different, 238 00:10:18,250 --> 00:10:21,730 even though we want to just know what word is being said. 239 00:10:21,730 --> 00:10:24,730 And conversely, if we have the same person 240 00:10:24,730 --> 00:10:27,620 saying two different words, that will look really different. 241 00:10:27,620 --> 00:10:29,990 And even if we want to know just who's speaking, 242 00:10:29,990 --> 00:10:32,740 we have to deal with the invariance of generalizing 243 00:10:32,740 --> 00:10:35,680 across those very different ways, very different 244 00:10:35,680 --> 00:10:38,410 sounds that they produce when they say different things. 245 00:10:38,410 --> 00:10:40,570 So those are kind of flips of each other. 246 00:10:40,570 --> 00:10:43,840 To recognize voices, we want invariance of the voice 247 00:10:43,840 --> 00:10:45,130 with respect to the words. 248 00:10:45,130 --> 00:10:47,830 To recognize words, we want invariance for the words 249 00:10:47,830 --> 00:10:49,690 independent of the voice. 250 00:10:49,690 --> 00:10:51,385 And those are all tied up together. 251 00:10:54,040 --> 00:10:56,530 So we need to appreciate the sameness of those stimuli 252 00:10:56,530 --> 00:10:59,620 across those changes. 253 00:10:59,620 --> 00:11:01,777 Here's another reason that hearing is challenging-- 254 00:11:01,777 --> 00:11:02,860 I mentioned this briefly-- 255 00:11:05,890 --> 00:11:08,387 in normal situations-- it's pretty quiet in the room. 256 00:11:08,387 --> 00:11:09,970 There's some background noise, but not 257 00:11:09,970 --> 00:11:11,720 a whole lot of other noise, so it's mostly 258 00:11:11,720 --> 00:11:13,900 just me making the noise in here. 259 00:11:13,900 --> 00:11:17,600 But in many situations, there are multiple sound sources. 260 00:11:17,600 --> 00:11:19,090 For example, listen to this. 261 00:11:19,090 --> 00:11:20,050 [AUDIO PLAYBACK] 262 00:11:20,050 --> 00:11:21,280 [INTERPOSING VOICES] 263 00:11:21,280 --> 00:11:23,320 - All right, Debbie Whittaker, Sterling James, 264 00:11:23,320 --> 00:11:24,160 wrapping things up. 265 00:11:24,160 --> 00:11:24,850 [END PLAYBACK] 266 00:11:24,850 --> 00:11:27,058 NANCY KANWISHER: OK, little segment of radio, there's 267 00:11:27,058 --> 00:11:28,960 music, and a person speaking both at once. 268 00:11:28,960 --> 00:11:30,970 And you had no problem hearing what 269 00:11:30,970 --> 00:11:32,680 the person was saying and knowing 270 00:11:32,680 --> 00:11:35,830 something about the gender and age of that person. 271 00:11:35,830 --> 00:11:38,080 You recognize the voice, the content of the speech, 272 00:11:38,080 --> 00:11:41,410 even though the music is right on top of it. 273 00:11:41,410 --> 00:11:45,040 So the music might be like this and the speech like that. 274 00:11:45,040 --> 00:11:47,770 And what you hear is this, with those things 275 00:11:47,770 --> 00:11:50,590 right on top of each other. 276 00:11:50,590 --> 00:11:53,200 So you need to go backwards to hear these things, 277 00:11:53,200 --> 00:11:55,962 even though that's all you get. 278 00:11:55,962 --> 00:11:57,670 Everybody see how that's a big challenge? 279 00:11:57,670 --> 00:12:00,430 If you had to write the code to take this and recover 280 00:12:00,430 --> 00:12:02,860 that, best of luck to you. 281 00:12:02,860 --> 00:12:04,630 Yeah, question? 282 00:12:04,630 --> 00:12:07,480 STUDENT: How does intensity or volume come into this picture 283 00:12:07,480 --> 00:12:08,923 again? 284 00:12:08,923 --> 00:12:10,840 NANCY KANWISHER: It's not really well depicted 285 00:12:10,840 --> 00:12:11,890 on these diagrams. 286 00:12:11,890 --> 00:12:14,470 This is just showing you the entire source. 287 00:12:14,470 --> 00:12:17,950 So the intensity I showed before essentially takes this and does 288 00:12:17,950 --> 00:12:20,860 a Fourier analysis of it so that it gives you the energy at each 289 00:12:20,860 --> 00:12:22,370 of those frequencies. 290 00:12:22,370 --> 00:12:24,370 So you could just do a Fourier analysis on this 291 00:12:24,370 --> 00:12:25,770 and you get a spectrogram. 292 00:12:29,460 --> 00:12:31,530 So the listener's usually interested 293 00:12:31,530 --> 00:12:34,200 in individual sources even though they're 294 00:12:34,200 --> 00:12:36,220 superimposed on other sources. 295 00:12:36,220 --> 00:12:37,720 And that's a real problem. 296 00:12:37,720 --> 00:12:38,970 So this is the input. 297 00:12:38,970 --> 00:12:43,890 They get added together, and the brain has to pull them apart. 298 00:12:43,890 --> 00:12:46,770 So this is a classic, ill-posed problem. 299 00:12:46,770 --> 00:12:52,140 That means just given this, we have no way 300 00:12:52,140 --> 00:12:55,260 to go backwards to that if that's all we have, 301 00:12:55,260 --> 00:12:58,170 because there's multiple possible solutions. 302 00:12:58,170 --> 00:13:01,080 It's like saying, "x plus y equals 9, 303 00:13:01,080 --> 00:13:03,030 now solve for x and y." 304 00:13:03,030 --> 00:13:05,550 And whenever we're in that situation 305 00:13:05,550 --> 00:13:09,060 of an ill-posed problem with multiple possible solutions, 306 00:13:09,060 --> 00:13:12,810 only one of which is right in any situation in the world, 307 00:13:12,810 --> 00:13:15,900 the usual answer is that we need to bring 308 00:13:15,900 --> 00:13:18,720 in some other assumptions or world knowledge or something, 309 00:13:18,720 --> 00:13:21,960 to constrain that problem and narrow that large, usually 310 00:13:21,960 --> 00:13:23,850 infinite, space of possible answers 311 00:13:23,850 --> 00:13:26,820 down to the one correct one. 312 00:13:26,820 --> 00:13:28,770 So this is a classic problem that people 313 00:13:28,770 --> 00:13:31,350 have talked about in audition for many decades. 314 00:13:31,350 --> 00:13:33,840 Josh McDermott in this department does a lot of work 315 00:13:33,840 --> 00:13:35,220 on it. 316 00:13:35,220 --> 00:13:38,460 And you can solve it in part by knowledge 317 00:13:38,460 --> 00:13:43,770 of natural sounds, which I won't talk about in detail here. 318 00:13:43,770 --> 00:13:48,180 One more challenge for solving problems in audition 319 00:13:48,180 --> 00:13:49,860 comes from the fact that real world 320 00:13:49,860 --> 00:13:53,310 sounds, including the sound of my voice right now, 321 00:13:53,310 --> 00:13:54,330 have reverb. 322 00:13:54,330 --> 00:13:56,490 So "reverb" means-- this is an aerial view. 323 00:13:56,490 --> 00:13:59,530 That's a person, kind of hard to see in an aerial view. 324 00:13:59,530 --> 00:14:01,020 And that's a sound source. 325 00:14:01,020 --> 00:14:04,260 And some of the sound comes straight from the sound source 326 00:14:04,260 --> 00:14:06,030 to the person's ears. 327 00:14:06,030 --> 00:14:09,900 But a lot of the sound goes and ricochets off the walls, 328 00:14:09,900 --> 00:14:12,630 god knows how many times, before it hits the ears. 329 00:14:12,630 --> 00:14:15,360 And all of those different paths of sound 330 00:14:15,360 --> 00:14:17,730 are all kind of superimposed at the ears. 331 00:14:17,730 --> 00:14:19,590 And they arrive at different times, 332 00:14:19,590 --> 00:14:22,290 making a hell of a mess of the input sound. 333 00:14:22,290 --> 00:14:25,590 So instead of that nice, clean, straightforward input, 334 00:14:25,590 --> 00:14:28,170 you have the input plus a slightly delayed input, 335 00:14:28,170 --> 00:14:31,140 a more delayed input, another delayed input, all superimposed 336 00:14:31,140 --> 00:14:32,220 on top of each other. 337 00:14:32,220 --> 00:14:33,810 That's reverb. 338 00:14:33,810 --> 00:14:35,550 Is that clear, what the problem is? 339 00:14:35,550 --> 00:14:38,400 So now we have this really messed-up signal 340 00:14:38,400 --> 00:14:40,860 that we're trying to go backwards and understand 341 00:14:40,860 --> 00:14:42,610 what the input is. 342 00:14:42,610 --> 00:14:44,080 So I'll give you an example. 343 00:14:44,080 --> 00:14:46,770 This is a recording of what's known as "dry speech." 344 00:14:46,770 --> 00:14:48,450 That means speech with no reverb. 345 00:14:48,450 --> 00:14:49,398 Sorry, question? 346 00:14:49,398 --> 00:14:51,690 STUDENT: I'm just having a little trouble understanding 347 00:14:51,690 --> 00:14:54,210 why reverb poses a problem. 348 00:14:54,210 --> 00:14:57,870 The stimulus isn't changing, it's just delayed over time. 349 00:14:57,870 --> 00:14:59,460 NANCY KANWISHER: Yeah, OK. 350 00:14:59,460 --> 00:15:01,418 Let's do a vision example. 351 00:15:01,418 --> 00:15:03,460 This is a little crazy, but let me just try this. 352 00:15:03,460 --> 00:15:05,842 Suppose we had a photograph of my face 353 00:15:05,842 --> 00:15:07,050 and you have to recognize it. 354 00:15:07,050 --> 00:15:07,560 OK, fine. 355 00:15:07,560 --> 00:15:09,580 Various visual algorithms can do that. 356 00:15:09,580 --> 00:15:12,360 But now suppose we took that photograph and we moved it 357 00:15:12,360 --> 00:15:16,207 over 10%, and we superimposed it and added them together, 358 00:15:16,207 --> 00:15:18,540 and then we moved it over again and added them together, 359 00:15:18,540 --> 00:15:20,457 and moved it over again and add them together. 360 00:15:20,457 --> 00:15:22,710 Pretty soon you have a blurry mess. 361 00:15:22,710 --> 00:15:24,900 And those things are all on top of each other, 362 00:15:24,900 --> 00:15:28,372 just as two people talking at once are on top of each other. 363 00:15:28,372 --> 00:15:30,330 And so you have a real problem going backwards. 364 00:15:30,330 --> 00:15:31,470 Does that make sense? 365 00:15:31,470 --> 00:15:32,400 OK. 366 00:15:32,400 --> 00:15:34,883 OK, so here's dry speech with no reverb. 367 00:15:34,883 --> 00:15:35,550 [AUDIO PLAYBACK] 368 00:15:35,550 --> 00:15:37,170 - They ate the lemon pie. 369 00:15:37,170 --> 00:15:38,790 Father forgot the bread. 370 00:15:38,790 --> 00:15:39,390 [END PLAYBACK] 371 00:15:39,390 --> 00:15:41,182 NANCY KANWISHER: OK, here's the same speech 372 00:15:41,182 --> 00:15:42,220 but with lots of reverb. 373 00:15:42,220 --> 00:15:43,830 - They ate the lemon pie. 374 00:15:43,830 --> 00:15:45,184 Father forgot the bread. 375 00:15:45,184 --> 00:15:46,150 [END PLAYBACK] 376 00:15:46,150 --> 00:15:47,733 NANCY KANWISHER: OK, now you can still 377 00:15:47,733 --> 00:15:49,840 hear it because your auditory system knows 378 00:15:49,840 --> 00:15:52,120 how to solve this problem. 379 00:15:52,120 --> 00:15:54,160 But look what happens to the spectrogram. 380 00:15:54,160 --> 00:15:57,040 Here-- this is time this way, frequency this way, 381 00:15:57,040 --> 00:15:59,740 and the dark bits are where the energy is, where the power is. 382 00:15:59,740 --> 00:16:03,310 In the dry speech, you see all these nice, vertical things, 383 00:16:03,310 --> 00:16:06,910 and here you see a blurry mess. 384 00:16:06,910 --> 00:16:09,160 Nonetheless, you can hear it fine. 385 00:16:09,160 --> 00:16:12,280 And further, what else could you tell from the reverb? 386 00:16:16,890 --> 00:16:18,140 STUDENT: The size of the room. 387 00:16:18,140 --> 00:16:21,350 NANCY KANWISHER: Yeah, it's in a cathedral or something, right? 388 00:16:21,350 --> 00:16:25,370 So it's not just that it causes a problem. 389 00:16:25,370 --> 00:16:27,920 Reverb also tells us something about the location 390 00:16:27,920 --> 00:16:30,180 we're in, if we know how to extract it, 391 00:16:30,180 --> 00:16:32,790 which you guys' visual-- 392 00:16:32,790 --> 00:16:34,310 auditory systems do. 393 00:16:34,310 --> 00:16:37,070 You can see I'm a vision scientist. 394 00:16:37,070 --> 00:16:39,260 So how to study this? 395 00:16:39,260 --> 00:16:41,390 There's a very beautiful paper that Josh McDermott 396 00:16:41,390 --> 00:16:42,660 published a few years ago. 397 00:16:42,660 --> 00:16:44,300 And I'm going to try to give you the gist of the paper 398 00:16:44,300 --> 00:16:45,770 without all the technical details, 399 00:16:45,770 --> 00:16:47,810 because I think it's just brilliant. 400 00:16:47,810 --> 00:16:52,520 So they wanted to characterize what exactly is reverb. 401 00:16:52,520 --> 00:16:54,800 And reverb is going to vary for different sounds. 402 00:16:54,800 --> 00:16:57,195 You heard the reverb in that cathedral-like space. 403 00:16:57,195 --> 00:16:59,570 That's very different from the reverb in this room, which 404 00:16:59,570 --> 00:17:00,230 also happens. 405 00:17:00,230 --> 00:17:02,300 It's harder to hear because it's less obvious. 406 00:17:02,300 --> 00:17:04,010 But you can tell a lot about the space 407 00:17:04,010 --> 00:17:06,530 you're in because the reverb properties are different. 408 00:17:06,530 --> 00:17:08,270 The distance to the walls are different. 409 00:17:08,270 --> 00:17:11,190 The reflective properties are different. 410 00:17:11,190 --> 00:17:12,849 And so there's information there. 411 00:17:12,849 --> 00:17:16,730 So you can characterize the nature of the reverb in any one 412 00:17:16,730 --> 00:17:20,990 location by making an instantaneous, brief click 413 00:17:20,990 --> 00:17:24,020 sound in that environment and recording what 414 00:17:24,020 --> 00:17:25,079 happens after that. 415 00:17:25,079 --> 00:17:29,090 And then you can collect all the reverberant reflections 416 00:17:29,090 --> 00:17:31,160 of that sound off the walls. 417 00:17:31,160 --> 00:17:32,660 So what they did is they went around 418 00:17:32,660 --> 00:17:35,720 to lots of natural locations and they played a click like this. 419 00:17:35,720 --> 00:17:36,780 [CLICK] 420 00:17:36,780 --> 00:17:38,150 That's it, just a click. 421 00:17:38,150 --> 00:17:40,130 And then they recorded. 422 00:17:40,130 --> 00:17:42,380 So this is the initial click, but this 423 00:17:42,380 --> 00:17:46,410 is what you record in a single location, all this stuff. 424 00:17:46,410 --> 00:17:49,370 And those are all the reverberant reflections 425 00:17:49,370 --> 00:17:51,470 of that sound off the walls-- make sense? 426 00:17:51,470 --> 00:17:54,780 For one location. 427 00:17:54,780 --> 00:17:58,850 So then they did that in a whole bunch of locations. 428 00:17:58,850 --> 00:18:02,180 And the idea is that here is a description 429 00:18:02,180 --> 00:18:05,640 of the basic problems, just the same thing I said before, 430 00:18:05,640 --> 00:18:08,180 but slightly more detailed. 431 00:18:08,180 --> 00:18:11,915 So a sound source would be something like this. 432 00:18:11,915 --> 00:18:14,510 This looks like a person speaking, 433 00:18:14,510 --> 00:18:17,510 with those nice, harmonic, parallel bands, like you 434 00:18:17,510 --> 00:18:18,858 saw when I was speaking. 435 00:18:18,858 --> 00:18:19,775 Maybe it's a trombone. 436 00:18:22,520 --> 00:18:23,270 So that's time. 437 00:18:23,270 --> 00:18:24,330 That's the source. 438 00:18:24,330 --> 00:18:26,480 That's what you want to know. 439 00:18:26,480 --> 00:18:28,850 This is now the impulse response function 440 00:18:28,850 --> 00:18:31,760 for the location where that sound is being played, 441 00:18:31,760 --> 00:18:34,310 determined by doing that click and recording. 442 00:18:34,310 --> 00:18:37,070 I showed you just you do a Fourier 443 00:18:37,070 --> 00:18:40,550 analysis of that black curve in the previous slide 444 00:18:40,550 --> 00:18:42,080 and you get something like this. 445 00:18:42,080 --> 00:18:44,450 And that shows you all the echoes that happen 446 00:18:44,450 --> 00:18:45,770 in that sound in that location. 447 00:18:45,770 --> 00:18:47,270 And there are different time delays, 448 00:18:47,270 --> 00:18:51,440 and different intensities, and frequency dependence. 449 00:18:51,440 --> 00:18:55,655 What comes to your ear is basically this times that. 450 00:18:58,450 --> 00:19:00,820 So you're given this and you have 451 00:19:00,820 --> 00:19:03,430 to go backwards and solve for that. 452 00:19:03,430 --> 00:19:06,250 Everybody see the problem? 453 00:19:06,250 --> 00:19:13,028 So what McDermott and Traer showed is that-- 454 00:19:13,028 --> 00:19:15,070 just to state the problem a little more clearly-- 455 00:19:15,070 --> 00:19:18,113 you're interested in the source and/or the environment. 456 00:19:18,113 --> 00:19:19,780 You might want to know what kind of room 457 00:19:19,780 --> 00:19:22,180 am I, if somebody is dragging you around blindfolded. 458 00:19:22,180 --> 00:19:24,347 You might want to know if you're outside, or inside, 459 00:19:24,347 --> 00:19:27,880 or in a cathedral, or a closet, or what. 460 00:19:27,880 --> 00:19:30,910 And now this should seem very analogous to the problem 461 00:19:30,910 --> 00:19:32,440 of color vision. 462 00:19:32,440 --> 00:19:34,870 Remember the problem of color vision? 463 00:19:34,870 --> 00:19:38,980 We want to know the color of this object right here. 464 00:19:38,980 --> 00:19:40,600 So this little, purple patch here, 465 00:19:40,600 --> 00:19:44,080 we want to know that, but all we have is the light coming 466 00:19:44,080 --> 00:19:47,192 to our eyes from that patch. 467 00:19:47,192 --> 00:19:49,150 And the light coming to our eyes from the patch 468 00:19:49,150 --> 00:19:52,270 is a function not just of the property of the object, 469 00:19:52,270 --> 00:19:54,850 but whatever light happens to be coming onto it 470 00:19:54,850 --> 00:19:58,180 and then reflecting to our eyes. 471 00:19:58,180 --> 00:19:59,830 And so in color vision, we have one set 472 00:19:59,830 --> 00:20:01,720 of tricks to try to solve that problem 473 00:20:01,720 --> 00:20:04,360 and recover the actual properties of the object, 474 00:20:04,360 --> 00:20:06,430 even though it's totally confounded in the input 475 00:20:06,430 --> 00:20:08,740 with the properties of the incident light. 476 00:20:08,740 --> 00:20:10,900 This is extremely analogous. 477 00:20:10,900 --> 00:20:14,110 We're trying to solve for what is the sound source. 478 00:20:14,110 --> 00:20:16,210 And we have to deal with this problem that 479 00:20:16,210 --> 00:20:19,340 is completely confounded with the reverberation of the room 480 00:20:19,340 --> 00:20:19,840 it's in. 481 00:20:19,840 --> 00:20:21,340 Does everybody see that analogy? 482 00:20:21,340 --> 00:20:26,790 They're both classic ill-posed problems in perception. 483 00:20:26,790 --> 00:20:29,670 So here's another way of putting it-- we're given that 484 00:20:29,670 --> 00:20:31,350 and we want to solve for at least one 485 00:20:31,350 --> 00:20:34,110 of these, ideally both of those. 486 00:20:34,110 --> 00:20:36,430 And you can't do that with just this. 487 00:20:36,430 --> 00:20:38,400 So we need to make assumptions about the room. 488 00:20:38,400 --> 00:20:40,680 And what Traer and McDermott showed 489 00:20:40,680 --> 00:20:43,290 is that first, they measured those impulse response 490 00:20:43,290 --> 00:20:45,000 functions in natural environments 491 00:20:45,000 --> 00:20:48,540 to characterize reverb in different environments. 492 00:20:48,540 --> 00:20:52,260 And they found that there's some systematic properties of reverb 493 00:20:52,260 --> 00:20:54,540 having to do with the decay function 494 00:20:54,540 --> 00:20:56,040 as a function of frequency. 495 00:20:56,040 --> 00:20:59,100 And those systematic properties are preserved 496 00:20:59,100 --> 00:21:02,260 across different environments. 497 00:21:02,260 --> 00:21:07,440 And then they showed that your auditory system knows 498 00:21:07,440 --> 00:21:10,290 about the way reverb works, in the sense 499 00:21:10,290 --> 00:21:15,040 that if you make up a different, non-physical reverb property 500 00:21:15,040 --> 00:21:18,450 and you play it to people, it sounds weird, number one. 501 00:21:18,450 --> 00:21:22,020 And two, they can't recover the sound source. 502 00:21:22,020 --> 00:21:25,410 And what that means is, built into your auditory system 503 00:21:25,410 --> 00:21:27,660 is knowledge of the physics of sound, 504 00:21:27,660 --> 00:21:31,050 and in particular about the particulars of the decay 505 00:21:31,050 --> 00:21:33,420 function of reverb, such that you 506 00:21:33,420 --> 00:21:35,340 can use that knowledge of how reverb 507 00:21:35,340 --> 00:21:38,100 works in general to undo this problem, 508 00:21:38,100 --> 00:21:41,133 and constrain this problem, and solve for the sound source. 509 00:21:41,133 --> 00:21:42,550 I didn't give you all the details. 510 00:21:42,550 --> 00:21:43,890 But I want you to get the gist. 511 00:21:43,890 --> 00:21:47,110 Do you get the kind of idea? 512 00:21:47,110 --> 00:21:47,610 OK. 513 00:21:50,640 --> 00:21:53,640 But as I said, that's only true for reverb 514 00:21:53,640 --> 00:21:57,240 that has the reverb properties of real-world sound. 515 00:21:57,240 --> 00:21:59,760 If you make up fake reverb, it doesn't work. 516 00:21:59,760 --> 00:22:01,590 And people can't solve this problem. 517 00:22:01,590 --> 00:22:03,690 That tells you they're using their knowledge. 518 00:22:03,690 --> 00:22:06,510 Doesn't tell us whether that knowledge is built in innately, 519 00:22:06,510 --> 00:22:08,250 or whether they learned it, or what. 520 00:22:13,110 --> 00:22:14,100 All right, good. 521 00:22:14,100 --> 00:22:17,220 So in other words, we solve the ill-posed problem 522 00:22:17,220 --> 00:22:20,070 of recovering the sound source despite reverb 523 00:22:20,070 --> 00:22:23,070 by building a knowledge of the physics of the world 524 00:22:23,070 --> 00:22:24,810 into our auditory system and using 525 00:22:24,810 --> 00:22:28,410 it to constrain the problem. 526 00:22:28,410 --> 00:22:32,160 So we just said, why is this computationally challenging? 527 00:22:32,160 --> 00:22:34,620 Invariance problems, appreciating the sameness 528 00:22:34,620 --> 00:22:36,630 of a voice across different words, 529 00:22:36,630 --> 00:22:42,060 appreciating the sameness of a word across different voices. 530 00:22:42,060 --> 00:22:44,310 Separating multiple sound sources that come 531 00:22:44,310 --> 00:22:47,100 in simultaneously and are just massively superimposed 532 00:22:47,100 --> 00:22:51,840 on the input-- the cocktail party problem, also ill-posed-- 533 00:22:51,840 --> 00:22:53,670 and the reverb problem. 534 00:22:53,670 --> 00:22:56,580 So everybody see how these are three really big challenges 535 00:22:56,580 --> 00:22:57,210 for audition? 536 00:22:57,210 --> 00:22:57,710 Yeah. 537 00:22:57,710 --> 00:23:01,890 STUDENT: So was brain imaging as well a part of the [INAUDIBLE]?? 538 00:23:01,890 --> 00:23:03,470 NANCY KANWISHER: Nope. 539 00:23:03,470 --> 00:23:05,065 One could do that and ask questions 540 00:23:05,065 --> 00:23:06,690 about where that's solved in the brain. 541 00:23:06,690 --> 00:23:08,580 But the beauty of that study is that 542 00:23:08,580 --> 00:23:10,260 in a way, who cares where it's solved? 543 00:23:10,260 --> 00:23:11,635 I mean, it's kind of interesting, 544 00:23:11,635 --> 00:23:14,970 but it's such a beautiful story already just from actually, 545 00:23:14,970 --> 00:23:17,430 a big part of their study was measuring reverb. 546 00:23:17,430 --> 00:23:19,180 Nobody had done it before. 547 00:23:19,180 --> 00:23:21,120 They sent people out with speakers, 548 00:23:21,120 --> 00:23:24,750 and recording devices, and little random timers 549 00:23:24,750 --> 00:23:25,920 on their iPhones. 550 00:23:25,920 --> 00:23:27,060 And at random times-- 551 00:23:27,060 --> 00:23:28,560 how did this go-- oh, yeah, they had 552 00:23:28,560 --> 00:23:31,920 people had to mark the location they were in using their iPhone 553 00:23:31,920 --> 00:23:34,410 GPS and then that's right-- they didn't send people out 554 00:23:34,410 --> 00:23:35,370 with recording devices. 555 00:23:35,370 --> 00:23:36,070 It's too hard. 556 00:23:36,070 --> 00:23:38,160 And so then they sampled what kind of places 557 00:23:38,160 --> 00:23:39,480 do people hang out in. 558 00:23:39,480 --> 00:23:43,170 And then they went back with their impulse sound 559 00:23:43,170 --> 00:23:45,030 source and the recording device, and they 560 00:23:45,030 --> 00:23:46,950 measured that impulse response function 561 00:23:46,950 --> 00:23:49,050 in lots and lots of different natural sounds 562 00:23:49,050 --> 00:23:51,900 in order to characterize what is the nature of reverb 563 00:23:51,900 --> 00:23:52,510 in the world. 564 00:23:52,510 --> 00:23:53,850 Nobody had done that before. 565 00:23:53,850 --> 00:23:55,560 So that's why I tell you this, is that to me, it's 566 00:23:55,560 --> 00:23:57,185 just one of the most beautiful examples 567 00:23:57,185 --> 00:23:58,500 of computational theory-- 568 00:23:58,500 --> 00:24:00,600 no measurement in the brain. 569 00:24:00,600 --> 00:24:02,670 A big part of the study was just characterizing 570 00:24:02,670 --> 00:24:05,340 the physics of sound, and then some 571 00:24:05,340 --> 00:24:07,020 psychophysics to say actually, do people 572 00:24:07,020 --> 00:24:09,660 use that knowledge of how reverb works in the world? 573 00:24:09,660 --> 00:24:10,530 And yes, they do. 574 00:24:15,580 --> 00:24:17,810 So I've been talking about hearing in general, 575 00:24:17,810 --> 00:24:20,980 but let's talk about one of the most interesting examples 576 00:24:20,980 --> 00:24:24,070 of hearing, the one you're doing right now-- speech perception. 577 00:24:26,590 --> 00:24:28,440 So what do speech sounds look like? 578 00:24:28,440 --> 00:24:30,900 You saw a few of them briefly before. 579 00:24:30,900 --> 00:24:32,350 Here are a few spectra. 580 00:24:32,350 --> 00:24:34,740 So just to remind you, each one of these things 581 00:24:34,740 --> 00:24:38,670 has time going along the x-axis, frequency here. 582 00:24:38,670 --> 00:24:42,600 And the color shows you the intensity of energy 583 00:24:42,600 --> 00:24:44,670 at that frequency band. 584 00:24:44,670 --> 00:24:48,630 So this is a person saying, "hot," and "hat," and "hit," 585 00:24:48,630 --> 00:24:50,940 and "head." 586 00:24:50,940 --> 00:24:54,540 That's the same person saying these four things, a person 587 00:24:54,540 --> 00:24:56,100 with a high-pitched voice. 588 00:24:56,100 --> 00:24:58,920 And here's a person with a slightly lower-pitched voice 589 00:24:58,920 --> 00:25:00,390 saying the same things. 590 00:25:04,080 --> 00:25:06,270 So what do we notice here? 591 00:25:06,270 --> 00:25:10,460 Well, first of all, we see that vowels have regularly 592 00:25:10,460 --> 00:25:11,760 spaced harmonics. 593 00:25:11,760 --> 00:25:12,733 That's the red stripes. 594 00:25:12,733 --> 00:25:14,150 This is a vowel sound right there. 595 00:25:14,150 --> 00:25:17,240 See those perfectly regularly spaced harmonics? 596 00:25:17,240 --> 00:25:20,780 That makes a pitchy sound, so voices are pitchy. 597 00:25:20,780 --> 00:25:22,970 You may not think that there's a pitch to my voice 598 00:25:22,970 --> 00:25:24,860 right now because I'm talking, not singing, 599 00:25:24,860 --> 00:25:26,240 but there is a pitch. 600 00:25:26,240 --> 00:25:28,370 And you use that, actually, in the intonation 601 00:25:28,370 --> 00:25:31,220 of speech, as you guys read about in the assigned 602 00:25:31,220 --> 00:25:33,680 reading for yesterday. 603 00:25:33,680 --> 00:25:36,530 So each of these things with the stacked harmonics 604 00:25:36,530 --> 00:25:38,420 is a vowel sound. 605 00:25:38,420 --> 00:25:42,440 It's got a pitch and it lasts over a chunk of time. 606 00:25:42,440 --> 00:25:45,650 And the consonants are these kind of muckier things 607 00:25:45,650 --> 00:25:47,240 that happen before and after. 608 00:25:47,240 --> 00:25:49,310 And consonants don't have pitch. 609 00:25:49,310 --> 00:25:50,540 They don't have harmonics. 610 00:25:50,540 --> 00:25:51,740 They have kind of muck. 611 00:25:55,610 --> 00:25:57,680 So there are certain band-- people 612 00:25:57,680 --> 00:26:00,200 who study speech spend a lot of time staring at these things 613 00:26:00,200 --> 00:26:01,670 and characterizing them. 614 00:26:01,670 --> 00:26:07,160 And they like to talk about bands of frequency, of power. 615 00:26:07,160 --> 00:26:09,950 And so this band down here that's 616 00:26:09,950 --> 00:26:13,820 present in all of these speech sounds 617 00:26:13,820 --> 00:26:15,740 here is called a "formant." 618 00:26:15,740 --> 00:26:18,527 It's just a chunk of the frequency spectrum 619 00:26:18,527 --> 00:26:19,610 that you hear with speech. 620 00:26:23,030 --> 00:26:25,550 So that's a formant. 621 00:26:25,550 --> 00:26:29,390 And some of those frequency bands or formants 622 00:26:29,390 --> 00:26:33,680 are particularly diagnostic for different vowels. 623 00:26:33,680 --> 00:26:35,780 So if you look in this range here, 624 00:26:35,780 --> 00:26:39,140 only in that mid-range here, only for "hat" 625 00:26:39,140 --> 00:26:41,210 and a little bit for "F" sound do 626 00:26:41,210 --> 00:26:44,270 you get an energy in that frequency band, 627 00:26:44,270 --> 00:26:47,700 not for "hot" or "hit." 628 00:26:47,700 --> 00:26:49,820 And that's true both for the high-pitched voice 629 00:26:49,820 --> 00:26:51,050 and the low-pitched voice. 630 00:26:51,050 --> 00:26:54,018 This frequency band here is really diagnostic to which 631 00:26:54,018 --> 00:26:55,310 of those vowels you're hearing. 632 00:26:59,790 --> 00:27:03,690 So we're going to play with that spectrogram 633 00:27:03,690 --> 00:27:08,800 again a little bit more, although I now 634 00:27:08,800 --> 00:27:09,940 have learned avoidance. 635 00:27:13,350 --> 00:27:20,110 So this is me speaking again, as you saw before. 636 00:27:20,110 --> 00:27:29,300 So I'm going to say an A, an E, an I-- 637 00:27:29,300 --> 00:27:34,460 look how different that one is, O, and U. 638 00:27:34,460 --> 00:27:35,840 And there's lots of other vowels. 639 00:27:35,840 --> 00:27:37,573 Do you see how that energy moves around 640 00:27:37,573 --> 00:27:38,615 for the different vowels? 641 00:27:41,330 --> 00:27:47,950 Now as I said before, if I do a long vowel like this, 642 00:27:47,950 --> 00:27:51,950 it makes a big, long bunch of harmonics. 643 00:27:51,950 --> 00:27:55,460 But a lot of the time, they're just these vertical lines. 644 00:27:55,460 --> 00:28:00,890 The vertical lines are consonants, t, p, k, r. 645 00:28:00,890 --> 00:28:03,288 If I don't say a vowel, you just see a vertical line. 646 00:28:03,288 --> 00:28:04,580 It's not quite a vertical line. 647 00:28:04,580 --> 00:28:08,450 They are different from each other in ways you can tell. 648 00:28:08,450 --> 00:28:12,440 So the consonants are those bands of energy 649 00:28:12,440 --> 00:28:14,120 that go vertically. 650 00:28:14,120 --> 00:28:16,760 And the vowels are the big, long harmonic structures 651 00:28:16,760 --> 00:28:17,990 that stretch between them. 652 00:28:20,580 --> 00:28:22,457 Now, I'm not sure you'll be able to do this. 653 00:28:22,457 --> 00:28:24,540 I'm going to need a volunteer in a second, and I'm 654 00:28:24,540 --> 00:28:26,280 going to pick on [? Iadun, ?] because he's 655 00:28:26,280 --> 00:28:27,447 most accessible right there. 656 00:28:27,447 --> 00:28:28,330 So come on up here. 657 00:28:28,330 --> 00:28:32,970 You know it won't be horrible or embarrassing. 658 00:28:32,970 --> 00:28:34,870 So you can stand here for a second. 659 00:28:34,870 --> 00:28:38,640 I'm going to say "ba's" and "pa's" and I'll tell you 660 00:28:38,640 --> 00:28:39,900 in a moment what to say. 661 00:28:39,900 --> 00:28:41,358 I'm not sure this is going to work. 662 00:28:41,358 --> 00:28:42,540 I tried it before. 663 00:28:42,540 --> 00:28:44,550 We're going to look at two different formants 664 00:28:44,550 --> 00:28:48,240 when I say "ba." 665 00:28:48,240 --> 00:28:56,790 Actually, I'm going to do it rising-- ba, pa, ba, pa. 666 00:28:56,790 --> 00:28:58,920 So there's two different formants, here and here, 667 00:28:58,920 --> 00:28:59,820 with both of those. 668 00:28:59,820 --> 00:29:00,960 I'm going to do it again. 669 00:29:00,960 --> 00:29:03,030 And there's just a tiny, little difference 670 00:29:03,030 --> 00:29:04,770 between a ba and a pa. 671 00:29:04,770 --> 00:29:08,370 And it has to do with the interval between the consonant, 672 00:29:08,370 --> 00:29:09,870 which is the first vertical thing, 673 00:29:09,870 --> 00:29:11,760 and the vowel, which is the horizontal stuff. 674 00:29:11,760 --> 00:29:13,260 So let's see if we can see it again. 675 00:29:13,260 --> 00:29:14,880 Here we go. 676 00:29:14,880 --> 00:29:20,740 Ba, pa-- do you see how the pa starts earlier there 677 00:29:20,740 --> 00:29:23,230 and the ba is slightly delayed? 678 00:29:23,230 --> 00:29:25,370 I'll show you diagrams that show you more clearly. 679 00:29:25,370 --> 00:29:26,020 OK, great. 680 00:29:26,020 --> 00:29:28,240 Don't go away. 681 00:29:28,240 --> 00:29:30,040 So we're going to do the cocktail party 682 00:29:30,040 --> 00:29:33,430 thing with the recording devices here. 683 00:29:33,430 --> 00:29:34,345 What is this? 684 00:29:34,345 --> 00:29:36,220 This is just some boring administrative thing 685 00:29:36,220 --> 00:29:37,650 you can just read. 686 00:29:37,650 --> 00:29:40,150 I actually brought it to crumple and make a crumpling sound, 687 00:29:40,150 --> 00:29:42,580 but we'll do that afterwards. 688 00:29:42,580 --> 00:29:45,370 Right now, you will read from that 689 00:29:45,370 --> 00:29:47,740 and I will recite something boring. 690 00:29:47,740 --> 00:29:49,282 And we'll just do it simultaneously. 691 00:29:49,282 --> 00:29:50,740 So just focus on what you're doing. 692 00:29:50,740 --> 00:29:51,670 And everybody watch. 693 00:29:51,670 --> 00:29:53,600 You can see my voice here. 694 00:29:53,600 --> 00:29:56,830 And let's see what happens when we're both talking at once. 695 00:29:56,830 --> 00:29:58,610 OK, here we go. 696 00:29:58,610 --> 00:30:01,160 Four score and seven years ago-- 697 00:30:01,160 --> 00:30:03,980 oh, geez, I forget how it goes after that, so I'll just 698 00:30:03,980 --> 00:30:05,840 have to make up some other random garbage. 699 00:30:05,840 --> 00:30:06,170 [INTERPOSING VOICES] 700 00:30:06,170 --> 00:30:06,920 STUDENT: --outstanding-- 701 00:30:06,920 --> 00:30:07,400 NANCY KANWISHER: OK. 702 00:30:07,400 --> 00:30:08,840 STUDENT: --review the student's course-- 703 00:30:08,840 --> 00:30:09,140 [INTERPOSING VOICES] 704 00:30:09,140 --> 00:30:10,460 NANCY KANWISHER: That's great. 705 00:30:10,460 --> 00:30:11,510 That's great. 706 00:30:11,510 --> 00:30:12,787 Don't go away. 707 00:30:12,787 --> 00:30:14,870 I don't know if you could tell that it got muckier 708 00:30:14,870 --> 00:30:16,610 when we were both talking. 709 00:30:16,610 --> 00:30:20,370 Maybe it's mucky enough with me talking fast to begin with. 710 00:30:20,370 --> 00:30:21,770 Let's try a few other things. 711 00:30:21,770 --> 00:30:23,780 Let's have me say words and you say words. 712 00:30:23,780 --> 00:30:25,430 And let's see how different they look. 713 00:30:25,430 --> 00:30:31,950 OK, so I'm going to say "mousetrap." 714 00:30:31,950 --> 00:30:32,805 STUDENT: Mousetrap. 715 00:30:32,805 --> 00:30:34,930 NANCY KANWISHER: You can see some similarity there, 716 00:30:34,930 --> 00:30:35,430 can't you? 717 00:30:35,430 --> 00:30:37,540 Let's do it again. 718 00:30:37,540 --> 00:30:39,260 Mousetrap. 719 00:30:39,260 --> 00:30:40,085 STUDENT: Mousetrap. 720 00:30:40,085 --> 00:30:41,460 NANCY KANWISHER: OK, that's good. 721 00:30:41,460 --> 00:30:43,230 It's funny, I see more low-frequency band here. 722 00:30:43,230 --> 00:30:44,910 I'm sure your voice is lower than mine. 723 00:30:44,910 --> 00:30:49,740 Pitch, interestingly, isn't just about how low the energy goes. 724 00:30:49,740 --> 00:30:52,320 It's an interesting, complicated property 725 00:30:52,320 --> 00:30:56,452 of the lowest common denominator of that whole frequency stack. 726 00:30:56,452 --> 00:30:57,660 So I'm not going to do pitch. 727 00:30:57,660 --> 00:31:00,550 It's complicated. 728 00:31:00,550 --> 00:31:01,810 What else do we want to do? 729 00:31:01,810 --> 00:31:06,167 Let's try some ba's and pa's. 730 00:31:06,167 --> 00:31:08,000 But let's stick them on the fronts of words. 731 00:31:08,000 --> 00:31:09,125 Maybe that'll work better-- 732 00:31:12,990 --> 00:31:16,960 pat, bat. 733 00:31:16,960 --> 00:31:18,278 STUDENT: Pat, bat. 734 00:31:18,278 --> 00:31:20,570 NANCY KANWISHER: Oh, I could see the commonality there. 735 00:31:20,570 --> 00:31:22,090 Could you guys see that? 736 00:31:22,090 --> 00:31:24,010 Let's do it again. 737 00:31:24,010 --> 00:31:27,130 Pat, bat. 738 00:31:27,130 --> 00:31:29,352 STUDENT: Pat, bat. 739 00:31:29,352 --> 00:31:31,310 NANCY KANWISHER: Well, yours look more similar. 740 00:31:31,310 --> 00:31:31,850 All right. 741 00:31:31,850 --> 00:31:32,630 Anyway, thank you. 742 00:31:32,630 --> 00:31:33,130 That's good. 743 00:31:33,130 --> 00:31:35,990 That's all I need, just to show you how hard this is, 744 00:31:35,990 --> 00:31:38,360 and how there's variability across speakers saying 745 00:31:38,360 --> 00:31:42,140 the same thing, and very, very subtle differences 746 00:31:42,140 --> 00:31:46,700 between sounds that sound totally different to us. 747 00:31:46,700 --> 00:31:50,745 So back to lecture. 748 00:31:56,620 --> 00:32:01,990 So you saw the harmonics, those red stripes, during the vowels. 749 00:32:01,990 --> 00:32:07,330 You noticed that I showed the consonants 750 00:32:07,330 --> 00:32:08,277 and the ba's and pa's. 751 00:32:08,277 --> 00:32:09,110 So here's a diagram. 752 00:32:09,110 --> 00:32:10,540 I'm sorry, this is very abstracted 753 00:32:10,540 --> 00:32:11,998 away from those spectrograms, which 754 00:32:11,998 --> 00:32:13,930 are messy, as you can see. 755 00:32:13,930 --> 00:32:18,940 The idea is that a consonant vowel sound, a single syllable 756 00:32:18,940 --> 00:32:21,010 like ba or pa-- 757 00:32:21,010 --> 00:32:23,890 this is time this way-- has this big, long formant which 758 00:32:23,890 --> 00:32:27,100 is a band of energy that's the vowel, the ah sound. 759 00:32:27,100 --> 00:32:29,740 And it's these transitions that happen just before 760 00:32:29,740 --> 00:32:33,400 that that make the difference for different consonants. 761 00:32:33,400 --> 00:32:36,580 And in particular, the difference 762 00:32:36,580 --> 00:32:38,620 between a ba and a pa-- 763 00:32:38,620 --> 00:32:40,900 this is a ba, that's a pa-- 764 00:32:40,900 --> 00:32:43,420 the difference we were looking for that didn't show up that 765 00:32:43,420 --> 00:32:45,190 clearly, but you can try it at home, 766 00:32:45,190 --> 00:32:48,040 maybe you can get it clearer than I just got it now-- 767 00:32:48,040 --> 00:32:51,160 has to do with that transition onto the first formant. 768 00:32:51,160 --> 00:32:55,810 So with a ba, the transitions happen in parallel. 769 00:32:55,810 --> 00:32:58,840 And with a pa, this transition happens 770 00:32:58,840 --> 00:33:00,700 before that lower formant. 771 00:33:00,700 --> 00:33:04,750 So that tiny, little-- it's a 65 millisecond delay 772 00:33:04,750 --> 00:33:08,440 in the case of pa that you don't have in the case of ba, 773 00:33:08,440 --> 00:33:10,430 is how you tell that difference. 774 00:33:10,430 --> 00:33:11,740 It's very, very subtle. 775 00:33:14,950 --> 00:33:17,860 So there's lots of different kinds of phonemes. 776 00:33:17,860 --> 00:33:21,160 We've been talking about vowels and consonants. 777 00:33:21,160 --> 00:33:25,570 Each vowel or consonant sound is called a "phoneme" 778 00:33:25,570 --> 00:33:29,070 if a distinction in that sound makes 779 00:33:29,070 --> 00:33:30,820 the difference between two different words 780 00:33:30,820 --> 00:33:33,140 in your language. 781 00:33:33,140 --> 00:33:36,160 And that means that what counts as a phoneme in one language 782 00:33:36,160 --> 00:33:38,440 may not be a phoneme in another language, 783 00:33:38,440 --> 00:33:39,910 because it won't make a distinction 784 00:33:39,910 --> 00:33:42,220 between different words. 785 00:33:42,220 --> 00:33:45,760 Many of the phonemes are shared across languages, but not all. 786 00:33:45,760 --> 00:33:49,630 We've talked about R and L that aren't distinguished in Japan, 787 00:33:49,630 --> 00:33:51,340 and two different D sounds that sound 788 00:33:51,340 --> 00:33:54,430 the same to me that are distinguished in Hindi, 789 00:33:54,430 --> 00:33:56,440 and lots of others. 790 00:33:56,440 --> 00:33:59,320 And so those are just variations across natural languages 791 00:33:59,320 --> 00:34:02,020 on which of those phonemes, which of those sounds, 792 00:34:02,020 --> 00:34:03,820 are used to discriminate different words, 793 00:34:03,820 --> 00:34:07,250 and hence count as phonemes in that language. 794 00:34:07,250 --> 00:34:11,170 So there's some particularly awesome phonemes 795 00:34:11,170 --> 00:34:13,120 that use a particular kind of consonant 796 00:34:13,120 --> 00:34:15,100 known as a click consonant. 797 00:34:15,100 --> 00:34:19,360 And these are common in some Southern African languages. 798 00:34:19,360 --> 00:34:23,530 And a year ago, I was traveling in Mozambique, which was just 799 00:34:23,530 --> 00:34:25,030 hit by a devastating flood. 800 00:34:25,030 --> 00:34:25,940 It's really awful. 801 00:34:25,940 --> 00:34:28,150 But anyway, I was there visiting a game park 802 00:34:28,150 --> 00:34:30,409 seeing all kinds of animals. 803 00:34:30,409 --> 00:34:35,080 And I met this guy, Test. 804 00:34:35,080 --> 00:34:35,857 And he's amazing. 805 00:34:35,857 --> 00:34:37,690 I mean, his knowledge of the natural history 806 00:34:37,690 --> 00:34:39,409 was mind blowing, but he also speaks, 807 00:34:39,409 --> 00:34:42,370 I think, six different languages fluently, one of which 808 00:34:42,370 --> 00:34:45,500 is Xhosa, or as he would say, [SPEAKING Xhosa] or something 809 00:34:45,500 --> 00:34:46,000 like that. 810 00:34:46,000 --> 00:34:47,690 You'll hear him say it in a moment. 811 00:34:47,690 --> 00:34:49,690 And so he was illustrating click languages. 812 00:34:49,690 --> 00:34:51,505 And I'll play this for you in a second. 813 00:34:51,505 --> 00:34:53,380 And he says there's a sentence in Xhosa which 814 00:34:53,380 --> 00:34:55,810 is a little bit crazy, but has all the different clicks. 815 00:34:55,810 --> 00:34:58,570 And it means, basically, "the skunk was rolling 816 00:34:58,570 --> 00:35:00,430 and accidentally got cut by the throat." 817 00:35:00,430 --> 00:35:02,200 Doesn't mean a whole lot, but listen 818 00:35:02,200 --> 00:35:04,930 to Test saying the sentence, first in English 819 00:35:04,930 --> 00:35:06,240 and then in Xhosa. 820 00:35:06,240 --> 00:35:07,120 [AUDIO PLAYBACK] 821 00:35:07,120 --> 00:35:09,940 - The phrase in English, it says skunk 822 00:35:09,940 --> 00:35:14,680 was rolling and accidentally got cut by the throat. 823 00:35:14,680 --> 00:35:19,667 In Xhosa, or in east Xhosa, [SPEAKING Xhosa]. 824 00:35:23,514 --> 00:35:24,340 [END PLAYBACK] 825 00:35:24,340 --> 00:35:25,120 NANCY KANWISHER: Isn't that awesome? 826 00:35:25,120 --> 00:35:27,120 I think we just have to crank it up a little bit 827 00:35:27,120 --> 00:35:28,390 and hear him again. 828 00:35:28,390 --> 00:35:33,970 [AUDIO PLAYBACK] 829 00:35:33,970 --> 00:35:37,420 - The phrase in English, it says the skunk was rolling 830 00:35:37,420 --> 00:35:41,500 and accidentally get cut by the throat. 831 00:35:41,500 --> 00:35:46,296 In Xhosa or in east Xhosa, [SPEAKING Xhosa]. 832 00:35:51,540 --> 00:35:52,410 [END PLAYBACK] 833 00:35:52,410 --> 00:35:54,035 NANCY KANWISHER: OK, for the most part, 834 00:35:54,035 --> 00:35:56,620 we don't have click consonants in English 835 00:35:56,620 --> 00:35:58,090 that count as phonemes in the sense 836 00:35:58,090 --> 00:35:59,680 of distinguishing different words. 837 00:35:59,680 --> 00:36:03,340 But we do have click consonants that we use in other domains. 838 00:36:03,340 --> 00:36:06,520 Anybody know what we use click consonants for? 839 00:36:06,520 --> 00:36:09,230 There's at least two. 840 00:36:09,230 --> 00:36:10,932 Know any click consonants? 841 00:36:10,932 --> 00:36:12,260 STUDENT: [INAUDIBLE] 842 00:36:12,260 --> 00:36:13,427 NANCY KANWISHER: Yeah, what? 843 00:36:13,427 --> 00:36:14,825 STUDENT: [INAUDIBLE] That's-- 844 00:36:14,825 --> 00:36:15,950 NANCY KANWISHER: Like what? 845 00:36:15,950 --> 00:36:18,200 STUDENT: [INAUDIBLE] 846 00:36:18,200 --> 00:36:21,047 NANCY KANWISHER: Yes, but that's a regular consonant. 847 00:36:21,047 --> 00:36:22,130 It's actually not a click. 848 00:36:22,130 --> 00:36:25,250 It's just a regular consonant. 849 00:36:25,250 --> 00:36:30,830 Well, one is when you go, tsk, tsk, tsk, the scolding sound. 850 00:36:30,830 --> 00:36:31,790 It's not a phoneme. 851 00:36:31,790 --> 00:36:35,440 It's not a word, but it has a very particular meaning. 852 00:36:35,440 --> 00:36:41,722 Another one is how you get a horse to giddy up. (CLICKS) 853 00:36:41,722 --> 00:36:43,930 So those are the click consonants we have in English. 854 00:36:43,930 --> 00:36:46,120 They're not phonemes, but we have them, 855 00:36:46,120 --> 00:36:48,340 and he's got a whole lot more. 856 00:36:48,340 --> 00:36:50,770 That was just for fun. 857 00:36:50,770 --> 00:36:54,430 So why is speech perception challenging? 858 00:36:54,430 --> 00:36:58,060 Well, one is the essence of it is that a given speech 859 00:36:58,060 --> 00:37:00,310 sound is highly variable. 860 00:37:00,310 --> 00:37:02,200 One way it's variable is that when 861 00:37:02,200 --> 00:37:05,140 you speak at different rates, all the frequencies 862 00:37:05,140 --> 00:37:07,030 go up and down and haywire, making 863 00:37:07,030 --> 00:37:10,720 them very different across different talking rates. 864 00:37:10,720 --> 00:37:13,460 Another is the context. 865 00:37:13,460 --> 00:37:17,500 So a given phoneme, like a ba, or a pa sound, or a vowel, 866 00:37:17,500 --> 00:37:19,930 sounds totally different depending on what phonemes 867 00:37:19,930 --> 00:37:21,400 come before and after it. 868 00:37:21,400 --> 00:37:24,490 They're not little punctate, one at a time things. 869 00:37:24,490 --> 00:37:29,435 They all overlap and affect each other in a big mess. 870 00:37:29,435 --> 00:37:31,310 And the third is one we've already mentioned, 871 00:37:31,310 --> 00:37:33,670 which is the big differences across speakers 872 00:37:33,670 --> 00:37:35,140 in the language. 873 00:37:35,140 --> 00:37:39,190 So you have to recognize a ba sound 874 00:37:39,190 --> 00:37:41,410 even though it sounds quite different when 875 00:37:41,410 --> 00:37:44,230 spoken by different speakers. 876 00:37:44,230 --> 00:37:46,900 So all of these things make it very computationally 877 00:37:46,900 --> 00:37:48,820 challenging to understand speech. 878 00:37:48,820 --> 00:37:53,170 Here's an illustration of that talker variability. 879 00:37:53,170 --> 00:37:56,920 So what's shown here is not a whole spectrogram, but just 880 00:37:56,920 --> 00:37:59,320 the intensity of the first formant 881 00:37:59,320 --> 00:38:02,380 and the second formant, those bands of energy 882 00:38:02,380 --> 00:38:04,090 that I showed you in the spectrogram. 883 00:38:04,090 --> 00:38:09,550 And so each dot here is a different person 884 00:38:09,550 --> 00:38:11,890 pronouncing a vowel. 885 00:38:11,890 --> 00:38:15,040 And each color-- this is one vowel here in green, 886 00:38:15,040 --> 00:38:17,260 in that green ellipse, with lots of different people 887 00:38:17,260 --> 00:38:19,210 saying that vowel. 888 00:38:19,210 --> 00:38:21,040 Here's another vowel up here in red, 889 00:38:21,040 --> 00:38:24,670 with lots of different people saying that vowel. 890 00:38:24,670 --> 00:38:29,650 And what you see is they're really overlapping. 891 00:38:29,650 --> 00:38:32,290 So that means you can't just go from the energy at those two 892 00:38:32,290 --> 00:38:35,500 formants, a point in that space, and know what the vowel is. 893 00:38:35,500 --> 00:38:37,000 What if you were right there? 894 00:38:37,000 --> 00:38:41,050 Well, then it could be any of four different vowels. 895 00:38:41,050 --> 00:38:43,820 So that's the problem of talker variability illustrated 896 00:38:43,820 --> 00:38:44,320 with vowels. 897 00:38:44,320 --> 00:38:45,195 Does that make sense? 898 00:38:48,260 --> 00:38:50,540 I think I just said all of this, blah, blah, blah-- 899 00:38:50,540 --> 00:38:53,210 another classic ill-posed problem in perception. 900 00:38:53,210 --> 00:38:54,870 You're given a point in this space. 901 00:38:54,870 --> 00:38:57,890 How do you tell which vowel it is? 902 00:38:57,890 --> 00:39:02,300 So one way we solve that is that we learn each other's voices. 903 00:39:02,300 --> 00:39:06,290 And we know how a given person pronounces 904 00:39:06,290 --> 00:39:08,720 a given set of vowels or words. 905 00:39:08,720 --> 00:39:11,300 And we use that to constrain what they're saying. 906 00:39:11,300 --> 00:39:12,740 Have you ever noticed, especially 907 00:39:12,740 --> 00:39:14,865 if you meet somebody new-- well, actually, you just 908 00:39:14,865 --> 00:39:16,250 experience this with Test. 909 00:39:16,250 --> 00:39:18,560 When he first speaks, his English is beautiful, 910 00:39:18,560 --> 00:39:21,170 but he's from Zimbabwe and he has kind of Zimbabwe, 911 00:39:21,170 --> 00:39:22,520 British-type accent. 912 00:39:22,520 --> 00:39:24,770 And at first it's hard to understand what he's saying. 913 00:39:24,770 --> 00:39:26,570 Did you all experience that briefly? 914 00:39:26,570 --> 00:39:29,108 I mean, that's why I put the text on the slide, 915 00:39:29,108 --> 00:39:31,400 so you would get used to his English and understand it. 916 00:39:31,400 --> 00:39:33,800 If I hadn't, you probably wouldn't have understood 917 00:39:33,800 --> 00:39:35,060 that sentence he spoke first. 918 00:39:35,060 --> 00:39:37,430 That's because we don't know his voice yet. 919 00:39:37,430 --> 00:39:39,870 But did you notice, after even just a few words, 920 00:39:39,870 --> 00:39:42,620 you start to like tune right in and you can understand him? 921 00:39:42,620 --> 00:39:45,500 So learning about an individual's voice 922 00:39:45,500 --> 00:39:48,230 helps you pull apart the properties of the voice, 923 00:39:48,230 --> 00:39:50,270 and unconfound them from the sound 924 00:39:50,270 --> 00:39:52,550 so you can understand what that person is saying. 925 00:39:56,180 --> 00:39:59,500 So that's part of how we solve this ill-posed problem. 926 00:39:59,500 --> 00:40:02,500 And so evidence that we do that is 927 00:40:02,500 --> 00:40:07,090 that if you have people listen to voices they don't know 928 00:40:07,090 --> 00:40:09,460 or voices that are changing from word to word, 929 00:40:09,460 --> 00:40:11,650 it's much harder to understand speech. 930 00:40:11,650 --> 00:40:14,350 So you imagine you took the sentence I'm saying right now, 931 00:40:14,350 --> 00:40:17,050 and you spliced in a different person saying each word. 932 00:40:17,050 --> 00:40:19,540 Actually, I should make that demo. 933 00:40:19,540 --> 00:40:21,010 One of you guys send me an email-- 934 00:40:21,010 --> 00:40:22,980 make that demo of a different person speaking 935 00:40:22,980 --> 00:40:23,980 each word in a sentence. 936 00:40:23,980 --> 00:40:25,810 It'd be really hard to understand. 937 00:40:25,810 --> 00:40:27,570 Because you wouldn't have been able to fix 938 00:40:27,570 --> 00:40:28,945 this was a property of the voice, 939 00:40:28,945 --> 00:40:31,030 now we can kind of separate that from everything else. 940 00:40:31,030 --> 00:40:33,238 Because the damn voice will be changing on each word. 941 00:40:33,238 --> 00:40:34,930 It'll be a mess. 942 00:40:34,930 --> 00:40:37,700 So that's one problem. 943 00:40:37,700 --> 00:40:42,470 So it turns out that the opposite is true, as well. 944 00:40:42,470 --> 00:40:47,320 And that is, your ability to recognize somebody's voice 945 00:40:47,320 --> 00:40:51,230 is a function of what you know about that language. 946 00:40:51,230 --> 00:40:54,320 So you can recognize voices better in a language 947 00:40:54,320 --> 00:40:57,740 you know than a language you don't because you're 948 00:40:57,740 --> 00:40:58,700 doing the opposite. 949 00:40:58,700 --> 00:41:01,310 You're using knowledge of the language and its speech 950 00:41:01,310 --> 00:41:05,450 properties that you already know to constrain 951 00:41:05,450 --> 00:41:08,255 the problem of figuring out who is this person's voice. 952 00:41:08,255 --> 00:41:09,380 So does everybody get this? 953 00:41:09,380 --> 00:41:11,990 These two things are affecting each other-- the speaker 954 00:41:11,990 --> 00:41:13,490 and what's being said. 955 00:41:13,490 --> 00:41:15,290 And because they're so confounded, 956 00:41:15,290 --> 00:41:17,300 massively confounded in the stimulus, 957 00:41:17,300 --> 00:41:20,660 to solve that, the more you know about the speaker, 958 00:41:20,660 --> 00:41:22,820 the better you can understand what's being said. 959 00:41:22,820 --> 00:41:25,610 And the more you know about the language and its properties, 960 00:41:25,610 --> 00:41:27,830 the more you can recognize the voice. 961 00:41:27,830 --> 00:41:30,020 Each one is a source of information about one 962 00:41:30,020 --> 00:41:34,100 of those two confounded variables. 963 00:41:34,100 --> 00:41:37,340 And so people have shown that psychophysically. 964 00:41:37,340 --> 00:41:39,050 And I think I have time to do this. 965 00:41:39,050 --> 00:41:41,840 Here's a kind of cool corollary of this, 966 00:41:41,840 --> 00:41:46,730 and that is, it's commonly thought that dyslexia is most 967 00:41:46,730 --> 00:41:49,880 fundamentally a problem of auditory speech perception, not 968 00:41:49,880 --> 00:41:50,730 a visual problem. 969 00:41:50,730 --> 00:41:52,640 There may also be a bit of a visual problem, 970 00:41:52,640 --> 00:41:54,740 but it's thought that at core, it's 971 00:41:54,740 --> 00:41:58,760 a problem of auditory speech perception. 972 00:41:58,760 --> 00:42:01,430 So if that's true, then you might 973 00:42:01,430 --> 00:42:05,150 think that this ability to use knowledge of the language 974 00:42:05,150 --> 00:42:08,570 and its sounds to constrain voice recognition 975 00:42:08,570 --> 00:42:11,330 would be reduced in people with dyslexia, 976 00:42:11,330 --> 00:42:15,660 because they are less good at processing speech sounds. 977 00:42:15,660 --> 00:42:17,470 And it turns out that's true. 978 00:42:17,470 --> 00:42:21,190 So here's a beautiful study from Gabrielli Lab a few years ago. 979 00:42:21,190 --> 00:42:23,560 So first look at the bars in blue. 980 00:42:23,560 --> 00:42:26,380 So this is accuracy at voice recognition, 981 00:42:26,380 --> 00:42:28,920 which person is speaking. 982 00:42:28,920 --> 00:42:30,780 And this is native English speakers 983 00:42:30,780 --> 00:42:32,100 who don't speak Chinese. 984 00:42:32,100 --> 00:42:34,230 They are much more accurate recognizing 985 00:42:34,230 --> 00:42:36,480 who's speaking when they're speaking English than when 986 00:42:36,480 --> 00:42:39,210 they're speaking Chinese. 987 00:42:39,210 --> 00:42:40,290 So that's kind of cool. 988 00:42:40,290 --> 00:42:41,970 That shows you the way in which you 989 00:42:41,970 --> 00:42:45,480 use knowledge of the language to constrain recognition 990 00:42:45,480 --> 00:42:46,890 of the voice. 991 00:42:46,890 --> 00:42:49,200 But now look what happens in the dyslexics-- 992 00:42:49,200 --> 00:42:52,350 no effect, exactly as they predicted. 993 00:42:52,350 --> 00:42:55,380 Given that the dyslexics have a problem with speech perception, 994 00:42:55,380 --> 00:42:57,150 they're apparently not able to use 995 00:42:57,150 --> 00:43:00,150 that knowledge of the phonemes of the language 996 00:43:00,150 --> 00:43:02,280 to constrain the problem of voice recognition. 997 00:43:02,280 --> 00:43:05,220 They're just as bad at voice recognition-- 998 00:43:05,220 --> 00:43:07,890 I'm sorry, they're no better at voice recognition 999 00:43:07,890 --> 00:43:11,130 in their native language than in a foreign language. 1000 00:43:11,130 --> 00:43:14,070 They can't use that knowledge to constrain voice recognition. 1001 00:43:14,070 --> 00:43:16,220 Does that make sense? 1002 00:43:16,220 --> 00:43:18,920 Yeah, I love that study. 1003 00:43:18,920 --> 00:43:23,240 So we haven't done any brain stuff so far. 1004 00:43:23,240 --> 00:43:26,090 We were just thinking about the problem of hearing and speech 1005 00:43:26,090 --> 00:43:28,550 perception, and what we know from behavior. 1006 00:43:28,550 --> 00:43:30,350 And we've learned a lot already, but we'll 1007 00:43:30,350 --> 00:43:32,600 learn more by looking at the brain, and the meat, 1008 00:43:32,600 --> 00:43:33,390 and all of that. 1009 00:43:33,390 --> 00:43:36,230 So let's start with the ear. 1010 00:43:36,230 --> 00:43:41,150 Again, remember, compressions of air come into the ear. 1011 00:43:41,150 --> 00:43:43,040 They travel through the ear canal. 1012 00:43:43,040 --> 00:43:45,230 They hit the tympanic membrane. 1013 00:43:45,230 --> 00:43:49,160 They go through a whole series of transducers, these three 1014 00:43:49,160 --> 00:43:50,780 little ear bones here that connect 1015 00:43:50,780 --> 00:43:56,170 to this snail-shaped thing, which is called the "cochlea." 1016 00:43:56,170 --> 00:43:57,530 Cochlea is really important. 1017 00:43:57,530 --> 00:43:58,780 You should remember that word. 1018 00:43:58,780 --> 00:44:03,130 It's the place where you transduce incoming sound 1019 00:44:03,130 --> 00:44:08,890 into neural impulses, way in there. 1020 00:44:08,890 --> 00:44:10,945 And the cochlea is really cool. 1021 00:44:13,600 --> 00:44:16,360 It's this, as I said, a snail-shaped thing. 1022 00:44:16,360 --> 00:44:20,500 And there are nerve endings all the way along this thing. 1023 00:44:20,500 --> 00:44:23,240 And because of the physics of the cochlea, 1024 00:44:23,240 --> 00:44:26,920 there are different resonant frequencies at different parts 1025 00:44:26,920 --> 00:44:28,700 of this snail. 1026 00:44:28,700 --> 00:44:34,520 So basically, here are some low-frequency sound waves. 1027 00:44:34,520 --> 00:44:37,600 This is the cochlea stretched out with the base and the apex. 1028 00:44:37,600 --> 00:44:38,560 This is the base. 1029 00:44:38,560 --> 00:44:39,880 That's the apex. 1030 00:44:39,880 --> 00:44:45,070 And what you see is the low frequencies 1031 00:44:45,070 --> 00:44:48,850 have transduced some energy at the base of the cochlea, 1032 00:44:48,850 --> 00:44:51,550 and also at the apex. 1033 00:44:51,550 --> 00:44:55,030 But midway range frequencies and high frequencies 1034 00:44:55,030 --> 00:44:56,470 do nothing at the apex. 1035 00:44:56,470 --> 00:45:00,010 This business, there's only physical fluctuations 1036 00:45:00,010 --> 00:45:04,150 happening up here for low frequency sounds. 1037 00:45:04,150 --> 00:45:05,650 So there's little nerve endings here 1038 00:45:05,650 --> 00:45:07,420 that detect those fluctuations up there 1039 00:45:07,420 --> 00:45:10,060 and send those signals up into the brain 1040 00:45:10,060 --> 00:45:11,480 through the auditory nerve. 1041 00:45:11,480 --> 00:45:15,970 And so in the middle, here or something, 1042 00:45:15,970 --> 00:45:18,820 you have sensitivity to mid-range frequencies, 1043 00:45:18,820 --> 00:45:21,770 not high or low. 1044 00:45:21,770 --> 00:45:25,600 And at the base, it's sensitive more to high frequencies 1045 00:45:25,600 --> 00:45:27,520 than mid or low. 1046 00:45:27,520 --> 00:45:28,570 So everybody get that? 1047 00:45:28,570 --> 00:45:31,780 So basically, the cochlea is doing a Fourier transform 1048 00:45:31,780 --> 00:45:34,090 on the acoustic signal. 1049 00:45:34,090 --> 00:45:39,370 It's taking these compressions of air, and it's just saying, 1050 00:45:39,370 --> 00:45:41,830 let's separate those out into different frequencies, 1051 00:45:41,830 --> 00:45:44,200 just with this physical device. 1052 00:45:44,200 --> 00:45:47,620 It's like a physical Fourier transform that's saying, 1053 00:45:47,620 --> 00:45:50,110 let's just physically separate the energy 1054 00:45:50,110 --> 00:45:53,470 at each frequency range along the length of the cochlea. 1055 00:45:53,470 --> 00:45:55,520 Does that make sense? 1056 00:45:55,520 --> 00:45:58,030 And then once you get different parts of the cochlea that 1057 00:45:58,030 --> 00:46:00,580 are sensitive to different frequencies oscillating 1058 00:46:00,580 --> 00:46:03,700 to different degrees, then you stick some nerve cells there 1059 00:46:03,700 --> 00:46:07,120 to pick up those oscillations, go up the auditory nerve, 1060 00:46:07,120 --> 00:46:08,980 and travel into the brain. 1061 00:46:08,980 --> 00:46:10,815 Everybody have a gist of how this works? 1062 00:46:13,560 --> 00:46:14,580 So that's cool. 1063 00:46:14,580 --> 00:46:16,380 But now, let's go up to the brain. 1064 00:46:16,380 --> 00:46:18,880 So now, this is a view like this. 1065 00:46:18,880 --> 00:46:20,710 And so here are the cochleae-- 1066 00:46:20,710 --> 00:46:22,380 I guess that's the plural-- 1067 00:46:22,380 --> 00:46:25,800 on each side-- ears, ear canal, cochleae. 1068 00:46:25,800 --> 00:46:28,530 And the first thing to know, which is important, 1069 00:46:28,530 --> 00:46:32,250 is that the path between the cochlea and the first step 1070 00:46:32,250 --> 00:46:35,640 up in the cortex is much more complicated in hearing 1071 00:46:35,640 --> 00:46:38,040 than it is in vision. 1072 00:46:38,040 --> 00:46:40,530 Look at all these nuclei deep down 1073 00:46:40,530 --> 00:46:43,050 in the basement of the brain. 1074 00:46:43,050 --> 00:46:46,380 In contrast, in vision, how many synapses 1075 00:46:46,380 --> 00:46:48,120 do you have to make between the retina 1076 00:46:48,120 --> 00:46:49,365 and primary visual cortex? 1077 00:46:53,160 --> 00:46:54,480 Sorry. 1078 00:46:54,480 --> 00:46:55,470 One synapse. 1079 00:46:55,470 --> 00:46:56,110 Right? 1080 00:46:56,110 --> 00:46:57,030 STUDENT: Well, I was thinking-- 1081 00:46:57,030 --> 00:46:58,560 NANCY KANWISHER: Yeah, two, that's 1082 00:46:58,560 --> 00:47:01,290 right, so retinal ganglion cells send their axons straight 1083 00:47:01,290 --> 00:47:04,290 into the LGN in the thalamus, make a synapse. 1084 00:47:04,290 --> 00:47:06,480 And then those LGN neurons go straight 1085 00:47:06,480 --> 00:47:09,420 up to primary visual cortex, just one stop on the way. 1086 00:47:09,420 --> 00:47:13,380 Look at all the stops on the way here. 1087 00:47:13,380 --> 00:47:15,420 So audition is a really different beast 1088 00:47:15,420 --> 00:47:17,460 from hearing in many ways. 1089 00:47:17,460 --> 00:47:20,040 Next time, we'll talk about how audition-- 1090 00:47:20,040 --> 00:47:22,950 not these parts of it, but after you get up to the cortex-- 1091 00:47:22,950 --> 00:47:26,970 audition, we in my lab and a few other labs 1092 00:47:26,970 --> 00:47:28,560 are really starting to suspect, is 1093 00:47:28,560 --> 00:47:33,880 profoundly different in humans from any non-human animal. 1094 00:47:33,880 --> 00:47:35,880 And I think that's for very interesting reasons, 1095 00:47:35,880 --> 00:47:38,160 but this part is pretty similar in animals, 1096 00:47:38,160 --> 00:47:40,560 just getting information up to the cortex. 1097 00:47:40,560 --> 00:47:43,740 And audition is already very different from vision just 1098 00:47:43,740 --> 00:47:46,600 in the number of relays going up to the brain. 1099 00:47:46,600 --> 00:47:50,400 So those structures down there do all kinds of awesome things. 1100 00:47:50,400 --> 00:47:52,110 And last year, I talked at great length 1101 00:47:52,110 --> 00:47:54,570 about how we detect the locations of sounds. 1102 00:47:54,570 --> 00:47:57,480 It's absolutely beautiful work, and elegant, and fun, 1103 00:47:57,480 --> 00:48:00,060 but I decided that was a little too much behavior. 1104 00:48:00,060 --> 00:48:01,680 We should get on to the brain. 1105 00:48:01,680 --> 00:48:05,430 But I recommend 9.35 if you want to learn more about audition-- 1106 00:48:05,430 --> 00:48:06,270 awesome course. 1107 00:48:06,270 --> 00:48:07,110 Did you take it? 1108 00:48:07,110 --> 00:48:08,370 Really awesome course. 1109 00:48:08,370 --> 00:48:09,700 Yeah, exactly. 1110 00:48:09,700 --> 00:48:12,450 And so you'll learn more about all that stuff. 1111 00:48:12,450 --> 00:48:14,820 So instead, we will just skip all that 1112 00:48:14,820 --> 00:48:18,030 and go straight up to cortex. 1113 00:48:18,030 --> 00:48:20,940 So the first place that auditory information 1114 00:48:20,940 --> 00:48:23,790 hits the cortex coming up from the cochleae 1115 00:48:23,790 --> 00:48:26,910 is primary auditory cortex, just like the first place 1116 00:48:26,910 --> 00:48:30,010 visual information hits the cortex coming up from the eyes 1117 00:48:30,010 --> 00:48:32,730 is primary visual cortex. 1118 00:48:32,730 --> 00:48:37,590 So you can see in here that in a cross-sectional view like that, 1119 00:48:37,590 --> 00:48:39,280 this is primary auditory cortex. 1120 00:48:39,280 --> 00:48:41,592 It's in that sulcus right there. 1121 00:48:41,592 --> 00:48:43,050 That's kind of a drag, because when 1122 00:48:43,050 --> 00:48:46,110 we get occasional opportunities to test patients 1123 00:48:46,110 --> 00:48:50,010 who have grids of electrodes on the surface of their brain, 1124 00:48:50,010 --> 00:48:51,570 the grids don't usually go in there 1125 00:48:51,570 --> 00:48:54,900 and we can't see primary auditory cortex. 1126 00:48:54,900 --> 00:48:56,400 Although there are new methods where 1127 00:48:56,400 --> 00:48:58,110 they stick depth electrodes, which 1128 00:48:58,110 --> 00:49:00,570 is surprisingly, apparently, better on the patients. 1129 00:49:00,570 --> 00:49:03,180 And right now your TA, Dana [? Bobinger, ?] 1130 00:49:03,180 --> 00:49:08,160 is over at Children's Hospital recording from a 19-year-old 1131 00:49:08,160 --> 00:49:13,250 who has bad epilepsy and who has depth electrodes in his brain. 1132 00:49:13,250 --> 00:49:15,000 And he's listening to all kinds of sounds. 1133 00:49:15,000 --> 00:49:17,970 And she's recorded his neural activity with depth electrodes. 1134 00:49:17,970 --> 00:49:20,287 And so we are hopeful, one, that we 1135 00:49:20,287 --> 00:49:21,870 can find some information that will be 1136 00:49:21,870 --> 00:49:23,670 relevant to the neurosurgeons-- 1137 00:49:23,670 --> 00:49:25,140 I don't know about that-- 1138 00:49:25,140 --> 00:49:27,568 but two, that we'll get some information 1139 00:49:27,568 --> 00:49:29,610 from those deep structures that you can't usually 1140 00:49:29,610 --> 00:49:31,777 see when you have just grids sitting on the surface. 1141 00:49:34,650 --> 00:49:37,390 So back to functional MRI-- 1142 00:49:37,390 --> 00:49:39,750 so this is primary auditory cortex. 1143 00:49:39,750 --> 00:49:40,937 It's quite stylized. 1144 00:49:40,937 --> 00:49:42,270 Let me remind you where you are. 1145 00:49:42,270 --> 00:49:45,180 This is an inflated view of the right hemisphere-- 1146 00:49:45,180 --> 00:49:48,420 back of the head, front of the head, temporal lobe, all funny 1147 00:49:48,420 --> 00:49:50,520 looking because it's been mathematically unfolded 1148 00:49:50,520 --> 00:49:53,460 so you can see stuff in the sulcus where I just showed you. 1149 00:49:53,460 --> 00:49:55,500 Primary auditory cortex is in the sulcus. 1150 00:49:55,500 --> 00:49:57,850 But we've inflated it so you can see it. 1151 00:49:57,850 --> 00:50:02,110 And so this is primary auditory cortex, this whole thing here. 1152 00:50:02,110 --> 00:50:04,810 And it shows you a property we've talked about before. 1153 00:50:04,810 --> 00:50:09,340 It's got a map, but the map in primary auditory cortex 1154 00:50:09,340 --> 00:50:12,010 is not a map of space like it is in the retina 1155 00:50:12,010 --> 00:50:13,490 for visual information. 1156 00:50:13,490 --> 00:50:16,280 It's a map of frequency. 1157 00:50:16,280 --> 00:50:18,770 And that makes sense because the input transducer 1158 00:50:18,770 --> 00:50:21,440 is a cochlea, which already physically creates 1159 00:50:21,440 --> 00:50:22,568 a map of frequency. 1160 00:50:22,568 --> 00:50:24,110 And so that gets traveled through all 1161 00:50:24,110 --> 00:50:26,068 those intermediate stages down in the basement, 1162 00:50:26,068 --> 00:50:27,710 and it comes up to the brain, and makes 1163 00:50:27,710 --> 00:50:30,300 a map of frequency space. 1164 00:50:30,300 --> 00:50:32,210 So what this means, actually-- so here's 1165 00:50:32,210 --> 00:50:34,470 sensitivity to different frequencies. 1166 00:50:34,470 --> 00:50:37,760 And so the classic structure of primary auditory cortex 1167 00:50:37,760 --> 00:50:41,330 in humans is high, low, high-- high frequencies, 1168 00:50:41,330 --> 00:50:43,010 low frequencies, high frequencies, 1169 00:50:43,010 --> 00:50:45,950 in that V-shaped pattern. 1170 00:50:45,950 --> 00:50:47,323 So this is the right hemisphere. 1171 00:50:47,323 --> 00:50:48,740 This is the left hemisphere that's 1172 00:50:48,740 --> 00:50:51,110 been mirror flipped so you can compare them directly. 1173 00:50:51,110 --> 00:50:54,710 And you can see this highly stereotyped pattern of high, 1174 00:50:54,710 --> 00:50:55,310 low, high. 1175 00:50:55,310 --> 00:50:56,720 That's a tonotopic map. 1176 00:50:56,720 --> 00:50:59,810 Everybody clear on what a tonotopic map is? 1177 00:50:59,810 --> 00:51:02,510 And we've just discretized it into two chunks, 1178 00:51:02,510 --> 00:51:05,810 but it's actually a gradient of high to low to high, 1179 00:51:05,810 --> 00:51:08,150 which you can kind of see by those intermediate colors 1180 00:51:08,150 --> 00:51:08,870 in there. 1181 00:51:08,870 --> 00:51:10,690 Yeah. 1182 00:51:10,690 --> 00:51:18,042 STUDENT: [INAUDIBLE] why does the [INAUDIBLE]?? 1183 00:51:22,240 --> 00:51:24,970 NANCY KANWISHER: Yeah, everything in the brain 1184 00:51:24,970 --> 00:51:28,390 rearranges everything in the input in multiple ways. 1185 00:51:28,390 --> 00:51:30,550 So we didn't talk about this, but in visual cortex, 1186 00:51:30,550 --> 00:51:31,360 you have-- 1187 00:51:31,360 --> 00:51:33,520 I don't know what the latest count is, at least 10, 1188 00:51:33,520 --> 00:51:36,640 probably more than that, separate retinotopic maps 1189 00:51:36,640 --> 00:51:39,460 in different patches of cortex-- map, map, map, map, loads 1190 00:51:39,460 --> 00:51:40,630 of them. 1191 00:51:40,630 --> 00:51:43,960 And so there's all kinds of transformations. 1192 00:51:43,960 --> 00:51:48,700 And so much less is known about the functional responses and 1193 00:51:48,700 --> 00:51:50,920 functional organization of auditory cortex 1194 00:51:50,920 --> 00:51:55,000 than visual cortex, especially in humans where we really 1195 00:51:55,000 --> 00:51:57,010 don't know a lot, in fact. 1196 00:51:57,010 --> 00:51:59,770 So there's no real answer to that, other 1197 00:51:59,770 --> 00:52:04,965 than it's not that shocking, in a way, 1198 00:52:04,965 --> 00:52:07,090 because you see that in vision and in other domains 1199 00:52:07,090 --> 00:52:10,120 anyway, with multiple maps that differentially represent 1200 00:52:10,120 --> 00:52:12,380 different parts of space. 1201 00:52:12,380 --> 00:52:14,080 And so yeah, I didn't say this, but many 1202 00:52:14,080 --> 00:52:18,160 of those dozen or so maps in visual cortex 1203 00:52:18,160 --> 00:52:21,988 have differential representation of different parts of space. 1204 00:52:21,988 --> 00:52:23,530 Some focus on the upper visual field, 1205 00:52:23,530 --> 00:52:25,150 some on the lower visual field. 1206 00:52:25,150 --> 00:52:29,470 And the whole question of is that really one thing 1207 00:52:29,470 --> 00:52:31,030 or is it two-- 1208 00:52:31,030 --> 00:52:33,100 this is all now getting into the kind 1209 00:52:33,100 --> 00:52:38,130 of cutting-edge, ambiguous state that we don't know. 1210 00:52:38,130 --> 00:52:40,540 All right, everybody clear on tonotopy, primary auditory 1211 00:52:40,540 --> 00:52:41,040 cortex? 1212 00:52:41,040 --> 00:52:42,300 OK, good. 1213 00:52:42,300 --> 00:52:46,590 All right, the standard view from recording neurons 1214 00:52:46,590 --> 00:52:50,130 in primary auditory cortex in animals-- 1215 00:52:50,130 --> 00:52:54,030 monkeys, ferrets are big in auditory neuroscience, 1216 00:52:54,030 --> 00:52:56,280 other animals-- 1217 00:52:56,280 --> 00:52:58,770 is that the receptive fields of individual 1218 00:52:58,770 --> 00:53:01,110 neurons in primary auditory cortex 1219 00:53:01,110 --> 00:53:04,680 are linear filters in the following sense-- 1220 00:53:04,680 --> 00:53:06,780 so here's a spectrogram of a sound. 1221 00:53:06,780 --> 00:53:08,700 This is just a description of the stimulus. 1222 00:53:08,700 --> 00:53:11,790 As usual, time, frequency. 1223 00:53:11,790 --> 00:53:14,430 So it looks like it could be a speech sound with some vowels 1224 00:53:14,430 --> 00:53:14,670 there. 1225 00:53:14,670 --> 00:53:15,920 Or it might be something else. 1226 00:53:15,920 --> 00:53:17,850 Who knows. 1227 00:53:17,850 --> 00:53:19,800 So that's a sound. 1228 00:53:19,800 --> 00:53:25,020 So now, imagine an electrode sitting 1229 00:53:25,020 --> 00:53:28,770 next to a single neuron in primary auditory 1230 00:53:28,770 --> 00:53:32,100 cortex in, say, a ferret listening to that sound, 1231 00:53:32,100 --> 00:53:35,310 and characterizing what does that neuron respond to. 1232 00:53:35,310 --> 00:53:37,740 Well, the typical finding is that neurons 1233 00:53:37,740 --> 00:53:40,890 in primary auditory cortex are what's 1234 00:53:40,890 --> 00:53:45,090 known as spectral temporal receptive fields, or STRFs 1235 00:53:45,090 --> 00:53:47,160 to their friends. 1236 00:53:47,160 --> 00:53:48,490 So what does that mean? 1237 00:53:48,490 --> 00:53:51,870 Here's an example of the receptive field that 1238 00:53:51,870 --> 00:53:55,830 is the response dependence of a given auditory cell, 1239 00:53:55,830 --> 00:53:59,680 again, with time on this axis and frequency on that axis. 1240 00:53:59,680 --> 00:54:03,795 So what kind of sound does that cell like? 1241 00:54:09,239 --> 00:54:11,168 Can you see just by looking at this? 1242 00:54:11,168 --> 00:54:11,960 What kind of sound? 1243 00:54:14,550 --> 00:54:15,840 STUDENT: Increasing frequency. 1244 00:54:15,840 --> 00:54:17,970 NANCY KANWISHER: Increasing frequency, yeah, 1245 00:54:17,970 --> 00:54:20,280 something like that right. 1246 00:54:20,280 --> 00:54:22,950 Here's one that also likes increasing frequency, 1247 00:54:22,950 --> 00:54:26,550 but slower, shallower increasing frequency. 1248 00:54:26,550 --> 00:54:28,890 Here's one that likes decreasing frequency. 1249 00:54:28,890 --> 00:54:31,080 Now, you may be wondering what the stripes are. 1250 00:54:31,080 --> 00:54:33,070 We didn't talk about this in visual cortex, 1251 00:54:33,070 --> 00:54:35,010 but this is a common property, that it 1252 00:54:35,010 --> 00:54:37,320 likes this particular set of frequencies 1253 00:54:37,320 --> 00:54:41,910 here, but is inhibited by adjacent frequencies. 1254 00:54:41,910 --> 00:54:45,660 So you also see something like that with orientation tuning 1255 00:54:45,660 --> 00:54:49,440 in primary visual cortex. 1256 00:54:49,440 --> 00:54:53,700 And so here, these ones are changing faster, 1257 00:54:53,700 --> 00:54:55,750 both increasing and decreasing. 1258 00:54:55,750 --> 00:54:58,950 So the idea is primary auditory cortex in animals, 1259 00:54:58,950 --> 00:55:02,580 and presumably in humans, is full of a bunch 1260 00:55:02,580 --> 00:55:05,580 of cells that are basically spectrotemporal filters 1261 00:55:05,580 --> 00:55:06,600 like this. 1262 00:55:06,600 --> 00:55:09,300 They are picking out changes in frequency 1263 00:55:09,300 --> 00:55:12,420 over time that happen to different degrees, 1264 00:55:12,420 --> 00:55:17,802 and at different rates, and in different frequency ranges. 1265 00:55:17,802 --> 00:55:19,260 Does that make sense, more or less? 1266 00:55:19,260 --> 00:55:20,385 Yes, [INAUDIBLE] 1267 00:55:20,385 --> 00:55:21,510 STUDENT: I have a question. 1268 00:55:21,510 --> 00:55:22,427 NANCY KANWISHER: Yeah. 1269 00:55:22,427 --> 00:55:29,955 STUDENT: [INAUDIBLE] how would you tell that was [INAUDIBLE]?? 1270 00:55:29,955 --> 00:55:32,080 NANCY KANWISHER: Yeah, how do they figure that out? 1271 00:55:32,080 --> 00:55:34,240 I usually spend all this time talking about the design 1272 00:55:34,240 --> 00:55:34,990 of the experiment. 1273 00:55:34,990 --> 00:55:37,970 I just skipped straight to the answer here. 1274 00:55:37,970 --> 00:55:40,720 Well, I don't know exactly what you do, but you probably-- 1275 00:55:40,720 --> 00:55:42,310 I mean, this has been a whole thing 1276 00:55:42,310 --> 00:55:44,630 that went on for decades for people to get at this. 1277 00:55:44,630 --> 00:55:46,380 So I'm guessing that somehow, they 1278 00:55:46,380 --> 00:55:48,130 got into that general space, and then they 1279 00:55:48,130 --> 00:55:51,053 generated stimuli that make all these different sounds. 1280 00:55:51,053 --> 00:55:52,720 And they just run through them, and they 1281 00:55:52,720 --> 00:55:55,620 find, for a given cell, you play all these different sounds. 1282 00:55:55,620 --> 00:56:01,210 You go-- [MAKES SOUNDS],, et cetera. 1283 00:56:01,210 --> 00:56:03,250 I'll spare you more imitations. 1284 00:56:03,250 --> 00:56:05,830 You play all these different sounds to the animal 1285 00:56:05,830 --> 00:56:07,690 and you record the response of that neuron. 1286 00:56:07,690 --> 00:56:09,400 And you would find, for example, that it 1287 00:56:09,400 --> 00:56:11,380 responds much more when you play that sound than any 1288 00:56:11,380 --> 00:56:11,963 of the others. 1289 00:56:15,305 --> 00:56:16,180 Does that make sense? 1290 00:56:16,180 --> 00:56:17,897 STUDENT: No, it makes sense. 1291 00:56:17,897 --> 00:56:19,980 NANCY KANWISHER: But how do they ever hit on that? 1292 00:56:19,980 --> 00:56:22,640 STUDENT: No, what I was asking is 1293 00:56:22,640 --> 00:56:26,460 that are they using separate [INAUDIBLE]?? 1294 00:56:29,430 --> 00:56:31,740 NANCY KANWISHER: Oh, the red and the blue? 1295 00:56:31,740 --> 00:56:34,860 How exactly they got-- rather than just the simple thing with 1296 00:56:34,860 --> 00:56:36,360 just that-- 1297 00:56:36,360 --> 00:56:39,870 how exactly they arrived on that, I'm not totally sure. 1298 00:56:42,770 --> 00:56:45,380 I mean, there are mathematical reasons why it makes sense 1299 00:56:45,380 --> 00:56:50,600 to have that whole thing rather than just a single stripe, 1300 00:56:50,600 --> 00:56:51,978 that I think are beyond the scope 1301 00:56:51,978 --> 00:56:53,270 of this lecture for the moment. 1302 00:56:53,270 --> 00:56:57,770 But anyway, it wasn't just a totally arbitrary thing to try. 1303 00:56:57,770 --> 00:56:59,990 Those are particularly useful kind 1304 00:56:59,990 --> 00:57:05,490 of receptive fields for representing the input. 1305 00:57:05,490 --> 00:57:07,800 So everybody sort of clear, approximately, 1306 00:57:07,800 --> 00:57:08,550 what this idea is? 1307 00:57:08,550 --> 00:57:10,920 So it's very low-level basic, just 1308 00:57:10,920 --> 00:57:13,590 are the frequencies going up or down, and which range, 1309 00:57:13,590 --> 00:57:14,472 and how fast? 1310 00:57:14,472 --> 00:57:15,930 That's what primary auditory cortex 1311 00:57:15,930 --> 00:57:20,055 does organized in this map, this tonotopic map. 1312 00:57:25,770 --> 00:57:28,500 So think of primary auditory cortex as just 1313 00:57:28,500 --> 00:57:30,570 this bank, this big set of linear filters 1314 00:57:30,570 --> 00:57:34,170 for particular frequency changes over time. 1315 00:57:34,170 --> 00:57:38,220 So that's all based on data from animals, 1316 00:57:38,220 --> 00:57:40,290 from recording individual neurons. 1317 00:57:40,290 --> 00:57:43,260 But we want to know about humans, not just because that's 1318 00:57:43,260 --> 00:57:46,020 what this course is about, but we want to know about humans. 1319 00:57:46,020 --> 00:57:48,960 I mean, ferrets are nice, but really! 1320 00:57:48,960 --> 00:57:51,820 So is that true for humans. 1321 00:57:51,820 --> 00:57:54,900 Well, Josh McDermott and Sam Norman-Haignere 1322 00:57:54,900 --> 00:57:57,300 just published a paper a few months ago 1323 00:57:57,300 --> 00:57:59,010 in which they addressed this question 1324 00:57:59,010 --> 00:58:00,922 in a really interesting way. 1325 00:58:00,922 --> 00:58:03,130 So here's the logic-- this is a little bit technical. 1326 00:58:03,130 --> 00:58:04,140 I'm trying to give you the gist. 1327 00:58:04,140 --> 00:58:05,010 I hope it works. 1328 00:58:05,010 --> 00:58:06,190 Give it a try. 1329 00:58:06,190 --> 00:58:09,690 So they generated synthetically, computationally, 1330 00:58:09,690 --> 00:58:12,550 what they call "model-matched stimuli." 1331 00:58:12,550 --> 00:58:13,950 So the idea is this-- 1332 00:58:13,950 --> 00:58:16,470 the idea is if you present a natural sound-- 1333 00:58:16,470 --> 00:58:19,620 like a dog barking, or a person speaking, or a toilet flushing, 1334 00:58:19,620 --> 00:58:22,680 just some sound that you would hear in life-- 1335 00:58:22,680 --> 00:58:26,490 and then what they do is they make a synthetic signal that 1336 00:58:26,490 --> 00:58:29,940 matches that sound with respect to those STRFs 1337 00:58:29,940 --> 00:58:31,260 I just showed you. 1338 00:58:31,260 --> 00:58:34,170 That is, if you fed the original sound 1339 00:58:34,170 --> 00:58:37,170 and you fed this synthetic sound into the STRFs, 1340 00:58:37,170 --> 00:58:40,260 you'd get the same thing in the STRFs. 1341 00:58:40,260 --> 00:58:42,330 So this is a way of saying, we're 1342 00:58:42,330 --> 00:58:44,940 assuming that those STRFs are a good description of what 1343 00:58:44,940 --> 00:58:47,370 goes on in A1, so let's test that 1344 00:58:47,370 --> 00:58:49,572 by taking a big, fancy, real-world sound that 1345 00:58:49,572 --> 00:58:51,030 has meaning and people know what it 1346 00:58:51,030 --> 00:58:54,330 is, and let's make a control sound that 1347 00:58:54,330 --> 00:58:58,710 matches the in-STRF properties. 1348 00:58:58,710 --> 00:59:00,660 And let's see if we get the same response 1349 00:59:00,660 --> 00:59:03,870 in the brain in that region. 1350 00:59:03,870 --> 00:59:05,610 If that model is a good description 1351 00:59:05,610 --> 00:59:07,170 of what that region does, then you 1352 00:59:07,170 --> 00:59:08,820 should get a very similar response 1353 00:59:08,820 --> 00:59:11,790 when you give the synthetic sound and the original sound 1354 00:59:11,790 --> 00:59:13,830 that you recorded in the world. 1355 00:59:13,830 --> 00:59:17,010 So they tested this on a STRF-like model, 1356 00:59:17,010 --> 00:59:19,763 like this thing I just described before. 1357 00:59:19,763 --> 00:59:21,930 And so just to show you what these sounds are like-- 1358 00:59:21,930 --> 00:59:24,032 so here's an original sound just recorded 1359 00:59:24,032 --> 00:59:25,365 in the world of somebody typing. 1360 00:59:25,365 --> 00:59:26,032 [AUDIO PLAYBACK] 1361 00:59:26,032 --> 00:59:30,087 [TYPING] 1362 00:59:30,087 --> 00:59:30,670 [END PLAYBACK] 1363 00:59:30,670 --> 00:59:32,830 OK, OK, OK, that's enough. 1364 00:59:32,830 --> 00:59:35,890 I know it's riveting, but so then they run that 1365 00:59:35,890 --> 00:59:37,000 through their STRF model. 1366 00:59:37,000 --> 00:59:39,010 They get a STRF description and they 1367 00:59:39,010 --> 00:59:42,900 generate a matched stimulus from their STRF description. 1368 00:59:42,900 --> 00:59:43,900 And it sounds like this. 1369 00:59:43,900 --> 00:59:44,567 [AUDIO PLAYBACK] 1370 00:59:44,567 --> 00:59:46,190 [TYPING] 1371 00:59:46,690 --> 00:59:47,758 Pretty good. 1372 00:59:47,758 --> 00:59:49,300 It's kind of hard to tell them apart. 1373 00:59:51,850 --> 00:59:53,320 Sorry, enough. 1374 00:59:53,320 --> 00:59:54,640 [END PLAYBACK] 1375 00:59:54,640 --> 00:59:56,733 All right. 1376 00:59:56,733 --> 00:59:58,150 And you can see their spectrograms 1377 00:59:58,150 --> 01:00:00,070 are really similar. 1378 01:00:00,070 --> 01:00:02,140 So for a textury thing like typing, 1379 01:00:02,140 --> 01:00:04,630 it really captures the essence of what's being heard. 1380 01:00:04,630 --> 01:00:07,420 We're just telling you what these control stimuli 1381 01:00:07,420 --> 01:00:08,373 sound like. 1382 01:00:08,373 --> 01:00:10,540 Let's take another sound, a person walking in heels. 1383 01:00:10,540 --> 01:00:12,310 And you can see all those verticals. 1384 01:00:12,310 --> 01:00:13,300 Those are the clicks. 1385 01:00:13,300 --> 01:00:17,135 Clicks have energy across lots of different frequencies. 1386 01:00:17,135 --> 01:00:18,760 And that's what a vertical line means-- 1387 01:00:18,760 --> 01:00:20,170 it means all those different-- remember, 1388 01:00:20,170 --> 01:00:21,462 this is frequency on this axis. 1389 01:00:21,462 --> 01:00:23,440 So a vertical line means energy at lots 1390 01:00:23,440 --> 01:00:27,580 of different frequencies not organized in harmonics, 1391 01:00:27,580 --> 01:00:29,050 so it's not pitchy. 1392 01:00:29,050 --> 01:00:29,590 Here we go. 1393 01:00:29,590 --> 01:00:30,516 [AUDIO PLAYBACK] 1394 01:00:30,516 --> 01:00:36,747 [HEELS CLICKING] 1395 01:00:36,747 --> 01:00:37,330 [END PLAYBACK] 1396 01:00:37,330 --> 01:00:40,360 OK, here's the STRF version, the control stimulus. 1397 01:00:40,360 --> 01:00:41,211 [AUDIO PLAYBACK] 1398 01:00:41,211 --> 01:00:47,594 [CLICKING] 1399 01:00:47,594 --> 01:00:48,590 [END PLAYBACK] 1400 01:00:48,590 --> 01:00:52,330 So it captures some of it, but not all of it. 1401 01:00:52,330 --> 01:00:54,130 It captures the sound of each click, 1402 01:00:54,130 --> 01:00:56,110 but not the spacing between. 1403 01:00:56,110 --> 01:00:58,150 So it's getting the local properties, but not 1404 01:00:58,150 --> 01:01:00,340 all of the properties. 1405 01:01:00,340 --> 01:01:01,070 Yeah. 1406 01:01:01,070 --> 01:01:02,730 STUDENT: How did you say-- 1407 01:01:02,730 --> 01:01:03,910 like just the [INAUDIBLE]? 1408 01:01:03,910 --> 01:01:05,452 NANCY KANWISHER: How do they make it? 1409 01:01:05,452 --> 01:01:07,360 I didn't tell you because it's complicated. 1410 01:01:07,360 --> 01:01:10,360 They basically start with pink noise, or white noise, 1411 01:01:10,360 --> 01:01:11,440 or some kind of noise. 1412 01:01:11,440 --> 01:01:13,365 They run it through their STRF thing. 1413 01:01:13,365 --> 01:01:15,490 They run the original sound through the STRF thing. 1414 01:01:15,490 --> 01:01:16,270 They compare them. 1415 01:01:16,270 --> 01:01:17,620 And they say, how are we going to adjust the noise 1416 01:01:17,620 --> 01:01:18,703 to make it more like that? 1417 01:01:18,703 --> 01:01:24,000 And they just iterate a lot, and they end up with these stimuli. 1418 01:01:24,000 --> 01:01:25,800 And you can see just looking at it, 1419 01:01:25,800 --> 01:01:27,258 they ended up with something that's 1420 01:01:27,258 --> 01:01:30,090 pretty similar in terms of the spectrogram. 1421 01:01:30,090 --> 01:01:32,037 Let's listen to a person speaking. 1422 01:01:32,037 --> 01:01:33,120 Here's the original sound. 1423 01:01:33,120 --> 01:01:33,787 [AUDIO PLAYBACK] 1424 01:01:33,787 --> 01:01:36,840 - Is that art offers a time warp to the past, as well 1425 01:01:36,840 --> 01:01:37,365 as insight. 1426 01:01:37,365 --> 01:01:37,740 [END PLAYBACK] 1427 01:01:37,740 --> 01:01:39,540 NANCY KANWISHER: OK, now I'm going to turn it off. 1428 01:01:39,540 --> 01:01:40,950 Here's the synthetic version. 1429 01:01:40,950 --> 01:01:41,617 [AUDIO PLAYBACK] 1430 01:01:41,617 --> 01:01:46,428 [INAUDIBLE] 1431 01:01:46,428 --> 01:01:47,370 [END PLAYBACK] 1432 01:01:47,370 --> 01:01:49,000 OK, now we've lost something. 1433 01:01:49,000 --> 01:01:51,780 So does everybody see how with keyboard typing, 1434 01:01:51,780 --> 01:01:54,450 it really sounds the same, the synthetic version? 1435 01:01:54,450 --> 01:01:56,280 With walking in heels, kind of, sort of, 1436 01:01:56,280 --> 01:01:59,460 at least locally, but not globally, and with speech, 1437 01:01:59,460 --> 01:02:01,690 we've just totally lost it. 1438 01:02:01,690 --> 01:02:03,690 The stuff that you can capture with a STRF model 1439 01:02:03,690 --> 01:02:06,420 does not capture the full richness of speech. 1440 01:02:06,420 --> 01:02:08,970 There's something more in a speech stimulus 1441 01:02:08,970 --> 01:02:11,812 than you can capture with that just simple STRF model. 1442 01:02:11,812 --> 01:02:13,020 OK, let's listen to a violin. 1443 01:02:13,020 --> 01:02:13,687 [AUDIO PLAYBACK] 1444 01:02:13,687 --> 01:02:18,907 [MUSIC PLAYING] 1445 01:02:18,907 --> 01:02:19,490 [END PLAYBACK] 1446 01:02:19,490 --> 01:02:21,410 OK, what does the STRF model do with that? 1447 01:02:21,410 --> 01:02:22,077 [AUDIO PLAYBACK] 1448 01:02:22,077 --> 01:02:25,147 [MUDDY MUSIC PLAYING] 1449 01:02:25,147 --> 01:02:25,730 [END PLAYBACK] 1450 01:02:25,730 --> 01:02:26,230 I love that. 1451 01:02:26,230 --> 01:02:29,460 It sounds like a sea lion colony. 1452 01:02:29,460 --> 01:02:32,870 Anyway, so what you see is the STRF model totally 1453 01:02:32,870 --> 01:02:35,150 fails to capture speech and music, 1454 01:02:35,150 --> 01:02:38,210 but it captures textury sounds like that. 1455 01:02:38,210 --> 01:02:44,630 And it loses some of the broader temporal scale information. 1456 01:02:44,630 --> 01:02:48,470 So that's the stimuli. 1457 01:02:48,470 --> 01:02:51,890 Then you scan people listening to these sounds. 1458 01:02:51,890 --> 01:02:55,710 Just pop them in the scanner and play those sounds. 1459 01:02:55,710 --> 01:02:58,910 And so then what they do is they just ask. 1460 01:02:58,910 --> 01:03:01,730 So this is, again, the white outline is primary auditory 1461 01:03:01,730 --> 01:03:04,190 cortex where you have that frequency map, mapped 1462 01:03:04,190 --> 01:03:06,800 in a separate experiment, and just plunk down on the brain 1463 01:03:06,800 --> 01:03:07,300 here. 1464 01:03:07,300 --> 01:03:09,890 We're zooming in on that part of the top of the temporal lobe. 1465 01:03:09,890 --> 01:03:13,250 And so what's shown here is, for each voxel, 1466 01:03:13,250 --> 01:03:16,850 they're showing the correlation of the response of that voxel 1467 01:03:16,850 --> 01:03:21,470 to the original sound and the synthetic, STRF-y sound. 1468 01:03:21,470 --> 01:03:23,360 And what you see is those correlations 1469 01:03:23,360 --> 01:03:26,510 are really high in primary auditory cortex. 1470 01:03:26,510 --> 01:03:29,090 In other words, primary auditory cortex 1471 01:03:29,090 --> 01:03:32,150 responds pretty much the same to the original sound 1472 01:03:32,150 --> 01:03:33,560 and the synthetic sound. 1473 01:03:33,560 --> 01:03:35,630 It doesn't detect that difference. 1474 01:03:35,630 --> 01:03:38,810 But as soon as you get outside of primary auditory cortex, 1475 01:03:38,810 --> 01:03:41,070 you get something totally different. 1476 01:03:41,070 --> 01:03:43,190 And so that was exactly the prediction, 1477 01:03:43,190 --> 01:03:45,470 is that model that's being tested here 1478 01:03:45,470 --> 01:03:47,690 is a model of how they thought primary auditory 1479 01:03:47,690 --> 01:03:48,730 cortex worked-- 1480 01:03:48,730 --> 01:03:50,420 a bank of linear filters. 1481 01:03:50,420 --> 01:03:53,510 They test that model by generating a new set of stimuli 1482 01:03:53,510 --> 01:03:56,180 that are matched for those linear filters, 1483 01:03:56,180 --> 01:03:58,100 and they get pretty much the same response 1484 01:03:58,100 --> 01:03:59,730 in primary auditory cortex. 1485 01:03:59,730 --> 01:04:02,850 So check-- that's a good model of primary auditory cortex. 1486 01:04:02,850 --> 01:04:04,815 But also, the blue shows you much lower 1487 01:04:04,815 --> 01:04:05,690 correlation out here. 1488 01:04:05,690 --> 01:04:08,795 It is not a good model of stuff outside of auditory cortex. 1489 01:04:08,795 --> 01:04:09,930 Josh. 1490 01:04:09,930 --> 01:04:12,410 STUDENT: So isn't this kind of self-fulfilling, 1491 01:04:12,410 --> 01:04:17,030 in the sense that I build my synthetic stimuli based 1492 01:04:17,030 --> 01:04:19,340 on these kind of models, and then-- 1493 01:04:19,340 --> 01:04:22,190 NANCY KANWISHER: It is, except the models were all 1494 01:04:22,190 --> 01:04:24,890 based on animal work and this is human brains. 1495 01:04:24,890 --> 01:04:26,128 So this is a way-- 1496 01:04:26,128 --> 01:04:27,170 but that's exactly right. 1497 01:04:27,170 --> 01:04:32,000 It's a way of saying all this work from animals precisely 1498 01:04:32,000 --> 01:04:33,800 characterizing response properties 1499 01:04:33,800 --> 01:04:36,350 of individual neurons, which you can do in animals and mostly 1500 01:04:36,350 --> 01:04:38,600 not in humans, do we think that's 1501 01:04:38,600 --> 01:04:40,460 true of human primary auditory cortex? 1502 01:04:40,460 --> 01:04:41,135 And yes, it is. 1503 01:04:41,135 --> 01:04:43,010 Does everybody get at least the gist of that? 1504 01:04:43,010 --> 01:04:45,098 I realize I skipped over lots of details 1505 01:04:45,098 --> 01:04:47,015 because I want you to get the general picture. 1506 01:04:49,610 --> 01:04:51,170 Yeah. 1507 01:04:51,170 --> 01:04:54,080 STUDENT: What are they trying to achieve by doing 1508 01:04:54,080 --> 01:04:55,440 this type of [INAUDIBLE]? 1509 01:04:55,440 --> 01:04:58,130 I mean, the hypothesis is that the human 1510 01:04:58,130 --> 01:05:02,673 and the animal auditory cortex is the same? 1511 01:05:02,673 --> 01:05:04,590 NANCY KANWISHER: Primary auditory cortex, yes. 1512 01:05:04,590 --> 01:05:05,540 Yes. 1513 01:05:05,540 --> 01:05:07,850 They're basically testing-- you derive that model 1514 01:05:07,850 --> 01:05:11,880 from the animal work, then you design a test of it, 1515 01:05:11,880 --> 01:05:14,480 which is making those synthetic stimuli. 1516 01:05:14,480 --> 01:05:16,280 And I left this out because actually, I 1517 01:05:16,280 --> 01:05:18,500 don't think they've done that, but presumably, 1518 01:05:18,500 --> 01:05:21,112 if you test those stimuli with single units in ferrets, 1519 01:05:21,112 --> 01:05:22,070 you get the same thing. 1520 01:05:22,070 --> 01:05:23,810 You get very, very similar responses 1521 01:05:23,810 --> 01:05:26,630 in primary auditory cortex to the original sound 1522 01:05:26,630 --> 01:05:29,680 and the synthetic version of it based on the STRF model. 1523 01:05:29,680 --> 01:05:32,180 STUDENT: It's predicated on the assumption that both of them 1524 01:05:32,180 --> 01:05:33,263 are structurally the same. 1525 01:05:33,263 --> 01:05:34,763 NANCY KANWISHER: Well, it's testing. 1526 01:05:34,763 --> 01:05:36,080 It's asking that question. 1527 01:05:36,080 --> 01:05:38,400 It's asking that question. 1528 01:05:38,400 --> 01:05:42,110 Because I've occasionally in here lamented about how crappy 1529 01:05:42,110 --> 01:05:44,120 our methods are in human cognitive neuroscience. 1530 01:05:44,120 --> 01:05:44,953 I mean, they're fun. 1531 01:05:44,953 --> 01:05:47,340 We can do something, but we hit a wall pretty fast. 1532 01:05:47,340 --> 01:05:49,190 We want to see the actual neural code. 1533 01:05:49,190 --> 01:05:51,530 We don't have spatial and temporal resolution 1534 01:05:51,530 --> 01:05:52,820 at the same time. 1535 01:05:52,820 --> 01:05:54,800 We pretty much only get that in animals. 1536 01:05:54,800 --> 01:05:58,380 We can pretty much only do really careful causal tests 1537 01:05:58,380 --> 01:05:58,880 in animals. 1538 01:05:58,880 --> 01:06:01,790 We can pretty much only see connectivity in a precise way. 1539 01:06:01,790 --> 01:06:04,110 And all these things we can do only in animals. 1540 01:06:04,110 --> 01:06:07,220 And so we need to know if those animal models are 1541 01:06:07,220 --> 01:06:08,232 good models for humans. 1542 01:06:08,232 --> 01:06:09,440 And this is a way to test it. 1543 01:06:09,440 --> 01:06:11,750 And it passed with flying colors. 1544 01:06:11,750 --> 01:06:13,850 Make sense? 1545 01:06:13,850 --> 01:06:17,030 So primary auditory cortex seems in humans 1546 01:06:17,030 --> 01:06:18,860 that it's much like it is in ferrets, 1547 01:06:18,860 --> 01:06:21,980 a bank of linear filters with STRF-y properties. 1548 01:06:25,420 --> 01:06:27,520 What about everything else? 1549 01:06:27,520 --> 01:06:30,220 After all, you guys can hear the difference 1550 01:06:30,220 --> 01:06:33,250 between the original version and the synthetic version 1551 01:06:33,250 --> 01:06:35,410 of the woman talking and the violin. 1552 01:06:35,410 --> 01:06:37,180 And if I played you all the other stimuli 1553 01:06:37,180 --> 01:06:39,347 of real-world sounds, you could hear the differences 1554 01:06:39,347 --> 01:06:40,880 in many of the other ones as well. 1555 01:06:40,880 --> 01:06:42,190 So what are you doing? 1556 01:06:42,190 --> 01:06:44,140 Well, there's lots of auditory cortex 1557 01:06:44,140 --> 01:06:46,900 beyond primary auditory cortex that could 1558 01:06:46,900 --> 01:06:48,180 represent that difference. 1559 01:06:48,180 --> 01:06:49,930 And what this is suggesting is, whatever's 1560 01:06:49,930 --> 01:06:52,330 going on out here is doing something really different 1561 01:06:52,330 --> 01:06:53,080 with those sounds. 1562 01:06:53,080 --> 01:06:53,818 It is not fooled. 1563 01:06:53,818 --> 01:06:56,110 It does not think the synthetic thing is the same thing 1564 01:06:56,110 --> 01:06:57,242 as the original thing. 1565 01:06:57,242 --> 01:06:58,825 That's what the low correlation means. 1566 01:07:02,210 --> 01:07:04,880 So I'll tell you about just one little patch of cortex 1567 01:07:04,880 --> 01:07:06,410 out there. 1568 01:07:06,410 --> 01:07:10,220 And that is-- again, this is just for reference. 1569 01:07:10,220 --> 01:07:13,580 We've zoomed in again on this is the little code 1570 01:07:13,580 --> 01:07:16,280 for separate mapping of high, low, high, primary auditory 1571 01:07:16,280 --> 01:07:17,990 cortex right there. 1572 01:07:17,990 --> 01:07:22,707 And what the yellow bands are is selective responses to speech. 1573 01:07:22,707 --> 01:07:24,290 So you compare a whole bunch of speech 1574 01:07:24,290 --> 01:07:26,840 sounds to a whole bunch of non-speech sounds, 1575 01:07:26,840 --> 01:07:29,720 and you get a band of activation right 1576 01:07:29,720 --> 01:07:31,910 below primary auditory cortex. 1577 01:07:31,910 --> 01:07:32,884 Yes. 1578 01:07:32,884 --> 01:07:34,884 STUDENT: I thought the separation was low, high, 1579 01:07:34,884 --> 01:07:37,320 medium [INAUDIBLE]. 1580 01:07:37,320 --> 01:07:39,410 NANCY KANWISHER: High, low, high-- 1581 01:07:39,410 --> 01:07:40,863 I probably said it backwards. 1582 01:07:40,863 --> 01:07:41,780 That would be like me. 1583 01:07:41,780 --> 01:07:44,510 But it's-- wait, wait. 1584 01:07:44,510 --> 01:07:46,880 What the hell is it? 1585 01:07:46,880 --> 01:07:48,620 I'm pretty sure it's high, low, high. 1586 01:07:48,620 --> 01:07:50,938 Let's go back and look. 1587 01:07:50,938 --> 01:07:53,480 I might have screwed it up on the slide or said it backwards, 1588 01:07:53,480 --> 01:07:59,360 but I'm pretty sure it's high, low, high. 1589 01:07:59,360 --> 01:08:02,495 STUDENT: So the low frequency is the [INAUDIBLE].. 1590 01:08:02,495 --> 01:08:05,120 NANCY KANWISHER: Yeah, just like that's the code for frequency, 1591 01:08:05,120 --> 01:08:07,760 right there. 1592 01:08:07,760 --> 01:08:09,400 But ask me those questions because I'm 1593 01:08:09,400 --> 01:08:11,900 very capable of getting things backwards, as you've probably 1594 01:08:11,900 --> 01:08:12,567 already noticed. 1595 01:08:16,069 --> 01:08:20,300 So there is a band of speech-selective cortex just 1596 01:08:20,300 --> 01:08:22,460 outside of primary auditory cortex, 1597 01:08:22,460 --> 01:08:26,479 in that region that we just saw responds differently 1598 01:08:26,479 --> 01:08:31,310 to the original sound and the model-matched synthetic sound. 1599 01:08:31,310 --> 01:08:32,390 So that's pretty cool. 1600 01:08:32,390 --> 01:08:35,029 What do I mean by "speech-selective cortex?" 1601 01:08:35,029 --> 01:08:37,880 What I mean is-- 1602 01:08:37,880 --> 01:08:39,255 this is some of our data. 1603 01:08:39,255 --> 01:08:40,880 I tried to find you someone else's data 1604 01:08:40,880 --> 01:08:44,083 and I went down a 45-minute rabbit hole 1605 01:08:44,083 --> 01:08:45,250 trying to find a nice slide. 1606 01:08:45,250 --> 01:08:46,790 And I just couldn't find a good picture. 1607 01:08:46,790 --> 01:08:48,370 I finally said, screw it, I'll show you my data, 1608 01:08:48,370 --> 01:08:49,495 even though I'm trying to-- 1609 01:08:49,495 --> 01:08:51,245 we're not the only ones who've shown this. 1610 01:08:51,245 --> 01:08:52,390 We just have the best data. 1611 01:08:52,390 --> 01:08:55,510 Other people had tested it with four, five, six conditions. 1612 01:08:55,510 --> 01:08:59,830 We tested it with 165 sounds. 1613 01:08:59,830 --> 01:09:01,870 So this is the magnitude of response 1614 01:09:01,870 --> 01:09:06,790 in that yellow region to 165 different sounds, color coded 1615 01:09:06,790 --> 01:09:09,088 by condition shown down here. 1616 01:09:09,088 --> 01:09:10,630 And so what you see if you look at it 1617 01:09:10,630 --> 01:09:16,040 is all the top sounds are light green and dark green. 1618 01:09:16,040 --> 01:09:21,970 Speech-- notice, importantly, that the response 1619 01:09:21,970 --> 01:09:25,840 is very similar to English speech and foreign speech 1620 01:09:25,840 --> 01:09:29,670 which our subjects do not understand. 1621 01:09:29,670 --> 01:09:32,810 So that tells us that this is not about language. 1622 01:09:32,810 --> 01:09:35,840 This is not about the meaning of a sentence, or syntax, or any 1623 01:09:35,840 --> 01:09:37,170 of that stuff. 1624 01:09:37,170 --> 01:09:39,470 This is about phonemes, the difference 1625 01:09:39,470 --> 01:09:42,157 between a ba and a pa, which you can do on a foreign language, 1626 01:09:42,157 --> 01:09:44,240 even if there's a few phonemes that are different. 1627 01:09:44,240 --> 01:09:45,475 You get most of them. 1628 01:09:45,475 --> 01:09:46,850 Does everybody get the difference 1629 01:09:46,850 --> 01:09:50,760 between speech and language? 1630 01:09:50,760 --> 01:09:53,010 Amazingly, the senior author of the paper 1631 01:09:53,010 --> 01:09:56,400 you read for last night does not understand that difference. 1632 01:09:56,400 --> 01:09:58,170 He published a beautiful paper. 1633 01:09:58,170 --> 01:10:00,000 Every time he comes here to speak, 1634 01:10:00,000 --> 01:10:02,760 he talks about language, language, language, language. 1635 01:10:02,760 --> 01:10:07,080 And I say, Eddie, have you ever presented a stimulus that's 1636 01:10:07,080 --> 01:10:08,038 in a foreign language? 1637 01:10:08,038 --> 01:10:10,080 He's, like, oh, no, that'd be really interesting. 1638 01:10:10,080 --> 01:10:13,290 It's like, Eddie, until you do that, you don't know if you're 1639 01:10:13,290 --> 01:10:14,730 studying language or speech. 1640 01:10:14,730 --> 01:10:16,800 Oh, yeah, really interesting. 1641 01:10:16,800 --> 01:10:19,112 And then he comes back four years later 1642 01:10:19,112 --> 01:10:21,570 and he doesn't seem to know the difference between language 1643 01:10:21,570 --> 01:10:22,070 and speech. 1644 01:10:22,070 --> 01:10:23,750 I'm, like, hello. 1645 01:10:23,750 --> 01:10:27,033 Anyway, he does beautiful experiments, but it's just-- 1646 01:10:27,033 --> 01:10:28,950 it's a blind spot, or it's a misuse of a word. 1647 01:10:28,950 --> 01:10:30,908 I don't know what it is, but it drives me nuts. 1648 01:10:30,908 --> 01:10:32,550 Can you tell? 1649 01:10:32,550 --> 01:10:34,800 Anyway, you guys get that difference even if Eddie 1650 01:10:34,800 --> 01:10:36,347 doesn't. 1651 01:10:36,347 --> 01:10:37,680 Let's look at some other things. 1652 01:10:37,680 --> 01:10:38,970 How about all this light blue stuff? 1653 01:10:38,970 --> 01:10:41,430 There's a lot of light blue stuff that's almost as high. 1654 01:10:41,430 --> 01:10:44,130 Oh, that's music with people singing. 1655 01:10:44,130 --> 01:10:46,190 That also has speech. 1656 01:10:46,190 --> 01:10:47,930 The speech is slightly less intelligible 1657 01:10:47,930 --> 01:10:49,305 because it's singing, and there's 1658 01:10:49,305 --> 01:10:52,850 background instrumental music, so it's a little bit lower. 1659 01:10:52,850 --> 01:10:53,780 Oh, what's next? 1660 01:10:53,780 --> 01:10:55,910 We've got some light purple stuff 1661 01:10:55,910 --> 01:10:58,310 and some dark purple stuff. 1662 01:10:58,310 --> 01:11:00,560 This is non-speech vocalizations. 1663 01:11:00,560 --> 01:11:04,790 That's stuff like laughing, and crying, and sighing-- 1664 01:11:04,790 --> 01:11:07,100 pretty similar to speech but not speech. 1665 01:11:07,100 --> 01:11:09,650 It's the next highest thing, but it's well down 1666 01:11:09,650 --> 01:11:12,150 from the speech sounds. 1667 01:11:12,150 --> 01:11:14,122 And then we have dogs barking, and geese, 1668 01:11:14,122 --> 01:11:16,080 and stuff like that, that are yet further down. 1669 01:11:16,080 --> 01:11:18,630 And then we have all kinds of other stuff down there-- 1670 01:11:18,630 --> 01:11:21,930 sirens, and toilets, and stuff like that. 1671 01:11:21,930 --> 01:11:22,620 Yeah. 1672 01:11:22,620 --> 01:11:25,680 STUDENT: Is instrumental music perceived as speech? 1673 01:11:25,680 --> 01:11:27,622 I mean, I can't make out the colors. 1674 01:11:27,622 --> 01:11:28,455 NANCY KANWISHER: No. 1675 01:11:28,455 --> 01:11:31,637 The instrumental music is way down in here. 1676 01:11:31,637 --> 01:11:32,970 Yeah, it's a little hard to see. 1677 01:11:32,970 --> 01:11:35,640 That stuff up there is non-speech vocalizations. 1678 01:11:35,640 --> 01:11:38,250 It's not a perfect slide. 1679 01:11:38,250 --> 01:11:43,080 So that's pretty strong evidence that that band of cortex 1680 01:11:43,080 --> 01:11:44,760 is pretty selective for speech. 1681 01:11:44,760 --> 01:11:47,650 Everybody get that? 1682 01:11:47,650 --> 01:11:48,150 Yeah. 1683 01:11:48,150 --> 01:11:49,567 STUDENT: So you're saying it's not 1684 01:11:49,567 --> 01:11:51,540 like it doesn't process like the other one, 1685 01:11:51,540 --> 01:11:54,720 so the violin stuff would still be that [INAUDIBLE] 1686 01:11:54,720 --> 01:11:56,700 NANCY KANWISHER: Yeah, right. 1687 01:11:56,700 --> 01:11:57,750 OK, good point. 1688 01:11:57,750 --> 01:12:00,480 Remember when I first showed you the fusiform face area, 1689 01:12:00,480 --> 01:12:03,270 I showed you that time where it's faces are like this, 1690 01:12:03,270 --> 01:12:05,910 staring at dot is like that, looking at objects 1691 01:12:05,910 --> 01:12:07,322 is like this. 1692 01:12:07,322 --> 01:12:09,780 So I said, OK, there's a little bit of a response to things 1693 01:12:09,780 --> 01:12:10,530 that aren't faces. 1694 01:12:10,530 --> 01:12:12,360 It's just much more to faces. 1695 01:12:12,360 --> 01:12:14,040 Now, you guys may not have noticed this 1696 01:12:14,040 --> 01:12:16,770 because it went by kind of fast, but when I showed you 1697 01:12:16,770 --> 01:12:19,740 intracranial data from the fusiform face 1698 01:12:19,740 --> 01:12:21,780 area in that patient who got stimulated there, 1699 01:12:21,780 --> 01:12:24,780 and saw the illusory faces, the intracranial data 1700 01:12:24,780 --> 01:12:28,660 showed zero response to things that are not faces. 1701 01:12:28,660 --> 01:12:32,580 So I think that that's because functional MRI is 1702 01:12:32,580 --> 01:12:35,430 the best we have in spatial resolution in the human brain, 1703 01:12:35,430 --> 01:12:37,800 except when we have intracranial data. 1704 01:12:37,800 --> 01:12:38,760 But it's still blurry. 1705 01:12:38,760 --> 01:12:43,380 It's blurry because there's blood flow and all of that. 1706 01:12:43,380 --> 01:12:46,870 So I would guess the same thing here. 1707 01:12:46,870 --> 01:12:48,930 In fact, I guess it isn't in the paper you 1708 01:12:48,930 --> 01:12:51,390 read because he didn't have any non-speech sounds, 1709 01:12:51,390 --> 01:12:53,160 but I will show you. 1710 01:12:53,160 --> 01:12:55,590 Dana's recording them right now at Children's Hospital, 1711 01:12:55,590 --> 01:12:58,320 and we have some other ones that I will show you next time, 1712 01:12:58,320 --> 01:13:00,540 of intracranial electrodes. 1713 01:13:00,540 --> 01:13:03,030 And they will be even more selective than that. 1714 01:13:05,467 --> 01:13:06,800 But this is pretty good already. 1715 01:13:06,800 --> 01:13:07,940 Yeah, Nava. 1716 01:13:07,940 --> 01:13:09,770 STUDENT: What's the human non-vocal? 1717 01:13:09,770 --> 01:13:10,190 NANCY KANWISHER: I didn't hear. 1718 01:13:10,190 --> 01:13:10,690 What? 1719 01:13:10,690 --> 01:13:12,770 STUDENT: The human non-vocal? 1720 01:13:12,770 --> 01:13:18,150 NANCY KANWISHER: Oh, that's like clapping, and footsteps, 1721 01:13:18,150 --> 01:13:22,177 and I forget what else, things where you hear it and you know 1722 01:13:22,177 --> 01:13:24,010 that's a person, but it doesn't sound at all 1723 01:13:24,010 --> 01:13:26,260 like speaking or speech. 1724 01:13:26,260 --> 01:13:27,700 So if it was about the meaning, it 1725 01:13:27,700 --> 01:13:29,780 could have been all about the meaning of people, 1726 01:13:29,780 --> 01:13:31,420 could be something telling you there's a person there. 1727 01:13:31,420 --> 01:13:32,080 Deal with it. 1728 01:13:32,080 --> 01:13:33,250 But no, apparently not. 1729 01:13:36,720 --> 01:13:38,940 So we're not the first ones to see this. 1730 01:13:38,940 --> 01:13:40,690 We've just tested it with more conditions. 1731 01:13:40,690 --> 01:13:43,680 So our evidence for selectivity is stronger than everyone 1732 01:13:43,680 --> 01:13:45,600 else's. 1733 01:13:45,600 --> 01:13:49,680 Given what I've told you today, can you think of a stronger way 1734 01:13:49,680 --> 01:13:51,450 to test this? 1735 01:13:51,450 --> 01:13:55,220 For example, suppose I was worried, 1736 01:13:55,220 --> 01:13:58,500 maybe the frequency composition of the speech 1737 01:13:58,500 --> 01:14:01,200 is different than the non-speech. 1738 01:14:01,200 --> 01:14:02,880 After all, those are just recordings 1739 01:14:02,880 --> 01:14:05,820 of natural sounds in the world that we went out and made, 1740 01:14:05,820 --> 01:14:11,030 or mostly got off the web, someone else made. 1741 01:14:11,030 --> 01:14:14,100 And maybe they differ in really low-level properties. 1742 01:14:14,100 --> 01:14:16,340 And so how do we know that that's really 1743 01:14:16,340 --> 01:14:19,670 speech selectivity, not just selectivity 1744 01:14:19,670 --> 01:14:22,320 for certain frequencies or frequency changes? 1745 01:14:22,320 --> 01:14:22,820 Yes. 1746 01:14:22,820 --> 01:14:25,490 STUDENT: You could run it with the McDermott generate-- 1747 01:14:25,490 --> 01:14:27,740 NANCY KANWISHER: Bingo, absolutely. 1748 01:14:27,740 --> 01:14:28,610 Everybody get that? 1749 01:14:31,810 --> 01:14:34,630 So then we'd know, because those are beautifully 1750 01:14:34,630 --> 01:14:38,320 designed to match all those acoustic properties, 1751 01:14:38,320 --> 01:14:43,820 match the spectrogram for all those lower level properties. 1752 01:14:43,820 --> 01:14:45,520 And McDermott and Norman-Haigenere 1753 01:14:45,520 --> 01:14:46,450 have done that. 1754 01:14:46,450 --> 01:14:49,390 And this region does not respond strongly 1755 01:14:49,390 --> 01:14:51,310 to the model-matched version, so it's not 1756 01:14:51,310 --> 01:14:52,690 just the acoustic properties. 1757 01:14:52,690 --> 01:14:53,860 Yeah. 1758 01:14:53,860 --> 01:14:56,210 STUDENT: Can we also do something like [INAUDIBLE] play 1759 01:14:56,210 --> 01:14:57,610 speech backwards? 1760 01:14:57,610 --> 01:15:00,670 NANCY KANWISHER: Yes, people have done that, too. 1761 01:15:00,670 --> 01:15:03,490 It's a little bit complicated, because speech backward 1762 01:15:03,490 --> 01:15:06,610 sounds a lot like speech. 1763 01:15:06,610 --> 01:15:08,800 It's kind of in the intermediate zone. 1764 01:15:08,800 --> 01:15:11,320 So it balances many things, but one, it 1765 01:15:11,320 --> 01:15:14,020 doesn't balance all the acoustic properties. 1766 01:15:14,020 --> 01:15:16,480 So speech has certain onset properties. 1767 01:15:16,480 --> 01:15:18,770 I forget how it goes, but if you play it backwards, 1768 01:15:18,770 --> 01:15:19,478 there's lots of-- 1769 01:15:19,478 --> 01:15:23,050 [MAKING SOUNDS] You've heard backward speech played, right? 1770 01:15:23,050 --> 01:15:27,040 And so the STRF model would respond differently 1771 01:15:27,040 --> 01:15:30,730 to forward and backward speech, whereas the STRF model responds 1772 01:15:30,730 --> 01:15:34,000 the same to the original and the synthetic speech. 1773 01:15:34,000 --> 01:15:36,370 Make sense? 1774 01:15:36,370 --> 01:15:39,700 So there's a very speech-selective patch 1775 01:15:39,700 --> 01:15:41,710 of cortex. 1776 01:15:41,710 --> 01:15:46,840 And it's speech selective, not language selective. 1777 01:15:46,840 --> 01:15:48,310 And of course, we want to know-- 1778 01:15:48,310 --> 01:15:51,140 speech is lots of different things. 1779 01:15:51,140 --> 01:15:53,590 It's what words you're saying. 1780 01:15:53,590 --> 01:15:55,660 It's who's saying it. 1781 01:15:55,660 --> 01:15:57,760 It's your intonation-- are you making 1782 01:15:57,760 --> 01:15:59,440 a statement, or a question, or what 1783 01:15:59,440 --> 01:16:01,270 are you emphasizing in the sentence? 1784 01:16:01,270 --> 01:16:03,100 And it's lots of other things. 1785 01:16:03,100 --> 01:16:05,590 And the paper you read asked that question. 1786 01:16:05,590 --> 01:16:07,270 What's coded here about speech? 1787 01:16:09,800 --> 01:16:11,810 And so I made a whole bunch of slides 1788 01:16:11,810 --> 01:16:15,005 to explain what the paper said because I thought people 1789 01:16:15,005 --> 01:16:16,130 would have trouble with it. 1790 01:16:16,130 --> 01:16:17,390 And everyone nailed it, so I'm not even 1791 01:16:17,390 --> 01:16:18,432 going to go through them. 1792 01:16:18,432 --> 01:16:20,450 Maybe I'll just show one in closing. 1793 01:16:20,450 --> 01:16:22,730 So one thing a few of you got wrong-- 1794 01:16:22,730 --> 01:16:24,620 and I totally get why, it didn't matter-- 1795 01:16:24,620 --> 01:16:26,940 is that here is this is one patient, 1796 01:16:26,940 --> 01:16:29,390 and this is the bank of electrodes placed 1797 01:16:29,390 --> 01:16:31,340 on the surface of the brain. 1798 01:16:31,340 --> 01:16:33,530 The red bits are the bits where you 1799 01:16:33,530 --> 01:16:35,720 could account for the neural responses in terms 1800 01:16:35,720 --> 01:16:37,130 of any of those models-- 1801 01:16:37,130 --> 01:16:40,340 intonation, speaker identity, sentence, 1802 01:16:40,340 --> 01:16:42,948 or any of the interactions between those things. 1803 01:16:42,948 --> 01:16:44,990 And so that just says that's where the action is, 1804 01:16:44,990 --> 01:16:46,790 is those electrodes there. 1805 01:16:46,790 --> 01:16:51,740 And that graph down here is from only three different-- 1806 01:16:51,740 --> 01:16:54,200 each one is a single electrode, just so you get this. 1807 01:16:54,200 --> 01:16:58,610 So this critical graph here, that shows electrode E1. 1808 01:16:58,610 --> 01:17:01,670 That's one of those electrodes in one patient. 1809 01:17:01,670 --> 01:17:05,750 An electrode is typically 2 millimeters on a side. 1810 01:17:05,750 --> 01:17:10,530 It's probably listening to a few tens of thousands of neurons. 1811 01:17:10,530 --> 01:17:12,590 So it's one or two orders of magnitude 1812 01:17:12,590 --> 01:17:14,383 better than a voxel with functional MRI, 1813 01:17:14,383 --> 01:17:15,800 but it's still averaging over lots 1814 01:17:15,800 --> 01:17:17,750 of neurons, not a single nerve. 1815 01:17:17,750 --> 01:17:24,210 STUDENT: The question [INAUDIBLE] 1816 01:17:24,210 --> 01:17:30,120 averaging over [INAUDIBLE] but it's averaged over [INAUDIBLE].. 1817 01:17:30,120 --> 01:17:33,210 NANCY KANWISHER: Yeah, that was the response of one electrode 1818 01:17:33,210 --> 01:17:34,860 listening to male and female. 1819 01:17:34,860 --> 01:17:37,440 I forget which is which. 1820 01:17:37,440 --> 01:17:39,540 But other than that, you guys totally nailed it. 1821 01:17:39,540 --> 01:17:44,010 And notice how precise, and specific, and fascinatingly 1822 01:17:44,010 --> 01:17:46,530 separated the responses of those electrodes 1823 01:17:46,530 --> 01:17:52,110 are, segregated for pitch contour, or speaker identity, 1824 01:17:52,110 --> 01:17:54,390 or what sentence was being spoken. 1825 01:17:54,390 --> 01:17:56,400 Those things seem to be segregated spatially 1826 01:17:56,400 --> 01:17:58,010 in the brain at a fine grain. 1827 01:17:58,010 --> 01:17:59,760 Whether you'd see it with functional MRI-- 1828 01:17:59,760 --> 01:18:00,840 you might, might not. 1829 01:18:00,840 --> 01:18:03,552 Many of you pointed out we might have not have the resolution. 1830 01:18:03,552 --> 01:18:05,010 Think about other methods you might 1831 01:18:05,010 --> 01:18:08,940 use to look for that, even if we didn't have the resolution 1832 01:18:08,940 --> 01:18:11,370 with a simple binary contrast. 1833 01:18:11,370 --> 01:18:13,175 And it's 12:26 and I'm going to stop. 1834 01:18:13,175 --> 01:18:14,550 I will see you guys on Wednesday, 1835 01:18:14,550 --> 01:18:17,390 and we will talk about music.