1 00:00:01,680 --> 00:00:04,080 The following content is provided under a Creative 2 00:00:04,080 --> 00:00:05,620 Commons license. 3 00:00:05,620 --> 00:00:07,920 Your support will help MIT OpenCourseWare 4 00:00:07,920 --> 00:00:12,280 continue to offer high quality educational resources for free. 5 00:00:12,280 --> 00:00:14,910 To make a donation or view additional materials 6 00:00:14,910 --> 00:00:18,870 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,870 --> 00:00:21,915 at ocw.mit.edu. 8 00:00:21,915 --> 00:00:24,540 JOSH MCDERMOTT: We're going to get started again. 9 00:00:24,540 --> 00:00:27,450 So where we stopped, I had just played you 10 00:00:27,450 --> 00:00:31,030 some of the results of this text, your synthesis algorithm. 11 00:00:31,030 --> 00:00:33,204 We all agreed that they sounded pretty realistic. 12 00:00:33,204 --> 00:00:34,620 And so the whole point of this was 13 00:00:34,620 --> 00:00:37,260 that this gives plausibility to the notion 14 00:00:37,260 --> 00:00:41,610 that you could be representing these textures with these sorts 15 00:00:41,610 --> 00:00:44,774 of statistics that you can compute from a model of what we 16 00:00:44,774 --> 00:00:46,440 think encapsulates the signal processing 17 00:00:46,440 --> 00:00:49,020 in the early auditory system. 18 00:00:49,020 --> 00:00:52,290 And, again, I'll just underscore that the sort 19 00:00:52,290 --> 00:00:53,940 of cool thing about doing the synthesis 20 00:00:53,940 --> 00:00:57,870 is that there's an infinite number of ways 21 00:00:57,870 --> 00:01:00,250 in which it can fail, right. 22 00:01:00,250 --> 00:01:02,340 And by listening to it and convincing yourself 23 00:01:02,340 --> 00:01:05,099 that those things actually sound pretty realistic, 24 00:01:05,099 --> 00:01:07,560 you actually get a pretty powerful sense 25 00:01:07,560 --> 00:01:10,204 that the representation is sort of capturing 26 00:01:10,204 --> 00:01:12,120 most of what you hear when you actually listen 27 00:01:12,120 --> 00:01:14,310 to that natural sound, right? 28 00:01:14,310 --> 00:01:18,265 And for instance, we could design a classification 29 00:01:18,265 --> 00:01:20,140 algorithm that could discriminate between all 30 00:01:20,140 --> 00:01:21,390 these different things, right. 31 00:01:21,390 --> 00:01:23,514 But the point is that they could-- 32 00:01:23,514 --> 00:01:24,930 the representation could still not 33 00:01:24,930 --> 00:01:28,090 capture all kinds of different things that you would hear. 34 00:01:28,090 --> 00:01:29,850 And by synthesizing, because of the fact 35 00:01:29,850 --> 00:01:32,921 that you can potentially fail in any of the possible ways, 36 00:01:32,921 --> 00:01:33,420 right. 37 00:01:33,420 --> 00:01:36,520 And then listen and observe whether the failure occurs. 38 00:01:36,520 --> 00:01:38,730 You get a pretty powerful method. 39 00:01:38,730 --> 00:01:39,390 All right. 40 00:01:39,390 --> 00:01:40,920 But one thing that you might be concerned about. 41 00:01:40,920 --> 00:01:41,910 And this is sort of something that 42 00:01:41,910 --> 00:01:44,400 was annoying me, right, is that what we've done here 43 00:01:44,400 --> 00:01:46,860 is we've imposed a whole bunch of statistical constraints, 44 00:01:46,860 --> 00:01:47,360 right. 45 00:01:47,360 --> 00:01:50,830 So we're measuring like this really large set of statistics 46 00:01:50,830 --> 00:01:53,214 from the model, right. 47 00:01:53,214 --> 00:01:55,380 And then generating things that have the same values 48 00:01:55,380 --> 00:01:57,010 of those statistics. 49 00:01:57,010 --> 00:02:00,750 So there's this question of whether any set of statistics 50 00:02:00,750 --> 00:02:01,914 will do. 51 00:02:01,914 --> 00:02:03,330 And so we wonder what would happen 52 00:02:03,330 --> 00:02:05,121 if we measured statistics from a model that 53 00:02:05,121 --> 00:02:08,630 deviates from what we know about the biology of the ear. 54 00:02:08,630 --> 00:02:11,100 So, in particular, you remember that in this model 55 00:02:11,100 --> 00:02:14,500 that we set out, there were a bunch of different stages, 56 00:02:14,500 --> 00:02:15,000 right. 57 00:02:15,000 --> 00:02:17,774 So we've got this initial stage of bandpass filtering. 58 00:02:17,774 --> 00:02:19,690 There's the process of extracting the envelope 59 00:02:19,690 --> 00:02:21,356 and then applying amplitude compression. 60 00:02:21,356 --> 00:02:23,520 And there's this modulation filtering. 61 00:02:23,520 --> 00:02:25,110 And in each of these cases, there 62 00:02:25,110 --> 00:02:27,960 are particular characteristics of the signal processing 63 00:02:27,960 --> 00:02:31,560 of that's explicitly intended to mimic what we see in biology. 64 00:02:31,560 --> 00:02:35,640 And so in particular as we noted the kinds of filter banks 65 00:02:35,640 --> 00:02:38,280 that you see in biological systems 66 00:02:38,280 --> 00:02:40,057 are better approximated by something 67 00:02:40,057 --> 00:02:41,890 that's logarithmically spaced than something 68 00:02:41,890 --> 00:02:42,780 that's linearly spaced. 69 00:02:42,780 --> 00:02:43,710 So we remember-- remember that picture 70 00:02:43,710 --> 00:02:46,001 I showed at the start, where we saw that the filters up 71 00:02:46,001 --> 00:02:48,586 here were a lot broader than the filters down here, all right. 72 00:02:48,586 --> 00:02:49,960 OK, and so we can ask, well, what 73 00:02:49,960 --> 00:02:54,340 happens if we swap in a filter bank that's linearly spaced. 74 00:02:54,340 --> 00:02:56,650 It's sort of more closely analogous to like an FFT, 75 00:02:56,650 --> 00:02:58,017 for instance. 76 00:02:58,017 --> 00:03:00,600 Similarly we can ask, well, what happens if we kind of get rid 77 00:03:00,600 --> 00:03:05,340 of this nonlinear function here that's applied 78 00:03:05,340 --> 00:03:07,200 to the amplitude envelope. 79 00:03:07,200 --> 00:03:11,200 And we make the amplitude respond to linear instead. 80 00:03:11,200 --> 00:03:12,210 And so we did this. 81 00:03:12,210 --> 00:03:14,790 So you can change the auditory model 82 00:03:14,790 --> 00:03:16,090 and play the exact same game. 83 00:03:16,090 --> 00:03:17,965 So you can measure statistics from that model 84 00:03:17,965 --> 00:03:20,040 and synthesize something from those statistics. 85 00:03:20,040 --> 00:03:23,600 And then ask whether they sound any different. 86 00:03:23,600 --> 00:03:25,360 And so we did an experiment. 87 00:03:25,360 --> 00:03:27,480 So we would play people on the original sound. 88 00:03:27,480 --> 00:03:30,630 And from that original sound, we have two synthetic versions. 89 00:03:30,630 --> 00:03:32,400 One that's generated from the statistics 90 00:03:32,400 --> 00:03:35,900 of the model that replicates biology as best we know how. 91 00:03:35,900 --> 00:03:39,674 On the other of it that is altered in some way. 92 00:03:39,674 --> 00:03:42,090 And we would ask people which of the two synthetic version 93 00:03:42,090 --> 00:03:43,140 sounds more realistic? 94 00:03:43,140 --> 00:03:45,360 And so there's four conditions in this experiment 95 00:03:45,360 --> 00:03:47,970 because we could alter these models in three different ways. 96 00:03:47,970 --> 00:03:50,310 So we could get rid of amplitude compression-- 97 00:03:50,310 --> 00:03:51,450 that's the first bar. 98 00:03:51,450 --> 00:03:53,820 We could make the cochlea linearly spaced. 99 00:03:53,820 --> 00:03:56,190 Or we could make the modulation filters linearly spaced. 100 00:03:56,190 --> 00:03:58,523 Or we could do all three, and that's the last condition. 101 00:03:58,523 --> 00:04:00,480 And so it's being plotted on this axis-- 102 00:04:00,480 --> 00:04:02,250 whoops I gave it away-- 103 00:04:02,250 --> 00:04:04,560 is the proportion of trials on which people 104 00:04:04,560 --> 00:04:07,620 said that the synthesis from the biologically plausible model 105 00:04:07,620 --> 00:04:08,610 was more realistic. 106 00:04:08,610 --> 00:04:11,620 And so if it doesn't matter what statistics you use, 107 00:04:11,620 --> 00:04:13,440 you should be right here at this 50% mark 108 00:04:13,440 --> 00:04:14,760 in each of these cases. 109 00:04:14,760 --> 00:04:16,380 And as you can see in every case, 110 00:04:16,380 --> 00:04:19,600 people actually report them, on average, 111 00:04:19,600 --> 00:04:21,959 the synthesis from the biologically plausible model 112 00:04:21,959 --> 00:04:24,300 is more realistic. 113 00:04:24,300 --> 00:04:26,290 And I'll give you a couple of examples. 114 00:04:26,290 --> 00:04:28,920 So here's crowd noise synthesized 115 00:04:28,920 --> 00:04:31,770 from the biologically plausible auditory model. 116 00:04:31,770 --> 00:04:37,380 [CROWD NOISE] 117 00:04:37,380 --> 00:04:39,570 And here's the result of doing the exact same thing 118 00:04:39,570 --> 00:04:41,290 but from the altered model. 119 00:04:41,290 --> 00:04:43,140 And this is from this condition here where 120 00:04:43,140 --> 00:04:44,370 everything is different. 121 00:04:44,370 --> 00:04:46,680 And you, a little here, it just kind of sounds weird. 122 00:04:46,680 --> 00:04:53,000 [CROWD NOISE] 123 00:04:53,000 --> 00:04:54,550 It's kind of garbled in some way. 124 00:04:54,550 --> 00:04:56,330 So here's a helicopter synthesized 125 00:04:56,330 --> 00:04:58,580 from the biologically plausible model. 126 00:04:58,580 --> 00:05:02,481 [HELICOPTER NOISE] 127 00:05:02,481 --> 00:05:03,730 And here's from the other one. 128 00:05:03,730 --> 00:05:08,800 [HELICOPTER NOISE] 129 00:05:08,800 --> 00:05:11,750 Sort of-- doesn't sound like the modulations are quite as 130 00:05:11,750 --> 00:05:13,510 precise. 131 00:05:13,510 --> 00:05:16,990 And so the notion here is that-- 132 00:05:20,236 --> 00:05:22,110 we're initializing this procedure with noise. 133 00:05:22,110 --> 00:05:24,260 And so the output is a different sound 134 00:05:24,260 --> 00:05:25,850 in every case that are sharing only 135 00:05:25,850 --> 00:05:27,050 the statistical properties. 136 00:05:27,050 --> 00:05:29,140 And so the statistics that we measure and used 137 00:05:29,140 --> 00:05:31,070 to do the synthesis, they define a class 138 00:05:31,070 --> 00:05:33,440 of sounds that include the original that, 139 00:05:33,440 --> 00:05:36,330 in fact, defines a set as well as a whole bunch of others. 140 00:05:36,330 --> 00:05:37,340 And when you run the synthesis, you're 141 00:05:37,340 --> 00:05:38,970 generating one of these other examples. 142 00:05:38,970 --> 00:05:41,120 And so the notion is that if the statistics are measuring 143 00:05:41,120 --> 00:05:43,550 what the brain is measuring, well, then, these examples 144 00:05:43,550 --> 00:05:45,966 ought to sound like another example of the original sound. 145 00:05:45,966 --> 00:05:48,560 You ought to be generating sort of an equivalence class. 146 00:05:48,560 --> 00:05:51,500 And the idea is that when you are synthesizing 147 00:05:51,500 --> 00:05:53,510 from statistics of this non-biological model 148 00:05:53,510 --> 00:05:55,010 where it's a different set, right? 149 00:05:55,010 --> 00:05:56,960 So, again, it's defined by the original. 150 00:05:56,960 --> 00:05:58,617 But it contains different things. 151 00:05:58,617 --> 00:06:00,200 And they don't sound like the original 152 00:06:00,200 --> 00:06:01,783 because they're presumably not defined 153 00:06:01,783 --> 00:06:04,220 with the measurements that the brain is making. 154 00:06:04,220 --> 00:06:07,460 I just mentioned to you the fact that the procedure 155 00:06:07,460 --> 00:06:10,010 will generate a different signal in each of these cases. 156 00:06:10,010 --> 00:06:12,320 Here you can see the result of synthesizing 157 00:06:12,320 --> 00:06:14,622 from the statistics of a particular recording of waves. 158 00:06:14,622 --> 00:06:16,080 These are three different examples. 159 00:06:16,080 --> 00:06:17,150 And if you sort of inspect these, 160 00:06:17,150 --> 00:06:19,520 you can kind of see that they're all different, right? 161 00:06:19,520 --> 00:06:20,870 They sort of have peaks and amplitude 162 00:06:20,870 --> 00:06:22,120 in different places and stuff. 163 00:06:22,120 --> 00:06:23,990 But on the other hand, they all kind of 164 00:06:23,990 --> 00:06:25,670 look the same in a sense that they 165 00:06:25,670 --> 00:06:27,378 have the same textural properties, right? 166 00:06:27,378 --> 00:06:29,650 And that's what's supposed to happen. 167 00:06:29,650 --> 00:06:32,150 And so the fact that you have all of these different signals 168 00:06:32,150 --> 00:06:34,100 that have the same statistical properties 169 00:06:34,100 --> 00:06:36,000 raises this interesting possibility, 170 00:06:36,000 --> 00:06:39,080 which is that if the brain is just representing time average 171 00:06:39,080 --> 00:06:42,050 statistics, we would predict that different exemplars 172 00:06:42,050 --> 00:06:45,590 of a texture ought to be difficult to discriminate. 173 00:06:45,590 --> 00:06:48,385 And so this is the thing that I'll show you about next 174 00:06:48,385 --> 00:06:50,510 is an experiment that attempts to test whether this 175 00:06:50,510 --> 00:06:53,330 is the case to try to test whether, really, you 176 00:06:53,330 --> 00:06:56,960 are, in fact, representing these textures with statistics that 177 00:06:56,960 --> 00:06:59,604 summarize their properties by averaging over time. 178 00:06:59,604 --> 00:07:01,520 And in doing so, we're going to take advantage 179 00:07:01,520 --> 00:07:04,850 of a really simple statistical phenomenon, which 180 00:07:04,850 --> 00:07:07,280 is that statistics that are measured from small samples 181 00:07:07,280 --> 00:07:09,710 are more variable than statistics measured 182 00:07:09,710 --> 00:07:11,730 from large samples. 183 00:07:11,730 --> 00:07:14,600 And that's what is exemplified by the graph that's 184 00:07:14,600 --> 00:07:15,570 here on the bottom. 185 00:07:15,570 --> 00:07:19,430 So what this graph is plotting is the results 186 00:07:19,430 --> 00:07:23,030 of an exercise where we took multiple excerpts of a given 187 00:07:23,030 --> 00:07:24,890 texture of a particular duration. 188 00:07:24,890 --> 00:07:27,635 So you're 40 milliseconds, 80, 160, 320. 189 00:07:27,635 --> 00:07:29,510 So we get a whole bunch of different excerpts 190 00:07:29,510 --> 00:07:30,689 of that length. 191 00:07:30,689 --> 00:07:33,230 And then we measure a particular statistic from that excerpt. 192 00:07:33,230 --> 00:07:36,920 So in this case it's a particular cross correlation 193 00:07:36,920 --> 00:07:40,894 coefficient for the envelopes of a pair of sub-bands. 194 00:07:40,894 --> 00:07:42,560 So we're going to measure that statistic 195 00:07:42,560 --> 00:07:44,240 in those different excerpts. 196 00:07:44,240 --> 00:07:46,640 And then we're just going to try to see how variable that 197 00:07:46,640 --> 00:07:47,450 is across excerpts. 198 00:07:47,450 --> 00:07:49,982 And that's summarized with a standard deviation 199 00:07:49,982 --> 00:07:50,690 of the statistic. 200 00:07:50,690 --> 00:07:53,170 And that's what's plotted here on the y-axis. 201 00:07:53,170 --> 00:07:55,460 And so the point is that when the excerpts are short, 202 00:07:55,460 --> 00:07:56,780 the statistics are variable. 203 00:07:56,780 --> 00:07:58,820 So you measure it in one excerpt and then another and then 204 00:07:58,820 --> 00:07:59,090 another. 205 00:07:59,090 --> 00:08:00,923 And you don't get the same thing, all right. 206 00:08:00,923 --> 00:08:02,510 And so the standard deviation is high. 207 00:08:02,510 --> 00:08:04,670 And as the excerpt duration increases, 208 00:08:04,670 --> 00:08:06,650 the statistics become more consistent. 209 00:08:06,650 --> 00:08:09,980 They converge to the true values of the station underlying 210 00:08:09,980 --> 00:08:11,220 stationary process. 211 00:08:11,220 --> 00:08:13,310 And so the standard deviation kind of shrinks. 212 00:08:13,310 --> 00:08:14,851 All right, and so we're going to take 213 00:08:14,851 --> 00:08:18,980 advantage of this in the experiments that we'll do. 214 00:08:18,980 --> 00:08:24,069 All right, so first to make sure that 215 00:08:24,069 --> 00:08:25,610 or to give plausibility to the notion 216 00:08:25,610 --> 00:08:27,330 that people might be able to base 217 00:08:27,330 --> 00:08:29,120 judgments on long-term statistics, 218 00:08:29,120 --> 00:08:31,330 we ask people to discriminate different textures. 219 00:08:31,330 --> 00:08:33,980 So these are things that have different long-term statistics. 220 00:08:33,980 --> 00:08:35,780 And so in the experiment, people would 221 00:08:35,780 --> 00:08:38,330 hear three sounds, one of which would 222 00:08:38,330 --> 00:08:42,253 be from a particular texture like rain. 223 00:08:42,253 --> 00:08:43,669 And then two others of which would 224 00:08:43,669 --> 00:08:45,980 be different examples of a different texture 225 00:08:45,980 --> 00:08:47,060 like a stream. 226 00:08:47,060 --> 00:08:48,140 So you'd hear a rain-- 227 00:08:48,140 --> 00:08:50,060 stream one-- stream too. 228 00:08:50,060 --> 00:08:51,979 And the task was to say which sound was 229 00:08:51,979 --> 00:08:53,270 produced by a different source. 230 00:08:53,270 --> 00:08:55,580 And so in this case, the answer would be first-- 231 00:08:55,580 --> 00:08:56,310 all right. 232 00:08:56,310 --> 00:08:57,860 And so we gave people this task. 233 00:08:57,860 --> 00:09:00,470 And we manipulated the duration of the excerpts. 234 00:09:00,470 --> 00:09:04,550 And so the notion here is that while given this graph, what 235 00:09:04,550 --> 00:09:06,260 happens is that the statistics are very 236 00:09:06,260 --> 00:09:08,210 variable for short excerpts. 237 00:09:08,210 --> 00:09:10,810 And then they become more consistent as the excerpt 238 00:09:10,810 --> 00:09:12,400 duration gets longer. 239 00:09:12,400 --> 00:09:14,480 And so if you're basing your judgments 240 00:09:14,480 --> 00:09:17,474 on the statistics computed across the excerpt, 241 00:09:17,474 --> 00:09:18,890 well, then you ought to get better 242 00:09:18,890 --> 00:09:20,390 at seeing whether the statistics are 243 00:09:20,390 --> 00:09:23,052 the same or different as the excerpt duration gets longer. 244 00:09:23,052 --> 00:09:25,010 All right, and so what we're going to plot here 245 00:09:25,010 --> 00:09:27,259 is the proportion correct that this task is a function 246 00:09:27,259 --> 00:09:28,310 of the excerpt duration. 247 00:09:28,310 --> 00:09:30,710 And, indeed, we see that people get better 248 00:09:30,710 --> 00:09:31,880 as the duration gets longer. 249 00:09:31,880 --> 00:09:33,671 So they're not very good when you give them 250 00:09:33,671 --> 00:09:34,790 a really short clip. 251 00:09:34,790 --> 00:09:37,980 But they get better and better and as the duration increases. 252 00:09:37,980 --> 00:09:40,550 Now, of course, this is not really a particularly exciting 253 00:09:40,550 --> 00:09:42,776 result. When you increase the duration, 254 00:09:42,776 --> 00:09:44,150 you give people more information. 255 00:09:44,150 --> 00:09:45,800 And pretty much on any story, people 256 00:09:45,800 --> 00:09:47,216 ought to be getting better, right? 257 00:09:47,216 --> 00:09:49,190 But it's at least consistent with the notion 258 00:09:49,190 --> 00:09:51,650 that you might be basing your judgments on statistics. 259 00:09:51,650 --> 00:09:55,080 Now the really critical experiment is the next one. 260 00:09:55,080 --> 00:09:57,950 And so in this experiment, we gave people different excerpts 261 00:09:57,950 --> 00:09:59,890 of the same texture. 262 00:09:59,890 --> 00:10:02,180 And we asked of them to discriminate them. 263 00:10:02,180 --> 00:10:05,540 So again on each trial, you hear three sounds. 264 00:10:05,540 --> 00:10:09,310 But they're all excerpts from the same texture. 265 00:10:09,310 --> 00:10:10,620 But two of them are identical. 266 00:10:10,620 --> 00:10:12,080 So in this case, the last two are 267 00:10:12,080 --> 00:10:14,750 physically identical excerpts of, for instance, rain. 268 00:10:14,750 --> 00:10:17,270 And the first one is a different excerpt of rain. 269 00:10:17,270 --> 00:10:19,610 And so you just have to say, which one is 270 00:10:19,610 --> 00:10:21,320 different from the other two? 271 00:10:21,320 --> 00:10:24,140 All right, now but maybe the null hypothesis here 272 00:10:24,140 --> 00:10:25,970 is what you might expect if you gave this 273 00:10:25,970 --> 00:10:28,950 to a computer algorithm that was just limited by sensor noise. 274 00:10:28,950 --> 00:10:32,139 And so the notion is that as the excerpt duration gets longer, 275 00:10:32,139 --> 00:10:33,680 you're giving people more information 276 00:10:33,680 --> 00:10:37,035 with which to tell that this one is different from this one. 277 00:10:37,035 --> 00:10:38,910 So maybe if you listen to just the beginning, 278 00:10:38,910 --> 00:10:39,618 it would be hard. 279 00:10:39,618 --> 00:10:43,649 But as you got more information, it would get easier and easier. 280 00:10:43,649 --> 00:10:46,190 If in contrast you think that what people represent when they 281 00:10:46,190 --> 00:10:49,790 hear these sounds are statistics that summarize the properties 282 00:10:49,790 --> 00:10:50,780 over time. 283 00:10:50,780 --> 00:10:53,930 Well, I've just shown you how the statistics converge 284 00:10:53,930 --> 00:10:56,037 to fixed values as the duration increases. 285 00:10:56,037 --> 00:10:57,620 And so if what people are representing 286 00:10:57,620 --> 00:10:59,930 are those statistics, you might paradoxically 287 00:10:59,930 --> 00:11:02,630 think that as the duration increases, 288 00:11:02,630 --> 00:11:05,060 they would get worse at this task-- 289 00:11:05,060 --> 00:11:06,269 all right. 290 00:11:06,269 --> 00:11:08,060 And that's, in fact, what we find happened. 291 00:11:08,060 --> 00:11:11,701 So people are good at this task when 292 00:11:11,701 --> 00:11:13,950 the excerpts are very short on the order of, like, 100 293 00:11:13,950 --> 00:11:14,650 milliseconds. 294 00:11:14,650 --> 00:11:16,430 So they can very easily tell you whether you're-- 295 00:11:16,430 --> 00:11:18,054 which of the two excerpts is different. 296 00:11:18,054 --> 00:11:20,820 And then as the duration gets longer and longer, 297 00:11:20,820 --> 00:11:23,080 they get progressively worse and worse. 298 00:11:23,080 --> 00:11:25,280 And so we think this is consistent with the idea 299 00:11:25,280 --> 00:11:28,310 that when you are hearing a texture, once the texture is 300 00:11:28,310 --> 00:11:30,980 a couple of seconds long, you're predominantly representing 301 00:11:30,980 --> 00:11:33,179 the statistical properties averaging 302 00:11:33,179 --> 00:11:34,220 the properties over time. 303 00:11:34,220 --> 00:11:37,461 And you lose access to the details that differentiate 304 00:11:37,461 --> 00:11:39,710 different examples of rain so that the exact positions 305 00:11:39,710 --> 00:11:42,168 of the rain drops or the clicks of the fire, what have you. 306 00:11:42,168 --> 00:11:44,866 Why should people be unable to discriminate 307 00:11:44,866 --> 00:11:45,740 two examples of rain? 308 00:11:45,740 --> 00:11:47,531 Well, you might think, well, these textures 309 00:11:47,531 --> 00:11:48,980 are just homogeneous, right? 310 00:11:48,980 --> 00:11:51,530 There's just not enough stuff there to differentiate them. 311 00:11:51,530 --> 00:11:52,905 And we know that that's not true, 312 00:11:52,905 --> 00:11:55,580 because if you just chop out a little section at random, 313 00:11:55,580 --> 00:11:56,930 people can very easily tell you whether it's 314 00:11:56,930 --> 00:11:58,138 the same or different, right. 315 00:11:58,138 --> 00:11:59,880 So at a local time scale, the details 316 00:11:59,880 --> 00:12:03,770 is very easily discriminable. 317 00:12:03,770 --> 00:12:06,650 You might also imagine that what's happening 318 00:12:06,650 --> 00:12:11,870 is that over time, maybe there's some kind of masking in time. 319 00:12:11,870 --> 00:12:14,090 Or that the representation kind of gets 320 00:12:14,090 --> 00:12:16,815 blurred together in some strange way. 321 00:12:16,815 --> 00:12:18,440 On the other hand, when you give people 322 00:12:18,440 --> 00:12:20,300 sounds that have different statistics, 323 00:12:20,300 --> 00:12:22,160 you find that they're just great, right. 324 00:12:22,160 --> 00:12:26,937 So they get better and better as the stimulus increases. 325 00:12:26,937 --> 00:12:29,270 And, in fact, the fact that they continue to get better, 326 00:12:29,270 --> 00:12:31,340 that seems to indicate that the detail that 327 00:12:31,340 --> 00:12:33,200 is streaming into your ears is being 328 00:12:33,200 --> 00:12:36,190 accrued into some representation that you have access to. 329 00:12:36,190 --> 00:12:37,940 And so we think what we think is happening 330 00:12:37,940 --> 00:12:39,980 is that those details come in. 331 00:12:39,980 --> 00:12:42,500 They're incorporated into your statistical estimates. 332 00:12:42,500 --> 00:12:45,160 But the fact that you can't tell apart these different excerpts 333 00:12:45,160 --> 00:12:47,660 means that there's these details are not otherwise retained. 334 00:12:47,660 --> 00:12:49,700 All right, so they're accrued into statistics, 335 00:12:49,700 --> 00:12:52,850 but then you lose access to the details on their own. 336 00:12:52,850 --> 00:12:55,220 The point is that the result as it stands, I think, 337 00:12:55,220 --> 00:12:57,640 provides evidence for a representation of time average 338 00:12:57,640 --> 00:12:58,387 statistics. 339 00:12:58,387 --> 00:13:00,470 So that is that when the statistics are different, 340 00:13:00,470 --> 00:13:02,900 you can tell things are distinct. 341 00:13:02,900 --> 00:13:07,250 When they're the same, you can't, and relates 342 00:13:07,250 --> 00:13:09,500 to this phenomenon of the variability of statistics 343 00:13:09,500 --> 00:13:11,930 as a function of sample size. 344 00:13:11,930 --> 00:13:15,634 So a couple control experience that are probably not 345 00:13:15,634 --> 00:13:17,300 exactly addressing the question you just 346 00:13:17,300 --> 00:13:19,910 raised but maybe are related. 347 00:13:19,910 --> 00:13:23,060 So one obvious possibility is that the reason that people 348 00:13:23,060 --> 00:13:24,740 are good at the exemplar discrimination 349 00:13:24,740 --> 00:13:26,840 here when the excerpts are short and bad 350 00:13:26,840 --> 00:13:30,980 here might be the fact that maybe your memory is 351 00:13:30,980 --> 00:13:32,030 decaying with time. 352 00:13:32,030 --> 00:13:34,640 All right, so the way that we did this experiment originally 353 00:13:34,640 --> 00:13:36,330 was that there was a fixed inner stimulus interval, 354 00:13:36,330 --> 00:13:37,205 so that was the same. 355 00:13:37,205 --> 00:13:39,390 It was couple of hundred milliseconds in every case. 356 00:13:39,390 --> 00:13:43,340 And so to tell that this is different from this, the bits 357 00:13:43,340 --> 00:13:46,010 that you would have to compare are separated by a shorter time 358 00:13:46,010 --> 00:13:47,510 interval than they are in this case, 359 00:13:47,510 --> 00:13:50,000 right, where they're separated by a longer time interval. 360 00:13:50,000 --> 00:13:52,670 And if you might just imagine that memory decays with time, 361 00:13:52,670 --> 00:13:54,732 you might think that would make people worse. 362 00:13:54,732 --> 00:13:56,690 So we did a control experiment where we equated 363 00:13:56,690 --> 00:13:58,262 the inner onset interval. 364 00:13:58,262 --> 00:13:59,720 All right, so that the elapsed time 365 00:13:59,720 --> 00:14:01,490 between the stuff that you would have to compare in order 366 00:14:01,490 --> 00:14:03,114 to tell whether something was different 367 00:14:03,114 --> 00:14:04,580 was the same in the two cases. 368 00:14:04,580 --> 00:14:06,080 And that basically makes no difference, right. 369 00:14:06,080 --> 00:14:08,090 You're still a lot better when the things are 370 00:14:08,090 --> 00:14:10,610 short than when they're long. 371 00:14:10,610 --> 00:14:12,980 And we went to pretty great lengths 372 00:14:12,980 --> 00:14:15,060 to try to help people be able to do 373 00:14:15,060 --> 00:14:16,310 this with these long excerpts. 374 00:14:16,310 --> 00:14:18,410 So you might also wonder, well, given 375 00:14:18,410 --> 00:14:20,330 that you can do this with the short excerpts. 376 00:14:20,330 --> 00:14:21,746 With the short excerpts are really 377 00:14:21,746 --> 00:14:24,200 just analogous to the very beginning of these longer 378 00:14:24,200 --> 00:14:24,700 excerpts. 379 00:14:24,700 --> 00:14:27,380 All right, so why can't you just listen to the beginning? 380 00:14:27,380 --> 00:14:29,730 And so we tried to help people do just that. 381 00:14:29,730 --> 00:14:32,120 So in this condition here, we put 382 00:14:32,120 --> 00:14:34,855 a little gap between the very beginning excerpt 383 00:14:34,855 --> 00:14:36,230 and the rest of the thing, right? 384 00:14:36,230 --> 00:14:38,120 And we just told people, all right, 385 00:14:38,120 --> 00:14:40,286 there's going to be this little thing at the start-- 386 00:14:40,286 --> 00:14:41,360 just listen for that. 387 00:14:41,360 --> 00:14:43,220 And people can't do that. 388 00:14:43,220 --> 00:14:46,200 So we also did it at the end-- so if at the gap at the end. 389 00:14:46,200 --> 00:14:48,620 So again, you get this little thing of the same length 390 00:14:48,620 --> 00:14:51,380 as you have in the short condition. 391 00:14:51,380 --> 00:14:53,684 And this is performance, in this case, 392 00:14:53,684 --> 00:14:55,850 some people are good when it's short and a lot worse 393 00:14:55,850 --> 00:14:56,480 when it's longer. 394 00:14:56,480 --> 00:14:58,146 And the presence of a gap doesn't really 395 00:14:58,146 --> 00:15:00,500 seem to make a difference, right. 396 00:15:00,500 --> 00:15:02,900 So you have great trouble accessing these things. 397 00:15:06,290 --> 00:15:09,970 Another thing that's sort of relevant and related 398 00:15:09,970 --> 00:15:12,172 was these experiments that resulted 399 00:15:12,172 --> 00:15:14,630 from our thinking about the fact that textures are normally 400 00:15:14,630 --> 00:15:16,730 not generated from our synthesis algorithm, 401 00:15:16,730 --> 00:15:18,650 but rather from the super position of lots 402 00:15:18,650 --> 00:15:19,740 of different sources. 403 00:15:19,740 --> 00:15:22,310 And so we wondered what would happen to this phenomenon 404 00:15:22,310 --> 00:15:25,490 if we varied the number of sources in a textures. 405 00:15:25,490 --> 00:15:27,130 So we actually generated textures 406 00:15:27,130 --> 00:15:29,990 by superimposing different numbers of sources. 407 00:15:29,990 --> 00:15:32,290 So in one case we did this with speakers. 408 00:15:32,290 --> 00:15:35,960 So we wanted to get rid of linguistic effects. 409 00:15:35,960 --> 00:15:37,829 And so we used German speech and people 410 00:15:37,829 --> 00:15:38,870 that didn't speak German. 411 00:15:38,870 --> 00:15:40,180 So it was like a German cocktail party 412 00:15:40,180 --> 00:15:40,990 that we're going to generate. 413 00:15:40,990 --> 00:15:42,550 So we have one person like this. 414 00:15:42,550 --> 00:15:45,280 [FEMALE VOICE 1] [SPEAKING GERMAN] 415 00:15:45,280 --> 00:15:46,300 And then 29-- 416 00:15:46,300 --> 00:15:49,282 [GROUP VOICE] [SPEAKING GERMAN] 417 00:15:49,282 --> 00:15:52,060 All right, room full of people speaking German, all right? 418 00:15:52,060 --> 00:15:54,400 And so we do the exact same experiment 419 00:15:54,400 --> 00:15:58,916 where we give people different exemplars of these textures. 420 00:15:58,916 --> 00:16:00,790 And we ask them to discriminate between them. 421 00:16:00,790 --> 00:16:03,187 And so what's plotted here is the proportion correct 422 00:16:03,187 --> 00:16:04,270 is a function of duration. 423 00:16:04,270 --> 00:16:07,186 Here, we've reduced it to just two durations-- short and long. 424 00:16:07,186 --> 00:16:08,560 And there's four different curves 425 00:16:08,560 --> 00:16:10,780 corresponding to different numbers of speakers 426 00:16:10,780 --> 00:16:11,800 in that signal, right. 427 00:16:11,800 --> 00:16:14,560 So the cyan here is what happens with a single speaker. 428 00:16:14,560 --> 00:16:16,310 And so with a single speaker, you actually 429 00:16:16,310 --> 00:16:18,490 get better at doing this as the duration increases. 430 00:16:18,490 --> 00:16:20,370 All right, and so that's, again, consistent 431 00:16:20,370 --> 00:16:22,870 with the null hypothesis that when there's more information, 432 00:16:22,870 --> 00:16:24,286 you're actually going to be better 433 00:16:24,286 --> 00:16:27,190 able to say whether something is the same or different. 434 00:16:27,190 --> 00:16:30,130 But as you increase the number of people at the cocktail 435 00:16:30,130 --> 00:16:30,730 party-- 436 00:16:30,730 --> 00:16:32,540 the density of the signal in some sense-- 437 00:16:32,540 --> 00:16:34,665 you can see that performance for the short excerpts 438 00:16:34,665 --> 00:16:35,590 doesn't really change. 439 00:16:35,590 --> 00:16:37,720 So you retain the ability to say whether these things are 440 00:16:37,720 --> 00:16:38,800 the same or different. 441 00:16:38,800 --> 00:16:40,510 But there's this huge interaction. 442 00:16:40,510 --> 00:16:44,630 And for the long excerpts, you get kind of worse and worse. 443 00:16:44,630 --> 00:16:46,240 So impairment at long durations is 444 00:16:46,240 --> 00:16:48,280 really specific to textures-- doesn't seem to be 445 00:16:48,280 --> 00:16:49,840 present for single sources. 446 00:16:49,840 --> 00:16:52,423 To make sure that phenomenon is not really specific to speech, 447 00:16:52,423 --> 00:16:55,780 we did the exact same thing with synthetic drum hits. 448 00:16:55,780 --> 00:16:59,830 So we just varied the density of a bunch of random drum hits. 449 00:16:59,830 --> 00:17:01,840 Like, here's five hits per second. 450 00:17:01,840 --> 00:17:05,230 [DRUM SOUNDS] 451 00:17:05,230 --> 00:17:05,900 Here's 50. 452 00:17:05,900 --> 00:17:09,588 [DRUM SOUNDS] 453 00:17:09,588 --> 00:17:12,319 All right, and you see the exact same phenomenon. 454 00:17:12,319 --> 00:17:14,650 So for the very sparsest case, you 455 00:17:14,650 --> 00:17:17,619 get better as you go from the short excerpts to the long. 456 00:17:17,619 --> 00:17:19,060 But then as the density increases, 457 00:17:19,060 --> 00:17:20,740 you see this big interaction. 458 00:17:20,740 --> 00:17:22,270 And you get selectively worse here 459 00:17:22,270 --> 00:17:25,099 for the long duration case. 460 00:17:25,099 --> 00:17:28,600 OK, so, again, it's worth pointing out 461 00:17:28,600 --> 00:17:31,360 that the high performance with the short excerpts 462 00:17:31,360 --> 00:17:33,819 indicate that all the stimuli have discriminable variation. 463 00:17:33,819 --> 00:17:35,776 So it's not the case that these things are just 464 00:17:35,776 --> 00:17:37,000 like totally homogeneous. 465 00:17:37,000 --> 00:17:38,970 And that's why you can't do it. 466 00:17:38,970 --> 00:17:41,680 It seems to be a specific problem with retaining 467 00:17:41,680 --> 00:17:47,470 temporal detail when the signals are both long and texture-like. 468 00:17:47,470 --> 00:17:49,130 OK, so what does this mean? 469 00:17:49,130 --> 00:17:49,960 Well, go ahead. 470 00:17:49,960 --> 00:17:50,910 Here's the specular framework. 471 00:17:50,910 --> 00:17:52,600 And this sort of gets back to these questions 472 00:17:52,600 --> 00:17:54,058 about working memory, and so forth. 473 00:17:54,058 --> 00:17:57,250 And so this the way that I make sense of this stuff. 474 00:17:57,250 --> 00:18:01,030 And each one of these things is pure speculation or almost 475 00:18:01,030 --> 00:18:01,810 pure speculation. 476 00:18:01,810 --> 00:18:03,518 But I actually think you need all of them 477 00:18:03,518 --> 00:18:05,380 to really totally make sense of the results. 478 00:18:05,380 --> 00:18:07,088 It's at least interesting to think about. 479 00:18:07,088 --> 00:18:10,690 So I think it's plausible that sounds 480 00:18:10,690 --> 00:18:14,800 are encoded both as sequences of features and with statistics 481 00:18:14,800 --> 00:18:17,450 that average information over time. 482 00:18:17,450 --> 00:18:21,220 And I think that the features with which we encode things 483 00:18:21,220 --> 00:18:25,210 are engineered to be sparse for typical natural sound sources. 484 00:18:25,210 --> 00:18:27,790 But they end up being dense for textures. 485 00:18:27,790 --> 00:18:29,560 So the signal comes in-- you're trying 486 00:18:29,560 --> 00:18:31,809 to model that with a whole bunch of different features 487 00:18:31,809 --> 00:18:34,070 that are in some dictionary you have in your head. 488 00:18:34,070 --> 00:18:36,470 And for a signal like speech, your dictionary features 489 00:18:36,470 --> 00:18:38,470 include things that might be related to phonemes 490 00:18:38,470 --> 00:18:39,100 and so forth. 491 00:18:39,100 --> 00:18:41,320 And so for like a single person talking, 492 00:18:41,320 --> 00:18:42,859 you end up with this representation 493 00:18:42,859 --> 00:18:43,900 that's relatively sparse. 494 00:18:43,900 --> 00:18:46,180 It's got sort of a small number of feature activations. 495 00:18:46,180 --> 00:18:47,380 But when you get a texture, in order 496 00:18:47,380 --> 00:18:48,859 to actually model that signal, you 497 00:18:48,859 --> 00:18:51,025 need lots and lots and lots of feature coefficients, 498 00:18:51,025 --> 00:18:53,590 all right, in order to actually model the signal. 499 00:18:53,590 --> 00:18:56,440 And my hypothesis would be that memory capacity 500 00:18:56,440 --> 00:18:59,665 places limits on the number of features that can be retained. 501 00:18:59,665 --> 00:19:01,300 All right, so it's not really related 502 00:19:01,300 --> 00:19:03,940 to the duration of signal that you can encode, per se. 503 00:19:03,940 --> 00:19:05,410 It's on the number of coefficients 504 00:19:05,410 --> 00:19:09,490 that you can retain that you need to encode that signal. 505 00:19:09,490 --> 00:19:11,620 And the additional thing I would hypothesize 506 00:19:11,620 --> 00:19:13,990 is that sound is continuously, and this is critical 507 00:19:13,990 --> 00:19:15,674 obligatorily encoded. 508 00:19:15,674 --> 00:19:17,590 All right, so this stuff comes into your ears. 509 00:19:17,590 --> 00:19:18,964 You're continuously projecting it 510 00:19:18,964 --> 00:19:21,790 onto this dictionary of features that you have-- all right. 511 00:19:21,790 --> 00:19:23,459 And you've got some memory buffer 512 00:19:23,459 --> 00:19:26,000 within which you can hang onto some number of those features. 513 00:19:26,000 --> 00:19:29,020 But then once the memory buffer gets exceeded, 514 00:19:29,020 --> 00:19:30,100 it gets overwritten. 515 00:19:30,100 --> 00:19:33,680 And so you just lose all the stuff that came before. 516 00:19:33,680 --> 00:19:36,070 So when your memory capacity for these future sequences 517 00:19:36,070 --> 00:19:38,034 is reached, the memory is overwritten 518 00:19:38,034 --> 00:19:38,950 by the incoming sound. 519 00:19:38,950 --> 00:19:42,650 And the only thing you're left with are these statistics. 520 00:19:42,650 --> 00:19:45,610 So I'll give you one last experiment in the texture 521 00:19:45,610 --> 00:19:48,260 domain, and then we'll move on. 522 00:19:48,260 --> 00:19:51,580 So this is an experiment where we presented people 523 00:19:51,580 --> 00:19:55,240 with an original recording, and then the synthetic version 524 00:19:55,240 --> 00:19:57,520 that we generated from the the synthesis algorithm. 525 00:19:57,520 --> 00:19:59,290 And we just ask them to rate the realism 526 00:19:59,290 --> 00:20:00,565 of the synthetic example. 527 00:20:00,565 --> 00:20:03,190 And so this is just a summary of the results of that experiment 528 00:20:03,190 --> 00:20:07,432 where we did this for 170 different sounds. 529 00:20:07,432 --> 00:20:09,640 And this is a histogram of the average realism rating 530 00:20:09,640 --> 00:20:10,990 for each of those 170 sounds. 531 00:20:10,990 --> 00:20:13,406 And there's just two points to take away from this, right. 532 00:20:13,406 --> 00:20:15,744 The first that there's a big peak up here. 533 00:20:15,744 --> 00:20:17,910 So they rate it as the realism on a scale of 1 to 7. 534 00:20:17,910 --> 00:20:19,630 And so the big peak looks centered 535 00:20:19,630 --> 00:20:22,540 at 6 means that the synthesis is working pretty well 536 00:20:22,540 --> 00:20:23,745 most of the time. 537 00:20:23,745 --> 00:20:25,372 And that's sort of encouraging. 538 00:20:25,372 --> 00:20:27,080 But there's this other interesting thing, 539 00:20:27,080 --> 00:20:29,329 which is that there's this long tail down here, right. 540 00:20:29,329 --> 00:20:32,560 And what this means is that people are telling us 541 00:20:32,560 --> 00:20:35,350 that this synthetic signal that is statistically 542 00:20:35,350 --> 00:20:37,540 matched to this original recording 543 00:20:37,540 --> 00:20:39,320 doesn't sound anything like it. 544 00:20:39,320 --> 00:20:41,830 And that's really interesting because it's statistically 545 00:20:41,830 --> 00:20:42,830 matched to the original. 546 00:20:42,830 --> 00:20:45,205 So it's matched in all these different dimensions, right. 547 00:20:45,205 --> 00:20:47,746 And, yet, there's still things that are perceptually missing. 548 00:20:47,746 --> 00:20:49,390 And that tells us that there are things 549 00:20:49,390 --> 00:20:52,210 that are important to the brain that are not in our model. 550 00:20:52,210 --> 00:20:54,910 This is a list of the 15 or so sounds that 551 00:20:54,910 --> 00:20:56,460 got the lowest realism ratings. 552 00:20:56,460 --> 00:20:59,912 And just to make things easy on you, 553 00:20:59,912 --> 00:21:01,120 I'll put labels next to them. 554 00:21:01,120 --> 00:21:03,161 Because by and large, they tend to fall into sort 555 00:21:03,161 --> 00:21:04,900 of three different categories-- 556 00:21:04,900 --> 00:21:08,170 sounds that have some sort of pitch in them. 557 00:21:08,170 --> 00:21:10,630 Sounds that have some kind of rhythmic structure. 558 00:21:10,630 --> 00:21:12,160 And sounds that have reverberation. 559 00:21:12,160 --> 00:21:13,660 And I'll play you these examples, 560 00:21:13,660 --> 00:21:18,570 because they're really kind of spectacular failures. 561 00:21:18,570 --> 00:21:21,070 Here, I'll play the original version and then the synthetic. 562 00:21:21,070 --> 00:21:24,297 [RAILROAD CROSSING SOUNDS] 563 00:21:26,529 --> 00:21:27,570 And here's the synthetic. 564 00:21:27,570 --> 00:21:29,190 I'm just warning you-- it's bad. 565 00:21:29,190 --> 00:21:32,655 [SYNTHETIC RAILROAD CROSSING SOUNDS] 566 00:21:35,630 --> 00:21:38,360 Here's the tapping rhythm-- really simple but-- 567 00:21:38,360 --> 00:21:41,811 [TAPPING RHYTHM SOUNDS] 568 00:21:43,290 --> 00:21:44,490 And the synthetic version. 569 00:21:44,490 --> 00:21:47,934 [SYNTHETIC TAPPING RHYTHM SOUNDS] 570 00:21:49,910 --> 00:21:52,420 All right. 571 00:21:52,420 --> 00:21:54,040 This is what happens if you-- 572 00:21:54,040 --> 00:21:55,780 well, this is not going to work very well because we're 573 00:21:55,780 --> 00:21:56,290 in an auditorium. 574 00:21:56,290 --> 00:21:57,370 But I'll try it anyways. 575 00:21:57,370 --> 00:21:59,140 This is a recording of somebody running up a stairwell that's 576 00:21:59,140 --> 00:21:59,931 pretty reverberant. 577 00:21:59,931 --> 00:22:03,325 [STAIR STEP SOUNDS] 578 00:22:05,094 --> 00:22:06,510 And here's this synthetic version. 579 00:22:06,510 --> 00:22:07,800 And it's almost as though like the echoes 580 00:22:07,800 --> 00:22:09,758 don't get put in the right place, or something. 581 00:22:09,758 --> 00:22:12,687 [SYNTHETIC STAIR STEP SOUNDS] 582 00:22:15,150 --> 00:22:17,970 And now it sound even worse if this was not an auditorium. 583 00:22:17,970 --> 00:22:19,600 Here's what happens with music. 584 00:22:19,600 --> 00:22:22,785 [SALSA MUSIC PLAYING] 585 00:22:23,637 --> 00:22:24,720 And the synthetic version. 586 00:22:24,720 --> 00:22:27,975 [SALSA MUSIC PLAYING] 587 00:22:29,370 --> 00:22:31,290 And this is what happens with speech. 588 00:22:31,290 --> 00:22:33,570 [MALE VOICE 1] A boy fell from the window. 589 00:22:33,570 --> 00:22:34,920 The wife helped her husband. 590 00:22:34,920 --> 00:22:37,030 Big dogs can be dangerous. 591 00:22:37,030 --> 00:22:40,670 Her-- [INAUDIBLE]. 592 00:22:44,762 --> 00:22:47,610 All right, OK, so in some sense, this 593 00:22:47,610 --> 00:22:49,890 is the most informative thing that 594 00:22:49,890 --> 00:22:52,887 comes out of this whole effort, because, again, it 595 00:22:52,887 --> 00:22:55,220 makes it really clear what you don't understand-- right. 596 00:22:55,220 --> 00:23:00,270 And in all these cases, it was really not obvious, a priori, 597 00:23:00,270 --> 00:23:01,689 that things would be this bad. 598 00:23:01,689 --> 00:23:03,480 I actually thought it was sort of plausible 599 00:23:03,480 --> 00:23:05,063 that we might be able to capture pitch 600 00:23:05,063 --> 00:23:06,660 with some of these statistics. 601 00:23:06,660 --> 00:23:09,510 Same with reverb and certainly some of these simple rhythms. 602 00:23:09,510 --> 00:23:11,670 I kind of thought that some of the modulation 603 00:23:11,670 --> 00:23:13,680 filters responses and their correlations 604 00:23:13,680 --> 00:23:14,887 would give this to you. 605 00:23:14,887 --> 00:23:17,220 And it's not until you actually test this with synthesis 606 00:23:17,220 --> 00:23:19,320 that you realize how bad this is, right? 607 00:23:19,320 --> 00:23:20,910 And so this really kind of tells you 608 00:23:20,910 --> 00:23:22,260 that there's something very important 609 00:23:22,260 --> 00:23:23,968 that your brain is measuring that we just 610 00:23:23,968 --> 00:23:26,460 don't yet understand and hasn't been built into our model. 611 00:23:26,460 --> 00:23:29,430 So it really sort of identifies the things you need to work on. 612 00:23:29,430 --> 00:23:32,495 OK, so just take home messages from this portion 613 00:23:32,495 --> 00:23:33,120 of the lecture. 614 00:23:33,120 --> 00:23:36,000 So I've argued that sound synthesis is 615 00:23:36,000 --> 00:23:38,940 a powerful tool that can help us test and explore theories 616 00:23:38,940 --> 00:23:41,210 of addition and that the variables that 617 00:23:41,210 --> 00:23:43,710 produce compelling synthesis are things that could plausibly 618 00:23:43,710 --> 00:23:45,720 underlie perception. 619 00:23:45,720 --> 00:23:47,766 And, conversely, that synthesis failures 620 00:23:47,766 --> 00:23:49,890 are things that point the way to new variables that 621 00:23:49,890 --> 00:23:53,617 might be important for the perceptual system. 622 00:23:53,617 --> 00:23:55,950 I've also argued that textures are a nice point of entry 623 00:23:55,950 --> 00:23:57,360 for a real world hearing. 624 00:23:57,360 --> 00:23:59,860 I think what's appealing about them is that you can actually 625 00:23:59,860 --> 00:24:02,940 work with actual real world-like signals and all 626 00:24:02,940 --> 00:24:06,360 of the complexity that at least exists in that domain. 627 00:24:06,360 --> 00:24:08,710 And, yet, work with them and generate things 628 00:24:08,710 --> 00:24:11,880 that you feel like you can understand. 629 00:24:11,880 --> 00:24:13,950 And I've argued that many natural sounds may 630 00:24:13,950 --> 00:24:16,110 be recognized with relatively simple statistics 631 00:24:16,110 --> 00:24:17,680 of early auditory representation. 632 00:24:17,680 --> 00:24:20,667 So the very simplest kinds of statistical representations 633 00:24:20,667 --> 00:24:22,500 that you might construct that capture things 634 00:24:22,500 --> 00:24:23,492 like the spectrum. 635 00:24:23,492 --> 00:24:25,700 Well, that on its own is not really that informative. 636 00:24:25,700 --> 00:24:27,533 But if you just go a little bit more complex 637 00:24:27,533 --> 00:24:29,940 and into the domain of marginal moments and correlations, 638 00:24:29,940 --> 00:24:32,550 you get representations that are pretty powerful. 639 00:24:32,550 --> 00:24:34,890 And finally, I gave you some evidence 640 00:24:34,890 --> 00:24:36,510 that for textures of moderate length, 641 00:24:36,510 --> 00:24:39,911 statistics may be all that we retain. 642 00:24:39,911 --> 00:24:41,910 So there are a lot of interesting open questions 643 00:24:41,910 --> 00:24:43,530 in this domain. 644 00:24:43,530 --> 00:24:46,200 So one of the big ones, I think, is the locus 645 00:24:46,200 --> 00:24:47,940 of the time-averaging. 646 00:24:47,940 --> 00:24:50,890 So I told you about how we've got some evidence in the lab 647 00:24:50,890 --> 00:24:53,982 that the time scale of the integration 648 00:24:53,982 --> 00:24:55,440 process for computing statistics is 649 00:24:55,440 --> 00:24:56,670 on the order of several seconds. 650 00:24:56,670 --> 00:24:58,150 And that's a really long time scale 651 00:24:58,150 --> 00:25:00,662 relative to typical time scales in the auditory system. 652 00:25:00,662 --> 00:25:02,620 And so where exactly that happens in the brain, 653 00:25:02,620 --> 00:25:05,484 I think, is very much an open question and kind 654 00:25:05,484 --> 00:25:06,400 of an interesting one. 655 00:25:06,400 --> 00:25:07,816 And so we'd like to sort of figure 656 00:25:07,816 --> 00:25:10,089 out how to get some leverage on that. 657 00:25:10,089 --> 00:25:11,880 There's also a lot of interesting questions 658 00:25:11,880 --> 00:25:13,860 about the relationship to scene analysis. 659 00:25:13,860 --> 00:25:16,080 So usually you're not hearing a texture in isolation. 660 00:25:16,080 --> 00:25:17,760 It's sort of the background to things 661 00:25:17,760 --> 00:25:20,176 that, maybe, you're actually more interested in-- somebody 662 00:25:20,176 --> 00:25:21,160 talking or what not. 663 00:25:21,160 --> 00:25:23,640 And so the relationship between these statistical 664 00:25:23,640 --> 00:25:26,904 representations and the extraction of individual source 665 00:25:26,904 --> 00:25:28,570 signals is something that's really open, 666 00:25:28,570 --> 00:25:31,964 and, I think, kind of interesting. 667 00:25:31,964 --> 00:25:34,380 And then these other questions of what kinds of statistics 668 00:25:34,380 --> 00:25:36,463 would you need to account for some of these really 669 00:25:36,463 --> 00:25:39,670 profound failures of synthesis. 670 00:25:39,670 --> 00:25:41,760 OK, so actually one-- 671 00:25:41,760 --> 00:25:44,809 I think this might be interesting to people. 672 00:25:44,809 --> 00:25:46,350 So I'll just talk briefly about this. 673 00:25:46,350 --> 00:25:47,340 And then we're going to have to figure out what 674 00:25:47,340 --> 00:25:48,590 to do for the last 20 minutes. 675 00:25:48,590 --> 00:25:50,869 But one of the reasons, I think, I 676 00:25:50,869 --> 00:25:53,160 was requested to talk about this is because of the fact 677 00:25:53,160 --> 00:25:55,290 that there's been all this work on texture 678 00:25:55,290 --> 00:25:57,120 in the domain of vision. 679 00:25:57,120 --> 00:25:59,160 And so it's sort of an interesting case where 680 00:25:59,160 --> 00:26:01,860 we can kind of think about similarities and differences 681 00:26:01,860 --> 00:26:03,060 between sensory systems. 682 00:26:03,060 --> 00:26:04,914 And so back when we were doing this work-- 683 00:26:04,914 --> 00:26:07,080 as I said, this was joint work with Eero Simoncelli. 684 00:26:07,080 --> 00:26:09,645 I was a post-doc in his lab at NYU. 685 00:26:09,645 --> 00:26:11,520 And we thought it would be interesting to try 686 00:26:11,520 --> 00:26:15,510 to turn the kind of standard model of visual texture, which 687 00:26:15,510 --> 00:26:18,310 was done by Javier Portia and Eero 688 00:26:18,310 --> 00:26:20,686 a long time ago, into sort of the same kind of diagram 689 00:26:20,686 --> 00:26:21,810 that I've been showing you. 690 00:26:21,810 --> 00:26:24,517 And so we actually did this in our paper. 691 00:26:24,517 --> 00:26:26,850 And so this is the one that you've been seeing all talk, 692 00:26:26,850 --> 00:26:27,349 right. 693 00:26:27,349 --> 00:26:29,470 So you've got a sound-wave form-- 694 00:26:29,470 --> 00:26:30,610 a stage of filtering. 695 00:26:30,610 --> 00:26:33,090 This non-linearity to extract the envelope and compress it. 696 00:26:33,090 --> 00:26:34,530 And then another stage of filtering. 697 00:26:34,530 --> 00:26:35,740 And then there are statistical measurements 698 00:26:35,740 --> 00:26:38,130 that kind of the last two stages of representation. 699 00:26:38,130 --> 00:26:40,620 And this is an analogous diagram that you 700 00:26:40,620 --> 00:26:44,940 can make for this sort of standard visual texture model. 701 00:26:44,940 --> 00:26:48,440 So we start out with images like beans. 702 00:26:48,440 --> 00:26:50,730 There's centers surround filtering of the sort 703 00:26:50,730 --> 00:26:53,640 that you would find in the retina or LGN 704 00:26:53,640 --> 00:26:56,070 that filters things into particular spatial frequency 705 00:26:56,070 --> 00:26:56,670 bands. 706 00:26:56,670 --> 00:26:58,650 And so that's what you get here. 707 00:26:58,650 --> 00:27:01,929 So these are sub-bands again. 708 00:27:01,929 --> 00:27:03,720 Then there's oriented filtering of the sort 709 00:27:03,720 --> 00:27:06,960 that you might get via simple cells and V1. 710 00:27:06,960 --> 00:27:09,390 So then you get the sub-bands divided up even finer 711 00:27:09,390 --> 00:27:12,000 into both spatial frequency and orientation. 712 00:27:12,000 --> 00:27:13,380 And then there's something that's 713 00:27:13,380 --> 00:27:15,660 analogous to the extraction of the envelope that 714 00:27:15,660 --> 00:27:17,650 would give you something like a complex cell. 715 00:27:17,650 --> 00:27:20,190 All right, and so this is sort of local amplitude 716 00:27:20,190 --> 00:27:22,500 in each of these different sub-bands-- right. 717 00:27:22,500 --> 00:27:24,790 So you can see, here, the contrast is very high. 718 00:27:24,790 --> 00:27:27,610 And so you get a high response in this particular point 719 00:27:27,610 --> 00:27:28,910 in the sub-band. 720 00:27:28,910 --> 00:27:31,891 So, again, this is in the dimensions of space. 721 00:27:31,891 --> 00:27:33,890 So that's a difference, right, so it's an image. 722 00:27:33,890 --> 00:27:37,902 So you got x and y-coordinates instead of time. 723 00:27:37,902 --> 00:27:39,860 But, again, there are statistical measurements, 724 00:27:39,860 --> 00:27:44,480 and you can actually relate a lot of them 725 00:27:44,480 --> 00:27:46,020 to some of the same functional form. 726 00:27:46,020 --> 00:27:48,500 So there's marginal moments just like we 727 00:27:48,500 --> 00:27:51,110 were computing from sound. 728 00:27:51,110 --> 00:27:54,110 In the visual texture model, there's an auto correlation. 729 00:27:54,110 --> 00:27:56,150 So that's measuring spatial correlations 730 00:27:56,150 --> 00:27:59,000 which we don't actually have in the auditory model. 731 00:27:59,000 --> 00:28:01,850 But then these correlations across different frequency 732 00:28:01,850 --> 00:28:02,370 channels. 733 00:28:02,370 --> 00:28:04,880 So this is across different spatial frequencies 734 00:28:04,880 --> 00:28:08,900 to things tuned to the same orientation. 735 00:28:08,900 --> 00:28:15,026 And this is across orientations and in the energy domain. 736 00:28:15,026 --> 00:28:16,400 So a couple of interesting points 737 00:28:16,400 --> 00:28:18,320 to take from this if you just sort of 738 00:28:18,320 --> 00:28:21,350 look back and forth between these two pictures. 739 00:28:21,350 --> 00:28:24,020 The first is that the statistics that we ended up 740 00:28:24,020 --> 00:28:28,760 using in the domain of sound are kind of late in the game. 741 00:28:28,760 --> 00:28:32,450 All right, so they're sort of after this non-linear stage 742 00:28:32,450 --> 00:28:34,610 that extracts amplitude. 743 00:28:34,610 --> 00:28:36,140 Whereas in the visual texture model, 744 00:28:36,140 --> 00:28:37,449 the nonlinearity happens here. 745 00:28:37,449 --> 00:28:38,990 And there's all these statistics that 746 00:28:38,990 --> 00:28:41,250 are being measured at these earlier stages 747 00:28:41,250 --> 00:28:43,550 before you're extracting local amplitude. 748 00:28:43,550 --> 00:28:45,240 And that's an important difference, 749 00:28:45,240 --> 00:28:47,140 I think, between sounds and images 750 00:28:47,140 --> 00:28:49,310 and that a lot of the action and sound 751 00:28:49,310 --> 00:28:52,167 is in the kind of the local amplitude domain. 752 00:28:52,167 --> 00:28:54,500 Whereas there's a lot of important structure and image-- 753 00:28:54,500 --> 00:28:56,834 images that has to do with sort of local phase 754 00:28:56,834 --> 00:28:59,000 that you can't just get from kind of local amplitude 755 00:28:59,000 --> 00:29:01,610 measurements. 756 00:29:01,610 --> 00:29:05,900 But at sort of a coarse scale, the big picture 757 00:29:05,900 --> 00:29:08,600 is that we think of visual texture 758 00:29:08,600 --> 00:29:11,060 as being represented with statistical measurements 759 00:29:11,060 --> 00:29:13,140 that average across space. 760 00:29:13,140 --> 00:29:15,860 And we've been arguing that sound texture consists 761 00:29:15,860 --> 00:29:19,790 of statistical computations that average across time. 762 00:29:19,790 --> 00:29:21,950 That said, as I was alluding to earlier, 763 00:29:21,950 --> 00:29:24,124 I think it's totally plausible that we should really 764 00:29:24,124 --> 00:29:26,540 think about visual texture as something that's potentially 765 00:29:26,540 --> 00:29:30,080 dynamic if you're looking at a sheet blowing in the wind 766 00:29:30,080 --> 00:29:32,427 or much of people moving in a crowd. 767 00:29:32,427 --> 00:29:34,760 And so there might well be statistics in the time domain 768 00:29:34,760 --> 00:29:37,410 as well that people just haven't really thought about. 769 00:29:37,410 --> 00:29:42,949 OK, so auditory scene analysis is, 770 00:29:42,949 --> 00:29:44,990 loosely speaking, the process of inferring events 771 00:29:44,990 --> 00:29:46,281 in the world from sound, right. 772 00:29:46,281 --> 00:29:49,780 So in almost any kind of normal situation, 773 00:29:49,780 --> 00:29:52,100 there is this sound signal that comes into your ears. 774 00:29:52,100 --> 00:29:54,290 And that's the result of multiple causal factors 775 00:29:54,290 --> 00:29:54,980 in the world. 776 00:29:54,980 --> 00:29:57,830 And those can be different things in the world that 777 00:29:57,830 --> 00:29:59,456 are making sound. 778 00:29:59,456 --> 00:30:00,830 As we discussed, the sound signal 779 00:30:00,830 --> 00:30:02,390 also interacts with the environment on the way 780 00:30:02,390 --> 00:30:03,230 to your ear. 781 00:30:03,230 --> 00:30:05,720 And so both of those things contribute. 782 00:30:05,720 --> 00:30:07,760 The classic instantiation of this 783 00:30:07,760 --> 00:30:09,410 is the cocktail party problem where 784 00:30:09,410 --> 00:30:10,820 the notion is that there would be 785 00:30:10,820 --> 00:30:14,120 multiple sources in the world that the signals from the two 786 00:30:14,120 --> 00:30:17,990 sources sum together into a mixture that enters your ear. 787 00:30:17,990 --> 00:30:19,594 And as a listener, you're usually 788 00:30:19,594 --> 00:30:21,260 interested in individual sources, maybe, 789 00:30:21,260 --> 00:30:23,509 one of those in particular like what somebody that you 790 00:30:23,509 --> 00:30:24,920 care about is saying. 791 00:30:24,920 --> 00:30:27,400 And so your brain has to take that mixed signal-- 792 00:30:27,400 --> 00:30:29,870 and from that to infer the content of one or more 793 00:30:29,870 --> 00:30:31,274 of the sources. 794 00:30:31,274 --> 00:30:32,690 And so this is the classic example 795 00:30:32,690 --> 00:30:35,430 of an ill-posed problem. 796 00:30:35,430 --> 00:30:38,690 And by that I mean that it's ill-posed 797 00:30:38,690 --> 00:30:40,959 because many sets of possible sounds 798 00:30:40,959 --> 00:30:42,500 add up to equal the observed mixture. 799 00:30:42,500 --> 00:30:44,960 So all you have access to is this red guy here, right? 800 00:30:44,960 --> 00:30:47,211 And you'd like to infer that the blue signals, 801 00:30:47,211 --> 00:30:49,460 which are the true sources that occurred in the world. 802 00:30:49,460 --> 00:30:52,440 And the problem is that there are these green signals here, 803 00:30:52,440 --> 00:30:54,177 which also add up to the red signal. 804 00:30:54,177 --> 00:30:56,510 In fact, there's lots and lots and lots of these, right? 805 00:30:56,510 --> 00:30:58,176 So your brain has to take the red signal 806 00:30:58,176 --> 00:31:00,030 and somehow infer the blue ones. 807 00:31:00,030 --> 00:31:02,090 And so this is analogous to me telling you, 808 00:31:02,090 --> 00:31:03,890 x plus y equals 17-- 809 00:31:03,890 --> 00:31:05,060 please solve for x. 810 00:31:05,060 --> 00:31:06,720 And so, obviously, if you got this on a math test, 811 00:31:06,720 --> 00:31:09,020 you would complain because there is not a unique solution, 812 00:31:09,020 --> 00:31:09,519 right. 813 00:31:09,519 --> 00:31:12,970 That you could have 1 in 16, and 2 in 15, and 3 in 14, 814 00:31:12,970 --> 00:31:14,220 and so on and so forth, right? 815 00:31:14,220 --> 00:31:16,136 But that's exactly the problem that your brain 816 00:31:16,136 --> 00:31:17,990 is solving all the time every day when 817 00:31:17,990 --> 00:31:20,090 you get a mixture of sounds. 818 00:31:20,090 --> 00:31:22,670 And the only way that you can solve problems of these sorts 819 00:31:22,670 --> 00:31:24,832 is by making assumptions about the sound sources. 820 00:31:24,832 --> 00:31:27,290 And the only way that you would be able to make assumptions 821 00:31:27,290 --> 00:31:29,390 about sound sources is if real-world sound sources 822 00:31:29,390 --> 00:31:31,436 have some degree of regularity. 823 00:31:31,436 --> 00:31:32,310 And in fact, they do. 824 00:31:32,310 --> 00:31:35,810 And one easy way to see this is by generating sounds 825 00:31:35,810 --> 00:31:37,420 that are fully random. 826 00:31:37,420 --> 00:31:39,200 And so the way that you would do this 827 00:31:39,200 --> 00:31:41,480 is you would have a random number generator-- 828 00:31:41,480 --> 00:31:42,890 you would draw numbers from that. 829 00:31:42,890 --> 00:31:46,080 And each of those numbers would form a particular sample 830 00:31:46,080 --> 00:31:47,012 and a sound signal. 831 00:31:47,012 --> 00:31:49,220 And then you could play that and listen to it, right. 832 00:31:49,220 --> 00:31:50,750 And so if you did that procedure, 833 00:31:50,750 --> 00:31:51,875 this is what you would get. 834 00:31:51,875 --> 00:31:54,802 [SPRAY SOUNDS] 835 00:31:55,756 --> 00:31:57,900 All right, so those are fully random sound signals. 836 00:31:57,900 --> 00:32:00,450 And so we could generate lots and lots of those. 837 00:32:00,450 --> 00:32:02,201 And the point is that with that procedure, 838 00:32:02,201 --> 00:32:03,783 you would have to sit there generating 839 00:32:03,783 --> 00:32:05,850 these random sounds for a very, very long time 840 00:32:05,850 --> 00:32:08,197 before you got something that sounded like a real world 841 00:32:08,197 --> 00:32:08,780 sounds, right? 842 00:32:08,780 --> 00:32:10,440 Real world sounds are like this. 843 00:32:10,440 --> 00:32:13,100 [ENGINE SOUND] 844 00:32:13,100 --> 00:32:13,610 Or this-- 845 00:32:13,610 --> 00:32:15,020 [DOOR BELL SOUND] 846 00:32:15,020 --> 00:32:15,960 Or this-- 847 00:32:15,960 --> 00:32:17,370 [BIRD SOUND] 848 00:32:17,370 --> 00:32:17,930 Or this-- 849 00:32:17,930 --> 00:32:20,120 [SCRUBBING SOUND] 850 00:32:20,120 --> 00:32:22,405 All right, so the point is that the set 851 00:32:22,405 --> 00:32:23,780 of sounds that occur in the world 852 00:32:23,780 --> 00:32:25,760 are a very, very, very small portion 853 00:32:25,760 --> 00:32:28,919 of the set of all physically realizable sound-wave forms. 854 00:32:28,919 --> 00:32:31,460 And so the notion is that that's what enables you to hear it. 855 00:32:31,460 --> 00:32:33,418 It's the fact that you've instantiated the fact 856 00:32:33,418 --> 00:32:36,192 that the structure of real world tones is not random. 857 00:32:36,192 --> 00:32:38,150 And such that when you get a mixture of sounds, 858 00:32:38,150 --> 00:32:39,774 you can actually make some good guesses 859 00:32:39,774 --> 00:32:41,170 as to what the sources are. 860 00:32:41,170 --> 00:32:44,760 All right, so we rely on these regularities in order to hear. 861 00:32:44,760 --> 00:32:47,780 So one intuitive view of inferring a target source 862 00:32:47,780 --> 00:32:50,240 from a mixture like this is that you have 863 00:32:50,240 --> 00:32:52,160 to do at least a couple things. 864 00:32:52,160 --> 00:32:55,340 One is to determine the grouping of the observed elements 865 00:32:55,340 --> 00:32:56,670 and the sound signal. 866 00:32:56,670 --> 00:32:58,970 And so what I've done here is for each 867 00:32:58,970 --> 00:33:01,820 of these-- this is that cocktail party problem demo that we 868 00:33:01,820 --> 00:33:03,130 saw that we heard at the start. 869 00:33:03,130 --> 00:33:04,730 So we've got one speaker-- 870 00:33:04,730 --> 00:33:07,025 two, three, and then seven. 871 00:33:07,025 --> 00:33:12,030 And in the spectrograms, I've coded the pixels 872 00:33:12,030 --> 00:33:16,160 either red or green, where the pixels are coded red 873 00:33:16,160 --> 00:33:18,990 if they come from something other than the target source, 874 00:33:18,990 --> 00:33:19,490 right. 875 00:33:19,490 --> 00:33:23,240 So this stuff up here is coming from this additional speaker. 876 00:33:23,240 --> 00:33:27,470 And then the green bits are the pixels in the target signal 877 00:33:27,470 --> 00:33:29,030 that are masked by the other signal. 878 00:33:29,030 --> 00:33:32,000 Or the other signal actually has higher intensity. 879 00:33:32,000 --> 00:33:33,500 And so one notion is that, well, you 880 00:33:33,500 --> 00:33:35,666 have to be able to tell that the red things actually 881 00:33:35,666 --> 00:33:38,120 don't go with the gray things. 882 00:33:38,120 --> 00:33:40,700 But then you also need to take these parts that are green, 883 00:33:40,700 --> 00:33:41,750 where the other source is actually 884 00:33:41,750 --> 00:33:43,040 swamping the thing you're interested in, 885 00:33:43,040 --> 00:33:45,440 and then estimate the content of the target source. 886 00:33:45,440 --> 00:33:47,930 That's at least a very sort of naive intuitive view 887 00:33:47,930 --> 00:33:49,990 of what has to happen. 888 00:33:49,990 --> 00:33:52,640 And in both of these cases, the only way that you can do this 889 00:33:52,640 --> 00:33:55,130 is by taking advantage of statistical regularities 890 00:33:55,130 --> 00:33:56,660 in sounds. 891 00:33:56,660 --> 00:33:58,550 So one example of irregularity that we 892 00:33:58,550 --> 00:34:01,910 think might be used to group sound is harmonic frequencies. 893 00:34:01,910 --> 00:34:04,740 So voices and instruments and certain other sounds 894 00:34:04,740 --> 00:34:06,800 produce frequencies that are harmonics, i.e., 895 00:34:06,800 --> 00:34:08,000 multiples of a fundamental. 896 00:34:08,000 --> 00:34:09,620 So here's a schematic power spectrum 897 00:34:09,620 --> 00:34:13,469 of somebody of what might come out of your vocal chords. 898 00:34:13,469 --> 00:34:15,260 So there's the fundamental frequency here. 899 00:34:15,260 --> 00:34:17,320 And then all the different harmonics. 900 00:34:17,320 --> 00:34:21,228 And they exhibit this very regular structure. 901 00:34:21,228 --> 00:34:23,330 Here, similarly, this is A440 on the oboe. 902 00:34:23,330 --> 00:34:25,320 [OBOE SOUND] 903 00:34:25,320 --> 00:34:27,330 So the fundamental frequency is 440 hertz. 904 00:34:27,330 --> 00:34:28,759 That's concert A. But if you look 905 00:34:28,759 --> 00:34:30,300 at the power spectrum of that signal, 906 00:34:30,300 --> 00:34:36,431 you get all of these integer multiples of that fundamental. 907 00:34:36,431 --> 00:34:38,639 All right, and so the way that this happens in speech 908 00:34:38,639 --> 00:34:39,680 is that there are these-- 909 00:34:39,680 --> 00:34:41,940 your vocal chords, which open and closed 910 00:34:41,940 --> 00:34:43,469 in this periodic manner. 911 00:34:43,469 --> 00:34:45,396 They generate a series of sound pulses. 912 00:34:45,396 --> 00:34:46,770 And in the frequency domain, that 913 00:34:46,770 --> 00:34:49,005 translates to harmonic structure. 914 00:34:49,005 --> 00:34:50,880 Not going to go through this in great detail. 915 00:34:50,880 --> 00:34:53,280 Hynek's going to tell you about speech. 916 00:34:53,280 --> 00:34:56,460 All right, and so there's some classic evidence 917 00:34:56,460 --> 00:34:59,850 that your brain uses harmonicity as a grouping cue, which 918 00:34:59,850 --> 00:35:02,940 is that if you take a series of harmonic frequencies 919 00:35:02,940 --> 00:35:06,690 and you mistune one of them, your brain typically 920 00:35:06,690 --> 00:35:09,210 causes you to hear that as a distinct sound source 921 00:35:09,210 --> 00:35:10,860 once the mistuning becomes sufficient. 922 00:35:10,860 --> 00:35:12,620 And here's just a classic demo of that. 923 00:35:16,071 --> 00:35:20,170 [MALE VOICE 2] Demonstration 18-- isolation of a frequency 924 00:35:20,170 --> 00:35:22,860 component based on mistuning. 925 00:35:22,860 --> 00:35:26,590 You are to listen for the third harmonic of a complex tone. 926 00:35:26,590 --> 00:35:30,250 First, this component is played alone as a standard. 927 00:35:30,250 --> 00:35:32,570 Then over a series of repetitions, 928 00:35:32,570 --> 00:35:35,170 it remains at a constant frequency 929 00:35:35,170 --> 00:35:38,200 while the rest of the components are gradually lowered 930 00:35:38,200 --> 00:35:41,904 as a group in steps of 1%. 931 00:35:41,904 --> 00:35:45,341 [BEEPING SOUNDS] 932 00:36:06,660 --> 00:36:08,480 [MALE VOICE 2] Now after two-- 933 00:36:08,480 --> 00:36:11,112 OK, and so what you should have heard-- 934 00:36:11,112 --> 00:36:13,320 and you can tell me whether this is the case or not-- 935 00:36:13,320 --> 00:36:15,600 is that as this thing is mistuned, at some point, 936 00:36:15,600 --> 00:36:17,300 you actually start to hear, kind of, two beeps. 937 00:36:17,300 --> 00:36:19,008 All right, there's the main tone and then 938 00:36:19,008 --> 00:36:21,070 there's this other little beep, right. 939 00:36:21,070 --> 00:36:22,880 And if you did it in the other direction, 940 00:36:22,880 --> 00:36:23,796 it would then reverse. 941 00:36:27,610 --> 00:36:32,190 OK, so one other consequence of harmonicity is-- 942 00:36:32,190 --> 00:36:35,520 and somebody was asking about this earlier-- 943 00:36:35,520 --> 00:36:38,610 is that your brain is able to use the harmonics of the sound 944 00:36:38,610 --> 00:36:40,350 in order to infer its pitch. 945 00:36:40,350 --> 00:36:43,630 So the pitch that you hear when you hear somebody talking 946 00:36:43,630 --> 00:36:46,470 is like a collective function of all the different harmonics. 947 00:36:46,470 --> 00:36:49,140 And so one interesting thing that 948 00:36:49,140 --> 00:36:51,030 happens when you mistune a harmonic 949 00:36:51,030 --> 00:36:53,520 is that for very small mistunings, 950 00:36:53,520 --> 00:36:56,964 that initially causes a bias in the perceived pitch. 951 00:36:56,964 --> 00:36:58,380 And so that's what's plotted here. 952 00:36:58,380 --> 00:37:02,490 So this is a task where somebody hears this complex tone that 953 00:37:02,490 --> 00:37:04,600 has one of the harmonics mistuned by a little bit. 954 00:37:04,600 --> 00:37:05,695 And then they hear another complex tone. 955 00:37:05,695 --> 00:37:07,470 And they have to adjust the pitch of the other one 956 00:37:07,470 --> 00:37:08,610 until it sounds the same. 957 00:37:08,610 --> 00:37:11,160 All right, and so what's being plotted on the y-axis 958 00:37:11,160 --> 00:37:13,590 in this graph is the average amount 959 00:37:13,590 --> 00:37:17,160 of shift in the pitch match as a function of the shift 960 00:37:17,160 --> 00:37:18,810 in that particular harmonic. 961 00:37:18,810 --> 00:37:21,705 And for very small mistunings of a few percent, 962 00:37:21,705 --> 00:37:23,580 you can see that there's this linear increase 963 00:37:23,580 --> 00:37:24,750 in the proceeded pitch. 964 00:37:24,750 --> 00:37:26,375 All right, so the mistune that harmonic 965 00:37:26,375 --> 00:37:28,420 causes the pitch to change. 966 00:37:28,420 --> 00:37:30,960 But then once the mistuning exceeds a certain amount, 967 00:37:30,960 --> 00:37:33,210 you can actually see that the effect reverses. 968 00:37:33,210 --> 00:37:35,040 And the pitch shift goes away. 969 00:37:35,040 --> 00:37:36,704 And so we think what's happening here 970 00:37:36,704 --> 00:37:38,370 is that the mechanism in your brain that 971 00:37:38,370 --> 00:37:41,190 is computing pitch from the harmonics 972 00:37:41,190 --> 00:37:44,370 somehow realizes that one of those harmonics is mistuned 973 00:37:44,370 --> 00:37:46,120 and is not part of the same thing. 974 00:37:46,120 --> 00:37:48,780 And so it's excluded from the computation of pitch. 975 00:37:48,780 --> 00:37:52,410 So the fact that you segregated those sources then somehow 976 00:37:52,410 --> 00:37:55,600 happened prior to or in at the same time 977 00:37:55,600 --> 00:37:58,980 as the calculation of the pitch. 978 00:37:58,980 --> 00:38:01,360 Here's another classic demonstration 979 00:38:01,360 --> 00:38:04,770 of sounds variations related to harmonicity. 980 00:38:04,770 --> 00:38:07,410 This is called the Reynolds-McAdams Oboe-- 981 00:38:07,410 --> 00:38:09,960 some collaboration between Roger Reynolds and Steve McAdams. 982 00:38:09,960 --> 00:38:11,064 There's a complex tone-- 983 00:38:11,064 --> 00:38:12,480 and what's going to happen here is 984 00:38:12,480 --> 00:38:15,510 that the even harmonics-- two, four, six, eight, et cetera, 985 00:38:15,510 --> 00:38:18,690 will become frequency modulated in a way that's coherent. 986 00:38:18,690 --> 00:38:20,947 And so, initially, you'll hear this kind of one thing. 987 00:38:20,947 --> 00:38:23,280 And then it will sort of separate into these two voices. 988 00:38:23,280 --> 00:38:25,080 And it's called the oboe because the oboe 989 00:38:25,080 --> 00:38:27,270 is an instrument that has a lot of power at the odd harmonics. 990 00:38:27,270 --> 00:38:28,080 And so you'll hear something that 991 00:38:28,080 --> 00:38:30,330 sounds like an oboe along with something that, maybe, is 992 00:38:30,330 --> 00:38:31,580 like a voice that has vibrato. 993 00:38:34,215 --> 00:38:37,680 [OBOE AND VIBRATO SOUNDS] 994 00:38:45,190 --> 00:38:46,780 Is that work for everybody? 995 00:38:46,780 --> 00:38:49,070 So all these things that are being affected 996 00:38:49,070 --> 00:38:52,450 in kind of interesting ways by the reverb in this auditorium, 997 00:38:52,450 --> 00:38:54,760 which will-- 998 00:38:54,760 --> 00:38:58,870 yeah, but that mostly works. 999 00:38:58,870 --> 00:39:00,400 So we've done a little bit of work 1000 00:39:00,400 --> 00:39:03,580 trying to test whether the brain uses harmonicity to segregate 1001 00:39:03,580 --> 00:39:04,370 actual speech. 1002 00:39:04,370 --> 00:39:06,430 And so very recently, it's become 1003 00:39:06,430 --> 00:39:10,180 possible to manipulate speech and change its harmonicity. 1004 00:39:10,180 --> 00:39:12,670 And I'm not going to tell you in detail how this works. 1005 00:39:12,670 --> 00:39:15,610 But we can resynthesize speech in ways that 1006 00:39:15,610 --> 00:39:16,900 are either harmonic like this. 1007 00:39:16,900 --> 00:39:17,770 This sounds normal. 1008 00:39:17,770 --> 00:39:19,630 [FEMALE VOICE 2] She smiled and the teeth 1009 00:39:19,630 --> 00:39:23,080 gleamed in her beautifully modeled olive face. 1010 00:39:23,080 --> 00:39:25,640 But we can also resynthesize it so as to make it inharmonic. 1011 00:39:25,640 --> 00:39:26,740 And if you look at the spectrium here, 1012 00:39:26,740 --> 00:39:29,281 you can see that the harmonic spacing is no longer regular. 1013 00:39:29,281 --> 00:39:31,030 All right, so we've just added some jitter 1014 00:39:31,030 --> 00:39:32,740 to the frequencies of the harmonics. 1015 00:39:32,740 --> 00:39:34,210 And it makes it sound weird. 1016 00:39:34,210 --> 00:39:36,220 [FEMALE VOICE 2] She smiled and the teeth 1017 00:39:36,220 --> 00:39:39,480 gleamed in her beautifully modeled olive face. 1018 00:39:39,480 --> 00:39:42,310 But it's still perfectly intelligible, right. 1019 00:39:42,310 --> 00:39:44,590 And that's because the vocal tract filtering 1020 00:39:44,590 --> 00:39:46,089 that I think Hynek is probably going 1021 00:39:46,089 --> 00:39:48,600 to tell you about this afternoon remains unchanged. 1022 00:39:48,600 --> 00:39:50,641 And so the notion here is that if you're actually 1023 00:39:50,641 --> 00:39:53,770 using this harmonic structure to kind of tell you 1024 00:39:53,770 --> 00:39:56,980 what parts of the sound signal belong together-- well, 1025 00:39:56,980 --> 00:39:59,056 and if you've got a mixture of two speakers that 1026 00:39:59,056 --> 00:40:00,430 were in harmonic, you might think 1027 00:40:00,430 --> 00:40:03,340 that it would be harder to understand what was being said. 1028 00:40:03,340 --> 00:40:06,594 So he gave people this task where we played them 1029 00:40:06,594 --> 00:40:09,010 words, either one word at a time, or two concurrent words. 1030 00:40:09,010 --> 00:40:11,200 And we just asked them to type in what they heard. 1031 00:40:11,200 --> 00:40:14,560 And then we just score how much they got correct. 1032 00:40:14,560 --> 00:40:16,909 And we did this with a bunch of different conditions 1033 00:40:16,909 --> 00:40:18,200 where we increased the jitters. 1034 00:40:18,200 --> 00:40:19,615 So there's harmonic-- 1035 00:40:19,615 --> 00:40:21,460 [MALE VOICE 3] Finally, he asked, 1036 00:40:21,460 --> 00:40:24,040 do you object to petting? 1037 00:40:24,040 --> 00:40:26,860 I don't know why my Rh has this example. 1038 00:40:26,860 --> 00:40:29,410 But, whatever-- it's taken from a corpus 1039 00:40:29,410 --> 00:40:32,000 called TIMIT that has a lot of weird sentences. 1040 00:40:32,000 --> 00:40:33,540 [MALE VOICE 3] Finally, he asked, 1041 00:40:33,540 --> 00:40:35,810 do you object to petting? 1042 00:40:35,810 --> 00:40:39,300 Finally, he asked, do you object to petting? 1043 00:40:39,300 --> 00:40:42,240 Finally, he asked, do you object to petting? 1044 00:40:42,240 --> 00:40:44,740 All right, so it kind of gets stranger and stranger sounding 1045 00:40:44,740 --> 00:40:45,573 than a bottom's out. 1046 00:40:45,573 --> 00:40:48,070 These are ratings of how weird it sounds. 1047 00:40:48,070 --> 00:40:50,530 And these are the results of the recognition experiment. 1048 00:40:50,530 --> 00:40:52,446 And so what's being plotted is the mean number 1049 00:40:52,446 --> 00:40:55,420 of correct words as a function of the deviation 1050 00:40:55,420 --> 00:40:56,200 from harmonicity. 1051 00:40:56,200 --> 00:40:59,557 So 0 here is perfectly harmonic, and this is increasing jitter. 1052 00:40:59,557 --> 00:41:01,390 And so the interesting thing is that there's 1053 00:41:01,390 --> 00:41:04,042 no effect on the recognition of single words, which 1054 00:41:04,042 --> 00:41:06,250 is below ceiling, because these are single words that 1055 00:41:06,250 --> 00:41:07,450 are excised from sentences. 1056 00:41:07,450 --> 00:41:10,580 And so they are actually not that easy to understand. 1057 00:41:10,580 --> 00:41:12,370 But when you give people pairs of words, 1058 00:41:12,370 --> 00:41:15,750 you see that they get worse at recognizing what was said. 1059 00:41:15,750 --> 00:41:17,699 And then the effect kind of bottoms out. 1060 00:41:17,699 --> 00:41:19,240 So this is consistent with the notion 1061 00:41:19,240 --> 00:41:21,520 that your brain is actually relying, in part, 1062 00:41:21,520 --> 00:41:23,920 on the harmonic structure of the speech in order 1063 00:41:23,920 --> 00:41:26,380 to pull, say, two concurrent speakers apart. 1064 00:41:29,410 --> 00:41:31,924 And the other thing to note here, though, 1065 00:41:31,924 --> 00:41:34,090 is that the effect is actually pretty modest, right. 1066 00:41:34,090 --> 00:41:36,131 So you're going from, I don't know, this is like, 1067 00:41:36,131 --> 00:41:39,430 0.65 words correct on a trial down to 0.5. 1068 00:41:39,430 --> 00:41:41,020 So it's like a 20% reduction. 1069 00:41:41,020 --> 00:41:43,155 And the mistuning thing also works with speech. 1070 00:41:43,155 --> 00:41:44,030 This is kind of cool. 1071 00:41:44,030 --> 00:41:46,930 So here we've just taken a single harmonic 1072 00:41:46,930 --> 00:41:48,160 and mistuned it. 1073 00:41:48,160 --> 00:41:50,290 And if you listen to that, I think this is this-- 1074 00:41:50,290 --> 00:41:52,857 you'll basically-- you'll hear the spoken utterance. 1075 00:41:52,857 --> 00:41:54,940 And then it will sound like there's some whistling 1076 00:41:54,940 --> 00:41:56,230 sound on top of it. 1077 00:41:56,230 --> 00:41:58,834 Because that's what the individual harmonic sounds 1078 00:41:58,834 --> 00:41:59,500 like on its own. 1079 00:41:59,500 --> 00:42:03,430 [FEMALE VOICE 3] Academic Act 2 guarantees your diploma. 1080 00:42:03,430 --> 00:42:05,363 So you might have been able to hear-- 1081 00:42:05,363 --> 00:42:06,560 I think this is the-- 1082 00:42:06,560 --> 00:42:09,962 [WHISTLING SOUND] 1083 00:42:09,962 --> 00:42:12,750 That's a little quiet. 1084 00:42:12,750 --> 00:42:15,196 But if you listen again. 1085 00:42:15,196 --> 00:42:18,244 [FEMALE VOICE 3] Academic Act 2 guarantees your diploma. 1086 00:42:18,244 --> 00:42:19,910 Yeah, so there's this little other thing 1087 00:42:19,910 --> 00:42:21,535 kind of hiding there in the background. 1088 00:42:21,535 --> 00:42:22,850 But it's kind of hard to hear. 1089 00:42:22,850 --> 00:42:26,510 And that's probably because it's particularly in speech 1090 00:42:26,510 --> 00:42:28,180 there's all these other factors that 1091 00:42:28,180 --> 00:42:29,679 are telling you that thing is speech 1092 00:42:29,679 --> 00:42:32,390 and that belongs together. 1093 00:42:32,390 --> 00:42:34,370 And, all right, let me just wrap up here. 1094 00:42:34,370 --> 00:42:37,190 So there's a bunch of other demos of this character 1095 00:42:37,190 --> 00:42:39,740 that I could kind of give you about-- 1096 00:42:39,740 --> 00:42:41,217 I could tell you about. 1097 00:42:41,217 --> 00:42:43,300 Another thing that actually matters is repetition. 1098 00:42:43,300 --> 00:42:45,447 So if there's something that repeats in the signal, 1099 00:42:45,447 --> 00:42:47,780 your brain is very strongly biased to actually segregate 1100 00:42:47,780 --> 00:42:50,010 that from the background. 1101 00:42:50,010 --> 00:42:52,529 So this is a demonstration of that in action. 1102 00:42:52,529 --> 00:42:54,320 So what I'm going to be presenting you with 1103 00:42:54,320 --> 00:42:58,010 is a sequence of mixtures of sounds that will 1104 00:42:58,010 --> 00:42:59,180 vary in how many there are. 1105 00:42:59,180 --> 00:43:00,679 And then at the end, you're actually 1106 00:43:00,679 --> 00:43:03,110 going to hear the target sound. 1107 00:43:03,110 --> 00:43:04,730 So if I just give you one-- 1108 00:43:04,730 --> 00:43:07,550 [WHOPPING SOUND] 1109 00:43:07,550 --> 00:43:09,020 All right, it doesn't sound-- 1110 00:43:09,020 --> 00:43:10,950 the sound at the end doesn't sound like what 1111 00:43:10,950 --> 00:43:13,214 you heard in the first thing. 1112 00:43:13,214 --> 00:43:15,380 But, here, you can probably start to hear something. 1113 00:43:15,380 --> 00:43:17,550 [WHOPPING SOUND] 1114 00:43:17,550 --> 00:43:19,010 And with here, you'll hear more. 1115 00:43:19,010 --> 00:43:21,457 [WHOPPING SOUND] 1116 00:43:21,457 --> 00:43:22,790 And with here, it's pretty easy. 1117 00:43:22,790 --> 00:43:26,150 [WHOPPING SOUND] 1118 00:43:26,150 --> 00:43:29,060 All right, so each time you're getting one of these mixtures-- 1119 00:43:29,060 --> 00:43:29,960 and if you just get a single mixture, 1120 00:43:29,960 --> 00:43:31,430 you can't hear anything, right. 1121 00:43:31,430 --> 00:43:33,800 But just by virtue of the fact that there is this latent 1122 00:43:33,800 --> 00:43:35,270 repeating structure in there. 1123 00:43:35,270 --> 00:43:36,800 Your brain is actually able to tell 1124 00:43:36,800 --> 00:43:39,470 that there's a consistent source and segregates that 1125 00:43:39,470 --> 00:43:41,420 from the background. 1126 00:43:41,420 --> 00:43:43,910 I started off by telling you that the only way that you 1127 00:43:43,910 --> 00:43:45,950 can actually solve this problem is 1128 00:43:45,950 --> 00:43:48,500 by incorporating your knowledge of the statistical structure 1129 00:43:48,500 --> 00:43:49,940 of the world. 1130 00:43:49,940 --> 00:43:53,900 And, yet, so far the way that the field has really moved 1131 00:43:53,900 --> 00:43:55,834 has been to basically just use intuitions. 1132 00:43:55,834 --> 00:43:57,500 And so people would look at spectrograms 1133 00:43:57,500 --> 00:43:59,360 and they say, oh yeah, there's harmonic structure. 1134 00:43:59,360 --> 00:44:00,235 There's common onset. 1135 00:44:00,235 --> 00:44:02,210 And so then you can do an experiment and show 1136 00:44:02,210 --> 00:44:04,620 that has some effect. 1137 00:44:04,620 --> 00:44:06,260 But what we'd really like to understand 1138 00:44:06,260 --> 00:44:09,590 is how these so-called grouping cues relate to natural sound 1139 00:44:09,590 --> 00:44:10,605 statistics. 1140 00:44:10,605 --> 00:44:12,230 We'd like to know whether we're optimal 1141 00:44:12,230 --> 00:44:14,360 given the nature of real world sounds. 1142 00:44:14,360 --> 00:44:15,740 We'd like to know whether these things are actually 1143 00:44:15,740 --> 00:44:17,240 learned from experience with sound-- 1144 00:44:17,240 --> 00:44:19,396 whether you're born with them. 1145 00:44:19,396 --> 00:44:21,020 The relative importance of these things 1146 00:44:21,020 --> 00:44:24,370 relative to knowledge of particular sounds like words. 1147 00:44:24,370 --> 00:44:26,951 And so this-- I really regard this stuff as in its infancy. 1148 00:44:26,951 --> 00:44:28,700 But I think it's really kind of wide open. 1149 00:44:28,700 --> 00:44:31,100 And so the sort of take-home messages 1150 00:44:31,100 --> 00:44:34,580 here are that there are grouping cues 1151 00:44:34,580 --> 00:44:37,970 that the brain uses to take the sound energy that 1152 00:44:37,970 --> 00:44:40,640 comes into your ears and assign it to different sources that 1153 00:44:40,640 --> 00:44:42,890 are presumed to be related to statistical regularities 1154 00:44:42,890 --> 00:44:44,000 of natural sounds. 1155 00:44:44,000 --> 00:44:45,470 Some of the ones that we know about 1156 00:44:45,470 --> 00:44:48,804 are, chiefly, harmonicity and common onset and repetition. 1157 00:44:48,804 --> 00:44:49,970 I didn't really get to this. 1158 00:44:49,970 --> 00:44:52,730 But we also know that the brain infers parts of source signals 1159 00:44:52,730 --> 00:44:54,800 that are masked by other sources, 1160 00:44:54,800 --> 00:44:56,260 again, using prior assumptions. 1161 00:44:56,260 --> 00:44:59,232 But we really need a proper theory in this domain, 1162 00:44:59,232 --> 00:45:01,190 I think, both to be able to predict and explain 1163 00:45:01,190 --> 00:45:02,480 real world performance. 1164 00:45:02,480 --> 00:45:05,480 And also, I think, to be able to relate what humans are doing 1165 00:45:05,480 --> 00:45:07,190 this domain to the machine algorithms 1166 00:45:07,190 --> 00:45:09,440 that we'd like to able to develop to sort of replicate 1167 00:45:09,440 --> 00:45:10,880 this sort of competence. 1168 00:45:10,880 --> 00:45:13,550 And the engineering-- there was sort of a brief period of time 1169 00:45:13,550 --> 00:45:14,990 where there were some people in engineering that 1170 00:45:14,990 --> 00:45:17,245 were kind of trying to relate things to biology. 1171 00:45:17,245 --> 00:45:19,370 But by and large, the fields have sort of diverged. 1172 00:45:19,370 --> 00:45:21,360 And I think they really need to come back together. 1173 00:45:21,360 --> 00:45:22,985 And so this is going to be a good place 1174 00:45:22,985 --> 00:45:25,840 for bright young people to work.