1 00:00:01,610 --> 00:00:04,010 The following content is provided under a Creative 2 00:00:04,010 --> 00:00:05,550 Commons license. 3 00:00:05,550 --> 00:00:07,850 Your support will help MIT OpenCourseWare 4 00:00:07,850 --> 00:00:12,240 continue to offer high quality educational resources for free. 5 00:00:12,240 --> 00:00:14,840 To make a donation or view additional materials 6 00:00:14,840 --> 00:00:18,800 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,800 --> 00:00:22,400 at ocw.mit.edu. 8 00:00:22,400 --> 00:00:26,610 HYNEK HERMANSKY: So we have this wanted information 9 00:00:26,610 --> 00:00:30,110 and unwanted information. 10 00:00:30,110 --> 00:00:33,530 Not-- I call-- unwanted information noise 11 00:00:33,530 --> 00:00:35,450 and wanted information signal. 12 00:00:35,450 --> 00:00:38,570 Not all noises are created equal. 13 00:00:38,570 --> 00:00:43,460 There are some noises which are partially understood, 14 00:00:43,460 --> 00:00:47,750 and I'm claim this is what we should strip off very quickly, 15 00:00:47,750 --> 00:00:50,900 like linear distortions of speaker dependencies. 16 00:00:50,900 --> 00:00:53,240 Those are two things I will be talking about. 17 00:00:53,240 --> 00:00:56,900 You can easily do it in a future instruction. 18 00:00:56,900 --> 00:00:59,890 There are noises which are expected. 19 00:00:59,890 --> 00:01:02,690 But efforts may not be well understood. 20 00:01:02,690 --> 00:01:05,106 These should go into machine land, 21 00:01:05,106 --> 00:01:06,860 but this is what I'm going-- 22 00:01:06,860 --> 00:01:08,570 this is something which we-- 23 00:01:08,570 --> 00:01:10,640 whatever you don't know, you better 24 00:01:10,640 --> 00:01:12,030 have the machine to learn. 25 00:01:12,030 --> 00:01:13,970 It's always better to use the dumb machine 26 00:01:13,970 --> 00:01:16,861 with a lot of training data than putting in something 27 00:01:16,861 --> 00:01:18,110 which you don't know for sure. 28 00:01:18,110 --> 00:01:22,470 But what you know for sure, it should go here. 29 00:01:22,470 --> 00:01:26,010 And then there is an interesting set of the whole noises, which 30 00:01:26,010 --> 00:01:27,260 you don't know that they are-- 31 00:01:27,260 --> 00:01:28,700 they even exist. 32 00:01:28,700 --> 00:01:32,000 These are the ones which I'm especially interested, 33 00:01:32,000 --> 00:01:35,120 because they cause us the biggest problems. 34 00:01:35,120 --> 00:01:37,970 Basically, the word you don't know that it exists, 35 00:01:37,970 --> 00:01:40,520 noises which somebody introduces-- 36 00:01:40,520 --> 00:01:43,190 they've never talked about, and so on, and so on. 37 00:01:43,190 --> 00:01:45,500 So I think this is an interesting problem. 38 00:01:45,500 --> 00:01:48,650 Hopefully, I will get to it towards the end of the talk-- 39 00:01:48,650 --> 00:01:50,790 at least a little bit. 40 00:01:50,790 --> 00:01:53,850 So some noise is with known effects. 41 00:01:53,850 --> 00:01:55,902 One is like this. 42 00:01:55,902 --> 00:01:57,830 When you have a-- 43 00:01:57,830 --> 00:01:59,594 you have a speech-- 44 00:01:59,594 --> 00:02:00,370 oh. 45 00:02:00,370 --> 00:02:02,190 VOICE RECORDING: You are yo-yo. 46 00:02:02,190 --> 00:02:03,185 HYNEK HERMANSKY: And you have another speech 47 00:02:03,185 --> 00:02:04,055 which looks very different. 48 00:02:04,055 --> 00:02:04,805 VOICE RECORDING: You are yo-yo! 49 00:02:04,805 --> 00:02:06,932 HYNEK HERMANSKY: But it says the same thing, right? 50 00:02:06,932 --> 00:02:07,890 I mean this is a child. 51 00:02:07,890 --> 00:02:10,300 This is a adult. And you can tell this was me. 52 00:02:10,300 --> 00:02:14,010 This was my daughter when she was 4-- not 30. 53 00:02:14,010 --> 00:02:16,760 The problem is that different human beings 54 00:02:16,760 --> 00:02:17,960 have different vocal tracts. 55 00:02:17,960 --> 00:02:20,020 Especially when it comes to children, 56 00:02:20,020 --> 00:02:21,750 vocal tract is much, much shorter. 57 00:02:21,750 --> 00:02:23,125 And I was showing you the effects 58 00:02:23,125 --> 00:02:25,550 of what happens is that you get a very different set 59 00:02:25,550 --> 00:02:30,890 of formants like these dark lines, which a number of people 60 00:02:30,890 --> 00:02:34,280 believe we should look at if we want to understand what's 61 00:02:34,280 --> 00:02:36,180 being said. 62 00:02:36,180 --> 00:02:39,590 We have four formants here, but we have only two formants here. 63 00:02:39,590 --> 00:02:42,500 They are in approximately similar positions, 64 00:02:42,500 --> 00:02:44,690 but where you had a fourth formant, 65 00:02:44,690 --> 00:02:47,670 you have only second formant here. 66 00:02:47,670 --> 00:02:49,180 So what we want-- 67 00:02:49,180 --> 00:02:51,680 we want some techniques which would work more 68 00:02:51,680 --> 00:02:55,190 like a human perception, not look at the spectral envelopes. 69 00:02:55,190 --> 00:02:57,804 But mainly you look at the whole clusters. 70 00:02:57,804 --> 00:03:00,470 So this is a technique which has been developed a long time ago, 71 00:03:00,470 --> 00:03:03,530 but I still mention it because it's an interesting way 72 00:03:03,530 --> 00:03:06,830 of going about the things. 73 00:03:06,830 --> 00:03:09,330 So it uses several things. 74 00:03:09,330 --> 00:03:15,110 One is that it suppresses the signals at low frequencies. 75 00:03:15,110 --> 00:03:17,136 You basically use this equal loudness curve. 76 00:03:17,136 --> 00:03:19,730 So you emphasize the parts of the signal 77 00:03:19,730 --> 00:03:21,690 each child heard well. 78 00:03:21,690 --> 00:03:24,920 Second one that uses this is critical bands. 79 00:03:24,920 --> 00:03:28,070 Because you say, first step which you want to do 80 00:03:28,070 --> 00:03:30,080 is to integrate overcritical band. 81 00:03:30,080 --> 00:03:32,020 This is the simplest way of processing 82 00:03:32,020 --> 00:03:33,950 within the band-- is you integrate 83 00:03:33,950 --> 00:03:37,160 what's happening inside. 84 00:03:37,160 --> 00:03:39,310 So what you do, you take your full ear spectrum. 85 00:03:39,310 --> 00:03:42,110 This is the spectrum which has equal frequency resolution 86 00:03:42,110 --> 00:03:44,960 at all times and a lot of details with this case, 87 00:03:44,960 --> 00:03:47,140 because it's a fundamental frequency. 88 00:03:47,140 --> 00:03:50,690 And here you integrate over these different frequency 89 00:03:50,690 --> 00:03:51,620 bands. 90 00:03:51,620 --> 00:03:53,990 They are narrower at low frequencies 91 00:03:53,990 --> 00:03:56,510 and getting broader, and broader, and broader, 92 00:03:56,510 --> 00:03:58,910 very much how we learned from the experiment 93 00:03:58,910 --> 00:04:02,330 with the simultaneous masking. 94 00:04:02,330 --> 00:04:06,070 So this is a textbook knowledge. 95 00:04:06,070 --> 00:04:12,430 So you get a different spectrum, which is unequally sampled. 96 00:04:12,430 --> 00:04:14,930 So, of course, you go back into the equal sampling, 97 00:04:14,930 --> 00:04:17,350 but you know that there is fewer samples 98 00:04:17,350 --> 00:04:19,990 at the high frequencies, because you are integrating 99 00:04:19,990 --> 00:04:23,230 more spectral energy at high frequencies 100 00:04:23,230 --> 00:04:25,570 than in low frequencies. 101 00:04:25,570 --> 00:04:29,560 And you multiply these outputs through these equal loudness 102 00:04:29,560 --> 00:04:30,060 curves. 103 00:04:30,060 --> 00:04:32,560 So from the spectrum you get something 104 00:04:32,560 --> 00:04:38,250 which has a resolution, which is more like auditory-like. 105 00:04:38,250 --> 00:04:41,650 Then you put it through the equal loudness curves, 106 00:04:41,650 --> 00:04:43,420 because you know the loudness depends 107 00:04:43,420 --> 00:04:46,130 on a cubic root of intensity. 108 00:04:46,130 --> 00:04:49,530 So you get a modified spectrum. 109 00:04:49,530 --> 00:04:52,360 And then you find some approximation 110 00:04:52,360 --> 00:04:54,700 to this spectrum-- 111 00:04:54,700 --> 00:04:57,430 auditory spectrum-- saying, that I 112 00:04:57,430 --> 00:04:59,470 don't think that all these details still 113 00:04:59,470 --> 00:05:00,820 have to be important. 114 00:05:00,820 --> 00:05:02,410 I would like to have some control 115 00:05:02,410 --> 00:05:06,937 of how much spectral detail I want to keep in. 116 00:05:06,937 --> 00:05:09,520 So the whole thing looks like-- let's start with the spectrum. 117 00:05:09,520 --> 00:05:11,320 You go through-- a number of steps. 118 00:05:11,320 --> 00:05:12,970 And you end up with the spectrum, which 119 00:05:12,970 --> 00:05:15,240 is, of course, related to the origin of the spectrum, 120 00:05:15,240 --> 00:05:17,720 but it's much simpler. 121 00:05:17,720 --> 00:05:21,610 So we eliminated information about fundamental frequency. 122 00:05:21,610 --> 00:05:25,250 We merged number of foramens and so on, and so on. 123 00:05:25,250 --> 00:05:27,280 So we follow our philosophy. 124 00:05:27,280 --> 00:05:30,730 Leave out the stuff which you think may not be important. 125 00:05:30,730 --> 00:05:33,190 You don't know how much stuff you should leave out. 126 00:05:33,190 --> 00:05:35,440 So if you don't know something, and you are engineer, 127 00:05:35,440 --> 00:05:37,430 you run an experiment. 128 00:05:37,430 --> 00:05:40,270 You know, research is what I'm doing when 129 00:05:40,270 --> 00:05:42,520 I don't know what I'm doing-- 130 00:05:42,520 --> 00:05:43,120 supposingly. 131 00:05:43,120 --> 00:05:45,160 I don't know where now-- from Brown or somebody 132 00:05:45,160 --> 00:05:46,080 was saying that. 133 00:05:46,080 --> 00:05:48,950 So we didn't know what's-- how much smoothing we should do 134 00:05:48,950 --> 00:05:52,970 if we want to do our-- speaker independent representation. 135 00:05:52,970 --> 00:05:56,000 So we ran an experiment for a number of smoothing, 136 00:05:56,000 --> 00:05:58,000 a number of complex poles telling you 137 00:05:58,000 --> 00:06:00,960 how much smoothing you get through auto-regressive model. 138 00:06:00,960 --> 00:06:04,450 And there was a very distinct peak in the situations 139 00:06:04,450 --> 00:06:06,730 where we had a training coming for-- 140 00:06:06,730 --> 00:06:08,930 templates coming from one speaker and the test 141 00:06:08,930 --> 00:06:10,726 coming from another speaker. 142 00:06:13,540 --> 00:06:16,070 Then we used this kind of representation 143 00:06:16,070 --> 00:06:17,970 in a speech recognition-- 144 00:06:17,970 --> 00:06:19,520 I mean in a-- 145 00:06:19,520 --> 00:06:22,264 to derive the features from the speech. 146 00:06:22,264 --> 00:06:24,430 Suddenly, these two pictures start looking much more 147 00:06:24,430 --> 00:06:27,490 similar, because what this technique is doing 148 00:06:27,490 --> 00:06:30,750 is basically interpreting the spectrum in a way 149 00:06:30,750 --> 00:06:33,100 the hearing might be doing. 150 00:06:33,100 --> 00:06:37,990 It has much lower resolution that normally people would use. 151 00:06:37,990 --> 00:06:40,840 It has only two peaks, right? 152 00:06:40,840 --> 00:06:44,170 But they say that it was good enough for speech recognition. 153 00:06:44,170 --> 00:06:46,080 What was more interesting about-- 154 00:06:46,080 --> 00:06:49,090 a little bit of interesting science 155 00:06:49,090 --> 00:06:55,450 is they also-- found that difference between production 156 00:06:55,450 --> 00:07:00,160 of the adults and the children might be just 157 00:07:00,160 --> 00:07:01,600 in the length of the pharynx. 158 00:07:01,600 --> 00:07:06,370 This is a back part of the vocal tract; that the children may 159 00:07:06,370 --> 00:07:08,320 be producing speech in such a way 160 00:07:08,320 --> 00:07:10,000 that they are putting already the-- 161 00:07:10,000 --> 00:07:13,120 [AUDIO OUT] constriction into right position 162 00:07:13,120 --> 00:07:15,100 against the palate. 163 00:07:15,100 --> 00:07:16,980 And because they know-- 164 00:07:16,980 --> 00:07:19,210 or they-- well, whatever-- mother nature 165 00:07:19,210 --> 00:07:22,500 taught them that pharynx will grow in the lifetime. 166 00:07:22,500 --> 00:07:26,270 But the front part of the vocal tract is going to be similar. 167 00:07:26,270 --> 00:07:30,790 So it is the front cavity which is speaker independent, 168 00:07:30,790 --> 00:07:33,490 and it is the back cavity, the rest of the vocal tract, 169 00:07:33,490 --> 00:07:37,120 which may be introducing speaker dependencies. 170 00:07:37,120 --> 00:07:39,070 It's quite possible if you will-- 171 00:07:39,070 --> 00:07:43,230 if you ask people how they've been treated-- like actors-- 172 00:07:43,230 --> 00:07:46,930 on how they are being trained to generate the different voices, 173 00:07:46,930 --> 00:07:50,400 they are being trained to modify back part of the-- vocal tract. 174 00:07:50,400 --> 00:07:52,230 Normally, we don't know how to do. 175 00:07:52,230 --> 00:07:55,240 But there is some circumstantial evidence 176 00:07:55,240 --> 00:07:58,210 for this might be at least partially true. 177 00:07:58,210 --> 00:08:01,240 So what is nice is that when we synthesize the speech 178 00:08:01,240 --> 00:08:03,460 and we made sure that front cavity is always 179 00:08:03,460 --> 00:08:05,890 in the same place, even when the foramens were 180 00:08:05,890 --> 00:08:11,000 in different positions, we were getting very similar results. 181 00:08:11,000 --> 00:08:12,550 So we have this theory. 182 00:08:12,550 --> 00:08:16,570 The message is encoded in the shape of the front cavity. 183 00:08:16,570 --> 00:08:18,640 Through speaker-dependent vocal tracts, 184 00:08:18,640 --> 00:08:21,880 you generate the speech spectrum with all the formants. 185 00:08:21,880 --> 00:08:25,360 But then there comes the speech perception fine 186 00:08:25,360 --> 00:08:27,370 point, which extracts what is called 187 00:08:27,370 --> 00:08:28,612 perceptual second formant. 188 00:08:28,612 --> 00:08:29,570 Don't worry about that. 189 00:08:29,570 --> 00:08:31,760 Basic-- [AUDIO OUT] on the-- 190 00:08:31,760 --> 00:08:33,980 at most, two peaks from the spectrum. 191 00:08:33,980 --> 00:08:37,000 And this is being used for decoding of the signal, speaker 192 00:08:37,000 --> 00:08:38,720 independently. 193 00:08:38,720 --> 00:08:41,289 However, I told you one thing, which is 194 00:08:41,289 --> 00:08:43,809 don't use the textbook data and be the exact-- 195 00:08:43,809 --> 00:08:45,220 [AUDIO OUT] 196 00:08:45,220 --> 00:08:48,312 And so I was challenged by my friend, late 197 00:08:48,312 --> 00:08:49,270 Professor Fred Jalinek. 198 00:08:51,850 --> 00:08:55,270 He is claimed to say, airplanes don't flap wings, 199 00:08:55,270 --> 00:08:59,070 so why should we be putting the knowledge of the hearing 200 00:08:59,070 --> 00:09:01,670 in-- actually, he said something quite different. 201 00:09:01,670 --> 00:09:04,990 This is what The New York Times quoted after he passed away, 202 00:09:04,990 --> 00:09:06,670 because that was one of-- supposedly one 203 00:09:06,670 --> 00:09:07,930 of his famous quotes. 204 00:09:07,930 --> 00:09:10,370 No, he said something else. 205 00:09:10,370 --> 00:09:12,230 Well, if airplanes do not flap wings, 206 00:09:12,230 --> 00:09:14,950 but they have a wings nevertheless. 207 00:09:14,950 --> 00:09:19,450 They use some knowledge from the nature 208 00:09:19,450 --> 00:09:22,000 in order to get the job done. 209 00:09:22,000 --> 00:09:24,550 The flapping of the wings is not important. 210 00:09:24,550 --> 00:09:27,460 Having the wings is important if you 211 00:09:27,460 --> 00:09:31,120 want to create the machine which is heavier than the air 212 00:09:31,120 --> 00:09:32,350 and flies. 213 00:09:32,350 --> 00:09:35,085 So we should try to include everything 214 00:09:35,085 --> 00:09:38,860 what we know about human perception, and production, 215 00:09:38,860 --> 00:09:40,850 and so on. 216 00:09:40,850 --> 00:09:44,860 However, we need to estimate the parameters from the data, 217 00:09:44,860 --> 00:09:48,550 because don't trust the textbooks and that sort 218 00:09:48,550 --> 00:09:49,130 of thing. 219 00:09:49,130 --> 00:09:53,560 You have to derive in such a way that is relevant to your task. 220 00:09:53,560 --> 00:09:54,820 What I wanted to say-- 221 00:09:54,820 --> 00:09:57,430 you can use the data to derive the similar knowledge. 222 00:09:57,430 --> 00:09:59,350 And I want to show it to you. 223 00:09:59,350 --> 00:10:02,940 What you can do is to use a technique again, 224 00:10:02,940 --> 00:10:05,840 known from the '30s, called Linear Discriminant Analysis. 225 00:10:05,840 --> 00:10:09,520 This is the statistician's friend. 226 00:10:09,520 --> 00:10:12,760 For this you need a within-class covariance and between class 227 00:10:12,760 --> 00:10:13,910 covariance matrix. 228 00:10:13,910 --> 00:10:16,000 You need the labeled data. 229 00:10:16,000 --> 00:10:18,310 And you need to make some assumptions, which turns out 230 00:10:18,310 --> 00:10:20,760 are not very critical. 231 00:10:20,760 --> 00:10:22,570 But I mean, they are approximately 232 00:10:22,570 --> 00:10:25,910 satisfied when you are working with the spectra. 233 00:10:25,910 --> 00:10:29,170 So what we did was we would take this spectrogram, 234 00:10:29,170 --> 00:10:33,280 and we would generate the spectral vectors from that. 235 00:10:33,280 --> 00:10:36,030 So we would always cut part of the spectrum, 236 00:10:36,030 --> 00:10:38,500 or short term the spectrum, and we assign it 237 00:10:38,500 --> 00:10:43,340 to the label from which part of speech it came from. 238 00:10:43,340 --> 00:10:47,260 So this one would have a label, "yo," right? 239 00:10:47,260 --> 00:10:51,220 And so you get the big box full of vectors. 240 00:10:51,220 --> 00:10:52,780 All of them are labeled. 241 00:10:52,780 --> 00:10:55,060 So you can do LDA. 242 00:10:55,060 --> 00:10:57,650 And you can look what they are-- 243 00:10:57,650 --> 00:11:00,400 discriminants are telling you. 244 00:11:00,400 --> 00:11:03,370 From LDA, you get the discriminant matrix 245 00:11:03,370 --> 00:11:05,800 and each of the row of the-- 246 00:11:05,800 --> 00:11:09,160 column of the-- column or row-- 247 00:11:09,160 --> 00:11:13,720 whatever-- creates the basis on which you should project 248 00:11:13,720 --> 00:11:15,670 the whole spectrum, right? 249 00:11:15,670 --> 00:11:17,470 These are the four obvious ones here. 250 00:11:17,470 --> 00:11:21,290 You also have the amount of variability 251 00:11:21,290 --> 00:11:24,870 present in this discriminant matrix which you started with. 252 00:11:24,870 --> 00:11:27,380 What you observe, which is very interesting, 253 00:11:27,380 --> 00:11:31,840 is that these bases tend to project the spectrum 254 00:11:31,840 --> 00:11:34,280 at the beginning, with more detail, 255 00:11:34,280 --> 00:11:37,030 than the spectrum at the end. 256 00:11:37,030 --> 00:11:39,700 So essentially, they tend-- they appear-- 257 00:11:39,700 --> 00:11:44,140 in the first-- group they appear to be emulating properties 258 00:11:44,140 --> 00:11:47,200 of human hearing, with some of the well known properties 259 00:11:47,200 --> 00:11:51,040 of human hearing, namely, non-equal spectral resolution 260 00:11:51,040 --> 00:11:55,080 being verified in any ways-- many, many ways. 261 00:11:55,080 --> 00:11:56,410 Among them was one-- 262 00:11:56,410 --> 00:11:59,145 I was showing you this masking experiment of Harvey Fletcher. 263 00:12:02,030 --> 00:12:04,480 There is a number of reasons to believe 264 00:12:04,480 --> 00:12:06,516 that this is a good thing. 265 00:12:06,516 --> 00:12:07,390 This is what you see. 266 00:12:07,390 --> 00:12:09,400 So essentially, if you look at the zero crossings 267 00:12:09,400 --> 00:12:11,108 of these bases-- this is the first base-- 268 00:12:11,108 --> 00:12:12,710 they are getting broader and broader. 269 00:12:12,710 --> 00:12:17,880 So you are integrating more and more spectrum, right? 270 00:12:17,880 --> 00:12:19,740 This is all right-- so that I leave it. 271 00:12:19,740 --> 00:12:21,270 Oh, this is from another experiment 272 00:12:21,270 --> 00:12:24,150 with a-- very large database, very much and very similar 273 00:12:24,150 --> 00:12:25,710 thing. 274 00:12:25,710 --> 00:12:28,370 Eigenvalues quickly decay. 275 00:12:28,370 --> 00:12:30,540 And what is interesting-- you can actually formally 276 00:12:30,540 --> 00:12:33,000 ask, "What is your resolution?" by doing what is 277 00:12:33,000 --> 00:12:35,490 called perturbation analysis. 278 00:12:35,490 --> 00:12:37,829 So you take some-- say, some signal here. 279 00:12:37,829 --> 00:12:38,620 This is a Gaussian. 280 00:12:38,620 --> 00:12:42,240 And you project this on this LDA basis. 281 00:12:42,240 --> 00:12:43,230 Then you perturb it. 282 00:12:43,230 --> 00:12:44,830 You move it. 283 00:12:44,830 --> 00:12:49,920 And you ask, how much effect this movement of this, say, 284 00:12:49,920 --> 00:12:58,580 emulated spectral element of the speech causes the output-- 285 00:12:58,580 --> 00:13:02,220 seen as the output of this projection of this many-- 286 00:13:02,220 --> 00:13:03,870 on these many bases? 287 00:13:03,870 --> 00:13:06,570 And what do you see is, as I was suggesting, 288 00:13:06,570 --> 00:13:09,750 spectral sensitivity to the movements of the formant 289 00:13:09,750 --> 00:13:13,530 is much higher at the beginning of the spectrum 290 00:13:13,530 --> 00:13:17,260 and much less at the end of the spectrum. 291 00:13:17,260 --> 00:13:20,370 You can actually compare it to what 292 00:13:20,370 --> 00:13:24,600 we had-- initially in the PLP analysis when 293 00:13:24,600 --> 00:13:28,350 we integrated the spectrum based on the knowledge coming 294 00:13:28,350 --> 00:13:30,930 from the textbook. 295 00:13:30,930 --> 00:13:34,270 And it's very much the same if there 296 00:13:34,270 --> 00:13:36,440 were just a plain cosine basis computing 297 00:13:36,440 --> 00:13:40,905 mel cepstrum sensitivity is the same at all frequencies. 298 00:13:40,905 --> 00:13:43,780 But these bases-- are from the LDA, 299 00:13:43,780 --> 00:13:46,320 they're very much doing the thing which physical bin 300 00:13:46,320 --> 00:13:49,360 analysis would be doing. 301 00:13:49,360 --> 00:13:51,469 And so this is a-- 302 00:13:51,469 --> 00:13:52,260 you can look at it. 303 00:13:52,260 --> 00:13:56,830 It was a PhD thesis from Oregon Graduate Institute by Naren 304 00:13:56,830 --> 00:14:00,329 Malayath, who is now a big-- 305 00:14:00,329 --> 00:14:01,620 you better be friends with him. 306 00:14:01,620 --> 00:14:02,970 He's at Qualcomm. 307 00:14:02,970 --> 00:14:05,660 I think he's a head of Image Processing department. 308 00:14:05,660 --> 00:14:09,460 [COUGHS] We better-- better be good friends with him. 309 00:14:09,460 --> 00:14:10,040 [INAUDIBLE] 310 00:14:10,040 --> 00:14:12,440 [LAUGHTER] 311 00:14:12,440 --> 00:14:13,080 OK. 312 00:14:13,080 --> 00:14:15,850 Another problem with linear distortions-- linear 313 00:14:15,850 --> 00:14:17,550 distortions void was a problem. 314 00:14:17,550 --> 00:14:21,420 Now it's not problem anymore, but in the old days, 315 00:14:21,420 --> 00:14:22,970 this was a problem. 316 00:14:22,970 --> 00:14:27,990 A problem shows up in a rather dramatic way, in following way. 317 00:14:27,990 --> 00:14:29,120 Here we have one sound. 318 00:14:31,624 --> 00:14:32,540 VOICE RECORDING: Beat. 319 00:14:32,540 --> 00:14:33,810 HYNEK HERMANSKY: Beat. 320 00:14:33,810 --> 00:14:35,820 So "buh ee- tuh." 321 00:14:35,820 --> 00:14:40,310 Here is the very distinct E. Every phonetician would agree. 322 00:14:40,310 --> 00:14:44,430 This is E-- high foramen, cluster of high of foramens, 323 00:14:44,430 --> 00:14:45,460 and so on. 324 00:14:45,460 --> 00:14:48,870 Some vicious person, namely one of my graduate students, 325 00:14:48,870 --> 00:14:51,320 took this spectral envelope, designed a filter, 326 00:14:51,320 --> 00:14:53,570 which is exactly in reverse of that, 327 00:14:53,570 --> 00:14:57,140 and put this speech through this inverse filter 328 00:14:57,140 --> 00:14:58,400 so it looked like this. 329 00:14:58,400 --> 00:15:02,590 This was-- there is a spectrum where there were nine formants. 330 00:15:02,590 --> 00:15:04,666 It's entirely flat. 331 00:15:04,666 --> 00:15:06,690 And if you listened to it, you've 332 00:15:06,690 --> 00:15:09,135 probably already guessed what you will hear. 333 00:15:09,135 --> 00:15:10,051 VOICE RECORDING: Beat. 334 00:15:10,051 --> 00:15:12,450 HYNEK HERMANSKY: You'll hear the first speech, right? 335 00:15:12,450 --> 00:15:12,950 But you-- 336 00:15:12,950 --> 00:15:14,660 VOICE RECORDING: Beat. 337 00:15:14,660 --> 00:15:16,530 HYNEK HERMANSKY: It's OK when this-- oops! 338 00:15:16,530 --> 00:15:17,649 That's what you would-- 339 00:15:17,649 --> 00:15:20,643 sorry. 340 00:15:20,643 --> 00:15:22,140 VOICE RECORDING: Beat. 341 00:15:22,140 --> 00:15:23,137 Beat. 342 00:15:23,137 --> 00:15:23,637 Beat. 343 00:15:23,637 --> 00:15:26,400 HYNEK HERMANSKY: But whoever doesn't hear E-- 344 00:15:26,400 --> 00:15:27,710 don't spoil my talk. 345 00:15:27,710 --> 00:15:30,480 I mean, I think everybody has to hear E, 346 00:15:30,480 --> 00:15:33,270 even though any phonetician would get very upset. 347 00:15:33,270 --> 00:15:36,020 Because they say this is not E. Because, of course, 348 00:15:36,020 --> 00:15:39,590 what is happening, human perception is taking relative-- 349 00:15:39,590 --> 00:15:43,100 percept relative to these neighboring sounds, right? 350 00:15:43,100 --> 00:15:46,280 And since we filtered everything with the same filter, I mean, 351 00:15:46,280 --> 00:15:47,990 relative percepts is still the same. 352 00:15:47,990 --> 00:15:51,519 So this is something which we needed to put into our machine. 353 00:15:51,519 --> 00:15:53,810 And we did-- actually, this is a very straight-- signal 354 00:15:53,810 --> 00:15:56,780 processing wise, the things are very straightforward, 355 00:15:56,780 --> 00:15:59,470 because what you have, the signal is actually 356 00:15:59,470 --> 00:16:03,140 signal of the speech convolved with the spectrum 357 00:16:03,140 --> 00:16:06,480 of the impulse response of the environment. 358 00:16:06,480 --> 00:16:10,370 So in logarithmic domain this is the signal processing stuff. 359 00:16:10,370 --> 00:16:12,940 Basically, you have a logarithmic spectrum 360 00:16:12,940 --> 00:16:16,250 of the signal plus logarithmic spectrum of the environment, 361 00:16:16,250 --> 00:16:19,140 which is fixed. 362 00:16:19,140 --> 00:16:25,360 So-- what we're finding here is that if you remove somehow this 363 00:16:25,360 --> 00:16:28,730 environment, or if you make it invariant to this environment 364 00:16:28,730 --> 00:16:32,200 thing, then you maybe win-- 365 00:16:32,200 --> 00:16:33,100 you may be winning. 366 00:16:33,100 --> 00:16:35,620 The problem here is that each frequency 367 00:16:35,620 --> 00:16:38,500 you have a different amount of additive constant, 368 00:16:38,500 --> 00:16:40,780 because this is a spectrum, right? 369 00:16:40,780 --> 00:16:42,765 If it was just a constant at all frequencies, 370 00:16:42,765 --> 00:16:45,050 you just subtract it. 371 00:16:45,050 --> 00:16:47,440 But in this case, you can use the trick. 372 00:16:47,440 --> 00:16:50,350 You remember what Josh told us this morning. 373 00:16:50,350 --> 00:16:52,550 Hearing is doing spectral analysis. 374 00:16:52,550 --> 00:16:55,225 And what I was trying to tell you, that each-- 375 00:16:55,225 --> 00:16:58,870 at each frequency, each critical band is trajectory 376 00:16:58,870 --> 00:17:01,720 of the spectral energy is independent of-- in the first 377 00:17:01,720 --> 00:17:03,440 approximation, of the other-- 378 00:17:03,440 --> 00:17:03,940 others. 379 00:17:03,940 --> 00:17:05,619 You can do independent processing 380 00:17:05,619 --> 00:17:07,300 at each frequency band-- 381 00:17:07,300 --> 00:17:10,280 and maybe don't screw up that many things. 382 00:17:10,280 --> 00:17:13,119 So this was a step which we took. 383 00:17:13,119 --> 00:17:13,819 We said, OK. 384 00:17:13,819 --> 00:17:17,680 We will treat each sample trajectory differently, right? 385 00:17:17,680 --> 00:17:21,359 But we will filter out stuff which is not changing. 386 00:17:21,359 --> 00:17:23,900 So related to different frequency channel, 387 00:17:23,900 --> 00:17:26,609 do the independent processing channel. 388 00:17:26,609 --> 00:17:29,830 And processing was that we would take first logarithm, 389 00:17:29,830 --> 00:17:33,630 and then we would do-- then we put each trajectory 390 00:17:33,630 --> 00:17:36,230 through a bandpass filter, which would-- 391 00:17:36,230 --> 00:17:39,580 main thing is which would be suppressing DC, 392 00:17:39,580 --> 00:17:42,300 and slowly changing components. 393 00:17:42,300 --> 00:17:44,220 Mainly it was suppressing anything which 394 00:17:44,220 --> 00:17:46,440 was slower than one hertz. 395 00:17:46,440 --> 00:17:49,170 And also, it turned out it was useful to separate things which 396 00:17:49,170 --> 00:17:52,920 are higher than about 15 hertz. 397 00:17:52,920 --> 00:17:56,129 So what you get out-- this was the origin spectrogram. 398 00:17:56,129 --> 00:17:57,545 This was the modified spectrogram. 399 00:17:57,545 --> 00:17:59,628 It seems it got a little bit-- this trajectory got 400 00:17:59,628 --> 00:18:00,900 a little bit smoother. 401 00:18:00,900 --> 00:18:02,940 Transitions got smoothed out because there 402 00:18:02,940 --> 00:18:05,040 was a bandpass filter. 403 00:18:05,040 --> 00:18:07,350 There was a high pass elements to that. 404 00:18:07,350 --> 00:18:10,440 Very much what we thought-- well, this is interesting. 405 00:18:10,440 --> 00:18:13,140 Maybe this is what a human hearing might be doing. 406 00:18:13,140 --> 00:18:15,990 To tell you the truth, we didn't know. 407 00:18:15,990 --> 00:18:18,690 It was-- for the people who are from MIT 408 00:18:18,690 --> 00:18:20,340 and who are working in Image, it was 409 00:18:20,340 --> 00:18:25,590 inspired by some work on a perception of lightness, what 410 00:18:25,590 --> 00:18:27,860 David Marr called lightness. 411 00:18:27,860 --> 00:18:29,910 And here was the type-- 412 00:18:29,910 --> 00:18:31,710 thing which I told you about-- 413 00:18:31,710 --> 00:18:35,280 6 by-- 6 by 749. 414 00:18:35,280 --> 00:18:40,170 David Marr was talking about processing in space. 415 00:18:40,170 --> 00:18:43,080 We applied it in processing in time. 416 00:18:43,080 --> 00:18:44,940 But it was still good enough. 417 00:18:44,940 --> 00:18:50,200 I mean, so that we definitely got rid of the problem. 418 00:18:50,200 --> 00:18:51,540 So here it is. 419 00:18:51,540 --> 00:18:53,750 The spectrograms, which look very different, 420 00:18:53,750 --> 00:18:56,830 look-- suddenly start looking very similar. 421 00:18:56,830 --> 00:18:58,290 I was good seeing what's here. 422 00:18:58,290 --> 00:18:59,570 Remember, I'm an engineer. 423 00:18:59,570 --> 00:19:01,740 I was working for a telephone company at the time. 424 00:19:01,740 --> 00:19:03,960 It was still working better in some problems 425 00:19:03,960 --> 00:19:05,370 which we had before. 426 00:19:05,370 --> 00:19:08,040 And we had a severely mismatched environment, 427 00:19:08,040 --> 00:19:09,900 getting the training data from the labs 428 00:19:09,900 --> 00:19:13,060 and testing it in the US west, in Colorado. 429 00:19:13,060 --> 00:19:14,980 So it didn't work at all; recognized 430 00:19:14,980 --> 00:19:19,740 after this processing everything was cool and dandy. 431 00:19:19,740 --> 00:19:20,490 OK. 432 00:19:20,490 --> 00:19:23,840 So now we have a RASTA LDA. 433 00:19:23,840 --> 00:19:25,770 And we can do the same trick. 434 00:19:25,770 --> 00:19:26,760 How about that? 435 00:19:26,760 --> 00:19:30,120 So we take-- you take the spectral temporal vectors, 436 00:19:30,120 --> 00:19:31,800 and you label each of these vector 437 00:19:31,800 --> 00:19:35,130 by the label which is of the phoneme, which 438 00:19:35,130 --> 00:19:37,350 is in the center of this trajectory. 439 00:19:37,350 --> 00:19:39,070 And just for the sake of-- 440 00:19:39,070 --> 00:19:42,300 just to have some fun, we took a rather long vector. 441 00:19:42,300 --> 00:19:44,760 It was about one second. 442 00:19:44,760 --> 00:19:49,210 And we said, well, what kind of projections would these 443 00:19:49,210 --> 00:19:53,970 temporal trajectories would go on if we wanted to get rid 444 00:19:53,970 --> 00:19:55,670 of a speaker-dependent-- 445 00:19:55,670 --> 00:19:58,420 I mean, of an environment-dependent 446 00:19:58,420 --> 00:20:00,270 information? 447 00:20:00,270 --> 00:20:03,900 Well, these were the impulse responses. 448 00:20:03,900 --> 00:20:06,980 These were the frequency responses. 449 00:20:06,980 --> 00:20:09,640 Because in this case, you get a FIR filters. 450 00:20:09,640 --> 00:20:12,230 These discriminants are FIR filters, 451 00:20:12,230 --> 00:20:15,520 which are to be applied to temporal trajectory-- so 452 00:20:15,520 --> 00:20:18,460 spectral energies. 453 00:20:18,460 --> 00:20:20,220 Because it's just basically projection 454 00:20:20,220 --> 00:20:22,306 of the spectrum on the basis. 455 00:20:22,306 --> 00:20:24,059 And basis is one second long. 456 00:20:24,059 --> 00:20:25,100 This is impulse response. 457 00:20:25,100 --> 00:20:27,280 It cannot be all that long, because eventually, 458 00:20:27,280 --> 00:20:30,450 they become zero, right, if you should do nothing to it. 459 00:20:30,450 --> 00:20:32,100 You do do nothing. 460 00:20:32,100 --> 00:20:34,500 But you can see active part is about a couple 461 00:20:34,500 --> 00:20:37,560 of hundred millisecond, maybe a little bit more. 462 00:20:37,560 --> 00:20:40,500 And these are the bandpass filters. 463 00:20:40,500 --> 00:20:44,060 So essentially-- passing frequency between the 1 hertz 464 00:20:44,060 --> 00:20:45,750 and 10, 15 hertz-- 465 00:20:45,750 --> 00:20:47,226 very similar at all frequencies. 466 00:20:47,226 --> 00:20:49,600 There was another thing which we were very interested in. 467 00:20:49,600 --> 00:20:51,254 We should really do different things 468 00:20:51,254 --> 00:20:52,295 at different frequencies. 469 00:20:52,295 --> 00:20:56,730 Answer is pretty much no. 470 00:20:56,730 --> 00:20:58,650 And so that was very exciting. 471 00:20:58,650 --> 00:21:02,010 What-- what I-- well, anyway, so let me tell you-- yet 472 00:21:02,010 --> 00:21:04,500 another experiment which hasn't happened and is going 473 00:21:04,500 --> 00:21:07,200 to be presented next week. 474 00:21:07,200 --> 00:21:09,870 Well, we wanted to move in the 21st century, 475 00:21:09,870 --> 00:21:12,930 so we did convolutive neural network. 476 00:21:12,930 --> 00:21:15,180 And our convolutive network is maybe not 477 00:21:15,180 --> 00:21:18,980 what you are used to when you have a 2D convolutions there. 478 00:21:18,980 --> 00:21:22,680 But we just said, we will have a 1D filter 479 00:21:22,680 --> 00:21:27,330 as a first processing step in this deep neural network. 480 00:21:27,330 --> 00:21:32,360 So we postulated the filter, the input, to the neural network. 481 00:21:32,360 --> 00:21:35,260 But in this case, we trained the whole thing together. 482 00:21:35,260 --> 00:21:39,210 So it wasn't just LDA and that sort of thing. 483 00:21:39,210 --> 00:21:42,180 So we forced all filters at all frequencies 484 00:21:42,180 --> 00:21:46,080 be the same, because we expected that's what we want to get. 485 00:21:46,080 --> 00:21:48,570 And we were asking how these forces look like, 486 00:21:48,570 --> 00:21:51,750 which come from the convolutive neural network. 487 00:21:51,750 --> 00:21:54,700 Well, again, I wouldn't be showing it 488 00:21:54,700 --> 00:21:56,700 if it wasn't really somehow supportive of what I 489 00:21:56,700 --> 00:21:57,459 want to say. 490 00:21:57,459 --> 00:21:59,250 They don't look all that different for what 491 00:21:59,250 --> 00:22:00,333 we were guessing from LDA. 492 00:22:03,000 --> 00:22:06,270 They definitely are enhancing the important modulation 493 00:22:06,270 --> 00:22:09,450 frequencies around four hertz, right? 494 00:22:09,450 --> 00:22:11,180 They are passing a number of them. 495 00:22:11,180 --> 00:22:15,220 I'm showing three here, which are somehow arbitrary. 496 00:22:15,220 --> 00:22:17,750 They are passing-- and most of them look like that. 497 00:22:17,750 --> 00:22:20,840 And we will use the 16 of them. 498 00:22:20,840 --> 00:22:24,070 Passing between in 1 and 10 hertz in modulation 499 00:22:24,070 --> 00:22:29,000 spectral domain, so changes which are 1 to 10 times 500 00:22:29,000 --> 00:22:29,925 a second. 501 00:22:29,925 --> 00:22:34,390 It's coming out in a paper, so you can look it up if you want. 502 00:22:34,390 --> 00:22:36,580 Last thing which I still wanted to do-- 503 00:22:36,580 --> 00:22:38,700 I said, well, maybe it has something 504 00:22:38,700 --> 00:22:41,735 to do with the hearing after all. 505 00:22:41,735 --> 00:22:44,550 We were are deriving everything from speech. 506 00:22:44,550 --> 00:22:46,200 There was no knowledge about hearing, 507 00:22:46,200 --> 00:22:49,800 except that we said we think that we should be looking 508 00:22:49,800 --> 00:22:52,410 at long segments of the signal, and we 509 00:22:52,410 --> 00:22:55,440 expect that this filtering will be very much the same 510 00:22:55,440 --> 00:22:56,340 at all frequencies. 511 00:22:56,340 --> 00:22:59,430 Actually, not even-- earlier it come out automatically. 512 00:22:59,430 --> 00:23:01,890 There wasn't much of the knowledge from human hearing-- 513 00:23:01,890 --> 00:23:03,250 [AUDIO OUT] on. 514 00:23:03,250 --> 00:23:06,270 In the first one, when I was showing you critical 515 00:23:06,270 --> 00:23:09,760 band spectral resolution, we started this full ear spectrum. 516 00:23:09,760 --> 00:23:13,720 We didn't tell anything about human hearing. 517 00:23:13,720 --> 00:23:16,920 And what comes out is a property of human hearing. 518 00:23:16,920 --> 00:23:20,970 I mean, tell me if there is yet another strong evidence 519 00:23:20,970 --> 00:23:25,320 that speech is processed in such a way that fits human hearing, 520 00:23:25,320 --> 00:23:27,090 because the only thing which was used here 521 00:23:27,090 --> 00:23:30,430 was the speech, labor, into certain-- 522 00:23:30,430 --> 00:23:33,840 into classes which we are using for recognizing it-- 523 00:23:33,840 --> 00:23:36,770 speech sounds. 524 00:23:36,770 --> 00:23:39,850 So what we did was-- that was with Nema, with Garani, 525 00:23:39,850 --> 00:23:44,060 and my students. 526 00:23:44,060 --> 00:23:47,930 We took a number of these cortical receptive fields-- 527 00:23:47,930 --> 00:23:50,750 which we talk about it before a little bit-- 528 00:23:50,750 --> 00:23:53,300 about 2,000, 3,000, whatever-- we spread-- 529 00:23:53,300 --> 00:23:57,290 we basically spread to the floor at the University of Maryland 530 00:23:57,290 --> 00:24:00,110 and computed principal components 531 00:24:00,110 --> 00:24:05,100 from these fields in both spectral and temporal domain. 532 00:24:05,100 --> 00:24:08,150 But here I'm showing the temporal domain. 533 00:24:08,150 --> 00:24:13,370 How they look-- they are very much like the rest of filter. 534 00:24:13,370 --> 00:24:14,650 This is what is happening. 535 00:24:14,650 --> 00:24:16,610 It's a bandpass. 536 00:24:16,610 --> 00:24:19,130 Peak is somewhere around four hertz. 537 00:24:19,130 --> 00:24:21,080 Essentially, I'm showing you here what 538 00:24:21,080 --> 00:24:27,050 I understood might be a transfer function of the auditory cortex 539 00:24:27,050 --> 00:24:30,920 derived with all the usual disclaimers, 540 00:24:30,920 --> 00:24:33,920 like, this is a linear approximation to the receptive 541 00:24:33,920 --> 00:24:34,820 fields. 542 00:24:34,820 --> 00:24:37,190 And there might have been problems with collecting it, 543 00:24:37,190 --> 00:24:39,566 and so on, and so on. 544 00:24:39,566 --> 00:24:42,950 But this is what we are getting as a possible function 545 00:24:42,950 --> 00:24:45,980 of auditory cortex. 546 00:24:45,980 --> 00:24:49,370 I'm doing fine with the time, right? 547 00:24:49,370 --> 00:24:53,510 So you can do experiment in this case. 548 00:24:53,510 --> 00:24:55,970 You can actually generate the speech 549 00:24:55,970 --> 00:25:00,470 which has certain rates of change eliminated. 550 00:25:00,470 --> 00:25:03,830 By doing all this, computing the cepstrum, 551 00:25:03,830 --> 00:25:06,140 do the filtering of each trajectory, 552 00:25:06,140 --> 00:25:08,180 and reconstruct the speech. 553 00:25:08,180 --> 00:25:10,730 And ask people what do they hear? 554 00:25:10,730 --> 00:25:12,830 How do they recognize speech? 555 00:25:12,830 --> 00:25:15,310 You can also ask, "Do you recognize it?" 556 00:25:15,310 --> 00:25:18,350 For this you don't have to regenerate the speech. 557 00:25:18,350 --> 00:25:22,240 But you just use therapy C cepstrum. 558 00:25:22,240 --> 00:25:24,970 This is the full experiment with the-- this 559 00:25:24,970 --> 00:25:27,800 is for-- this is called a residual-excited LPC vocoder. 560 00:25:27,800 --> 00:25:31,870 But it's modified in such a way that you artificially 561 00:25:31,870 --> 00:25:40,000 slow down or modify the temporal trajectories, which are being-- 562 00:25:40,000 --> 00:25:44,190 if there is no filter, you cannot make a replica 563 00:25:44,190 --> 00:25:48,050 of the origin, the signal here. 564 00:25:48,050 --> 00:25:51,680 So just the bottom line of the experiment here is what-- 565 00:25:51,680 --> 00:25:54,260 if you start removing components which 566 00:25:54,260 --> 00:25:56,960 are somewhere between 1 and 16 hertz, 567 00:25:56,960 --> 00:25:59,330 you are getting hurt significantly. 568 00:25:59,330 --> 00:26:02,690 Most you are getting hurt in performance 569 00:26:02,690 --> 00:26:06,110 when they are removing component between 2 and 4 hertz. 570 00:26:06,110 --> 00:26:08,810 This is a-- that's how you are getting biggest hit. 571 00:26:08,810 --> 00:26:11,540 Here we are showing how much is-- 572 00:26:11,540 --> 00:26:14,870 how much these bands contribute to recognition and performance 573 00:26:14,870 --> 00:26:15,670 by humans. 574 00:26:15,670 --> 00:26:20,480 This is a white bars; and to speech recognize it. 575 00:26:20,480 --> 00:26:23,480 Those are the black bars. 576 00:26:23,480 --> 00:26:25,850 So you can see that in speech recognition, machine 577 00:26:25,850 --> 00:26:28,160 recognition, you can safely remove 578 00:26:28,160 --> 00:26:30,456 stuff between 0 and 1 hertz. 579 00:26:30,456 --> 00:26:31,580 It's not going to hurt you. 580 00:26:31,580 --> 00:26:33,680 It's only helps you in this task. 581 00:26:33,680 --> 00:26:36,420 Speech perception, there is a little bit of hit, 582 00:26:36,420 --> 00:26:38,270 but certainly not as much hit as you 583 00:26:38,270 --> 00:26:40,730 are getting when you are moving to the part 584 00:26:40,730 --> 00:26:41,950 where you hear the-- 585 00:26:41,950 --> 00:26:43,010 [AUDIO OUT] 586 00:26:43,010 --> 00:26:48,680 And certainly the component that is higher than 16 or 20 hertz 587 00:26:48,680 --> 00:26:49,970 are not important. 588 00:26:49,970 --> 00:26:53,050 Then, already, Homer Dudley, he knew, in 1930, 589 00:26:53,050 --> 00:26:55,420 when he was designing his vocoder. 590 00:26:55,420 --> 00:26:57,185 But it was a nice experiment. 591 00:26:57,185 --> 00:26:59,230 It came out in just-- 592 00:26:59,230 --> 00:27:03,940 so you can look it up and-- if you want to have a go. 593 00:27:03,940 --> 00:27:07,770 Just to summarize what I told you so far, 594 00:27:07,770 --> 00:27:12,850 Homer Dudley was telling us information from the-- 595 00:27:12,850 --> 00:27:16,570 information about the message in the slow modulation, 596 00:27:16,570 --> 00:27:21,070 slow movements of the vocal tract, which modulates 597 00:27:21,070 --> 00:27:24,340 the carrier; information about the message 598 00:27:24,340 --> 00:27:27,790 in slow modulations of the signal-- 599 00:27:27,790 --> 00:27:31,810 slow changes of speech signal in individual frequency bands. 600 00:27:34,600 --> 00:27:41,380 Slow modulations imply long impulse responses, right? 601 00:27:41,380 --> 00:27:45,460 So 5 hertz, I sense something around 200 millisecond. 602 00:27:45,460 --> 00:27:47,660 My magic number of what physically needs to allow, 603 00:27:47,660 --> 00:27:53,210 which we have observed in this summation of sub-threshold-- 604 00:27:53,210 --> 00:27:56,350 the signals and temporal masking. 605 00:27:56,350 --> 00:28:00,070 And so I have to hear a number of things which I listed. 606 00:28:00,070 --> 00:28:01,630 Frequency discrimination improved. 607 00:28:05,050 --> 00:28:07,000 If you are longer than 200 millisecond, 608 00:28:07,000 --> 00:28:09,040 below 200 milliseconds of signal, 609 00:28:09,040 --> 00:28:12,890 you don't get such a good frequency discrimination. 610 00:28:12,890 --> 00:28:16,180 Loudness increases up to 200 millisecond, 611 00:28:16,180 --> 00:28:17,615 then it stays constant. 612 00:28:17,615 --> 00:28:19,893 It depends on amplitude. 613 00:28:19,893 --> 00:28:21,355 Effect of forward masking-- 614 00:28:21,355 --> 00:28:22,770 I was showing you-- 615 00:28:22,770 --> 00:28:25,960 asked about 200 millisecond independent 616 00:28:25,960 --> 00:28:29,410 of the amplitude of the masker. 617 00:28:29,410 --> 00:28:32,530 And sub-threshold integration is also showing you. 618 00:28:32,530 --> 00:28:35,450 So I'm suggesting there seem to be some temporal buffer 619 00:28:35,450 --> 00:28:37,780 in human hearing on some level. 620 00:28:37,780 --> 00:28:43,410 I suspect it's cortical level, which is processing. 621 00:28:43,410 --> 00:28:46,330 Whatever happens within this buffer, 622 00:28:46,330 --> 00:28:51,480 it's a fair thing to treat as a one element. 623 00:28:51,480 --> 00:28:53,370 So you can do filtering on it. 624 00:28:53,370 --> 00:28:56,800 You can integrate it; any-- basically, all kinds of thing. 625 00:28:56,800 --> 00:29:00,310 If things are happening outside this buffer, 626 00:29:00,310 --> 00:29:01,620 these parts should be treated-- 627 00:29:01,620 --> 00:29:03,110 [AUDIO OUT] in parts. 628 00:29:05,780 --> 00:29:09,430 So how does it help us? 629 00:29:09,430 --> 00:29:12,540 You remember the story about the phonemes. 630 00:29:12,540 --> 00:29:14,670 You remember that phonemes don't look like this, 631 00:29:14,670 --> 00:29:17,010 but they look like this. 632 00:29:17,010 --> 00:29:21,300 Length of the coarticulation pattern is about 200 633 00:29:21,300 --> 00:29:23,580 millisecond, perhaps more. 634 00:29:23,580 --> 00:29:25,800 So what is a good thing about it is 635 00:29:25,800 --> 00:29:30,920 that if you look at sufficiently long segment of this signal, 636 00:29:30,920 --> 00:29:35,340 you will get whole coarticulation pattern in. 637 00:29:35,340 --> 00:29:38,760 And then you have a chance that your classifier is getting 638 00:29:38,760 --> 00:29:41,130 all the information about the speech 639 00:29:41,130 --> 00:29:45,090 sound for finding the sound. 640 00:29:45,090 --> 00:29:49,440 And then you may have a chance to get a good estimate 641 00:29:49,440 --> 00:29:51,270 of the speech sounds. 642 00:29:51,270 --> 00:29:54,930 But you need to use these long temporal segments. 643 00:29:54,930 --> 00:29:57,900 And here I can say it even to YouTube. 644 00:29:57,900 --> 00:30:00,930 I think we should claim the full victory here, 645 00:30:00,930 --> 00:30:03,750 because most of the speech recognition systems 646 00:30:03,750 --> 00:30:05,370 do it nowadays. 647 00:30:05,370 --> 00:30:08,940 They use the long segments of the signal as a first step 648 00:30:08,940 --> 00:30:11,070 of the processing. 649 00:30:11,070 --> 00:30:15,840 So I can happily retire telling my grandchildren, 650 00:30:15,840 --> 00:30:18,280 well, we knew it. 651 00:30:18,280 --> 00:30:19,220 We were the only ones. 652 00:30:19,220 --> 00:30:21,610 But I mean, we were certainly using it for a long time 653 00:30:21,610 --> 00:30:24,480 and probably for a long time, in such a way 654 00:30:24,480 --> 00:30:28,510 that we even designed several techniques that you do there. 655 00:30:28,510 --> 00:30:31,920 So this is a classifying speech recognition 656 00:30:31,920 --> 00:30:33,990 from the temporal patterns directly. 657 00:30:33,990 --> 00:30:36,610 So we would take these long segments of the speech 658 00:30:36,610 --> 00:30:40,590 through some processing, put neural nets on every-- 659 00:30:40,590 --> 00:30:44,400 every temporal structure, trying to estimate the sound 660 00:30:44,400 --> 00:30:46,950 at each frequency-- 661 00:30:46,950 --> 00:30:48,670 each carrier frequency. 662 00:30:48,670 --> 00:30:50,970 And then we would fuse all these decisions 663 00:30:50,970 --> 00:30:53,130 from different frequency bands. 664 00:30:53,130 --> 00:30:56,100 And then we would use the final vector 665 00:30:56,100 --> 00:30:57,860 of posterior probabilities. 666 00:30:57,860 --> 00:30:59,860 Unlike what people do very often-- 667 00:30:59,860 --> 00:31:02,760 most often-- that they just take the short term spectra, 668 00:31:02,760 --> 00:31:06,600 and then they maybe now take the longer segment of this block 669 00:31:06,600 --> 00:31:07,980 of these short term spectra. 670 00:31:07,980 --> 00:31:10,530 We say, short term spectrum is good for nothing. 671 00:31:10,530 --> 00:31:12,590 We just cut it into pieces. 672 00:31:12,590 --> 00:31:15,840 And we classify each temporal trajectory individually 673 00:31:15,840 --> 00:31:17,342 in the first step. 674 00:31:17,342 --> 00:31:19,100 Tell now that it was used-- 675 00:31:19,100 --> 00:31:22,620 it may be useful later when I will 676 00:31:22,620 --> 00:31:26,360 be telling you about dealing with some kind of noises. 677 00:31:26,360 --> 00:31:28,170 But you understand what we did here, right? 678 00:31:28,170 --> 00:31:31,680 Instead of using the spectral temporal blocks, 679 00:31:31,680 --> 00:31:33,660 we would be using temporal trajectories 680 00:31:33,660 --> 00:31:36,060 at each critical event, very much 681 00:31:36,060 --> 00:31:38,190 along the lines of what we think that hearing is 682 00:31:38,190 --> 00:31:39,570 doing with the speech signal. 683 00:31:39,570 --> 00:31:41,250 First thing is, hearing is doing. 684 00:31:41,250 --> 00:31:45,210 It takes the signal, sub divides it into individual frequency 685 00:31:45,210 --> 00:31:48,660 bands, and then it treats each temporal trajectory 686 00:31:48,660 --> 00:31:51,420 coming from each of these cochlear filters 687 00:31:51,420 --> 00:31:53,130 to extract the information. 688 00:31:53,130 --> 00:31:54,630 And then it tries to figure out what 689 00:31:54,630 --> 00:31:57,140 to do with this information later, right? 690 00:32:01,460 --> 00:32:04,260 Well, we have another technique called MRASTA, just 691 00:32:04,260 --> 00:32:07,350 for people who are interested in cochlear-- 692 00:32:07,350 --> 00:32:13,990 I mean, in-- cortical modeling. 693 00:32:13,990 --> 00:32:14,980 You take this data. 694 00:32:14,980 --> 00:32:17,940 We project a number of projections 695 00:32:17,940 --> 00:32:19,230 with variable resolution. 696 00:32:19,230 --> 00:32:23,850 So we get a huge vector of the data 697 00:32:23,850 --> 00:32:26,710 coming from different parts of the spectrum. 698 00:32:26,710 --> 00:32:30,127 And then we feed it into speech recognizers. 699 00:32:30,127 --> 00:32:31,460 The first test looked like this. 700 00:32:31,460 --> 00:32:34,200 I mean, they have a different temporal resolution, 701 00:32:34,200 --> 00:32:35,310 spectral resolution. 702 00:32:35,310 --> 00:32:39,140 We are pretty much integrating or differentiating over 703 00:32:39,140 --> 00:32:42,700 three critical bands following some of the filters, 704 00:32:42,700 --> 00:32:44,410 coming from these three-- 705 00:32:44,410 --> 00:32:51,526 I mean, old PLP low order model and three bark three bark 706 00:32:51,526 --> 00:32:54,090 critical event integration. 707 00:32:54,090 --> 00:32:57,240 So these ones look a bit like what people 708 00:32:57,240 --> 00:32:59,970 would call Gabor filters. 709 00:32:59,970 --> 00:33:03,720 But they are just put together, basically, 710 00:33:03,720 --> 00:33:09,750 from these two places in time and in frequency-- 711 00:33:09,750 --> 00:33:13,950 different temporal resolution enhancing different components 712 00:33:13,950 --> 00:33:15,510 of moderating the spectrum. 713 00:33:15,510 --> 00:33:18,210 Again, you may be claiming that this is something 714 00:33:18,210 --> 00:33:21,290 which resembles the Thorston-- 715 00:33:21,290 --> 00:33:25,290 [AUDIO OUT] Josh was mentioning in the morning. 716 00:33:25,290 --> 00:33:27,680 It's cochlear filter banks-- 717 00:33:27,680 --> 00:33:30,450 [CLEARS THROAT] auditory, of course. 718 00:33:30,450 --> 00:33:33,180 I mixed up cochlear and cortical-- cortical filter 719 00:33:33,180 --> 00:33:37,480 bank, modulation filter banks. 720 00:33:37,480 --> 00:33:42,110 So there are some novel aspects in this type of processing 721 00:33:42,110 --> 00:33:43,010 I want to impress. 722 00:33:43,010 --> 00:33:45,050 It was novel in 1998. 723 00:33:45,050 --> 00:33:48,580 That is, as I said, this is fortunately becoming less novel 724 00:33:48,580 --> 00:33:50,850 15 years later. 725 00:33:50,850 --> 00:33:52,780 Use is rather long temporal context 726 00:33:52,780 --> 00:33:56,660 of the signal as a input. 727 00:33:56,660 --> 00:33:59,800 It uses already hierarchical neural nets. 728 00:33:59,800 --> 00:34:03,430 So deep neural network processing, 729 00:34:03,430 --> 00:34:06,460 which wasn't around in 1998. 730 00:34:06,460 --> 00:34:10,420 The only thing was that there was independent processing 731 00:34:10,420 --> 00:34:15,820 of frequency of neural net estimator at frequencies. 732 00:34:15,820 --> 00:34:17,770 The only thing which we didn't do at the time, 733 00:34:17,770 --> 00:34:19,370 and I don't know how important it is. 734 00:34:19,370 --> 00:34:20,915 I don't think it doesn't hurt-- 735 00:34:20,915 --> 00:34:21,894 it hurts anybody. 736 00:34:21,894 --> 00:34:25,290 They were training these parts of the system, 737 00:34:25,290 --> 00:34:28,090 this deep neural net individually. 738 00:34:28,090 --> 00:34:29,900 And it's just concatenated output. 739 00:34:29,900 --> 00:34:32,440 So we never did training all together 740 00:34:32,440 --> 00:34:35,320 as we do now in convoluted nets and that sort of thing. 741 00:34:35,320 --> 00:34:38,679 Because simply, we didn't even dream about doing that, 742 00:34:38,679 --> 00:34:41,610 because we didn't have the hardware. 743 00:34:41,610 --> 00:34:46,060 That was one thing which I tried to point out during this panel. 744 00:34:46,060 --> 00:34:49,840 A lot of progress in neural nets research 745 00:34:49,840 --> 00:34:52,150 and success of neural nets comes from the fact 746 00:34:52,150 --> 00:34:55,389 that we have very, very powerful hardware, which we didn't have. 747 00:34:55,389 --> 00:34:59,140 So we didn't dream about many things doing, even when 748 00:34:59,140 --> 00:35:00,810 they might have made sense. 749 00:35:00,810 --> 00:35:03,059 So, OK. 750 00:35:03,059 --> 00:35:03,600 Where are we? 751 00:35:03,600 --> 00:35:07,120 Oh, I see-- one more thing. 752 00:35:07,120 --> 00:35:08,140 Coarticulation. 753 00:35:08,140 --> 00:35:10,540 This is a problem which is known since people started 754 00:35:10,540 --> 00:35:12,570 looking at the spectrograms. 755 00:35:12,570 --> 00:35:16,600 There's some consonants like a "kuh", or "huh." 756 00:35:16,600 --> 00:35:19,020 They are very dependent on what's following. 757 00:35:19,020 --> 00:35:22,130 So "kuh" in front of "ee, kee, koo, kah, koo-kah," 758 00:35:22,130 --> 00:35:25,030 we [AUDIO OUT] has a burst here. 759 00:35:25,030 --> 00:35:27,430 In front of "ooh" has a burst here. 760 00:35:27,430 --> 00:35:31,450 And in front of "ah" there's a burst here. 761 00:35:31,450 --> 00:35:33,370 So the phonemes are very different 762 00:35:33,370 --> 00:35:36,670 depending on the environment. 763 00:35:36,670 --> 00:35:40,060 When you start using these long temporal segments, 764 00:35:40,060 --> 00:35:42,760 you know all the tricks, or some of the tricks, I showed you 765 00:35:42,760 --> 00:35:46,870 about, what comes out are the posteriogram 766 00:35:46,870 --> 00:35:50,690 in which the "kuh" almost looks the same as a "kuh." 767 00:35:50,690 --> 00:35:52,770 It basically recognizes this-- 768 00:35:52,770 --> 00:35:55,450 recognizes that since it looks at the whole coarticulation 769 00:35:55,450 --> 00:35:58,210 pattern or group of the phonemes, 770 00:35:58,210 --> 00:36:02,810 in order to recognize this sound, it does the right thing. 771 00:36:02,810 --> 00:36:07,440 So I suspect that success of these long temporal context 772 00:36:07,440 --> 00:36:10,150 which people are using now with speech recognition, 773 00:36:10,150 --> 00:36:12,160 comes from the fact that this partially 774 00:36:12,160 --> 00:36:15,040 compensates for the problems with-- by problems 775 00:36:15,040 --> 00:36:16,690 for the coarticulation. 776 00:36:16,690 --> 00:36:19,210 And what I also want is to say-- coarticulation 777 00:36:19,210 --> 00:36:20,800 is not really a problem. 778 00:36:20,800 --> 00:36:24,790 It just spreads the information for a long period of the time. 779 00:36:24,790 --> 00:36:28,540 If you know how to suck it out, it can be useful. 780 00:36:28,540 --> 00:36:31,060 But it's a terrible thing if you start just 781 00:36:31,060 --> 00:36:33,070 looking at individual frequency events, 782 00:36:33,070 --> 00:36:35,800 even with your frequency's slices of the short term 783 00:36:35,800 --> 00:36:36,912 spectrum. 784 00:36:36,912 --> 00:36:40,420 So it's another deep net-- 785 00:36:40,420 --> 00:36:43,830 deep net from, I don't know the name. 786 00:36:43,830 --> 00:36:44,330 Sorry. 787 00:36:44,330 --> 00:36:47,350 It was already almost legal deep net. 788 00:36:47,350 --> 00:36:51,520 You do the-- you estimate the posteriorgram 789 00:36:51,520 --> 00:36:53,960 from the short window in the first step for about 40 790 00:36:53,960 --> 00:36:55,270 mm length window. 791 00:36:55,270 --> 00:36:56,850 And then you take the long-- 792 00:36:56,850 --> 00:37:00,550 I mean, big window of the posteriors' tree. 793 00:37:00,550 --> 00:37:04,510 Another neural net you get much better, which also work better. 794 00:37:04,510 --> 00:37:07,540 Again, the mainstream technique nowadays 795 00:37:07,540 --> 00:37:10,475 is being used in most of the DARPA systems. 796 00:37:13,120 --> 00:37:16,060 Oh, yes-- one more thing. 797 00:37:16,060 --> 00:37:17,350 I want to stress this one. 798 00:37:17,350 --> 00:37:18,916 [LAUGHS] So, I'm sorry I didn't want 799 00:37:18,916 --> 00:37:20,040 to show it all at the same. 800 00:37:20,040 --> 00:37:22,780 But anyways, I don't think that there's 801 00:37:22,780 --> 00:37:27,190 anything which is terribly special about short term 802 00:37:27,190 --> 00:37:28,570 spectrum of speech. 803 00:37:28,570 --> 00:37:31,090 I think what really matters is how 804 00:37:31,090 --> 00:37:33,820 you process the temporal trajectories of the spectra 805 00:37:33,820 --> 00:37:34,930 energies. 806 00:37:34,930 --> 00:37:36,610 This is what the human hearing is 807 00:37:36,610 --> 00:37:40,970 doing that seems to be doing a good job on our speech 808 00:37:40,970 --> 00:37:42,586 recognizers. 809 00:37:42,586 --> 00:37:46,970 So essentially, this is one message which I want to say. 810 00:37:46,970 --> 00:37:50,230 Don't be afraid to treat different parts of the spectrum 811 00:37:50,230 --> 00:37:50,870 different. 812 00:37:50,870 --> 00:37:53,970 Individually you may get some advantages from them. 813 00:37:53,970 --> 00:37:55,550 It started with your stub, but it 814 00:37:55,550 --> 00:37:59,110 shows up over and over again. 815 00:37:59,110 --> 00:38:03,550 So away from the short term spectrum, go away, 816 00:38:03,550 --> 00:38:06,920 they start using what hearing is doing-- 817 00:38:06,920 --> 00:38:09,880 start using a temporal trajectories 818 00:38:09,880 --> 00:38:15,090 of the spectra energies coming from your analysis. 819 00:38:15,090 --> 00:38:19,290 To the point that we did this work on real [INAUDIBLE],, 820 00:38:19,290 --> 00:38:22,830 on the-- going directly, don't do this. 821 00:38:22,830 --> 00:38:27,760 And don't get your time frequency patterns 822 00:38:27,760 --> 00:38:29,410 from the short term spectra. 823 00:38:29,410 --> 00:38:32,425 I think about always how to get directly what you want. 824 00:38:32,425 --> 00:38:35,480 It turns out that there is a nice way of doing-- 825 00:38:35,480 --> 00:38:38,620 for estimating directed hilbert envelopes 826 00:38:38,620 --> 00:38:41,470 of the signal in the frequency bands called frequency domain 827 00:38:41,470 --> 00:38:42,220 linear prediction. 828 00:38:42,220 --> 00:38:44,190 [STATIC] 829 00:38:44,190 --> 00:38:46,390 Mario says-- there's his PhD thesis. 830 00:38:46,390 --> 00:38:50,460 And we were working together for a couple of years. 831 00:38:50,460 --> 00:38:55,890 So what you do, instead of using the time trajectory, 832 00:38:55,890 --> 00:39:00,910 so use this case, autoregressive modeling-- 833 00:39:00,910 --> 00:39:05,790 LPC modeling-- and put the windows on a time to get 834 00:39:05,790 --> 00:39:07,800 the frequency-- 835 00:39:07,800 --> 00:39:09,660 frequency vectors. 836 00:39:09,660 --> 00:39:13,690 You do it on a cosine transform of the signal. 837 00:39:13,690 --> 00:39:17,190 So you move the signal into a frequency domain. 838 00:39:17,190 --> 00:39:20,430 And then you put the windows on this cosine transform 839 00:39:20,430 --> 00:39:22,050 of the signal. 840 00:39:22,050 --> 00:39:25,150 And you derive directly the-- 841 00:39:25,150 --> 00:39:28,320 all polar approximations to hilbert envelopes 842 00:39:28,320 --> 00:39:31,290 of the signal in the sub bands. 843 00:39:31,290 --> 00:39:34,470 You don't ever do the hilbert transform. 844 00:39:34,470 --> 00:39:38,880 You just use the usual techniques 845 00:39:38,880 --> 00:39:40,920 from autoaggressive modeling. 846 00:39:40,920 --> 00:39:42,200 The only difference is-- 847 00:39:42,200 --> 00:39:45,860 [AUDIO OUT] on the cosine transform of the signal. 848 00:39:45,860 --> 00:39:48,960 And your windowing determines which frequency 849 00:39:48,960 --> 00:39:50,880 range you are looking. 850 00:39:50,880 --> 00:39:52,840 So, of course, what you typically do, 851 00:39:52,840 --> 00:39:56,300 you can use the longer windows at higher frequencies, shorter 852 00:39:56,300 --> 00:39:57,780 window of lower frequencies. 853 00:39:57,780 --> 00:39:59,120 You do all these things. 854 00:39:59,120 --> 00:40:01,610 But this is a convenient way. 855 00:40:01,610 --> 00:40:04,200 [COUGHS] It's convenient, and this is more and more like fun. 856 00:40:04,200 --> 00:40:06,640 But maybe somebody might be interested in that. 857 00:40:06,640 --> 00:40:09,320 So essentially, what you do-- oops. 858 00:40:09,320 --> 00:40:10,050 Sorry. 859 00:40:10,050 --> 00:40:12,540 What you do is that you take the signal, 860 00:40:12,540 --> 00:40:18,210 and you eliminate modulation component 861 00:40:18,210 --> 00:40:22,770 out of that AM component which the signal is being modulated. 862 00:40:22,770 --> 00:40:26,110 So this carries the information about the message. 863 00:40:26,110 --> 00:40:28,470 And this is the carrier itself. 864 00:40:28,470 --> 00:40:32,950 And you can do what is called channel vocoder, which we did. 865 00:40:32,950 --> 00:40:35,110 And you can listen to the signal. 866 00:40:35,110 --> 00:40:37,870 So this is-- in some ways it's interesting-- original signal. 867 00:40:37,870 --> 00:40:40,410 VOICE RECORDING: They are both trend-following methods. 868 00:40:40,410 --> 00:40:40,590 HYNEK HERMANKSY: Oops. 869 00:40:40,590 --> 00:40:41,820 I tried to make it somehow-- 870 00:40:41,820 --> 00:40:43,945 [AUDIO OUT] 871 00:40:43,945 --> 00:40:46,460 VOICE RECORDING: They are both trend-following methods. 872 00:40:46,460 --> 00:40:49,190 HYNEK HERMANKSY: Somebody may recognize Jim Glass from MIT 873 00:40:49,190 --> 00:40:50,242 in that. 874 00:40:50,242 --> 00:40:52,710 VOICE RECORDING: In an ideological argument, 875 00:40:52,710 --> 00:40:54,662 the participants tend to dump the table. 876 00:40:54,662 --> 00:40:56,370 HYNEK HERMANKSY: So this is silly, right? 877 00:40:56,370 --> 00:40:59,640 Now you can look at what you get if you just 878 00:40:59,640 --> 00:41:04,611 keep the modulations and excite you know, with the white noise. 879 00:41:04,611 --> 00:41:05,110 Oops. 880 00:41:05,110 --> 00:41:05,610 Sorry. 881 00:41:09,060 --> 00:41:10,346 Oops! 882 00:41:10,346 --> 00:41:11,062 What am I doing? 883 00:41:11,062 --> 00:41:11,673 Oh, here. 884 00:41:11,673 --> 00:41:13,256 VOICE RECORDING: (WHISPERING) They are 885 00:41:13,256 --> 00:41:15,300 both trend-following methods. 886 00:41:15,300 --> 00:41:17,810 HYNEK HERMANSKY: Do you recognize Jim Glass? 887 00:41:17,810 --> 00:41:20,147 I can. 888 00:41:20,147 --> 00:41:22,480 VOICE RECORDING: (WHISPERING) In an ideological argument 889 00:41:22,480 --> 00:41:25,150 the participants tend to dump the table. 890 00:41:25,150 --> 00:41:26,950 HYNEK HERMANSKY: And then you can also 891 00:41:26,950 --> 00:41:32,220 listen to what is left after you eliminate the message. 892 00:41:32,220 --> 00:41:33,220 VOICE RECORDING: Mm-hmm. 893 00:41:33,220 --> 00:41:34,208 Ha, ha. 894 00:41:34,208 --> 00:41:35,122 [LAUGHTER] 895 00:41:35,690 --> 00:41:38,156 HYNEK HERMANSKY: Maybe it's a male, right? 896 00:41:38,156 --> 00:41:41,400 VOICE RECORDING: Mm-mm [VOCALIZING] 897 00:41:41,400 --> 00:41:43,017 HYNEK HERMANSKY: Oh, this is fun. 898 00:41:43,017 --> 00:41:43,975 This is [CHUCKLES] fun. 899 00:41:43,975 --> 00:41:46,800 It may have some implication for speech recognition. 900 00:41:46,800 --> 00:41:52,200 But certainly, if I have seen one verification of what old 901 00:41:52,200 --> 00:41:55,450 Homer Dudley was telling us-- where the message is-- 902 00:41:55,450 --> 00:41:56,747 I mean, this is it. 903 00:41:56,747 --> 00:41:58,240 All right? 904 00:41:58,240 --> 00:42:00,480 Anyways, what is good in-- for that, 905 00:42:00,480 --> 00:42:02,190 is that once you get an open-- 906 00:42:02,190 --> 00:42:04,290 [AUDIO OUT] it's relatively very easy 907 00:42:04,290 --> 00:42:07,170 to compensate for your ear distortions. 908 00:42:07,170 --> 00:42:11,050 Because main effect of linear distortions 909 00:42:11,050 --> 00:42:15,020 is basically shifting the energy a different frequency by-- 910 00:42:15,020 --> 00:42:16,785 bends by different amounts. 911 00:42:16,785 --> 00:42:19,710 But all this information is in the gain of the model. 912 00:42:19,710 --> 00:42:21,505 It's one parameter which you essentially 913 00:42:21,505 --> 00:42:25,180 ignore after you do this frequency domain 914 00:42:25,180 --> 00:42:26,700 linear prediction. 915 00:42:26,700 --> 00:42:31,080 And you get a very similar trajectory for both. 916 00:42:31,080 --> 00:42:33,660 This is a telephone speech and clean speech 917 00:42:33,660 --> 00:42:35,130 which differed quite a bit. 918 00:42:35,130 --> 00:42:38,180 And I hope that I have-- oh, this is for reverberant speech. 919 00:42:38,180 --> 00:42:40,020 There seem to be also some advantage, 920 00:42:40,020 --> 00:42:43,190 because reverberation is in the first information, 921 00:42:43,190 --> 00:42:47,220 it is a convolution with the impulse response of the room. 922 00:42:47,220 --> 00:42:48,660 So you make the-- 923 00:42:48,660 --> 00:42:51,660 if you use a truly long segments-- in this case, 924 00:42:51,660 --> 00:42:55,800 we used about 10 seconds of the signal approximating 925 00:42:55,800 --> 00:43:00,340 by this open model, and eliminated the DC from that. 926 00:43:00,340 --> 00:43:02,340 You know, it seems to be getting some advanced-- 927 00:43:02,340 --> 00:43:03,380 [AUDIO OUT] 928 00:43:03,380 --> 00:43:09,480 So known noise with unknown effects. 929 00:43:09,480 --> 00:43:12,080 I say train the machine on that. 930 00:43:15,020 --> 00:43:16,610 Here is the one example, right? 931 00:43:16,610 --> 00:43:20,120 You have a phoneme error rates, noise estimate. 932 00:43:20,120 --> 00:43:23,060 If everything is good, clean, trading clean test, 933 00:43:23,060 --> 00:43:26,430 you have about 20% phoneme accuracy. 934 00:43:26,430 --> 00:43:28,190 This is a stage of the result-- 935 00:43:28,190 --> 00:43:32,240 reasonable result. But once you start adding a noise, 936 00:43:32,240 --> 00:43:35,060 things quickly go south. 937 00:43:35,060 --> 00:43:39,420 Typical way of dealing with it is if you train multi-style. 938 00:43:39,420 --> 00:43:41,870 So if you know which choices you are going to deal with, 939 00:43:41,870 --> 00:43:43,580 you train on them. 940 00:43:43,580 --> 00:43:46,674 And things get better, but you pay some price. 941 00:43:46,674 --> 00:43:48,590 I mean, certainly, you pay the price on clean, 942 00:43:48,590 --> 00:43:52,590 because you recognize your model became must mushier, basically. 943 00:43:52,590 --> 00:43:55,430 It's not a very sharp model. 944 00:43:55,430 --> 00:43:59,150 So here we had a wonderful 21%. 945 00:43:59,150 --> 00:44:06,080 We paid 10% relative price for getting this better performance 946 00:44:06,080 --> 00:44:07,280 on the noises. 947 00:44:07,280 --> 00:44:12,050 What we observe is that you get much better results, most 948 00:44:12,050 --> 00:44:13,730 noticeably better results, if you 949 00:44:13,730 --> 00:44:18,380 would have different recognizers for each type of noise. 950 00:44:22,415 --> 00:44:24,020 But of course, the problem is that you 951 00:44:24,020 --> 00:44:25,290 have different types of noise. 952 00:44:25,290 --> 00:44:27,450 So you have this number of recognizers. 953 00:44:27,450 --> 00:44:30,155 But now you need to pick up the best stream. 954 00:44:30,155 --> 00:44:31,760 And how do you do that? 955 00:44:31,760 --> 00:44:35,090 This is something, again, which I was mentioning also earlier. 956 00:44:35,090 --> 00:44:37,130 This is something which we are struggling with 957 00:44:37,130 --> 00:44:39,350 and we don't know how to do that. 958 00:44:39,350 --> 00:44:40,970 If you are a human being, maybe you 959 00:44:40,970 --> 00:44:42,300 can just look at the output. 960 00:44:42,300 --> 00:44:45,500 And you can see, just keep switching after your message 961 00:44:45,500 --> 00:44:47,030 starts looking reasonably well. 962 00:44:47,030 --> 00:44:49,500 But if you want to do it fully automatically, 963 00:44:49,500 --> 00:44:51,530 I don't know why we want to only build 964 00:44:51,530 --> 00:44:55,490 a fully automatic recognizers, but that's what we are doing. 965 00:44:55,490 --> 00:44:57,110 So you want to pick up the-- 966 00:44:57,110 --> 00:45:01,190 you want a system to pick up the best stream. 967 00:45:01,190 --> 00:45:03,150 So how do we do that? 968 00:45:03,150 --> 00:45:05,430 First thing is, of course-- 969 00:45:05,430 --> 00:45:07,950 one way is to recognize type of noise. 970 00:45:07,950 --> 00:45:13,040 This is a typical system nowadays. 971 00:45:13,040 --> 00:45:15,710 You recognize type of the noise, and you use 972 00:45:15,710 --> 00:45:17,050 the appropriate recognizers. 973 00:45:17,050 --> 00:45:18,860 BBN is doing it. 974 00:45:18,860 --> 00:45:21,860 My feeling is that it's somehow cleaner and more 975 00:45:21,860 --> 00:45:24,580 elegant to be able to figure out what 976 00:45:24,580 --> 00:45:26,581 is the right output, because what-- 977 00:45:26,581 --> 00:45:27,080 neither. 978 00:45:27,080 --> 00:45:30,230 It's not what is the signal, but what is the signal interacting 979 00:45:30,230 --> 00:45:31,830 with the classifier? 980 00:45:31,830 --> 00:45:35,470 So for this we have to figure out what the best means. 981 00:45:35,470 --> 00:45:38,510 So here we have two posteriograms. 982 00:45:38,510 --> 00:45:40,970 If you look at it, if you know that these are trajectories 983 00:45:40,970 --> 00:45:42,800 of the posteriors-- 984 00:45:42,800 --> 00:45:45,470 of the speech sounds, you know this one is good. 985 00:45:45,470 --> 00:45:47,030 This one is not so good. 986 00:45:47,030 --> 00:45:48,760 Because the word is nine-- 987 00:45:48,760 --> 00:45:50,780 "ne-ine," "ne," right? 988 00:45:50,780 --> 00:45:52,710 Here is a lot of garbage. 989 00:45:52,710 --> 00:45:53,960 So I will know that-- 990 00:45:53,960 --> 00:45:57,290 I will do it automatically. 991 00:45:57,290 --> 00:46:00,020 So ideally, I would pick up the stream which 992 00:46:00,020 --> 00:46:01,160 gives me the lowest error. 993 00:46:01,160 --> 00:46:03,980 But I don't know what the lowest error is, because I don't know 994 00:46:03,980 --> 00:46:05,780 what the correct answer is. 995 00:46:05,780 --> 00:46:07,370 That's the problem, right? 996 00:46:07,370 --> 00:46:11,270 So one is to maybe try to see what I-- what my I did? 997 00:46:11,270 --> 00:46:14,960 Try to figure out which posteriogram is the cleanest. 998 00:46:14,960 --> 00:46:18,110 Another one is following thinking. 999 00:46:18,110 --> 00:46:19,370 When I trained the-- 1000 00:46:19,370 --> 00:46:22,220 I trained the neural net on something. 1001 00:46:22,220 --> 00:46:25,970 It's going to work well on the data on which it was trained. 1002 00:46:25,970 --> 00:46:29,400 So I have some gold standard output. 1003 00:46:29,400 --> 00:46:34,310 And then I will try to see how much my output differs 1004 00:46:34,310 --> 00:46:39,500 if the test data, which are not the same as the data on which 1005 00:46:39,500 --> 00:46:41,300 the recognizer was trained. 1006 00:46:41,300 --> 00:46:43,990 So both of these tricks we were using. 1007 00:46:43,990 --> 00:46:49,040 So first one uses a technique which is like this. 1008 00:46:49,040 --> 00:46:52,890 You look at the differences between posteriors 1009 00:46:52,890 --> 00:46:57,930 or KL divergence areas a certain distance from each other, 1010 00:46:57,930 --> 00:46:59,260 and you will slice this window. 1011 00:46:59,260 --> 00:47:04,640 You cumulatively cover as much data as you possibly can. 1012 00:47:04,640 --> 00:47:08,592 And what you observe is that if you have a good, clean data, 1013 00:47:08,592 --> 00:47:11,335 this cumulative divergence keeps increasing. 1014 00:47:11,335 --> 00:47:15,860 And after you cross the point where there is a coarticulation 1015 00:47:15,860 --> 00:47:18,290 pattern or coarticulation ceases, 1016 00:47:18,290 --> 00:47:22,700 suddenly you start getting pretty much the fixed high tail 1017 00:47:22,700 --> 00:47:25,190 divergence-- cumulative KL divergence. 1018 00:47:25,190 --> 00:47:27,340 If we have a noisy data, the noise 1019 00:47:27,340 --> 00:47:32,720 start dominating this KL divergences and differences RS. 1020 00:47:32,720 --> 00:47:36,140 Because the signal in the first place 1021 00:47:36,140 --> 00:47:40,430 carries the information, and the information is in the changes. 1022 00:47:40,430 --> 00:47:43,460 But noise is creating the information 1023 00:47:43,460 --> 00:47:47,480 which it doesn't have these segments, or something. 1024 00:47:47,480 --> 00:47:49,940 So this is one technique which we use. 1025 00:47:49,940 --> 00:47:52,970 Another technique which is even more, now, popular, 1026 00:47:52,970 --> 00:47:56,980 in at least in my lab, training of another unit. 1027 00:47:56,980 --> 00:48:04,100 We trained this autoencoder on the output of a classifier. 1028 00:48:04,100 --> 00:48:06,230 And we say-- so we-- 1029 00:48:06,230 --> 00:48:08,720 on the output of the classifier as 1030 00:48:08,720 --> 00:48:13,540 it's being used on the training data, and we say autoencoder. 1031 00:48:13,540 --> 00:48:17,570 Then we learn how in average the output 1032 00:48:17,570 --> 00:48:22,070 from the classifier used on its training data looks like. 1033 00:48:22,070 --> 00:48:26,030 And if-- and then we use it on the output from the classifier 1034 00:48:26,030 --> 00:48:28,250 used to unknown data. 1035 00:48:28,250 --> 00:48:31,830 And if the autoencoders then knew, 1036 00:48:31,830 --> 00:48:36,320 then we tried to predict input on its output. 1037 00:48:36,320 --> 00:48:38,780 This is how it is being trained. 1038 00:48:38,780 --> 00:48:40,230 So if the output-- 1039 00:48:40,230 --> 00:48:42,530 if the prediction is not very good, 1040 00:48:42,530 --> 00:48:44,630 then we say we are probably dealing 1041 00:48:44,630 --> 00:48:50,410 with the data for which the classifier is not good. 1042 00:48:54,860 --> 00:48:55,770 It's how it works. 1043 00:48:55,770 --> 00:48:59,010 I mean, you know, it's honest. 1044 00:48:59,010 --> 00:49:01,890 If you are looking at the output of the neural net 1045 00:49:01,890 --> 00:49:03,973 which has been applied to a-- 1046 00:49:03,973 --> 00:49:08,580 towards training data, the test is-- or the test is maybe-- 1047 00:49:08,580 --> 00:49:11,010 the training data is matched. 1048 00:49:11,010 --> 00:49:14,670 Pretty much the error is very similar 1049 00:49:14,670 --> 00:49:16,950 as it goes on the training data. 1050 00:49:16,950 --> 00:49:20,820 When you apply to a data for which the classifier wasn't 1051 00:49:20,820 --> 00:49:24,240 trained, your error is, of course, much larger. 1052 00:49:24,240 --> 00:49:28,380 Prediction is prediction of the output of the-- 1053 00:49:28,380 --> 00:49:31,440 so there's a double-- there is a double deep net. 1054 00:49:31,440 --> 00:49:33,660 One is classifying, and then another one 1055 00:49:33,660 --> 00:49:37,080 is predicting its output. 1056 00:49:37,080 --> 00:49:39,690 One-- and the one which predicts output 1057 00:49:39,690 --> 00:49:43,530 is trained to predict its best output it can possibly have, 1058 00:49:43,530 --> 00:49:46,170 which is the output on the training data 1059 00:49:46,170 --> 00:49:48,420 of the previous classifier. 1060 00:49:48,420 --> 00:49:50,059 I don't know if you follow me or if it 1061 00:49:50,059 --> 00:49:51,225 is becoming too complicated. 1062 00:49:51,225 --> 00:49:52,920 [CHUCKLES] But essentially, we are 1063 00:49:52,920 --> 00:49:56,000 trying to figure out if anything like 1064 00:49:56,000 --> 00:49:59,725 it is looking on a training-- and applied on the training 1065 00:49:59,725 --> 00:50:00,600 data. 1066 00:50:00,600 --> 00:50:03,090 So it seems to be working to some extent. 1067 00:50:03,090 --> 00:50:06,510 Here we have a multi-style results. 1068 00:50:06,510 --> 00:50:08,150 Here we have a matched result. This 1069 00:50:08,150 --> 00:50:09,780 is what we would like to achieve. 1070 00:50:09,780 --> 00:50:11,100 Of course this is oracle. 1071 00:50:11,100 --> 00:50:13,170 This is what we-- it will be ideal 1072 00:50:13,170 --> 00:50:15,840 if we knew which stream is best. 1073 00:50:15,840 --> 00:50:17,520 This is what we would be getting. 1074 00:50:17,520 --> 00:50:20,820 But what we are getting is not that terribly bad. 1075 00:50:20,820 --> 00:50:24,120 I mean, certainly, it is typically better 1076 00:50:24,120 --> 00:50:26,904 than a multi-style training. 1077 00:50:26,904 --> 00:50:28,340 All right. 1078 00:50:28,340 --> 00:50:32,462 And we have some ways to go to oracle like-- 1079 00:50:32,462 --> 00:50:33,920 not too far from the matched trace. 1080 00:50:33,920 --> 00:50:36,180 Sometimes it's even there, because it's making 1081 00:50:36,180 --> 00:50:38,300 a decision on every utterance. 1082 00:50:38,300 --> 00:50:42,600 So sometimes it can-- going to do quite well. 1083 00:50:42,600 --> 00:50:48,150 So we were capable of picking up the good streams 1084 00:50:48,150 --> 00:50:50,640 and leaving out the best bad streams, even 1085 00:50:50,640 --> 00:50:54,200 at the output of the classifier. 1086 00:50:54,200 --> 00:50:56,910 How does it work on previously unseen noise? 1087 00:50:56,910 --> 00:50:59,200 Fortunately, for this example, we still 1088 00:50:59,200 --> 00:51:02,850 seem to be getting some advantage. 1089 00:51:02,850 --> 00:51:05,190 We are using noise which has never 1090 00:51:05,190 --> 00:51:07,350 been seen by the classifiers. 1091 00:51:07,350 --> 00:51:10,350 But still, it was capable of picking up 1092 00:51:10,350 --> 00:51:12,950 the good classifier, which is actually better 1093 00:51:12,950 --> 00:51:14,200 than any of these classifiers. 1094 00:51:14,200 --> 00:51:19,400 Sort of a, so this seems to be good. 1095 00:51:23,340 --> 00:51:27,910 Another technique of dealing with unseen noises 1096 00:51:27,910 --> 00:51:30,760 is actually one which I like maybe even a bit more, 1097 00:51:30,760 --> 00:51:32,640 which is you do the pre-processing, 1098 00:51:32,640 --> 00:51:35,570 and in some processing in frequency bands, 1099 00:51:35,570 --> 00:51:39,000 hoping the main effect of the different noises 1100 00:51:39,000 --> 00:51:43,960 is in the shape, spectral shape, of different noise. 1101 00:51:43,960 --> 00:51:46,980 So if you are doing recommission in the sub bands, 1102 00:51:46,980 --> 00:51:50,310 then the noise start looking-- in each sub band 1103 00:51:50,310 --> 00:51:52,950 starts looking more like a white noise except 1104 00:51:52,950 --> 00:51:56,280 of different levels. 1105 00:51:56,280 --> 00:52:00,780 So meaning, it's here, that maybe the signal to noise ratio 1106 00:52:00,780 --> 00:52:01,350 is higher. 1107 00:52:01,350 --> 00:52:03,120 Here it is more miserable. 1108 00:52:03,120 --> 00:52:05,010 But if I have the classifier, which 1109 00:52:05,010 --> 00:52:09,420 is strained on multiple levels of the white noise, 1110 00:52:09,420 --> 00:52:13,710 in each frequency band perhaps I can get some advantage. 1111 00:52:13,710 --> 00:52:16,920 So I do what in the cochlear might be doing, 1112 00:52:16,920 --> 00:52:19,626 which is I divide each-- 1113 00:52:19,626 --> 00:52:23,640 the signal into a number of frequency bands, and then I 1114 00:52:23,640 --> 00:52:25,680 have one fusion DNN which will try 1115 00:52:25,680 --> 00:52:27,590 to put these things together. 1116 00:52:27,590 --> 00:52:29,910 But each of these nets are going to be 1117 00:52:29,910 --> 00:52:33,810 trained on multiple levels, but this time of the white noise. 1118 00:52:33,810 --> 00:52:36,850 And it's going to get back to noises which are not white, 1119 00:52:36,850 --> 00:52:40,350 or the noise which is not white. 1120 00:52:40,350 --> 00:52:42,000 So how it works, you can figure it out 1121 00:52:42,000 --> 00:52:44,940 to see if it was done in the case of Aurora. 1122 00:52:44,940 --> 00:52:47,160 So here we have the examples here-- 1123 00:52:47,160 --> 00:52:50,880 how it works on a-- in matched situations. 1124 00:52:50,880 --> 00:52:52,900 Here is multi-style training. 1125 00:52:52,900 --> 00:52:54,500 But here it is what you are getting 1126 00:52:54,500 --> 00:52:56,860 if you apply this technique. 1127 00:52:56,860 --> 00:52:59,470 This is what you get now, multi-style training. 1128 00:52:59,470 --> 00:53:01,530 But in the sub band recommendation, 1129 00:53:01,530 --> 00:53:04,970 you are getting 1/2 of the error rate. 1130 00:53:04,970 --> 00:53:07,360 Just a simple trick which I think 1131 00:53:07,360 --> 00:53:12,500 is reasonable, which is you do the sub band recognition. 1132 00:53:12,500 --> 00:53:15,200 There's a number of power recognized, 1133 00:53:15,200 --> 00:53:18,930 each of them paying attention to part of the spectrum. 1134 00:53:18,930 --> 00:53:21,260 And then you-- each of them is being 1135 00:53:21,260 --> 00:53:24,580 trained to head the white noise, a simple white noise. 1136 00:53:24,580 --> 00:53:29,225 But you turn, in some ways, arbitrary additive noise, 1137 00:53:29,225 --> 00:53:33,900 this car noise, into white-like noise in each sub band. 1138 00:53:33,900 --> 00:53:36,640 And that what's you get. 1139 00:53:36,640 --> 00:53:41,960 So in general, dealing with unexpected noise is-- 1140 00:53:41,960 --> 00:53:43,990 you want to do the adaptation. 1141 00:53:43,990 --> 00:53:49,390 You want to modify your classifier on the fly. 1142 00:53:49,390 --> 00:53:53,320 You want to have parts of the classifier or some streams 1143 00:53:53,320 --> 00:53:54,790 which are doing well. 1144 00:53:54,790 --> 00:53:56,030 And some of the parts-- 1145 00:53:56,030 --> 00:54:00,340 so parts of the classifier are still reliable. 1146 00:54:00,340 --> 00:54:04,840 And you want to pick up these streams which are 1147 00:54:04,840 --> 00:54:07,810 reliable on unseen situation. 1148 00:54:07,810 --> 00:54:11,520 So this is what we call this multi-stream recognition-- 1149 00:54:11,520 --> 00:54:15,430 adapt to multi-stream adaptation to unknown noise. 1150 00:54:15,430 --> 00:54:19,870 So you assume that not all the streams are going to give you 1151 00:54:19,870 --> 00:54:22,990 good result, but you assume that at least some of the streams 1152 00:54:22,990 --> 00:54:24,930 are going to have good results. 1153 00:54:24,930 --> 00:54:27,910 And all these streams are being trained on, say, clean speech 1154 00:54:27,910 --> 00:54:28,950 or something. 1155 00:54:31,570 --> 00:54:34,194 So this is this multi-band processing, all right? 1156 00:54:34,194 --> 00:54:35,110 So this is what we do. 1157 00:54:35,110 --> 00:54:37,740 We do different frequency ranges. 1158 00:54:37,740 --> 00:54:41,350 And then we use our performance monitor 1159 00:54:41,350 --> 00:54:43,750 to pick up the best stream. 1160 00:54:43,750 --> 00:54:46,550 So here is the experiment which we did. 1161 00:54:46,550 --> 00:54:48,490 So you would have the 31 processing 1162 00:54:48,490 --> 00:54:53,500 streams created from all combinations 1163 00:54:53,500 --> 00:54:54,770 of fine frequencies. 1164 00:54:54,770 --> 00:54:57,550 And one stream was looking at full spectrum. 1165 00:54:57,550 --> 00:54:59,260 And the other things were only looking 1166 00:54:59,260 --> 00:55:01,550 at the parts of the spectrum. 1167 00:55:01,550 --> 00:55:05,950 So more black ones, there is a more spectrum; more white, 1168 00:55:05,950 --> 00:55:10,600 some of them are looking only at a single frequency band. 1169 00:55:10,600 --> 00:55:13,960 So we have a decent number of processing channels, 1170 00:55:13,960 --> 00:55:16,600 and we would hope that if the noise comes here, 1171 00:55:16,600 --> 00:55:19,750 maybe this one is going to be good, because this one is not 1172 00:55:19,750 --> 00:55:20,700 going to be-- 1173 00:55:20,700 --> 00:55:24,560 not worth picking up this noise, or recognize 1174 00:55:24,560 --> 00:55:28,180 that basically which only uses the bands which are not noisy. 1175 00:55:28,180 --> 00:55:30,920 It's going to be good. 1176 00:55:30,920 --> 00:55:32,230 So this is the whole system. 1177 00:55:32,230 --> 00:55:34,020 It was published in 230-- 1178 00:55:34,020 --> 00:55:36,620 [STATIC] like a entire speech. 1179 00:55:36,620 --> 00:55:38,350 We had this sub band recognition, fuse, 1180 00:55:38,350 --> 00:55:42,460 and performance monitor, and the selecting of a stream. 1181 00:55:42,460 --> 00:55:44,500 This is how it works. 1182 00:55:44,500 --> 00:55:46,620 This is again, showing for car noise. 1183 00:55:46,620 --> 00:55:49,762 Car noise is very nice, because it mainly, it 1184 00:55:49,762 --> 00:55:50,970 corrupts the low frequencies. 1185 00:55:50,970 --> 00:55:54,860 So all these sub band techniques work quite well. 1186 00:55:54,860 --> 00:55:59,470 But you can see that it's pretty impressive, and it's-- 1187 00:55:59,470 --> 00:56:02,710 if you didn't do anything, you get 50% error. 1188 00:56:02,710 --> 00:56:04,840 With this one you get 38% error. 1189 00:56:04,840 --> 00:56:06,550 If you knew which bands to pickup 1190 00:56:06,550 --> 00:56:09,020 you would be getting oracle experiment-- cheating 1191 00:56:09,020 --> 00:56:09,910 experiment. 1192 00:56:09,910 --> 00:56:12,022 You would be getting about 35%. 1193 00:56:12,022 --> 00:56:13,480 So that was, I thought, quite nice. 1194 00:56:17,580 --> 00:56:22,900 Just to conclude-- so, auditory system doesn't only look like 1195 00:56:22,900 --> 00:56:26,250 this, that it starts with a signal as the analysis, 1196 00:56:26,250 --> 00:56:28,150 and then it reduces the-- 1197 00:56:28,150 --> 00:56:31,330 reduces the bit rate; but it also 1198 00:56:31,330 --> 00:56:36,250 increases number of views of the signal. 1199 00:56:36,250 --> 00:56:39,100 And this is based on the fact that there 1200 00:56:39,100 --> 00:56:45,970 is a massive increase in number of the cortical neurons 1201 00:56:45,970 --> 00:56:48,910 on the level of the cortex. 1202 00:56:48,910 --> 00:56:52,770 So there is many ways of describing information 1203 00:56:52,770 --> 00:56:55,360 at high level of perception. 1204 00:56:55,360 --> 00:56:59,445 Essentially, the signal doesn't go through one pass, 1205 00:56:59,445 --> 00:57:02,100 but it goes through many, many passes. 1206 00:57:02,100 --> 00:57:05,790 And then we need to have some means, or we have some means 1207 00:57:05,790 --> 00:57:08,760 to pick up the good ones and ignore the other ones-- 1208 00:57:08,760 --> 00:57:11,380 maybe switch them entirely off. 1209 00:57:11,380 --> 00:57:13,680 This is like vision. 1210 00:57:13,680 --> 00:57:16,970 So this is the path of processing, in general, signal. 1211 00:57:16,970 --> 00:57:19,260 You get the different probability estimates 1212 00:57:19,260 --> 00:57:21,380 for different streams. 1213 00:57:21,380 --> 00:57:24,570 And then you need to do some fusion and decide on-- 1214 00:57:24,570 --> 00:57:27,590 based on the level of the fusion. 1215 00:57:27,590 --> 00:57:30,840 How you can create the streams-- but we were showing you 1216 00:57:30,840 --> 00:57:34,920 differently trained probability estimates on different noises-- 1217 00:57:34,920 --> 00:57:36,690 different aspects of signal, that 1218 00:57:36,690 --> 00:57:40,020 is, different parts of the spectrum of the signal. 1219 00:57:40,020 --> 00:57:42,120 But you can go wild. 1220 00:57:42,120 --> 00:57:44,760 You can start thinking about different modalities, 1221 00:57:44,760 --> 00:57:47,850 because there-- as we talked about it also in the panel, 1222 00:57:47,850 --> 00:57:50,940 you know, very often audiovisual model, 1223 00:57:50,940 --> 00:57:53,040 if it carries the same information 1224 00:57:53,040 --> 00:57:55,560 about the same things, [STATIC] infusion 1225 00:57:55,560 --> 00:57:57,330 of audio visual streams. 1226 00:57:57,330 --> 00:58:00,270 You can also imagine the fusion from streams 1227 00:58:00,270 --> 00:58:02,970 with different levels of priors, different levels 1228 00:58:02,970 --> 00:58:04,750 of hallucinations. 1229 00:58:04,750 --> 00:58:07,380 So basically, this is what I see human beings are 1230 00:58:07,380 --> 00:58:08,730 doing very often. 1231 00:58:08,730 --> 00:58:11,610 If the signal is very noisy, you are at a cocktail party, 1232 00:58:11,610 --> 00:58:13,560 you are guessing, because that's the best 1233 00:58:13,560 --> 00:58:16,620 way to get through if the communication is not 1234 00:58:16,620 --> 00:58:17,530 very important. 1235 00:58:17,530 --> 00:58:19,830 It's not about your salary increase, 1236 00:58:19,830 --> 00:58:21,540 but about the weather-- 1237 00:58:21,540 --> 00:58:24,150 so basically, guessing what the other people are saying, 1238 00:58:24,150 --> 00:58:26,100 especially if they speak the way I 1239 00:58:26,100 --> 00:58:28,560 do, right, with a strong accent or something. 1240 00:58:28,560 --> 00:58:30,690 So the priors are very important. 1241 00:58:30,690 --> 00:58:33,200 And streams with priors are very important. 1242 00:58:33,200 --> 00:58:36,690 We use this to some extent, I was mentioning, 1243 00:58:36,690 --> 00:58:39,690 by comparing the streams of different-- with different 1244 00:58:39,690 --> 00:58:44,860 prior to discover if the signal is biased in the wrong way 1245 00:58:44,860 --> 00:58:46,590 by priors. 1246 00:58:46,590 --> 00:58:50,040 So stream formation-- there is a number of PhD theses 1247 00:58:50,040 --> 00:58:51,540 right here, right? 1248 00:58:51,540 --> 00:58:53,280 I think. 1249 00:58:53,280 --> 00:58:55,115 Fusion-- oh, why? 1250 00:58:55,115 --> 00:58:58,140 It's select the best probability estimates. 1251 00:58:58,140 --> 00:58:58,710 I tell you. 1252 00:58:58,710 --> 00:59:02,730 This is the problem which I was actually asking for, 1253 00:59:02,730 --> 00:59:05,050 please help me to solve it. 1254 00:59:05,050 --> 00:59:07,670 Because we still don't know how to do that. 1255 00:59:07,670 --> 00:59:11,320 I have a-- I suspect that especially 1256 00:59:11,320 --> 00:59:14,260 in human communications, people are doing it 1257 00:59:14,260 --> 00:59:16,800 like this, which starts making a sense 1258 00:59:16,800 --> 00:59:20,020 if they use the certain processing strategy. 1259 00:59:20,020 --> 00:59:24,645 So people can tell if the output of our-- 1260 00:59:24,645 --> 00:59:27,850 this perceptual system makes sense or not. 1261 00:59:27,850 --> 00:59:30,830 Our machines don't know how to do it yet. 1262 00:59:30,830 --> 00:59:33,280 Conclusion. 1263 00:59:33,280 --> 00:59:36,460 So some problems with the noise are simple. 1264 00:59:36,460 --> 00:59:39,010 You know, you can deal with it on a signal processing level 1265 00:59:39,010 --> 00:59:43,270 by filtering the spectrum, filtering the trajectories, 1266 00:59:43,270 --> 00:59:45,860 because these effects are very predictable. 1267 00:59:45,860 --> 00:59:48,700 And if you understand them, you should do it. 1268 00:59:48,700 --> 00:59:52,040 Because there's no need to train on that. 1269 00:59:52,040 --> 00:59:53,230 They said there is-- 1270 00:59:53,230 --> 00:59:54,100 you just do it. 1271 00:59:54,100 --> 00:59:56,980 And things may be working well. 1272 00:59:56,980 --> 01:00:00,460 Unpredictable effects of noise can be-- 1273 01:00:00,460 --> 01:00:04,720 typically are being handled by now by multi-style training. 1274 01:00:04,720 --> 01:00:09,020 And these amounts of training are enormous nowadays. 1275 01:00:09,020 --> 01:00:10,880 You know, if you talk to Google people, 1276 01:00:10,880 --> 01:00:13,220 they say we are not deeply interested in this-- what you 1277 01:00:13,220 --> 01:00:15,110 are doing, because we can always collect more 1278 01:00:15,110 --> 01:00:17,230 data from new environments. 1279 01:00:17,230 --> 01:00:19,517 But I think it's not-- 1280 01:00:19,517 --> 01:00:20,600 I shouldn't say dishonest. 1281 01:00:20,600 --> 01:00:21,240 I'm sorry. 1282 01:00:21,240 --> 01:00:22,010 Scratch it. 1283 01:00:22,010 --> 01:00:23,330 [LAUGHTER] 1284 01:00:23,330 --> 01:00:26,806 It's not the best engineering way 1285 01:00:26,806 --> 01:00:28,430 of dealing with these things, because I 1286 01:00:28,430 --> 01:00:30,980 think the good engineering way of dealing with those things 1287 01:00:30,980 --> 01:00:34,190 is to get away with less training and that sort 1288 01:00:34,190 --> 01:00:37,580 of thing; and maybe follow what I believe that human beings are 1289 01:00:37,580 --> 01:00:38,420 doing. 1290 01:00:38,420 --> 01:00:42,740 So we have a lot of parallel experts working 1291 01:00:42,740 --> 01:00:44,990 with the different aspects of the signal, 1292 01:00:44,990 --> 01:00:48,350 giving us different pictures. 1293 01:00:48,350 --> 01:00:53,450 And then we need to pick up the good ways of being.