1
00:00:01,610 --> 00:00:04,010
The following content is
provided under a Creative

2
00:00:04,010 --> 00:00:05,550
Commons license.

3
00:00:05,550 --> 00:00:07,850
Your support will help
MIT OpenCourseWare

4
00:00:07,850 --> 00:00:12,240
continue to offer high quality
educational resources for free.

5
00:00:12,240 --> 00:00:14,840
To make a donation or
view additional materials

6
00:00:14,840 --> 00:00:18,800
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,800 --> 00:00:22,400
at ocw.mit.edu.

8
00:00:22,400 --> 00:00:26,610
HYNEK HERMANSKY: So we have
this wanted information

9
00:00:26,610 --> 00:00:30,110
and unwanted information.

10
00:00:30,110 --> 00:00:33,530
Not-- I call-- unwanted
information noise

11
00:00:33,530 --> 00:00:35,450
and wanted information signal.

12
00:00:35,450 --> 00:00:38,570
Not all noises
are created equal.

13
00:00:38,570 --> 00:00:43,460
There are some noises which
are partially understood,

14
00:00:43,460 --> 00:00:47,750
and I'm claim this is what we
should strip off very quickly,

15
00:00:47,750 --> 00:00:50,900
like linear distortions
of speaker dependencies.

16
00:00:50,900 --> 00:00:53,240
Those are two things I
will be talking about.

17
00:00:53,240 --> 00:00:56,900
You can easily do it in
a future instruction.

18
00:00:56,900 --> 00:00:59,890
There are noises
which are expected.

19
00:00:59,890 --> 00:01:02,690
But efforts may not
be well understood.

20
00:01:02,690 --> 00:01:05,106
These should go
into machine land,

21
00:01:05,106 --> 00:01:06,860
but this is what I'm going--

22
00:01:06,860 --> 00:01:08,570
this is something which we--

23
00:01:08,570 --> 00:01:10,640
whatever you don't
know, you better

24
00:01:10,640 --> 00:01:12,030
have the machine to learn.

25
00:01:12,030 --> 00:01:13,970
It's always better to
use the dumb machine

26
00:01:13,970 --> 00:01:16,861
with a lot of training data
than putting in something

27
00:01:16,861 --> 00:01:18,110
which you don't know for sure.

28
00:01:18,110 --> 00:01:22,470
But what you know for
sure, it should go here.

29
00:01:22,470 --> 00:01:26,010
And then there is an interesting
set of the whole noises, which

30
00:01:26,010 --> 00:01:27,260
you don't know that they are--

31
00:01:27,260 --> 00:01:28,700
they even exist.

32
00:01:28,700 --> 00:01:32,000
These are the ones which
I'm especially interested,

33
00:01:32,000 --> 00:01:35,120
because they cause us
the biggest problems.

34
00:01:35,120 --> 00:01:37,970
Basically, the word you
don't know that it exists,

35
00:01:37,970 --> 00:01:40,520
noises which
somebody introduces--

36
00:01:40,520 --> 00:01:43,190
they've never talked about,
and so on, and so on.

37
00:01:43,190 --> 00:01:45,500
So I think this is an
interesting problem.

38
00:01:45,500 --> 00:01:48,650
Hopefully, I will get to it
towards the end of the talk--

39
00:01:48,650 --> 00:01:50,790
at least a little bit.

40
00:01:50,790 --> 00:01:53,850
So some noise is
with known effects.

41
00:01:53,850 --> 00:01:55,902
One is like this.

42
00:01:55,902 --> 00:01:57,830
When you have a--

43
00:01:57,830 --> 00:01:59,594
you have a speech--

44
00:01:59,594 --> 00:02:00,370
oh.

45
00:02:00,370 --> 00:02:02,190
VOICE RECORDING: You are yo-yo.

46
00:02:02,190 --> 00:02:03,185
HYNEK HERMANSKY: And
you have another speech

47
00:02:03,185 --> 00:02:04,055
which looks very different.

48
00:02:04,055 --> 00:02:04,805
VOICE RECORDING: You are yo-yo!

49
00:02:04,805 --> 00:02:06,932
HYNEK HERMANSKY: But it
says the same thing, right?

50
00:02:06,932 --> 00:02:07,890
I mean this is a child.

51
00:02:07,890 --> 00:02:10,300
This is a adult. And you
can tell this was me.

52
00:02:10,300 --> 00:02:14,010
This was my daughter
when she was 4-- not 30.

53
00:02:14,010 --> 00:02:16,760
The problem is that
different human beings

54
00:02:16,760 --> 00:02:17,960
have different vocal tracts.

55
00:02:17,960 --> 00:02:20,020
Especially when it
comes to children,

56
00:02:20,020 --> 00:02:21,750
vocal tract is
much, much shorter.

57
00:02:21,750 --> 00:02:23,125
And I was showing
you the effects

58
00:02:23,125 --> 00:02:25,550
of what happens is that you
get a very different set

59
00:02:25,550 --> 00:02:30,890
of formants like these dark
lines, which a number of people

60
00:02:30,890 --> 00:02:34,280
believe we should look at if
we want to understand what's

61
00:02:34,280 --> 00:02:36,180
being said.

62
00:02:36,180 --> 00:02:39,590
We have four formants here, but
we have only two formants here.

63
00:02:39,590 --> 00:02:42,500
They are in approximately
similar positions,

64
00:02:42,500 --> 00:02:44,690
but where you had
a fourth formant,

65
00:02:44,690 --> 00:02:47,670
you have only
second formant here.

66
00:02:47,670 --> 00:02:49,180
So what we want--

67
00:02:49,180 --> 00:02:51,680
we want some techniques
which would work more

68
00:02:51,680 --> 00:02:55,190
like a human perception, not
look at the spectral envelopes.

69
00:02:55,190 --> 00:02:57,804
But mainly you look
at the whole clusters.

70
00:02:57,804 --> 00:03:00,470
So this is a technique which has
been developed a long time ago,

71
00:03:00,470 --> 00:03:03,530
but I still mention it because
it's an interesting way

72
00:03:03,530 --> 00:03:06,830
of going about the things.

73
00:03:06,830 --> 00:03:09,330
So it uses several things.

74
00:03:09,330 --> 00:03:15,110
One is that it suppresses the
signals at low frequencies.

75
00:03:15,110 --> 00:03:17,136
You basically use this
equal loudness curve.

76
00:03:17,136 --> 00:03:19,730
So you emphasize the
parts of the signal

77
00:03:19,730 --> 00:03:21,690
each child heard well.

78
00:03:21,690 --> 00:03:24,920
Second one that uses
this is critical bands.

79
00:03:24,920 --> 00:03:28,070
Because you say, first
step which you want to do

80
00:03:28,070 --> 00:03:30,080
is to integrate
overcritical band.

81
00:03:30,080 --> 00:03:32,020
This is the simplest
way of processing

82
00:03:32,020 --> 00:03:33,950
within the band--
is you integrate

83
00:03:33,950 --> 00:03:37,160
what's happening inside.

84
00:03:37,160 --> 00:03:39,310
So what you do, you take
your full ear spectrum.

85
00:03:39,310 --> 00:03:42,110
This is the spectrum which
has equal frequency resolution

86
00:03:42,110 --> 00:03:44,960
at all times and a lot of
details with this case,

87
00:03:44,960 --> 00:03:47,140
because it's a
fundamental frequency.

88
00:03:47,140 --> 00:03:50,690
And here you integrate over
these different frequency

89
00:03:50,690 --> 00:03:51,620
bands.

90
00:03:51,620 --> 00:03:53,990
They are narrower
at low frequencies

91
00:03:53,990 --> 00:03:56,510
and getting broader, and
broader, and broader,

92
00:03:56,510 --> 00:03:58,910
very much how we learned
from the experiment

93
00:03:58,910 --> 00:04:02,330
with the simultaneous masking.

94
00:04:02,330 --> 00:04:06,070
So this is a textbook knowledge.

95
00:04:06,070 --> 00:04:12,430
So you get a different spectrum,
which is unequally sampled.

96
00:04:12,430 --> 00:04:14,930
So, of course, you go back
into the equal sampling,

97
00:04:14,930 --> 00:04:17,350
but you know that
there is fewer samples

98
00:04:17,350 --> 00:04:19,990
at the high frequencies,
because you are integrating

99
00:04:19,990 --> 00:04:23,230
more spectral energy
at high frequencies

100
00:04:23,230 --> 00:04:25,570
than in low frequencies.

101
00:04:25,570 --> 00:04:29,560
And you multiply these outputs
through these equal loudness

102
00:04:29,560 --> 00:04:30,060
curves.

103
00:04:30,060 --> 00:04:32,560
So from the spectrum
you get something

104
00:04:32,560 --> 00:04:38,250
which has a resolution, which
is more like auditory-like.

105
00:04:38,250 --> 00:04:41,650
Then you put it through
the equal loudness curves,

106
00:04:41,650 --> 00:04:43,420
because you know
the loudness depends

107
00:04:43,420 --> 00:04:46,130
on a cubic root of intensity.

108
00:04:46,130 --> 00:04:49,530
So you get a modified spectrum.

109
00:04:49,530 --> 00:04:52,360
And then you find
some approximation

110
00:04:52,360 --> 00:04:54,700
to this spectrum--

111
00:04:54,700 --> 00:04:57,430
auditory spectrum--
saying, that I

112
00:04:57,430 --> 00:04:59,470
don't think that all
these details still

113
00:04:59,470 --> 00:05:00,820
have to be important.

114
00:05:00,820 --> 00:05:02,410
I would like to
have some control

115
00:05:02,410 --> 00:05:06,937
of how much spectral
detail I want to keep in.

116
00:05:06,937 --> 00:05:09,520
So the whole thing looks like--
let's start with the spectrum.

117
00:05:09,520 --> 00:05:11,320
You go through--
a number of steps.

118
00:05:11,320 --> 00:05:12,970
And you end up with
the spectrum, which

119
00:05:12,970 --> 00:05:15,240
is, of course, related to
the origin of the spectrum,

120
00:05:15,240 --> 00:05:17,720
but it's much simpler.

121
00:05:17,720 --> 00:05:21,610
So we eliminated information
about fundamental frequency.

122
00:05:21,610 --> 00:05:25,250
We merged number of foramens
and so on, and so on.

123
00:05:25,250 --> 00:05:27,280
So we follow our philosophy.

124
00:05:27,280 --> 00:05:30,730
Leave out the stuff which you
think may not be important.

125
00:05:30,730 --> 00:05:33,190
You don't know how much
stuff you should leave out.

126
00:05:33,190 --> 00:05:35,440
So if you don't know something,
and you are engineer,

127
00:05:35,440 --> 00:05:37,430
you run an experiment.

128
00:05:37,430 --> 00:05:40,270
You know, research is
what I'm doing when

129
00:05:40,270 --> 00:05:42,520
I don't know what I'm doing--

130
00:05:42,520 --> 00:05:43,120
supposingly.

131
00:05:43,120 --> 00:05:45,160
I don't know where now--
from Brown or somebody

132
00:05:45,160 --> 00:05:46,080
was saying that.

133
00:05:46,080 --> 00:05:48,950
So we didn't know what's-- how
much smoothing we should do

134
00:05:48,950 --> 00:05:52,970
if we want to do our-- speaker
independent representation.

135
00:05:52,970 --> 00:05:56,000
So we ran an experiment
for a number of smoothing,

136
00:05:56,000 --> 00:05:58,000
a number of complex
poles telling you

137
00:05:58,000 --> 00:06:00,960
how much smoothing you get
through auto-regressive model.

138
00:06:00,960 --> 00:06:04,450
And there was a very distinct
peak in the situations

139
00:06:04,450 --> 00:06:06,730
where we had a
training coming for--

140
00:06:06,730 --> 00:06:08,930
templates coming from
one speaker and the test

141
00:06:08,930 --> 00:06:10,726
coming from another speaker.

142
00:06:13,540 --> 00:06:16,070
Then we used this
kind of representation

143
00:06:16,070 --> 00:06:17,970
in a speech recognition--

144
00:06:17,970 --> 00:06:19,520
I mean in a--

145
00:06:19,520 --> 00:06:22,264
to derive the features
from the speech.

146
00:06:22,264 --> 00:06:24,430
Suddenly, these two pictures
start looking much more

147
00:06:24,430 --> 00:06:27,490
similar, because what
this technique is doing

148
00:06:27,490 --> 00:06:30,750
is basically interpreting
the spectrum in a way

149
00:06:30,750 --> 00:06:33,100
the hearing might be doing.

150
00:06:33,100 --> 00:06:37,990
It has much lower resolution
that normally people would use.

151
00:06:37,990 --> 00:06:40,840
It has only two peaks, right?

152
00:06:40,840 --> 00:06:44,170
But they say that it was good
enough for speech recognition.

153
00:06:44,170 --> 00:06:46,080
What was more
interesting about--

154
00:06:46,080 --> 00:06:49,090
a little bit of
interesting science

155
00:06:49,090 --> 00:06:55,450
is they also-- found that
difference between production

156
00:06:55,450 --> 00:07:00,160
of the adults and the
children might be just

157
00:07:00,160 --> 00:07:01,600
in the length of the pharynx.

158
00:07:01,600 --> 00:07:06,370
This is a back part of the vocal
tract; that the children may

159
00:07:06,370 --> 00:07:08,320
be producing speech
in such a way

160
00:07:08,320 --> 00:07:10,000
that they are
putting already the--

161
00:07:10,000 --> 00:07:13,120
[AUDIO OUT] constriction
into right position

162
00:07:13,120 --> 00:07:15,100
against the palate.

163
00:07:15,100 --> 00:07:16,980
And because they know--

164
00:07:16,980 --> 00:07:19,210
or they-- well,
whatever-- mother nature

165
00:07:19,210 --> 00:07:22,500
taught them that pharynx
will grow in the lifetime.

166
00:07:22,500 --> 00:07:26,270
But the front part of the vocal
tract is going to be similar.

167
00:07:26,270 --> 00:07:30,790
So it is the front cavity
which is speaker independent,

168
00:07:30,790 --> 00:07:33,490
and it is the back cavity,
the rest of the vocal tract,

169
00:07:33,490 --> 00:07:37,120
which may be introducing
speaker dependencies.

170
00:07:37,120 --> 00:07:39,070
It's quite possible
if you will--

171
00:07:39,070 --> 00:07:43,230
if you ask people how they've
been treated-- like actors--

172
00:07:43,230 --> 00:07:46,930
on how they are being trained to
generate the different voices,

173
00:07:46,930 --> 00:07:50,400
they are being trained to modify
back part of the-- vocal tract.

174
00:07:50,400 --> 00:07:52,230
Normally, we don't
know how to do.

175
00:07:52,230 --> 00:07:55,240
But there is some
circumstantial evidence

176
00:07:55,240 --> 00:07:58,210
for this might be at
least partially true.

177
00:07:58,210 --> 00:08:01,240
So what is nice is that when
we synthesize the speech

178
00:08:01,240 --> 00:08:03,460
and we made sure that
front cavity is always

179
00:08:03,460 --> 00:08:05,890
in the same place, even
when the foramens were

180
00:08:05,890 --> 00:08:11,000
in different positions, we were
getting very similar results.

181
00:08:11,000 --> 00:08:12,550
So we have this theory.

182
00:08:12,550 --> 00:08:16,570
The message is encoded in the
shape of the front cavity.

183
00:08:16,570 --> 00:08:18,640
Through speaker-dependent
vocal tracts,

184
00:08:18,640 --> 00:08:21,880
you generate the speech
spectrum with all the formants.

185
00:08:21,880 --> 00:08:25,360
But then there comes the
speech perception fine

186
00:08:25,360 --> 00:08:27,370
point, which extracts
what is called

187
00:08:27,370 --> 00:08:28,612
perceptual second formant.

188
00:08:28,612 --> 00:08:29,570
Don't worry about that.

189
00:08:29,570 --> 00:08:31,760
Basic-- [AUDIO OUT] on the--

190
00:08:31,760 --> 00:08:33,980
at most, two peaks
from the spectrum.

191
00:08:33,980 --> 00:08:37,000
And this is being used for
decoding of the signal, speaker

192
00:08:37,000 --> 00:08:38,720
independently.

193
00:08:38,720 --> 00:08:41,289
However, I told you
one thing, which is

194
00:08:41,289 --> 00:08:43,809
don't use the textbook
data and be the exact--

195
00:08:43,809 --> 00:08:45,220
[AUDIO OUT]

196
00:08:45,220 --> 00:08:48,312
And so I was challenged
by my friend, late

197
00:08:48,312 --> 00:08:49,270
Professor Fred Jalinek.

198
00:08:51,850 --> 00:08:55,270
He is claimed to say,
airplanes don't flap wings,

199
00:08:55,270 --> 00:08:59,070
so why should we be putting
the knowledge of the hearing

200
00:08:59,070 --> 00:09:01,670
in-- actually, he said
something quite different.

201
00:09:01,670 --> 00:09:04,990
This is what The New York Times
quoted after he passed away,

202
00:09:04,990 --> 00:09:06,670
because that was one
of-- supposedly one

203
00:09:06,670 --> 00:09:07,930
of his famous quotes.

204
00:09:07,930 --> 00:09:10,370
No, he said something else.

205
00:09:10,370 --> 00:09:12,230
Well, if airplanes
do not flap wings,

206
00:09:12,230 --> 00:09:14,950
but they have a
wings nevertheless.

207
00:09:14,950 --> 00:09:19,450
They use some knowledge
from the nature

208
00:09:19,450 --> 00:09:22,000
in order to get the job done.

209
00:09:22,000 --> 00:09:24,550
The flapping of the
wings is not important.

210
00:09:24,550 --> 00:09:27,460
Having the wings
is important if you

211
00:09:27,460 --> 00:09:31,120
want to create the machine
which is heavier than the air

212
00:09:31,120 --> 00:09:32,350
and flies.

213
00:09:32,350 --> 00:09:35,085
So we should try to
include everything

214
00:09:35,085 --> 00:09:38,860
what we know about human
perception, and production,

215
00:09:38,860 --> 00:09:40,850
and so on.

216
00:09:40,850 --> 00:09:44,860
However, we need to estimate
the parameters from the data,

217
00:09:44,860 --> 00:09:48,550
because don't trust the
textbooks and that sort

218
00:09:48,550 --> 00:09:49,130
of thing.

219
00:09:49,130 --> 00:09:53,560
You have to derive in such a way
that is relevant to your task.

220
00:09:53,560 --> 00:09:54,820
What I wanted to say--

221
00:09:54,820 --> 00:09:57,430
you can use the data to
derive the similar knowledge.

222
00:09:57,430 --> 00:09:59,350
And I want to show it to you.

223
00:09:59,350 --> 00:10:02,940
What you can do is to
use a technique again,

224
00:10:02,940 --> 00:10:05,840
known from the '30s, called
Linear Discriminant Analysis.

225
00:10:05,840 --> 00:10:09,520
This is the
statistician's friend.

226
00:10:09,520 --> 00:10:12,760
For this you need a within-class
covariance and between class

227
00:10:12,760 --> 00:10:13,910
covariance matrix.

228
00:10:13,910 --> 00:10:16,000
You need the labeled data.

229
00:10:16,000 --> 00:10:18,310
And you need to make some
assumptions, which turns out

230
00:10:18,310 --> 00:10:20,760
are not very critical.

231
00:10:20,760 --> 00:10:22,570
But I mean, they
are approximately

232
00:10:22,570 --> 00:10:25,910
satisfied when you are
working with the spectra.

233
00:10:25,910 --> 00:10:29,170
So what we did was we would
take this spectrogram,

234
00:10:29,170 --> 00:10:33,280
and we would generate the
spectral vectors from that.

235
00:10:33,280 --> 00:10:36,030
So we would always cut
part of the spectrum,

236
00:10:36,030 --> 00:10:38,500
or short term the
spectrum, and we assign it

237
00:10:38,500 --> 00:10:43,340
to the label from which
part of speech it came from.

238
00:10:43,340 --> 00:10:47,260
So this one would have
a label, "yo," right?

239
00:10:47,260 --> 00:10:51,220
And so you get the big
box full of vectors.

240
00:10:51,220 --> 00:10:52,780
All of them are labeled.

241
00:10:52,780 --> 00:10:55,060
So you can do LDA.

242
00:10:55,060 --> 00:10:57,650
And you can look what they are--

243
00:10:57,650 --> 00:11:00,400
discriminants are telling you.

244
00:11:00,400 --> 00:11:03,370
From LDA, you get the
discriminant matrix

245
00:11:03,370 --> 00:11:05,800
and each of the row of the--

246
00:11:05,800 --> 00:11:09,160
column of the-- column or row--

247
00:11:09,160 --> 00:11:13,720
whatever-- creates the basis
on which you should project

248
00:11:13,720 --> 00:11:15,670
the whole spectrum, right?

249
00:11:15,670 --> 00:11:17,470
These are the four
obvious ones here.

250
00:11:17,470 --> 00:11:21,290
You also have the
amount of variability

251
00:11:21,290 --> 00:11:24,870
present in this discriminant
matrix which you started with.

252
00:11:24,870 --> 00:11:27,380
What you observe, which
is very interesting,

253
00:11:27,380 --> 00:11:31,840
is that these bases tend
to project the spectrum

254
00:11:31,840 --> 00:11:34,280
at the beginning,
with more detail,

255
00:11:34,280 --> 00:11:37,030
than the spectrum at the end.

256
00:11:37,030 --> 00:11:39,700
So essentially, they
tend-- they appear--

257
00:11:39,700 --> 00:11:44,140
in the first-- group they appear
to be emulating properties

258
00:11:44,140 --> 00:11:47,200
of human hearing, with some
of the well known properties

259
00:11:47,200 --> 00:11:51,040
of human hearing, namely,
non-equal spectral resolution

260
00:11:51,040 --> 00:11:55,080
being verified in any
ways-- many, many ways.

261
00:11:55,080 --> 00:11:56,410
Among them was one--

262
00:11:56,410 --> 00:11:59,145
I was showing you this masking
experiment of Harvey Fletcher.

263
00:12:02,030 --> 00:12:04,480
There is a number of
reasons to believe

264
00:12:04,480 --> 00:12:06,516
that this is a good thing.

265
00:12:06,516 --> 00:12:07,390
This is what you see.

266
00:12:07,390 --> 00:12:09,400
So essentially, if you
look at the zero crossings

267
00:12:09,400 --> 00:12:11,108
of these bases-- this
is the first base--

268
00:12:11,108 --> 00:12:12,710
they are getting
broader and broader.

269
00:12:12,710 --> 00:12:17,880
So you are integrating more
and more spectrum, right?

270
00:12:17,880 --> 00:12:19,740
This is all right--
so that I leave it.

271
00:12:19,740 --> 00:12:21,270
Oh, this is from
another experiment

272
00:12:21,270 --> 00:12:24,150
with a-- very large database,
very much and very similar

273
00:12:24,150 --> 00:12:25,710
thing.

274
00:12:25,710 --> 00:12:28,370
Eigenvalues quickly decay.

275
00:12:28,370 --> 00:12:30,540
And what is interesting--
you can actually formally

276
00:12:30,540 --> 00:12:33,000
ask, "What is your
resolution?" by doing what is

277
00:12:33,000 --> 00:12:35,490
called perturbation analysis.

278
00:12:35,490 --> 00:12:37,829
So you take some--
say, some signal here.

279
00:12:37,829 --> 00:12:38,620
This is a Gaussian.

280
00:12:38,620 --> 00:12:42,240
And you project this
on this LDA basis.

281
00:12:42,240 --> 00:12:43,230
Then you perturb it.

282
00:12:43,230 --> 00:12:44,830
You move it.

283
00:12:44,830 --> 00:12:49,920
And you ask, how much effect
this movement of this, say,

284
00:12:49,920 --> 00:12:58,580
emulated spectral element of
the speech causes the output--

285
00:12:58,580 --> 00:13:02,220
seen as the output of this
projection of this many--

286
00:13:02,220 --> 00:13:03,870
on these many bases?

287
00:13:03,870 --> 00:13:06,570
And what do you see is,
as I was suggesting,

288
00:13:06,570 --> 00:13:09,750
spectral sensitivity to the
movements of the formant

289
00:13:09,750 --> 00:13:13,530
is much higher at the
beginning of the spectrum

290
00:13:13,530 --> 00:13:17,260
and much less at the
end of the spectrum.

291
00:13:17,260 --> 00:13:20,370
You can actually
compare it to what

292
00:13:20,370 --> 00:13:24,600
we had-- initially in
the PLP analysis when

293
00:13:24,600 --> 00:13:28,350
we integrated the spectrum
based on the knowledge coming

294
00:13:28,350 --> 00:13:30,930
from the textbook.

295
00:13:30,930 --> 00:13:34,270
And it's very much
the same if there

296
00:13:34,270 --> 00:13:36,440
were just a plain
cosine basis computing

297
00:13:36,440 --> 00:13:40,905
mel cepstrum sensitivity is
the same at all frequencies.

298
00:13:40,905 --> 00:13:43,780
But these bases--
are from the LDA,

299
00:13:43,780 --> 00:13:46,320
they're very much doing the
thing which physical bin

300
00:13:46,320 --> 00:13:49,360
analysis would be doing.

301
00:13:49,360 --> 00:13:51,469
And so this is a--

302
00:13:51,469 --> 00:13:52,260
you can look at it.

303
00:13:52,260 --> 00:13:56,830
It was a PhD thesis from Oregon
Graduate Institute by Naren

304
00:13:56,830 --> 00:14:00,329
Malayath, who is now a big--

305
00:14:00,329 --> 00:14:01,620
you better be friends with him.

306
00:14:01,620 --> 00:14:02,970
He's at Qualcomm.

307
00:14:02,970 --> 00:14:05,660
I think he's a head of
Image Processing department.

308
00:14:05,660 --> 00:14:09,460
[COUGHS] We better-- better
be good friends with him.

309
00:14:09,460 --> 00:14:10,040
[INAUDIBLE]

310
00:14:10,040 --> 00:14:12,440
[LAUGHTER]

311
00:14:12,440 --> 00:14:13,080
OK.

312
00:14:13,080 --> 00:14:15,850
Another problem with
linear distortions-- linear

313
00:14:15,850 --> 00:14:17,550
distortions void was a problem.

314
00:14:17,550 --> 00:14:21,420
Now it's not problem
anymore, but in the old days,

315
00:14:21,420 --> 00:14:22,970
this was a problem.

316
00:14:22,970 --> 00:14:27,990
A problem shows up in a rather
dramatic way, in following way.

317
00:14:27,990 --> 00:14:29,120
Here we have one sound.

318
00:14:31,624 --> 00:14:32,540
VOICE RECORDING: Beat.

319
00:14:32,540 --> 00:14:33,810
HYNEK HERMANSKY: Beat.

320
00:14:33,810 --> 00:14:35,820
So "buh ee- tuh."

321
00:14:35,820 --> 00:14:40,310
Here is the very distinct E.
Every phonetician would agree.

322
00:14:40,310 --> 00:14:44,430
This is E-- high foramen,
cluster of high of foramens,

323
00:14:44,430 --> 00:14:45,460
and so on.

324
00:14:45,460 --> 00:14:48,870
Some vicious person, namely
one of my graduate students,

325
00:14:48,870 --> 00:14:51,320
took this spectral
envelope, designed a filter,

326
00:14:51,320 --> 00:14:53,570
which is exactly
in reverse of that,

327
00:14:53,570 --> 00:14:57,140
and put this speech
through this inverse filter

328
00:14:57,140 --> 00:14:58,400
so it looked like this.

329
00:14:58,400 --> 00:15:02,590
This was-- there is a spectrum
where there were nine formants.

330
00:15:02,590 --> 00:15:04,666
It's entirely flat.

331
00:15:04,666 --> 00:15:06,690
And if you listened
to it, you've

332
00:15:06,690 --> 00:15:09,135
probably already guessed
what you will hear.

333
00:15:09,135 --> 00:15:10,051
VOICE RECORDING: Beat.

334
00:15:10,051 --> 00:15:12,450
HYNEK HERMANSKY: You'll hear
the first speech, right?

335
00:15:12,450 --> 00:15:12,950
But you--

336
00:15:12,950 --> 00:15:14,660
VOICE RECORDING: Beat.

337
00:15:14,660 --> 00:15:16,530
HYNEK HERMANSKY: It's
OK when this-- oops!

338
00:15:16,530 --> 00:15:17,649
That's what you would--

339
00:15:17,649 --> 00:15:20,643
sorry.

340
00:15:20,643 --> 00:15:22,140
VOICE RECORDING: Beat.

341
00:15:22,140 --> 00:15:23,137
Beat.

342
00:15:23,137 --> 00:15:23,637
Beat.

343
00:15:23,637 --> 00:15:26,400
HYNEK HERMANSKY: But
whoever doesn't hear E--

344
00:15:26,400 --> 00:15:27,710
don't spoil my talk.

345
00:15:27,710 --> 00:15:30,480
I mean, I think
everybody has to hear E,

346
00:15:30,480 --> 00:15:33,270
even though any phonetician
would get very upset.

347
00:15:33,270 --> 00:15:36,020
Because they say this is
not E. Because, of course,

348
00:15:36,020 --> 00:15:39,590
what is happening, human
perception is taking relative--

349
00:15:39,590 --> 00:15:43,100
percept relative to these
neighboring sounds, right?

350
00:15:43,100 --> 00:15:46,280
And since we filtered everything
with the same filter, I mean,

351
00:15:46,280 --> 00:15:47,990
relative percepts
is still the same.

352
00:15:47,990 --> 00:15:51,519
So this is something which we
needed to put into our machine.

353
00:15:51,519 --> 00:15:53,810
And we did-- actually, this
is a very straight-- signal

354
00:15:53,810 --> 00:15:56,780
processing wise, the things
are very straightforward,

355
00:15:56,780 --> 00:15:59,470
because what you have,
the signal is actually

356
00:15:59,470 --> 00:16:03,140
signal of the speech
convolved with the spectrum

357
00:16:03,140 --> 00:16:06,480
of the impulse response
of the environment.

358
00:16:06,480 --> 00:16:10,370
So in logarithmic domain this
is the signal processing stuff.

359
00:16:10,370 --> 00:16:12,940
Basically, you have a
logarithmic spectrum

360
00:16:12,940 --> 00:16:16,250
of the signal plus logarithmic
spectrum of the environment,

361
00:16:16,250 --> 00:16:19,140
which is fixed.

362
00:16:19,140 --> 00:16:25,360
So-- what we're finding here is
that if you remove somehow this

363
00:16:25,360 --> 00:16:28,730
environment, or if you make it
invariant to this environment

364
00:16:28,730 --> 00:16:32,200
thing, then you maybe win--

365
00:16:32,200 --> 00:16:33,100
you may be winning.

366
00:16:33,100 --> 00:16:35,620
The problem here is
that each frequency

367
00:16:35,620 --> 00:16:38,500
you have a different amount
of additive constant,

368
00:16:38,500 --> 00:16:40,780
because this is a
spectrum, right?

369
00:16:40,780 --> 00:16:42,765
If it was just a constant
at all frequencies,

370
00:16:42,765 --> 00:16:45,050
you just subtract it.

371
00:16:45,050 --> 00:16:47,440
But in this case, you
can use the trick.

372
00:16:47,440 --> 00:16:50,350
You remember what Josh
told us this morning.

373
00:16:50,350 --> 00:16:52,550
Hearing is doing
spectral analysis.

374
00:16:52,550 --> 00:16:55,225
And what I was trying
to tell you, that each--

375
00:16:55,225 --> 00:16:58,870
at each frequency, each
critical band is trajectory

376
00:16:58,870 --> 00:17:01,720
of the spectral energy is
independent of-- in the first

377
00:17:01,720 --> 00:17:03,440
approximation, of the other--

378
00:17:03,440 --> 00:17:03,940
others.

379
00:17:03,940 --> 00:17:05,619
You can do
independent processing

380
00:17:05,619 --> 00:17:07,300
at each frequency band--

381
00:17:07,300 --> 00:17:10,280
and maybe don't screw
up that many things.

382
00:17:10,280 --> 00:17:13,119
So this was a step
which we took.

383
00:17:13,119 --> 00:17:13,819
We said, OK.

384
00:17:13,819 --> 00:17:17,680
We will treat each sample
trajectory differently, right?

385
00:17:17,680 --> 00:17:21,359
But we will filter out
stuff which is not changing.

386
00:17:21,359 --> 00:17:23,900
So related to different
frequency channel,

387
00:17:23,900 --> 00:17:26,609
do the independent
processing channel.

388
00:17:26,609 --> 00:17:29,830
And processing was that we
would take first logarithm,

389
00:17:29,830 --> 00:17:33,630
and then we would do--
then we put each trajectory

390
00:17:33,630 --> 00:17:36,230
through a bandpass
filter, which would--

391
00:17:36,230 --> 00:17:39,580
main thing is which
would be suppressing DC,

392
00:17:39,580 --> 00:17:42,300
and slowly changing components.

393
00:17:42,300 --> 00:17:44,220
Mainly it was suppressing
anything which

394
00:17:44,220 --> 00:17:46,440
was slower than one hertz.

395
00:17:46,440 --> 00:17:49,170
And also, it turned out it was
useful to separate things which

396
00:17:49,170 --> 00:17:52,920
are higher than about 15 hertz.

397
00:17:52,920 --> 00:17:56,129
So what you get out-- this
was the origin spectrogram.

398
00:17:56,129 --> 00:17:57,545
This was the
modified spectrogram.

399
00:17:57,545 --> 00:17:59,628
It seems it got a little
bit-- this trajectory got

400
00:17:59,628 --> 00:18:00,900
a little bit smoother.

401
00:18:00,900 --> 00:18:02,940
Transitions got smoothed
out because there

402
00:18:02,940 --> 00:18:05,040
was a bandpass filter.

403
00:18:05,040 --> 00:18:07,350
There was a high pass
elements to that.

404
00:18:07,350 --> 00:18:10,440
Very much what we thought--
well, this is interesting.

405
00:18:10,440 --> 00:18:13,140
Maybe this is what a human
hearing might be doing.

406
00:18:13,140 --> 00:18:15,990
To tell you the
truth, we didn't know.

407
00:18:15,990 --> 00:18:18,690
It was-- for the
people who are from MIT

408
00:18:18,690 --> 00:18:20,340
and who are working
in Image, it was

409
00:18:20,340 --> 00:18:25,590
inspired by some work on a
perception of lightness, what

410
00:18:25,590 --> 00:18:27,860
David Marr called lightness.

411
00:18:27,860 --> 00:18:29,910
And here was the type--

412
00:18:29,910 --> 00:18:31,710
thing which I told you about--

413
00:18:31,710 --> 00:18:35,280
6 by-- 6 by 749.

414
00:18:35,280 --> 00:18:40,170
David Marr was talking
about processing in space.

415
00:18:40,170 --> 00:18:43,080
We applied it in
processing in time.

416
00:18:43,080 --> 00:18:44,940
But it was still good enough.

417
00:18:44,940 --> 00:18:50,200
I mean, so that we definitely
got rid of the problem.

418
00:18:50,200 --> 00:18:51,540
So here it is.

419
00:18:51,540 --> 00:18:53,750
The spectrograms, which
look very different,

420
00:18:53,750 --> 00:18:56,830
look-- suddenly start
looking very similar.

421
00:18:56,830 --> 00:18:58,290
I was good seeing what's here.

422
00:18:58,290 --> 00:18:59,570
Remember, I'm an engineer.

423
00:18:59,570 --> 00:19:01,740
I was working for a telephone
company at the time.

424
00:19:01,740 --> 00:19:03,960
It was still working
better in some problems

425
00:19:03,960 --> 00:19:05,370
which we had before.

426
00:19:05,370 --> 00:19:08,040
And we had a severely
mismatched environment,

427
00:19:08,040 --> 00:19:09,900
getting the training
data from the labs

428
00:19:09,900 --> 00:19:13,060
and testing it in the
US west, in Colorado.

429
00:19:13,060 --> 00:19:14,980
So it didn't work
at all; recognized

430
00:19:14,980 --> 00:19:19,740
after this processing
everything was cool and dandy.

431
00:19:19,740 --> 00:19:20,490
OK.

432
00:19:20,490 --> 00:19:23,840
So now we have a RASTA LDA.

433
00:19:23,840 --> 00:19:25,770
And we can do the same trick.

434
00:19:25,770 --> 00:19:26,760
How about that?

435
00:19:26,760 --> 00:19:30,120
So we take-- you take the
spectral temporal vectors,

436
00:19:30,120 --> 00:19:31,800
and you label each
of these vector

437
00:19:31,800 --> 00:19:35,130
by the label which is
of the phoneme, which

438
00:19:35,130 --> 00:19:37,350
is in the center
of this trajectory.

439
00:19:37,350 --> 00:19:39,070
And just for the sake of--

440
00:19:39,070 --> 00:19:42,300
just to have some fun, we
took a rather long vector.

441
00:19:42,300 --> 00:19:44,760
It was about one second.

442
00:19:44,760 --> 00:19:49,210
And we said, well, what kind
of projections would these

443
00:19:49,210 --> 00:19:53,970
temporal trajectories would
go on if we wanted to get rid

444
00:19:53,970 --> 00:19:55,670
of a speaker-dependent--

445
00:19:55,670 --> 00:19:58,420
I mean, of an
environment-dependent

446
00:19:58,420 --> 00:20:00,270
information?

447
00:20:00,270 --> 00:20:03,900
Well, these were the
impulse responses.

448
00:20:03,900 --> 00:20:06,980
These were the
frequency responses.

449
00:20:06,980 --> 00:20:09,640
Because in this case,
you get a FIR filters.

450
00:20:09,640 --> 00:20:12,230
These discriminants
are FIR filters,

451
00:20:12,230 --> 00:20:15,520
which are to be applied to
temporal trajectory-- so

452
00:20:15,520 --> 00:20:18,460
spectral energies.

453
00:20:18,460 --> 00:20:20,220
Because it's just
basically projection

454
00:20:20,220 --> 00:20:22,306
of the spectrum on the basis.

455
00:20:22,306 --> 00:20:24,059
And basis is one second long.

456
00:20:24,059 --> 00:20:25,100
This is impulse response.

457
00:20:25,100 --> 00:20:27,280
It cannot be all that
long, because eventually,

458
00:20:27,280 --> 00:20:30,450
they become zero, right, if
you should do nothing to it.

459
00:20:30,450 --> 00:20:32,100
You do do nothing.

460
00:20:32,100 --> 00:20:34,500
But you can see active
part is about a couple

461
00:20:34,500 --> 00:20:37,560
of hundred millisecond,
maybe a little bit more.

462
00:20:37,560 --> 00:20:40,500
And these are the
bandpass filters.

463
00:20:40,500 --> 00:20:44,060
So essentially-- passing
frequency between the 1 hertz

464
00:20:44,060 --> 00:20:45,750
and 10, 15 hertz--

465
00:20:45,750 --> 00:20:47,226
very similar at all frequencies.

466
00:20:47,226 --> 00:20:49,600
There was another thing which
we were very interested in.

467
00:20:49,600 --> 00:20:51,254
We should really
do different things

468
00:20:51,254 --> 00:20:52,295
at different frequencies.

469
00:20:52,295 --> 00:20:56,730
Answer is pretty much no.

470
00:20:56,730 --> 00:20:58,650
And so that was very exciting.

471
00:20:58,650 --> 00:21:02,010
What-- what I-- well, anyway,
so let me tell you-- yet

472
00:21:02,010 --> 00:21:04,500
another experiment which
hasn't happened and is going

473
00:21:04,500 --> 00:21:07,200
to be presented next week.

474
00:21:07,200 --> 00:21:09,870
Well, we wanted to move
in the 21st century,

475
00:21:09,870 --> 00:21:12,930
so we did convolutive
neural network.

476
00:21:12,930 --> 00:21:15,180
And our convolutive
network is maybe not

477
00:21:15,180 --> 00:21:18,980
what you are used to when you
have a 2D convolutions there.

478
00:21:18,980 --> 00:21:22,680
But we just said, we
will have a 1D filter

479
00:21:22,680 --> 00:21:27,330
as a first processing step
in this deep neural network.

480
00:21:27,330 --> 00:21:32,360
So we postulated the filter, the
input, to the neural network.

481
00:21:32,360 --> 00:21:35,260
But in this case, we trained
the whole thing together.

482
00:21:35,260 --> 00:21:39,210
So it wasn't just LDA
and that sort of thing.

483
00:21:39,210 --> 00:21:42,180
So we forced all filters
at all frequencies

484
00:21:42,180 --> 00:21:46,080
be the same, because we expected
that's what we want to get.

485
00:21:46,080 --> 00:21:48,570
And we were asking how
these forces look like,

486
00:21:48,570 --> 00:21:51,750
which come from the
convolutive neural network.

487
00:21:51,750 --> 00:21:54,700
Well, again, I
wouldn't be showing it

488
00:21:54,700 --> 00:21:56,700
if it wasn't really somehow
supportive of what I

489
00:21:56,700 --> 00:21:57,459
want to say.

490
00:21:57,459 --> 00:21:59,250
They don't look all
that different for what

491
00:21:59,250 --> 00:22:00,333
we were guessing from LDA.

492
00:22:03,000 --> 00:22:06,270
They definitely are enhancing
the important modulation

493
00:22:06,270 --> 00:22:09,450
frequencies around
four hertz, right?

494
00:22:09,450 --> 00:22:11,180
They are passing
a number of them.

495
00:22:11,180 --> 00:22:15,220
I'm showing three here,
which are somehow arbitrary.

496
00:22:15,220 --> 00:22:17,750
They are passing-- and most
of them look like that.

497
00:22:17,750 --> 00:22:20,840
And we will use the 16 of them.

498
00:22:20,840 --> 00:22:24,070
Passing between in 1 and
10 hertz in modulation

499
00:22:24,070 --> 00:22:29,000
spectral domain, so changes
which are 1 to 10 times

500
00:22:29,000 --> 00:22:29,925
a second.

501
00:22:29,925 --> 00:22:34,390
It's coming out in a paper, so
you can look it up if you want.

502
00:22:34,390 --> 00:22:36,580
Last thing which I
still wanted to do--

503
00:22:36,580 --> 00:22:38,700
I said, well, maybe
it has something

504
00:22:38,700 --> 00:22:41,735
to do with the
hearing after all.

505
00:22:41,735 --> 00:22:44,550
We were are deriving
everything from speech.

506
00:22:44,550 --> 00:22:46,200
There was no knowledge
about hearing,

507
00:22:46,200 --> 00:22:49,800
except that we said we think
that we should be looking

508
00:22:49,800 --> 00:22:52,410
at long segments of
the signal, and we

509
00:22:52,410 --> 00:22:55,440
expect that this filtering
will be very much the same

510
00:22:55,440 --> 00:22:56,340
at all frequencies.

511
00:22:56,340 --> 00:22:59,430
Actually, not even-- earlier
it come out automatically.

512
00:22:59,430 --> 00:23:01,890
There wasn't much of the
knowledge from human hearing--

513
00:23:01,890 --> 00:23:03,250
[AUDIO OUT] on.

514
00:23:03,250 --> 00:23:06,270
In the first one, when I
was showing you critical

515
00:23:06,270 --> 00:23:09,760
band spectral resolution, we
started this full ear spectrum.

516
00:23:09,760 --> 00:23:13,720
We didn't tell anything
about human hearing.

517
00:23:13,720 --> 00:23:16,920
And what comes out is a
property of human hearing.

518
00:23:16,920 --> 00:23:20,970
I mean, tell me if there is
yet another strong evidence

519
00:23:20,970 --> 00:23:25,320
that speech is processed in such
a way that fits human hearing,

520
00:23:25,320 --> 00:23:27,090
because the only thing
which was used here

521
00:23:27,090 --> 00:23:30,430
was the speech,
labor, into certain--

522
00:23:30,430 --> 00:23:33,840
into classes which we are
using for recognizing it--

523
00:23:33,840 --> 00:23:36,770
speech sounds.

524
00:23:36,770 --> 00:23:39,850
So what we did was-- that
was with Nema, with Garani,

525
00:23:39,850 --> 00:23:44,060
and my students.

526
00:23:44,060 --> 00:23:47,930
We took a number of these
cortical receptive fields--

527
00:23:47,930 --> 00:23:50,750
which we talk about it
before a little bit--

528
00:23:50,750 --> 00:23:53,300
about 2,000, 3,000,
whatever-- we spread--

529
00:23:53,300 --> 00:23:57,290
we basically spread to the floor
at the University of Maryland

530
00:23:57,290 --> 00:24:00,110
and computed
principal components

531
00:24:00,110 --> 00:24:05,100
from these fields in both
spectral and temporal domain.

532
00:24:05,100 --> 00:24:08,150
But here I'm showing
the temporal domain.

533
00:24:08,150 --> 00:24:13,370
How they look-- they are very
much like the rest of filter.

534
00:24:13,370 --> 00:24:14,650
This is what is happening.

535
00:24:14,650 --> 00:24:16,610
It's a bandpass.

536
00:24:16,610 --> 00:24:19,130
Peak is somewhere
around four hertz.

537
00:24:19,130 --> 00:24:21,080
Essentially, I'm
showing you here what

538
00:24:21,080 --> 00:24:27,050
I understood might be a transfer
function of the auditory cortex

539
00:24:27,050 --> 00:24:30,920
derived with all the
usual disclaimers,

540
00:24:30,920 --> 00:24:33,920
like, this is a linear
approximation to the receptive

541
00:24:33,920 --> 00:24:34,820
fields.

542
00:24:34,820 --> 00:24:37,190
And there might have been
problems with collecting it,

543
00:24:37,190 --> 00:24:39,566
and so on, and so on.

544
00:24:39,566 --> 00:24:42,950
But this is what we are
getting as a possible function

545
00:24:42,950 --> 00:24:45,980
of auditory cortex.

546
00:24:45,980 --> 00:24:49,370
I'm doing fine with
the time, right?

547
00:24:49,370 --> 00:24:53,510
So you can do
experiment in this case.

548
00:24:53,510 --> 00:24:55,970
You can actually
generate the speech

549
00:24:55,970 --> 00:25:00,470
which has certain rates
of change eliminated.

550
00:25:00,470 --> 00:25:03,830
By doing all this,
computing the cepstrum,

551
00:25:03,830 --> 00:25:06,140
do the filtering
of each trajectory,

552
00:25:06,140 --> 00:25:08,180
and reconstruct the speech.

553
00:25:08,180 --> 00:25:10,730
And ask people
what do they hear?

554
00:25:10,730 --> 00:25:12,830
How do they recognize speech?

555
00:25:12,830 --> 00:25:15,310
You can also ask, "Do
you recognize it?"

556
00:25:15,310 --> 00:25:18,350
For this you don't have
to regenerate the speech.

557
00:25:18,350 --> 00:25:22,240
But you just use
therapy C cepstrum.

558
00:25:22,240 --> 00:25:24,970
This is the full
experiment with the-- this

559
00:25:24,970 --> 00:25:27,800
is for-- this is called a
residual-excited LPC vocoder.

560
00:25:27,800 --> 00:25:31,870
But it's modified in such
a way that you artificially

561
00:25:31,870 --> 00:25:40,000
slow down or modify the temporal
trajectories, which are being--

562
00:25:40,000 --> 00:25:44,190
if there is no filter,
you cannot make a replica

563
00:25:44,190 --> 00:25:48,050
of the origin, the signal here.

564
00:25:48,050 --> 00:25:51,680
So just the bottom line of
the experiment here is what--

565
00:25:51,680 --> 00:25:54,260
if you start removing
components which

566
00:25:54,260 --> 00:25:56,960
are somewhere between
1 and 16 hertz,

567
00:25:56,960 --> 00:25:59,330
you are getting
hurt significantly.

568
00:25:59,330 --> 00:26:02,690
Most you are getting
hurt in performance

569
00:26:02,690 --> 00:26:06,110
when they are removing
component between 2 and 4 hertz.

570
00:26:06,110 --> 00:26:08,810
This is a-- that's how you
are getting biggest hit.

571
00:26:08,810 --> 00:26:11,540
Here we are showing
how much is--

572
00:26:11,540 --> 00:26:14,870
how much these bands contribute
to recognition and performance

573
00:26:14,870 --> 00:26:15,670
by humans.

574
00:26:15,670 --> 00:26:20,480
This is a white bars; and
to speech recognize it.

575
00:26:20,480 --> 00:26:23,480
Those are the black bars.

576
00:26:23,480 --> 00:26:25,850
So you can see that in
speech recognition, machine

577
00:26:25,850 --> 00:26:28,160
recognition, you
can safely remove

578
00:26:28,160 --> 00:26:30,456
stuff between 0 and 1 hertz.

579
00:26:30,456 --> 00:26:31,580
It's not going to hurt you.

580
00:26:31,580 --> 00:26:33,680
It's only helps
you in this task.

581
00:26:33,680 --> 00:26:36,420
Speech perception, there
is a little bit of hit,

582
00:26:36,420 --> 00:26:38,270
but certainly not
as much hit as you

583
00:26:38,270 --> 00:26:40,730
are getting when you
are moving to the part

584
00:26:40,730 --> 00:26:41,950
where you hear the--

585
00:26:41,950 --> 00:26:43,010
[AUDIO OUT]

586
00:26:43,010 --> 00:26:48,680
And certainly the component that
is higher than 16 or 20 hertz

587
00:26:48,680 --> 00:26:49,970
are not important.

588
00:26:49,970 --> 00:26:53,050
Then, already, Homer
Dudley, he knew, in 1930,

589
00:26:53,050 --> 00:26:55,420
when he was designing
his vocoder.

590
00:26:55,420 --> 00:26:57,185
But it was a nice experiment.

591
00:26:57,185 --> 00:26:59,230
It came out in just--

592
00:26:59,230 --> 00:27:03,940
so you can look it up and--
if you want to have a go.

593
00:27:03,940 --> 00:27:07,770
Just to summarize what
I told you so far,

594
00:27:07,770 --> 00:27:12,850
Homer Dudley was telling
us information from the--

595
00:27:12,850 --> 00:27:16,570
information about the message
in the slow modulation,

596
00:27:16,570 --> 00:27:21,070
slow movements of the vocal
tract, which modulates

597
00:27:21,070 --> 00:27:24,340
the carrier; information
about the message

598
00:27:24,340 --> 00:27:27,790
in slow modulations
of the signal--

599
00:27:27,790 --> 00:27:31,810
slow changes of speech signal
in individual frequency bands.

600
00:27:34,600 --> 00:27:41,380
Slow modulations imply long
impulse responses, right?

601
00:27:41,380 --> 00:27:45,460
So 5 hertz, I sense something
around 200 millisecond.

602
00:27:45,460 --> 00:27:47,660
My magic number of what
physically needs to allow,

603
00:27:47,660 --> 00:27:53,210
which we have observed in this
summation of sub-threshold--

604
00:27:53,210 --> 00:27:56,350
the signals and
temporal masking.

605
00:27:56,350 --> 00:28:00,070
And so I have to hear a number
of things which I listed.

606
00:28:00,070 --> 00:28:01,630
Frequency
discrimination improved.

607
00:28:05,050 --> 00:28:07,000
If you are longer
than 200 millisecond,

608
00:28:07,000 --> 00:28:09,040
below 200 milliseconds
of signal,

609
00:28:09,040 --> 00:28:12,890
you don't get such a good
frequency discrimination.

610
00:28:12,890 --> 00:28:16,180
Loudness increases up
to 200 millisecond,

611
00:28:16,180 --> 00:28:17,615
then it stays constant.

612
00:28:17,615 --> 00:28:19,893
It depends on amplitude.

613
00:28:19,893 --> 00:28:21,355
Effect of forward masking--

614
00:28:21,355 --> 00:28:22,770
I was showing you--

615
00:28:22,770 --> 00:28:25,960
asked about 200
millisecond independent

616
00:28:25,960 --> 00:28:29,410
of the amplitude of the masker.

617
00:28:29,410 --> 00:28:32,530
And sub-threshold integration
is also showing you.

618
00:28:32,530 --> 00:28:35,450
So I'm suggesting there seem
to be some temporal buffer

619
00:28:35,450 --> 00:28:37,780
in human hearing on some level.

620
00:28:37,780 --> 00:28:43,410
I suspect it's cortical
level, which is processing.

621
00:28:43,410 --> 00:28:46,330
Whatever happens
within this buffer,

622
00:28:46,330 --> 00:28:51,480
it's a fair thing to
treat as a one element.

623
00:28:51,480 --> 00:28:53,370
So you can do filtering on it.

624
00:28:53,370 --> 00:28:56,800
You can integrate it; any--
basically, all kinds of thing.

625
00:28:56,800 --> 00:29:00,310
If things are happening
outside this buffer,

626
00:29:00,310 --> 00:29:01,620
these parts should be treated--

627
00:29:01,620 --> 00:29:03,110
[AUDIO OUT] in parts.

628
00:29:05,780 --> 00:29:09,430
So how does it help us?

629
00:29:09,430 --> 00:29:12,540
You remember the story
about the phonemes.

630
00:29:12,540 --> 00:29:14,670
You remember that phonemes
don't look like this,

631
00:29:14,670 --> 00:29:17,010
but they look like this.

632
00:29:17,010 --> 00:29:21,300
Length of the coarticulation
pattern is about 200

633
00:29:21,300 --> 00:29:23,580
millisecond, perhaps more.

634
00:29:23,580 --> 00:29:25,800
So what is a good
thing about it is

635
00:29:25,800 --> 00:29:30,920
that if you look at sufficiently
long segment of this signal,

636
00:29:30,920 --> 00:29:35,340
you will get whole
coarticulation pattern in.

637
00:29:35,340 --> 00:29:38,760
And then you have a chance
that your classifier is getting

638
00:29:38,760 --> 00:29:41,130
all the information
about the speech

639
00:29:41,130 --> 00:29:45,090
sound for finding the sound.

640
00:29:45,090 --> 00:29:49,440
And then you may have a
chance to get a good estimate

641
00:29:49,440 --> 00:29:51,270
of the speech sounds.

642
00:29:51,270 --> 00:29:54,930
But you need to use these
long temporal segments.

643
00:29:54,930 --> 00:29:57,900
And here I can say
it even to YouTube.

644
00:29:57,900 --> 00:30:00,930
I think we should claim
the full victory here,

645
00:30:00,930 --> 00:30:03,750
because most of the
speech recognition systems

646
00:30:03,750 --> 00:30:05,370
do it nowadays.

647
00:30:05,370 --> 00:30:08,940
They use the long segments
of the signal as a first step

648
00:30:08,940 --> 00:30:11,070
of the processing.

649
00:30:11,070 --> 00:30:15,840
So I can happily retire
telling my grandchildren,

650
00:30:15,840 --> 00:30:18,280
well, we knew it.

651
00:30:18,280 --> 00:30:19,220
We were the only ones.

652
00:30:19,220 --> 00:30:21,610
But I mean, we were certainly
using it for a long time

653
00:30:21,610 --> 00:30:24,480
and probably for a long
time, in such a way

654
00:30:24,480 --> 00:30:28,510
that we even designed several
techniques that you do there.

655
00:30:28,510 --> 00:30:31,920
So this is a classifying
speech recognition

656
00:30:31,920 --> 00:30:33,990
from the temporal
patterns directly.

657
00:30:33,990 --> 00:30:36,610
So we would take these
long segments of the speech

658
00:30:36,610 --> 00:30:40,590
through some processing,
put neural nets on every--

659
00:30:40,590 --> 00:30:44,400
every temporal structure,
trying to estimate the sound

660
00:30:44,400 --> 00:30:46,950
at each frequency--

661
00:30:46,950 --> 00:30:48,670
each carrier frequency.

662
00:30:48,670 --> 00:30:50,970
And then we would fuse
all these decisions

663
00:30:50,970 --> 00:30:53,130
from different frequency bands.

664
00:30:53,130 --> 00:30:56,100
And then we would
use the final vector

665
00:30:56,100 --> 00:30:57,860
of posterior probabilities.

666
00:30:57,860 --> 00:30:59,860
Unlike what people
do very often--

667
00:30:59,860 --> 00:31:02,760
most often-- that they just
take the short term spectra,

668
00:31:02,760 --> 00:31:06,600
and then they maybe now take
the longer segment of this block

669
00:31:06,600 --> 00:31:07,980
of these short term spectra.

670
00:31:07,980 --> 00:31:10,530
We say, short term spectrum
is good for nothing.

671
00:31:10,530 --> 00:31:12,590
We just cut it into pieces.

672
00:31:12,590 --> 00:31:15,840
And we classify each temporal
trajectory individually

673
00:31:15,840 --> 00:31:17,342
in the first step.

674
00:31:17,342 --> 00:31:19,100
Tell now that it was used--

675
00:31:19,100 --> 00:31:22,620
it may be useful
later when I will

676
00:31:22,620 --> 00:31:26,360
be telling you about dealing
with some kind of noises.

677
00:31:26,360 --> 00:31:28,170
But you understand what
we did here, right?

678
00:31:28,170 --> 00:31:31,680
Instead of using the
spectral temporal blocks,

679
00:31:31,680 --> 00:31:33,660
we would be using
temporal trajectories

680
00:31:33,660 --> 00:31:36,060
at each critical
event, very much

681
00:31:36,060 --> 00:31:38,190
along the lines of what
we think that hearing is

682
00:31:38,190 --> 00:31:39,570
doing with the speech signal.

683
00:31:39,570 --> 00:31:41,250
First thing is,
hearing is doing.

684
00:31:41,250 --> 00:31:45,210
It takes the signal, sub divides
it into individual frequency

685
00:31:45,210 --> 00:31:48,660
bands, and then it treats
each temporal trajectory

686
00:31:48,660 --> 00:31:51,420
coming from each of
these cochlear filters

687
00:31:51,420 --> 00:31:53,130
to extract the information.

688
00:31:53,130 --> 00:31:54,630
And then it tries
to figure out what

689
00:31:54,630 --> 00:31:57,140
to do with this
information later, right?

690
00:32:01,460 --> 00:32:04,260
Well, we have another
technique called MRASTA, just

691
00:32:04,260 --> 00:32:07,350
for people who are
interested in cochlear--

692
00:32:07,350 --> 00:32:13,990
I mean, in-- cortical modeling.

693
00:32:13,990 --> 00:32:14,980
You take this data.

694
00:32:14,980 --> 00:32:17,940
We project a number
of projections

695
00:32:17,940 --> 00:32:19,230
with variable resolution.

696
00:32:19,230 --> 00:32:23,850
So we get a huge
vector of the data

697
00:32:23,850 --> 00:32:26,710
coming from different
parts of the spectrum.

698
00:32:26,710 --> 00:32:30,127
And then we feed it
into speech recognizers.

699
00:32:30,127 --> 00:32:31,460
The first test looked like this.

700
00:32:31,460 --> 00:32:34,200
I mean, they have a different
temporal resolution,

701
00:32:34,200 --> 00:32:35,310
spectral resolution.

702
00:32:35,310 --> 00:32:39,140
We are pretty much integrating
or differentiating over

703
00:32:39,140 --> 00:32:42,700
three critical bands
following some of the filters,

704
00:32:42,700 --> 00:32:44,410
coming from these three--

705
00:32:44,410 --> 00:32:51,526
I mean, old PLP low order
model and three bark three bark

706
00:32:51,526 --> 00:32:54,090
critical event integration.

707
00:32:54,090 --> 00:32:57,240
So these ones look a
bit like what people

708
00:32:57,240 --> 00:32:59,970
would call Gabor filters.

709
00:32:59,970 --> 00:33:03,720
But they are just put
together, basically,

710
00:33:03,720 --> 00:33:09,750
from these two places in
time and in frequency--

711
00:33:09,750 --> 00:33:13,950
different temporal resolution
enhancing different components

712
00:33:13,950 --> 00:33:15,510
of moderating the spectrum.

713
00:33:15,510 --> 00:33:18,210
Again, you may be claiming
that this is something

714
00:33:18,210 --> 00:33:21,290
which resembles the Thorston--

715
00:33:21,290 --> 00:33:25,290
[AUDIO OUT] Josh was
mentioning in the morning.

716
00:33:25,290 --> 00:33:27,680
It's cochlear filter banks--

717
00:33:27,680 --> 00:33:30,450
[CLEARS THROAT]
auditory, of course.

718
00:33:30,450 --> 00:33:33,180
I mixed up cochlear and
cortical-- cortical filter

719
00:33:33,180 --> 00:33:37,480
bank, modulation filter banks.

720
00:33:37,480 --> 00:33:42,110
So there are some novel aspects
in this type of processing

721
00:33:42,110 --> 00:33:43,010
I want to impress.

722
00:33:43,010 --> 00:33:45,050
It was novel in 1998.

723
00:33:45,050 --> 00:33:48,580
That is, as I said, this is
fortunately becoming less novel

724
00:33:48,580 --> 00:33:50,850
15 years later.

725
00:33:50,850 --> 00:33:52,780
Use is rather long
temporal context

726
00:33:52,780 --> 00:33:56,660
of the signal as a input.

727
00:33:56,660 --> 00:33:59,800
It uses already
hierarchical neural nets.

728
00:33:59,800 --> 00:34:03,430
So deep neural
network processing,

729
00:34:03,430 --> 00:34:06,460
which wasn't around in 1998.

730
00:34:06,460 --> 00:34:10,420
The only thing was that there
was independent processing

731
00:34:10,420 --> 00:34:15,820
of frequency of neural net
estimator at frequencies.

732
00:34:15,820 --> 00:34:17,770
The only thing which we
didn't do at the time,

733
00:34:17,770 --> 00:34:19,370
and I don't know
how important it is.

734
00:34:19,370 --> 00:34:20,915
I don't think it doesn't hurt--

735
00:34:20,915 --> 00:34:21,894
it hurts anybody.

736
00:34:21,894 --> 00:34:25,290
They were training these
parts of the system,

737
00:34:25,290 --> 00:34:28,090
this deep neural
net individually.

738
00:34:28,090 --> 00:34:29,900
And it's just
concatenated output.

739
00:34:29,900 --> 00:34:32,440
So we never did
training all together

740
00:34:32,440 --> 00:34:35,320
as we do now in convoluted
nets and that sort of thing.

741
00:34:35,320 --> 00:34:38,679
Because simply, we didn't
even dream about doing that,

742
00:34:38,679 --> 00:34:41,610
because we didn't
have the hardware.

743
00:34:41,610 --> 00:34:46,060
That was one thing which I tried
to point out during this panel.

744
00:34:46,060 --> 00:34:49,840
A lot of progress in
neural nets research

745
00:34:49,840 --> 00:34:52,150
and success of neural
nets comes from the fact

746
00:34:52,150 --> 00:34:55,389
that we have very, very powerful
hardware, which we didn't have.

747
00:34:55,389 --> 00:34:59,140
So we didn't dream about
many things doing, even when

748
00:34:59,140 --> 00:35:00,810
they might have made sense.

749
00:35:00,810 --> 00:35:03,059
So, OK.

750
00:35:03,059 --> 00:35:03,600
Where are we?

751
00:35:03,600 --> 00:35:07,120
Oh, I see-- one more thing.

752
00:35:07,120 --> 00:35:08,140
Coarticulation.

753
00:35:08,140 --> 00:35:10,540
This is a problem which is
known since people started

754
00:35:10,540 --> 00:35:12,570
looking at the spectrograms.

755
00:35:12,570 --> 00:35:16,600
There's some consonants
like a "kuh", or "huh."

756
00:35:16,600 --> 00:35:19,020
They are very dependent
on what's following.

757
00:35:19,020 --> 00:35:22,130
So "kuh" in front of "ee,
kee, koo, kah, koo-kah,"

758
00:35:22,130 --> 00:35:25,030
we [AUDIO OUT] has a burst here.

759
00:35:25,030 --> 00:35:27,430
In front of "ooh"
has a burst here.

760
00:35:27,430 --> 00:35:31,450
And in front of "ah"
there's a burst here.

761
00:35:31,450 --> 00:35:33,370
So the phonemes
are very different

762
00:35:33,370 --> 00:35:36,670
depending on the environment.

763
00:35:36,670 --> 00:35:40,060
When you start using these
long temporal segments,

764
00:35:40,060 --> 00:35:42,760
you know all the tricks, or
some of the tricks, I showed you

765
00:35:42,760 --> 00:35:46,870
about, what comes out
are the posteriogram

766
00:35:46,870 --> 00:35:50,690
in which the "kuh" almost
looks the same as a "kuh."

767
00:35:50,690 --> 00:35:52,770
It basically recognizes this--

768
00:35:52,770 --> 00:35:55,450
recognizes that since it looks
at the whole coarticulation

769
00:35:55,450 --> 00:35:58,210
pattern or group
of the phonemes,

770
00:35:58,210 --> 00:36:02,810
in order to recognize this
sound, it does the right thing.

771
00:36:02,810 --> 00:36:07,440
So I suspect that success of
these long temporal context

772
00:36:07,440 --> 00:36:10,150
which people are using now
with speech recognition,

773
00:36:10,150 --> 00:36:12,160
comes from the fact
that this partially

774
00:36:12,160 --> 00:36:15,040
compensates for the
problems with-- by problems

775
00:36:15,040 --> 00:36:16,690
for the coarticulation.

776
00:36:16,690 --> 00:36:19,210
And what I also want is
to say-- coarticulation

777
00:36:19,210 --> 00:36:20,800
is not really a problem.

778
00:36:20,800 --> 00:36:24,790
It just spreads the information
for a long period of the time.

779
00:36:24,790 --> 00:36:28,540
If you know how to suck
it out, it can be useful.

780
00:36:28,540 --> 00:36:31,060
But it's a terrible
thing if you start just

781
00:36:31,060 --> 00:36:33,070
looking at individual
frequency events,

782
00:36:33,070 --> 00:36:35,800
even with your frequency's
slices of the short term

783
00:36:35,800 --> 00:36:36,912
spectrum.

784
00:36:36,912 --> 00:36:40,420
So it's another deep net--

785
00:36:40,420 --> 00:36:43,830
deep net from, I
don't know the name.

786
00:36:43,830 --> 00:36:44,330
Sorry.

787
00:36:44,330 --> 00:36:47,350
It was already almost
legal deep net.

788
00:36:47,350 --> 00:36:51,520
You do the-- you estimate
the posteriorgram

789
00:36:51,520 --> 00:36:53,960
from the short window in
the first step for about 40

790
00:36:53,960 --> 00:36:55,270
mm length window.

791
00:36:55,270 --> 00:36:56,850
And then you take the long--

792
00:36:56,850 --> 00:37:00,550
I mean, big window of
the posteriors' tree.

793
00:37:00,550 --> 00:37:04,510
Another neural net you get much
better, which also work better.

794
00:37:04,510 --> 00:37:07,540
Again, the mainstream
technique nowadays

795
00:37:07,540 --> 00:37:10,475
is being used in most
of the DARPA systems.

796
00:37:13,120 --> 00:37:16,060
Oh, yes-- one more thing.

797
00:37:16,060 --> 00:37:17,350
I want to stress this one.

798
00:37:17,350 --> 00:37:18,916
[LAUGHS] So, I'm
sorry I didn't want

799
00:37:18,916 --> 00:37:20,040
to show it all at the same.

800
00:37:20,040 --> 00:37:22,780
But anyways, I don't
think that there's

801
00:37:22,780 --> 00:37:27,190
anything which is terribly
special about short term

802
00:37:27,190 --> 00:37:28,570
spectrum of speech.

803
00:37:28,570 --> 00:37:31,090
I think what really
matters is how

804
00:37:31,090 --> 00:37:33,820
you process the temporal
trajectories of the spectra

805
00:37:33,820 --> 00:37:34,930
energies.

806
00:37:34,930 --> 00:37:36,610
This is what the
human hearing is

807
00:37:36,610 --> 00:37:40,970
doing that seems to be doing
a good job on our speech

808
00:37:40,970 --> 00:37:42,586
recognizers.

809
00:37:42,586 --> 00:37:46,970
So essentially, this is one
message which I want to say.

810
00:37:46,970 --> 00:37:50,230
Don't be afraid to treat
different parts of the spectrum

811
00:37:50,230 --> 00:37:50,870
different.

812
00:37:50,870 --> 00:37:53,970
Individually you may get
some advantages from them.

813
00:37:53,970 --> 00:37:55,550
It started with
your stub, but it

814
00:37:55,550 --> 00:37:59,110
shows up over and over again.

815
00:37:59,110 --> 00:38:03,550
So away from the short
term spectrum, go away,

816
00:38:03,550 --> 00:38:06,920
they start using what
hearing is doing--

817
00:38:06,920 --> 00:38:09,880
start using a
temporal trajectories

818
00:38:09,880 --> 00:38:15,090
of the spectra energies
coming from your analysis.

819
00:38:15,090 --> 00:38:19,290
To the point that we did this
work on real [INAUDIBLE],,

820
00:38:19,290 --> 00:38:22,830
on the-- going
directly, don't do this.

821
00:38:22,830 --> 00:38:27,760
And don't get your
time frequency patterns

822
00:38:27,760 --> 00:38:29,410
from the short term spectra.

823
00:38:29,410 --> 00:38:32,425
I think about always how to
get directly what you want.

824
00:38:32,425 --> 00:38:35,480
It turns out that there
is a nice way of doing--

825
00:38:35,480 --> 00:38:38,620
for estimating directed
hilbert envelopes

826
00:38:38,620 --> 00:38:41,470
of the signal in the frequency
bands called frequency domain

827
00:38:41,470 --> 00:38:42,220
linear prediction.

828
00:38:42,220 --> 00:38:44,190
[STATIC]

829
00:38:44,190 --> 00:38:46,390
Mario says-- there's
his PhD thesis.

830
00:38:46,390 --> 00:38:50,460
And we were working together
for a couple of years.

831
00:38:50,460 --> 00:38:55,890
So what you do, instead of
using the time trajectory,

832
00:38:55,890 --> 00:39:00,910
so use this case,
autoregressive modeling--

833
00:39:00,910 --> 00:39:05,790
LPC modeling-- and put the
windows on a time to get

834
00:39:05,790 --> 00:39:07,800
the frequency--

835
00:39:07,800 --> 00:39:09,660
frequency vectors.

836
00:39:09,660 --> 00:39:13,690
You do it on a cosine
transform of the signal.

837
00:39:13,690 --> 00:39:17,190
So you move the signal
into a frequency domain.

838
00:39:17,190 --> 00:39:20,430
And then you put the windows
on this cosine transform

839
00:39:20,430 --> 00:39:22,050
of the signal.

840
00:39:22,050 --> 00:39:25,150
And you derive directly the--

841
00:39:25,150 --> 00:39:28,320
all polar approximations
to hilbert envelopes

842
00:39:28,320 --> 00:39:31,290
of the signal in the sub bands.

843
00:39:31,290 --> 00:39:34,470
You don't ever do the
hilbert transform.

844
00:39:34,470 --> 00:39:38,880
You just use the
usual techniques

845
00:39:38,880 --> 00:39:40,920
from autoaggressive modeling.

846
00:39:40,920 --> 00:39:42,200
The only difference is--

847
00:39:42,200 --> 00:39:45,860
[AUDIO OUT] on the cosine
transform of the signal.

848
00:39:45,860 --> 00:39:48,960
And your windowing
determines which frequency

849
00:39:48,960 --> 00:39:50,880
range you are looking.

850
00:39:50,880 --> 00:39:52,840
So, of course, what
you typically do,

851
00:39:52,840 --> 00:39:56,300
you can use the longer windows
at higher frequencies, shorter

852
00:39:56,300 --> 00:39:57,780
window of lower frequencies.

853
00:39:57,780 --> 00:39:59,120
You do all these things.

854
00:39:59,120 --> 00:40:01,610
But this is a convenient way.

855
00:40:01,610 --> 00:40:04,200
[COUGHS] It's convenient, and
this is more and more like fun.

856
00:40:04,200 --> 00:40:06,640
But maybe somebody might
be interested in that.

857
00:40:06,640 --> 00:40:09,320
So essentially,
what you do-- oops.

858
00:40:09,320 --> 00:40:10,050
Sorry.

859
00:40:10,050 --> 00:40:12,540
What you do is that
you take the signal,

860
00:40:12,540 --> 00:40:18,210
and you eliminate
modulation component

861
00:40:18,210 --> 00:40:22,770
out of that AM component which
the signal is being modulated.

862
00:40:22,770 --> 00:40:26,110
So this carries the
information about the message.

863
00:40:26,110 --> 00:40:28,470
And this is the carrier itself.

864
00:40:28,470 --> 00:40:32,950
And you can do what is called
channel vocoder, which we did.

865
00:40:32,950 --> 00:40:35,110
And you can listen
to the signal.

866
00:40:35,110 --> 00:40:37,870
So this is-- in some ways it's
interesting-- original signal.

867
00:40:37,870 --> 00:40:40,410
VOICE RECORDING: They are
both trend-following methods.

868
00:40:40,410 --> 00:40:40,590
HYNEK HERMANKSY: Oops.

869
00:40:40,590 --> 00:40:41,820
I tried to make it somehow--

870
00:40:41,820 --> 00:40:43,945
[AUDIO OUT]

871
00:40:43,945 --> 00:40:46,460
VOICE RECORDING: They are
both trend-following methods.

872
00:40:46,460 --> 00:40:49,190
HYNEK HERMANKSY: Somebody may
recognize Jim Glass from MIT

873
00:40:49,190 --> 00:40:50,242
in that.

874
00:40:50,242 --> 00:40:52,710
VOICE RECORDING: In an
ideological argument,

875
00:40:52,710 --> 00:40:54,662
the participants tend
to dump the table.

876
00:40:54,662 --> 00:40:56,370
HYNEK HERMANKSY: So
this is silly, right?

877
00:40:56,370 --> 00:40:59,640
Now you can look at
what you get if you just

878
00:40:59,640 --> 00:41:04,611
keep the modulations and excite
you know, with the white noise.

879
00:41:04,611 --> 00:41:05,110
Oops.

880
00:41:05,110 --> 00:41:05,610
Sorry.

881
00:41:09,060 --> 00:41:10,346
Oops!

882
00:41:10,346 --> 00:41:11,062
What am I doing?

883
00:41:11,062 --> 00:41:11,673
Oh, here.

884
00:41:11,673 --> 00:41:13,256
VOICE RECORDING:
(WHISPERING) They are

885
00:41:13,256 --> 00:41:15,300
both trend-following methods.

886
00:41:15,300 --> 00:41:17,810
HYNEK HERMANSKY: Do you
recognize Jim Glass?

887
00:41:17,810 --> 00:41:20,147
I can.

888
00:41:20,147 --> 00:41:22,480
VOICE RECORDING: (WHISPERING)
In an ideological argument

889
00:41:22,480 --> 00:41:25,150
the participants tend
to dump the table.

890
00:41:25,150 --> 00:41:26,950
HYNEK HERMANSKY: And
then you can also

891
00:41:26,950 --> 00:41:32,220
listen to what is left after
you eliminate the message.

892
00:41:32,220 --> 00:41:33,220
VOICE RECORDING: Mm-hmm.

893
00:41:33,220 --> 00:41:34,208
Ha, ha.

894
00:41:34,208 --> 00:41:35,122
[LAUGHTER]

895
00:41:35,690 --> 00:41:38,156
HYNEK HERMANSKY: Maybe
it's a male, right?

896
00:41:38,156 --> 00:41:41,400
VOICE RECORDING:
Mm-mm [VOCALIZING]

897
00:41:41,400 --> 00:41:43,017
HYNEK HERMANSKY:
Oh, this is fun.

898
00:41:43,017 --> 00:41:43,975
This is [CHUCKLES] fun.

899
00:41:43,975 --> 00:41:46,800
It may have some implication
for speech recognition.

900
00:41:46,800 --> 00:41:52,200
But certainly, if I have seen
one verification of what old

901
00:41:52,200 --> 00:41:55,450
Homer Dudley was telling
us-- where the message is--

902
00:41:55,450 --> 00:41:56,747
I mean, this is it.

903
00:41:56,747 --> 00:41:58,240
All right?

904
00:41:58,240 --> 00:42:00,480
Anyways, what is
good in-- for that,

905
00:42:00,480 --> 00:42:02,190
is that once you get an open--

906
00:42:02,190 --> 00:42:04,290
[AUDIO OUT] it's
relatively very easy

907
00:42:04,290 --> 00:42:07,170
to compensate for
your ear distortions.

908
00:42:07,170 --> 00:42:11,050
Because main effect
of linear distortions

909
00:42:11,050 --> 00:42:15,020
is basically shifting the energy
a different frequency by--

910
00:42:15,020 --> 00:42:16,785
bends by different amounts.

911
00:42:16,785 --> 00:42:19,710
But all this information is
in the gain of the model.

912
00:42:19,710 --> 00:42:21,505
It's one parameter
which you essentially

913
00:42:21,505 --> 00:42:25,180
ignore after you do
this frequency domain

914
00:42:25,180 --> 00:42:26,700
linear prediction.

915
00:42:26,700 --> 00:42:31,080
And you get a very similar
trajectory for both.

916
00:42:31,080 --> 00:42:33,660
This is a telephone
speech and clean speech

917
00:42:33,660 --> 00:42:35,130
which differed quite a bit.

918
00:42:35,130 --> 00:42:38,180
And I hope that I have-- oh,
this is for reverberant speech.

919
00:42:38,180 --> 00:42:40,020
There seem to be
also some advantage,

920
00:42:40,020 --> 00:42:43,190
because reverberation is
in the first information,

921
00:42:43,190 --> 00:42:47,220
it is a convolution with the
impulse response of the room.

922
00:42:47,220 --> 00:42:48,660
So you make the--

923
00:42:48,660 --> 00:42:51,660
if you use a truly long
segments-- in this case,

924
00:42:51,660 --> 00:42:55,800
we used about 10 seconds
of the signal approximating

925
00:42:55,800 --> 00:43:00,340
by this open model, and
eliminated the DC from that.

926
00:43:00,340 --> 00:43:02,340
You know, it seems to be
getting some advanced--

927
00:43:02,340 --> 00:43:03,380
[AUDIO OUT]

928
00:43:03,380 --> 00:43:09,480
So known noise with
unknown effects.

929
00:43:09,480 --> 00:43:12,080
I say train the machine on that.

930
00:43:15,020 --> 00:43:16,610
Here is the one example, right?

931
00:43:16,610 --> 00:43:20,120
You have a phoneme error
rates, noise estimate.

932
00:43:20,120 --> 00:43:23,060
If everything is good,
clean, trading clean test,

933
00:43:23,060 --> 00:43:26,430
you have about 20%
phoneme accuracy.

934
00:43:26,430 --> 00:43:28,190
This is a stage of the result--

935
00:43:28,190 --> 00:43:32,240
reasonable result. But once
you start adding a noise,

936
00:43:32,240 --> 00:43:35,060
things quickly go south.

937
00:43:35,060 --> 00:43:39,420
Typical way of dealing with it
is if you train multi-style.

938
00:43:39,420 --> 00:43:41,870
So if you know which choices
you are going to deal with,

939
00:43:41,870 --> 00:43:43,580
you train on them.

940
00:43:43,580 --> 00:43:46,674
And things get better,
but you pay some price.

941
00:43:46,674 --> 00:43:48,590
I mean, certainly, you
pay the price on clean,

942
00:43:48,590 --> 00:43:52,590
because you recognize your model
became must mushier, basically.

943
00:43:52,590 --> 00:43:55,430
It's not a very sharp model.

944
00:43:55,430 --> 00:43:59,150
So here we had a wonderful 21%.

945
00:43:59,150 --> 00:44:06,080
We paid 10% relative price for
getting this better performance

946
00:44:06,080 --> 00:44:07,280
on the noises.

947
00:44:07,280 --> 00:44:12,050
What we observe is that you
get much better results, most

948
00:44:12,050 --> 00:44:13,730
noticeably better
results, if you

949
00:44:13,730 --> 00:44:18,380
would have different recognizers
for each type of noise.

950
00:44:22,415 --> 00:44:24,020
But of course, the
problem is that you

951
00:44:24,020 --> 00:44:25,290
have different types of noise.

952
00:44:25,290 --> 00:44:27,450
So you have this
number of recognizers.

953
00:44:27,450 --> 00:44:30,155
But now you need to
pick up the best stream.

954
00:44:30,155 --> 00:44:31,760
And how do you do that?

955
00:44:31,760 --> 00:44:35,090
This is something, again, which
I was mentioning also earlier.

956
00:44:35,090 --> 00:44:37,130
This is something which
we are struggling with

957
00:44:37,130 --> 00:44:39,350
and we don't know
how to do that.

958
00:44:39,350 --> 00:44:40,970
If you are a human
being, maybe you

959
00:44:40,970 --> 00:44:42,300
can just look at the output.

960
00:44:42,300 --> 00:44:45,500
And you can see, just keep
switching after your message

961
00:44:45,500 --> 00:44:47,030
starts looking reasonably well.

962
00:44:47,030 --> 00:44:49,500
But if you want to do
it fully automatically,

963
00:44:49,500 --> 00:44:51,530
I don't know why we
want to only build

964
00:44:51,530 --> 00:44:55,490
a fully automatic recognizers,
but that's what we are doing.

965
00:44:55,490 --> 00:44:57,110
So you want to pick up the--

966
00:44:57,110 --> 00:45:01,190
you want a system to
pick up the best stream.

967
00:45:01,190 --> 00:45:03,150
So how do we do that?

968
00:45:03,150 --> 00:45:05,430
First thing is, of course--

969
00:45:05,430 --> 00:45:07,950
one way is to recognize
type of noise.

970
00:45:07,950 --> 00:45:13,040
This is a typical
system nowadays.

971
00:45:13,040 --> 00:45:15,710
You recognize type of
the noise, and you use

972
00:45:15,710 --> 00:45:17,050
the appropriate recognizers.

973
00:45:17,050 --> 00:45:18,860
BBN is doing it.

974
00:45:18,860 --> 00:45:21,860
My feeling is that it's
somehow cleaner and more

975
00:45:21,860 --> 00:45:24,580
elegant to be able
to figure out what

976
00:45:24,580 --> 00:45:26,581
is the right output,
because what--

977
00:45:26,581 --> 00:45:27,080
neither.

978
00:45:27,080 --> 00:45:30,230
It's not what is the signal, but
what is the signal interacting

979
00:45:30,230 --> 00:45:31,830
with the classifier?

980
00:45:31,830 --> 00:45:35,470
So for this we have to figure
out what the best means.

981
00:45:35,470 --> 00:45:38,510
So here we have
two posteriograms.

982
00:45:38,510 --> 00:45:40,970
If you look at it, if you know
that these are trajectories

983
00:45:40,970 --> 00:45:42,800
of the posteriors--

984
00:45:42,800 --> 00:45:45,470
of the speech sounds, you
know this one is good.

985
00:45:45,470 --> 00:45:47,030
This one is not so good.

986
00:45:47,030 --> 00:45:48,760
Because the word is nine--

987
00:45:48,760 --> 00:45:50,780
"ne-ine," "ne," right?

988
00:45:50,780 --> 00:45:52,710
Here is a lot of garbage.

989
00:45:52,710 --> 00:45:53,960
So I will know that--

990
00:45:53,960 --> 00:45:57,290
I will do it automatically.

991
00:45:57,290 --> 00:46:00,020
So ideally, I would
pick up the stream which

992
00:46:00,020 --> 00:46:01,160
gives me the lowest error.

993
00:46:01,160 --> 00:46:03,980
But I don't know what the lowest
error is, because I don't know

994
00:46:03,980 --> 00:46:05,780
what the correct answer is.

995
00:46:05,780 --> 00:46:07,370
That's the problem, right?

996
00:46:07,370 --> 00:46:11,270
So one is to maybe try to
see what I-- what my I did?

997
00:46:11,270 --> 00:46:14,960
Try to figure out which
posteriogram is the cleanest.

998
00:46:14,960 --> 00:46:18,110
Another one is
following thinking.

999
00:46:18,110 --> 00:46:19,370
When I trained the--

1000
00:46:19,370 --> 00:46:22,220
I trained the neural
net on something.

1001
00:46:22,220 --> 00:46:25,970
It's going to work well on the
data on which it was trained.

1002
00:46:25,970 --> 00:46:29,400
So I have some gold
standard output.

1003
00:46:29,400 --> 00:46:34,310
And then I will try to see
how much my output differs

1004
00:46:34,310 --> 00:46:39,500
if the test data, which are not
the same as the data on which

1005
00:46:39,500 --> 00:46:41,300
the recognizer was trained.

1006
00:46:41,300 --> 00:46:43,990
So both of these
tricks we were using.

1007
00:46:43,990 --> 00:46:49,040
So first one uses a
technique which is like this.

1008
00:46:49,040 --> 00:46:52,890
You look at the differences
between posteriors

1009
00:46:52,890 --> 00:46:57,930
or KL divergence areas a certain
distance from each other,

1010
00:46:57,930 --> 00:46:59,260
and you will slice this window.

1011
00:46:59,260 --> 00:47:04,640
You cumulatively cover as
much data as you possibly can.

1012
00:47:04,640 --> 00:47:08,592
And what you observe is that
if you have a good, clean data,

1013
00:47:08,592 --> 00:47:11,335
this cumulative divergence
keeps increasing.

1014
00:47:11,335 --> 00:47:15,860
And after you cross the point
where there is a coarticulation

1015
00:47:15,860 --> 00:47:18,290
pattern or
coarticulation ceases,

1016
00:47:18,290 --> 00:47:22,700
suddenly you start getting
pretty much the fixed high tail

1017
00:47:22,700 --> 00:47:25,190
divergence-- cumulative
KL divergence.

1018
00:47:25,190 --> 00:47:27,340
If we have a noisy
data, the noise

1019
00:47:27,340 --> 00:47:32,720
start dominating this KL
divergences and differences RS.

1020
00:47:32,720 --> 00:47:36,140
Because the signal
in the first place

1021
00:47:36,140 --> 00:47:40,430
carries the information, and the
information is in the changes.

1022
00:47:40,430 --> 00:47:43,460
But noise is creating
the information

1023
00:47:43,460 --> 00:47:47,480
which it doesn't have these
segments, or something.

1024
00:47:47,480 --> 00:47:49,940
So this is one
technique which we use.

1025
00:47:49,940 --> 00:47:52,970
Another technique which is
even more, now, popular,

1026
00:47:52,970 --> 00:47:56,980
in at least in my lab,
training of another unit.

1027
00:47:56,980 --> 00:48:04,100
We trained this autoencoder
on the output of a classifier.

1028
00:48:04,100 --> 00:48:06,230
And we say-- so we--

1029
00:48:06,230 --> 00:48:08,720
on the output of
the classifier as

1030
00:48:08,720 --> 00:48:13,540
it's being used on the training
data, and we say autoencoder.

1031
00:48:13,540 --> 00:48:17,570
Then we learn how in
average the output

1032
00:48:17,570 --> 00:48:22,070
from the classifier used on
its training data looks like.

1033
00:48:22,070 --> 00:48:26,030
And if-- and then we use it on
the output from the classifier

1034
00:48:26,030 --> 00:48:28,250
used to unknown data.

1035
00:48:28,250 --> 00:48:31,830
And if the
autoencoders then knew,

1036
00:48:31,830 --> 00:48:36,320
then we tried to predict
input on its output.

1037
00:48:36,320 --> 00:48:38,780
This is how it is being trained.

1038
00:48:38,780 --> 00:48:40,230
So if the output--

1039
00:48:40,230 --> 00:48:42,530
if the prediction
is not very good,

1040
00:48:42,530 --> 00:48:44,630
then we say we are
probably dealing

1041
00:48:44,630 --> 00:48:50,410
with the data for which
the classifier is not good.

1042
00:48:54,860 --> 00:48:55,770
It's how it works.

1043
00:48:55,770 --> 00:48:59,010
I mean, you know, it's honest.

1044
00:48:59,010 --> 00:49:01,890
If you are looking at the
output of the neural net

1045
00:49:01,890 --> 00:49:03,973
which has been applied to a--

1046
00:49:03,973 --> 00:49:08,580
towards training data, the test
is-- or the test is maybe--

1047
00:49:08,580 --> 00:49:11,010
the training data is matched.

1048
00:49:11,010 --> 00:49:14,670
Pretty much the
error is very similar

1049
00:49:14,670 --> 00:49:16,950
as it goes on the training data.

1050
00:49:16,950 --> 00:49:20,820
When you apply to a data for
which the classifier wasn't

1051
00:49:20,820 --> 00:49:24,240
trained, your error is,
of course, much larger.

1052
00:49:24,240 --> 00:49:28,380
Prediction is prediction
of the output of the--

1053
00:49:28,380 --> 00:49:31,440
so there's a double--
there is a double deep net.

1054
00:49:31,440 --> 00:49:33,660
One is classifying,
and then another one

1055
00:49:33,660 --> 00:49:37,080
is predicting its output.

1056
00:49:37,080 --> 00:49:39,690
One-- and the one
which predicts output

1057
00:49:39,690 --> 00:49:43,530
is trained to predict its best
output it can possibly have,

1058
00:49:43,530 --> 00:49:46,170
which is the output
on the training data

1059
00:49:46,170 --> 00:49:48,420
of the previous classifier.

1060
00:49:48,420 --> 00:49:50,059
I don't know if you
follow me or if it

1061
00:49:50,059 --> 00:49:51,225
is becoming too complicated.

1062
00:49:51,225 --> 00:49:52,920
[CHUCKLES] But
essentially, we are

1063
00:49:52,920 --> 00:49:56,000
trying to figure
out if anything like

1064
00:49:56,000 --> 00:49:59,725
it is looking on a training--
and applied on the training

1065
00:49:59,725 --> 00:50:00,600
data.

1066
00:50:00,600 --> 00:50:03,090
So it seems to be
working to some extent.

1067
00:50:03,090 --> 00:50:06,510
Here we have a
multi-style results.

1068
00:50:06,510 --> 00:50:08,150
Here we have a
matched result. This

1069
00:50:08,150 --> 00:50:09,780
is what we would
like to achieve.

1070
00:50:09,780 --> 00:50:11,100
Of course this is oracle.

1071
00:50:11,100 --> 00:50:13,170
This is what we--
it will be ideal

1072
00:50:13,170 --> 00:50:15,840
if we knew which stream is best.

1073
00:50:15,840 --> 00:50:17,520
This is what we
would be getting.

1074
00:50:17,520 --> 00:50:20,820
But what we are getting
is not that terribly bad.

1075
00:50:20,820 --> 00:50:24,120
I mean, certainly, it
is typically better

1076
00:50:24,120 --> 00:50:26,904
than a multi-style training.

1077
00:50:26,904 --> 00:50:28,340
All right.

1078
00:50:28,340 --> 00:50:32,462
And we have some ways
to go to oracle like--

1079
00:50:32,462 --> 00:50:33,920
not too far from
the matched trace.

1080
00:50:33,920 --> 00:50:36,180
Sometimes it's even
there, because it's making

1081
00:50:36,180 --> 00:50:38,300
a decision on every utterance.

1082
00:50:38,300 --> 00:50:42,600
So sometimes it can--
going to do quite well.

1083
00:50:42,600 --> 00:50:48,150
So we were capable of
picking up the good streams

1084
00:50:48,150 --> 00:50:50,640
and leaving out the
best bad streams, even

1085
00:50:50,640 --> 00:50:54,200
at the output of the classifier.

1086
00:50:54,200 --> 00:50:56,910
How does it work on
previously unseen noise?

1087
00:50:56,910 --> 00:50:59,200
Fortunately, for this
example, we still

1088
00:50:59,200 --> 00:51:02,850
seem to be getting
some advantage.

1089
00:51:02,850 --> 00:51:05,190
We are using noise
which has never

1090
00:51:05,190 --> 00:51:07,350
been seen by the classifiers.

1091
00:51:07,350 --> 00:51:10,350
But still, it was
capable of picking up

1092
00:51:10,350 --> 00:51:12,950
the good classifier,
which is actually better

1093
00:51:12,950 --> 00:51:14,200
than any of these classifiers.

1094
00:51:14,200 --> 00:51:19,400
Sort of a, so this
seems to be good.

1095
00:51:23,340 --> 00:51:27,910
Another technique of
dealing with unseen noises

1096
00:51:27,910 --> 00:51:30,760
is actually one which I
like maybe even a bit more,

1097
00:51:30,760 --> 00:51:32,640
which is you do
the pre-processing,

1098
00:51:32,640 --> 00:51:35,570
and in some processing
in frequency bands,

1099
00:51:35,570 --> 00:51:39,000
hoping the main effect
of the different noises

1100
00:51:39,000 --> 00:51:43,960
is in the shape, spectral
shape, of different noise.

1101
00:51:43,960 --> 00:51:46,980
So if you are doing
recommission in the sub bands,

1102
00:51:46,980 --> 00:51:50,310
then the noise start
looking-- in each sub band

1103
00:51:50,310 --> 00:51:52,950
starts looking more like
a white noise except

1104
00:51:52,950 --> 00:51:56,280
of different levels.

1105
00:51:56,280 --> 00:52:00,780
So meaning, it's here, that
maybe the signal to noise ratio

1106
00:52:00,780 --> 00:52:01,350
is higher.

1107
00:52:01,350 --> 00:52:03,120
Here it is more miserable.

1108
00:52:03,120 --> 00:52:05,010
But if I have the
classifier, which

1109
00:52:05,010 --> 00:52:09,420
is strained on multiple
levels of the white noise,

1110
00:52:09,420 --> 00:52:13,710
in each frequency band perhaps
I can get some advantage.

1111
00:52:13,710 --> 00:52:16,920
So I do what in the
cochlear might be doing,

1112
00:52:16,920 --> 00:52:19,626
which is I divide each--

1113
00:52:19,626 --> 00:52:23,640
the signal into a number of
frequency bands, and then I

1114
00:52:23,640 --> 00:52:25,680
have one fusion
DNN which will try

1115
00:52:25,680 --> 00:52:27,590
to put these things together.

1116
00:52:27,590 --> 00:52:29,910
But each of these
nets are going to be

1117
00:52:29,910 --> 00:52:33,810
trained on multiple levels, but
this time of the white noise.

1118
00:52:33,810 --> 00:52:36,850
And it's going to get back to
noises which are not white,

1119
00:52:36,850 --> 00:52:40,350
or the noise which is not white.

1120
00:52:40,350 --> 00:52:42,000
So how it works, you
can figure it out

1121
00:52:42,000 --> 00:52:44,940
to see if it was done
in the case of Aurora.

1122
00:52:44,940 --> 00:52:47,160
So here we have
the examples here--

1123
00:52:47,160 --> 00:52:50,880
how it works on a--
in matched situations.

1124
00:52:50,880 --> 00:52:52,900
Here is multi-style training.

1125
00:52:52,900 --> 00:52:54,500
But here it is what
you are getting

1126
00:52:54,500 --> 00:52:56,860
if you apply this technique.

1127
00:52:56,860 --> 00:52:59,470
This is what you get now,
multi-style training.

1128
00:52:59,470 --> 00:53:01,530
But in the sub band
recommendation,

1129
00:53:01,530 --> 00:53:04,970
you are getting 1/2
of the error rate.

1130
00:53:04,970 --> 00:53:07,360
Just a simple
trick which I think

1131
00:53:07,360 --> 00:53:12,500
is reasonable, which is you
do the sub band recognition.

1132
00:53:12,500 --> 00:53:15,200
There's a number of
power recognized,

1133
00:53:15,200 --> 00:53:18,930
each of them paying attention
to part of the spectrum.

1134
00:53:18,930 --> 00:53:21,260
And then you-- each
of them is being

1135
00:53:21,260 --> 00:53:24,580
trained to head the white
noise, a simple white noise.

1136
00:53:24,580 --> 00:53:29,225
But you turn, in some ways,
arbitrary additive noise,

1137
00:53:29,225 --> 00:53:33,900
this car noise, into white-like
noise in each sub band.

1138
00:53:33,900 --> 00:53:36,640
And that what's you get.

1139
00:53:36,640 --> 00:53:41,960
So in general, dealing
with unexpected noise is--

1140
00:53:41,960 --> 00:53:43,990
you want to do the adaptation.

1141
00:53:43,990 --> 00:53:49,390
You want to modify your
classifier on the fly.

1142
00:53:49,390 --> 00:53:53,320
You want to have parts of the
classifier or some streams

1143
00:53:53,320 --> 00:53:54,790
which are doing well.

1144
00:53:54,790 --> 00:53:56,030
And some of the parts--

1145
00:53:56,030 --> 00:54:00,340
so parts of the classifier
are still reliable.

1146
00:54:00,340 --> 00:54:04,840
And you want to pick up
these streams which are

1147
00:54:04,840 --> 00:54:07,810
reliable on unseen situation.

1148
00:54:07,810 --> 00:54:11,520
So this is what we call this
multi-stream recognition--

1149
00:54:11,520 --> 00:54:15,430
adapt to multi-stream
adaptation to unknown noise.

1150
00:54:15,430 --> 00:54:19,870
So you assume that not all the
streams are going to give you

1151
00:54:19,870 --> 00:54:22,990
good result, but you assume that
at least some of the streams

1152
00:54:22,990 --> 00:54:24,930
are going to have good results.

1153
00:54:24,930 --> 00:54:27,910
And all these streams are being
trained on, say, clean speech

1154
00:54:27,910 --> 00:54:28,950
or something.

1155
00:54:31,570 --> 00:54:34,194
So this is this multi-band
processing, all right?

1156
00:54:34,194 --> 00:54:35,110
So this is what we do.

1157
00:54:35,110 --> 00:54:37,740
We do different
frequency ranges.

1158
00:54:37,740 --> 00:54:41,350
And then we use our
performance monitor

1159
00:54:41,350 --> 00:54:43,750
to pick up the best stream.

1160
00:54:43,750 --> 00:54:46,550
So here is the
experiment which we did.

1161
00:54:46,550 --> 00:54:48,490
So you would have
the 31 processing

1162
00:54:48,490 --> 00:54:53,500
streams created from
all combinations

1163
00:54:53,500 --> 00:54:54,770
of fine frequencies.

1164
00:54:54,770 --> 00:54:57,550
And one stream was
looking at full spectrum.

1165
00:54:57,550 --> 00:54:59,260
And the other things
were only looking

1166
00:54:59,260 --> 00:55:01,550
at the parts of the spectrum.

1167
00:55:01,550 --> 00:55:05,950
So more black ones, there is
a more spectrum; more white,

1168
00:55:05,950 --> 00:55:10,600
some of them are looking only
at a single frequency band.

1169
00:55:10,600 --> 00:55:13,960
So we have a decent number
of processing channels,

1170
00:55:13,960 --> 00:55:16,600
and we would hope that
if the noise comes here,

1171
00:55:16,600 --> 00:55:19,750
maybe this one is going to be
good, because this one is not

1172
00:55:19,750 --> 00:55:20,700
going to be--

1173
00:55:20,700 --> 00:55:24,560
not worth picking up
this noise, or recognize

1174
00:55:24,560 --> 00:55:28,180
that basically which only uses
the bands which are not noisy.

1175
00:55:28,180 --> 00:55:30,920
It's going to be good.

1176
00:55:30,920 --> 00:55:32,230
So this is the whole system.

1177
00:55:32,230 --> 00:55:34,020
It was published in 230--

1178
00:55:34,020 --> 00:55:36,620
[STATIC] like a entire speech.

1179
00:55:36,620 --> 00:55:38,350
We had this sub band
recognition, fuse,

1180
00:55:38,350 --> 00:55:42,460
and performance monitor, and
the selecting of a stream.

1181
00:55:42,460 --> 00:55:44,500
This is how it works.

1182
00:55:44,500 --> 00:55:46,620
This is again,
showing for car noise.

1183
00:55:46,620 --> 00:55:49,762
Car noise is very nice,
because it mainly, it

1184
00:55:49,762 --> 00:55:50,970
corrupts the low frequencies.

1185
00:55:50,970 --> 00:55:54,860
So all these sub band
techniques work quite well.

1186
00:55:54,860 --> 00:55:59,470
But you can see that it's
pretty impressive, and it's--

1187
00:55:59,470 --> 00:56:02,710
if you didn't do anything,
you get 50% error.

1188
00:56:02,710 --> 00:56:04,840
With this one you get 38% error.

1189
00:56:04,840 --> 00:56:06,550
If you knew which
bands to pickup

1190
00:56:06,550 --> 00:56:09,020
you would be getting oracle
experiment-- cheating

1191
00:56:09,020 --> 00:56:09,910
experiment.

1192
00:56:09,910 --> 00:56:12,022
You would be getting about 35%.

1193
00:56:12,022 --> 00:56:13,480
So that was, I
thought, quite nice.

1194
00:56:17,580 --> 00:56:22,900
Just to conclude-- so, auditory
system doesn't only look like

1195
00:56:22,900 --> 00:56:26,250
this, that it starts with
a signal as the analysis,

1196
00:56:26,250 --> 00:56:28,150
and then it reduces the--

1197
00:56:28,150 --> 00:56:31,330
reduces the bit
rate; but it also

1198
00:56:31,330 --> 00:56:36,250
increases number of
views of the signal.

1199
00:56:36,250 --> 00:56:39,100
And this is based on
the fact that there

1200
00:56:39,100 --> 00:56:45,970
is a massive increase in
number of the cortical neurons

1201
00:56:45,970 --> 00:56:48,910
on the level of the cortex.

1202
00:56:48,910 --> 00:56:52,770
So there is many ways of
describing information

1203
00:56:52,770 --> 00:56:55,360
at high level of perception.

1204
00:56:55,360 --> 00:56:59,445
Essentially, the signal
doesn't go through one pass,

1205
00:56:59,445 --> 00:57:02,100
but it goes through
many, many passes.

1206
00:57:02,100 --> 00:57:05,790
And then we need to have some
means, or we have some means

1207
00:57:05,790 --> 00:57:08,760
to pick up the good ones
and ignore the other ones--

1208
00:57:08,760 --> 00:57:11,380
maybe switch them entirely off.

1209
00:57:11,380 --> 00:57:13,680
This is like vision.

1210
00:57:13,680 --> 00:57:16,970
So this is the path of
processing, in general, signal.

1211
00:57:16,970 --> 00:57:19,260
You get the different
probability estimates

1212
00:57:19,260 --> 00:57:21,380
for different streams.

1213
00:57:21,380 --> 00:57:24,570
And then you need to do
some fusion and decide on--

1214
00:57:24,570 --> 00:57:27,590
based on the level
of the fusion.

1215
00:57:27,590 --> 00:57:30,840
How you can create the streams--
but we were showing you

1216
00:57:30,840 --> 00:57:34,920
differently trained probability
estimates on different noises--

1217
00:57:34,920 --> 00:57:36,690
different aspects
of signal, that

1218
00:57:36,690 --> 00:57:40,020
is, different parts of the
spectrum of the signal.

1219
00:57:40,020 --> 00:57:42,120
But you can go wild.

1220
00:57:42,120 --> 00:57:44,760
You can start thinking
about different modalities,

1221
00:57:44,760 --> 00:57:47,850
because there-- as we talked
about it also in the panel,

1222
00:57:47,850 --> 00:57:50,940
you know, very often
audiovisual model,

1223
00:57:50,940 --> 00:57:53,040
if it carries the
same information

1224
00:57:53,040 --> 00:57:55,560
about the same things,
[STATIC] infusion

1225
00:57:55,560 --> 00:57:57,330
of audio visual streams.

1226
00:57:57,330 --> 00:58:00,270
You can also imagine
the fusion from streams

1227
00:58:00,270 --> 00:58:02,970
with different levels of
priors, different levels

1228
00:58:02,970 --> 00:58:04,750
of hallucinations.

1229
00:58:04,750 --> 00:58:07,380
So basically, this is what
I see human beings are

1230
00:58:07,380 --> 00:58:08,730
doing very often.

1231
00:58:08,730 --> 00:58:11,610
If the signal is very noisy,
you are at a cocktail party,

1232
00:58:11,610 --> 00:58:13,560
you are guessing,
because that's the best

1233
00:58:13,560 --> 00:58:16,620
way to get through if
the communication is not

1234
00:58:16,620 --> 00:58:17,530
very important.

1235
00:58:17,530 --> 00:58:19,830
It's not about your
salary increase,

1236
00:58:19,830 --> 00:58:21,540
but about the weather--

1237
00:58:21,540 --> 00:58:24,150
so basically, guessing what
the other people are saying,

1238
00:58:24,150 --> 00:58:26,100
especially if they
speak the way I

1239
00:58:26,100 --> 00:58:28,560
do, right, with a strong
accent or something.

1240
00:58:28,560 --> 00:58:30,690
So the priors are
very important.

1241
00:58:30,690 --> 00:58:33,200
And streams with priors
are very important.

1242
00:58:33,200 --> 00:58:36,690
We use this to some
extent, I was mentioning,

1243
00:58:36,690 --> 00:58:39,690
by comparing the streams of
different-- with different

1244
00:58:39,690 --> 00:58:44,860
prior to discover if the signal
is biased in the wrong way

1245
00:58:44,860 --> 00:58:46,590
by priors.

1246
00:58:46,590 --> 00:58:50,040
So stream formation-- there
is a number of PhD theses

1247
00:58:50,040 --> 00:58:51,540
right here, right?

1248
00:58:51,540 --> 00:58:53,280
I think.

1249
00:58:53,280 --> 00:58:55,115
Fusion-- oh, why?

1250
00:58:55,115 --> 00:58:58,140
It's select the best
probability estimates.

1251
00:58:58,140 --> 00:58:58,710
I tell you.

1252
00:58:58,710 --> 00:59:02,730
This is the problem which
I was actually asking for,

1253
00:59:02,730 --> 00:59:05,050
please help me to solve it.

1254
00:59:05,050 --> 00:59:07,670
Because we still don't
know how to do that.

1255
00:59:07,670 --> 00:59:11,320
I have a-- I suspect
that especially

1256
00:59:11,320 --> 00:59:14,260
in human communications,
people are doing it

1257
00:59:14,260 --> 00:59:16,800
like this, which
starts making a sense

1258
00:59:16,800 --> 00:59:20,020
if they use the certain
processing strategy.

1259
00:59:20,020 --> 00:59:24,645
So people can tell if
the output of our--

1260
00:59:24,645 --> 00:59:27,850
this perceptual system
makes sense or not.

1261
00:59:27,850 --> 00:59:30,830
Our machines don't
know how to do it yet.

1262
00:59:30,830 --> 00:59:33,280
Conclusion.

1263
00:59:33,280 --> 00:59:36,460
So some problems with
the noise are simple.

1264
00:59:36,460 --> 00:59:39,010
You know, you can deal with it
on a signal processing level

1265
00:59:39,010 --> 00:59:43,270
by filtering the spectrum,
filtering the trajectories,

1266
00:59:43,270 --> 00:59:45,860
because these effects
are very predictable.

1267
00:59:45,860 --> 00:59:48,700
And if you understand
them, you should do it.

1268
00:59:48,700 --> 00:59:52,040
Because there's no
need to train on that.

1269
00:59:52,040 --> 00:59:53,230
They said there is--

1270
00:59:53,230 --> 00:59:54,100
you just do it.

1271
00:59:54,100 --> 00:59:56,980
And things may be working well.

1272
00:59:56,980 --> 01:00:00,460
Unpredictable effects
of noise can be--

1273
01:00:00,460 --> 01:00:04,720
typically are being handled by
now by multi-style training.

1274
01:00:04,720 --> 01:00:09,020
And these amounts of training
are enormous nowadays.

1275
01:00:09,020 --> 01:00:10,880
You know, if you talk
to Google people,

1276
01:00:10,880 --> 01:00:13,220
they say we are not deeply
interested in this-- what you

1277
01:00:13,220 --> 01:00:15,110
are doing, because we
can always collect more

1278
01:00:15,110 --> 01:00:17,230
data from new environments.

1279
01:00:17,230 --> 01:00:19,517
But I think it's not--

1280
01:00:19,517 --> 01:00:20,600
I shouldn't say dishonest.

1281
01:00:20,600 --> 01:00:21,240
I'm sorry.

1282
01:00:21,240 --> 01:00:22,010
Scratch it.

1283
01:00:22,010 --> 01:00:23,330
[LAUGHTER]

1284
01:00:23,330 --> 01:00:26,806
It's not the best
engineering way

1285
01:00:26,806 --> 01:00:28,430
of dealing with these
things, because I

1286
01:00:28,430 --> 01:00:30,980
think the good engineering way
of dealing with those things

1287
01:00:30,980 --> 01:00:34,190
is to get away with less
training and that sort

1288
01:00:34,190 --> 01:00:37,580
of thing; and maybe follow what
I believe that human beings are

1289
01:00:37,580 --> 01:00:38,420
doing.

1290
01:00:38,420 --> 01:00:42,740
So we have a lot of
parallel experts working

1291
01:00:42,740 --> 01:00:44,990
with the different
aspects of the signal,

1292
01:00:44,990 --> 01:00:48,350
giving us different pictures.

1293
01:00:48,350 --> 01:00:53,450
And then we need to pick
up the good ways of being.