1
00:00:01,640 --> 00:00:04,040
The following and content
is provided under a Creative

2
00:00:04,040 --> 00:00:05,580
Commons license.

3
00:00:05,580 --> 00:00:07,880
Your support will help
MIT OpenCourseWare

4
00:00:07,880 --> 00:00:12,270
continue to offer high-quality,
educational resources for free.

5
00:00:12,270 --> 00:00:14,870
To make a donation or
view additional materials

6
00:00:14,870 --> 00:00:18,830
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,830 --> 00:00:22,370
at ocw.mit.edu.

8
00:00:22,370 --> 00:00:24,170
HYNEK HERMANSKY: I'm
basically an engineer.

9
00:00:24,170 --> 00:00:26,720
And I'm working on
speech recognition.

10
00:00:26,720 --> 00:00:29,600
And so you may wonder, so
what is there to work on?

11
00:00:29,600 --> 00:00:32,119
Because you have a cell
phone in your pocket,

12
00:00:32,119 --> 00:00:37,300
and you speak to it, and Siri
answers you and everything.

13
00:00:37,300 --> 00:00:40,085
And the whole thing is
working very basic principles.

14
00:00:40,085 --> 00:00:41,085
You start with a signal.

15
00:00:41,085 --> 00:00:42,780
It goes to signal processing.

16
00:00:42,780 --> 00:00:44,840
There's some pattern
classification.

17
00:00:44,840 --> 00:00:47,040
Of course, deep
neural nets as usual.

18
00:00:47,040 --> 00:00:50,270
And so this recognizes
the message.

19
00:00:50,270 --> 00:00:52,850
This recognizes
what you are saying,

20
00:00:52,850 --> 00:00:56,240
so the question is, what is it
that is fighter of the boat?

21
00:00:56,240 --> 00:01:00,410
Why not keep going, and try
just to improve the error rates,

22
00:01:00,410 --> 00:01:04,040
and improve them
basically step by step?

23
00:01:04,040 --> 00:01:05,730
Because we have a
good thing going.

24
00:01:05,730 --> 00:01:08,210
We have something which
is already out there.

25
00:01:08,210 --> 00:01:09,560
And it's working.

26
00:01:09,560 --> 00:01:11,480
But you may know
the answer, so this

27
00:01:11,480 --> 00:01:15,500
is, imagine that you are, sort
of, skiing or going on a sled.

28
00:01:15,500 --> 00:01:17,010
And suddenly, you
come somewhere.

29
00:01:17,010 --> 00:01:18,260
And you have to start pushing.

30
00:01:18,260 --> 00:01:21,380
You don't want to do that,
but you do it for the reason,

31
00:01:21,380 --> 00:01:24,790
because there may be some--
another kind of slope going out

32
00:01:24,790 --> 00:01:25,400
there.

33
00:01:25,400 --> 00:01:28,350
And that's the way I feel
about the speech recognition.

34
00:01:28,350 --> 00:01:31,370
So basically, sometimes we
need to push a little bit up

35
00:01:31,370 --> 00:01:34,310
and maybe go slightly
out of our comfort zone

36
00:01:34,310 --> 00:01:36,160
in order to get further.

37
00:01:38,680 --> 00:01:41,269
Speech is not what we are--

38
00:01:41,269 --> 00:01:42,810
it's not the thing
which we are using

39
00:01:42,810 --> 00:01:44,690
for communicating with Siri.

40
00:01:44,690 --> 00:01:45,710
Speech is this.

41
00:01:45,710 --> 00:01:47,690
Basically, people
speak the way I do.

42
00:01:47,690 --> 00:01:49,220
They hesitate.

43
00:01:49,220 --> 00:01:51,710
There's a lot of
fillers, interruptions.

44
00:01:51,710 --> 00:01:54,230
And I don't finish
the sentences.

45
00:01:54,230 --> 00:01:57,620
I speak with a strong
accent, and so on.

46
00:01:57,620 --> 00:02:00,620
I become excited,
and so on, and so on.

47
00:02:00,620 --> 00:02:02,780
And we would like to
put a machine there

48
00:02:02,780 --> 00:02:04,340
instead of the other person.

49
00:02:04,340 --> 00:02:07,070
Basically, this is what a
speech recognition ultimately

50
00:02:07,070 --> 00:02:07,640
is, right?

51
00:02:07,640 --> 00:02:11,380
I mean, and actually, if you see
what government is supporting,

52
00:02:11,380 --> 00:02:14,000
what the big companies
are working on,

53
00:02:14,000 --> 00:02:17,110
this is what we
are worried about.

54
00:02:17,110 --> 00:02:19,940
We are worried about
the real speech

55
00:02:19,940 --> 00:02:24,770
produced by real people in the
real communications by speech.

56
00:02:24,770 --> 00:02:29,120
And you know, I didn't mention
all the disturbing things

57
00:02:29,120 --> 00:02:33,000
like noises, and so on, and so
on, but we will get into that.

58
00:02:33,000 --> 00:02:35,780
So I believe that
we don't only need

59
00:02:35,780 --> 00:02:38,300
signal processing, and
information theory,

60
00:02:38,300 --> 00:02:40,520
and machine
learning, but we also

61
00:02:40,520 --> 00:02:42,050
need the other disciplines.

62
00:02:42,050 --> 00:02:44,810
And this is where you
guys are coming in.

63
00:02:44,810 --> 00:02:46,340
So that's what I believe in.

64
00:02:46,340 --> 00:02:49,310
We should be working
together, engineering and life

65
00:02:49,310 --> 00:02:52,820
sciences working together.

66
00:02:52,820 --> 00:02:54,650
At least we should try.

67
00:02:54,650 --> 00:02:56,030
We should at least
try to be-- we

68
00:02:56,030 --> 00:03:00,590
engineers should be try to
inspired by life sciences.

69
00:03:00,590 --> 00:03:03,500
And as far as
inspiration is concerned,

70
00:03:03,500 --> 00:03:07,040
I have a story to start with.

71
00:03:07,040 --> 00:03:11,010
There was a guy who won the
lottery by using numbers 1, 2,

72
00:03:11,010 --> 00:03:13,250
3, 6, 7, 49.

73
00:03:13,250 --> 00:03:16,070
And they said, well, this is
of course unusual sequence

74
00:03:16,070 --> 00:03:18,950
of numbers, so they say, how
did you ever get to that?

75
00:03:18,950 --> 00:03:20,750
He says, I'm the first child.

76
00:03:20,750 --> 00:03:23,570
My mother was my
mother's second marriage

77
00:03:23,570 --> 00:03:25,190
and my father's third marriage.

78
00:03:25,190 --> 00:03:27,120
And I was born on
the 6th of July.

79
00:03:27,120 --> 00:03:31,340
And of course, 6 by 7 is 49.

80
00:03:31,340 --> 00:03:34,010
And that's sometimes I
feel, I'm getting this sort

81
00:03:34,010 --> 00:03:35,680
of inspiration from you people.

82
00:03:35,680 --> 00:03:37,040
I may not get it right.

83
00:03:37,040 --> 00:03:42,480
I may not get it right, but as
long as it works, I'm happy.

84
00:03:42,480 --> 00:03:45,350
You know, I'm not being
paid for being smart

85
00:03:45,350 --> 00:03:47,830
and being knowledgeable
about biology.

86
00:03:47,830 --> 00:03:51,420
I'm being, really, paid for
making something which works.

87
00:03:51,420 --> 00:03:53,900
Anyways, so this is
just the warm up.

88
00:03:53,900 --> 00:03:56,270
I thought that you will
still be drinking a coffee,

89
00:03:56,270 --> 00:03:59,160
so I decided to
start with a joke.

90
00:03:59,160 --> 00:04:00,820
But anyway, but it's
an inspiring joke.

91
00:04:00,820 --> 00:04:02,650
I mean, it's about inspiration.

92
00:04:02,650 --> 00:04:05,390
And I would maybe point out
to some of the inspiration

93
00:04:05,390 --> 00:04:08,330
points, which I, of course,
didn't get it right,

94
00:04:08,330 --> 00:04:11,370
but still, it was working.

95
00:04:11,370 --> 00:04:13,330
Why do we have audition?

96
00:04:13,330 --> 00:04:14,960
Josh already told
us-- because we

97
00:04:14,960 --> 00:04:16,829
want to survive in this world.

98
00:04:16,829 --> 00:04:19,715
I mean, this is a little ferret
or whatever, and there is--

99
00:04:19,715 --> 00:04:21,360
he's getting something now.

100
00:04:21,360 --> 00:04:22,670
And there is a object.

101
00:04:22,670 --> 00:04:25,186
And ferret is worrying,
is this something

102
00:04:25,186 --> 00:04:26,810
I should be friendly
with or I should--

103
00:04:26,810 --> 00:04:32,630
it should be something
which I run away.

104
00:04:32,630 --> 00:04:34,850
So what is the message
in this signal?

105
00:04:34,850 --> 00:04:37,910
Is it a danger or
is a opportunity?

106
00:04:37,910 --> 00:04:40,550
Well, the same way,
how do we survive

107
00:04:40,550 --> 00:04:43,050
in this world as human beings?

108
00:04:43,050 --> 00:04:47,070
So there is my wife who has
some message in her head.

109
00:04:47,070 --> 00:04:49,700
And so she wants to
tell me, eat vegetables,

110
00:04:49,700 --> 00:04:52,340
they are good for you,
so she's using speech.

111
00:04:52,340 --> 00:04:54,012
And speech is
actually amazing sort

112
00:04:54,012 --> 00:04:57,759
of mechanism for sharing
the experiences and for--

113
00:04:57,759 --> 00:04:59,300
actually, without
speech, we wouldn't

114
00:04:59,300 --> 00:05:01,090
be where we are, I
can guarantee you,

115
00:05:01,090 --> 00:05:05,720
because that allows us to tell
the other people things what

116
00:05:05,720 --> 00:05:08,420
they should do without
going through much trouble

117
00:05:08,420 --> 00:05:12,110
like a ferret with the bird.

118
00:05:12,110 --> 00:05:14,780
That we may not
have to be eaten,

119
00:05:14,780 --> 00:05:18,125
maybe we just die a little early
if we don't get this right,

120
00:05:18,125 --> 00:05:19,830
if we don't get this message.

121
00:05:19,830 --> 00:05:24,030
So she says this thing, and
hopefully, I get the message.

122
00:05:24,030 --> 00:05:25,760
So this is what
speech is about, but I

123
00:05:25,760 --> 00:05:28,550
wanted to say, the speech
is an important thing,

124
00:05:28,550 --> 00:05:31,790
because it allows us to
communicate abstract ideas

125
00:05:31,790 --> 00:05:33,470
like good for you.

126
00:05:33,470 --> 00:05:35,950
And that's sort of not only
vegetable, vegetable is saying,

127
00:05:35,950 --> 00:05:39,920
but a lot of abstract ideas
can be conveyed by speech.

128
00:05:39,920 --> 00:05:45,020
And that's why I think
it's kind of exciting.

129
00:05:45,020 --> 00:05:49,280
Why do we work on machine
recognition of speech?

130
00:05:49,280 --> 00:05:52,490
Well, first one is just
like Edmund Hillary

131
00:05:52,490 --> 00:05:54,120
said, because it's there.

132
00:05:54,120 --> 00:05:56,330
They are asking, why did
you climb Mount Everest?

133
00:05:56,330 --> 00:05:57,746
He said, well,
because it's there.

134
00:05:57,746 --> 00:05:59,390
I mean, it's a challenge, right?

135
00:05:59,390 --> 00:06:02,120
Spoken language is one of
the most amazing things,

136
00:06:02,120 --> 00:06:05,330
I already told you
before, of human race

137
00:06:05,330 --> 00:06:07,580
so there would be
hell if we can't build

138
00:06:07,580 --> 00:06:09,920
a machine which understands it.

139
00:06:09,920 --> 00:06:12,740
And we don't have an
easy time so far yet.

140
00:06:12,740 --> 00:06:15,680
In addition, when you
are addressing speech,

141
00:06:15,680 --> 00:06:18,140
you are really addressing
the generic problems

142
00:06:18,140 --> 00:06:21,470
which we have in processing
of other cognitive signals.

143
00:06:21,470 --> 00:06:25,550
And we touched it to some
extent during this panel,

144
00:06:25,550 --> 00:06:27,890
because, you know, problems
which we have in speech,

145
00:06:27,890 --> 00:06:30,200
we have the similar
problems in perceiving

146
00:06:30,200 --> 00:06:32,390
images and perceiving smells.

147
00:06:32,390 --> 00:06:35,300
All these cognitive
signals, basically, machines

148
00:06:35,300 --> 00:06:36,697
are not very good at it.

149
00:06:36,697 --> 00:06:37,280
Let's face it.

150
00:06:37,280 --> 00:06:40,550
Machines can add 10 billion
numbers very quickly,

151
00:06:40,550 --> 00:06:43,590
but they cannot tell my
grandmother from the monkey,

152
00:06:43,590 --> 00:06:44,090
right?

153
00:06:44,090 --> 00:06:48,670
I mean, so this is
actually important thing.

154
00:06:48,670 --> 00:06:50,840
There are also practical
applications, obviously--

155
00:06:50,840 --> 00:06:52,710
access to information.

156
00:06:52,710 --> 00:06:56,510
Voice interaction with
machines extracting information

157
00:06:56,510 --> 00:06:59,840
from speech, given how much
speech is out there now with--

158
00:06:59,840 --> 00:07:02,360
I don't know how
much we are adding

159
00:07:02,360 --> 00:07:06,897
every second through the
YouTube and that sort of things,

160
00:07:06,897 --> 00:07:08,480
but there's a lot
of speech out there.

161
00:07:08,480 --> 00:07:10,400
It would be good if
information can actually

162
00:07:10,400 --> 00:07:13,270
extract information from that.

163
00:07:13,270 --> 00:07:16,100
And I tell always the students,
there is a job security.

164
00:07:16,100 --> 00:07:18,470
It's not going to be solved
during your lifetime,

165
00:07:18,470 --> 00:07:19,695
certainly not during mine.

166
00:07:19,695 --> 00:07:22,040
I mean, sort of, if
you get into it--

167
00:07:22,040 --> 00:07:26,690
in addition, I mean, I know
that this is maybe on YouTube,

168
00:07:26,690 --> 00:07:30,860
but also, if you don't like
it, you can get fantastic jobs.

169
00:07:30,860 --> 00:07:33,470
There is a half of
the IBM group ended up

170
00:07:33,470 --> 00:07:36,920
on the Wall Street making
insane amount of money.

171
00:07:36,920 --> 00:07:38,550
So I mean, you
know, what skills we

172
00:07:38,550 --> 00:07:42,290
should get in recognizing
speech, working on the speech,

173
00:07:42,290 --> 00:07:44,510
can be also applied
in other areas.

174
00:07:44,510 --> 00:07:49,700
Obviously it can be applied in
vision, and so on, and so on.

175
00:07:49,700 --> 00:07:52,040
Speech has been produced
to be perceived.

176
00:07:52,040 --> 00:07:57,020
Here is Roman Jakobson,
the great Harvard, MIT guy,

177
00:07:57,020 --> 00:07:58,660
passed away unfortunately.

178
00:07:58,660 --> 00:08:00,710
He would be now a
hundred and something.

179
00:08:00,710 --> 00:08:02,750
He says, we speak in
order to be heard,

180
00:08:02,750 --> 00:08:04,430
in order to be understood.

181
00:08:04,430 --> 00:08:07,680
Speech has been produced
to be perceived.

182
00:08:07,680 --> 00:08:11,460
And over the millennia
of the human evolution,

183
00:08:11,460 --> 00:08:14,010
it evolved this way
so that it reflects

184
00:08:14,010 --> 00:08:16,190
properties of human hearing.

185
00:08:16,190 --> 00:08:18,620
And so I'm very
much also with Josh.

186
00:08:18,620 --> 00:08:21,050
If you build up a machine
which recognizes speech,

187
00:08:21,050 --> 00:08:24,420
you may be verifying some of the
theories of speech perception.

188
00:08:24,420 --> 00:08:28,230
And I'll point out
that along the way.

189
00:08:28,230 --> 00:08:32,929
How do I know that the speech
evolved to fit the hearing

190
00:08:32,929 --> 00:08:34,169
and not the other way around?

191
00:08:34,169 --> 00:08:37,460
I got some big people arguing
over that, because they say,

192
00:08:37,460 --> 00:08:40,280
you don't know, I mean,
basically, but I know.

193
00:08:40,280 --> 00:08:40,970
I think.

194
00:08:40,970 --> 00:08:43,760
Well, I think that
I know, right?

195
00:08:43,760 --> 00:08:47,000
Every single organ which is
used for speech production

196
00:08:47,000 --> 00:08:49,780
is also used for something
much more useful,

197
00:08:49,780 --> 00:08:52,560
like, sort of typically,
eating and breathing.

198
00:08:52,560 --> 00:08:57,440
So this is the organs of speech
production-- lungs, the lips,

199
00:08:57,440 --> 00:09:03,470
teeth, nose, and velum,
and so on, and so on.

200
00:09:03,470 --> 00:09:07,740
Everything is being used for
some life-sustaining functions,

201
00:09:07,740 --> 00:09:08,970
including speaking.

202
00:09:08,970 --> 00:09:12,825
So I know that it's not
the same in hearing.

203
00:09:12,825 --> 00:09:15,330
Hearing has evolved
to hear, for hearing.

204
00:09:15,330 --> 00:09:18,290
Maybe there are some
organs of balance,

205
00:09:18,290 --> 00:09:20,870
and that sort of thing,
but mostly, you do hear.

206
00:09:20,870 --> 00:09:23,780
In the speech, everything,
what is being used,

207
00:09:23,780 --> 00:09:25,520
has been used for--

208
00:09:25,520 --> 00:09:27,310
it's used for
something else also,

209
00:09:27,310 --> 00:09:30,590
so clearly, we just
learned how to speak

210
00:09:30,590 --> 00:09:33,530
because we had the
appropriate hardware there,

211
00:09:33,530 --> 00:09:35,630
and we learned how to use it.

212
00:09:35,630 --> 00:09:42,410
So in order to get
the message, you

213
00:09:42,410 --> 00:09:44,030
use some cognitive
aspects, which

214
00:09:44,030 --> 00:09:45,360
I won't be talking much about.

215
00:09:45,360 --> 00:09:47,750
So you have to use
the common language.

216
00:09:47,750 --> 00:09:50,390
You have to have some
context of the conversation.

217
00:09:50,390 --> 00:09:53,120
You have to have some
common set of priors,

218
00:09:53,120 --> 00:09:56,450
some common experience, and
so on, and so on, but mainly

219
00:09:56,450 --> 00:09:58,430
what I will be
talking about, you

220
00:09:58,430 --> 00:10:02,210
need the reliable signal
which carries the message,

221
00:10:02,210 --> 00:10:05,620
because the message
is in the signal.

222
00:10:05,620 --> 00:10:07,760
It's also in your
head, but the signal

223
00:10:07,760 --> 00:10:11,240
supports what is
happening in your head.

224
00:10:11,240 --> 00:10:14,370
So how much information
is in speech signal?

225
00:10:14,370 --> 00:10:18,690
This is, I have stolen I believe
from George Miller, I think.

226
00:10:18,690 --> 00:10:21,080
So if you look at
Shannon's theory,

227
00:10:21,080 --> 00:10:24,860
I mean, there will be about
80 kilobytes per second.

228
00:10:24,860 --> 00:10:27,980
And indeed, you can
generate a reasonable signal

229
00:10:27,980 --> 00:10:30,030
without being very
smart about it

230
00:10:30,030 --> 00:10:34,220
just by coding it to 11 bits
at 8 kilohertz per second, 80

231
00:10:34,220 --> 00:10:35,240
kilobits per second.

232
00:10:35,240 --> 00:10:36,710
This verifies it.

233
00:10:36,710 --> 00:10:40,160
So this is how much information
might be in the signal.

234
00:10:40,160 --> 00:10:42,620
How much is in the
speech is actually very--

235
00:10:42,620 --> 00:10:46,010
it's, sort of, not very
clear, but at least we

236
00:10:46,010 --> 00:10:48,590
can estimate it to some extent.

237
00:10:48,590 --> 00:10:52,220
If you say, I would like to
transcribe the signal in terms

238
00:10:52,220 --> 00:10:55,250
of the speech sounds,
phonemes, so there is maybe

239
00:10:55,250 --> 00:11:01,070
about 40 to 49 phonemes,
or something, 41 phonemes.

240
00:11:01,070 --> 00:11:03,350
So if you look at
the entropy of that,

241
00:11:03,350 --> 00:11:05,970
it comes to about
80 bits per second.

242
00:11:05,970 --> 00:11:08,840
So there is three orders
of magnitude difference.

243
00:11:08,840 --> 00:11:11,610
If you push a
little bit further--

244
00:11:11,610 --> 00:11:16,430
indeed, I mean, if you speak
with about 150,000 words, that

245
00:11:16,430 --> 00:11:19,490
means about 80 bits,
30 words per minute,

246
00:11:19,490 --> 00:11:23,390
again, it comes to
less than 100 bits.

247
00:11:23,390 --> 00:11:25,130
So there's, as I
said, there's a number

248
00:11:25,130 --> 00:11:28,340
of ways how you can argue about
this amount of information.

249
00:11:28,340 --> 00:11:31,300
If you start thinking about
dependencies between phonemes,

250
00:11:31,300 --> 00:11:35,300
it can go as low as
10, 20 bits per second.

251
00:11:35,300 --> 00:11:39,980
No question that there is much
more information in the signal

252
00:11:39,980 --> 00:11:43,830
than it is in useful message
which we would like to get out.

253
00:11:43,830 --> 00:11:46,520
And we'll get into that.

254
00:11:46,520 --> 00:11:50,090
Because what is in the
message, there is not only

255
00:11:50,090 --> 00:11:52,040
information about the
message, but there

256
00:11:52,040 --> 00:11:54,230
is a lot of other information.

257
00:11:54,230 --> 00:11:58,808
There's information about health
of the speaker, about which

258
00:11:58,808 --> 00:12:00,620
language the speaker
is using, what

259
00:12:00,620 --> 00:12:03,380
are-- what emotions,
there is who is speaking,

260
00:12:03,380 --> 00:12:07,190
speaker-dependent
information, what is the mood,

261
00:12:07,190 --> 00:12:09,030
and so on, and so on.

262
00:12:09,030 --> 00:12:11,360
And there is a lot
of noise coming

263
00:12:11,360 --> 00:12:14,030
from around, reverberations.

264
00:12:14,030 --> 00:12:16,610
We talk about it quite
a lot in the morning,

265
00:12:16,610 --> 00:12:18,080
all kinds of other noises.

266
00:12:18,080 --> 00:12:21,350
So what I call
noise in general, I

267
00:12:21,350 --> 00:12:23,430
call everything
what we don't want

268
00:12:23,430 --> 00:12:26,240
besides the signal, which,
in speech recognition,

269
00:12:26,240 --> 00:12:27,960
is the message.

270
00:12:27,960 --> 00:12:30,020
So when I talk
about the noise, it

271
00:12:30,020 --> 00:12:34,070
can be information about who is
speaking, about their emotions,

272
00:12:34,070 --> 00:12:37,020
about the fact that
maybe my voice is going,

273
00:12:37,020 --> 00:12:38,030
and so on, and so on.

274
00:12:42,020 --> 00:12:44,860
To my mind, purpose
of perception is

275
00:12:44,860 --> 00:12:50,340
get the information
which carries--

276
00:12:50,340 --> 00:12:54,520
get the signal which carries
the desired information

277
00:12:54,520 --> 00:12:58,450
and suppress the noise,
eliminate the noise.

278
00:12:58,450 --> 00:13:01,130
So the purpose of perception,
being a little bit vulgar

279
00:13:01,130 --> 00:13:05,620
about it, is how to get rid
of most of the information

280
00:13:05,620 --> 00:13:09,640
very quickly, because otherwise,
your brain would go bananas.

281
00:13:09,640 --> 00:13:12,970
So you basically want to focus
on what you want to hear,

282
00:13:12,970 --> 00:13:17,480
and you want to ignore, if
possible, everything else.

283
00:13:17,480 --> 00:13:19,690
And it's not, of course,
easy, but we discuss that

284
00:13:19,690 --> 00:13:23,540
again in the morning, about some
techniques how to go about it.

285
00:13:23,540 --> 00:13:26,350
And I will mention a
few more techniques

286
00:13:26,350 --> 00:13:28,960
which we are working on.

287
00:13:28,960 --> 00:13:33,100
But this a key thing,
is, purpose of perception

288
00:13:33,100 --> 00:13:37,011
is to get what you need and
not to get what you don't need,

289
00:13:37,011 --> 00:13:39,010
because otherwise, your
brain would be too busy.

290
00:13:41,980 --> 00:13:44,710
Speech happens in many--
it's a very simple example.

291
00:13:44,710 --> 00:13:47,380
Speech happens in many,
many environments.

292
00:13:47,380 --> 00:13:50,300
And there is a lot of
stuff happening around it,

293
00:13:50,300 --> 00:13:52,340
so the very simple
example, which I actually

294
00:13:52,340 --> 00:13:55,310
used when I was giving a talk to
some grandmothers in the Czech

295
00:13:55,310 --> 00:13:58,580
Republic is that, what
you can already use

296
00:13:58,580 --> 00:14:02,270
is the fact that things
happen at different levels.

297
00:14:02,270 --> 00:14:05,810
And they happen at
different frequencies,

298
00:14:05,810 --> 00:14:09,140
so perception is selective.

299
00:14:09,140 --> 00:14:12,190
Every perceptual mode is
selective and attends only

300
00:14:12,190 --> 00:14:15,290
to part of the world.

301
00:14:15,290 --> 00:14:17,210
You know, we don't
hear the radio--

302
00:14:17,210 --> 00:14:18,740
we don't see the radio waves.

303
00:14:18,740 --> 00:14:21,080
And we don't hear
the ultrasound,

304
00:14:21,080 --> 00:14:25,830
and so does all the lower
elements, and so on, and so on.

305
00:14:25,830 --> 00:14:27,390
So there are
different frequencies,

306
00:14:27,390 --> 00:14:30,470
different sound intensities
are in the first approximation.

307
00:14:30,470 --> 00:14:31,580
This is what you may use.

308
00:14:31,580 --> 00:14:34,857
If something is too
weak, I don't care.

309
00:14:34,857 --> 00:14:36,440
If something has too
high frequencies,

310
00:14:36,440 --> 00:14:38,930
I don't care, and
so on, and so on.

311
00:14:38,930 --> 00:14:41,960
There are also different
spectral and temporal dynamics

312
00:14:41,960 --> 00:14:45,260
to speech, which we are
learning about that quite a lot.

313
00:14:45,260 --> 00:14:48,095
It happens at different
locations of the space.

314
00:14:48,095 --> 00:14:53,010
Again, this is the reason why
we have a spatial directivity.

315
00:14:53,010 --> 00:14:54,870
That's why we have two ears.

316
00:14:54,870 --> 00:14:57,620
That's why we have a
specifically-shaped ears,

317
00:14:57,620 --> 00:14:59,060
and so on, and so on.

318
00:14:59,060 --> 00:15:01,540
There are also other
cognitive aspects,

319
00:15:01,540 --> 00:15:03,610
I mean, sort of, like,
the selective attention.

320
00:15:03,610 --> 00:15:05,510
Again, we talk about
it, that people

321
00:15:05,510 --> 00:15:09,230
appear to be able to
modify the properties

322
00:15:09,230 --> 00:15:11,135
of your cognitive
processing depending

323
00:15:11,135 --> 00:15:12,660
on what you want to listen to.

324
00:15:12,660 --> 00:15:16,460
And my friend Nima
Mesgarani with Eddie Chang,

325
00:15:16,460 --> 00:15:19,010
who was supposed to
be here instead of me,

326
00:15:19,010 --> 00:15:22,210
just had a major paper in Nature
about that, and so on, and so

327
00:15:22,210 --> 00:15:22,710
on.

328
00:15:22,710 --> 00:15:26,200
There's a number of ways how
we can modify the selectivity.

329
00:15:26,200 --> 00:15:29,490
We talk about this sharpening
the cochlear filters,

330
00:15:29,490 --> 00:15:32,180
I mean, depending on the
signal from the brain.

331
00:15:32,180 --> 00:15:35,000
So speech happens like
this, start with a message.

332
00:15:35,000 --> 00:15:38,130
You have a linguistic code,
maybe 50 bits per second.

333
00:15:38,130 --> 00:15:39,380
There are some motor controls.

334
00:15:39,380 --> 00:15:42,110
Speech production comes
to a speech signal,

335
00:15:42,110 --> 00:15:46,850
which has three orders of
magnitude larger information

336
00:15:46,850 --> 00:15:47,690
content.

337
00:15:47,690 --> 00:15:49,880
Through speech perception
and cognitive processes,

338
00:15:49,880 --> 00:15:52,040
we get, somehow, back
to the linguistic code

339
00:15:52,040 --> 00:15:55,070
and extract the
message, so this is

340
00:15:55,070 --> 00:15:58,060
important-- from the small,
low bit-rate, to high bit-rate,

341
00:15:58,060 --> 00:16:00,200
to the low bit-rate.

342
00:16:00,200 --> 00:16:01,790
In production,
actually, I don't want

343
00:16:01,790 --> 00:16:04,780
to pretend it happens
in such a linear way.

344
00:16:04,780 --> 00:16:07,070
There are also
feedbacks, so there

345
00:16:07,070 --> 00:16:10,610
is a feedback from you listen to
yourself when you are speaking.

346
00:16:10,610 --> 00:16:13,060
You can control how you speak.

347
00:16:13,060 --> 00:16:14,810
And you can also
actually change the code,

348
00:16:14,810 --> 00:16:16,640
because you realize,
oh, I should have

349
00:16:16,640 --> 00:16:19,040
said it somehow differently.

350
00:16:19,040 --> 00:16:23,170
In speech perception, again, we
just talked about it, you can,

351
00:16:23,170 --> 00:16:24,950
if the message is
not getting through,

352
00:16:24,950 --> 00:16:28,020
you may be able to tune
the system in some ways.

353
00:16:28,020 --> 00:16:31,610
You may be changing
the things, you know?

354
00:16:31,610 --> 00:16:34,820
And you may also use the
very mechanical techniques,

355
00:16:34,820 --> 00:16:38,360
as I told you, close the
window, or walk away.

356
00:16:38,360 --> 00:16:41,970
There is also feedback
through the dialogue,

357
00:16:41,970 --> 00:16:43,790
so from-- between
message and message,

358
00:16:43,790 --> 00:16:45,980
depending what
I'm hearing, I may

359
00:16:45,980 --> 00:16:47,960
be asking a different
kind of question,

360
00:16:47,960 --> 00:16:50,900
so which also modifies
the message of the sender.

361
00:16:54,030 --> 00:16:56,290
How do we produce speech?

362
00:16:56,290 --> 00:16:58,510
So we speak in order
to be heard, in order

363
00:16:58,510 --> 00:16:59,725
to be understood.

364
00:16:59,725 --> 00:17:02,710
So very quickly,
I want to go back

365
00:17:02,710 --> 00:17:05,440
to something which people
already forgot a big way, which

366
00:17:05,440 --> 00:17:06,730
is Homer Dudley.

367
00:17:06,730 --> 00:17:09,410
He was a great researcher
at Bell Laboratories

368
00:17:09,410 --> 00:17:10,810
before the Second World War.

369
00:17:10,810 --> 00:17:13,950
He retired I think
sometime early in '50s.

370
00:17:13,950 --> 00:17:16,030
He passed away in the '60s.

371
00:17:16,030 --> 00:17:20,579
He was saying message is in the
movements of the vocal tract

372
00:17:20,579 --> 00:17:24,069
which modulates the carrier,
so message in the speech

373
00:17:24,069 --> 00:17:26,020
is not in fundamental
frequency, it's

374
00:17:26,020 --> 00:17:28,420
not the way you are
exciting your vocal tract.

375
00:17:28,420 --> 00:17:32,860
Message is how you shape the
organs of speech production.

376
00:17:32,860 --> 00:17:37,270
Proof for that is that you
can whisper and you can still

377
00:17:37,270 --> 00:17:39,700
understand, so you don't--

378
00:17:39,700 --> 00:17:41,980
how you excite the vocal
tract is secondary.

379
00:17:41,980 --> 00:17:47,900
How do you generate this
audible carrier is secondary.

380
00:17:47,900 --> 00:17:51,190
You know, you can use
the artificial larynx,

381
00:17:51,190 --> 00:17:54,160
so there is this idea,
there's a message.

382
00:17:54,160 --> 00:17:55,420
A message is being--

383
00:17:55,420 --> 00:18:00,790
goes through modulator into
carrier, comes out as speech.

384
00:18:00,790 --> 00:18:03,970
So this modulation actually
has been used a long time ago,

385
00:18:03,970 --> 00:18:06,710
and excuse me for being maybe
a little bit simplistic,

386
00:18:06,710 --> 00:18:08,920
but it's actually, in
some ways, interesting.

387
00:18:08,920 --> 00:18:13,030
So this was the speech
production mechanism

388
00:18:13,030 --> 00:18:17,200
which was developed in some
time in the 18th century

389
00:18:17,200 --> 00:18:21,760
by the guy Johannes Wolfgang
Ritter von Kempelen.

390
00:18:21,760 --> 00:18:24,100
And he actually had it right.

391
00:18:24,100 --> 00:18:26,050
The only problem is
nobody trusted him,

392
00:18:26,050 --> 00:18:29,170
because he also invented
Mechanical Turk, which

393
00:18:29,170 --> 00:18:30,970
was playing the chess.

394
00:18:30,970 --> 00:18:32,680
And so he was
caught as a cheater,

395
00:18:32,680 --> 00:18:35,830
so when he was showing
his synthesizer,

396
00:18:35,830 --> 00:18:37,450
nobody believed him.

397
00:18:37,450 --> 00:18:42,220
But anyways, he was
definitely a smart guy.

398
00:18:42,220 --> 00:18:45,570
So he used already the
principle which is now used.

399
00:18:45,570 --> 00:18:49,930
This is a linear model of speech
production developed actually

400
00:18:49,930 --> 00:18:51,790
before the Second World
War, really, again,

401
00:18:51,790 --> 00:18:54,280
Bell Laboratories
should get the credit.

402
00:18:54,280 --> 00:19:00,890
I believe this is stolen
from Dudley's paper.

403
00:19:00,890 --> 00:19:03,680
So there is a source,
and you can change it.

404
00:19:03,680 --> 00:19:06,700
It periodic signals
out random noise,

405
00:19:06,700 --> 00:19:10,540
if you are producing voice
signal or unvoice signal.

406
00:19:10,540 --> 00:19:13,190
And then there is
a resonance control

407
00:19:13,190 --> 00:19:17,320
which goes into amplifier,
and it produces the speech.

408
00:19:17,320 --> 00:19:20,110
So this is the key here,
this a key to the point

409
00:19:20,110 --> 00:19:23,410
that Dudley developed
for this called a voder.

410
00:19:23,410 --> 00:19:26,710
And he trained the lady who
spent a year or something

411
00:19:26,710 --> 00:19:27,370
to play it.

412
00:19:27,370 --> 00:19:29,650
It was played like a organ.

413
00:19:29,650 --> 00:19:31,990
And she was changing
the resonance properties

414
00:19:31,990 --> 00:19:33,715
of this system here.

415
00:19:33,715 --> 00:19:40,060
And she was creating excitation
by pushing on a pitch pedal

416
00:19:40,060 --> 00:19:42,220
and switching on the big--

417
00:19:42,220 --> 00:19:44,110
on the wrist bar.

418
00:19:44,110 --> 00:19:48,730
And if it works well, we may
even be able to make the sound.

419
00:19:48,730 --> 00:19:50,878
This is a test.

420
00:19:50,878 --> 00:19:52,790
[AUDIO PLAYBACK]

421
00:19:52,790 --> 00:19:55,665
- Will you please make the voder
say, for our Eastern listeners,

422
00:19:55,665 --> 00:19:56,540
good evening, Radio--

423
00:19:56,540 --> 00:19:56,982
HYNEK HERMANSKY:
This is a real--

424
00:19:56,982 --> 00:19:57,125
- --audience.

425
00:19:57,125 --> 00:19:58,208
HYNEK HERMANSKY: --speech.

426
00:19:58,208 --> 00:20:00,687
- Good evening, radio audience.

427
00:20:00,687 --> 00:20:01,770
HYNEK HERMANSKY: This is--

428
00:20:01,770 --> 00:20:03,780
- And now, for our
Western listeners,

429
00:20:03,780 --> 00:20:06,731
say, good afternoon,
radio audience.

430
00:20:06,731 --> 00:20:10,460
- Good afternoon,
radio audience.

431
00:20:10,460 --> 00:20:11,460
[END PLAYBACK]

432
00:20:11,460 --> 00:20:12,960
HYNEK HERMANSKY:
Good enough, right?

433
00:20:12,960 --> 00:20:17,080
I mean, sort of--
so already, 1940s,

434
00:20:17,080 --> 00:20:20,850
This was the demonstrated
at a trade fair.

435
00:20:20,850 --> 00:20:24,790
And the lady was trained so
well that, in the '50s, when

436
00:20:24,790 --> 00:20:27,090
Dudley was retiring,
they brought her in.

437
00:20:27,090 --> 00:20:29,434
She was already retired
a long time ago.

438
00:20:29,434 --> 00:20:30,600
And she still could play it.

439
00:20:34,250 --> 00:20:35,480
How the speech works--

440
00:20:35,480 --> 00:20:38,090
I mean, maybe-- oh, I wanted
to jump this, but anyways,

441
00:20:38,090 --> 00:20:39,800
let's go very
quickly through that.

442
00:20:39,800 --> 00:20:40,925
So this is a speech signal.

443
00:20:40,925 --> 00:20:43,070
This is a acoustic signal.

444
00:20:43,070 --> 00:20:46,250
It changes in-- this is a
sinusoid, high pressure,

445
00:20:46,250 --> 00:20:48,530
low pressure, high
pressure, low pressure.

446
00:20:48,530 --> 00:20:52,220
If you put somewhere
in the in the path,

447
00:20:52,220 --> 00:20:56,010
some barrier, what happens is
you generate a standing wave.

448
00:20:56,010 --> 00:21:00,280
A standing wave
stands in a space.

449
00:21:00,280 --> 00:21:02,780
And there are high pressures,
low pressures, high pressures,

450
00:21:02,780 --> 00:21:03,655
low pressures.

451
00:21:03,655 --> 00:21:06,140
And the frequency depends
on the frequency--

452
00:21:06,140 --> 00:21:08,990
I mean, the size of
this standing wave

453
00:21:08,990 --> 00:21:13,620
depends on the
frequency of the signal.

454
00:21:13,620 --> 00:21:17,680
So if I put it into
something like a vocal tract,

455
00:21:17,680 --> 00:21:18,960
which we have here--

456
00:21:18,960 --> 00:21:19,910
so this is a glottis.

457
00:21:19,910 --> 00:21:21,201
This is where it gets exciting.

458
00:21:21,201 --> 00:21:23,000
This is a very simple
model of vocal tract.

459
00:21:23,000 --> 00:21:24,740
And here I have a lips.

460
00:21:24,740 --> 00:21:28,970
So it takes certain time to
provide this through the tube.

461
00:21:28,970 --> 00:21:34,310
And the tube will have a maximum
velocity at certain point for--

462
00:21:34,310 --> 00:21:37,120
so that it will be
resonating in a quarter

463
00:21:37,120 --> 00:21:41,660
wavelength of the signal, 3/4 of
the wavelength of the signals,

464
00:21:41,660 --> 00:21:47,100
in 5/4 of the wavelength of the
signal, and so on, and so on.

465
00:21:47,100 --> 00:21:49,720
So we can compute
which frequencies

466
00:21:49,720 --> 00:21:53,780
this tube will be resonating.

467
00:21:53,780 --> 00:21:56,210
This is a very simplistic
way of producing speech,

468
00:21:56,210 --> 00:22:00,000
but you can generate reasonable
speech sounds with that.

469
00:22:00,000 --> 00:22:03,350
So if we start putting a
constriction there somewhere,

470
00:22:03,350 --> 00:22:06,530
which emulates the way,
very simple the way

471
00:22:06,530 --> 00:22:10,890
how we can speak by moving
the tongue against the palate

472
00:22:10,890 --> 00:22:13,700
or making of constrictions
in the speech--

473
00:22:13,700 --> 00:22:16,410
so when the tube
is open like this,

474
00:22:16,410 --> 00:22:19,556
it resonates at
500, 1,500, 2,500

475
00:22:19,556 --> 00:22:22,160
if the tube is 17
centimeters long,

476
00:22:22,160 --> 00:22:26,960
which is a typical length
for the vocal tract.

477
00:22:26,960 --> 00:22:29,570
So if I put a constriction
here, everything

478
00:22:29,570 --> 00:22:32,810
moves down because there is
such a thing like perturbation

479
00:22:32,810 --> 00:22:34,940
theory, which says
that, if you are putting

480
00:22:34,940 --> 00:22:38,930
a constriction through the point
of the maximum velocity, which

481
00:22:38,930 --> 00:22:42,760
is, of course, at the opening,
all the modes will go down.

482
00:22:42,760 --> 00:22:47,960
And as you go on, basically,
the whole thing keeps changing.

483
00:22:47,960 --> 00:22:51,320
The point is that, almost
in every position of the,

484
00:22:51,320 --> 00:22:54,530
say, this tongue, all
the resonance frequencies

485
00:22:54,530 --> 00:22:57,650
are changing, so the whole
spectrum is being affected.

486
00:22:57,650 --> 00:23:02,280
And that may become useful
to explain something later.

487
00:23:02,280 --> 00:23:03,350
But we go like this.

488
00:23:03,350 --> 00:23:08,035
At the end, you end up, again,
in the same frequencies.

489
00:23:08,035 --> 00:23:09,430
These are called nomograms.

490
00:23:09,430 --> 00:23:12,620
And they will be heavily
worked on at the Speech Group

491
00:23:12,620 --> 00:23:16,790
at MIT and at Stockholm.

492
00:23:16,790 --> 00:23:19,140
So you can see how the
formants are moving.

493
00:23:19,140 --> 00:23:22,210
And you can see that, for every
position of the [INAUDIBLE],,

494
00:23:22,210 --> 00:23:25,670
here we have a distance of a
constriction from the lips.

495
00:23:25,670 --> 00:23:29,840
For every position, we are
having all the formants moving,

496
00:23:29,840 --> 00:23:34,160
so information about what I'm
doing with my vocal organs

497
00:23:34,160 --> 00:23:37,100
is actually at all frequencies,
all audible frequencies

498
00:23:37,100 --> 00:23:39,660
in different ways, but
it's there everywhere.

499
00:23:39,660 --> 00:23:42,980
It's not a single frequency
which would carry information

500
00:23:42,980 --> 00:23:44,360
about something.

501
00:23:44,360 --> 00:23:48,130
All the audible frequencies
carry information about speech.

502
00:23:48,130 --> 00:23:49,150
That's important.

503
00:23:49,150 --> 00:23:51,230
You can also look at
it and you can say,

504
00:23:51,230 --> 00:23:52,620
you know, what is the--

505
00:23:52,620 --> 00:23:55,140
where the front
cavity resonates,

506
00:23:55,140 --> 00:23:56,590
the back cavity resonates.

507
00:23:56,590 --> 00:23:58,190
Again, this front
cavity resonance

508
00:23:58,190 --> 00:24:01,900
may become interesting a little
bit later if we get to that.

509
00:24:01,900 --> 00:24:05,720
But this is a very simplistic
model of the speech production,

510
00:24:05,720 --> 00:24:12,380
but pretty much contains all the
basic elements of the speech.

511
00:24:12,380 --> 00:24:16,370
Point here is that, depending on
the length of the vocal tract,

512
00:24:16,370 --> 00:24:18,650
even when you keep the
constriction at the same

513
00:24:18,650 --> 00:24:22,070
position-- this is how long
is this front part before

514
00:24:22,070 --> 00:24:23,810
the construction is--

515
00:24:23,810 --> 00:24:25,670
so all the resonances
are moving,

516
00:24:25,670 --> 00:24:28,860
but a shorter vocal tract, like
the children's vocal tract,

517
00:24:28,860 --> 00:24:31,730
or even in a number of
females, they typically

518
00:24:31,730 --> 00:24:34,940
have a shorter vocal
tract than the males,

519
00:24:34,940 --> 00:24:37,340
there's a different
number of resonances.

520
00:24:37,340 --> 00:24:39,500
So if somebody is telling
you the information

521
00:24:39,500 --> 00:24:42,830
is in the formants of
speech, question it,

522
00:24:42,830 --> 00:24:49,340
because it's actually impossible
to generate the same speech

523
00:24:49,340 --> 00:24:52,340
being two different
people, especially

524
00:24:52,340 --> 00:24:54,690
having two different
lengths of the vocal tract.

525
00:24:54,690 --> 00:25:00,020
And we get into it when we talk
about the speaker dependencies.

526
00:25:00,020 --> 00:25:03,420
Second part is-- of the
equation is hearing.

527
00:25:03,420 --> 00:25:05,870
So we speak in order
to be heard, in order

528
00:25:05,870 --> 00:25:07,080
to be understood.

529
00:25:07,080 --> 00:25:12,740
And again, thanks to Josh, he
spent more than sufficient time

530
00:25:12,740 --> 00:25:15,380
explaining you enough
what I wanted to say.

531
00:25:15,380 --> 00:25:18,350
I will just add something--
some very, very small things.

532
00:25:18,350 --> 00:25:21,040
So just to summarize,
Josh was telling you

533
00:25:21,040 --> 00:25:24,880
the theory works basically
like a bank of bandpass filters

534
00:25:24,880 --> 00:25:28,220
with a changing
frequency and output

535
00:25:28,220 --> 00:25:30,530
depending on sound
level intensity.

536
00:25:30,530 --> 00:25:32,520
There are many
caveats to that, but I

537
00:25:32,520 --> 00:25:35,720
mean, in a first
approximation, I 100%

538
00:25:35,720 --> 00:25:38,070
agree this is enough
for us to follow

539
00:25:38,070 --> 00:25:40,280
for all the rest of the talk.

540
00:25:40,280 --> 00:25:42,780
Second thing which Josh
mentioned very briefly,

541
00:25:42,780 --> 00:25:47,750
but I want to stress it, because
it is important, firing rates--

542
00:25:47,750 --> 00:25:50,240
because you know the
cochlea communicates

543
00:25:50,240 --> 00:25:58,110
with the rest of the
system through the firings,

544
00:25:58,110 --> 00:26:00,170
through the impulses.

545
00:26:00,170 --> 00:26:02,810
Firing rates on
the auditory nerve

546
00:26:02,810 --> 00:26:07,170
are of the order of 1 kilohertz
every one millisecond.

547
00:26:07,170 --> 00:26:09,920
But as you go up and
up in the system,

548
00:26:09,920 --> 00:26:12,691
already here on the colliculus
is maybe order of magnitude

549
00:26:12,691 --> 00:26:13,190
less.

550
00:26:13,190 --> 00:26:15,620
And the order in the
level of auditory cortex

551
00:26:15,620 --> 00:26:19,280
is 2 orders of magnitude less.

552
00:26:19,280 --> 00:26:22,620
So of course, I mean, you know,
this is how the brain works.

553
00:26:22,620 --> 00:26:26,780
I mean, so here we have
from periphery up to cortex,

554
00:26:26,780 --> 00:26:29,320
but also, I think it was
mentioned very briefly,

555
00:26:29,320 --> 00:26:34,280
if you look at it, number of
neurons increase more than

556
00:26:34,280 --> 00:26:38,780
actually decrease in the firing
rates, because if we have--

557
00:26:38,780 --> 00:26:41,360
again, those are just
orders of magnitude--

558
00:26:41,360 --> 00:26:47,000
100,000 neurons maybe on
the level of auditory nerve,

559
00:26:47,000 --> 00:26:51,980
or colliculus nucleus, and you
have 100 million neurons maybe

560
00:26:51,980 --> 00:26:53,450
on the level of the brain.

561
00:26:53,450 --> 00:26:55,430
And this can become
handy later, when,

562
00:26:55,430 --> 00:26:58,170
if I get all the way
to the end of the talk,

563
00:26:58,170 --> 00:27:03,120
I will recall this
piece of information.

564
00:27:03,120 --> 00:27:05,450
Another thing which was
mentioned a number of times

565
00:27:05,450 --> 00:27:08,090
is that there are not the
only connections from ear,

566
00:27:08,090 --> 00:27:10,700
from the periphery to
the brain, but there

567
00:27:10,700 --> 00:27:13,010
is, by some estimates,
many, many more--

568
00:27:13,010 --> 00:27:15,290
I mean, again, I mean
the estimates vary,

569
00:27:15,290 --> 00:27:18,530
but this is something which I
have heard somewhere-- maybe

570
00:27:18,530 --> 00:27:20,750
there is maybe almost 10
times more connections

571
00:27:20,750 --> 00:27:23,900
going from brain to the ear
than from the ear to the brain.

572
00:27:23,900 --> 00:27:26,240
And typically, the
nature hardly ever

573
00:27:26,240 --> 00:27:28,050
builds anything
without a reason,

574
00:27:28,050 --> 00:27:30,110
so there must be
some reason for that.

575
00:27:30,110 --> 00:27:34,050
And perhaps we
will get into that.

576
00:27:34,050 --> 00:27:37,790
Josh didn't talk much about the
level of the-- on the cortex.

577
00:27:37,790 --> 00:27:41,610
So what's happening on the
lower levels, on the periphery?

578
00:27:41,610 --> 00:27:43,430
They are just these
simple increases

579
00:27:43,430 --> 00:27:48,410
of auditory-- of firing rate.

580
00:27:48,410 --> 00:27:51,830
There is a certain
enhancement of the changes.

581
00:27:51,830 --> 00:27:54,710
So at the beginning of the
tone-- this is a tone--

582
00:27:54,710 --> 00:27:56,990
the beginning of
the tone, there is

583
00:27:56,990 --> 00:27:58,850
more firing on auditory nerve.

584
00:27:58,850 --> 00:28:02,840
At the end of the tone,
there is some deflection.

585
00:28:02,840 --> 00:28:06,080
But when you look at a
higher level of the cortex,

586
00:28:06,080 --> 00:28:08,060
all these wonderful
curves, which

587
00:28:08,060 --> 00:28:10,010
are sort of increasing
with intensity

588
00:28:10,010 --> 00:28:13,720
like it would if you had
a simple bandpass filter,

589
00:28:13,720 --> 00:28:15,070
start looking quite differently.

590
00:28:15,070 --> 00:28:16,920
So we measure
majority-- what I heard,

591
00:28:16,920 --> 00:28:23,810
the majority of the
cortical neurons

592
00:28:23,810 --> 00:28:25,790
are selective to certain levels.

593
00:28:25,790 --> 00:28:28,460
Basically, the firing
increases to a certain level,

594
00:28:28,460 --> 00:28:30,350
and then it decreases again.

595
00:28:30,350 --> 00:28:34,970
And they are, of course,
selective at different levels.

596
00:28:34,970 --> 00:28:37,880
Also, I mean, you don't see,
just these simple things

597
00:28:37,880 --> 00:28:42,160
like here, that your firing
starts as a tone starts.

598
00:28:42,160 --> 00:28:45,740
But they are neurons like
that, but there are also

599
00:28:45,740 --> 00:28:48,470
neurons which just are
interested at the beginning

600
00:28:48,470 --> 00:28:49,580
of the signal.

601
00:28:49,580 --> 00:28:51,950
There are neurons which
are interested in beginning

602
00:28:51,950 --> 00:28:52,610
and ends.

603
00:28:52,610 --> 00:28:54,920
There are neurons which
are interested only

604
00:28:54,920 --> 00:28:59,280
at the ends of the signals,
and so on, and so on.

605
00:28:59,280 --> 00:29:02,690
Receptive fields, again, has
been mentioned already before.

606
00:29:05,360 --> 00:29:08,810
Just as we have a receptive
field in the visual cortex,

607
00:29:08,810 --> 00:29:11,660
we have also receptive
fields in auditory cortex.

608
00:29:11,660 --> 00:29:13,540
Here we don't have the--

609
00:29:13,540 --> 00:29:18,010
here we have a frequency and a
time, unlike x and y, receptive

610
00:29:18,010 --> 00:29:19,960
fields which are
typical, sort of,

611
00:29:19,960 --> 00:29:23,080
first thing you are
hearing about when you

612
00:29:23,080 --> 00:29:25,780
talk about visual perception.

613
00:29:25,780 --> 00:29:29,430
They come in all
kinds of colors.

614
00:29:29,430 --> 00:29:32,860
They tend to be
quite long, meaning

615
00:29:32,860 --> 00:29:35,790
they can be sensitive for
about quarter of the second--

616
00:29:35,790 --> 00:29:38,410
not all of them,
but certainly, there

617
00:29:38,410 --> 00:29:42,520
are many, many different
cortical receptive fields.

618
00:29:42,520 --> 00:29:45,970
So some people are
suggesting, given the richness

619
00:29:45,970 --> 00:29:49,150
of the neurons in
auditory cortex,

620
00:29:49,150 --> 00:29:51,310
it's a very legal
thing to suggest

621
00:29:51,310 --> 00:29:54,780
that maybe the sounds are
processing in following way,

622
00:29:54,780 --> 00:29:57,100
not only that you
do the frequency

623
00:29:57,100 --> 00:30:00,460
analysis in the cochlea, but
then, on the higher levels,

624
00:30:00,460 --> 00:30:05,650
you are creating many
pictures of the outside world.

625
00:30:05,650 --> 00:30:07,670
And then, of course, only
the question is here,

626
00:30:07,670 --> 00:30:10,600
if answer, this is Murray
Sachs' paper from their labs,

627
00:30:10,600 --> 00:30:14,500
from Johns Hopkins in 1988.

628
00:30:14,500 --> 00:30:16,270
They just simply said
pattern recognition,

629
00:30:16,270 --> 00:30:18,040
but I believe there
is a mechanism which

630
00:30:18,040 --> 00:30:21,730
picks up the best streams
and leaves out not so

631
00:30:21,730 --> 00:30:24,340
useful things, but
the concept was here

632
00:30:24,340 --> 00:30:27,320
around for a long time.

633
00:30:27,320 --> 00:30:30,130
So this was physiology 101.

634
00:30:30,130 --> 00:30:35,760
Psychophysics is saying that you
play the signals to listeners,

635
00:30:35,760 --> 00:30:38,390
and you ask them what they hear.

636
00:30:38,390 --> 00:30:41,140
But we want to know what is
the response of the organism

637
00:30:41,140 --> 00:30:44,860
to the incoming stimulus, so
simply, you play the stimulus

638
00:30:44,860 --> 00:30:46,781
and you ask what
is the response.

639
00:30:46,781 --> 00:30:49,280
First thing which you can ask,
do you hear something or not?

640
00:30:49,280 --> 00:30:52,020
And you already will discover
some interesting stuff.

641
00:30:52,020 --> 00:30:55,000
Hearing is not
equally-sensitive everywhere.

642
00:30:55,000 --> 00:30:56,350
It's selective.

643
00:30:56,350 --> 00:30:58,210
And it's more
sensitive in the area

644
00:30:58,210 --> 00:31:02,050
somewhere between
1 and 4 kilohertz.

645
00:31:02,050 --> 00:31:04,690
It's much less sensitive
at the lower frequencies.

646
00:31:04,690 --> 00:31:06,170
This is a threshold.

647
00:31:06,170 --> 00:31:07,466
On the threshold level--

648
00:31:07,466 --> 00:31:08,840
here's another
interesting thing.

649
00:31:08,840 --> 00:31:18,370
If you just apply the signals
in different ears, as long

650
00:31:18,370 --> 00:31:21,520
as the signals happen within a
certain period, about a couple

651
00:31:21,520 --> 00:31:23,760
of hundred millisecond,
and if couple of hundred

652
00:31:23,760 --> 00:31:27,250
millisecond you hear from
your ear would be more often,

653
00:31:27,250 --> 00:31:28,420
the thresholds are half.

654
00:31:28,420 --> 00:31:30,010
Basically, neither
of these signals

655
00:31:30,010 --> 00:31:32,710
would be heard if you
applied only a single one,

656
00:31:32,710 --> 00:31:37,990
but when, if you apply both of
them, basically you hear them.

657
00:31:37,990 --> 00:31:41,560
If you play the signals
of different frequencies,

658
00:31:41,560 --> 00:31:43,260
if these signals
are close enough,

659
00:31:43,260 --> 00:31:46,890
close so that, as Josh
mentioned about the beats,

660
00:31:46,890 --> 00:31:48,640
they happen within
one critical band,

661
00:31:48,640 --> 00:31:52,390
again, neither blue
or green signal

662
00:31:52,390 --> 00:31:54,110
would be heard on its own.

663
00:31:54,110 --> 00:31:56,440
But if you play them
together, you hear them.

664
00:31:56,440 --> 00:32:00,160
But if they are further in
frequency, you don't hear them.

665
00:32:00,160 --> 00:32:03,190
Same thing if these guys
are further in time,

666
00:32:03,190 --> 00:32:04,360
you wouldn't hear them.

667
00:32:04,360 --> 00:32:07,000
So this subthreshold
perception actually

668
00:32:07,000 --> 00:32:08,170
is kind of interesting.

669
00:32:08,170 --> 00:32:09,690
And we will use it.

670
00:32:09,690 --> 00:32:11,680
Which we didn't
talk much about is

671
00:32:11,680 --> 00:32:15,160
that there are obvious
ways how you can modify

672
00:32:15,160 --> 00:32:17,610
the threshold of hearing.

673
00:32:17,610 --> 00:32:18,789
Here we have a target.

674
00:32:18,789 --> 00:32:20,830
And since it is higher
than threshold of hearing,

675
00:32:20,830 --> 00:32:22,010
you hear it.

676
00:32:22,010 --> 00:32:25,030
But if you play another
sound called masker,

677
00:32:25,030 --> 00:32:27,640
you will not hear it, because
your threshold basically

678
00:32:27,640 --> 00:32:28,290
is modified.

679
00:32:28,290 --> 00:32:30,370
It's called the mask threshold.

680
00:32:30,370 --> 00:32:32,440
And this part is suddenly not--

681
00:32:32,440 --> 00:32:34,680
this target is not heard.

682
00:32:34,680 --> 00:32:37,060
The target can be
something useful,

683
00:32:37,060 --> 00:32:39,850
but in mp3, it can
be pretty annoying,

684
00:32:39,850 --> 00:32:41,530
because it's typically noise.

685
00:32:41,530 --> 00:32:43,960
You try to figure
out how you can mask

686
00:32:43,960 --> 00:32:45,810
the noise by the useful signal.

687
00:32:45,810 --> 00:32:49,940
You're computing these
masked thresholds on the fly.

688
00:32:49,940 --> 00:32:52,210
The initial
experiment with this,

689
00:32:52,210 --> 00:32:54,950
what is called simultaneous
masking, was following,

690
00:32:54,950 --> 00:32:58,970
and, again, was Bell Labs,
Fletcher, and his people.

691
00:32:58,970 --> 00:33:01,060
They would figure out
what is the threshold

692
00:33:01,060 --> 00:33:02,960
of certain frequency
without the noise.

693
00:33:02,960 --> 00:33:04,990
But then they would
put noise around it,

694
00:33:04,990 --> 00:33:08,260
and the threshold had to go
up, because there was a noise,

695
00:33:08,260 --> 00:33:10,300
so there was masking.

696
00:33:10,300 --> 00:33:14,627
Then they made a broader noise,
and threshold was going up,

697
00:33:14,627 --> 00:33:15,460
as you would expect.

698
00:33:15,460 --> 00:33:19,420
There was more noise, so you
had to make the signal stronger.

699
00:33:19,420 --> 00:33:21,700
And you made it to
a certain point,

700
00:33:21,700 --> 00:33:24,940
when you start making the
band of noise too wide,

701
00:33:24,940 --> 00:33:27,440
suddenly it's not
happening anymore.

702
00:33:27,440 --> 00:33:29,710
There is no more
masking anymore.

703
00:33:29,710 --> 00:33:32,770
That's how they came with
this concept of critical band.

704
00:33:32,770 --> 00:33:37,490
Critical band is what happens
inside the critical band

705
00:33:37,490 --> 00:33:43,900
matters, basically, influences
the decoding of the signal

706
00:33:43,900 --> 00:33:45,040
within a critical band.

707
00:33:45,040 --> 00:33:48,580
But if it happens outside the
critical band, it doesn't.

708
00:33:48,580 --> 00:33:51,970
So essentially, if the signals
are far away in frequency,

709
00:33:51,970 --> 00:33:53,746
they don't interact
with each other.

710
00:33:53,746 --> 00:33:55,120
And again, this
is a useful thing

711
00:33:55,120 --> 00:33:56,440
for speech recognition people.

712
00:33:56,440 --> 00:34:01,000
They didn't much realize
that this is the main outcome

713
00:34:01,000 --> 00:34:02,990
of the masking.

714
00:34:02,990 --> 00:34:05,870
Critical bands, I mean, again,
I mean, discussions are here,

715
00:34:05,870 --> 00:34:09,670
but this is a Bark scale which
has been developed in Germany

716
00:34:09,670 --> 00:34:11,780
by Zwicker and his colleagues.

717
00:34:11,780 --> 00:34:16,600
It's pretty much regarded to be
from about 600, 700 hertz up.

718
00:34:16,600 --> 00:34:24,070
And it's approximately
constant up to 600, 700 hertz.

719
00:34:24,070 --> 00:34:26,549
If you talk to Cambridge people,
Brian Moore, and that sort

720
00:34:26,549 --> 00:34:28,215
of logarithmic it's
pretty much regarded

721
00:34:28,215 --> 00:34:30,000
to be pretty much everywhere.

722
00:34:30,000 --> 00:34:34,000
But not really, but
the critical bands,

723
00:34:34,000 --> 00:34:37,030
remember, critical bands
from the subthreshold things?

724
00:34:37,030 --> 00:34:38,740
Again, the critical
band is masking.

725
00:34:38,740 --> 00:34:40,389
It's starting it
with things happen

726
00:34:40,389 --> 00:34:42,010
within the critical band.

727
00:34:42,010 --> 00:34:43,239
They integrate.

728
00:34:43,239 --> 00:34:44,800
They happen outside the--

729
00:34:44,800 --> 00:34:46,900
each of them outside
the critical band,

730
00:34:46,900 --> 00:34:48,590
they don't interact.

731
00:34:48,590 --> 00:34:52,030
Another masking is
temporal masking.

732
00:34:52,030 --> 00:34:53,600
So you have a signal--
and of course,

733
00:34:53,600 --> 00:34:58,010
if you put a mask on it,
it's simultaneous masking.

734
00:34:58,010 --> 00:34:59,465
You have to make it much--

735
00:34:59,465 --> 00:35:02,470
the signal much stronger in
order for you to hear it.

736
00:35:02,470 --> 00:35:05,260
But it also influences
things in time.

737
00:35:05,260 --> 00:35:07,210
This is what is called
forward masking.

738
00:35:07,210 --> 00:35:09,460
And this is the one which
is probably more interesting

739
00:35:09,460 --> 00:35:12,830
and more useful.

740
00:35:12,830 --> 00:35:15,580
It's also backward masking,
when a masker happens

741
00:35:15,580 --> 00:35:18,699
after the signal,
but it probably

742
00:35:18,699 --> 00:35:20,490
has a different origin,
more like cognitive

743
00:35:20,490 --> 00:35:23,230
rather than earlier.

744
00:35:23,230 --> 00:35:24,440
So there is still a masker.

745
00:35:24,440 --> 00:35:28,660
You have to make the signal
stronger up to a certain point.

746
00:35:28,660 --> 00:35:32,560
When the distance between
masker and the signal

747
00:35:32,560 --> 00:35:35,260
is more than 200
milliseconds, there

748
00:35:35,260 --> 00:35:36,990
is like there's
no masker anymore.

749
00:35:36,990 --> 00:35:39,250
Basically, there is no
temporal masking anymore,

750
00:35:39,250 --> 00:35:43,820
but it is within this
interval of 200 milliseconds.

751
00:35:43,820 --> 00:35:47,040
If you make mask stronger,
masking is stronger initially,

752
00:35:47,040 --> 00:35:49,070
but it also decays faster.

753
00:35:49,070 --> 00:35:53,140
And again, it decays after
about 200 milliseconds.

754
00:35:53,140 --> 00:35:57,170
So whatever happens outside
this critical interval,

755
00:35:57,170 --> 00:35:59,980
about a couple of
hundred millisecond,

756
00:35:59,980 --> 00:36:01,570
it doesn't integrate.

757
00:36:01,570 --> 00:36:04,560
But if it happens inside
this critical interval,

758
00:36:04,560 --> 00:36:06,400
that seems to be influencing--

759
00:36:06,400 --> 00:36:10,120
these signals seem to be
influencing each other.

760
00:36:10,120 --> 00:36:11,620
And again, I mean,
you know, I talk

761
00:36:11,620 --> 00:36:14,620
about the subthreshold
perception--

762
00:36:14,620 --> 00:36:18,220
if there were two tones which
happen within 200 millisecond,

763
00:36:18,220 --> 00:36:20,540
neither of them would
be heard in isolation,

764
00:36:20,540 --> 00:36:24,350
but they are heard if
you play them together.

765
00:36:24,350 --> 00:36:26,290
Another part which is
kind of interesting

766
00:36:26,290 --> 00:36:28,730
is that loudness
depends, of course,

767
00:36:28,730 --> 00:36:31,660
on the intensity of the sound,
but it doesn't depend linearly

768
00:36:31,660 --> 00:36:32,440
on that.

769
00:36:32,440 --> 00:36:35,120
It depends with
about cubic root,

770
00:36:35,120 --> 00:36:38,080
so in order to make a
signal twice as loud,

771
00:36:38,080 --> 00:36:40,510
you have to make it
about 10 times more

772
00:36:40,510 --> 00:36:43,190
in intensity for
stimuli which are longer

773
00:36:43,190 --> 00:36:46,710
than 200 milliseconds.

774
00:36:46,710 --> 00:36:50,440
Equal loudness curves,
this is a threshold curve,

775
00:36:50,440 --> 00:36:55,830
but these equal loudness
curve are telling you what

776
00:36:55,830 --> 00:36:58,260
the intensity of
the sound-- sorry--

777
00:36:58,260 --> 00:37:02,100
would need to be in order
to hear it equally loud.

778
00:37:02,100 --> 00:37:06,080
So it's saying that, if you have
a 40 dB signal at 1 kilohertz,

779
00:37:06,080 --> 00:37:08,750
and you want to make it
equally loud at 100 hertz,

780
00:37:08,750 --> 00:37:13,800
you have to make it
60 dB, and so on.

781
00:37:13,800 --> 00:37:16,805
These curves become flatter
and flatter, most pronounced

782
00:37:16,805 --> 00:37:19,470
at the threshold at lower
levels, but they are there.

783
00:37:19,470 --> 00:37:22,410
And they are actually kind
of interesting and important.

784
00:37:22,410 --> 00:37:25,410
Hearing is rather non-linear.

785
00:37:25,410 --> 00:37:28,915
Properties depend
on the intensity.

786
00:37:28,915 --> 00:37:31,290
Speech of course is happening
somewhere around here where

787
00:37:31,290 --> 00:37:32,540
the hearing is more sensitive.

788
00:37:32,540 --> 00:37:34,350
That was the point here.

789
00:37:34,350 --> 00:37:36,630
Modulations, again, we
didn't talk much about that,

790
00:37:36,630 --> 00:37:38,730
but modulations
are very important.

791
00:37:38,730 --> 00:37:42,460
Since 1923, it's
known that hearing

792
00:37:42,460 --> 00:37:45,090
is the most sensitive to
certain rate of modulations

793
00:37:45,090 --> 00:37:48,050
around 4, 5 hertz.

794
00:37:48,050 --> 00:37:52,740
These are experiments from Bell
Labs repeated number of times.

795
00:37:52,740 --> 00:37:55,350
So this is this for
a, a modulations.

796
00:37:55,350 --> 00:37:58,020
This experiment, what you do
is, that you modulate a signal,

797
00:37:58,020 --> 00:38:00,840
and change the depth,
and change the frequency.

798
00:38:00,840 --> 00:38:03,090
And you are asking, do
you hear the modulation

799
00:38:03,090 --> 00:38:05,040
or don't you hear
the modulation?

800
00:38:05,040 --> 00:38:07,260
Very interesting--
interesting thing

801
00:38:07,260 --> 00:38:09,780
is, if you look
at-- again, I mean,

802
00:38:09,780 --> 00:38:12,360
I refer to what Josh was
telling you in the morning.

803
00:38:12,360 --> 00:38:15,605
If you just take one
trajectory of the spectrum,

804
00:38:15,605 --> 00:38:18,660
you treat it as a time domain
signal, remove the mean

805
00:38:18,660 --> 00:38:21,360
and compute its Fourier
components-- frequency

806
00:38:21,360 --> 00:38:24,810
components, they peak
somewhere around 4 hertz,

807
00:38:24,810 --> 00:38:28,380
just where the hearing
is the most sensitive.

808
00:38:28,380 --> 00:38:31,210
So hearing is not very
sensitive, obviously,

809
00:38:31,210 --> 00:38:34,080
to when the signal
is non-modulated,

810
00:38:34,080 --> 00:38:37,950
but also there is-- there
are almost no components

811
00:38:37,950 --> 00:38:40,890
in the signal which would be
non-modulated, because when I

812
00:38:40,890 --> 00:38:42,490
talk to you, I move the mouth.

813
00:38:42,490 --> 00:38:44,010
I mean, I change the things.

814
00:38:44,010 --> 00:38:48,300
And I change the things about
four times a second, mainly.

815
00:38:51,820 --> 00:38:54,280
When it comes to speech,
you can also compute--

816
00:38:54,280 --> 00:38:57,210
music, you can also
figure out what

817
00:38:57,210 --> 00:39:00,370
are the natural
rhythms in the music.

818
00:39:00,370 --> 00:39:04,210
I stole this from, I
believe, the Munich group,

819
00:39:04,210 --> 00:39:08,020
from [INAUDIBLE].

820
00:39:08,020 --> 00:39:10,100
He played 60 pieces of music.

821
00:39:10,100 --> 00:39:14,230
And then he asked people to
tap to the rhythm of the music.

822
00:39:14,230 --> 00:39:16,300
And this is the
histogram of tapping.

823
00:39:16,300 --> 00:39:18,790
Most of the people,
for most of the music,

824
00:39:18,790 --> 00:39:21,190
tapping was about
four times a second.

825
00:39:21,190 --> 00:39:23,890
This is where the hearing
is most sensitive.

826
00:39:23,890 --> 00:39:29,950
And this is modulation
frequency of this music.

827
00:39:29,950 --> 00:39:33,180
So people play
music in such a way

828
00:39:33,180 --> 00:39:35,580
that we hear it well,
that it basically

829
00:39:35,580 --> 00:39:38,510
resonates with the
natural frequency which

830
00:39:38,510 --> 00:39:40,020
we are perceiving.

831
00:39:40,020 --> 00:39:42,250
You can also ask
the similar thing.

832
00:39:42,250 --> 00:39:45,120
So, in speech, you can
play the speech sentences.

833
00:39:45,120 --> 00:39:47,860
And you ask people to tap in
to the rhythm of the sentences.

834
00:39:47,860 --> 00:39:49,990
Of course, what gets out
is the syllabic rate.

835
00:39:49,990 --> 00:39:53,210
And syllabic rate
is about 4 hertz.

836
00:39:53,210 --> 00:39:55,530
Where is the
information in speech?

837
00:39:55,530 --> 00:39:58,140
Well, we know what
the ear is doing.

838
00:39:58,140 --> 00:40:02,940
It analyzes signal into
individual frequency bands.

839
00:40:02,940 --> 00:40:06,360
We know what Homer
Dudley was telling us.

840
00:40:06,360 --> 00:40:08,880
When messages and
modulations of these

841
00:40:08,880 --> 00:40:10,920
frequencies-- as
a matter of fact,

842
00:40:10,920 --> 00:40:14,250
that was the base
of his vocoder.

843
00:40:14,250 --> 00:40:17,190
What he also did was that
he designed-- actually,

844
00:40:17,190 --> 00:40:18,544
it wasn't only him.

845
00:40:18,544 --> 00:40:19,710
There was another technique.

846
00:40:19,710 --> 00:40:22,500
This one is, kind of,
somehow cleaner thing,

847
00:40:22,500 --> 00:40:24,930
which is called the
spectrograph, which tells you

848
00:40:24,930 --> 00:40:27,480
about the spectrum of
frequency components

849
00:40:27,480 --> 00:40:30,272
of the acoustic signal.

850
00:40:30,272 --> 00:40:31,230
So you take the signal.

851
00:40:31,230 --> 00:40:33,750
You put it through a
bank of bandpass filters.

852
00:40:33,750 --> 00:40:37,470
And then here, you
basically display,

853
00:40:37,470 --> 00:40:42,420
on the z-axis, intensity
in each frequency band.

854
00:40:42,420 --> 00:40:45,200
This was, I heard,
used for listening

855
00:40:45,200 --> 00:40:48,300
for German submarines,
because they

856
00:40:48,300 --> 00:40:54,270
wanted to-- they knew that
acoustic signatures were

857
00:40:54,270 --> 00:40:57,120
different for friendly
submarines and enemy

858
00:40:57,120 --> 00:40:58,260
submarines.

859
00:40:58,260 --> 00:41:00,570
People listen to it--
for it, but also people

860
00:41:00,570 --> 00:41:03,310
realized it may be useful
to look at the signal--

861
00:41:03,310 --> 00:41:04,335
acoustic signal somehow.

862
00:41:04,335 --> 00:41:07,000
Waveform, it wasn't making
all that much sense,

863
00:41:07,000 --> 00:41:09,270
but the spectrogram was.

864
00:41:09,270 --> 00:41:13,140
Danger there was that the people
who were working in speech

865
00:41:13,140 --> 00:41:14,850
got hold of it.

866
00:41:14,850 --> 00:41:17,280
And then they start, sort of,
looking at the spectrograms.

867
00:41:17,280 --> 00:41:19,950
And they say, haha, we are
seeing the information here.

868
00:41:19,950 --> 00:41:23,130
We are seeing the
information in waves.

869
00:41:23,130 --> 00:41:26,910
The spectrum is changing,
because not only that this

870
00:41:26,910 --> 00:41:28,800
was the way the origin
of the spectrogram

871
00:41:28,800 --> 00:41:32,880
was developed, that you
were displaying changes

872
00:41:32,880 --> 00:41:36,710
in energy in individual
frequency bands,

873
00:41:36,710 --> 00:41:38,000
but you can also look at this.

874
00:41:38,000 --> 00:41:40,375
This when you get to what is
called a short-term spectrum

875
00:41:40,375 --> 00:41:42,030
of speech.

876
00:41:42,030 --> 00:41:44,540
And people said, oh,
this short-term spectrum

877
00:41:44,540 --> 00:41:46,410
looks different
for R than for E,

878
00:41:46,410 --> 00:41:49,440
so maybe this is the
way to recognize speech.

879
00:41:49,440 --> 00:41:51,000
So indeed, I mean,
those are two ways

880
00:41:51,000 --> 00:41:52,870
of generating the spectrograms.

881
00:41:52,870 --> 00:41:56,210
I mean, this was the original
one, bank of bandpass filters.

882
00:41:56,210 --> 00:41:59,790
And you were displaying the
energy as a function of time.

883
00:41:59,790 --> 00:42:02,130
This is what your ear is doing.

884
00:42:02,130 --> 00:42:03,330
That's what I'm saying.

885
00:42:03,330 --> 00:42:05,900
This is not what
your ear is doing,

886
00:42:05,900 --> 00:42:08,850
that if you take a short
segments of the signal,

887
00:42:08,850 --> 00:42:10,890
and you compute the
Fourier transform,

888
00:42:10,890 --> 00:42:15,400
then you display the Fourier
transform one frame at a time,

889
00:42:15,400 --> 00:42:18,480
but this is the way most of
the speech recognition systems

890
00:42:18,480 --> 00:42:19,950
work.

891
00:42:19,950 --> 00:42:24,750
And I'm suggesting that maybe we
should think about other ways.

892
00:42:27,270 --> 00:42:30,260
So now we have to deal
with all these problems.

893
00:42:30,260 --> 00:42:35,060
So we have a number
of things coming

894
00:42:35,060 --> 00:42:40,040
in in the form of the message
with all these chunk around it.

895
00:42:40,040 --> 00:42:42,030
And machine
recognition of speech

896
00:42:42,030 --> 00:42:45,170
would like to transcribe the
code which carries the message.

897
00:42:45,170 --> 00:42:47,870
This is a typical example
of the application

898
00:42:47,870 --> 00:42:48,830
of speech recognition.

899
00:42:48,830 --> 00:42:50,690
I'm not saying this
is the only one.

900
00:42:50,690 --> 00:42:53,840
There are attempts to
recognize just some key words.

901
00:42:53,840 --> 00:42:55,790
There are attempts
to actually generate

902
00:42:55,790 --> 00:42:58,580
the understanding of what
people are saying, and so on,

903
00:42:58,580 --> 00:43:01,340
but we would be
happy, in most cases,

904
00:43:01,340 --> 00:43:05,340
just to transcribe the speech.

905
00:43:05,340 --> 00:43:07,860
Speech has been produced
to be perceived.

906
00:43:07,860 --> 00:43:09,080
We already talked about it.

907
00:43:09,080 --> 00:43:13,570
It evolved over millennia to
fit the properties of hearing.

908
00:43:13,570 --> 00:43:16,470
So this is-- I'm sort of
seconding what Josh was saying.

909
00:43:16,470 --> 00:43:19,350
Josh was saying, you can
learn about the hearing

910
00:43:19,350 --> 00:43:21,150
by synthesizing stuff.

911
00:43:21,150 --> 00:43:23,700
I'm saying you of learn
about hearing by trying

912
00:43:23,700 --> 00:43:25,940
to recognize the stuff.

913
00:43:25,940 --> 00:43:31,470
So if you put something in
and it works, and it supports

914
00:43:31,470 --> 00:43:36,090
some theory of hearing, you may
be kind of reasonably confident

915
00:43:36,090 --> 00:43:38,980
that it was something
which has been useful.

916
00:43:38,980 --> 00:43:41,730
Actually there's a paper
about that, which, of course,

917
00:43:41,730 --> 00:43:43,860
I'm co-author, but I
didn't want to show that.

918
00:43:43,860 --> 00:43:45,270
I thought I would
leave this one,

919
00:43:45,270 --> 00:43:48,750
but I didn't do
it at last minute.

920
00:43:48,750 --> 00:43:52,340
Anyways, speech
recognition-- speech signal

921
00:43:52,340 --> 00:43:54,730
has high bit-rate,
recognizer comes

922
00:43:54,730 --> 00:43:57,240
in, information, low bit-rate.

923
00:43:57,240 --> 00:43:59,040
So what you are doing
here, you are trying

924
00:43:59,040 --> 00:44:01,020
to reorganize your stuff.

925
00:44:01,020 --> 00:44:04,240
You are trying to
reduce the entropy.

926
00:44:04,240 --> 00:44:06,510
If you are reducing
the entropy, you better

927
00:44:06,510 --> 00:44:09,360
know what you are doing,
because otherwise, you

928
00:44:09,360 --> 00:44:10,990
get real garbage.

929
00:44:10,990 --> 00:44:13,410
I mean, that's, kind of, like,
one of these common sense

930
00:44:13,410 --> 00:44:15,210
things, right?

931
00:44:15,210 --> 00:44:16,910
So you want to use
some knowledge.

932
00:44:16,910 --> 00:44:18,930
You have plenty of knowledge
in this recognizer.

933
00:44:18,930 --> 00:44:20,722
Where does this
knowledge come from?

934
00:44:20,722 --> 00:44:22,180
We keep discussing
it all the time.

935
00:44:22,180 --> 00:44:27,310
It came from textbooks,
teachers, intuitions, beliefs,

936
00:44:27,310 --> 00:44:28,530
and so on.

937
00:44:28,530 --> 00:44:31,110
And its a good thing
about that, that you

938
00:44:31,110 --> 00:44:34,950
can hardwire this
knowledge so that you

939
00:44:34,950 --> 00:44:39,630
don't have to learn it, relearn
it next time based on the data.

940
00:44:39,630 --> 00:44:43,170
Of course, problem is that this
knowledge may be incomplete,

941
00:44:43,170 --> 00:44:47,550
irrelevant, can be plain
wrong, because, you know,

942
00:44:47,550 --> 00:44:49,470
who can say that whatever
teachers tell you,

943
00:44:49,470 --> 00:44:53,880
or textbooks tell, or your
intuitions or beliefs is always

944
00:44:53,880 --> 00:44:54,840
true?

945
00:44:54,840 --> 00:44:58,200
Much more often now,
what people are using

946
00:44:58,200 --> 00:45:01,770
is that they-- that knowledge
comes directly from the data.

947
00:45:01,770 --> 00:45:05,355
Such knowledge is
relevant and unbiased,

948
00:45:05,355 --> 00:45:09,020
but the problem is that you
need a lot of training data.

949
00:45:09,020 --> 00:45:13,650
And it's very hard to get
architecture of the recognizer

950
00:45:13,650 --> 00:45:16,560
from the data, at
least, I don't know

951
00:45:16,560 --> 00:45:18,640
quite well how to do it yet.

952
00:45:18,640 --> 00:45:19,710
So these are two things.

953
00:45:19,710 --> 00:45:22,680
And again, I mean, let
me go back to '50s.

954
00:45:22,680 --> 00:45:26,900
First knowledge-based recognizer
was based on the spectrograms.

955
00:45:26,900 --> 00:45:28,820
There was Richard Galt.

956
00:45:28,820 --> 00:45:30,870
And he was looking
at spectrograms

957
00:45:30,870 --> 00:45:33,660
and trying to figure out how
this short-term spectrum looked

958
00:45:33,660 --> 00:45:35,740
like for different
speech sounds.

959
00:45:35,740 --> 00:45:38,130
Then he thought he would make
this finite state machine,

960
00:45:38,130 --> 00:45:41,100
which will generate the text.

961
00:45:41,100 --> 00:45:43,500
Needless to say, it
didn't work too well.

962
00:45:43,500 --> 00:45:48,270
He got beaten by data-driven
approach, where people

963
00:45:48,270 --> 00:45:51,760
took a high-pass filter
speech, low-pass filter speech,

964
00:45:51,760 --> 00:45:55,200
displayed energies from
these to two channels

965
00:45:55,200 --> 00:45:58,510
on, at the time it
was oscilloscope.

966
00:45:58,510 --> 00:46:00,960
And they tried to figure
out what are the patterns.

967
00:46:00,960 --> 00:46:02,490
They tried to
memorize the patterns,

968
00:46:02,490 --> 00:46:06,000
make the templates
from the training data.

969
00:46:06,000 --> 00:46:09,650
And they tried to match it for
the test data was recognized,

970
00:46:09,650 --> 00:46:11,210
which was recognizing
ten digits.

971
00:46:11,210 --> 00:46:12,930
And it was working
reasonably well,

972
00:46:12,930 --> 00:46:16,440
better than 90% of the time for
a single speaker, and so on,

973
00:46:16,440 --> 00:46:17,500
and so on.

974
00:46:17,500 --> 00:46:21,200
But it's interesting
that, already in '50s,

975
00:46:21,200 --> 00:46:25,230
the data-driven
approach got beat

976
00:46:25,230 --> 00:46:29,250
by the knowledge-based approach,
because knowledge maybe wasn't

977
00:46:29,250 --> 00:46:31,500
exactly what you needed to use.

978
00:46:31,500 --> 00:46:34,220
You were looking at the shapes
of the short-term spectra

979
00:46:34,220 --> 00:46:36,920
basically.

980
00:46:36,920 --> 00:46:40,460
Of course, now, we are
in 21st century, finally.

981
00:46:40,460 --> 00:46:43,460
Number of people say,
this is the real way

982
00:46:43,460 --> 00:46:44,990
of recognizing speech.

983
00:46:44,990 --> 00:46:48,470
You take the signal as it
comes with the microphone.

984
00:46:48,470 --> 00:46:50,930
You take the neural net.

985
00:46:50,930 --> 00:46:53,960
You put a lot of
training data, which

986
00:46:53,960 --> 00:46:58,000
contain all sources of unwanted
variability, basically, what

987
00:46:58,000 --> 00:46:58,630
you--

988
00:46:58,630 --> 00:47:01,490
all possible ways of--

989
00:47:01,490 --> 00:47:08,700
you can disturb the speech and
comes out the speech message.

990
00:47:08,700 --> 00:47:11,080
The key thing is, I'm not
saying that this is wrong,

991
00:47:11,080 --> 00:47:14,330
but I'm saying that, maybe this
is not the most efficient way

992
00:47:14,330 --> 00:47:17,010
of going about it,
because, in this case,

993
00:47:17,010 --> 00:47:19,455
you would have to retrain
the recognizer every time.

994
00:47:19,455 --> 00:47:21,205
It's a little bit like,
sort of, you know,

995
00:47:21,205 --> 00:47:24,760
if you look at the hearing
system, or the simple animal

996
00:47:24,760 --> 00:47:25,430
system--

997
00:47:25,430 --> 00:47:26,780
this is a moth here.

998
00:47:29,990 --> 00:47:33,610
Here it changes in
acoustic pressure

999
00:47:33,610 --> 00:47:36,780
to changes in firing rate.

1000
00:47:36,780 --> 00:47:39,590
It goes to very simple
brain, very small one.

1001
00:47:39,590 --> 00:47:42,430
You know, this is not the way
the human hearing is working.

1002
00:47:42,430 --> 00:47:44,870
Human hearing is
much more complex.

1003
00:47:44,870 --> 00:47:47,190
And again, Josh already
told us a lot about it,

1004
00:47:47,190 --> 00:47:48,830
so I won't spend much time.

1005
00:47:48,830 --> 00:47:52,580
The point here is, the human
hearing is frequency-selective.

1006
00:47:52,580 --> 00:47:54,760
It goes through a
number of levels.

1007
00:47:54,760 --> 00:47:58,940
This is very much along the deep
net and that sort of things.

1008
00:47:58,940 --> 00:48:01,220
But still, there is
a lot of structure

1009
00:48:01,220 --> 00:48:04,070
there in the hearing system.

1010
00:48:04,070 --> 00:48:07,370
So it makes at least
some sense to me,

1011
00:48:07,370 --> 00:48:10,470
if you want to do what people
are doing more and more,

1012
00:48:10,470 --> 00:48:13,410
and there will be a whole
special session next week

1013
00:48:13,410 --> 00:48:17,600
at Interspeech on how to
train the things directly

1014
00:48:17,600 --> 00:48:20,750
from the data, probably
you want to have

1015
00:48:20,750 --> 00:48:23,330
highly-structured environment.

1016
00:48:23,330 --> 00:48:26,690
You want to have a convoluted
pre-processing recursive

1017
00:48:26,690 --> 00:48:30,420
structures, and so on, and
long, short-term memory.

1018
00:48:30,420 --> 00:48:32,540
Yeah, here are actually
some, but all these things

1019
00:48:32,540 --> 00:48:33,360
are being used.

1020
00:48:33,360 --> 00:48:37,310
And I think this is
the direction to go.

1021
00:48:37,310 --> 00:48:39,810
But I still argue
that maybe it's

1022
00:48:39,810 --> 00:48:42,060
a better-- there's a
better way to go about it.

1023
00:48:42,060 --> 00:48:44,980
A better way to go
about it is that you

1024
00:48:44,980 --> 00:48:49,270
try first to do some
pre-processing of the signal

1025
00:48:49,270 --> 00:48:53,110
and derive some way of
describing the signal more

1026
00:48:53,110 --> 00:48:58,750
efficiently, using the
features, and so on, and so on.

1027
00:48:58,750 --> 00:49:02,920
Here you put all the
knowledge which you possibly

1028
00:49:02,920 --> 00:49:04,320
may want to--

1029
00:49:04,320 --> 00:49:05,950
you already have.

1030
00:49:05,950 --> 00:49:10,830
This knowledge can be derived
from some development data,

1031
00:49:10,830 --> 00:49:14,020
but you don't want to use
directly the speech signal

1032
00:49:14,020 --> 00:49:15,295
every time you are using--

1033
00:49:19,390 --> 00:49:21,010
you don't want to
retrain, basically,

1034
00:49:21,010 --> 00:49:23,110
every time, directly
from the speech signal.

1035
00:49:23,110 --> 00:49:26,440
You want to reserve
your training

1036
00:49:26,440 --> 00:49:29,860
data, the task-specific
training data,

1037
00:49:29,860 --> 00:49:32,120
to deal with the effects
of the noise which

1038
00:49:32,120 --> 00:49:33,610
you don't understand.

1039
00:49:33,610 --> 00:49:36,462
This is where the
machine learning comes.

1040
00:49:36,462 --> 00:49:38,920
I'm not saying that this is
not a part of machine learning,

1041
00:49:38,920 --> 00:49:40,240
but, I mean, this is--

1042
00:49:40,240 --> 00:49:44,240
there are two different things
which you are going to do.

1043
00:49:44,240 --> 00:49:45,990
I was just looking
for some support.

1044
00:49:45,990 --> 00:49:49,270
This one came from Stu
Geman from Brown University

1045
00:49:49,270 --> 00:49:51,010
and his colleagues.

1046
00:49:51,010 --> 00:49:53,780
Stu Geman is a machine
learning person, definitely,

1047
00:49:53,780 --> 00:49:58,570
but he says, we feel that
meat is in the features

1048
00:49:58,570 --> 00:50:01,150
rather than in the
machine learning,

1049
00:50:01,150 --> 00:50:03,250
because they go
overboard, basically,

1050
00:50:03,250 --> 00:50:06,970
explaining that, if you just
rely on machine learning, sure,

1051
00:50:06,970 --> 00:50:09,040
you have a neural net
which can approximate just

1052
00:50:09,040 --> 00:50:11,380
about any function,
given that you

1053
00:50:11,380 --> 00:50:14,980
have infinite amount of data
an infinitely large neural net.

1054
00:50:14,980 --> 00:50:18,250
And they say, infinite is a
kind of not useful engineering

1055
00:50:18,250 --> 00:50:19,240
concepts.

1056
00:50:19,240 --> 00:50:23,470
So they feel like that, if
representations actually are--

1057
00:50:23,470 --> 00:50:25,030
I hope they still feel the same.

1058
00:50:25,030 --> 00:50:27,040
I didn't talk to
them now, but it

1059
00:50:27,040 --> 00:50:30,340
seems like that there is some
support in this notion, what

1060
00:50:30,340 --> 00:50:31,794
I'm saying.

1061
00:50:31,794 --> 00:50:33,460
But of course, problem
with the features

1062
00:50:33,460 --> 00:50:38,480
is following, whatever you
stripped on the features,

1063
00:50:38,480 --> 00:50:39,820
this is a bottleneck.

1064
00:50:39,820 --> 00:50:43,990
Whatever you decide that is
not important is lost forever.

1065
00:50:43,990 --> 00:50:46,000
You will never recover
from it, right?

1066
00:50:46,000 --> 00:50:48,590
Because I'm asking for
feature extraction.

1067
00:50:48,590 --> 00:50:52,270
I'm asking for this emulation
of the human perception, which

1068
00:50:52,270 --> 00:50:55,360
strips out a lot of
information, but I still

1069
00:50:55,360 --> 00:50:56,830
think that we need
to do it if we

1070
00:50:56,830 --> 00:51:01,150
want to design a useful
engineering representations.

1071
00:51:01,150 --> 00:51:05,410
The other problem, of course,
is whatever you leave in,

1072
00:51:05,410 --> 00:51:09,520
the noise, the information which
is not relevant to your task,

1073
00:51:09,520 --> 00:51:11,950
you will have to
deal with it later.

1074
00:51:11,950 --> 00:51:15,160
You will need to train
the old machine on that,

1075
00:51:15,160 --> 00:51:16,990
so you want to be
very, very careful.

1076
00:51:16,990 --> 00:51:20,272
You are walking
a thin line here.

1077
00:51:20,272 --> 00:51:21,730
What is it that I
should leave out?

1078
00:51:21,730 --> 00:51:23,640
What is it that
I should keep in?

1079
00:51:23,640 --> 00:51:27,390
It's always safer to keep a
little bit more in, obviously.

1080
00:51:27,390 --> 00:51:30,710
But this is the goal
which we have here.

1081
00:51:30,710 --> 00:51:33,130
And I wanted to say,
features can be designed

1082
00:51:33,130 --> 00:51:35,000
using development data.

1083
00:51:35,000 --> 00:51:37,510
And when I'm saying use
the development data,

1084
00:51:37,510 --> 00:51:39,940
design your features
and use them.

1085
00:51:39,940 --> 00:51:42,650
Don't use this
development data anymore.

1086
00:51:42,650 --> 00:51:45,550
We have a lot of data for the
designing of good features.

1087
00:51:45,550 --> 00:51:47,897
And I think that, again,
is happening in the field--

1088
00:51:50,806 --> 00:51:51,306
good.

1089
00:51:54,230 --> 00:51:58,200
How the speech recognition
was done in 20th century,

1090
00:51:58,200 --> 00:52:02,840
this is what I know, maybe, the
most, so we'll spend some time.

1091
00:52:02,840 --> 00:52:06,110
And it's still done largely in--

1092
00:52:06,110 --> 00:52:10,070
there are some variants of this
recognition that's still done.

1093
00:52:10,070 --> 00:52:11,120
You take the signal.

1094
00:52:11,120 --> 00:52:13,700
And you derive the features.

1095
00:52:13,700 --> 00:52:15,620
In the first place,
you derive what

1096
00:52:15,620 --> 00:52:17,810
is called short-term
features, so you

1097
00:52:17,810 --> 00:52:20,000
take short segments of
the signal, about 10

1098
00:52:20,000 --> 00:52:21,560
to 20 milliseconds.

1099
00:52:21,560 --> 00:52:25,040
And you derive some
features from that.

1100
00:52:25,040 --> 00:52:26,670
That was in 20th century.

1101
00:52:26,670 --> 00:52:28,400
Now we are taking
much longer segments,

1102
00:52:28,400 --> 00:52:29,930
but we'll get into that.

1103
00:52:29,930 --> 00:52:32,150
But you derive it
with about 100 hertz

1104
00:52:32,150 --> 00:52:35,060
sampling every 10
millisecond, so you

1105
00:52:35,060 --> 00:52:39,520
turn the one-dimensional signal
into two-dimensional signal.

1106
00:52:39,520 --> 00:52:42,060
And here, typically, the
first step is the frequency,

1107
00:52:42,060 --> 00:52:45,420
so those may be-- imagine
those are frequency vectors,

1108
00:52:45,420 --> 00:52:47,920
or something derived
from frequency vectors,

1109
00:52:47,920 --> 00:52:49,770
gets through or stuff like that.

1110
00:52:49,770 --> 00:52:53,130
Those are just tricks,
signal processing tricks

1111
00:52:53,130 --> 00:52:54,295
which people use--

1112
00:52:54,295 --> 00:52:57,300
but one-dimensional
to two-dimensional.

1113
00:52:57,300 --> 00:53:01,700
Next thing is, you estimate the
likelihood of the sounds each

1114
00:53:01,700 --> 00:53:03,430
10 millisecond.

1115
00:53:03,430 --> 00:53:04,500
So here, what I--

1116
00:53:04,500 --> 00:53:08,490
imagine that here we have
different, say, speech sounds,

1117
00:53:08,490 --> 00:53:13,050
maybe 41 phonemes, maybe 3,000
context-dependent phonemes,

1118
00:53:13,050 --> 00:53:14,370
and so on, depends on--

1119
00:53:14,370 --> 00:53:19,610
but those are parts of speech
which makes some sense.

1120
00:53:19,610 --> 00:53:23,190
And they come, typically,
from phonetics theory.

1121
00:53:23,190 --> 00:53:25,700
And we know that
you can generate

1122
00:53:25,700 --> 00:53:29,550
different words putting phonemes
together in different ways,

1123
00:53:29,550 --> 00:53:30,950
and so on, and so on.

1124
00:53:30,950 --> 00:53:33,200
So suppose for the
simplicity that they

1125
00:53:33,200 --> 00:53:35,060
are-- there's 41 phonemes.

1126
00:53:35,060 --> 00:53:38,960
And so if there
is a red one, red

1127
00:53:38,960 --> 00:53:44,090
means that, probably,
posterior probability of the--

1128
00:53:44,090 --> 00:53:45,440
actually, we need them more.

1129
00:53:45,440 --> 00:53:47,420
We need the likelihoods
rather than posteriors,

1130
00:53:47,420 --> 00:53:52,690
so with your posteriors, we
just divided it by priors

1131
00:53:52,690 --> 00:53:57,470
to get the likelihoods, so
meaning that this phoneme has

1132
00:53:57,470 --> 00:53:59,480
a high likelihood
and white ones don't

1133
00:53:59,480 --> 00:54:02,700
have a likelihood at this time.

1134
00:54:02,700 --> 00:54:07,980
So next step is that
you do the search on it.

1135
00:54:07,980 --> 00:54:09,140
This is a painful part.

1136
00:54:09,140 --> 00:54:10,910
And I won't be spending
much time on that.

1137
00:54:10,910 --> 00:54:13,910
I just want to give you
some flavor of this.

1138
00:54:13,910 --> 00:54:19,790
You try to find the best
path through this lattice

1139
00:54:19,790 --> 00:54:22,260
of the likelihoods.

1140
00:54:22,260 --> 00:54:24,930
And if you are lucky,
the best part, then,

1141
00:54:24,930 --> 00:54:28,240
is going to present
your speech sounds.

1142
00:54:28,240 --> 00:54:32,090
So then the next thing is only
that you look and transcribe

1143
00:54:32,090 --> 00:54:36,420
to go from phonemic
representation from

1144
00:54:36,420 --> 00:54:39,720
into lexical
representation, basically,

1145
00:54:39,720 --> 00:54:41,910
because you know there
is typically one-to-one

1146
00:54:41,910 --> 00:54:43,456
relations--

1147
00:54:43,456 --> 00:54:45,130
Well, should be
careful, one-to-one,

1148
00:54:45,130 --> 00:54:51,810
but it is a relation, known
relation between phonemes

1149
00:54:51,810 --> 00:54:54,250
and the transcription.

1150
00:54:54,250 --> 00:54:56,760
So we know what has been said.

1151
00:54:56,760 --> 00:54:58,878
So this is how the speech
recognition is done.

1152
00:55:03,280 --> 00:55:05,320
Talking about this
part, I mean, here we

1153
00:55:05,320 --> 00:55:09,070
have to deal with one
major problem, which is,

1154
00:55:09,070 --> 00:55:12,790
like, the speech doesn't
come out this way.

1155
00:55:12,790 --> 00:55:16,900
It doesn't come out as a
sequences of individual speech

1156
00:55:16,900 --> 00:55:20,860
sounds, but, since I'm talking
to you, I'm moving the mouse.

1157
00:55:20,860 --> 00:55:23,390
I'm moving the
mouse continuously.

1158
00:55:23,390 --> 00:55:27,580
There is a thing that first I
can make certain sounds longer,

1159
00:55:27,580 --> 00:55:30,070
certain sounds shorter.

1160
00:55:30,070 --> 00:55:33,080
And then I add some noise to it.

1161
00:55:33,080 --> 00:55:37,600
Finally, because of what
is called co-articulation,

1162
00:55:37,600 --> 00:55:43,660
each target phonemes gets spread
in time, so you get a mess.

1163
00:55:43,660 --> 00:55:46,150
But people say--
sometimes, people

1164
00:55:46,150 --> 00:55:49,360
like to say, speech recognition,
this is our biggest problem.

1165
00:55:49,360 --> 00:55:52,670
I claim to say this
is not the problem.

1166
00:55:52,670 --> 00:55:53,690
It is a feature.

1167
00:55:53,690 --> 00:55:58,330
And feature is important,
because it comes quite handy

1168
00:55:58,330 --> 00:55:58,930
later.

1169
00:55:58,930 --> 00:56:01,285
Hopefully, I will
convince you about it.

1170
00:56:01,285 --> 00:56:05,570
But what we get is a mess, so
this is not easy to recognize,

1171
00:56:05,570 --> 00:56:06,070
right?

1172
00:56:06,070 --> 00:56:07,111
We have co-articulations.

1173
00:56:07,111 --> 00:56:11,330
We have speaker dependencies,
noise from the environment,

1174
00:56:11,330 --> 00:56:13,180
and so on, and so on.

1175
00:56:13,180 --> 00:56:15,730
So the way to deal
with it is to recognize

1176
00:56:15,730 --> 00:56:18,970
that different people
may sound different,

1177
00:56:18,970 --> 00:56:21,720
communication and
environment may differ,

1178
00:56:21,720 --> 00:56:24,610
so the features will be
dependent on a number

1179
00:56:24,610 --> 00:56:27,580
of things, on
environmental problems,

1180
00:56:27,580 --> 00:56:29,920
on who are saying
things, and so on.

1181
00:56:29,920 --> 00:56:32,970
People say same things
in different speech.

1182
00:56:32,970 --> 00:56:34,990
I can speak faster,
I can speak slower,

1183
00:56:34,990 --> 00:56:39,790
still, the message is the same.

1184
00:56:39,790 --> 00:56:46,060
So we use what is called the
Hidden Markov Model, where

1185
00:56:46,060 --> 00:56:53,350
you try to find such a sequence
of the phonemes which optimizes

1186
00:56:53,350 --> 00:57:00,250
the conditional probability
of the model, given the data.

1187
00:57:00,250 --> 00:57:02,830
And models you
generate, on the fly,

1188
00:57:02,830 --> 00:57:05,999
as many models as possible,
actually, an infinite number

1189
00:57:05,999 --> 00:57:07,540
of models, but, of
course, again, you

1190
00:57:07,540 --> 00:57:10,810
can't do it infinitely, so
you do it in some smart ways.

1191
00:57:10,810 --> 00:57:15,370
And this is being computed
through modified Bayes' rule.

1192
00:57:15,370 --> 00:57:18,760
Modified is because,
for one, I mean,

1193
00:57:18,760 --> 00:57:22,100
you would need a prior
probability of the signal,

1194
00:57:22,100 --> 00:57:22,600
and so on.

1195
00:57:22,600 --> 00:57:23,840
We don't use that.

1196
00:57:23,840 --> 00:57:29,010
But also, what we are doing, we
are somehow arbitrarily scale

1197
00:57:29,010 --> 00:57:32,020
the things which are called the
language model, because this

1198
00:57:32,020 --> 00:57:35,650
is a prior probability of
the particular utterance.

1199
00:57:35,650 --> 00:57:39,790
This is the likelihoods
coming from the data,

1200
00:57:39,790 --> 00:57:43,960
combining these two things
together, and finding the best

1201
00:57:43,960 --> 00:57:50,000
match, you get the output
which best matches the things.

1202
00:57:50,000 --> 00:57:53,950
Model parameters are typically
derived from the training data.

1203
00:57:53,950 --> 00:57:57,700
Problem is, how to find
the unknown utterance.

1204
00:57:57,700 --> 00:58:00,010
You don't know what is
the form of the model.

1205
00:58:00,010 --> 00:58:03,410
And you don't know
what is the data.

1206
00:58:03,410 --> 00:58:05,740
So we are dealing with
what is called the Doubly

1207
00:58:05,740 --> 00:58:09,140
stochastic model, a
Hidden Markov model.

1208
00:58:09,140 --> 00:58:13,620
Speech is a sequence-- it's
a sequence of hidden states.

1209
00:58:13,620 --> 00:58:15,780
You don't see this hidden state.

1210
00:58:15,780 --> 00:58:20,626
And also, you don't know
what comes from any state.

1211
00:58:20,626 --> 00:58:24,180
So it somehow-- so you don't
know for sure in which state

1212
00:58:24,180 --> 00:58:24,960
you are on.

1213
00:58:24,960 --> 00:58:28,580
You don't know for sure what
comes out, but you know that--

1214
00:58:28,580 --> 00:58:30,540
well, you know, you
assume that this

1215
00:58:30,540 --> 00:58:32,160
is how the speech looks like.

1216
00:58:32,160 --> 00:58:34,950
So here I have a
picture a little bit.

1217
00:58:34,950 --> 00:58:37,080
I apologize for being
trivial about this,

1218
00:58:37,080 --> 00:58:39,630
but imagine that you
have a string of--

1219
00:58:39,630 --> 00:58:40,800
group of people.

1220
00:58:40,800 --> 00:58:43,670
They are-- some are
female, some are male.

1221
00:58:43,670 --> 00:58:46,740
They are groups of
males, groups of females.

1222
00:58:46,740 --> 00:58:48,220
And each of them says something.

1223
00:58:48,220 --> 00:58:49,120
He says, hi.

1224
00:58:49,120 --> 00:58:50,370
And you can measure something.

1225
00:58:50,370 --> 00:58:52,260
This is a fundamental frequency.

1226
00:58:52,260 --> 00:58:56,594
You get some measurement out of
that, but you don't see them.

1227
00:58:56,594 --> 00:58:59,170
But what you know is that
they interleave, basically.

1228
00:58:59,170 --> 00:59:01,420
For a while, there
is a group of males,

1229
00:59:01,420 --> 00:59:04,780
then there is a-- then the
speech is to a group of female.

1230
00:59:04,780 --> 00:59:07,540
And then you stay for a
while in a group of female,

1231
00:59:07,540 --> 00:59:08,650
and so on, and so on.

1232
00:59:08,650 --> 00:59:12,490
So basically-- and
you know what is

1233
00:59:12,490 --> 00:59:14,680
the probability of the
fundamental frequency

1234
00:59:14,680 --> 00:59:16,380
for males, so some distribution.

1235
00:59:16,380 --> 00:59:18,850
So you know what is the path
on the fundamental frequency

1236
00:59:18,850 --> 00:59:20,860
for females.

1237
00:59:20,860 --> 00:59:24,790
You know what is the probability
of the first group being male.

1238
00:59:24,790 --> 00:59:28,790
Subsequently, you also know
what is the probability of the

1239
00:59:28,790 --> 00:59:30,275
[AUDIO OUT]

1240
00:59:34,235 --> 00:59:36,924
Because, to me, the features
are the important as I

1241
00:59:36,924 --> 00:59:41,020
told you, in which
we don't need,

1242
00:59:41,020 --> 00:59:44,476
but we don't want to take
out stuff that you may need.

1243
00:59:47,927 --> 00:59:49,948
I told you that that
one important role

1244
00:59:49,948 --> 00:59:51,820
of the perception
is to eliminate

1245
00:59:51,820 --> 00:59:53,692
some of this information.

1246
00:59:53,692 --> 00:59:57,804
Basically that's to so,
eliminate irrelevant focus

1247
00:59:57,804 --> 01:00:00,600
on irrelevant stuff.

1248
01:00:00,600 --> 01:00:05,250
So this is where I feel the
properties of perception

1249
01:00:05,250 --> 01:00:09,680
can come in very strongly,
because this is what emulates

1250
01:00:09,680 --> 01:00:12,500
this basic process
of the speech,

1251
01:00:12,500 --> 01:00:15,826
of the extraction of
information [INAUDIBLE]..

1252
01:00:15,826 --> 01:00:18,742
Especially about the
Hidden Markov models,

1253
01:00:18,742 --> 01:00:22,180
that speech consists of
the sequences of sounds

1254
01:00:22,180 --> 01:00:24,895
and they can be previously
different speed,

1255
01:00:24,895 --> 01:00:25,970
and other things.

1256
01:00:25,970 --> 01:00:27,010
It's important.

1257
01:00:27,010 --> 01:00:32,830
But here, we can use
a lot of our model.

1258
01:00:32,830 --> 01:00:38,612
Those features which can be
also designed based on the data.

1259
01:00:38,612 --> 01:00:41,057
And what comes out
is probably going

1260
01:00:41,057 --> 01:00:42,640
to be irrelevant to
speech perception,

1261
01:00:42,640 --> 01:00:48,924
so this is my point for how
you can use your engineering

1262
01:00:48,924 --> 01:00:53,326
to verify our theories
of speech perception.

1263
01:00:53,326 --> 01:00:57,240
We use largely,
nowadays, the neural

1264
01:00:57,240 --> 01:01:01,150
nets to derive the features.

1265
01:01:01,150 --> 01:01:04,450
So how we do it is that we
sort of-- because we know

1266
01:01:04,450 --> 01:01:11,123
that best set of features are
posteriors of the class we want

1267
01:01:11,123 --> 01:01:15,175
to recognize our speech sounds,
maybe it's going to be useful.

1268
01:01:15,175 --> 01:01:16,945
If you do a good
job, actually you

1269
01:01:16,945 --> 01:01:19,840
can do the reasonable sound.

1270
01:01:19,840 --> 01:01:22,295
So if you take a signal, you
do something processing--

1271
01:01:22,295 --> 01:01:25,860
and I will be talking about
signal processing quite a lot.

1272
01:01:25,860 --> 01:01:30,625
But then it goes into neural
net, nowadays, deep neural net,

1273
01:01:30,625 --> 01:01:34,150
and you estimate a posterior
use of different speech sounds.

1274
01:01:34,150 --> 01:01:36,350
And then what comes out,
whatever, it's always

1275
01:01:36,350 --> 01:01:39,175
is the high posterior
probability of the phoneme,

1276
01:01:39,175 --> 01:01:44,540
so we have you do it [INAUDIBLE]
sequence of the phoneme.

1277
01:01:48,150 --> 01:01:50,360
As the classes you can
use, directly context

1278
01:01:50,360 --> 01:01:55,020
independent in this
example, small number.

1279
01:01:55,020 --> 01:01:57,880
You can use context
dependent phonemes, which

1280
01:01:57,880 --> 01:02:01,180
I use quite a lot, because
they've tried to optimize this

1281
01:02:01,180 --> 01:02:03,450
despite that, if the
phoneme is produced

1282
01:02:03,450 --> 01:02:07,052
depends on what happened
inside, in the neighborhood,

1283
01:02:07,052 --> 01:02:09,680
[INAUDIBLE]

1284
01:02:13,150 --> 01:02:17,390
These posteriors can be
directly used with our research.

1285
01:02:17,390 --> 01:02:22,949
This is the search through
the lattice of the likelihoods

1286
01:02:22,949 --> 01:02:24,280
in recognition.

1287
01:02:24,280 --> 01:02:26,236
And again, I mean,
it's coming back.

1288
01:02:26,236 --> 01:02:29,750
This was the late 1990,
but this is the way

1289
01:02:29,750 --> 01:02:32,500
that most of this
recognizers work.

1290
01:02:32,500 --> 01:02:35,325
This is the major way now how
you do this feature cognition.

1291
01:02:35,325 --> 01:02:37,820
There's another way, which is
called bottleneck or tandem--

1292
01:02:37,820 --> 01:02:40,420
we were involved in that too--

1293
01:02:40,420 --> 01:02:43,720
which was the way to make the
neural nets friendly to people

1294
01:02:43,720 --> 01:02:48,280
who were used to old
generative HMM models,

1295
01:02:48,280 --> 01:02:50,440
because you
basically convert it,

1296
01:02:50,440 --> 01:02:52,690
your outputs from
the posteriors,

1297
01:02:52,690 --> 01:02:56,520
into some features which
your generative HMM

1298
01:02:56,520 --> 01:02:59,050
model would like for you.

1299
01:02:59,050 --> 01:03:01,270
What you did for you
de- correlated them,

1300
01:03:01,270 --> 01:03:05,410
you coercionized them so that
they have a normal distribution

1301
01:03:05,410 --> 01:03:07,070
and use it as a features.

1302
01:03:07,070 --> 01:03:11,350
And bottom line is, if you
get the good posteriors,

1303
01:03:11,350 --> 01:03:13,110
you will get the good features.

1304
01:03:13,110 --> 01:03:14,700
And we know how to use them.

1305
01:03:14,700 --> 01:03:17,910
And this is pretty much
the mainstream now.