1
00:00:01,680 --> 00:00:04,080
The following content is
provided under a Creative

2
00:00:04,080 --> 00:00:05,620
Commons license.

3
00:00:05,620 --> 00:00:07,920
Your support will help
MIT OpenCourseWare

4
00:00:07,920 --> 00:00:12,280
continue to offer high quality
educational resources for free.

5
00:00:12,280 --> 00:00:14,910
To make a donation or
view additional materials

6
00:00:14,910 --> 00:00:18,870
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,870 --> 00:00:21,915
at ocw.mit.edu.

8
00:00:21,915 --> 00:00:24,540
JOSH MCDERMOTT: We're
going to get started again.

9
00:00:24,540 --> 00:00:27,450
So where we stopped,
I had just played you

10
00:00:27,450 --> 00:00:31,030
some of the results of this
text, your synthesis algorithm.

11
00:00:31,030 --> 00:00:33,204
We all agreed that they
sounded pretty realistic.

12
00:00:33,204 --> 00:00:34,620
And so the whole
point of this was

13
00:00:34,620 --> 00:00:37,260
that this gives
plausibility to the notion

14
00:00:37,260 --> 00:00:41,610
that you could be representing
these textures with these sorts

15
00:00:41,610 --> 00:00:44,774
of statistics that you can
compute from a model of what we

16
00:00:44,774 --> 00:00:46,440
think encapsulates
the signal processing

17
00:00:46,440 --> 00:00:49,020
in the early auditory system.

18
00:00:49,020 --> 00:00:52,290
And, again, I'll just
underscore that the sort

19
00:00:52,290 --> 00:00:53,940
of cool thing about
doing the synthesis

20
00:00:53,940 --> 00:00:57,870
is that there's an
infinite number of ways

21
00:00:57,870 --> 00:01:00,250
in which it can fail, right.

22
00:01:00,250 --> 00:01:02,340
And by listening to it
and convincing yourself

23
00:01:02,340 --> 00:01:05,099
that those things actually
sound pretty realistic,

24
00:01:05,099 --> 00:01:07,560
you actually get a
pretty powerful sense

25
00:01:07,560 --> 00:01:10,204
that the representation
is sort of capturing

26
00:01:10,204 --> 00:01:12,120
most of what you hear
when you actually listen

27
00:01:12,120 --> 00:01:14,310
to that natural sound, right?

28
00:01:14,310 --> 00:01:18,265
And for instance, we could
design a classification

29
00:01:18,265 --> 00:01:20,140
algorithm that could
discriminate between all

30
00:01:20,140 --> 00:01:21,390
these different things, right.

31
00:01:21,390 --> 00:01:23,514
But the point is
that they could--

32
00:01:23,514 --> 00:01:24,930
the representation
could still not

33
00:01:24,930 --> 00:01:28,090
capture all kinds of different
things that you would hear.

34
00:01:28,090 --> 00:01:29,850
And by synthesizing,
because of the fact

35
00:01:29,850 --> 00:01:32,921
that you can potentially fail
in any of the possible ways,

36
00:01:32,921 --> 00:01:33,420
right.

37
00:01:33,420 --> 00:01:36,520
And then listen and observe
whether the failure occurs.

38
00:01:36,520 --> 00:01:38,730
You get a pretty
powerful method.

39
00:01:38,730 --> 00:01:39,390
All right.

40
00:01:39,390 --> 00:01:40,920
But one thing that you
might be concerned about.

41
00:01:40,920 --> 00:01:41,910
And this is sort
of something that

42
00:01:41,910 --> 00:01:44,400
was annoying me, right, is
that what we've done here

43
00:01:44,400 --> 00:01:46,860
is we've imposed a whole bunch
of statistical constraints,

44
00:01:46,860 --> 00:01:47,360
right.

45
00:01:47,360 --> 00:01:50,830
So we're measuring like this
really large set of statistics

46
00:01:50,830 --> 00:01:53,214
from the model, right.

47
00:01:53,214 --> 00:01:55,380
And then generating things
that have the same values

48
00:01:55,380 --> 00:01:57,010
of those statistics.

49
00:01:57,010 --> 00:02:00,750
So there's this question of
whether any set of statistics

50
00:02:00,750 --> 00:02:01,914
will do.

51
00:02:01,914 --> 00:02:03,330
And so we wonder
what would happen

52
00:02:03,330 --> 00:02:05,121
if we measured statistics
from a model that

53
00:02:05,121 --> 00:02:08,630
deviates from what we know
about the biology of the ear.

54
00:02:08,630 --> 00:02:11,100
So, in particular, you
remember that in this model

55
00:02:11,100 --> 00:02:14,500
that we set out, there were
a bunch of different stages,

56
00:02:14,500 --> 00:02:15,000
right.

57
00:02:15,000 --> 00:02:17,774
So we've got this initial
stage of bandpass filtering.

58
00:02:17,774 --> 00:02:19,690
There's the process of
extracting the envelope

59
00:02:19,690 --> 00:02:21,356
and then applying
amplitude compression.

60
00:02:21,356 --> 00:02:23,520
And there's this
modulation filtering.

61
00:02:23,520 --> 00:02:25,110
And in each of
these cases, there

62
00:02:25,110 --> 00:02:27,960
are particular characteristics
of the signal processing

63
00:02:27,960 --> 00:02:31,560
of that's explicitly intended
to mimic what we see in biology.

64
00:02:31,560 --> 00:02:35,640
And so in particular as we
noted the kinds of filter banks

65
00:02:35,640 --> 00:02:38,280
that you see in
biological systems

66
00:02:38,280 --> 00:02:40,057
are better approximated
by something

67
00:02:40,057 --> 00:02:41,890
that's logarithmically
spaced than something

68
00:02:41,890 --> 00:02:42,780
that's linearly spaced.

69
00:02:42,780 --> 00:02:43,710
So we remember--
remember that picture

70
00:02:43,710 --> 00:02:46,001
I showed at the start, where
we saw that the filters up

71
00:02:46,001 --> 00:02:48,586
here were a lot broader than the
filters down here, all right.

72
00:02:48,586 --> 00:02:49,960
OK, and so we can
ask, well, what

73
00:02:49,960 --> 00:02:54,340
happens if we swap in a filter
bank that's linearly spaced.

74
00:02:54,340 --> 00:02:56,650
It's sort of more closely
analogous to like an FFT,

75
00:02:56,650 --> 00:02:58,017
for instance.

76
00:02:58,017 --> 00:03:00,600
Similarly we can ask, well, what
happens if we kind of get rid

77
00:03:00,600 --> 00:03:05,340
of this nonlinear function
here that's applied

78
00:03:05,340 --> 00:03:07,200
to the amplitude envelope.

79
00:03:07,200 --> 00:03:11,200
And we make the amplitude
respond to linear instead.

80
00:03:11,200 --> 00:03:12,210
And so we did this.

81
00:03:12,210 --> 00:03:14,790
So you can change
the auditory model

82
00:03:14,790 --> 00:03:16,090
and play the exact same game.

83
00:03:16,090 --> 00:03:17,965
So you can measure
statistics from that model

84
00:03:17,965 --> 00:03:20,040
and synthesize something
from those statistics.

85
00:03:20,040 --> 00:03:23,600
And then ask whether
they sound any different.

86
00:03:23,600 --> 00:03:25,360
And so we did an experiment.

87
00:03:25,360 --> 00:03:27,480
So we would play people
on the original sound.

88
00:03:27,480 --> 00:03:30,630
And from that original sound,
we have two synthetic versions.

89
00:03:30,630 --> 00:03:32,400
One that's generated
from the statistics

90
00:03:32,400 --> 00:03:35,900
of the model that replicates
biology as best we know how.

91
00:03:35,900 --> 00:03:39,674
On the other of it that
is altered in some way.

92
00:03:39,674 --> 00:03:42,090
And we would ask people which
of the two synthetic version

93
00:03:42,090 --> 00:03:43,140
sounds more realistic?

94
00:03:43,140 --> 00:03:45,360
And so there's four
conditions in this experiment

95
00:03:45,360 --> 00:03:47,970
because we could alter these
models in three different ways.

96
00:03:47,970 --> 00:03:50,310
So we could get rid of
amplitude compression--

97
00:03:50,310 --> 00:03:51,450
that's the first bar.

98
00:03:51,450 --> 00:03:53,820
We could make the
cochlea linearly spaced.

99
00:03:53,820 --> 00:03:56,190
Or we could make the modulation
filters linearly spaced.

100
00:03:56,190 --> 00:03:58,523
Or we could do all three, and
that's the last condition.

101
00:03:58,523 --> 00:04:00,480
And so it's being
plotted on this axis--

102
00:04:00,480 --> 00:04:02,250
whoops I gave it away--

103
00:04:02,250 --> 00:04:04,560
is the proportion of
trials on which people

104
00:04:04,560 --> 00:04:07,620
said that the synthesis from
the biologically plausible model

105
00:04:07,620 --> 00:04:08,610
was more realistic.

106
00:04:08,610 --> 00:04:11,620
And so if it doesn't matter
what statistics you use,

107
00:04:11,620 --> 00:04:13,440
you should be right
here at this 50% mark

108
00:04:13,440 --> 00:04:14,760
in each of these cases.

109
00:04:14,760 --> 00:04:16,380
And as you can
see in every case,

110
00:04:16,380 --> 00:04:19,600
people actually report
them, on average,

111
00:04:19,600 --> 00:04:21,959
the synthesis from the
biologically plausible model

112
00:04:21,959 --> 00:04:24,300
is more realistic.

113
00:04:24,300 --> 00:04:26,290
And I'll give you a
couple of examples.

114
00:04:26,290 --> 00:04:28,920
So here's crowd
noise synthesized

115
00:04:28,920 --> 00:04:31,770
from the biologically
plausible auditory model.

116
00:04:31,770 --> 00:04:37,380
[CROWD NOISE]

117
00:04:37,380 --> 00:04:39,570
And here's the result of
doing the exact same thing

118
00:04:39,570 --> 00:04:41,290
but from the altered model.

119
00:04:41,290 --> 00:04:43,140
And this is from this
condition here where

120
00:04:43,140 --> 00:04:44,370
everything is different.

121
00:04:44,370 --> 00:04:46,680
And you, a little here, it
just kind of sounds weird.

122
00:04:46,680 --> 00:04:53,000
[CROWD NOISE]

123
00:04:53,000 --> 00:04:54,550
It's kind of
garbled in some way.

124
00:04:54,550 --> 00:04:56,330
So here's a
helicopter synthesized

125
00:04:56,330 --> 00:04:58,580
from the biologically
plausible model.

126
00:04:58,580 --> 00:05:02,481
[HELICOPTER NOISE]

127
00:05:02,481 --> 00:05:03,730
And here's from the other one.

128
00:05:03,730 --> 00:05:08,800
[HELICOPTER NOISE]

129
00:05:08,800 --> 00:05:11,750
Sort of-- doesn't sound like
the modulations are quite as

130
00:05:11,750 --> 00:05:13,510
precise.

131
00:05:13,510 --> 00:05:16,990
And so the notion here is that--

132
00:05:20,236 --> 00:05:22,110
we're initializing this
procedure with noise.

133
00:05:22,110 --> 00:05:24,260
And so the output
is a different sound

134
00:05:24,260 --> 00:05:25,850
in every case that
are sharing only

135
00:05:25,850 --> 00:05:27,050
the statistical properties.

136
00:05:27,050 --> 00:05:29,140
And so the statistics
that we measure and used

137
00:05:29,140 --> 00:05:31,070
to do the synthesis,
they define a class

138
00:05:31,070 --> 00:05:33,440
of sounds that include
the original that,

139
00:05:33,440 --> 00:05:36,330
in fact, defines a set as well
as a whole bunch of others.

140
00:05:36,330 --> 00:05:37,340
And when you run the
synthesis, you're

141
00:05:37,340 --> 00:05:38,970
generating one of
these other examples.

142
00:05:38,970 --> 00:05:41,120
And so the notion is that if
the statistics are measuring

143
00:05:41,120 --> 00:05:43,550
what the brain is measuring,
well, then, these examples

144
00:05:43,550 --> 00:05:45,966
ought to sound like another
example of the original sound.

145
00:05:45,966 --> 00:05:48,560
You ought to be generating
sort of an equivalence class.

146
00:05:48,560 --> 00:05:51,500
And the idea is that
when you are synthesizing

147
00:05:51,500 --> 00:05:53,510
from statistics of this
non-biological model

148
00:05:53,510 --> 00:05:55,010
where it's a
different set, right?

149
00:05:55,010 --> 00:05:56,960
So, again, it's defined
by the original.

150
00:05:56,960 --> 00:05:58,617
But it contains
different things.

151
00:05:58,617 --> 00:06:00,200
And they don't sound
like the original

152
00:06:00,200 --> 00:06:01,783
because they're
presumably not defined

153
00:06:01,783 --> 00:06:04,220
with the measurements
that the brain is making.

154
00:06:04,220 --> 00:06:07,460
I just mentioned to you
the fact that the procedure

155
00:06:07,460 --> 00:06:10,010
will generate a different
signal in each of these cases.

156
00:06:10,010 --> 00:06:12,320
Here you can see the
result of synthesizing

157
00:06:12,320 --> 00:06:14,622
from the statistics of a
particular recording of waves.

158
00:06:14,622 --> 00:06:16,080
These are three
different examples.

159
00:06:16,080 --> 00:06:17,150
And if you sort
of inspect these,

160
00:06:17,150 --> 00:06:19,520
you can kind of see that
they're all different, right?

161
00:06:19,520 --> 00:06:20,870
They sort of have
peaks and amplitude

162
00:06:20,870 --> 00:06:22,120
in different places and stuff.

163
00:06:22,120 --> 00:06:23,990
But on the other
hand, they all kind of

164
00:06:23,990 --> 00:06:25,670
look the same in
a sense that they

165
00:06:25,670 --> 00:06:27,378
have the same textural
properties, right?

166
00:06:27,378 --> 00:06:29,650
And that's what's
supposed to happen.

167
00:06:29,650 --> 00:06:32,150
And so the fact that you have
all of these different signals

168
00:06:32,150 --> 00:06:34,100
that have the same
statistical properties

169
00:06:34,100 --> 00:06:36,000
raises this interesting
possibility,

170
00:06:36,000 --> 00:06:39,080
which is that if the brain is
just representing time average

171
00:06:39,080 --> 00:06:42,050
statistics, we would predict
that different exemplars

172
00:06:42,050 --> 00:06:45,590
of a texture ought to be
difficult to discriminate.

173
00:06:45,590 --> 00:06:48,385
And so this is the thing
that I'll show you about next

174
00:06:48,385 --> 00:06:50,510
is an experiment that
attempts to test whether this

175
00:06:50,510 --> 00:06:53,330
is the case to try to
test whether, really, you

176
00:06:53,330 --> 00:06:56,960
are, in fact, representing these
textures with statistics that

177
00:06:56,960 --> 00:06:59,604
summarize their properties
by averaging over time.

178
00:06:59,604 --> 00:07:01,520
And in doing so, we're
going to take advantage

179
00:07:01,520 --> 00:07:04,850
of a really simple
statistical phenomenon, which

180
00:07:04,850 --> 00:07:07,280
is that statistics that are
measured from small samples

181
00:07:07,280 --> 00:07:09,710
are more variable than
statistics measured

182
00:07:09,710 --> 00:07:11,730
from large samples.

183
00:07:11,730 --> 00:07:14,600
And that's what is exemplified
by the graph that's

184
00:07:14,600 --> 00:07:15,570
here on the bottom.

185
00:07:15,570 --> 00:07:19,430
So what this graph is
plotting is the results

186
00:07:19,430 --> 00:07:23,030
of an exercise where we took
multiple excerpts of a given

187
00:07:23,030 --> 00:07:24,890
texture of a
particular duration.

188
00:07:24,890 --> 00:07:27,635
So you're 40 milliseconds,
80, 160, 320.

189
00:07:27,635 --> 00:07:29,510
So we get a whole bunch
of different excerpts

190
00:07:29,510 --> 00:07:30,689
of that length.

191
00:07:30,689 --> 00:07:33,230
And then we measure a particular
statistic from that excerpt.

192
00:07:33,230 --> 00:07:36,920
So in this case it's a
particular cross correlation

193
00:07:36,920 --> 00:07:40,894
coefficient for the envelopes
of a pair of sub-bands.

194
00:07:40,894 --> 00:07:42,560
So we're going to
measure that statistic

195
00:07:42,560 --> 00:07:44,240
in those different excerpts.

196
00:07:44,240 --> 00:07:46,640
And then we're just going to
try to see how variable that

197
00:07:46,640 --> 00:07:47,450
is across excerpts.

198
00:07:47,450 --> 00:07:49,982
And that's summarized
with a standard deviation

199
00:07:49,982 --> 00:07:50,690
of the statistic.

200
00:07:50,690 --> 00:07:53,170
And that's what's plotted
here on the y-axis.

201
00:07:53,170 --> 00:07:55,460
And so the point is that
when the excerpts are short,

202
00:07:55,460 --> 00:07:56,780
the statistics are variable.

203
00:07:56,780 --> 00:07:58,820
So you measure it in one excerpt
and then another and then

204
00:07:58,820 --> 00:07:59,090
another.

205
00:07:59,090 --> 00:08:00,923
And you don't get the
same thing, all right.

206
00:08:00,923 --> 00:08:02,510
And so the standard
deviation is high.

207
00:08:02,510 --> 00:08:04,670
And as the excerpt
duration increases,

208
00:08:04,670 --> 00:08:06,650
the statistics become
more consistent.

209
00:08:06,650 --> 00:08:09,980
They converge to the true
values of the station underlying

210
00:08:09,980 --> 00:08:11,220
stationary process.

211
00:08:11,220 --> 00:08:13,310
And so the standard
deviation kind of shrinks.

212
00:08:13,310 --> 00:08:14,851
All right, and so
we're going to take

213
00:08:14,851 --> 00:08:18,980
advantage of this in the
experiments that we'll do.

214
00:08:18,980 --> 00:08:24,069
All right, so first
to make sure that

215
00:08:24,069 --> 00:08:25,610
or to give plausibility
to the notion

216
00:08:25,610 --> 00:08:27,330
that people might
be able to base

217
00:08:27,330 --> 00:08:29,120
judgments on
long-term statistics,

218
00:08:29,120 --> 00:08:31,330
we ask people to discriminate
different textures.

219
00:08:31,330 --> 00:08:33,980
So these are things that have
different long-term statistics.

220
00:08:33,980 --> 00:08:35,780
And so in the
experiment, people would

221
00:08:35,780 --> 00:08:38,330
hear three sounds,
one of which would

222
00:08:38,330 --> 00:08:42,253
be from a particular
texture like rain.

223
00:08:42,253 --> 00:08:43,669
And then two others
of which would

224
00:08:43,669 --> 00:08:45,980
be different examples
of a different texture

225
00:08:45,980 --> 00:08:47,060
like a stream.

226
00:08:47,060 --> 00:08:48,140
So you'd hear a rain--

227
00:08:48,140 --> 00:08:50,060
stream one-- stream too.

228
00:08:50,060 --> 00:08:51,979
And the task was to
say which sound was

229
00:08:51,979 --> 00:08:53,270
produced by a different source.

230
00:08:53,270 --> 00:08:55,580
And so in this case, the
answer would be first--

231
00:08:55,580 --> 00:08:56,310
all right.

232
00:08:56,310 --> 00:08:57,860
And so we gave people this task.

233
00:08:57,860 --> 00:09:00,470
And we manipulated the
duration of the excerpts.

234
00:09:00,470 --> 00:09:04,550
And so the notion here is that
while given this graph, what

235
00:09:04,550 --> 00:09:06,260
happens is that the
statistics are very

236
00:09:06,260 --> 00:09:08,210
variable for short excerpts.

237
00:09:08,210 --> 00:09:10,810
And then they become more
consistent as the excerpt

238
00:09:10,810 --> 00:09:12,400
duration gets longer.

239
00:09:12,400 --> 00:09:14,480
And so if you're
basing your judgments

240
00:09:14,480 --> 00:09:17,474
on the statistics computed
across the excerpt,

241
00:09:17,474 --> 00:09:18,890
well, then you
ought to get better

242
00:09:18,890 --> 00:09:20,390
at seeing whether
the statistics are

243
00:09:20,390 --> 00:09:23,052
the same or different as the
excerpt duration gets longer.

244
00:09:23,052 --> 00:09:25,010
All right, and so what
we're going to plot here

245
00:09:25,010 --> 00:09:27,259
is the proportion correct
that this task is a function

246
00:09:27,259 --> 00:09:28,310
of the excerpt duration.

247
00:09:28,310 --> 00:09:30,710
And, indeed, we see
that people get better

248
00:09:30,710 --> 00:09:31,880
as the duration gets longer.

249
00:09:31,880 --> 00:09:33,671
So they're not very
good when you give them

250
00:09:33,671 --> 00:09:34,790
a really short clip.

251
00:09:34,790 --> 00:09:37,980
But they get better and better
and as the duration increases.

252
00:09:37,980 --> 00:09:40,550
Now, of course, this is not
really a particularly exciting

253
00:09:40,550 --> 00:09:42,776
result. When you
increase the duration,

254
00:09:42,776 --> 00:09:44,150
you give people
more information.

255
00:09:44,150 --> 00:09:45,800
And pretty much on
any story, people

256
00:09:45,800 --> 00:09:47,216
ought to be getting
better, right?

257
00:09:47,216 --> 00:09:49,190
But it's at least
consistent with the notion

258
00:09:49,190 --> 00:09:51,650
that you might be basing
your judgments on statistics.

259
00:09:51,650 --> 00:09:55,080
Now the really critical
experiment is the next one.

260
00:09:55,080 --> 00:09:57,950
And so in this experiment, we
gave people different excerpts

261
00:09:57,950 --> 00:09:59,890
of the same texture.

262
00:09:59,890 --> 00:10:02,180
And we asked of them
to discriminate them.

263
00:10:02,180 --> 00:10:05,540
So again on each trial,
you hear three sounds.

264
00:10:05,540 --> 00:10:09,310
But they're all excerpts
from the same texture.

265
00:10:09,310 --> 00:10:10,620
But two of them are identical.

266
00:10:10,620 --> 00:10:12,080
So in this case,
the last two are

267
00:10:12,080 --> 00:10:14,750
physically identical excerpts
of, for instance, rain.

268
00:10:14,750 --> 00:10:17,270
And the first one is a
different excerpt of rain.

269
00:10:17,270 --> 00:10:19,610
And so you just have
to say, which one is

270
00:10:19,610 --> 00:10:21,320
different from the other two?

271
00:10:21,320 --> 00:10:24,140
All right, now but maybe
the null hypothesis here

272
00:10:24,140 --> 00:10:25,970
is what you might
expect if you gave this

273
00:10:25,970 --> 00:10:28,950
to a computer algorithm that was
just limited by sensor noise.

274
00:10:28,950 --> 00:10:32,139
And so the notion is that as the
excerpt duration gets longer,

275
00:10:32,139 --> 00:10:33,680
you're giving people
more information

276
00:10:33,680 --> 00:10:37,035
with which to tell that this
one is different from this one.

277
00:10:37,035 --> 00:10:38,910
So maybe if you listen
to just the beginning,

278
00:10:38,910 --> 00:10:39,618
it would be hard.

279
00:10:39,618 --> 00:10:43,649
But as you got more information,
it would get easier and easier.

280
00:10:43,649 --> 00:10:46,190
If in contrast you think that
what people represent when they

281
00:10:46,190 --> 00:10:49,790
hear these sounds are statistics
that summarize the properties

282
00:10:49,790 --> 00:10:50,780
over time.

283
00:10:50,780 --> 00:10:53,930
Well, I've just shown you
how the statistics converge

284
00:10:53,930 --> 00:10:56,037
to fixed values as the
duration increases.

285
00:10:56,037 --> 00:10:57,620
And so if what people
are representing

286
00:10:57,620 --> 00:10:59,930
are those statistics,
you might paradoxically

287
00:10:59,930 --> 00:11:02,630
think that as the
duration increases,

288
00:11:02,630 --> 00:11:05,060
they would get
worse at this task--

289
00:11:05,060 --> 00:11:06,269
all right.

290
00:11:06,269 --> 00:11:08,060
And that's, in fact,
what we find happened.

291
00:11:08,060 --> 00:11:11,701
So people are good
at this task when

292
00:11:11,701 --> 00:11:13,950
the excerpts are very short
on the order of, like, 100

293
00:11:13,950 --> 00:11:14,650
milliseconds.

294
00:11:14,650 --> 00:11:16,430
So they can very easily
tell you whether you're--

295
00:11:16,430 --> 00:11:18,054
which of the two
excerpts is different.

296
00:11:18,054 --> 00:11:20,820
And then as the duration
gets longer and longer,

297
00:11:20,820 --> 00:11:23,080
they get progressively
worse and worse.

298
00:11:23,080 --> 00:11:25,280
And so we think this is
consistent with the idea

299
00:11:25,280 --> 00:11:28,310
that when you are hearing a
texture, once the texture is

300
00:11:28,310 --> 00:11:30,980
a couple of seconds long, you're
predominantly representing

301
00:11:30,980 --> 00:11:33,179
the statistical
properties averaging

302
00:11:33,179 --> 00:11:34,220
the properties over time.

303
00:11:34,220 --> 00:11:37,461
And you lose access to the
details that differentiate

304
00:11:37,461 --> 00:11:39,710
different examples of rain
so that the exact positions

305
00:11:39,710 --> 00:11:42,168
of the rain drops or the clicks
of the fire, what have you.

306
00:11:42,168 --> 00:11:44,866
Why should people be
unable to discriminate

307
00:11:44,866 --> 00:11:45,740
two examples of rain?

308
00:11:45,740 --> 00:11:47,531
Well, you might think,
well, these textures

309
00:11:47,531 --> 00:11:48,980
are just homogeneous, right?

310
00:11:48,980 --> 00:11:51,530
There's just not enough stuff
there to differentiate them.

311
00:11:51,530 --> 00:11:52,905
And we know that
that's not true,

312
00:11:52,905 --> 00:11:55,580
because if you just chop out
a little section at random,

313
00:11:55,580 --> 00:11:56,930
people can very easily
tell you whether it's

314
00:11:56,930 --> 00:11:58,138
the same or different, right.

315
00:11:58,138 --> 00:11:59,880
So at a local time
scale, the details

316
00:11:59,880 --> 00:12:03,770
is very easily discriminable.

317
00:12:03,770 --> 00:12:06,650
You might also imagine
that what's happening

318
00:12:06,650 --> 00:12:11,870
is that over time, maybe there's
some kind of masking in time.

319
00:12:11,870 --> 00:12:14,090
Or that the representation
kind of gets

320
00:12:14,090 --> 00:12:16,815
blurred together in
some strange way.

321
00:12:16,815 --> 00:12:18,440
On the other hand,
when you give people

322
00:12:18,440 --> 00:12:20,300
sounds that have
different statistics,

323
00:12:20,300 --> 00:12:22,160
you find that they're
just great, right.

324
00:12:22,160 --> 00:12:26,937
So they get better and better
as the stimulus increases.

325
00:12:26,937 --> 00:12:29,270
And, in fact, the fact that
they continue to get better,

326
00:12:29,270 --> 00:12:31,340
that seems to indicate
that the detail that

327
00:12:31,340 --> 00:12:33,200
is streaming into
your ears is being

328
00:12:33,200 --> 00:12:36,190
accrued into some representation
that you have access to.

329
00:12:36,190 --> 00:12:37,940
And so we think what
we think is happening

330
00:12:37,940 --> 00:12:39,980
is that those details come in.

331
00:12:39,980 --> 00:12:42,500
They're incorporated into
your statistical estimates.

332
00:12:42,500 --> 00:12:45,160
But the fact that you can't tell
apart these different excerpts

333
00:12:45,160 --> 00:12:47,660
means that there's these details
are not otherwise retained.

334
00:12:47,660 --> 00:12:49,700
All right, so they're
accrued into statistics,

335
00:12:49,700 --> 00:12:52,850
but then you lose access to
the details on their own.

336
00:12:52,850 --> 00:12:55,220
The point is that the result
as it stands, I think,

337
00:12:55,220 --> 00:12:57,640
provides evidence for a
representation of time average

338
00:12:57,640 --> 00:12:58,387
statistics.

339
00:12:58,387 --> 00:13:00,470
So that is that when the
statistics are different,

340
00:13:00,470 --> 00:13:02,900
you can tell things
are distinct.

341
00:13:02,900 --> 00:13:07,250
When they're the same,
you can't, and relates

342
00:13:07,250 --> 00:13:09,500
to this phenomenon of the
variability of statistics

343
00:13:09,500 --> 00:13:11,930
as a function of sample size.

344
00:13:11,930 --> 00:13:15,634
So a couple control experience
that are probably not

345
00:13:15,634 --> 00:13:17,300
exactly addressing
the question you just

346
00:13:17,300 --> 00:13:19,910
raised but maybe are related.

347
00:13:19,910 --> 00:13:23,060
So one obvious possibility is
that the reason that people

348
00:13:23,060 --> 00:13:24,740
are good at the
exemplar discrimination

349
00:13:24,740 --> 00:13:26,840
here when the excerpts
are short and bad

350
00:13:26,840 --> 00:13:30,980
here might be the fact
that maybe your memory is

351
00:13:30,980 --> 00:13:32,030
decaying with time.

352
00:13:32,030 --> 00:13:34,640
All right, so the way that we
did this experiment originally

353
00:13:34,640 --> 00:13:36,330
was that there was a fixed
inner stimulus interval,

354
00:13:36,330 --> 00:13:37,205
so that was the same.

355
00:13:37,205 --> 00:13:39,390
It was couple of hundred
milliseconds in every case.

356
00:13:39,390 --> 00:13:43,340
And so to tell that this is
different from this, the bits

357
00:13:43,340 --> 00:13:46,010
that you would have to compare
are separated by a shorter time

358
00:13:46,010 --> 00:13:47,510
interval than they
are in this case,

359
00:13:47,510 --> 00:13:50,000
right, where they're separated
by a longer time interval.

360
00:13:50,000 --> 00:13:52,670
And if you might just imagine
that memory decays with time,

361
00:13:52,670 --> 00:13:54,732
you might think that
would make people worse.

362
00:13:54,732 --> 00:13:56,690
So we did a control
experiment where we equated

363
00:13:56,690 --> 00:13:58,262
the inner onset interval.

364
00:13:58,262 --> 00:13:59,720
All right, so that
the elapsed time

365
00:13:59,720 --> 00:14:01,490
between the stuff that you
would have to compare in order

366
00:14:01,490 --> 00:14:03,114
to tell whether
something was different

367
00:14:03,114 --> 00:14:04,580
was the same in the two cases.

368
00:14:04,580 --> 00:14:06,080
And that basically makes
no difference, right.

369
00:14:06,080 --> 00:14:08,090
You're still a lot better
when the things are

370
00:14:08,090 --> 00:14:10,610
short than when they're long.

371
00:14:10,610 --> 00:14:12,980
And we went to
pretty great lengths

372
00:14:12,980 --> 00:14:15,060
to try to help
people be able to do

373
00:14:15,060 --> 00:14:16,310
this with these long excerpts.

374
00:14:16,310 --> 00:14:18,410
So you might also
wonder, well, given

375
00:14:18,410 --> 00:14:20,330
that you can do this
with the short excerpts.

376
00:14:20,330 --> 00:14:21,746
With the short
excerpts are really

377
00:14:21,746 --> 00:14:24,200
just analogous to the very
beginning of these longer

378
00:14:24,200 --> 00:14:24,700
excerpts.

379
00:14:24,700 --> 00:14:27,380
All right, so why can't you
just listen to the beginning?

380
00:14:27,380 --> 00:14:29,730
And so we tried to help
people do just that.

381
00:14:29,730 --> 00:14:32,120
So in this condition
here, we put

382
00:14:32,120 --> 00:14:34,855
a little gap between the
very beginning excerpt

383
00:14:34,855 --> 00:14:36,230
and the rest of
the thing, right?

384
00:14:36,230 --> 00:14:38,120
And we just told
people, all right,

385
00:14:38,120 --> 00:14:40,286
there's going to be this
little thing at the start--

386
00:14:40,286 --> 00:14:41,360
just listen for that.

387
00:14:41,360 --> 00:14:43,220
And people can't do that.

388
00:14:43,220 --> 00:14:46,200
So we also did it at the end--
so if at the gap at the end.

389
00:14:46,200 --> 00:14:48,620
So again, you get this little
thing of the same length

390
00:14:48,620 --> 00:14:51,380
as you have in the
short condition.

391
00:14:51,380 --> 00:14:53,684
And this is performance,
in this case,

392
00:14:53,684 --> 00:14:55,850
some people are good when
it's short and a lot worse

393
00:14:55,850 --> 00:14:56,480
when it's longer.

394
00:14:56,480 --> 00:14:58,146
And the presence of
a gap doesn't really

395
00:14:58,146 --> 00:15:00,500
seem to make a
difference, right.

396
00:15:00,500 --> 00:15:02,900
So you have great trouble
accessing these things.

397
00:15:06,290 --> 00:15:09,970
Another thing that's sort
of relevant and related

398
00:15:09,970 --> 00:15:12,172
was these experiments
that resulted

399
00:15:12,172 --> 00:15:14,630
from our thinking about the
fact that textures are normally

400
00:15:14,630 --> 00:15:16,730
not generated from our
synthesis algorithm,

401
00:15:16,730 --> 00:15:18,650
but rather from the
super position of lots

402
00:15:18,650 --> 00:15:19,740
of different sources.

403
00:15:19,740 --> 00:15:22,310
And so we wondered what would
happen to this phenomenon

404
00:15:22,310 --> 00:15:25,490
if we varied the number
of sources in a textures.

405
00:15:25,490 --> 00:15:27,130
So we actually
generated textures

406
00:15:27,130 --> 00:15:29,990
by superimposing different
numbers of sources.

407
00:15:29,990 --> 00:15:32,290
So in one case we did
this with speakers.

408
00:15:32,290 --> 00:15:35,960
So we wanted to get rid
of linguistic effects.

409
00:15:35,960 --> 00:15:37,829
And so we used German
speech and people

410
00:15:37,829 --> 00:15:38,870
that didn't speak German.

411
00:15:38,870 --> 00:15:40,180
So it was like a
German cocktail party

412
00:15:40,180 --> 00:15:40,990
that we're going to generate.

413
00:15:40,990 --> 00:15:42,550
So we have one person like this.

414
00:15:42,550 --> 00:15:45,280
[FEMALE VOICE 1]
[SPEAKING GERMAN]

415
00:15:45,280 --> 00:15:46,300
And then 29--

416
00:15:46,300 --> 00:15:49,282
[GROUP VOICE] [SPEAKING GERMAN]

417
00:15:49,282 --> 00:15:52,060
All right, room full of people
speaking German, all right?

418
00:15:52,060 --> 00:15:54,400
And so we do the
exact same experiment

419
00:15:54,400 --> 00:15:58,916
where we give people different
exemplars of these textures.

420
00:15:58,916 --> 00:16:00,790
And we ask them to
discriminate between them.

421
00:16:00,790 --> 00:16:03,187
And so what's plotted here
is the proportion correct

422
00:16:03,187 --> 00:16:04,270
is a function of duration.

423
00:16:04,270 --> 00:16:07,186
Here, we've reduced it to just
two durations-- short and long.

424
00:16:07,186 --> 00:16:08,560
And there's four
different curves

425
00:16:08,560 --> 00:16:10,780
corresponding to different
numbers of speakers

426
00:16:10,780 --> 00:16:11,800
in that signal, right.

427
00:16:11,800 --> 00:16:14,560
So the cyan here is what
happens with a single speaker.

428
00:16:14,560 --> 00:16:16,310
And so with a single
speaker, you actually

429
00:16:16,310 --> 00:16:18,490
get better at doing this
as the duration increases.

430
00:16:18,490 --> 00:16:20,370
All right, and so
that's, again, consistent

431
00:16:20,370 --> 00:16:22,870
with the null hypothesis that
when there's more information,

432
00:16:22,870 --> 00:16:24,286
you're actually
going to be better

433
00:16:24,286 --> 00:16:27,190
able to say whether something
is the same or different.

434
00:16:27,190 --> 00:16:30,130
But as you increase the number
of people at the cocktail

435
00:16:30,130 --> 00:16:30,730
party--

436
00:16:30,730 --> 00:16:32,540
the density of the
signal in some sense--

437
00:16:32,540 --> 00:16:34,665
you can see that performance
for the short excerpts

438
00:16:34,665 --> 00:16:35,590
doesn't really change.

439
00:16:35,590 --> 00:16:37,720
So you retain the ability to
say whether these things are

440
00:16:37,720 --> 00:16:38,800
the same or different.

441
00:16:38,800 --> 00:16:40,510
But there's this
huge interaction.

442
00:16:40,510 --> 00:16:44,630
And for the long excerpts, you
get kind of worse and worse.

443
00:16:44,630 --> 00:16:46,240
So impairment at
long durations is

444
00:16:46,240 --> 00:16:48,280
really specific to
textures-- doesn't seem to be

445
00:16:48,280 --> 00:16:49,840
present for single sources.

446
00:16:49,840 --> 00:16:52,423
To make sure that phenomenon is
not really specific to speech,

447
00:16:52,423 --> 00:16:55,780
we did the exact same thing
with synthetic drum hits.

448
00:16:55,780 --> 00:16:59,830
So we just varied the density
of a bunch of random drum hits.

449
00:16:59,830 --> 00:17:01,840
Like, here's five
hits per second.

450
00:17:01,840 --> 00:17:05,230
[DRUM SOUNDS]

451
00:17:05,230 --> 00:17:05,900
Here's 50.

452
00:17:05,900 --> 00:17:09,588
[DRUM SOUNDS]

453
00:17:09,588 --> 00:17:12,319
All right, and you see
the exact same phenomenon.

454
00:17:12,319 --> 00:17:14,650
So for the very
sparsest case, you

455
00:17:14,650 --> 00:17:17,619
get better as you go from the
short excerpts to the long.

456
00:17:17,619 --> 00:17:19,060
But then as the
density increases,

457
00:17:19,060 --> 00:17:20,740
you see this big interaction.

458
00:17:20,740 --> 00:17:22,270
And you get
selectively worse here

459
00:17:22,270 --> 00:17:25,099
for the long duration case.

460
00:17:25,099 --> 00:17:28,600
OK, so, again, it's
worth pointing out

461
00:17:28,600 --> 00:17:31,360
that the high performance
with the short excerpts

462
00:17:31,360 --> 00:17:33,819
indicate that all the stimuli
have discriminable variation.

463
00:17:33,819 --> 00:17:35,776
So it's not the case that
these things are just

464
00:17:35,776 --> 00:17:37,000
like totally homogeneous.

465
00:17:37,000 --> 00:17:38,970
And that's why you can't do it.

466
00:17:38,970 --> 00:17:41,680
It seems to be a specific
problem with retaining

467
00:17:41,680 --> 00:17:47,470
temporal detail when the signals
are both long and texture-like.

468
00:17:47,470 --> 00:17:49,130
OK, so what does this mean?

469
00:17:49,130 --> 00:17:49,960
Well, go ahead.

470
00:17:49,960 --> 00:17:50,910
Here's the specular framework.

471
00:17:50,910 --> 00:17:52,600
And this sort of gets
back to these questions

472
00:17:52,600 --> 00:17:54,058
about working
memory, and so forth.

473
00:17:54,058 --> 00:17:57,250
And so this the way that I
make sense of this stuff.

474
00:17:57,250 --> 00:18:01,030
And each one of these things
is pure speculation or almost

475
00:18:01,030 --> 00:18:01,810
pure speculation.

476
00:18:01,810 --> 00:18:03,518
But I actually think
you need all of them

477
00:18:03,518 --> 00:18:05,380
to really totally make
sense of the results.

478
00:18:05,380 --> 00:18:07,088
It's at least interesting
to think about.

479
00:18:07,088 --> 00:18:10,690
So I think it's
plausible that sounds

480
00:18:10,690 --> 00:18:14,800
are encoded both as sequences
of features and with statistics

481
00:18:14,800 --> 00:18:17,450
that average
information over time.

482
00:18:17,450 --> 00:18:21,220
And I think that the features
with which we encode things

483
00:18:21,220 --> 00:18:25,210
are engineered to be sparse for
typical natural sound sources.

484
00:18:25,210 --> 00:18:27,790
But they end up being
dense for textures.

485
00:18:27,790 --> 00:18:29,560
So the signal comes
in-- you're trying

486
00:18:29,560 --> 00:18:31,809
to model that with a whole
bunch of different features

487
00:18:31,809 --> 00:18:34,070
that are in some dictionary
you have in your head.

488
00:18:34,070 --> 00:18:36,470
And for a signal like speech,
your dictionary features

489
00:18:36,470 --> 00:18:38,470
include things that might
be related to phonemes

490
00:18:38,470 --> 00:18:39,100
and so forth.

491
00:18:39,100 --> 00:18:41,320
And so for like a
single person talking,

492
00:18:41,320 --> 00:18:42,859
you end up with
this representation

493
00:18:42,859 --> 00:18:43,900
that's relatively sparse.

494
00:18:43,900 --> 00:18:46,180
It's got sort of a small
number of feature activations.

495
00:18:46,180 --> 00:18:47,380
But when you get a
texture, in order

496
00:18:47,380 --> 00:18:48,859
to actually model
that signal, you

497
00:18:48,859 --> 00:18:51,025
need lots and lots and lots
of feature coefficients,

498
00:18:51,025 --> 00:18:53,590
all right, in order to
actually model the signal.

499
00:18:53,590 --> 00:18:56,440
And my hypothesis would
be that memory capacity

500
00:18:56,440 --> 00:18:59,665
places limits on the number of
features that can be retained.

501
00:18:59,665 --> 00:19:01,300
All right, so it's
not really related

502
00:19:01,300 --> 00:19:03,940
to the duration of signal
that you can encode, per se.

503
00:19:03,940 --> 00:19:05,410
It's on the number
of coefficients

504
00:19:05,410 --> 00:19:09,490
that you can retain that you
need to encode that signal.

505
00:19:09,490 --> 00:19:11,620
And the additional thing
I would hypothesize

506
00:19:11,620 --> 00:19:13,990
is that sound is continuously,
and this is critical

507
00:19:13,990 --> 00:19:15,674
obligatorily encoded.

508
00:19:15,674 --> 00:19:17,590
All right, so this stuff
comes into your ears.

509
00:19:17,590 --> 00:19:18,964
You're continuously
projecting it

510
00:19:18,964 --> 00:19:21,790
onto this dictionary of features
that you have-- all right.

511
00:19:21,790 --> 00:19:23,459
And you've got
some memory buffer

512
00:19:23,459 --> 00:19:26,000
within which you can hang onto
some number of those features.

513
00:19:26,000 --> 00:19:29,020
But then once the memory
buffer gets exceeded,

514
00:19:29,020 --> 00:19:30,100
it gets overwritten.

515
00:19:30,100 --> 00:19:33,680
And so you just lose all
the stuff that came before.

516
00:19:33,680 --> 00:19:36,070
So when your memory capacity
for these future sequences

517
00:19:36,070 --> 00:19:38,034
is reached, the
memory is overwritten

518
00:19:38,034 --> 00:19:38,950
by the incoming sound.

519
00:19:38,950 --> 00:19:42,650
And the only thing you're left
with are these statistics.

520
00:19:42,650 --> 00:19:45,610
So I'll give you one last
experiment in the texture

521
00:19:45,610 --> 00:19:48,260
domain, and then we'll move on.

522
00:19:48,260 --> 00:19:51,580
So this is an experiment
where we presented people

523
00:19:51,580 --> 00:19:55,240
with an original recording,
and then the synthetic version

524
00:19:55,240 --> 00:19:57,520
that we generated from the
the synthesis algorithm.

525
00:19:57,520 --> 00:19:59,290
And we just ask them
to rate the realism

526
00:19:59,290 --> 00:20:00,565
of the synthetic example.

527
00:20:00,565 --> 00:20:03,190
And so this is just a summary of
the results of that experiment

528
00:20:03,190 --> 00:20:07,432
where we did this for
170 different sounds.

529
00:20:07,432 --> 00:20:09,640
And this is a histogram of
the average realism rating

530
00:20:09,640 --> 00:20:10,990
for each of those 170 sounds.

531
00:20:10,990 --> 00:20:13,406
And there's just two points
to take away from this, right.

532
00:20:13,406 --> 00:20:15,744
The first that there's
a big peak up here.

533
00:20:15,744 --> 00:20:17,910
So they rate it as the
realism on a scale of 1 to 7.

534
00:20:17,910 --> 00:20:19,630
And so the big
peak looks centered

535
00:20:19,630 --> 00:20:22,540
at 6 means that the synthesis
is working pretty well

536
00:20:22,540 --> 00:20:23,745
most of the time.

537
00:20:23,745 --> 00:20:25,372
And that's sort of encouraging.

538
00:20:25,372 --> 00:20:27,080
But there's this other
interesting thing,

539
00:20:27,080 --> 00:20:29,329
which is that there's this
long tail down here, right.

540
00:20:29,329 --> 00:20:32,560
And what this means is
that people are telling us

541
00:20:32,560 --> 00:20:35,350
that this synthetic signal
that is statistically

542
00:20:35,350 --> 00:20:37,540
matched to this
original recording

543
00:20:37,540 --> 00:20:39,320
doesn't sound anything like it.

544
00:20:39,320 --> 00:20:41,830
And that's really interesting
because it's statistically

545
00:20:41,830 --> 00:20:42,830
matched to the original.

546
00:20:42,830 --> 00:20:45,205
So it's matched in all these
different dimensions, right.

547
00:20:45,205 --> 00:20:47,746
And, yet, there's still things
that are perceptually missing.

548
00:20:47,746 --> 00:20:49,390
And that tells us
that there are things

549
00:20:49,390 --> 00:20:52,210
that are important to the brain
that are not in our model.

550
00:20:52,210 --> 00:20:54,910
This is a list of the
15 or so sounds that

551
00:20:54,910 --> 00:20:56,460
got the lowest realism ratings.

552
00:20:56,460 --> 00:20:59,912
And just to make
things easy on you,

553
00:20:59,912 --> 00:21:01,120
I'll put labels next to them.

554
00:21:01,120 --> 00:21:03,161
Because by and large, they
tend to fall into sort

555
00:21:03,161 --> 00:21:04,900
of three different categories--

556
00:21:04,900 --> 00:21:08,170
sounds that have some
sort of pitch in them.

557
00:21:08,170 --> 00:21:10,630
Sounds that have some kind
of rhythmic structure.

558
00:21:10,630 --> 00:21:12,160
And sounds that
have reverberation.

559
00:21:12,160 --> 00:21:13,660
And I'll play you
these examples,

560
00:21:13,660 --> 00:21:18,570
because they're really kind
of spectacular failures.

561
00:21:18,570 --> 00:21:21,070
Here, I'll play the original
version and then the synthetic.

562
00:21:21,070 --> 00:21:24,297
[RAILROAD CROSSING SOUNDS]

563
00:21:26,529 --> 00:21:27,570
And here's the synthetic.

564
00:21:27,570 --> 00:21:29,190
I'm just warning you-- it's bad.

565
00:21:29,190 --> 00:21:32,655
[SYNTHETIC RAILROAD CROSSING
 SOUNDS]

566
00:21:35,630 --> 00:21:38,360
Here's the tapping rhythm--
really simple but--

567
00:21:38,360 --> 00:21:41,811
[TAPPING RHYTHM SOUNDS]

568
00:21:43,290 --> 00:21:44,490
And the synthetic version.

569
00:21:44,490 --> 00:21:47,934
[SYNTHETIC TAPPING RHYTHM
 SOUNDS]

570
00:21:49,910 --> 00:21:52,420
All right.

571
00:21:52,420 --> 00:21:54,040
This is what happens if you--

572
00:21:54,040 --> 00:21:55,780
well, this is not going to
work very well because we're

573
00:21:55,780 --> 00:21:56,290
in an auditorium.

574
00:21:56,290 --> 00:21:57,370
But I'll try it anyways.

575
00:21:57,370 --> 00:21:59,140
This is a recording of somebody
running up a stairwell that's

576
00:21:59,140 --> 00:21:59,931
pretty reverberant.

577
00:21:59,931 --> 00:22:03,325
[STAIR STEP SOUNDS]

578
00:22:05,094 --> 00:22:06,510
And here's this
synthetic version.

579
00:22:06,510 --> 00:22:07,800
And it's almost as
though like the echoes

580
00:22:07,800 --> 00:22:09,758
don't get put in the
right place, or something.

581
00:22:09,758 --> 00:22:12,687
[SYNTHETIC STAIR STEP SOUNDS]

582
00:22:15,150 --> 00:22:17,970
And now it sound even worse
if this was not an auditorium.

583
00:22:17,970 --> 00:22:19,600
Here's what happens with music.

584
00:22:19,600 --> 00:22:22,785
[SALSA MUSIC PLAYING]

585
00:22:23,637 --> 00:22:24,720
And the synthetic version.

586
00:22:24,720 --> 00:22:27,975
[SALSA MUSIC PLAYING]

587
00:22:29,370 --> 00:22:31,290
And this is what
happens with speech.

588
00:22:31,290 --> 00:22:33,570
[MALE VOICE 1] A boy
fell from the window.

589
00:22:33,570 --> 00:22:34,920
The wife helped her husband.

590
00:22:34,920 --> 00:22:37,030
Big dogs can be dangerous.

591
00:22:37,030 --> 00:22:40,670
Her-- [INAUDIBLE].

592
00:22:44,762 --> 00:22:47,610
All right, OK, so
in some sense, this

593
00:22:47,610 --> 00:22:49,890
is the most
informative thing that

594
00:22:49,890 --> 00:22:52,887
comes out of this whole
effort, because, again, it

595
00:22:52,887 --> 00:22:55,220
makes it really clear what
you don't understand-- right.

596
00:22:55,220 --> 00:23:00,270
And in all these cases, it was
really not obvious, a priori,

597
00:23:00,270 --> 00:23:01,689
that things would be this bad.

598
00:23:01,689 --> 00:23:03,480
I actually thought it
was sort of plausible

599
00:23:03,480 --> 00:23:05,063
that we might be
able to capture pitch

600
00:23:05,063 --> 00:23:06,660
with some of these statistics.

601
00:23:06,660 --> 00:23:09,510
Same with reverb and certainly
some of these simple rhythms.

602
00:23:09,510 --> 00:23:11,670
I kind of thought that
some of the modulation

603
00:23:11,670 --> 00:23:13,680
filters responses and
their correlations

604
00:23:13,680 --> 00:23:14,887
would give this to you.

605
00:23:14,887 --> 00:23:17,220
And it's not until you actually
test this with synthesis

606
00:23:17,220 --> 00:23:19,320
that you realize how
bad this is, right?

607
00:23:19,320 --> 00:23:20,910
And so this really
kind of tells you

608
00:23:20,910 --> 00:23:22,260
that there's something
very important

609
00:23:22,260 --> 00:23:23,968
that your brain is
measuring that we just

610
00:23:23,968 --> 00:23:26,460
don't yet understand and hasn't
been built into our model.

611
00:23:26,460 --> 00:23:29,430
So it really sort of identifies
the things you need to work on.

612
00:23:29,430 --> 00:23:32,495
OK, so just take home
messages from this portion

613
00:23:32,495 --> 00:23:33,120
of the lecture.

614
00:23:33,120 --> 00:23:36,000
So I've argued that
sound synthesis is

615
00:23:36,000 --> 00:23:38,940
a powerful tool that can help
us test and explore theories

616
00:23:38,940 --> 00:23:41,210
of addition and that
the variables that

617
00:23:41,210 --> 00:23:43,710
produce compelling synthesis
are things that could plausibly

618
00:23:43,710 --> 00:23:45,720
underlie perception.

619
00:23:45,720 --> 00:23:47,766
And, conversely, that
synthesis failures

620
00:23:47,766 --> 00:23:49,890
are things that point the
way to new variables that

621
00:23:49,890 --> 00:23:53,617
might be important for
the perceptual system.

622
00:23:53,617 --> 00:23:55,950
I've also argued that textures
are a nice point of entry

623
00:23:55,950 --> 00:23:57,360
for a real world hearing.

624
00:23:57,360 --> 00:23:59,860
I think what's appealing about
them is that you can actually

625
00:23:59,860 --> 00:24:02,940
work with actual real
world-like signals and all

626
00:24:02,940 --> 00:24:06,360
of the complexity that at
least exists in that domain.

627
00:24:06,360 --> 00:24:08,710
And, yet, work with
them and generate things

628
00:24:08,710 --> 00:24:11,880
that you feel like
you can understand.

629
00:24:11,880 --> 00:24:13,950
And I've argued that
many natural sounds may

630
00:24:13,950 --> 00:24:16,110
be recognized with
relatively simple statistics

631
00:24:16,110 --> 00:24:17,680
of early auditory
representation.

632
00:24:17,680 --> 00:24:20,667
So the very simplest kinds of
statistical representations

633
00:24:20,667 --> 00:24:22,500
that you might construct
that capture things

634
00:24:22,500 --> 00:24:23,492
like the spectrum.

635
00:24:23,492 --> 00:24:25,700
Well, that on its own is
not really that informative.

636
00:24:25,700 --> 00:24:27,533
But if you just go a
little bit more complex

637
00:24:27,533 --> 00:24:29,940
and into the domain of marginal
moments and correlations,

638
00:24:29,940 --> 00:24:32,550
you get representations
that are pretty powerful.

639
00:24:32,550 --> 00:24:34,890
And finally, I gave
you some evidence

640
00:24:34,890 --> 00:24:36,510
that for textures
of moderate length,

641
00:24:36,510 --> 00:24:39,911
statistics may be
all that we retain.

642
00:24:39,911 --> 00:24:41,910
So there are a lot of
interesting open questions

643
00:24:41,910 --> 00:24:43,530
in this domain.

644
00:24:43,530 --> 00:24:46,200
So one of the big ones,
I think, is the locus

645
00:24:46,200 --> 00:24:47,940
of the time-averaging.

646
00:24:47,940 --> 00:24:50,890
So I told you about how we've
got some evidence in the lab

647
00:24:50,890 --> 00:24:53,982
that the time scale
of the integration

648
00:24:53,982 --> 00:24:55,440
process for computing
statistics is

649
00:24:55,440 --> 00:24:56,670
on the order of several seconds.

650
00:24:56,670 --> 00:24:58,150
And that's a really
long time scale

651
00:24:58,150 --> 00:25:00,662
relative to typical time
scales in the auditory system.

652
00:25:00,662 --> 00:25:02,620
And so where exactly that
happens in the brain,

653
00:25:02,620 --> 00:25:05,484
I think, is very much an
open question and kind

654
00:25:05,484 --> 00:25:06,400
of an interesting one.

655
00:25:06,400 --> 00:25:07,816
And so we'd like
to sort of figure

656
00:25:07,816 --> 00:25:10,089
out how to get some
leverage on that.

657
00:25:10,089 --> 00:25:11,880
There's also a lot of
interesting questions

658
00:25:11,880 --> 00:25:13,860
about the relationship
to scene analysis.

659
00:25:13,860 --> 00:25:16,080
So usually you're not hearing
a texture in isolation.

660
00:25:16,080 --> 00:25:17,760
It's sort of the
background to things

661
00:25:17,760 --> 00:25:20,176
that, maybe, you're actually
more interested in-- somebody

662
00:25:20,176 --> 00:25:21,160
talking or what not.

663
00:25:21,160 --> 00:25:23,640
And so the relationship
between these statistical

664
00:25:23,640 --> 00:25:26,904
representations and the
extraction of individual source

665
00:25:26,904 --> 00:25:28,570
signals is something
that's really open,

666
00:25:28,570 --> 00:25:31,964
and, I think, kind
of interesting.

667
00:25:31,964 --> 00:25:34,380
And then these other questions
of what kinds of statistics

668
00:25:34,380 --> 00:25:36,463
would you need to account
for some of these really

669
00:25:36,463 --> 00:25:39,670
profound failures of synthesis.

670
00:25:39,670 --> 00:25:41,760
OK, so actually one--

671
00:25:41,760 --> 00:25:44,809
I think this might be
interesting to people.

672
00:25:44,809 --> 00:25:46,350
So I'll just talk
briefly about this.

673
00:25:46,350 --> 00:25:47,340
And then we're going to
have to figure out what

674
00:25:47,340 --> 00:25:48,590
to do for the last 20 minutes.

675
00:25:48,590 --> 00:25:50,869
But one of the
reasons, I think, I

676
00:25:50,869 --> 00:25:53,160
was requested to talk about
this is because of the fact

677
00:25:53,160 --> 00:25:55,290
that there's been all
this work on texture

678
00:25:55,290 --> 00:25:57,120
in the domain of vision.

679
00:25:57,120 --> 00:25:59,160
And so it's sort of an
interesting case where

680
00:25:59,160 --> 00:26:01,860
we can kind of think about
similarities and differences

681
00:26:01,860 --> 00:26:03,060
between sensory systems.

682
00:26:03,060 --> 00:26:04,914
And so back when we
were doing this work--

683
00:26:04,914 --> 00:26:07,080
as I said, this was joint
work with Eero Simoncelli.

684
00:26:07,080 --> 00:26:09,645
I was a post-doc
in his lab at NYU.

685
00:26:09,645 --> 00:26:11,520
And we thought it would
be interesting to try

686
00:26:11,520 --> 00:26:15,510
to turn the kind of standard
model of visual texture, which

687
00:26:15,510 --> 00:26:18,310
was done by Javier
Portia and Eero

688
00:26:18,310 --> 00:26:20,686
a long time ago, into sort
of the same kind of diagram

689
00:26:20,686 --> 00:26:21,810
that I've been showing you.

690
00:26:21,810 --> 00:26:24,517
And so we actually
did this in our paper.

691
00:26:24,517 --> 00:26:26,850
And so this is the one that
you've been seeing all talk,

692
00:26:26,850 --> 00:26:27,349
right.

693
00:26:27,349 --> 00:26:29,470
So you've got a
sound-wave form--

694
00:26:29,470 --> 00:26:30,610
a stage of filtering.

695
00:26:30,610 --> 00:26:33,090
This non-linearity to extract
the envelope and compress it.

696
00:26:33,090 --> 00:26:34,530
And then another
stage of filtering.

697
00:26:34,530 --> 00:26:35,740
And then there are
statistical measurements

698
00:26:35,740 --> 00:26:38,130
that kind of the last two
stages of representation.

699
00:26:38,130 --> 00:26:40,620
And this is an analogous
diagram that you

700
00:26:40,620 --> 00:26:44,940
can make for this sort of
standard visual texture model.

701
00:26:44,940 --> 00:26:48,440
So we start out with
images like beans.

702
00:26:48,440 --> 00:26:50,730
There's centers surround
filtering of the sort

703
00:26:50,730 --> 00:26:53,640
that you would find
in the retina or LGN

704
00:26:53,640 --> 00:26:56,070
that filters things into
particular spatial frequency

705
00:26:56,070 --> 00:26:56,670
bands.

706
00:26:56,670 --> 00:26:58,650
And so that's what you get here.

707
00:26:58,650 --> 00:27:01,929
So these are sub-bands again.

708
00:27:01,929 --> 00:27:03,720
Then there's oriented
filtering of the sort

709
00:27:03,720 --> 00:27:06,960
that you might get via
simple cells and V1.

710
00:27:06,960 --> 00:27:09,390
So then you get the sub-bands
divided up even finer

711
00:27:09,390 --> 00:27:12,000
into both spatial
frequency and orientation.

712
00:27:12,000 --> 00:27:13,380
And then there's
something that's

713
00:27:13,380 --> 00:27:15,660
analogous to the extraction
of the envelope that

714
00:27:15,660 --> 00:27:17,650
would give you something
like a complex cell.

715
00:27:17,650 --> 00:27:20,190
All right, and so this is
sort of local amplitude

716
00:27:20,190 --> 00:27:22,500
in each of these different
sub-bands-- right.

717
00:27:22,500 --> 00:27:24,790
So you can see, here, the
contrast is very high.

718
00:27:24,790 --> 00:27:27,610
And so you get a high response
in this particular point

719
00:27:27,610 --> 00:27:28,910
in the sub-band.

720
00:27:28,910 --> 00:27:31,891
So, again, this is in
the dimensions of space.

721
00:27:31,891 --> 00:27:33,890
So that's a difference,
right, so it's an image.

722
00:27:33,890 --> 00:27:37,902
So you got x and
y-coordinates instead of time.

723
00:27:37,902 --> 00:27:39,860
But, again, there are
statistical measurements,

724
00:27:39,860 --> 00:27:44,480
and you can actually
relate a lot of them

725
00:27:44,480 --> 00:27:46,020
to some of the same
functional form.

726
00:27:46,020 --> 00:27:48,500
So there's marginal
moments just like we

727
00:27:48,500 --> 00:27:51,110
were computing from sound.

728
00:27:51,110 --> 00:27:54,110
In the visual texture model,
there's an auto correlation.

729
00:27:54,110 --> 00:27:56,150
So that's measuring
spatial correlations

730
00:27:56,150 --> 00:27:59,000
which we don't actually
have in the auditory model.

731
00:27:59,000 --> 00:28:01,850
But then these correlations
across different frequency

732
00:28:01,850 --> 00:28:02,370
channels.

733
00:28:02,370 --> 00:28:04,880
So this is across different
spatial frequencies

734
00:28:04,880 --> 00:28:08,900
to things tuned to
the same orientation.

735
00:28:08,900 --> 00:28:15,026
And this is across orientations
and in the energy domain.

736
00:28:15,026 --> 00:28:16,400
So a couple of
interesting points

737
00:28:16,400 --> 00:28:18,320
to take from this
if you just sort of

738
00:28:18,320 --> 00:28:21,350
look back and forth
between these two pictures.

739
00:28:21,350 --> 00:28:24,020
The first is that the
statistics that we ended up

740
00:28:24,020 --> 00:28:28,760
using in the domain of sound
are kind of late in the game.

741
00:28:28,760 --> 00:28:32,450
All right, so they're sort of
after this non-linear stage

742
00:28:32,450 --> 00:28:34,610
that extracts amplitude.

743
00:28:34,610 --> 00:28:36,140
Whereas in the
visual texture model,

744
00:28:36,140 --> 00:28:37,449
the nonlinearity happens here.

745
00:28:37,449 --> 00:28:38,990
And there's all
these statistics that

746
00:28:38,990 --> 00:28:41,250
are being measured at
these earlier stages

747
00:28:41,250 --> 00:28:43,550
before you're extracting
local amplitude.

748
00:28:43,550 --> 00:28:45,240
And that's an
important difference,

749
00:28:45,240 --> 00:28:47,140
I think, between
sounds and images

750
00:28:47,140 --> 00:28:49,310
and that a lot of
the action and sound

751
00:28:49,310 --> 00:28:52,167
is in the kind of the
local amplitude domain.

752
00:28:52,167 --> 00:28:54,500
Whereas there's a lot of
important structure and image--

753
00:28:54,500 --> 00:28:56,834
images that has to do
with sort of local phase

754
00:28:56,834 --> 00:28:59,000
that you can't just get
from kind of local amplitude

755
00:28:59,000 --> 00:29:01,610
measurements.

756
00:29:01,610 --> 00:29:05,900
But at sort of a coarse
scale, the big picture

757
00:29:05,900 --> 00:29:08,600
is that we think
of visual texture

758
00:29:08,600 --> 00:29:11,060
as being represented with
statistical measurements

759
00:29:11,060 --> 00:29:13,140
that average across space.

760
00:29:13,140 --> 00:29:15,860
And we've been arguing
that sound texture consists

761
00:29:15,860 --> 00:29:19,790
of statistical computations
that average across time.

762
00:29:19,790 --> 00:29:21,950
That said, as I was
alluding to earlier,

763
00:29:21,950 --> 00:29:24,124
I think it's totally plausible
that we should really

764
00:29:24,124 --> 00:29:26,540
think about visual texture as
something that's potentially

765
00:29:26,540 --> 00:29:30,080
dynamic if you're looking at
a sheet blowing in the wind

766
00:29:30,080 --> 00:29:32,427
or much of people
moving in a crowd.

767
00:29:32,427 --> 00:29:34,760
And so there might well be
statistics in the time domain

768
00:29:34,760 --> 00:29:37,410
as well that people just
haven't really thought about.

769
00:29:37,410 --> 00:29:42,949
OK, so auditory
scene analysis is,

770
00:29:42,949 --> 00:29:44,990
loosely speaking, the
process of inferring events

771
00:29:44,990 --> 00:29:46,281
in the world from sound, right.

772
00:29:46,281 --> 00:29:49,780
So in almost any kind
of normal situation,

773
00:29:49,780 --> 00:29:52,100
there is this sound signal
that comes into your ears.

774
00:29:52,100 --> 00:29:54,290
And that's the result of
multiple causal factors

775
00:29:54,290 --> 00:29:54,980
in the world.

776
00:29:54,980 --> 00:29:57,830
And those can be different
things in the world that

777
00:29:57,830 --> 00:29:59,456
are making sound.

778
00:29:59,456 --> 00:30:00,830
As we discussed,
the sound signal

779
00:30:00,830 --> 00:30:02,390
also interacts with the
environment on the way

780
00:30:02,390 --> 00:30:03,230
to your ear.

781
00:30:03,230 --> 00:30:05,720
And so both of those
things contribute.

782
00:30:05,720 --> 00:30:07,760
The classic
instantiation of this

783
00:30:07,760 --> 00:30:09,410
is the cocktail
party problem where

784
00:30:09,410 --> 00:30:10,820
the notion is that
there would be

785
00:30:10,820 --> 00:30:14,120
multiple sources in the world
that the signals from the two

786
00:30:14,120 --> 00:30:17,990
sources sum together into a
mixture that enters your ear.

787
00:30:17,990 --> 00:30:19,594
And as a listener,
you're usually

788
00:30:19,594 --> 00:30:21,260
interested in individual
sources, maybe,

789
00:30:21,260 --> 00:30:23,509
one of those in particular
like what somebody that you

790
00:30:23,509 --> 00:30:24,920
care about is saying.

791
00:30:24,920 --> 00:30:27,400
And so your brain has to
take that mixed signal--

792
00:30:27,400 --> 00:30:29,870
and from that to infer
the content of one or more

793
00:30:29,870 --> 00:30:31,274
of the sources.

794
00:30:31,274 --> 00:30:32,690
And so this is the
classic example

795
00:30:32,690 --> 00:30:35,430
of an ill-posed problem.

796
00:30:35,430 --> 00:30:38,690
And by that I mean
that it's ill-posed

797
00:30:38,690 --> 00:30:40,959
because many sets
of possible sounds

798
00:30:40,959 --> 00:30:42,500
add up to equal the
observed mixture.

799
00:30:42,500 --> 00:30:44,960
So all you have access to
is this red guy here, right?

800
00:30:44,960 --> 00:30:47,211
And you'd like to infer
that the blue signals,

801
00:30:47,211 --> 00:30:49,460
which are the true sources
that occurred in the world.

802
00:30:49,460 --> 00:30:52,440
And the problem is that there
are these green signals here,

803
00:30:52,440 --> 00:30:54,177
which also add up
to the red signal.

804
00:30:54,177 --> 00:30:56,510
In fact, there's lots and
lots and lots of these, right?

805
00:30:56,510 --> 00:30:58,176
So your brain has to
take the red signal

806
00:30:58,176 --> 00:31:00,030
and somehow infer the blue ones.

807
00:31:00,030 --> 00:31:02,090
And so this is analogous
to me telling you,

808
00:31:02,090 --> 00:31:03,890
x plus y equals 17--

809
00:31:03,890 --> 00:31:05,060
please solve for x.

810
00:31:05,060 --> 00:31:06,720
And so, obviously, if you
got this on a math test,

811
00:31:06,720 --> 00:31:09,020
you would complain because
there is not a unique solution,

812
00:31:09,020 --> 00:31:09,519
right.

813
00:31:09,519 --> 00:31:12,970
That you could have 1 in 16,
and 2 in 15, and 3 in 14,

814
00:31:12,970 --> 00:31:14,220
and so on and so forth, right?

815
00:31:14,220 --> 00:31:16,136
But that's exactly the
problem that your brain

816
00:31:16,136 --> 00:31:17,990
is solving all the
time every day when

817
00:31:17,990 --> 00:31:20,090
you get a mixture of sounds.

818
00:31:20,090 --> 00:31:22,670
And the only way that you can
solve problems of these sorts

819
00:31:22,670 --> 00:31:24,832
is by making assumptions
about the sound sources.

820
00:31:24,832 --> 00:31:27,290
And the only way that you would
be able to make assumptions

821
00:31:27,290 --> 00:31:29,390
about sound sources is if
real-world sound sources

822
00:31:29,390 --> 00:31:31,436
have some degree of regularity.

823
00:31:31,436 --> 00:31:32,310
And in fact, they do.

824
00:31:32,310 --> 00:31:35,810
And one easy way to see
this is by generating sounds

825
00:31:35,810 --> 00:31:37,420
that are fully random.

826
00:31:37,420 --> 00:31:39,200
And so the way that
you would do this

827
00:31:39,200 --> 00:31:41,480
is you would have a
random number generator--

828
00:31:41,480 --> 00:31:42,890
you would draw
numbers from that.

829
00:31:42,890 --> 00:31:46,080
And each of those numbers
would form a particular sample

830
00:31:46,080 --> 00:31:47,012
and a sound signal.

831
00:31:47,012 --> 00:31:49,220
And then you could play that
and listen to it, right.

832
00:31:49,220 --> 00:31:50,750
And so if you did
that procedure,

833
00:31:50,750 --> 00:31:51,875
this is what you would get.

834
00:31:51,875 --> 00:31:54,802
[SPRAY SOUNDS]

835
00:31:55,756 --> 00:31:57,900
All right, so those are
fully random sound signals.

836
00:31:57,900 --> 00:32:00,450
And so we could generate
lots and lots of those.

837
00:32:00,450 --> 00:32:02,201
And the point is that
with that procedure,

838
00:32:02,201 --> 00:32:03,783
you would have to
sit there generating

839
00:32:03,783 --> 00:32:05,850
these random sounds for
a very, very long time

840
00:32:05,850 --> 00:32:08,197
before you got something that
sounded like a real world

841
00:32:08,197 --> 00:32:08,780
sounds, right?

842
00:32:08,780 --> 00:32:10,440
Real world sounds are like this.

843
00:32:10,440 --> 00:32:13,100
[ENGINE SOUND]

844
00:32:13,100 --> 00:32:13,610
Or this--

845
00:32:13,610 --> 00:32:15,020
[DOOR BELL SOUND]

846
00:32:15,020 --> 00:32:15,960
Or this--

847
00:32:15,960 --> 00:32:17,370
[BIRD SOUND]

848
00:32:17,370 --> 00:32:17,930
Or this--

849
00:32:17,930 --> 00:32:20,120
[SCRUBBING SOUND]

850
00:32:20,120 --> 00:32:22,405
All right, so the
point is that the set

851
00:32:22,405 --> 00:32:23,780
of sounds that
occur in the world

852
00:32:23,780 --> 00:32:25,760
are a very, very,
very small portion

853
00:32:25,760 --> 00:32:28,919
of the set of all physically
realizable sound-wave forms.

854
00:32:28,919 --> 00:32:31,460
And so the notion is that that's
what enables you to hear it.

855
00:32:31,460 --> 00:32:33,418
It's the fact that you've
instantiated the fact

856
00:32:33,418 --> 00:32:36,192
that the structure of real
world tones is not random.

857
00:32:36,192 --> 00:32:38,150
And such that when you
get a mixture of sounds,

858
00:32:38,150 --> 00:32:39,774
you can actually make
some good guesses

859
00:32:39,774 --> 00:32:41,170
as to what the sources are.

860
00:32:41,170 --> 00:32:44,760
All right, so we rely on these
regularities in order to hear.

861
00:32:44,760 --> 00:32:47,780
So one intuitive view of
inferring a target source

862
00:32:47,780 --> 00:32:50,240
from a mixture like
this is that you have

863
00:32:50,240 --> 00:32:52,160
to do at least a couple things.

864
00:32:52,160 --> 00:32:55,340
One is to determine the grouping
of the observed elements

865
00:32:55,340 --> 00:32:56,670
and the sound signal.

866
00:32:56,670 --> 00:32:58,970
And so what I've
done here is for each

867
00:32:58,970 --> 00:33:01,820
of these-- this is that cocktail
party problem demo that we

868
00:33:01,820 --> 00:33:03,130
saw that we heard at the start.

869
00:33:03,130 --> 00:33:04,730
So we've got one speaker--

870
00:33:04,730 --> 00:33:07,025
two, three, and then seven.

871
00:33:07,025 --> 00:33:12,030
And in the spectrograms,
I've coded the pixels

872
00:33:12,030 --> 00:33:16,160
either red or green, where
the pixels are coded red

873
00:33:16,160 --> 00:33:18,990
if they come from something
other than the target source,

874
00:33:18,990 --> 00:33:19,490
right.

875
00:33:19,490 --> 00:33:23,240
So this stuff up here is coming
from this additional speaker.

876
00:33:23,240 --> 00:33:27,470
And then the green bits are
the pixels in the target signal

877
00:33:27,470 --> 00:33:29,030
that are masked by
the other signal.

878
00:33:29,030 --> 00:33:32,000
Or the other signal actually
has higher intensity.

879
00:33:32,000 --> 00:33:33,500
And so one notion
is that, well, you

880
00:33:33,500 --> 00:33:35,666
have to be able to tell
that the red things actually

881
00:33:35,666 --> 00:33:38,120
don't go with the gray things.

882
00:33:38,120 --> 00:33:40,700
But then you also need to take
these parts that are green,

883
00:33:40,700 --> 00:33:41,750
where the other
source is actually

884
00:33:41,750 --> 00:33:43,040
swamping the thing
you're interested in,

885
00:33:43,040 --> 00:33:45,440
and then estimate the
content of the target source.

886
00:33:45,440 --> 00:33:47,930
That's at least a very sort
of naive intuitive view

887
00:33:47,930 --> 00:33:49,990
of what has to happen.

888
00:33:49,990 --> 00:33:52,640
And in both of these cases, the
only way that you can do this

889
00:33:52,640 --> 00:33:55,130
is by taking advantage of
statistical regularities

890
00:33:55,130 --> 00:33:56,660
in sounds.

891
00:33:56,660 --> 00:33:58,550
So one example of
irregularity that we

892
00:33:58,550 --> 00:34:01,910
think might be used to group
sound is harmonic frequencies.

893
00:34:01,910 --> 00:34:04,740
So voices and instruments
and certain other sounds

894
00:34:04,740 --> 00:34:06,800
produce frequencies that
are harmonics, i.e.,

895
00:34:06,800 --> 00:34:08,000
multiples of a fundamental.

896
00:34:08,000 --> 00:34:09,620
So here's a schematic
power spectrum

897
00:34:09,620 --> 00:34:13,469
of somebody of what might
come out of your vocal chords.

898
00:34:13,469 --> 00:34:15,260
So there's the fundamental
frequency here.

899
00:34:15,260 --> 00:34:17,320
And then all the
different harmonics.

900
00:34:17,320 --> 00:34:21,228
And they exhibit this
very regular structure.

901
00:34:21,228 --> 00:34:23,330
Here, similarly, this
is A440 on the oboe.

902
00:34:23,330 --> 00:34:25,320
[OBOE SOUND]

903
00:34:25,320 --> 00:34:27,330
So the fundamental
frequency is 440 hertz.

904
00:34:27,330 --> 00:34:28,759
That's concert A.
But if you look

905
00:34:28,759 --> 00:34:30,300
at the power spectrum
of that signal,

906
00:34:30,300 --> 00:34:36,431
you get all of these integer
multiples of that fundamental.

907
00:34:36,431 --> 00:34:38,639
All right, and so the way
that this happens in speech

908
00:34:38,639 --> 00:34:39,680
is that there are these--

909
00:34:39,680 --> 00:34:41,940
your vocal chords,
which open and closed

910
00:34:41,940 --> 00:34:43,469
in this periodic manner.

911
00:34:43,469 --> 00:34:45,396
They generate a series
of sound pulses.

912
00:34:45,396 --> 00:34:46,770
And in the frequency
domain, that

913
00:34:46,770 --> 00:34:49,005
translates to
harmonic structure.

914
00:34:49,005 --> 00:34:50,880
Not going to go through
this in great detail.

915
00:34:50,880 --> 00:34:53,280
Hynek's going to tell
you about speech.

916
00:34:53,280 --> 00:34:56,460
All right, and so there's
some classic evidence

917
00:34:56,460 --> 00:34:59,850
that your brain uses harmonicity
as a grouping cue, which

918
00:34:59,850 --> 00:35:02,940
is that if you take a series
of harmonic frequencies

919
00:35:02,940 --> 00:35:06,690
and you mistune one of
them, your brain typically

920
00:35:06,690 --> 00:35:09,210
causes you to hear that
as a distinct sound source

921
00:35:09,210 --> 00:35:10,860
once the mistuning
becomes sufficient.

922
00:35:10,860 --> 00:35:12,620
And here's just a
classic demo of that.

923
00:35:16,071 --> 00:35:20,170
[MALE VOICE 2] Demonstration
18-- isolation of a frequency

924
00:35:20,170 --> 00:35:22,860
component based on mistuning.

925
00:35:22,860 --> 00:35:26,590
You are to listen for the third
harmonic of a complex tone.

926
00:35:26,590 --> 00:35:30,250
First, this component is
played alone as a standard.

927
00:35:30,250 --> 00:35:32,570
Then over a series
of repetitions,

928
00:35:32,570 --> 00:35:35,170
it remains at a
constant frequency

929
00:35:35,170 --> 00:35:38,200
while the rest of the
components are gradually lowered

930
00:35:38,200 --> 00:35:41,904
as a group in steps of 1%.

931
00:35:41,904 --> 00:35:45,341
[BEEPING SOUNDS]

932
00:36:06,660 --> 00:36:08,480
[MALE VOICE 2] Now after two--

933
00:36:08,480 --> 00:36:11,112
OK, and so what you
should have heard--

934
00:36:11,112 --> 00:36:13,320
and you can tell me whether
this is the case or not--

935
00:36:13,320 --> 00:36:15,600
is that as this thing is
mistuned, at some point,

936
00:36:15,600 --> 00:36:17,300
you actually start to
hear, kind of, two beeps.

937
00:36:17,300 --> 00:36:19,008
All right, there's
the main tone and then

938
00:36:19,008 --> 00:36:21,070
there's this other
little beep, right.

939
00:36:21,070 --> 00:36:22,880
And if you did it in
the other direction,

940
00:36:22,880 --> 00:36:23,796
it would then reverse.

941
00:36:27,610 --> 00:36:32,190
OK, so one other consequence
of harmonicity is--

942
00:36:32,190 --> 00:36:35,520
and somebody was asking
about this earlier--

943
00:36:35,520 --> 00:36:38,610
is that your brain is able to
use the harmonics of the sound

944
00:36:38,610 --> 00:36:40,350
in order to infer its pitch.

945
00:36:40,350 --> 00:36:43,630
So the pitch that you hear
when you hear somebody talking

946
00:36:43,630 --> 00:36:46,470
is like a collective function
of all the different harmonics.

947
00:36:46,470 --> 00:36:49,140
And so one
interesting thing that

948
00:36:49,140 --> 00:36:51,030
happens when you
mistune a harmonic

949
00:36:51,030 --> 00:36:53,520
is that for very
small mistunings,

950
00:36:53,520 --> 00:36:56,964
that initially causes a
bias in the perceived pitch.

951
00:36:56,964 --> 00:36:58,380
And so that's
what's plotted here.

952
00:36:58,380 --> 00:37:02,490
So this is a task where somebody
hears this complex tone that

953
00:37:02,490 --> 00:37:04,600
has one of the harmonics
mistuned by a little bit.

954
00:37:04,600 --> 00:37:05,695
And then they hear
another complex tone.

955
00:37:05,695 --> 00:37:07,470
And they have to adjust
the pitch of the other one

956
00:37:07,470 --> 00:37:08,610
until it sounds the same.

957
00:37:08,610 --> 00:37:11,160
All right, and so what's
being plotted on the y-axis

958
00:37:11,160 --> 00:37:13,590
in this graph is
the average amount

959
00:37:13,590 --> 00:37:17,160
of shift in the pitch match
as a function of the shift

960
00:37:17,160 --> 00:37:18,810
in that particular harmonic.

961
00:37:18,810 --> 00:37:21,705
And for very small
mistunings of a few percent,

962
00:37:21,705 --> 00:37:23,580
you can see that there's
this linear increase

963
00:37:23,580 --> 00:37:24,750
in the proceeded pitch.

964
00:37:24,750 --> 00:37:26,375
All right, so the
mistune that harmonic

965
00:37:26,375 --> 00:37:28,420
causes the pitch to change.

966
00:37:28,420 --> 00:37:30,960
But then once the mistuning
exceeds a certain amount,

967
00:37:30,960 --> 00:37:33,210
you can actually see
that the effect reverses.

968
00:37:33,210 --> 00:37:35,040
And the pitch shift goes away.

969
00:37:35,040 --> 00:37:36,704
And so we think
what's happening here

970
00:37:36,704 --> 00:37:38,370
is that the mechanism
in your brain that

971
00:37:38,370 --> 00:37:41,190
is computing pitch
from the harmonics

972
00:37:41,190 --> 00:37:44,370
somehow realizes that one of
those harmonics is mistuned

973
00:37:44,370 --> 00:37:46,120
and is not part
of the same thing.

974
00:37:46,120 --> 00:37:48,780
And so it's excluded from
the computation of pitch.

975
00:37:48,780 --> 00:37:52,410
So the fact that you segregated
those sources then somehow

976
00:37:52,410 --> 00:37:55,600
happened prior to or
in at the same time

977
00:37:55,600 --> 00:37:58,980
as the calculation of the pitch.

978
00:37:58,980 --> 00:38:01,360
Here's another
classic demonstration

979
00:38:01,360 --> 00:38:04,770
of sounds variations
related to harmonicity.

980
00:38:04,770 --> 00:38:07,410
This is called the
Reynolds-McAdams Oboe--

981
00:38:07,410 --> 00:38:09,960
some collaboration between Roger
Reynolds and Steve McAdams.

982
00:38:09,960 --> 00:38:11,064
There's a complex tone--

983
00:38:11,064 --> 00:38:12,480
and what's going
to happen here is

984
00:38:12,480 --> 00:38:15,510
that the even harmonics-- two,
four, six, eight, et cetera,

985
00:38:15,510 --> 00:38:18,690
will become frequency modulated
in a way that's coherent.

986
00:38:18,690 --> 00:38:20,947
And so, initially, you'll
hear this kind of one thing.

987
00:38:20,947 --> 00:38:23,280
And then it will sort of
separate into these two voices.

988
00:38:23,280 --> 00:38:25,080
And it's called the
oboe because the oboe

989
00:38:25,080 --> 00:38:27,270
is an instrument that has a lot
of power at the odd harmonics.

990
00:38:27,270 --> 00:38:28,080
And so you'll hear
something that

991
00:38:28,080 --> 00:38:30,330
sounds like an oboe along
with something that, maybe, is

992
00:38:30,330 --> 00:38:31,580
like a voice that has vibrato.

993
00:38:34,215 --> 00:38:37,680
[OBOE AND VIBRATO SOUNDS]

994
00:38:45,190 --> 00:38:46,780
Is that work for everybody?

995
00:38:46,780 --> 00:38:49,070
So all these things
that are being affected

996
00:38:49,070 --> 00:38:52,450
in kind of interesting ways by
the reverb in this auditorium,

997
00:38:52,450 --> 00:38:54,760
which will--

998
00:38:54,760 --> 00:38:58,870
yeah, but that mostly works.

999
00:38:58,870 --> 00:39:00,400
So we've done a
little bit of work

1000
00:39:00,400 --> 00:39:03,580
trying to test whether the brain
uses harmonicity to segregate

1001
00:39:03,580 --> 00:39:04,370
actual speech.

1002
00:39:04,370 --> 00:39:06,430
And so very
recently, it's become

1003
00:39:06,430 --> 00:39:10,180
possible to manipulate speech
and change its harmonicity.

1004
00:39:10,180 --> 00:39:12,670
And I'm not going to tell
you in detail how this works.

1005
00:39:12,670 --> 00:39:15,610
But we can resynthesize
speech in ways that

1006
00:39:15,610 --> 00:39:16,900
are either harmonic like this.

1007
00:39:16,900 --> 00:39:17,770
This sounds normal.

1008
00:39:17,770 --> 00:39:19,630
[FEMALE VOICE 2] She
smiled and the teeth

1009
00:39:19,630 --> 00:39:23,080
gleamed in her beautifully
modeled olive face.

1010
00:39:23,080 --> 00:39:25,640
But we can also resynthesize
it so as to make it inharmonic.

1011
00:39:25,640 --> 00:39:26,740
And if you look at
the spectrium here,

1012
00:39:26,740 --> 00:39:29,281
you can see that the harmonic
spacing is no longer regular.

1013
00:39:29,281 --> 00:39:31,030
All right, so we've
just added some jitter

1014
00:39:31,030 --> 00:39:32,740
to the frequencies
of the harmonics.

1015
00:39:32,740 --> 00:39:34,210
And it makes it sound weird.

1016
00:39:34,210 --> 00:39:36,220
[FEMALE VOICE 2] She
smiled and the teeth

1017
00:39:36,220 --> 00:39:39,480
gleamed in her beautifully
modeled olive face.

1018
00:39:39,480 --> 00:39:42,310
But it's still perfectly
intelligible, right.

1019
00:39:42,310 --> 00:39:44,590
And that's because the
vocal tract filtering

1020
00:39:44,590 --> 00:39:46,089
that I think Hynek
is probably going

1021
00:39:46,089 --> 00:39:48,600
to tell you about this
afternoon remains unchanged.

1022
00:39:48,600 --> 00:39:50,641
And so the notion here is
that if you're actually

1023
00:39:50,641 --> 00:39:53,770
using this harmonic
structure to kind of tell you

1024
00:39:53,770 --> 00:39:56,980
what parts of the sound
signal belong together-- well,

1025
00:39:56,980 --> 00:39:59,056
and if you've got a mixture
of two speakers that

1026
00:39:59,056 --> 00:40:00,430
were in harmonic,
you might think

1027
00:40:00,430 --> 00:40:03,340
that it would be harder to
understand what was being said.

1028
00:40:03,340 --> 00:40:06,594
So he gave people this
task where we played them

1029
00:40:06,594 --> 00:40:09,010
words, either one word at a
time, or two concurrent words.

1030
00:40:09,010 --> 00:40:11,200
And we just asked them to
type in what they heard.

1031
00:40:11,200 --> 00:40:14,560
And then we just score
how much they got correct.

1032
00:40:14,560 --> 00:40:16,909
And we did this with a bunch
of different conditions

1033
00:40:16,909 --> 00:40:18,200
where we increased the jitters.

1034
00:40:18,200 --> 00:40:19,615
So there's harmonic--

1035
00:40:19,615 --> 00:40:21,460
[MALE VOICE 3]
Finally, he asked,

1036
00:40:21,460 --> 00:40:24,040
do you object to petting?

1037
00:40:24,040 --> 00:40:26,860
I don't know why my
Rh has this example.

1038
00:40:26,860 --> 00:40:29,410
But, whatever-- it's
taken from a corpus

1039
00:40:29,410 --> 00:40:32,000
called TIMIT that has a
lot of weird sentences.

1040
00:40:32,000 --> 00:40:33,540
[MALE VOICE 3]
Finally, he asked,

1041
00:40:33,540 --> 00:40:35,810
do you object to petting?

1042
00:40:35,810 --> 00:40:39,300
Finally, he asked, do
you object to petting?

1043
00:40:39,300 --> 00:40:42,240
Finally, he asked, do
you object to petting?

1044
00:40:42,240 --> 00:40:44,740
All right, so it kind of gets
stranger and stranger sounding

1045
00:40:44,740 --> 00:40:45,573
than a bottom's out.

1046
00:40:45,573 --> 00:40:48,070
These are ratings of
how weird it sounds.

1047
00:40:48,070 --> 00:40:50,530
And these are the results of
the recognition experiment.

1048
00:40:50,530 --> 00:40:52,446
And so what's being
plotted is the mean number

1049
00:40:52,446 --> 00:40:55,420
of correct words as a
function of the deviation

1050
00:40:55,420 --> 00:40:56,200
from harmonicity.

1051
00:40:56,200 --> 00:40:59,557
So 0 here is perfectly harmonic,
and this is increasing jitter.

1052
00:40:59,557 --> 00:41:01,390
And so the interesting
thing is that there's

1053
00:41:01,390 --> 00:41:04,042
no effect on the recognition
of single words, which

1054
00:41:04,042 --> 00:41:06,250
is below ceiling, because
these are single words that

1055
00:41:06,250 --> 00:41:07,450
are excised from sentences.

1056
00:41:07,450 --> 00:41:10,580
And so they are actually
not that easy to understand.

1057
00:41:10,580 --> 00:41:12,370
But when you give
people pairs of words,

1058
00:41:12,370 --> 00:41:15,750
you see that they get worse
at recognizing what was said.

1059
00:41:15,750 --> 00:41:17,699
And then the effect
kind of bottoms out.

1060
00:41:17,699 --> 00:41:19,240
So this is consistent
with the notion

1061
00:41:19,240 --> 00:41:21,520
that your brain is
actually relying, in part,

1062
00:41:21,520 --> 00:41:23,920
on the harmonic structure
of the speech in order

1063
00:41:23,920 --> 00:41:26,380
to pull, say, two
concurrent speakers apart.

1064
00:41:29,410 --> 00:41:31,924
And the other thing
to note here, though,

1065
00:41:31,924 --> 00:41:34,090
is that the effect is
actually pretty modest, right.

1066
00:41:34,090 --> 00:41:36,131
So you're going from, I
don't know, this is like,

1067
00:41:36,131 --> 00:41:39,430
0.65 words correct on
a trial down to 0.5.

1068
00:41:39,430 --> 00:41:41,020
So it's like a 20% reduction.

1069
00:41:41,020 --> 00:41:43,155
And the mistuning thing
also works with speech.

1070
00:41:43,155 --> 00:41:44,030
This is kind of cool.

1071
00:41:44,030 --> 00:41:46,930
So here we've just
taken a single harmonic

1072
00:41:46,930 --> 00:41:48,160
and mistuned it.

1073
00:41:48,160 --> 00:41:50,290
And if you listen to that,
I think this is this--

1074
00:41:50,290 --> 00:41:52,857
you'll basically-- you'll
hear the spoken utterance.

1075
00:41:52,857 --> 00:41:54,940
And then it will sound
like there's some whistling

1076
00:41:54,940 --> 00:41:56,230
sound on top of it.

1077
00:41:56,230 --> 00:41:58,834
Because that's what the
individual harmonic sounds

1078
00:41:58,834 --> 00:41:59,500
like on its own.

1079
00:41:59,500 --> 00:42:03,430
[FEMALE VOICE 3] Academic Act
2 guarantees your diploma.

1080
00:42:03,430 --> 00:42:05,363
So you might have
been able to hear--

1081
00:42:05,363 --> 00:42:06,560
I think this is the--

1082
00:42:06,560 --> 00:42:09,962
[WHISTLING SOUND]

1083
00:42:09,962 --> 00:42:12,750
That's a little quiet.

1084
00:42:12,750 --> 00:42:15,196
But if you listen again.

1085
00:42:15,196 --> 00:42:18,244
[FEMALE VOICE 3] Academic Act
2 guarantees your diploma.

1086
00:42:18,244 --> 00:42:19,910
Yeah, so there's this
little other thing

1087
00:42:19,910 --> 00:42:21,535
kind of hiding there
in the background.

1088
00:42:21,535 --> 00:42:22,850
But it's kind of hard to hear.

1089
00:42:22,850 --> 00:42:26,510
And that's probably because
it's particularly in speech

1090
00:42:26,510 --> 00:42:28,180
there's all these
other factors that

1091
00:42:28,180 --> 00:42:29,679
are telling you
that thing is speech

1092
00:42:29,679 --> 00:42:32,390
and that belongs together.

1093
00:42:32,390 --> 00:42:34,370
And, all right, let
me just wrap up here.

1094
00:42:34,370 --> 00:42:37,190
So there's a bunch of other
demos of this character

1095
00:42:37,190 --> 00:42:39,740
that I could kind
of give you about--

1096
00:42:39,740 --> 00:42:41,217
I could tell you about.

1097
00:42:41,217 --> 00:42:43,300
Another thing that actually
matters is repetition.

1098
00:42:43,300 --> 00:42:45,447
So if there's something
that repeats in the signal,

1099
00:42:45,447 --> 00:42:47,780
your brain is very strongly
biased to actually segregate

1100
00:42:47,780 --> 00:42:50,010
that from the background.

1101
00:42:50,010 --> 00:42:52,529
So this is a demonstration
of that in action.

1102
00:42:52,529 --> 00:42:54,320
So what I'm going to
be presenting you with

1103
00:42:54,320 --> 00:42:58,010
is a sequence of mixtures
of sounds that will

1104
00:42:58,010 --> 00:42:59,180
vary in how many there are.

1105
00:42:59,180 --> 00:43:00,679
And then at the
end, you're actually

1106
00:43:00,679 --> 00:43:03,110
going to hear the target sound.

1107
00:43:03,110 --> 00:43:04,730
So if I just give you one--

1108
00:43:04,730 --> 00:43:07,550
[WHOPPING SOUND]

1109
00:43:07,550 --> 00:43:09,020
All right, it doesn't sound--

1110
00:43:09,020 --> 00:43:10,950
the sound at the end
doesn't sound like what

1111
00:43:10,950 --> 00:43:13,214
you heard in the first thing.

1112
00:43:13,214 --> 00:43:15,380
But, here, you can probably
start to hear something.

1113
00:43:15,380 --> 00:43:17,550
[WHOPPING SOUND]

1114
00:43:17,550 --> 00:43:19,010
And with here, you'll hear more.

1115
00:43:19,010 --> 00:43:21,457
[WHOPPING SOUND]

1116
00:43:21,457 --> 00:43:22,790
And with here, it's pretty easy.

1117
00:43:22,790 --> 00:43:26,150
[WHOPPING SOUND]

1118
00:43:26,150 --> 00:43:29,060
All right, so each time you're
getting one of these mixtures--

1119
00:43:29,060 --> 00:43:29,960
and if you just get
a single mixture,

1120
00:43:29,960 --> 00:43:31,430
you can't hear anything, right.

1121
00:43:31,430 --> 00:43:33,800
But just by virtue of the
fact that there is this latent

1122
00:43:33,800 --> 00:43:35,270
repeating structure in there.

1123
00:43:35,270 --> 00:43:36,800
Your brain is
actually able to tell

1124
00:43:36,800 --> 00:43:39,470
that there's a consistent
source and segregates that

1125
00:43:39,470 --> 00:43:41,420
from the background.

1126
00:43:41,420 --> 00:43:43,910
I started off by telling you
that the only way that you

1127
00:43:43,910 --> 00:43:45,950
can actually solve
this problem is

1128
00:43:45,950 --> 00:43:48,500
by incorporating your knowledge
of the statistical structure

1129
00:43:48,500 --> 00:43:49,940
of the world.

1130
00:43:49,940 --> 00:43:53,900
And, yet, so far the way that
the field has really moved

1131
00:43:53,900 --> 00:43:55,834
has been to basically
just use intuitions.

1132
00:43:55,834 --> 00:43:57,500
And so people would
look at spectrograms

1133
00:43:57,500 --> 00:43:59,360
and they say, oh yeah,
there's harmonic structure.

1134
00:43:59,360 --> 00:44:00,235
There's common onset.

1135
00:44:00,235 --> 00:44:02,210
And so then you can do
an experiment and show

1136
00:44:02,210 --> 00:44:04,620
that has some effect.

1137
00:44:04,620 --> 00:44:06,260
But what we'd really
like to understand

1138
00:44:06,260 --> 00:44:09,590
is how these so-called grouping
cues relate to natural sound

1139
00:44:09,590 --> 00:44:10,605
statistics.

1140
00:44:10,605 --> 00:44:12,230
We'd like to know
whether we're optimal

1141
00:44:12,230 --> 00:44:14,360
given the nature of
real world sounds.

1142
00:44:14,360 --> 00:44:15,740
We'd like to know whether
these things are actually

1143
00:44:15,740 --> 00:44:17,240
learned from
experience with sound--

1144
00:44:17,240 --> 00:44:19,396
whether you're born with them.

1145
00:44:19,396 --> 00:44:21,020
The relative importance
of these things

1146
00:44:21,020 --> 00:44:24,370
relative to knowledge of
particular sounds like words.

1147
00:44:24,370 --> 00:44:26,951
And so this-- I really regard
this stuff as in its infancy.

1148
00:44:26,951 --> 00:44:28,700
But I think it's really
kind of wide open.

1149
00:44:28,700 --> 00:44:31,100
And so the sort of
take-home messages

1150
00:44:31,100 --> 00:44:34,580
here are that there
are grouping cues

1151
00:44:34,580 --> 00:44:37,970
that the brain uses to
take the sound energy that

1152
00:44:37,970 --> 00:44:40,640
comes into your ears and assign
it to different sources that

1153
00:44:40,640 --> 00:44:42,890
are presumed to be related
to statistical regularities

1154
00:44:42,890 --> 00:44:44,000
of natural sounds.

1155
00:44:44,000 --> 00:44:45,470
Some of the ones
that we know about

1156
00:44:45,470 --> 00:44:48,804
are, chiefly, harmonicity and
common onset and repetition.

1157
00:44:48,804 --> 00:44:49,970
I didn't really get to this.

1158
00:44:49,970 --> 00:44:52,730
But we also know that the brain
infers parts of source signals

1159
00:44:52,730 --> 00:44:54,800
that are masked
by other sources,

1160
00:44:54,800 --> 00:44:56,260
again, using prior assumptions.

1161
00:44:56,260 --> 00:44:59,232
But we really need a proper
theory in this domain,

1162
00:44:59,232 --> 00:45:01,190
I think, both to be able
to predict and explain

1163
00:45:01,190 --> 00:45:02,480
real world performance.

1164
00:45:02,480 --> 00:45:05,480
And also, I think, to be able
to relate what humans are doing

1165
00:45:05,480 --> 00:45:07,190
this domain to the
machine algorithms

1166
00:45:07,190 --> 00:45:09,440
that we'd like to able to
develop to sort of replicate

1167
00:45:09,440 --> 00:45:10,880
this sort of competence.

1168
00:45:10,880 --> 00:45:13,550
And the engineering-- there was
sort of a brief period of time

1169
00:45:13,550 --> 00:45:14,990
where there were some
people in engineering that

1170
00:45:14,990 --> 00:45:17,245
were kind of trying to
relate things to biology.

1171
00:45:17,245 --> 00:45:19,370
But by and large, the fields
have sort of diverged.

1172
00:45:19,370 --> 00:45:21,360
And I think they really
need to come back together.

1173
00:45:21,360 --> 00:45:22,985
And so this is going
to be a good place

1174
00:45:22,985 --> 00:45:25,840
for bright young people to work.