WEBVTT

00:00:00.050 --> 00:00:02.500
The following content is
provided under a Creative

00:00:02.500 --> 00:00:04.010
Commons license.

00:00:04.010 --> 00:00:06.350
Your support will help
MIT OpenCourseWare

00:00:06.350 --> 00:00:10.720
continue to offer high quality
educational resources for free.

00:00:10.720 --> 00:00:13.330
To make a donation or
view additional materials

00:00:13.330 --> 00:00:17.205
from hundreds of MIT courses,
visit MIT OpenCourseWare

00:00:17.205 --> 00:00:17.830
at ocw.mit.edu.

00:00:21.742 --> 00:00:22.950
BROWN WESTRICK: Good morning.

00:00:22.950 --> 00:00:24.366
My name is Brown
Westrick, and I'm

00:00:24.366 --> 00:00:28.450
going to be talking to you about
the speech synthesis project.

00:00:31.850 --> 00:00:33.820
Our main goal for the
speech synthesis project

00:00:33.820 --> 00:00:38.990
was to create
simulated speech using

00:00:38.990 --> 00:00:43.690
a model of the vocal tract
in which we would model

00:00:43.690 --> 00:00:45.170
the flow of air over time.

00:00:48.680 --> 00:00:51.790
There's existing software
called new speech

00:00:51.790 --> 00:00:53.400
that already does this.

00:00:53.400 --> 00:00:57.030
And we want to deport it to
cell and then improve the speech

00:00:57.030 --> 00:00:59.050
quality that it would
afford us by using

00:00:59.050 --> 00:01:00.385
additional computational cycles.

00:01:04.430 --> 00:01:07.540
So again, new speech
was originally

00:01:07.540 --> 00:01:09.060
developed for
linguistics research

00:01:09.060 --> 00:01:13.100
but now it's available for free
under the new public license.

00:01:13.100 --> 00:01:17.105
It already models airflow in
the vocal tract in real time.

00:01:17.105 --> 00:01:19.480
What this means is that there
are no pre-recorded sounds.

00:01:19.480 --> 00:01:22.780
Many speech
synthesizers nowadays

00:01:22.780 --> 00:01:25.340
have very large
dictionaries of sounds

00:01:25.340 --> 00:01:27.090
that they just piece
together and then try

00:01:27.090 --> 00:01:29.740
to smooth the
transition between them.

00:01:29.740 --> 00:01:33.784
However, this model
attempts to actually do

00:01:33.784 --> 00:01:35.950
what the vocal tract is
doing instead of just trying

00:01:35.950 --> 00:01:38.990
to imitate the end result.

00:01:40.860 --> 00:01:44.130
The quality of the speech,
of this synthesizer,

00:01:44.130 --> 00:01:49.060
the way that it exists is not
as high as the current ones that

00:01:49.060 --> 00:01:50.670
use the recorded libraries.

00:01:50.670 --> 00:01:53.540
But it has potential to
be much better because you

00:01:53.540 --> 00:01:56.090
have so much finer control over
all the different parameters.

00:02:00.310 --> 00:02:03.660
Our goal was that we
have this software that

00:02:03.660 --> 00:02:05.540
would make an acceptable
speech in real time.

00:02:05.540 --> 00:02:07.540
We are hoping it would
be able to take advantage

00:02:07.540 --> 00:02:09.889
of the additional
computational power of cell

00:02:09.889 --> 00:02:13.560
to be able to get an
increase in speech quality.

00:02:13.560 --> 00:02:16.490
And I will it over to
Drew who will tell you

00:02:16.490 --> 00:02:18.620
about the new speech system.

00:02:31.412 --> 00:02:33.257
DREW ALTSCHUL: So
new speech is made up

00:02:33.257 --> 00:02:35.300
of three different major parts.

00:02:35.300 --> 00:02:37.634
The first of which is just
called the new speech engine.

00:02:37.634 --> 00:02:40.050
The second of which is, which
is probably the largest part

00:02:40.050 --> 00:02:41.390
of it, which is called Monet.

00:02:41.390 --> 00:02:43.120
And then the tube
resonance model,

00:02:43.120 --> 00:02:46.840
which is the final part that
actually outputs the sound.

00:02:46.840 --> 00:02:49.515
And as [INAUDIBLE],
the basic processes

00:02:49.515 --> 00:02:52.190
is you take a text
input standard string

00:02:52.190 --> 00:02:54.030
and the new speech
engine will take care of

00:02:54.030 --> 00:02:56.900
and transform it into basic
phonetic information, which

00:02:56.900 --> 00:03:00.675
they will then deal with
and will take that string

00:03:00.675 --> 00:03:04.070
and eventually convert it into
these what we call vocal tract

00:03:04.070 --> 00:03:07.320
parameters, which are basically
parameters that can be sent

00:03:07.320 --> 00:03:09.540
to the tube resonance model.

00:03:09.540 --> 00:03:12.140
And those parameters
will define exactly how

00:03:12.140 --> 00:03:14.800
this tube, which represents
the throat and the nasal tract,

00:03:14.800 --> 00:03:18.060
changes over time to
represent speech sounds.

00:03:18.060 --> 00:03:20.570
And with those parameters, you
can send a signal through it

00:03:20.570 --> 00:03:23.520
and create a voice.

00:03:23.520 --> 00:03:25.747
So the first part
of the example will

00:03:25.747 --> 00:03:28.330
take a perfectly normal string,
like "all your base are belong

00:03:28.330 --> 00:03:31.740
to us," and transform
it into what we

00:03:31.740 --> 00:03:34.694
call the phonetic format of it.

00:03:34.694 --> 00:03:36.578
And you can see it highlighted.

00:03:36.578 --> 00:03:39.060
The actual sounds
are highlighted,

00:03:39.060 --> 00:03:42.450
whereas various parameters are
also included in the output

00:03:42.450 --> 00:03:46.377
string, like /w, and you can
determine where the words are.

00:03:46.377 --> 00:03:49.510
And [INAUDIBLE] which
determine where sentences

00:03:49.510 --> 00:03:52.166
and various phrases end.

00:03:52.166 --> 00:03:55.795
And basically, Gnuspeech
makes uses of dictionary files

00:03:55.795 --> 00:03:58.285
as well as some basic
linguistic models in order

00:03:58.285 --> 00:04:03.386
to create this phonetic output
from the basic input string.

00:04:06.240 --> 00:04:07.990
So having created
that phonetic model,

00:04:07.990 --> 00:04:11.602
you can then send it to Monet,
which is by far the largest

00:04:11.602 --> 00:04:14.120
part of the program,
which in turn will take

00:04:14.120 --> 00:04:17.200
the phonetic information,
and as I said,

00:04:17.200 --> 00:04:21.040
use what a basic
diphone file, which

00:04:21.040 --> 00:04:27.430
takes a very large range
of sounds and characters

00:04:27.430 --> 00:04:29.630
and will then transform
these phonetics

00:04:29.630 --> 00:04:32.650
into direct parameters,
which can represent

00:04:32.650 --> 00:04:36.310
the changing of a throat
the entire nasal tract

00:04:36.310 --> 00:04:38.860
as you voice your own speech.

00:04:38.860 --> 00:04:43.100
So Monet has to go through a
long process of calculating

00:04:43.100 --> 00:04:46.580
these phrases given the whole--
the rhythm, the intonation

00:04:46.580 --> 00:04:49.340
of the phrase that's
being given to Monet.

00:04:49.340 --> 00:04:52.010
And also, a very important
part of the Monet process

00:04:52.010 --> 00:04:55.330
is by taking each phrase
and the postures--

00:04:55.330 --> 00:04:57.770
is what we call the output.

00:04:57.770 --> 00:05:02.870
Looking at the phrase and piece
by piece examining the sounds

00:05:02.870 --> 00:05:07.190
and realizing that as the
postures of the throat change,

00:05:07.190 --> 00:05:10.050
there are important changes
being made between there.

00:05:10.050 --> 00:05:17.320
And basically having some sort
of-- basically a slow change

00:05:17.320 --> 00:05:19.250
between them, not a
gradual conversion

00:05:19.250 --> 00:05:22.450
as opposed to a sudden
change in the actual shape.

00:05:22.450 --> 00:05:25.630
So then having outputted
the basic postures,

00:05:25.630 --> 00:05:28.832
you finally send it to the
Tube Resonance Model, or TRM,

00:05:28.832 --> 00:05:31.290
which will take the vocal tract
which is divided into eight

00:05:31.290 --> 00:05:32.790
sections in this model.

00:05:32.790 --> 00:05:37.440
And send a signal
off of a sine wave

00:05:37.440 --> 00:05:40.450
through the modified
tube resonance model.

00:05:40.450 --> 00:05:43.740
And therefore, all the changes
that occur as time goes on

00:05:43.740 --> 00:05:46.405
and the postures
which change the width

00:05:46.405 --> 00:05:49.360
of the tube at various
stages will then

00:05:49.360 --> 00:05:52.770
cause different speech patterns
to come out and basically

00:05:52.770 --> 00:05:56.560
create an actual speech pattern,
which is usually recognized.

00:05:56.560 --> 00:06:01.880
So basically you have these
three parts from a basic string

00:06:01.880 --> 00:06:05.260
to phonetics to throat
postures until finally you

00:06:05.260 --> 00:06:08.250
get the actual speech out.

00:06:08.250 --> 00:06:11.880
So now I'm handing
it over to Joyce

00:06:11.880 --> 00:06:15.345
to talk a little bit about
the resources and algorithms.

00:06:21.780 --> 00:06:24.007
JOYCE CHEN: Well, before
I talk about the resources

00:06:24.007 --> 00:06:26.895
and algorithms, I'll talk a
little bit about the TRM, which

00:06:26.895 --> 00:06:28.740
is the tube resonance model.

00:06:28.740 --> 00:06:31.370
So we already talked
about how Monet outputs

00:06:31.370 --> 00:06:34.380
like two parameters
based on transitions

00:06:34.380 --> 00:06:37.940
between different words
and postures and so on.

00:06:37.940 --> 00:06:39.760
And the tube resonance
model actually

00:06:39.760 --> 00:06:42.840
simulates the physics
of the vocal tract.

00:06:42.840 --> 00:06:44.770
First you have a glottal source.

00:06:44.770 --> 00:06:47.264
If you have done
any linguistics,

00:06:47.264 --> 00:06:48.930
you might have heard
the little clicking

00:06:48.930 --> 00:06:50.720
sound the glottal source makes.

00:06:50.720 --> 00:06:53.710
There are different ways to
simulate the glottal source.

00:06:53.710 --> 00:06:57.520
Now, ideally, the way you have
a good, natural glottal source

00:06:57.520 --> 00:07:00.750
is you have a simulation
of the physics between two

00:07:00.750 --> 00:07:03.750
oscillating masses as you
air passes through them.

00:07:03.750 --> 00:07:05.620
Now, back in the days
when, say, people

00:07:05.620 --> 00:07:08.760
were doing speech
research on gnu speech,

00:07:08.760 --> 00:07:12.780
actually simulating the physics
of glottis was not possible.

00:07:12.780 --> 00:07:15.010
So what they did
instead was, you know,

00:07:15.010 --> 00:07:17.050
they would try a half
sine wave or they

00:07:17.050 --> 00:07:20.260
would research the most natural
sounding glottal pulse shape,

00:07:20.260 --> 00:07:22.950
initialize a wave table,
and do table lookup

00:07:22.950 --> 00:07:26.060
on it, updating it
with the amplitude

00:07:26.060 --> 00:07:28.380
and so on to change
it a little by little.

00:07:28.380 --> 00:07:31.490
So one of our goals
was possibly to--

00:07:31.490 --> 00:07:34.770
because we harnessed the
additional computational power

00:07:34.770 --> 00:07:37.760
and make more natural
sounding speech.

00:07:37.760 --> 00:07:40.680
And now I'll talk about
allocating the resources.

00:07:43.300 --> 00:07:46.275
For example, in
new speech Monet,

00:07:46.275 --> 00:07:48.430
there is not as
much computation.

00:07:48.430 --> 00:07:49.660
Monet has a lot of rules.

00:07:49.660 --> 00:07:51.500
For example, between
postures and postures,

00:07:51.500 --> 00:07:53.150
like different shapes
of vocal tracts,

00:07:53.150 --> 00:07:55.300
you can't just do an
linear interpolation

00:07:55.300 --> 00:07:56.600
to smoothly change.

00:07:56.600 --> 00:07:58.710
There are different
rules beach in order

00:07:58.710 --> 00:08:00.750
to change between
the postures which

00:08:00.750 --> 00:08:03.970
greatly affects the speech.

00:08:03.970 --> 00:08:10.030
This was much harder to improve
on, like to parallelize.

00:08:10.030 --> 00:08:11.790
Then the tube resonance model.

00:08:11.790 --> 00:08:13.040
It had a lot more computation.

00:08:13.040 --> 00:08:14.581
In fact, the thing
that took probably

00:08:14.581 --> 00:08:17.590
most computation was
after we got our signal

00:08:17.590 --> 00:08:21.210
data from the mouth
end of the simulation,

00:08:21.210 --> 00:08:23.520
we would have to up
sample or down sample it.

00:08:23.520 --> 00:08:25.710
And that was something
that had a lot of potential

00:08:25.710 --> 00:08:27.260
to be parallelized.

00:08:27.260 --> 00:08:29.730
However, when you were
simulating the tube presence

00:08:29.730 --> 00:08:33.400
model, you could only update the
signal inside the vocal tract

00:08:33.400 --> 00:08:34.659
incrementally.

00:08:34.659 --> 00:08:37.440
If you were to break it
up, there was a possibility

00:08:37.440 --> 00:08:40.260
that there would be a lot
of pops in between when

00:08:40.260 --> 00:08:41.850
you try to space them together.

00:08:41.850 --> 00:08:44.507
We thought about trying to
resolve that with interpolation

00:08:44.507 --> 00:08:45.090
between forms.

00:08:47.910 --> 00:08:49.590
There were nested loops.

00:08:49.590 --> 00:08:52.080
The main synthesized
thing had nested loops.

00:08:52.080 --> 00:08:53.480
You had a posture.

00:08:53.480 --> 00:08:55.890
And then you simulate
on the posture

00:08:55.890 --> 00:08:57.470
and between the postures.

00:08:57.470 --> 00:09:00.680
And that took the
most computation,

00:09:00.680 --> 00:09:03.570
as well as updating
the glottal wave table.

00:09:03.570 --> 00:09:04.920
All right.

00:09:04.920 --> 00:09:07.840
Now I will hand it off to Omari
to explain the challenges.

00:09:11.770 --> 00:09:14.490
SPEAKER 5: So the
TRM algorithm, which

00:09:14.490 --> 00:09:16.440
is-- we're most
focused on trying

00:09:16.440 --> 00:09:28.270
to-- so we most tried to
focus on parallelizing the TRM

00:09:28.270 --> 00:09:29.010
algorithm.

00:09:29.010 --> 00:09:32.910
Because both Gnuspeech and
Monet are almost entirely

00:09:32.910 --> 00:09:34.390
just dictionary
lookups involving

00:09:34.390 --> 00:09:37.020
large amounts of memory with
not that much computation.

00:09:37.020 --> 00:09:41.300
So there wasn't really
that much potential

00:09:41.300 --> 00:09:42.790
for parallelizing those.

00:09:42.790 --> 00:09:49.425
So we looked at the tasks
that were being done on TRM

00:09:49.425 --> 00:09:51.340
and profiled them.

00:09:51.340 --> 00:09:54.620
And you can see what
took most of the time

00:09:54.620 --> 00:09:57.940
was the noise generator
part, like the attempt

00:09:57.940 --> 00:10:01.810
at the glottal source
that was being put

00:10:01.810 --> 00:10:05.850
into the tubes and
the actual updates

00:10:05.850 --> 00:10:08.750
where the tubes are supposed
to be as they were shifting.

00:10:08.750 --> 00:10:14.260
And so unfortunately, the
main loop as this was updating

00:10:14.260 --> 00:10:15.630
was very, very fast.

00:10:15.630 --> 00:10:17.690
It was about 15 microseconds.

00:10:17.690 --> 00:10:24.500
So it would be pretty difficult
to update, for example,

00:10:24.500 --> 00:10:28.490
several SPUs as fast
as we needed them to,

00:10:28.490 --> 00:10:32.950
considering how communication
costs affect them.

00:10:35.730 --> 00:10:42.950
So parallelism was not
very simple to use.

00:10:42.950 --> 00:10:46.710
Our main, original idea for
this exploiting parallelism

00:10:46.710 --> 00:10:50.350
was to make a pipeline of the
various parts of the TRM model,

00:10:50.350 --> 00:10:53.690
maybe using three or
four SPUs, like one

00:10:53.690 --> 00:10:56.080
for each part of the
throat as a sound would

00:10:56.080 --> 00:10:57.820
go from one to the next.

00:10:57.820 --> 00:11:00.090
So that all of them would
be engaged simultaneously,

00:11:00.090 --> 00:11:06.380
like going from one posture to
the next in a linear fashion.

00:11:06.380 --> 00:11:09.210
But unfortunately,
the timing for this

00:11:09.210 --> 00:11:12.430
was very, very fast, in the
order of about 70 kilohertz,

00:11:12.430 --> 00:11:15.074
which is many times
a second for SPUs

00:11:15.074 --> 00:11:17.240
to be transferring data
back and forth to each other

00:11:17.240 --> 00:11:19.820
with mailboxes and memory.

00:11:19.820 --> 00:11:42.490
So that was somewhat difficult.

00:11:42.490 --> 00:11:45.080
OMARI: Unfortunately,
with this project,

00:11:45.080 --> 00:11:48.670
we faced a number of
challenges, the first

00:11:48.670 --> 00:11:52.220
and foremost being
that Gnuspeech

00:11:52.220 --> 00:11:55.260
is written in a programming
language most of us

00:11:55.260 --> 00:11:56.720
weren't familiar with.

00:11:56.720 --> 00:11:59.060
And it's huge.

00:11:59.060 --> 00:12:02.330
Monet, for example,
is 30,000 lines.

00:12:02.330 --> 00:12:03.390
It's hardly documented.

00:12:06.090 --> 00:12:08.780
And it took a fair
amount of time

00:12:08.780 --> 00:12:12.360
just reading through and
figuring out what was going on.

00:12:12.360 --> 00:12:14.630
Additionally, because
it uses Gnustep,

00:12:14.630 --> 00:12:19.660
which is a GUI library,
the calls are asynchronous

00:12:19.660 --> 00:12:25.670
and it makes it tremendously
difficult to debug as well.

00:12:25.670 --> 00:12:30.370
I had tried to convert part
of it to C++ to try to get

00:12:30.370 --> 00:12:32.270
the tube running
on one of the SPEs,

00:12:32.270 --> 00:12:34.710
and that took three
days in and of itself.

00:12:34.710 --> 00:12:38.910
And I ended up having to
toss that attempt out.

00:12:38.910 --> 00:12:42.330
Another problem we had was
dynamic pointer alignment.

00:12:42.330 --> 00:12:45.830
In Objective C,
most of the objects

00:12:45.830 --> 00:12:49.130
are stored as dynamic pointers.

00:12:49.130 --> 00:12:55.190
And basically in
Objective C, there's

00:12:55.190 --> 00:12:59.300
also no malloc alignment
or anything of that nature.

00:12:59.300 --> 00:13:02.390
So we couldn't really
transfer any of the objects

00:13:02.390 --> 00:13:08.730
from Objective C memory area
to the SPUs to work on the data

00:13:08.730 --> 00:13:12.190
and then send them back.

00:13:12.190 --> 00:13:13.930
So what is working now?

00:13:13.930 --> 00:13:21.160
We are able to do line buffered
text in the Gnuspeech engine

00:13:21.160 --> 00:13:23.680
and translate that
to utterances so

00:13:23.680 --> 00:13:27.350
the-- phonetic pronunciations.

00:13:27.350 --> 00:13:31.930
And get to the point where we
would execute the tube model.

00:13:31.930 --> 00:13:37.420
Unfortunately, one of
the-- a bug in Gnuspeech

00:13:37.420 --> 00:13:40.060
potentially is preventing
us from properly executing

00:13:40.060 --> 00:13:41.470
the tube model right now.

00:13:44.490 --> 00:13:47.870
So that's one thing we're
having problems with.

00:13:47.870 --> 00:13:52.230
Additionally, the
tube runs on the PPE.

00:13:52.230 --> 00:13:54.930
We've been trying to get
the tube to run on the SPE,

00:13:54.930 --> 00:14:00.220
but it's not going well, partly
because of the dynamic pointer

00:14:00.220 --> 00:14:04.280
alignment issue and partly
because of some other things

00:14:04.280 --> 00:14:07.090
we've run into.

00:14:07.090 --> 00:14:09.400
Currently not working.

00:14:09.400 --> 00:14:14.000
As Drew had mentioned,
there are a lot

00:14:14.000 --> 00:14:18.510
of dictionary lookups
in the preprocessing

00:14:18.510 --> 00:14:22.760
stage of the pipeline.

00:14:22.760 --> 00:14:25.700
And there's a bug
in Gnustep where

00:14:25.700 --> 00:14:31.360
it won't parse the dictionary
if it's above a certain size.

00:14:31.360 --> 00:14:35.230
The dictionary has I
believe 70,000 entries,

00:14:35.230 --> 00:14:37.430
it takes up almost 3 megabytes.

00:14:37.430 --> 00:14:44.500
But if there are more
than like 3,000 entries

00:14:44.500 --> 00:14:46.740
in the dictionary, it
just doesn't parse.

00:14:46.740 --> 00:14:48.160
And we have no idea why.

00:14:52.340 --> 00:14:58.870
So to conclude, this was a
tremendously difficult problem.

00:14:58.870 --> 00:15:02.130
There are a bunch
of data dependencies

00:15:02.130 --> 00:15:07.160
and the synchronization
is very, very close.

00:15:07.160 --> 00:15:11.420
However, we feel
that with more time

00:15:11.420 --> 00:15:14.030
and with more experience
with the code base,

00:15:14.030 --> 00:15:16.870
we would have been
able to parallelize it.

00:15:16.870 --> 00:15:20.970
And the parallelization
can almost

00:15:20.970 --> 00:15:22.620
definitely help
the vocal quality

00:15:22.620 --> 00:15:24.820
in terms of naturalness.

00:15:24.820 --> 00:15:26.960
Getting, for example,
as Joyce mentioned,

00:15:26.960 --> 00:15:29.820
a higher quality glottal source.

00:15:29.820 --> 00:15:33.620
Speaker identification
and vowel identification.

00:15:33.620 --> 00:15:39.630
For example, when you
pronounce different vowels,

00:15:39.630 --> 00:15:43.400
sometimes quality the
glottal source changes.

00:15:43.400 --> 00:15:46.780
And lastly, it would be worth
the time to re-write the whole

00:15:46.780 --> 00:15:51.000
thing from scratch, skipping
Gnustep, skipping Objective C,

00:15:51.000 --> 00:15:55.650
and going with C++ or most
likely C for the whole thing.

00:15:55.650 --> 00:15:56.150
Thank you.

00:15:56.150 --> 00:15:56.805
Any questions?

00:16:02.227 --> 00:16:03.602
AUDIENCE: It sounds
like you guys

00:16:03.602 --> 00:16:05.685
are taking a fine-grained
approach in which you're

00:16:05.685 --> 00:16:09.280
splitting the application
across different units.

00:16:09.280 --> 00:16:12.410
Since you're synthesizing
completely independent words,

00:16:12.410 --> 00:16:16.125
let's say, could you just
run the whole application

00:16:16.125 --> 00:16:17.690
on an SPU?

00:16:17.690 --> 00:16:20.000
There's engineering there,
but from a parallelization

00:16:20.000 --> 00:16:22.367
standpoint, can you just
take the whole application

00:16:22.367 --> 00:16:24.200
and run it, for example,
on different words?

00:16:28.090 --> 00:16:30.340
OMARI: So I believe
you're suggesting

00:16:30.340 --> 00:16:32.420
we run the tube
on different SPEs

00:16:32.420 --> 00:16:36.280
and then feed data to the
separate instances of the tube

00:16:36.280 --> 00:16:37.540
from the PPE?

00:16:37.540 --> 00:16:40.114
AUDIENCE: Well, including--
the whole processing backline.

00:16:40.114 --> 00:16:42.530
I mean, order the sentences
from one sentence to the next?

00:16:42.530 --> 00:16:44.820
OMARI: So the big
stumbling block for us

00:16:44.820 --> 00:16:48.420
was that there isn't
currently an Objective C

00:16:48.420 --> 00:16:54.710
compiler for the SPEs, so
we can't run the Objective C

00:16:54.710 --> 00:16:55.948
code on the SPEs at all.

00:16:55.948 --> 00:16:57.656
AUDIENCE: But if you
did it from scratch,

00:16:57.656 --> 00:17:00.349
if you were to write your
own-- just throw away

00:17:00.349 --> 00:17:03.525
all your Objective C
and start from scratch,

00:17:03.525 --> 00:17:05.483
would that be a better
parallelization strategy

00:17:05.483 --> 00:17:06.706
than a fine-grained one?

00:17:11.740 --> 00:17:13.240
OMARI: Possibly.

00:17:13.240 --> 00:17:18.859
So one of the disadvantages
of splitting up words

00:17:18.859 --> 00:17:22.099
is that there is
state that connects

00:17:22.099 --> 00:17:24.839
the different-- [INAUDIBLE]
continuous state that connects

00:17:24.839 --> 00:17:28.240
the different postures
vocal tract all

00:17:28.240 --> 00:17:29.880
the way through the utterance.

00:17:29.880 --> 00:17:34.090
And so we would have to do
possibly some prediction,

00:17:34.090 --> 00:17:36.680
possibly some
interpolation to figure out

00:17:36.680 --> 00:17:39.480
how to connect the different,
separate utterances which

00:17:39.480 --> 00:17:41.475
should been produced
consecutively.

00:17:41.475 --> 00:17:42.850
AUDIENCE: Permitting
one sentence

00:17:42.850 --> 00:17:44.270
at a time or something?

00:17:44.270 --> 00:17:45.360
OMARI: Yeah.

00:17:45.360 --> 00:17:48.620
That might be an option, yes.