WEBVTT

00:00:07.000 --> 00:00:12.000
We only have four more lectures
left, and what Professor Demaine

00:00:12.000 --> 00:00:18.000
and I have decided to do is give
two series of lectures on sort

00:00:18.000 --> 00:00:22.000
of advanced topics.
So, today at Wednesday we're

00:00:22.000 --> 00:00:27.000
going to talk about parallel
algorithms, algorithms where you

00:00:27.000 --> 00:00:34.000
have more than one processor
whacking away on your problem.

00:00:34.000 --> 00:00:38.000
And this is a very hot topic
right now because all of the

00:00:38.000 --> 00:00:42.000
chip manufacturers are now
producing so-called multicore

00:00:42.000 --> 00:00:47.000
processors where you have more
than one processor per chip.

00:00:47.000 --> 00:00:50.000
So, knowing something about
that is good.

00:00:50.000 --> 00:00:55.000
The second topic we're going to
cover is going to be caching,

00:00:55.000 --> 00:01:00.000
and how you design algorithms
for systems with cache.

00:01:00.000 --> 00:01:03.000
Right now, we've sort of
program to everything as if it

00:01:03.000 --> 00:01:07.000
were just a single level of
memory, and for some problems

00:01:07.000 --> 00:01:10.000
that's not an entirely realistic
model.

00:01:10.000 --> 00:01:14.000
You'd like to have some model
for how the caching hierarchy

00:01:14.000 --> 00:01:18.000
works, and how you can take
advantage of that.

00:01:18.000 --> 00:01:22.000
And there's been a lot of
research in that area as well.

00:01:22.000 --> 00:01:26.000
So, both of those actually turn
out to be my area of research.

00:01:26.000 --> 00:01:30.000
So, this is actually fun for
me.

00:01:30.000 --> 00:01:33.000
Actually, most of it's fun
anyway.

00:01:33.000 --> 00:01:37.000
So, today we'll talk about
parallel algorithms.

00:01:37.000 --> 00:01:43.000
And the particular topic,
it turns out that there are

00:01:43.000 --> 00:01:49.000
lots of models for parallel
algorithms, and for parallelism.

00:01:49.000 --> 00:01:54.000
And it's one of the reasons
that, whereas for serial

00:01:54.000 --> 00:02:00.000
algorithms, most people sort of
have this basic model that we've

00:02:00.000 --> 00:02:04.000
been using.
It's sometimes called a random

00:02:04.000 --> 00:02:08.000
access machine model,
which is what we've been using

00:02:08.000 --> 00:02:11.000
to analyze things,
whereas in the parallel space,

00:02:11.000 --> 00:02:15.000
there's just a huge number of
models, and there is no general

00:02:15.000 --> 00:02:19.000
agreement on what is the best
model because there are

00:02:19.000 --> 00:02:23.000
different machines that are made
with different configurations,

00:02:23.000 --> 00:02:24.000
etc.
and people haven't,

00:02:24.000 --> 00:02:27.000
sort of, agreed on,
even how parallel machines

00:02:27.000 --> 00:02:32.000
should be organized.
So, we're going to deal with a

00:02:32.000 --> 00:02:37.000
particular model,
which goes under the rubric of

00:02:37.000 --> 00:02:42.000
dynamic multithreading,
which is appropriate for the

00:02:42.000 --> 00:02:48.000
multicore machines that are now
being built for shared memory

00:02:48.000 --> 00:02:52.000
programming.
It's not appropriate for what's

00:02:52.000 --> 00:02:57.000
called distributed memory
programs particularly because

00:02:57.000 --> 00:03:03.000
the processors are able to
access things.

00:03:03.000 --> 00:03:06.000
And for those,
you need more involved models.

00:03:06.000 --> 00:03:10.000
And so, let me start just by
giving an example of how one

00:03:10.000 --> 00:03:14.000
would write something.
I'm going to give you a program

00:03:14.000 --> 00:03:18.000
for calculating the nth
Fibonacci number in this model.

00:03:18.000 --> 00:03:23.000
This is actually a really bad
algorithm that I'm going to give

00:03:23.000 --> 00:03:28.000
you because it's going to be the
exponential time algorithm,

00:03:28.000 --> 00:03:32.000
whereas we know from week one
or two that you can calculate

00:03:32.000 --> 00:03:37.000
the nth Fibonacci number and how
much time?

00:03:37.000 --> 00:03:40.000
Log n time.
So, this is too exponentials

00:03:40.000 --> 00:03:46.000
off what you should be able to
get, OK, two exponentials off.

00:03:46.000 --> 00:03:49.000
OK, so here's the code.

00:04:36.000 --> 00:04:40.000
OK, so this is essentially the
pseudocode we would write.

00:04:40.000 --> 00:04:44.000
And let me just explain a
little bit about,

00:04:44.000 --> 00:04:48.000
we have a couple of key words
here we haven't seen before:

00:04:48.000 --> 00:04:52.000
in particular,
spawn and sync.

00:04:52.000 --> 00:04:58.000
OK, so spawn,
this basically says that the

00:04:58.000 --> 00:05:07.000
subroutine that you're calling,
you use it as a keyword before

00:05:07.000 --> 00:05:14.000
a subroutine,
that it can execute at the same

00:05:14.000 --> 00:05:21.000
time as its parent.
So, here, what we say x equals

00:05:21.000 --> 00:05:29.000
spawn of n minus one,
we immediately go onto the next

00:05:29.000 --> 00:05:36.000
statement.
And now, while we're executing

00:05:36.000 --> 00:05:42.000
fib of n minus one,
we can also be executing,

00:05:42.000 --> 00:05:49.000
now, this statement which
itself will spawn something off.

00:05:49.000 --> 00:05:54.000
OK, and we continue,
and then we hit the sync

00:05:54.000 --> 00:05:58.000
statement.
And, what sync says is,

00:05:58.000 --> 00:06:04.000
wait until all children are
done.

00:06:04.000 --> 00:06:09.000
OK, so it says once you get to
this point, you've got to wait

00:06:09.000 --> 00:06:15.000
until everything here has
completed before you execute the

00:06:15.000 --> 00:06:21.000
x plus y because otherwise
you're going to try to execute

00:06:21.000 --> 00:06:26.000
the calculation of x plus y
without having computed it yet.

00:06:26.000 --> 00:06:31.000
OK, so that's the basic
structure.

00:06:31.000 --> 00:06:33.000
What this describes,
notice in here we never said

00:06:33.000 --> 00:06:36.000
how many processors or anything
we are running on.

00:06:36.000 --> 00:06:40.000
OK, so this actually is just
describing logical parallelism

00:06:40.000 --> 00:06:41.000
--

00:06:51.000 --> 00:07:02.000
-- not the actual parallelism
when we execute it.

00:07:02.000 --> 00:07:11.000
And so, what we need is a
scheduler, OK,

00:07:11.000 --> 00:07:25.000
to determine how to map this
dynamically, unfolding execution

00:07:25.000 --> 00:07:37.000
onto whatever processors you
have available.

00:07:37.000 --> 00:07:45.000
OK, and so, today actually
we're going to talk mostly about

00:07:45.000 --> 00:07:48.000
scheduling.
OK, and then,

00:07:48.000 --> 00:07:56.000
next time we're going to talk
about specific application

00:07:56.000 --> 00:08:01.000
algorithms, and how you analyze
them.

00:08:01.000 --> 00:08:11.000
OK, so you can view the actual
multithreaded computation.

00:08:11.000 --> 00:08:16.000
If you take a look at the
parallel instruction stream,

00:08:16.000 --> 00:08:21.000
it's just a directed acyclic
graph, OK?

00:08:21.000 --> 00:08:25.000
So, let me show you how that
works.

00:08:25.000 --> 00:08:30.000
So, normally when we have an
instruction stream,

00:08:30.000 --> 00:08:36.000
I look at each instruction
being executed.

00:08:36.000 --> 00:08:38.000
If I'm in a loop,
I'm not looking at it as a

00:08:38.000 --> 00:08:40.000
loop.
I'm just looking at the

00:08:40.000 --> 00:08:42.000
sequence of instructions that
actually executed.

00:08:42.000 --> 00:08:45.000
I can do that just as a chain.
Before I execute one

00:08:45.000 --> 00:08:48.000
instruction, I have to execute
the one before it.

00:08:48.000 --> 00:08:51.000
Before I execute that,
I've got to execute the one

00:08:51.000 --> 00:08:53.000
before it.
At least, that's the

00:08:53.000 --> 00:08:55.000
abstraction.
If you've studied processors,

00:08:55.000 --> 00:08:58.000
you know that there are a lot
of tricks there in figuring out

00:08:58.000 --> 00:09:02.000
instruction level parallelism,
and how you can actually make

00:09:02.000 --> 00:09:07.000
that serial instruction stream
actually execute in parallel.

00:09:07.000 --> 00:09:15.000
But what we are going to be
mostly talking about is the

00:09:15.000 --> 00:09:22.000
logical parallelism here,
and what we can do in that

00:09:22.000 --> 00:09:26.000
context.
So, in this DAG,

00:09:26.000 --> 00:09:34.000
the vertices are threads,
which are maximal sequences of

00:09:34.000 --> 00:09:40.000
instructions not containing --

00:09:47.000 --> 00:09:52.000
-- parallel control.
And by parallel control,

00:09:52.000 --> 00:09:58.000
I just mean spawn,
sync, and return from a spawned

00:09:58.000 --> 00:10:02.000
procedure.
So, let's just mark the,

00:10:02.000 --> 00:10:06.000
so the vertices are threads.
So, let's just mark what the

00:10:06.000 --> 00:10:10.000
vertices are here,
OK, what the threads are here.

00:10:10.000 --> 00:10:16.000
So, when we enter the function
here, we basically execute up to

00:10:16.000 --> 00:10:18.000
the point where,
basically, here,

00:10:18.000 --> 00:10:24.000
let's call that thread A where
we are just doing a sequential

00:10:24.000 --> 00:10:29.000
execution up to either returning
or starting to do the spawn,

00:10:29.000 --> 00:10:33.000
fib of n minus one.
So actually,

00:10:33.000 --> 00:10:38.000
thread A would include the
calculation of n minus one right

00:10:38.000 --> 00:10:43.000
up to the point where you
actually make the subroutine

00:10:43.000 --> 00:10:45.000
jump.
That's thread A.

00:10:45.000 --> 00:10:49.000
Thread B would be the stuff
that you would do,

00:10:49.000 --> 00:10:54.000
executing from fib of,
sorry, B would be from the,

00:10:54.000 --> 00:10:57.000
right.
We'd go up to the spawn.

00:10:57.000 --> 00:11:03.000
So, we've done the spawn.
I'm really looking at this.

00:11:03.000 --> 00:11:05.000
So, B would be up to the spawn
of y.

00:11:05.000 --> 00:11:09.000
OK, spawn of fib of n minus two
to compute y,

00:11:09.000 --> 00:11:12.000
and then we'd have essentially
an empty thread.

00:11:12.000 --> 00:11:17.000
So, I'll ignore that for now,
but really then we have after

00:11:17.000 --> 00:11:22.000
the sync up to the point that we
get to the return of x plus y.

00:11:22.000 --> 00:11:25.000
So basically,
we're just looking at maximal

00:11:25.000 --> 00:11:30.000
sequences of instructions that
are all serial.

00:11:30.000 --> 00:11:34.000
And every time I do a parallel
instruction, OK,

00:11:34.000 --> 00:11:37.000
spawn or a sync,
or return from it,

00:11:37.000 --> 00:11:40.000
that terminates the current
thread.

00:11:40.000 --> 00:11:45.000
OK, so we can look at that as a
bunch of small threads.

00:11:45.000 --> 00:11:50.000
So those of you who are
familiar with threads from Java

00:11:50.000 --> 00:11:54.000
threads, or POSIX threads,
OK, so-called P threads,

00:11:54.000 --> 00:12:00.000
those are sort of heavyweight
static threads.

00:12:00.000 --> 00:12:04.000
This is a much lighter weight
notion of thread,

00:12:04.000 --> 00:12:08.000
OK, that we are using in this
model.

00:12:08.000 --> 00:12:13.000
OK, so these are the vertices.
And now, let me map out a

00:12:13.000 --> 00:12:19.000
little bit how this works,
so we can where the edges come

00:12:19.000 --> 00:12:21.000
from.
So, let's imagine we're

00:12:21.000 --> 00:12:26.000
executing fib of four.
So, I'm going to draw a

00:12:26.000 --> 00:12:31.000
horizontal oval.
That's going to correspond to

00:12:31.000 --> 00:12:36.000
the procedure execution.
And, in this procedure,

00:12:36.000 --> 00:12:39.000
there are essentially three
threads.

00:12:39.000 --> 00:12:44.000
We start out with A,
so this is our initial thread

00:12:44.000 --> 00:12:49.000
is this guy here.
And then, when he executes a

00:12:49.000 --> 00:12:55.000
spawn, OK, he's going to execute
a spawn, we are going to create

00:12:55.000 --> 00:13:00.000
a new procedure,
and he's going to execute a new

00:13:00.000 --> 00:13:05.000
A recursively within that
procedure.

00:13:05.000 --> 00:13:09.000
But at the same time,
we're also going to be,

00:13:09.000 --> 00:13:14.000
now, aloud to go on and execute
B in the parent,

00:13:14.000 --> 00:13:18.000
we have parallelism here when I
do a spawn.

00:13:18.000 --> 00:13:21.000
OK, and so there's an edge
here.

00:13:21.000 --> 00:13:25.000
This edge we are going to call
a spawn edge,

00:13:25.000 --> 00:13:31.000
and this is called a
continuation edge because it's

00:13:31.000 --> 00:13:37.000
just simply continuing the
procedure execution.

00:13:37.000 --> 00:13:41.000
OK, now at this point,
this guy, we now have two

00:13:41.000 --> 00:13:45.000
things that can execute at the
same time.

00:13:45.000 --> 00:13:49.000
Once I've executed A,
I now have two things that can

00:13:49.000 --> 00:13:52.000
execute.
OK, so this one,

00:13:52.000 --> 00:13:56.000
for example,
may spawn another thread here.

00:13:56.000 --> 00:13:59.000
Oh, so this is fib of three,
right?

00:13:59.000 --> 00:14:07.000
And this is now fib of two.
OK, so he spawns another guy

00:14:07.000 --> 00:14:15.000
here, and simultaneously,
he can go on and execute B

00:14:15.000 --> 00:14:22.000
here, OK, with a continued edge.
And B, in fact,

00:14:22.000 --> 00:14:32.000
can also spawn at this point.
OK, and this is now fib of two

00:14:32.000 --> 00:14:36.000
also.
And now, at this point,

00:14:36.000 --> 00:14:44.000
we can't execute C yet here
even though I've spawned things

00:14:44.000 --> 00:14:48.000
off.
And the reason is because C

00:14:48.000 --> 00:14:54.000
won't execute until we've
executed the sync statement,

00:14:54.000 --> 00:15:01.000
which can't occur until A and B
have both been executed,

00:15:01.000 --> 00:15:06.000
OK?
So, he just sort of sits there

00:15:06.000 --> 00:15:12.000
waiting, OK, and a scheduler
can't try to schedule him.

00:15:12.000 --> 00:15:18.000
Or if he does,
then nothing's going to happen

00:15:18.000 --> 00:15:21.000
here, OK?
So, we can go on.

00:15:21.000 --> 00:15:25.000
Let's see, here we could call
fib of one.

00:15:25.000 --> 00:15:34.000
The fib of one is only going to
execute an A statement here.

00:15:34.000 --> 00:15:39.000
OK, of course it can't continue
here because A is the only

00:15:39.000 --> 00:15:45.000
thing, when I execute fib of
one, if we look at the code,

00:15:45.000 --> 00:15:50.000
it never executes B or C.
OK, and similarly here,

00:15:50.000 --> 00:15:55.000
this guy here to do fib of one.
OK, and this guy,

00:15:55.000 --> 00:16:01.000
I guess, could execute A here
of fib of one.

00:16:10.000 --> 00:16:17.000
OK, and maybe now this guy
calls his another fib of one,

00:16:17.000 --> 00:16:25.000
and this guy does another one.
This is going to be fib of

00:16:25.000 --> 00:16:31.000
zero, right?
I keep drawing that arrow to

00:16:31.000 --> 00:16:35.000
the wrong place,
OK?

00:16:35.000 --> 00:16:38.000
And now, once these guys
return, well,

00:16:38.000 --> 00:16:42.000
let's say these guys return
here, I can now execute C.

00:16:42.000 --> 00:16:47.000
But I can't execute with them
until both of these guys are

00:16:47.000 --> 00:16:52.000
done, and that guy is done.
So, you see that we get a

00:16:52.000 --> 00:16:56.000
synchronization point here
before executing C.

00:16:56.000 --> 00:17:01.000
And then, similarly here,
now that we've executed this

00:17:01.000 --> 00:17:06.000
and this, we can now execute
this guy here.

00:17:06.000 --> 00:17:11.000
And so, those returns go to
there.

00:17:11.000 --> 00:17:17.000
Likewise here,
this guy can now execute his C,

00:17:17.000 --> 00:17:26.000
and now once both of those are
done, we can execute this guy

00:17:26.000 --> 00:17:30.000
here.
And then we are done.

00:17:30.000 --> 00:17:41.000
This is our final thread.
So, I should have labeled also

00:17:41.000 --> 00:17:53.000
that when I get one of these
guys here, that's a return edge.

00:17:53.000 --> 00:18:01.000
So, the three types of edges
are spawn, return,

00:18:01.000 --> 00:18:08.000
and continuation.
OK, and by describing it in

00:18:08.000 --> 00:18:11.000
this way, I essentially get a
DAG that unfolds.

00:18:11.000 --> 00:18:15.000
So, rather than having just a
serial execution trace,

00:18:15.000 --> 00:18:19.000
I get something where I have
still some serial dependencies.

00:18:19.000 --> 00:18:23.000
There are still some things
that have to be done before

00:18:23.000 --> 00:18:27.000
other things,
but there are also things that

00:18:27.000 --> 00:18:31.000
can be done at the same time.
So how are we doing?

00:18:31.000 --> 00:18:35.000
Yeah, question?
Is every spawn were covered by

00:18:35.000 --> 00:18:38.000
a sync, effectively,
yeah, yeah, effectively.

00:18:38.000 --> 00:18:43.000
There's actually a null thread
that gets executed in there,

00:18:43.000 --> 00:18:45.000
which I hadn't bothered to
show.

00:18:45.000 --> 00:18:50.000
But yes, basically you would
then not have any parallelism,

00:18:50.000 --> 00:18:54.000
OK, because you would spawn it
off, but then you're not doing

00:18:54.000 --> 00:18:58.000
anything in the parent.
So it's pretty much the same,

00:18:58.000 --> 00:19:03.000
yeah, as if it had executed
serially.

00:19:03.000 --> 00:19:06.000
Yep, OK, so you can see that
basically what we had here in

00:19:06.000 --> 00:19:09.000
some sense is a DAG embedded in
a tree.

00:19:09.000 --> 00:19:13.000
OK, so you have a tree that's
sort of the procedure structure,

00:19:13.000 --> 00:19:16.000
but in their you have a DAG,
and that DAG can actually get

00:19:16.000 --> 00:19:20.000
to be pretty complicated.
OK, now what I want to do is

00:19:20.000 --> 00:19:23.000
now that we understand that
we've got an underlying DAG,

00:19:23.000 --> 00:19:27.000
I want to switch to trying to
study the performance attributes

00:19:27.000 --> 00:19:31.000
of a particular DAG execution,
so looking at performance

00:19:31.000 --> 00:19:33.000
measures.

00:19:45.000 --> 00:19:55.000
So, the notation that we'll use
is we'll let T_P be the running

00:19:55.000 --> 00:20:05.000
time of whatever our computation
is on P processors.

00:20:05.000 --> 00:20:07.000
OK, so, T_P is,
how long does it take to

00:20:07.000 --> 00:20:10.000
execute this on P processors?
Now, in general,

00:20:10.000 --> 00:20:13.000
this is not going to be just a
particular number,

00:20:13.000 --> 00:20:17.000
OK, because I can have
different scheduling disciplines

00:20:17.000 --> 00:20:20.000
would lead me to get numbers for
T_P, OK?

00:20:20.000 --> 00:20:22.000
But when we talk about the
running time,

00:20:22.000 --> 00:20:26.000
we'll still sort of use this
notation, and I'll try to be

00:20:26.000 --> 00:20:30.000
careful as we go through to make
sure that there's no confusion

00:20:30.000 --> 00:20:34.000
about what that means in
context.

00:20:34.000 --> 00:20:38.000
There are a couple of them,
though, which are fairly well

00:20:38.000 --> 00:20:40.000
defined.
One is based on this.

00:20:40.000 --> 00:20:43.000
One is T_1.
So, T_1 is the running time on

00:20:43.000 --> 00:20:46.000
one processor.
OK, so if I were to execute

00:20:46.000 --> 00:20:49.000
this on one processor,
you can imagine it's just as if

00:20:49.000 --> 00:20:53.000
I had just gotten rid of the
spawn, and syncs,

00:20:53.000 --> 00:20:55.000
and everything,
and just executed it.

00:20:55.000 --> 00:21:00.000
That will give me a particular
running time.

00:21:00.000 --> 00:21:06.000
We call that running time on
one processor the work.

00:21:06.000 --> 00:21:10.000
It's essentially the serial
time.

00:21:10.000 --> 00:21:16.000
OK, so when we talk about the
work of a computation,

00:21:16.000 --> 00:21:22.000
we just been essentially a
serial running time.

00:21:22.000 --> 00:21:30.000
OK, the other measure that ends
up being interesting is what we

00:21:30.000 --> 00:21:35.000
call T infinity.
OK, and this is the critical

00:21:35.000 --> 00:21:40.000
pathlength, OK,
which is essentially the

00:21:40.000 --> 00:21:46.000
longest path in the DAG.
So, for example,

00:21:46.000 --> 00:21:50.000
if we look at the fib of four
in this example,

00:21:50.000 --> 00:21:54.000
it has T of one equal to,
so let's assume we have unit

00:21:54.000 --> 00:21:58.000
time threads.
I know they're not unit time,

00:21:58.000 --> 00:22:01.000
but let's just imagine,
for the purposes of

00:22:01.000 --> 00:22:06.000
understanding this,
that every thread costs me one

00:22:06.000 --> 00:22:12.000
unit of time to execute.
What would be the work of this

00:22:12.000 --> 00:22:16.000
particular computation?
17, right, OK,

00:22:16.000 --> 00:22:21.000
because all we do is just add
up three, six,

00:22:21.000 --> 00:22:24.000
nine, 12, 13,
14, 15, 16, 17.

00:22:24.000 --> 00:22:32.000
So, the work is 17 in this case
if it were unit time threads.

00:22:32.000 --> 00:22:35.000
In general, you would add up
how many instructions or

00:22:35.000 --> 00:22:39.000
whatever were in there.
OK, and then T infinity is the

00:22:39.000 --> 00:22:42.000
longest path.
So, this is the longest

00:22:42.000 --> 00:22:44.000
sequence.
It's like, if you had an

00:22:44.000 --> 00:22:48.000
infinite number of processors,
you still can't just do

00:22:48.000 --> 00:22:52.000
everything at once because some
things have to come before other

00:22:52.000 --> 00:22:55.000
things.
But if you had an infinite

00:22:55.000 --> 00:22:59.000
number of processors,
as many processors as you want,

00:22:59.000 --> 00:23:04.000
what's the fastest you could
possibly execute this?

00:23:04.000 --> 00:23:07.000
A little trickier.
Seven?

00:23:07.000 --> 00:23:12.000
So, what's your seven?
So, one, two,

00:23:12.000 --> 00:23:17.000
three, four,
five, six, seven,

00:23:17.000 --> 00:23:22.000
eight, yeah,
eight is the longest path.

00:23:22.000 --> 00:23:30.000
So, the work and the critical
path length, as we'll see,

00:23:30.000 --> 00:23:38.000
are key attributes of any
computation.

00:23:38.000 --> 00:23:44.000
And abstractly,
and this is just for [the

00:23:44.000 --> 00:23:50.000
notes?], if they're unit time
threads.

00:23:50.000 --> 00:23:59.000
OK, so we can use these two
measures to derive lower bounds

00:23:59.000 --> 00:24:07.000
on T_P for P that fall between
one and infinity,

00:24:07.000 --> 00:24:09.000
OK?

00:24:20.000 --> 00:24:30.000
OK, so the first lower bound we
can derive is that T_P has got

00:24:30.000 --> 00:24:39.000
to be at least T_1 over P.
OK, so why is that a lower

00:24:39.000 --> 00:24:42.000
bound?
Yeah?

00:24:42.000 --> 00:24:57.000
But if I have P processors,
and, OK, and why would I have

00:24:57.000 --> 00:25:05.000
this lower bound?
OK, yeah, you've got the right

00:25:05.000 --> 00:25:07.000
idea.
So, but can we be a little bit

00:25:07.000 --> 00:25:10.000
more articulate about it?
So, that's right,

00:25:10.000 --> 00:25:13.000
so you want to use all of
processors.

00:25:13.000 --> 00:25:17.000
If you could use all of
processors, why couldn't I use

00:25:17.000 --> 00:25:20.000
all the processors,
though, and have T_P be less

00:25:20.000 --> 00:25:23.000
than this?
Why does it have to be at least

00:25:23.000 --> 00:25:27.000
as big as T_1 over P?
I'm just asking for a little

00:25:27.000 --> 00:25:31.000
more precision in the answer.
You've got exactly the right

00:25:31.000 --> 00:25:35.000
idea, but I need a little more
precision if we're going to

00:25:35.000 --> 00:25:41.000
persuade the rest of the class
that this is the lower bound.

00:25:41.000 --> 00:25:42.000
Yeah?

00:25:50.000 --> 00:25:53.000
Yeah, that's another way of
looking at it.

00:25:53.000 --> 00:25:56.000
If you were to serialize the
computation, OK,

00:25:56.000 --> 00:25:59.000
so whatever things you execute
on each step,

00:25:59.000 --> 00:26:02.000
you do P of them,
and so if you serialized it,

00:26:02.000 --> 00:26:07.000
somehow then it would take you
P steps to execute one step of a

00:26:07.000 --> 00:26:09.000
P way, a machine with P
processors.

00:26:09.000 --> 00:26:11.000
So then, OK,
yeah?

00:26:11.000 --> 00:26:13.000
OK, maybe a little more
precise.

00:26:13.000 --> 00:26:15.000
David?

00:26:28.000 --> 00:26:33.000
Yeah, good, so let me just
state this a little bit.

00:26:33.000 --> 00:26:38.000
So, P processors,
so what are we relying on?

00:26:38.000 --> 00:26:43.000
P processors can do,
at most, P work in one step,

00:26:43.000 --> 00:26:47.000
right?
So, in one step they do,

00:26:47.000 --> 00:26:52.000
at most P work.
They can't do more than P work.

00:26:52.000 --> 00:26:58.000
And so, if they can do,
at most P work in one step,

00:26:58.000 --> 00:27:02.000
then if the number of steps
was, in fact,

00:27:02.000 --> 00:27:08.000
less than T_1 over P,
they would be able to do more

00:27:08.000 --> 00:27:15.000
than T_1 work in P steps.
And, there's only T_1 work to

00:27:15.000 --> 00:27:19.000
be done.
OK, I just stated that almost

00:27:19.000 --> 00:27:22.000
as badly as all the responses I
got.

00:27:22.000 --> 00:27:25.000
[LAUGHTER] OK,
P processors can do,

00:27:25.000 --> 00:27:30.000
at most, P work in one step,
right?

00:27:30.000 --> 00:27:34.000
So, if there's T_1 work to be
done, the number of steps is

00:27:34.000 --> 00:27:37.000
going to be at least T_1 over P,
OK?

00:27:37.000 --> 00:27:40.000
There we go.
OK, it wasn't that hard.

00:27:40.000 --> 00:27:43.000
It's just like,
I've got a certain amount of,

00:27:43.000 --> 00:27:46.000
I've got T_1 work to do.
I can knock off,

00:27:46.000 --> 00:27:49.000
at most, P on every step.
How many steps?

00:27:49.000 --> 00:27:53.000
Just divide.
OK, so it's going to have to be

00:27:53.000 --> 00:27:55.000
at least that amount.
OK, good.

00:27:55.000 --> 00:27:59.000
The other lower bound is T_P is
greater than or equal to T

00:27:59.000 --> 00:28:04.000
infinity.
Somebody explain to me why that

00:28:04.000 --> 00:28:06.000
might be true.
Yeah?

00:28:06.000 --> 00:28:10.000
Yeah, if you have an infinite
number of processors,

00:28:10.000 --> 00:28:13.000
you have P.
so if you could do it in a

00:28:13.000 --> 00:28:18.000
certain amount of time with P,
you can certainly do it in that

00:28:18.000 --> 00:28:21.000
time with an infinite number of
processors.

00:28:21.000 --> 00:28:25.000
OK, this is in this model
where, you know,

00:28:25.000 --> 00:28:29.000
there is lots of stuff that
this model doesn't model like

00:28:29.000 --> 00:28:32.000
communication costs and
interference,

00:28:32.000 --> 00:28:37.000
and all sorts of things.
But it is simple model,

00:28:37.000 --> 00:28:41.000
which actually in practice
works out pretty well,

00:28:41.000 --> 00:28:45.000
OK, you're not going to be able
to do more work with P

00:28:45.000 --> 00:28:51.000
processors than you are with an
infinite number of processors.

00:29:06.000 --> 00:29:12.000
OK, so those are helpful bounds
to understand when we are trying

00:29:12.000 --> 00:29:17.000
to make something go faster,
it's nice to know what you

00:29:17.000 --> 00:29:23.000
could possibly hope to achieve,
OK, as opposed to beating your

00:29:23.000 --> 00:29:28.000
head against a wall,
how come I can't get it to go

00:29:28.000 --> 00:29:33.000
much faster?
Maybe it's because one of these

00:29:33.000 --> 00:29:39.000
lower bounds is operating.
OK, well, we're interested in

00:29:39.000 --> 00:29:44.000
how fast we can go.
That's the main reason for

00:29:44.000 --> 00:29:51.000
using multiple processors is you
hope you're going to go faster

00:29:51.000 --> 00:29:55.000
than you could with one
processor.

00:29:55.000 --> 00:30:03.000
So, we define T_1 over T_P to
be the speedup on P processors.

00:30:03.000 --> 00:30:09.000
OK, so we say,
how much faster is it on P

00:30:09.000 --> 00:30:14.000
processors than on one
processor?

00:30:14.000 --> 00:30:22.000
OK, that's the speed up.
If T_1 over T_P is order P,

00:30:22.000 --> 00:30:27.000
we say that it's linear
speedup.

00:30:27.000 --> 00:30:32.000
OK, in other words,
why?

00:30:32.000 --> 00:30:38.000
Because that says that it means
that if I've thrown P processors

00:30:38.000 --> 00:30:44.000
at the job I'm going to get a
speedup that's proportional to

00:30:44.000 --> 00:30:46.000
P.
OK, so when I throw P

00:30:46.000 --> 00:30:51.000
processors at the job and I get
T_P, if that's order P,

00:30:51.000 --> 00:30:57.000
that means that in some sense
my processors each contributed

00:30:57.000 --> 00:31:04.000
within a constant factor its
full measure of support.

00:31:04.000 --> 00:31:08.000
If this, in fact,
were equal to P,

00:31:08.000 --> 00:31:13.000
we'd call that perfect linear
speedup.

00:31:13.000 --> 00:31:20.000
OK, so but here we're looking
at giving ourselves,

00:31:20.000 --> 00:31:27.000
for theoretical purposes,
a little bit of a constant

00:31:27.000 --> 00:31:34.000
buffer here, perhaps.
If T_1 over T_P is greater than

00:31:34.000 --> 00:31:41.000
P, we call that super linear
speedup.

00:31:41.000 --> 00:31:45.000
OK, so can somebody tell me,
when can I get super linear

00:31:45.000 --> 00:31:46.000
speedup?

00:31:56.000 --> 00:31:59.000
When can I get super linear
speed up?

00:31:59.000 --> 00:32:01.000
Never.
OK, why never?

00:32:01.000 --> 00:32:06.000
Yeah, if we buy these lower
bounds, the first lower bound

00:32:06.000 --> 00:32:11.000
there, it is T_P is greater than
or equal to T_1 over P.

00:32:11.000 --> 00:32:17.000
And, if I just take T_1 over
T_P, that says it's less than or

00:32:17.000 --> 00:32:19.000
equal to P.
so, this is never,

00:32:19.000 --> 00:32:25.000
OK, not possible in this model.
OK, there are other models

00:32:25.000 --> 00:32:30.000
where it is possible to get
super linear speed up due to

00:32:30.000 --> 00:32:36.000
caching effects,
and things of that nature.

00:32:36.000 --> 00:32:43.000
But in this simple model that
we are dealing with,

00:32:43.000 --> 00:32:50.000
it's not possible to get super
linear speedup.

00:32:50.000 --> 00:32:57.000
OK, not possible.
Now, the maximum possible

00:32:57.000 --> 00:33:06.000
speedup, given some amount of
work and critical path length is

00:33:06.000 --> 00:33:13.000
what?
What's the maximum possible

00:33:13.000 --> 00:33:20.000
speed up I could get over any
number of processors?

00:33:20.000 --> 00:33:26.000
What's the maximum I could
possibly get?

00:33:26.000 --> 00:33:32.000
No, I'm saying,
no matter how many processors,

00:33:32.000 --> 00:33:40.000
what's the most speedup that I
could get?

00:33:40.000 --> 00:33:44.000
T_1 over T infinity,
because this is the,

00:33:44.000 --> 00:33:49.000
so T_1 over T infinity is the
maximum I could possibly get.

00:33:49.000 --> 00:33:55.000
OK, if I threw an infinite
number of processors at the

00:33:55.000 --> 00:34:00.000
problem, that's going to give me
my biggest speedup.

00:34:00.000 --> 00:34:05.000
OK, and we call that the
parallelism.

00:34:05.000 --> 00:34:08.000
OK, so that's defined to be the
parallelism.

00:34:08.000 --> 00:34:11.000
So the parallelism of the
particular algorithm is

00:34:11.000 --> 00:34:16.000
essentially the work divided by
the critical path length.

00:34:16.000 --> 00:34:31.000
Another way of viewing it is
that this is the average amount

00:34:31.000 --> 00:34:46.000
of work that can be done in
parallel along each step of the

00:34:46.000 --> 00:34:57.000
critical path.
And, we denote it often by P

00:34:57.000 --> 00:35:01.000
bar.
So, do not get confused.

00:35:01.000 --> 00:35:05.000
P bar does not have anything to
do with P at some level.

00:35:05.000 --> 00:35:10.000
OK, P is going to be a certain
number of processors you're

00:35:10.000 --> 00:35:13.000
running.
P bar is defined just in terms

00:35:13.000 --> 00:35:17.000
of the computation you're
executing, not in terms of the

00:35:17.000 --> 00:35:21.000
machine you're running it on.
OK, it's just the average

00:35:21.000 --> 00:35:25.000
amount of work that can be done
in parallel along each step of

00:35:25.000 --> 00:35:30.000
the critical path.
OK, questions so far?

00:35:30.000 --> 00:35:33.000
So mostly we're just doing
definitions so far.

00:35:33.000 --> 00:35:37.000
OK, now we get into,
OK, so it's helpful to know

00:35:37.000 --> 00:35:41.000
what the parallelism is,
because the parallelism is

00:35:41.000 --> 00:35:46.000
going to, there's no real point
in trying to get speed up bigger

00:35:46.000 --> 00:35:50.000
than the parallelism.
OK, so if you are given a

00:35:50.000 --> 00:35:53.000
particular computation,
you'll be able to say,

00:35:53.000 --> 00:35:58.000
oh, it doesn't go any faster.
You're throwing more processors

00:35:58.000 --> 00:36:03.000
at it.
Why is it that going any

00:36:03.000 --> 00:36:07.000
faster?
And the answer could be,

00:36:07.000 --> 00:36:14.000
no more parallelism.
OK, let's see what I want to,

00:36:14.000 --> 00:36:20.000
yeah, I think we can raise the
example here.

00:36:20.000 --> 00:36:25.000
We'll talk more about this
model.

00:36:25.000 --> 00:36:31.000
Mostly, now,
we're going to just talk about

00:36:31.000 --> 00:36:35.000
DAG's.
So, we'll talk about the

00:36:35.000 --> 00:36:43.000
programming model next time.
So, let's talk about

00:36:43.000 --> 00:36:48.000
scheduling.
The goal of scheduler is to map

00:36:48.000 --> 00:36:55.000
the computation to P processors.
And this is typically done by a

00:36:55.000 --> 00:36:59.000
runtime system,
which, if you will,

00:36:59.000 --> 00:37:06.000
is an algorithm that is running
underneath the language layer

00:37:06.000 --> 00:37:12.000
that I showed you.
OK, so the programmer designs

00:37:12.000 --> 00:37:15.000
an algorithm using spawns,
and syncs, and so forth.

00:37:15.000 --> 00:37:19.000
Then, underneath that,
there's an algorithm that has

00:37:19.000 --> 00:37:24.000
to actually map that executing
program onto the processors of

00:37:24.000 --> 00:37:27.000
the machine as it executes.
And that's the scheduler.

00:37:27.000 --> 00:37:31.000
OK, so it's done by the
language runtime system,

00:37:31.000 --> 00:37:37.000
typically.
OK, so it turns out that online

00:37:37.000 --> 00:37:42.000
schedulers, let me just say
they're complex.

00:37:42.000 --> 00:37:49.000
OK, they're not necessarily
easy things to build.

00:37:49.000 --> 00:37:53.000
OK, they're not too bad
actually.

00:37:53.000 --> 00:38:01.000
But, we are not going to go
there because we only have two

00:38:01.000 --> 00:38:07.000
lectures to do this.
Instead, we're going to do is

00:38:07.000 --> 00:38:16.000
we'll illustrate the ideas using
off-line scheduling.

00:38:16.000 --> 00:38:20.000
OK, so you'll get an idea out
of this for what a scheduler

00:38:20.000 --> 00:38:24.000
does, and it turns out that
doing these things online is

00:38:24.000 --> 00:38:27.000
another level of complexity
beyond that.

00:38:27.000 --> 00:38:31.000
And typically,
the online schedulers that are

00:38:31.000 --> 00:38:35.000
good, these days,
are randomized schedulers.

00:38:35.000 --> 00:38:42.000
And they have very strong
proofs of their ability to

00:38:42.000 --> 00:38:46.000
perform.
But we're not going to go

00:38:46.000 --> 00:38:50.000
there.
We'll keep it simple.

00:38:50.000 --> 00:38:56.000
And in particular,
we're going to look at a

00:38:56.000 --> 00:39:05.000
particular type of scheduler
called a greedy scheduler.

00:39:05.000 --> 00:39:09.000
So, if you have a DAG to
execute, so the basic rules of

00:39:09.000 --> 00:39:15.000
the scheduler is you can't
execute a node until all of the

00:39:15.000 --> 00:39:19.000
nodes that precede it in the DAG
have executed.

00:39:19.000 --> 00:39:24.000
OK, so you've got to wait until
everything is executed.

00:39:24.000 --> 00:39:29.000
So, a greedy scheduler,
what it says is let's just try

00:39:29.000 --> 00:39:34.000
to do as much as possible on
every step, OK?

00:39:50.000 --> 00:39:52.000
In other words,
it says I'm never going to try

00:39:52.000 --> 00:39:56.000
to guess that it's worthwhile
delaying doing something.

00:39:56.000 --> 00:40:00.000
If I could do something now,
I'm going to do it.

00:40:00.000 --> 00:40:08.000
And so, each step is going to
correspond to be one of two

00:40:08.000 --> 00:40:13.000
types.
The first type is what we'll

00:40:13.000 --> 00:40:21.000
call a complete step.
And this is a step in which

00:40:21.000 --> 00:40:27.000
there are at least P threads
ready to run.

00:40:27.000 --> 00:40:34.000
And, I'm executing on P
processors.

00:40:34.000 --> 00:40:38.000
There are at least P threads
ready to run.

00:40:38.000 --> 00:40:42.000
So, what's a greedy strategy
here?

00:40:42.000 --> 00:40:48.000
I've got P processors.
I've got at least P threads.

00:40:48.000 --> 00:40:52.000
Run any P.
Yeah, first P would be if you

00:40:52.000 --> 00:40:57.000
had a notion of ordering.
That would be perfectly

00:40:57.000 --> 00:41:02.000
reasonable.
Here, we are just going to

00:41:02.000 --> 00:41:07.000
execute any P.
We might make a mistake there,

00:41:07.000 --> 00:41:10.000
because there may be a
particular one that if we

00:41:10.000 --> 00:41:14.000
execute now, that'll enable more
parallelism later on.

00:41:14.000 --> 00:41:18.000
We might not execute that one.
We don't know.

00:41:18.000 --> 00:41:21.000
OK, but basically,
what we're going to do is just

00:41:21.000 --> 00:41:24.000
execute any P willy-nilly.
So, there's some,

00:41:24.000 --> 00:41:27.000
if you will,
non-determinism in this step

00:41:27.000 --> 00:41:32.000
here because which one you
execute may or may not be a good

00:41:32.000 --> 00:41:38.000
choice.
OK, the second type of step

00:41:38.000 --> 00:41:45.000
we're going to have is an
incomplete step.

00:41:45.000 --> 00:41:55.000
And this is a situation where
we have fewer than P threads

00:41:55.000 --> 00:42:04.000
ready to run.
So, what's our strategy there?

00:42:04.000 --> 00:42:10.000
Execute all of them.
OK, if it's greedy,

00:42:10.000 --> 00:42:19.000
no point in not executing.
OK, so if I've got more than P

00:42:19.000 --> 00:42:25.000
threads ready to run,
I execute any P.

00:42:25.000 --> 00:42:32.000
If I have fewer than P threads
ready to run,

00:42:32.000 --> 00:42:39.000
we execute all of them.
So, it turns out this is a good

00:42:39.000 --> 00:42:42.000
strategy.
It's not a perfect strategy.

00:42:42.000 --> 00:42:48.000
In fact, the strategy of trying
to schedule optimally a DAG on P

00:42:48.000 --> 00:42:53.000
processors is NP complete,
meaning it's very difficult.

00:42:53.000 --> 00:42:57.000
So, those of you going to take
6.045 or 6.840,

00:42:57.000 --> 00:43:01.000
I highly recommend these
courses, and we'll talk more

00:43:01.000 --> 00:43:06.000
about that in the last lecture
as we talked a little bit about

00:43:06.000 --> 00:43:13.000
what's coming up in the theory
engineering concentration.

00:43:13.000 --> 00:43:16.000
You can learn about NP
completeness and about how you

00:43:16.000 --> 00:43:19.000
show that certain problems,
there are no good algorithms

00:43:19.000 --> 00:43:22.000
for them, OK,
that we are aware of,

00:43:22.000 --> 00:43:24.000
OK, and what exactly that
means.

00:43:24.000 --> 00:43:28.000
So, it turns out that this type
of scheduling problem turns out

00:43:28.000 --> 00:43:32.000
to be a very difficult problem
to get it optimal.

00:43:32.000 --> 00:43:46.000
But, there's nice theorem,
due independently to Graham and

00:43:46.000 --> 00:43:53.000
Brent.
It says, essentially,

00:43:53.000 --> 00:44:05.000
a greedy scheduler executes any
computation,

00:44:05.000 --> 00:44:15.000
G, with work,
T_1, and critical path length,

00:44:15.000 --> 00:44:27.000
T infinity in time,
T_P, less than or equal to T_1

00:44:27.000 --> 00:44:34.000
over P plus T infinity --

00:44:44.000 --> 00:44:49.000
-- on a computer with P
processors.

00:44:49.000 --> 00:44:56.000
OK, so, it says that I can
achieve T_1 over P plus T

00:44:56.000 --> 00:45:02.000
infinity.
So, what does that say?

00:45:02.000 --> 00:45:09.000
If we take a look and compare
this with our lower bounds on

00:45:09.000 --> 00:45:16.000
runtime, how efficient is this?
How does this compare with the

00:45:16.000 --> 00:45:22.000
optimal execution?
Yeah, it's two competitive.

00:45:22.000 --> 00:45:30.000
It's within a factor of two of
optimal because this is a lower

00:45:30.000 --> 00:45:37.000
bound and this is a lower bound.
And so, if I take twice the max

00:45:37.000 --> 00:45:41.000
of these two,
twice the maximum of these two,

00:45:41.000 --> 00:45:44.000
that's going to be bigger than
the sum.

00:45:44.000 --> 00:45:49.000
So, I'm within a factor of two
of which ever is the stronger,

00:45:49.000 --> 00:45:54.000
lower bound for any situation.
So, this says you get within a

00:45:54.000 --> 00:45:58.000
factor of two of efficiency of
scheduling in terms of the

00:45:58.000 --> 00:46:04.000
runtime on P processors.
OK, does everybody see that?

00:46:04.000 --> 00:46:10.000
So, let's prove this theorem.
It's quite an elegant theorem.

00:46:10.000 --> 00:46:15.000
It's not a hard theorem.
One of the nice things,

00:46:15.000 --> 00:46:20.000
by the way, about this week,
is that nothing is very hard.

00:46:20.000 --> 00:46:25.000
It just requires you to think
differently.

00:46:25.000 --> 00:46:31.000
OK, so the proof has to do with
counting up how many complete

00:46:31.000 --> 00:46:35.000
steps we have,
and how many incomplete steps

00:46:35.000 --> 00:46:41.000
we have.
OK, so we'll start with the

00:46:41.000 --> 00:46:49.000
number of complete steps.
So, can somebody tell me what's

00:46:49.000 --> 00:46:58.000
the largest number of complete
steps I could possibly have?

00:46:58.000 --> 00:47:05.000
Yeah, I heard somebody mumble
it back there.

00:47:05.000 --> 00:47:08.000
T_1 over P.
Why is that?

00:47:08.000 --> 00:47:17.000
Yeah, so the number of complete
steps is, at most,

00:47:17.000 --> 00:47:25.000
T_1 over P because why?
Yeah, once you've had this

00:47:25.000 --> 00:47:32.000
many, you've done T_1 work,
OK?

00:47:32.000 --> 00:47:36.000
So, every complete step I'm
getting P work done.

00:47:36.000 --> 00:47:41.000
So, if I did more than T_1 over
P steps, there would be no more

00:47:41.000 --> 00:47:45.000
work to be done.
So, the number of complete

00:47:45.000 --> 00:47:49.000
steps can't be bigger than T_1
over P.

00:48:10.000 --> 00:48:16.000
OK, so that's this piece.
OK, now we're going to count up

00:48:16.000 --> 00:48:21.000
the incomplete steps,
and show its bounded by T

00:48:21.000 --> 00:48:25.000
infinity.
OK, so let's consider an

00:48:25.000 --> 00:48:31.000
incomplete step.
And, let's see what happens.

00:48:39.000 --> 00:48:57.000
And, let's let G prime be the
subgraph of G that remains to be

00:48:57.000 --> 00:49:02.000
executed.
OK, so we'll draw a picture

00:49:02.000 --> 00:49:04.000
here.
So, imagine we have,

00:49:04.000 --> 00:49:07.000
let's draw it on a new board.

00:49:26.000 --> 00:49:32.000
So here, we're going to have a
graph, our graph,

00:49:32.000 --> 00:49:36.000
G.
We're going to do actually P

00:49:36.000 --> 00:49:40.000
equals three as our example
here.

00:49:40.000 --> 00:49:45.000
So, imagine that this is the
graph, G.

00:49:45.000 --> 00:49:52.000
And, I'm not showing the
procedures here because this

00:49:52.000 --> 00:50:00.000
actually is a theorem that works
for any DAG.

00:50:00.000 --> 00:50:09.000
And, the procedure outlines are
not necessary.

00:50:09.000 --> 00:50:16.000
All we care about is the
threads.

00:50:16.000 --> 00:50:25.000
I missed one.
OK, so imagine that's my DAG,

00:50:25.000 --> 00:50:38.000
G, and imagine that I have
executed up to this point.

00:50:38.000 --> 00:50:47.000
Which ones have I executed?
Yeah, I've executed these guys.

00:50:47.000 --> 00:50:57.000
So, the things that are in G
prime are just the things that

00:50:57.000 --> 00:51:04.000
have yet to be executed.
And these guys are the ones

00:51:04.000 --> 00:51:09.000
that are already executed.
And, we'll imagine that all of

00:51:09.000 --> 00:51:14.000
them are unit time threads
without loss of generality.

00:51:14.000 --> 00:51:19.000
The theorem would go through,
even if each of these had a

00:51:19.000 --> 00:51:23.000
particular time associated with
it.

00:51:23.000 --> 00:51:27.000
The same scheduling algorithm
will work just fine.

00:51:27.000 --> 00:51:32.000
So, how can I characterize the
threads that are ready to be

00:51:32.000 --> 00:51:38.000
executed?
Which are the threads that are

00:51:38.000 --> 00:51:42.000
ready to be executed here?
Let's just see.

00:51:42.000 --> 00:51:46.000
So, that one?
No, that's not ready to be

00:51:46.000 --> 00:51:48.000
executed.
Why?

00:51:48.000 --> 00:51:52.000
Because it's got a predecessor
here, this guy.

00:51:52.000 --> 00:51:59.000
OK, so this guy is ready to be
executed, and this guy is ready

00:51:59.000 --> 00:52:04.000
to be executed.
OK, so those two threads are

00:52:04.000 --> 00:52:08.000
ready to be, how can I
characterize this?

00:52:08.000 --> 00:52:12.000
What's their property?
What's a graph theoretic

00:52:12.000 --> 00:52:17.000
property in G prime that tells
me whether or not something is

00:52:17.000 --> 00:52:21.000
ready to be executed?
It has no predecessor,

00:52:21.000 --> 00:52:24.000
but what's another way of
saying that?

00:52:24.000 --> 00:52:29.000
It's got no predecessor in G
prime.

00:52:29.000 --> 00:52:38.000
What does it mean for a node
not to have a predecessor in a

00:52:38.000 --> 00:52:43.000
graph?
Its in degree is zero,

00:52:43.000 --> 00:52:46.000
right?
Same thing.

00:52:46.000 --> 00:52:56.000
OK, the threads with in degree,
zero and G prime are the ones

00:52:56.000 --> 00:53:06.000
that are ready to be executed.
OK, and if it's incomplete

00:53:06.000 --> 00:53:11.000
step, what do I do?
I'm going to execute says,

00:53:11.000 --> 00:53:17.000
if it's an incomplete step,
I execute all of them.

00:53:17.000 --> 00:53:24.000
OK, so I execute all of these.
OK, now I execute all of the in

00:53:24.000 --> 00:53:30.000
degree zero threads,
what happens to the critical

00:53:30.000 --> 00:53:38.000
path length of the graph that
remains to be executed?

00:53:38.000 --> 00:53:48.000
It decreases by one.
OK, so the critical path length

00:53:48.000 --> 00:54:00.000
of what remains to be executed,
G prime, is reduced by one.

00:54:00.000 --> 00:54:04.000
So, what's left to be executed
on every incomplete step,

00:54:04.000 --> 00:54:08.000
what's left to be executed
always reduces by one.

00:54:08.000 --> 00:54:12.000
Notice the next step here is
going to be a complete step,

00:54:12.000 --> 00:54:16.000
because I've got four things
that are ready to go.

00:54:16.000 --> 00:54:21.000
And, I can execute them in such
a way that the critical path

00:54:21.000 --> 00:54:24.000
length doesn't get reduced on
that step.

00:54:24.000 --> 00:54:29.000
OK, but if I had to execute all
of them, then it does reduce the

00:54:29.000 --> 00:54:33.000
critical path length.
Now, of course,

00:54:33.000 --> 00:54:38.000
both could happen,
OK, at the same time,

00:54:38.000 --> 00:54:43.000
OK, but any time that I have an
incomplete step,

00:54:43.000 --> 00:54:50.000
I'm guaranteed to reduce the
critical path length by one.

00:54:50.000 --> 00:54:56.000
OK, so that implies that the
number of incomplete steps is,

00:54:56.000 --> 00:55:01.000
at most, T infinity.
And so, therefore,

00:55:01.000 --> 00:55:05.000
T of P is, at most,
the number of complete steps

00:55:05.000 --> 00:55:08.000
plus the number of incomplete
steps.

00:55:08.000 --> 00:55:12.000
And we get our bound.
This is sort of an amortized

00:55:12.000 --> 00:55:17.000
argument if you want to think of
it that way, OK,

00:55:17.000 --> 00:55:22.000
that at every step I'm either
amortizing the step against the

00:55:22.000 --> 00:55:26.000
work, or I'm amortizing it
against the critical path

00:55:26.000 --> 00:55:32.000
length, or possibly both.
But I'm at least doing one of

00:55:32.000 --> 00:55:35.000
those for every step,
OK, and so, in the end,

00:55:35.000 --> 00:55:39.000
I just have to add up the two
contributions.

00:55:39.000 --> 00:55:42.000
Any questions about that?
So this, by the way,

00:55:42.000 --> 00:55:46.000
is the fundamental theorem of
all scheduling.

00:55:46.000 --> 00:55:50.000
If ever you study anything
having to do with scheduling,

00:55:50.000 --> 00:55:55.000
this basic result is sort of
the foundation of a huge number

00:55:55.000 --> 00:55:58.000
of things.
And then what people do is they

00:55:58.000 --> 00:56:01.000
gussy it up, like,
let's do this online,

00:56:01.000 --> 00:56:05.000
OK, with a scheduler,
etc., that everybody's trying

00:56:05.000 --> 00:56:09.000
to match these bounds,
OK, of what an omniscient

00:56:09.000 --> 00:56:14.000
greedy scheduler would achieve,
OK, and there are all kinds of

00:56:14.000 --> 00:56:19.000
other things.
But this is sort of the basic

00:56:19.000 --> 00:56:25.000
theorem that just pervades the
whole area of scheduling.

00:56:25.000 --> 00:56:32.000
OK, let's do a quick corollary.
I'm not going to erase those.

00:56:32.000 --> 00:56:37.000
Those are just too important.
I want to erase those.

00:56:37.000 --> 00:56:42.000
Let's not erase those.
I want to erase that either.

00:56:42.000 --> 00:56:45.000
We're going to go back to the
top.

00:56:45.000 --> 00:56:51.000
Actually, we'll put the
corollary here because that's

00:56:51.000 --> 00:56:54.000
just one line.
OK.

00:57:11.000 --> 00:57:17.000
The corollary says you get
linear speed up if the number of

00:57:17.000 --> 00:57:24.000
processors that you allocate,
that you run your job on is

00:57:24.000 --> 00:57:31.000
order, the parallelism.
OK, so greedy scheduler gives

00:57:31.000 --> 00:57:37.000
you linear speed up if you're
running on essentially

00:57:37.000 --> 00:57:46.000
parallelism or fewer processors.
OK, so let's see why that is.

00:57:46.000 --> 00:57:51.000
And I hope I'll fit this,
OK?

00:57:51.000 --> 00:57:58.000
So, P bar is T_1 over T
infinity.

00:57:58.000 --> 00:58:04.000
And that implies that if P
equals order T_1 over T

00:58:04.000 --> 00:58:10.000
infinity, then that says just
bringing those around,

00:58:10.000 --> 00:58:17.000
T infinity is order T_1 over P.
So, everybody with me?

00:58:17.000 --> 00:58:22.000
It's just algebra.
So, it says this is the

00:58:22.000 --> 00:58:28.000
definition of parallelism,
T_1 over T infinity,

00:58:28.000 --> 00:58:35.000
and so, if P is order
parallelism, then it's order T_1

00:58:35.000 --> 00:58:43.000
over T infinity.
And now, just bring it around.

00:58:43.000 --> 00:58:49.000
It says T infinity is order T_1
over P.

00:58:49.000 --> 00:58:56.000
So, that says T infinity is
order T_1 over P.

00:58:56.000 --> 00:59:03.000
OK, and so, therefore,
continue the proof here,

00:59:03.000 --> 00:59:12.000
thus T_P is at most T_1 over P
plus T infinity.

00:59:12.000 --> 00:59:23.000
Well, if this is order T_1 over
P, the whole thing is order T_1

00:59:23.000 --> 00:59:29.000
over P.
OK, and so, now I have T_P is

00:59:29.000 --> 00:59:37.000
order T_1 over P,
and what we need is to compute

00:59:37.000 --> 00:59:45.000
T_1 over T_P,
and that's going to be order

00:59:45.000 --> 00:59:48.000
T_P.
OK?

00:59:48.000 --> 00:59:51.000
Does everybody see that?
So what that says is that if I

00:59:51.000 --> 00:59:54.000
have a certain amount of
parallelism, if I run

00:59:54.000 --> 00:59:58.000
essentially on fewer processors
than that parallelism,

00:59:58.000 --> 01:00:02.000
I get linear speed up if I use
greedy scheduling.

01:00:02.000 --> 01:00:05.077
OK, if I run on more processors
than the parallelism,

01:00:05.077 --> 01:00:07.859
in some sense I'm being
wasteful because I can't

01:00:07.859 --> 01:00:11.529
possibly get enough speed up to
justify those extra processors.

01:00:11.529 --> 01:00:15.021
So, understanding parallelism
of a job says that's sort of a

01:00:15.021 --> 01:00:17.862
limit on the number of
processors I want to have.

01:00:17.862 --> 01:00:19.757
And, in fact,
I can achieve that.

01:00:19.757 --> 01:00:21.000
Question?

01:00:39.000 --> 01:00:41.008
Yeah, really,
in some sense,

01:00:41.008 --> 01:00:43.611
this is saying it should be
omega P.

01:00:43.611 --> 01:00:46.586
Yeah, so that's fine.
It's a question of,

01:00:46.586 --> 01:00:48.000
so ask again.

01:01:03.000 --> 01:01:06.495
No, no, it's only if it's
bounded above by a constant.

01:01:06.495 --> 01:01:08.804
T_1 and T infinity aren't
constants.

01:01:08.804 --> 01:01:12.497
They're variables in this.
So, we are doing multivariable

01:01:12.497 --> 01:01:15.795
asymptotic analysis.
So, any of these things can be

01:01:15.795 --> 01:01:19.555
a function of anything else,
and can be growing as much as

01:01:19.555 --> 01:01:22.127
we want.
So, the fact that we say we are

01:01:22.127 --> 01:01:26.019
given it for a particular thing,
we're really not given that

01:01:26.019 --> 01:01:28.327
number.
We're given a whole class of

01:01:28.327 --> 01:01:31.889
DAG's or whatever of various
sizes is really what we're

01:01:31.889 --> 01:01:37.788
talking about.
So, I can look at the growth.

01:01:37.788 --> 01:01:45.626
Here, where it's talking about
the growth of the parallelism,

01:01:45.626 --> 01:01:52.941
sorry, the growth of the
runtime T_P as a function of T_1

01:01:52.941 --> 01:01:58.689
and T infinity.
So, I am talking about things

01:01:58.689 --> 01:02:03.000
that are growing here,
OK?

01:02:03.000 --> 01:02:06.018
OK, so let's put this to work,
OK?

01:02:06.018 --> 01:02:09.951
And, in fact,
so now I'm going to go back to

01:02:09.951 --> 01:02:13.243
here.
Now I'm going to tell you about

01:02:13.243 --> 01:02:18.913
a little bit of my own research,
and how we use this in some of

01:02:18.913 --> 01:02:23.030
the work that we did.
OK, so we've developed a

01:02:23.030 --> 01:02:28.426
dynamic multithreaded language
called Cilk, spelled with a C

01:02:28.426 --> 01:02:33.000
because it's based on the
language, C.

01:02:33.000 --> 01:02:39.837
And, it's not an acronym
because silk is like nice

01:02:39.837 --> 01:02:46.953
threads, OK, although at one
point my students had a

01:02:46.953 --> 01:02:53.651
competition for what the acronym
silk could mean.

01:02:53.651 --> 01:03:01.046
The winner, turns out,
was Charles' Idiotic Linguistic

01:03:01.046 --> 01:03:06.214
Kluge.
So anyway, if you want to take

01:03:06.214 --> 01:03:10.714
a look at it,
you can find some stuff on it

01:03:10.714 --> 01:03:12.000
here.
OK,

01:03:20.000 --> 01:03:28.412
OK, and what it uses is
actually one of these more

01:03:28.412 --> 01:03:36.480
complicated schedulers.
It's a randomized online

01:03:36.480 --> 01:03:44.206
scheduler, OK,
and if you look at its expected

01:03:44.206 --> 01:03:53.476
runtime on P processors,
it gets effectively T_1 over P

01:03:53.476 --> 01:04:01.428
plus O of T infinity provably.
OK, and empirically,

01:04:01.428 --> 01:04:05.714
if you actually look at what
kind of runtimes you get to find

01:04:05.714 --> 01:04:09.285
out what's hidden in the big O
there, it turns out,

01:04:09.285 --> 01:04:13.785
in fact, it's T_1 over P plus T
infinity with the constants here

01:04:13.785 --> 01:04:16.285
being very close to one
empirically.

01:04:16.285 --> 01:04:19.428
So, no guarantees,
but this turns out to be a

01:04:19.428 --> 01:04:22.142
pretty good bound.
Sometimes, you see a

01:04:22.142 --> 01:04:26.214
coefficient on T infinity that's
up maybe close to four or

01:04:26.214 --> 01:04:29.385
something.
But generally,

01:04:29.385 --> 01:04:34.533
you don't see something that's
much bigger than that.

01:04:34.533 --> 01:04:39.680
And mostly, it tends to be
around, if you do a linear

01:04:39.680 --> 01:04:44.729
regression curve fit,
you get that the constant here

01:04:44.729 --> 01:04:48.094
is close to one.
And so, with this,

01:04:48.094 --> 01:04:54.331
you get near perfect if you use
this formula as a model for your

01:04:54.331 --> 01:04:57.795
runtime.
You get near perfect linear

01:04:57.795 --> 01:05:03.339
speed up if the number of
processors you're running on is

01:05:03.339 --> 01:05:07.892
much less than your average
parallelism, which,

01:05:07.892 --> 01:05:14.029
of course, is the same thing as
if T infinity is much less than

01:05:14.029 --> 01:05:19.481
T_1 over P.
So, what happens here is that

01:05:19.481 --> 01:05:23.247
when P is much less than P
infinity, that is,

01:05:23.247 --> 01:05:28.297
T infinity is much less than
T_1 over P, this term ceases to

01:05:28.297 --> 01:05:32.319
matter very much,
and you get very good speedup,

01:05:32.319 --> 01:05:36.000
OK, in fact,
almost perfect speedup.

01:05:36.000 --> 01:05:42.357
So, each processor gives you
another processor's work as long

01:05:42.357 --> 01:05:48.503
as you are the range where the
number of processors is much

01:05:48.503 --> 01:05:52.211
less than the number of
parallelism.

01:05:52.211 --> 01:05:58.463
Now, with this language many
years ago, which seems now like

01:05:58.463 --> 01:06:03.231
many years ago,
OK, it turned out we competed.

01:06:03.231 --> 01:06:08.000
We built a bunch of chess
programs.

01:06:08.000 --> 01:06:11.962
And, among our programs were
Starsocrates,

01:06:11.962 --> 01:06:16.312
and Cilkchess,
and we also had several others.

01:06:16.312 --> 01:06:19.501
And these were,
I would call them,

01:06:19.501 --> 01:06:22.014
world-class.
In particular,

01:06:22.014 --> 01:06:26.750
we tied for first in the 1995
World Computer Chess

01:06:26.750 --> 01:06:32.066
Championship in Hong Kong,
and then we had a playoff and

01:06:32.066 --> 01:06:35.860
we lost.
It was really a shame.

01:06:35.860 --> 01:06:39.157
We almost won,
running on a big parallel

01:06:39.157 --> 01:06:41.778
machine.
That was, incidentally,

01:06:41.778 --> 01:06:47.020
some of you may know about the
Deep Blue chess playing program.

01:06:47.020 --> 01:06:52.008
That was the last time before
they faced then world champion

01:06:52.008 --> 01:06:55.728
Kasparov that they competed
against programs.

01:06:55.728 --> 01:06:58.941
They tied for third in that
tournament.

01:06:58.941 --> 01:07:03.000
OK, so we actually out-placed
them.

01:07:03.000 --> 01:07:07.159
However, in the head-to-head
competition, we lost to them.

01:07:07.159 --> 01:07:11.099
So we had one loss in the
tournament up to the point of

01:07:11.099 --> 01:07:13.872
the finals.
They had a loss and a draw.

01:07:13.872 --> 01:07:17.375
Most people aren't aware that
Deep Blue, in fact,

01:07:17.375 --> 01:07:21.608
was not the reigning World
Computer Chess Championship when

01:07:21.608 --> 01:07:24.964
they faced Kasparov.
The reason that they faced

01:07:24.964 --> 01:07:30.000
Kasparov was because IBM was
willing to put up the money.

01:07:30.000 --> 01:07:38.029
OK, so we developed these chess
programs, and the way we

01:07:38.029 --> 01:07:44.747
developed them,
let me in particular talk about

01:07:44.747 --> 01:07:51.172
Starsocrates.
We had this interesting anomaly

01:07:51.172 --> 01:07:55.699
come up.
We were running on a 32

01:07:55.699 --> 01:08:03.000
processor computer at MIT for
development.

01:08:03.000 --> 01:08:07.463
And, we had access to a 512
processor computer for the

01:08:07.463 --> 01:08:11.505
tournament at NCSA at the
University of Illinois.

01:08:11.505 --> 01:08:16.389
So, we had this big machine.
Of course, they didn't want to

01:08:16.389 --> 01:08:20.852
give it to us very much,
but we have the same machine,

01:08:20.852 --> 01:08:22.872
just a small one,
at MIT.

01:08:22.872 --> 01:08:27.756
So, we would develop on this,
and occasionally we'd be able

01:08:27.756 --> 01:08:31.126
to run on this,
and this was what we were

01:08:31.126 --> 01:08:37.719
developing for on our processor.
So, let me show you sort of the

01:08:37.719 --> 01:08:40.000
anomaly that came up,
OK?

01:08:48.000 --> 01:08:55.974
So, we had a version of a
program that I'll call the

01:08:55.974 --> 01:09:02.854
original program,
OK, and we had an optimized

01:09:02.854 --> 01:09:12.236
program that included some new
features that were supposed to

01:09:12.236 --> 01:09:20.992
make the program go faster.
And so, we timed it on our 32

01:09:20.992 --> 01:09:28.341
processor machine.
And, it took us 65 seconds to

01:09:28.341 --> 01:09:33.839
run it.
OK, and then we timed this new

01:09:33.839 --> 01:09:37.340
program.
So, I'll call that T prime of

01:09:37.340 --> 01:09:42.261
sub 32 on our 32 processor
machine, and it ran and 40

01:09:42.261 --> 01:09:45.952
seconds to do this particular
benchmark.

01:09:45.952 --> 01:09:50.399
Now, let me just say,
I've lied about the actual

01:09:50.399 --> 01:09:54.375
numbers here to make the
calculations easy.

01:09:54.375 --> 01:10:01.000
But, the same idea happened.
Just the numbers were messier.

01:10:01.000 --> 01:10:07.275
OK, so this looks like a
significant improvement in

01:10:07.275 --> 01:10:12.421
runtime, but we rejected the
optimization.

01:10:12.421 --> 01:10:19.574
OK, and the reason we rejected
it is because we understood

01:10:19.574 --> 01:10:24.846
about the issues of work and
critical path.

01:10:24.846 --> 01:10:30.368
So, let me show you the
analysis that we did,

01:10:30.368 --> 01:10:33.813
OK?
So the analysis,

01:10:33.813 --> 01:10:37.441
it turns out,
if we looked at our

01:10:37.441 --> 01:10:42.089
instrumentation,
the work in this case was

01:10:42.089 --> 01:10:46.170
2,048.
And, the critical path was one

01:10:46.170 --> 01:10:50.931
second, which,
over here with the optimized

01:10:50.931 --> 01:10:55.125
program, the work was,
in fact, 1,024.

01:10:55.125 --> 01:11:00.000
But the critical path was
eight.

01:11:00.000 --> 01:11:07.375
So, if we plug into our simple
model here, the one I have up

01:11:07.375 --> 01:11:14.625
there with the approximation
there, I have T_32 is equal to

01:11:14.625 --> 01:11:20.625
T_1 over 32 plus T infinity,
and that's equal to,

01:11:20.625 --> 01:11:25.250
well, the work is 2,048 divided
by 32.

01:11:25.250 --> 01:11:30.125
What's that?
64, good, plus the critical

01:11:30.125 --> 01:11:37.625
path, one, that's 65.
So, that checks out with what

01:11:37.625 --> 01:11:40.000
we saw.
OK, in fact,

01:11:40.000 --> 01:11:43.875
we did that,
and it checked out.

01:11:43.875 --> 01:11:48.375
OK, it was very close.
OK, over here,

01:11:48.375 --> 01:11:54.875
T prime of 32 is T prime,
one over 32 plus T infinity

01:11:54.875 --> 01:12:02.750
prime, and that's equal to 1,024
divided by 32 is 32 plus eight,

01:12:02.750 --> 01:12:07.981
the critical path here.
That's 40.

01:12:07.981 --> 01:12:13.377
So, that checked out too.
So, now what we did is we said

01:12:13.377 --> 01:12:17.596
is we said, OK,
let's extrapolate to our big

01:12:17.596 --> 01:12:21.422
machine.
How fast are these things going

01:12:21.422 --> 01:12:25.445
to run on our big machine?
Well, for that,

01:12:25.445 --> 01:12:29.958
we want T of 512.
And, that's equal to T_1 over

01:12:29.958 --> 01:12:36.913
512 plus T infinity.
And so, what's 2,048 divided by

01:12:36.913 --> 01:12:41.079
512?
It's four, plus T infinity is

01:12:41.079 --> 01:12:44.235
one.
That's equal to five.

01:12:44.235 --> 01:12:48.401
So, go quite a bit faster on
this.

01:12:48.401 --> 01:12:55.471
But here, T prime of 512 is
equal to T one prime over 512

01:12:55.471 --> 01:13:03.172
plus T infinity prime is equal
to, well, 1,024 plus divided by

01:13:03.172 --> 01:13:11.000
512 is two plus critical path of
eight, that's ten.

01:13:11.000 --> 01:13:15.913
OK, and so, you see that on the
big machine, we would have been

01:13:15.913 --> 01:13:19.163
running twice as slow had we
adopted that,

01:13:19.163 --> 01:13:23.205
quote, "optimization",
OK, because we had run out of

01:13:23.205 --> 01:13:27.009
parallelism, and this was making
the path longer.

01:13:27.009 --> 01:13:31.447
We needed to have a way of
doing it where we could reduce

01:13:31.447 --> 01:13:34.459
the work.
Yeah, it's good to reduce the

01:13:34.459 --> 01:13:39.135
work but not as the critical
path ends up getting rid of the

01:13:39.135 --> 01:13:45.000
parallels that we hope to be
able to use during the runtime.

01:13:45.000 --> 01:13:48.186
So, it's twice as slow,
OK, twice as slow.

01:13:48.186 --> 01:13:52.927
So the moral is that the work
and critical path length predict

01:13:52.927 --> 01:13:56.968
the performance better than the
execution time alone,

01:13:56.968 --> 01:14:00.000
OK, when you look at
scalability.

01:14:00.000 --> 01:14:03.600
And a big issue on a lot of
these machines is scalability;

01:14:03.600 --> 01:14:07.263
not always, sometimes you're
not worried about scalability.

01:14:07.263 --> 01:14:10.421
Sometimes you just care.
Had we been running in the

01:14:10.421 --> 01:14:14.210
competition on a 32 processor
machine, we would have accepted

01:14:14.210 --> 01:14:16.926
this optimization.
It would have been a good

01:14:16.926 --> 01:14:19.515
trade-off.
OK, but because we knew that we

01:14:19.515 --> 01:14:22.800
were running on a machine with a
lot more processors,

01:14:22.800 --> 01:14:26.336
and that we were close to
running out of the parallelism,

01:14:26.336 --> 01:14:29.936
it didn't make sense to be
increasing the critical path at

01:14:29.936 --> 01:14:33.726
that point, because that was
just reducing the parallelism of

01:14:33.726 --> 01:14:36.887
our calculation.
OK, next time,

01:14:36.887 --> 01:14:39.041
any questions about that first?
No?

01:14:39.041 --> 01:14:40.626
OK.
Next time, now that we

01:14:40.626 --> 01:14:44.111
understand the model for
execution, we're going to start

01:14:44.111 --> 01:14:47.786
looking at the performance of
particular algorithms what we

01:14:47.786 --> 01:14:50.701
code them up in a dynamic,
multithreaded style,

01:14:50.701 --> 01:14:53.000
OK?