WEBVTT

00:00:07.000 --> 00:00:08.000
-- week of 6.046.
Woohoo!

00:00:08.000 --> 00:00:13.000
The topic of this final week,
among our advanced topics,

00:00:13.000 --> 00:00:18.000
is cache oblivious algorithms.
This is a particularly fun

00:00:18.000 --> 00:00:22.000
area, one dear to my heart
because I've done a lot of

00:00:22.000 --> 00:00:26.000
research in this area.
This is an area co-founded by

00:00:26.000 --> 00:00:29.000
Professor Leiserson.
So, in fact,

00:00:29.000 --> 00:00:34.000
the first context in which I
met Professor Leiserson was him

00:00:34.000 --> 00:00:38.000
giving a talk about cache
oblivious algorithms at WADS '99

00:00:38.000 --> 00:00:41.000
in Vancouver I think.
Yeah, that has to be an odd

00:00:41.000 --> 00:00:44.000
year.
So, I learned about cache

00:00:44.000 --> 00:00:48.000
oblivious algorithms then,
started working in the area,

00:00:48.000 --> 00:00:50.000
and it's been a fun place to
play.

00:00:50.000 --> 00:00:54.000
But this topic in some sense
was also developed in the

00:00:54.000 --> 00:00:58.000
context of this class.
I think there was one semester,

00:00:58.000 --> 00:01:02.000
probably also '98-'99 where all
of the problem sets were about

00:01:02.000 --> 00:01:07.000
cache oblivious algorithms.
And they were,

00:01:07.000 --> 00:01:10.000
in particular,
working out the research ideas

00:01:10.000 --> 00:01:13.000
at the same time.
So, it must have been fun

00:01:13.000 --> 00:01:15.000
semester.
We consider doing that this

00:01:15.000 --> 00:01:18.000
semester, but we kept it to the
simple.

00:01:18.000 --> 00:01:23.000
We know a lot more about cache
oblivious algorithms by now as

00:01:23.000 --> 00:01:26.000
you might expect.
Right, I think that's all the

00:01:26.000 --> 00:01:29.000
setting.
I mean, it was kind of

00:01:29.000 --> 00:01:33.000
developed also with a bunch of
MIT students in particular,

00:01:33.000 --> 00:01:35.000
M.Eng.
student, Harold Prokop.

00:01:35.000 --> 00:01:36.000
It was his M.Eng.
thesis.

00:01:36.000 --> 00:01:39.000
There is all the citations I
will give for now.

00:01:39.000 --> 00:01:43.000
I haven't posted yet,
but there are some lecture

00:01:43.000 --> 00:01:45.000
notes that are already on my
webpage.

00:01:45.000 --> 00:01:49.000
But I will link to them from
the course website that gives

00:01:49.000 --> 00:01:53.000
all the references for all the
results I'll be talking about.

00:01:53.000 --> 00:01:56.000
They've all been done in the
last five years or so,

00:01:56.000 --> 00:01:59.000
in particular,
starting in '99 when the first

00:01:59.000 --> 00:02:03.000
paper was published.
But I won't give the specific

00:02:03.000 --> 00:02:08.000
citations in lecture.
And, this topic is related to

00:02:08.000 --> 00:02:11.000
the topic of last week,
multithreaded algorithms,

00:02:11.000 --> 00:02:14.000
although at a somewhat high
level.

00:02:14.000 --> 00:02:18.000
And then it's also dealing with
parallelism in modern machines.

00:02:18.000 --> 00:02:22.000
And we've had throughout all of
these last two lectures,

00:02:22.000 --> 00:02:26.000
we've had this very simple
model of a computer where we

00:02:26.000 --> 00:02:30.000
have random access.
You can access memory at a cost

00:02:30.000 --> 00:02:33.000
of one.
You can read and write a word

00:02:33.000 --> 00:02:36.000
of memory.
There is some details on how

00:02:36.000 --> 00:02:39.000
big a word can be and whatnot.
It's pretty basic,

00:02:39.000 --> 00:02:41.000
simple, flat model.
And, at the multithreaded

00:02:41.000 --> 00:02:45.000
algorithm is the idea that,
well, maybe you have multiple

00:02:45.000 --> 00:02:48.000
threads of computation running
at once, but you still have this

00:02:48.000 --> 00:02:51.000
very flat memory.
Everyone can access anything in

00:02:51.000 --> 00:02:54.000
memory at a constant cost.
We're going to change that

00:02:54.000 --> 00:02:58.000
model now.
And we are going to realize

00:02:58.000 --> 00:03:03.000
that a real machine,
the memory of a real machine is

00:03:03.000 --> 00:03:06.000
some hierarchy.
You have some CPU,

00:03:06.000 --> 00:03:10.000
you have some cache,
probably on the same chip,

00:03:10.000 --> 00:03:14.000
level 1 cache,
you have some level 2 cache,

00:03:14.000 --> 00:03:18.000
if you're lucky,
maybe you have some level 3

00:03:18.000 --> 00:03:21.000
cache, before you get to main
memory.

00:03:21.000 --> 00:03:26.000
And then, you probably have a
really big disk and probably

00:03:26.000 --> 00:03:31.000
there's even some cache out
here, but I won't even think

00:03:31.000 --> 00:03:35.000
about that.
So, the point is,

00:03:35.000 --> 00:03:38.000
you have lots of different
levels of memory and what's

00:03:38.000 --> 00:03:42.000
changing here is that things
that are very close to the CPU

00:03:42.000 --> 00:03:46.000
are very fast to access.
Usually level 1 cache you can

00:03:46.000 --> 00:03:48.000
access in one clock cycle or a
few.

00:03:48.000 --> 00:03:50.000
And then, things get slower and
slower.

00:03:50.000 --> 00:03:54.000
Memory still costs like 70 ns
or so to access a chunk out of.

00:03:54.000 --> 00:03:57.000
And that's a long time.
70 ns is, of course,

00:03:57.000 --> 00:04:01.000
a very long time.
So, as we go out here,

00:04:01.000 --> 00:04:04.000
we get slower.
But we also get bigger.

00:04:04.000 --> 00:04:07.000
I mean, if we could put
everything at level 1 cache,

00:04:07.000 --> 00:04:11.000
the problem would be solved.
But what would be a flat

00:04:11.000 --> 00:04:13.000
memory.
Accessing everything in here,

00:04:13.000 --> 00:04:16.000
we assumed takes the same
amount of time.

00:04:16.000 --> 00:04:18.000
But usually,
we can't afford,

00:04:18.000 --> 00:04:22.000
it's not even possible to put
everything in level 1 cache.

00:04:22.000 --> 00:04:26.000
I mean, there's a reason why
there is a memory hierarchy.

00:04:26.000 --> 00:04:32.000
Does anyone have a suggestion
on what that reason might be?

00:04:32.000 --> 00:04:35.000
It's like one of these limits
in life.

00:04:35.000 --> 00:04:37.000
Yeah?
Fast memory is expensive.

00:04:37.000 --> 00:04:40.000
That's the practical
limitations indeed,

00:04:40.000 --> 00:04:45.000
that you could try to build
more and more at level 1 cache

00:04:45.000 --> 00:04:48.000
and maybe you could try to,
well, yeah.

00:04:48.000 --> 00:04:51.000
Expenses is a good reason,
and practically,

00:04:51.000 --> 00:04:55.000
that's why they may be the
sizes are what they are.

00:04:55.000 --> 00:05:01.000
But suppose really fast memory
were really cheap.

00:05:01.000 --> 00:05:04.000
There is a physical limitation
of what's going on,

00:05:04.000 --> 00:05:05.000
yeah?
The speed of light.

00:05:05.000 --> 00:05:08.000
Yeah, that's a bit of a
problem, right?

00:05:08.000 --> 00:05:11.000
No matter how much,
let's suppose you can only fit

00:05:11.000 --> 00:05:15.000
so many bits in an atom.
You can only fit so many bits

00:05:15.000 --> 00:05:18.000
in a particular amount of space.
If you want more bits,

00:05:18.000 --> 00:05:22.000
and you need more space,
and the more space you have,

00:05:22.000 --> 00:05:25.000
the longer it's going to take
for a round-trip.

00:05:25.000 --> 00:05:28.000
So, if you assume your CPU is
like this point in space,

00:05:28.000 --> 00:05:32.000
so it's relatively small and it
has to get the data in,

00:05:32.000 --> 00:05:37.000
the bigger the data,
the farther it has to be away.

00:05:37.000 --> 00:05:40.000
But, you can have these cores
around the CPU that are,

00:05:40.000 --> 00:05:44.000
we don't usually live in 3-D,
and chips were usually in 2-D,

00:05:44.000 --> 00:05:46.000
but never mind.
You can have the sphere that's

00:05:46.000 --> 00:05:49.000
closer to the CPU that's a lot
faster to access.

00:05:49.000 --> 00:05:52.000
And as you get further away it
costs more.

00:05:52.000 --> 00:05:55.000
And that's essentially what
this model is representing,

00:05:55.000 --> 00:05:59.000
although it's a bit
approximated from the intrinsic

00:05:59.000 --> 00:06:02.000
physics and geometry and
whatnot.

00:06:02.000 --> 00:06:05.000
But that's the idea.
The latency,

00:06:05.000 --> 00:06:11.000
the round-trip time to get some
of this memory has to be big.

00:06:11.000 --> 00:06:17.000
In general, the costs to access
memory is made up of two things.

00:06:17.000 --> 00:06:21.000
There's the latency,
the round-trip time,

00:06:21.000 --> 00:06:26.000
which in particular is limited
by the speed of light.

00:06:26.000 --> 00:06:32.000
And, plus the round-trip time,
you also have to get the data

00:06:32.000 --> 00:06:36.000
out.
And depending on how much data

00:06:36.000 --> 00:06:40.000
you want, it could take longer.
OK, so there's something.

00:06:40.000 --> 00:06:42.000
There could be,
get this right,

00:06:42.000 --> 00:06:46.000
let's say, the amount of data
divided by the bandwidth.

00:06:46.000 --> 00:06:51.000
OK, the bandwidth is at what
rate can you get the data out?

00:06:51.000 --> 00:06:54.000
And if you look at the
bandwidth of these various

00:06:54.000 --> 00:06:58.000
levels of memory,
it's all pretty much the same.

00:06:58.000 --> 00:07:02.000
If you have a well-designed
computer the bandwidths should

00:07:02.000 --> 00:07:07.000
all be the same.
OK, as you can still get data

00:07:07.000 --> 00:07:08.000
off disc really,
really fast,

00:07:08.000 --> 00:07:13.000
usually at about the speed of
your bus, and that the bus gets

00:07:13.000 --> 00:07:16.000
the CPU hopefully as fast as
everything else.

00:07:16.000 --> 00:07:20.000
So, even though they're slower,
they're really only slower in

00:07:20.000 --> 00:07:23.000
terms of latency.
And so, this part is maybe

00:07:23.000 --> 00:07:26.000
reasonable.
The bandwidth looks pretty much

00:07:26.000 --> 00:07:29.000
the same universally.
It's the latency that's going

00:07:29.000 --> 00:07:32.000
up.
So, if the latency is going up

00:07:32.000 --> 00:07:36.000
but we still get to divide by
the same amount of bandwidth,

00:07:36.000 --> 00:07:40.000
what should we do to make the
access cost at all these levels

00:07:40.000 --> 00:07:45.000
about the same?
This is fixed.

00:07:45.000 --> 00:07:53.000
Let's say this is increasing,
but this is still staying big.

00:07:53.000 --> 00:07:59.000
What could we do to balance
this formula?

00:07:59.000 --> 00:08:05.000
Change the amounts.
As the latency goes up,

00:08:05.000 --> 00:08:10.000
if we increase the amount,
then the amortized cost to

00:08:10.000 --> 00:08:16.000
access one element will go down.
So, this is amortization in a

00:08:16.000 --> 00:08:21.000
very simple sense.
So, this was to access a whole

00:08:21.000 --> 00:08:26.000
block, let's say,
and this amount was the size of

00:08:26.000 --> 00:08:30.000
the block.
So, the amortized cost,

00:08:30.000 --> 00:08:36.000
then, to access one element is
going to be the latency divided

00:08:36.000 --> 00:08:41.000
by the size of the block,
the amount plus one over the

00:08:41.000 --> 00:08:45.000
bandwidth.
OK, so this is what you should

00:08:45.000 --> 00:08:49.000
implicitly be thinking in your
head.

00:08:49.000 --> 00:08:55.000
So, I'm just dividing here by
the amounts because the amount

00:08:55.000 --> 00:09:02.000
is how many elements you get in
one access, let's suppose.

00:09:02.000 --> 00:09:04.000
OK, so we get this formula for
the amortized cost.

00:09:04.000 --> 00:09:08.000
The one over bandwidth is going
to be good no matter what level

00:09:08.000 --> 00:09:11.000
we are on, I claim.
There's no real fundamental

00:09:11.000 --> 00:09:14.000
limitation there except it might
be expensive.

00:09:14.000 --> 00:09:17.000
And the latency week at the
amortized out by the amounts,

00:09:17.000 --> 00:09:21.000
so whatever the latency is,
at the latency gets bigger out

00:09:21.000 --> 00:09:24.000
here, we just get more and more
stuff and then we make these two

00:09:24.000 --> 00:09:27.000
terms equal, let's say.
That would be a good way to

00:09:27.000 --> 00:09:30.000
balance things.
So what particular,

00:09:30.000 --> 00:09:34.000
disc has a really high latency.
Not only is there speed of

00:09:34.000 --> 00:09:37.000
light issues here,
but there's actually the speed

00:09:37.000 --> 00:09:39.000
of the head moving on the tracks
of the disk.

00:09:39.000 --> 00:09:42.000
That takes a long time.
There's a physical motion.

00:09:42.000 --> 00:09:45.000
Everything else here doesn't
usually have physical motion.

00:09:45.000 --> 00:09:47.000
It's just electric.
So, this is really,

00:09:47.000 --> 00:09:51.000
really slow and latency,
so when you read something out

00:09:51.000 --> 00:09:54.000
of disk, you might as well read
a lot of data from disc,

00:09:54.000 --> 00:09:57.000
like a megabyte or so.
It's probably even old these

00:09:57.000 --> 00:09:58.000
days.
Maybe you read multiple

00:09:58.000 --> 00:10:02.000
megabytes when you read anything
from disk if you want these to

00:10:02.000 --> 00:10:06.000
be matched.
OK, there's a bit of a problem

00:10:06.000 --> 00:10:10.000
with doing that.
Any suggestions what the

00:10:10.000 --> 00:10:14.000
problem would be?
So, you have this algorithm.

00:10:14.000 --> 00:10:17.000
And, whenever it reads
something off of desk,

00:10:17.000 --> 00:10:22.000
it reads an entire megabyte of
stuff around the element it

00:10:22.000 --> 00:10:26.000
asked for.
So the amortized cost to access

00:10:26.000 --> 00:10:31.000
is going to be reasonable,
but that's actually sort of

00:10:31.000 --> 00:10:34.000
assuming something.
Yeah?

00:10:34.000 --> 00:10:38.000
Right.
I'm assuming I'm ever going to

00:10:38.000 --> 00:10:43.000
use the rest of that data.
If I'm going to read 10 MB

00:10:43.000 --> 00:10:49.000
around the one element that
asked for, I access A bracket I,

00:10:49.000 --> 00:10:55.000
and I get 10 million items from
A around I, it would be kind of

00:10:55.000 --> 00:11:00.000
good if the algorithm actually
used that data for something.

00:11:00.000 --> 00:11:06.000
It seems reasonable.
So, this would be spatial

00:11:06.000 --> 00:11:08.000
locality.
So, we want,

00:11:08.000 --> 00:11:15.000
I mean the goal of this world
in cache oblivious algorithms

00:11:15.000 --> 00:11:20.000
and cache efficient algorithms
in general is you want

00:11:20.000 --> 00:11:26.000
algorithms that perform well
when this is happening.

00:11:26.000 --> 00:11:31.000
So, this is the idea of
blocking.

00:11:31.000 --> 00:11:36.000
And we want the algorithm to
use all or at least most of the

00:11:36.000 --> 00:11:41.000
elements in a block,
a consecutive chunk of memory.

00:11:41.000 --> 00:11:45.000
So, this is spatial locality.

00:11:55.000 --> 00:11:57.000
Ideally, we'd use all of them
right then.

00:11:57.000 --> 00:11:59.000
But I mean, depending on your
algorithm, that's a little bit

00:11:59.000 --> 00:12:01.000
tricky.
There is another issue,

00:12:01.000 --> 00:12:03.000
though.
So, you read in your thing

00:12:03.000 --> 00:12:05.000
into, read your 10 MB into main
memory, let's say,

00:12:05.000 --> 00:12:07.000
and your memory,
let's say, is at least,

00:12:07.000 --> 00:12:10.000
these days you should have a 4
GB memory or something.

00:12:10.000 --> 00:12:13.000
So, you could read and actually
a lot of different blocks into

00:12:13.000 --> 00:12:15.000
main memory.
What you'd like is that you can

00:12:15.000 --> 00:12:17.000
use those blocks for as long as
possible.

00:12:17.000 --> 00:12:20.000
Maybe you don't even use them.
If you have a linear time

00:12:20.000 --> 00:12:23.000
algorithm, you're probably only
going to visit each element a

00:12:23.000 --> 00:12:25.000
constant number of times.
So, this is enough.

00:12:25.000 --> 00:12:27.000
But if your algorithm is more
than linear time,

00:12:27.000 --> 00:12:32.000
you're going to be accessing
elements more than once.

00:12:32.000 --> 00:12:37.000
So, it would be a good idea not
only to use all the elements of

00:12:37.000 --> 00:12:43.000
the blocks, but use them as many
times as you can before you have

00:12:43.000 --> 00:12:47.000
to throw the block out.
That's temporal locality.

00:12:47.000 --> 00:12:52.000
So ideally, you even reuse
blocks as much as possible.

00:12:52.000 --> 00:12:55.000
So, I mean, we have all these
caches.

00:12:55.000 --> 00:13:01.000
So, I didn't write this word.
Just in case I don't know how

00:13:01.000 --> 00:13:07.000
to spell it, it's not the money.
We should use those caches for

00:13:07.000 --> 00:13:09.000
something.
I mean, the fact that they

00:13:09.000 --> 00:13:13.000
store more than one block,
each cache can store several

00:13:13.000 --> 00:13:14.000
blocks.
How many?

00:13:14.000 --> 00:13:17.000
Well, we'll get to that in a
second.

00:13:17.000 --> 00:13:20.000
OK, so this is the general
motivation, but at this point

00:13:20.000 --> 00:13:23.000
the model is still pretty damn
ugly.

00:13:23.000 --> 00:13:27.000
If you wanted to design an
algorithm that runs well on this

00:13:27.000 --> 00:13:30.000
kind of machine directly,
it's possible but pretty

00:13:30.000 --> 00:13:34.000
difficult, and essentially never
done, let's say,

00:13:34.000 --> 00:13:39.000
even though this is what real
machines look like.

00:13:39.000 --> 00:13:42.000
At least in theory,
and pretty much in practice,

00:13:42.000 --> 00:13:47.000
the main thing to think about
is two levels at a time.

00:13:47.000 --> 00:13:51.000
So, this is a simplification
where we can say a lot more

00:13:51.000 --> 00:13:55.000
about algorithms,
a simplification over this

00:13:55.000 --> 00:13:57.000
model.
So, in this model,

00:13:57.000 --> 00:14:01.000
each of these levels has
different block sizes,

00:14:01.000 --> 00:14:06.000
and a different total sizes,
it's a mess to deal with and

00:14:06.000 --> 00:14:10.000
design algorithms for.
If you just think about two

00:14:10.000 --> 00:14:17.000
levels, it's relatively easy.
So, we have our CPU which we

00:14:17.000 --> 00:14:22.000
assume has a constant number of
registers only.

00:14:22.000 --> 00:14:27.000
So, you know,
once it has a couple of data

00:14:27.000 --> 00:14:31.000
items, you can add them and
whatnot.

00:14:31.000 --> 00:14:35.000
Then we have this really fast
pipe.

00:14:35.000 --> 00:14:41.000
So, I draw it thick to some
cache.

00:14:41.000 --> 00:14:49.000
So this is cache.
And, we have a relatively

00:14:49.000 --> 00:14:58.000
narrow pipe to some really big
other storage,

00:14:58.000 --> 00:15:06.000
which I will call main memory.
So, I mean, that's the general

00:15:06.000 --> 00:15:09.000
picture.
Now, this could represent any

00:15:09.000 --> 00:15:12.000
two of these levels.
It could be between L3 cache

00:15:12.000 --> 00:15:14.000
and make memory.
That's maybe,

00:15:14.000 --> 00:15:16.000
what?
The naming corresponds to best.

00:15:16.000 --> 00:15:20.000
Or cache could in fact be main
memory, what we consider the RAM

00:15:20.000 --> 00:15:23.000
of the machine,
and what's called a memory over

00:15:23.000 --> 00:15:26.000
there to be the disk.
It's whatever you care about.

00:15:26.000 --> 00:15:28.000
And usually,
if you have a program,

00:15:28.000 --> 00:15:34.000
that's what usually we assume
everything fits in main memory.

00:15:34.000 --> 00:15:36.000
Then you care about the caching
behavior.

00:15:36.000 --> 00:15:39.000
So you probably look between
these two levels.

00:15:39.000 --> 00:15:42.000
That's probably what matters
the most inner program because

00:15:42.000 --> 00:15:46.000
the cost differential here is
really big relative to the cost

00:15:46.000 --> 00:15:49.000
differential here.
If your data doesn't even fit

00:15:49.000 --> 00:15:51.000
it main memory,
and you have to go to disk,

00:15:51.000 --> 00:15:54.000
then you really care about this
level because the cost

00:15:54.000 --> 00:15:57.000
differential here is huge.
It's like six orders of

00:15:57.000 --> 00:16:00.000
magnitude, let's say.
So, in practice you may think

00:16:00.000 --> 00:16:05.000
of just two memory levels that
are the most relevant.

00:16:05.000 --> 00:16:09.000
OK, now I'm going to define
some parameters.

00:16:09.000 --> 00:16:14.000
I'm going to call them cache
and make memory just for clarity

00:16:14.000 --> 00:16:20.000
because I like to think of main
memory just the way it used to

00:16:20.000 --> 00:16:23.000
be.
And now all we have to worry

00:16:23.000 --> 00:16:26.000
about is this extra thing called
cache.

00:16:26.000 --> 00:16:31.000
It has some bounded size,
and there's a block size.

00:16:31.000 --> 00:16:36.000
The block size is B.
and a number of blocks is M

00:16:36.000 --> 00:16:41.000
over B.
So, the total size of the cache

00:16:41.000 --> 00:16:44.000
is M.
OK, main memory is also blocked

00:16:44.000 --> 00:16:49.000
into blocks of size B.
And we assume that it has

00:16:49.000 --> 00:16:55.000
essentially infinite size.
We don't care about its size in

00:16:55.000 --> 00:16:59.000
this picture.
It's whatever is big enough to

00:16:59.000 --> 00:17:04.000
hold the size of your algorithm,
or data structure,

00:17:04.000 --> 00:17:09.000
or whatever.
OK, so that's the general

00:17:09.000 --> 00:17:11.000
model.
And for strange,

00:17:11.000 --> 00:17:15.000
historical reasons,
which I don't want to get into,

00:17:15.000 --> 00:17:20.000
these things are called capital
M and capital B.

00:17:20.000 --> 00:17:25.000
Even though M sounds a lot like
memory, it's really for cache,

00:17:25.000 --> 00:17:29.000
and don't ask.
OK, this is to preserve

00:17:29.000 --> 00:17:32.000
history.
OK, now what do we do with this

00:17:32.000 --> 00:17:34.000
model?
It seems nice,

00:17:34.000 --> 00:17:36.000
but now what do we measure
about it?

00:17:36.000 --> 00:17:39.000
What I'm going to assume is
that the cache is really fast.

00:17:39.000 --> 00:17:43.000
So the CPU can access cache
essentially instantaneously.

00:17:43.000 --> 00:17:46.000
I still have to pay for the
computation that the CPU is

00:17:46.000 --> 00:17:50.000
doing, but I'm assuming cache is
close enough that I don't care.

00:17:50.000 --> 00:17:54.000
And that may memory is so big
that it has to be far away,

00:17:54.000 --> 00:17:56.000
and therefore,
this pipe is a problem.

00:17:56.000 --> 00:17:59.000
I mean, what I should really
draw is that pipe is still

00:17:59.000 --> 00:18:04.000
thick, but is really long.
So, the latency is high.

00:18:04.000 --> 00:18:07.000
The bandwidth is still high.
OK, and all transfers here

00:18:07.000 --> 00:18:10.000
happened as blocks.
So, when you don't have

00:18:10.000 --> 00:18:12.000
something, so the idea is CPU
asks for A of I,

00:18:12.000 --> 00:18:15.000
as for something in memory,
if it's in the cache,

00:18:15.000 --> 00:18:17.000
it gets it.
That's free.

00:18:17.000 --> 00:18:21.000
Otherwise, it has to grab the
entire block containing that

00:18:21.000 --> 00:18:23.000
element from main memory,
brings it into cache,

00:18:23.000 --> 00:18:26.000
maybe kicks somebody out if the
cache was full,

00:18:26.000 --> 00:18:29.000
and then the CPU can use that
data and keep going.

00:18:29.000 --> 00:18:33.000
Until it accesses something
else that's not in cache,

00:18:33.000 --> 00:18:37.000
then it has to grab it from
main memory.

00:18:37.000 --> 00:18:43.000
When you kick something out,
you're actually writing back to

00:18:43.000 --> 00:18:46.000
memory.
That's the model.

00:18:46.000 --> 00:18:51.000
So, we suppose the accesses to
cache are free.

00:18:51.000 --> 00:18:56.000
But we can still think about
the running time of the

00:18:56.000 --> 00:19:01.000
algorithm.
I'm not going to change the

00:19:01.000 --> 00:19:05.000
definition of running time.
This would be the computation

00:19:05.000 --> 00:19:10.000
time, or the work if you want to
use multithreaded lingo,

00:19:10.000 --> 00:19:13.000
computation time.
OK, so we still have time,

00:19:13.000 --> 00:19:17.000
and T of N will still mean what
it did before.

00:19:17.000 --> 00:19:22.000
This is just an extra level of
refinement of understanding of

00:19:22.000 --> 00:19:24.000
what's going on.
Essentially,

00:19:24.000 --> 00:19:29.000
measuring the parallelism that
we can exploit out of the memory

00:19:29.000 --> 00:19:34.000
system, that when you access
something you actually get B

00:19:34.000 --> 00:19:39.000
items.
So, this is the old stuff.

00:19:39.000 --> 00:19:47.000
Now, what I want to do is count
memory transfers.

00:19:47.000 --> 00:19:56.000
These are transfers of blocks,
so I should say block memory

00:19:56.000 --> 00:20:04.000
transfers between the two
levels, so, between the cache

00:20:04.000 --> 00:20:12.000
and main memory.
So, memory transfers are either

00:20:12.000 --> 00:20:19.000
reading reads or writes.
Maybe I should say that.

00:20:19.000 --> 00:20:29.000
These are number of block reads
and writes from and to the main

00:20:29.000 --> 00:20:33.000
memory.
OK, so I'm going to introduce

00:20:33.000 --> 00:20:35.000
some notation.
This is new notation,

00:20:35.000 --> 00:20:40.000
so we'll see how it works out.
MT of N I want to represent the

00:20:40.000 --> 00:20:44.000
number of memory transfers
instead of just normal time of

00:20:44.000 --> 00:20:49.000
the problem of size N.
Really, this is a function that

00:20:49.000 --> 00:20:52.000
depends not only on N but also
on these parameters,

00:20:52.000 --> 00:20:56.000
B and M, in our model.
So, this is what it should be,

00:20:56.000 --> 00:21:00.000
MT_B,M(N), but that's obviously
pretty messy,

00:21:00.000 --> 00:21:04.000
so I'm going to stick to MT of
N.

00:21:04.000 --> 00:21:07.000
But this will always,
because mainly I care about the

00:21:07.000 --> 00:21:09.000
growth in terms of N.
well, I care about the growth

00:21:09.000 --> 00:21:12.000
in terms of all things,
but the only thing I could

00:21:12.000 --> 00:21:14.000
change is N.
So, most of the time I only

00:21:14.000 --> 00:21:17.000
think about, like when we are
writing recurrences,

00:21:17.000 --> 00:21:20.000
only N is changing.
I can't recurse on the block

00:21:20.000 --> 00:21:22.000
size.
I can't recurse on the size of

00:21:22.000 --> 00:21:24.000
cache.
Those are given to me.

00:21:24.000 --> 00:21:26.000
They're fixed.
OK, so we'll be changing N

00:21:26.000 --> 00:21:28.000
mainly.
But B and M always matter here.

00:21:28.000 --> 00:21:31.000
They're not constants.
They're parameters of the

00:21:31.000 --> 00:21:34.000
model.
OK, easy enough.

00:21:34.000 --> 00:21:39.000
This is something called the
disk access model,

00:21:39.000 --> 00:21:44.000
if you like DAM models,
or the external memory model,

00:21:44.000 --> 00:21:50.000
or the cache aware model.
Maybe I should mention that;

00:21:50.000 --> 00:21:55.000
this is the cache aware.
In general, you have some

00:21:55.000 --> 00:22:01.000
algorithm that runs on this kind
of model, machine model.

00:22:01.000 --> 00:22:07.000
That's a cache aware algorithm.
OK, we're not too interested in

00:22:07.000 --> 00:22:10.000
cache aware algorithms.
We've seen one,

00:22:10.000 --> 00:22:12.000
B trees.
B trees are cache aware data

00:22:12.000 --> 00:22:14.000
structure.
You assume that there is some

00:22:14.000 --> 00:22:15.000
block size, B,
underlying.

00:22:15.000 --> 00:22:18.000
Maybe you didn't see exactly
this model.

00:22:18.000 --> 00:22:20.000
In particular,
it didn't really matter how big

00:22:20.000 --> 00:22:23.000
the cache was because you just
wanted to know.

00:22:23.000 --> 00:22:26.000
When I read B items,
I can use all of them as much

00:22:26.000 --> 00:22:29.000
as possible and figure out where
I fit among those B items,

00:22:29.000 --> 00:22:32.000
and that gives me log base B of
N memory transfers instead of

00:22:32.000 --> 00:22:36.000
log N, which would be,
if you just threw your favorite

00:22:36.000 --> 00:22:41.000
balanced binary search tree.
So, log base B of N is

00:22:41.000 --> 00:22:46.000
definitely better than log base
2 of N.

00:22:46.000 --> 00:22:51.000
B trees are a cache aware
algorithm.

00:22:51.000 --> 00:22:58.000
OK, what we would like to do
today and next lecture is get

00:22:58.000 --> 00:23:06.000
cache oblivious algorithms.
So, there's essentially only

00:23:06.000 --> 00:23:12.000
one difference between cache
aware algorithms and cache

00:23:12.000 --> 00:23:18.000
oblivious algorithms.
In cache oblivious algorithms,

00:23:18.000 --> 00:23:22.000
the algorithm doesn't know what
B and M are.

00:23:22.000 --> 00:23:30.000
So this is a bit of a subtle
point, but very cool idea.

00:23:30.000 --> 00:23:32.000
You assume that this is the
model of the machine,

00:23:32.000 --> 00:23:36.000
and you care about the number
of memory transfers between this

00:23:36.000 --> 00:23:39.000
cache of size M with blocking B,
and main memory with blocking

00:23:39.000 --> 00:23:41.000
B.
But you don't actually know

00:23:41.000 --> 00:23:43.000
what the model is.
You don't know the other

00:23:43.000 --> 00:23:45.000
parameters of the model.
It looks like this,

00:23:45.000 --> 00:23:48.000
but you don't know the width.
You don't know the height.

00:23:48.000 --> 00:23:50.000
Why not?
So, the analysis knows what B

00:23:50.000 --> 00:23:52.000
and M are.
We are going to write some

00:23:52.000 --> 00:23:56.000
algorithms which look just like
boring old algorithms that we've

00:23:56.000 --> 00:24:00.000
seen throughout the lecture.
That's one of the nice things

00:24:00.000 --> 00:24:03.000
about this model.
Every algorithm we have seen is

00:24:03.000 --> 00:24:06.000
a cache oblivious algorithm,
all right, because we didn't

00:24:06.000 --> 00:24:08.000
even know the word cache in this
class until today.

00:24:08.000 --> 00:24:11.000
So, we already have lots of
algorithms to choose from.

00:24:11.000 --> 00:24:13.000
The thing is,
some of them will perform well

00:24:13.000 --> 00:24:15.000
in this model,
and some of them won't.

00:24:15.000 --> 00:24:18.000
So, we would like to design
algorithms that just like our

00:24:18.000 --> 00:24:21.000
old algorithms that happened to
perform well in this context,

00:24:21.000 --> 00:24:24.000
no matter what B and M are.
So, another way this is the

00:24:24.000 --> 00:24:27.000
same algorithm should work well
for all values of B and M if you

00:24:27.000 --> 00:24:31.000
have a good cache oblivious
algorithm.

00:24:31.000 --> 00:24:33.000
OK, there are a few
consequences to this assumption.

00:24:33.000 --> 00:24:36.000
In a cache aware algorithm,
you can explicitly say,

00:24:36.000 --> 00:24:39.000
OK, I'm blocking my memory into
chunks of size B.

00:24:39.000 --> 00:24:42.000
Here they are.
I was going to store these B

00:24:42.000 --> 00:24:44.000
elements here,
these B elements here,

00:24:44.000 --> 00:24:46.000
because you know B,
you can do that.

00:24:46.000 --> 00:24:48.000
You can say,
well, OK, now I want to read

00:24:48.000 --> 00:24:51.000
these B items into my cache,
and then write out these ones

00:24:51.000 --> 00:24:53.000
over here.
You can explicitly maintain

00:24:53.000 --> 00:24:55.000
your cache.
With cache oblivious

00:24:55.000 --> 00:25:00.000
algorithms, you can't because
you don't know what it is.

00:25:00.000 --> 00:25:04.000
So, it's got to be all
implicit.

00:25:04.000 --> 00:25:11.000
And this is pretty much how
caches work anyway except for

00:25:11.000 --> 00:25:15.000
disk.
So, it's a pretty reasonable

00:25:15.000 --> 00:25:18.000
model.
In particular,

00:25:18.000 --> 00:25:24.000
when you access an element
that's not in cache,

00:25:24.000 --> 00:25:33.000
you automatically fetch the
block containing that element.

00:25:33.000 --> 00:25:38.000
And you pay one memory transfer
for that if it wasn't already

00:25:38.000 --> 00:25:41.000
there.
Another bit of a catch here is,

00:25:41.000 --> 00:25:45.000
what if your cache is full?
Then you've got to kick some

00:25:45.000 --> 00:25:50.000
block out of your cache.
And then, so we need some model

00:25:50.000 --> 00:25:55.000
of which block gets kicked out
because we can't control that.

00:25:55.000 --> 00:26:00.000
We have no knowledge of what
the blocks are in our algorithm.

00:26:00.000 --> 00:26:05.000
So what we're going to assume
in this model is the ideal

00:26:05.000 --> 00:26:10.000
thing, that when you fetch a new
block, if your cache is full,

00:26:10.000 --> 00:26:17.000
you evict a block that will be
used farthest in the future.

00:26:17.000 --> 00:26:21.000
Sorry, the furthest.
Farthest is distance.

00:26:21.000 --> 00:26:25.000
Furthest is time.
Furthest in the future.

00:26:25.000 --> 00:26:31.000
OK, this would be the best
possible thing to do.

00:26:31.000 --> 00:26:35.000
It's a little bit hard to do in
practice because you don't know

00:26:35.000 --> 00:26:38.000
the future generally,
unless you're omniscient.

00:26:38.000 --> 00:26:41.000
So, this is a bit of an
idealized model.

00:26:41.000 --> 00:26:45.000
But it's pretty reasonable in
the sense that if you've read

00:26:45.000 --> 00:26:49.000
the reading handout number 20,
this paper by Sleator and

00:26:49.000 --> 00:26:52.000
Tarjan, they introduce the idea
of competitive algorithms.

00:26:52.000 --> 00:26:56.000
So, we only talked about a
small portion of that paper that

00:26:56.000 --> 00:27:01.000
moved to front heuristic for
storing a list.

00:27:01.000 --> 00:27:03.000
But, it also proves that there
are strategies,

00:27:03.000 --> 00:27:06.000
and maybe you heard this in
recitation.

00:27:06.000 --> 00:27:08.000
Some people covered it;
some didn't,

00:27:08.000 --> 00:27:10.000
that these are called paging
strategies.

00:27:10.000 --> 00:27:13.000
So, you want to maintain some
cache of pages or blocks,

00:27:13.000 --> 00:27:17.000
and you pay whenever you have
to access a block that's not in

00:27:17.000 --> 00:27:19.000
your cache.
The best thing to do is to

00:27:19.000 --> 00:27:23.000
always kick out the block that
will be used farthest in the

00:27:23.000 --> 00:27:27.000
future because that way you'll
use all the blocks that are in

00:27:27.000 --> 00:27:28.000
there.
This turns out to be the

00:27:28.000 --> 00:27:33.000
offline optimal strategy if you
knew the future.

00:27:33.000 --> 00:27:35.000
But, there are algorithms that
are essentially constant

00:27:35.000 --> 00:27:37.000
competitive against this
strategy.

00:27:37.000 --> 00:27:40.000
I don't want to get into
details because they're not

00:27:40.000 --> 00:27:43.000
exactly constant competitive.
But they are sufficiently

00:27:43.000 --> 00:27:46.000
constant competitive for the
purposes of this lecture that we

00:27:46.000 --> 00:27:49.000
can assume this,
not have to worry about it.

00:27:49.000 --> 00:27:51.000
Most of the time,
we don't even really use this

00:27:51.000 --> 00:27:53.000
assumption.
But there it is.

00:27:53.000 --> 00:27:55.000
That's the cache oblivious
model.

00:27:55.000 --> 00:27:58.000
It makes things cleaner to
think about just anything that

00:27:58.000 --> 00:28:01.000
should be done,
will be done.

00:28:01.000 --> 00:28:05.000
And you can simulate that with
least recently used or whatever

00:28:05.000 --> 00:28:10.000
good heuristic that you want to
that's competitive against the

00:28:10.000 --> 00:28:12.000
optimal.
OK, that's pretty much the

00:28:12.000 --> 00:28:16.000
cache oblivious algorithm.
Once you have the two level

00:28:16.000 --> 00:28:20.000
model, you just assume you don't
know B and M.

00:28:20.000 --> 00:28:24.000
You have this automatic request
in writing, and whatnot.

00:28:24.000 --> 00:28:28.000
A little bit more to say,
I guess, it may be obvious at

00:28:28.000 --> 00:28:34.000
this point, but I've been
drawing everything as tables.

00:28:34.000 --> 00:28:37.000
So, it's not really clear what
the linear order is.

00:28:37.000 --> 00:28:40.000
Linear order is just the
reading order.

00:28:40.000 --> 00:28:44.000
So, although we don't
explicitly say it most of the

00:28:44.000 --> 00:28:48.000
time, a typical model is that
memory is a linear array.

00:28:48.000 --> 00:28:53.000
Everything that you ever store
in your program is written in

00:28:53.000 --> 00:28:57.000
this linear array.
If you've ever programmed in

00:28:57.000 --> 00:29:01.000
Assembly or whatever,
that's the model.

00:29:01.000 --> 00:29:04.000
You have the address space,
and any number between here and

00:29:04.000 --> 00:29:08.000
here, that's where you can
actually, this is physical

00:29:08.000 --> 00:29:11.000
memory.
This is all you can write to.

00:29:11.000 --> 00:29:15.000
So, it starts at zero and goes
out to, let's call it infinity

00:29:15.000 --> 00:29:17.000
over here.
And, if you allocate some

00:29:17.000 --> 00:29:20.000
array, maybe it occupies some
space in the middle.

00:29:20.000 --> 00:29:23.000
Who knows?
OK, we usually don't think

00:29:23.000 --> 00:29:26.000
about that much.
What I care about now is that

00:29:26.000 --> 00:29:29.000
memory itself is blocked in this
view.

00:29:29.000 --> 00:29:31.000
So, however your stuff is
stored in memory,

00:29:31.000 --> 00:29:36.000
it's blocked into clusters of
length B.

00:29:36.000 --> 00:29:39.000
So, if this is,
let me call it one and be a

00:29:39.000 --> 00:29:41.000
little bit nicer.
This is B.

00:29:41.000 --> 00:29:46.000
This is position B plus one.
This is 2B, and 2B plus one,

00:29:46.000 --> 00:29:49.000
and so on.
These are the indexes into

00:29:49.000 --> 00:29:51.000
memory.
This is how the blocking

00:29:51.000 --> 00:29:54.000
happens.
If you access something here,

00:29:54.000 --> 00:29:59.000
you get that chunk from U,
round it down to the previous

00:29:59.000 --> 00:30:02.000
multiple of B,
round it up to the next

00:30:02.000 --> 00:30:06.000
multiple of B.
That's what you always get.

00:30:06.000 --> 00:30:11.000
OK, so if you think about some
array that's maybe allocated

00:30:11.000 --> 00:30:15.000
here, OK, you have to keep in
mind that that array may not be

00:30:15.000 --> 00:30:18.000
perfectly aligned with the
blocks.

00:30:18.000 --> 00:30:21.000
But more or less it will be so
we don't care too much.

00:30:21.000 --> 00:30:24.000
But that's a bit of a subtlety
there.

00:30:24.000 --> 00:30:28.000
OK, so that's pretty much the
model.

00:30:28.000 --> 00:30:32.000
So every algorithm we've seen,
except B trees,

00:30:32.000 --> 00:30:36.000
is a cache oblivious algorithm.
And our question is,

00:30:36.000 --> 00:30:41.000
now, we know how everything
runs in terms of running time.

00:30:41.000 --> 00:30:46.000
Now we want to measure the
number of memory transfers,

00:30:46.000 --> 00:30:49.000
MT of N.
I want to mention one other

00:30:49.000 --> 00:30:53.000
fact or theorem.
I'll put it in brackets because

00:30:53.000 --> 00:30:58.000
I don't want to state it
precisely.

00:30:58.000 --> 00:31:04.000
But if you have an algorithm
that is efficient on two levels,

00:31:04.000 --> 00:31:08.000
so in other words,
what we're looking at,

00:31:08.000 --> 00:31:14.000
if we just think about the two
level world and your algorithm

00:31:14.000 --> 00:31:18.000
is cache oblivious,
then it is efficient on any

00:31:18.000 --> 00:31:23.000
number of levels in your memory
hierarchy, say,

00:31:23.000 --> 00:31:27.000
L levels.
So, I don't want to define what

00:31:27.000 --> 00:31:31.000
efficient means.
But the intuition is,

00:31:31.000 --> 00:31:34.000
if your machine really looks
like this and you have a cache

00:31:34.000 --> 00:31:36.000
oblivious algorithm,
you can apply the cache

00:31:36.000 --> 00:31:38.000
oblivious analysis for all B and
M.

00:31:38.000 --> 00:31:41.000
So you can analyze the number
of memory transfers here,

00:31:41.000 --> 00:31:43.000
here, here, here,
and here.

00:31:43.000 --> 00:31:45.000
And if you have a good cache
oblivious algorithm,

00:31:45.000 --> 00:31:48.000
the performances at all those
levels has to be good.

00:31:48.000 --> 00:31:51.000
And therefore,
the whole performance is good.

00:31:51.000 --> 00:31:54.000
Good here means asymptotically
optimal up to constant factors,

00:31:54.000 --> 00:31:57.000
something like that.
OK, so I don't want to prove

00:31:57.000 --> 00:32:01.000
that, and you can read the cache
oblivious papers.

00:32:01.000 --> 00:32:04.000
That's a nice fact about cache
oblivious algorithms.

00:32:04.000 --> 00:32:08.000
If you have a cache aware
algorithm that tunes to a

00:32:08.000 --> 00:32:12.000
particular value of B,
and a particular value of M,

00:32:12.000 --> 00:32:15.000
you're not going to have that
problem.

00:32:15.000 --> 00:32:19.000
So, this is one nice feature of
cache obliviousness.

00:32:19.000 --> 00:32:23.000
Another nice feature is when
you are coding the algorithm,

00:32:23.000 --> 00:32:26.000
you don't have to put in B and
M.

00:32:26.000 --> 00:32:28.000
So, that simplifies things a
bit.

00:32:28.000 --> 00:32:34.000
So, let's do some algorithms.
Enough about models.

00:32:34.000 --> 00:32:40.000
OK, we're going to start out
with some really simple things

00:32:40.000 --> 00:32:45.000
just to get warmed up on the
analysis side.

00:32:45.000 --> 00:32:52.000
The most basic thing you can do
that's good in a cache oblivious

00:32:52.000 --> 00:32:57.000
world is scanning.
So, scanning is just visiting

00:32:57.000 --> 00:33:03.000
the items in an array in order.
So, visit A_1 up to A_N in

00:33:03.000 --> 00:33:06.000
order.
For some notion of visit,

00:33:06.000 --> 00:33:09.000
this is presumably some
constant time operation.

00:33:09.000 --> 00:33:12.000
For example,
suppose you want to compute the

00:33:12.000 --> 00:33:16.000
aggregate of the array.
You want to sum all the

00:33:16.000 --> 00:33:19.000
elements in the array.
So, you have one extra variable

00:33:19.000 --> 00:33:23.000
using, but you can store that in
a register or whatever,

00:33:23.000 --> 00:33:27.000
so that's one simple example.
Sum the array.

00:33:27.000 --> 00:33:31.000
OK, so here's the picture.
We have our memory.

00:33:31.000 --> 00:33:36.000
Each of these cells represents
one item, one element,

00:33:36.000 --> 00:33:38.000
log N bits, one word,
whatever.

00:33:38.000 --> 00:33:43.000
Our array is somewhere in here.
Maybe it's there.

00:33:43.000 --> 00:33:47.000
And we go from here to here to
here to here.

00:33:47.000 --> 00:33:50.000
OK, and so on.
So, what does this cost?

00:33:50.000 --> 00:33:53.000
What is the number of memory
transfers?

00:33:53.000 --> 00:33:57.000
We know that this is a linear
time algorithm.

00:33:57.000 --> 00:34:03.000
It takes order N time.
What does it cost in terms of

00:34:03.000 --> 00:34:07.000
memory transfers?
N over B, pretty much.

00:34:07.000 --> 00:34:12.000
We like to say it's order N
over B plus two or one in the

00:34:12.000 --> 00:34:15.000
big O.
This is a bit of worry.

00:34:15.000 --> 00:34:18.000
I mean, N could be smaller than
B.

00:34:18.000 --> 00:34:21.000
We really want to think about
all the cases,

00:34:21.000 --> 00:34:26.000
especially because usually
you're not doing this on

00:34:26.000 --> 00:34:31.000
something of size N.
You're doing it on something of

00:34:31.000 --> 00:34:37.000
size k, where we don't really
know much about k.

00:34:37.000 --> 00:34:40.000
But in general,
it's N over B plus one because

00:34:40.000 --> 00:34:43.000
we always need at least one
memory transfer to look at

00:34:43.000 --> 00:34:46.000
something, unless N is zero.
And in particular,

00:34:46.000 --> 00:34:49.000
it's plus two if you care about
the constants.

00:34:49.000 --> 00:34:53.000
If I don't write the big O,
then it would be plus two at

00:34:53.000 --> 00:34:57.000
most because you could
essentially waste the first

00:34:57.000 --> 00:35:01.000
block and that everything is
fine for awhile.

00:35:01.000 --> 00:35:05.000
And then, if you're unlucky,
you essentially waste the last

00:35:05.000 --> 00:35:08.000
blocked.
There is just one element in

00:35:08.000 --> 00:35:12.000
that block, and you're not
getting much out of it.

00:35:12.000 --> 00:35:16.000
Everything in the middle,
though, every block between the

00:35:16.000 --> 00:35:19.000
first and last block has to be
full.

00:35:19.000 --> 00:35:22.000
So, you're using all of those
elements.

00:35:22.000 --> 00:35:26.000
So out of the N elements,
you only have N over B blocks

00:35:26.000 --> 00:35:28.000
because the block has B
elements.

00:35:28.000 --> 00:35:33.000
OK, that's pretty trivial.
Let me do something slightly

00:35:33.000 --> 00:35:38.000
more interesting,
which is two scans at once.

00:35:38.000 --> 00:35:41.000
OK, here we are not assuming
anything about M.

00:35:41.000 --> 00:35:45.000
we're not assuming anything
about the size of the cache,

00:35:45.000 --> 00:35:48.000
just that I can hold a single
block.

00:35:48.000 --> 00:35:51.000
The last block that we visited
has to be there.

00:35:51.000 --> 00:35:55.000
OK, you can also do a constant
number of parallel scans.

00:35:55.000 --> 00:36:00.000
This is not really parallel in
the sense of multithreaded,

00:36:00.000 --> 00:36:06.000
bit simulated parallelism.
I mean, if you have a constant

00:36:06.000 --> 00:36:09.000
number, do one,
do the other,

00:36:09.000 --> 00:36:12.000
do the other,
come back, come back,

00:36:12.000 --> 00:36:18.000
come back, all right,
visit them in turn round robin,

00:36:18.000 --> 00:36:20.000
whatever.
For example,

00:36:20.000 --> 00:36:26.000
here's a cute piece of code.
If you want to reverse an

00:36:26.000 --> 00:36:33.000
array, OK, then you can do it.
This is a good puzzle.

00:36:33.000 --> 00:36:38.000
You can do it by essentially
two scans where you repeatedly

00:36:38.000 --> 00:36:42.000
swapped the first and last
element.

00:36:42.000 --> 00:36:46.000
So I was swapping A_i with N
minus i plus one,

00:36:46.000 --> 00:36:51.000
and just restart at one.
So, here's your array.

00:36:51.000 --> 00:36:54.000
Suppose this is actually my
array.

00:36:54.000 --> 00:36:59.000
I swap these two guys,
and I saw these two guys,

00:36:59.000 --> 00:37:04.000
and so on.
That will reverse my array,

00:37:04.000 --> 00:37:08.000
and this should work hopefully
the middle as well if it's odd.

00:37:08.000 --> 00:37:13.000
It should not do anything.
And you can view this as two

00:37:13.000 --> 00:37:16.000
scans.
There is one scan that's coming

00:37:16.000 --> 00:37:19.000
in this way.
There's also a reverse scan,

00:37:19.000 --> 00:37:23.000
ooh, some more sophisticated,
coming back this way.

00:37:23.000 --> 00:37:26.000
Of course, reverse scan has the
same analysis.

00:37:26.000 --> 00:37:31.000
And as long as your cache is
big enough to store at least two

00:37:31.000 --> 00:37:35.000
blocks, which is a pretty
reasonable assumption,

00:37:35.000 --> 00:37:40.000
so let's write it.
Assuming the number of blocks

00:37:40.000 --> 00:37:43.000
in the cache,
which is M over B,

00:37:43.000 --> 00:37:49.000
is at least two in this
algorithm, the number of memory

00:37:49.000 --> 00:37:53.000
transfers is still order N over
B plus one.

00:37:53.000 --> 00:37:58.000
OK, the constant goes up maybe,
but in this case it probably

00:37:58.000 --> 00:38:02.000
doesn't.
But who cares.

00:38:02.000 --> 00:38:06.000
OK, as long as you're doing a
constant number of scans,

00:38:06.000 --> 00:38:11.000
and some constant number of
arrays, it happens to be one of

00:38:11.000 --> 00:38:15.000
them's reversed,
whatever, it will take,

00:38:15.000 --> 00:38:20.000
we call this linear time.
It's linear in the number of

00:38:20.000 --> 00:38:22.000
blocks in your input.
OK, great.

00:38:22.000 --> 00:38:26.000
So now you can reverse an
array: exciting.

00:38:26.000 --> 00:38:32.000
Let's try another simple
algorithm on another board.

00:38:47.000 --> 00:38:50.000
Let's try binary search.
So just like last week,

00:38:50.000 --> 00:38:53.000
we're going back to our basics
here.

00:38:53.000 --> 00:38:57.000
Scanning we didn't even talk
about in this class.

00:38:57.000 --> 00:39:02.000
Binary search is something we
talked about a little bit.

00:39:02.000 --> 00:39:04.000
It was a simple divide and
conquer algorithm.

00:39:04.000 --> 00:39:08.000
I hope you all remember it.
And if we look at an array,

00:39:08.000 --> 00:39:11.000
and I'm not going to draw the
cells here because I want to

00:39:11.000 --> 00:39:14.000
imagine a really big array,
binary search,

00:39:14.000 --> 00:39:16.000
but suppose it always goes to
left.

00:39:16.000 --> 00:39:19.000
It starts by visiting this
element in the middle.

00:39:19.000 --> 00:39:23.000
Then ago so the quarter marked.
Then it goes to the one eighth

00:39:23.000 --> 00:39:25.000
mark.
OK, this is one hypothetical

00:39:25.000 --> 00:39:29.000
execution of a binary search.
OK, and eventually it finds the

00:39:29.000 --> 00:39:32.000
element it's looking for.
It finds where it fits at

00:39:32.000 --> 00:39:35.000
least.
So x is over here.

00:39:35.000 --> 00:39:38.000
So, we know that it takes log N
time.

00:39:38.000 --> 00:39:41.000
How many memory transfers of
the take?

00:39:41.000 --> 00:39:45.000
Now, I blocked this array into
chunks of size B,

00:39:45.000 --> 00:39:49.000
blocks of size B.
How many blocks do I touch?

00:39:49.000 --> 00:39:53.000
This one's a little bit more
subtle.

00:40:18.000 --> 00:40:21.000
It depends on the relative
sizes of N and B,

00:40:21.000 --> 00:40:23.000
yeah.
Log base B of N would be a good

00:40:23.000 --> 00:40:25.000
guess.
We would like it to be,

00:40:25.000 --> 00:40:29.000
let's say, hope,
is that it's log base B of N

00:40:29.000 --> 00:40:33.000
because we know that B trees can
search in what's essentially a

00:40:33.000 --> 00:40:38.000
sorted list of N items in log
base B of N time.

00:40:38.000 --> 00:40:42.000
That turns out to be optimal in
the cache oblivious model or in

00:40:42.000 --> 00:40:46.000
the two level model you've got
to pay log base B of N.

00:40:46.000 --> 00:40:51.000
I won't prove that here.
The same reason you need log N

00:40:51.000 --> 00:40:55.000
comparisons to do a binary
search in the normal model.

00:40:55.000 --> 00:41:00.000
Alas, it is possible to get log
base B of N even without knowing

00:41:00.000 --> 00:41:06.000
B.
But, binary search does not do

00:41:06.000 --> 00:41:09.000
it.
Log of N over B,

00:41:09.000 --> 00:41:13.000
yes.
So the number of memory

00:41:13.000 --> 00:41:22.000
transfers on N items is log of N
over B also known as,

00:41:22.000 --> 00:41:31.000
let's say, plus one,
also known as log N minus log

00:41:31.000 --> 00:41:35.000
B.
OK, whereas log base B of N is

00:41:35.000 --> 00:41:39.000
log N divided by log B,
OK, clearly this is much better

00:41:39.000 --> 00:41:42.000
than subtracting.
So, this would be good,

00:41:42.000 --> 00:41:45.000
but this is bad.
Most of the time,

00:41:45.000 --> 00:41:47.000
this is log N,
which is no better,

00:41:47.000 --> 00:41:51.000
I mean, you're not using blocks
at all essentially.

00:41:51.000 --> 00:41:53.000
The idea is,
out here, I mean,

00:41:53.000 --> 00:41:57.000
there's some little,
tiny block that contains this

00:41:57.000 --> 00:42:00.000
thing.
I mean, tiny depends on how big

00:42:00.000 --> 00:42:03.000
B is.
But, each of these items will

00:42:03.000 --> 00:42:06.000
be in a different block until
you get essentially within one

00:42:06.000 --> 00:42:09.000
block worth of x.
When you get within one block

00:42:09.000 --> 00:42:12.000
worth of x, there's only like a
constant number of blocks that

00:42:12.000 --> 00:42:15.000
matter, and so all of these
accesses are indeed within the

00:42:15.000 --> 00:42:17.000
same block.
But, how many are there?

00:42:17.000 --> 00:42:21.000
Well, just log B because you're
only spending log B within a,

00:42:21.000 --> 00:42:24.000
if you're within an interval of
size k, you're only going to

00:42:24.000 --> 00:42:27.000
spend log k steps in it.
So, you're saving log B in

00:42:27.000 --> 00:42:30.000
here, but overall you're paying
log N, so you only get log N

00:42:30.000 --> 00:42:34.000
minus log B plus some constant.
OK, so this is bad news for

00:42:34.000 --> 00:42:37.000
binary search.
So, not all of the algorithms

00:42:37.000 --> 00:42:40.000
we've seen are going to work
well in this model.

00:42:40.000 --> 00:42:43.000
We need a lot more thinking
before we can solve what is

00:42:43.000 --> 00:42:47.000
essentially the binary search
problem, finding an element in a

00:42:47.000 --> 00:42:50.000
sorted list, in log base B of N
without knowing B.

00:42:50.000 --> 00:42:52.000
OK, we know we could use B
trees.

00:42:52.000 --> 00:42:53.000
If you knew B,
great, that works,

00:42:53.000 --> 00:42:56.000
and that's optimal.
But without knowing B,

00:42:56.000 --> 00:43:02.000
it's a little bit harder.
And this gets us into the world

00:43:02.000 --> 00:43:06.000
of divide and conquer.
Also like last week,

00:43:06.000 --> 00:43:13.000
and like the first few weeks of
this class, divide and conquer

00:43:13.000 --> 00:43:17.000
is your friend.
And, it turns out divide and

00:43:17.000 --> 00:43:23.000
conquer is not the only tool,
but it's a really useful tool

00:43:23.000 --> 00:43:27.000
in designing cache oblivious
algorithms.

00:43:27.000 --> 00:43:31.000
And, let me say why.

00:43:43.000 --> 00:43:47.000
So, we'll see a bunch of divide
and conquer based algorithms,

00:43:47.000 --> 00:43:50.000
cache oblivious.
And, the intuition is that we

00:43:50.000 --> 00:43:54.000
can take all the favorite
algorithms we have,

00:43:54.000 --> 00:43:56.000
obviously it doesn't always
work.

00:43:56.000 --> 00:43:59.000
Binary search was a divide and
conquer algorithm.

00:43:59.000 --> 00:44:03.000
It's not so great.
But, in general,

00:44:03.000 --> 00:44:07.000
the idea is that your algorithm
can just do the normal divide

00:44:07.000 --> 00:44:08.000
and conquer thing,
right?

00:44:08.000 --> 00:44:12.000
You divide your problem into
subproblems of smaller size

00:44:12.000 --> 00:44:15.000
repeatedly, all the way down to
problems of constant size,

00:44:15.000 --> 00:44:19.000
OK, just like before.
But, if you're recursively

00:44:19.000 --> 00:44:21.000
dividing your problem into
smaller things,

00:44:21.000 --> 00:44:24.000
at some point you can think
about it and say,

00:44:24.000 --> 00:44:27.000
well, wait, I mean,
the algorithm divides all the

00:44:27.000 --> 00:44:31.000
way, but we can think about the
point at which the problem fits

00:44:31.000 --> 00:44:36.000
in a block or fits in cache.
OK, and that's the analysis.

00:44:36.000 --> 00:44:40.000
OK, we'll think about the time
when your problem is small

00:44:40.000 --> 00:44:43.000
enough that we can analyze it in
some other way.

00:44:43.000 --> 00:44:46.000
So, usually,
we analyze it recursively.

00:44:46.000 --> 00:44:48.000
We get a recurrence.
What we're changing,

00:44:48.000 --> 00:44:50.000
essentially,
is the base case.

00:44:50.000 --> 00:44:54.000
So, in the base case,
we don't want to go down to a

00:44:54.000 --> 00:44:56.000
constant size.
That's too far.

00:44:56.000 --> 00:45:02.000
I'll show you some examples.
We want to consider the point

00:45:02.000 --> 00:45:09.000
in recursion at which either the
problem fits in cache,

00:45:09.000 --> 00:45:17.000
so it has size less than or
equal to M, or it fits in order

00:45:17.000 --> 00:45:22.000
one blocks.
That's another natural time to

00:45:22.000 --> 00:45:27.000
do it.
Order one blocks would be even

00:45:27.000 --> 00:45:35.000
better than fitting in cache.
So, this means a size order B.

00:45:35.000 --> 00:45:41.000
OK, this will change the base
case of the recurrence,

00:45:41.000 --> 00:45:48.000
and it will turn out to give us
good answers instead of bad

00:45:48.000 --> 00:45:52.000
ones.
So, let's do a simple example.

00:45:52.000 --> 00:45:57.000
Our good friend order
statistics, in particular,

00:45:57.000 --> 00:46:04.000
for finding medians.
So, I hope you all know this by

00:46:04.000 --> 00:46:08.000
heart.
Remember the worst case linear

00:46:08.000 --> 00:46:12.000
time, median finding algorithm
by Bloom et al.

00:46:12.000 --> 00:46:17.000
I'll write this fast.
We partition our array.

00:46:17.000 --> 00:46:21.000
It turns out,
this is a good algorithm as it

00:46:21.000 --> 00:46:24.000
is.
We partition our array

00:46:24.000 --> 00:46:30.000
conceptually into N over five,
five tuples into little groups

00:46:30.000 --> 00:46:36.000
of five.
This may not have been exactly

00:46:36.000 --> 00:46:40.000
how I wrote it last time.
I didn't check.

00:46:40.000 --> 00:46:46.000
But, it's the same algorithm.
You compute the median of each

00:46:46.000 --> 00:46:49.000
five tuple.
Then you recursively compute

00:46:49.000 --> 00:46:55.000
the median of the medians of
these medians.

00:47:11.000 --> 00:47:15.000
Then, you partition around x.
So, that gave us some element

00:47:15.000 --> 00:47:20.000
that was roughly in the middle.
It was within the middle half,

00:47:20.000 --> 00:47:22.000
I think.
Partition around x,

00:47:22.000 --> 00:47:27.000
and then we show that you could
always recurse on just one of

00:47:27.000 --> 00:47:29.000
the sides.

00:47:38.000 --> 00:47:41.000
OK, this was our good old
friend for computing,

00:47:41.000 --> 00:47:43.000
order statistics,
or medians, or whatnot.

00:47:43.000 --> 00:47:47.000
OK, so how much time does this,
well, we know how much time

00:47:47.000 --> 00:47:50.000
this takes.
It should be linear time.

00:47:50.000 --> 00:47:52.000
But how many memory transfers
does this take?

00:47:52.000 --> 00:47:56.000
Well, conceptually partitioning
that, I can do,

00:47:56.000 --> 00:47:58.000
in zero.
Maybe I have to compute N over

00:47:58.000 --> 00:48:02.000
five, no big deal here.
We're not thinking about

00:48:02.000 --> 00:48:05.000
computation.
I have to find the median of

00:48:05.000 --> 00:48:07.000
each tuple.
So, here it matters how my

00:48:07.000 --> 00:48:10.000
array is laid out.
But, what I'm going to do is

00:48:10.000 --> 00:48:13.000
take my array,
take the first five elements,

00:48:13.000 --> 00:48:16.000
and then the next five elements
and so on.

00:48:16.000 --> 00:48:20.000
Those will be my five tuples.
So, I can implement this just

00:48:20.000 --> 00:48:23.000
by scanning, and then computing
the median on those five

00:48:23.000 --> 00:48:27.000
elements, which I stored in the
five registers on my CPU.

00:48:27.000 --> 00:48:32.000
I'll assume that there are
enough registers for that.

00:48:32.000 --> 00:48:35.000
And, I compute the median,
write it out to some array out

00:48:35.000 --> 00:48:38.000
here.
So, it's going to be one

00:48:38.000 --> 00:48:40.000
element.
So, the median of here goes

00:48:40.000 --> 00:48:43.000
into there.
The median of these guys goes

00:48:43.000 --> 00:48:46.000
into there, and so on.
So, I'm scanning in here,

00:48:46.000 --> 00:48:50.000
and in parallel,
I'm scanning an output in here.

00:48:50.000 --> 00:48:54.000
So, it's two parallel scans.
So, that takes linear time.

00:48:54.000 --> 00:48:59.000
So, this takes order N over B
plus one memory transfers.

00:48:59.000 --> 00:49:03.000
OK, then we have recursively
compute the median of the

00:49:03.000 --> 00:49:06.000
medians.
This step used to be T of N

00:49:06.000 --> 00:49:09.000
over five.
Now it's MT of N over five,

00:49:09.000 --> 00:49:12.000
OK, with the same values of B
and M.

00:49:12.000 --> 00:49:17.000
Then we partition around x.
Partitioning is also like three

00:49:17.000 --> 00:49:19.000
parallel scans if you work it
out.

00:49:19.000 --> 00:49:24.000
So, this is also going to take
linear memory transfers,

00:49:24.000 --> 00:49:28.000
N over B plus one.
And then, we recurse on one of

00:49:28.000 --> 00:49:33.000
the sides, and this is the fun
part of the analysis which I

00:49:33.000 --> 00:49:37.000
won't repeat here.
But, we get MT of,

00:49:37.000 --> 00:49:42.000
like, three quarters N.
I think originally it was seven

00:49:42.000 --> 00:49:45.000
tenths, so we simplified to
three quarters,

00:49:45.000 --> 00:49:49.000
which is hopefully bigger than
seven tenths.

00:49:49.000 --> 00:49:52.000
Yeah, it is.
OK, so this is the new

00:49:52.000 --> 00:49:55.000
analysis.
Now we get a recurrence.

00:49:55.000 --> 00:49:58.000
So, let's do that.

00:50:16.000 --> 00:50:22.000
So, the analysis is we get this
MT of N is MT of N over five

00:50:22.000 --> 00:50:29.000
plus MT of three quarters N
plus, this is just as before.

00:50:29.000 --> 00:50:35.000
Before we had linear work here.
And now, we have what we call

00:50:35.000 --> 00:50:39.000
linear number of memory
transfers, linear number of

00:50:39.000 --> 00:50:41.000
blocks.
OK, I'll sort of ignore this

00:50:41.000 --> 00:50:44.000
plus one.
It's not too critical.

00:50:44.000 --> 00:50:48.000
So, this is our recurrence.
Now, it depends what our base

00:50:48.000 --> 00:50:51.000
case is.
And, usually we would use a

00:50:51.000 --> 00:50:55.000
base case of constant size.
So, let's see what happens if

00:50:55.000 --> 00:51:00.000
we use a base case of constant
size just so that it's clear why

00:51:00.000 --> 00:51:05.000
this base case is so important.
OK, this describes a recurrence

00:51:05.000 --> 00:51:07.000
as one of these hairy
recurrences.

00:51:07.000 --> 00:51:09.000
And, I don't want to use
substitution.

00:51:09.000 --> 00:51:12.000
I just want the intuition of
why this is going to solve to

00:51:12.000 --> 00:51:14.000
something rather big.
OK, and for me,

00:51:14.000 --> 00:51:17.000
the best intuition always comes
from recursion trees.

00:51:17.000 --> 00:51:20.000
If you don't know the solution
to recurrence and you need a

00:51:20.000 --> 00:51:24.000
good guess, use recursion trees.
And today, I will only give you

00:51:24.000 --> 00:51:26.000
good guesses.
I don't want to prove anything

00:51:26.000 --> 00:51:31.000
with substitution because I want
to get to the bigger ideas.

00:51:31.000 --> 00:51:34.000
So, this is even messy from a
recursion tree point of view

00:51:34.000 --> 00:51:38.000
because you have these
unbalanced sizes where you start

00:51:38.000 --> 00:51:40.000
at the root with some of size N
over B.

00:51:40.000 --> 00:51:44.000
Then you split it into
something size one fifth N over

00:51:44.000 --> 00:51:47.000
B, and something of size three
quarters N over B,

00:51:47.000 --> 00:51:51.000
which is annoying because now
this subtree will be a lot

00:51:51.000 --> 00:51:54.000
bigger than this one,
or this one will terminate

00:51:54.000 --> 00:51:56.000
faster.
So, it's pretty unbalanced.

00:51:56.000 --> 00:52:00.000
But, summing per level doesn't
really tell you a lot at this

00:52:00.000 --> 00:52:02.000
point.
But let's just look at the

00:52:02.000 --> 00:52:07.000
bottom level.
Look at all the leaves in this

00:52:07.000 --> 00:52:10.000
recursion tree.
So, that's the base cases.

00:52:10.000 --> 00:52:13.000
How many base cases are there?
This is an interesting

00:52:13.000 --> 00:52:16.000
question.
We've never thought about it in

00:52:16.000 --> 00:52:21.000
the context of this recurrence.
It gives a somewhat surprising

00:52:21.000 --> 00:52:23.000
answer.
It was surprising to me the

00:52:23.000 --> 00:52:27.000
first time I worked it out.
So, how many leaves does this

00:52:27.000 --> 00:52:32.000
recursion tree have?
Well, we can write a

00:52:32.000 --> 00:52:35.000
recurrence.
The number of leaves in a

00:52:35.000 --> 00:52:41.000
problem of size N,
it's going to be the number of

00:52:41.000 --> 00:52:47.000
leaves in this problem plus the
number of leaves in this problem

00:52:47.000 --> 00:52:52.000
plus zero.
So, that's another recurrence.

00:52:52.000 --> 00:52:57.000
We'll call this L of N.
OK, now the base case is really

00:52:57.000 --> 00:53:02.000
relevant.
It determines the solution to

00:53:02.000 --> 00:53:04.000
this recurrence.
And let's, again,

00:53:04.000 --> 00:53:08.000
assume that in a problem of
size one, we have one leaf.

00:53:08.000 --> 00:53:12.000
That's our only base case.
Well, it turns out,

00:53:12.000 --> 00:53:14.000
and here you need to guess,
I think.

00:53:14.000 --> 00:53:17.000
This is not particularly
obvious.

00:53:17.000 --> 00:53:21.000
Any of the TA's have guesses of
the form of this solution?

00:53:21.000 --> 00:53:25.000
Or anybody, not just TA's.
But this is open to everyone.

00:53:25.000 --> 00:53:28.000
If Charles were here,
I would ask him.

00:53:28.000 --> 00:53:31.000
I had to think for a while,
and it's not linear,

00:53:31.000 --> 00:53:37.000
right, because you're somehow
decreasing quite a bit.

00:53:37.000 --> 00:53:42.000
So, it's smaller than linear,
but it's more than a constant.

00:53:42.000 --> 00:53:47.000
OK, it's actually more than
polylog, so what's your favorite

00:53:47.000 --> 00:53:50.000
function in the middle?
N over log N,

00:53:50.000 --> 00:53:53.000
that's still too big.
Keep going.

00:53:53.000 --> 00:53:57.000
You have an oracle here,
so you can, N to the k,

00:53:57.000 --> 00:54:00.000
yeah, close.
I mean, k is usually an

00:54:00.000 --> 00:54:04.000
integer.
N to the alpha for some real

00:54:04.000 --> 00:54:09.000
number between zero and one.
Yeah, that's what you meant.

00:54:09.000 --> 00:54:11.000
Sorry.
It's like the shortest

00:54:11.000 --> 00:54:15.000
mathematical joke.
Let epsilon be less than zero

00:54:15.000 --> 00:54:18.000
or for a sufficiently large
epsilon.

00:54:18.000 --> 00:54:21.000
I don't know.
So, you've got to use the right

00:54:21.000 --> 00:54:25.000
letters.
So, let's suppose that it's N

00:54:25.000 --> 00:54:28.000
to the alpha.
Then we would get this N over

00:54:28.000 --> 00:54:32.000
five to the alpha,
and we'd get three quarters N

00:54:32.000 --> 00:54:36.000
to the alpha.
When you have a nice recurrence

00:54:36.000 --> 00:54:40.000
like this, you can just try
plugging in a guess and see

00:54:40.000 --> 00:54:42.000
whether it works,
OK, and of course this will

00:54:42.000 --> 00:54:46.000
work only depending on alpha.
So, we should get an equation

00:54:46.000 --> 00:54:49.000
on alpha here.
So, everything has an N to the

00:54:49.000 --> 00:54:51.000
alpha, in fact,
all of these terms.

00:54:51.000 --> 00:54:53.000
So, I can divide through my N
to the alpha.

00:54:53.000 --> 00:54:56.000
That's assuming that it's not
zero or something.

00:54:56.000 --> 00:54:59.000
That seems reasonable.
So, we have one equals one

00:54:59.000 --> 00:55:04.000
fifth to the alpha plus three
quarters to the alpha.

00:55:04.000 --> 00:55:10.000
This is something you won't get
on a final because I don't know

00:55:10.000 --> 00:55:15.000
any good way to solve this
except with like Maple or

00:55:15.000 --> 00:55:19.000
Mathematica.
If you're smart I'm sure you

00:55:19.000 --> 00:55:24.000
could compute it in a nicer way,
but alpha is about 0.8,

00:55:24.000 --> 00:55:28.000
it turns out.
So, the number of leaves is

00:55:28.000 --> 00:55:34.000
this sort of in between constant
and linear.

00:55:34.000 --> 00:55:37.000
Usually polynomial means you
have an integer power.

00:55:37.000 --> 00:55:40.000
Let's call it a polynomial.
Why not?

00:55:40.000 --> 00:55:43.000
There's a lot of leaves,
is the point,

00:55:43.000 --> 00:55:47.000
and if we say that each leaf
costs a constant number of

00:55:47.000 --> 00:55:50.000
memory transfers,
we're in trouble because then

00:55:50.000 --> 00:55:54.000
the number of memory transfers
has to be at least this.

00:55:54.000 --> 00:55:58.000
If it's at least that,
that's potentially bigger than

00:55:58.000 --> 00:56:02.000
N over B, I mean,
bigger than in an asymptotic

00:56:02.000 --> 00:56:06.000
sense.
This is little omega of N over

00:56:06.000 --> 00:56:10.000
B if B is big.
If B is at least N to the 0.2

00:56:10.000 --> 00:56:14.000
something, OK,
or one seventh something.

00:56:14.000 --> 00:56:18.000
But if, in particular,
B is at least N to the 0.2,

00:56:18.000 --> 00:56:22.000
then this should be bigger than
that.

00:56:22.000 --> 00:56:27.000
So, this is a bad analysis
because we're not going to get

00:56:27.000 --> 00:56:32.000
the answer we want,
which is N over B.

00:56:32.000 --> 00:56:35.000
The best you can do for median
is N over B because you have to

00:56:35.000 --> 00:56:38.000
read all the element,
and you should spend linear

00:56:38.000 --> 00:56:40.000
time.
So, we want to get N over B.

00:56:40.000 --> 00:56:42.000
This algorithm is N over B plus
one.

00:56:42.000 --> 00:56:45.000
So, this is why you need a good
base case, all right?

00:56:45.000 --> 00:56:48.000
So that makes the point.
So, the question is,

00:56:48.000 --> 00:56:51.000
what base case should I use?

00:57:04.000 --> 00:57:06.000
So, we have this recurrence

00:57:21.000 --> 00:57:25.000
What base case should I use?
Constant was too small.

00:57:25.000 --> 00:57:30.000
We have a couple of choices
listed up here.

00:57:46.000 --> 00:57:55.000
Any suggestions?
B, OK, MT of B is?

00:57:55.000 --> 00:58:01.000
The hard part.
So, if my problem,

00:58:01.000 --> 00:58:07.000
if the size of my array fits in
a block and I do all this stuff

00:58:07.000 --> 00:58:11.000
on it, how many memory transfers
could that take?

00:58:11.000 --> 00:58:15.000
One, or a constant,
depending on alignment.

00:58:15.000 --> 00:58:20.000
OK, maybe it takes two memory
transfers, but constant.

00:58:20.000 --> 00:58:23.000
Good.
That's clearly a lot better

00:58:23.000 --> 00:58:27.000
than this base case,
MT of one equals order one,

00:58:27.000 --> 00:58:30.000
clearly stronger.
So, hopefully,

00:58:30.000 --> 00:58:36.000
it gives the right answer,
and now indeed it does.

00:58:36.000 --> 00:58:39.000
I love this analysis.
So, I'm going to wave my hands.

00:58:39.000 --> 00:58:43.000
OK, but in particular,
what this gives us,

00:58:43.000 --> 00:58:47.000
if we do the previous analysis,
what is the number of leaves?

00:58:47.000 --> 00:58:51.000
So, in the leaves,
now L of B equals one instead

00:58:51.000 --> 00:58:54.000
of L of one equals one.
So, this stops earlier.

00:58:54.000 --> 00:58:59.000
When does it stop?
Well, instead of getting N to

00:58:59.000 --> 00:59:02.000
the order of 0.8,
whatever, we get N over B to

00:59:02.000 --> 00:59:06.000
the power of 0.8 whatever.
OK, so it turns out the number

00:59:06.000 --> 00:59:10.000
of leaves is N over B to the
alpha, which is little o of N

00:59:10.000 --> 00:59:12.000
over B.
So, we don't care.

00:59:12.000 --> 00:59:15.000
It's tiny.
And, if you look at the root

00:59:15.000 --> 00:59:17.000
cost is N over B in the
recursion tree,

00:59:17.000 --> 00:59:22.000
the leaf cost is little o of N
over B, and if you wave your

00:59:22.000 --> 00:59:26.000
hands, and close your eyes,
and squint, the cost should be

00:59:26.000 --> 00:59:29.000
geometrically decreasing as we
go down, I hope,

00:59:29.000 --> 00:59:34.000
more or less.
It's a bit messy because of all

00:59:34.000 --> 00:59:39.000
the things terminating,
but let's say cost is roughly

00:59:39.000 --> 00:59:42.000
geometric.
Don't do this in the final,

00:59:42.000 --> 00:59:47.000
but you won't have any messy
recurrences like this.

00:59:47.000 --> 00:59:50.000
So, don't worry.
Down the tree,

00:59:50.000 --> 00:59:55.000
so you'd have to prove this
formally, but I claim that the

00:59:55.000 --> 01:00:01.000
root cost dominates.
And, the root cost is N over B.

01:00:13.000 --> 01:00:16.591
So, we get N over B.
OK, so this is a nice,

01:00:16.591 --> 01:00:21.892
linear time algorithm for order
statistics for cache oblivious.

01:00:21.892 --> 01:00:24.970
Great.
This may turn you off a little

01:00:24.970 --> 01:00:29.758
bit, but even though this is
like the simplest algorithm,

01:00:29.758 --> 01:00:34.460
it's also probably the most
complicated analysis that we

01:00:34.460 --> 01:00:36.846
will do.
In the future,

01:00:36.846 --> 01:00:40.234
our algorithms will be more
complicated, and the analyses

01:00:40.234 --> 01:00:42.533
will be relatively simple.
And usually,

01:00:42.533 --> 01:00:45.255
it's that way with cache
oblivious algorithms.

01:00:45.255 --> 01:00:48.824
So, I'm giving you this sort of
as the intuition of why this

01:00:48.824 --> 01:00:51.425
should be enough.
Then you have to prove it.

01:00:51.425 --> 01:00:54.933
OK, let's go to another problem
where divide and conquer is

01:00:54.933 --> 01:00:57.716
useful, our good friend,
matrix multiplication.

01:00:57.716 --> 01:01:01.164
I don't know how many times
we've seen this in this class,

01:01:01.164 --> 01:01:04.370
but in particular we saw it
last week with a recursive

01:01:04.370 --> 01:01:08.000
matrix multiply,
multithreaded algorithm.

01:01:08.000 --> 01:01:11.708
So, I won't give you the
algorithm yet again,

01:01:11.708 --> 01:01:16.176
but we're going to analyze it
in a very different way.

01:01:16.176 --> 01:01:20.475
So, we have C and we have A,
and actually up to you.

01:01:20.475 --> 01:01:24.521
So, I could cover standard
matrix multiplication,

01:01:24.521 --> 01:01:30.000
which is when you do it row by
row, and column by column.

01:01:30.000 --> 01:01:32.331
And, we could see why that's
bad.

01:01:32.331 --> 01:01:36.485
And then, we could do the
recursive one and see why that's

01:01:36.485 --> 01:01:39.036
good.
Or, we could skip the standard

01:01:39.036 --> 01:01:41.951
algorithm.
So, how many people would like

01:01:41.951 --> 01:01:44.866
to see why the standard
algorithm is bad?

01:01:44.866 --> 01:01:47.198
Because it's not totally
obvious.

01:01:47.198 --> 01:01:49.603
One, two, three,
four, five, half?

01:01:49.603 --> 01:01:53.611
Wow, that's a lot of votes.
Now, how many people want to

01:01:53.611 --> 01:01:55.433
skip to the chase?
No one.

01:01:55.433 --> 01:01:58.129
One, OK.
And, everyone else is asleep.

01:01:58.129 --> 01:02:01.190
So, that's pretty good,
50% awake, not bad.

01:02:01.190 --> 01:02:06.000
OK, then, so standard matrix
multiplication.

01:02:06.000 --> 01:02:10.036
I'll do this fast because it
is, I mean, you all know the

01:02:10.036 --> 01:02:13.207
algorithm, right?
To compute this value of C;

01:02:13.207 --> 01:02:17.099
in A, you take this row,
and in B you take this column.

01:02:17.099 --> 01:02:19.477
Sorry I did a little bit
sloppily.

01:02:19.477 --> 01:02:21.927
But this is supposed to be
aligned.

01:02:21.927 --> 01:02:24.378
Right?
So I take all of this stuff,

01:02:24.378 --> 01:02:27.837
I multiply it with all of the
stuff, add them up,

01:02:27.837 --> 01:02:31.949
the dot product.
That gives me this element.

01:02:31.949 --> 01:02:35.487
And, let's say I do them in
this order row by row.

01:02:35.487 --> 01:02:39.241
So for every item in C,
I loop over this row and this

01:02:39.241 --> 01:02:41.624
column, B, multiply them
together.

01:02:41.624 --> 01:02:44.151
That is an access pattern in
memory.

01:02:44.151 --> 01:02:48.555
So, exactly how much that costs
depends how these matrices are

01:02:48.555 --> 01:02:51.732
laid out in memory.
OK, this is a subtlety we

01:02:51.732 --> 01:02:55.703
haven't had to worry about
before because everything was

01:02:55.703 --> 01:02:58.519
uniform.
I'm going to assume to give the

01:02:58.519 --> 01:03:02.057
standard algorithm the best
chances of being good,

01:03:02.057 --> 01:03:05.956
I'm going to store C in row
major order, A in row major

01:03:05.956 --> 01:03:10.000
order, and B in column major
order.

01:03:10.000 --> 01:03:14.983
So, everything is nice and
you're scanning.

01:03:14.983 --> 01:03:19.254
So then this inner product is a
scan.

01:03:19.254 --> 01:03:21.389
Cool.
Sounds great,

01:03:21.389 --> 01:03:24.711
doesn't it?
It's bad, though.

01:03:24.711 --> 01:03:31.000
Assume A is row major,
and B is column major.

01:03:31.000 --> 01:03:33.911
And C, you could assume is
really either way,

01:03:33.911 --> 01:03:37.750
but if I'm doing it row by row,
I'll assume it's row major.

01:03:37.750 --> 01:03:41.257
So, this is what I call the
layout, the memory layout,

01:03:41.257 --> 01:03:43.904
of these matrices.
OK, it's good for this

01:03:43.904 --> 01:03:46.551
algorithm, but the algorithm is
not good.

01:03:46.551 --> 01:03:49.000
So, it won't be that great.

01:04:12.000 --> 01:04:16.227
So, how long does this take?
How many memory transfers?

01:04:16.227 --> 01:04:20.533
We know it takes M^3 time.
Not going to try and beat M^3

01:04:20.533 --> 01:04:22.882
here.
Just going to try and get

01:04:22.882 --> 01:04:26.249
standard matrix multiplication
going faster.

01:04:26.249 --> 01:04:30.711
So, well, for each item over
here I pay N over B to do the

01:04:30.711 --> 01:04:36.801
scans and get the inner product.
So, N over B per item.

01:04:36.801 --> 01:04:42.659
So, it's N over B,
or we could go with the plus

01:04:42.659 --> 01:04:49.408
one here, to compute each c_ij.
So that would suggest,

01:04:49.408 --> 01:04:54.883
as an upper bound at least,
it's N^3 over B.

01:04:54.883 --> 01:05:00.996
OK, and indeed that is the
right bound, so theta.

01:05:00.996 --> 01:05:08.000
This is memory transfers,
not time, obviously.

01:05:08.000 --> 01:05:12.349
That is indeed the case because
if you look at consecutive,

01:05:12.349 --> 01:05:14.525
I do this c_ij,
then this one,

01:05:14.525 --> 01:05:18.125
this one, this one,
this one, keep incrementing j

01:05:18.125 --> 01:05:20.074
and keeping I fixed,
right?

01:05:20.074 --> 01:05:23.824
So, the row that I use stays
fixed for a long time.

01:05:23.824 --> 01:05:27.875
I get to reuse that if it
happens, say that that fits a

01:05:27.875 --> 01:05:32.150
block maybe, I get to reuse that
row several times if that

01:05:32.150 --> 01:05:36.631
happens to fit in cache.
But the column is changing

01:05:36.631 --> 01:05:39.642
every single time.
OK, so every time I moved here

01:05:39.642 --> 01:05:43.093
and compute the next c_ij,
even if a column could fit in

01:05:43.093 --> 01:05:45.790
cache, I can't fit all the
columns in cache.

01:05:45.790 --> 01:05:48.174
And the columns that I'm
visiting move,

01:05:48.174 --> 01:05:50.119
you know, they just scan
across.

01:05:50.119 --> 01:05:52.942
So, I'm scanning this whole
matrix every time.

01:05:52.942 --> 01:05:55.766
And unless you're entire matrix
fits in cache,

01:05:55.766 --> 01:05:58.840
in which case you could do
anything, I don't care,

01:05:58.840 --> 01:06:02.353
it will take constant time,
or you'll take M over B time,

01:06:02.353 --> 01:06:05.302
enough to read it into the
cache, do your stuff,

01:06:05.302 --> 01:06:09.989
and write it back out.
Except in that boring case,

01:06:09.989 --> 01:06:14.115
you're going to have to pay N^2
over B for every row here

01:06:14.115 --> 01:06:18.242
because you have to scan the
whole collection of columns.

01:06:18.242 --> 01:06:22.589
You have to read this entire
matrix for every row over here.

01:06:22.589 --> 01:06:26.494
So, you really do need N^3 over
B for the whole thing.

01:06:26.494 --> 01:06:30.043
So, it's usually a theta.
So, you might say,

01:06:30.043 --> 01:06:32.766
well, that's great.
It's the size of my problem,

01:06:32.766 --> 01:06:34.852
the usual running time,
divided by B.

01:06:34.852 --> 01:06:38.329
And that was the case when we
are thinking about linear time,

01:06:38.329 --> 01:06:41.168
N versus N over B.
It's hard to beat N over B when

01:06:41.168 --> 01:06:44.066
your problem is of size N.
But now we have a cubed.

01:06:44.066 --> 01:06:47.137
And, this gets back to,
we have good spatial locality.

01:06:47.137 --> 01:06:49.687
When we read a block,
we use the whole thing.

01:06:49.687 --> 01:06:51.019
Great.
It seems optimal.

01:06:51.019 --> 01:06:53.337
But we don't have good temporal
locality.

01:06:53.337 --> 01:06:56.350
It could be that maybe if we
stored the right things,

01:06:56.350 --> 01:06:59.074
we kept them around,
we could them several times

01:06:59.074 --> 01:07:04.000
because we're using each element
like a cubed number of times.

01:07:04.000 --> 01:07:08.990
That's not the right way of
saying it, but we're reusing the

01:07:08.990 --> 01:07:11.951
matrices a lot,
reusing those items.

01:07:11.951 --> 01:07:16.942
If we are doing N^3 work on N^2
things, we're reusing a lot.

01:07:16.942 --> 01:07:21.933
So, we want to do better than
this, and that's the recursive

01:07:21.933 --> 01:07:26.416
algorithm, which we've seen.
So, we know the algorithm

01:07:26.416 --> 01:07:29.800
pretty much.
I just have to tell you what

01:07:29.800 --> 01:07:36.588
the layout is.
So, we're going to take C,

01:07:36.588 --> 01:07:42.941
partition of C_1-1,
C_1-2, and so on.

01:07:42.941 --> 01:07:52.647
So, I have an N by N matrix,
and I'm partitioning into N

01:07:52.647 --> 01:08:02.176
over 2 by N over 2 submatrices,
all three of them times

01:08:02.176 --> 01:08:07.377
whatever.
And, I could write this out yet

01:08:07.377 --> 01:08:11.058
again but I won't.
OK, we can recursively compute

01:08:11.058 --> 01:08:15.200
this thing with eight matrix
multiplies, and a bunch of

01:08:15.200 --> 01:08:18.191
matrix additions.
I don't care how many,

01:08:18.191 --> 01:08:22.256
but a constant number.
We see that at least twice now,

01:08:22.256 --> 01:08:26.091
so I won't show it again.
Now, how do I lay out the

01:08:26.091 --> 01:08:29.005
matrices?
Any suggestions how I lay out

01:08:29.005 --> 01:08:32.979
the matrices?
I could lay them out in row

01:08:32.979 --> 01:08:35.693
major order.
I'll call it major order.

01:08:35.693 --> 01:08:38.185
But that might be less natural
now.

01:08:38.185 --> 01:08:42.000
We're not doing anything by
rows or by columns.

01:08:59.000 --> 01:09:03.014
So, what layout should I use?
Yeah?

01:09:03.014 --> 01:09:08.446
Quartet major order,
maybe quadrant major order

01:09:08.446 --> 01:09:12.933
unless you're musically
inclined, yeah.

01:09:12.933 --> 01:09:17.420
Good idea.
You've never seen this order

01:09:17.420 --> 01:09:21.671
before, so it's maybe not so
natural.

01:09:21.671 --> 01:09:26.158
Somehow I want to cluster it by
blocks.

01:09:26.158 --> 01:09:33.402
OK, I think that's about all.
So, I mean, it's a recursive

01:09:33.402 --> 01:09:36.576
layout.
This was not an easy question.

01:09:36.576 --> 01:09:39.751
It's OK.
Store matrices or lay out the

01:09:39.751 --> 01:09:44.899
matrices recursively by block.
OK, I'm cheating a little bit.

01:09:44.899 --> 01:09:49.961
I'm redefining the problem to
say, assume that your matrices

01:09:49.961 --> 01:09:54.680
are laid out in this way.
But, it doesn't really matter.

01:09:54.680 --> 01:09:56.568
We can cheat,
can't we?

01:09:56.568 --> 01:10:02.276
In fact, it doesn't matter.
You can turn a matrix into this

01:10:02.276 --> 01:10:06.315
layout without too much linear
work, almost linear work.

01:10:06.315 --> 01:10:07.637
Log factors,
maybe.

01:10:07.637 --> 01:10:11.676
OK, so if I want to store my
matrix A as a linear thing,

01:10:11.676 --> 01:10:15.274
I'm going to recursively
defined that layout to be

01:10:15.274 --> 01:10:19.019
recursively store the upper left
corner, then store,

01:10:19.019 --> 01:10:21.442
let's say, the upper right
corner.

01:10:21.442 --> 01:10:24.380
It doesn't matter which order I
do these.

01:10:24.380 --> 01:10:28.492
I should have drawn this wider,
then store the lower left

01:10:28.492 --> 01:10:34.000
corner, and then store the lower
right corner recursively.

01:10:34.000 --> 01:10:38.025
So, how do you store this?
Well, you divide it in four,

01:10:38.025 --> 01:10:40.634
and lay out the top left,
and so on.

01:10:40.634 --> 01:10:44.511
OK, this is a recursive
definition of how the element

01:10:44.511 --> 01:10:47.046
should be stored in a linear
array.

01:10:47.046 --> 01:10:50.326
It's a weird one,
but this is a very powerful

01:10:50.326 --> 01:10:52.861
idea in cache oblivious
algorithms.

01:10:52.861 --> 01:10:57.408
We'll use this multiple times.
OK, so now all we have to do is

01:10:57.408 --> 01:11:00.241
analyze the number of memory
transfers.

01:11:00.241 --> 01:11:05.066
How hard could it be?
So, we're going to store all

01:11:05.066 --> 01:11:08.978
the matrices in this order,
and we want to compute the

01:11:08.978 --> 01:11:12.373
number of memory transfers on an
N by N matrix.

01:11:12.373 --> 01:11:15.547
See, I lapsed and I switched to
lowercase n.

01:11:15.547 --> 01:11:19.902
I should, throughout this week,
be using uppercase N because

01:11:19.902 --> 01:11:23.666
for historical reasons,
any external memory kinds of

01:11:23.666 --> 01:11:28.095
algorithms, to level algorithms,
always talk about capital N.

01:11:28.095 --> 01:11:31.785
And, don't ask why.
You should see what they define

01:11:31.785 --> 01:11:37.995
little n to be.
OK, so, any suggestions on what

01:11:37.995 --> 01:11:45.342
the recurrence should be now?
All his fancy setup with the

01:11:45.342 --> 01:11:49.724
recurrence is actually pretty
easy.

01:11:49.724 --> 01:11:57.071
So, definitely it involves
multiplying matrices that are N

01:11:57.071 --> 01:12:03.000
over 2 by N over 2.
So, what goes here?

01:12:03.000 --> 01:12:05.752
Eight, thank you.
That you should know.

01:12:05.752 --> 01:12:08.793
And that the tricky part is
what goes here.

01:12:08.793 --> 01:12:12.487
OK, what goes here is,
now, the fact that I can even

01:12:12.487 --> 01:12:15.384
write this, this is the matrix
additions.

01:12:15.384 --> 01:12:18.788
Ignore those for now.
Suppose there weren't any.

01:12:18.788 --> 01:12:21.323
I just have to recursively
multiply.

01:12:21.323 --> 01:12:25.740
The fact that this actually is
eight times memory transfers of

01:12:25.740 --> 01:12:30.670
N over 2 relies on this layout.
Right, I'm assuming that the

01:12:30.670 --> 01:12:34.129
arrays that I'm given are given
as contiguous intervals and

01:12:34.129 --> 01:12:35.442
memory.
If they aren't,

01:12:35.442 --> 01:12:38.066
I mean, if they're scattered
all over memory,

01:12:38.066 --> 01:12:40.273
I'm screwed.
There's nothing I can do.

01:12:40.273 --> 01:12:43.434
So, but by assuming that I have
this recursive layout,

01:12:43.434 --> 01:12:46.835
I know that the recursive
multiplies will always deal with

01:12:46.835 --> 01:12:49.519
three consecutive chunks of
memory, one for A,

01:12:49.519 --> 01:12:52.202
one for B, one for C,
OK, no matter what I do.

01:12:52.202 --> 01:12:54.470
Because these are stored
consecutively,

01:12:54.470 --> 01:12:56.438
recursively I have that
invariant.

01:12:56.438 --> 01:12:59.540
And I can keep recursing.
And I'm always dealing with

01:12:59.540 --> 01:13:03.000
three consecutive chunks of
memory.

01:13:03.000 --> 01:13:08.327
That's why I need this layout
is to be able to say this.

01:13:08.327 --> 01:13:11.332
OK, Now what does addition
cost?

01:13:11.332 --> 01:13:14.335
I'll just give you two
matrices.

01:13:14.335 --> 01:13:19.858
They're stored in some linear
order, the same linear order

01:13:19.858 --> 01:13:25.186
among the three of them.
Do I care what the linear order

01:13:25.186 --> 01:13:28.384
is?
How should I add two matrices,

01:13:28.384 --> 01:13:31.000
get the output?

01:13:42.000 --> 01:13:43.000
Yeah?

01:13:51.000 --> 01:13:54.850
Right, if each of the three
arrays I'm dealing with are

01:13:54.850 --> 01:13:58.559
stored in the same order,
I can just scan in parallel

01:13:58.559 --> 01:14:02.909
through all three of them and
just add corresponding elements,

01:14:02.909 --> 01:14:07.045
and output it to the third.
So, I don't care what the order

01:14:07.045 --> 01:14:10.682
is, as long as it's consistent
and I get N^2 over B.

01:14:10.682 --> 01:14:14.390
I'll ignore plus one here.
That's just looking at the

01:14:14.390 --> 01:14:16.529
entire matrix.
So, there we go:

01:14:16.529 --> 01:14:19.667
another recurrence.
We've seen this with N^2,

01:14:19.667 --> 01:14:23.090
and we just got N^3.
But, it turns out now we get

01:14:23.090 --> 01:14:26.371
something cooler if we use the
right base case.

01:14:26.371 --> 01:14:30.008
So now we get to the base case,
ah, the tricky part.

01:14:30.008 --> 01:14:35.000
So, any suggestions what base
case I should use?

01:14:35.000 --> 01:14:36.672
The block size,
good suggestion.

01:14:36.672 --> 01:14:38.829
So, if we have something of
size order B,

01:14:38.829 --> 01:14:41.850
we know that takes a constant
number of memory transfers.

01:14:41.850 --> 01:14:44.871
It turns out that's not enough.
That won't solve it here.

01:14:44.871 --> 01:14:46.381
But good guess.
In this case,

01:14:46.381 --> 01:14:49.294
it's not the right answer.
I'll give you some intuition

01:14:49.294 --> 01:14:51.182
why.
We are trying to improve on N^3

01:14:51.182 --> 01:14:53.178
over B.
If you were just trying to get

01:14:53.178 --> 01:14:55.443
it divided by B,
this is a great base case.

01:14:55.443 --> 01:14:58.572
But here, we know that just the
improvement afforded by the

01:14:58.572 --> 01:15:03.244
block size is not enough.
We have to somehow use the fact

01:15:03.244 --> 01:15:06.864
that the cache is big.
It's M, so however big M is,

01:15:06.864 --> 01:15:09.977
it's that big.
OK, so if we want to get some

01:15:09.977 --> 01:15:13.307
improvement on this,
we've got to have M in the

01:15:13.307 --> 01:15:16.276
formula somewhere,
and there's no M's yet.

01:15:16.276 --> 01:15:19.027
So, it's got to involve M.
What's that?

01:15:19.027 --> 01:15:21.271
MT of M over B?
That would work,

01:15:21.271 --> 01:15:25.108
but MT of M is also OK,
I mean, some constant times M,

01:15:25.108 --> 01:15:27.859
let's say.
I want to make this constant

01:15:27.859 --> 01:15:33.000
small enough so that the entire
problem fits in cache.

01:15:33.000 --> 01:15:37.006
So, it's like one third.
I think it's actually,

01:15:37.006 --> 01:15:40.837
oh wait, is it the square root
of M actually?

01:15:40.837 --> 01:15:43.537
Right, this is an N by N
matrix.

01:15:43.537 --> 01:15:47.456
So, it should be C times the
square root of M.

01:15:47.456 --> 01:15:50.330
Sorry.
So, the square root of M by

01:15:50.330 --> 01:15:53.552
square root of M matrix has M
entries.

01:15:53.552 --> 01:15:58.603
If I make C like one third or
something, then I can fit all

01:15:58.603 --> 01:16:04.372
three matrices in memory.
Actually, one over square root

01:16:04.372 --> 01:16:06.903
of three would do,
but who cares?

01:16:06.903 --> 01:16:10.621
So, for some constant,
C, now everything fits in

01:16:10.621 --> 01:16:13.548
memory.
How many memory transfers does

01:16:13.548 --> 01:16:14.497
it take?
One?

01:16:14.497 --> 01:16:18.451
It's a bit too small,
because I do have to read the

01:16:18.451 --> 01:16:20.587
problem in.
And now, I mean,

01:16:20.587 --> 01:16:24.621
here was one because there's
only one block to read.

01:16:24.621 --> 01:16:27.548
Now how many blocks are there
to read?

01:16:27.548 --> 01:16:30.000
Constants?
No.

01:16:30.000 --> 01:16:30.369
B?
No.

01:16:30.369 --> 01:16:33.255
M over B, good.
Get it right eventually.

01:16:33.255 --> 01:16:37.102
That's the great thing about
thinking with an oracle.

01:16:37.102 --> 01:16:41.318
You can just keep guessing.
M over B because we have cache

01:16:41.318 --> 01:16:43.908
size M.
There are M over B blocks in

01:16:43.908 --> 01:16:46.201
that cache to read each one,
OK?

01:16:46.201 --> 01:16:49.382
This is maybe,
you forgot what M was because

01:16:49.382 --> 01:16:51.897
we haven't used it for a long
time.

01:16:51.897 --> 01:16:54.857
But M is the number of elements
in cache.

01:16:54.857 --> 01:16:59.000
This is the number of blocks in
cache.

01:16:59.000 --> 01:17:02.537
OK, some of was saying B,
and it's reasonable to assume

01:17:02.537 --> 01:17:05.943
that M over B is about B.
That's like a square cache,

01:17:05.943 --> 01:17:08.892
but in general,
we don't make that assumption.

01:17:08.892 --> 01:17:11.381
OK, where are we?
We're hopefully done,

01:17:11.381 --> 01:17:14.460
just about, good,
because we have three minutes.

01:17:14.460 --> 01:17:17.800
So, that's our base case.
I have a square root here;

01:17:17.800 --> 01:17:20.815
I just forgot it.
Now we just have to solve it.

01:17:20.815 --> 01:17:23.434
Now, this is an easier
recurrence, right?

01:17:23.434 --> 01:17:27.497
I don't want to use the master
method, because master method is

01:17:27.497 --> 01:17:31.296
not going to handle these B's
and M's, and these crazy base

01:17:31.296 --> 01:17:35.271
cases.
OK, master method would prove

01:17:35.271 --> 01:17:36.054
N^3.
Great.

01:17:36.054 --> 01:17:40.282
Master method doesn't really
think about these kinds of

01:17:40.282 --> 01:17:42.789
cases.
But with regression trees,

01:17:42.789 --> 01:17:47.331
if you remember way back to the
proof of the master method,

01:17:47.331 --> 01:17:52.030
just look at the recursion tree
as geometric up or down where

01:17:52.030 --> 01:17:55.945
everything is equal,
and then you just add them up,

01:17:55.945 --> 01:17:59.000
every level.
The point is that this is a

01:17:59.000 --> 01:18:02.680
nice recurrence.
All of the sub problems are the

01:18:02.680 --> 01:18:05.891
same size, and that analysis
always works,

01:18:05.891 --> 01:18:12.000
I say, when everything has the
same size, all the children.

01:18:12.000 --> 01:18:18.857
So, here's the recursion tree.
We have N^2 over B at the top.

01:18:18.857 --> 01:18:24.114
We split into eight subproblems
where each one,

01:18:24.114 --> 01:18:27.657
the cost is one half N^2 over
B.

01:18:27.657 --> 01:18:32.000
I'm not going to write them
all.

01:18:32.000 --> 01:18:34.716
There they are.
You add them up.

01:18:34.716 --> 01:18:38.921
How much do you get?
Well, there's eight of them.

01:18:38.921 --> 01:18:41.637
Eight times a half is two.
Four.

01:18:41.637 --> 01:18:44.265
[LAUGHTER] Thanks.
Four, right?

01:18:44.265 --> 01:18:48.909
OK, I'm bad at arithmetic.
I probably already said it,

01:18:48.909 --> 01:18:52.675
but there are three kinds of
mathematicians,

01:18:52.675 --> 01:18:56.006
those who can add,
and those who can't.

01:18:56.006 --> 01:19:01.000
OK, why am I looking at this?
It's obvious.

01:19:01.000 --> 01:19:03.800
OK, so we keep going.
This looks geometrically

01:19:03.800 --> 01:19:04.858
increasing.
Right?

01:19:04.858 --> 01:19:08.405
You just know in your heart
that if you work out the first

01:19:08.405 --> 01:19:12.263
two levels, you can tell whether
it's geometrically increasing,

01:19:12.263 --> 01:19:15.437
decreasing, or they're all
equal, or something else.

01:19:15.437 --> 01:19:18.984
And then you better think.
But I see this as geometrically

01:19:18.984 --> 01:19:21.412
increasing.
It will indeed be like 16 at

01:19:21.412 --> 01:19:22.843
the next level,
I guess.

01:19:22.843 --> 01:19:25.145
OK, it should be.
So, it's increasing.

01:19:25.145 --> 01:19:30.000
That means the leaves matter.
So, let's work out the leaves.

01:19:30.000 --> 01:19:33.960
And, this is where we use our
base case.

01:19:33.960 --> 01:19:38.630
So, we have a problem of size
square root of M.

01:19:38.630 --> 01:19:41.981
And so, yeah,
you have a question?

01:19:41.981 --> 01:19:45.840
Oh, indeed.
I knew there was something.

01:19:45.840 --> 01:19:50.003
I knew it was supposed to be
two out here.

01:19:50.003 --> 01:19:53.150
Thanks.
This is why you're here.

01:19:53.150 --> 01:19:57.110
It's actually N over two
squared over B.

01:19:57.110 --> 01:20:00.867
Thanks.
I'm substituting N over 2 into

01:20:00.867 --> 01:20:04.900
this.
OK, so this is actually N^2

01:20:04.900 --> 01:20:06.519
over 4 B.
So, I get two,

01:20:06.519 --> 01:20:09.546
because there are eight times
one over four.

01:20:09.546 --> 01:20:13.416
OK, I wasn't that far off then.
It's still geometrically

01:20:13.416 --> 01:20:15.529
increasing, still the case,
OK?

01:20:15.529 --> 01:20:17.992
But now, it actually doesn't
matter.

01:20:17.992 --> 01:20:21.371
Whatever the cost is,
as long as it's bigger than

01:20:21.371 --> 01:20:23.975
one, great.
Now we look at the leaves.

01:20:23.975 --> 01:20:26.157
The leaves are root M by root
M.

01:20:26.157 --> 01:20:29.958
I substitute root M into this:
I get M over B with some

01:20:29.958 --> 01:20:32.903
constants.
Who cares?

01:20:32.903 --> 01:20:36.787
So, each leaf is M over B,
OK, lots of them.

01:20:36.787 --> 01:20:40.038
How many are there?
This is the only,

01:20:40.038 --> 01:20:45.006
deal with recursion trees,
counting the number of leaves

01:20:45.006 --> 01:20:48.709
is always the annoying part.
Oh boy, well,

01:20:48.709 --> 01:20:53.948
we start with an N by N matrix.
We stop when we get down to

01:20:53.948 --> 01:21:00.000
root N by root N matrix.
So, that sounds like something.

01:21:00.000 --> 01:21:04.141
Oh boy, I'm cheating here.
Really?

01:21:04.141 --> 01:21:07.905
That many?
It sounds plausible.

01:21:07.905 --> 01:21:11.921
OK, the claim is,
and I'll cheat.

01:21:11.921 --> 01:21:19.450
So I'm going to use the oracle
here, and we'll figure out why

01:21:19.450 --> 01:21:24.470
this is the case.
N over root N^3 leaves,

01:21:24.470 --> 01:21:27.231
hey what?
I think here,

01:21:27.231 --> 01:21:33.979
it's hard to see the tree.
But it's easy to see in the

01:21:33.979 --> 01:21:36.178
matrix.
Let's enter the matrix.

01:21:36.178 --> 01:21:39.256
We have our big matrix.
We divided in half.

01:21:39.256 --> 01:21:43.654
We recursively divide in half.
We recursively divide in half.

01:21:43.654 --> 01:21:45.120
You get the idea,
OK?

01:21:45.120 --> 01:21:49.151
Now, at some point these
sectors, let's say one of these

01:21:49.151 --> 01:21:52.743
sectors, and each of these
sectors, fits in cache.

01:21:52.743 --> 01:21:56.994
And three of them fit in cache.
So, that's when we stop the

01:21:56.994 --> 01:22:02.320
recursion in the analysis.
The algorithm goes all the way.

01:22:02.320 --> 01:22:05.538
But in the analysis,
let's say we stop at M.

01:22:05.538 --> 01:22:08.981
OK, now, how many leaves or
problems are there?

01:22:08.981 --> 01:22:11.451
Oh man, this is still not
obvious.

01:22:11.451 --> 01:22:14.669
OK, the number of leaf chunks
here is, like,

01:22:14.669 --> 01:22:19.010
I mean, the number of these
things is something like N over

01:22:19.010 --> 01:22:21.629
root M, right,
the number of chunks.

01:22:21.629 --> 01:22:26.195
But, it's a little less clear
because I have so many of these.

01:22:26.195 --> 01:22:28.964
But, all right,
so let's just suppose,

01:22:28.964 --> 01:22:32.856
now, I think of normal,
boring, matrix multiplication

01:22:32.856 --> 01:22:38.119
on chunks of this size.
That's essentially what the

01:22:38.119 --> 01:22:42.200
leaves should tell me.
I start with this big problem,

01:22:42.200 --> 01:22:45.261
I recurse out to all these
little, tiny,

01:22:45.261 --> 01:22:48.950
multiply this by that,
OK, this root M by root M

01:22:48.950 --> 01:22:51.305
chunk.
OK, how many operations,

01:22:51.305 --> 01:22:54.680
how many multiplies do I do on
those things?

01:22:54.680 --> 01:22:57.034
N^3.
But now, N, the size of my

01:22:57.034 --> 01:23:00.488
matrix in terms of these little
sub matrices,

01:23:00.488 --> 01:23:05.859
is N over root M.
So, it should be N over root

01:23:05.859 --> 01:23:10.760
M^3 subproblems of this size.
If you work it out,

01:23:10.760 --> 01:23:16.478
normally we go down to things
of constant size and we get

01:23:16.478 --> 01:23:21.278
exactly N^3 of them.
Now we are stopping at this

01:23:21.278 --> 01:23:26.485
short point in saying,
well, it's however many there

01:23:26.485 --> 01:23:30.161
are, cubed.
OK, this is a bit of hand

01:23:30.161 --> 01:23:35.352
waving.
You could work it out with the

01:23:35.352 --> 01:23:39.151
recurrence on the number of
leaves.

01:23:39.151 --> 01:23:44.180
But there it is.
So, the total here is N over,

01:23:44.180 --> 01:23:49.656
let's work it out.
N^3 over M to the three halves,

01:23:49.656 --> 01:23:56.025
that's this number of leaves,
times the cost at each leaf,

01:23:56.025 --> 01:24:01.054
which is M over B.
So, some of the N's cancel,

01:24:01.054 --> 01:24:07.759
and we get N^3 over B root M,
which is a root M factor better

01:24:07.759 --> 01:24:13.433
than N^3 over B.
It's actually quite a lot,

01:24:13.433 --> 01:24:16.522
the square root of the cache
size.

01:24:16.522 --> 01:24:20.359
That is optimal.
The best two level matrix

01:24:20.359 --> 01:24:26.162
multiplication algorithm is N^3
over B root M memory transfers.

01:24:26.162 --> 01:24:30.000
Pretty amazing,
and I'm over time.

01:24:30.000 --> 01:24:34.979
You can generalize this into
all sorts of great things,

01:24:34.979 --> 01:24:39.959
but the bottom line is this is
a great way to do matrix

01:24:39.959 --> 01:24:45.308
multiplication as a recursion.
We'll see more recursion for

01:24:45.308 --> 01:24:48.000
cache oblivious algorithms on
Wednesday.