WEBVTT

00:00:00.060 --> 00:00:02.500
The following content is
provided under a Creative

00:00:02.500 --> 00:00:04.010
Commons license.

00:00:04.010 --> 00:00:06.360
Your support will help
MIT OpenCourseWare

00:00:06.360 --> 00:00:10.730
continue to offer high quality
educational resources for free.

00:00:10.730 --> 00:00:13.330
To make a donation or
view additional materials

00:00:13.330 --> 00:00:17.236
from hundreds of MIT courses,
visit MIT OpenCourseWare

00:00:17.236 --> 00:00:21.690
at ocw.mit.edu.

00:00:21.690 --> 00:00:24.524
ERIK DEMAINE: Welcome to
the final week of 6.046.

00:00:24.524 --> 00:00:25.190
Are you excited?

00:00:25.190 --> 00:00:28.581
[CHEERING] Yeah,
today-- AUDIENCE: Oh.

00:00:28.581 --> 00:00:30.080
ERIK DEMAINE: Well,
and sad, I know.

00:00:30.080 --> 00:00:30.640
It's tough.

00:00:30.640 --> 00:00:32.630
But we've got two more lectures.

00:00:32.630 --> 00:00:36.720
They're on that one topic, which
is cache oblivious algorithms.

00:00:36.720 --> 00:00:39.470
And this is a
really cool concept.

00:00:39.470 --> 00:00:40.970
It was actually
originally developed

00:00:40.970 --> 00:00:45.980
in the context of 6.046, as
sort of an interesting way

00:00:45.980 --> 00:00:48.560
to teach cache
efficient algorithms.

00:00:48.560 --> 00:00:50.710
But it turned into a
whole research program

00:00:50.710 --> 00:00:55.240
in the late '90s, and
now it's its own thing.

00:00:55.240 --> 00:00:59.870
It's kind of funny to
bring it back to 6.046.

00:00:59.870 --> 00:01:04.670
The whole idea is in all of
the algorithms we have seen,

00:01:04.670 --> 00:01:06.520
except maybe
distributed algorithms,

00:01:06.520 --> 00:01:11.289
we've had this view that all
of the data that we can access

00:01:11.289 --> 00:01:13.280
is the same cost.

00:01:13.280 --> 00:01:15.360
If we have an array,
like a hash table,

00:01:15.360 --> 00:01:19.020
accessing anything in a hash
table is equally costly.

00:01:19.020 --> 00:01:20.840
If we have a binary
search tree, every node

00:01:20.840 --> 00:01:23.110
costs the same to access.

00:01:23.110 --> 00:01:26.120
But this is not real.

00:01:26.120 --> 00:01:27.680
Let me give you
some idea of what

00:01:27.680 --> 00:01:29.350
a real computer looks like.

00:01:29.350 --> 00:01:32.880
You probably know this, but
we've not yet thought about it

00:01:32.880 --> 00:01:42.300
in an algorithmic context.

00:01:42.300 --> 00:01:45.390
These are caches, what are
typically called caches,

00:01:45.390 --> 00:01:48.539
in your computer.

00:01:48.539 --> 00:01:52.640
Then you have what we've
mostly been thinking about,

00:01:52.640 --> 00:02:00.660
which is main memory, your RAM.

00:02:00.660 --> 00:02:02.260
And then there's
probably more stuff.

00:02:02.260 --> 00:02:05.512
These days you probably
have some big flash.

00:02:05.512 --> 00:02:06.970
If you have a
fancier computer, you

00:02:06.970 --> 00:02:09.030
have flash, which
is maybe caching

00:02:09.030 --> 00:02:13.000
your disk, which is huge.

00:02:13.000 --> 00:02:15.110
And then maybe there's
the internet at the end,

00:02:15.110 --> 00:02:16.860
if you like.

00:02:16.860 --> 00:02:22.370
So the point is all the data in
the world is not on your CPU.

00:02:22.370 --> 00:02:26.040
And there's this big thing which
is called the memory hierarchy,

00:02:26.040 --> 00:02:29.410
which dictates which
things are fast

00:02:29.410 --> 00:02:32.290
and which things are
slow, not exactly which

00:02:32.290 --> 00:02:35.530
data items; that's up to you.

00:02:35.530 --> 00:02:37.820
But the idea is that
on board your CPU

00:02:37.820 --> 00:02:43.570
you have probably, these days,
up to four levels of cache.

00:02:43.570 --> 00:02:47.500
As I've tried to draw them,
they get increasingly big.

00:02:47.500 --> 00:02:49.440
Typical values--
a level one cache

00:02:49.440 --> 00:02:52.860
is something on the order
of 10, 32 K, whatever.

00:02:52.860 --> 00:02:54.840
Level four cache these
days, as introduced

00:02:54.840 --> 00:02:58.724
by like Haswell Architectures,
has about 100 megabytes.

00:02:58.724 --> 00:03:01.140
Main memory you know; this is
the thing you usually think.

00:03:01.140 --> 00:03:01.470
About.

00:03:01.470 --> 00:03:02.400
It's in the gigabytes.

00:03:02.400 --> 00:03:05.600
These days you can buy computers
with a terabyte of RAM.

00:03:05.600 --> 00:03:06.830
It's not crazy.

00:03:06.830 --> 00:03:08.520
Flash gets bigger.

00:03:08.520 --> 00:03:12.317
Disk-- these days you can
buy 4 terabyte single disk,

00:03:12.317 --> 00:03:13.900
but if you have a
whole RAID of disks,

00:03:13.900 --> 00:03:17.079
you can have petabytes
of data on one computer.

00:03:17.079 --> 00:03:20.780
So things are getting bigger
as we go farther to the right.

00:03:20.780 --> 00:03:23.329
But they're also
getting slower. .

00:03:23.329 --> 00:03:25.310
And the point of cache
efficient algorithms

00:03:25.310 --> 00:03:27.710
is to deal with the fact
that things get slow

00:03:27.710 --> 00:03:29.462
when they get far away.

00:03:29.462 --> 00:03:31.420
And this makes sense from
a physics standpoint.

00:03:31.420 --> 00:03:33.860
If you think about
how much data can

00:03:33.860 --> 00:03:36.240
you store in a cubic
inch or something

00:03:36.240 --> 00:03:41.017
and how much could possibly be
near your CPU, at some point,

00:03:41.017 --> 00:03:42.600
you're just going
to run out of space,

00:03:42.600 --> 00:03:44.090
and you've got to
go farther away.

00:03:44.090 --> 00:03:47.184
And to go farther away is
going to take more time.

00:03:47.184 --> 00:03:48.850
So you can think of
it-- I mean, there's

00:03:48.850 --> 00:03:50.850
the speed of light
argument, that things

00:03:50.850 --> 00:03:52.525
that are farther
away in your computer

00:03:52.525 --> 00:03:54.710
are going to take longer.

00:03:54.710 --> 00:03:56.630
Typical computers
are not anywhere near

00:03:56.630 --> 00:03:59.350
the speed of light, so there's
a more real issue, which

00:03:59.350 --> 00:04:01.160
is how long are your traces.

00:04:01.160 --> 00:04:04.640
And then when you have physical
moving parts, like a disk,

00:04:04.640 --> 00:04:06.840
I don't know if you know,
but disks actually spin,

00:04:06.840 --> 00:04:09.814
and there's a head, and
it has to move around.

00:04:09.814 --> 00:04:10.980
And that's called seek time.

00:04:10.980 --> 00:04:13.550
Moving a head around on
the disk is really slow,

00:04:13.550 --> 00:04:15.360
on the order of milliseconds.

00:04:15.360 --> 00:04:18.152
Whereas reading
from on chip cache,

00:04:18.152 --> 00:04:19.610
that's on the order
of nanoseconds,

00:04:19.610 --> 00:04:23.180
whatever your clock rate is, so
a few billion times a second.

00:04:23.180 --> 00:04:26.340
So there's a big
spread of like a factor

00:04:26.340 --> 00:04:31.930
of a million or 10 million from
level one cache to disk speed.

00:04:31.930 --> 00:04:33.100
That sucks.

00:04:33.100 --> 00:04:35.576
And so you might think,
well, if your data's big,

00:04:35.576 --> 00:04:36.409
you're just screwed.

00:04:36.409 --> 00:04:39.909
You've got to deal with
disk, and disk is slow.

00:04:39.909 --> 00:04:42.430
But that's not true.

00:04:42.430 --> 00:04:44.960
Life is not so bad.

00:04:44.960 --> 00:05:07.720
So, in general, there's
two notions of speed,

00:05:07.720 --> 00:05:09.560
and I've been kind
of vague on them.

00:05:09.560 --> 00:05:13.300
One notion is latency,
which is if right now I

00:05:13.300 --> 00:05:17.250
have the idea that I really
need to fetch memory location 2

00:05:17.250 --> 00:05:21.100
billion and 73, how long does
it take for that data-- say,

00:05:21.100 --> 00:05:23.190
one word of data-- to come back?

00:05:23.190 --> 00:05:25.260
That's latency.

00:05:25.260 --> 00:05:28.710
But there's another
issue, which is bandwidth;

00:05:28.710 --> 00:05:32.720
how fat are these pipes?

00:05:32.720 --> 00:05:35.500
What's my rate of information
that I could pump?

00:05:35.500 --> 00:05:38.570
If I said, please give me
all of main memory in order,

00:05:38.570 --> 00:05:40.507
how fast could it pump it back?

00:05:40.507 --> 00:05:56.370
And that's actually really good.

00:05:56.370 --> 00:05:57.974
So latency is like
your start up cost.

00:05:57.974 --> 00:05:59.390
When I ask for
something, how long

00:05:59.390 --> 00:06:01.270
does it take for that
one thing to come?

00:06:01.270 --> 00:06:04.660
But then there's a data rate.

00:06:04.660 --> 00:06:08.620
And bandwidth you can
generally make really large.

00:06:08.620 --> 00:06:12.110
For example, in disk, bandwidth
of a disk is pretty big.

00:06:12.110 --> 00:06:16.780
But even if it weren't big, you
could just add 100 more disks.

00:06:16.780 --> 00:06:18.410
And then when you
ask for some data,

00:06:18.410 --> 00:06:23.032
all 100 disks could give
you data at the same speed,

00:06:23.032 --> 00:06:24.740
and provided you don't
overload your bus,

00:06:24.740 --> 00:06:27.480
so you've got to also
make more buses and so on.

00:06:27.480 --> 00:06:30.567
You can actually really huge
amount of data per second,

00:06:30.567 --> 00:06:33.150
but still the time to get there
and the time for all the disks

00:06:33.150 --> 00:06:34.770
to seek their
heads, that's slow.

00:06:34.770 --> 00:06:36.770
It doesn't add up, actually,
because they're all

00:06:36.770 --> 00:06:38.420
doing it in parallel.

00:06:38.420 --> 00:06:42.620
So you can't reduced latency,
but you can increase bandwidth.

00:06:42.620 --> 00:06:45.630
And let's say-- it
doesn't match physics,

00:06:45.630 --> 00:06:48.480
but we can get pretty close
to arbitrarily high bandwidth.

00:06:48.480 --> 00:06:50.590
And so in a well
designed computer,

00:06:50.590 --> 00:06:52.840
the fatnesses of
these pipes are going

00:06:52.840 --> 00:06:56.520
to increase, or could
increase, if you want.

00:06:56.520 --> 00:07:01.510
So you can move
lots of data around.

00:07:01.510 --> 00:07:03.840
But latency we can't get
rid of, and this is annoying

00:07:03.840 --> 00:07:05.673
because from an algorithmic
standpoint, when

00:07:05.673 --> 00:07:07.952
we ask for something,
we'd like it immediately.

00:07:07.952 --> 00:07:09.910
In a sequential logarithm,
we can't do anything

00:07:09.910 --> 00:07:11.530
until that date arrives.

00:07:11.530 --> 00:07:21.260
So cache efficiency is going
to fix this by blocking.

00:07:21.260 --> 00:07:30.692
This is an old idea, since
caches were introduced.

00:07:30.692 --> 00:07:31.900
There's the idea of blocking.

00:07:31.900 --> 00:07:36.300
So when you ask for a
single word in main memory,

00:07:36.300 --> 00:07:38.090
you don't get one word.

00:07:38.090 --> 00:07:41.960
You get maybe 32
kilobytes of information,

00:07:41.960 --> 00:08:09.620
not just 4 bytes or 8 bytes.

00:08:09.620 --> 00:08:12.490
And we're kind of free to
choose these block sizes however

00:08:12.490 --> 00:08:15.490
we want, when we
designed the system.

00:08:15.490 --> 00:08:21.940
So we can set them, in a
certain sense, to hide latency.

00:08:21.940 --> 00:08:34.799
So if you think of amortizing
the cost over the block,

00:08:34.799 --> 00:08:41.990
then you have something like
amortized cost over block.

00:08:41.990 --> 00:08:45.600
This is per word.

00:08:45.600 --> 00:08:54.380
Essentially, we divide the
latency by the block size.

00:08:54.380 --> 00:08:57.330
And we have to pay
one over bandwidth.

00:08:57.330 --> 00:08:59.900
Bandwidth is how
many words a second

00:08:59.900 --> 00:09:02.390
you can read, say,
from your memory.

00:09:02.390 --> 00:09:06.240
So one over bandwidth is
going to be your cost.

00:09:06.240 --> 00:09:07.800
So this we can't change but.

00:09:07.800 --> 00:09:10.280
By adding enough disks
or adding enough things

00:09:10.280 --> 00:09:12.300
at making these
pipes fat enough,

00:09:12.300 --> 00:09:15.500
you can basically make this big.

00:09:15.500 --> 00:09:18.600
Latency is the thing
we can't control.

00:09:18.600 --> 00:09:24.260
But if this block
is sort of useful,

00:09:24.260 --> 00:09:27.240
then we're paying the initial
start up time, say, hey,

00:09:27.240 --> 00:09:29.600
give me this block, and then
waiting for the response.

00:09:29.600 --> 00:09:32.520
That latency we only pay
once for the entire block.

00:09:32.520 --> 00:09:38.080
So if there are block size
words in that block, per item,

00:09:38.080 --> 00:09:40.380
we're effectively dividing
latency by block size.

00:09:40.380 --> 00:09:44.110
This is kind of rough,
but this is the idea

00:09:44.110 --> 00:09:45.790
of how to reduce latency.

00:09:45.790 --> 00:10:00.230
Now, for this actually work,
we need better algorithms.

00:10:00.230 --> 00:10:03.550
Pretty much every algorithm you
see in the class so far works

00:10:03.550 --> 00:10:05.050
horribly in this model.

00:10:05.050 --> 00:10:13.410
So that's the point of today
and next class is to fix that.

00:10:13.410 --> 00:10:20.070
For this kind of
amortization to work,

00:10:20.070 --> 00:10:23.370
I'm using "use" in a
vague sense so far.

00:10:23.370 --> 00:10:24.930
We'll make it
formal in a moment.

00:10:24.930 --> 00:10:28.500
When I fetch an entire
block, all of the elements

00:10:28.500 --> 00:10:30.184
in that block should be useful.

00:10:30.184 --> 00:10:32.100
We should be able to
compute something on them

00:10:32.100 --> 00:10:33.510
that we needed to compute.

00:10:33.510 --> 00:10:36.915
Otherwise, if I if I only needed
the one item that I read out

00:10:36.915 --> 00:10:41.250
of the block, that's not
going to help me so much.

00:10:41.250 --> 00:10:45.382
So I really want to structure
my data in such a way

00:10:45.382 --> 00:10:46.840
that when I access
one element, I'm

00:10:46.840 --> 00:10:49.890
also going to access
the elements nearby it.

00:10:49.890 --> 00:10:51.980
Then this blocking will
actually be useful.

00:10:51.980 --> 00:11:04.190
This is a property normally
called spatial locality.

00:11:04.190 --> 00:11:08.180
And the other thing
we'd like-- these caches

00:11:08.180 --> 00:11:11.360
have some size, so I can store
more than just one block.

00:11:11.360 --> 00:11:13.110
It's not like I read
one block, and I just

00:11:13.110 --> 00:11:16.630
finish processing it, and then
I read the next block and go on.

00:11:16.630 --> 00:11:18.505
Some of these caches
are actually pretty big.

00:11:18.505 --> 00:11:21.470
If you think of main memory
as a cache to your disk,

00:11:21.470 --> 00:11:22.740
that can be really big.

00:11:22.740 --> 00:11:27.355
So ideally, the blocks
that I'm using here

00:11:27.355 --> 00:11:28.730
relate to each
other in some way,

00:11:28.730 --> 00:11:30.990
or when I access
the block, I'm going

00:11:30.990 --> 00:11:34.030
to access it for awhile,
along with other blocks.

00:11:34.030 --> 00:11:36.020
So the way this
is usually said is

00:11:36.020 --> 00:11:41.902
that we'd like to reuse the
existing blocks in the cache

00:11:41.902 --> 00:11:45.290
as much as possible.

00:11:45.290 --> 00:11:51.000
And this you can think
of as temporal locality.

00:11:51.000 --> 00:11:52.530
When I access a
particular block,

00:11:52.530 --> 00:11:54.710
I'm going to access
it again fairly soon.

00:11:54.710 --> 00:11:57.690
That way it's actually useful
to bring it into my cache,

00:11:57.690 --> 00:11:59.197
and then I use it many times.

00:11:59.197 --> 00:12:00.280
That would be even better.

00:12:00.280 --> 00:12:01.738
I don't have to
have both of these,

00:12:01.738 --> 00:12:03.500
and exactly to
what extent I have

00:12:03.500 --> 00:12:05.740
them is going to dictate
what the overall time

00:12:05.740 --> 00:12:07.570
it's going to take
to run my algorithm.

00:12:07.570 --> 00:12:09.280
But these are so
the ideal properties

00:12:09.280 --> 00:12:11.640
you want in a very
informal sense.

00:12:11.640 --> 00:12:16.000
Now, in the rest today, we're
going to make this formal,

00:12:16.000 --> 00:12:18.840
and then we're going to develop
some algorithms for this model.

00:12:18.840 --> 00:12:20.740
But this is the motivation.

00:12:20.740 --> 00:12:24.585
In reality, we're free to
choose block size in the system.

00:12:24.585 --> 00:12:26.460
Though, in a moment,
I'm going to assume that

00:12:26.460 --> 00:12:27.870
it's given to us.

00:12:27.870 --> 00:12:29.290
You'd normally
set the block size

00:12:29.290 --> 00:12:32.570
so that these two terms
come out roughly equal.

00:12:32.570 --> 00:12:35.400
Because if you're spending
the latency time to go and get

00:12:35.400 --> 00:12:41.530
something, you might as well
get a whole chunk of something,

00:12:41.530 --> 00:12:43.200
according to whatever
your bandwidth is.

00:12:43.200 --> 00:12:44.710
If it only cost
you, say, twice as

00:12:44.710 --> 00:12:48.350
much to fetch an entire
block than to fetch one word,

00:12:48.350 --> 00:12:50.420
that seems like a
pretty good block size.

00:12:50.420 --> 00:12:54.160
So for something like
disk, that block size

00:12:54.160 --> 00:12:57.140
is on the order of
megabytes, maybe even

00:12:57.140 --> 00:12:58.970
bigger-- hundreds of megabytes.

00:12:58.970 --> 00:13:01.170
So think of the block
sizes as really big.

00:13:01.170 --> 00:13:04.950
We really want all that data
to be useful in some way.

00:13:04.950 --> 00:13:08.980
Now it's really hard
to think about a memory

00:13:08.980 --> 00:13:12.150
hierarchy with so many levels.

00:13:12.150 --> 00:13:14.810
So we're going to focus
on two levels at a time--

00:13:14.810 --> 00:13:18.530
the sort of the
cheap and small cache

00:13:18.530 --> 00:13:21.010
versus the huge thing,
which I'll call disk,

00:13:21.010 --> 00:13:26.420
just for emphasis.

00:13:26.420 --> 00:13:30.386
So I'm going to call this two
level model the external memory

00:13:30.386 --> 00:13:30.890
model.

00:13:30.890 --> 00:13:33.920
It was originally
introduced as a model

00:13:33.920 --> 00:13:35.626
for main memory versus disk.

00:13:35.626 --> 00:13:37.500
But you could apply it
to any pair of levels.

00:13:37.500 --> 00:13:40.720
In general, you have
your problem size N,

00:13:40.720 --> 00:13:45.040
choose the smallest level
that fits N. Typically that's

00:13:45.040 --> 00:13:45.680
main memory.

00:13:45.680 --> 00:13:46.900
Maybe it's disk.

00:13:46.900 --> 00:13:52.160
And just think of the level
between that and the previous,

00:13:52.160 --> 00:13:57.120
so the last level and
the next to last level.

00:13:57.120 --> 00:13:58.567
Often that's what matters.

00:13:58.567 --> 00:14:00.650
Like if you run a program,
and you run out of RAM,

00:14:00.650 --> 00:14:02.983
and you start swapping the
disks, that's when everything

00:14:02.983 --> 00:14:05.080
just slows to a crawl.

00:14:05.080 --> 00:14:07.305
You can see that difference
at each of these levels,

00:14:07.305 --> 00:14:08.930
but it's probably
most dramatic at disk

00:14:08.930 --> 00:14:13.910
just because it's so slow-- a
million times slower than RAM,

00:14:13.910 --> 00:14:17.190
or at least 1,000 times
slower than RAM, I should say.

00:14:17.190 --> 00:14:23.730
Anyway, so we have
just two levels.

00:14:23.730 --> 00:14:25.760
So let me draw a
more precise picture.

00:14:25.760 --> 00:14:27.020
We have the CPU.

00:14:27.020 --> 00:14:29.160
This is where all
of our operations

00:14:29.160 --> 00:14:31.210
are doing this, where we
add numbers and so on.

00:14:31.210 --> 00:14:33.668
We'll think of it as having a
constant number of registers.

00:14:33.668 --> 00:14:36.660
Each register is one word.

00:14:36.660 --> 00:14:42.440
And then we have a really
fat pipe, low latency pipe,

00:14:42.440 --> 00:14:48.490
to the cache.

00:14:48.490 --> 00:14:53.810
Cache is going to be
divided into blocks.

00:14:53.810 --> 00:14:58.800
So let's say there's
B words per blocks.

00:14:58.800 --> 00:15:00.930
Instead of writing
block size, I'll

00:15:00.930 --> 00:15:04.910
just write capital B.
And the number of blocks.

00:15:04.910 --> 00:15:16.300
I'm going to call M over B. So
the total size of your cache

00:15:16.300 --> 00:15:22.650
is capital M. And then there
is a relatively thin and slow

00:15:22.650 --> 00:15:28.060
connection-- this one's fast.

00:15:28.060 --> 00:15:35.460
This one's slow-- to your disk.

00:15:35.460 --> 00:15:38.800
Disk we'll think of as huge,
essentially infinite size.

00:15:38.800 --> 00:15:51.110
It's also divided into blocks
of size B, so same block size.

00:15:51.110 --> 00:15:52.840
So this is the picture.

00:15:52.840 --> 00:15:56.190
And so, initially,
all of the input

00:15:56.190 --> 00:15:59.050
is over here, all of your
end data items, whatever.

00:15:59.050 --> 00:16:01.010
So you want to sort those items.

00:16:01.010 --> 00:16:04.130
And in order to
access those items,

00:16:04.130 --> 00:16:06.670
you first have to
bring them into cache.

00:16:06.670 --> 00:16:12.260
That's going to be slow, but
it's done in a blocked manner.

00:16:12.260 --> 00:16:16.230
So when I can't access
an individual item here,

00:16:16.230 --> 00:16:19.190
I have to request
the entire block.

00:16:19.190 --> 00:16:21.300
When I request that block,
it gets sent over here.

00:16:21.300 --> 00:16:22.409
It takes a while.

00:16:22.409 --> 00:16:24.200
And then I get to choose
where to store it.

00:16:24.200 --> 00:16:25.960
Maybe I'll put it here.

00:16:25.960 --> 00:16:29.000
And then maybe I'll
grab this block

00:16:29.000 --> 00:16:32.100
and then store it
here and so on.

00:16:32.100 --> 00:16:34.870
Each of those is a block read,
so these are new instructions

00:16:34.870 --> 00:16:36.990
the CPU can do.

00:16:36.990 --> 00:16:39.680
And eventually, this
cache will get full.

00:16:39.680 --> 00:16:41.430
And then before I
bring in a new block,

00:16:41.430 --> 00:16:42.990
I have to kick out an old lock.

00:16:42.990 --> 00:16:44.910
Meaning I need to
take one these blocks

00:16:44.910 --> 00:16:49.175
and write it to some position,
maybe to the same place.

00:16:49.175 --> 00:16:50.800
I think, in fact, we
will always assume

00:16:50.800 --> 00:16:53.100
that you write to the
same place, overwrite

00:16:53.100 --> 00:16:54.340
what was on the disk.

00:16:54.340 --> 00:16:56.656
You made some changes
here, send it back.

00:16:56.656 --> 00:16:58.280
And, in general, what
we're going to do

00:16:58.280 --> 00:17:01.100
is count how many times
we read and write blocks.

00:17:01.100 --> 00:17:02.474
Question?

00:17:02.474 --> 00:17:05.400
AUDIENCE: When you talked about
how fast the connection is,

00:17:05.400 --> 00:17:07.108
you're just talking
about latency, right?

00:17:07.108 --> 00:17:08.858
ERIK DEMAINE: Yes,
sorry, this is latency.

00:17:08.858 --> 00:17:11.710
AUDIENCE: Yeah, so like
the [INAUDIBLE] connections

00:17:11.710 --> 00:17:13.710
[? just don't have ?]
[INAUDIBLE]?

00:17:13.710 --> 00:17:16.020
ERIK DEMAINE: Right, this
could have huge bandwidth.

00:17:16.020 --> 00:17:19.089
So in this model, we're assuming
the block size is fixed,

00:17:19.089 --> 00:17:21.250
and then the latency
versus bandwidth

00:17:21.250 --> 00:17:23.781
is not-- we're not going
to think about bandwidth.

00:17:23.781 --> 00:17:25.280
We'll assume the
block size has been

00:17:25.280 --> 00:17:27.036
chosen in some reasonable way.

00:17:27.036 --> 00:17:29.410
And then all we need to do is
count the number of blocks.

00:17:29.410 --> 00:17:33.640
But underneath, yeah, you
have some kind of bandwidth.

00:17:33.640 --> 00:17:35.990
Presumably you
set the block size

00:17:35.990 --> 00:17:37.841
to make these two
things roughly equal,

00:17:37.841 --> 00:17:39.216
and so then latency
and bandwidth

00:17:39.216 --> 00:17:41.010
are kind of the same thing.

00:17:41.010 --> 00:17:41.939
That's the idea.

00:17:41.939 --> 00:17:44.480
But really, we're just going to
think about counting latency,

00:17:44.480 --> 00:17:45.870
which is how many
times do I have

00:17:45.870 --> 00:17:48.400
to request to block and
wait for it to come over,

00:17:48.400 --> 00:17:51.010
and how much does it
cost to write a block?

00:17:51.010 --> 00:17:52.650
How many times do
I write a block?

00:17:52.650 --> 00:17:55.357
I'm not going to worry about
how much physical time it

00:17:55.357 --> 00:17:56.940
takes me to do either
of those things;

00:17:56.940 --> 00:17:59.000
I'm just going to count
them and assume that that

00:17:59.000 --> 00:18:02.410
is what I need to minimize.

00:18:02.410 --> 00:18:09.750
So I'm going to count-- we
call these memory transfers--

00:18:09.750 --> 00:18:13.334
transfers of blocks
between levels,

00:18:13.334 --> 00:18:17.150
between these two levels.

00:18:17.150 --> 00:18:32.380
This is the number of blocks
read from or written to disk.

00:18:32.380 --> 00:18:40.110
We're going to view accesses
to the cache as free.

00:18:40.110 --> 00:18:45.284
I'm not going to count those.

00:18:45.284 --> 00:18:46.700
You don't need to
worry about that

00:18:46.700 --> 00:18:52.550
so much because we can still
count the number of operations

00:18:52.550 --> 00:18:59.520
that we do on the
computer, on the CPU.

00:18:59.520 --> 00:19:02.120
We still can think
about how much time,

00:19:02.120 --> 00:19:04.730
regular time, it takes
to do the computation--

00:19:04.730 --> 00:19:07.570
how many comparisons, how many
additions, things like that.

00:19:07.570 --> 00:19:10.030
And that would include things
like reading and writing

00:19:10.030 --> 00:19:12.890
elements from cache--
individual things.

00:19:12.890 --> 00:19:15.360
But we're going to view
this connection-- let's say,

00:19:15.360 --> 00:19:17.120
these are on the same ship.

00:19:17.120 --> 00:19:19.897
So reading cache is just as
fast as reading from registers.

00:19:19.897 --> 00:19:21.730
So we're not going to
worry about that time.

00:19:21.730 --> 00:19:24.230
What we're focusing on, for
the purpose of this model,

00:19:24.230 --> 00:19:26.700
is between these two levels.

00:19:26.700 --> 00:19:30.355
So these are essentially
one level combined.

00:19:30.355 --> 00:19:31.730
I'll change that
in a little bit.

00:19:31.730 --> 00:19:34.240
But for now, just think
about the two levels.

00:19:34.240 --> 00:19:36.210
And we're counting how
many memory transfers

00:19:36.210 --> 00:19:41.660
do we have between these
two levels, cache and disk.

00:19:41.660 --> 00:19:43.480
So we want to minimize that.

00:19:43.480 --> 00:19:45.360
Now, just like before,
we want to minimize

00:19:45.360 --> 00:19:49.090
the running time in the
usual traditional measure.

00:19:49.090 --> 00:19:51.570
And we want to minimize space
and all the usual things

00:19:51.570 --> 00:19:52.240
we minimize.

00:19:52.240 --> 00:19:53.740
But now we have a
new measure, which

00:19:53.740 --> 00:19:56.073
is number of memory transfers,
and we want our algorithm

00:19:56.073 --> 00:19:59.460
to minimize that too,
for a given block size

00:19:59.460 --> 00:20:05.930
and for a given cache size.

00:20:05.930 --> 00:20:10.180
And at this point-- I'm going
to change this in a moment--

00:20:10.180 --> 00:20:15.480
the algorithm that we would
write in this external memory

00:20:15.480 --> 00:20:19.590
model explicitly
manages the blocks.

00:20:19.590 --> 00:20:31.444
It has to explicitly
read and write blocks.

00:20:31.444 --> 00:20:32.860
And there's a
software system that

00:20:32.860 --> 00:20:35.350
implements this model,
particularly for disk,

00:20:35.350 --> 00:20:37.510
and lets you do this in
a nice controlled way,

00:20:37.510 --> 00:20:40.460
maintain your memory, maintain
reading and writing disk.

00:20:40.460 --> 00:20:42.160
The operating system
tries to do this,

00:20:42.160 --> 00:20:45.070
but it usually does a really
bad job with swapping.

00:20:45.070 --> 00:20:47.280
But there are
software systems that

00:20:47.280 --> 00:20:53.900
let you take control
and do much better.

00:20:53.900 --> 00:20:54.980
So that's a good model.

00:20:54.980 --> 00:21:02.280
External memory model is
especially good for disk.

00:21:02.280 --> 00:21:04.630
It's not going to capture
the finesse of all

00:21:04.630 --> 00:21:07.900
these other levels, and
it's a little bit annoying

00:21:07.900 --> 00:21:10.245
to write algorithms in
this way-- explicitly

00:21:10.245 --> 00:21:11.370
reading and writing blocks.

00:21:11.370 --> 00:21:14.230
Today I will not write
any such algorithms.

00:21:14.230 --> 00:21:16.890
Although, you could
think about them.

00:21:16.890 --> 00:21:20.765
I personally love
this other model,

00:21:20.765 --> 00:21:29.920
which is cache obviousness.

00:21:29.920 --> 00:21:32.840
It's going to lead to, in
some sense, cleaner algorithm.

00:21:32.840 --> 00:21:36.380
Although, it's more of a magic
trick to get them to work.

00:21:36.380 --> 00:21:38.140
But writing the
algorithms is very simple.

00:21:38.140 --> 00:21:40.500
Analyzing them is more work.

00:21:40.500 --> 00:21:44.740
And it will capture, in some
sense, all of these levels.

00:21:44.740 --> 00:21:48.320
But, in fact, it is basically
exactly this model, almost

00:21:48.320 --> 00:21:49.400
the same.

00:21:49.400 --> 00:21:52.270
We're going to change
one thing, which is

00:21:52.270 --> 00:21:54.510
where the oblivious comes from.

00:21:54.510 --> 00:21:59.200
We're going to say that
the algorithm doesn't

00:21:59.200 --> 00:22:02.430
know the cache parameters.

00:22:02.430 --> 00:22:08.215
It doesn't know B or M.
So this is a little weird.

00:22:08.215 --> 00:22:13.490
We're going to have to make some
other changes to make it work.

00:22:13.490 --> 00:22:16.800
From an analysis perspective, I
want to count memory transfers

00:22:16.800 --> 00:22:19.350
and analyze my algorithm
with respect to this memory

00:22:19.350 --> 00:22:20.980
hierarchy.

00:22:20.980 --> 00:22:22.930
But the algorithm
itself isn't allowed

00:22:22.930 --> 00:22:25.190
to know what that member
hierarchy looks like.

00:22:25.190 --> 00:22:27.250
Another way to say this
is that the algorithm

00:22:27.250 --> 00:22:29.765
has to work simultaneously
for all values of B

00:22:29.765 --> 00:22:34.167
and all values of M.
As you might imagine,

00:22:34.167 --> 00:22:35.000
this is not so easy.

00:22:35.000 --> 00:22:37.166
But there are some simple
things where this is easy,

00:22:37.166 --> 00:22:40.150
and more complicated things
where this is possible.

00:22:40.150 --> 00:22:42.800
And it gives you all
sorts of cool things.

00:22:42.800 --> 00:22:46.290
Let me first formalize
the model a little bit.

00:22:46.290 --> 00:22:48.780
The other nice thing about
cache oblivious algorithms

00:22:48.780 --> 00:22:52.620
is it corresponds
much more closely

00:22:52.620 --> 00:22:55.740
to how these caches work.

00:22:55.740 --> 00:22:57.310
When you write code
on your CPU, you

00:22:57.310 --> 00:22:58.560
may have noticed
you don't usually

00:22:58.560 --> 00:23:00.476
do block reads and block
writes, unless you're

00:23:00.476 --> 00:23:02.760
dealing with flash or disk.

00:23:02.760 --> 00:23:04.590
All of this is
taking care for you.

00:23:04.590 --> 00:23:06.500
It's all done internal
to the processor.

00:23:06.500 --> 00:23:09.360
When you access a word,
behind the scenes,

00:23:09.360 --> 00:23:13.260
magically, the
system, the computer,

00:23:13.260 --> 00:23:16.100
finds which word to read
or which block to read.

00:23:16.100 --> 00:23:19.200
It moves the entire block
into a higher level cache,

00:23:19.200 --> 00:23:21.930
and then it's just serving
you words out of that block.

00:23:21.930 --> 00:23:25.200
And you don't have
explicit control over that.

00:23:25.200 --> 00:23:32.930
So the way that works is when
you access a word in memory--

00:23:32.930 --> 00:23:36.420
and I'm going to think
of memory as everything;

00:23:36.420 --> 00:23:42.246
this is what's stored
in the disk, say.

00:23:42.246 --> 00:23:44.370
This is the entire memory
system, the entire memory

00:23:44.370 --> 00:23:45.280
hierarchy.

00:23:45.280 --> 00:23:46.840
And, as usual in
this class, we're

00:23:46.840 --> 00:23:48.360
going to think of
the entire memory

00:23:48.360 --> 00:23:56.940
as a giant array of words.

00:23:56.940 --> 00:24:01.660
Each of these
squares is one word.

00:24:01.660 --> 00:24:06.040
But then also, the memory
is now divided into blocks.

00:24:06.040 --> 00:24:07.440
So let's say every four.

00:24:07.440 --> 00:24:09.160
Let's say B equals 4.

00:24:09.160 --> 00:24:13.820
Every four words is
a block boundary,

00:24:13.820 --> 00:24:17.960
just for the sake
of drawing a figure.

00:24:17.960 --> 00:24:20.450
So this is B equals 4.

00:24:20.450 --> 00:24:23.650
When you access a single
word, like this one,

00:24:23.650 --> 00:24:30.100
you get the entire block
containing the word.

00:24:30.100 --> 00:24:32.720
Let's say, to emphasize,
it's not you personally;

00:24:32.720 --> 00:24:47.307
the system somehow fetches the
block containing that word.

00:24:47.307 --> 00:24:48.640
It has to do this automatically.

00:24:48.640 --> 00:24:51.120
We can't explicitly read and
write blocks in this model

00:24:51.120 --> 00:24:53.050
because we don't know
how big the blocks are.

00:24:53.050 --> 00:24:55.230
So it couldn't even name them.

00:24:55.230 --> 00:24:59.510
But internally, on the real
system and in your analysis,

00:24:59.510 --> 00:25:01.760
you're going to think of
whenever you touch something,

00:25:01.760 --> 00:25:03.630
you actually get all
this into the cache.

00:25:03.630 --> 00:25:06.100
So you hope that you will use
things nearby because you've

00:25:06.100 --> 00:25:07.300
already read them in.

00:25:07.300 --> 00:25:08.300
Ideally, they're useful.

00:25:08.300 --> 00:25:10.091
But you don't know how
many you've read in.

00:25:10.091 --> 00:25:13.457
You've read in B, and
you don't what B is.

00:25:13.457 --> 00:25:17.200
The algorithm doesn't now.

00:25:17.200 --> 00:25:19.670
One more detail--
the cache is going

00:25:19.670 --> 00:25:21.462
to get full pretty quickly.

00:25:21.462 --> 00:25:23.170
And so then, whenever
you read something,

00:25:23.170 --> 00:25:24.461
you have to kick something out.

00:25:24.461 --> 00:25:26.690
In steady state,
cache might as well

00:25:26.690 --> 00:25:29.250
always stay full-- no reason
to leave anything empty.

00:25:29.250 --> 00:25:34.884
So which block do you kick out?

00:25:34.884 --> 00:25:35.550
Any suggestions?

00:25:35.550 --> 00:25:37.870
Which block should I kick out?

00:25:37.870 --> 00:25:40.710
If I've been reading
and writing some blocks,

00:25:40.710 --> 00:25:46.228
reading and writing to
words within these blocks.

00:25:46.228 --> 00:25:46.728
Yeah?

00:25:46.728 --> 00:25:48.477
AUDIENCE: [INAUDIBLE].

00:25:48.477 --> 00:25:51.060
ERIK DEMAINE: The block that was
fetched farthest in the past?

00:25:51.060 --> 00:25:53.480
Yeah that is usually
called First In, First Out.

00:25:53.480 --> 00:25:54.610
That's FIFO.

00:25:54.610 --> 00:25:57.890
And that is a good strategy.

00:25:57.890 --> 00:25:59.352
Any other suggestions?

00:25:59.352 --> 00:25:59.852
Yeah.

00:25:59.852 --> 00:26:02.692
AUDIENCE: [INAUDIBLE].

00:26:02.692 --> 00:26:04.900
ERIK DEMAINE: The block has
been least recently used.

00:26:04.900 --> 00:26:07.670
So maybe you fetched
it a long time ago,

00:26:07.670 --> 00:26:10.999
but you use it
every clock cycle.

00:26:10.999 --> 00:26:12.790
That one you should
probably not throw away

00:26:12.790 --> 00:26:13.831
because you use it a lot.

00:26:13.831 --> 00:26:18.730
That's called LRU, and that
is also a good strategy.

00:26:18.730 --> 00:26:19.720
Other suggestions?

00:26:19.720 --> 00:26:21.150
Those are two good ones.

00:26:21.150 --> 00:26:23.180
If you go beyond that,
I'm worried I won't know.

00:26:23.180 --> 00:26:24.596
But there are some
bad strategies.

00:26:24.596 --> 00:26:25.890
Yeah?

00:26:25.890 --> 00:26:27.050
AUDIENCE: Just random.

00:26:27.050 --> 00:26:32.435
ERIK DEMAINE: Random-- yeah,
random is probably pretty good.

00:26:32.435 --> 00:26:33.310
I don't know offhand.

00:26:33.310 --> 00:26:34.810
There are some
randomized strategies

00:26:34.810 --> 00:26:36.160
that beat both of those.

00:26:36.160 --> 00:26:38.250
But from this perspective,
both are good.

00:26:38.250 --> 00:26:42.760
We've got lots of Frisbees
to go through, so.

00:26:42.760 --> 00:26:43.690
That's a good answer.

00:26:43.690 --> 00:26:44.950
Random is definitely
a good idea.

00:26:44.950 --> 00:26:46.930
I know there's a randomized
strategy called [? bit, ?]

00:26:46.930 --> 00:26:48.760
that in certain senses
is a little bit better.

00:26:48.760 --> 00:26:51.051
But from my perspective, I
think all of those are good.

00:26:51.051 --> 00:26:53.600
Random, I have to double check
whether you lose a log factor.

00:26:53.600 --> 00:26:57.520
And expectation should be fine.

00:26:57.520 --> 00:27:00.600
So all of those
strategies will work.

00:27:00.600 --> 00:27:02.945
You could define this
model with any of them.

00:27:02.945 --> 00:27:04.820
I think it would work
fine, except randomize,

00:27:04.820 --> 00:27:08.640
you'd get an expectation bound.

00:27:08.640 --> 00:27:24.480
So the system evicts, let's say,
the least recently used page.

00:27:24.480 --> 00:27:26.770
The least recently loaded
page would also work fine.

00:27:26.770 --> 00:27:28.136
That's FIFO.

00:27:28.136 --> 00:27:31.740
Sorry I'm switching to page, but
I've been calling them blocks.

00:27:31.740 --> 00:27:36.200
Blocks and pages are the
same thing for this lecture.

00:27:36.200 --> 00:27:40.690
And either at the end of this
lecture or beginning of next,

00:27:40.690 --> 00:27:42.900
I'll tell you why
that's an OK thing.

00:27:42.900 --> 00:27:51.440
But let's not worry
about it at this point.

00:27:51.440 --> 00:27:55.000
So now we have a model--
cache flow oblivious.

00:27:55.000 --> 00:27:58.186
We have two models, actually.

00:27:58.186 --> 00:28:00.060
But I think now that
the cache flow oblivious

00:28:00.060 --> 00:28:03.040
model is complete,
we're going to analyze.

00:28:03.040 --> 00:28:06.460
Again, we're still counting
the number of memory transfers

00:28:06.460 --> 00:28:07.260
in this thing.

00:28:07.260 --> 00:28:09.275
The algorithm's just not
allowed know B and M,

00:28:09.275 --> 00:28:10.900
and so we had to
change the model

00:28:10.900 --> 00:28:13.890
to make the reading
and writing of blocks

00:28:13.890 --> 00:28:15.812
automatic because
the algorithm's not

00:28:15.812 --> 00:28:16.520
allowed to do it.

00:28:16.520 --> 00:28:18.950
So someone's got to.

00:28:18.950 --> 00:28:20.980
The cool thing about
cache oblivious model

00:28:20.980 --> 00:28:23.870
is every algorithm
you see in this class,

00:28:23.870 --> 00:28:26.260
or most of the algorithms
you see in this class,

00:28:26.260 --> 00:28:28.510
are in a certain sense
cache oblivious algorithms.

00:28:28.510 --> 00:28:32.390
They weren't aware of B
and M before, still not.

00:28:32.390 --> 00:28:35.930
What changes is now you can
analyze them in this new way,

00:28:35.930 --> 00:28:37.300
in this new model.

00:28:37.300 --> 00:28:39.815
Now, as I said, all the
algorithms we've seen

00:28:39.815 --> 00:28:44.309
are not going to perform well
in this model-- almost all.

00:28:44.309 --> 00:28:45.725
But that makes
things interesting,

00:28:45.725 --> 00:28:50.870
and that's why we
have some work to do.

00:28:50.870 --> 00:28:53.180
I have some reasons why
cache obliviousness--

00:28:53.180 --> 00:28:55.470
why would you tie your
hands behind your back

00:28:55.470 --> 00:28:57.060
and not know B or M?

00:28:57.060 --> 00:29:00.150
Reason one, it's cool.

00:29:00.150 --> 00:29:02.390
I think it's pretty amazing
you can actually do this.

00:29:02.390 --> 00:29:03.880
I guess that's reason
two is you can actually

00:29:03.880 --> 00:29:05.930
do it for a lot of
problems we care about.

00:29:05.930 --> 00:29:08.900
Cache oblivious algorithms
exist that are just as good.

00:29:08.900 --> 00:29:10.630
So, I mean, of
course they exist.

00:29:10.630 --> 00:29:12.700
But there are ones
that are optimal.

00:29:12.700 --> 00:29:15.230
They're within a constant
factor of the best algorithm

00:29:15.230 --> 00:29:18.665
when you know B or M.
So that's surprising.

00:29:18.665 --> 00:29:22.019
That's the cool part.

00:29:22.019 --> 00:29:23.560
In general, the
algorithms are easier

00:29:23.560 --> 00:29:27.540
to write down because we can use
pseudo code just like before.

00:29:27.540 --> 00:29:30.930
We don't need to worry about
blocking in the algorithm.

00:29:30.930 --> 00:29:34.530
The analysis is going to be
harder, but that's unavoidable.

00:29:34.530 --> 00:29:37.510
In some sense, it makes
it easier to write code.

00:29:37.510 --> 00:29:40.820
And it's also a little easier
to distribute your code

00:29:40.820 --> 00:29:43.160
because every computer
has different block

00:29:43.160 --> 00:29:44.200
sizes that matter.

00:29:44.200 --> 00:29:46.500
Also, as you change
your value of N,

00:29:46.500 --> 00:29:49.390
a different level in the memory
hierarchy's going to matter.

00:29:49.390 --> 00:29:52.520
And so it's annoying-- each of
these levels, I didn't mention,

00:29:52.520 --> 00:29:54.640
has a different block
size and, of course,

00:29:54.640 --> 00:29:56.360
has a different cache size.

00:29:56.360 --> 00:29:59.840
So tuning your code every
time to a different B or M

00:29:59.840 --> 00:30:01.740
is annoying.

00:30:01.740 --> 00:30:03.980
The big gain here,
though, I think,

00:30:03.980 --> 00:30:08.030
is that you capture the
entire hierarchy, in a sense.

00:30:08.030 --> 00:30:11.930
So in the real world,
each of these pipes

00:30:11.930 --> 00:30:12.945
has its own latency.

00:30:12.945 --> 00:30:15.110
And let's just
think about latency.

00:30:15.110 --> 00:30:17.490
And you'd like to minimize
the number of block transfers

00:30:17.490 --> 00:30:18.630
between here and here.

00:30:18.630 --> 00:30:20.810
You'd like to minimize the
number block answers here here.

00:30:20.810 --> 00:30:22.434
Well, OK, I can't
minimize all of them.

00:30:22.434 --> 00:30:24.580
That's a multi
dimensional problem.

00:30:24.580 --> 00:30:27.190
What I'd like to minimize
is some weighted average

00:30:27.190 --> 00:30:30.589
of those things-- latency
times number of blocks here,

00:30:30.589 --> 00:30:32.380
plus the latency times
the number of blocks

00:30:32.380 --> 00:30:34.254
here, plus latency times
the number of blocks

00:30:34.254 --> 00:30:36.910
here, and so on.

00:30:36.910 --> 00:30:41.410
If you can find an optimal cache
oblivious algorithm and analyze

00:30:41.410 --> 00:30:44.680
it just with respect
to two levels,

00:30:44.680 --> 00:30:47.455
because the algorithm's not
allowed to know B and M,

00:30:47.455 --> 00:30:50.130
it has to work for all levels.

00:30:50.130 --> 00:30:54.140
It has to minimize the number
of block transfers between all

00:30:54.140 --> 00:30:55.680
these levels, and
so, in particular,

00:30:55.680 --> 00:30:59.175
will minimize the
weighted sum of them.

00:30:59.175 --> 00:31:00.050
It's a bit hand wavy.

00:31:00.050 --> 00:31:01.520
You have to prove
something there.

00:31:01.520 --> 00:31:06.680
But you can prove it.

00:31:06.680 --> 00:31:09.930
So there's a paper
about this from 1999

00:31:09.930 --> 00:31:15.390
by Frigo, Leiserson,
Prokop, and Ramachandran.

00:31:15.390 --> 00:31:17.710
It's old enough that I
remember all the names.

00:31:17.710 --> 00:31:20.387
After about 2001, when
I became a professor,

00:31:20.387 --> 00:31:21.470
I can't remember anything.

00:31:21.470 --> 00:31:24.400
But before that, I can
remember everything.

00:31:24.400 --> 00:31:28.450
So Frigo, we've talked about
him in the context of FFTW.

00:31:28.450 --> 00:31:30.780
That was the fastest Fourier
Transform in the West.

00:31:30.780 --> 00:31:31.870
So he was a student here.

00:31:31.870 --> 00:31:35.810
And FFTW uses a cache oblivious
Fast Fourier Transform

00:31:35.810 --> 00:31:37.770
algorithm.

00:31:37.770 --> 00:31:40.490
Leiserson, you've probably seen
on the cover of your textbook

00:31:40.490 --> 00:31:42.340
or walking around Stata.

00:31:42.340 --> 00:31:45.370
Professor Leiserson here at MIT.

00:31:45.370 --> 00:31:48.270
And Prokop, this is actually
his [? M Enge ?] thesis.

00:31:48.270 --> 00:31:52.220
So pretty awesome
[? M Enge ?] thesis.

00:31:52.220 --> 00:31:56.180
All right, so cool, I
think I said all the things

00:31:56.180 --> 00:31:58.450
I wanted to say.

00:31:58.450 --> 00:32:00.195
So if you want to see
the proof that you

00:32:00.195 --> 00:32:02.140
can solve the entire
memory hierarchy,

00:32:02.140 --> 00:32:04.020
you can read their paper.

00:32:04.020 --> 00:32:05.740
You have to make a
couple of assumptions,

00:32:05.740 --> 00:32:07.200
but it's intuitive.

00:32:07.200 --> 00:32:09.090
Cache oblivious has to
work for all B and M,

00:32:09.090 --> 00:32:12.220
so it's going to optimize all
the levels simultaneously.

00:32:12.220 --> 00:32:15.220
Doing that explicitly, with all
the different B's and M's, that

00:32:15.220 --> 00:32:19.157
would be really messy
code, probably also slower.

00:32:19.157 --> 00:32:20.740
Cache oblivious is
just going to do it

00:32:20.740 --> 00:32:23.046
for free with the same code.

00:32:23.046 --> 00:32:29.480
All right, let's
do some algorithms.

00:32:29.480 --> 00:32:31.660
There's one easy
algorithm which works

00:32:31.660 --> 00:32:37.770
great from a cache oblivious
perspective, which is scanning.

00:32:37.770 --> 00:32:48.540
Let we give you
some Python code.

00:32:48.540 --> 00:32:50.850
For historical
reasons, in this field,

00:32:50.850 --> 00:32:52.730
N is written with
a capital letter.

00:32:52.730 --> 00:32:55.707
Don't ask, or don't
worry about it.

00:32:55.707 --> 00:32:57.040
So here's some very simple code.

00:32:57.040 --> 00:32:59.230
Suppose you want to
accumulate an array.

00:32:59.230 --> 00:33:01.700
You want to add up all
of the items in the array

00:33:01.700 --> 00:33:04.180
or multiply them or take
them in or whatever.

00:33:04.180 --> 00:33:06.900
This is a typical kind of thing.

00:33:06.900 --> 00:33:11.200
Again, an array, we're
going to think of-- so here

00:33:11.200 --> 00:33:12.895
was my memory.

00:33:12.895 --> 00:33:14.270
We're going to
think of the array

00:33:14.270 --> 00:33:19.040
as being stored
as some contiguous

00:33:19.040 --> 00:33:23.610
segment of that array,
let's say, this segment.

00:33:23.610 --> 00:33:25.092
So this is important.

00:33:25.092 --> 00:33:37.090
Assume array is stored
contiguously, no holes,

00:33:37.090 --> 00:33:41.760
relative to how it's
mapped on to memory.

00:33:41.760 --> 00:33:43.350
And this is a
realistic assumption.

00:33:43.350 --> 00:33:45.860
When you allocate
a block of memory,

00:33:45.860 --> 00:33:48.120
the promise by the system
is that it's essentially

00:33:48.120 --> 00:33:53.390
a contiguous chunk of
memory or disk, or whatever.

00:33:53.390 --> 00:33:58.170
And when Python makes
an array, it does this.

00:33:58.170 --> 00:34:01.160
It guarantees that these things
will be stored contiguously.

00:34:01.160 --> 00:34:03.160
If you use a dictionary,
this would not be true.

00:34:03.160 --> 00:34:05.780
But for regular [? array's ?]
list, this is true.

00:34:05.780 --> 00:34:10.530
So I'm accessing the items
in the array in order,

00:34:10.530 --> 00:34:12.500
and so I start
here at item zero.

00:34:12.500 --> 00:34:15.780
I end up with item N minus 1.

00:34:15.780 --> 00:34:17.949
That seems good because
I read this one.

00:34:17.949 --> 00:34:19.114
I get the whole block.

00:34:19.114 --> 00:34:19.989
Then I read this one.

00:34:19.989 --> 00:34:21.031
I already had that block.

00:34:21.031 --> 00:34:21.710
It's free.

00:34:21.710 --> 00:34:22.560
This one's free.

00:34:22.560 --> 00:34:23.389
This one's free.

00:34:23.389 --> 00:34:25.260
Here I have to read a new block.

00:34:25.260 --> 00:34:26.650
But then this one's free.

00:34:26.650 --> 00:34:29.130
So the first item I
access in each block

00:34:29.130 --> 00:34:33.610
costs one, but as long as my
cache store's at least one

00:34:33.610 --> 00:34:35.820
block, that's enough.

00:34:35.820 --> 00:34:38.010
And let's say the
sum is a register;

00:34:38.010 --> 00:34:39.684
that's enough to
remember that block so

00:34:39.684 --> 00:34:43.250
that the next operation
I do will be free.

00:34:43.250 --> 00:34:52.840
So the cost is going
to be-- actually,

00:34:52.840 --> 00:35:02.800
be a little more precise--
ceiling of N over B almost.

00:35:02.800 --> 00:35:09.170
Without the big O here, this
is right in the external memory

00:35:09.170 --> 00:35:16.330
model, but not quite right
in the cache oblivious model.

00:35:16.330 --> 00:35:18.154
Can someone tell me why?

00:35:18.154 --> 00:35:19.540
Yeah?

00:35:19.540 --> 00:35:21.388
AUDIENCE: If N is
two, you could have it

00:35:21.388 --> 00:35:23.240
beyond a border [INAUDIBLE].

00:35:23.240 --> 00:35:25.320
ERIK DEMAINE: Good,
N could be two.

00:35:25.320 --> 00:35:26.970
But it could span
a block boundary.

00:35:26.970 --> 00:35:28.500
Remember, the
algorithm has no idea

00:35:28.500 --> 00:35:29.791
where the block boundaries are.

00:35:29.791 --> 00:35:32.750
And again, in reality,
there are block boundaries

00:35:32.750 --> 00:35:35.390
all over the place, and
there's no way to know.

00:35:35.390 --> 00:35:38.400
You can't request that
when you allocate an array

00:35:38.400 --> 00:35:40.230
it always begins in
a block boundary.

00:35:40.230 --> 00:35:48.066
So great, you can span block
boundaries in-- oh, way off.

00:35:48.066 --> 00:35:52.200
I just spanned a
block boundary, sorry.

00:35:52.200 --> 00:35:56.290
So it's going to be,
at most, ceiling over N

00:35:56.290 --> 00:36:00.470
over B plus 1 cache obviously.

00:36:00.470 --> 00:36:02.060
So it's just going
to hurt you by one.

00:36:02.060 --> 00:36:04.226
But I want to point out,
there's a slight difference

00:36:04.226 --> 00:36:07.010
between the two models,
even with this very simple

00:36:07.010 --> 00:36:08.560
algorithm.

00:36:08.560 --> 00:36:10.860
In general, I'm just
going to think of this

00:36:10.860 --> 00:36:15.680
as big O N over B plus 1.

00:36:15.680 --> 00:36:17.790
There's some additive constant.

00:36:17.790 --> 00:36:20.470
I guess you could even say
it's N over B plus big O 1,

00:36:20.470 --> 00:36:23.960
but we won't worry about
constant factors today.

00:36:23.960 --> 00:36:26.820
So that's scanning, cache
oblivious external memory,

00:36:26.820 --> 00:36:28.500
both great.

00:36:28.500 --> 00:36:52.740
Slightly more interesting--
AUDIENCE: [INAUDIBLE]?

00:36:52.740 --> 00:36:55.949
ERIK DEMAINE: Yeah, in the
external memory algorithm,

00:36:55.949 --> 00:36:57.990
because you're explicitly
controlling the blocks,

00:36:57.990 --> 00:36:59.970
you're explicitly
reading and writing them.

00:36:59.970 --> 00:37:01.920
And you know where the
block boundaries are.

00:37:01.920 --> 00:37:04.680
You could, if you wanted
to, you don't have to,

00:37:04.680 --> 00:37:07.070
but you could choose
the array to be aligned,

00:37:07.070 --> 00:37:09.370
to be starting at
a block boundary.

00:37:09.370 --> 00:37:10.490
So that's the distinction.

00:37:10.490 --> 00:37:12.570
In the cache oblivious,
you can't control that,

00:37:12.570 --> 00:37:15.319
so you have to worry
about the worst case.

00:37:15.319 --> 00:37:16.860
External memory you
could control it,

00:37:16.860 --> 00:37:19.240
and you could do better,
and maybe you'd want to.

00:37:19.240 --> 00:37:23.240
It will hurt you buy
a constant factor.

00:37:23.240 --> 00:37:25.130
And in disks, for
example, you want

00:37:25.130 --> 00:37:28.182
things to be track
aligned because if you

00:37:28.182 --> 00:37:30.640
have to go to an adjacent track,
it's a lot more expensive.

00:37:30.640 --> 00:37:32.320
You've got to move the head.

00:37:32.320 --> 00:37:35.220
Track is a circle, what
you can read without moving

00:37:35.220 --> 00:37:42.170
the head, so great.

00:37:42.170 --> 00:37:44.110
So slightly more
interesting is you

00:37:44.110 --> 00:37:47.360
can do a constant number
of parallel scans.

00:37:47.360 --> 00:37:50.380
So that was one scan.

00:37:50.380 --> 00:38:02.810
Here's an example of two scans.

00:38:02.810 --> 00:38:10.360
Again, we have one array
of size N. Python notation,

00:38:10.360 --> 00:38:13.370
that would be the whole thing.

00:38:13.370 --> 00:38:21.170
And what I want to do is swap
Ai with-- this is not Python,

00:38:21.170 --> 00:38:25.780
but it's, I think,
textbook notation.

00:38:25.780 --> 00:38:28.840
But you know what swap means.

00:38:28.840 --> 00:38:35.980
What does this do, assuming
I got my minus ones right?

00:38:35.980 --> 00:38:36.480
Yeah?

00:38:36.480 --> 00:38:37.813
AUDIENCE: It reverses the array.

00:38:37.813 --> 00:38:39.860
ERIK DEMAINE: It
reverses the array, good.

00:38:39.860 --> 00:38:42.424
We'll just run through
these Frisbees.

00:38:42.424 --> 00:38:43.840
So this is a very
simple algorithm

00:38:43.840 --> 00:38:44.575
for reversing the array.

00:38:44.575 --> 00:38:46.060
It was originally
by John Bentley,

00:38:46.060 --> 00:38:48.120
who was Charles
Leiserson's adviser-- PhD

00:38:48.120 --> 00:38:50.670
adviser-- back in the day.

00:38:50.670 --> 00:38:53.085
So very simple, but
what's cool about it,

00:38:53.085 --> 00:38:56.630
if you think about the array
and the order in which you're

00:38:56.630 --> 00:39:03.210
accessing things, it's
like I have two fingers--

00:39:03.210 --> 00:39:06.190
and I should have
made this smaller.

00:39:06.190 --> 00:39:08.340
So here, we'll go down here.

00:39:08.340 --> 00:39:10.250
I start at the very
beginning of the array

00:39:10.250 --> 00:39:11.660
and the very end of the array.

00:39:11.660 --> 00:39:14.830
Then I go to the second
element, next to last element,

00:39:14.830 --> 00:39:17.740
and I advance like this.

00:39:17.740 --> 00:39:22.256
So as long as your cache M, the
number of blocks in the cache

00:39:22.256 --> 00:39:24.130
is at least two, which
is totally reasonable.

00:39:24.130 --> 00:39:27.930
You can assume this is
at least 100, typically.

00:39:27.930 --> 00:39:30.100
You've got at least
100 blocks, say.

00:39:30.100 --> 00:39:32.837
So for any fixed constant,
we're going to assume N over B

00:39:32.837 --> 00:39:33.920
is bigger than a constant.

00:39:33.920 --> 00:39:35.628
We'll only need like
two or three or four

00:39:35.628 --> 00:39:38.320
for the algorithms we cover.

00:39:38.320 --> 00:39:40.660
Then great, when I
access this item,

00:39:40.660 --> 00:39:43.410
I will load in the
block that contains it.

00:39:43.410 --> 00:39:47.640
I don't know how it's aligned,
but don't care so much.

00:39:47.640 --> 00:39:50.120
And then I load in the block
that contains this item.

00:39:50.120 --> 00:39:52.250
And then the next
accesses are free until I

00:39:52.250 --> 00:39:53.470
advance to the next block.

00:39:53.470 --> 00:39:56.340
But once I advance to the next
block on the left or the right,

00:39:56.340 --> 00:39:58.020
I'll never have to
access the old ones.

00:39:58.020 --> 00:40:01.020
And so again, the
cost here is just

00:40:01.020 --> 00:40:03.290
going to be equal to the
number of blocks, which

00:40:03.290 --> 00:40:07.540
is big O of N over B plus 1.

00:40:07.540 --> 00:40:09.840
So a constant number
of parallel scans

00:40:09.840 --> 00:40:14.690
is going to be basically the
number of blocks in the array.

00:40:14.690 --> 00:40:18.340
So N is smaller than B, this
is a bad idea or not so hot.

00:40:18.340 --> 00:40:19.760
But when N is
bigger than B, this

00:40:19.760 --> 00:40:21.590
is just N over B.
That's how much it takes

00:40:21.590 --> 00:40:26.670
to read in the data-- big deal.

00:40:26.670 --> 00:40:29.890
So these are boring cache
oblivious algorithms.

00:40:29.890 --> 00:40:31.830
Let's do interesting ones.

00:40:31.830 --> 00:40:34.830
And I would say the
central idea in cache

00:40:34.830 --> 00:40:38.360
oblivious algorithms is
to use divide and conquer.

00:40:38.360 --> 00:40:42.010
This goes back to the first
few lectures in this class.

00:40:42.010 --> 00:40:46.390
And so we will go back
to examples from there.

00:40:46.390 --> 00:40:48.910
Today we're going to
do the median finding,

00:40:48.910 --> 00:40:52.790
in particular, which
we did in lecture two,

00:40:52.790 --> 00:40:54.620
so really a blast from the past.

00:40:54.620 --> 00:40:57.040
But it's good review because
the final covers everything,

00:40:57.040 --> 00:40:59.570
so you've got to remember that.

00:40:59.570 --> 00:41:02.430
Matrix multiplication,
we've talked about, but not

00:41:02.430 --> 00:41:06.920
usually-- well, I guess we did
actually use divide and conquer

00:41:06.920 --> 00:41:08.440
for Strassen's algorithm.

00:41:08.440 --> 00:41:11.310
We're going to use -and conquer
even for the boring algorithm

00:41:11.310 --> 00:41:12.139
today.

00:41:12.139 --> 00:41:14.680
And then next class, we're going
to go back to van Emde Boas,

00:41:14.680 --> 00:41:16.150
but in a completely
different way.

00:41:16.150 --> 00:41:18.280
So if you don't
like van Emde Boas,

00:41:18.280 --> 00:41:21.800
don't worry; it's much simpler.

00:41:21.800 --> 00:41:24.930
So let's do median finding.

00:41:24.930 --> 00:41:29.808
Or actually, sorry, let
me first talk about divide

00:41:29.808 --> 00:41:33.860
and conquer in general.

00:41:33.860 --> 00:41:35.360
You know what divide
and conquer is.

00:41:35.360 --> 00:41:36.600
You take your problem.

00:41:36.600 --> 00:41:39.390
You split it into non
overlapping subproblems,

00:41:39.390 --> 00:41:42.665
recursively solve
them, combine them.

00:41:42.665 --> 00:41:44.040
But what I want
to stress here is

00:41:44.040 --> 00:41:47.110
what it's going to look like
in a cache oblivious context.

00:41:47.110 --> 00:41:51.598
So the algorithm is going to
look like a regular divide

00:41:51.598 --> 00:41:53.380
and conquer algorithm.

00:41:53.380 --> 00:42:01.800
So, in particular, the algorithm
will recurse all the way to,

00:42:01.800 --> 00:42:05.090
let's say, constant
size problems,

00:42:05.090 --> 00:42:11.900
whatever the base case is.

00:42:11.900 --> 00:42:18.410
So same as usual, but what's
different is the analysis.

00:42:18.410 --> 00:42:23.610
When we analyze a cache
oblivious algorithm,

00:42:23.610 --> 00:42:25.274
then we get to know
what B and M are.

00:42:25.274 --> 00:42:27.190
In some sense, we're
analyzing for all B an M.

00:42:27.190 --> 00:42:29.669
But let's suppose B and
M is given to us, then

00:42:29.669 --> 00:42:32.451
will tell you how many
memory transfers you need.

00:42:32.451 --> 00:42:33.950
This kind of bound,
you need to know

00:42:33.950 --> 00:42:37.110
what B is to know what the
value of this bound is.

00:42:37.110 --> 00:42:39.740
But you learn it as a
function of B and, in general,

00:42:39.740 --> 00:42:41.500
a function of B
and M, and that's

00:42:41.500 --> 00:42:46.390
the best you could hope for as
a complete characterization.

00:42:46.390 --> 00:42:49.470
So in the analysis, let's
just look at one value of B

00:42:49.470 --> 00:42:57.540
and one value of M. So
analysis knows B and M,

00:42:57.540 --> 00:43:11.620
and it's going to look at,
let's say, the recursive level,

00:43:11.620 --> 00:43:15.230
where one of two things happens.

00:43:15.230 --> 00:43:28.660
Either the problem size
fits in order one blocks.

00:43:28.660 --> 00:43:32.790
So meaning it's order B size.

00:43:32.790 --> 00:43:34.690
That's an interesting level.

00:43:34.690 --> 00:43:39.650
Another interesting level,
the more obvious one probably,

00:43:39.650 --> 00:43:44.045
is that it fits in cache.

00:43:44.045 --> 00:43:46.545
So that means that the size is
less than or equal to capital

00:43:46.545 --> 00:43:52.870
M. Everything here is
counted in terms of words.

00:43:52.870 --> 00:43:54.200
This is the more obvious one.

00:43:54.200 --> 00:43:57.247
For a lot of problems, the
cache size isn't so relevant.

00:43:57.247 --> 00:43:58.830
What really matters
is the block size.

00:43:58.830 --> 00:44:01.430
For example, scanning, you're
only looking through the data

00:44:01.430 --> 00:44:01.930
once.

00:44:01.930 --> 00:44:03.721
So it doesn't matter
how big your cache is,

00:44:03.721 --> 00:44:05.970
as long as it's not super tiny.

00:44:05.970 --> 00:44:08.420
As long as it has
a few blocks, then

00:44:08.420 --> 00:44:13.360
it's just a function of
B and N, no M involved.

00:44:13.360 --> 00:44:15.500
So for that kind of
problem this would

00:44:15.500 --> 00:44:19.800
be more useful-- constant
number of blocks.

00:44:19.800 --> 00:44:22.740
Because I think of the
cache M as being larger

00:44:22.740 --> 00:44:27.140
than any constant times B,
this is strictly smaller,

00:44:27.140 --> 00:44:30.870
or this is smaller or equal
to problem fitting in cache.

00:44:30.870 --> 00:44:32.780
So when M is
relevant, we'll look

00:44:32.780 --> 00:44:35.600
at this level and maybe
the adjacent levels

00:44:35.600 --> 00:44:37.580
in the recursion.

00:44:37.580 --> 00:44:40.150
So the algorithm doesn't know
what B and M are, so it's

00:44:40.150 --> 00:44:42.610
got to recurse all
the way down-- turtles

00:44:42.610 --> 00:44:44.040
all the way down.

00:44:44.040 --> 00:44:45.890
But the analysis,
because we're only

00:44:45.890 --> 00:44:47.780
thinking about one
value B and M at a time,

00:44:47.780 --> 00:44:49.830
we can afford to just
consider that one level,

00:44:49.830 --> 00:44:51.496
and that will be like
the critical place

00:44:51.496 --> 00:44:52.599
where all the cost is.

00:44:52.599 --> 00:44:55.140
Because once things fit in cache
and you've loaded things in,

00:44:55.140 --> 00:44:56.570
the cost will be zero.

00:44:56.570 --> 00:44:58.899
So below that, the base
case is kind of trivial.

00:44:58.899 --> 00:45:00.440
So basically what
this is going to do

00:45:00.440 --> 00:45:02.410
is make our base cases larger.

00:45:02.410 --> 00:45:04.510
Instead of our base
case being constant,

00:45:04.510 --> 00:45:11.650
it's going to be order B or M.

00:45:11.650 --> 00:45:21.839
What don't I need?

00:45:21.839 --> 00:45:45.420
So now let's going
to median finding.

00:45:45.420 --> 00:45:47.990
Median finding, you're
given an unsorted array.

00:45:47.990 --> 00:45:50.560
You want to find the median.

00:45:50.560 --> 00:45:55.440
And in lecture two,
we had a linear time

00:45:55.440 --> 00:46:00.900
worst case algorithm for this.

00:46:00.900 --> 00:46:04.230
And so my goal today is to
make it this running time.

00:46:04.230 --> 00:46:05.890
This is what you
might call linear time

00:46:05.890 --> 00:46:08.360
in the cache oblivious model
because that's how long it

00:46:08.360 --> 00:46:12.850
takes just to read the data.

00:46:12.850 --> 00:46:15.680
It turns out basically
the same algorithm works.

00:46:15.680 --> 00:46:17.810
First, you've got to
remember the algorithm.

00:46:17.810 --> 00:46:20.510
So let me write it down quickly.

00:46:20.510 --> 00:46:25.250
This is the sort of
five by in N array.

00:46:25.250 --> 00:46:29.816
So think of the array as
being partitioned into, I'll

00:46:29.816 --> 00:46:35.540
call them, five columns.

00:46:35.540 --> 00:46:42.990
So this picture of five dots
by N over 5 dots-- this is

00:46:42.990 --> 00:46:44.361
dot, dot, dot.

00:46:44.361 --> 00:46:46.240
So this is five.

00:46:46.240 --> 00:46:48.025
Now, we didn't
talk about it then,

00:46:48.025 --> 00:46:50.150
and there's a few different
ways you could actually

00:46:50.150 --> 00:46:52.710
implement it, but let's say
these-- the actual array is

00:46:52.710 --> 00:46:53.810
one-dimensional.

00:46:53.810 --> 00:46:55.610
Let's say these are
the first five items.

00:46:55.610 --> 00:46:57.020
These are the next five items.

00:46:57.020 --> 00:47:01.610
So, in other words, this matrix
is stored column by column.

00:47:01.610 --> 00:47:03.150
This is just a conceptual view.

00:47:03.150 --> 00:47:05.880
So we can define it either
way, however we want.

00:47:05.880 --> 00:47:08.070
So I'm going to
view it that way.

00:47:08.070 --> 00:47:12.090
And then what the rest of the
algorithm did was for sort

00:47:12.090 --> 00:47:16.150
each column, it's
only five items,

00:47:16.150 --> 00:47:19.137
so you can sort it in
constant time, each one.

00:47:19.137 --> 00:47:20.720
But, in particular,
what we care about

00:47:20.720 --> 00:47:24.010
is the median of
those five items.

00:47:24.010 --> 00:47:32.370
Then we recursively found
the median of the medians.

00:47:32.370 --> 00:47:41.150
This is the step we're going
to have to change a little bit.

00:47:41.150 --> 00:47:46.350
Then we-- leave a
little bit of space.

00:47:46.350 --> 00:47:52.580
Then we partition
the array by x.

00:47:52.580 --> 00:47:55.190
Meaning we split the
array into items less than

00:47:55.190 --> 00:48:00.350
or equal to x and
things greater than x.

00:48:00.350 --> 00:48:03.040
We probably assumed there was
only one value equal to x,

00:48:03.040 --> 00:48:05.160
but it doesn't matter.

00:48:05.160 --> 00:48:13.760
And finally, we recurse on
one of those two halves.

00:48:13.760 --> 00:48:16.369
So this is a pretty crazy divide
and conquer algorithm, one

00:48:16.369 --> 00:48:17.867
of the more sophisticated ones.

00:48:17.867 --> 00:48:19.700
You don't need to know
all the details here,

00:48:19.700 --> 00:48:22.980
just that it worked and
it ran in linear time.

00:48:22.980 --> 00:48:26.030
What's crazy about it is
there are two recursive calls.

00:48:26.030 --> 00:48:27.460
Usually, like in
merge sort, where

00:48:27.460 --> 00:48:30.090
you do two recursive calls
and spend linear time

00:48:30.090 --> 00:48:32.300
to do the stuff,
like this partition,

00:48:32.300 --> 00:48:34.470
you get n log n time,
like merge sort.

00:48:34.470 --> 00:48:37.570
Here, because this
array is a lot smaller,

00:48:37.570 --> 00:48:39.690
this is a size N over 5.

00:48:39.690 --> 00:48:41.610
And this one was
reasonably small;

00:48:41.610 --> 00:48:48.800
it was like [? M of ?] 7/10
N. Because 7/10 plus 1/5

00:48:48.800 --> 00:48:52.730
is strictly less than
1, this ends up being

00:48:52.730 --> 00:48:54.480
linear time instead of n log n.

00:48:54.480 --> 00:48:56.820
That's just review.

00:48:56.820 --> 00:49:05.270
Now, what I'd like to do is
the same thing, same analysis,

00:49:05.270 --> 00:49:08.490
or same algorithm, but
now I want to analyze it

00:49:08.490 --> 00:49:10.410
in this two-level model.

00:49:10.410 --> 00:49:28.740
So actually, I will
erase this board.

00:49:28.740 --> 00:49:33.820
So now my array has been
partitioned into blocks

00:49:33.820 --> 00:49:36.414
of size B, like this picture.

00:49:36.414 --> 00:49:37.580
In fact, it's quite similar.

00:49:37.580 --> 00:49:39.730
Here, we're partitioning
things into blocks,

00:49:39.730 --> 00:49:40.670
but they're size five.

00:49:40.670 --> 00:49:41.620
That's different.

00:49:41.620 --> 00:49:44.990
Now someone has partitioned my
array into blocks of size B.

00:49:44.990 --> 00:49:46.840
I need to count how
many things I access.

00:49:46.840 --> 00:49:49.720
Well, let's just look
line by line at this code

00:49:49.720 --> 00:49:50.530
and see what we do.

00:49:50.530 --> 00:49:52.490
Step one, we do
absolutely nothing.

00:49:52.490 --> 00:49:56.440
This is a conceptual
picture, so zero cost, great.

00:49:56.440 --> 00:50:00.600
Step one is zero,
my favorite answer.

00:50:00.600 --> 00:50:03.630
Step two, we sort each column.

00:50:03.630 --> 00:50:04.630
How long does this take?

00:50:04.630 --> 00:50:12.230
What am I doing?

00:50:12.230 --> 00:50:17.420
It's right above me.

00:50:17.420 --> 00:50:18.350
AUDIENCE: N over B.

00:50:18.350 --> 00:50:21.024
ERIK DEMAINE: N over B
because this is a scan.

00:50:21.024 --> 00:50:22.440
It's a little bit
weird of a scan.

00:50:22.440 --> 00:50:25.520
We look at five
items, and then we

00:50:25.520 --> 00:50:28.370
look at the next five items,
and then the next five items.

00:50:28.370 --> 00:50:29.550
But it's basically a scan.

00:50:29.550 --> 00:50:32.559
You could think of it as almost
five parallel scans, I suppose,

00:50:32.559 --> 00:50:34.100
or you could just
break into the case

00:50:34.100 --> 00:50:37.540
where maybe if B
is a constant, then

00:50:37.540 --> 00:50:38.790
it doesn't matter what you do.

00:50:38.790 --> 00:50:42.681
But if B bigger than a constant,
then reading five items,

00:50:42.681 --> 00:50:44.680
those are all probably
going to be in one block,

00:50:44.680 --> 00:50:47.290
except the ones that straddle
the block boundaries.

00:50:47.290 --> 00:50:51.086
So in all cases,
for step two-- maybe

00:50:51.086 --> 00:50:53.600
I should rewrite
step one-- zero cost.

00:50:53.600 --> 00:51:01.240
Step two, is order N over
B plus 1, to be careful.

00:51:01.240 --> 00:51:03.636
That's a scan.

00:51:03.636 --> 00:51:05.010
Actually, it's
two parallel scans

00:51:05.010 --> 00:51:09.490
because we have to write
out these medians somewhere,

00:51:09.490 --> 00:51:10.620
so we'll have to.

00:51:10.620 --> 00:51:14.320
Step three is recursively
find the medians.

00:51:14.320 --> 00:51:22.140
Now, before, we had in
T of N is T of N over 5

00:51:22.140 --> 00:51:30.110
plus T of 7/10 N plus linear.

00:51:30.110 --> 00:51:34.002
In this new world-- this
is regular running time.

00:51:34.002 --> 00:51:35.460
In this new world,
I'm going to use

00:51:35.460 --> 00:51:38.640
a different notation for
the recurrence, MT of N

00:51:38.640 --> 00:51:40.630
for memory transfers.

00:51:40.630 --> 00:51:42.490
This is a good old
fashioned time,

00:51:42.490 --> 00:51:44.900
and this is our new
modern notion of time--

00:51:44.900 --> 00:51:47.760
how many block transfers do I
need to do for problem size N.

00:51:47.760 --> 00:51:54.020
So this is a recursion, and
should be MT of N over 5.

00:51:54.020 --> 00:52:00.500
But, and this is
important, for this

00:52:00.500 --> 00:52:02.850
to be a same problem
of the same type,

00:52:02.850 --> 00:52:05.750
I need to know that the
array that recursing on

00:52:05.750 --> 00:52:08.660
is stored contiguously.

00:52:08.660 --> 00:52:10.800
Before, I didn't
need to do that.

00:52:10.800 --> 00:52:14.200
I could say, well, let's put
the medians in the middle.

00:52:14.200 --> 00:52:18.341
So now every fifth item in
this array is my new subarray.

00:52:18.341 --> 00:52:20.090
And so I could recursively
call this thing

00:52:20.090 --> 00:52:22.690
and say, OK, here's my
array, but really only think

00:52:22.690 --> 00:52:23.830
about every fifth item.

00:52:23.830 --> 00:52:25.650
That's like a
stride in the array.

00:52:25.650 --> 00:52:27.460
And then the next
recursive level, oh, only

00:52:27.460 --> 00:52:28.950
worry about every 25th item.

00:52:28.950 --> 00:52:32.260
And every 5 cubed item, I'm
going to stop computing,

00:52:32.260 --> 00:52:33.870
and so on.

00:52:33.870 --> 00:52:37.360
And that would be fine
for regular running time.

00:52:37.360 --> 00:52:39.554
But when I get my stride
gets bigger and bigger,

00:52:39.554 --> 00:52:40.970
at some point,
every item is going

00:52:40.970 --> 00:52:42.130
to be in a different block.

00:52:42.130 --> 00:52:43.130
That's bad.

00:52:43.130 --> 00:52:44.260
I don't want to do that.

00:52:44.260 --> 00:52:48.270
So when I find these
medians, or when I recurse,

00:52:48.270 --> 00:52:50.290
I need that the medians
that I'm recursing on

00:52:50.290 --> 00:52:51.670
are stored in a
contiguous array.

00:52:51.670 --> 00:52:52.690
Now, this is easy to do.

00:52:52.690 --> 00:52:54.023
But we didn't have to do before.

00:52:54.023 --> 00:52:57.790
That's the key difference.

00:52:57.790 --> 00:53:07.060
Make sure they are
stored contiguously.

00:53:07.060 --> 00:53:11.760
I can do that because when I
sort each column in one scan,

00:53:11.760 --> 00:53:14.250
I can have a second scan
which is the output, which

00:53:14.250 --> 00:53:16.360
is the array of medians.

00:53:16.360 --> 00:53:17.960
So as I'm scanning
through the input,

00:53:17.960 --> 00:53:19.290
I'm going to output the median.

00:53:19.290 --> 00:53:21.059
It's going to be 1/5 the size.

00:53:21.059 --> 00:53:22.850
Then I've got all the
medians nicely stored

00:53:22.850 --> 00:53:25.070
in a contiguous array.

00:53:25.070 --> 00:53:27.430
So with order one
parallel scans,

00:53:27.430 --> 00:53:33.810
same time here, this is actually
a legitimate recursive call.

00:53:33.810 --> 00:53:35.120
Then we partition.

00:53:35.120 --> 00:53:42.920
Partition, again, is a bunch of
parallel scans, I think, three.

00:53:42.920 --> 00:53:44.640
You've got one
reading scan, which

00:53:44.640 --> 00:53:46.190
is you're reading
through the array,

00:53:46.190 --> 00:53:47.380
and you've got to writing scans.

00:53:47.380 --> 00:53:49.420
You're writing out the elements
less than or equal to x,

00:53:49.420 --> 00:53:51.630
and you're writing out the
elements greater than x.

00:53:51.630 --> 00:53:53.050
But again, all of
those are scans.

00:53:53.050 --> 00:53:55.120
You're always writing
the next element right

00:53:55.120 --> 00:53:56.460
after the previous one.

00:53:56.460 --> 00:53:58.350
So if you already have
that block in memory

00:53:58.350 --> 00:54:01.750
and if you assume that the
number of blocks in cache

00:54:01.750 --> 00:54:06.910
is at least three, then
three parallel scans is fine.

00:54:06.910 --> 00:54:09.190
It's different from the
CLRS partition algorithm.

00:54:09.190 --> 00:54:11.260
That one was fancy
to be in place.

00:54:11.260 --> 00:54:13.720
We're not trying to be
in place or fancy at all.

00:54:13.720 --> 00:54:16.310
Let's just do it with
a bunch of scans.

00:54:16.310 --> 00:54:18.524
So now we have two arrays--
the element less than x,

00:54:18.524 --> 00:54:19.690
the elements greater than x.

00:54:19.690 --> 00:54:22.070
Then we recurse on one of
them, and those elements

00:54:22.070 --> 00:54:24.260
are consecutive
already, so good.

00:54:24.260 --> 00:54:26.630
This is a regular
recursive call.

00:54:26.630 --> 00:54:28.320
Again, we're
maintaining the variant

00:54:28.320 --> 00:54:32.350
that the array is
stored contiguously.

00:54:32.350 --> 00:54:37.430
And by the old analysis, that
array is sized at most 7/10 N.

00:54:37.430 --> 00:54:45.918
So I get a new recurrence, which
is MT of N is MT of N over 5

00:54:45.918 --> 00:54:51.090
plus MT-- this analysis
feels very "empty--"

00:54:51.090 --> 00:54:57.150
plus N over B-- sorry, bad
joke-- N over B plus 1.

00:54:57.150 --> 00:55:01.010
So basically the same
recurrence, but now N over B

00:55:01.010 --> 00:55:03.955
plus 1 for what
we're doing here.

00:55:03.955 --> 00:55:05.330
But I had to change
the algorithm

00:55:05.330 --> 00:55:07.970
a little bit for this
recurrence to be correct,

00:55:07.970 --> 00:55:10.430
for it to correctly reflect
the number of memory transfers.

00:55:10.430 --> 00:55:13.920
Now all we need to do
is solve the recurrence.

00:55:13.920 --> 00:55:18.100
And actually, in some
sense, more importantly,

00:55:18.100 --> 00:55:20.520
we need to figure out
what the base case is.

00:55:20.520 --> 00:55:25.630
Because we could say, all right,
here's the usual base case.

00:55:25.630 --> 00:55:27.170
If I have a constant
sized problem,

00:55:27.170 --> 00:55:29.140
well, that's going
to be constant.

00:55:29.140 --> 00:55:32.260
This is our base case for every
recurrence we've ever done.

00:55:32.260 --> 00:55:34.220
And that's enough usually.

00:55:34.220 --> 00:55:36.900
It's going to give us a
really bad answer here.

00:55:36.900 --> 00:56:01.060
So let's go off to the side
here and solve that recurrence.

00:56:01.060 --> 00:56:05.534
So if that's my base case,
well, in particular-- so

00:56:05.534 --> 00:56:06.700
this is some recursion tree.

00:56:06.700 --> 00:56:09.650
It's very uneven, so it's
kind of annoying to draw.

00:56:09.650 --> 00:56:13.500
But what I know
with this base case,

00:56:13.500 --> 00:56:17.160
this overall MT event is
going to at least the number

00:56:17.160 --> 00:56:19.880
of leaves in the recursion tree.

00:56:19.880 --> 00:56:22.955
So let's say MT
of N is at least L

00:56:22.955 --> 00:56:29.300
of N, number of leaves
in the recursion.

00:56:29.300 --> 00:56:31.520
So this is really if
I run the algorithm,

00:56:31.520 --> 00:56:35.630
how many base cases of
constant size do I get?

00:56:35.630 --> 00:56:46.050
And that satisfies-- so it's
not obvious what that is.

00:56:46.050 --> 00:56:47.199
There's no plus here.

00:56:47.199 --> 00:56:49.490
Number of leaves is just how
many leaves are over here,

00:56:49.490 --> 00:56:52.560
how many leaves are over here,
and L of 1 equals 1, say,

00:56:52.560 --> 00:56:55.120
or some constant
equals constant.

00:56:55.120 --> 00:56:59.440
I happen to know, because
I saw lots of recurrences,

00:56:59.440 --> 00:57:03.260
this solves to some
N to the alpha.

00:57:03.260 --> 00:57:08.730
I claim that L of N is N to the
alpha for some constant alpha.

00:57:08.730 --> 00:57:09.440
Why?

00:57:09.440 --> 00:57:11.650
I'll just prove that it works.

00:57:11.650 --> 00:57:16.250
So this is now N
over 5 to the alpha,

00:57:16.250 --> 00:57:19.950
and this is 7/10 N to the alpha.

00:57:19.950 --> 00:57:24.640
If it's going to work, this
recurrence should be satisfied.

00:57:24.640 --> 00:57:26.830
And now, if you look
at this equation,

00:57:26.830 --> 00:57:30.080
there's a lot of N to the
alphas, and they all cancel.

00:57:30.080 --> 00:57:37.305
So I get 1 equals 1/5 to the
alpha plus 7/10 to the alpha.

00:57:37.305 --> 00:57:38.680
It's confusing
because I was just

00:57:38.680 --> 00:57:42.930
watching the TV show
Alphas, but no relation.

00:57:42.930 --> 00:57:45.370
So this is now something
purely in terms of alpha.

00:57:45.370 --> 00:57:47.580
You just need to check that
there is a real solution.

00:57:47.580 --> 00:57:48.440
There is one.

00:57:48.440 --> 00:57:51.630
You have to plug it into
Wolfram Alpha or something,

00:57:51.630 --> 00:57:53.000
no pun intended.

00:57:53.000 --> 00:57:55.967
Wow, they're just
coming out today.

00:57:55.967 --> 00:57:58.947
And then alpha
is... next page...

00:57:58.947 --> 00:58:01.947
I can't do this by hand.

00:58:01.947 --> 00:58:08.207
Something like .83978.

00:58:08.207 --> 00:58:15.487
So we get L of N is say at
least N to the 0.8th bigger.

00:58:15.487 --> 00:58:21.247
It's sublinear and that was
enough when we cared about time

00:58:21.247 --> 00:58:23.687
but now it's bad news
because N over B...

00:58:23.687 --> 00:58:29.087
our goal was to get N over B+1.

00:58:29.087 --> 00:58:33.507
If B is huge, if B is
bigger than N to the 0.2,

00:58:33.507 --> 00:58:35.843
then we are not
achieving this bound.

00:58:35.843 --> 00:58:36.343
Right.

00:58:36.343 --> 00:58:38.947
We are always are paying
at least N to the 0.8.

00:58:38.947 --> 00:58:43.247
For example B is roughly
N. We are way off!

00:58:43.247 --> 00:58:45.247
But that's because we
used the wrong base case.

00:58:45.247 --> 00:58:49.360
Turns out if you use a better
base case, things just work.

00:58:49.360 --> 00:58:51.024
So let's do that.

00:58:51.024 --> 00:58:53.940
I think its going to be smaller.

00:58:53.940 --> 00:58:55.912
So... the next base...

00:58:55.912 --> 00:58:56.616
I mean...

00:58:56.616 --> 00:58:58.532
When you are doing cache
full release analysis

00:58:58.532 --> 00:58:59.760
you never use this base case.

00:58:59.760 --> 00:59:03.100
The first one you should
think about is this one.

00:59:03.100 --> 00:59:04.812
If you have a
problem of size that

00:59:04.812 --> 00:59:06.320
fits in a constant
number of blocks.

00:59:06.320 --> 00:59:08.518
Well of course that's
going to take...

00:59:08.518 --> 00:59:10.684
once they are read
into the cache,

00:59:10.684 --> 00:59:12.100
you are not going
to pay anything.

00:59:12.100 --> 00:59:14.524
How long does it take to read
a constant number of blocks

00:59:14.524 --> 00:59:15.024
into cache?

00:59:15.024 --> 00:59:16.940
Constant number of
memory transfers.

00:59:16.940 --> 00:59:19.415
Okay, this is obviously a
strictly better base case

00:59:19.415 --> 00:59:20.240
than this one.

00:59:20.240 --> 00:59:23.090
Because we have the same
thing on the right hand

00:59:23.090 --> 00:59:26.240
side as a constant but we've
solved a larger problem.

00:59:26.240 --> 00:59:29.600
So clearly you should cut
here, instead of there.

00:59:29.600 --> 00:59:34.740
Then the number of leaves
in this recursion...

00:59:34.740 --> 00:59:38.998
So same recurrence-
different base case.

00:59:38.998 --> 00:59:41.164
So we'd stop recursing
conceptually in the analysis,

00:59:41.164 --> 00:59:42.980
the algorithm goes
all the way down,

00:59:42.980 --> 00:59:45.092
but in the analysis
we stop recursing when

00:59:45.092 --> 00:59:46.940
we reach a problem of size B.

00:59:46.940 --> 00:59:54.168
The number of leaves in that new
recursion tree will be N over B

00:59:54.168 --> 00:59:55.900
to the alpha.

00:59:55.900 --> 00:59:56.870
That's good!

00:59:56.870 --> 00:59:59.780
That's smaller than N over B.

00:59:59.780 --> 01:00:02.380
OK, now I'm going to wave
my hands a little bit

01:00:02.380 --> 01:00:10.780
and say, MT of N is
order N over B plus 1.

01:00:10.780 --> 01:00:13.090
I guess to do that,
you want to prove it

01:00:13.090 --> 01:00:16.390
the same way we did before
when we solved this recurrence,

01:00:16.390 --> 01:00:17.770
which is by substitution.

01:00:17.770 --> 01:00:19.900
You assume this is
true, you plug it in,

01:00:19.900 --> 01:00:22.750
verify it can actually be
done with some constants.

01:00:22.750 --> 01:00:25.750
The intuition of what's going on
is, in general, this recurrence

01:00:25.750 --> 01:00:27.550
is dominated by the root.

01:00:27.550 --> 01:00:31.759
The root cost for this
recursion is N over B plus 1.

01:00:31.759 --> 01:00:32.800
So this is the root cost.

01:00:32.800 --> 01:00:34.630
I claim that, up to
constant factors,

01:00:34.630 --> 01:00:35.920
that is the overall cost.

01:00:35.920 --> 01:00:38.440
Roughly because, as you go
down the recursion tree,

01:00:38.440 --> 01:00:41.431
the cost is decreasing
geometrically.

01:00:41.431 --> 01:00:43.180
But that's not obvious
for this recurrence

01:00:43.180 --> 01:00:44.694
because it's so uneven.

01:00:44.694 --> 01:00:47.110
But it's kind of like the
master method, a little fancier.

01:00:47.110 --> 01:00:52.230
Intuitively, this
should be obvious.

01:00:52.230 --> 01:00:54.490
There's the root cost and
then there's the other ones.

01:00:54.490 --> 01:00:56.990
But to actually prove it, you
should do substitution method.

01:00:56.990 --> 01:01:02.320
I want to go to more
interesting algorithms instead,

01:01:02.320 --> 01:01:07.610
but any questions
before we continue?

01:01:07.610 --> 01:01:08.150
All right.

01:01:08.150 --> 01:01:11.390
So next algorithm,
that was median, now

01:01:11.390 --> 01:01:18.670
we're going to do matrix
multiplication via divide

01:01:18.670 --> 01:01:32.620
and conquer.

01:01:32.620 --> 01:01:34.060
So what we just
saw was an example

01:01:34.060 --> 01:01:37.464
where, in divide and
conquer, in the analysis

01:01:37.464 --> 01:01:39.130
we think about the
case where things fit

01:01:39.130 --> 01:01:40.810
in a constant number of blocks.

01:01:40.810 --> 01:01:42.280
That was sort of case one.

01:01:42.280 --> 01:01:44.372
The next example,
matrix multiplication,

01:01:44.372 --> 01:01:45.330
will be the other case.

01:01:45.330 --> 01:01:53.650
So you get to see both types.

01:01:53.650 --> 01:01:55.890
So multiplying
matrices, something

01:01:55.890 --> 01:01:57.270
we've done many times.

01:01:57.270 --> 01:02:01.020
For example, in the FFT
lecture and in the Strassen's

01:02:01.020 --> 01:02:03.764
algorithm, just to remind you.

01:02:03.764 --> 01:02:05.430
I'm just thinking
about the square case,

01:02:05.430 --> 01:02:07.860
although this generalizes.

01:02:07.860 --> 01:02:16.140
We have two square
matrices, N by N.

01:02:16.140 --> 01:02:18.420
Normally, I would say
C equals A times B,

01:02:18.420 --> 01:02:20.970
but I realized we
used B for block side.

01:02:20.970 --> 01:02:26.125
So this is going to
be s equals x times y.

01:02:26.125 --> 01:02:31.020
Hopefully that doesn't conflict
with anything else, but no B's.

01:02:31.020 --> 01:02:33.510
All right, so standard matrix.

01:02:33.510 --> 01:02:42.090
Let's start with the
standard algorithm.

01:02:42.090 --> 01:02:43.920
Let's start by analyzing that.

01:02:43.920 --> 01:02:46.770
Because if you're
reasonably clever,

01:02:46.770 --> 01:02:50.740
this the standard
algorithm is not so bad.

01:02:50.740 --> 01:02:54.010
So in general, this
won't matter too much.

01:02:54.010 --> 01:02:57.150
Let's suppose we're
doing z row by row,

01:02:57.150 --> 01:03:03.150
and let's say we're currently
computing this product cell.

01:03:03.150 --> 01:03:06.780
So that product cell
is the dot product

01:03:06.780 --> 01:03:15.410
this ZIJ here is the dot product
of this row with this column.

01:03:15.410 --> 01:03:17.370
How do I compute dot products?

01:03:17.370 --> 01:03:18.190
Two parallel scans.

01:03:18.190 --> 01:03:18.690
Right?

01:03:18.690 --> 01:03:20.520
I scan through this
row and I parallel

01:03:20.520 --> 01:03:22.020
scan through this column.

01:03:22.020 --> 01:03:26.160
Now, it depends the order
in which you store x and y,

01:03:26.160 --> 01:03:30.000
but let's suppose we can
store x in row major order,

01:03:30.000 --> 01:03:33.240
meaning row by row, and we
store y in column major order,

01:03:33.240 --> 01:03:34.530
meaning column-by-column.

01:03:34.530 --> 01:03:36.178
Then this will be an
honest to goodness

01:03:36.178 --> 01:03:37.560
scan of a contiguous array.

01:03:37.560 --> 01:03:41.550
Again, the order we store
things in memory really matters.

01:03:41.550 --> 01:03:42.990
So let's make our life ideal.

01:03:42.990 --> 01:03:48.120
Let's say that
this is row by row

01:03:48.120 --> 01:03:52.740
and this one is column
by column, then hey,

01:03:52.740 --> 01:03:55.500
this is two parallel
scans so order N over B

01:03:55.500 --> 01:03:58.300
to compute this cell.

01:03:58.300 --> 01:04:07.500
OK, I claim that
computing ZIJ costs

01:04:07.500 --> 01:04:12.360
N over B, so maybe plus 1.

01:04:12.360 --> 01:04:14.187
Again, these are
end by end matrices,

01:04:14.187 --> 01:04:24.480
so total size N squared, which
means the total cost is what?

01:04:24.480 --> 01:04:30.480
N cubed over B plus
N squared, I guess.

01:04:30.480 --> 01:04:31.350
Seems pretty good.

01:04:31.350 --> 01:04:34.200
I mean, we had a running
time of N cubed before

01:04:34.200 --> 01:04:37.470
and we divided by B. How
could you possibly do better?

01:04:37.470 --> 01:04:40.140
Well, by being smarter.

01:04:40.140 --> 01:04:46.500
This is not optimal,
you can do better.

01:04:46.500 --> 01:04:50.702
It's not obvious,
but let me just spend

01:04:50.702 --> 01:04:53.160
a little more time convincing
you this is the right answer.

01:04:53.160 --> 01:04:56.607
Not only is this big O, but
for appropriate settings--

01:04:56.607 --> 01:05:01.110
in the worst case this
is going to be theta.

01:05:01.110 --> 01:05:03.800
Because if you think of the
order in which we're-- see,

01:05:03.800 --> 01:05:06.680
we look at these
rows several times.

01:05:06.680 --> 01:05:09.330
And if you look at, when I
compute this cell and this cell

01:05:09.330 --> 01:05:12.420
and this cell of the z
matrix, or the product matrix,

01:05:12.420 --> 01:05:15.750
each of them uses
the same row of x.

01:05:15.750 --> 01:05:18.300
So maybe you could reuse that.

01:05:18.300 --> 01:05:21.300
You could reuse that row of x.

01:05:21.300 --> 01:05:23.550
That might actually
be free, depending

01:05:23.550 --> 01:05:24.910
on how B and N relate.

01:05:24.910 --> 01:05:30.930
But the columns of y, those
are different every time.

01:05:30.930 --> 01:05:32.920
When I compute this one,
I use the first column

01:05:32.920 --> 01:05:35.760
of y, when I compute this one
I use the second column of y.

01:05:35.760 --> 01:05:38.010
Unless the cache
is so big that it

01:05:38.010 --> 01:05:40.380
can store all of
y, which is like,

01:05:40.380 --> 01:05:42.779
you could store the
entire problem in cache

01:05:42.779 --> 01:05:44.460
that's unrealistic.

01:05:44.460 --> 01:05:48.900
So unless M is bigger
than N squared,

01:05:48.900 --> 01:05:52.290
in this algorithm at least, you
have to read a new column of y

01:05:52.290 --> 01:05:53.970
every single time.

01:05:53.970 --> 01:05:55.960
So that's why it's
theta N over B plus 1.

01:05:55.960 --> 01:06:00.430
You need to spend
N over B, assuming

01:06:00.430 --> 01:06:06.160
M is less than N squared.

01:06:06.160 --> 01:06:06.660
OK.

01:06:06.660 --> 01:06:09.120
And I claim this is not the
best you can do because we're

01:06:09.120 --> 01:06:10.380
going to do better.

01:06:10.380 --> 01:06:28.490
And we're going to do better
by divide and conquer.

01:06:28.490 --> 01:06:30.540
Now, you've already seen
divide and conquer used

01:06:30.540 --> 01:06:36.930
for matrix multiplication
to get Strassen's algorithm,

01:06:36.930 --> 01:06:44.100
and the idea there
is to use blocks.

01:06:44.100 --> 01:06:46.740
So this is sort of an
algorithm you've already seen.

01:06:46.740 --> 01:06:55.080
I'm going to divide the
matrix z into N over 2

01:06:55.080 --> 01:06:57.900
by N over 2 sub-matrices.

01:06:57.900 --> 01:07:02.190
Each of these ZIJs is an N
over 2 by N over 2 matrix.

01:07:02.190 --> 01:07:15.400
And I do the same
thing for x and y.

01:07:15.400 --> 01:07:16.770
Numbers are right.

01:07:16.770 --> 01:07:19.190
1, 2, y, 2, 1, and so on.

01:07:19.190 --> 01:07:22.190
And you can write
this out explicitly.

01:07:22.190 --> 01:07:25.345
I prefer not to do all of
it, but let's do one of them.

01:07:25.345 --> 01:07:27.470
You can just think of these
as two by two matrices,

01:07:27.470 --> 01:07:29.570
because matrix
multiplication is associative

01:07:29.570 --> 01:07:30.740
and good things happen.

01:07:30.740 --> 01:07:32.450
I can just take
these two elements--

01:07:32.450 --> 01:07:34.640
but they're actually
matrices, sorry.

01:07:34.640 --> 01:07:38.192
I might take these two and
dot product with these two.

01:07:38.192 --> 01:07:46.790
And I get x1,1 y1,1
plus x1,2 y2,1,

01:07:46.790 --> 01:07:50.220
and that's what I
should set z1,1 to.

01:07:50.220 --> 01:07:54.020
So this is a formula, but it's
also a recursive algorithm.

01:07:54.020 --> 01:07:57.470
It says, if I want to
compute z I'm going to say,

01:07:57.470 --> 01:07:59.510
well, there are
four sub problems.

01:07:59.510 --> 01:08:01.399
The first one is
to compute z1,1,

01:08:01.399 --> 01:08:03.440
and I'm going to do that
by recursively computing

01:08:03.440 --> 01:08:06.500
the product of x1,1 and
y1,1, recursively computing

01:08:06.500 --> 01:08:10.034
the product of x1,2 y2,1 and
then adding them together.

01:08:10.034 --> 01:08:10.950
This is not recursive.

01:08:10.950 --> 01:08:13.080
Addition is easy.

01:08:13.080 --> 01:08:13.580
OK.

01:08:13.580 --> 01:08:15.800
And there's two products
here, two products here,

01:08:15.800 --> 01:08:17.341
two products here,
two products here,

01:08:17.341 --> 01:08:18.830
a total of eight
products, so we're

01:08:18.830 --> 01:08:25.055
going to have eight recursive
calls in size N over 2.

01:08:25.055 --> 01:08:26.930
If we look at the number
of memory transfers,

01:08:26.930 --> 01:08:31.689
this is 8 times recursive
call on N over 2 by N

01:08:31.689 --> 01:08:37.550
over 2 sub matrices plus
the cost of addition.

01:08:37.550 --> 01:08:41.300
And I claim the cost of addition
is at most N squared over B

01:08:41.300 --> 01:08:46.461
plus 1, because addition is
basically parallel scans.

01:08:46.461 --> 01:08:50.390
I can scan through
x, scan through y.

01:08:50.390 --> 01:08:52.609
As long as they're
stored in the same order,

01:08:52.609 --> 01:08:55.350
I just am adding them
element by element,

01:08:55.350 --> 01:08:59.960
and there's a third scan, which
is writing out the z vector

01:08:59.960 --> 01:09:02.500
once things are linearized.

01:09:02.500 --> 01:09:06.100
Now, for this to work, for
this to be a true recursion,

01:09:06.100 --> 01:09:12.279
I need that, say, x1,1 and y1,1
are stored as contiguous things

01:09:12.279 --> 01:09:13.690
in memory.

01:09:13.690 --> 01:09:19.809
So this means that the
layout of a matrix,

01:09:19.809 --> 01:09:24.430
let's consider the matrix z, is
going to be like the following.

01:09:24.430 --> 01:09:27.370
I'm going to recursively lay
out 1,1-- so when I say lay out,

01:09:27.370 --> 01:09:29.920
I mean what order do I store
the elements in memory?

01:09:29.920 --> 01:09:32.050
What order do I store
the cells in memory?

01:09:32.050 --> 01:09:36.370
And what I'm going to say
is, recursively lay out

01:09:36.370 --> 01:09:40.859
the pieces-- there's four
pieces-- recursively call

01:09:40.859 --> 01:09:46.282
layout of those and then
concatenate them together.

01:09:46.282 --> 01:09:46.990
That's my layout.

01:09:46.990 --> 01:09:48.430
So I'm going to store
all of these items,

01:09:48.430 --> 01:09:50.221
then I'm going to store
all of these items,

01:09:50.221 --> 01:09:52.510
and then all of these
items, then all these items.

01:09:52.510 --> 01:09:55.480
How do I store these
items, in what order?

01:09:55.480 --> 01:09:56.107
Recursively.

01:09:56.107 --> 01:09:57.690
So I'm going to
divide them like this,

01:09:57.690 --> 01:09:59.650
store these before these
before these before these,

01:09:59.650 --> 01:10:00.640
how do I store these?

01:10:00.640 --> 01:10:01.440
Recursively.

01:10:01.440 --> 01:10:02.455
OK, same recursion.

01:10:02.455 --> 01:10:04.510
So it's a really
weird order, it's

01:10:04.510 --> 01:10:06.820
a divide and conquer order.

01:10:06.820 --> 01:10:08.380
There's only four things here.

01:10:08.380 --> 01:10:10.510
In what order should I
combine the four things?

01:10:10.510 --> 01:10:11.716
Doesn't matter.

01:10:11.716 --> 01:10:13.590
All that matters is that
this is consecutive,

01:10:13.590 --> 01:10:15.730
this is consecutive,
and this is consecutive,

01:10:15.730 --> 01:10:18.530
so that when I recurse, I'm
recursing on consecutive chunks

01:10:18.530 --> 01:10:19.030
of memory.

01:10:19.030 --> 01:10:21.070
Otherwise the analysis
just won't work.

01:10:21.070 --> 01:10:24.690
So for this to be right,
got to have this layout.

01:10:24.690 --> 01:10:26.500
OK.

01:10:26.500 --> 01:10:32.110
Now we just need to solve the
recurrence, and we're done.

01:10:32.110 --> 01:10:34.960
I already told you, the
base case we're going to use

01:10:34.960 --> 01:10:35.860
is this one.

01:10:35.860 --> 01:10:37.526
We're going to use
this one because it's

01:10:37.526 --> 01:10:40.360
stronger and better, and
we'll need it, in this case,

01:10:40.360 --> 01:10:43.442
to get a better analysis.

01:10:43.442 --> 01:10:45.400
You could solve it using
the weaker base cases,

01:10:45.400 --> 01:10:47.330
you'll get larger numbers.

01:10:47.330 --> 01:10:51.309
But if you use the strongest
base case, MT of-- it's

01:10:51.309 --> 01:10:54.180
not M. Got to be
a little careful.

01:10:54.180 --> 01:10:56.980
Because N here is actually
just one side length.

01:10:56.980 --> 01:11:00.445
This is an end by end
matrix, so the total size

01:11:00.445 --> 01:11:04.720
is N squared-- actually the
total size is 3N squared,

01:11:04.720 --> 01:11:09.460
so this is going to be the
square root of M over 3,

01:11:09.460 --> 01:11:12.880
at some constant, times the
square root of N. It actually

01:11:12.880 --> 01:11:14.510
doesn't matter what
the constant is.

01:11:14.510 --> 01:11:16.240
But this is the
size of-- this is

01:11:16.240 --> 01:11:18.400
the value of N for which
all three matrices will

01:11:18.400 --> 01:11:20.890
fit in cache.

01:11:20.890 --> 01:11:27.309
So I claim we know this costs at
most M over B memory transfers,

01:11:27.309 --> 01:11:33.550
because we were kind of
stroke here because we know

01:11:33.550 --> 01:11:35.110
that all of these
guys fit in cache

01:11:35.110 --> 01:11:37.540
and because we know that they
can store it consecutively

01:11:37.540 --> 01:11:40.840
in memory, well three
consecutive chunks.

01:11:40.840 --> 01:11:45.940
Once, no matter what I do, there
are only M over B blocks there,

01:11:45.940 --> 01:11:47.660
and so at worst I
read them all in.

01:11:47.660 --> 01:11:50.470
But once the cache
is filled with them,

01:11:50.470 --> 01:11:52.540
for the duration
of this recursion,

01:11:52.540 --> 01:11:54.100
I won't be reading
any other blocks,

01:11:54.100 --> 01:11:56.990
and so the cache will just
stay full with the problem.

01:11:56.990 --> 01:11:59.440
And so I never pay
more than this.

01:11:59.440 --> 01:12:00.830
So that's the base case.

01:12:00.830 --> 01:12:05.140
Easy, but you have to think
about it for a second.

01:12:05.140 --> 01:12:05.650
Cool.

01:12:05.650 --> 01:12:08.640
Now we have a recurrence
and a base case,

01:12:08.640 --> 01:12:11.210
and now we have a good old
fashioned recursion tree.

01:12:11.210 --> 01:12:13.809
This one I can actually
draw, because it's-- well,

01:12:13.809 --> 01:12:17.120
partly because it's
nice and uniform.

01:12:17.120 --> 01:12:19.870
It just explodes rather fast.

01:12:19.870 --> 01:12:25.960
So at the top we have a cost
of N squared over B plus 1,

01:12:25.960 --> 01:12:28.750
and we have eight
recursive calls.

01:12:28.750 --> 01:12:31.960
And the recursive calls
are to something in size

01:12:31.960 --> 01:12:39.970
N over 2 squared over B, also
known as N squared over 4B.

01:12:39.970 --> 01:12:42.170
OK, so if I add up
everything on this level,

01:12:42.170 --> 01:12:46.180
I get N squared over B, and if I
add up everything on this level

01:12:46.180 --> 01:12:53.200
I'm going to get 8 times
N over 4-- is that right?

01:12:53.200 --> 01:12:53.740
Yeah.

01:12:53.740 --> 01:12:58.120
So 2 times N squared over B.

01:12:58.120 --> 01:12:58.870
OK.

01:12:58.870 --> 01:13:02.170
I did that in order to verify
that the cost per level

01:13:02.170 --> 01:13:06.430
is increasing geometrically,
so all that will matter

01:13:06.430 --> 01:13:08.770
is the leaf level.

01:13:08.770 --> 01:13:11.520
This is the proof of
the master theorem.

01:13:11.520 --> 01:13:13.300
When things are
doubling at every step--

01:13:13.300 --> 01:13:15.059
and this was just
a special case,

01:13:15.059 --> 01:13:18.529
but every level would look
the same-- every level

01:13:18.529 --> 01:13:20.070
of recursion, if
you add them all up,

01:13:20.070 --> 01:13:22.653
you're getting twice as much as
you had at the previous level.

01:13:22.653 --> 01:13:26.150
So all that will matter
is the leaf level.

01:13:26.150 --> 01:13:29.430
OK, the leaf level.

01:13:29.430 --> 01:13:32.610
Actually, maybe I'll
do it over here.

01:13:32.610 --> 01:13:34.800
First question is how
many leaves are there?

01:13:34.800 --> 01:13:37.900
The leaves are this thing.

01:13:37.900 --> 01:13:41.300
So the way I would think about
this is, because everything

01:13:41.300 --> 01:13:44.499
is nice and uniform, is 8 to the
power of the number of levels.

01:13:44.499 --> 01:13:50.790
What's the number of levels?

01:13:50.790 --> 01:13:53.320
Well, we're dividing
by 2 each time,

01:13:53.320 --> 01:13:56.559
so it's going to be
log of something,

01:13:56.559 --> 01:14:00.130
but it's no longer log N
because we're stopping early.

01:14:00.130 --> 01:14:03.450
We're stopping when
N reaches this value.

01:14:03.450 --> 01:14:10.856
So it turns out that is
N divided by that value.

01:14:10.856 --> 01:14:12.230
This is, how many
times do I have

01:14:12.230 --> 01:14:14.775
to multiply by 2 before
I get to this, which

01:14:14.775 --> 01:14:17.150
is the same thing as how many
times do I have to divide N

01:14:17.150 --> 01:14:19.636
by 2 before I get that?

01:14:19.636 --> 01:14:20.780
Think about it.

01:14:20.780 --> 01:14:22.320
OK, but 8 to the log.

01:14:22.320 --> 01:14:24.980
This is 2 to the 3 times log.

01:14:24.980 --> 01:14:27.690
2 to the log is just the thing.

01:14:27.690 --> 01:14:36.830
So this is N over root M
over B-- so many overs--

01:14:36.830 --> 01:14:39.017
to the third power.

01:14:39.017 --> 01:14:40.600
OK, this is starting
to look familiar.

01:14:40.600 --> 01:14:46.080
This is N cubed, that
should appear somewhere,

01:14:46.080 --> 01:14:48.480
divided by square
root of M over B.

01:14:48.480 --> 01:14:50.090
This is the number of leaves.

01:14:50.090 --> 01:14:55.880
Now, for each leaf
we're paying this cost,

01:14:55.880 --> 01:15:03.020
so the overall cost of MT of N
is going to be this times this.

01:15:03.020 --> 01:15:11.930
So let's do that and simplify.

01:15:11.930 --> 01:15:18.062
So MT of N is going to be big
O, because we're taking the leaf

01:15:18.062 --> 01:15:20.020
level but there's some
other things that's just

01:15:20.020 --> 01:15:23.530
going to lose us a factor of 2.

01:15:23.530 --> 01:15:26.175
We have this thing
multiplied by this thing.

01:15:26.175 --> 01:15:34.220
So we've got N cubed over
square root of M over B

01:15:34.220 --> 01:15:40.798
times M over B.

01:15:40.798 --> 01:15:43.360
AUDIENCE: You-- PROFESSOR:
I made a mistake.

01:15:43.360 --> 01:15:44.720
Yea, thank you.

01:15:44.720 --> 01:15:45.970
This was supposed to be cubed.

01:15:45.970 --> 01:15:49.600
So this was M over B to
the 1/2, so now we have,

01:15:49.600 --> 01:15:52.720
down here, M over B to the 3/2.

01:15:52.720 --> 01:15:59.850
Thank you, thought
that looked weird.

01:15:59.850 --> 01:16:01.890
All right.

01:16:01.890 --> 01:16:06.531
M over B to the 3/2.

01:16:06.531 --> 01:16:07.031
OK.

01:16:07.031 --> 01:16:18.430
AUDIENCE: [INAUDIBLE]
PROFESSOR: Yeah.

01:16:18.430 --> 01:16:20.180
What was I doing here?

01:16:20.180 --> 01:16:21.630
This is supposed to be M over 3.

01:16:21.630 --> 01:16:23.605
I was not missing a
stroke, thank you.

01:16:23.605 --> 01:16:27.824
M over 3, this is
supposed to be M over 3.

01:16:27.824 --> 01:16:29.240
Wow.

01:16:29.240 --> 01:16:32.260
OK, so this is M over 3.

01:16:32.260 --> 01:16:34.780
I'm just going to drop the--
well, I'll put it here.

01:16:34.780 --> 01:16:37.340
But then I'm just
going to write theta

01:16:37.340 --> 01:16:39.670
so I can forget about the
3, because that's just

01:16:39.670 --> 01:16:41.360
a square root of 3 factor.

01:16:41.360 --> 01:16:47.405
So now this is going
to be M to the 3/2.

01:16:47.405 --> 01:16:50.546
That makes me much happier.

01:16:50.546 --> 01:16:53.410
Did I get it right this time?

01:16:53.410 --> 01:16:54.430
Let's double-check.

01:16:54.430 --> 01:16:57.910
So this is square root
of M to the 3 power,

01:16:57.910 --> 01:17:00.840
so that's M to the 1/2
cubed M to the 3/2.

01:17:00.840 --> 01:17:04.850
I think that's good, this base
case was square root of M.

01:17:04.850 --> 01:17:07.929
OK, get it right.

01:17:07.929 --> 01:17:10.282
So now this is M to the 3/2.

01:17:10.282 --> 01:17:11.740
There is a square
root that's going

01:17:11.740 --> 01:17:16.270
to come back, there's M to the
3/2 and there's an M upstairs,

01:17:16.270 --> 01:17:18.270
so the one cancels.

01:17:18.270 --> 01:17:23.439
We're going to be left with
N cubed over square root of M

01:17:23.439 --> 01:17:26.112
times B. OK.

01:17:26.112 --> 01:17:28.570
There was a lower order term
because I dropped this plus 1,

01:17:28.570 --> 01:17:31.180
but let's not worry
about that right now.

01:17:31.180 --> 01:17:33.730
Here we had N cubed
divided by B, that

01:17:33.730 --> 01:17:35.200
was the standard algorithm.

01:17:35.200 --> 01:17:39.160
Now we've got M cubed divided by
B divided by square root of M.

01:17:39.160 --> 01:17:40.635
That's big.

01:17:40.635 --> 01:17:42.010
I mean, this is
basically, you're

01:17:42.010 --> 01:17:45.450
dividing by-- well, square
root of your cache size.

01:17:45.450 --> 01:17:46.030
Wow.

01:17:46.030 --> 01:17:49.230
So who knows how big
that is, but say,

01:17:49.230 --> 01:17:52.430
between memory and disk,
we're talking gigabytes.

01:17:52.430 --> 01:17:54.330
So this is like billions.

01:17:54.330 --> 01:17:57.450
Square root of a billion
is still pretty big,

01:17:57.450 --> 01:18:01.324
like 10 to 100,000, so this
is a huge amount faster

01:18:01.324 --> 01:18:02.490
than the standard algorithm.

01:18:02.490 --> 01:18:04.920
You can do way
better than scans.

01:18:04.920 --> 01:18:07.650
Basically because we're reusing
the same rows and columns

01:18:07.650 --> 01:18:08.726
over and over.

01:18:08.726 --> 01:18:10.559
Now, this is standard
matrix multiplication.

01:18:10.559 --> 01:18:12.610
You might ask, what about
Strassen's algorithm?

01:18:12.610 --> 01:18:13.770
Well, same thing works.

01:18:13.770 --> 01:18:15.960
You can do the same analysis
Strassen, of course.

01:18:15.960 --> 01:18:19.020
You get a similar
improvement over Strassen.

01:18:19.020 --> 01:18:21.960
You can do this for non
square matrices and all

01:18:21.960 --> 01:18:23.580
those good things.

01:18:23.580 --> 01:18:25.170
And one minute left.

01:18:25.170 --> 01:18:27.100
And it's going to
be enough, I think,

01:18:27.100 --> 01:18:31.330
to cover LRU block replacement.

01:18:31.330 --> 01:18:39.604
So here's what I want to say
about LRU block replacement.

01:18:39.604 --> 01:18:41.520
So in the beginning, we
said the model is LRU,

01:18:41.520 --> 01:18:44.670
or it could have been FIFO.

01:18:44.670 --> 01:18:45.870
Remember that?

01:18:45.870 --> 01:18:48.210
And this algorithm will
work just fine from an LRU

01:18:48.210 --> 01:18:49.620
perspective or a
FIFO perspective

01:18:49.620 --> 01:18:51.570
if you think about
it, but how do

01:18:51.570 --> 01:18:53.700
we know that LRU is
as good as anything?

01:18:53.700 --> 01:18:58.790
I claim, if you look at some
sequence of block axises--

01:18:58.790 --> 01:19:01.890
so suppose you know what
B is-- and you count,

01:19:01.890 --> 01:19:07.020
for a cache of size M, how many
memory transfers does LRU do,

01:19:07.020 --> 01:19:09.930
it's going to be within a
factor of 2 of the optimal.

01:19:09.930 --> 01:19:12.780
But not the optimal
for a cache of size M,

01:19:12.780 --> 01:19:15.732
the optimal for a
cache of size M over 2.

01:19:15.732 --> 01:19:17.190
This is a bit of
a weird statement.

01:19:17.190 --> 01:19:19.950
I have a factor of 2 here
and a factor of 2 here.

01:19:19.950 --> 01:19:27.450
This is a cool idea called
resource augmentation,

01:19:27.450 --> 01:19:30.380
fancy word for a simple idea.

01:19:30.380 --> 01:19:31.200
This we're used to.

01:19:31.200 --> 01:19:33.370
This is approximation
algorithms.

01:19:33.370 --> 01:19:36.090
OK, but this is an
approximation in cost.

01:19:36.090 --> 01:19:38.040
Here we're approximating
the resources

01:19:38.040 --> 01:19:39.240
available to the algorithm.

01:19:39.240 --> 01:19:42.900
We're changing the machine
model, dividing M by 2,

01:19:42.900 --> 01:19:46.540
and we get a nice result.

01:19:46.540 --> 01:19:47.670
Why is this OK?

01:19:47.670 --> 01:19:49.559
Because, if you look
at a bound like this,

01:19:49.559 --> 01:19:51.300
if you change M
by a factor of 2,

01:19:51.300 --> 01:19:53.130
it will not change the
bound by more than a

01:19:53.130 --> 01:19:54.880
factor of square root of 2.

01:19:54.880 --> 01:19:56.910
So as long as you
have at most, say,

01:19:56.910 --> 01:19:59.670
a linear or polynomial
dependence on M,

01:19:59.670 --> 01:20:01.530
changing M by a
constant factor will not

01:20:01.530 --> 01:20:03.110
change the overall
running time of the cache

01:20:03.110 --> 01:20:03.690
for the previous algorithm.

01:20:03.690 --> 01:20:05.430
This is why we can
assume it's LRU.

01:20:05.430 --> 01:20:07.680
The same is true for
FIFO, it's probably

01:20:07.680 --> 01:20:11.660
true in expectation
for random sequences.

01:20:11.660 --> 01:20:16.170
And I will leave it at that.

01:20:16.170 --> 01:20:17.670
If you want to see
the-- do you want

01:20:17.670 --> 01:20:22.080
to see the proof
of this theorem?

01:20:22.080 --> 01:20:22.680
Tomorrow?

01:20:22.680 --> 01:20:24.120
Or, Thursday?

01:20:24.120 --> 01:20:24.620
Yes.

01:20:24.620 --> 01:20:27.370
OK, we'll cover it on Thursday.