WEBVTT

00:00:01.550 --> 00:00:03.920
The following content is
provided under a Creative

00:00:03.920 --> 00:00:05.310
Commons license.

00:00:05.310 --> 00:00:07.520
Your support will help
MIT OpenCourseWare

00:00:07.520 --> 00:00:11.610
continue to offer high quality
educational resources for free.

00:00:11.610 --> 00:00:14.180
To make a donation or to
view additional materials

00:00:14.180 --> 00:00:18.140
from hundreds of MIT courses,
visit MIT OpenCourseWare

00:00:18.140 --> 00:00:19.026
at ocw.mit.edu.

00:00:21.632 --> 00:00:22.590
JULIAN SHUN: All right.

00:00:22.590 --> 00:00:25.920
So we've talked a little
bit about caching before,

00:00:25.920 --> 00:00:30.330
but today we're going to talk in
much more detail about caching

00:00:30.330 --> 00:00:34.680
and how to design
cache-efficient algorithms.

00:00:34.680 --> 00:00:38.070
So first, let's look
at the caching hardware

00:00:38.070 --> 00:00:41.830
on modern machines today.

00:00:41.830 --> 00:00:43.710
So here's what the
cache hierarchy looks

00:00:43.710 --> 00:00:46.140
like for a multicore chip.

00:00:46.140 --> 00:00:49.310
We have a whole
bunch of processors.

00:00:49.310 --> 00:00:53.040
They all have their
own private L1 caches

00:00:53.040 --> 00:00:56.220
for both the data, as
well as the instruction.

00:00:56.220 --> 00:00:58.050
They also have a
private L2 cache.

00:00:58.050 --> 00:01:01.480
And then they share a last
level cache, or L3 cache,

00:01:01.480 --> 00:01:05.129
which is also called LLC.

00:01:05.129 --> 00:01:07.080
They're all connected
to a memory controller

00:01:07.080 --> 00:01:09.480
that can access DRAM.

00:01:09.480 --> 00:01:12.810
And then, oftentimes,
you'll have multiple chips

00:01:12.810 --> 00:01:16.710
on the same server,
and these chips

00:01:16.710 --> 00:01:18.880
would be connected
through a network.

00:01:18.880 --> 00:01:20.910
So here we have a bunch
of multicore chips

00:01:20.910 --> 00:01:24.130
that are connected together.

00:01:24.130 --> 00:01:27.300
So we can see that there are
different levels of memory

00:01:27.300 --> 00:01:30.160
here.

00:01:30.160 --> 00:01:32.520
And the sizes of each one
of these levels of memory

00:01:32.520 --> 00:01:33.750
is different.

00:01:33.750 --> 00:01:36.690
So the sizes tend to
go up as you move up

00:01:36.690 --> 00:01:39.480
the memory hierarchy.

00:01:39.480 --> 00:01:44.970
The L1 caches tend to
be about 32 kilobytes.

00:01:44.970 --> 00:01:47.327
In fact, these are the
specifications for the machines

00:01:47.327 --> 00:01:48.660
that you're using in this class.

00:01:48.660 --> 00:01:51.660
So 32 kilobytes for
both the L1 data cache

00:01:51.660 --> 00:01:54.540
and the L1 instruction cache.

00:01:54.540 --> 00:01:57.580
256 kilobytes for the L2 cache.

00:01:57.580 --> 00:02:01.200
so the L2 cache tends to
be about 8 to 10 times

00:02:01.200 --> 00:02:03.570
larger than the L1 cache.

00:02:03.570 --> 00:02:06.790
And then the last level cache,
the size is 30 megabytes.

00:02:06.790 --> 00:02:10.610
So this is typically on the
order of tens of megabytes.

00:02:10.610 --> 00:02:14.250
And then DRAM is on
the order of gigabytes.

00:02:14.250 --> 00:02:18.320
So here we have
128 gigabyte DRAM.

00:02:18.320 --> 00:02:21.480
And nowadays, you can
actually get machines

00:02:21.480 --> 00:02:25.440
that have terabytes of DRAM.

00:02:25.440 --> 00:02:29.880
So the associativity tends
to go up as you move up

00:02:29.880 --> 00:02:30.780
the cache hierarchy.

00:02:30.780 --> 00:02:32.970
And I'll talk more
about associativity

00:02:32.970 --> 00:02:34.980
on the next couple of slides.

00:02:34.980 --> 00:02:37.800
The time to access the
memory also tends to go up.

00:02:37.800 --> 00:02:39.870
So the latency tends
to go up as you move up

00:02:39.870 --> 00:02:41.020
the memory hierarchy.

00:02:41.020 --> 00:02:44.490
So the L1 caches are
the quickest to access,

00:02:44.490 --> 00:02:48.270
about two nanoseconds,
just rough numbers.

00:02:48.270 --> 00:02:50.380
The L2 cache is a
little bit slower--

00:02:50.380 --> 00:02:52.810
so say four nanoseconds.

00:02:52.810 --> 00:02:55.410
Last level cache,
maybe six nanoseconds.

00:02:55.410 --> 00:02:57.240
And then when you
have to go to DRAM,

00:02:57.240 --> 00:03:00.930
it's about an order of magnitude
slower-- so 50 nanoseconds

00:03:00.930 --> 00:03:03.280
in this example.

00:03:03.280 --> 00:03:09.420
And the reason why the memory
is further down in the cache

00:03:09.420 --> 00:03:11.070
hierarchy are faster
is because they're

00:03:11.070 --> 00:03:14.650
using more expensive materials
to manufacture these things.

00:03:14.650 --> 00:03:18.120
But since they tend to be more
expensive, we can't fit as much

00:03:18.120 --> 00:03:19.720
of that on the machines.

00:03:19.720 --> 00:03:22.620
So that's why the faster
memories are smaller

00:03:22.620 --> 00:03:24.690
than the slower memories.

00:03:24.690 --> 00:03:26.880
But if we're able to take
advantage of locality

00:03:26.880 --> 00:03:31.167
in our programs, then we can
make use of the fast memory

00:03:31.167 --> 00:03:32.000
as much as possible.

00:03:32.000 --> 00:03:36.730
And we'll talk about ways to
do that in this lecture today.

00:03:36.730 --> 00:03:39.000
There's also the latency
across the network, which

00:03:39.000 --> 00:03:42.660
tends to be cheaper than
going to main memory

00:03:42.660 --> 00:03:47.475
but slower than doing a
last level cache access.

00:03:50.520 --> 00:03:52.410
And there's a lot
of work in trying

00:03:52.410 --> 00:03:55.770
to get the cache coherence
protocols right, as we

00:03:55.770 --> 00:03:56.860
mentioned before.

00:03:56.860 --> 00:03:59.730
So since these processors
all have private caches,

00:03:59.730 --> 00:04:01.200
we need to make
sure that they all

00:04:01.200 --> 00:04:03.510
see a consistent
view of memory when

00:04:03.510 --> 00:04:05.670
they're trying to
access the same memory

00:04:05.670 --> 00:04:08.290
addresses in parallel.

00:04:08.290 --> 00:04:11.340
So we talked about the
MSI cache protocol before.

00:04:11.340 --> 00:04:13.500
And there are many other
protocols out there,

00:04:13.500 --> 00:04:16.510
and you can read more
about these things online.

00:04:16.510 --> 00:04:18.730
But these are very
hard to get right,

00:04:18.730 --> 00:04:20.700
and there's a lot of
verification involved

00:04:20.700 --> 00:04:23.110
in trying to prove that the
cache coherence protocols are

00:04:23.110 --> 00:04:23.610
correct.

00:04:27.490 --> 00:04:29.050
So any questions so far?

00:04:33.600 --> 00:04:34.100
OK.

00:04:34.100 --> 00:04:38.210
So let's talk about the
associativity of a cache.

00:04:38.210 --> 00:04:41.690
So here I'm showing you a
fully associative cache.

00:04:41.690 --> 00:04:43.700
And in a fully
associative cache,

00:04:43.700 --> 00:04:47.060
a cache block can reside
anywhere in the cache.

00:04:47.060 --> 00:04:50.760
And a basic unit of movement
here is a cache block.

00:04:50.760 --> 00:04:53.750
In this example, the cache
block size is 4 bytes,

00:04:53.750 --> 00:04:57.050
but on the machines that
we're using for this class,

00:04:57.050 --> 00:05:00.110
the cache block
size is 64 bytes.

00:05:00.110 --> 00:05:04.470
But for this example, I'm going
to use a four byte cache line.

00:05:04.470 --> 00:05:07.160
So each row here corresponds
to one cache line.

00:05:07.160 --> 00:05:10.310
And a fully associative cache
means that each line here

00:05:10.310 --> 00:05:13.225
can go anywhere in the cache.

00:05:13.225 --> 00:05:14.600
And then here
we're also assuming

00:05:14.600 --> 00:05:17.420
a cache size that has 32 bytes.

00:05:17.420 --> 00:05:19.450
So, in total, it can
store eight cache line

00:05:19.450 --> 00:05:21.245
since the cache line is 4 bytes.

00:05:24.970 --> 00:05:28.840
So to find a block in a
fully associative cache,

00:05:28.840 --> 00:05:30.970
you have to actually
search the entire cache,

00:05:30.970 --> 00:05:35.740
because a cache line can
appear anywhere in the cache.

00:05:35.740 --> 00:05:38.860
And there's a tag associated
with each of these cache lines

00:05:38.860 --> 00:05:42.610
here that basically
specify which

00:05:42.610 --> 00:05:45.670
of the memory addresses
in virtual memory space

00:05:45.670 --> 00:05:47.740
it corresponds to.

00:05:47.740 --> 00:05:49.440
So for the fully
associative cache,

00:05:49.440 --> 00:05:51.940
we're actually going to use
most of the bits of that address

00:05:51.940 --> 00:05:53.160
as a tag.

00:05:53.160 --> 00:05:54.910
We don't actually need
the two lower order

00:05:54.910 --> 00:05:56.980
bits, because the
things are being

00:05:56.980 --> 00:06:00.010
moved at the granularity
of cache lines, which

00:06:00.010 --> 00:06:00.670
are four bytes.

00:06:00.670 --> 00:06:03.190
So the two lower order bits are
always going to be the same,

00:06:03.190 --> 00:06:05.560
but we're just going to
use the rest of the bits

00:06:05.560 --> 00:06:07.070
to store the tag.

00:06:07.070 --> 00:06:09.640
So if our address
space is 64 bits,

00:06:09.640 --> 00:06:12.550
then we're going to use 62 bits
to store the tag in a fully

00:06:12.550 --> 00:06:14.800
associative caching scheme.

00:06:14.800 --> 00:06:18.010
And when a cache
becomes full, a block

00:06:18.010 --> 00:06:22.000
has to be evicted to make
room for a new block.

00:06:22.000 --> 00:06:24.790
And there are various
ways that you can

00:06:24.790 --> 00:06:26.660
decide how to evict a block.

00:06:26.660 --> 00:06:29.260
So this is known as
the replacement policy.

00:06:29.260 --> 00:06:32.820
One common replacement policy
is LRU Least Recently Used.

00:06:32.820 --> 00:06:34.720
So you basically kick
the thing out that

00:06:34.720 --> 00:06:39.020
has been used the
farthest in the past.

00:06:39.020 --> 00:06:41.980
The other schemes, such
as second chance and clock

00:06:41.980 --> 00:06:44.020
replacement, we're
not going to talk

00:06:44.020 --> 00:06:47.080
too much about the different
replacement schemes today.

00:06:47.080 --> 00:06:50.780
But you can feel free to read
about these things online.

00:06:53.470 --> 00:06:55.450
So what's a disadvantage
of this scheme?

00:07:05.170 --> 00:07:05.670
Yes?

00:07:05.670 --> 00:07:07.270
AUDIENCE: It's slow.

00:07:07.270 --> 00:07:08.020
JULIAN SHUN: Yeah.

00:07:08.020 --> 00:07:08.830
Why is it slow?

00:07:08.830 --> 00:07:12.440
AUDIENCE: Because you have to
go all the way [INAUDIBLE]..

00:07:12.440 --> 00:07:13.190
JULIAN SHUN: Yeah.

00:07:13.190 --> 00:07:15.590
So the disadvantage
is that searching

00:07:15.590 --> 00:07:18.200
for a cache line in the cache
can be pretty slow, because you

00:07:18.200 --> 00:07:21.380
have to search entire
cache in the worst case,

00:07:21.380 --> 00:07:25.370
since a cache block can
reside anywhere in the cache.

00:07:25.370 --> 00:07:28.010
So even though the search can
go on in parallel and hardware

00:07:28.010 --> 00:07:31.340
is still expensive in terms
of power and performance

00:07:31.340 --> 00:07:35.030
to have to search most
of the cache every time.

00:07:35.030 --> 00:07:37.580
So let's look at
another extreme.

00:07:37.580 --> 00:07:40.010
This is a direct mapped cache.

00:07:40.010 --> 00:07:42.860
So in a direct mapped
cache, each cache block

00:07:42.860 --> 00:07:45.690
can only go in one
place in the cache.

00:07:45.690 --> 00:07:48.890
So I've color-coded
these cache blocks here.

00:07:48.890 --> 00:07:53.990
So the red blocks can only go
in the first row of this cache,

00:07:53.990 --> 00:07:57.030
the orange ones can only go
in the second row, and so on.

00:08:00.380 --> 00:08:06.110
And the position which a
cache block can go into

00:08:06.110 --> 00:08:09.140
is known as that
cache blocks set.

00:08:09.140 --> 00:08:11.240
So the set determines
the location

00:08:11.240 --> 00:08:14.480
in the cache for each
particular block.

00:08:14.480 --> 00:08:19.927
So let's look at how the virtual
memory address is divided up

00:08:19.927 --> 00:08:21.510
into and which of
the bits we're going

00:08:21.510 --> 00:08:24.380
to use to figure out
where a cache block should

00:08:24.380 --> 00:08:25.610
go in the cache.

00:08:25.610 --> 00:08:29.450
So we have the offset,
we have the set,

00:08:29.450 --> 00:08:31.820
and then the tag fields.

00:08:31.820 --> 00:08:35.179
The offset just tells
us which position

00:08:35.179 --> 00:08:37.669
we want to access
within a cache block.

00:08:37.669 --> 00:08:40.010
So since a cache
block has B bytes,

00:08:40.010 --> 00:08:43.850
we only need log base 2
of B bits as the offset.

00:08:43.850 --> 00:08:45.350
And the reason why
we have to offset

00:08:45.350 --> 00:08:47.383
is because we're not
always accessing something

00:08:47.383 --> 00:08:48.800
at the beginning
of a cache block.

00:08:48.800 --> 00:08:50.800
We might want to access
something in the middle.

00:08:50.800 --> 00:08:52.190
And that's why we
need the offset

00:08:52.190 --> 00:08:54.755
to specify where in the cache
block we want to access.

00:08:57.730 --> 00:08:59.530
Then there's a set field.

00:08:59.530 --> 00:09:05.020
And the set field is going
to determine which position

00:09:05.020 --> 00:09:08.110
in the cache that cache
block can go into.

00:09:08.110 --> 00:09:12.790
So there are eight possible
positions for each cache block.

00:09:12.790 --> 00:09:16.240
And therefore, we only
need log base 2 of 8 bits--

00:09:16.240 --> 00:09:19.120
so three bits for the
set in this example.

00:09:19.120 --> 00:09:23.200
And more generally, it's going
to be log base 2 of M over B.

00:09:23.200 --> 00:09:25.815
And here, M over B is 8.

00:09:25.815 --> 00:09:27.940
And then, finally, we're
going to use the remaining

00:09:27.940 --> 00:09:29.030
bits as a tag.

00:09:29.030 --> 00:09:32.800
So w minus log base 2
of M bits for the tag.

00:09:32.800 --> 00:09:36.250
And that gets stored along with
the cache block in the cache.

00:09:36.250 --> 00:09:39.070
And that's going to
uniquely identify

00:09:39.070 --> 00:09:44.560
which of the memory blocks
the cache block corresponds to

00:09:44.560 --> 00:09:47.430
in virtual memory.

00:09:47.430 --> 00:09:53.110
And you can verify that the
sum of all these quantities

00:09:53.110 --> 00:09:55.190
here sums to w bits.

00:09:55.190 --> 00:09:58.120
So in total, we have
a w bit address space.

00:09:58.120 --> 00:09:59.863
And the sum of those
three things is w.

00:10:03.034 --> 00:10:06.880
So what's the advantage and
disadvantage of this scheme?

00:10:16.880 --> 00:10:19.760
So first, what's a good thing
about this scheme compared

00:10:19.760 --> 00:10:21.990
to the previous
scheme that we saw?

00:10:21.990 --> 00:10:22.490
Yes?

00:10:22.490 --> 00:10:23.250
AUDIENCE: Faster.

00:10:23.250 --> 00:10:24.000
JULIAN SHUN: Yeah.

00:10:24.000 --> 00:10:26.240
It's fast because you only
have to check one place.

00:10:26.240 --> 00:10:27.620
Because each cache
block can only

00:10:27.620 --> 00:10:30.410
go in one place in a cache,
and that's only place

00:10:30.410 --> 00:10:32.810
you have to check when
you try to do a lookup.

00:10:32.810 --> 00:10:34.740
If the cache block is
there, then you find it.

00:10:34.740 --> 00:10:38.750
If it's not, then you know
it's not in the cache.

00:10:38.750 --> 00:10:42.020
What's the downside
to this scheme?

00:10:42.020 --> 00:10:42.520
Yeah?

00:10:42.520 --> 00:10:44.437
AUDIENCE: You only end
up putting the red ones

00:10:44.437 --> 00:10:47.350
into the cache and you have
mostly every [INAUDIBLE],, which

00:10:47.350 --> 00:10:48.520
is totally [INAUDIBLE].

00:10:48.520 --> 00:10:49.270
JULIAN SHUN: Yeah.

00:10:49.270 --> 00:10:50.440
So good answer.

00:10:50.440 --> 00:10:54.630
So the downside is that you
might, for example, just

00:10:54.630 --> 00:10:58.740
be accessing the
red cache blocks

00:10:58.740 --> 00:11:01.260
and then not accessing any
of the other cache blocks.

00:11:01.260 --> 00:11:04.140
They'll all get mapped to the
same location in the cache,

00:11:04.140 --> 00:11:06.240
and then they'll keep
evicting each other,

00:11:06.240 --> 00:11:09.150
even though there's a lot
of empty space in the cache.

00:11:09.150 --> 00:11:11.130
And this is known
as a conflict miss.

00:11:11.130 --> 00:11:15.000
And these can be very
bad for performance

00:11:15.000 --> 00:11:16.760
and very hard to debug.

00:11:16.760 --> 00:11:19.140
So that's one downside
of a direct map

00:11:19.140 --> 00:11:22.950
cache is that you can get these
conflict misses where you have

00:11:22.950 --> 00:11:25.050
to evict things from the
cache even though there's

00:11:25.050 --> 00:11:26.205
empty space in the cache.

00:11:29.720 --> 00:11:32.330
So as we said, finding
a block is very fast.

00:11:32.330 --> 00:11:35.810
Only a single location in
the cache has to be searched.

00:11:35.810 --> 00:11:38.390
But you might
suffer from conflict

00:11:38.390 --> 00:11:40.620
misses if you keep axing
things in the same set

00:11:40.620 --> 00:11:45.140
repeatedly without accessing
the things in the other sets.

00:11:45.140 --> 00:11:46.220
So any questions?

00:11:53.030 --> 00:11:53.530
OK.

00:11:53.530 --> 00:11:58.870
So these are sort of the two
extremes for cache design.

00:11:58.870 --> 00:12:01.060
There's actually
a hybrid solution

00:12:01.060 --> 00:12:03.872
called set associative cache.

00:12:03.872 --> 00:12:07.180
And in a set associative
cache, you still sets,

00:12:07.180 --> 00:12:11.200
but each of the sets contains
more than one line now.

00:12:11.200 --> 00:12:14.970
So all the red blocks
still map to the red set,

00:12:14.970 --> 00:12:16.990
but there's actually
two possible locations

00:12:16.990 --> 00:12:20.020
for the red blocks now.

00:12:20.020 --> 00:12:24.730
So in this case, this is known
as a two-way associate of cache

00:12:24.730 --> 00:12:28.870
since there are two possible
locations inside each set.

00:12:28.870 --> 00:12:33.670
And again, a cache block's set
determines k possible cache

00:12:33.670 --> 00:12:35.140
locations for that block.

00:12:35.140 --> 00:12:38.440
So within a set it's
fully associative,

00:12:38.440 --> 00:12:42.040
but each block can only
go in one of the sets.

00:12:44.590 --> 00:12:48.190
So let's look again
at how the bits are

00:12:48.190 --> 00:12:50.680
divided into in the address.

00:12:50.680 --> 00:12:53.770
So we still have the tag
set and offset fields.

00:12:53.770 --> 00:12:58.180
The offset field is
still a log base 2 of b.

00:12:58.180 --> 00:13:04.510
The set field is going to take
log base 2 of M over kB bits.

00:13:04.510 --> 00:13:07.320
So the number of sets
we have is M over kB.

00:13:07.320 --> 00:13:11.080
So we need log base
2 of that number

00:13:11.080 --> 00:13:14.230
to represent the set of a block.

00:13:14.230 --> 00:13:17.590
And then, finally, we use
the remaining bits as a tag,

00:13:17.590 --> 00:13:22.730
so it's going to be w minus
log base 2 of M over k.

00:13:22.730 --> 00:13:25.900
And now, to find a
block in the cache,

00:13:25.900 --> 00:13:30.400
only k locations of it's
set must be searched.

00:13:30.400 --> 00:13:33.970
So you basically find which
set the cache block maps too,

00:13:33.970 --> 00:13:36.130
and then you check
all k locations

00:13:36.130 --> 00:13:41.320
within that set to see if
that cached block is there.

00:13:41.320 --> 00:13:44.358
And whenever you
want to whenever

00:13:44.358 --> 00:13:46.900
you try to put something in the
cache because it's not there,

00:13:46.900 --> 00:13:48.067
you have to evict something.

00:13:48.067 --> 00:13:51.010
And you evict something from
the same set as the block

00:13:51.010 --> 00:13:53.780
that you're placing
into the cache.

00:13:53.780 --> 00:13:56.410
So for this example, I showed
a two-way associative cache.

00:13:56.410 --> 00:13:59.200
But in practice, the
associated is usually bigger

00:13:59.200 --> 00:14:04.090
say eight-way, 16-way,
or sometimes 20-way.

00:14:04.090 --> 00:14:09.490
And as you keep increasing
the associativity,

00:14:09.490 --> 00:14:13.130
it's going to look more and more
like a fully associative cache.

00:14:13.130 --> 00:14:15.460
And if you have a one
way associative cache,

00:14:15.460 --> 00:14:17.050
then there's just
a direct map cache.

00:14:17.050 --> 00:14:21.310
So this is a sort of a hybrid in
between-- a fully mapped cache

00:14:21.310 --> 00:14:24.325
and a fully associative
cache in a direct map cache.

00:14:27.620 --> 00:14:30.650
So any questions on set
associative caches ?

00:14:38.310 --> 00:14:38.810
OK.

00:14:38.810 --> 00:14:43.340
So let's go over a taxonomy
of different types of cache

00:14:43.340 --> 00:14:45.510
misses that you can incur.

00:14:45.510 --> 00:14:48.620
So the first type of cache
miss is called a cold miss.

00:14:48.620 --> 00:14:50.150
And this is the
cache miss that you

00:14:50.150 --> 00:14:53.705
have to incur the first time
you access a cache block.

00:14:53.705 --> 00:14:55.580
And if you need to access
this piece of data,

00:14:55.580 --> 00:14:58.220
there's no way to get around
getting a cold miss for this.

00:14:58.220 --> 00:15:01.225
Because your cache starts
out not having this block,

00:15:01.225 --> 00:15:02.600
and the first time
you access it,

00:15:02.600 --> 00:15:06.960
you have to bring it into cache.

00:15:06.960 --> 00:15:09.860
Then there are capacity misses.

00:15:09.860 --> 00:15:12.660
So capacity misses
are cache misses

00:15:12.660 --> 00:15:14.690
You get because
the cache is full

00:15:14.690 --> 00:15:16.590
and it can't fit all
of the cache blocks

00:15:16.590 --> 00:15:18.870
that you want to access.

00:15:18.870 --> 00:15:21.540
So you get a capacity miss
when the previous cache

00:15:21.540 --> 00:15:23.970
copy would have been
evicted even with a fully

00:15:23.970 --> 00:15:24.870
associative scheme.

00:15:24.870 --> 00:15:28.260
So even if all of the possible
locations in your cache

00:15:28.260 --> 00:15:31.230
could be used for a
particular cache line,

00:15:31.230 --> 00:15:33.750
that cache line still has to
be evicted because there's not

00:15:33.750 --> 00:15:34.420
enough space.

00:15:34.420 --> 00:15:37.530
So that's what's
called a capacity miss.

00:15:37.530 --> 00:15:41.010
And you can deal
with capacity misses

00:15:41.010 --> 00:15:44.610
by introducing more locality
into your code, both spatial

00:15:44.610 --> 00:15:46.440
and temporal locality.

00:15:46.440 --> 00:15:48.690
And we'll look at ways
to reduce the capacity

00:15:48.690 --> 00:15:51.420
misses of algorithms
later on in this lecture.

00:15:53.930 --> 00:15:55.830
Then there are conflict misses.

00:15:55.830 --> 00:16:00.000
And conflict misses happen
in set associate of caches

00:16:00.000 --> 00:16:06.420
when you have too many blocks
from the same set wanting

00:16:06.420 --> 00:16:08.640
to go into the cache.

00:16:08.640 --> 00:16:10.770
And some of these
have to be evicted,

00:16:10.770 --> 00:16:14.130
because the set can't
fit all of the blocks.

00:16:14.130 --> 00:16:15.720
And these blocks
wouldn't have been

00:16:15.720 --> 00:16:18.540
evicted if you had a fully
associative scheme, so these

00:16:18.540 --> 00:16:21.750
are what's called
conflict misses.

00:16:21.750 --> 00:16:25.800
For example, if you
have 16 things in a set

00:16:25.800 --> 00:16:29.820
and you keep accessing 17 things
that all belong in the set,

00:16:29.820 --> 00:16:32.310
something's going
to get kicked out

00:16:32.310 --> 00:16:35.340
every time you want
to access something.

00:16:35.340 --> 00:16:38.280
And these cache
evictions might not

00:16:38.280 --> 00:16:41.115
have happened if you had
a fully associative cache.

00:16:44.600 --> 00:16:46.460
And then, finally,
they're sharing misses.

00:16:46.460 --> 00:16:50.810
So sharing misses only
happened in a parallel context.

00:16:50.810 --> 00:16:52.940
And we talked a little
bit about true sharing

00:16:52.940 --> 00:16:56.300
a false sharing misses
in prior lectures.

00:16:56.300 --> 00:16:59.270
So let's just
review this briefly.

00:16:59.270 --> 00:17:03.860
So a sharing miss can happen
if multiple processors are

00:17:03.860 --> 00:17:06.619
accessing the same cache
line and at least one of them

00:17:06.619 --> 00:17:08.869
is writing to that cache line.

00:17:08.869 --> 00:17:10.460
If all of the
processors are just

00:17:10.460 --> 00:17:13.010
reading from the cache line,
then the cache [INAUDIBLE]

00:17:13.010 --> 00:17:16.250
protocol knows how to make
it work so that you don't get

00:17:16.250 --> 00:17:16.880
misses.

00:17:16.880 --> 00:17:19.670
They can all access the same
cache line at the same time

00:17:19.670 --> 00:17:22.099
if nobody's modifying it.

00:17:22.099 --> 00:17:24.290
But if at least one
processor is modifying it,

00:17:24.290 --> 00:17:26.359
you could get either
true sharing misses

00:17:26.359 --> 00:17:28.250
or false sharing misses.

00:17:28.250 --> 00:17:31.580
So a true sharing miss is
when two processors are

00:17:31.580 --> 00:17:36.590
accessing the same data
on the same cache line.

00:17:36.590 --> 00:17:38.750
And as you recall from
a previous lecture,

00:17:38.750 --> 00:17:41.150
if one of the two processors
is writing to this cache

00:17:41.150 --> 00:17:43.640
line, whenever it
does a write it

00:17:43.640 --> 00:17:46.370
needs to acquire the cache
line in exclusive mode

00:17:46.370 --> 00:17:51.710
and then invalidate that cache
line and all other caches.

00:17:51.710 --> 00:17:54.020
So then when one
another processor

00:17:54.020 --> 00:17:55.820
tries to access the
same memory location,

00:17:55.820 --> 00:17:58.130
it has to bring it back
into its own cache,

00:17:58.130 --> 00:18:02.260
and then you get a
cache miss there.

00:18:02.260 --> 00:18:04.430
A false sharing this
happens if two processes

00:18:04.430 --> 00:18:07.070
are accessing different data
that just happened to reside

00:18:07.070 --> 00:18:08.870
on the same cache line.

00:18:08.870 --> 00:18:10.670
Because the basic
unit of movement

00:18:10.670 --> 00:18:13.580
is a cache line in
the architecture.

00:18:13.580 --> 00:18:15.860
So even if you're
asking different things,

00:18:15.860 --> 00:18:17.480
if they are on the
same cache line,

00:18:17.480 --> 00:18:20.810
you're still going to
get a sharing miss.

00:18:20.810 --> 00:18:22.940
And false sharing is
pretty hard to deal with,

00:18:22.940 --> 00:18:26.030
because, in general,
you don't know what data

00:18:26.030 --> 00:18:28.282
gets placed on what cache line.

00:18:28.282 --> 00:18:29.990
There are certain
heuristics you can use.

00:18:29.990 --> 00:18:32.510
For example, if you're
mallocing a big memory region,

00:18:32.510 --> 00:18:35.430
you know that that memory
region is contiguous,

00:18:35.430 --> 00:18:37.670
so you can space your
access is far enough apart

00:18:37.670 --> 00:18:40.310
by different processors so
they don't touch the same cache

00:18:40.310 --> 00:18:41.110
line.

00:18:41.110 --> 00:18:43.910
But if you're just declaring
local variables on the stack,

00:18:43.910 --> 00:18:45.710
you don't know
where the compiler

00:18:45.710 --> 00:18:50.810
is going to decide to
place these variables

00:18:50.810 --> 00:18:54.480
in the virtual
memory address space.

00:18:54.480 --> 00:18:57.050
So these are four
different types of cache

00:18:57.050 --> 00:19:00.150
misses that you
should know about.

00:19:00.150 --> 00:19:02.690
And there's many
models out there

00:19:02.690 --> 00:19:05.840
for analyzing the cache
performance of algorithms.

00:19:05.840 --> 00:19:08.720
And some of the models ignore
some of these different types

00:19:08.720 --> 00:19:10.640
of cache misses.

00:19:10.640 --> 00:19:13.940
So just be aware of this when
you're looking at algorithm

00:19:13.940 --> 00:19:16.010
analysis, because
not all of the models

00:19:16.010 --> 00:19:18.120
will capture all of these
different types of cache

00:19:18.120 --> 00:19:18.620
misses.

00:19:22.830 --> 00:19:27.540
So let's look at a bad
case for conflict misses.

00:19:27.540 --> 00:19:33.270
So here I want to access a
submatrix within a larger

00:19:33.270 --> 00:19:34.440
matrix.

00:19:34.440 --> 00:19:39.540
And recall that matrices are
stored in row-major order.

00:19:39.540 --> 00:19:44.850
And let's say our matrix is
4,096 columns by 4,096 rows

00:19:44.850 --> 00:19:47.670
and it still stores doubles.

00:19:47.670 --> 00:19:50.190
So therefore, each
row here is going

00:19:50.190 --> 00:19:55.140
to contain 2 to the 15th
bytes, because 4,096

00:19:55.140 --> 00:19:58.800
is t2 to the 12th,
and we have doubles,

00:19:58.800 --> 00:20:00.110
which takes eight bytes.

00:20:00.110 --> 00:20:03.390
So 2 to the 12 times to the
3rd, which is 2 to the 15th.

00:20:06.750 --> 00:20:11.280
We're going to assume the word
width is 64, which is standard.

00:20:11.280 --> 00:20:15.060
We're going to assume that
we have a cache size of 32k.

00:20:15.060 --> 00:20:19.710
And the cache block size is
64, which, again, is standard.

00:20:19.710 --> 00:20:22.125
And let's say we have a
four-way associative cache.

00:20:26.520 --> 00:20:31.860
So let's look at how the
bits are divided into.

00:20:31.860 --> 00:20:36.270
So again we have
this offset, which

00:20:36.270 --> 00:20:38.867
takes log base 2 of B bits.

00:20:38.867 --> 00:20:41.325
So how many bits do we have
for the offset in this example?

00:20:48.300 --> 00:20:48.800
Right.

00:20:48.800 --> 00:20:50.030
So we have 6 bits.

00:20:50.030 --> 00:20:53.930
So it's just log base 2 of 64.

00:20:53.930 --> 00:20:56.180
What about for the set?

00:20:56.180 --> 00:20:59.030
How many bits do
we have for that?

00:20:59.030 --> 00:21:00.350
7.

00:21:00.350 --> 00:21:02.280
Who said 7?

00:21:02.280 --> 00:21:02.780
Yeah.

00:21:02.780 --> 00:21:04.220
So it is 7.

00:21:04.220 --> 00:21:10.130
So M is 32k, which
is 2 to the 15th.

00:21:10.130 --> 00:21:17.310
And then k is 2 to
the 2, b is 2 6.

00:21:17.310 --> 00:21:21.050
So it's 2 to the 15th divided by
2 the 8th, which is to the 7th.

00:21:21.050 --> 00:21:23.930
And log base 2 of that is 7.

00:21:23.930 --> 00:21:25.940
And finally, what
about the tag field?

00:21:29.660 --> 00:21:31.990
AUDIENCE: 51.

00:21:31.990 --> 00:21:33.100
JULIAN SHUN: 51.

00:21:33.100 --> 00:21:33.730
Why is that?

00:21:33.730 --> 00:21:36.330
AUDIENCE: 64 minus 13.

00:21:36.330 --> 00:21:37.080
JULIAN SHUN: Yeah.

00:21:37.080 --> 00:21:43.880
So it's just 64 minus
7 minus 6, which is 51.

00:21:43.880 --> 00:21:44.380
OK.

00:21:44.380 --> 00:21:47.890
So let's say that we want
to access a submatrix

00:21:47.890 --> 00:21:49.710
within this larger matrix.

00:21:49.710 --> 00:21:52.660
Let's say we want to acts
as a 32 by 32 submatrix.

00:21:52.660 --> 00:21:57.220
And THIS is pretty common
in matrix algorithms, where

00:21:57.220 --> 00:21:59.810
you want to access submatrices,
especially in divide

00:21:59.810 --> 00:22:01.591
and conquer algorithms.

00:22:04.240 --> 00:22:09.850
And let's say we want to access
a column of this submatrix A.

00:22:09.850 --> 00:22:13.180
So the addresses of the elements
that we're going to access

00:22:13.180 --> 00:22:14.050
are as follows--

00:22:14.050 --> 00:22:17.290
so let's say the first
element in the column

00:22:17.290 --> 00:22:19.600
is stored at address x.

00:22:19.600 --> 00:22:21.280
Then the second
element in the column

00:22:21.280 --> 00:22:24.640
is going to be stored at
address x plus 2 to the 15th,

00:22:24.640 --> 00:22:27.910
because each row has
2 to the 15th bytes,

00:22:27.910 --> 00:22:29.650
and we're skipping
over an entire row

00:22:29.650 --> 00:22:34.490
here to get to the element in
the next row of the sub matrix.

00:22:34.490 --> 00:22:36.460
So we're going to
add 2 to the 15th.

00:22:36.460 --> 00:22:38.020
And then to get
the third element,

00:22:38.020 --> 00:22:40.660
we're going to add 2
times 2 to the 15th.

00:22:40.660 --> 00:22:43.420
And so on, until we get
to the last element,

00:22:43.420 --> 00:22:48.490
which is x plus 31
times 2 to the 15th.

00:22:48.490 --> 00:22:50.350
So which fields
of the address are

00:22:50.350 --> 00:22:54.850
changing as we go through
one column of this submatrix?

00:23:05.586 --> 00:23:09.002
AUDIENCE: You're just adding
multiple [INAUDIBLE] tag

00:23:09.002 --> 00:23:10.000
the [INAUDIBLE].

00:23:10.000 --> 00:23:10.750
JULIAN SHUN: Yeah.

00:23:10.750 --> 00:23:13.490
So it's just going to be
the tag that's changing.

00:23:13.490 --> 00:23:17.360
The set and the offset are going
to stay the same, because we're

00:23:17.360 --> 00:23:22.190
just using the lower 13 bits
to store the set and a tag.

00:23:22.190 --> 00:23:24.890
And therefore, when we
increment by 2 to the 15th,

00:23:24.890 --> 00:23:28.920
we're not going to touch
the set and the offset.

00:23:28.920 --> 00:23:32.060
So all of these addresses
fall into the same set.

00:23:32.060 --> 00:23:35.640
And this is a problem,
because our cache

00:23:35.640 --> 00:23:37.160
is only four-way associative.

00:23:37.160 --> 00:23:42.860
So we can only fit four
cache lines in each set.

00:23:42.860 --> 00:23:45.860
And here, we're accessing
31 of these things.

00:23:45.860 --> 00:23:50.510
So by the time we get
to the next column of A,

00:23:50.510 --> 00:23:53.280
all the things that we access
in the current column of A

00:23:53.280 --> 00:23:56.360
are going to be evicted
from cache already.

00:23:56.360 --> 00:23:58.970
And this is known
as a conflict miss,

00:23:58.970 --> 00:24:01.850
because if you had a
fully associative cache

00:24:01.850 --> 00:24:04.730
this might not have happened,
because you could actually

00:24:04.730 --> 00:24:09.940
use any location in the cache
to store these cache blocks.

00:24:09.940 --> 00:24:13.720
So does anybody have
any questions on why

00:24:13.720 --> 00:24:15.060
we get conflict misses here?

00:24:22.860 --> 00:24:27.110
So anybody have any
ideas on how to fix this?

00:24:27.110 --> 00:24:29.300
So what can I do to
make it so that I'm not

00:24:29.300 --> 00:24:32.990
incrementing by exactly
2 to the 15th every time?

00:24:39.696 --> 00:24:40.654
Yeah.

00:24:40.654 --> 00:24:43.050
AUDIENCE: So pad the matrix?

00:24:43.050 --> 00:24:44.020
JULIAN SHUN: Yeah.

00:24:44.020 --> 00:24:46.270
So one solution is
to pad the matrix.

00:24:46.270 --> 00:24:49.060
You can add some
constant amount of space

00:24:49.060 --> 00:24:50.920
to the end of the matrix.

00:24:50.920 --> 00:24:53.320
So each row is going
to be longer than 2

00:24:53.320 --> 00:24:54.550
to the 15th bytes.

00:24:54.550 --> 00:24:57.400
So maybe you add some
small constant like 17.

00:24:57.400 --> 00:25:00.130
So add 17 bytes to
the end of each row.

00:25:00.130 --> 00:25:04.090
And now, when you access a
column of this submatrix,

00:25:04.090 --> 00:25:07.000
you're not just incrementing
by 2 to the 15th,

00:25:07.000 --> 00:25:10.570
you're also adding
some small integer.

00:25:10.570 --> 00:25:14.535
And that's going to cause
the set and the offset fields

00:25:14.535 --> 00:25:15.910
to change as well,
and you're not

00:25:15.910 --> 00:25:18.640
going to get as many
conflict misses.

00:25:18.640 --> 00:25:22.610
So that's one way to
solve the problem.

00:25:22.610 --> 00:25:25.570
It turns out that if you're
doing a matrix multiplication

00:25:25.570 --> 00:25:27.910
algorithm, that's a
cubic work algorithm,

00:25:27.910 --> 00:25:31.630
and you can basically
afford to copy the submatrix

00:25:31.630 --> 00:25:34.270
into a temporary
32 by 32 matrix,

00:25:34.270 --> 00:25:36.580
do all the operations
on the temporary matrix,

00:25:36.580 --> 00:25:39.760
and then copy it back out
to the original matrix.

00:25:39.760 --> 00:25:42.610
The copying only
takes quadratic work

00:25:42.610 --> 00:25:45.160
to do across the
whole algorithm.

00:25:45.160 --> 00:25:48.070
And since the whole
algorithm takes cubic work,

00:25:48.070 --> 00:25:50.620
the quadratic work is
a lower order term.

00:25:50.620 --> 00:25:54.790
So you can use temporary
space to make sure that you

00:25:54.790 --> 00:25:56.050
don't get conflict misses.

00:25:58.560 --> 00:25:59.490
Any questions?

00:26:06.030 --> 00:26:09.340
So this was conflict misses.

00:26:09.340 --> 00:26:10.900
So conflict misses
are important.

00:26:10.900 --> 00:26:13.180
But usually, we're going
to be first concerned

00:26:13.180 --> 00:26:15.820
about getting good spatial
and temporal locality,

00:26:15.820 --> 00:26:19.240
because those are
usually the higher order

00:26:19.240 --> 00:26:21.070
factors in the
performance of a program.

00:26:21.070 --> 00:26:24.250
And once we get good spatial
and temporal locality

00:26:24.250 --> 00:26:25.840
in our program,
we can then start

00:26:25.840 --> 00:26:28.720
worrying about conflict
misses, for example,

00:26:28.720 --> 00:26:32.860
by using temporary space
or padding our data

00:26:32.860 --> 00:26:35.650
by some small constants
so that we don't

00:26:35.650 --> 00:26:37.210
have as if any conflict misses.

00:26:41.120 --> 00:26:43.170
So now, I want to
talk about a model

00:26:43.170 --> 00:26:45.270
that we can use to
analyze the cache

00:26:45.270 --> 00:26:46.530
performance of algorithms.

00:26:46.530 --> 00:26:51.010
And this is called
the ideal-cache model.

00:26:51.010 --> 00:26:57.030
So in this model, we have a
two-level cache hierarchy.

00:26:57.030 --> 00:27:01.440
So we have the cache
and then main memory.

00:27:01.440 --> 00:27:05.205
The cache size is of size
M, and the cache line size

00:27:05.205 --> 00:27:06.750
is of B bytes.

00:27:06.750 --> 00:27:10.245
And therefore, we can fit M over
V cache lines inside our cache.

00:27:13.020 --> 00:27:15.930
This model assumes that the
cache is fully associative,

00:27:15.930 --> 00:27:18.920
so any cache block can
go anywhere in the cache.

00:27:18.920 --> 00:27:23.070
And it also assumes an optimal
omniscient replacement policy.

00:27:23.070 --> 00:27:25.140
So this means that where
we want to evict a cache

00:27:25.140 --> 00:27:26.600
block from the
cache, we're going

00:27:26.600 --> 00:27:28.410
to pick the thing to
evict that gives us

00:27:28.410 --> 00:27:30.060
the best performance overall.

00:27:30.060 --> 00:27:31.830
It gives us the
lowest number of cache

00:27:31.830 --> 00:27:34.210
misses throughout
our entire algorithm.

00:27:34.210 --> 00:27:36.960
So we're assuming that we know
the sequence of memory requests

00:27:36.960 --> 00:27:38.858
throughout the entire algorithm.

00:27:38.858 --> 00:27:41.400
And that's why it's called the
omniscient mission replacement

00:27:41.400 --> 00:27:41.900
policy.

00:27:45.370 --> 00:27:49.000
And if something is in cache,
you can operate on it for free.

00:27:49.000 --> 00:27:51.040
And if something
is in main memory,

00:27:51.040 --> 00:27:52.810
you have to bring it
into cache and then

00:27:52.810 --> 00:27:54.070
you incur a cache miss.

00:27:56.990 --> 00:27:59.880
So two performance measures
that we care about--

00:27:59.880 --> 00:28:01.890
first, we care about
the ordinary work,

00:28:01.890 --> 00:28:04.830
which is just the ordinary
running time of a program.

00:28:04.830 --> 00:28:07.740
So this is the
same as before when

00:28:07.740 --> 00:28:09.360
we were analyzing algorithms.

00:28:09.360 --> 00:28:11.160
It's just a total
number of operations

00:28:11.160 --> 00:28:13.690
that the program does.

00:28:13.690 --> 00:28:15.420
And the number of
cache misses is

00:28:15.420 --> 00:28:17.190
going to be the
number of lines we

00:28:17.190 --> 00:28:21.893
have to transfer between the
main memory and the cache.

00:28:21.893 --> 00:28:23.310
So the number of
cache misses just

00:28:23.310 --> 00:28:24.930
counts a number of
cache transfers,

00:28:24.930 --> 00:28:27.570
whereas as the work counts
all the operations that you

00:28:27.570 --> 00:28:29.227
have to do in the algorithm.

00:28:32.640 --> 00:28:35.490
So ideally, we would
like to come up

00:28:35.490 --> 00:28:38.970
with algorithms that have a
low number of cache misses

00:28:38.970 --> 00:28:42.540
without increasing the work
from the traditional standard

00:28:42.540 --> 00:28:44.550
algorithm.

00:28:44.550 --> 00:28:47.060
Sometimes we can do that,
sometimes we can't do that.

00:28:47.060 --> 00:28:49.470
And then there's a
trade-off between the work

00:28:49.470 --> 00:28:51.210
and the number of cache misses.

00:28:51.210 --> 00:28:53.850
And it's a trade-off
that you have

00:28:53.850 --> 00:28:56.910
to decide whether it's
worthwhile as a performance

00:28:56.910 --> 00:28:57.960
engineer.

00:28:57.960 --> 00:28:59.790
Today, we're going to
look at an algorithm

00:28:59.790 --> 00:29:01.915
where you can't actually
reduce the number of cache

00:29:01.915 --> 00:29:03.780
misses without
increasing the work.

00:29:03.780 --> 00:29:06.090
So you basically get
the best of both worlds.

00:29:08.880 --> 00:29:11.430
So any questions on
this ideal cache model?

00:29:19.430 --> 00:29:23.810
So this model is just used
for analyzing algorithms.

00:29:23.810 --> 00:29:27.530
You can't actually buy one
of these caches at the store.

00:29:27.530 --> 00:29:31.760
So this is a very ideal
cache, and they don't exist.

00:29:31.760 --> 00:29:35.000
But it turns out that this
optimal omniscient replacement

00:29:35.000 --> 00:29:38.580
policy has nice
theoretical properties.

00:29:38.580 --> 00:29:43.970
And this is a very important
lemma that was proved in 1985.

00:29:43.970 --> 00:29:46.720
It's called the LRU lemma.

00:29:46.720 --> 00:29:48.770
It was proved by
Slater and Tarjan.

00:29:48.770 --> 00:29:51.950
And the lemma says, suppose
that an algorithm incurs

00:29:51.950 --> 00:29:56.540
Q cache misses on an ideal
cache of size M. Then,

00:29:56.540 --> 00:30:01.280
on a fully associative cache
of size 2M, that uses the LRU,

00:30:01.280 --> 00:30:04.760
or Least Recently Used
replacement policy,

00:30:04.760 --> 00:30:08.900
it incurs at most
2Q cache misses.

00:30:08.900 --> 00:30:12.980
So what this says is if I
can show the number of cache

00:30:12.980 --> 00:30:16.700
misses for an algorithm
on the ideal cache,

00:30:16.700 --> 00:30:19.820
then if I take a fully
associative cache that's twice

00:30:19.820 --> 00:30:23.220
the size and use the
LRU replacement policy,

00:30:23.220 --> 00:30:25.280
which is a pretty
practical policy,

00:30:25.280 --> 00:30:26.900
then the algorithm
is going to incur,

00:30:26.900 --> 00:30:31.160
at most, twice the
number of cache misses.

00:30:31.160 --> 00:30:33.890
And the implication
of this lemma

00:30:33.890 --> 00:30:36.590
is that for asymptotic
analyses, you

00:30:36.590 --> 00:30:40.040
can assume either the optimal
replacement policy or the LRU

00:30:40.040 --> 00:30:41.930
replacement policy
as convenient.

00:30:41.930 --> 00:30:46.010
Because the number
of cache misses

00:30:46.010 --> 00:30:50.270
is just going to be within a
constant factor of each other.

00:30:50.270 --> 00:30:52.610
So this is a very
important lemma.

00:30:52.610 --> 00:30:54.650
It says that this
basically makes

00:30:54.650 --> 00:31:00.306
it much easier for us to analyze
our cache misses in algorithms.

00:31:03.780 --> 00:31:06.240
And here's a software
engineering principle

00:31:06.240 --> 00:31:08.770
that I want to point out.

00:31:08.770 --> 00:31:13.480
So first, when you're trying
to get good performance,

00:31:13.480 --> 00:31:16.540
you should come up with a
theoretically good algorithm

00:31:16.540 --> 00:31:20.670
that has good balance on the
work and the cache complexity.

00:31:20.670 --> 00:31:23.130
And then after you come up
with an algorithm that's

00:31:23.130 --> 00:31:26.040
theoretically good, then
you start engineering

00:31:26.040 --> 00:31:27.150
for detailed performance.

00:31:27.150 --> 00:31:30.630
You start worrying about the
details such as real world

00:31:30.630 --> 00:31:34.770
caches not being fully
associative, and, for example,

00:31:34.770 --> 00:31:37.080
loads and stores having
different costs with respect

00:31:37.080 --> 00:31:39.090
to bandwidth and latency.

00:31:39.090 --> 00:31:41.340
But coming up with a
theoretically good algorithm

00:31:41.340 --> 00:31:43.980
is the first order bit to
getting good performance.

00:31:48.840 --> 00:31:49.812
Questions?

00:31:58.090 --> 00:32:00.550
So let's start analyzing
the number of cache

00:32:00.550 --> 00:32:02.320
misses in a program.

00:32:02.320 --> 00:32:04.090
So here's a lemma.

00:32:04.090 --> 00:32:07.990
So the lemma says, suppose that
a program reads a set of r data

00:32:07.990 --> 00:32:13.480
segments, where the i-th segment
consists of s sub i bytes.

00:32:13.480 --> 00:32:17.110
And suppose that the sum of
the sizes of all the segments

00:32:17.110 --> 00:32:22.360
is equal to N. And we're going
to assume that N is less than M

00:32:22.360 --> 00:32:23.120
over 3.

00:32:23.120 --> 00:32:26.260
So the sum of the
size of the segments

00:32:26.260 --> 00:32:30.100
is less than the cache
size divided by 3.

00:32:30.100 --> 00:32:32.320
We're also going to
assume that N over r

00:32:32.320 --> 00:32:34.870
is greater than or
equal to B. So recall

00:32:34.870 --> 00:32:38.650
that r is the number of
data segments we have,

00:32:38.650 --> 00:32:41.090
and N is the total
size of the segment.

00:32:41.090 --> 00:32:46.080
So what does N over
r mean, semantically?

00:32:46.080 --> 00:32:46.580
Yes.

00:32:46.580 --> 00:32:47.950
AUDIENCE: Average [INAUDIBLE].

00:32:47.950 --> 00:32:48.700
JULIAN SHUN: Yeah.

00:32:48.700 --> 00:32:53.390
So N over r is the just the
average size of a segment.

00:32:53.390 --> 00:32:56.390
And here we're saying that
the average size of a segment

00:32:56.390 --> 00:33:01.790
is at least B-- so at least
the size of a cache line.

00:33:01.790 --> 00:33:04.830
So if these two assumptions
hold, then all of the segments

00:33:04.830 --> 00:33:07.590
are going to fit into cache,
and the number of cache

00:33:07.590 --> 00:33:13.590
misses to read them all is,
at most, 3 times N over B.

00:33:13.590 --> 00:33:20.490
So if you had just a
single array of size N,

00:33:20.490 --> 00:33:21.990
then the number of
cache misses you

00:33:21.990 --> 00:33:24.180
would need to read
that array into cache

00:33:24.180 --> 00:33:25.920
is going to be N
over B. And this

00:33:25.920 --> 00:33:29.280
is saying that, even
if our data is divided

00:33:29.280 --> 00:33:32.040
into a bunch of segments, as
long as the average length

00:33:32.040 --> 00:33:35.580
of the segments is large enough,
then the number of cache misses

00:33:35.580 --> 00:33:41.550
is just a constant factor worse
than reading a single array.

00:33:41.550 --> 00:33:44.160
So let's try to prove
this cache miss lemma.

00:33:48.000 --> 00:33:50.220
So here's a proof so.

00:33:50.220 --> 00:33:52.290
A single segment,
s sub i is going

00:33:52.290 --> 00:33:58.350
to incur at most s sub i
over B plus 2 cache misses.

00:33:58.350 --> 00:34:01.800
So does anyone want to tell me
where the s sub i over B plus 2

00:34:01.800 --> 00:34:02.370
comes from?

00:34:09.540 --> 00:34:13.170
So let's say this is a
segment that we're analyzing,

00:34:13.170 --> 00:34:16.320
and this is how it's
aligned in virtual memory.

00:34:21.900 --> 00:34:22.400
Yes?

00:34:22.400 --> 00:34:25.310
AUDIENCE: How many blocks
it could overlap worst case.

00:34:25.310 --> 00:34:26.060
JULIAN SHUN: Yeah.

00:34:26.060 --> 00:34:29.870
So s sub i over B plus 2 is
the number of blocks that could

00:34:29.870 --> 00:34:32.610
overlap within the worst case.

00:34:32.610 --> 00:34:36.949
So you need s sub i
over B cache misses just

00:34:36.949 --> 00:34:39.949
to load those s sub i bytes.

00:34:39.949 --> 00:34:43.400
But then the beginning and
the end of that segment

00:34:43.400 --> 00:34:47.360
might not be perfectly aligned
with a cache line boundary.

00:34:47.360 --> 00:34:49.670
And therefore, you could
waste, at most, one block

00:34:49.670 --> 00:34:51.320
on each side of the segment.

00:34:51.320 --> 00:34:55.310
So that's where the
plus 2 comes from.

00:34:55.310 --> 00:34:57.560
So to get the total
number of cache

00:34:57.560 --> 00:35:03.170
misses, we just have to sum this
quantity from i equals 1 to r.

00:35:03.170 --> 00:35:06.620
So if I sum s sub i over
B from i equals 1 to r,

00:35:06.620 --> 00:35:08.810
I just get N over
B, by definition.

00:35:08.810 --> 00:35:12.640
And then I sum 2
from i equals 1 to r.

00:35:12.640 --> 00:35:14.840
So that just gives me 2r.

00:35:14.840 --> 00:35:17.180
Now, I'm going to multiply
the top and the bottom

00:35:17.180 --> 00:35:21.080
with the second term by
B. So 2r B over B now.

00:35:21.080 --> 00:35:24.200
And then that's less
than or equal to N over B

00:35:24.200 --> 00:35:29.730
plus 2N over B. So where did
I get this inequality here?

00:35:29.730 --> 00:35:32.420
Why do I know that 2r B is
less than or equal to 2N?

00:35:35.500 --> 00:35:36.000
Yes?

00:35:36.000 --> 00:35:38.760
AUDIENCE: You know that the N
is greater than or equal to B r.

00:35:38.760 --> 00:35:38.940
JULIAN SHUN: Yeah.

00:35:38.940 --> 00:35:41.250
So you know that N is
greater than or equal to B

00:35:41.250 --> 00:35:43.380
r by this assumption up here.

00:35:43.380 --> 00:35:46.830
So therefore, r B is
less than or equal to N.

00:35:46.830 --> 00:35:51.450
And then, N B plus 2 N
B just sums up to 3 N B.

00:35:51.450 --> 00:35:55.335
So in the worst case, we're
going to incur 3N over B cache

00:35:55.335 --> 00:35:55.835
misses.

00:36:00.800 --> 00:36:03.340
So any questions on
this cache miss lemma?

00:36:07.620 --> 00:36:11.520
So the Important thing to
remember here is that if you

00:36:11.520 --> 00:36:14.070
have a whole bunch of data
segments and the average length

00:36:14.070 --> 00:36:15.780
of your segments
is large enough--

00:36:15.780 --> 00:36:18.540
bigger than a cache block size--

00:36:18.540 --> 00:36:21.690
then you can access all
of these segments just

00:36:21.690 --> 00:36:24.360
like a single array.

00:36:24.360 --> 00:36:25.980
It only increases
the number of cache

00:36:25.980 --> 00:36:27.810
misses by a constant factor.

00:36:27.810 --> 00:36:29.892
And if you're doing an
asymptotic analysis,

00:36:29.892 --> 00:36:30.850
then it doesn't matter.

00:36:30.850 --> 00:36:33.360
So we're going to be using
this cache miss lemma later

00:36:33.360 --> 00:36:35.160
on when we analyze algorithms.

00:36:40.720 --> 00:36:44.200
So another assumption
that we're going to need

00:36:44.200 --> 00:36:46.840
is called the tall
cache assumption.

00:36:46.840 --> 00:36:49.450
And the tall cache
assumption basically

00:36:49.450 --> 00:36:52.390
says that the cache is
taller than it is wide.

00:36:52.390 --> 00:36:55.750
So it says that B
squared is less than c M

00:36:55.750 --> 00:36:58.750
for some sufficiently
small constant c less than

00:36:58.750 --> 00:37:02.050
or equal to 1.

00:37:02.050 --> 00:37:05.830
So in other words, it says
that the number of cache lines

00:37:05.830 --> 00:37:13.660
M over B you have is
going to be bigger than B.

00:37:13.660 --> 00:37:16.330
And this tall cache
assumption is usually

00:37:16.330 --> 00:37:17.650
satisfied in practice.

00:37:17.650 --> 00:37:22.090
So here are the cache
line sizes and the cache

00:37:22.090 --> 00:37:24.460
sizes on the machines
that we're using.

00:37:24.460 --> 00:37:28.990
So cache line size is 64
bytes, and the L1 cache size

00:37:28.990 --> 00:37:31.390
is 32 kilobytes.

00:37:31.390 --> 00:37:36.400
So 64 bytes squared,
that's 2 to the 12th.

00:37:36.400 --> 00:37:39.420
And 32 kilobytes is
2 to the 15th bytes.

00:37:39.420 --> 00:37:41.510
So 2 to the 12th is
less than 2 to the 15th,

00:37:41.510 --> 00:37:44.530
so it satisfies the
tall cache assumption.

00:37:44.530 --> 00:37:46.540
And as we go up the
memory hierarchy,

00:37:46.540 --> 00:37:49.990
the cache size increases,
but the cache line length

00:37:49.990 --> 00:37:51.080
stays the same.

00:37:51.080 --> 00:37:53.230
So the cache has
become even taller

00:37:53.230 --> 00:37:57.160
as we move up the
memory hierarchy.

00:37:57.160 --> 00:38:00.468
So let's see why this
tall cache assumption is

00:38:00.468 --> 00:38:01.260
going to be useful.

00:38:04.550 --> 00:38:06.300
To see that, we're
going to look at what's

00:38:06.300 --> 00:38:07.770
wrong with a short cache.

00:38:07.770 --> 00:38:11.580
So in a short cache, our lines
are going to be very wide,

00:38:11.580 --> 00:38:14.190
and they're wider than
the number of lines

00:38:14.190 --> 00:38:18.200
that we can have in our cache.

00:38:18.200 --> 00:38:19.950
And let's say we're
working with an m

00:38:19.950 --> 00:38:24.120
by n submatrix sorted
in row-major order.

00:38:24.120 --> 00:38:27.810
If you have a short cache,
then even if n squared

00:38:27.810 --> 00:38:29.700
is less than c M,
meaning that you

00:38:29.700 --> 00:38:33.540
can fit all the bytes of
the submatrix in cache,

00:38:33.540 --> 00:38:37.620
you might still not be able
to fit it into a short cache.

00:38:37.620 --> 00:38:40.650
And this picture sort
of illustrates this.

00:38:40.650 --> 00:38:43.050
So we have m rows here.

00:38:43.050 --> 00:38:46.290
But we can only fit M over
B of the rows in the cache,

00:38:46.290 --> 00:38:48.960
because the cache
lines are so long,

00:38:48.960 --> 00:38:51.045
and we're actually
wasting a lot of space

00:38:51.045 --> 00:38:52.170
on each of the cache lines.

00:38:52.170 --> 00:38:54.570
We're only using a very small
fraction of each cache line

00:38:54.570 --> 00:38:58.690
to store the row
of this submatrix.

00:38:58.690 --> 00:39:00.960
If this were the
entire matrix, then

00:39:00.960 --> 00:39:05.250
it would actually be OK,
because consecutive rows

00:39:05.250 --> 00:39:08.850
are going to be placed together
consecutively in memory.

00:39:08.850 --> 00:39:10.740
But if this is a
submatrix, then we

00:39:10.740 --> 00:39:14.070
can't be guaranteed that the
next row is going to be placed

00:39:14.070 --> 00:39:17.220
right after the current row.

00:39:17.220 --> 00:39:19.290
And oftentimes, we have
to deal with submatrices

00:39:19.290 --> 00:39:22.110
when we're doing recursive
matrix algorithms.

00:39:25.330 --> 00:39:27.760
So this is what's wrong
with short caches.

00:39:27.760 --> 00:39:32.340
And that's why we want us assume
the tall cache assumption.

00:39:32.340 --> 00:39:34.210
And we can assume that,
because it's usually

00:39:34.210 --> 00:39:35.185
satisfied in practice.

00:39:37.945 --> 00:39:40.080
The TLB be actually
tends to be short.

00:39:40.080 --> 00:39:42.550
It only has a couple of
entries, so it might not satisfy

00:39:42.550 --> 00:39:44.020
the tall cache assumption.

00:39:44.020 --> 00:39:50.060
But all of the other caches
will satisfy this assumption.

00:39:50.060 --> 00:39:51.100
Any questions?

00:39:54.630 --> 00:39:56.797
OK.

00:39:56.797 --> 00:39:58.880
So here's another lemma
that's going to be useful.

00:39:58.880 --> 00:40:03.220
This is called the
submatrix caching llama.

00:40:03.220 --> 00:40:06.310
So suppose that we
have an n by m matrix,

00:40:06.310 --> 00:40:08.650
and it's read into
a tall cache that

00:40:08.650 --> 00:40:13.190
satisfies B squared less than c
M for some constant c less than

00:40:13.190 --> 00:40:15.580
or equal to 1.

00:40:15.580 --> 00:40:19.840
And suppose that n squared
is less than M over 3,

00:40:19.840 --> 00:40:24.280
but it's greater than
or equal to c M. Then

00:40:24.280 --> 00:40:27.580
A is going to fit into cache,
and the number of cache

00:40:27.580 --> 00:40:31.600
misses required to read all
of A's elements into cache is,

00:40:31.600 --> 00:40:38.470
at most, 3n squared over B.

00:40:38.470 --> 00:40:42.900
So let's see why this is true.

00:40:42.900 --> 00:40:45.120
So we're going to
let big N denote

00:40:45.120 --> 00:40:48.930
the total number of bytes
that we need to access.

00:40:48.930 --> 00:40:50.940
So big N is going to
be equal to n squared.

00:40:53.800 --> 00:40:56.550
And we're going to use the
cache miss lemma, which

00:40:56.550 --> 00:40:59.160
says that if the average
length of our segments

00:40:59.160 --> 00:41:02.310
is large enough, then we
can read all of the segments

00:41:02.310 --> 00:41:05.770
in just like it were a
single contiguous array.

00:41:05.770 --> 00:41:09.930
So the lengths of our segments
here are going to be little n.

00:41:09.930 --> 00:41:13.230
So r is going to be a little n.

00:41:13.230 --> 00:41:16.470
And also, the number of segments
is going to be little n.

00:41:16.470 --> 00:41:18.040
And the segment
length is also going

00:41:18.040 --> 00:41:21.660
to be little n, since we're
working with a square submatrix

00:41:21.660 --> 00:41:24.090
here.

00:41:24.090 --> 00:41:30.120
And then we also have the
cache block size B is less than

00:41:30.120 --> 00:41:32.310
or equal to n.

00:41:32.310 --> 00:41:36.090
And that's equal
to big N over r.

00:41:36.090 --> 00:41:39.750
And where do we get this
property that B is less than

00:41:39.750 --> 00:41:42.600
or equal to n?

00:41:42.600 --> 00:41:46.110
So I made some
assumptions up here,

00:41:46.110 --> 00:41:50.070
where I can use to infer that
B is less than or equal to n.

00:41:50.070 --> 00:41:53.150
Does anybody see where?

00:41:53.150 --> 00:41:53.922
Yeah.

00:41:53.922 --> 00:41:55.850
AUDIENCE: So B squared
is less than c M,

00:41:55.850 --> 00:41:57.300
and c M is [INAUDIBLE]

00:41:57.300 --> 00:41:58.050
JULIAN SHUN: Yeah.

00:41:58.050 --> 00:42:00.435
So I know that B
squared is less than c

00:42:00.435 --> 00:42:02.820
M. C M is less than
or equal to n squared.

00:42:02.820 --> 00:42:05.250
So therefore, B squared
is less than n squared,

00:42:05.250 --> 00:42:09.360
and B is less than n.

00:42:09.360 --> 00:42:15.060
So now, I also have
that N is less than M

00:42:15.060 --> 00:42:18.450
over 3, just by assumption.

00:42:18.450 --> 00:42:20.810
And therefore, I can use
the cache miss lemma.

00:42:20.810 --> 00:42:23.700
So the cache miss lemma
tells me that I only

00:42:23.700 --> 00:42:26.610
need a total of 3n
squared over B cache

00:42:26.610 --> 00:42:28.120
misses to read this
whole thing in.

00:42:32.780 --> 00:42:35.150
Any questions on the
submatrix caching lemma?

00:42:48.980 --> 00:42:53.198
So now, let's analyze
matrix multiplication.

00:42:53.198 --> 00:42:55.490
How many of you have seen
matrix multiplication before?

00:42:59.250 --> 00:43:00.130
So a couple of you.

00:43:03.340 --> 00:43:07.150
So here's what the
code looks like

00:43:07.150 --> 00:43:11.260
for the standard cubic
work matrix multiplication

00:43:11.260 --> 00:43:12.980
algorithm.

00:43:12.980 --> 00:43:15.430
So we have two input
matrices, A and B,

00:43:15.430 --> 00:43:18.610
And we're going to
store the result in C.

00:43:18.610 --> 00:43:22.930
And the height and the
width of our matrix is n.

00:43:22.930 --> 00:43:25.798
We're just going to deal
with square matrices here,

00:43:25.798 --> 00:43:27.340
but what I'm going
to talk about also

00:43:27.340 --> 00:43:30.770
extends to non-square matrices.

00:43:30.770 --> 00:43:33.450
And then we just have
three loops here.

00:43:33.450 --> 00:43:37.600
We're going to loop through i
from 0 to n minus 1, j from 0

00:43:37.600 --> 00:43:40.540
to n minus 1, and k
from 0 to n minus 1.

00:43:40.540 --> 00:43:43.225
And then we're going to
let's C of i n plus j

00:43:43.225 --> 00:43:48.280
be incremented by a of i n
plus k times b of k n plus j.

00:43:48.280 --> 00:43:53.200
So that's just the standard
code for matrix multiply.

00:43:53.200 --> 00:43:57.105
So what's the work
of this algorithm?

00:43:57.105 --> 00:44:02.140
It should be review
for all of you.

00:44:02.140 --> 00:44:02.740
n cubed.

00:44:05.790 --> 00:44:08.850
So now, let's analyze
the number of cache

00:44:08.850 --> 00:44:11.400
misses this algorithm
is going to incur.

00:44:11.400 --> 00:44:13.680
And again, we're going to
assume that the matrix is

00:44:13.680 --> 00:44:16.770
in row-major order, and
we satisfy the tall cache

00:44:16.770 --> 00:44:17.760
assumption.

00:44:20.640 --> 00:44:23.100
We're also going to
analyze the number of cache

00:44:23.100 --> 00:44:25.723
misses in matrix B,
because it turns out

00:44:25.723 --> 00:44:27.390
that the number of
cache misses incurred

00:44:27.390 --> 00:44:29.850
by matrix B is going to
dominate the number of cache

00:44:29.850 --> 00:44:31.470
misses overall.

00:44:31.470 --> 00:44:33.720
And there are three cases
we need to consider.

00:44:33.720 --> 00:44:37.110
The first case is when
n is greater than c M

00:44:37.110 --> 00:44:39.570
over B for some constant c.

00:44:42.890 --> 00:44:44.900
And we're going to analyze
matrix B, as I said.

00:44:44.900 --> 00:44:48.650
And we're also going to
assume LRU, because we can.

00:44:48.650 --> 00:44:50.300
If you recall,
the LRU lemma says

00:44:50.300 --> 00:44:52.390
that whatever we
analyze using the LRU

00:44:52.390 --> 00:44:55.160
is just going to be a constant
factor within what we analyze

00:44:55.160 --> 00:44:56.420
using the ideal cache.

00:45:01.220 --> 00:45:07.460
So to do this matrix
multiplication,

00:45:07.460 --> 00:45:10.940
I'm going to go through one
row of A and one column of B

00:45:10.940 --> 00:45:12.740
and do the dot product there.

00:45:12.740 --> 00:45:17.460
This is what happens
in the innermost loop.

00:45:17.460 --> 00:45:19.010
And how many cache
misses am I going

00:45:19.010 --> 00:45:24.110
to incur when I go down
one column of B here?

00:45:24.110 --> 00:45:29.120
So here, I have the case where
n is greater than M over B.

00:45:29.120 --> 00:45:38.430
So I can't fit one block
from each row into the cache.

00:45:38.430 --> 00:45:40.490
So how many cache misses
do I have the first time

00:45:40.490 --> 00:45:41.840
I go down a column of B?

00:45:44.440 --> 00:45:45.990
So how many rows of B do I have?

00:45:48.820 --> 00:45:49.700
n.

00:45:49.700 --> 00:45:54.850
Yeah, and how many cache
misses do I need for each row?

00:45:54.850 --> 00:45:55.350
One.

00:45:55.350 --> 00:45:58.590
So in total, I'm going
to need n cache misses

00:45:58.590 --> 00:46:02.280
for the first column of B.

00:46:02.280 --> 00:46:04.020
What about the
second column of B?

00:46:08.980 --> 00:46:12.090
So recall that I'm assuming the
LRU replacement policy here.

00:46:12.090 --> 00:46:13.590
So when the cache
is full, I'm going

00:46:13.590 --> 00:46:17.030
to evict the thing that
was least recently used--

00:46:17.030 --> 00:46:18.610
used the furthest in the past.

00:46:26.932 --> 00:46:28.140
Sorry, could you repeat that?

00:46:28.140 --> 00:46:29.080
AUDIENCE: [INAUDIBLE].

00:46:29.080 --> 00:46:29.830
JULIAN SHUN: Yeah.

00:46:29.830 --> 00:46:30.997
So it's still going to be n.

00:46:30.997 --> 00:46:33.462
Why is that?

00:46:33.462 --> 00:46:38.350
AUDIENCE: Because there
are [INAUDIBLE] integer.

00:46:38.350 --> 00:46:39.822
JULIAN SHUN: Yeah.

00:46:39.822 --> 00:46:41.280
It's still going
to be n, because I

00:46:41.280 --> 00:46:45.030
can't fit one cache block
from each row into my cache.

00:46:45.030 --> 00:46:48.630
And by the time I get back
to the top of my matrix B,

00:46:48.630 --> 00:46:52.130
the top block has already
been evicted from the cache,

00:46:52.130 --> 00:46:53.410
and I have to load it back in.

00:46:53.410 --> 00:46:56.070
And this is the same for every
other block that I access.

00:46:56.070 --> 00:46:58.680
So I'm, again, going
to need n cache misses

00:46:58.680 --> 00:47:01.200
for the second
column of B. And this

00:47:01.200 --> 00:47:05.400
is going to be the same
for all the columns of B.

00:47:05.400 --> 00:47:09.790
And then I have to do this
again for the second row of A.

00:47:09.790 --> 00:47:13.120
So in total, I'm going
to need theta of n

00:47:13.120 --> 00:47:15.730
cubed number of cache misses.

00:47:15.730 --> 00:47:21.710
And this is one cache miss
per entry that I access in B.

00:47:21.710 --> 00:47:25.420
And this is not very good,
because the total work was also

00:47:25.420 --> 00:47:26.270
theta of n cubed.

00:47:26.270 --> 00:47:29.170
So I'm not gaining anything
from having any locality

00:47:29.170 --> 00:47:32.900
in this algorithm here.

00:47:32.900 --> 00:47:36.440
So any questions
on this analysis?

00:47:36.440 --> 00:47:39.410
So this just case 1.

00:47:39.410 --> 00:47:41.580
Let's look at case 2.

00:47:41.580 --> 00:47:46.130
So in this case, n is
less than c M over B.

00:47:46.130 --> 00:47:50.270
So I can fit one block from
each row of B into cache.

00:47:50.270 --> 00:47:55.370
And then n is also greater than
another constant, c prime time

00:47:55.370 --> 00:48:00.080
square root of M, so I can't
fit the whole matrix into cache.

00:48:00.080 --> 00:48:02.600
And again, let's analyze
the number of cache

00:48:02.600 --> 00:48:07.432
misses incurred by
accessing B, assuming LRU.

00:48:07.432 --> 00:48:08.890
So how many cache
misses am I going

00:48:08.890 --> 00:48:12.882
to incur for the
first column of B?

00:48:12.882 --> 00:48:13.382
AUDIENCE: n.

00:48:13.382 --> 00:48:14.007
JULIAN SHUN: n.

00:48:14.007 --> 00:48:15.530
So that's the same as before.

00:48:15.530 --> 00:48:18.470
What about the
second column of B?

00:48:18.470 --> 00:48:24.260
So by the time I get to the
beginning of the matrix here,

00:48:24.260 --> 00:48:26.690
is the top block
going to be in cache?

00:48:29.940 --> 00:48:33.330
So who thinks the block is
still going to be in cache when

00:48:33.330 --> 00:48:35.410
I get back to the beginning?

00:48:35.410 --> 00:48:35.910
Yeah.

00:48:35.910 --> 00:48:37.320
So a couple of people.

00:48:37.320 --> 00:48:39.000
Who think it going
to be out of cache?

00:48:42.550 --> 00:48:46.660
So it turns out it is going
to be in cache, because I

00:48:46.660 --> 00:48:50.710
can fit one block for every
row of B into my cache

00:48:50.710 --> 00:48:53.980
since I have n less
than c M over B.

00:48:53.980 --> 00:48:58.668
So therefore, when I get to the
beginning of the second column,

00:48:58.668 --> 00:49:01.210
that block is still going to be
in cache, because I loaded it

00:49:01.210 --> 00:49:03.050
in when I was accessing
the first column.

00:49:03.050 --> 00:49:04.800
So I'm not going to
incur any cache misses

00:49:04.800 --> 00:49:07.450
for the second column.

00:49:07.450 --> 00:49:14.230
And, in general, if I can fit
B columns or some constant

00:49:14.230 --> 00:49:19.540
times B columns
into cache, then I

00:49:19.540 --> 00:49:23.830
can reduce the number of cache
misses I have by a factor of B.

00:49:23.830 --> 00:49:26.365
So I only need to incur a
cache miss the first time I

00:49:26.365 --> 00:49:29.190
access of block and not for
all the subsequent accesses.

00:49:33.250 --> 00:49:37.740
And the same is true
for the second row of A.

00:49:37.740 --> 00:49:40.500
And since I have
m rows of A, I'm

00:49:40.500 --> 00:49:44.850
going to have n times theta of
n squared over B cache misses.

00:49:44.850 --> 00:49:46.530
For each row of A,
I'm going to incur

00:49:46.530 --> 00:49:49.260
n squared over B cache misses.

00:49:49.260 --> 00:49:52.750
So the overall number of cache
misses is n cubed over B.

00:49:52.750 --> 00:49:55.110
And this is because
inside matrix B

00:49:55.110 --> 00:49:56.850
I can exploit spatial locality.

00:49:56.850 --> 00:50:00.000
Once I load in a block, I
can reuse it the next time

00:50:00.000 --> 00:50:02.280
I traverse down a
column that's nearby.

00:50:06.780 --> 00:50:08.400
Any questions on this analysis?

00:50:16.640 --> 00:50:18.530
So let's look at the third case.

00:50:18.530 --> 00:50:23.120
And here, n is less than c
prime times square root of M.

00:50:23.120 --> 00:50:27.810
So this means that the entire
matrix fits into cache.

00:50:27.810 --> 00:50:30.350
So let's analyze the number
of cache misses for matrix B

00:50:30.350 --> 00:50:32.150
again, assuming LRU.

00:50:32.150 --> 00:50:34.100
So how many cache
misses do I have now?

00:50:36.950 --> 00:50:39.300
So let's count the
total number of cache

00:50:39.300 --> 00:50:50.750
misses I have for every time
I go through a row of A. Yes.

00:50:50.750 --> 00:50:53.540
AUDIENCE: Is it just n
for the first column?

00:50:56.030 --> 00:50:56.780
JULIAN SHUN: Yeah.

00:50:56.780 --> 00:51:00.110
So for the first column,
it's going to be n.

00:51:00.110 --> 00:51:04.000
What about the second column?

00:51:04.000 --> 00:51:05.950
AUDIENCE: [INAUDIBLE]
the second [INAUDIBLE]..

00:51:05.950 --> 00:51:07.420
JULIAN SHUN: Right.

00:51:07.420 --> 00:51:11.042
So basically, for
the first row of A,

00:51:11.042 --> 00:51:13.000
the analysis is going to
be the same as before.

00:51:13.000 --> 00:51:16.870
I need n squared over B cache
misses to load the cache in.

00:51:16.870 --> 00:51:18.750
What about the second row of A?

00:51:18.750 --> 00:51:20.500
How many cache misses
am I going to incur?

00:51:27.262 --> 00:51:30.230
AUDIENCE: [INAUDIBLE].

00:51:30.230 --> 00:51:30.980
JULIAN SHUN: Yeah.

00:51:30.980 --> 00:51:32.420
So for the second
row of A, I'm not

00:51:32.420 --> 00:51:33.770
going to incur any cache misses.

00:51:33.770 --> 00:51:36.173
Because once I
load B into cache,

00:51:36.173 --> 00:51:37.340
it's going to stay in cache.

00:51:37.340 --> 00:51:39.470
Because the entire
matrix can fit in cache,

00:51:39.470 --> 00:51:44.870
I assume that n is less than c
prime times square root of M.

00:51:44.870 --> 00:51:46.340
So total number
of cache misses I

00:51:46.340 --> 00:51:50.900
need for matrix B is theta of n
squared over B since everything

00:51:50.900 --> 00:51:51.660
fits in cache.

00:51:51.660 --> 00:51:54.770
And I just apply the submatrix
caching lemma from before.

00:51:58.100 --> 00:52:00.290
Overall, this is not
a very good algorithm.

00:52:00.290 --> 00:52:02.360
Because as you
recall, in case 1 I

00:52:02.360 --> 00:52:06.410
needed a cubic number
of cache misses.

00:52:09.200 --> 00:52:12.980
What happens if I swap the
order of the inner two loops?

00:52:12.980 --> 00:52:16.850
So recall that this was one of
the optimizations in lecture 1,

00:52:16.850 --> 00:52:19.910
when Charles was talking
about matrix multiplication

00:52:19.910 --> 00:52:22.250
and how to speed it up.

00:52:22.250 --> 00:52:26.450
So if I swapped the order
of the two inner loops,

00:52:26.450 --> 00:52:31.190
then, for every
iteration, what I'm doing

00:52:31.190 --> 00:52:35.450
is I'm actually going over
a row of C and a row of B,

00:52:35.450 --> 00:52:40.520
and A stays fixed inside
the innermost iteration.

00:52:40.520 --> 00:52:42.950
So now, when I analyze
the number of cache

00:52:42.950 --> 00:52:45.920
misses of matrix
B, assuming LRU,

00:52:45.920 --> 00:52:47.840
I'm going to benefit
from spatial locality,

00:52:47.840 --> 00:52:49.970
since I'm going row by
row and the matrix is

00:52:49.970 --> 00:52:53.030
stored in row-major order.

00:52:53.030 --> 00:52:55.700
So across all of
the rows, I'm just

00:52:55.700 --> 00:53:00.380
going to require theta of n
squared over B cache misses.

00:53:00.380 --> 00:53:04.142
And I have to do this n
times for the outer loop.

00:53:04.142 --> 00:53:05.600
So in total, I'm
going to get theta

00:53:05.600 --> 00:53:08.450
of n cubed over B cache misses.

00:53:08.450 --> 00:53:10.700
So if you swap the order
of the inner two loops

00:53:10.700 --> 00:53:13.697
this significantly improves
the locality of your algorithm

00:53:13.697 --> 00:53:15.530
and you can benefit
from spatial accounting.

00:53:15.530 --> 00:53:18.500
That's why we saw a significant
performance improvement

00:53:18.500 --> 00:53:23.750
in the first lecture when we
swapped the order of the loops.

00:53:23.750 --> 00:53:24.560
Any questions?

00:53:31.280 --> 00:53:34.210
So does anybody think
we can do better than n

00:53:34.210 --> 00:53:36.140
cubed over B cache misses?

00:53:36.140 --> 00:53:39.440
Or do you think that
it's the best you can do?

00:53:39.440 --> 00:53:41.510
So how many people
think you can do better?

00:53:46.010 --> 00:53:46.510
Yeah.

00:53:46.510 --> 00:53:49.480
And how many people think
this is the best you can do?

00:53:53.780 --> 00:53:55.970
And how many people don't care?

00:54:00.660 --> 00:54:03.960
So it turns out
you can do better.

00:54:03.960 --> 00:54:06.060
And we're going to
do better by using

00:54:06.060 --> 00:54:09.870
an optimization called tiling.

00:54:09.870 --> 00:54:12.210
So how this is going
to work is instead

00:54:12.210 --> 00:54:13.910
of just having
three for loops, I'm

00:54:13.910 --> 00:54:15.570
going to have six for loops.

00:54:15.570 --> 00:54:19.220
And I'm going to
loop over tiles.

00:54:19.220 --> 00:54:22.070
So I've got a loop over
s by s submatrices.

00:54:22.070 --> 00:54:24.110
And within each
submatrix, I'm going

00:54:24.110 --> 00:54:27.050
to do all of the computation
I need for that submatrix

00:54:27.050 --> 00:54:30.270
before moving on to
the next submatrix.

00:54:30.270 --> 00:54:32.840
So the three innermost
loops are going

00:54:32.840 --> 00:54:36.710
to loop inside a submatrix,
and a three outermost loops

00:54:36.710 --> 00:54:39.110
are going to loop within
the larger matrix,

00:54:39.110 --> 00:54:42.710
one submatrix matrix at a time.

00:54:42.710 --> 00:54:45.330
So let's analyze the
work of this algorithm.

00:54:48.150 --> 00:54:54.380
So the work that we need to
do for a submatrix of size

00:54:54.380 --> 00:54:58.610
s by s is going to be s cubed,
since that's just a bound

00:54:58.610 --> 00:55:00.950
for matrix multiplication.

00:55:00.950 --> 00:55:04.160
And then a number of times I
have to operate on submatrices

00:55:04.160 --> 00:55:07.590
is going to be n over s cubed.

00:55:07.590 --> 00:55:11.210
And you can see this if you just
consider each submatrix to be

00:55:11.210 --> 00:55:13.820
a single element, and then
using the same cubic work

00:55:13.820 --> 00:55:18.740
analysis on the smaller matrix.

00:55:18.740 --> 00:55:22.710
So the work is n over
s cubed times s cubed,

00:55:22.710 --> 00:55:24.620
which is equal to
theta of n cubed.

00:55:24.620 --> 00:55:27.800
So the work of this tiled
matrix multiplies the same

00:55:27.800 --> 00:55:31.820
as the version that
didn't do tiling.

00:55:31.820 --> 00:55:34.040
And now, let's analyze the
number of cache misses.

00:55:38.390 --> 00:55:42.020
So we're going to tune s so
that the submatrices just

00:55:42.020 --> 00:55:43.100
fit into cache.

00:55:43.100 --> 00:55:46.250
So we're going to set
s to be equal to theta

00:55:46.250 --> 00:55:53.990
of square root of M. We
actually need to make this 1/3

00:55:53.990 --> 00:55:55.760
square root of M,
because we need to fit

00:55:55.760 --> 00:55:57.800
three submatrices in the cache.

00:55:57.800 --> 00:55:59.780
But it's going to be some
constant times square

00:55:59.780 --> 00:56:02.780
root of M.

00:56:02.780 --> 00:56:07.190
The submatrix caching level
implies that for each submatrix

00:56:07.190 --> 00:56:10.550
we're going to need x squared
over B misses to load it in.

00:56:10.550 --> 00:56:13.850
And once we loaded into cache,
it fits entirely into cache,

00:56:13.850 --> 00:56:16.430
so we can do all of our
computations within cache

00:56:16.430 --> 00:56:18.230
and not incur any
more cache misses.

00:56:21.530 --> 00:56:23.540
So therefore, the
total number of cache

00:56:23.540 --> 00:56:26.027
misses we're going
to incur, it's

00:56:26.027 --> 00:56:27.860
going to be the number
of subproblems, which

00:56:27.860 --> 00:56:30.860
is n over s cubed and
then a number of cache

00:56:30.860 --> 00:56:35.210
misses per subproblem,
which is s squared over B.

00:56:35.210 --> 00:56:37.530
And if you multiply
this out, you're

00:56:37.530 --> 00:56:43.070
going to get n cubed over
B times square root of M.

00:56:43.070 --> 00:56:45.500
So here, I plugged in
square root of M for s.

00:56:48.440 --> 00:56:49.940
And this is a
pretty cool result,

00:56:49.940 --> 00:56:51.950
because it says that you
can actually do better

00:56:51.950 --> 00:56:53.540
than the n cubed over B bound.

00:56:53.540 --> 00:56:58.520
You can improve this bound by
a factor of square root of M.

00:56:58.520 --> 00:57:00.950
And in practice,
square root of M

00:57:00.950 --> 00:57:04.230
is actually not insignificant.

00:57:04.230 --> 00:57:07.250
So, for example, if you're
looking at the last level

00:57:07.250 --> 00:57:10.290
cache, the size of that is
on the order of megabytes.

00:57:10.290 --> 00:57:12.080
So a square root
of M is going to be

00:57:12.080 --> 00:57:13.340
on the order of thousands.

00:57:13.340 --> 00:57:15.710
So this significantly
improves the performance

00:57:15.710 --> 00:57:18.110
of the matrix
multiplication code

00:57:18.110 --> 00:57:20.750
if you tune s so that
the submatrices just

00:57:20.750 --> 00:57:23.540
fit in the cache.

00:57:23.540 --> 00:57:26.180
It turns out that
this bound is optimal.

00:57:26.180 --> 00:57:30.590
So this was shown in 1981.

00:57:30.590 --> 00:57:32.760
So for cubic work
matrix multiplication,

00:57:32.760 --> 00:57:33.950
this is the best you can do.

00:57:33.950 --> 00:57:35.960
If you use another matrix
multiply algorithm,

00:57:35.960 --> 00:57:40.380
like Strassen's algorithm,
you can do better.

00:57:40.380 --> 00:57:42.230
So I want you to
remember this bound.

00:57:42.230 --> 00:57:44.910
It's a very important
bound to know.

00:57:44.910 --> 00:57:48.050
It says that for a
matrix multiplication

00:57:48.050 --> 00:57:51.440
you can benefit both from
spatial locality as well

00:57:51.440 --> 00:57:53.160
as temporal locality.

00:57:53.160 --> 00:57:58.820
So I get spatial locality in
the B term in the denominator.

00:57:58.820 --> 00:58:00.500
And then the square
root of M term

00:58:00.500 --> 00:58:02.510
comes from temporal
locality, since I'm

00:58:02.510 --> 00:58:04.730
doing all of the work
inside a submatrix

00:58:04.730 --> 00:58:07.310
before I evict that
submatrix from cache.

00:58:10.190 --> 00:58:13.250
Any questions on this analysis?

00:58:13.250 --> 00:58:15.640
So what's one issue with
this algorithm here?

00:58:19.920 --> 00:58:20.697
Yes.

00:58:20.697 --> 00:58:23.030
AUDIENCE: It's not portable,
like different architecture

00:58:23.030 --> 00:58:24.120
[INAUDIBLE].

00:58:24.120 --> 00:58:24.870
JULIAN SHUN: Yeah.

00:58:24.870 --> 00:58:27.930
So the problem here
is I have to tune s

00:58:27.930 --> 00:58:30.910
for my particular machine.

00:58:30.910 --> 00:58:32.670
And I call this a
voodoo parameter.

00:58:32.670 --> 00:58:36.420
It's sort of like a magic
number I put into my program

00:58:36.420 --> 00:58:39.900
so that it fits in the cache
on the particular machine I'm

00:58:39.900 --> 00:58:40.920
running on.

00:58:40.920 --> 00:58:42.630
And this makes the
code not portable,

00:58:42.630 --> 00:58:46.200
because if I try to run this
code on a another machine,

00:58:46.200 --> 00:58:49.480
the cache sizes might
be different there,

00:58:49.480 --> 00:58:51.450
and then I won't get
the same performance

00:58:51.450 --> 00:58:53.130
as I did on my machine.

00:58:55.710 --> 00:58:57.840
And this is also an issue
even if you're running it

00:58:57.840 --> 00:58:59.423
on the same machine,
because you might

00:58:59.423 --> 00:59:01.620
have other programs
running at the same time

00:59:01.620 --> 00:59:03.330
and using up part of the cache.

00:59:03.330 --> 00:59:06.540
So you don't actually
know how much of the cache

00:59:06.540 --> 00:59:10.020
your program actually gets
to use in a multiprogramming

00:59:10.020 --> 00:59:11.036
environment.

00:59:14.610 --> 00:59:17.280
And then this was also just
for one level of cache.

00:59:17.280 --> 00:59:20.550
If we want to optimize
for two levels of caches,

00:59:20.550 --> 00:59:23.910
we're going to have two
voodoo parameters, s and t.

00:59:23.910 --> 00:59:27.370
We're going to have submatrices
and sub-submatrices.

00:59:27.370 --> 00:59:29.970
And then we have to tune
both of these parameters

00:59:29.970 --> 00:59:32.310
to get the best
performance on our machine.

00:59:32.310 --> 00:59:34.410
And multi-dimensional
tuning optimization

00:59:34.410 --> 00:59:36.790
can't be done simply
with binary search.

00:59:36.790 --> 00:59:38.790
So if you're just tuning
for one level of cache,

00:59:38.790 --> 00:59:41.220
you can do a binary
search on the parameter s,

00:59:41.220 --> 00:59:43.470
but here you can't
do binary search.

00:59:43.470 --> 00:59:47.910
So it's much more
expensive to optimize here.

00:59:47.910 --> 00:59:51.180
And the code becomes
a little bit messier.

00:59:51.180 --> 00:59:55.580
You have nine for
loops instead of six.

00:59:55.580 --> 00:59:59.330
And how many levels of caches
do we have on the machines

00:59:59.330 --> 01:00:00.870
that we're using today?

01:00:00.870 --> 01:00:01.630
AUDIENCE: Three.

01:00:01.630 --> 01:00:02.810
JULIAN SHUN: Three.

01:00:02.810 --> 01:00:06.920
So for three level cache, you
have three voodoo parameters.

01:00:06.920 --> 01:00:08.510
You have 12 nested for loops.

01:00:08.510 --> 01:00:11.480
This code becomes very ugly.

01:00:11.480 --> 01:00:13.310
And you have to tune
these parameters

01:00:13.310 --> 01:00:15.300
for your particular machine.

01:00:15.300 --> 01:00:17.870
And this makes the
code not very portable,

01:00:17.870 --> 01:00:19.970
as one student pointed out.

01:00:19.970 --> 01:00:21.650
And in a multiprogramming
environment,

01:00:21.650 --> 01:00:23.990
you don't actually know
the effective cache size

01:00:23.990 --> 01:00:25.490
that your program has access to.

01:00:25.490 --> 01:00:28.073
Because other jobs are running
at the same time, and therefore

01:00:28.073 --> 01:00:30.948
it's very easy to
mistune the parameters.

01:00:30.948 --> 01:00:31.740
Was their question?

01:00:31.740 --> 01:00:33.130
No?

01:00:33.130 --> 01:00:35.310
So any questions?

01:00:35.310 --> 01:00:35.810
Yeah.

01:00:35.810 --> 01:00:37.563
Is there a way to
programmatically get

01:00:37.563 --> 01:00:38.850
the size of the cache?

01:00:38.850 --> 01:00:40.120
[INAUDIBLE]

01:00:40.120 --> 01:00:40.870
JULIAN SHUN: Yeah.

01:00:40.870 --> 01:00:43.610
So you can auto-tune
your program

01:00:43.610 --> 01:00:47.090
so that it's optimized
for the cache sizes

01:00:47.090 --> 01:00:48.283
of your particular machine.

01:00:48.283 --> 01:00:49.658
AUDIENCE: [INAUDIBLE]
instruction

01:00:49.658 --> 01:00:52.640
to get the size of
the cache [INAUDIBLE]..

01:00:52.640 --> 01:00:56.473
JULIAN SHUN: Instruction to
get the size of your cache.

01:00:56.473 --> 01:00:57.390
I'm not actually sure.

01:00:57.390 --> 01:00:57.890
Do you know?

01:00:57.890 --> 01:00:59.172
AUDIENCE: [INAUDIBLE] in--

01:00:59.172 --> 01:01:00.534
AUDIENCE: [INAUDIBLE].

01:01:00.534 --> 01:01:02.410
AUDIENCE: Yeah, in the proc--

01:01:07.595 --> 01:01:09.400
JULIAN SHUN: Yeah, proc cpuinfo.

01:01:09.400 --> 01:01:10.980
AUDIENCE: Yeah. proc cpuinfo
or something like that.

01:01:10.980 --> 01:01:11.730
JULIAN SHUN: Yeah.

01:01:11.730 --> 01:01:14.260
So you can probably
get that as well.

01:01:14.260 --> 01:01:16.367
AUDIENCE: And I
think if you google,

01:01:16.367 --> 01:01:17.950
I think you'll find
it pretty quickly.

01:01:17.950 --> 01:01:18.300
JULIAN SHUN: Yeah.

01:01:18.300 --> 01:01:18.925
AUDIENCE: Yeah.

01:01:23.400 --> 01:01:25.710
But even if you do
that, and you're

01:01:25.710 --> 01:01:27.960
running this program when
other jobs are running,

01:01:27.960 --> 01:01:30.570
you don't actually know how much
cache your program has access

01:01:30.570 --> 01:01:30.780
to.

01:01:30.780 --> 01:01:31.280
Yes?

01:01:31.280 --> 01:01:34.140
Is cache of architecture
and stuff like that

01:01:34.140 --> 01:01:37.110
optimized around
matrix problems?

01:01:37.110 --> 01:01:38.355
JULIAN SHUN: No.

01:01:38.355 --> 01:01:41.370
They're actually
general purpose.

01:01:41.370 --> 01:01:43.320
Today, we're just looking
at matrix multiply,

01:01:43.320 --> 01:01:46.290
but on Thursday's
lecture we'll actually

01:01:46.290 --> 01:01:47.880
be looking at many
other problems

01:01:47.880 --> 01:01:50.848
and how to optimize them
for the cache hierarchy.

01:01:56.180 --> 01:01:57.312
Other questions?

01:02:01.790 --> 01:02:06.500
So this was a good algorithm
in terms of cache performance,

01:02:06.500 --> 01:02:07.935
but it wasn't very portable.

01:02:07.935 --> 01:02:09.310
So let's see if
we can do better.

01:02:09.310 --> 01:02:12.050
Let's see if we can come
up with a simpler design

01:02:12.050 --> 01:02:15.390
where we still get pretty
good cache performance.

01:02:15.390 --> 01:02:19.250
So we're going to turn
to divide and conquer.

01:02:19.250 --> 01:02:21.770
We're going to look at the
recursive matrix multiplication

01:02:21.770 --> 01:02:24.750
algorithm that we saw before.

01:02:24.750 --> 01:02:26.750
Again, we're going to
deal with square matrices,

01:02:26.750 --> 01:02:30.330
but the results generalize
to non-square matrices.

01:02:30.330 --> 01:02:33.800
So how this works is
we're going to split

01:02:33.800 --> 01:02:37.340
our [INAUDIBLE] matrices
into four submatrices or four

01:02:37.340 --> 01:02:38.990
quadrants.

01:02:38.990 --> 01:02:41.220
And then for each quadrant
of the output matrix,

01:02:41.220 --> 01:02:45.110
it's just going to be the sum
of two matrix multiplies on n

01:02:45.110 --> 01:02:46.700
over 2 by n over 2 matrices.

01:02:46.700 --> 01:02:51.260
So c 1 1 one is going
to be a 1 1 times b 1 1,

01:02:51.260 --> 01:02:54.530
plus a 1 2 times B 2 1.

01:02:54.530 --> 01:02:56.900
And then we're going
to do this recursively.

01:02:56.900 --> 01:03:00.140
So every level of
recursion we're

01:03:00.140 --> 01:03:04.070
going to get eight
multiplied adds of n over 2

01:03:04.070 --> 01:03:07.580
by n over 2 matrices.

01:03:07.580 --> 01:03:10.440
Here's what the recursive
code looks like.

01:03:10.440 --> 01:03:14.660
You can see that we have
eight recursive calls here.

01:03:14.660 --> 01:03:17.060
The base case here is of size 1.

01:03:17.060 --> 01:03:19.760
In practice, you want to coarsen
the base case to overcome

01:03:19.760 --> 01:03:20.930
function call overheads.

01:03:23.690 --> 01:03:27.480
Let's also look at what these
values here correspond to.

01:03:27.480 --> 01:03:31.890
So I've color coded these
so that they correspond

01:03:31.890 --> 01:03:33.570
to particular elements
in the submatrix

01:03:33.570 --> 01:03:36.330
that I'm looking
at on the right.

01:03:36.330 --> 01:03:39.060
So these values here
correspond to the index

01:03:39.060 --> 01:03:41.700
of the first element in
each of my quadrants.

01:03:41.700 --> 01:03:43.920
So the first element
in my first quadrant

01:03:43.920 --> 01:03:47.250
is just going to
have an offset of 0.

01:03:47.250 --> 01:03:50.370
And then the first element
of my second quadrant,

01:03:50.370 --> 01:03:51.870
that's going to
be on the same row

01:03:51.870 --> 01:03:54.120
as the first element
in my first quadrant.

01:03:54.120 --> 01:04:02.790
So I just need to add the
width of my quadrant, which

01:04:02.790 --> 01:04:04.410
is n over 2.

01:04:04.410 --> 01:04:09.480
And then to get the first
element in quadrant 2 1,

01:04:09.480 --> 01:04:12.850
I'm going to jump over
and over two rows.

01:04:12.850 --> 01:04:16.140
And each row has
a length row size,

01:04:16.140 --> 01:04:18.930
so it's just going to be
n over 2 times row size.

01:04:18.930 --> 01:04:23.400
And then to get the first
element in quadrant 2 2,

01:04:23.400 --> 01:04:27.810
it's just the first element
in quadrant 2 1 plus n over 2.

01:04:27.810 --> 01:04:30.450
So that's n over 2
times row size plus 1.

01:04:34.540 --> 01:04:38.390
So let's analyze the
work of this algorithm.

01:04:38.390 --> 01:04:41.930
So what's the recurrence
for this algorithm--

01:04:41.930 --> 01:04:44.750
for the work of this algorithm?

01:04:44.750 --> 01:04:46.300
So how many
subproblems do we have?

01:04:46.300 --> 01:04:47.078
AUDIENCE: Eight

01:04:47.078 --> 01:04:47.870
JULIAN SHUN: Eight.

01:04:47.870 --> 01:04:53.840
And what's the size of
each Subproblem n over 2.

01:04:53.840 --> 01:04:57.800
And how much work are we doing
to set up the recursive calls?

01:05:00.887 --> 01:05:03.250
A constant amount of work.

01:05:03.250 --> 01:05:06.580
So the recurrences is
W of n is equal to 8 W

01:05:06.580 --> 01:05:09.280
n over 2 plus theta of 1.

01:05:09.280 --> 01:05:12.560
And what does that solve to?

01:05:12.560 --> 01:05:13.440
n cubed.

01:05:13.440 --> 01:05:16.500
So it's one of the three
cases of the master theorem.

01:05:20.850 --> 01:05:24.360
We're actually going to
analyze this in more detail

01:05:24.360 --> 01:05:25.920
by drawing out the
recursion tree.

01:05:25.920 --> 01:05:29.190
And this is going to give
us more intuition about why

01:05:29.190 --> 01:05:32.540
the master theorem is true.

01:05:32.540 --> 01:05:35.480
So at the top level
of my recursion tree,

01:05:35.480 --> 01:05:38.320
I'm going to have a
problem of size n.

01:05:38.320 --> 01:05:41.570
And then I'm going to branch
into eight subproblems of size

01:05:41.570 --> 01:05:42.217
n over 2.

01:05:42.217 --> 01:05:44.300
And then I'm going to do
a constant amount of work

01:05:44.300 --> 01:05:45.950
to set up the recursive calls.

01:05:45.950 --> 01:05:47.570
Here, I'm just
labeling this with one.

01:05:47.570 --> 01:05:48.820
So I'm ignoring the constants.

01:05:48.820 --> 01:05:52.670
But it's not going to matter
for asymptotic analysis.

01:05:52.670 --> 01:05:54.560
And then I'm going
to branch again

01:05:54.560 --> 01:05:58.250
into eight subproblems
of size n over 4.

01:05:58.250 --> 01:06:01.790
And eventually, I'm going
to get down to the leaves.

01:06:01.790 --> 01:06:06.342
And how many levels do I have
until I get to the leaves?

01:06:11.510 --> 01:06:12.010
Yes?

01:06:12.010 --> 01:06:12.750
AUDIENCE: Log n.

01:06:12.750 --> 01:06:13.500
JULIAN SHUN: Yeah.

01:06:13.500 --> 01:06:17.790
So log n-- what's
the base of the log?

01:06:17.790 --> 01:06:18.290
Yeah.

01:06:18.290 --> 01:06:21.000
So it's log base 2 of n,
because I'm dividing my problem

01:06:21.000 --> 01:06:22.470
size by 2 every time.

01:06:24.942 --> 01:06:26.400
And therefore, the
number of leaves

01:06:26.400 --> 01:06:28.950
I have is going to be 8
to the log base 2 of n,

01:06:28.950 --> 01:06:31.500
because I'm branching it
eight ways every time.

01:06:31.500 --> 01:06:35.400
8 to the log base 2 of n is
the same as n to the log base

01:06:35.400 --> 01:06:37.230
2 of 8, which is n cubed.

01:06:40.660 --> 01:06:44.740
The amount of work I'm doing
at the top level is constant.

01:06:44.740 --> 01:06:47.530
So I'm just going to say 1 here.

01:06:47.530 --> 01:06:52.450
At the next level, it's
eight times, then 64.

01:06:52.450 --> 01:06:54.210
And then when I
get to the leaves,

01:06:54.210 --> 01:06:55.900
it's going to be
theta of n cubed,

01:06:55.900 --> 01:06:58.330
since I have m cubed
leaves, and they're all

01:06:58.330 --> 01:07:01.090
doing constant work.

01:07:01.090 --> 01:07:04.060
And the work is geometrically
increasing as I go down

01:07:04.060 --> 01:07:05.020
the recursion tree.

01:07:05.020 --> 01:07:07.780
So the overall work is
just dominated by the work

01:07:07.780 --> 01:07:09.850
I need to do at the leaves.

01:07:09.850 --> 01:07:13.780
So the overall work is just
going to be theta of n cubed.

01:07:13.780 --> 01:07:15.430
And this is the
same as the looping

01:07:15.430 --> 01:07:18.100
versions of matrix multiply--

01:07:18.100 --> 01:07:20.410
they're all cubic work.

01:07:20.410 --> 01:07:22.990
Now, let's analyze the number
of cache misses of this divide

01:07:22.990 --> 01:07:26.260
and conquer algorithm.

01:07:26.260 --> 01:07:29.540
So now, my recurrence is
going to be different.

01:07:29.540 --> 01:07:34.400
My base case now is when the
submatrix fits in the cache--

01:07:34.400 --> 01:07:38.200
so when n squared is less than
c M. And when that's true,

01:07:38.200 --> 01:07:40.690
I just need to load that
submatrix into cache,

01:07:40.690 --> 01:07:43.300
and then I don't incur
any more cache misses.

01:07:43.300 --> 01:07:45.390
So I need theta of n
squared over B cache

01:07:45.390 --> 01:07:49.840
misses when n squared is less
than c M for some sufficiently

01:07:49.840 --> 01:07:52.360
small constant c, less
than or equal to 1.

01:07:52.360 --> 01:07:56.680
And then, otherwise, I recurse
into 8 subproblems of size n

01:07:56.680 --> 01:07:57.460
over 2.

01:07:57.460 --> 01:07:59.290
And then I add theta
of 1, because I'm

01:07:59.290 --> 01:08:03.740
doing a constant amount of work
to set up the recursive calls.

01:08:03.740 --> 01:08:06.700
And I get this state of
n squared over B term

01:08:06.700 --> 01:08:08.935
from the submatrix
caching lemma.

01:08:08.935 --> 01:08:12.430
It says I can just load the
entire matrix into cache

01:08:12.430 --> 01:08:15.020
with this many cache misses.

01:08:15.020 --> 01:08:18.359
So the difference between
the cache analysis here

01:08:18.359 --> 01:08:20.859
and the work analysis before
is that I have a different base

01:08:20.859 --> 01:08:22.510
case.

01:08:22.510 --> 01:08:24.460
And I think in all
of the algorithms

01:08:24.460 --> 01:08:26.979
that you've seen before,
the base case was always

01:08:26.979 --> 01:08:27.970
of a constant size.

01:08:27.970 --> 01:08:29.800
But here, we're working
with a base case

01:08:29.800 --> 01:08:31.350
that's not of a constant size.

01:08:34.359 --> 01:08:36.790
So let's try to analyze this
using the recursion tree

01:08:36.790 --> 01:08:38.390
approach.

01:08:38.390 --> 01:08:42.260
So at the top level, I
have a problem of size n

01:08:42.260 --> 01:08:44.649
that I'm going to branch
into eight problems of size n

01:08:44.649 --> 01:08:45.160
over 2.

01:08:45.160 --> 01:08:48.170
And then I'm also going to
incur a constant number of cache

01:08:48.170 --> 01:08:48.670
misses.

01:08:48.670 --> 01:08:51.580
I'm just going to say 1 here.

01:08:51.580 --> 01:08:54.850
Then I'm going to branch again.

01:08:54.850 --> 01:08:58.210
And then, eventually, I'm
going to get to the base case

01:08:58.210 --> 01:09:01.840
where n squared
is less than c M.

01:09:01.840 --> 01:09:05.649
And when n squared is less than
c M, then the number of cache

01:09:05.649 --> 01:09:07.300
misses that I'm going
to incur is going

01:09:07.300 --> 01:09:12.460
to be theta of c M over B. So
I can just plug-in c M here

01:09:12.460 --> 01:09:15.790
for n squared.

01:09:15.790 --> 01:09:17.830
And the number of
levels of recursion

01:09:17.830 --> 01:09:22.340
I have in this recursion tree is
no longer just log base 2 of n.

01:09:22.340 --> 01:09:27.370
I'm going to have log base
2 of n minus log base 2

01:09:27.370 --> 01:09:31.149
of square root of c M
number of levels, which

01:09:31.149 --> 01:09:33.850
is the same as log base
2 of n minus 1/2 times

01:09:33.850 --> 01:09:40.390
log base 2 of c M. And then,
the number of leaves I get

01:09:40.390 --> 01:09:44.710
is going to be 8 to this
number of levels here.

01:09:44.710 --> 01:09:50.680
So it's 8 to log base 2 of n
minus 1/2 of log base 2 of c M.

01:09:50.680 --> 01:09:56.400
And this is equal to the theta
of n cubed over M to the 3/2.

01:09:56.400 --> 01:10:00.580
So the n cubed comes from the
8 to the log base 2 of n term.

01:10:00.580 --> 01:10:07.450
And then if I do 8 to the
negative 1/2 of log base 2

01:10:07.450 --> 01:10:12.520
of c M, that's just going
to give me M to the 3/2

01:10:12.520 --> 01:10:13.480
in the denominator.

01:10:16.210 --> 01:10:19.160
So any questions on how I
computed the number of levels

01:10:19.160 --> 01:10:20.627
of this recursion tree here?

01:10:29.400 --> 01:10:32.110
So I'm basically dividing
my problem size by 2

01:10:32.110 --> 01:10:35.410
until I get to a problem
size that fits in the cache.

01:10:35.410 --> 01:10:40.180
So that means n is less
than square root of c M.

01:10:40.180 --> 01:10:42.310
So therefore, I can
subtract that many levels

01:10:42.310 --> 01:10:43.556
for my recursion tree.

01:10:46.248 --> 01:10:47.790
And then to get the
number of leaves,

01:10:47.790 --> 01:10:49.320
since I'm branching
eight ways, I

01:10:49.320 --> 01:10:52.630
just do 8 to the power of
the number of levels I have.

01:10:52.630 --> 01:10:54.713
And then that gives me the
total number of leaves.

01:10:58.580 --> 01:11:00.320
So now, let's analyze
a number of cache

01:11:00.320 --> 01:11:03.440
misses I need each level
of this recursion tree.

01:11:03.440 --> 01:11:05.630
At the top level, I
have a constant number

01:11:05.630 --> 01:11:06.710
of cache misses--

01:11:06.710 --> 01:11:08.240
let's just say 1.

01:11:08.240 --> 01:11:12.530
The next level, I have 8, 64.

01:11:12.530 --> 01:11:14.540
And then at the
leaves, I'm going

01:11:14.540 --> 01:11:18.050
to have theta of n cubed over
B times square root of M cache

01:11:18.050 --> 01:11:18.960
misses.

01:11:18.960 --> 01:11:21.620
And I got this quantity
just by multiplying

01:11:21.620 --> 01:11:23.660
the number of
leaves by the number

01:11:23.660 --> 01:11:25.040
of cache misses per leaf.

01:11:25.040 --> 01:11:28.730
So number of leaves is n
cubed over M to the 3/2.

01:11:28.730 --> 01:11:32.150
The cache misses per leaves
is theta of c M over B.

01:11:32.150 --> 01:11:35.640
So I lose one factor of
B in the denominator.

01:11:35.640 --> 01:11:37.940
I'm left with the square
root of M at the bottom.

01:11:37.940 --> 01:11:41.450
And then I also divide
by the block size B.

01:11:41.450 --> 01:11:45.110
So overall, I get n cubed over
B times square root of M cache

01:11:45.110 --> 01:11:46.070
misses.

01:11:46.070 --> 01:11:48.440
And again, this is
a geometric series.

01:11:48.440 --> 01:11:50.690
And the number of cache
misses at the leaves

01:11:50.690 --> 01:11:53.372
dominates all of
the other levels.

01:11:53.372 --> 01:11:54.830
So the total number
of cache misses

01:11:54.830 --> 01:11:57.980
I have is going to
be theta of n cubed

01:11:57.980 --> 01:12:00.896
over B times square root of M.

01:12:00.896 --> 01:12:04.630
And notice that I'm getting
the same number of cache

01:12:04.630 --> 01:12:07.330
misses as I did with the
tiling version of the code.

01:12:07.330 --> 01:12:09.710
But here, I don't actually
have the tune my code

01:12:09.710 --> 01:12:12.510
for the particular cache size.

01:12:12.510 --> 01:12:14.958
So what cache sizes
does this code work for?

01:12:22.130 --> 01:12:24.481
So is this code going
to work on your machine?

01:12:27.920 --> 01:12:30.700
Is it going to get
good cache performance?

01:12:30.700 --> 01:12:33.340
So this code is going to
work for all cache sizes,

01:12:33.340 --> 01:12:38.370
because I didn't tune it for
any particular cache size.

01:12:38.370 --> 01:12:42.250
And this is what's known as
a cache-oblivious algorithm.

01:12:42.250 --> 01:12:44.300
It doesn't have any
voodoo tuning parameters,

01:12:44.300 --> 01:12:47.030
it has no explicit
knowledge of the caches,

01:12:47.030 --> 01:12:49.540
and it's essentially
passively auto-tuning itself

01:12:49.540 --> 01:12:53.710
for the particular cache
size of your machine.

01:12:53.710 --> 01:12:56.620
It can also work for
multi-level caches

01:12:56.620 --> 01:12:59.470
automatically, because I never
specified what level of cache

01:12:59.470 --> 01:13:00.940
I'm analyzing this for.

01:13:00.940 --> 01:13:03.170
I can analyze it for
any level of cache,

01:13:03.170 --> 01:13:06.330
and it's still going to give
me good cache complexity.

01:13:06.330 --> 01:13:08.680
And this is also good in
multiprogramming environments,

01:13:08.680 --> 01:13:10.490
where you might have
other jobs running

01:13:10.490 --> 01:13:12.410
and you don't know your
effective cache size.

01:13:12.410 --> 01:13:14.660
This is just going to passively
auto-tune for whatever

01:13:14.660 --> 01:13:15.700
cache size is available.

01:13:18.780 --> 01:13:21.620
It turns out that the best
cache-oblivious codes to date

01:13:21.620 --> 01:13:24.150
work on arbitrary
rectangular matrices.

01:13:24.150 --> 01:13:26.480
I just talked about
square matrices,

01:13:26.480 --> 01:13:29.000
but the best codes work
on rectangular matrices.

01:13:29.000 --> 01:13:30.440
And they perform
binary splitting

01:13:30.440 --> 01:13:32.000
instead of eight-way splitting.

01:13:32.000 --> 01:13:37.130
And you're split on the
largest of i, j, and k.

01:13:37.130 --> 01:13:39.590
So this is what the best
cache-oblivious matrix

01:13:39.590 --> 01:13:41.060
multiplication algorithm does.

01:13:44.970 --> 01:13:46.101
Any questions?

01:13:50.940 --> 01:13:54.440
So I only talked about
the serial setting so far.

01:13:54.440 --> 01:13:56.090
I was assuming that
these algorithms

01:13:56.090 --> 01:13:58.190
ran on just a single thread.

01:13:58.190 --> 01:14:02.674
What happens if I go
to multiple processors?

01:14:02.674 --> 01:14:05.340
It turns out that the
results do generalize

01:14:05.340 --> 01:14:08.380
to a parallel context.

01:14:08.380 --> 01:14:10.770
So this is the recursive
parallel matrix multiply

01:14:10.770 --> 01:14:13.710
code that we saw before.

01:14:13.710 --> 01:14:17.040
And notice that we're executing
four sub calls in parallel,

01:14:17.040 --> 01:14:19.620
doing a sync, and then
doing four more sub

01:14:19.620 --> 01:14:20.385
calls in parallel.

01:14:23.310 --> 01:14:25.920
So let's try to analyze
the number of cache

01:14:25.920 --> 01:14:27.540
misses in this parallel code.

01:14:27.540 --> 01:14:30.210
And to do that, we're going
to use this theorem, which

01:14:30.210 --> 01:14:32.910
says that let Q sub p
be the number of cache

01:14:32.910 --> 01:14:34.980
misses in a deterministic
cell computation

01:14:34.980 --> 01:14:39.000
why run on P processors, each
with a private cache of size M.

01:14:39.000 --> 01:14:41.610
And let S sub p be the
number of successful steals

01:14:41.610 --> 01:14:43.830
during the computation.

01:14:43.830 --> 01:14:46.800
In the ideal cache model,
the number of cache

01:14:46.800 --> 01:14:50.970
misses we're going to have
is Q sub p equal to Q sub 1

01:14:50.970 --> 01:14:55.830
plus big O of number of
steals times M over B.

01:14:55.830 --> 01:14:59.520
So the number of cache misses
in the parallel context is

01:14:59.520 --> 01:15:02.730
equal to the number of cache
misses when you run it serially

01:15:02.730 --> 01:15:05.970
plus this term here, which
is the number of steals

01:15:05.970 --> 01:15:09.670
times M over B.

01:15:09.670 --> 01:15:13.650
And the proof for this goes
as follows-- so every call

01:15:13.650 --> 01:15:16.200
in the Cilk runtime
system, we can

01:15:16.200 --> 01:15:18.900
have workers steal
tasks from other workers

01:15:18.900 --> 01:15:20.580
when they don't have work to do.

01:15:20.580 --> 01:15:23.520
And after a worker steals
a task from another worker,

01:15:23.520 --> 01:15:26.700
it's cache becomes completely
cold in the worst case,

01:15:26.700 --> 01:15:29.790
because it wasn't actually
working on that subproblem

01:15:29.790 --> 01:15:31.080
before.

01:15:31.080 --> 01:15:33.750
But after M over B
cold cache misses,

01:15:33.750 --> 01:15:36.630
its cache is going to become
identical to what it would

01:15:36.630 --> 01:15:38.500
be in the serial execution.

01:15:38.500 --> 01:15:40.590
So we just need to
pay M over B cache

01:15:40.590 --> 01:15:44.130
misses to make it so that
the cache looks the same as

01:15:44.130 --> 01:15:47.010
if it were executing serially.

01:15:47.010 --> 01:15:48.630
And the same is
true when a worker

01:15:48.630 --> 01:15:52.380
resumes a stolen subcomputation
after a Cilk sync.

01:15:52.380 --> 01:15:55.230
And the number of times that
these two situations can happen

01:15:55.230 --> 01:15:57.795
is 2 times as S p--

01:15:57.795 --> 01:16:00.270
2 times the number of steals.

01:16:00.270 --> 01:16:03.780
And each time, we have to
pay M over b cache misses.

01:16:03.780 --> 01:16:06.870
And this is where this additive
term comes from-- order

01:16:06.870 --> 01:16:13.260
S sub p times M over B.

01:16:13.260 --> 01:16:16.920
We also know that the number
of steals in a Cilk program

01:16:16.920 --> 01:16:21.770
is upper-bounded by
P times T infinity,

01:16:21.770 --> 01:16:24.150
in the expectation where P
is the number of processors

01:16:24.150 --> 01:16:27.390
and T infinity is the
span of your computation.

01:16:27.390 --> 01:16:30.060
So if you can minimize the
span of your computation,

01:16:30.060 --> 01:16:34.170
then this also gives
you a good cache bounds.

01:16:34.170 --> 01:16:37.140
So moral of the story
here is that minimizing

01:16:37.140 --> 01:16:41.010
the number of cache
misses in a serial elision

01:16:41.010 --> 01:16:44.370
essentially minimizes them
in the parallel execution

01:16:44.370 --> 01:16:46.080
for a low span algorithm.

01:16:48.690 --> 01:16:51.660
So in this recursive matrix
multiplication algorithm,

01:16:51.660 --> 01:16:55.910
the span of this is as follows--

01:16:55.910 --> 01:16:58.920
so T infinity of n is 2T
infinity of of n over 2

01:16:58.920 --> 01:17:01.260
plus theta of 1.

01:17:01.260 --> 01:17:02.670
Since we're doing
a sync here, we

01:17:02.670 --> 01:17:06.960
have to pay the critical
path length of two sub calls.

01:17:06.960 --> 01:17:09.180
This solves to theta of n.

01:17:09.180 --> 01:17:12.150
And applying to previous
lemma, this gives us

01:17:12.150 --> 01:17:17.190
a cache miss bound of theta of
n cubed over B square root of M.

01:17:17.190 --> 01:17:20.550
This is just the same
as the serial execution

01:17:20.550 --> 01:17:24.150
And then this additive term is
going to be order P times n.

01:17:24.150 --> 01:17:29.570
And it's a span times M over B

01:17:29.570 --> 01:17:35.510
So that was a parallel
algorithm for matrix multiply.

01:17:35.510 --> 01:17:39.320
And we saw that we can also
get good cache bounds there.

01:17:39.320 --> 01:17:41.430
So here's a summary of
what we talked about today.

01:17:41.430 --> 01:17:45.950
We talked about associativity
and caches, different ways

01:17:45.950 --> 01:17:47.790
you can design a cache.

01:17:47.790 --> 01:17:49.520
We talked about the
ideal cache model

01:17:49.520 --> 01:17:52.940
that's useful for
analyzing algorithms.

01:17:52.940 --> 01:17:55.910
We talked about
cache-aware algorithms

01:17:55.910 --> 01:17:58.110
that have explicit
knowledge of the cache.

01:17:58.110 --> 01:18:01.850
And the example we used
was titled matrix multiply.

01:18:01.850 --> 01:18:03.980
Then we came up with a
much simpler algorithm

01:18:03.980 --> 01:18:09.290
that was cache-oblivious
using divide and conquer.

01:18:09.290 --> 01:18:11.510
And then on Thursday's
lecture, we'll

01:18:11.510 --> 01:18:14.730
actually see much more on
cache-oblivious algorithm

01:18:14.730 --> 01:18:15.230
design.

01:18:15.230 --> 01:18:16.897
And then you'll also
have an opportunity

01:18:16.897 --> 01:18:20.150
to analyze the cache
efficiency of some algorithms

01:18:20.150 --> 01:18:22.690
in the next homework.