1
00:00:08 --> 00:00:13
The last lecture of 6.046.
We are here today to talk more
2
00:00:13 --> 00:00:17
about cache oblivious
algorithms.
3
00:00:17 --> 00:00:30
4
00:00:30 --> 00:00:34
Last class, we saw several
cache oblivious algorithms,
5
00:00:34 --> 00:00:37
although none of them quite too
difficult.
6
00:00:37 --> 00:00:42
Today we will see two difficult
cache oblivious algorithms,
7
00:00:42 --> 00:00:46
a little bit more advanced.
I figure we should do something
8
00:00:46 --> 00:00:51
advanced for the last class just
to get to some exciting climax.
9
00:00:51 --> 00:00:55
So without further ado,
let's get started.
10
00:00:55 --> 00:01:00
Last time, we looked at the
binary search problem.
11
00:01:00 --> 00:01:02
Or, we looked at binary search,
rather.
12
00:01:02 --> 00:01:06
And so, the binary search did
not do so well in the cache
13
00:01:06 --> 00:01:10
oblivious context.
And, some people asked me after
14
00:01:10 --> 00:01:14
class, is it possible to do
binary search while cache
15
00:01:14 --> 00:01:16
obliviously?
And, indeed it is with
16
00:01:16 --> 00:01:19
something called static search
trees.
17
00:01:19 --> 00:01:21
So, this is really binary
search.
18
00:01:21 --> 00:01:25
So, I mean, the abstract
problem is I give you N items,
19
00:01:25 --> 00:01:28
say presorted,
build some static data
20
00:01:28 --> 00:01:34
structure so that you can search
among those N items quickly.
21
00:01:34 --> 00:01:37
And quickly,
I claim, means log base B of N.
22
00:01:37 --> 00:01:41
We know that with B trees,
our goal is to get log base B
23
00:01:41 --> 00:01:44
of N.
We know that we can achieve
24
00:01:44 --> 00:01:46
that with B trees when we know
B.
25
00:01:46 --> 00:01:49
We'd like to do that when we
don't know B.
26
00:01:49 --> 00:01:54
And that's what cache oblivious
static search trees achieve.
27
00:01:54 --> 00:01:58
So here's what we're going to
do.
28
00:01:58 --> 00:02:02
As you might suspect,
we're going to use a tree.
29
00:02:02 --> 00:02:07
So, we're going to store our N
elements in a complete binary
30
00:02:07 --> 00:02:10
tree.
We can't use B trees because we
31
00:02:10 --> 00:02:15
don't know what B is.
So, we'll use a binary tree.
32
00:02:15 --> 00:02:19
And the key is how we lay out a
binary tree.
33
00:02:19 --> 00:02:22
The binary tree will have N
nodes.
34
00:02:22 --> 00:02:25
Or, you can put the data in the
leaves.
35
00:02:25 --> 00:02:30
It doesn't really matter.
So, here's our tree.
36
00:02:30 --> 00:02:33
There are the N nodes.
And we're storing them,
37
00:02:33 --> 00:02:35
I didn't say,
in order, you know,
38
00:02:35 --> 00:02:38
in the usual way,
in order in a binary tree,
39
00:02:38 --> 00:02:41
which makes it a binary search
tree.
40
00:02:41 --> 00:02:43
So we now had a search in this
thing.
41
00:02:43 --> 00:02:47
So, the search will just start
at the root and a walk down some
42
00:02:47 --> 00:02:51
root-to-leaf path.
OK, and each point you know
43
00:02:51 --> 00:02:54
whether to go left or to go
right because things are in
44
00:02:54 --> 00:02:57
order.
So we're assuming here that we
45
00:02:57 --> 00:03:01
have an ordered universe of
keys.
46
00:03:01 --> 00:03:04
So that's easy.
We know that that will take log
47
00:03:04 --> 00:03:07
N time.
The question is how many memory
48
00:03:07 --> 00:03:10
transfers?
We'd like a lot of the nodes
49
00:03:10 --> 00:03:13
near the root to be somehow
closer and one block.
50
00:03:13 --> 00:03:16
But we don't know what the
block size is.
51
00:03:16 --> 00:03:21
So are going to do is carve the
tree in the middle level.
52
00:03:21 --> 00:03:25
We're going to use divide and
conquer for our layout of the
53
00:03:25 --> 00:03:28
tree, how we order the nodes in
memory.
54
00:03:28 --> 00:03:33
And the divide and conquer is
based on cutting in the middle,
55
00:03:33 --> 00:03:38
which is a bit weird.
It's not our usual divide and
56
00:03:38 --> 00:03:42
conquer.
And we'll see this more than
57
00:03:42 --> 00:03:45
once today.
So, when you cut on the middle
58
00:03:45 --> 00:03:50
level, if the height of your
original tree is log N,
59
00:03:50 --> 00:03:55
maybe log N plus one or
something, it's roughly log N,
60
00:03:55 --> 00:04:00
then the top half will be log N
over two.
61
00:04:00 --> 00:04:05
And at the height of the bottom
pieces will be log N over two.
62
00:04:05 --> 00:04:10
How many nodes will there be in
the top tree?
63
00:04:10 --> 00:04:12
N over two?
Not quite.
64
00:04:12 --> 00:04:16
Two to the log N over two,
square root of N.
65
00:04:16 --> 00:04:21
OK, so it would be about root N
nodes over here.
66
00:04:21 --> 00:04:24
And therefore,
there will be about root N
67
00:04:24 --> 00:04:28
subtrees down here,
one for each,
68
00:04:28 --> 00:04:34
or a couple for each leaf.
OK, so we have these subtrees
69
00:04:34 --> 00:04:38
of root N, and there are about
root N of them.
70
00:04:38 --> 00:04:42
OK, this is how we are carving
our tree.
71
00:04:42 --> 00:04:46
Now, we're going to recurse on
each of the pieces.
72
00:04:46 --> 00:04:50
I'd like to redraw this
slightly, sorry,
73
00:04:50 --> 00:04:53
just to make it a little bit
clearer.
74
00:04:53 --> 00:04:58
These triangles are really
trees, and they are connected by
75
00:04:58 --> 00:05:04
edges to this tree up here.
So what we are really doing is
76
00:05:04 --> 00:05:08
carving in the middle level of
edges in the tree.
77
00:05:08 --> 00:05:12
And if N is not exactly a power
of two, you have to round your
78
00:05:12 --> 00:05:15
level by taking floors or
ceilings.
79
00:05:15 --> 00:05:19
But you cut roughly in the
middle level of edges.
80
00:05:19 --> 00:05:23
There is a lot of edges here.
You conceptually slice there.
81
00:05:23 --> 00:05:27
That gives you a top tree and
the bottom tree,
82
00:05:27 --> 00:05:32
several bottom trees,
each of size roughly root N.
83
00:05:32 --> 00:05:39
OK, and then we are going to
recursively layout these root N
84
00:05:39 --> 00:05:45
plus one subtrees,
and then concatenate.
85
00:05:45 --> 00:05:50
So, this is the idea of the
recursive layout.
86
00:05:50 --> 00:05:57
We sought recursive layouts
with matrices last time.
87
00:05:57 --> 00:06:04
This is doing the same thing
for a tree.
88
00:06:04 --> 00:06:07
So, I want to recursively
layout the top tree.
89
00:06:07 --> 00:06:11
So here's the top tree.
And I imagine it being somehow
90
00:06:11 --> 00:06:14
squashed down into a linear
array recursively.
91
00:06:14 --> 00:06:18
And then I do the same thing
for each of the bottom trees.
92
00:06:18 --> 00:06:21
So here are all the bottom
trees.
93
00:06:21 --> 00:06:25
And I squashed each of them
down into some linear order.
94
00:06:25 --> 00:06:28
And then I concatenate those
linear orders.
95
00:06:28 --> 00:06:32
That's the linear order of the
street.
96
00:06:32 --> 00:06:35
And you need a base case.
And the base case,
97
00:06:35 --> 00:06:39
just a single node,
is stored in the only order of
98
00:06:39 --> 00:06:43
a single node there is.
OK, so that's a recursive
99
00:06:43 --> 00:06:48
layout of a binary search tree.
It turns out this works really
100
00:06:48 --> 00:06:51
well.
And let's quickly do a little
101
00:06:51 --> 00:06:56
example just so it's completely
clear what this layout is
102
00:06:56 --> 00:07:02
because it's a bit bizarre maybe
the first time you see it.
103
00:07:02 --> 00:07:05
So let me draw my favorite
picture.
104
00:07:05 --> 00:07:15
105
00:07:15 --> 00:07:19
So here's a tree of height four
or three depending on how you
106
00:07:19 --> 00:07:22
count.
We divide in the middle level,
107
00:07:22 --> 00:07:25
and we say, OK,
that's the top tree.
108
00:07:25 --> 00:07:27
And then these are the bottom
trees.
109
00:07:27 --> 00:07:32
So there's four bottom trees.
So there are four children
110
00:07:32 --> 00:07:36
hanging off the root tree.
They each have the same size in
111
00:07:36 --> 00:07:39
this case.
They should all roughly be the
112
00:07:39 --> 00:07:41
same size.
And the first we layout the top
113
00:07:41 --> 00:07:43
thing where we divide on the
middle level.
114
00:07:43 --> 00:07:47
We say, OK, this comes first.
And then, the bottom subtrees
115
00:07:47 --> 00:07:50
come next, two and three.
So, I'm writing down the order
116
00:07:50 --> 00:07:52
in which these nodes are stored
in an array.
117
00:07:52 --> 00:07:55
And then, we visit this tree so
we get four, five,
118
00:07:55 --> 00:07:57
six.
And then we visit this one so
119
00:07:57 --> 00:08:00
we get seven,
eight, nine.
120
00:08:00 --> 00:08:03
And then the subtree,
10, 11, 12, and then the last
121
00:08:03 --> 00:08:06
subtree.
So that's the order in which
122
00:08:06 --> 00:08:09
you store these 15 nodes.
And you can build that up
123
00:08:09 --> 00:08:13
recursively.
OK, so the structure is fairly
124
00:08:13 --> 00:08:17
simple, just a binary structure
which we know and love,
125
00:08:17 --> 00:08:19
but store it in this funny
order.
126
00:08:19 --> 00:08:22
This is not depth research
order or level order,
127
00:08:22 --> 00:08:27
lots of natural things you
might try, none of which work in
128
00:08:27 --> 00:08:31
cache oblivious context.
This is pretty much the only
129
00:08:31 --> 00:08:33
thing that works.
And the intuition as,
130
00:08:33 --> 00:08:36
well, we are trying to mimic
all kinds of B trees.
131
00:08:36 --> 00:08:39
So, if you want a binary tree,
well, that's the original tree.
132
00:08:39 --> 00:08:41
It doesn't matter how you store
things.
133
00:08:41 --> 00:08:44
If you want a tree where the
branching factor is four,
134
00:08:44 --> 00:08:47
well, then here it is.
These blocks give you a
135
00:08:47 --> 00:08:50
branching factor of four.
If we had more leaves down
136
00:08:50 --> 00:08:53
here, there would be four
children hanging off of that
137
00:08:53 --> 00:08:54
node.
And these are all clustered
138
00:08:54 --> 00:08:56
together consecutively in
memory.
139
00:08:56 --> 00:08:59
So, if your block size happens
to be three, then this is a
140
00:08:59 --> 00:09:04
perfect way to store things for
a block size of three.
141
00:09:04 --> 00:09:07
If you're block size happens to
be probably 15,
142
00:09:07 --> 00:09:12
right, if we count the number
of, right, the number of nodes
143
00:09:12 --> 00:09:16
in here is 15,
if you're block size happens to
144
00:09:16 --> 00:09:21
be 15, then this recursion will
give you a perfect blocking in
145
00:09:21 --> 00:09:23
terms of 15.
And in general,
146
00:09:23 --> 00:09:27
it's actually mimicking block
sizes of 2^K-1.
147
00:09:27 --> 00:09:32
Think powers of two.
OK, that's the intuition.
148
00:09:32 --> 00:09:37
Let me give you the formal
analysis to make it clearer.
149
00:09:37 --> 00:09:42
So, we claim that there are
order, log base B of N memory
150
00:09:42 --> 00:09:45
transfers.
That's what we want to prove no
151
00:09:45 --> 00:09:49
matter what B is.
So here's what we're going to
152
00:09:49 --> 00:09:52
do.
You may recall last time when
153
00:09:52 --> 00:09:57
we analyzed divide and conquer
algorithms, we wrote our
154
00:09:57 --> 00:10:03
recurrence, and that the base
case was the key.
155
00:10:03 --> 00:10:05
Here, in fact,
we are only going to think
156
00:10:05 --> 00:10:07
about the base case in a certain
sense.
157
00:10:07 --> 00:10:08
We don't have,
really, recursion in the
158
00:10:08 --> 00:10:10
algorithm.
The algorithm is just walking
159
00:10:10 --> 00:10:13
down some root-to-leaf path.
We only have a recursion in a
160
00:10:13 --> 00:10:16
definition of the layout.
So, we can be a little bit more
161
00:10:16 --> 00:10:18
flexible.
We don't have to look at our
162
00:10:18 --> 00:10:20
recurrence.
We are just going to think
163
00:10:20 --> 00:10:22
about the base case.
I want to imagine,
164
00:10:22 --> 00:10:24
you start with the big
triangle.
165
00:10:24 --> 00:10:26
That you cut it in the middle;
you get smaller triangles.
166
00:10:26 --> 00:10:31
Imagine the point at which you
keep recursively cutting.
167
00:10:31 --> 00:10:34
So imagine this process.
Big triangles halve in height
168
00:10:34 --> 00:10:37
each time.
They're getting smaller and
169
00:10:37 --> 00:10:41
smaller, stop cutting at the
point where a triangle fits in a
170
00:10:41 --> 00:10:44
block.
OK, and look at that time.
171
00:10:44 --> 00:10:48
OK, the recursion actually goes
all the way, but in the analysis
172
00:10:48 --> 00:10:53
let's think about the point
where the chunk fits in a block
173
00:10:53 --> 00:10:57
in one of these triangles,
one of these boxes fits in a
174
00:10:57 --> 00:10:59
block.
So, I'm going to call this a
175
00:10:59 --> 00:11:05
recursive level.
So, I'm imagining expanding all
176
00:11:05 --> 00:11:10
of the recursions in parallel.
This is some level of detail,
177
00:11:10 --> 00:11:16
some level of refinement of the
trees at which the tree you're
178
00:11:16 --> 00:11:19
looking at, the triangle,
has size.
179
00:11:19 --> 00:11:24
In other words,
there is a number of nodes in
180
00:11:24 --> 00:11:29
that triangle is less than or
equal to B.
181
00:11:29 --> 00:11:34
OK, so let me draw a picture.
So, I want to draw sort of this
182
00:11:34 --> 00:11:39
picture but where instead of
nodes, I have little triangles
183
00:11:39 --> 00:11:41
of size, at most,
B.
184
00:11:41 --> 00:11:44
So, the picture looks something
like this.
185
00:11:44 --> 00:11:48
We have a little triangle of
size, at most,
186
00:11:48 --> 00:11:50
B.
It has a bunch of children
187
00:11:50 --> 00:11:55
which are subtrees of size,
at most, B, the same size.
188
00:11:55 --> 00:12:00
And then, these are in a chunk,
and then we have other chunks
189
00:12:00 --> 00:12:06
that look like that in recursion
potentially.
190
00:12:06 --> 00:12:29
191
00:12:29 --> 00:12:31
OK, so I haven't drawn
everything.
192
00:12:31 --> 00:12:34
There would be a whole bunch
of, between B and B^2,
193
00:12:34 --> 00:12:37
in fact, subtrees,
other squares of this size.
194
00:12:37 --> 00:12:40
So here, I had to refine the
entire tree here.
195
00:12:40 --> 00:12:44
And then I refined each of the
subtrees here and here at these
196
00:12:44 --> 00:12:47
levels.
And then it turned out after
197
00:12:47 --> 00:12:50
these two recursive levels,
everything fits in a block.
198
00:12:50 --> 00:12:54
Everything has the same size,
so at some point they will all
199
00:12:54 --> 00:12:57
fit within a block.
And they might actually be
200
00:12:57 --> 00:12:59
quite a bit smaller than the
block.
201
00:12:59 --> 00:13:05
How small?
So, what I'm doing is cutting
202
00:13:05 --> 00:13:09
the number of levels and half at
each point.
203
00:13:09 --> 00:13:15
And I stop when the height of
one of these trees is
204
00:13:15 --> 00:13:21
essentially at most log B
because that's when the number
205
00:13:21 --> 00:13:25
of nodes at there will be B
roughly.
206
00:13:25 --> 00:13:30
So, how small can the height
be?
207
00:13:30 --> 00:13:32
I keep dividing at half and
stopping when it's,
208
00:13:32 --> 00:13:34
at most, log B.
Log B over two.
209
00:13:34 --> 00:13:37
So it's, at most,
log B, it's at least half log
210
00:13:37 --> 00:13:39
B.
Therefore, the number of nodes
211
00:13:39 --> 00:13:42
it here could be between the
square root of B and B.
212
00:13:42 --> 00:13:46
So, this could be a lot smaller
and less than a constant factor
213
00:13:46 --> 00:13:49
of a block, a claim that doesn't
matter.
214
00:13:49 --> 00:13:51
It's OK.
This could be a small square
215
00:13:51 --> 00:13:53
root of B.
I'm not even going to write
216
00:13:53 --> 00:13:57
that it could be a small square
root of B because that doesn't
217
00:13:57 --> 00:14:00
play a role in the analysis.
It's a worry,
218
00:14:00 --> 00:14:04
but it's OK essentially because
our bound only involves log B.
219
00:14:04 --> 00:14:09
It doesn't involve B.
So, here's what we do.
220
00:14:09 --> 00:14:16
We know that each of the height
of one of these triangles of
221
00:14:16 --> 00:14:20
size, at most,
B is at least a half log B.
222
00:14:20 --> 00:14:25
And therefore,
if you look at a search path,
223
00:14:25 --> 00:14:32
so, when we do a search in this
tree, we're going to start up
224
00:14:32 --> 00:14:36
here.
And I'm going to mess up the
225
00:14:36 --> 00:14:39
diagram now.
We're going to follow some
226
00:14:39 --> 00:14:42
path, maybe I should have drawn
it going down here.
227
00:14:42 --> 00:14:46
We visit through some of these
triangles, but it's a
228
00:14:46 --> 00:14:51
root-to-node path in the tree.
So, how many of the triangles
229
00:14:51 --> 00:14:54
could it visit?
Well, the height of the tree
230
00:14:54 --> 00:14:58
divided by the height of one of
the triangles.
231
00:14:58 --> 00:15:01
So, this visits,
at most, log N over half log B
232
00:15:01 --> 00:15:07
triangles, which looks good.
This is log base B of N,
233
00:15:07 --> 00:15:12
mind you off factor of two.
Now, what we worry about is how
234
00:15:12 --> 00:15:15
many blocks does a triangle
occupy?
235
00:15:15 --> 00:15:19
One of these triangles should
fit in a block.
236
00:15:19 --> 00:15:23
We know by the recursive
layout, it is stored in a
237
00:15:23 --> 00:15:28
consecutive region in memory.
So, how many blocks could
238
00:15:28 --> 00:15:32
occupy?
Two, because of alignment,
239
00:15:32 --> 00:15:35
it might fall across the
boundary of a block,
240
00:15:35 --> 00:15:37
but at most,
one boundary.
241
00:15:37 --> 00:15:42
So, it fits in two blocks.
So, each triangle fits in one
242
00:15:42 --> 00:15:45
block, but is in,
at most, two blocks,
243
00:15:45 --> 00:15:49
memory blocks,
size B depending on alignment.
244
00:15:49 --> 00:15:53
So, the number of memory
transfers, in other words,
245
00:15:53 --> 00:15:57
a number of blocks we read,
because all we are doing here
246
00:15:57 --> 00:16:01
is reading in a search,
is at most two blocks per
247
00:16:01 --> 00:16:05
triangle.
There are this many triangles,
248
00:16:05 --> 00:16:07
so it's at most,
4 log base B of N,
249
00:16:07 --> 00:16:09
OK, which is order log base B
of N.
250
00:16:09 --> 00:16:13
And there are papers about
decreasing this constant 4 with
251
00:16:13 --> 00:16:15
more sophisticated data
structures.
252
00:16:15 --> 00:16:18
You can get it down to a little
bit less than two I think.
253
00:16:18 --> 00:16:21
So, there you go.
So not quite as good as B trees
254
00:16:21 --> 00:16:24
in terms of the constant,
but pretty good.
255
00:16:24 --> 00:16:27
And what's good is that this
data structure works for all B
256
00:16:27 --> 00:16:32
at the same time.
This analysis works for all B.
257
00:16:32 --> 00:16:37
So, we have a multilevel memory
hierarchy, no problem.
258
00:16:37 --> 00:16:41
Any questions about this data
structure?
259
00:16:41 --> 00:16:44
This is already pretty
sophisticated,
260
00:16:44 --> 00:16:48
but we are going to get even
more sophisticated.
261
00:16:48 --> 00:16:51
Next, OK, good,
no questions.
262
00:16:51 --> 00:16:56
This is either perfectly clear,
or a little bit difficult,
263
00:16:56 --> 00:16:59
or both.
So, now, I debated with myself
264
00:16:59 --> 00:17:05
what exactly I would cover next.
There are two natural things I
265
00:17:05 --> 00:17:08
could cover, both of which are
complicated.
266
00:17:08 --> 00:17:11
My first result in the cache
oblivious world is making this
267
00:17:11 --> 00:17:14
data structure dynamic.
So, there is a dynamic B tree
268
00:17:14 --> 00:17:18
that's cache oblivious that
works for all values of B.
269
00:17:18 --> 00:17:20
And it gets log base B of N,
insert, delete,
270
00:17:20 --> 00:17:23
and search.
So, this just gets search in
271
00:17:23 --> 00:17:25
log base B of N.
That data structure,
272
00:17:25 --> 00:17:28
our first paper was damn
complicated, and then it got
273
00:17:28 --> 00:17:31
simplified.
It's now not too hard,
274
00:17:31 --> 00:17:35
but it takes a couple of
lectures in an advanced
275
00:17:35 --> 00:17:40
algorithms class to teach it.
So, I'm not going to do that.
276
00:17:40 --> 00:17:42
But there you go.
It exists.
277
00:17:42 --> 00:17:47
Instead, we're going to cover
our favorite problem sorting in
278
00:17:47 --> 00:17:52
the cache oblivious context.
And this is quite complicated,
279
00:17:52 --> 00:17:56
more than you'd expect,
OK, much more complicated than
280
00:17:56 --> 00:18:01
it is in a multithreaded setting
to get the right answer,
281
00:18:01 --> 00:18:05
anyway.
Maybe to get the best answer in
282
00:18:05 --> 00:18:08
a multithreaded setting is also
complicated.
283
00:18:08 --> 00:18:11
The version we got last week
was pretty easy.
284
00:18:11 --> 00:18:13
But before we go to cache
oblivious sorting,
285
00:18:13 --> 00:18:18
let me talk about cache aware
sorting because we need to know
286
00:18:18 --> 00:18:21
what bound we are aiming for.
And just to warn you,
287
00:18:21 --> 00:18:24
I may not get to the full
analysis of the full cache
288
00:18:24 --> 00:18:28
oblivious sorting.
But I want to give you an idea
289
00:18:28 --> 00:18:31
of what goes into it because
it's pretty cool,
290
00:18:31 --> 00:18:35
I think, a lot of ideas.
So, how might you sort?
291
00:18:35 --> 00:18:39
So, cache aware,
we assume we can do everything.
292
00:18:39 --> 00:18:41
Basically, this means we have B
trees.
293
00:18:41 --> 00:18:44
That's the only other structure
we know.
294
00:18:44 --> 00:18:49
How would you sort N numbers,
given that that's the only data
295
00:18:49 --> 00:18:52
structure you have?
Right, just add them into the B
296
00:18:52 --> 00:18:55
tree, and then do an in-order
traversal.
297
00:18:55 --> 00:19:00
That's one way to sort,
perfectly reasonable.
298
00:19:00 --> 00:19:04
We'll call it repeated
insertion into a B tree.
299
00:19:04 --> 00:19:08
OK, we know in the usual
setting, and the BST sort,
300
00:19:08 --> 00:19:13
where you use a balanced binary
search tree, like red-black
301
00:19:13 --> 00:19:17
trees, that takes N log N time,
log N per operation,
302
00:19:17 --> 00:19:22
and that's an optimal sorting
algorithm in the comparison
303
00:19:22 --> 00:19:28
model, only thinking about
comparison model here.
304
00:19:28 --> 00:19:39
So, how many memory transfers
does this data structure takes?
305
00:19:39 --> 00:19:45
Sorry, this algorithm for
sorting?
306
00:19:45 --> 00:19:54
The number of memory transfers
is a function of N,
307
00:19:54 --> 00:20:01
and B_M of N is?
This is easy.
308
00:20:01 --> 00:20:07
N insertions,
OK, you have to think about N
309
00:20:07 --> 00:20:13
order traversal.
You have to remember back your
310
00:20:13 --> 00:20:20
analysis of B trees,
but this is not too hard.
311
00:20:20 --> 00:20:27
How long does the insertion
take, the N insertions?
312
00:20:27 --> 00:20:32
N log base B of N.
How long does the traversal
313
00:20:32 --> 00:20:33
take?
Less time.
314
00:20:33 --> 00:20:37
If we think about it,
you can get away with N over B
315
00:20:37 --> 00:20:40
memory transfers,
so quite a bit less than this.
316
00:20:40 --> 00:20:44
This is bigger than N,
which is actually pretty bad.
317
00:20:44 --> 00:20:47
N memory transfers means
essentially you're doing random
318
00:20:47 --> 00:20:51
access, visiting every element
in some random order.
319
00:20:51 --> 00:20:54
It's even worse.
There's even a log factor.
320
00:20:54 --> 00:20:57
Now, the log factor goes down
by this log B factor.
321
00:20:57 --> 00:21:02
But, this is actually a really
bad sorting bound.
322
00:21:02 --> 00:21:06
So, unlike normal algorithms,
where using a search tree is a
323
00:21:06 --> 00:21:10
good way to sort,
in cache oblivious or cache
324
00:21:10 --> 00:21:13
aware sorting it's really,
really bad.
325
00:21:13 --> 00:21:17
So, what's another natural
algorithm you might try,
326
00:21:17 --> 00:21:22
given what we know for sorting?
And, even cache oblivious,
327
00:21:22 --> 00:21:26
all the algorithms we've seen
are cache oblivious.
328
00:21:26 --> 00:21:30
So, what's a good one to try?
Merge sort.
329
00:21:30 --> 00:21:34
OK, we did merge sort in
multithreaded algorithms.
330
00:21:34 --> 00:21:37
Let's try a merge sort,
a good divide and conquer
331
00:21:37 --> 00:21:40
thing.
So, I'm going to call it binary
332
00:21:40 --> 00:21:44
merge sort because it splits the
array into two pieces,
333
00:21:44 --> 00:21:46
and it recurses on the two
pieces.
334
00:21:46 --> 00:21:49
So, you get a binary recursion
tree.
335
00:21:49 --> 00:21:52
So, let's analyze it.
So the number of memory
336
00:21:52 --> 00:21:56
transfers on N elements,
so I mean it has a pretty good
337
00:21:56 --> 00:21:57
recursive layout,
right?
338
00:21:57 --> 00:22:02
The two subarrays that we get
what we partition our array are
339
00:22:02 --> 00:22:05
consecutive.
So, we're recursing on this,
340
00:22:05 --> 00:22:10
recursing on this.
So, it's a nice cache oblivious
341
00:22:10 --> 00:22:13
layout.
And this is even for cache
342
00:22:13 --> 00:22:15
aware.
This is a pretty good
343
00:22:15 --> 00:22:19
algorithm, a lot better than
this one, as we'll see.
344
00:22:19 --> 00:22:22
But, what is the recurrence we
get?
345
00:22:22 --> 00:22:27
So, here we have to go back to
last lecture when we were
346
00:22:27 --> 00:22:31
thinking about recurrences for
recursive cache oblivious
347
00:22:31 --> 00:22:34
algorithms.
348
00:22:34 --> 00:22:46
349
00:22:46 --> 00:22:50
I mean, the first part should
be pretty easy.
350
00:22:50 --> 00:22:55
There's an O.
Well, OK, let's put the O at
351
00:22:55 --> 00:23:00
the end, the divide and the
conquer part at the end.
352
00:23:00 --> 00:23:06
The recursion is 2MT of N over
two, good.
353
00:23:06 --> 00:23:09
All right, that's just like the
merge sort recurrence,
354
00:23:09 --> 00:23:12
and that's the additive term
that you're thinking about.
355
00:23:12 --> 00:23:15
OK, so normally,
we would pay a linear additive
356
00:23:15 --> 00:23:19
term here, order N because
merging takes order N time.
357
00:23:19 --> 00:23:22
Now, we are merging,
which is three parallel scans,
358
00:23:22 --> 00:23:26
the two inputs and the output.
OK, they're not quite parallel
359
00:23:26 --> 00:23:28
interleaved.
They're a bit funnily
360
00:23:28 --> 00:23:31
interleaved, but as long as your
cache stores at least three
361
00:23:31 --> 00:23:35
blocks, this is also linear time
in this setting,
362
00:23:35 --> 00:23:38
which means you visit each
block a constant number of
363
00:23:38 --> 00:23:41
times.
OK, that's the recurrence.
364
00:23:41 --> 00:23:44
Now, we also need a base case,
of course.
365
00:23:44 --> 00:23:47
We've seen two base cases,
one MT of B,
366
00:23:47 --> 00:23:50
and the other,
MT of whatever fits in cache.
367
00:23:50 --> 00:23:53
So, let's look at that one
because it's better.
368
00:23:53 --> 00:23:56
So, for some constant,
C, if I have an array of size
369
00:23:56 --> 00:24:00
M, this fits in cache,
actually, probably C is one
370
00:24:00 --> 00:24:03
here, but I'll just be careful.
For some constant,
371
00:24:03 --> 00:24:10
this fits in cache.
A problem of this size fits in
372
00:24:10 --> 00:24:18
cache, and in that case,
the number of memory transfers
373
00:24:18 --> 00:24:25
is, anyone remember?
We've used this base case more
374
00:24:25 --> 00:24:31
than once before.
Do you remember?
375
00:24:31 --> 00:24:32
Sorry?
CM over B.
376
00:24:32 --> 00:24:33
I've got a big O,
so M over B.
377
00:24:33 --> 00:24:37
Order M over B because this is
the size of the data.
378
00:24:37 --> 00:24:40
So, I mean, just to read it all
in takes M over B.
379
00:24:40 --> 00:24:43
Once it's in cache,
it doesn't really matter what I
380
00:24:43 --> 00:24:47
do as long as I use linear space
for the right constant here.
381
00:24:47 --> 00:24:50
As long as I use linear space
in that algorithm,
382
00:24:50 --> 00:24:53
I'll stay in cache,
and therefore,
383
00:24:53 --> 00:24:57
not have to write anything out
until the very end and I spend M
384
00:24:57 --> 00:25:02
over B to write it out.
OK, so I can't really spend
385
00:25:02 --> 00:25:07
more than M over B almost no
matter what algorithm I have,
386
00:25:07 --> 00:25:09
so long as it uses linear
space.
387
00:25:09 --> 00:25:14
So, this is a base case that's
useful pretty much in any
388
00:25:14 --> 00:25:17
algorithm.
OK, that's a recurrence.
389
00:25:17 --> 00:25:22
Now we just have to solve it.
OK, let's see how good binary
390
00:25:22 --> 00:25:24
merge sort is.
OK, and again,
391
00:25:24 --> 00:25:29
I'm going to just give the
intuition behind the solution to
392
00:25:29 --> 00:25:33
this recurrence.
And I won't use the
393
00:25:33 --> 00:25:36
substitution method to prove it
formally.
394
00:25:36 --> 00:25:38
But this one's actually pretty
simple.
395
00:25:38 --> 00:25:41
So, we have,
at the top, actually I'm going
396
00:25:41 --> 00:25:44
to write it over here.
Otherwise I won't be able to
397
00:25:44 --> 00:25:46
see.
So, at the top of the
398
00:25:46 --> 00:25:48
recursion, we have N over B
costs.
399
00:25:48 --> 00:25:52
I'll ignore the constants.
There is probably also on
400
00:25:52 --> 00:25:55
additive one,
which I'm ignoring here.
401
00:25:55 --> 00:25:58
Then we split into two problems
of half the size.
402
00:25:58 --> 00:26:03
So, we get a half N over B,
and a half N over B.
403
00:26:03 --> 00:26:05
OK, usually this was N,
half N, half N.
404
00:26:05 --> 00:26:08
You should regard it as from
lecture one.
405
00:26:08 --> 00:26:10
So, the total on this level is
N over B.
406
00:26:10 --> 00:26:12
The total on this level is N
over B.
407
00:26:12 --> 00:26:16
And, you can prove by
induction, that every level is N
408
00:26:16 --> 00:26:18
over B.
The question is how many levels
409
00:26:18 --> 00:26:20
are there?
Well, at the bottom,
410
00:26:20 --> 00:26:23
so, dot, dot,
dot, at the bottom of this
411
00:26:23 --> 00:26:26
recursion tree we should get
something of size M,
412
00:26:26 --> 00:26:30
and then we're paying M over B.
Actually here we're paying M
413
00:26:30 --> 00:26:34
over B.
So, it's a good thing those
414
00:26:34 --> 00:26:35
match.
They should.
415
00:26:35 --> 00:26:40
So here, we have a bunch of
leaves, all the size M over B.
416
00:26:40 --> 00:26:44
You can also compute the number
of leaves here is N over M.
417
00:26:44 --> 00:26:49
If you want to be extra sure,
you should always check the
418
00:26:49 --> 00:26:51
leaf level.
It's a good idea.
419
00:26:51 --> 00:26:55
So we have N over M leaves,
each costing M over B.
420
00:26:55 --> 00:27:00
This is an M.
So, this is N over B also.
421
00:27:00 --> 00:27:04
So, every level here is N over
B memory transfers.
422
00:27:04 --> 00:27:08
And the number of levels is one
N over B?
423
00:27:08 --> 00:27:11
Log N over B.
Yep, that's right.
424
00:27:11 --> 00:27:16
I just didn't hear it right.
OK, we are starting at N.
425
00:27:16 --> 00:27:21
We're getting down to M.
So, you can think of it as log
426
00:27:21 --> 00:27:26
N, the whole binary tree minus
the subtrees log M,
427
00:27:26 --> 00:27:31
and that's the same as log N
over M, OK, or however you want
428
00:27:31 --> 00:27:37
to think about it.
The point is that this is a log
429
00:27:37 --> 00:27:40
base two.
That's where we are not doing
430
00:27:40 --> 00:27:42
so great.
So this is actually a pretty
431
00:27:42 --> 00:27:46
good algorithm.
So let me write the solution
432
00:27:46 --> 00:27:48
over here.
So, the number of memory
433
00:27:48 --> 00:27:53
transfers on N items is going to
be the number of levels times
434
00:27:53 --> 00:27:56
the cost of each level.
So, this is N over B times log
435
00:27:56 --> 00:28:00
base two of N over M,
which is a lot better than
436
00:28:00 --> 00:28:04
repeated insertion into a B
tree.
437
00:28:04 --> 00:28:07
Here, we were getting N times
log N over log B,
438
00:28:07 --> 00:28:12
OK, so N log N over log B.
We're getting a log B savings
439
00:28:12 --> 00:28:16
over not proving anything,
and here we are getting a
440
00:28:16 --> 00:28:19
factor of B savings,
N log N over B.
441
00:28:19 --> 00:28:24
In fact, we even made it a
little bit smaller by dividing
442
00:28:24 --> 00:28:28
this N by M.
That doesn't matter too much.
443
00:28:28 --> 00:28:32
This dividing by B is a big
one.
444
00:28:32 --> 00:28:35
OK, so we're almost there.
This is almost an optimal
445
00:28:35 --> 00:28:37
algorithm.
It's even cache oblivious,
446
00:28:37 --> 00:28:40
which is pretty cool.
And that extra little step,
447
00:28:40 --> 00:28:43
which is that you should be
able to get on other log B
448
00:28:43 --> 00:28:46
factor improvement,
I want to combine these two
449
00:28:46 --> 00:28:48
ideas.
I want to keep this factor B
450
00:28:48 --> 00:28:51
improvement over N log N,
and I want to keep this factor
451
00:28:51 --> 00:28:54
log B improvement over N log N,
and get them together.
452
00:28:54 --> 00:28:57
So, first, before we do that
cache obliviously,
453
00:28:57 --> 00:29:03
let's do it cache aware.
So, this is the third cache
454
00:29:03 --> 00:29:07
aware algorithm.
This one was also cache
455
00:29:07 --> 00:29:11
oblivious.
So, how should I modify a merge
456
00:29:11 --> 00:29:18
sort in order to do better?
I mean, I have this log base
457
00:29:18 --> 00:29:22
two, and I want a log base B,
more or less.
458
00:29:22 --> 00:29:27
So, how would I do that with
merge sort?
459
00:29:27 --> 00:29:30
Yeah?
Split into B subarrays,
460
00:29:30 --> 00:29:32
yeah.
Instead of doing binary merge
461
00:29:32 --> 00:29:35
sort, this is what I was hinting
at here, instead of splitting it
462
00:29:35 --> 00:29:37
into two pieces,
and recursing on the two
463
00:29:37 --> 00:29:40
pieces, and then merging them,
I could split potentially into
464
00:29:40 --> 00:29:42
more pieces.
OK, and to do that,
465
00:29:42 --> 00:29:45
I'm going to use my cache.
So the idea is B pieces.
466
00:29:45 --> 00:29:48
This is actually not the best
thing to do, although B pieces
467
00:29:48 --> 00:29:50
does work.
And, it's what I was hinting at
468
00:29:50 --> 00:29:52
because I was saying I want a
log B.
469
00:29:52 --> 00:29:55
It's actually not quite log B.
It's log M over B.
470
00:29:55 --> 00:29:57
OK, but let's see.
So, what is the most pieces I
471
00:29:57 --> 00:30:01
could split into?
Right, well,
472
00:30:01 --> 00:30:06
I could split into N pieces.
That would be good,
473
00:30:06 --> 00:30:11
wouldn't it,
at only one recursive level?
474
00:30:11 --> 00:30:14
I can't split into N pieces.
Why?
475
00:30:14 --> 00:30:19
What happens wrong when I split
into N pieces?
476
00:30:19 --> 00:30:24
That would be the ultimate.
You can't merge,
477
00:30:24 --> 00:30:27
exactly.
So, if I have N pieces,
478
00:30:27 --> 00:30:33
you can't merge in cache.
I mean, so in order to merge in
479
00:30:33 --> 00:30:37
cache, what I need is to be able
to store an entire block from
480
00:30:37 --> 00:30:40
each of the lists that I'm
merging.
481
00:30:40 --> 00:30:43
If I can store an entire block
in cache for each of the lists,
482
00:30:43 --> 00:30:46
then it's a bunch of parallel
scans.
483
00:30:46 --> 00:30:49
So this is like testing the
limit of parallel scanning
484
00:30:49 --> 00:30:52
technology.
If you have K parallel scans,
485
00:30:52 --> 00:30:55
and you can fit K blocks in
cache, then all is well because
486
00:30:55 --> 00:30:58
you can scan through each of
those K arrays,
487
00:30:58 --> 00:31:02
and have one block from each of
the K arrays in cache at the
488
00:31:02 --> 00:31:05
same time.
So, that's the idea.
489
00:31:05 --> 00:31:09
Now, how many blocks can I fit
in cache?
490
00:31:09 --> 00:31:13
M over B.
That's the biggest I could do.
491
00:31:13 --> 00:31:18
So this will give the best
running time among these kinds
492
00:31:18 --> 00:31:24
of merge sort algorithms.
This is an M over B way merge
493
00:31:24 --> 00:31:27
sort.
OK, so now we get somewhat
494
00:31:27 --> 00:31:31
better recurrence.
We split into M over B
495
00:31:31 --> 00:31:34
subproblems now,
each of size,
496
00:31:34 --> 00:31:38
well, it's N divided by M over
B without thinking.
497
00:31:38 --> 00:31:43
And, the claim is that the
merge time is still linear
498
00:31:43 --> 00:31:48
because we have barely enough,
OK, maybe I should describe
499
00:31:48 --> 00:31:50
this algorithm.
So, we divide,
500
00:31:50 --> 00:31:55
because we've never really done
non-binary merge sort.
501
00:31:55 --> 00:32:00
We divide into M over B equal
size subarrays instead of two.
502
00:32:00 --> 00:32:06
Here, we are clearly doing a
cache aware algorithm.
503
00:32:06 --> 00:32:11
We are assuming we know what M
over B is.
504
00:32:11 --> 00:32:17
So, then we recursively sort
each subarray,
505
00:32:17 --> 00:32:21
and then we conquer.
We merge.
506
00:32:21 --> 00:32:29
And, the reason merge works is
because we can afford one block
507
00:32:29 --> 00:32:34
in cache.
So, let's call it one cache
508
00:32:34 --> 00:32:36
block per subarray.
OK, actually,
509
00:32:36 --> 00:32:40
if you're careful,
you also need one block for the
510
00:32:40 --> 00:32:44
output of the merged array
before you write it out.
511
00:32:44 --> 00:32:47
So, it should be M over B minus
one.
512
00:32:47 --> 00:32:50
But, let's ignore some additive
constants.
513
00:32:50 --> 00:32:53
OK, so this is the recurrence
we get.
514
00:32:53 --> 00:32:59
The base case is the same.
And, what improves here?
515
00:32:59 --> 00:33:02
I mean, the per level cost
doesn't change,
516
00:33:02 --> 00:33:06
I claim, because at the top we
get N over B.
517
00:33:06 --> 00:33:09
This does before.
Then we split into M over B
518
00:33:09 --> 00:33:15
subproblems, each of which costs
a one over M over B factor times
519
00:33:15 --> 00:33:18
N over B.
OK, so you add all those up,
520
00:33:18 --> 00:33:23
you still get N over B because
we are not decreasing the number
521
00:33:23 --> 00:33:26
of elements.
We're just splitting them.
522
00:33:26 --> 00:33:31
There's now M over B
subproblems, each of one over M
523
00:33:31 --> 00:33:36
over B the size.
So, just like before,
524
00:33:36 --> 00:33:39
each level will sum to N over
B.
525
00:33:39 --> 00:33:44
What changes is the number of
levels because now we have
526
00:33:44 --> 00:33:49
bigger branching factor.
Instead of log base two,
527
00:33:49 --> 00:33:53
it's now log base the branching
factor.
528
00:33:53 --> 00:33:59
So, the height of this tree is
log base M over B of N over M,
529
00:33:59 --> 00:34:03
I believe.
Let me make sure that agrees
530
00:34:03 --> 00:34:06
with me.
Yeah.
531
00:34:06 --> 00:34:12
OK, and if you're careful,
this counts not quite the
532
00:34:12 --> 00:34:18
number of levels,
but the number of levels minus
533
00:34:18 --> 00:34:22
one.
So, I'm going to one plus one
534
00:34:22 --> 00:34:26
here.
And the reason why is this is
535
00:34:26 --> 00:34:37
not quite the bound that I want.
So, we have log base M over B.
536
00:34:37 --> 00:34:45
What I really want,
actually, is N over B.
537
00:34:45 --> 00:34:55
I claim that these are the same
because we have minus,
538
00:34:55 --> 00:35:01
yeah, that's good.
OK, this should come as rather
539
00:35:01 --> 00:35:05
mysterious, but it's because I
know what the sorting bound
540
00:35:05 --> 00:35:07
should be as I'm doing this
arithmetic.
541
00:35:07 --> 00:35:10
So, I'm taking log base M over
B of N over M.
542
00:35:10 --> 00:35:12
I'm not changing the base of
the log.
543
00:35:12 --> 00:35:14
I'm just saying,
well, N over M,
544
00:35:14 --> 00:35:17
that is N over B divided by M
over B because then the B's
545
00:35:17 --> 00:35:20
cancel, and the M goes on the
bottom.
546
00:35:20 --> 00:35:23
So, if I do that in the logs,
I get log of N over B minus log
547
00:35:23 --> 00:35:26
of M over B minus,
because I'm dividing.
548
00:35:26 --> 00:35:30
OK, now, log base M over B of M
over B is one.
549
00:35:30 --> 00:35:33
So, these cancel,
and I get log base M over B,
550
00:35:33 --> 00:35:36
N over B, which is what I was
aiming for.
551
00:35:36 --> 00:35:39
Why?
Because that's the right bound
552
00:35:39 --> 00:35:43
as it's normally written.
OK, that's what we will be
553
00:35:43 --> 00:35:48
trying to get cache obliviously.
So, that's the height of the
554
00:35:48 --> 00:35:53
search tree, and at each level
we are paying N over B memory
555
00:35:53 --> 00:35:56
transfers.
So, the overall number of
556
00:35:56 --> 00:36:01
memory transfers for this M over
B way merge sort is the sorting
557
00:36:01 --> 00:36:03
bound.
558
00:36:03 --> 00:36:13
559
00:36:13 --> 00:36:19
This is, I'll put it in a box.
This is the sorting bound,
560
00:36:19 --> 00:36:25
and it's very special because
it is the optimal number of
561
00:36:25 --> 00:36:31
memory transfers for sorting N
items cache aware.
562
00:36:31 --> 00:36:33
This has been known since,
like, 1983.
563
00:36:33 --> 00:36:35
OK, this is the best thing to
do.
564
00:36:35 --> 00:36:38
It's a really weird bound,
but if you ignore all the
565
00:36:38 --> 00:36:41
divided by B's,
it's sort of like N times log
566
00:36:41 --> 00:36:44
base M of N.
So, that's little bit more
567
00:36:44 --> 00:36:46
reasonable.
But, there's lots of divided by
568
00:36:46 --> 00:36:49
B's.
So, the number of the blocks in
569
00:36:49 --> 00:36:53
the input times log base the
number of blocks in the cache of
570
00:36:53 --> 00:36:55
the number of blocks in the
input.
571
00:36:55 --> 00:36:57
That's a little bit more
intuitive.
572
00:36:57 --> 00:37:02
That is the bound.
And that's what we are aiming
573
00:37:02 --> 00:37:04
for.
So, this algorithm,
574
00:37:04 --> 00:37:08
crucially, assume that we knew
what M over B was.
575
00:37:08 --> 00:37:12
Now, we are going to try and do
it without knowing M over B,
576
00:37:12 --> 00:37:17
do it cache obliviously.
And that is the result of only
577
00:37:17 --> 00:37:19
a few years ago.
Are you ready?
578
00:37:19 --> 00:37:23
Everything clear so far?
It's a pretty natural
579
00:37:23 --> 00:37:26
algorithm.
We were going to try to mimic
580
00:37:26 --> 00:37:31
it essentially and do a merge
sort, but not M over B way merge
581
00:37:31 --> 00:37:36
sort because we don't know how.
We're going to try and do it
582
00:37:36 --> 00:37:39
essentially a square root of N
way merge sort.
583
00:37:39 --> 00:37:43
If you play around,
that's the natural thing to do.
584
00:37:43 --> 00:37:46
The tricky part is that it's
hard to merge square root of N
585
00:37:46 --> 00:37:50
lists at the same time,
in a cache efficient way.
586
00:37:50 --> 00:37:54
We know that if the square root
of N is bigger than M over B,
587
00:37:54 --> 00:37:57
you're hosed if you just do a
straightforward merge.
588
00:37:57 --> 00:38:02
So, we need a fancy merge.
We are going to do a divide and
589
00:38:02 --> 00:38:05
conquer merge.
It's a lot like the
590
00:38:05 --> 00:38:10
multithreaded algorithms of last
week, try and do a divide and
591
00:38:10 --> 00:38:14
conquer merge so that no matter
how many lists are merging,
592
00:38:14 --> 00:38:18
as long as it's less than the
square root of N,
593
00:38:18 --> 00:38:23
or actually cubed root of N,
we can do it cache efficiently,
594
00:38:23 --> 00:38:24
OK?
So, we'll do this,
595
00:38:24 --> 00:38:28
we need a bit of setup.
But that's where we're going,
596
00:38:28 --> 00:38:33
cache oblivious sorting.
So, we want to get the sorting
597
00:38:33 --> 00:38:36
bound, and, yeah.
It turns out,
598
00:38:36 --> 00:38:40
to do cache oblivious sorting,
you need an assumption about
599
00:38:40 --> 00:38:42
the cache size.
This is kind of annoying,
600
00:38:42 --> 00:38:45
because we said,
well, cache oblivious
601
00:38:45 --> 00:38:49
algorithms should work for all
values of B and all values of M.
602
00:38:49 --> 00:38:53
But, you can actually prove you
need an additional assumption in
603
00:38:53 --> 00:38:55
order to get this bound cache
obliviously.
604
00:38:55 --> 00:38:58
That's the result of,
like, last year by Garrett
605
00:38:58 --> 00:39:01
Brodel.
So, and the assumption is,
606
00:39:01 --> 00:39:04
well, the assumption is fairly
weak.
607
00:39:04 --> 00:39:07
That's the good news.
OK, we've actually made an
608
00:39:07 --> 00:39:10
assumption several times.
We said, well,
609
00:39:10 --> 00:39:13
assuming the cache can store at
least three blocks,
610
00:39:13 --> 00:39:17
or assuming the cache can store
at least four blocks,
611
00:39:17 --> 00:39:21
yeah, it's reasonable to say
the cache can store at least
612
00:39:21 --> 00:39:25
four blocks, or at least any
constant number of blocks.
613
00:39:25 --> 00:39:29
This is that the number of
blocks that your cache can store
614
00:39:29 --> 00:39:33
is at least B to the epsilon
blocks.
615
00:39:33 --> 00:39:36
This is saying your cache
isn't, like, really narrow.
616
00:39:36 --> 00:39:37
It's about as tall as it is
wide.
617
00:39:37 --> 00:39:40
This actually gives you a lot
of sloth.
618
00:39:40 --> 00:39:42
And, we're going to use a
simple version of this
619
00:39:42 --> 00:39:44
assumption that M is at least
B^2.
620
00:39:44 --> 00:39:48
OK, this is pretty natural.
It's saying that your cache is
621
00:39:48 --> 00:39:51
at least as tall as it is wide
where these are the blocks.
622
00:39:51 --> 00:39:54
OK, the number of blocks is it
least the size of a block.
623
00:39:54 --> 00:39:57
That's a pretty reasonable
assumption, and if you look at
624
00:39:57 --> 00:40:00
caches these days,
they all satisfy this,
625
00:40:00 --> 00:40:04
at least for some epsilon.
Pretty much universally,
626
00:40:04 --> 00:40:08
M is at least B^2 or so.
OK, and in fact,
627
00:40:08 --> 00:40:12
if you think from our speed of
light arguments from last time,
628
00:40:12 --> 00:40:16
B^2 or B^3 is actually the
right thing to do.
629
00:40:16 --> 00:40:18
As you go out,
I guess in 3-D,
630
00:40:18 --> 00:40:23
B^2 would be the surface area
of the sphere out there.
631
00:40:23 --> 00:40:27
OK, so this is actually the
natural thing of how much space
632
00:40:27 --> 00:40:32
you should have at a particular
distance.
633
00:40:32 --> 00:40:35
Assuming we live in a constant
dimensional space,
634
00:40:35 --> 00:40:40
that assumption would be true.
This even allows going up to 42
635
00:40:40 --> 00:40:43
dimensions or whatever,
OK, so a pretty reasonable
636
00:40:43 --> 00:40:44
assumption.
Good.
637
00:40:44 --> 00:40:47
Now, we are going to achieve
this bound.
638
00:40:47 --> 00:40:52
And what we are going to try to
do is use an N to the epsilon
639
00:40:52 --> 00:40:56
way merge sort for some epsilon.
And, if we assume that M is at
640
00:40:56 --> 00:41:02
least B^2, the epsilon will be
one third, it turns out.
641
00:41:02 --> 00:41:08
So, we are going to do the
cubed root of N way merge sort.
642
00:41:08 --> 00:41:14
I'll start by giving you and
analyzing the sorting
643
00:41:14 --> 00:41:20
algorithms, assuming that we
know how to do merge in a
644
00:41:20 --> 00:41:25
particular bound.
OK, then we'll do the merge.
645
00:41:25 --> 00:41:31
The merge is the hard part.
OK, so the merge,
646
00:41:31 --> 00:41:34
I'm going to give you the black
box first of all.
647
00:41:34 --> 00:41:36
First of all,
what does merge do?
648
00:41:36 --> 00:41:40
The K way merger is called the
K funnel just because it looks
649
00:41:40 --> 00:41:42
like a funnel,
which you'll see.
650
00:41:42 --> 00:41:45
So, a K funnel is a data
structure, or is an algorithm,
651
00:41:45 --> 00:41:48
let's say, that looks like a
data structure.
652
00:41:48 --> 00:41:52
And it merges K sorted lists.
So, supposing you already have
653
00:41:52 --> 00:41:56
K lists, and they're sorted,
and assuming that the lists are
654
00:41:56 --> 00:41:59
relatively long,
so we need some additional
655
00:41:59 --> 00:42:03
assumptions for this black box
to work, and we'll be able to
656
00:42:03 --> 00:42:09
get them as we sort.
We want the total size of those
657
00:42:09 --> 00:42:12
lists.
You add up all the elements,
658
00:42:12 --> 00:42:17
and all the lists should have
size at least K^3 is the
659
00:42:17 --> 00:42:21
assumption.
Then, it merges these lists
660
00:42:21 --> 00:42:25
using essentially the sorting
bound.
661
00:42:25 --> 00:42:30
Actually, I should really say
theta K^3.
662
00:42:30 --> 00:42:36
I also don't want to be too
much bigger than K^3.
663
00:42:36 --> 00:42:42
Sorry about that.
So, the number of memory
664
00:42:42 --> 00:42:50
transfers that this funnel
merger uses is the sorting bound
665
00:42:50 --> 00:42:57
on K^3, so K^3 over B,
log base M over B of K^3 over
666
00:42:57 --> 00:43:03
B, plus another K memory
transfers.
667
00:43:03 --> 00:43:06
Now, K memory transfers is
pretty reasonable.
668
00:43:06 --> 00:43:09
You've got to at least start
reading each list,
669
00:43:09 --> 00:43:12
so you got to pay one memory
transfer per list.
670
00:43:12 --> 00:43:16
OK, but our challenge in some
sense will be getting rid of
671
00:43:16 --> 00:43:19
this plus K.
This is how fast we can merge.
672
00:43:19 --> 00:43:22
We'll do that after.
Now, assuming we have this,
673
00:43:22 --> 00:43:26
let me tell you how to sort.
This is, eventually enough,
674
00:43:26 --> 00:43:31
called funnel sort.
But in a certain sense,
675
00:43:31 --> 00:43:36
it's really cubed root of N way
merge sort.
676
00:43:36 --> 00:43:41
OK, but we'll analyze it using
this.
677
00:43:41 --> 00:43:47
OK, so funnel sort,
we are going to define K to be
678
00:43:47 --> 00:43:52
N to the one third,
and apply this merger.
679
00:43:52 --> 00:43:56
So, what do we do?
It's just like here.
680
00:43:56 --> 00:44:05
We're going to divide our array
into N to the one third.
681
00:44:05 --> 00:44:09
I mean, it they should be
consecutive subarrays.
682
00:44:09 --> 00:44:13
I'll call them segments of the
array.
683
00:44:13 --> 00:44:18
OK, for cache oblivious,
it's really crucial how these
684
00:44:18 --> 00:44:22
things are laid out.
We're going to cut and get
685
00:44:22 --> 00:44:28
consecutive chunks of the array,
N to the one third of them.
686
00:44:28 --> 00:44:34
Then I'm going to recursively
sort them, and then I'm going to
687
00:44:34 --> 00:44:37
merge.
OK, and I'm going to merge
688
00:44:37 --> 00:44:41
using the K funnel,
the N to the one third funnel
689
00:44:41 --> 00:44:43
because, now,
why do I use one third?
690
00:44:43 --> 00:44:48
Well, because of this three.
OK, in order to use the N to
691
00:44:48 --> 00:44:51
the one third funnel,
I need to guarantee that the
692
00:44:51 --> 00:44:55
total number of elements that
I'm merging is at least the cube
693
00:44:55 --> 00:44:57
of this number,
K^3.
694
00:44:57 --> 00:45:01
The cube of this number is N.
That's exactly how many
695
00:45:01 --> 00:45:05
elements I have in total.
OK, so this is exactly what I
696
00:45:05 --> 00:45:08
can apply the funnel.
It's going to require that I
697
00:45:08 --> 00:45:11
have it least K^3 elements,
so that I can only use an N to
698
00:45:11 --> 00:45:14
the one third funnel.
I mean, if it didn't have this
699
00:45:14 --> 00:45:17
requirement, I could just say,
well, I have N lists each of
700
00:45:17 --> 00:45:20
size one.
OK, that's clearly not going to
701
00:45:20 --> 00:45:23
work very well for our merger,
I mean, intuitively because
702
00:45:23 --> 00:45:26
this plus K will kill you.
That will be a plus M which is
703
00:45:26 --> 00:45:30
way too big.
But we can use an N to the one
704
00:45:30 --> 00:45:35
third funnel,
and this is how we would sort.
705
00:45:35 --> 00:45:38
So, let's analyze this
algorithm.
706
00:45:38 --> 00:45:42
Hopefully, it will give the
sorting bound if I did
707
00:45:42 --> 00:45:47
everything correctly.
OK, this is pretty easy.
708
00:45:47 --> 00:45:52
The only thing that makes this
messy as I have to write the
709
00:45:52 --> 00:45:58
sorting bound over and over.
OK, this is the cost of the
710
00:45:58 --> 00:46:02
merge.
So that's at the root.
711
00:46:02 --> 00:46:07
But K^3 in this case is N.
So at the root of the
712
00:46:07 --> 00:46:11
recursion, let me write the
recurrence first.
713
00:46:11 --> 00:46:15
Sorry.
So, we have memory transfers on
714
00:46:15 --> 00:46:19
N elements is N to the one
third.
715
00:46:19 --> 00:46:24
Let me get this right.
Yeah, N to the one third
716
00:46:24 --> 00:46:28
recursions, each of size N to
the two thirds,
717
00:46:28 --> 00:46:34
OK, plus this time,
except K^3 is N.
718
00:46:34 --> 00:46:40
So, this is plus N over B,
log base M over B of N over B
719
00:46:40 --> 00:46:46
plus cubed root of M.
This is additive plus K term.
720
00:46:46 --> 00:46:52
OK, so that's my recurrence.
The base case will be the
721
00:46:52 --> 00:46:57
usual.
MT is some constant times M is
722
00:46:57 --> 00:47:02
order M over B.
So, we sort of know what we
723
00:47:02 --> 00:47:06
should get here.
Well, not really.
724
00:47:06 --> 00:47:09
So, in all the previous
recurrence is,
725
00:47:09 --> 00:47:15
we have the same costs at every
level, and that's where we got
726
00:47:15 --> 00:47:20
our log factor.
Now, we already have a log
727
00:47:20 --> 00:47:24
factor, so we better not get
another one.
728
00:47:24 --> 00:47:28
Right, this is the bound we
want to prove.
729
00:47:28 --> 00:47:33
So, let me cheat here for a
second.
730
00:47:33 --> 00:47:36
All right, indeed.
You may already be wondering,
731
00:47:36 --> 00:47:39
this N to the one third seems
rather large.
732
00:47:39 --> 00:47:43
If it's bigger than this,
we are already in trouble at
733
00:47:43 --> 00:47:45
the very top level of the
recursion.
734
00:47:45 --> 00:47:49
So, I claim that that's OK.
Let's look at N to the one
735
00:47:49 --> 00:47:51
third.
OK, there is a base case here
736
00:47:51 --> 00:47:54
which covers all values of N
that are, at most,
737
00:47:54 --> 00:47:58
some constant times N.
So, if I'm in this case,
738
00:47:58 --> 00:48:02
I know that N is at least as
big as the cache up to some
739
00:48:02 --> 00:48:06
constant.
OK, now the cache is it least
740
00:48:06 --> 00:48:10
B^2, we've assumed.
And you can do this with B to
741
00:48:10 --> 00:48:13
the one plus epsilon if you're
more careful.
742
00:48:13 --> 00:48:15
So, N is at least B^2,
OK?
743
00:48:15 --> 00:48:19
And then, I always have trouble
with these.
744
00:48:19 --> 00:48:23
So this means that N divided by
B is omega root N.
745
00:48:23 --> 00:48:26
OK, there's many things you
could say here,
746
00:48:26 --> 00:48:30
and only one of them is right.
So, why?
747
00:48:30 --> 00:48:34
So this says that the square
root of N is at least B,
748
00:48:34 --> 00:48:38
and so N divided by B is at
most N divided by square root of
749
00:48:38 --> 00:48:41
N.
So that's at least the square
750
00:48:41 --> 00:48:43
root of N if you check that all
out.
751
00:48:43 --> 00:48:48
I'm going to go through this
arithmetic relatively quickly
752
00:48:48 --> 00:48:50
because it's tedious but
necessary.
753
00:48:50 --> 00:48:54
OK, the square root of N is
strictly bigger than cubed root
754
00:48:54 --> 00:48:57
of N.
OK, so that means that N over B
755
00:48:57 --> 00:49:02
is strictly bigger than N to the
one third.
756
00:49:02 --> 00:49:05
Here we have N over B times
something that's bigger than
757
00:49:05 --> 00:49:07
one.
So this term definitely
758
00:49:07 --> 00:49:10
dominates this term in this
case.
759
00:49:10 --> 00:49:14
As long as I'm not in the base
case, I know N is at least order
760
00:49:14 --> 00:49:16
M.
This term disappears from my
761
00:49:16 --> 00:49:18
recurrence.
OK, so, good.
762
00:49:18 --> 00:49:21
That was a bit close.
Now, what we want to get is
763
00:49:21 --> 00:49:25
this running time overall.
So, the recursive cost better
764
00:49:25 --> 00:49:29
be small, better be less than
the constant factor increase
765
00:49:29 --> 00:49:35
over this.
So, let's write the recurrence.
766
00:49:35 --> 00:49:39
So, we get N over B,
log base M over B,
767
00:49:39 --> 00:49:44
N over B at the root.
Then, we split into a lot of
768
00:49:44 --> 00:49:49
subproblems, N to the one third
subproblems here,
769
00:49:49 --> 00:49:55
and each one costs essentially
this but with N replaced by N to
770
00:49:55 --> 00:50:00
the two thirds.
OK, so N to the two thirds log
771
00:50:00 --> 00:50:04
base M over B,
oops I forgot to divide it by B
772
00:50:04 --> 00:50:11
out here, of N to the two thirds
divided by B.
773
00:50:11 --> 00:50:14
That's the cost of one of these
nodes, N to the one third of
774
00:50:14 --> 00:50:17
them.
What should they add up to?
775
00:50:17 --> 00:50:20
Well, there is N to the one
third, and there's an N to the
776
00:50:20 --> 00:50:23
two thirds here that multiplies
out to N.
777
00:50:23 --> 00:50:25
So, we get N over B.
This looks bad.
778
00:50:25 --> 00:50:28
This looks the same.
And we don't want to lose
779
00:50:28 --> 00:50:31
another log factor.
But the good news is we have
780
00:50:31 --> 00:50:35
two thirds in here.
OK, this is what we get in
781
00:50:35 --> 00:50:38
total at this level.
It looks like the sorting
782
00:50:38 --> 00:50:41
bound, but in the log there's
still a two thirds.
783
00:50:41 --> 00:50:45
Now, a power of two thirds and
a log comes out as a multiple of
784
00:50:45 --> 00:50:48
two thirds.
So, this is in fact two thirds
785
00:50:48 --> 00:50:51
times N over B,
log base M over B of N over B,
786
00:50:51 --> 00:50:54
the sorting bound.
So, this is two thirds of the
787
00:50:54 --> 00:50:57
sorting bound.
And this is the sorting bound,
788
00:50:57 --> 00:51:01
one times the sorting bound.
So, it's going down
789
00:51:01 --> 00:51:02
geometrically,
yea!
790
00:51:02 --> 00:51:05
OK, I'm not going to prove it,
but it's true.
791
00:51:05 --> 00:51:08
This went down by a factor of
two thirds.
792
00:51:08 --> 00:51:12
The next one will also go down
by a factor of two thirds by
793
00:51:12 --> 00:51:14
induction.
OK, if you prove it at one
794
00:51:14 --> 00:51:17
level, it should be true at all
of them.
795
00:51:17 --> 00:51:19
And I'm going to skip the
details there.
796
00:51:19 --> 00:51:23
So, we could check the leaf
level just to make sure.
797
00:51:23 --> 00:51:25
That's always a good sanity
check.
798
00:51:25 --> 00:51:30
At the leaves,
we know our cost is M over B.
799
00:51:30 --> 00:51:32
OK, and how many leaves are
there?
800
00:51:32 --> 00:51:34
Just like before,
in some sense,
801
00:51:34 --> 00:51:38
we have N/M leaves.
OK, so in fact the total cost
802
00:51:38 --> 00:51:41
at the bottom is N over B.
And it turns out that that's
803
00:51:41 --> 00:51:44
what you get.
So, you essentially,
804
00:51:44 --> 00:51:47
it looks funny,
because you'd think that this
805
00:51:47 --> 00:51:51
would actually be smaller than
this at some intuitive level.
806
00:51:51 --> 00:51:54
It's not.
In fact, what's happening is
807
00:51:54 --> 00:51:57
you have this N over B times
this log thing,
808
00:51:57 --> 00:52:00
whatever the log thing is.
We don't care too much.
809
00:52:00 --> 00:52:05
Let's just call it log.
What you are taking at the next
810
00:52:05 --> 00:52:08
level is two thirds times that
log.
811
00:52:08 --> 00:52:11
And at the next level,
it's four ninths times that log
812
00:52:11 --> 00:52:13
and so on.
So, it's geometrically
813
00:52:13 --> 00:52:16
decreasing until the log gets
down to one.
814
00:52:16 --> 00:52:17
And then you stop the
recursion.
815
00:52:17 --> 00:52:21
And that's what you get N over
B here with no log.
816
00:52:21 --> 00:52:23
So, what you're doing is
decreasing the log,
817
00:52:23 --> 00:52:27
not the N over B stuff.
The two thirds should really be
818
00:52:27 --> 00:52:29
over here.
In fact, the number of levels
819
00:52:29 --> 00:52:34
here is log, log N.
It's the number of times you
820
00:52:34 --> 00:52:39
have to divide a log by three
halves before you get down to
821
00:52:39 --> 00:52:42
one, OK?
So, we don't actually need
822
00:52:42 --> 00:52:45
that.
We don't care how many levels
823
00:52:45 --> 00:52:49
are because it's geometrically
decreasing.
824
00:52:49 --> 00:52:52
It could be infinitely many
levels.
825
00:52:52 --> 00:52:58
It's geometrically decreasing,
and we get this as our running
826
00:52:58 --> 00:53:01
time.
MT of N is the sorting bound
827
00:53:01 --> 00:53:05
for funnel sort.
So, this is great.
828
00:53:05 --> 00:53:09
As long as we can get a funnel
that merges this quickly,
829
00:53:09 --> 00:53:14
we get a sorting algorithm that
sorts as fast as it possibly
830
00:53:14 --> 00:53:17
can.
I didn't write that on the
831
00:53:17 --> 00:53:20
board that this is
asymptotically optimal.
832
00:53:20 --> 00:53:25
Even if you knew what B and M
were, this is the best that you
833
00:53:25 --> 00:53:28
could hope to do.
And here, we are doing it no
834
00:53:28 --> 00:53:32
matter what, B and M are.
Good.
835
00:53:32 --> 00:53:35
Get ready for the funnel.
The funnel will be another
836
00:53:35 --> 00:53:37
recursion.
So, this is a recursive
837
00:53:37 --> 00:53:39
algorithm in a recursive
algorithm.
838
00:53:39 --> 00:53:43
It's another divide and
conquer, kind of like the static
839
00:53:43 --> 00:53:46
search trees we saw at the
beginning of this lecture.
840
00:53:46 --> 00:53:49
So, these all tie together.
841
00:53:49 --> 00:54:03
842
00:54:03 --> 00:54:06
All right, the K funnel,
so, I'm calling it K funnel
843
00:54:06 --> 00:54:10
because I want to think of it at
some recursive level,
844
00:54:10 --> 00:54:14
not just N to the one third.
OK, we're going to recursively
845
00:54:14 --> 00:54:17
use, in fact,
the square root of K funnel.
846
00:54:17 --> 00:54:21
So, here's, and I need to
achieve that bound.
847
00:54:21 --> 00:54:24
So, the recursion is like the
static search tree,
848
00:54:24 --> 00:54:27
and a little bit hard to draw
on one board,
849
00:54:27 --> 00:54:34
but here we go.
So, we have a square root of K
850
00:54:34 --> 00:54:37
funnel.
Recursively,
851
00:54:37 --> 00:54:44
we have a buffer up here.
This is called the output
852
00:54:44 --> 00:54:50
buffer, and it has size K^3,
and just for kicks,
853
00:54:50 --> 00:54:57
let's suppose it that filled up
a little bit.
854
00:54:57 --> 00:55:06
And, we have some more buffers.
And, let's suppose they've been
855
00:55:06 --> 00:55:13
filled up by different amounts.
And each of these has size K to
856
00:55:13 --> 00:55:16
the three halves,
of course.
857
00:55:16 --> 00:55:21
Halves, these are called
buffers, let's say,
858
00:55:21 --> 00:55:28
with the intermediate buffers.
And, then hanging off of them,
859
00:55:28 --> 00:55:34
we have more funnels,
the square root of K funnel
860
00:55:34 --> 00:55:40
here, and a square root of K
funnel here, one for each
861
00:55:40 --> 00:55:47
buffer, one for each child of
this funnel.
862
00:55:47 --> 00:55:53
OK, and then hanging off of
these funnels are the input
863
00:55:53 --> 00:55:54
arrays.
864
00:55:54 --> 00:56:07
865
00:56:07 --> 00:56:12
OK, I'm not going to draw all K
of them, but there are K input
866
00:56:12 --> 00:56:16
arrays, input lists let's call
them down at the bottom.
867
00:56:16 --> 00:56:21
OK, so the idea is we are going
to merge bottom-up in this
868
00:56:21 --> 00:56:23
picture.
We start with our K input
869
00:56:23 --> 00:56:26
arrays of total size at least
K^3.
870
00:56:26 --> 00:56:31
That's what we're assuming we
have up here.
871
00:56:31 --> 00:56:34
We are clustering them into
groups of size square root of K,
872
00:56:34 --> 00:56:37
so, the square root of K
groups, throw each of them into
873
00:56:37 --> 00:56:40
a square root of K funnel that
recursively merges those square
874
00:56:40 --> 00:56:43
root of K lists.
The output of those funnels we
875
00:56:43 --> 00:56:46
are putting into a buffer to
sort of accumulate what the
876
00:56:46 --> 00:56:49
answer should be.
These buffers have besides
877
00:56:49 --> 00:56:52
exactly K to the three halves,
which might not be perfect
878
00:56:52 --> 00:56:55
because we know that on average,
there should be K to the three
879
00:56:55 --> 00:56:59
halves elements in each of these
because there's K^3 total,
880
00:56:59 --> 00:57:02
and the square root of K
groups.
881
00:57:02 --> 00:57:05
So, it should be K^3 divided by
the square root of K,
882
00:57:05 --> 00:57:07
which is K to the three halves
on average.
883
00:57:07 --> 00:57:09
But some of these will be
bigger.
884
00:57:09 --> 00:57:12
Some of them will be smaller.
I've drawn it here.
885
00:57:12 --> 00:57:15
Some of them had emptied a bit
more depending on how you merge
886
00:57:15 --> 00:57:16
things.
But on average,
887
00:57:16 --> 00:57:18
these will all fill at the same
time.
888
00:57:18 --> 00:57:22
And then, we plug them into a
square root of K funnel,
889
00:57:22 --> 00:57:24
and that we get the output of
size K^3.
890
00:57:24 --> 00:57:28
So, that is roughly what we
should have happen.
891
00:57:28 --> 00:57:31
OK, but in fact,
some of these might fill first,
892
00:57:31 --> 00:57:36
and we have to do some merging
in order to empty a buffer,
893
00:57:36 --> 00:57:39
make room for more stuff coming
up.
894
00:57:39 --> 00:57:43
That's the picture.
Now, before I actually tell you
895
00:57:43 --> 00:57:47
what the algorithm is,
or analyze the algorithm,
896
00:57:47 --> 00:57:51
let's first just think about
space, a very simple warm-up
897
00:57:51 --> 00:57:54
analysis.
So, let's look at the space
898
00:57:54 --> 00:58:00
excluding the inputs and
outputs, those buffers.
899
00:58:00 --> 00:58:02
OK, why do I want to exclude
input and output buffers?
900
00:58:02 --> 00:58:05
Well, because I want to only
count each buffer once,
901
00:58:05 --> 00:58:09
and this buffer is actually the
input to this one and the output
902
00:58:09 --> 00:58:11
to this one.
So, in order to recursively
903
00:58:11 --> 00:58:14
count all the buffers exactly
once, I'm only going to count
904
00:58:14 --> 00:58:16
these middle buffers.
And then separately,
905
00:58:16 --> 00:58:20
I'm going to have to think of
the overall output and input
906
00:58:20 --> 00:58:22
buffers.
But those are sort of given.
907
00:58:22 --> 00:58:23
I mean, I need K^3 for the
output.
908
00:58:23 --> 00:58:26
I need K^3 for the input.
So ignore those overall.
909
00:58:26 --> 00:58:29
And that if I count the middle
buffers recursively,
910
00:58:29 --> 00:58:34
I'll get all the buffers.
So, then we get a very simple
911
00:58:34 --> 00:58:39
recurrence for space.
S of K is roughly square root
912
00:58:39 --> 00:58:45
of K plus one times S of square
root of K plus order K^2,
913
00:58:45 --> 00:58:51
K^2 because we have the square
root of K of these buffers,
914
00:58:51 --> 00:58:54
each of size K to the three
halves.
915
00:58:54 --> 00:58:58
Work that out,
does that sound right?
916
00:58:58 --> 00:59:02
That sounds an awful lot like
K^3, but maybe,
917
00:59:02 --> 00:59:06
all right.
Oh, no, that's right.
918
00:59:06 --> 00:59:09
It's K to the three halves
times the square root of K,
919
00:59:09 --> 00:59:13
which is K to the three halves
plus a half, which is K to the
920
00:59:13 --> 00:59:16
four halves, which is K^2.
Phew, OK, good.
921
00:59:16 --> 00:59:18
I'm just bad with my arithmetic
here.
922
00:59:18 --> 00:59:20
OK, so K^2 total buffering
here.
923
00:59:20 --> 00:59:23
You add them up for each level,
each recursion,
924
00:59:23 --> 00:59:27
and the plus one here is to
take into account the top guy,
925
00:59:27 --> 00:59:31
the square root of K bottom
guys, so the square root of K
926
00:59:31 --> 00:59:33
plus one.
If this were,
927
00:59:33 --> 00:59:36
well, let me just draw the
recurrence tree.
928
00:59:36 --> 00:59:39
There's many ways you could
solve this recurrence.
929
00:59:39 --> 00:59:41
A natural one is instead of
looking at K,
930
00:59:41 --> 00:59:44
you look at log K,
because here at log K is
931
00:59:44 --> 00:59:47
getting divided by two.
I just going to draw the
932
00:59:47 --> 00:59:50
recursion trees,
so you can see the intuition.
933
00:59:50 --> 00:59:53
But if you are going to solve
it, you should probably take the
934
00:59:53 --> 00:59:57
logs, substitute by log.
So, we have the square root of
935
00:59:57 --> 1:00:00
K.
plus one branching factor.
936
1:00:00 --> 1:00:03.729
And then, the problem is size
square root of K,
937
1:00:03.729 --> 1:00:08.108
so this is going to be K,
I believe, for each of these.
938
1:00:08.108 --> 1:00:12.324
This is square root of K
squared is the cost of these
939
1:00:12.324 --> 1:00:14.513
levels.
And, you keep going.
940
1:00:14.513 --> 1:00:19.54
I don't particularly care what
the bottom looks like because at
941
1:00:19.54 --> 1:00:23.351
the top we have K^2.
That we have K times root K
942
1:00:23.351 --> 1:00:28.297
plus one cost at the next level.
This is K to the three halves
943
1:00:28.297 --> 1:00:32.664
plus K.
OK, so we go from K^2 to K to
944
1:00:32.664 --> 1:00:37.257
the three halves plus K.
This is a super-geometric.
945
1:00:37.257 --> 1:00:41.207
It's like an exponential
geometric decrease.
946
1:00:41.207 --> 1:00:45.8
This is decreasing really fast.
So, it's order K^2.
947
1:00:45.8 --> 1:00:51.22
That's my hand-waving argument.
OK, so the cost is basically
948
1:00:51.22 --> 1:00:56.456
the size of the buffers at the
top level, the total space.
949
1:00:56.456 --> 1:01:01.601
We're going to need this.
It's actually theta K^2 because
950
1:01:01.601 --> 1:01:06.398
I have a theta K^2 here.
We are going to be this in
951
1:01:06.398 --> 1:01:09.249
order to analyze the time.
That's why it mentioned it.
952
1:01:09.249 --> 1:01:12.368
It's not just a good feeling
that the space is not too big.
953
1:01:12.368 --> 1:01:15.595
In fact, the funnel is a lot
smaller than a total input size.
954
1:01:15.595 --> 1:01:18.177
The input size is K^3.
But that's not so crucial.
955
1:01:18.177 --> 1:01:21.243
What's crucial is that it's
K^2, and we'll use that in the
956
1:01:21.243 --> 1:01:22.48
analysis.
OK, naturally,
957
1:01:22.48 --> 1:01:24.308
this thing is laid out
recursively.
958
1:01:24.308 --> 1:01:26.675
You recursively store the
funnel, top funnel.
959
1:01:26.675 --> 1:01:29.256
Then, for example,
you write out each buffer as a
960
1:01:29.256 --> 1:01:32
consecutive array,
in this case.
961
1:01:32 --> 1:01:34.748
There's no recursion there.
So just write them all out one
962
1:01:34.748 --> 1:01:36.243
by one.
Don't interleave them or
963
1:01:36.243 --> 1:01:37.642
anything.
Store them in order.
964
1:01:37.642 --> 1:01:40.005
And that, you write out
recursively these funnels,
965
1:01:40.005 --> 1:01:41.934
the bottom funnels.
OK, any way you do it
966
1:01:41.934 --> 1:01:44.634
recursively, as long as each
funnel remains a consecutive
967
1:01:44.634 --> 1:01:46.418
chunk of memory,
each buffer remains a
968
1:01:46.418 --> 1:01:49.167
consecutive chuck of memory,
the time analysis that we are
969
1:01:49.167 --> 1:01:51
about to do will work.
970
1:01:51 --> 1:02:14
971
1:02:14 --> 1:02:18.062
OK, let me actually give you
the algorithm that we're
972
1:02:18.062 --> 1:02:21.265
analyzing.
In order to make the funnel go,
973
1:02:21.265 --> 1:02:25.015
what we do is say,
initially, all the buffers are
974
1:02:25.015 --> 1:02:27.671
empty.
Everything is at the bottom.
975
1:02:27.671 --> 1:02:32.125
And what we are going to do is,
say, fill the root buffer.
976
1:02:32.125 --> 1:02:36.04
Fill this one.
And, that's a recursive
977
1:02:36.04 --> 1:02:41.542
algorithm, which I'll define in
a second, how to fill a buffer.
978
1:02:41.542 --> 1:02:45.713
Once it's filled,
that means everything has been
979
1:02:45.713 --> 1:02:50.682
pulled up, and then it's merged.
OK, so that's how we get
980
1:02:50.682 --> 1:02:53.522
started.
So, merge means to merge
981
1:02:53.522 --> 1:02:58.402
algorithm is fill the topmost
buffer, the topmost output
982
1:02:58.402 --> 1:03:01.002
buffer.
OK, and now,
983
1:03:01.002 --> 1:03:04.678
here's how you fill a buffer.
So, in general,
984
1:03:04.678 --> 1:03:08.355
if you expand out this
recursion all the way,
985
1:03:08.355 --> 1:03:12.114
in the base case,
I didn't mention you sort of
986
1:03:12.114 --> 1:03:16.71
get a little node there.
So, if you look at an arbitrary
987
1:03:16.71 --> 1:03:20.386
buffer in this picture that you
want to fill,
988
1:03:20.386 --> 1:03:23.979
so this one's empty and you
want to fill it,
989
1:03:23.979 --> 1:03:28.407
then immediately below it will
be a vertex who has two
990
1:03:28.407 --> 1:03:34.434
children, two other buffers.
OK, maybe they look like this.
991
1:03:34.434 --> 1:03:39.141
You have no idea how big they
are, except they are the same
992
1:03:39.141 --> 1:03:41.981
size.
It could be a lot smaller than
993
1:03:41.981 --> 1:03:44.984
this one, a lot bigger,
we don't know.
994
1:03:44.984 --> 1:03:48.554
But in the end,
you do get a binary structure
995
1:03:48.554 --> 1:03:53.261
out of this just like we did
with the binary search tree at
996
1:03:53.261 --> 1:03:56.913
the beginning.
So, how do we fill this buffer?
997
1:03:56.913 --> 1:04:03
Well, we just merge these two
child buffers as long as we can.
998
1:04:03 --> 1:04:08.854
So, we merge the two children
buffers as long as they are both
999
1:04:08.854 --> 1:04:11.253
non-empty.
So, in general,
1000
1:04:11.253 --> 1:04:16.82
the invariant will be that this
buffer, let me write down a
1001
1:04:16.82 --> 1:04:19.795
sentence.
As long as a buffer is
1002
1:04:19.795 --> 1:04:25.17
non-empty, or whatever is in
that buffer, and hasn't been
1003
1:04:25.17 --> 1:04:29.009
used already,
it's a prefix of the merged
1004
1:04:29.009 --> 1:04:34
output of the entire subtree
beneath it.
1005
1:04:34 --> 1:04:37.567
OK, so this is a partially
merged subsequence of everything
1006
1:04:37.567 --> 1:04:39.781
down here.
This is a partially merged
1007
1:04:39.781 --> 1:04:41.933
subsequence of everything down
here.
1008
1:04:41.933 --> 1:04:44.824
I can just merge element by
element off the top,
1009
1:04:44.824 --> 1:04:48.453
and that will give me outputs
to put there until one of them
1010
1:04:48.453 --> 1:04:51.097
gets emptied.
And, we have no idea which one
1011
1:04:51.097 --> 1:04:54.357
will empty first just because it
depends on the order.
1012
1:04:54.357 --> 1:04:57.801
OK, whenever one of them
empties, we recursively fill it,
1013
1:04:57.801 --> 1:05:01
and that's it.
That's the algorithm.
1014
1:05:01 --> 1:05:05
Whenever one empties --
1015
1:05:05 --> 1:05:16
1016
1:05:16 --> 1:05:20.391
-- we recursively fill it.
And at the base case at the
1017
1:05:20.391 --> 1:05:23.456
leaves, there's sort of nothing
to do.
1018
1:05:23.456 --> 1:05:27.847
I believe you just sort of
directly read from an input
1019
1:05:27.847 --> 1:05:30.167
list.
So, at the very bottom,
1020
1:05:30.167 --> 1:05:34.807
if you have some note here
that's trying to merge between
1021
1:05:34.807 --> 1:05:39.198
these two, that's just a
straightforward merge between
1022
1:05:39.198 --> 1:05:42.595
two lists.
We know how to do that with two
1023
1:05:42.595 --> 1:05:44.832
parallel scans.
So, in fact,
1024
1:05:44.832 --> 1:05:49.886
we can merge the entire thing
here and just spit it out to the
1025
1:05:49.886 --> 1:05:52.786
buffer.
Well, it depends how big the
1026
1:05:52.786 --> 1:05:56.1
buffer is.
We can only merge it until the
1027
1:05:56.1 --> 1:06:01.445
buffer fills.
Whenever a buffer is full,
1028
1:06:01.445 --> 1:06:05.394
we stop and we pop up the
recursive layers.
1029
1:06:05.394 --> 1:06:11.131
OK, so we keep doing this merge
until the buffer we are trying
1030
1:06:11.131 --> 1:06:14.047
to fill fills,
and that we stop,
1031
1:06:14.047 --> 1:06:17.338
pop up.
OK, that's the algorithm for
1032
1:06:17.338 --> 1:06:20.724
merging.
Now, we just have to analyze
1033
1:06:20.724 --> 1:06:24.579
the algorithm.
It's actually not too hard,
1034
1:06:24.579 --> 1:06:29
but it's a pretty clever
analysis.
1035
1:06:29 --> 1:06:31.898
And, to top it off,
it's an amortization,
1036
1:06:31.898 --> 1:06:35.159
your favorite.
OK, so we get one last practice
1037
1:06:35.159 --> 1:06:39.072
at amortized analysis in the
context of cache oblivious
1038
1:06:39.072 --> 1:06:41.971
algorithms.
So, this is going to be a bit
1039
1:06:41.971 --> 1:06:45.231
sophisticated.
We are going to combine all the
1040
1:06:45.231 --> 1:06:48.492
ideas we've seen.
The main analysis idea we've
1041
1:06:48.492 --> 1:06:52.84
seen is that we are doing this
recursion in the construction,
1042
1:06:52.84 --> 1:06:55.666
and if we imagine,
we take our K funnel,
1043
1:06:55.666 --> 1:06:59.507
we split it in the middle
level, make a whole bunch of
1044
1:06:59.507 --> 1:07:03.202
square root of K funnels,
and so on, and then we cut
1045
1:07:03.202 --> 1:07:07.188
those in the middle level,
get fourth root of K funnels,
1046
1:07:07.188 --> 1:07:10.666
and so on, and so on,
at some point the funnel we
1047
1:07:10.666 --> 1:07:15.816
look at fits in cache.
OK, before we said if it's in a
1048
1:07:15.816 --> 1:07:17.984
block.
Now, we're going to say that at
1049
1:07:17.984 --> 1:07:20.914
some point, one of these funnels
will fit in cache.
1050
1:07:20.914 --> 1:07:24.253
Each of the funnels at that
recursive level of detail will
1051
1:07:24.253 --> 1:07:26.656
fit in cache.
We are going to analyze that
1052
1:07:26.656 --> 1:07:29
level.
We'll call that level J.
1053
1:07:29 --> 1:07:37.266
So, consider the first
recursive level of detail,
1054
1:07:37.266 --> 1:07:45.877
and I'll call it J,
at which every J funnel we have
1055
1:07:45.877 --> 1:07:53.8
fits, let's say,
not only does it fit in cache,
1056
1:07:53.8 --> 1:08:02.337
but four of them fit in cache.
It fits in one quarter of the
1057
1:08:02.337 --> 1:08:05.158
cache.
OK, but we need to leave some
1058
1:08:05.158 --> 1:08:07.899
cache extra for doing other
things.
1059
1:08:07.899 --> 1:08:11.607
But I want to make sure that
the J funnel fits.
1060
1:08:11.607 --> 1:08:16.04
OK, now what does that mean?
Well, we've analyzed space.
1061
1:08:16.04 --> 1:08:19.989
We know that the space of a J
funnel is about J^2,
1062
1:08:19.989 --> 1:08:24.02
some constant times J^2.
We'll call it C times J^2.
1063
1:08:24.02 --> 1:08:27.969
OK, so this is saying that C
times J^2 is at most,
1064
1:08:27.969 --> 1:08:32
M over 4, one quarter of the
cache.
1065
1:08:32 --> 1:08:35.915
OK, that means a J funnel that
happens at the size sits in the
1066
1:08:35.915 --> 1:08:38.803
quarter of the cache.
OK, at some point in the
1067
1:08:38.803 --> 1:08:41.884
recursion, we'll have this big
tree of J funnels,
1068
1:08:41.884 --> 1:08:44.515
with all sorts of buffers in
between them,
1069
1:08:44.515 --> 1:08:46.697
and each of the J funnels will
fit.
1070
1:08:46.697 --> 1:08:49.521
So, let's think about one of
those J funnels.
1071
1:08:49.521 --> 1:08:51.96
Suppose J is like the square
root of K.
1072
1:08:51.96 --> 1:08:55.619
So, this is the picture because
otherwise I have to draw a
1073
1:08:55.619 --> 1:08:58.314
bigger one.
So, suppose this is a J funnel.
1074
1:08:58.314 --> 1:09:03
It has a bunch of input
buffers, has one output buffer.
1075
1:09:03 --> 1:09:06.366
So, we just want to think about
how the J funnel executes.
1076
1:09:06.366 --> 1:09:09.259
And, for a long time,
as long as these buffers are
1077
1:09:09.259 --> 1:09:12.33
all full, this is just a merger.
It's doing something
1078
1:09:12.33 --> 1:09:14.515
recursively, but we don't really
care.
1079
1:09:14.515 --> 1:09:17.468
As soon as this whole thing
swaps in, and actually,
1080
1:09:17.468 --> 1:09:20.244
I should be drawing this,
as soon as the funnel,
1081
1:09:20.244 --> 1:09:23.019
the output buffer,
and the input buffer swap in,
1082
1:09:23.019 --> 1:09:25.677
in other words,
you bring all those blocks in,
1083
1:09:25.677 --> 1:09:28.452
you can just merge,
and you can go on your merry
1084
1:09:28.452 --> 1:09:33
way merging until something
empties or you fill the output.
1085
1:09:33 --> 1:09:36.323
So, let's analyze that.
Suppose everything is in
1086
1:09:36.323 --> 1:09:40.707
memory, because we know it fits.
OK, well I have to be a little
1087
1:09:40.707 --> 1:09:43.676
bit careful.
The input buffers are actually
1088
1:09:43.676 --> 1:09:48.202
pretty big in total size because
the total size is K to the three
1089
1:09:48.202 --> 1:09:50.747
halves here versus K to the one
half.
1090
1:09:50.747 --> 1:09:54.848
Actually, this is of size K.
Let me draw a general picture.
1091
1:09:54.848 --> 1:09:57.676
We have a J funnel,
because otherwise the
1092
1:09:57.676 --> 1:10:01
arithmetic is going to get
messy.
1093
1:10:01 --> 1:10:04.854
We have a J funnel.
Its size is C times J^2,
1094
1:10:04.854 --> 1:10:08.619
we're supposing.
The number of inputs is J,
1095
1:10:08.619 --> 1:10:11.666
and the size of them is pretty
big.
1096
1:10:11.666 --> 1:10:15.61
Where did we define that?
We have a K funnel.
1097
1:10:15.61 --> 1:10:20.719
The total input size is K^3.
So, the total input size here
1098
1:10:20.719 --> 1:10:24.663
would be J^3.
We can't afford to put all that
1099
1:10:24.663 --> 1:10:27.98
in cache.
That's an extra factor of J.
1100
1:10:27.98 --> 1:10:33
But, we can afford to one block
per input.
1101
1:10:33 --> 1:10:35.035
And for merging,
that's all we need.
1102
1:10:35.035 --> 1:10:38.176
I claim that I can fit the
first block of each of these
1103
1:10:38.176 --> 1:10:41.724
input arrays in cash at the same
time along with the J funnel.
1104
1:10:41.724 --> 1:10:44.864
And so, for that duration,
as long as all of that is in
1105
1:10:44.864 --> 1:10:48.238
cache, this thing can merge at
full speed just like we were
1106
1:10:48.238 --> 1:10:51.204
doing parallel scans.
You use up all the blocks down
1107
1:10:51.204 --> 1:10:54.752
here, and one of them empties.
You go to the next block in the
1108
1:10:54.752 --> 1:10:57.602
input buffer and so on,
just like the normal merge
1109
1:10:57.602 --> 1:11:00.859
analysis of parallel arrays,
at this point we assume that
1110
1:11:00.859 --> 1:11:04
everything here is fitting in
cache.
1111
1:11:04 --> 1:11:08.485
So, it's just like before.
Of course, in fact,
1112
1:11:08.485 --> 1:11:13.668
it's recursive but we are
analyzing it at this level.
1113
1:11:13.668 --> 1:11:19.25
OK, I need to prove that you
can fit one block per input.
1114
1:11:19.25 --> 1:11:22.839
It's not hard.
It's just computation.
1115
1:11:22.839 --> 1:11:28.72
And, it's basically the way
that these funnels were designed
1116
1:11:28.72 --> 1:11:35
was so that you could fit one
block per input buffer.
1117
1:11:35 --> 1:11:41.607
And, here's the argument.
So, the claim is you can also
1118
1:11:41.607 --> 1:11:47.725
fit one memory block in the
cache per input buffer.
1119
1:11:47.725 --> 1:11:52.497
So, this is in addition to one
J funnel.
1120
1:11:52.497 --> 1:11:59.594
You could also fit one block
for each of its input buffers.
1121
1:11:59.594 --> 1:12:06.23
OK, this is of the J funnel.
It's not any funnel because
1122
1:12:06.23 --> 1:12:10.938
bigger funnels are way too big.
OK, so here's how we prove
1123
1:12:10.938 --> 1:12:13.581
that.
J^2 is at most a quarter M.
1124
1:12:13.581 --> 1:12:16.967
That's what we assumed here,
actually CJ2.
1125
1:12:16.967 --> 1:12:21.675
I'm not going to bother with
the C because that's going to
1126
1:12:21.675 --> 1:12:25.887
make my life even harder.
OK, I think this is even a
1127
1:12:25.887 --> 1:12:29.522
weaker constraint.
So, the size of our funnel
1128
1:12:29.522 --> 1:12:35.11
proves about J^2.
That's at most a quarter of the
1129
1:12:35.11 --> 1:12:37.719
cache.
That implies that J,
1130
1:12:37.719 --> 1:12:43.941
if we take square roots of both
sides, is at most a half square
1131
1:12:43.941 --> 1:12:47.955
root of M.
OK, also, we know that B is at
1132
1:12:47.955 --> 1:12:53.273
most square root of M because M
is at least B squared.
1133
1:12:53.273 --> 1:12:58.993
So, we put these together,
and we get J times B is at most
1134
1:12:58.993 --> 1:13:02.611
a half M.
OK, now I claim that what we
1135
1:13:02.611 --> 1:13:05.718
are asking for here is J times B
because in a J funnel,
1136
1:13:05.718 --> 1:13:08.825
there are J input arrays.
And so, if you want one block
1137
1:13:08.825 --> 1:13:10.781
each, that costs a space of B
each.
1138
1:13:10.781 --> 1:13:13.831
So, for each input buffer we
have one block of size B,
1139
1:13:13.831 --> 1:13:16.938
and the claim is that that
whole thing fits in half the
1140
1:13:16.938 --> 1:13:19.009
cache.
And, we've only used a quarter
1141
1:13:19.009 --> 1:13:20.448
of the cache.
So in total,
1142
1:13:20.448 --> 1:13:23.843
we use three quarters of the
cache and that's all we'll use.
1143
1:13:23.843 --> 1:13:26.95
OK, so that's good news.
We can also fit one more block
1144
1:13:26.95 --> 1:13:30
to the output.
Not too big a deal.
1145
1:13:30 --> 1:13:33.401
So now, as long as this J
funnel is running,
1146
1:13:33.401 --> 1:13:36.012
if it's all in cache,
all is well.
1147
1:13:36.012 --> 1:13:39.889
What does that mean?
Let me first analyze how long
1148
1:13:39.889 --> 1:13:42.895
it takes for us to swap in this
funnel.
1149
1:13:42.895 --> 1:13:47.563
OK, so how long does it take
for us to read all the stuff in
1150
1:13:47.563 --> 1:13:50.806
a J funnel and one block per
input buffer?
1151
1:13:50.806 --> 1:13:55
That's what it would take to
get started.
1152
1:13:55 --> 1:14:02.344
So, this is swapping in a J
funnel, which means reading the
1153
1:14:02.344 --> 1:14:09.435
J funnel in its entirety,
and reading one block per input
1154
1:14:09.435 --> 1:14:14.12
buffer.
OK, the cost of the swap in is
1155
1:14:14.12 --> 1:14:19.818
pretty natural.
The size of the buffer divided
1156
1:14:19.818 --> 1:14:27.542
by B, because that's just sort
of a linear scan to read it in,
1157
1:14:27.542 --> 1:14:34
and we need to read one block
per buffer.
1158
1:14:34 --> 1:14:38.463
These buffers could be all over
the place because they're pretty
1159
1:14:38.463 --> 1:14:40.942
big.
So, let's say we pay one memory
1160
1:14:40.942 --> 1:14:45.264
transfer for each input buffer
just to get started to read the
1161
1:14:45.264 --> 1:14:47.318
first block.
OK, the claim is,
1162
1:14:47.318 --> 1:14:50.365
and here we need to do some
more arithmetic.
1163
1:14:50.365 --> 1:14:52.348
This is, at most,
J^3 over B.
1164
1:14:52.348 --> 1:14:54.757
OK, why is it,
at most, J^3 over B?
1165
1:14:54.757 --> 1:15:00
Well, this was the first level
at which things fit in cache.
1166
1:15:00 --> 1:15:04.119
That means the next level
bigger, which is J^2,
1167
1:15:04.119 --> 1:15:08.328
which has size J^4,
should be bigger than cache.
1168
1:15:08.328 --> 1:15:11.552
Otherwise we would have stopped
then.
1169
1:15:11.552 --> 1:15:14.686
OK, so this is just more
arithmetic.
1170
1:15:14.686 --> 1:15:19.164
You can either believe me or
follow the arithmetic.
1171
1:15:19.164 --> 1:15:23.731
We know that J^4 is at least M.
So, this means that,
1172
1:15:23.731 --> 1:15:26.776
and we know that M is at least
B^2.
1173
1:15:26.776 --> 1:15:29.462
Therefore, J^2,
instead of J^4,
1174
1:15:29.462 --> 1:15:36
we take the square root of both
sides, J^2 is at least B.
1175
1:15:36 --> 1:15:39.379
OK, so certainly J^2 over B is
at most J^3 over B.
1176
1:15:39.379 --> 1:15:43.379
But also J is at most J^3 over
B because J^2 is at least B.
1177
1:15:43.379 --> 1:15:46.896
Hopefully that should be clear.
That's just algebra.
1178
1:15:46.896 --> 1:15:50.965
OK, so we're not going to use
this bound because that's kind
1179
1:15:50.965 --> 1:15:53.655
of complicated.
We're just going to say,
1180
1:15:53.655 --> 1:15:56.689
well, it causes J^3 over B to
get swapped in.
1181
1:15:56.689 --> 1:16:00
Now, why is J^3 over B a good
thing?
1182
1:16:00 --> 1:16:03.972
Because we know the total size
of inputs to the J funnel is
1183
1:16:03.972 --> 1:16:06.232
J^3.
So, to read all of the inputs
1184
1:16:06.232 --> 1:16:08.424
to the J funnel takes J^3 over
B.
1185
1:16:08.424 --> 1:16:12.054
So, this is really just a
linear extra cost to get the
1186
1:16:12.054 --> 1:16:14.657
whole thing swapped in.
It sounds good.
1187
1:16:14.657 --> 1:16:17.671
To do the merging would also
cost J^3 over B.
1188
1:16:17.671 --> 1:16:21.438
So, the swap-in causes J^3 over
B to merge all these J^3
1189
1:16:21.438 --> 1:16:24.041
elements.
If they were all there in the
1190
1:16:24.041 --> 1:16:28.013
inputs, it would take J^3 over B
because once everything is
1191
1:16:28.013 --> 1:16:31.78
there, you're merging at full
speed, one per B items per
1192
1:16:31.78 --> 1:16:36.859
memory transfer on average.
OK, the problem is you're going
1193
1:16:36.859 --> 1:16:39.26
to swap out, which you may have
imagined.
1194
1:16:39.26 --> 1:16:41.899
As soon as one of your input
buffers empties,
1195
1:16:41.899 --> 1:16:45.199
let's say this one's almost
gone, as soon as it empties,
1196
1:16:45.199 --> 1:16:48.439
you're going to totally
obliterate that funnel and swap
1197
1:16:48.439 --> 1:16:51.38
in this one in order to merge
all the stuff there,
1198
1:16:51.38 --> 1:16:54.92
and fill this buffer back up.
This is where the amortization
1199
1:16:54.92 --> 1:16:56.96
comes in.
And this is where the log
1200
1:16:56.96 --> 1:17:00.68
factor comes in because so far
it we've basically paid a linear
1201
1:17:00.68 --> 1:17:07.034
cost.
We are almost done.
1202
1:17:07.034 --> 1:17:17.897
So, we charge,
sorry, I'm jumping ahead of
1203
1:17:17.897 --> 1:17:26.111
myself.
So, when an input buffer
1204
1:17:26.111 --> 1:17:35.169
empties, we swap out.
And we recursively fill that
1205
1:17:35.169 --> 1:17:37.881
buffer.
OK, I'm going to assume that
1206
1:17:37.881 --> 1:17:42.065
there is absolutely no reuse,
that is recursive filling
1207
1:17:42.065 --> 1:17:46.481
completely swapped everything
out and I have to start from
1208
1:17:46.481 --> 1:17:50.046
scratch for this funnel.
So, when that happens,
1209
1:17:50.046 --> 1:17:53.92
I feel this buffer,
and then I come back and I say,
1210
1:17:53.92 --> 1:17:58.026
well, I go swap it back in.
So when the recursive call
1211
1:17:58.026 --> 1:18:01.978
finishes, I swap back in.
OK, so I recursively fill,
1212
1:18:01.978 --> 1:18:08.031
and then I swap back in.
And, at the swapping back in
1213
1:18:08.031 --> 1:18:13.012
costs J^3 over B.
I'm going to charge that cost
1214
1:18:13.012 --> 1:18:16.91
to the elements that just got
filled.
1215
1:18:16.91 --> 1:18:22
So this is an amortized
charging argument.
1216
1:18:22 --> 1:18:48
1217
1:18:48 --> 1:18:51.322
How many are there?
It's the only question.
1218
1:18:51.322 --> 1:18:54.169
It turns out,
things are really good,
1219
1:18:54.169 --> 1:18:59.073
like here, for the square root
of K funnel, we have each buffer
1220
1:18:59.073 --> 1:19:04.063
has size K to the three halves.
OK, so this is a bit
1221
1:19:04.063 --> 1:19:08.395
complicated.
But I claim that the number of
1222
1:19:08.395 --> 1:19:12.624
elements here that fill the
buffer is J^3.
1223
1:19:12.624 --> 1:19:18.401
So, if you have a J funnel,
each of the input buffers has
1224
1:19:18.401 --> 1:19:22.114
size J^3.
It should be correct if you
1225
1:19:22.114 --> 1:19:26.137
work it out.
So, we're charging this J^3
1226
1:19:26.137 --> 1:19:31.501
over B cost to J^3 elements,
which sounds like you're
1227
1:19:31.501 --> 1:19:38
charging, essentially,
one over B to each element.
1228
1:19:38 --> 1:19:39.951
Sounds great.
That means that,
1229
1:19:39.951 --> 1:19:43.718
so you're thinking overall,
I mean, there are N elements,
1230
1:19:43.718 --> 1:19:46.678
and to each one you charge a
one over B cost.
1231
1:19:46.678 --> 1:19:50.11
That sounds like the total
running time is N over B.
1232
1:19:50.11 --> 1:19:52.195
It's a bit too fast for
sorting.
1233
1:19:52.195 --> 1:19:55.559
We lost the log factor.
So, what's going on is that
1234
1:19:55.559 --> 1:20:00
we're actually charging to one
element more than once.
1235
1:20:00 --> 1:20:02.729
And, this is something that we
don't normally do,
1236
1:20:02.729 --> 1:20:05.913
never done it in this class,
but you can do it as long as
1237
1:20:05.913 --> 1:20:08.471
you bound that the number of
times you charge.
1238
1:20:08.471 --> 1:20:10.916
OK, and wherever you do a
charging argument,
1239
1:20:10.916 --> 1:20:13.304
you say, well,
this doesn't happen too many
1240
1:20:13.304 --> 1:20:16.09
times because whenever this
happens, this happens.
1241
1:20:16.09 --> 1:20:18.705
You should say,
you should prove that the thing
1242
1:20:18.705 --> 1:20:21.775
that you're charging to,
Ito charged to that think very
1243
1:20:21.775 --> 1:20:24.107
many times.
So here, I have a quantifiable
1244
1:20:24.107 --> 1:20:26.153
thing that I'm charging to:
elements.
1245
1:20:26.153 --> 1:20:29.394
So, I'm saying that for each
element that happened to come
1246
1:20:29.394 --> 1:20:31.953
into this buffer,
I'm going to charge it a one
1247
1:20:31.953 --> 1:20:35.992
over B cost.
How many times does one element
1248
1:20:35.992 --> 1:20:38.755
get charged?
Well, each time it gets charged
1249
1:20:38.755 --> 1:20:40.812
to, it's moved into a new
buffer.
1250
1:20:40.812 --> 1:20:43.254
How many buffers could it move
through?
1251
1:20:43.254 --> 1:20:45.632
Well, it's just going up all
the time.
1252
1:20:45.632 --> 1:20:49.102
Merging always goes up.
So, we start here and you go to
1253
1:20:49.102 --> 1:20:52.059
the next buffer,
and you go to the next buffer.
1254
1:20:52.059 --> 1:20:55.143
The number of buffers you visit
is the right log,
1255
1:20:55.143 --> 1:20:59
it turns out.
I don't know which log that is.
1256
1:20:59 --> 1:21:05.199
So, the number of charges of a
one over B cost to each element
1257
1:21:05.199 --> 1:21:11.196
is the number of buffers it
visits, and that's a log factor.
1258
1:21:11.196 --> 1:21:17.193
That's where we get an extra
log factor on the running time.
1259
1:21:17.193 --> 1:21:23.291
It is, this is the number of
levels of J funnels that you can
1260
1:21:23.291 --> 1:21:26.849
visit.
So, it's log K divided by log
1261
1:21:26.849 --> 1:21:33.228
J, if I got it right.
OK, and we're almost done.
1262
1:21:33.228 --> 1:21:38.442
Let's wrap up a bit.
Just a little bit more
1263
1:21:38.442 --> 1:21:44.278
arithmetic, unfortunately.
So, log K over log J.
1264
1:21:44.278 --> 1:21:47.63
Now, J^2 is like M,
roughly.
1265
1:21:47.63 --> 1:21:54.956
It might be square root of M.
But, log J is basically log M.
1266
1:21:54.956 --> 1:22:02.281
There's some constants there.
So, the number of charges here
1267
1:22:02.281 --> 1:22:08.299
is theta, log K over log M.
So, now this is a bit,
1268
1:22:08.299 --> 1:22:11.135
we haven't seen this in
amortization necessarily,
1269
1:22:11.135 --> 1:22:14.265
but we just need to count up
total amount of charging.
1270
1:22:14.265 --> 1:22:17.219
All work gets charged to
somebody, except we didn't
1271
1:22:17.219 --> 1:22:20.054
charge the very initial swapping
in to everybody.
1272
1:22:20.054 --> 1:22:23.244
But, every time we do some
swapping in, we charge it to
1273
1:22:23.244 --> 1:22:25.075
someone.
So, how many times does
1274
1:22:25.075 --> 1:22:27.97
everything it charged?
Well, there are N elements.
1275
1:22:27.97 --> 1:22:31.632
Each gets charged to a one over
B cost, and the number of times
1276
1:22:31.632 --> 1:22:35
it gets charged is its log K
over log M.
1277
1:22:35 --> 1:22:39.246
So therefore,
the total cost is number of
1278
1:22:39.246 --> 1:22:44.342
elements times a one over B
times this log thing.
1279
1:22:44.342 --> 1:22:49.65
OK, it's actually plus K.
We forgot about a plus K,
1280
1:22:49.65 --> 1:22:55.171
but that's just to get started
in the very beginning,
1281
1:22:55.171 --> 1:22:58.886
and start on all of the input
lists.
1282
1:22:58.886 --> 1:23:06
OK, this is an amortization
analysis to prove this bound.
1283
1:23:06 --> 1:23:10.914
Sorry, what was N here?
I assumed that I started out
1284
1:23:10.914 --> 1:23:14.286
with K cubed elements at the
bottom.
1285
1:23:14.286 --> 1:23:19.682
The total number of elements in
the bottom was K^3 theta.
1286
1:23:19.682 --> 1:23:23.343
OK, so I should have written
K^3 not M.
1287
1:23:23.343 --> 1:23:28.835
This should be almost the same
as this, OK, but not quite.
1288
1:23:28.835 --> 1:23:34.039
This is log based M of K,
and if you do a little bit of
1289
1:23:34.039 --> 1:23:39.82
arithmetic, this should be K^3
over B times log base M over B
1290
1:23:39.82 --> 1:23:45.747
of K over B plus K.
That's what I want to prove.
1291
1:23:45.747 --> 1:23:49.867
Actually there's a K^3 here
instead of a K,
1292
1:23:49.867 --> 1:23:53.105
but that's just a factor of
three.
1293
1:23:53.105 --> 1:23:58.6
And this follows because we
assume we are not in the base
1294
1:23:58.6 --> 1:24:01.052
case.
So, K is at least M,
1295
1:24:01.052 --> 1:24:06.252
which is at least B^2,
and therefore K over B is omega
1296
1:24:06.252 --> 1:24:10.716
square root of K.
OK, so K over B is basically
1297
1:24:10.716 --> 1:24:13.045
the same as K when you put it in
a log.
1298
1:24:13.045 --> 1:24:16.354
So here we have log base M.
I turned it into log base M
1299
1:24:16.354 --> 1:24:17.887
over B.
That's even worse.
1300
1:24:17.887 --> 1:24:20.277
It doesn't matter.
And, I have log of K.
1301
1:24:20.277 --> 1:24:23.525
I replaced it with K over B,
but K over B is basically
1302
1:24:23.525 --> 1:24:25.303
square root of K.
So in a log,
1303
1:24:25.303 --> 1:24:30.261
that's just a factor of a half.
So that concludes the analysis
1304
1:24:30.261 --> 1:24:33.654
of the funnel.
We get this crazy running time,
1305
1:24:33.654 --> 1:24:37.424
which is basically sorting
bound plus a little bit.
1306
1:24:37.424 --> 1:24:40.817
We plug that into our funnel
sort, and we get,
1307
1:24:40.817 --> 1:24:44.964
magically, optimal cache
oblivious sorting just in time.
1308
1:24:44.964 --> 1:24:48.809
Tuesday is the final.
The final is more in the style
1309
1:24:48.809 --> 1:24:53.107
of quiz one, so not too much
creativity, mostly mastery of
1310
1:24:53.107 --> 1:24:55.369
material.
It covers everything.
1311
1:24:55.369 --> 1:24:59.591
You don't have to worry about
the details of funnel sort,
1312
1:24:59.591 --> 1:25:03.285
but everything else.
So it's like quiz one but for
1313
1:25:03.285 --> 1:25:07.664
the entire class.
It's three hours long,
1314
1:25:07.664 --> 1:25:10.766
and good luck.
It's been a pleasure having
1315
1:25:10.766 --> 1:25:14.247
you, all the students.
I'm sure Charles agrees,
1316
1:25:14.247 --> 1:25:17
so thanks everyone.
It was a lot of fun.