1 00:00:08 --> 00:00:13 The last lecture of 6.046. We are here today to talk more 2 00:00:13 --> 00:00:17 about cache oblivious algorithms. 3 00:00:17 --> 00:00:30 4 00:00:30 --> 00:00:34 Last class, we saw several cache oblivious algorithms, 5 00:00:34 --> 00:00:37 although none of them quite too difficult. 6 00:00:37 --> 00:00:42 Today we will see two difficult cache oblivious algorithms, 7 00:00:42 --> 00:00:46 a little bit more advanced. I figure we should do something 8 00:00:46 --> 00:00:51 advanced for the last class just to get to some exciting climax. 9 00:00:51 --> 00:00:55 So without further ado, let's get started. 10 00:00:55 --> 00:01:00 Last time, we looked at the binary search problem. 11 00:01:00 --> 00:01:02 Or, we looked at binary search, rather. 12 00:01:02 --> 00:01:06 And so, the binary search did not do so well in the cache 13 00:01:06 --> 00:01:10 oblivious context. And, some people asked me after 14 00:01:10 --> 00:01:14 class, is it possible to do binary search while cache 15 00:01:14 --> 00:01:16 obliviously? And, indeed it is with 16 00:01:16 --> 00:01:19 something called static search trees. 17 00:01:19 --> 00:01:21 So, this is really binary search. 18 00:01:21 --> 00:01:25 So, I mean, the abstract problem is I give you N items, 19 00:01:25 --> 00:01:28 say presorted, build some static data 20 00:01:28 --> 00:01:34 structure so that you can search among those N items quickly. 21 00:01:34 --> 00:01:37 And quickly, I claim, means log base B of N. 22 00:01:37 --> 00:01:41 We know that with B trees, our goal is to get log base B 23 00:01:41 --> 00:01:44 of N. We know that we can achieve 24 00:01:44 --> 00:01:46 that with B trees when we know B. 25 00:01:46 --> 00:01:49 We'd like to do that when we don't know B. 26 00:01:49 --> 00:01:54 And that's what cache oblivious static search trees achieve. 27 00:01:54 --> 00:01:58 So here's what we're going to do. 28 00:01:58 --> 00:02:02 As you might suspect, we're going to use a tree. 29 00:02:02 --> 00:02:07 So, we're going to store our N elements in a complete binary 30 00:02:07 --> 00:02:10 tree. We can't use B trees because we 31 00:02:10 --> 00:02:15 don't know what B is. So, we'll use a binary tree. 32 00:02:15 --> 00:02:19 And the key is how we lay out a binary tree. 33 00:02:19 --> 00:02:22 The binary tree will have N nodes. 34 00:02:22 --> 00:02:25 Or, you can put the data in the leaves. 35 00:02:25 --> 00:02:30 It doesn't really matter. So, here's our tree. 36 00:02:30 --> 00:02:33 There are the N nodes. And we're storing them, 37 00:02:33 --> 00:02:35 I didn't say, in order, you know, 38 00:02:35 --> 00:02:38 in the usual way, in order in a binary tree, 39 00:02:38 --> 00:02:41 which makes it a binary search tree. 40 00:02:41 --> 00:02:43 So we now had a search in this thing. 41 00:02:43 --> 00:02:47 So, the search will just start at the root and a walk down some 42 00:02:47 --> 00:02:51 root-to-leaf path. OK, and each point you know 43 00:02:51 --> 00:02:54 whether to go left or to go right because things are in 44 00:02:54 --> 00:02:57 order. So we're assuming here that we 45 00:02:57 --> 00:03:01 have an ordered universe of keys. 46 00:03:01 --> 00:03:04 So that's easy. We know that that will take log 47 00:03:04 --> 00:03:07 N time. The question is how many memory 48 00:03:07 --> 00:03:10 transfers? We'd like a lot of the nodes 49 00:03:10 --> 00:03:13 near the root to be somehow closer and one block. 50 00:03:13 --> 00:03:16 But we don't know what the block size is. 51 00:03:16 --> 00:03:21 So are going to do is carve the tree in the middle level. 52 00:03:21 --> 00:03:25 We're going to use divide and conquer for our layout of the 53 00:03:25 --> 00:03:28 tree, how we order the nodes in memory. 54 00:03:28 --> 00:03:33 And the divide and conquer is based on cutting in the middle, 55 00:03:33 --> 00:03:38 which is a bit weird. It's not our usual divide and 56 00:03:38 --> 00:03:42 conquer. And we'll see this more than 57 00:03:42 --> 00:03:45 once today. So, when you cut on the middle 58 00:03:45 --> 00:03:50 level, if the height of your original tree is log N, 59 00:03:50 --> 00:03:55 maybe log N plus one or something, it's roughly log N, 60 00:03:55 --> 00:04:00 then the top half will be log N over two. 61 00:04:00 --> 00:04:05 And at the height of the bottom pieces will be log N over two. 62 00:04:05 --> 00:04:10 How many nodes will there be in the top tree? 63 00:04:10 --> 00:04:12 N over two? Not quite. 64 00:04:12 --> 00:04:16 Two to the log N over two, square root of N. 65 00:04:16 --> 00:04:21 OK, so it would be about root N nodes over here. 66 00:04:21 --> 00:04:24 And therefore, there will be about root N 67 00:04:24 --> 00:04:28 subtrees down here, one for each, 68 00:04:28 --> 00:04:34 or a couple for each leaf. OK, so we have these subtrees 69 00:04:34 --> 00:04:38 of root N, and there are about root N of them. 70 00:04:38 --> 00:04:42 OK, this is how we are carving our tree. 71 00:04:42 --> 00:04:46 Now, we're going to recurse on each of the pieces. 72 00:04:46 --> 00:04:50 I'd like to redraw this slightly, sorry, 73 00:04:50 --> 00:04:53 just to make it a little bit clearer. 74 00:04:53 --> 00:04:58 These triangles are really trees, and they are connected by 75 00:04:58 --> 00:05:04 edges to this tree up here. So what we are really doing is 76 00:05:04 --> 00:05:08 carving in the middle level of edges in the tree. 77 00:05:08 --> 00:05:12 And if N is not exactly a power of two, you have to round your 78 00:05:12 --> 00:05:15 level by taking floors or ceilings. 79 00:05:15 --> 00:05:19 But you cut roughly in the middle level of edges. 80 00:05:19 --> 00:05:23 There is a lot of edges here. You conceptually slice there. 81 00:05:23 --> 00:05:27 That gives you a top tree and the bottom tree, 82 00:05:27 --> 00:05:32 several bottom trees, each of size roughly root N. 83 00:05:32 --> 00:05:39 OK, and then we are going to recursively layout these root N 84 00:05:39 --> 00:05:45 plus one subtrees, and then concatenate. 85 00:05:45 --> 00:05:50 So, this is the idea of the recursive layout. 86 00:05:50 --> 00:05:57 We sought recursive layouts with matrices last time. 87 00:05:57 --> 00:06:04 This is doing the same thing for a tree. 88 00:06:04 --> 00:06:07 So, I want to recursively layout the top tree. 89 00:06:07 --> 00:06:11 So here's the top tree. And I imagine it being somehow 90 00:06:11 --> 00:06:14 squashed down into a linear array recursively. 91 00:06:14 --> 00:06:18 And then I do the same thing for each of the bottom trees. 92 00:06:18 --> 00:06:21 So here are all the bottom trees. 93 00:06:21 --> 00:06:25 And I squashed each of them down into some linear order. 94 00:06:25 --> 00:06:28 And then I concatenate those linear orders. 95 00:06:28 --> 00:06:32 That's the linear order of the street. 96 00:06:32 --> 00:06:35 And you need a base case. And the base case, 97 00:06:35 --> 00:06:39 just a single node, is stored in the only order of 98 00:06:39 --> 00:06:43 a single node there is. OK, so that's a recursive 99 00:06:43 --> 00:06:48 layout of a binary search tree. It turns out this works really 100 00:06:48 --> 00:06:51 well. And let's quickly do a little 101 00:06:51 --> 00:06:56 example just so it's completely clear what this layout is 102 00:06:56 --> 00:07:02 because it's a bit bizarre maybe the first time you see it. 103 00:07:02 --> 00:07:05 So let me draw my favorite picture. 104 00:07:05 --> 00:07:15 105 00:07:15 --> 00:07:19 So here's a tree of height four or three depending on how you 106 00:07:19 --> 00:07:22 count. We divide in the middle level, 107 00:07:22 --> 00:07:25 and we say, OK, that's the top tree. 108 00:07:25 --> 00:07:27 And then these are the bottom trees. 109 00:07:27 --> 00:07:32 So there's four bottom trees. So there are four children 110 00:07:32 --> 00:07:36 hanging off the root tree. They each have the same size in 111 00:07:36 --> 00:07:39 this case. They should all roughly be the 112 00:07:39 --> 00:07:41 same size. And the first we layout the top 113 00:07:41 --> 00:07:43 thing where we divide on the middle level. 114 00:07:43 --> 00:07:47 We say, OK, this comes first. And then, the bottom subtrees 115 00:07:47 --> 00:07:50 come next, two and three. So, I'm writing down the order 116 00:07:50 --> 00:07:52 in which these nodes are stored in an array. 117 00:07:52 --> 00:07:55 And then, we visit this tree so we get four, five, 118 00:07:55 --> 00:07:57 six. And then we visit this one so 119 00:07:57 --> 00:08:00 we get seven, eight, nine. 120 00:08:00 --> 00:08:03 And then the subtree, 10, 11, 12, and then the last 121 00:08:03 --> 00:08:06 subtree. So that's the order in which 122 00:08:06 --> 00:08:09 you store these 15 nodes. And you can build that up 123 00:08:09 --> 00:08:13 recursively. OK, so the structure is fairly 124 00:08:13 --> 00:08:17 simple, just a binary structure which we know and love, 125 00:08:17 --> 00:08:19 but store it in this funny order. 126 00:08:19 --> 00:08:22 This is not depth research order or level order, 127 00:08:22 --> 00:08:27 lots of natural things you might try, none of which work in 128 00:08:27 --> 00:08:31 cache oblivious context. This is pretty much the only 129 00:08:31 --> 00:08:33 thing that works. And the intuition as, 130 00:08:33 --> 00:08:36 well, we are trying to mimic all kinds of B trees. 131 00:08:36 --> 00:08:39 So, if you want a binary tree, well, that's the original tree. 132 00:08:39 --> 00:08:41 It doesn't matter how you store things. 133 00:08:41 --> 00:08:44 If you want a tree where the branching factor is four, 134 00:08:44 --> 00:08:47 well, then here it is. These blocks give you a 135 00:08:47 --> 00:08:50 branching factor of four. If we had more leaves down 136 00:08:50 --> 00:08:53 here, there would be four children hanging off of that 137 00:08:53 --> 00:08:54 node. And these are all clustered 138 00:08:54 --> 00:08:56 together consecutively in memory. 139 00:08:56 --> 00:08:59 So, if your block size happens to be three, then this is a 140 00:08:59 --> 00:09:04 perfect way to store things for a block size of three. 141 00:09:04 --> 00:09:07 If you're block size happens to be probably 15, 142 00:09:07 --> 00:09:12 right, if we count the number of, right, the number of nodes 143 00:09:12 --> 00:09:16 in here is 15, if you're block size happens to 144 00:09:16 --> 00:09:21 be 15, then this recursion will give you a perfect blocking in 145 00:09:21 --> 00:09:23 terms of 15. And in general, 146 00:09:23 --> 00:09:27 it's actually mimicking block sizes of 2^K-1. 147 00:09:27 --> 00:09:32 Think powers of two. OK, that's the intuition. 148 00:09:32 --> 00:09:37 Let me give you the formal analysis to make it clearer. 149 00:09:37 --> 00:09:42 So, we claim that there are order, log base B of N memory 150 00:09:42 --> 00:09:45 transfers. That's what we want to prove no 151 00:09:45 --> 00:09:49 matter what B is. So here's what we're going to 152 00:09:49 --> 00:09:52 do. You may recall last time when 153 00:09:52 --> 00:09:57 we analyzed divide and conquer algorithms, we wrote our 154 00:09:57 --> 00:10:03 recurrence, and that the base case was the key. 155 00:10:03 --> 00:10:05 Here, in fact, we are only going to think 156 00:10:05 --> 00:10:07 about the base case in a certain sense. 157 00:10:07 --> 00:10:08 We don't have, really, recursion in the 158 00:10:08 --> 00:10:10 algorithm. The algorithm is just walking 159 00:10:10 --> 00:10:13 down some root-to-leaf path. We only have a recursion in a 160 00:10:13 --> 00:10:16 definition of the layout. So, we can be a little bit more 161 00:10:16 --> 00:10:18 flexible. We don't have to look at our 162 00:10:18 --> 00:10:20 recurrence. We are just going to think 163 00:10:20 --> 00:10:22 about the base case. I want to imagine, 164 00:10:22 --> 00:10:24 you start with the big triangle. 165 00:10:24 --> 00:10:26 That you cut it in the middle; you get smaller triangles. 166 00:10:26 --> 00:10:31 Imagine the point at which you keep recursively cutting. 167 00:10:31 --> 00:10:34 So imagine this process. Big triangles halve in height 168 00:10:34 --> 00:10:37 each time. They're getting smaller and 169 00:10:37 --> 00:10:41 smaller, stop cutting at the point where a triangle fits in a 170 00:10:41 --> 00:10:44 block. OK, and look at that time. 171 00:10:44 --> 00:10:48 OK, the recursion actually goes all the way, but in the analysis 172 00:10:48 --> 00:10:53 let's think about the point where the chunk fits in a block 173 00:10:53 --> 00:10:57 in one of these triangles, one of these boxes fits in a 174 00:10:57 --> 00:10:59 block. So, I'm going to call this a 175 00:10:59 --> 00:11:05 recursive level. So, I'm imagining expanding all 176 00:11:05 --> 00:11:10 of the recursions in parallel. This is some level of detail, 177 00:11:10 --> 00:11:16 some level of refinement of the trees at which the tree you're 178 00:11:16 --> 00:11:19 looking at, the triangle, has size. 179 00:11:19 --> 00:11:24 In other words, there is a number of nodes in 180 00:11:24 --> 00:11:29 that triangle is less than or equal to B. 181 00:11:29 --> 00:11:34 OK, so let me draw a picture. So, I want to draw sort of this 182 00:11:34 --> 00:11:39 picture but where instead of nodes, I have little triangles 183 00:11:39 --> 00:11:41 of size, at most, B. 184 00:11:41 --> 00:11:44 So, the picture looks something like this. 185 00:11:44 --> 00:11:48 We have a little triangle of size, at most, 186 00:11:48 --> 00:11:50 B. It has a bunch of children 187 00:11:50 --> 00:11:55 which are subtrees of size, at most, B, the same size. 188 00:11:55 --> 00:12:00 And then, these are in a chunk, and then we have other chunks 189 00:12:00 --> 00:12:06 that look like that in recursion potentially. 190 00:12:06 --> 00:12:29 191 00:12:29 --> 00:12:31 OK, so I haven't drawn everything. 192 00:12:31 --> 00:12:34 There would be a whole bunch of, between B and B^2, 193 00:12:34 --> 00:12:37 in fact, subtrees, other squares of this size. 194 00:12:37 --> 00:12:40 So here, I had to refine the entire tree here. 195 00:12:40 --> 00:12:44 And then I refined each of the subtrees here and here at these 196 00:12:44 --> 00:12:47 levels. And then it turned out after 197 00:12:47 --> 00:12:50 these two recursive levels, everything fits in a block. 198 00:12:50 --> 00:12:54 Everything has the same size, so at some point they will all 199 00:12:54 --> 00:12:57 fit within a block. And they might actually be 200 00:12:57 --> 00:12:59 quite a bit smaller than the block. 201 00:12:59 --> 00:13:05 How small? So, what I'm doing is cutting 202 00:13:05 --> 00:13:09 the number of levels and half at each point. 203 00:13:09 --> 00:13:15 And I stop when the height of one of these trees is 204 00:13:15 --> 00:13:21 essentially at most log B because that's when the number 205 00:13:21 --> 00:13:25 of nodes at there will be B roughly. 206 00:13:25 --> 00:13:30 So, how small can the height be? 207 00:13:30 --> 00:13:32 I keep dividing at half and stopping when it's, 208 00:13:32 --> 00:13:34 at most, log B. Log B over two. 209 00:13:34 --> 00:13:37 So it's, at most, log B, it's at least half log 210 00:13:37 --> 00:13:39 B. Therefore, the number of nodes 211 00:13:39 --> 00:13:42 it here could be between the square root of B and B. 212 00:13:42 --> 00:13:46 So, this could be a lot smaller and less than a constant factor 213 00:13:46 --> 00:13:49 of a block, a claim that doesn't matter. 214 00:13:49 --> 00:13:51 It's OK. This could be a small square 215 00:13:51 --> 00:13:53 root of B. I'm not even going to write 216 00:13:53 --> 00:13:57 that it could be a small square root of B because that doesn't 217 00:13:57 --> 00:14:00 play a role in the analysis. It's a worry, 218 00:14:00 --> 00:14:04 but it's OK essentially because our bound only involves log B. 219 00:14:04 --> 00:14:09 It doesn't involve B. So, here's what we do. 220 00:14:09 --> 00:14:16 We know that each of the height of one of these triangles of 221 00:14:16 --> 00:14:20 size, at most, B is at least a half log B. 222 00:14:20 --> 00:14:25 And therefore, if you look at a search path, 223 00:14:25 --> 00:14:32 so, when we do a search in this tree, we're going to start up 224 00:14:32 --> 00:14:36 here. And I'm going to mess up the 225 00:14:36 --> 00:14:39 diagram now. We're going to follow some 226 00:14:39 --> 00:14:42 path, maybe I should have drawn it going down here. 227 00:14:42 --> 00:14:46 We visit through some of these triangles, but it's a 228 00:14:46 --> 00:14:51 root-to-node path in the tree. So, how many of the triangles 229 00:14:51 --> 00:14:54 could it visit? Well, the height of the tree 230 00:14:54 --> 00:14:58 divided by the height of one of the triangles. 231 00:14:58 --> 00:15:01 So, this visits, at most, log N over half log B 232 00:15:01 --> 00:15:07 triangles, which looks good. This is log base B of N, 233 00:15:07 --> 00:15:12 mind you off factor of two. Now, what we worry about is how 234 00:15:12 --> 00:15:15 many blocks does a triangle occupy? 235 00:15:15 --> 00:15:19 One of these triangles should fit in a block. 236 00:15:19 --> 00:15:23 We know by the recursive layout, it is stored in a 237 00:15:23 --> 00:15:28 consecutive region in memory. So, how many blocks could 238 00:15:28 --> 00:15:32 occupy? Two, because of alignment, 239 00:15:32 --> 00:15:35 it might fall across the boundary of a block, 240 00:15:35 --> 00:15:37 but at most, one boundary. 241 00:15:37 --> 00:15:42 So, it fits in two blocks. So, each triangle fits in one 242 00:15:42 --> 00:15:45 block, but is in, at most, two blocks, 243 00:15:45 --> 00:15:49 memory blocks, size B depending on alignment. 244 00:15:49 --> 00:15:53 So, the number of memory transfers, in other words, 245 00:15:53 --> 00:15:57 a number of blocks we read, because all we are doing here 246 00:15:57 --> 00:16:01 is reading in a search, is at most two blocks per 247 00:16:01 --> 00:16:05 triangle. There are this many triangles, 248 00:16:05 --> 00:16:07 so it's at most, 4 log base B of N, 249 00:16:07 --> 00:16:09 OK, which is order log base B of N. 250 00:16:09 --> 00:16:13 And there are papers about decreasing this constant 4 with 251 00:16:13 --> 00:16:15 more sophisticated data structures. 252 00:16:15 --> 00:16:18 You can get it down to a little bit less than two I think. 253 00:16:18 --> 00:16:21 So, there you go. So not quite as good as B trees 254 00:16:21 --> 00:16:24 in terms of the constant, but pretty good. 255 00:16:24 --> 00:16:27 And what's good is that this data structure works for all B 256 00:16:27 --> 00:16:32 at the same time. This analysis works for all B. 257 00:16:32 --> 00:16:37 So, we have a multilevel memory hierarchy, no problem. 258 00:16:37 --> 00:16:41 Any questions about this data structure? 259 00:16:41 --> 00:16:44 This is already pretty sophisticated, 260 00:16:44 --> 00:16:48 but we are going to get even more sophisticated. 261 00:16:48 --> 00:16:51 Next, OK, good, no questions. 262 00:16:51 --> 00:16:56 This is either perfectly clear, or a little bit difficult, 263 00:16:56 --> 00:16:59 or both. So, now, I debated with myself 264 00:16:59 --> 00:17:05 what exactly I would cover next. There are two natural things I 265 00:17:05 --> 00:17:08 could cover, both of which are complicated. 266 00:17:08 --> 00:17:11 My first result in the cache oblivious world is making this 267 00:17:11 --> 00:17:14 data structure dynamic. So, there is a dynamic B tree 268 00:17:14 --> 00:17:18 that's cache oblivious that works for all values of B. 269 00:17:18 --> 00:17:20 And it gets log base B of N, insert, delete, 270 00:17:20 --> 00:17:23 and search. So, this just gets search in 271 00:17:23 --> 00:17:25 log base B of N. That data structure, 272 00:17:25 --> 00:17:28 our first paper was damn complicated, and then it got 273 00:17:28 --> 00:17:31 simplified. It's now not too hard, 274 00:17:31 --> 00:17:35 but it takes a couple of lectures in an advanced 275 00:17:35 --> 00:17:40 algorithms class to teach it. So, I'm not going to do that. 276 00:17:40 --> 00:17:42 But there you go. It exists. 277 00:17:42 --> 00:17:47 Instead, we're going to cover our favorite problem sorting in 278 00:17:47 --> 00:17:52 the cache oblivious context. And this is quite complicated, 279 00:17:52 --> 00:17:56 more than you'd expect, OK, much more complicated than 280 00:17:56 --> 00:18:01 it is in a multithreaded setting to get the right answer, 281 00:18:01 --> 00:18:05 anyway. Maybe to get the best answer in 282 00:18:05 --> 00:18:08 a multithreaded setting is also complicated. 283 00:18:08 --> 00:18:11 The version we got last week was pretty easy. 284 00:18:11 --> 00:18:13 But before we go to cache oblivious sorting, 285 00:18:13 --> 00:18:18 let me talk about cache aware sorting because we need to know 286 00:18:18 --> 00:18:21 what bound we are aiming for. And just to warn you, 287 00:18:21 --> 00:18:24 I may not get to the full analysis of the full cache 288 00:18:24 --> 00:18:28 oblivious sorting. But I want to give you an idea 289 00:18:28 --> 00:18:31 of what goes into it because it's pretty cool, 290 00:18:31 --> 00:18:35 I think, a lot of ideas. So, how might you sort? 291 00:18:35 --> 00:18:39 So, cache aware, we assume we can do everything. 292 00:18:39 --> 00:18:41 Basically, this means we have B trees. 293 00:18:41 --> 00:18:44 That's the only other structure we know. 294 00:18:44 --> 00:18:49 How would you sort N numbers, given that that's the only data 295 00:18:49 --> 00:18:52 structure you have? Right, just add them into the B 296 00:18:52 --> 00:18:55 tree, and then do an in-order traversal. 297 00:18:55 --> 00:19:00 That's one way to sort, perfectly reasonable. 298 00:19:00 --> 00:19:04 We'll call it repeated insertion into a B tree. 299 00:19:04 --> 00:19:08 OK, we know in the usual setting, and the BST sort, 300 00:19:08 --> 00:19:13 where you use a balanced binary search tree, like red-black 301 00:19:13 --> 00:19:17 trees, that takes N log N time, log N per operation, 302 00:19:17 --> 00:19:22 and that's an optimal sorting algorithm in the comparison 303 00:19:22 --> 00:19:28 model, only thinking about comparison model here. 304 00:19:28 --> 00:19:39 So, how many memory transfers does this data structure takes? 305 00:19:39 --> 00:19:45 Sorry, this algorithm for sorting? 306 00:19:45 --> 00:19:54 The number of memory transfers is a function of N, 307 00:19:54 --> 00:20:01 and B_M of N is? This is easy. 308 00:20:01 --> 00:20:07 N insertions, OK, you have to think about N 309 00:20:07 --> 00:20:13 order traversal. You have to remember back your 310 00:20:13 --> 00:20:20 analysis of B trees, but this is not too hard. 311 00:20:20 --> 00:20:27 How long does the insertion take, the N insertions? 312 00:20:27 --> 00:20:32 N log base B of N. How long does the traversal 313 00:20:32 --> 00:20:33 take? Less time. 314 00:20:33 --> 00:20:37 If we think about it, you can get away with N over B 315 00:20:37 --> 00:20:40 memory transfers, so quite a bit less than this. 316 00:20:40 --> 00:20:44 This is bigger than N, which is actually pretty bad. 317 00:20:44 --> 00:20:47 N memory transfers means essentially you're doing random 318 00:20:47 --> 00:20:51 access, visiting every element in some random order. 319 00:20:51 --> 00:20:54 It's even worse. There's even a log factor. 320 00:20:54 --> 00:20:57 Now, the log factor goes down by this log B factor. 321 00:20:57 --> 00:21:02 But, this is actually a really bad sorting bound. 322 00:21:02 --> 00:21:06 So, unlike normal algorithms, where using a search tree is a 323 00:21:06 --> 00:21:10 good way to sort, in cache oblivious or cache 324 00:21:10 --> 00:21:13 aware sorting it's really, really bad. 325 00:21:13 --> 00:21:17 So, what's another natural algorithm you might try, 326 00:21:17 --> 00:21:22 given what we know for sorting? And, even cache oblivious, 327 00:21:22 --> 00:21:26 all the algorithms we've seen are cache oblivious. 328 00:21:26 --> 00:21:30 So, what's a good one to try? Merge sort. 329 00:21:30 --> 00:21:34 OK, we did merge sort in multithreaded algorithms. 330 00:21:34 --> 00:21:37 Let's try a merge sort, a good divide and conquer 331 00:21:37 --> 00:21:40 thing. So, I'm going to call it binary 332 00:21:40 --> 00:21:44 merge sort because it splits the array into two pieces, 333 00:21:44 --> 00:21:46 and it recurses on the two pieces. 334 00:21:46 --> 00:21:49 So, you get a binary recursion tree. 335 00:21:49 --> 00:21:52 So, let's analyze it. So the number of memory 336 00:21:52 --> 00:21:56 transfers on N elements, so I mean it has a pretty good 337 00:21:56 --> 00:21:57 recursive layout, right? 338 00:21:57 --> 00:22:02 The two subarrays that we get what we partition our array are 339 00:22:02 --> 00:22:05 consecutive. So, we're recursing on this, 340 00:22:05 --> 00:22:10 recursing on this. So, it's a nice cache oblivious 341 00:22:10 --> 00:22:13 layout. And this is even for cache 342 00:22:13 --> 00:22:15 aware. This is a pretty good 343 00:22:15 --> 00:22:19 algorithm, a lot better than this one, as we'll see. 344 00:22:19 --> 00:22:22 But, what is the recurrence we get? 345 00:22:22 --> 00:22:27 So, here we have to go back to last lecture when we were 346 00:22:27 --> 00:22:31 thinking about recurrences for recursive cache oblivious 347 00:22:31 --> 00:22:34 algorithms. 348 00:22:34 --> 00:22:46 349 00:22:46 --> 00:22:50 I mean, the first part should be pretty easy. 350 00:22:50 --> 00:22:55 There's an O. Well, OK, let's put the O at 351 00:22:55 --> 00:23:00 the end, the divide and the conquer part at the end. 352 00:23:00 --> 00:23:06 The recursion is 2MT of N over two, good. 353 00:23:06 --> 00:23:09 All right, that's just like the merge sort recurrence, 354 00:23:09 --> 00:23:12 and that's the additive term that you're thinking about. 355 00:23:12 --> 00:23:15 OK, so normally, we would pay a linear additive 356 00:23:15 --> 00:23:19 term here, order N because merging takes order N time. 357 00:23:19 --> 00:23:22 Now, we are merging, which is three parallel scans, 358 00:23:22 --> 00:23:26 the two inputs and the output. OK, they're not quite parallel 359 00:23:26 --> 00:23:28 interleaved. They're a bit funnily 360 00:23:28 --> 00:23:31 interleaved, but as long as your cache stores at least three 361 00:23:31 --> 00:23:35 blocks, this is also linear time in this setting, 362 00:23:35 --> 00:23:38 which means you visit each block a constant number of 363 00:23:38 --> 00:23:41 times. OK, that's the recurrence. 364 00:23:41 --> 00:23:44 Now, we also need a base case, of course. 365 00:23:44 --> 00:23:47 We've seen two base cases, one MT of B, 366 00:23:47 --> 00:23:50 and the other, MT of whatever fits in cache. 367 00:23:50 --> 00:23:53 So, let's look at that one because it's better. 368 00:23:53 --> 00:23:56 So, for some constant, C, if I have an array of size 369 00:23:56 --> 00:24:00 M, this fits in cache, actually, probably C is one 370 00:24:00 --> 00:24:03 here, but I'll just be careful. For some constant, 371 00:24:03 --> 00:24:10 this fits in cache. A problem of this size fits in 372 00:24:10 --> 00:24:18 cache, and in that case, the number of memory transfers 373 00:24:18 --> 00:24:25 is, anyone remember? We've used this base case more 374 00:24:25 --> 00:24:31 than once before. Do you remember? 375 00:24:31 --> 00:24:32 Sorry? CM over B. 376 00:24:32 --> 00:24:33 I've got a big O, so M over B. 377 00:24:33 --> 00:24:37 Order M over B because this is the size of the data. 378 00:24:37 --> 00:24:40 So, I mean, just to read it all in takes M over B. 379 00:24:40 --> 00:24:43 Once it's in cache, it doesn't really matter what I 380 00:24:43 --> 00:24:47 do as long as I use linear space for the right constant here. 381 00:24:47 --> 00:24:50 As long as I use linear space in that algorithm, 382 00:24:50 --> 00:24:53 I'll stay in cache, and therefore, 383 00:24:53 --> 00:24:57 not have to write anything out until the very end and I spend M 384 00:24:57 --> 00:25:02 over B to write it out. OK, so I can't really spend 385 00:25:02 --> 00:25:07 more than M over B almost no matter what algorithm I have, 386 00:25:07 --> 00:25:09 so long as it uses linear space. 387 00:25:09 --> 00:25:14 So, this is a base case that's useful pretty much in any 388 00:25:14 --> 00:25:17 algorithm. OK, that's a recurrence. 389 00:25:17 --> 00:25:22 Now we just have to solve it. OK, let's see how good binary 390 00:25:22 --> 00:25:24 merge sort is. OK, and again, 391 00:25:24 --> 00:25:29 I'm going to just give the intuition behind the solution to 392 00:25:29 --> 00:25:33 this recurrence. And I won't use the 393 00:25:33 --> 00:25:36 substitution method to prove it formally. 394 00:25:36 --> 00:25:38 But this one's actually pretty simple. 395 00:25:38 --> 00:25:41 So, we have, at the top, actually I'm going 396 00:25:41 --> 00:25:44 to write it over here. Otherwise I won't be able to 397 00:25:44 --> 00:25:46 see. So, at the top of the 398 00:25:46 --> 00:25:48 recursion, we have N over B costs. 399 00:25:48 --> 00:25:52 I'll ignore the constants. There is probably also on 400 00:25:52 --> 00:25:55 additive one, which I'm ignoring here. 401 00:25:55 --> 00:25:58 Then we split into two problems of half the size. 402 00:25:58 --> 00:26:03 So, we get a half N over B, and a half N over B. 403 00:26:03 --> 00:26:05 OK, usually this was N, half N, half N. 404 00:26:05 --> 00:26:08 You should regard it as from lecture one. 405 00:26:08 --> 00:26:10 So, the total on this level is N over B. 406 00:26:10 --> 00:26:12 The total on this level is N over B. 407 00:26:12 --> 00:26:16 And, you can prove by induction, that every level is N 408 00:26:16 --> 00:26:18 over B. The question is how many levels 409 00:26:18 --> 00:26:20 are there? Well, at the bottom, 410 00:26:20 --> 00:26:23 so, dot, dot, dot, at the bottom of this 411 00:26:23 --> 00:26:26 recursion tree we should get something of size M, 412 00:26:26 --> 00:26:30 and then we're paying M over B. Actually here we're paying M 413 00:26:30 --> 00:26:34 over B. So, it's a good thing those 414 00:26:34 --> 00:26:35 match. They should. 415 00:26:35 --> 00:26:40 So here, we have a bunch of leaves, all the size M over B. 416 00:26:40 --> 00:26:44 You can also compute the number of leaves here is N over M. 417 00:26:44 --> 00:26:49 If you want to be extra sure, you should always check the 418 00:26:49 --> 00:26:51 leaf level. It's a good idea. 419 00:26:51 --> 00:26:55 So we have N over M leaves, each costing M over B. 420 00:26:55 --> 00:27:00 This is an M. So, this is N over B also. 421 00:27:00 --> 00:27:04 So, every level here is N over B memory transfers. 422 00:27:04 --> 00:27:08 And the number of levels is one N over B? 423 00:27:08 --> 00:27:11 Log N over B. Yep, that's right. 424 00:27:11 --> 00:27:16 I just didn't hear it right. OK, we are starting at N. 425 00:27:16 --> 00:27:21 We're getting down to M. So, you can think of it as log 426 00:27:21 --> 00:27:26 N, the whole binary tree minus the subtrees log M, 427 00:27:26 --> 00:27:31 and that's the same as log N over M, OK, or however you want 428 00:27:31 --> 00:27:37 to think about it. The point is that this is a log 429 00:27:37 --> 00:27:40 base two. That's where we are not doing 430 00:27:40 --> 00:27:42 so great. So this is actually a pretty 431 00:27:42 --> 00:27:46 good algorithm. So let me write the solution 432 00:27:46 --> 00:27:48 over here. So, the number of memory 433 00:27:48 --> 00:27:53 transfers on N items is going to be the number of levels times 434 00:27:53 --> 00:27:56 the cost of each level. So, this is N over B times log 435 00:27:56 --> 00:28:00 base two of N over M, which is a lot better than 436 00:28:00 --> 00:28:04 repeated insertion into a B tree. 437 00:28:04 --> 00:28:07 Here, we were getting N times log N over log B, 438 00:28:07 --> 00:28:12 OK, so N log N over log B. We're getting a log B savings 439 00:28:12 --> 00:28:16 over not proving anything, and here we are getting a 440 00:28:16 --> 00:28:19 factor of B savings, N log N over B. 441 00:28:19 --> 00:28:24 In fact, we even made it a little bit smaller by dividing 442 00:28:24 --> 00:28:28 this N by M. That doesn't matter too much. 443 00:28:28 --> 00:28:32 This dividing by B is a big one. 444 00:28:32 --> 00:28:35 OK, so we're almost there. This is almost an optimal 445 00:28:35 --> 00:28:37 algorithm. It's even cache oblivious, 446 00:28:37 --> 00:28:40 which is pretty cool. And that extra little step, 447 00:28:40 --> 00:28:43 which is that you should be able to get on other log B 448 00:28:43 --> 00:28:46 factor improvement, I want to combine these two 449 00:28:46 --> 00:28:48 ideas. I want to keep this factor B 450 00:28:48 --> 00:28:51 improvement over N log N, and I want to keep this factor 451 00:28:51 --> 00:28:54 log B improvement over N log N, and get them together. 452 00:28:54 --> 00:28:57 So, first, before we do that cache obliviously, 453 00:28:57 --> 00:29:03 let's do it cache aware. So, this is the third cache 454 00:29:03 --> 00:29:07 aware algorithm. This one was also cache 455 00:29:07 --> 00:29:11 oblivious. So, how should I modify a merge 456 00:29:11 --> 00:29:18 sort in order to do better? I mean, I have this log base 457 00:29:18 --> 00:29:22 two, and I want a log base B, more or less. 458 00:29:22 --> 00:29:27 So, how would I do that with merge sort? 459 00:29:27 --> 00:29:30 Yeah? Split into B subarrays, 460 00:29:30 --> 00:29:32 yeah. Instead of doing binary merge 461 00:29:32 --> 00:29:35 sort, this is what I was hinting at here, instead of splitting it 462 00:29:35 --> 00:29:37 into two pieces, and recursing on the two 463 00:29:37 --> 00:29:40 pieces, and then merging them, I could split potentially into 464 00:29:40 --> 00:29:42 more pieces. OK, and to do that, 465 00:29:42 --> 00:29:45 I'm going to use my cache. So the idea is B pieces. 466 00:29:45 --> 00:29:48 This is actually not the best thing to do, although B pieces 467 00:29:48 --> 00:29:50 does work. And, it's what I was hinting at 468 00:29:50 --> 00:29:52 because I was saying I want a log B. 469 00:29:52 --> 00:29:55 It's actually not quite log B. It's log M over B. 470 00:29:55 --> 00:29:57 OK, but let's see. So, what is the most pieces I 471 00:29:57 --> 00:30:01 could split into? Right, well, 472 00:30:01 --> 00:30:06 I could split into N pieces. That would be good, 473 00:30:06 --> 00:30:11 wouldn't it, at only one recursive level? 474 00:30:11 --> 00:30:14 I can't split into N pieces. Why? 475 00:30:14 --> 00:30:19 What happens wrong when I split into N pieces? 476 00:30:19 --> 00:30:24 That would be the ultimate. You can't merge, 477 00:30:24 --> 00:30:27 exactly. So, if I have N pieces, 478 00:30:27 --> 00:30:33 you can't merge in cache. I mean, so in order to merge in 479 00:30:33 --> 00:30:37 cache, what I need is to be able to store an entire block from 480 00:30:37 --> 00:30:40 each of the lists that I'm merging. 481 00:30:40 --> 00:30:43 If I can store an entire block in cache for each of the lists, 482 00:30:43 --> 00:30:46 then it's a bunch of parallel scans. 483 00:30:46 --> 00:30:49 So this is like testing the limit of parallel scanning 484 00:30:49 --> 00:30:52 technology. If you have K parallel scans, 485 00:30:52 --> 00:30:55 and you can fit K blocks in cache, then all is well because 486 00:30:55 --> 00:30:58 you can scan through each of those K arrays, 487 00:30:58 --> 00:31:02 and have one block from each of the K arrays in cache at the 488 00:31:02 --> 00:31:05 same time. So, that's the idea. 489 00:31:05 --> 00:31:09 Now, how many blocks can I fit in cache? 490 00:31:09 --> 00:31:13 M over B. That's the biggest I could do. 491 00:31:13 --> 00:31:18 So this will give the best running time among these kinds 492 00:31:18 --> 00:31:24 of merge sort algorithms. This is an M over B way merge 493 00:31:24 --> 00:31:27 sort. OK, so now we get somewhat 494 00:31:27 --> 00:31:31 better recurrence. We split into M over B 495 00:31:31 --> 00:31:34 subproblems now, each of size, 496 00:31:34 --> 00:31:38 well, it's N divided by M over B without thinking. 497 00:31:38 --> 00:31:43 And, the claim is that the merge time is still linear 498 00:31:43 --> 00:31:48 because we have barely enough, OK, maybe I should describe 499 00:31:48 --> 00:31:50 this algorithm. So, we divide, 500 00:31:50 --> 00:31:55 because we've never really done non-binary merge sort. 501 00:31:55 --> 00:32:00 We divide into M over B equal size subarrays instead of two. 502 00:32:00 --> 00:32:06 Here, we are clearly doing a cache aware algorithm. 503 00:32:06 --> 00:32:11 We are assuming we know what M over B is. 504 00:32:11 --> 00:32:17 So, then we recursively sort each subarray, 505 00:32:17 --> 00:32:21 and then we conquer. We merge. 506 00:32:21 --> 00:32:29 And, the reason merge works is because we can afford one block 507 00:32:29 --> 00:32:34 in cache. So, let's call it one cache 508 00:32:34 --> 00:32:36 block per subarray. OK, actually, 509 00:32:36 --> 00:32:40 if you're careful, you also need one block for the 510 00:32:40 --> 00:32:44 output of the merged array before you write it out. 511 00:32:44 --> 00:32:47 So, it should be M over B minus one. 512 00:32:47 --> 00:32:50 But, let's ignore some additive constants. 513 00:32:50 --> 00:32:53 OK, so this is the recurrence we get. 514 00:32:53 --> 00:32:59 The base case is the same. And, what improves here? 515 00:32:59 --> 00:33:02 I mean, the per level cost doesn't change, 516 00:33:02 --> 00:33:06 I claim, because at the top we get N over B. 517 00:33:06 --> 00:33:09 This does before. Then we split into M over B 518 00:33:09 --> 00:33:15 subproblems, each of which costs a one over M over B factor times 519 00:33:15 --> 00:33:18 N over B. OK, so you add all those up, 520 00:33:18 --> 00:33:23 you still get N over B because we are not decreasing the number 521 00:33:23 --> 00:33:26 of elements. We're just splitting them. 522 00:33:26 --> 00:33:31 There's now M over B subproblems, each of one over M 523 00:33:31 --> 00:33:36 over B the size. So, just like before, 524 00:33:36 --> 00:33:39 each level will sum to N over B. 525 00:33:39 --> 00:33:44 What changes is the number of levels because now we have 526 00:33:44 --> 00:33:49 bigger branching factor. Instead of log base two, 527 00:33:49 --> 00:33:53 it's now log base the branching factor. 528 00:33:53 --> 00:33:59 So, the height of this tree is log base M over B of N over M, 529 00:33:59 --> 00:34:03 I believe. Let me make sure that agrees 530 00:34:03 --> 00:34:06 with me. Yeah. 531 00:34:06 --> 00:34:12 OK, and if you're careful, this counts not quite the 532 00:34:12 --> 00:34:18 number of levels, but the number of levels minus 533 00:34:18 --> 00:34:22 one. So, I'm going to one plus one 534 00:34:22 --> 00:34:26 here. And the reason why is this is 535 00:34:26 --> 00:34:37 not quite the bound that I want. So, we have log base M over B. 536 00:34:37 --> 00:34:45 What I really want, actually, is N over B. 537 00:34:45 --> 00:34:55 I claim that these are the same because we have minus, 538 00:34:55 --> 00:35:01 yeah, that's good. OK, this should come as rather 539 00:35:01 --> 00:35:05 mysterious, but it's because I know what the sorting bound 540 00:35:05 --> 00:35:07 should be as I'm doing this arithmetic. 541 00:35:07 --> 00:35:10 So, I'm taking log base M over B of N over M. 542 00:35:10 --> 00:35:12 I'm not changing the base of the log. 543 00:35:12 --> 00:35:14 I'm just saying, well, N over M, 544 00:35:14 --> 00:35:17 that is N over B divided by M over B because then the B's 545 00:35:17 --> 00:35:20 cancel, and the M goes on the bottom. 546 00:35:20 --> 00:35:23 So, if I do that in the logs, I get log of N over B minus log 547 00:35:23 --> 00:35:26 of M over B minus, because I'm dividing. 548 00:35:26 --> 00:35:30 OK, now, log base M over B of M over B is one. 549 00:35:30 --> 00:35:33 So, these cancel, and I get log base M over B, 550 00:35:33 --> 00:35:36 N over B, which is what I was aiming for. 551 00:35:36 --> 00:35:39 Why? Because that's the right bound 552 00:35:39 --> 00:35:43 as it's normally written. OK, that's what we will be 553 00:35:43 --> 00:35:48 trying to get cache obliviously. So, that's the height of the 554 00:35:48 --> 00:35:53 search tree, and at each level we are paying N over B memory 555 00:35:53 --> 00:35:56 transfers. So, the overall number of 556 00:35:56 --> 00:36:01 memory transfers for this M over B way merge sort is the sorting 557 00:36:01 --> 00:36:03 bound. 558 00:36:03 --> 00:36:13 559 00:36:13 --> 00:36:19 This is, I'll put it in a box. This is the sorting bound, 560 00:36:19 --> 00:36:25 and it's very special because it is the optimal number of 561 00:36:25 --> 00:36:31 memory transfers for sorting N items cache aware. 562 00:36:31 --> 00:36:33 This has been known since, like, 1983. 563 00:36:33 --> 00:36:35 OK, this is the best thing to do. 564 00:36:35 --> 00:36:38 It's a really weird bound, but if you ignore all the 565 00:36:38 --> 00:36:41 divided by B's, it's sort of like N times log 566 00:36:41 --> 00:36:44 base M of N. So, that's little bit more 567 00:36:44 --> 00:36:46 reasonable. But, there's lots of divided by 568 00:36:46 --> 00:36:49 B's. So, the number of the blocks in 569 00:36:49 --> 00:36:53 the input times log base the number of blocks in the cache of 570 00:36:53 --> 00:36:55 the number of blocks in the input. 571 00:36:55 --> 00:36:57 That's a little bit more intuitive. 572 00:36:57 --> 00:37:02 That is the bound. And that's what we are aiming 573 00:37:02 --> 00:37:04 for. So, this algorithm, 574 00:37:04 --> 00:37:08 crucially, assume that we knew what M over B was. 575 00:37:08 --> 00:37:12 Now, we are going to try and do it without knowing M over B, 576 00:37:12 --> 00:37:17 do it cache obliviously. And that is the result of only 577 00:37:17 --> 00:37:19 a few years ago. Are you ready? 578 00:37:19 --> 00:37:23 Everything clear so far? It's a pretty natural 579 00:37:23 --> 00:37:26 algorithm. We were going to try to mimic 580 00:37:26 --> 00:37:31 it essentially and do a merge sort, but not M over B way merge 581 00:37:31 --> 00:37:36 sort because we don't know how. We're going to try and do it 582 00:37:36 --> 00:37:39 essentially a square root of N way merge sort. 583 00:37:39 --> 00:37:43 If you play around, that's the natural thing to do. 584 00:37:43 --> 00:37:46 The tricky part is that it's hard to merge square root of N 585 00:37:46 --> 00:37:50 lists at the same time, in a cache efficient way. 586 00:37:50 --> 00:37:54 We know that if the square root of N is bigger than M over B, 587 00:37:54 --> 00:37:57 you're hosed if you just do a straightforward merge. 588 00:37:57 --> 00:38:02 So, we need a fancy merge. We are going to do a divide and 589 00:38:02 --> 00:38:05 conquer merge. It's a lot like the 590 00:38:05 --> 00:38:10 multithreaded algorithms of last week, try and do a divide and 591 00:38:10 --> 00:38:14 conquer merge so that no matter how many lists are merging, 592 00:38:14 --> 00:38:18 as long as it's less than the square root of N, 593 00:38:18 --> 00:38:23 or actually cubed root of N, we can do it cache efficiently, 594 00:38:23 --> 00:38:24 OK? So, we'll do this, 595 00:38:24 --> 00:38:28 we need a bit of setup. But that's where we're going, 596 00:38:28 --> 00:38:33 cache oblivious sorting. So, we want to get the sorting 597 00:38:33 --> 00:38:36 bound, and, yeah. It turns out, 598 00:38:36 --> 00:38:40 to do cache oblivious sorting, you need an assumption about 599 00:38:40 --> 00:38:42 the cache size. This is kind of annoying, 600 00:38:42 --> 00:38:45 because we said, well, cache oblivious 601 00:38:45 --> 00:38:49 algorithms should work for all values of B and all values of M. 602 00:38:49 --> 00:38:53 But, you can actually prove you need an additional assumption in 603 00:38:53 --> 00:38:55 order to get this bound cache obliviously. 604 00:38:55 --> 00:38:58 That's the result of, like, last year by Garrett 605 00:38:58 --> 00:39:01 Brodel. So, and the assumption is, 606 00:39:01 --> 00:39:04 well, the assumption is fairly weak. 607 00:39:04 --> 00:39:07 That's the good news. OK, we've actually made an 608 00:39:07 --> 00:39:10 assumption several times. We said, well, 609 00:39:10 --> 00:39:13 assuming the cache can store at least three blocks, 610 00:39:13 --> 00:39:17 or assuming the cache can store at least four blocks, 611 00:39:17 --> 00:39:21 yeah, it's reasonable to say the cache can store at least 612 00:39:21 --> 00:39:25 four blocks, or at least any constant number of blocks. 613 00:39:25 --> 00:39:29 This is that the number of blocks that your cache can store 614 00:39:29 --> 00:39:33 is at least B to the epsilon blocks. 615 00:39:33 --> 00:39:36 This is saying your cache isn't, like, really narrow. 616 00:39:36 --> 00:39:37 It's about as tall as it is wide. 617 00:39:37 --> 00:39:40 This actually gives you a lot of sloth. 618 00:39:40 --> 00:39:42 And, we're going to use a simple version of this 619 00:39:42 --> 00:39:44 assumption that M is at least B^2. 620 00:39:44 --> 00:39:48 OK, this is pretty natural. It's saying that your cache is 621 00:39:48 --> 00:39:51 at least as tall as it is wide where these are the blocks. 622 00:39:51 --> 00:39:54 OK, the number of blocks is it least the size of a block. 623 00:39:54 --> 00:39:57 That's a pretty reasonable assumption, and if you look at 624 00:39:57 --> 00:40:00 caches these days, they all satisfy this, 625 00:40:00 --> 00:40:04 at least for some epsilon. Pretty much universally, 626 00:40:04 --> 00:40:08 M is at least B^2 or so. OK, and in fact, 627 00:40:08 --> 00:40:12 if you think from our speed of light arguments from last time, 628 00:40:12 --> 00:40:16 B^2 or B^3 is actually the right thing to do. 629 00:40:16 --> 00:40:18 As you go out, I guess in 3-D, 630 00:40:18 --> 00:40:23 B^2 would be the surface area of the sphere out there. 631 00:40:23 --> 00:40:27 OK, so this is actually the natural thing of how much space 632 00:40:27 --> 00:40:32 you should have at a particular distance. 633 00:40:32 --> 00:40:35 Assuming we live in a constant dimensional space, 634 00:40:35 --> 00:40:40 that assumption would be true. This even allows going up to 42 635 00:40:40 --> 00:40:43 dimensions or whatever, OK, so a pretty reasonable 636 00:40:43 --> 00:40:44 assumption. Good. 637 00:40:44 --> 00:40:47 Now, we are going to achieve this bound. 638 00:40:47 --> 00:40:52 And what we are going to try to do is use an N to the epsilon 639 00:40:52 --> 00:40:56 way merge sort for some epsilon. And, if we assume that M is at 640 00:40:56 --> 00:41:02 least B^2, the epsilon will be one third, it turns out. 641 00:41:02 --> 00:41:08 So, we are going to do the cubed root of N way merge sort. 642 00:41:08 --> 00:41:14 I'll start by giving you and analyzing the sorting 643 00:41:14 --> 00:41:20 algorithms, assuming that we know how to do merge in a 644 00:41:20 --> 00:41:25 particular bound. OK, then we'll do the merge. 645 00:41:25 --> 00:41:31 The merge is the hard part. OK, so the merge, 646 00:41:31 --> 00:41:34 I'm going to give you the black box first of all. 647 00:41:34 --> 00:41:36 First of all, what does merge do? 648 00:41:36 --> 00:41:40 The K way merger is called the K funnel just because it looks 649 00:41:40 --> 00:41:42 like a funnel, which you'll see. 650 00:41:42 --> 00:41:45 So, a K funnel is a data structure, or is an algorithm, 651 00:41:45 --> 00:41:48 let's say, that looks like a data structure. 652 00:41:48 --> 00:41:52 And it merges K sorted lists. So, supposing you already have 653 00:41:52 --> 00:41:56 K lists, and they're sorted, and assuming that the lists are 654 00:41:56 --> 00:41:59 relatively long, so we need some additional 655 00:41:59 --> 00:42:03 assumptions for this black box to work, and we'll be able to 656 00:42:03 --> 00:42:09 get them as we sort. We want the total size of those 657 00:42:09 --> 00:42:12 lists. You add up all the elements, 658 00:42:12 --> 00:42:17 and all the lists should have size at least K^3 is the 659 00:42:17 --> 00:42:21 assumption. Then, it merges these lists 660 00:42:21 --> 00:42:25 using essentially the sorting bound. 661 00:42:25 --> 00:42:30 Actually, I should really say theta K^3. 662 00:42:30 --> 00:42:36 I also don't want to be too much bigger than K^3. 663 00:42:36 --> 00:42:42 Sorry about that. So, the number of memory 664 00:42:42 --> 00:42:50 transfers that this funnel merger uses is the sorting bound 665 00:42:50 --> 00:42:57 on K^3, so K^3 over B, log base M over B of K^3 over 666 00:42:57 --> 00:43:03 B, plus another K memory transfers. 667 00:43:03 --> 00:43:06 Now, K memory transfers is pretty reasonable. 668 00:43:06 --> 00:43:09 You've got to at least start reading each list, 669 00:43:09 --> 00:43:12 so you got to pay one memory transfer per list. 670 00:43:12 --> 00:43:16 OK, but our challenge in some sense will be getting rid of 671 00:43:16 --> 00:43:19 this plus K. This is how fast we can merge. 672 00:43:19 --> 00:43:22 We'll do that after. Now, assuming we have this, 673 00:43:22 --> 00:43:26 let me tell you how to sort. This is, eventually enough, 674 00:43:26 --> 00:43:31 called funnel sort. But in a certain sense, 675 00:43:31 --> 00:43:36 it's really cubed root of N way merge sort. 676 00:43:36 --> 00:43:41 OK, but we'll analyze it using this. 677 00:43:41 --> 00:43:47 OK, so funnel sort, we are going to define K to be 678 00:43:47 --> 00:43:52 N to the one third, and apply this merger. 679 00:43:52 --> 00:43:56 So, what do we do? It's just like here. 680 00:43:56 --> 00:44:05 We're going to divide our array into N to the one third. 681 00:44:05 --> 00:44:09 I mean, it they should be consecutive subarrays. 682 00:44:09 --> 00:44:13 I'll call them segments of the array. 683 00:44:13 --> 00:44:18 OK, for cache oblivious, it's really crucial how these 684 00:44:18 --> 00:44:22 things are laid out. We're going to cut and get 685 00:44:22 --> 00:44:28 consecutive chunks of the array, N to the one third of them. 686 00:44:28 --> 00:44:34 Then I'm going to recursively sort them, and then I'm going to 687 00:44:34 --> 00:44:37 merge. OK, and I'm going to merge 688 00:44:37 --> 00:44:41 using the K funnel, the N to the one third funnel 689 00:44:41 --> 00:44:43 because, now, why do I use one third? 690 00:44:43 --> 00:44:48 Well, because of this three. OK, in order to use the N to 691 00:44:48 --> 00:44:51 the one third funnel, I need to guarantee that the 692 00:44:51 --> 00:44:55 total number of elements that I'm merging is at least the cube 693 00:44:55 --> 00:44:57 of this number, K^3. 694 00:44:57 --> 00:45:01 The cube of this number is N. That's exactly how many 695 00:45:01 --> 00:45:05 elements I have in total. OK, so this is exactly what I 696 00:45:05 --> 00:45:08 can apply the funnel. It's going to require that I 697 00:45:08 --> 00:45:11 have it least K^3 elements, so that I can only use an N to 698 00:45:11 --> 00:45:14 the one third funnel. I mean, if it didn't have this 699 00:45:14 --> 00:45:17 requirement, I could just say, well, I have N lists each of 700 00:45:17 --> 00:45:20 size one. OK, that's clearly not going to 701 00:45:20 --> 00:45:23 work very well for our merger, I mean, intuitively because 702 00:45:23 --> 00:45:26 this plus K will kill you. That will be a plus M which is 703 00:45:26 --> 00:45:30 way too big. But we can use an N to the one 704 00:45:30 --> 00:45:35 third funnel, and this is how we would sort. 705 00:45:35 --> 00:45:38 So, let's analyze this algorithm. 706 00:45:38 --> 00:45:42 Hopefully, it will give the sorting bound if I did 707 00:45:42 --> 00:45:47 everything correctly. OK, this is pretty easy. 708 00:45:47 --> 00:45:52 The only thing that makes this messy as I have to write the 709 00:45:52 --> 00:45:58 sorting bound over and over. OK, this is the cost of the 710 00:45:58 --> 00:46:02 merge. So that's at the root. 711 00:46:02 --> 00:46:07 But K^3 in this case is N. So at the root of the 712 00:46:07 --> 00:46:11 recursion, let me write the recurrence first. 713 00:46:11 --> 00:46:15 Sorry. So, we have memory transfers on 714 00:46:15 --> 00:46:19 N elements is N to the one third. 715 00:46:19 --> 00:46:24 Let me get this right. Yeah, N to the one third 716 00:46:24 --> 00:46:28 recursions, each of size N to the two thirds, 717 00:46:28 --> 00:46:34 OK, plus this time, except K^3 is N. 718 00:46:34 --> 00:46:40 So, this is plus N over B, log base M over B of N over B 719 00:46:40 --> 00:46:46 plus cubed root of M. This is additive plus K term. 720 00:46:46 --> 00:46:52 OK, so that's my recurrence. The base case will be the 721 00:46:52 --> 00:46:57 usual. MT is some constant times M is 722 00:46:57 --> 00:47:02 order M over B. So, we sort of know what we 723 00:47:02 --> 00:47:06 should get here. Well, not really. 724 00:47:06 --> 00:47:09 So, in all the previous recurrence is, 725 00:47:09 --> 00:47:15 we have the same costs at every level, and that's where we got 726 00:47:15 --> 00:47:20 our log factor. Now, we already have a log 727 00:47:20 --> 00:47:24 factor, so we better not get another one. 728 00:47:24 --> 00:47:28 Right, this is the bound we want to prove. 729 00:47:28 --> 00:47:33 So, let me cheat here for a second. 730 00:47:33 --> 00:47:36 All right, indeed. You may already be wondering, 731 00:47:36 --> 00:47:39 this N to the one third seems rather large. 732 00:47:39 --> 00:47:43 If it's bigger than this, we are already in trouble at 733 00:47:43 --> 00:47:45 the very top level of the recursion. 734 00:47:45 --> 00:47:49 So, I claim that that's OK. Let's look at N to the one 735 00:47:49 --> 00:47:51 third. OK, there is a base case here 736 00:47:51 --> 00:47:54 which covers all values of N that are, at most, 737 00:47:54 --> 00:47:58 some constant times N. So, if I'm in this case, 738 00:47:58 --> 00:48:02 I know that N is at least as big as the cache up to some 739 00:48:02 --> 00:48:06 constant. OK, now the cache is it least 740 00:48:06 --> 00:48:10 B^2, we've assumed. And you can do this with B to 741 00:48:10 --> 00:48:13 the one plus epsilon if you're more careful. 742 00:48:13 --> 00:48:15 So, N is at least B^2, OK? 743 00:48:15 --> 00:48:19 And then, I always have trouble with these. 744 00:48:19 --> 00:48:23 So this means that N divided by B is omega root N. 745 00:48:23 --> 00:48:26 OK, there's many things you could say here, 746 00:48:26 --> 00:48:30 and only one of them is right. So, why? 747 00:48:30 --> 00:48:34 So this says that the square root of N is at least B, 748 00:48:34 --> 00:48:38 and so N divided by B is at most N divided by square root of 749 00:48:38 --> 00:48:41 N. So that's at least the square 750 00:48:41 --> 00:48:43 root of N if you check that all out. 751 00:48:43 --> 00:48:48 I'm going to go through this arithmetic relatively quickly 752 00:48:48 --> 00:48:50 because it's tedious but necessary. 753 00:48:50 --> 00:48:54 OK, the square root of N is strictly bigger than cubed root 754 00:48:54 --> 00:48:57 of N. OK, so that means that N over B 755 00:48:57 --> 00:49:02 is strictly bigger than N to the one third. 756 00:49:02 --> 00:49:05 Here we have N over B times something that's bigger than 757 00:49:05 --> 00:49:07 one. So this term definitely 758 00:49:07 --> 00:49:10 dominates this term in this case. 759 00:49:10 --> 00:49:14 As long as I'm not in the base case, I know N is at least order 760 00:49:14 --> 00:49:16 M. This term disappears from my 761 00:49:16 --> 00:49:18 recurrence. OK, so, good. 762 00:49:18 --> 00:49:21 That was a bit close. Now, what we want to get is 763 00:49:21 --> 00:49:25 this running time overall. So, the recursive cost better 764 00:49:25 --> 00:49:29 be small, better be less than the constant factor increase 765 00:49:29 --> 00:49:35 over this. So, let's write the recurrence. 766 00:49:35 --> 00:49:39 So, we get N over B, log base M over B, 767 00:49:39 --> 00:49:44 N over B at the root. Then, we split into a lot of 768 00:49:44 --> 00:49:49 subproblems, N to the one third subproblems here, 769 00:49:49 --> 00:49:55 and each one costs essentially this but with N replaced by N to 770 00:49:55 --> 00:50:00 the two thirds. OK, so N to the two thirds log 771 00:50:00 --> 00:50:04 base M over B, oops I forgot to divide it by B 772 00:50:04 --> 00:50:11 out here, of N to the two thirds divided by B. 773 00:50:11 --> 00:50:14 That's the cost of one of these nodes, N to the one third of 774 00:50:14 --> 00:50:17 them. What should they add up to? 775 00:50:17 --> 00:50:20 Well, there is N to the one third, and there's an N to the 776 00:50:20 --> 00:50:23 two thirds here that multiplies out to N. 777 00:50:23 --> 00:50:25 So, we get N over B. This looks bad. 778 00:50:25 --> 00:50:28 This looks the same. And we don't want to lose 779 00:50:28 --> 00:50:31 another log factor. But the good news is we have 780 00:50:31 --> 00:50:35 two thirds in here. OK, this is what we get in 781 00:50:35 --> 00:50:38 total at this level. It looks like the sorting 782 00:50:38 --> 00:50:41 bound, but in the log there's still a two thirds. 783 00:50:41 --> 00:50:45 Now, a power of two thirds and a log comes out as a multiple of 784 00:50:45 --> 00:50:48 two thirds. So, this is in fact two thirds 785 00:50:48 --> 00:50:51 times N over B, log base M over B of N over B, 786 00:50:51 --> 00:50:54 the sorting bound. So, this is two thirds of the 787 00:50:54 --> 00:50:57 sorting bound. And this is the sorting bound, 788 00:50:57 --> 00:51:01 one times the sorting bound. So, it's going down 789 00:51:01 --> 00:51:02 geometrically, yea! 790 00:51:02 --> 00:51:05 OK, I'm not going to prove it, but it's true. 791 00:51:05 --> 00:51:08 This went down by a factor of two thirds. 792 00:51:08 --> 00:51:12 The next one will also go down by a factor of two thirds by 793 00:51:12 --> 00:51:14 induction. OK, if you prove it at one 794 00:51:14 --> 00:51:17 level, it should be true at all of them. 795 00:51:17 --> 00:51:19 And I'm going to skip the details there. 796 00:51:19 --> 00:51:23 So, we could check the leaf level just to make sure. 797 00:51:23 --> 00:51:25 That's always a good sanity check. 798 00:51:25 --> 00:51:30 At the leaves, we know our cost is M over B. 799 00:51:30 --> 00:51:32 OK, and how many leaves are there? 800 00:51:32 --> 00:51:34 Just like before, in some sense, 801 00:51:34 --> 00:51:38 we have N/M leaves. OK, so in fact the total cost 802 00:51:38 --> 00:51:41 at the bottom is N over B. And it turns out that that's 803 00:51:41 --> 00:51:44 what you get. So, you essentially, 804 00:51:44 --> 00:51:47 it looks funny, because you'd think that this 805 00:51:47 --> 00:51:51 would actually be smaller than this at some intuitive level. 806 00:51:51 --> 00:51:54 It's not. In fact, what's happening is 807 00:51:54 --> 00:51:57 you have this N over B times this log thing, 808 00:51:57 --> 00:52:00 whatever the log thing is. We don't care too much. 809 00:52:00 --> 00:52:05 Let's just call it log. What you are taking at the next 810 00:52:05 --> 00:52:08 level is two thirds times that log. 811 00:52:08 --> 00:52:11 And at the next level, it's four ninths times that log 812 00:52:11 --> 00:52:13 and so on. So, it's geometrically 813 00:52:13 --> 00:52:16 decreasing until the log gets down to one. 814 00:52:16 --> 00:52:17 And then you stop the recursion. 815 00:52:17 --> 00:52:21 And that's what you get N over B here with no log. 816 00:52:21 --> 00:52:23 So, what you're doing is decreasing the log, 817 00:52:23 --> 00:52:27 not the N over B stuff. The two thirds should really be 818 00:52:27 --> 00:52:29 over here. In fact, the number of levels 819 00:52:29 --> 00:52:34 here is log, log N. It's the number of times you 820 00:52:34 --> 00:52:39 have to divide a log by three halves before you get down to 821 00:52:39 --> 00:52:42 one, OK? So, we don't actually need 822 00:52:42 --> 00:52:45 that. We don't care how many levels 823 00:52:45 --> 00:52:49 are because it's geometrically decreasing. 824 00:52:49 --> 00:52:52 It could be infinitely many levels. 825 00:52:52 --> 00:52:58 It's geometrically decreasing, and we get this as our running 826 00:52:58 --> 00:53:01 time. MT of N is the sorting bound 827 00:53:01 --> 00:53:05 for funnel sort. So, this is great. 828 00:53:05 --> 00:53:09 As long as we can get a funnel that merges this quickly, 829 00:53:09 --> 00:53:14 we get a sorting algorithm that sorts as fast as it possibly 830 00:53:14 --> 00:53:17 can. I didn't write that on the 831 00:53:17 --> 00:53:20 board that this is asymptotically optimal. 832 00:53:20 --> 00:53:25 Even if you knew what B and M were, this is the best that you 833 00:53:25 --> 00:53:28 could hope to do. And here, we are doing it no 834 00:53:28 --> 00:53:32 matter what, B and M are. Good. 835 00:53:32 --> 00:53:35 Get ready for the funnel. The funnel will be another 836 00:53:35 --> 00:53:37 recursion. So, this is a recursive 837 00:53:37 --> 00:53:39 algorithm in a recursive algorithm. 838 00:53:39 --> 00:53:43 It's another divide and conquer, kind of like the static 839 00:53:43 --> 00:53:46 search trees we saw at the beginning of this lecture. 840 00:53:46 --> 00:53:49 So, these all tie together. 841 00:53:49 --> 00:54:03 842 00:54:03 --> 00:54:06 All right, the K funnel, so, I'm calling it K funnel 843 00:54:06 --> 00:54:10 because I want to think of it at some recursive level, 844 00:54:10 --> 00:54:14 not just N to the one third. OK, we're going to recursively 845 00:54:14 --> 00:54:17 use, in fact, the square root of K funnel. 846 00:54:17 --> 00:54:21 So, here's, and I need to achieve that bound. 847 00:54:21 --> 00:54:24 So, the recursion is like the static search tree, 848 00:54:24 --> 00:54:27 and a little bit hard to draw on one board, 849 00:54:27 --> 00:54:34 but here we go. So, we have a square root of K 850 00:54:34 --> 00:54:37 funnel. Recursively, 851 00:54:37 --> 00:54:44 we have a buffer up here. This is called the output 852 00:54:44 --> 00:54:50 buffer, and it has size K^3, and just for kicks, 853 00:54:50 --> 00:54:57 let's suppose it that filled up a little bit. 854 00:54:57 --> 00:55:06 And, we have some more buffers. And, let's suppose they've been 855 00:55:06 --> 00:55:13 filled up by different amounts. And each of these has size K to 856 00:55:13 --> 00:55:16 the three halves, of course. 857 00:55:16 --> 00:55:21 Halves, these are called buffers, let's say, 858 00:55:21 --> 00:55:28 with the intermediate buffers. And, then hanging off of them, 859 00:55:28 --> 00:55:34 we have more funnels, the square root of K funnel 860 00:55:34 --> 00:55:40 here, and a square root of K funnel here, one for each 861 00:55:40 --> 00:55:47 buffer, one for each child of this funnel. 862 00:55:47 --> 00:55:53 OK, and then hanging off of these funnels are the input 863 00:55:53 --> 00:55:54 arrays. 864 00:55:54 --> 00:56:07 865 00:56:07 --> 00:56:12 OK, I'm not going to draw all K of them, but there are K input 866 00:56:12 --> 00:56:16 arrays, input lists let's call them down at the bottom. 867 00:56:16 --> 00:56:21 OK, so the idea is we are going to merge bottom-up in this 868 00:56:21 --> 00:56:23 picture. We start with our K input 869 00:56:23 --> 00:56:26 arrays of total size at least K^3. 870 00:56:26 --> 00:56:31 That's what we're assuming we have up here. 871 00:56:31 --> 00:56:34 We are clustering them into groups of size square root of K, 872 00:56:34 --> 00:56:37 so, the square root of K groups, throw each of them into 873 00:56:37 --> 00:56:40 a square root of K funnel that recursively merges those square 874 00:56:40 --> 00:56:43 root of K lists. The output of those funnels we 875 00:56:43 --> 00:56:46 are putting into a buffer to sort of accumulate what the 876 00:56:46 --> 00:56:49 answer should be. These buffers have besides 877 00:56:49 --> 00:56:52 exactly K to the three halves, which might not be perfect 878 00:56:52 --> 00:56:55 because we know that on average, there should be K to the three 879 00:56:55 --> 00:56:59 halves elements in each of these because there's K^3 total, 880 00:56:59 --> 00:57:02 and the square root of K groups. 881 00:57:02 --> 00:57:05 So, it should be K^3 divided by the square root of K, 882 00:57:05 --> 00:57:07 which is K to the three halves on average. 883 00:57:07 --> 00:57:09 But some of these will be bigger. 884 00:57:09 --> 00:57:12 Some of them will be smaller. I've drawn it here. 885 00:57:12 --> 00:57:15 Some of them had emptied a bit more depending on how you merge 886 00:57:15 --> 00:57:16 things. But on average, 887 00:57:16 --> 00:57:18 these will all fill at the same time. 888 00:57:18 --> 00:57:22 And then, we plug them into a square root of K funnel, 889 00:57:22 --> 00:57:24 and that we get the output of size K^3. 890 00:57:24 --> 00:57:28 So, that is roughly what we should have happen. 891 00:57:28 --> 00:57:31 OK, but in fact, some of these might fill first, 892 00:57:31 --> 00:57:36 and we have to do some merging in order to empty a buffer, 893 00:57:36 --> 00:57:39 make room for more stuff coming up. 894 00:57:39 --> 00:57:43 That's the picture. Now, before I actually tell you 895 00:57:43 --> 00:57:47 what the algorithm is, or analyze the algorithm, 896 00:57:47 --> 00:57:51 let's first just think about space, a very simple warm-up 897 00:57:51 --> 00:57:54 analysis. So, let's look at the space 898 00:57:54 --> 00:58:00 excluding the inputs and outputs, those buffers. 899 00:58:00 --> 00:58:02 OK, why do I want to exclude input and output buffers? 900 00:58:02 --> 00:58:05 Well, because I want to only count each buffer once, 901 00:58:05 --> 00:58:09 and this buffer is actually the input to this one and the output 902 00:58:09 --> 00:58:11 to this one. So, in order to recursively 903 00:58:11 --> 00:58:14 count all the buffers exactly once, I'm only going to count 904 00:58:14 --> 00:58:16 these middle buffers. And then separately, 905 00:58:16 --> 00:58:20 I'm going to have to think of the overall output and input 906 00:58:20 --> 00:58:22 buffers. But those are sort of given. 907 00:58:22 --> 00:58:23 I mean, I need K^3 for the output. 908 00:58:23 --> 00:58:26 I need K^3 for the input. So ignore those overall. 909 00:58:26 --> 00:58:29 And that if I count the middle buffers recursively, 910 00:58:29 --> 00:58:34 I'll get all the buffers. So, then we get a very simple 911 00:58:34 --> 00:58:39 recurrence for space. S of K is roughly square root 912 00:58:39 --> 00:58:45 of K plus one times S of square root of K plus order K^2, 913 00:58:45 --> 00:58:51 K^2 because we have the square root of K of these buffers, 914 00:58:51 --> 00:58:54 each of size K to the three halves. 915 00:58:54 --> 00:58:58 Work that out, does that sound right? 916 00:58:58 --> 00:59:02 That sounds an awful lot like K^3, but maybe, 917 00:59:02 --> 00:59:06 all right. Oh, no, that's right. 918 00:59:06 --> 00:59:09 It's K to the three halves times the square root of K, 919 00:59:09 --> 00:59:13 which is K to the three halves plus a half, which is K to the 920 00:59:13 --> 00:59:16 four halves, which is K^2. Phew, OK, good. 921 00:59:16 --> 00:59:18 I'm just bad with my arithmetic here. 922 00:59:18 --> 00:59:20 OK, so K^2 total buffering here. 923 00:59:20 --> 00:59:23 You add them up for each level, each recursion, 924 00:59:23 --> 00:59:27 and the plus one here is to take into account the top guy, 925 00:59:27 --> 00:59:31 the square root of K bottom guys, so the square root of K 926 00:59:31 --> 00:59:33 plus one. If this were, 927 00:59:33 --> 00:59:36 well, let me just draw the recurrence tree. 928 00:59:36 --> 00:59:39 There's many ways you could solve this recurrence. 929 00:59:39 --> 00:59:41 A natural one is instead of looking at K, 930 00:59:41 --> 00:59:44 you look at log K, because here at log K is 931 00:59:44 --> 00:59:47 getting divided by two. I just going to draw the 932 00:59:47 --> 00:59:50 recursion trees, so you can see the intuition. 933 00:59:50 --> 00:59:53 But if you are going to solve it, you should probably take the 934 00:59:53 --> 00:59:57 logs, substitute by log. So, we have the square root of 935 00:59:57 --> 1:00:00 K. plus one branching factor. 936 1:00:00 --> 1:00:03.729 And then, the problem is size square root of K, 937 1:00:03.729 --> 1:00:08.108 so this is going to be K, I believe, for each of these. 938 1:00:08.108 --> 1:00:12.324 This is square root of K squared is the cost of these 939 1:00:12.324 --> 1:00:14.513 levels. And, you keep going. 940 1:00:14.513 --> 1:00:19.54 I don't particularly care what the bottom looks like because at 941 1:00:19.54 --> 1:00:23.351 the top we have K^2. That we have K times root K 942 1:00:23.351 --> 1:00:28.297 plus one cost at the next level. This is K to the three halves 943 1:00:28.297 --> 1:00:32.664 plus K. OK, so we go from K^2 to K to 944 1:00:32.664 --> 1:00:37.257 the three halves plus K. This is a super-geometric. 945 1:00:37.257 --> 1:00:41.207 It's like an exponential geometric decrease. 946 1:00:41.207 --> 1:00:45.8 This is decreasing really fast. So, it's order K^2. 947 1:00:45.8 --> 1:00:51.22 That's my hand-waving argument. OK, so the cost is basically 948 1:00:51.22 --> 1:00:56.456 the size of the buffers at the top level, the total space. 949 1:00:56.456 --> 1:01:01.601 We're going to need this. It's actually theta K^2 because 950 1:01:01.601 --> 1:01:06.398 I have a theta K^2 here. We are going to be this in 951 1:01:06.398 --> 1:01:09.249 order to analyze the time. That's why it mentioned it. 952 1:01:09.249 --> 1:01:12.368 It's not just a good feeling that the space is not too big. 953 1:01:12.368 --> 1:01:15.595 In fact, the funnel is a lot smaller than a total input size. 954 1:01:15.595 --> 1:01:18.177 The input size is K^3. But that's not so crucial. 955 1:01:18.177 --> 1:01:21.243 What's crucial is that it's K^2, and we'll use that in the 956 1:01:21.243 --> 1:01:22.48 analysis. OK, naturally, 957 1:01:22.48 --> 1:01:24.308 this thing is laid out recursively. 958 1:01:24.308 --> 1:01:26.675 You recursively store the funnel, top funnel. 959 1:01:26.675 --> 1:01:29.256 Then, for example, you write out each buffer as a 960 1:01:29.256 --> 1:01:32 consecutive array, in this case. 961 1:01:32 --> 1:01:34.748 There's no recursion there. So just write them all out one 962 1:01:34.748 --> 1:01:36.243 by one. Don't interleave them or 963 1:01:36.243 --> 1:01:37.642 anything. Store them in order. 964 1:01:37.642 --> 1:01:40.005 And that, you write out recursively these funnels, 965 1:01:40.005 --> 1:01:41.934 the bottom funnels. OK, any way you do it 966 1:01:41.934 --> 1:01:44.634 recursively, as long as each funnel remains a consecutive 967 1:01:44.634 --> 1:01:46.418 chunk of memory, each buffer remains a 968 1:01:46.418 --> 1:01:49.167 consecutive chuck of memory, the time analysis that we are 969 1:01:49.167 --> 1:01:51 about to do will work. 970 1:01:51 --> 1:02:14 971 1:02:14 --> 1:02:18.062 OK, let me actually give you the algorithm that we're 972 1:02:18.062 --> 1:02:21.265 analyzing. In order to make the funnel go, 973 1:02:21.265 --> 1:02:25.015 what we do is say, initially, all the buffers are 974 1:02:25.015 --> 1:02:27.671 empty. Everything is at the bottom. 975 1:02:27.671 --> 1:02:32.125 And what we are going to do is, say, fill the root buffer. 976 1:02:32.125 --> 1:02:36.04 Fill this one. And, that's a recursive 977 1:02:36.04 --> 1:02:41.542 algorithm, which I'll define in a second, how to fill a buffer. 978 1:02:41.542 --> 1:02:45.713 Once it's filled, that means everything has been 979 1:02:45.713 --> 1:02:50.682 pulled up, and then it's merged. OK, so that's how we get 980 1:02:50.682 --> 1:02:53.522 started. So, merge means to merge 981 1:02:53.522 --> 1:02:58.402 algorithm is fill the topmost buffer, the topmost output 982 1:02:58.402 --> 1:03:01.002 buffer. OK, and now, 983 1:03:01.002 --> 1:03:04.678 here's how you fill a buffer. So, in general, 984 1:03:04.678 --> 1:03:08.355 if you expand out this recursion all the way, 985 1:03:08.355 --> 1:03:12.114 in the base case, I didn't mention you sort of 986 1:03:12.114 --> 1:03:16.71 get a little node there. So, if you look at an arbitrary 987 1:03:16.71 --> 1:03:20.386 buffer in this picture that you want to fill, 988 1:03:20.386 --> 1:03:23.979 so this one's empty and you want to fill it, 989 1:03:23.979 --> 1:03:28.407 then immediately below it will be a vertex who has two 990 1:03:28.407 --> 1:03:34.434 children, two other buffers. OK, maybe they look like this. 991 1:03:34.434 --> 1:03:39.141 You have no idea how big they are, except they are the same 992 1:03:39.141 --> 1:03:41.981 size. It could be a lot smaller than 993 1:03:41.981 --> 1:03:44.984 this one, a lot bigger, we don't know. 994 1:03:44.984 --> 1:03:48.554 But in the end, you do get a binary structure 995 1:03:48.554 --> 1:03:53.261 out of this just like we did with the binary search tree at 996 1:03:53.261 --> 1:03:56.913 the beginning. So, how do we fill this buffer? 997 1:03:56.913 --> 1:04:03 Well, we just merge these two child buffers as long as we can. 998 1:04:03 --> 1:04:08.854 So, we merge the two children buffers as long as they are both 999 1:04:08.854 --> 1:04:11.253 non-empty. So, in general, 1000 1:04:11.253 --> 1:04:16.82 the invariant will be that this buffer, let me write down a 1001 1:04:16.82 --> 1:04:19.795 sentence. As long as a buffer is 1002 1:04:19.795 --> 1:04:25.17 non-empty, or whatever is in that buffer, and hasn't been 1003 1:04:25.17 --> 1:04:29.009 used already, it's a prefix of the merged 1004 1:04:29.009 --> 1:04:34 output of the entire subtree beneath it. 1005 1:04:34 --> 1:04:37.567 OK, so this is a partially merged subsequence of everything 1006 1:04:37.567 --> 1:04:39.781 down here. This is a partially merged 1007 1:04:39.781 --> 1:04:41.933 subsequence of everything down here. 1008 1:04:41.933 --> 1:04:44.824 I can just merge element by element off the top, 1009 1:04:44.824 --> 1:04:48.453 and that will give me outputs to put there until one of them 1010 1:04:48.453 --> 1:04:51.097 gets emptied. And, we have no idea which one 1011 1:04:51.097 --> 1:04:54.357 will empty first just because it depends on the order. 1012 1:04:54.357 --> 1:04:57.801 OK, whenever one of them empties, we recursively fill it, 1013 1:04:57.801 --> 1:05:01 and that's it. That's the algorithm. 1014 1:05:01 --> 1:05:05 Whenever one empties -- 1015 1:05:05 --> 1:05:16 1016 1:05:16 --> 1:05:20.391 -- we recursively fill it. And at the base case at the 1017 1:05:20.391 --> 1:05:23.456 leaves, there's sort of nothing to do. 1018 1:05:23.456 --> 1:05:27.847 I believe you just sort of directly read from an input 1019 1:05:27.847 --> 1:05:30.167 list. So, at the very bottom, 1020 1:05:30.167 --> 1:05:34.807 if you have some note here that's trying to merge between 1021 1:05:34.807 --> 1:05:39.198 these two, that's just a straightforward merge between 1022 1:05:39.198 --> 1:05:42.595 two lists. We know how to do that with two 1023 1:05:42.595 --> 1:05:44.832 parallel scans. So, in fact, 1024 1:05:44.832 --> 1:05:49.886 we can merge the entire thing here and just spit it out to the 1025 1:05:49.886 --> 1:05:52.786 buffer. Well, it depends how big the 1026 1:05:52.786 --> 1:05:56.1 buffer is. We can only merge it until the 1027 1:05:56.1 --> 1:06:01.445 buffer fills. Whenever a buffer is full, 1028 1:06:01.445 --> 1:06:05.394 we stop and we pop up the recursive layers. 1029 1:06:05.394 --> 1:06:11.131 OK, so we keep doing this merge until the buffer we are trying 1030 1:06:11.131 --> 1:06:14.047 to fill fills, and that we stop, 1031 1:06:14.047 --> 1:06:17.338 pop up. OK, that's the algorithm for 1032 1:06:17.338 --> 1:06:20.724 merging. Now, we just have to analyze 1033 1:06:20.724 --> 1:06:24.579 the algorithm. It's actually not too hard, 1034 1:06:24.579 --> 1:06:29 but it's a pretty clever analysis. 1035 1:06:29 --> 1:06:31.898 And, to top it off, it's an amortization, 1036 1:06:31.898 --> 1:06:35.159 your favorite. OK, so we get one last practice 1037 1:06:35.159 --> 1:06:39.072 at amortized analysis in the context of cache oblivious 1038 1:06:39.072 --> 1:06:41.971 algorithms. So, this is going to be a bit 1039 1:06:41.971 --> 1:06:45.231 sophisticated. We are going to combine all the 1040 1:06:45.231 --> 1:06:48.492 ideas we've seen. The main analysis idea we've 1041 1:06:48.492 --> 1:06:52.84 seen is that we are doing this recursion in the construction, 1042 1:06:52.84 --> 1:06:55.666 and if we imagine, we take our K funnel, 1043 1:06:55.666 --> 1:06:59.507 we split it in the middle level, make a whole bunch of 1044 1:06:59.507 --> 1:07:03.202 square root of K funnels, and so on, and then we cut 1045 1:07:03.202 --> 1:07:07.188 those in the middle level, get fourth root of K funnels, 1046 1:07:07.188 --> 1:07:10.666 and so on, and so on, at some point the funnel we 1047 1:07:10.666 --> 1:07:15.816 look at fits in cache. OK, before we said if it's in a 1048 1:07:15.816 --> 1:07:17.984 block. Now, we're going to say that at 1049 1:07:17.984 --> 1:07:20.914 some point, one of these funnels will fit in cache. 1050 1:07:20.914 --> 1:07:24.253 Each of the funnels at that recursive level of detail will 1051 1:07:24.253 --> 1:07:26.656 fit in cache. We are going to analyze that 1052 1:07:26.656 --> 1:07:29 level. We'll call that level J. 1053 1:07:29 --> 1:07:37.266 So, consider the first recursive level of detail, 1054 1:07:37.266 --> 1:07:45.877 and I'll call it J, at which every J funnel we have 1055 1:07:45.877 --> 1:07:53.8 fits, let's say, not only does it fit in cache, 1056 1:07:53.8 --> 1:08:02.337 but four of them fit in cache. It fits in one quarter of the 1057 1:08:02.337 --> 1:08:05.158 cache. OK, but we need to leave some 1058 1:08:05.158 --> 1:08:07.899 cache extra for doing other things. 1059 1:08:07.899 --> 1:08:11.607 But I want to make sure that the J funnel fits. 1060 1:08:11.607 --> 1:08:16.04 OK, now what does that mean? Well, we've analyzed space. 1061 1:08:16.04 --> 1:08:19.989 We know that the space of a J funnel is about J^2, 1062 1:08:19.989 --> 1:08:24.02 some constant times J^2. We'll call it C times J^2. 1063 1:08:24.02 --> 1:08:27.969 OK, so this is saying that C times J^2 is at most, 1064 1:08:27.969 --> 1:08:32 M over 4, one quarter of the cache. 1065 1:08:32 --> 1:08:35.915 OK, that means a J funnel that happens at the size sits in the 1066 1:08:35.915 --> 1:08:38.803 quarter of the cache. OK, at some point in the 1067 1:08:38.803 --> 1:08:41.884 recursion, we'll have this big tree of J funnels, 1068 1:08:41.884 --> 1:08:44.515 with all sorts of buffers in between them, 1069 1:08:44.515 --> 1:08:46.697 and each of the J funnels will fit. 1070 1:08:46.697 --> 1:08:49.521 So, let's think about one of those J funnels. 1071 1:08:49.521 --> 1:08:51.96 Suppose J is like the square root of K. 1072 1:08:51.96 --> 1:08:55.619 So, this is the picture because otherwise I have to draw a 1073 1:08:55.619 --> 1:08:58.314 bigger one. So, suppose this is a J funnel. 1074 1:08:58.314 --> 1:09:03 It has a bunch of input buffers, has one output buffer. 1075 1:09:03 --> 1:09:06.366 So, we just want to think about how the J funnel executes. 1076 1:09:06.366 --> 1:09:09.259 And, for a long time, as long as these buffers are 1077 1:09:09.259 --> 1:09:12.33 all full, this is just a merger. It's doing something 1078 1:09:12.33 --> 1:09:14.515 recursively, but we don't really care. 1079 1:09:14.515 --> 1:09:17.468 As soon as this whole thing swaps in, and actually, 1080 1:09:17.468 --> 1:09:20.244 I should be drawing this, as soon as the funnel, 1081 1:09:20.244 --> 1:09:23.019 the output buffer, and the input buffer swap in, 1082 1:09:23.019 --> 1:09:25.677 in other words, you bring all those blocks in, 1083 1:09:25.677 --> 1:09:28.452 you can just merge, and you can go on your merry 1084 1:09:28.452 --> 1:09:33 way merging until something empties or you fill the output. 1085 1:09:33 --> 1:09:36.323 So, let's analyze that. Suppose everything is in 1086 1:09:36.323 --> 1:09:40.707 memory, because we know it fits. OK, well I have to be a little 1087 1:09:40.707 --> 1:09:43.676 bit careful. The input buffers are actually 1088 1:09:43.676 --> 1:09:48.202 pretty big in total size because the total size is K to the three 1089 1:09:48.202 --> 1:09:50.747 halves here versus K to the one half. 1090 1:09:50.747 --> 1:09:54.848 Actually, this is of size K. Let me draw a general picture. 1091 1:09:54.848 --> 1:09:57.676 We have a J funnel, because otherwise the 1092 1:09:57.676 --> 1:10:01 arithmetic is going to get messy. 1093 1:10:01 --> 1:10:04.854 We have a J funnel. Its size is C times J^2, 1094 1:10:04.854 --> 1:10:08.619 we're supposing. The number of inputs is J, 1095 1:10:08.619 --> 1:10:11.666 and the size of them is pretty big. 1096 1:10:11.666 --> 1:10:15.61 Where did we define that? We have a K funnel. 1097 1:10:15.61 --> 1:10:20.719 The total input size is K^3. So, the total input size here 1098 1:10:20.719 --> 1:10:24.663 would be J^3. We can't afford to put all that 1099 1:10:24.663 --> 1:10:27.98 in cache. That's an extra factor of J. 1100 1:10:27.98 --> 1:10:33 But, we can afford to one block per input. 1101 1:10:33 --> 1:10:35.035 And for merging, that's all we need. 1102 1:10:35.035 --> 1:10:38.176 I claim that I can fit the first block of each of these 1103 1:10:38.176 --> 1:10:41.724 input arrays in cash at the same time along with the J funnel. 1104 1:10:41.724 --> 1:10:44.864 And so, for that duration, as long as all of that is in 1105 1:10:44.864 --> 1:10:48.238 cache, this thing can merge at full speed just like we were 1106 1:10:48.238 --> 1:10:51.204 doing parallel scans. You use up all the blocks down 1107 1:10:51.204 --> 1:10:54.752 here, and one of them empties. You go to the next block in the 1108 1:10:54.752 --> 1:10:57.602 input buffer and so on, just like the normal merge 1109 1:10:57.602 --> 1:11:00.859 analysis of parallel arrays, at this point we assume that 1110 1:11:00.859 --> 1:11:04 everything here is fitting in cache. 1111 1:11:04 --> 1:11:08.485 So, it's just like before. Of course, in fact, 1112 1:11:08.485 --> 1:11:13.668 it's recursive but we are analyzing it at this level. 1113 1:11:13.668 --> 1:11:19.25 OK, I need to prove that you can fit one block per input. 1114 1:11:19.25 --> 1:11:22.839 It's not hard. It's just computation. 1115 1:11:22.839 --> 1:11:28.72 And, it's basically the way that these funnels were designed 1116 1:11:28.72 --> 1:11:35 was so that you could fit one block per input buffer. 1117 1:11:35 --> 1:11:41.607 And, here's the argument. So, the claim is you can also 1118 1:11:41.607 --> 1:11:47.725 fit one memory block in the cache per input buffer. 1119 1:11:47.725 --> 1:11:52.497 So, this is in addition to one J funnel. 1120 1:11:52.497 --> 1:11:59.594 You could also fit one block for each of its input buffers. 1121 1:11:59.594 --> 1:12:06.23 OK, this is of the J funnel. It's not any funnel because 1122 1:12:06.23 --> 1:12:10.938 bigger funnels are way too big. OK, so here's how we prove 1123 1:12:10.938 --> 1:12:13.581 that. J^2 is at most a quarter M. 1124 1:12:13.581 --> 1:12:16.967 That's what we assumed here, actually CJ2. 1125 1:12:16.967 --> 1:12:21.675 I'm not going to bother with the C because that's going to 1126 1:12:21.675 --> 1:12:25.887 make my life even harder. OK, I think this is even a 1127 1:12:25.887 --> 1:12:29.522 weaker constraint. So, the size of our funnel 1128 1:12:29.522 --> 1:12:35.11 proves about J^2. That's at most a quarter of the 1129 1:12:35.11 --> 1:12:37.719 cache. That implies that J, 1130 1:12:37.719 --> 1:12:43.941 if we take square roots of both sides, is at most a half square 1131 1:12:43.941 --> 1:12:47.955 root of M. OK, also, we know that B is at 1132 1:12:47.955 --> 1:12:53.273 most square root of M because M is at least B squared. 1133 1:12:53.273 --> 1:12:58.993 So, we put these together, and we get J times B is at most 1134 1:12:58.993 --> 1:13:02.611 a half M. OK, now I claim that what we 1135 1:13:02.611 --> 1:13:05.718 are asking for here is J times B because in a J funnel, 1136 1:13:05.718 --> 1:13:08.825 there are J input arrays. And so, if you want one block 1137 1:13:08.825 --> 1:13:10.781 each, that costs a space of B each. 1138 1:13:10.781 --> 1:13:13.831 So, for each input buffer we have one block of size B, 1139 1:13:13.831 --> 1:13:16.938 and the claim is that that whole thing fits in half the 1140 1:13:16.938 --> 1:13:19.009 cache. And, we've only used a quarter 1141 1:13:19.009 --> 1:13:20.448 of the cache. So in total, 1142 1:13:20.448 --> 1:13:23.843 we use three quarters of the cache and that's all we'll use. 1143 1:13:23.843 --> 1:13:26.95 OK, so that's good news. We can also fit one more block 1144 1:13:26.95 --> 1:13:30 to the output. Not too big a deal. 1145 1:13:30 --> 1:13:33.401 So now, as long as this J funnel is running, 1146 1:13:33.401 --> 1:13:36.012 if it's all in cache, all is well. 1147 1:13:36.012 --> 1:13:39.889 What does that mean? Let me first analyze how long 1148 1:13:39.889 --> 1:13:42.895 it takes for us to swap in this funnel. 1149 1:13:42.895 --> 1:13:47.563 OK, so how long does it take for us to read all the stuff in 1150 1:13:47.563 --> 1:13:50.806 a J funnel and one block per input buffer? 1151 1:13:50.806 --> 1:13:55 That's what it would take to get started. 1152 1:13:55 --> 1:14:02.344 So, this is swapping in a J funnel, which means reading the 1153 1:14:02.344 --> 1:14:09.435 J funnel in its entirety, and reading one block per input 1154 1:14:09.435 --> 1:14:14.12 buffer. OK, the cost of the swap in is 1155 1:14:14.12 --> 1:14:19.818 pretty natural. The size of the buffer divided 1156 1:14:19.818 --> 1:14:27.542 by B, because that's just sort of a linear scan to read it in, 1157 1:14:27.542 --> 1:14:34 and we need to read one block per buffer. 1158 1:14:34 --> 1:14:38.463 These buffers could be all over the place because they're pretty 1159 1:14:38.463 --> 1:14:40.942 big. So, let's say we pay one memory 1160 1:14:40.942 --> 1:14:45.264 transfer for each input buffer just to get started to read the 1161 1:14:45.264 --> 1:14:47.318 first block. OK, the claim is, 1162 1:14:47.318 --> 1:14:50.365 and here we need to do some more arithmetic. 1163 1:14:50.365 --> 1:14:52.348 This is, at most, J^3 over B. 1164 1:14:52.348 --> 1:14:54.757 OK, why is it, at most, J^3 over B? 1165 1:14:54.757 --> 1:15:00 Well, this was the first level at which things fit in cache. 1166 1:15:00 --> 1:15:04.119 That means the next level bigger, which is J^2, 1167 1:15:04.119 --> 1:15:08.328 which has size J^4, should be bigger than cache. 1168 1:15:08.328 --> 1:15:11.552 Otherwise we would have stopped then. 1169 1:15:11.552 --> 1:15:14.686 OK, so this is just more arithmetic. 1170 1:15:14.686 --> 1:15:19.164 You can either believe me or follow the arithmetic. 1171 1:15:19.164 --> 1:15:23.731 We know that J^4 is at least M. So, this means that, 1172 1:15:23.731 --> 1:15:26.776 and we know that M is at least B^2. 1173 1:15:26.776 --> 1:15:29.462 Therefore, J^2, instead of J^4, 1174 1:15:29.462 --> 1:15:36 we take the square root of both sides, J^2 is at least B. 1175 1:15:36 --> 1:15:39.379 OK, so certainly J^2 over B is at most J^3 over B. 1176 1:15:39.379 --> 1:15:43.379 But also J is at most J^3 over B because J^2 is at least B. 1177 1:15:43.379 --> 1:15:46.896 Hopefully that should be clear. That's just algebra. 1178 1:15:46.896 --> 1:15:50.965 OK, so we're not going to use this bound because that's kind 1179 1:15:50.965 --> 1:15:53.655 of complicated. We're just going to say, 1180 1:15:53.655 --> 1:15:56.689 well, it causes J^3 over B to get swapped in. 1181 1:15:56.689 --> 1:16:00 Now, why is J^3 over B a good thing? 1182 1:16:00 --> 1:16:03.972 Because we know the total size of inputs to the J funnel is 1183 1:16:03.972 --> 1:16:06.232 J^3. So, to read all of the inputs 1184 1:16:06.232 --> 1:16:08.424 to the J funnel takes J^3 over B. 1185 1:16:08.424 --> 1:16:12.054 So, this is really just a linear extra cost to get the 1186 1:16:12.054 --> 1:16:14.657 whole thing swapped in. It sounds good. 1187 1:16:14.657 --> 1:16:17.671 To do the merging would also cost J^3 over B. 1188 1:16:17.671 --> 1:16:21.438 So, the swap-in causes J^3 over B to merge all these J^3 1189 1:16:21.438 --> 1:16:24.041 elements. If they were all there in the 1190 1:16:24.041 --> 1:16:28.013 inputs, it would take J^3 over B because once everything is 1191 1:16:28.013 --> 1:16:31.78 there, you're merging at full speed, one per B items per 1192 1:16:31.78 --> 1:16:36.859 memory transfer on average. OK, the problem is you're going 1193 1:16:36.859 --> 1:16:39.26 to swap out, which you may have imagined. 1194 1:16:39.26 --> 1:16:41.899 As soon as one of your input buffers empties, 1195 1:16:41.899 --> 1:16:45.199 let's say this one's almost gone, as soon as it empties, 1196 1:16:45.199 --> 1:16:48.439 you're going to totally obliterate that funnel and swap 1197 1:16:48.439 --> 1:16:51.38 in this one in order to merge all the stuff there, 1198 1:16:51.38 --> 1:16:54.92 and fill this buffer back up. This is where the amortization 1199 1:16:54.92 --> 1:16:56.96 comes in. And this is where the log 1200 1:16:56.96 --> 1:17:00.68 factor comes in because so far it we've basically paid a linear 1201 1:17:00.68 --> 1:17:07.034 cost. We are almost done. 1202 1:17:07.034 --> 1:17:17.897 So, we charge, sorry, I'm jumping ahead of 1203 1:17:17.897 --> 1:17:26.111 myself. So, when an input buffer 1204 1:17:26.111 --> 1:17:35.169 empties, we swap out. And we recursively fill that 1205 1:17:35.169 --> 1:17:37.881 buffer. OK, I'm going to assume that 1206 1:17:37.881 --> 1:17:42.065 there is absolutely no reuse, that is recursive filling 1207 1:17:42.065 --> 1:17:46.481 completely swapped everything out and I have to start from 1208 1:17:46.481 --> 1:17:50.046 scratch for this funnel. So, when that happens, 1209 1:17:50.046 --> 1:17:53.92 I feel this buffer, and then I come back and I say, 1210 1:17:53.92 --> 1:17:58.026 well, I go swap it back in. So when the recursive call 1211 1:17:58.026 --> 1:18:01.978 finishes, I swap back in. OK, so I recursively fill, 1212 1:18:01.978 --> 1:18:08.031 and then I swap back in. And, at the swapping back in 1213 1:18:08.031 --> 1:18:13.012 costs J^3 over B. I'm going to charge that cost 1214 1:18:13.012 --> 1:18:16.91 to the elements that just got filled. 1215 1:18:16.91 --> 1:18:22 So this is an amortized charging argument. 1216 1:18:22 --> 1:18:48 1217 1:18:48 --> 1:18:51.322 How many are there? It's the only question. 1218 1:18:51.322 --> 1:18:54.169 It turns out, things are really good, 1219 1:18:54.169 --> 1:18:59.073 like here, for the square root of K funnel, we have each buffer 1220 1:18:59.073 --> 1:19:04.063 has size K to the three halves. OK, so this is a bit 1221 1:19:04.063 --> 1:19:08.395 complicated. But I claim that the number of 1222 1:19:08.395 --> 1:19:12.624 elements here that fill the buffer is J^3. 1223 1:19:12.624 --> 1:19:18.401 So, if you have a J funnel, each of the input buffers has 1224 1:19:18.401 --> 1:19:22.114 size J^3. It should be correct if you 1225 1:19:22.114 --> 1:19:26.137 work it out. So, we're charging this J^3 1226 1:19:26.137 --> 1:19:31.501 over B cost to J^3 elements, which sounds like you're 1227 1:19:31.501 --> 1:19:38 charging, essentially, one over B to each element. 1228 1:19:38 --> 1:19:39.951 Sounds great. That means that, 1229 1:19:39.951 --> 1:19:43.718 so you're thinking overall, I mean, there are N elements, 1230 1:19:43.718 --> 1:19:46.678 and to each one you charge a one over B cost. 1231 1:19:46.678 --> 1:19:50.11 That sounds like the total running time is N over B. 1232 1:19:50.11 --> 1:19:52.195 It's a bit too fast for sorting. 1233 1:19:52.195 --> 1:19:55.559 We lost the log factor. So, what's going on is that 1234 1:19:55.559 --> 1:20:00 we're actually charging to one element more than once. 1235 1:20:00 --> 1:20:02.729 And, this is something that we don't normally do, 1236 1:20:02.729 --> 1:20:05.913 never done it in this class, but you can do it as long as 1237 1:20:05.913 --> 1:20:08.471 you bound that the number of times you charge. 1238 1:20:08.471 --> 1:20:10.916 OK, and wherever you do a charging argument, 1239 1:20:10.916 --> 1:20:13.304 you say, well, this doesn't happen too many 1240 1:20:13.304 --> 1:20:16.09 times because whenever this happens, this happens. 1241 1:20:16.09 --> 1:20:18.705 You should say, you should prove that the thing 1242 1:20:18.705 --> 1:20:21.775 that you're charging to, Ito charged to that think very 1243 1:20:21.775 --> 1:20:24.107 many times. So here, I have a quantifiable 1244 1:20:24.107 --> 1:20:26.153 thing that I'm charging to: elements. 1245 1:20:26.153 --> 1:20:29.394 So, I'm saying that for each element that happened to come 1246 1:20:29.394 --> 1:20:31.953 into this buffer, I'm going to charge it a one 1247 1:20:31.953 --> 1:20:35.992 over B cost. How many times does one element 1248 1:20:35.992 --> 1:20:38.755 get charged? Well, each time it gets charged 1249 1:20:38.755 --> 1:20:40.812 to, it's moved into a new buffer. 1250 1:20:40.812 --> 1:20:43.254 How many buffers could it move through? 1251 1:20:43.254 --> 1:20:45.632 Well, it's just going up all the time. 1252 1:20:45.632 --> 1:20:49.102 Merging always goes up. So, we start here and you go to 1253 1:20:49.102 --> 1:20:52.059 the next buffer, and you go to the next buffer. 1254 1:20:52.059 --> 1:20:55.143 The number of buffers you visit is the right log, 1255 1:20:55.143 --> 1:20:59 it turns out. I don't know which log that is. 1256 1:20:59 --> 1:21:05.199 So, the number of charges of a one over B cost to each element 1257 1:21:05.199 --> 1:21:11.196 is the number of buffers it visits, and that's a log factor. 1258 1:21:11.196 --> 1:21:17.193 That's where we get an extra log factor on the running time. 1259 1:21:17.193 --> 1:21:23.291 It is, this is the number of levels of J funnels that you can 1260 1:21:23.291 --> 1:21:26.849 visit. So, it's log K divided by log 1261 1:21:26.849 --> 1:21:33.228 J, if I got it right. OK, and we're almost done. 1262 1:21:33.228 --> 1:21:38.442 Let's wrap up a bit. Just a little bit more 1263 1:21:38.442 --> 1:21:44.278 arithmetic, unfortunately. So, log K over log J. 1264 1:21:44.278 --> 1:21:47.63 Now, J^2 is like M, roughly. 1265 1:21:47.63 --> 1:21:54.956 It might be square root of M. But, log J is basically log M. 1266 1:21:54.956 --> 1:22:02.281 There's some constants there. So, the number of charges here 1267 1:22:02.281 --> 1:22:08.299 is theta, log K over log M. So, now this is a bit, 1268 1:22:08.299 --> 1:22:11.135 we haven't seen this in amortization necessarily, 1269 1:22:11.135 --> 1:22:14.265 but we just need to count up total amount of charging. 1270 1:22:14.265 --> 1:22:17.219 All work gets charged to somebody, except we didn't 1271 1:22:17.219 --> 1:22:20.054 charge the very initial swapping in to everybody. 1272 1:22:20.054 --> 1:22:23.244 But, every time we do some swapping in, we charge it to 1273 1:22:23.244 --> 1:22:25.075 someone. So, how many times does 1274 1:22:25.075 --> 1:22:27.97 everything it charged? Well, there are N elements. 1275 1:22:27.97 --> 1:22:31.632 Each gets charged to a one over B cost, and the number of times 1276 1:22:31.632 --> 1:22:35 it gets charged is its log K over log M. 1277 1:22:35 --> 1:22:39.246 So therefore, the total cost is number of 1278 1:22:39.246 --> 1:22:44.342 elements times a one over B times this log thing. 1279 1:22:44.342 --> 1:22:49.65 OK, it's actually plus K. We forgot about a plus K, 1280 1:22:49.65 --> 1:22:55.171 but that's just to get started in the very beginning, 1281 1:22:55.171 --> 1:22:58.886 and start on all of the input lists. 1282 1:22:58.886 --> 1:23:06 OK, this is an amortization analysis to prove this bound. 1283 1:23:06 --> 1:23:10.914 Sorry, what was N here? I assumed that I started out 1284 1:23:10.914 --> 1:23:14.286 with K cubed elements at the bottom. 1285 1:23:14.286 --> 1:23:19.682 The total number of elements in the bottom was K^3 theta. 1286 1:23:19.682 --> 1:23:23.343 OK, so I should have written K^3 not M. 1287 1:23:23.343 --> 1:23:28.835 This should be almost the same as this, OK, but not quite. 1288 1:23:28.835 --> 1:23:34.039 This is log based M of K, and if you do a little bit of 1289 1:23:34.039 --> 1:23:39.82 arithmetic, this should be K^3 over B times log base M over B 1290 1:23:39.82 --> 1:23:45.747 of K over B plus K. That's what I want to prove. 1291 1:23:45.747 --> 1:23:49.867 Actually there's a K^3 here instead of a K, 1292 1:23:49.867 --> 1:23:53.105 but that's just a factor of three. 1293 1:23:53.105 --> 1:23:58.6 And this follows because we assume we are not in the base 1294 1:23:58.6 --> 1:24:01.052 case. So, K is at least M, 1295 1:24:01.052 --> 1:24:06.252 which is at least B^2, and therefore K over B is omega 1296 1:24:06.252 --> 1:24:10.716 square root of K. OK, so K over B is basically 1297 1:24:10.716 --> 1:24:13.045 the same as K when you put it in a log. 1298 1:24:13.045 --> 1:24:16.354 So here we have log base M. I turned it into log base M 1299 1:24:16.354 --> 1:24:17.887 over B. That's even worse. 1300 1:24:17.887 --> 1:24:20.277 It doesn't matter. And, I have log of K. 1301 1:24:20.277 --> 1:24:23.525 I replaced it with K over B, but K over B is basically 1302 1:24:23.525 --> 1:24:25.303 square root of K. So in a log, 1303 1:24:25.303 --> 1:24:30.261 that's just a factor of a half. So that concludes the analysis 1304 1:24:30.261 --> 1:24:33.654 of the funnel. We get this crazy running time, 1305 1:24:33.654 --> 1:24:37.424 which is basically sorting bound plus a little bit. 1306 1:24:37.424 --> 1:24:40.817 We plug that into our funnel sort, and we get, 1307 1:24:40.817 --> 1:24:44.964 magically, optimal cache oblivious sorting just in time. 1308 1:24:44.964 --> 1:24:48.809 Tuesday is the final. The final is more in the style 1309 1:24:48.809 --> 1:24:53.107 of quiz one, so not too much creativity, mostly mastery of 1310 1:24:53.107 --> 1:24:55.369 material. It covers everything. 1311 1:24:55.369 --> 1:24:59.591 You don't have to worry about the details of funnel sort, 1312 1:24:59.591 --> 1:25:03.285 but everything else. So it's like quiz one but for 1313 1:25:03.285 --> 1:25:07.664 the entire class. It's three hours long, 1314 1:25:07.664 --> 1:25:10.766 and good luck. It's been a pleasure having 1315 1:25:10.766 --> 1:25:14.247 you, all the students. I'm sure Charles agrees, 1316 1:25:14.247 --> 1:25:17 so thanks everyone. It was a lot of fun.