1 00:00:07 --> 00:00:08 -- week of 6.046. Woohoo! 2 00:00:08 --> 00:00:13 The topic of this final week, among our advanced topics, 3 00:00:13 --> 00:00:18 is cache oblivious algorithms. This is a particularly fun 4 00:00:18 --> 00:00:22 area, one dear to my heart because I've done a lot of 5 00:00:22 --> 00:00:26 research in this area. This is an area co-founded by 6 00:00:26 --> 00:00:29 Professor Leiserson. So, in fact, 7 00:00:29 --> 00:00:34 the first context in which I met Professor Leiserson was him 8 00:00:34 --> 00:00:38 giving a talk about cache oblivious algorithms at WADS '99 9 00:00:38 --> 00:00:41 in Vancouver I think. Yeah, that has to be an odd 10 00:00:41 --> 00:00:44 year. So, I learned about cache 11 00:00:44 --> 00:00:48 oblivious algorithms then, started working in the area, 12 00:00:48 --> 00:00:50 and it's been a fun place to play. 13 00:00:50 --> 00:00:54 But this topic in some sense was also developed in the 14 00:00:54 --> 00:00:58 context of this class. I think there was one semester, 15 00:00:58 --> 00:01:02 probably also '98-'99 where all of the problem sets were about 16 00:01:02 --> 00:01:07 cache oblivious algorithms. And they were, 17 00:01:07 --> 00:01:10 in particular, working out the research ideas 18 00:01:10 --> 00:01:13 at the same time. So, it must have been fun 19 00:01:13 --> 00:01:15 semester. We consider doing that this 20 00:01:15 --> 00:01:18 semester, but we kept it to the simple. 21 00:01:18 --> 00:01:23 We know a lot more about cache oblivious algorithms by now as 22 00:01:23 --> 00:01:26 you might expect. Right, I think that's all the 23 00:01:26 --> 00:01:29 setting. I mean, it was kind of 24 00:01:29 --> 00:01:33 developed also with a bunch of MIT students in particular, 25 00:01:33 --> 00:01:35 M.Eng. student, Harold Prokop. 26 00:01:35 --> 00:01:36 It was his M.Eng. thesis. 27 00:01:36 --> 00:01:39 There is all the citations I will give for now. 28 00:01:39 --> 00:01:43 I haven't posted yet, but there are some lecture 29 00:01:43 --> 00:01:45 notes that are already on my webpage. 30 00:01:45 --> 00:01:49 But I will link to them from the course website that gives 31 00:01:49 --> 00:01:53 all the references for all the results I'll be talking about. 32 00:01:53 --> 00:01:56 They've all been done in the last five years or so, 33 00:01:56 --> 00:01:59 in particular, starting in '99 when the first 34 00:01:59 --> 00:02:03 paper was published. But I won't give the specific 35 00:02:03 --> 00:02:08 citations in lecture. And, this topic is related to 36 00:02:08 --> 00:02:11 the topic of last week, multithreaded algorithms, 37 00:02:11 --> 00:02:14 although at a somewhat high level. 38 00:02:14 --> 00:02:18 And then it's also dealing with parallelism in modern machines. 39 00:02:18 --> 00:02:22 And we've had throughout all of these last two lectures, 40 00:02:22 --> 00:02:26 we've had this very simple model of a computer where we 41 00:02:26 --> 00:02:30 have random access. You can access memory at a cost 42 00:02:30 --> 00:02:33 of one. You can read and write a word 43 00:02:33 --> 00:02:36 of memory. There is some details on how 44 00:02:36 --> 00:02:39 big a word can be and whatnot. It's pretty basic, 45 00:02:39 --> 00:02:41 simple, flat model. And, at the multithreaded 46 00:02:41 --> 00:02:45 algorithm is the idea that, well, maybe you have multiple 47 00:02:45 --> 00:02:48 threads of computation running at once, but you still have this 48 00:02:48 --> 00:02:51 very flat memory. Everyone can access anything in 49 00:02:51 --> 00:02:54 memory at a constant cost. We're going to change that 50 00:02:54 --> 00:02:58 model now. And we are going to realize 51 00:02:58 --> 00:03:03 that a real machine, the memory of a real machine is 52 00:03:03 --> 00:03:06 some hierarchy. You have some CPU, 53 00:03:06 --> 00:03:10 you have some cache, probably on the same chip, 54 00:03:10 --> 00:03:14 level 1 cache, you have some level 2 cache, 55 00:03:14 --> 00:03:18 if you're lucky, maybe you have some level 3 56 00:03:18 --> 00:03:21 cache, before you get to main memory. 57 00:03:21 --> 00:03:26 And then, you probably have a really big disk and probably 58 00:03:26 --> 00:03:31 there's even some cache out here, but I won't even think 59 00:03:31 --> 00:03:35 about that. So, the point is, 60 00:03:35 --> 00:03:38 you have lots of different levels of memory and what's 61 00:03:38 --> 00:03:42 changing here is that things that are very close to the CPU 62 00:03:42 --> 00:03:46 are very fast to access. Usually level 1 cache you can 63 00:03:46 --> 00:03:48 access in one clock cycle or a few. 64 00:03:48 --> 00:03:50 And then, things get slower and slower. 65 00:03:50 --> 00:03:54 Memory still costs like 70 ns or so to access a chunk out of. 66 00:03:54 --> 00:03:57 And that's a long time. 70 ns is, of course, 67 00:03:57 --> 00:04:01 a very long time. So, as we go out here, 68 00:04:01 --> 00:04:04 we get slower. But we also get bigger. 69 00:04:04 --> 00:04:07 I mean, if we could put everything at level 1 cache, 70 00:04:07 --> 00:04:11 the problem would be solved. But what would be a flat 71 00:04:11 --> 00:04:13 memory. Accessing everything in here, 72 00:04:13 --> 00:04:16 we assumed takes the same amount of time. 73 00:04:16 --> 00:04:18 But usually, we can't afford, 74 00:04:18 --> 00:04:22 it's not even possible to put everything in level 1 cache. 75 00:04:22 --> 00:04:26 I mean, there's a reason why there is a memory hierarchy. 76 00:04:26 --> 00:04:32 Does anyone have a suggestion on what that reason might be? 77 00:04:32 --> 00:04:35 It's like one of these limits in life. 78 00:04:35 --> 00:04:37 Yeah? Fast memory is expensive. 79 00:04:37 --> 00:04:40 That's the practical limitations indeed, 80 00:04:40 --> 00:04:45 that you could try to build more and more at level 1 cache 81 00:04:45 --> 00:04:48 and maybe you could try to, well, yeah. 82 00:04:48 --> 00:04:51 Expenses is a good reason, and practically, 83 00:04:51 --> 00:04:55 that's why they may be the sizes are what they are. 84 00:04:55 --> 00:05:01 But suppose really fast memory were really cheap. 85 00:05:01 --> 00:05:04 There is a physical limitation of what's going on, 86 00:05:04 --> 00:05:05 yeah? The speed of light. 87 00:05:05 --> 00:05:08 Yeah, that's a bit of a problem, right? 88 00:05:08 --> 00:05:11 No matter how much, let's suppose you can only fit 89 00:05:11 --> 00:05:15 so many bits in an atom. You can only fit so many bits 90 00:05:15 --> 00:05:18 in a particular amount of space. If you want more bits, 91 00:05:18 --> 00:05:22 and you need more space, and the more space you have, 92 00:05:22 --> 00:05:25 the longer it's going to take for a round-trip. 93 00:05:25 --> 00:05:28 So, if you assume your CPU is like this point in space, 94 00:05:28 --> 00:05:32 so it's relatively small and it has to get the data in, 95 00:05:32 --> 00:05:37 the bigger the data, the farther it has to be away. 96 00:05:37 --> 00:05:40 But, you can have these cores around the CPU that are, 97 00:05:40 --> 00:05:44 we don't usually live in 3-D, and chips were usually in 2-D, 98 00:05:44 --> 00:05:46 but never mind. You can have the sphere that's 99 00:05:46 --> 00:05:49 closer to the CPU that's a lot faster to access. 100 00:05:49 --> 00:05:52 And as you get further away it costs more. 101 00:05:52 --> 00:05:55 And that's essentially what this model is representing, 102 00:05:55 --> 00:05:59 although it's a bit approximated from the intrinsic 103 00:05:59 --> 00:06:02 physics and geometry and whatnot. 104 00:06:02 --> 00:06:05 But that's the idea. The latency, 105 00:06:05 --> 00:06:11 the round-trip time to get some of this memory has to be big. 106 00:06:11 --> 00:06:17 In general, the costs to access memory is made up of two things. 107 00:06:17 --> 00:06:21 There's the latency, the round-trip time, 108 00:06:21 --> 00:06:26 which in particular is limited by the speed of light. 109 00:06:26 --> 00:06:32 And, plus the round-trip time, you also have to get the data 110 00:06:32 --> 00:06:36 out. And depending on how much data 111 00:06:36 --> 00:06:40 you want, it could take longer. OK, so there's something. 112 00:06:40 --> 00:06:42 There could be, get this right, 113 00:06:42 --> 00:06:46 let's say, the amount of data divided by the bandwidth. 114 00:06:46 --> 00:06:51 OK, the bandwidth is at what rate can you get the data out? 115 00:06:51 --> 00:06:54 And if you look at the bandwidth of these various 116 00:06:54 --> 00:06:58 levels of memory, it's all pretty much the same. 117 00:06:58 --> 00:07:02 If you have a well-designed computer the bandwidths should 118 00:07:02 --> 00:07:07 all be the same. OK, as you can still get data 119 00:07:07 --> 00:07:08 off disc really, really fast, 120 00:07:08 --> 00:07:13 usually at about the speed of your bus, and that the bus gets 121 00:07:13 --> 00:07:16 the CPU hopefully as fast as everything else. 122 00:07:16 --> 00:07:20 So, even though they're slower, they're really only slower in 123 00:07:20 --> 00:07:23 terms of latency. And so, this part is maybe 124 00:07:23 --> 00:07:26 reasonable. The bandwidth looks pretty much 125 00:07:26 --> 00:07:29 the same universally. It's the latency that's going 126 00:07:29 --> 00:07:32 up. So, if the latency is going up 127 00:07:32 --> 00:07:36 but we still get to divide by the same amount of bandwidth, 128 00:07:36 --> 00:07:40 what should we do to make the access cost at all these levels 129 00:07:40 --> 00:07:45 about the same? This is fixed. 130 00:07:45 --> 00:07:53 Let's say this is increasing, but this is still staying big. 131 00:07:53 --> 00:07:59 What could we do to balance this formula? 132 00:07:59 --> 00:08:05 Change the amounts. As the latency goes up, 133 00:08:05 --> 00:08:10 if we increase the amount, then the amortized cost to 134 00:08:10 --> 00:08:16 access one element will go down. So, this is amortization in a 135 00:08:16 --> 00:08:21 very simple sense. So, this was to access a whole 136 00:08:21 --> 00:08:26 block, let's say, and this amount was the size of 137 00:08:26 --> 00:08:30 the block. So, the amortized cost, 138 00:08:30 --> 00:08:36 then, to access one element is going to be the latency divided 139 00:08:36 --> 00:08:41 by the size of the block, the amount plus one over the 140 00:08:41 --> 00:08:45 bandwidth. OK, so this is what you should 141 00:08:45 --> 00:08:49 implicitly be thinking in your head. 142 00:08:49 --> 00:08:55 So, I'm just dividing here by the amounts because the amount 143 00:08:55 --> 00:09:02 is how many elements you get in one access, let's suppose. 144 00:09:02 --> 00:09:04 OK, so we get this formula for the amortized cost. 145 00:09:04 --> 00:09:08 The one over bandwidth is going to be good no matter what level 146 00:09:08 --> 00:09:11 we are on, I claim. There's no real fundamental 147 00:09:11 --> 00:09:14 limitation there except it might be expensive. 148 00:09:14 --> 00:09:17 And the latency week at the amortized out by the amounts, 149 00:09:17 --> 00:09:21 so whatever the latency is, at the latency gets bigger out 150 00:09:21 --> 00:09:24 here, we just get more and more stuff and then we make these two 151 00:09:24 --> 00:09:27 terms equal, let's say. That would be a good way to 152 00:09:27 --> 00:09:30 balance things. So what particular, 153 00:09:30 --> 00:09:34 disc has a really high latency. Not only is there speed of 154 00:09:34 --> 00:09:37 light issues here, but there's actually the speed 155 00:09:37 --> 00:09:39 of the head moving on the tracks of the disk. 156 00:09:39 --> 00:09:42 That takes a long time. There's a physical motion. 157 00:09:42 --> 00:09:45 Everything else here doesn't usually have physical motion. 158 00:09:45 --> 00:09:47 It's just electric. So, this is really, 159 00:09:47 --> 00:09:51 really slow and latency, so when you read something out 160 00:09:51 --> 00:09:54 of disk, you might as well read a lot of data from disc, 161 00:09:54 --> 00:09:57 like a megabyte or so. It's probably even old these 162 00:09:57 --> 00:09:58 days. Maybe you read multiple 163 00:09:58 --> 00:10:02 megabytes when you read anything from disk if you want these to 164 00:10:02 --> 00:10:06 be matched. OK, there's a bit of a problem 165 00:10:06 --> 00:10:10 with doing that. Any suggestions what the 166 00:10:10 --> 00:10:14 problem would be? So, you have this algorithm. 167 00:10:14 --> 00:10:17 And, whenever it reads something off of desk, 168 00:10:17 --> 00:10:22 it reads an entire megabyte of stuff around the element it 169 00:10:22 --> 00:10:26 asked for. So the amortized cost to access 170 00:10:26 --> 00:10:31 is going to be reasonable, but that's actually sort of 171 00:10:31 --> 00:10:34 assuming something. Yeah? 172 00:10:34 --> 00:10:38 Right. I'm assuming I'm ever going to 173 00:10:38 --> 00:10:43 use the rest of that data. If I'm going to read 10 MB 174 00:10:43 --> 00:10:49 around the one element that asked for, I access A bracket I, 175 00:10:49 --> 00:10:55 and I get 10 million items from A around I, it would be kind of 176 00:10:55 --> 00:11:00 good if the algorithm actually used that data for something. 177 00:11:00 --> 00:11:06 It seems reasonable. So, this would be spatial 178 00:11:06 --> 00:11:08 locality. So, we want, 179 00:11:08 --> 00:11:15 I mean the goal of this world in cache oblivious algorithms 180 00:11:15 --> 00:11:20 and cache efficient algorithms in general is you want 181 00:11:20 --> 00:11:26 algorithms that perform well when this is happening. 182 00:11:26 --> 00:11:31 So, this is the idea of blocking. 183 00:11:31 --> 00:11:36 And we want the algorithm to use all or at least most of the 184 00:11:36 --> 00:11:41 elements in a block, a consecutive chunk of memory. 185 00:11:41 --> 00:11:45 So, this is spatial locality. 186 00:11:45 --> 00:11:55 187 00:11:55 --> 00:11:57 Ideally, we'd use all of them right then. 188 00:11:57 --> 00:11:59 But I mean, depending on your algorithm, that's a little bit 189 00:11:59 --> 00:12:01 tricky. There is another issue, 190 00:12:01 --> 00:12:03 though. So, you read in your thing 191 00:12:03 --> 00:12:05 into, read your 10 MB into main memory, let's say, 192 00:12:05 --> 00:12:07 and your memory, let's say, is at least, 193 00:12:07 --> 00:12:10 these days you should have a 4 GB memory or something. 194 00:12:10 --> 00:12:13 So, you could read and actually a lot of different blocks into 195 00:12:13 --> 00:12:15 main memory. What you'd like is that you can 196 00:12:15 --> 00:12:17 use those blocks for as long as possible. 197 00:12:17 --> 00:12:20 Maybe you don't even use them. If you have a linear time 198 00:12:20 --> 00:12:23 algorithm, you're probably only going to visit each element a 199 00:12:23 --> 00:12:25 constant number of times. So, this is enough. 200 00:12:25 --> 00:12:27 But if your algorithm is more than linear time, 201 00:12:27 --> 00:12:32 you're going to be accessing elements more than once. 202 00:12:32 --> 00:12:37 So, it would be a good idea not only to use all the elements of 203 00:12:37 --> 00:12:43 the blocks, but use them as many times as you can before you have 204 00:12:43 --> 00:12:47 to throw the block out. That's temporal locality. 205 00:12:47 --> 00:12:52 So ideally, you even reuse blocks as much as possible. 206 00:12:52 --> 00:12:55 So, I mean, we have all these caches. 207 00:12:55 --> 00:13:01 So, I didn't write this word. Just in case I don't know how 208 00:13:01 --> 00:13:07 to spell it, it's not the money. We should use those caches for 209 00:13:07 --> 00:13:09 something. I mean, the fact that they 210 00:13:09 --> 00:13:13 store more than one block, each cache can store several 211 00:13:13 --> 00:13:14 blocks. How many? 212 00:13:14 --> 00:13:17 Well, we'll get to that in a second. 213 00:13:17 --> 00:13:20 OK, so this is the general motivation, but at this point 214 00:13:20 --> 00:13:23 the model is still pretty damn ugly. 215 00:13:23 --> 00:13:27 If you wanted to design an algorithm that runs well on this 216 00:13:27 --> 00:13:30 kind of machine directly, it's possible but pretty 217 00:13:30 --> 00:13:34 difficult, and essentially never done, let's say, 218 00:13:34 --> 00:13:39 even though this is what real machines look like. 219 00:13:39 --> 00:13:42 At least in theory, and pretty much in practice, 220 00:13:42 --> 00:13:47 the main thing to think about is two levels at a time. 221 00:13:47 --> 00:13:51 So, this is a simplification where we can say a lot more 222 00:13:51 --> 00:13:55 about algorithms, a simplification over this 223 00:13:55 --> 00:13:57 model. So, in this model, 224 00:13:57 --> 00:14:01 each of these levels has different block sizes, 225 00:14:01 --> 00:14:06 and a different total sizes, it's a mess to deal with and 226 00:14:06 --> 00:14:10 design algorithms for. If you just think about two 227 00:14:10 --> 00:14:17 levels, it's relatively easy. So, we have our CPU which we 228 00:14:17 --> 00:14:22 assume has a constant number of registers only. 229 00:14:22 --> 00:14:27 So, you know, once it has a couple of data 230 00:14:27 --> 00:14:31 items, you can add them and whatnot. 231 00:14:31 --> 00:14:35 Then we have this really fast pipe. 232 00:14:35 --> 00:14:41 So, I draw it thick to some cache. 233 00:14:41 --> 00:14:49 So this is cache. And, we have a relatively 234 00:14:49 --> 00:14:58 narrow pipe to some really big other storage, 235 00:14:58 --> 00:15:06 which I will call main memory. So, I mean, that's the general 236 00:15:06 --> 00:15:09 picture. Now, this could represent any 237 00:15:09 --> 00:15:12 two of these levels. It could be between L3 cache 238 00:15:12 --> 00:15:14 and make memory. That's maybe, 239 00:15:14 --> 00:15:16 what? The naming corresponds to best. 240 00:15:16 --> 00:15:20 Or cache could in fact be main memory, what we consider the RAM 241 00:15:20 --> 00:15:23 of the machine, and what's called a memory over 242 00:15:23 --> 00:15:26 there to be the disk. It's whatever you care about. 243 00:15:26 --> 00:15:28 And usually, if you have a program, 244 00:15:28 --> 00:15:34 that's what usually we assume everything fits in main memory. 245 00:15:34 --> 00:15:36 Then you care about the caching behavior. 246 00:15:36 --> 00:15:39 So you probably look between these two levels. 247 00:15:39 --> 00:15:42 That's probably what matters the most inner program because 248 00:15:42 --> 00:15:46 the cost differential here is really big relative to the cost 249 00:15:46 --> 00:15:49 differential here. If your data doesn't even fit 250 00:15:49 --> 00:15:51 it main memory, and you have to go to disk, 251 00:15:51 --> 00:15:54 then you really care about this level because the cost 252 00:15:54 --> 00:15:57 differential here is huge. It's like six orders of 253 00:15:57 --> 00:16:00 magnitude, let's say. So, in practice you may think 254 00:16:00 --> 00:16:05 of just two memory levels that are the most relevant. 255 00:16:05 --> 00:16:09 OK, now I'm going to define some parameters. 256 00:16:09 --> 00:16:14 I'm going to call them cache and make memory just for clarity 257 00:16:14 --> 00:16:20 because I like to think of main memory just the way it used to 258 00:16:20 --> 00:16:23 be. And now all we have to worry 259 00:16:23 --> 00:16:26 about is this extra thing called cache. 260 00:16:26 --> 00:16:31 It has some bounded size, and there's a block size. 261 00:16:31 --> 00:16:36 The block size is B. and a number of blocks is M 262 00:16:36 --> 00:16:41 over B. So, the total size of the cache 263 00:16:41 --> 00:16:44 is M. OK, main memory is also blocked 264 00:16:44 --> 00:16:49 into blocks of size B. And we assume that it has 265 00:16:49 --> 00:16:55 essentially infinite size. We don't care about its size in 266 00:16:55 --> 00:16:59 this picture. It's whatever is big enough to 267 00:16:59 --> 00:17:04 hold the size of your algorithm, or data structure, 268 00:17:04 --> 00:17:09 or whatever. OK, so that's the general 269 00:17:09 --> 00:17:11 model. And for strange, 270 00:17:11 --> 00:17:15 historical reasons, which I don't want to get into, 271 00:17:15 --> 00:17:20 these things are called capital M and capital B. 272 00:17:20 --> 00:17:25 Even though M sounds a lot like memory, it's really for cache, 273 00:17:25 --> 00:17:29 and don't ask. OK, this is to preserve 274 00:17:29 --> 00:17:32 history. OK, now what do we do with this 275 00:17:32 --> 00:17:34 model? It seems nice, 276 00:17:34 --> 00:17:36 but now what do we measure about it? 277 00:17:36 --> 00:17:39 What I'm going to assume is that the cache is really fast. 278 00:17:39 --> 00:17:43 So the CPU can access cache essentially instantaneously. 279 00:17:43 --> 00:17:46 I still have to pay for the computation that the CPU is 280 00:17:46 --> 00:17:50 doing, but I'm assuming cache is close enough that I don't care. 281 00:17:50 --> 00:17:54 And that may memory is so big that it has to be far away, 282 00:17:54 --> 00:17:56 and therefore, this pipe is a problem. 283 00:17:56 --> 00:17:59 I mean, what I should really draw is that pipe is still 284 00:17:59 --> 00:18:04 thick, but is really long. So, the latency is high. 285 00:18:04 --> 00:18:07 The bandwidth is still high. OK, and all transfers here 286 00:18:07 --> 00:18:10 happened as blocks. So, when you don't have 287 00:18:10 --> 00:18:12 something, so the idea is CPU asks for A of I, 288 00:18:12 --> 00:18:15 as for something in memory, if it's in the cache, 289 00:18:15 --> 00:18:17 it gets it. That's free. 290 00:18:17 --> 00:18:21 Otherwise, it has to grab the entire block containing that 291 00:18:21 --> 00:18:23 element from main memory, brings it into cache, 292 00:18:23 --> 00:18:26 maybe kicks somebody out if the cache was full, 293 00:18:26 --> 00:18:29 and then the CPU can use that data and keep going. 294 00:18:29 --> 00:18:33 Until it accesses something else that's not in cache, 295 00:18:33 --> 00:18:37 then it has to grab it from main memory. 296 00:18:37 --> 00:18:43 When you kick something out, you're actually writing back to 297 00:18:43 --> 00:18:46 memory. That's the model. 298 00:18:46 --> 00:18:51 So, we suppose the accesses to cache are free. 299 00:18:51 --> 00:18:56 But we can still think about the running time of the 300 00:18:56 --> 00:19:01 algorithm. I'm not going to change the 301 00:19:01 --> 00:19:05 definition of running time. This would be the computation 302 00:19:05 --> 00:19:10 time, or the work if you want to use multithreaded lingo, 303 00:19:10 --> 00:19:13 computation time. OK, so we still have time, 304 00:19:13 --> 00:19:17 and T of N will still mean what it did before. 305 00:19:17 --> 00:19:22 This is just an extra level of refinement of understanding of 306 00:19:22 --> 00:19:24 what's going on. Essentially, 307 00:19:24 --> 00:19:29 measuring the parallelism that we can exploit out of the memory 308 00:19:29 --> 00:19:34 system, that when you access something you actually get B 309 00:19:34 --> 00:19:39 items. So, this is the old stuff. 310 00:19:39 --> 00:19:47 Now, what I want to do is count memory transfers. 311 00:19:47 --> 00:19:56 These are transfers of blocks, so I should say block memory 312 00:19:56 --> 00:20:04 transfers between the two levels, so, between the cache 313 00:20:04 --> 00:20:12 and main memory. So, memory transfers are either 314 00:20:12 --> 00:20:19 reading reads or writes. Maybe I should say that. 315 00:20:19 --> 00:20:29 These are number of block reads and writes from and to the main 316 00:20:29 --> 00:20:33 memory. OK, so I'm going to introduce 317 00:20:33 --> 00:20:35 some notation. This is new notation, 318 00:20:35 --> 00:20:40 so we'll see how it works out. MT of N I want to represent the 319 00:20:40 --> 00:20:44 number of memory transfers instead of just normal time of 320 00:20:44 --> 00:20:49 the problem of size N. Really, this is a function that 321 00:20:49 --> 00:20:52 depends not only on N but also on these parameters, 322 00:20:52 --> 00:20:56 B and M, in our model. So, this is what it should be, 323 00:20:56 --> 00:21:00 MT_B,M(N), but that's obviously pretty messy, 324 00:21:00 --> 00:21:04 so I'm going to stick to MT of N. 325 00:21:04 --> 00:21:07 But this will always, because mainly I care about the 326 00:21:07 --> 00:21:09 growth in terms of N. well, I care about the growth 327 00:21:09 --> 00:21:12 in terms of all things, but the only thing I could 328 00:21:12 --> 00:21:14 change is N. So, most of the time I only 329 00:21:14 --> 00:21:17 think about, like when we are writing recurrences, 330 00:21:17 --> 00:21:20 only N is changing. I can't recurse on the block 331 00:21:20 --> 00:21:22 size. I can't recurse on the size of 332 00:21:22 --> 00:21:24 cache. Those are given to me. 333 00:21:24 --> 00:21:26 They're fixed. OK, so we'll be changing N 334 00:21:26 --> 00:21:28 mainly. But B and M always matter here. 335 00:21:28 --> 00:21:31 They're not constants. They're parameters of the 336 00:21:31 --> 00:21:34 model. OK, easy enough. 337 00:21:34 --> 00:21:39 This is something called the disk access model, 338 00:21:39 --> 00:21:44 if you like DAM models, or the external memory model, 339 00:21:44 --> 00:21:50 or the cache aware model. Maybe I should mention that; 340 00:21:50 --> 00:21:55 this is the cache aware. In general, you have some 341 00:21:55 --> 00:22:01 algorithm that runs on this kind of model, machine model. 342 00:22:01 --> 00:22:07 That's a cache aware algorithm. OK, we're not too interested in 343 00:22:07 --> 00:22:10 cache aware algorithms. We've seen one, 344 00:22:10 --> 00:22:12 B trees. B trees are cache aware data 345 00:22:12 --> 00:22:14 structure. You assume that there is some 346 00:22:14 --> 00:22:15 block size, B, underlying. 347 00:22:15 --> 00:22:18 Maybe you didn't see exactly this model. 348 00:22:18 --> 00:22:20 In particular, it didn't really matter how big 349 00:22:20 --> 00:22:23 the cache was because you just wanted to know. 350 00:22:23 --> 00:22:26 When I read B items, I can use all of them as much 351 00:22:26 --> 00:22:29 as possible and figure out where I fit among those B items, 352 00:22:29 --> 00:22:32 and that gives me log base B of N memory transfers instead of 353 00:22:32 --> 00:22:36 log N, which would be, if you just threw your favorite 354 00:22:36 --> 00:22:41 balanced binary search tree. So, log base B of N is 355 00:22:41 --> 00:22:46 definitely better than log base 2 of N. 356 00:22:46 --> 00:22:51 B trees are a cache aware algorithm. 357 00:22:51 --> 00:22:58 OK, what we would like to do today and next lecture is get 358 00:22:58 --> 00:23:06 cache oblivious algorithms. So, there's essentially only 359 00:23:06 --> 00:23:12 one difference between cache aware algorithms and cache 360 00:23:12 --> 00:23:18 oblivious algorithms. In cache oblivious algorithms, 361 00:23:18 --> 00:23:22 the algorithm doesn't know what B and M are. 362 00:23:22 --> 00:23:30 So this is a bit of a subtle point, but very cool idea. 363 00:23:30 --> 00:23:32 You assume that this is the model of the machine, 364 00:23:32 --> 00:23:36 and you care about the number of memory transfers between this 365 00:23:36 --> 00:23:39 cache of size M with blocking B, and main memory with blocking 366 00:23:39 --> 00:23:41 B. But you don't actually know 367 00:23:41 --> 00:23:43 what the model is. You don't know the other 368 00:23:43 --> 00:23:45 parameters of the model. It looks like this, 369 00:23:45 --> 00:23:48 but you don't know the width. You don't know the height. 370 00:23:48 --> 00:23:50 Why not? So, the analysis knows what B 371 00:23:50 --> 00:23:52 and M are. We are going to write some 372 00:23:52 --> 00:23:56 algorithms which look just like boring old algorithms that we've 373 00:23:56 --> 00:24:00 seen throughout the lecture. That's one of the nice things 374 00:24:00 --> 00:24:03 about this model. Every algorithm we have seen is 375 00:24:03 --> 00:24:06 a cache oblivious algorithm, all right, because we didn't 376 00:24:06 --> 00:24:08 even know the word cache in this class until today. 377 00:24:08 --> 00:24:11 So, we already have lots of algorithms to choose from. 378 00:24:11 --> 00:24:13 The thing is, some of them will perform well 379 00:24:13 --> 00:24:15 in this model, and some of them won't. 380 00:24:15 --> 00:24:18 So, we would like to design algorithms that just like our 381 00:24:18 --> 00:24:21 old algorithms that happened to perform well in this context, 382 00:24:21 --> 00:24:24 no matter what B and M are. So, another way this is the 383 00:24:24 --> 00:24:27 same algorithm should work well for all values of B and M if you 384 00:24:27 --> 00:24:31 have a good cache oblivious algorithm. 385 00:24:31 --> 00:24:33 OK, there are a few consequences to this assumption. 386 00:24:33 --> 00:24:36 In a cache aware algorithm, you can explicitly say, 387 00:24:36 --> 00:24:39 OK, I'm blocking my memory into chunks of size B. 388 00:24:39 --> 00:24:42 Here they are. I was going to store these B 389 00:24:42 --> 00:24:44 elements here, these B elements here, 390 00:24:44 --> 00:24:46 because you know B, you can do that. 391 00:24:46 --> 00:24:48 You can say, well, OK, now I want to read 392 00:24:48 --> 00:24:51 these B items into my cache, and then write out these ones 393 00:24:51 --> 00:24:53 over here. You can explicitly maintain 394 00:24:53 --> 00:24:55 your cache. With cache oblivious 395 00:24:55 --> 00:25:00 algorithms, you can't because you don't know what it is. 396 00:25:00 --> 00:25:04 So, it's got to be all implicit. 397 00:25:04 --> 00:25:11 And this is pretty much how caches work anyway except for 398 00:25:11 --> 00:25:15 disk. So, it's a pretty reasonable 399 00:25:15 --> 00:25:18 model. In particular, 400 00:25:18 --> 00:25:24 when you access an element that's not in cache, 401 00:25:24 --> 00:25:33 you automatically fetch the block containing that element. 402 00:25:33 --> 00:25:38 And you pay one memory transfer for that if it wasn't already 403 00:25:38 --> 00:25:41 there. Another bit of a catch here is, 404 00:25:41 --> 00:25:45 what if your cache is full? Then you've got to kick some 405 00:25:45 --> 00:25:50 block out of your cache. And then, so we need some model 406 00:25:50 --> 00:25:55 of which block gets kicked out because we can't control that. 407 00:25:55 --> 00:26:00 We have no knowledge of what the blocks are in our algorithm. 408 00:26:00 --> 00:26:05 So what we're going to assume in this model is the ideal 409 00:26:05 --> 00:26:10 thing, that when you fetch a new block, if your cache is full, 410 00:26:10 --> 00:26:17 you evict a block that will be used farthest in the future. 411 00:26:17 --> 00:26:21 Sorry, the furthest. Farthest is distance. 412 00:26:21 --> 00:26:25 Furthest is time. Furthest in the future. 413 00:26:25 --> 00:26:31 OK, this would be the best possible thing to do. 414 00:26:31 --> 00:26:35 It's a little bit hard to do in practice because you don't know 415 00:26:35 --> 00:26:38 the future generally, unless you're omniscient. 416 00:26:38 --> 00:26:41 So, this is a bit of an idealized model. 417 00:26:41 --> 00:26:45 But it's pretty reasonable in the sense that if you've read 418 00:26:45 --> 00:26:49 the reading handout number 20, this paper by Sleator and 419 00:26:49 --> 00:26:52 Tarjan, they introduce the idea of competitive algorithms. 420 00:26:52 --> 00:26:56 So, we only talked about a small portion of that paper that 421 00:26:56 --> 00:27:01 moved to front heuristic for storing a list. 422 00:27:01 --> 00:27:03 But, it also proves that there are strategies, 423 00:27:03 --> 00:27:06 and maybe you heard this in recitation. 424 00:27:06 --> 00:27:08 Some people covered it; some didn't, 425 00:27:08 --> 00:27:10 that these are called paging strategies. 426 00:27:10 --> 00:27:13 So, you want to maintain some cache of pages or blocks, 427 00:27:13 --> 00:27:17 and you pay whenever you have to access a block that's not in 428 00:27:17 --> 00:27:19 your cache. The best thing to do is to 429 00:27:19 --> 00:27:23 always kick out the block that will be used farthest in the 430 00:27:23 --> 00:27:27 future because that way you'll use all the blocks that are in 431 00:27:27 --> 00:27:28 there. This turns out to be the 432 00:27:28 --> 00:27:33 offline optimal strategy if you knew the future. 433 00:27:33 --> 00:27:35 But, there are algorithms that are essentially constant 434 00:27:35 --> 00:27:37 competitive against this strategy. 435 00:27:37 --> 00:27:40 I don't want to get into details because they're not 436 00:27:40 --> 00:27:43 exactly constant competitive. But they are sufficiently 437 00:27:43 --> 00:27:46 constant competitive for the purposes of this lecture that we 438 00:27:46 --> 00:27:49 can assume this, not have to worry about it. 439 00:27:49 --> 00:27:51 Most of the time, we don't even really use this 440 00:27:51 --> 00:27:53 assumption. But there it is. 441 00:27:53 --> 00:27:55 That's the cache oblivious model. 442 00:27:55 --> 00:27:58 It makes things cleaner to think about just anything that 443 00:27:58 --> 00:28:01 should be done, will be done. 444 00:28:01 --> 00:28:05 And you can simulate that with least recently used or whatever 445 00:28:05 --> 00:28:10 good heuristic that you want to that's competitive against the 446 00:28:10 --> 00:28:12 optimal. OK, that's pretty much the 447 00:28:12 --> 00:28:16 cache oblivious algorithm. Once you have the two level 448 00:28:16 --> 00:28:20 model, you just assume you don't know B and M. 449 00:28:20 --> 00:28:24 You have this automatic request in writing, and whatnot. 450 00:28:24 --> 00:28:28 A little bit more to say, I guess, it may be obvious at 451 00:28:28 --> 00:28:34 this point, but I've been drawing everything as tables. 452 00:28:34 --> 00:28:37 So, it's not really clear what the linear order is. 453 00:28:37 --> 00:28:40 Linear order is just the reading order. 454 00:28:40 --> 00:28:44 So, although we don't explicitly say it most of the 455 00:28:44 --> 00:28:48 time, a typical model is that memory is a linear array. 456 00:28:48 --> 00:28:53 Everything that you ever store in your program is written in 457 00:28:53 --> 00:28:57 this linear array. If you've ever programmed in 458 00:28:57 --> 00:29:01 Assembly or whatever, that's the model. 459 00:29:01 --> 00:29:04 You have the address space, and any number between here and 460 00:29:04 --> 00:29:08 here, that's where you can actually, this is physical 461 00:29:08 --> 00:29:11 memory. This is all you can write to. 462 00:29:11 --> 00:29:15 So, it starts at zero and goes out to, let's call it infinity 463 00:29:15 --> 00:29:17 over here. And, if you allocate some 464 00:29:17 --> 00:29:20 array, maybe it occupies some space in the middle. 465 00:29:20 --> 00:29:23 Who knows? OK, we usually don't think 466 00:29:23 --> 00:29:26 about that much. What I care about now is that 467 00:29:26 --> 00:29:29 memory itself is blocked in this view. 468 00:29:29 --> 00:29:31 So, however your stuff is stored in memory, 469 00:29:31 --> 00:29:36 it's blocked into clusters of length B. 470 00:29:36 --> 00:29:39 So, if this is, let me call it one and be a 471 00:29:39 --> 00:29:41 little bit nicer. This is B. 472 00:29:41 --> 00:29:46 This is position B plus one. This is 2B, and 2B plus one, 473 00:29:46 --> 00:29:49 and so on. These are the indexes into 474 00:29:49 --> 00:29:51 memory. This is how the blocking 475 00:29:51 --> 00:29:54 happens. If you access something here, 476 00:29:54 --> 00:29:59 you get that chunk from U, round it down to the previous 477 00:29:59 --> 00:30:02 multiple of B, round it up to the next 478 00:30:02 --> 00:30:06 multiple of B. That's what you always get. 479 00:30:06 --> 00:30:11 OK, so if you think about some array that's maybe allocated 480 00:30:11 --> 00:30:15 here, OK, you have to keep in mind that that array may not be 481 00:30:15 --> 00:30:18 perfectly aligned with the blocks. 482 00:30:18 --> 00:30:21 But more or less it will be so we don't care too much. 483 00:30:21 --> 00:30:24 But that's a bit of a subtlety there. 484 00:30:24 --> 00:30:28 OK, so that's pretty much the model. 485 00:30:28 --> 00:30:32 So every algorithm we've seen, except B trees, 486 00:30:32 --> 00:30:36 is a cache oblivious algorithm. And our question is, 487 00:30:36 --> 00:30:41 now, we know how everything runs in terms of running time. 488 00:30:41 --> 00:30:46 Now we want to measure the number of memory transfers, 489 00:30:46 --> 00:30:49 MT of N. I want to mention one other 490 00:30:49 --> 00:30:53 fact or theorem. I'll put it in brackets because 491 00:30:53 --> 00:30:58 I don't want to state it precisely. 492 00:30:58 --> 00:31:04 But if you have an algorithm that is efficient on two levels, 493 00:31:04 --> 00:31:08 so in other words, what we're looking at, 494 00:31:08 --> 00:31:14 if we just think about the two level world and your algorithm 495 00:31:14 --> 00:31:18 is cache oblivious, then it is efficient on any 496 00:31:18 --> 00:31:23 number of levels in your memory hierarchy, say, 497 00:31:23 --> 00:31:27 L levels. So, I don't want to define what 498 00:31:27 --> 00:31:31 efficient means. But the intuition is, 499 00:31:31 --> 00:31:34 if your machine really looks like this and you have a cache 500 00:31:34 --> 00:31:36 oblivious algorithm, you can apply the cache 501 00:31:36 --> 00:31:38 oblivious analysis for all B and M. 502 00:31:38 --> 00:31:41 So you can analyze the number of memory transfers here, 503 00:31:41 --> 00:31:43 here, here, here, and here. 504 00:31:43 --> 00:31:45 And if you have a good cache oblivious algorithm, 505 00:31:45 --> 00:31:48 the performances at all those levels has to be good. 506 00:31:48 --> 00:31:51 And therefore, the whole performance is good. 507 00:31:51 --> 00:31:54 Good here means asymptotically optimal up to constant factors, 508 00:31:54 --> 00:31:57 something like that. OK, so I don't want to prove 509 00:31:57 --> 00:32:01 that, and you can read the cache oblivious papers. 510 00:32:01 --> 00:32:04 That's a nice fact about cache oblivious algorithms. 511 00:32:04 --> 00:32:08 If you have a cache aware algorithm that tunes to a 512 00:32:08 --> 00:32:12 particular value of B, and a particular value of M, 513 00:32:12 --> 00:32:15 you're not going to have that problem. 514 00:32:15 --> 00:32:19 So, this is one nice feature of cache obliviousness. 515 00:32:19 --> 00:32:23 Another nice feature is when you are coding the algorithm, 516 00:32:23 --> 00:32:26 you don't have to put in B and M. 517 00:32:26 --> 00:32:28 So, that simplifies things a bit. 518 00:32:28 --> 00:32:34 So, let's do some algorithms. Enough about models. 519 00:32:34 --> 00:32:40 OK, we're going to start out with some really simple things 520 00:32:40 --> 00:32:45 just to get warmed up on the analysis side. 521 00:32:45 --> 00:32:52 The most basic thing you can do that's good in a cache oblivious 522 00:32:52 --> 00:32:57 world is scanning. So, scanning is just visiting 523 00:32:57 --> 00:33:03 the items in an array in order. So, visit A_1 up to A_N in 524 00:33:03 --> 00:33:06 order. For some notion of visit, 525 00:33:06 --> 00:33:09 this is presumably some constant time operation. 526 00:33:09 --> 00:33:12 For example, suppose you want to compute the 527 00:33:12 --> 00:33:16 aggregate of the array. You want to sum all the 528 00:33:16 --> 00:33:19 elements in the array. So, you have one extra variable 529 00:33:19 --> 00:33:23 using, but you can store that in a register or whatever, 530 00:33:23 --> 00:33:27 so that's one simple example. Sum the array. 531 00:33:27 --> 00:33:31 OK, so here's the picture. We have our memory. 532 00:33:31 --> 00:33:36 Each of these cells represents one item, one element, 533 00:33:36 --> 00:33:38 log N bits, one word, whatever. 534 00:33:38 --> 00:33:43 Our array is somewhere in here. Maybe it's there. 535 00:33:43 --> 00:33:47 And we go from here to here to here to here. 536 00:33:47 --> 00:33:50 OK, and so on. So, what does this cost? 537 00:33:50 --> 00:33:53 What is the number of memory transfers? 538 00:33:53 --> 00:33:57 We know that this is a linear time algorithm. 539 00:33:57 --> 00:34:03 It takes order N time. What does it cost in terms of 540 00:34:03 --> 00:34:07 memory transfers? N over B, pretty much. 541 00:34:07 --> 00:34:12 We like to say it's order N over B plus two or one in the 542 00:34:12 --> 00:34:15 big O. This is a bit of worry. 543 00:34:15 --> 00:34:18 I mean, N could be smaller than B. 544 00:34:18 --> 00:34:21 We really want to think about all the cases, 545 00:34:21 --> 00:34:26 especially because usually you're not doing this on 546 00:34:26 --> 00:34:31 something of size N. You're doing it on something of 547 00:34:31 --> 00:34:37 size k, where we don't really know much about k. 548 00:34:37 --> 00:34:40 But in general, it's N over B plus one because 549 00:34:40 --> 00:34:43 we always need at least one memory transfer to look at 550 00:34:43 --> 00:34:46 something, unless N is zero. And in particular, 551 00:34:46 --> 00:34:49 it's plus two if you care about the constants. 552 00:34:49 --> 00:34:53 If I don't write the big O, then it would be plus two at 553 00:34:53 --> 00:34:57 most because you could essentially waste the first 554 00:34:57 --> 00:35:01 block and that everything is fine for awhile. 555 00:35:01 --> 00:35:05 And then, if you're unlucky, you essentially waste the last 556 00:35:05 --> 00:35:08 blocked. There is just one element in 557 00:35:08 --> 00:35:12 that block, and you're not getting much out of it. 558 00:35:12 --> 00:35:16 Everything in the middle, though, every block between the 559 00:35:16 --> 00:35:19 first and last block has to be full. 560 00:35:19 --> 00:35:22 So, you're using all of those elements. 561 00:35:22 --> 00:35:26 So out of the N elements, you only have N over B blocks 562 00:35:26 --> 00:35:28 because the block has B elements. 563 00:35:28 --> 00:35:33 OK, that's pretty trivial. Let me do something slightly 564 00:35:33 --> 00:35:38 more interesting, which is two scans at once. 565 00:35:38 --> 00:35:41 OK, here we are not assuming anything about M. 566 00:35:41 --> 00:35:45 we're not assuming anything about the size of the cache, 567 00:35:45 --> 00:35:48 just that I can hold a single block. 568 00:35:48 --> 00:35:51 The last block that we visited has to be there. 569 00:35:51 --> 00:35:55 OK, you can also do a constant number of parallel scans. 570 00:35:55 --> 00:36:00 This is not really parallel in the sense of multithreaded, 571 00:36:00 --> 00:36:06 bit simulated parallelism. I mean, if you have a constant 572 00:36:06 --> 00:36:09 number, do one, do the other, 573 00:36:09 --> 00:36:12 do the other, come back, come back, 574 00:36:12 --> 00:36:18 come back, all right, visit them in turn round robin, 575 00:36:18 --> 00:36:20 whatever. For example, 576 00:36:20 --> 00:36:26 here's a cute piece of code. If you want to reverse an 577 00:36:26 --> 00:36:33 array, OK, then you can do it. This is a good puzzle. 578 00:36:33 --> 00:36:38 You can do it by essentially two scans where you repeatedly 579 00:36:38 --> 00:36:42 swapped the first and last element. 580 00:36:42 --> 00:36:46 So I was swapping A_i with N minus i plus one, 581 00:36:46 --> 00:36:51 and just restart at one. So, here's your array. 582 00:36:51 --> 00:36:54 Suppose this is actually my array. 583 00:36:54 --> 00:36:59 I swap these two guys, and I saw these two guys, 584 00:36:59 --> 00:37:04 and so on. That will reverse my array, 585 00:37:04 --> 00:37:08 and this should work hopefully the middle as well if it's odd. 586 00:37:08 --> 00:37:13 It should not do anything. And you can view this as two 587 00:37:13 --> 00:37:16 scans. There is one scan that's coming 588 00:37:16 --> 00:37:19 in this way. There's also a reverse scan, 589 00:37:19 --> 00:37:23 ooh, some more sophisticated, coming back this way. 590 00:37:23 --> 00:37:26 Of course, reverse scan has the same analysis. 591 00:37:26 --> 00:37:31 And as long as your cache is big enough to store at least two 592 00:37:31 --> 00:37:35 blocks, which is a pretty reasonable assumption, 593 00:37:35 --> 00:37:40 so let's write it. Assuming the number of blocks 594 00:37:40 --> 00:37:43 in the cache, which is M over B, 595 00:37:43 --> 00:37:49 is at least two in this algorithm, the number of memory 596 00:37:49 --> 00:37:53 transfers is still order N over B plus one. 597 00:37:53 --> 00:37:58 OK, the constant goes up maybe, but in this case it probably 598 00:37:58 --> 00:38:02 doesn't. But who cares. 599 00:38:02 --> 00:38:06 OK, as long as you're doing a constant number of scans, 600 00:38:06 --> 00:38:11 and some constant number of arrays, it happens to be one of 601 00:38:11 --> 00:38:15 them's reversed, whatever, it will take, 602 00:38:15 --> 00:38:20 we call this linear time. It's linear in the number of 603 00:38:20 --> 00:38:22 blocks in your input. OK, great. 604 00:38:22 --> 00:38:26 So now you can reverse an array: exciting. 605 00:38:26 --> 00:38:32 Let's try another simple algorithm on another board. 606 00:38:32 --> 00:38:47 607 00:38:47 --> 00:38:50 Let's try binary search. So just like last week, 608 00:38:50 --> 00:38:53 we're going back to our basics here. 609 00:38:53 --> 00:38:57 Scanning we didn't even talk about in this class. 610 00:38:57 --> 00:39:02 Binary search is something we talked about a little bit. 611 00:39:02 --> 00:39:04 It was a simple divide and conquer algorithm. 612 00:39:04 --> 00:39:08 I hope you all remember it. And if we look at an array, 613 00:39:08 --> 00:39:11 and I'm not going to draw the cells here because I want to 614 00:39:11 --> 00:39:14 imagine a really big array, binary search, 615 00:39:14 --> 00:39:16 but suppose it always goes to left. 616 00:39:16 --> 00:39:19 It starts by visiting this element in the middle. 617 00:39:19 --> 00:39:23 Then ago so the quarter marked. Then it goes to the one eighth 618 00:39:23 --> 00:39:25 mark. OK, this is one hypothetical 619 00:39:25 --> 00:39:29 execution of a binary search. OK, and eventually it finds the 620 00:39:29 --> 00:39:32 element it's looking for. It finds where it fits at 621 00:39:32 --> 00:39:35 least. So x is over here. 622 00:39:35 --> 00:39:38 So, we know that it takes log N time. 623 00:39:38 --> 00:39:41 How many memory transfers of the take? 624 00:39:41 --> 00:39:45 Now, I blocked this array into chunks of size B, 625 00:39:45 --> 00:39:49 blocks of size B. How many blocks do I touch? 626 00:39:49 --> 00:39:53 This one's a little bit more subtle. 627 00:39:53 --> 00:40:18 628 00:40:18 --> 00:40:21 It depends on the relative sizes of N and B, 629 00:40:21 --> 00:40:23 yeah. Log base B of N would be a good 630 00:40:23 --> 00:40:25 guess. We would like it to be, 631 00:40:25 --> 00:40:29 let's say, hope, is that it's log base B of N 632 00:40:29 --> 00:40:33 because we know that B trees can search in what's essentially a 633 00:40:33 --> 00:40:38 sorted list of N items in log base B of N time. 634 00:40:38 --> 00:40:42 That turns out to be optimal in the cache oblivious model or in 635 00:40:42 --> 00:40:46 the two level model you've got to pay log base B of N. 636 00:40:46 --> 00:40:51 I won't prove that here. The same reason you need log N 637 00:40:51 --> 00:40:55 comparisons to do a binary search in the normal model. 638 00:40:55 --> 00:41:00 Alas, it is possible to get log base B of N even without knowing 639 00:41:00 --> 00:41:06 B. But, binary search does not do 640 00:41:06 --> 00:41:09 it. Log of N over B, 641 00:41:09 --> 00:41:13 yes. So the number of memory 642 00:41:13 --> 00:41:22 transfers on N items is log of N over B also known as, 643 00:41:22 --> 00:41:31 let's say, plus one, also known as log N minus log 644 00:41:31 --> 00:41:35 B. OK, whereas log base B of N is 645 00:41:35 --> 00:41:39 log N divided by log B, OK, clearly this is much better 646 00:41:39 --> 00:41:42 than subtracting. So, this would be good, 647 00:41:42 --> 00:41:45 but this is bad. Most of the time, 648 00:41:45 --> 00:41:47 this is log N, which is no better, 649 00:41:47 --> 00:41:51 I mean, you're not using blocks at all essentially. 650 00:41:51 --> 00:41:53 The idea is, out here, I mean, 651 00:41:53 --> 00:41:57 there's some little, tiny block that contains this 652 00:41:57 --> 00:42:00 thing. I mean, tiny depends on how big 653 00:42:00 --> 00:42:03 B is. But, each of these items will 654 00:42:03 --> 00:42:06 be in a different block until you get essentially within one 655 00:42:06 --> 00:42:09 block worth of x. When you get within one block 656 00:42:09 --> 00:42:12 worth of x, there's only like a constant number of blocks that 657 00:42:12 --> 00:42:15 matter, and so all of these accesses are indeed within the 658 00:42:15 --> 00:42:17 same block. But, how many are there? 659 00:42:17 --> 00:42:21 Well, just log B because you're only spending log B within a, 660 00:42:21 --> 00:42:24 if you're within an interval of size k, you're only going to 661 00:42:24 --> 00:42:27 spend log k steps in it. So, you're saving log B in 662 00:42:27 --> 00:42:30 here, but overall you're paying log N, so you only get log N 663 00:42:30 --> 00:42:34 minus log B plus some constant. OK, so this is bad news for 664 00:42:34 --> 00:42:37 binary search. So, not all of the algorithms 665 00:42:37 --> 00:42:40 we've seen are going to work well in this model. 666 00:42:40 --> 00:42:43 We need a lot more thinking before we can solve what is 667 00:42:43 --> 00:42:47 essentially the binary search problem, finding an element in a 668 00:42:47 --> 00:42:50 sorted list, in log base B of N without knowing B. 669 00:42:50 --> 00:42:52 OK, we know we could use B trees. 670 00:42:52 --> 00:42:53 If you knew B, great, that works, 671 00:42:53 --> 00:42:56 and that's optimal. But without knowing B, 672 00:42:56 --> 00:43:02 it's a little bit harder. And this gets us into the world 673 00:43:02 --> 00:43:06 of divide and conquer. Also like last week, 674 00:43:06 --> 00:43:13 and like the first few weeks of this class, divide and conquer 675 00:43:13 --> 00:43:17 is your friend. And, it turns out divide and 676 00:43:17 --> 00:43:23 conquer is not the only tool, but it's a really useful tool 677 00:43:23 --> 00:43:27 in designing cache oblivious algorithms. 678 00:43:27 --> 00:43:31 And, let me say why. 679 00:43:31 --> 00:43:43 680 00:43:43 --> 00:43:47 So, we'll see a bunch of divide and conquer based algorithms, 681 00:43:47 --> 00:43:50 cache oblivious. And, the intuition is that we 682 00:43:50 --> 00:43:54 can take all the favorite algorithms we have, 683 00:43:54 --> 00:43:56 obviously it doesn't always work. 684 00:43:56 --> 00:43:59 Binary search was a divide and conquer algorithm. 685 00:43:59 --> 00:44:03 It's not so great. But, in general, 686 00:44:03 --> 00:44:07 the idea is that your algorithm can just do the normal divide 687 00:44:07 --> 00:44:08 and conquer thing, right? 688 00:44:08 --> 00:44:12 You divide your problem into subproblems of smaller size 689 00:44:12 --> 00:44:15 repeatedly, all the way down to problems of constant size, 690 00:44:15 --> 00:44:19 OK, just like before. But, if you're recursively 691 00:44:19 --> 00:44:21 dividing your problem into smaller things, 692 00:44:21 --> 00:44:24 at some point you can think about it and say, 693 00:44:24 --> 00:44:27 well, wait, I mean, the algorithm divides all the 694 00:44:27 --> 00:44:31 way, but we can think about the point at which the problem fits 695 00:44:31 --> 00:44:36 in a block or fits in cache. OK, and that's the analysis. 696 00:44:36 --> 00:44:40 OK, we'll think about the time when your problem is small 697 00:44:40 --> 00:44:43 enough that we can analyze it in some other way. 698 00:44:43 --> 00:44:46 So, usually, we analyze it recursively. 699 00:44:46 --> 00:44:48 We get a recurrence. What we're changing, 700 00:44:48 --> 00:44:50 essentially, is the base case. 701 00:44:50 --> 00:44:54 So, in the base case, we don't want to go down to a 702 00:44:54 --> 00:44:56 constant size. That's too far. 703 00:44:56 --> 00:45:02 I'll show you some examples. We want to consider the point 704 00:45:02 --> 00:45:09 in recursion at which either the problem fits in cache, 705 00:45:09 --> 00:45:17 so it has size less than or equal to M, or it fits in order 706 00:45:17 --> 00:45:22 one blocks. That's another natural time to 707 00:45:22 --> 00:45:27 do it. Order one blocks would be even 708 00:45:27 --> 00:45:35 better than fitting in cache. So, this means a size order B. 709 00:45:35 --> 00:45:41 OK, this will change the base case of the recurrence, 710 00:45:41 --> 00:45:48 and it will turn out to give us good answers instead of bad 711 00:45:48 --> 00:45:52 ones. So, let's do a simple example. 712 00:45:52 --> 00:45:57 Our good friend order statistics, in particular, 713 00:45:57 --> 00:46:04 for finding medians. So, I hope you all know this by 714 00:46:04 --> 00:46:08 heart. Remember the worst case linear 715 00:46:08 --> 00:46:12 time, median finding algorithm by Bloom et al. 716 00:46:12 --> 00:46:17 I'll write this fast. We partition our array. 717 00:46:17 --> 00:46:21 It turns out, this is a good algorithm as it 718 00:46:21 --> 00:46:24 is. We partition our array 719 00:46:24 --> 00:46:30 conceptually into N over five, five tuples into little groups 720 00:46:30 --> 00:46:36 of five. This may not have been exactly 721 00:46:36 --> 00:46:40 how I wrote it last time. I didn't check. 722 00:46:40 --> 00:46:46 But, it's the same algorithm. You compute the median of each 723 00:46:46 --> 00:46:49 five tuple. Then you recursively compute 724 00:46:49 --> 00:46:55 the median of the medians of these medians. 725 00:46:55 --> 00:47:11 726 00:47:11 --> 00:47:15 Then, you partition around x. So, that gave us some element 727 00:47:15 --> 00:47:20 that was roughly in the middle. It was within the middle half, 728 00:47:20 --> 00:47:22 I think. Partition around x, 729 00:47:22 --> 00:47:27 and then we show that you could always recurse on just one of 730 00:47:27 --> 00:47:29 the sides. 731 00:47:29 --> 00:47:38 732 00:47:38 --> 00:47:41 OK, this was our good old friend for computing, 733 00:47:41 --> 00:47:43 order statistics, or medians, or whatnot. 734 00:47:43 --> 00:47:47 OK, so how much time does this, well, we know how much time 735 00:47:47 --> 00:47:50 this takes. It should be linear time. 736 00:47:50 --> 00:47:52 But how many memory transfers does this take? 737 00:47:52 --> 00:47:56 Well, conceptually partitioning that, I can do, 738 00:47:56 --> 00:47:58 in zero. Maybe I have to compute N over 739 00:47:58 --> 00:48:02 five, no big deal here. We're not thinking about 740 00:48:02 --> 00:48:05 computation. I have to find the median of 741 00:48:05 --> 00:48:07 each tuple. So, here it matters how my 742 00:48:07 --> 00:48:10 array is laid out. But, what I'm going to do is 743 00:48:10 --> 00:48:13 take my array, take the first five elements, 744 00:48:13 --> 00:48:16 and then the next five elements and so on. 745 00:48:16 --> 00:48:20 Those will be my five tuples. So, I can implement this just 746 00:48:20 --> 00:48:23 by scanning, and then computing the median on those five 747 00:48:23 --> 00:48:27 elements, which I stored in the five registers on my CPU. 748 00:48:27 --> 00:48:32 I'll assume that there are enough registers for that. 749 00:48:32 --> 00:48:35 And, I compute the median, write it out to some array out 750 00:48:35 --> 00:48:38 here. So, it's going to be one 751 00:48:38 --> 00:48:40 element. So, the median of here goes 752 00:48:40 --> 00:48:43 into there. The median of these guys goes 753 00:48:43 --> 00:48:46 into there, and so on. So, I'm scanning in here, 754 00:48:46 --> 00:48:50 and in parallel, I'm scanning an output in here. 755 00:48:50 --> 00:48:54 So, it's two parallel scans. So, that takes linear time. 756 00:48:54 --> 00:48:59 So, this takes order N over B plus one memory transfers. 757 00:48:59 --> 00:49:03 OK, then we have recursively compute the median of the 758 00:49:03 --> 00:49:06 medians. This step used to be T of N 759 00:49:06 --> 00:49:09 over five. Now it's MT of N over five, 760 00:49:09 --> 00:49:12 OK, with the same values of B and M. 761 00:49:12 --> 00:49:17 Then we partition around x. Partitioning is also like three 762 00:49:17 --> 00:49:19 parallel scans if you work it out. 763 00:49:19 --> 00:49:24 So, this is also going to take linear memory transfers, 764 00:49:24 --> 00:49:28 N over B plus one. And then, we recurse on one of 765 00:49:28 --> 00:49:33 the sides, and this is the fun part of the analysis which I 766 00:49:33 --> 00:49:37 won't repeat here. But, we get MT of, 767 00:49:37 --> 00:49:42 like, three quarters N. I think originally it was seven 768 00:49:42 --> 00:49:45 tenths, so we simplified to three quarters, 769 00:49:45 --> 00:49:49 which is hopefully bigger than seven tenths. 770 00:49:49 --> 00:49:52 Yeah, it is. OK, so this is the new 771 00:49:52 --> 00:49:55 analysis. Now we get a recurrence. 772 00:49:55 --> 00:49:58 So, let's do that. 773 00:49:58 --> 00:50:16 774 00:50:16 --> 00:50:22 So, the analysis is we get this MT of N is MT of N over five 775 00:50:22 --> 00:50:29 plus MT of three quarters N plus, this is just as before. 776 00:50:29 --> 00:50:35 Before we had linear work here. And now, we have what we call 777 00:50:35 --> 00:50:39 linear number of memory transfers, linear number of 778 00:50:39 --> 00:50:41 blocks. OK, I'll sort of ignore this 779 00:50:41 --> 00:50:44 plus one. It's not too critical. 780 00:50:44 --> 00:50:48 So, this is our recurrence. Now, it depends what our base 781 00:50:48 --> 00:50:51 case is. And, usually we would use a 782 00:50:51 --> 00:50:55 base case of constant size. So, let's see what happens if 783 00:50:55 --> 00:51:00 we use a base case of constant size just so that it's clear why 784 00:51:00 --> 00:51:05 this base case is so important. OK, this describes a recurrence 785 00:51:05 --> 00:51:07 as one of these hairy recurrences. 786 00:51:07 --> 00:51:09 And, I don't want to use substitution. 787 00:51:09 --> 00:51:12 I just want the intuition of why this is going to solve to 788 00:51:12 --> 00:51:14 something rather big. OK, and for me, 789 00:51:14 --> 00:51:17 the best intuition always comes from recursion trees. 790 00:51:17 --> 00:51:20 If you don't know the solution to recurrence and you need a 791 00:51:20 --> 00:51:24 good guess, use recursion trees. And today, I will only give you 792 00:51:24 --> 00:51:26 good guesses. I don't want to prove anything 793 00:51:26 --> 00:51:31 with substitution because I want to get to the bigger ideas. 794 00:51:31 --> 00:51:34 So, this is even messy from a recursion tree point of view 795 00:51:34 --> 00:51:38 because you have these unbalanced sizes where you start 796 00:51:38 --> 00:51:40 at the root with some of size N over B. 797 00:51:40 --> 00:51:44 Then you split it into something size one fifth N over 798 00:51:44 --> 00:51:47 B, and something of size three quarters N over B, 799 00:51:47 --> 00:51:51 which is annoying because now this subtree will be a lot 800 00:51:51 --> 00:51:54 bigger than this one, or this one will terminate 801 00:51:54 --> 00:51:56 faster. So, it's pretty unbalanced. 802 00:51:56 --> 00:52:00 But, summing per level doesn't really tell you a lot at this 803 00:52:00 --> 00:52:02 point. But let's just look at the 804 00:52:02 --> 00:52:07 bottom level. Look at all the leaves in this 805 00:52:07 --> 00:52:10 recursion tree. So, that's the base cases. 806 00:52:10 --> 00:52:13 How many base cases are there? This is an interesting 807 00:52:13 --> 00:52:16 question. We've never thought about it in 808 00:52:16 --> 00:52:21 the context of this recurrence. It gives a somewhat surprising 809 00:52:21 --> 00:52:23 answer. It was surprising to me the 810 00:52:23 --> 00:52:27 first time I worked it out. So, how many leaves does this 811 00:52:27 --> 00:52:32 recursion tree have? Well, we can write a 812 00:52:32 --> 00:52:35 recurrence. The number of leaves in a 813 00:52:35 --> 00:52:41 problem of size N, it's going to be the number of 814 00:52:41 --> 00:52:47 leaves in this problem plus the number of leaves in this problem 815 00:52:47 --> 00:52:52 plus zero. So, that's another recurrence. 816 00:52:52 --> 00:52:57 We'll call this L of N. OK, now the base case is really 817 00:52:57 --> 00:53:02 relevant. It determines the solution to 818 00:53:02 --> 00:53:04 this recurrence. And let's, again, 819 00:53:04 --> 00:53:08 assume that in a problem of size one, we have one leaf. 820 00:53:08 --> 00:53:12 That's our only base case. Well, it turns out, 821 00:53:12 --> 00:53:14 and here you need to guess, I think. 822 00:53:14 --> 00:53:17 This is not particularly obvious. 823 00:53:17 --> 00:53:21 Any of the TA's have guesses of the form of this solution? 824 00:53:21 --> 00:53:25 Or anybody, not just TA's. But this is open to everyone. 825 00:53:25 --> 00:53:28 If Charles were here, I would ask him. 826 00:53:28 --> 00:53:31 I had to think for a while, and it's not linear, 827 00:53:31 --> 00:53:37 right, because you're somehow decreasing quite a bit. 828 00:53:37 --> 00:53:42 So, it's smaller than linear, but it's more than a constant. 829 00:53:42 --> 00:53:47 OK, it's actually more than polylog, so what's your favorite 830 00:53:47 --> 00:53:50 function in the middle? N over log N, 831 00:53:50 --> 00:53:53 that's still too big. Keep going. 832 00:53:53 --> 00:53:57 You have an oracle here, so you can, N to the k, 833 00:53:57 --> 00:54:00 yeah, close. I mean, k is usually an 834 00:54:00 --> 00:54:04 integer. N to the alpha for some real 835 00:54:04 --> 00:54:09 number between zero and one. Yeah, that's what you meant. 836 00:54:09 --> 00:54:11 Sorry. It's like the shortest 837 00:54:11 --> 00:54:15 mathematical joke. Let epsilon be less than zero 838 00:54:15 --> 00:54:18 or for a sufficiently large epsilon. 839 00:54:18 --> 00:54:21 I don't know. So, you've got to use the right 840 00:54:21 --> 00:54:25 letters. So, let's suppose that it's N 841 00:54:25 --> 00:54:28 to the alpha. Then we would get this N over 842 00:54:28 --> 00:54:32 five to the alpha, and we'd get three quarters N 843 00:54:32 --> 00:54:36 to the alpha. When you have a nice recurrence 844 00:54:36 --> 00:54:40 like this, you can just try plugging in a guess and see 845 00:54:40 --> 00:54:42 whether it works, OK, and of course this will 846 00:54:42 --> 00:54:46 work only depending on alpha. So, we should get an equation 847 00:54:46 --> 00:54:49 on alpha here. So, everything has an N to the 848 00:54:49 --> 00:54:51 alpha, in fact, all of these terms. 849 00:54:51 --> 00:54:53 So, I can divide through my N to the alpha. 850 00:54:53 --> 00:54:56 That's assuming that it's not zero or something. 851 00:54:56 --> 00:54:59 That seems reasonable. So, we have one equals one 852 00:54:59 --> 00:55:04 fifth to the alpha plus three quarters to the alpha. 853 00:55:04 --> 00:55:10 This is something you won't get on a final because I don't know 854 00:55:10 --> 00:55:15 any good way to solve this except with like Maple or 855 00:55:15 --> 00:55:19 Mathematica. If you're smart I'm sure you 856 00:55:19 --> 00:55:24 could compute it in a nicer way, but alpha is about 0.8, 857 00:55:24 --> 00:55:28 it turns out. So, the number of leaves is 858 00:55:28 --> 00:55:34 this sort of in between constant and linear. 859 00:55:34 --> 00:55:37 Usually polynomial means you have an integer power. 860 00:55:37 --> 00:55:40 Let's call it a polynomial. Why not? 861 00:55:40 --> 00:55:43 There's a lot of leaves, is the point, 862 00:55:43 --> 00:55:47 and if we say that each leaf costs a constant number of 863 00:55:47 --> 00:55:50 memory transfers, we're in trouble because then 864 00:55:50 --> 00:55:54 the number of memory transfers has to be at least this. 865 00:55:54 --> 00:55:58 If it's at least that, that's potentially bigger than 866 00:55:58 --> 00:56:02 N over B, I mean, bigger than in an asymptotic 867 00:56:02 --> 00:56:06 sense. This is little omega of N over 868 00:56:06 --> 00:56:10 B if B is big. If B is at least N to the 0.2 869 00:56:10 --> 00:56:14 something, OK, or one seventh something. 870 00:56:14 --> 00:56:18 But if, in particular, B is at least N to the 0.2, 871 00:56:18 --> 00:56:22 then this should be bigger than that. 872 00:56:22 --> 00:56:27 So, this is a bad analysis because we're not going to get 873 00:56:27 --> 00:56:32 the answer we want, which is N over B. 874 00:56:32 --> 00:56:35 The best you can do for median is N over B because you have to 875 00:56:35 --> 00:56:38 read all the element, and you should spend linear 876 00:56:38 --> 00:56:40 time. So, we want to get N over B. 877 00:56:40 --> 00:56:42 This algorithm is N over B plus one. 878 00:56:42 --> 00:56:45 So, this is why you need a good base case, all right? 879 00:56:45 --> 00:56:48 So that makes the point. So, the question is, 880 00:56:48 --> 00:56:51 what base case should I use? 881 00:56:51 --> 00:57:04 882 00:57:04 --> 00:57:06 So, we have this recurrence 883 00:57:06 --> 00:57:21 884 00:57:21 --> 00:57:25 What base case should I use? Constant was too small. 885 00:57:25 --> 00:57:30 We have a couple of choices listed up here. 886 00:57:30 --> 00:57:46 887 00:57:46 --> 00:57:55 Any suggestions? B, OK, MT of B is? 888 00:57:55 --> 00:58:01 The hard part. So, if my problem, 889 00:58:01 --> 00:58:07 if the size of my array fits in a block and I do all this stuff 890 00:58:07 --> 00:58:11 on it, how many memory transfers could that take? 891 00:58:11 --> 00:58:15 One, or a constant, depending on alignment. 892 00:58:15 --> 00:58:20 OK, maybe it takes two memory transfers, but constant. 893 00:58:20 --> 00:58:23 Good. That's clearly a lot better 894 00:58:23 --> 00:58:27 than this base case, MT of one equals order one, 895 00:58:27 --> 00:58:30 clearly stronger. So, hopefully, 896 00:58:30 --> 00:58:36 it gives the right answer, and now indeed it does. 897 00:58:36 --> 00:58:39 I love this analysis. So, I'm going to wave my hands. 898 00:58:39 --> 00:58:43 OK, but in particular, what this gives us, 899 00:58:43 --> 00:58:47 if we do the previous analysis, what is the number of leaves? 900 00:58:47 --> 00:58:51 So, in the leaves, now L of B equals one instead 901 00:58:51 --> 00:58:54 of L of one equals one. So, this stops earlier. 902 00:58:54 --> 00:58:59 When does it stop? Well, instead of getting N to 903 00:58:59 --> 00:59:02 the order of 0.8, whatever, we get N over B to 904 00:59:02 --> 00:59:06 the power of 0.8 whatever. OK, so it turns out the number 905 00:59:06 --> 00:59:10 of leaves is N over B to the alpha, which is little o of N 906 00:59:10 --> 00:59:12 over B. So, we don't care. 907 00:59:12 --> 00:59:15 It's tiny. And, if you look at the root 908 00:59:15 --> 00:59:17 cost is N over B in the recursion tree, 909 00:59:17 --> 00:59:22 the leaf cost is little o of N over B, and if you wave your 910 00:59:22 --> 00:59:26 hands, and close your eyes, and squint, the cost should be 911 00:59:26 --> 00:59:29 geometrically decreasing as we go down, I hope, 912 00:59:29 --> 00:59:34 more or less. It's a bit messy because of all 913 00:59:34 --> 00:59:39 the things terminating, but let's say cost is roughly 914 00:59:39 --> 00:59:42 geometric. Don't do this in the final, 915 00:59:42 --> 00:59:47 but you won't have any messy recurrences like this. 916 00:59:47 --> 00:59:50 So, don't worry. Down the tree, 917 00:59:50 --> 00:59:55 so you'd have to prove this formally, but I claim that the 918 00:59:55 --> 1:00:01 root cost dominates. And, the root cost is N over B. 919 1:00:01 --> 1:00:13 920 1:00:13 --> 1:00:16.591 So, we get N over B. OK, so this is a nice, 921 1:00:16.591 --> 1:00:21.892 linear time algorithm for order statistics for cache oblivious. 922 1:00:21.892 --> 1:00:24.97 Great. This may turn you off a little 923 1:00:24.97 --> 1:00:29.758 bit, but even though this is like the simplest algorithm, 924 1:00:29.758 --> 1:00:34.46 it's also probably the most complicated analysis that we 925 1:00:34.46 --> 1:00:36.846 will do. In the future, 926 1:00:36.846 --> 1:00:40.234 our algorithms will be more complicated, and the analyses 927 1:00:40.234 --> 1:00:42.533 will be relatively simple. And usually, 928 1:00:42.533 --> 1:00:45.255 it's that way with cache oblivious algorithms. 929 1:00:45.255 --> 1:00:48.824 So, I'm giving you this sort of as the intuition of why this 930 1:00:48.824 --> 1:00:51.425 should be enough. Then you have to prove it. 931 1:00:51.425 --> 1:00:54.933 OK, let's go to another problem where divide and conquer is 932 1:00:54.933 --> 1:00:57.716 useful, our good friend, matrix multiplication. 933 1:00:57.716 --> 1:01:01.164 I don't know how many times we've seen this in this class, 934 1:01:01.164 --> 1:01:04.37 but in particular we saw it last week with a recursive 935 1:01:04.37 --> 1:01:08 matrix multiply, multithreaded algorithm. 936 1:01:08 --> 1:01:11.708 So, I won't give you the algorithm yet again, 937 1:01:11.708 --> 1:01:16.176 but we're going to analyze it in a very different way. 938 1:01:16.176 --> 1:01:20.475 So, we have C and we have A, and actually up to you. 939 1:01:20.475 --> 1:01:24.521 So, I could cover standard matrix multiplication, 940 1:01:24.521 --> 1:01:30 which is when you do it row by row, and column by column. 941 1:01:30 --> 1:01:32.331 And, we could see why that's bad. 942 1:01:32.331 --> 1:01:36.485 And then, we could do the recursive one and see why that's 943 1:01:36.485 --> 1:01:39.036 good. Or, we could skip the standard 944 1:01:39.036 --> 1:01:41.951 algorithm. So, how many people would like 945 1:01:41.951 --> 1:01:44.866 to see why the standard algorithm is bad? 946 1:01:44.866 --> 1:01:47.198 Because it's not totally obvious. 947 1:01:47.198 --> 1:01:49.603 One, two, three, four, five, half? 948 1:01:49.603 --> 1:01:53.611 Wow, that's a lot of votes. Now, how many people want to 949 1:01:53.611 --> 1:01:55.433 skip to the chase? No one. 950 1:01:55.433 --> 1:01:58.129 One, OK. And, everyone else is asleep. 951 1:01:58.129 --> 1:02:01.19 So, that's pretty good, 50% awake, not bad. 952 1:02:01.19 --> 1:02:06 OK, then, so standard matrix multiplication. 953 1:02:06 --> 1:02:10.036 I'll do this fast because it is, I mean, you all know the 954 1:02:10.036 --> 1:02:13.207 algorithm, right? To compute this value of C; 955 1:02:13.207 --> 1:02:17.099 in A, you take this row, and in B you take this column. 956 1:02:17.099 --> 1:02:19.477 Sorry I did a little bit sloppily. 957 1:02:19.477 --> 1:02:21.927 But this is supposed to be aligned. 958 1:02:21.927 --> 1:02:24.378 Right? So I take all of this stuff, 959 1:02:24.378 --> 1:02:27.837 I multiply it with all of the stuff, add them up, 960 1:02:27.837 --> 1:02:31.949 the dot product. That gives me this element. 961 1:02:31.949 --> 1:02:35.487 And, let's say I do them in this order row by row. 962 1:02:35.487 --> 1:02:39.241 So for every item in C, I loop over this row and this 963 1:02:39.241 --> 1:02:41.624 column, B, multiply them together. 964 1:02:41.624 --> 1:02:44.151 That is an access pattern in memory. 965 1:02:44.151 --> 1:02:48.555 So, exactly how much that costs depends how these matrices are 966 1:02:48.555 --> 1:02:51.732 laid out in memory. OK, this is a subtlety we 967 1:02:51.732 --> 1:02:55.703 haven't had to worry about before because everything was 968 1:02:55.703 --> 1:02:58.519 uniform. I'm going to assume to give the 969 1:02:58.519 --> 1:03:02.057 standard algorithm the best chances of being good, 970 1:03:02.057 --> 1:03:05.956 I'm going to store C in row major order, A in row major 971 1:03:05.956 --> 1:03:10 order, and B in column major order. 972 1:03:10 --> 1:03:14.983 So, everything is nice and you're scanning. 973 1:03:14.983 --> 1:03:19.254 So then this inner product is a scan. 974 1:03:19.254 --> 1:03:21.389 Cool. Sounds great, 975 1:03:21.389 --> 1:03:24.711 doesn't it? It's bad, though. 976 1:03:24.711 --> 1:03:31 Assume A is row major, and B is column major. 977 1:03:31 --> 1:03:33.911 And C, you could assume is really either way, 978 1:03:33.911 --> 1:03:37.75 but if I'm doing it row by row, I'll assume it's row major. 979 1:03:37.75 --> 1:03:41.257 So, this is what I call the layout, the memory layout, 980 1:03:41.257 --> 1:03:43.904 of these matrices. OK, it's good for this 981 1:03:43.904 --> 1:03:46.551 algorithm, but the algorithm is not good. 982 1:03:46.551 --> 1:03:49 So, it won't be that great. 983 1:03:49 --> 1:04:12 984 1:04:12 --> 1:04:16.227 So, how long does this take? How many memory transfers? 985 1:04:16.227 --> 1:04:20.533 We know it takes M^3 time. Not going to try and beat M^3 986 1:04:20.533 --> 1:04:22.882 here. Just going to try and get 987 1:04:22.882 --> 1:04:26.249 standard matrix multiplication going faster. 988 1:04:26.249 --> 1:04:30.711 So, well, for each item over here I pay N over B to do the 989 1:04:30.711 --> 1:04:36.801 scans and get the inner product. So, N over B per item. 990 1:04:36.801 --> 1:04:42.659 So, it's N over B, or we could go with the plus 991 1:04:42.659 --> 1:04:49.408 one here, to compute each c_ij. So that would suggest, 992 1:04:49.408 --> 1:04:54.883 as an upper bound at least, it's N^3 over B. 993 1:04:54.883 --> 1:05:00.996 OK, and indeed that is the right bound, so theta. 994 1:05:00.996 --> 1:05:08 This is memory transfers, not time, obviously. 995 1:05:08 --> 1:05:12.349 That is indeed the case because if you look at consecutive, 996 1:05:12.349 --> 1:05:14.525 I do this c_ij, then this one, 997 1:05:14.525 --> 1:05:18.125 this one, this one, this one, keep incrementing j 998 1:05:18.125 --> 1:05:20.074 and keeping I fixed, right? 999 1:05:20.074 --> 1:05:23.824 So, the row that I use stays fixed for a long time. 1000 1:05:23.824 --> 1:05:27.875 I get to reuse that if it happens, say that that fits a 1001 1:05:27.875 --> 1:05:32.15 block maybe, I get to reuse that row several times if that 1002 1:05:32.15 --> 1:05:36.631 happens to fit in cache. But the column is changing 1003 1:05:36.631 --> 1:05:39.642 every single time. OK, so every time I moved here 1004 1:05:39.642 --> 1:05:43.093 and compute the next c_ij, even if a column could fit in 1005 1:05:43.093 --> 1:05:45.79 cache, I can't fit all the columns in cache. 1006 1:05:45.79 --> 1:05:48.174 And the columns that I'm visiting move, 1007 1:05:48.174 --> 1:05:50.119 you know, they just scan across. 1008 1:05:50.119 --> 1:05:52.942 So, I'm scanning this whole matrix every time. 1009 1:05:52.942 --> 1:05:55.766 And unless you're entire matrix fits in cache, 1010 1:05:55.766 --> 1:05:58.84 in which case you could do anything, I don't care, 1011 1:05:58.84 --> 1:06:02.353 it will take constant time, or you'll take M over B time, 1012 1:06:02.353 --> 1:06:05.302 enough to read it into the cache, do your stuff, 1013 1:06:05.302 --> 1:06:09.989 and write it back out. Except in that boring case, 1014 1:06:09.989 --> 1:06:14.115 you're going to have to pay N^2 over B for every row here 1015 1:06:14.115 --> 1:06:18.242 because you have to scan the whole collection of columns. 1016 1:06:18.242 --> 1:06:22.589 You have to read this entire matrix for every row over here. 1017 1:06:22.589 --> 1:06:26.494 So, you really do need N^3 over B for the whole thing. 1018 1:06:26.494 --> 1:06:30.043 So, it's usually a theta. So, you might say, 1019 1:06:30.043 --> 1:06:32.766 well, that's great. It's the size of my problem, 1020 1:06:32.766 --> 1:06:34.852 the usual running time, divided by B. 1021 1:06:34.852 --> 1:06:38.329 And that was the case when we are thinking about linear time, 1022 1:06:38.329 --> 1:06:41.168 N versus N over B. It's hard to beat N over B when 1023 1:06:41.168 --> 1:06:44.066 your problem is of size N. But now we have a cubed. 1024 1:06:44.066 --> 1:06:47.137 And, this gets back to, we have good spatial locality. 1025 1:06:47.137 --> 1:06:49.687 When we read a block, we use the whole thing. 1026 1:06:49.687 --> 1:06:51.019 Great. It seems optimal. 1027 1:06:51.019 --> 1:06:53.337 But we don't have good temporal locality. 1028 1:06:53.337 --> 1:06:56.35 It could be that maybe if we stored the right things, 1029 1:06:56.35 --> 1:06:59.074 we kept them around, we could them several times 1030 1:06:59.074 --> 1:07:04 because we're using each element like a cubed number of times. 1031 1:07:04 --> 1:07:08.99 That's not the right way of saying it, but we're reusing the 1032 1:07:08.99 --> 1:07:11.951 matrices a lot, reusing those items. 1033 1:07:11.951 --> 1:07:16.942 If we are doing N^3 work on N^2 things, we're reusing a lot. 1034 1:07:16.942 --> 1:07:21.933 So, we want to do better than this, and that's the recursive 1035 1:07:21.933 --> 1:07:26.416 algorithm, which we've seen. So, we know the algorithm 1036 1:07:26.416 --> 1:07:29.8 pretty much. I just have to tell you what 1037 1:07:29.8 --> 1:07:36.588 the layout is. So, we're going to take C, 1038 1:07:36.588 --> 1:07:42.941 partition of C_1-1, C_1-2, and so on. 1039 1:07:42.941 --> 1:07:52.647 So, I have an N by N matrix, and I'm partitioning into N 1040 1:07:52.647 --> 1:08:02.176 over 2 by N over 2 submatrices, all three of them times 1041 1:08:02.176 --> 1:08:07.377 whatever. And, I could write this out yet 1042 1:08:07.377 --> 1:08:11.058 again but I won't. OK, we can recursively compute 1043 1:08:11.058 --> 1:08:15.2 this thing with eight matrix multiplies, and a bunch of 1044 1:08:15.2 --> 1:08:18.191 matrix additions. I don't care how many, 1045 1:08:18.191 --> 1:08:22.256 but a constant number. We see that at least twice now, 1046 1:08:22.256 --> 1:08:26.091 so I won't show it again. Now, how do I lay out the 1047 1:08:26.091 --> 1:08:29.005 matrices? Any suggestions how I lay out 1048 1:08:29.005 --> 1:08:32.979 the matrices? I could lay them out in row 1049 1:08:32.979 --> 1:08:35.693 major order. I'll call it major order. 1050 1:08:35.693 --> 1:08:38.186 But that might be less natural now. 1051 1:08:38.186 --> 1:08:42 We're not doing anything by rows or by columns. 1052 1:08:42 --> 1:08:59 1053 1:08:59 --> 1:09:03.014 So, what layout should I use? Yeah? 1054 1:09:03.014 --> 1:09:08.446 Quartet major order, maybe quadrant major order 1055 1:09:08.446 --> 1:09:12.933 unless you're musically inclined, yeah. 1056 1:09:12.933 --> 1:09:17.42 Good idea. You've never seen this order 1057 1:09:17.42 --> 1:09:21.671 before, so it's maybe not so natural. 1058 1:09:21.671 --> 1:09:26.158 Somehow I want to cluster it by blocks. 1059 1:09:26.158 --> 1:09:33.402 OK, I think that's about all. So, I mean, it's a recursive 1060 1:09:33.402 --> 1:09:36.576 layout. This was not an easy question. 1061 1:09:36.576 --> 1:09:39.751 It's OK. Store matrices or lay out the 1062 1:09:39.751 --> 1:09:44.899 matrices recursively by block. OK, I'm cheating a little bit. 1063 1:09:44.899 --> 1:09:49.961 I'm redefining the problem to say, assume that your matrices 1064 1:09:49.961 --> 1:09:54.68 are laid out in this way. But, it doesn't really matter. 1065 1:09:54.68 --> 1:09:56.568 We can cheat, can't we? 1066 1:09:56.568 --> 1:10:02.276 In fact, it doesn't matter. You can turn a matrix into this 1067 1:10:02.276 --> 1:10:06.315 layout without too much linear work, almost linear work. 1068 1:10:06.315 --> 1:10:07.637 Log factors, maybe. 1069 1:10:07.637 --> 1:10:11.676 OK, so if I want to store my matrix A as a linear thing, 1070 1:10:11.676 --> 1:10:15.274 I'm going to recursively defined that layout to be 1071 1:10:15.274 --> 1:10:19.019 recursively store the upper left corner, then store, 1072 1:10:19.019 --> 1:10:21.442 let's say, the upper right corner. 1073 1:10:21.442 --> 1:10:24.38 It doesn't matter which order I do these. 1074 1:10:24.38 --> 1:10:28.492 I should have drawn this wider, then store the lower left 1075 1:10:28.492 --> 1:10:34 corner, and then store the lower right corner recursively. 1076 1:10:34 --> 1:10:38.025 So, how do you store this? Well, you divide it in four, 1077 1:10:38.025 --> 1:10:40.634 and lay out the top left, and so on. 1078 1:10:40.634 --> 1:10:44.511 OK, this is a recursive definition of how the element 1079 1:10:44.511 --> 1:10:47.046 should be stored in a linear array. 1080 1:10:47.046 --> 1:10:50.326 It's a weird one, but this is a very powerful 1081 1:10:50.326 --> 1:10:52.861 idea in cache oblivious algorithms. 1082 1:10:52.861 --> 1:10:57.408 We'll use this multiple times. OK, so now all we have to do is 1083 1:10:57.408 --> 1:11:00.241 analyze the number of memory transfers. 1084 1:11:00.241 --> 1:11:05.066 How hard could it be? So, we're going to store all 1085 1:11:05.066 --> 1:11:08.978 the matrices in this order, and we want to compute the 1086 1:11:08.978 --> 1:11:12.373 number of memory transfers on an N by N matrix. 1087 1:11:12.373 --> 1:11:15.547 See, I lapsed and I switched to lowercase n. 1088 1:11:15.547 --> 1:11:19.902 I should, throughout this week, be using uppercase N because 1089 1:11:19.902 --> 1:11:23.666 for historical reasons, any external memory kinds of 1090 1:11:23.666 --> 1:11:28.095 algorithms, to level algorithms, always talk about capital N. 1091 1:11:28.095 --> 1:11:31.785 And, don't ask why. You should see what they define 1092 1:11:31.785 --> 1:11:37.995 little n to be. OK, so, any suggestions on what 1093 1:11:37.995 --> 1:11:45.342 the recurrence should be now? All his fancy setup with the 1094 1:11:45.342 --> 1:11:49.724 recurrence is actually pretty easy. 1095 1:11:49.724 --> 1:11:57.071 So, definitely it involves multiplying matrices that are N 1096 1:11:57.071 --> 1:12:03 over 2 by N over 2. So, what goes here? 1097 1:12:03 --> 1:12:05.752 Eight, thank you. That you should know. 1098 1:12:05.752 --> 1:12:08.793 And that the tricky part is what goes here. 1099 1:12:08.793 --> 1:12:12.487 OK, what goes here is, now, the fact that I can even 1100 1:12:12.487 --> 1:12:15.384 write this, this is the matrix additions. 1101 1:12:15.384 --> 1:12:18.788 Ignore those for now. Suppose there weren't any. 1102 1:12:18.788 --> 1:12:21.323 I just have to recursively multiply. 1103 1:12:21.323 --> 1:12:25.74 The fact that this actually is eight times memory transfers of 1104 1:12:25.74 --> 1:12:30.67 N over 2 relies on this layout. Right, I'm assuming that the 1105 1:12:30.67 --> 1:12:34.129 arrays that I'm given are given as contiguous intervals and 1106 1:12:34.129 --> 1:12:35.442 memory. If they aren't, 1107 1:12:35.442 --> 1:12:38.066 I mean, if they're scattered all over memory, 1108 1:12:38.066 --> 1:12:40.273 I'm screwed. There's nothing I can do. 1109 1:12:40.273 --> 1:12:43.435 So, but by assuming that I have this recursive layout, 1110 1:12:43.435 --> 1:12:46.835 I know that the recursive multiplies will always deal with 1111 1:12:46.835 --> 1:12:49.519 three consecutive chunks of memory, one for A, 1112 1:12:49.519 --> 1:12:52.203 one for B, one for C, OK, no matter what I do. 1113 1:12:52.203 --> 1:12:54.47 Because these are stored consecutively, 1114 1:12:54.47 --> 1:12:56.438 recursively I have that invariant. 1115 1:12:56.438 --> 1:12:59.54 And I can keep recursing. And I'm always dealing with 1116 1:12:59.54 --> 1:13:03 three consecutive chunks of memory. 1117 1:13:03 --> 1:13:08.328 That's why I need this layout is to be able to say this. 1118 1:13:08.328 --> 1:13:11.332 OK, Now what does addition cost? 1119 1:13:11.332 --> 1:13:14.335 I'll just give you two matrices. 1120 1:13:14.335 --> 1:13:19.858 They're stored in some linear order, the same linear order 1121 1:13:19.858 --> 1:13:25.186 among the three of them. Do I care what the linear order 1122 1:13:25.186 --> 1:13:28.384 is? How should I add two matrices, 1123 1:13:28.384 --> 1:13:31 get the output? 1124 1:13:31 --> 1:13:42 1125 1:13:42 --> 1:13:43 Yeah? 1126 1:13:43 --> 1:13:51 1127 1:13:51 --> 1:13:54.85 Right, if each of the three arrays I'm dealing with are 1128 1:13:54.85 --> 1:13:58.559 stored in the same order, I can just scan in parallel 1129 1:13:58.559 --> 1:14:02.909 through all three of them and just add corresponding elements, 1130 1:14:02.909 --> 1:14:07.045 and output it to the third. So, I don't care what the order 1131 1:14:07.045 --> 1:14:10.682 is, as long as it's consistent and I get N^2 over B. 1132 1:14:10.682 --> 1:14:14.39 I'll ignore plus one here. That's just looking at the 1133 1:14:14.39 --> 1:14:16.529 entire matrix. So, there we go: 1134 1:14:16.529 --> 1:14:19.667 another recurrence. We've seen this with N^2, 1135 1:14:19.667 --> 1:14:23.09 and we just got N^3. But, it turns out now we get 1136 1:14:23.09 --> 1:14:26.371 something cooler if we use the right base case. 1137 1:14:26.371 --> 1:14:30.008 So now we get to the base case, ah, the tricky part. 1138 1:14:30.008 --> 1:14:35 So, any suggestions what base case I should use? 1139 1:14:35 --> 1:14:36.672 The block size, good suggestion. 1140 1:14:36.672 --> 1:14:38.829 So, if we have something of size order B, 1141 1:14:38.829 --> 1:14:41.85 we know that takes a constant number of memory transfers. 1142 1:14:41.85 --> 1:14:44.871 It turns out that's not enough. That won't solve it here. 1143 1:14:44.871 --> 1:14:46.381 But good guess. In this case, 1144 1:14:46.381 --> 1:14:49.294 it's not the right answer. I'll give you some intuition 1145 1:14:49.294 --> 1:14:51.182 why. We are trying to improve on N^3 1146 1:14:51.182 --> 1:14:53.178 over B. If you were just trying to get 1147 1:14:53.178 --> 1:14:55.443 it divided by B, this is a great base case. 1148 1:14:55.443 --> 1:14:58.572 But here, we know that just the improvement afforded by the 1149 1:14:58.572 --> 1:15:03.244 block size is not enough. We have to somehow use the fact 1150 1:15:03.244 --> 1:15:06.864 that the cache is big. It's M, so however big M is, 1151 1:15:06.864 --> 1:15:09.977 it's that big. OK, so if we want to get some 1152 1:15:09.977 --> 1:15:13.307 improvement on this, we've got to have M in the 1153 1:15:13.307 --> 1:15:16.276 formula somewhere, and there's no M's yet. 1154 1:15:16.276 --> 1:15:19.027 So, it's got to involve M. What's that? 1155 1:15:19.027 --> 1:15:21.271 MT of M over B? That would work, 1156 1:15:21.271 --> 1:15:25.108 but MT of M is also OK, I mean, some constant times M, 1157 1:15:25.108 --> 1:15:27.859 let's say. I want to make this constant 1158 1:15:27.859 --> 1:15:33 small enough so that the entire problem fits in cache. 1159 1:15:33 --> 1:15:37.006 So, it's like one third. I think it's actually, 1160 1:15:37.006 --> 1:15:40.837 oh wait, is it the square root of M actually? 1161 1:15:40.837 --> 1:15:43.537 Right, this is an N by N matrix. 1162 1:15:43.537 --> 1:15:47.456 So, it should be C times the square root of M. 1163 1:15:47.456 --> 1:15:50.33 Sorry. So, the square root of M by 1164 1:15:50.33 --> 1:15:53.552 square root of M matrix has M entries. 1165 1:15:53.552 --> 1:15:58.603 If I make C like one third or something, then I can fit all 1166 1:15:58.603 --> 1:16:04.372 three matrices in memory. Actually, one over square root 1167 1:16:04.372 --> 1:16:06.903 of three would do, but who cares? 1168 1:16:06.903 --> 1:16:10.621 So, for some constant, C, now everything fits in 1169 1:16:10.621 --> 1:16:13.548 memory. How many memory transfers does 1170 1:16:13.548 --> 1:16:14.497 it take? One? 1171 1:16:14.497 --> 1:16:18.451 It's a bit too small, because I do have to read the 1172 1:16:18.451 --> 1:16:20.587 problem in. And now, I mean, 1173 1:16:20.587 --> 1:16:24.621 here was one because there's only one block to read. 1174 1:16:24.621 --> 1:16:27.548 Now how many blocks are there to read? 1175 1:16:27.548 --> 1:16:30 Constants? No. 1176 1:16:30 --> 1:16:30.369 B? No. 1177 1:16:30.369 --> 1:16:33.255 M over B, good. Get it right eventually. 1178 1:16:33.255 --> 1:16:37.102 That's the great thing about thinking with an oracle. 1179 1:16:37.102 --> 1:16:41.318 You can just keep guessing. M over B because we have cache 1180 1:16:41.318 --> 1:16:43.908 size M. There are M over B blocks in 1181 1:16:43.908 --> 1:16:46.201 that cache to read each one, OK? 1182 1:16:46.201 --> 1:16:49.382 This is maybe, you forgot what M was because 1183 1:16:49.382 --> 1:16:51.897 we haven't used it for a long time. 1184 1:16:51.897 --> 1:16:54.857 But M is the number of elements in cache. 1185 1:16:54.857 --> 1:16:59 This is the number of blocks in cache. 1186 1:16:59 --> 1:17:02.537 OK, some of was saying B, and it's reasonable to assume 1187 1:17:02.537 --> 1:17:05.944 that M over B is about B. That's like a square cache, 1188 1:17:05.944 --> 1:17:08.892 but in general, we don't make that assumption. 1189 1:17:08.892 --> 1:17:11.381 OK, where are we? We're hopefully done, 1190 1:17:11.381 --> 1:17:14.46 just about, good, because we have three minutes. 1191 1:17:14.46 --> 1:17:17.801 So, that's our base case. I have a square root here; 1192 1:17:17.801 --> 1:17:20.815 I just forgot it. Now we just have to solve it. 1193 1:17:20.815 --> 1:17:23.435 Now, this is an easier recurrence, right? 1194 1:17:23.435 --> 1:17:27.497 I don't want to use the master method, because master method is 1195 1:17:27.497 --> 1:17:31.296 not going to handle these B's and M's, and these crazy base 1196 1:17:31.296 --> 1:17:35.271 cases. OK, master method would prove 1197 1:17:35.271 --> 1:17:36.054 N^3. Great. 1198 1:17:36.054 --> 1:17:40.283 Master method doesn't really think about these kinds of 1199 1:17:40.283 --> 1:17:42.789 cases. But with regression trees, 1200 1:17:42.789 --> 1:17:47.331 if you remember way back to the proof of the master method, 1201 1:17:47.331 --> 1:17:52.03 just look at the recursion tree as geometric up or down where 1202 1:17:52.03 --> 1:17:55.945 everything is equal, and then you just add them up, 1203 1:17:55.945 --> 1:17:59 every level. The point is that this is a 1204 1:17:59 --> 1:18:02.68 nice recurrence. All of the sub problems are the 1205 1:18:02.68 --> 1:18:05.891 same size, and that analysis always works, 1206 1:18:05.891 --> 1:18:12 I say, when everything has the same size, all the children. 1207 1:18:12 --> 1:18:18.857 So, here's the recursion tree. We have N^2 over B at the top. 1208 1:18:18.857 --> 1:18:24.114 We split into eight subproblems where each one, 1209 1:18:24.114 --> 1:18:27.657 the cost is one half N^2 over B. 1210 1:18:27.657 --> 1:18:32 I'm not going to write them all. 1211 1:18:32 --> 1:18:34.716 There they are. You add them up. 1212 1:18:34.716 --> 1:18:38.921 How much do you get? Well, there's eight of them. 1213 1:18:38.921 --> 1:18:41.637 Eight times a half is two. Four. 1214 1:18:41.637 --> 1:18:44.265 [LAUGHTER] Thanks. Four, right? 1215 1:18:44.265 --> 1:18:48.909 OK, I'm bad at arithmetic. I probably already said it, 1216 1:18:48.909 --> 1:18:52.676 but there are three kinds of mathematicians, 1217 1:18:52.676 --> 1:18:56.006 those who can add, and those who can't. 1218 1:18:56.006 --> 1:19:01 OK, why am I looking at this? It's obvious. 1219 1:19:01 --> 1:19:03.8 OK, so we keep going. This looks geometrically 1220 1:19:03.8 --> 1:19:04.858 increasing. Right? 1221 1:19:04.858 --> 1:19:08.405 You just know in your heart that if you work out the first 1222 1:19:08.405 --> 1:19:12.263 two levels, you can tell whether it's geometrically increasing, 1223 1:19:12.263 --> 1:19:15.437 decreasing, or they're all equal, or something else. 1224 1:19:15.437 --> 1:19:18.984 And then you better think. But I see this as geometrically 1225 1:19:18.984 --> 1:19:21.412 increasing. It will indeed be like 16 at 1226 1:19:21.412 --> 1:19:22.843 the next level, I guess. 1227 1:19:22.843 --> 1:19:25.145 OK, it should be. So, it's increasing. 1228 1:19:25.145 --> 1:19:30 That means the leaves matter. So, let's work out the leaves. 1229 1:19:30 --> 1:19:33.96 And, this is where we use our base case. 1230 1:19:33.96 --> 1:19:38.63 So, we have a problem of size square root of M. 1231 1:19:38.63 --> 1:19:41.981 And so, yeah, you have a question? 1232 1:19:41.981 --> 1:19:45.84 Oh, indeed. I knew there was something. 1233 1:19:45.84 --> 1:19:50.003 I knew it was supposed to be two out here. 1234 1:19:50.003 --> 1:19:53.15 Thanks. This is why you're here. 1235 1:19:53.15 --> 1:19:57.11 It's actually N over two squared over B. 1236 1:19:57.11 --> 1:20:00.867 Thanks. I'm substituting N over 2 into 1237 1:20:00.867 --> 1:20:04.9 this. OK, so this is actually N^2 1238 1:20:04.9 --> 1:20:06.519 over 4 B. So, I get two, 1239 1:20:06.519 --> 1:20:09.546 because there are eight times one over four. 1240 1:20:09.546 --> 1:20:13.417 OK, I wasn't that far off then. It's still geometrically 1241 1:20:13.417 --> 1:20:15.529 increasing, still the case, OK? 1242 1:20:15.529 --> 1:20:17.992 But now, it actually doesn't matter. 1243 1:20:17.992 --> 1:20:21.371 Whatever the cost is, as long as it's bigger than 1244 1:20:21.371 --> 1:20:23.975 one, great. Now we look at the leaves. 1245 1:20:23.975 --> 1:20:26.157 The leaves are root M by root M. 1246 1:20:26.157 --> 1:20:29.958 I substitute root M into this: I get M over B with some 1247 1:20:29.958 --> 1:20:32.903 constants. Who cares? 1248 1:20:32.903 --> 1:20:36.787 So, each leaf is M over B, OK, lots of them. 1249 1:20:36.787 --> 1:20:40.038 How many are there? This is the only, 1250 1:20:40.038 --> 1:20:45.006 deal with recursion trees, counting the number of leaves 1251 1:20:45.006 --> 1:20:48.709 is always the annoying part. Oh boy, well, 1252 1:20:48.709 --> 1:20:53.948 we start with an N by N matrix. We stop when we get down to 1253 1:20:53.948 --> 1:21:00 root N by root N matrix. So, that sounds like something. 1254 1:21:00 --> 1:21:04.141 Oh boy, I'm cheating here. Really? 1255 1:21:04.141 --> 1:21:07.905 That many? It sounds plausible. 1256 1:21:07.905 --> 1:21:11.921 OK, the claim is, and I'll cheat. 1257 1:21:11.921 --> 1:21:19.45 So I'm going to use the oracle here, and we'll figure out why 1258 1:21:19.45 --> 1:21:24.47 this is the case. N over root N^3 leaves, 1259 1:21:24.47 --> 1:21:27.231 hey what? I think here, 1260 1:21:27.231 --> 1:21:33.979 it's hard to see the tree. But it's easy to see in the 1261 1:21:33.979 --> 1:21:36.178 matrix. Let's enter the matrix. 1262 1:21:36.178 --> 1:21:39.256 We have our big matrix. We divided in half. 1263 1:21:39.256 --> 1:21:43.654 We recursively divide in half. We recursively divide in half. 1264 1:21:43.654 --> 1:21:45.12 You get the idea, OK? 1265 1:21:45.12 --> 1:21:49.151 Now, at some point these sectors, let's say one of these 1266 1:21:49.151 --> 1:21:52.743 sectors, and each of these sectors, fits in cache. 1267 1:21:52.743 --> 1:21:56.994 And three of them fit in cache. So, that's when we stop the 1268 1:21:56.994 --> 1:22:02.32 recursion in the analysis. The algorithm goes all the way. 1269 1:22:02.32 --> 1:22:05.538 But in the analysis, let's say we stop at M. 1270 1:22:05.538 --> 1:22:08.981 OK, now, how many leaves or problems are there? 1271 1:22:08.981 --> 1:22:11.451 Oh man, this is still not obvious. 1272 1:22:11.451 --> 1:22:14.669 OK, the number of leaf chunks here is, like, 1273 1:22:14.669 --> 1:22:19.01 I mean, the number of these things is something like N over 1274 1:22:19.01 --> 1:22:21.629 root M, right, the number of chunks. 1275 1:22:21.629 --> 1:22:26.195 But, it's a little less clear because I have so many of these. 1276 1:22:26.195 --> 1:22:28.964 But, all right, so let's just suppose, 1277 1:22:28.964 --> 1:22:32.856 now, I think of normal, boring, matrix multiplication 1278 1:22:32.856 --> 1:22:38.119 on chunks of this size. That's essentially what the 1279 1:22:38.119 --> 1:22:42.2 leaves should tell me. I start with this big problem, 1280 1:22:42.2 --> 1:22:45.261 I recurse out to all these little, tiny, 1281 1:22:45.261 --> 1:22:48.95 multiply this by that, OK, this root M by root M 1282 1:22:48.95 --> 1:22:51.305 chunk. OK, how many operations, 1283 1:22:51.305 --> 1:22:54.68 how many multiplies do I do on those things? 1284 1:22:54.68 --> 1:22:57.034 N^3. But now, N, the size of my 1285 1:22:57.034 --> 1:23:00.488 matrix in terms of these little sub matrices, 1286 1:23:00.488 --> 1:23:05.859 is N over root M. So, it should be N over root 1287 1:23:05.859 --> 1:23:10.76 M^3 subproblems of this size. If you work it out, 1288 1:23:10.76 --> 1:23:16.478 normally we go down to things of constant size and we get 1289 1:23:16.478 --> 1:23:21.278 exactly N^3 of them. Now we are stopping at this 1290 1:23:21.278 --> 1:23:26.485 short point in saying, well, it's however many there 1291 1:23:26.485 --> 1:23:30.161 are, cubed. OK, this is a bit of hand 1292 1:23:30.161 --> 1:23:35.352 waving. You could work it out with the 1293 1:23:35.352 --> 1:23:39.151 recurrence on the number of leaves. 1294 1:23:39.151 --> 1:23:44.18 But there it is. So, the total here is N over, 1295 1:23:44.18 --> 1:23:49.656 let's work it out. N^3 over M to the three halves, 1296 1:23:49.656 --> 1:23:56.025 that's this number of leaves, times the cost at each leaf, 1297 1:23:56.025 --> 1:24:01.054 which is M over B. So, some of the N's cancel, 1298 1:24:01.054 --> 1:24:07.759 and we get N^3 over B root M, which is a root M factor better 1299 1:24:07.759 --> 1:24:13.433 than N^3 over B. It's actually quite a lot, 1300 1:24:13.433 --> 1:24:16.522 the square root of the cache size. 1301 1:24:16.522 --> 1:24:20.359 That is optimal. The best two level matrix 1302 1:24:20.359 --> 1:24:26.162 multiplication algorithm is N^3 over B root M memory transfers. 1303 1:24:26.162 --> 1:24:30 Pretty amazing, and I'm over time. 1304 1:24:30 --> 1:24:34.979 You can generalize this into all sorts of great things, 1305 1:24:34.979 --> 1:24:39.959 but the bottom line is this is a great way to do matrix 1306 1:24:39.959 --> 1:24:45.308 multiplication as a recursion. We'll see more recursion for 1307 1:24:45.308 --> 1:24:48 cache oblivious algorithms on Wednesday.