1 00:00:00,000 --> 00:00:00,120 2 00:00:00,120 --> 00:00:02,500 The following content is provided under a Creative 3 00:00:02,500 --> 00:00:03,910 Commons license. 4 00:00:03,910 --> 00:00:06,950 Your support will help MIT OpenCourseWare continue to 5 00:00:06,950 --> 00:00:10,600 offer high-quality educational resources for free. 6 00:00:10,600 --> 00:00:13,500 To make a donation or view additional materials from 7 00:00:13,500 --> 00:00:17,780 hundreds of MIT courses, visit MIT OpenCourseWare at 8 00:00:17,780 --> 00:00:19,030 ocw.mit.edu. 9 00:00:19,030 --> 00:00:23,360 10 00:00:23,360 --> 00:00:27,230 CHARLES LEISERSON: So today, more parallel programming, as 11 00:00:27,230 --> 00:00:32,170 we will do for the next couple lectures as well. 12 00:00:32,170 --> 00:00:36,570 So today, we're going to look at how to analyze 13 00:00:36,570 --> 00:00:44,970 multi-threaded algorithms, and I'm going to start out with a 14 00:00:44,970 --> 00:00:53,470 review of what I hope most of you know from 6006 or 6046, 15 00:00:53,470 --> 00:00:56,380 which is how to solve divide and conquer recurrences. 16 00:00:56,380 --> 00:01:00,280 Now, we know that we can solve them with recursion trees, and 17 00:01:00,280 --> 00:01:04,069 that gets tedious after a while, so I want to go through 18 00:01:04,069 --> 00:01:07,050 the so-called Master Method to begin with, and then we'll get 19 00:01:07,050 --> 00:01:11,680 into the content of the course. 20 00:01:11,680 --> 00:01:13,610 But it will be very helpful, since we're going to do so 21 00:01:13,610 --> 00:01:15,650 many divide and conquer recurrences. 22 00:01:15,650 --> 00:01:18,270 The difference between these divide and conquer recurrences 23 00:01:18,270 --> 00:01:21,790 and the ones for caching is that caching is all tricky by 24 00:01:21,790 --> 00:01:23,270 the base condition. 25 00:01:23,270 --> 00:01:25,140 Here, are all the recurrences are going to be nice and 26 00:01:25,140 --> 00:01:29,690 clean, just like you learn in your algorithms class. 27 00:01:29,690 --> 00:01:32,310 So we'll start with talking about it, and then we'll go 28 00:01:32,310 --> 00:01:36,140 through several examples of analysis of algorithms. 29 00:01:36,140 --> 00:01:38,450 And it'll also tell us something about what we need 30 00:01:38,450 --> 00:01:42,290 to do to make our code go fast. 31 00:01:42,290 --> 00:01:44,870 So the main method we're going to use is 32 00:01:44,870 --> 00:01:48,400 called the Master Method. 33 00:01:48,400 --> 00:01:53,360 It's for solving recurrences of the form t of n equals a t 34 00:01:53,360 --> 00:01:58,100 of n over b plus f of n, where we have some technical 35 00:01:58,100 --> 00:02:01,480 conditions, a is greater than or equal to 1, b is greater 36 00:02:01,480 --> 00:02:03,950 than one, and f is asymptotically positive. 37 00:02:03,950 --> 00:02:08,460 As f gets large, it becomes positive. 38 00:02:08,460 --> 00:02:11,330 When we give a recurrence like this, normally if the base 39 00:02:11,330 --> 00:02:14,850 case is order one, it's convention not give it, to 40 00:02:14,850 --> 00:02:17,810 just assume, yeah, we understand that when n is 41 00:02:17,810 --> 00:02:22,670 small enough, the result is constant. 42 00:02:22,670 --> 00:02:25,710 As I say, that's the place where this differs from the 43 00:02:25,710 --> 00:02:33,260 way that we solve recurrences for caching, where you have to 44 00:02:33,260 --> 00:02:35,760 worry about what is the base case of the recurrence. 45 00:02:35,760 --> 00:02:38,640 46 00:02:38,640 --> 00:02:40,450 So the way to solve this is, in fact, the 47 00:02:40,450 --> 00:02:41,330 way we've seen before. 48 00:02:41,330 --> 00:02:42,730 It's a recursion tree. 49 00:02:42,730 --> 00:02:47,490 So we start with t of n, and then we replace t of n by the 50 00:02:47,490 --> 00:02:50,010 right hand side, just by substitution. 51 00:02:50,010 --> 00:02:52,590 So what's always going to be in the tree as we develop it 52 00:02:52,590 --> 00:02:56,150 is the total amount of work. 53 00:02:56,150 --> 00:02:59,870 So we basically replace it by f of n plus a copies 54 00:02:59,870 --> 00:03:02,410 of t of n over b. 55 00:03:02,410 --> 00:03:08,650 And then each of those we replace by a copies so t of n 56 00:03:08,650 --> 00:03:11,280 over b squared. 57 00:03:11,280 --> 00:03:14,830 And so forth, continually replacing until we get 58 00:03:14,830 --> 00:03:15,980 down to t of 1. 59 00:03:15,980 --> 00:03:18,680 And at the point t of 1, we no longer can substitute here, 60 00:03:18,680 --> 00:03:22,340 but we know that t of 1 is order 1. 61 00:03:22,340 --> 00:03:25,560 And now what we do is add across the rows. 62 00:03:25,560 --> 00:03:30,510 So we get f of n, we get a, f of n over b, a squared f of n 63 00:03:30,510 --> 00:03:35,110 over b squared, and we keep going to the height of this, 64 00:03:35,110 --> 00:03:39,050 which we're dividing by the argument by b each time. 65 00:03:39,050 --> 00:03:44,440 So to get down to 1 is just log base b of n. 66 00:03:44,440 --> 00:03:50,200 So the number of leaves is, since this is a regular 67 00:03:50,200 --> 00:03:50,680 [? a area ?] 68 00:03:50,680 --> 00:03:53,690 tree, is a to the height, which is a to the 69 00:03:53,690 --> 00:03:54,880 log base b of n. 70 00:03:54,880 --> 00:03:58,010 And for each of those, we're paying t of 1, 71 00:03:58,010 --> 00:04:00,840 which is order 1. 72 00:04:00,840 --> 00:04:03,270 And so now, it turns out there's 73 00:04:03,270 --> 00:04:04,740 no closed form solution. 74 00:04:04,740 --> 00:04:06,480 If I add up all these terms, there's 75 00:04:06,480 --> 00:04:08,790 no closed form solution. 76 00:04:08,790 --> 00:04:11,390 But there are three common situations 77 00:04:11,390 --> 00:04:14,020 that occur in practice. 78 00:04:14,020 --> 00:04:17,170 So yes, this is just n to the log base b of a, just this 79 00:04:17,170 --> 00:04:22,210 term, not the sum, just this term. 80 00:04:22,210 --> 00:04:26,600 So three cases have to do with comparing the number of 81 00:04:26,600 --> 00:04:31,350 leaves, which is times order 1, with f of n. 82 00:04:31,350 --> 00:04:33,880 83 00:04:33,880 --> 00:04:39,760 So the first case is the case where n the log base b of a is 84 00:04:39,760 --> 00:04:42,170 bigger than f of n. 85 00:04:42,170 --> 00:04:44,430 So whenever you're given a recurrence to compute n to the 86 00:04:44,430 --> 00:04:45,480 log base b of a-- 87 00:04:45,480 --> 00:04:48,010 I hope this is repeat for most people. 88 00:04:48,010 --> 00:04:50,240 If not, that's fine, but hopefully it'll 89 00:04:50,240 --> 00:04:53,990 get you caught up. 90 00:04:53,990 --> 00:04:58,020 n log base b of a, if it's much bigger than f of n, then 91 00:04:58,020 --> 00:05:01,080 these terms are geometrically increasing. 92 00:05:01,080 --> 00:05:03,930 And since it's geometrically increasing, all that matters 93 00:05:03,930 --> 00:05:05,630 is the base case. 94 00:05:05,630 --> 00:05:08,970 In fact, actually, it has to be not just greater than, it's 95 00:05:08,970 --> 00:05:11,760 got to be greater than by a polynomial amount, by some n 96 00:05:11,760 --> 00:05:15,200 to the epsilon amount for some epsilon greater than 0. 97 00:05:15,200 --> 00:05:20,480 So it might be n to the 1/2, it might be n to the 1/3, it 98 00:05:20,480 --> 00:05:23,510 could be n to the 100th. 99 00:05:23,510 --> 00:05:29,320 But what it can't be is log n, because log n is less than any 100 00:05:29,320 --> 00:05:31,590 polynomial amount. 101 00:05:31,590 --> 00:05:34,330 So it's got to exceed it by at least n to the epsilon for 102 00:05:34,330 --> 00:05:35,830 some epsilon. 103 00:05:35,830 --> 00:05:38,010 In that case, it's geometrically increasing, the 104 00:05:38,010 --> 00:05:40,550 answer is just what's at the leaves. 105 00:05:40,550 --> 00:05:44,270 So that's case one, geometrically increasing. 106 00:05:44,270 --> 00:05:47,800 Case two is when things are actually fairly 107 00:05:47,800 --> 00:05:50,670 equal on every level. 108 00:05:50,670 --> 00:05:53,680 And the general case we'll look at is when it's 109 00:05:53,680 --> 00:05:56,500 arithmetically increasing. 110 00:05:56,500 --> 00:06:03,200 So in particular, this occurs when f of n is n to the log 111 00:06:03,200 --> 00:06:09,000 base b of a times log to some power, n, for some constant k 112 00:06:09,000 --> 00:06:12,080 that's at least 0. 113 00:06:12,080 --> 00:06:13,450 So this is the situation. 114 00:06:13,450 --> 00:06:18,080 If k is equal to 0, it just says that f of n, the amount 115 00:06:18,080 --> 00:06:22,420 here is exactly the same as the number of leaves. 116 00:06:22,420 --> 00:06:24,960 And that case, it turns out that every level has almost 117 00:06:24,960 --> 00:06:26,960 exactly the same amount. 118 00:06:26,960 --> 00:06:29,610 And since there are log n levels, you tack on an extra 119 00:06:29,610 --> 00:06:31,300 log n for the solution. 120 00:06:31,300 --> 00:06:33,050 In fact, the solution is one more log. 121 00:06:33,050 --> 00:06:38,720 Turns out that if it's growing arithmetically with layer, you 122 00:06:38,720 --> 00:06:41,400 basically take on one extra log. 123 00:06:41,400 --> 00:06:45,320 It's actually like doing the integral of a as 124 00:06:45,320 --> 00:06:46,430 an arithmetic series. 125 00:06:46,430 --> 00:06:51,810 If you're adding the terms of, say, i squared, 126 00:06:51,810 --> 00:06:53,060 the result is i cubed. 127 00:06:53,060 --> 00:06:56,000 128 00:06:56,000 --> 00:07:00,030 If you have a summation that goes from i equals 1 to n of i 129 00:07:00,030 --> 00:07:04,640 squared, the result is proportional to n cubed. 130 00:07:04,640 --> 00:07:11,380 And similarly, if it's i to any k, the result is going to 131 00:07:11,380 --> 00:07:15,690 be n to the k plus 1, and that's basically 132 00:07:15,690 --> 00:07:18,450 what's going on here. 133 00:07:18,450 --> 00:07:21,320 And then the third case is when it's geometrically 134 00:07:21,320 --> 00:07:23,670 decreasing, when the amount at the root dominates. 135 00:07:23,670 --> 00:07:26,860 136 00:07:26,860 --> 00:07:33,960 So in this case, if it's much less than n, and specifically, 137 00:07:33,960 --> 00:07:40,726 f of n is at least n to the epsilon less than log base b 138 00:07:40,726 --> 00:07:45,380 of a, for some constant epsilon, it turns out in 139 00:07:45,380 --> 00:07:47,950 addition, you need f of n to satisfy a regularity 140 00:07:47,950 --> 00:07:50,810 condition, but this regularity condition is satisfied by all 141 00:07:50,810 --> 00:07:53,300 the normal functions that we're going to come up. 142 00:07:53,300 --> 00:08:00,110 It's not satisfied by things like n to the sine n, which 143 00:08:00,110 --> 00:08:02,150 oscillates like crazy. 144 00:08:02,150 --> 00:08:04,750 But it also isn't satisfied by 145 00:08:04,750 --> 00:08:06,800 exponentially growing functions. 146 00:08:06,800 --> 00:08:09,150 But it is satisfied by anything that's polynomial, or 147 00:08:09,150 --> 00:08:12,580 polynomial times a logarithm, or what have you. 148 00:08:12,580 --> 00:08:14,560 So generally, we don't really have to 149 00:08:14,560 --> 00:08:16,820 check this too carefully. 150 00:08:16,820 --> 00:08:19,850 And then the answer there is just it's order f of n, 151 00:08:19,850 --> 00:08:24,530 because it's geometrically decreasing, f of n dominates. 152 00:08:24,530 --> 00:08:26,890 So is this review for everybody? 153 00:08:26,890 --> 00:08:28,920 Pretty much, yeah, yeah, yeah? 154 00:08:28,920 --> 00:08:31,960 You can do this in your head, because we're going to ask you 155 00:08:31,960 --> 00:08:35,195 to do this in your head during the lecture? 156 00:08:35,195 --> 00:08:36,110 Yeah, we're all good? 157 00:08:36,110 --> 00:08:38,940 OK, good. 158 00:08:38,940 --> 00:08:42,299 One of the things that students, when they learn this 159 00:08:42,299 --> 00:08:45,500 in an algorithms class, don't recognize is that this also 160 00:08:45,500 --> 00:08:48,260 tells you where in your program, where in your 161 00:08:48,260 --> 00:08:52,100 recursive program, you should bother to try to eke out 162 00:08:52,100 --> 00:08:53,350 constant factors. 163 00:08:53,350 --> 00:08:55,570 164 00:08:55,570 --> 00:08:58,210 So if you think about it, for example, in case three here, 165 00:08:58,210 --> 00:09:00,290 it's geometrically decreasing. 166 00:09:00,290 --> 00:09:05,530 Does it make sense to try to optimize the leaves? 167 00:09:05,530 --> 00:09:07,510 No, because very little time is spent there. 168 00:09:07,510 --> 00:09:12,420 It makes sense to optimize what's going on at the root, 169 00:09:12,420 --> 00:09:14,850 and to save anything you can at the root. 170 00:09:14,850 --> 00:09:17,250 And sometimes, the root in particular has special 171 00:09:17,250 --> 00:09:20,140 properties that aren't true of the internal nodes that you 172 00:09:20,140 --> 00:09:23,160 can take advantage of, that you may not be able to take 173 00:09:23,160 --> 00:09:24,360 advantage of regularly. 174 00:09:24,360 --> 00:09:26,390 But since it's going to be dominated by the root, trying 175 00:09:26,390 --> 00:09:28,340 to save in the root makes sense. 176 00:09:28,340 --> 00:09:35,540 Correspondingly, if we're in case one, in case one, it's 177 00:09:35,540 --> 00:09:39,810 absolutely critical that you coarsen the recursion, because 178 00:09:39,810 --> 00:09:41,850 all the work is down at this level. 179 00:09:41,850 --> 00:09:44,420 180 00:09:44,420 --> 00:09:47,430 And so if you want to get additional performance, you 181 00:09:47,430 --> 00:09:50,660 want to basically move this up high enough that you can cut 182 00:09:50,660 --> 00:09:55,540 off that constant amount and get factors two, three, 183 00:09:55,540 --> 00:10:00,820 sometimes more, out of your code. 184 00:10:00,820 --> 00:10:06,200 So understanding the structure of the recursion allows you to 185 00:10:06,200 --> 00:10:09,165 figure out where it is that you should optimize your code. 186 00:10:09,165 --> 00:10:10,870 Of course, with loops, it's much easier. 187 00:10:10,870 --> 00:10:13,720 Where do you spend your time with loops to 188 00:10:13,720 --> 00:10:14,970 make code go fast? 189 00:10:14,970 --> 00:10:17,520 190 00:10:17,520 --> 00:10:21,250 The innermost loop, right, because that's the one that's 191 00:10:21,250 --> 00:10:22,790 executing the most. 192 00:10:22,790 --> 00:10:25,230 The outer loops are not that important. 193 00:10:25,230 --> 00:10:26,090 This is sort of the 194 00:10:26,090 --> 00:10:28,840 corresponding thing for recursion. 195 00:10:28,840 --> 00:10:31,490 Figure out where the recursion is spending the time, that's 196 00:10:31,490 --> 00:10:34,110 where you spend time eking out extra factors. 197 00:10:34,110 --> 00:10:38,550 198 00:10:38,550 --> 00:10:39,800 Here's the cheat sheet. 199 00:10:39,800 --> 00:10:43,010 200 00:10:43,010 --> 00:10:47,920 So if it's n to the log base b of a minus epsilon, the 201 00:10:47,920 --> 00:10:49,505 answer's n to the log base b of a. 202 00:10:49,505 --> 00:10:52,020 If it's plus epsilon, it's f of n. 203 00:10:52,020 --> 00:10:55,030 And if it's n to the log base b of a times a logarithmic 204 00:10:55,030 --> 00:10:58,875 factor, where this logarithm has the exponent greater than 205 00:10:58,875 --> 00:11:02,140 or equal to 0, you add a 1. 206 00:11:02,140 --> 00:11:04,340 This is not all of the situations. 207 00:11:04,340 --> 00:11:06,935 There are situations which don't occur. 208 00:11:06,935 --> 00:11:08,185 OK, quick quiz. 209 00:11:08,185 --> 00:11:10,380 210 00:11:10,380 --> 00:11:13,390 So t of n equals 4t of n over 2 plus n. 211 00:11:13,390 --> 00:11:14,640 What's the solution? 212 00:11:14,640 --> 00:11:17,550 213 00:11:17,550 --> 00:11:19,730 n squared, good. 214 00:11:19,730 --> 00:11:23,170 So this is n to the log base b of a, is n to the log base 2 215 00:11:23,170 --> 00:11:25,910 of 4, which is n squared. 216 00:11:25,910 --> 00:11:27,260 That's much bigger than n. 217 00:11:27,260 --> 00:11:29,230 It's bigger by a factor of n. 218 00:11:29,230 --> 00:11:32,430 Here, the epsilon of 1 would do, so would an epsilon of 1/2 219 00:11:32,430 --> 00:11:34,580 or an epsilon of 1/4, but in particular, an 220 00:11:34,580 --> 00:11:36,050 epsilon of 1 would do. 221 00:11:36,050 --> 00:11:37,430 That's case one. 222 00:11:37,430 --> 00:11:40,590 The n squared dominates, so the answer is n squared. 223 00:11:40,590 --> 00:11:43,460 The basic idea is whichever side dominates for case one 224 00:11:43,460 --> 00:11:47,700 and case three, that's the one that is the answer. 225 00:11:47,700 --> 00:11:48,150 Here we go. 226 00:11:48,150 --> 00:11:50,650 What about this one? 227 00:11:50,650 --> 00:11:54,170 n squared log n, because the two sides are 228 00:11:54,170 --> 00:11:56,950 about the same size. 229 00:11:56,950 --> 00:12:02,240 It's n squared times log to the 0n, tack on the extra log. 230 00:12:02,240 --> 00:12:04,120 How about this one? 231 00:12:04,120 --> 00:12:05,180 n cubed. 232 00:12:05,180 --> 00:12:06,430 How about this one? 233 00:12:06,430 --> 00:12:09,403 234 00:12:09,403 --> 00:12:11,399 AUDIENCE: [INAUDIBLE]. 235 00:12:11,399 --> 00:12:14,900 Master Theorem [INAUDIBLE]? 236 00:12:14,900 --> 00:12:17,140 CHARLES LEISERSON: Yeah, the Master Theorem does not apply 237 00:12:17,140 --> 00:12:18,790 to this one. 238 00:12:18,790 --> 00:12:23,070 It looks like it's case two with an an exponent of minus 239 00:12:23,070 --> 00:12:28,360 1, but that's bogus because the exponent of the log must 240 00:12:28,360 --> 00:12:32,280 be greater than or equal to 0. 241 00:12:32,280 --> 00:12:35,510 So instead, this actually has a solution, so it does not 242 00:12:35,510 --> 00:12:40,750 apply, it actually has the solution n squared, log log n, 243 00:12:40,750 --> 00:12:43,320 but that's not covered by the Master Theorem. 244 00:12:43,320 --> 00:12:46,640 You can have an infinite hierarchy of 245 00:12:46,640 --> 00:12:48,840 narrowing in things. 246 00:12:48,840 --> 00:12:52,330 So if you don't have a solution to something that 247 00:12:52,330 --> 00:12:56,250 looks like a Master Theorem type of recurrence, what's 248 00:12:56,250 --> 00:12:59,782 your best strategy for solving it? 249 00:12:59,782 --> 00:13:00,750 AUDIENCE: [INAUDIBLE]. 250 00:13:00,750 --> 00:13:03,260 CHARLES LEISERSON: What's that? 251 00:13:03,260 --> 00:13:04,840 AUDIENCE: [INAUDIBLE]. 252 00:13:04,840 --> 00:13:05,860 CHARLES LEISERSON: So a recursion tree can be good, 253 00:13:05,860 --> 00:13:09,410 but actually, the best is the substitution method, basically 254 00:13:09,410 --> 00:13:11,090 proving it by induction. 255 00:13:11,090 --> 00:13:14,320 And the recursion tree can be very helpful in giving you a 256 00:13:14,320 --> 00:13:18,070 good guess for what you think the answer is. 257 00:13:18,070 --> 00:13:21,170 But the most reliable way to prove any of these things is 258 00:13:21,170 --> 00:13:23,900 using substitution method. 259 00:13:23,900 --> 00:13:25,270 Good enough. 260 00:13:25,270 --> 00:13:28,590 So that was review for, I hope, most people? 261 00:13:28,590 --> 00:13:28,820 Yeah? 262 00:13:28,820 --> 00:13:30,580 OK, good. 263 00:13:30,580 --> 00:13:32,510 OK, let's talk about parallel programming. 264 00:13:32,510 --> 00:13:34,550 We're going to start out with loops. 265 00:13:34,550 --> 00:13:39,800 So last time, we talked about how the cilk++ runtime system 266 00:13:39,800 --> 00:13:46,430 is based on, essentially, implementing spawns and syncs, 267 00:13:46,430 --> 00:13:48,820 using the work stealing algorithm, and we talked about 268 00:13:48,820 --> 00:13:51,660 scheduling and so forth. 269 00:13:51,660 --> 00:13:55,920 We didn't talk about how loops are implemented, except to 270 00:13:55,920 --> 00:13:57,110 mention that they were implemented 271 00:13:57,110 --> 00:13:58,580 with divide and conquer. 272 00:13:58,580 --> 00:14:02,160 So here I want to go into the subtleties of loops, because 273 00:14:02,160 --> 00:14:06,770 probably most programs that occur in the real world these 274 00:14:06,770 --> 00:14:11,930 days are programs where people just simply say, make this a 275 00:14:11,930 --> 00:14:13,800 parallel loop. 276 00:14:13,800 --> 00:14:15,830 That's it. 277 00:14:15,830 --> 00:14:21,730 So let's take an example of the in-place matrix transpose, 278 00:14:21,730 --> 00:14:24,880 where we're basically trying to flip everything along the 279 00:14:24,880 --> 00:14:26,660 main diagonal. 280 00:14:26,660 --> 00:14:31,030 I've used this figure before, I think. 281 00:14:31,030 --> 00:14:34,860 So let's just do it not cache efficiently. 282 00:14:34,860 --> 00:14:36,540 So the cache efficient algorithm actually 283 00:14:36,540 --> 00:14:41,690 parallelizes beautifully also, but let's not look at the 284 00:14:41,690 --> 00:14:44,020 cache efficient version, the divide and conquer version. 285 00:14:44,020 --> 00:14:48,010 Let's look at a looping version to understand. 286 00:14:48,010 --> 00:14:51,940 And once again here, as I did before, I'm going to make the 287 00:14:51,940 --> 00:14:56,430 indices for my implementation run from 0, not 1. 288 00:14:56,430 --> 00:15:01,030 And basically, I have an outer loop that goes from i equals 1 289 00:15:01,030 --> 00:15:08,020 up to n minus 1, and an inner loop that goes from j equals 0 290 00:15:08,020 --> 00:15:09,990 up to i minus 1. 291 00:15:09,990 --> 00:15:15,850 And then I do a little swap in there. 292 00:15:15,850 --> 00:15:21,590 And in here, the outer loop I've parallelized, the inner 293 00:15:21,590 --> 00:15:24,450 loop is running serially. 294 00:15:24,450 --> 00:15:31,600 295 00:15:31,600 --> 00:15:34,810 So let's take a look at analyzing this particular 296 00:15:34,810 --> 00:15:37,900 piece of code to understand what's going on. 297 00:15:37,900 --> 00:15:42,380 So the way this actually gets implemented is as follows. 298 00:15:42,380 --> 00:15:44,830 So here's the code on the left. 299 00:15:44,830 --> 00:15:53,790 What actually happens in the cilk++ compiler is it converts 300 00:15:53,790 --> 00:15:58,210 it into recursion, divide and conquer recursion. 301 00:15:58,210 --> 00:16:02,160 So what it does is it has a range from low to high, is 302 00:16:02,160 --> 00:16:03,840 sort of the common case. 303 00:16:03,840 --> 00:16:06,920 We're going to call it on from 1 to n minus 1, because that's 304 00:16:06,920 --> 00:16:11,300 the indexes that I've given to the cilk_for loop. 305 00:16:11,300 --> 00:16:17,530 And what we do is, if I have a region to do divide and 306 00:16:17,530 --> 00:16:22,760 conquer on, a set of values for i, I basically 307 00:16:22,760 --> 00:16:26,760 divide that in half. 308 00:16:26,760 --> 00:16:32,640 And then I recursively execute the first half, and then the 309 00:16:32,640 --> 00:16:34,740 second half, and then cilk_sync. 310 00:16:34,740 --> 00:16:37,790 So the two sides are going off in parallel. 311 00:16:37,790 --> 00:16:42,450 And then if I am at the base, then I go through the inner 312 00:16:42,450 --> 00:16:48,260 loop and do the swap of the values in the inner loop. 313 00:16:48,260 --> 00:16:51,100 So the outer loop is the one that's the parallel loop. 314 00:16:51,100 --> 00:16:54,450 That one we're doing divide and conquer on. 315 00:16:54,450 --> 00:16:58,990 So we basically recursively spawn the first half, execute 316 00:16:58,990 --> 00:17:02,960 the second, and then each of those recursively does the 317 00:17:02,960 --> 00:17:08,680 same thing until all the iterations have been done. 318 00:17:08,680 --> 00:17:10,184 Any questions about how that operates? 319 00:17:10,184 --> 00:17:14,069 320 00:17:14,069 --> 00:17:16,940 So this is the way all the parallel loops are done, is 321 00:17:16,940 --> 00:17:18,560 basically this strategy. 322 00:17:18,560 --> 00:17:21,609 Now here, I mention that this is in fact what happens is 323 00:17:21,609 --> 00:17:25,589 this test here is actually coarsened. 324 00:17:25,589 --> 00:17:29,000 So we don't want to go all the way to n equals 1, because 325 00:17:29,000 --> 00:17:31,990 then we'll have this recursion call overhead every 326 00:17:31,990 --> 00:17:33,280 time we do a call. 327 00:17:33,280 --> 00:17:39,120 So in fact, what happens is you go down to some grain size 328 00:17:39,120 --> 00:17:43,365 of some number of iterations, and at that number of 329 00:17:43,365 --> 00:17:45,790 iterations, it then just runs through it as an ordinary 330 00:17:45,790 --> 00:17:50,140 serial four loop, in order not to pay the function call 331 00:17:50,140 --> 00:17:51,830 overhead all the way down. 332 00:17:51,830 --> 00:17:55,600 We're going to look exactly at that issue. 333 00:17:55,600 --> 00:18:05,660 So if I take a look at it from using the DAG model that I 334 00:18:05,660 --> 00:18:12,000 introduced last time, remember that the rectangles here kind 335 00:18:12,000 --> 00:18:16,560 of count as activation frames, stack 336 00:18:16,560 --> 00:18:18,520 frames on the call stack. 337 00:18:18,520 --> 00:18:22,700 And the circles here are strands, which are sequences 338 00:18:22,700 --> 00:18:25,020 of serial code. 339 00:18:25,020 --> 00:18:27,430 And so what's happening here is essentially, I'm running 340 00:18:27,430 --> 00:18:30,665 the code that divides it into two parts, 341 00:18:30,665 --> 00:18:33,200 and I spawn one part. 342 00:18:33,200 --> 00:18:36,480 Then this guy spawns the other and waits for the return, and 343 00:18:36,480 --> 00:18:37,530 then these guys come back. 344 00:18:37,530 --> 00:18:40,010 And then I keep doing that recursively, and then when I 345 00:18:40,010 --> 00:18:45,010 get to the bottom, I then run through the innermost loop, 346 00:18:45,010 --> 00:18:49,470 which starts out with just one element to do, two, three. 347 00:18:49,470 --> 00:18:51,210 And so the number of elements-- 348 00:18:51,210 --> 00:18:53,080 for example, in this case where I've done eight, I go 349 00:18:53,080 --> 00:18:59,130 through eight elements at the bottom here if this were an 350 00:18:59,130 --> 00:19:03,430 eight by eight matrix that I was transposing. 351 00:19:03,430 --> 00:19:08,450 So there's more work in these guys than there is over here. 352 00:19:08,450 --> 00:19:11,150 So it's not something you just can map onto processors in 353 00:19:11,150 --> 00:19:12,770 some naive fashion. 354 00:19:12,770 --> 00:19:14,410 It does take some load balancing to 355 00:19:14,410 --> 00:19:16,040 parallelize this loop. 356 00:19:16,040 --> 00:19:18,750 Any questions about what's going on here? 357 00:19:18,750 --> 00:19:19,800 Yeah? 358 00:19:19,800 --> 00:19:21,854 AUDIENCE: Why is it that it's one, two, 359 00:19:21,854 --> 00:19:23,270 three, four, up to eight? 360 00:19:23,270 --> 00:19:24,686 CHARLES LEISERSON: Take a look. 361 00:19:24,686 --> 00:19:26,085 The inner loop goes from j to i. 362 00:19:26,085 --> 00:19:28,710 363 00:19:28,710 --> 00:19:31,230 So this guy just does one iteration of the inner loop, 364 00:19:31,230 --> 00:19:33,660 this guy does two, this guy does three, all the way up to 365 00:19:33,660 --> 00:19:35,820 this guy doing eight iterations, if it were an 366 00:19:35,820 --> 00:19:38,880 eight by eight matrix. 367 00:19:38,880 --> 00:19:42,610 And in general, if it's n by n, it's going from one to n 368 00:19:42,610 --> 00:19:45,630 work up here, but only one work down here. 369 00:19:45,630 --> 00:19:47,800 Because I'm basically iterating through a triangular 370 00:19:47,800 --> 00:19:52,000 iteration space to swap the matrix, and this is basically 371 00:19:52,000 --> 00:19:54,480 swapping row by row. 372 00:19:54,480 --> 00:19:55,550 Questions? 373 00:19:55,550 --> 00:19:57,160 Is that good? 374 00:19:57,160 --> 00:19:58,770 Everybody see what's going on? 375 00:19:58,770 --> 00:20:01,080 So now let's take a look and let's analyze this 376 00:20:01,080 --> 00:20:02,500 for work and span. 377 00:20:02,500 --> 00:20:06,410 So what is the work of this in terms of n, if I 378 00:20:06,410 --> 00:20:08,255 have an n by n matrix? 379 00:20:08,255 --> 00:20:18,550 380 00:20:18,550 --> 00:20:18,950 What's the work? 381 00:20:18,950 --> 00:20:23,470 The work is the ordinary serial running time, right? 382 00:20:23,470 --> 00:20:24,420 It's n squared. 383 00:20:24,420 --> 00:20:26,370 Good. 384 00:20:26,370 --> 00:20:29,710 So basically, it's order n squared, because these guys 385 00:20:29,710 --> 00:20:30,670 are all adding up. 386 00:20:30,670 --> 00:20:35,060 This is an arithmetic sequence up to n, and so the total 387 00:20:35,060 --> 00:20:38,420 amount in here is order n squared. 388 00:20:38,420 --> 00:20:39,770 What about this part up here? 389 00:20:39,770 --> 00:20:43,420 390 00:20:43,420 --> 00:20:50,580 How much does that cost us for work? 391 00:20:50,580 --> 00:20:52,460 How much is in the control overhead of 392 00:20:52,460 --> 00:20:53,710 doing that outer loop? 393 00:20:53,710 --> 00:21:05,935 394 00:21:05,935 --> 00:21:08,530 So asymptotically, how much is in here? 395 00:21:08,530 --> 00:21:09,890 The total is going to be n squared. 396 00:21:09,890 --> 00:21:11,180 That I guarantee you. 397 00:21:11,180 --> 00:21:13,820 But what's going on up here? 398 00:21:13,820 --> 00:21:16,500 How do I count that up? 399 00:21:16,500 --> 00:21:18,261 AUDIENCE: I'm assuming that each [? strain ?] is going to 400 00:21:18,261 --> 00:21:20,730 be constant time? 401 00:21:20,730 --> 00:21:22,160 CHARLES LEISERSON: Yeah, well in this case, it is constant 402 00:21:22,160 --> 00:21:25,520 time, for these up here, because what am I doing? 403 00:21:25,520 --> 00:21:32,690 All I'm doing is the recursion code where I divide the range 404 00:21:32,690 --> 00:21:34,060 and then spawn off two things. 405 00:21:34,060 --> 00:21:36,880 That takes me only a constant amount of manipulation to be 406 00:21:36,880 --> 00:21:38,600 able to do that. 407 00:21:38,600 --> 00:21:42,090 So this is all order n. 408 00:21:42,090 --> 00:21:43,760 Total of order n here. 409 00:21:43,760 --> 00:21:45,620 The reason is because in some sense, there 410 00:21:45,620 --> 00:21:47,550 are n leaves here. 411 00:21:47,550 --> 00:21:52,040 And if you have a full binary tree, meaning every child is 412 00:21:52,040 --> 00:21:55,775 either a leaf or has two children, then the number of 413 00:21:55,775 --> 00:21:58,120 the internal nodes of the tree is one less than 414 00:21:58,120 --> 00:22:01,330 the number of leaves. 415 00:22:01,330 --> 00:22:06,080 So that's a basic property of trees, that the number of the 416 00:22:06,080 --> 00:22:10,430 internal nodes here is going to be n minus 1, in this case. 417 00:22:10,430 --> 00:22:14,460 In particular, we have 7 here. 418 00:22:14,460 --> 00:22:15,910 Is that good? 419 00:22:15,910 --> 00:22:17,730 So this part doesn't contribute 420 00:22:17,730 --> 00:22:19,790 significantly to the work. 421 00:22:19,790 --> 00:22:22,700 Just this part contributes to the work. 422 00:22:22,700 --> 00:22:23,950 Is that good? 423 00:22:23,950 --> 00:22:26,000 424 00:22:26,000 --> 00:22:27,520 What about the span for this? 425 00:22:27,520 --> 00:22:44,470 426 00:22:44,470 --> 00:22:45,720 What's the span? 427 00:22:45,720 --> 00:22:48,819 428 00:22:48,819 --> 00:22:50,770 AUDIENCE: Log n. 429 00:22:50,770 --> 00:22:54,010 CHARLES LEISERSON: It's not log n, but your heads are in 430 00:22:54,010 --> 00:22:55,260 the right place. 431 00:22:55,260 --> 00:22:57,796 432 00:22:57,796 --> 00:23:00,720 AUDIENCE: The longest path is going [INAUDIBLE]. 433 00:23:00,720 --> 00:23:02,824 CHARLES LEISERSON: Which is the longest path going to be 434 00:23:02,824 --> 00:23:10,371 here, starting here and ending there, which way do we go? 435 00:23:10,371 --> 00:23:10,840 AUDIENCE: Go all the way down. 436 00:23:10,840 --> 00:23:11,707 CHARLES LEISERSON: Which way? 437 00:23:11,707 --> 00:23:12,960 AUDIENCE: To the right. 438 00:23:12,960 --> 00:23:17,310 CHARLES LEISERSON: Down to the right, over, down 439 00:23:17,310 --> 00:23:18,040 through this guy. 440 00:23:18,040 --> 00:23:19,800 How big is this guy? 441 00:23:19,800 --> 00:23:20,730 n. 442 00:23:20,730 --> 00:23:22,830 Go back up this way. 443 00:23:22,830 --> 00:23:25,407 So how much is in this part going down? 444 00:23:25,407 --> 00:23:26,800 AUDIENCE: Log n. 445 00:23:26,800 --> 00:23:27,880 CHARLES LEISERSON: Going down and up is log 446 00:23:27,880 --> 00:23:30,890 n, but this is n. 447 00:23:30,890 --> 00:23:32,140 Good. 448 00:23:32,140 --> 00:23:35,160 449 00:23:35,160 --> 00:23:40,310 So it's basically order n plus order log n, order n down here 450 00:23:40,310 --> 00:23:44,490 plus order log n up and down, that's order n. 451 00:23:44,490 --> 00:23:48,840 So the parallelism is the ratio of those two things, 452 00:23:48,840 --> 00:23:51,530 which is order n. 453 00:23:51,530 --> 00:23:54,260 So that's got good parallelism. 454 00:23:54,260 --> 00:23:56,280 And so if you imagine doing this in a large number 455 00:23:56,280 --> 00:24:02,630 processors, very easy to get sort of your benchmark of 456 00:24:02,630 --> 00:24:05,040 maybe 10 times more parallelism than the number of 457 00:24:05,040 --> 00:24:08,570 processors that you're running on. 458 00:24:08,570 --> 00:24:09,820 Everybody follow this? 459 00:24:09,820 --> 00:24:12,240 460 00:24:12,240 --> 00:24:14,460 Good. 461 00:24:14,460 --> 00:24:16,700 So the span of the loop control is order log n. 462 00:24:16,700 --> 00:24:23,310 And in general, when you have a four loop with n iterations, 463 00:24:23,310 --> 00:24:27,700 the loop control itself is going to have log n is going 464 00:24:27,700 --> 00:24:30,000 to have to be added to the span every time you hit a 465 00:24:30,000 --> 00:24:35,570 loop, log of whatever the number of iterations is. 466 00:24:35,570 --> 00:24:38,680 And then we have the maximum span of the body. 467 00:24:38,680 --> 00:24:41,710 Well, the worst case for this thing in the body is when it's 468 00:24:41,710 --> 00:24:44,680 doing the whole thing, because whenever we're looking at 469 00:24:44,680 --> 00:24:48,490 spans, we're always looking what's the maximum of things 470 00:24:48,490 --> 00:24:50,740 that are operating in parallel. 471 00:24:50,740 --> 00:24:53,060 Everybody good? 472 00:24:53,060 --> 00:24:54,310 Questions? 473 00:24:54,310 --> 00:24:56,330 474 00:24:56,330 --> 00:24:58,440 Great. 475 00:24:58,440 --> 00:25:05,590 So now let's do something a little more parallel. 476 00:25:05,590 --> 00:25:09,380 Let's make both loops be parallel. 477 00:25:09,380 --> 00:25:12,510 So here we have a cilk_for loop here, and then another 478 00:25:12,510 --> 00:25:13,990 cilk_for loop on the interior. 479 00:25:13,990 --> 00:25:16,570 480 00:25:16,570 --> 00:25:18,480 And let's see what we get here. 481 00:25:18,480 --> 00:25:20,868 So what's the work for this? 482 00:25:20,868 --> 00:25:21,816 AUDIENCE: n squared. 483 00:25:21,816 --> 00:25:23,000 CHARLES LEISERSON: Yeah, n squared. 484 00:25:23,000 --> 00:25:25,230 That's not going to change. 485 00:25:25,230 --> 00:25:26,480 That's not going to change. 486 00:25:26,480 --> 00:25:29,650 487 00:25:29,650 --> 00:25:30,900 What's the span? 488 00:25:30,900 --> 00:25:34,992 489 00:25:34,992 --> 00:25:37,990 AUDIENCE: log n. 490 00:25:37,990 --> 00:25:40,490 CHARLES LEISERSON: Yeah, log n. 491 00:25:40,490 --> 00:25:44,580 So it's log n because the span of the outer control loop is 492 00:25:44,580 --> 00:25:46,490 going to add log n. 493 00:25:46,490 --> 00:25:50,750 The max span of the inner control loop, well, it's going 494 00:25:50,750 --> 00:25:58,850 from log of 1 up to log of i, but the maximums of those is 495 00:25:58,850 --> 00:26:01,960 going to be proportional to log n 496 00:26:01,960 --> 00:26:05,420 because it's not regular. 497 00:26:05,420 --> 00:26:09,490 And the span of the body now is going to be order 1. 498 00:26:09,490 --> 00:26:16,470 And so we add the logs because those things are in series. 499 00:26:16,470 --> 00:26:17,810 We don't multiply them. 500 00:26:17,810 --> 00:26:21,250 501 00:26:21,250 --> 00:26:22,310 What we're doing is we're looking at, 502 00:26:22,310 --> 00:26:23,170 what's the worst case? 503 00:26:23,170 --> 00:26:26,380 The worst case is I have to do the control for this, plus the 504 00:26:26,380 --> 00:26:30,230 control for this, plus the worst iteration here, which in 505 00:26:30,230 --> 00:26:32,000 this case is just order one. 506 00:26:32,000 --> 00:26:35,090 So the total is order log n. 507 00:26:35,090 --> 00:26:38,810 That can be confusing for people, why it is that we add 508 00:26:38,810 --> 00:26:45,210 here rather than multiply or do something else. 509 00:26:45,210 --> 00:26:47,460 So let me pause here for some questions 510 00:26:47,460 --> 00:26:51,020 if people have questions. 511 00:26:51,020 --> 00:26:52,020 Everybody with us? 512 00:26:52,020 --> 00:26:55,550 Anybody want clarification or make a point that would lead 513 00:26:55,550 --> 00:26:56,800 to clarification? 514 00:26:56,800 --> 00:26:59,630 515 00:26:59,630 --> 00:27:00,780 Yes, question. 516 00:27:00,780 --> 00:27:02,989 AUDIENCE: If you were going to draw a tree like the previous 517 00:27:02,989 --> 00:27:04,239 slide, what would it look like? 518 00:27:04,239 --> 00:27:09,670 519 00:27:09,670 --> 00:27:10,910 CHARLES LEISERSON: Let's see. 520 00:27:10,910 --> 00:27:14,175 I had wanted to do that and it got out of control. 521 00:27:14,175 --> 00:27:17,510 522 00:27:17,510 --> 00:27:20,220 So what it would look like is if we go back to the previous 523 00:27:20,220 --> 00:27:25,550 slide, it basically would look like this, except where each 524 00:27:25,550 --> 00:27:30,450 one of these guys is replaced by a tree that looks like this 525 00:27:30,450 --> 00:27:38,900 with as many leaves as the number here indicates. 526 00:27:38,900 --> 00:27:42,310 So once again, this would be the one with the longest span 527 00:27:42,310 --> 00:27:44,990 because this would be log of the largest number. 528 00:27:44,990 --> 00:27:48,720 But basically, each one of these would be a tree that 529 00:27:48,720 --> 00:27:49,640 came from this. 530 00:27:49,640 --> 00:27:50,200 Is that clear? 531 00:27:50,200 --> 00:27:52,070 That's a great question. 532 00:27:52,070 --> 00:27:53,580 Anybody else have as illuminating 533 00:27:53,580 --> 00:27:54,670 questions as those? 534 00:27:54,670 --> 00:27:57,630 Everybody understand that explanation, what the tree 535 00:27:57,630 --> 00:27:58,310 would look like? 536 00:27:58,310 --> 00:27:59,960 OK, good. 537 00:27:59,960 --> 00:28:07,420 538 00:28:07,420 --> 00:28:10,140 Get 539 00:28:10,140 --> 00:28:15,022 So the parallelism here is n squared over log n. 540 00:28:15,022 --> 00:28:17,790 Now it's tempting, when you do parallel programming, to say 541 00:28:17,790 --> 00:28:20,160 therefore, this is better parallel code. 542 00:28:20,160 --> 00:28:24,190 543 00:28:24,190 --> 00:28:26,760 And the reason is, well, it does asymptotically have more 544 00:28:26,760 --> 00:28:27,430 parallelism. 545 00:28:27,430 --> 00:28:31,110 But generally when you're programming, you're not trying 546 00:28:31,110 --> 00:28:33,120 to get the most parallelism. 547 00:28:33,120 --> 00:28:37,840 What you're trying to do is get sufficient parallelism. 548 00:28:37,840 --> 00:28:43,470 So if n is sufficiently large, it's going to be way more-- 549 00:28:43,470 --> 00:28:44,710 if n is a million-- 550 00:28:44,710 --> 00:28:47,540 which is typical problem size for a loop, for example, for a 551 00:28:47,540 --> 00:28:51,180 big loop, or even if it's a few thousand or whatever-- 552 00:28:51,180 --> 00:28:55,820 it may be just fine to have parallelism on the order of 553 00:28:55,820 --> 00:29:00,720 1,000, which is what the first one gives you. 554 00:29:00,720 --> 00:29:03,520 So 1,000 iterations is generally a small number of 555 00:29:03,520 --> 00:29:05,310 iterations. 556 00:29:05,310 --> 00:29:08,150 So 1,000 by 1,000 matrix is going to generate 557 00:29:08,150 --> 00:29:09,470 parallelism of 1,000. 558 00:29:09,470 --> 00:29:11,480 Here, we're going to get a parallelism of 1 million 559 00:29:11,480 --> 00:29:18,010 divided by about 20, log of 10 or 20, so like 100,000. 560 00:29:18,010 --> 00:29:28,520 But if I have 1,000 by 1,000 matrix, the difference between 561 00:29:28,520 --> 00:29:33,140 having parallelism of 1,000 and parallelism of 100,000, 562 00:29:33,140 --> 00:29:38,440 when I'm running on 100 cores, let's say, it doesn't matter. 563 00:29:38,440 --> 00:29:40,740 Up to 100 cores, it doesn't matter. 564 00:29:40,740 --> 00:29:43,540 And in fact, running this on 100 cores, that's really a 565 00:29:43,540 --> 00:29:46,410 tiny problem compared to the amount of memory 566 00:29:46,410 --> 00:29:48,670 you're going to get. 567 00:29:48,670 --> 00:29:55,840 1,000 by 1,000 matrix is tiny when it comes to the size of 568 00:29:55,840 --> 00:29:58,390 memory that you're going to have access to and so forth. 569 00:29:58,390 --> 00:30:00,760 So for big problems and so forth you really want to look 570 00:30:00,760 --> 00:30:08,650 at this and say, of things that have ample parallelism, 571 00:30:08,650 --> 00:30:12,570 which ones really are going to give me the best bang for the 572 00:30:12,570 --> 00:30:15,070 buck for reasonable machine sizes? 573 00:30:15,070 --> 00:30:18,300 574 00:30:18,300 --> 00:30:20,950 That's different from things like work or 575 00:30:20,950 --> 00:30:22,240 serial running time. 576 00:30:22,240 --> 00:30:25,880 Usually less running time is better, 577 00:30:25,880 --> 00:30:27,970 and it's always better. 578 00:30:27,970 --> 00:30:30,170 But here parallelism-- 579 00:30:30,170 --> 00:30:35,150 yes, it's good to minimize your span, but you don't have 580 00:30:35,150 --> 00:30:38,680 to minimize it extremely. 581 00:30:38,680 --> 00:30:43,220 You just have to get it small enough, whereas the work term, 582 00:30:43,220 --> 00:30:45,680 that you really want to minimize, because that's what 583 00:30:45,680 --> 00:30:47,250 you're going to have to do, even in a serial 584 00:30:47,250 --> 00:30:48,420 implementation. 585 00:30:48,420 --> 00:30:49,780 Question. 586 00:30:49,780 --> 00:30:52,105 AUDIENCE: So are you suggesting that the 587 00:30:52,105 --> 00:30:55,800 other code was OK? 588 00:30:55,800 --> 00:30:59,050 CHARLES LEISERSON: We're going to look a little bit closer at 589 00:30:59,050 --> 00:31:02,210 the issue of overheads. 590 00:31:02,210 --> 00:31:04,200 We're now going to take a look at what's the difference 591 00:31:04,200 --> 00:31:05,930 between these two codes? 592 00:31:05,930 --> 00:31:08,320 We'll come back to that in a minute. 593 00:31:08,320 --> 00:31:12,620 The way I want to do it is take a look at the issue of 594 00:31:12,620 --> 00:31:16,590 overheads with a simpler example, where we can see 595 00:31:16,590 --> 00:31:18,380 what's really going on. 596 00:31:18,380 --> 00:31:22,850 So here, what I've done is I've got a loop that is 597 00:31:22,850 --> 00:31:25,520 basically just doing vector addition. 598 00:31:25,520 --> 00:31:32,110 It's adding for i equals 0 to n minus 1, add b 599 00:31:32,110 --> 00:31:33,450 of i into a of i. 600 00:31:33,450 --> 00:31:36,120 601 00:31:36,120 --> 00:31:38,020 Pretty simple code, and we want to make that be a 602 00:31:38,020 --> 00:31:40,660 parallel loop. 603 00:31:40,660 --> 00:31:44,290 So I get a recursion tree that looks like this, where I have 604 00:31:44,290 --> 00:31:47,170 constant work at every step there. 605 00:31:47,170 --> 00:31:49,460 And of course, the work is order n, 606 00:31:49,460 --> 00:31:52,500 because I've got n leaves. 607 00:31:52,500 --> 00:31:55,670 And the number of internal nodes, the control, is all 608 00:31:55,670 --> 00:31:57,600 constant size strands there. 609 00:31:57,600 --> 00:32:00,170 So this is all just order n for work. 610 00:32:00,170 --> 00:32:03,830 And the span is basically log n, as we've seen, by going 611 00:32:03,830 --> 00:32:06,000 down one of these paths, for example. 612 00:32:06,000 --> 00:32:11,000 And so the parallelism for this is order n over log n. 613 00:32:11,000 --> 00:32:12,550 So a very simple problem. 614 00:32:12,550 --> 00:32:17,280 But now let's take a look more closely at the overheads here. 615 00:32:17,280 --> 00:32:19,540 So the problem is that this work term 616 00:32:19,540 --> 00:32:22,570 contains substantial overhead. 617 00:32:22,570 --> 00:32:25,440 In other words, if I really was doing that, if I hadn't 618 00:32:25,440 --> 00:32:29,160 coarsened the recursion at all in the implementation of 619 00:32:29,160 --> 00:32:31,940 cilk_for, if the developers hadn't done that, then I've 620 00:32:31,940 --> 00:32:37,210 got a function call, I've got n function calls here for 621 00:32:37,210 --> 00:32:43,000 doing a single addition of values at the leaves. 622 00:32:43,000 --> 00:32:47,035 I've got n minus one of these guys, that's approximately n, 623 00:32:47,035 --> 00:32:48,940 and I've got n of these guys. 624 00:32:48,940 --> 00:32:51,360 And which are bigger, these guys or these guys? 625 00:32:51,360 --> 00:32:54,540 626 00:32:54,540 --> 00:32:55,690 These guys are way bigger. 627 00:32:55,690 --> 00:32:58,040 They've got a function call in there. 628 00:32:58,040 --> 00:32:59,590 This guy right here just has what? 629 00:32:59,590 --> 00:33:02,300 One floating point addition. 630 00:33:02,300 --> 00:33:06,530 And so if I really was doing my divide and conquer down to 631 00:33:06,530 --> 00:33:13,230 a single element, this would be way slower on one processor 632 00:33:13,230 --> 00:33:15,790 than if I just ran it with a for loop. 633 00:33:15,790 --> 00:33:17,620 Because if I do a for loop, it's just going to go through, 634 00:33:17,620 --> 00:33:24,440 and the overhead it has is incrementing i and testing for 635 00:33:24,440 --> 00:33:25,100 termination. 636 00:33:25,100 --> 00:33:26,160 That's it. 637 00:33:26,160 --> 00:33:30,130 And of course, that's a predictable branch, because it 638 00:33:30,130 --> 00:33:34,070 almost never terminates until it actually terminates, and so 639 00:33:34,070 --> 00:33:36,150 that's exactly the sort of thing that's going to have a 640 00:33:36,150 --> 00:33:38,850 really, really tight loop with very few instructions. 641 00:33:38,850 --> 00:33:40,750 But in the parallel implementation, there's going 642 00:33:40,750 --> 00:33:44,590 to be this function call overhead everywhere. 643 00:33:44,590 --> 00:33:47,310 And so therefore, this cilk_for loop in principle 644 00:33:47,310 --> 00:33:49,430 would not be as efficient. 645 00:33:49,430 --> 00:33:52,290 It actually is, but we're going to explain why it is, 646 00:33:52,290 --> 00:33:56,760 what goes on in the runtime system, to understand that. 647 00:33:56,760 --> 00:34:00,290 So here's the idea, and you can 648 00:34:00,290 --> 00:34:02,230 control this with a pragma. 649 00:34:02,230 --> 00:34:06,200 So a pragma is a statement to the compiler 650 00:34:06,200 --> 00:34:08,070 that gives it a hint. 651 00:34:08,070 --> 00:34:12,880 And here, the pragma says, you can name a grain size and give 652 00:34:12,880 --> 00:34:15,389 it a value of g. 653 00:34:15,389 --> 00:34:19,000 And what that says is rather than just doing one element 654 00:34:19,000 --> 00:34:22,590 when you get down to the bottom here, do g elements in 655 00:34:22,590 --> 00:34:26,750 a for loop when you get down to the bottom. 656 00:34:26,750 --> 00:34:29,880 And that way, you halt the recursion earlier. 657 00:34:29,880 --> 00:34:32,570 You have fewer of these internal nodes. 658 00:34:32,570 --> 00:34:36,760 And if you make the grain size sufficiently large, the cost 659 00:34:36,760 --> 00:34:40,600 of the recursion at the top you won't be able to see. 660 00:34:40,600 --> 00:34:43,989 So let's analyze what happens when we do this. 661 00:34:43,989 --> 00:34:49,190 So we can understand this vis a vis this equation. 662 00:34:49,190 --> 00:34:54,380 So the idea here is, if I look at my work, imagine that t 663 00:34:54,380 --> 00:34:58,290 iter is the time for iteration of one iteration of the loop, 664 00:34:58,290 --> 00:35:00,490 basic time for one iteration of the loop. 665 00:35:00,490 --> 00:35:04,340 So the amount of work that I have to do is n times the time 666 00:35:04,340 --> 00:35:06,580 for the iterations of the loop. 667 00:35:06,580 --> 00:35:10,430 And then depending upon my grain size, I've got to do 668 00:35:10,430 --> 00:35:13,220 things having to do with the internal nodes, and there's 669 00:35:13,220 --> 00:35:20,180 basically going to be n over g of those, times the time for a 670 00:35:20,180 --> 00:35:24,050 spawn, which is I'm saying is the time to execute one of 671 00:35:24,050 --> 00:35:26,110 these things. 672 00:35:26,110 --> 00:35:29,640 So if these are batched into groups of g, then there are n 673 00:35:29,640 --> 00:35:32,560 over g such leaves. 674 00:35:32,560 --> 00:35:34,840 There's a minus 1 in here, but it doesn't matter. 675 00:35:34,840 --> 00:35:39,250 It's basically n over g times the time for 676 00:35:39,250 --> 00:35:40,700 the internal nodes. 677 00:35:40,700 --> 00:35:42,740 So everybody see where I'm getting this? 678 00:35:42,740 --> 00:35:44,820 So I'm trying to account for the constants in the 679 00:35:44,820 --> 00:35:46,620 implementation. 680 00:35:46,620 --> 00:35:48,000 People follow where I'm getting this? 681 00:35:48,000 --> 00:35:49,180 Ask questions. 682 00:35:49,180 --> 00:35:53,220 I see a couple of people who are sort of going, not sure I 683 00:35:53,220 --> 00:35:54,470 understand. 684 00:35:54,470 --> 00:35:57,480 685 00:35:57,480 --> 00:35:58,260 Yes? 686 00:35:58,260 --> 00:36:01,040 AUDIENCE: The constants [INAUDIBLE]. 687 00:36:01,040 --> 00:36:01,520 CHARLES LEISERSON: Yes. 688 00:36:01,520 --> 00:36:05,940 So basically, the constants are these t iter and t spawn. 689 00:36:05,940 --> 00:36:10,310 So t spawn is the time to execute all that mess. 690 00:36:10,310 --> 00:36:15,250 t iter is the time to execute one iteration within here. 691 00:36:15,250 --> 00:36:17,670 I'm doing, in this case, g of them. 692 00:36:17,670 --> 00:36:22,430 So I have n over g leaves, but each one is doing g, so it's n 693 00:36:22,430 --> 00:36:26,240 over g times g, which is a total of n iterations, which 694 00:36:26,240 --> 00:36:26,750 makes sense. 695 00:36:26,750 --> 00:36:28,980 I should be doing n iterations if I'm 696 00:36:28,980 --> 00:36:30,460 adding two vectors here. 697 00:36:30,460 --> 00:36:34,200 So that's accounting for all the work in these guys. 698 00:36:34,200 --> 00:36:37,470 Then in addition, I've got all of the work for all the 699 00:36:37,470 --> 00:36:41,790 spawning, which is n over g times t spawn. 700 00:36:41,790 --> 00:36:43,950 And as I say, you can play with this yourself, play with 701 00:36:43,950 --> 00:36:46,640 grain size yourself, by just sticking in different grain 702 00:36:46,640 --> 00:36:47,830 size directives. 703 00:36:47,830 --> 00:36:52,410 Otherwise it turns out that the cilk runtime system will 704 00:36:52,410 --> 00:36:56,930 pick what it deems to be a good grain size. 705 00:36:56,930 --> 00:37:02,080 And it usually does a good job, except sometimes. 706 00:37:02,080 --> 00:37:05,350 And that's why there's a parameter here. 707 00:37:05,350 --> 00:37:06,570 So if there's a parameter there, you 708 00:37:06,570 --> 00:37:10,160 can get rid of that. 709 00:37:10,160 --> 00:37:11,410 Yes? 710 00:37:11,410 --> 00:37:13,462 711 00:37:13,462 --> 00:37:16,086 AUDIENCE: Is the pragma something that is enforced, or 712 00:37:16,086 --> 00:37:18,382 is it something that says, hey-- 713 00:37:18,382 --> 00:37:19,366 CHARLES LEISERSON: It's a hint. 714 00:37:19,366 --> 00:37:20,350 AUDIENCE: It's a hint. 715 00:37:20,350 --> 00:37:21,885 CHARLES LEISERSON: Yes, it's a hint. 716 00:37:21,885 --> 00:37:23,310 In other words, compiler could ignore it. 717 00:37:23,310 --> 00:37:23,389 [? 718 00:37:23,389 --> 00:37:24,972 AUDIENCE: The compiler is ?] going to be like, oh, that's 719 00:37:24,972 --> 00:37:26,975 the total [INAUDIBLE] 720 00:37:26,975 --> 00:37:28,000 constant. 721 00:37:28,000 --> 00:37:29,340 CHARLES LEISERSON: It's supposed to be something that 722 00:37:29,340 --> 00:37:32,240 gives a hint for performance reasons but does not affect 723 00:37:32,240 --> 00:37:35,120 the correctness of the program. 724 00:37:35,120 --> 00:37:37,650 So the program is going to be doing the same thing 725 00:37:37,650 --> 00:37:38,350 regardless. 726 00:37:38,350 --> 00:37:40,800 The question is, here's a hint to the compiler and the 727 00:37:40,800 --> 00:37:44,020 runtime system. 728 00:37:44,020 --> 00:37:46,788 And so then it's picked at-- 729 00:37:46,788 --> 00:37:49,268 yeah? 730 00:37:49,268 --> 00:37:53,265 AUDIENCE: My question is, so there's these cases where you 731 00:37:53,265 --> 00:37:56,460 say that the runtime system fails to find an appropriate 732 00:37:56,460 --> 00:37:57,873 value for that [INAUDIBLE]. 733 00:37:57,873 --> 00:38:00,699 I mean, basically, chooses one that's not as good. 734 00:38:00,699 --> 00:38:03,211 If you put a pragma on it, will the runtime system choose 735 00:38:03,211 --> 00:38:04,890 the one that you give it, or still choose-- 736 00:38:04,890 --> 00:38:06,290 CHARLES LEISERSON: No, if you give it, the 737 00:38:06,290 --> 00:38:07,840 runtime system will-- 738 00:38:07,840 --> 00:38:11,040 in the current implementation, it always picks whatever you 739 00:38:11,040 --> 00:38:12,340 say is here. 740 00:38:12,340 --> 00:38:13,215 And that can be an expression. 741 00:38:13,215 --> 00:38:15,490 You can evaluate something there. 742 00:38:15,490 --> 00:38:16,540 It's not just a constant. 743 00:38:16,540 --> 00:38:20,070 It could be maximum of this and that times 744 00:38:20,070 --> 00:38:22,655 whatever, et cetera. 745 00:38:22,655 --> 00:38:25,670 Is that good? 746 00:38:25,670 --> 00:38:27,390 So this is a description of the work. 747 00:38:27,390 --> 00:38:30,400 Now let's get a description with the constants 748 00:38:30,400 --> 00:38:33,840 again of the span. 749 00:38:33,840 --> 00:38:35,825 So what is going to be the constants for the span? 750 00:38:35,825 --> 00:38:40,040 751 00:38:40,040 --> 00:38:43,275 Well, I'm executing this part in here now serially. 752 00:38:43,275 --> 00:38:46,320 753 00:38:46,320 --> 00:38:49,120 So for the span part, we're basically going to go down on 754 00:38:49,120 --> 00:38:52,220 one of these paths and back up I'm not sure which one, but 755 00:38:52,220 --> 00:38:55,330 they're basically all fairly symmetric. 756 00:38:55,330 --> 00:38:56,860 But then when I get to the leaf, I'm 757 00:38:56,860 --> 00:39:00,110 executing the leaf serially. 758 00:39:00,110 --> 00:39:03,790 So I'm going to have whatever the cost is, g times the time 759 00:39:03,790 --> 00:39:07,520 per iteration, is going to be executed serially, plus now 760 00:39:07,520 --> 00:39:12,850 log of n over g-- 761 00:39:12,850 --> 00:39:15,680 n over g is the number of things I have here-- 762 00:39:15,680 --> 00:39:17,435 times the cost of the spawn, basically. 763 00:39:17,435 --> 00:39:21,152 764 00:39:21,152 --> 00:39:22,402 Does that make sense? 765 00:39:22,402 --> 00:39:25,470 766 00:39:25,470 --> 00:39:28,270 So the idea is, what do we want to have here if I want a 767 00:39:28,270 --> 00:39:29,640 good parallel code? 768 00:39:29,640 --> 00:39:33,230 We would like the work to be as small as possible. 769 00:39:33,230 --> 00:39:34,530 How do I make the work small? 770 00:39:34,530 --> 00:39:38,270 771 00:39:38,270 --> 00:39:42,758 How can I set g to make the work small? 772 00:39:42,758 --> 00:39:44,580 AUDIENCE: [INAUDIBLE]. 773 00:39:44,580 --> 00:39:45,520 CHARLES LEISERSON: Make g-- 774 00:39:45,520 --> 00:39:46,540 AUDIENCE: Square root of n. 775 00:39:46,540 --> 00:39:51,090 CHARLES LEISERSON: Well, make g big or little? 776 00:39:51,090 --> 00:39:58,770 If you want this term to be small, you want g to be big. 777 00:39:58,770 --> 00:40:02,380 But we also want to have a lot of parallelism. 778 00:40:02,380 --> 00:40:05,530 So I want this term to be what? 779 00:40:05,530 --> 00:40:09,290 Small, which means I need to make g what? 780 00:40:09,290 --> 00:40:14,366 Well, we got an n over g here, but it's in a log. 781 00:40:14,366 --> 00:40:15,910 It's minus log. 782 00:40:15,910 --> 00:40:20,882 So really, to get this small, I want g to be small. 783 00:40:20,882 --> 00:40:24,670 So I have tension, trade off. 784 00:40:24,670 --> 00:40:26,090 I have trade off. 785 00:40:26,090 --> 00:40:27,640 So let's analyze this a little bit. 786 00:40:27,640 --> 00:40:30,210 787 00:40:30,210 --> 00:40:34,465 Essentially, if I look at this, I want g to be bigger-- 788 00:40:34,465 --> 00:40:41,860 789 00:40:41,860 --> 00:40:44,130 from this one I want g to be small. 790 00:40:44,130 --> 00:40:47,170 But here, what I would like is to make it so that this term 791 00:40:47,170 --> 00:40:48,480 dominates this term. 792 00:40:48,480 --> 00:40:51,390 793 00:40:51,390 --> 00:40:56,470 If the first term here dominates the second term, 794 00:40:56,470 --> 00:40:59,400 then the work is going to be the same as if I did an 795 00:40:59,400 --> 00:41:05,450 ordinary for loop to within a few percent. 796 00:41:05,450 --> 00:41:09,570 So therefore, I want t span over t iter, if I take the 797 00:41:09,570 --> 00:41:14,110 ratio of these things, I want g to be bigger than the time 798 00:41:14,110 --> 00:41:17,450 to spawn divided by the time to iterate. 799 00:41:17,450 --> 00:41:20,960 If I get it much bigger than that, then this term will be 800 00:41:20,960 --> 00:41:22,580 much bigger than that term and I don't have to 801 00:41:22,580 --> 00:41:25,760 worry about this term. 802 00:41:25,760 --> 00:41:29,660 So I want it to be much bigger, but I want to be as 803 00:41:29,660 --> 00:41:31,850 small as possible. 804 00:41:31,850 --> 00:41:35,630 There's no point in making it much bigger than that which 805 00:41:35,630 --> 00:41:38,340 causes this term to essentially be wiped out. 806 00:41:38,340 --> 00:41:39,590 People follow that? 807 00:41:39,590 --> 00:41:44,830 808 00:41:44,830 --> 00:41:49,580 So basically, the idea is we pick a grain size that's large 809 00:41:49,580 --> 00:41:53,180 but not too large, is what you generally want to do, so that 810 00:41:53,180 --> 00:41:55,930 you have enough parallelism, but you don't. 811 00:41:55,930 --> 00:41:59,690 The way that the runtime system does it is it has a 812 00:41:59,690 --> 00:42:02,780 somewhat complicated heuristic, but it actually 813 00:42:02,780 --> 00:42:06,660 looks to see how many processors you're running on. 814 00:42:06,660 --> 00:42:10,970 And it uses a heuristic that says, let's make sure there's 815 00:42:10,970 --> 00:42:13,550 at least parallelism 10 times more than the number of 816 00:42:13,550 --> 00:42:14,820 processors. 817 00:42:14,820 --> 00:42:18,330 But there's no point in having more iterations than like 500 818 00:42:18,330 --> 00:42:23,350 or something, because at 500 iterations, you can't see the 819 00:42:23,350 --> 00:42:25,940 spawn overhead regardless. 820 00:42:25,940 --> 00:42:29,900 So basically, it uses a formula kind of that nature to 821 00:42:29,900 --> 00:42:31,590 pick this automatically. 822 00:42:31,590 --> 00:42:34,150 But you're free to pick this yourself. 823 00:42:34,150 --> 00:42:37,540 But you can see the point is that although it's doing 824 00:42:37,540 --> 00:42:41,470 divide and conquer, you do this issue of coarsening and 825 00:42:41,470 --> 00:42:46,760 you do want to make sure that you have enough work to do in 826 00:42:46,760 --> 00:42:49,100 any of the leaves of the computation. 827 00:42:49,100 --> 00:42:51,470 And as I say, usually it'll guess right. 828 00:42:51,470 --> 00:42:54,550 But if you have trouble with that, you have a parameter you 829 00:42:54,550 --> 00:42:56,760 can play with. 830 00:42:56,760 --> 00:42:59,990 Let's take a look at another implementation just to try to 831 00:42:59,990 --> 00:43:01,240 understand this issue. 832 00:43:01,240 --> 00:43:04,490 833 00:43:04,490 --> 00:43:06,100 Suppose I'm going to do a vector add. 834 00:43:06,100 --> 00:43:10,110 So here I have a vector add of two arrays, where I'm 835 00:43:10,110 --> 00:43:17,750 basically saying ai gets the value of b added into it. 836 00:43:17,750 --> 00:43:20,260 That's kind of the code we had before. 837 00:43:20,260 --> 00:43:25,440 But now, what I want to do is I'm going to implement a 838 00:43:25,440 --> 00:43:26,950 vector add using cilk spawn. 839 00:43:26,950 --> 00:43:30,560 840 00:43:30,560 --> 00:43:34,160 So rather than using a cilk_for loop, I'm going to 841 00:43:34,160 --> 00:43:37,660 parallelize this loop by hand using cilk spawn. 842 00:43:37,660 --> 00:43:41,240 What I'm going to do is I'm going to say for j equals 0 up 843 00:43:41,240 --> 00:43:42,970 to-- and I'm going to jump by whatever my 844 00:43:42,970 --> 00:43:45,020 grain size is here-- 845 00:43:45,020 --> 00:43:50,610 and spawn off the addition of things of size, essentially, 846 00:43:50,610 --> 00:43:53,180 g, unless I get close to the end of the array. 847 00:43:53,180 --> 00:43:57,440 But basically, I'm always spawning off the next g 848 00:43:57,440 --> 00:44:00,200 iterations to do that in parallel. 849 00:44:00,200 --> 00:44:03,280 And then I sync all these spawns. 850 00:44:03,280 --> 00:44:06,180 So everybody understand the code? 851 00:44:06,180 --> 00:44:07,270 I see nods. 852 00:44:07,270 --> 00:44:09,740 I want to see everybody nod, actually, when I do this. 853 00:44:09,740 --> 00:44:12,690 Otherwise what happens is I see three people nod, and I 854 00:44:12,690 --> 00:44:13,770 assume that people are nodding. 855 00:44:13,770 --> 00:44:15,760 Because if you don't do it, you can shake your head, and I 856 00:44:15,760 --> 00:44:18,410 promise none of your friends will see that you're 857 00:44:18,410 --> 00:44:21,280 shaking your head. 858 00:44:21,280 --> 00:44:23,880 And since the TAs are doing the grading and they're facing 859 00:44:23,880 --> 00:44:26,450 this way, they won't see either. 860 00:44:26,450 --> 00:44:29,900 And so it's perfectly safe to let me know, and that way I 861 00:44:29,900 --> 00:44:31,150 can make sure you understand. 862 00:44:31,150 --> 00:44:33,590 863 00:44:33,590 --> 00:44:37,290 So everybody understand what this does? 864 00:44:37,290 --> 00:44:38,500 OK, so I see a few more. 865 00:44:38,500 --> 00:44:38,820 No. 866 00:44:38,820 --> 00:44:39,910 OK, question? 867 00:44:39,910 --> 00:44:43,540 Do you have a question, or should I just explain again? 868 00:44:43,540 --> 00:44:49,490 So this is basically doing a vector add of b into a, of n 869 00:44:49,490 --> 00:44:51,970 iterations here. 870 00:44:51,970 --> 00:44:54,910 And we're going to call it here, when I do a vector add, 871 00:44:54,910 --> 00:44:57,490 of basically g iterations. 872 00:44:57,490 --> 00:45:00,670 So what it's doing is it's going to take my array of size 873 00:45:00,670 --> 00:45:05,590 n, bust it into chunks of size g, and spawn off the first 874 00:45:05,590 --> 00:45:08,230 one, spawn off the second one, spawn off the third one, each 875 00:45:08,230 --> 00:45:11,310 one to do g iterations. 876 00:45:11,310 --> 00:45:13,340 That make sense? 877 00:45:13,340 --> 00:45:14,700 We'll see it. 878 00:45:14,700 --> 00:45:17,330 So here's sort of the instruction stream 879 00:45:17,330 --> 00:45:18,980 for the code here. 880 00:45:18,980 --> 00:45:22,810 So basically, it says here is one, we spawn off something of 881 00:45:22,810 --> 00:45:27,370 size g, then we go on, we spawn off something else of 882 00:45:27,370 --> 00:45:28,870 size g, et cetera. 883 00:45:28,870 --> 00:45:32,740 We keep going up there until we hit the cilk sync. 884 00:45:32,740 --> 00:45:34,480 That make sense? 885 00:45:34,480 --> 00:45:38,610 Each of these is doing a vector add of size g using 886 00:45:38,610 --> 00:45:40,015 this serial routine. 887 00:45:40,015 --> 00:45:42,610 888 00:45:42,610 --> 00:45:46,420 So let's analyze this to understand the efficiency of 889 00:45:46,420 --> 00:45:49,980 this type of looping structure. 890 00:45:49,980 --> 00:45:52,910 So let's assume for this analysis that g equals 1, to 891 00:45:52,910 --> 00:45:54,690 make it easy, so we don't have to worry about it. 892 00:45:54,690 --> 00:45:57,840 So we're simply spawning off one thing here, one thing 893 00:45:57,840 --> 00:46:01,760 here, one iteration here, all the way to the end. 894 00:46:01,760 --> 00:46:05,370 So what is the work for this, if I spawn off things of size 895 00:46:05,370 --> 00:46:09,370 one, asymptotic work? 896 00:46:09,370 --> 00:46:12,850 It's order n, because I've got n leaves, and I've got n guys 897 00:46:12,850 --> 00:46:13,710 that I'm spawning off. 898 00:46:13,710 --> 00:46:15,720 So it's order n. 899 00:46:15,720 --> 00:46:18,019 What's the span? 900 00:46:18,019 --> 00:46:20,800 AUDIENCE: [INAUDIBLE]. 901 00:46:20,800 --> 00:46:27,130 CHARLES LEISERSON: Yeah, it's also order n, because the 902 00:46:27,130 --> 00:46:29,110 critical path goes something like brrrup, brrrup, brrrup. 903 00:46:29,110 --> 00:46:33,620 904 00:46:33,620 --> 00:46:35,850 That's order n length. 905 00:46:35,850 --> 00:46:37,760 It's not this, because that's only order 906 00:46:37,760 --> 00:46:38,820 one length, all those. 907 00:46:38,820 --> 00:46:42,130 The longest path is order n. 908 00:46:42,130 --> 00:46:49,220 So that says the parallelism is order one. 909 00:46:49,220 --> 00:46:53,720 Conclusion, at least with grain size one, this is a 910 00:46:53,720 --> 00:46:57,950 really bad way to implement a parallel loop. 911 00:46:57,950 --> 00:47:01,080 However, I guarantee, it may not be the people in this 912 00:47:01,080 --> 00:47:07,130 room, but some fraction of students in this class will 913 00:47:07,130 --> 00:47:12,080 write this rather than doing a cilk for. 914 00:47:12,080 --> 00:47:15,440 Bad idea. 915 00:47:15,440 --> 00:47:17,270 Bad idea. 916 00:47:17,270 --> 00:47:19,950 Generally, bad idea. 917 00:47:19,950 --> 00:47:20,862 Question? 918 00:47:20,862 --> 00:47:22,308 AUDIENCE: Do you think you could find a constant factor, 919 00:47:22,308 --> 00:47:23,558 not just [INAUDIBLE]? 920 00:47:23,558 --> 00:47:26,164 921 00:47:26,164 --> 00:47:29,360 CHARLES LEISERSON: Well here, actually, with grain size one, 922 00:47:29,360 --> 00:47:31,960 this is really bad, because I've got this overhead of 923 00:47:31,960 --> 00:47:35,450 doing a spawn, and then I'm only doing one iteration. 924 00:47:35,450 --> 00:47:38,250 So the ideal thing would be that I really am only paying 925 00:47:38,250 --> 00:47:41,170 for the leaves, and the internal nodes, I don't have 926 00:47:41,170 --> 00:47:42,405 to pay anything for. 927 00:47:42,405 --> 00:47:44,182 Yeah, Eric? 928 00:47:44,182 --> 00:47:45,820 AUDIENCE: Shouldn't there be a sort of keyword 929 00:47:45,820 --> 00:47:46,560 in the b add too? 930 00:47:46,560 --> 00:47:47,150 CHARLES LEISERSON: In the where? 931 00:47:47,150 --> 00:47:48,175 AUDIENCE: In the b add? 932 00:47:48,175 --> 00:47:49,470 CHARLES LEISERSON: No, that's serial. 933 00:47:49,470 --> 00:47:51,190 That's a serial code. 934 00:47:51,190 --> 00:47:53,066 AUDIENCE: No, but if you were going to call it with cilk 935 00:47:53,066 --> 00:47:56,140 spawn, don't you have to declare it cilk? 936 00:47:56,140 --> 00:47:58,581 Is that not the case? 937 00:47:58,581 --> 00:47:59,038 CHARLES LEISERSON: No. 938 00:47:59,038 --> 00:48:00,288 AUDIENCE: Never mind. 939 00:48:00,288 --> 00:48:02,820 940 00:48:02,820 --> 00:48:03,900 CHARLES LEISERSON: Yes, question. 941 00:48:03,900 --> 00:48:05,884 AUDIENCE: If g is [INAUDIBLE], isn't that good enough? 942 00:48:05,884 --> 00:48:08,420 943 00:48:08,420 --> 00:48:09,290 CHARLES LEISERSON: Yeah, so let's take a look. 944 00:48:09,290 --> 00:48:10,540 That's actually the next slide. 945 00:48:10,540 --> 00:48:12,980 946 00:48:12,980 --> 00:48:17,036 This is basically, this we call puny parallelism. 947 00:48:17,036 --> 00:48:20,580 We don't like puny parallelism. 948 00:48:20,580 --> 00:48:22,470 It doesn't have to be spectacular. 949 00:48:22,470 --> 00:48:25,680 It has to be good enough. 950 00:48:25,680 --> 00:48:28,455 And this is not good enough for most applications. 951 00:48:28,455 --> 00:48:31,960 952 00:48:31,960 --> 00:48:34,250 So here's another implementation. 953 00:48:34,250 --> 00:48:35,620 Here's another way of doing it. 954 00:48:35,620 --> 00:48:40,710 Now let's analyze it where we have control over g. 955 00:48:40,710 --> 00:48:44,430 So we'll analyze it in terms of g, and then see whether 956 00:48:44,430 --> 00:48:47,270 there's a setting for which this make sense. 957 00:48:47,270 --> 00:48:49,690 So if I analyze it in terms of g, now I have to do a little 958 00:48:49,690 --> 00:48:51,600 bit more careful analysis here. 959 00:48:51,600 --> 00:48:57,798 How much work do I have here in terms of n and g? 960 00:48:57,798 --> 00:48:59,200 AUDIENCE: It's the same. 961 00:48:59,200 --> 00:49:00,120 CHARLES LEISERSON: Yeah, the work is still 962 00:49:00,120 --> 00:49:01,370 asymptotically order n. 963 00:49:01,370 --> 00:49:05,220 964 00:49:05,220 --> 00:49:07,940 Because I always have n work in the leaves, even if I do 965 00:49:07,940 --> 00:49:09,190 more iterations here. 966 00:49:09,190 --> 00:49:14,091 What increasing g does is it shrinks this, right? 967 00:49:14,091 --> 00:49:17,350 It shrinks this. 968 00:49:17,350 --> 00:49:18,850 The span for this is what? 969 00:49:18,850 --> 00:49:23,240 970 00:49:23,240 --> 00:49:25,820 So I heard somebody say it. 971 00:49:25,820 --> 00:49:27,448 n over g plus g. 972 00:49:27,448 --> 00:49:30,560 973 00:49:30,560 --> 00:49:32,240 And it corresponds to this path. 974 00:49:32,240 --> 00:49:34,750 975 00:49:34,750 --> 00:49:37,660 So this is the n over g part up here, and 976 00:49:37,660 --> 00:49:39,126 this is the plus g. 977 00:49:39,126 --> 00:49:41,720 978 00:49:41,720 --> 00:49:47,060 So what we want to do to minimize this, is we can 979 00:49:47,060 --> 00:49:47,820 minimize this. 980 00:49:47,820 --> 00:49:50,630 This has the smallest value when these two terms are 981 00:49:50,630 --> 00:49:55,370 equal, which you can either know as a basic fact of the 982 00:49:55,370 --> 00:49:58,300 summation of these kinds of things, or you could take 983 00:49:58,300 --> 00:50:02,210 derivatives and so forth. 984 00:50:02,210 --> 00:50:05,050 Or you can just eyeball it and say, gee, if g is bigger than 985 00:50:05,050 --> 00:50:08,710 square root of n, then this is going to be the dominant, and 986 00:50:08,710 --> 00:50:11,500 if g is smaller than square root of n, then this is going 987 00:50:11,500 --> 00:50:12,630 to be dominant. 988 00:50:12,630 --> 00:50:15,130 And so when they're equal, that sounds like about when it 989 00:50:15,130 --> 00:50:17,720 should be the smallest, which is true. 990 00:50:17,720 --> 00:50:20,270 So we pick it to be about square root of n, to 991 00:50:20,270 --> 00:50:22,970 minimize the span. 992 00:50:22,970 --> 00:50:25,620 Since g, I don't have anything to minimize here. 993 00:50:25,620 --> 00:50:30,200 So pick it around square root of n, then the span is around 994 00:50:30,200 --> 00:50:31,520 square root of n. 995 00:50:31,520 --> 00:50:37,680 And so then the parallelism is order square root of n. 996 00:50:37,680 --> 00:50:38,880 So that's pretty good. 997 00:50:38,880 --> 00:50:40,150 So that's not bad. 998 00:50:40,150 --> 00:50:42,950 So for something that's a big array, array of size 1 999 00:50:42,950 --> 00:50:46,310 million, parallelism might be 1,000. 1000 00:50:46,310 --> 00:50:50,300 That might be just hunky dory. 1001 00:50:50,300 --> 00:50:51,142 Question. 1002 00:50:51,142 --> 00:50:51,594 What's that? 1003 00:50:51,594 --> 00:50:54,510 AUDIENCE: I don't see where-- 1004 00:50:54,510 --> 00:50:56,170 CHARLES LEISERSON: We've picked g to be equal to 1005 00:50:56,170 --> 00:50:57,986 square root of n. 1006 00:50:57,986 --> 00:51:00,944 AUDIENCE: [INAUDIBLE] plus n over g, plus g. 1007 00:51:00,944 --> 00:51:02,430 I don't see where [INAUDIBLE]. 1008 00:51:02,430 --> 00:51:05,870 CHARLES LEISERSON: You don't see where this g came from? 1009 00:51:05,870 --> 00:51:09,540 This g comes from, because I'm doing g iterations here. 1010 00:51:09,540 --> 00:51:11,780 So remember that these are now of size g. 1011 00:51:11,780 --> 00:51:14,510 I'm doing g iterations in each leaf here, if 1012 00:51:14,510 --> 00:51:15,860 I set g to be large. 1013 00:51:15,860 --> 00:51:21,960 So I'm doing n over g pieces here, plus g iterations in my 1014 00:51:21,960 --> 00:51:22,610 [INAUDIBLE]. 1015 00:51:22,610 --> 00:51:24,090 Is that clear? 1016 00:51:24,090 --> 00:51:26,120 So the n over g is this part. 1017 00:51:26,120 --> 00:51:28,710 This size here, this is not one. 1018 00:51:28,710 --> 00:51:30,620 This has g iterations in it. 1019 00:51:30,620 --> 00:51:33,252 So the total span is g plus n over g. 1020 00:51:33,252 --> 00:51:37,370 1021 00:51:37,370 --> 00:51:39,280 Any other questions about this? 1022 00:51:39,280 --> 00:51:41,750 So basically, I get order square root of n. 1023 00:51:41,750 --> 00:51:44,270 1024 00:51:44,270 --> 00:51:49,370 And so this is not necessarily a bad way of doing it, but the 1025 00:51:49,370 --> 00:51:51,985 cilk for is a far more reliable way of making sure 1026 00:51:51,985 --> 00:51:54,230 that you get the parallelism than spawning 1027 00:51:54,230 --> 00:51:55,710 things off one by one. 1028 00:51:55,710 --> 00:52:00,000 One of the things, by the way, in this, I've seen people 1029 00:52:00,000 --> 00:52:03,370 write code where their first instinct is to write something 1030 00:52:03,370 --> 00:52:06,600 like this, where this that they're spawning off is only 1031 00:52:06,600 --> 00:52:07,660 constant time. 1032 00:52:07,660 --> 00:52:11,810 And they say, gee, I spawned off n things. 1033 00:52:11,810 --> 00:52:14,340 That's really parallel. 1034 00:52:14,340 --> 00:52:18,270 When in fact, their parallelism is order one. 1035 00:52:18,270 --> 00:52:22,600 So it's really seductive to think that you can get 1036 00:52:22,600 --> 00:52:23,880 parallelism by this, [? right. ?] 1037 00:52:23,880 --> 00:52:27,160 It's much better to do divide and conquer, and cilk for does 1038 00:52:27,160 --> 00:52:29,140 that for you automatically. 1039 00:52:29,140 --> 00:52:31,060 If you're going to do it by hand, sometimes you do want to 1040 00:52:31,060 --> 00:52:33,160 do it by hand, then you probably want to think more 1041 00:52:33,160 --> 00:52:35,900 about divide and conquer to generate parallelism, because 1042 00:52:35,900 --> 00:52:38,230 you'll have a small span, than doing many 1043 00:52:38,230 --> 00:52:39,480 things one at a time. 1044 00:52:39,480 --> 00:52:41,800 1045 00:52:41,800 --> 00:52:47,520 So here's some tips for performance. 1046 00:52:47,520 --> 00:52:52,090 So you want to minimize the span, so the parallelism is 1047 00:52:52,090 --> 00:52:53,470 the work over the span. 1048 00:52:53,470 --> 00:52:57,880 So you want to minimize the span to maximize parallelism. 1049 00:52:57,880 --> 00:53:00,770 And in general, you should try to generate something like 10 1050 00:53:00,770 --> 00:53:04,290 times more parallelism than processors, if you want to get 1051 00:53:04,290 --> 00:53:05,510 near perfect linear speed-up. 1052 00:53:05,510 --> 00:53:09,200 In other words, a parallel slackness of 10 or better is 1053 00:53:09,200 --> 00:53:10,450 usually adequate. 1054 00:53:10,450 --> 00:53:13,340 1055 00:53:13,340 --> 00:53:16,190 If you can get more, you're now talking that you can get 1056 00:53:16,190 --> 00:53:22,720 more performance, but now you're getting performance 1057 00:53:22,720 --> 00:53:27,110 increases in the range of 5% or so, 5% to 10%, 1058 00:53:27,110 --> 00:53:29,890 something like that. 1059 00:53:29,890 --> 00:53:33,520 Second thing is if you have plenty of parallelism, try to 1060 00:53:33,520 --> 00:53:36,970 trade some of it off to reduce work overhead. 1061 00:53:36,970 --> 00:53:38,060 So this is a general case. 1062 00:53:38,060 --> 00:53:42,800 This is what actually goes on underneath in the cilk++ 1063 00:53:42,800 --> 00:53:45,790 runtime system, is they are trying to do this themselves. 1064 00:53:45,790 --> 00:53:49,640 But you in your own code can play exactly the same game. 1065 00:53:49,640 --> 00:53:52,330 Whenever you have a problem and it says, whoa, look at all 1066 00:53:52,330 --> 00:53:55,130 this parallelism, think about ways that you could reduce the 1067 00:53:55,130 --> 00:54:00,140 parallelism and get something back in the efficiency of the 1068 00:54:00,140 --> 00:54:03,490 work term, because the performance in the end is 1069 00:54:03,490 --> 00:54:08,070 going to be something like t1 over p plus t infinity. 1070 00:54:08,070 --> 00:54:11,400 If t infinity is small, it's like t1 over p, and so 1071 00:54:11,400 --> 00:54:16,060 anything you save in the t1 term is saving you overall. 1072 00:54:16,060 --> 00:54:19,630 It's going to be a savings for you overall. 1073 00:54:19,630 --> 00:54:22,200 Use divide and conquer recursion on parallel loops 1074 00:54:22,200 --> 00:54:24,870 rather than sprawling one small thing after another. 1075 00:54:24,870 --> 00:54:28,930 In other words, do this not this, generally. 1076 00:54:28,930 --> 00:54:33,620 1077 00:54:33,620 --> 00:54:36,220 And here's some more tips. 1078 00:54:36,220 --> 00:54:39,520 Another thing that can happen that we looked at here was 1079 00:54:39,520 --> 00:54:42,300 make sure that the amount of work you're doing is 1080 00:54:42,300 --> 00:54:45,390 reasonably large compared to the number of spawns. 1081 00:54:45,390 --> 00:54:47,950 You could also say this is true when you do recursion for 1082 00:54:47,950 --> 00:54:49,280 function calls. 1083 00:54:49,280 --> 00:54:52,040 Make sure if you're just in serial programming, you always 1084 00:54:52,040 --> 00:54:54,180 want to make sure that the amount of work you're doing is 1085 00:54:54,180 --> 00:54:57,500 small compared to the number of function calls are doing if 1086 00:54:57,500 --> 00:55:00,120 you can, and that'll make things go faster. 1087 00:55:00,120 --> 00:55:08,050 So same thing here, you want to have a lot of work compared 1088 00:55:08,050 --> 00:55:09,620 to the total number of spawns that you're 1089 00:55:09,620 --> 00:55:11,780 doing in your program. 1090 00:55:11,780 --> 00:55:14,520 So spawns, by the way, in this system, are about three or 1091 00:55:14,520 --> 00:55:19,500 four times the cost of a function call. 1092 00:55:19,500 --> 00:55:22,800 They're sort of the same order of magnitude as a function 1093 00:55:22,800 --> 00:55:26,670 call, a little bit heavier than a function call. 1094 00:55:26,670 --> 00:55:31,960 So you can spawn pretty readily, as long as the total 1095 00:55:31,960 --> 00:55:37,350 number of spawns you're doing isn't dominating your work. 1096 00:55:37,350 --> 00:55:40,210 Generally parallelize outer loops as opposed to inner 1097 00:55:40,210 --> 00:55:43,620 loops if you're forced to make a choice. 1098 00:55:43,620 --> 00:55:47,090 So it's always better to have an outer loop that runs in 1099 00:55:47,090 --> 00:55:51,000 parallel rather than an inner loop that runs in parallel, 1100 00:55:51,000 --> 00:55:54,750 because when you do an inner loop that runs in parallel, 1101 00:55:54,750 --> 00:55:56,590 you've got a lot of overhead to overcome. 1102 00:55:56,590 --> 00:56:00,980 But in an outer loop, you've got all of the inner loop to 1103 00:56:00,980 --> 00:56:03,930 amortize against the cost of the spawns that are being used 1104 00:56:03,930 --> 00:56:06,570 to parallelize the outer loop. 1105 00:56:06,570 --> 00:56:10,195 So you'll do many fewer spawns in the implementation. 1106 00:56:10,195 --> 00:56:12,810 1107 00:56:12,810 --> 00:56:14,510 Watch out for scheduling overheads. 1108 00:56:14,510 --> 00:56:18,620 1109 00:56:18,620 --> 00:56:21,990 So if you do something like this-- 1110 00:56:21,990 --> 00:56:27,310 so here we're paralyzing an inner loop rather than an 1111 00:56:27,310 --> 00:56:27,640 outer loop. 1112 00:56:27,640 --> 00:56:30,470 Now this turns out, it doesn't matter which order we're going 1113 00:56:30,470 --> 00:56:33,000 in or whatever. 1114 00:56:33,000 --> 00:56:35,650 It's generally not desirable to do this because I'm paying 1115 00:56:35,650 --> 00:56:40,230 scheduling overhead n times through this loop, whereas 1116 00:56:40,230 --> 00:56:43,930 here, I pay for scheduling overhead just twice. 1117 00:56:43,930 --> 00:56:46,920 1118 00:56:46,920 --> 00:56:50,010 So is generally better, if I have n pieces of work to do, 1119 00:56:50,010 --> 00:56:52,400 rather than, in this case, parallelizing-- 1120 00:56:52,400 --> 00:56:55,200 1121 00:56:55,200 --> 00:56:57,510 let me slow down here. 1122 00:56:57,510 --> 00:56:58,980 So let's look at what this code does. 1123 00:56:58,980 --> 00:57:00,835 This says, go for two iterations. 1124 00:57:00,835 --> 00:57:03,740 1125 00:57:03,740 --> 00:57:05,980 Do something for which it is going to take n 1126 00:57:05,980 --> 00:57:09,840 iterations for j. 1127 00:57:09,840 --> 00:57:12,235 So two iterations for i, n iterations for j. 1128 00:57:12,235 --> 00:57:15,710 1129 00:57:15,710 --> 00:57:17,970 If you look at the parallelism of this, what is the 1130 00:57:17,970 --> 00:57:20,560 parallelism of this assuming that f is constant time? 1131 00:57:20,560 --> 00:57:23,830 1132 00:57:23,830 --> 00:57:25,220 What's the parallelism of this code? 1133 00:57:25,220 --> 00:57:33,460 1134 00:57:33,460 --> 00:57:35,210 Two. 1135 00:57:35,210 --> 00:57:37,570 The parallelism of two, because I've got two things on 1136 00:57:37,570 --> 00:57:39,930 the outer loop here, and then each is n. 1137 00:57:39,930 --> 00:57:43,420 So my span is essentially n. 1138 00:57:43,420 --> 00:57:46,770 My work is like 2n, something like that. 1139 00:57:46,770 --> 00:57:49,250 So it's got a parallelism of that, too. 1140 00:57:49,250 --> 00:57:50,800 What's the parallelism of this code? 1141 00:57:50,800 --> 00:58:04,970 1142 00:58:04,970 --> 00:58:05,790 What's the parallelism? 1143 00:58:05,790 --> 00:58:08,220 It's not n, because I'm basically going through this 1144 00:58:08,220 --> 00:58:10,070 serially, the outer loop serially. 1145 00:58:10,070 --> 00:58:18,680 1146 00:58:18,680 --> 00:58:20,490 What's the theoretical parallelism of this? 1147 00:58:20,490 --> 00:58:24,270 1148 00:58:24,270 --> 00:58:29,430 So for each iteration here, the parallelism is two. 1149 00:58:29,430 --> 00:58:31,170 No, not n. 1150 00:58:31,170 --> 00:58:34,980 It can't be n, because I'm basically only parallelizing 1151 00:58:34,980 --> 00:58:37,530 two things, and I'm doing them serially. 1152 00:58:37,530 --> 00:58:40,462 1153 00:58:40,462 --> 00:58:44,540 The outer loop is going serially through the code and 1154 00:58:44,540 --> 00:58:47,480 it's spawning off two things, two things, two things, two 1155 00:58:47,480 --> 00:58:48,530 things, two things. 1156 00:58:48,530 --> 00:58:50,380 And waiting for them to be done, two things, wait for it 1157 00:58:50,380 --> 00:58:52,540 to be done, two things, wait for it to be done. 1158 00:58:52,540 --> 00:58:54,930 So the parallelism is two. 1159 00:58:54,930 --> 00:58:56,180 These have the same parallelism. 1160 00:58:56,180 --> 00:58:58,690 1161 00:58:58,690 --> 00:59:03,190 However if you run this, this one will give you a speedup of 1162 00:59:03,190 --> 00:59:07,530 two on two cores, very close to it. 1163 00:59:07,530 --> 00:59:09,990 Because there's the scheduling overhead here, you've only 1164 00:59:09,990 --> 00:59:12,390 paid once for the scheduling overhead, and then you're 1165 00:59:12,390 --> 00:59:14,640 doing a whole bunch of stuff. 1166 00:59:14,640 --> 00:59:17,140 So remember, to schedule it, it's got to be migrated, it's 1167 00:59:17,140 --> 00:59:19,910 got to be moved to another processor, et cetera. 1168 00:59:19,910 --> 00:59:25,410 This one, it's not even worth it probably to steal each of 1169 00:59:25,410 --> 00:59:26,250 these individual things. 1170 00:59:26,250 --> 00:59:28,770 You're spawning off things that are so small, this may 1171 00:59:28,770 --> 00:59:33,395 even have parallelism that's less than 1 in practice. 1172 00:59:33,395 --> 00:59:35,880 And if you look at the cilkview tool, this will show 1173 00:59:35,880 --> 00:59:38,180 you a high burden parallelism. 1174 00:59:38,180 --> 00:59:41,450 Because the cilkview tool, the burden parallelism tells you 1175 00:59:41,450 --> 00:59:46,980 what the overhead is from scheduling, as well as what 1176 00:59:46,980 --> 00:59:48,310 the actual parallelism is. 1177 00:59:48,310 --> 00:59:51,010 And it recognizes that oh, gee whiz. 1178 00:59:51,010 --> 00:59:53,625 This thing really has very small-- 1179 00:59:53,625 --> 01:00:00,670 1180 01:00:00,670 --> 01:00:02,200 there's almost no work in here. 1181 01:00:02,200 --> 01:00:04,000 So you're trying to parallelize something where 1182 01:00:04,000 --> 01:00:06,960 the work is so small, it's not even worth migrating it to 1183 01:00:06,960 --> 01:00:10,380 take advantage of it. 1184 01:00:10,380 --> 01:00:12,740 So those are some tips. 1185 01:00:12,740 --> 01:00:15,140 Now let's go through and analyze a bunch of algorithms 1186 01:00:15,140 --> 01:00:16,730 reasonably quickly. 1187 01:00:16,730 --> 01:00:21,670 We'll start with matrix multiplication. 1188 01:00:21,670 --> 01:00:22,935 People seen this problem before? 1189 01:00:22,935 --> 01:00:28,400 1190 01:00:28,400 --> 01:00:31,900 Here's the matrix multiplication problem. 1191 01:00:31,900 --> 01:00:33,780 And let's assume for simplicity that n 1192 01:00:33,780 --> 01:00:35,030 is a power of 2. 1193 01:00:35,030 --> 01:00:38,470 1194 01:00:38,470 --> 01:00:44,250 So basically, let's start out with just our looping version. 1195 01:00:44,250 --> 01:00:46,390 In fact, this isn't even a very good looping version, 1196 01:00:46,390 --> 01:00:49,340 because I've got the order of the loops wrong, I think. 1197 01:00:49,340 --> 01:00:52,380 But it is just illustrative. 1198 01:00:52,380 --> 01:00:55,080 Basically let's parallelize the outer two loops. 1199 01:00:55,080 --> 01:00:57,070 I can't parallelize the inner loop. 1200 01:00:57,070 --> 01:00:58,140 Why not? 1201 01:00:58,140 --> 01:01:00,180 What happens if I tried to parallelize the inner loop 1202 01:01:00,180 --> 01:01:03,095 with a cilk_for in this implementation? 1203 01:01:03,095 --> 01:01:07,580 1204 01:01:07,580 --> 01:01:11,040 Why can't I just put a cilk_for there? 1205 01:01:11,040 --> 01:01:12,408 Yes, somebody said it. 1206 01:01:12,408 --> 01:01:14,600 AUDIENCE: It does that in cij. 1207 01:01:14,600 --> 01:01:17,220 CHARLES LEISERSON: Yeah, we get a race condition. 1208 01:01:17,220 --> 01:01:19,980 We have more than two things in parallel trying to update 1209 01:01:19,980 --> 01:01:25,070 the same cij, and we'll have a race condition. 1210 01:01:25,070 --> 01:01:29,000 So always run cilkview to tell your performance. 1211 01:01:29,000 --> 01:01:33,695 But always, always, run cilk screen to tell whether or not 1212 01:01:33,695 --> 01:01:35,060 you've got races in your code. 1213 01:01:35,060 --> 01:01:38,500 1214 01:01:38,500 --> 01:01:40,650 So yeah, you'll have a race condition if you try to 1215 01:01:40,650 --> 01:01:43,160 naively parallelize the loop here. 1216 01:01:43,160 --> 01:01:46,570 1217 01:01:46,570 --> 01:01:47,990 So the work of this is what? 1218 01:01:47,990 --> 01:01:53,460 1219 01:01:53,460 --> 01:01:58,090 It's order n cubed, just three nested loops each going to n. 1220 01:01:58,090 --> 01:01:59,340 What's the span of this? 1221 01:01:59,340 --> 01:02:12,180 1222 01:02:12,180 --> 01:02:13,430 What's the span of this? 1223 01:02:13,430 --> 01:02:20,990 1224 01:02:20,990 --> 01:02:26,900 It's order n, because it's log n for this loop, log n for 1225 01:02:26,900 --> 01:02:30,290 this loop, plus the maximum of this, well, that's n. 1226 01:02:30,290 --> 01:02:34,860 Log n plus log n plus n is order n. 1227 01:02:34,860 --> 01:02:37,130 So order n span, which says 1228 01:02:37,130 --> 01:02:42,340 parallelism is order n squared. 1229 01:02:42,340 --> 01:02:45,080 So for 1,000 by 1,000 matrices, the parallelism is 1230 01:02:45,080 --> 01:02:49,376 on the order of a million. 1231 01:02:49,376 --> 01:02:50,626 Wow. 1232 01:02:50,626 --> 01:02:52,430 1233 01:02:52,430 --> 01:02:53,680 That's great. 1234 01:02:53,680 --> 01:02:56,050 1235 01:02:56,050 --> 01:03:00,600 However, it's on the order of a million, but as we know, 1236 01:03:00,600 --> 01:03:04,630 this doesn't use cache very effectively. 1237 01:03:04,630 --> 01:03:06,890 So one of the nice things about doing divide and conquer 1238 01:03:06,890 --> 01:03:08,860 is, as you know, that's a really good way to take 1239 01:03:08,860 --> 01:03:11,850 advantage of caching. 1240 01:03:11,850 --> 01:03:15,100 And this works in parallel, too. 1241 01:03:15,100 --> 01:03:18,920 In particular because whenever you have sufficient 1242 01:03:18,920 --> 01:03:23,880 parallelism, these processors are executing the code just as 1243 01:03:23,880 --> 01:03:26,160 if they were executing serial code. 1244 01:03:26,160 --> 01:03:28,750 So you get all the same cache locality you would get in the 1245 01:03:28,750 --> 01:03:32,280 serial code in the parallel code, except for the times 1246 01:03:32,280 --> 01:03:34,300 that you're actually migrating work. 1247 01:03:34,300 --> 01:03:35,710 And if you have sufficient parallelism, 1248 01:03:35,710 --> 01:03:38,590 that isn't too often. 1249 01:03:38,590 --> 01:03:40,740 So let's take a look at recursive divide and conquer 1250 01:03:40,740 --> 01:03:41,890 multiplication. 1251 01:03:41,890 --> 01:03:45,770 So we're familiar with this, too. 1252 01:03:45,770 --> 01:03:48,520 So this is eight multiplications of n over 2 by 1253 01:03:48,520 --> 01:03:51,350 2 matrices, and one addition of n by n matrix. 1254 01:03:51,350 --> 01:03:55,970 So here's a code using a little bit of C++ism. 1255 01:03:55,970 --> 01:04:01,560 So I've made the type a variable t. 1256 01:04:01,560 --> 01:04:07,710 So we're going to do matrix multiplication of an array, a, 1257 01:04:07,710 --> 01:04:10,220 the result is going to go in c, and we're going to 1258 01:04:10,220 --> 01:04:13,630 basically have a and b, and we're going to add 1259 01:04:13,630 --> 01:04:15,240 the result into c. 1260 01:04:15,240 --> 01:04:19,950 We have n, which is the side of the submatrix that we're 1261 01:04:19,950 --> 01:04:23,080 working on, and we're also going to have an n size, which 1262 01:04:23,080 --> 01:04:26,105 is the length of the row in the original matrix. 1263 01:04:26,105 --> 01:04:29,870 So remember when we do matrix things, if I take a submatrix, 1264 01:04:29,870 --> 01:04:31,240 it's not contiguous in memory. 1265 01:04:31,240 --> 01:04:34,470 So I have to know the row size of the matrix that I'm in in 1266 01:04:34,470 --> 01:04:38,710 order to be able to calculate what the elements are. 1267 01:04:38,710 --> 01:04:41,370 So the way it's going to work is I'm going to assign this 1268 01:04:41,370 --> 01:04:46,060 temporary d, by using the new-- 1269 01:04:46,060 --> 01:04:49,096 which is basically memory allocation C++-- 1270 01:04:49,096 --> 01:04:52,410 array of size n by n. 1271 01:04:52,410 --> 01:04:58,670 And what we're going to do is then do four of the recursive 1272 01:04:58,670 --> 01:05:05,470 multiplications, these guys here, into the elements of c, 1273 01:05:05,470 --> 01:05:12,000 and then four of them also into d using the temporary. 1274 01:05:12,000 --> 01:05:14,690 And then we're going to sync, after we get all that parallel 1275 01:05:14,690 --> 01:05:18,440 work done, and then we're going to add d into c, and 1276 01:05:18,440 --> 01:05:22,880 then we'll delete d, because we allocated it up here. 1277 01:05:22,880 --> 01:05:25,920 Everybody understand the code? 1278 01:05:25,920 --> 01:05:27,080 So we're doing this, it's just we're 1279 01:05:27,080 --> 01:05:30,490 going to do it in parallel. 1280 01:05:30,490 --> 01:05:31,520 Good? 1281 01:05:31,520 --> 01:05:33,620 Questions? 1282 01:05:33,620 --> 01:05:36,260 OK. 1283 01:05:36,260 --> 01:05:38,680 So this is the row length of the matrices so that I can do 1284 01:05:38,680 --> 01:05:41,590 the base cases, and in particular, partition the 1285 01:05:41,590 --> 01:05:43,350 matrices effectively. 1286 01:05:43,350 --> 01:05:45,720 I haven't shown that code. 1287 01:05:45,720 --> 01:05:47,980 And of course, the base case, normally, we would want to 1288 01:05:47,980 --> 01:05:49,560 coarsen for efficiency. 1289 01:05:49,560 --> 01:05:52,390 I would want to go down to something like maybe a eight 1290 01:05:52,390 --> 01:05:57,090 by eight or 16 by 16 matrix, and at that point switch to 1291 01:05:57,090 --> 01:06:01,880 something that's going to use the processor pipeline better. 1292 01:06:01,880 --> 01:06:04,360 The base cases, once again, I want to emphasize this because 1293 01:06:04,360 --> 01:06:07,430 a couple people on the quiz misunderstood this. 1294 01:06:07,430 --> 01:06:11,250 The reason you coarsen has nothing to do with caches. 1295 01:06:11,250 --> 01:06:14,410 The reason you coarsen is to overcome the overhead of the 1296 01:06:14,410 --> 01:06:18,440 function calls, and the coarsening is generally chosen 1297 01:06:18,440 --> 01:06:21,280 independent of what the size of the caches are. 1298 01:06:21,280 --> 01:06:25,590 It's not a parameter that has to be tuned to cache size. 1299 01:06:25,590 --> 01:06:28,190 It's a parameter that has to be tuned to function call, 1300 01:06:28,190 --> 01:06:33,080 versus ALU instructions, and what that balance is. 1301 01:06:33,080 --> 01:06:34,464 Question? 1302 01:06:34,464 --> 01:06:37,446 AUDIENCE: I mean, I understand that's true, but I thought-- 1303 01:06:37,446 --> 01:06:39,434 I mean, maybe I [? heard the call ?] wrong, 1304 01:06:39,434 --> 01:06:42,416 but I thought we wanted, in general, in terms of caching, 1305 01:06:42,416 --> 01:06:45,895 that you would choose it somehow so that all of the 1306 01:06:45,895 --> 01:06:48,380 data that you have would somehow fit-- 1307 01:06:48,380 --> 01:06:50,060 CHARLES LEISERSON: That's what the divide and conquer does 1308 01:06:50,060 --> 01:06:52,210 automatically. 1309 01:06:52,210 --> 01:06:54,550 The divide and conquer keeps halving it until it fits in 1310 01:06:54,550 --> 01:06:56,450 whatever size cache you have. 1311 01:06:56,450 --> 01:06:58,330 And in fact, we have three caches on the 1312 01:06:58,330 --> 01:06:59,770 machines we're using. 1313 01:06:59,770 --> 01:07:03,363 AUDIENCE: Yeah, but I'm saying if your coarsened constant is 1314 01:07:03,363 --> 01:07:05,160 too big, that's not going to happen. 1315 01:07:05,160 --> 01:07:07,030 CHARLES LEISERSON: If the coarsened constant is too big, 1316 01:07:07,030 --> 01:07:08,180 that's not going to happen. 1317 01:07:08,180 --> 01:07:12,120 But generally, the caches are much bigger than what you need 1318 01:07:12,120 --> 01:07:14,410 to do to amortize the cost. 1319 01:07:14,410 --> 01:07:16,660 But you're right, that is an assumption. 1320 01:07:16,660 --> 01:07:19,490 The caches are generally much bigger than the size that you 1321 01:07:19,490 --> 01:07:21,680 need in order to overcome function call overhead. 1322 01:07:21,680 --> 01:07:24,170 Function call overhead is not that high. 1323 01:07:24,170 --> 01:07:24,930 OK? 1324 01:07:24,930 --> 01:07:27,020 Good. 1325 01:07:27,020 --> 01:07:29,280 I'm glad I raised that issue again. 1326 01:07:29,280 --> 01:07:31,690 And so we're going to determine the submatrices by 1327 01:07:31,690 --> 01:07:33,330 index calculation. 1328 01:07:33,330 --> 01:07:36,400 And then we have to implement this parallel add, and that 1329 01:07:36,400 --> 01:07:42,270 I'm going to do just with a doubly nested for loop to add 1330 01:07:42,270 --> 01:07:43,520 the things. 1331 01:07:43,520 --> 01:07:45,530 1332 01:07:45,530 --> 01:07:48,650 There's no cache behavior I can really take advantage of 1333 01:07:48,650 --> 01:07:51,750 here except for spatial locality. 1334 01:07:51,750 --> 01:07:53,760 There's no temporal locality because I'm just adding two 1335 01:07:53,760 --> 01:07:59,270 matrices once, so there's no real temporal locality that 1336 01:07:59,270 --> 01:08:00,726 I'll get out of it. 1337 01:08:00,726 --> 01:08:02,440 And here, I've actually done the index 1338 01:08:02,440 --> 01:08:03,690 calculations by hand. 1339 01:08:03,690 --> 01:08:08,550 1340 01:08:08,550 --> 01:08:11,050 So let's analyze this. 1341 01:08:11,050 --> 01:08:13,790 So to analyze the multiplication program, I have 1342 01:08:13,790 --> 01:08:16,590 to start by analyzing the addition program. 1343 01:08:16,590 --> 01:08:19,890 So this should be, I think, fairly straightforward. 1344 01:08:19,890 --> 01:08:26,290 What's the work for adding two n by n matrices here? 1345 01:08:26,290 --> 01:08:29,689 n squared, good, just doubly nested loop. 1346 01:08:29,689 --> 01:08:32,612 What's the span? 1347 01:08:32,612 --> 01:08:35,609 AUDIENCE: [INAUDIBLE]. 1348 01:08:35,609 --> 01:08:37,859 CHARLES LEISERSON: Yeah, here it's log n, very good. 1349 01:08:37,859 --> 01:08:40,830 Because I've got log n plus log n plus order one. 1350 01:08:40,830 --> 01:08:44,600 1351 01:08:44,600 --> 01:08:46,470 I'm not going to analyze the parallelism, because I really 1352 01:08:46,470 --> 01:08:48,260 don't care about the parallelism of the addition. 1353 01:08:48,260 --> 01:08:51,180 I really care about the parallelism of the matrix 1354 01:08:51,180 --> 01:08:51,859 multiplication. 1355 01:08:51,859 --> 01:08:54,859 But we'll plug those values in now. 1356 01:08:54,859 --> 01:08:57,430 What is the work of the matrix multiplication? 1357 01:08:57,430 --> 01:09:02,359 Well for this, what we want to do is get a recurrence that we 1358 01:09:02,359 --> 01:09:03,140 can then solve. 1359 01:09:03,140 --> 01:09:04,920 So what's the recurrence that we want to get 1360 01:09:04,920 --> 01:09:06,450 for m of 1 of n? 1361 01:09:06,450 --> 01:09:11,050 1362 01:09:11,050 --> 01:09:17,439 Yeah, it's going to be 8m sub 1 of n over 2, that 1363 01:09:17,439 --> 01:09:22,399 corresponds to these things, plus some constant stuff, plus 1364 01:09:22,399 --> 01:09:23,649 the work of the addition. 1365 01:09:23,649 --> 01:09:26,300 1366 01:09:26,300 --> 01:09:28,040 Does that make sense? 1367 01:09:28,040 --> 01:09:29,390 We analyze the work of the addition. 1368 01:09:29,390 --> 01:09:32,000 What's the work of the addition? 1369 01:09:32,000 --> 01:09:33,630 Order n squared. 1370 01:09:33,630 --> 01:09:36,100 So that's going to dominate that constant 1371 01:09:36,100 --> 01:09:38,569 there, so we get 8. 1372 01:09:38,569 --> 01:09:41,090 And what's the solution to this? 1373 01:09:41,090 --> 01:09:43,319 Back to Master Theorem. 1374 01:09:43,319 --> 01:09:46,210 Now we're going to start pulling out the Master Theorem 1375 01:09:46,210 --> 01:09:49,850 multiple times per slide for the rest of the lecture. 1376 01:09:49,850 --> 01:09:52,700 n cubed, because we have log base 2 of 8. 1377 01:09:52,700 --> 01:09:57,330 That's n cubed compared with n squared. 1378 01:09:57,330 --> 01:10:00,450 So we get a solution which is n cubed-- 1379 01:10:00,450 --> 01:10:02,390 Case 3 of the Master Theorem. 1380 01:10:02,390 --> 01:10:03,670 So that's good. 1381 01:10:03,670 --> 01:10:06,140 The work we're doing is the same asymptotic work we're 1382 01:10:06,140 --> 01:10:09,220 doing for the triply nested loop. 1383 01:10:09,220 --> 01:10:11,380 Now let's take a look at the span. 1384 01:10:11,380 --> 01:10:14,170 1385 01:10:14,170 --> 01:10:15,790 So what's the span for this? 1386 01:10:15,790 --> 01:10:21,030 1387 01:10:21,030 --> 01:10:23,420 So once again, we want a recurrence. 1388 01:10:23,420 --> 01:10:24,670 What's the recurrence look like? 1389 01:10:24,670 --> 01:10:31,930 1390 01:10:31,930 --> 01:10:35,444 So the span of this is going to be the span of-- 1391 01:10:35,444 --> 01:10:37,345 it's going to be the sum of some things. 1392 01:10:37,345 --> 01:10:39,990 1393 01:10:39,990 --> 01:10:43,950 But the key observation is that it's going to be-- we 1394 01:10:43,950 --> 01:10:45,800 want the maximum of these guys. 1395 01:10:45,800 --> 01:10:52,970 1396 01:10:52,970 --> 01:10:55,850 So we're going to basically have the allocation as 1397 01:10:55,850 --> 01:11:00,900 constant time, we have the maximum of these, which is m 1398 01:11:00,900 --> 01:11:03,900 of infinity of n over 2, and then we have 1399 01:11:03,900 --> 01:11:07,160 the span of the add. 1400 01:11:07,160 --> 01:11:10,060 So we get this recurrence. 1401 01:11:10,060 --> 01:11:12,930 m infinity sub n over 2, because we have only to worry 1402 01:11:12,930 --> 01:11:15,740 about the worst of these guys. 1403 01:11:15,740 --> 01:11:17,400 The worst of them is-- 1404 01:11:17,400 --> 01:11:19,620 they're all symmetric, so it's basically the same. 1405 01:11:19,620 --> 01:11:21,800 We have a of n, and then there's a constant amount of 1406 01:11:21,800 --> 01:11:24,290 other overhead here. 1407 01:11:24,290 --> 01:11:28,320 Any questions about where I pulled that out of, why that's 1408 01:11:28,320 --> 01:11:29,570 the recurrence? 1409 01:11:29,570 --> 01:11:32,710 1410 01:11:32,710 --> 01:11:36,390 So this is the addition, the span of the addition of this 1411 01:11:36,390 --> 01:11:37,570 guy that we analyzed already. 1412 01:11:37,570 --> 01:11:41,000 What is the span of the addition? 1413 01:11:41,000 --> 01:11:42,900 What did we decide that was? 1414 01:11:42,900 --> 01:11:44,530 log n. 1415 01:11:44,530 --> 01:11:46,730 So basically, that dominates the order one. 1416 01:11:46,730 --> 01:11:48,920 So we get this term, and what's the solution of this 1417 01:11:48,920 --> 01:11:49,530 recurrence? 1418 01:11:49,530 --> 01:11:50,780 AUDIENCE: [INAUDIBLE]. 1419 01:11:50,780 --> 01:11:54,270 1420 01:11:54,270 --> 01:11:55,727 CHARLES LEISERSON: What case is this? 1421 01:11:55,727 --> 01:11:57,515 AUDIENCE: [INAUDIBLE] 1422 01:11:57,515 --> 01:11:58,860 log n squared. 1423 01:11:58,860 --> 01:12:02,500 CHARLES LEISERSON: Yes, it's log squared n. 1424 01:12:02,500 --> 01:12:04,530 So basically, it's case two. 1425 01:12:04,530 --> 01:12:07,880 So if I do n to the log base b of a, that's n to the log base 1426 01:12:07,880 --> 01:12:12,330 2 of 1, that's just 1. 1427 01:12:12,330 --> 01:12:15,900 And so this is basically a logarithmic factor times the 1428 01:12:15,900 --> 01:12:18,250 1, so we add an extra log. 1429 01:12:18,250 --> 01:12:19,500 We get log squared n. 1430 01:12:19,500 --> 01:12:22,330 1431 01:12:22,330 --> 01:12:24,300 That's just Master Theorem plugging in. 1432 01:12:24,300 --> 01:12:26,640 So here, the span is order log squared n. 1433 01:12:26,640 --> 01:12:29,370 1434 01:12:29,370 --> 01:12:33,840 And so we have the work of n cubed, the span of log squared 1435 01:12:33,840 --> 01:12:37,150 n, so the parallelism is the ratio, which is n cubed over 1436 01:12:37,150 --> 01:12:38,400 log squared n. 1437 01:12:38,400 --> 01:12:40,810 1438 01:12:40,810 --> 01:12:43,150 Not too bad for a 1,000 by 1,000 matrices, the 1439 01:12:43,150 --> 01:12:49,300 parallelism is about 10 million. 1440 01:12:49,300 --> 01:12:50,550 Plenty of parallelism. 1441 01:12:50,550 --> 01:12:52,940 1442 01:12:52,940 --> 01:12:55,760 So let's use the fact that we have plenty of parallelism to 1443 01:12:55,760 --> 01:12:58,500 say, let's get rid of some of that parallelism and put it 1444 01:12:58,500 --> 01:13:02,180 back into making our code more efficient. 1445 01:13:02,180 --> 01:13:09,520 So in particular, this code uses an extra temporary d, 1446 01:13:09,520 --> 01:13:14,080 which it allocates here and it deletes here. 1447 01:13:14,080 --> 01:13:16,640 And generally, there's a good rule that says, if you use 1448 01:13:16,640 --> 01:13:18,570 more storage you're going to use more time, because you're 1449 01:13:18,570 --> 01:13:20,770 going to have to look at that storage, it's going to take up 1450 01:13:20,770 --> 01:13:23,220 space in your cache, and it's generally 1451 01:13:23,220 --> 01:13:24,300 going to make you slower. 1452 01:13:24,300 --> 01:13:28,470 So things that use less storage are generally faster. 1453 01:13:28,470 --> 01:13:30,240 Not always the case, sometimes there's a trade off. 1454 01:13:30,240 --> 01:13:34,290 But often it's the case, use more storage, it runs slower. 1455 01:13:34,290 --> 01:13:36,380 So let's get rid of this guy. 1456 01:13:36,380 --> 01:13:37,630 How do we get rid of this guy? 1457 01:13:37,630 --> 01:13:44,440 1458 01:13:44,440 --> 01:13:45,754 Yeah? 1459 01:13:45,754 --> 01:13:47,004 AUDIENCE: [INAUDIBLE PHRASE]. 1460 01:13:47,004 --> 01:13:55,140 1461 01:13:55,140 --> 01:13:55,530 CHARLES LEISERSON: You're going to do this serially, 1462 01:13:55,530 --> 01:13:55,860 you're saying? 1463 01:13:55,860 --> 01:13:58,510 AUDIENCE: Yeah, you do those serially in add. 1464 01:13:58,510 --> 01:14:00,880 CHARLES LEISERSON: If you do this serially in add, it turns 1465 01:14:00,880 --> 01:14:03,290 out if you do that, you're going to be in trouble because 1466 01:14:03,290 --> 01:14:07,120 you're going to not have very much parallelism, 1467 01:14:07,120 --> 01:14:09,330 unfortunately. 1468 01:14:09,330 --> 01:14:11,920 Actually, analyzing exactly what the parallelism is there 1469 01:14:11,920 --> 01:14:13,330 is actually pretty good. 1470 01:14:13,330 --> 01:14:15,130 It's a good puzzle. 1471 01:14:15,130 --> 01:14:18,740 Maybe we'll do that on the quiz, the take home problem 1472 01:14:18,740 --> 01:14:21,110 set we're calling it now, right? 1473 01:14:21,110 --> 01:14:23,170 We're going to have a take home problem set, maybe that's 1474 01:14:23,170 --> 01:14:26,100 a good one. 1475 01:14:26,100 --> 01:14:29,865 Yeah, so the idea is, you can sync. 1476 01:14:29,865 --> 01:14:36,810 And in particular, why not compute these, then sync, and 1477 01:14:36,810 --> 01:14:39,780 then compute these, adding their results into the places 1478 01:14:39,780 --> 01:14:41,030 where we added these in? 1479 01:14:41,030 --> 01:14:43,850 1480 01:14:43,850 --> 01:14:47,370 So it's making the program more serial, because I'm 1481 01:14:47,370 --> 01:14:50,420 putting in a sync. 1482 01:14:50,420 --> 01:14:52,300 That shouldn't have an impact on the work, but it will have 1483 01:14:52,300 --> 01:14:54,845 an impact on the span. 1484 01:14:54,845 --> 01:14:58,430 1485 01:14:58,430 --> 01:15:00,880 So we're going to trade it off, and the way we'll do that 1486 01:15:00,880 --> 01:15:04,470 is by putting essentially a sync in the middle. 1487 01:15:04,470 --> 01:15:06,700 And since they're adding it in, I don't even have we call 1488 01:15:06,700 --> 01:15:10,800 the addition routine, because it's just going to 1489 01:15:10,800 --> 01:15:12,800 add it in in place. 1490 01:15:12,800 --> 01:15:16,190 So I spawn off these four guys, putting their results 1491 01:15:16,190 --> 01:15:20,260 into c, then I spawn off these four guys, and they add their 1492 01:15:20,260 --> 01:15:21,920 results into c. 1493 01:15:21,920 --> 01:15:25,060 Is that clear what the code is? 1494 01:15:25,060 --> 01:15:26,415 So let's analyze this. 1495 01:15:26,415 --> 01:15:32,050 1496 01:15:32,050 --> 01:15:41,570 So the work for this is order n cubed. 1497 01:15:41,570 --> 01:15:43,580 It's the same as anything else, we can come up with a 1498 01:15:43,580 --> 01:15:46,380 recurrence, slightly different from before because I only 1499 01:15:46,380 --> 01:15:48,510 have an order one there, but it doesn't really matter. 1500 01:15:48,510 --> 01:15:51,570 The answer is order n cubed. 1501 01:15:51,570 --> 01:15:55,850 The span, now this gets a little trickier. 1502 01:15:55,850 --> 01:15:57,200 What's the recurrence of the span? 1503 01:15:57,200 --> 01:16:04,430 1504 01:16:04,430 --> 01:16:06,570 AUDIENCE: [INAUDIBLE]. 1505 01:16:06,570 --> 01:16:07,706 CHARLES LEISERSON: What is that? 1506 01:16:07,706 --> 01:16:09,690 AUDIENCE: Twice the span of m of n over 2. 1507 01:16:09,690 --> 01:16:12,780 CHARLES LEISERSON: Twice the span of m of n 1508 01:16:12,780 --> 01:16:15,510 over 2, that's right. 1509 01:16:15,510 --> 01:16:18,560 So basically, we have the maximum of these guys, the 1510 01:16:18,560 --> 01:16:20,870 maximum of these guys, and then this is making those 1511 01:16:20,870 --> 01:16:23,090 things be in series. 1512 01:16:23,090 --> 01:16:25,650 So things that are in parallel I take the max, if it's in 1513 01:16:25,650 --> 01:16:26,940 series, I have to add them. 1514 01:16:26,940 --> 01:16:31,916 So I end up with 2m infinity of n over 2 plus order one. 1515 01:16:31,916 --> 01:16:35,270 Does that make sense? 1516 01:16:35,270 --> 01:16:36,250 OK, good. 1517 01:16:36,250 --> 01:16:37,550 So let's solve that recurrence. 1518 01:16:37,550 --> 01:16:40,260 What's the answer to that one? 1519 01:16:40,260 --> 01:16:41,120 That's order in. 1520 01:16:41,120 --> 01:16:44,290 Which case is it? 1521 01:16:44,290 --> 01:16:46,540 I never know what the cases are. 1522 01:16:46,540 --> 01:16:49,410 I know two, but one and three, it's like-- 1523 01:16:49,410 --> 01:16:52,030 they're the same thing, it's just which side it's in, so I 1524 01:16:52,030 --> 01:16:53,370 never remember what the number is. 1525 01:16:53,370 --> 01:16:55,250 But anyway, case one, yes. 1526 01:16:55,250 --> 01:16:57,720 Case one. 1527 01:16:57,720 --> 01:17:00,710 It's the one where this thing is bigger, so that's order n. 1528 01:17:00,710 --> 01:17:04,410 1529 01:17:04,410 --> 01:17:07,200 Good. 1530 01:17:07,200 --> 01:17:13,010 So then the work is n cubed, the span is order n, the 1531 01:17:13,010 --> 01:17:15,870 parallelism is order n squared. 1532 01:17:15,870 --> 01:17:18,660 So for 1,000 by 1,000 matrices, I get parallelism on 1533 01:17:18,660 --> 01:17:21,600 the order of a million, instead of before, I had 1534 01:17:21,600 --> 01:17:25,710 parallelism on the order of 10 million. 1535 01:17:25,710 --> 01:17:29,540 So this turns out way better code than the previous one 1536 01:17:29,540 --> 01:17:33,840 because it avoids the temporary and therefore runs, 1537 01:17:33,840 --> 01:17:37,280 you get a constant factor improvement for that, and it's 1538 01:17:37,280 --> 01:17:42,300 still, on 12 cores, it's going to run pretty fast. 1539 01:17:42,300 --> 01:17:45,530 And in practice, this is a much better way to do it. 1540 01:17:45,530 --> 01:17:49,190 The actual best code that I know for doing this 1541 01:17:49,190 --> 01:17:52,590 essentially does divide and conquer in only one 1542 01:17:52,590 --> 01:17:54,580 dimension at a time. 1543 01:17:54,580 --> 01:17:57,270 So basically, it looks to see what's the long dimension, and 1544 01:17:57,270 --> 01:18:00,880 whatever the long dimension is, it slices it in half and 1545 01:18:00,880 --> 01:18:04,920 then recurses, and just does that as a binary thing. 1546 01:18:04,920 --> 01:18:06,450 And it basically is the same work, et cetera. 1547 01:18:06,450 --> 01:18:07,755 It's a little bit more tricky to analyze. 1548 01:18:07,755 --> 01:18:14,820 1549 01:18:14,820 --> 01:18:17,940 Let me quick do merge sort. 1550 01:18:17,940 --> 01:18:18,900 So you know merge sort. 1551 01:18:18,900 --> 01:18:23,470 There's merging two sorted arrays, we saw this before. 1552 01:18:23,470 --> 01:18:26,320 If I spend all this time doing animations, I might as well 1553 01:18:26,320 --> 01:18:29,830 get my mileage out of it. 1554 01:18:29,830 --> 01:18:30,480 There we go. 1555 01:18:30,480 --> 01:18:33,820 So you merge, that's basically what this code does. 1556 01:18:33,820 --> 01:18:37,090 Order n time to merge. 1557 01:18:37,090 --> 01:18:38,470 So here's merge sort. 1558 01:18:38,470 --> 01:18:40,880 So what I'll do in merge sort is the same thing I normally 1559 01:18:40,880 --> 01:18:52,210 do, except that I'll make recursive 1560 01:18:52,210 --> 01:18:53,500 routines go in parallel. 1561 01:18:53,500 --> 01:18:58,500 So when I do that, it basically divide and conquers 1562 01:18:58,500 --> 01:19:03,760 down, and then it sort of does this to merge things together. 1563 01:19:03,760 --> 01:19:08,710 So we saw this before, except now, I've got the fact that I 1564 01:19:08,710 --> 01:19:11,450 can sort two things in parallel rather than sorting 1565 01:19:11,450 --> 01:19:13,020 them serially. 1566 01:19:13,020 --> 01:19:14,375 So let's take a look at the work. 1567 01:19:14,375 --> 01:19:15,900 What's the work of merge sort? 1568 01:19:15,900 --> 01:19:18,770 We know that. 1569 01:19:18,770 --> 01:19:20,370 n log n, right? 1570 01:19:20,370 --> 01:19:26,790 2t of n over 2 plus order n, so that's order n log n. 1571 01:19:26,790 --> 01:19:29,590 The span is what? 1572 01:19:29,590 --> 01:19:30,840 What's the recurrence of the span? 1573 01:19:30,840 --> 01:19:36,150 1574 01:19:36,150 --> 01:19:38,890 So we're going to take the maximum of these two guys. 1575 01:19:38,890 --> 01:19:44,050 So we only have one term that involves t infinity, and then 1576 01:19:44,050 --> 01:19:46,570 the merge costs us order n, so we get this recurrence. 1577 01:19:46,570 --> 01:19:49,440 1578 01:19:49,440 --> 01:19:54,990 So that says that the solution is order n. 1579 01:19:54,990 --> 01:20:01,590 So therefore, the work is n log n, the span is order n, 1580 01:20:01,590 --> 01:20:03,858 and so the parallelism is order log n. 1581 01:20:03,858 --> 01:20:06,546 1582 01:20:06,546 --> 01:20:07,950 Puny. 1583 01:20:07,950 --> 01:20:08,410 Puny parallelism. 1584 01:20:08,410 --> 01:20:11,750 Log n is like, you can run it, and it'll work fine on a few 1585 01:20:11,750 --> 01:20:14,250 cores, but it's not to be something that generally will 1586 01:20:14,250 --> 01:20:17,570 scale and give you a lot of parallelism. 1587 01:20:17,570 --> 01:20:20,630 So it's pretty clear from this that the bottleneck-- 1588 01:20:20,630 --> 01:20:22,330 where's all the span going to? 1589 01:20:22,330 --> 01:20:23,580 It's going to that merge. 1590 01:20:23,580 --> 01:20:26,730 1591 01:20:26,730 --> 01:20:28,960 So when you understand that that's the structure of it, 1592 01:20:28,960 --> 01:20:31,410 now you say if you want to get parallelism, you've got to go 1593 01:20:31,410 --> 01:20:32,230 after the merge. 1594 01:20:32,230 --> 01:20:34,480 So here's how we parallelize the merge. 1595 01:20:34,480 --> 01:20:36,615 So we're going to look at merging of two arrays that are 1596 01:20:36,615 --> 01:20:38,920 of possibly different length. 1597 01:20:38,920 --> 01:20:41,220 So one we'll call A, and one we'll call B, 1598 01:20:41,220 --> 01:20:42,810 with na and nb elements. 1599 01:20:42,810 --> 01:20:46,540 And let me assume without loss of generality that na is 1600 01:20:46,540 --> 01:20:48,830 greater than or equal to nb, because otherwise I can just 1601 01:20:48,830 --> 01:20:51,280 switch the roles of A and B. 1602 01:20:51,280 --> 01:20:53,370 So the way that I'm going to do it is I'm going to find the 1603 01:20:53,370 --> 01:20:56,490 middle element of A. These are sorted arrays that 1604 01:20:56,490 --> 01:20:58,300 I'm going to merge. 1605 01:20:58,300 --> 01:21:01,990 I find the middle element of A, so these guys are or less 1606 01:21:01,990 --> 01:21:04,540 than or equal to a of ma, and these are greater 1607 01:21:04,540 --> 01:21:05,930 than or equal to. 1608 01:21:05,930 --> 01:21:09,570 And now I binary search and find out where that middle 1609 01:21:09,570 --> 01:21:14,840 element would fall in the array B. So that costs me log 1610 01:21:14,840 --> 01:21:16,460 n time to binary search. 1611 01:21:16,460 --> 01:21:17,710 Remember binary search? 1612 01:21:17,710 --> 01:21:22,420 1613 01:21:22,420 --> 01:21:25,560 Then what I'm going to do is recursively merge these guys, 1614 01:21:25,560 --> 01:21:28,490 because these are sorted and less than or equal to ma, 1615 01:21:28,490 --> 01:21:31,950 recursively merge those and put this guy in the middle. 1616 01:21:31,950 --> 01:21:35,140 1617 01:21:35,140 --> 01:21:42,180 So when I do that, the key question when we analyze-- 1618 01:21:42,180 --> 01:21:45,310 it turns out the work is going to basically be the same, but 1619 01:21:45,310 --> 01:21:49,170 the key thing is going to be what happens to the span? 1620 01:21:49,170 --> 01:21:52,080 And the idea here is that the total number of elements in 1621 01:21:52,080 --> 01:21:59,690 the larger of these two things is going to be at most what? 1622 01:21:59,690 --> 01:22:04,015 Another way of looking at it is in the smaller partition, 1623 01:22:04,015 --> 01:22:06,230 if n is the total number of elements, the smaller 1624 01:22:06,230 --> 01:22:09,310 partition has how many elements at 1625 01:22:09,310 --> 01:22:11,010 least relative to n? 1626 01:22:11,010 --> 01:22:13,750 1627 01:22:13,750 --> 01:22:17,620 No matter where this binary search finds itself. 1628 01:22:17,620 --> 01:22:21,060 So the worst case is sort of going to come when this guy is 1629 01:22:21,060 --> 01:22:24,900 like at one end or the other. 1630 01:22:24,900 --> 01:22:28,070 And then the point is that because A is the larger array, 1631 01:22:28,070 --> 01:22:30,640 at least a quarter of the elements will still be in the 1632 01:22:30,640 --> 01:22:33,340 smaller partition. 1633 01:22:33,340 --> 01:22:37,280 Of all the elements here, at least a quarter will be in the 1634 01:22:37,280 --> 01:22:39,840 smaller partition, which will occur when B is equal to in 1635 01:22:39,840 --> 01:22:45,170 size to A. So the number, in the larger of the recursive 1636 01:22:45,170 --> 01:22:46,790 merges, is at most 3/4 n. 1637 01:22:46,790 --> 01:22:49,350 1638 01:22:49,350 --> 01:22:50,750 Sound good? 1639 01:22:50,750 --> 01:22:54,410 That's the main, key idea behind this. 1640 01:22:54,410 --> 01:22:57,420 So here's the parallel merge. 1641 01:22:57,420 --> 01:23:01,290 Basically you do binary search, you spawn, then, the 1642 01:23:01,290 --> 01:23:02,660 two merges. 1643 01:23:02,660 --> 01:23:04,720 Here's one merge, and here's the other merge, 1644 01:23:04,720 --> 01:23:06,140 and then you sync. 1645 01:23:06,140 --> 01:23:09,590 So that's the code for the doing the parallel merge. 1646 01:23:09,590 --> 01:23:11,430 And now you want to incorporate that parallel 1647 01:23:11,430 --> 01:23:14,750 merge into the parallel merge sort. 1648 01:23:14,750 --> 01:23:16,805 Of course, you coarsen the base cases for efficiency. 1649 01:23:16,805 --> 01:23:21,190 1650 01:23:21,190 --> 01:23:24,190 So let's analyze the span of this. 1651 01:23:24,190 --> 01:23:29,470 So the span is basically then the span of something of 3/4, 1652 01:23:29,470 --> 01:23:36,380 at most 3/4, the size plus the log n for the binary search. 1653 01:23:36,380 --> 01:23:40,500 So the span of parallel merge is therefore order log squared 1654 01:23:40,500 --> 01:23:44,300 n, because the important thing is, I'm whacking off a 1655 01:23:44,300 --> 01:23:46,415 constant fraction here every time. 1656 01:23:46,415 --> 01:23:52,050 So I get log squared n as the span, and the work I get this 1657 01:23:52,050 --> 01:23:58,270 hairy recurrence, that it's t of alpha n plus t1 minus alpha 1658 01:23:58,270 --> 01:24:03,920 n plus log n, where alpha falls in this range. 1659 01:24:03,920 --> 01:24:07,700 This does not satisfy the Master Theorem. 1660 01:24:07,700 --> 01:24:10,080 You can actually do this pretty easily with a recursion 1661 01:24:10,080 --> 01:24:13,270 tree, but the way to verify is-- 1662 01:24:13,270 --> 01:24:16,370 we call this technically a hairy recurrence. 1663 01:24:16,370 --> 01:24:20,360 That's the technical term for it. 1664 01:24:20,360 --> 01:24:23,830 So it turns out, this has order n, just like ordinary 1665 01:24:23,830 --> 01:24:28,010 merge, order n time. 1666 01:24:28,010 --> 01:24:30,540 here's You can use the substitution method, and I 1667 01:24:30,540 --> 01:24:32,230 won't drag you through it, but you can look 1668 01:24:32,230 --> 01:24:35,510 at it in the notes. 1669 01:24:35,510 --> 01:24:39,180 And this should be very familiar to you as having all 1670 01:24:39,180 --> 01:24:43,270 aced 6006, right? 1671 01:24:43,270 --> 01:24:44,560 Otherwise you wouldn't be here, right? 1672 01:24:44,560 --> 01:24:47,930 1673 01:24:47,930 --> 01:24:51,640 So the parallelism of the parallel merge is something 1674 01:24:51,640 --> 01:24:55,670 like n over log squared n. 1675 01:24:55,670 --> 01:25:00,510 So that's much better than having n order n bound. 1676 01:25:00,510 --> 01:25:03,340 And now, we can plug it into merge sort. 1677 01:25:03,340 --> 01:25:05,920 So the work is going to be the same as before, because I just 1678 01:25:05,920 --> 01:25:08,780 have the work of the merge, which is still order n. 1679 01:25:08,780 --> 01:25:12,190 So the work is order n log n, once again pulling out the 1680 01:25:12,190 --> 01:25:13,550 Master Theorem. 1681 01:25:13,550 --> 01:25:21,660 And then the span is n over 2 plus log n, because basically, 1682 01:25:21,660 --> 01:25:27,590 I have the span of a problem of half the size plus the span 1683 01:25:27,590 --> 01:25:28,910 that I need to merge things. 1684 01:25:28,910 --> 01:25:30,660 That's order log squared n. 1685 01:25:30,660 --> 01:25:32,300 This I want to pause on for moment. 1686 01:25:32,300 --> 01:25:35,520 People get this recurrence? 1687 01:25:35,520 --> 01:25:38,230 Because this is the span of the merge. 1688 01:25:38,230 --> 01:25:45,310 And so what I end up with is I get another log, log cubed n. 1689 01:25:45,310 --> 01:25:50,350 And so the total parallelism is n over log squared n. 1690 01:25:50,350 --> 01:25:55,440 And this is actually quite a practical thing to implement, 1691 01:25:55,440 --> 01:25:59,980 to get the n over log squared n parallelism versus just a 1692 01:25:59,980 --> 01:26:02,770 log n parallelism. 1693 01:26:02,770 --> 01:26:04,200 We're not going to do tableau construction. 1694 01:26:04,200 --> 01:26:07,170 You can read that up, that's on the notes that are online, 1695 01:26:07,170 --> 01:26:11,010 but you should read through that part of it. 1696 01:26:11,010 --> 01:26:13,520 It's got some nice animations which you don't get to see. 1697 01:26:13,520 --> 01:26:20,630 1698 01:26:20,630 --> 01:26:23,240 This is like when you do longest common subsequence and 1699 01:26:23,240 --> 01:26:25,460 stuff like that, how you would solve that type 1700 01:26:25,460 --> 01:26:27,110 of problem in parallel. 1701 01:26:27,110 --> 01:26:28,360 OK, great. 1702 01:26:28,360 --> 01:26:30,912