1 00:00:00,000 --> 00:00:00,030 2 00:00:00,030 --> 00:00:02,420 The following content is provided under a Creative 3 00:00:02,420 --> 00:00:03,850 Commons license. 4 00:00:03,850 --> 00:00:06,860 Your support will help MIT OpenCourseWare continue to 5 00:00:06,860 --> 00:00:10,550 offer high quality educational resources for free. 6 00:00:10,550 --> 00:00:13,420 To make a donation or view additional materials from 7 00:00:13,420 --> 00:00:17,510 hundreds of MIT courses, visit MIT OpenCourseWare at 8 00:00:17,510 --> 00:00:18,760 ocw.mit.edu. 9 00:00:18,760 --> 00:00:21,140 10 00:00:21,140 --> 00:00:23,450 PROFESSOR: So we'll get started. 11 00:00:23,450 --> 00:00:27,240 So today we are going to dive into some parallel 12 00:00:27,240 --> 00:00:28,610 architectures. 13 00:00:28,610 --> 00:00:36,070 So the way, if you look at the big world, is there's -- 14 00:00:36,070 --> 00:00:39,370 just counting parallelism, you can do it implicitly, either 15 00:00:39,370 --> 00:00:40,770 by hardware or the compiler. 16 00:00:40,770 --> 00:00:42,220 So the user won't see it. 17 00:00:42,220 --> 00:00:44,870 It will be done behind the user's back, but can be done 18 00:00:44,870 --> 00:00:46,070 by hardware or compiler. 19 00:00:46,070 --> 00:00:48,980 Or explicitly, visible to the user. 20 00:00:48,980 --> 00:00:53,440 So the hardware part is done in superscalar processors, and 21 00:00:53,440 --> 00:00:55,550 all those things will have explicitly parallel 22 00:00:55,550 --> 00:00:56,590 architecture. 23 00:00:56,590 --> 00:01:02,010 So what I am going to do is spend some time just talking 24 00:01:02,010 --> 00:01:05,320 about implicitly parallel superscalar processors. 25 00:01:05,320 --> 00:01:08,290 Because probably the entire time you guys were born till 26 00:01:08,290 --> 00:01:11,220 now, this has been the mainstream, people are 27 00:01:11,220 --> 00:01:13,220 building these things, and we are use to it. 28 00:01:13,220 --> 00:01:14,980 And now we are kind of doing a switch. 29 00:01:14,980 --> 00:01:17,480 Then we'll go into explicit parallelism processors and 30 00:01:17,480 --> 00:01:20,310 kind of look at different types in there, and get a feel 31 00:01:20,310 --> 00:01:23,270 for the big picture. 32 00:01:23,270 --> 00:01:26,140 So let's start at implicitly parallel superscalar 33 00:01:26,140 --> 00:01:27,350 processors. 34 00:01:27,350 --> 00:01:29,160 So there are two types of superscalar processors. 35 00:01:29,160 --> 00:01:31,140 One is what we call statically scheduled. 36 00:01:31,140 --> 00:01:34,330 Those are kind of simpler ones, where you use compiler 37 00:01:34,330 --> 00:01:36,620 techniques to figure out where the parallelism is. 38 00:01:36,620 --> 00:01:40,280 And what happens is the computer keeps executing, 39 00:01:40,280 --> 00:01:42,600 intead of one instruction at a time, the few instructions 40 00:01:42,600 --> 00:01:44,440 next to each other in one bunch. 41 00:01:44,440 --> 00:01:47,180 Like a bundle after bundle type thing. 42 00:01:47,180 --> 00:01:49,570 On the other hand, dynamically scheduled processors -- 43 00:01:49,570 --> 00:01:52,260 things like the current Pentiums -- are a lot more 44 00:01:52,260 --> 00:01:52,850 complicated. 45 00:01:52,850 --> 00:01:55,540 They have to extract instruction level parallelism. 46 00:01:55,540 --> 00:01:58,020 ILP doesn't mean integer linear programming, it's 47 00:01:58,020 --> 00:02:00,290 instruction level parallelism. 48 00:02:00,290 --> 00:02:03,840 Schedule them as soon as operands become available, 49 00:02:03,840 --> 00:02:06,440 when the data is able to run these instructions. 50 00:02:06,440 --> 00:02:08,520 Then there's just a bunch of things that get more about 51 00:02:08,520 --> 00:02:10,610 parallelism, things like rename registers to eliminate 52 00:02:10,610 --> 00:02:11,840 some dependences. 53 00:02:11,840 --> 00:02:14,170 You execute things out of order. 54 00:02:14,170 --> 00:02:16,580 If later instructions the operands become available 55 00:02:16,580 --> 00:02:19,300 early, you'll get those things done instead of waiting. 56 00:02:19,300 --> 00:02:20,770 You can speculate to execute. 57 00:02:20,770 --> 00:02:23,430 I'll go through a little bit in detail to kind of explain 58 00:02:23,430 --> 00:02:24,680 what these things might be. 59 00:02:24,680 --> 00:02:27,774 60 00:02:27,774 --> 00:02:29,680 Why is this not going down. 61 00:02:29,680 --> 00:02:32,180 Oops. 62 00:02:32,180 --> 00:02:36,540 So if you look at a normal pipeline. 63 00:02:36,540 --> 00:02:40,230 So this is a 004 type pipeline. 64 00:02:40,230 --> 00:02:43,790 What I have is a very simplistic four 65 00:02:43,790 --> 00:02:45,210 stage pipeline in there. 66 00:02:45,210 --> 00:02:48,050 So a normal microprocessor, a single-issue, will do 67 00:02:48,050 --> 00:02:50,080 something like this. 68 00:02:50,080 --> 00:02:52,050 And if you look at it, there's still a little bit of 69 00:02:52,050 --> 00:02:53,680 parallelism here. 70 00:02:53,680 --> 00:02:56,320 Because you don't wait till the first thing finishes to go 71 00:02:56,320 --> 00:02:58,250 to the second thing. 72 00:02:58,250 --> 00:03:01,670 If you look at a superscalar, you have something like this. 73 00:03:01,670 --> 00:03:03,760 This is an in-order superscalar. 74 00:03:03,760 --> 00:03:07,020 What happens is in every cycle instead of doing one, you 75 00:03:07,020 --> 00:03:10,360 fetch two, you decode two, you execute two, and 76 00:03:10,360 --> 00:03:12,080 so on and so forth. 77 00:03:12,080 --> 00:03:15,070 In an out-of-order super-scalar, these are not 78 00:03:15,070 --> 00:03:16,670 going in these very nice boundaries. 79 00:03:16,670 --> 00:03:19,930 You have a fetch unit that fetches like hundreds ahead, 80 00:03:19,930 --> 00:03:22,580 and it keeps issuing as soon as things are fetched and 81 00:03:22,580 --> 00:03:24,540 decoded to the execute unit. 82 00:03:24,540 --> 00:03:26,600 And it's a lot more of a complex picture in there. 83 00:03:26,600 --> 00:03:29,260 I'm not going to show too much of the picture there, because 84 00:03:29,260 --> 00:03:31,830 it's a very complicated thing. 85 00:03:31,830 --> 00:03:35,910 So the first thing the processor has to do is, it has 86 00:03:35,910 --> 00:03:38,890 to look for true data dependences. 87 00:03:38,890 --> 00:03:42,300 True data dependence says that this instruction in fact is 88 00:03:42,300 --> 00:03:45,990 using something produced by the previous guy. 89 00:03:45,990 --> 00:03:50,395 So this is important because if the two instructions are 90 00:03:50,395 --> 00:03:53,410 data dependent, they cannot be executed simultaneously. 91 00:03:53,410 --> 00:03:54,590 You to wait till the first guy finishes to 92 00:03:54,590 --> 00:03:55,470 get the second guy. 93 00:03:55,470 --> 00:03:58,480 It cannot be completely overlapped, and you can't 94 00:03:58,480 --> 00:03:59,910 execute them in out-of-order. 95 00:03:59,910 --> 00:04:01,930 You have to make sure the data comes in before you 96 00:04:01,930 --> 00:04:03,180 actually use it. 97 00:04:03,180 --> 00:04:05,190 98 00:04:05,190 --> 00:04:09,360 In computer architecture jargon, this is called a 99 00:04:09,360 --> 00:04:10,340 pipeline hazard. 100 00:04:10,340 --> 00:04:12,120 And this is called a Read After Write 101 00:04:12,120 --> 00:04:13,750 hazard, or RAW hazard. 102 00:04:13,750 --> 00:04:18,370 What that means is that the write has to finish before you 103 00:04:18,370 --> 00:04:20,800 can do the read. 104 00:04:20,800 --> 00:04:23,660 In a microprocessor, people try very hard to minimize the 105 00:04:23,660 --> 00:04:26,490 time you have to wait to do that, and you really have to 106 00:04:26,490 --> 00:04:27,740 honor that. 107 00:04:27,740 --> 00:04:32,780 108 00:04:32,780 --> 00:04:34,550 In hardware/software what you have to do is you have to 109 00:04:34,550 --> 00:04:38,330 preserve this program ordering. 110 00:04:38,330 --> 00:04:41,820 The program has to be executed sequentially, determined by 111 00:04:41,820 --> 00:04:42,630 the source program. 112 00:04:42,630 --> 00:04:44,560 So if the source program says some order of doing things, 113 00:04:44,560 --> 00:04:44,900 you better -- 114 00:04:44,900 --> 00:04:46,330 if there's some reason for doing that, you better 115 00:04:46,330 --> 00:04:48,200 actually adhere to that order. 116 00:04:48,200 --> 00:04:51,390 You can't go and just do things in a haphazard way. 117 00:04:51,390 --> 00:04:55,410 And dependences are basically a fact of the 118 00:04:55,410 --> 00:04:56,940 program, so what you got. 119 00:04:56,940 --> 00:04:58,570 If you're lucky you'll get a program without too many 120 00:04:58,570 --> 00:05:01,160 dependences, but most probably you'll get programs that have 121 00:05:01,160 --> 00:05:02,010 a lot of dependences. 122 00:05:02,010 --> 00:05:03,260 That's normal. 123 00:05:03,260 --> 00:05:06,050 124 00:05:06,050 --> 00:05:08,170 There's a lot of importance of the data dependence. 125 00:05:08,170 --> 00:05:10,120 It indicates the possibility of these hazards, how these 126 00:05:10,120 --> 00:05:11,590 dependences have to work. 127 00:05:11,590 --> 00:05:14,020 And it determines the order in which the results might be 128 00:05:14,020 --> 00:05:18,755 calculated, because if you need the result of that to do 129 00:05:18,755 --> 00:05:21,700 the next, you have what you call a dependency chain. 130 00:05:21,700 --> 00:05:23,730 And you have to excute that in that order. 131 00:05:23,730 --> 00:05:26,930 And because of the dependency chain, it sets an upper bound 132 00:05:26,930 --> 00:05:29,980 of how much parallelism that can be possibly expected. 133 00:05:29,980 --> 00:05:32,230 If you can say in all your program there's nothing 134 00:05:32,230 --> 00:05:35,190 dependent -- every instruction just can go any time -- 135 00:05:35,190 --> 00:05:38,390 then you can say the best computer will get done in one 136 00:05:38,390 --> 00:05:40,180 cycle, because everything can run. 137 00:05:40,180 --> 00:05:43,285 But if you say the next instruction is dependent on 138 00:05:43,285 --> 00:05:44,760 the previous one, the next instruction is dependent on 139 00:05:44,760 --> 00:05:46,610 the previous one, you have a chain. 140 00:05:46,610 --> 00:05:48,620 And no matter how good the hardware, you have to wait 141 00:05:48,620 --> 00:05:50,580 till that chain finishes. 142 00:05:50,580 --> 00:05:54,310 And you don't get that much parallelism. 143 00:05:54,310 --> 00:05:57,150 So the goal is to exploit parallelism by preserving the 144 00:05:57,150 --> 00:06:01,290 program order where it affects the outcome of the program. 145 00:06:01,290 --> 00:06:04,190 So if we want to have a look and feel like the program is 146 00:06:04,190 --> 00:06:07,620 run on a nice single-issue machine that does one 147 00:06:07,620 --> 00:06:09,760 instruction after another after another, that's the 148 00:06:09,760 --> 00:06:10,690 world we are looking in. 149 00:06:10,690 --> 00:06:13,730 And then we are doing all this underneath to kind of get 150 00:06:13,730 --> 00:06:17,540 performance, but give that abstraction. 151 00:06:17,540 --> 00:06:21,370 So there are other dependences that we can do better. 152 00:06:21,370 --> 00:06:23,850 There are two types of name dependences. 153 00:06:23,850 --> 00:06:29,450 That means there's no real program use of data, but there 154 00:06:29,450 --> 00:06:31,340 are limited resources in the program. 155 00:06:31,340 --> 00:06:33,130 And you have resource contentions. 156 00:06:33,130 --> 00:06:38,430 So the two types of resources are registers and memory. 157 00:06:38,430 --> 00:06:40,610 So linear resource contentions. 158 00:06:40,610 --> 00:06:45,830 The first name dependence is what we call anti-dependence. 159 00:06:45,830 --> 00:06:48,230 Anti-dependence means that -- 160 00:06:48,230 --> 00:06:54,840 161 00:06:54,840 --> 00:06:57,640 what I need to do is, I want to write this register. 162 00:06:57,640 --> 00:06:59,180 But in the previous instruction I'm actually 163 00:06:59,180 --> 00:07:02,110 reading the register. 164 00:07:02,110 --> 00:07:03,460 Because I'm writing the next one, I'm not 165 00:07:03,460 --> 00:07:05,270 really using the value. 166 00:07:05,270 --> 00:07:08,220 But I cannot write it until I have read that value. 167 00:07:08,220 --> 00:07:10,960 Because the minute I write it, I lose the previous value. 168 00:07:10,960 --> 00:07:14,270 And if I haven't used it, I'm out of luck. 169 00:07:14,270 --> 00:07:18,070 So there might be a case that I have a register, that I'm 170 00:07:18,070 --> 00:07:20,740 reading the register and rewriting it some new value. 171 00:07:20,740 --> 00:07:22,990 But I have to wait till the reading is done before I do 172 00:07:22,990 --> 00:07:24,240 this new write. 173 00:07:24,240 --> 00:07:26,180 And that's called anti-dependence. 174 00:07:26,180 --> 00:07:30,900 So what that means is we have to wait to run this 175 00:07:30,900 --> 00:07:33,410 instruction until this is all -- you can't do 176 00:07:33,410 --> 00:07:36,960 it all before that. 177 00:07:36,960 --> 00:07:41,230 So this is called a Write After Read, as I said, in the 178 00:07:41,230 --> 00:07:42,460 architecture jargon. 179 00:07:42,460 --> 00:07:44,850 The other dependences have what you call output 180 00:07:44,850 --> 00:07:46,470 dependence. 181 00:07:46,470 --> 00:07:50,550 Two guys are writing the register, and 182 00:07:50,550 --> 00:07:51,720 then I'm reading it. 183 00:07:51,720 --> 00:07:55,710 So I want to read the value the last guy wrote. 184 00:07:55,710 --> 00:07:59,080 So if I reorder that, I get a wrong value. 185 00:07:59,080 --> 00:08:01,020 Actually you can even do better in here. 186 00:08:01,020 --> 00:08:02,806 How can you do better in here? 187 00:08:02,806 --> 00:08:03,640 AUDIENCE: You can eliminate I. 188 00:08:03,640 --> 00:08:03,812 PROFESSOR: Yeah. 189 00:08:03,812 --> 00:08:05,580 You can elimiate the first one, because nobody's using 190 00:08:05,580 --> 00:08:06,330 that value. 191 00:08:06,330 --> 00:08:09,740 So you can go even further and further, but 192 00:08:09,740 --> 00:08:10,680 this is also a hazard. 193 00:08:10,680 --> 00:08:15,730 This is called a Write After Write hazard. 194 00:08:15,730 --> 00:08:20,260 And the interesting thing is by doing what you call 195 00:08:20,260 --> 00:08:23,650 register renaming, you can eliminate these things. 196 00:08:23,650 --> 00:08:26,420 So why do both have to use the same register? 197 00:08:26,420 --> 00:08:29,050 In these two, if I use a different register I don't 198 00:08:29,050 --> 00:08:30,770 have that dependency. 199 00:08:30,770 --> 00:08:35,720 And so a lot of times in software, and also in modern 200 00:08:35,720 --> 00:08:39,200 superscalar hardware, there's this huge amount of hardware 201 00:08:39,200 --> 00:08:41,650 resources that actually do register renaming. 202 00:08:41,650 --> 00:08:43,400 So they realized that anti-dependence is output 203 00:08:43,400 --> 00:08:44,220 dependent, and said -- "Wait minute. 204 00:08:44,220 --> 00:08:45,280 Why do I even have to do that? 205 00:08:45,280 --> 00:08:47,450 I can use a different register." So even 206 00:08:47,450 --> 00:08:49,130 though you have -- 207 00:08:49,130 --> 00:08:52,260 Intel basically [UNINTELLIGIBLE] 208 00:08:52,260 --> 00:08:54,340 accessory only have eight registers. 209 00:08:54,340 --> 00:08:56,060 They are about 100 registers behind. 210 00:08:56,060 --> 00:08:58,190 Hardware registers just basically let you do this 211 00:08:58,190 --> 00:09:01,120 reordering and renaming -- register renaming. 212 00:09:01,120 --> 00:09:03,670 213 00:09:03,670 --> 00:09:05,660 So the other type of depencence is control 214 00:09:05,660 --> 00:09:07,170 dependence. 215 00:09:07,170 --> 00:09:11,150 So what that means is if you have a program like this, you 216 00:09:11,150 --> 00:09:13,630 have to preserve the program ordering. 217 00:09:13,630 --> 00:09:19,300 And what that means is S1 is control dependent on p1. 218 00:09:19,300 --> 00:09:22,475 Because depending on what p1 is, it will depend on this one 219 00:09:22,475 --> 00:09:23,870 to get excuted. 220 00:09:23,870 --> 00:09:27,370 S2 is control dependent on p2, but not p1. 221 00:09:27,370 --> 00:09:32,550 So it doesn't matter what p1 does, S2 will execute only if 222 00:09:32,550 --> 00:09:34,800 p2 is true. 223 00:09:34,800 --> 00:09:36,260 So there's a control dependence in there. 224 00:09:36,260 --> 00:09:39,880 225 00:09:39,880 --> 00:09:42,900 Another interesting thing is control dependence may -- you 226 00:09:42,900 --> 00:09:45,190 don't need to preserve it all the time. 227 00:09:45,190 --> 00:09:48,250 You might be able to do things out of this order. 228 00:09:48,250 --> 00:09:51,050 Basically, what you can do is if you are willing to do more 229 00:09:51,050 --> 00:09:53,440 work, you can say -- "Well, I will do this. 230 00:09:53,440 --> 00:09:55,590 I don't know that I really need it, because I don't know 231 00:09:55,590 --> 00:09:56,950 whether the p2 is true or not. 232 00:09:56,950 --> 00:09:58,210 But I'll just keep doing it. 233 00:09:58,210 --> 00:10:02,800 And then if I really wanted, I'll actually have the results 234 00:10:02,800 --> 00:10:07,170 ready for me." And that's called speculative execution. 235 00:10:07,170 --> 00:10:08,550 So you can do speculation. 236 00:10:08,550 --> 00:10:10,220 You speculatively think that you will need 237 00:10:10,220 --> 00:10:11,470 something, and go do it. 238 00:10:11,470 --> 00:10:14,320 239 00:10:14,320 --> 00:10:18,000 Speculation provides you with a lot of increased ILP, 240 00:10:18,000 --> 00:10:21,320 because it can overcome control dependence by 241 00:10:21,320 --> 00:10:24,620 executing through branches, before even you know where the 242 00:10:24,620 --> 00:10:25,700 branch is going. 243 00:10:25,700 --> 00:10:28,120 And a lot of times you can go through both directions, and 244 00:10:28,120 --> 00:10:29,900 say -- "Wait a minute, I don't know which way I'm going. 245 00:10:29,900 --> 00:10:33,210 I'll do both sides." And I know at least one side you are 246 00:10:33,210 --> 00:10:34,230 going, and that will be useful. 247 00:10:34,230 --> 00:10:37,090 And you can go more and more, and soon you see that you are 248 00:10:37,090 --> 00:10:39,170 doing so much more work than actually will be useful. 249 00:10:39,170 --> 00:10:41,890 250 00:10:41,890 --> 00:10:45,710 So the first level of speculation is -- speculation 251 00:10:45,710 --> 00:10:48,780 basically says, you go, you fetch, issue, and execute 252 00:10:48,780 --> 00:10:49,240 everything. 253 00:10:49,240 --> 00:10:52,060 You do the end of the thing without just committing your 254 00:10:52,060 --> 00:10:55,160 weight into the commit to make sure that the right thing 255 00:10:55,160 --> 00:10:56,000 actually happened. 256 00:10:56,000 --> 00:10:58,800 So this is the full speculation. 257 00:10:58,800 --> 00:11:02,140 There's a little bit of less speculation called dynamic 258 00:11:02,140 --> 00:11:02,580 scheduling. 259 00:11:02,580 --> 00:11:04,760 If you look at a microprocessor, one of the 260 00:11:04,760 --> 00:11:09,120 biggest problems is the pipeline stall is a branch. 261 00:11:09,120 --> 00:11:12,430 You can't keep even a pipeline going, even in a single-issue 262 00:11:12,430 --> 00:11:14,520 machine, if there's a branch, because the branch condition 263 00:11:14,520 --> 00:11:15,470 gets resolved. 264 00:11:15,470 --> 00:11:18,750 Not after the next instruction has to get fetched. 265 00:11:18,750 --> 00:11:21,100 So if you do a normal thing, you just have to 266 00:11:21,100 --> 00:11:22,870 reinstall the pipeline. 267 00:11:22,870 --> 00:11:29,800 So what dynamic scheduling or a branch predictor sometimes 268 00:11:29,800 --> 00:11:31,880 does is, it will say I will predict where 269 00:11:31,880 --> 00:11:33,660 the branch is going. 270 00:11:33,660 --> 00:11:35,730 So I might not have fed board direction, but I will 271 00:11:35,730 --> 00:11:38,340 speculatively go fetch down one path, because it looks 272 00:11:38,340 --> 00:11:39,620 like it which it's going. 273 00:11:39,620 --> 00:11:42,890 For many times, like for example in a loop, 99% of the 274 00:11:42,890 --> 00:11:44,935 time you are going in the backage, because you don't go 275 00:11:44,935 --> 00:11:45,450 through that. 276 00:11:45,450 --> 00:11:46,750 And then if you predict that you are mostly 277 00:11:46,750 --> 00:11:47,580 [UNINTELLIGIBLE]. 278 00:11:47,580 --> 00:11:49,730 So the branch predictors are pretty good at finding these 279 00:11:49,730 --> 00:11:50,870 kind of cases. 280 00:11:50,870 --> 00:11:53,710 There are very few branches that are kind of 50-50. 281 00:11:53,710 --> 00:11:56,260 Most branches have a preferred path. 282 00:11:56,260 --> 00:11:58,780 If you find the preferred path you can go through that, and 283 00:11:58,780 --> 00:12:00,200 you don't pay any penalty. 284 00:12:00,200 --> 00:12:01,860 The penalty is if you made a mistake, you had to kind of 285 00:12:01,860 --> 00:12:03,450 back up a few times. 286 00:12:03,450 --> 00:12:05,490 So you can at least do in one direction. 287 00:12:05,490 --> 00:12:08,240 Most hardware do that, even the simplest things do that. 288 00:12:08,240 --> 00:12:10,550 But if you do good speculation you go both. 289 00:12:10,550 --> 00:12:13,150 You say -- "Eh, there's a chance if I go down that path 290 00:12:13,150 --> 00:12:13,900 I'm going to lose a lot. 291 00:12:13,900 --> 00:12:18,920 So I'll do that, too." So that does a lot of expensive stuff. 292 00:12:18,920 --> 00:12:23,080 And basically this is more for data flow model. 293 00:12:23,080 --> 00:12:26,160 So as soon as data get available you don't think too 294 00:12:26,160 --> 00:12:30,150 much about control, you keep firing that. 295 00:12:30,150 --> 00:12:36,780 So today's superscalar processors have huge amount of 296 00:12:36,780 --> 00:12:37,460 speculation. 297 00:12:37,460 --> 00:12:39,290 You speculate on everything. 298 00:12:39,290 --> 00:12:40,170 Branch prediction. 299 00:12:40,170 --> 00:12:42,690 You assume all the branches, multilevel down you predict, 300 00:12:42,690 --> 00:12:43,470 and go that. 301 00:12:43,470 --> 00:12:44,360 Value prediction. 302 00:12:44,360 --> 00:12:45,960 You look at it and say -- "Hey, I think it's going to be 303 00:12:45,960 --> 00:12:50,450 two." And in fact there's a paper that says about 80% of 304 00:12:50,450 --> 00:12:51,700 program values are zero. 305 00:12:51,700 --> 00:12:55,130 306 00:12:55,130 --> 00:12:56,060 And then you say -- "OK. 307 00:12:56,060 --> 00:12:57,510 I'll think it's zero, and it'll go on. 308 00:12:57,510 --> 00:12:59,530 And if it is not zero, I'll have to come back and do 309 00:12:59,530 --> 00:13:00,610 that." So things like that. 310 00:13:00,610 --> 00:13:02,437 AUDIENCE: Do you know what percentage of the time it has 311 00:13:02,437 --> 00:13:03,870 to go back? 312 00:13:03,870 --> 00:13:08,350 PROFESSOR: A lot of times I think it is probably an 80-20 313 00:13:08,350 --> 00:13:11,420 type thing, but if you do too much you're always backing up. 314 00:13:11,420 --> 00:13:13,310 But you can at least do a few things down 315 00:13:13,310 --> 00:13:14,650 assuming it's zero. 316 00:13:14,650 --> 00:13:16,260 So things like that. 317 00:13:16,260 --> 00:13:21,530 People, try to take advantage of the statistical nature of 318 00:13:21,530 --> 00:13:24,690 programs. And you are mining every day. 319 00:13:24,690 --> 00:13:29,160 So basically there's no -- 320 00:13:29,160 --> 00:13:30,420 it's almost at the entropy. 321 00:13:30,420 --> 00:13:33,030 So every information is kind of taken advantage in the 322 00:13:33,030 --> 00:13:37,370 program, but what that means is you are wasting a lot of 323 00:13:37,370 --> 00:13:38,470 time cycles. 324 00:13:38,470 --> 00:13:40,740 So the conventional wisdom was -- 325 00:13:40,740 --> 00:13:42,610 "You have Moore's slope. 326 00:13:42,610 --> 00:13:43,920 You keep getting these transistors. 327 00:13:43,920 --> 00:13:47,680 There's nothing to do with it, so let me do more other work. 328 00:13:47,680 --> 00:13:50,080 We'll predicate, we'll do additional work, we'll go 329 00:13:50,080 --> 00:13:52,560 through multipe branches, we'll assume things are zero. 330 00:13:52,560 --> 00:13:54,110 Because what's wasted? 331 00:13:54,110 --> 00:13:57,580 Because it's extra work, if it is wrong we just give it up." 332 00:13:57,580 --> 00:14:00,380 So that's the way it went, and the thing is it's very 333 00:14:00,380 --> 00:14:00,895 inefficient. 334 00:14:00,895 --> 00:14:03,900 Because a lot of times you are doing -- think about even a 335 00:14:03,900 --> 00:14:04,960 simple cache. 336 00:14:04,960 --> 00:14:07,580 If you have 4-way as a cache. 337 00:14:07,580 --> 00:14:09,700 Every cycle when you're doing a memory fetch, you are 338 00:14:09,700 --> 00:14:14,140 fetching on all four, assuming one of it will have hit. 339 00:14:14,140 --> 00:14:17,480 Even if you have a cache hit where only one bank is hit, 340 00:14:17,480 --> 00:14:19,350 and all the other three banks are not hit. 341 00:14:19,350 --> 00:14:21,750 So you are just doing a lot more extra work 342 00:14:21,750 --> 00:14:23,340 just to get one thing. 343 00:14:23,340 --> 00:14:26,580 Of course because if you wait to figure out which bank, it's 344 00:14:26,580 --> 00:14:28,000 going to add a little bit more delay. 345 00:14:28,000 --> 00:14:28,710 So you won't do it parallelly. 346 00:14:28,710 --> 00:14:30,790 You know that's it's going to be one of the lines, so you 347 00:14:30,790 --> 00:14:32,900 just go fetch everything and then later decide 348 00:14:32,900 --> 00:14:33,840 which one you want. 349 00:14:33,840 --> 00:14:38,390 So things like that really waste energy. 350 00:14:38,390 --> 00:14:41,560 And what has been happening in the last 10 years is you 351 00:14:41,560 --> 00:14:44,320 double the amount of transistors, and you add 5% 352 00:14:44,320 --> 00:14:46,060 more performance gain. 353 00:14:46,060 --> 00:14:49,470 Because statistically you have mined most of the lower 354 00:14:49,470 --> 00:14:51,260 hanging fruit, there's nothing much left. 355 00:14:51,260 --> 00:14:53,800 So you're getting to a point that has a little bit of a 356 00:14:53,800 --> 00:14:56,280 statistical significance, and you go after that. 357 00:14:56,280 --> 00:14:59,200 So of course, most of the time it's wrong. 358 00:14:59,200 --> 00:15:03,060 So this leads to this chart that actually yesterday I also 359 00:15:03,060 --> 00:15:03,730 pointed out. 360 00:15:03,730 --> 00:15:06,220 So you are going from hot plate to nuclear reactor, to 361 00:15:06,220 --> 00:15:08,790 rocket nozzle. 362 00:15:08,790 --> 00:15:10,400 We tend to be going in that direction. 363 00:15:10,400 --> 00:15:12,390 That is the path, because we are just doing all these 364 00:15:12,390 --> 00:15:14,450 wasteful things. 365 00:15:14,450 --> 00:15:18,230 And right now, the power consumption on processors is 366 00:15:18,230 --> 00:15:21,420 significant enough in both things like laptops -- 367 00:15:21,420 --> 00:15:24,110 because the battery's not getting faster -- as well as 368 00:15:24,110 --> 00:15:25,220 things like Google. 369 00:15:25,220 --> 00:15:28,360 So doing this extra useless work is 370 00:15:28,360 --> 00:15:29,610 actually starting to impact. 371 00:15:29,610 --> 00:15:32,670 372 00:15:32,670 --> 00:15:34,980 So for example, if you look at something like Pentium. 373 00:15:34,980 --> 00:15:40,310 You have 11 stages of instructions. 374 00:15:40,310 --> 00:15:45,350 You can execute 3 x86 instructions per cycle. 375 00:15:45,350 --> 00:15:49,770 So you're doing this huge superscalar thing, but 376 00:15:49,770 --> 00:15:52,750 something that had been creeping in lately is also 377 00:15:52,750 --> 00:15:55,700 some amount of explicit parallelism. 378 00:15:55,700 --> 00:15:58,780 So they introduced things like MMX and SSE instructions. 379 00:15:58,780 --> 00:16:01,280 They are explicit parallelism, visible to the user. 380 00:16:01,280 --> 00:16:03,670 So it's not hiding trying to get parallelism. 381 00:16:03,670 --> 00:16:06,980 So we have been slowly moving to this kind of model, saying 382 00:16:06,980 --> 00:16:09,670 if you want performance you have to do something manual. 383 00:16:09,670 --> 00:16:11,580 So people who really cared about performance had 384 00:16:11,580 --> 00:16:12,490 to deal with that. 385 00:16:12,490 --> 00:16:17,450 And of course, we put multiple chips together to build a 386 00:16:17,450 --> 00:16:19,250 multiprocessor -- 387 00:16:19,250 --> 00:16:22,120 it's not in a single chip -- that actually do parallel 388 00:16:22,120 --> 00:16:22,800 processing. 389 00:16:22,800 --> 00:16:28,270 So for about three, four years if you buy a workstation it 390 00:16:28,270 --> 00:16:30,320 had two processors sitting in there. 391 00:16:30,320 --> 00:16:32,650 So dual processor, quad processor machines came about, 392 00:16:32,650 --> 00:16:33,820 and people started using that. 393 00:16:33,820 --> 00:16:37,240 So it's not like we are doing this shift abruptly, we have 394 00:16:37,240 --> 00:16:39,770 been going that direction. 395 00:16:39,770 --> 00:16:41,880 For people who really cared about performance, actually 396 00:16:41,880 --> 00:16:43,770 had to deal with that and were actually using that. 397 00:16:43,770 --> 00:16:46,960 398 00:16:46,960 --> 00:16:47,580 OK. 399 00:16:47,580 --> 00:16:49,380 So let's switch gears a little bit and do explicit 400 00:16:49,380 --> 00:16:50,220 parallelism. 401 00:16:50,220 --> 00:16:51,980 So this is kind of where we are -- 402 00:16:51,980 --> 00:16:55,500 where we are today, where we are switching. 403 00:16:55,500 --> 00:17:00,740 So basically, these are the machines that parallelism 404 00:17:00,740 --> 00:17:02,410 exposed to software -- either compiler. 405 00:17:02,410 --> 00:17:05,890 So you might not see it as a user, but it exposes some 406 00:17:05,890 --> 00:17:07,020 layer of software. 407 00:17:07,020 --> 00:17:09,210 And there are many different forms of it. 408 00:17:09,210 --> 00:17:15,110 From very loosely coupled multiprocessors sitting on a 409 00:17:15,110 --> 00:17:19,610 board, or even sitting on multipe machines -- things 410 00:17:19,610 --> 00:17:22,460 like a cluster of workstations -- 411 00:17:22,460 --> 00:17:24,030 to very tightly coupled machines. 412 00:17:24,030 --> 00:17:26,290 So we'll go through, and figure out what are all the 413 00:17:26,290 --> 00:17:27,590 flavors of these things. 414 00:17:27,590 --> 00:17:28,625 AUDIENCE: Excuse me. 415 00:17:28,625 --> 00:17:29,142 PROFESSOR: Mhmm? 416 00:17:29,142 --> 00:17:31,830 AUDIENCE: So does it mean that since there being the level 417 00:17:31,830 --> 00:17:35,900 parallelism, the processor can exploit the fact that the 418 00:17:35,900 --> 00:17:37,740 compiler knows the higher level instructions? 419 00:17:37,740 --> 00:17:38,950 Does that make any difference? 420 00:17:38,950 --> 00:17:40,410 PROFESSOR: It goes both ways. 421 00:17:40,410 --> 00:17:45,730 So what the processor knows is it know values for everything. 422 00:17:45,730 --> 00:17:49,200 So it has full exact knowledge of what's going on. 423 00:17:49,200 --> 00:17:51,620 Compiler is an abstraction. 424 00:17:51,620 --> 00:17:54,200 In that sense, processor wins in those. 425 00:17:54,200 --> 00:17:56,730 On the other hand, compile time doesn't 426 00:17:56,730 --> 00:17:58,160 affect the run time. 427 00:17:58,160 --> 00:18:00,670 So the compiler has a much bigger view of the world. 428 00:18:00,670 --> 00:18:03,440 429 00:18:03,440 --> 00:18:05,750 Even the most aggressive processor can't look ahead 430 00:18:05,750 --> 00:18:07,940 more than 100 instructions. 431 00:18:07,940 --> 00:18:09,760 On the other hand, the compiler sees ahead of 432 00:18:09,760 --> 00:18:11,600 millions of instructions. 433 00:18:11,600 --> 00:18:14,280 And so the compiler has the ability to kind of get the big 434 00:18:14,280 --> 00:18:16,960 picture and do things -- global kind of things. 435 00:18:16,960 --> 00:18:19,360 But on the other hand, it loses out when it doesn't have 436 00:18:19,360 --> 00:18:20,650 information. 437 00:18:20,650 --> 00:18:23,130 Whereas when you do the hardware parallelism, you have 438 00:18:23,130 --> 00:18:23,870 full information. 439 00:18:23,870 --> 00:18:24,840 AUDIENCE: You don't have to give up one at 440 00:18:24,840 --> 00:18:27,290 the loss of the other. 441 00:18:27,290 --> 00:18:29,490 PROFESSOR: The thing is, I don't think we have a good way 442 00:18:29,490 --> 00:18:31,540 of combining both very well. 443 00:18:31,540 --> 00:18:34,350 Because the thing is, sometimes global optimization 444 00:18:34,350 --> 00:18:36,140 needs local information, and that's not 445 00:18:36,140 --> 00:18:38,150 available at run time. 446 00:18:38,150 --> 00:18:40,396 And global optimization is very costly, so you can't say 447 00:18:40,396 --> 00:18:43,860 -- "OK, I'm going to do it any time." So I think it's kind of 448 00:18:43,860 --> 00:18:45,720 even hybrid things. 449 00:18:45,720 --> 00:18:47,800 There's no nice mesh in there. 450 00:18:47,800 --> 00:18:51,960 451 00:18:51,960 --> 00:18:55,540 So if you think a little bit about parallelism, one 452 00:18:55,540 --> 00:18:58,100 interesting thing is this Little's Law. 453 00:18:58,100 --> 00:19:05,500 Little's Law says parallelism is a multiplication of 454 00:19:05,500 --> 00:19:07,020 throughput vs. latency. 455 00:19:07,020 --> 00:19:09,840 456 00:19:09,840 --> 00:19:14,980 So the way to think about that is the parallelism is dictated 457 00:19:14,980 --> 00:19:16,500 by the program in some sense. 458 00:19:16,500 --> 00:19:19,610 The program has a certain amount of parallelism. 459 00:19:19,610 --> 00:19:22,735 So if you have a thing that has a lot of latency to get to 460 00:19:22,735 --> 00:19:27,870 the result, what that means is there's a certain amount of 461 00:19:27,870 --> 00:19:30,850 throughput you can satisfy. 462 00:19:30,850 --> 00:19:34,380 Whereas if you have a thing that has a very low latency 463 00:19:34,380 --> 00:19:37,910 operation, you can go much wider. 464 00:19:37,910 --> 00:19:40,460 So if you look at Intel processors, what they have 465 00:19:40,460 --> 00:19:42,450 done is the superscalars -- 466 00:19:42,450 --> 00:19:45,320 they have actually, to get things faster they have a very 467 00:19:45,320 --> 00:19:46,380 long latency. 468 00:19:46,380 --> 00:19:48,210 Because they know they couldn't go more than 469 00:19:48,210 --> 00:19:49,980 three or four wide. 470 00:19:49,980 --> 00:19:52,630 So they went like 55 the pipeline, three wide. 471 00:19:52,630 --> 00:19:55,130 472 00:19:55,130 --> 00:19:58,140 Because you can go fast, so they assume the 473 00:19:58,140 --> 00:19:59,400 parallelism fits here. 474 00:19:59,400 --> 00:20:00,510 So still you need a lot of parallelism. 475 00:20:00,510 --> 00:20:01,870 So you say -- "Three, why should [UNINTELLIGIBLE] 476 00:20:01,870 --> 00:20:02,940 issue machine. 477 00:20:02,940 --> 00:20:05,600 [UNINTELLIGIBLE] three it's no big deal." But no, if you have 478 00:20:05,600 --> 00:20:12,210 55 the pipeline you need to have 165 parallel instructions 479 00:20:12,210 --> 00:20:14,560 on the fly any given time. 480 00:20:14,560 --> 00:20:15,410 So that's the thing. 481 00:20:15,410 --> 00:20:17,960 Even in the moder machine, there's about hundreds of 482 00:20:17,960 --> 00:20:19,180 instruction on the fly, because the 483 00:20:19,180 --> 00:20:22,230 pipeline is so large. 484 00:20:22,230 --> 00:20:24,250 So if you said 3-issue, it's not that. 485 00:20:24,250 --> 00:20:25,890 I mean this happens in there. 486 00:20:25,890 --> 00:20:29,280 So this gives designers a lot of flexibiilty in where you 487 00:20:29,280 --> 00:20:30,540 are expanding. 488 00:20:30,540 --> 00:20:34,380 And in some ways you can have a lot -- 489 00:20:34,380 --> 00:20:36,930 there are some machines that are a lot wider, but the 490 00:20:36,930 --> 00:20:38,070 latency is -- 491 00:20:38,070 --> 00:20:41,020 For example, if you look at an Itanium. 492 00:20:41,020 --> 00:20:46,290 It's clock cycle is about half the time of the Pentium, 493 00:20:46,290 --> 00:20:51,160 because it has a lot less latency but a lot wider. 494 00:20:51,160 --> 00:20:52,580 So you can do these kind of tradeoffs. 495 00:20:52,580 --> 00:20:55,240 496 00:20:55,240 --> 00:20:57,690 Types of parallelism. 497 00:20:57,690 --> 00:21:00,750 There are four categorizations here. 498 00:21:00,750 --> 00:21:03,800 So one categorization is, you have pipeline. 499 00:21:03,800 --> 00:21:07,620 You do the same thing in a pipeline fashion. 500 00:21:07,620 --> 00:21:09,310 So you do the same instruction. 501 00:21:09,310 --> 00:21:12,450 You do a little bit, and you start another copy of another 502 00:21:12,450 --> 00:21:13,250 copy of another copy. 503 00:21:13,250 --> 00:21:15,710 So you kind of pipeline the same thing down here. 504 00:21:15,710 --> 00:21:17,550 Kind of a vector machine -- we'll go through categories 505 00:21:17,550 --> 00:21:19,840 that kind of fit in here. 506 00:21:19,840 --> 00:21:22,390 Another category is data-level parallelism. 507 00:21:22,390 --> 00:21:29,130 What that means is, in a given cycle you do the same thing 508 00:21:29,130 --> 00:21:30,980 many many many many -- 509 00:21:30,980 --> 00:21:33,610 the same instructions for many many things. 510 00:21:33,610 --> 00:21:35,620 And then next cycle you do something 511 00:21:35,620 --> 00:21:37,133 different, stuff like that. 512 00:21:37,133 --> 00:21:39,320 Thread-level parallelism breaks in the other way. 513 00:21:39,320 --> 00:21:41,360 Thread-level parallelism says -- 514 00:21:41,360 --> 00:21:43,980 "I am not connecting the cycles, they are independent. 515 00:21:43,980 --> 00:21:48,590 Each thread can go do something different." 516 00:21:48,590 --> 00:21:50,470 And instruction-level parallelism is kind of a 517 00:21:50,470 --> 00:21:51,280 combination. 518 00:21:51,280 --> 00:21:54,865 What you are doing is, you are doing cycle by cycle -- they 519 00:21:54,865 --> 00:21:57,820 are connected -- and each cycle you do some kind of a 520 00:21:57,820 --> 00:21:59,320 combination of operations. 521 00:21:59,320 --> 00:22:01,170 So if you look at this closely. 522 00:22:01,170 --> 00:22:03,090 So pipelining hits here. 523 00:22:03,090 --> 00:22:05,590 Data parallel execution, things like SIMD 524 00:22:05,590 --> 00:22:06,870 execution hits here. 525 00:22:06,870 --> 00:22:08,110 Thread-level parallelism. 526 00:22:08,110 --> 00:22:09,520 Instruction-level parallelism. 527 00:22:09,520 --> 00:22:11,530 So before a models of parallelism, what software 528 00:22:11,530 --> 00:22:18,390 people see kind of fits also in this architecture picture. 529 00:22:18,390 --> 00:22:21,440 So when you are designing a parallel machine, what do you 530 00:22:21,440 --> 00:22:22,800 have to worry about? 531 00:22:22,800 --> 00:22:24,700 The first thing is communication. 532 00:22:24,700 --> 00:22:26,060 That's the begin -- 533 00:22:26,060 --> 00:22:27,140 the problem in here. 534 00:22:27,140 --> 00:22:30,930 How do parallel operations communicate the data results? 535 00:22:30,930 --> 00:22:33,490 Because it's not only an issue of bandwith, 536 00:22:33,490 --> 00:22:35,550 it's an issue of latency. 537 00:22:35,550 --> 00:22:38,300 The thing about bandwith is that had been increasing by 538 00:22:38,300 --> 00:22:38,990 Moore's Law. 539 00:22:38,990 --> 00:22:40,600 Latency, speed of light. 540 00:22:40,600 --> 00:22:42,770 So as I pointed out, there's no Moore's Law on speed of 541 00:22:42,770 --> 00:22:46,540 light, and you have to deal with that. 542 00:22:46,540 --> 00:22:47,650 Synchronization. 543 00:22:47,650 --> 00:22:50,510 So when people do different things, how do you synchronize 544 00:22:50,510 --> 00:22:50,990 at some point? 545 00:22:50,990 --> 00:22:53,550 Because you can't keep going on different paths, at some 546 00:22:53,550 --> 00:22:54,680 point you have to come together. 547 00:22:54,680 --> 00:22:56,270 What's the cost? 548 00:22:56,270 --> 00:22:57,700 What are the processes of going it? 549 00:22:57,700 --> 00:23:01,670 Some stuff it's very explicit -- 550 00:23:01,670 --> 00:23:03,160 you have to deal with that. 551 00:23:03,160 --> 00:23:06,680 Some machines it's built in, so every cycle you are 552 00:23:06,680 --> 00:23:08,840 synchronizing. 553 00:23:08,840 --> 00:23:14,300 So sometimes it makes it easier for you, sometimes it 554 00:23:14,300 --> 00:23:15,940 makes it more inefficient. 555 00:23:15,940 --> 00:23:18,500 So you have to figure out what is in here. 556 00:23:18,500 --> 00:23:20,932 Resource management. 557 00:23:20,932 --> 00:23:23,920 The thing about parallelism is you have a lot of things going 558 00:23:23,920 --> 00:23:28,480 on, and managing that is a very important issue. 559 00:23:28,480 --> 00:23:33,970 Because sometimes if you put things in the wrong place, the 560 00:23:33,970 --> 00:23:36,260 cost of doing that might be much higher. 561 00:23:36,260 --> 00:23:40,890 That really reduces the benefit of doing that. 562 00:23:40,890 --> 00:23:43,010 And finally the scalability. 563 00:23:43,010 --> 00:23:48,070 How do you build process that not only can do 2x 564 00:23:48,070 --> 00:23:50,110 parallelism, but can do thousand? 565 00:23:50,110 --> 00:23:52,750 How can you keep growing with Moore's Law. 566 00:23:52,750 --> 00:23:55,960 So there are some ways you can get really good numbers, small 567 00:23:55,960 --> 00:23:58,240 numbers, but as you go bigger and bigger you can't scale. 568 00:23:58,240 --> 00:24:01,880 569 00:24:01,880 --> 00:24:02,850 So here's a classic 570 00:24:02,850 --> 00:24:04,340 classification of parallel machines. 571 00:24:04,340 --> 00:24:07,610 This has been [? divided ?] up by Mike Flynn in 1966. 572 00:24:07,610 --> 00:24:10,310 So he came up with four ways of classifying a machine. 573 00:24:10,310 --> 00:24:12,560 First he looked at how 574 00:24:12,560 --> 00:24:15,100 instruction and data is issued. 575 00:24:15,100 --> 00:24:18,240 So one thing is single instruction, single data. 576 00:24:18,240 --> 00:24:21,080 So there's single instruction given each cycle, and it 577 00:24:21,080 --> 00:24:22,040 affects single data. 578 00:24:22,040 --> 00:24:25,560 This is your conventional uniprocessor. 579 00:24:25,560 --> 00:24:28,360 Then came a SIMD machine -- single 580 00:24:28,360 --> 00:24:30,160 instruction, multiple data. 581 00:24:30,160 --> 00:24:32,520 So what that means is the given instruction affects 582 00:24:32,520 --> 00:24:34,700 multiple data in here. 583 00:24:34,700 --> 00:24:38,640 So things like -- there are two types, distributed memory 584 00:24:38,640 --> 00:24:39,390 and shared memory. 585 00:24:39,390 --> 00:24:41,270 I'll go to this distinction later. 586 00:24:41,270 --> 00:24:43,120 So there are a bunch of machines. 587 00:24:43,120 --> 00:24:46,010 In the good old times this was a useful trick, because the 588 00:24:46,010 --> 00:24:48,620 sequencer -- or what ran the instructions -- was a pretty 589 00:24:48,620 --> 00:24:50,930 substantial piece of hardware. 590 00:24:50,930 --> 00:24:54,780 So you build one of them and use it for many, many data. 591 00:24:54,780 --> 00:24:57,030 Even today data in a Pentium if you are doing a SIMD 592 00:24:57,030 --> 00:24:59,640 instruction, you just issue one instruction, it affects 593 00:24:59,640 --> 00:25:04,400 multiple data, and you can get a nice reuse 594 00:25:04,400 --> 00:25:06,190 of instruction decoding. 595 00:25:06,190 --> 00:25:10,920 Reduce the instruction bandwidth by doing SIMD. 596 00:25:10,920 --> 00:25:13,600 Then you go to MIMD, which is Multiple 597 00:25:13,600 --> 00:25:14,920 Instruction, Multiple Data. 598 00:25:14,920 --> 00:25:17,600 So we have multiple instruction streams each 599 00:25:17,600 --> 00:25:20,510 affecting its own data. 600 00:25:20,510 --> 00:25:23,240 So each data streams, instruction streams 601 00:25:23,240 --> 00:25:23,820 separately. 602 00:25:23,820 --> 00:25:27,250 So things like message passing machines, coherent and 603 00:25:27,250 --> 00:25:28,390 non-coherent shared memory. 604 00:25:28,390 --> 00:25:30,060 I'll go into details of coherence and 605 00:25:30,060 --> 00:25:31,180 non-coherence later. 606 00:25:31,180 --> 00:25:35,060 There are multiple categories within that too. 607 00:25:35,060 --> 00:25:38,090 And then finally, there's kind of a misnomer, MISD. 608 00:25:38,090 --> 00:25:39,520 There hasn't been a single machine. 609 00:25:39,520 --> 00:25:41,595 It doesn't make sense to have multiple instructions work on 610 00:25:41,595 --> 00:25:42,400 single data. 611 00:25:42,400 --> 00:25:46,000 So this classification, right now -- question? 612 00:25:46,000 --> 00:25:49,070 AUDIENCE: I've heard that [INAUDIBLE] 613 00:25:49,070 --> 00:25:51,040 PROFESSOR: Multiple instruction, single data? 614 00:25:51,040 --> 00:25:53,140 I don't know. 615 00:25:53,140 --> 00:25:55,490 You can try to fit something there just to have something, 616 00:25:55,490 --> 00:26:00,340 but it doesn't fit really well into this kind of thinking. 617 00:26:00,340 --> 00:26:02,340 So I don't like that thinking. 618 00:26:02,340 --> 00:26:04,640 I was thinking how should I do it, so I came up with a new 619 00:26:04,640 --> 00:26:05,830 way of classifying. 620 00:26:05,830 --> 00:26:09,390 So what my classification is, what's the last 621 00:26:09,390 --> 00:26:10,350 thing you are sharing? 622 00:26:10,350 --> 00:26:13,440 Because when you are running something, if it is some 623 00:26:13,440 --> 00:26:16,150 single machine, some thing has to be shared, and some things 624 00:26:16,150 --> 00:26:17,170 have to be separated. 625 00:26:17,170 --> 00:26:19,830 So are you sharing instructions, are you sharing 626 00:26:19,830 --> 00:26:22,740 the sequencer, are you sharing the memory, are you sharing 627 00:26:22,740 --> 00:26:23,310 the network? 628 00:26:23,310 --> 00:26:27,290 So this kind of fits many things nicely into this model. 629 00:26:27,290 --> 00:26:29,670 So let's go through this model and see 630 00:26:29,670 --> 00:26:30,920 different things in there. 631 00:26:30,920 --> 00:26:34,960 632 00:26:34,960 --> 00:26:38,630 So let's look at shared instruction processors. 633 00:26:38,630 --> 00:26:43,130 So there had been a lot of work in the good old days. 634 00:26:43,130 --> 00:26:48,290 Did anybody know Goodyear actually made supercomputers? 635 00:26:48,290 --> 00:26:50,260 Not only did they make tires, for a long time they were 636 00:26:50,260 --> 00:26:53,390 actually making processors. 637 00:26:53,390 --> 00:26:56,460 GE made processors, stuff like that. 638 00:26:56,460 --> 00:27:00,550 And so a long time ago this was a very interesting 639 00:27:00,550 --> 00:27:04,090 proposition, because there was a huge amount of hardware that 640 00:27:04,090 --> 00:27:08,150 has to be dedicated to doing the sequence and running the 641 00:27:08,150 --> 00:27:08,830 instruction. 642 00:27:08,830 --> 00:27:11,840 So just to share that was a really interesting concept. 643 00:27:11,840 --> 00:27:14,470 So people built machines that basically -- 644 00:27:14,470 --> 00:27:17,090 single instruction stream affecting 645 00:27:17,090 --> 00:27:18,340 multiple data in there. 646 00:27:18,340 --> 00:27:21,900 I think very well-known machines are things like 647 00:27:21,900 --> 00:27:26,720 Thinking Machines CM-1, Maspar MP-1 -- 648 00:27:26,720 --> 00:27:31,100 which had 16,000 processors. 649 00:27:31,100 --> 00:27:32,310 Small processors -- 650 00:27:32,310 --> 00:27:35,410 4-bit processors, you can only do 4-bit computation. 651 00:27:35,410 --> 00:27:39,100 And then every cycle you can do 16,000 of them, 4-bit 652 00:27:39,100 --> 00:27:40,640 things in here. 653 00:27:40,640 --> 00:27:43,250 It really fits in to the kind of things they could build in 654 00:27:43,250 --> 00:27:45,430 hardware those days. 655 00:27:45,430 --> 00:27:47,400 And there's one controller in there. 656 00:27:47,400 --> 00:27:49,230 So it is just a neat thing, because you can do a lot of 657 00:27:49,230 --> 00:27:51,710 work if you actually can match it in that form. 658 00:27:51,710 --> 00:27:55,660 659 00:27:55,660 --> 00:27:58,760 So the way you look at that is, to run this array you have 660 00:27:58,760 --> 00:28:00,570 this array controller. 661 00:28:00,570 --> 00:28:04,040 And then you have processing elements, a 662 00:28:04,040 --> 00:28:04,750 huge amount of them. 663 00:28:04,750 --> 00:28:07,125 And you have each processor mainly had distributed memory 664 00:28:07,125 --> 00:28:08,790 -- each has its own memory. 665 00:28:08,790 --> 00:28:12,150 And so given instruction, everybody did the same thing 666 00:28:12,150 --> 00:28:15,310 to memory or arithmetic in there. 667 00:28:15,310 --> 00:28:18,000 And then you had also interconnect network, so you 668 00:28:18,000 --> 00:28:20,350 can actually send it. 669 00:28:20,350 --> 00:28:21,670 A lot of these things have the [? near-enabled ?] 670 00:28:21,670 --> 00:28:22,580 communication. 671 00:28:22,580 --> 00:28:24,860 You can send data you near enable, so everybody kind of 672 00:28:24,860 --> 00:28:29,900 shifts the 2-D or some kind of torus mapping in there. 673 00:28:29,900 --> 00:28:33,240 And if you can program that, you can get really good 674 00:28:33,240 --> 00:28:34,810 performance in there. 675 00:28:34,810 --> 00:28:38,150 676 00:28:38,150 --> 00:28:39,880 And each cycle, it's very synchronous. 677 00:28:39,880 --> 00:28:42,110 So each cycle everybody does the same thing -- go to the 678 00:28:42,110 --> 00:28:43,360 next thing, do the same thing. 679 00:28:43,360 --> 00:28:45,860 680 00:28:45,860 --> 00:28:49,840 So the next very interesting machine is this Cray-1. 681 00:28:49,840 --> 00:28:51,710 I think this is one of the first successful 682 00:28:51,710 --> 00:28:53,400 supercomputers out there. 683 00:28:53,400 --> 00:28:57,250 So here's the Cray-1, it is this kind of round seat type 684 00:28:57,250 --> 00:28:59,340 thing sitting in here. 685 00:28:59,340 --> 00:29:01,914 Everybody know what was under the seat? 686 00:29:01,914 --> 00:29:03,330 AUDIENCE: Cooling. 687 00:29:03,330 --> 00:29:04,030 PROFESSOR: Cooling. 688 00:29:04,030 --> 00:29:05,520 So here's a photo. 689 00:29:05,520 --> 00:29:08,120 I don't think you can see that -- you can probably look at it 690 00:29:08,120 --> 00:29:09,700 when I put this on the web -- was this 691 00:29:09,700 --> 00:29:10,880 entire cooling mechanism. 692 00:29:10,880 --> 00:29:14,350 In fact Seymour Cray at one time said one of his most 693 00:29:14,350 --> 00:29:16,505 important innovations in this machine is 694 00:29:16,505 --> 00:29:19,000 how to cool the thing. 695 00:29:19,000 --> 00:29:20,270 And this is a generation, again, that 696 00:29:20,270 --> 00:29:22,420 power was a big thing. 697 00:29:22,420 --> 00:29:25,590 So each of these columns had this huge amount of boards 698 00:29:25,590 --> 00:29:28,990 going, and in the middle had all the wiring going. 699 00:29:28,990 --> 00:29:31,900 So we had this huge mess of wiring in the middle -- 700 00:29:31,900 --> 00:29:32,490 [UNINTELLIGIBLE] 701 00:29:32,490 --> 00:29:33,845 -- and then you had all these boards in 702 00:29:33,845 --> 00:29:35,130 there in each of these. 703 00:29:35,130 --> 00:29:37,580 So this is the Cray-1 processor. 704 00:29:37,580 --> 00:29:39,190 AUDIENCE: Do you know your little -- 705 00:29:39,190 --> 00:29:43,630 your laptop is way faster than that Cray -- 706 00:29:43,630 --> 00:29:45,010 PROFESSOR: Yeah. 707 00:29:45,010 --> 00:29:48,655 Did you have the clock speed in here? 708 00:29:48,655 --> 00:29:49,690 [INTERPOSING VOICES] 709 00:29:49,690 --> 00:29:51,930 AUDIENCE: 80 MHz. 710 00:29:51,930 --> 00:29:54,510 PROFESSOR: So, yeah. 711 00:29:54,510 --> 00:29:58,520 And that cost like $10 million or something like 712 00:29:58,520 --> 00:30:01,470 that at that time. 713 00:30:01,470 --> 00:30:03,360 Moore's Law, it's just amazing. 714 00:30:03,360 --> 00:30:05,640 If you think if you apply Moore's Law to any other thing 715 00:30:05,640 --> 00:30:08,330 we have, it can't do the comparison. 716 00:30:08,330 --> 00:30:11,290 We are very fortunate to be part of that generation. 717 00:30:11,290 --> 00:30:13,040 AUDIENCE: But did it have PowerPoint? 718 00:30:13,040 --> 00:30:16,580 PROFESSOR: So what it had, was it had these 719 00:30:16,580 --> 00:30:17,690 three type of registers. 720 00:30:17,690 --> 00:30:19,550 It had scalar registers, address 721 00:30:19,550 --> 00:30:21,470 registers, and vector registers. 722 00:30:21,470 --> 00:30:23,880 The key thing there is the vector register. 723 00:30:23,880 --> 00:30:28,160 So if you want to do things fast -- 724 00:30:28,160 --> 00:30:29,250 no, fast is not the word. 725 00:30:29,250 --> 00:30:32,510 You can do a lot of computation in a short amount 726 00:30:32,510 --> 00:30:35,840 of time by using the vector registers. 727 00:30:35,840 --> 00:30:40,210 So the way to look at that is normally when you go to the 728 00:30:40,210 --> 00:30:42,350 execute stage you do one thing. 729 00:30:42,350 --> 00:30:44,670 In a vector register what happens is it got pipelined. 730 00:30:44,670 --> 00:30:47,790 So execute state happened one word next, next, next. 731 00:30:47,790 --> 00:30:51,660 You can do up to 64 or even bigger. 732 00:30:51,660 --> 00:30:53,380 I think it was 64 length, length 64 things. 733 00:30:53,380 --> 00:30:55,360 So you can -- so that instruction. 734 00:30:55,360 --> 00:30:58,640 So you do a few of these, and then this state keeps going on 735 00:30:58,640 --> 00:31:00,590 and on and on, for 64. 736 00:31:00,590 --> 00:31:02,920 And then you can pipeline in the way that you can start 737 00:31:02,920 --> 00:31:04,170 another one. 738 00:31:04,170 --> 00:31:06,080 739 00:31:06,080 --> 00:31:08,220 Actually, this will use the same executioner, so you have 740 00:31:08,220 --> 00:31:12,230 to wait till that finishes to start. 741 00:31:12,230 --> 00:31:15,430 So you can pipeline to get a huge amount of things going 742 00:31:15,430 --> 00:31:17,200 through the pipeline. 743 00:31:17,200 --> 00:31:20,750 And so each cycle you can graduate many, many 744 00:31:20,750 --> 00:31:21,300 things going on. 745 00:31:21,300 --> 00:31:22,873 AUDIENCE: Can I ask you a quick question? 746 00:31:22,873 --> 00:31:24,446 Something I'm trying to get straight in my head. 747 00:31:24,446 --> 00:31:26,960 My notion -- and I don't think I'm right on this, that's why 748 00:31:26,960 --> 00:31:31,305 I'm asking you -- is machines like the Cray, I know you were 749 00:31:31,305 --> 00:31:34,285 talking about some of the vector operations, those were 750 00:31:34,285 --> 00:31:36,980 by and large a relatively small set of operations. 751 00:31:36,980 --> 00:31:39,840 Like dot products, and vector time scalar. 752 00:31:39,840 --> 00:31:41,514 On the other hand, when you look at the SIMD machines, 753 00:31:41,514 --> 00:31:43,860 they had a much richer set of operations. 754 00:31:43,860 --> 00:31:49,110 PROFESSOR: I think with scatter-gather and things like 755 00:31:49,110 --> 00:31:53,560 conditional execution, I think vector machines could be a 756 00:31:53,560 --> 00:31:54,670 fairly large -- 757 00:31:54,670 --> 00:31:58,298 I mean it's painful. 758 00:31:58,298 --> 00:32:01,230 AUDIENCE: [INAUDIBLE] 759 00:32:01,230 --> 00:32:03,660 PROFESSOR: The SIMD instruction is Pentium. 760 00:32:03,660 --> 00:32:08,410 I think that is mainly targeting single processing 761 00:32:08,410 --> 00:32:09,660 type stuff. 762 00:32:09,660 --> 00:32:14,050 763 00:32:14,050 --> 00:32:15,260 They don't have real scatter-gather. 764 00:32:15,260 --> 00:32:17,460 AUDIENCE: And the Cell processor? 765 00:32:17,460 --> 00:32:20,370 PROFESSOR: Cell is distributed memory. 766 00:32:20,370 --> 00:32:22,946 AUDIENCE: Yeah, but on one the -- what do they 767 00:32:22,946 --> 00:32:23,490 call them, the -- 768 00:32:23,490 --> 00:32:26,140 PROFESSOR: I don't think you can scatter-gather either. 769 00:32:26,140 --> 00:32:31,260 It's just basically, you have to align words in, word out. 770 00:32:31,260 --> 00:32:33,770 IBM is always about doing align. 771 00:32:33,770 --> 00:32:37,030 So in even AltiVec, you can't even do unaligned access. 772 00:32:37,030 --> 00:32:38,000 You had to do aligned access. 773 00:32:38,000 --> 00:32:40,795 So if there is no run align, you had to pay a 774 00:32:40,795 --> 00:32:43,700 big penalty in there. 775 00:32:43,700 --> 00:32:46,140 So if you look at how this happens. 776 00:32:46,140 --> 00:32:49,320 So you have this entire pipeline thing. 777 00:32:49,320 --> 00:32:52,830 When things get started the first value is at this point 778 00:32:52,830 --> 00:32:54,200 done in one clock cycle. 779 00:32:54,200 --> 00:32:56,250 The next value is halfway through that. 780 00:32:56,250 --> 00:32:58,460 Another value is in some part of a -- 781 00:32:58,460 --> 00:33:00,550 is also pipelined, the alias pipeline. 782 00:33:00,550 --> 00:33:03,840 And other values are kind of feeding nicely into that. 783 00:33:03,840 --> 00:33:06,940 So if you have one -- this is called one lane. 784 00:33:06,940 --> 00:33:10,520 You can have multiple lanes, and then what you can do is 785 00:33:10,520 --> 00:33:13,230 each cycle you get 40 [UNINTELLIGIBLE] 786 00:33:13,230 --> 00:33:15,000 And the next ones are in the middle of that, 787 00:33:15,000 --> 00:33:16,020 next ones are in middle. 788 00:33:16,020 --> 00:33:19,310 So what you have is a very pipelined machine, so you can 789 00:33:19,310 --> 00:33:21,290 kind of pipeline things in there. 790 00:33:21,290 --> 00:33:23,290 So you can have either one lane, or multiple lanes 791 00:33:23,290 --> 00:33:25,090 pipeline coming out. 792 00:33:25,090 --> 00:33:27,720 So if you look at the architecture, what you had is 793 00:33:27,720 --> 00:33:30,230 you have some kind of vector registers feeding into these 794 00:33:30,230 --> 00:33:32,220 kind of functional units. 795 00:33:32,220 --> 00:33:34,910 So at a given time, in this one you might be able to get 796 00:33:34,910 --> 00:33:38,030 eight results out, because everything gets pipelined. 797 00:33:38,030 --> 00:33:42,330 But the same thing is happening in there. 798 00:33:42,330 --> 00:33:44,720 Clear how vector machines work? 799 00:33:44,720 --> 00:33:46,880 So it's not really parallelism, it's basically -- 800 00:33:46,880 --> 00:33:50,780 especially if you are one -- it's a superpipelined thing. 801 00:33:50,780 --> 00:33:53,740 But given one instruction, it will crank out many, many, 802 00:33:53,740 --> 00:33:57,960 many things for that instruction. 803 00:33:57,960 --> 00:34:00,220 And doing parallelism is easy in here too, because it's the 804 00:34:00,220 --> 00:34:02,750 same thing happening to very regular data sets. 805 00:34:02,750 --> 00:34:05,230 So there's no notion of asynchronizations and all 806 00:34:05,230 --> 00:34:06,160 these weird things. 807 00:34:06,160 --> 00:34:08,980 It's just a very simple pattern. 808 00:34:08,980 --> 00:34:13,030 So the next thing is the shared sequencer processor. 809 00:34:13,030 --> 00:34:16,990 So here it's similar to the vector machines because each 810 00:34:16,990 --> 00:34:20,840 cycle you issue a single instruction. 811 00:34:20,840 --> 00:34:24,560 But the instruction is a wide instruction. 812 00:34:24,560 --> 00:34:28,410 It had multiple operations in these same instructions. 813 00:34:28,410 --> 00:34:29,490 So what it says is -- 814 00:34:29,490 --> 00:34:32,190 "I have multiple execution units, I have memory in a 815 00:34:32,190 --> 00:34:35,280 separate unit, and each instruction I will tell each 816 00:34:35,280 --> 00:34:40,060 unit what to do." And so something you might have -- 817 00:34:40,060 --> 00:34:43,450 two integer units, two memory/load store units, two 818 00:34:43,450 --> 00:34:44,360 floating-point units. 819 00:34:44,360 --> 00:34:47,190 Each cycle you tell each of them what to do. 820 00:34:47,190 --> 00:34:49,210 So you just kind of keep issuing an instruction that 821 00:34:49,210 --> 00:34:50,330 affects many of them. 822 00:34:50,330 --> 00:34:54,430 So sometimes what happens is if this has latency of four, 823 00:34:54,430 --> 00:34:56,590 you might have to wait till this is done to do the next 824 00:34:56,590 --> 00:34:56,940 instruction. 825 00:34:56,940 --> 00:34:59,560 So if one guy takes long, everybody has to kind 826 00:34:59,560 --> 00:35:00,700 of wait till that. 827 00:35:00,700 --> 00:35:02,180 So it's very synchronous going. 828 00:35:02,180 --> 00:35:04,150 So things like synchronization stuff were 829 00:35:04,150 --> 00:35:05,401 not an issue in here. 830 00:35:05,401 --> 00:35:09,250 831 00:35:09,250 --> 00:35:12,430 So if you look at a pipeline, this is what happens. 832 00:35:12,430 --> 00:35:13,970 So you have this instruction. 833 00:35:13,970 --> 00:35:16,800 It's an instruction, but you are fetching a wide 834 00:35:16,800 --> 00:35:17,120 instruction. 835 00:35:17,120 --> 00:35:18,430 You are not researching a simple instruction. 836 00:35:18,430 --> 00:35:20,630 You decode the entire thing, but you can decode it 837 00:35:20,630 --> 00:35:20,980 separately. 838 00:35:20,980 --> 00:35:23,984 And then you go execute on each execution unit. 839 00:35:23,984 --> 00:35:26,770 840 00:35:26,770 --> 00:35:28,980 One interesting problem here was this 841 00:35:28,980 --> 00:35:31,410 was not really scalable. 842 00:35:31,410 --> 00:35:36,530 What happened here is each functional unit, if you had 843 00:35:36,530 --> 00:35:40,020 one single register file, has to access the register file. 844 00:35:40,020 --> 00:35:42,670 So each function would say -- "I am using register R1," "I 845 00:35:42,670 --> 00:35:46,060 am using R3," "I am using R5." So what has to happen is the 846 00:35:46,060 --> 00:35:48,990 register file has to have -- 847 00:35:48,990 --> 00:35:53,450 basically, if you have eight functional units, 16 outports 848 00:35:53,450 --> 00:35:55,190 and 8 inports coming in. 849 00:35:55,190 --> 00:35:57,270 And then of course, when you build a register file it has a 850 00:35:57,270 --> 00:36:01,880 scale, so it had huge scalability issues. 851 00:36:01,880 --> 00:36:04,960 So it's a quadratically scalable register function. 852 00:36:04,960 --> 00:36:05,476 Question? 853 00:36:05,476 --> 00:36:07,540 AUDIENCE: The sequencer [INAUDIBLE PHRASE] 854 00:36:07,540 --> 00:36:10,120 855 00:36:10,120 --> 00:36:11,370 PROFESSOR: Yeah. 856 00:36:11,370 --> 00:36:13,270 857 00:36:13,270 --> 00:36:15,820 It's basically you had to wait till everybody's done, there's 858 00:36:15,820 --> 00:36:17,820 nothing going out of any order. 859 00:36:17,820 --> 00:36:19,150 And memory also. 860 00:36:19,150 --> 00:36:21,950 Since everybody's going to memory, this is not scalable. 861 00:36:21,950 --> 00:36:26,880 So people try to build -- you can do four, eight wide, but 862 00:36:26,880 --> 00:36:30,760 beyond that this register and memory interconnect became a 863 00:36:30,760 --> 00:36:32,770 big mess to build. 864 00:36:32,770 --> 00:36:36,830 And so one kind of modification thing people did 865 00:36:36,830 --> 00:36:39,690 was called Clustered VLIW. 866 00:36:39,690 --> 00:36:43,560 So what happens is you have a very wide instruction in here. 867 00:36:43,560 --> 00:36:46,730 It goes to not one cluster, but different clusters. 868 00:36:46,730 --> 00:36:49,940 Each cluster has its own register file, its own kind of 869 00:36:49,940 --> 00:36:52,160 memory interconnect going on there. 870 00:36:52,160 --> 00:36:55,750 And what that means is if you want to do intercluster 871 00:36:55,750 --> 00:36:58,000 communication, you have to go to a very special 872 00:36:58,000 --> 00:37:00,060 communication network. 873 00:37:00,060 --> 00:37:03,000 So you don't have this bandwidth expansion register. 874 00:37:03,000 --> 00:37:06,180 So you only have, we'll say two execution units, so you 875 00:37:06,180 --> 00:37:10,430 only have to have four out and one in to the 876 00:37:10,430 --> 00:37:11,900 register filing cycle. 877 00:37:11,900 --> 00:37:15,030 And then if you want other communication, you have a much 878 00:37:15,030 --> 00:37:17,600 lower bandwidth interconnect that you'll have 879 00:37:17,600 --> 00:37:18,640 to go through that. 880 00:37:18,640 --> 00:37:23,070 So what this does is you kind of expose more complexity to 881 00:37:23,070 --> 00:37:28,110 the compiler and software, and the rationale here is most 882 00:37:28,110 --> 00:37:31,380 programs have locality. 883 00:37:31,380 --> 00:37:33,210 It's like everybody always wants to to communicate with 884 00:37:33,210 --> 00:37:35,670 everybody else, so there are some locality in here. 885 00:37:35,670 --> 00:37:38,610 So you can basically cluster things that are local together 886 00:37:38,610 --> 00:37:41,360 and put it in here, and then when other things have to be 887 00:37:41,360 --> 00:37:43,880 communicated you can use this communication and go about 888 00:37:43,880 --> 00:37:44,210 doing that. 889 00:37:44,210 --> 00:37:48,540 So this is kind of the state of the art in this technology. 890 00:37:48,540 --> 00:37:49,510 And something like -- 891 00:37:49,510 --> 00:37:50,410 what I didn't put -- 892 00:37:50,410 --> 00:37:52,710 Itanium kind of fits in here. 893 00:37:52,710 --> 00:37:55,830 Itanium processor. 894 00:37:55,830 --> 00:37:59,810 So then we go to shared network. 895 00:37:59,810 --> 00:38:01,570 There has been a lot of work in here. 896 00:38:01,570 --> 00:38:05,410 People have been building multiprocessors for a long 897 00:38:05,410 --> 00:38:07,000 time, because it's a very easy thing to build. 898 00:38:07,000 --> 00:38:09,870 So what you do is -- 899 00:38:09,870 --> 00:38:13,490 if you look at it, you have a processor unit that connects 900 00:38:13,490 --> 00:38:15,000 its own memory. 901 00:38:15,000 --> 00:38:16,340 And it's like a multiple [UNINTELLIGIBLE] 902 00:38:16,340 --> 00:38:19,840 Then it has a very tightly connected network interface 903 00:38:19,840 --> 00:38:21,820 that goes to interconnect network. 904 00:38:21,820 --> 00:38:26,170 So we can even think about a workstation farm as this type 905 00:38:26,170 --> 00:38:27,110 of a machine. 906 00:38:27,110 --> 00:38:33,200 But of course, the network is a pretty slow one that requres 907 00:38:33,200 --> 00:38:34,180 an ethernet connector. 908 00:38:34,180 --> 00:38:35,930 But people build things that have much 909 00:38:35,930 --> 00:38:39,060 faster networks in there. 910 00:38:39,060 --> 00:38:41,890 This was designed in a way you can build hundreds and 911 00:38:41,890 --> 00:38:43,580 thousands of these things -- 912 00:38:43,580 --> 00:38:44,610 nodes in here. 913 00:38:44,610 --> 00:38:48,760 So today if you look at the top 500 supercomputers, a 914 00:38:48,760 --> 00:38:51,530 bunch of them fits into this category because it's very 915 00:38:51,530 --> 00:38:54,510 easy to scale and build very large. 916 00:38:54,510 --> 00:38:56,647 AUDIENCE: Are you doing SMPs in this list, 917 00:38:56,647 --> 00:38:57,670 or some other place? 918 00:38:57,670 --> 00:39:00,020 PROFESSOR: SMP is mostly shared 919 00:39:00,020 --> 00:39:01,750 memory, so shared network. 920 00:39:01,750 --> 00:39:03,000 I'll do shared memory next. 921 00:39:03,000 --> 00:39:06,500 922 00:39:06,500 --> 00:39:09,180 But there are problems with it. 923 00:39:09,180 --> 00:39:12,860 All the data layout has to be handled by software, or by the 924 00:39:12,860 --> 00:39:15,670 programmer basically. 925 00:39:15,670 --> 00:39:18,100 If you want something outside your memory, you had to do 926 00:39:18,100 --> 00:39:19,310 very explicit communication. 927 00:39:19,310 --> 00:39:21,470 Not only you, the other guy who has the data actually has 928 00:39:21,470 --> 00:39:23,420 to cooperate to send it to you. 929 00:39:23,420 --> 00:39:26,320 And he needs to know that now you have the data. 930 00:39:26,320 --> 00:39:29,480 All of that management is your problem. 931 00:39:29,480 --> 00:39:34,020 And that makes programming these kind of things very 932 00:39:34,020 --> 00:39:36,040 difficult, which you'll probably figure out by the 933 00:39:36,040 --> 00:39:37,080 time you're done with Cell. 934 00:39:37,080 --> 00:39:41,930 So Cell has a lot of these issues, too. 935 00:39:41,930 --> 00:39:45,980 The problem here is not dealing with most of the data, 936 00:39:45,980 --> 00:39:48,200 but the kind of corner cases that you don't 937 00:39:48,200 --> 00:39:49,520 know about that much. 938 00:39:49,520 --> 00:39:51,695 There's no nice safe way, of saying -- "I don't know where, 939 00:39:51,695 --> 00:39:52,850 who's going to access it. 940 00:39:52,850 --> 00:39:54,430 I'll let the hardware take care of it." There's no 941 00:39:54,430 --> 00:39:58,160 hardware, you have to take of it yourself. 942 00:39:58,160 --> 00:40:02,060 And also message passing has a very high overhead. 943 00:40:02,060 --> 00:40:04,980 Most of the time in order to do message, you have to invoke 944 00:40:04,980 --> 00:40:06,130 some kind of a kernel thing. 945 00:40:06,130 --> 00:40:08,240 You have to actually do a kernel switch that will call 946 00:40:08,240 --> 00:40:09,400 the network -- 947 00:40:09,400 --> 00:40:11,990 it's operaing system involves a process, basically, to get a 948 00:40:11,990 --> 00:40:13,850 message in there. 949 00:40:13,850 --> 00:40:16,250 And also when you get a message out you have to do 950 00:40:16,250 --> 00:40:21,110 some kind of interrupt or polling, and that's a bunch of 951 00:40:21,110 --> 00:40:22,140 copies out of kernel. 952 00:40:22,140 --> 00:40:25,040 And this became a pretty expensive proposition. 953 00:40:25,040 --> 00:40:27,800 So you can't send messages the size of one [UNINTELLIGIBLE] 954 00:40:27,800 --> 00:40:29,970 so you had to accumulate a huge amount of things to send 955 00:40:29,970 --> 00:40:31,730 out to amortize the cost of doing that. 956 00:40:31,730 --> 00:40:37,430 957 00:40:37,430 --> 00:40:39,590 Sending can be somewhat cheap, but receiving 958 00:40:39,590 --> 00:40:41,180 is a lot more expensive. 959 00:40:41,180 --> 00:40:42,690 Because receiving you have to multiplex. 960 00:40:42,690 --> 00:40:44,280 You have no idea who it's coming to. 961 00:40:44,280 --> 00:40:46,070 So you have to receive, you have to figure out who is 962 00:40:46,070 --> 00:40:47,380 supposed to get it. 963 00:40:47,380 --> 00:40:49,455 Especially if you are running multiple applications, it 964 00:40:49,455 --> 00:40:50,570 might be for someone's application. 965 00:40:50,570 --> 00:40:51,810 You had to contact [UNINTELLIGIBLE] 966 00:40:51,810 --> 00:40:53,060 So it's a big mess. 967 00:40:53,060 --> 00:40:55,640 968 00:40:55,640 --> 00:40:58,800 That is where people went to shared memory processors, 969 00:40:58,800 --> 00:41:02,040 because it became easier message method to use. 970 00:41:02,040 --> 00:41:05,480 So that is basically the SMPs Alan was talking about. 971 00:41:05,480 --> 00:41:09,350 972 00:41:09,350 --> 00:41:12,160 The nice thing is it will work with any data placement. 973 00:41:12,160 --> 00:41:15,390 It might work very slowly, but at least it will work. 974 00:41:15,390 --> 00:41:18,860 So it makes it very easy to take your existing application 975 00:41:18,860 --> 00:41:21,200 and first getting it working, because it's 976 00:41:21,200 --> 00:41:22,880 just working there. 977 00:41:22,880 --> 00:41:25,700 You can choose to optimize only critical sections. 978 00:41:25,700 --> 00:41:27,210 You can say -- "OK, this section I 979 00:41:27,210 --> 00:41:28,290 know it's very important. 980 00:41:28,290 --> 00:41:30,380 I will do the right thing, I will place it properly 981 00:41:30,380 --> 00:41:33,320 everything." And the rest of it I can just leave alone, and 982 00:41:33,320 --> 00:41:35,730 it will go and get the data and do it right. 983 00:41:35,730 --> 00:41:38,020 You can run sequentially, of course, but at least the 984 00:41:38,020 --> 00:41:39,390 memory part I don't have to deal with it. 985 00:41:39,390 --> 00:41:43,090 If some other memory just once in a while accesses that data 986 00:41:43,090 --> 00:41:44,940 that you have actually parallelized, it 987 00:41:44,940 --> 00:41:46,010 will actually work. 988 00:41:46,010 --> 00:41:47,690 So you only have to worry about the [UNINTELLIGIBLE] 989 00:41:47,690 --> 00:41:48,940 that you are parallelizing. 990 00:41:48,940 --> 00:41:51,130 991 00:41:51,130 --> 00:41:54,470 And you can communicate using load store instructions. 992 00:41:54,470 --> 00:41:56,710 You don't have to get always in order to do that. 993 00:41:56,710 --> 00:41:57,970 And it's a lot lower overhead. 994 00:41:57,970 --> 00:42:02,000 So 5 to 10 cycles, instead of hundreds to thousands cycles 995 00:42:02,000 --> 00:42:03,030 to do that. 996 00:42:03,030 --> 00:42:05,840 And most of these messages actually stoplight some 997 00:42:05,840 --> 00:42:08,230 instructions to do this communication very fast. 998 00:42:08,230 --> 00:42:10,430 There's a thing called fetch&op, and a thing called 999 00:42:10,430 --> 00:42:12,580 load linked/store conditional operations. 1000 00:42:12,580 --> 00:42:16,125 There are these very special operations where if you are 1001 00:42:16,125 --> 00:42:19,760 waiting for somebody else, you can do it very fast. So if two 1002 00:42:19,760 --> 00:42:21,430 people are communicating. 1003 00:42:21,430 --> 00:42:24,550 So people came up with these very fast operations that are 1004 00:42:24,550 --> 00:42:26,320 low cost, as a last -- 1005 00:42:26,320 --> 00:42:28,230 if the data's available it will happen very fast. 1006 00:42:28,230 --> 00:42:29,480 Synchronization. 1007 00:42:29,480 --> 00:42:31,260 1008 00:42:31,260 --> 00:42:34,820 And when you are starting to build a large system, you can 1009 00:42:34,820 --> 00:42:37,820 actually give a logically shared view of memory, but the 1010 00:42:37,820 --> 00:42:41,120 underlying hardware can be still distributed memory. 1011 00:42:41,120 --> 00:42:42,260 So there's a thing called -- 1012 00:42:42,260 --> 00:42:45,060 I will get into when you do synchronization -- 1013 00:42:45,060 --> 00:42:46,290 directory-based cache coherence. 1014 00:42:46,290 --> 00:42:48,630 So you give a nice, simple view of memory. 1015 00:42:48,630 --> 00:42:50,250 But of course memory is really disbributed. 1016 00:42:50,250 --> 00:42:52,790 So that kind of gives the best of both worlds. 1017 00:42:52,790 --> 00:42:55,150 So you can keep scaling and build large machines, but the 1018 00:42:55,150 --> 00:42:59,450 view is a very simple view of machines. 1019 00:42:59,450 --> 00:43:00,920 So there are two categories in here. 1020 00:43:00,920 --> 00:43:03,660 One is non-cache coherent, and then hardware cache coherence. 1021 00:43:03,660 --> 00:43:08,450 So non-cache coherence kind of gives a view of memory as a 1022 00:43:08,450 --> 00:43:10,260 single address space. 1023 00:43:10,260 --> 00:43:13,020 But you had to deal with that if you write something to get 1024 00:43:13,020 --> 00:43:14,510 there early to me, you had to explicitly say -- 1025 00:43:14,510 --> 00:43:17,580 "Now send it to that person." But we're still in a single 1026 00:43:17,580 --> 00:43:19,380 address space. 1027 00:43:19,380 --> 00:43:21,790 It doesn't give the full benefits of a 1028 00:43:21,790 --> 00:43:22,600 shared memory machine. 1029 00:43:22,600 --> 00:43:24,610 It's kind of inbetween distributed memory. 1030 00:43:24,610 --> 00:43:26,100 In distributed memory basically everybody's in a 1031 00:43:26,100 --> 00:43:27,830 different address space, so you had to map 1032 00:43:27,830 --> 00:43:28,760 by sending a message. 1033 00:43:28,760 --> 00:43:30,550 Here, you just say I have to flush and send it 1034 00:43:30,550 --> 00:43:31,800 to the other guy. 1035 00:43:31,800 --> 00:43:36,360 1036 00:43:36,360 --> 00:43:39,080 Some of the early machines, as well as some big machines, 1037 00:43:39,080 --> 00:43:42,070 were no hardware cache coherence. 1038 00:43:42,070 --> 00:43:44,440 Things like supercomputers were built in this way because 1039 00:43:44,440 --> 00:43:45,980 it's very easy to build. 1040 00:43:45,980 --> 00:43:49,900 And the nice thing here is if you know your applications 1041 00:43:49,900 --> 00:43:54,280 well, if you are running good parallel large applications, 1042 00:43:54,280 --> 00:43:55,980 and you are actually knowing what the communication 1043 00:43:55,980 --> 00:43:57,760 patterns are -- you can actually do it. 1044 00:43:57,760 --> 00:44:00,430 And you don't have to pay the hardware overhead to have this 1045 00:44:00,430 --> 00:44:02,470 nice hardware support in there. 1046 00:44:02,470 --> 00:44:07,230 However, a lot of small scale machines -- for example, most 1047 00:44:07,230 --> 00:44:12,360 people's workstations are stuffy, it's probably now two 1048 00:44:12,360 --> 00:44:14,240 Pentium Quad machines -- 1049 00:44:14,240 --> 00:44:15,430 actually add memory. 1050 00:44:15,430 --> 00:44:20,430 Because if you are trying to do the starting things it's 1051 00:44:20,430 --> 00:44:21,540 much easier to do shared memory. 1052 00:44:21,540 --> 00:44:24,840 And also it's easier to bulid small shared memory machines. 1053 00:44:24,840 --> 00:44:32,480 And people talk about using a bus-based machine, and also 1054 00:44:32,480 --> 00:44:33,560 using a large scale 1055 00:44:33,560 --> 00:44:34,818 directory-based machine in here. 1056 00:44:34,818 --> 00:44:38,170 1057 00:44:38,170 --> 00:44:42,540 So for bus-based machines, how do you do shared memory? 1058 00:44:42,540 --> 00:44:46,880 So there's a protocol, what we call a snoopy cache protocol. 1059 00:44:46,880 --> 00:44:51,050 What that means is, every time you modify your location 1060 00:44:51,050 --> 00:44:54,120 somewhere -- so of course you have that in your cache -- 1061 00:44:54,120 --> 00:44:57,070 you tell everybody in the world who's using a busing, "I 1062 00:44:57,070 --> 00:45:03,460 modified that." And then if somebody else also has that 1063 00:45:03,460 --> 00:45:04,470 memory location. 1064 00:45:04,470 --> 00:45:06,390 That person says, "Oops, he modified it." Either he 1065 00:45:06,390 --> 00:45:09,160 invalidates it or gets the modified copy. 1066 00:45:09,160 --> 00:45:12,340 If you are using something new, you have to go and snoop. 1067 00:45:12,340 --> 00:45:15,040 And you can ask everybody and say -- "Wait a minute, does 1068 00:45:15,040 --> 00:45:19,160 anybody have a copy of this?" And some more complicated 1069 00:45:19,160 --> 00:45:22,680 protocols have saying -- "I don't have any, I have a copy 1070 00:45:22,680 --> 00:45:24,540 but it's only read-only. 1071 00:45:24,540 --> 00:45:26,470 So I'm just reading it, I'm not modifying it." Then 1072 00:45:26,470 --> 00:45:28,940 multiple people can have the same copy, because everybody's 1073 00:45:28,940 --> 00:45:29,830 reading and it's OK. 1074 00:45:29,830 --> 00:45:31,840 And then there's the next thing -- "OK, I am actually 1075 00:45:31,840 --> 00:45:33,550 trying to modify this thing." And then only I 1076 00:45:33,550 --> 00:45:35,080 can have the copy. 1077 00:45:35,080 --> 00:45:37,830 So some data you can give to multiple people as a read 1078 00:45:37,830 --> 00:45:40,380 copy, and then when you are trying to write everybody gets 1079 00:45:40,380 --> 00:45:42,140 disinvited, only the person who has write 1080 00:45:42,140 --> 00:45:43,090 has access to it. 1081 00:45:43,090 --> 00:45:45,315 And there are a lot of complicated protocols how if 1082 00:45:45,315 --> 00:45:46,870 you write it, and then somebody else wants to write 1083 00:45:46,870 --> 00:45:48,680 it, how do you get to that person? 1084 00:45:48,680 --> 00:45:50,990 And of course you have to keep it consistent with memory. 1085 00:45:50,990 --> 00:45:53,420 So there is a lot of work in how to get these things all 1086 00:45:53,420 --> 00:45:55,720 working, but that's the kind of basic idea. 1087 00:45:55,720 --> 00:45:59,300 1088 00:45:59,300 --> 00:46:01,730 So directory-based machines are very different. 1089 00:46:01,730 --> 00:46:05,060 In directory-based machines mainly there's a 1090 00:46:05,060 --> 00:46:06,820 notion of a home node. 1091 00:46:06,820 --> 00:46:10,540 So everybody has local space in memory, you keep some part 1092 00:46:10,540 --> 00:46:10,820 of your memory. 1093 00:46:10,820 --> 00:46:12,720 And of course you have a cache also. 1094 00:46:12,720 --> 00:46:16,130 So you have a notion that this memory belongs to you. 1095 00:46:16,130 --> 00:46:18,470 And every time I want to do something with that memory I 1096 00:46:18,470 --> 00:46:19,390 had to ask you. 1097 00:46:19,390 --> 00:46:20,380 I had to get your permission. 1098 00:46:20,380 --> 00:46:22,560 "I want that memory, can you give it to me?" 1099 00:46:22,560 --> 00:46:24,610 And so there are two things. 1100 00:46:24,610 --> 00:46:26,670 That person has a directory [UNINTELLIGIBLE] say -- "OK, 1101 00:46:26,670 --> 00:46:28,150 this memory is in me. 1102 00:46:28,150 --> 00:46:31,480 I am the one who right now owns it, and I have the copy." 1103 00:46:31,480 --> 00:46:32,420 Or it will say -- 1104 00:46:32,420 --> 00:46:36,120 "You want to copy that memory to this other guy to write, 1105 00:46:36,120 --> 00:46:38,380 and here is that person's address or that machine's 1106 00:46:38,380 --> 00:46:41,650 name." Or if multiple people have taken this copy and are 1107 00:46:41,650 --> 00:46:42,730 reading it. 1108 00:46:42,730 --> 00:46:45,240 So when somebody asks me for a copy -- 1109 00:46:45,240 --> 00:46:49,220 assume you ask to read this copy. 1110 00:46:49,220 --> 00:46:52,890 And if I have given it to nobody to read, or if I have 1111 00:46:52,890 --> 00:46:54,410 given it to other people to read, so I say -- 1112 00:46:54,410 --> 00:46:55,330 "OK, here's a copy. 1113 00:46:55,330 --> 00:46:58,610 Go read." And I add that person is reading that, and I 1114 00:46:58,610 --> 00:47:00,190 keep that in my directory. 1115 00:47:00,190 --> 00:47:01,910 Or if somebody's writing that. 1116 00:47:01,910 --> 00:47:04,010 I say sure, "I can't give it to read because somebody's 1117 00:47:04,010 --> 00:47:05,750 writing that." So I can do two things. 1118 00:47:05,750 --> 00:47:07,750 I can tell that person, saying -- 1119 00:47:07,750 --> 00:47:11,350 "You have to get it from the person who's writing. 1120 00:47:11,350 --> 00:47:12,860 So go directly get it from there. 1121 00:47:12,860 --> 00:47:16,190 And I will mark that now you own it as a read value." Or, I 1122 00:47:16,190 --> 00:47:17,630 can tell the person who's writing -- 1123 00:47:17,630 --> 00:47:19,400 "Look, you have to give up your write privilege. 1124 00:47:19,400 --> 00:47:21,990 If you have modified it, give me the data back." And that 1125 00:47:21,990 --> 00:47:23,950 person goes back to the read or no 1126 00:47:23,950 --> 00:47:25,330 privileges with that data. 1127 00:47:25,330 --> 00:47:26,860 When I get that data, I'll send it back to this 1128 00:47:26,860 --> 00:47:27,240 person and say -- 1129 00:47:27,240 --> 00:47:29,600 "Here, you can read." And the same thing if you ask for 1130 00:47:29,600 --> 00:47:30,690 write permission. 1131 00:47:30,690 --> 00:47:33,090 If anybody has [UNINTELLIGIBLE] 1132 00:47:33,090 --> 00:47:34,010 I have to tell everybody -- 1133 00:47:34,010 --> 00:47:35,250 "Now you can't read it anymore. 1134 00:47:35,250 --> 00:47:37,760 Go invalidate, because somebody's about to write." 1135 00:47:37,760 --> 00:47:39,825 Get the invalidate request coming back, and then when 1136 00:47:39,825 --> 00:47:42,250 you've done that I say, "OK, you can write that." So 1137 00:47:42,250 --> 00:47:45,000 everybody keeps part of the memory, and then 1138 00:47:45,000 --> 00:47:45,720 all of that in there. 1139 00:47:45,720 --> 00:47:48,762 So because of that you can really scale this thing. 1140 00:47:48,762 --> 00:47:52,860 1141 00:47:52,860 --> 00:47:54,700 So if you look at a bus-based machine. 1142 00:47:54,700 --> 00:47:55,930 This is the kind of way it looks like. 1143 00:47:55,930 --> 00:47:59,410 You have a cache in here, microprocessor, central 1144 00:47:59,410 --> 00:48:01,120 memory, and you have a bus in here. 1145 00:48:01,120 --> 00:48:04,560 And a lot of small machines, including most people's 1146 00:48:04,560 --> 00:48:06,770 desktops, basically fit in this category. 1147 00:48:06,770 --> 00:48:09,040 And you have a snoopy bus in here. 1148 00:48:09,040 --> 00:48:10,200 So a little bit of a bigger machine, 1149 00:48:10,200 --> 00:48:12,730 something like a Sun Starfire. 1150 00:48:12,730 --> 00:48:17,230 Basically it had four processors in the board, four 1151 00:48:17,230 --> 00:48:20,250 caches, and had an interconnect that actually has 1152 00:48:20,250 --> 00:48:21,560 multiple buses going. 1153 00:48:21,560 --> 00:48:23,450 So it can actually get a little bit of scalability, 1154 00:48:23,450 --> 00:48:24,290 because here's the bottleneck. 1155 00:48:24,290 --> 00:48:25,780 The bus becomes the bottleneck. 1156 00:48:25,780 --> 00:48:27,400 Everybody has to go through the bus. 1157 00:48:27,400 --> 00:48:29,570 And so you actually get multiple buses to get 1158 00:48:29,570 --> 00:48:32,810 bottleneck, and it actually had some distributed memory 1159 00:48:32,810 --> 00:48:35,160 going through a crossbar here. 1160 00:48:35,160 --> 00:48:36,583 So this cache coherent protocol has 1161 00:48:36,583 --> 00:48:38,400 to deal with that. 1162 00:48:38,400 --> 00:48:41,100 And going to the other extreme, 1163 00:48:41,100 --> 00:48:43,310 something like SGI Origin. 1164 00:48:43,310 --> 00:48:46,930 1165 00:48:46,930 --> 00:48:50,170 In this machine there are two processors, and it had 1166 00:48:50,170 --> 00:48:52,090 actually a little bit of processors and a lot of memory 1167 00:48:52,090 --> 00:48:52,830 dealing with the directory. 1168 00:48:52,830 --> 00:48:55,040 So you keep the data, and you actually keep all the 1169 00:48:55,040 --> 00:48:56,550 directory information in there -- 1170 00:48:56,550 --> 00:48:57,070 in this. 1171 00:48:57,070 --> 00:48:58,850 And then it goes -- 1172 00:48:58,850 --> 00:49:02,740 then after that it almost uses a normal message passing type 1173 00:49:02,740 --> 00:49:05,420 network to communicate with that. 1174 00:49:05,420 --> 00:49:07,520 And they use the crane to connect networks, so we can 1175 00:49:07,520 --> 00:49:09,660 have a very large machine built out of that. 1176 00:49:09,660 --> 00:49:12,720 1177 00:49:12,720 --> 00:49:14,450 So now let's switch to multicore processors. 1178 00:49:14,450 --> 00:49:18,200 1179 00:49:18,200 --> 00:49:21,930 If you look at the way we have been dealing with VLSI, every 1180 00:49:21,930 --> 00:49:24,920 generation we are getting more and more transistors. 1181 00:49:24,920 --> 00:49:27,470 So at the beginning when you have enough transistors to 1182 00:49:27,470 --> 00:49:29,860 deal with, people actually start dealing with bit-level 1183 00:49:29,860 --> 00:49:30,960 parallelism. 1184 00:49:30,960 --> 00:49:35,270 So you didn't have -- you can do 16-bit, 32-bit machines. 1185 00:49:35,270 --> 00:49:36,990 You can do wider machines, because you have enough 1186 00:49:36,990 --> 00:49:37,850 transistors. 1187 00:49:37,850 --> 00:49:39,610 Because at the beginning you have like 8-bit processors, 1188 00:49:39,610 --> 00:49:41,110 16-bit, 32-bit. 1189 00:49:41,110 --> 00:49:43,790 And then at some point that I have still more transistors, I 1190 00:49:43,790 --> 00:49:47,660 start doing instruction-level parallelism in a die. 1191 00:49:47,660 --> 00:49:50,080 So even in a bit-level parallelism, in order to get 1192 00:49:50,080 --> 00:49:53,830 64-bit you had to actually have multiple chips. 1193 00:49:53,830 --> 00:49:57,135 So in this regime in order to get parallelism, you need to 1194 00:49:57,135 --> 00:49:58,150 have multiple processors -- 1195 00:49:58,150 --> 00:49:59,370 multiprocessors. 1196 00:49:59,370 --> 00:50:02,860 So in the good old days you actually built a processsor, 1197 00:50:02,860 --> 00:50:03,950 things like a minicomputer. 1198 00:50:03,950 --> 00:50:06,620 Basically you had one processor dealing 1199 00:50:06,620 --> 00:50:07,380 with a 1-bit slice. 1200 00:50:07,380 --> 00:50:10,700 So in the 4-bit slice, dealing with that amount, you could 1201 00:50:10,700 --> 00:50:12,230 fit in a chip. 1202 00:50:12,230 --> 00:50:14,550 And a multichip made a single processor. 1203 00:50:14,550 --> 00:50:17,870 Here a multichip made a multiprocessor. 1204 00:50:17,870 --> 00:50:20,510 We are hitting a regime where a multichip -- 1205 00:50:20,510 --> 00:50:22,870 what [? it ?] will be multiprocessor -- now fits in 1206 00:50:22,870 --> 00:50:26,030 one piece of silicon, because you have more transistors. 1207 00:50:26,030 --> 00:50:29,560 So we are going into a time where multicore is basically 1208 00:50:29,560 --> 00:50:31,630 multiple processors on a die -- 1209 00:50:31,630 --> 00:50:33,790 on a chip. 1210 00:50:33,790 --> 00:50:35,140 So I showed this slide. 1211 00:50:35,140 --> 00:50:39,650 We are getting there, and it's getting pretty fast. You had 1212 00:50:39,650 --> 00:50:41,450 something like this, and suddenly we accelerated. 1213 00:50:41,450 --> 00:50:46,530 We added more and more cores on a die. 1214 00:50:46,530 --> 00:50:50,000 So I categorized multicores also the way I categorized 1215 00:50:50,000 --> 00:50:51,020 them previously. 1216 00:50:51,020 --> 00:50:54,850 There are shared memory multicores. 1217 00:50:54,850 --> 00:50:56,180 Here are some examples. 1218 00:50:56,180 --> 00:50:59,100 Then there are shared network multicores. 1219 00:50:59,100 --> 00:51:01,930 Cell processor is one, and at MIT we are 1220 00:51:01,930 --> 00:51:04,440 building also Raw processor. 1221 00:51:04,440 --> 00:51:07,700 And there is another part, what they call crippled or 1222 00:51:07,700 --> 00:51:08,550 mini-cores. 1223 00:51:08,550 --> 00:51:15,000 So the reason in this graph you can have 512, is because 1224 00:51:15,000 --> 00:51:17,130 it's not Pentium sized things sitting in there. 1225 00:51:17,130 --> 00:51:20,940 You are putting very simple small cores, and a 1226 00:51:20,940 --> 00:51:21,940 huge amount of them. 1227 00:51:21,940 --> 00:51:24,890 So for some class replication, that's also useful. 1228 00:51:24,890 --> 00:51:29,120 So if you look at shared memory multicores, basically 1229 00:51:29,120 --> 00:51:32,730 this is an evolution path for current processors. 1230 00:51:32,730 --> 00:51:35,890 So if you look at it, what they did was they took their 1231 00:51:35,890 --> 00:51:38,160 years' worth of and billions of dollars worth of 1232 00:51:38,160 --> 00:51:42,880 engineering building a single superscalar processor. 1233 00:51:42,880 --> 00:51:45,456 Then they slapped a few of them on the same die, and said 1234 00:51:45,456 --> 00:51:48,390 -- "Hey, we've got a multicore." And of course they 1235 00:51:48,390 --> 00:51:54,450 were always doing shared memory at the network level. 1236 00:51:54,450 --> 00:51:56,220 They said -- "OK, I'll put the shared memory bus also into 1237 00:51:56,220 --> 00:51:58,340 the same die, and I got a multicore." So this is 1238 00:51:58,340 --> 00:52:00,440 basically what all these things are all about. 1239 00:52:00,440 --> 00:52:03,170 So this is kind of gluing these things together, it's a 1240 00:52:03,170 --> 00:52:04,240 first generation. 1241 00:52:04,240 --> 00:52:07,740 However, you didn't build a core completely from scratch. 1242 00:52:07,740 --> 00:52:11,330 You just kind of integrated what we had in multiple chips 1243 00:52:11,330 --> 00:52:15,880 into one chip, and basically got that. 1244 00:52:15,880 --> 00:52:19,640 So to go a little bit beyond, I think you can do better. 1245 00:52:19,640 --> 00:52:24,260 So for example, this AMD multicore. 1246 00:52:24,260 --> 00:52:31,240 Basically you have CPUs in there, actually have a full 1247 00:52:31,240 --> 00:52:34,400 snoopy controller in there, and can have some other 1248 00:52:34,400 --> 00:52:35,280 interface with that. 1249 00:52:35,280 --> 00:52:38,900 So you can actually start building more and more uni 1250 00:52:38,900 --> 00:52:41,440 CPU, thinking that you're building a multicore. 1251 00:52:41,440 --> 00:52:43,745 Instead of saying, "I had this thing in my shelf, I'm going 1252 00:52:43,745 --> 00:52:45,480 to plop it here, and then kind of [INAUDIBLE] 1253 00:52:45,480 --> 00:52:46,950 And you'll see, I think, a lot of 1254 00:52:46,950 --> 00:52:48,100 interesting things happening. 1255 00:52:48,100 --> 00:52:52,310 Because now as they're connected closely in the same 1256 00:52:52,310 --> 00:52:56,170 die, you can do more things than what you could do in a 1257 00:52:56,170 --> 00:52:57,000 multiprocessor. 1258 00:52:57,000 --> 00:52:59,300 So in the last lecture we talked a little bit about what 1259 00:52:59,300 --> 00:53:01,530 the future could be in this kind of regime. 1260 00:53:01,530 --> 00:53:10,040 1261 00:53:10,040 --> 00:53:11,290 Come on. 1262 00:53:11,290 --> 00:53:13,930 1263 00:53:13,930 --> 00:53:14,500 OK. 1264 00:53:14,500 --> 00:53:18,560 So one thing we have been doing at MIT for -- now this 1265 00:53:18,560 --> 00:53:23,190 practice is ended, we started about eight years ago -- is to 1266 00:53:23,190 --> 00:53:28,050 figure out when you have all the silicon, how can you build 1267 00:53:28,050 --> 00:53:30,460 a multicore if you to start from scratch. 1268 00:53:30,460 --> 00:53:33,120 So we built this Raw processor where each -- 1269 00:53:33,120 --> 00:53:37,100 we have 16, these small cores, identical ones in here. 1270 00:53:37,100 --> 00:53:40,260 And the interesting thing is what we said was, we have all 1271 00:53:40,260 --> 00:53:41,500 this bandwidth. 1272 00:53:41,500 --> 00:53:44,060 It's not just going from pins to memory, we have all this 1273 00:53:44,060 --> 00:53:45,580 bandwidth sitting next to each other. 1274 00:53:45,580 --> 00:53:48,990 So can we really take advantage of that to do a lot 1275 00:53:48,990 --> 00:53:50,240 of communication? 1276 00:53:50,240 --> 00:53:52,300 And also the other thing is that to build something like a 1277 00:53:52,300 --> 00:53:54,850 bus, you need a lot of long wires. 1278 00:53:54,850 --> 00:53:56,940 And it's really hard to build long wires. 1279 00:53:56,940 --> 00:54:00,770 So in Raw processor it's something like each chip, a 1280 00:54:00,770 --> 00:54:05,430 large amount of part, is into this eight 32-bit buses. 1281 00:54:05,430 --> 00:54:06,940 So you have a huge amount of communication 1282 00:54:06,940 --> 00:54:07,950 next to each other. 1283 00:54:07,950 --> 00:54:10,320 And we don't have any kind of global memory because that 1284 00:54:10,320 --> 00:54:12,400 requires, right now, either do a directory, which you didn't 1285 00:54:12,400 --> 00:54:15,750 want to build, or have a bus, which will require long wires. 1286 00:54:15,750 --> 00:54:19,570 So we did in a way that all wires -- no wires longer than 1287 00:54:19,570 --> 00:54:22,830 one of the cores. 1288 00:54:22,830 --> 00:54:25,980 So we can do short wires, but we came up with a lot of 1289 00:54:25,980 --> 00:54:29,380 communications for each of these, what we called tile 1290 00:54:29,380 --> 00:54:32,170 those days, are very tightly coupled. 1291 00:54:32,170 --> 00:54:35,730 So this is kind of a direction where people perhaps might go, 1292 00:54:35,730 --> 00:54:39,580 because now we have all this bandwidth in here. 1293 00:54:39,580 --> 00:54:41,260 And how would you take advantage of that bandwidth? 1294 00:54:41,260 --> 00:54:43,720 So this is a different way of looking at that. 1295 00:54:43,720 --> 00:54:47,970 And in some sense the Cell fits somewhere in this regime. 1296 00:54:47,970 --> 00:54:51,070 Because what Cell did was instead of -- it says, "I'm 1297 00:54:51,070 --> 00:54:52,300 not building a bus, I am actually 1298 00:54:52,300 --> 00:54:53,750 building a ring network. 1299 00:54:53,750 --> 00:54:57,000 I'm keeping distributed memory, and provide to Cell a 1300 00:54:57,000 --> 00:54:58,910 ring." I'm not going to go through Cell, because actually 1301 00:54:58,910 --> 00:55:03,457 you had a full lecture the day before yesterday on this. 1302 00:55:03,457 --> 00:55:04,888 AUDIENCE: Saman, can I ask you a question? 1303 00:55:04,888 --> 00:55:07,325 Is there a conclusion that I should be reaching in that I 1304 00:55:07,325 --> 00:55:09,405 look at the multicores you can buy today are still by and 1305 00:55:09,405 --> 00:55:11,085 large two and four processors. 1306 00:55:11,085 --> 00:55:12,280 There are people that have done more. 1307 00:55:12,280 --> 00:55:15,480 The Verano has 16 and the Dell has 8. 1308 00:55:15,480 --> 00:55:19,530 And the conclusion that I want to reach is that as an 1309 00:55:19,530 --> 00:55:21,635 engineering tradeoff, if you throw away the shared memory 1310 00:55:21,635 --> 00:55:23,070 you can add processors. 1311 00:55:23,070 --> 00:55:24,120 Is that a straightforward tradeoff? 1312 00:55:24,120 --> 00:55:26,140 PROFESSOR: I don't think it's a shared memory. 1313 00:55:26,140 --> 00:55:29,600 You can still have things like directory-based 1314 00:55:29,600 --> 00:55:32,200 cache coherent things. 1315 00:55:32,200 --> 00:55:34,940 What's missing right now is what people have done is just 1316 00:55:34,940 --> 00:55:37,570 basically took parts in their shelves, and kind of put it 1317 00:55:37,570 --> 00:55:39,230 into the chip. 1318 00:55:39,230 --> 00:55:43,830 If you look at it, if you put two chips next to each other 1319 00:55:43,830 --> 00:55:46,370 on a board, there's a certain amount of communication 1320 00:55:46,370 --> 00:55:48,020 bandwidth going here. 1321 00:55:48,020 --> 00:55:51,640 And if you put those things into the same die, there's 1322 00:55:51,640 --> 00:55:55,430 about five orders of magnitude possibility to communicate. 1323 00:55:55,430 --> 00:55:58,080 We haven't figured out how to take advantage of that. 1324 00:55:58,080 --> 00:56:00,770 In some sense, we can almost say I want to copy the entire 1325 00:56:00,770 --> 00:56:04,180 cache from this machine to another machine in the cycle. 1326 00:56:04,180 --> 00:56:06,440 I don't think you even would want to do that, but you can 1327 00:56:06,440 --> 00:56:09,280 have that level of huge amount of communication. 1328 00:56:09,280 --> 00:56:11,530 We are still kind of doing this evolutionary path in 1329 00:56:11,530 --> 00:56:15,600 there [UNINTELLIGIBLE] but I don't think we know what cool 1330 00:56:15,600 --> 00:56:16,660 things we can do with that. 1331 00:56:16,660 --> 00:56:19,050 There's a lot of opportunity in that in some sense. 1332 00:56:19,050 --> 00:56:20,760 AUDIENCE: [INAUDIBLE] 1333 00:56:20,760 --> 00:56:23,240 PROFESSOR: Yeah, because the interesting thing is -- 1334 00:56:23,240 --> 00:56:26,920 the way I would say it is, in the good old days 1335 00:56:26,920 --> 00:56:29,190 parallelization sometimes was a scary prospect. 1336 00:56:29,190 --> 00:56:31,510 Because the minute you distribute data, if you don't 1337 00:56:31,510 --> 00:56:35,610 do it right it's a lot slower than sequential execution. 1338 00:56:35,610 --> 00:56:39,100 Because your access time becomes so large, and you're 1339 00:56:39,100 --> 00:56:40,540 basically dead in water. 1340 00:56:40,540 --> 00:56:42,610 In this kind of machine you don't have to. 1341 00:56:42,610 --> 00:56:44,950 There's so much bandwidth in here. 1342 00:56:44,950 --> 00:56:47,130 Latency was still -- latency would be better than going to 1343 00:56:47,130 --> 00:56:49,800 the outside memory. 1344 00:56:49,800 --> 00:56:51,610 And we don't know how to take advantage of 1345 00:56:51,610 --> 00:56:53,040 that bandwidth yet. 1346 00:56:53,040 --> 00:56:57,310 And my feeling is as we go about trying to rebuild from 1347 00:56:57,310 --> 00:57:02,440 scratch multicore processors, we'll try to figure out 1348 00:57:02,440 --> 00:57:03,060 different ways. 1349 00:57:03,060 --> 00:57:10,510 So for example, people are coming up with much more rich 1350 00:57:10,510 --> 00:57:14,860 semantics for speculation and stuff like that, and we can 1351 00:57:14,860 --> 00:57:16,580 take advantage of that. 1352 00:57:16,580 --> 00:57:20,980 So I think there's a lot of interesting hardware, 1353 00:57:20,980 --> 00:57:24,910 microprocessor, and then kind of programming research now. 1354 00:57:24,910 --> 00:57:27,770 Because I don't think anybody had anything in there saying , 1355 00:57:27,770 --> 00:57:30,130 "Here's how we would take it down to this bandwidth." I 1356 00:57:30,130 --> 00:57:31,810 think that'll happen. 1357 00:57:31,810 --> 00:57:35,480 Now the next [? thing ?] is these mini-cores. 1358 00:57:35,480 --> 00:57:38,070 So for example, this PicoChip has array of 1359 00:57:38,070 --> 00:57:39,720 322 processing elements. 1360 00:57:39,720 --> 00:57:43,010 They have 16-bit RISC, so it's not even a 32-bit. 1361 00:57:43,010 --> 00:57:44,950 Piddling little things, 3-way issue in. 1362 00:57:44,950 --> 00:57:48,980 And they had like 240 standard -- 1363 00:57:48,980 --> 00:57:50,370 basically, nothing more than just a 1364 00:57:50,370 --> 00:57:52,850 multiplier, and add in there. 1365 00:57:52,850 --> 00:57:56,880 64 memory tiles, full control, and 14 some special 1366 00:57:56,880 --> 00:57:58,480 [UNINTELLIGIBLE] function accelerator. 1367 00:57:58,480 --> 00:58:03,240 So this is kind of what people call heterogeneous systems. 1368 00:58:03,240 --> 00:58:05,505 Where what this is -- you have all these cores, why do you 1369 00:58:05,505 --> 00:58:07,160 make everything the same? 1370 00:58:07,160 --> 00:58:09,450 I can make something that's good doing graphics, something 1371 00:58:09,450 --> 00:58:11,110 that's good doing networking. 1372 00:58:11,110 --> 00:58:13,540 So I can kind of customize in these things. 1373 00:58:13,540 --> 00:58:15,350 Because what we have in excess is silicon. 1374 00:58:15,350 --> 00:58:17,080 We don't have power in excess. 1375 00:58:17,080 --> 00:58:21,250 So in the future you can't assume everything is working 1376 00:58:21,250 --> 00:58:22,600 all the time, because that will still 1377 00:58:22,600 --> 00:58:24,310 create too much heat. 1378 00:58:24,310 --> 00:58:27,710 So you kind of say -- the best efficiencies, for each type of 1379 00:58:27,710 --> 00:58:30,170 computation you have some few special purpose units. 1380 00:58:30,170 --> 00:58:34,680 So we kind of say if I'm doing graphics, I fit to my graphics 1381 00:58:34,680 --> 00:58:35,500 optimized code. 1382 00:58:35,500 --> 00:58:36,190 So I will do that. 1383 00:58:36,190 --> 00:58:38,570 And the minute I want to do a little bit of arithmetic I'll 1384 00:58:38,570 --> 00:58:39,620 switch to that. 1385 00:58:39,620 --> 00:58:43,190 And sometimes I am doing TCP, I'll switch to my TCP offload. 1386 00:58:43,190 --> 00:58:43,770 Stuff like that. 1387 00:58:43,770 --> 00:58:46,040 Can you do some kind of mixed in there? 1388 00:58:46,040 --> 00:58:48,880 The problem there is you need to understand what the mix is. 1389 00:58:48,880 --> 00:58:50,600 So we need to have a good understanding of 1390 00:58:50,600 --> 00:58:51,880 what that mix is. 1391 00:58:51,880 --> 00:58:54,360 The advantage is it will be a lot more memory efficient. 1392 00:58:54,360 --> 00:58:56,930 So this is kind of going in that direction. 1393 00:58:56,930 --> 00:59:00,550 And so in some sense, if you want to communicate you have 1394 00:59:00,550 --> 00:59:03,280 these special communication elements. 1395 00:59:03,280 --> 00:59:04,280 You have to go through that. 1396 00:59:04,280 --> 00:59:06,540 And the processor can do some work, and there are some 1397 00:59:06,540 --> 00:59:07,340 memory elements. 1398 00:59:07,340 --> 00:59:08,630 So far and so forth. 1399 00:59:08,630 --> 00:59:11,950 So that's one push, people are pushing more for embedded very 1400 00:59:11,950 --> 00:59:13,120 low power in. 1401 00:59:13,120 --> 00:59:15,770 AUDIENCE: Is this starting to look more and more like FPGA, 1402 00:59:15,770 --> 00:59:16,830 which is [UNINTELLIGIBLE] 1403 00:59:16,830 --> 00:59:20,660 PROFESSOR: Yeah, it's a kind of a combination. 1404 00:59:20,660 --> 00:59:25,300 Because the thing about FPGA is, it's just done 1-bit lot. 1405 00:59:25,300 --> 00:59:27,950 That doesn't make sense to do any arithmetic. 1406 00:59:27,950 --> 00:59:30,550 So this is saying -- "Ok, instead of 1 bit I am doing 16 1407 00:59:30,550 --> 00:59:34,660 bits." Because then I can very efficiently build 1408 00:59:34,660 --> 00:59:35,760 [UNINTELLIGIBLE] 1409 00:59:35,760 --> 00:59:36,960 Because I don't have to build [UNINTELLIGIBLE] 1410 00:59:36,960 --> 00:59:38,890 out of scratch. 1411 00:59:38,890 --> 00:59:42,140 So I think that an interesting convergence is happening. 1412 00:59:42,140 --> 00:59:45,930 Because what happened, I think, for a long time was 1413 00:59:45,930 --> 00:59:47,860 things like architecture and programming languages, and 1414 00:59:47,860 --> 00:59:50,220 stuff like that, kind of got stuck in a rut. 1415 00:59:50,220 --> 00:59:52,320 Because things there are so very efficiently and 1416 00:59:52,320 --> 00:59:56,270 incremental -- it's like doing research in airplanes. 1417 00:59:56,270 --> 00:59:58,760 Things are so efficient, so complex. 1418 00:59:58,760 --> 01:00:05,020 Here AeroAstro can't build an airplane, because it's a $9 1419 01:00:05,020 --> 01:00:10,000 billion job to build a good airplane in there. 1420 01:00:10,000 --> 01:00:11,380 And it became like that. 1421 01:00:11,380 --> 01:00:13,350 Universities could not build it because if you want to 1422 01:00:13,350 --> 01:00:16,610 build a superscalar it's, again, a $9 billion type 1423 01:00:16,610 --> 01:00:19,130 endeavor to do that -- thousands of people, was very, 1424 01:00:19,130 --> 01:00:20,020 very customized. 1425 01:00:20,020 --> 01:00:22,670 But now it's kind of hitting the end of the road. 1426 01:00:22,670 --> 01:00:24,562 Everbody's going back and saying -- "Jeez, what's the 1427 01:00:24,562 --> 01:00:26,090 new thing?" And I think there's a lot of opportunity 1428 01:00:26,090 --> 01:00:29,270 to kind of figure out is there some radically different thing 1429 01:00:29,270 --> 01:00:30,340 you can do. 1430 01:00:30,340 --> 01:00:33,640 So this is what I have for my first lecture. 1431 01:00:33,640 --> 01:00:35,130 Some conclusions basically. 1432 01:00:35,130 --> 01:00:38,530 I think for a lot of people who are programmers, there was 1433 01:00:38,530 --> 01:00:42,210 a time that you never cared about what's under the hood. 1434 01:00:42,210 --> 01:00:44,200 You knew it was going to go fast, and in the 1435 01:00:44,200 --> 01:00:45,290 air it will go faster. 1436 01:00:45,290 --> 01:00:47,420 I think that's kind of coming to an end. 1437 01:00:47,420 --> 01:00:49,480 And there's a lot of variations/choices in 1438 01:00:49,480 --> 01:00:51,900 hardware, and I think software people should understand and 1439 01:00:51,900 --> 01:00:54,970 know what they can choose in here. 1440 01:00:54,970 --> 01:00:57,630 And many have performance implications. 1441 01:00:57,630 --> 01:01:01,710 And if you know these things you will be able to get high 1442 01:01:01,710 --> 01:01:03,070 performance of software built easy. 1443 01:01:03,070 --> 01:01:05,570 You can't do high performance software without knowing what 1444 01:01:05,570 --> 01:01:07,190 it's running on. 1445 01:01:07,190 --> 01:01:09,860 However, there's a note of caution. 1446 01:01:09,860 --> 01:01:13,550 If you become too much attached to your hardware, we 1447 01:01:13,550 --> 01:01:16,270 go back to the old days of assembly language programming. 1448 01:01:16,270 --> 01:01:19,910 So you say -- "I got every performance out of a -- now 1449 01:01:19,910 --> 01:01:24,090 the Cell says you have seven SPEs." So in two years, they 1450 01:01:24,090 --> 01:01:25,290 come with 16 SPEs. 1451 01:01:25,290 --> 01:01:26,080 And what's going to happen? 1452 01:01:26,080 --> 01:01:28,920 Your thing is still working on seven SPEs very well, but it 1453 01:01:28,920 --> 01:01:31,020 might not work on 16 SPEs, even with that. 1454 01:01:31,020 --> 01:01:33,700 But of course, you really customize for Cell too. 1455 01:01:33,700 --> 01:01:36,780 And I guarantee it will not run good with the Intel -- 1456 01:01:36,780 --> 01:01:39,670 probably Quad, Xeon processor -- because it will be doing 1457 01:01:39,670 --> 01:01:41,040 something very different. 1458 01:01:41,040 --> 01:01:44,950 And so there's this tension that's coming back again. 1459 01:01:44,950 --> 01:01:48,540 How to do something that is general, portable, malleable, 1460 01:01:48,540 --> 01:01:52,255 and at the same time get good performance with hardware 1461 01:01:52,255 --> 01:01:52,770 being exposed. 1462 01:01:52,770 --> 01:01:54,020 I don't think there's an answer for that. 1463 01:01:54,020 --> 01:01:55,870 And in this class we are going to go to one extreme. 1464 01:01:55,870 --> 01:01:58,710 We are going to go low level and really understand the 1465 01:01:58,710 --> 01:02:01,540 hardware, and take advantage of that. 1466 01:02:01,540 --> 01:02:04,340 But at some point we have to probably come out of that and 1467 01:02:04,340 --> 01:02:06,420 figure out how to be, again, high level. 1468 01:02:06,420 --> 01:02:09,137 And I think that these are open questions. 1469 01:02:09,137 --> 01:02:10,965 AUDIENCE: Do you have any thoughts, and this may be 1470 01:02:10,965 --> 01:02:15,620 unanswerable, but how could Cell really [INAUDIBLE]. 1471 01:02:15,620 --> 01:02:18,970 And not Cell only, but some of these other ones that are out 1472 01:02:18,970 --> 01:02:22,870 there today, given how hard they are to program. 1473 01:02:22,870 --> 01:02:25,200 PROFESSOR: So I have this talk that I'm 1474 01:02:25,200 --> 01:02:25,860 giving at all the places. 1475 01:02:25,860 --> 01:02:28,320 I said the third software crisis is due 1476 01:02:28,320 --> 01:02:30,340 to multicore menace. 1477 01:02:30,340 --> 01:02:35,090 I termed it a menace, because it will create this thing that 1478 01:02:35,090 --> 01:02:36,000 people will have to change. 1479 01:02:36,000 --> 01:02:38,410 Something has to change, something has to give. 1480 01:02:38,410 --> 01:02:40,300 I don't know who's going to give. 1481 01:02:40,300 --> 01:02:42,560 Either people will say -- "This is too complicated, I am 1482 01:02:42,560 --> 01:02:44,050 happy with the current performance. 1483 01:02:44,050 --> 01:02:46,550 I will live for the next 20 years at today's level of 1484 01:02:46,550 --> 01:02:51,070 performance." I doubt that will happen. 1485 01:02:51,070 --> 01:02:53,290 The other end is saying -- "Jeez, you know I am going to 1486 01:02:53,290 --> 01:02:56,410 learn parallel programming, and I will deal with locks and 1487 01:02:56,410 --> 01:02:58,060 semaphores, and all those things. 1488 01:02:58,060 --> 01:03:00,080 And I am going to jump in there." That's not going to 1489 01:03:00,080 --> 01:03:01,040 happen either. 1490 01:03:01,040 --> 01:03:02,790 So there has to be something in the middle. 1491 01:03:02,790 --> 01:03:04,380 And the neat thing is, I don't think anybody 1492 01:03:04,380 --> 01:03:07,650 knows what it is. 1493 01:03:07,650 --> 01:03:12,120 Being in industry, it makes them terrified, because they 1494 01:03:12,120 --> 01:03:13,190 have no idea what's happening. 1495 01:03:13,190 --> 01:03:14,360 But in a university, it's a fun time. 1496 01:03:14,360 --> 01:03:17,220 [LAUGHTER] 1497 01:03:17,220 --> 01:03:18,650 AUDIENCE: Good question. 1498 01:03:18,650 --> 01:03:18,890 PROFESSOR: OK. 1499 01:03:18,890 --> 01:03:21,850 So we'll take about a five minutes break, and switch 1500 01:03:21,850 --> 01:03:24,490 gears into concurrent programming.