1 00:00:00,000 --> 00:00:00,030 2 00:00:00,030 --> 00:00:03,155 The following content is provided under a Creative 3 00:00:03,155 --> 00:00:04,000 Commons license. 4 00:00:04,000 --> 00:00:06,920 Your support will help MIT OpenCourseWare continue to 5 00:00:06,920 --> 00:00:08,660 offer high quality, educational 6 00:00:08,660 --> 00:00:10,560 resources for free. 7 00:00:10,560 --> 00:00:13,450 To make a donation or view additional materials from 8 00:00:13,450 --> 00:00:16,610 hundreds of MIT courses visit MIT OpenCourseWare at 9 00:00:16,610 --> 00:00:17,860 ocw.mit.edu. 10 00:00:17,860 --> 00:00:22,010 11 00:00:22,010 --> 00:00:23,260 MICHAEL PERRONE: So my name's Michael Perrone. 12 00:00:23,260 --> 00:00:30,170 13 00:00:30,170 --> 00:00:34,460 I'm at the T.J. Watson Research Center, IBM research. 14 00:00:34,460 --> 00:00:38,330 Doing all kinds of things for research, but most recently-- 15 00:00:38,330 --> 00:00:39,630 that's not what I want. 16 00:00:39,630 --> 00:00:40,660 There we go. 17 00:00:40,660 --> 00:00:43,290 Most recently I've been working with the cell 18 00:00:43,290 --> 00:00:46,590 processor for the past three years or so. 19 00:00:46,590 --> 00:00:47,840 I don't want that. 20 00:00:47,840 --> 00:00:51,170 21 00:00:51,170 --> 00:00:53,380 How's that? 22 00:00:53,380 --> 00:00:56,320 And because I do have to run out for a flight, I have my 23 00:00:56,320 --> 00:00:59,330 e-mail here if you want to ask me questions, 24 00:00:59,330 --> 00:01:02,820 feel free to do that. 25 00:01:02,820 --> 00:01:05,640 What I'm going to do in this presentation is as Saman 26 00:01:05,640 --> 00:01:09,270 suggested, talk in depth about the cell processor, but really 27 00:01:09,270 --> 00:01:11,140 it's still going to be just the very surface because you 28 00:01:11,140 --> 00:01:12,950 going to have a month to go into a lot more detail. 29 00:01:12,950 --> 00:01:16,300 But I want to give you a sense for why it was created, the 30 00:01:16,300 --> 00:01:19,180 way it was created, what it's capable of doing, and what are 31 00:01:19,180 --> 00:01:22,640 the programming considerations that have to be taken in mind 32 00:01:22,640 --> 00:01:24,120 when you program. 33 00:01:24,120 --> 00:01:30,520 34 00:01:30,520 --> 00:01:33,490 Here's the agenda just for this section, 35 00:01:33,490 --> 00:01:34,800 Mike, of this class. 36 00:01:34,800 --> 00:01:35,990 I'll give you some motivation. 37 00:01:35,990 --> 00:01:37,840 This is going to be a bit of a repeat, so I'll go through it 38 00:01:37,840 --> 00:01:38,940 fairly quickly. 39 00:01:38,940 --> 00:01:43,460 I'll talk about the design concepts, hardware overview, 40 00:01:43,460 --> 00:01:46,070 performance characteristics, application affinity-- 41 00:01:46,070 --> 00:01:49,920 what good is this device? 42 00:01:49,920 --> 00:01:53,290 Talk about the software and this I imagine is one of the 43 00:01:53,290 --> 00:01:55,330 areas where you're going to go into a lot of detail in the 44 00:01:55,330 --> 00:01:59,200 next month because as you suggested, the software really 45 00:01:59,200 --> 00:02:01,470 is the issue and I would actually go a little further 46 00:02:01,470 --> 00:02:05,520 and say, why do people drive such large cars in the U.S.? 47 00:02:05,520 --> 00:02:07,560 Why do they waste so much energy? 48 00:02:07,560 --> 00:02:08,360 The answer is very simple. 49 00:02:08,360 --> 00:02:09,660 It's because it's cheap. 50 00:02:09,660 --> 00:02:12,840 Even at $3 a gallon, it's cheap compared to say, Europe 51 00:02:12,840 --> 00:02:15,450 and other places. 52 00:02:15,450 --> 00:02:17,790 The truth is it's the same thing with programmers. 53 00:02:17,790 --> 00:02:20,480 Why did programmers program the way they did in the past 54 00:02:20,480 --> 00:02:22,240 10, 20 years? 55 00:02:22,240 --> 00:02:23,490 Because cycles were cheap. 56 00:02:23,490 --> 00:02:26,190 They knew Moore's law was going to keep going and so you 57 00:02:26,190 --> 00:02:28,710 could implement some algorithm, you didn't have to 58 00:02:28,710 --> 00:02:31,310 worry about the details, as long as you got the right 59 00:02:31,310 --> 00:02:35,560 power law-- if you got your n squared or n cubed or n log n, 60 00:02:35,560 --> 00:02:37,720 whatever behavior. 61 00:02:37,720 --> 00:02:41,390 The details, if the multiplying factor was 10 or 62 00:02:41,390 --> 00:02:42,170 100 it didn't matter. 63 00:02:42,170 --> 00:02:44,170 Eventually Moore's law would solve that problem for you, so 64 00:02:44,170 --> 00:02:45,410 you didn't have to be efficient. 65 00:02:45,410 --> 00:02:49,090 And I think I've spent the better part of three years 66 00:02:49,090 --> 00:02:52,510 trying to fight against that and you're going to learn in 67 00:02:52,510 --> 00:02:54,630 this class that, particularly for multicore you have to 68 00:02:54,630 --> 00:02:57,660 think very hard about how you're going to get 69 00:02:57,660 --> 00:02:58,910 performance. 70 00:02:58,910 --> 00:03:00,990 71 00:03:00,990 --> 00:03:04,260 This is actually the take home message that I want to give. 72 00:03:04,260 --> 00:03:06,630 I think it's just one or two slides, but we really need to 73 00:03:06,630 --> 00:03:10,340 get to these because that's where I want to get you 74 00:03:10,340 --> 00:03:11,740 thinking along the right lines. 75 00:03:11,740 --> 00:03:12,960 And then there's a hardware 76 00:03:12,960 --> 00:03:16,650 consideration, we can skip that. 77 00:03:16,650 --> 00:03:19,790 All right, so where have all the gigahertz gone, right? 78 00:03:19,790 --> 00:03:24,220 We saw Moore's law, things getting faster and faster and 79 00:03:24,220 --> 00:03:26,926 the answer is I have a different chart that's 80 00:03:26,926 --> 00:03:28,220 basically the same thing. 81 00:03:28,220 --> 00:03:31,200 You have relative device performance on this axis and 82 00:03:31,200 --> 00:03:32,210 you've got the year here. 83 00:03:32,210 --> 00:03:35,300 And different technologies were growing, growing, 84 00:03:35,300 --> 00:03:37,210 growing, but now you see they're thresholding. 85 00:03:37,210 --> 00:03:42,100 And you go to conferences now, architecture conferences, and 86 00:03:42,100 --> 00:03:45,500 people are saying, Moore's law is dead. 87 00:03:45,500 --> 00:03:47,660 Now, I don't know if I would go that far and I know there 88 00:03:47,660 --> 00:03:50,110 are true believers out there who say, well maybe the 89 00:03:50,110 --> 00:03:54,280 silicon on the insulator technology is dead, but 90 00:03:54,280 --> 00:03:55,140 they'll be something else. 91 00:03:55,140 --> 00:03:59,330 And maybe that's true and maybe that is multicore, but 92 00:03:59,330 --> 00:04:02,800 unless we get the right programming models in place 93 00:04:02,800 --> 00:04:04,050 it's not going to be multicore. 94 00:04:04,050 --> 00:04:07,320 95 00:04:07,320 --> 00:04:08,730 Here's this power density graph. 96 00:04:08,730 --> 00:04:11,460 Here we have the nuclear reactor power up here and you 97 00:04:11,460 --> 00:04:12,660 see pentiums going up now. 98 00:04:12,660 --> 00:04:16,870 Of course, there's a log plot, so we're far away, but on this 99 00:04:16,870 --> 00:04:18,450 axis we're not far away. 100 00:04:18,450 --> 00:04:22,320 This is how much we shrink the technology, the size of those 101 00:04:22,320 --> 00:04:23,940 transistors. 102 00:04:23,940 --> 00:04:30,670 So if we're kind of going down by 2 every 18 months or so, 103 00:04:30,670 --> 00:04:33,080 maybe it's 2 years now, we're not so far away from that 104 00:04:33,080 --> 00:04:34,500 nuclear reactor output. 105 00:04:34,500 --> 00:04:37,140 And that's a problem. 106 00:04:37,140 --> 00:04:39,800 And what's really causing that problem? 107 00:04:39,800 --> 00:04:42,680 Here's a picture of one of these gates magnified a lot 108 00:04:42,680 --> 00:04:46,300 and here's the interface magnified even further and you 109 00:04:46,300 --> 00:04:49,330 see here's this dielectric that's insulating between the 110 00:04:49,330 --> 00:04:51,680 2 sides of the gate-- 111 00:04:51,680 --> 00:04:52,860 we're reaching a fundamental limit. 112 00:04:52,860 --> 00:04:54,000 A few atomic layers. 113 00:04:54,000 --> 00:04:56,880 You see here it's like 11 angstroms. What's that? 114 00:04:56,880 --> 00:05:00,040 10, 11 atoms across? 115 00:05:00,040 --> 00:05:02,700 If you go back to basic physics you know that quantum 116 00:05:02,700 --> 00:05:06,700 mechanical properties like electrons, they tunnel, right? 117 00:05:06,700 --> 00:05:08,560 And they tunnel through barriers with kind of an 118 00:05:08,560 --> 00:05:09,890 exponential decay. 119 00:05:09,890 --> 00:05:12,800 So whenever you shrink this further you get more and more 120 00:05:12,800 --> 00:05:15,630 leakage, so the current is leaking through. 121 00:05:15,630 --> 00:05:19,040 In this graph, what you see here is that as this size gets 122 00:05:19,040 --> 00:05:22,780 smaller, the leakage current is getting equivalent to the 123 00:05:22,780 --> 00:05:23,510 active power. 124 00:05:23,510 --> 00:05:29,050 So even when it's not doing anything, this 65 nanometer, 125 00:05:29,050 --> 00:05:31,110 the technology is leaking as much power 126 00:05:31,110 --> 00:05:32,550 as it actually uses. 127 00:05:32,550 --> 00:05:35,430 And eventually, as we get smaller, smaller we're going 128 00:05:35,430 --> 00:05:38,720 to be using more power, just leaking stuff away and that's 129 00:05:38,720 --> 00:05:43,200 really bad because as Saman suggested we have people like 130 00:05:43,200 --> 00:05:45,390 Google putting this stuff near the Coulee Dam so that they 131 00:05:45,390 --> 00:05:46,140 can get power. 132 00:05:46,140 --> 00:05:49,600 I deal with a lot of customers who have tens of thousands of 133 00:05:49,600 --> 00:05:54,450 nodes, 50,000 processors, 100,000 processors. 134 00:05:54,450 --> 00:05:56,930 They're using 20 gigabytes-- 135 00:05:56,930 --> 00:05:58,620 sorry, megahertz. 136 00:05:58,620 --> 00:06:02,090 No, megawatts, that's what I want to say. 137 00:06:02,090 --> 00:06:03,340 It's too early in the morning. 138 00:06:03,340 --> 00:06:06,150 139 00:06:06,150 --> 00:06:09,460 Tens of megawatts to power their installations and 140 00:06:09,460 --> 00:06:12,130 they're choosing sites specifically to get that power 141 00:06:12,130 --> 00:06:12,940 and they're limited. 142 00:06:12,940 --> 00:06:15,300 So they come to me, they come to people at IBM and they say, 143 00:06:15,300 --> 00:06:16,390 what can we do about power? 144 00:06:16,390 --> 00:06:18,810 Power is a problem. 145 00:06:18,810 --> 00:06:21,630 And that's why we're not seeing 146 00:06:21,630 --> 00:06:25,190 increasing the gigahertz. 147 00:06:25,190 --> 00:06:26,590 Has this ever happened before? 148 00:06:26,590 --> 00:06:29,560 Well, I'm going to go to this quickly, yes. 149 00:06:29,560 --> 00:06:33,230 Here we see the power outage output of a steam iron, right 150 00:06:33,230 --> 00:06:36,520 there per unit area. 151 00:06:36,520 --> 00:06:39,300 152 00:06:39,300 --> 00:06:42,750 And something's messed up here. 153 00:06:42,750 --> 00:06:46,960 You see as the technology changed from bipolar to CMOS 154 00:06:46,960 --> 00:06:52,220 we were able to improve the performance, but the heat flux 155 00:06:52,220 --> 00:06:55,700 got higher again and that begs the question, what's going to 156 00:06:55,700 --> 00:06:56,750 happen next? 157 00:06:56,750 --> 00:07:00,310 And of course, IBM, Intel, AMD, they're all 158 00:07:00,310 --> 00:07:03,250 betting this multicore. 159 00:07:03,250 --> 00:07:06,090 And so there's an opportunity from a business point of view. 160 00:07:06,090 --> 00:07:09,650 So now, that's the intro. 161 00:07:09,650 --> 00:07:12,540 Multicore: how do you deal with it? 162 00:07:12,540 --> 00:07:17,060 Here's a picture of the chip, the cell processor. 163 00:07:17,060 --> 00:07:19,930 You can see these 8 little black dots. 164 00:07:19,930 --> 00:07:23,570 They're local memory for each one of 8 special purpose 165 00:07:23,570 --> 00:07:27,400 processors, as well as a big chunk over here, which is a 166 00:07:27,400 --> 00:07:28,200 ninth processor. 167 00:07:28,200 --> 00:07:32,770 So this chip has 9 processors on board and the trick is to 168 00:07:32,770 --> 00:07:35,720 design it so that it addresses lots of issues 169 00:07:35,720 --> 00:07:38,080 that we just discussed. 170 00:07:38,080 --> 00:07:43,570 So let me put this in context, cell was created for the Sony 171 00:07:43,570 --> 00:07:44,780 Playstation 3. 172 00:07:44,780 --> 00:07:48,590 It started in about 2000 and there's a long development 173 00:07:48,590 --> 00:07:53,530 here until it was finally announced over here. 174 00:07:53,530 --> 00:07:54,380 Where was it first announced? 175 00:07:54,380 --> 00:07:58,680 It was announced several years later and IBM recently 176 00:07:58,680 --> 00:08:02,190 announced a cell blade about a year back and we're pushing 177 00:08:02,190 --> 00:08:05,280 these blades and we're very much struggling with the 178 00:08:05,280 --> 00:08:06,660 programming model. 179 00:08:06,660 --> 00:08:09,040 How do you get performance while making something 180 00:08:09,040 --> 00:08:09,610 programmable? 181 00:08:09,610 --> 00:08:11,790 If you go to customers and they have 4 million lines of 182 00:08:11,790 --> 00:08:19,240 code, you can't tell them just port it and it'll be 80 person 183 00:08:19,240 --> 00:08:22,030 years to get it ported, 100 person years more. 184 00:08:22,030 --> 00:08:23,740 And then you have to optimize it. 185 00:08:23,740 --> 00:08:27,950 So there are problems and we'll talk about that. 186 00:08:27,950 --> 00:08:32,360 But it was created in this context and because of that, 187 00:08:32,360 --> 00:08:35,510 this chip in particular, is a commodity processor. 188 00:08:35,510 --> 00:08:39,070 Meaning that it's going to be selling millions and millions. 189 00:08:39,070 --> 00:08:44,920 Sony Playstation 2 sold an average of 20 million units 190 00:08:44,920 --> 00:08:47,360 each year for 5 years and we expect the same for the 191 00:08:47,360 --> 00:08:48,440 Playstation 3. 192 00:08:48,440 --> 00:08:53,600 So the cell has a big advantage over other multicore 193 00:08:53,600 --> 00:08:57,340 processors like the Intel Woodcrest, which has a street 194 00:08:57,340 --> 00:09:01,930 price of about $2000 and the cell around 100. 195 00:09:01,930 --> 00:09:04,790 So not only do we have big performance improvements, we 196 00:09:04,790 --> 00:09:06,800 have price advantages too because of 197 00:09:06,800 --> 00:09:09,660 that commodity market. 198 00:09:09,660 --> 00:09:14,450 All right, let's talk about the design concept. 199 00:09:14,450 --> 00:09:16,580 Here's a little bit of a rehash of what we discussed 200 00:09:16,580 --> 00:09:18,550 with some interesting words here. 201 00:09:18,550 --> 00:09:20,570 We're talking about a power wall, a memory wall and a 202 00:09:20,570 --> 00:09:21,320 frequency wall. 203 00:09:21,320 --> 00:09:22,900 So we've talked about this frequency wall. 204 00:09:22,900 --> 00:09:26,300 We're hitting this wall because of the power really 205 00:09:26,300 --> 00:09:28,840 and the power wall people just don't have enough power coming 206 00:09:28,840 --> 00:09:32,140 into their buildings to keep these things going. 207 00:09:32,140 --> 00:09:35,680 But memory wall, Saman didn't actually use that term, but 208 00:09:35,680 --> 00:09:38,140 that's the fact that as the clock frequencies get higher 209 00:09:38,140 --> 00:09:41,160 and higher, memory appeared further and further away. 210 00:09:41,160 --> 00:09:44,200 The more cycles that I have to go as a processor before the 211 00:09:44,200 --> 00:09:45,180 data came in. 212 00:09:45,180 --> 00:09:47,620 And so that changes the whole paradigm, how you have to 213 00:09:47,620 --> 00:09:48,400 think about it. 214 00:09:48,400 --> 00:09:53,000 We have processors with lots of cache, but is cache really 215 00:09:53,000 --> 00:09:54,320 what you want? 216 00:09:54,320 --> 00:09:56,020 Well, it depends. 217 00:09:56,020 --> 00:09:59,490 If you have a very localized process where you're going to 218 00:09:59,490 --> 00:10:02,920 bring something into cache and the data is going to be reused 219 00:10:02,920 --> 00:10:04,950 then that's really a good thing to do. 220 00:10:04,950 --> 00:10:07,980 But what if you have random gather and scatter of data? 221 00:10:07,980 --> 00:10:13,040 You know, you're doing some transactional processing or 222 00:10:13,040 --> 00:10:16,040 whatever mathematical function you're calculating is very 223 00:10:16,040 --> 00:10:17,820 distributed like an FFT. 224 00:10:17,820 --> 00:10:21,210 So you have to do all sorts of accesses through memory and it 225 00:10:21,210 --> 00:10:23,770 doesn't fit in that cache. 226 00:10:23,770 --> 00:10:26,080 Well, then you can start thrashing cache. 227 00:10:26,080 --> 00:10:29,770 You bring in one integer and then you ask the cache for the 228 00:10:29,770 --> 00:10:32,400 next thing, it's not there, so it has to go in and so you 229 00:10:32,400 --> 00:10:36,380 spend all this time wasting time getting stuff into cache. 230 00:10:36,380 --> 00:10:40,270 So what we're pushing for multicore, especially for cell 231 00:10:40,270 --> 00:10:43,260 is the notion of a shopping list. And this is where 232 00:10:43,260 --> 00:10:46,830 programability comes in and programing models come in. 233 00:10:46,830 --> 00:10:50,310 You really need to think ahead of time about what your 234 00:10:50,310 --> 00:10:53,380 shopping list is going to be and the analogy that people 235 00:10:53,380 --> 00:10:56,170 have been using is you're fixing something in your 236 00:10:56,170 --> 00:10:57,690 house, you're pipe breaks. 237 00:10:57,690 --> 00:10:59,120 So you go and say, oh, I need a new pipe. 238 00:10:59,120 --> 00:11:00,570 So you go the store, you get a pipe. 239 00:11:00,570 --> 00:11:02,870 You bring it back and say, oh, I need some putty. 240 00:11:02,870 --> 00:11:04,090 So you go to the store, you get some putty. 241 00:11:04,090 --> 00:11:05,420 And oh, I need a wrench. 242 00:11:05,420 --> 00:11:07,760 Go to the store-- that's what cache is. 243 00:11:07,760 --> 00:11:10,790 So you figure out what you need when you need it. 244 00:11:10,790 --> 00:11:12,580 In the cell processor you have to think ahead and make a 245 00:11:12,580 --> 00:11:14,850 shopping list. If I'm going to do this calculation I need all 246 00:11:14,850 --> 00:11:15,630 these things. 247 00:11:15,630 --> 00:11:17,240 I'm going to bring them all in, I'm going to start 248 00:11:17,240 --> 00:11:18,040 calculating. 249 00:11:18,040 --> 00:11:20,090 When I'm calculating on that, I'm going to get my other 250 00:11:20,090 --> 00:11:23,190 shopping list. So that I can have some concurrency of the 251 00:11:23,190 --> 00:11:24,750 data load with the computes. 252 00:11:24,750 --> 00:11:31,230 253 00:11:31,230 --> 00:11:33,260 I'm going to skip this here. 254 00:11:33,260 --> 00:11:37,850 255 00:11:37,850 --> 00:11:41,340 You can read that later, it's not that important. 256 00:11:41,340 --> 00:11:45,230 Cell synergy, now this is kind of you know, apple pie, 257 00:11:45,230 --> 00:11:48,000 motherhood kind of thing. 258 00:11:48,000 --> 00:11:50,610 The cell processor was specifically designed so that 259 00:11:50,610 --> 00:11:52,990 those 9 cores are synergistic. 260 00:11:52,990 --> 00:11:55,670 That they interoperate very efficiently. 261 00:11:55,670 --> 00:11:59,040 Now I told you we have 8 identical processors, we call 262 00:11:59,040 --> 00:11:59,440 those SPEs. 263 00:11:59,440 --> 00:12:02,760 In the ninth processor its the PPE. 264 00:12:02,760 --> 00:12:06,100 It's been designed so that the PPE is running the OS and it's 265 00:12:06,100 --> 00:12:09,630 doing all the transaction file systems and what not so that 266 00:12:09,630 --> 00:12:11,650 these SPEs can focus on what they're good 267 00:12:11,650 --> 00:12:12,900 at, which is compute. 268 00:12:12,900 --> 00:12:15,510 269 00:12:15,510 --> 00:12:19,290 The whole thing is pullled together with an element 270 00:12:19,290 --> 00:12:22,530 interconnect bus and we'll talk about that. 271 00:12:22,530 --> 00:12:25,070 It's very, very efficient, very high bandwidth bus. 272 00:12:25,070 --> 00:12:28,940 273 00:12:28,940 --> 00:12:30,450 Now we're going to talk about the detail hardware 274 00:12:30,450 --> 00:12:31,340 components. 275 00:12:31,340 --> 00:12:35,440 And Rodric somewhere, there you are, asked me to actually 276 00:12:35,440 --> 00:12:39,370 dig down into more of the hardware. 277 00:12:39,370 --> 00:12:40,200 I would love to do that. 278 00:12:40,200 --> 00:12:43,470 Honestly, I'm not a hardware person. 279 00:12:43,470 --> 00:12:47,220 I'll do the best I can, perhaps at the end of the talk 280 00:12:47,220 --> 00:12:50,410 we'll dig down and show me which slides you want. 281 00:12:50,410 --> 00:12:54,180 But I've been dealing with this for so long that I can do 282 00:12:54,180 --> 00:12:55,010 a decent job. 283 00:12:55,010 --> 00:12:57,620 Here's another picture of the chip. 284 00:12:57,620 --> 00:12:59,140 It has lots of transistors. 285 00:12:59,140 --> 00:13:00,340 This is the size. 286 00:13:00,340 --> 00:13:03,200 We talked about the 9 cores, it has 10 threads because this 287 00:13:03,200 --> 00:13:06,080 power processor, the PPE has 2 threads. 288 00:13:06,080 --> 00:13:08,780 Each of these are single threaded. 289 00:13:08,780 --> 00:13:10,480 And this is the wow factor. 290 00:13:10,480 --> 00:13:15,350 We have 200 gigaflops, over 200 gigaflops of single 291 00:13:15,350 --> 00:13:18,260 precision performance on these chips. 292 00:13:18,260 --> 00:13:21,820 And over 20 gigaflops of double precision and that will 293 00:13:21,820 --> 00:13:24,670 be going up to 100 gigaflops by the end of this year. 294 00:13:24,670 --> 00:13:27,840 295 00:13:27,840 --> 00:13:32,430 The bandwidth to main memory is 25 gigabytes per second and 296 00:13:32,430 --> 00:13:35,640 up to 75 gigabytes per second of I/O bandwidth. 297 00:13:35,640 --> 00:13:38,840 Now this chip really has tremendous bandwidth, but what 298 00:13:38,840 --> 00:13:40,800 we've seen so far-- particularly with the Sony 299 00:13:40,800 --> 00:13:44,170 Playstation and I think you may have lots of them here, 300 00:13:44,170 --> 00:13:46,780 the board is not designed to really take 301 00:13:46,780 --> 00:13:48,310 advantage of that bandwidth. 302 00:13:48,310 --> 00:13:53,860 And even the blades that IBM sells really can't get that 303 00:13:53,860 --> 00:13:55,830 type of bandwidth off the blade. 304 00:13:55,830 --> 00:13:57,870 And so if you're keeping everything local on the blade 305 00:13:57,870 --> 00:14:00,640 or on the Playstation 3 then you have lots of bandwidth 306 00:14:00,640 --> 00:14:01,500 internally. 307 00:14:01,500 --> 00:14:06,320 But off blade or off board you really have to survive with 308 00:14:06,320 --> 00:14:12,060 something like a gigabyte, 2 gigabytes in the future. 309 00:14:12,060 --> 00:14:14,330 And this element interconnect bus I mentioned before has a 310 00:14:14,330 --> 00:14:20,630 tremendous bandwidth, over 300 gigabytes per second. 311 00:14:20,630 --> 00:14:23,280 The top frequency in the lab was over 4 gigabytes-- 312 00:14:23,280 --> 00:14:24,380 gigahertz, sorry. 313 00:14:24,380 --> 00:14:28,060 And it's currently running when you 314 00:14:28,060 --> 00:14:30,220 buy them at 3.2 gigahertz. 315 00:14:30,220 --> 00:14:33,720 And actually the Playstation 3's that you're buying today, 316 00:14:33,720 --> 00:14:38,560 I think, they only use 7 out of the 8 SPEs. 317 00:14:38,560 --> 00:14:40,460 And that was a design consideration from the 318 00:14:40,460 --> 00:14:43,870 hardware point of view because as these chips get bigger and 319 00:14:43,870 --> 00:14:46,890 bigger, which is if you can't ratchet up the gigahertz you 320 00:14:46,890 --> 00:14:48,980 have to spread out. 321 00:14:48,980 --> 00:14:51,830 And so as they get bigger, flaws in the manufacturing 322 00:14:51,830 --> 00:14:54,630 process lead to faulty units. 323 00:14:54,630 --> 00:14:57,600 So instead of just throwing away things, if one of these 324 00:14:57,600 --> 00:15:01,110 SPE is bad we don't use it and we just do 7. 325 00:15:01,110 --> 00:15:06,410 As the design process gets better by the end of this year 326 00:15:06,410 --> 00:15:09,020 they'll be using 8. 327 00:15:09,020 --> 00:15:14,160 The blades that IBM sells, they're all set up for 8 328 00:15:14,160 --> 00:15:18,440 OK, so here's a schematic view of what you just saw on the 329 00:15:18,440 --> 00:15:20,010 previous slide. 330 00:15:20,010 --> 00:15:22,110 You have these 8 SPEs. 331 00:15:22,110 --> 00:15:25,270 You have the PPE here with this L1 and L2 cache. 332 00:15:25,270 --> 00:15:26,800 You have the element interconnect bus connecting 333 00:15:26,800 --> 00:15:29,610 all of these pieces together to a memory interface 334 00:15:29,610 --> 00:15:32,550 controller and a bus interface controller. 335 00:15:32,550 --> 00:15:38,120 And so this MIC is what has the 25.6 gigabytes per second 336 00:15:38,120 --> 00:15:43,240 and this BIC has potentially 75 going out here. 337 00:15:43,240 --> 00:15:48,060 Each of these SPEs has its own local store. 338 00:15:48,060 --> 00:15:49,940 Those are those little black dots that you saw, those 8 339 00:15:49,940 --> 00:15:50,830 black dots. 340 00:15:50,830 --> 00:15:54,340 It's not very large, it's a quarter of a megabyte, but 341 00:15:54,340 --> 00:15:57,750 it's very fast to this SXU, this processing unit. 342 00:15:57,750 --> 00:16:02,520 It's only 6 cycles away from that unit. 343 00:16:02,520 --> 00:16:06,520 And it's a fully pipelined 6 so that if you feed that 344 00:16:06,520 --> 00:16:09,810 pipeline you can get data every cycle. 345 00:16:09,810 --> 00:16:11,480 And here, the thing that you can't read because it's 346 00:16:11,480 --> 00:16:14,190 probably too dark is the DMA engine. 347 00:16:14,190 --> 00:16:17,960 So one of the interesting things about this is that each 348 00:16:17,960 --> 00:16:21,610 one of these is a full fledged processor. 349 00:16:21,610 --> 00:16:25,930 It can access main memory independent of this PPE. 350 00:16:25,930 --> 00:16:30,990 So you can have 9 processes or 10 if you're running 2 threads 351 00:16:30,990 --> 00:16:34,470 here, all going simultaneously, all 352 00:16:34,470 --> 00:16:36,210 independent of one another. 353 00:16:36,210 --> 00:16:37,800 And that allows for a tremendous amount of 354 00:16:37,800 --> 00:16:41,650 flexibility in the types of algorithms you can implement. 355 00:16:41,650 --> 00:16:45,000 And because of this bus here you can see it's 96 bytes per 356 00:16:45,000 --> 00:16:49,390 cycle and we're at 3.2 gigahertz. 357 00:16:49,390 --> 00:16:54,930 I think that's 288 gigabytes per second. 358 00:16:54,930 --> 00:16:57,650 These guys can communicate to one another across this bus 359 00:16:57,650 --> 00:17:00,210 without ever going out to main memory and so they can get 360 00:17:00,210 --> 00:17:03,400 much faster access to their local memories. 361 00:17:03,400 --> 00:17:06,630 So if you're doing lots of computes internally here you 362 00:17:06,630 --> 00:17:11,880 can scream on this processing; really, really go fast. And 363 00:17:11,880 --> 00:17:13,530 you can do the same if you're going out to the memory 364 00:17:13,530 --> 00:17:16,410 interface controller here to main memory, if you 365 00:17:16,410 --> 00:17:19,890 sufficiently hide that memory access. 366 00:17:19,890 --> 00:17:21,140 So we'll talk about that. 367 00:17:21,140 --> 00:17:24,030 368 00:17:24,030 --> 00:17:28,100 All right, this is the PPE that I mentioned before. 369 00:17:28,100 --> 00:17:32,020 It's based on the IBM power family of processors, it's a 370 00:17:32,020 --> 00:17:34,680 watered down version to reduce the power consumption. 371 00:17:34,680 --> 00:17:39,190 So it doesn't have the horse power that you see in say a 372 00:17:39,190 --> 00:17:42,470 Pentium 4 or even-- 373 00:17:42,470 --> 00:17:44,730 actually, I don't have an exact comparison point for 374 00:17:44,730 --> 00:17:47,930 this processor, but if you take the code that runs today 375 00:17:47,930 --> 00:17:51,910 on your Intel or AMD, whatever your power and you recompile 376 00:17:51,910 --> 00:17:54,960 it on cell it'll run today-- 377 00:17:54,960 --> 00:17:57,810 maybe you have to change the library or two, but it'll run 378 00:17:57,810 --> 00:17:59,470 today here, no problem. 379 00:17:59,470 --> 00:18:04,150 But it'll be about 60% slower, 50% slower and so people say, 380 00:18:04,150 --> 00:18:07,620 oh my god this cell processor's terrible. 381 00:18:07,620 --> 00:18:10,980 But that's because you're only using that one piece. 382 00:18:10,980 --> 00:18:12,490 So let's look at the other-- 383 00:18:12,490 --> 00:18:14,070 OK, so now we go into details of the PPE. 384 00:18:14,070 --> 00:18:16,960 385 00:18:16,960 --> 00:18:20,490 Half a megabyte of L2 cache here, coherent load stores. 386 00:18:20,490 --> 00:18:24,270 It does have a VMX unit, so you can do some SIMD 387 00:18:24,270 --> 00:18:27,670 operations, single instruction multiple data instructions. 388 00:18:27,670 --> 00:18:29,530 Two-way hardware multithreaded here. 389 00:18:29,530 --> 00:18:33,180 390 00:18:33,180 --> 00:18:36,960 Then there's an EIB that goes around here. 391 00:18:36,960 --> 00:18:41,780 It's composed of four 16 byte data rings. 392 00:18:41,780 --> 00:18:44,510 And you can have multiple, simultaneous transfers per 393 00:18:44,510 --> 00:18:48,030 ring for a total of over 100 outstanding requests 394 00:18:48,030 --> 00:18:49,280 simultaneously. 395 00:18:49,280 --> 00:18:53,390 396 00:18:53,390 --> 00:18:54,830 But this slide doesn't-- this kind of hides 397 00:18:54,830 --> 00:18:55,620 it under the rug. 398 00:18:55,620 --> 00:18:57,720 There's a certain topology here. 399 00:18:57,720 --> 00:18:59,910 And so these things are going to be 400 00:18:59,910 --> 00:19:05,160 connected to those 8 SPEs. 401 00:19:05,160 --> 00:19:08,850 And depending on which way you send things, you'll have 402 00:19:08,850 --> 00:19:10,340 better or worse performance. 403 00:19:10,340 --> 00:19:14,845 So some of these buses are going around this way and some 404 00:19:14,845 --> 00:19:16,470 are going counterclockwise. 405 00:19:16,470 --> 00:19:18,960 And because of that you have to know who you're 406 00:19:18,960 --> 00:19:22,230 communicating if you want have real high efficiency. 407 00:19:22,230 --> 00:19:25,890 I haven't seen personally cases where it made a really 408 00:19:25,890 --> 00:19:27,700 big difference, but I do know that there's some people who 409 00:19:27,700 --> 00:19:34,720 found, if I'm going from here to here I want to make sure 410 00:19:34,720 --> 00:19:38,220 I'm sending things the right way because of that 411 00:19:38,220 --> 00:19:39,160 connectivity. 412 00:19:39,160 --> 00:19:40,880 Or else I could be sending things all the 413 00:19:40,880 --> 00:19:42,540 way around and waiting. 414 00:19:42,540 --> 00:19:43,880 AUDIENCE: Just a quick question. 415 00:19:43,880 --> 00:19:46,020 MICHAEL PERRONE: Yes. 416 00:19:46,020 --> 00:19:47,282 AUDIENCE: Just like you said you could complie anything on 417 00:19:47,282 --> 00:19:48,740 the power processor would be slower, but you can. 418 00:19:48,740 --> 00:19:51,380 Now you also said the cell processor is in itself a 419 00:19:51,380 --> 00:19:53,440 [INAUDIBLE] processor. 420 00:19:53,440 --> 00:19:58,300 Can I compile it in a C code just for that as well. 421 00:19:58,300 --> 00:19:59,580 MICHAEL PERRONE: C code would compile. 422 00:19:59,580 --> 00:20:03,000 There's issues with libraries because the libraries wouldn't 423 00:20:03,000 --> 00:20:05,230 be ported to the SPE necessarily. 424 00:20:05,230 --> 00:20:08,620 If it had been then yes. 425 00:20:08,620 --> 00:20:10,440 This is actually a very good question. 426 00:20:10,440 --> 00:20:11,700 It opens up lots of things. 427 00:20:11,700 --> 00:20:14,320 I don't know if I should take that later. 428 00:20:14,320 --> 00:20:15,480 PROFESSOR: Take it later. 429 00:20:15,480 --> 00:20:18,990 MICHAEL PERRONE: Bottom line is this chip has two different 430 00:20:18,990 --> 00:20:21,230 processors and therefore you need two different compilers 431 00:20:21,230 --> 00:20:26,440 and it generates two different source codes. 432 00:20:26,440 --> 00:20:30,220 In principle, SPEs can run a full OS, but they're not 433 00:20:30,220 --> 00:20:32,980 designed to do that and no one's ever actually tried. 434 00:20:32,980 --> 00:20:36,200 So you could imagine having 8 or 9 OSes running on this 435 00:20:36,200 --> 00:20:38,190 processor if you wanted. 436 00:20:38,190 --> 00:20:41,280 Terrible waste from my perspective, but OK, so let's 437 00:20:41,280 --> 00:20:42,780 talk about these a little bit. 438 00:20:42,780 --> 00:20:47,190 Each of these SPEs has, like I mentioned this memory flow 439 00:20:47,190 --> 00:20:52,110 controller here, an atomic update unit, the local store, 440 00:20:52,110 --> 00:20:54,900 and the SPU, which is actually the processing unit. 441 00:20:54,900 --> 00:21:01,370 Each SPU has a register file with 128 registers. 442 00:21:01,370 --> 00:21:04,140 Each register is 128 bits. 443 00:21:04,140 --> 00:21:09,340 So they're native SIMD, there are no scalar registers here 444 00:21:09,340 --> 00:21:12,060 for the user to play with. 445 00:21:12,060 --> 00:21:15,220 If you want to do scalar ops they'll be running in those 446 00:21:15,220 --> 00:21:18,420 full vector registers, but you'll just be wasting some 447 00:21:18,420 --> 00:21:19,670 portion of that register. 448 00:21:19,670 --> 00:21:22,340 449 00:21:22,340 --> 00:21:25,760 It has IEEE double precision floating point, but it doesn't 450 00:21:25,760 --> 00:21:29,100 have IEEE single precision floating point. 451 00:21:29,100 --> 00:21:32,950 It's curiosity, but that was again, came from the history. 452 00:21:32,950 --> 00:21:36,420 The processor was designed for the gaming industry and the 453 00:21:36,420 --> 00:21:38,850 gamers, they didn't care if it had IEEE. 454 00:21:38,850 --> 00:21:39,910 Who cares IEEE? 455 00:21:39,910 --> 00:21:42,020 What I want is to have good monsters right on the screen. 456 00:21:42,020 --> 00:21:45,590 457 00:21:45,590 --> 00:21:51,500 And so those SIMD registers can operate bitwise on bytes, 458 00:21:51,500 --> 00:21:57,020 on shorts, on four words at a time or two doubles at a time. 459 00:21:57,020 --> 00:21:59,950 460 00:21:59,950 --> 00:22:06,210 The DMA engines here, each DMA engine can have up to 16 461 00:22:06,210 --> 00:22:09,430 outstanding requests in its queue before it stalls. 462 00:22:09,430 --> 00:22:12,680 So you can imagine you're writing something, some code 463 00:22:12,680 --> 00:22:15,210 and you're sending things out to the DMA and then all of a 464 00:22:15,210 --> 00:22:18,060 sudden you see really bad performance, it could be that 465 00:22:18,060 --> 00:22:20,210 your DMA egine has stalled the entire processor. 466 00:22:20,210 --> 00:22:23,300 If you try to write to that thing and then that queue is 467 00:22:23,300 --> 00:22:27,230 full, it just waits until the next open slot is available. 468 00:22:27,230 --> 00:22:31,040 So those are kind considerations. 469 00:22:31,040 --> 00:22:34,460 AUDIENCE: You mean [UNINTELLIGIBLE PHRASE] 470 00:22:34,460 --> 00:22:35,352 MICHAEL PERRONE: Yes. 471 00:22:35,352 --> 00:22:37,360 AUDIENCE: It's not the global one? 472 00:22:37,360 --> 00:22:37,900 MICHAEL PERRONE: Right. 473 00:22:37,900 --> 00:22:39,590 That's correct. 474 00:22:39,590 --> 00:22:42,000 But there is a global address space. 475 00:22:42,000 --> 00:22:45,070 AUDIENCE: 16 slots each in each SPU. 476 00:22:45,070 --> 00:22:45,910 MICHAEL PERRONE: Right. 477 00:22:45,910 --> 00:22:46,450 Exactly. 478 00:22:46,450 --> 00:22:51,570 Each MFC has its own 16 slots. 479 00:22:51,570 --> 00:22:54,450 And they all address the same memory. 480 00:22:54,450 --> 00:22:57,540 They can have a transparent memory space or they can have 481 00:22:57,540 --> 00:22:59,280 a partitioned memory space depending on 482 00:22:59,280 --> 00:22:59,920 how you set it up. 483 00:22:59,920 --> 00:23:03,809 AUDIENCE: Each SPU doesn't have its own-- the DMA goes 484 00:23:03,809 --> 00:23:05,267 onto the bus, [UNINTELLIGIBLE] 485 00:23:05,267 --> 00:23:07,985 486 00:23:07,985 --> 00:23:10,850 that goes to a connection to the [UNINTELLIGIBLE]. 487 00:23:10,850 --> 00:23:14,235 488 00:23:14,235 --> 00:23:16,570 PROFESSOR: You can add this data in the SPUs too. 489 00:23:16,570 --> 00:23:18,530 You don't have to always go to outside memory. 490 00:23:18,530 --> 00:23:20,690 You can do SPU to SPU communication basically. 491 00:23:20,690 --> 00:23:21,250 MICHAEL PERRONE: Right. 492 00:23:21,250 --> 00:23:23,700 So I can do a DMA that transfers memory from this 493 00:23:23,700 --> 00:23:27,760 local store to this one if I wanted to and vice versa. 494 00:23:27,760 --> 00:23:29,590 And I can pull stuff in through the-- 495 00:23:29,590 --> 00:23:32,920 496 00:23:32,920 --> 00:23:34,350 yeah, I mentioned this stuff. 497 00:23:34,350 --> 00:23:37,800 498 00:23:37,800 --> 00:23:43,710 Now this broadband interface controller, the BIC, this is 499 00:23:43,710 --> 00:23:47,660 how you get off the blade or off the board. 500 00:23:47,660 --> 00:23:51,570 It has 20 gigabytes per second here on I/O IF. 501 00:23:51,570 --> 00:23:54,790 502 00:23:54,790 --> 00:23:56,410 In 10 over here--I'm sorry, 5 over here. 503 00:23:56,410 --> 00:24:00,700 I'm trying to remember how we get up to 70. 504 00:24:00,700 --> 00:24:04,260 This is actually two-way and one is 25 and 505 00:24:04,260 --> 00:24:04,990 the other one's 30. 506 00:24:04,990 --> 00:24:08,100 That gets you to 55. 507 00:24:08,100 --> 00:24:09,920 This should be 10 and now, what's going on here? 508 00:24:09,920 --> 00:24:14,310 509 00:24:14,310 --> 00:24:16,790 It adds up to 75, I'm sure. 510 00:24:16,790 --> 00:24:18,040 I'm sure about that. 511 00:24:18,040 --> 00:24:20,790 512 00:24:20,790 --> 00:24:22,850 I don't know why that says that. 513 00:24:22,850 --> 00:24:25,730 But the interesting thing about this over here, this I/O 514 00:24:25,730 --> 00:24:30,670 IF zero is that you can use it to connect two 515 00:24:30,670 --> 00:24:32,130 cell processors together. 516 00:24:32,130 --> 00:24:35,180 So this is why I know it's really 25.6 because it's 517 00:24:35,180 --> 00:24:38,110 matched to this one. 518 00:24:38,110 --> 00:24:42,690 So you have 25.6 going out to main memory, but this one can 519 00:24:42,690 --> 00:24:45,240 go to another processor, so now you have these two 520 00:24:45,240 --> 00:24:49,140 processors side-by-side connected at 25.6 gigabytes 521 00:24:49,140 --> 00:24:49,880 per second. 522 00:24:49,880 --> 00:24:52,360 And now I can do a memory access through here to the 523 00:24:52,360 --> 00:24:56,270 memory that's on this processor and vice versa. 524 00:24:56,270 --> 00:24:59,090 However, If I'm going straight out to my memory it's going to 525 00:24:59,090 --> 00:25:01,300 be faster than if I go out to this memory. 526 00:25:01,300 --> 00:25:04,220 So you have a slight NUMA architecture and nonuniform 527 00:25:04,220 --> 00:25:05,320 memory access. 528 00:25:05,320 --> 00:25:09,220 And you can hide that with sufficient multibuffering. 529 00:25:09,220 --> 00:25:12,090 530 00:25:12,090 --> 00:25:14,910 So I know that this is 25 and I know the other one's 30. 531 00:25:14,910 --> 00:25:17,070 I don't know why it's written as 20 there. 532 00:25:17,070 --> 00:25:18,970 AUDIENCE: Can the SPUs write to the 533 00:25:18,970 --> 00:25:21,600 [UNINTELLIGIBLE PHRASE]? 534 00:25:21,600 --> 00:25:24,370 MICHAEL PERRONE: Yes, they can read from it. 535 00:25:24,370 --> 00:25:27,220 I don't know if they can write to it. 536 00:25:27,220 --> 00:25:29,790 In fact, that leads to a bottleneck occurring. 537 00:25:29,790 --> 00:25:34,850 So I happily start a process on my PPE and then I tell all 538 00:25:34,850 --> 00:25:37,340 my SPEs, start doing some number crunching. 539 00:25:37,340 --> 00:25:38,420 So they do that. 540 00:25:38,420 --> 00:25:41,690 They get access to memory, but they find the memory is in L2. 541 00:25:41,690 --> 00:25:44,440 So they start pulling from L2, but now all 8 are pulling from 542 00:25:44,440 --> 00:25:47,820 L2 and it's only 7 gigabytes per second instead of 25 and 543 00:25:47,820 --> 00:25:49,180 so you get a bottleneck. 544 00:25:49,180 --> 00:25:51,660 And so what I tell everybody is if you're going to 545 00:25:51,660 --> 00:25:54,520 initialize data with that PPE make sure you flush your cache 546 00:25:54,520 --> 00:25:59,210 before you start the SPEs. 547 00:25:59,210 --> 00:26:02,010 And then you don't want to be touching that memory because 548 00:26:02,010 --> 00:26:04,380 you really want to keep things-- stuff that the SPEs 549 00:26:04,380 --> 00:26:06,330 are dealing with-- you want to keep it out of L2 cache. 550 00:26:06,330 --> 00:26:12,380 551 00:26:12,380 --> 00:26:14,020 Here there's an interrupt controller. 552 00:26:14,020 --> 00:26:17,050 553 00:26:17,050 --> 00:26:19,540 An I/O bus master translation unit. 554 00:26:19,540 --> 00:26:22,850 And you know, these allow for messaging and message passing 555 00:26:22,850 --> 00:26:24,340 and interrupts and things of that nature. 556 00:26:24,340 --> 00:26:27,450 557 00:26:27,450 --> 00:26:29,130 So that's the hardware overview. 558 00:26:29,130 --> 00:26:30,820 Any questions before I move on? 559 00:26:30,820 --> 00:26:37,950 560 00:26:37,950 --> 00:26:39,900 So why's the cell processor so fast? 561 00:26:39,900 --> 00:26:43,250 Well, 3.2 gigahertz, that's one. 562 00:26:43,250 --> 00:26:45,630 But there's also the fact that we have 8 SPEs. 563 00:26:45,630 --> 00:26:51,140 Each 8 SPEs have SIMD units, registers that are running so 564 00:26:51,140 --> 00:26:56,090 they can do this parallel processing on a chip. 565 00:26:56,090 --> 00:27:01,440 We have 8 SPEs and each one are doing up to 8 ops per 566 00:27:01,440 --> 00:27:03,760 cycle if you're doing a mul-add. 567 00:27:03,760 --> 00:27:07,730 So you have four mul-adds for single precision. 568 00:27:07,730 --> 00:27:15,340 So you've got 8, that's 64 ops per cycle times 3.2. 569 00:27:15,340 --> 00:27:20,040 You get up to 200 gigaflops per cycle, 204.8. 570 00:27:20,040 --> 00:27:23,970 So that's really the main reason. 571 00:27:23,970 --> 00:27:25,650 We've talked about this stuff here. 572 00:27:25,650 --> 00:27:29,810 This is an image of why it's faster. 573 00:27:29,810 --> 00:27:32,160 Instead of staging and bringing the data through the 574 00:27:32,160 --> 00:27:34,740 L2, which is kind of what we were just discussing and 575 00:27:34,740 --> 00:27:39,220 having this PU, this processing unit, the PPE 576 00:27:39,220 --> 00:27:42,640 manage the data coming in, each one can do it themselves 577 00:27:42,640 --> 00:27:45,030 and bypass this bottleneck. 578 00:27:45,030 --> 00:27:47,410 So that's something you have to keep in the back of your 579 00:27:47,410 --> 00:27:48,380 mind when you're programming. 580 00:27:48,380 --> 00:27:52,140 You really want to make sure that you get this processor 581 00:27:52,140 --> 00:27:52,720 out of there. 582 00:27:52,720 --> 00:27:54,230 You don't want it in your way. 583 00:27:54,230 --> 00:27:56,540 Let these guys do as much of their own work as they can. 584 00:27:56,540 --> 00:27:59,780 585 00:27:59,780 --> 00:28:03,030 Here's a comparison of theorectical peak performance 586 00:28:03,030 --> 00:28:08,200 of cell versus freescale, AMD, Intel over here. 587 00:28:08,200 --> 00:28:08,720 Very nice. 588 00:28:08,720 --> 00:28:11,170 That's the wow chart. 589 00:28:11,170 --> 00:28:15,860 The theoretical peak, this is in practice, what did we see? 590 00:28:15,860 --> 00:28:18,410 I don't know if you can read these numbers but what you 591 00:28:18,410 --> 00:28:20,750 really want to focus on is the first and last columns. 592 00:28:20,750 --> 00:28:23,460 This is the type of calculation, high performance 593 00:28:23,460 --> 00:28:26,470 computing like matrix multiplication, 594 00:28:26,470 --> 00:28:28,910 bioinformatics, graphics, security, it was really 595 00:28:28,910 --> 00:28:31,150 designed for graphics. 596 00:28:31,150 --> 00:28:33,850 Security, communication, video processing and over here you 597 00:28:33,850 --> 00:28:40,470 see the advantage against an IA 32, a G5 processor. 598 00:28:40,470 --> 00:28:46,510 And you see 8x, 12x, 15, 10, 18x. 599 00:28:46,510 --> 00:28:48,270 Very considerable improvement in performance. 600 00:28:48,270 --> 00:28:49,557 In the back-- question? 601 00:28:49,557 --> 00:28:51,841 AUDIENCE: [UNINTELLIGIBLE] previous slide, how did it 602 00:28:51,841 --> 00:28:55,140 compare to high [UNINTELLIGIBLE PHRASE]? 603 00:28:55,140 --> 00:28:57,020 MICHAEL PERRONE: All right, so you're thinking like a peak 604 00:28:57,020 --> 00:28:58,833 stream or something like that? 605 00:28:58,833 --> 00:29:01,400 AUDIENCE: Any particular [UNINTELLIGIBLE PHRASE]. 606 00:29:01,400 --> 00:29:05,506 The design of the SPUs is very reminiscent of 607 00:29:05,506 --> 00:29:06,860 [UNINTELLIGIBLE PHRASE]. 608 00:29:06,860 --> 00:29:11,480 MICHAEL PERRONE: So I believe, and I'm not well versed in all 609 00:29:11,480 --> 00:29:12,560 of the processors that are out there. 610 00:29:12,560 --> 00:29:14,090 I think that we still have a performance 611 00:29:14,090 --> 00:29:17,850 advantage in that space. 612 00:29:17,850 --> 00:29:19,260 You know, I don't know about Xilinx and 613 00:29:19,260 --> 00:29:20,490 those kind of things-- 614 00:29:20,490 --> 00:29:25,850 FPGAs I don't know, but what I tell people this 615 00:29:25,850 --> 00:29:26,890 is there's a spectrum. 616 00:29:26,890 --> 00:29:29,150 And at one end you have your general purpose processors. 617 00:29:29,150 --> 00:29:32,390 You've got your Intel, you've got your Opteron whatever, 618 00:29:32,390 --> 00:29:33,540 your power processor. 619 00:29:33,540 --> 00:29:37,410 And then at the other and you've got your FPGAs and DSPs 620 00:29:37,410 --> 00:29:39,960 and then maybe over here, somewhere in the middle you've 621 00:29:39,960 --> 00:29:42,230 got graphical processing units. 622 00:29:42,230 --> 00:29:43,970 Like Nvidia kind of things. 623 00:29:43,970 --> 00:29:47,210 And then somewhere between those graphics processing 624 00:29:47,210 --> 00:29:49,060 processors and the general purpose 625 00:29:49,060 --> 00:29:52,360 processors you've got cell. 626 00:29:52,360 --> 00:29:57,040 You get a significant improvement in performance, 627 00:29:57,040 --> 00:29:59,340 but you have to pay some pain in programming. 628 00:29:59,340 --> 00:30:01,350 But not nearly as much as you have to do with the graphics 629 00:30:01,350 --> 00:30:06,150 processors and no where near the FPGAs, which are just 630 00:30:06,150 --> 00:30:08,220 every time you write something you have to rewrite 631 00:30:08,220 --> 00:30:10,980 everything. 632 00:30:10,980 --> 00:30:11,520 Question? 633 00:30:11,520 --> 00:30:13,848 AUDIENCE: Somewhat related to the previous question, but 634 00:30:13,848 --> 00:30:16,253 with a different angle. 635 00:30:16,253 --> 00:30:19,540 I always figured anyone could do a [INAUDIBLE], so that's 636 00:30:19,540 --> 00:30:21,010 why I ask about FFTs. 637 00:30:21,010 --> 00:30:25,590 Are they captured on the front or otherwise [UNINTELLIGIBLE] 638 00:30:25,590 --> 00:30:27,640 MICHAEL PERRONE: Yeah, so this is actually one of the things 639 00:30:27,640 --> 00:30:29,660 I spent a lot of time on for FFTs. 640 00:30:29,660 --> 00:30:32,750 I spent a lot of time with the petroleum industry. 641 00:30:32,750 --> 00:30:36,590 They take these enormous boats, they have these arrays 642 00:30:36,590 --> 00:30:39,460 that go 5 kilometers back and 1 kilometer wide, they drag 643 00:30:39,460 --> 00:30:41,800 them over the ocean, and they make these noises and they 644 00:30:41,800 --> 00:30:43,240 record the echo. 645 00:30:43,240 --> 00:30:45,010 And they have to do this enormous FFT and it 646 00:30:45,010 --> 00:30:47,580 takes them 6 months. 647 00:30:47,580 --> 00:30:49,690 Depending on the size of the FFT it can be anywhere from a 648 00:30:49,690 --> 00:30:51,665 week to 6 months, literally. 649 00:30:51,665 --> 00:30:52,270 AUDIENCE: [UNINTELLIGIBLE]. 650 00:30:52,270 --> 00:30:52,860 MICHAEL PERRONE: Sorry? 651 00:30:52,860 --> 00:30:55,740 AUDIENCE: Is this a PD FFT? 652 00:30:55,740 --> 00:31:00,690 MICHAEL PERRONE: Sometimes I do too, but they do both. 653 00:31:00,690 --> 00:31:03,250 I've become somewhat of an expert on these FFTs. 654 00:31:03,250 --> 00:31:06,610 For cell the best performance number I know of is about 90 655 00:31:06,610 --> 00:31:08,390 gigaflops of FFT performance. 656 00:31:08,390 --> 00:31:11,960 657 00:31:11,960 --> 00:31:14,630 You know, that's very good. 658 00:31:14,630 --> 00:31:17,590 Yeah, it's like 50% of peak performance. 659 00:31:17,590 --> 00:31:21,320 You know, it's easy to get 98% with [? lynpacker ?] 660 00:31:21,320 --> 00:31:22,890 or [? djem ?] 661 00:31:22,890 --> 00:31:28,320 on a processor like this and we have. We get 97% of peak 662 00:31:28,320 --> 00:31:31,845 performance, but it's a lot harder to get FFTs up to that. 663 00:31:31,845 --> 00:31:34,005 AUDIENCE: Well, then I'll [INAUDIBLE] the next questions 664 00:31:34,005 --> 00:31:36,529 then which is somehow or another you get the FFT 665 00:31:36,529 --> 00:31:39,435 performance, you've got to get the data at the right 666 00:31:39,435 --> 00:31:39,535 place at the time. 667 00:31:39,535 --> 00:31:39,560 [UNINTELLIGIBLE] 668 00:31:39,560 --> 00:31:42,940 So you've personally done that or been involved with that? 669 00:31:42,940 --> 00:31:44,560 MICHAEL PERRONE: Right, so we do a lot of tricks. 670 00:31:44,560 --> 00:31:47,080 I can show you another slide or another presentation that 671 00:31:47,080 --> 00:31:51,880 we talk about this, but typically the FFTs that we 672 00:31:51,880 --> 00:31:58,920 work with are somewhere from a 1024 to 2048, that's square. 673 00:31:58,920 --> 00:32:04,700 And so it's possible to take say, the top 4 rows-- 674 00:32:04,700 --> 00:32:08,540 in the case of 1024, four rows complex, single precision I 675 00:32:08,540 --> 00:32:11,300 think is 16 kilobytes. 676 00:32:11,300 --> 00:32:13,340 That fits into the local store very nicely. 677 00:32:13,340 --> 00:32:14,690 So you can stop multibuffering. 678 00:32:14,690 --> 00:32:16,620 You bring in one, you start computing on it. 679 00:32:16,620 --> 00:32:19,530 While you're computing on those 4 in a SIMD fashion 680 00:32:19,530 --> 00:32:21,530 across the SIMD registers you're 681 00:32:21,530 --> 00:32:22,900 bringing in the next one. 682 00:32:22,900 --> 00:32:24,670 And then when that one's finished you're writing that 683 00:32:24,670 --> 00:32:26,840 one out while your computing on the one that arrived and 684 00:32:26,840 --> 00:32:28,140 while you're getting the next one. 685 00:32:28,140 --> 00:32:33,760 And since you can get the entire 1024 or 2000 into local 686 00:32:33,760 --> 00:32:38,600 store, you're only 6 cycles away from any element in it. 687 00:32:38,600 --> 00:32:41,470 So it's much, much faster. 688 00:32:41,470 --> 00:32:45,610 We also did the 16 million element FFT. 689 00:32:45,610 --> 00:32:48,120 690 00:32:48,120 --> 00:32:52,550 1D, yeah and we did some tricks there to make it 691 00:32:52,550 --> 00:32:53,980 efficient, but it was a lot slower. 692 00:32:53,980 --> 00:32:56,810 693 00:32:56,810 --> 00:32:59,180 AUDIENCE: [UNINTELLIGIBLE PHRASE] 694 00:32:59,180 --> 00:33:01,156 would have to be a lot slower by the need for the problem. 695 00:33:01,156 --> 00:33:03,970 [UNINTELLIGIBLE PHRASE] 696 00:33:03,970 --> 00:33:05,870 MICHAEL PERRONE: What I remember it was fifteen times 697 00:33:05,870 --> 00:33:08,660 faster than a power 5. 698 00:33:08,660 --> 00:33:12,970 699 00:33:12,970 --> 00:33:16,160 It might have been a power 4, I don't remember, sorry. 700 00:33:16,160 --> 00:33:22,010 701 00:33:22,010 --> 00:33:22,716 I might want to skip this one. 702 00:33:22,716 --> 00:33:25,436 I think I'm going to skip this one. 703 00:33:25,436 --> 00:33:27,340 AUDIENCE: [UNINTELLIGIBLE PHRASE] 704 00:33:27,340 --> 00:33:28,590 MICHAEL PERRONE: Right. 705 00:33:28,590 --> 00:33:32,330 706 00:33:32,330 --> 00:33:34,360 Let's talk about what is the cell good for. 707 00:33:34,360 --> 00:33:36,935 You kind of have a sense of the architecture and how it 708 00:33:36,935 --> 00:33:38,510 all fits together. 709 00:33:38,510 --> 00:33:41,690 You may have some sense of the gotchas and the problems that 710 00:33:41,690 --> 00:33:44,300 might be there, but what did we actually applied to 2? 711 00:33:44,300 --> 00:33:48,120 I mean you saw some of that here. 712 00:33:48,120 --> 00:33:52,405 Here's a list of things that either we've already proven to 713 00:33:52,405 --> 00:33:56,460 ourself that it works well or we're very confident that it 714 00:33:56,460 --> 00:33:58,320 works well or we're working to demonstrate 715 00:33:58,320 --> 00:33:59,570 that it works well. 716 00:33:59,570 --> 00:34:01,700 717 00:34:01,700 --> 00:34:04,280 Signal processing, image processing, audio resampling, 718 00:34:04,280 --> 00:34:04,990 noise generation. 719 00:34:04,990 --> 00:34:06,920 I mean, you can read through this list, there's a long 720 00:34:06,920 --> 00:34:11,010 list. And I guess there are a few characteristics that 721 00:34:11,010 --> 00:34:14,030 really make it suitable for cell. 722 00:34:14,030 --> 00:34:16,460 Things that are in single precision because you've got 723 00:34:16,460 --> 00:34:20,210 200 gigaflops single and only 20 of double, but that will 724 00:34:20,210 --> 00:34:23,360 change as I mentioned. 725 00:34:23,360 --> 00:34:26,580 Things that are streaming, streaming through and so 726 00:34:26,580 --> 00:34:29,770 single processing is ideal where the data comes through 727 00:34:29,770 --> 00:34:32,350 and you do your compute and then you throw it away or you 728 00:34:32,350 --> 00:34:33,830 give out your results and you throw it away. 729 00:34:33,830 --> 00:34:35,080 Those are good. 730 00:34:35,080 --> 00:34:39,770 731 00:34:39,770 --> 00:34:42,410 And things that are compute intensive, where you bring the 732 00:34:42,410 --> 00:34:44,590 data in and you're going to crunch on it for a long time, 733 00:34:44,590 --> 00:34:48,170 so things like cryptography where you're either generating 734 00:34:48,170 --> 00:34:53,320 something from a key and there's virtually no input. 735 00:34:53,320 --> 00:34:57,120 You're just generating streams of random numbers that's very 736 00:34:57,120 --> 00:34:58,160 well suited for this thing. 737 00:34:58,160 --> 00:34:59,770 You see FFTs listed here. 738 00:34:59,770 --> 00:35:02,630 739 00:35:02,630 --> 00:35:04,210 TCPIP off load. 740 00:35:04,210 --> 00:35:06,790 I didn't put that there. 741 00:35:06,790 --> 00:35:11,590 There's actually a problem with cell today that we're 742 00:35:11,590 --> 00:35:15,330 working to fix that the TCPIP performance is not very good. 743 00:35:15,330 --> 00:35:19,970 And so what I tell people to use is open NPI. 744 00:35:19,970 --> 00:35:23,450 You know, so that over InfiniBand. 745 00:35:23,450 --> 00:35:26,930 The PPE processor really doesn't have the horse power 746 00:35:26,930 --> 00:35:29,750 to drive a full TCPIP sack. 747 00:35:29,750 --> 00:35:33,300 I'm not sure it has the horse power to do a full NPI stack 748 00:35:33,300 --> 00:35:36,680 either, but at least you have more control in that case. 749 00:35:36,680 --> 00:35:42,130 750 00:35:42,130 --> 00:35:45,170 The game physics, physical simulations-- 751 00:35:45,170 --> 00:35:47,130 I can show you a demo, but I don't know that we'll have 752 00:35:47,130 --> 00:35:50,500 time where a company called Rapid Mind, which is 753 00:35:50,500 --> 00:35:55,380 developing software to ease programmability for cell. 754 00:35:55,380 --> 00:35:57,980 Basically you take your existing scalar code and you 755 00:35:57,980 --> 00:36:03,010 instrument it with C++ classes that are kind of SPE aware. 756 00:36:03,010 --> 00:36:07,650 And by doing that, just write your scalar code and you get 757 00:36:07,650 --> 00:36:11,010 the SPE performance advantage. 758 00:36:11,010 --> 00:36:12,470 They have this wonderful demo of these chickens. 759 00:36:12,470 --> 00:36:15,770 They've got 16,000 chickens in a chicken yard. 760 00:36:15,770 --> 00:36:18,870 You know, the chicken yard has varying topologies and the 761 00:36:18,870 --> 00:36:22,310 chickens move around and all 16,000 are being processed in 762 00:36:22,310 --> 00:36:24,470 real time with a single cell processor. 763 00:36:24,470 --> 00:36:30,080 In fact, the Nvidia card that was used to render that 764 00:36:30,080 --> 00:36:33,480 couldn't keep up with what was coming out of the SPEs. 765 00:36:33,480 --> 00:36:34,470 We we're impressed with that. 766 00:36:34,470 --> 00:36:35,000 We're happy with that. 767 00:36:35,000 --> 00:36:37,710 We showed it around at the game conferences and the 768 00:36:37,710 --> 00:36:40,300 gamers saw all these chickens and were like, 769 00:36:40,300 --> 00:36:40,630 this is really cool. 770 00:36:40,630 --> 00:36:41,880 How do I shoot them? 771 00:36:41,880 --> 00:36:44,260 772 00:36:44,260 --> 00:36:45,740 So we said, you can't. 773 00:36:45,740 --> 00:36:48,250 But maybe in the next version. 774 00:36:48,250 --> 00:36:51,780 But the idea that we've designed this so that it can 775 00:36:51,780 --> 00:36:55,050 do physical simulations, and this is maybe an entree for 776 00:36:55,050 --> 00:36:56,740 some of you people when you're doing your stuff. 777 00:36:56,740 --> 00:36:58,680 I don't know what kinds of things you want to try to do 778 00:36:58,680 --> 00:37:02,630 on cell, but I've seen people do lots of things that really 779 00:37:02,630 --> 00:37:04,430 have no business doing well on cell and they 780 00:37:04,430 --> 00:37:05,240 did very, very well. 781 00:37:05,240 --> 00:37:08,010 Like pointer chasing. 782 00:37:08,010 --> 00:37:13,260 783 00:37:13,260 --> 00:37:14,100 I'm trying to remember. 784 00:37:14,100 --> 00:37:15,230 There are two pieces of work. 785 00:37:15,230 --> 00:37:22,860 One done by PNNL Fabritzio Petrini and he did a graph 786 00:37:22,860 --> 00:37:24,690 traversal algorithm. 787 00:37:24,690 --> 00:37:29,340 It was very much random access and he was able to parallelize 788 00:37:29,340 --> 00:37:31,120 that very nicely on Cell. 789 00:37:31,120 --> 00:37:34,900 And then there was another guy at Georgia Tech who did 790 00:37:34,900 --> 00:37:37,010 something similar for linked lists. 791 00:37:37,010 --> 00:37:41,170 And you know, I expect things to work well on cell if 792 00:37:41,170 --> 00:37:44,310 they're streaming and they have very compute intensive 793 00:37:44,310 --> 00:37:46,870 kernels that are working on things, but those are two 794 00:37:46,870 --> 00:37:50,600 examples where they're very not very compute intensive and 795 00:37:50,600 --> 00:37:51,350 not very streaming. 796 00:37:51,350 --> 00:37:54,710 They're kind of random access and they work very well. 797 00:37:54,710 --> 00:37:56,410 Over here, target applications. 798 00:37:56,410 --> 00:37:58,980 There are lots of areas where we're trying 799 00:37:58,980 --> 00:38:02,260 to push cell forward. 800 00:38:02,260 --> 00:38:04,110 Clearly it works in the gaming industry, but 801 00:38:04,110 --> 00:38:04,960 where else can it work? 802 00:38:04,960 --> 00:38:08,360 So medical imaging, there's a lot of success there. 803 00:38:08,360 --> 00:38:11,580 The sysmic imaging for petroleum, aerospace and 804 00:38:11,580 --> 00:38:13,080 defense for radar and sonar-- 805 00:38:13,080 --> 00:38:16,190 these are all signal processing apps. 806 00:38:16,190 --> 00:38:18,510 We're also looking at digital content creation 807 00:38:18,510 --> 00:38:20,220 for computer animation. 808 00:38:20,220 --> 00:38:21,470 Very well suited for cell. 809 00:38:21,470 --> 00:38:24,470 810 00:38:24,470 --> 00:38:28,040 This is kind just what I just said. 811 00:38:28,040 --> 00:38:29,380 Did I leave out anything? 812 00:38:29,380 --> 00:38:33,870 Finance-- once we have double precision we'll be doing 813 00:38:33,870 --> 00:38:35,700 things with finance. 814 00:38:35,700 --> 00:38:37,940 We actually demonstrated that things work very well. 815 00:38:37,940 --> 00:38:41,690 You know, metropolis algorithms, Monte Carlo, Black 816 00:38:41,690 --> 00:38:43,620 shoals algorithms if you're familiar with these kind of 817 00:38:43,620 --> 00:38:47,140 things from finance. 818 00:38:47,140 --> 00:38:48,960 They tell us they need double precision and we're like, you 819 00:38:48,960 --> 00:38:51,240 don't really need double precision, come on. 820 00:38:51,240 --> 00:38:56,460 I mean, what you have is some mathematical calculation that 821 00:38:56,460 --> 00:38:57,900 you're doing and you're doing it over and over and over. 822 00:38:57,900 --> 00:39:00,190 And Monte Carlo there's so much noise, we say to these 823 00:39:00,190 --> 00:39:01,150 people, why do you need double precision? 824 00:39:01,150 --> 00:39:06,040 It turns out with decimal notation you can only go up to 825 00:39:06,040 --> 00:39:08,730 like a billion or something in single precision. 826 00:39:08,730 --> 00:39:11,180 So they have more dollars than that, so they need double, for 827 00:39:11,180 --> 00:39:13,060 that reason alone. 828 00:39:13,060 --> 00:39:15,620 But this gets back to the sloppiness of programmers. 829 00:39:15,620 --> 00:39:18,030 And I'm guilty of this myself. 830 00:39:18,030 --> 00:39:18,910 They said, oh we have double. 831 00:39:18,910 --> 00:39:19,930 Let's use double. 832 00:39:19,930 --> 00:39:21,720 They didn't need to, but they did it anyway. 833 00:39:21,720 --> 00:39:24,990 And now their legacy code is stuck with double. 834 00:39:24,990 --> 00:39:28,840 They could convert it all to single, but it's too painful. 835 00:39:28,840 --> 00:39:32,410 Down on Wall Street to build a new data center is like $100 836 00:39:32,410 --> 00:39:34,090 million proposition. 837 00:39:34,090 --> 00:39:37,330 And they do it regularly, all of the banks. 838 00:39:37,330 --> 00:39:40,050 They'll be generating a new data center every year, 839 00:39:40,050 --> 00:39:43,700 sometimes multiple times a year and they just don't have 840 00:39:43,700 --> 00:39:47,010 time or the resources to go through and redo all their 841 00:39:47,010 --> 00:39:49,270 code to make it run or something like cell. 842 00:39:49,270 --> 00:39:54,690 So we're making double precision cell. 843 00:39:54,690 --> 00:39:56,170 That's the short of it. 844 00:39:56,170 --> 00:40:00,210 All right, now software environment. 845 00:40:00,210 --> 00:40:03,640 This is stuff that you can find on the web and actually, 846 00:40:03,640 --> 00:40:06,260 it's changing a lot lately because we just 847 00:40:06,260 --> 00:40:09,430 released the 2.0 SDK. 848 00:40:09,430 --> 00:40:12,950 And so the stuff that's in the slide might not actually be 849 00:40:12,950 --> 00:40:16,480 the latest and greatest, but it's going to be epsilon away, 850 00:40:16,480 --> 00:40:17,970 so don't worry about it too much. 851 00:40:17,970 --> 00:40:20,020 But you really shouldn't trust these slides, you should go to 852 00:40:20,020 --> 00:40:23,300 the website and the website you want to go to is 853 00:40:23,300 --> 00:40:26,981 www.ibm.com/alphaworks. 854 00:40:26,981 --> 00:40:30,100 PROFESSOR: Tomorrow we are going to have a recitation 855 00:40:30,100 --> 00:40:32,310 session talking about the environment 856 00:40:32,310 --> 00:40:33,960 that we have created. 857 00:40:33,960 --> 00:40:36,500 I think we just got, probably just set up the latest 858 00:40:36,500 --> 00:40:39,815 environment and then we increase it through the three 859 00:40:39,815 --> 00:40:41,000 weeks we've got. 860 00:40:41,000 --> 00:40:44,180 This is changing faster than a three week cycle. 861 00:40:44,180 --> 00:40:45,430 So [UNINTELLIGIBLE PHRASE] 862 00:40:45,430 --> 00:40:47,590 863 00:40:47,590 --> 00:40:51,620 So this will give you a preview of what's going to be. 864 00:40:51,620 --> 00:40:52,460 MICHAEL PERRONE: Then you go to alphaworks, you go to 865 00:40:52,460 --> 00:40:55,510 search on alphaworks for cell and you get more information 866 00:40:55,510 --> 00:40:57,810 then you could ever possibly read. 867 00:40:57,810 --> 00:41:01,370 We have a programmer's manual that's 900 pages long, it's 868 00:41:01,370 --> 00:41:04,260 really good reading. 869 00:41:04,260 --> 00:41:07,730 Actually there's one thing in that 800, 900 hundred pages 870 00:41:07,730 --> 00:41:08,460 that you really should read. 871 00:41:08,460 --> 00:41:10,600 It's called the cell programming tips chapter. 872 00:41:10,600 --> 00:41:14,450 It's a really nice chapter. 873 00:41:14,450 --> 00:41:17,140 But there are many, many publications and things like 874 00:41:17,140 --> 00:41:23,110 that, more than just the SDK in the OS and whatnot, so I 875 00:41:23,110 --> 00:41:25,410 encourage you to look at that. 876 00:41:25,410 --> 00:41:28,430 All right, so this is kind of the pyramid, the 877 00:41:28,430 --> 00:41:29,520 cell software pyramid. 878 00:41:29,520 --> 00:41:32,990 We've got the standards under here, the application binary 879 00:41:32,990 --> 00:41:36,710 interface, language extensions. 880 00:41:36,710 --> 00:41:39,380 And over here we have development tools and we'll 881 00:41:39,380 --> 00:41:42,130 talk about each of these pieces briefly. 882 00:41:42,130 --> 00:41:45,080 883 00:41:45,080 --> 00:41:49,350 These specifications define what's actually the reference 884 00:41:49,350 --> 00:41:52,030 implementation for the cell. 885 00:41:52,030 --> 00:41:56,480 C++ and C, they have language extensions in the similar way 886 00:41:56,480 --> 00:42:01,090 to the extensions for VMX for SSE on Intel. 887 00:42:01,090 --> 00:42:05,000 You have C extensions for cell that allow you to use 888 00:42:05,000 --> 00:42:12,200 intrinsics that actually run as SIMD instructions on cell. 889 00:42:12,200 --> 00:42:15,540 For example, you can say SPU underscore mul-add, and it's 890 00:42:15,540 --> 00:42:17,670 going to do a vector mul-add. 891 00:42:17,670 --> 00:42:24,060 So you can get assembly language level control over 892 00:42:24,060 --> 00:42:28,390 your code without having to use any assembly language. 893 00:42:28,390 --> 00:42:30,890 And then there's that. 894 00:42:30,890 --> 00:42:34,180 There is a full system simulator. 895 00:42:34,180 --> 00:42:40,050 The simulator is very, very accurate for things that do 896 00:42:40,050 --> 00:42:43,040 not run out to main memory. 897 00:42:43,040 --> 00:42:44,910 They've been working to improve this so I don't know 898 00:42:44,910 --> 00:42:47,810 if recently they have made it more accurate, but if you're 899 00:42:47,810 --> 00:42:52,090 doing compute intensive stuff, if you're compute bound the 900 00:42:52,090 --> 00:42:55,000 simulator can give you accuracies within 99%. 901 00:42:55,000 --> 00:42:58,120 You know, within 1% of the real value. 902 00:42:58,120 --> 00:43:02,050 I've only seen one thing on the simulator more than 1% off 903 00:43:02,050 --> 00:43:04,930 and that was 4%, so the simulator is very-- excuse 904 00:43:04,930 --> 00:43:06,220 me-- very reliable. 905 00:43:06,220 --> 00:43:08,260 And I encourage you to use it if you can't 906 00:43:08,260 --> 00:43:09,510 get access to hardware. 907 00:43:09,510 --> 00:43:12,600 908 00:43:12,600 --> 00:43:14,240 What else? 909 00:43:14,240 --> 00:43:16,710 The simulator has all kinds of tools in there. 910 00:43:16,710 --> 00:43:21,820 And I'm not going to go through the software stack in 911 00:43:21,820 --> 00:43:23,070 simulation. 912 00:43:23,070 --> 00:43:31,280 913 00:43:31,280 --> 00:43:33,090 This gives you a sense for-- 914 00:43:33,090 --> 00:43:35,330 you've got your hardware running here. 915 00:43:35,330 --> 00:43:38,280 You can run this on any one of these platforms. Power PC, 916 00:43:38,280 --> 00:43:42,910 Intel with these OS's. 917 00:43:42,910 --> 00:43:46,560 The whole thing is written in TCL, the simulator. 918 00:43:46,560 --> 00:43:48,930 And it has all these kind of simulators. 919 00:43:48,930 --> 00:43:54,300 It's simulating the DMAs, it's simulating the caches and then 920 00:43:54,300 --> 00:43:56,300 you get a graphical user interface and a command line 921 00:43:56,300 --> 00:43:58,590 interface to that simulator. 922 00:43:58,590 --> 00:44:01,940 THe graphical user interface is convenient, but the command 923 00:44:01,940 --> 00:44:03,160 line gives you much more control. 924 00:44:03,160 --> 00:44:04,860 You can treat parameters. 925 00:44:04,860 --> 00:44:09,790 926 00:44:09,790 --> 00:44:14,850 This gives you a view of what the graphical 927 00:44:14,850 --> 00:44:17,600 userface looks like. 928 00:44:17,600 --> 00:44:19,660 It says mambo zebra because that was a different project, 929 00:44:19,660 --> 00:44:21,360 but now it'd probably say system sim or 930 00:44:21,360 --> 00:44:23,780 something like that. 931 00:44:23,780 --> 00:44:26,040 And you'll see the PPC-- 932 00:44:26,040 --> 00:44:28,190 this is the PPE I don't know why they changed it. 933 00:44:28,190 --> 00:44:32,090 And then you have SP of zero, SP of 1 going down and it 934 00:44:32,090 --> 00:44:35,240 gives you some access to these parameters. 935 00:44:35,240 --> 00:44:41,310 The model here, it says pipeline and then there's I 936 00:44:41,310 --> 00:44:43,090 think, functional mode or pipeline mode. 937 00:44:43,090 --> 00:44:45,570 Pipeline mode is where it's really simulating everything 938 00:44:45,570 --> 00:44:47,280 and it's much slower. 939 00:44:47,280 --> 00:44:48,760 But it's accurate. 940 00:44:48,760 --> 00:44:50,590 And then the other is functional mode just to test 941 00:44:50,590 --> 00:44:51,960 the code actually works as it's supposed to. 942 00:44:51,960 --> 00:44:55,136 PROFESSOR: I guess one point in the class what we'll try 943 00:44:55,136 --> 00:44:58,340 and do is since each group has access to the the hardware, 944 00:44:58,340 --> 00:45:01,930 you can do most of the things in the real hardware and use 945 00:45:01,930 --> 00:45:03,430 the debugger in the hardware that's 946 00:45:03,430 --> 00:45:04,300 probably been talked about. 947 00:45:04,300 --> 00:45:07,950 But if things gets really bad and you can't understand use 948 00:45:07,950 --> 00:45:11,030 simulator as a very accurate debugger only when it's needs 949 00:45:11,030 --> 00:45:13,250 needed because there you can look at every 950 00:45:13,250 --> 00:45:14,870 little detail inside. 951 00:45:14,870 --> 00:45:17,980 This is kind of a thing, a last resort type thing. 952 00:45:17,980 --> 00:45:19,930 MICHAEL PERRONE: Yeah, I agree. 953 00:45:19,930 --> 00:45:21,390 Like I said, I've been doing this for three years. 954 00:45:21,390 --> 00:45:23,590 Three years ago we didn't even have hardware. 955 00:45:23,590 --> 00:45:27,120 So the simulator was all we had, so we relied on it a lot. 956 00:45:27,120 --> 00:45:29,880 But I think that usage of it makes a lot of sense. 957 00:45:29,880 --> 00:45:33,550 958 00:45:33,550 --> 00:45:34,900 This is the graphical interface. 959 00:45:34,900 --> 00:45:36,720 You know, it's just a Tickle interface. 960 00:45:36,720 --> 00:45:41,240 961 00:45:41,240 --> 00:45:42,440 I'm going to skip through these things. 962 00:45:42,440 --> 00:45:47,350 It just shows you how you can look at memory with this more 963 00:45:47,350 --> 00:45:48,970 memory access. 964 00:45:48,970 --> 00:45:49,830 You get some graphical 965 00:45:49,830 --> 00:45:51,630 representation of various pieces. 966 00:45:51,630 --> 00:45:52,660 You know, how many stalls? 967 00:45:52,660 --> 00:45:53,740 How many loads? 968 00:45:53,740 --> 00:45:55,590 How many DMA transactions? 969 00:45:55,590 --> 00:45:57,320 So you can see what's going on at that level. 970 00:45:57,320 --> 00:46:00,270 971 00:46:00,270 --> 00:46:02,090 And all of this can be pulled together into 972 00:46:02,090 --> 00:46:05,240 this UART window here. 973 00:46:05,240 --> 00:46:09,680 OK, so the Linux, it's pretty standard Linux, but it has 974 00:46:09,680 --> 00:46:12,410 some extensions. 975 00:46:12,410 --> 00:46:14,820 Let's see. 976 00:46:14,820 --> 00:46:16,930 Provided as a patch, yeah. 977 00:46:16,930 --> 00:46:17,730 That might be wrong. 978 00:46:17,730 --> 00:46:21,490 I don't know where we are currently. 979 00:46:21,490 --> 00:46:24,980 You have this SPE thread API for creating 980 00:46:24,980 --> 00:46:28,020 threads from the PPEs. 981 00:46:28,020 --> 00:46:30,850 Let's see. 982 00:46:30,850 --> 00:46:32,330 What do I want to tell you here? 983 00:46:32,330 --> 00:46:35,680 There's a better slide for this kind of information. 984 00:46:35,680 --> 00:46:39,220 They share the memory space, we talked about that. 985 00:46:39,220 --> 00:46:41,830 There's error event and signal handling. 986 00:46:41,830 --> 00:46:45,630 So there are multiple ways you communicate. 987 00:46:45,630 --> 00:46:50,030 You can communicate with the interrupts and the event and 988 00:46:50,030 --> 00:46:53,770 signaling that way or you can use these mailboxes. 989 00:46:53,770 --> 00:46:56,640 So each SPE has its own mailbox and inbox and an 990 00:46:56,640 --> 00:46:59,750 outbox so you can post something to your outbox and 991 00:46:59,750 --> 00:47:01,770 then the PPE will read it when it's ready. 992 00:47:01,770 --> 00:47:05,030 Or you can read from your inbox waiting on the PPE to 993 00:47:05,030 --> 00:47:05,790 write something. 994 00:47:05,790 --> 00:47:07,960 You have to be careful because you can stall there. 995 00:47:07,960 --> 00:47:11,970 If the PPE hasn't written you will stall waiting for 996 00:47:11,970 --> 00:47:12,770 something to fill up. 997 00:47:12,770 --> 00:47:14,460 So you can do a check. 998 00:47:14,460 --> 00:47:16,150 There are ways to get around that, but these are kind of 999 00:47:16,150 --> 00:47:18,040 common gotchas that you have to watch out for. 1000 00:47:18,040 --> 00:47:22,410 1001 00:47:22,410 --> 00:47:25,360 Then you have the mailboxes, you have the interrupts, you 1002 00:47:25,360 --> 00:47:26,100 also have DMAs. 1003 00:47:26,100 --> 00:47:28,300 You can do communication with DMAs so you have at least 1004 00:47:28,300 --> 00:47:29,900 three different ways that you communicate 1005 00:47:29,900 --> 00:47:33,580 between the SPEs on cell. 1006 00:47:33,580 --> 00:47:37,250 And which one is going to be best really depends on the 1007 00:47:37,250 --> 00:47:40,050 algorithm you're running. 1008 00:47:40,050 --> 00:47:42,330 So these are the extensions to Linux. 1009 00:47:42,330 --> 00:47:43,800 This is going to show you a bunch of things that you 1010 00:47:43,800 --> 00:47:46,800 probably won't be able to read, but there's something 1011 00:47:46,800 --> 00:47:51,580 called SPUFS, the file system that has a bunch of open, 1012 00:47:51,580 --> 00:47:53,900 read, write, and close functionality. 1013 00:47:53,900 --> 00:47:57,450 1014 00:47:57,450 --> 00:48:01,630 And then we also have this signaling and the mailboxes 1015 00:48:01,630 --> 00:48:03,650 that I mentioned to you previously. 1016 00:48:03,650 --> 00:48:04,870 And this you can't even read. 1017 00:48:04,870 --> 00:48:05,850 I can't even read this one. 1018 00:48:05,850 --> 00:48:08,300 What is it? 1019 00:48:08,300 --> 00:48:10,060 Ah, this is perhaps the most important one. 1020 00:48:10,060 --> 00:48:13,790 It says SPU create thread. 1021 00:48:13,790 --> 00:48:19,370 So the SPEs from the Linux point of view are just threads 1022 00:48:19,370 --> 00:48:20,440 that are running. 1023 00:48:20,440 --> 00:48:23,290 The Linux doesn't really know that they're special purpose 1024 00:48:23,290 --> 00:48:25,890 hardware, it just knows it's a thread and you can do things 1025 00:48:25,890 --> 00:48:29,775 like spawn a thread, kill a thread, wait on a thread-- all 1026 00:48:29,775 --> 00:48:33,490 the usual things that you can do with threads. 1027 00:48:33,490 --> 00:48:34,970 So it's a lot like P threads, but it's 1028 00:48:34,970 --> 00:48:36,980 not actually P threads. 1029 00:48:36,980 --> 00:48:40,590 So here you could see these things are more useful. 1030 00:48:40,590 --> 00:48:42,710 This is SPE create groups. 1031 00:48:42,710 --> 00:48:46,370 So you can create a thread and thread group so that threads 1032 00:48:46,370 --> 00:48:49,200 that are part of the same group know about one another. 1033 00:48:49,200 --> 00:48:51,620 So you can partition your system and have three SPEs 1034 00:48:51,620 --> 00:48:53,740 doing one thing and five doing another. 1035 00:48:53,740 --> 00:48:56,060 So that you can split it up however you like. 1036 00:48:56,060 --> 00:48:58,940 You have get and set affinity so that you can choose which 1037 00:48:58,940 --> 00:49:01,750 SPEs are running which tasks, so that you can get more 1038 00:49:01,750 --> 00:49:05,800 efficient use of that element interconnect bus. 1039 00:49:05,800 --> 00:49:10,260 Kill and waits, open, close, writing signals, the usual. 1040 00:49:10,260 --> 00:49:15,110 1041 00:49:15,110 --> 00:49:17,490 Let me check my time here. 1042 00:49:17,490 --> 00:49:22,410 I really don't have a lot more time, so I'm going to say that 1043 00:49:22,410 --> 00:49:24,030 we have this thread management library. 1044 00:49:24,030 --> 00:49:26,660 It has the functionality that I just mentioned. 1045 00:49:26,660 --> 00:49:28,470 In the next month or so you're going to go through that in a 1046 00:49:28,470 --> 00:49:29,990 lot more detail. 1047 00:49:29,990 --> 00:49:35,860 1048 00:49:35,860 --> 00:49:38,340 The SPE comes with a lot of sample libraries. 1049 00:49:38,340 --> 00:49:41,410 These are not necessarily the very best implementation of 1050 00:49:41,410 --> 00:49:43,440 these libraries and they're not even fully functional 1051 00:49:43,440 --> 00:49:46,500 libraries, but they're suggestive of first of all, 1052 00:49:46,500 --> 00:49:50,900 how things can be written to cell, how to use cell, and in 1053 00:49:50,900 --> 00:49:53,000 some cases how to optimize cell. 1054 00:49:53,000 --> 00:49:55,790 Like the basic matrix operations, there's some 1055 00:49:55,790 --> 00:49:56,670 optimization. 1056 00:49:56,670 --> 00:49:58,970 The FFTs are very tightly optimized, so if you want to 1057 00:49:58,970 --> 00:50:01,470 take a look at that and understand how to do that type 1058 00:50:01,470 --> 00:50:04,010 of memory manipulation. 1059 00:50:04,010 --> 00:50:08,940 So there are samples codes out there that can be very useful. 1060 00:50:08,940 --> 00:50:10,240 We'll skip that. 1061 00:50:10,240 --> 00:50:12,400 Oh, this is that FFT 16 million. 1062 00:50:12,400 --> 00:50:15,940 There's an example, it's on the SDK. 1063 00:50:15,940 --> 00:50:18,340 Actually, I don't know if you've got PS3's if all these 1064 00:50:18,340 --> 00:50:20,070 things can run. 1065 00:50:20,070 --> 00:50:20,900 They should run. 1066 00:50:20,900 --> 00:50:23,820 Yeah, they should run. 1067 00:50:23,820 --> 00:50:25,850 There may be some memory issues out to main memory that 1068 00:50:25,850 --> 00:50:29,090 I'm not aware of. 1069 00:50:29,090 --> 00:50:32,040 There are all kinds of demos there that you can play with, 1070 00:50:32,040 --> 00:50:35,620 which are good for learning how to spawn threads and 1071 00:50:35,620 --> 00:50:38,030 things like that. 1072 00:50:38,030 --> 00:50:41,360 You have your basic GNU binutils tools. 1073 00:50:41,360 --> 00:50:43,670 There's GCC out there. 1074 00:50:43,670 --> 00:50:45,150 There's also XLC. 1075 00:50:45,150 --> 00:50:48,530 You can download XLC. 1076 00:50:48,530 --> 00:50:51,420 In some cases, one will be better than the other, but I 1077 00:50:51,420 --> 00:50:53,780 think in most cases XLC's a little better. 1078 00:50:53,780 --> 00:50:57,210 Or in some cases, actually a lot better. 1079 00:50:57,210 --> 00:50:59,240 So you can get that. 1080 00:50:59,240 --> 00:51:00,820 I'd recommend that. 1081 00:51:00,820 --> 00:51:04,110 There's a debugger which provides application source 1082 00:51:04,110 --> 00:51:06,160 level debugging. 1083 00:51:06,160 --> 00:51:08,790 PPE multithreading, SPE multithreading, the 1084 00:51:08,790 --> 00:51:11,310 interaction between these guys. 1085 00:51:11,310 --> 00:51:15,430 There are three modes for the debugger: stand alone and then 1086 00:51:15,430 --> 00:51:17,750 attached to SPE threads. 1087 00:51:17,750 --> 00:51:19,000 Sounds like two. 1088 00:51:19,000 --> 00:51:22,270 1089 00:51:22,270 --> 00:51:26,120 That's problematic. 1090 00:51:26,120 --> 00:51:28,130 There's this nice static analysis tool. 1091 00:51:28,130 --> 00:51:30,140 This is good for looking for really tightly, 1092 00:51:30,140 --> 00:51:31,330 optimizing your code. 1093 00:51:31,330 --> 00:51:33,070 You have to be able to read assembly, but it shows you 1094 00:51:33,070 --> 00:51:34,810 graphically-- 1095 00:51:34,810 --> 00:51:36,430 kind of-- 1096 00:51:36,430 --> 00:51:38,800 where the stalls are happening and you can try and 1097 00:51:38,800 --> 00:51:40,890 reorganize your code. 1098 00:51:40,890 --> 00:51:44,720 And then like Saman suggested, the dynamic analysis using the 1099 00:51:44,720 --> 00:51:48,880 simulator is a good way to really get cycle by cycle 1100 00:51:48,880 --> 00:51:51,190 stepping through the code. 1101 00:51:51,190 --> 00:51:54,220 And someone was very excited when they made this chart 1102 00:51:54,220 --> 00:51:55,720 because they put these big explosions here. 1103 00:51:55,720 --> 00:51:58,500 1104 00:51:58,500 --> 00:52:02,790 You've got some compiler here that's going to be generating 1105 00:52:02,790 --> 00:52:07,270 two pieces of code, the PPE binary and the SPE binary. 1106 00:52:07,270 --> 00:52:11,210 When you go through the cell tutorials for training on how 1107 00:52:11,210 --> 00:52:14,900 to program cell you'll see that this code is actually 1108 00:52:14,900 --> 00:52:17,900 plugged into-- linked into the PPE code. 1109 00:52:17,900 --> 00:52:21,170 And when the PPE code spawns a thread it's going to take a 1110 00:52:21,170 --> 00:52:25,030 pointer to this code and basically DMA that code into 1111 00:52:25,030 --> 00:52:27,540 the SPE and tell the SPE to start running. 1112 00:52:27,540 --> 00:52:31,180 Once it's done that, that thread is independent. 1113 00:52:31,180 --> 00:52:34,220 The PPE could kill it, but it could just let it run to its 1114 00:52:34,220 --> 00:52:37,060 natural termination or this thing could terminate itself 1115 00:52:37,060 --> 00:52:41,370 or it could be interrupted by some other communication. 1116 00:52:41,370 --> 00:52:42,890 But that's the basic process, you have these 1117 00:52:42,890 --> 00:52:45,900 two pieces of code. 1118 00:52:45,900 --> 00:52:51,070 OK, so now this is really what I wanted to get to. 1119 00:52:51,070 --> 00:52:54,620 So I want lots of questions here. 1120 00:52:54,620 --> 00:52:59,800 There are 4 levels of parallelism in cell. 1121 00:52:59,800 --> 00:53:02,680 On the cell blade, the IBM blade you have two cell 1122 00:53:02,680 --> 00:53:04,270 processors per blade. 1123 00:53:04,270 --> 00:53:06,570 So that's one level of parallelism. 1124 00:53:06,570 --> 00:53:08,160 At chip level we know there are 9 cores and they're all 1125 00:53:08,160 --> 00:53:08,900 running independently. 1126 00:53:08,900 --> 00:53:11,050 That's another level of parallelism. 1127 00:53:11,050 --> 00:53:14,170 On the instruction level each of the SPEs has two 1128 00:53:14,170 --> 00:53:18,010 instruction pipelines, so it's an odd and an even pipeline. 1129 00:53:18,010 --> 00:53:19,860 One pipeline is doing things-- 1130 00:53:19,860 --> 00:53:23,370 the odd pipeline is doing loads and stores, DMA 1131 00:53:23,370 --> 00:53:30,840 transactions, interrupts, branches and it's doing 1132 00:53:30,840 --> 00:53:33,610 something called shuffle byte or the shuffle operation. 1133 00:53:33,610 --> 00:53:36,270 So shuffle operation's a very, very useful operation that 1134 00:53:36,270 --> 00:53:41,140 allows you to take two registers as data, a third 1135 00:53:41,140 --> 00:53:44,730 register as a pattern register, and the fourth 1136 00:53:44,730 --> 00:53:46,530 register as output. 1137 00:53:46,530 --> 00:53:50,040 It then, from this pattern, will choose arbitrarily the 1138 00:53:50,040 --> 00:53:53,210 bytes that are in these two and reconstitute them into 1139 00:53:53,210 --> 00:53:54,990 this fourth register. 1140 00:53:54,990 --> 00:53:58,350 It's wonderful for doing manipulations and shuffling 1141 00:53:58,350 --> 00:53:59,360 things around. 1142 00:53:59,360 --> 00:54:02,870 Like shuffling a deck of cards, you could take all of 1143 00:54:02,870 --> 00:54:04,820 these and ignore this or you could take the first one here, 1144 00:54:04,820 --> 00:54:07,410 replicate it 16 times or you could take a random sampling 1145 00:54:07,410 --> 00:54:09,120 from these, put into that register. 1146 00:54:09,120 --> 00:54:12,172 AUDIENCE: Do you use that specifically for the 1147 00:54:12,172 --> 00:54:13,630 [UNINTELLIGIBLE]? 1148 00:54:13,630 --> 00:54:14,670 MICHAEL PERRONE: We do use it, yeah. 1149 00:54:14,670 --> 00:54:18,010 Yeah, you take a look, you'll see we use shuffle a lot. 1150 00:54:18,010 --> 00:54:20,540 It's surprising how valuable shuffle can be. 1151 00:54:20,540 --> 00:54:23,280 However, then you have to worry now, you've got the 1152 00:54:23,280 --> 00:54:28,300 shuffle here, if you're doing like matrix transpose, it's 1153 00:54:28,300 --> 00:54:30,350 all shuffles. 1154 00:54:30,350 --> 00:54:32,090 But what's a date matrix transpose? 1155 00:54:32,090 --> 00:54:34,490 It's really bandwidth bound, right? 1156 00:54:34,490 --> 00:54:36,940 Because you're pulling data in, shuffling it around and 1157 00:54:36,940 --> 00:54:37,350 sending it out. 1158 00:54:37,350 --> 00:54:39,640 Well, where's the reads and writes? 1159 00:54:39,640 --> 00:54:40,590 They're on the odd pipeline. 1160 00:54:40,590 --> 00:54:41,360 Where are the shuffles? 1161 00:54:41,360 --> 00:54:42,970 They're on the odd pipeline. 1162 00:54:42,970 --> 00:54:45,390 So now you can have a situation where it's all 1163 00:54:45,390 --> 00:54:50,360 shuffle, shuffle, shuffle, shuffle and then the 1164 00:54:50,360 --> 00:54:53,950 instruction pre-fetch buffer gets starved and so it stalls 1165 00:54:53,950 --> 00:54:56,840 for 15, 17 cycles while I have to load. 1166 00:54:56,840 --> 00:54:59,900 Basically, it's a tiny little loop. 1167 00:54:59,900 --> 00:55:01,710 But you get stalls and you get really bad performance. 1168 00:55:01,710 --> 00:55:04,480 So then you have to tell the compiler-- 1169 00:55:04,480 --> 00:55:05,880 actually, the compiler is getting 1170 00:55:05,880 --> 00:55:07,170 better at these things. 1171 00:55:07,170 --> 00:55:10,550 Much better than it used to be or by hand you can force it to 1172 00:55:10,550 --> 00:55:12,910 leave a slot for the pre-fetch. 1173 00:55:12,910 --> 00:55:14,690 These are gotchas that programmers 1174 00:55:14,690 --> 00:55:17,470 have to be aware of. 1175 00:55:17,470 --> 00:55:20,800 On the other pipeline you have all your normal operations. 1176 00:55:20,800 --> 00:55:25,620 So you have your mul-adds, your bit operations, all the 1177 00:55:25,620 --> 00:55:28,060 shift and things like that, they're all over there. 1178 00:55:28,060 --> 00:55:30,500 There is one other operation on the odd pipeline and I 1179 00:55:30,500 --> 00:55:32,730 think it's a quad word rotate or 1180 00:55:32,730 --> 00:55:36,560 something, but I don't remember. 1181 00:55:36,560 --> 00:55:40,710 So that's instruction level dual issue parallelism. 1182 00:55:40,710 --> 00:55:43,280 AUDIENCE: [UNINTELLIGIBLE PHRASE] 1183 00:55:43,280 --> 00:55:44,280 MICHAEL PERRONE: Everything is in order on 1184 00:55:44,280 --> 00:55:45,340 this processor, yeah. 1185 00:55:45,340 --> 00:55:47,080 And that was done for power reasons, right? 1186 00:55:47,080 --> 00:55:49,760 Get rid of all the space and all the transistors that are 1187 00:55:49,760 --> 00:55:51,730 doing all this fancy, out of order 1188 00:55:51,730 --> 00:55:53,600 processing to save power. 1189 00:55:53,600 --> 00:55:54,850 AUDIENCE: [UNINTELLIGIBLE PHRASE] 1190 00:55:54,850 --> 00:56:18,050 1191 00:56:18,050 --> 00:56:19,270 MICHAEL PERRONE: That's a really good point. 1192 00:56:19,270 --> 00:56:22,810 When you're doing scalar processing you think well, 1193 00:56:22,810 --> 00:56:25,465 you're thinking I'm going to-- kind of conceptually, you want 1194 00:56:25,465 --> 00:56:27,050 to have all the things that are doing the same thing 1195 00:56:27,050 --> 00:56:27,960 together right. 1196 00:56:27,960 --> 00:56:30,160 That's how I used to program. 1197 00:56:30,160 --> 00:56:32,590 You put all this stuff here then you do maybe all your 1198 00:56:32,590 --> 00:56:35,320 reads or whatever and then you do all your computes and you 1199 00:56:35,320 --> 00:56:36,290 can't do it that way. 1200 00:56:36,290 --> 00:56:38,370 You have to really think about how are you going to interlead 1201 00:56:38,370 --> 00:56:39,600 these things. 1202 00:56:39,600 --> 00:56:43,990 Now the compiler will help you, but to get really high 1203 00:56:43,990 --> 00:56:46,680 performance you have to have better tools and we don't have 1204 00:56:46,680 --> 00:56:47,550 those tools yet. 1205 00:56:47,550 --> 00:56:50,140 And so I'm hoping that you guys are the ones that are 1206 00:56:50,140 --> 00:56:52,380 going to come up with the new tools, the new ideas that are 1207 00:56:52,380 --> 00:56:54,420 going to really help people improve 1208 00:56:54,420 --> 00:56:57,970 programmability in cell. 1209 00:56:57,970 --> 00:57:00,930 Then at the lowest level you have the register level 1210 00:57:00,930 --> 00:57:05,320 parallelism where you can have four single precision float 1211 00:57:05,320 --> 00:57:08,720 ops going simultaneously. 1212 00:57:08,720 --> 00:57:11,250 So when you're programming cell you have to keep all of 1213 00:57:11,250 --> 00:57:13,140 these levels of hierarchy in your head. 1214 00:57:13,140 --> 00:57:15,860 It's not straight scalar programming anymore. 1215 00:57:15,860 --> 00:57:18,070 And if you think of it that way you're just not going to 1216 00:57:18,070 --> 00:57:20,910 get the performance that you're looking for period. 1217 00:57:20,910 --> 00:57:24,600 1218 00:57:24,600 --> 00:57:26,960 Another consideration is this local store. 1219 00:57:26,960 --> 00:57:30,880 Each little store is 256 kilobytes. 1220 00:57:30,880 --> 00:57:32,130 That's not a lot of space. 1221 00:57:32,130 --> 00:57:35,110 1222 00:57:35,110 --> 00:57:37,760 You have to think about how are you going to bring the 1223 00:57:37,760 --> 00:57:41,680 data in so that the chunks are big enough, but not too big 1224 00:57:41,680 --> 00:57:43,050 because if they're too big thing then you won't be able 1225 00:57:43,050 --> 00:57:44,300 to get multibuffering. 1226 00:57:44,300 --> 00:57:48,120 1227 00:57:48,120 --> 00:57:49,930 Let's back up a little bit more. 1228 00:57:49,930 --> 00:57:54,640 The local store holds the data, but it also holds the 1229 00:57:54,640 --> 00:57:56,730 code that you're running. 1230 00:57:56,730 --> 00:58:02,350 So if you have 200 kilobytes of code then you only have 56 1231 00:58:02,350 --> 00:58:03,950 kilobytes of data space. 1232 00:58:03,950 --> 00:58:06,080 And if you want to have double buffering that means you only 1233 00:58:06,080 --> 00:58:15,400 have 25 kilobytes and then as Saman correctly points out 1234 00:58:15,400 --> 00:58:17,950 there's a problem with stack. 1235 00:58:17,950 --> 00:58:20,390 So if you're going to have recursion in your code or 1236 00:58:20,390 --> 00:58:23,550 something nasty like that, you're going to start pushing 1237 00:58:23,550 --> 00:58:25,630 stack variables off the register file. 1238 00:58:25,630 --> 00:58:27,020 So where do they go? 1239 00:58:27,020 --> 00:58:29,130 They go in the local store. 1240 00:58:29,130 --> 00:58:34,200 What prevents the stack them overwriting your data? 1241 00:58:34,200 --> 00:58:35,520 Nothing. 1242 00:58:35,520 --> 00:58:38,160 Nothing at all and that's a big gotcha. 1243 00:58:38,160 --> 00:58:42,620 I've seen over the past three years maybe 30 separate 1244 00:58:42,620 --> 00:58:46,470 algorithms implemented on cell and I know of only one that 1245 00:58:46,470 --> 00:58:48,030 was definitely doing that. 1246 00:58:48,030 --> 00:58:51,080 But you know, if there are 30 in this class maybe you're 1247 00:58:51,080 --> 00:58:52,420 going to be the one that that happens to. 1248 00:58:52,420 --> 00:58:57,970 So you have to be aware of that and you have 1249 00:58:57,970 --> 00:58:58,400 to deal with it. 1250 00:58:58,400 --> 00:59:02,240 So what you can do, is in the local store put some dead beef 1251 00:59:02,240 --> 00:59:07,400 thing in there so that you can look for an overwrite and that 1252 00:59:07,400 --> 00:59:10,240 will let you know that either you have to make you code 1253 00:59:10,240 --> 00:59:14,890 smalller or your data smaller or get rid of recursion. 1254 00:59:14,890 --> 00:59:18,350 On SPEs, recursion is kind of anathema. 1255 00:59:18,350 --> 00:59:19,900 Inlining is good. 1256 00:59:19,900 --> 00:59:25,220 Inlining really can accelerate your codes performance. 1257 00:59:25,220 --> 00:59:28,310 Oh yeah, it says stack right there. 1258 00:59:28,310 --> 00:59:30,330 You're reading ahead on me here. 1259 00:59:30,330 --> 00:59:32,340 Yes, so all three are in there and you have 1260 00:59:32,340 --> 00:59:33,780 to be aware of that. 1261 00:59:33,780 --> 00:59:37,000 Now there is a memory management library, very 1262 00:59:37,000 --> 00:59:39,960 lightweight library on the SPE and it's going to prevent your 1263 00:59:39,960 --> 00:59:42,930 data from overwriting your code because once the code's 1264 00:59:42,930 --> 00:59:45,820 loaded that memory management library knows where it is and 1265 00:59:45,820 --> 00:59:47,320 it will stop. 1266 00:59:47,320 --> 00:59:50,830 The date you from allocating, doing a [? mul-add. ?] 1267 00:59:50,830 --> 00:59:52,150 over this code. 1268 00:59:52,150 --> 00:59:53,850 But the stack's up for grabs. 1269 00:59:53,850 --> 00:59:56,270 And that was again done because of power 1270 00:59:56,270 --> 00:59:58,220 considerations and real estate on the chip. 1271 00:59:58,220 --> 01:00:02,640 It you want to have a chip that's this big you can have 1272 01:00:02,640 --> 01:00:05,950 anything you want, but manufacturing it's impossible. 1273 01:00:05,950 --> 01:00:08,170 So things were removed and that was one of the things 1274 01:00:08,170 --> 01:00:09,440 that's removed and that's one of the things you have to 1275 01:00:09,440 --> 01:00:11,040 watch out for. 1276 01:00:11,040 --> 01:00:14,010 And communication, we've talked about this quite a bit. 1277 01:00:14,010 --> 01:00:17,380 1278 01:00:17,380 --> 01:00:20,460 I didn't mention this: the DMA transactions-- oh, 1279 01:00:20,460 --> 01:00:21,685 question in the back? 1280 01:00:21,685 --> 01:00:25,151 AUDIENCE: Is there any reasonable possibility of 1281 01:00:25,151 --> 01:00:26,665 doing things dynamically? 1282 01:00:26,665 --> 01:00:32,670 1283 01:00:32,670 --> 01:00:39,000 Is it at all conceivable to have [? bunks ?] that fetch in 1284 01:00:39,000 --> 01:00:42,100 new code or an allocator that shuffles somehow? 1285 01:00:42,100 --> 01:00:45,572 Or is it basically as soon as you get to that point your 1286 01:00:45,572 --> 01:00:46,510 performance is going to go to hell. 1287 01:00:46,510 --> 01:00:48,330 MICHAEL PERRONE: Yes, well if you don't do anything about 1288 01:00:48,330 --> 01:00:50,510 it, yes your performance will go to hell. 1289 01:00:50,510 --> 01:00:52,070 So there are two ways. 1290 01:00:52,070 --> 01:00:57,240 In research we came up with an overlay mechanism. 1291 01:00:57,240 --> 01:00:59,810 So this is what people used to do 20 years ago when 1292 01:00:59,810 --> 01:01:00,820 processors were simple. 1293 01:01:00,820 --> 01:01:03,630 Well, these processors are simple, so going back to the 1294 01:01:03,630 --> 01:01:07,570 old technologies is actually a good thing to do. 1295 01:01:07,570 --> 01:01:13,580 So we had a video processing algorithm where we took video 1296 01:01:13,580 --> 01:01:17,070 images, we had to decode them with one SPE, we had to do 1297 01:01:17,070 --> 01:01:19,630 some background subtraction to the next SPE. 1298 01:01:19,630 --> 01:01:21,300 We had to do some edge detection. 1299 01:01:21,300 --> 01:01:24,300 And so each SPE was doing a different thing, but even then 1300 01:01:24,300 --> 01:01:27,850 the code was very big, the chunks of code were large. 1301 01:01:27,850 --> 01:01:32,080 And we were spending 27% of the time swapping code out and 1302 01:01:32,080 --> 01:01:33,370 bringing in new code. 1303 01:01:33,370 --> 01:01:34,740 Bad, very bad. 1304 01:01:34,740 --> 01:01:36,580 Oh, and I should tell you, spawning SPE 1305 01:01:36,580 --> 01:01:37,830 threads is very painful. 1306 01:01:37,830 --> 01:01:40,660 1307 01:01:40,660 --> 01:01:43,790 500,000 cycles, a million cycles-- 1308 01:01:43,790 --> 01:01:44,490 I don't know. 1309 01:01:44,490 --> 01:01:48,040 It varies depending on how the SPE feels that particular day. 1310 01:01:48,040 --> 01:01:51,080 And it's something to avoid. 1311 01:01:51,080 --> 01:01:53,030 You really want to spawn a thread and keep it running for 1312 01:01:53,030 --> 01:01:54,240 a long time. 1313 01:01:54,240 --> 01:01:58,290 So context switching is painful on cell. 1314 01:01:58,290 --> 01:02:03,420 Using an overlay we got that 27% overhead down to 1%. 1315 01:02:03,420 --> 01:02:04,970 So yes, you can do that. 1316 01:02:04,970 --> 01:02:07,410 That tool is not in the SDK. 1317 01:02:07,410 --> 01:02:09,640 It's on my to-do list to put it in the SDK, but the 1318 01:02:09,640 --> 01:02:11,750 compiler team at IBM tells me that the XLC 1319 01:02:11,750 --> 01:02:14,040 compiler now does overlays. 1320 01:02:14,040 --> 01:02:18,310 But it only does overlays at the function level, so if the 1321 01:02:18,310 --> 01:02:20,800 function still doesn't fit in the SPE 1322 01:02:20,800 --> 01:02:22,070 you're dead in the water. 1323 01:02:22,070 --> 01:02:24,800 And I think the compiler will say, when it compiles it it'll 1324 01:02:24,800 --> 01:02:28,010 say this doesn't fit quietly and you'll never see that 1325 01:02:28,010 --> 01:02:29,450 until you run and it doesn't load and you don't know 1326 01:02:29,450 --> 01:02:30,360 what's going on. 1327 01:02:30,360 --> 01:02:33,570 So read your compiler outputs. 1328 01:02:33,570 --> 01:02:35,530 The DMA granularity is 128 bytes. 1329 01:02:35,530 --> 01:02:38,770 This is the same, the data transactions for Intel, for 1330 01:02:38,770 --> 01:02:41,950 AMD they're all 128 byte data envelopes. 1331 01:02:41,950 --> 01:02:45,690 So if you're doing a memory access that's 4 bytes you're 1332 01:02:45,690 --> 01:02:48,180 still using 128 bytes of bandwidth. 1333 01:02:48,180 --> 01:02:50,790 So this comes back to this notion of getting a shopping 1334 01:02:50,790 --> 01:02:53,740 list. You really want to think ahead what you want to get, 1335 01:02:53,740 --> 01:02:56,130 bring it in, then use it so that you don't waste 1336 01:02:56,130 --> 01:02:58,750 bandwidth; if you're bandwidth bound. 1337 01:02:58,750 --> 01:03:01,380 If you're not than you can be a little more wasteful. 1338 01:03:01,380 --> 01:03:04,100 But there's a guy, Mike Acton-- 1339 01:03:04,100 --> 01:03:07,050 you can find his website, I think he has a website called 1340 01:03:07,050 --> 01:03:11,060 www.cellperformance.org? 1341 01:03:11,060 --> 01:03:11,480 Net? 1342 01:03:11,480 --> 01:03:11,820 Com? 1343 01:03:11,820 --> 01:03:12,100 I don't know. 1344 01:03:12,100 --> 01:03:15,010 AUDIENCE: Just a quick comment [UNINTELLIGIBLE PHRASE]. 1345 01:03:15,010 --> 01:03:16,410 MICHAEL PERRONE: Oh, he's good. 1346 01:03:16,410 --> 01:03:17,410 He's much better than me. 1347 01:03:17,410 --> 01:03:20,470 You're really going to like him. 1348 01:03:20,470 --> 01:03:24,460 His belief, and I believe him wholeheartedly, is it's all 1349 01:03:24,460 --> 01:03:26,030 about the data. 1350 01:03:26,030 --> 01:03:32,930 We're coming to a point in computer science where the 1351 01:03:32,930 --> 01:03:35,150 code doesn't matter as much as getting the data 1352 01:03:35,150 --> 01:03:36,310 where you need it. 1353 01:03:36,310 --> 01:03:40,300 This is because of the latency out to main memory. 1354 01:03:40,300 --> 01:03:43,790 Memory's getting so far away that having all these cycles 1355 01:03:43,790 --> 01:03:46,210 is not that useful anymore if you can't get the data. 1356 01:03:46,210 --> 01:03:47,940 So he always pushes this point, you 1357 01:03:47,940 --> 01:03:48,830 have to get the data. 1358 01:03:48,830 --> 01:03:51,510 You have to think about the data, good code starts with 1359 01:03:51,510 --> 01:03:54,180 the data, good code ends with the data, good data structure 1360 01:03:54,180 --> 01:03:55,000 start with the data. 1361 01:03:55,000 --> 01:03:58,520 You have to think data, data, data. 1362 01:03:58,520 --> 01:04:00,590 And I can't emphasize that enough because it's really 1363 01:04:00,590 --> 01:04:03,625 very, very true for this processor and I believe, for 1364 01:04:03,625 --> 01:04:05,310 all the multicore processors you're going to be seeing. 1365 01:04:05,310 --> 01:04:08,730 1366 01:04:08,730 --> 01:04:15,090 The DMAs that you issue can be 128 bytes or multiples of 128 1367 01:04:15,090 --> 01:04:17,890 bytes, up to 16 kilobytes per single DMA. 1368 01:04:17,890 --> 01:04:20,570 There's also something called a DMA list, which is a list of 1369 01:04:20,570 --> 01:04:26,140 DMAs in local store and you tell the DMA queue OK, here 1370 01:04:26,140 --> 01:04:29,490 are these 100 DMAs, spawn them off. 1371 01:04:29,490 --> 01:04:32,760 That only takes one slot in the DMA queue so it's an 1372 01:04:32,760 --> 01:04:36,210 efficient way of loading the queue without 1373 01:04:36,210 --> 01:04:39,200 overloading the queue. 1374 01:04:39,200 --> 01:04:46,080 Traffic controls, this is perhaps one of the trickier 1375 01:04:46,080 --> 01:04:48,020 things with cell because the simulator doesn't help very 1376 01:04:48,020 --> 01:04:51,560 much and the tools don't help very much. 1377 01:04:51,560 --> 01:04:53,530 Thinking about synchronization, DMA latency 1378 01:04:53,530 --> 01:04:54,860 handling-- all those things are important. 1379 01:04:54,860 --> 01:04:59,390 1380 01:04:59,390 --> 01:05:01,690 OK, so this is the last slide that I'm going to do and then 1381 01:05:01,690 --> 01:05:02,940 I have to run off. 1382 01:05:02,940 --> 01:05:05,820 1383 01:05:05,820 --> 01:05:09,780 I want to give you a sense for the process by which people-- 1384 01:05:09,780 --> 01:05:12,320 my group in particular went through, especially when we 1385 01:05:12,320 --> 01:05:15,490 didn't even have hardware and we didn't have compilers that 1386 01:05:15,490 --> 01:05:17,880 worked nearly as well as they do now and it's really very 1387 01:05:17,880 --> 01:05:21,140 ugly knifes and stones and sticks. 1388 01:05:21,140 --> 01:05:23,750 You know, just kind of stone knifes. 1389 01:05:23,750 --> 01:05:26,580 That's what I'm thinking, very primitive. 1390 01:05:26,580 --> 01:05:30,970 But this way of thinking is still very much true. 1391 01:05:30,970 --> 01:05:32,570 You have to think about your code this way. 1392 01:05:32,570 --> 01:05:34,940 You want to start, you have your application, whatever it 1393 01:05:34,940 --> 01:05:35,900 happens to be; you want to do an 1394 01:05:35,900 --> 01:05:38,080 algorithmic complexity study. 1395 01:05:38,080 --> 01:05:41,140 Is this order n squared, is this log n? 1396 01:05:41,140 --> 01:05:42,260 Where are the bottlenecks? 1397 01:05:42,260 --> 01:05:45,160 What do I expect to be bottlenecks? 1398 01:05:45,160 --> 01:05:48,390 Then I want to do data layout/locality. 1399 01:05:48,390 --> 01:05:50,360 Now this is the data, data, data approach of Mike Acton. 1400 01:05:50,360 --> 01:05:52,950 1401 01:05:52,950 --> 01:05:54,430 You want to think about the data. 1402 01:05:54,430 --> 01:05:55,540 Where is it? 1403 01:05:55,540 --> 01:05:57,810 How can you structure your data so that it's going to be 1404 01:05:57,810 --> 01:06:01,550 efficiently positioned for when you need it? 1405 01:06:01,550 --> 01:06:04,400 And then you start with an experimental petitioning of 1406 01:06:04,400 --> 01:06:05,340 the algorithm. 1407 01:06:05,340 --> 01:06:08,050 You want to break it up between the pieces that you 1408 01:06:08,050 --> 01:06:12,320 believe are scalar and remain scalar, leave those on the SPE 1409 01:06:12,320 --> 01:06:14,460 and the ones that can be paralellized. 1410 01:06:14,460 --> 01:06:17,810 Those are the ones that are going to go on the SPE. 1411 01:06:17,810 --> 01:06:19,430 You have the think conceptually about 1412 01:06:19,430 --> 01:06:21,730 partitioning that out. 1413 01:06:21,730 --> 01:06:24,980 And then run it on the PPE anyway. 1414 01:06:24,980 --> 01:06:27,390 You want to have a baseline there. 1415 01:06:27,390 --> 01:06:31,370 Then you have this PPE scalar code and PPE control code. 1416 01:06:31,370 --> 01:06:35,230 This PPE scalar code you want to then push down to the SPEs. 1417 01:06:35,230 --> 01:06:39,060 So now you're going to add stuff for communication, 1418 01:06:39,060 --> 01:06:40,440 synchronization, and latency handling. 1419 01:06:40,440 --> 01:06:42,420 So you have the spawn threads. 1420 01:06:42,420 --> 01:06:43,640 The [? RAIDs ?] 1421 01:06:43,640 --> 01:06:47,110 have to be told where the data is, they have to get their 1422 01:06:47,110 --> 01:06:49,320 code, they have to run their code, they have to then start 1423 01:06:49,320 --> 01:06:51,490 pulling in the data, synchronize with the other 1424 01:06:51,490 --> 01:06:55,620 SPEs and then latency handling with multibuffering of the 1425 01:06:55,620 --> 01:06:59,090 data so that you can be doing computing and reading data 1426 01:06:59,090 --> 01:07:01,020 simultaneously. 1427 01:07:01,020 --> 01:07:06,970 Then you have your first parallel code that's running. 1428 01:07:06,970 --> 01:07:12,400 Now the compiler, the XLC compiler, GCC compiler-- 1429 01:07:12,400 --> 01:07:14,900 well, the XLC compiler I know for certain will do some 1430 01:07:14,900 --> 01:07:16,370 automatic SIMDization. 1431 01:07:16,370 --> 01:07:18,080 if you put the auto SIMD flag on. 1432 01:07:18,080 --> 01:07:19,550 Does GCC compiler do that? 1433 01:07:19,550 --> 01:07:20,800 PROFESSOR: [UNINTELLIGIBLE PHRASE] 1434 01:07:20,800 --> 01:07:23,300 1435 01:07:23,300 --> 01:07:24,860 MICHAEL PERRONE: OK, so I don't know if the GCC 1436 01:07:24,860 --> 01:07:27,190 compiler does that. 1437 01:07:27,190 --> 01:07:33,690 So that can be done by hand, but sometimes that works, 1438 01:07:33,690 --> 01:07:34,670 sometimes it doesn't. 1439 01:07:34,670 --> 01:07:36,690 And it really depends on how complex the algorithm. 1440 01:07:36,690 --> 01:07:39,530 If it's a very regular code, like a matrix-matrix multiply. 1441 01:07:39,530 --> 01:07:43,980 You'll see that the compiler can do fairly well if the 1442 01:07:43,980 --> 01:07:45,590 block sizes are right and all. 1443 01:07:45,590 --> 01:07:50,090 But if you have something that's more irregular then you 1444 01:07:50,090 --> 01:07:53,360 may find that doing it by hand is really required. 1445 01:07:53,360 --> 01:07:56,270 And so this step here could be done with the compiler 1446 01:07:56,270 --> 01:07:58,700 initially to see if you're getting the performance that 1447 01:07:58,700 --> 01:08:00,780 you think you should be getting from that algorithmic 1448 01:08:00,780 --> 01:08:02,380 complexity study. 1449 01:08:02,380 --> 01:08:04,420 You should see that type of scaling. 1450 01:08:04,420 --> 01:08:06,880 You can look at the CPI and see how many cycles per 1451 01:08:06,880 --> 01:08:08,480 instruction you're getting. 1452 01:08:08,480 --> 01:08:11,200 Each SPE should be getting 0.5. 1453 01:08:11,200 --> 01:08:13,590 You should be able to get two instructions per cycle. 1454 01:08:13,590 --> 01:08:16,310 1455 01:08:16,310 --> 01:08:19,480 Very few codes actually get exactly-- 1456 01:08:19,480 --> 01:08:27,180 you can get down to 5.8 or something like that, but I 1457 01:08:27,180 --> 01:08:29,830 think if you can get to 1 you're doing well. 1458 01:08:29,830 --> 01:08:32,390 If you get to 2 there's probably more you can be doing 1459 01:08:32,390 --> 01:08:33,870 and if you're above 2 there's something 1460 01:08:33,870 --> 01:08:36,200 wrong with your code. 1461 01:08:36,200 --> 01:08:37,020 It may be the algorithm. 1462 01:08:37,020 --> 01:08:39,400 It may be just a poorly chosen algorithm. 1463 01:08:39,400 --> 01:08:42,120 1464 01:08:42,120 --> 01:08:44,020 But that's where you can talk to me. 1465 01:08:44,020 --> 01:08:46,010 I want to make myself available to everyone in the 1466 01:08:46,010 --> 01:08:48,460 class or in my department as well. 1467 01:08:48,460 --> 01:08:53,170 We're very enthusiastic about working with research groups 1468 01:08:53,170 --> 01:08:59,230 in universities to develop new tools, new methods and if you 1469 01:08:59,230 --> 01:09:00,180 can help me, I can help you. 1470 01:09:00,180 --> 01:09:01,850 I think it works very well. 1471 01:09:01,850 --> 01:09:04,710 1472 01:09:04,710 --> 01:09:07,440 Then once you've done this, you may find that what you 1473 01:09:07,440 --> 01:09:11,000 originally thought for the complexity or the layout 1474 01:09:11,000 --> 01:09:13,840 wasn't quite accurate, so you need to then go do some 1475 01:09:13,840 --> 01:09:14,970 additional rebalancing. 1476 01:09:14,970 --> 01:09:17,060 Maybe change your block sizes. 1477 01:09:17,060 --> 01:09:20,960 You know, maybe you had 64 by 64 blocks, now you need 32 by 1478 01:09:20,960 --> 01:09:25,800 64 or 48 by whatever-- some readjustment to match what you 1479 01:09:25,800 --> 01:09:30,610 have, And then you may want to reevaluate the data movement. 1480 01:09:30,610 --> 01:09:33,100 And then you know, in many cases you'll be done, but 1481 01:09:33,100 --> 01:09:35,620 you're looking at your cycles per instruction or your speed 1482 01:09:35,620 --> 01:09:39,960 up and you're not seeing exactly what you expected, so 1483 01:09:39,960 --> 01:09:42,830 you can start looking at other optimization considerations. 1484 01:09:42,830 --> 01:09:46,210 Like using the vector unit, the VMX unit on the cell 1485 01:09:46,210 --> 01:09:49,840 processor, on the PPE. 1486 01:09:49,840 --> 01:09:53,760 Looking for system bottlenecks and this is actually, I have 1487 01:09:53,760 --> 01:09:56,400 found the biggest problem. 1488 01:09:56,400 --> 01:09:59,730 Trying to identify where the DMA bottlenecks are happening 1489 01:09:59,730 --> 01:10:02,980 is kind of devilishly hard. 1490 01:10:02,980 --> 01:10:05,100 We don't have good tools for that, so you really have to 1491 01:10:05,100 --> 01:10:08,100 think hard and come up with interesting kind of 1492 01:10:08,100 --> 01:10:11,260 experiments for your code to track down these bottlenecks. 1493 01:10:11,260 --> 01:10:13,990 1494 01:10:13,990 --> 01:10:15,160 And then load balancing. 1495 01:10:15,160 --> 01:10:17,850 If you look at these SPEs, I told you they're completely 1496 01:10:17,850 --> 01:10:18,520 independent. 1497 01:10:18,520 --> 01:10:20,850 You can have them all running the same code or they could be 1498 01:10:20,850 --> 01:10:22,170 running all different code. 1499 01:10:22,170 --> 01:10:24,310 They could be daisy chained so that this one feeds, this one 1500 01:10:24,310 --> 01:10:25,940 feeds that one, feeds that one. 1501 01:10:25,940 --> 01:10:28,020 If you do that daisy chaining you may find out there's a 1502 01:10:28,020 --> 01:10:28,400 bottleneck. 1503 01:10:28,400 --> 01:10:31,540 That this SPE takes three times as long 1504 01:10:31,540 --> 01:10:33,110 as any of the others. 1505 01:10:33,110 --> 01:10:38,540 So make that use 3 SPEs and have this SPE feed these 3. 1506 01:10:38,540 --> 01:10:41,430 So you have to do some load balancing and thinking about 1507 01:10:41,430 --> 01:10:43,460 how many SPEs really need to be dedicated 1508 01:10:43,460 --> 01:10:46,510 to each of the tasks. 1509 01:10:46,510 --> 01:10:50,920 Now that's the end of my talk. 1510 01:10:50,920 --> 01:10:54,900 I think that gives you a good sense of where we have been, 1511 01:10:54,900 --> 01:10:57,190 where we are now, and where we're going. 1512 01:10:57,190 --> 01:11:01,420 And I hope that if was good, educational, and I'll make 1513 01:11:01,420 --> 01:11:03,260 myself available to you guys in the future. 1514 01:11:03,260 --> 01:11:04,680 And if you have questions-- 1515 01:11:04,680 --> 01:11:05,170 PROFESSOR: Thank you. 1516 01:11:05,170 --> 01:11:10,140 I know you have to catch a flight. 1517 01:11:10,140 --> 01:11:11,810 How much time do have for questions? 1518 01:11:11,810 --> 01:11:13,210 MICHAEL PERRONE: Not much. 1519 01:11:13,210 --> 01:11:14,890 I leave at 1:10. 1520 01:11:14,890 --> 01:11:16,750 So I should be there by 12:00. 1521 01:11:16,750 --> 01:11:16,970 PROFESSOR: OK. 1522 01:11:16,970 --> 01:11:18,080 So [UNINTELLIGIBLE] 1523 01:11:18,080 --> 01:11:18,810 at some time. 1524 01:11:18,810 --> 01:11:19,450 MICHAEL PERRONE: My car is out-- 1525 01:11:19,450 --> 01:11:22,770 PROFESSOR: OK, so we'll have about 5 minutes questions. 1526 01:11:22,770 --> 01:11:25,630 OK, so I know this talk is early. 1527 01:11:25,630 --> 01:11:27,750 We haven't gotten a lot of basics so there might be a lot 1528 01:11:27,750 --> 01:11:30,940 of things kind of going above your head, but we'll slowly 1529 01:11:30,940 --> 01:11:32,030 get back to it. 1530 01:11:32,030 --> 01:11:34,990 So questions? 1531 01:11:34,990 --> 01:11:38,190 AUDIENCE: You mentioned that SPEs would 1532 01:11:38,190 --> 01:11:40,910 be able to run kernel. 1533 01:11:40,910 --> 01:11:43,517 Is there a microkernel that you could install on them so 1534 01:11:43,517 --> 01:11:45,660 that you could begin experimenting with MPI type 1535 01:11:45,660 --> 01:11:47,240 structures? 1536 01:11:47,240 --> 01:11:49,450 MICHAEL PERRONE: Not that I'm aware of. 1537 01:11:49,450 --> 01:11:52,240 We did look at something called MicroMPI, where we were 1538 01:11:52,240 --> 01:11:57,290 using kind of a very watered down MPI implementation for 1539 01:11:57,290 --> 01:12:00,030 the SPEs in the transactions. 1540 01:12:00,030 --> 01:12:01,000 I don't recommend it. 1541 01:12:01,000 --> 01:12:07,060 What I recommend is you have a cluster say, a thousand node 1542 01:12:07,060 --> 01:12:10,570 cluster and the code today, the legacy code that's out 1543 01:12:10,570 --> 01:12:14,400 there runs some process on this node. 1544 01:12:14,400 --> 01:12:19,360 Take that process, don't try to push MPI further down, but 1545 01:12:19,360 --> 01:12:24,940 just try to subpartition that process and let the PPE handle 1546 01:12:24,940 --> 01:12:31,190 all the communication off board, off node. 1547 01:12:31,190 --> 01:12:32,130 That's my recommendation. 1548 01:12:32,130 --> 01:12:35,960 AUDIENCE: So MPI is running on [UNINTELLIGIBLE]? 1549 01:12:35,960 --> 01:12:37,890 MICHAEL PERRONE: Yeah, Open MPI. 1550 01:12:37,890 --> 01:12:39,840 It's an open source MPI. 1551 01:12:39,840 --> 01:12:42,310 It's just a recompile and it hasn't 1552 01:12:42,310 --> 01:12:44,960 been tuned or optimized. 1553 01:12:44,960 --> 01:12:48,480 And it doesn't know anything about the SPEs. 1554 01:12:48,480 --> 01:12:50,990 You know, you let the PPE do all the communication or 1555 01:12:50,990 --> 01:12:52,080 handle the communications. 1556 01:12:52,080 --> 01:12:55,180 When it finishes the task at hand then it can 1557 01:12:55,180 --> 01:12:56,967 issue its MPI process. 1558 01:12:56,967 --> 01:12:58,217 AUDIENCE: [UNINTELLIGIBLE PHRASE] 1559 01:12:58,217 --> 01:13:00,010 1560 01:13:00,010 --> 01:13:03,850 MICHAEL PERRONE: Open NP is the methodology where you take 1561 01:13:03,850 --> 01:13:08,260 existing scalar code and you insert compiler pragmas to say 1562 01:13:08,260 --> 01:13:10,490 this for loop can be parallelized. 1563 01:13:10,490 --> 01:13:13,330 And you know, this data structures are disjoint, so we 1564 01:13:13,330 --> 01:13:17,410 don't have to worry about any kind of interference, side 1565 01:13:17,410 --> 01:13:19,950 effects of the data manipulation. 1566 01:13:19,950 --> 01:13:24,090 The compiler, the XLC compiler implements open MP. 1567 01:13:24,090 --> 01:13:27,360 There's several components that are required. 1568 01:13:27,360 --> 01:13:30,980 One was a software cache where they implemented a little 1569 01:13:30,980 --> 01:13:32,460 cache on the local store. 1570 01:13:32,460 --> 01:13:36,250 And if it misses in that local cache it goes and gets it. 1571 01:13:36,250 --> 01:13:41,910 I don't know how well that performs yet, but it exists. 1572 01:13:41,910 --> 01:13:43,150 There's the SIMDization. 1573 01:13:43,150 --> 01:13:45,830 For a while, Open NP wasn't working with auto SIMDization 1574 01:13:45,830 --> 01:13:48,830 but now it does. 1575 01:13:48,830 --> 01:13:53,310 So it's getting there, for so C it's there. 1576 01:13:53,310 --> 01:13:55,205 I don't know what type of performance hit 1577 01:13:55,205 --> 01:13:55,950 you take for that. 1578 01:13:55,950 --> 01:13:59,820 AUDIENCE: Probably runs [UNINTELLIGIBLE PHRASE] 1579 01:13:59,820 --> 01:14:00,450 MICHAEL PERRONE: It's 1580 01:14:00,450 --> 01:14:03,110 XLC version that does that. 1581 01:14:03,110 --> 01:14:06,700 I don't know if GCC does it. 1582 01:14:06,700 --> 01:14:10,500 But my recommendation is if you want to use open NP, go 1583 01:14:10,500 --> 01:14:14,010 ahead, take your scalar code, implement it with those 1584 01:14:14,010 --> 01:14:17,340 pragmas, see what type of improvement you get. 1585 01:14:17,340 --> 01:14:18,330 Play around with it a little. 1586 01:14:18,330 --> 01:14:21,710 If you find something that you expect should be 10x better 1587 01:14:21,710 --> 01:14:25,100 and it's only 3x take that bottleneck and 1588 01:14:25,100 --> 01:14:26,350 implement it by hand. 1589 01:14:26,350 --> 01:14:31,726 1590 01:14:31,726 --> 01:14:32,976 AUDIENCE: [UNINTELLIGIBLE PHRASE] 1591 01:14:32,976 --> 01:14:34,945 1592 01:14:34,945 --> 01:14:39,340 with the memory models and such that the SPEs certainly 1593 01:14:39,340 --> 01:14:41,293 went back a couple of generations to a simpler 1594 01:14:41,293 --> 01:14:41,781 [INAUDIBLE]. 1595 01:14:41,781 --> 01:14:44,512 How come you went so far back rather to just say, 1596 01:14:44,512 --> 01:14:45,580 segmentation. 1597 01:14:45,580 --> 01:14:46,760 MICHAEL PERRONE: I don't know the answer. 1598 01:14:46,760 --> 01:14:48,010 I'm sorry. 1599 01:14:48,010 --> 01:14:50,650 1600 01:14:50,650 --> 01:14:53,860 I suspect and most of these answers come down to the same 1601 01:14:53,860 --> 01:14:57,210 thing, it comes back to Sony. 1602 01:14:57,210 --> 01:14:59,990 Sony contracted with IBM, gave us a lot of money 1603 01:14:59,990 --> 01:15:00,700 to make this thing. 1604 01:15:00,700 --> 01:15:02,330 And they said we need a Playstation 3. 1605 01:15:02,330 --> 01:15:03,740 We need this, this, this, this. 1606 01:15:03,740 --> 01:15:06,870 And so IBM was very focused on providing those things. 1607 01:15:06,870 --> 01:15:10,650 Now that that is delivered, Playstation 3 is being sold 1608 01:15:10,650 --> 01:15:11,740 we're looking at other options. 1609 01:15:11,740 --> 01:15:17,560 And if that's something that you're interested in pursuing 1610 01:15:17,560 --> 01:15:18,020 you should talk to me. 1611 01:15:18,020 --> 01:15:20,057 AUDIENCE: Among other things it seems to me that the 1612 01:15:20,057 --> 01:15:23,622 lightweight mechanism for keeping the stack from 1613 01:15:23,622 --> 01:15:27,940 stomping on other things -- 1614 01:15:27,940 --> 01:15:33,950 PROFESSOR: I think that this is very new area. 1615 01:15:33,950 --> 01:15:36,190 Before you put things in hardware, you need to have 1616 01:15:36,190 --> 01:15:39,190 some kind of consensus, what's the right way to do it? 1617 01:15:39,190 --> 01:15:42,690 This is a bare metal that gives you huge amount of 1618 01:15:42,690 --> 01:15:44,320 opportunity but you give enough rope to hang yourself. 1619 01:15:44,320 --> 01:15:46,850 1620 01:15:46,850 --> 01:15:49,250 And the key thing is you can get all this performance and 1621 01:15:49,250 --> 01:15:52,380 what will happen perhaps, in the next few years is people 1622 01:15:52,380 --> 01:15:53,730 come up to consensus saying, look, 1623 01:15:53,730 --> 01:15:54,790 everybody has to do this. 1624 01:15:54,790 --> 01:15:57,060 Everybody needs MPI, everybody needs this cache. 1625 01:15:57,060 --> 01:16:00,180 And slowly, some of those features will do a little bit 1626 01:16:00,180 --> 01:16:02,130 of a feature creep, so you're going to have they little bit 1627 01:16:02,130 --> 01:16:04,390 of overhead, little bit less power efficient. 1628 01:16:04,390 --> 01:16:05,630 But it will be much easier to program. 1629 01:16:05,630 --> 01:16:08,590 But this is kind of the bare metal thing that to get and in 1630 01:16:08,590 --> 01:16:12,410 some sense, it's a nice time because I think in 5 years if 1631 01:16:12,410 --> 01:16:17,400 you look at cell you won't have this level of access. 1632 01:16:17,400 --> 01:16:20,940 You'll have all this nice build on top up in doing this 1633 01:16:20,940 --> 01:16:22,840 so, this is a unique positioning there. 1634 01:16:22,840 --> 01:16:25,840 It's very hard to deal with, but also on the other hand you 1635 01:16:25,840 --> 01:16:27,640 get to see underneath. 1636 01:16:27,640 --> 01:16:30,650 You get to see without any kind of these sort 1637 01:16:30,650 --> 01:16:31,310 of things in there. 1638 01:16:31,310 --> 01:16:33,700 So my feeling is in a few years you'll get all those 1639 01:16:33,700 --> 01:16:34,800 things put back. 1640 01:16:34,800 --> 01:16:37,210 When and if we figure out how to deal with things like 1641 01:16:37,210 --> 01:16:40,640 segmentation on the multicore with very fine grain 1642 01:16:40,640 --> 01:16:42,640 communication and there's a lot of issues here that you 1643 01:16:42,640 --> 01:16:43,370 need to figure out. 1644 01:16:43,370 --> 01:16:44,950 But right now all those issues are [INAUDIBLE]. 1645 01:16:44,950 --> 01:16:46,450 It's like OK, we don't know how to do it. 1646 01:16:46,450 --> 01:16:52,460 Well, you go figure it out OK? 1647 01:16:52,460 --> 01:16:53,070 MICHAEL PERRONE: Thank you very much. 1648 01:16:53,070 --> 01:16:54,320 PROFESSOR: Thank you. 1649 01:16:54,320 --> 01:16:56,390 1650 01:16:56,390 --> 01:16:58,070 I don't have that much more material. 1651 01:16:58,070 --> 01:17:00,690 So I have about 10, 15 minutes. 1652 01:17:00,690 --> 01:17:03,160 Do you guys need a break or should we just go 1653 01:17:03,160 --> 01:17:03,740 directly to the end? 1654 01:17:03,740 --> 01:17:06,430 How many people say we want a break?