Lecture 1: Introduction and Matrix Multiplication

Flash and JavaScript are required for this feature.

Download the video from Internet Archive.

Description: Professor Leiserson introduces 6.172 Performance Engineering of Software Systems. The class examines an example of code optimization using matrix multiplication and discusses the differences between programming languages Python, Java, and C.

Instructor: Charles Leiserson

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

CHARLES LEISERSON: So welcome to 6.172. My name is Charles Leiserson, and I am one of the two lecturers this term. The other is Professor Julian Shun. We're both in EECS and in CSAIL on the 7th floor of the Gates Building. If you don't know it, you are in Performance Engineering of Software Systems, so if you found yourself in the wrong place, now's the time to exit.

I want to start today by talking a little bit about why we do performance engineering, and then I'll do a little bit of administration, and then sort of dive into sort of a case study that will give you a good sense of some of the things that we're going to do during the term. I put the administration in the middle, because it's like, if from me telling you about the course you don't want to do the course, then it's like, why should you listen to the administration, right? It's like--

So let's just dive right in, OK? So the first thing to always understand whenever you're doing something is a perspective on what matters in what you're doing. So the whole term, we're going to do software performance engineering. And so this is kind of interesting, because it turns out that performance is usually not at the top of what people are interested in when they're building software. OK? What are some of the things that are more important than performance? Yeah?

AUDIENCE: Deadlines.




AUDIENCE: Correctness.


AUDIENCE: Extensibility.

CHARLES LEISERSON: Extensibility. Yeah, maybe we could go on and on. And I think that you folks could probably make a pretty long list. I made a short list of all the kinds of things that are more important than performance. So then, if programmers are so willing to sacrifice performance for these properties, why do we study performance? OK? So this is kind of a bit of a paradox and a bit of a puzzle. Why do you study something that clearly isn't at the top of the list of what most people care about when they're developing software?

I think the answer to that is that performance is the currency of computing. OK? You use performance to buy these other properties. So you'll say something like, gee, I want to make it easy to program, and so therefore I'm willing to sacrifice some performance to make something easy to program. I'm willing to sacrifice some performance to make sure that my system is secure. OK?

And all those things come out of your performance budget. And clearly if performance degrades too far, your stuff becomes unusable. OK? When I talk with people, with programmers, and I say-- you know, people are fond of saying, ah, performance. Oh, you do performance? Performance doesn't matter. I never think about it.

Then I talk with people who use computers, and I ask, what's your main complaint about the computing systems you use? Answer? Too slow. [CHUCKLES] OK? So it's interesting, whether you're the producer or whatever. But the real answer is that performance is like currency. It's something you spend.


If you look-- you know, would I rather have $100 or a gallon of water? Well, water is indispensable to life. There are circumstances, certainly, where I would prefer to have the water. OK? Than $100. OK? But in our modern society, I can buy water for much less than $100. OK? So even though water is essential to life and far more important than money, money is a currency, and so I prefer to have the money because I can just buy the things I need.

And that's the same kind of analogy of performance. It has no intrinsic value, but it contributes to things. You can use it to buy things that you care about like usability or testability or what have you. OK? Now, in the early days of computing, software performance engineering was common because machine resources were limited. If you look at these machines from 1964 to 1977-- I mean, look at how many bytes they have on them, right? In '64, there is a computer with 524 kilobytes. OK?

That was a big machine back then. That's kilobytes. That's not megabytes, that's not gigabytes. That's kilobytes. OK? And many programs would strain the machine resources, OK? The clock rate for that machine was 33 kilohertz. What's a typical clock rate today?

AUDIENCE: 4 gigahertz.


AUDIENCE: 4 gigahertz.

CHARLES LEISERSON: 4 gigahertz, 3 gigahertz, 2 gigahertz, somewhere up there. Yeah, somewhere in that range, OK? And here they were operating with kilohertz. So many programs would not fit without intense performance engineering. And one of the things, also-- there's a lot of sayings that came out of that era. Donald Knuth, who's a Turing Award winner, absolutely fabulous computer scientist in all respects, wrote, "Premature optimization is the root of all evil."

And I invite you, by the way, to look that quote up because it's actually taken out of context. OK? So trying to optimize stuff too early he was worried about. OK? Bill Wulf, who designed the BLISS language and worked on the PDP-11 and such, said, "More computing sins are committed in the name of efficiency without necessarily achieving it than for any other single reason, including blind stupidity." OK? And Michael Jackson said, "The first rule of program optimization-- don't do it. Second rule of program optimization, for experts only-- don't do it yet." [CHUCKLES]

OK? So everybody warning away, because when you start trying to make things fast, your code becomes unreadable. OK? Making code that is readable and fast-- now that's where the art is, and hopefully we'll learn a little bit about doing that. OK? And indeed, there was no real point in working too hard on performance engineering for many years.

If you look at technology scaling and you look at how many transistors are on various processor designs, up until about 2004, we had Moore's law in full throttle, OK? With chip densities doubling every two years. And really quite amazing. And along with that, as they shrunk the dimensions of chips-- because by miniaturization-- the clock speed would go up correspondingly, as well.

And so, if you found something was too slow, wait a couple of years. OK? Wait a couple of years. It'll be faster. OK? And so, you know, if you were going to do something with software and make your software ugly, there wasn't a real good payoff compared to just simply waiting around. And in that era, there was something called Dennard scaling, which, as things shrunk, allowed the clock speeds to get larger, basically by reducing power. You could reduce power and still keep everything fast, and we'll talk about that in a minute.

So if you look at what happened to from 1977 to 2004-- here are Apple computers with similar price tags, and you can see the clock rate really just skyrocketed. 1 megahertz, 400 megahertz, 1.8 gigahertz, OK? And the data paths went from 8 bits to 30 to 64. The memory, correspondingly, grows. Cost? Approximately the same.

And that's the legacy from Moore's law and the tremendous advances in semiconductor technology. And so, until 2004, Moore's law and the scaling of clock frequency, so-called Dennard scaling, was essentially a printing press for the currency of performance. OK? You didn't have to do anything. You just made the hardware go faster. Very easy.

And all that came to an end-- well, some of it came to an end-- in 2004 when clock speeds plateaued. OK? So if you look at this, around 2005, you can see all the speeds-- we hit, you know, 2 to 4 gigahertz, and we have not been able to make chips go faster than that in any practical way since then. But the densities have kept going great. Now, the reason that the clock speed flattened was because of power density.

And this is a slide from Intel from that era, looking at the growth of power density. And what they were projecting was that the junction temperatures of the transistors on the chip, if they just keep scaling the way they had been scaling, would start to approach, first of all, the temperature of a nuclear reactor, then the temperature of a rocket nozzle, and then the sun's surface. OK? So that we're not going to build little technology that cools that very well.

And even if you could solve it for a little bit, the writing was on the wall. We cannot scale clock frequencies any more. The reason for that is that, originally, clock frequency was scaled assuming that most of the power was dynamic power, which was going when you switched the circuit. And what happened as we kept reducing that and reducing that is, something that used to be in the noise, namely the leakage currents, OK, started to become significant to the point where-- now, today-- the dynamic power is far less of a concern than the static power from just the circuit sitting there leaking, and when you miniaturize, you can't stop that effect from happening.

So what did the vendors do in 2004 and 2005 and since is, they said, oh, gosh, we've got all these transistors to use, but we can't use the transistors to make stuff run faster. So what they did is, they introduced parallelism in the form of multicore processors. They put more than one processing core in a chip. And to scale performance, they would, you know, have multiple cores, and each generation of Moore's law now was potentially doubling the number of cores.

And so if you look at what happened for processor cores, you see that around 2004, 2005, we started to get multiple processing cores per chip, to the extent that today, it's basically impossible to find a single-core chip for a laptop or a workstation or whatever. Everything is multicore. You can't buy just one. You have to buy a parallel processor.

And so the impact of that was that performance was no longer free. You couldn't just speed up the hardware. Now if you wanted to use that potential, you had to do parallel programming, and that's not something that anybody in the industry really had done. So today, there are a lot of other things that happened in that intervening time. We got vector units as common parts of our machines; we got GPUs; we got steeper cache hierarchies; we have configurable logic on some machines; and so forth.

And now it's up to the software to adapt to it. And so, although we don't want to have to deal with performance, today you have to deal with performance. And in your lifetimes, you will have to deal with performance in software if you're going to have effective software. OK? You can see what happened, also-- this is a study that we did looking at software bugs in a variety of open-source projects where they're mentioning the word "performance." And you can see that in 2004, the numbers start going up. You know, some of them-- it's not as convincing for some things as others, but generally there's a trend of, after 2004, people started worrying more about performance.

If you look at software developer jobs, as of around early, mid-2000s-- the 2000 "oh oh's," I guess, OK-- you see once again the mention of "performance" in jobs is going up. And anecdotally, I can tell you, I had one student who came to me after the spring, after he'd taken 6.172, and he said, you know, I went and I applied for five jobs. And every job asked me, at every job interview, they asked me a question I couldn't have answered if I hadn't taken 6.172, and I got five offers. OK?

And when I compared those offers, they tended to be 20% to 30% larger than people who are just web monkeys. OK? So anyway.


That's not to say that you should necessarily take this class, OK? But I just want to point out that what we're going to learn is going to be interesting from a practical point of view, i.e., your futures. OK? As well as theoretical points of view and technical points of view. OK?

So modern processors are really complicated, and the big question is, how do we write software to use that modern hardware efficiently? OK? I want to give you an example of performance engineering of a very well-studied problem, namely matrix multiplication. Who has never seen this problem? [LAUGHS] Yeah. OK, so we got some jokers in the class, I can see. OK.

So, you know, it takes n cubed operations, because you're basically computing n squared dot products. OK? So essentially, if you add up the total number of operations, it's about 2n cubed because there is essentially a multiply and an add for every pair of terms that need to be accumulated. OK? So it's basically 2n cubed. We're going to look at it assuming for simplicity that our n is an exact power of 2, OK?

Now, the machine that we're going to look at is going to be one of the ones that you'll have access to in AWS. OK? It's a compute-optimized machine, which has a Haswell microachitecture running at 2.9 gigahertz. There are 2 processor chips for each of these machines and 9 processing cores per chip, so a total of 18 cores. So that's the amount of parallel processing.

It does two-way hyperthreading, which we're actually going to not deal a lot with. Hyperthreading gives you a little bit more performance, but it also makes it really hard to measure things, so generally we will turn off hyperthreading. But the performance that you get tends to be correlated with what you get when you hyperthread. For floating-point unit there, it is capable of doing 8 double-precision precision operations. That's 64-bit floating-point operations, including a fused-multiply-add per core, per cycle. OK?

So that's a vector unit. So basically, each of these 18 cores can do 8 double-precision operations, including a fused-multiply-add, which is actually 2 operations. OK? The way that they count these things, OK? It has a cache-line size of 64 bytes. The icache is 32 kilobytes, which is 8-way set associative. We'll talk about some of these things. If you don't know all the terms, it's OK. We're going to cover most of these terms later on.

It's got a dcache of the same size. It's got an L2-cache of 256 kilobytes, and it's got an L3-cache or what's sometimes called an LLC, Last-Level Cache, of 25 megabytes. And then it's got 60 gigabytes of DRAM. So this is a honking big machine. OK? This is like-- you can get things to sing on this, OK?

If you look at the peak performance, it's the clock speed times 2 processor chips times 9 processing cores per chip, each capable of, if you can use both the multiply and the add, 16 floating-point operations, and that goes out to just short of teraflop, OK? 836 gigaflops. So that's a lot of power, OK? That's a lot of power. These are fun machines, actually, OK?

Especially when we get into things like the game-playing AI and stuff that we do for the fourth project. You'll see. They're really fun. You can have a lot of compute, OK? Now here's the basic code. This is the full code for Python for doing matrix multiplication. Now, generally, in Python, you wouldn't use this code because you just call a library subroutine that does matrix multiplication.

But sometimes you have a problem. I'm going to illustrate with matrix multiplication, but sometimes you have a problem for which you have to write the code. And I want to give you an idea of what kind of performance you get out of Python, OK? In addition, somebody has to write-- if there is a library routine, somebody had to write it, and that person was a performance engineer, because they wrote it to be as fast as possible. And so this will give you an idea of what you can do to make code run fast, OK?

So when you run this code-- so you can see that the start time-- you know, before the triply-nested loop-- right here, before the triply-nested loop, we take a time measurement, and then we take another time measurement at the end, and then we print the difference. And then that's just this classic triply-nested loop for matrix multiplication. And so, when you run this, how long is this run for, you think? Any guesses?

Let's see. How about-- let's do this. It runs for 6 microseconds. Who thinks 6 microseconds? How about 6 milliseconds? How about-- 6 milliseconds. How about 6 seconds? How about 6 minutes? OK. How about 6 hours?


How about 6 days?


OK. Of course, it's important to know what size it is. It's 4,096 by 4,096, as it shows in the code, OK? And those of you who didn't vote-- wake up. Let's get active. This is active learning. Put yourself out there, OK? It doesn't matter whether you're right or wrong. There'll be a bunch of people who got the right answer, but they have no idea why.


OK? So it turns out, it takes about 21,000 seconds, which is about 6 hours. OK? Amazing. Is this fast?



CHARLES LEISERSON: Yeah, right. Duh, right? Is this fast? No. You know, how do we tell whether this is fast or not? OK? You know, what should we expect from our machine? So let's do a back-of-the-envelope calculation of--


--of how many operations there are and how fast we ought to be able to do it. We just went through and said what are all the parameters of the machine. So there are 2n cubed operations that need to be performed. We're not doing Strassen's algorithm or anything like that. We're just doing a straight triply-nested loop.

So that's 2 to the 37 floating point operations, OK? The running time is 21,000 seconds, so that says that we're getting about 6.25 megaflops out of our machine when we run that code, OK? Just by dividing it out, how many floating-point operations per second do we get? We take the number of operations divided by the time, OK?

The peak, as you recall, was about 836 gigaflops, OK? And we're getting 6.25 megaflops, OK? So we're getting about 0.00075% of peak, OK? This is not fast. OK? This is not fast. So let's do something really simple. Let's code it in Java rather than Python, OK? So we take just that loop. The code is almost the same, OK?

Just the triply-nested loop. We run it in Java, OK? And the running time now, it turns out, is about just under 3,000 seconds, which is about 46 minutes. The same code. Python, Java, OK? We got almost a 9 times speedup just simply coding it in a different language, OK? Well, let's try C. That's the language we're going to be using here.

What happens when you code it in C? It's exactly the same thing, OK? We're going to use the Clang/LLVM 5.0 compiler. I believe we're using 6.0 this term, is that right? Yeah. OK, I should have rerun these numbers for 6.0, but I didn't. So now, it's basically 1,100 seconds, which is about 19 minutes, so we got, then, about-- it's twice as fast as Java and about 18 times faster than Python, OK?

So here's where we stand so far. OK? We have the running time of these various things, OK? And the relative speedup is how much faster it is than the previous row, and the absolute speedup is how it is compared to the first row, and now we're managing to get, now, 0.014% of peak. So we're still slow, but before we go and try to optimize it further-- like, why is Python so slow and C so fast? Does anybody know?

AUDIENCE: Python is interpreted, so it has to-- so there's a C program that basically parses Python pycode instructions, which takes up most of the time.

CHARLES LEISERSON: OK. That's kind of on the right track. Anybody else have any-- articulate a little bit why Python is so slow? Yeah?

AUDIENCE: When you write, like, multiplying and add, those aren't the only instructions Python's doing. It's doing lots of code for, like, going through Python objects and integers and blah-blah-blah.

CHARLES LEISERSON: Yeah, yeah. OK, good. So the big reason why Python is slow and C is so fast is that Python is interpreted and C is compiled directly to machine code. And Java is somewhere in the middle because Java is compiled to bytecode, which is then interpreted and then just-in-time compiled into machine codes. So let me talk a little bit about these things.

So interpreters, such as in Python, are versatile but slow. It's one of these things where they said, we're going to take some of our performance and use it to make a more flexible, easier-to-program environment. OK? The interpreter basically reads, interprets, and performs each program statement and then updates the machine state. So it's not just-- it's actually going through and, each time, reading your code, figuring out what it does, and then implementing it.

So there's like all this overhead compared to just doing its operations. So interpreters can easily support high-level programming features, and they can do things like dynamic code alteration and so forth at the cost of performance. So that, you know, typically the cycle for an interpreter is, you read the next statement, you interpret the statement. You then perform the statement, and then you update the state of the machine, and then you fetch the next instruction. OK?

And you're going through that each time, and that's done in software. OK? When you have things compiled to machine code, it goes through a similar thing, but it's highly optimized just for the things that machines are done. OK? And so when you compile, you're able to take advantage of the hardware and interpreter of machine instructions, and that's much, much lower overhead than the big software overhead you get with Python.

Now, JIT is somewhere in the middle, what's used in Java. JIT compilers can recover some of the performance, and in fact it did a pretty good job in this case. The idea is, when the code is first executed, it's interpreted, and then the runtime system keeps track of how often the various pieces of code are executed. And whenever it finds that there's some piece of code that it's executing frequently, it then calls the compiler to compile that piece of code, and then subsequent to that, it runs the compiled code.

So it tries to get the big advantage of performance by only compiling the things that are necessary-- you know, for which it's actually going to pay off to invoke the compiler to do. OK? So anyway, so that's the big difference with those kinds of things. One of the reasons we don't use Python in this class is because the performance model is hard to figure out.

OK? C is much closer to the metal, much closer to the silicon, OK? And so it's much easier to figure out what's going on in that context. OK? But we will have a guest lecture that we're going to talk about performance in managed languages like Python, so it's not that we're going to ignore the topic. But we will learn how to do performance engineering in a place where it's easier to do it. OK?

Now, one of the things that a good compiler will do is-- you know, once you get to-- let's say we have the C version, which is where we're going to move from this point because that's the fastest we got so far-- is, it turns out that you can change the order of loops in this program without affecting the correctness. OK? So here we went-- you know, for i, for j, for k, do the update. OK?

We could otherwise do-- we could do, for i, for k, for j, do the update, and it computes exactly the same thing. Or we could do, for k, for j, for i, do the updates. OK? So we can change the order without affecting the correctness, OK?

And so do you think the order of loops matters for performance? Duh. You know, this is like this leading question. Yeah? Question?

AUDIENCE: Maybe for cache localities.

CHARLES LEISERSON: Yeah. OK. And you're exactly right. Cache locality is what it is. So when we do that, we get-- the loop order affects the running time by a factor of 18. Whoa. Just by switching the order. OK? What's going on there? OK? What's going on?

So we're going to talk about this in more depth, so I'm just going to fly through this, because this is just sort of showing you the kinds of considerations that you do. So in hardware, each processor reads and writes main memory in contiguous blocks called cache lines. OK? Previously accessed cache lines are stored in a small memory called cache that sits near the processor.

When the processor accesses something, if it's in the cache, you get a hit. That's very cheap, OK, and fast. If you miss, you have to go out to either a deeper level cache or all the way out to main memory. That is much, much slower, and we'll talk about that kind of thing. So what happens for this matrix problem is, the matrices are laid out in memory in row-major order. That means you take-- you know, you have a two-dimensional matrix. It's laid out in the linear order of the addresses of memory by essentially taking row 1, and then, after row 1, stick row 2, and after that, stick row 3, and so forth, and unfolding it.

There's another order that things could have been laid out-- in fact, they are in Fortran-- which is called column-major order. OK? So it turns out C and Fortran operate in different orders. OK? And it turns out it affects performance, which way it does it. So let's just take a look at the access pattern for order i, j, k. OK? So what we're doing is, once we figure out what i and what j is, we're going to go through and cycle through k. And as we cycle through k, OK, C i, j stays the same for everything.

We get for that excellent spatial locality because we're just accessing the same location. Every single time, it's going to be in cache. It's always going to be there. It's going to be fast to access C. For A, what happens is, we go through in a linear order, and we get good spatial locality. But for B, it's going through columns, and those points are distributed far away in memory, so the processor is going to be bringing in 64 bytes to operate on a particular datum. OK?

And then it's ignoring 7 of the 8 floating-point words on that cache line and going to the next one. So it's wasting an awful lot, OK? So this one has good spatial locality in that it's all adjacent and you would use the cache lines effectively. This one, you're going 4,096 elements apart. It's got poor spatial locality, OK? And that's for this one.

So then if we look at the different other ones-- this one, the order i, k, j-- it turns out you get good spatial locality for both C and B and excellent for A. OK? And if you look at even another one, you don't get nearly as good as the other ones, so there's a whole range of things. OK? This one, you're doing optimally badly in both, OK? And so you can just measure the different ones, and it turns out that you can use a tool to figure this out.

And the tool that we'll be using is called Cachegrind. And it's one of the Valgrind suites of caches. And what it'll do is, it will tell you what the miss rates are for the various pieces of code. OK? And you'll learn how to use that tool and figure out, oh, look at that. We have a high miss rate for some and not for others, so that may be why my code is running slowly. OK?

So when you pick the best one of those, OK, we then got a relative speedup of about 6 and 1/2. So what other simple changes can we try? There's actually a collection of things that we could do that don't even have us touching the code. What else could we do, for people who've played with compilers and such? Hint, hint. Yeah?

AUDIENCE: You could change the compiler flags.

CHARLES LEISERSON: Yeah, change the compiler flags, OK? So Clang, which is the compiler that we'll be using, provides a collection of optimization switches, and you can specify a switch to the compiler to ask it to optimize. So you do minus O, and then a number. And 0, if you look at the documentation, it says, "Do not optimize." 1 says, "Optimize." 2 says, "Optimize even more." 3 says, "Optimize yet more." OK?

In this case, it turns out that even though it optimized more in O3, it turns out O2 was a better setting. OK? This is one of these cases. It doesn't happen all the time. Usually, O3 does better than O2, but in this case O2 actually optimized better than O3, because the optimizations are to some extent heuristic. OK? And there are also other kinds of optimization. You can have it do profile-guided optimization, where you look at what the performance was and feed that back into the code, and then the compiler can be smarter about how it optimizes.

And there are a variety of other things. So with this simple technology, choosing a good optimization flag-- in this case, O2-- we got for free, basically, a factor of 3.25, OK? Without having to do much work at all, OK? And now we're actually starting to approach 1% of peak performance. We've got point 0.3% of peak performance, OK?

So what's causing the low performance? Why aren't we getting most of the performance out of this machine? Why do you think? Yeah?

AUDIENCE: We're not using all the cores.

CHARLES LEISERSON: Yeah, we're not using all the cores. So far we're using just one core, and how many cores do we have?


CHARLES LEISERSON: 18, right? 18 cores. Ah! 18 cores just sitting there, 17 sitting idle, while we are trying to optimize one. OK. So multicore. So we have 9 cores per chip, and there are 2 of these chips in our test machine. So we're running on just one of them, so let's use them all. To do that, we're going to use the Cilk infrastructure, and in particular, we can use what's called a parallel loop, which in Cilk, you'd call cilk_for, and so you just relay that outer loop-- for example, in this case, you say cilk_for, it says, do all those iterations in parallel.

The compiler and runtime system are free to schedule them and so forth. OK? And we could also do it for the inner loop, OK? And it turns out you can't also do it for the middle loop, if you think about it. OK? So I'll let you do that is a little bit of a homework problem-- why can't I just do a cilk_for of the inner loop? OK? So the question is, which parallel version works best?

So we can parallel the i loop, we can parallel the j loop, and we can do i and j together. You can't do k just with a parallel loop and expect to get the right thing. OK? And that's this one. So if you look-- wow! What a spread of running times, right? OK? If I parallelize the just the i loop, it's 3.18 seconds, and if I parallelize the j loop, it actually slows down, I think, right? And then, if I do both i and j, it's still bad. I just want to do the outer loop there.

This has to do, it turns out, with scheduling overhead, and we'll learn about scheduling overhead and how you predict that and such. So the rule of thumb here is, parallelize outer loops rather than inner loops, OK? And so when we do parallel loops, we get an almost 18x speedup on 18 cores, OK? So let me assure you, not all code is that easy to parallelize. OK? But this one happens to be.

So now we're up to, what, about just over 5% of peak. OK? So where are we losing time here? OK, why are we getting just 5%? Yeah?

AUDIENCE: So another area of the parallelism that [INAUDIBLE]. So we could, for example, vectorize the multiplication.

CHARLES LEISERSON: Yep. Good. So that's one, and there's one other that we're not using very effectively. OK. That's one, and those are the two optimizations we're going to do to get a really good code here. So what's the other one? Yeah?

AUDIENCE: The multiply and add operation.

CHARLES LEISERSON: That's actually related to the same question, OK? But there's another completely different source of opportunity here. Yeah?

AUDIENCE: We could also do a lot better on our handling of cache misses.

CHARLES LEISERSON: Yeah. OK, we can actually manage the cache misses better. OK? So let's go back to hardware caches, and let's restructure the computation to reuse data in the cache as much as possible. Because cache misses are slow, and hits are fast. And try to make the most of the cache by reusing the data that's already there. OK? So let's just take a look.

Suppose that we're going to just compute one row of C, OK? So we go through one row of C. That's going to take us-- since it's a 4,096-long vector there, that's going to basically be 4,096 writes that we're going to do. OK? And we're going to get some spatial locality there, which is good, but we're basically doing-- the processor's doing 4,096 writes. Now, to compute that row, I need to access 4,096 reads from A, OK? And I need all of B, OK?

Because I go each column of B as I'm going through to fully compute C. Do people see that? OK. So I need-- to just compute one row of C, I need to access one row of A and all of B. OK? Because the first element of C needs the whole first column of B. The second element of C needs the whole second column of B. Once again, don't worry if you don't fully understand this, because right now I'm just ripping through this at high speed. We're going to go into this in much more depth in the class, and there'll be plenty of time to master this stuff.

But the main thing to understand is, you're going through all of B, then I want to compute another row of C. I'm going to do the same thing. I'm going to go through one row of A and all of B again, so that when I'm done we do about 16 million, 17 million memory accesses total. OK? That's a lot of memory access.

So what if, instead of doing that, I do things in blocks, OK? So what if I want to compute a 64 by 64 block of C rather than a row of C? So let's take a look at what happens there. So remember, by the way, this number-- 16, 17 million, OK? Because we're going to compare with it. OK?

So what about to compute a block? So if I look at a block, that is going to take-- 64 by 64 also takes 4,096 writes to C. Same number, OK? But now I have to do about 200,000 reads from A because I need to access all those rows. OK? And then for B, I need to access 64 columns of B, OK? And that's another 262,000 reads from B, OK? Which ends up being half a million memory accesses total. OK?

So I end up doing way fewer accesses, OK, if those blocks will fit in my cache. OK? So I do much less to compute the same size footprint if I compute a block rather than computing a row. OK? Much more efficient. OK? And that's a scheme called tiling, and so if you do tiled matrix multiplication, what you do is you bust your matrices into, let's say, 64 by 64 submatrices, and then you do two levels of matrix multiply.

You do an outer level of multiplying of the blocks using the same algorithm, and then when you hit the inner, to do a 64 by 64 matrix multiply, I then do another three-nested loops. So you end up with 6 nested loops. OK? And so you're basically busting it like this. And there's a tuning parameter, of course, which is, you know, how big do I make my tile size?

You know, if it's s by s, what should I do at the leaves there? Should it be 64? Should it be 128? What number should I use there? How do we find the right value of s, this tuning parameter? OK? Ideas of how we might find it?

AUDIENCE: You could figure out how much there is in the cache.

CHARLES LEISERSON: You could do that. You might get a number, but who knows what else is going on in the cache while you're doing this.

AUDIENCE: Just test a bunch of them.

CHARLES LEISERSON: Yeah, test a bunch of them. Experiment! OK? Try them! See which one gives you good numbers. And when you do that, it turns out that 32 gives you the best performance, OK, for this particular problem. OK? So you can block it, and then you can get faster, and when you do that, that gave us a speedup of about 1.7.

OK? So we're now up to-- what? We're almost 10% of peak, OK? And the other thing is that if you use Cachegrind or a similar tool, you can figure out how many cache references there are and so forth, and you can see that, in fact, it's dropped quite considerably when you do the tiling versus just the straight parallel loops. OK? So once again, you can use tools to help you figure this out and to understand the cause of what's going on.

Well, it turns out that our chips don't have just one cache. They've got three levels of caches. OK? There's L1-cache, OK? And there's data and instructions, so we're thinking about data here, for the data for the matrix. And it's got an L2-cache, which is also private to the processor, and then a shared L3-cache, and then you go out to the DRAM-- you also can go to your neighboring processors and such. OK?

And they're of different sizes. You can see they grow in size-- 32 kilobytes, 256 kilobytes, to 25 megabytes, to main memory, which is 60 gigabytes. So what you can do is, if you want to do two-level tiling, OK, you can have two tuning parameters, s and t. And now you get to do-- you can't do binary search to find it, unfortunately, because it's multi-dimensional.

You kind of have to do it exhaustively. And when you do that, you end up with--



--with 9 nested loops, OK? But of course, we don't really want to do that. We have three levels of caching, OK? Can anybody figure out the inductive number? For three levels of caching, how many levels of tiling do we have to do? This is a gimme, right?


CHARLES LEISERSON: 12! Good, 12, OK? Yeah, you then do 12. And man, you know, when I say the code gets ugly when you start making things go fast. OK? Right? This is like, ughhh!



OK? OK, but it turns out there's a trick. You can tile for every power of 2 simultaneously by just solving the problem recursively. So the idea is that you do divide and conquer. You divide each of the matrices into 4 submatrices, OK? And then, if you look at the calculations you need to do, you have to solve 8 subproblems of half the size, and then do an addition. OK? And so you have 8 multiplications of size n over 2 by n over 2 and 1 addition of n by n matrices, and that gives you your answer.

But then, of course, what you going to do is solve each of those recursively, OK? And that's going to give you, essentially, the same type of performance. Here's the code. I don't expect that you understand this, but we've written this using in parallel, because it turns out you can do 4 of them in parallel. And the Cilk spawn here says, go and do this subroutine, which is basically a subproblem, and then, while you're doing that, you're allowed to go and execute the next statement-- which you'll do another spawn and another spawn and finally this.

And then this statement says, ah, but don't start the next phase until you finish the first phase. OK? And we'll learn about this stuff. OK, when we do that, we get a running time of about 93 seconds, which is about 50 times slower than the last version. We're using cache much better, but it turns out, you know, nothing is free, nothing is easy, typically, in performance engineering. You have to be clever.


What happened here? Why did this get worse, even though-- it turns out, if you actually look at the caching numbers, you're getting great hits on cache. I mean, very few cache misses, lots of hits on cache, but we're still slower. Why do you suppose that is? Let me get someone-- yeah?

AUDIENCE: Overhead to start functions and [INAUDIBLE].

CHARLES LEISERSON: Yeah, the overhead to the start of the function, and in particular the place that it matters is at the leaves of the computation. OK? So what we do is, we have a very small base case. We're doing this overhead all the way down to n equals 1. So there's a function call overhead even when you're multiplying 1 by 1. So hey, let's pick a threshold, and below that threshold, let's just use a standard good algorithm for that threshold. And if we're above that, we'll do divide and conquer, OK?

So what we do is we call-- if we're less than the threshold, we call a base case, and the base case looks very much like just ordinary matrix multiply. OK? And so, when you do that, you can once again look to see what's the best value for the base case, and it turns out in this case, I guess, it's 64. OK? We get down to 1.95 seconds.

I didn't do the base case of 1, because I tried that, and that was the one that gave us terrible performance.


CHARLES LEISERSON: Sorry. 32-- oh, yeah. 32 is even better. 1.3. Good, yeah, so we picked 32. I think I even-- oh, I didn't highlight it. OK. I should have highlighted that on the slide. So then, when we do that, we now are getting 12% of peak. OK?

And if you count up how many cache misses we have, you can see that-- here's the data cache for L1, and with parallel divide and conquer it's the lowest, but also now so is the last-level caching. OK? And then the total number of references is small, as well. So divide and conquer turns out to be a big win here. OK?

Now the other thing that we mentioned, which was we're not using the vector hardware. All of these things have vectors that we can operate on, OK? They have vector hardware that process data in what's called SIMD fashion, which means Single-Instruction stream, Multiple-Data. That means you give one instruction, and it does operations on a vector, OK? And as we mentioned, we have 8 floating-point units per core, of which we can also do a fused-multiply-add, OK?

So each vector register holds multiple words. I believe in the machine we're using this term, it's 4 words. I think so. OK? But it's important when you use these-- you can't just use them willy-nilly. You have to operate on the data as one chunk of vector data. You can't, you know, have this lane of the vector unit doing one thing and a different lane doing something else. They all have to be doing essentially the same thing, the only difference being the indexing of memory. OK?

So when you do that-- so already we've actually been taking advantage of it. But you can produce a vectorization report by asking for that, and the system will tell you what kinds of things are being vectorized, which things are being vectorized, which aren't. And we'll talk about how you vectorize things that the compiler doesn't want to vectorize. OK? And in particular, most machines don't support the newest sets of vector instructions, so the compiler uses vector instructions conservatively by default.

So if you're compiling for a particular machine, you can say, use that particular machine. And here's some of the vectorization flags. You can say, use the AVX instructions if you have AVX. You can use AVX2. You can use the fused-multiply-add vector instructions. You can give a string that tells you the architecture that you're running on, on that special thing. And you can say, well, use whatever machine I'm currently compiling on, OK, and it'll figure out which architecture is that. OK?

Now, floating-point numbers, as we'll talk about, turn out to have some undesirable properties, like they're not associative, so if you do A times B times C, how you parenthesize that can give you two different numbers. And so if you give a specification of a code, typically, the compiler will not change the order of associativity because it says, I want to get exactly the same result.

But you can give it a flag called fast math, minus ffast-math, which will allow it to do that kind of reordering. OK? If it's not important to you, then it will be the same as the default ordering, OK? And when you use that-- and particularly using architecture native and fast math, we actually get about double the performance out of vectorization, just having the compiler vectorize it. OK? Yeah, question.

AUDIENCE: Are the data types in our matrix, are they 32-bit?

CHARLES LEISERSON: They're 64-bit. Yep. These days, 64-bit is pretty standard. They call that double precision, but it's pretty standard. Unless you're doing AI applications, in which case you may want to do lower-precision arithmetic.

AUDIENCE: So float and double are both the same?

CHARLES LEISERSON: No, float is 32, OK? So generally, people who are doing serious linear algebra calculations use 64 bits. But actually sometimes they can use less, and then you can get more performance if you discover you can use fewer bits in your representation. We'll talk about that, too. OK?

So last thing that we're going to do is, you can actually use the instructions, the vector instructions, yourself rather than rely on the compiler to do it. And there's a whole manual of intrinsic instructions that you can call from C that allow you to do, you know, the specific vector instructions that you might want to do it. So the compiler doesn't have to figure that out.

And you can also use some more insights to do things like-- you can do preprocessing, and you can transpose the matrices, which turns out to help, and do data alignment. And there's a lot of other things using clever algorithms for the base case, OK? And you do more performance engineering. You think about you're doing, you code, and then you run, run, run to test, and that's one nice reason to have the cloud, because you can do tests in parallel.

So it takes you less time to do your tests in terms of your, you know, sitting around time when you're doing something. You say, oh, I want to do 10 tests. Let's spin up 10 machines and do all the tests at the same time. When you do that-- and the main one we're getting out of this is the AVX intrinsics-- we get up to 0.41 of peak, so 41% of peak, and get about 50,000 speedup. OK?

And it turns out that's where we quit. OK? And the reason is because we beat Intel's professionally engineered Math Kernel Library at that point.


OK? You know, a good question is, why aren't we getting all of peak? And you know, I invite you to try to figure that out, OK? It turns out, though, Intel MKL is better than what we did because we assumed it was a power of 2. Intel doesn't assume that it's a power of 2, and they're more robust, although we win on the 496 by 496 matrices, they win on other sizes of matrices, so it's not all things.

But the end of the story is, what have we done? We have just done a factor of 50,000, OK? If you looked at the gas economy, OK, of a jumbo jet and getting the kind of performance that we just got in terms of miles per gallon, you would be able to run a jumbo jet on a little Vespa scooter or whatever type of scooter that is, OK? That's how much we've been able to do it.

You generally-- let me just caution you-- won't see the magnitude of a performance improvement that we obtained for matrix multiplication. OK? That turns out to be one where-- it's a really good example because it's so dramatic. But we will see some substantial numbers, and in particular in 6.172 you'll learn how to print this currency of performance all by yourself so that you don't have to take somebody else's library. You can, you know, say, oh, no, I'm an engineer of that.

Let me mention one other thing. In this course, we're going to focus on multicore computing. We are not, in particular, going to be doing GPUs or file systems or network performance. In the real world, those are hugely important, OK? What we found, however, is that it's better to learn a particular domain, and in particular, this particular domain-- people who master multicore performance engineering, in fact, go on to do these other things and are really good at it, OK? Because you've learned the sort of the core, the basis, the foundation--