Topics covered: Performance
Instructors: Prof. Hari Balakrishnan and Prof. Samuel Madden
Lecture notes (PDF)
0:00:00 Let's go ahead and get started. 0:00:02 0:00:07 OK, so today we have one topic to finish up very briefly from 0:00:12.142 last time. So if you remember, 0:00:14.628 when we finished off last time, we were talking about the 0:00:19.428 example of a multithreaded Web server. 0:00:22.6 So the example that we were talking about, 0:00:26.114 and this is an example that I'm going to use throughout the 0:00:31.085 lecture today consisted of a Web server with, this example 0:00:35.971 consists of a Web server with three main modules or main 0:00:40.685 components. So it consists, 0:00:44.666 the three modules are a networking module, 0:00:49.222 a Web server module -- 0:00:52 0:00:58 -- which is in charge of generating, for example, 0:01:01.115 HTML pages, and then a disk module which is in charge of 0:01:04.685 reading data off a disk. OK, so this thing is going to 0:01:08.125 be communicating with the disk, which I've drawn as a cylinder 0:01:12.085 here. So what happens is the client 0:01:14.292 requests come in to this Web server. 0:01:16.563 They come in to the network module. 0:01:18.77 The network module forwards those requests on to the Web 0:01:22.34 server. The Web server is in charge of 0:01:24.742 generating, say, the HTML page that corresponds 0:01:27.728 to the request. And in order to do that, 0:01:31.281 it may need to read some data off of the disk. 0:01:33.684 So it forwards this request onto the disk module, 0:01:36.247 which goes and actually gets the page from the disk and at 0:01:39.291 some point later, the disk returns the page to 0:01:41.694 the Web server. The Web server returns the page 0:01:44.15 to the network module, and then the network module 0:01:46.766 sends the answer back over the network to the user. 0:01:49.436 So this is a very simple example of a Web server. 0:01:52 It should be sort of familiar to you since you have just spent 0:01:55.257 a while studying the Flash Web server. 0:01:57.233 So you can see that this is sort of a simplified description 0:02:00.383 of what a Web server does. So if you think about how you would 0:02:04.823 actually go about designing a Web server like this, 0:02:07.968 of course it's not the case that there is only one request 0:02:11.553 that is moving between these modules at any one point in 0:02:15.012 time. So, in fact, 0:02:16.081 there may be multiple client requests that come into the 0:02:19.54 network module. And the network module may want 0:02:22.433 to have multiple outstanding pages that it's asking the Web 0:02:26.081 server to generate. And the Web server itself might 0:02:29.226 be requesting multiple items from the disk. 0:02:33 And so, in turn, that means that at any point in 0:02:35.647 time there could be sort of multiple results. 0:02:38.126 There could be results streaming back in from the disk 0:02:41.112 which are going into the Web server, which is sort of chewing 0:02:44.492 on results and producing pages to the network module. 0:02:47.422 And so it's possible for there to be sort of queues that are 0:02:50.746 building up between these modules both on the send and 0:02:53.732 receive. So, I'm going to draw a queue, 0:02:55.873 I'll draw a queue just sort as a box with these vertical arrows 0:02:59.366 through it. So there is some buffering 0:03:02.406 that's happening between the sort of incoming requests and 0:03:05.49 the outgoing requests on these modules. 0:03:07.546 OK, and this buffering is a good thing. 0:03:09.602 And we're going to talk more about this throughout the 0:03:12.47 lecture today because what it allows us to do is it allows us 0:03:15.716 to decouple the operations of these different modules. 0:03:18.583 So, for example, the disk module can be reading 0:03:21.072 a page from disk while the HTML page, while the HTML server is, 0:03:24.427 for example simultaneously generating an HTML page that 0:03:27.348 wants to return to the client. But in this architecture, 0:03:31.544 you can see that, for example, 0:03:33.335 when the Web server wants to produce a result, 0:03:36.115 it can only produce a result when the disk pages that it 0:03:39.512 needs are actually available. So the Web server is dependent 0:03:43.156 on some result from the disk module being available. 0:03:46.307 So if we were to look at just this Web server, 0:03:49.086 I'm going to call this the HTML thread here and the disk thread, 0:03:52.978 so these two threads that are on the right side of this 0:03:56.313 diagram that I've drawn here, here to look at the code that 0:03:59.896 was running in these things, and we saw this last time, 0:04:03.232 the code might look something like this. 0:04:07 So the HTML thread is just going to sit in a loop 0:04:09.395 continually trying to de-queue information from this queue that 0:04:12.489 is shared between it and the disk thread. 0:04:14.485 And then, the disk thread is going to be in a loop where it 0:04:17.379 continually reads blocks off the disk, and then enqueues them 0:04:20.373 onto this queue. So this design at first seems 0:04:22.618 like it might be fine. But then if you start thinking 0:04:25.213 about what's really going on here, there could be a problem. 0:04:28.157 So, suppose for example that the queue is of a finite length. 0:04:32 It only has a certain number of elements in it. 0:04:34.353 Now when we keep calling in queue over and over and over 0:04:37.167 again, it's possible that if the HTML thread isn't consuming 0:04:40.186 these pages off the queue fast enough, that the queue could 0:04:43.153 fill up, and it could overflow, right? 0:04:45.046 So that might be a problem that we would want to sort of make a 0:04:47.911 condition that we would explicitly check for in the 0:04:50.469 code. And so we could do that by 0:04:52.055 adding a set of conditions like this. 0:04:53.897 So what you see here is that I have just augmented the code 0:04:56.865 with these two additional variables, used and free, 0:04:59.423 where used indicates the number blocks that are in the queue 0:05:02.441 that are currently in use. And free indicates the number 0:05:06.798 of blocks that are in the code, the number of blocks that are 0:05:10.519 in the queue that are currently free. 0:05:12.751 So what this loop does is that the disk thread says it only 0:05:16.348 wants to enqueue something onto the queue when there are some 0:05:20.069 free blocks. So, it has a while loop that 0:05:22.55 just loops forever and ever and ever while they're waiting when 0:05:26.395 there are no free blocks, OK? 0:05:29 And similarly, the HTML thread is just going 0:05:31.377 to wait forever when there are no used blocks, 0:05:33.865 OK? And then, when the disk thread 0:05:35.689 enqueues a block onto the queue, it's going to decrement the 0:05:38.951 free count because it's reduced the number of things that are in 0:05:42.434 the queue. And it's going to increment the 0:05:44.7 used count because now there is one additional thing that's 0:05:47.907 available in the queue, OK? 0:05:49.344 So this is a simple way in which now we've made it so these 0:05:52.551 things are waiting for each other. 0:05:54.375 They are coordinating with each other by use of these two shared 0:05:57.858 variables used in free. OK, so these two threads share 0:06:00.788 these variables. So that's fine. 0:06:04.144 But if you think about this from a scheduling point of view, 0:06:08.644 there still is a little bit of a problem with this approach. 0:06:13.144 So in particular, what's going on here is that, 0:06:16.652 oops, when one of these threads enters into one of these while 0:06:21.305 loops, it's just going to sit there checking this condition 0:06:25.728 over and over and over and over again, right? 0:06:30 So then the thread scheduler schedules that thread. 0:06:32.401 It's going to repeatedly check this condition. 0:06:34.563 And that's maybe not so desirable. 0:06:36.148 So suppose, for example, that the HTML thread enters 0:06:38.598 into this loop and starts looping because there's no data 0:06:41.288 available. Now, what really we would like 0:06:43.209 to have happen is for the disk thread to be allowed to get a 0:06:46.043 chance to run, so maybe it can produce some 0:06:48.061 data so that the HTML thread can then go ahead and operate. 0:06:50.847 But, with this while loop there, we can't quite do that. 0:06:53.489 We just sort of waste the CPU during the time we are in this 0:06:56.323 while loop. So, instead what we are going 0:06:58.244 to do is introduce the set of what we call sequence 0:07:00.646 coordination operators. 0:07:03 0:07:09 So in order to introduce this, we're going to add something, 0:07:13.829 a new kind of data type that we call an event count. 0:07:18.003 An event count, you can just think of it as an 0:07:21.686 integer that indicates the number of times that something 0:07:26.27 has occurred. It's just some sort of running 0:07:29.79 counter-variable. And we're going to introduce 0:07:35.162 two new routines. So these two routines are 0:07:39.906 called wait and notify. OK, so wait takes two 0:07:44.876 arguments. It takes one of these event 0:07:49.055 count variables. And it takes a value. 0:07:53.234 OK, so what wait says is check the value of this event count 0:07:59.898 thing, and see whether or when we check it, the value of this 0:08:06.674 event count is less than or equal to value. 0:08:13 If the event count is less than or equal to value, 0:08:16.988 then it waits. And what it means for it to 0:08:20.325 wait is that it tells the thread scheduler that it no longer 0:08:25.127 wants to be scheduled until somebody later calls this notify 0:08:29.93 routine on this same event count variable. 0:08:34 OK, so wait says wait, if this condition is true, 0:08:37.412 and then notify says, wake up everybody who's waiting 0:08:41.109 on this variable. So we can use these routines in 0:08:44.521 the following way in this code. And it's really very 0:08:48.146 straightforward. We simply change our iteration 0:08:51.417 through, our while loops into wait statements. 0:08:54.616 So what we're going to do is we're going to have the HTML 0:08:58.597 thread wait until the value of used becomes greater than zero. 0:09:04 And we're going to have our disk thread wait until the value 0:09:07.793 of free becomes greater than zero. 0:09:09.915 And then the only other thing that we have to add to this is 0:09:13.709 simply a call to notify. So what notify does is it 0:09:16.86 indicates to any other thread that is waiting on a particular 0:09:20.718 variable that that thread can run. 0:09:22.84 So the HTML thread will notify free, which will tell the 0:09:26.569 disk thread that it can now begin running if it had been 0:09:30.106 waiting on the variable free. OK, so this emulates the 0:09:34.576 behavior of the while loop that we had before except that the 0:09:38.518 thread scheduler, rather than sitting in any 0:09:41.343 infinite while loop simply doesn't schedule the HTML thread 0:09:45.153 of the disk thread while it's waiting in one of these wait 0:09:48.897 statements. OK, so what we're going to talk 0:09:51.656 about for the rest of the lecture today is related to 0:09:55.072 this, and I think you will see why as we get through the talk. 0:10:00 The topic for today is performance. 0:10:02.373 So performance, what we've looked at so far in 0:10:05.515 this class are these various ways of structuring complex 0:10:09.355 programs, how to break them up into several modules, 0:10:12.915 the client/server paradigm, how threads work, 0:10:15.987 how a thread scheduler works, all of these sort of big topics 0:10:20.176 about how you design a system. But we haven't said anything 0:10:24.225 about how you take a system design and in an ordered, 0:10:27.856 regular, systematic way, think about making that system 0:10:31.626 run efficiently. So that's what we're going to 0:10:35.714 try and get at today. We're going to look at a set of 0:10:38.787 techniques that we can use to make a computer system more 0:10:42.097 efficient. And so, these techniques, 0:10:44.166 there are really three techniques that we're going to 0:10:47.239 look at today. The first one is a technique 0:10:49.722 called concurrency. And concurrency is really about 0:10:52.677 allowing the system to perform multiple operations 0:10:55.573 simultaneously. So, for example, 0:10:57.405 in our sample Web server, it may be the case that we have 0:11:00.715 this disc that we can sort of read pages from at the same time 0:11:04.321 that, for example, the CPU generates some Web 0:11:06.921 pages that it's going to output to the client. 0:11:11 OK, so that's what concurrency is about. 0:11:12.867 We are also going to look at a technique called caching, 0:11:15.5 which you guys should have all seen before. 0:11:17.51 Caching is really just about saving off some previous work, 0:11:20.287 some previous computation that we've already done, 0:11:22.632 or our previous disk page that we've already read in. 0:11:25.122 We want to save it off so that we can reuse it again at a later 0:11:28.09 time. And then finally, 0:11:29.954 we are going to look at something called scheduling. 0:11:32.819 So scheduling is about when we have multiple requests to 0:11:35.907 process, we might be able to order those requests in a 0:11:38.884 certain way or group the requests together in a certain 0:11:41.917 way so that we can make the system more efficient. 0:11:44.669 So it's really about sort of choosing the order in which we 0:11:47.926 do things in order to make the system run more efficiently. 0:11:51.184 And throughout the course of this, I'm going to use this 0:11:54.273 example of this Web server that we've been talking about to sort 0:11:57.811 of motivate each of the applications, 0:11:59.833 or each of these performance techniques that we're going to 0:12:03.09 talk about. So in order to get to the point 0:12:07.118 where we can understand how these performance techniques 0:12:10.878 work, we need to talk a little bit about what we mean by 0:12:14.637 performance. How do we measure the 0:12:16.893 performance of the system, and how do we understand where 0:12:20.721 the bottlenecks in performance in a system might be? 0:12:24.207 So one thing we might want to, the first thing we need to do 0:12:28.24 is to define a set of performance metrics. 0:12:32 These are just a set of terms and definitions that we can use 0:12:36.901 so that we can talk about what the performance of a system 0:12:41.722 is. So the first metric we might be 0:12:44.5 interested in is the capacity of a system. 0:12:48.013 And capacity is simply some measure of the amount of 0:12:52.179 resource in a system. So this sounds kind of 0:12:55.692 abstract, but what we mean by a resource is some sort of thing 0:13:00.676 that we can compete with. It's a disk, 0:13:03.699 or a CPU, or a network, so we might, 0:13:06.558 for example, talk about the capacity of a 0:13:09.826 disk might be the size in gigabytes or the capacity of a 0:13:14.32 processor might be the number of instructions it can execute per 0:13:19.467 second. OK, so once we have capacity, 0:13:24.546 now we can start talking about how much of the system we are 0:13:29.726 actually using. So we talk about the 0:13:32.799 utilization. So utilization is simply the 0:13:36.312 percentage of capacity we're using. 0:13:40 So we might have used up 80% of the disk blocks on our computer. 0:13:45.542 So now there are two sort of properties, or two metrics that 0:13:50.733 are very commonly used in computer systems in order to 0:13:55.395 classify or sort of talk about what the performance of the 0:14:00.41 system is. So, the first metric is 0:14:03.313 latency. So, latency is simply the time 0:14:06.656 for a request to complete. The REQ is request, 0:14:12.082 OK, and we can also talk about sort of the inverse of this, 0:14:18.438 what at first will seem like the inverse of this, 0:14:23.698 which is throughput. That's simply the number of 0:14:28.849 requests per second that we can process. 0:14:34 So when you think about latency and throughput, 0:14:36.316 when you first see this definition, it's tempting to 0:14:38.884 think that simply throughput is the inverse of latency, 0:14:41.604 right? If it takes 10 ms for a request 0:14:43.467 to complete, well, then I must be able to complete 0:14:45.935 100 requests per second, right? 0:14:48 And, that's true in the simple case where in the very simple 0:14:53.198 example where I have a single module, for example, 0:14:57.515 that can process one request at a time, so a single 0:15:01.92 computational resource, for example, 0:15:05.004 that can only do one thing at a time, if this thing has some 0:15:10.202 infinite set of inputs in it, it takes 10 ms to process each 0:15:15.4 input, we'll see, say, 100 results per second 0:15:19.277 coming out, OK? So if something takes 10 ms to 0:15:23.242 do, you can be 100 of them per second. 0:15:28 So we could say the throughput of this system is 100 per 0:15:31.051 second, and the latency is 10 ms. 0:15:32.827 What we're going to see throughout this talk is that in 0:15:35.823 fact a strict relationship between latency and throughput 0:15:38.931 doesn't hold I mean, you guys probably have already 0:15:41.705 seen the notion of pipelining before in 6.004, 0:15:44.202 and you understand that pipelining is a way in which we 0:15:47.199 can improve the throughput of the system without necessarily 0:15:50.472 changing the latency. And we'll talk about that more 0:15:53.302 carefully as this talk goes on. OK, so given these metrics, 0:15:56.521 now what we need to do is think a little bit about, 0:15:59.295 OK, so suppose I have some system, and suppose I have some 0:16:02.458 sort of set of goals for that system like I want the system to 0:16:05.843 be able to process a certain number of requests per second, 0:16:09.061 or I want the latency of this system to be under some amount. 0:16:14 So then the question is, so you are given this computer 0:16:18.879 system and you sit down and you want to measure it. 0:16:23.397 And so you're going to measure the system. 0:16:27.102 And what do you expect to find? So, in the design of computer 0:16:32.524 systems, it turns out that there is some sort of well-known 0:16:37.765 performance pitfalls, or so-called performance 0:16:41.831 bottlenecks. And the goal of sort of doing 0:16:45.446 performance analysis of a system is to look at the system and 0:16:48.44 figure out where the bottlenecks are. 0:16:50.237 So, this typically in the design of the big computer 0:16:52.781 system, what we're worried about is which of the little 0:16:55.476 individual modules within the system is most responsible for 0:16:58.419 slowing down my computer. And what should I do in order 0:17:01.114 to, sort of, and then once you've identified that module, 0:17:03.908 picking about how to make a particular module that slow run 0:17:06.802 faster. So that's really what finding 0:17:10.263 performance bottlenecks is about. 0:17:12.76 And there's a classic bottleneck that occurs in 0:17:16.351 computer systems that you guys all need to know about. 0:17:20.487 It's this so-called IO bottleneck. 0:17:23.063 OK, so what the IO bottleneck says is really fairly 0:17:27.121 straightforward. If you think about a computer 0:17:30.634 system, it has a hierarchy of memory devices in it, 0:17:34.536 OK? And these memory devices start, 0:17:37.19 or storage devices. So these storage devices first 0:17:41.932 for start with the CPU. So the CPU has some set of 0:17:45.089 registers on it, a small number of them, 0:17:47.602 say for example, 32. 0:17:48.826 And you can access those registers very, 0:17:51.338 very fast, say once per instruction, once per cycle on 0:17:54.753 the computer. So, for example, 0:17:56.621 if your CPU is one gigahertz, you may be able to access one 0:18:00.294 of these registers in 1 ns. OK, and so typically at the 0:18:05.296 tallest level, it is pure med, 0:18:07.762 we have a small storage that is fast, OK? 0:18:11.164 As we go down this pyramid adding new layers, 0:18:14.906 and looking at this storage hierarchy, we're going to see 0:18:19.668 that things get bigger and slower. 0:18:22.475 So, just below the CPU, we may have some processor 0:18:26.642 cache, OK, and this might be, for example, 0:18:30.129 512 kB. And it might take 20 ns to 0:18:33.625 access a single, say, block of this memory. 0:18:36.25 And then we're going to have the ram, the main memory of the 0:18:39.937 device, which on a modern machine might be 1 GB. 0:18:42.875 And it might take 100 ns to access. 0:18:45 And then below that, you take a big step down or big 0:18:48.187 step up in size and big step down and performance. 0:18:51.25 You typically have a disk. So a disk might be as big as 0:18:54.625 100 GB, right? But, performance is very slow. 0:18:57.375 So it's a mechanical thing that has to spin, 0:19:00.125 and it only spins so fast. So a typical access time for a 0:19:04.656 block of the disk might be as high as 10 ms or even higher. 0:19:07.858 And then sometimes people will talk in this hierarchy the 0:19:10.95 network is actually a level below that. 0:19:13.049 So if something isn't available on the local disk, 0:19:15.754 for example, on our Web server, 0:19:17.411 we might actually have to go out into the network and fetch 0:19:20.613 it. And if this network is the 0:19:22.214 Internet, right, the Internet has a huge amount 0:19:24.754 of data. I mean, who knows how much it 0:19:26.797 is. It's certainly orders of 0:19:28.288 terabytes. And it could take a long time 0:19:31.743 to get a page of the Internet. So it might take 100 ms to 0:19:35.11 reach some remote site on the Internet. 0:19:37.395 All right, so the point about this IO bottleneck is that this 0:19:41.003 is going to be a very common, sort of the disparity in the 0:19:44.43 performance of these different levels of the system is going to 0:19:48.158 be a very common source of performance problems in our 0:19:51.344 computers. So in particular, 0:19:52.968 if you look at the access time, here's 1 ns. 0:19:55.553 The access time down here is 100 ms. 0:19:57.598 This is a ten to the eighth difference, right, 0:20:00.303 which is equal to 100 million times difference in the 0:20:03.43 performance of the fastest to the slowest thing here. 0:20:08 So, if the CPU has to wait for something to come over the 0:20:11.869 network, you're waiting for a very long time in terms of the 0:20:15.946 amount of time the CPU takes to, say, read a single word of 0:20:19.953 memory. So when we look at the 0:20:21.957 performance of a computer system, we're going to see that 0:20:25.827 often this sort of IO bottleneck is the problem with that system. 0:20:30.249 So if we look, for example, 0:20:32.046 at our Web server, with its three stages, 0:20:34.809 where at this stage is the one that goes to disk, 0:20:38.126 this is the HTML stage, which maybe can just be 0:20:41.305 computed in memory. And this is the network stage. 0:20:45.737 We might be talking about 10 ms latency for the disk stage. 0:20:49.097 We might be talking about just 1 ms for the HTML page, 0:20:52.167 because all it has to do is do some computation in memory. 0:20:55.469 And we might be talking about 100 ms for the network stage to 0:20:58.945 run because it has to send some data out to some remote site. 0:21:03 So if you, in order to process a single request, 0:21:06.167 have to go through each of these steps in sequence, 0:21:09.536 then the total performance of the system, the time to process 0:21:13.58 a single request is going to be, say for example, 0:21:16.815 111 ms, the sum of these three things, OK? 0:21:19.578 And so if you look at the system and you say, 0:21:22.543 OK, what's the performance bottleneck in this system? 0:21:26.047 So the performance bottleneck, right, is clearly this network 0:21:30.091 stage because it takes the longest to run. 0:21:34 And so if we want to answer a question about where we should 0:21:37.609 be optimizing the system, one place we might think to 0:21:40.791 optimize is within this network stage. 0:21:43.054 And we'll see later an example of a simple kind of optimization 0:21:46.848 that we can apply based on this notion of concurrency to improve 0:21:50.702 the performance of the networking stage. 0:21:53.088 So as I just said, the notion of concurrency is 0:21:55.902 going to be the way that we are really going to get at sort of 0:21:59.635 eliminating these IO bottlenecks. 0:22:01.592 So -- 0:22:03 0:22:09 And the idea is going to be that we want to overlap the use 0:22:12.824 of some other resource during the time that we are waiting for 0:22:16.846 one of these slow IO devices to complete. 0:22:19.483 And, we are going to look at two types of concurrency. 0:22:22.978 We're going to look at concurrency between modules -- 0:22:27 0:22:31 -- and within a module, OK? 0:22:32.714 So we may have modules that are composed, for example, 0:22:36.21 our networking module may be composed of multiple threads, 0:22:39.97 each of which can be accessing the network. 0:22:42.74 So that's an example of concurrency within a module. 0:22:46.104 And, we're going to look at the case of between module 0:22:49.599 concurrency where, for example, 0:22:51.578 the HTML module can be processing, can be generating an 0:22:55.008 HTML page, while the disk module is reading a request for another 0:22:59.229 client at the same time. OK, and so the idea behind 0:23:04.449 concurrency is really going to be by using concurrency, 0:23:09.536 we can hide the latency of one of these slow IO stages. 0:23:15 0:23:22 OK, so the first kind of concurrency we're going to talk 0:23:25.955 about is concurrency between modules. 0:23:28.544 And the primary technique we use for doing this is 0:23:32.068 pipelining. So the idea with pipelining is 0:23:35.017 as follows. Suppose we have our Web server 0:23:37.965 again. And this time let's draw it as 0:23:40.554 I drew it at first with Q's between each of the modules, 0:23:44.51 OK? So, we have our Web server 0:23:46.595 which has our three stages. And suppose that what we are 0:23:50.551 doing is we have some set of requests, sort of an infinite 0:23:54.65 queue of requests that is sort of queued up at the disk thread, 0:23:59.109 and the disk thread is producing these things. 0:24:04 And we're sending them through. Well, we want to look at how 0:24:08.498 many pages come out here per second, and what the latency of 0:24:12.997 each page is. So, if we have some list of 0:24:16.046 requests, suppose these requests are numbered R1 through RN, 0:24:20.545 OK? So what's going to happen is 0:24:22.909 that the first request is going to start being processed by the 0:24:27.636 disk server, right? So, it's going to start 0:24:31.279 processing R1. Now, in a pipelining system, 0:24:33.616 what we're going to want to do is to have each one of these 0:24:36.844 threads sort of working on a different request, 0:24:39.403 each one of these modules working on a different request 0:24:42.464 at each point in time. And because the disk is this 0:24:45.246 independent resource from the CPU, is an independent resource 0:24:48.585 from the network, this is going to be OK. 0:24:50.81 These three modules aren't actually going to contend with 0:24:53.926 each other too much. So what's going to happen is 0:24:56.597 this guy's going to start processing R1, 0:24:58.767 right? And then after 10 ms, 0:25:00.27 he's going to pass R1 up to here, and start working on R2, 0:25:03.441 OK? And now, 1 ms after that, 0:25:07.016 this guy is going to finish R1 and send it to here. 0:25:11.048 And then, 9 ms after that, R2 is going to come up here. 0:25:15.403 And this guy can start processing R3. 0:25:18.306 OK, so does everybody sort of see where those numbers are 0:25:22.822 coming from? OK. 0:25:24.032 [LAUGHTER] Good. So now , what we're going to do 0:25:27.822 is if we look at time starting with this equal to time zero, 0:25:32.58 in terms of the requests that come in and out of this last 0:25:37.177 network thread, we can sort of get a sense of 0:25:40.725 how fast this thing is processing. 0:25:45 So the first R1 enters into this system after 11 ms, 0:25:48.762 right? It takes 10 ms to get through 0:25:51.344 here and 1 ms to get through here. 0:25:53.778 And, it starts processing R1 at this time. 0:25:56.803 So, I'm going to write plus R1 to suggest that we start 0:26:00.786 processing and here. The next time that this module 0:26:05.674 can do anything is 100 ms after it first started processing, 0:26:10.933 the next time this module does anything is 100 ms after it 0:26:16.013 started processing R1. So, at time 111 ms, 0:26:19.668 it can output R1, or it's done processing R1. 0:26:23.59 And then, of course, by that time, 0:26:26.532 R2 and R3, some set of requests have already queued up in this 0:26:31.969 queue waiting for it. So it can immediately begin 0:26:37.059 processing R2 at this time, OK? 0:26:39.348 So then, clearly what's going to happen is after 211 ms, 0:26:43.544 it's going to output R2, and it's going to begin 0:26:47.129 processing R3, OK? 0:26:48.426 So, there should be a plus there and a plus there. 0:26:52.164 So, and similarly, at 311 we're going to move onto 0:26:55.903 the next one. So, he look now at the system, 0:26:59.183 we've done something pretty interesting, which is that it 0:27:03.455 still took us, sort of the time for this 0:27:06.43 request to travel through this whole thing was 110 ms. 0:27:12 But if you look at the inter- arrival time between each of 0:27:15.814 these successive outputs of R1, they are only 100 ms, 0:27:19.176 right? So we are only waiting as long 0:27:21.504 as it takes R1 to process a result in order to produce these 0:27:25.318 results, in order to produce answers. 0:27:27.646 So by pipelining the system in this way and having the Web 0:27:31.331 server thread and the disk thread do their processing on 0:27:34.887 later requests while R1 is processing its request, 0:27:38.056 we can increase the throughput of the system. 0:27:42 So in this case, we get an arrival every 100 ms. 0:27:45.823 So the throughput is now equal to one result every 100 ms, 0:27:50.461 or ten results per second, OK? 0:27:52.82 So, even though the latency is still 111 ms, 0:27:56.318 the throughput is no longer one over the latency because we have 0:28:01.444 separated them in this way by pipelining them. 0:28:06 OK, so that was good. That was nice. 0:28:08.042 We improve the performance of the system a little bit. 0:28:11.134 But we didn't really improve it very much, right? 0:28:13.935 We increased the throughput of this thing a little bit. 0:28:17.086 But we haven't really addressed what we identified earlier as 0:28:20.587 being a bottleneck, which the fact that this R1 0:28:23.271 stage is taking 100 ms to process. 0:28:25.197 And in general, when we have a pipeline system 0:28:27.822 like this, we can say that the throughput of the system is 0:28:31.148 bottlenecked by the slowest stage of the system. 0:28:35 So anytime you have a pipeline, the throughput of the system is 0:28:38.647 going to be the throughput of the slowest stage. 0:28:41.411 So in this case, the throughput is 10 results 0:28:44 per second. And that's the throughput of 0:28:46.294 the whole system. So if we want to increase the 0:28:49 throughput anymore than this, what we're going to have to do 0:28:52.47 is to somehow improve the performance of this module here. 0:28:55.823 And the way that we're going to do that is also by exploiting 0:28:59.352 concurrency. 0:29:01 0:29:05 This is going to be this within a module concurrency. 0:29:09.666 So if you think about how a Web server works, 0:29:13.615 or how a network works, typically when we are sending 0:29:18.282 these requests to a client, it's not that we are using up 0:29:23.307 all of the available bandwidth of the network when we are 0:29:28.333 sending these requests to a client, right? 0:29:33 You may be able to send 100 MB per second out over your 0:29:36.144 network. Or if you're connected to a 0:29:38.183 machine here, you may be able to send 10 MB a 0:29:40.745 second across the country to some other university. 0:29:43.657 The issue is that it takes a relatively long time for that 0:29:46.976 request to propagate, especially when that request is 0:29:50.004 propagating out over the Internet. 0:29:51.926 The latency can be quite high. But you may not be using all 0:29:55.304 the bandwidth when you are, say, for example, 0:29:57.866 sending an HTML page. So in particular it is the case 0:30:00.895 that multiple applications, multiple threads, 0:30:03.457 can be simultaneously sending data out over the network. 0:30:08 And if that doesn't make sense to you right now, 0:30:10.145 we're going to spend the whole next four lectures talking about 0:30:13.238 network performance. And it should make sense for 0:30:15.632 you. So just take my word for it 0:30:17.179 that one of the properties of the network is so that the 0:30:19.922 latency of the network may be relatively high. 0:30:22.167 But in this case we are not actually going to be using all 0:30:25.011 the bandwidth that's available to us. 0:30:26.807 So that suggests that there is an idle resource. 0:30:30 It means that we sort of have some network bandwidth that we 0:30:33.916 could be using that we are not using. 0:30:36.306 So we'd like to take advantage of that in the design of our 0:30:40.156 system. So we can do this in a 0:30:42.081 relatively simple way, which is simply to say, 0:30:45.068 let's, within our networking module, rather than only having 0:30:48.719 one thread sending out requests at one time, let's have multiple 0:30:52.901 threads. Let's, for example have, 0:30:55.025 say we have 10 threads. So we have thread 1, 0:30:58.012 thread 2, thread 10, OK? 0:31:01 And we're going to allow these all to be using the network at 0:31:05.207 once. And they are all going to be 0:31:07.522 talking to the same queue that's connected to the same HTML 0:31:11.589 module that's connected to the same disk module. 0:31:14.885 And there's a queue between these as well. 0:31:17.761 OK, so now when we think about the performance of this, 0:31:21.548 now let's see what happens when we start running requests 0:31:25.475 through this pipeline. And let's see how frequently we 0:31:30.529 get requests coming out of the other end. 0:31:33.792 We draw our timeline again. You can see that R1 is going to 0:31:38.524 come in here, and then after 10 ms it's going 0:31:42.114 to move to here. And then after 11 ms it'll 0:31:45.54 arrive here. We'll start processing request 0:31:48.967 one. Now the second request, 0:31:51.17 R2, is going to be here. And, we're going to have 9 ms 0:31:55.494 of processing left to do on it. After R1, it gets sent on to 0:32:00.307 the next thread. So, R2 is going to be in here 0:32:05.127 for 9 ms. It will be in here for 1 ms. 0:32:07.842 So, 10 ms after R1 arrives here, R2 is going to arrive 0:32:11.73 here. So, what we have is we have 11 0:32:14.298 ms. We have R1. 0:32:15.326 Now, 10 ms later, we have R2. 0:32:17.38 OK, so now you can see that suddenly this module, 0:32:20.902 this system is able to process multiple requests, 0:32:24.423 so it has multiple requests that processing at the same 0:32:28.385 time. And so 10 ms after that, 0:32:31.718 R3 is going to start being processed, and then, 0:32:35.011 so what that means is that after some passage of time, 0:32:38.806 we're going to have R10 in here. 0:32:41.026 And, that's going to go in after 101 ms, 0:32:43.818 right? So, we're going to get R10. 0:32:46.181 OK, and now we are ready to start processing. 0:32:49.331 Now we've sort of pushed all these through. 0:32:52.338 And now, suppose we start processing R11. 0:32:55.202 OK, so R11 is going to flow through this pipeline. 0:33:00 And then, it's at time 111, R11 is going to be ready to be 0:33:04.814 processed. But notice that at time 111, 0:33:08.024 we are finished processing R1, right? 0:33:11.065 So, at this time, we can add R11 to the system, 0:33:14.95 and we can output R1. OK, so now every 10 ms after 0:33:19.089 this, another result is going to arrive, and we're going to be 0:33:24.242 able to output the next one. OK, and this is just going to 0:33:29.057 continue. So now, you see what we've 0:33:32.76 managed to do is we've made this system so that every 10 ms after 0:33:37.095 this sort of startup time of 111 ms, after every 10 ms, 0:33:40.752 we are producing results, right? 0:33:42.852 So we are going to get, actually, 100 per second. 0:33:46.103 This is going to be the throughput of this system now. 0:33:49.693 OK, so that was kind of neat. How did we do that? 0:33:52.944 What have we done here? Well, effectively what we've 0:33:56.398 done is we've made it so that this module here can process 0:34:00.258 sort of 10 times as many requests as it could before. 0:34:05 So this module itself now has 10 times the throughput that it 0:34:09.192 had before. And we said before that the 0:34:11.847 bottleneck in the system is, the throughput of the system is 0:34:15.969 the throughput of the slowest stage. 0:34:18.414 So what we've managed to do is decrease the throughput of the 0:34:22.606 slowest stage. And so now the system is 0:34:25.262 running 10 times as fast. Notice now that the disk thread 0:34:29.174 and the network threads both take 10 ms, sort of the 0:34:32.737 throughput of each of them is 100 per second. 0:34:37 And so, now we have sort of two stages that have been equalized 0:34:41.26 in their throughput. And so if we wanted to further 0:34:44.696 increase the performance of the system, we would have to 0:34:48.476 increase the performance of both of these stages, 0:34:51.774 not just one of them. OK, so that was a nice result, 0:34:55.279 right? It seems like we've done 0:34:57.341 something sort of, we've shown that we can use 0:35:00.433 this notion of concurrency to increase the performance of a 0:35:04.419 system. But, we've introduced a little 0:35:09.099 bit of a problem. In particular, 0:35:12.303 the problem we've introduced is as follows. 0:35:16.643 So, remember, we said we had this set of 0:35:20.673 threads, 1 through, say for example, 0:35:24.496 10, that are processing, they're all sharing this queue 0:35:30.179 data structure that is connected up to our HTML thread. 0:35:37 So, the problem with this is that what we've done is to 0:35:40.767 introduce what's called a race condition on this queue. 0:35:44.534 And I'll show you what I mean by that. 0:35:47.116 So if we look at our code snippet up here, 0:35:49.976 for example for what's happening in our HTML thread, 0:35:53.534 we see that what it does is it calls de-queue, 0:35:56.674 right? So the problem that we can have 0:35:59.255 is that we may have multiple of these modules that are 0:36:02.953 simultaneously executing at the same time. 0:36:07 And they may simultaneously both call de-queue, 0:36:10.184 right? So depending on how de-queue is 0:36:12.746 implemented, we can get some weird results. 0:36:15.653 So, let me give you a sort of very simple possible 0:36:19.046 implementation of de-queue. Suppose that what de-queue does 0:36:23.061 is it reads, so, given this queue here, 0:36:25.692 let's say the queue is managed by, there's two variables that 0:36:29.846 keep track of the current state of this queue. 0:36:34 There is a variable called first, which points to the head 0:36:37.585 of the queue, and there's a variable called 0:36:40.227 last, which first points to the first used element in this 0:36:43.813 queue, and last points to the last used element. 0:36:46.77 So, the elements that are in use in the queue at any one time 0:36:50.544 are between first and last, OK? 0:36:52.431 And, what's going to happen is when we de-queue, 0:36:55.388 we're going to sort of move first over one, 0:36:58.03 right? So, when we de-queue something, 0:37:00.357 we'll free up this cell. And when we enqueue, 0:37:04.21 we'll move last down one. And then, when last reaches the 0:37:07.601 end, we are going to wrap it around. 0:37:09.72 So this is a fairly standard implementation of the queue. 0:37:13.11 It's called the circular buffer. 0:37:14.987 And if first is equal to last, then we know that the queue is 0:37:18.62 full. So that's the condition that we 0:37:20.799 can check. So we are not going to go into 0:37:23.221 too many details about how this thing is actually implemented. 0:37:26.914 But let's look at a very simple example of how de-queue might 0:37:30.546 work. So remember we have these two 0:37:33.897 shared variables first and last that are shared between these, 0:37:37.887 say, all these threads that are accessing this thing. 0:37:41.289 And what de-queue might do is to say it's going to read a 0:37:44.953 block, read a page from this queue, so read the next HTML 0:37:48.616 page to output, and it's going to read that 0:37:51.364 into a local variable called page. 0:37:53.523 Let's call this queue buf, B-U-F, I mean we'll use 0:37:56.859 array notation for accessing it. 0:38:00 So it's going to re-buff sub first, OK, and then it's going 0:38:04.923 to increment first. First gets first plus one, 0:38:08.742 and then it's going to return page. 0:38:11.628 OK, that seems like a straightforward implementation 0:38:15.957 of de-queue. And so we have one thread 0:38:19.098 that's doing this. Now, suppose we have another 0:38:23.002 thread that's doing exactly the same thing at the same time. 0:38:28.01 So it runs exactly the same code. 0:38:32 And remember that these two threads are sharing the 0:38:36.605 variables buf and first. 0:38:39 0:38:46 OK, so if you think about this and you think about how these two 0:38:49.392 things running two threads at the same time, 0:38:51.865 there is sort of an interesting problem that can arise. 0:38:54.971 So one thing that might happen when we are running these two 0:38:58.364 things at the same time is that the thread scheduler might first 0:39:01.987 start running thread 1. And it might run the first 0:39:05.58 instruction of thread 1. And then it might run the 0:39:08.681 second instruction. And then it might run this 0:39:11.417 return thing. And then it might come over here, 0:39:14.031 and it might start running T2. So, it might, 0:39:16.645 then, stop running T1 and start running T2, and execute its 0:39:20.171 three instructions. So if the thread scheduler does 0:39:23.211 this, there's nothing wrong. It's not a problem, 0:39:26.069 right? The thread scheduler, 0:39:27.71 each of these things read its value from the queue and 0:39:30.932 incremented it. T1 read one thing from the 0:39:34.798 queue, and then T2 read the next thing from the queue. 0:39:38.464 So clearly some of the time this is going to work fine. 0:39:42.199 So let's make a list of possible outcomes. 0:39:45.035 Sometimes we'll be OK. The first possible outcome was 0:39:48.632 OK. But let's look at a different 0:39:50.845 situation. Suppose what happens is that 0:39:53.474 the first thing the thread scheduler does is schedule T1. 0:39:57.347 And T1 executes this first instruction, and then just after 0:40:01.359 that the thread scheduler decides to preempt T1, 0:40:04.61 and allow T2 to start running. So it in particular allows T2 0:40:10.653 to execute this de-queue instruction to its end, 0:40:14.954 and then it comes over here and it runs T1. 0:40:18.797 OK, so what's the problem now? 0:40:22 0:40:28 Yeah? 0:40:29 0:40:33 Right, OK, so they both read in the same page variable. 0:40:36.454 So now both of these threads have de-queued the same page. 0:40:40.101 So the value first, for T1, it was pointing here. 0:40:43.172 And then we switched. And it was still pointing here, 0:40:46.5 right? And now, so both of these guys 0:40:48.803 have read the same page. And now they are both at some 0:40:52.194 point going to increment first. So you're going to increment it 0:40:56.161 once. Then you're going to increment 0:40:58.4 it again. So this second element here in 0:41:01.71 the queue has been skipped. OK, so this is a problem. 0:41:04.675 We don't want this to happen. Because the system is not 0:41:07.754 outputting all the pages that it was supposed to output. 0:41:10.89 So what can we do to fix this? 0:41:13 0:41:20 So the way that we fixed this is by introducing something we 0:41:23.39 call isolation primitives. 0:41:25 0:41:31 And the basic idea is that we want to introduce an operation 0:41:35.237 that will make it so that any time that the page variable gets 0:41:39.695 read out of the queue, that we also at the same time 0:41:43.421 increment first without any other sort of threads' accesses 0:41:47.659 to this queue being interleaved with our accesses to this queue, 0:41:52.263 or our de-queues from the queue. 0:41:54.528 So in sort of technical terms, what we say is we want these 0:41:58.766 two things, the reading of page and the incrementing of first to 0:42:03.369 be so-called atomic. OK, and the way that we're going 0:42:07.763 to make these things atomic is by isolating them from each 0:42:11.004 other, that by isolating these two threads from each other when 0:42:14.53 they are executing the enqueue and de-queue things. 0:42:17.374 So, these two terms we're going to come back to in a few months 0:42:20.9 in a class towards the end of the class. 0:42:23.175 But all you need to understand here is that there is this race 0:42:26.644 condition, and we want some way to prevent it. 0:42:30 And the way that we're going to prevent it is by using these 0:42:34.701 isolation routines also sometimes called locks. 0:42:38.367 So in this case, the isolation schemes are going 0:42:42.113 to be called locks. So the idea is that a lot is 0:42:45.858 simply a variable, which can be in one of two 0:42:49.365 states. It can either be set or unset. 0:42:52.313 And we have two operations that we can apply on a lock. 0:42:56.616 We can acquire it, and we can release it. 0:43:01 OK, and acquire and release have the following behavior. 0:43:05.782 What acquire says is check the state of the lock, 0:43:09.956 and if the lock is unset, then change the state to set. 0:43:14.652 But if the lock is set, then wait until the lock 0:43:18.739 becomes unset. What a release says is it 0:43:22.13 simply says change the state of the lock from unset to set, 0:43:27.173 or from set to unset, excuse me. 0:43:31 So let's see how we can use these two routines in our code. 0:43:35.38 So let's go back to our example of enqueue and de-queue. 0:43:39.533 Let's introduce a lock variable. 0:43:41.875 We'll call it TL for thread lock. 0:43:44.291 And, what we're going to do is simply around these two 0:43:48.294 operations to access the queue, to modify this page and first, 0:43:52.901 to read the page and modify first, we're going to put in an 0:43:57.281 acquire and a release. 0:44:00 0:44:21 OK so we have ACQ on this thread lock, and we have release 0:44:24.128 on this thread lock. OK, so let's look, 0:44:26.214 so this seems fine. It looks like we've done this. 0:44:28.904 But it's sort of positing the existence of this acquire 0:44:31.868 procedure that just sort of does the right thing. 0:44:34.503 If you think about this for a minute, it seems like we can 0:44:37.632 have the same race condition problem in the acquire module as 0:44:40.926 well, right, or the acquire function as well. 0:44:43.341 With two guys both try and acquire the lock at the same 0:44:46.305 time? How are we going to avoid this 0:44:48.226 problem? And there's a couple of ways 0:44:50.202 that are sort of well understood for avoiding this problem in 0:44:53.496 practice, and their talked about in the book. 0:44:55.801 I'm just going to introduce the simplest of them now, 0:44:58.656 which is that we're going to add a special instruction to the 0:45:01.949 microprocessor that allows us to do this, acquire efficiently. 0:45:07 It turns out that most modern microprocessors have an 0:45:12.605 equivalent instruction. So we're going to call this 0:45:17.996 instruction RSL for read and set lock. 0:45:21.985 OK, so the idea with RSL is as follows. 0:45:26.081 We can basically, the implementation of the 0:45:30.609 acquire module is going to be like this. 0:45:36 What it's going to do, remember we want to wait until 0:45:39.724 we want to loop. We don't have the lock. 0:45:42.517 If we don't have the lock, we want to loop until we 0:45:46.384 have the lock. So the implementation require 0:45:49.535 may look as follows. We'll have a local variable 0:45:52.901 called held. We'll initially set it to false 0:45:55.981 in a while loop while we don't hold the lock. 0:45:59.132 We're going to use this RSL instruction. 0:46:03 So, what this says is held equals RSL of TL, 0:46:06.262 OK? So, what the RSL instruction 0:46:08.615 does is it looks at the state of the lock, and if the lock is 0:46:13.168 unset, then it sets it. And if the lock is set, 0:46:16.658 then it sets it and it returns true. 0:46:19.314 And if the lock is set, then it returns false. 0:46:22.728 So it has the property that it can both read and set the lock 0:46:27.281 within a single instruction, right? 0:46:31 And we're going to use this read and set lock sort of 0:46:34.438 primitive, this basic thing, as a way to build up this sort 0:46:38.272 of more complicated acquire function, which we can then use 0:46:42.107 to build up these locks. OK, so anytime you're designing 0:46:45.743 a multithreaded system in this way, or a system with lots of 0:46:49.644 concurrency, you should be worrying about whether you have 0:46:53.413 race conditions. And if you have race 0:46:55.793 conditions, you need to think about how to use locks in order 0:46:59.76 to prevent those race conditions. 0:47:03 All right, so there are a couple of other topics related 0:47:06.971 to performance that appear in the text. 0:47:09.715 And one of those topics is caching. 0:47:12.17 And I just want to spend one very brief minute on caching. 0:47:16.286 So you guys have already seen catching presumably in the 0:47:20.258 context of 6.004 with processor caches. 0:47:23.002 And what we would want to do, so you might want to sit down 0:47:27.19 and think through as an example of how you would use a cache, 0:47:31.522 you improve the performance of our Web server. 0:47:36 So one thing that you might do in order to improve the 0:47:39.806 permanence of the Web server is to put a cash in the disk thread 0:47:44.331 that you use instead of going to disk in order to sort of reduce 0:47:48.856 the latency of a disk access. And I'll at the beginning of 0:47:52.95 class next time take you through a very simple example of how we 0:47:57.475 can actually use the disk thread in order to do that. 0:48:02 But you guys should think about this a little bit on your own. 0:48:04.808 So barring that little digression that we'll have next 0:48:07.248 time, this takes us to the end of our discussion of sort of 0:48:09.918 modularity, abstraction, and performance. 0:48:11.76 And what we're going to start talking about next time is 0:48:14.292 networking, and how networks work. 0:48:15.812 But I want you guys to make sure you keep in mind all these 0:48:18.482 topics that we've talked about because these are going to be 0:48:21.198 the sort of fundamental tools that we are going to use 0:48:23.639 throughout the class in the design of computer systems. 0:48:27 So because we've finished this module, it doesn't mean that 0:48:29.676 it's sort of OK to stop thinking about this stuff. 0:48:31.938 You need to keep all of this in mind at the same time. 0:48:34.384 So we'll see you all on Wednesday.
This is one of over 2,200 courses on OCW. Find materials for this course in the pages linked along the left.
MIT OpenCourseWare is a free & open publication of material from thousands of MIT courses, covering the entire MIT curriculum.
No enrollment or registration. Freely browse and use OCW materials at your own pace. There's no signup, and no start or end dates.
Knowledge is your reward. Use OCW to guide your own life-long learning, or to teach others. We don't offer credit or certification for using OCW.
Made for sharing. Download files for later. Send to friends and colleagues. Modify, remix, and reuse (just remember to cite OCW as the source.)
Learn more at Get Started with MIT OpenCourseWare