1 00:00:00,770 --> 00:00:05,600 Okay, one more cache design decision to make, then we're done! 2 00:00:05,600 --> 00:00:09,130 How should we handle memory writes in the cache? 3 00:00:09,130 --> 00:00:15,640 Ultimately we'll need update main memory with the new data, but when should that happen? 4 00:00:15,640 --> 00:00:18,290 The most obvious choice is to perform the write immediately. 5 00:00:18,290 --> 00:00:24,590 In other words, whenever the CPU sends a write request to the cache, the cache then performs 6 00:00:24,590 --> 00:00:27,280 the same write to main memory. 7 00:00:27,280 --> 00:00:30,050 This is called "write-through". 8 00:00:30,050 --> 00:00:34,640 That way main memory always has the most up-to-date value for all locations. 9 00:00:34,640 --> 00:00:40,140 But this can be slow if the CPU has to wait for a DRAM write access - writes could become 10 00:00:40,140 --> 00:00:42,440 a real bottleneck! 11 00:00:42,440 --> 00:00:48,120 And what if the program is constantly writing a particular memory location, e.g., updating 12 00:00:48,120 --> 00:00:52,260 the value of a local variable in the current stack frame? 13 00:00:52,260 --> 00:00:56,880 In the end we only need to write the last value to main memory. 14 00:00:56,880 --> 00:01:00,960 Writing all the earlier values is waste of memory bandwidth. 15 00:01:00,960 --> 00:01:05,649 Suppose we let the CPU continue execution while the cache waits for the write to main 16 00:01:05,649 --> 00:01:09,729 memory to complete - this is called "write-behind". 17 00:01:09,729 --> 00:01:14,179 This will overlap execution of the program with the slow writes to main memory. 18 00:01:14,179 --> 00:01:19,549 Of course, if there's another cache miss while the write is still pending, everything will 19 00:01:19,549 --> 00:01:24,090 have to wait at that point until both the write and subsequent refill read finish, since 20 00:01:24,090 --> 00:01:29,929 the CPU can't proceed until the cache miss is resolved. 21 00:01:29,929 --> 00:01:34,479 The best strategy is called "write-back" where the contents of the cache are updated and 22 00:01:34,479 --> 00:01:36,561 the CPU continues execution immediately. 23 00:01:36,561 --> 00:01:43,060 The updated cache value is only written to main memory when the cache line is chosen 24 00:01:43,060 --> 00:01:47,029 as the replacement line for a cache miss. 25 00:01:47,029 --> 00:01:52,189 This strategy minimizes the number of accesses to main memory, preserving the memory bandwidth 26 00:01:52,189 --> 00:01:54,549 for other operations. 27 00:01:54,549 --> 00:01:58,810 This is the strategy used by most modern processors. 28 00:01:58,810 --> 00:02:00,450 Write-back is easy to implement. 29 00:02:00,450 --> 00:02:05,289 Returning to our original cache recipe, we simply eliminate the start of the write to 30 00:02:05,289 --> 00:02:08,348 main memory when there's a write request to the cache. 31 00:02:08,348 --> 00:02:12,220 We just update the cache contents and leave it at that. 32 00:02:12,220 --> 00:02:17,790 However, replacing a cache line becomes a more complex operation, since we can't reuse 33 00:02:17,790 --> 00:02:22,780 the cache line without first writing its contents back to main memory in case they had been 34 00:02:22,780 --> 00:02:26,180 modified by an earlier write access. 35 00:02:26,180 --> 00:02:27,450 Hmm. 36 00:02:27,450 --> 00:02:31,890 Seems like this does a write-back of all replaced cache lines whether or not they've been written 37 00:02:31,890 --> 00:02:34,470 to. 38 00:02:34,470 --> 00:02:39,500 We can avoid unnecessary write-backs by adding another state bit to each cache line: the 39 00:02:39,500 --> 00:02:41,329 "dirty" bit. 40 00:02:41,329 --> 00:02:47,480 The dirty bit is set to 0 when a cache line is filled during a cache miss. 41 00:02:47,480 --> 00:02:52,380 If a subsequent write operation changes the data in a cache line, the dirty bit is set 42 00:02:52,380 --> 00:02:58,380 to 1, indicating that value in the cache now differs from the value in main memory. 43 00:02:58,380 --> 00:03:03,080 When a cache line is selected for replacement, we only need to write its data back to main 44 00:03:03,080 --> 00:03:06,220 memory if its dirty bit is 1. 45 00:03:06,220 --> 00:03:11,150 So a write-back strategy with a dirty bit gives an elegant solution that minimizes the 46 00:03:11,150 --> 00:03:17,360 number of writes to main memory and only delays the CPU on a cache miss if a dirty cache line 47 00:03:17,360 --> 00:03:20,860 needs to be written back to memory. 48 00:03:20,860 --> 00:03:25,780 That concludes our discussion of caches, which was motivated by our desire to minimize the 49 00:03:25,780 --> 00:03:31,980 average memory access time by building a hierarchical memory system that had both low latency and 50 00:03:31,980 --> 00:03:34,510 high capacity. 51 00:03:34,510 --> 00:03:39,000 There were a number of strategies we employed to achieve our goal. 52 00:03:39,000 --> 00:03:45,920 Increasing the number of cache lines decreases AMAT by decreasing the miss ratio. 53 00:03:45,920 --> 00:03:50,650 Increasing the block size of the cache let us take advantage of the fast column accesses 54 00:03:50,650 --> 00:03:56,400 in a DRAM to efficiently load a whole block of data on a cache miss. 55 00:03:56,400 --> 00:04:01,890 The expectation was that this would improve AMAT by increasing the number of hits in the 56 00:04:01,890 --> 00:04:07,300 future as accesses were made to nearby locations. 57 00:04:07,300 --> 00:04:12,190 Increasing the number of ways in the cache reduced the possibility of cache line conflicts, 58 00:04:12,190 --> 00:04:14,500 lowering the miss ratio. 59 00:04:14,500 --> 00:04:19,548 Choosing the least-recently used cache line for replacement minimized the impact of replacement 60 00:04:19,548 --> 00:04:21,009 on the hit ratio. 61 00:04:21,009 --> 00:04:27,430 And, finally, we chose to handle writes using a write-back strategy with dirty bits. 62 00:04:27,430 --> 00:04:31,069 How do we make the tradeoffs among all these architectural choices? 63 00:04:31,069 --> 00:04:38,009 As usual, we'll simulate different cache organizations and chose the architectural mix that provides 64 00:04:38,009 --> 00:04:40,629 the best performance on our benchmark programs.