1 00:00:00,930 --> 00:00:05,180 A fully-associative (FA) cache has a tag comparator for each cache line. 2 00:00:05,180 --> 00:00:09,980 So the tag field of *every* cache line in a FA cache is compared with the tag field 3 00:00:09,980 --> 00:00:12,740 of the incoming address. 4 00:00:12,740 --> 00:00:18,290 Since all cache lines are searched, a particular memory location can be held in any cache line, 5 00:00:18,290 --> 00:00:24,660 which eliminates the problems of address conflicts causing conflict misses. 6 00:00:24,660 --> 00:00:31,119 The cache shown here can hold 4 different 4-word blocks, regardless of their address. 7 00:00:31,119 --> 00:00:35,850 The example from the end of the previous segment required a cache that could hold two 3-word 8 00:00:35,850 --> 00:00:40,640 blocks, one for the instructions in the loop, and one for the data words. 9 00:00:40,640 --> 00:00:47,510 This FA cache would use two of its cache lines to perform that task and achieve a 100% hit 10 00:00:47,510 --> 00:00:53,110 ratio regardless of the addresses of the instruction and data blocks. 11 00:00:53,110 --> 00:00:59,840 FA caches are very flexible and have high hit ratios for most applications. 12 00:00:59,840 --> 00:01:03,020 Their only downside is cost. 13 00:01:03,020 --> 00:01:07,710 The inclusion of a tag comparator for each cache line to implement the parallel search 14 00:01:07,710 --> 00:01:13,500 for a tag match adds substantially the amount of circuitry required when there are many 15 00:01:13,500 --> 00:01:15,680 cache lines. 16 00:01:15,680 --> 00:01:21,560 Even the use of hybrid storage/comparison circuitry, called a content-addressable memory, 17 00:01:21,560 --> 00:01:27,160 doesn't make a big dent in the overall cost of a FA cache. 18 00:01:27,160 --> 00:01:31,310 DM caches searched only a single cache line. 19 00:01:31,310 --> 00:01:34,770 FA caches search all cache lines. 20 00:01:34,770 --> 00:01:39,229 Is there a happy middle ground where some small number of cache lines are searched in 21 00:01:39,229 --> 00:01:40,229 parallel? 22 00:01:40,229 --> 00:01:41,229 Yes! 23 00:01:41,229 --> 00:01:46,799 If you look closely at the diagram of the FA cache shown here, you'll see it looks like 24 00:01:46,799 --> 00:01:52,189 four 1-line DM caches operating in parallel. 25 00:01:52,189 --> 00:01:58,810 What would happen if we designed a cache with four multi-line DM caches operating in parallel? 26 00:01:58,810 --> 00:02:03,590 The result would be what we call an 4-way set-associative (SA) cache. 27 00:02:03,590 --> 00:02:10,288 An N-way SA cache is really just N DM caches (let's call them sub-caches) operating in 28 00:02:10,288 --> 00:02:11,939 parallel. 29 00:02:11,939 --> 00:02:16,889 Each of the N sub-caches compares the tag field of the incoming address with the tag 30 00:02:16,889 --> 00:02:22,629 field of the cache line selected by the index bits of the incoming address. 31 00:02:22,629 --> 00:02:28,769 The N cache lines searched on a particular request form a search "set" and the desired 32 00:02:28,769 --> 00:02:33,129 location might be held in any member of the set. 33 00:02:33,129 --> 00:02:39,950 The 4-way SA cache shown here has 8 cache lines in each sub-cache, so each set contains 34 00:02:39,950 --> 00:02:45,849 4 cache lines (one from each sub-cache) and there are a total of 8 sets (one for each 35 00:02:45,849 --> 00:02:50,040 line of the sub-caches). 36 00:02:50,040 --> 00:02:55,569 An N-way SA cache can accommodate up to N blocks whose addresses map to the same cache 37 00:02:55,569 --> 00:02:57,709 index. 38 00:02:57,709 --> 00:03:03,089 So access to up to N blocks with conflicting addresses can still be accommodated in this 39 00:03:03,089 --> 00:03:05,760 cache without misses. 40 00:03:05,760 --> 00:03:11,370 This a big improvement over a DM cache where an address conflict will cause the current 41 00:03:11,370 --> 00:03:16,930 resident of a cache line to be evicted in favor of the new request. 42 00:03:16,930 --> 00:03:22,639 And an N-way SA cache can have a very large number of cache lines but still only have 43 00:03:22,639 --> 00:03:26,540 to pay the cost of N tag comparators. 44 00:03:26,540 --> 00:03:31,809 This is a big improvement over a FA cache where a large number of cache lines would 45 00:03:31,809 --> 00:03:35,349 require a large number of comparators. 46 00:03:35,349 --> 00:03:43,019 So N-way SA caches are a good compromise between a conflict-prone DM cache and the flexible 47 00:03:43,019 --> 00:03:47,089 but very expensive FA cache. 48 00:03:47,089 --> 00:03:53,159 Here's a slightly more detailed diagram, in this case of a 3-way 8-set cache. 49 00:03:53,159 --> 00:03:57,299 Note that there's no constraint that the number of ways be a power of two since we aren't 50 00:03:57,299 --> 00:04:01,249 using any address bits to select a particular way. 51 00:04:01,249 --> 00:04:07,569 This means the cache designer can fine tune the cache capacity to fit her space budget. 52 00:04:07,569 --> 00:04:12,309 Just to review the terminology: the N cache lines that will be searched for a particular 53 00:04:12,309 --> 00:04:14,459 cache index are called a set. 54 00:04:14,459 --> 00:04:19,829 And each of N sub-caches is called a way. 55 00:04:19,829 --> 00:04:24,840 The hit logic in each "way" operates in parallel with the logic in other ways. 56 00:04:24,840 --> 00:04:30,160 Is it possible for a particular address to be matched by more than one way? 57 00:04:30,160 --> 00:04:36,300 That possibility isn't ruled out by the hardware, but the SA cache is managed so that doesn't 58 00:04:36,300 --> 00:04:37,530 happen. 59 00:04:37,530 --> 00:04:42,590 Assuming we write the data fetched from DRAM during a cache miss into a single sub-cache 60 00:04:42,590 --> 00:04:46,439 - we'll talk about how to choose that way in a minute - 61 00:04:46,439 --> 00:04:52,710 there's no possibility that more than one sub-cache will ever match an incoming address. 62 00:04:52,710 --> 00:04:54,680 How many ways to do we need? 63 00:04:54,680 --> 00:05:00,780 We'd like enough ways to avoid the cache line conflicts we experienced with the DM cache. 64 00:05:00,780 --> 00:05:06,349 Looking at the graph we saw earlier of memory accesses vs. time, we see that in any time 65 00:05:06,349 --> 00:05:12,110 interval there are only so many potential address conflicts that we need to worry about. 66 00:05:12,110 --> 00:05:16,780 The mapping from addresses to cache lines is designed to avoid conflicts between neighboring 67 00:05:16,780 --> 00:05:18,460 locations. 68 00:05:18,460 --> 00:05:25,150 So we only need to worry about conflicts between the different regions: code, stack and data. 69 00:05:25,150 --> 00:05:29,949 In the examples shown here there are three such regions, maybe 4 if you need two data 70 00:05:29,949 --> 00:05:33,470 regions to support copying from one data region to another. 71 00:05:33,470 --> 00:05:37,729 If the time interval is particularly large, we might need double that number to avoid 72 00:05:37,729 --> 00:05:43,460 conflicts between accesses early in the time interval and accesses late in the time interval. 73 00:05:43,460 --> 00:05:48,969 The point is that a small number of ways should be sufficient to avoid most cache line conflicts 74 00:05:48,969 --> 00:05:51,490 in the cache. 75 00:05:51,490 --> 00:05:56,380 As with block size, it's possible to have too much of a good thing: there's an optimum 76 00:05:56,380 --> 00:06:01,069 number of ways that minimizes the AMAT. 77 00:06:01,069 --> 00:06:05,020 Beyond that point, the additional circuity needed to combine the hit signals from a large 78 00:06:05,020 --> 00:06:10,819 number of ways will start have a significant propagation delay of its own, adding directly 79 00:06:10,819 --> 00:06:15,009 to the cache hit time and the AMAT. 80 00:06:15,009 --> 00:06:18,939 More to the point, the chart on the left shows that there's little additional impact on the 81 00:06:18,939 --> 00:06:22,270 miss ratio beyond 4 to 8 ways. 82 00:06:22,270 --> 00:06:27,530 For most programs, an 8-way set-associative cache with a large number of sets will perform 83 00:06:27,530 --> 00:06:34,880 on a par with the much more-expensive FA cache of equivalent capacity. 84 00:06:34,880 --> 00:06:40,180 There's one final issue to resolve with SA and FA caches. 85 00:06:40,180 --> 00:06:44,550 When there's a cache miss, which cache line should be chosen to hold the data that will 86 00:06:44,550 --> 00:06:47,840 be fetched from main memory? 87 00:06:47,840 --> 00:06:53,120 That's not an issue with DM caches, since each data block can only be held in one particular 88 00:06:53,120 --> 00:06:56,510 cache line, determined by its address. 89 00:06:56,510 --> 00:07:02,340 But in N-way SA caches, there are N possible cache lines to choose from, one in each of 90 00:07:02,340 --> 00:07:03,680 the ways. 91 00:07:03,680 --> 00:07:08,310 And in a FA cache, any of the cache lines can be chosen. 92 00:07:08,310 --> 00:07:10,120 So, how to choose? 93 00:07:10,120 --> 00:07:15,240 Our goal is to choose to replace the contents of the cache line which will minimize the 94 00:07:15,240 --> 00:07:18,930 impact on the hit ratio in the future. 95 00:07:18,930 --> 00:07:24,120 The optimal choice is to replace the block that is accessed furthest in the future (or 96 00:07:24,120 --> 00:07:26,080 perhaps is never accessed again). 97 00:07:26,080 --> 00:07:29,759 But that requires knowing the futureā€¦ 98 00:07:29,759 --> 00:07:35,009 Here's an idea: let's predict future accesses by looking a recent accesses and applying 99 00:07:35,009 --> 00:07:36,629 the principle of locality. 100 00:07:36,629 --> 00:07:41,919 d7.36 If a block has not been recently accessed, it's less likely to be accessed in the near 101 00:07:41,919 --> 00:07:44,039 future. 102 00:07:44,039 --> 00:07:50,650 That suggests the least-recently-used replacement strategy, usually referred to as LRU: replace 103 00:07:50,650 --> 00:07:54,090 the block that was accessed furthest in the past. 104 00:07:54,090 --> 00:08:00,720 LRU works well in practice, but requires us to keep a list ordered by last use for each 105 00:08:00,720 --> 00:08:06,729 set of cache lines, which would need to be updated on each cache access. 106 00:08:06,729 --> 00:08:10,979 When we needed to choose which member of a set to replace, we'd choose the last cache 107 00:08:10,979 --> 00:08:13,300 line on this list. 108 00:08:13,300 --> 00:08:21,330 For an 8-way SA cache there are 8! possible orderings, so we'd need log2(8!) = 16 state 109 00:08:21,330 --> 00:08:24,689 bits to encode the current ordering. 110 00:08:24,689 --> 00:08:28,550 The logic to update these state bits on each access isn't cheap. 111 00:08:28,550 --> 00:08:34,799 Basically you need a lookup table to map the current 16-bit value to the next 16-bit value. 112 00:08:34,799 --> 00:08:39,610 So most caches implement an approximation to LRU where the update function is much simpler 113 00:08:39,610 --> 00:08:42,539 to compute. 114 00:08:42,539 --> 00:08:47,660 There are other possible replacement policies: First-in, first-out, where the oldest cache 115 00:08:47,660 --> 00:08:50,460 line is replaced regardless of when it was last accessed. 116 00:08:50,460 --> 00:08:57,670 And Random, where some sort of pseudo-random number generator is used to select the replacement. 117 00:08:57,670 --> 00:09:01,520 All replacement strategies except for random can be defeated. 118 00:09:01,520 --> 00:09:05,330 If you know a cache's replacement strategy you can design a program that will have an 119 00:09:05,330 --> 00:09:10,760 abysmal hit rate by accessing addresses you know the cache just replaced. 120 00:09:10,760 --> 00:09:16,090 I'm not sure I care about how well a program designed to get bad performance runs on my 121 00:09:16,090 --> 00:09:21,430 system, but the point is that most replacement strategies will occasionally cause a particular 122 00:09:21,430 --> 00:09:25,949 program to execute much more slowly than expected. 123 00:09:25,949 --> 00:09:31,060 When all is said and done, an LRU replacement strategy or a close approximation is a reasonable 124 00:09:31,060 --> 00:09:31,490 choice.