1 00:00:00,350 --> 00:00:05,979 We are going to compare the behavior of 3 different cache configurations on a benchmark 2 00:00:05,979 --> 00:00:11,450 program to better understand the impact of the cache configuration on the performance 3 00:00:11,450 --> 00:00:13,490 of the benchmark. 4 00:00:13,490 --> 00:00:18,950 The first cache we will consider is a 64-line direct mapped cache. 5 00:00:18,950 --> 00:00:25,320 The second is a 2-way set associative cache that uses the LRU, or least recently used, 6 00:00:25,320 --> 00:00:29,840 replacement strategy, and has a total of 64 lines. 7 00:00:29,840 --> 00:00:36,280 The third is a 4-way set associative cache that uses the LRU replacement strategy, and 8 00:00:36,280 --> 00:00:39,979 also has a total of 64 lines. 9 00:00:39,979 --> 00:00:45,910 Note that all three caches have the same capacity in that they can store a total of 64 words 10 00:00:45,910 --> 00:00:47,600 of data. 11 00:00:47,600 --> 00:00:54,760 In a direct mapped cache any particular memory address maps to exactly one line in the cache. 12 00:00:54,760 --> 00:00:59,249 Let's assume that our data is 32 bits, or 4 bytes wide. 13 00:00:59,249 --> 00:01:04,500 This means that consecutive addresses are 4 bytes apart, so we treat the bottom two 14 00:01:04,500 --> 00:01:11,390 address bits as always being 00 so that our address is on a data word boundary. 15 00:01:11,390 --> 00:01:17,320 Next, we want to determine which cache line this particular address maps to. 16 00:01:17,320 --> 00:01:24,408 Since there are 64 lines in this cache, we need 6 bits to select one of the 64 lines. 17 00:01:24,408 --> 00:01:27,260 These 6 bits are called the index. 18 00:01:27,260 --> 00:01:36,860 In this example, the index is 000011, so this particular address maps to line 3 of the cache. 19 00:01:36,860 --> 00:01:42,070 The data that gets stored in the cache is the tag portion of the address of the line 20 00:01:42,070 --> 00:01:45,130 plus the 32 bits of data. 21 00:01:45,130 --> 00:01:50,450 The tag is used for comparison when checking if a particular address is in the cache or 22 00:01:50,450 --> 00:01:51,450 not. 23 00:01:51,450 --> 00:01:55,710 It uniquely identifies the particular memory address. 24 00:01:55,710 --> 00:02:00,540 In addition, each cache line has a valid bit that lets you know whether the data in the 25 00:02:00,540 --> 00:02:03,640 cache is currently valid or not. 26 00:02:03,640 --> 00:02:08,630 This is important upon startup because without this bit there is no way to know whether the 27 00:02:08,630 --> 00:02:13,150 data in the cache is garbage or real data. 28 00:02:13,150 --> 00:02:18,829 In a 2-way set associative cache, the cache is divided into 2 sets each with half the 29 00:02:18,829 --> 00:02:20,800 number of lines. 30 00:02:20,800 --> 00:02:24,360 So we have two sets with 32 lines each. 31 00:02:24,360 --> 00:02:30,320 Since there are only 32 lines, we now only need a 5 bit index to select the line. 32 00:02:30,320 --> 00:02:36,660 However, any given index can map to two distinct locations, one in each set. 33 00:02:36,660 --> 00:02:42,170 This also means that when the tag comparisons are done, two comparisons are required, one 34 00:02:42,170 --> 00:02:43,880 per set. 35 00:02:43,880 --> 00:02:50,480 In a 4-way set associative cache, the cache is divided into 4 sets each with 16 lines. 36 00:02:50,480 --> 00:02:54,730 The width of the index is now 4 bits to select the cache line. 37 00:02:54,730 --> 00:03:00,740 Here, selecting a line identifies one of 4 words as possible locations for reading or 38 00:03:00,740 --> 00:03:04,210 writing the associated data. 39 00:03:04,210 --> 00:03:10,190 This also implies that 4 tags need to be compared when trying to read from the cache to determine 40 00:03:10,190 --> 00:03:15,590 if the desired address is stored in the cache or not. 41 00:03:15,590 --> 00:03:23,230 The test program begins by defining a few constants, J, A, B, and N. 42 00:03:23,230 --> 00:03:26,890 J specifies the address where the program lives. 43 00:03:26,890 --> 00:03:33,450 A is the starting address of data region 1, and B is the starting address of data region 44 00:03:33,450 --> 00:03:34,450 2. 45 00:03:34,450 --> 00:03:39,010 Finally, N specifies the size of the data regions. 46 00:03:39,010 --> 00:03:44,850 Since one word consists of 4 bytes, 16 bytes of data mean that there are 4 data elements 47 00:03:44,850 --> 00:03:47,340 per region. 48 00:03:47,340 --> 00:03:55,310 Next the assembler is told that the beginning of the program is at address 0x1000. 49 00:03:55,310 --> 00:04:00,700 The green rectangle identifies the outer loop, and the yellow rectangle identifies the inner 50 00:04:00,700 --> 00:04:02,610 loop of the code. 51 00:04:02,610 --> 00:04:08,590 Before entering the outer loop, a loop counter, which is stored in register R6 is initialized 52 00:04:08,590 --> 00:04:10,930 to 1,000. 53 00:04:10,930 --> 00:04:16,540 Then each time through the outer loop, R6 is decremented by 1 and the loop is repeated 54 00:04:16,540 --> 00:04:19,390 as long as R6 is not equal to 0. 55 00:04:19,390 --> 00:04:25,180 The outer loop also resets R0 to N each time through the loop. 56 00:04:25,180 --> 00:04:29,680 R0 is used to hold the desired array offset. 57 00:04:29,680 --> 00:04:35,460 Since the last element of the array is stored at location N – 4, the first step of the 58 00:04:35,460 --> 00:04:39,080 inner loop, is to decrement R0 by 4. 59 00:04:39,080 --> 00:04:47,681 R1 is then loaded with the value at address A + N – 4 which is the address of A[3] because 60 00:04:47,681 --> 00:04:50,900 array indeces begin at 0. 61 00:04:50,900 --> 00:04:54,650 R2 is loaded with B[3]. 62 00:04:54,650 --> 00:05:01,210 As long as R0 is not equal to 0, the loop repeats itself, each time accessing the previous 63 00:05:01,210 --> 00:05:07,600 element of each array until the first element (index 0) is loaded. 64 00:05:07,600 --> 00:05:12,980 Then the outer loop decrements R6 and repeats the entire thing 1000 times. 65 00:05:12,980 --> 00:05:19,870 Now that we understand the configuration of our three caches and the behavior of our test 66 00:05:19,870 --> 00:05:26,280 benchmark, we can begin comparing the behavior of this benchmark on the three caches. 67 00:05:26,280 --> 00:05:31,430 The first thing we want to ask ourselves is which of the three cache configurations gets 68 00:05:31,430 --> 00:05:34,130 the highest hit ratio. 69 00:05:34,130 --> 00:05:40,260 Here we are not asked to calculate an actual hit ratio, instead we just need to realize 70 00:05:40,260 --> 00:05:44,260 that there are 3 distinct regions of data in this benchmark. 71 00:05:44,260 --> 00:05:51,909 The first holds the instructions, the second holds array A, and the third holds array B. 72 00:05:51,909 --> 00:05:56,940 If we think about the addresses of each of these regions in memory, we see that the first 73 00:05:56,940 --> 00:06:01,020 instruction is at address 0x1000. 74 00:06:01,020 --> 00:06:07,380 This will result in an index of 0 regardless of which cache you consider, so for all three 75 00:06:07,380 --> 00:06:13,010 caches the first instruction would map to the first line of the cache. 76 00:06:13,010 --> 00:06:22,100 Similarly the first element of arrays A and B are at address 0x2000 and 0x3000. 77 00:06:22,100 --> 00:06:27,410 These addresses will also result in an index of 0 regardless of which of the three caches 78 00:06:27,410 --> 00:06:29,450 you consider. 79 00:06:29,450 --> 00:06:37,450 So we see that the 1st CMOVE, A[0], and B[0] would all map to line 0 of the cache. 80 00:06:37,450 --> 00:06:46,360 Similarly, the 2nd CMOVE whose address is 0x1004 would map to line 1 of the cache as 81 00:06:46,360 --> 00:06:51,919 would array elements A[1] and B[1]. 82 00:06:51,919 --> 00:06:57,440 This tells us that if we use the direct mapped cache, or a 2-way set associative cache, then 83 00:06:57,440 --> 00:07:02,639 we will have cache collisions between the instructions, and the array elements. 84 00:07:02,639 --> 00:07:07,790 Collisions in the cache imply cache misses as we replace one piece of data with another 85 00:07:07,790 --> 00:07:09,330 in the cache. 86 00:07:09,330 --> 00:07:15,449 However, if we use a 4-way set associative cache then each region of memory can go in 87 00:07:15,449 --> 00:07:21,470 a distinct set in the cache thus avoiding collisions and resulting in 100% hit rate 88 00:07:21,470 --> 00:07:24,610 after the first time through the loop. 89 00:07:24,610 --> 00:07:29,230 Note that the first time through the loop each instruction and data access will result 90 00:07:29,230 --> 00:07:34,130 in a cache miss because the data needs to initially be brought into the cache. 91 00:07:34,130 --> 00:07:41,490 But when the loop is repeated, the data is already there and results in cache hits. 92 00:07:41,490 --> 00:07:49,960 Now suppose that we make a minor modification to our test program by changing B from 0x3000 93 00:07:49,960 --> 00:07:52,530 to 0x2000. 94 00:07:52,530 --> 00:07:58,790 This means that array A and array B now refer to same locations in memory. 95 00:07:58,790 --> 00:08:04,430 We want to determine, which of the cache's hit rate will show a noticeable improvement 96 00:08:04,430 --> 00:08:06,980 as a result of this change. 97 00:08:06,980 --> 00:08:12,290 The difference between our original benchmark and this modified one is that we now have 98 00:08:12,290 --> 00:08:17,560 two distinct regions of memory to access, one for the instructions, and one for the 99 00:08:17,560 --> 00:08:18,930 data. 100 00:08:18,930 --> 00:08:24,020 This means that the 2-way set associative cache will no longer experience collisions 101 00:08:24,020 --> 00:08:31,440 in its cache, so its hit rate will be significantly better than with the original benchmark. 102 00:08:31,440 --> 00:08:37,870 Now suppose that we change our benchmark once again, this time making J, A and B all equal 103 00:08:37,870 --> 00:08:42,209 to 0, and changing N to be 64. 104 00:08:42,209 --> 00:08:47,649 This means that we now have 16 elements in our arrays instead of 4. 105 00:08:47,649 --> 00:08:53,199 It also means that the array values that we are loading for arrays A and B are actually 106 00:08:53,199 --> 00:08:56,960 the same as the instructions of the program. 107 00:08:56,960 --> 00:09:01,540 Another way of thinking about this is that we now only have one distinct region of memory 108 00:09:01,540 --> 00:09:02,889 being accessed. 109 00:09:02,889 --> 00:09:08,339 What we want to determine now, is the total number of cache misses that will occur for 110 00:09:08,339 --> 00:09:12,600 each of the cache configurations. 111 00:09:12,600 --> 00:09:16,269 Let's begin by considering the direct mapped cache. 112 00:09:16,269 --> 00:09:21,809 In the direct mapped cache, we would want to first access the first CMOVE instruction. 113 00:09:21,809 --> 00:09:27,850 Since this instruction is not yet in the cache, our first access is a cache miss. 114 00:09:27,850 --> 00:09:33,480 We now bring the binary equivalent of this instruction into line 0 of our cache. 115 00:09:33,480 --> 00:09:38,839 Next, we want to access the second CMOVE instruction. 116 00:09:38,839 --> 00:09:44,889 Once again the instruction is not in our cache so we get another cache miss. 117 00:09:44,889 --> 00:09:51,079 This results in our loading the 2nd CMOVE instruction to line 1 of our cache. 118 00:09:51,079 --> 00:09:58,290 We continue in the same manner with the SUBC instruction and the first LD instruction. 119 00:09:58,290 --> 00:10:03,300 Again we get cache misses for each of those instructions and that in turn causes us to 120 00:10:03,300 --> 00:10:06,240 load those instructions into our cache. 121 00:10:06,240 --> 00:10:11,129 Now, we are ready to execute our first load operation. 122 00:10:11,129 --> 00:10:15,949 This operation wants to load A[15] into R1. 123 00:10:15,949 --> 00:10:23,980 Because the beginning of array A is at address 0, then A[15] maps to line 15 of our cache. 124 00:10:23,980 --> 00:10:28,949 Since we have not yet loaded anything into line 15 of our cache, this means that our 125 00:10:28,949 --> 00:10:32,379 first data access is a miss. 126 00:10:32,379 --> 00:10:35,149 We continue with the second load instruction. 127 00:10:35,149 --> 00:10:40,139 This instruction is not yet in the cache, so we get a cache miss and then load it into 128 00:10:40,139 --> 00:10:42,330 line 4 of our cache. 129 00:10:42,330 --> 00:10:45,279 We then try to access B[15]. 130 00:10:45,279 --> 00:10:52,709 B[15] corresponds to the same piece of data as A[15], so this data access is already in 131 00:10:52,709 --> 00:10:57,950 the cache thus resulting in a data hit for B[15]. 132 00:10:57,950 --> 00:11:04,519 So far we have gotten 5 instruction misses, 1 data miss and 1 data hit. 133 00:11:04,519 --> 00:11:09,189 Next we need to access the BNE instruction. 134 00:11:09,189 --> 00:11:15,170 Once again we get a cache miss which results in loading the BNE instruction into line 5 135 00:11:15,170 --> 00:11:17,019 of our cache. 136 00:11:17,019 --> 00:11:27,230 The inner loop is now repeated with R0 = 60 which corresponds to element 14 of the arrays. 137 00:11:27,230 --> 00:11:32,709 This time through the loop, all the instructions are already in the cache and result in instruction 138 00:11:32,709 --> 00:11:34,069 hits. 139 00:11:34,069 --> 00:11:41,010 A[14] which maps to line 14 of our cache results in a data miss because it is not yet present 140 00:11:41,010 --> 00:11:42,440 in our cache. 141 00:11:42,440 --> 00:11:47,029 So we bring A[14] into the cache. 142 00:11:47,029 --> 00:11:53,459 Then as before, when we try to access B[14], we get a data hit because it corresponds to 143 00:11:53,459 --> 00:11:57,730 the same piece of data as A[14]. 144 00:11:57,730 --> 00:12:03,040 So in total, we have now seen 6 instruction misses and 2 data misses. 145 00:12:03,040 --> 00:12:08,779 The rest of the accesses have all been hits. 146 00:12:08,779 --> 00:12:14,850 This process repeats itself with a data miss for array element A[i], and a data hit for 147 00:12:14,850 --> 00:12:22,209 array element B[i] until we get to A[5] which actually results in a hit because it corresponds 148 00:12:22,209 --> 00:12:29,000 to the location in memory that holds the BNE(R0, R) instruction which is already in the cache 149 00:12:29,000 --> 00:12:31,089 at line 5. 150 00:12:31,089 --> 00:12:36,499 From then on the remaining data accesses all result in hits. 151 00:12:36,499 --> 00:12:41,209 At this point we have completed the inner loop and proceed to the remaining instructions 152 00:12:41,209 --> 00:12:43,170 in the outer loop. 153 00:12:43,170 --> 00:12:48,679 These instructions are the second SUBC and the second BNE instructions. 154 00:12:48,679 --> 00:12:55,679 These correspond to the data that is in lines 6 and 7 of the cache thus resulting in hits. 155 00:12:55,679 --> 00:13:02,110 The loop then repeats itself 1000 times but each time through all the instructions and 156 00:13:02,110 --> 00:13:07,170 all the data is in the cache so they all result in hits. 157 00:13:07,170 --> 00:13:11,790 So the total number of misses that we get when executing this benchmark on a direct 158 00:13:11,790 --> 00:13:14,949 mapped cache is 16. 159 00:13:14,949 --> 00:13:19,709 These are known as compulsory misses which are misses that occur when you are first loading 160 00:13:19,709 --> 00:13:22,600 the data into your cache. 161 00:13:22,600 --> 00:13:28,480 Recall that the direct mapped cache has one set of 64 lines in the cache. 162 00:13:28,480 --> 00:13:35,730 The 2-way set associative has 2 sets of 32 lines each, and the 4-way set associative 163 00:13:35,730 --> 00:13:39,619 has 4 sets of 16 lines each. 164 00:13:39,619 --> 00:13:45,660 Since only 16 lines are required to fit all the instructions and data associated with 165 00:13:45,660 --> 00:13:52,269 this benchmark, this means that effectively, only one set will be used in the set associative 166 00:13:52,269 --> 00:13:59,449 caches, and because even in the 4-way set associative cache there are 16 lines, that 167 00:13:59,449 --> 00:14:04,410 means that once the data is loaded into the cache it does not need to be replaced with 168 00:14:04,410 --> 00:14:10,999 other data so after the first miss per line, the remaining accesses in the entire benchmark 169 00:14:10,999 --> 00:14:12,850 will be hits. 170 00:14:12,850 --> 00:14:20,019 So the total number of misses in the 2-way and 4-way set associative caches is also 16.