1 00:00:00,500 --> 00:00:03,330 Now let’s turn our attention to control hazards, 2 00:00:03,330 --> 00:00:06,675 illustrated by the code fragment shown here. 3 00:00:06,675 --> 00:00:08,800 Which instruction should be executed after the BNE? 4 00:00:08,800 --> 00:00:14,520 If the value in R3 is non-zero, ADDC should be executed. 5 00:00:14,520 --> 00:00:19,780 If the value in R3 is zero, the next instruction should be SUB. 6 00:00:19,780 --> 00:00:23,090 If the current instruction is an explicit transfer of control 7 00:00:23,090 --> 00:00:27,260 (i.e., JMPs or branches), the choice of the next instruction 8 00:00:27,260 --> 00:00:30,900 depends on the execution of the current instruction. 9 00:00:30,900 --> 00:00:33,150 What are the implications of this dependency 10 00:00:33,150 --> 00:00:36,680 on our execution pipeline? 11 00:00:36,680 --> 00:00:38,840 How does the unpipelined implementation 12 00:00:38,840 --> 00:00:41,190 determine the next instruction? 13 00:00:41,190 --> 00:00:44,310 For branches (BEQ or BNE), the value 14 00:00:44,310 --> 00:00:47,760 to be loaded into the program counter depends on 15 00:00:47,760 --> 00:00:50,740 (1) the opcode, i.e., whether the instruction 16 00:00:50,740 --> 00:00:51,920 is a BEQ or a BNE, 17 00:00:51,920 --> 00:00:56,990 (2) the current value of the program counter since that’s 18 00:00:56,990 --> 00:00:59,330 used in the offset calculation, and 19 00:00:59,330 --> 00:01:02,860 (3) the value stored in the register specified by the RA 20 00:01:02,860 --> 00:01:05,810 field of the instruction since that’s the value tested 21 00:01:05,810 --> 00:01:07,990 by the branch. 22 00:01:07,990 --> 00:01:11,180 For JMP instructions, the next value of the program counter 23 00:01:11,180 --> 00:01:15,140 depends once again on the opcode field and the value of the RA 24 00:01:15,140 --> 00:01:17,030 register. 25 00:01:17,030 --> 00:01:20,340 For all other instructions, the next PC value depends only 26 00:01:20,340 --> 00:01:25,540 the opcode of the instruction and the value PC+4. 27 00:01:25,540 --> 00:01:28,560 Exceptions also change the program counter. 28 00:01:28,560 --> 00:01:32,010 We’ll deal with them later in the lecture. 29 00:01:32,010 --> 00:01:34,870 The control hazard is triggered by JMP and branches 30 00:01:34,870 --> 00:01:38,610 since their execution depends on the value in the RA register, 31 00:01:38,610 --> 00:01:42,330 i.e., they need to read from the register file, which happens 32 00:01:42,330 --> 00:01:44,320 in the RF pipeline stage. 33 00:01:44,320 --> 00:01:47,090 Our bypass mechanisms ensure that we’ll use the correct 34 00:01:47,090 --> 00:01:50,540 value for the RA register even if it’s not yet written 35 00:01:50,540 --> 00:01:52,580 into the register file. 36 00:01:52,580 --> 00:01:55,690 What we’re concerned about here is that the address 37 00:01:55,690 --> 00:01:58,380 of the instruction following the JMP or branch will be loaded 38 00:01:58,380 --> 00:02:01,890 into program counter at the end of the cycle when the JMP 39 00:02:01,890 --> 00:02:05,070 or branch is in the RF stage. 40 00:02:05,070 --> 00:02:08,130 But what should the IF stage be doing while all this is 41 00:02:08,130 --> 00:02:11,290 going on in RF stage? 42 00:02:11,290 --> 00:02:14,770 The answer is that in the case of JMPs and taken branches, 43 00:02:14,770 --> 00:02:17,810 we don’t know what the IF stage should be doing until those 44 00:02:17,810 --> 00:02:21,690 instructions are able to access the value of the RA register 45 00:02:21,690 --> 00:02:23,890 in the RF stage. 46 00:02:23,890 --> 00:02:27,310 One solution is to stall the IF stage until the RF stage can 47 00:02:27,310 --> 00:02:30,200 compute the necessary result. 48 00:02:30,200 --> 00:02:32,410 This was the first of our general strategies 49 00:02:32,410 --> 00:02:34,280 for dealing with hazards. 50 00:02:34,280 --> 00:02:36,610 How would this work? 51 00:02:36,610 --> 00:02:40,510 If the opcode in the RF stage is JMP, BEQ, or BNE, 52 00:02:40,510 --> 00:02:43,790 stall the IF stage for one cycle. 53 00:02:43,790 --> 00:02:47,460 In the example code shown here, assume that the value in R3 54 00:02:47,460 --> 00:02:50,630 is non-zero when the BNE is executed, i.e., 55 00:02:50,630 --> 00:02:53,290 that the instruction following BNE 56 00:02:53,290 --> 00:02:57,380 should be the ADDC at the top of the loop. 57 00:02:57,380 --> 00:03:00,750 The pipeline diagram shows the effect we’re trying to achieve: 58 00:03:00,750 --> 00:03:05,790 a NOP is inserted into the pipeline in cycles 4 and 8. 59 00:03:05,790 --> 00:03:08,310 Then execution resumes in the next cycle 60 00:03:08,310 --> 00:03:12,430 after the RF stage determines what instruction comes next. 61 00:03:12,430 --> 00:03:15,140 Note, by the way, that we’re relying on our bypass logic 62 00:03:15,140 --> 00:03:19,310 to deliver the correct value for R3 from the MEM stage since 63 00:03:19,310 --> 00:03:22,480 the ADDC instruction that wrote into R3 is still 64 00:03:22,480 --> 00:03:28,170 in the pipeline, i.e., we have a data hazard to deal with too! 65 00:03:28,170 --> 00:03:31,930 Looking at, say, the WB stage in the pipeline diagram, 66 00:03:31,930 --> 00:03:35,560 we see it takes 4 cycles to execute one iteration 67 00:03:35,560 --> 00:03:38,060 of our 3-instruction loop. 68 00:03:38,060 --> 00:03:44,450 So the effective CPI is 4/3, an increase of 33%. 69 00:03:44,450 --> 00:03:46,700 Using stall to deal with control hazards 70 00:03:46,700 --> 00:03:49,520 has had an impact on the instruction throughput 71 00:03:49,520 --> 00:03:50,820 of our execution pipeline. 72 00:03:53,350 --> 00:03:56,380 We’ve already seen the logic needed to introduce NOPs 73 00:03:56,380 --> 00:03:57,900 into the pipeline. 74 00:03:57,900 --> 00:04:00,430 In this case, we add a mux to the instruction path 75 00:04:00,430 --> 00:04:05,380 in the IF stage, controlled by the IRSrc_IF signal. 76 00:04:05,380 --> 00:04:07,150 We use the superscript on the control 77 00:04:07,150 --> 00:04:09,710 signals to indicate which pipeline stage holds 78 00:04:09,710 --> 00:04:10,850 the logic they control. 79 00:04:13,390 --> 00:04:16,769 If the opcode in the RF stage is JMP, BEQ, or BNE 80 00:04:16,769 --> 00:04:20,870 we set IRSrc_IF to 1, which causes 81 00:04:20,870 --> 00:04:23,110 a NOP to replace the instruction that was 82 00:04:23,110 --> 00:04:25,330 being read from main memory. 83 00:04:25,330 --> 00:04:28,010 And, of course, we’ll be setting the PCSEL control signals 84 00:04:28,010 --> 00:04:30,180 to select the correct next PC value, 85 00:04:30,180 --> 00:04:34,030 so the IF stage will fetch the desired follow-on instruction 86 00:04:34,030 --> 00:04:36,400 in the next cycle. 87 00:04:36,400 --> 00:04:39,070 If we replace an instruction with NOP, 88 00:04:39,070 --> 00:04:42,880 we say we “annulled” the instruction. 89 00:04:42,880 --> 00:04:45,100 The branch instructions in the Beta ISA 90 00:04:45,100 --> 00:04:47,480 make their branch decision in the RF stage 91 00:04:47,480 --> 00:04:51,180 since they only need the value in register RA. 92 00:04:51,180 --> 00:04:54,360 But suppose the ISA had a branch where the branch decision was 93 00:04:54,360 --> 00:04:57,480 made in ALU stage. 94 00:04:57,480 --> 00:04:59,890 When the branch decision is made in the ALU stage, 95 00:04:59,890 --> 00:05:02,570 we need to introduce two NOPs into the pipeline, 96 00:05:02,570 --> 00:05:06,250 replacing the now unwanted instructions in the RF and IF 97 00:05:06,250 --> 00:05:07,930 stages. 98 00:05:07,930 --> 00:05:12,610 This would increase the effective CPI even further. 99 00:05:12,610 --> 00:05:15,860 But the tradeoff is that the more complex branches 100 00:05:15,860 --> 00:05:20,000 may reduce the number of instructions in the program. 101 00:05:20,000 --> 00:05:22,970 If we annul instructions in all the earlier pipeline stages, 102 00:05:22,970 --> 00:05:25,650 this is called “flushing the pipeline”. 103 00:05:25,650 --> 00:05:28,440 Since flushing the pipeline has a big impact on the effective 104 00:05:28,440 --> 00:05:33,320 CPI, we do it when it’s the only way to ensure the correct 105 00:05:33,320 --> 00:05:37,850 behavior of the execution pipeline. 106 00:05:37,850 --> 00:05:40,140 We can be smarter about when we choose 107 00:05:40,140 --> 00:05:42,880 to flush the pipeline when executing branches. 108 00:05:42,880 --> 00:05:45,430 If the branch is not taken, it turns out 109 00:05:45,430 --> 00:05:47,550 that the pipeline has been doing the right thing 110 00:05:47,550 --> 00:05:51,020 by fetching the instruction following the branch. 111 00:05:51,020 --> 00:05:54,110 Starting execution of an instruction even when we’re 112 00:05:54,110 --> 00:05:57,030 unsure whether we really want it executed is called 113 00:05:57,030 --> 00:05:59,310 “speculation”. 114 00:05:59,310 --> 00:06:01,960 Speculative execution is okay if we’re able to annul 115 00:06:01,960 --> 00:06:05,730 the instruction before it has an effect on the CPU state, e.g., 116 00:06:05,730 --> 00:06:08,830 by writing into the register file or main memory. 117 00:06:08,830 --> 00:06:12,010 Since these state changes (called “side effects”) happen 118 00:06:12,010 --> 00:06:15,610 in the later pipeline stages, an instruction can progress 119 00:06:15,610 --> 00:06:19,470 through the IF, RF, and ALU stages before we have to make 120 00:06:19,470 --> 00:06:23,490 a final decision about whether it should be annulled. 121 00:06:23,490 --> 00:06:26,350 How does speculation help with control hazards? 122 00:06:26,350 --> 00:06:30,290 Guessing that the next value of the program counter is PC+4 is 123 00:06:30,290 --> 00:06:34,260 correct for all but JMPs and taken branches. 124 00:06:34,260 --> 00:06:37,120 Here’s our example again, but this time let’s assume that 125 00:06:37,120 --> 00:06:42,660 the BNE is not taken, i.e., that the value in R3 is zero. 126 00:06:42,660 --> 00:06:44,590 The SUB instruction enters the pipeline 127 00:06:44,590 --> 00:06:46,560 at the start of cycle 4. 128 00:06:46,560 --> 00:06:49,040 At the end of cycle 4, we know whether or not 129 00:06:49,040 --> 00:06:50,990 to annul the SUB. 130 00:06:50,990 --> 00:06:53,660 If the branch is not taken, we want 131 00:06:53,660 --> 00:06:55,870 to execute the SUB instruction, so we just 132 00:06:55,870 --> 00:06:58,510 let it continue down the pipeline. 133 00:06:58,510 --> 00:07:00,730 In other words, instead of always annulling 134 00:07:00,730 --> 00:07:02,600 the instruction following branch, 135 00:07:02,600 --> 00:07:05,980 we only annul it if the branch was taken. 136 00:07:05,980 --> 00:07:07,950 If the branch is not taken, the pipeline 137 00:07:07,950 --> 00:07:10,710 has speculated correctly and no instructions 138 00:07:10,710 --> 00:07:13,020 need to be annulled. 139 00:07:13,020 --> 00:07:15,740 However if the BNE is taken, the SUB 140 00:07:15,740 --> 00:07:17,980 is annulled at the end of cycle 4 141 00:07:17,980 --> 00:07:20,230 and a NOP is executed in cycle 5. 142 00:07:20,230 --> 00:07:24,160 So we only introduce a bubble in the pipeline when there’s 143 00:07:24,160 --> 00:07:25,780 a taken branch. 144 00:07:25,780 --> 00:07:28,560 Fewer bubbles will decrease the impact of annulment 145 00:07:28,560 --> 00:07:31,620 on the effective CPI. 146 00:07:31,620 --> 00:07:34,420 We’ll be using the same data path circuitry as before, 147 00:07:34,420 --> 00:07:37,150 we’ll just be a bit more clever about when the value 148 00:07:37,150 --> 00:07:41,560 of the IRSrc_IF control signal is set to 1. 149 00:07:41,560 --> 00:07:44,050 Instead of setting it to 1 for all branches, 150 00:07:44,050 --> 00:07:49,000 we only set it to 1 when the branch is taken. 151 00:07:49,000 --> 00:07:51,580 Our naive strategy of always speculating that the next 152 00:07:51,580 --> 00:07:55,210 instruction comes from PC+4 is wrong for JMPs and taken 153 00:07:55,210 --> 00:07:56,840 branches. 154 00:07:56,840 --> 00:07:59,580 Looking at simulated execution traces, 155 00:07:59,580 --> 00:08:03,160 we’ll see that this error in speculation leads to about 10% 156 00:08:03,160 --> 00:08:05,940 higher effective CPI. 157 00:08:05,940 --> 00:08:08,110 Can we do better? 158 00:08:08,110 --> 00:08:11,890 This is an important question for CPUs with deep pipelines. 159 00:08:11,890 --> 00:08:16,100 For example, Intel’s Nehalem processor from 2009 resolves 160 00:08:16,100 --> 00:08:19,190 the more complex x86 branch instructions quite late 161 00:08:19,190 --> 00:08:20,940 in the pipeline. 162 00:08:20,940 --> 00:08:23,980 Since Nehalem is capable of executing multiple instructions 163 00:08:23,980 --> 00:08:26,750 each cycle, flushing the pipeline in Nehalem 164 00:08:26,750 --> 00:08:29,700 actually annuls the execution of many instructions, 165 00:08:29,700 --> 00:08:34,539 resulting in a considerable hit on the CPI. 166 00:08:34,539 --> 00:08:37,250 Like many modern processor implementations, 167 00:08:37,250 --> 00:08:41,299 Nehalem has a much more sophisticated speculation 168 00:08:41,299 --> 00:08:43,140 mechanism. 169 00:08:43,140 --> 00:08:46,280 Rather than always guessing the next instruction is at PC+4, 170 00:08:46,280 --> 00:08:49,720 it only does that for non-branch instructions. 171 00:08:49,720 --> 00:08:51,670 For branches, it predicts the behavior 172 00:08:51,670 --> 00:08:54,750 of each individual branch based on what the branch did 173 00:08:54,750 --> 00:08:57,700 last time it was executed and some knowledge of how 174 00:08:57,700 --> 00:08:59,630 the branch is being used. 175 00:08:59,630 --> 00:09:02,910 For example, backward branches at the end of loops, 176 00:09:02,910 --> 00:09:06,420 which are taken for all but the final iteration of the loop, 177 00:09:06,420 --> 00:09:11,120 can be identified by their negative branch offset values. 178 00:09:11,120 --> 00:09:13,740 Nehalem can even determine if there’s correlation between 179 00:09:13,740 --> 00:09:17,050 branch instructions, using the results of an another, 180 00:09:17,050 --> 00:09:19,550 earlier branch to speculate on the branch decision 181 00:09:19,550 --> 00:09:21,630 of the current branch. 182 00:09:21,630 --> 00:09:23,710 With these sophisticated strategies, 183 00:09:23,710 --> 00:09:27,950 Nehalem’s speculation is correct 95% to 99% of the time, 184 00:09:27,950 --> 00:09:33,890 greatly reducing the impact of branches on the effective CPI. 185 00:09:33,890 --> 00:09:37,180 There’s also the lazy option of changing the ISA to deal with 186 00:09:37,180 --> 00:09:38,660 control hazards. 187 00:09:38,660 --> 00:09:40,320 For example, we could change the ISA 188 00:09:40,320 --> 00:09:43,840 to specify that the instruction following a jump or branch 189 00:09:43,840 --> 00:09:45,610 is always executed. 190 00:09:45,610 --> 00:09:48,250 In other words the transfer of control happens *after* 191 00:09:48,250 --> 00:09:50,230 the next instruction. 192 00:09:50,230 --> 00:09:53,620 This change ensures that the guess of PC+4 as the address 193 00:09:53,620 --> 00:09:56,840 of the next instruction is always correct! 194 00:09:56,840 --> 00:10:00,280 In the example shown here, assuming we changed the ISA, 195 00:10:00,280 --> 00:10:03,450 we can reorganize the execution order of the loop to place 196 00:10:03,450 --> 00:10:06,320 the MUL instruction after the BNE instruction, 197 00:10:06,320 --> 00:10:09,170 in the so-called “branch delay slot”. 198 00:10:09,170 --> 00:10:11,420 Since the instruction in the branch delay slot 199 00:10:11,420 --> 00:10:13,760 is always executed, the MUL instruction 200 00:10:13,760 --> 00:10:17,320 will be executed during each iteration of the loop. 201 00:10:17,320 --> 00:10:21,470 The resulting execution is shown in this pipeline diagram. 202 00:10:21,470 --> 00:10:23,740 Assuming we can find an appropriate instruction 203 00:10:23,740 --> 00:10:26,140 to place in the delay slot, the branch 204 00:10:26,140 --> 00:10:28,720 will have zero impact on the effective CPI. 205 00:10:28,720 --> 00:10:33,700 Are branch delay slots a good idea? 206 00:10:33,700 --> 00:10:36,130 Seems like they reduce the negative impact 207 00:10:36,130 --> 00:10:40,240 that branches might have on instruction throughput. 208 00:10:40,240 --> 00:10:42,600 The downside is that only half the time can 209 00:10:42,600 --> 00:10:45,933 we find instructions to move to the branch delay slot. 210 00:10:45,933 --> 00:10:47,350 The other half of the time we have 211 00:10:47,350 --> 00:10:49,930 to fill it with an explicit NOP instruction, 212 00:10:49,930 --> 00:10:52,580 increasing the size of the code. 213 00:10:52,580 --> 00:10:55,390 And if we make the branch decision later in the pipeline, 214 00:10:55,390 --> 00:10:57,910 there are more branch delay slots, which 215 00:10:57,910 --> 00:11:00,410 would be even harder to fill. 216 00:11:00,410 --> 00:11:03,120 In practice, it turns out that branch prediction 217 00:11:03,120 --> 00:11:05,650 works better than delay slots in reducing 218 00:11:05,650 --> 00:11:07,830 the impact of branches. 219 00:11:07,830 --> 00:11:11,650 So, once again we see that it’s problematic to alter the ISA 220 00:11:11,650 --> 00:11:15,930 to improve the throughput of pipelined execution. 221 00:11:15,930 --> 00:11:19,520 ISAs outlive implementations, so it’s best not to change 222 00:11:19,520 --> 00:11:23,010 the execution semantics to deal with performance issues created 223 00:11:23,010 --> 00:11:25,346 by a particular implementation.