1
00:00:00,740 --> 00:00:05,790
The modern world has an insatiable appetite
for computation, so system architects are

2
00:00:05,790 --> 00:00:09,620
always thinking about ways to make programs
run faster.

3
00:00:09,620 --> 00:00:13,389
The running time of a program is the product
of three terms:

4
00:00:13,389 --> 00:00:18,240
The number of instructions in the program,
multiplied by the average number of processor

5
00:00:18,240 --> 00:00:23,790
cycles required to execute each instruction
(CPI), multiplied by the time required for

6
00:00:23,790 --> 00:00:27,440
each processor cycle (t_CLK).

7
00:00:27,440 --> 00:00:32,070
To decrease the running time we need to decrease
one or more of these terms.

8
00:00:32,070 --> 00:00:36,250
The number of instructions per program is
determined by the ISA and by the compiler

9
00:00:36,250 --> 00:00:40,030
that produced the sequence of assembly language
instructions to be executed.

10
00:00:40,030 --> 00:00:46,700
Both are fair game, but for this discussion,
let's work on reducing the other two terms.

11
00:00:46,700 --> 00:00:51,460
As we've seen, pipelining reduces t_CLK by
dividing instruction execution into a sequence

12
00:00:51,460 --> 00:00:56,330
of steps, each of which can complete its task
in a shorter t_CLK.

13
00:00:56,330 --> 00:00:59,420
What about reducing CPI?

14
00:00:59,420 --> 00:01:04,280
In our 5-stage pipelined implementation of
the Beta, we designed the hardware to complete

15
00:01:04,280 --> 00:01:10,399
the execution of one instruction every clock
cycle, so CPI_ideal is 1.

16
00:01:10,399 --> 00:01:15,619
But sometimes the hardware has to introduce
"NOP bubbles" into the pipeline to delay execution

17
00:01:15,619 --> 00:01:21,658
of a pipeline stage if the required operation
couldn't (yet) be completed.

18
00:01:21,658 --> 00:01:26,450
This happens on taken branch instructions,
when attempting to immediately use a value

19
00:01:26,450 --> 00:01:31,710
loaded from memory by the LD instruction,
and when waiting for a cache miss to be satisfied

20
00:01:31,710 --> 00:01:33,759
from main memory.

21
00:01:33,759 --> 00:01:39,770
CPI_stall accounts for the cycles lost to
the NOPs introduced into the pipeline.

22
00:01:39,770 --> 00:01:45,020
Its value depends on the frequency of taken
branches and immediate use of LD results.

23
00:01:45,020 --> 00:01:47,469
Typically it's some fraction of a cycle.

24
00:01:47,469 --> 00:01:53,908
For example, if a 6-instruction loop with
a LD takes 8 cycles to complete, CPI_stall

25
00:01:53,908 --> 00:02:00,659
for the loop would be 2/6, i.e., 2 extra cycles
for every 6 instructions.

26
00:02:00,659 --> 00:02:05,920
Our classic 5-stage pipeline is an effective
compromise that allows for a substantial reduction

27
00:02:05,920 --> 00:02:11,140
of t_CLK while keeping CPI_stall to a reasonably
modest value.

28
00:02:11,140 --> 00:02:12,730
There is room for improvement.

29
00:02:12,730 --> 00:02:19,690
Since each stage is working on one instruction
at a time, CPI_ideal is 1.

30
00:02:19,690 --> 00:02:25,800
Slow operations - e.g, completing a multiply
in the ALU stage, or accessing a large cache

31
00:02:25,800 --> 00:02:31,450
in the IF or MEM stages - force t_CLK to be
large to accommodate all the work that has

32
00:02:31,450 --> 00:02:34,510
to be done in one cycle.

33
00:02:34,510 --> 00:02:37,870
The order of the instructions in the pipeline
is fixed.

34
00:02:37,870 --> 00:02:43,750
If, say, a LD instruction is delayed in the
MEM stage because of a cache miss, all the

35
00:02:43,750 --> 00:02:48,590
instructions in earlier stages are also delayed
even though their execution may not depend

36
00:02:48,590 --> 00:02:51,160
on the value produced by the LD.

37
00:02:51,160 --> 00:02:54,960
The order of instructions in the pipeline
always reflects the order in which they were

38
00:02:54,960 --> 00:02:57,300
fetched by the IF stage.

39
00:02:57,300 --> 00:03:02,150
Let's look into what it would take to relax
these constraints and hopefully improve program

40
00:03:02,150 --> 00:03:04,180
runtimes.

41
00:03:04,180 --> 00:03:08,860
Increasing the number of pipeline stages should
allow us to decrease the clock cycle time.

42
00:03:08,860 --> 00:03:15,350
We'd add stages to break up performance bottlenecks,
e.g., adding additional pipeline stages (MEM1

43
00:03:15,350 --> 00:03:21,240
and MEM2) to allow a longer time for memory
operations to complete.

44
00:03:21,240 --> 00:03:26,710
This comes at cost to CPI_stall since each
additional MEM stage means that more NOP bubbles

45
00:03:26,710 --> 00:03:30,600
have to be introduced when there's a LD data
hazard.

46
00:03:30,600 --> 00:03:36,200
Deeper pipelines mean that the processor will
be executing more instructions in parallel.

47
00:03:36,200 --> 00:03:40,930
Let's interrupt enumerating our performance
shopping list to think about limits to pipeline

48
00:03:40,930 --> 00:03:42,120
depth.

49
00:03:42,120 --> 00:03:47,160
Each additional pipeline stage includes some
additional overhead costs to the time budget.

50
00:03:47,160 --> 00:03:52,380
We have to account for the propagation, setup,
and hold times for the pipeline registers.

51
00:03:52,380 --> 00:03:57,440
And we usually have to allow a bit of extra
time to account for clock skew, i.e., the

52
00:03:57,440 --> 00:04:00,940
difference in arrival time of the clock edge
at each register.

53
00:04:00,940 --> 00:04:06,270
And, finally, since we can't always divide
the work exactly evenly between the pipeline

54
00:04:06,270 --> 00:04:11,020
stages, there will be some wasted time in
the stages that have less work.

55
00:04:11,020 --> 00:04:16,789
We'll capture all of these effects as an additional
per-stage time overhead of O.

56
00:04:16,789 --> 00:04:22,270
If the original clock period was T, then with
N pipeline stages, the clock period will be

57
00:04:22,270 --> 00:04:24,729
T/N + O.

58
00:04:24,729 --> 00:04:31,110
At the limit, as N becomes large, the speedup
approaches T/O.

59
00:04:31,110 --> 00:04:36,240
In other words, the overhead starts to dominate
as the time spent on work in each stage becomes

60
00:04:36,240 --> 00:04:38,520
smaller and smaller.

61
00:04:38,520 --> 00:04:43,610
At some point adding additional pipeline stages
has almost no impact on the clock period.

62
00:04:43,610 --> 00:04:51,509
As a data point, the Intel Core-2 x86 chips
(nicknamed "Nehalem") have a 14-stage execution

63
00:04:51,509 --> 00:04:52,509
pipeline.

64
00:04:52,509 --> 00:04:55,990
Okay, back to our performance shopping list…

65
00:04:55,990 --> 00:05:01,150
There may be times we can arrange to execute
two or more instructions in parallel, assuming

66
00:05:01,150 --> 00:05:04,240
that their executions are independent from
each other.

67
00:05:04,240 --> 00:05:08,960
This would increase CPI_ideal at the cost
of increasing the complexity of each pipeline

68
00:05:08,960 --> 00:05:14,710
stage to deal with concurrent execution of
multiple instructions.

69
00:05:14,710 --> 00:05:19,930
If there's an instruction stalled in the pipeline
by a data hazard, there may be following instructions

70
00:05:19,930 --> 00:05:23,080
whose execution could still proceed.

71
00:05:23,080 --> 00:05:28,669
Allowing instructions to pass each other in
the pipeline is called out-of-order execution.

72
00:05:28,669 --> 00:05:33,159
We'd have to be careful to ensure that changing
the execution order didn't affect the values

73
00:05:33,159 --> 00:05:36,379
produced by the program.

74
00:05:36,379 --> 00:05:40,979
More pipeline stages and wider pipeline stages
increase the amount of work that has to be

75
00:05:40,979 --> 00:05:46,240
discarded on control hazards, potentially
increasing CPI_stall.

76
00:05:46,240 --> 00:05:51,010
So it's important to minimize the number of
control hazards by predicting the results

77
00:05:51,010 --> 00:05:57,319
of a branch (i.e., taken or not taken)
so that we increase the chances that the instructions

78
00:05:57,319 --> 00:06:00,930
in the pipeline are the ones we'll want to
execute.

79
00:06:00,930 --> 00:06:05,949
Our ability to exploit wider pipelines and
out-of-order execution depends on finding

80
00:06:05,949 --> 00:06:10,960
instructions that can be executed in parallel
or in different orders.

81
00:06:10,960 --> 00:06:15,249
Collectively these properties are called "instruction-level
parallelism" (ILP).

82
00:06:15,249 --> 00:06:21,780
Here's an example that will let us explore
the amount of ILP that might be available.

83
00:06:21,780 --> 00:06:27,009
On the left is an unoptimized loop that computes
the product of the first N integers.

84
00:06:27,009 --> 00:06:31,909
On the right, we've rewritten the code, placing
instructions that could be executed concurrently

85
00:06:31,909 --> 00:06:34,150
on the same line.

86
00:06:34,150 --> 00:06:38,360
First notice the red line following the BF
instruction.

87
00:06:38,360 --> 00:06:43,169
Instructions below the line should only be
executed if the BF is *not* taken.

88
00:06:43,169 --> 00:06:46,840
That doesn't mean we couldn't start executing
them before the results of the branch are

89
00:06:46,840 --> 00:06:49,520
known,
but if we executed them before the branch,

90
00:06:49,520 --> 00:06:54,800
we would have to be prepared to throw away
their results if the branch was taken.

91
00:06:54,800 --> 00:06:58,810
The possible execution order is constrained
by the read-after-write (RAW) dependencies

92
00:06:58,810 --> 00:07:00,580
shown by the red arrows.

93
00:07:00,580 --> 00:07:06,589
We recognize these as the potential data hazards
that occur when an operand value for one instruction

94
00:07:06,589 --> 00:07:10,110
depends on the result of an earlier instruction.

95
00:07:10,110 --> 00:07:14,990
In our 5-stage pipeline, we were able to resolve
many of these hazards by bypassing values

96
00:07:14,990 --> 00:07:22,020
from the ALU, MEM, and WB stages back to the
RF stage where operand values are determined.

97
00:07:22,020 --> 00:07:27,789
Of course, bypassing will only work when the
instruction has been executed so its result

98
00:07:27,789 --> 00:07:29,570
is available for bypassing!

99
00:07:29,570 --> 00:07:35,069
So, in this case, the arrows are showing us
the constraints on execution order that guarantee

100
00:07:35,069 --> 00:07:38,680
bypassing will be possible.

101
00:07:38,680 --> 00:07:41,469
There are other constraints on execution order.

102
00:07:41,469 --> 00:07:46,089
The green arrow identifies a write-after-write
(WAW) constraint between two instructions

103
00:07:46,089 --> 00:07:49,120
with the same destination register.

104
00:07:49,120 --> 00:07:55,241
In order to ensure the correct value is in
R2 at the end of the loop, the LD(r,R2) instruction

105
00:07:55,241 --> 00:08:01,001
has to store its result into the register
file after the result of the CMPLT instruction

106
00:08:01,001 --> 00:08:03,330
is stored into the register file.

107
00:08:03,330 --> 00:08:09,309
Similarly, the blue arrow shows a write-after-read
(WAR) constraint that ensures that the correct

108
00:08:09,309 --> 00:08:12,169
values are used when accessing a register.

109
00:08:12,169 --> 00:08:18,839
In this case, LD(r,R2) must store into R2
after the Ra operand for the BF has been read

110
00:08:18,839 --> 00:08:20,819
from R2.

111
00:08:20,819 --> 00:08:26,610
As it turns out, WAW and WAR constraints can
be eliminated if we give each instruction

112
00:08:26,610 --> 00:08:29,639
result a unique register name.

113
00:08:29,639 --> 00:08:34,399
This can actually be done relatively easily
by the hardware by using a generous supply

114
00:08:34,399 --> 00:08:39,309
of temporary registers, but we won't go into
the details of renaming here.

115
00:08:39,309 --> 00:08:43,849
The use of temporary registers also makes
it easy to discard results of instructions

116
00:08:43,849 --> 00:08:47,480
executed before we know the outcomes of branches.

117
00:08:47,480 --> 00:08:52,690
In this example, we discovered that the potential
concurrency was actually pretty good for the

118
00:08:52,690 --> 00:08:54,930
instructions following the BF.

119
00:08:54,930 --> 00:09:00,050
To take advantage of this potential concurrency,
we'll need to modify the pipeline to execute

120
00:09:00,050 --> 00:09:03,160
some number N of instructions in parallel.

121
00:09:03,160 --> 00:09:09,089
If we can sustain that rate of execution,
CPI_ideal would then be 1/N since we'd complete

122
00:09:09,089 --> 00:09:15,910
the execution of N instructions in each clock
cycle as they exited the final pipeline stage.

123
00:09:15,910 --> 00:09:19,190
So what value should we choose for N?

124
00:09:19,190 --> 00:09:23,660
Instructions that are executed by different
ALU hardware are easy to execute in parallel,

125
00:09:23,660 --> 00:09:28,449
e.g., ADDs and SHIFTs, or integer and floating-point
operations.

126
00:09:28,449 --> 00:09:33,170
Of course, if we provided multiple adders,
we could execute multiple integer arithmetic

127
00:09:33,170 --> 00:09:36,130
instructions concurrently.

128
00:09:36,130 --> 00:09:41,660
Having separate hardware for address arithmetic
(called LD/ST units) would support concurrent

129
00:09:41,660 --> 00:09:46,730
execution of LD/ST instructions and integer
arithmetic instructions.

130
00:09:46,730 --> 00:09:51,090
This set of lecture slides from Duke gives
a nice overview of techniques used in each

131
00:09:51,090 --> 00:09:55,260
pipeline stage to support concurrent execution.

132
00:09:55,260 --> 00:10:00,170
Basically by increasing the number of functional
units in the ALU and the number of memory

133
00:10:00,170 --> 00:10:04,990
ports on the register file and main memory,
we would have what it takes to support concurrent

134
00:10:04,990 --> 00:10:07,560
execution of multiple instructions.

135
00:10:07,560 --> 00:10:14,639
So, what's the right tradeoff between increased
circuit costs and increased concurrency?

136
00:10:14,639 --> 00:10:20,110
As a data point, the Intel Nehelam core can
complete up to 4 micro-operations per cycle,

137
00:10:20,110 --> 00:10:25,340
where each micro-operation corresponds to
one of our simple RISC instructions.

138
00:10:25,340 --> 00:10:31,630
Here's a simplified diagram of a modern out-of-order
superscalar processor.

139
00:10:31,630 --> 00:10:35,380
Instruction fetch and decode handles, say,
4 instructions at a time.

140
00:10:35,380 --> 00:10:39,870
The ability to sustain this execution rate
depends heavily on the ability to predict

141
00:10:39,870 --> 00:10:44,490
the outcome of branch instructions,
ensuring that the wide pipeline will be mostly

142
00:10:44,490 --> 00:10:48,220
filled with instructions we actually want
to execute.

143
00:10:48,220 --> 00:10:52,680
Good branch prediction requires the use of
the history from previous branches and there's

144
00:10:52,680 --> 00:10:56,910
been a lot of cleverness devoted to getting
good predictions from the least amount of

145
00:10:56,910 --> 00:10:58,589
hardware!

146
00:10:58,589 --> 00:11:03,930
If you're interested in the details, search
for "branch predictor" on Wikipedia.

147
00:11:03,930 --> 00:11:08,819
The register renaming happens during instruction
decode, after which the instructions are ready

148
00:11:08,819 --> 00:11:12,940
to be dispatched to the functional units.

149
00:11:12,940 --> 00:11:17,540
If an instruction needs the result of an earlier
instruction as an operand, the dispatcher

150
00:11:17,540 --> 00:11:22,290
has identified which functional unit will
be producing the result.

151
00:11:22,290 --> 00:11:26,870
The instruction waits in a queue until the
indicated functional unit produces the result

152
00:11:26,870 --> 00:11:31,290
and when all the operand values are known,
the instruction is finally taken from the

153
00:11:31,290 --> 00:11:33,720
queue and executed.

154
00:11:33,720 --> 00:11:37,850
Since the instructions are executed by different
functional units as soon as their operands

155
00:11:37,850 --> 00:11:44,320
are available, the order of execution may
not be the same as in the original program.

156
00:11:44,320 --> 00:11:49,170
After execution, the functional units broadcast
their results so that waiting instructions

157
00:11:49,170 --> 00:11:51,340
know when to proceed.

158
00:11:51,340 --> 00:11:55,589
The results are also collected in a large
reorder buffer so that that they can be retired

159
00:11:55,589 --> 00:12:00,949
(i.e., write their results in the register
file) in the correct order.

160
00:12:00,949 --> 00:12:02,199
Whew!

161
00:12:02,199 --> 00:12:06,760
There's a lot of circuitry involved in keeping
the functional units fed with instructions,

162
00:12:06,760 --> 00:12:11,529
knowing when instructions have all their operands,
and organizing the execution results into

163
00:12:11,529 --> 00:12:13,269
the correct order.

164
00:12:13,269 --> 00:12:18,420
So how much speed up should we expect from
all this machinery?

165
00:12:18,420 --> 00:12:24,259
The effective CPI is very program-specific,
depending as it does on cache hit rates, successful

166
00:12:24,259 --> 00:12:29,440
branch prediction, available ILP, and so on.

167
00:12:29,440 --> 00:12:33,620
Given the architecture described here the
best speed up we could hope for is a factor

168
00:12:33,620 --> 00:12:34,700
of 4.

169
00:12:34,700 --> 00:12:40,490
Googling around, it seems that the reality
is an average speed-up of 2, maybe slightly

170
00:12:40,490 --> 00:12:46,089
less, over what would be achieved by an in-order,
single-issue processor.

171
00:12:46,089 --> 00:12:49,889
What can we expect for future performance
improvements in out-of-order, superscalar

172
00:12:49,889 --> 00:12:52,379
pipelines?

173
00:12:52,379 --> 00:12:57,670
Increases in pipeline depth can cause CPI_stall
and timing overheads to rise.

174
00:12:57,670 --> 00:13:02,480
At the current pipeline depths the increase
in CPI_stall is larger than the gains from

175
00:13:02,480 --> 00:13:06,810
decreased t_CLK and so further increases in
depth are unlikely.

176
00:13:06,810 --> 00:13:13,430
A similar tradeoff exists between using more
out-of-order execution to increase ILP and

177
00:13:13,430 --> 00:13:19,080
the increase in CPI_stall caused by the impact
of mis-predicted branches and the inability

178
00:13:19,080 --> 00:13:22,860
to run main memories any faster.

179
00:13:22,860 --> 00:13:27,920
Power consumption increases more quickly than
the performance gains from lower t_CLK and

180
00:13:27,920 --> 00:13:30,649
additional out-of-order execution logic.

181
00:13:30,649 --> 00:13:34,790
The additional complexity required to enable
further improvements in branch prediction

182
00:13:34,790 --> 00:13:37,560
and concurrent execution seems very daunting.

183
00:13:37,560 --> 00:13:43,269
All of these factors suggest that is unlikely
that we'll see substantial future improvements

184
00:13:43,269 --> 00:13:48,990
in the performance of out-of-order superscalar
pipelined processors.

185
00:13:48,990 --> 00:13:53,509
So system architects have turned their attention
to exploiting data-level parallelism (DLP)

186
00:13:53,509 --> 00:13:55,709
and thread-level parallelism (TLP).

187
00:13:55,709 --> 00:13:57,109
These are our next two topics.