WEBVTT
00:00:00.500 --> 00:00:03.100
The most straightforward
way to improve performance
00:00:03.100 --> 00:00:06.070
is to reduce the propagation
delay of a circuit.
00:00:06.070 --> 00:00:08.500
Let’s look at a perennial
performance bottleneck:
00:00:08.500 --> 00:00:10.910
the ripple-carry adder.
00:00:10.910 --> 00:00:14.040
To fix it, we first have to
figure out the path from inputs
00:00:14.040 --> 00:00:17.470
to outputs that has the largest
propagation delay, i.e.,
00:00:17.470 --> 00:00:21.880
the path that’s determining
the overall t_PD.
00:00:21.880 --> 00:00:24.770
In this case that path
is the long carry chain
00:00:24.770 --> 00:00:27.440
following the carry-in
to carry-out path
00:00:27.440 --> 00:00:30.310
through each full adder module.
00:00:30.310 --> 00:00:35.940
To trigger the path add -1 and
1 by setting the A inputs to all
00:00:35.940 --> 00:00:39.260
1’s and the B input to all 0’s
except for the low-order bit
00:00:39.260 --> 00:00:41.660
which is 1.
00:00:41.660 --> 00:00:45.370
The final answer is 0, but
notice that each full adder has
00:00:45.370 --> 00:00:47.960
to wait for the carry-in
from the previous stage
00:00:47.960 --> 00:00:50.740
before it produces
0 on its sum output
00:00:50.740 --> 00:00:54.150
and generates a carry-out
for the next full adder.
00:00:54.150 --> 00:00:57.100
The carry really does
ripple through the circuit
00:00:57.100 --> 00:01:01.330
as each full adder in
turn does its thing.
00:01:01.330 --> 00:01:03.690
To total propagation
delay along this path
00:01:03.690 --> 00:01:06.840
is N-1 times the carry-in
to carry-out delay
00:01:06.840 --> 00:01:09.180
of each full adder,
plus the delay
00:01:09.180 --> 00:01:12.560
to produce the final
bit of the sum.
00:01:12.560 --> 00:01:14.660
How would the overall
latency change
00:01:14.660 --> 00:01:17.970
if we, say, doubled the
size of the operands, i.e.,
00:01:17.970 --> 00:01:21.080
made N twice as large?
00:01:21.080 --> 00:01:24.580
It’s useful to summarize the
dependency of the latency on N
00:01:24.580 --> 00:01:28.430
using the “order-of” notation
to give us the big picture.
00:01:28.430 --> 00:01:30.560
Clearly as N gets
larger the delay
00:01:30.560 --> 00:01:33.110
of the XOR gate at the end
becomes less significant,
00:01:33.110 --> 00:01:36.730
so the order-of notation ignores
terms that are relatively less
00:01:36.730 --> 00:01:39.160
important as N grows.
00:01:39.160 --> 00:01:41.590
In this example, the
latency is order N,
00:01:41.590 --> 00:01:43.500
which tells us that
the latency would
00:01:43.500 --> 00:01:45.410
be expected to
essentially double
00:01:45.410 --> 00:01:48.320
if we made N twice as large.
00:01:48.320 --> 00:01:50.770
The order-of notation,
which theoreticians
00:01:50.770 --> 00:01:53.320
call asymptotic
analysis, tells us
00:01:53.320 --> 00:01:57.060
the term that would dominate
the result as N grows.
00:01:57.060 --> 00:01:59.570
The yellow box contains
the official definition,
00:01:59.570 --> 00:02:02.570
but an example might make it
easier to understand what’s
00:02:02.570 --> 00:02:03.860
happening.
00:02:03.860 --> 00:02:06.330
Suppose we want to characterize
the growth in the value
00:02:06.330 --> 00:02:11.170
of the equation n^2 +
2n + 3 as n gets larger.
00:02:11.170 --> 00:02:15.060
The dominant term is clearly n^2
and the value of our equation
00:02:15.060 --> 00:02:18.920
is bounded above and below
by simple multiples of n^2,
00:02:18.920 --> 00:02:23.250
except for finitely
many values of n.
00:02:23.250 --> 00:02:25.540
The lower bound is
always true for n greater
00:02:25.540 --> 00:02:27.530
than or equal to 0.
00:02:27.530 --> 00:02:30.540
And in this case, the upper
bound doesn’t hold only for n
00:02:30.540 --> 00:02:32.510
equal to 0, 1, 2, or 3.
00:02:32.510 --> 00:02:37.430
For all other positive values of
n the upper inequality is true.
00:02:37.430 --> 00:02:40.460
So we’d say this
equation was “order N^2”.
00:02:40.460 --> 00:02:45.500
There are actually two variants
for the order-of notation.
00:02:45.500 --> 00:02:47.840
We use the Theta
notation to indicate
00:02:47.840 --> 00:02:50.970
that g(n) is bounded above AND
below by multiples of f(n).
00:02:50.970 --> 00:02:55.550
The “big O” notation is used
when g(n) is only bounded above
00:02:55.550 --> 00:02:57.750
by a multiple of f(n).
00:02:57.750 --> 00:03:00.140
Here’s a first attempt
at improving the latency
00:03:00.140 --> 00:03:02.070
of our addition circuit.
00:03:02.070 --> 00:03:04.030
The trouble with the
ripple-carry adder
00:03:04.030 --> 00:03:06.190
is that the high-order
bits have to wait
00:03:06.190 --> 00:03:09.010
for the carry-in from
the low-order bits.
00:03:09.010 --> 00:03:12.460
Is there a way in which we can
get high half the adder working
00:03:12.460 --> 00:03:15.090
in parallel with the low half?
00:03:15.090 --> 00:03:18.520
Suppose we wanted to
build a 32-bit adder.
00:03:18.520 --> 00:03:21.500
Let’s make two copies of the
high 16 bits of the adder,
00:03:21.500 --> 00:03:24.360
one assuming the carry-in
from the low bits is 0,
00:03:24.360 --> 00:03:27.530
and the other assuming
the carry-in is 1.
00:03:27.530 --> 00:03:30.420
So now we have three
16-bit adders, all of which
00:03:30.420 --> 00:03:35.550
can operate in parallel on
newly arriving A and B inputs.
00:03:35.550 --> 00:03:37.660
Once the 16-bit
additions are complete,
00:03:37.660 --> 00:03:40.320
we can use the actual
carry-out from the low-half
00:03:40.320 --> 00:03:43.670
to select the answer from the
particular high-half adder that
00:03:43.670 --> 00:03:45.890
used the matching
carry-in value.
00:03:45.890 --> 00:03:47.650
This type of adder
is appropriately
00:03:47.650 --> 00:03:50.410
named the carry-select adder.
00:03:50.410 --> 00:03:52.180
The latency of this
carry-select adder
00:03:52.180 --> 00:03:53.900
is just a little
more than the latency
00:03:53.900 --> 00:03:56.670
of a 16-bit
ripple-carry addition.
00:03:56.670 --> 00:03:58.710
This is approximately
half the latency
00:03:58.710 --> 00:04:01.910
of the original 32-bit
ripple-carry adder.
00:04:01.910 --> 00:04:04.810
So at a cost of about
50% more circuitry,
00:04:04.810 --> 00:04:07.620
we’ve halved the latency!
00:04:07.620 --> 00:04:09.980
As a next step, we could
apply the same strategy
00:04:09.980 --> 00:04:12.820
to halve the latency
of the 16-bit adders.
00:04:12.820 --> 00:04:15.500
And then again to halve the
latency of the 8-bit adders
00:04:15.500 --> 00:04:17.640
used in the previous step.
00:04:17.640 --> 00:04:22.100
At each step we halve the adder
latency and add a MUX delay.
00:04:22.100 --> 00:04:28.190
After log2(N) steps, N
will be 1 and we’re done.
00:04:28.190 --> 00:04:30.860
At this point the latency
would be some constant cost
00:04:30.860 --> 00:04:35.200
to do a 1-bit addition, plus
log2(N) times the MUX latency
00:04:35.200 --> 00:04:37.500
to select the right answers.
00:04:37.500 --> 00:04:40.070
So the overall latency
of the carry-select adder
00:04:40.070 --> 00:04:42.200
is order log(N).
00:04:42.200 --> 00:04:44.320
Note that log2(N)
and log(N) only
00:04:44.320 --> 00:04:46.430
differ by a constant
factor, so we
00:04:46.430 --> 00:04:50.420
ignore the base of the
log in order-of notation.
00:04:50.420 --> 00:04:54.090
The carry-select adder shows a
clear performance-size tradeoff
00:04:54.090 --> 00:04:56.370
available to the designer.
00:04:56.370 --> 00:04:59.550
Since adders play a big role
in many digital systems,
00:04:59.550 --> 00:05:02.440
here’s a more carefully
engineered version of a 32-bit
00:05:02.440 --> 00:05:03.810
carry-select adder.
00:05:03.810 --> 00:05:07.190
You could try this
in your ALU design!
00:05:07.190 --> 00:05:08.960
The size of the
adder blocks has been
00:05:08.960 --> 00:05:11.950
chosen so that the trial
sums and the carry-in
00:05:11.950 --> 00:05:15.100
from the previous stage
arrive at the carry-select MUX
00:05:15.100 --> 00:05:17.680
at approximately the same time.
00:05:17.680 --> 00:05:20.590
Note that since the select
signal for the MUXes is heavily
00:05:20.590 --> 00:05:23.920
loaded we’ve included a buffer
to make the select signal
00:05:23.920 --> 00:05:26.540
transitions faster.
00:05:26.540 --> 00:05:29.690
This carry-select adder is about
two-and-a-half times faster
00:05:29.690 --> 00:05:33.530
than a 32-bit ripple-carry
adder at the cost of about twice
00:05:33.530 --> 00:05:35.770
as much circuitry.
00:05:35.770 --> 00:05:38.150
A great design to remember
when you’re looking to double
00:05:38.150 --> 00:05:40.580
the speed of your ALU!