WEBVTT

00:00:00.719 --> 00:00:05.759
Let's see if we can improve the throughput
of the original combinational multiplier design.

00:00:05.759 --> 00:00:10.120
We'll use our patented pipelining process
to divide the processing into stages with

00:00:10.120 --> 00:00:15.610
the expectation of achieving a smaller clock
period and higher throughput.

00:00:15.610 --> 00:00:21.159
The number to beat is approximately 1 output
every 2N, where N is the number of bits in

00:00:21.159 --> 00:00:23.989
each of the operands.

00:00:23.989 --> 00:00:28.089
Our first step is to draw a contour across
all the outputs.

00:00:28.089 --> 00:00:33.500
This creates a 1-pipeline, which gets us started
but doesn't improve the throughput.

00:00:33.500 --> 00:00:37.350
Let's add another contour, dividing the computations
about in half.

00:00:37.350 --> 00:00:41.600
If we're on the right track, we hope to see
some improvement in the throughput.

00:00:41.600 --> 00:00:44.550
And indeed we do: the throughput has doubled.

00:00:44.550 --> 00:00:48.650
Yet both the before and after throughputs
are order 1/N.

00:00:48.650 --> 00:00:52.420
Is there any hope of a dramatically better
throughput?

00:00:52.420 --> 00:00:58.050
The necessary insight is that as long as an
entire row is inside a single pipeline stage,

00:00:58.050 --> 00:01:02.980
the latency of the stage will be order N since
we have to leave time for the N-bit ripple-carry

00:01:02.980 --> 00:01:05.790
add to complete.

00:01:05.790 --> 00:01:08.360
There are several ways to tackle this problem.

00:01:08.360 --> 00:01:12.670
The technique illustrated here will be useful
in our next task.

00:01:12.670 --> 00:01:15.690
In this schematic we've redrawn the carry
chains.

00:01:15.690 --> 00:01:20.260
Carry-outs are still connected to a module
one column to the left, but, in this case,

00:01:20.260 --> 00:01:22.580
a module that's down a row.

00:01:22.580 --> 00:01:27.300
So all the additions that need to happen in
a specific column still happen in that column,

00:01:27.300 --> 00:01:31.470
we've just reorganized which row does the
adding.

00:01:31.470 --> 00:01:36.658
Let's pipeline this revised diagram, creating
stages with approximately two module's worth

00:01:36.658 --> 00:01:38.658
of propagation delay.

00:01:38.658 --> 00:01:43.710
The horizontal contours now break the long
carry chains and the latency of each stage

00:01:43.710 --> 00:01:49.670
is now constant, independent of N.
Note that we had to add order N extra rows

00:01:49.670 --> 00:01:54.240
to take of the propagating the carries all
the way to the end - the extra circuitry is

00:01:54.240 --> 00:01:57.130
shown in the grey box.

00:01:57.130 --> 00:02:03.280
To achieve a latency that's independent of
N in each stage, we'll need order N contours.

00:02:03.280 --> 00:02:09.259
This means the latency is constant, which
in order-of notation we write as "order 1".

00:02:09.259 --> 00:02:14.510
But this means the clock period is now independent
of N, as is the throughput - they are both

00:02:14.510 --> 00:02:16.450
order 1.

00:02:16.450 --> 00:02:23.040
With order N contours, there are order N pipeline
stages, so the system latency is order N.

00:02:23.040 --> 00:02:27.450
The hardware cost is still order N^2.

00:02:27.450 --> 00:02:32.170
So the pipelined carry-save multiplier has
dramatically better throughput than the original

00:02:32.170 --> 00:02:36.910
circuit, another design tradeoff we can remember
for future use.

00:02:36.910 --> 00:02:41.680
We'll use the carry-save technique in our
next optimization, which is to implement the

00:02:41.680 --> 00:02:45.400
multiplier using only order N hardware.

00:02:45.400 --> 00:02:50.170
This sequential multiplier design computes
a single partial product in each step and

00:02:50.170 --> 00:02:52.820
adds it to the accumulating sum.

00:02:52.820 --> 00:02:57.000
It will take order N steps to perform the
complete multiplication.

00:02:57.000 --> 00:03:01.790
In each step, the next bit of the multiplier,
found in the low-order bit of the B register,

00:03:01.790 --> 00:03:06.410
is ANDed with the multiplicand to form the
next partial product.

00:03:06.410 --> 00:03:10.650
This is sent to the N-bit carry-save adder
to be added to the accumulating sum in the

00:03:10.650 --> 00:03:12.770
P register.

00:03:12.770 --> 00:03:18.260
The value in the P register and the output
of the adder are in "carry-save format".

00:03:18.260 --> 00:03:24.120
This means there are 32 data bits, but, in
addition, 31 saved carries, to be added to

00:03:24.120 --> 00:03:26.849
the appropriate column in the next cycle.

00:03:26.849 --> 00:03:31.280
The output of the carry-save adder is saved
in the P register, then in preparation for

00:03:31.280 --> 00:03:36.129
the next step both P and B are shifted right
by 1 bit.

00:03:36.129 --> 00:03:41.450
So each cycle one bit of the accumulated sum
is retired to the B register since it can

00:03:41.450 --> 00:03:44.990
no longer be affected by the remaining partial
products.

00:03:44.990 --> 00:03:49.910
Think of it this way: instead of shifting
the partial products left to account for the

00:03:49.910 --> 00:03:56.550
weight of the current multiplier bit, we're
shifting the accumulated sum right!

00:03:56.550 --> 00:04:00.540
The clock period needed for the sequential
logic is quite small, and, more importantly

00:04:00.540 --> 00:04:05.860
is independent of N.
Since there's no carry propagation, the latency

00:04:05.860 --> 00:04:12.420
of the carry-save adder is very small, i.e.,
only enough time for the operation of a single

00:04:12.420 --> 00:04:15.080
full adder module.

00:04:15.080 --> 00:04:19.839
After order N steps, we've generated the necessary
partial products, but will need to continue

00:04:19.839 --> 00:04:25.860
for another order N steps to finish propagating
the carries through the carry-save adder.

00:04:25.860 --> 00:04:30.979
But even at 2N steps, the overall latency
of the multiplier is still order N.

00:04:30.979 --> 00:04:36.560
And at the end of the 2N steps, we produce
the answer in the P and B registers combined,

00:04:36.560 --> 00:04:41.880
so the throughput is order 1/N.
The big change is in the hardware cost at

00:04:41.880 --> 00:04:47.970
order N, a dramatic improvement over the order
N^2 hardware cost of the original combinational

00:04:47.970 --> 00:04:50.430
multiplier.

00:04:50.430 --> 00:04:53.860
This completes our little foray into multiplier
designs.

00:04:53.860 --> 00:04:58.330
We've seen that with a little cleverness we
can create designs with order 1 throughput,

00:04:58.330 --> 00:05:01.520
or designs with only order N hardware.

00:05:01.520 --> 00:05:06.690
The technique of carry-save addition is useful
in many situations and its use can improve

00:05:06.690 --> 00:05:11.800
throughput at constant hardware cost, or save
hardware at a constant throughput.