1
00:00:00,719 --> 00:00:05,759
Let's see if we can improve the throughput
of the original combinational multiplier design.

2
00:00:05,759 --> 00:00:10,120
We'll use our patented pipelining process
to divide the processing into stages with

3
00:00:10,120 --> 00:00:15,610
the expectation of achieving a smaller clock
period and higher throughput.

4
00:00:15,610 --> 00:00:21,159
The number to beat is approximately 1 output
every 2N, where N is the number of bits in

5
00:00:21,159 --> 00:00:23,989
each of the operands.

6
00:00:23,989 --> 00:00:28,089
Our first step is to draw a contour across
all the outputs.

7
00:00:28,089 --> 00:00:33,500
This creates a 1-pipeline, which gets us started
but doesn't improve the throughput.

8
00:00:33,500 --> 00:00:37,350
Let's add another contour, dividing the computations
about in half.

9
00:00:37,350 --> 00:00:41,600
If we're on the right track, we hope to see
some improvement in the throughput.

10
00:00:41,600 --> 00:00:44,550
And indeed we do: the throughput has doubled.

11
00:00:44,550 --> 00:00:48,650
Yet both the before and after throughputs
are order 1/N.

12
00:00:48,650 --> 00:00:52,420
Is there any hope of a dramatically better
throughput?

13
00:00:52,420 --> 00:00:58,050
The necessary insight is that as long as an
entire row is inside a single pipeline stage,

14
00:00:58,050 --> 00:01:02,980
the latency of the stage will be order N since
we have to leave time for the N-bit ripple-carry

15
00:01:02,980 --> 00:01:05,790
add to complete.

16
00:01:05,790 --> 00:01:08,360
There are several ways to tackle this problem.

17
00:01:08,360 --> 00:01:12,670
The technique illustrated here will be useful
in our next task.

18
00:01:12,670 --> 00:01:15,690
In this schematic we've redrawn the carry
chains.

19
00:01:15,690 --> 00:01:20,260
Carry-outs are still connected to a module
one column to the left, but, in this case,

20
00:01:20,260 --> 00:01:22,580
a module that's down a row.

21
00:01:22,580 --> 00:01:27,300
So all the additions that need to happen in
a specific column still happen in that column,

22
00:01:27,300 --> 00:01:31,470
we've just reorganized which row does the
adding.

23
00:01:31,470 --> 00:01:36,658
Let's pipeline this revised diagram, creating
stages with approximately two module's worth

24
00:01:36,658 --> 00:01:38,658
of propagation delay.

25
00:01:38,658 --> 00:01:43,710
The horizontal contours now break the long
carry chains and the latency of each stage

26
00:01:43,710 --> 00:01:49,670
is now constant, independent of N.
Note that we had to add order N extra rows

27
00:01:49,670 --> 00:01:54,240
to take of the propagating the carries all
the way to the end - the extra circuitry is

28
00:01:54,240 --> 00:01:57,130
shown in the grey box.

29
00:01:57,130 --> 00:02:03,280
To achieve a latency that's independent of
N in each stage, we'll need order N contours.

30
00:02:03,280 --> 00:02:09,259
This means the latency is constant, which
in order-of notation we write as "order 1".

31
00:02:09,259 --> 00:02:14,510
But this means the clock period is now independent
of N, as is the throughput - they are both

32
00:02:14,510 --> 00:02:16,450
order 1.

33
00:02:16,450 --> 00:02:23,040
With order N contours, there are order N pipeline
stages, so the system latency is order N.

34
00:02:23,040 --> 00:02:27,450
The hardware cost is still order N^2.

35
00:02:27,450 --> 00:02:32,170
So the pipelined carry-save multiplier has
dramatically better throughput than the original

36
00:02:32,170 --> 00:02:36,910
circuit, another design tradeoff we can remember
for future use.

37
00:02:36,910 --> 00:02:41,680
We'll use the carry-save technique in our
next optimization, which is to implement the

38
00:02:41,680 --> 00:02:45,400
multiplier using only order N hardware.

39
00:02:45,400 --> 00:02:50,170
This sequential multiplier design computes
a single partial product in each step and

40
00:02:50,170 --> 00:02:52,820
adds it to the accumulating sum.

41
00:02:52,820 --> 00:02:57,000
It will take order N steps to perform the
complete multiplication.

42
00:02:57,000 --> 00:03:01,790
In each step, the next bit of the multiplier,
found in the low-order bit of the B register,

43
00:03:01,790 --> 00:03:06,410
is ANDed with the multiplicand to form the
next partial product.

44
00:03:06,410 --> 00:03:10,650
This is sent to the N-bit carry-save adder
to be added to the accumulating sum in the

45
00:03:10,650 --> 00:03:12,770
P register.

46
00:03:12,770 --> 00:03:18,260
The value in the P register and the output
of the adder are in "carry-save format".

47
00:03:18,260 --> 00:03:24,120
This means there are 32 data bits, but, in
addition, 31 saved carries, to be added to

48
00:03:24,120 --> 00:03:26,849
the appropriate column in the next cycle.

49
00:03:26,849 --> 00:03:31,280
The output of the carry-save adder is saved
in the P register, then in preparation for

50
00:03:31,280 --> 00:03:36,129
the next step both P and B are shifted right
by 1 bit.

51
00:03:36,129 --> 00:03:41,450
So each cycle one bit of the accumulated sum
is retired to the B register since it can

52
00:03:41,450 --> 00:03:44,990
no longer be affected by the remaining partial
products.

53
00:03:44,990 --> 00:03:49,910
Think of it this way: instead of shifting
the partial products left to account for the

54
00:03:49,910 --> 00:03:56,550
weight of the current multiplier bit, we're
shifting the accumulated sum right!

55
00:03:56,550 --> 00:04:00,540
The clock period needed for the sequential
logic is quite small, and, more importantly

56
00:04:00,540 --> 00:04:05,860
is independent of N.
Since there's no carry propagation, the latency

57
00:04:05,860 --> 00:04:12,420
of the carry-save adder is very small, i.e.,
only enough time for the operation of a single

58
00:04:12,420 --> 00:04:15,080
full adder module.

59
00:04:15,080 --> 00:04:19,839
After order N steps, we've generated the necessary
partial products, but will need to continue

60
00:04:19,839 --> 00:04:25,860
for another order N steps to finish propagating
the carries through the carry-save adder.

61
00:04:25,860 --> 00:04:30,979
But even at 2N steps, the overall latency
of the multiplier is still order N.

62
00:04:30,979 --> 00:04:36,560
And at the end of the 2N steps, we produce
the answer in the P and B registers combined,

63
00:04:36,560 --> 00:04:41,880
so the throughput is order 1/N.
The big change is in the hardware cost at

64
00:04:41,880 --> 00:04:47,970
order N, a dramatic improvement over the order
N^2 hardware cost of the original combinational

65
00:04:47,970 --> 00:04:50,430
multiplier.

66
00:04:50,430 --> 00:04:53,860
This completes our little foray into multiplier
designs.

67
00:04:53,860 --> 00:04:58,330
We've seen that with a little cleverness we
can create designs with order 1 throughput,

68
00:04:58,330 --> 00:05:01,520
or designs with only order N hardware.

69
00:05:01,520 --> 00:05:06,690
The technique of carry-save addition is useful
in many situations and its use can improve

70
00:05:06,690 --> 00:05:11,800
throughput at constant hardware cost, or save
hardware at a constant throughput.