1 00:00:00,719 --> 00:00:05,759 Let's see if we can improve the throughput of the original combinational multiplier design. 2 00:00:05,759 --> 00:00:10,120 We'll use our patented pipelining process to divide the processing into stages with 3 00:00:10,120 --> 00:00:15,610 the expectation of achieving a smaller clock period and higher throughput. 4 00:00:15,610 --> 00:00:21,159 The number to beat is approximately 1 output every 2N, where N is the number of bits in 5 00:00:21,159 --> 00:00:23,989 each of the operands. 6 00:00:23,989 --> 00:00:28,089 Our first step is to draw a contour across all the outputs. 7 00:00:28,089 --> 00:00:33,500 This creates a 1-pipeline, which gets us started but doesn't improve the throughput. 8 00:00:33,500 --> 00:00:37,350 Let's add another contour, dividing the computations about in half. 9 00:00:37,350 --> 00:00:41,600 If we're on the right track, we hope to see some improvement in the throughput. 10 00:00:41,600 --> 00:00:44,550 And indeed we do: the throughput has doubled. 11 00:00:44,550 --> 00:00:48,650 Yet both the before and after throughputs are order 1/N. 12 00:00:48,650 --> 00:00:52,420 Is there any hope of a dramatically better throughput? 13 00:00:52,420 --> 00:00:58,050 The necessary insight is that as long as an entire row is inside a single pipeline stage, 14 00:00:58,050 --> 00:01:02,980 the latency of the stage will be order N since we have to leave time for the N-bit ripple-carry 15 00:01:02,980 --> 00:01:05,790 add to complete. 16 00:01:05,790 --> 00:01:08,360 There are several ways to tackle this problem. 17 00:01:08,360 --> 00:01:12,670 The technique illustrated here will be useful in our next task. 18 00:01:12,670 --> 00:01:15,690 In this schematic we've redrawn the carry chains. 19 00:01:15,690 --> 00:01:20,260 Carry-outs are still connected to a module one column to the left, but, in this case, 20 00:01:20,260 --> 00:01:22,580 a module that's down a row. 21 00:01:22,580 --> 00:01:27,300 So all the additions that need to happen in a specific column still happen in that column, 22 00:01:27,300 --> 00:01:31,470 we've just reorganized which row does the adding. 23 00:01:31,470 --> 00:01:36,658 Let's pipeline this revised diagram, creating stages with approximately two module's worth 24 00:01:36,658 --> 00:01:38,658 of propagation delay. 25 00:01:38,658 --> 00:01:43,710 The horizontal contours now break the long carry chains and the latency of each stage 26 00:01:43,710 --> 00:01:49,670 is now constant, independent of N. Note that we had to add order N extra rows 27 00:01:49,670 --> 00:01:54,240 to take of the propagating the carries all the way to the end - the extra circuitry is 28 00:01:54,240 --> 00:01:57,130 shown in the grey box. 29 00:01:57,130 --> 00:02:03,280 To achieve a latency that's independent of N in each stage, we'll need order N contours. 30 00:02:03,280 --> 00:02:09,259 This means the latency is constant, which in order-of notation we write as "order 1". 31 00:02:09,259 --> 00:02:14,510 But this means the clock period is now independent of N, as is the throughput - they are both 32 00:02:14,510 --> 00:02:16,450 order 1. 33 00:02:16,450 --> 00:02:23,040 With order N contours, there are order N pipeline stages, so the system latency is order N. 34 00:02:23,040 --> 00:02:27,450 The hardware cost is still order N^2. 35 00:02:27,450 --> 00:02:32,170 So the pipelined carry-save multiplier has dramatically better throughput than the original 36 00:02:32,170 --> 00:02:36,910 circuit, another design tradeoff we can remember for future use. 37 00:02:36,910 --> 00:02:41,680 We'll use the carry-save technique in our next optimization, which is to implement the 38 00:02:41,680 --> 00:02:45,400 multiplier using only order N hardware. 39 00:02:45,400 --> 00:02:50,170 This sequential multiplier design computes a single partial product in each step and 40 00:02:50,170 --> 00:02:52,820 adds it to the accumulating sum. 41 00:02:52,820 --> 00:02:57,000 It will take order N steps to perform the complete multiplication. 42 00:02:57,000 --> 00:03:01,790 In each step, the next bit of the multiplier, found in the low-order bit of the B register, 43 00:03:01,790 --> 00:03:06,410 is ANDed with the multiplicand to form the next partial product. 44 00:03:06,410 --> 00:03:10,650 This is sent to the N-bit carry-save adder to be added to the accumulating sum in the 45 00:03:10,650 --> 00:03:12,770 P register. 46 00:03:12,770 --> 00:03:18,260 The value in the P register and the output of the adder are in "carry-save format". 47 00:03:18,260 --> 00:03:24,120 This means there are 32 data bits, but, in addition, 31 saved carries, to be added to 48 00:03:24,120 --> 00:03:26,849 the appropriate column in the next cycle. 49 00:03:26,849 --> 00:03:31,280 The output of the carry-save adder is saved in the P register, then in preparation for 50 00:03:31,280 --> 00:03:36,129 the next step both P and B are shifted right by 1 bit. 51 00:03:36,129 --> 00:03:41,450 So each cycle one bit of the accumulated sum is retired to the B register since it can 52 00:03:41,450 --> 00:03:44,990 no longer be affected by the remaining partial products. 53 00:03:44,990 --> 00:03:49,910 Think of it this way: instead of shifting the partial products left to account for the 54 00:03:49,910 --> 00:03:56,550 weight of the current multiplier bit, we're shifting the accumulated sum right! 55 00:03:56,550 --> 00:04:00,540 The clock period needed for the sequential logic is quite small, and, more importantly 56 00:04:00,540 --> 00:04:05,860 is independent of N. Since there's no carry propagation, the latency 57 00:04:05,860 --> 00:04:12,420 of the carry-save adder is very small, i.e., only enough time for the operation of a single 58 00:04:12,420 --> 00:04:15,080 full adder module. 59 00:04:15,080 --> 00:04:19,839 After order N steps, we've generated the necessary partial products, but will need to continue 60 00:04:19,839 --> 00:04:25,860 for another order N steps to finish propagating the carries through the carry-save adder. 61 00:04:25,860 --> 00:04:30,979 But even at 2N steps, the overall latency of the multiplier is still order N. 62 00:04:30,979 --> 00:04:36,560 And at the end of the 2N steps, we produce the answer in the P and B registers combined, 63 00:04:36,560 --> 00:04:41,880 so the throughput is order 1/N. The big change is in the hardware cost at 64 00:04:41,880 --> 00:04:47,970 order N, a dramatic improvement over the order N^2 hardware cost of the original combinational 65 00:04:47,970 --> 00:04:50,430 multiplier. 66 00:04:50,430 --> 00:04:53,860 This completes our little foray into multiplier designs. 67 00:04:53,860 --> 00:04:58,330 We've seen that with a little cleverness we can create designs with order 1 throughput, 68 00:04:58,330 --> 00:05:01,520 or designs with only order N hardware. 69 00:05:01,520 --> 00:05:06,690 The technique of carry-save addition is useful in many situations and its use can improve 70 00:05:06,690 --> 00:05:11,800 throughput at constant hardware cost, or save hardware at a constant throughput.