## 2.6.4   Multipliers

Figure 2.27 shows a symmetric 6-bit array multiplier (an n -bit multiplier multiplies two n -bit numbers; we shall use n -bit by m -bit multiplier if the lengths are different). Adders a0f0 may be eliminated, which then eliminates adders a1a6, leaving an asymmetric CSA array of 30 (5  ¥   6) adders (including one half adder). An n -bit array multiplier has a delay proportional to n plus the delay of the CPA (adders b6f6 in Figure 2.27). There are two items we can attack to improve the performance of a multiplier: the number of partial products and the addition of the partial products. FIGURE 2.27  Multiplication. A 6-bit array multiplier using a final carry-propagate adder (full-adder cells a6f6, a ripple-carry adder). Apart from the generation of the summands this multiplier uses the same structure as the carry-save adder of Figure 2.23(d).

Suppose we wish to multiply 15 (the multiplicand ) by 19 (the multiplier ) mentally. It is easier to calculate 15 ¥ 20 and subtract 15. In effect we complete the multiplication as 15 ¥ (20  1) and we could write this as 15 ¥ , with the overbar representing a minus sign. Now suppose we wish to multiply an 8-bit binary number, A, by B = 00010111 (decimal 16 + 4 + 2 + 1 = 23). It is easier to multiply A by the canonical signed-digit vector ( CSD vector ) D = (decimal  32  8 + 1 =  23) since this requires only three add or subtract operations (and a subtraction is as easy as an addition). We say B has a weight of 4 and D has a weight of 3. By using D instead of B we have reduced the number of partial products by 1 (= 4  3).

We can recode (or encode) any binary number, B, as a CSD vector, D, as follows (canonical means there is only one CSD vector for any number):

D i  = B i  + C i    2C i  + 1 ,(2.58)

where C i  + 1 is the carry from the sum of B i   + 1  + B i  + C i (we start with C 0  = 0).

As another example, if B = 011 (B 2  = 0, B 1  = 1, B 0  = 1; decimal 3), then, using Eq. 2.58,

D 0  = B 0  + C 0   2C 1  = 1 + 0  2 = ,

D 1  = B 1  + C 1   2C 2  = 1 + 1  2 = 0,

D 2  = B 2  + C 2   2C 3  = 0 + 1  0 = 1,(2.59)

so that D = (decimal 4  1 = 3). CSD vectors are useful to represent fixed coefficients in digital filters, for example.

We can recode using a radix other than 2. Suppose B is an ( n  + 1)-digit two's complement number,

B = B 0  + B 1 2 + B 2 2 2  + . . .  + B i 2 i  +   . . .  + B n   1 2 n   1   B n 2 n .(2.60)

We can rewrite the expression for B using the following sleight-of-hand:

2B  B  = B = B 0  + (B 0   B 1 )2 + . . .  + (B i   1   B i )2 i  + . . .  + B n   1 2 n   1   B n 2 n

=  (2B 1  + B 0 )2 0  + (2B 3  + B 2  + B 1 )2 2  + . . .

+ (2B i  + B i   1  + B i  2 )2 i  1  +  (2B i   + 2  + B i   + 1  + B i   )2 i  + 1   + . . .

+ (2B n + B i   1  + B i   2 )2 n 1 .(2.61)

This is very useful. Consider B = 101001 (decimal 9  32 = 23, n  = 5),

B = 101001 = (2B 1  + B 0 )2 0  + (2B 3  + B 2  + B 1 )2 2   + (2B 5 + B 4  + B 3 )2 4

= ((2  ¥  0) + 1)2 0  + ((2  ¥  1) + 0 + 0)2 2  + ((2  ¥  1) + 0 + 1)2 4  .(2.62)

Equation 2.61 tells us how to encode B as a radix-4 signed digit, E = (decimal 16  8 + 1 = 23). To multiply by B encoded as E we only have to perform a multiplication by 2 (a shift) and three add/subtract operations.

Using Eq. 2.61 we can encode any number by taking groups of three bits at a time and calculating

E j  = 2B i  + B i   1  + B i   2 , E j  + 1  = 2B i  + 2  + B i  + 1  + B i , . . . ,(2.63)

where each 3-bit group overlaps by one bit. We pad B with a zero, B n  . . . B 1 B 0 0, to match the first term in Eq. 2.61. If B has an odd number of bits, then we extend the sign: B n B n  . . . B 1 B 0 0. For example, B = 01011 (eleven), encodes to E = (16  4  1); and B = 101 is E = . This is called Booth encoding and reduces the number of partial products by a factor of two and thus considerably reduces the area as well as increasing the speed of our multiplier [Booth, 1951].

Next we turn our attention to improving the speed of addition in the CSA array. Figure 2.28(a) shows a section of the 6-bit array multiplier from Figure 2.27. We can collapse the chain of adders a0f5 (5 adder delays) to the Wallace tree consisting of adders 5.15.4 (4 adder delays) shown in Figure 2.28(b). FIGURE 2.28  Tree-based multiplication. (a) The portion of Figure 2.27 that calculates the sum bit, P 5 , using a chain of adders (cells a0f5). (b) We can collapse this chain to a Wallace tree (cells 5.15.5). (c) The stages of multiplication.

Figure 2.28(c) pictorially represents multiplication as a sort of golf course. Each link corresponds to an adder. The holes or dots are the outputs of one stage (and the inputs of the next). At each stage we have the following three choices: (1) sum three outputs using a full adder (denoted by a box enclosing three dots); (2) sum two outputs using a half adder (a box with two dots); (3) pass the outputs directly to the next stage. The two outputs of an adder are joined by a diagonal line (full adders use black dots, half adders white dots). The object of the game is to choose (1), (2), or (3) at each stage to maximize the performance of the multiplier. In tree-based multipliers there are two ways to do thisworking forward and working backward.

In a Wallace-tree multiplier we work forward from the multiplier inputs, compressing the number of signals to be added at each stage [Wallace, 1960]. We can view an FA as a 3:2 compressor or (3, 2) counter it counts the number of '1's on the inputs. Thus, for example, an input of '101' (two '1's) results in an output '10' (2). A half adder is a (2, 2) counter . To form P 5 in Figure 2.29 we must add 6 summands (S 05 , S 14 , S 23 , S 32 , S 41 , and S 50 ) and 4 carries from the P 4 column. We add these in stages 17, compressing from 6:3:2:2:3:1:1. Notice that we wait until stage 5 to add the last carry from column P 4  , and this means we expand (rather than compress) the number of signals (from 2 to 3) between stages 3 and 5. The maximum delay through the CSA array of Figure 2.29 is 6 adder delays. To this we must add the delay of the 4-bit (9 inputs) CPA (stage 7). There are 26 adders (6 half adders) plus the 4 adders in the CPA. FIGURE 2.29  A 6-bit Wallace-tree multiplier. The carry-save adder (CSA) requires 26 adders (cells 126, six are half adders). The final carry-propagate adder (CPA) consists of 4 adder cells (2730). The delay of the CSA is 6 adders. The delay of the CPA is 4 adders.

In a Dadda multiplier (Figure 2.30) we work backward from the final product [Dadda, 1965]. Each stage has a maximum of 2, 3, 4, 6, 9, 13, 19, . . . outputs (each successive stage is 3/2 times largerrounded down to an integer). Thus, for example, in Figure 2.28(d) we require 3 stages (with 3 adder delaysplus the delay of a 10-bit output CPA) for a 6-bit Dadda multiplier. There are 19 adders (4 half adders) in the CSA plus the 10 adders (2 half adders) in the CPA. A Dadda multiplier is usually faster and smaller than a Wallace-tree multiplier. FIGURE 2.30  The 6-bit Dadda multiplier. The carry-save adder (CSA) requires 20 adders (cells 120, four are half adders). The carry-propagate adder (CPA, cells 2130) is a ripple-carry adder (RCA). The CSA is smaller (20 versus 26 adders), faster (3 adder delays versus 6 adder delays), and more regular than the Wallace-tree CSA of Figure 2.29. The overall speed of this implementation is approximately the same as the Wallace-tree multiplier of Figure 2.29; however, the speed may be increased by substituting a faster CPA.

In general, the number of stages and thus delay (in units of an FA delayexcluding the CPA) for an n -bit tree-based multiplier using (3, 2) counters is

log 1.5   n  = log 10   n /log 10  1.5 = log 10   n /0.176.(2.64)

Figure 2.31(a) shows how the partial-product array is constructed in a conventional 4-bit multiplier. The FerrariStefanelli multiplier (Figure 2.31b) "nests" multipliersthe 2-bit submultipliers reduce the number of partial products [Ferrari and Stefanelli, 1969]. FIGURE 2.31  FerrariStefanelli multiplier. (a) A conventional 4-bit array multiplier using AND gates to calculate the summands with (2, 2) and (3, 2) counters to sum the partial products. (b) A 4-bit FerrariStefanelli multiplier using 2-bit submultipliers to construct the partial product array. (c) A circuit implementation for an inverting 2-bit submultiplier.

There are several issues in deciding between parallel multiplier architectures:

1. Since it is easier to fold triangles rather than trapezoids into squares, a Wallace-tree multiplier is more suited to full-custom layout, but is slightly larger, than a Dadda multiplierboth are less regular than an array multiplier. For cell-based ASICs, a Dadda multiplier is smaller than a Wallace-tree multiplier.
2. The overall multiplier speed does depend on the size and architecture of the final CPA, but this may be optimized independently of the CSA array. This means a Dadda multiplier is always at least as fast as the Wallace-tree version.
3. The low-order bits of any parallel multiplier settle first and can be added in the CPA before the remaining bits settle. This allows multiplication and the final addition to be overlapped in time.
4. Any of the parallel multiplier architectures may be pipelined. We may also use a variably pipelined approach that tailors the register locations to the size of the multiplier.
5. Using (4, 2), (5, 3), (7, 3), or (15, 4) counters increases the stage compression and permits the size of the stages to be tuned. Some ASIC cell libraries contain a (7, 3) countera 2-bit full-adder . A (15, 4) counter is a 3-bit full adder. There is a trade-off in using these counters between the speed and size of the logic cells and the delay as well as area of the interconnect.
6. Power dissipation is reduced by the tree-based structures. The simplified carry-save logic produces fewer signal transitions and the tree structures produce fewer glitches than a chain.
7. None of the multiplier structures we have discussed take into account the possibility of staggered arrival times for different bits of the multiplicand or the multiplier. Optimization then requires a logic-synthesis tool.