Chapter start Previous page Next page

Figure 2.27
shows a symmetric 6-bit array **multiplier** (an
n -bit multiplier multiplies two
n -bit numbers; we shall use n
-bit by m -bit multiplier if the lengths
are different). Adders a0f0 may be eliminated, which then eliminates adders
a1a6, leaving an asymmetric CSA array of 30 (5
¥ 6)
adders (including one half adder). An n
-bit array multiplier has a delay proportional to
n plus the delay of the CPA (adders b6f6 in Figure 2.27). There
are two items we can attack to improve the performance of a multiplier:
the number of partial products and the addition of the partial products.

Suppose we wish to multiply 15
(the **multiplicand** ) by 19 (the **multiplier** ) mentally. It is
easier to calculate 15 ¥ 20
and subtract 15. In effect we complete the multiplication as 15 ¥ (20 1)
and we could write this as 15 ¥
, with the overbar representing a minus sign. Now suppose we wish to multiply
an 8-bit binary number, A, by B = 00010111
(decimal 16 + 4 + 2 + 1 = 23).
It is easier to multiply A by the canonical signed-digit vector** **(**
CSD vector** ) D =
(decimal 32 8 + 1 =
23) since this requires only three add or subtract operations (and a subtraction
is as easy as an addition). We say B has a **weight** of 4 and D has
a weight of 3. By using D instead of B we have reduced the number of partial
products by 1 (= 4 3).

We can **recode** (or encode)
any binary number, B, as a CSD vector, D, as follows (canonical means there
is only one CSD vector for any number):

D_{
i} = B_{
i} + C_{
i} 2C_{
i} _{ + 1}
,(2.58)

where C_{
i} _{ + 1}
is the carry from the sum of B_{ i }
_{ + 1}
+ B_{
i} + C_{
i} (we start with C_{ 0} = 0).

As another example, if B = 011
(B_{ 2} = 0,
B_{ 1} = 1,
B_{ 0} = 1;
decimal 3), then, using Eq. 2.58,

D_{
0} = B_{
0} + C_{
0} 2C_{
1} = 1 + 0 2 =
,

D_{
1} = B_{
1} + C_{
1} 2C_{
2} = 1 + 1 2 = 0,

D_{
2} = B_{
2} + C_{
2} 2C_{
3} = 0 + 1 0 = 1,(2.59)

so that D = (decimal 4 1 = 3). CSD vectors are useful to represent fixed coefficients in digital filters, for example.

We can recode using a **radix**
other than 2. Suppose B is an ( n
+ 1)-digit two's
complement number,

B = B_{
0} + B_{
1} 2 + B_{
2} 2^{ 2} + . . . + B
i 2^{ i }
+
. . . + B_{
n} _{ 1}
2^{ n} ^{ 1}
B_{
n} 2^{ n} .(2.60)

We can rewrite the expression for B using the following sleight-of-hand:

2B B = B = B_{
0} + (B_{
0} B_{
1} )2 + . . . + (B_{
i} _{ 1}
B_{
i} )2^{ i }
+ . . . + B
n _{ 1}
2^{ n} ^{ 1}
B_{
n} 2^{ n}

= (2B_{
1} + B_{
0} )2^{ 0} + (2B_{
3} + B_{
2} + B_{
1} )2^{ 2} + . . .

+ (2B_{
i} + B_{
i} _{ 1}
+ B_{
i 2}
)2^{ i} ^{ 1}
+ (2B_{
i } _{
+ 2}
+ B_{
i } _{ + 1}
+ B_{
i} _{ }
)2^{ i} ^{ + 1}
+ . . .

+ (2B_{
n} + B_{
i} _{ 1}
+ B_{
i} _{ 2}
)2^{ n} ^{ 1}
.(2.61)

This is very useful. Consider B = 101001 (decimal 9 32 = 23, n = 5),

B = 101001 = (2B_{
1} + B_{
0} )2^{ 0} + (2B_{
3} + B_{
2} + B_{
1} )2^{ 2} + (2B_{
5} + B_{
4} + B_{
3} )2^{ 4}

= ((2
¥ 0) + 1)2^{
0} + ((2
¥ 1) + 0 + 0)2^{
2} + ((2
¥ 1) + 0 + 1)2^{
4} .(2.62)

Equation 2.61 tells us how to encode B as a radix-4 signed digit, E = (decimal 16 8 + 1 = 23). To multiply by B encoded as E we only have to perform a multiplication by 2 (a shift) and three add/subtract operations.

Using Eq. 2.61 we can encode any number by taking groups of three bits at a time and calculating

E
j = 2B_{
i} + B_{
i} _{ 1}
+ B_{
i} _{ 2}
, E_{ j} _{ + 1}
= 2B_{
i} _{ + 2}
+ B_{
i} _{ + 1}
+ B_{
i} , . . . ,(2.63)

where each 3-bit
group overlaps by one bit. We pad B with a zero, B_{
n} . . . B_{
1} B_{ 0} 0, to match the first term in Eq. 2.61.
If B has an odd number of bits, then we extend the sign: B_{
n} B_{ n} . . . B_{
1} B_{ 0} 0. For example, B = 01011
(eleven), encodes to E =
(16 4 1);
and B = 101
is E =
. This is called **Booth encoding** and reduces the number of partial
products by a factor of two and thus considerably reduces the area as well
as increasing the speed of our multiplier [Booth, 1951].

Next we turn our attention
to improving the speed of addition in the CSA array. Figure 2.28(a)
shows a section of the 6-bit array multiplier from Figure 2.27. We
can collapse the chain of adders a0f5 (5 adder delays) to the **Wallace tree**
consisting of adders 5.15.4 (4 adder delays) shown in Figure 2.28(b).

Figure 2.28(c) pictorially
represents multiplication as a sort of golf course. Each link corresponds
to an adder. The holes or dots are the outputs of one stage (and the inputs
of the next). At each stage we have the following three choices: (1) sum
three outputs using a full adder (denoted by a box enclosing three dots);
(2) sum two outputs using a half adder (a box with two dots); (3) pass
the outputs directly to the next stage. The two outputs of an adder are
joined by a diagonal line (full adders use black dots, half adders white
dots). The object of the game is to choose (1), (2), or (3) at each stage
to maximize the performance of the multiplier. In **tree-based multipliers**
there are two ways to do thisworking forward and working backward.

In a **Wallace-tree multiplier**
we work forward from the multiplier inputs, compressing the number of signals
to be added at each stage [Wallace, 1960]. We can view an FA as a **3:2
compressor** or **(3, 2) counter** it counts the number of '1's on
the inputs. Thus, for example, an input of '101' (two '1's) results in an
output '10' (2). A half adder is a **(2, 2) counter** . To form P_{
5} in Figure 2.29 we must add 6 summands (S_{
05} , S_{ 14} , S_{
23} , S_{ 32} , S_{
41} , and S_{ 50} ) and 4 carries from the
P_{ 4} column. We add these in stages 17, compressing
from 6:3:2:2:3:1:1. Notice that we wait until stage 5 to add the last carry
from column P_{ 4} ,
and this means we expand (rather than compress) the number of signals (from
2 to 3) between stages 3 and 5. The maximum delay through the CSA array
of Figure 2.29 is 6 adder delays. To this we must add the delay of
the 4-bit (9 inputs) CPA (stage 7). There are 26 adders (6 half adders)
plus the 4 adders in the CPA.

In a **Dadda multiplier**
(Figure 2.30) we work backward from the final product [Dadda, 1965].
Each stage has a maximum of 2, 3, 4, 6, 9, 13, 19, . . . outputs
(each successive stage is 3/2 times largerrounded down to an integer). Thus,
for example, in Figure 2.28(d) we require 3 stages (with 3 adder delaysplus
the delay of a 10-bit output CPA) for a 6-bit Dadda multiplier. There are
19 adders (4 half adders) in the CSA plus the 10 adders (2 half adders)
in the CPA. A Dadda multiplier is usually faster and smaller than a Wallace-tree
multiplier.

In general, the number of stages and thus delay (in units of an FA delayexcluding the CPA) for an n -bit tree-based multiplier using (3, 2) counters is

log_{
1.5}
n = log_{
10}
n /log_{ 10} 1.5 = log_{
10}
n /0.176.(2.64)

Figure 2.31(a) shows
how the partial-product array is constructed in a conventional 4-bit multiplier.
The **FerrariStefanelli multiplier** (Figure 2.31b) "nests"
multipliersthe 2-bit submultipliers reduce the number of partial products
[Ferrari and Stefanelli, 1969].

There are several issues in deciding between parallel multiplier architectures:

- Since it is easier to fold triangles rather than trapezoids into squares, a Wallace-tree multiplier is more suited to full-custom layout, but is slightly larger, than a Dadda multiplierboth are less regular than an array multiplier. For cell-based ASICs, a Dadda multiplier is smaller than a Wallace-tree multiplier.
- The overall multiplier speed does depend on the size and architecture of the final CPA, but this may be optimized independently of the CSA array. This means a Dadda multiplier is always at least as fast as the Wallace-tree version.
- The low-order bits of any parallel multiplier settle first and can be added in the CPA before the remaining bits settle. This allows multiplication and the final addition to be overlapped in time.
- Any of the parallel multiplier architectures
may be pipelined. We may also use a
**variably pipelined**approach that tailors the register locations to the size of the multiplier. - Using (4, 2), (5, 3), (7, 3), or (15, 4)
counters increases the stage compression and permits the size of the stages
to be tuned. Some ASIC cell libraries contain a (7, 3) countera
**2-bit full-adder**. A (15, 4) counter is a 3-bit full adder. There is a trade-off in using these counters between the speed and size of the logic cells and the delay as well as area of the interconnect. - Power dissipation is reduced by the tree-based structures. The simplified carry-save logic produces fewer signal transitions and the tree structures produce fewer glitches than a chain.
- None of the multiplier structures we have discussed take into account the possibility of staggered arrival times for different bits of the multiplicand or the multiplier. Optimization then requires a logic-synthesis tool.