|
by Ingo Cyliax
Start ý Implementing
ý Optimization ý Signed
and Unsigned Multipliers ý Sources and
PDF
IMPLEMENTING
There are three approaches to implementing
hardware-based multiplier circuits. You can either implement them
in a parallel circuit, look-up tables (LUTs), or iterative circuits.
Parallel and look-up tables are fast but take a lot of logic. Iterative
multipliers are compact, but they have to iterate, usually over the
n bits. However, there are some tricks to speed up this process.
Look-up tables are obvious, you just
have a large ROM that you index with the multiplier and multiplicand
to arrive at the product (see Figure 1). The multiplier is one of
the arguments (usually the right or bottom), and the multiplicand
is the other argument.
|

|
| Figure 1ýA look-up table-based
multiplier is perhaps the easiest multiplier. You simply look
up the product based on the multiplier and multiplicand values. |
Look-up table-based multipliers are only
practical with small bit sizes because the ROM grows exponentially
(22n). Because FPGAs use LUTs for their function generator,
it makes sense to use them for small bit sizes. A single 2 ý 2 multiplier
slice will fit into a 4-input LUT nicely. So, a complete 2 ý 2 multiplier
will take four LUTs (the output of a multiplier is 2n-bits wide).
A 2 ý 2 multiplier by itself is not useful,
unless you need to build a fast circuit that can operate with small
word sizes. An example of this is a correlator. For most DSP functions,
there is an interest in multiplying large word sizes together. A tree-based
multiplier can be used for this (see Figure 2).
 |
| Figure 2ýA tree-based multiplier
uses small multipliers to form partial products, which are then
added to form the product. Multiplier trees come in different
configurations. This example shows a tree that is optimized
for FPGA implementation. |
You can use the smaller look-up table-based
multipliers to compute partial products and then sum them to calculate
the final product. These multipliers are fast because they operate
on the whole word in parallel.
The size of a parallel adder can be computed.
There are (n/m)2 m-bit multipliers and (2n/m) ý
1 n-bit adders. n is the word size of the multiplicand and
the multiplier and m is the size of the first level multiplier.
When parallel-based adders get big, they
start taking up area and the combinatorial delays get worse. Multiplier
structures are regular and donýt require feedback paths, making them
a natural choice for pipelining.
When you stick a register at the output
of every node, the combinatorial delay before the register becomes
small, and you can clock the whole multiplier at fast rates. However,
this comes at a cost. Although a pipelined multiplier (see Figure
3) will provide an answer at every clock tick, the number of clock
ticks it takes for a result to come out at the output will introduce
latency. In many applications, the latency does not hurt, but you
have to be aware of this in feedback-based applications like control
loops.
 |
| Figure 3ýBy adding a register
to the output of each node, you can speed up the multiplier
tree clock rate, but at the cost of introducing latency. The
result will show up several clock ticks later on the outputs.
However, the total throughput is good. |
Finally, parallel adders lend themselves
to constant optimization. When one of the arguments is a constant,
some of the partial products become zero. Remember that constants
in digital logic are absorbed when you minimize the logic, so you
would arrive at compact circuits. An extreme case is when you multiply
by constants that are multiples of two. After minimization, you simply
end up with a shifter.
Building parallel multipliers is tedious,
especially if you want different word size requirements. For example,
if you need a 6 ý 8 multiplier, you might waste some logic if you
use an 8 ý 8 multiplier. To make it easier to multiply in your FPGA
designs, vendors provide libraries of multipliers. Also, some vendors
provide core generators. These are actually software-based synthesis
tools. You select the type of circuit module youýd like to include
in your design.
Photo 1 shows a screen shot of CoreGen,
one of these core-generating tools. You can select from several math
functions like divide and integrate. I selected multipliers and got
a submenu for parallel multipliers with different properties (area
or speed optimized).
| Photo
1ýThis shows the core browser in CoreGen, a Java-based FPGA
core generation tool. You select the core module you need by
function from a list of modules in its library. There are also
datasheets attached to the modules that describe the module,
its interface, and implementation sizes and speed. |
After selecting which kind of multiplier
you need, a dialog screen pops up where you can parameterize the core
module (see Photo 2). In this case, you can select whether you want
it to implement a signed or unsigned adder and how wide the multiplier
inputs are. The tool then generates an optimized multiplier module
that can be linked into your design.
| Photo
2ýAfter you have selected the optimized parallel multiplier,
you have to parameterize it. You can select the word sizes and
whether you want to perform signed or unsigned multiplication. |
High-end VHDL and Verilog synthesizers
also know how to infer and implement multipliers. As one of the design
constraints, you can specify whether you want these multipliers to
be optimized for either area or speed.
PREVIOUS
NEXT
Circuit Cellar provides up-to-date information for engineers. Visit
www.circuitcellar.com for
more information and additional articles.
For subscription information, call (860) 875-2199, subscribe@circuitcellar.com
or subscribe online.
ýCircuit Cellar, the Magazine for Computer Applications. Posted with
permission. |