|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
|
|
Pipelined Design for CPLD Improves System Performance
By Jimmy Gao (jimmy_gao@latticesemi.com), HDL Applications Specialist, Lattice Semiconductor Corp.
Pipelined design has been used extensively in high performance, computational intensive systems such as CPU (Central Process Unit). Todayıs popular CPUs such as Intelıs Pentium processors make full use of the pipelined techniques for its instruction fetch and execute cycles to improve performance. High performance DSP (Digital Signal Processing) systems also use pipelined design techniques in its building-block functions. Using basic computational intensive building block functions, such as adders and multipliers, this article discusses pipelining concepts and compares the performance and logic utilization tradeoffs between combinatorial and pipelined designs for CPLDs (Complex Programmable Logic Devices).
Pipelining Design Concepts for CPLD
The advantage of pipelined design is the increased throughput. Letıs assume the same propagation delay, Tpd, for T1, T3 and T2. For combinatorial design, the delay is 2*Tpd. For pipelined design, the clock period is Tpd + Tco. The concept of latency, described above, is the amount of time it takes for the initial or longest path and the throughput is the amount of time required for repeated operation. In the case of combinatorial design the latency and throughput is the same at 2*Tpd. Compare this to the pipelined design, the latency is 2*(Tpd+Tco) and the throughput is Tpd+Tco. If the CPLD hardware timing can provide the Tco advantage over Tpd, the pipelined design will always have a better throughput over a functionally equivalent combinatorial design. A typical register intensive CPLD such as Lattice ispLSI 8840 has a Tpd of 8.5ns and a Tco of 6ns.
The performance and utilization tradeoff for a pipelined design is the register utilization. For combinatorial designs with simple data path such as the example shown in this section, there is little or no increase in macrocell utilization for the pipelined design. As the combinatorial designs gets complicated, additional registers must be added to keep the intermediate computational results within the same clock cycles between each other. The combination of register-rich CPLD architecture and the predictable delays makes it an attractive alternative to FPGAs for implementing complex pipelined designs in CPLDs with much improved performance.
Pipelined Adder vs. Combinatorial Adder
Three levels of registers or register banks are inserted at the output of each logic stage of an n-bit combinatorial full adder to convert to the n-bit pipelined full adder as shown in Figure 3. Since the carry C-1 is used as input for the first stage of logic as well as the second stage of logic, two levels of registers are used to pipeline the carry C-1 for the first two logic stages. Similarly, the generate function is registered one more time before being used as input for the sum unit. The carry out function, Cout, is also registered twice to achieve same pipeline levels with the output of the sum unit.
The Lattice ispLSI 8840 which has a total of 840 macrocells organized in 42 GLBs (Global Logic Blocks) and 312 I/O cells with register capability is used to compare the implementation results of both the 16-bit combinatorial full adder and 16-bit pipelined full adder. For 16-bit combinatorial full adder, the VHDL implementation consumes 34 macrocells in 5 GLBs. It is implemented in 3 GLB levels of logic with a maximum propagation delay of 45.6ns. In contrast, the 16-bit pipelined full adder VHDL implementation consumes 81 macrocells in 6 GLBs. It is implemented in 1 GLB level which can operate at a clock period of 15.10ns. The associated latency with this implementation is three clock cycles.
Pipelined Multiplier vs. Combinatorial Multiplier
As shown in figure 4, a combinatorial 4x4 multiplication can be disassembled as several additions of partial product vectors from sixteen 1x1 multipliers. Instead of directly registering each stage of the combinatorial 4x4 multiplication for pipelined conversion, 1x4 multipliers are used to generate all the partial product vectors. The resulting two-stage pipelined design achieves less latency compared to the 1x1 multiplier. The 3-level pipelined adder, similar to the example shown in figure 3, is used to implement each stage of the pipelined addition.
Similar to the 4x4 pipelined multiplier shown in figure 4, a more complicated 6x10 pipelined multiplier is used to compare the performance difference of the pipelined vs non-pipelined multiplier. As shown in figure 5, the 6x10 pipelined multiplier uses six 10-bit multiplexors to implement the 1x10 multiplication processes -- a0 x b(9,0), a1 x b(9,0), ı., a5 x b(9,0). Since ai is either 1 or 0, the 1x10 multiplication result is either b(9,0) or 0. This explains the two input for the multiplex as either b(9,0) or 0.
Outputs from six multiplexors are separated in three groups with two members in each group, and added up by three 3-level pipelined adders. Each group of two multiplexors additions is 3-hop alternate. For this example the groups are organized as follows: Group[a5, a2], Group[a4, a1], and Group[a3, a0]. The Group[a5, a2] indicates that the first multiplexor output M (10-bit) and the fourth multiplexor output N (10-bit) that are used for one pipelined addition O. Similarly, the others are added for pipelined addition of P and Q. The 3-hop alternate grouping is very efficient for removing extra partial product terms from the addition process. A generic equation used in the example of Group[a5, a2] is shown below.
The Lattice ispLSI 8840 is used here again to compare the implementation results of both the 6x10 combinatorial multiplier and the 6x10 pipelined. For the 6x10 combinatorial multiplier, the VHDL implementation consumes 93 macrocells in 14 GLBs. It is implemented in 5 GLB levels with a maximum propagation delay of 73.5ns. In contrast, the 6x10 pipelined multiplier VHDL implementation consumes 360 macrocells in 22 GLBs. It is implemented in only 1 GLB level which can operate at a clock period of 15.30ns, more than four times faster than the combinatorial counterpart. The associated latency with this design is nine clock cycles.
Summary
Further information on the ispLSI 8840 device used in these pipelined designs can be found at http://www.latticesemi.com
Home | Product of the Week | Tech Note | AppReview | Vendor Tools | Feedback
|
|||||||||||||||||||||||||||||||||
|
Copyright © 2003 ChipCenter-QuestLink About ChipCenter-Questlink |
||||||||||||||||||||||||||||||||||