ChipCenter Questlink
SEARCH CHIPCENTER
Search Type:
Search for:




Knowledge Centers
Product Reviews
Data Sheets
Guides & Experts
News
International
Ask Us
Circuit Cellar Online
App Notes
NetSeminars
Careers
Resources
FAQ
EE Times Network
Electronics Group Sites

Pipelined Design for CPLD Improves System Performance

By Jimmy Gao (jimmy_gao@latticesemi.com), HDL Applications Specialist, Lattice Semiconductor Corp.

Pipelined design has been used extensively in high performance, computational intensive systems such as CPU (Central Process Unit). Todayıs popular CPUs such as Intelıs Pentium processors make full use of the pipelined techniques for its instruction fetch and execute cycles to improve performance. High performance DSP (Digital Signal Processing) systems also use pipelined design techniques in its building-block functions. Using basic computational intensive building block functions, such as adders and multipliers, this article discusses pipelining concepts and compares the performance and logic utilization tradeoffs between combinatorial and pipelined designs for CPLDs (Complex Programmable Logic Devices).

Pipelining Design Concepts for CPLD
Pipelining a design means inserting registers in every stage of the circuit. A K-Stage Pipeline is an acyclic circuit having exactly K registers (one register for every stage of the path) from an input to an output. Figure 1 shows the migration from a combinatorial design to a pipelined design. The combinatorial design contains two stages. The delay for the first stage is the maximum delay of either T1 or T3 and the delay for the second stage is the delay T2. To obtain one computation result from the combinatorial design, one has to wait for a propagation delay of [max(T1, T3) + T2] time unit. After putting registers on every stage of the design from inputs to output, the first stage of registers of the pipelined design has a total delay of the maximum of either T1 or T3, and the Tco (clock to output time) of the register. The second stage of register has a similar delay of T2 and the Tco. The overall clock period of the pipelined design is max(max(T1, T3)+Tco, (T2+Tco)). The pipelined design takes two clock cycles to obtain the very first computation result and one clock cycle for the subsequent results. The initial clock cycles to obtain the first result is called the latency of the pipelined design. For CPLDs, the delay through the device such as T1, T2 and T3 is relatively longer than the Tco of the device and the register setup time Tsu is much faster than the delay through the device. These basic assumptions about the hardware timing must be true for the pipelined design to have a performance improvement over a functionally equivalent combinatorial design.

The advantage of pipelined design is the increased throughput. Letıs assume the same propagation delay, Tpd, for T1, T3 and T2. For combinatorial design, the delay is 2*Tpd. For pipelined design, the clock period is Tpd + Tco. The concept of latency, described above, is the amount of time it takes for the initial or longest path and the throughput is the amount of time required for repeated operation. In the case of combinatorial design the latency and throughput is the same at 2*Tpd. Compare this to the pipelined design, the latency is 2*(Tpd+Tco) and the throughput is Tpd+Tco. If the CPLD hardware timing can provide the Tco advantage over Tpd, the pipelined design will always have a better throughput over a functionally equivalent combinatorial design. A typical register intensive CPLD such as Lattice ispLSI 8840 has a Tpd of 8.5ns and a Tco of 6ns.

The performance and utilization tradeoff for a pipelined design is the register utilization. For combinatorial designs with simple data path such as the example shown in this section, there is little or no increase in macrocell utilization for the pipelined design. As the combinatorial designs gets complicated, additional registers must be added to keep the intermediate computational results within the same clock cycles between each other. The combination of register-rich CPLD architecture and the predictable delays makes it an attractive alternative to FPGAs for implementing complex pipelined designs in CPLDs with much improved performance.

Figure 1
Figure 1. Combinatorial Design to Pipelined Design Conversion

Pipelined Adder vs. Combinatorial Adder
Pipelining is the key to increased throughput and thus increased performance. Take the example of n-bit full adder, there are three stages to implement the addition function as shown in Figure 2. First, generate and propagate functions depend on the adderıs inputs. Second, look-ahead carries depend on generate and propagate functions. And third, the sum function depends on both propagate/generate and look-ahead carry functions.

Figure 2
Figure 2. n-bit full adder equations

Three levels of registers or register banks are inserted at the output of each logic stage of an n-bit combinatorial full adder to convert to the n-bit pipelined full adder as shown in Figure 3. Since the carry C-1 is used as input for the first stage of logic as well as the second stage of logic, two levels of registers are used to pipeline the carry C-1 for the first two logic stages. Similarly, the generate function is registered one more time before being used as input for the sum unit. The carry out function, Cout, is also registered twice to achieve same pipeline levels with the output of the sum unit.

The Lattice ispLSI 8840 which has a total of 840 macrocells organized in 42 GLBs (Global Logic Blocks) and 312 I/O cells with register capability is used to compare the implementation results of both the 16-bit combinatorial full adder and 16-bit pipelined full adder. For 16-bit combinatorial full adder, the VHDL implementation consumes 34 macrocells in 5 GLBs. It is implemented in 3 GLB levels of logic with a maximum propagation delay of 45.6ns. In contrast, the 16-bit pipelined full adder VHDL implementation consumes 81 macrocells in 6 GLBs. It is implemented in 1 GLB level which can operate at a clock period of 15.10ns. The associated latency with this implementation is three clock cycles.

Figure 3
Figure 3. n-bit Combinatorial Full Adder vs. n-bit Pipelining Full Adder (CLICK to view full-size)

Pipelined Multiplier vs. Combinatorial Multiplier
For the multiplier, an example of a 4x4 multiplier is shown first to illustrate the basic concept of multiplication with partial products. A more complicated 6x10 multiplier is then used to compare the performance comparison between a combinatorial and pipelined multiplier implementation.

As shown in figure 4, a combinatorial 4x4 multiplication can be disassembled as several additions of partial product vectors from sixteen 1x1 multipliers. Instead of directly registering each stage of the combinatorial 4x4 multiplication for pipelined conversion, 1x4 multipliers are used to generate all the partial product vectors. The resulting two-stage pipelined design achieves less latency compared to the 1x1 multiplier. The 3-level pipelined adder, similar to the example shown in figure 3, is used to implement each stage of the pipelined addition.

Similar to the 4x4 pipelined multiplier shown in figure 4, a more complicated 6x10 pipelined multiplier is used to compare the performance difference of the pipelined vs non-pipelined multiplier. As shown in figure 5, the 6x10 pipelined multiplier uses six 10-bit multiplexors to implement the 1x10 multiplication processes -- a0 x b(9,0), a1 x b(9,0), ı., a5 x b(9,0). Since ai is either 1 or 0, the 1x10 multiplication result is either b(9,0) or 0. This explains the two input for the multiplex as either b(9,0) or 0.

Outputs from six multiplexors are separated in three groups with two members in each group, and added up by three 3-level pipelined adders. Each group of two multiplexors additions is 3-hop alternate. For this example the groups are organized as follows: Group[a5, a2], Group[a4, a1], and Group[a3, a0]. The Group[a5, a2] indicates that the first multiplexor output M (10-bit) and the fourth multiplexor output N (10-bit) that are used for one pipelined addition O. Similarly, the others are added for pipelined addition of P and Q. The 3-hop alternate grouping is very efficient for removing extra partial product terms from the addition process. A generic equation used in the example of Group[a5, a2] is shown below.


G(j,0) = {000, M(i,0)} and {N(i,0), 000}

P(j,0) = {000, M(i,0)} xor {N(i,0), 000} (0 <= i <= 9, 0<= j <= 12)



Cj = Gj or G j-1Pj or Gj-2Pj-1Pj (0<= j <= 12)



or ıı.



or G0P1P2ı..Pj



Sk = Pk xor Ck-1 ( 0 <= k <= 13)

The first three significant bit of M and the last three significant bits of N must all be zero because M and N are 3 hops away. Therefore, G0, G1, G2 and G10, G11, G12 must all be zero because M and N performs the AND operation. Furthermore, the carry computation can be simplified because some of the generators are all zeros. As the carry computation gets simplified, the sum computation can also be simplified. Similarly, the inputs for pipelined addition of P and Q are also 3-hop alternate added. The inputs for the pipelined addition of T and S are 1-hop alternate and 2-hop alternate added, respectively. Three levels of register banks are inserted between Q and S to achieve same number of pipelined stages as T, since the adder for T has 3-level pipelined addition.

Figure 4
Figure 4. 4-bit Combinatorial Multiplier vs. 4-bit Pipelining Multiplier

Figure 5
Figure 5. 6x10 Pipelining Multiplier

The Lattice ispLSI 8840 is used here again to compare the implementation results of both the 6x10 combinatorial multiplier and the 6x10 pipelined. For the 6x10 combinatorial multiplier, the VHDL implementation consumes 93 macrocells in 14 GLBs. It is implemented in 5 GLB levels with a maximum propagation delay of 73.5ns. In contrast, the 6x10 pipelined multiplier VHDL implementation consumes 360 macrocells in 22 GLBs. It is implemented in only 1 GLB level which can operate at a clock period of 15.30ns, more than four times faster than the combinatorial counterpart. The associated latency with this design is nine clock cycles.

Summary
Pipelining is the key to increasing throughput, thus increasing performance. The tradeoff for the performance is making use of more registers. The register-rich architecture of Lattice ispLSI8840 CPLD makes it possible to implement these computational intensive functions which would not be possible with smaller register count CPLDs.

Further information on the ispLSI 8840 device used in these pipelined designs can be found at http://www.latticesemi.com


Home | Product of the Week | Tech Note | AppReview | Vendor Tools | Feedback

Click here to get your listing up.

Copyright © 2003 ChipCenter-QuestLink
About ChipCenter-Questlink  Contact Us  Privacy Statement   Advertising Information  FAQ