|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
|
|
Combining FPGA Cores To Extend The Performance Of DSP Designs
By Paul Laity (paul.laity@xilinx.com) and Sabine Lam
Cores Solutions Group, Xilinx, Inc.
Forward thinking designers have been using field programmable gate arrays in high performance digital signal processing systems for some time. That's because many communications applications just couldn't be done with existing DSP processors for reasons of cost, power dissipation or time-to-market. FPGAs, which are uniquely suited to repetitive DSP tasks such as multiply and accumulate (MAC), can perform those operations in parallel and often outperform general purpose DSP processors. For example, benchmarks on a 16-bit finite impulse response (FIR) filter show that a single, medium-density FPGA with about 3,000 logic cells can perform almost 2 billion MACs a second, compared to less than 500 million MACs a second for a state-of-the-art DSP processor. Use a larger FPGA, and the performance scales upward very quickly, to more than 6 billion MACs for today's largest FPGA with more than 10,000 logic cells. Adding multiple DSPs to get this level of performance would be impractical if not impossible.
These levels of performance in FPGA-based designs, however, have come at a cost; DSP designers, who live in a world of software programming, have been required to make the leap into the world of electrical engineering and a new vocabulary of flips flops, gates and VHDL code to deal with FPGAs. But that situation is changing as FPGAs become more "DSP friendly."
Two reasons account for this. First, intellectual property, or cores, are now available for FPGAs. These predefined functions, whose parameters designers can set and change as they choose, perform a wide range of standard DSP functions common in wireless communication equipment such as wireless local loops and base stations to tie them together. These functions include filters, transforms, correlators, memories, sine/cosine building blocks and math functions. Cores help reduce time to market, and they make it easier for DSP designers to tap into the high performance capability of FPGAs.
Second, designers now have new tools at their disposal to make the job of incorporating cores into their DSP designs easier than ever before. For example, the new Xilinx CORE Generator tool allows designers pick their cores from a library, set the parameters, and produce the design with a guaranteed level of performance in an FGPA -- with a push of a button. In addition, mainstream system level modeling tools familiar to DSP designers ý SystemView by Elanix, for instance --are just now available with support for FPGAs. Other DSP tools vendors are expected offer support for FPGAs in their software later this year.
Today, the serial distributed arithmetic (SDA) FIR filter core is the workhorse for FPGA-based DSP design. These SDA FIR filters are used to implement low cost, high data rate applications such ADSL satellite modems and wireless base stations where the performance is beyond what a single DSP processor can achieve.
SDA cores are the most efficient way to implement FIR filters for sample rates in the range of one to 10 million samples per second. To go beyond this has required using a fully parallel distributed arithmetic (PDA) filter core, but parallel cores require much larger and more expensive FPGA devices.
By combining multiple serial cores, it is possible to fill in the performance gap between serial and parallel cores and reach desired performance levels. This approach allows the use of a more cost-effective FPGA device using existing standard cores as building blocks. The same techniques can be used to triple or quadruple the data rate using three or four SDA cores in a single FPGA.
Filling the Performance Gap
Parallel distributed arithmetic FIR filter cores are tap-parallel and bit-parallel. The calculations for all of the bits and all of the taps are completed in one clock cycle. This high performance approach can handle 80 MHz data rates, but it requires many times more resources than SDA FIR cores and is not practical for a large number of taps.
Data rates for many applications exceed SDA rates but are far less than the 60MHz to100MHz PDA rates. What is needed in many applications is a core that processes two or more bits per clock. This can be achieved easily by combining two SDA FIR filter cores with an adder core.
Non-mathematical Approach to Distributed Arithmetic
To understand serial distributed arithmetic, first consider the simplistic case where the number of taps is equal to one. In Fig. 1, 10-bit parallel data enters the filter and is then shifted out, one bit at a time, least significant bit first. This serial data is used
to address a two word look-up-table (LUT) that stores the constant value Co at location 1, and stores a zero at location 0. Repeat the process for ten clocks while the LUT output is accumulated in a scaling accumulator. The result is a serial multiplier using a LUT to store the constant.
Figure 1 - This 1-tap filter is a serial multiplier that uses an FPGA look up table to store the constant.
Now keep the same structure but add more taps. How to implement a 3-tap FIR filter is illustrated in Fig 2. The size of the LUT doubles for each additional tap and the LUT contains pre-calculated sums of coefficients for a 3-tap example. Now it starts to get interesting. With little increase in circuit complexity, three taps are processed in the same number of clocks and with little additional resources.
Figure 2 - With little increase in FPGA resources over those in Fig. 1, 3 taps can be processed in the same number of clocks.
Increasing the number of taps to four produces the optimal building block for the Xilinx XC4000 FPGA family. This 4-tap structure is then repeated to implement as many taps as required. And then all of the 4-tap slices are added together to complete the filter.
The registers and look-up tables and adders are implemented in field programmable gate array configurable logic blocks (CLBs) that consist of two 16-by-1 LUTs plus two flip-flops. The LUTs can also double as distributed RAM memories and are used to buffer the data samples in the SDA FIR filter. This distributed arithmetic approach for implementing FIR filters produces more efficient results compared to more traditional approach. With the cost per CLB at less than two cents each for the new low-cost Xilinx Spartan FPGA family, the 100-tap filter can be built for less than $5.
Exploiting Symmetry
The Xilinx CORE Generator, an application that resides on the designer's PC or workstation, is the delivery vehicle for the SDA FIR filter core. The CORE Generator accepts the core parameters and then generates the FPGA implementation for the core. Fig. 3 shows the CORE Generator parameterization screen that pops up when the SDA FIR is selected from the hierarchical list of available cores.
Figure 3 - The CORE Generator tool accepts DSP core parameters and produces the implementation for an FPGA
You enter an instance name for the core, the input data width, the output data width, the number of taps, the symmetry type, and the width of the coefficients. When you click on the generate button, the CORE Generator reads the coefficient values that you have defined in a text file and builds the customized core. A netlist file with relative
placement information (core floorplan) and an HDL behavioral model is then placed in your project directory for use with traditional VHDL, VERILOG or schematic capture design tools. The performance of the core is specified in the online core data sheet.
The two tables in Fig. 4 show the number of CLBs -- the size of the core ý and the maximum data rate in MHz.
Figure 4 - On-line data sheets in the CORE Generator include information on FPGA resources required to implement a DSP core (upper table) and the maximum data rate (lower table)
For example a 32-tap SDA FIR filter core with 10-bit data and 10-bit symmetrical coefficients requires 118 CLBs and has a maximum data sample rate of 80 / 11 = 7.3 MHz.
Implementing two bits-per-clock
Two SDA cores are used, each filter processes half of the 10-bit data word and then their outputs are added together. Each of the modules needed to implement this design are available as parameterizable cores contained in the Xilinx CORE Generator. Note that the size (number of CLBs) of an SDA FIR filter is determined by the number of taps and the bit width of the coefficients. The bit width of the input data determines the performance but does not effect the size of the core. There are four basic components to the filter shown in Fig. 5: The upper half filter, lower half filter, output register, and an adder.
Figure 5 - This design illustrates how two serial distributed arithmetic (SDA) cores can be combined in one FPGA to double performance
Upper Half SDA Filter: Used to filter the most significant 5-bits of the data sample. This half contains the sign bit so the parameters are the following: 30 Taps, 5-bit data, SIGNED, 10-bit coefficient, symmetrical. The output of the core is 20-bits after all of the multiply and add operations are performed.
Lower Half SDA Filter: Used to filter the least significant 5-bits of the data sample. The MSB of the lower half is NOT a sign bit and hence the input data should be treated as an Unsigned data. The parameters and coefficients are equivalent to the Upper SDA FIR Filter except for the SIGNED/UNSIGNED attribute: 30 Taps, 5-bit data, UNSIGNED, 10-bit coefficient, symmetrical.
Output Register: The RDY signal goes high when the Result (RSLT) data is available. The registers are used to save the partial results before adding them together. The width of the register depends on the RSLT (Result) output width. The RSLT output width varies with the number of Taps, coefficient width, input data width and symmetry. Keeping full resolution would in our example generate a 20-bit output. Note: These registers are not mandatory and could be removed if the maximum performance is not required. The RDY signal should in this case be directly connected to the Register Adder CE.
Registered Adder: There are two important things to remember in regard to adding the partial results together. The first is that the incoming data carries signed information and is therefore in twos complement form. The second is that the upper half of the data is 5-bits more significant than the lower half. This relative positioning must be
recreated when adding the partial results. An efficient way to obtain the final result is to shift the least significant result 5 places to the right before adding. Notice that when this shift occurs, the least significant bits that are output from the lower half SDA filter have nothing to add, so we can just save them in a register.
The numbers being added, however, are both in twos complement form, and carry signed data in their most significant bit. To maintain the sign value for both numbers, it is necessary to sign extend the least significant result 5 places. To sign extend, the most significant bit (now bit 15) is replicated onto newly created bits 16, 17, 18, 19 and
20.
Lower bit register: Used to register the shifted data. The size is data_sample_width/2 and in this example it is equal to 5.
The CORE Generator is used to build each of the cores and then interconnect
the cores with schematic capture or HDL. This same technique can be used to implement three bits per clock by dividing the input data sample word size by 3. This is practical when word widths are 15-bits or more and will increase the data rate by a factor of three.
Summary
Clearly, FPGAs will not replace DSPs for many applications. Rather, FPGAs are likely to co-exist in the same system with DSP processors and offer designers the edge they need to attain the highest level of performance while staying within their cost constraints. Thanks to availability of new tools and a growing library of intellectual property, the use of cores should make this job easier for more and more designers of wireless systems in which DSP functions play growing role.
Home | Product of the Week | Tech Note | AppReview | Vendor Tools | Feedback
|
|||||||||||||||||||||||||||||||||
|
Copyright © 2003 ChipCenter-QuestLink About ChipCenter-Questlink |
||||||||||||||||||||||||||||||||||