|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
|
|
A Design Methodology for The Efficient Implementation of Johannes Steensma The DSP Revolution - Increasingly electronics applications require some form of digital signal processing. DSP applications take analog data, convert it to a digital format, process it, filter it, packetize it and/or transmit it, and then turn it back into an analog format for human consumption. Today, mobile phones, hearing aids, Internet telephones, medical imaging systems, voice-activated appliances and toys, televisions, video recording and audio systems are heavily DSP-based. Simultaneously, the heat is on to make these systems ever smaller, frequently integrating the entire system functionality on a single piece of silicon, or system-on-a-chip (SoC). After all, nobody wants a hearing aid the size of a Palm Pilot. Moving complex DSP algorithms into systems-on-chip requires new design methodologies that are just beginning to emerge. FPGAs vs ASICs - Most systems-on-chips are implemented in custom ASICs. However with the advent of very large system-level FPGAs, SoC designs are also making their way into FPGAs. Putting a system on an ASIC is a challenging proposition. Putting one on an FPGA is more challenging because of the structure of FPGA architectures. A design that is optimized for an FPGA is generally very different from that which would be used in an ASIC because the two types of devices are structured differently. In contrast to gate arrays that have a single gate as the basic logic unit, FPGAs have the equivalent of 20 to 40 ASIC gates per logic unit, each of which typically contains a 4-input look-up table (LUT), carry logic, muxes and a flip-flop (register). Newer FPGA architectures also have blocks of RAM distributed throughout the array or located on the periphery of the device. Although FPGA logic units can be used to implement memory, doing so is usually silicon inefficient and expensive. As a result, it is very important to achieve the most efficient use of the on-FPGA memory. Sadly, most FPGA design and logic synthesis tools provide little or no support for the embedded SRAM blocks. Until now, they have had to be designed-in manually, taking a lot of extra time and effort. From Algorithm to RT-level -Virtually all DSP-based system designs are developed purely as software in some variation of the C-language, using floating-point math and design tools such as MATLAB, SPW or COSSAP. Initially these programs are developed, executed and debugged using very fast PCs or DSP processors. However, DSP processors are large, expensive and use a lot of power. Eventually, these designs must be implemented in hardware, frequently with the objective of shrinking them into a single high-performance integrated circuit that consumes as little power as possible. Unfortunately, while software is written in C, hardware designs are written in hardware description languages such as Verilog and VHDL. Historically, the design methodology for implementing a C-language DSP algorithm in hardware has been to manually re-render the algorithm in a fixed-point format, and then manually re-write it again in a register transfer (RT) level hardware description language. This is a daunting task that can take weeks or months. Whats more, since the architecture of the hardware design is explicit in any RT-level description, this methodology does not allow the designer to explore alternative hardware architectures to ensure that the hardware design is optimal for the particular application (e.g. low-power, high performance, etc.). When FPGAs are the silicon medium, the design challenge becomes greater because the logic structures are fixed and the available on-chip memory is limited. The Advent of Architectural Synthesis - A new class of EDA tools has been introduced in the past year that helps designers define hardware architectures for FPGAs and ASICs from SystemC or ANSI-C software. One such tool, A|RT Designer, from Frontier Design, helps designers create and explore multiple hardware architectures directly from the SystemC or ANSI C-language software. The designer has complete control of the process and can specify which resources will be used to execute the DSP programs. These can include ALUs, multipliers, adders, RAM, ROM, registers and special custom resources that execute entire algorithms (e.g. FFT or DCT) in a single clock cycle. When the optimum architecture is found, the tool automatically generates the RT-level Verilog or VHDL, but the original design remains in the C-language so it may be re-used and its hardware architecture re-optimized for other applications. For example, a high performance hardware architecture can be created from a C-language Viterbi algorithm for a high speed satellite application. Then, without any modification, the same C-language Viterbi algorithm can be re-rendered in a low-power architecture for a mobile phone. The C-language algorithm never changes; only the hardware architecture does. A technique called "dataflow analysis" in used during architectural synthesis to automatically identify and exploit data dependencies in the code. Operations are assigned to the datapath resources to maximize parallelism in the hardware design. The architectural synthesis tool then generates a hardware architecture that includes a VLIW controller and datapath resources, and schedules the algorithmic operations on the hardware resources. The designer may modify the architecture, scheduling, or resource allocation at will. Once the designer is satisfied, the tool automatically generates the RT-level Verilog or VHDL hardware description. In contrast to HDL-based designs, that can take weeks or months for a single iteration, C-language-based design can be iterated in as little as a few minutes using an architectural synthesis methodology. Because design iterations are so fast, designers have the freedom to explore and analyze many different architectures to achieve a highly optimized design. Implementing DSP Algorithms in FPGAs - DSP algorithms have several characteristics that affect their implementation in hardware, and especially in FPGAs. First, they are computationally intensive and require lots of registers. Second, they tend to use non-standard word-widths (e.g. 13- or 33-bits) that frequently do not fit in the fixed memory architectures of on-FPGA SRAM (e.g. 8-, 16-, 32-bits). Third, since they must perform repetitive calculations on samples or frames in real-time, they demand very high processing throughput. If the processing of the first sample or frame is not completed in time, the next one will be lost. Finally, DSP algorithms must often store the results of one frame or sample for comparison to the next frame or sample. Since FPGAs have limited memory, these storage requirements can force the use of external, and slower, off-chip memory. As a result, on-FPGA memory is at a premium and needs to be conserved whenever possible. These characteristics mandate special design techniques and tools for FPGA implementations. Register Intensive Nature of Digital Signal Processing - The computationally intensive nature of DSP algorithms means they rely heavily on the use of registers to store intermediate values. In fact, every datapath resource (ALU, multiplier, etc.) in a DSP-like architecture is likely to have at least one register file associated with it. Implementing a one-bit register in a gate array, takes just about 8 or 9 gates for PGAs, however, have pre-defined flip-flops included in their high-level logic units. Xilinxs Virtex II and Alteras APEX devices both have one flip-flop per logic cell (LC) or logic element (LE). Registers that are larger than one-bit can be built by chaining together several logic units. If the registers are specified in the HDL description as registers, the logic synthesis tool will synthesize them directly into the registers in the logic units on the FPGA. For example, a 16 x 16 multiplication will need two sixteen bit registers to store the left and the right operands. Each 16-bit register can be created by chaining together 16 LCs or LEs and using the flip-flop in each logic unit to store one-bit of the 16-bit words. This solution requires a total of 32 logic units to store the multiplication. However, this can be a very expensive solution because, in most cases, the logic resources are no longer available for logic operations. In other words, using logic elements just for their register can waste valuable logic resources. A more efficient means of implementing the 16-bit registers is to use a single LUT for each register. Since each LUT can be configured as a 16-bit RAM, a single LUT can be used to store each 16-bit operand. A|RT Designer 2.3 automatically performs this optimization. The tool evaluates the number and size of the required registers in the design and optimizes their implementation between built-in and LUT RAM registers. In this example, A|RT Designer reduces the number of LUTs required for the two 16-bit registers from 32 logic units to only two logic units - a 93% improvement. Implementing the register in this way can waste all the other resources in the logic element (e.g. the LUT, carry logic and muxes). Since an FPGA logic element contains the equivalent of between 25 and 40 gates, at least 15 equivalent gates of logic resources can be wasted if the registers in the FPGA logic units are used to implement registers in the design. A|RT Designer automates this use of LUT RAM for registers, whenever appropriate. The tool looks at all the register files in the hardware architecture and optimizes their implementation in the LUT RAMs and/or logic unit registers. For example, the 3GPP turbocoder IP core developed by Frontier Designs design service group has a total of 1,504 1-bit registers. If these one-bit registers were specified as such, in the HDL description of the design, it would require 94 VirtexII slices just for the implementation of the registers, plus about 4,700 additional slices to implement the datapath logic and microcode ROM. In fact, the complete turbocoder requires only 3,071 slices and 16 block RAMS in an XCV400E. Non-standard Word-widths - DSP algorithms typically are developed using floating-point math because it is infinitely accurate. However, when the algorithms are implemented in hardware, they are converted to a fixed-point representation to conserve silicon. Since every bit in every word takes up silicon, the fixed-point words are kept as small as possible. However, the words must also be wide enough, with enough precision to achieve the behavior of the floating-point algorithm within the specified dynamic range and signal-to-noise ratio. The smallest possible fixed-point word-width that has enough bits to provide an accurate result, may not be a standard 8-, 16- or 32-bits wide. The fixed-point words could be any width, 23-, 17-, or 51-bits. In addition, depending on what is required at any point in the execution of the algorithm, the word-width may change within the same algorithm. Using a non-standard word-width to perform the arithmetic will result in a variable value that must be stored in a non-standard word-width. Storing variables with unusual word-widths is not a problem in an ASIC because the designer has complete flexibility to configure the silicon any way he or she likes. In an FPGA, however, the on-chip memory is configured in fixed standard widths of 1-, 2-, 4-, 8-, 16-, or, in some cases, 32-bits, that are implemented in discrete blocks of 4096-bits in the case of Xilinx, or 2048-bits in the case of Altera. Storing words that do not fall within these standard architectures can result in a less than optimal use of the FPGA memory. In fact, it can be extremely wasteful. For example, suppose a designer specifies a 33-bit wide word in the HDL description. Since the widest possible word in either an APEX or Virtex FPGA is only 32-bits wide, a 33-bit word wont fit in a single ESB or Block RAM. The default solution would use two 32-bit wide SRAM blocks, with the first 32-bits of each word in stored in the first RAM block, and the 33rd bit of each word stored in the other RAM block. For every 33-bit word that is stored that way, 31-bits of memory are wasted - a very expensive proposition. Another, more efficient option is to use a combination of LUT RAM and Block RAM or ESBs to store the words. Since FPGA LUTs can be configured as SRAM with sixteen 1-bit words each, a LUT RAM can be used to store the 33rd bit of each word. Although designers can and do configure FPGA memory manually to achieve a better result, the process can be very time consuming and the result can be less than perfect. The A|RT Designer architectural synthesis tool automates the process completely by searching the code for the various words to be stored, and computing an optimized combination of LUT RAM and larger RAM blocks to minimize the amount of memory used. In this example, the tool would configure the memory to store the first 32-bits of the 33-bit word in a 32-bit wide block of RAM and store the 33rd bit in a LUT RAM. In this case, A|RT Designer would reduce the required amount of RAM by nearly 50% from the worst case implementation. This feature is particularly important in algorithms that must store massive amounts of data, such as ADPCM or speech recognition. Repetitive Calculations - DSP designs are packed with multiple algorithms, such as fast Fourier transform (FFT) butterflies or discrete cosine transforms (DCTs), that must be executed thousands of times a second. Parallelizing these operations is mandatory to achieve the required throughput in any implementation. With a maximum realistic clock of 80 MHz, FPGAs are much slower than 200 MHz ASICs or 400 MHz DSP processors and throughput becomes even more critical. A speech recognition algorithm that requires 90% of a 500 MHz Pentium III will require massive parallelization to execute in even the fastest FPGA. Architectural synthesis allows designers to achieve the highest level of parallelism by letting them create special resources, based on complex C-code algorithms or super instructions, (e.g. FFTs or DCTs), that execute in a single or a few clock cycles. For example, an FFT butterfly that requires two multiplications, an add and an accumulation would take at least four clock cycles to complete using conventional ALUs, multipliers and registers. By creating a special resource from a super instruction that includes the two multiplications, the addition, and the accumulation, the butterfly FFT can be executed in a single cycle instead of four. A|RT Designer automates the creation of these single-cycle, multiple-instruction resources at the designers request, and then automatically searches the C-code for any instance of the super instruction and instantiates the special resource to execute that code segment in just one clock cycle. Much larger super instructions can also be implemented in single-cycle special resources, using A|RT Designer 2.3. Frontier Designs design services group created a special resource for its 3GPP wideband CDMA decoder that executes 38 Log MAX calculations (for a 514-bit data block), in a single cycle. The tools unique capability to create this special resource allowed a more computationally intensive algorithm to be used with no penalty that delivers the 2 Mbits/second throughput required by the standard with a signal-to-noise ratio that is 0.5 dB better than the standard while executing 40% more iterations than the standard requires.
Freeing Up FPGA Memory Usage By Minimizing Code Storage - The implementation of any algorithm in hardware requires that a controller and microcode be created to schedule the execution of the datapath operations on the hardware resources. The microcode can become quite long in complex designs, taking up a substantial amount of the FPGAs limited memory resources. At the same time, certain algorithms that must compare the value of the previous sample to the value of the current sample can require substantial amounts of memory to store the data. Since on-FPGA memory is limited, conserving memory can be paramount. If there is not sufficient on-chip RAM, an external memory will be required that will degrade performance and increase board size, power consumption and system costs. Minimizing the storage requirement for the microcode can make the difference between a single-chip FPGA solution or a solution with external memory. Fortunately, the LUT architecture of FPGAs can be exploited to reduce the total memory requirements for microcode storage. Frequently, there is enough overlap within the instructions that not all the bits are required to identify them. For example, in a design with 13-bit instructions, it may be possible to identify each unique instruction using just four of the bits. In this situation, the code can be stored as 4-bit words and then decoded prior to execution using the FPGA LUTs. One LUT is used to decode each of the bits in the original instruction word. The memory savings can be significant. In a complex design, the microcode can be as long as 5,000 lines. In the hypothetical example with the 13-bit instructions, 5,000 lines of code would require 65,000 bits of SRAM for storage. That is the equivalent 16 of the block RAMs in a Virtex or 32 of the ESBs in an Altera APEX. By using LUTs to create small decoders, and storing the instructions in 4-bits of memory each, the required memory can be reduced to only 20,000-bits, plus 13 LUTs - a 70% reduction in the memory required for the code. This implementation of this code compaction and decoding process can be handled automatically by A|RT Designer. It evaluates the code and determines the smallest number of bits required to identify the instructions and then automatically builds the decoders in the FPGA LUTs, minimizing the total code size. Conclusion - Implementing DSP software in hardware can be a daunting task. The predefined logic structures, fixed memory architecture, limited on-chip memory and slow throughput of FPGAs make this task even more difficult when an FPGA is the target medium. Architectural synthesis tools, such as A|RT Designer, can substantially improve the silicon efficiency of SoC implementations by automating and optimizing the configuration of register files and on-FPGA memory. These tools can also achieve high levels of throughput required of DSP applications by helping the designer to introduce an extraordinary degree of parallelism in the hardware implementation - a critical requirement in relatively slow FPGAs.
|
|||||||||||||||||||||||||||||||||
|
Copyright © 2003 ChipCenter-QuestLink About ChipCenter-Questlink |
||||||||||||||||||||||||||||||||||