|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
|
|
![]() Frontier Design INTRODUCTION - FROM C TO SILICON A|RT DESIGNER - THE WORLD'S FIRST ARCHITECTURAL SYNTHESIS TOOL
SYNTHESIZING THE ARCHITECTURE WITH A|RT DESIGNER
INTERACTIVE ARCHITECTURAL CONTROL
FREQUENTLY ASKED QUESTIONS
INTRODUCTION - FROM C TO SILICON
As ASIC densities soar to the multi-million-gate level, there are enough transistors to put entire systems on a single IC. Designing a million gates of anything is a daunting task and with, mask charges exceeding $1 million per design, mistakes are very expensive. As a result, these large systems are typically prototyped, tested and debugged in software that is usually written in the C-language. The software is executed and functionality tested on very fast Pentiums or high performance RISC or DSP processors. In the best of worlds, that would be the end of it. End-products would be manufactured using these high-end processors. However, processors are expensive, costing as much as $1000 each. Also, the high clock rates needed to get the required performance result in excessive power consumption. Running a 300 MHz DSP in a mobile phone would reduce its battery life to a few hours. So, for cost and power considerations, these C-language prototype systems are usually migrated to a masked-ASIC or a system-on-a-chip (SoC). The question is how do you get a very large piece of software into a highly parallelized and optimized hardware architecture. Until now, there has not been any really acceptable solution. Software is a set of sequential operations that are written without regard for resources, parallelism or timing. To get a piece of software into dedicated hardware, specific hardware resources, such as ALUs, multipliers, adders, RAM, ROM and registers, must be allocated. Then, all the individual operations that make up the software must be assigned to those resources. Finally, the operations must be scheduled in such a way that the data flow of the software is accounted for. For example, if the result of a multiplication is the input to an add operation, then the multiplication must be scheduled first. The hardware eventually must be implemented in a hardware description language that describes the register transfers. Today, designers define resources a priori and then build the design in a bottom-up fashion, starting from scratch in either VHDL or Verilog. Until now, this has been the only available design flow. However, it ignores a critical component in the optimization of the design its architecture. Architecture is key to the optimization because it drives both the performance and silicon area. In software, a concrete notion of architecture is absent. The majority of existing software is targeted at "standard" platforms, such as Pentiums or advanced RISC-processors. Most standard platforms have very similar architectures with limited parallelism (e.g. parallel multiply/add on DSP processors). They are typically highly pipelined to allow a high clock-frequency. A processor's computational horsepower is more or less linear to the clock-frequency (e.g. a 600MHz Pentium is twice as fast as a 300MHz Pentium). The increasing performance of these architectures allows increasingly complex software to be run faster and faster. However, the advent of Digital Signal Processing, has fostered a new type of "software" that has a very hard constraint: real-time operation. DSP algorithms process real-time data that is sampled at a given rate (e.g. 44,100 samples per second for CD-audio or 50 frames of 160 samples per second for GSM speech). The designer does not have the freedom to choose between running the software fast on a high-end processor or running it a bit slower on a more low-end device. The processing has to be done at exactly the specified data-rate. Otherwise the application will not work at all! DSP algorithms can easily become so computation-intensive that even very highly-clocked microprocessors, RISCs or DSPs cannot run them in real-time. The only way to achieve real-time operation in these systems is to design a dedicated processor that is capable of executing multiple instructions per clock cycle. Other constraints, like battery-life or limitations in the silicon itself, can restrict clock-rates to such low levels that real-time operation is impossible. An application that can be run in real-time on a high-end DSP at 160MHz, stops working if, for some reason, it must be executed on a processor that can only run at 40MHz. In this particular example, the only way to solve this problem is to change the architecture so that at least four instructions can be executed every clock cycle. This might be accomplished by adding additional resources that make it possible to perform multiple operations in parallel, in one clock cycle. Imagine a hypothetical algorithm with a loop that consists of a multiplication followed by an addition, which must be iterated 1,000 times. Using a single multiply/accumulate unit (MAC), the loop would take 1,000 cycles to complete. However, by increasing the number of MACs from one to four, the number of required cycles can be reduced to only 250. By introducing parallelism into the system architecture, the number of instructions executed per clock cycle has been effectively quadrupled, and a 4 times slower clock can be accommodated. Although the example is hypothetical, the concept is not. The algorithms required for Layer1 of a GSM phone can be executed on a TMS320C6200 DSP running at 300 MHz. However, using the 300 MHz DSP in a GSM handset would cut its battery life way below acceptable levels The architecture of the hardware that executes the GSM algorithms must be adapted to achieve the required performance at a clock rate that is consistent with battery operation. This type of problem is further compounded by the growing use of FPGAs for SoC design. FPGAs consume more power and have substantially lower effective clock rates than masked-ASICs. It is not even an option to run an FPGA with a 300 MHz clock. If power consumption is a consideration, the FPGA will have to have a proportionately slower clock than an ASIC to achieve the target. Thus, even more care must be taken in determining the architectural implementation in an FPGA than in an ASIC. Unfortunately, there is no formal way to determine the optimum architecture. Typically an educated trial-and-error process is used, whereby the algorithm is evaluated on a number of alternative architectures. This is a very time-consuming activity and no more than a few such experiments can be done. As a result, the final architecture is much closer to a compromise than an optimum. The problem is that, until now, there have been no EDA tools that explicitly support the exploration of alternative hardware architectures. There is no practical way to engage in an intermediate architectural exploration between software and silicon. Writing the design in an HDL requires an explicit architecture because what HDLs do is describe the actual register transfers in a specific architecture. Writing a complex design in an HDL is too difficult and time consuming to do more than once. As a result, the architectures of most systems-on-a-chip are the result of unintended compromises. In response to this problem, Frontier Design has developed the world's first EDA tool that is capable of directly mapping software to multiple hardware architectures that can be quickly evaluated and refined to achieve a highly optimized solution, before going to the RT-level. A|RT DESIGNER - THE WORLD'S FIRST ARCHITECTURAL SYNTHESIS TOOL
Frontier Design's A|RT Designer provides the missing step in the design flow from software to silicon. The tool automatically maps C code to multiple hardware architectures that include a variety of datapath resources, such as ALUs, multipliers, adders, RAM, ROM, registers, etc., before going to a register-transfer level HDL description. A|RT Designer lets designers interactively add or remove resources, re-assign operations or alter their scheduling, and then analyze the resulting performance characteristics. Implementing Software in a Hardware Architecture Resources - Software consists of a sequential series of operations and intermediate variables. The first step in determining a datapath architecture is to allocate hardware resources that can perform these operations. In addition, muxes, registers, and interconnect need to be generated for the storage of intermediate variables and for data-routing. Operation Assignment - The next step is the assignment of the individual software operations to the appropriate datapath resources. In a "hardwired" implementation, a unique resource is assigned to every operation. For very small designs, this is acceptable. But for large compute-intensive systems, the resulting SoC would be much too large. In order to efficiently handle a larger design, resources must be shared and the hardware must be configured to perform multiple tasks. Scheduling - The next step is scheduling the operations on the shared resources so that data dataflow is accounted for and parallelism is exploited where possible. A controller physically imposes this schedule on the datapath. A processor-oriented architecture, as described above, is the only efficient means of implementing a complex system in a single piece of silicon. SYNTHESIZING THE ARCHITECTURE WITH A|RT DESIGNER
Data Flow Analysis - Prior to synthesizing the architecture, A|RT Designer uses a patented data flow analysis to analyze the data dependencies in the code. This information is used to reconstruct the parallelism that is inherent in the design, but has been hidden by the sequential nature of the software-description. For example, an algorithm could have the following lines of code: p1 = a * b; In the C language, "p1" will be computed first, then "p2", and finally" s". On a classical processor this code fragment will require three cycles to execute. In hardware, the operations can be executed simultaneously unless there are direct data dependencies. A|RT Designer's data flow analysis would show that the value of the variable "s" is dependent on the values of both "p1" and "p2", so "s" cannot be calculated until those values are known. In contrast, the values of "p1" and "p2" are independent of each other and can therefore be calculated in parallel. A|RT Designer would use this information to compute "p1" and "p2" in parallel, provided that two multipliers are available in the architecture, reducing the number of required cycles from three to just two cycles, without changing the behavior. Unlike what is illustrated with this simple example, A|RT Designer's data flow analysis is not limited to isolated groups of primitive operations. A|RT Designer dissects the entire program and automatically reconstructs as much parallelism as possible. This parallelism goes way beyond basic datapath operations, but also enhances address computations, loop folding, loop-invariant-code detection, etc. Resource Assignment - At this stage, the optimization steps initiated by A|RT Designer are represented by an intermediate C-code transformation which serves as the basis for the allocation of the variables to the various memory types, the assignment of operations to datapath resources, and their translation to register transfers. Prior to this point in the design flow, register transfers have not been described at all. The result of this operation is an untimed register transfer representation of the C source-code on the target architecture. Scheduling - During scheduling, A|RT Designer orders the register transfers on the time line in as few machine cycles as possible, while taking into account timing and hardware constraints. All variables are assigned to individual register fields in such a way that the overall register requirements are minimized. During the scheduling process important optimizations are performed, like loop folding, peephole-optimization and speculation. Loop Folding - A|RT Designer performs loop folding in which for loops are restructured to increase the available parallelism within every iteration. For example, an operation inside the loop can be moved outside the loop for the first iteration and then can be folded with another operation inside the loop to cut the number of required cycles. For example: FOR i=1 to 10 ( As there is a direct data-dependency between the two operations in the loop, "p[i]" needs to be computed before "s" can be computed, the whole loop will take 10 x 2 cycles = 20 cycles to execute. However, the loop can be restructured to break this direct data-dependency, without changing the behavior, but with a higher degree of parallelism. p[1] = c[1] * in[1]; By pre-computing "p[1]", "s" is no longer dependent on the "p" element that is computed in the current iteration. Rather it is dependent on the "p" value that was computed in the previous iteration. Hence they can be executed in parallel. The new cycle count is 1 + (10 x 1) + 1 = 12 cycles. A|RT Designer can perform this loop folding automatically over as many iterations as required or possible.this typically results in drastic acceleration of the computations, possibly at the cost of increased controller-size and additional registers, a consequence of the increased parallelism. The user can therefore selectively apply this optimization to the loops of his or her choice. Peephole Optimization - After scheduling, peephole optimization performs a local optimization (like looking through a "peephole") on the schedule in order to reduce the cycle count. The optimization works on register transfers that read constants from ROM memory. When two fetches of the same constant to the same register file happen shortly after one another, the second read is eliminated and the constant is kept in the register file for a longer time. This optimization may save several cycles which is especially important in critical loops. A possible drawback of the technique is that it may increase the amount of register fields that are required. A|RT Designer lets the user control where and when peephole optimization is used. Speculation - When the input description contains a conditional statement, A|RT Designer will, by default, evaluate the condition and will then execute the selected branch. The evaluation of the condition happens on the controller and the execution of the branch has to wait for the evaluation to ripple through, which may cost some cycles. Speculation follows a different approach. All the different branches are executed regardless of the condition. Only when the branches have been executed, is the decision made about which end values are kept for the variables involved in the branches. Simultaneously, the condition is evaluated on the controller, so the decision about which values to keep can be made immediately. A|RT Designer lets the designer apply the speculation technique to all branches or to selected branches. The result of speculation is a faster schedule, possibly at the expense of more registers. The designer may also have to allocate more resources in order to be able to execute the different branches in parallel. INTERACTIVE ARCHITECTURAL CONTROL
Although A|RT Designer can be used to automatically synthesize a push-button architecture with minimal designer-intervention, it has been developed to be used interactively. A wide variety of reports are available to help the designer analyze the characteristics of the design. Based on the design constraints, the designer has a variety of global optimization options available, plus a variety of pragmas that can be used to refine the architecture at a very fine-grained level. Each design iteration can take as little as a few minutes, so the designer has the freedom to explore multiple architectures very quickly. Design Analysis - A wide array of graphical analysis tools and reports assist the designer in identifying areas where the design may be improved. Multiple "views" are available that help the designer to identify areas for improvement. The most important views are cross-linked so that the user can easily correlate the activity of the architecture with the code of his software. Thus, clicking on a location in the load view window where the number of cycles is very high, will result in the highlighting of the corresponding source code and of the pertinent register transfers in the schedule report. Load View - The "load view" displays for each resource (shown on the vertical axis) when it is active (horizontal axis) and what operation it is performing (left mouse-click). An "x" represents an operation on a resource. A second graph at the top of the load view is a histogram showing the number of iterations that a given operation is repeated (loops). Lifetime View - The "lifetime" view shows the life-time of variables, from the moment of creation to final consumption, plus the intermediate time-points where the variable is consumed. Variables that are "alive" require registers to store them, which can take up a lot of valuable silicon. By analyzing this view, the designer may decide to move data from registers to less expensive RAM-memory or to selectively re-compute variables when needed, instead of keeping them in expensive registers. Optimizing the Architecture - A|RT Designer provides a number of means of improving the results. Pragmas may be instantiated to 1) change, add or remove resources (i.e. an additional ALU for pipelining); 2) to schedule operations differently (move a 16-bit add to a different adder, so an 8-bit adder can be used in its place); 3) to schedule operations differently or 4) to optimize the assignment of registers, busses or muxes. In addition, there are a series of global options for scheduling operations and generating the RT-level HDL. For example, loop folding may be invoked globally or for a particular loop with a large number of iterations. Register optimization may be invoked globally to reduce the total number of registers or individual variables that are not frequently used may be stored in local RAM to free up registers. In this way the designer has complete control to optimize the architecture to the specific requirements of the design. Scheduling Options - A|RT Designer provides the designer with multiple scheduling strategies, including ASAP, ALAP and ALAP Greedy. The scheduler in A|RT Designer is a list scheduler. This means that, for every consecutive time step, the scheduler creates a candidate list of register transfers (RTs) that are ready to be executed. If enough hardware is available, all RTs on the list are then scheduled for this clock cycle. If there is a conflict, the RT with the longest path to the end of the schedule is scheduled for that clock cycle, and the other operations are scheduled in subsequent clock cycles. Based on this principle, A|RT Designer offers three variants. The ASAP variant (As Soon As Possible) will build the candidate list, starting from the inputs and proceeding towards the outputs. The ALAP variant (As Late As Possible) will build the list, starting from the outputs and going backward to the inputs. The ALAP Greedy variant is like the ALAP variant, but considers complete paths rather than individual RTs. The result is that the most critical path is scheduled first, without interfering with other paths. No general rules can be given about which one of these variants produces the fastest schedule, it is dependent on the input description. In general, though, the ALAP Greedy variant, and to a lesser extent, the ALAP variant, will produce a solution with less registers than the ASAP variant. Pragmas - Pragmas include the ability to control loop folding for trade-offs between controller area and execution speed, and to specify a maximum number of cycles to run part of the register transfer graph. FREQUENTLY ASKED QUESTIONS
Is C the best language for high-level design? Whether or not C (or a specific version of C) is the best language for high-level design is irrelevant to the synthesis of the architecture. Architectural synthesis is an important step in the design of complex, computationally intensive SoCs that has previously been ignored. Any language could serve as the input for A|RT Designer. The C-language is currently used as the input to A|RT Designer because the vast majority of software and system-level designs are written in C. The input to A|RT Designer could as easily be JAVA, or even French. And if the majority of designers begin doing their software designs in French, French will become an input to A|RT Designer. Is A|RT Designer a behavioral synthesis tool? No, A|RT Designer is NOT a behavioral synthesis tool in the classical sense. It is an architectural synthesis tool. Behavioral synthesis tools use a behavioral HDL as their input and attempt to generate the register transfer level implementation that is optimal in resources and are constrained by a target process and physical libraries. Although the designer may specify constraints, such as clock speed, he/she has very limited direct control on resource allocation or scheduling. Although behavioral synthesis can yield acceptable results for designs of fewer than 1,000 operations, they become rather ineffective for larger designs. The problem to be solved just becomes too big. Architectural synthesis is different in that it gives the designer complete control over the datapath-architecture, and then maps the behavior, expressed in the C-code, on the architecture in an optimal way. It provides a unique combination of advanced compiler techniques and behavioral synthesis techniques. What is the difference between A|RT Builder and A|RT Designer? A|RT Builder translates fixed-point C algorithms to VHDL or Verilog. The translation is "literal" in the sense that the HDL output conforms exactly to the structure of the C-code. A|RT Designer is an interactive architectural synthesis tool that gives designers the ability to interactively optimize the hardware architecture prior to going to the RT level. A|RT Builder transparently translates the architecture created with A|RT Designer to Verilog or VHDL How does A|RT relate to System C and other C-based initiatives? The design of Systems-ona-Chip introduces new concepts like the need to express behavior with different models of computation, and the need to express communication between processes. In order to accommodate this design style, a new language is needed. At present there are several proposals for such a system-level design language: SystemC -- SystemC is a language being developed by the Open C Initiative. It enhances C with C++ classes to express concurrency, clock cycle information, and fixed-point data types. The approach is rather pragmatic, and is strongly backed by EDA vendors: Synopsys, CoWare and Frontier are contributors. Rosetta -- Rosetta is a proposal from the System Level Design Language (SLDL) Committee of VHDL International. The committee It is defining a semantic framework, which could be applied to many languages, and which focuses first on the expression of design constraints. Rosetta appears to be a more fundamental approach, but longer term effort. A problem with Rosetta appears to be biased toward VHDL. Open Verilog International --: Open Verilog International (OVI) is similar to Rosetta, but is biased toward Verilog. Recently some dialog between Rosetta and OVI supporters have been initiated. Superlog Superlog is a new language developed by Co-Design that is based on Verilog with some C-like extensions. The problem with any new design language is that as history has shown with ADA, DFL and others, it is very difficult to gain acceptance for a new language. Esterel Cadence has the exotic Esterel language and accompanying Polis tool set. Currently though Cadence seems to be moving toward C++. Others -- Cynapps has proposed C++ classes. A cooperation between Easics and C-Level Design also proposes C++ classes. ICL has proposed VHDL+. LavaLogic has proposed Java. IMEC has proposed TIPSY (C++ classes) and Arexsys has proposed SDL. The acceptance of a standard system-level design language is necessary for the implementation of systems-on-chips and for hardware-software co-design . It is not a questions of "if" a standard system-level language will evolve. It is a question of which if the many proposals will succeed. Currently, the output of A|RT Designer is Verilog or VHDL. Later this year, the tools will also output SystemC. Frontier is currently backing SystemC because C(++) is very well suited for expressing behavior and it is the primary language used for system design. Currently Frontier is cooperating with Synopsys and others members of the Open C Initiative to define OpenSystemC's fixed-point data types for Version 1 of the OpenSystemC standard. As soon as the OpenSystemC standard is released, it will be supported by Frontier's tools. A|RT Designer will be the first EDA tool that generates OpenSystemC, which will be an output of the tool More importantly, however, A|RT Designer is relatively language-independent. The optimized VLIW architectures that the tool synthesizes can be output in any language. Therefore, whatever language ultimately evolves as the system-level design language, A|RT Designer will support it.
|
|||||||||||||||||||||||||||||||||
|
Copyright © 2003 ChipCenter-QuestLink About ChipCenter-Questlink |
||||||||||||||||||||||||||||||||||