|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
|
|
Meeting the Synthesis Challenge of Complex Programmable Devices at 100K gates and Beyond
By Darron May (dmay@alt-tech.com), Applications Manager,
ALT Technologies (UK Distributors for Synplicity, Inc)
Introduction
Finally, this paper will open up discussions on what is needed to solve the problems presented by Complex programmable logic devices as they surpass 100K usable gates. This will include tighter integration between the synthesis technology and the place & route tools so that there is a stronger link between logical and physical design. The timing estimations used by the timing driven Synthesis engine need to be accurate, the only way this can happen is if the synthesis tools understand the tradeoffs between the choices of both placement and routing of a particular solution. Hardware Description Languages (HDLs) and Synthesis have become the preferred way of describing High Complex Programmable devices. Therefore to achieve the performance from these devices, new tools are needed to allow the partitioning and floorplanning of these devices at the Register Transfer Level (RTL). The paper will discuss the impact of failing to consider floorplanning and hierarchy when using an HDL.
Understanding Architectures
For successful mapping from RTL to the target architecture, the synthesis tool requires built in knowledge of the vendors technology. There are a number of techniques that are required to ensure that the logic elements within the technology are filled as efficiently as possible. Firstly the synthesis tool has to do the logic element packing itself instead of leaving it up to the place & route tools. This allows the synthesis tool to make tradeoffs as close to the design RTL source as possible. Direct Synthesis Technology is a technique that uses the building blocks from the vendors technology to tile the design. Tradeoffs in area and speed are made directly with the building blocks from the technology rather than using gate primitives which require the logic cells to be re-packed. There are other facilities such as wide decoders, flip flops in the I/O of these devices that also have to be used to improve performance and area utilisation.
As was mentioned above, the synthesis tool needs to be able to infer the usage of arithmetic functions directly from the HDL source. Context Sensitive Module Generation is another technique required to best utilise these devices and use the dedicated resources such as carry logic. Module Generation is the ability to generate a function in the vendors technology to match the HDL function. Context Sensitive means that the tool understands how the module is used within the design. If the synthesis tool generates the module, it understands how the module is constructed and can adapt the module to fit the application. A simple approach to module generation is shown below on the left hand side in Figure 1.
Figure 1 - Context Sensitive (Xilinx).
In this example the synthesis tool does not understand the module generated for the addition operator. It is a black box that implements the addition operator using the carry chain within the Xilinx device. The 'ELSE' part of the design infers that a 2-to-1 multiplexer is required to switch between the output of the addition function and the input B. This is achieved with another four look up tables. The adder and the look-up tables are written into the netlist to be passed to the backend place & route tools. On the right-hand side we can see one of the effects of context sensitive module generation. A separate optimisation stage occurs called boundary merging, which relies on the fact that the synthesis tool understands the connectivity of cells within the module. During boundary merging the cells within the module that connect to the inputs and outputs on the module are analysed, along with the cells that they drive and source in the rest of the circuit. It can be seen from the diagram on the left in Figure 1 that there is extra capacity within the look up tables within the module. Boundary merging detects this and merges the two to one muxes implemented as separate look up tables into the look up tables within the module itself. This step improves both the area utilisation and the performance of the design. The Context Sensitive aspect in some technologies could also mean generating a different implementation for the same function depending on the constraints placed on the design. For example within the Actel architecture it is possible to generate a slow ripple carry adder which uses less area in the device than a fast carry look ahead adder. Within Altera FLEX10K it is possible to switch the Logic Element (LE) into different modes to use different facilities. If we were to implement the same code shown in Figure 1 then the synthesis tool would generate the module with the LE switched to a different mode. The simple approach would be to have one implementation of the adder. The context sensitive approach would be to switch the LE into up/down counter mode to allow the extra capacity in the LE to be used by the 2-to-1 mux, as shown below in Figure 2.
Figure 2 - Altera Example.
Another important technique in achieving the best from the target architecture, during Synthesis, is resource sharing. Automatic Resource Sharing is the ability to recognise parts of the design where arithmetic functions are mutually exclusive and could therefore be shared. The HDL code could be written in such a way that resource sharing is explicit however the designer has to do this as the code is developed. If resource sharing is done automatically by the synthesis tool, there is a good chance that more opportunities will be located for sharing arithmetic functions within the design. The tool can also decide what is best for the architecture dependant on the design goals. The result of resource sharing is the reduction in logic for the arithmetic functions and an increase in logic for the logic required to multiplex between the arithmetic functions. For example a design is required to add A + B and A + C and select the answer depending on the value of a single select line. Without resource sharing two adders would be built to carry out the additions, along with a multiplexer controlled by the select line to output the selected result. With resource sharing only one adder in used, the select line is used to select between input C and B. Input A is connected to one input of the adder and the multilplexer output is connected to the other input. The general trend for resource sharing on four technologies is shown below in Figure 3.
Figure 3 - Resource Sharing.
Four designs were implemented on the four technologies, each design is shown as a coloured circle on each of the eight lines. In each graph the bottom line represents the design with resource sharing and the top one without resource sharing. In general the designs with resource sharing took less area however the performance of the circuit was slightly slower. The only exception to this rule was Altera where the circuits with resource sharing were smaller but the performance was the same as the solution without resource sharing.
Timing Driven
Some synthesis tools have area / speed trade off switches, this should not be confused with Timing Driven Synthesis. A problem with this kind of approach is that when the tool is set to map logic for speed it try's to reduce logic between all registers. In most designs there are paths that are critical and paths that are not critical, therefore it is a waste of resources for the synthesis tool to reduce levels of logic by increasing the number of logic elements used on these non-critical paths. Timing Driven Synthesis will only restructure logic on paths defined by the timing constraints, ensuring that extra logic is only used on these critical paths. Some synthesis tools only pass constraints onto the place & route tools without using the constraints to re-structure logic. This means that the constraint is only used to effect the routing part of the delay and not the levels of logic. The simple example that follows illustrates why timing driven synthesis is so important and why using constraints only in the place & route tools to achieve design performance may not be very effective. The code is shown below in Figure 4.
Figure 4 - HDL Code Example
The output SELECT_A becomes '1' when the three 3 bit vectors are set to three specific values. There are a number of solutions if this design is implemented into a Xilinx XC4000 device. If the synthesis tool is not timing driven then only one solution will be chosen and there is no way of influencing the tool to change the implementation. In Figure 5, below, we can see two solutions.
Figure 5 - Timing Driven Synthesis
In both solutions two FMAP's (4 input look-up tables) and one HMAP (3 input look-up table) are used. In the first solution PORT_B_ENABLE has the longest delay to the output however it is the latest arriving input in the design therefore its delay needs to be reduced. In the second solution the synthesis tool has been given a timing constraint on PORT_B_ENABLE to speed the path to the output. This has caused the logic to be re-structured so that PORT_B_ENABLE is the input to the look-up table that drives the output, hence the shortest path. The kind of manipulation that is possible within the place & route tools are the swapping of inputs to a look-up table and the choice of routing resources used for the implementation. Therefore using constraints only within the place & route tool would not be enough in many cases to get the design to run at the desired speed goal. In the example shown, if solution #1 was output from the synthesis tool, the fact that there are three look-up table delays between PORT_B_ENABLE and the output may mean that there is not much that the place & route tool can do to achieve the desired constraint on this path. This is not to say that the constraints in place & route do not play a part in the overall result. The first two look-up tables shown in the example have been given relational information to ensure that they get placed in the same CLB (Configurable Logic Block). However the last look-up table does not have any CLB assignment. This look-up table could get placed into the same CLB as the first two or in could be placed in a separate CLB. This decision is better made by the place & route tool based on its knowledge of the routing resources and the topology of the design.
Another important factor in timing driven synthesis is how the Timing Budget Management is achieved. All designs use hierarchy therefore critical paths can, and do, pass through more than one hierarchical block. Allocating performance goals to each of the different blocks that make up the critical paths within the design is called Timing Budget Management. Within commercially available synthesis tools there are two methodologies employed for Timing Budget Management. The first method is based on the designer having to manually manage the allocation of timing constraints to hierarchical blocks. This kind of tool normally supplies functionality to allow a block by block synthesis plus features to allow the designers hierarchical blocks to be grouped and ungrouped. Creating new hierarchy allows the synthesis tool to synthesize the new group as one block. This is necessary due to the fact that this kind of tool can not automatically manage the distribution of a timing constraint between hierarchical blocks. It is up to the designer to manage the Timing Budget by grouping together blocks that carry the critical paths that need to be constrained, and thus helping the synthesis tool to optimise these paths. A further recommendation is made by some Vendors in the way the HDL code should be written. If all inputs and outputs to a hierarchical block are registered then there will be no combinational paths between hierarchical blocks therefore easing the problem of Timing Budget Management. The second method of Timing Budget Management involves the tool being given the complete design hierarchy plus a set of constraints at the top level. Any propagation of constraints among the design hierarchy is carried out by the tool automatically. If it is advantageous for a number of hierarchical blocks to be merged then the tool will do it automatically to achieve the best results.
Figure 6 - Design Hierarchy
If distribution of the timing goals between the hierarchical blocks is necessary then the tool will do it automatically. The advantages of this approach are obvious, the designer does not have to manipulate this design hierarchy to help the synthesis tools algorithms and he also does not have to architect his code in such a way that all inputs and outputs are registered. The following example shows how automatic Timing Budget Management saves time for the designer. In Figure 6 we can see a simple design Hierarchy where module A is a sub-module of the top level and has three sub-modules itself labeled B, C and D. There is a critical path that passes between all three of the blocks. The detail inside the blocks is shown below in Figure 7.
Figure 7 - Timing Budget Management
Block B has a registered input that drives a combinational path in block C, which in turn drives more combinational logic before a registered output in block D. If the designer uses a tool that forces him to look after the timing budget management he may have three choices when faced with this problem. The first simple answer would be to set the area versus speed control to optimise for speed in all blocks that carry the critical path. In this case the tool will reduce the levels of logic in all of the blocks and choose the fastest solution for each. If we make the assumption that all tools can achieve the same trade-off it will make the comparisons easier to make. In reality packing algorithms are different therefore the tradeoffs will be different. In the example, using the fastest solution in each block would mean that 9 levels of logic would be used therefore if we use an estimate of 5 nanoseconds per level the path would be 45 nanoseconds. The constraint would be met at the expense of using the largest amount of logic in each block. The second choice would be to split the 60 nanosecond constraint between each of the blocks. In this case the designer would need to analyse the content of each block as he may end up over constraining a block causing the synthesis tool to struggle to meet the constraint. For example if a 20 nanosecond constraint was given to each of the blocks in the example block D would be over constrained due to the fact that the trade off shows that the fastest implementation would be 25 nanoseconds. The last choice is for the three blocks to be grouped and a single 60 nanosecond constraint applied. This last approach relies on the designer understand his hierarchy and being able to decide which blocks to group. This process is straight forward with a tool that manages the timing budget management. The design constraint is defined at the top level as 60 nanoseconds and the tool calculates the constraint for each of the blocks. As the trade off within each of the blocks is unknown until the synthesis tool starts optimising the advantage of this approach is that it can re-calculate the budget for each block during synthesis to ensure the constraint is met with the least amount of logic. In the example the smallest implementations with the greatest levels of logic were used for blocks B and D. The largest implementation with the lowest levels of logic was used for block C. During the synthesis of the design, the tool knows the constraint it needs to meet and the relationship between the three blocks. Firstly the constraint is shared out between the three blocks. After processing block B there is a 5 nanosecond slack that can be taken into consideration during the processing of block C. The tool automatically uses the slack from one block to allow it to meet the constraint across the complete path. This has the advantage that the constraint is met with the least amount of logic being used.
Embedded Synthesis
Figure 8 - Comparing Algorithms
The capacity and performance of current synthesis technology is the single biggest obstacle to overcome before it can be embedded into these new tools. Synplify from Synplicity uses algorithms that have been pioneered to enable fast run times and high quality of results. These algorithms are called the B.E.S.TTM (Behaviour Extracting Synthesis Technology) Algorithms and are made up of the four following components.
Conservation Of Abstraction:ý the synthesis tool recognises circuit behaviour and designer intent at a high level. A comparison of the synthesis process between Synplify and other synthesis technology is shown in Figure 8. After language compilation other tools start their optimisations on gates and then carry out technology mapping. This makes the tool sensitive to the style of the HDL code and this can impact the technology mapping stage. Synplify extracts the behaviour at a higher level, for example, recognising state machines and building state transition tables or producing parallel mux structures. By working on abstracted data further global optimisations can be carried out much faster and larger designs can be handled more easily. Using a state transition table means that reachability analysis and next state logic optimisation can be carried out instead of simply mapping combinational logic from the HDL. A typical example of Synplify's insensitivity to coding styles can seen in the synthesis of a simple 4-to-1 mux shown in Figure 9.
Figure 9 - Coding Styles
There are four coding examples of the 4-to-1 mux using different combinations of 'CASE' and 'IF' statements. After the compilation stage Synplify represents the design internally at the RTL level in exactly the same format. It uses a parallel mux structure that can have as many data and data enables as necessary to describe the function; in this case there are four data inputs and four enables. The technology mapping stage of synthesis works directly on this internal format therefore is insensitive to the way the original code was written. A 4-to-1 mux has 1 output and 6 inputs therefore, when implemented into the target architecture, it should not matter how the code was written, a 6-input function should be efficiently implemented. For a synthesis tool based around optimisation of gates to be insensitive to the coding style, the work has to be done at the mapping stage, which makes technology mapping more complicated and therefore results in the process taking longer. Another important area that embedded synthesis needs to satisfy is the ability to cross reference between the implementation and the source HDL. This is necessary with respect to linking the post place & route timing results back to the HDL source. Synplify is able to keep cross reference information due to its conservation of abstraction approach, ensuring that circuit debugging is not a time consuming task due to the synthesis tool destroying the link to the source code.
Integrated Module Generation:ý allows the synthesis tool to identify different types of logic structures, such as datapath and control logic and shares their implementation within the same logic resource in the target architecture. Integrated Module Generation is described in the "Understanding Architectures" section of this paper under the title of Context Sensitive Module generation. The important feature of this kind of module generation is that the synthesis tool constructs the module itself therefore understanding completely the implementation. This results in the ability to merge logic between the module itself and the surrounding logic. This results in better device utilisation and performance.
Automatic Hierarchy Optimisation:ý allows the complete design to be processed in a single pass instead of having to split the design into separate manageable chunks. The initial design hierarchy is analysed, and then further optimised by creating new "virtual" hierarchy where it will be beneficial to do so from a logic synthesis point of view. A timing budget for each block along a critical path is then automatically generated allowing the timing goal to be converged on very quickly. The benefits of automatic timing budget management is described in the timing driven synthesis section of this paper.
Figure 10 - Linear Compile Times
Linear Compile Times:ý allow predictable runtimes that increase in a linear fashion as the size of the design gets larger. This is achieved by not only ensuring that mapping algorithms are efficient but also ensuring that the algorithms that manage the design are efficient. This has the added benefit that the system resources, i.e memory usage, also scales in a linear fashion with the size of the design. The graph shown in Figure 10 shows the difference in runtimes between traditional algorithms and the ones used within Synplify as the complexity of the design increases. A recent benchmark showed that a design of over 200K useable gates compiled in just over 10 minutes. In comparison the same design took a number of hours to compile on competitive tools with worse results.
Floorplanning
Floorplanning is a design task that can benefit from Embedded Synthesis. This results in the ability to share high level design information between synthesis and floorplanning so that logic optimisation and placement can occur simultaneously. The need for this kind of technology is because of the deep submicron effects that have changed the design methodology for large ASIC devices. The threshold in terms of gate count, is lower with Complex Programmable logic devices than compared with ASICs since the ratio of routing resources to logic resources is lower. The actual threshold when floorplanning is necessary to achieve performance goals is difficult to state as designs can vary significantly. However 40-50K useable gates seems to be the threshold where floorplanning becomes necessary to achieve performance design goals. Circuit performance may improve by as much as 20% versus a methodology of performing floorplanning and synthesis as separate tasks. Smaller devices, of the 20K gate range, can also significantly benefit from being floorplanned if they are datapath intensive applications.
Figure 11 - Floorplanning
The basic flow of a floorplanner with embedded synthesis can be seen in Figure 11. After language compilation the floorplanning allows the designer to guide the physical implementation allowing the synthesis tool to have detailed knowledge on the timing and therefore run interactive mapping to improve performance. Technology mapping is finally carried out and the netlist, including placement information, is passed to the vendors place & route tools. The floorplanner allows large designs with lots of hierarchy to be interactively managed and guided by the designer, matching the logical hierarchy to a physical hierarchy. This has the effect of both shortening place & route times and reducing the number of iterations necessary to achieve the timing goal, as very accurate estimations of timing can be used during synthesis. Place & route times are increasing as designs get larger mainly due to the fact that there are so many combinations of placement to try for any given implementation. Having timing driven place & route tools helps to achieve timing goals, however, putting more constraints on the placement can result in the tool having to try too many combinations. Allowing these constraints to be achieved via embedded synthesis means that decisions can be made at a higher level, therefore easing some of the work of the place & route tool. One of the most difficult questions to answer prior to place & route is "will the design route successfully in the target device". A floorplanner should be able to give reports on routability, detailing any routing congestion and what device would be necessary to route the design.
One of the most important factors in an HDL design is the use of hierarchy. Almost every design uses hierarchy and correct use of it can avoid many design bottlenecks, including optimisation and debugging. Conversely, poor use of hierarchy can lead to extended design cycles. The first stages of an HDL design should always include a decomposition of the design into a hierarchy. There are a number of factors that will influence this decomposition including the size of the design, the complexity of different parts of the design, the number of designers, parts that need to be re-used, the architecture of the target device and blocks that may have been pre-designed. Correct choice of hierarchical structure early in the design phase can drastically reduce the overall design time by simplifying the coding, compilation, simulation, floorplanning, and optimisation steps of the design. When deciding upon a design hierarchy it is important to take into consideration how the place & route tools work. They are influenced by the net connections between logic blocks which means that they will naturally keep blocks together which have the most net connections. This means that if the design is broken into blocks that are grouped together it is important to minimise the wire connections between the blocks. Routing 50 wires from one corner of the chip to the other is more difficult than routing just 10 wires. This has the added benefit of making the design faster. The best approach is to design all of the hierarchical blocks in the HDL design before you begin coding. Determine what the functionality and pinout of each block will be before you write your first equation. This will save a great deal of time later on in the design cycle.
Partitioning
Figure 12 - Partitioning
Embedding synthesis into the partitioner and allowing the designer to interactively move HDL blocks between devices enables the tool to give accurate area and speed reports. As embedded synthesis is fast even when handling very large designs it allows the designer to explore many partitioning alternatives to find the best partition solution. Figure 12 shows that the best partition is not always on the logical hierarchical boundaries.
Conclusion
Home | Product of the Week | Tech Note | AppReview | FPGA/CPLD Jump Station | Design & Reuse Yellow Pages |Programmable Logic News & Views | FPGA/CPLD Design Tools | Feedback
|
|||||||||||||||||||||||||||||||||
|
Copyright © 2003 ChipCenter-QuestLink About ChipCenter-Questlink |
||||||||||||||||||||||||||||||||||