|
|||||||||||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||
|
|
DSP Main | Archives | Feedback Page 1 of 4 Memory Techniques Lead the Way by Rob Oshana, This article is the second of a two-part series that summarizes some of the techniques used to gain orders-of-magnitude speed increases from high-performance DSPs. The previous article focused on methods related to the optimum use of memories; this one looks to exploit architectural features on the DSP itself.
First, however, let's repeat a key concept from the last article. In evaluating a program, you'll generally find just a few pieces of software that execute many times. If you can save even a cycle or two out of these pieces of code, the savings will be significant. This is where you should spend your time during the tuning and optimization process.
Parallelism is the Key
Now let's move on to our new topics. The standard rule when programming superscalar and VLIW devices is Keep the pipelines full! A full pipe means efficient code. In order to determine how full the pipelines are, you should inspect the assembly code generated by the compiler. You can usually spot inefficient code by an abundance of NOPs, which indicate inefficient use of pipelines.
To demonstrate the advantages of parallelism in VLIW-based superscalar machines, start with a simple looping program (Program 1).
If you were to write a serial assembly-language implementation, the code would be similar to that in Program 2. The loop in that second version uses one of the two available sides of the superscalar machine. By counting up the instructions and the NOPs, you can see that it takes 26 cycles to execute each loop iteration. It's possible to do much better.
Towards that goal, be sure to notice two things in this example. Many of the execution units are sitting idleand that's a waste of processor hardware. Second, this piece of assembly code has many delay slots (20, to be exact) where the CPU is stalled waiting for data to be loaded or stored. When the CPU is stalled, nothing happens. This is the worst thing you can do to a processor when trying to crunch large amounts of data.
Let's consider ways to keep the CPU busy while it's waiting for data to arrive. For example, it does other operations that don't depend on the data you're waiting for. You can also use both sides of the superscalar architecture to help load and store other data values.
The code in Program 3 is an improvement over the serial version, a fact that becomes clear when you see that it reduces the number of NOPs from 20 to 5. It also performs some steps in parallel. Lines 4 and 5 execute two loads at the same time into each of the device's two load units (D1 and D2). This code also performs the branch operation earlier in the loop, and then takes advantage of the delays associated with that operation to complete operations on the current cycle. Notice that the code listing contains a new column, one that specifies which execution unit you want to use for a particular operation. This flexibility to specify execution units allows you to manage operations better.
|
||||||||||||||||||||||||||||||||||||||
|
Copyright © 2003 ChipCenter-QuestLink About ChipCenter-Questlink |
|||||||||||||||||||||||||||||||||||||||