ChipCenter Questlink
SEARCH CHIPCENTER
Search Type:
Search for:




Knowledge Centers
Product Reviews
Data Sheets
Guides & Experts
News
International
Ask Us
Circuit Cellar Online
App Notes
NetSeminars
Careers
Resources
FAQ
EE Times Network
Electronics Group Sites

  DSP

    Technical Note

DSP Main | Archives | Feedback

Page 1 of 2

Memory Techniques Lead the Way
to Optimizing DSP Applications

Part 1: Guidelines to Develop Applications

by Rob Oshana,
Texas Instruments

Programming DSPs involves understanding the application, the hardware architecture, and the code-generation tools used to produce efficient real-time software that meets the system deadlines. This article, the first of a two-part series that reviews some important DSP software and system optimization techniques, explains some of the guidelines to use to develop efficient embedded applications with these powerful processors.

First Rule of Optimization—Don't!

Before beginning any optimization effort, you must know where to look. From a performance point of view, all software is not created equal! You must first understand where the bottlenecks are. Once you have the application profiled, then the process of tuning the code can begin.

Profiling an application means measuring the time spent (or memory used, or power consumed) in each part of the code. Certain pieces of software are executed only once (initialization) or a few times. It doesn't make sense to spend a lot of time trying to optimize this part of the code because the overall savings to be gained are relatively minor. Most likely, there are a few pieces of software that execute many times and, although the code itself might be small, the fact that the code executes often means the total cycles spent in the code can be significant. If you can save even a cycle or two out of these pieces of code, the savings are significant. This is where you should spend your time during the tuning and optimization process.

Memory Dependencies

Processors store instructions and data in memory. Although many innovative ways have been created to fetch instructions and data from memory, there is still a penalty to pay when accessing instructions and data. This is pure overhead. Anything you can do to reduce the time waiting for instruction or data accesses improves the overall performance of the application. Hardware cache systems, for example, have been proven to increase the overall performance by putting as many instructions as possible close to the CPU so access is fast, generally a single cycle. DSPs have on-chip memory that holds instructions as well as data. The placement of data and instructions in on-chip memory, however, is not automatic. The programmer must manage this aspect, but if they perform this management efficiently, the DSP can use on-chip memory to improve performance significantly.

The embedded-system memory hierarchy consists of several levels (Figure 1). The first level consists of chip registers. This memory holds temporary and intermediate data. The compiler uses these registers when scheduling instructions. This memory is the fastest and most expensive (the more registers on the device, the larger the device, which means fewer devices on the silicon wafer, which means more wafers to get the same amount of devices…you get the message).

Figure 1

Figure 1—Memory Hierarchy for DSP Devices

The next level of memory is the cache system. This is also fast (and expensive), and is used to move instructions and data close to the CPU just before that function block uses the instructions or data.

The next level is "external" or "off-chip" memory, which tends to be slower (and less expensive) than the other memory types. This is generally where data and instructions are kept when not being used (longer term storage). Accessing information from this memory includes more handshaking and control, and therefore takes longer.

The primary goal for real-time embedded designers is to get anything the system will use as close to the CPU as possible. The means getting information from external memory into the faster memories, and using techniques such as Direct Memory Access (DMA), compilation techniques, or architectural techniques.

Engineers have turned to hardware-architecture techniques to enhance the performance of processors using pipelining concepts. The principle of a pipelined processor isn't much different than an automobile assembly line—each car moves through the line, being constructed step by step. There are multiple cars on the assembly line at the same time, each car at a different point in the assembly process. At the end of the assembly line a new car emerges, followed closely by another new car, and so on. It was discovered a long time ago that it's more cost-effective to start putting a new car together before the previous one was completed. It was a way to keep the available workers busy doing more work and less time idle.

The same effect holds true in pipelined processors. A pipelined processor can start a new task before it has completed a previous task. The completion rate becomes a matter of how often a new instruction can be introduced. As shown in Figures 2a and 2b, the completion time of an instruction doesn't change, but the completion rate of instructions improves.

To improve performance even further, a system can employ multiple pipelines. This approach is called superscalar, and further exploits the concept of parallelism (Figure 2c). Some of today's high-performance DSPs such as the Intel i860 have a superscalar design.

Figure 2

Figure 2—Execution Timeline of Nonpipelined (a), Pipelined (b),
and Superscalar (c) Architectures

Exploiting that parallelism in DSPs that have multiple independent executions units by executing multiple independent instructions at the same time provides an immediate boost to performance. The key is finding the "n" different instructions that are independent. Sometimes this job is done by the hardware and other times by software (compilation). Very Long Instruction Word (VLIW) processors (such as the TI C6200 DSP generation, Figure 3) use compilation techniques to schedule as many as eight independent instructions on eight orthogonal processor execution units. Data dependencies between the instructions often limit this to something less than the maximum rate, but significant performance is still realized. Many times you can restructure algorithms to take advantage of the architecture in order to realize the benefits of the multiple execution units.

Figure 3

Figure 3—Superscalar Architecture for TMS320C6200

A superscalar architecture offers more parallelism than a pipelined processor. But unless the code contains an algorithm or function that can exploit this parallelism, the extra pipe can go unused, reducing the amount of parallelism achieved. An algorithm written to run fast on a pipelined processor might not run nearly as efficiently on a superscalar processor. As an example, consider the algorithm in Figure 4a, which is written to take advantage of a pipelined processor. It shows a common way of computing a polynomial on a serial processor because it eliminates the need to compute p**8, p**7, and so on. This saves cycles as well as registers to hold intermediate values.

However, this isn't the most optimum way to calculate the expression for a superscalar device. The parentheses in this algorithm unfairly constrain the compiler to compute this expression in order, which eliminates any possibility of parallelism. If you instead decompose this expression into a number of independent expressions, the compiler can schedule these independent expressions on the parallel pipelines of a superscalar device in any convenient order. The resulting calculation uses fewer instruction cycles and a few more registers (Figure 4b).

(a)
rp = (((((((R8*p + R7) * p + R6) * p + R5) * p + R4) * p + R3) * p + R2) * p + R1) * p
 
(b)
p2 = p * p
p3 = p * p * p
.
.
.
p8 = p * p * p * p * p * p * p * p
---------------------------------------------
R1p1 = R1 * p
R2p2 = R2 * p2
.
.
.
R8p8 = R8 * p8
---------------------------------------------
rp = 0.0F
rp += R1p1
.
.
.
rp += R8p8

Figure 4—Algorithm Written to Run Fast on a Pipelined Processor (a), and
the Same Algorithm Modified to Run Fast on a Superscalar Processor (b)

This example shows why the programmer must have an understanding of the device architecture, the compiler, and the algorithm in order to determine the fastest way of executing any particular function. But now let's look at other ways to speed up the computation of functions using these high-performance devices.

Next >>

Click here to get your listing up.

Copyright © 2003 ChipCenter-QuestLink
About ChipCenter-Questlink  Contact Us  Privacy Statement   Advertising Information  FAQ