ChipCenter Questlink
SEARCH CHIPCENTER
Search Type:
Search for:




Knowledge Centers
Product Reviews
Data Sheets
Guides & Experts
News
International
Ask Us
Circuit Cellar Online
App Notes
NetSeminars
Careers
Resources
FAQ
EE Times Network
Electronics Group Sites

  DSP

    Technical Note

DSP Main | Archives | Feedback

Page 1 of 4

Memory Techniques Lead the Way
to Optimizing DSP Applications

Part 2: DSP Program Optimization

by Rob Oshana,
Texas Instruments

This article is the second of a two-part series that summarizes some of the techniques used to gain orders-of-magnitude speed increases from high-performance DSPs. The previous article focused on methods related to the optimum use of memories; this one looks to exploit architectural features on the DSP itself.

First, however, let's repeat a key concept from the last article. In evaluating a program, you'll generally find just a few pieces of software that execute many times. If you can save even a cycle or two out of these pieces of code, the savings will be significant. This is where you should spend your time during the tuning and optimization process.

Parallelism is the Key

Now let's move on to our new topics. The standard rule when programming superscalar and VLIW devices is Keep the pipelines full! A full pipe means efficient code. In order to determine how full the pipelines are, you should inspect the assembly code generated by the compiler. You can usually spot inefficient code by an abundance of NOPs, which indicate inefficient use of pipelines.

To demonstrate the advantages of parallelism in VLIW-based superscalar machines, start with a simple looping program (Program 1).

Program 1 - A Simple for Loop in C
1 void example1(float *out, float *input1, float *input2)
2 {
3   int i;
4
5   for(i = 0; i < 100; i++)
6    {
7      out[i] = input1[i] * input2[i];
8    }
9 }

If you were to write a serial assembly-language implementation, the code would be similar to that in Program 2. The loop in that second version uses one of the two available sides of the superscalar machine. By counting up the instructions and the NOPs, you can see that it takes 26 cycles to execute each loop iteration. It's possible to do much better.

Program 2 - Serial Assembly-Language Implementation of the for Loop from Program 1
1  ;
2  ;  serial implementation of loop
   ;  (26 cycles per iteration)
3  ;
4  L1:    LDW    *B++,B5  ; load B[i] into B5
5         NOP    4        ; wait for load to complete
6
7         LDW    *A++,A4  ; load A[i] into A4
8         NOP    4        ; wait for load to complete
9
10        MPYSP  B5,A4,A4 ; A4 = A4 * B5
11        NOP    3        ; wait for mult to complete
12
13        STW    A4,*C++  ; store A4 in C[i]
14        NOP    4        ; wait got store to complete
15
16        SUB    i,1,i    ; decrement i
17   [i]  B      L1       ; if i != 0, goto L1
18        NOP    5        ; delay for branch

Towards that goal, be sure to notice two things in this example. Many of the execution units are sitting idle—and that's a waste of processor hardware. Second, this piece of assembly code has many delay slots (20, to be exact) where the CPU is stalled waiting for data to be loaded or stored. When the CPU is stalled, nothing happens. This is the worst thing you can do to a processor when trying to crunch large amounts of data.

Let's consider ways to keep the CPU busy while it's waiting for data to arrive. For example, it does other operations that don't depend on the data you're waiting for. You can also use both sides of the superscalar architecture to help load and store other data values.

The code in Program 3 is an improvement over the serial version, a fact that becomes clear when you see that it reduces the number of NOPs from 20 to 5. It also performs some steps in parallel. Lines 4 and 5 execute two loads at the same time into each of the device's two load units (D1 and D2). This code also performs the branch operation earlier in the loop, and then takes advantage of the delays associated with that operation to complete operations on the current cycle. Notice that the code listing contains a new column, one that specifies which execution unit you want to use for a particular operation. This flexibility to specify execution units allows you to manage operations better.

Program 3 - A More Parallel Implementation of the for Loop
1  ; using delay slots and duplicate execution units
   ; of the device
2  ; 10 cycles per iteration
3
4  L1:    LDW   .D2   *B++,B5  ; load B[i] into B5
5  ||     LDW   .D1   *A++,A4  ; load A[i] into A4
6
7         NOP         2        ; wait load to complete
8         SUB   .L2   i,1,i    ; decrement i
9    [i]  B     .S1   L1       ; if i != 0, goto L1
10
11        MPYSP .M1X  B5,A4,A4 ; A4 = A4 * B5
12        NOP         3        ; wait mpy to complete
13
14        STW   .D1   A4,*C++  ; store A4 into C[i]

Next >>

 
Click here to get your listing up.

Copyright © 2003 ChipCenter-QuestLink
About ChipCenter-Questlink  Contact Us  Privacy Statement   Advertising Information  FAQ