|
||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||
|
|
Introduction
I've mentioned before that
the first step in code optimization is choosing the right tool for the job. Many
times, though, a programmer doesn't have that luxury. Wouldn't it be nice if we
could program all of our graphics-intensive apps with the Sun V9 ISA and VIS extensions?
Obviously, the choice of processor architecture made at the outset can be a compromise
itself, and this is why it is up to the programmer to optimize his code within
a given application.
Let's take a look at a common set of math operations that are used in video
applicationsthe transform function. How could we optimize code for such
calculations using the Intel platform? Should we use MMX technology, or SSE,
or SSE2? Is one ISA extension any better than the others at this type of thing?
The Transform Operation
First, to put everything in context, let's briefly examine the operations we
will study here. In the field of 3D graphic animation, transformation refers
to the process of creating a new set of coordinate points as an object moves
through space, using the previous coordinates and other parameters as input.
Within the transformation process itself it is often necessary to translate,
scale, rotate, or change perspective of a coordinate set, and sometimes all
of these procedures must be done simultaneously. The transformation process
is based on a set of matrices. The input matrix contains each set of coordinates
and a scaling factor for perspective correction. Therefore, when working in
three dimensions, each matrix has 4 sets of values. There are also individual
matrices (the transform matrices) for performing translation, scaling, rotation,
etc. To effect the operation, the matrices are multiplied together, resulting
in a 4 × 4 matrix vector operation. Note that it is possible to re-use
the transform matrices if their values still apply to the current operation.
Keep in mind that within these operands it is possible to introduce other mathematical
operations, such as sine and cosine during a rotation. Multiplication of inverse
values (division) is also used for the scaling factor of the coordinates, and
during a change of perspective. Final results are achieved through repetitive
additions.
Our Options
The MMX instructions allow us to perform "packed integer" operations by reading
the contents of a single register as multiple operands. The main disadvantage
to this is that we may not be able to use the processor's floating-point unit
simultaneously since the register file is shared between the MMX functional
unit and the floating-point unit. Therefore careful programming is in order.
Intel's Streaming SIMD Extensions (SSE), introduced with the Pentium III processor,
enable the programmer to perform single-precision (32-bit) floating-point vector
operations with the operands contained in a single instruction. This works by
reading the contents of a single 64-bit register as two 32-bit registers, therefore
making more efficient use of registers and issuing fewer instructions to do
the same job. The advantage here is that resource contention has been eliminated
by allowing SSE instructions their own register file.
SSE2 instructions also use their own register file, which has been expanded
to 128 bits, for "packed floating-point" operations in double precision. These
extensions are included with the Pentium 4 processor.
Our Mission, Should We Choose To Accept It
In developing and optimizing software, we have a couple of important issues
to take into account. The first is the "installed base" of the instruction set
we choose to work with, and the second is the amount of information available
to help us with optimization for that platform. In order to take advantage of
broader product availability, and the availability of more detailed information
for use in optimization, we might elect to stay a "notch" or two behind "leading
edge," so to speak. To create highly effective software that will run on some
P5 chips, as well as all chips of the P6 architecture, we might want to target
our code optimization to achieve leverage with the MMX extensions. This topic
will be our focus here. For general information on optimizing for the Intel
P6 Microarchitecture, please refer to my previous article, Understanding
the Intel P6 Microarchitecture.
The Challenges
The decision to use MMX Technology for 3D graphics applications is not one
that should be taken lightly. There are several issues intrinsic to this method
that present themselves during the course of programming, especially when hand-optimizing
at the assembly language level. MMX instructions by nature limit us to a precision
of 16 bits and the use of integer math, although this may not be enough of a
problem to deter us. What is the best we can do with these limitations? According
to Intel, "angles can be resolved to about 1/1000°, and screen coordinates
to about 1/50 pixel." This is quite adequate for most 3D graphics applications.
Another issue to consider is that popular off-the-shelf software applications,
such as CAD, produce transform data in floating-point format. This means that
we will need to convert this data to integer math for MMX processing. In order
to maintain the best possible accuracy, the conversion process should be performed
as few times as possible. One reason for this is that many significant digits
are "shifted out" during the conversion, leaving us with less precise data.
Unfortunately, the conversion must take place every time the transform matrix
is re-calculated. This particular issue would make a good case for the use of
SSE or SSE2 instructions, which were made available with the introduction of
the Pentium III and Pentium 4 processors, respectively.
|
|||||||||||||||||||||||||||||||||||||||
|
Copyright © 2003 ChipCenter-QuestLink About ChipCenter-Questlink |
||||||||||||||||||||||||||||||||||||||||