|
||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||
|
|
Debugging a PCI Bus With a Mixed-Signal Oscilloscope
by Vivian Patlin, Though the Peripheral Component Interconnect (PCI) bus has been a popular parallel
bus for years, identifying signal-integrity issues on it can still tie up a projectsometimes for months.
It's axiomatic that the faster an engineering team can complete a design, the sooner the product hits the marketand the sooner
revenue begins arriving. One of the most time-consuming tasks is debugging the signal anomalies that pop up. In the past, designers have
found that it generally required several test instruments to fully test and debug a parallel bus such as the PCI bus.
Well-Known Challenges The challenges of debugging a PCI-based system are well known. For starters, there are many bus lines to monitor at once, and that means
too many signals to view at once.
That situation can be exacerbated by difficulties in detecting analog anomalies riding on digital signals. As such, it's crucial for
digital designers to be able to examine the analog characteristics of digital signals.
Since no single instrument sufficed to do this in the past, debugging often required the use of a number of instruments to detect anomalies.
But many signal defects were still missed, sometimes because of either slow display update rates or instruments with too much deadtime at their
inputs.
Some Pitfalls A logic analyzer is fine for looking at all bus lines simultaneously, but since it's strictly a digital signal tool, it captures data as ones
and zeros. A user can't view detailed signal behavior such as ringing, or see rise or fall times and bounce, as you can with an oscilloscope.
Sometimes a PCI exerciser and analyzer are used. The combination can connect to all the lines of a PCI bus and provide a number of timing checks
for bus events. Timing violations can be detected and isolated more easily with this tool. But again, it doesn't enable you to view and analyze an
error in detail.
To view and analyze signal integrity, digital storage oscilloscopes (DSO) haveuntil recentlybeen considered the best tools for the
job. They're designed specifically to look at signal characteristics in detail. However, since their channel count is typically limited to four
channels, it's sometimes difficult to trigger properly on PCI bus events.
The Agilent Technologies Model 54832D is an example of a mixed-signal oscilloscope (MSO). It's called mixed signal because it provides several analog inputs and sufficient digital inputs to enable viewing many signals at once, and to permit triggering on any of many bus lines.
This example MSO can measure and display two analog signals (represented in the upper portion of the screen) and 16 digital channels (represented in the lower portion of the screen) at once, and with all 18 channels time-aligned.
Each of the two analog channels of this MSO signal provides 600 MHz of bandwidth. The MSO's standard acquisition memory enables capturing up to 8 Mbytes.
This instrument thereby combines the detailed signal-analysis capability of a scope with the multi-timing measurements of a logic analyzer. Intended for designs with lots of digital signals, it enables you to see complex interrelationships among all displayed signals. Its high-definition display is mapped into 32 levels of intensity that disclose subtle details instantaneously. This means that you need to cross-trigger the DSO with either an exerciser/analyzer tool or a
logic analyzer to look at signals where there's complex multi-line triggering. That can be cumbersome. A mixed-signal oscilloscope (MSO) can fill
the gaps.
MSOs combine the signal analysis capabilities of oscilloscopes with the multi-timing measurement capabilities of logic analyzers. In effect, an
MSO unites, in a single enclosure, the best features of an oscilloscope with a logic analyzer. With an MSO you can both trigger and view the
signal-integrity issues on a PCI bus.
A Real-World Example To see how this is done, let's look at a real problem the author and her design team faced while debugging a PCI parallel bus. In the early stages of this actual project, progress moved along smoothly. Prototypes came back from manufacturing and seemed to function properly. The system's firmware development was on schedule, and the project was cruising toward completion.
Then a fly landed in the honey. Some new boards started to fail sporadically. Everything would seem to be working fine, and then suddenly systems would crash, and crash hard. System shut-down and reboot was required to get up and running again.
This posed a huge problem for our project team. What's more, our manufacturing-line folks couldn't consistently get new systems to run through all the required parametric tests.
Other engineers on our team couldn't proceed through environmental testing. The project's firmware engineers, for example, had to reboot frequently,
sometimes several times each day. That caused delays in firmware quality assurance.
We studied the problem and initially deduced that the PCI bus was locking up intermittently. It would run fine for a while and then would just hang up. It appeared that it was going into some sort of deadlock situation where everything seemed to be up, but no work got done.
Making the Bugs Repeatable Since the problem was intermittent, the first step was to find a way to cause the problem, and cause it to be repeatable. After some poking around and collaboration between our software and hardware teams, we discovered a way to cause the problem to appear more frequentlyif not quite in a reproducible and repeatable way.
Studying the problem intensely, we discovered that it would occur more frequently if certain paths in the system's software were exercised heavily. Specifically, running a software test cycle that exercised the PCI bus and the devices connected to it caused problems. Now it was a question of where
and why.
Scrutinizing the Hardware The hardware consisted of a printed-circuit board (PCB) loaded with lots of custom components and ASICs. Our area of interest was the 32-bit 33 MHz PCI bus, which had five to seven devices connected to it. A large firmware base was driving it.
A typical 32-bit PCI bus requires 47 to 49 pins, depending on whether it's a Target or Master device. Our components all contained 49 lines, since all
the devices were required to behave as Masters occasionally.
Of the 49 lines, 32 were multiplexed Address and Data lines. Two lines were used for error reporting, and one line was a parity bit for the Address/Data
lines. The rest of the lines were control lines used to coordinate the use of the PCI bus by multiple devices. Since the problem we were facing was a lockup,
our interest focused on the interaction of those control lines.
Enter the MSO To help get a handle on the problem, we decided to use an Agilent Model 54832D 600 MHz Deep-Memory MSO. It provided 16 digital timing channels and four
analog channels.
By running basic Write and Readout tests, we noticed that the address lines of one of the PCB's devices would occasionally receive the wrong address. That
is, the sequence returned was not always what was sent.
For example, an ABCDEF sequence of addresses sent to the device would be read as ABCFEF sporadically. As such, it made sense to look closely
at the address phase of a PCI bus transaction. The MSO's state trigger handled this nicely.
To begin, we hooked up several control lines from the PCI bus. We hooked up FRAME#, IRDY#, TRDY#, DEVSEL#, GNT0, and
CLK. We then set the oscilloscope to trigger in its advanced AND state/pattern mode.
As CLK provides the basic timing for the PCI bus, all the other lines we connected were sampled on the rising edge of CLK. CLK was used
as the clock in the state trigger.
FRAME#, asserted when a transaction occurs, needs to be asserted (low) in our trigger since we weren't interested in non-transaction phases.
IRDY# and TRDY# were asserted when both the initiator or Master and the Target were ready for data transfer.
Since we weren't interested in the data phases of the transaction, we wanted both IRDY# and TRDY# to be de-asserted (high). DEVSEL#
indicated when the device decoded its address. Since we were interested in the address phase itself, it was set up to be de-asserted (high). This prevented
triggering in the middle of a data phase where both the Master and the Target devices weren't ready.
GNT0 is an arbitration line used to grant devices the right to drive the bus. We toggled it from asserted (low) to de-asserted (high) so we could
control whether or not we triggered when Device 1 was driving the bus.
Time for Infinite Persistence The address phase of a PCI bus starts on the CLK edge following FRAME# being asserted (going low). After looking around a while, it became
apparent to us that the CLK signal itself might very well have a signal-integrity problem. So at this point we turned on the infinite-persistence feature
on the oscilloscope so that we could see any issues with the CLK signal.
The address phases of all devices, other than Device 1, were shown. These are displayed as the lower eight digital traces in the figure here.
Figure 1 - Address Phases of All Devices, Except Device 1
Note that GNT0 in the above state trigger was de-asserted (high). Basically we were examining CLK integrity when Device 1 was quiescent. The
markers were set to the Vin and Vout levels of the CLK. At this point, everything looked fine.
Triggering on the address phase of Device 1, however, revealed a problem with the clock pulse that preceded the address-phase clock. This is the clock that
samples FRAME# when it's first asserted.
There It Is! As shown in the next figure, an anomaly can be clearly seen in the upper analog trace as it drops below the trigger level and the Vout marker.
Figure 2 - Anomaly in Address Phases
Now we had a viable suspect! We then added circuitry to enhance the coupling on boards that weren't failing to see if they would fail. They did.
The address Write and Readout tests were occasionally failing because we were violating setup and hold times. This was occurring because the anomalous
CLK signal was double-clocking, causing the address to be read in sooner than expected. Basically, the address was being clocked when the abnormal
dip in the CLK went high rather than on a normal edge.
By changing the circuitry to reduce the coupling between the activity on Device 1 with the CLK, we eliminated the intermittent lock-up problemsand
proved that our MSO was an effective tool for looking at signal integrity on the PCI bus.
Looking at the same problem with a conventional oscilloscope would have required external circuitry to be built, or the use of a logic analyzer to cross-trigger
from. But either approach would have made it difficult to look at the signals we were triggering on, as well as the signal we were checking for integrity problems.
Both approaches would've also required significantly more time to set up, and reducing the amount of time it took to look at this problem was essential to
meeting our schedule requirements. As it stands, the problem described took several engineers several weeks to isolate.
|
|||||||||||||||||||||||||||||||||||||||
|
Copyright © 2003 ChipCenter-QuestLink About ChipCenter-Questlink |
||||||||||||||||||||||||||||||||||||||||