ChipCenter Questlink
SEARCH CHIPCENTER
Search Type:
Search for:




Knowledge Centers
Product Reviews
Data Sheets
Guides & Experts
News
International
Ask Us
Circuit Cellar Online
App Notes
NetSeminars
Careers
Resources
FAQ
EE Times Network
Electronics Group Sites

Fibre Channel Multimedia Adapter With PCI Bus Interface Using ORCA FPGAs

Alan Cunningham, Lead Applications Engineer, Lucent Technologies ("http://i.cmpnet.com/chipcenter/pld/bertrand_leigh@latticesemi.com")

Introduction
A 1.0625 Gb/s Fibre Channel-based Multimedia Adapter was needed to move real-time, uncompressed video data, and support a minimum of three studio-quality video streams at 18.5 Mbytes/sec per video stream. Studio quality, which is defined as 640 x 480 x 16-bit pixels, requires 640 Kbytes of memory per frame and must be refreshed at 30 times a second. High-density FPGAs were a good choice for moving data between the PCI bus and the internal bus, and for controlling data being transferred into and out of the VRAM buffer. This article discusses the design methodology and tool support used to create a successful design.

The adapter design utilized an integrated fiber-optic transceiver, an embedded RISC processor, a serializer/deserializer component, and a Fibre Channel ASIC. An embedded super VGA adapter used multi-ported VRAM to accept the three video streams simultaneously.

The product needed to handle all of the physical and link-level formatting requirements for Class 1,2,3 and Intermix Fibre Channel communication for a 32-bit data stream. Maximum sustained throughout is 200 megabytes per second (Mbytes/sec) for client/server adapters and 133 MB/s for PCI communications adapters.

The PCI bus could support carrying the three-video streams from the Fibre Channel to the VGA adapter. However, by developing a 64-bit back-end bus (which could support 266 MB/s) running at the same 33 MHz PCI bus speed, a portion of the PCI bus bandwidth was conserved for other functions.

More Sophisticated FPGAs Meet PCI Timing Requirements
Advances in FPGA technology now make it possible to support the seven-nanosecond (ns) setup time and the 11 ns clock-to-output signal delay required by the PCI bus specification and PCI-bus compatible I/O buffers. A high-pin-count FPGA was chosen that could support multiple 32- and 64-bit buses, three separate interfaces, four DMA controllers, and VRAM control logic, with SRAM for internal RAM and FIFO architectures. Lucent's ORCA device was determined by the customer to be the best fit, in part because of the sophisticated FPGA development tools available that supported user-defined timing parameters. Without timing-driven development tools, meeting the stringent PCI-bus timing would have been difficult, if not impossible.

The design used a pair of 15,000-gate, SRAM-based FPGAs to control data transfer between the Fibre Channel ASIC and the VRAM serial port, and to control data transfer on the VRAM parallel port. The design also used an FPGA as an interface between the PCI bus and the internal bus. To keep costs down, a -4 speed device was used because it could pass the PCI requirements, even though faster devices from -5 to -7 are available for this same FPGA today.

Defining Fibre Channel
Fibre Channel is a local area and metropolitan area network that combines the advantages of a high-bandwidth channel with the low-latency characteristics of a traditional network. A Fibre Channel node, called an N-port, interfaces to a Fibre Channel fabric, which routes data to other nodes. Unlike traditional networks, Fibre Channel can use a two-dimensional fabric, which has both a time-division and a space-division switching component--connected in parallel. In addition, much like a telephone system, each of the nodes may be simultaneously sending and receiving traffic.

This allows the design of networks with bandwidths that scales linearly with the number of attached nodes. For example, a simple, 16-node fabric switch provides an aggregate bandwidth of about 17 Gb/s (i.e., 16 x 1.0625 Gb/s).

Fibre Channel is an ANSI standard that allows for a wide range of data transfer rates ranging from 133 Mb/s up to 17 Gb/s. Currently, shipping products are clustered at two frequencies: 266 Mb/s (commonly called 1/4 speed) and 1.0625 Gb/s, for which, off-the-shelf fiber-optic transceivers are available. At one end of the spectrum, disk drive manufacturers are providing Fibre Channel interfaced disk drives as low cost, higher speed replacements for SCSI-2 and SCSI-3. At the other end of the spectrum, Fibre Channel provides switched networks over distances of up to 10 kilometers.

Traffic Flow
For this application, the PCI adapter had to manage two types of traffic. Ten to twenty percent of the packets required 99 percent of the fiber bandwidth, due to large packet size. The second type of traffic, which accounted for 80-90 percent of the packets traveling over the network, were low-latency, small packets and consumed less than one percent of the fiber bandwidth.

A Fibre Channel Class 1 connection-oriented service worked well with the first type of traffic encountered. Class 1 requires a microseconds-long period to set up a physical connection between the originator and the destination and then provides a continuous stream of data at the maximum data rate of the Fibre Channel fabric.

The remaining one percent of the traffic was best sent by Fibre Channel Class 2 and 3 connectionless services. The 1.25 ms time required to transmit a short 128 byte packet at 1.0625 GB/s did not justify the set-up time for a Class 1 connection.

A Fibre Channel benefit is its intermix capability, which allows a Class 1 connection to share the link with connectionless traffic so that traffic requiring low latency is not suspended while a large data block is being moved.

Adapter Design
The high volume of Type 1 traffic flowing through the system required a set of buffers between the Fibre Channel ASIC and the PCI bus. The Fibre Channel expects continuous data, but the PCI bus experiences latencies between packets. An embedded RISC processor was used to manage the flow of data through the Fibre Channel ASIC with the assistance of the FPGA-based multithreaded DMA controller.

The adapter uses separate 1.0625 GB/s data links for receiving and transmitting to support full duplex operation. Minimum buffer size is dictated by the maximum latency of the internal bus and the PCI bus. After factoring in internal bus latency and the worst-case PCI latency of 30 ms, it was calculated that the communications buffer must be able to hold at least two full video frames in each direction. To set a minimum of four MB with a bandwidth of 400 MB/s, an off-the-shelf VRAM was chosen as the solution.

FPGA Functional Requirements

Serial Port Functions
The FPGA uses two serial-port DMA controllers to transfer data between the Fibre Channel ASIC and the VRAM serial port. Both the receiving and transmitting DMA controllers have an address pointer to a location within the VRAM, a length counter and associated control circuitry to start and stop the data transfer, and interrupt circuitry to inform the local processor when the transfer has completed.

At the start of a transfer, the DMA circuitry arbitrates for the parallel VRAM port and then performs a full serial register transfer to establish the base address and the direction for the subsequent split serial register transfers.

The FPGA generates the serial port clock and maintains a serial register position count. The VRAM has a serial register length of 256 (64 bits each) words. After every 128 words, the DMA control circuitry within the FPGA arbitrates for the parallel VRAM port and performs a split serial register transfer, which moves the entire one Kbyte chunk to or from the DRAM. At the 1.0625 GB/s data rate, the DMA circuitry must perform one split serial register transfer approximately every 10 ms in each direction. The FPGA is always at least half a serial register ahead of the incoming data, therefore it has up to 10 ms to respond to a transfer request before an overrun or underun condition occurs.

The Fibre Channel ASIC contains a synchronous 2 KB FIFO with an external interface consisting of a FIFO length counter, a direction signal, and a transfer enable signal. Once a transfer is enabled, the FPGA continuously moves data between the Fibre Channel ASIC and the VRAM serial register port. All transfers are synchronized to the clock supplied by the Fibre Channel ASIC, which provides a 36-bit word (32 bits plus parity) at a 100 MB/s.

Parallel Port Functions
The FPGA pair also controls the transfer of data on the VRAM parallel port, up to a maximum fast-page mode cycle time of 35 ns. A word size of 64 bits was chosen, which made it possible to support a 230 MB/s internal bus rate. This ensures an easy transition to the 64-bit PCI.

The FPGA control logic arbitrates between the following requestors for the VRAM parallel port interface: refresh, split serial register transfers, full serial transfers, bus mastering DMA transfers to/from the PCI bus, slave access by a host on the PCI bus, and direct access by the local processor.

An internal FPGA counter generates refresh requests at the appropriate intervals. Once a refresh request gains control of the VRAM port, the FPGA synthesizes a standard CAS before RAS refresh cycle.

The FPGA counts each word that is transmitted or recovered. At each half serial register boundary during the transfer cycle, the FPGA generates a split serial register transfer request and a target address. When the programmed number of transfers has occurred, the FPGA inhibits further transfers and sends a completion status to the local processor.

DMA Controller Functions
The FPGA also acts as a multi-threaded host DMA controller, which controls bus-mastering transfer of data between the PCI bus and the data communications buffer. The FPGA maintains a pointer to a DMA control block within the local processor's memory. When a DMA operation is completed, the FPGA retrieves the next set of DMA control words from memory and initiates the transfer. The control words specify the VRAM base address, the host memory base address, and the length of the transfer in bytes.

By interrogating the FPGA, the local processor can easily determine which DMA task is currently in progress and which tasks have been completed. Task overhead on the local processor is minimized by storing the DMA control blocks in the local processor.

The back-end host DMA controller is decoupled from the front-end PCI bus mastering logic for independent optimization and reusability.

Each section has its own clock. The PCI front end must use the PCI system clock. The DMA controller section, however, is tightly coupled to the VRAM and uses a clock that is optimized for the particular DRAM speed chosen. A small internal FIFO of sixteen 32-bit words provides the necessary speed matching between the front and back ends.

The FPGA provides byte alignment and big-endian, or little-endian conversion logic, so the host DMA controller can start and end a transfer on any byte boundary, with a transfer length that is specified in bytes. Data moving from the host bus to the local communications buffer is realigned through a 32-bit barrel shifter.

Up to three DMA functions may be simultaneously active:

  1. the receiver DMA, which moves data from the Fibre-Channel receive buffer to the display buffer,
  2. the transmitter DMA, which moves data from the camera buffer to the Fibre-Channel transmit buffer,
  3. the host bus DMA, which moves data between any of the on-board buffers and host memory.

Separately implementing these three DMA controllers would require seven 32-bit registers for source and destination addresses and length, as well as a number of 32-bit multiplexers. A simpler and more flexible approach was chosen and implemented: a DMA register file using the internal RAM capabilities of ORCA FPGAs and a single set of counters. For each activity, the appropriate set of DMA counters is fetched from the register file. If the DMA function is suspended because a higher priority request has pre-empted, the current values in the DMA counters are stored in the DMA register file until the function is restarted. This approach allowed the addition of new DMA functions with minor changes.

Slave Access
The FPGA must also function as a full DRAM controller for slave accesses by either the host processor or the local processor. These requests have medium priority, lower refresh and serial register transfers but higher than DMA transfers, since slave access should not be suspended during long DMA transfer operations. The on-board, embedded RISC processor uses the 32-bit FPGA interface to directly access internal FPGA control and status registers as well as on-board VRAM.

Master/Target Interface
The final function of the FPGA is to provide a full master/target PCI interface. Although this interface requires only a small percentage of the total FPGA silicon available, it imposes some of the most stringent timing and drive requirements. Unlike many other bus technologies, PCI uses a reflective rather than incident-wave switching mechanism, so I/O driver selection must be based on a complete V/I specification rather than simply on dc current sink or source specifications. The PCI spec also has stringent input capacitance and signal trace length requirements and stringent timing, including an input setup time of 7 ns and the clock to output signal delay of 11 ns.

With careful design techniques and a sophisticated development tool set, PCI requirements were met using less than ten percent of the ORCA device, which left plenty of room for the other functions mentioned above.

Design Methodology
As the design progressed, it was clear that what was needed was a timing-driven place and route system to allow specification of critical dependencies such as delay paths and frequency. The ORCA FPGA Foundry package met the customer's requirements with timing-driven place and route functionality.

ORCA Foundry's Preference Language allows specification of timing preferences such as clock frequency, path delays, multicycle paths, setup requirements, and clock-to-output constraints. Placement and routing take user-specified preferences into account, with each iteration providing information on remaining violations. This was a major benefit over place and route packages that may be internally optimized to support arbitrary timing goals, but which could have unpredictable effects on any given design.

The design concerns included the maximum operating frequency of several clocks and the setup and hold times required by the PCI bus specification. The ORCA FPGA met the clock-to-output timing requirement through its direct register to pad output connection. Layout constraints were primarily for the data path input and output elements; the remainder of the design was unconstrained, except through timing preferences entered in the preference file.

ORCA Foundry offers a set of adjustments for placement that trade off placement time for placement efficiency, as well as a setting for the maximum number of routing iterations. Analysis of the results from different projects and different settings indicated that the major contributor to longer runs was overly constrained preference files. The tool was usually set for a mid-point that would provide reasonably optimal results in a rapid period of time, with the rule of thumb being that if the tool is not close to completing its routing in the first half-hour, then the last change made should be re-examined.

A four-part methodology was used for timing-driven place and route. The first step is to analyze and assign pinouts. The second is to hard place the necessary pins. In this design, most of the lock-downs were PCI pins. Third, the internal logic and register sets with difficult to meet timing that interface to the locked-down pins were locked down.

Finally, the preference files were checked to make sure they reflected the placement plan, without overly constraining the placer. Note that blocking non-critical nets -- such as pullups, pulldowns, or signals that may take up to two clocks to propagate -- assists the router, since it has no way to determine their significance.

Without using timing-driven place and route tools, the layout tools were unable to achieve more than a 20 MHz main clock frequency and the result was extremely susceptible to small logic changes, but when the timing-driven tools were used the 33 MHz clock frequency was routinely achieved.

The timing-driven tools routinely beat the best manual placement operating frequency, even when nothing was locked down. This shows that the tool's ability to intelligently place and route to achieve timing requirements was essential in meeting the 33 MHz performance required in the PCI specification.


Home | Product of the Week | Tech Note | AppReview | Vendor Tools | Feedback

Click here to get your listing up.

Copyright © 2003 ChipCenter-QuestLink
About ChipCenter-Questlink  Contact Us  Privacy Statement   Advertising Information  FAQ