|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
|
|
Fibre Channel Multimedia Adapter With PCI Bus Interface Using ORCA FPGAs
Alan Cunningham, Lead Applications Engineer, Lucent Technologies ("http://i.cmpnet.com/chipcenter/pld/bertrand_leigh@latticesemi.com")
Introduction
The adapter design utilized an integrated fiber-optic transceiver, an embedded RISC processor, a serializer/deserializer component, and a Fibre Channel ASIC. An embedded super VGA adapter used multi-ported VRAM to accept the three video streams simultaneously.
The product needed to handle all of the physical and link-level formatting requirements for Class 1,2,3 and Intermix Fibre Channel communication for a 32-bit data stream. Maximum sustained throughout is 200 megabytes per second (Mbytes/sec) for client/server adapters and 133 MB/s for PCI communications adapters.
The PCI bus could support carrying the three-video streams from the Fibre Channel to the VGA adapter. However, by developing a 64-bit back-end bus (which could support 266 MB/s) running at the same 33 MHz PCI bus speed, a portion of the PCI bus bandwidth was conserved for other functions.
More Sophisticated FPGAs Meet PCI Timing Requirements
The design used a pair of 15,000-gate, SRAM-based FPGAs to control data transfer between the Fibre Channel ASIC and the VRAM serial port, and to control data transfer on the VRAM parallel port. The design also used an FPGA as an interface between the PCI bus and the internal bus. To keep costs down, a -4 speed device was used because it could pass the PCI requirements, even though faster devices from -5 to -7 are available for this same FPGA today.
Defining Fibre Channel
This allows the design of networks with bandwidths that scales linearly with the number of attached nodes. For example, a simple, 16-node fabric switch provides an aggregate bandwidth of about 17 Gb/s (i.e., 16 x 1.0625 Gb/s).
Fibre Channel is an ANSI standard that allows for a wide range of data transfer rates ranging from 133 Mb/s up to 17 Gb/s. Currently, shipping products are clustered at two frequencies: 266 Mb/s (commonly called 1/4 speed) and 1.0625 Gb/s, for which, off-the-shelf fiber-optic transceivers are available.
At one end of the spectrum, disk drive manufacturers are providing Fibre Channel interfaced disk drives as low cost, higher speed replacements for SCSI-2 and SCSI-3. At the other end of the spectrum, Fibre Channel provides switched networks over distances of up to 10 kilometers.
Traffic Flow
A Fibre Channel Class 1 connection-oriented service worked well with the first type of traffic encountered. Class 1 requires a microseconds-long period to set up a physical connection between the originator and the destination and then provides a continuous stream of data at the maximum data rate of the Fibre Channel fabric.
The remaining one percent of the traffic was best sent by Fibre Channel Class 2 and 3 connectionless services. The 1.25 ms time required to transmit a short 128 byte packet at 1.0625 GB/s did not justify the set-up time for a Class 1 connection.
A Fibre Channel benefit is its intermix capability, which allows a Class 1 connection to share the link with connectionless traffic so that traffic requiring low latency is not suspended while a large data block is being moved.
Adapter Design
The adapter uses separate 1.0625 GB/s data links for receiving and transmitting to support full duplex operation.
Minimum buffer size is dictated by the maximum latency of the internal bus and the PCI bus. After factoring in internal bus latency and the worst-case PCI latency of 30 ms, it was calculated that the communications buffer must be able to hold at least two full video frames in each direction. To set a minimum of four MB with a bandwidth of 400 MB/s, an off-the-shelf VRAM was chosen as the solution.
FPGA Functional Requirements
Serial Port Functions
At the start of a transfer, the DMA circuitry arbitrates for the parallel VRAM port and then performs a full serial register transfer to establish the base address and the direction for the subsequent split serial register transfers.
The FPGA generates the serial port clock and maintains a serial register position count. The VRAM has a serial register length of 256 (64 bits each) words. After every 128 words, the DMA control circuitry within the FPGA arbitrates for the parallel VRAM port and performs a split serial register transfer, which moves the entire one Kbyte chunk to or from the DRAM. At the 1.0625 GB/s data rate, the DMA circuitry must perform one split serial register transfer approximately every 10 ms in each direction. The FPGA is always at least half a serial register ahead of the incoming data, therefore it has up to 10 ms to respond to a transfer request before an overrun or underun condition occurs.
The Fibre Channel ASIC contains a synchronous 2 KB FIFO with an external interface consisting of a FIFO length counter, a direction signal, and a transfer enable signal. Once a transfer is enabled, the FPGA continuously moves data between the Fibre Channel ASIC and the VRAM serial register port. All transfers are synchronized to the clock supplied by the Fibre Channel ASIC, which provides a 36-bit word (32 bits plus parity) at a 100 MB/s.
Parallel Port Functions
The FPGA control logic arbitrates between the following requestors for the VRAM parallel port interface: refresh, split serial register transfers, full serial transfers, bus mastering DMA transfers to/from the PCI bus, slave access by a host on the PCI bus, and direct access by the local processor.
An internal FPGA counter generates refresh requests at the appropriate intervals. Once a refresh request gains control of the VRAM port, the FPGA synthesizes a standard CAS before RAS refresh cycle.
The FPGA counts each word that is transmitted or recovered. At each half serial register boundary during the transfer cycle, the FPGA generates a split serial register transfer request and a target address. When the programmed number of transfers has occurred, the FPGA inhibits further transfers and sends a completion status to the local processor.
DMA Controller Functions
By interrogating the FPGA, the local processor can easily determine which DMA task is currently in progress and which tasks have been completed. Task overhead on the local processor is minimized by storing the DMA control blocks in the local processor.
The back-end host DMA controller is decoupled from the front-end PCI bus mastering logic for independent optimization and reusability.
Each section has its own clock. The PCI front end must use the PCI system clock. The DMA controller section, however, is tightly coupled to the VRAM and uses a clock that is optimized for the particular DRAM speed chosen. A small internal FIFO of sixteen 32-bit words provides the necessary speed matching between the front and back ends.
The FPGA provides byte alignment and big-endian, or little-endian conversion logic, so the host DMA controller can start and end a transfer on any byte boundary, with a transfer length that is specified in bytes. Data moving from the host bus to the local communications buffer is realigned through a 32-bit barrel shifter.
Up to three DMA functions may be simultaneously active:
Separately implementing these three DMA controllers would require seven 32-bit registers for source and destination addresses and length, as well as a number of 32-bit multiplexers. A simpler and more flexible approach was chosen and implemented: a DMA register file using the internal RAM capabilities of ORCA FPGAs and a single set of counters. For each activity, the appropriate set of DMA counters is fetched from the register file. If the DMA function is suspended because a higher priority request has pre-empted, the current values in the DMA counters are stored in the DMA register file until the function is restarted. This approach allowed the addition of new DMA functions with minor changes.
Slave Access
Master/Target Interface
With careful design techniques and a sophisticated development tool set, PCI requirements were met using less than ten percent of the ORCA device, which left plenty of room for the other functions mentioned above.
Design Methodology
ORCA Foundry's Preference Language allows specification of timing preferences such as clock frequency, path delays, multicycle paths, setup requirements, and clock-to-output constraints. Placement and routing take user-specified preferences into account, with each iteration providing information on remaining violations. This was a major benefit over place and route packages that may be internally optimized to support arbitrary timing goals, but which could have unpredictable effects on any given design.
The design concerns included the maximum operating frequency of several clocks and the setup and hold times required by the PCI bus specification. The ORCA FPGA met the clock-to-output timing requirement through its direct register to pad output connection. Layout constraints were primarily for the data path input and output elements; the remainder of the design was unconstrained, except through timing preferences entered in the preference file.
ORCA Foundry offers a set of adjustments for placement that trade off placement time for placement efficiency, as well as a setting for the maximum number of routing iterations. Analysis of the results from different projects and different settings indicated that the major contributor to longer runs was overly constrained preference files. The tool was usually set for a mid-point that would provide reasonably optimal results in a rapid period of time, with the rule of thumb being that if the tool is not close to completing its routing in the first half-hour, then the last change made should be re-examined.
A four-part methodology was used for timing-driven place and route. The first step is to analyze and assign pinouts. The second is to hard place the necessary pins. In this design, most of the lock-downs were PCI pins. Third, the internal logic and register sets with difficult to meet timing that interface to the locked-down pins were locked down.
Finally, the preference files were checked to make sure they reflected the placement plan, without overly constraining the placer. Note that blocking non-critical nets -- such as pullups, pulldowns, or signals that may take up to two clocks to propagate -- assists the router, since it has no way to determine their significance.
Without using timing-driven place and route tools, the layout tools were unable to achieve more than a 20 MHz main clock frequency and the result was extremely susceptible to small logic changes, but when the timing-driven tools were used the 33 MHz clock frequency was routinely achieved.
The timing-driven tools routinely beat the best manual placement operating frequency, even when nothing was locked down. This shows that the tool's ability to intelligently place and route to achieve timing requirements was essential in meeting the 33 MHz performance required in the PCI specification.
Home | Product of the Week | Tech Note | AppReview | Vendor Tools | Feedback
|
|||||||||||||||||||||||||||||||||
|
Copyright © 2003 ChipCenter-QuestLink About ChipCenter-Questlink |
||||||||||||||||||||||||||||||||||