.
Improvements in the performance of low-cost workstations have stimulated growth in the area of cluster computing. These low-cost clusters have emerged as the price-performance winner and allow individual research groups and departments to purchase balanced supercomputing power at PC prices. Balance is a function of processor speed, memory size, memory speed, and interprocessor communication.
With increased processor performance, the need for matched network speed has become a potential bottleneck to cluster performance. The introduction of Fast Ethernet. has done much to improve performance and many groups have built clusters of Pentium Pro based PCs using Fast Ethernet. These clusters have achieved GFLOP class performance using 16 PCs at a total price below $50,000.
Small collections of Pentium Pro workstations obtain adequate speed from 100 Mbit/s Ethernet for a limited range of applications. Experiments have shown, however, that larger ensembles and certain communication-intensive applications require multiple Fast Ethernet connections per PC for acceptable price-performance. The inexorable increase in microprocessor speeds will exacerbate this requirement. There are several proprietary solutions for gigabit interconnects, including Myrinet from Myricom and Memory Channel from DEC, each having its own advantages and disadvantages. Standards based solutions have primarily focused on ATM and Ethernet. The Gigabit Ethernet Alliance has nearly completed the specifications extending Ethernet into the gigabit range and the cost-performance of this technology when it achieves commodity status looks truly promising.
This paper will examine the performance of Gigabit Ethernet and the performance of clustering workstations using this technology.
In today's networks, Ethernet accounts for approximately 80% of all LAN connections, with numbers steadily increasing as Gigabit Ethernet delivers what customers need. Gigabit Ethernet is the latest speed extension of the ubiquitous Ethernet technology. It is currently in the draft stage of IEEE standardization and is known as IEEE P802.3z draft standard.
Ethernet is successful for several reasons. The technology is simple and uncomplicated, and this translates into high reliability and low cost of maintenance. Ethernet continues to evolve to meet the needs of its customers, as evidenced by the dash to standardize Gigabit Ethernet so quickly. In addition, Gigabit Ethernet offers higher performance, lower maintenance costs, lower cost of entry and increased scalability when compared to other high-speed technologies.
One leap in the drive to standardize Gigabit Ethernet came through adapting existing technology. When creating a new high-speed networking system, the largest obstacle to overcome is developing a physical layer which can reliably deliver data at increased speeds. To circumvent this problem and advance the entire process more quickly, the HSSG (High Speed Study Group) proposed a crossover of technologies. The well-characterized and well-understood IEEE 802.3 MAC was layered on top of the already developed and tested physical layer of the ANSI standard Fibre-Channel. In this manner, a high-speed system could be engineered without the risk of developing a completely new and untried physical layer, or, Phy.
However, Fibre-Channel, running at full speed, can only supply a signaling speed of 1.0625 Gbps, yielding a nominal delivered bit rate of 850 Mbps. In order to deliver a true bit rate of 1 Gbps, the Fibre-Channel optical components had to be enhanced. The simple and inexpensive modifications brought the signaling speed up to 1.25 Gbps, supplying a full 1.0 Gbps data rate to the user.
With the most difficult portion of the process (developing the Phy) already out of the way, it was a simple matter to scale the Media Access Controller (MAC) up by a factor of 10. Gigabit Ethernet also takes advantage of a system of block coding known as 8B/10B. This type of block coding takes in a group of 8 data bits and retranslates them into a different larger group of 10 signal bits. This scheme simplifies circuitry, since bits can be received in serial and then processed as a parallel group at a rate slower than the line speed of 1000 Mbps. This is significant since logic for high speed circuitry is considerably more expensive than logic that processes data at a slower rate.
Along with the advent of Gigabit Ethernet comes a new class of device called the full-duplex repeater. The full-duplex repeater meshes together features found in switches and characteristics of a traditional repeater. It features full-duplex links, packet buffering, and utilizes 802.3x flow control methods. Most importantly, using these features, the full-duplex repeater achieves nearly 100% throughput in a shared domain for all packet sizes, rivaling the performance of gigabit switches. At nearly a fifth of the cost of a Gigabit Ethernet switch, the full-duplex repeater is an ideal clustering device.
The advent of Gigabit Ethernet and its 1000 Mbps transmission speed provides network designers with an exciting new tool known as full-duplex repeaters to build higher-performance networks. For high-performance servers, 100 Mbps Fast Ethernet can not keep up to the 300-400 Mbps transmission capability commonly found today. Since the deployment of Gigabit Ethernet ensures that the network is not the bottleneck, the servers are allowed to operate with maximum efficiency.
Full-duplex repeaters provide an important option to anyone building high-speed networks. They add a new performance point in the scalability of Ethernet between the performance of 100 Mbps and 1 Gbps switched links. Full-duplex repeaters are an evolutionary improvement from traditional half-duplex repeaters and are fully compliant with Ethernet standards. In addition, they offer full performance at all packet sizes, improved topology coverage and consistent link behavior.
A Gigabit Ethernet cluster network may be constructed using a gigabit switch or full-duplex repeater. Full-duplex repeaters are essential components in the modern network setup, and when compared to other network devices, they are more cost-efficient than a switch, which can cost as much as five times more than a repeater with the same amount of ports. The full-duplex repeater fulfills the needs of a cluster environment with its high throughput, natural broadcast capability, and simple deployment. The full-duplex repeater is the obvious choice for this application.
Full-duplex repeaters can be deployed in many network environments, such as:

Full-duplex repeaters introduce three new elements to repeaters: an arbitration mechanism completely contained in the repeater, a storage buffer on the input of each port and a MAC on each port.
The most significant element involves placing the arbitration mechanism inside the repeater, which removes the mechanism from the link and eliminates the time delay associated with the link. There is still a delay present; however, it is now the short delay in the internal electronics between the ports. Instead of arbitrating between computer systems at the end of links, we now arbitrate between the buffers on each port. As a result, the links can be of any practical length, representing another advantage for the full-duplex repeater over traditional repeaters.
Full-duplex repeaters also utilize storage buffers. When a full-duplex repeater receives a packet, the packet is stored in a buffer. If the repeater is not busy, the arbitration mechanism will flood the packet out through all ports, except the port that first received the packet. When several packets are received at the same time on different ports, the buffer on each port is used to store the incoming packet. Arbitration is then performed to see which packet will be transmitted first. This is a performance improvement over traditional repeaters, where a computer attempting to send a packet would defer for a random period and try again.
Another full-duplex repeater element involves placing a MAC on each port. The incoming bit stream from the Phy (transceiver) is converted into an Ethernet frame by the receive logic and then stored in the local First-In-First-Out (FIFO) storage buffer memory. The arbitration logic determines the order in which incoming frames from the ports are flooded to all the other ports. The full-duplex repeater core receives the selected frame and floods it to the transmit logic of each port. The transmit logic converts the Ethernet frame back into a bit stream and passes it to the Phy for transmission over the media.
Full-duplex repeaters make use of full-duplex connections to the end stations. They must be able to buffer a stream of incoming frames, and each port must be able to buffer several frames. Since design economics of repeaters require that memory cost is minimized which restricts the number of frames that can be stored, flow control is used. Each port's FIFO buffer contains a high and low watermark. When the high watermark on a port's FIFO is reached, a flow control frame is transmitted to the computer or end station requesting that it stop transmitting further frames.
Eventually the frames in the port's FIFO are transmitted and the low watermark is reached. A flow control PAUSE frame is then sent to the end station indicating that it may continue transmitting frames. The frame contains the PAUSE command and a count to indicate how long the PAUSE should be. When the timer completes its count (times-out), the link partner resumes transmission of data frames. Using flow control in this way results in significant savings in memory costs while ensuring close to 100% efficiency in throughput.
Full-duplex repeaters use a form of flow control known as asymmetric flow control. Whereas symmetric flow control allows the communication of PAUSE frames between both devices involved, asymmetric flow control only allows PAUSE frames to be sent from one device to another but not vice versa. For example, if one link partner was an end device and one was a full-duplex repeater, it is undesirable to have the computer end station pause the repeater. The asymmetry prevents an end device from halting the repeater and creating a log-jam in the network infrastructure.
The Ethernet full-duplex operation standard (IEEE 802.3x) was completed by the IEEE in early 1997. The work may be applied to Ethernet at any speed of operation.
Instead of a shared environment where all end stations compete for access to the network, as in half-duplex operation, full-duplex operation ensures that collisions do not occur. This is accomplished since two transmission paths exist between the two ends of the link. One end may transmit to the other end at any time, and both ends may transmit at the same time without restriction.
With the introduction of full-duplex repeaters, the arbitration mechanism is placed in the wiring-closet box, which retains the fair arbitration method of Ethernet. But this change also has the benefit of removing link distance as a factor in the arbitration mechanism. As a result, we can use links of any acceptable length in a full-duplex repeater. This benefit provides more flexibility in network configuration and a wider geographic cover for any particular repeater.
The full-duplex repeater perfectly bridges the gap between unintelligent, traditional repeaters and expensive switches, and is a major evolutionary advancement for networks.
Packet Engines'® full-duplex repeater, the FDR12®, offers 12 ports of 1000BASE-SX Gigabit Ethernet at $17,900, and the 64-bit PCI bus G-NIC® Network Interface Card, priced at $1,295, provides a maximum possible throughput of 2 Gbps. In addition, Packet Engines' Getting Started Kit, at the attractive price of $995 per port, comes complete with one FDR12, six G-NICs, and six pairs of fiber optic cables to connect the network components together.
We offer a detailed examination of Gigabit Ethernet performance.
For purposes of evaluting network performance, we use the NetPIPE benchmark. NetPIPE [1] is a network performance analysis tool developed at Ames Laboratory. It maps the performance of a network link in fine detail and presents the full range of performance data, not just the peak performance. Just as the performance of a computer cannot be accurately described using a single sized computation, neither can the performance of a network be described using a single sized communication transfer. NetPIPE increases the transfer block size from a single byte to large blocks until transmission time exceeds 1 second. This allows a more accurate asessment of network performance and comparison. NetPIPE has been very useful in revealing performance variations for certain block sizes.
Figure 1: A comparison of Ethernet, FDDI and ATM using NetPIPE.
Figures 1 and 2 show the strengths of NetPIPE. Figure 1 portrays the performance differences between ATM, FDDI, and Ethernet. The ATM data was taken using Classic IP as the protocol. The graph shows the advantages and disadvantages of each media.
Figure 2: The evolution of a Fast Ethernet driver.
Since Gigabit Ethernet is a developing standard, we will be able to also track the performance of products through beta and release stages. NetPIPE is useful in this regard as well. The graph in Figure 2 shows the progression of performance for the developing Fast Ethernet driver for Linux. In the beginning the driver had many fallouts in performance as evidenced by the graph. These performance dips were soon eliminated as solutions were found. Likewise, NetPIPE has been useful useful in determining any performance anomolies in the early stages of Gigabit Ethernet. Figure 3 shows the evolution of Donald Becker's Packet Engines Gigabit Ethernet Linux driver in conjunction with hardware improvements.
Figure 3: Evolution of Gigabit Ethernet.
Interestingly, the Gigabit Ethernet driver for the Packet Engines G-NIC® developed at SCL for the FreeBSD operating system shows a nice curve under NetPIPE, with no severe dropouts (a slight modification was made to the FreeBSD kernel to force TCP to not delay ACKs). Figure 4 shows the comparison of the FreeBSD driver to the Linux driver, revealing a slightly higher performance for Linux for small transfer sizes, but FreeBSD overtakes Linux and shows a smoother performance curve for larger transfers.
Figure 4: NetPIPE Comparison of Gigabit Ethernet for FreeBSD
and Linux
Performance comparisons between operating systems were made on the same PC systems built with ASUS P/I-P65UP5 (rev 1.41) motherboards (using the Intel 440FX chipset), a single 200MHz Pentium Pro CPU, 256KB L2 cache, and 512MB EDO RAM, with a 33MHz 32-bit PCI bus. The Linux version used was 2.0.29 and the FreeBSD version was 2.2.2.
NetPIPE was used to compare ATM (OC-3c), FDDI, Ethernet, Fast Ethernet, and Gigabit Ethernet. Figure 5 shows the significant performance advantages of Gigabit Ethernet over ATM, FDDI, and Fast Ethernet.
Figure 5: Comparison of Ethernet, FDDI, ATM, Fast Ethernet, and
Gigabit Ethernet
Gigabit Ethernet strongly outperforms competing high-speed network technologies, even with early versions of the hardware and software. More interesting is the advantage in throughput for small block sizes that Gigabit Ethernet shows over ATM and FDDI, as the competing technologies begin performing well only at large block sizes (around 100KB).
A final performance issue clearly represented by the network signature graphs (time versus throughput) from NetPIPE data is latency. Figure 6 shows that, while Gigabit Ethernet and Fast Ethernet both start at a latency that is an order of magnitude better than ATM, FDDI, and Ethernet, Gigabit Ethernet's acceleration curve is much better than the other technologies and quickly rises to its peak. The dips in the Gigabit Ethernet and Fast Ethernet curves seem to be due to the Linux TCP protocol implementation, as FreeBSD was seen in figure 4 above to have a clean performance curve.
Figure 6: Latency of Ethernet, FDDI, ATM, Fast Ethernet, and
Gigabit Ethernet
A true picture of cluster performance can only be seen in application performance. Using four applications developed by Ames Laboratory scientists, we will compare the performance of our cluster of Pentium Pro workstations using Fast Ethernet and a cluster using Gigabit Ethernet. We will also present a comparison to low end supercomputers such as the IBM SP2 and the SGI Power Challenge. The applications of choice are: the HINT benchmark, a global illumination solver called Photon [2], Ames Laboratory's classical molecular dynamics code, and a tight binding molecular dynamics code.
The HINT [3] performance metric was developed at Ames Laboratory to gauge the overall performance of a wide variety of computing machines. Unlike other benchmarks which determine the amount of work to be done a priori, HINT fixes neither the problem size nor the execution time of the problem to be solved. Consequently, it measures the performance of a computer across all memory regimes. The output of a HINT performance measurement produces a graph which is a rigorous measurement of the QUality Improvement Per Second (QUIPS) in an answer. The graph in Figure 7 reveals the performance range of the tested computer from burst speed for very small problems to endurance speed for large problems that may have to use mass storage.
Figure 7: Comparison of cluster computing vs. traditional
supercomputers using HINT.Any digital computer, from a calculator to the largest parallel supercomputer, can be measured and accurately compared using HINT. Comparisons across different architectures are valid and the differences are readily visible using the HINT graphs. Figure 7 clearly shows the differences in performance of various popular supercomputers. Clearly, the HINT graph gives a visualization of computer performance and architecture differences previously unavailable.
The design of parallel HINT does not make use of interprocessor communication during computation, so HINT performance on our cluster depends mostly on the startup time of the computation. While HINT shows that clusters of workstations tied together with Fast Ethernet or Gigabit Ethernet perform on par with inexpensive supercomputers, HINT is clearly CPU-bound and Gigabit Ethernet is expected to provide little if any improvement.
Unfortunately, issues with the Linux kernel, the Gigabit device driver, or MPI precluded proper operation of HINT on any more than 2 systems clustered with Gigabit Ethernet. Figure 8 shows the HINT comparison of Fast Ethernet and Gigabit Ethernet on two of the clustered nodes, which shows a close performance relationship likely due to the similar latency of the two technologies.
Figure 8: HINT Using Fast Ethernet and Gigabit Ethernet.The Photon [2] parallelized scene rendering program developed at Ames Laboratory uses a Monte Carlo light transport simulation to provide a scalable solution to Kajiya's Rendering Equation. Photon was parallelized using the mpich implementation of MPI from Argonne National Labs.
During Photon's execution, it regularly reports the photons per second that it has traced on the input scene description. Photon was asked to trace up to 5*106 photons using a common scene description, and the output photons per second were averaged to determine points for the graph below. It was intended to run Photon on two, four, and eight processors for each network type, but for technical reasons Myrinet runs were limited to three processors and for logistical reasons Cabletron results were limited to four processors.
Figure 9 shows the results of the Photon test of Fast Ethernet through two different Ethernet switches, Gigabit Ethernet through the FDR, and Myrinet. All tests were conducted using Linux 2.0.29 on the ASUS P/I-P65UP5 (rev 1.41) systems mentioned in the NetPIPE section with Beta Packet Engines Gigabit Ethernet cards using Donald Becker's yellowfin driver version 0.03, SMC 9332 EtherPower 10/100 Fast Ethernet cards using Donald Becker's tulip driver versions 0.70, 0.76, or 0.78, and Myrinet using the Myrinet supplied driver version 3.09.
Figure 9: Photon Using Fast Ethernet, Gigabit Ethernet, and
Myrinet.
This work is supported by the U.S. Department of Energy under Contract W-7405-Eng-82 to Ames Laboratory which is operated by Iowa State University.
Ames Laboratory Technical Report IS-5126
Contact: Steve Elbert
+1-515-294-1307 elbert@AmesLab.gov
The URL for this document is http://www.scl.ameslab.gov
Revised