MP_Lite: Channel Bonding Ethernet between PCs

It is common to use Gigabit Ethernet to connect PCs in low cost clusters due to its low price of around $100 per machine (~$40 for the network card and ~$60 per port for the switch). Gigabit Ethernet provides a theoretical maximum of 1000 Mbps, reaching 900 Mbps in practice with some tuning of the system.

This provides a low cost parallel computing system, but one that is much more unbalanced than traditional MPP systems that use similar processors but have communication systems that are an order of magnitude faster. This limits the types of applications that are suitable for PC clusters. Faster networking such as Myrinet, Quadrics, SCI, or InfiniBand can be used to connect the PCs, but this usually doubles the cost of the cluster.

Channel bonding is a method where the data in each message gets striped across multiple network cards installed in each machine. The figure above shows a small PC cluster with 2 network cards per machine. The graph below demonstrates that channel bonding 2 Gigabit Ethernet cards per PC using MP_Lite doubles the communication rate while only adding about 10% to the overall cost of the cluster. Adding a 3rd card provides little additional benefit.

Channel bonding using the Linux kernel bonding.c module currently does not work at Gigabit speeds, providing worse performance than using a single Gigabit Ethernet card. Proper tuning of this module should allow for the efficient use of more Gigabit Ethernet cards per machine in the future. It should be possible to get much close to the 4 Gbps limit of the 64-bit 66 MHz PCI bus by channel bonding at this low level.

Directions for setup and use

First a few warnings: Try channel-bonding between 2 machines before you build an entire cluster around this. You will get different results depending on the network cards, and possibly the main memory bandwidth of your machines. The performance curves above show a pretty uniform doubling of the communication rate for messages above a reasonable size. Applications using only small messages that are latency bound will see no benefit, since small messages that fit within a single Ethernet packet of 1500 Bytes are sent over one network card only.

You do not need to make any changes to your code or to the way you compile MP_Lite for TCP (make tcp). You will need to set up your system to use the multiple network cards by assigning separate IP numbers and names to each interface. For example, use something like node0.ge1, node0.ge2, node1.ge1, node1.ge2, etc. Since each interface has its own IP number and name, all connections can be linked to a single switch as in the diagram above or you can connect each set of network cards to a separate switch.

The command to start a run on these nodes would then be:

mprun -np 2 -nics 2 -h node0.ge1 node0.ge2 node1.ge1 node1.ge2 program

The setup is therefore very minimal, and the only change to the user is the need to specify the multiple interface names at run-time.