EMP: Ethernet message passing for programmable gigabit ethernet NICs

Status

Broadcom has not been forthcoming with documentation for the tigon3 NIC (as found on 3Com 3c996 et al), hence only the tigon2 NIC is supported by EMP. If you have any of the list of Supported NICs below, you are welcome to try EMP. It works in production here at OSC on our Itanium cluster. But no active development is underway on EMP for tigon2 mainly due to the lack of documented available programmable commodity NIC chipsets. Bugs will be fixed as they arise, though.

The source is available in a tarball here:

It includes both the EMP firmware and MPICH device.

You'll also need the gcc and binutils patches to be able to cross-compile executables for the MIPS variant processor in the Tigon chip on Alteon chipset gigabit ethernet cards:

There is no mailing list currently, but I'll try to help with questions, and there's some Q/A below from other email discussions.

Papers

EMP: zero-copy OS-bypass NIC-driven gigabit ethernet message passing
Piyush Shivam, Pete Wyckoff and D. K. Panda
Proceedings of SC '01, Denver, CO, November 2001
PDF

Modern interconnects like Myrinet and Gigabit Ethernet offer Gb/s speeds which has put the onus of reducing the communication latency on messaging software. This has led to the development of OS bypass protocols which removed the kernel from the critical path and hence reduced the end-to-end latency. With the advent of programmable NICs, many aspects of protocol processing can be offloaded from user space to the NIC leaving the host processor to dedicate more cycles to the application. Many host-offload messaging systems exist for Myrinet; however, nothing similar exits for Gigabit Ethernet. In this paper we propose Ethernet Message Passing (EMP), a completely new zero-copy, OS-bypass messaging layer for Gigabit Ethernet on Alteon NICs where the entire protocol processing is done at the NIC. This messaging system delivers very good performance (latency of 23 us, and throughput of 880 Mb/s). To the best of our knowledge, this is the first NIC-level implementation of a zero-copy message passing layer for Gigabit Ethernet.

Piyush Shivam, Pete Wyckoff and D. K. Panda
Can user level protocols take advantage of multi-CPU NICs?
Proceedings of IPDPS '02, Ft. Lauderdale, FL, April 2002
Full paper PDF
Shorter conference proceedings PDF

Modern high speed interconnects such as Myrinet and Gigabit Ethernet have shifted the bottleneck in communication from the interconnect to the messaging software at the sending and receiving ends. The development of user-level protocols and their implementations on smart and programmable network interface cards (NICs) have been alleviating this communication bottleneck. Most of the user-level protocols developed so far have been based on single-CPU NICs. One of the more popular current generation Gigabit Ethernet NICs includes two CPUs, though. This raises an open challenge whether performance of user-level protocols can be improved by taking advantage of a multi-CPU NIC. In this paper, we analyze the intrinsic issues associated with such a challenge and explore different parallelization and pipelining schemes to enhance the performance of our earlier developed EMP protocol for single-CPU Alteon NICs. Four different strategies are proposed and implemented on our testbed. Performance evaluation results indicate that parallelizing the receive path of the protocol can deliver 964 Mbps of bandwidth, close to the maximum achievable on Gigabit Ethernet. This scheme also delivers up to 8% improvement in latency for a range of message sizes. Parallelizing the send path leads to 17% improvement in bidirectional bandwidth. To the best of our knowledge, this is the first research in the literature to exploit the capabilities of multi-CPU NICs to improve the performance of user-level protocols. Results of this research demonstrate significant potential to design scalable and high performance clusters with Gigabit Ethernet.

High performance user-level sockets over gigabit ethernet
Pavan Balaji, Piyush Shivam, Pete Wyckoff and D. K. Panda
Proceedings of Cluster '02, Chicago, IL, September 2002
PDF

While a number of User-Level Protocols have been developed to reduce the gap between the performance capabilities of the physical network and the performance actually available, applications that have already been developed on high-latency protocols such as TCP have largely been ignored. There is a need to make these existing TCP applications take advantage of the modern user level protocols such as EMP or VIA which feature both low latency and high bandwidth. In this paper, we have designed, implemented and evaluated a scheme to support such applications written using the sockets API to run over EMP without any changes to the application itself. Using this scheme, we are able to achieve a latency of 28.5us for the Datagram sockets and 37us for Data Streaming sockets compared to a latency of 120us obtained by TCP for 4-byte messages. This scheme attains a peak bandwidth of around 840Mbps. Both the latency and the throughput numbers are close to those achievable by native EMP. The ftp shows twice as much benefit on our sockets interface while the web server application shows up to six times performance enhancement as compared to TCP. To the best of our knowledge, this is the first such design and implementation for Gigabit Ethernet.

Supported NICs

The following NICs are built around the Alteon Tigon 2, and should work with EMP:

Watch out for "improved versions" of the discontinued ones. For example, the Netgear GA622 is based on the terrible ns83820 chipset. There's a few around based on the follow-on Tigon 3 chipset (Broadcom BCM5700/1) which might work someday, including 3Com 3C996, but not yet.

An old thread on GigE NIC performance:

OSC usage

Details of how to run MPI/EMP programs on the OSC Itanium cluster follow.

Before doing anything else, you'll have to get yourself onto the ia64 cluster (not any of the other OSC clusters). The name of the frontend on this machine is ia64.osc.edu. Use ssh to login to that machine, then proceed with the following.

First, get a PBS allocation of nodes which have gigabit ethernet cards and are connected to the switch. You must allocate both processors on the node since nobody else will be able to use the gigabit ethernet card when it is running your EMP job.

qsub -I -l nodes=4:ppn=2:emp -l walltime=8:00:00

The PBS prologue and epilogue scripts take care of setting up and tearing down emp in the gigabit ethernet cards for you. They look for the emp module in /home/$USER/emp/emp_alteon.o, and if one is not found there, they grab it from the default location /home/pw/gige/emp/emp_alteon.o.

If you are curious to see what nodes you've been assigned, do

node13$ uniq $PBS_NODEFILE

The uniq command filters out duplicate entries, since you've actually got both processors on each node, but can only use one with emp (currently).

Now to build an executable, you must link with the MPICH/EMP libraries, which currently live only in an obscure place. You might as well set your path, too, to use these things and other debugging tools for EMP.

MPI_CFLAGS="-I/home/pw/src/mpich/include -I/home/pw/gige/emp" MPI_CXXFLAGS="-I/home/pw/src/mpich/include -I/home/pw/gige/emp" MPI_FFLAGS="-I/home/pw/src/mpich/include -I/home/pw/gige/emp" MPI_F90FLAGS="-I/home/pw/src/mpich/include -I/home/pw/gige/emp" MPI_LIBS="-L/home/pw/src/mpich/lib -lmpich -L/home/pw/gige/emp -lemp" PATH=/home/pw/src/mpich/bin:/home/pw/gige/emp MANPATH=/home/pw/src/mpich/man

Using mpicc from the first part of the PATH above should work fine too.

To run a parallel code, the preferred way is to use the program mpiexec, which queries the PBS environment and uses that to start jobs on the appropriate nodes. Use it like this for EMP jobs:

mpiexec -comm=emp -pernode -gige a.out args

There is also a standard mpirun script in the MPICH distribution which understands PBS_NODEFILE, and should do the same thing as mpiexec only using rsh to start the jobs. It is slower and less reliable:

mpirun a.out args

Finally, for debugging, you can bypass all of the above and directly start the codes by hand, perhaps inside a debugger. Set the environment variable EMPHOSTS to be a space-separated string of the machines on which you wish to run. It is important to use the machine names corresponding to the gigabit ethernet interfaces if you choose to do this:

# In this example, node12 will be MPI ID 0, and node 14 will be ID 1 node12$ export EMPHOSTS="gige12 gige14" node12$ ./a.out args ... node14$ export EMPHOSTS="gige12 gige14" node14$ ./a.out args

When it breaks

Chances are that things won't work perfectly every time. If not, there is no permanent damage and nothing to clean up (usually). Each new run will reset the state of the gigabit ethernet cards on all the nodes, and the ifdown/ifup procedure can always yank out the EMP driver and replace it with the standard IP one. So don't fear the EMP.

You can always debug using gdb, but be aware that there is another program running on the NIC itself, and it can be a bit trickier to debug. For post-mortem analysis of the NIC code, a good thing to do is

io -d 10

to show the last 10 lines of trace output on the NIC, which might end with an entry with asterisks that will give you a hint about what died. But it will probably look like gibberish to the uninitiated. There's no way to get mpiexec to run commands in serial yet, so your best bet to getting the trace output on all nodes would involve something icky like:

all -$(uniq $PBS_NODEFILE | sed s/node// | tr \\012 , | sed s/,\$//) io -d 10

You can grab all the debugging output into files to mail it to me with

all -$(uniq $PBS_NODEFILE | sed s/node// | tr \\012 , | sed s/,\$//) io -d > out

but do it quickly since the NIC may continue to log other events it sees which are unrelated to your failed code, like arp request packets on the network.

TODO

Previously (not necessarily frequently) asked questions

1. Since you have connected back-to-back, have you also tested/considered channel bonding?

No channel bonding on the horizon. It would require user apps (the library) to talk once to each card to arrange for striping of messages. Our goal is to reduce communication overhead and let the app get back to doing computation. Another 0.8 us might not be worth the effort, depending on your code. If you want sheer bandwidth and don't care too much about latency or CPU utilization, I'd recommend sticking with TCP/IP.

2. What would be your scaling estimate for EMP in terms of working well for what max machine number?

We designed EMP to scale well. There is minimal static connection state on the NIC: currently 28 bytes to keep a struct for each other host in the cluster. Each in-progress or pending message also requires some state: roughly 150 bytes on each side. This is okay since we need to flow control the app/NIC interface somehow, and we currently give slots for 640 messages to the host (think outstanding MPI_Isend or MPI_Irecv operations).

3. How many machines/processors did you successfully connect and possibly benchmark, say, in terms of the nominal top500 double precision FLOPs benchmark HPL?

Ahem, four machines. But keep posted. I'm almost done getting the MPI implementation to pass all the intel stress tests (well, at least the ones that broke due to my ADI implementation). We're hoping to publish some real live performance numbers soon and get this code into production.

Benchmarks

Pointers to a few low-level and application benchmarks.

Last updated 18 Mar 2005.