Selected papers

2008

Data Structure Consistency Using Atomic Operations in Storage Devices
Ananth Devulapalli, Dennis Dalessandro, Pete Wyckoff
Proceedings of MSST'08, SNAPI Workshop, Baltimore, MD, September 2008, to appear

Managing concurrency is a fundamental requirement for any multi-threaded system, frequently implemented by serializing critical code regions or using object locks on shared resources. Storage systems are one case of this, where multiple clients may wish to access or modify on-disk objects concurrently yet safely. Data consistency may be provided by an inter-client protocol, or it can be implemented in the file system server or storage device.

We look at the design of algorithms for consistency using atomic operations, in particular compare-and-swap and fetch-and-add. These can be used as primitives to form a variety of distributed algorithms, including locking. The operations are also simple enough to consider implementing directly in storage devices, obviating the need for dedicated servers in the data and metadata access paths.

An OSD-based Approach to Managing Directory Operations in Parallel File Systems
Nawab Ali, Ananth Devulapalli, Dennis Dalessandro, Pete Wyckoff, P. Sadayappan
Proceedings of Cluster'08, Tsukuba, Japan, September 2008, to appear

Distributed file systems that use multiple servers to store data in parallel are becoming commonplace. Much work has already gone into such systems to maximize data throughput. However, metadata management has historically been treated as an afterthought. In previous work we focused on improving metadata management techniques by placing file metadata along with data on Object-based Storage Devices (OSDs). However, we did not investigate directory operations. This work looks at the possibility of designing directory structures directly on OSDs, without the need for intervening servers. In particular, the need for atomicity is a fundamental requirement that we explore in depth. Through performance results of benchmarks and applications we show the feasibility of using OSDs directly for metadata, including directory operations.

Non-Contiguous I/O Support for Object-Based Storage
Dennis Dalessandro, Ananth Devulapalli, Pete Wyckoff
Proceedings of ICPP'08, P2S2 Workshop, Portland, OR, September 2008, to appear
Paper (PDF)

The access patterns performed by disk-intensive applications vary widely, from simple contiguous reads or writes through an entire file to completely unpredictable random access. Often, applications will be able to access multiple disconnected sections of a file in a single operation. Application programming interfaces such as POSIX and MPI encourage the use of non-contiguous access with calls that process I/O vectors.

Under the level of the programming interface, most storage protocols do not implement I/O vector operations (also known as scatter/gather). These protocols, including NFSv3 and block-based SCSI devices, must instead issue multiple independent operations to complete the single I/O vector operation specified by the application, at a cost of a much slower overall transfer time.

Scatter/gather I/O is critical to the performance of many parallel applications, hence protocols designed for this area do tend to support I/O vectors. Parallel Virtual File System (PVFS) in particular does so; however, a recent specification for object-based storage devices (OSD) does not.

Using a software implementation of an OSD as storage devices in a Parallel Virtual File System (PVFS) framework, we show the advantages of providing direct support for non-contiguous data transfers. We also implement the feature in OSDs in a way that is both efficient for performance and appropriate for inclusion in future specification documents.

Tapping into the Fountain of CPUs—On Operating System Support for Programmable Devices
Yaron Weinsberg, Danny Dolev, Tal Anker, Muli Ben-Yehuda, Pete Wyckoff
Proceedings of ASPLOS'08, Seattle, WA, March 2008
Paper (PDF)

The constant race for faster and more powerful CPUs is drawing to a close. No longer is it feasible to significantly increase the speed of the CPU without paying a crushing penalty in power consumption and production costs. Instead of increasing single thread performance, the industry is turning to multiple CPU threads or cores (such as SMT and CMP) and heterogeneous CPU architectures (such as the Cell Broadband Engine). While this is a step in the right direction, in every modern PC there is a wealth of untapped compute resources. The NIC has a CPU; the disk controller is programmable; some high-end graphics adapters are already more powerful than host CPUs. Some of these CPUs can perform some functions more efficiently than the host CPUs. Our operating systems and programming abstractions should be expanded to let applications tap into these computational resources and make the best use of them.

Therefore, we propose the Hydra framework, which lets application developers use the combined power of every compute resource in a coherent way. Hydra is a programming model and a runtime support layer which enables utilization of host processors as well as various programmable peripheral devices' processors. We present the framework and its application for a demonstrative use-case, as well as provide a thorough evaluation of its capabilities. Using Hydra we were able to cut down the development cost of a system that uses multiple heterogenous compute resources significantly.

2007

Integrating Parallel File Systems with Object-Based Storage Devices
Ananth Devulapalli, Dennis Dalessandro, Pete Wyckoff, Nawab Ali and P. Sadayappan
Proceedings of SC'07, Reno, NV, November 2007
Paper (PDF)

As storage systems grow larger and more complex, the traditional block-based design of current disks can no longer satisfy workloads that are increasingly metadata intensive. A new approach is offered by object-based storage devices (OSDs). By moving more of the responsibility of storage management onto each OSD, improvements in performance, scalability and manageability are possible.

Since this technology is new, no physical object-based storage device is currently available. In this work we describe a software emulator that enables a storage server to behave as an object-based disk. We focus on the design of the attribute storage, which is used to hold metadata associated with particular data objects. Alternative designs are discussed, and performance results for an SQL implementation are presented.

iSER Storage Target for Object-based Storage Devices
Dennis Dalessandro, Ananth Devulapalli, Pete Wyckoff
Proceedings of MSST'07, SNAPI Workshop, San Diego, CA, September 2007
Paper (PDF)
Presentation (PDF)

In order to increase client capacity and performance, storage systems have begun to utilize the advantages offered by modern interconnects. Previously storage has been transported over costly fibre channel networks or ubiquitous but low performance Ethernet networks. However, with the adoption of the iSCSI extensions for RDMA (iSER) it is now possible to use RDMA based interconnects for storage while leveraging existing iSCSI tools and deployments.

Building on previous work with an object-based storage device emulator using the iSCSI transport, we extend its functionality to include iSER. Using an RDMA transport brings with it many design issues including the need register memory to be used by the network, and how to bridge the quite different RDMA completion semantics with existing event management based on file descriptors.

Experiments demonstrate reduced latency and greatly increased throughput compared to iSCSI implementations both on Gigabit Ethernet and on IP over InfiniBand.

Attribute Storage Design for Object-based Storage Devices
Ananth Devulapalli, Dennis Dalessandro, Nawab Ali and Pete Wyckoff
Proceedings of MSST'07, San Diego, CA, September 2007
Paper (PDF)

Memory Management Strategies for Data Serving with RDMA
Dennis Dalessandro and Pete Wyckoff
Proceedings of HotI'07, Palo Alto, CA, August 2007
Paper (PDF)
Presentation (PDF)

Using remote direct memory access (RDMA) to ship data is becoming a very popular technique in network architectures. As these networks are adopted by the broader computing market, new challenges arise in transitioning existing code to use RDMA APIs. One particular class of applications that map poorly to RDMA are those that act as servers of file data. In order to access file data and send it over the network, an application must copy it to user-space buffers, and the operating system must register those buffers with the network adapter. Ordinary sockets-based networks can achieve higher performance by using the "sendfile" mechanism to avoid copying file data into user-space buffers.

In this work we revisit time-honored approaches to sending file data, but adapted to RDMA networks. In particular, both pipelining and sendfile can be used, albeit with modifications to handle memory registration issues. However, memory registration is not well-integrated in current operating systems, leading to difficulties in adapting the sendfile mechanism. These two techniques make it feasible to create RDMA-based applications that serve file data and still maintain a high level of performance.

Accelerating Web Protocols Using RDMA
Dennis Dalessandro and Pete Wyckoff
Proceedings of NCA'07, Cambridge, MA, July 2007
Paper (PDF)

High-performance computing, just like the world at large, is continually discovering new uses for the Internet. Interesting applications rely on server-generated content, severely taxing the capabilities of web servers. Thus it is common for multiple servers to run a single site. In our work, we use a novel network feature known as RDMA to vastly improve performance and scalability of a single server. Using an unmodified Apache web server with a dynamic module to enable iWARP (RDMA over TCP), we can handle more clients with lower CPU utilization, and higher throughput.

Accelerating distributed computing applications using a network offloading framework
Yaron Weinsberg, Danny Dolev, Pete Wyckoff and Tal Anker
Proceedings of IPDPS'07, Long Beach, CA, March 2007
Paper (PDF)

During the last two decades, a considerable amount of academic research has been conducted in the field of distributed computing. Typically, distributed applications require frequent network communication, which becomes a dominate factor in the overall runtime overhead. The recent proliferation of programmable peripheral devices for computer systems may be utilized in order to improve the performance of such applications. Offloading application-specific network functions to peripheral devices can improve performance and reduce host CPU utilization. Due to the peculiarities of each particular device and the difficulty of programming an outboard CPU, the need for an abstracted offloading framework is apparent. This paper proposes a novel offloading framework, called HYDRA, that enables utilization of such devices. The framework enables an application developer to design the offloading aspects of the application by specifying an "offloading layout", which is enforced by the runtime during application deployment. The performance of a variety of distributed algorithms can be significantly improved by utilizing such a framework. We demonstrate this claim by evaluating several offloaded applications: a distributed total message ordering algorithm, a packet generator, and an embedded firewall.

File creation strategies in a distributed metadata file system
Ananth Devulapalli and Pete Wyckoff
Proceedings of IPDPS'07, Long Beach, CA, March 2007
Paper (PDF)

As computing breaches petascale limits both in processor performance and storage capacity, the only way that current and future gains in performance can be achieved is by increasing the parallelism of the system. Gains in storage performance remain low due to the use of traditional distributed file systems such as NFS, where although multiple clients can access files at the same time, only one node can serve files to the clients. New file systems that distribute load across multiple data servers are being developed; however, most implementations concentrate all the metadata load at a single server still. Distributing metadata load is important to accommodate growing numbers of more powerful clients.

Scaling metadata performance is more complex than scaling raw I/O performance, and with distributed metadata the complexity increases further. In this paper we present strategies for file creation in distributed metadata file systems. Using the PVFS distributed file system as our testbed, we present designs that are able to reduce the message complexity of the create operation and increase performance. Compared to the basecase create protocol implemented in PVFS, our design delivers near constant operation latency as the system scales, does not degenerate under high contention situations, and increases throughput linearly as the number of metadata servers increase. The design schemes are applicable to any distributed file system implementation.

2006

Accelerating web protocols using RDMA
Dennis Dalessandro and Pete Wyckoff
Proceedings of SC'06, Poster, Tampa, FL, November 2006

As high performance computing finds increased usage in the Internet, we should expect to see larger data transfers and an increase in the processing of dynamic web content by HPC web servers, particularly where grid applications are concerned. The high cost of processing TCP/IP on the CPU will be too much for today's web servers to handle. As an answer to this problem, we look to iWARP, which is RDMA over TCP. The benefits of RDMA have long been realized in technologies such as InfiniBand, and iWARP makes this possible over ordinary Ethernet. In implementing an iWARP module for the Apache web server, we create an experimental platform for testing the feasibility of RDMA in the web.

HYDRA: a novel framework for making high-performance computing offload capable
Yaron Weinsberg, Danny Dolev, Tal Anker and Pete Wyckoff
Proceedings of LCN'06, Tampa, FL, November 2006
Short paper (PDF)

The proliferation of programmable peripheral devices for computer systems opens new possibilities for academic research that will influence system designs in the near future. Programmability is a key feature that enables application-specific extensions to improve performance and offer new features. Increasing transistor density and decreasing cost provide excess computational power in devices such as disk controllers, network interfaces and video cards.

This paper proposes an innovative programming model and runtime support that enables utilization of such devices by providing a generic code offloading framework. The framework enables an application developer to design the offloading aspects of the application by specifying an "offloading layout", which is enforced by the runtime during application deployment. The framework also provides the necessary development tools and programming constructs for developing such applications.

We test our framework by implementing a packet generator on a programmable network card for network testing. The offloaded application produces traffic at five times the rate, and with inter-packet variability that is many orders of magnitude smaller than the non-offloaded version.

Initial performance evaluation of the NetEffect 10 Gigabit iWARP Adapter
Dennis Dalessandro, Pete Wyckoff and Gary Montry
Proceedings of IEEE Cluster '06, RAIT Workshop, Barcelona, Spain, Sep 2006
Paper (PDF)

Interconnect speeds currently surpass the abilities of today's processors to satisfy their demands. The throughput rate provided by the network simply generates too much protocol work for the processor to keep up with. Remote Direct Memory Access has long been studied as a way to alleviate the strain from the processors. The problem is that until recently RDMA interconnects were limited to proprietary or specialty interconnects that are incompatible with existing networking hardware. iWARP, or RDMA over TCP/IP, changes this situation. iWARP brings all of the advantages of RDMA, but is compatible with existing network infrastructure, namely TCP/IP over Ethernet. The drawback to iWARP up to now has been the lack of availability of hardware capable of meeting the performance of specialty RDMA interconnects. Recently, however, 10 Gigabit iWARP adapters are beginning to appear on the market. This paper demonstrates the performance of one such 10 Gigabit iWARP implementation and compares it to a popular specialty RDMA interconnect, InfiniBand.

Hardware/Software Integration for FPGA-based All-Pairs Shortest-Paths
Uday Bondhugula, Ananth Devulapalli, James Dinan, Joseph Fernando, Pete Wyckoff, Eric Stahlberg and P. Sadayappan
Proceedings of FCCM '06, Napa Valley, California, Apr 2006
Paper (PDF)

Field-Programmable Gate Arrays (FPGAs) are being employed in high performance computing systems owing to their potential to accelerate a wide variety of long-running routines. Parallel FPGA-based designs often yield a very high speedup. Applications using these designs on reconfigurable supercomputers involve software on the system managing computation on the FPGA. To extract maximum performance from an FPGA design at the application level, it becomes necessary to minimize associated data movement costs on the system. We address this hardware/software integration challenge in the context of the All-Pairs Shortest-Paths (APSP) problem in a directed graph. We employ a parallel FPGA-based design using a blocked algorithm to solve large instances of APSP. With appropriate design choices and optimizations, experimental results on the Cray XD1 show that the FPGA-based implementation sustains an application-level speedup of 15 over an optimized CPU-based implementation.

Experimental analysis of a mass storage system
Shahid Bokhari, Benjamin Rutt, Pete Wyckoff and Paul Buerger
Concurrency and Computation: Practice and Experience, volume 18, number 15, pages 1929--1950, April 2006
Paper (PDF)

Mass storage systems (MSSs) play a key role in data-intensive parallel computing. Most contemporary MSSs are implemented as redundant arrays of independent/inexpensive disks (RAID) arrays in which commodity disks are tied together with proprietary controller hardware. The performance of such systems can be difficult to predict because most internal details of the controller behavior are not public. We present a systematic method for empirically evaluating MSS performance by obtaining measurements on a series of RAID configurations of increasing size and complexity. We apply this methodology to a large MSS at Ohio Supercomputer Center that has 16 input/output processors, each connected to four 8+1 RAID5 units and provides 128 Tbytes of storage (of which 116.8 Tbytes are usable when formatted). Our methodology permits storage system designers to evaluate empirically the performance of their systems with considerable confidence. Although we have carried out our expirements in the context of a specific system, our methodology is applicable to all large MSSs. The measurements obtained using our methods permit application programmers to be aware of the limits to performance of their codes.

iWarp Protocol Kernel Space Software Implementation
Dennis Dalessandro, Ananth Devulapalli and Pete Wyckoff
Proceedings of IPDPS '06, CAC Workshop, Rhodes Island, Greece, April 2006
Paper (PDF)

Zero-copy, RDMA, and protocol offload are three very important characteristics of high performance interconnects. Previous networks that made use of these techniques were built upon proprietary, and often expensive, hardware. With the introduction of iWarp, it is now possible to achieve all three over existing low-cost TCP/IP networks.

iWarp is a step in the right direction, but currently requires an expensive RNIC to enable zero-copy, RDMA, and protocol offload. While the hardware is expensive at present, given that iWarp is based on a commodity interconnect, prices will surely fall. In the meantime only the most critical of servers will likely make use of iWarp, but in order to take advantage of the RNIC both sides must be so equipped.

It is for this reason that we have implemented the iWarp protocol in software. This allows a server equipped with an RNIC to exploit its advantages even if the client does not have an RNIC. While throughput and latency do not improve by doing this, the server with the RNIC does experience a dramatic reduction in system load. This means that the server is much more scalable, and can handle many more clients than would otherwise be possible with the usual sockets/TCP/IP protocol stack.

Parallel FPGA-based all-pairs shortest-paths in a directed graph
Uday Bondhugula, Ananth Devulapalli, Joseph Fernando, Pete Wyckoff and P. Sadayappan
Proceedings of IPDPS '06, Rhodes Island, Greece, April 2006
Paper (PDF)

With rapid advances in VLSI technology, Field Programmable Gate Arrays (FPGAs) are receiving the attention of the Parallel and High Performance Computing community. In this paper, we propose a highly parallel FPGA design for the Floyd-Warshall algorithm to solve the All-Pairs Shortest-Paths problem in a directed graph. Our work is motivated by a computationally intensive bio-informatics application that employs this algorithm. The design we propose efficiently utilizes the large amount of resources available on an FPGA to maximize parallelism in the presence of significant data dependences. Experimental results from a working implementation of our design on the Cray XD1 show a speedup of 24x over a modern general-purpose microprocessor.

2005

Design and Implementation of the iWarp Protocol in Software
Dennis Dalessandro, Ananth Devulapalli and Pete Wyckoff
Proceedings of PDCS '05, Phoenix, AZ, November 2005
Paper (PDF)
Presentation (PDF)

The term iWarp indicates a set of published protocol specifications that provide remote read and write access to user applications, without operating system intervention or intermediate data copies. The iWarp protocol provides for higher bandwidth and lower latency transfers over existing, widely deployed TCP/IP networks. While hardware implementations of iWarp are starting to emerge, there is a need for software implementations to enable offload on servers as a transition mechanism, for protocol testing, and for future protocol research.

The work presented here allows a server with an iWarp network card to utilize it fully by implementing the iWarp protocol in software on the non-accelerated clients. While throughput does not improve, the true benefit of reduced load on the server machine is realized. Experiments show that sender system load is reduced from 35% to 5% and receiver load is reduced from 90% to under 5%. These gains allow a server to scale to handle many more simultaneous client connections.

A performance analysis of the Ammasso RDMA enabled ethernet adapter and its iWARP API
Dennis Dalessandro and Pete Wyckoff
Proceedings of RAIT Workshop, Cluster '05, Burlington, MA, September 2005
Paper (PDF)

Network speeds are increasing well beyond the capabilities of today's CPUs to efficiently handle the traffic. This bottleneck at the CPU causes the processor to spend more of its time handling communication and less time on actual processing. As network speeds reach 10 Gb/s and more, the CPU simply can not keep up with the data. Various methods have been proposed to solve this problem. High performance interconnects, such as Infiniband, have been developed that rely on RDMA and protocol offload in order to achieve higher throughput and lower latency. In this paper we evaluate the feasibility of a similar approach which, unlike existing high performance interconnects, requires no special infrastructure. RDMA over Ethernet, otherwise known as iWARP, facilitates the zero copy exchange of data over ordinary local area networks. Since it is based on TCP, iWARP enables RDMA in the wide area network as well. This paper provides a look into the performance of one of the earliest commodity implementations of this emerging technology, the Ammasso 1100 RNIC.

Distributed queue-based locking using advanced network features
Ananth Devulapalli and Pete Wyckoff
Proceedings of ICPP '05, Oslo, Normay, June 2005
Paper (PDF)

A Distributed Lock Manager (DLM) provides advisory locking services to applications that run on distributed systems such as databases and file systems. Lock management at the server is implemented using First-In-First-Out (FIFO) queues. In this paper, we demonstrate a novel way of delegating the lock management to the participating lock-requesting nodes, using advanced network features such as Remote Direct Memory Access (RDMA) and Atomic operations. This nicely complements the original idea of DLM, where management of the lock space is distributed. Our implementation achieves better load balancing, reduction in server load and improved throughput over traditional designs.

Memory registration caching correctness
Pete Wyckoff and Jiesheng Wu
Proceedings of CCGrid '05, Cardiff, UK, May 2005
Paper (PDF)

Fast and powerful networks are becoming more popular on clusters to support applications including message passing, file systems, and databases. These networks require special treatment by the operating system to obtain high throughput and low latency. In particular, application memory must be pinned and registered in advance of use. However, popular communication libraries such as MPI have interfaces that do not require explicit registration calls from the user, thus the libraries must manage this aspect themselves.

Registration caching is a necessary and effective tool to reuse memory registrations and avoid the overheads of pinning and unpinning pages around every send or receive. Current memory registration caching schemes do not take into account the fact thet the user has access to a variety of operating system calls that can alter memory layout and destroy earlier cached registrations. The work presented in this paper fixes that problem by providing a mechanism for the operating system to notify the communication library of changes in the memory layout of a process while preserving existing application semantics. This permits the safe and accurate use of memory registration caching.

A Hypergraph Partitioning Based Approach for Scheduling of Tasks with Batch-shared I/O
Gaurav Khanna, N. Vydyanathan, Tahsin Kurc, Umit Catalyurek, Pete Wyckoff, Joel Saltz and P. Sadayappan
Proceedings of CCGrid '05, Cardiff, UK, May 2005
Paper (PDF)

This paper proposes a novel, hypergraph partitioning based strategy to schedule multiple data analysis tasks with batchshared I/O behavior. This strategy formulates the sharing of ﬁles among tasks as a hypergraph to minimize the I/O overheads due to transferring of the same set of ﬁles multiple times and employs a dynamic scheme for ﬁle transfers to reduce contention on the storage system. We experimentally evaluate the proposed approach using application emulators from two application domains; analysis of remotelysensed data and biomedical imaging.

Fast scalable file distribution over Infiniband
Dennis Dalessandro and Pete Wyckoff
First Workshop on System Management Tools for Large-Scale Parallel Systems, Proceedings of IPDPS '05, Denver, CO, April 2005
Paper (PDF)

One of the first steps in starting a program on a cluster is to get the executable, which generally resides on some network file server. This creates not only contention on the network, but causes unnecessary strain on the network file system as well, which is busy serving other requests at the same time. This approach is certainly not scalable as clusters grow larger. We present a new approach that uses a high speed interconnect, novel network features, and a scalable design. We provide a fast, efficient, and scalable solution to the distribution of executable files on production parallel machines.

BMI: a network abstraction layer for parallel I/O
Philip H. Carns, Walter B. Ligon III, Robert Ross and Pete Wyckoff
Workshop on Communication Architecture for Clusters, Proceedings of IPDPS '05, Denver, CO, April 2005
Paper (PDF)

As high-performance computing increases in popularity and performance, the demand for similarly capable input and output systems rises. Parallel I/O takes advantage of many data server machines to provide linearly scaling performance to parallel applications that access storage over the system area network. The demands placed on the network by a parallel storage system are considerably different than those imposed by message-passing algorithms or data-center operations; and, there are many popular and varied networks in use in modern parallel machines. These considerations lead us to develop a network abstraction layer for parallel I/O which is efficient and thread-safe, provides operations specifically required for I/O processing, and supports multiple networks. The Buffered Message Interface (BMI) has low processor overhead, minimal impact on latency, and can improve throughput for parallel file system workloads by as much as 40% compared to other more generic network abstractions.

2004

A Parallel I/O Mechanism for Distributed Systems
Troy Baer and Pete Wyckoff
Proceedings of Cluster '04, San Diego, CA, September 2004
Paper (PDF)

Access to shared data is critical to the long term success of grids of distributed systems. As more parallel applications are being used on these grids, the need for some kind of parallel I/O facility across distributed systems increases. However, grid middleware has thus far had only limited support for distributed parallel I/O.

In this paper, we present an implementation of the MPI-2 I/O interface using the Globus GridFTP client API. MPI is widely used for parallel computing, and its I/O interface maps onto a large variety of storage systems. The limitations of using GridFTP as an MPI-I/O transport mechanism are described, as well as support for parallel access to scientific data formats such as HDF and NetCDF. We compare the performance of GridFTP to that of NFS on the same network using several parallel I/O benchmarks. Our tests indicate that GridFTP can be a workable transport for parallel I/O, particularly for distributed read-only access to shared data sets.

Unifier: unifying cache management and communication buffer management for PVFS over InfiniBand
Jiesheng Wu, Pete Wyckoff, D. K. Panda and Rob Ross
Proceedings of CCGrid '04, Chicago, IL, April 2004
Paper (PDF)

The advent of networking technologies and high performance transport protocols facilitates the service of storage over networks. However, they pose challenges in integration and interaction among storage server application components and system components. In this paper, we put forward a component, called Unifier, to provide more efficient integration and better interaction among these components. Unifier has three notable features. (1) Unifier integrates cache management and communication buffer management. It offers a single copy data sharing among all components in a server application safely and concurrently. (2) It reduces memory registration and deregistration costs to enable applications to take full advantage of RDMA operations. (3) It provides means to achieve adaptation, application-specific optimization, and better cooperation among different components. This paper presents the design and implementation of Unifier. This component has been deployed and evaluated in a version of PVFS1 implementation over InfiniBand. Experimental results show performance improvements between 30% and 70% over other approaches. Better scalability is also achieved by the PVFS I/O servers.

High performance implementation of MPI derived datatype communication over InfiniBand
Jiesheng Wu, Pete Wyckoff and D. K. Panda
Proceedings of IPDPS '04, Santa Fe, NM, April 2004
Paper (PDF)

In this paper, a systematic study of two main types of approach for MPI datatype communication (Pack/Unpackbased approaches and Copy-Reduced approaches) is carried out on the InfiniBand network. We focus on overlapping packing, network communication, and unpacking in the Pack/Unpack-based approaches. We use RDMA operations to avoid packing and/or unpacking in the CopyReduced approaches. Four schemes (Buffer-Centric Segment Pack/Unpack, RDMA Write Gather With Unpack, Pack with RDMA Read Scatter, and Multiple RDMA Writes have been proposed. Three of them have been implemented and evaluated based on one MPI implementation over InfiniBand. Performance results of a vector microbenchmark demonstrate that latency is improved by a factor of up to 3.4 and bandwidth by a factor of up to 3.6 compared to the current datatype communication implementation. Collective operations like MPI Alltoall are demonstrated to benefit. A factor of up to 2.0 improvement has been seen in our measurements of those collective operations on an 8-node system.

Design and implementation of MPICH2 over InfiniBand with RDMA support
Jiuxing Liu, Weihang Jiang, Pete Wyckoff, D. K. Panda, David Ashton, Darius Buntinas, William Gropp and Brian Toonen
Proceedings of IPDPS '04, Santa Fe, NM, April 2004
Paper (PDF)

For several years, MPI has been the de facto standard for writing parallel applications. One of the most popular MPI implementations is MPICH. Its successor, MPICH2, features a completely new design that provides more performance and flexibility. To ensure portability, it has a hierarchical structure based on which porting can be done at different levels.

In this paper, we present our experiences in designing and implementing MPICH2 over InfiniBand. Because of its high performance and open standard, InfiniBand is gaining popularity in the area of high-performance computing. Our study focuses on optimizing the performance of MPI-1 functions in MPICH2. One of our objectives is to exploit Remote Direct Memory Access (RDMA) in InfiniBand to achieve high performance. We have based our design on the RDMA Channel interface provided by MPICH2, which encapsulates architecture-dependent communication functionalities into a very small set of functions.

Starting with a basic design, we apply different optimizations and also propose a zero-copy-based design. We characterize the impact of our optimizations and designs using microbenchmarks. We have also performed an applicationlevel evaluation using the NAS Parallel Benchmarks. Our optimized MPICH2 implementation achieves 7.6 s latency and 857 MB/s bandwidth, which are close to the raw performance of the underlying InfiniBand layer. Our study shows that the RDMA Channel interface in MPICH2 provides a simple, yet powerful, abstraction that enables implementations with high performance by exploiting RDMA operations in InfiniBand. To the best of our knowledge, this is the first high-performance design and implementation of MPICH2 on InfiniBand using RDMA support.

2003

Supporting efficient noncontiguous access in PVFS over InfiniBand
Jiesheng Wu, Pete Wyckoff and D. K. Panda
Proceedings of Cluster '03, Hong Kong, China, December 2003
Paper (PDF)

Noncontiguous I/O access is the main access pattern in many scientific applications. Noncontiguity exists both in access to files and in access to target memory regions on the client. This characteristic imposes a requirement of native noncontiguous I/O access support in cluster file systems for high performance. In this paper, we address noncontiguous data transmission between the client and the I/O server in cluster file systems over a high performance network.

We propose a novel approach, RDMA Gather/Scatter, to transfer noncontiguous data for such I/O accesses. We also propose a new scheme, Optimistic Group Registration, to reduce memory registration costs associated with this approach. We have designed and incorporated this approach in a version of PVFS over InfiniBand. Through a range of PVFS and MPI-IO micro-benchmarks, and the NAS BTIO benchmark, we demonstrate that our approach attains significant performance gains compared to other existing approaches.

Performance comparisons of MPI implementations over InfiniBand, Myrinet and Quadrics
Jiuxing Liu, Balasubramanian Chandrasekaran, Jiesheng Wu, Weihang Jiang, Sushmitha Kini, Weikuan Yu, Darius Buntinas, Pete Wyckoff and D. K. Panda
Proceedings of SC '03, Phoenix, AZ, November 2003
Paper (PDF)

In this paper, we present a comprehensive performance comparison of MPI implementations over InfiniBand, Myrinet and Quadrics. Our performance evaluation consists of two major parts. The first part consists of a set of MPI level micro-benchmarks that characterize different aspects of MPI implementations. The second part of the performance evaluation consists of application level benchmarks. We have used the NAS Parallel Benchmarks and the sweep3D benchmark. We not only present the overall performance results, but also relate application communication characteristics to the information we acquired from the micro-benchmarks. Our results show that the three MPI implementations all have their advantages and disadvantages. For our 8-node cluster, InfiniBand can offer significant performance improvements for a number of applications compared with Myrinet and Quadrics when using the PCI-X bus. Even with just the PCI bus, InfiniBand can still perform better if the applications are bandwidth-bound.

Fast and scalable barrier using RDMA and multicast mechanisms for InfiniBand-based clusters
Sushmitha P. Kini, Jiuxing Liu, Jiesheng Wu, Pete Wyckoff and D. K. Panda
Proceedings of EuroPVM/MPI '03, Venice, Italy, October 2003
Lecture Notes in Computer Science, volume 2840, September 2003, pages 369-378
Paper (PDF)

This paper describes a methodology for efficiently implementing the collective operations, in this case the barrier, on clusters with the emerging InfiniBand Architecture (IBA). IBA provides hardware level support for the Remote Direct Memory Access (RDMA) message passing model as well as the multicast operation. Exploiting these features of InfiniBand to efficiently implement the barrier operation is a challenge in itself. This paper describes the design, implementation and evaluation of three barrier algorithms that leverage these mechanisms. Performance evaluation studies indicate that considerable benefits can be achieved using these mechanisms compared to the traditional implementation based on the point-to-point message passing model. Our experimental results show a performance benefit of up to 1.29 times for a 16-node barrier and up to 1.71 times for non-powers-of-2 group size barriers. Each proposed algorithm performs the best for certain ranges of group sizes and the optimal algorithm can be chosen based on this range. To the best of our knowledge, this is the first attempt to characterize the multicast performance in IBA and to demonstrate the benefits achieved by combining it with RDMA operations for efficient implementations of barrier.

Jiesheng Wu, Pete Wyckoff and Dhabaleswar Panda
PVFS over InfiniBand: design and performance evaluation
Proceedings of ICPP '03, Kaohsiung, Taiwan, October 2003
Best paper award
Longer technical report (PDF)
Eight-page conference paper (PDF)

I/O is quickly emerging as the main bottleneck limiting performance in modern day clusters. The need for scalable parallel I/O and file systems is becoming more and more urgent. In this paper, we examine the feasibility of leveraging InfiniBand technologies to improve I/O performance and scalability of cluster file systems. We use PVFS as a basis for exploring these features.

We design and implement a PVFS version over InfiniBand by taking advantage of InfiniBand features and resolving many challenging issues. In this paper, we design and test: a transport layer customized for the PVFS protocol by trading transparency and generality for performance, buffer management for flow control and efficient memory registration and deregistration, and communication management for reducing network congestion and achieving differentiated services.

Compared to a PVFS implementation over standard TCP/IP on the same InfiniBand network, our implementation offers three times the bandwidth if workloads are not disk-bound and 40% improvement in bandwidth in the disk-bound case. Client CPU utilization is reduced to 1.5% from 91% on TCP/IP.

Demotion-based exclusive caching through demote buffering: design and evaluations over different networks
Jiesheng Wu, Pete Wyckoff and D. K. Panda
Proceedings of SNAPI '03, New Orleans, LA, September 2003
Paper (PDF)

Multi-level buffer cache architecture has been widely deployed in today's multiple-tier computing environments. However, caches in different levels are inclusive. To make better use of these caches and to achieve the expected performance commensurate to the aggregate cache size, exclusive caching has been proposed. Demotion-based exclusive caching introduces a DEMOTE operation to transfer blocks discarded by a upper level cache to a lower level cache. In this paper, we propose a DEMOTE buffering mechanism over storage networks to reduce the visible costs of DEMOTE operations and provides more flexibility for optimizations. We evaluate the performance of DEMOTE buffering using simulations across both synthetic and real-life workloads on different networks. Our results show that DEMOTE buffering can effectively hide demotion costs. A maximum speedup of 1.4x over the original DEMOTE approach is achieved for some workloads. 1.08-1.15x speedups are achieved for two real-life workloads. The vast performance gains results from overlapping demotions and other activities, reduced communication operations and high utilization of the network bandwidth.

Balasubramanian Chandrasekaran, Pete Wyckoff and Dhabaleswar K. Panda
MBIB: A micro-benchmark suite for evaluating InfiniBand architecture implementations
Proceedings of Peformance Tools '03, Urbana, IL, September 2003
Lecture Notes in Computer Science, volume 2794, September 2003, pages 29-46
Paper (PDF)

Recently, the InfiniBand architecture has been proposed as the next generation interconnect for I/O and inter-process communication. The main idea behind this industry standard is to use a scalable switched fabric to design the next generation clusters and servers with high performance and scalability. This architecture provides various types of new mechanisms and services. Different implementation choices may lead to different design strategies for efficient implementation of higher-level communication layers such as MPI, sockets and distributed shared memory. They also have an impact on the performance of applications.

Currently there is no framework for evaluating different design choices and for obtaining insights about the choices made in a particular implementation. In this paper we address these issues by proposing a new microbenchmark suite to evaluate the Infiniband architecture implementations. MIBA consists of several microbenchmarks that are divided into two major categories: non-data transfer related and data transfer related. By using the new microbenchmark suite, the performance of Infiniband implementations can be evaluated under different communication scenarios.

Microbenchmark performance comparison of high-speed cluster interconnects
Jiuxing Liu, Balasubramanian Chandrasekaran, Weikuan Yu, Jiesheng Wu, Darius Buntinas, Sushmitha Kini, Dhabaleswar K. Panda and Pete Wyckoff
Proceedings of Hot Interconnects '03, Palo Alto, CA, August 2003
Hot Interconnects paper (PDF)
IEEE Micro, volume 24, number 1, January/February 2004, pages 42-51
IEEE Micro paper (PDF)

In this paper, we present a comprehensive performance evaluation of three high speed cluster interconnects: InfiniBand, Myrinet and Quadrics. We propose a set of microbenchmarks to characterize different performance aspects of these interconnects. Our microbenchmark suite includes not only traditional tests and performance parameters, but also those specifically tailored to the interconnects' advanced features such as user-level access for performing communication and remote direct memory access. In order to explore the full communication capability of the interconnects, we have implemented the microbenchmark suite at the low level messaging layer provided by each interconnect. Our performance results show that all three interconnects achieve low latency and high bandwidth at the expense of low host overhead. However, they show quite different performance behaviors when handling completion notification, unbalanced communication patterns and different communication buffer reuse patterns.

High performance RDMA-based MPI implementation over InfiniBand
Jiuxing Liu, Jiesheng Wu, Sushmitha Kini, Pete Wyckoff and Dhabaleswar K. Panda
Proceedings of ICS '03, San Francisco, CA, June 2003
Paper (PDF)

Although InfiniBand Architecture is relatively new in the high performance computing area, it offers many features which help us to improve the performance of the communication subsystem. In this paper, we have proposed a new design of MPI over InfiniBand which brings the benefit of RDMA to not only large messages, but also small and control messages. In the mean time, we have achieved scalability by exploiting application communication pattern and combining send/receive operations with RDMA operations. Our RDMA-based MPI implementation currently delivers a latency of 7.5 microseconds for small messages and a peak bandwidth of 857 million bytes per second. Performance evaluation at the MPI level shows that for small messages, our RDMA based design can reduce the latency by 20%, increase the bandwidth by over 70%, and reduce the host overhead by 15%. For large messages, we improve performance by reducing the time for transferring control messages. We have also shown that our new design also benefits MPI collective communication and NAS Parallel Benchmarks.

2002

Impact of on-demand connection management in MPI over VIA
Jiesheng Wu, Jiuxing Liu, Pete Wyckoff and D. K. Panda
Proceedings of Cluster '02, Chicago, IL, September 2002
Paper (PDF)

Designing scalable and efficient Message Passing Interface (MPI) implementations for emerging cluster interconnects such as VIA-based networks and InfiniBand are important building next generation clusters. In this paper, we address the scalability issue in the implementation of MPI over VIA by on-demand connection management mechanism. The on-demand connection management is designed to limit the use of resources to what applications absolutely require.

We address the design issues of incorporating the ondemand connection mechanism into an implementation of MPI over VIA. A complete implementation was done for MVICH over both cLAN VIA and Berkeley VIA. Performance evaluation on a set of microbenchmarks and NAS parallel benchmarks demonstrates that the on-demand mechanism can increase the scalability of MPI implementations by limiting the use of resources as needed by applications. It also shows that the on-demand mechanism delivers comparable or better performance as the static mechanism in which a fully-connected process model usually exists in the MPI implementations. These results demonstrate that the on-demand connection mechanism is a feasible solution to increase the scalability of MPI implementations over VIA- and InfiniBand-based networks.

Back to index

Last updated 16 Jan 2008.