Data Structure Consistency Using Atomic Operations in Storage Devices
Ananth Devulapalli, Dennis Dalessandro, Pete Wyckoff
Proceedings of MSST'08, SNAPI Workshop, Baltimore, MD, September 2008, to appear
Managing concurrency is a fundamental requirement for any multi-threaded
system, frequently implemented by serializing critical code regions or using
object locks on shared resources. Storage systems are one case of this,
where multiple clients may wish to access or modify on-disk objects
concurrently yet safely. Data consistency may be provided by an
inter-client protocol, or it can be implemented in the file system server or
storage device.
We look at the design of algorithms for consistency using atomic operations,
in particular compare-and-swap and fetch-and-add. These can be used as
primitives to form a variety of distributed algorithms, including locking.
The operations are also simple enough to consider implementing directly in
storage devices, obviating the need for dedicated servers in the data and
metadata access paths.
An OSD-based Approach to Managing Directory Operations in Parallel File
Systems
Nawab Ali, Ananth Devulapalli, Dennis Dalessandro, Pete Wyckoff,
P. Sadayappan
Proceedings of Cluster'08, Tsukuba, Japan, September 2008, to appear
Distributed file systems that use multiple servers to store data in parallel are becoming commonplace. Much work has already gone into such systems to maximize data throughput. However, metadata management has historically been treated as an afterthought. In previous work we focused on improving metadata management techniques by placing file metadata along with data on Object-based Storage Devices (OSDs). However, we did not investigate directory operations. This work looks at the possibility of designing directory structures directly on OSDs, without the need for intervening servers. In particular, the need for atomicity is a fundamental requirement that we explore in depth. Through performance results of benchmarks and applications we show the feasibility of using OSDs directly for metadata, including directory operations.
Non-Contiguous I/O Support for Object-Based Storage
Dennis Dalessandro, Ananth Devulapalli, Pete Wyckoff
Proceedings of ICPP'08, P2S2 Workshop, Portland, OR, September 2008, to appear
Paper (PDF)
The access patterns performed by disk-intensive applications vary widely,
from simple contiguous reads or writes through an entire file to completely
unpredictable random access. Often, applications will be able to access
multiple disconnected sections of a file in a single operation. Application
programming interfaces such as POSIX and MPI encourage the use of
non-contiguous access with calls that process I/O vectors.
Under the level of the programming interface, most storage protocols do
not implement I/O vector operations (also known as scatter/gather). These
protocols, including NFSv3 and block-based SCSI devices, must instead issue
multiple independent operations to complete the single I/O vector operation
specified by the application, at a cost of a much slower overall transfer
time.
Scatter/gather I/O is critical to the performance of many parallel
applications, hence protocols designed for this area do tend to support
I/O vectors. Parallel Virtual File System (PVFS) in particular does so;
however, a recent specification for object-based storage devices (OSD) does
not.
Using a software implementation of an OSD as storage devices in a
Parallel Virtual File System (PVFS) framework, we show the advantages of
providing direct support for non-contiguous data transfers. We also
implement the feature in OSDs in a way that is both efficient for
performance and appropriate for inclusion in future specification documents.
Tapping into the Fountain of CPUs—On Operating System Support for
Programmable Devices
Yaron Weinsberg, Danny Dolev, Tal Anker, Muli Ben-Yehuda, Pete Wyckoff
Proceedings of ASPLOS'08, Seattle, WA, March 2008
Paper (PDF)
The constant race for faster and more powerful CPUs is drawing to a
close. No longer is it feasible to significantly increase the speed of
the CPU without paying a crushing penalty in power consumption and
production costs. Instead of increasing single thread performance, the
industry is turning to multiple CPU threads or cores (such as SMT and
CMP) and heterogeneous CPU architectures (such as the Cell Broadband
Engine). While this is a step in the right direction, in every modern
PC there is a wealth of untapped compute resources. The NIC has a CPU;
the disk controller is programmable; some high-end graphics adapters
are already more powerful than host CPUs. Some of these CPUs can
perform some functions more efficiently than the host CPUs. Our
operating systems and programming abstractions should be expanded to
let applications tap into these computational resources and make the
best use of them.
Therefore, we propose the Hydra framework, which lets application
developers use the combined power of every compute resource in a
coherent way. Hydra is a programming model and a runtime support
layer which enables utilization of host processors as well as various
programmable peripheral devices' processors. We present the framework
and its application for a demonstrative use-case, as well as provide a
thorough evaluation of its capabilities. Using Hydra we were able to
cut down the development cost of a system that uses multiple
heterogenous compute resources significantly.
Integrating Parallel File Systems with Object-Based Storage Devices
Ananth Devulapalli, Dennis Dalessandro, Pete Wyckoff, Nawab Ali
and P. Sadayappan
Proceedings of SC'07, Reno, NV, November 2007
Paper (PDF)
As storage systems grow larger and more complex, the traditional block-based
design of current disks can no longer satisfy workloads that are
increasingly metadata intensive. A new approach is offered by object-based
storage devices (OSDs). By moving more of the responsibility of storage
management onto each OSD, improvements in performance, scalability and
manageability are possible.
Since this technology is new, no physical object-based storage device is
currently available. In this work we describe a software emulator that
enables a storage server to behave as an object-based disk. We focus on
the design of the attribute storage, which is used to hold metadata
associated with particular data objects. Alternative designs are discussed,
and performance results for an SQL implementation are presented.
iSER Storage Target for Object-based Storage Devices
Dennis Dalessandro, Ananth Devulapalli, Pete Wyckoff
Proceedings of MSST'07, SNAPI Workshop, San Diego, CA, September 2007
Paper (PDF)
Presentation (PDF)
In order to increase client capacity and performance, storage systems have
begun to utilize the advantages offered by modern interconnects. Previously
storage has been transported over costly fibre channel networks or ubiquitous
but low performance Ethernet networks. However, with the adoption of the iSCSI
extensions for RDMA (iSER) it is now possible to use RDMA based interconnects
for storage while leveraging existing iSCSI tools and
deployments.
Building on previous work with an object-based storage device emulator using
the iSCSI transport, we extend its functionality to include iSER. Using an
RDMA transport brings with it many design issues including the need register
memory to be used by the network, and how to bridge the quite different RDMA
completion semantics with existing event management based on file descriptors.
Experiments demonstrate reduced latency and greatly increased throughput
compared to iSCSI implementations both on Gigabit Ethernet and
on IP over InfiniBand.
Attribute Storage Design for Object-based Storage Devices
Ananth Devulapalli, Dennis Dalessandro, Nawab Ali and Pete Wyckoff
Proceedings of MSST'07, San Diego, CA, September 2007
Paper (PDF)
As storage systems grow larger and more complex, the traditional block-based
design of current disks can no longer satisfy workloads that are
increasingly metadata intensive. A new approach is offered by object-based
storage devices (OSDs). By moving more of the responsibility of storage
management onto each OSD, improvements in performance, scalability and
manageability are possible.
Since this technology is new, no physical object-based storage device is
currently available. In this work we describe a software emulator that
enables a storage server to behave as an object-based disk. We focus on
the design of the attribute storage, which is used to hold metadata
associated with particular data objects. Alternative designs are discussed,
and performance results for an SQL implementation are presented.
Memory Management Strategies for Data Serving with RDMA
Dennis Dalessandro and Pete Wyckoff
Proceedings of HotI'07, Palo Alto, CA, August 2007
Paper (PDF)
Presentation (PDF)
Using remote direct memory access (RDMA) to ship data is becoming a very
popular technique in network architectures. As these networks are adopted
by the broader computing market, new challenges arise in transitioning
existing code to use RDMA APIs.
One particular class of applications that map poorly to RDMA are those that
act as servers of file data. In order to access file data and send it over
the network, an application must copy it to user-space buffers, and the
operating system must register those buffers with the network adapter.
Ordinary sockets-based networks can achieve higher performance by using the
"sendfile" mechanism to avoid copying file data into user-space buffers.
In this work we revisit time-honored approaches to sending file data, but
adapted to RDMA networks. In particular, both pipelining and sendfile can
be used, albeit with modifications to handle memory registration issues.
However, memory registration is not well-integrated in current operating
systems, leading to difficulties in adapting the sendfile mechanism.
These two techniques make it feasible to create RDMA-based applications that
serve file data and still maintain a high level of performance.
Accelerating Web Protocols Using RDMA
Dennis Dalessandro and Pete Wyckoff
Proceedings of NCA'07, Cambridge, MA, July 2007
Paper (PDF)
High-performance computing, just like the world at large, is continually discovering new uses for the Internet. Interesting applications rely on server-generated content, severely taxing the capabilities of web servers. Thus it is common for multiple servers to run a single site. In our work, we use a novel network feature known as RDMA to vastly improve performance and scalability of a single server. Using an unmodified Apache web server with a dynamic module to enable iWARP (RDMA over TCP), we can handle more clients with lower CPU utilization, and higher throughput.
Accelerating distributed computing applications using a network
offloading framework
Yaron Weinsberg, Danny Dolev, Pete Wyckoff and Tal Anker
Proceedings of IPDPS'07, Long Beach, CA, March 2007
Paper (PDF)
During the last two decades, a considerable amount of academic research has been conducted in the field of distributed computing. Typically, distributed applications require frequent network communication, which becomes a dominate factor in the overall runtime overhead. The recent proliferation of programmable peripheral devices for computer systems may be utilized in order to improve the performance of such applications. Offloading application-specific network functions to peripheral devices can improve performance and reduce host CPU utilization. Due to the peculiarities of each particular device and the difficulty of programming an outboard CPU, the need for an abstracted offloading framework is apparent. This paper proposes a novel offloading framework, called HYDRA, that enables utilization of such devices. The framework enables an application developer to design the offloading aspects of the application by specifying an "offloading layout", which is enforced by the runtime during application deployment. The performance of a variety of distributed algorithms can be significantly improved by utilizing such a framework. We demonstrate this claim by evaluating several offloaded applications: a distributed total message ordering algorithm, a packet generator, and an embedded firewall.
File creation strategies in a distributed metadata file system
Ananth Devulapalli and Pete Wyckoff
Proceedings of IPDPS'07, Long Beach, CA, March 2007
Paper (PDF)
As computing breaches petascale limits both in processor performance and
storage capacity, the only way that current and future gains in performance
can be achieved is by increasing the parallelism of the system. Gains in
storage performance remain low due to the use of traditional distributed file
systems such as NFS, where although multiple clients can access files at the
same time, only one node can serve files to the clients. New file systems
that distribute load across multiple data servers are being developed;
however, most implementations concentrate all the metadata load at a single
server still. Distributing metadata load is important to accommodate growing
numbers of more powerful clients.
Scaling metadata performance is more complex than scaling raw I/O
performance, and with distributed metadata the complexity increases further.
In this paper we present strategies for file creation in distributed
metadata file systems. Using the PVFS distributed file system as our
testbed, we present designs that are able to reduce the message complexity
of the create operation and increase performance. Compared to the basecase
create protocol implemented in PVFS, our design delivers near constant
operation latency as the system scales, does not degenerate under high
contention situations, and increases throughput linearly as the number of
metadata servers increase. The design schemes are applicable to any
distributed file system implementation.
Accelerating web protocols using RDMA
Dennis Dalessandro and Pete Wyckoff
Proceedings of SC'06, Poster, Tampa, FL, November 2006
As high performance computing finds increased usage in the Internet, we should expect to see larger data transfers and an increase in the processing of dynamic web content by HPC web servers, particularly where grid applications are concerned. The high cost of processing TCP/IP on the CPU will be too much for today's web servers to handle. As an answer to this problem, we look to iWARP, which is RDMA over TCP. The benefits of RDMA have long been realized in technologies such as InfiniBand, and iWARP makes this possible over ordinary Ethernet. In implementing an iWARP module for the Apache web server, we create an experimental platform for testing the feasibility of RDMA in the web.
HYDRA: a novel framework for making high-performance computing offload
capable
Yaron Weinsberg, Danny Dolev, Tal Anker and Pete Wyckoff
Proceedings of LCN'06, Tampa, FL, November 2006
Short paper (PDF)
The proliferation of programmable peripheral devices for computer
systems opens new possibilities for academic research that will
influence system designs in the near future. Programmability is a
key feature that enables application-specific extensions to improve
performance and offer new features. Increasing transistor density
and decreasing cost provide excess computational power in devices
such as disk controllers, network interfaces and video cards.
This paper proposes an innovative programming model and
runtime support that enables utilization of such devices
by providing a generic code offloading framework. The framework
enables an application developer to design the offloading aspects of
the application by specifying an "offloading layout",
which is enforced by the runtime during application deployment. The
framework also provides the necessary development tools and
programming constructs for developing such applications.
We test our framework by implementing a packet generator on a
programmable network card for network testing. The offloaded
application produces traffic at five times the rate, and with
inter-packet variability that is many orders of magnitude smaller
than the non-offloaded version.
Initial performance evaluation of the NetEffect 10 Gigabit iWARP Adapter
Dennis Dalessandro, Pete Wyckoff and Gary Montry
Proceedings of IEEE Cluster '06, RAIT Workshop, Barcelona, Spain, Sep 2006
Paper (PDF)
Interconnect speeds currently surpass the abilities of today's processors to satisfy their demands. The throughput rate provided by the network simply generates too much protocol work for the processor to keep up with. Remote Direct Memory Access has long been studied as a way to alleviate the strain from the processors. The problem is that until recently RDMA interconnects were limited to proprietary or specialty interconnects that are incompatible with existing networking hardware. iWARP, or RDMA over TCP/IP, changes this situation. iWARP brings all of the advantages of RDMA, but is compatible with existing network infrastructure, namely TCP/IP over Ethernet. The drawback to iWARP up to now has been the lack of availability of hardware capable of meeting the performance of specialty RDMA interconnects. Recently, however, 10 Gigabit iWARP adapters are beginning to appear on the market. This paper demonstrates the performance of one such 10 Gigabit iWARP implementation and compares it to a popular specialty RDMA interconnect, InfiniBand.
Hardware/Software Integration for FPGA-based All-Pairs Shortest-Paths
Uday Bondhugula, Ananth Devulapalli, James Dinan, Joseph Fernando, Pete
Wyckoff, Eric Stahlberg and P. Sadayappan
Proceedings of FCCM '06, Napa Valley, California, Apr 2006
Paper (PDF)
Field-Programmable Gate Arrays (FPGAs) are being employed in high performance computing systems owing to their potential to accelerate a wide variety of long-running routines. Parallel FPGA-based designs often yield a very high speedup. Applications using these designs on reconfigurable supercomputers involve software on the system managing computation on the FPGA. To extract maximum performance from an FPGA design at the application level, it becomes necessary to minimize associated data movement costs on the system. We address this hardware/software integration challenge in the context of the All-Pairs Shortest-Paths (APSP) problem in a directed graph. We employ a parallel FPGA-based design using a blocked algorithm to solve large instances of APSP. With appropriate design choices and optimizations, experimental results on the Cray XD1 show that the FPGA-based implementation sustains an application-level speedup of 15 over an optimized CPU-based implementation.
Experimental analysis of a mass storage system
Shahid Bokhari, Benjamin Rutt, Pete Wyckoff and Paul Buerger
Concurrency and Computation: Practice and Experience, volume 18, number 15,
pages 1929--1950, April 2006
Paper (PDF)
Mass storage systems (MSSs) play a key role in data-intensive parallel computing. Most contemporary MSSs are implemented as redundant arrays of independent/inexpensive disks (RAID) arrays in which commodity disks are tied together with proprietary controller hardware. The performance of such systems can be difficult to predict because most internal details of the controller behavior are not public. We present a systematic method for empirically evaluating MSS performance by obtaining measurements on a series of RAID configurations of increasing size and complexity. We apply this methodology to a large MSS at Ohio Supercomputer Center that has 16 input/output processors, each connected to four 8+1 RAID5 units and provides 128 Tbytes of storage (of which 116.8 Tbytes are usable when formatted). Our methodology permits storage system designers to evaluate empirically the performance of their systems with considerable confidence. Although we have carried out our expirements in the context of a specific system, our methodology is applicable to all large MSSs. The measurements obtained using our methods permit application programmers to be aware of the limits to performance of their codes.
iWarp Protocol Kernel Space Software Implementation
Dennis Dalessandro, Ananth Devulapalli and Pete Wyckoff
Proceedings of IPDPS '06, CAC Workshop, Rhodes Island, Greece, April 2006
Paper (PDF)
Zero-copy, RDMA, and protocol offload are three very important
characteristics of high performance interconnects. Previous networks
that made use of these techniques were built upon proprietary, and often
expensive, hardware. With the introduction of iWarp, it is now possible to
achieve all three over existing low-cost TCP/IP networks.
iWarp is a step in the right direction, but currently requires an expensive
RNIC to enable zero-copy, RDMA, and protocol offload. While the hardware is
expensive at present, given that iWarp is based on a commodity interconnect,
prices will surely fall. In the meantime only the most critical
of servers will likely make use of iWarp, but in order to take advantage of
the RNIC both sides must be so equipped.
It is for this reason that we have implemented the iWarp protocol
in software. This allows a server equipped with an RNIC to exploit its
advantages even if the client does not have an RNIC. While throughput
and latency do not improve by doing this, the server with the RNIC does
experience a dramatic reduction in system load. This means that the server
is much more scalable, and can handle many more clients than would otherwise
be possible with the usual sockets/TCP/IP protocol stack.
Parallel FPGA-based all-pairs shortest-paths in a directed graph
Uday Bondhugula, Ananth Devulapalli, Joseph Fernando, Pete Wyckoff
and P. Sadayappan
Proceedings of IPDPS '06, Rhodes Island, Greece, April 2006
Paper (PDF)
With rapid advances in VLSI technology, Field Programmable Gate Arrays (FPGAs) are receiving the attention of the Parallel and High Performance Computing community. In this paper, we propose a highly parallel FPGA design for the Floyd-Warshall algorithm to solve the All-Pairs Shortest-Paths problem in a directed graph. Our work is motivated by a computationally intensive bio-informatics application that employs this algorithm. The design we propose efficiently utilizes the large amount of resources available on an FPGA to maximize parallelism in the presence of significant data dependences. Experimental results from a working implementation of our design on the Cray XD1 show a speedup of 24x over a modern general-purpose microprocessor.
Design and Implementation of the iWarp Protocol in Software
Dennis Dalessandro, Ananth Devulapalli and Pete Wyckoff
Proceedings of PDCS '05, Phoenix, AZ, November 2005
Paper (PDF)
Presentation (PDF)
The term iWarp indicates a set of published protocol specifications that
provide remote read and write access to user applications, without operating
system intervention or intermediate data copies. The iWarp protocol provides
for higher bandwidth and lower latency transfers over existing, widely
deployed TCP/IP networks. While hardware implementations of iWarp are
starting to emerge, there is a need for software implementations to enable
offload on servers as a transition mechanism, for protocol testing, and for
future protocol research.
The work presented here allows a server with an iWarp network card to
utilize it fully by implementing the iWarp protocol in software on the
non-accelerated clients. While throughput does not improve, the true
benefit of reduced load on the server machine is realized. Experiments show
that sender system load is reduced from 35% to 5% and receiver load is
reduced from 90% to under 5%. These gains allow a server to scale to
handle many more simultaneous client connections.
A performance analysis of the Ammasso RDMA enabled ethernet adapter and its iWARP API
Dennis Dalessandro and Pete Wyckoff
Proceedings of RAIT Workshop, Cluster '05, Burlington, MA, September 2005
Paper (PDF)
Network speeds are increasing well beyond the capabilities of today's CPUs to efficiently handle the traffic. This bottleneck at the CPU causes the processor to spend more of its time handling communication and less time on actual processing. As network speeds reach 10 Gb/s and more, the CPU simply can not keep up with the data. Various methods have been proposed to solve this problem. High performance interconnects, such as Infiniband, have been developed that rely on RDMA and protocol offload in order to achieve higher throughput and lower latency. In this paper we evaluate the feasibility of a similar approach which, unlike existing high performance interconnects, requires no special infrastructure. RDMA over Ethernet, otherwise known as iWARP, facilitates the zero copy exchange of data over ordinary local area networks. Since it is based on TCP, iWARP enables RDMA in the wide area network as well. This paper provides a look into the performance of one of the earliest commodity implementations of this emerging technology, the Ammasso 1100 RNIC.
Distributed queue-based locking using advanced network features
Ananth Devulapalli and Pete Wyckoff
Proceedings of ICPP '05, Oslo, Normay, June 2005
Paper (PDF)
A Distributed Lock Manager (DLM) provides advisory locking services to applications that run on distributed systems such as databases and file systems. Lock management at the server is implemented using First-In-First-Out (FIFO) queues. In this paper, we demonstrate a novel way of delegating the lock management to the participating lock-requesting nodes, using advanced network features such as Remote Direct Memory Access (RDMA) and Atomic operations. This nicely complements the original idea of DLM, where management of the lock space is distributed. Our implementation achieves better load balancing, reduction in server load and improved throughput over traditional designs.
Memory registration caching correctness
Pete Wyckoff and Jiesheng Wu
Proceedings of CCGrid '05, Cardiff, UK, May 2005
Paper (PDF)
Fast and powerful networks are becoming more popular on clusters to support
applications including message passing, file systems, and databases. These
networks require special treatment by the operating system to obtain high
throughput and low latency. In particular, application memory must be
pinned and registered in advance of use. However, popular communication
libraries such as MPI have interfaces that do not require explicit
registration calls from the user, thus the libraries must manage this aspect
themselves.
Registration caching is a necessary and effective tool to reuse memory
registrations and avoid the overheads of pinning and unpinning pages around
every send or receive. Current memory registration caching schemes do not
take into account the fact thet the user has access to a variety of
operating system calls that can alter memory layout and destroy earlier
cached registrations. The work presented in this paper fixes that problem
by providing a mechanism for the operating system to notify the
communication library of changes in the memory layout of a process while
preserving existing application semantics. This permits the safe and
accurate use of memory registration caching.
A Hypergraph Partitioning Based Approach for Scheduling of Tasks
with Batch-shared I/O
Gaurav Khanna, N. Vydyanathan, Tahsin Kurc, Umit Catalyurek, Pete Wyckoff,
Joel Saltz and P. Sadayappan
Proceedings of CCGrid '05, Cardiff, UK, May 2005
Paper (PDF)
This paper proposes a novel, hypergraph partitioning based strategy to schedule multiple data analysis tasks with batchshared I/O behavior. This strategy formulates the sharing of files among tasks as a hypergraph to minimize the I/O overheads due to transferring of the same set of files multiple times and employs a dynamic scheme for file transfers to reduce contention on the storage system. We experimentally evaluate the proposed approach using application emulators from two application domains; analysis of remotelysensed data and biomedical imaging.
Fast scalable file distribution over Infiniband
Dennis Dalessandro and Pete Wyckoff
First Workshop on System Management Tools for Large-Scale Parallel Systems,
Proceedings of IPDPS '05, Denver, CO, April 2005
Paper (PDF)
One of the first steps in starting a program on a cluster is to get the executable, which generally resides on some network file server. This creates not only contention on the network, but causes unnecessary strain on the network file system as well, which is busy serving other requests at the same time. This approach is certainly not scalable as clusters grow larger. We present a new approach that uses a high speed interconnect, novel network features, and a scalable design. We provide a fast, efficient, and scalable solution to the distribution of executable files on production parallel machines.
BMI: a network abstraction layer for parallel I/O
Philip H. Carns, Walter B. Ligon III, Robert Ross and Pete Wyckoff
Workshop on Communication Architecture for Clusters,
Proceedings of IPDPS '05, Denver, CO, April 2005
Paper (PDF)
As high-performance computing increases in popularity and performance, the demand for similarly capable input and output systems rises. Parallel I/O takes advantage of many data server machines to provide linearly scaling performance to parallel applications that access storage over the system area network. The demands placed on the network by a parallel storage system are considerably different than those imposed by message-passing algorithms or data-center operations; and, there are many popular and varied networks in use in modern parallel machines. These considerations lead us to develop a network abstraction layer for parallel I/O which is efficient and thread-safe, provides operations specifically required for I/O processing, and supports multiple networks. The Buffered Message Interface (BMI) has low processor overhead, minimal impact on latency, and can improve throughput for parallel file system workloads by as much as 40% compared to other more generic network abstractions.
A Parallel I/O Mechanism for Distributed Systems
Troy Baer and Pete Wyckoff
Proceedings of Cluster '04, San Diego, CA, September 2004
Paper (PDF)
Access to shared data is critical to the long term success of grids of
distributed systems. As more parallel applications are being used on
these grids, the need for some kind of parallel I/O facility across
distributed systems increases. However, grid middleware has thus far
had only limited support for distributed parallel I/O.
In this paper, we present an implementation of the MPI-2 I/O interface
using the Globus GridFTP client API. MPI is widely used for parallel
computing, and its I/O interface maps onto a large variety of storage
systems. The limitations of using GridFTP as an MPI-I/O transport
mechanism are described, as well as support for parallel access to
scientific data formats such as HDF and NetCDF. We compare the
performance of GridFTP to that of NFS on the same network using
several parallel I/O benchmarks. Our tests indicate that GridFTP can
be a workable transport for parallel I/O, particularly for distributed
read-only access to shared data sets.
Unifier: unifying cache management and communication buffer
management for PVFS over InfiniBand
Jiesheng Wu, Pete Wyckoff, D. K. Panda and Rob Ross
Proceedings of CCGrid '04, Chicago, IL, April 2004
Paper (PDF)
The advent of networking technologies and high performance transport protocols facilitates the service of storage over networks. However, they pose challenges in integration and interaction among storage server application components and system components. In this paper, we put forward a component, called Unifier, to provide more efficient integration and better interaction among these components. Unifier has three notable features. (1) Unifier integrates cache management and communication buffer management. It offers a single copy data sharing among all components in a server application safely and concurrently. (2) It reduces memory registration and deregistration costs to enable applications to take full advantage of RDMA operations. (3) It provides means to achieve adaptation, application-specific optimization, and better cooperation among different components. This paper presents the design and implementation of Unifier. This component has been deployed and evaluated in a version of PVFS1 implementation over InfiniBand. Experimental results show performance improvements between 30% and 70% over other approaches. Better scalability is also achieved by the PVFS I/O servers.
High performance implementation of MPI derived datatype
communication over InfiniBand
Jiesheng Wu, Pete Wyckoff and D. K. Panda
Proceedings of IPDPS '04, Santa Fe, NM, April 2004
Paper (PDF)
In this paper, a systematic study of two main types of approach for MPI datatype communication (Pack/Unpackbased approaches and Copy-Reduced approaches) is carried out on the InfiniBand network. We focus on overlapping packing, network communication, and unpacking in the Pack/Unpack-based approaches. We use RDMA operations to avoid packing and/or unpacking in the CopyReduced approaches. Four schemes (Buffer-Centric Segment Pack/Unpack, RDMA Write Gather With Unpack, Pack with RDMA Read Scatter, and Multiple RDMA Writes have been proposed. Three of them have been implemented and evaluated based on one MPI implementation over InfiniBand. Performance results of a vector microbenchmark demonstrate that latency is improved by a factor of up to 3.4 and bandwidth by a factor of up to 3.6 compared to the current datatype communication implementation. Collective operations like MPI Alltoall are demonstrated to benefit. A factor of up to 2.0 improvement has been seen in our measurements of those collective operations on an 8-node system.
Design and implementation of MPICH2 over InfiniBand with RDMA support
Jiuxing Liu, Weihang Jiang, Pete Wyckoff, D. K. Panda, David Ashton,
Darius Buntinas, William Gropp and Brian Toonen
Proceedings of IPDPS '04, Santa Fe, NM, April 2004
Paper (PDF)
For several years, MPI has been the de facto standard for writing parallel
applications. One of the most popular MPI implementations is MPICH. Its
successor, MPICH2, features a completely new design that provides more
performance and flexibility. To ensure portability, it has a hierarchical
structure based on which porting can be done at different levels.
In this paper, we present our experiences in designing and implementing MPICH2
over InfiniBand. Because of its high performance and open standard, InfiniBand
is gaining popularity in the area of high-performance computing. Our study
focuses on optimizing the performance of MPI-1 functions in MPICH2. One of our
objectives is to exploit Remote Direct Memory Access (RDMA) in InfiniBand to
achieve high performance. We have based our design on the RDMA Channel
interface provided by MPICH2, which encapsulates architecture-dependent
communication functionalities into a very small set of functions.
Starting with a basic design, we apply different optimizations and also
propose a zero-copy-based design. We characterize the impact of our
optimizations and designs using microbenchmarks. We have also performed an
applicationlevel evaluation using the NAS Parallel Benchmarks. Our optimized
MPICH2 implementation achieves 7.6 s latency and 857 MB/s bandwidth, which are
close to the raw performance of the underlying InfiniBand layer. Our study
shows that the RDMA Channel interface in MPICH2 provides a simple, yet
powerful, abstraction that enables implementations with high performance by
exploiting RDMA operations in InfiniBand. To the best of our knowledge, this is
the first high-performance design and implementation of MPICH2 on InfiniBand
using RDMA support.
Supporting efficient noncontiguous access in PVFS over InfiniBand
Jiesheng Wu, Pete Wyckoff and D. K. Panda
Proceedings of Cluster '03, Hong Kong, China, December 2003
Paper (PDF)
Noncontiguous I/O access is the main access pattern in many scientific
applications. Noncontiguity exists both in access to files and in access to
target memory regions on the client. This characteristic imposes a requirement
of native noncontiguous I/O access support in cluster file systems for high
performance. In this paper, we address noncontiguous data transmission between
the client and the I/O server in cluster file systems over a high performance
network.
We propose a novel approach, RDMA Gather/Scatter, to transfer noncontiguous
data for such I/O accesses. We also propose a new scheme, Optimistic Group
Registration, to reduce memory registration costs associated with this
approach. We have designed and incorporated this approach in a version of PVFS
over InfiniBand. Through a range of PVFS and MPI-IO micro-benchmarks, and the
NAS BTIO benchmark, we demonstrate that our approach attains significant
performance gains compared to other existing approaches.
Performance comparisons of MPI implementations over InfiniBand, Myrinet
and Quadrics
Jiuxing Liu, Balasubramanian Chandrasekaran, Jiesheng Wu, Weihang Jiang,
Sushmitha Kini, Weikuan Yu, Darius Buntinas, Pete Wyckoff
and D. K. Panda
Proceedings of SC '03, Phoenix, AZ, November 2003
Paper (PDF)
In this paper, we present a comprehensive performance comparison of MPI implementations over InfiniBand, Myrinet and Quadrics. Our performance evaluation consists of two major parts. The first part consists of a set of MPI level micro-benchmarks that characterize different aspects of MPI implementations. The second part of the performance evaluation consists of application level benchmarks. We have used the NAS Parallel Benchmarks and the sweep3D benchmark. We not only present the overall performance results, but also relate application communication characteristics to the information we acquired from the micro-benchmarks. Our results show that the three MPI implementations all have their advantages and disadvantages. For our 8-node cluster, InfiniBand can offer significant performance improvements for a number of applications compared with Myrinet and Quadrics when using the PCI-X bus. Even with just the PCI bus, InfiniBand can still perform better if the applications are bandwidth-bound.
Fast and scalable barrier using RDMA and multicast mechanisms for
InfiniBand-based clusters
Sushmitha P. Kini, Jiuxing Liu, Jiesheng Wu, Pete Wyckoff and D. K. Panda
Proceedings of EuroPVM/MPI '03, Venice, Italy, October 2003
Lecture Notes in Computer Science, volume 2840, September 2003,
pages 369-378
Paper (PDF)
This paper describes a methodology for efficiently implementing the collective operations, in this case the barrier, on clusters with the emerging InfiniBand Architecture (IBA). IBA provides hardware level support for the Remote Direct Memory Access (RDMA) message passing model as well as the multicast operation. Exploiting these features of InfiniBand to efficiently implement the barrier operation is a challenge in itself. This paper describes the design, implementation and evaluation of three barrier algorithms that leverage these mechanisms. Performance evaluation studies indicate that considerable benefits can be achieved using these mechanisms compared to the traditional implementation based on the point-to-point message passing model. Our experimental results show a performance benefit of up to 1.29 times for a 16-node barrier and up to 1.71 times for non-powers-of-2 group size barriers. Each proposed algorithm performs the best for certain ranges of group sizes and the optimal algorithm can be chosen based on this range. To the best of our knowledge, this is the first attempt to characterize the multicast performance in IBA and to demonstrate the benefits achieved by combining it with RDMA operations for efficient implementations of barrier.
Jiesheng Wu, Pete Wyckoff and Dhabaleswar Panda
PVFS over InfiniBand: design and performance evaluation
Proceedings of ICPP '03, Kaohsiung, Taiwan, October 2003
Best paper award
Longer technical report (PDF)
Eight-page conference paper (PDF)
I/O is quickly emerging as the main bottleneck limiting performance in modern
day clusters. The need for scalable parallel I/O and file systems is becoming
more and more urgent. In this paper, we examine the feasibility of leveraging
InfiniBand technologies to improve I/O performance and scalability of cluster
file systems. We use PVFS as a basis for exploring these features.
We design and implement a PVFS version over InfiniBand by taking advantage of
InfiniBand features and resolving many challenging issues. In this paper, we
design and test: a transport layer customized for the PVFS protocol by trading
transparency and generality for performance, buffer management for flow control
and efficient memory registration and deregistration, and communication
management for reducing network congestion and achieving differentiated
services.
Compared to a PVFS implementation over standard TCP/IP on the same InfiniBand
network, our implementation offers three times the bandwidth if workloads are
not disk-bound and 40% improvement in bandwidth in the disk-bound case.
Client CPU utilization is reduced to 1.5% from 91% on TCP/IP.
Demotion-based exclusive caching through demote buffering: design and
evaluations over different networks
Jiesheng Wu, Pete Wyckoff and D. K. Panda
Proceedings of SNAPI '03, New Orleans, LA, September 2003
Paper (PDF)
Multi-level buffer cache architecture has been widely deployed in today's multiple-tier computing environments. However, caches in different levels are inclusive. To make better use of these caches and to achieve the expected performance commensurate to the aggregate cache size, exclusive caching has been proposed. Demotion-based exclusive caching introduces a DEMOTE operation to transfer blocks discarded by a upper level cache to a lower level cache. In this paper, we propose a DEMOTE buffering mechanism over storage networks to reduce the visible costs of DEMOTE operations and provides more flexibility for optimizations. We evaluate the performance of DEMOTE buffering using simulations across both synthetic and real-life workloads on different networks. Our results show that DEMOTE buffering can effectively hide demotion costs. A maximum speedup of 1.4x over the original DEMOTE approach is achieved for some workloads. 1.08-1.15x speedups are achieved for two real-life workloads. The vast performance gains results from overlapping demotions and other activities, reduced communication operations and high utilization of the network bandwidth.
Balasubramanian Chandrasekaran, Pete Wyckoff and Dhabaleswar K. Panda
MBIB: A micro-benchmark suite for evaluating InfiniBand architecture
implementations
Proceedings of Peformance Tools '03, Urbana, IL, September 2003
Lecture Notes in Computer Science, volume 2794, September 2003, pages 29-46
Paper (PDF)
Recently, the InfiniBand architecture has been proposed as the next generation
interconnect for I/O and inter-process communication. The main idea behind
this industry standard is to use a scalable switched fabric to design the next
generation clusters and servers with high performance and scalability. This
architecture provides various types of new mechanisms and services. Different
implementation choices may lead to different design strategies for efficient
implementation of higher-level communication layers such as MPI, sockets and
distributed shared memory. They also have an impact on the performance of
applications.
Currently there is no framework for evaluating different design choices and
for obtaining insights about the choices made in a particular implementation.
In this paper we address these issues by proposing a new microbenchmark suite
to evaluate the Infiniband architecture implementations. MIBA consists of
several microbenchmarks that are divided into two major categories: non-data
transfer related and data transfer related. By using the new microbenchmark
suite, the performance of Infiniband implementations can be evaluated under
different communication scenarios.
Microbenchmark performance comparison of high-speed cluster interconnects
Jiuxing Liu, Balasubramanian Chandrasekaran, Weikuan Yu, Jiesheng Wu,
Darius Buntinas, Sushmitha Kini, Dhabaleswar K. Panda and Pete Wyckoff
Proceedings of Hot Interconnects '03, Palo Alto, CA, August 2003
Hot Interconnects paper (PDF)
IEEE Micro, volume 24, number 1, January/February 2004, pages 42-51
IEEE Micro paper (PDF)
In this paper, we present a comprehensive performance evaluation of three high speed cluster interconnects: InfiniBand, Myrinet and Quadrics. We propose a set of microbenchmarks to characterize different performance aspects of these interconnects. Our microbenchmark suite includes not only traditional tests and performance parameters, but also those specifically tailored to the interconnects' advanced features such as user-level access for performing communication and remote direct memory access. In order to explore the full communication capability of the interconnects, we have implemented the microbenchmark suite at the low level messaging layer provided by each interconnect. Our performance results show that all three interconnects achieve low latency and high bandwidth at the expense of low host overhead. However, they show quite different performance behaviors when handling completion notification, unbalanced communication patterns and different communication buffer reuse patterns.
High performance RDMA-based MPI implementation over InfiniBand
Jiuxing Liu, Jiesheng Wu, Sushmitha Kini, Pete Wyckoff
and Dhabaleswar K. Panda
Proceedings of ICS '03, San Francisco, CA, June 2003
Paper (PDF)
Although InfiniBand Architecture is relatively new in the high performance computing area, it offers many features which help us to improve the performance of the communication subsystem. In this paper, we have proposed a new design of MPI over InfiniBand which brings the benefit of RDMA to not only large messages, but also small and control messages. In the mean time, we have achieved scalability by exploiting application communication pattern and combining send/receive operations with RDMA operations. Our RDMA-based MPI implementation currently delivers a latency of 7.5 microseconds for small messages and a peak bandwidth of 857 million bytes per second. Performance evaluation at the MPI level shows that for small messages, our RDMA based design can reduce the latency by 20%, increase the bandwidth by over 70%, and reduce the host overhead by 15%. For large messages, we improve performance by reducing the time for transferring control messages. We have also shown that our new design also benefits MPI collective communication and NAS Parallel Benchmarks.
Impact of on-demand connection management in MPI over VIA
Jiesheng Wu, Jiuxing Liu, Pete Wyckoff and D. K. Panda
Proceedings of Cluster '02, Chicago, IL, September 2002
Paper (PDF)
Designing scalable and efficient Message Passing Interface (MPI)
implementations for emerging cluster interconnects such as VIA-based networks
and InfiniBand are important building next generation clusters. In this paper,
we address the scalability issue in the implementation of MPI over VIA by
on-demand connection management mechanism. The on-demand connection management
is designed to limit the use of resources to what applications absolutely
require.
We address the design issues of incorporating the ondemand connection mechanism
into an implementation of MPI over VIA. A complete implementation was done for
MVICH over both cLAN VIA and Berkeley VIA. Performance evaluation on a set of
microbenchmarks and NAS parallel benchmarks demonstrates that the on-demand
mechanism can increase the scalability of MPI implementations by limiting the
use of resources as needed by applications. It also shows that the on-demand
mechanism delivers comparable or better performance as the static mechanism in
which a fully-connected process model usually exists in the MPI
implementations. These results demonstrate that the on-demand connection
mechanism is a feasible solution to increase the scalability of MPI
implementations over VIA- and InfiniBand-based networks.
Last updated 16 Jan 2008.