Title: High-Performance Transport Protocols

Los Alamos National Laboratory

PI: Wu-chun Feng

Los Alamos National Laboratory

P. O. Box 1663, MS D451

Los Alamos, New Mexico 87545

Tel: 505-665-2730, Email: feng@lanl.gov

 

 

Executive Summary:

The explosive growth of high-speed computer networks, combined with rapid and unpredictable developments in applications and workloads, has crippled the performance of today’s ubiquitous transport protocol, TCP. Without tedious manual tuning by network experts, TCP performance over the wide-area network (WAN) in support of distributed computational grids is so abysmal that visualization scientists at Los Alamos National Laboratory (LANL) currently dump 100 GB of data to tape and then send it to Sandia National Laboratory (SNL) via Federal Express because it is actually faster than electronically transferring the data via TCP over a 155-MB/s (OC-3) WAN backbone, i.e., 100 GB / 1 MB/s = 28 hours, where 1 MB/s is the measured throughput between LANL and SNL. Even with manual tuning of the end-host buffer sizes, TCP’s performance is still abysmal at only 2 MB/s over the aforementioned WAN. In short, critical high-end applications such as remote visualization and high-capacity data transfers routinely fail to meet end-to-end performance requirements when deployed over high-speed networks that rely on TCP.

Recent studies blame the self-similar nature of aggregate network traffic for TCP’s poor performance because such traffic is not readily amenable to statistical multiplexing in the Internet, and hence computational grids. Researchers attribute the self-similar behavior to heavy-tailed distributions in file size, packet inter-arrival, and transfer duration. However, our recent research presented at SC 2000 identifies a source of self-similarity previously ignored, a source that is readily controllable – TCP. Even when aggregate application traffic is rigged to smooth out as more applications’ traffic are multiplexed, TCP induces burstiness in the aggregate traffic load, thus adversely impacting network performance. In addition, our results indicate that TCP performance will dramatically worsen as WAN speeds continue to increase.

To simultaneously address many of the problems with TCP for high-performance networking in a distributed computational grid, we propose three inter-related research tasks to directly address the "high-performance transport protocols" initiative in the Scientific Discovery through Advanced Computing Program: [1] dynamic right-sizing, [2] high-performance IP, and [3] RAPID: Rate-Adjusting Protocol for Internet Delivery.

Major Goals and Technical Challenges

Frustrated at the pitiful networking performance provided by the TCP/IP protocol suite in system-area networks (SANs) for clusters and supercomputers, researchers in high-performance computing proposed the notion of "user-level network interfaces" or ULNIs (also known as OS-bypass protocols) in the mid-1990s. These ULNIs reduced end-to-end latency by two orders of magnitude and increased realizable bandwidth by up to an order of magnitude in the SAN. The wild success of these high-performance ULNIs in SANs provided the impetus for new research directions including, but not limited to, the application of ULNI techniques to the I/O file system and proposals for a standard ULNI, e.g., Virtual Interface Architecture (VIA).

However, the recent rise of computational grids for high-performance computing necessitates high-performance networking over the WAN, not just the SAN. While the performance of TCP/IP may be abysmal in the WAN, the protocol suite seamlessly scales from the SAN to the WAN. In contrast, while the performance of a ULNI is orders of magnitude better than TCP/IP, ULNI currently does not scale beyond the SAN, and furthermore, does not support congestion control. Thus, the high-performance networking community finds itself at a crossroad: [1] Do we leverage our lessons learned in ULNI and incorporate them into TCP/IP (or a new high-performance transport protocol that sits on top of IP) in order to achieve better performance? Or [2] do we "re-implement" portions of IP and TCP – specifically, routing and congestion control – into ULNI so that it will scale to the WAN? Based on the lessons learned from ULNIs, the research on multifractal network traffic analysis and modeling, and our proposed research with Rice U. and SLAC on network inference and tomography, we believe that the answer is the former.

In addition, based on our traffic-characterization research, we have already identified two fundamental problems with TCP in computational grids. First, the static flow-control window is always set to a default value (e.g., 32 KB) that is only a small fraction of the WAN bandwidth-delay product (e.g., 32 KB / 1024 KB = 3.125%). Second, the congestion-control mechanism induces burstiness, and thus, adversely impacts end-to-end network performance. These problems result in absolutely abysmal performance over the WAN, e.g., less than 1 MB/s (or 8 Mb/s) end-to-end bandwidth with 30-ms round-trip time latency over a 155-Mb/s WAN backbone between Los Alamos National Laboratory and Sandia National Laboratory, laboratories that are located only sixty miles apart. Furthermore, our preliminary results indicate that end-to-end TCP performance will dramatically worsen as WAN speeds continue to increase.

Thus, our ultimate goal is to provide a high-performance TCP (or a new, high-performance transport protocol) for computational grids. This goal consists of three sub-goals. First, by recognizing and thoroughly understanding the manual tasks to optimally size the buffers (or flow-control windows), we propose a technique called dynamic right-sizing that will automate the sizing of these buffers so they dynamically change in such a way as to not artificially limit bandwidth. Second, we leverage the idea of protocol off-loading from the ULNI community and propose a high-performance IP that likewise off-loads many of its standard tasks to the network interface card (NIC), thus freeing up processor cycles on the main CPU. Third, we propose a transport protocol with user-settable reliability. This is a longer-term research project aimed at providing a spectrum of qualities of service from the unreliable UDP to the reliable TCP.

Major Milestones and Activities

Current Connections with Other SciDAC Projects