A High-Performance Data Grid Toolkit:

Enabling Technology for Wide Area Data-Intensive Applications

http://www.mcs.anl.gov/dsl/scidac/datagrid

PIs: Ian Foster (foster@mcs.anl.gov)

Carl Kesselman (carl@isi.edu), Miron Livny (miron@cs.wisc.edu)

Executive Summary

September 2001

Vision:

One of the most demanding and important challenges that we face as we attempt to construct the distributed computing machinery required to support SciDAC goals is the efficient, high-performance, reliable, secure, and policy-aware management of large-scale data movement. This problem is fundamental to application domains as diverse as experimental physics (high energy physics, nuclear physics, light sources), simulation science (climate, computational chemistry, fusion, astrophysics), and large-scale collaboration. In each case, we have highly distributed user communities that require high-speed access to valuable data, whether for visualization or analysis. The quantities of data involved (terabytes to petabytes), the scale of the demand (hundreds or thousands of users, data-intensive analyses, real-time constraints), and the complexity of the infrastructure that must be managed (networks, tertiary storage systems, network caches, computers, visualization systems) make the problem extremely challenging.

In the past several years, we have seen considerable consensus emerge across different scientific disciplines that an effective solution to these problems requires an integrated "Data Grid" infrastructure that recognizes and exploits commonalities in two independent dimensions: (1) among different levels of the "service stack" so that network, middleware, and application-specific services are developed in a coordinated fashion; and (2) across different applications, so that we develop infrastructure that is truly general purpose.

We now propose to exploit our existing base of experience and code and move to the next level of Data Grid technology, via the construction of a high-performance Data Grid Toolkit aimed at providing the fundamental mechanisms required for efficient, high-performance, reliable, secure, and policy-aware management of large-scale data movement.

Major Goals and Technical Challenges:

Clearly in a project of this scale we cannot hope to address the whole range of Data Grid middleware challenges. Fortunately, we are able to identify a relatively small set of critical issues that, if addressed, will have a major impact on a wide range of Data Grid projects. We believe that we have solid technical ideas in each area and that this project will hence be extremely effective in moving Data Grid technology and applications forward.

The general problem that we propose to focus on is the reliable, high-performance movement of data between Data Grid components. In tackling this problem, we focus on three key questions:

  1. Data Movement: How do we move data efficiently and reliably between two Data Grid components?
  2. Data Transfer Management: How do we coordinate end-to-end data transfers so as to meet performance goals and make efficient use of available resources?
  3. Collective Data Management: How do we manage structure, schedule, and manage data movement operations to satisfy higher-level goals?

To resolve these issues, we will:

We plan four distinct classes of contributions from the proposed work:

  1. Technical innovations. We will advance knowledge in high-performance protocols and Data Grid technologies via the definition and evaluation of new approaches to protocol design, parallel transfer strategies, resource management, scheduling, and other key technical problems.
  2. High-quality software. We will incorporate both our new techniques and "best of breed" techniques from other sources into high-quality software toolkits and libraries. We will leverage our leadership position in Data Grid software to achieve wide dissemination and uptake of this software.
  3. Innovative application concepts. We will work with DOE application groups to apply our Data Grid technologies in new and challenging ways, hence both contributing to the development of secure, workable DOE Collaboratory applications and obtaining guidance on our technology R&D directions.
  4. Standards. We will continue to work actively with the Global Grid Forum (GGF) and Internet Engineering Task Force (IETF) to transition new protocol ideas into standards wherever possible, and with industry to achieve adoption of those standards in products.

Major Milestones and Activities:

YEAR 1

Deliver GridFTP servers and clients (non-striped)

Deliver replica management service

YEAR 2

Deliver GridFTP servers and clients (striped & non-striped)

Deploy striped GridFTP into real Data Grid

Demonstrate coordinated use of storage mgr, info svc, and transfer svc by replica management service

YEAR 3

Deliver distributed replica catalog

Deliver extended storage resource manager & info service

Demonstrate coordination of replica management service with storage resource services, for a reliable, high-performance, scheduled data replication

YEAR 4

Deliver extended resource and replication services

Demonstrate alternative data management approaches

YEAR 5

Deliver alternative data management approaches (GridHSM, Grid filesystem)

Deliver multiple, advanced Data Grid management systems using common storage resource services