A High-Performance Data Grid Toolkit:
Enabling Technology for Wide Area Data-Intensive Applications
http://www.mcs.anl.gov/dsl/scidac/datagrid
PIs: Ian Foster (foster@mcs.anl.gov)
Carl Kesselman (carl@isi.edu), Miron Livny (miron@cs.wisc.edu)
Executive Summary
September 2001
Vision:
One of the most demanding and important challenges that we face as we attempt to construct the distributed computing machinery required to support SciDAC goals is the efficient, high-performance, reliable, secure, and policy-aware management of large-scale data movement. This problem is fundamental to application domains as diverse as experimental physics (high energy physics, nuclear physics, light sources), simulation science (climate, computational chemistry, fusion, astrophysics), and large-scale collaboration. In each case, we have highly distributed user communities that require high-speed access to valuable data, whether for visualization or analysis. The quantities of data involved (terabytes to petabytes), the scale of the demand (hundreds or thousands of users, data-intensive analyses, real-time constraints), and the complexity of the infrastructure that must be managed (networks, tertiary storage systems, network caches, computers, visualization systems) make the problem extremely challenging.
In the past several years, we have seen considerable consensus emerge across different scientific disciplines that an effective solution to these problems requires an integrated "Data Grid" infrastructure that recognizes and exploits commonalities in two independent dimensions: (1) among different levels of the "service stack" so that network, middleware, and application-specific services are developed in a coordinated fashion; and (2) across different applications, so that we develop infrastructure that is truly general purpose.
We now propose to exploit our existing base of experience and code and move to the next level of Data Grid technology, via the construction of a high-performance Data Grid Toolkit aimed at providing the fundamental mechanisms required for efficient, high-performance, reliable, secure, and policy-aware management of large-scale data movement.
Major Goals and Technical Challenges:
Clearly in a project of this scale we cannot hope to address the whole range of Data Grid middleware challenges. Fortunately, we are able to identify a relatively small set of critical issues that, if addressed, will have a major impact on a wide range of Data Grid projects. We believe that we have solid technical ideas in each area and that this project will hence be extremely effective in moving Data Grid technology and applications forward.
The general problem that we propose to focus on is the reliable, high-performance movement of data between Data Grid components. In tackling this problem, we focus on three key questions:
To resolve these issues, we will:
We plan four distinct classes of contributions from the proposed work:
Major Milestones and Activities:
|
YEAR 1 |
|
Deliver GridFTP servers and clients (non-striped) |
|
Deliver replica management service |
|
YEAR 2 |
|
Deliver GridFTP servers and clients (striped & non-striped) |
|
Deploy striped GridFTP into real Data Grid |
|
Demonstrate coordinated use of storage mgr, info svc, and transfer svc by replica management service |
|
YEAR 3 |
|
Deliver distributed replica catalog |
|
Deliver extended storage resource manager & info service |
|
Demonstrate coordination of replica management service with storage resource services, for a reliable, high-performance, scheduled data replication |
|
YEAR 4 |
|
Deliver extended resource and replication services |
|
Demonstrate alternative data management approaches |
|
YEAR 5 |
|
Deliver alternative data management approaches (GridHSM, Grid filesystem) |
|
Deliver multiple, advanced Data Grid management systems using common storage resource services |