Storage Resource Management

for Data Grid Applications

PI: Arie Shoshani (shoshani@lbl.gov)

Co-PI: Don Petravick (petravick@fnal.gov)

Vision

Terascale computing often generate petascale data, where petabytes of data need to be collected and managed. Such data intensive applications already overwhelm scientists who spend much of their time managing data, rather than concentrate on scientific investigations. Because such data are now vital to large scientific collaborations dispersed over wide-area networks, there is a growing activity of developing a grid infrastructure to support such applications. Initially, the grid infrastructure mainly emphasized the computational aspect of supporting large distributed computational tasks, and optimizing the use of the network by using bandwidth reservation techniques. In this proposal we complement this with middleware components to manage the storage resources on the grid. Just as compute resources and network resources need to be carefully scheduled and managed, the management of storage resources is critical for data intensive applications since the access to data can become the main bottleneck. This situation stems from the fact that data files may be dispersed on multiple disk caches and much of the data are stored on tertiary (tape) storage system, which are slow because of their mechanical nature. Data replication in multiple disk caches is also limited due to the volume of data. Finally, transfer of large data files over wide area networks also needs to be minimized. Based on previous experience with several prototype storage management systems, we propose to enhance the features of storage resource managers (SRMs), to develop common application interfaces to them, and to deploy them in real grid applications. We target both tape-based and disk-based systems, and design their interfaces so that they inter-operate.

The technology developed will provide the following benefits:

Major Goals and Technical Challenges:

In the past, storage management issues were avoided by pre-allocation of space, a great deal of data replication, and pre-partitioning of the data. For example, high energy physics data was pre-selected into subsets (referred to as "micro-DSTs" or "streams"), and those were stored in pre-selected sites where the scientists interested in this subset perform the analysis. Because the data are stored according to the access patterns anticipated when the experiment was implemented, analysis requiring new access patterns is made more difficult, since the time to get data other than the subset at hand was prohibitively long. Because of the volume of data being collected and simulated, this mode of operation is not effective any longer. Scientists accessing the data can be in multiple remote institutions, and the amount of data is only expected to grow. To improve access to the data in this mode, data grids are now being developed, and storage resource management is a needed integral part of the middleware of data grids.

The purpose of this project is to address the problem of managing the access to large amounts of data distributed over the sites of the network. It is necessary to have Storage Resource Managers (SRMs) associated with each storage resource on the grid in order to achieve the coordinated and optimized usage of these resources. The term "storage resource" refers to any storage system that can be shared by multiple clients. The goal is to use shared resources on the grid to minimize data staging from tape systems, as well as to minimize the amount of data transferred over the network. The main advantages of developing SRMs as part of the grid architecture are:

  1. Support for local policy. Each storage resource can be managed independently of other storage resources. Thus, each site can have its own policy on which files to keep in its resource and for how long.
  2. Temporary locking. Files residing in one storage system can be temporarily locked before being transferred to another system that needs them. This provides the flexibility to read frequently accessed files from disk caches on the grid, rather than reading files repeatedly from the archival tape system.
  3. Advance reservations. SRMs are the components that manage the storage content dynamically. Therefore, they can be used to plan the storage system usage by making advanced reservations.
  4. Dynamic space management. It is essential to have SRMs in order to provide the dynamic management of replicas according to the locations they are needed the most (based on access patterns).
  5. Estimates for planning. SRMs are essential for planning the execution of a request. They can provide estimates on space availability and the time till a file will be accessed. These estimates are essential for planning, and for providing dynamic status information on the progress of multi-file requests.

The major challenges are:

•Managing storage resources in an unreliable distributed large heterogeneous system

• Support long lasting data intensive transactions when we cannot afford to restart jobs

• Support reliable handling of data that we afford to loose, especially from experiments

• Support storage management when multiple types of failures may occur

• Storage system failures

• Mass Storage System (MSS)

• Disk system

• Server failures

• Network failures

• Support heterogeneous systems

• Different operating systems

• Different Mass Storage Systems, such as HPSS, Castor, Enstore, etc.

• Different disk systems – system attached, network attached, parallel

• Optimization issues

• avoid extra file transfers - What to keep in each disk caches over time

• How to maximize sharing for multiple users

• Global optimization

• Multi-Tier storage system optimization

Major Milestones and Activities:

Year 1: Design and develop Storage Resource Manager (SRM) prototypes

- Design functionality and interfaces

- Implement prototype Disk Resource Manager (DRM)

- Implement prototype Hierarchical (tape and disk) Resource Manager (HRM)

Year 2: Deploy in Particle Physics Data Grid (PPDG) STAR project

- Installation at LBNL and BNL

- Start continuous automated file replication from BNL’s RCF-HPSS to LBNL’s NERSC-HPSS

Year 3: Robustness design of Storage Resource Managers (SRMs) and their development

- Design and implement the saving SRM states dynamically, and automatic recovery

- Recover from SRM failures, network failures and storage system failures

 Current Connections with Other SciDAC Projects: