Scientific Data Management Center

http://sdm.lbl.gov/sdmcenter

PI: Arie Shoshani (shoshani@lbl.gov)

 

Participating institutions and their coordinating PIs

DOE Laboratories: Universities:

ANL: Bill Gropp Georgia Institute of Technology: Calton Pu

LBNL:Arie Shoshani North Carolina State University: Mladen Vouk

LLNL: Terence Critchlow Northwestern University: Alok Choudhary

ORNL: Thomas Potok UC San Diego (Supercomputer Center): Reagan Moore

Vision

Terascale computing and large scientific experiments produce enormous quantities of data that require effective and efficient management. The task of managing scientific data is so overwhelming that scientists spend much of their time managing the data by developing special purpose solutions, rather than using their time effectively for scientific investigation and discovery. The goal of this center is to provide a coordinated framework for the unification, development, deployment, and reuse of scientific data management software. We have identified four main areas that are essential to scientific data management, emphasizing efficient management and data mining of very large, heterogeneous, distributed datasets. In addition, we have identified technologies that fall into four tier levels: storage, file, dataset, and dataset federation. This "area and tier" framework is the basis for a management structure that will ensure the vertical integration of technologies in each area over these tier levels, as well as cross utilization of area technologies. The results expected are efficient, well-integrated, robust scientific data management software modules that will provide end-to-end solutions to multiple scientific applications. In the short term, we expect to enhance the robustness of previously developed technologies and deploy then to specific scientific applications. In the longer term, we will conduct research and development based on an iterative experience cycle with the deployment of our technology.

Major Goals and Technical Challenges

Recent advances in scientific data management and scientific data mining have made it possible to manage and analyze massive amounts of simulated/collected data (terabytes) to facilitate scientific analysis and discovery. Still many barriers exist that force application scientists to spend an increasing amount of their time performing basic data management tasks instead of pursuing their research. Major bottlenecks inherent to most of our targeted scientific application areas include:

  1. Current I/O techniques are too slow to provide a desirable data access rate. Accessing the data in the fastest possible manner is crucial for many scientific activities, especially real-time scientific visualization, required by such application areas as astrophysics (different forms), cosmology, and computational fluid dynamics.
  2. Retrieving data subsets of interest from storage systems (especially tertiary storage systems) is a major bottleneck in analyzing the simulated/collected data. This is a crucial step for scientific discovery in high energy and nuclear physics as well as climate modeling and other earth science investigations.
  3. Navigating between heterogeneous, distributed data sources (e.g., Internet) and transferring data between various interfaces is a pain-staking, largely manual, and time-consuming process. The increasing number of these data sources, especially in the computational biology community, is creating a crisis.
  4. Network bandwidth continues to be a major bottleneck in transferring very large datasets, especially across WAN. The migration of massive (terabytes) datasets to the standard archives, such as PCMDI in global climate research community, is an essential part of the collaborative research process. However, often only a subset of the data is required at the destination so minimizing the amount of data transferred over WANs will continue to be an important need for the foreseeable future.
  5. Existing data management and data mining techniques do not scale up to petabytes of scientific data sets, a typical size of data produced by simulations on terascale computers.

Planned activities

To address the above bottlenecks, we have assembled a team with expertise in areas that address these problems. Below is a short list of technologies that address each bottleneck.

Bottleneck #1:

Bottleneck #2:

Bottleneck #3:

Bottleneck #4:

Bottleneck 5:

Major Milestones:

The activities of the center will be organized as projects, each initially targeting a specific application area. Initially, the targeted areas include: Climate Modeling, astrophysics, Genomics and proteomics, and High Energy and Nuclear Physics. The specifics milestones are too numerous to list here. These can be found on the web site listed above. Generally, the goals for year one is to integrate and apply existing technology to a particular application area, as well as identify open research problems. The goals for the second year will be to expand to additional users as well as apply the technology to new application areas, as well as develop prototype of solutions that address research problems identified in the first year. The goals for the third year are to package many of the tools and make them sufficiently robust for routine use.