PI: Arie Shoshani (shoshani@lbl.gov)
- Executive Summary
- September 2001
Participating institutions and their coordinating PIs
DOE Laboratories: Universities:
ANL: Bill Gropp Georgia Institute of Technology: Calton Pu
LBNL:Arie Shoshani North Carolina State University: Mladen Vouk
LLNL: Terence Critchlow Northwestern University: Alok Choudhary
ORNL: Thomas Potok UC San Diego (Supercomputer Center): Reagan Moore
Vision
Terascale computing and large scientific experiments produce enormous
quantities of data that require effective and efficient management. The task of managing
scientific data is so overwhelming that scientists spend much of their time managing the
data by developing special purpose solutions, rather than using their time effectively for
scientific investigation and discovery. The goal of this center is to provide a
coordinated framework for the unification, development, deployment, and reuse of
scientific data management software. We have identified four main areas that are essential
to scientific data management, emphasizing efficient management and data mining of very
large, heterogeneous, distributed datasets. In addition, we have identified technologies
that fall into four tier levels: storage, file, dataset, and dataset federation. This
"area and tier" framework is the basis for a management structure that will
ensure the vertical integration of technologies in each area over these tier levels, as
well as cross utilization of area technologies. The results expected are efficient,
well-integrated, robust scientific data management software modules that will provide
end-to-end solutions to multiple scientific applications. In the short term, we expect to
enhance the robustness of previously developed technologies and deploy then to specific
scientific applications. In the longer term, we will conduct research and development
based on an iterative experience cycle with the deployment of our technology.
Major Goals and Technical Challenges
Recent advances in scientific data management and scientific data
mining have made it possible to manage and analyze massive amounts of simulated/collected
data (terabytes) to facilitate scientific analysis and discovery. Still many barriers
exist that force application scientists to spend an increasing amount of their time
performing basic data management tasks instead of pursuing their research. Major
bottlenecks inherent to most of our targeted scientific application areas include:
- Current I/O techniques are too slow to provide a desirable data access rate. Accessing
the data in the fastest possible manner is crucial for many scientific activities,
especially real-time scientific visualization, required by such application areas as
astrophysics (different forms), cosmology, and computational fluid dynamics.
- Retrieving data subsets of interest from storage systems (especially tertiary storage
systems) is a major bottleneck in analyzing the simulated/collected data. This is a
crucial step for scientific discovery in high energy and nuclear physics as well as
climate modeling and other earth science investigations.
- Navigating between heterogeneous, distributed data sources (e.g., Internet) and
transferring data between various interfaces is a pain-staking, largely manual, and
time-consuming process. The increasing number of these data sources, especially in the
computational biology community, is creating a crisis.
- Network bandwidth continues to be a major bottleneck in transferring very large
datasets, especially across WAN. The migration of massive (terabytes) datasets to the
standard archives, such as PCMDI in global climate research community, is an essential
part of the collaborative research process. However, often only a subset of the data is
required at the destination so minimizing the amount of data transferred over WANs will
continue to be an important need for the foreseeable future.
- Existing data management and data mining techniques do not scale up to petabytes of
scientific data sets, a typical size of data produced by simulations on terascale
computers.
Planned activities
To address the above bottlenecks, we have assembled a team with
expertise in areas that address these problems. Below is a short list of technologies that
address each bottleneck.
Bottleneck #1:
- To overcome I/O bottlenecks in cluster systems, our strategy is to let various
"hints", or additional information about intended access or intended use of
data, determine physical data layout on disk, thus optimizing subsequent access to these
data. The integration of such hints will be performed at both the Parallel Virtual File
System (PVFS) layer and the MPI-IO.
- To optimize shared access to tertiary storage, we plan to improving the retrieval
of subsets of data by using a shared front end disk and managing its content. By
maximizing the sharing of files by users we can minimize repeated files access from
robotic tape systems.
- To optimize access in distributed systems our strategy is to provide an ROMIO
interface with grid technologies to bridge the gap between traditional methods of file
access and the ones provided by grid I/O facilities (e.g., gridFTP).
Bottleneck #2:
- To Improve retrieving chunks of data distributed across multiple files and/or multiple
storage media, we plan to use a high-dimensional indexing scheme to identify the location
of millions of objects distributed over large datasets. Our strategy is to use
bitmap-based indexing schemes because their compression rates are very high and its
efficiency high for such data.
Bottleneck #3:
- To facilitate navigation and exploration of heterogeneous and distributed data sources,
we plan to use an automated, meta-data based framework in which wrappers are created
directly from interface specific meta-data and linked in to the query engine.
- To address the multiplicity of data source, we plan to use an automated wrapper
generator based on XWrap for accessing large numbers of sources.
- To address semantic mismatch of datasets, we plan to use an F-logic based query engine
for data and knowledge representation and semantic integration.
Bottleneck #4:
- To Overcome network bandwidth bottlenecks, we plan to achieve more effective
high-bandwidth transfers of very large datasets by: (1) optimizing blocksizes and
equipment configuration to improve transfer rates and (2) implementing a reliable UDP
service with capabilities to recover from dropped packets, (3) testing a beta version of
the AIX operating system with improved TCP/IP performance over lossy networks, and (4)
investigating new data transfer protocols, such as Scheduled Transfer.
- A second activity aims to minimize the amount of data that needs to be transferred and
stored. An adaptable software package, ASPECT, built on top of HARNESS and VIPAR will be
integrated into the SDM Center framework. Its main purpose is to let simulation scientists
plug-in a data analysis component of their preference into the ASPECTs job
monitoring and decision-making framework.
Bottleneck 5:
- To address the scalability problem, we plan to use Use agent technology and our unique
development testbed, called Probe. Probe sites at ORNL and NERSC, with their
state-of-the-art equipment and facilities, will be our "place to be" for initial
development and prototyping of the research activities of the SDM Center. The expanded
ORMAC agent framework will be our "glue" to ensure integration and
interoperation between various components.
Major Milestones:
The activities of the center will be organized as projects, each
initially targeting a specific application area. Initially, the targeted areas include:
Climate Modeling, astrophysics, Genomics and proteomics, and High Energy and Nuclear
Physics. The specifics milestones are too numerous to list here. These can be found on the
web site listed above. Generally, the goals for year one is to integrate and apply
existing technology to a particular application area, as well as identify open research
problems. The goals for the second year will be to expand to additional users as well as
apply the technology to new application areas, as well as develop prototype of solutions
that address research problems identified in the first year. The goals for the third year
are to package many of the tools and make them sufficiently robust for routine use.