Integrated Data Analysis Pipelines for Large-Scale Data
Management, HPC, and Machine Learning

The DAPHNE project aims to define and build an open and extensible system infrastructure for integrated data analysis pipelines, including data management and processing, high-performance computing (HPC), and machine learning (ML) training and scoring. This vision stems from several key observations in this research field:

  1. Systems of these areas share many compilation and runtime techniques.
  2. There is a trend towards complex data analysis pipelines that combine these systems.
  3. The used, increasingly heterogeneous, hardware infrastructure converges as well.
  4. Yet, the programming paradigms, cluster resource management, as well as data formats and representations differ substantially.

A depiction of an example data pipeline shows the typical challenges researchers are confronted with while building and executing such pipelines:

Therefore, this project aims – with a joint consortium of experts from the data management, ML systems, and HPC communities – at systematically investigating the necessary system infrastructure, language abstractions, compilation and runtime techniques, as well as systems and tools necessary to increase the productivity when building such data analysis pipelines, and eliminating unnecessary performance bottlenecks.

Objectives

DAPHNE has three main objectives (O): establishing a system architecture with APIs and DSL; develop hierarchical scheduling and task planning strategies; and benchmarking the newly developed framework in high-level, real-life use cases:

System Architecture, APIs and DSL (O1)

Improve the productivity for developing integrated data analysis pipelines via appropriate APIs and a domain-specific language, an overall system architecture for seamless integration with existing data processing frameworks, HPC libraries, and ML systems. A major goal is an open, extensible reference implementation of the necessary compiler and runtime infrastructure to simplify the integration of current and future state-of-the-art methods.

Hierarchical Scheduling and Task Planning (O2)

Improve the utilization of existing computing clusters, multiple heterogeneous hardware devices, and capabilities of modern storage and memory technologies through improved scheduling as well as static (compile time) task planning. In this context, we also aim to automatically leverage interesting data characteristics such as the sorting order, degree of redundancy, and matrix/tensor sparsity.

Use Cases and Benchmarking (O3)

The technological results will be evaluated on a variety of real-world use cases and datasets as well as a new benchmark developed as part of the DAPHNE project. We aim to improve the accuracy and runtime of these use cases combining data management, machine learning, and HPC – this exploratory analysis serves as a qualitative study on productivity improvements. The variety of real-world use cases will further be generalized to a benchmark for integrated data analysis pipelines quantifying the progress compared to state-of-the-art.

An overview of the project work plan shows how a work package is dedicated to every relevant field and how they are bundled to contribute to the three main objectives:

Mangement of the work packages is distributed between all partners, acedemic and industrial. All work packages take place in parallel, with the results and findings of each one feeding into the other ones. To ensure efficient execution of the project as well as to ensure widespread dissemination of the results and adoption of the open-source implementation of DAPHNE, a work package is dedicated to each, project management and dissemination and exploitation.