All material
  • Slides_Development …
    July 15, 2022
Presentation

  • 15:00 – 15:15
Community members
ORCID iD icon
ORCID iD icon

Development of a flexible, multi-stage data pipeline for enhanced automation, quality control and observability

Werner, Christian1ORCID iD icon , Lorenz, C.1ORCID iD icon
  1. Karlsruhe Institute of Technology

Facilitating and monitoring the ingestion and processing of continuous data streams is a challanging exercise that is often only addressed for individual scientific projects and/ or stations and thus results in a heterogeneous data environment.

In order to reduce duplication and to enhance data quality we built a prototypical data ingestion pipeline using open-source frameworks with the goal to a) unify the data flow for various data sources, b) enhance observability at all stages of the pipeline, c) introduce a multi-stage QA/ QC procedure to increase data quality and reduce the lag of data degradation or data failure detection. The system is orchestrated using Prefect , QA/ QC is handled by Great Expectations and SaQC , and the SensorThings API and THREDDS Data Server are used to facilitate data access and integration with other services.

The prototype workflow also features a human-in-the-loop aspect so scientific PIs can act on incoming data problems early and with little effort. The framework is flexible enough so specific needs of individual projects can be addressed while still using a common platform. The final outcome of the pipeline are aggregated data products that are served to the scientists and/ or the public via data catalogues. In the future, we plan to add more data flows at our institute. This will help us to further standardize the processing and QA/ QC - and thus increase data quality and availability - and hopefully also reduce the overal maintenance burden.