Data Science Symposium No. 7 programme

Click a slot in the calendar below to view the abstract or head over to the session overview .

June 27, 2022

Current and future data-driven scientific applications from explorative methods to operational systems require suitable data strategies. The necessary requirements for this are presented in the context of earth system science and their embedding in POF 4 and NFDI. In particular, exemplary examples are presented which stand for consistent data management as a data strategy from data collection to the data product including the data processing steps.

Those who operate autonomous vehicles under or above water want to know where they are, if they are doing well and if they are doing what they supposed to do. Sometimes it makes sense to interfere in the mission or to request a status of a vehicle. When the GEOMAR AUV team were asked to bring two Hover AUVs (part of the MOSES Helmholtz Research Infrastructure Program; “Modular Observation Solutions for Earth Systems”) into operational service, that was exactly what we wanted. We decided to create our own tool: BELUGA.
After almost 3 years BELUGA is a core tool for our operational work with our GIRONA 500 AUVs and their acoustic seafloor beacons. BELUGA allows communication and data exchange between those devices mentioned above and the ship.
Every component inside this kind of ad hoc network has an extended driver installed. This driver handles the messages and their content and decides which communication channels to use: Wi-Fi, satellite or acoustical communication. The driver of the shipborne component has additionally a module for a data model, which allows to add more network devices, sensors and messages in future. The user works on a web based graphical interface, which visualizes all the …

Compliant with the FAIR data principles, long-term archiving of bathymetry data from multibeam echosounders –a highly added value for the data life cycle - is the challenging task in the data information system PANGAEA. To cope with the increasing amount of data (“bathymetry puzzle pieces”) acquired from research vessels and the demand for easy “map-based” means to find valuable data, new semi-automated processes and standard operating procedures (SOPs) for bathymetry data publishing and simultaneous visualization are currently developed.

This research is part of the “Underway Research Data” project, an initiative of the German Marine Research Alliance (Deutsche Allianz Meeresforschung e.V., DAM). The DAM “Underway Research Data” project, spanning across different institutions, started in mid-2019. The aim of the project is to improve and unify the constant data flow from German research vessels to data repositories like PANGAEA. This comprises multibeam-echosounder and other permanently installed scientific devices and sensors following the FAIR data management aspects. Thus, exploiting the full potential of German research vessels as instant “underway” mobile scientific measuring platforms.

In an ongoing effort within the Helmholtz association to make research data FAIR, i.e. findable, accessible, interoperable and reusable we also need to make biological samples visible and searchable. To achieve this a first crucial step is to inventory already available samples, connect them to relevant metadata and assess the requirements for various sample types (e.g. experimental, time series, cruise samples). This high diversity is challenging for when creating standardized workflows to provide a uniform metadata collection with complete and meaningful metadata for each sample. As part of the Helmholtz DataHub at GEOMAR the B iosample I nformation S ystem (BIS) has been set up, turning the former decentral sample management into a fully digital and centrally managed long-term sample storage.

The BIS is based on the open-source research data management system CaosDB, which offers a framework for managing diverse and heterogeneous data.  It supports fine-grained access permissions and regular backups, has a powerfull search engine, different APIs and an extendable WebUI. We have designed a flexible datamodel and multiple WebUI modules to support scientists, technicans and datamanagers in digitalizing and centralising sample metadata and making the metadata visible in data portals (e.g. https://marine-data.de).

The system allows us to manage …

The International Generic Sample Number (IGSN) is a globally unique and persistent identifier for physical objects, such as samples. IGSNs allow to cite, track and locate physical samples and to link samples to corresponding data and publications. The metadata schemata is modular and can be extended by domain-specific metadata.

Within the FAIR WISH projected funded by the Helmholtz Metadata Collaboration, domain-specific metadata schemes for different sample types within Earth and Environment are developed based on three different use cases. These use cases represent all states of digitization, from hand-written notes only to information stored in relational databases. For all stages of digitization, workflows and templates to generate machine-readable IGSN metadata are developed, which allow automatic IGSN registration. These workflows and templates will be published and will contribute to the standardization of IGSN metadata.

Facilitating and monitoring the ingestion and processing of continuous data streams is a challanging exercise that is often only addressed for individual scientific projects and/ or stations and thus results in a heterogeneous data environment.

In order to reduce duplication and to enhance data quality we built a prototypical data ingestion pipeline using open-source frameworks with the goal to a) unify the data flow for various data sources, b) enhance observability at all stages of the pipeline, c) introduce a multi-stage QA/ QC procedure to increase data quality and reduce the lag of data degradation or data failure detection. The system is orchestrated using Prefect , QA/ QC is handled by Great Expectations and SaQC , and the SensorThings API and THREDDS Data Server are used to facilitate data access and integration with other services.

The prototype workflow also features a human-in-the-loop aspect so scientific PIs can act on incoming data problems early and with little effort. The framework is flexible enough so specific needs of individual projects can be addressed while still using a common platform. The final outcome of the pipeline are aggregated data products that are served to the scientists and/ or the public via data catalogues. …

The NFDI 4 Earth Academy is a network of early career scientists interested in linking Earth System and Data Sciences beyond institutional borders . The research networks Geo.X, Geoverbund ABC/J, and DAM offer an open science and learning environment that covers specialized training courses, collaborations within the NFDI4Earth consortium and access to all NFDI 4 Earth innovations and services. Fellows of the Academy advance their research projects by exploring and integrating new methods and connect with like-minded scientists in an agile, bottom-up, and peer-mentored community. We support young scientists in developing skills and mindset for open and data-driven science across disciplinary boundaries.

The recent introduction of machine learning techniques in the field of numerical geophysical prediction has expanded the scope so far assigned to data assimilation, in particular through efficient automatic differentiation, optimisation and nonlinear functional representations.  Data assimilation together with machine learning techniques, can not only help estimate the state vector but also the physical system dynamics or some of the model parametrisations. This addresses a major issue of numerical weather prediction: model error.
I will discuss from a theoretical perspective how to combine data assimilation and deep learning techniques to assimilate noisy and sparse observations with the goal to estimate both the state and dynamics, with, when possible, a proper estimation of residual model error. I will review several ways to accomplish this using for instance offline, variational algorithms and online, sequential filters. The skills of these solutions with be illustrated on low-order and intermediate chaotic dynamical systems, as well as data from meteorological models and real observations.

Examples will be taken from collaborations with J. Brajard, A. Carrassi, L. Bertino, A. Farchi, Q. Malartic, M. Bonavita, P. Laloyaux, and M. Chrust.

Mantle convection plays a fundamental role in the long-term thermal evolution of terrestrial planets like Earth, Mars, Mercury and Venus. The buoyancy-driven creeping flow of silicate rocks in the mantle is modeled as a highly viscous fluid over geological time scales and quantified using partial differential equations (PDEs) for conservation of mass, momentum and energy. Yet, key parameters and initial conditions to these PDEs are poorly constrained and often require a large sampling of the parameter space to find constraints from observational data.

Since it is not computationally feasible to solve hundreds of thousands of forward models in 2D or 3D, scaling laws have been the go-to alternative. These are computationally efficient, but ultimately limited in the amount of physics they can model (e.g., depth-dependent material properties). More recently, machine learning techniques have been used for advanced surrogate modeling. For example, Agarwal et al. (2020) used feedforward neural networks to predict the evolution of entire 1D laterally averaged temperature profile in time from five parameters: reference viscosity, enrichment factor for the crust in heat producing elements, initial mantle temperature, activation energy and activation volume of the diffusion creep. In Agarwal et al. (2021), we extended that study to predict the …

For some time, large scale analyses and data-driven approaches have become increasingly popular in all research fields of hydrology. Many advantages are seen in the ability to achieve good predictive accuracy with comparatively little time and financial investment. It has been shown by previous studies that complex hydrogeological processes can be learned from artificial neural networks, whereby Deep Learning demonstrates its strengths particularly in combination with large data sets. However, there are limitations in the interpretability of the predictions and the transferability with such methods. Furthermore, most groundwater data are not yet ready for data-driven applications and the data availability often remains insufficient for training neural networks. The larger the scale, the more difficult it becomes to obtain sufficient information and data on local processes and environmental drivers in addition to groundwater data. For example, groundwater dynamics are very sensitive to pumping activities, but information on their local effects and magnitude – especially in combination with natural fluctuations – is often missing or inaccurate. Coastal regions are often particularly water-stressed. Exemplified by the important coastal aquifers, novel data-driven approaches are presented that have the potential to both contribute to process understanding of groundwater dynamics and groundwater level prediction on large …

Together, the creatures of the oceans and the physical features of their habitat play a significant role in sequestering carbon and taking it out of the atmosphere. Through the biological processes of photosynthesis, predation, decomposition, and the physical movements of the currents, the oceans take in more carbon than they release. With sediment accumulation in the deep seafloor, carbon gets stored for a long time, making oceans big carbon sinks, and protecting our planet from the devastating effects of climate change.


Despite the significance of seafloor sediments as a major global carbon sink, direct observations on the mass accumulation rates(MAR) of sediments are sparse. The existing sparse data set is inadequate to quantify the change in the composition of carbon and other constituents at the seabed on a global scale. Machine learning techniques such as the k-nearest neighbour’s algorithm have provided predictions of sparse sediment accumulation rates, by correlating known features(predictors) such as bathymetry, bottom currents, distance to coasts and river mouths, etc.

In my current work, global maps of the sediment accumulation rates at the seafloor are predicted using the known fea ture maps and the sparse dataset of sediment accumulation rates using multi-layer perceptrons(supervised models). Despite a good …

The Lagrangian perspective on Ocean currents describes trajectories of individual virtual or physical particles which move passively or semi-actively with the Ocean currents. The analysis of such trajectory data offers insights about pathways and connectivity within the Ocean. To date, studies using trajectory data typically identify pathways and connections between regions of interest in a manual way. Hence, they are limited in their capability in finding previously unknown structures, since  the person analyzing the data set can not foresee them. An unsupervised approach to trajectories could allow for using the potential of such collections to a fuller extent.

This study aims at identifying and subsequently quantifying pathways based on collections of millions of simulated Lagrangian trajectories. It develops a stepwise multi-resolution clustering approach, which substantially reduces the computational complexity of quantifying similarity between pairs of trajectories and it allows for parallelized cluster construction.

It is found that the multi-resolution clustering approach makes unsupervised analysis of large collections of trajectories feasible. Moreover, it is demonstrated that for selected example research questions the unsupervised results can be applied.

The frequent presence of cloud cover in polar regions limits the use of the Moderate Resolution Imaging Spectroradiometer (MODIS) and similar instruments for the investigation and monitoring of sea-ice polynyas compared to passive-microwave-based sensors. The very low thermal contrast between present clouds and the sea-ice surface in combination with the lack of available visible and near-infrared channels during polar nighttime results in deficiencies in the MODIS cloud mask and dependent MODIS data products. This leads to frequent misclassifications of (i) present clouds as sea ice or open water (false negative) and (ii) open-water and/or thin-ice areas as clouds (false positive), which results in an underestimation of actual polynya area and subsequently derived information. Here, we present a novel machine-learning-based approach using a deep neural network that is able to reliably discriminate between clouds, sea-ice, and open-water and/or thin-ice areas in a given swath solely from thermal-infrared MODIS channels and derived additional information. Compared to the reference MODIS sea-ice product for the year 2017, our data result in an overall increase of 20 % in annual swath-based coverage for the Brunt Ice Shelf polynya, attributed to an improved cloud-cover discrimination and the reduction of false-positive classifications. At the same time, the …

Funded by the Helmholtz Foundation, the aim of the Artificial Intelligence for COld REgions (AI-CORE) project is to develop methods of Artificial Intelligence for solving some of the most challenging questions in cryosphere research by the example of four use cases. These use cases are of high relevance in the context of climate change but very difficult to tackle with common image processing techniques. Therefore, different AI-based imaging techniques are applied on the diverse, extensive, and inhomogeneous input data sets.

In a collaborative approach, the German Aerospace Center, the Alfred-Wegener-Institute, and the Technical University of Dresden work together to address not only the methodology of how to solve these questions, but also how to implement procedures for data integration on the infrastructures of the partners. Within the individual Helmholtz centers already existing competences in data science, AI implementation, and processing infrastructures exist but are decentralized and distributed among the individual centers. Therefore, AI-CORE aims at bringing these experts together to jointly work on developing state of the art tools to analyze and quantify processes currently occurring in the cryosphere. The presentation will give a brief overview of the geoscientific use cases and then address the different challenges that emerged so …

June 28, 2022

This presentation reports on support done under the aegis of Helmholtz AI for a wide range of machine learning based solutions for research questions related to Earth and Environmental sciences. We will give insight into typical problem statements from Earth observation and Earth system modeling that are good candidates for experimentation with ML methods and report on our accumulated experience tackling such challenges with individual support projects. We address these projects in an agile, iterative manner and during the definition phase, we direct special attention towards assembling practically meaningful demonstrators within a couple of months. A recent focus of our work lies on tackling software engineering concerns for building ML-ESM hybrids.

Our implementation workflow covers stages from data exploration to model tuning. A project may often start with evaluating available data and deciding on basic feasibility, apparent limitations such as biases or a lack of labels, and splitting into training and test data. Setting up a data processing workflow to subselect and compile training data is often the next step, followed by setting up a model architecture. We have made good experience with automatic tooling to tune hyperparameters and test and optimize network architectures. In typical implementation projects, these stages …

Underwater images are used to explore and monitor ocean habitats, generating huge datasets with unusual data characteristics that preclude traditional data management strategies. Due to the lack of universally adopted data standards, image data collected from the marine environment are increasing in heterogeneity, preventing objective comparison. The extraction of actionable information thus remains challenging, particularly for researchers not directly involved with the image data collection. Standardized formats and procedures are needed to enable sustainable image analysis and processing tools, as are solutions for image publication in long-term repositories to ascertain reuse of data. The FAIR principles (Findable, Accessible, Interoperable, Reusable) provide a framework for such data management goals. We propose the use of image FAIR Digital Objects (iFDOs) and present an infrastructure environment to create and exploit such FAIR digital objects. We show how these iFDOs can be created, validated, managed, and stored, and which data associated with imagery should be curated. The goal is to reduce image management overheads while simultaneously creating visibility for image acquisition and publication efforts and to provide a standardised interface to image (meta) data for data science applications such as annotation, visualization, digital twins or machine learning.

Numerical simulations of Earth's weather and climate require substantial amounts of computation. This has led to a growing interest in replacing subroutines that explicitly compute physical processes with approximate machine learning (ML) methods that are fast at inference time. Within weather and climate models, atmospheric radiative transfer (RT) calculations are especially expensive. This has made them a popular target for neural network-based emulators. However, prior work is hard to compare due to the lack of a comprehensive dataset and standardized best practices for ML benchmarking.
To fill this gap, we introduce the ClimART dataset, which is based on the Canadian Earth System Model, and comes with more than 10 million samples from present, pre-industrial, and future climate conditions.

ClimART poses several methodological challenges for the ML community, such as multiple out-of-distribution test sets, underlying domain physics, and a trade-off between accuracy and inference speed. We also present several novel baselines that indicate shortcomings of the datasets and network architectures used in prior work.

The impacts from anthropogenic climate change are directly felt through extremes. The existing research skills in assessing future changes in impacts from climate extremes is however still limited. Despite the fact that a multitude of climate simulations is now available that allows the analysis of such climatic events, available approaches have not yet sufficiently analysed the complex and dynamic aspects that are relevant to estimate what climate extremes mean for society in terms of impacts and damages. Machine Learning (ML) algorithms have the ability to model multivariate and nonlinear relationships, with possibilities for non-parametric regression and classification, and are therefore well-suited to model highly complex relations between climate extremes and their impacts.

In this presentation, I will highlight some recent ML applications, focussing on monetary damages from floods and windstorms. For these extremes, ML models are built using observational datasets of extremes and their impacts. Here I will also address the sample selection bias, which occurs between observed moderate impact events, and more extreme events sampled in current observed and projected future data. This can be addressed by adjusting weighting for such variable values, as is demonstrated for extreme windstorm events.

Another application focusses on health outcomes, in this case …

The ocean economy is growing and the pressures on our seas and the ocean, including from over exploitation, pollution and climate change, have asserted significant stresses on the marine system. Digital twins are rich, virtual representations of objects and systems, in this case the ocean system, or a part of it. They allow us to track how and why the things we care about are changing and simulate what their futures could be, including by exploring ‘what if?’ scenarios. They can provide critical knowledge to plan and guide human activities in the ocean and coastal spaces to safeguard a healthy ocean and support a sustainable green-blue economy.

Digital twins depend upon: an int egrated, and sustainable, ocean observing system ; well -managed, accessible and interoperable data and software; predictive processes or data-driven models with which users can interact, to support their needs; sharing of good and best practive and traing, education and outreach.


The connection between a digital twin and its real-world counterpart requires a well-formulated interface between the digital twin, environmental and societal data, and the user. User interaction is an essential function embedded in the design of digital twins, to ensure maximum information value for investment in ocean …


Digital hydromorphological twin of the Trilateral Wadden Sea

The project “Digital hydromorphological twin of the Trilateral Wadden Sea” focuses
on cooperation of cross-border data innovations between The Netherlands, Germany and Denmark,
the provision / harmonization of data together with a new digital geodata and
analysis infrastructure for the trilateral Wadden Sea World Heritage Site These
data and information are linked via Web portals and services to form a
versatile assistance system.

Different demands, requirements and restrictive environ­mental legislation pose major
challenges for the planning and maintenance of transport infra­structure in the
marine environment. TrilaWatt aims at developing and implementing a powerful
spatial data and analysis infra­structure on a homogenized database comprising
the Trilateral Wadden Sea area of the Netherlands, Germany and Denmark.

High-resolution hydrographic data series are locally available for the German Bight.
These can be combined into spatial models. Due to the high mapping effort, area-wide
morphological and especially sedimentological surveys can only be conducted at
intervals of several years or decades. On the other hand, current issues
require much higher temporal resolution for the analysis and assessment of
environmental impacts in the entire Wadden Sea. Pressures are addressed in the
MSFD Annex III among others as descriptors "seabed …

Today's Earth-System Models (ESM) are not designed to be interactive. They follow a configure, setup, run, and data analysis scheme. Thus, researchers often have to wait for hours or days until they can inspect model runtime diagnostic data. This causes time-consuming round trips between setup and analysis and limits interaction and insights into an ESM at runtime.

We aim to overcome this static scientific modeling process with interactive exploration of ESMs. It will allow to monitor the state of the simulation via dashboards presenting real-time diagnostics within a digital twin world. It will support to halt simulations, move back in time, and explore divergent setup at any given point in the simulation. Therefore, we have to include code in ESMs to access, store, and change data in every part, and make it available to interactive visualization dashboards.

In today's monolithic implementation of ESMs and and other scientific models, we have to modularize models and discover or recover interfaces between these modules. The modularization does not only help with restructuring existing ESMs, it also allows to integrate additional scientific domains into the interactive simulation environment.

We apply a domain-driven modularization approach utilizing reverse engineering techniques combining static and dynamic analysis of …

The Digital Earth Viewer is a visualisation platform capable of ingesting data from heterogeneous sources and performing spatial and temporal contextualisation upon them. Its web-based nature enables several users to access and visualise large geo-scientific datasets. Here we present the latest development of this viewer: collaborative capabilities that allow parallel, live exploration and annotation of 4 dimensional environments by multiple remote users.

A comprehensive study of the Earth System and its different environments requires understanding of multi-dimensional data acquired with a multitude of different sensors or produced by various complex models. Geoscientists use state of the art instruments and tecniques  to aquire and analise said data, which is in stark contrast with the outdated means that are often selected to present the resulting findings: today's most popular presentation software chocies (PowerPoint, Keynote, etc.) were developed to support a presentation style that has seen slim to none development over the last 70 years. Here I present a software framework for creating dynamic data presentations. This combination of different web based resources enables a new paradigm in data visualisation for scientific presentations.