Defining procedures and services to enforce data provenance for thematic communities and beyond

Introduction

Due to data exploration complexity, provenance management is a key component in order to guarantee scientific data discovery, reproducibility and results interpretation. Provenance management should be able to define a set of metadata able to capture the derivation history of any stored data, including the original raw data sources, the intermediate data products and the steps that were applied to produce them.

To enable a well-defined data provenance in the scientific experiments workflow, from data production to data usage, this task will:

elaborate cross-domain, FAIR-oriented procedures and recommendations to enforce data provenance by taking into account needs coming from various communities such as nanoscience/soft matter, material science and climate change in the field of environmental science;
implement these procedures and recommendations by developing the proper adaptations/extensions on top of existing scientific data services, made available by the participating partners at national level.

Challenges addressed

The main challenge is to address and solve the provenance of data within the material science/nanoscience and climate science community.

For the material science community, we focused on the wide community behind the NFFA- Europe PILOT Project (NEP), where scientific data are collected from more than 150 experimental techniques. We planned to define general procedures and recommendations on how to manage the important aspect of data provenance for the NEP users’ community and beyond.

Climate research makes use of lots of data coming from the modelling and observational climate communities. In this domain, provenance management plays a key role both for numerical end-to-end simulations at the data center level as well as in the inner data analytics workflows.

Provenance enforcement procedures were identified, which will contribute to the climate data science notebooks of Use Case 2 (Agile FAIR data for environment and earth system communities). The provenance procedures can be later generalised for other domains. In addition, the approach includes a discussion of possible use cases of PID collections in this context, thus providing a link to Use Case 8 (Virtual definition of big datasets at seismological data centres according to RDA recommendations).

To account for in-depth provenance support within the climate processing services, a second-level provenance management complying with the W3C prov standard has been elaborated, thus addressing open science challenges (reproducibility, re-usability, etc.) at a finer granularity.

Benefits through EOSC-Pillar

The material science use case will benefit through EOSC-Pillar by the fact that the NEP user community and the project itself can establish a strong connection with the EOSC network, favouring a continuous exchange among the two communities. Such exchange is of mutual interest: on one side (NEP) allowing the project and the data services built within the project to be EOSC- compliant. On the other hand, EOSC can benefit from a large and committed user community that should provide useful suggestions and case studies in the overall EOSC implementation.

The climate use cases are built on top of the ENES Climate Analytics Service, which is the server-side compute infrastructure exploited in Use Case 2. ECAS, one of the EOSC-hub thematic services, allows performing data analysis experiments on large volumes of multidimensional data through a PID-enabled, server-side, and parallel approach. In this way scientific users can run their experiments and get fine-grained provenance information captured at the second level of a data analytics workflow based on W3C prov specifications. This allows retrieving the data lineage of an object including the entire analytics workflow associated to it, which is particularly worth towards data discovery and the FAIR Reproducibility principle.

Highlights

Material science/nanoscience case study

The case study within the community of material science/nanoscience is represented by data management practices and services developed for making FAIR compliant a scientific archive of Scanning Tunneling Microscopy (STM) images [1].The objective was to organize a dataset of STM images to allow researchers to build the provenance of data. For each image of the dataset, instrument metadata was extracted and enriched with information on the structure and composition of the surface by means of a pipeline that leverages human annotation, machine learning techniques, and instrument metadata filtering. To visually explore both images and metadata, as well as to improve the accessibility and usability of the dataset, was developed “STM explorer”, as a web service integrated within the Trieste Advance Data services (TriDAS) website. On top of these data services and tools, was implemented W3C PROV standard to describe provenance metadata of STM images.

The web service workflow, as reported in the image below, allows users a first exploration of the dataset which then should be downloaded for further analysis and image visualization. To address this issue, the scatter plot features the selection of a specific metadata combination to retrieve a new page containing a table with metadata fields for each image in that subset. On this page, researchers can select, order, filter, and search images based on their metadata values. Moreover, the ID column consists of each image’s unique identifier in the database and, by clicking on it, the corresponding STM image is rendered and shown on a new page, where a download feature is included to obtain data, metadata, plot, and provenance metadata for every image.

The web service workflow on the TriDAS website.

Climate science

In the climate science domain, two distinct demonstrators have been implemented to address provenance tracking at two different levels:

the first one (first-level) refers to the whole end-to-end scientific workflow including a proper reference to input and output datasets via PIDs;
the second one (second-level) provides a focus on some first-level tasks, like those ones regarding data analytics which can be represented themselves as a workflow of micro-tasks (analytics operators) which may usually be in the order of hundreds or thousands.

The main idea is to navigate the first level provenance to have a complete understanding of the overall high-level end-to-end workflow and then have the opportunity to drill-down into some specific analytics tasks to get more detailed information about their internal workflow in terms of micro-tasks or analytics operators. In such a context we can refer to two-level workflows, as two-level provenance management accordingly.

The first use case provides a set of easy-to-use templates and tools that can be used in the context of climate science notebooks to enable researchers to provide standards conforming provenance descriptions for their data products. Specifically, users can analyse and visualize data of the Coupled Model Intercomparison Project CMIP6, by choosing one variable of multiple experiments and compare the results of different models. In particular, the user can analyse the historical experiment in combination with one of the shared socioeconomic pathways (SSP) experiments. In the end and additionally to the scientific results, provenance records will be created in the form of a provenance graph.

The second one is mainly aimed at scientific users from the core climate modelling community, who are interested in performing scientific data analysis workflows with a stronger provenance support (so called second-level) able to track analytics activity at the level of single operators.

In such a context, the multi-model Precipitation Trend Analysis (PTA) was selected as a pilot case. It has been implemented as an ECAS analytics workflow and executed on 17 climate models from the CMIP5 experiment for a total of about 300 tasks. Based on information about the executed analytics operators that are tracked by the Ophidia analytics engine [3], scientific users can run their analytics workflows and get fine-grained (second-level) provenance information, represented according to the W3C PROV specifications. More in detail, the developed Python application allows users to retrieve the provenance documents related to a specific analytics workflow in several formats, such as XML, JSON, RDF or graphical format.

Fine-grained provenance information captured at the second level of the PTA workflow based on W3C PROV specifications

[1] Tommaso Rodani, Elda Osmenaj, Alberto Cazzaniga, Mirco Panighel, Cristina Africh, & Stefano Cozzini. (2021). Dataset of Scanning Tunneling Microscopy (STM) images of graphene on nickel (1.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5808724

[2] A Templating System to Generate Provenance, Luc Moreau et al. IEEE Transactions on software engineering, https://nms.kcl.ac.uk/luc.moreau/papers/provtemplate-tse17.pdf

[3] S. Fiore, D. Elia, C. Palazzo, F. Antonio, A. D’Anca, I. Foster, G. Aloisio, “Towards High Performance Data Analytics for Climate Change”, ISC High Performance 2019, LNCS Springer, 2019