Detecting common scientific workflow fragments using templates and execution provenance

This page represents a Research Object containing different additional materials for a paper accepted in the K-CAP 2013 conference. The purpose of this web page is to provide a summary of the paper, support links and short descriptions of the contents used as input and generated as output of the described work. A copy of the paper is available here. The software described in the paper is under development and avialable at Github

Abstract

Provenance plays a major role when understanding and reusing the methods applied in a scientific experiment, as it provides a record of inputs, the processes carried out and the use and generation of intermediate and final results. In the specific case of in-silico scientific experiments, a large variety of scientific workflow systems (e.g., Wings, Taverna, Galaxy, Vistrails) have been created to support scientists. All of these systems produce some sort of provenance about the executions of the workflows that encode scientific experiments. However, provenance is normally recorded at a very low level of detail, which complicates the understanding of what happened during execution. In this paper we propose an approach to automatically obtain abstractions from low-level provenance data by finding common workflow fragments on workflow execution provenance and relating them to templates. We have tested our approach with a dataset of workflows published by the Wings workflow system. Our results show that by using these kinds of abstractions we can highlight the most common abstract methods used in the executions of a repository, relating different runs and workflow templates with each other.

Inputs and examples of the analysis

We have selected a dataset that contains 22 workflow templates specified using the Wings workflow system. We also use a dataset of 30 workflow execution provenance traces obtained from the executions of the 22 workflow templates and annotated according to the Open Provenance Model for Workflows (OPMW). Both datasets are in the domain of text analytics (retrieving all templates from that domain from http://opmw.org/sparql).

The processed template dataset is available here (without inference) and here (with inference).

The trace dataset is available here (without inference) and here (with inference).

Results of the analysis

The following excel files summarize the obtained results (which are available here).

  1. File summarizing the performance of the SUBDUE algorithm on the dataset when finding external macros.
  2. File summarizing the performance of the SUBDUE algorithm on the dataset when finding internal macros.

About the authors

Daniel Garijo Daniel Garijo is a PhD student in the Ontology Engineering Group at the Artificial Intelligence Department of the Computer Science Faculty of Universidad Politécnica de Madrid. His research activities focus on e-Science and the Semantic web, specifically on how to increase the understandability of scientific workflows using provenance, metadata, intermediate results and Linked Data.
Oscar Corcho Oscar Corcho is an Associate Professor at Departamento de Inteligencia Artificial (Facultad de Informática , Universidad Politécnica de Madrid) , and he belongs to the Ontology Engineering Group. His research activities are focused on Semantic e-Science and Real World Internet. In these areas, he has participated in a number of EU projects (Wf4Ever, PlanetData, SemsorGrid4Env, ADMIRE, OntoGrid, Esperonto, Knowledge Web and OntoWeb), Spanish Research and Development projects (CENITS mIO!, España Virtual and Buscamedia, myBigData, GeoBuddies), and has also participated in privately-funded projects like ICPS (International Classification of Patient Safety), funded by the World Health Organisation, and HALO, funded by Vulcan Inc.
Yolanda Gil Yolanda Gil Yolanda Gil is Director of Knowledge Technologies and at the Information Sciences Institute of the University of Southern California, and Research Professor in the Computer Science Department. Her research interests include intelligent user interfaces, social knowledge collection, provenance and assessment of trust, and knowledge management in science. Her most recent work focuses on intelligent workflow systems to support collaborative data analytics at scale.