Research Object for the thesis "Mining Abstractions in Scientific Workflows"

This page describes a Research Object for my PhD thesis (written in PDF). The Research Object includes the inputs, outputs and material used in the experiments and evaluation of the thesis, including other Research Objects referencing past publications. All the contents are semantically described (i.e., they can be retrieved automatically by machines), linked to each other and have a DOI in order to facilitate their accessibility.

If you want to get an RDF description of the contents presented in this document, just parse it with your favourite RDF parser. Alternatively, you can use content negotiation on its id (http://w3id.org/dgarijo/ro/mining-abstractions-in-scientific-wfs) to retrieve it on TTL (link), RDF/XML (link) or JSON-LD (link) formats.

The PDF associated to this Research Object is available online, in the UPM's archive.

Abstract

(Abstract extracted from the thesis document) "Scientific workflows have been adopted in the last decade to represent the computational methods used in in silico scientific experiments and their associated research products. Scientific workflows have demonstrated to be useful for sharing and reproducing scientific experiments, allowing scientists to visualize, debug and save time when re-executing previous work. However, scientific workflows may be difficult to understand and reuse. The large amount of available workflows in repositories, together with their heterogeneity and lack of documentation and usage examples may become an obstacle for a scientist aiming to reuse the work from other scientists. Furthermore, given that it is often possible to implement a method using different algorithms or techniques, seemingly disparate workflows may be related at a higher level of abstraction, based on their common functionality. In this thesis we address the issue of reusability and abstraction by exploring how workflows relate to one another in a workflow repository, mining abstractions that may be helpful for workflow reuse. In order to do so, we propose a simple model for representing and relating workflows and their executions, we analyze the typical common abstractions that can be found in workflow repositories, we explore the current practices of users regarding workflow reuse and we describe a method for discovering useful abstractions for workflows based on existing graph mining techniques. Our results expose the common abstractions and practices of users in terms of workflow reuse, and show how our proposed abstractions have potential to become useful for users designing new workflows".

Aggregated resources

The work described in this thesis is an aggregation and refinement of previous work, which has also been published as Research Objects. In this section we explain briefly some of the contributions of the thesis, pointing to the aggregated resources and Research Objects related to them.

  1. The Open Provenance Model for Workflows (OPMW) [1]: One of the objectives in the thesis is to address workflow representation. To this end, we define a model to represent scientific workflows and their executions in a simple way. The model is available online, with content negotiation enabled. This means that just you can access it through its URL and import in RDF or TTL.
  2. A repository of workflows published as Linked Data: Thanks to the model, we published a corpus of workflows in the web [3]. The repository can be accessed online here: http://opmw.org/sparql , and it is available for download here: https://github.com/wf4ever/provenance-corpus/tree/master/WINGS_repository/Wings_corpus (in the last link, available mainly in PROV-O, the W3C standard for interchanging provenance on the web). The repository contains workflow templates and their respective executions following the OPMW model (which extends both OPM and W3C PROV).
  3. An empirical analysis on more than 260 workflows [2] [5]: in order to find the common domain independent abstraction of several workflow corpora, we analyzed more than 260 workflows from Wings, Taverna, Vistrails and Galaxy. The corpus is described in detail in the Research Object associated to the publication (http://purl.org/net/ro-motifPaper), although we have also uploaded it in FigShare to be downloaded as a bundle (http://dx.doi.org/10.6084/m9.figshare.1598104). As a result from the analysis, we created a vocabulary for common workflow motifs (wf-motifs), which describes the most common operations among the workflow steps based on their functionality (reformat data, combine data, etc.). Motifs are divided in two main groups: those that refer to they type of operation being undertaken in the workflow step and those that refer to how that operation was undertaken (e.g., with a sub-workflow, by issuing a job to an external service, etc.). For more details, have a look at the previous Research Object or the associated publications [2] [5].
  4. An automatic reuse analysis of scientific workflows [6] [7]:In the thesis we have performed two types of workflow reuse analysis. The first one is made in an automated way. It uses four corpora from the LONI Pipeline. Three of these corpora are described in this Research Object (http://purl.org/net/escience2014), and available in FigShare as a bundle. The results of the analysis extend the previous Research Object with an additional corpus (WC4), and are available here (link). The second reuse analysis consists on a user survey, described already on this Research Object.
  5. FragFlow, an approach to mine workflows for commonly occurring patterns [6]: One of the main goal of this thesis is to mine abstractions and patterns from a corpus of scientific workflows. To this end, we have developed FragFlow, an approach for mining commonly occurring workflow fragments based on graph mining techniques. FragFlow is available on Github, in this link (https://github.com/dgarijo/FragFlow). An overview of the FragFlow approach can be seen in the following Research Object (http://purl.org/net/escience2014).
  6. Evaluations of FragFlow [4] [6]: We have evaluated FragFlow by analyzing two main aspects. The first one is: is FragFlow able to find those motifs related to reusability of workflows? (like the internal macro (http://purl.org/net/wf-motifs#InternalMacro) or those workflows part of a composite workflow motif (http://purl.org/net/wf-motifs#CompositeWorkflow)). In addition, in this evaluation we also play with the level of abstraction at which our approach can generate the patterns, based on the domain knowledge of the workflows we are dealing with. This evaluation is based on the inputs used in this research object (http://purl.org/net/kcap2013RO). However, we re-run the evaluation to with some new additions. Below you can find pointers to each of the updated resources: The second evaluation consisted on assessing whether the fragments provided by FragFlow were equal (or significantly similar) to those grouping or workflows defined by users. The contents of this evaluation are an evolution of those in the Research Object associated to a prior publication. The corpus used consisted on the four corpora introduced previously. The results are measured in terms of precision and recall, and are available here (http://figshare.com/articles/Results_of_the_FragFlow_evaluation_performed_on_the_LONI_Pipeline/1603473), including the resultant fragments as a bundle. Additionally, we asked some domain experts on their opinion regarding the usability of the discovered fragments. The responses are available here (http://dx.doi.org/10.6084/m9.figshare.1603474).

Authors and Contributors

Daniel Garijo Verdejo (Author)Daniel GarijoDaniel Garijo is a PhD student in the Ontology Engineering Group at the Artificial Intelligence Department of the Computer Science Faculty of Universidad Politécnica de Madrid. His research activities focus on e-Science and the Semantic web, specifically on how to increase the understandability of scientific workflows using provenance, metadata, intermediate results and Linked Data.
Oscar Corcho (Supervisor)Oscar CorchoOscar Corcho is an Associate Professor at Departamento de Inteligencia Artificial (Facultad de Informática , Universidad Politécnica de Madrid) , and he belongs to the Ontology Engineering Group. His research activities are focused on Semantic e-Science and Real World Internet. In these areas, he has participated in a number of EU projects (Wf4Ever, PlanetData, SemsorGrid4Env, ADMIRE, OntoGrid, Esperonto, Knowledge Web and OntoWeb), Spanish Research and Development projects (CENITS mIO!, España Virtual and Buscamedia, myBigData, GeoBuddies), and has also participated in privately-funded projects like ICPS (International Classification of Patient Safety), funded by the World Health Organisation, and HALO, funded by Vulcan Inc.
Yolanda Gil (Supervisor)Yolanda Gil Yolanda Gil is Director of Knowledge Technologies and at the Information Sciences Institute of the University of Southern California, and Research Professor in the Computer Science Department. Her research interests include intelligent user interfaces, social knowledge collection, provenance and assessment of trust, and knowledge management in science. Her most recent work focuses on intelligent workflow systems to support collaborative data analytics at scale.

Bibliography

Every Research Object mentioned in this page is associated to a previous publication. Some of them have been enhanced and expanded for the final thesis work. Below there is a list of the papers mentioned in this document.
  1. Daniel Garijo and Yolanda Gil. A new Approach for Publishing workflows: Abstractions, Standards, and Linked Data. Proceedings of the 6th Workshop on Workflows in support of large-scale science, pages 47-56, Seattle, USA. 2011.
  2. Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, and Carole Goble. Common Motifs in Scientific Workflows: An Empirical Analysis. 8th IEEE International Conference on eScience 2012, pages 1-8 Chicago, USA. 2012.
  3. Khalid Belhajjame, Jun Zhao, Daniel Garijo, Aleix Garrido, Stian Soiland-Reyes, Pinar Alper and Oscar Corcho. A Workflow PROV-Corpus Based on Taverna and Wings. Proceedings of the Joint EDBT/ICDT 2013 Workshops, pages 331-332. Genova, Italy 2013.
  4. Daniel Garijo, Oscar Corcho, and Yolanda Gil. Detecting Common Scientific Workflow Fragments Using Templates and Execution Provenance. Seventh International Conference on Knowledge Capture (K-CAP), pages 33-40, Banff, Alberta, Canada. 2013.
  5. Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble. Common Motifs in Scientific Workflows: An Empirical Analysis (Extension of 2012's paper for a journal). Future Generation Computer Systems, volume 36, pages 338-351. 2014.
  6. Daniel Garijo, Oscar Corcho, Yolanda Gil, Boris A. Gutman, Ivo D. Dinov, Paul Thompson, and Arthur W. Toga. FragFlow: Automated Fragment Detection in Scientific Workflows. 10th IEEE International Conference on eScience 2014, pages 281-289, Guaruja, Brasil. 2014.
  7. Daniel Garijo, Oscar Corcho, Yolanda Gil, Meredith N. Braskie, Derrek Hibar, Xue Hua, Neda Jahanshad, Paul Thompson, and Arthur W. Toga. Workflow Reuse in Practice: A Study of Neuroimaging Pipeline Users. 10th IEEE International Conference on eScience 2014, pages 90-99. Guaruja, Brasil. 2014.