Additional materials for the paper "Automated Hypothesis Testing with Large Scientific Data Repositories"

This page describes the additional materials used for the publication "Automated Hypothesis Testing with Large Scientific Data Repositories", which is currently under review in the ACS2016 conference.

If you want to get an RDF description of the contents presented in this document, just parse it with your favourite RDF-a parser. Alternatively, you can use content negotiation on its id ( to retrieve it on TTL , RDF/XML or JSON-LD formats.

A link to the pdf paper will be provided here once the review process is done.


Summary extracted from the submitted paper: "The automation of important aspects of scientific data analysis would significantly accelerate the pace of science and innovation. Although important aspects of data analysis can be automated, the hypothesize-test-evaluate discovery cycle is largely carried out by hand by researchers. This introduces a significant human bottleneck, which is inefficient and can lead to erroneous and incomplete explorations. We introduce a novel approach to automate the hypothesize-test-evaluate discovery cycle with an intelligent system that a scientist can task to test hypotheses of interest in a data repository. Our approach captures three types of data analytics knowledge: 1) common data analytic methods represented as semantic workflows; 2) meta-analysis methods that aggregate those results, represented as meta-workflows; and 3) data analysis strategies that specify for a type of hypothesis what data and methods to use, represented as lines of inquiry. Given a hypothesis specified by a scientist, appropriate lines of inquiry are triggered, which lead to retrieving relevant datasets, running relevant workflows on that data, and finally running meta-workflows on workflow results. The scientist is then presented with a level of confidence on the initial hypothesis (or a revised hypothesis) based on the data and methods applied. We have implemented this approach in the DISK system, and applied it to multi-omics data analysis."

Aggregated resources

The paper associated to this page describes the DISK framework, which aims to automatize the hypothesis test-refine life cycle. Below you can see a list of the materials that we have used to test and demonstrate our approach.

  1. list of the workflows will go here [1]: One of the objectives in the thesis is to address workflow representation. To this end, we define a model to represent scientific workflows and their executions in a simple way. The model is available online, with content negotiation enabled. This means that just you can access it through its URL and import in RDF or TTL.
  2. Proteomics workflows (2): pictures and links to inputs and outputs.
  3. Link to the Zang et al paper. Which is used as reference for the evaluations.
  4. Any other workflows that were reproduced from the paper
  5. Metaworkflows
  6. Portal

Authors and Contributors

Yolanda Gil Yolanda Gil Yolanda Gil is Director of Knowledge Technologies and at the Information Sciences Institute of the University of Southern California, and Research Professor in the Computer Science Department. Her research interests include intelligent user interfaces, social knowledge collection, provenance and assessment of trust, and knowledge management in science. Her most recent work focuses on intelligent workflow systems to support collaborative data analytics at scale.
Daniel Garijo VerdejoDaniel GarijoDaniel Garijo is a postdoc researcher in the Information Sciences Institute of the University of Southern California. His research activities focus on e-Science and the Semantic web, specifically on how to increase the understandability of scientific workflows using provenance, metadata, intermediate results and Linked Data.
Varun RatnakarRatnakar Varun Ratnakar is a research programmer at the Information Sciences Institute of the University of Southern California. He is the main developer of the Wings workflow system.
Rajiv MayaniRajivRajiv Mayani is a programmer analyst at the Information Sciences Institute of the University of Southern California.
Parag MallikParagParag Mallik is an assistant professor (Research) in Radiology. Parag is also member of the Stanford Cancer Institute and a faculty fellow of Stanford ChEM-H. After completing his PhD, he trained with Ruedi Aebersold in clinical proteomics and systems biology at the Institute for Systems Biology.
Ravali AdusumilliRavaliRavali Adusumilliis a bioinformaticist at the Mallik Lab of Stanford Univesity. She is interested in developing tools and pipelines for multi-omic analysis.
Hunter BoyceHunterHunter Boyce is a postdoctoral researcher at the Mallik lab of Stanford university.


In this page we make use of the following references:
  1. [Zhang et al 2014]: Bing Zhang, Jing Wang, Xiaojing Wang, Jing Zhu, Qi Liu, et al. “Proteogenomic characterization of human colon and rectal cancer.” Nature 513,382–387, 18 September 2014.