This page describes the additional materials used for the publication "Towards Continuous Scientific Data Analysis and Hypothesis Evolution", Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), 2017.
How to cite the materials used on this page:
Adusumilli, Ravali; Garijo, Daniel; Gil, Yolanda; Mallick, Parag (2016). Additional materials used in the paper "Towards Continuous Scientific Data Analysis and Hypothesis Evolution" [Data set]. Zenodo. http:doi.org/10.5281/zenodo.190374
Extracted from the paper: "Scientific data is continuously generated throughout the world. However, analyses of these data are typically per-formed exactly once and on a small fragment of recently generated data. Ideally, data analysis would be a continu-ous process that uses all the data available at the time, and would be automatically re-run and updated when new da-ta appears. We present a framework for automated dis-covery from data repositories that tests user-provided hy-potheses using expert-grade data analysis strategies, and reassesses hypotheses when more data becomes available. Novel contributions of this approach include a framework to trigger new analyses appropriate for the available data through lines of inquiry that support progressive hypothe-sis evolution, and a representation of hypothesis revisions with provenance records that can be used to inspect the results. We implemented our approach in the DISK framework, and evaluated it using two scenarios from cancer multi-omics: 1) data for new patients becomes available over time, 2) new types of data for the same pa-tients are released. We show that in all scenarios DISK updates the confidence on the original hypotheses as it automatically analyzes new data."
The DISK software can be accessed on Zenodo.
We used three different workflows in the evaluation of this paper. Each workflow was executed once on 84 datasets from 42 samples used in [Zhang et al 2014]. The datasets were taken from TCGA and CPTAC, and are available from those repositories. Below you can find the three workflow templates with a sample execution and the pointers to the software used on each workflow step. Note that the 126 executions (42*3) of these workflows have not been made public due to their size and TCGA/CPTAC data access restrictions.
Workflow designed to estimate the confidence of the results in proteomic data. A snapshot of a workflow execution can be seen below:
The workflow uses two main steps. CalcNSAFNode uses a script to perform its main functionality. The second step is labkeyProtSearchNode, which uses a script to invoke the LabKey software. The scripts used for the integration with WINGS are also available online.
Workflow designed to estimate the confidence of results in genomic data. A snapshot of a workflow execution can be seen below:
The workflow has 7 steps: The wTopHat2 (see invocation script), MpileUpCaller (see invocation script), bcftoolsCallVariants (see invocation script), varFilterSAMTools (see invocation script), SAMToolsBAMIndex (see invocation script), CustomPRODB (see invocation script) and CalCRPKMValues (from CustomPRODB) (see invocation script).
Workflow designed to estimate the condifence of results in genomic and proteomic data combined. A snapshot of the workflow can be seen below:
The workflow has 9 steps: The wTopHat2 (see invocation script), MpileUpCaller (see invocation script), bcftoolsCallVariants (see invocation script), varFilterSAMTools (see invocation script), SAMToolsBAMIndex (see invocation script), CustomPRODB (see invocation script), joinFASTAFiles (see invocation script), LabKey (see invocation script) and CalcNSAFNode (see invocation script).
|Yolanda Gil||Yolanda Gil is Director of Knowledge Technologies and at the Information Sciences Institute of the University of Southern California, and Research Professor in the Computer Science Department. Her research interests include intelligent user interfaces, social knowledge collection, provenance and assessment of trust, and knowledge management in science. Her most recent work focuses on intelligent workflow systems to support collaborative data analytics at scale.|
|Daniel Garijo Verdejo||Daniel Garijo is a postdoc researcher in the Information Sciences Institute of the University of Southern California. His research activities focus on e-Science and the Semantic web, specifically on how to increase the understandability of scientific workflows using provenance, metadata, intermediate results and Linked Data.|
|Varun Ratnakar||Varun Ratnakar is a research programmer at the Information Sciences Institute of the University of Southern California. He is the main developer of the Wings workflow system.|
|Rajiv Mayani||Rajiv Mayani is a programmer analyst at the Information Sciences Institute of the University of Southern California.|
|Ravali Adusumilli||Ravali Adusumilli is a bioinformaticist at the Mallik Lab of Stanford Univesity. She is interested in developing tools and pipelines for multi-omic analysis.|
|Hunter Boyce||Hunter Boyce is a graduate student at Stanford Biomedical Informatics (BMI) program.|
|Arunima Srivastava||Arunima Srivastava is a graduate student at the Ohio State University.|
|Parag Mallik||Parag Mallik is an assistant professor (Research) in Radiology. Parag is also member of the Stanford Cancer Institute and a faculty fellow of Stanford ChEM-H. After completing his PhD, he trained with Ruedi Aebersold in clinical proteomics and systems biology at the Institute for Systems Biology.|