Research Object for paper: A Framework for Creating Knowledge Graphs of Scientific Software Metadata

This paper provides pointers to the materials used in a paper under review at the Journal of Quantitative Science Studies

Abstract (extracted from the submitted paper): An increasing number of researchers rely on computational methods to generate or manipulate the results described in their scientific publications. Software created to this end--scientific software--is key to understanding, reproducing, and reusing existing work in many disciplines, ranging from Geosciences to Astronomy or Artificial Intelligence. However, scientific software is usually challenging to find, set up, and compare to similar software due to its disconnected documentation (dispersed in manuals, readme files, web sites, and code comments) and the lack of structured metadata to describe it. As a result, researchers have to manually inspect existing tools in order to understand their differences and incorporate them into their work. This approach scales poorly with the number of publications and tools made available every year. In this paper we address these issues by introducing a framework for automatically extracting scientific software metadata from its documentation (in particular, their readme files); a methodology for structuring the extracted metadata in a Knowledge Graph (KG) of scientific software; and an exploitation framework for browsing, comparing and exploring the contents of the generated KG. We demonstrate our approach by creating a prototype with metadata from over ten thousand scientific software entries from public code repositories.

Resources.

The paper associated to this research object describes our approach for creating a framework to 1) extract metadata from scientific software repositories; 2) create knowledge graphs of connected scientific software; 3) exploiting the created knowledge graph. Our approach produced the following resources:

The SOftware Metadata Extraction Framework (SOMEF): Given a GitHub repository URL, SOMEF extracts up to 25 different metadata categories, and exports the metadata in JSON, or Turtle formats following the Codemeta vocabulary and the Software Description Ontology. SOMEF releases are available in Zenodo in the following DOI: https://zenodo.org/badge/latestdoi/190487675
SOMEF: Training corpus- Supervised classification. A corpus of 89 repositories, manually annotated by hand, to determine which sections are talking about installation instructions, description, citation and invocation.
SOMEF: Trained classifiers: The classifiers that yielded best results for detecting scientific software metadata.
Header analysis evaluation annotated corpus. 898 headers annotated with their respective header categories.
Header evaluation results. As of July 4th, 2021.
SOMEF example notebook
SOMEF Dockerfile, also available in DockerHub under the id: kcapd/somef
SOSEN-KG repository. SOSEN is available in Zenodo, under the following DOI: https://zenodo.org/record/4574224
SOSEN-KG Notebook
SOSEN-KG dump Turtle files with the results of the extraction of over 10,000 scientific software repositories

Examples with SPARQL queries and additional documentation on how to access the SOSEN KG is accessible in the README file of the main SOSEN repository.

About the authors.

Aidan Kelley

Student visitor

Student at the Washington University in St. Louis. Aidan participated in the NSF Research Exchange Undergraduate program in the summer of 2020.

Daniel Garijo

Researcher

Researcher at the Information Sciences Institute of the University of Southern California. Daniel's research activities focus on e-Science and the Semantic web, specifically on how to increase the understandability of software and scientific workflows using provenance, metadata, intermediate results and Linked Data.