Supplementary materials for paper: Creating and Querying Personalized Versions of Wikidata on a Laptop (Under review)

This page provides pointers to the materials (datasets, software and notebooks) used in a paper currently under review

Summary: Application developers today have three choices for exploiting the knowledge present in Wikidata [1] : they can download the Wikidata dumps in JSON or RDF format, they can use the Wikidata API to get data about individual entities, or they can use the Wikidata SPARQL endpoint. None of these methods can support complex, yet common, query use cases, such as retrieval of large amounts of data or aggregations over large fractions of Wikidata. This paper introduces KGTK Kypher, a query language and processor that allows users creating personalized variants of Wikidata on a laptop. We present several use cases that illustrate the types of analyses that Kypher enables users to run on the full Wikidata KG on a laptop, combining data from external resources such as DBpedia. The Kypher queries for these use cases run much faster on a laptop than the equivalent SPARQL queries on a Wikidata clone running on a powerful server with 24h time-out limits.

This page and the materials described on it (excluding external references) is available at a permanent URL: https://w3id.org/kgtk_kypher.

A preprint of the paper will soon be available here.

Datasets.

Input Datasets

We used the following datasets for our paper, available in Zenodo under DOI https://doi.org/10.5281/zenodo.5139550:

claims.time.tsv.gz: time-related assertions
claims.wikibase-item.tsv-006.gz: item-related assertions
derived.P279.tsv.gz: statements that are subclass of another statement
derived.P279star.tsv.gz: statement that are subclass of another statement, including their chains.
derived.P31.tsv.gz: instance of statements.
labels.en.tsv-004.gz: labels in English
claims.external-id.tsv-005.gz: External identifiers for each item.
ulan.tsv: ULAN ids (used to link external identifiers to Wikidata identifiers)
wikidata_infobox.tsv.gz: Information about dbpedia infoboxes.

The cache (sqlite) used to store the queries is available in the following DOI: https://doi.org/10.5281/zenodo.5146407

The results of the analyis are the times comparing KGTK and SPARQL, as reported in the paper.

Software and Notebooks.

The pointers for the main software used can be found below:

The Knowledge Graph Toolkit, a framework for large Knowledge Graph manipulation. This is an external tool used by the authors for the analysis.
Repository with the Jupyter Notebook and SPARQL queries used in the analysis.

Bibliography.

Vrandecic, D., Krotzsch, M.: Wikidata: a free collaborative knowledgebase. Com-munications of the ACM57(10), 78–85 (2014)

About the authors.

Hans Chalupsky

Research Lead

Research Lead at the Information Sciences Institute, University of Southern California.

Pedro Szekely

Research Director

Research Director at the center on Knowledge Graphs, Information Sciences Institute, University of Southern California.

Filip Ilievski

Research Scientist

Researcher at the Information Sciences Institute, University of Southern California.

Daniel Garijo

Distinguished Researcher

Researcher at the Universidad Politécnica de Madrid.

Kartik Shenoy

Student worker

Master student at the University of Southern California.