Supplementary materials for paper: A Study of the Quality of Wikidata (Under review)

This page provides pointers to the materials (datasets, software and notebooks) used in a paper currently under review

Summary:The increasing adoption of Wikidata [1] in a wide range of applications (entity linking, question answering, link prediction, etc.) motivates the need for high-quality knowledge to support them. However, we currently lack an understanding of the quality of the semantic infor-mation captured in Wikidata. In this paper, we explore two notions of data quality in Wikidata: 1) a community-based notion which captures the ongoing community consensus on the recorded knowledge, assumingthat statements that have been removed and not added back are implicitly agreed to be of low quality by the community; and 2) a constraint-based notion which encodes Wikidata constraints efficiently, and detects their violations in the data. Our analysis reveals that low-quality state-ments can be detected with both strategies, while their cause ranges frommodeling errors and factual errors, to constraint incompleteness. These findings can complement ongoing efforts by the Wikidata community toimprove data quality based on games and suggestions, aiming to make it easier for users and editors to find and correct mistakes.

This page and the materials described on it (excluding external references) is available at a permanent URL: https://w3id.org/wd_quality.

A preprint of the paper is available in Arxiv: https://arxiv.org/abs/2107.00156

Datasets.

Input Datasets

We used the following datasets for our paper. We note that some of the files may not have the same source date, as we performed the analysis as new data became available. However, since Wikidata is continously evolving, all files are sufficiently close in time so as to be compatible with one another.

Claims file - Wikidata dump of Dec 7, 2020. Reference dump containing the `claims.tsv.gz` file with 1.15B statements, used for finding constraint violations.
Labels file - Wikidata dump of Dec 7, 2020, File with all labels of nodes used in the analysis
Claim properties file - Wikidata dump of Dec 7, 2020 File used to determine constraints. DOI: http://doi.org/10.5281/zenodo.5120200
Qualifiers file - Wikidata dump of Dec 7, 2020. File used to fetch more details on the constraints for each property. DOI: http://doi.org/10.5281/zenodo.5120200 (both this file and the previous one are aggregated in the collection identified by the same DOI)
Wikidata permanently removed statements from 2014 to Jan 2021. File that was calculated by downloading 311 weekly Wikidata dumps from the Internet Archive and community contributors; calculating the removed statements between each pair of dumps, and then assessing that any previous deleted statements was still not present in the most recent dump.
Wikidata deprecated statements by Jan 2021. All statements tagged as deprecated in the Knowledge Graph.
Wikidata instanceOf, subclassOf, isa-relation dumps of Feb 15, 2021. The combination of these files is used to determine the ancestors of each node at 1 or more levels above it. All these files are accessible under DOI: https://doi.org/10.5281/zenodo.5120050

Output datasets

The notebooks generate constraint violation sets per property, which have been aggregated and summarized in the following files. These files are used to create the figures of the paper, and are all available under DOI http://doi.org/10.5281/zenodo.5121276:

Software and Notebooks.

The pointers for the main software used can be found below:

The Knowledge Graph Toolkit, a framework for large Knowledge Graph manipulation. This is an external tool used by the authors for the analysis.
Repository with the Jupyter Notebooks used in the analysis, i.e., removed statement analysis, constraint validation and deprecated statement analysis. Some of the notebooks have relative paths, pointing to the datasets specified above, and should be adapted in order to work correclty. The release of this code used for the paper is accessible in https://doi.org/10.5281/zenodo.5119983
Constraint validation queries. These are the queries needed to reproduce the contraint validation analysis on Wikidata. Note that these constraints need a unique folder configuration of the repository to work, which is found in the readme file

An instantiated query template in detail

The following snippet explains step by step how one of the item-requires-statement constraint is validated in Wikidata. This is just one of the constraint types we support in our analysis, and one of the most popular ones:

# Item requires statement constraint on property P1321:
# place of origin (Switzerland).
# The constraint indicates that those with property
# P1321 should be (P31) human (Q5) and live in (P27) Switzerland,
# with the exception of Hans von Flachslanden (Q1583384)
kgtk query            # This query retrieves the valid entities
  -i claims.P1321.tsv # Statements with property P1321
     claims.P31.tsv   # Statements with property P31
     claims.P27.tsv   # Statements with property P27
  --match
    'P1321: (node1)-[nodeProp]->(node2),
     P31: (node1)-[]->(node2_P31),
     P27: (node1)-[]->(node2_P27)'
  --where 'node2_P31 in ["Q5"] # Constraint indicates the property subject has
                               # to be human (Q5)
    and node2_P27 in ["Q39"]'  # Constraint indicates the property subject
                               # should have country
                               # of citizenship (P27) Sweden (Q39)
  --return 'distinct nodeProp.id, node1 as `node1`,
    nodeProp.label as `label`,
    node2 as `node2`'
  -o claims.P1321.correct_wo_exceptions.tsv
  --graph-cache cache.db;

kgtk --debug ifnotexists      # This query retrieves the violations of P1321.
  -i claims.P1321.tsv
  --filter-on claims.P1321.correct_wo_exceptions.tsv
  -o claims.P1321.incorrect_wo_exceptions.tsv

kgtk --debug query   # This query calculates the exceptions to the
                     # constraint, i.e., Hans von Flachslanden (Q1583384)
  -i claims.P1321.incorrect_wo_exceptions.tsv
  --match
    '(node1)-[]->()' --where 'node1 in ["Q1583384"]'
  -o claims.P1321.incorrect_w_exceptions.tsv
  --graph-cache cache.db;

kgtk --debug ifnotexists  # Filter exceptions from incorrect file
  -i claims.P1321.incorrect_wo_exceptions.tsv
  --filter-on claims.P1321.incorrect_w_exceptions.tsv
  -o claims.P1321.incorrect.tsv;

kgtk cat   # Aggregate correct results.
  -i claims.P1321.correct_wo_exceptions.tsv
     claims.P1321.incorrect_w_exceptions.tsv
  -o claims.P1321.correct.tsv

Bibliography.

Vrandecic, D., Krotzsch, M.: Wikidata: a free collaborative knowledgebase. Com-munications of the ACM57(10), 78–85 (2014)

About the authors.

Kartik Shenoy

Student worker

Master student at the University of Southern California.

Filip Ilievski

Research Scientist

Researcher at the Information Sciences Institute, University of Southern California.

Daniel Garijo

Distinguished Researcher

Researcher at the Universidad Politécnica de Madrid.

Daniel Schwabe

Invited researcher

Professor at Pontificia Universidade Católica Rio de Janeiro and invited researcher at the Information Sciences Institute, University of Southern California.

Pedro Szekely

Research Director

Research Director at the center on Knowledge Graphs, Information Sciences Institute, University of Southern California.