Supplementary materials for paper: A Study of the Quality of Wikidata (Under review)


This page provides pointers to the materials (datasets, software and notebooks) used in a paper currently under review

Summary:The increasing adoption of Wikidata [1] in a wide range of applications (entity linking, question answering, link prediction, etc.) motivates the need for high-quality knowledge to support them. However, we currently lack an understanding of the quality of the semantic infor-mation captured in Wikidata. In this paper, we explore two notions of data quality in Wikidata: 1) a community-based notion which captures the ongoing community consensus on the recorded knowledge, assumingthat statements that have been removed and not added back are implicitly agreed to be of low quality by the community; and 2) a constraint-based notion which encodes Wikidata constraints efficiently, and detects their violations in the data. Our analysis reveals that low-quality state-ments can be detected with both strategies, while their cause ranges frommodeling errors and factual errors, to constraint incompleteness. These findings can complement ongoing efforts by the Wikidata community toimprove data quality based on games and suggestions, aiming to make it easier for users and editors to find and correct mistakes.

This page and the materials described on it (excluding external references) is available at a permanent URL: https://w3id.org/wd_quality.

A preprint of the paper is available in Arxiv: https://arxiv.org/abs/2107.00156

Datasets.


Input Datasets

We used the following datasets for our paper. We note that some of the files may not have the same source date, as we performed the analysis as new data became available. However, since Wikidata is continously evolving, all files are sufficiently close in time so as to be compatible with one another.

Output datasets

The notebooks generate constraint violation sets per property, which have been aggregated and summarized in the following files. These files are used to create the figures of the paper, and are all available under DOI http://doi.org/10.5281/zenodo.5121276:

Software and Notebooks.


The pointers for the main software used can be found below:

An instantiated query template in detail

The following snippet explains step by step how one of the item-requires-statement constraint is validated in Wikidata. This is just one of the constraint types we support in our analysis, and one of the most popular ones:

# Item requires statement constraint on property P1321:
# place of origin (Switzerland).
# The constraint indicates that those with property
# P1321 should be (P31) human (Q5) and live in (P27) Switzerland,
# with the exception of Hans von Flachslanden (Q1583384)
kgtk query            # This query retrieves the valid entities
  -i claims.P1321.tsv # Statements with property P1321
     claims.P31.tsv   # Statements with property P31
     claims.P27.tsv   # Statements with property P27
  --match
    'P1321: (node1)-[nodeProp]->(node2),
     P31: (node1)-[]->(node2_P31),
     P27: (node1)-[]->(node2_P27)'
  --where 'node2_P31 in ["Q5"] # Constraint indicates the property subject has
                               # to be human (Q5)
    and node2_P27 in ["Q39"]'  # Constraint indicates the property subject
                               # should have country
                               # of citizenship (P27) Sweden (Q39)
  --return 'distinct nodeProp.id, node1 as `node1`,
    nodeProp.label as `label`,
    node2 as `node2`'
  -o claims.P1321.correct_wo_exceptions.tsv
  --graph-cache cache.db;

kgtk --debug ifnotexists      # This query retrieves the violations of P1321.
  -i claims.P1321.tsv
  --filter-on claims.P1321.correct_wo_exceptions.tsv
  -o claims.P1321.incorrect_wo_exceptions.tsv

kgtk --debug query   # This query calculates the exceptions to the
                     # constraint, i.e., Hans von Flachslanden (Q1583384)
  -i claims.P1321.incorrect_wo_exceptions.tsv
  --match
    '(node1)-[]->()' --where 'node1 in ["Q1583384"]'
  -o claims.P1321.incorrect_w_exceptions.tsv
  --graph-cache cache.db;

kgtk --debug ifnotexists  # Filter exceptions from incorrect file
  -i claims.P1321.incorrect_wo_exceptions.tsv
  --filter-on claims.P1321.incorrect_w_exceptions.tsv
  -o claims.P1321.incorrect.tsv;

kgtk cat   # Aggregate correct results.
  -i claims.P1321.correct_wo_exceptions.tsv
     claims.P1321.incorrect_w_exceptions.tsv
  -o claims.P1321.correct.tsv

Bibliography.


  1. Vrandecic, D., Krotzsch, M.: Wikidata: a free collaborative knowledgebase. Com-munications of the ACM57(10), 78–85 (2014)

About the authors.


Kartik Shenoy

Kartik Shenoy

Student worker

Master student at the University of Southern California.

Filip Ilievski

Filip Ilievski

Research Scientist

Researcher at the Information Sciences Institute, University of Southern California.

Daniel Garijo

Daniel Garijo

Distinguished Researcher

Researcher at the Universidad Politécnica de Madrid.

Daniel Schwabe

Daniel Schwabe

Invited researcher

Professor at Pontificia Universidade Católica Rio de Janeiro and invited researcher at the Information Sciences Institute, University of Southern California.

Pedro Szekely

Pedro Szekely

Research Director

Research Director at the center on Knowledge Graphs, Information Sciences Institute, University of Southern California.

Designed deived from w3.css