This page provides pointers to the materials (datasets, software and notebooks) used in a paper currently under review
Summary:The increasing adoption of Wikidata [1] in a wide range of applications (entity linking, question answering, link prediction, etc.) motivates the need for high-quality knowledge to support them. However, we currently lack an understanding of the quality of the semantic infor-mation captured in Wikidata. In this paper, we explore two notions of data quality in Wikidata: 1) a community-based notion which captures the ongoing community consensus on the recorded knowledge, assumingthat statements that have been removed and not added back are implicitly agreed to be of low quality by the community; and 2) a constraint-based notion which encodes Wikidata constraints efficiently, and detects their violations in the data. Our analysis reveals that low-quality state-ments can be detected with both strategies, while their cause ranges frommodeling errors and factual errors, to constraint incompleteness. These findings can complement ongoing efforts by the Wikidata community toimprove data quality based on games and suggestions, aiming to make it easier for users and editors to find and correct mistakes.
This page and the materials described on it (excluding external references) is available at a permanent URL: https://w3id.org/wd_quality.
A preprint of the paper is available in Arxiv: https://arxiv.org/abs/2107.00156
The notebooks generate constraint violation sets per property, which have been aggregated and summarized in the following files. These files are used to create the figures of the paper, and are all available under DOI http://doi.org/10.5281/zenodo.5121276:
The following snippet explains step by step how one of the item-requires-statement constraint is validated in Wikidata. This is just one of the constraint types we support in our analysis, and one of the most popular ones:
# Item requires statement constraint on property P1321: # place of origin (Switzerland). # The constraint indicates that those with property # P1321 should be (P31) human (Q5) and live in (P27) Switzerland, # with the exception of Hans von Flachslanden (Q1583384) kgtk query # This query retrieves the valid entities -i claims.P1321.tsv # Statements with property P1321 claims.P31.tsv # Statements with property P31 claims.P27.tsv # Statements with property P27 --match 'P1321: (node1)-[nodeProp]->(node2), P31: (node1)-[]->(node2_P31), P27: (node1)-[]->(node2_P27)' --where 'node2_P31 in ["Q5"] # Constraint indicates the property subject has # to be human (Q5) and node2_P27 in ["Q39"]' # Constraint indicates the property subject # should have country # of citizenship (P27) Sweden (Q39) --return 'distinct nodeProp.id, node1 as `node1`, nodeProp.label as `label`, node2 as `node2`' -o claims.P1321.correct_wo_exceptions.tsv --graph-cache cache.db; kgtk --debug ifnotexists # This query retrieves the violations of P1321. -i claims.P1321.tsv --filter-on claims.P1321.correct_wo_exceptions.tsv -o claims.P1321.incorrect_wo_exceptions.tsv kgtk --debug query # This query calculates the exceptions to the # constraint, i.e., Hans von Flachslanden (Q1583384) -i claims.P1321.incorrect_wo_exceptions.tsv --match '(node1)-[]->()' --where 'node1 in ["Q1583384"]' -o claims.P1321.incorrect_w_exceptions.tsv --graph-cache cache.db; kgtk --debug ifnotexists # Filter exceptions from incorrect file -i claims.P1321.incorrect_wo_exceptions.tsv --filter-on claims.P1321.incorrect_w_exceptions.tsv -o claims.P1321.incorrect.tsv; kgtk cat # Aggregate correct results. -i claims.P1321.correct_wo_exceptions.tsv claims.P1321.incorrect_w_exceptions.tsv -o claims.P1321.correct.tsv
Student worker
Master student at the University of Southern California.
Research Scientist
Researcher at the Information Sciences Institute, University of Southern California.
Invited researcher
Professor at Pontificia Universidade Católica Rio de Janeiro and invited researcher at the Information Sciences Institute, University of Southern California.
Research Director
Research Director at the center on Knowledge Graphs, Information Sciences Institute, University of Southern California.
Designed deived from w3.css