Supplementary materials for paper: A Study of the Quality of Wikidata (Under review)
This page provides pointers to the materials (datasets, software and notebooks) used in a paper currently under review (more details about the publication will be announced at a later stage)
Summary:The increasing adoption of Wikidata  in a wide range of applications (entity linking, question answering, link prediction, etc.) motivates the need for high-quality knowledge to support them. However, we currently lack an understanding of the quality of the semantic infor-mation captured in Wikidata. In this paper, we explore two notions of data quality in Wikidata: 1) a community-based notion which captures the ongoing community consensus on the recorded knowledge, assumingthat statements that have been removed and not added back are implicitly agreed to be of low quality by the community; and 2) a constraint-based notion which encodes Wikidata constraints efficiently, and detects their violations in the data. Our analysis reveals that low-quality state-ments can be detected with both strategies, while their cause ranges frommodeling errors and factual errors, to constraint incompleteness. These findings can complement ongoing efforts by the Wikidata community toimprove data quality based on games and suggestions, aiming to makeit easier for users and editors to find and correct mistakes.
We used the following datasets for our paper. We will deposit them in Zenodo upon paper acceptance (in order to preserve annonimity requirements). We note that some of the files may not have the same source date, as we performed the analysis as new data became available. However, since Wikidata is continouslu evolving, all files are sufficiently close in time so as to be compatible with one another.
- Claims file - Wikidata dump of Dec 7, 2020. Reference dump containing the `claims.tsv.gz` file with 1.15B statements, used for finding constraint violations.
- Labels file - Wikidata dump of Dec 7, 2020, File with all labels of nodes used in the analysis
- Claim properties file - Wikidata dump of Dec 7, 2020 File used to determine constraints.
- Qualifiers file - Wikidata dump of Dec 7, 2020. File used to fetch more details on the constraints for each property.
- Wikidata permanently removed statements from 2014 to Jan 2021. File that was calculated by downloading 311 weekly Wikidata dumps from the Internet Archive and community contributors; calculating the removed statements between each pair of dumps, and then assessing that any previous deleted statements was still not present in the most recent dump.
- Wikidata deprecated statements by Jan 2021. All statements tagged as deprecated in the Knowledge Graph.
- Wikidata instanceOf, subclassOf, isa-relation dumps of Feb 15, 2021. The combination of these files is used to determine the ancestors of each node at 1 or more levels above it.
Software and Notebooks.
The pointers for using the main software used can be found below:
- The Knowledge Graph Toolkit, a framework for large Knowledge Graph manipulation. This is an external tool used by the authors for the analysis.
- Zip with all the Jupyter Notebooks used in the analysis, i.e., removed statement analysis, constraint validation and deprecated statement analysis. Some of the notebooks have relative paths, pointing to the datasets specified above.
- Vrandecic, D., Krotzsch, M.: Wikidata: a free collaborative knowledgebase. Com-munications of the ACM57(10), 78–85 (2014)