CONTROLLED VOCABULARIES - String matching algorithms in OpenRefine clustering and reconciliation functions - a case study of person name matching

ZBW - Leibniz-Informationszentrum Wirtschaft

Hochschulbibliothekszentrum des Landes Nordrhein-Westfalen (hbz)

Klaes, Christiane

Formale Metadaten

Titel

CONTROLLED VOCABULARIES - String matching algorithms in OpenRefine clustering and reconciliation functions - a case study of person name matching

Serientitel

SWIB21 - Semantic Web in Libraries

Anzahl der Teile

Autor

Klaes, Christiane

Mitwirkende

Khan, Huda (Moderation)

Lizenz

CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/60264 (DOI)

Herausgeber

ZBW - Leibniz-Informationszentrum Wirtschaft

Hochschulbibliothekszentrum des Landes Nordrhein-Westfalen (hbz)

Erscheinungsjahr

2021

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Sonstige

Genre

Konferenz/Talk

Abstract

Person entities are important linking nodes both within and between Linked Open Data resources across different domains and use cases. Therefore, efficient identity management is a crucial part of resource development and maintenance. This case study is concerned with the task of semi-automatic population of a newly developed domain knowledge graph, LexBib Wikibase with high-quality person data. We aim to transform person name literals taken from publication metadata into Semantic Web entities, to enable improved retrieval and entity enrichment for the domain-specific discovery portal ElexiFinder. In a prototype workflow, the open source tool OpenRefine is used as a one-tool solution to perform deduplication, disambiguation and reconciliation of person names with reference datasets, using a sample of 3.104 name literals taken from LexBib bibliography. We closely examine OpenRefine’s clustering functions with its underlying string matching algorithms, focusing on their ability to account for different error types that frequently occur in person name matching, such as spelling errors, phonetic variations, initials, or double names. Following the same approach, string matching processes implemented in two widely used reconciliation services for Wikidata and VIAF are examined. OpenRefine offers various features. We also analyse the usefulness of OpenRefine features to support further processing of algorithmic output. The results of this case study may contribute to a better understanding and subsequent further development of interlinking features in OpenRefine and adjoining reconciliation services. By offering empiric data on OpenRefine’s underlying string matching algorithms, the study’s results supplement existing guides and tutorials on clustering and reconciliation, especially for person name matching projects.