We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

CONTROLLED VOCABULARIES - String matching algorithms in OpenRefine clustering and reconciliation functions - a case study of person name matching

00:00

Formal Metadata

Title
CONTROLLED VOCABULARIES - String matching algorithms in OpenRefine clustering and reconciliation functions - a case study of person name matching
Title of Series
Number of Parts
14
Author
Contributors
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Person entities are important linking nodes both within and between Linked Open Data resources across different domains and use cases. Therefore, efficient identity management is a crucial part of resource development and maintenance. This case study is concerned with the task of semi-automatic population of a newly developed domain knowledge graph, LexBib Wikibase with high-quality person data. We aim to transform person name literals taken from publication metadata into Semantic Web entities, to enable improved retrieval and entity enrichment for the domain-specific discovery portal ElexiFinder. In a prototype workflow, the open source tool OpenRefine is used as a one-tool solution to perform deduplication, disambiguation and reconciliation of person names with reference datasets, using a sample of 3.104 name literals taken from LexBib bibliography. We closely examine OpenRefine’s clustering functions with its underlying string matching algorithms, focusing on their ability to account for different error types that frequently occur in person name matching, such as spelling errors, phonetic variations, initials, or double names. Following the same approach, string matching processes implemented in two widely used reconciliation services for Wikidata and VIAF are examined. OpenRefine offers various features. We also analyse the usefulness of OpenRefine features to support further processing of algorithmic output. The results of this case study may contribute to a better understanding and subsequent further development of interlinking features in OpenRefine and adjoining reconciliation services. By offering empiric data on OpenRefine’s underlying string matching algorithms, the study’s results supplement existing guides and tutorials on clustering and reconciliation, especially for person name matching projects.
Transcript: English(auto-generated)
Hi, everyone. I'm happy to present something from my master thesis today here, which happened at the University of Hildesheim in the summer. Currently, I'm working at University Library in Braunschweig, just to clarify the two affiliations that you see here. My master thesis was written for a real project called LexBib, where a new domain
knowledge base was created. And for this, string matching was necessary on several stages in the process. First, I'll give you a brief assessment of string matching measures for person names, as was the case in our use case. And then I'll do an assessment of clustering algorithms
in OpenRefine and also for matching algorithms, which use similar techniques to do string matching for person names. We started off with a bibliography on Zotero, where publication
metadata on scholarly records was kept. And this was our main source for building up this domain knowledge base, which is hosted as a Wikibase instance. And of course, publication metadata are literals at this stage when they are ingested in Zotero. We needed to clean these
data to deduplicate synonymous entries for person names. And of course, we had to get rid of errors and assign preferred labels and also all of that to get unique person names, which could then be turned into person entities, which are needed, of course, if you have to keep up a domain
knowledge base. To enrich these person entities, we wanted to reconcile them to relevant external data sets like Wikidata and VIAF, where we expected a high coverage of the scholars of the domain of lexicography. And OpenRefine just seemed like a good solution to do both, to do
the data cleaning, the pre-processing, and to connect this with the necessary reconciliation. The reconciliation services you see here in one box are actually separate services for each reference data set. I'll just try to simplify this a little bit.
Now, person names in publication metadata can be really noisy. Apart from being literals, they cannot be harmonized in a simple way. And to automate this process is a big win
for workflows and for routines that you have to keep up. Now, this is not an everyday example because this person, Sue Atkins, actually goes by a huge number of name variants for the same person. Normally, you only have one or two or three. No, not one, but two or three.
As you can see, not all of them are consistent. At least one of those variations has to be an error, while the others are just variations of the same name. You see here the sequence, the order of her given names, it's Sue B.T. Atkins and sometimes B.T. Sue Atkins.
So part of the task is not only to decide which one is the preferred name, but also which of the remaining name variants should be kept as valid alternative names and which of them should be ejected as actual errors. Well, typical error types, as you can see at this example, can
be of different kinds, like use of initials and use of double names, of which not every component is always present. People use nicknames and they are just simple spelling errors as well.
And you have to determine which kind of error it is. For these separate error types, different string matching measures have been developed in the past to account for different error types and also to account for special features that you find in special languages. So as you can easily imagine, it's not easy to find the one best string matching
measure to automate the duplication process, because in the data set, as in our case, which consists of international origin, you would probably be hard set
to find the one perfect measure. Levenstein distance is one that is widely used, which accounts for deviations in characters and spellings. It's an added distance, while engrams and skitgrams have been shown to work well for short names or for short strings
in general. But person names are mostly quite short strings by comparison. Phonetic measures are strongly language dependent and are developed for every language because they work with sound tables. And Yaro and Yaro Winkler have been developed especially
for person names with the heuristic that errors typically occur more towards the middle or the end of a name, and they give more weight to a corresponding or a matching beginning of the name. So most studies come to the conclusion that it's best to combine a chosen set
of relevant string matching measures. And as we'll see in a minute, OpenRefine does just that. OpenRefine implements a wide choice of different approaches to string matching to account for these different error types that you can find in any kind of given data set.
OpenRefine, of course, is not specialized for person name matching, but for any kind of string matching that can occur. This could be locations or subject headings or organizations. So I just wanted to see how well do these general or generic string matching approaches work
for person names and how efficient can these tasks be handled that I outlined just now, setting a preferred name and handling the remaining name variants. In OpenRefine, you are supposed to apply these different string matching algorithms in a strict order,
going from conservative measures that find the small deviations and are quite precise, or going on to more liberal measures, like here we find the Lievenstein distance again, and also the PPM measure. The latter two allow you to set important parameters,
like the threshold of similarity between the two strings that is still allowed to put the two names or the three names within the same cluster, which is quite good. So you can fine tune your own clustering algorithm. But after these automatic processes, manual work is still
necessary, not only to determine whether we really have true synonymy in the cluster, whether the cluster is free of outliers or not. In our example, of course, this is probably the same person, which leaves us with the decision which of these nameforms should be the preferred
nameform and what to do with the rest, so to say, what to do with the remaining name variants. Are they valid, alternative nameforms that should be kept and maybe explicitly stated as such, or are they just treated as errors and eliminated from our data set? Now,
here are a few of those string matching approaches that worked best for our data set. These are not all of them, not all the ones that are implemented in OpenRefine, but these are the ones
that work best for my data set. As you can see, different error types are there, like or just one or, yeah, only one deviation in one character per name.
The phonetic measures work quite differently, and I could see that even though Metaphone is optimized for English and Cologne is optimized for German, they still worked for names from other languages too. And you can see that 66 clusters alone were found by Metaphone,
added by 42 additional ones by Cologne. This is quite a huge share of all the clusters found in total, so I would say the phonetic measures are an important addition to the very precise fingerprint ones. And the lower precision, as seen with Metaphone and Cologne, is not really
a problem during the clustering validation process, because you can see straight ahead that it's which clusters are correct and which not. And I can even recommend the PPN clustering, because that is a huge benefit, because they find the clusters with the bigger deviations.
In the automatic way, I was really surprised how many clusters could be found automatically, as opposed to the ones that were found manually afterwards. So this gives really courage to do it
the automatic way, for our use case at least. It's also important to note that the name form might be important, which you put into the clustering, so to follow the name pattern surname comma given name is more precise than following the name pattern given name surname.
Now as we have done all this, let's go on to reconciliation. Now we have unique name forms. Still need to be disambiguated, because we could have homonyms, and we still need to enrich our
person entities. And now after working with so many different string matching algorithms, combining them, modifying them in in reconciliation services for Wikidata and VF, you only have the Levenstein distance, or at least a variation of that. So just one string measure. And let's see what validation of these reconciliation will bring. For the first look at
this chart, we can see the blue ones is VF, and the yellow one is Wikidata. VF reconciliation got many more matches automatically. These are the ones with a
perfect similarity score, while for Wikidata a huge share of person names could not be matched at all. For validation, I used 100 automatic matches and another 100 candidates,
which were below the threshold for automatic matching, to see how precise these reconciliation results are. I was really surprised, because we only had the name literal and nothing else to disambiguate these entities, to get still a precision of 1.9 for Wikidata and a
similar result for VF, which is very high, higher than I expected. But still, for building up a domain knowledge graph, you want a near perfect data quality. So in future we still will have to validate these automatic matches. For Wikidata, we won't validate the linking candidates
to increase recall, but for VF this could be feasible, because among these linking candidates many, many important and relevant and correct matches could be found. And this might be due to
special features that come with library data, like the birth year that is added to a name, or as in the lower example with Ladislav Skuska, the point that is put in after the full stop after the name. This is an additional character in this case, and here it's even five
additional characters which increase the edit distance in such a way that the names, even though they are exactly the same, don't get the perfect similarity score and fall out of the automatic matching, which is really something that might be worked on. And
these are only a few of those things that could be said. Still lots to discuss where to find a better way, maybe look at VF data or the reconciliation services.
In that sense, I'm looking forward to your questions and comments, and thank you for listening.