CONTROLLED VOCABULARIES - String matching algorithms in OpenRefine clustering and reconciliation functions - a case study of person name matching
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 14 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/60264 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Transcript: English(auto-generated)
00:00
Hi, everyone. I'm happy to present something from my master thesis today here, which happened at the University of Hildesheim in the summer. Currently, I'm working at University Library in Braunschweig, just to clarify the two affiliations that you see here. My master thesis was written for a real project called LexBib, where a new domain
00:25
knowledge base was created. And for this, string matching was necessary on several stages in the process. First, I'll give you a brief assessment of string matching measures for person names, as was the case in our use case. And then I'll do an assessment of clustering algorithms
00:45
in OpenRefine and also for matching algorithms, which use similar techniques to do string matching for person names. We started off with a bibliography on Zotero, where publication
01:03
metadata on scholarly records was kept. And this was our main source for building up this domain knowledge base, which is hosted as a Wikibase instance. And of course, publication metadata are literals at this stage when they are ingested in Zotero. We needed to clean these
01:24
data to deduplicate synonymous entries for person names. And of course, we had to get rid of errors and assign preferred labels and also all of that to get unique person names, which could then be turned into person entities, which are needed, of course, if you have to keep up a domain
01:45
knowledge base. To enrich these person entities, we wanted to reconcile them to relevant external data sets like Wikidata and VIAF, where we expected a high coverage of the scholars of the domain of lexicography. And OpenRefine just seemed like a good solution to do both, to do
02:06
the data cleaning, the pre-processing, and to connect this with the necessary reconciliation. The reconciliation services you see here in one box are actually separate services for each reference data set. I'll just try to simplify this a little bit.
02:26
Now, person names in publication metadata can be really noisy. Apart from being literals, they cannot be harmonized in a simple way. And to automate this process is a big win
02:42
for workflows and for routines that you have to keep up. Now, this is not an everyday example because this person, Sue Atkins, actually goes by a huge number of name variants for the same person. Normally, you only have one or two or three. No, not one, but two or three.
03:06
As you can see, not all of them are consistent. At least one of those variations has to be an error, while the others are just variations of the same name. You see here the sequence, the order of her given names, it's Sue B.T. Atkins and sometimes B.T. Sue Atkins.
03:25
So part of the task is not only to decide which one is the preferred name, but also which of the remaining name variants should be kept as valid alternative names and which of them should be ejected as actual errors. Well, typical error types, as you can see at this example, can
03:45
be of different kinds, like use of initials and use of double names, of which not every component is always present. People use nicknames and they are just simple spelling errors as well.
04:01
And you have to determine which kind of error it is. For these separate error types, different string matching measures have been developed in the past to account for different error types and also to account for special features that you find in special languages. So as you can easily imagine, it's not easy to find the one best string matching
04:27
measure to automate the duplication process, because in the data set, as in our case, which consists of international origin, you would probably be hard set
04:42
to find the one perfect measure. Levenstein distance is one that is widely used, which accounts for deviations in characters and spellings. It's an added distance, while engrams and skitgrams have been shown to work well for short names or for short strings
05:00
in general. But person names are mostly quite short strings by comparison. Phonetic measures are strongly language dependent and are developed for every language because they work with sound tables. And Yaro and Yaro Winkler have been developed especially
05:20
for person names with the heuristic that errors typically occur more towards the middle or the end of a name, and they give more weight to a corresponding or a matching beginning of the name. So most studies come to the conclusion that it's best to combine a chosen set
05:42
of relevant string matching measures. And as we'll see in a minute, OpenRefine does just that. OpenRefine implements a wide choice of different approaches to string matching to account for these different error types that you can find in any kind of given data set.
06:03
OpenRefine, of course, is not specialized for person name matching, but for any kind of string matching that can occur. This could be locations or subject headings or organizations. So I just wanted to see how well do these general or generic string matching approaches work
06:23
for person names and how efficient can these tasks be handled that I outlined just now, setting a preferred name and handling the remaining name variants. In OpenRefine, you are supposed to apply these different string matching algorithms in a strict order,
06:43
going from conservative measures that find the small deviations and are quite precise, or going on to more liberal measures, like here we find the Lievenstein distance again, and also the PPM measure. The latter two allow you to set important parameters,
07:03
like the threshold of similarity between the two strings that is still allowed to put the two names or the three names within the same cluster, which is quite good. So you can fine tune your own clustering algorithm. But after these automatic processes, manual work is still
07:25
necessary, not only to determine whether we really have true synonymy in the cluster, whether the cluster is free of outliers or not. In our example, of course, this is probably the same person, which leaves us with the decision which of these nameforms should be the preferred
07:45
nameform and what to do with the rest, so to say, what to do with the remaining name variants. Are they valid, alternative nameforms that should be kept and maybe explicitly stated as such, or are they just treated as errors and eliminated from our data set? Now,
08:07
here are a few of those string matching approaches that worked best for our data set. These are not all of them, not all the ones that are implemented in OpenRefine, but these are the ones
08:20
that work best for my data set. As you can see, different error types are there, like or just one or, yeah, only one deviation in one character per name.
08:43
The phonetic measures work quite differently, and I could see that even though Metaphone is optimized for English and Cologne is optimized for German, they still worked for names from other languages too. And you can see that 66 clusters alone were found by Metaphone,
09:04
added by 42 additional ones by Cologne. This is quite a huge share of all the clusters found in total, so I would say the phonetic measures are an important addition to the very precise fingerprint ones. And the lower precision, as seen with Metaphone and Cologne, is not really
09:23
a problem during the clustering validation process, because you can see straight ahead that it's which clusters are correct and which not. And I can even recommend the PPN clustering, because that is a huge benefit, because they find the clusters with the bigger deviations.
09:47
In the automatic way, I was really surprised how many clusters could be found automatically, as opposed to the ones that were found manually afterwards. So this gives really courage to do it
10:01
the automatic way, for our use case at least. It's also important to note that the name form might be important, which you put into the clustering, so to follow the name pattern surname comma given name is more precise than following the name pattern given name surname.
10:28
Now as we have done all this, let's go on to reconciliation. Now we have unique name forms. Still need to be disambiguated, because we could have homonyms, and we still need to enrich our
10:41
person entities. And now after working with so many different string matching algorithms, combining them, modifying them in in reconciliation services for Wikidata and VF, you only have the Levenstein distance, or at least a variation of that. So just one string measure. And let's see what validation of these reconciliation will bring. For the first look at
11:08
this chart, we can see the blue ones is VF, and the yellow one is Wikidata. VF reconciliation got many more matches automatically. These are the ones with a
11:21
perfect similarity score, while for Wikidata a huge share of person names could not be matched at all. For validation, I used 100 automatic matches and another 100 candidates,
11:40
which were below the threshold for automatic matching, to see how precise these reconciliation results are. I was really surprised, because we only had the name literal and nothing else to disambiguate these entities, to get still a precision of 1.9 for Wikidata and a
12:04
similar result for VF, which is very high, higher than I expected. But still, for building up a domain knowledge graph, you want a near perfect data quality. So in future we still will have to validate these automatic matches. For Wikidata, we won't validate the linking candidates
12:25
to increase recall, but for VF this could be feasible, because among these linking candidates many, many important and relevant and correct matches could be found. And this might be due to
12:40
special features that come with library data, like the birth year that is added to a name, or as in the lower example with Ladislav Skuska, the point that is put in after the full stop after the name. This is an additional character in this case, and here it's even five
13:02
additional characters which increase the edit distance in such a way that the names, even though they are exactly the same, don't get the perfect similarity score and fall out of the automatic matching, which is really something that might be worked on. And
13:26
these are only a few of those things that could be said. Still lots to discuss where to find a better way, maybe look at VF data or the reconciliation services.
13:43
In that sense, I'm looking forward to your questions and comments, and thank you for listening.