Improving data quality at Europeana
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 16 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/47570 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
6
11
12
01:51
Lecture/Conference
03:53
Computer animation
06:02
Computer animation
07:32
Computer animation
08:48
Computer animation
09:24
Computer animation
17:15
Computer animation
18:22
Computer animation
18:53
Computer animation
19:16
DiagramProgram flowchart
20:14
Computer animation
21:18
Lecture/Conference
22:05
Lecture/Conference
23:59
Lecture/Conference
25:23
Lecture/ConferenceMeeting/Interview
Transcript: English(auto-generated)
00:08
And our next speaker is Peter Kiraly from Europeana on improving data quality at Europeana.
00:22
And while we are setting up, the full title is improving data quality at Europeana, new requirements and methods for better measuring metadata quality.
00:53
Yeah, meantime I started. So this project I will be talking about is how we can improve data quality in Europeana.
01:10
And this has lots of parts. The first part is how to measure. Okay, so how to measure data quality in general and specifically in Europeana.
01:49
This is a joint venture of different parties. The biggest part is Europeana, but there's a community called Europeana Network, which
02:02
are voluntary pro bono members. And I'm one of the members of this community and beginning of this year we had a specific group for focusing on these questions.
02:24
Yeah, so first just some of you already know, but for understanding what I will talk about, this is a big picture of the data workflow in Europeana. So there's a Europeana data model, that's the schema Europeana use.
02:44
But in order to get this, Europeana collects different kind of metadata schema record from more than 3,000 organizations, so the schemas are Dublin Core, Lido, EAD, MARC and
03:05
so forth. There are lots of them and some of custom or custom variation of these standards. The second step is the data aggregators, so there's a layer between the organizations
03:22
and Europeana, which collects the data either originally or domain-specific way. And then there's a Europeana ingestion process, which collects all this data and enrich with
03:41
the help of semantic technologies. So there are, as you can see, there are lots of transformation during this process and yeah, that's one of the reason we are doing this research. So the problem is that there are good and bad metadata records, we'll see some examples,
04:08
but it's not easy to evaluate whether a record is good or bad, and we don't have a clear matrix, so the clear matrix would be something like that, but we have a scale
04:21
of functional requirements, and if the records fulfill the requirements, then it's good, if it doesn't fulfill that it's bad, and in between, there's an acceptable region. So some examples.
04:40
This is a semantic problem that we have non-informative and informative titles. For example, we have several thousands of photographs with the title of photograph without any additional information, so it's photograph of what and where it was created and so forth,
05:03
and there's no description as well. On the other side, there are good examples that photograph of Sir Douglas Clerk, which is a photograph of a specific person, that's quite good and understandable and searchable. Another kind of problem is that some records are created via templates or during the transformation
05:30
they use some templating system, so sometimes that's a real record, that the title is unknown unknown, and the subject is also unknown, and the description is also unknown,
05:44
so that's a problem. Last year there was a report on metadata quality which collects similar issues in 60 pages, so if you want to looking for more details, you should read that report.
06:05
So our purpose is to fitness, yeah, the purpose is the data usage, so Europeana is a meta-collection in the way that it collects only metadata and the purpose is that the final users go
06:26
to the individual collections and find the data, but if the metadata is bad, it's not searchable, not findable and so forth, then at the end there will be no data usage. There's a very nice working graph from the W3C, data on the best practices, they
06:53
listed lots of examples about this purpose. And this year, January, this group called Data Quality Committee was formed, about 50
07:12
members, well, about 10 or 12 active members, and from B-Weekly we discuss things, we
07:25
find examples, we try to define the problems and so forth. One of our hypothesis is that by measuring structural elements, we can predict the metadata
07:43
record quality, but this is not a direct signal, so what we measure is structure and somehow there's a relationship between structure and data quality, but it's not a direct connection, so what we measure is kind of metadata smell, some of you are developers,
08:07
so might be, you know, the term code smell, which means that we see something in the code structure which might lead to some problems in the future. We don't know yet, but there's
08:26
a chance. So with the measurements, we had the same situation. It's not quite true that we found real problems, but there's a chance that something is there and it's
08:42
adverse to eczema 9 by humans. So the purpose is that, yeah, improve the metadata. If we improve the metadata, then we will have good data and we can create better services on the good data. And also
09:05
to find the weak points, then we can improve the schema itself and its documentation. And finally, if we can find good examples, we can propagate them, then please follow
09:22
that examples. What we measure? There are three different things. The first is structural and semantic features. These are, this can be measured schema independently. So it means
09:44
that the methods we find here can apply to mark or whatever metadata schema and the tool is built in that way. The second one is discovery scenarios. So we collected the
10:02
most important functionality of European and examine what metadata features should be there to support those functionality. And finally, we have a catalog of anti-patterns which
10:20
are known that bad customs or it just happens, but we can somehow find them. So this is the catalog of discovery scenarios. We collected 14 most important functionality
10:42
of European like the cross language recall, entity-based facet and so forth. And for each we create a kind of user story that describe what is the purpose of this functionality,
11:03
what kind of analysis we should do on the metadata level and what are the measurements rules. And here are the problem catalogs. It's an ongoing project so we will find more
11:22
and more problems like the title and the description are the same or bad Unicode characters and so forth. I guess these are familiar for everybody independently what kind of schema they use. And yeah, we did a little bit similar analysis for these patterns as well.
11:51
And finally, the measurements. Right now we measure lots of things. So from one record we extract about 500 features and in European we create an overall view of these features
12:14
which gives you one picture of the whole situation for every record. Then there's
12:23
a collection level view so each organization can check their own features. And finally there's a record level view. So for each record you can check whether that record has the problem or not and so forth. And yeah, I did not mention so far that measuring
12:50
the metadata quality is not a new thing in the science. So there will be some metrics
13:02
which were created in the literature and we follow those metrics. Sometimes we modify them and introduce new ones. So here are some slides from the results. This is one
13:21
field in this example, DC terms alternative, so the alternative title. You can see that some collection doesn't have at all, some collections have alternative title for all records and there are collections in between. And this is interactive visualization so you
13:50
can filter out some values. All these things are in constant work in progress so these
14:02
are the first results. So we will improve all these things on the user interface. We also provide some detailed statistics. This is called the cardinality, so how many
14:21
field instances are there in one record. In this example is the DC subjects and there's a minimum and maximum value and all the basic statistics and we draw histograms and
14:40
we point the minimal and maximal records so you can find examples for the lower level and the upper level of the ranges. In Europe there's a very important feature
15:05
is the multilinguality, so several of the discovery examples built on the multilinguality. It means that one of the main idea behind Europeana is that the content should be available
15:24
in multiple languages. So Europeana based on the RDF syntax which means that there are three types of values. One is a simple literal where you cannot see the language.
15:44
The second one is the language notification, a string plus we tell that, or the organizations tell that it's written in English or French or whatever. And the third one is when the
16:02
value is a resource URI, so points to a linked open data dictionary entry which is hopefully multilingual. We set up a very, create a statistic about languages itself, so yeah,
16:23
most of the string doesn't have unfortunately language specification, but yeah, about one third has and there's a vast of language notation, so there's more than 400 different
16:46
notations of languages. Some unfortunately are bad, so for example the most prominent, most frequent language is English and English is encoded as six different ways, so in, in,
17:04
big in and so forth. There's an ongoing discussion what we should do, but it's relatively easy to improve and fix. The second thing about multilinguality is the level of multilinguality,
17:23
so that's not the list of individual languages, but whether one instance has multiple language notifications. We set up a scale, so on the lower level there's a missing level and
17:42
text without any language tags and if the string has, or the field has multiple languages, so we supposedly keep that, it's a translation, then we get, we give more points and finally
18:05
if it's a link to control dictionary, then that's the best and there's a penalty that if there's a mix, so there's translated languages and without translation that gets
18:21
a penalty. We try to visualize different ways. This is a heat map, the darker the better, so more visualization, more language notifications and yeah, this is an interactive
18:49
map, so you can click and you can see the scores. Another matrix is the information content. The basic idea is that the more frequent a term, like the photograph, is
19:07
less valuable. The less frequent, most unique, that's most important. And yeah, we have box plot and histograms and QQ plot to find outliers and the distribution of the
19:29
quality and finally this is the architecture graph, the most important part that we use big data analysis tools like Apache, Hadoop and Spark in order to process this whole
19:49
record set. So right now there's 53 million records in Europe which is more than 400 gigabytes. It takes a number of time to process them. These big data technologies
20:07
help us to distribute the task across process source and nodes. And further steps, there are human and technical steps we plan, so the results into
20:24
documentation and recommendation to the organizations, communication with the data providers, we would like to do some human evaluation of this course for some selected sets and
20:43
we would like to cooperate with other projects like DPLA or other similar institutions and one is, for me it's quite important, the shape constraint language, this is W3 draft right
21:07
now, or this is published, I don't know, to define the data patterns and here are
21:21
some links if you are interested in. Thank you very much. Thank you, Peter. And now questions.
21:43
Thank you. My name is Susanti, I'm from Indonesian Parlayamin Library. I would like to ask about is the validation process for metadata quality by measuring the structural element can be implemented or adopted by another publication outside of European election
22:01
or this scope is only for a European collection? The data quality scope is European, but I'm also a researcher, so I had another head and yeah, I'm working with Mark as well and research data as well.
22:22
So is it a possibility for another journal or another collection or set of reference? Thank you very much. Do you have by chance any metric for provenance of the information? Provenance, where it originated?
22:42
Yes, yes. Or some score of quality or the source? Yeah, in the literature it's specific metrics, we haven't introduced it yet, but yes, the
23:00
basic idea about provenance is that they measure other metrics and find from which source is the best and so forth. And then the next round when the same source provides
23:20
new records, then it gets automatically a little bit higher score. And how many sources do you have like so far? 3,500. Okay, thank you. Alright, any more questions? The other side of the room? Oh, okay.
23:46
Did you try to apply some kind of automatic language recognition to the fields where you didn't have any language tags? Yes. We have a student who just yesterday defended his thesis and in his research he did it.
24:14
And if I remember correctly about more than half of the unspecified strings can be detected
24:25
for languages. But we haven't had time to evaluate the results. The problem is that usually the language detection works for longer time.
24:42
So the longer the text, it's more reliable, the results. And usually in metadata there are extreme values, but the normal values are small, so 100 characters. Alright, we have time for some more questions.
25:08
Could you give us some examples for metadata quality augmentation you reached as a consequence from your analysis?
25:24
Well, it's hard to communicate because I don't want to hurt anybody. It's very important that when we prepare slides for these matters, it's a little bit political thing.
25:50
The purpose is not to find bad things. The purpose is to improve things and communicate directly with the data providers.
26:05
So, yeah, we found good examples and bad examples, yes. We haven't started with this communication. The idea is that this tool will be built into Europeana and it will be part of the so-called ingestion process.
26:27
So when Europeana pulls in data, it will run the process and find the problems and they will communicate directly with the data providers and provide suggestions for them to improve this or that.
26:51
And what are the most important things that data providers can do to improve their data quality at Europeana?
27:01
Yeah, right now, one thing is multilinguality, so that's a risk to improve. The other thing is that there are lots of examples where I suppose that there were too many transformations between the steps
27:28
and they started from Dublin Core which provides about 15 fields and at the end, Europeana provides 150 fields.
27:42
So the information is very dense. The granularity is different. So that would be good if everybody, so I don't want to say that everybody should left Dublin Core but try to provide more granular metadata.
28:09
All right, and is there one more question? All right, there were a lot of questions already. Thanks a lot. Thank you.
Recommendations
Series of 4 media