We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Improving data quality at Europeana

00:00

Formal Metadata

Title
Improving data quality at Europeana
Title of Series
Number of Parts
16
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Europeana aggregates metadata from a wide variety of institutions, a significant proportion of which is of inconsistent or low quality. This low-quality metadata acts as a limiting factor for functionality, affecting e.g. information retrieval and usability. Europeana is accordingly implementing a user- and functionality-based framework for assessing and improving metadata quality. Currently, the metadata is being validated (against the EDM XML schema) prior to being loaded into the Europeana database. However, some technical choices with regard to the expressions of rules impose limitations on the constraints that can be checked. Furthermore, Europeana and its partners sense that more than simple validation is needed. Finer-grained indicators for the 'fitness for use' of metadata would be useful for Europeana and its data providers to detect and solve potential shortcomings in the data. Beginning 2016, Europeana created a Data Quality Committee to work on data quality issues and to propose recommendations for its data providers, seeking to employ new technology and innovate metadata-related processes. This presentation will describe more specifically the activities of the Committee with respect to data quality checks: - Definition of new data quality requirements and measurements, such as metadata completeness measures; - Assessment of (new) technologies for data validation and quantification, such as SHACL for defining data patterns; - Recommendations to data providers, and integration of the results into the Europeana data aggregation workflow.
Lecture/Conference
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
DiagramProgram flowchart
Computer animation
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/ConferenceMeeting/Interview
Transcript: English(auto-generated)
And our next speaker is Peter Kiraly from Europeana on improving data quality at Europeana.
And while we are setting up, the full title is improving data quality at Europeana, new requirements and methods for better measuring metadata quality.
Yeah, meantime I started. So this project I will be talking about is how we can improve data quality in Europeana.
And this has lots of parts. The first part is how to measure. Okay, so how to measure data quality in general and specifically in Europeana.
This is a joint venture of different parties. The biggest part is Europeana, but there's a community called Europeana Network, which
are voluntary pro bono members. And I'm one of the members of this community and beginning of this year we had a specific group for focusing on these questions.
Yeah, so first just some of you already know, but for understanding what I will talk about, this is a big picture of the data workflow in Europeana. So there's a Europeana data model, that's the schema Europeana use.
But in order to get this, Europeana collects different kind of metadata schema record from more than 3,000 organizations, so the schemas are Dublin Core, Lido, EAD, MARC and
so forth. There are lots of them and some of custom or custom variation of these standards. The second step is the data aggregators, so there's a layer between the organizations
and Europeana, which collects the data either originally or domain-specific way. And then there's a Europeana ingestion process, which collects all this data and enrich with
the help of semantic technologies. So there are, as you can see, there are lots of transformation during this process and yeah, that's one of the reason we are doing this research. So the problem is that there are good and bad metadata records, we'll see some examples,
but it's not easy to evaluate whether a record is good or bad, and we don't have a clear matrix, so the clear matrix would be something like that, but we have a scale
of functional requirements, and if the records fulfill the requirements, then it's good, if it doesn't fulfill that it's bad, and in between, there's an acceptable region. So some examples.
This is a semantic problem that we have non-informative and informative titles. For example, we have several thousands of photographs with the title of photograph without any additional information, so it's photograph of what and where it was created and so forth,
and there's no description as well. On the other side, there are good examples that photograph of Sir Douglas Clerk, which is a photograph of a specific person, that's quite good and understandable and searchable. Another kind of problem is that some records are created via templates or during the transformation
they use some templating system, so sometimes that's a real record, that the title is unknown unknown, and the subject is also unknown, and the description is also unknown,
so that's a problem. Last year there was a report on metadata quality which collects similar issues in 60 pages, so if you want to looking for more details, you should read that report.
So our purpose is to fitness, yeah, the purpose is the data usage, so Europeana is a meta-collection in the way that it collects only metadata and the purpose is that the final users go
to the individual collections and find the data, but if the metadata is bad, it's not searchable, not findable and so forth, then at the end there will be no data usage. There's a very nice working graph from the W3C, data on the best practices, they
listed lots of examples about this purpose. And this year, January, this group called Data Quality Committee was formed, about 50
members, well, about 10 or 12 active members, and from B-Weekly we discuss things, we
find examples, we try to define the problems and so forth. One of our hypothesis is that by measuring structural elements, we can predict the metadata
record quality, but this is not a direct signal, so what we measure is structure and somehow there's a relationship between structure and data quality, but it's not a direct connection, so what we measure is kind of metadata smell, some of you are developers,
so might be, you know, the term code smell, which means that we see something in the code structure which might lead to some problems in the future. We don't know yet, but there's
a chance. So with the measurements, we had the same situation. It's not quite true that we found real problems, but there's a chance that something is there and it's
adverse to eczema 9 by humans. So the purpose is that, yeah, improve the metadata. If we improve the metadata, then we will have good data and we can create better services on the good data. And also
to find the weak points, then we can improve the schema itself and its documentation. And finally, if we can find good examples, we can propagate them, then please follow
that examples. What we measure? There are three different things. The first is structural and semantic features. These are, this can be measured schema independently. So it means
that the methods we find here can apply to mark or whatever metadata schema and the tool is built in that way. The second one is discovery scenarios. So we collected the
most important functionality of European and examine what metadata features should be there to support those functionality. And finally, we have a catalog of anti-patterns which
are known that bad customs or it just happens, but we can somehow find them. So this is the catalog of discovery scenarios. We collected 14 most important functionality
of European like the cross language recall, entity-based facet and so forth. And for each we create a kind of user story that describe what is the purpose of this functionality,
what kind of analysis we should do on the metadata level and what are the measurements rules. And here are the problem catalogs. It's an ongoing project so we will find more
and more problems like the title and the description are the same or bad Unicode characters and so forth. I guess these are familiar for everybody independently what kind of schema they use. And yeah, we did a little bit similar analysis for these patterns as well.
And finally, the measurements. Right now we measure lots of things. So from one record we extract about 500 features and in European we create an overall view of these features
which gives you one picture of the whole situation for every record. Then there's
a collection level view so each organization can check their own features. And finally there's a record level view. So for each record you can check whether that record has the problem or not and so forth. And yeah, I did not mention so far that measuring
the metadata quality is not a new thing in the science. So there will be some metrics
which were created in the literature and we follow those metrics. Sometimes we modify them and introduce new ones. So here are some slides from the results. This is one
field in this example, DC terms alternative, so the alternative title. You can see that some collection doesn't have at all, some collections have alternative title for all records and there are collections in between. And this is interactive visualization so you
can filter out some values. All these things are in constant work in progress so these
are the first results. So we will improve all these things on the user interface. We also provide some detailed statistics. This is called the cardinality, so how many
field instances are there in one record. In this example is the DC subjects and there's a minimum and maximum value and all the basic statistics and we draw histograms and
we point the minimal and maximal records so you can find examples for the lower level and the upper level of the ranges. In Europe there's a very important feature
is the multilinguality, so several of the discovery examples built on the multilinguality. It means that one of the main idea behind Europeana is that the content should be available
in multiple languages. So Europeana based on the RDF syntax which means that there are three types of values. One is a simple literal where you cannot see the language.
The second one is the language notification, a string plus we tell that, or the organizations tell that it's written in English or French or whatever. And the third one is when the
value is a resource URI, so points to a linked open data dictionary entry which is hopefully multilingual. We set up a very, create a statistic about languages itself, so yeah,
most of the string doesn't have unfortunately language specification, but yeah, about one third has and there's a vast of language notation, so there's more than 400 different
notations of languages. Some unfortunately are bad, so for example the most prominent, most frequent language is English and English is encoded as six different ways, so in, in,
big in and so forth. There's an ongoing discussion what we should do, but it's relatively easy to improve and fix. The second thing about multilinguality is the level of multilinguality,
so that's not the list of individual languages, but whether one instance has multiple language notifications. We set up a scale, so on the lower level there's a missing level and
text without any language tags and if the string has, or the field has multiple languages, so we supposedly keep that, it's a translation, then we get, we give more points and finally
if it's a link to control dictionary, then that's the best and there's a penalty that if there's a mix, so there's translated languages and without translation that gets
a penalty. We try to visualize different ways. This is a heat map, the darker the better, so more visualization, more language notifications and yeah, this is an interactive
map, so you can click and you can see the scores. Another matrix is the information content. The basic idea is that the more frequent a term, like the photograph, is
less valuable. The less frequent, most unique, that's most important. And yeah, we have box plot and histograms and QQ plot to find outliers and the distribution of the
quality and finally this is the architecture graph, the most important part that we use big data analysis tools like Apache, Hadoop and Spark in order to process this whole
record set. So right now there's 53 million records in Europe which is more than 400 gigabytes. It takes a number of time to process them. These big data technologies
help us to distribute the task across process source and nodes. And further steps, there are human and technical steps we plan, so the results into
documentation and recommendation to the organizations, communication with the data providers, we would like to do some human evaluation of this course for some selected sets and
we would like to cooperate with other projects like DPLA or other similar institutions and one is, for me it's quite important, the shape constraint language, this is W3 draft right
now, or this is published, I don't know, to define the data patterns and here are
some links if you are interested in. Thank you very much. Thank you, Peter. And now questions.
Thank you. My name is Susanti, I'm from Indonesian Parlayamin Library. I would like to ask about is the validation process for metadata quality by measuring the structural element can be implemented or adopted by another publication outside of European election
or this scope is only for a European collection? The data quality scope is European, but I'm also a researcher, so I had another head and yeah, I'm working with Mark as well and research data as well.
So is it a possibility for another journal or another collection or set of reference? Thank you very much. Do you have by chance any metric for provenance of the information? Provenance, where it originated?
Yes, yes. Or some score of quality or the source? Yeah, in the literature it's specific metrics, we haven't introduced it yet, but yes, the
basic idea about provenance is that they measure other metrics and find from which source is the best and so forth. And then the next round when the same source provides
new records, then it gets automatically a little bit higher score. And how many sources do you have like so far? 3,500. Okay, thank you. Alright, any more questions? The other side of the room? Oh, okay.
Did you try to apply some kind of automatic language recognition to the fields where you didn't have any language tags? Yes. We have a student who just yesterday defended his thesis and in his research he did it.
And if I remember correctly about more than half of the unspecified strings can be detected
for languages. But we haven't had time to evaluate the results. The problem is that usually the language detection works for longer time.
So the longer the text, it's more reliable, the results. And usually in metadata there are extreme values, but the normal values are small, so 100 characters. Alright, we have time for some more questions.
Could you give us some examples for metadata quality augmentation you reached as a consequence from your analysis?
Well, it's hard to communicate because I don't want to hurt anybody. It's very important that when we prepare slides for these matters, it's a little bit political thing.
The purpose is not to find bad things. The purpose is to improve things and communicate directly with the data providers.
So, yeah, we found good examples and bad examples, yes. We haven't started with this communication. The idea is that this tool will be built into Europeana and it will be part of the so-called ingestion process.
So when Europeana pulls in data, it will run the process and find the problems and they will communicate directly with the data providers and provide suggestions for them to improve this or that.
And what are the most important things that data providers can do to improve their data quality at Europeana?
Yeah, right now, one thing is multilinguality, so that's a risk to improve. The other thing is that there are lots of examples where I suppose that there were too many transformations between the steps
and they started from Dublin Core which provides about 15 fields and at the end, Europeana provides 150 fields.
So the information is very dense. The granularity is different. So that would be good if everybody, so I don't want to say that everybody should left Dublin Core but try to provide more granular metadata.
All right, and is there one more question? All right, there were a lot of questions already. Thanks a lot. Thank you.