We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Entitifying Europeana: building an ecosystem of networked references for cultural objects

00:00

Formal Metadata

Title
Entitifying Europeana: building an ecosystem of networked references for cultural objects
Title of Series
Number of Parts
16
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In the past years, the number of references to places, peoples, concepts and time in Europeana’s metadata has grown considerably and with it new challenges have arisen. These contextual entities are provided as references as part of the metadata delivered to Europeana or selected by Europeana for semantic enrichment or crowdsourcing. However their diversity in terms of semantic and multilingual coverage and their very variable quality make it difficult for Europeana to fully exploit this rich information. Pursuing its efforts towards the creation of a semantic network around cultural heritage objects and intending in this way to further enhance its data and retrieval across languages, Europeana is now working on a long term strategy for entities. The cornerstone of this strategy is a “semantic entity collection” that acts as a centralised point of reference and access to data about contextual entities, which is based on the cached and curated data from the wider Linked Open Data cloud. While Europeana will have to address the technical challenges of integration and representation of the various sources, it will also have to define a content and curation plan for its maintenance. This presentation will highlight the design principles of the Europeana Entity Collection and its challenges. We will detail our plans regarding its curation and maintenance while providing the first examples of its use in Europeana users' services. We will also reflect on how our goals can fit our partners' processes and how can organizations like national cultural heritage portals and smaller institutions contribute to (and benefit from) such a project as a network.
Lecture/Conference
Computer animation
Computer animation
Computer animation
Program flowchart
Program flowchart
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Lecture/Conference
Transcript: English(auto-generated)
is Hugo Maginies, Maginies? All right, I'm gonna work on that. Hugo works as a technical coordinator for research and development at Europeana. I'm very excited to have him here talking about Intifying Europeana, Building an Ecosystem of Networked References
for Cultural Objects. Hi everyone, I'm Hugo Maginies. I'll be speaking, I come from Europeana Foundation. I'll be speaking about the work that we've been doing with regards to entities.
For a long time, we have gathering a lot of information about these entities and we never had like a real strategies to deal with it and actually make use of them. So that's what I'll be speaking today. For those of you that don't know, but I guess most of you didn't actually know,
Europeana is a platform for digital cultural heritage, even though we speak of ourselves as normally as the portal and the portal is the one that most of the persons are used to. We like to see ourselves as a platform much more than the portal. We aggregate metadata from around all EU countries,
35 hundred galleries, libraries, archives and museums, more than 53 million objects so far and in about 15 languages. As part of this data, there is a huge amount of references to places, agents, concepts and time spans.
And this is what I'll be speaking about today, which we'll probably call as contextual entities. So for a long time, Europeana has been committing this effort and linked data. So here I'll speak about some of the efforts
to end lines of work that we've been working on. One of them is the EDM, as all of you know, as the base of our linked data. Another thing is the enrichment that we do, linking the source data to reference data. Another is that we encourage data providers to actually submit the information they have
about contextual entities so that we actually can improve, well, benefit from the information that they can give us. And also encourage activities for making the alignment between local vocabularies and mental vocabularies and spine vocabularies so that we can even more
take advantage of this information. And, well, some of these lines of work we have already presented in sweep in past conferences. So as I spoke before, we're now looking actually to having building or thinking in a strategy
for dealing with entities as a cornerstone for these strategies. Something that we called entity collection, which is a easy way to express it is that it's a database, but it's much more than a database. It's actually a service that gathers all the information
that we have for entities. Until now, all the information that we had was actually scattered around in several places. So we had the Richmond database, which has information from external sources. We have the information in our collection,
which is the data that the providers give us. And all of this was scattered, and now we're starting to have a place where we can actually gather all this information, store it, and make it accessible. The idea is that we cache and create data as well,
or at least we hope to create this data as well, so it makes smaller corrections. And this caching and gathering is done from linked open data cloud. It's a sort of knowledge graph for entities and options.
So our main motivation is to improve user experience, so support better ways for searching, navigating through the contents, eliminating ambiguity when you have a name of the person and you can actually see which person that is and actually see what information we have
about that entity, and adapt as well, because considering that a lot of these descriptions have labels in several languages, we can also adapt the portal to adjust to the language that the user is more used to reading. And we can do this by improving the interlinking of data,
so this brings more context to objects, it leads to policy issues, expands language coverage that I spoke already, and also try to make our contribution to the web of data as well. So some of the use cases I'll speak about
that we are considering so far. So one is for the providers, looking at the provider side, to have a vocabulary that they can use to identify their own name references to entities, and also for example, link to our data as well. Another thing is crowdsourcing,
which is actually presented last year at SWEB, which is the work on annotations where we can actually open to all the users to annotate the records and to contribute to a better description of the data. And the hope is that the users can pick entities
from this entity collection and actually use those ones instead of others and use actual entities and not just keywords or names. Another thing is the European Collections Portal, which I already spoke about in the previous slide. And the last thing is also the publication
and reuse for other projects and that can make use of the information that we can hold. And all of this is powered by the entity collection. At least we envision it to be powered by that. So here I want to just show a couple of things
that we have in our roadmap and we're going to implement. So one of them is the annotation side. This is just an initial mockup. I'll actually show a bit more about this at the end. So the idea is that I can, using in the portal, type in the keywords and while I'm typing in,
I can just have a list of the entities that are in the entity collection. And actually, this list can be, since we have this information in the entity collection and we have the links from the data to this entity collection, we can also rank it based on the number of items we have
that link to this entity. So this can also make it easier surfacing the most relevant ones to the top and less relevant ones to the bottom. This is also one advantage of having it within our data. One other thing is the annotations,
which, for example, I'm showing here, Pandit annotation clients. The idea is that they, so for example, here is one of the functions where they are annotating the metadata and they, for example, can pull in. Here you can see several ones. DBP is one of them. But there is also a Europeana, so they can click and then see and pick the entities
that they want the user to annotate or the user wants to annotate the record and this can be done at the provider side. Another thing is, and this is an example from food and drink project, for example, is to power the entity facets. So instead of having facets that are automatically built,
we can create our own facets based on the entity collection. The last thing is the entity pages. Actually, this one was stolen from Google but the idea is that we'll have a similar page where we gather all the information, also showcase the objects that are more relevant
for that entity, also take advantage of the family relations, for example, if we can get some wiki data and all these kind of web of data take advantage in a page where they showcase it all. So this is also in our roadmap. So I'll speak now about how we choose our vocabularies. This was presented, again, last year
and we defined a set of criteria to choose the vocabularies and this was defined within the Europeana Task Force on enrichment and evaluation. So this is a list of...
I'll not go through all of them but I'll just highlight the multilinguality which for us is very important because we have a portal and we need to support... At least our ambition is to support 50 languages. One of them is also the open license which is very important for us.
And all the other aspects about technical availability and also the representation, the quality of the representation of the data. The other ones are less important in a way but still important. So starting by what we had done until now, so we have for each of the four kinds of entities
we have gathered from different sources. These are the ones for historical reasons coming from the semantic enrichment work which will now move on to the entity collection and we will now start to expand it to cover other vocabularies.
So for example, for places we chose geonames, natural choice. We chose also some specific types because we don't want to have all the geonames in. We select the places, some administrative regions, islands and other future types.
We gather around all of these for European countries only. For agents, we chose DBPDN. We select them by the DBPDN artists. We also filter a lot of the artists because we recognize that there is actually
a lot of artists that, at least for Europeans, we would not hold any information for them so they have a lot of rap artists and all sorts of people that you can think of. So we have some filters that go through the, so see some patterns in the data in the descriptions and filter them out so that we are left
with the ones that we feel that we probably will have content about them. We also don't look only to the English DBPDN. We collected them from the 45 language editions that match the 50 languages. One is missing from the ones that we support.
So we gather all of them, select the datasets in specific and then convert and normalize into all the ingestion reporting site. In terms of concepts, we picked a couple of concepts. The concepts that we pick are actually dictated
by the needs of the European collections, the collections portal. So now we have the music collection and art and history collections. Those will dictate more concepts and also concept schemes that we will bring in which will be identified by these collections.
So these are the ones that are expected to actually expand in a very short time. And also time spans which we use simian time which was a recovery that was initially not developed by this but then went within European and then we took it on.
It mostly has chronological periods but we're also looking at period though, for example, for historical periods. So, and this is just 35, around 3500 chronological periods. Was it basically like 19th century, 18th century,
first quarter of the 18th century, things like that and then the dates. So here is, for example, looking at the entity collection and in red, you can see all the entities that are in the entity collection divided by language. So the existence of a language description for that entity
and then you can see in blue the ones that are actually used. So you see, of course, English is the one with the most. It's normal because English DBP is much more wider than the others.
We still have some that are quite well represented but we still have entities described in almost all the languages that we support. We also see the difference between the entities present in the entity collection and the ones that are effectively used. This is because the enrichment,
it may happen because the enrichment simply didn't pick up that entity, which is natural. Also, our enrichment is not that good. We hope to improve it in a short time but also we want to have a bit of a prefer
to cope with entities that we don't have any options for them but we may eventually in the future have options for them. So we may still need to have the ability for, for example, the annotations or the enrichments to have those in the collection so that they can actually be picked.
So are these target vocabularies enough? They are not for several reasons. One of them is that they don't have coreference information to other vocabularies, especially the domain in local vocabularies that we have coming in from the data providers and the aggregators.
So, for example, one of them is MIMO, the musical instrument vocabulary used for, for example, in the sounds projects. So another thing is labels and values are not always accurate and normalized. So you see DBP, Wikipedia is a bit better but DBP is really amazing. So if you look at the dates, it's very creative
to form that they represent dates. So it's, we need actually to have a better representation for this, YF can be a good option. We're missing a lot of information across in professions, at least in a very normalized way because they are described in a very arbitrary way.
And we need to expand to coverage, to coverage also other types of entities, so works and events, which we don't support so far. So we are investigating strategies for integrating new vocabularies that can improve the descriptions and multilinguality coverage so that actually that distribution,
the red part becomes, the blue part becomes a bit higher and more evenly distributed, also leaking between entities. Also integrating alignments coming from alignment efforts that typically you've done outside the entity collection but you can bring them in as coreference information.
Support manual correlation as well. So that's, for example, at the beginning we expect only to have minor fixes but at the end ultimately we could support also for the providers to go there and actually change the descriptions if they want or they are willing to.
Another thing is also to keep up to date information, both the information that we're getting through the external together with the ones that we curate without losing any information. So we're looking at strategies to actually deal with this. So far we have done already, we minted European APIs for all the entities
that we were using for enrichment and we did also, we did a mass re-index of all our collection and we are now pointing to European entities. We also developed the entity API which is now in alpha version. And I'll show a bit more about that. Start to make use of the API in the collections portal.
Initially with the entity suggest when you can type in and see the entities and then fire a query that will use the URI and not the labels. And also implement the support for new recoveries and entity types. So this is the API that we have now.
So for now it supports the entity retrieval. So we have information, the descriptive information for the entity and also the suggest call where you can type in and I'll show a bit about that. And you have also more methods that will be implemented
for creation, update and delete and also for URI resolution. So if I ask the API with the DBP URI, I'll get back the URI of Europeana, for example. It is an example of an output of our API for a resource coming from DBP,
but for which we meant it's a Europeana URI. So here you see, for example, the labels in 48 languages. You see interlinking information that we hope also to point to European entities. Also, it's a bit cumbersome, but for example you see the place of birth
ideally should be only Salzburg, not Austria because you can get that information through just the referencing that entity that is sufficient in class verbose and this can make some issues with sending if it's actually wrong or wrong with duplicate place or whatever.
And also coreference links to six other datasets. Actually, the links here don't show, but you have for free base data all the list that you get from DBP. So here is the suggest call. So our suggest call is based on language, so we have an index for each language. So here, for example, is a search with something new
and you see all the entity types. The mock-up here actually will not be the one that we'll use. We'll make some effort to make it better. The icons here actually is just to understand where the information lies. We will not use these icons. But just for now, just a testing
and once the user experience designer will work on it, it will change to a more production-ready version. So you can see in the right also the what is output. We use LDP, some of the things from LDP and then also you can see how the entities are ranked.
By the way, for ranking, we use two things. One is the, oh, oh, oh. So how many objects are linking to that entity, but also we take in the Wikipedia click-throughs. So we take that information, we weight the two
and then we rank the entity. So taking in consideration the two dimensions besides the text, what is given by Solr with the text matching.
So concluding the presentation. So we started, well, as energy for entities, we were missing this and this is really a must for a group piano. There is no one-fits-all vocabulary. You will need to look in many vocabularies, not just Wikipedia data, not just YF, other vocabularies.
And we still have a long way to go. But hopefully we are making progress and we also take on a lot of things that we hope will be developed in other projects. That's it.
We have time for a couple quick questions or one question before we break for lunch. Does anyone have a question?
Thank you. It's exciting to see work on creating like entity database that you can refer to from Europeana projects and from elsewhere. I wanted to ask, well, one short comment that in say annotating textual information
with mentions of entities, at the National Library of Latvia, we also saw the same need for entity database which to refer to. And a question to you, is this database also available in other forms
like say, if I wanted to explore it, what is inside it? Is there a data dump available? Something we can play with. At the moment, we only have these two methods. We could make available the data that we have, but actually the data that we have so far
is also not that, well, mostly we are resting from other sources. So for now we are not improving it. So it may be in the future perhaps a good resource for you. But we can make available as dumps
and we could do that if you're interested. But for now, we just looked at these two methods. We will have methods as well for general search and also for going through all the collection. But those will come later.
And generate dumps as well. But not at this moment. All right. Well, thank you very much for this presentation. We'll give them another round of applause.