We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

AUTOMATING LOD - Transformations for aggregating Linked Open Data

00:00

Formal Metadata

Title
AUTOMATING LOD - Transformations for aggregating Linked Open Data
Title of Series
Number of Parts
16
Author
Contributors
License
CC Attribution - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Production PlaceBonn, Germany

Content Metadata

Subject Area
Genre
Abstract
Linked Open Data is usually provided as-is. Institutions make choices how to model the data, including properties, blank nodes and uris, for valid reasons. If you want to combine data, there are generally two options: 1) do a distributed query and inference on the data 2) aggregate the data into a new, single endpoint. Distribution enables the use of all available data structures, aggregation enables more easy-to-use data and better performance. For aggregation it is good practice to do transformations to obtain the needed convenience. We give an overview of the transformation types needed, learned in the AdamNet Library Association project AdamLink, a collaboration of the Amsterdam City Archives, Amsterdam Museum, University of Amsterdam Library, Public Library of Amsterdam and International Institute of Social History. The objective is to create a linked open data infrastructure connecting the member institutions’ collections on the topic of "Amsterdam", targeted at reuse by researchers, teachers, students, creative industry and general public. We discuss the (dis)advantages of creating an aggregation vs. distribution of queries. Every transformation type should solve a distribution problem to be useful. But transformation probably reduces querying-options on the data. We therefore need to get the best trade-off between complexity and usability. An interesting option to investigate is to apply a caching node mechanism, that could combine the best of both worlds. We distinguish 6 types of transformation: Mapping ontologies - Mapping and adding thesauri and authority lists - Mapping and adding object-types - Adding our own statements - Restructuring data - Data-typing. We will illustrate the transformations with real examples. We will also discuss the issues with feeding back the enriched data in the cache or aggregation to the original data sources.
Lecture/Conference
Transcript: English(auto-generated)
Hello? Ah, okay, this one works. Good. Hello. We made it. You're at the... I'll introduce myself. My name's M.J. Sahanas. I'm part of the program committee from Ryerson Library in Canada.
We've made it to the last afternoon, so hang in there. This morning we heard a lot about what happens when we have people working on linked open data, and this afternoon we're going to switch and talk about what happens when we use machines and automation for linked open data.
So we'll start the first talk with Lukas Koster and Ivo Sandhuis from Haus, sorry, practicing, from the University of Amsterdam to tell us about transformations for aggregated linked open data.
Good afternoon. So this will be a two-part presentation. I will start first. My name is Lukas Koster. I work for the Library at the University of Amsterdam, and this is Ivo over here.
He will take over. He is a self-employed consultant, and he was hired as an independent external project manager for this particular project. So the title we submitted was Transformations for Aggregating Linked Open Data, which we
decided to change it a bit into Transformations for Aggregated Linked Open Data. I'm not sure why that was, but there was some subtle difference. Right, so I will say something about the background and the context of this project
and why we're trying to do this, and then Ivo will take over and explain the actual transformations that we think were necessary. So this is a project based in Amsterdam organized by the AdamNet Foundation. Adam is short for Amsterdam, which originally was a library consortium, a sort of collaboration
organization, which consists currently of 33 member institutions, very different in nature, large small museums like Rijksmuseum, university library like mine, public library, archives,
the city archive, et cetera, a very diverse and heterogeneous collection. The goal of this project was to, instead of just focusing on the traditional library books lending, that was the original objective of this foundation, to see if we can make
the use of the digital collections, of heritage collections that a lot of the member institutions also have. So like the museums, they have lots of stuff online already, like Rijksmuseum, everybody knows the example, and the university library also has a large special collection.
The goal was to try to collect all these collections on the data level, using linked data on the topic of Amsterdam, everything about Amsterdam, which can be anything, and the objective was to only do this on a data level, so there were no applications involved.
That was not part of the project, although of course we needed some kind of showcases and pilots to show that it actually works. Target audiences are also very diverse, so researchers, higher education or primary
education for teachers and students, the general public, they might be interested in making apps, et cetera, and the creative industry, so this also means that we want to make what we deliver as usable as possible, so basically before we heard about the term
loud, linked, open, usable, I think we already tried to reach that goal as well. We started off with an initial collection of five participating institutions, Amsterdam Museum, the City Archive, the International Institute of Social History, my university
and the public library, they all have very different kinds of image databases, texts, books, et cetera, everything you can imagine we have there. So we needed to make, of course, a selection of what to put in the first year of the project.
We chose Triply, which is a linked open data platform, as the platform that we would try to achieve and bring all this together. Triply is not only a triple store, but also it can give you sparkle endpoints, APIs, you can manage the data between different institutions, et cetera.
And they offer a hosted solution, which was interesting for us, because the ADAMNET Foundation doesn't have staff of its own at the moment, they're working on it. So we did that, and we could do this because we had a grant of the Netherlands
PICA Foundation for one year, and we're very grateful for that. I think it's all gone now, right? So, again, I think this has been the topic of a number of talks here, the challenge of, as Mia said, purity against practicality,
and others have also mentioned this usability or not for developers, et cetera. If you are very pure and do linked data as it was intended, you have everything distributed, but there are so many problems, we have seen that already.
So we have a number of reasons here. Often there is no live linked open data at all. None of the participating institutions had working linked data already, except the Social History Institute was working on their own triply. There are so many different vocabularies, ontologies, data types, data quality,
and the data joining between these collections is very complicated, and as we've seen in Nigeria, you can have performance issues if there's no connection at all. One of the requirements of this PICA grant was we had to follow, as much as possible,
the Dutch national program NDE, National Heritage Network. This is an image of what they think the whole Dutch National Heritage Network should look like. At the bottom, the local collections providing linked data. The green middle is the knowledge graph, some registry information or data test organizations,
and the top is something like service portals that are thematically organized, and you see on the side there is still a dotted box aggregated,
because as long as we don't have all these locally distributed linked data servers, there's still need for aggregators. And basically what we have been doing is a combination, I guess, of the aggregator function and the service portal at this moment.
So the service portal is something that NDE calls, a service platform combines and enriches heritage information and makes it usable in a specific context. Similarly, like Ruben Froboda, who you might know, he said this at the E-Lag conference, something about the aggregators moving into a direction of being transparent layers and nodes
in a network of nodes because of performance issues, et cetera. So I think these are the things that we took in mind, and this looks a lot similar as what you've seen before in a couple of other presentations.
Down there we have the local legacy catalogs databases. We transform from, for instance, MARC or whatever there is into RDF, import it into a triply where they are individually recognizable as separate data sets,
but we also combine them in one data set. At the top you have endpoints but also APIs, et cetera. This is where you can find the result at the moment. At the top left is the combined data set with all separate aggregated data sets.
The right one is a geography authority file that we have created ourselves because the main linking points are locations in Amsterdam, streets, buildings, et cetera. It's becoming a new authority file of its own. And down there, there's several individual collections from the different partners.
This is a little bit about the workflows that we have used, and all these different partners have very different systems. We have an image database down left which has MARC records from our ILIF,
which also originates from WorldCat. It has to be converted into an RDF, which Ivo will tell you more about after my part. Other museums tend to use atlips databases in the Netherlands at least,
completely different formats and data quality. So there are all individual workflows to turn legacy data into workable, reconciled RDF in the middle.
So the linking points that we have identified, these locations I mentioned, and you can see a screenshot down at the bottom on AdamLink.nl. You can go there. I think it's still in Dutch, but you can find everything about locations,
also around people, information. On the right, as you can see, we also use Wikidata, URIs to identify same as relationships and other government URIs here. We also do people, types, so object types and subjects.
And I think Ivo will take over now and explain how this works in reality. Thanks, Lucas. I think this list of street names and locations in Amsterdam was the main result of our project.
You could have had a different dataset from the different cultural heritage institutes, but we had to link them together. It's linked open data, right? So for that reason, we introduced our own list of street names in Amsterdam,
and street names no longer existed. And we linked that up to administration, governmental administration, Wikidata, but also to geometry, and that made it possible to do all kinds of stuff on maps.
But if we wanted to make one dataset, of all the various datasets that we identified and wanted to participate in this project, then we had to transform it into one very simple dataset for people to use it.
And it's a thing that a social scientist does when he does his research. He connects all kinds of datasets and then imports a tiny set and then transforms it, makes visualizations, and then writes his paper on the data. So we're not the only people doing this.
We needed one endpoint with very simple data, so we did that for months and did it over and over again until we reached a point that it was very good to use for lots of people.
And afterwards, we were thinking, well, if we reflect on that, what actually did we do to get this transformation right? And we had six things that we did. But first, I'll show you an example of this canal in Amsterdam where you could skate.
You can skate on the canals in Amsterdam, which were a very famous thing last winter, I guess. And this is the idea that we created, and we wanted to have as much URIs as possible, no strings, no literals, but as much URIs as possible.
Then we could link up the data and do stuff that we wanted to do. So we did, reflecting on that, we did six things. The first thing is ontology alignment.
Well, you could call it a mapping. All right, we got data from a European data model, and we got data in Dublin Core format, in schema.org format, and we had to reconcile that into one format, and we didn't choose one ontology, but mixed up all kinds of different parts.
So we used the RDF type and FOF depiction and RDFS label. Those three were forced upon us by the application that says, well, if you have these three things in your record, then we can do a simple visualization of your data record.
And besides that, we used a lot of Dublin Core things, Dublin Core terms, especially the spatial thing where we put in the street that we introduced and did some dating, obviously.
So that was one thing. We had to do shifting of properties, but the second thing was that we had to do some alignment of the authorities. Some of the data set from the museum had a big list of artists, which is a Dutch list called Akade Artist,
but the library obviously had VIAF, but there was no alignment between those two lists. We had to do that ourselves, and some of the data was in Wikidata as well. So we introduced a new URI for every person that was in our...
Yeah, that was my question as well. So I said, do we have to... Yeah, there was not one system that contained all the persons that we needed. So we introduced our own URIs and hooked all the stuff up to that. We'll see how that works in the future.
Subjects were mostly spatial. That's the main thing that we wanted to elaborate on. And for type, we could use the art and architecture. That was fine for us, and some of the people used that already. That was easy.
So that's what I told already. So we reused the Adam Link URIs for creators as for subject as well. If a person was depicted on the painting or something,
then we could use that URI as well. Okay, thanks. All the records, we called them provided CHO, which is the cultural heritage object typing of a European data model.
So we had to do that in the RDF type, and we used the DC type with an art and architecture thesaurus for differentiating between the different types of cultural heritage objects that we had. We chose schema person instead of fault person, which was a choice.
And we had our own... In the Netherlands, there's this ontology on places in time. Called histograph, and we reused the concepts and properties there for street, building, and district.
So our idea was that we created a new dataset. So we transformed all those different datasets into one big new dataset, and for that reason, we wanted to give that dataset a name and state that the new created records were in this particular dataset.
So we used void in dataset to do that, and some other DCAT and voids stopped to describe the datasets. We had to add that statement to the data that we already had.
Sometimes we had to restructure the data. A European data model can be very elaborate, and we wanted to have it as simple as possible. So if you want to reach a depiction of a cultural heritage object in EDM,
then you have to follow a path through the data, and we restructured that into one path via both depiction. And data typing, we were not there yet, but we noticed that some of the data that we got was in string format
while it was a date or in the wrong integer format. So that should be something that we should do to get a better dataset again. But on the other hand, maybe that we could do and stimulate our providing institutes to do that in a better way for themselves.
So in the discussion, decentral or central, we want all the participating institutes to include the URIs themselves in their own dataset.
Then for us doing the transformation only for the stuff that we want to do for people to use the dataset in a better way. And to give one example is that we had a week with students building on top of the SPARQL endpoints all kinds of different examples on what could you do if you have this data available.
And they made over 30 different applications, prototypes, over 30 applications on that. So we were very happy to see that it could be used, and it could be used in all kinds of different ways. And I think we made progress in that sense,
and in the sense that we convinced a lot of directors of cultural heritage institutes that this is the way to go and that they have to implement their own data, link data infrastructure to provide linkable data to us.
So that's all. And we have made a URL where you could download a document and describe the things we did. Thank you.
Maybe two if they're quick. Yes. Let's see if this works.
Hello. Thank you for the presentation. At the BNF we are working on the problem of places in time. I saw that you did stuff in that matter. I'd like to know more about the way you could find a name, a street name.
So you told about street names in history that doesn't exist anymore and how you managed it. And I didn't hear well the name of the ontology for the place in time. Cool. So we hired a programmer, a data scientist actually,
that collected all the data that was available on old streets in Amsterdam, which was in all kinds of formats and maybe even on paper. And he made a database containing all that data, contains over 6,000 streets, small places, all kinds of stuff,
with different naming through time. And if we know geography, a geometrical thing, but it's work, just a lot of work, calling people. Do you have a dataset on streets? Cool. Can I have it?
Then you participate in this huge important program. Yes. As a project team we also manually edited and entered information there. It's a lot of work. But it's also fun to do, to check out historical sources. This street was called that before 1649 or something.
And you can enter also the separate geographies that may have existed for the same street name. You can find it online on this itemlink.nl. Yeah, actually. And the second question, the name of the ontology is Histograph.
And it's somewhat hidden. It's created in another project somewhere in the Netherlands, but you can download it from rdf.histograph.io. If you put that in your browser then you download the TTL file of the definitions.
It could be interesting to make that a more international standard. Is that an answer to your question? Wonderful. Thank you again.