We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Swissbib Goes Linked Data

00:00

Formal Metadata

Title
Swissbib Goes Linked Data
Title of Series
Number of Parts
16
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The project linked.swissbib.ch aims to integrate the Swiss library metadata into the semantic web. A Linked Data infrastructure has been created to provide on the one hand a data service for other applications and on the other hand an improved interface for the end user (e.g. a searcher). The workflow for the development of this infrastructure involves basically five steps: (1) data modeling and transformation in RDF, (2) data indexing, (3) data interlinking and enrichment, (4) creation of a user interface and (5) creation of a RESTful API. The project team would like to highlight some challenges faced during these stages, and the means found to solve them. This includes for example the conception of various use cases of innovative semantic search functionalities to give specifications for data modelling, data enrichment and for the design of the search index. Data processing operations such as transformation and interlinking must be highly scalable, with the aim of an integration in the workflow of the already existing system. Wireframes have been made to realize early usability evaluations. Finally, negotiations have been undertaken with the various Swiss library networks to adopt a common open license for bibliographic data.
Lecture/Conference
Program flowchart
Lecture/Conference
Computer animation
Lecture/Conference
Transcript: English(auto-generated)
The next talk is by Nicolas Poinchet, who is working at the University for Applied
Sciences in Geneva, and by Felix Bensmann from the Geysers Leibniz Institute for Social Sciences. They will introduce the Linked Swiss BIP project.
Well, it's just yours. Okay. Hi, everybody.
Hi, again. It's a pleasure for us to be here in Bonn for the SWIB 2016 to present the project, Linked
Swiss BIP.CH. I am Nicolas from Geneva, and I am here with Felix from Cologne, and we're going to talk about the transition of Swiss BIP into a linked data infrastructure. So you may already know Swiss BIP.
At least I hope you do, because it's cool. Swiss BIP is the meta-catalog of all the academic libraries and library networks in Switzerland. It includes about 15 institutions that are basically library networks in Switzerland
and more than 20 millions of bibliographic records. The particularity of Swiss BIP is that it doesn't own any data. It just harvests the data from other institutions and duplicates them.
So if Swiss BIP wants to perform any data operations, those operations have to be done on a daily basis, and they have to be fully automated. Swiss BIP is an evolving interface with new contents and new functionalities.
For example, new contents come abroad through the integration of new digital repositories, and new functionalities come with projects like Linked Swiss BIP.
Linked Swiss BIP.CH had the objective of making Swiss BIP a linked data compatible, and it has specifically two concrete objectives. The first one is to create a RESTful API so that the data of Swiss BIP are made openly
available for computer clients. And the second objective is to build on this linked data a new interface, an improved interface with an added value for the end users. We are a small project which will last for two years and a half, but we still have
a few partners. The main one is the Basel University Library, which maintains the Swiss BIP Classic project. We are used to say Swiss BIP Classic for the traditional interface and Linked Swiss BIP
for the new project Linked Data. We also have two universities of applied sciences in Switzerland, one in the French part in Geneva, and we collaborate with GESIS, which works on the question of interlinking.
And this project is financed with support of Swiss universities. Okay, so after two years of work, what do we have now? We have a great new interface with a home page, and on this home page we have this
search field with two tabs, and this tab's person here, authors, are the new tabs of the search. In this search field, we created an auto-suggest function, which not only suggests books,
but also authors and subjects. So this auto-subject function is faceted with linked data. We created also web pages for persons.
Here you see pages about Robert Walzer, where we find additional information about this person, and this information comes from the linked open data cloud, and they pass through our interlinking processes and workflow in Linked Swiss BIP.
The same information can also be displayed in the form of a knowledge card. So this is a kind of window, or pop-up window, you can open from a search result list, and it gives a brief overview of the main information about a person or a subject.
Okay, so I will let the floor to Felix, who is going to explain to you how we achieved it. Hi, also from me, I'm here for the technical part, and I want to give you a small introduction
into our architecture. Nicolas presented already the user interface and the REST full interface, and yeah, both are served by an elastic search index that you can see here, and with this architecture,
we want to connect the classic Swiss BIP with the new interlinked linked Swiss BIP. On the side of classic Swiss BIP, we have this CBS, Central Bibliographic System, that collects all the metadata from the connected libraries and serves it as mark data,
and in order to index it, we use meter fracture here. In this meter fracture pipeline, we take the mark data, convert it to RDF according
to schema that we developed for this, and thereby we split the monolithic records into various bibliographic concepts, bibliographic research that represents literature and also some kinds of media document that stores meter data about how these records
came to the system and what happened to them. Then we have data about the items, the physical representations of our resources,
and then we have organizations here that are authors, and we have additional here persons that are authors, and for the person authors, we took them as first step
for interlinking and enrichment. So we take these persons and interlink them with DBpedia and VIAF, and then we use the links that we get to extract additional information from these big corpora, and we pack it all together, and then we index it
also into our elastic search index and make it available for search for our users. So far, for the next step, I picked through the three challenges that
I want to introduce in a bit more detail starting with how to present linked data to our users, also how to do the linking and enrichment, dealing with really large files here, and for the third part,
a bit about license negotiations. That's the part when Nicolas will overtake here again. Okay, the linked data user interface. This work was actually carried out by our partners from the heart of the core here. We are not the first project offering our content as linked data.
There are a few examples that we had the chance to take for orientation. Here's a small list starting with DataV and FFR and many, many others. We went through prototyping phase. We had some user tests, and then we came up with three concepts
that we want to expose as linked data for our users. All of this works in subjects here. Subjects are taken directly from the GND. It has nothing to do with the pipeline that you saw in the former slide.
We created a search with an auto-suggest implementation to help the users search in the data. For the search, we also use, for example, the enriched data, let's say, pseudonyms that are not in the Swiss data and the original data.
We also offer information, alternative names here, and when we display the data, we also have information about spouses, partners, the movement the respective author was in.
Yeah, those are things that we display. I want to leave a few more words about why we took an elastic search. Usually, we would expect that we take a triple store with a sparkle access point in order to provide our data to our users,
but we think our users are used to classics with that, and it's already very elaborated, and we want to overtake these principles and use them to get the known user experience for our users. So, we focused on how to provide some kind of performance,
and that's why we used elastic search and applied a lot of background operations, like loading content from the background. Good. For the next challenge, the interlinking and enrichment.
At this point, we have the SwissBib corpora. It has an update frequency found once a day. It stores mainly here at this point information about persons. It's about seven gigabytes large, and we have also DBpedia, which is about the corpus that we took.
The subset of what is available from DBpedia is about 35 gigabytes large, and we have VR, which is about 80 gigabytes large. In order to interlink, you would have to compare every resource that is available in SwissBib
to every resource that is available in DBpedia and VR, but since this is so much data, we can't do this in one run, and we have to think about how to manage this. So we use lives, a tool that is suitable for interlink made for this,
and focuses to do it in a very performant way, but still we have a really large memory footprint, and it would take ages if it comes to an end at all. So what we do here is we do some kind of preparation.
The idea is since DBpedia and VR update, let's say, once a month, we have at least 29 phases where we don't have to update them, where we want to link.
So we only have to do pre-processing with DBpedia and VR once, but with SwissBib every day, and it's rather suitable for us because SwissBib is a smaller corpora here. For this we rely on two concepts here. One is sorting. We get our RDF content here. We have it as a list of statements, in this case sorted and tripled,
and when we sort them in an alphabetical manner, we have all these resources kept together and in a certain order, and that helps us with shaping. We can identify duplicate entries here and remove them, or we can extract certain kinds of resources. Also, it helps us with the alignment at the end.
When we want to extract the enrichment, we can just take our links one by one and search in the reference file, in the reference corpus, and we don't have to search this corpus for every link. We can just take the first link, search it, find it, take the second link, and go on from that point in the file
where we have been, where it reduces our effort that we have to do here. For the second part, we use blocking. This is not a new concept here. It's already a feature in SIP, for example. The idea is here we split our data into small packages
so that we don't have to compare these big corpora at once. So we just compare the packages, but still we have to interlink them crosswise, as you can see here on the left-hand side. But we can take this packaging a step further and use a blocking. Since we know we want to interlink
by comparing first names, last names, and first dates, we can help our algorithm to... We can help them by doing some kind of pre-grouping. In this case, we group all the persons that have a last name starting with a certain letter in one block, and we do this on both sides, on both corporas,
and then we only have to compare the respective corpora, and that helps us with the complexity. Okay, so when we have done this, we can have this workflow here, starting with the import, when we get, for example, our data in, let's say, JSON-ID,
we have to transform it into N-triples or whatever. Then we sort data, as I described. We shape away all unused data we block, and then we can start with the linking. We link to the DBPDR and then, again, against VIAF,
and then the enrichment part, so that we take our links to identify the resource in DBPDR and VIAF, extract the data, and then merge it all together. And for the merge part, also the sorting helps here because, yeah, when we have those both corpora sorted,
it moves all together. Okay, and if we do that and assume that DBPDR and VIAF is already preprocessed, because we are not at the beginning of the month, yeah, then we have a processing time from the end of three and a half hours, and that's used for daily updates.
Okay, that's so far for the technical part, and I hope that's the timeframe for that. Just a word about licensed negotiation. At the beginning of the project, we focused on results, so we didn't want to start with licensed negotiation with library networks,
but we soon realized that it would take time. So we made an open data call, where we gather 69% of positive responses and decisions in principle for a CC0 license. Then we made an open bibliographical data workshop with those institutions, and after that,
so today we are at about 85% of decision for CC0, and the objective of this workshop was to create declarations for those data that are online available,
so an online webpage that says our data are available with those terms of use. A small outlook, because there is still work to do in this area. We will have to attribute persistent identifiers
for the meta-catalog, which is not an easy task, because the records of SwiftBib are moving every day. We will have to shift this infrastructure into a productive service within the SwiftBib environment, and we will maybe have to integrate other data,
other authority files from Swift institutions, and there is a work also in the question of interlinking processes and optimization. More about the project, I encourage you to visit us on GitHub
and to read our blog. We wrote a series of articles about the project, in French and German, and you can also try to type in your web browser linked.cisbib.ch, but it won't work very well, because it's still a work in progress.
We are currently working on the index, so you may be disappointed by the results, but we will make an announcement when it will be okay for the visit. Thank you for your attention.
We have time for one question. Thank you, very inspiring presentation. Just one question.
Is your data model that you use to publish these works, do you have any mechanism for grouping together different editions of the same, let's say, the same book or different translations, like, for example, the French and the Spanish national libraries do? So you mean Ferber or...?
Yeah, something like Ferber or BibFrame, where you have a... We didn't focus on that, but this is a task which is performed by the CBS part of SwissBib. So SwissBib Classic, which groups the works together.
So we tried to transform these works and the manifestations of these works into linked data, but then we decided to focus on the enrichment part of persons, so we abandon for a first step the work
and the manifestations levels. So maybe in the future we will address this problem again, but for now we focus in person. Okay, thanks.
Okay, so thank you very much again. Thank you. And now it's coffee break. We should be back at 10.45 in this room for the next session.