We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Finnish National Bibliography Fennica as Linked Data

00:00

Formal Metadata

Title
Finnish National Bibliography Fennica as Linked Data
Title of Series
Number of Parts
15
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The National Library of Finland is making our national bibliography Fennica available as Linked Open Data. We are converting the data from 1 million MARC bibliographic records first to BIBFRAME 2.0 and then further to the Schema.org data model. In the process, we are clustering works extracted from the bibliographic records, reconciling entities against internal and external authorities, cleaning up many aspects of the data and linking it to further resources. The Linked Data set is CC0 licensed and served using HDT technology. The publishing of Linked Data supports other aspects of metadata development at the National Library. For some aspects of the Linked Data, we are relying on the RDA conversion of MARC records that was completed in early 2016. The work clustering methods, and their limitations, inform the discussions about potentially establishing a work authority, which is a prerequisite for real RDA cataloguing. This presentation will discuss lessons learned during the publishing process, including the selection and design of the data model, the construction of the conversion pipeline using pre-existing tools, the methods used for work clustering, reconciliation and linking as well as the infrastructure for publishing the data and keeping it up to date.
Lecture/ConferenceMeeting/Interview
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Program flowchart
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animationProgram flowchart
Computer animation
Computer animationDiagram
Computer animation
Computer animation
Computer animation
Computer animationLecture/Conference
Lecture/ConferenceMeeting/Interview
Lecture/Conference
Lecture/Conference
Lecture/ConferenceMeeting/Interview
Lecture/Conference
Lecture/Conference
Lecture/ConferenceMeeting/Interview
Computer animation
Transcript: English(auto-generated)
Yes, I'm Osmo Soaminen. Hello again. Yeah, I'm going to talk about the Finnish national bibliography Fenneka as linked data. So, basically, this is the mission that I'm trying to accomplish here. That I'm trying to make our national bibliography a bit more web-ish,
but not too web-ish. So, just to write them out. I'm still not sure how long it will take, but there's been some progress since last year, so I'm going to present that. So, why are we doing this? Well, first of all, we want to make our data, our metadata, more visible also internationally.
Because right now it's locked down in silos, basically. And in order for libraries to stay relevant in the web-based world, I think it's important for them to share what they have. And of course, also, this is a great exercise for finding out problems with the quality of data.
And an opportunity to make it more interoperable with the rest of the world. And we're also, sort of, as an institution, trying to prepare for the future. Well, yeah, I was pointed out yesterday that it's 15 years since Mark was, sort of,
yeah, Mark must die again, but it still hasn't died, but eventually there will be something else, and I think this is, sort of, exercising, trying to find out what the issues are with the current data, and how to move it to some other, more modern model.
And of course, why not do this? Because, yeah, we're here. So, this is basically what we started with. This is our snowflake. I don't know if you can see it there in the background. But we had, in the National Bibliography, which is mostly about books, serials, ebooks, maps,
we have 1 million bibliographic records. So, this is our main bibliographic database, but we have several others. For example, we have a separate database for music. So, that's not included here. And we have including, we have some authority records. So, we have person names and corporate names, and then we have subject records.
And these are all all Mark records, but we want to turn this into a graph, something like this, to find out the main entities there, the works, places, organizations, subjects, persons. But in order to do this, it's not a straightforward mapping, as you probably know.
So, you basically need to blow up all the Mark records and then reassemble all the bits and pieces into something that looks a bit like this marshmallow and toothpick structure. So, you build a graph out of the pieces that you get from the Mark records.
But how to do this? What sort of, what's the best approach to find out how to do some research? And that sort of culminated in my last year's talk at SWIP. Some of you might remember this diagram and picture with the silos.
I gave a talk called, From Mark Silos to Linked Data Silos, and then did a follow-up webinar with DCMI, and then finally a journal article. But basically, the point was just to review all the different approaches that had been used in different libraries to make their bibliographic
metadata, to make it into linked data. And it turned out that everybody is basically doing it differently, and yeah. But we tried to learn from that, and sort of, since we're not exactly the first ones to do this, so it was possible to learn from the others. And
basically, we found out that for our purposes, schema.org would be a good model, because it sort of allows us to describe our resources from a common-sense web users perspective. So it means that, but we still wanted to do it in sort of a bit more advanced than just
converting one record at a time. So we wanted to do, separate out the works from the instances in the sense of bibframe. So, and it's possible to do that with the bibliographic extensions of schema. And, but it means that we're not converting every detail. It's, in a sense,
we get a metadata haircut for free, because we're just, at least at the moment, we're trying to find out what are the main entities, but some of their attributes or fields are still not converted. And we got some help from Richard Wallace in planning this. But how to get from the mark to the linked data?
So we needed something like a black box that, where you can put in mark, and then you get out linked data. And I looked at the alternatives for, there are various tools for doing this, but none of them seemed just right for our purposes. There were some, but there were some interesting approaches and good tools that sort of got you
part of the way, but not all the way. So, we decided to build this pipeline for conversion that sort of stitches together existing tools
as much as possible, and then sort of glues them together and then adds the missing parts. And we started from a dump of our database. It's an ALF ALF system, so the easiest way was to start with an ALF
sequence. It's basically a textual format for mark records. So one million records in the form of an ALF sequence. And then convert them using a tool called Katmandu, which is like a Swiss Army knife for metadata, into mark XML and also do some fixes there at the same time.
And then from mark XML, we could use the Library of Congress mark2bibframe2 converter. We started with the earlier one, the mark2bibframe, but then switched to this one as soon as it was released. So we get, in basically two steps, we already get to bibframe RDF, but it's
pretty verbose, and it's not really linked anywhere, even within the data set. So there's a lot of internal duplication in there and yeah, it doesn't have the URIs for the things that are referenced, that we would like there. But anyway, from there
we could start using RDF technologies. So I made this monster sparkle query to turn this bibframe into schema, basic schema without any external links yet. And so we have it in a basic schema
format. And then another sparkle to connect that with external vocabularies and like resources. So some of them were own like the subjects and the corporate names, but we also linked to the
RDA media content and carrier vocabularies. Because we are already using RDA in our mark records, so it was just getting the right URIs into the RDF version. Then, but the problem is that within there, there's still a lot of internal duplications. So for example, one record becomes at least one work and
and that doesn't seem right because often you have many editions of the same work or you have a print and an e-book or something. And to bring these together, I had to create some work keys that try to connect the works that, I mean that the work entities that actually represent the same
intellectual work. And then to merge this into the same entity, again using sparkle mostly. But then we still have a problem with the persons because
well many of the sort of the important authors are listed in the person names authority and we could use their identifiers. But some are not and there's a big long tail of people mentioned in the mark records, but which we don't have identifiers for. And it's
when you merge the works, you also get sort of many duplicates of the same person. And to merge these, we also created keys for them and try to de-duplicate some of them. But it's difficult because you can't be sure that, if you just have a name and the same name is mentioned in many mark records,
without any authority control, you can't be sure whether it's the same person or not. So we have to be a bit careful there and sort of err on the safe side, which means that we have a duplicate. We have some duplicates there still. Anyway, then the result we can put in an RDF store. And from there on we can then publish it. So we have this
lot of systems working together to make this available for humans and machines. So first of all, we have we have this triple store
based on Fuseki, which provides a sparkle endpoint and then a small application that makes this available also as web pages and the linked data and open search API. And then also we provide the same data as downloads and one of the download formats is HDT, which is very convenient for
publishing this sort of a static data set in a compressed RDF file. And we can also serve that using a linked data fragment server. So it's an alternative to using the sparkle endpoint when you want to do complex queries that might otherwise bring down the sparkle endpoint. You can
move some of the processing over to the client side using linked data fragments. Okay, so this is the publishing. I'm going to show a little demo of this. Bear with me, so this is
make it a little bigger. So this is how it looks on the web. Of course, I mean the main point is the data, not the user interface. This is just for taking a look at, I mean getting familiar with the data.
This is Ion Liget Historia. This is a Finnish translation of Stephen Hawking's work, A Brief History of Time. And up here we have the work attributes. So, what it's about, who is the author and so on. And then down here we have several instances. So the first
edition in Finnish was published in 1988, but then we have several other editions published later and the last one is an e-book in 2012. And these have all been brought together into sort of that they are instances of the same work.
Which wasn't of course the case in MARC because each one of them was a separate MARC record. But we can also, I mean this is not very user-friendly and it doesn't have any holdings information. So for somebody using this
and who wants to know more, we can also go to FINNA, which is the main discovery interface for all kinds of cultural heritage and which has more information based. This is just, this is not linked data, but it provides access to the holdings information, for example.
So it's sort of linked to the more user-friendly, but less linked UI. Okay. Or then we can look at, but this is browsable. We can, all of these are links because this is linked data. So we can go and look at Stephen Hawking and we can see what he has authored, what he has contributed to and works about
Stephen Hawking. So everything here is sort of linked together and browsable. It's very different from FINNA because FINNA, the main interface, is just a search box. So, I don't know if this qualifies as a spelunking UI, maybe not because it, yeah, it doesn't have all those nice
visualizations, but anyway, it's, the point is not to search, even though there is a search box to get started. And also this work is a translation, so it's linked to the original work, A Brief History of Time, and if there were other translations, they would also be shown here.
We can also look at the subjects here. So this one is about, for example, black holes. So black holes are cool. Okay, there is some information about the subject itself. This is the SCOS information, but when we scroll down, we find all the works about black holes. So this way you can also browse by subject. This is also different.
I mean, in FINNA you would have to search for the subject. Maybe not so different, but anyway, it's, yeah. And we can take any of these, like this one, aspects of quantum fields and strings. This is maybe a thesis or something. And again, we get the instances.
But this is all data, so we get links to other representations. So for example, we can look at this in JSON-LD, and I mean, you probably can't see it, but the URL ends with .json. So this is, I mean, at the same time, in the same way, we can look at RDF XML here. So it's all
provided. And the data is embedded. I can't see it, but it's somewhere here. It's embedded within the page, so that has a script tag. So a search engine could pick it up from there as well. And we can, yeah, there is a basic search box, which is also provided as open search. So we can search for Mark Twain and find out, well,
we have a few duplicates of the person Mark Twain, and then we have some works about him. We can look at what we have about him in this database. Okay. Getting back to my presentation.
This is also available as downloadable dumps. So these are our hairballs. If you want to do something more interesting, just go pick it up from there, if you like. And there also, the mark records are also provided there. So if you want to look at that instead, it's possible.
And yeah, I already mentioned we have a linked data fragment server, just very experimental, but it was easy to set up. So it's possible to use that. This is the full data model currently. So you can see there's not that many attributes in here. The main point is to get the entities right.
And the main classes here are work and instance. And then there are some relationships between them. And this is basically just the same division as in bibframe. And then we have person, organization,
publication event, place, and then we have the concepts that are the subjects. Yeah, and then we have some series. There's a full documentation there if you're interested in the form of tables. Some of the challenges here in this, during this, first of all, the work extraction is really hard to do properly.
I mean, in principle, it's like this, that you extract the works using some kind of work keys from the mark records. And then this will never be right because the metadata is always a little messy and there are problems with all the records, for example, or missing information about
original works. So you have to probably, you would have to eventually make a work authority to actually have stable identities for those works. And then start using it and maintaining it for the purposes of cataloging.
This is what I would hope to do eventually, but currently we're at step one and it's not, I mean, it's hard to find the resources and the motivation to do this. Although, I mean, some libraries like the Swedish National Library and also in Norway and Germany,
I know there are projects around this, doing this. So yeah, I hope eventually you would get some kind of use for this, but it seems like a big investment to get started. So it's not so easy to do in practice. Then the linking. This is, I mean, the blue blob there in the middle is basically this bibliographic data set and the yellow boxes
are the things that we currently link to. So the subjects are linked to the Library of Congress subject headings. Places are linked to the Finnish Place Name Registry and Wikidata currently. We're still working on the Wikidata mapping, but it's
pretty far along. So this could be thought, in terms of the stars, linked data or the data stars, maybe we're at four stars now because there's no really links for the works and the instances, which are the core entities. But we could, I mean, potentially link,
for example, the persons could be linked to ISNI and VIAF and the works could be linked to WorldCat works or, for example, the Libris Excel system, which is coming in Sweden very soon. They used the BIBFRAME approach for works, so we could link to their works or we could link
the instances to WorldCat or to other national libraries. For example, using ISBNs and as keys for the linking. Yeah, so these are some just potential ideas for linking. Then there's the problem with persistence of identifiers because
the data is still being maintained as marked records and the conversion is done in a way that we can do it all over again tomorrow. It's stable as long as the data is unchanged, but there are lots of things that have to be given identifiers, not just the records themselves,
but entities extracted from the records. If the records change, for example, two records get merged together or somebody adds a new contributor information or something, then often the identifiers in the result also change. It's like trying to build a castle on
sand. It keeps moving and it's hard to maintain the persistence. This is something that I don't really have a good solution for. Of course, one way would be to push back those identifiers into the marked records themselves so that they would be maintained in the same place as the data
itself. This would help, but you can't do that in all cases, I think. There are some facilities for doing this, like the subfield 0 and subfield 1, but it doesn't work for all cases. This is open data. We could look at it from a fair perspective.
Is it findable? Yes. Well, we use URIs. We have rich metadata. Is it accessible? We provide several ways of accessing it. Is it interoperable? Well, we use RDF, which is a standard, and we use Kyma, which is a standard, a little bit of RDAU. It's CC0 licensed and we use the
entities that are also used in other databases, so we hope that it's somewhat reusable at least. What to do next? Well, first of all, we want to continue enriching and cleaning the RDF data.
For example, maps are now just creative works, but they could be schema maps. And to add more links to other linked data sets, like I already said, and then to expand the same idea to other data sets, for example, the music discography viola and the article database
art. It happens to be the 100-year anniversary of Finland today, so this is also my birthday present. We have this logo everywhere right now. When you buy milk or something,
it's everywhere. Okay. But this was my present. Thank you.
Thank you, Osma. Any questions? Thank you for this talk. This was very nice to see. For me, although it's not very important for you, it's the user interface. I found it quite
nice. Especially, you can see all these links. And for people like us, it's clear that this is linked data. But I think many users, they don't know what linked data are. And I think basically
you have something from it, that this is linked data. But do you tell the users that this is linked data, or do you ask them anything? Well, I guess first of all, we have to know
who is the user here. And it's a new service, so right now we don't have any. I mean, you could be, but it says linked data service right at the top. So, yes, in that way, it's telling that this is linked data. But I don't know what else could we do to make the point clear,
if you have ideas, then I'm very happy. Yeah. One question which came to my mind is, you have to do a lot of cleanup work with the works
and the data application. Do you consider opening an opportunity for users to take part in this
kind of crowd sourcing approach? So, let's check if this three might win as the same person or not. Is there anything as a consideration to open this kind of work up?
That's a good question, and we haven't considered that approach. I think the challenge is sort of relating the data here, which is sort of the end result of a pretty long chain of operations, and trying to relate that back to the original data. So, well, in the case of Mark Twain,
it would be actually fairly simple. What it would mean is that the original Mark records, they should be enhanced with the person identifier, which was missing, and that's why those duplicates were created. But this is also something that we are sort of
doing internally. So, having those identifiers there in the Mark records is actually a very new thing. They haven't been there for like two months maybe. So, before opening up
into crowd sourcing, I think we first have to see what we can do internally. We still haven't picked all the low-hanging fruits in that area. We're just getting started. That's a good idea, and we will think about it.
If you were to run the conversion again tomorrow, what kind of hurdle would that be? How are these different pieces connected to each other, or is there one script that you can use, or how does it work? I can easily run it any time. The full conversion takes about five hours,
and then the loading to the triple store and generating all the downloads and stuff, that takes one more hour, so maybe about six hours, and the result should be the same if the records have not changed. So, all the links will be generated the same way,
and all the URIs should be the same. Did that answer the question? Okay. So, just a Webby question. So, with Bitframe, there's the activity pattern, and
in your schema, you use shortcut properties. Was that a deliberate decision when you could have used schema actions, or could you say a little bit more about that decision? I would say we just took the easiest approach, and in this case, this was the direct
sort of direct property, and also it's because in the data there is, for example, those roles, who is a translator and who's an illustrator, those are only available in some of the records
and not all of them. I mean, it doesn't prevent you from using the more advanced pattern, but it just makes it less useful, I think. I think we could do some of that with sub-properties, that we could say, for example, that this person is not only a contributor, but also an illustrator or translator, and assert those both properties.
Okay. Hello, Asma. Thanks for the presentation. When you're linking, when you would be linking the mark records or blown up mark records to other databases, let's say to other national libraries to find the similar items, would you inject the links back to the original mark records?
Yeah, I think that would make sense. I haven't really thought that far, but right now there is basically all the data is coming from somewhere. This system itself,
well, it has a few triples about this data set, but other than that, everything comes from a data source, and I think for those links we would have to decide where to put them, and it would probably make sense to put them in the mark records rather than somewhere else.
Thank you, Asma. It's very nice to follow your work. I've got a question about if you're planning to integrate more kinds of relations between your works, like aggregations or works that are based on other work, so that kind of links in your mark record.
Yes, I would hope to do that, but I haven't really looked very deeply at what would be possible based on the records we have, because of course it has to be, I mean, we have to get the information from somewhere. So I mainly concentrated in terms of work-to-work relations,
mainly concentrated on the translation relationship, because it seemed to be by far the most common that was in the records, and it's also sort of interesting from a user perspective, and from, I mean, being a national library of a small country where
a lot of the literature is a translation of something else. Thank you so much, Asma, for these insights. I think it's very inspiring how you use the bip frame core classes to group the
things, and then use schema org to comply to the search engines. I think this should really increase interoperability of the data, considering that data models are harder to overcome than individual properties that can be mapped, and was that something you did deliberately to
be able to align to other data sets likely to come up in bip frame, or is it just because it fitted the structure you wanted to achieve?
I wouldn't say those two options are contradictory, but I think it's what's more about that we sort of saw the problems of duplication when you have a large number of marked records,
which are basically about the same work, and especially when it comes to, when you have, start to have lots of electronic versions or digitized versions of the same same work. So we have sort of more and more of these
marked records that are disconnected, even though they are about the same thing. So it seemed to be useful to do something in this area, and rather than chose just a flat model that maps one marked record to one entity in the RDF to do something a little bit more
advanced with works, and also the available tools of course affected this decision, so having the bip frame converter available from the Library of Congress and seeing how it seems to give you at least a starting point for doing this grouping or clustering by work, it
sort of seemed like a good idea to try to apply here. Okay one last question before we go into the coffee break. Thank you for your talk, and I'm wondering, it's a very nice data set I think,
do you have any idea who will be your users, or are you planning to do any kind of marketing for special user groups to take advantage of that data? Well yes, this is actually part of a larger opening of our data, so the data,
in addition to this linked data set, we have created a data catalog, because we already provide lots of many open APIs in more traditional APIs usually,
but they were all sort of not very well documented and sort of hidden in specific systems, and so we wanted to, we were also encouraged by the ministries and so on that want to see our data put out in the open, so we created this data catalog that tries to
document all the data sets and all the APIs we have, and this is one of them, or actually several of them, because there are many APIs available for the same data set, so and we are trying to, but we haven't yet really, I mean this was the official launch
in a way for the linked data set, and we will announce the open data service sometime next week and try to organize also maybe some event around it to make it more known to developers, so yes we are trying to get some publicity for this.
Okay, thanks again Osma.