We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

From MARC silos to Linked Data silos?

00:00

Formal Metadata

Title
From MARC silos to Linked Data silos?
Title of Series
Number of Parts
16
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Many libraries are experimenting with publishing their metadata as Linked Data in order to open up bibliographic silos, usually based on MARC records, and make them more interoperable, accessible and understandable to developers who are not intimately familiar with library data. The libraries who have published Linked Data have all used different data models for structuring their bibliographic data. Some are using a FRBR-based model where Works, Expressions and Manifestations are represented separately. Others have chosen basic Dublin Core, dumbing down their data into a lowest common denominator format. The proliferation of data models limits the reusability of bibliographic data. In effect, libraries have moved from MARC silos to Linked Data silos of incompatible data models. Data sets can be difficult to combine, for example when one data set is modelled around Works while another mixes Work-level metadata such as author and subject with Manifestation-level metadata such as publisher and physical form. Small modelling differences may be overcome by schema mappings, but it is not clear that interoperability has improved overall. We present a survey of published bibliographic Linked Data, the data models proposed for representing bibliographic data as RDF, and tools used for conversion from MARC. We also present efforts at the National Library of Finland to open up metadata, including the national bibliography Fennica, the national discography Viola and the article database Arto, as Linked Data while trying to learn from the examples of others.
Lecture/Conference
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Lecture/Conference
Transcript: English(auto-generated)
Hello? Ah, perfect. So welcome back everyone. I hope you all enjoyed your lunch. We are now about to begin the final session of this conference and a good one. We have four presentations this afternoon. My name is M.G. Sohonis. I'm from Toronto, Canada,
Ryerson University, and part of the program committee here. To begin this afternoon, we have Osma Swarmanan from the National Library of Finland. I will leave it to you to take it away. Thank you. Yeah. Hello everybody. I am Osma and I'm talking about the topic of my presentation
is From Mark Silos to Linked Data Silos with a question mark. This is work that I did together with my colleague Nina Hivonen. So the starting point for bibliographic data in most places is something like this. So everybody has a silo of marked records.
It could be mark 21 or maybe a local variant, but anyway, they all have pretty much the same structure. You have bibliographic records, you have some authority records, and they are not generally available to the web, so they are sort of siloed, but they are similar inside. So what I tried to do is attempt to do an overview of what the more
modern alternatives for representing this data, what they are, and let's see if I succeed. So I wanted to do sort of a family tree of data models, but it turns out that there
is no common ancestor, so it became a more family forest. And I also tried to put in the same picture all the various tools that are available for converting between these and then also some datasets and application profiles. So let's see. So the one dimension that I wanted to look at is whether these models are basically
flat or record-based, where you combine everything about the bibliographic entity into a single record, or whether it splits between different kinds of entities. So here on this picture, the flat ones will be on the top and the entity-based will be on the bottom.
You can guess which one has more in it. And there is also a fine line there trying to sort of divide whether the models have an explicit representation of works as a separate thing, not sort of outside the record or manifestation or whatever level or not.
Okay. And there is a legend, if you, to make sense of the colors. So first of all, we have some flat data models. So obviously there is MARC, and there is not really a good RDF representation of MARC. There has been a few attempts, but it's
kind of difficult. The thinking is so different. But there is MODS, which is basically modeling pretty much the same things as MARC, and there is an RDF representation of that, and there is also a conversion tool that can go from MARC to MODS and then to MODS RDF, called MARC-MODS to RDF. Okay. So that's the first line. Then there is doubling core,
and of course DC terms, all in the same bubble, which has an RDF representation, and there is, Katmandu is one of the tools that can be used to convert, for example, from MARC to DC. So it can be used for other things as well, but that's the most common,
I think, that all the examples are. And then there is BIBO, the bibliographic ontology, which is also flat, quite a flat model, and it's mostly oriented around scientific publishing. Okay. Then there is a special case, which is the schema.org model, which is, of course, trying to model quite a lot of everything, but including creative works
and other bibliographic things. And you can use it either as a flat model, or you can use some of the bibliographic extensions to sort of separate out the works and instances. So it's sort of in the middle. Okay. And then there is the bibframe family of data
models. So on the left we have bibframe1, which is already a little bit old. And yeah, it turned into sort of a different variety. So there is Zephira went their own way and
did their own model, which they also called bibframe, but yeah. And then there's the linked data for libraries ontology, which took bibframe as a starting point but cut out some of the bad parts and replaced them with better parts and including some parts from the schema.org. So that's also sort of a different data model.
And then we have the mark to bibframe conversion tool, which can go from mark to bibframe1, made by the Library of Congress. And then Zephira made their own conversion tool called pybibframe that can also go from mark to their bibframe. Then there's bibframe2,
which came out in the spring and which is still fairly new, so no tools that I know of exist yet. Then there's the linked data for production ontology, which is again taking this time bibframe2 as a starting point and trying to tweak some things that they're not happy about. Okay. Then we get the Ferber family. So Ferber itself is a pretty abstract
model. But it does have an RDF representation called a Ferber core. Then there's the Ferber aligned bibliographic ontology Fabio, which is similar in scope to Bebo but based on the Ferber model of works and expressions and manifestations. Then there's the Ferber
ER model, which is not RDF based. And Ferber OO, which is also not RDF based. But then there's the E Ferber OO ontology, which is Ferber based. So you get this sort of
chain of different things building on top of others. And then there's the RDA vocabulary, which is also pretty much Ferber based and which is sort of an appendix to RDA itself, which is more of a set of cataloging rules. But yeah, it's part of the same package. So it can actually be used to represent bibliographic data. And then the Spanish National
Library wanted to apply this when they published their linked data. But instead of using it directly, they made their own ontology sort of expanding on that. And finally, the
conversion tools. There's Marimba, which is just Spanish National Library is using to convert theirs. So it's sort of coupled with the BNE ontology. And then there's Aliada, which also has been presented at SWIB before, which is a sort of package for converting and publishing and linking. It's called converting, linking and publishing
bibliographic data. Okay. Finally, we get to the datasets. So not everything is represented here, I'm sure, but these are sort of some of the major ones. So among the flat data models, the Japanese National Diet Library has made their own dataset,
and they have defined their application profile mainly around Dublin Core. The British National Bibliography have also defined an application profile which is combining, among other things, DEC and Bebo, and they published their own dataset. The German National Library
has also made an application profile combining DEC, Bebo, and the RDA vocabulary and published their dataset, and they used Metafacture for the conversion. Similarly, the SWIB that we heard about today is defining sort of an application profile and putting out their data. So these are all pretty much flat models that don't represent works explicitly
as far as I know. Okay. Then among the ones that do have works, there's WorldCat, which is available as schema.org data, and they have WorldCat works, which is sort of an additional layer on top. And there's the linked data for libraries, some of those datasets,
and the French National Library has an application profile and a dataset, and this one does have a representation for works. And the Spanish National Library has a very nice one which also sort of draws not only the works but also the expressions as separate entities.
There's Libhub, which I don't know much about, but they are using the BIBFRAME SIFIRA version, and then there's the Aliada datasets, including Artium is just one of them, but they have been published using these tools and the E-Fibro. Okay. So that's the big picture. Then another perspective on these data models is, I'm not sure, this is a bit fuzzier, but
let me try. So when looking at this, I found that there is some sort of contrast between different sorts of use cases. So on the left, we have the library-ish use case, which is mainly for when you want to produce or maintain your metadata as RDF. So then
you need to be sure that, for example, it's lossless, that you don't lose anything important when going from MARC into something new, and often that means that you will be modelling abstractions like records and authorities instead of the things itself, and you need
housekeeping metadata. And the web-ish use case is more for publishing data for others to reuse. So there you want to be interoperable with other data models, and you model real-world objects. So we can look at this separately for sort of bibliographic data, I mean the
things that are normally in a MARC bib record, and authority data, which is more about maybe people, organizations, subjects. So I tried to place some of these models in this kind of setting. So on the left we have bib frame and MODS, which are very library-ish
in that they try to represent all the little details in a MARC record accurately, so that you would be able to use this as your sort of primary format. And the linked data for libraries and linked data for production ontologies sort of take this a little bit in the way of, towards web-ish use case by, for example, dropping some of the awkward
constructs. And then on the right side we have, for example, Bibo and Fabio and schema.org, and then on the authority side, Fauve, which is about modeling people. Okay, I'm not going
to go through all this. So this is basically what has happened, that we have had a number of standards and then people come up with new use cases, and they think that they should be able to cover anything, and then we have more standards. So the end result
is that everybody is building their own silos using slightly different data models. So in principle it's good since it's all linked data, but in practice it's very difficult to combine these because they're all different in various ways and not some of that, not
Okay, so why does it have to be like this? First of all, well, different use cases require different kinds of models. Another reason is that to convert from existing data is difficult, so people end up with different kind of solutions. And especially going to
Ferber models is very difficult because Ferberization in general is difficult. Beprin is a little bit easier. A third reason is that libraries want to be in control of their data and their data models, and sometimes they have local requirements, which are good reasons for sort of making your own. And finally, once you've made something and used
a specific data model, you're unlikely to change into something else. So if you want to choose a data model for bibliographic data sets, you have to think about at least whether you want to model works, sorry, works explicitly or just flat, and then whether it's for maintaining or for publishing. What can we do about this? First of all, don't
create more models, and I think we need projects like the linked data for production that try to sort of consider this point of view of maintaining things as modern entity-based models, and we would like to share and reuse each other's data. It's possible that
Google or somebody else will sort of force us to use a specific model in the future. If there was a compelling use case for sharing your data with a specific entity that sets the rules, then that would maybe help sort this out, but I'm not counting on this happening. Finally, a few words about what we are
doing with our bibliographic data. We also are trying to model this as linked open data and trying to learn from the others. So we have some databases. We have the National Bibliography, which I'm going to concentrate on, but it's part of the union catalogue Melinda, which is much bigger. We have other databases for articles and music. These
are all mark-based. So my assignment basically is to put this on the web. I'm not sure how long it will take. It's not very linked at the moment because it's pretty heavily
siloed. Some of it is in WorldCat, but not all, and we don't know the OCLC numbers. We don't have good links between even our own records, and we're not well represented in VIA, for example, or not in ISNI. But our subject headings, we use YSA, but they
are linked to YSO, which is the more modern ontology version, and they are linked to Library of Congress subject headings. So we're good there. We're targeting schema.org currently because it's very good for you can do surprisingly rich descriptions, and as I said, you can model works as a sort of a separate layer.
It's not as detailed as big frame, for example, but you get the advantages of a full model. You can model other things than just bibliographic. And it forces you to think from a user's point of view. So instead of saying we have this
1 million bibliographic records like we have, you have to put it in a different way so that we have this collection of works, and we have these editions of those works, and they are available either from this building, and then you can tell what the opening hours are, and then you can point to the electronic versions, of course, when they exist.
So it forces you to think from what the consumer would like to see instead of just looking at your own data. And here's some example of some data. This is the illustrated edition of the brief history
of time by Stephen Hawking. So here is the original English language work represented using schema. This is the finished translation of that, and this is sort of the manifestation, so the specific edition that was published in a certain year by a certain company, and
then we have the author and translation. Thanks, Richard, for helping with this. We converted using a pipeline. This is still a draft, but it pretty much works already. It's a batch process. We start with a dump from the ALAF database, one million records,
and we split it into smaller batches so they can be processed in parallel, and first we convert to mark XML and do some fixes, use the Library of Congress, convert it to bibframe, and then go from there to schema.org using sparkle, and then we create
some work keys and mapping rules based on those. Again, using sparkle, we merge them using sparkle and consolidate everything into a nice N triple file, and then make HTTP out of that.
Here are some challenges. I'm running short of time, but we're not very far. We still have some work to do, for example, with linking, and we would like to publish it directly from the HTTP files using, for example, Fuseki and linked data fragment server, and then
to be able to provide both a web interface, a REST API, a sparkle endpoint, an LDF endpoint, for example, both for humans and machines, but it's still, like, early days. Hope to have something to present next year about this. Okay. Thanks.
Thank you. So we have a little bit of time for some questions if there's a lot to go through, so does anyone have any questions they'd like to ask about what we just saw?
What is this bots testing framework you mentioned? Yeah, it's a framework for writing unit tests in shell scripts. It seems to work
fine for this use case. So I can make sure that the conversion processes, which is pretty much based on files converted into other kinds of files and so on, so I can verify that it does what it's supposed to do. More fine-grained data tests to check whether something conforms with your expectation, what
the final result could be. I guess you could use Shaco or RDF unit. That's true. Thank you. Anyone else? We have time for one more question.
Thank you for a brilliant talk. It was really interesting. I'd like to hear if you would elaborate a bit more on the schema.org conversion. Did you miss out any data? What were your expectations and what was the result?
Yes, so the conversion from BIBFRAME to schema is basically a big sparkle query. I'm sort of picking up the things I'm interested in from the BIBFRAME data. Most of the things are fairly straightforward. So the underlying idea of schema.org when you want to model
works is very similar to BIBFRAME. So sometimes you have to move things over from one entity to another, but it's still just one construct query. I don't know how to elaborate on that without showing the code, but that's maybe not appropriate here, so we can come back
to that maybe offline if you want to see. Well, thank you. Join me again in thanking Cosmo.