We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Leveraging Linked Data Fragments for enhanced data publication: the Share-VDE case study

00:00

Formal Metadata

Title
Leveraging Linked Data Fragments for enhanced data publication: the Share-VDE case study
Title of Series
Number of Parts
15
Author
Contributors
License
CC Attribution - ShareAlike 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Share-Virtual Discovery Environment (SVDE, wiki site) is a library-driven initiative that brings together in a shared, BIBFRAME-based discovery environment the bibliographic catalogs and authority files of a growing number of leading academic and national libraries from across North America, Europe and globally. In (big) data-driven environments, accessing, querying, and processing vast datasets efficiently and flexibly is challenging. Linked Data Fragments (LDF) have emerged as a promising paradigm to address these challenges by providing a distributed and scalable approach for publishing and serving Linked Data. As part of the SVDE Labs activities, we developed a set of web APIs that adopt that approach and provide several benefits: real-time RDF generation and publication, on-demand ontology mapping, and multi-provenance management. The Linked Data Fragments paradigm also addresses one challenging point in the infrastructure: It no longer needs a dedicated RDF Store. The API set provided by the system (GraphQL, REST, and SPARQL) uses a centralized knowledge base implemented using a hybrid approach composed of a relational database and an inverted-index-based search engine. A working prototype will be used during the presentation. However, the overall work is under development as we follow two parallel “investigation paths”: the first using a plain RDBMS and the second using a NoSQL storage. As part of the presentation, we will discuss technical details and the architecture/infrastructure by examining the lessons learned and the challenges. The ongoing development will be applied to the linked data underlying SVDE discovery portal. It will benefit the other interconnected initiatives that are part of the broader Share Family linked data ecosystem.
Computer animation
Computer animation
Computer animation
Meeting/Interview
Transcript: English(auto-generated)
So, we are going to describe some recent effort that we are focusing on, on the Sharebidee project,
and specifically the usage of linked data fragments for our RDS API layer. Prior to going deep into that topic, we need to describe Sharebidee, what is its domain model, and so on.
So, my name is Andrea Gazzarini, I'm a software engineer, I'm based in Italy, and in this context I'm representing the Sharebidee initiative.
I'm the Sharebidee lead architect. So, what is Sharebidee? Sharebidee stands for Sharebital Discovery Environment, and is a library-driven initiative that brings together in a common discovery environment
the catalogs, the bibliographic catalogs and the authority files of a huge number of academic and national libraries from North America and Europe. So, behind the system, behind the end user system, we have what we call the knowledge base,
what we call Sapientia, the Sharebidee knowledge base, where data is stored. So, it's useful to understand how it is created.
So, we have several libraries, the participant libraries, universities, that contribute with their data. In this stage, data is separated because each library, you know, provides the data in a different,
potentially in a different format, in a different way, through API, through batches, and so on. So, we received that data, and there is a part of their kind of pre-processing where we execute several processes
like deduplication, enrichment, clusterization, and we call this part clustering. It's not actual clustering, but, okay, just to summarize, this is our clustering pre-processing,
and the output of this set of components is what we call Sapientia, the Sharebidee knowledge base. Let's see how the data is organized within the Sapientia, the Sharebidee knowledge base.
The domain model, the Sharebidee domain model, is basically inspired to the most popular BIP frame,
bibliographic framework from the Library of Congress. So, you will find a lot of similarities. At the same time, there are differences. So, in the picture, we have the main, the core hierarchy, where, for example, the work entity has been split in Sharebidee between another entity called Opus,
and the work, which is basically the BIP frame work. The reason about the differences is because we wanted to have a domain model more cohesive and, how to say,
tied to our context. Apart from the core hierarchy, we have also, as I told you, there are a lot of similarities. So, for example, the agent part is very similar to the BIP frame model,
and we have subjects and what we call non-core entities like languages, genre, types, formats, places, availability, and so on.
These are managed, all of these are managed as entities. So, we described the big picture, the overview, how the Sharebidee knowledge base is created,
then how data is organized in terms of entities within the Sharebidee, the knowledge base, sorry, and then what about the single entity? We call it a prism. We refer to that term because I'm going to explain.
Let's assume that we have a library that uses an ILS for creating, for cataloging,
and so it creates a bibliographic record, and this is the very first bibliographic record that is sent to Sharebidee. So, a bibliographic record corresponds to a main entity, which is the instance that basically corresponds to the BIP frame instance,
but other than that specific entity, in a bibliographic record, there are a lot of other entities. We have names, authors, publishers, places, dates, events, and so on.
So, the pre-processing components that I was talking about separate those entities, and in the picture, in the example, the main triangle is the instance, is the entity that we create in Sharebidee
that will contain the instance attributes taken from the bibliographic record, of course, but other than that, we have also a lot of other entities, authors, publishers, dates, events, and so on.
These are the small triangles. So, what happens if another participant provides a different bibliographic record for the same resource, the same book, for example?
Probably, there will be a lot of intersections, and do we discard that data? No, absolutely, because in the same way, the new bibliographic records will create other triangles, okay?
So, for example, here is the triangle of the first library, Stanford, for example. This is the triangle, the instance triangle for the University of Pennsylvania, and so on, Alberta, and so on. The pre-processing components understand that all of those triangles belong to the same entity, an instance in this example,
so it sums up everything assigned to that group, a URI, and forms what we call the prison.
We like this analogy because Sharebidee entity is an aggregation of several things, but in any moment, we are able to separate the phases. We always retain the prominence of each attribute in the knowledge base.
So, what about linked data fragments? So, this is the big picture of Sharebidee. So, starting from the left part, we have the several participants that contribute with their data in the way that we saw,
and create the Sharebidee knowledge base. Let's forget the part on the right because it's about another topic,
but basically the Sharebidee knowledge base internally uses three kinds of storages, an RDBMS, a relational database, a search engine, an inverted index-based search engine, and an RDF store. The first two are used by the API layer to provide GraphQL and REST API, GraphQL and REST access to our data.
The third one, the RDF store, is what provides the RDF API.
So, URI, the referencing, and the SparkQL interface over HTTP, of course. In this talk, we are focusing on this part. So, the SparkQL is the language, the standard factor for querying an RDF store.
So, this is a very simple SparkQL query that uses the core hierarchy. So, opposites, works, instances, items.
And so, a SparkQL is composed by a select part where the request indicates what he or she wants to get back. In this case, a barcode, something called the barcode, which is the barcode.
And then, we have a where clause that can be divided in several clauses called triple patterns. Why? Because they are composed by a subject, predicate, and object.
And each of them could be a literal, an explicit value, a URI or a literal, or a variable. The variables are used for connecting one triple pattern with another. So, basically, the where condition that controls the selection, where I want to extract the information that I request in the select,
can be considered as composed of a set of multiple triple patterns that can be potentially executed independently.
Potentially, of course, is not always possible. The execution of a single pattern returns a partial view of the data, of the entire result set.
That partial view is called fragment, and since we are talking about linked data, a linked data fragment. So, what are the participants of a linked data fragments architecture?
We have clients, we have a layer that offers, that provides, Spark UL over HTTP. But actually, this is not executing any Spark UL at all.
Because its purpose is to destructure the incoming Spark UL into triple patterns, and each triple pattern execution is delegated to a linked data fragment resolver, or triple pattern resolver.
A server that is in charge to resolve a single and simple request, a triple pattern request. And, of course, this linked data fragment resolver needs to fetch the data somewhere.
In ShareBD, specifically, it fetches from RDBMS and from the search engine. So, what is the advantage of such an architecture? Is that everything happens on the fly.
If you noticed, we do not have any RDF store. So, that means the Spark UL layer needs to destructure the Spark UL queries,
make some potential optimization of the query, and delegates the execution of each triple pattern or triple patterns group to a specific server, a linked data fragment or triple pattern servers.
Then, the several response, then, among its job, sorry, other than delegates, it then receives the fragments from the several servers that it contacted.
And then, those fragments are merged and the response is returned to the request or to the client. This is a very lightweight architecture because the servers are very, very simple.
They are supposed to receive a lot, probably, of requests, but very simple. For that reason, they are absolutely stateless, so we can scale up those linked data fragment resolvers as much as we want.
And, of course, we also need something scalable behind the scenes because otherwise the storage would become a bottleneck. For that reason, within that prototype, we are migrating the RDBMS to a NoSQL database.
So, the underlying storage will be composed at the end by a NoSQL and an inverted index-based search engine. So, coming back from the previous high-level architecture,
with the introduction of linked data fragment, we have basically removed that part. We no longer need any RDF storage. And that's the first point, no RDF storage.
We have a distributed computation because we have a lot of replicated, simple, triple patterned servers that are supposed to receive an important number of requests, but very, very simple. Because a triple pattern request is a request where the service accepts a subject, a predicate, and an object.
And potentially, each of them could be a wildcard. The very good thing about this architecture is that everything happens at real time, at query time.
So, in the next couple of slides, I will show you concretely two benefits, for example, that we implemented.
So, we said that each entity is composed by the contribution of different vulnerabilities, right? So, a client that wants to get back the representation of the results associated with that URI,
that's the same if I run a SparkQL query, of course. In this example, I'm just referencing a URI. So, I want to get back the RDF representation of agents slash 201.
Using an HTTP header, but it could be everything that is included in the request. Also, a query parameter. I can select the provenances that I want to get back in my results.
So, for example, here, I want the RDF representation of agents 201, but only the contribution of the University of Pennsylvania, Alberta, and Chicago. And that's very easy, because we know that the shared entity is a Prisma,
but we are always able in any moment to destructure them. So, it's just a matter of selecting the phases I'm interested in, and return to the client the RDF created using those contributions.
Another increasing thing is that what about the ontologies used for representing the data? It's not fixed, because again, we do not store RDF anywhere.
We store our data in RDBMS now in NoSQL later, and in our search engine. So, it has basically nothing to do with RDF.
That means that a query client, we can transform and translate and create the RDF as we want. So, in our prototype, we created a couple of mappings, one that uses BitFrame plus Wikidata ontology,
and another XDC that is using XDC-XDFR codes that we assigned. That means nothing, basically. It could be whatever. And XDC, Schema.org, and Dublin Core.
So, the same agents, the same request, basically, as slash agent slash 201, with a different HTTP header that says, give me the representation of the agents 201 using the XBF mapping.
And so, the RDF API would return something like this. This is an extract, of course. The same request with a different value for that header
could provide a different representation of the same resource. Different, because here we are using another two ontologies. The cool thing is that we can combine this example with the example above.
So, we could say, give me the agents 201 using the XBF mapping, but I want in the result only the contributions of libraries B, C, P, and F.
So, this is very powerful in our opinion, because, again, everything happens in real time, and we no longer need an RDF store. And, of course, there is a bot. It depends on the bot is related to the network latency,
because you saw that the architecture is very, very distributed. So, it depends basically on the kind of the data access pattern
that your system is supposed to serve, right? And so, I'm not saying, of course, that is the solution for everything. In our context, it seems to be very, very helpful, useful, and interesting.
And so, that was this last slide. Yuda, am I supposed to end at ten to four, right? Yeah, yeah. We're just about at time. If anyone, we could maybe do one question, but I didn't want to say that PRISM analogy was really interesting.
And I think when you got to the slide where you're like, we don't need RDF anymore, I was like, for some people that's like really blasphemous, and for other people that's really practical. So, I think you sort of demonstrated why, you know, removing the RDF store is actually helping. But do folks have any questions?
We have time maybe for one before we go on to the next speaker. None so far. Yeah, and I think we're at time. So, I think that was a really in-depth explanation of that. And if people have questions, please follow up on the forum chat. So, thank you so much, Andrea. That was really enlightening.
You're welcome.