VIVO-DataConnect: Towards an Architectural Model for Interconnecting Heterogeneous Data Sources to Populate the VIVO Triplestore - TIB AV-Portal

VIVO-DataConnect: Towards an Architectural Model for Interconnecting Heterogeneous Data Sources to Populate the VIVO Triplestore

00:00

23

Technische Informationsbibliothek (TIB)

Héon, Michel Dickner, Nicolas Jerabek, Alexander Belkouch, Rachid

Formal Metadata

Title

VIVO-DataConnect: Towards an Architectural Model for Interconnecting Heterogeneous Data Sources to Populate the VIVO Triplestore

Title of Series

11th International VIVO Conference, June 16-18, 2020

Number of Parts

22

Author

Dickner, Nicolas

Jerabek, Alexander

Belkouch, Rachid

License

CC Attribution - NonCommercial 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/48010 (DOI)

Publisher

Technische Informationsbibliothek (TIB)

0000-0002-5190-1867 (ORCID)

1080328793 (GND)

04aj4c181 (ROR)

Release Date

Language

Content Metadata

Subject Area

Information Science

Genre

Conference/Talk

Abstract

In a large organization, corporate data is rarely stored in a single data source. Data is most often stored sparsely in distributed systems that communicate more or less well with each other. In this context, the integration of a new data source such as VIVO is sometimes perceived as a complexification of the infrastructure already in production, making it difficult or impossible to exchange data between the VIVO instance and the databases in use. Important and common obstacles to each new integration are encountered by organizations. A first problem is the conversion of data from a tabular format specific to relational databases to the RDF graph specific to the triplestore; and also, the updating (adding, modifying, deleting) of data through different data sources. In our work currently in progress, we plan to build a generalizable and adaptive solution to different organizational contexts. In this presentation we will present the architectural solution that we have designed and that we wish to implement in our institution. It is an architecture based on message processing of the data to be transferred. The architecture should make it possible to standardize the data transformation process and the synchronization of these data in the different databases. The target architecture considers the VIVO instance as a node in a network of data servers rather than considering a star architecture based on the principle that VIVO is the centre of data sources. In addition to presenting this distributed architecture based on Apache Kafka, the presentation will discuss the advantages and disadvantages of the solution.

11th International VIVO Conference, June 16-18, 202014 / 22

1

33:39

VIVO in the euroCRIS Directory of Research Information Systems (DRIS): a growing presence, multiple use cases

2

17:42

EuroCRIS workshop - VIVO as a front-end for an underlying "monolithic" CRIS (II): The UPF approach by SIGMA AIE

3

28:40

EuroCRIS workshop - Expertise & Skills (SE) - The CINECA "find an expert" solution based on VIVO

4

18:18

G-OWL: Towards Graphical-Ontology-Web-Language - an OWL-2 Visual Notation for the Semantic Web Ontology Modeling

5

09:51

VIVO as a Linked Open Data Enabler for the Université du Québec Network

6

15:24

AEON - towards an Academic Events Ontology

7

07:19

Empowering decisions and research management through a VIVO-based system

8

14:26

Chinese translation using VIVO v1.11 and Web of Science data

9

04:50

A Short Note on Internationalization of Ontologies

10

14:33

TUCfis: Applying VIVO as the new RIS of the Technical University of Chemnitz

11

09:19

Research Profile Ownership through User Studies: A Case Study in the German National Research System

12

07:04

ConfIDent - persistent identifiers and high quality metadata for conference information

13

12:06

Let’s ROR Together! Implementing Open Identifiers for Research Organizations across the Research Landscape

14

11:26

VIVO-DataConnect: Towards an Architectural Model for Interconnecting Heterogeneous Data Sources to Populate the VIVO Triplestore

15

12:47

ROSI - Visualising open scientometric indicators in research profiles

16

10:26

An Innovation Lab Experience in a Federal Agency

17

16:15

Persistent Identification Of Instruments

18

07:44

Representation of Art-Historical Information with Vitro

19

17:56

Software Ecosystem for VIVO's Continuous Development: i18n Development Use Case

20

56:37

Web of Science Workshop

21

10:11

An extensible software architecture for VIVO

22

27:31

EuroCris workshop - VIVO as a standalone CRIS at University Osnabrück

Automatic playback

Speech

Text

Image

00:00

Computer animation

00:23

Computer animation

01:13

Computer animation

01:59

Computer animation

02:59

Computer animation

03:23

Computer animation

04:28

Computer animation

05:34

Computer animation

07:24

Computer animation

09:14

Computer animation

11:22

Computer animation

Transcript: English(auto-generated)

00:05

And welcome everyone to the presentation on the Vivo interconnection architecture with other data sources in an organization.

00:24

The presentation objective is to present the problem of data exchange architecture with Vivo. Discuss an enterprise vision decentralized architecture of the data where Vivo integrates into an existing data ecosystem. Describe the architectural solution and its benefit for data exchange based on the Apache Kafka messaging framework.

00:49

Complete the architectural description with presentation of the scenario illustrating the data exchange between Vivo and its original context of data sources with the use of Kafka.

01:01

Through this presentation we hope to generate a debate around the architectural proposal that introduces new concepts in the way of ingesting data in Vivo. When the choice is made to integrate Vivo within an institution, the question that quickly

01:21

arises is how can data be exchanged between Vivo and data sources within the organization. A series of questions that emerge in the architect's mind is give, for example, how can I ensure the integrity and data security of data sources? How to allow Vivo to be a data source for other database?

01:42

Is there a solution that is extensible, scalable, modular, and reusable? How to translate data from a rational database into an RAF graph? Can I synchronize data in real time with other data sources?

02:01

Vivo's current feeding process consists of harvesting data from various external data sources with a basic relational database that becomes a graph, CSV file, or the triple store, or whatever data source. Its architecture puts Vivo in the center of the data transfer mechanism. In an enterprise architecture, the emphasis on the exchange mechanisms between data source, enterprise

02:28

architecture aims to make data source equivalent to each other and their exchange mechanisms. Each has the possible to being harvest or harvesting in an equivalent way. There are therefore a kind of interoperability of connections between data source and is independent of its architecture.

02:47

Introducing a messaging framework make it easier to achieve an enterprise architecture vision. Although the integration of a real-time data stream synchronization mechanism.

03:01

Kafka is a real-time data streaming system. It was invited by LinkedIn and is now part of the open source Apache Foundation. Kafka leverages the architecture principle of microservices development which is highly appreciated in agile development methodologies.

03:25

Apache Kafka is a published subscribe messaging system. The message is a Kafka atomic data unit. It does not, by definition, have a predefined data structure. It is a set of bits that could be a JSON-LD ontology.

03:40

The message is transmitted by a producer and received by a consumer. The topic is the channel to which the message being communicated. A note about the message structure. The use of JSON-LD elevates enterprise architecture to an ontological data exchange that would notably allow the design of intelligence producer and receivers able to adapt its behavior to the semantic of the exchange data.

04:07

The topic is like a table in a relational database. It is a reference by which data is stored. Each topic can divide into partition to distribute the message communication.

04:23

Partition is a mechanism by which Kafka ensures its scalability. An important challenge of data exchange between heterogeneous architecture data source is the transition from the entity relationship type representational mode to a subjects predicate object type representational modes.

04:44

Several questions arise. How to extract the semantic of the rational data? How to process the content of a junction table? How to identify what is class, a property, a resource? How to ensure a good mapping between all of these elements?

05:02

What are the transition moves from translating the attribute value to its representation by an IRI? All questions that must be answered in order to carry out the adequate data mapping from one system to another. There are several solutions proposed by the W3C.

05:21

I would like to hold your attention to the concept of StarDock Virtual Graph, which is, in my opinion, the most advanced solution in terms of relational data to graph mapping. This diagram presents a hypothetical scenario of implementing communication

05:41

mechanisms between several heterogeneous data sources using the Kafka framework. In the scenario, three data sources are linked. Avivu RDF triple store, just right here. Orchid, just right here. And relational database of professor.ucam.

06:01

For each data source, Avivu researchers profile topic is associated to which the other data source can subscribe through the publish subscribe mechanism. Each data source has its own producer and consumer. You can see just right here and right here. Right here for our kid and right here for the prof.ucam.

06:24

Let's now analyze the mechanism that we will implement in the case where a professor wishes to update his or her research profile via a professor.ucam. So the professor is just here and wants to make an update here.

06:40

First, the update's operation is serialized in a message by a producer. Just right here. Second, the producer sends the message to researcher profile topic database. Just right here. This is the topic. Third, each consumer who subscribes to the topic will receive the message to be processed.

07:02

So the message is received from this consumer here and this consumer here. The receiver message is serialized and stored in the host data source in a native format. At the end of the process, all data sources have the synchronized image of the update performed by the professor.

07:27

Here is a second example of scenario where we want to standardize with the CRDC the expertise of a professor in Orchid. The scenario starts with the Avivu Orchid application.

07:42

Receive standardization message. Orchid sends the expertise translation message to the CRDC via Kafka topic of the Avivu CRDC vocabulary. So you have the full chain of Kafka here, producer, topic, consumer, and the applications receive the message.

08:01

The Avivu CRDC vocabulary application translates the expertise contained in the message into the CRDC standardized expertise format using semantic AI to make the matching. Then it transit an expertise conversion message to the Orchid CRDC expertise consumer via the Avivu CRDC vocabulary Kafka topics.

08:25

So when it makes the matching, it sends the message to the producer. The producer takes the message, serializes it to a message and sends it to the topics and the consumer receives the message.

08:40

When the consumer receives the message, he transits it to the application that updates the Orchid database via the REST API. So the application is making this update just right here. Other expertise consumers who subscribe to the CRDC topic are also updated with this new standardization. So if you have another consumer here connected to another database, it will have the message to update his database.

09:15

Vivodata Connect here is a summary of the main feature of the Vivodata Connect.

09:22

The first one, the modularity, publish, subscribe, enables secure control of message flow between data sources. Interoperability, serialization and deserialization of the producer consumer ensure data transport regardless of the nature of the data annotation used by the data source.

09:40

Reusability, the producer consumer implementation can be reused for other applications. Security and data protection, the subscribe to a topic the consumer and or producer must have the necessary authorities. Scalability, the architecture facility is the integration, addition and interconnection of a

10:03

new data source while provide loading balancing capabilities with the physical architecture. Innovation, the use of anthology-based message, although the design of semantic AI services is certainly an innovation to be highlight. In conclusion, the transition of the decentralized to distributed architecture is

10:29

facilitated by the integration of a messaging framework such as Kafka. The anthology-based message structure broadens the scope of use of semantic technology beyond the Vivo triple store.

10:44

In particular, it introduced the ability to include semantic AI in producer and consumers. Public subscribe mechanism and consumer-producer architecture provides a deployment and component architecture that is secure, evaluative, extensible, scalable, modular, interoperable and reusable.

11:05

Finally, Vivo Data Connect does not aim to replace Vivo's current population mechanisms. Rather, the vision is to enrich the mechanism with an add-on that elevates Vivo as an enterprise-level data source. Thank you for your attention.