A RESTful JSON-LD Architecture for Unraveling Hidden References to Research Data
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 16 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/47540 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
3
8
10
00:00
Lecture/Conference
00:26
Computer animation
00:48
Computer animation
04:00
Computer animation
08:38
Computer animation
09:05
Computer animation
10:05
Computer animation
12:42
Computer animation
14:13
Computer animation
15:40
Computer animation
17:02
Computer animation
19:43
Computer animation
20:09
Computer animation
20:55
Computer animation
21:40
Computer animation
22:22
Computer animation
22:59
Computer animation
23:22
Computer animation
24:00
Computer animation
24:32
Computer animation
25:18
Computer animation
25:53
Computer animation
26:24
Computer animation
Transcript: English(auto-generated)
00:07
Thank you and good afternoon everybody. So just for the first slide overview, so we have kind of four parts, kind of the first two general parts more a little bit,
00:21
and then there will be a technical architecture part and in the end of demonstration about what we can do at the moment and how you can use our tools and stuff like that. So we will talk about the Infolist project and this is a joint project with the GESIS,
00:41
GESIS Leibniz Institute for the Social Sciences and with the University of Mannheim or respectively the Hochschule, the Made in Stuttgart, that's one, and we as the Mannheim University Library, you've been Mannheim. It's funded by the DFG, the German Research Foundation
01:00
and we are now in the second funding phase. So just that we are on the same level maybe, let's start with some easy stuff, research data. Well, that's just some raw data, imagine some numbers you have measured, you have made some experiments and you're measuring that and that will be kind of maybe an intermediate step
01:22
in your research process and you're then heading for the publication where you do some nice analysis. So this would be the first possibility but you can also kind of take research data from a data provider or from some official statistics, like for countries, there are a lot of data available
01:42
or also for other organization or stuff like that or you can just take the research data from your colleague given that he or she shares it with you or maybe has even published it. So if you're using some other work, other scholarly works, well, then you have to cite it
02:02
and the citation is just some formal structured reference to another sort of scholarly work, data citation is then kind of the thing if the scholarly work is research data. So let's start maybe when does this data citation actually started? I have here a timeline,
02:21
I think we have seen at least some of the dates or more or less already and the first question I would like to ask, when was the first structured data citation used in a publication? So I claim maybe around the year 2000. If you have any proof or more hints,
02:41
just send it to us, we are interested kind of to find out maybe a more accurate date. Next question, well, when was the first unstructured reference to research data used in a publication? And here we say, well, that's 1609 or before and the proof just follows,
03:02
here is one of the first unstructured data citations. It's a paper or a book actually by Kepler, so he's the author and the title, well, actually the whole stuff here above, that's the title. If you translate it into English, it says new astronomy based on causes
03:21
or celestial physics treated by means of commentaries on the motions of the star Mars from the observations of Tujubrai. So we see here, he mentions this observation, this research data from Tujubrai, so he cites that, he does some data citation kind of. How does he do that? Well, just by, well, a little bit above should it be
03:46
from the observation of Tujubrai. So that's just the sentences here, part of the title, which is the suggestion that he's, or the data citation kind of. Well, that was a long time ago, and actually now we have some data citations principles.
04:04
So these are the eight data citation principles. First one is, well, it's just important. It's actually as important as other citations. So you should do kind of the same thing as you're doing for the other citations. And you should make it easy or facilitates
04:23
to give credit and attribution to the authors or to the contributors of this research data. You should evidence, so you should do data citations whenever you're using some research data, you should cite them also as a citation. A unique identification, some global identifier,
04:42
for example, access, how can you access the research data, persistence, and two more. Currently, there are 100, actually exactly 100 institutional supporters. This means if your institution want to become the 101st, you have maybe to hurry up a little bit.
05:01
There are some data centers, publishers, and societies, also some library societies are around among them, and some others supporters. Well, that was the principle kind of, how's kind of in the practice, how does a data citation look? Well, here is one format suggested by DataCite.
05:22
It's just kind of in format you can use. You start with the creator, then the publication year and parenthesis, column, title point, version point, publisher point, resource type point, identify. So not so things fancy in there. There's an example. You can maybe move around a little bit and another order, publication year maybe in the end,
05:43
or some other rules about the separation, stuff like that. So that's normal seeing what citation styles actually are forcing you to do. And actually there are also some other well-known citation styles, like API who already has a data citation guidelines included.
06:02
Some other examples I have here, also from the NLM or the Chicago manual, have at least four databases they are talking about. And journal, some journal styles as well are listed here. But actually in practice, the people they are still doing the same thing
06:21
as 400 years ago. Namely they are citing or references the research data in the text, just by mentioning some words. So for example, the first one is the caption of a table and somewhere there's a reference to the research data. Second one is, well, it just mentioned the ELU study.
06:43
And if you read a little bit around, you see there should be some connection to reading literacy. Maybe not the first thing when you would sing, you come to mind if you see ELU. And this third example also it can be scattered around in the text as here with the word years
07:00
are not in the same place and some other words in between maybe. So how do you process this? How you can find now the research data? What would be the steps you need? Well, there are different steps you have to perform now. Actually, the first one we have just done namely the detection of data citation in the running full text.
07:22
Second one is, well, you have to resolve and normalize kind of the data citations. So for example, the IGLU that stands for the German Internationale Grunstuhl-Lesse Untersu, okay? And this SEP, that's kind of the abbreviation for the socioeconomic panel
07:41
or also in German the socioeconomic panel and you can even write that differently. So there are different possibilities or variants here. Next thing is the unique identify the data citations. Actually, there was an IGLU study in 2001. There was an other one in 2006
08:00
and there was one in 2011. Which one was referenced in the paper before? And last step kind of find actually now really the cited research data. So you actually you're after some URL or maybe then just the location. And well, the steps they are kind of annoying. You don't want to spend a lot of time
08:20
and actually you don't maybe not know time at all. Just see it maybe even in the beginning that would be nice. So the question is here, is it possible that we can some of these steps automate maybe some tools, some algorithm that can help us here. And that's exactly the goal or one of the goal for the info list project.
08:43
Automating these processing steps. So this means automatically unraveling hidden references in the running text research data into structured data citations with URIs. And this all should happen in a flexible,
09:00
long-term, substantial infrastructure. So here's an overall view about the project. So as always, we need some data like the full text metadata, stuff like that from research data, from publications. And then our algorithms, they can work on that. They are relying on store.
09:22
There are some data mining algorithms and they are using some bootstrapping strategies and some other stuff like there. We will not speak much more about the algorithms, but focus on the technical architectures where they are actually in. And the technical architecture relies on linked open data
09:42
and will provide some RESTful APIs. In between, there is kind of some abstract modeling stuff, some structures semantics, which kind of connects all of these algorithms and technical architecture. And you can get out things out of it. So there is an integration. We are trying to integrate it in as much as possible.
10:04
And maybe you see that here better. So as the end user, well, you can, for example, search in a discovery system. Then you're receiving some publication according to your research through your search. And then it would be nice to see
10:22
on what data were there this publication relying on. Or if you go the other way, if you're searching in a data repository, finding some research data, it would actually be nice to see which publication were built on top of this research data. So we need some linking stuff in between them.
10:45
However, the users, they are not only searching in discovery systems and data repository, but they're also searching in other stuff, which I should mention in a minute. First, there is also the question here, how actually to best incorporate data connections
11:01
into library catalogs. That question comes from Horizon Report 2014, library edition. So you search or use a search also somewhere else, namely, for example, in Google Scholar, or they can maybe search on the journal website or wherever. Actually, somewhere in the web, they can make a search.
11:22
And so actually, it's also a good question to ask here, where and how is the integration of data citation for our users most useful? So we see here, actually, there's a lot of stuff of different systems we would like to cover in this integration.
11:43
And therefore, we need a really flexible infrastructure, which allows us to do that. And that's the next what we want to show you. All right, so we've just seen that there are various agents involved or possibly involved with the results of our project.
12:05
And this is like a 10,000 feet view of our architecture. We have an internal API, which does all the heavy lifting, does all the text extraction and so on. That's written in Java. And it should be a mostly self-contained service.
12:23
On the other hand, we have our public API that should be as flexible as possible, should support as many different serializations and data formats as possible. And allow our data model to be as complex as needed, but still be really fast. So speed is of the essence for us.
12:43
And that's why we laid down some principles when we started designing this whole architecture. The main thing should be that the API usability is more important than expressivity of all parts of the model. So we want to support it at the right places.
13:01
But in general, the API should be easy to maintain, easy to consume for possible developers. And it should be possible to understand the data model. So we tried to postpone the making the data model extremely complicated part to later and start with something simple.
13:22
Of course, it should be RESTful-ish. So not all the aspects of RESTful architecture are followed closely or orthodoxly. But still it's protocol independent, so we can reproduce everything on a local client without HTTP. That's really important because it has to be fast,
13:42
as we said. And we decided to use a JSON store versus a triple store because it's really fast. It has native ordered lists or arrays, which everyone who has developed RDF software knows is a real pain with RDF.
14:02
And it has a deterministic structure, again, which RDF has not. And that makes it easy to use it for closed-world validation, which is really important for us. So in general, we started out with keeping it simple. And that's also understandable
14:21
if we look at the main operations and info lists at the moment. There's this bootstrapping part where we try to learn from a simple seed word, new patterns to find data set references, as Philipp showed before. And there's multiple levels of recursion involved. And it's an iterative process, and it's really tough on CPU and on RAM.
14:45
So here speed is much more important than expressivity. As for text extraction, so extracting text from PDF, which we do a lot. And for applying patterns that we found using this bootstrapping process to text files,
15:00
again, these must be really fast, and there shouldn't be any time lost with serialization or description or complicated data structure problems. Now for the data set resolution, that's the part about if we have some string like SOEP, what does that refer to?
15:22
Which databases must we search? How do we rank these results? How can we automate the intuition that people put into resolving these data set resolutions? And here, expressivity is much more important than speed because we want perfect results.
15:41
So I still think that deep modeling has its merit and that it's important for us. For example, data set granularity. So if someone refers to SOEP, does he mean the whole panel, the whole survey every year or just a single year? That's one aspect.
16:01
Then there are data set references, which cannot be automatically resolved without context. Like if people write as the results of our study shows and we have to know who are those people and what's the context? Where did we get this from? Or something like page 15 of the DERP panel,
16:24
which we don't know what it is because we can't find it anywhere, but we still want to state that we found somewhere that someone references something that's called the DERP panel. Also for doing like bibliometric analysis or graphing the relationships of the entities
16:42
in our data store, we also want to have the possibility to model this deeply. And also for mining the provenance of the things in our data store, it would be really helpful to understand it as a set of statements instead of a set of documents.
17:02
So the question is how do we get the best of both these worlds, of deep modeling and of keeping it simple? And to show you that, I'll just briefly explain our architecture. So we have an HTTP server, which handles the API calls,
17:22
which has a RDF JSON ID, content negotiating middleware, and we have a MongoDB storage because that's really easy to set up and easy to deploy and fast. We are using the Mongoose document mapper
17:42
to make it a bit easier to work in code. And then we have a mapping tool that will map between the Mongoose schema and the incoming data. So for once this handles RDF requests,
18:02
it handles requests for our schema, but it also handles the RESTful API requests and exposes our data model in different serializations. And that's all controlled by something we call the Tson.
18:22
And if you are asking yourself what this spidery thing in the middle is with the arrows to everything else, I will explain that now. So Tson is our self-developed format. It's based on, it's just JSON-LD with a bit of a different syntax, more oriented towards Turtle because that's easier to read and easier to write.
18:44
And in this we keep all the different aspects. We keep the descriptive part, we keep the database schema part, and also the presentation part. So these are the parts that describe the RDF semantics
19:01
of our data model. We have a class execution with two properties, lock and algorithm, which are described in a context. So we happily stole that from JSON-LD. Then we have the database schema part, so there's a collection execution and a property algorithm, and the algorithm may be required
19:22
or should be indexed and so on. And lastly, this is just an example. This one shouldn't be displayed in the API front end. So yeah, we're mixing different levels, but we keep them all in one place, which makes it really easy to adapt and to fix things.
19:42
So one schema to rule them all, that's our general idea. From this one file, we generate our ontology, we generate our REST API endpoints, and the documentation for them. We generate our database schema and the indexes that make this database fast,
20:01
and a data model explorer, which allows us to get a better understanding of how this works. Now, let's hope that it works. Enjoy your live demos. All right, so what we see here are the aspects
20:21
of our data model. I won't dive into that much detail. Just want to say that execution is the most important thing because we are doing heavily algorithmic stuff here, but we also have links between entities, for example, and patterns. And if I open one of these, I see that the context aspect is always the RDF part,
20:43
and I can just open all the RDF descriptions. Right, so here, now we're on the RDF level, but we could also check out the database level, and to see, for example, to find out why some query
21:03
is really slow, maybe some field isn't indexed, and we can always jump into the real RDF description of some class, for example here, a search query described as turtle in this case,
21:20
could also be, I find it always helpful to look at it in JSON-LD because that's really terse, or anything else, or I could even go crazy and look at some visualization. Right, so that's the data model part.
21:41
Let's look at what we can do with that. So I just showed that in the demo work, that's nice. Right, so I jump right into it. We have our API exposed using the Swagger interface,
22:01
so all the things, REST-LD, something, something, are generated from that file. We can do all the HTTP verbs that are relevant for REST, GET, POST, and so on. But what I want to show you is our simplified API calls to execute something, and what I want to execute is a short version of that learning algorithm.
22:25
I just copy it, paste it. So just really quickly, what I'm doing here is I'm executing this frequency-based bootstrapping algorithm for all the files
22:42
that were tagged with this particular tag. Oops, and I start with the skied albus, so I know that albus is a dataset reference, and I want to find out what can be used from just this information in a lot of files. So let's try it out.
23:02
Okay, the thing is posted, and I get a response also in the location header. And now it has started an execution that's running asynchronously on the server. I open that up, and I get, again, the triple view. So I could look at this, but we have a bit of a nicer interface.
23:23
So we see the algorithm is at 50%. Just have to hit F5 a few times while it runs. I can just show you, these are all the, well, the knobs and turns that you can configure for an algorithm, and let's see if he finished.
23:41
Yes, he finished, right. So he has generated a lot of patterns. All these patterns are, of course, referenceable resources generated a lot of textual references, so these are the things, like, it's open one.
24:01
And maybe in turtle. These are the extracted elements of the text, so words left of the thing we found, and so on. And the pattern, check out the pattern as well, which are just fancy,
24:20
just a fancy word for regular expression. And everything can be tagged, and that's how we organize our stuff, because that proved to be really fast and simple. And now that we've learned something, let's apply this to a PDF file. So for that, we've written a small JavaScript library,
24:40
which is really thin, just does what I just showed you. And I will choose a file, choose two files, and we'll try to analyze them. And what happens now is he uploads those files to our store, he extracts the text from the PDF files,
25:00
and now he tries to apply patterns, or patterns with a certain tag that I just created, to those text files, and yeah, there's a kind of funny bug where it just jumps around, but let's watch it in the, again, the triple view, and jump into the monitor view,
25:22
where it does the same thing, but he has already found a lot of patterns, right? And he has found some links. So let's open one of those. We see this is a link, again, these links are de-referenceable, and the entities from which they link,
25:42
and to which they link, in this case, from a publication into a data set, are de-referenceable as well. There's just a little thing that they are not turned into links in its interface, but still it should be de-referenceable.
26:00
Okay, and we see that he has found a reference from this publication that's referenced by this entity to something that's called Anomie Albus. Don't really know, but that's the DOI of the thing. So we've gone the full way, and yeah.
26:21
So that's it. If you have any questions, feel free to ask them, get in touch with us. If you have any data that you want to run through this, yeah, and try it out, it's kind of not that stable, or rapid development, depending on how you look at it.
26:49
Thank you very much.