RDF by Example: rdfpuml for True RDF Diagrams, rdf2rml for R2RML Generation
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 16 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/47577 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Lecture/Conference
02:32
Lecture/Conference
06:12
Program flowchart
06:41
Program flowchart
09:03
Program flowchart
09:56
Computer animation
10:24
Computer animation
10:53
Program flowchart
11:20
Program flowchart
11:39
Program flowchart
12:25
Program flowchart
12:44
Program flowchart
13:02
Program flowchart
13:25
Program flowchart
13:44
Program flowchart
14:01
Program flowchart
14:51
Computer animation
15:37
Computer animation
15:54
Program flowchart
16:48
Program flowchart
17:19
Program flowchart
18:04
Lecture/Conference
21:07
Lecture/Conference
Transcript: English(auto-generated)
00:23
I hope everybody had a good lunch break, so welcome to this session.
00:49
My name is Osma Suaminen, I'm from the National Library of Finland, and we have four speakers today. But first of all, an announcement concerning the lightning talks. So the lightning talks will be held after the coffee break, after this session.
01:04
So we have eight lightning talks registered, so every speaker will get four minutes to talk on their subject, so you know how much you can fit into that. Okay, but the first speaker in this session is Vladimir Alekseyev, and he leads the
01:23
data and ontology management group at the Ontotext Corporation from Bulgaria, which is one of the leading semantic technology companies, has been doing this for a long time, and they have about 70 people, and one of their products is the CraftDB database, triple
01:42
store. And he has a PhD in computer science from the University of Alberta, and he has done some projects, including the research space project with the British Museum and the Yale Centre of British Art, and he's also been working with publishing the Getty Trust vocabularies,
02:04
including the art and architecture thesaurus, for example, as linked open data, and he has also done several projects for Europeana. But his talk today is titled RDF by Example, so RDF PUML for true RDF diagrams and RDF
02:21
to RML for R to RML generation. What a mouthful. So, welcome Vladimir. Anyone who attempts to pronounce these abbreviations is quite a brave man. So you see a lot of diagrams in my presentation, you won't be able to read most of them, but
02:44
in addition to this version, which is a presentation, there is also continuous HTML where the diagrams are much bigger and you can read it at your leisure. I'm sure the organizers will put on the URL for this. It's on GitHub. So the link for this continuous HTML is here.
03:03
There is no PDF yet, I'll see how probably I'll also make a PDF. So in my daily work, I do a lot of data modeling for all kinds of domains, you see examples further on, and I've always wanted a good visualization too, I've tried several,
03:24
And I think that this is very important for RDF modeling because that's a graph data model, right? And I think that the people who hold the data, the subject matter experts, the cultural heritage professionals, the librarians, they have to be able to understand it and say
03:41
whether the mapping is right or not right. So I made a tool that uses plant UML which itself uses graph, these both are languages where you can define in textual form a diagram, plant UML is used widely in the software industry for doing UML diagrams, just describing them in text.
04:05
And the benefit of generating diagrams directly from RDF is that they are true, they are exactly what you mean in your model. It's not just hand waving and you don't need to update them or tweak them, they're
04:20
all laid out from the RDF. So here's a very simple example. I don't know how many of you are familiar with Terto, but on top is the Terto code. You see the last line there is a little instruction, in this case it just says that one of the nodes should be displayed in the node where it is referenced instead of
04:43
being displayed as a separate thing. And you see that the graph is easy to read, it corresponds to the Terto basically one to one. Everything that I can put in the node is put in the node to save space, to save clutter. These are called inlines, so the types and RDF literals are inlined, but you can
05:02
also inline more nodes. This is the generated plant UML, you see that it's not a very complex language, you have arrows there, node names and so on and so forth, but it can quickly get tricky when you have more features. So because readability is a very important concern for these sort of diagrams, I've
05:26
done several features, for example, you have parallel arrows, then I only show one arrow with several labels. If you have several values for the same predicate, then they are collected inside the node with parenthesis, basically shortcuts that you can also see in Terto.
05:44
I also handle reification and similar other specialized kind of things, we'll see them later. So this is an example of arrows that collect the property names. In this case we have several properties connecting two nodes and to save space we display them this way.
06:04
This is from CDOC CRM modeling the Getty-Cona aggregation of cultural objects. Now we can do a bit with the arrows, for example, change the direction, by default the direction goes down, but in this case because supposedly this thing on the left
06:26
which was the motivation for doing the thing on the right happened earlier in time, I just want to put it on the left to emphasize the chronological order. We can also change the shape of the arrow, dashed versus solid and that kind of stuff.
06:42
We can also put in what are called sterile types, these are these colored circles plus the Gilem method italicized names on top. So this is from the Getty-Tessari, they have different kinds of nodes that are implemented using on one hand scores, on the other hand this on the bottom comes from the ISO25964, the latest standard on Tessari.
07:07
So here we're just showing G's guide term, A's and Tessari's array, some particular construct in that mapping we're emphasizing with these circles and stuff. I mentioned reification, well the thing here is if you want to say more about the
07:24
relation, how do you do it, let's say confidence or provenance of that relation, some date or who created it, things like that. And there is something called the property reification vocabulary, which basically allows you to describe which properties are
07:43
used in the reification to address the relation and so on. So the tool recognizes RDF reification and CDOC CRM reification constructs and displays them like this. So basically the idea is, see this node on the bottom, rather than connecting it with
08:05
two arrows to the source and to the subject and the object, we just attach it to the relation and the addressing properties are here on the bottom and they have a little arrow on the left, on the right and, you know, this colon showing, just showing the
08:20
reader where to look to figure out how it relates to the arrow. Now we're getting to some more complex diagrams. So this is from the Getty Kona, modeling sources and contributors. You see a rather complex diagram, luckily it's just
08:40
the tree. We also see here collecting of values, so for example over here we have more than one value and also comments, so if you want to describe to the reader what this code corresponds to, you can put an RDFS label and it will be shown for an inline node, it will be shown with a comment after a hash sign. This is another part of the
09:08
Getty Kona, the Econographic Authority. It is similar in scope to the icon class that was described in the presentation before the break and here again we have reification, we have a custom arrow
09:21
because has spouse is a bidirectional property, symmetric property, so we show it to the left and without any arrow. Or I did some work for the American Art Collaborative which is 12 museums in the States trying to map their data to CDOC CRM and establish a
09:41
demo service. This is one alternative of modeling the concept of cast after, so one sculpture is cast after another sculpture from the same mold. It could be interpreted as a different sort of network. This is from the Ulupana task force on FRBR-OO and
10:05
FRBR-OO is a CDOC CRM extension for bibliographic data. If you have four classes in FRBR, you have I think about 35 in FRBR-OO. So here's some works after Donkey Hot and you know the various connections between them and so on and so forth,
10:24
rather complex. This example is from the European Holocaust Research Infrastructure Project. One of the research problems they're trying to tackle there is to investigate Jewish social networks and how that influence the chance of survival of people. And this is just the model here showing it uses CDOC CRM
10:46
and German National Library's AGRELON. It stands for Agent Relations Ontology to Social Network. This is an example from a European project called Multisensor. It dealt with video
11:01
annotation, news annotation, social network stuff. So it used a bunch of related ontologies for media fragments, open annotation and also quite in a large way NIF, the NLP Interchange Format for describing NLP results over text. This is a bit of a model for
11:23
social network analysis showing influence and centrality for person in a social network. This is again modeling open annotation and confidence which comes in another ontology, the Istanbul FISSE. Here an example of again using these
11:43
stereotypes. So on the left you have an original article in Spanish, these letters SSS, and on the right a translation in English. And you see that with just a bit of control, just saying this translation should go to the left rather than down.
12:00
We see very well the parallelism in the two parts of the network. Now because in Multisensor, one of the partners also does frame net analysis. We devise the way to embed frame net into NIF. This is again generated with PlantUML but not from RDF.
12:21
It uses PlantUML packages, just show the grouping of nodes. Now this here is an actual example of a frame net annotation. And it's not a model, it's generated from actual triples about one sentence. Even though it's totally unreadable to anyone who cannot zoom very deeply into it.
12:43
And this is only half of that diagram, this is the other half. But it was important for us to see the connectivity of this network and to make sure that the triples we were making were right. I redid one of the examples in the open annotation specification. So you have here blank nodes, lists, and interesting kinds of stuff. I proposed to the PCDM people, this is
13:09
the Portland Common Data Model for a common metadata model for institutional repositories. And over there the idea is that you use the circles to designate
13:22
different types of metadata. It's quite easy to write. This is a handmade diagram from Link TV. We discussed this project before the break. It's about video annotation. And pretty much an equivalent thing is in this diagram, which is generated from RDF. And of course a lot less
13:43
effort to create this one. This is the model of, if you have heard of the International Council of Investigative Journalists, Panama Papers, and now the Bahama leaks. So we did an RDF rendition of this, and this is sort of the data model that we used. GitHub turns out that they have, they
14:06
can show a diff of two images. So as the model was evolving, on the left you see the old version and on the right is the new version. Okay, and now in the last several minutes I want to talk a
14:21
little bit about, can we use these models to generate conversion to RDF. I mean everybody's data is in different systems, a lot of the data is in relational systems, and the W3C standard for conversion from relational to RML is called R2RML. And yeah, it turns out that if instead of
14:43
sample values we use field names, I made another tool that can generate R2RML conversions out of that. So here is from the Getty Museum a model of exhibitions, and just this node in the middle, it describes a particular sub-exhibition, if you will.
15:04
An exhibition being at a particular site, in case it's a travelling exhibition. Out of this we generate R2RML and one node generates about 15, because in R2RML you have to be very specific about the subject, the property, the object, every object you have
15:22
to describe with a separate node, and so on. And so what this generator does, it saves you a lot of work, and also allows a subject matter expert to inspect the model, and guarantees that the transformation will be consistent with the model. After we feed it this relational data,
15:41
it produces this actual RDF. The shape of this RDF is pretty much the same as in the model, but because you have two exhibitions over three venues, there's more nodes in it, right? This is a more involved example, the central node of the Getty Museum RDF, which would
16:02
be the museum object and nodes around it. And so this R2RML generation is working well for converting relational sources, but we're also having to deal with XML, with JSON sources. And then the question is can we extend this, and we're currently working to extend it
16:23
for other types of input. So there is RML, which is an extension of R2RML, to deal with JSON and XML. We're currently experimenting with it. There is XParko, which is a melding of XQuery and SPARCO, and I think that we might be able to generate XParko, or at least
16:44
a subset of it. For tabular stuff, there is Tarko. And here are just a few models to finish off. So up to now I've been showing stuff from Cultural Heritage, but here is things from clinicaltrials.gov. So we have a rather elaborate experience with
17:06
life sciences, things that are important to pharmaceutical companies and so on. And this is a model for just one part of clinical study results, which basically describes the statistical outcome. Lately I have been working a lot with company data, so this is done
17:25
in Broad Street data that is mapped to the financial industry business ontology, Fibo. Or this is legal identity identifier. This is a global initiative to make a sort of
17:41
global trade register, basically to make all of the US funds that created the crash of a few years ago to at least register and to know their shareholding and control structures. And again, mapping this glade to Fibo in this model. The difference is that over here we have XML, XPATs, and inside the notes are XML fields rather than relational
18:04
fields. In the further future, we hope to extend this towards RDF shapes. What you have been seeing here are RDF shapes, but there is a standard for that called SHACO. And I think it's a more modern approach compared to ontologies to describe your semantic data
18:22
model. And first of all, to be able to visualize RDF shapes, and secondly, to be able to generate them from a more succinct representation, I think can be quite useful. Thanks.
18:44
I'll generate some myself. Any questions from the audience? Thank you for your presentation. Out of curiosity, what visualization did you try first? Just
19:05
how scalable are the visualizations? To what degree you can actually fit stuff on your screen and actually still make sense? Well, a good example is V-Owl. V-Owl can visualize an ontology, and it's integrated in several toolkits for working with ontologies. But
19:28
in order to really be able to read the V-Owl diagram, because you have overlap of the nodes and of the labels, you need to drag them around to review stuff. Before that, there's visualizations that rely on GraphVis, but because they put every node out, and
19:47
because, for example, they don't use prefixes, don't shorten the node URLs or the property URLs, they're very hard to read. As for your second question, it's very important. I think what you've seen here is kind of the maximum you can cram on a screen, but
20:03
it's not a problem. You don't try to describe a complete mapping of, let's say, 200 fields on one screen. You split it up in four or five screens. Then you can just run the generated R2RML files in succession, and they will spit out whatever is needed. I think maybe the strongest v2 I've seen is by AllegroGraph. I've played with it just
20:28
a bit. I cannot make it work outside of AllegroGraph. And with being a competitor to AllegroGraph, it doesn't do a good enough job for me.
20:43
Any other questions? Maybe I'll ask one. It wasn't obvious to me, so is the RDF PUMO, is it available somewhere so that I could play with it, or is it still working? It's still not clean enough to put it out, and we're still wondering whether this is
21:05
such a smart tool that we can make some money out of it or whether we want to open source it. It is being used by others in the American Art Cooperative, but yeah, we will decide this later on. Okay. Anyone else? If there are no further questions, then we will thank the speaker.