Metadata Records & RDF: Validation, Record Scope, State, and the Statement-centric Model
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 16 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/47534 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
3
8
10
00:00
Lecture/Conference
01:20
Computer animation
10:30
Computer animation
11:25
Computer animation
17:17
Computer animation
21:33
Computer animation
21:51
Computer animation
Transcript: English(auto-generated)
00:08
Is there a way in the pit full screen? So wow, what a talk to follow, huh?
00:23
So I guess this is the second to last talk of the day, and it's going to be a bit of a rant, too. So I guess my hope for the rest of SWIB is that I can say something useful and meaningful as a follow-up to Karen, and that Valentin can bring us down into something that's good to part on.
00:41
Hi. So as was mentioned in the introduction, I work a lot on open source software. And a lot of what I'm about to talk about is coming out of an effort to write a framework for representing RDF graph data in a way
01:02
that developers can use it in application development in ways that they would normally expect to work with. So that's a little bit of context for where this is coming from. So to begin with, I want to talk to you about graphs.
01:23
Do you know about graphs? We talk about them a lot. We don't always talk with a lot of clarity about what graphs are. So let's take a step back and look at a graph model for a second. Graphs are, well, let's see if I can get the slide to change.
01:48
Graphs are mathematical objects. They have nodes, and they have edges that connect the nodes. And this will be the only math slide with the embarrassing kind of handwritten chalk motif.
02:01
But these are mathematical things, and they're defined in RDF as mathematical things. And the graphs we talk about in linked data look like this. They have subject, predicate, and object, and your nodes are defined by the subjects and objects, and your edges are defined by the predicates, and off you go.
02:26
And they also look like this. And if we do that, that is to say, if we remove a statement, what we've done isn't to change the graph so much as to create a new, slightly different one, with one fewer edge and as many as two fewer nodes.
02:43
Because the graph is defined in terms of its member triples, that is to say it's a set of its member triples, these changes aren't edits. And that's a good thing, because if we want to find out what the difference is between two
03:00
graphs or what the difference in meaning is semantically, what we're claiming in each of them, we have a simple model for that. And the model is statements in, statements out. This is one reason why we as a community have been so eager to advance graphs as a way of expressing data, because we can capture meaning at this level.
03:23
Adding and removing and comparing is easy, and it's an attractive prospect. And I'm not here to tell you that we shouldn't do that. Another reason is that if I have this, these are triples that I know about, and you have this, or another one of my systems has this,
03:42
putting them together is as easy as merging the graphs, simply appending the one graph to the other. So combining this idea with the web and Follow Your Nose conventions that let me find your graph, we have linked data, and it's taken us quite far.
04:00
It's taken us here to all of the things that we've heard about during this conference. But one thing that I've noticed is that we're talking a lot about read-write data workflows and information management workflows. And while the community has gotten really good at publishing and consuming linked data,
04:20
questions about how we manage it are mostly open. We've seen implementations here that serialize JSON-LD into NoSQL data stores. We've seen implementations that serve REST APIs from Elasticsearch. We have object graph modelers represented here. We have linked data platform and ad hoc patch formats.
04:42
This looks to me like the community grappling with how to manage graphs through stateful lenses. So this is coming from the section of the RDF concepts and abstract syntax document, which is an interesting read if you haven't read it.
05:04
The RDF data model is atemporal. Graphs are static snapshots of information. However, RDF graphs can express information about events and temporal aspects of other entities
05:22
given appropriate vocabulary terms. Since graphs are defined as mathematical sets, adding or removing triples from RDF graph yields a different graph. So this problem is called out directly in RDF 1.1
05:41
in this section called RDF and change over time. And my claim here is that we need a robust account of a thing that's at once as simple and complex as change in the resources and the descriptions of them that we manage. Again, this atemporal model is a good thing.
06:02
And besides unavoidable, once you publish some linked open data, retraction isn't part of the ecosystem. It's on the web. Other people have pulled it down, presumably. Otherwise, you might be wasting your time with the publishing thing. And those assertions are out there.
06:20
And there's no system for you to say that assertion is invalidated. So a 2009 paper that has influenced my thinking about this pretty significantly makes a similar observation about XML documents. Since documents, including in data contexts,
06:43
are defined as strings, which are in turn defined as sets of characters, how can we account for workflows that rely on those documents to change? So document modification seems to be routine and widespread. Editing is a familiar practice to almost everyone.
07:02
And revision is a fundamental feature of publishing workflows. Yet document modification would appear to be an illusion. Common accounts of what documents are seem to imply that documents cannot undergo genuine modification. So the point of this isn't to say
07:23
that those workflows don't exist. That's not the point of the paper. And it's not my point here. They clearly do exist. The point is simply to prod us into thinking about some important things which we might have missed by failing to understand how this problem
07:42
interacts with our data model. These stateless things interact with our data model. So both, possibly we're missing both things that we should understand about how we tackle information life cycles
08:01
and information management use cases like the sorts of things that Karen is pointing to with users that we may be overlooking amidst our bad faith rhetoric about the death of the record. So let's talk about metadata records for a minute. I use this term records deliberately
08:21
in part because I know that it will raise hackles among some of the people listening to this. And in part because I think that it stands a chance of inviting a group of people and practices that have been alienated by the linked data community back into the fold. So let me explain what I mean by a record in web friendly terms.
08:43
So this is the conference of HTTP methods. Here our happy client sends a request to a nondescript web server saying, give me one Tim Berners-Lee please. The nondescript web server responds 200 okay and sends back some data in a response body.
09:04
When the response is an RDF representation as in this case, this is Tim Berners-Lee's file. It's his profile description for the W3C. It's customary to return some statements about the requested resource.
09:22
So we get a nice description of Tim BL's personal profile document, his full file. And we're told that the primary topic of the document is Tim BL. When constructing such a response, one common pattern is to return all of the triples
09:41
that the server knows about where the requested resource is the subject. So that means you return a graph generated on the fly and it includes the edges and nodes that are adjacent only to the connected resource. Usually this is extended to include connected resources
10:03
that don't have requestable URIs of their own. So blank nodes or hash URIs, resources that are identified by hash URIs that have a base that's shared with the other resource. And anything adjacent to them also gets included, possibly many triples out,
10:21
because otherwise there's no way to go collect that data from another follow your nose style request. And this is also a common implementation for the loosely specified sparkle describe. This word describe is an interesting one and we'll come back to it.
10:40
Another common pattern, an alternate pattern, is to serve out a graph with many subjects deemed to be primarily about, quote unquote, a given subject. This is what we see in the Tim Berners-Lee example. So while there are many unique subjects in the graph that's returned, Berners-Lee himself has an amorphously special status here.
11:03
But in either case, what is retained when we send the next request and get a different graph is the aboutness. The graph returned is in some meaningful sense of representation of the resource it describes.
11:20
So this starts to get at what I mean by record in RDF. RDF source is an interesting concept that's not terribly well known. I know of three references to this concept, so I'm just gonna jump in with the first one, which follows immediately from the previous reference
11:42
to RDF and change over time, this section, in 1.1 concepts. And this is kind of a funny passage. We informally use the term RDF source to refer to a persistent yet mutable source or container for RDF graphs.
12:00
A source is a resource that may be said to have a state that can change over time, a snapshot of the state can be expressed as an RDF graph. For example, any web document that has an RDF bearing representation may be considered an RDF source. Like all resources, RDF sources may be named with IRIs
12:23
and therefore described in other RDF graphs. So in some sense here, what's being said is that an RDF source is a time sequence of graphs or maybe it's easier to think of it as a container, that at any given time during its life cycle,
12:43
you can ask for its current state and it will give that back to you. And this is an interesting concept when we juxtapose it against the assumption that what we're dealing with in linked data is a giant global graph.
13:05
So these are stateful resources. Linked data platform, a recent W3C specification for handling restful interactions, that is to say stateful interactions with RDF documents, RDF descriptions,
13:24
uses this concept pretty heavily. So an LVPR in this context is just any resource that conforms to a base set of HTTP interaction patterns described in LDPs, so you send a certain kind of request
13:41
and it gives a certain kind of response and there's a set of assumptions built in to LDP that makes it possible for you to classify things as LDPRs, LDP resources. LDP RDF sources are a subset of this. Basically, they're a subset that contain RDF documents
14:01
as opposed to say binary blobs or videos or images or what have you. So an LDPR whose state, LDP RDF source is an LDPR whose state is fully represented in RDF corresponding to an RDF graph. So what this means is if you send that get request,
14:23
you get back the current state of the resource and because of the way that LDPs interaction patterns are defined otherwise, assuming that the server supports update, you have a set of methods also for updating that resource.
14:44
This is the only other reference that I know of to RDF sources in any kind of formal writing anywhere. If you know of any others, I would like to know. The entire semantics it says, this is from the RDF 1.1 semantics which is a less pleasant read
15:02
unless you have a certain event. So the entire semantics applies to RDF graphs, not to RDF sources. An RDF source has semantic meaning only through the graph that is its value at a given time or in a given state.
15:21
So it has semantics for previous values as well. But graphs cannot change their semantics with time. So the thing to take away from this is that this RDF source concept is really talking about different graphs for any given instant
15:40
that you might access them. So what are examples of things that might look like RDF sources? Any web resource that returns an RDF description and RDF serialization is going to look an awful lot like an RDF source.
16:01
This includes HTML documents with embedded micro data. Named graphs, this is an interesting one. But it kind of suggests a way that we might move forward with staple management of our resources.
16:20
Sparkle describe and sparkle construct both return RDF graphs and have stable identifiers. And assuming that you're running these queries against a data set, you can expect those queries to give some kind of representation of something reflected in the data set.
16:44
Also API resources like some of the ones we've seen in low bid and the like. MVC model objects, MVC is model view controller for the non-developers. This is like an object that you use to marshal data towards your user interfaces
17:04
and update it, et cetera. Also I think, and Ruben can correct me if he disagrees, but I think triple pattern fragments fall into this category as well and probably lots of other things. So some of those things don't really make a lot of sense
17:22
to think of as records. So what I'm saying here isn't necessarily that an RDF source aligns closely with a record. Construct in particular is a particularly odd example since it's possible to induce the server to include whole bunches of data that's totally new and originates in your request in the response
17:44
contingent on the contents of the data set. You can do interesting things with construct. But let me suggest some qualities that we might want to use to separate sources that are not very record-like from those that might be.
18:01
So records we might think of as having consistent aboutness and some of those source types might fit that bill. Records we would hopefully want to think of as having stable scope. So if we were to pull down a record,
18:21
we would sort of expect the same kind of range of descriptions to fit in. And we would expect records to be amenable to managed life cycles. So what I mean by this is if we have information
18:42
processes that we're trying to support with our applications, information processes in our libraries, we want to be able to interact with the records in a consistent way. We want them to do something for, we want their life cycle to represent
19:01
a workflow in a meaningful way. So in some sense you might think of these records as subsets of a larger graph and we can do that juxtaposition thing and still have
19:22
our lovely global graph structure, but also have recourse to managing them as part of a process. Your mileage may vary on this. In thinking about record use cases, I found it useful to look back at Dublin Core abstract model
19:42
and description set profiles. There's some interesting things here. The abstract model of DC metadata description sets is as follows and you can kind of draw a nice picture between description sets as closely aligned with the RDF source concept,
20:04
descriptions as that first option that we dealt with earlier where the subject is always the same. Descriptions in DCAM look like key value pairs that are all about the same subject, that's one to one principle.
20:21
And also there's this concept of statement. Below this DCAM starts to get a little bit fussy with some, what I think are just simply out of date formalisms about what values look like and there's value surrogate referenced here
20:44
and you can probably safely stop reading past this point. But there's also this interesting reference where description sets are treated as separate from records where records are documents that represent a snapshot
21:02
of a description set which is I found very interesting. So yeah, so that's my description set takeaway and with application profiles, I think we have an opportunity to dig a little deeper
21:23
on how we can scope records. So I think there's, oh, that's just DPLA math, we'll skip that. So I think there's an important distinction
21:42
that we need to make between human actionable like that application profiles and machine actionable ones. Description set profiles is an interesting start towards some of this, towards the validation side
22:01
but I'm actually, I wanna call on us to be looking at ways to identify records as information management resources from larger graphs actively, not just in terms of looking at them and validating whether they fit a certain shape
22:20
but also picking the aboutness of resources out from our application profiles from our larger graphs. So that's, yeah, I think that's it. So I'm interested in talking about next steps for this and I think we have time for questions, yes?
22:44
Of course we have. Let's say it from. Thank you.