Crossref Event Data - Transparency First
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 9 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46266 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
00:00
Event horizonAssociative propertyPresentation of a groupContent (media)Meeting/Interview
00:31
Event horizonContent (media)MetadataForm (programming)WebsiteDigital object identifierIdentifiabilitySet (mathematics)Sign (mathematics)Meeting/Interview
01:01
Event horizonOrder (biology)CodeSource codeService (economics)Theory of relativitySoftware developerDifferent (Kate Ryan album)CountingFile formatDigital object identifierStructural loadWeb pageMetric systemObject (grammar)MereologyAuthorizationSet (mathematics)Uniform resource locatorProcess (computing)Combinational logicTwitterGroup actionPoint (geometry)MetadataMultilaterationWebsiteBitChainPhysical systemBlogRow (database)Thermal conductivityDecision theoryType theoryWeb 2.0Internet service providerComputing platformInstance (computer science)Predictability1 (number)TouchscreenGreatest elementMeeting/Interview
04:56
Multiplication signGroup actionMessage passingComputer virusUniform resource locatorTwitterCASE <Informatik>Event horizonRow (database)BlogDigital object identifierMeeting/Interview
05:53
AreaTwitterAuthorizationMeeting/Interview
06:08
Landing pageData conversionSoftware developerData managementRow (database)Different (Kate Ryan album)10 (number)Context awarenessArithmetic progressionSource codeRevision controlMereologyStructural loadCASE <Informatik>Service (economics)WritingBlogTwitterEvent horizonAuthorizationFeedbackProduct (business)Basis <Mathematik>Point (geometry)Digital object identifierRobotContent (media)Process (computing)Uniform resource locatorComputing platformElectronic mailing listIntegrated development environmentProbability density functionForm (programming)Query languageRandomizationDomain namePresentation of a groupWeb pageType theory1 (number)Multiplication signGodInformationWordBenutzerhandbuchStreaming mediaSystem administratorInclusion mapCodeIdentifiabilityFormal grammarReverse engineeringMathematicsParameter (computer programming)ResultantConditional-access moduleMappingLogicBitNumberMeeting/Interview
12:01
Data managementMeeting/Interview
12:13
Event horizonElectronic mailing listType theoryTwitterBookmark (World Wide Web)Theory of relativityConnectivity (graph theory)AreaSoftwareHypermediaPhysical systemControl flowMeeting/Interview
13:48
Meeting/Interview
Transcript: English(auto-generated)
00:00
I'm the tech lead on cross-ref event data. I would have been up first. And it was kind of cool that I'm not, because I had the chance to hear some really interesting stuff this morning in the discussions in the first presentation. So it was really good to hear that stuff. And I'm going to address some of it. So I'm here to talk about cross-ref event data and our transparency-first approach that we took.
00:23
So what is cross-ref event data? Well, first of all, who's cross-ref? Who are we? Well, we're an association for scholarly content. Our members are publishers. There's about 5,000 of them. And the main thing we do is assign persistent identifiers in the form of DOIs. I'm sure most of you have seen our DOIs. Data sites also do DOIs for data sets.
00:43
And we have metadata about the DOIs. Maybe some of you have used our metadata API. And most of our content items are articles, but there's also things like book chapters. And we have APIs for sharing the data, getting it out, all kinds of stuff. And event data will be one more of these APIs.
01:03
And the reason cross-ref was formed was because there were many publishers. And in order for them to link, they would have had to make bilateral agreements. And 5,000 publishers all making individual agreements is crazy. So we're the central linking hub. And the stuff that we do is all stuff that all the publishers, all our members are interested in doing.
01:22
Cross-ref event data is under development. It's not ready. We've not launched. But I'm here to talk about it. And some of this stuff is available now. So cross-ref event data collects events from sources. And I'll talk about what's an event a bit later. It's a pipeline that other people can contribute data to, for example, data site and maybe others.
01:43
And the data in event data is a record of all the stuff that happens around items like articles. For instance, they're mentioned on Twitter, on the web, et cetera. And it's going to be a set of APIs for you to get the data out. Cross-ref event data will not be a metrics platform. We don't count anything. We give you the data. We're not making any judgments.
02:02
It's also not an end user service. There won't be a nice, neat little web page that an end user can use. It's going to be data. And it's people like you, the guys in this room, that we hope will be using this stuff. So why are we doing events? Well, we participate in the NISO code of conduct. And that was a really interesting process. And it describes various points along the chain.
02:24
And one of the points is the data provider, another is the aggregator, another is the user. And it was pretty clear from the discussions we were having in the working group that transparency through the whole process was really, really important. We plan to be part of loads of different pipelines, maybe
02:40
unpredictable ones, people using our data in kinds of ways that we wouldn't have predicted. So we want to make sure that as our part in the pipeline, we're as transparent as possible. So we came up with this kind of transparency first design. And every design decision that we made, made sure that we were as transparent as possible.
03:01
So we asked, how do we design a system that's into your transparency? When you make a metric, you're making some kind of judgment. Are you counting tweets? Are you counting retweets or originals? Whatever. Every kind of metric is a kind of judgment. And we don't do that. Everybody will have different kinds of metrics that they want to use, but we're not going to make metrics. We are just going to provide events.
03:23
Also, an individual event can be linked to evidence very clearly. A metric might combine lots of different data, and it's more complicated. We want to make the data as simple as possible. And for all these reasons, we can also provide events that are all in a familiar format.
03:40
So for all these reasons, we're doing events, not metrics. It's not available yet, but as I say, some of it is. And some of the data I'm going to be talking about is from Twitter, Wikipedia, Reddit, and News Feeds. That's blogs. So unfortunately, I thought I'd have a bit of a bigger screen than this. But this is an example of one of the events I'm talking about.
04:02
It's from Twitter. And all of our events in the format of subject-verb-object. So the subject is going to be something like a tweet or a web page. The verb is a relation like sites, mentions, whatever. The object is usually a DOI. So I don't know if you guys can see.
04:21
You have subject ID at the top, which is this tweet. You have object ID, which is this DOI. And at the bottom, you have relation type discusses. And it's from the Twitter source. We also have metadata about the tweet. For example, the title is the actual text of the tweet, the publication date, the URL of the tweet, and the
04:42
author URL. Every event also has an ID, which you can see here. And then you have a Wikipedia example.
05:01
Sorry you guys can't really read it. But the crucial thing here is the message action is delete. In Wikipedia, citations can come and go over time. And this event records the fact that the article Aloavirus did have a reference to this DOI, but it was deleted on this date. I'm going to skim over Reddit, because it's a similar thing.
05:24
And you can't really see the stuff. But you see this tweet with this Reddit comment mentioned this DOI. Newsfeed again, this blog URL. The blog post of this URL mentioned this DOI here.
05:40
So there's a few thoughts that come out of this, that come out of the kind of events that we've seen here. In the case of Twitter, there's a lot of Was the tweet an original, or was it a retweet? For example, some people might see more value in original tweets than retweets. Some people may be interested in seeing how are retweets, what happens with the retweets, how do they vary
06:03
across subject areas. Who tweeted it? It may be interesting to see who the actual author of the tweet was. Was it a bot that you know about? Was it an individual? Do you want to go and talk to them? In the case of Wikipedia, was the reference added or removed? Why was it added or removed? Are there comments attached to the edit?
06:22
Who actually did it? Was it an admin? Was it a user? Did they have any history? Was this part of an edit war? We see on Wikipedia that pages get fought over. And you sometimes see a long stream of references being added or removed over time. And that's a bit of context around Wikipedia that might not happen on other platforms.
06:41
Was this just an act of random vandalism? I've seen a few examples where somebody just vandalizes a page and just puts some random content on there. And as a result, this citation was removed. We shouldn't really care that a citation was removed, because there's all this other content. It was just kind of collateral in this event.
07:02
Was added and removed repeatedly as part of an edit war? If you look at the context of the edit, was the same edit reverted and put back a number of times? In the case of Reddit, was this a submission, i.e. a whole discussion, or was it just a comment in a discussion? The purpose of a comment is different to the purpose of a
07:23
whole discussion. Who submitted it? Was it a Reddit bot? Was it a bot that harvests news feeds and puts articles on there, or was it an individual? Which subreddit was it on? If it's on a special purpose science subreddit, that makes sense, it was on a general purpose news subreddit, that
07:42
might mean something. It might mean an article has finally found fame. Was this a genuine discussion, or can it have a PDF request? We're not making any judgments about that, but they mean different things. So in the case of all sources, there's questions like, did the author use a DOI?
08:02
Are they an academic who knows the purpose of persistent identifiers? Or more likely, was it a URL for the landing page, because that's the most likely thing? And the point is, there's a load of different facets. And they all matter differently to different people, and there's a load of people in this room who will use these different things differently.
08:22
So we want to make sure we don't throw any of this data away. And I've heard a few more things. Two minutes, my god. I've heard a few more things this morning, but before today, I came to the conclusion there were three types of context that we were interested in. There's the context of the platform's technology.
08:42
For example, Wikipedia encourages you to use DOIs for citations, whereas Twitter doesn't encourage you to do that. With Wikipedia pages, they can change over time. And you can see the history of that. With blog pages, they can also change over time. But they only have one author, and you can't see that edit history.
09:01
And then you have the context of the community that is built around the platform. For example, there are edit wars on Wikipedia. Do they happen elsewhere? A blog might have a similar argument, but they might happen in the comments. And finally, you have the context of the individual event. For example, the tweet might be a reply to something else or part of a conversation.
09:21
In the case of Wikipedia, you might look at the context of that edit and see that there were a load of different ones. So if we weren't very careful when we were designing this service, we'd start with data from the source, we'd produce events, and there would be this evidence gap between the two. And we're very keen to avoid that. So instead, we have this evidence-first approach.
09:43
Event data gets data from sources, and it produces evidence. And from the evidence, it builds events. And the point of the evidence is to capture all of the context that we could find. And then the events data and the evidence data is made available to end users on an equal basis.
10:01
So you can get the event, but you can also say, tell me more about how you came to this conclusion. So an evidence record exists for every event that we collect. There may be other data sources that we don't collect. Everything that we do has an evidence record first. And the idea of the evidence record is to bridge the gap between the end data and the
10:21
event that you get. And it captures everything that we know. So it captures the data that came in from the original data source, but also captures, for example, the list of landing page domains that we knew about at the time. That's useful for Twitter, so you know what we were filtering. It can also contain the list of DOI prefixes, if it's
10:42
relevant, that we knew at the time. It also represents the processing that we do. For example, if we took an article landing page, and we mapped it to a DOI, which article landing page mapped the DOI? If we weren't able to do it, we show that we weren't able to do it. Which version of the service was running that did that? Which version of the agent?
11:00
Because the quality of the landing page to DOI reversal, we hope it'll change over time. We hope it'll improve. So we're capturing as much environment context as possible in this event evidence record. So you absolutely can't read this. I'm running over time anyway. But come and talk to me. I'd love to talk to you guys about what's in here and
11:22
about the kind of context you can capture from this. So event data's under development. It's a work in progress. The service is a work in progress. But the query API service, where you can get the data, is available in a kind of pre-release form. And you can get your hands on it. So is the evidence service that I described.
11:41
Please go and try it out. I'm here to get your feedback. There's quite a long user guide I've written. That's also under development. But it's up to a few tens of thousands of words already. And it has all of the information I hope that I've described here. And we're also going to do a presentation at 3 AM in a couple of days.
12:00
My colleague, Madeline, is product manager. Please come and talk to us. Thank you very much. I have some burning questions. Yeah, just for a quick way. You showed with the Twitter example that you had a
12:23
relationship type. I want to know how many different types there are and how you actually define those. And if it goes with Wikipedia, Reddit, whatever. So one thing I didn't mention is that one of the components of event data is the Legato system, which was originally
12:41
developed at PLOS, which Martin Fenner, who you all know, I'm sure, has put a large amount of work into. And the list of relation types is currently what's defined in the Legato software. And that comes from somewhere, Martin. Can you say where that comes from? So it's the data side relation types plus a few
13:11
extra that are needed, which is discuses, bookmarks. So I think about four that are needed for social media. But that's a big problem, because we want to have a
13:21
common relation type vocabulary that's consistent. And I think that is an area where we still need to do some work. Well, you have more questions than during your break or something?
13:42
Thank you very much. Our next speaker is Mike Taylor.