Keynote: Unlocking Citations from tens Of millions of scholarly Papers
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 15 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/47598 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
10
00:00
Computer animation
09:36
Computer animation
15:04
Computer animation
15:59
Computer animation
20:18
Computer animation
21:21
Computer animation
26:47
Computer animation
32:13
Lecture/Conference
34:22
Lecture/Conference
35:02
Lecture/Conference
36:22
Lecture/Conference
37:54
Computer animation
Transcript: English(auto-generated)
00:01
So we're excited to be here, so thank you Adrian for the invitation Having a blast at a conference so far looking forward to the the rest of discussions today So yeah, like I just said today I want to talk about the initiative for open citation and I want to give you a sense of Why we did this how it came together and what the the rationale is for investing into
00:26
efforts to unlock references and citation data I want to start by giving you like a few reasons why I believe open citations matter And I think that the good starting point is to look at Wikipedia itself
00:42
So just to set things straight you're really not supposed to cite Wikipedia No matter what? Your friends do what your teacher says you're really really not supposed to cite Wikipedia And The reason for this is actually fairly simple
01:01
Wikipedia is not about the truth. It's about verifiability So what could be do only works as long as it acts as a gateway to external sources As it works as a as a starting point for discovering external information, it's not a destination in itself and
01:23
Open knowledge you might say fulfills its function as long as it is backed by carefully vetted Reliable secondary sources that everyone can look up and check for themselves so this is what makes citations and references so critical for Wikipedia and
01:43
You might think that if this is true for for Wikipedia, this should be even more true for scholarly knowledge at large After all science is a Like a scale the largest example of collaborative created knowledge production, right? So in principle, this should work for science and scholarships as well
02:04
So let me give you three reasons As to why these also applies to scholarship first off the citation graph underpins our collective understanding of where scholarship come from It allows us to understand knowledge provenance
02:23
it allows us to understand how we know what we know in science and to understand the evolution of scientific debates and scholarly citations are really the foundation of Of knowledge in science and scholarship and they represent the main transmission mechanism
02:41
That allows us to reconstruct the genesis of knowledge Second as I'm sure you'll hold pretty much entire assessment system for science depends on the ability of counting citations So pick your favorite metric whether you like it or not No matter a good or bad it is at measuring the impact of a paper or journal of a venue
03:06
This metric will almost always depend on some way of counting citations. So again the way in which collectively we assess Scholarship is by using some notion of citation based metrics and
03:22
finally because of that the prioritization of how we invest into research that's largely paid for by the taxpayer ultimately depends on citations so The availability of the citations how we measure science using citations affects what science we fund how we allocate
03:43
taxpayer money to Pursuing research so given how critical citations are for this reason for the functioning and the vetting of science you would expect the citation graph is a shared resource that belongs to everyone and that everyone can use and
04:03
The reality is not quite so as it turns out as of today The primary source of data about scholarly citations that the entire planet depends on comes from proprietary databases from two companies one is a scopus a product by Elsevier and the other one is well science by
04:25
formerly Thompson Reuters now carry weight So what does that mean in practice to have citations locked into these two databases? Well, let's think again about these three factors that I mentioned before so first off To understand the provenance of information. You need to get access to the systems
04:44
these systems only allow you to access data if your institution pays a subscription to Scopus or our web of science So as a regular citizen, you don't have the ability of access them unless unless your school pays for them Second the assessment and the impact evaluation that I mentioned before
05:05
Cannot be reproduced and vetted by the public Right. So only people again with access to the system can can be in charge of the of the of the vetting of impact and finally and most importantly public funding of research
05:22
Relying on this data depends on whatever Data curation policies these two companies are put in place so for this reason I Think it's very important that we we think about On the one hand the the value there is an immense value that these companies have
05:43
Provided by creating this curated data sets But also we need to think about the fact that the underlying data doesn't belong to these two companies The underlying data is non-copyrightable and as such should belong to the public David Schotten has been one of the
06:02
Long-standing advocates for open citation and he called out his fact by saying he's a scandal that as of today We're still have access to a good large-scale source of citation data and that this data is locked into Repositories that are by and large proprietary So the question becomes how do we even get started? How do we?
06:28
Think about creating at least the beginning of a corpus of citation data And make it available to anyone without any copyright restriction, so this is a low-hanging fruit
06:42
And sometimes ideas are You know beautiful and ripe and ready to be harvested and and consumed and In this case surely if you know this data took so much effort to be curated You would imagine that
07:01
It's readily readily available somewhere right it must have caused some effort for these companies to produce it So this is artificial ripening gas and sometimes ideas need a little bit of help to get to a point where they can come to fruition and This is in a nutshell the story of the of the I for a seat. So it's a story of using
07:25
ripening gas for public goods and It's a story of how I believe a big success story for openness Started with little help of a group of stubborn individuals and like-minded organizations
07:40
So let me tell you about an initiative for opposite issues This is the website of initiative whose stated goal is to promote the unrestricted availability of scholarly citation data What we mean by open citation data Is threefold is data that is one machine readable
08:03
It's important. We make this data available not just for humans, but also for machines to there is separable meaning that is a separate from the underlying bibliographic source that it represents and it belongs to and third Its data is freely accessible
08:22
Reusable and subject to no copyright restriction whatsoever. This is important might come back later Discussing the licensing aspects of this data. So how does this thing come together? Well, it all started with a realization back in September cuz in 16 it caused the
08:44
the annual conference of the Open Access Publishing Association that this data actually existed It was not exposed by default as it turns out Most publishers deposit to cross-ref not just a bibliographic record of
09:02
Publications that have a DOI a digital object identifier in most cases. They also deposit the full Reference record. It just turns out that this data is closed by default So we suddenly realize that this data already exists. It doesn't require extra effort to be produced and generated
09:21
It's already released by publishers to cross-ref only in a closed form So the challenge became how do we persuade a Group of influential publishers to flip and to make this data publicly available
09:41
We started making the case and talking to people using this gorilla team of instigators of openness And we started telling the story that one. This is not something that's gonna cost anything. It is already there It while easily requires a single email written to cross-ref to ask and release this data publicly
10:07
Second Insisting that this is not a goal that can be achieved alone by one or two publishers for this to be effective in Easter each critical mass so we can make a statement and hopefully other publishers will follow So it was really critical to get a large group of influential publishers racial players
10:29
in the room agreeing on on this goal and The way we started was to focus on publishers that held the largest amount of data, so we know
10:43
there's public data about the volume of publications that each publisher deposits to the cross-ref and so we started targeting the top 20 publishers by volume of the OIs deposited to cross-ref We agreed on a deadline we asked everybody to
11:01
Prepare their communications plan and to hold off any announcement just to make a big splash and we started doing this by Well with the idea that once we had the main players on board We'll be able to get traction and bring on also the the long tail of Publishers that may want to follow the example of the largest ones
11:23
So this is a progress so far of an initiative prior to the launch of the F ROC the percentage of DOIs deposited to cross-ref with open references was 1% 1% out of about 38 million documents that cross-ref knows about again. This is a tiny fraction of scholarship
11:46
I'll come back to this later. This is not the universe of citations, but it's still a pretty sizable corpus of the literature So 1% out of 38 million articles will references deposit cross-ref at the time of the launch six months after we started
12:02
the fraction of publications went to Well, actually right now. This is the most current statistic the fraction of Documents with open references in the cross-ref database went from 1% to more than 40% What does that mean in practice We currently have a 18 million
12:24
articles with open reference data and that amounts to about half a billion Individual citation links that are now freely available to everyone as machine readable data with no copyright restriction whatsoever in the process of doing this we also realize it's really important to
12:49
Get the word out and amplify the message And so what we realized was really important for this was not just to talk to publishers But you also build a coalition of allies
13:01
Including major funders scholarly platforms open data organizations and publishers as well supporting the notion of unrestricted availability of scholarship station data, so this is of today the list of Organizations that have agreed to lend their name and their authority to support in this initiative and
13:26
The availability of data per se is is great But as George was saying yesterday It is really important that we just don't make data available. This data needs to get into action and produce impact, right? And so this is how the data is currently being reused
13:43
One of the organizations behind the initiative is open citations Co-founded by David Schotten and Sylvia Peroni and open citations has been producing a corpus that basically collects Cleans up and republishes Citation data from crossef and many other sources as basically RDF dumps
14:04
You can access this data via sparkling point You can access statically as a set of dumps and it's a growing corpus of again, fully Open linked open data you can reuse for whatever purposes you want
14:24
second the community of scientificians for a very long time could only use a proprietary data license from the two databases I mentioned before As of the launch of the initiative they started adapting their tools So that they can now analyze and visualize these data pointing
14:44
Pointing them to the cross repair. So this is an example of a tool Called VOS viewer it allows you to perform draft analysis and visualization on the citation network based on data that now is coming from cross-search and
15:07
The example that I'm mostly excited about is the reuse of these data in the context of we can meet with immediate projects Wiki data we talked about yesterday is an open knowledge base at the moment
15:21
One out of four Items in wiki data represents a source. So there's a massive coverage of creative works and bibliographic sources in wiki data And as part of creating a record of sources in wiki data and cross linking all these sources with all the other entities That exist in wiki data
15:40
The community also started ingesting the citation graph. So as of today, we have a 36 million Citation links that represent the connection between one paper and another paper each of these represented as individual items in the knowledge base and
16:01
What I think is really cool is that People are now leveraging this data to create custom applications that are built on top of it Scolia is such an example and I think I might try and give you a live demo if I managed not to break Everything here
16:21
Alright, okay
16:42
So scolia is basically a very simple front-end to All data of scholarly relevance broadly defined that exists currently in which data So you think of it as a way of exploring information about authors about works about outlets about in
17:03
Organizations about scholarly words and whatnot basically the entire corpus of knowledge existing in wiki data a structured data And this is an example. So let's see the entry of an author
17:20
for example stake Woodford Neuroscientist made in London and This is all data that is currently in wiki data and is generated on the fly by querying the wiki data sparkle endpoint So all that are you see here is just a result of a several sparkle queries stitched together
17:46
And you'll see that there's a list of publications publications per year and statistics about with Fritz publication record
18:04
Any statistics these are the places where she mostly publish her work? Her chronograph
18:21
topics And of course we can also link all this information to free license media that exists in Wikimedia Commons the Canadian straight this work timeline of education
18:42
Etc, etc, but I want to take one example of one of our papers and see the information that we have Like demos, okay
19:01
Alright, let me find So this is an example of a record for individual publication and you see it's a very
19:23
Sparse data source. We haven't ingested the entire citation network But you can see here is basically the fact that for this work We can reconstruct the citation graph. And of course, we have some query errors But you get it you get a sense of it, right? So all of this is basically generated entirely
19:43
From CC0 license data that has been just saying to a key data And if you if you go here, you can basically see how this works right, so this is just a query that you can run and Modify and get information you want about your favorite author
20:06
All right, what sorry I've demos, okay
20:50
So The road ahead so this is really just the beginning right so we're far from being
21:02
Anywhere close to where we want to be with this with this project And there's still a lot of work to be done There are a few whoops
21:28
Help, okay Okay, I think I'm lost here. Yes
22:06
Okay Yeah, so there's a few lessons learned from from trying to put together such an initiative right and first thing I want to say is that uh, it really helped you have a single measurable goal and
22:23
try and drive a very large group of organizations towards getting getting that done and Focusing on that single metric that I really want to push to get to a hundred percent Second Low-cost as I mentioned in the beginning This would not have been possible Without the data being already there in some capacity. We didn't ask people to start producing data link data
22:46
That's very costly and that's not something we can expect people to do overnight Third and I want to say most importantly This is an initiative that was never peaches and open access initiative So we really truly believe that it is bipartisan we believe that regardless of your business model
23:05
you should support the notion of Opening up the citation layer because this will benefit you as a publisher regardless of whether You believe in open access or traditional subscription based publishing
23:21
And last but not least Focus on amplification make sure you build a large coalition of people can help you Amplify the message and get the work out to people who are not there yet. So we're we would like to see this going
23:40
What we really think is going to be the ultimate goal for for this initiative is The creation of a comprehensive graph for scholarship, right? This is something that can only be built by a large number of organizations. There's no single player that can do this But in order to get there and to represent how
24:03
sources and Institutions and authors and piece of knowledge all relate to each other. It is really critical to these data be made available at scale so this is an example again coming from the COS viewer and a quote from the author saying that this is finally possible using data that used to be
24:25
privately licensed in the past People ask often about the benefits. Is this only something for? publishers or researchers I Think the answer is that this benefits like a large set of stakeholders it benefits
24:44
Authors who will be able to have access to a record of their own Citations for the words without having to retrieve this data from a specific database. I think the data will belong to them It benefits researchers will be able to perform analysis of the citation graph
25:02
In a way that is reproducible and open It will benefit funders who will be able to reconstruct basically the the impact of what they're what they're they're spending how they're investing their money and basically vetting the
25:21
Impact metrics that are used for determining where this money goes You will we believe they will also benefit publishers Like I said regardless of their business model because Making these data public available would just result in the creation of applications. That will enhance the discoverability of scholarly objects
25:41
And finally the public again, I think it's really important that the public itself Have access to to this data there are challenges and again, I want to Clarify the scope of what this is about. There was actually a lot of discussion online about
26:03
What 100% means are we talking about the universe of citations or a tiny fraction of it? Cross-serve is limited cross-serve data is limited both temporarily and in terms of the scope of this data, right? So we're talking about the bibliographic record of
26:20
papers that have a DOI assigned and this is only a tiny fraction of the entire scholarship and that also doesn't cover books and other types of publications and Decades and centuries of works and their citations. So this is just a beginning and it's not the end of a story and these are statistics about the current coverage of
26:43
Records in cross-serve with reference data and a fraction of these that are open and Obviously aside from coverage is also question about data quality So we have a currently 1 billion references in cross-serve in total
27:01
About half of these references. So half a billion are open of these open citations only 50% of DOI's or some kind of identifier so getting from raw messy data to a clean citation graph
27:21
There is a duplicated that uses identifiers That results authors Etc. It's gonna be a lot of work and this is currently the main distinction between This project and the highly curated databases that I mentioned before they're really providing this value in the form of curation
27:42
So the question is how do we reach our goal of a hundred percent coverage again with these limitations that I just mentioned So as of today, we have the vast majority of the top 20 publishers that we contacted This is the list of the publishers to produce the largest
28:03
Volume of DOI's and that are currently are part of the initiative As of today, we're definitely made significant progress But there are still some major exceptions. There are the six publishers that are
28:20
Responsible for a very large amount of citations that haven't joined the initiative yet So if you happen to be an editor or an author if you work with any of these publishers Please help us get the word out. This is really important that we get them on board And persuade them to join the initiative a month ago
28:42
Cross-ref announced this tool which can look up yourselves allows you to check for every individual publishers Statistics and the status of the references the coverage of DOI's the coverage of DOI's with open references And their default policy when it comes to reference distribution and
29:05
This just got in yesterday we've received a lot of support for the initiative from multiple stakeholders and the International Society for infrometrics and psychometrics Posted a letter
29:20
Basically calling all the major publishers who haven't joined yet initiative to do so Signed by basically the leading voices in the field so it's really humbling to have a All these voices in the in the field of Santa metrics because you people care about Defining impact and impact metrics to be supportive of this notion of open citation data
29:48
So like I said, there's still a very long way to go we published an open call to action to our stakeholders a couple of months ago and what that means is that if you're a journal editor a
30:01
Researcher a librarian if you work for an organization that produces or consumes scholarly metadata and citation data it is really critical that You you help us and we hope you'll join initiative to help further the goal of open citation data the citation graph belongs to the public and
30:23
We need to be able to build upon it as a common good with that I'd like to thank you and I'd be happy to take any questions Thank you now, um, yes we have lots of time for questions so
30:46
anybody Dario, this is a fantastic initiative and it's exciting to see
31:02
I've I guess I've got one sort of crazy idea and one question crazy idea being as we just At our local institution. We just suffered a case of plagiarism of an article and Actually from a thesis that that was created to an article But it seems to me that comparing citation graphs may be a way of like another sort of way of flagging
31:25
Possible if things are too similar another possible way of flagging that as as a defense So having that openly available makes that yet another possible Outcome does it needs to be researched but to see whether it's feasible, but that'd be good
31:43
and I'm sure Sarvin would want to ask this at some point but The site's property in Wikidata is you know a standard citation I don't see any sub properties that allow some of the more fine-grained Sort of the nature of the citation. I think is there any
32:02
Hope of sort of going back and adding extra Data about the nature of the citation whether it's positive or negative or something along those lines. Yeah, excellent questions Yeah on the on the first comment, yeah, so I think that plagiarism is a great example and great use case for this we've also talked a lot about
32:23
Tracking retractions more effectively, right? So we still know that retracted papers keep accumulating citations and partly I believe You know having the citation graph as a public good should help us basically like raises a Warning whenever someone is citing a paper has been retracted
32:44
And in general annotation annotating citations with whatever metadata that we think are important So pleasure is a great example of something that hopeful these data will help better better understand On the second point. Yes, so we key data now as a property to represent
33:01
To represent a citation between two items representing creative works We Haven't yet talked about how to further qualify the citation although There are many people are excited about for example Implementing a version of a site of citation typing ontology or somehow adding additional qualifiers to the citations
33:26
The one type of qualifier that's currently being used in wiki data is about the provenance So pretty much the entire data model of wiki data allows you to specify a reference for every single statement and
33:40
given that Wikidata can currently ingest the citation data from a variety of sources cross ref directly the open citation corpus PubMed in some cases. It is really important that we keep track of where we get the citation information from so We have the ability of specifying more information more qualifiers for citations we haven't gotten there yet
34:11
Any other questions? Yeah, my name is Jaromus Oracle
34:27
Do you know any other? Citation corpus is where let's say national peer-reviewed papers non-english scientific papers could post their Citation data because I find that interesting problem that you have scientific publications
34:45
Not in English, and they have met sometimes they also are often referring to Publications in English and so on so how to interlink all these multi language publications yep to
35:02
Excellent question so how to get you know proper global coverage with this graph so a couple of things first off cross for data Really covers publishers across many many different countries so as long as a publisher assigns the OIs It doesn't matter whether the publication is in English or any other language so
35:24
But this is a requirement for this data to be to be accessible so in other words Unless there is a bibliographic record and unless the publisher is opted into this program. It's called cited by You will not get this data for free from the the cross of API's
35:42
For journals and publishers that are not for whatever reason interested in participating in crosstalk There are projects like the open citation corpus that are trying to aggregate and Transform and republish citations from much broader set of sources So I think the hope is that down the line no matter whether you're in crosstalk or not
36:06
That corpus will be able to grow and include also other references that are not otherwise available I
36:22
think I didn't get the point if Open citation corpus is being integrated into wiki data Is it to exist as a separate? Corpus Yeah, so the corporate the open citation corpus exists as its own standalone
36:41
Corpus, it's actually a fairly complex Linked open data project that publishes and cleans up all this data adding provenance and additional information about all different entities Wikidata is really ingesting in in an opportunistic way some of this data
37:02
It's not ingesting and representing the entire RDF structure of the corpus for people are interested in proper RDF again personally if I ask me I tend to side with the pragmatic approach to you know useful data without making overly complex when it comes to
37:21
Doing structure. I think that we create a currently delivers good value for data consumers But there are other use cases for which an RDF record is actually what you want And that's what the open citation corpus provides
37:41
anybody else okay, and We move on Next