We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Keynote: Unlocking Citations from tens Of millions of scholarly Papers

00:00

Formal Metadata

Title
Keynote: Unlocking Citations from tens Of millions of scholarly Papers
Title of Series
Number of Parts
15
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Citations are the foundation for how we know what we know. Until recently, the idea of creating a freely accessible repository of citation data – representing how scholarly works cite each other – has been hampered by restrictive and inconsistent licenses and by the lack of comprehensive, machine-readable data sources: for decades, references have been locked inside PDFs or proprietary databases. Launched in April 2017, the Initiative for Open Citations (I4OC) has made nearly half of all indexed scholarly references freely available to everyone with no copyright restrictions. The percentage of indexed scholarly works with open reference data was 1% before the launch of the I4OC: as of July 2017, over 16 million scholarly works have open references available as machine-readable public domain data. There’s now momentum and a growing number of organizations, scholarly societies, funders, and publishers in support of the unconstrained availability of scholarly citation data. However, this is just the beginning of a journey to build high-quality scientific commons. In this talk, I’ll present how the I4OC was created, its current vision and challenges. I'll showcase examples of real-world applications demonstrating how data unlocked by the initiative can be reused to accelerate scientific discovery and the broader impact of scholarship knowledge.
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Computer animation
Transcript: English(auto-generated)
So we're excited to be here, so thank you Adrian for the invitation Having a blast at a conference so far looking forward to the the rest of discussions today So yeah, like I just said today I want to talk about the initiative for open citation and I want to give you a sense of Why we did this how it came together and what the the rationale is for investing into
efforts to unlock references and citation data I want to start by giving you like a few reasons why I believe open citations matter And I think that the good starting point is to look at Wikipedia itself
So just to set things straight you're really not supposed to cite Wikipedia No matter what? Your friends do what your teacher says you're really really not supposed to cite Wikipedia And The reason for this is actually fairly simple
Wikipedia is not about the truth. It's about verifiability So what could be do only works as long as it acts as a gateway to external sources As it works as a as a starting point for discovering external information, it's not a destination in itself and
Open knowledge you might say fulfills its function as long as it is backed by carefully vetted Reliable secondary sources that everyone can look up and check for themselves so this is what makes citations and references so critical for Wikipedia and
You might think that if this is true for for Wikipedia, this should be even more true for scholarly knowledge at large After all science is a Like a scale the largest example of collaborative created knowledge production, right? So in principle, this should work for science and scholarships as well
So let me give you three reasons As to why these also applies to scholarship first off the citation graph underpins our collective understanding of where scholarship come from It allows us to understand knowledge provenance
it allows us to understand how we know what we know in science and to understand the evolution of scientific debates and scholarly citations are really the foundation of Of knowledge in science and scholarship and they represent the main transmission mechanism
That allows us to reconstruct the genesis of knowledge Second as I'm sure you'll hold pretty much entire assessment system for science depends on the ability of counting citations So pick your favorite metric whether you like it or not No matter a good or bad it is at measuring the impact of a paper or journal of a venue
This metric will almost always depend on some way of counting citations. So again the way in which collectively we assess Scholarship is by using some notion of citation based metrics and
finally because of that the prioritization of how we invest into research that's largely paid for by the taxpayer ultimately depends on citations so The availability of the citations how we measure science using citations affects what science we fund how we allocate
taxpayer money to Pursuing research so given how critical citations are for this reason for the functioning and the vetting of science you would expect the citation graph is a shared resource that belongs to everyone and that everyone can use and
The reality is not quite so as it turns out as of today The primary source of data about scholarly citations that the entire planet depends on comes from proprietary databases from two companies one is a scopus a product by Elsevier and the other one is well science by
formerly Thompson Reuters now carry weight So what does that mean in practice to have citations locked into these two databases? Well, let's think again about these three factors that I mentioned before so first off To understand the provenance of information. You need to get access to the systems
these systems only allow you to access data if your institution pays a subscription to Scopus or our web of science So as a regular citizen, you don't have the ability of access them unless unless your school pays for them Second the assessment and the impact evaluation that I mentioned before
Cannot be reproduced and vetted by the public Right. So only people again with access to the system can can be in charge of the of the of the vetting of impact and finally and most importantly public funding of research
Relying on this data depends on whatever Data curation policies these two companies are put in place so for this reason I Think it's very important that we we think about On the one hand the the value there is an immense value that these companies have
Provided by creating this curated data sets But also we need to think about the fact that the underlying data doesn't belong to these two companies The underlying data is non-copyrightable and as such should belong to the public David Schotten has been one of the
Long-standing advocates for open citation and he called out his fact by saying he's a scandal that as of today We're still have access to a good large-scale source of citation data and that this data is locked into Repositories that are by and large proprietary So the question becomes how do we even get started? How do we?
Think about creating at least the beginning of a corpus of citation data And make it available to anyone without any copyright restriction, so this is a low-hanging fruit
And sometimes ideas are You know beautiful and ripe and ready to be harvested and and consumed and In this case surely if you know this data took so much effort to be curated You would imagine that
It's readily readily available somewhere right it must have caused some effort for these companies to produce it So this is artificial ripening gas and sometimes ideas need a little bit of help to get to a point where they can come to fruition and This is in a nutshell the story of the of the I for a seat. So it's a story of using
ripening gas for public goods and It's a story of how I believe a big success story for openness Started with little help of a group of stubborn individuals and like-minded organizations
So let me tell you about an initiative for opposite issues This is the website of initiative whose stated goal is to promote the unrestricted availability of scholarly citation data What we mean by open citation data Is threefold is data that is one machine readable
It's important. We make this data available not just for humans, but also for machines to there is separable meaning that is a separate from the underlying bibliographic source that it represents and it belongs to and third Its data is freely accessible
Reusable and subject to no copyright restriction whatsoever. This is important might come back later Discussing the licensing aspects of this data. So how does this thing come together? Well, it all started with a realization back in September cuz in 16 it caused the
the annual conference of the Open Access Publishing Association that this data actually existed It was not exposed by default as it turns out Most publishers deposit to cross-ref not just a bibliographic record of
Publications that have a DOI a digital object identifier in most cases. They also deposit the full Reference record. It just turns out that this data is closed by default So we suddenly realize that this data already exists. It doesn't require extra effort to be produced and generated
It's already released by publishers to cross-ref only in a closed form So the challenge became how do we persuade a Group of influential publishers to flip and to make this data publicly available
We started making the case and talking to people using this gorilla team of instigators of openness And we started telling the story that one. This is not something that's gonna cost anything. It is already there It while easily requires a single email written to cross-ref to ask and release this data publicly
Second Insisting that this is not a goal that can be achieved alone by one or two publishers for this to be effective in Easter each critical mass so we can make a statement and hopefully other publishers will follow So it was really critical to get a large group of influential publishers racial players
in the room agreeing on on this goal and The way we started was to focus on publishers that held the largest amount of data, so we know
there's public data about the volume of publications that each publisher deposits to the cross-ref and so we started targeting the top 20 publishers by volume of the OIs deposited to cross-ref We agreed on a deadline we asked everybody to
Prepare their communications plan and to hold off any announcement just to make a big splash and we started doing this by Well with the idea that once we had the main players on board We'll be able to get traction and bring on also the the long tail of Publishers that may want to follow the example of the largest ones
So this is a progress so far of an initiative prior to the launch of the F ROC the percentage of DOIs deposited to cross-ref with open references was 1% 1% out of about 38 million documents that cross-ref knows about again. This is a tiny fraction of scholarship
I'll come back to this later. This is not the universe of citations, but it's still a pretty sizable corpus of the literature So 1% out of 38 million articles will references deposit cross-ref at the time of the launch six months after we started
the fraction of publications went to Well, actually right now. This is the most current statistic the fraction of Documents with open references in the cross-ref database went from 1% to more than 40% What does that mean in practice We currently have a 18 million
articles with open reference data and that amounts to about half a billion Individual citation links that are now freely available to everyone as machine readable data with no copyright restriction whatsoever in the process of doing this we also realize it's really important to
Get the word out and amplify the message And so what we realized was really important for this was not just to talk to publishers But you also build a coalition of allies
Including major funders scholarly platforms open data organizations and publishers as well supporting the notion of unrestricted availability of scholarship station data, so this is of today the list of Organizations that have agreed to lend their name and their authority to support in this initiative and
The availability of data per se is is great But as George was saying yesterday It is really important that we just don't make data available. This data needs to get into action and produce impact, right? And so this is how the data is currently being reused
One of the organizations behind the initiative is open citations Co-founded by David Schotten and Sylvia Peroni and open citations has been producing a corpus that basically collects Cleans up and republishes Citation data from crossef and many other sources as basically RDF dumps
You can access this data via sparkling point You can access statically as a set of dumps and it's a growing corpus of again, fully Open linked open data you can reuse for whatever purposes you want
second the community of scientificians for a very long time could only use a proprietary data license from the two databases I mentioned before As of the launch of the initiative they started adapting their tools So that they can now analyze and visualize these data pointing
Pointing them to the cross repair. So this is an example of a tool Called VOS viewer it allows you to perform draft analysis and visualization on the citation network based on data that now is coming from cross-search and
The example that I'm mostly excited about is the reuse of these data in the context of we can meet with immediate projects Wiki data we talked about yesterday is an open knowledge base at the moment
One out of four Items in wiki data represents a source. So there's a massive coverage of creative works and bibliographic sources in wiki data And as part of creating a record of sources in wiki data and cross linking all these sources with all the other entities That exist in wiki data
The community also started ingesting the citation graph. So as of today, we have a 36 million Citation links that represent the connection between one paper and another paper each of these represented as individual items in the knowledge base and
What I think is really cool is that People are now leveraging this data to create custom applications that are built on top of it Scolia is such an example and I think I might try and give you a live demo if I managed not to break Everything here
Alright, okay
So scolia is basically a very simple front-end to All data of scholarly relevance broadly defined that exists currently in which data So you think of it as a way of exploring information about authors about works about outlets about in
Organizations about scholarly words and whatnot basically the entire corpus of knowledge existing in wiki data a structured data And this is an example. So let's see the entry of an author
for example stake Woodford Neuroscientist made in London and This is all data that is currently in wiki data and is generated on the fly by querying the wiki data sparkle endpoint So all that are you see here is just a result of a several sparkle queries stitched together
And you'll see that there's a list of publications publications per year and statistics about with Fritz publication record
Any statistics these are the places where she mostly publish her work? Her chronograph
topics And of course we can also link all this information to free license media that exists in Wikimedia Commons the Canadian straight this work timeline of education
Etc, etc, but I want to take one example of one of our papers and see the information that we have Like demos, okay
Alright, let me find So this is an example of a record for individual publication and you see it's a very
Sparse data source. We haven't ingested the entire citation network But you can see here is basically the fact that for this work We can reconstruct the citation graph. And of course, we have some query errors But you get it you get a sense of it, right? So all of this is basically generated entirely
From CC0 license data that has been just saying to a key data And if you if you go here, you can basically see how this works right, so this is just a query that you can run and Modify and get information you want about your favorite author
All right, what sorry I've demos, okay
So The road ahead so this is really just the beginning right so we're far from being
Anywhere close to where we want to be with this with this project And there's still a lot of work to be done There are a few whoops
Help, okay Okay, I think I'm lost here. Yes
Okay Yeah, so there's a few lessons learned from from trying to put together such an initiative right and first thing I want to say is that uh, it really helped you have a single measurable goal and
try and drive a very large group of organizations towards getting getting that done and Focusing on that single metric that I really want to push to get to a hundred percent Second Low-cost as I mentioned in the beginning This would not have been possible Without the data being already there in some capacity. We didn't ask people to start producing data link data
That's very costly and that's not something we can expect people to do overnight Third and I want to say most importantly This is an initiative that was never peaches and open access initiative So we really truly believe that it is bipartisan we believe that regardless of your business model
you should support the notion of Opening up the citation layer because this will benefit you as a publisher regardless of whether You believe in open access or traditional subscription based publishing
And last but not least Focus on amplification make sure you build a large coalition of people can help you Amplify the message and get the work out to people who are not there yet. So we're we would like to see this going
What we really think is going to be the ultimate goal for for this initiative is The creation of a comprehensive graph for scholarship, right? This is something that can only be built by a large number of organizations. There's no single player that can do this But in order to get there and to represent how
sources and Institutions and authors and piece of knowledge all relate to each other. It is really critical to these data be made available at scale so this is an example again coming from the COS viewer and a quote from the author saying that this is finally possible using data that used to be
privately licensed in the past People ask often about the benefits. Is this only something for? publishers or researchers I Think the answer is that this benefits like a large set of stakeholders it benefits
Authors who will be able to have access to a record of their own Citations for the words without having to retrieve this data from a specific database. I think the data will belong to them It benefits researchers will be able to perform analysis of the citation graph
In a way that is reproducible and open It will benefit funders who will be able to reconstruct basically the the impact of what they're what they're they're spending how they're investing their money and basically vetting the
Impact metrics that are used for determining where this money goes You will we believe they will also benefit publishers Like I said regardless of their business model because Making these data public available would just result in the creation of applications. That will enhance the discoverability of scholarly objects
And finally the public again, I think it's really important that the public itself Have access to to this data there are challenges and again, I want to Clarify the scope of what this is about. There was actually a lot of discussion online about
What 100% means are we talking about the universe of citations or a tiny fraction of it? Cross-serve is limited cross-serve data is limited both temporarily and in terms of the scope of this data, right? So we're talking about the bibliographic record of
papers that have a DOI assigned and this is only a tiny fraction of the entire scholarship and that also doesn't cover books and other types of publications and Decades and centuries of works and their citations. So this is just a beginning and it's not the end of a story and these are statistics about the current coverage of
Records in cross-serve with reference data and a fraction of these that are open and Obviously aside from coverage is also question about data quality So we have a currently 1 billion references in cross-serve in total
About half of these references. So half a billion are open of these open citations only 50% of DOI's or some kind of identifier so getting from raw messy data to a clean citation graph
There is a duplicated that uses identifiers That results authors Etc. It's gonna be a lot of work and this is currently the main distinction between This project and the highly curated databases that I mentioned before they're really providing this value in the form of curation
So the question is how do we reach our goal of a hundred percent coverage again with these limitations that I just mentioned So as of today, we have the vast majority of the top 20 publishers that we contacted This is the list of the publishers to produce the largest
Volume of DOI's and that are currently are part of the initiative As of today, we're definitely made significant progress But there are still some major exceptions. There are the six publishers that are
Responsible for a very large amount of citations that haven't joined the initiative yet So if you happen to be an editor or an author if you work with any of these publishers Please help us get the word out. This is really important that we get them on board And persuade them to join the initiative a month ago
Cross-ref announced this tool which can look up yourselves allows you to check for every individual publishers Statistics and the status of the references the coverage of DOI's the coverage of DOI's with open references And their default policy when it comes to reference distribution and
This just got in yesterday we've received a lot of support for the initiative from multiple stakeholders and the International Society for infrometrics and psychometrics Posted a letter
Basically calling all the major publishers who haven't joined yet initiative to do so Signed by basically the leading voices in the field so it's really humbling to have a All these voices in the in the field of Santa metrics because you people care about Defining impact and impact metrics to be supportive of this notion of open citation data
So like I said, there's still a very long way to go we published an open call to action to our stakeholders a couple of months ago and what that means is that if you're a journal editor a
Researcher a librarian if you work for an organization that produces or consumes scholarly metadata and citation data it is really critical that You you help us and we hope you'll join initiative to help further the goal of open citation data the citation graph belongs to the public and
We need to be able to build upon it as a common good with that I'd like to thank you and I'd be happy to take any questions Thank you now, um, yes we have lots of time for questions so
anybody Dario, this is a fantastic initiative and it's exciting to see
I've I guess I've got one sort of crazy idea and one question crazy idea being as we just At our local institution. We just suffered a case of plagiarism of an article and Actually from a thesis that that was created to an article But it seems to me that comparing citation graphs may be a way of like another sort of way of flagging
Possible if things are too similar another possible way of flagging that as as a defense So having that openly available makes that yet another possible Outcome does it needs to be researched but to see whether it's feasible, but that'd be good
and I'm sure Sarvin would want to ask this at some point but The site's property in Wikidata is you know a standard citation I don't see any sub properties that allow some of the more fine-grained Sort of the nature of the citation. I think is there any
Hope of sort of going back and adding extra Data about the nature of the citation whether it's positive or negative or something along those lines. Yeah, excellent questions Yeah on the on the first comment, yeah, so I think that plagiarism is a great example and great use case for this we've also talked a lot about
Tracking retractions more effectively, right? So we still know that retracted papers keep accumulating citations and partly I believe You know having the citation graph as a public good should help us basically like raises a Warning whenever someone is citing a paper has been retracted
And in general annotation annotating citations with whatever metadata that we think are important So pleasure is a great example of something that hopeful these data will help better better understand On the second point. Yes, so we key data now as a property to represent
To represent a citation between two items representing creative works We Haven't yet talked about how to further qualify the citation although There are many people are excited about for example Implementing a version of a site of citation typing ontology or somehow adding additional qualifiers to the citations
The one type of qualifier that's currently being used in wiki data is about the provenance So pretty much the entire data model of wiki data allows you to specify a reference for every single statement and
given that Wikidata can currently ingest the citation data from a variety of sources cross ref directly the open citation corpus PubMed in some cases. It is really important that we keep track of where we get the citation information from so We have the ability of specifying more information more qualifiers for citations we haven't gotten there yet
Any other questions? Yeah, my name is Jaromus Oracle
Do you know any other? Citation corpus is where let's say national peer-reviewed papers non-english scientific papers could post their Citation data because I find that interesting problem that you have scientific publications
Not in English, and they have met sometimes they also are often referring to Publications in English and so on so how to interlink all these multi language publications yep to
Excellent question so how to get you know proper global coverage with this graph so a couple of things first off cross for data Really covers publishers across many many different countries so as long as a publisher assigns the OIs It doesn't matter whether the publication is in English or any other language so
But this is a requirement for this data to be to be accessible so in other words Unless there is a bibliographic record and unless the publisher is opted into this program. It's called cited by You will not get this data for free from the the cross of API's
For journals and publishers that are not for whatever reason interested in participating in crosstalk There are projects like the open citation corpus that are trying to aggregate and Transform and republish citations from much broader set of sources So I think the hope is that down the line no matter whether you're in crosstalk or not
That corpus will be able to grow and include also other references that are not otherwise available I
think I didn't get the point if Open citation corpus is being integrated into wiki data Is it to exist as a separate? Corpus Yeah, so the corporate the open citation corpus exists as its own standalone
Corpus, it's actually a fairly complex Linked open data project that publishes and cleans up all this data adding provenance and additional information about all different entities Wikidata is really ingesting in in an opportunistic way some of this data
It's not ingesting and representing the entire RDF structure of the corpus for people are interested in proper RDF again personally if I ask me I tend to side with the pragmatic approach to you know useful data without making overly complex when it comes to
Doing structure. I think that we create a currently delivers good value for data consumers But there are other use cases for which an RDF record is actually what you want And that's what the open citation corpus provides
anybody else okay, and We move on Next