We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Using Cita to support reference extraction workflows from Zotero

00:00

Formal Metadata

Title
Using Cita to support reference extraction workflows from Zotero
Title of Series
Number of Parts
7
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer
Production Year2023
Production PlaceFrankfurt am Main

Content Metadata

Subject Area
Genre
Abstract
In this presentation we will introduce Cita and discuss how its current features can support reference extraction workflows by simplifying the reconciliation, publication and reutilization of extracted data. Specifically, we will show how extracted references may be added to Zotero items and reconciled with bibliographic records on Wikidata. We will also show how extracted reference metadata can be published to Wikidata, further contributing to this collaborative database. Finally, we will demonstrate how these data can be reused by other Cita and Scholia users to better understand how works in their fields build upon one another. In addition, we will discuss how proposed features may further support reconciliation with and publication to additional data sources, including OpenCitation’s Crowdsourced Open Citations Index, CROCI. Finally, we will propose plans to have Cita integrate reference extraction engines right into Zotero. We will discuss how this may enable wider use of these tools and the promotion of crowd-sourced projects.
Computer animationMeeting/Interview
Transcript: English(auto-generated)
Okay, so thank you all for, for being here and for the invitation, Andreas and Christian. So today I am with Dominic, who is attending remotely. We are going to introduce ourselves in a minute.
And we will talk about SITA, which is a plugin for the reference management software Zotero that we have developed, and we are going to talk about ideas that we have in which this could be integrated into workflows of reference extraction.
So, to get started, a brief introduction, I will start myself. I'm Diego de la Hera. I'm from Argentina. PhD in psychology, a researcher, although I'm not actually doing research right now, but I will try to always keep a foot in academia. And also at Wikimedia, that is, I'm part of the Wikimedia community. And Dominic, if you want to introduce yourself, please.
Hi, so I'm Dominic, I'm originally from Australia now in Switzerland, studying neuroinformatics at the moment and more coming from the research research side that making citation data more accessible would make my life a lot easier.
And that's how I ended up here. So, nice to meet you all. Thank you, Dom. Okay, so I guess we all may know what Zotero is, but just in case, in case someone in the audience doesn't know what Zotero is, Zotero is a reference management software that is a software that users can use to organize their bibliographic records.
In particular, you have a work that you're interested in, so you might input the works, the works metadata, that is the title, name of the authors, publication date, etc. And Zotero supports adding PDF attachments to these records. That's something that would be important for what we're going to see later on.
And more importantly, Zotero is, I would say, the top three most known and used, and maybe someone would not agree here, but which are EndNote, Mendeley, although Mendeley I think is fading out a little bit.
And Zotero is the only one which is open source and Libre software. And also, it is, it can be expanded via community add-ons. So not only can people collaborate into coding the software, but they can also create plugins that would expand Zotero features. And there are lots of very popular plugins in the Zotero ecosystem.
As part of this metadata, with the title, name of authors, etc., some people also argue that the references that are cited within a work should also be part of the metadata of that work. And thanks to the efforts of different people, including the people at the Initiative for
Open Citations and the Open Citations team, apparently, at least for what it is, journal articles, this reference metadata are now quite widely available. Like you can download from different papers from different journals, not only the title and the name of the authors, but also the articles that those works reference.
However, when we started this project, it surprised us that although this metadata is widely available and is used by other tools, reference management software did not seem to support this metadata. Like if I want, if within the
reference management software, I want to know what works, given work sites, then that information was not supported. So that's what motivated the development of CTAB. CTAB started with a grant from the Wikimedia Foundation through the Wikisite community.
And the idea of CTAB is that it brings support for a citation graph to Zotero. We're going to show this in detail, but the idea is you have an item in Zotero so that you can add references to that item based on what the item cites.
And these references, you can either add them manually, like you want to input the metadata of the reference item one by one by hand, that's an option. But you can also import the references from a list of references. We will see how this is useful in a reference extraction workflow. And also you can import them from online sources, like, for example, Wikidata, which we are going to talk about in a minute,
and also cross rep, which is a feature that is currently being worked on. And also, not only can you, in your personal library, add these references as metadata, but you can also then use CTAB to publish this information somewhere.
We're going to talk about this in a minute and how it integrates into reference extraction workflows. And also, and I'm also going to show this in a minute, if you have your items in Zotero and you have reference metadata for these items, then you can see how the items in your library connect to one another in a citation graph.
And this citation graph is useful not only to understand how the works in your field connect to one another, particularly, not like in the field generally, but specifically the works that you have in your library. But also, it could be helpful to discover new words, like for example, I see
with a citation graph that 10 of the papers in my library are citing a work, and this work, I don't have it in my library yet, then this could mean that I should download the paper and maybe I should read it. And it gives you an option to do that. So, I mentioned Wikidata. What is Wikidata in case, well, because probably we don't know what it is.
Wikidata is a collaborative knowledge graph. It is a sister project of Wikipedia. And the idea of a knowledge graph is that you have nodes in this knowledge graph.
Each of these nodes represent a concept. For example, Argentina does have a node in Wikidata. And these nodes are related to one another via properties. So, one property of the Argentina node is official language, and that property links to another node in the graph, which is Spanish.
And at the same time, Spanish is also a node, so it also has properties like, for example, the subclass of property which links to West Iberian languages. And because of this, Wikipedia is also working as a bibliographic repository, because in Wikipedia you can also find elements like this one here,
that I'm showing here in the middle, which is a bibliographic journal article. And this node also has properties describing the node, like name of the author, where it was published in. And also it has properties, sites, word, which are reference links.
You can declare in Wikidata how elements are connected to one another via citations. And this repository, Wikidata, is the online repository from which you can get information and digest it into CETA.
And also this is a place where you can publish information from CETA, like you have a work, you added some references to it, that you got from the list of references, and you can upload them to Wikidata. Why is it important to contribute this information to Wikidata?
Well, because Wikidata is used by other tools to show this information, and one of these tools, which is actually really cool, and if you don't know it, I would suggest that you try it, which is called Scolia. Scolia is just something like a Scopus thing, but based on data collaboratively maintained on Wikidata.
So I'm not going to show it right now, but this is like what it says here, table of contents. It's a list of things that you can check for a given article in Scolia. So for a given article in Scolia, you can get a list of authors, but you can also get a list of topics that are treated in the article,
and you can also check related works and co-citation networks, and also see the institutions that finance this. I'm not sure about this, but you have profiles, not only for works, but also for institutions, also for authors.
If you want to see how an author has published during his career, you can check the profile in Scolia, and as long as his or her information is complete in Wikidata, then you can find this information here. So uploading information to Wikidata would populate the graphs that Scolia shows.
And also because CETA is using information from Wikidata, if we upload our information to Wikidata, it means that it's going to be available to other users of CETA. However, I'm going to, I would like to make like a warning here just for you to know,
although Wikidata is currently a bibliographic repository and it's growing as a bibliographic repository, it is not clear what the future of this is going to be, and there are currently ongoing discussions because the Wikicide community has been very active, and right now, half of Wikidata, almost half of Wikidata are scholarly articles, and that is leading to some technical issues, apparently.
Some things are becoming slow in Wikidata because you have so many scholarly articles there. So right now, there are ongoing discussions of what the future will be of this, whether this is going to be moved out of Wikidata, but if you're interested into this,
I would suggest that you stay tuned. You can follow what's happening in the link. You also have links to other places from there. But for now, it's okay for individual contributions in general. It's just that right now, they wouldn't be welcoming a bot, for example, that is uploading thousands of records, all of the ones.
So, and I wonder, well, I can just continue and leave questions for the end. So with Dominic, we were thinking of ways how CETA could integrate into reference extraction workflows.
And although I'm not going to show this in detail, I do encourage you, if you're interested, to watch one of our recorded workshops where you can see how CETA can be used. I'm going to very briefly show some examples. So the idea that we have for integrating CETA into reference extraction workflows is that one could have,
I'm going to switch to my Sotero here, I hope it's large enough, probably not, I'm sorry, I cannot zoom anymore. So I have here like a sample collection, each of these is a bibliographic record, saying that we have PDFs attached to this. Yeah, so this is like where we start.
We saw that we could, like if I open this citations tab here, I would see the list of references for this article. I could add references manually, as I said. I can also sync them from Wikidata in case this information is in Wikidata.
But when it comes to integration into reference extraction workflows, I can use an external source, an external, sorry, an external tool to extract references from a PDF. Like when I prepared this, I was thinking of this quality API, but then I realized just two days ago that it was no longer really available.
But you have, for example, the any style, a command line tool, which is quite, which is quite cool. So I'm going to just show you this. It's like I have this PDF here. Again, it's like this is not meant to be like 100% clear, like we can discuss more.
This is just an example. I'm not going to make it step by step because we won't have enough time. So I have the file here. If I open my command line, and I run this, I have it here, and I run this any style tool. Oh, sorry, I'm going to,
I'm going to send the output to a file here. So this will parse the PDF, find references, and send them to this file as in mid-text format. I can use other formats if I want. It's here. So if I copy the references that it extracted,
and in Zotero, in CTA, I tell it to import citations, and I paste the text that has been extracted here, then the citations, these references are going to be added to CTA as metadata here, right?
And then if I click here where it says automatically linking citations with Zotero items, it's going to check if any of these references is in my Zotero library already, which is identified by this strong red Z. And now if I select all of these words, which by the way do have references already,
which I imported from Wikidata, just that I didn't know how I did it. I'm sorry, you can check the workshops. If I say show the local citation network, then for those references, I get this. So apparently in my, in this sample library that I have here,
this paper by Bramecic, which is the creator of Wikidata, is the most cited. But then, for example, these triangles here represent articles that are frequently cited in my library, but which I don't have yet. So I can go ahead and download it and add it to Zotero, for example. And once that I have this here, I mean, again, where was it?
I think it's this one up here. If I click here and I click on sync citations with Wikidata, I could potentially upload all these references to Wikidata. Yes, if I already have in Wikidata an item for this paper, then I would add these references as properties to that paper that already exists in Wikidata.
This is a little more complex than I've just explained, because you have to make sure you're not creating duplicates. CETA does support some of this, but like, again, we won't have time enough in this short presentation.
So yeah, you can check the CETA homepage, where you will find a quick start guide and also some presentation workshop recorded. So where are we with CETA right now? This project brand that finds the initial development of CETA ended,
and it ended more than a year ago. I was a developer back then, so I could no longer spend so much time developing this, actually almost none in the last year and a half, I guess. But then eventually, Dominic, who's going to speak now, joined as a user,
and he started sending contributions. At the beginning, I was trying to kind of review them, but then I realized that I couldn't review them anymore, and he became the co-maintainer, the new maintainer of the project. He's doing this on a volunteer basis. We currently have some bad fixes to address
and lots of feature proposals and feature requests. Some of these were ideas that were in the project from the beginning. Some are ideas from the community of CETA users. As Dominic has written here, currently in the last two months, there were 2,000 downloads of the plugin, and these downloads include automatic downloads from Zotero users
that already use CETA as a plugin. So it's like, I think it's quite large, considering that it's a plugin of Zotero, which is not like, I mean, it's not an Instagram plugin. And like, among the features that we would like to add
is integrating reference extraction right into CETA so that you don't have to rely on an external tool, including import from Crossref instead of only Wikidata, maybe integrating with OpenALEX, either to download this reference information
or eventually also to upload to OpenALEX. Also, exporting into the crossey format, which is a database within the OpenCitations project, we have to make sure that this is going to be supported by Zotero 7, which is going to be released very soon, among other things.
And also, when we started this project, we said that there was no software supporting reference metadata in reference management softwares, no plugin. Since CETA was released, actually in December last year, a new plugin appeared by this person, Muus Destinis.
The plugin has been very actively developed in the last months. It has like 50 releases in less than half a year, and it has at least lots of GitHub stars. I don't know how many users it has, but probably many. It would be nice to try it as well.
I haven't tried it that much. It has some differences with CTAB. I don't think it would integrate as well with reference instruction workflows because it actually just seems to be using information. It can extract references from PDFs. I haven't seen yet how it does it. It can also download from Crossref, I think.
But then that information is not saved as metadata of the item. It's just showing there. So you cannot upload that information later, anywhere. I think you cannot export it either. I mean, it's something worth trying and seeing how one thing could complement the other and eventually what common paths we might find for the future.
And talking about future and about the future of CTAB, here, Dominik, if you can please continue. Yeah, thanks a lot, Diego. So yeah, like Diego mentioned, I kind of initially joined the project because I was looking for something exactly like CTAB for a long time.
Now, there seems to be a push, particularly from Zotero, to integrate as much of the research project into one piece of software as possible. So previously, it was just a tool for storing papers, essentially. Recently, there's now an integrated PDF reader and annotation tool also in the software.
And the one thing that was still missing was like the process or being able to have this process of discovering new papers and relevant research that now CTAB is able to provide through having references for each paper in your library.
So this I found super useful and was just experimenting with software. I thought of like, oh, it would be cool if this feature was also possible. And then I found Diego quite welcoming and to new ideas. And then so I contributed a bit and then in the end, ended up now as a maintainer of the plugin.
But it's kind of the same problem as Diego has, and I guess for a lot of open source software stuff as this as well, like this is not my main research focus or anything like that. It's just something I'm volunteering some time on the side because I enjoy it and I find the tool useful and I think it's useful for other people as well. So I find contributing rewarding, but it's something that I only have a limited time to do.
And at the moment, that is in some way limiting the new developments we can make. It's possible to make sure the software is, I guess, because Zotero is always evolving and we're relying on some APIs, which are also not super stable
from Wikidata, from Prospera, all these things that are being developed and changing as well. Like we will always need some maintenance to keep the plugin running. I say we can't just forget it and expect it to still work in 10 years time. So this is okay. It doesn't take so much time to maintain small bug fixes and things,
but bigger features that would be really useful, such as integrating with bigger citation databases or integrating the reference extraction workflows. These, I think, would be super useful features that will take some time to develop. And at the moment, me in particular, I'm a bit limited in how much time I can dedicate to this,
which I find a bit sac. And there's always a lot of requests from new users who are like, oh, it would be really nice if we could pull data from this other database, if we could support the language of different citations or different intentions for citations and different things like this.
So that brings me to the question, like, what could the future of CETA be? Do we, will we just stay as we are now with some maintenance and things progressing fairly into the future, or could we be part of something bigger? Are we a valuable contribution to, for example, Zotero?
Is this a feature that's useful for most users of the software? Or is this something that Wikidata is interested in, as a way of allowing users to easily access a lot of the data in its repository? Or is CETA useful as a reference extraction interface in some way for users that might not be so technically inclined to,
if they have a tool that is already present in their reference management software that allow them to quickly extract references from their PDFs? If that's something that makes the reference, that gives more publicity and uses to reference extraction workflows and more data and enables them to be run more easily.
Yeah, this is a bit of an open question that I hope we can have a discussion about today. If you think, if you've seen what we've presented with CETA and you think this is useful, this could be integrated into your workflow in some way, or you think there'll be parts missing or anything like this. Or any ideas, that would be fantastic to hear from you.
So, thanks. Well, so thank you for listening. These are the ways to contact us. Emails and Twitter, master loan handlers. So, thank you.