Software citation: a cornerstone of software-enabled research

Video in TIB AV-Portal: Software citation: a cornerstone of software-enabled research

Formal Metadata

Software citation: a cornerstone of software-enabled research
Title of Series
Part Number
Number of Parts
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Software is a critical part of modern research and yet there is little support across the scholarly ecosystem for its citation. Inspired by the activities of the FORCE11 working group focused on data citation, the FORCE11 Software Citation Working Group has published a set Software Citation Principles ( in September 2016. This has the goal of encouraging broad adoption of a consistent policy for software citation across disciplines and venues. This presentation will discuss the principles (in brief, importance, credit and attribution, unique identification, persistence, accessibility, and specificity), how they will impact the practice of research, and they can be implemented by researchers, publishers, librarians and others who build and maintain repositories, scholars of science, university administrators, and research funders.

Related Material

The following resource is accompanying material for the video
Point (geometry) Service (economics) Musical ensemble Service (economics) Observational study Algorithm MIDI Bit Mathematical analysis Process modeling Software maintenance Number Type theory Computer animation Software Computer hardware Software Authorization Self-organization Information Library (computing) Physical system
Point (geometry) Randomization Building Algorithm Multiplication sign Mathematical analysis Mereology Software maintenance Hypothesis Product (business) Computational physics Type theory Different (Kate Ryan album) Software Information Service (economics) Pattern recognition Building Software developer Projective plane Software maintenance Computer System call Hypothesis Type theory Computer animation Software Computer hardware Computer science Right angle Cycle (graph theory) Physical system Resultant
Service (economics) Metric system Multiplication sign Digital signal Hypothesis Hypothesis Measurement Computer animation Software Personal digital assistant Term (mathematics) Software Information Metric system Resultant Physical system
Service (economics) Software developer MIDI Focus (optics) Measurement Faculty (division) Computer animation Software Computer configuration Software Process (computing) Information Physical system Electric current Physical system
Random number Observational study Observational study Consistency Digitizing Consistency MIDI Digital signal Counting Similarity (geometry) Uniform resource locator Process (computing) Sample (statistics) Computer animation Software Software Website Data type
Point (geometry) Random number Group action Consistency Software developer Feedback Multiplication sign Counting Smith chart Computer Subset Local Group Uniform resource locator Different (Kate Ryan album) Software Personal digital assistant Repository (publishing) Authorization Information Process (computing) Website Musical ensemble Context awareness Service (economics) Observational study Consistency Forcing (mathematics) Bit Digital signal Digital object identifier Similarity (geometry) Benutzerhandbuch Sample (statistics) Computer animation Software Website Resultant Force
Web page Service (economics) Group action Musical ensemble Presentation of a group Forcing (mathematics) Software developer Lattice (order) Flow separation Number Local Group Computer animation Software Repository (publishing) Personal digital assistant Software Personal digital assistant Process (computing) Information Maß <Mathematik>
Web page Area Context awareness Software developer Feedback Feedback Bit Digital object identifier Smith chart Computer Product (business) Number Local Group Revision control Mathematics Computer animation Software Different (Kate Ryan album) Personal digital assistant Software Personal digital assistant Repository (publishing) Process (computing) Force
Point (geometry) Slide rule Software developer Dependent and independent variables Direction (geometry) Multiplication sign File format Sheaf (mathematics) Content (media) Metadata Attribute grammar Number Product (business) Local Group Revision control Latent heat Different (Kate Ryan album) Software Personal digital assistant Authorization Uniqueness quantification Computational science Process (computing) Information System identification Service (economics) Enthalpy Dependent and independent variables Standard deviation Web page Public domain Product (business) Virtual machine Mechanism design Number Computer animation Software Basis <Mathematik> Personal digital assistant Normed vector space Revision control Computing platform System identification Text editor Cycle (graph theory) Force
Point (geometry) Standard deviation Context awareness Set (mathematics) Product (business) Revision control Word Sign (mathematics) Different (Kate Ryan album) Term (mathematics) Software Computer hardware Authorization Information Office suite Normal (geometry) System identification Physical system Context awareness Programming language Service (economics) Standard deviation Computer programming Product (business) Word Computer animation Software Normal (geometry) Office suite Right angle Resultant Library (computing)
Group action Computer animation Software Personal digital assistant Multiplication sign Software Moment (mathematics) Multilateration System identification Product (business)
Principal ideal Enthalpy Addition Open source Source code Wellenwiderstand <Strömungsmechanik> Revision control Type theory Term (mathematics) Software Authorization Uniqueness quantification Information System identification Source code Service (economics) Enthalpy Validity (statistics) Uniqueness quantification Product (business) Type theory Uniform resource locator Computer animation Software Basis <Mathematik> Personal digital assistant Repository (publishing) Revision control System identification Right angle Energy level Force
Point (geometry) Principal ideal Link (knot theory) Source code Mereology Metadata Revision control Landing page Zeno of Elea Computer configuration Different (Kate Ryan album) Software Software repository Uniqueness quantification Information System identification Source code Service (economics) Link (knot theory) Point (geometry) Landing page Computer animation Software Repository (publishing) System identification
Point (geometry) Computer file Information Multiplication sign View (database) Feedback Sheaf (mathematics) Digital object identifier Number Revision control Computer animation Repository (publishing) Software Physical system
Point (geometry) Enthalpy Computer file Open source Multiplication sign Product (business) Number Revision control Uniform resource locator Software Personal digital assistant Software repository Authorization Information Formal grammar Service (economics) Information Cellular automaton Web page Projective plane Numbering scheme Landing page Virtual machine Inclusion map Uniform resource locator Computer animation Software System identification Video game Ideal (ethics) Figurate number
Service (economics) Software developer Software developer Multiplication sign Forcing (mathematics) Open source Control flow Digital object identifier Inclusion map Process (computing) Computer animation Software Repository (publishing) Different (Kate Ryan album) Software Universe (mathematics) Revision control Repository (publishing) Authorization Partial derivative System identification Information Physical system Task (computing)
Computer animation Software Software developer Different (Kate Ryan album) Software developer Software Open source Website Quicksort Student's t-test
Context awareness State of matter Serializability Product (business) Revision control Telecommunication Different (Kate Ryan album) Term (mathematics) Software Authorization Information Physical system Default (computer science) Service (economics) Information Open source Parameter (computer programming) Flow separation Product (business) Computer animation Software Telecommunication Revision control Right angle Identical particles Physical system
Group action Implementation Web page Distance Smith chart Local Group Planning Sign (mathematics) Writing Frequency Computer animation Software Software Implementation
Web page Service (economics) Implementation Group action Open source Software developer Code Forcing (mathematics) Software developer Open source Code Open set Ordinary differential equation Coprocessor Computer animation Software Personal digital assistant Different (Kate Ryan album) Software File archiver Website Process (computing) Information
Web page Point (geometry) Process (computing) Computer animation Software Software developer Zeno of Elea Software Open source Code Text editor Open set
Service (economics) Group action Arm Software developer Lattice (order) Number Faculty (division) Product (business) Voting Computer animation Software Different (Kate Ryan album) Software Right angle Information Metropolitan area network
Service (economics) Computer animation Software Information
but the thinking on so yes thank
you for the invitation to come and talk this is so related to this author citation working Group the wall the organism mention the mentioned non so if just wanted to talk a bit about where selfish situation is
and why it's important in some of the things I'm want to mention are going to repeat earlier talks and highlight notes that came points that came up in earlier talks as well so the 1st thing is as I think it's been discussed and as we probably all know of that software including services is essential for the bulk of science today I'm not about 4 years ago I did a small study of a number of issues of science magazine discovered that about half of the papers for softer intensive
projects and almost all the papers have software random at some point but research is dependent on advances in software and in some sense I would say that offers the thing that sits between the computing infrastructure and and actually enables it to be used for science sulfur is not something that you just develop you have to sustain it over time and the development and production and maintenance of software people-intensive and the software lifetimes along compared with hard lifetimes and so on so for a 10 7 underappreciated value although probably not in this room but if we think about computational science research how we can think about this cycle of creating a hypothesis on getting some resources to investigate the hypothesis and those resources like the funding or software data other things are actually doing the research so buildings and software of data of publishing results potentially is a paper or a book or the software itself or the data might be the result of gaining recognition for having done that and then kind of continuing around on the cycle another time and and and the knowledge really is is in this last piece of the paper the to solve the data the thing that's been gained from the research and if we think about them worse off reflection of light 2 different types of software right this software that's developed as part of the research and then the software that you get at the beginning in the 2 end up with
at the end that's not the software that you just using in the research but it's the software that you're getting from somebody else or giving to somebody else and that's what I would call sofas infrastructure and that's ideally what is published and cited and so that you giving credit to the people that you're using the softer and as people use your software they're giving you the the and so so I think we certainly say the scientific research is becoming
more open and more digital it's more open and that people want to share more and need to share more in some cases because of funding agencies or because of other community reasons but it's more digital which makes it easier to share and there is significant time that's spent developing software and data and these efforts as we've already
discussed are not always recognized rewarded by if we think about how we look at people as we have a system that the builds citations finally build metrics again as the previous speakers said and we don't for software and data in general and so the hypothesis that I made probably 4 years ago a result of what was that if we could better measure the contributions of people to software in the measured in terms of citation or impactor metrics that would lead us to have better rewards better incentives for people to do this so would lead to better career paths and more
willingness for people that are what's a faculty members to join us softer development communities and that overall this will lead to more sustainable softer by having more contributors but if we think about how we can measure sulfur usage we have the fact that the
citation system again was created for papers and books and you really have 2 options to to try to put software that's right we can either and jam softer into the system that wasn't built for our we can try to rework their citation system and and we're focusing on the 1st 1 because the 2nd 1 is is just too hard on although I would say honestly you there's a lot of reasons that we really should try to do the 2nd 1 and the the challenge then that we're trying to address the self situation is not just how to identify software in a paper because the citations are not just in papers but it's how to identify software that used within
the research process so that within the research process as a whole was that's a software that you're producing you can see what software that's offers citing and you can see what so many other things like that to to so if we look at how cited
today and sulfur and other digital resources appear in a bunch of very very inconsistent ways by James Allison had this nice study from 2015 has published on that looked at 90 biology articles and try to see how software is mentioned in those articles and he found that of the 90 articles there 286 mentions of sulfur and
sulfur is mentioned in 7 different ways including of citations to a publication about the software which is the most common of citations to the user's manual for the software citation to the name or the website of the sulfur or something that's like an instrument will you know something about it but is not exactly a citation value of the euro the softer and text while you have the name of the softer in text but that's all that you have you don't even have a name but and I was a little bit confused by the last 1 actually and I asked James Wong Howe right how do you know that the software that's not even names and he said that there is still a text would say something like that we use software to do this that that was at the time and some of it's acceptable but when you look at the data facility citations you end up with similar results the data also was cited in very inconsistent ways and facilities actually assigned probably the least of all but but also in in strange ways the but so this author citation principles work that started with
the force 11 group force I should say from for the people don't know is the Future of Research Communication the scholarship and this is a group that started in well that's that's kind of let in San Diego at this point I don't actually know where it started but it started with a couple of different conferences called beyond the PDF and 1 and 2 and then it expanded to to start doing things that were even more than just papers well known the this working group started in July of 2015 and was the which was
mentioned before which is something else I've been co-leading for quite a while is the workshops on scientific on sustaining scientific suffer practice and experience and on our 3rd meeting we had a working group that was looking at credit and citation and this group decided to to join the force 11 group rather than doing something separate and so the 2 groups emerged in September of 2015 now we ended at that point with about 55 members who researchers and developers and publishers and repository people i and librarians we did all of our work and get on primarily and had a force 11
webpages pointed to that in what we did was to review existing community practices and develop use cases for what software citation should be where it would be used how would be used for things like that so we started then drafting a principles document and we started with the data citation principles that had been around for probably a year or so before that only updated that based on the soft for use cases and related work and then number working group discussions that we had no I don't know maybe every 2 months of communities and review a draft so we had a workshop for a half-day workshop at the force 2016 meeting in April 2016 all in 1 of the things that came out was so that was the fact that software and data are different which probably isn't too surprising to
anybody here but it's surprising to a lot of people and a lot of our reviewers and faxable Socrates data widely it something different and so because we got feedback we ended up actually of creating a what turned out to be a preprint encourage a that was originally just to get a page where we tried to get a community people to talk about why software and data are different and so we've documented a number of differences so as 1 of the speakers this morning so that 1 of them is licensing and product and copyright which
is 1 area in which suffered a different and offers a creative work that is not and so they have different legal requirements as well In any case we again did everything through get over so it's a bit of issues you can go to our page you can see the old version of this document you can see why it changed you can see see you asked for it to change you can see the discussion about our white didn't change in some cases things like that so we try to do everything extremely openly
are and what we end up doing then was submitting the final paper to appear J. Computer Science so we went through a number of review cycles and eventually published the final paper and 1 of the things it was actually interesting with Q J was that when we were done with this the editor said to us as clear your paper is now accepted that would you like us to also published the reviews and your response to the reviews as an appendix and researcher wants to I'm so again this is extremely open you can see what the reviewers said you can see what responded and and how we get to to the final paper so the principles paper itself and has a 6 principles that we discuss number talk about this in in the next few slides of the motivation a summary of the use cases the related work on the discussion including recommendations on and no yet began as a good talked about the rest of the so the 1st principles importance and what we say is that software should be considered a legitimate and citable product of research and so softer citation should have the same importance as citations of other products or the papers were data or something else the In the 2nd principles credit and attribution which says that softer citations should facilitate giving credit and attribution to all contributors the not unique identification says this offer citation should include a method for identification that's machine-actionable globally unique interoperable and recognized persistence says the unique identifiers in the metadata that describe the software and its disposition should persist and it should persist even if the software itself is not available which is really what an identifier does in most other cases as well accessibility so the citation should make it possible to access the software itself and the metadata and any other material is necessary to be able to use the software and finally specificity says that the citation should facilitate identification of access to other specific version of the software used and so I think again this is basically consistent with a lot of the other points have been made we tried to formalize this and it's a don't know actually the other speakers and had seen this or not but it's it's I think it's promising that the we seem to be actually all coding in the same direction and thinking about the same things at this point the good so let me talk about a few different other things some I should say that basically we have the 6 principles that we agreed on and then we have a lot of other discussion and the other discussion is a much much longer sessions section of the paper then the principles and it's there because the principles are
intended and we believe our on uniformly applicable but how they're actually applied differs depending on a lot of different things and so the discussion really talks about how you apply the principles in some cases or how the different ways which could apply the principles were not sure exactly what the right thing is yet but we think that over time will figure out something that's that's the standard because so is the 1 of these points of discussion is just what the site and and the important principle says that authors should
cite the appropriate set of software products just as they cite the appropriate set of papers but exactly what that means again is not completely clear and what we discussed was that the software that cited should be decided by the author of the product in the context of community norms and practices on and in particular it's worth mentioning something from the uh the pretty 1 ride Online Writing Lab which shows do not cite standard office software such as Word or Excel or programming languages provide references only for specialized software and we tend to agree with this and the the point that we came up with was that if using different software would produce different data results and that's a sign that software
should be cited and if the software is something that you're using along the way but it doesn't actually make so much of a difference in the results and then maybe it's not important side but but again this is discussion this is not the a strong principle the the women in terms of Provenance and reproducibility at this discussion actually can often apocalypse in yesterday on provenance and reproducibility requirements are much greater than citation requirements on citation is thinking about what's the software that's important to the research outcome provenance is all the steps including the software that go into the research and so just 1 example of this is that you might say that for citation we use our library X version 1 . 5 but for provenance you might say we compiled it was flagged Deschanel 3 we ran it on the specific hardware using the specific operating system right so there's a lot more that goes in the provenance then goes into citation and we recognize
that but we were not so concerned with the provenance case we're concerned the citation case decided the provenance case was more than we wanted to to worry about and we would like somebody else look at that at a later time yeah moments of the papers
also come up and and our group school is that software should be cited and that software is software not papers about software and papers about sulfur are
published and cited today and and our principal says that the software itself should be cited and so we take this than to mean that it's OK to cite the software paper also but you shouldn't cite justice offer paper you should cite the software itself and and the reason is like this author paper contains something about the performance on the validation or something else it's important that might say that you really should cite the paper also on or the author of this offer may say Please cite myself a paper in which case right it's probably a good idea to do it and that the only thing that our principles are saying it's like a softer not to cite the source for people but in terms of unique identifiers the principles again are intended to be general and they're intended to apply to all identifier types
but we recommend that when possible the people deciding softer you your eyes that identify specific versions of source code and by this this is the deal pointing to source code it in a repository somewhere something like notable feature were some institutional repository and not a URL that points to a get have repository long so it is also some discussion about our ideas and and do do people are people here familiar with our ideas OK so I'm going to leave it out and skip because it's not worth
going into the I assume it's also were thinking about what the identifier resolves to and and 1 option is as I said before is that identifiers could point directly to software like a github repository URL to get a repository and that would satisfy
all 3 of the 6 principals that doesn't actually satisfy the persistence part of it because they get a repository may not be persistant it may actually change we may go away from get up at some point to move to I don't know and I different version control system that has an understandable API potentially on or something else on but so we recommend that the identifier's resolved to persistent landing pages that contain metadata and link to the software rather than directly to the source code itself and particularly because of landing pages are persistent of this is intended to ensure the longevity of the longevity of softer metadata even beyond the offers lifespan the may still on the in the in the paper there's a lot of other
points and if this was an hour talking we talk about most of them but for a half hour talk and talking about a few of them but if you're interested please read it and see what the others are and you can agree or disagree on the paper has as its last section discussion how we would update the paper and and so we recognize that the point that we made a not necessarily last points there is best points they could come up with at the time and so it's perfectly conceivable there may be another version of this in a couple of years and if you have feedback that leads into the next version of that would be great the and so in example of how you make yourself a
citable is 1st is the publisher I mean if the target of it actually extremely easy to do this and there is no way that you have to do a very small number of things to to put your software from get out into the no and it's not and get home you can actually put it in does not or a future were something else pretty easily as well with a few more steps if you do it manually but you get that idea why then for yourself or are you creating citation file which is a file that says name citation that's in the top of your repository that says how you should cite your paper and you also put that information into the readme because a bunch of people are not looking for the citation filing the the reviews a single pop up at the beginning on and you can also write it off a paper as people decided but again that's a secondary thing from our point of view so if you don't wanna say
somebody else's software so the 1st thing to do would be to check for citation file or read me in that would say notice that the offer potentially and if it does then you do it and if it doesn't you do your best to try to follow the principles and try to include all the contributors to the softer if there are too many and you can't figure it out and maybe you can name the project and have that be the author of the software and I try to include a method for identification its machine-actionable globally unique interoperable and maybe to the URL to release maybe it's a company product number right so this is the maybe the 1st time that I've mentioned this doesn't apply just open source this applies to software and so there is commercial software and I wanna be able to cite that commercial software as well as open source of life is a landing page includes metadata point to that are not directly to the software and includes specific version release information OK so at this point you may think hopefully the cell makes sense this is really good and this is the way the community is going to move
inward and or you may not only like kind of this was OK for another few months after we did it in a realize that there some problems and so I just wanna mention 1 of the problems that and thinking about is that if we think about paper
citation really 3 steps that happens at there's the creator of of who we call the author of a semester the paper to a publisher of the publisher potentially has a paper reviewed in some way and then publishers and and assigns an identifier and then once it's been published and has an identifier people can actually cited and they do that by you or by the deal I usually and so these 3 things happen papers
in a very fixed-order in their 3 discrete steps you can't do 1 before the other 1st offer today we have the fact that creators often develop software can have been their release in the different stages aversions during its development and some the users it likely one-sided but if they do the inside the repository and so effectively there is no step through that happens in this process and the suffer citation principles really are trying to put in step 2 the trying to kind of force this extra step in so that we can use the existing system so that the problem with this is that it might not work on because it had a step to the softer developers workflow break that they have to actually publish their software end a lot of developers may not care enough to actually do that and if they're not working in a university of the working in a company that may not be the way they get credit and this may just be an extra task that doesn't add anything to them have been so even if we do get at some future time in which developers do
routinely publish the softer releases we don't really know what happens until then or what happens for existing software and so the real problem here is that the steps that we have to create publish site I don't actually match how open software is developed and used and the fact is that software is more fine-grained and more intuitive than papers and this leads to differences but also open-source development happens in the open
like papers which are usually not written the open although certainly sometimes they are in free software there's no natural lakes published at other than for marketing credit which again our primary concerns for everybody so it's interesting to think about how all sorts of papers and the fact that sometimes a papers people want to cite something that hasn't been published and and students are initially taught not to cite things that have been published but
then eventually they realize that sometimes there's a reason that you just have to and are told often that they should call those things personal communications and the the EPA which I believe is the American Publishing Association has a manual that says that the distinguishes between a recoverable and nonrecoverable data and I think these ideas are actually kind of important here in the context of software so recoverable data is something to be accessed by the reader via citation information and that's something that the EPA so should be cited in unrecoverable data something that's referred to within the text and the context should be accessed by the reader and it's referred to in the text by the author of the communication and the personal communication some if keyword right so this distinction between recoverable or published in hour of or not available doesn't work for
software and this is really the problem is for sulfur in general there is no unrecoverable states but there is an unpublished state and so there's a kind of a difference of definitions in the difference in terms of all the versions of software that I get help even if you never published are recoverable by default and so we go back to the principles documenting the thing that I said towards beginning which is that it's it's not that academics often needs a separate credit system from that of academic papers but the need for credit for research softer underscores the need to overhaul the system a credit for all research products and I think this is basically this is just another reason that what we're trying to do is the best we can but it's not really
right either the good so the working group itself so again publish the principles documents that working group that had our Smith and myself in County Myers chairs ended and this is where we are now and we've just started a new
software citation implementation working group that has no Martin Fenner and your to Hong and myself as coaches in we will be planning what we're going to do hopefully tomorrow on distance from the same place but things and I think we made is to work with institutions and publishers and funders and researchers have potentially to try to get people to endorse the principles and say that this is something they want to do and then work with us to figure out exactly how they can do it but if you're interested in this please
talk to me or talk to either of the other coaches were here at all we also may think about writing a full implementation examples paper we talk about exactly how the principles would be applied to a bunch of different use cases in a way that again publishers and others can work through and say this is what I need to do to make this happen on and if you're interested there's a new force 11 group page and you can sign up for this group and you can join and and participate in all the discussions so the the last thing I
wanna mention is that there is also in the meantime the Journal of open source software and this is 1 of the journals was mentioned earlier it's a developer friendly Journal for Research offer packages will be saying on the website is if you've written coding you've already licensed didn't have good documentation that should take less than an hour to prepare and submit your paper to this journal of everything in the journal is open but the published paper this is open so that paper is open the code itself that is being published as Open where is the processor done through GitHub issues and so they're open and the code it's behind the journal itself is also open so you can create your own journal like to work on this and other archives papers and issues that utilize the 1st paper was submitted in May of last year and this about a year ago and about a week can know we had 100 paper accepted I I think we're
103 now if I remember right I'm not exactly sure could be to the problem of 35 papers the under review currently as well and and if you're interested in this you can also volunteer to be reviewed and by going to the journal page and an opening in issue saying that you'd like to be a reviewer and we would be very happy to have more actually reviewers of the problem editors of the problem and because every
paper needs 7 editor it's it's it's an interesting job but it took it so just to wrap up then I hope I've made the point that software is important today essential tomorrow
which actually probably was made by everybody else other citation we kind of know what we wanna do more or less of the we actually need to start doing it now and that's the reason for this offer citation Implementation Group is to take these principles in the turn them into actions that happen automatically and the things that you can do if you're a developer or you can cite the software that you use you can make it easy for other people decide to soften you right arm and hand no was the for there is a manifesto that led by Carol Goble called I solemnly pledge that talks about these 2 things plus a number of other things like but if you're viewing a paper
and you see it mentions offer what it doesn't take a softer than you can say you should cite that offer but were if you were in a man of faculty meeting in your voting on the next person the higher in some of the other people in your committees releases developed the software but he has written many papers right you could point out that well maybe this softer actually is an interesting product as well that may be as valuable papers so there's a number of different things you can do so without that you