Logo TIB AV-Portal Logo TIB AV-Portal

Software citation: a cornerstone of software-enabled research

Video in TIB AV-Portal: Software citation: a cornerstone of software-enabled research

Formal Metadata

Title
Software citation: a cornerstone of software-enabled research
Title of Series
Part Number
5
Number of Parts
13
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2017
Language
English

Content Metadata

Subject Area
Abstract
Software is a critical part of modern research and yet there is little support across the scholarly ecosystem for its citation. Inspired by the activities of the FORCE11 working group focused on data citation, the FORCE11 Software Citation Working Group has published a set Software Citation Principles (https://doi.org/10.7717/peerj-cs.86) in September 2016. This has the goal of encouraging broad adoption of a consistent policy for software citation across disciplines and venues. This presentation will discuss the principles (in brief, importance, credit and attribution, unique identification, persistence, accessibility, and specificity), how they will impact the practice of research, and they can be implemented by researchers, publishers, librarians and others who build and maintain repositories, scholars of science, university administrators, and research funders.

Related Material

The following resource is accompanying material for the video
point services groups services study algorithm mid bits analysis process maintenance number Types Computer animation software Hardware software auth organization information libraries systems
point building randomization algorithm time analysis part maintenance product hypotheses Computational physics Types difference software information services recognition building development project maintenance Computing system call hypotheses Types Computer animation software Hardware computer scientist Right cycle systems result
services metrics time digital hypotheses hypotheses measures Computer animation software case terms software information metrics result systems
services development mid focus measures Faculty Computer animation software configuration software processes information systems current systems
randomness study study logical consistency digital logical consistency mid digital COUNT similar Location processes sample Computer animation software software website data types
point randomness Actions logical consistency development feedback time COUNT Smith Computing subset local clustering Location difference software case repository auth information processes website groups Context services study logical consistency forces bits digital DOI similar user guide sample Computer animation software website result Kraft
web pages services Actions groups presentation development forces meetings several number local clustering Computer animation software repository case software case processes information unite
web pages area Context development feedback feedback bits DOI Smith Computing product number local clustering versions mathematics Computer animation software difference case software case repository processes Kraft
point Slides Identify development response directions time formating presheaf contents metadata attributes number product local clustering versions specific difference software case auth unique scientific computing processes information identification services Specifications response standards web pages domain product machine mechanisms number Computer animation software basis case Normierte Räume versions platforms identification editors cycle Kraft
point standards Context sets product versions words sign difference terms software Hardware auth information Office normal identification systems Context programming language services standards programs product words Computer animation software normal Office Right result libraries
Actions Computer animation software case time software moment Later identification product
principals Identify Specifications addition source sources impedance versions Types terms software auth unique information identification sources services Specifications validation unique product Types Location Computer animation software basis case repository versions Right level Kraft
principals point Identify links sources part metadata versions Landing Pages Zeno configuration difference software repo unique information identification sources services links point Landing Pages Computer animation software repository
point files information time views feedback presheaf DOI number versions Computer animation repository software systems
point Specifications source files time product number versions Location software case repo auth information formal services information cellular web pages project schemes Landing Pages machine inclusion Location Computer animation software life identification ideal figure
services Identify development time forces development source Continuation DOI inclusion processes Computer animation software difference repository software Universal versions repository auth Partial information tasks systems
Computer animation software development difference development software source website sort student
Context states serializability product versions communication difference terms software auth information systems default services information source argument several product Computer animation software communication versions Right distinguish systems
Actions implementation web pages distances Smith local clustering plan sign write period Computer animation software software implementation
web pages services Actions implementation source development code forces development source code open subsets Ordinary Differential Equations processor Computer animation software difference case software archive website processes information
web pages point processes Computer animation software development Zeno software source code editors open subsets
services Actions Arm development meetings number Faculty product voting Computer animation software difference software Right information man
services Computer animation software information
but the thinking on so yes thank
you for the invitation to come and talk this is so related to this author citation working Group the wall the organism mention the mentioned non so if just wanted to talk a bit about where selfish situation is
and why it's important in some of the things I'm want to mention are going to repeat earlier talks and highlight notes that came points that came up in earlier talks as well so the 1st thing is as I think it's been discussed and as we probably all know of that software including services is essential for the bulk of science today I'm not about 4 years ago I did a small study of a number of issues of science magazine discovered that about half of the papers for softer intensive
projects and almost all the papers have software random at some point but research is dependent on advances in software and in some sense I would say that offers the thing that sits between the computing infrastructure and and actually enables it to be used for science sulfur is not something that you just develop you have to sustain it over time and the development and production and maintenance of software people-intensive and the software lifetimes along compared with hard lifetimes and so on so for a 10 7 underappreciated value although probably not in this room but if we think about computational science research how we can think about this cycle of creating a hypothesis on getting some resources to investigate the hypothesis and those resources like the funding or software data other things are actually doing the research so buildings and software of data of publishing results potentially is a paper or a book or the software itself or the data might be the result of gaining recognition for having done that and then kind of continuing around on the cycle another time and and and the knowledge really is is in this last piece of the paper the to solve the data the thing that's been gained from the research and if we think about them worse off reflection of light 2 different types of software right this software that's developed as part of the research and then the software that you get at the beginning in the 2 end up with
at the end that's not the software that you just using in the research but it's the software that you're getting from somebody else or giving to somebody else and that's what I would call sofas infrastructure and that's ideally what is published and cited and so that you giving credit to the people that you're using the softer and as people use your software they're giving you the the and so so I think we certainly say the scientific research is becoming
more open and more digital it's more open and that people want to share more and need to share more in some cases because of funding agencies or because of other community reasons but it's more digital which makes it easier to share and there is significant time that's spent developing software and data and these efforts as we've already
discussed are not always recognized rewarded by if we think about how we look at people as we have a system that the builds citations finally build metrics again as the previous speakers said and we don't for software and data in general and so the hypothesis that I made probably 4 years ago a result of what was that if we could better measure the contributions of people to software in the measured in terms of citation or impactor metrics that would lead us to have better rewards better incentives for people to do this so would lead to better career paths and more
willingness for people that are what's a faculty members to join us softer development communities and that overall this will lead to more sustainable softer by having more contributors but if we think about how we can measure sulfur usage we have the fact that the
citation system again was created for papers and books and you really have 2 options to to try to put software that's right we can either and jam softer into the system that wasn't built for our we can try to rework their citation system and and we're focusing on the 1st 1 because the 2nd 1 is is just too hard on although I would say honestly you there's a lot of reasons that we really should try to do the 2nd 1 and the the challenge then that we're trying to address the self situation is not just how to identify software in a paper because the citations are not just in papers but it's how to identify software that used within
the research process so that within the research process as a whole was that's a software that you're producing you can see what software that's offers citing and you can see what so many other things like that to to so if we look at how cited
today and sulfur and other digital resources appear in a bunch of very very inconsistent ways by James Allison had this nice study from 2015 has published on that looked at 90 biology articles and try to see how software is mentioned in those articles and he found that of the 90 articles there 286 mentions of sulfur and
sulfur is mentioned in 7 different ways including of citations to a publication about the software which is the most common of citations to the user's manual for the software citation to the name or the website of the sulfur or something that's like an instrument will you know something about it but is not exactly a citation value of the euro the softer and text while you have the name of the softer in text but that's all that you have you don't even have a name but and I was a little bit confused by the last 1 actually and I asked James Wong Howe right how do you know that the software that's not even names and he said that there is still a text would say something like that we use software to do this that that was at the time and some of it's acceptable but when you look at the data facility citations you end up with similar results the data also was cited in very inconsistent ways and facilities actually assigned probably the least of all but but also in in strange ways the but so this author citation principles work that started with
the force 11 group force I should say from for the people don't know is the Future of Research Communication the scholarship and this is a group that started in well that's that's kind of let in San Diego at this point I don't actually know where it started but it started with a couple of different conferences called beyond the PDF and 1 and 2 and then it expanded to to start doing things that were even more than just papers well known the this working group started in July of 2015 and was the which was
mentioned before which is something else I've been co-leading for quite a while is the workshops on scientific on sustaining scientific suffer practice and experience and on our 3rd meeting we had a working group that was looking at credit and citation and this group decided to to join the force 11 group rather than doing something separate and so the 2 groups emerged in September of 2015 now we ended at that point with about 55 members who researchers and developers and publishers and repository people i and librarians we did all of our work and get on primarily and had a force 11
webpages pointed to that in what we did was to review existing community practices and develop use cases for what software citation should be where it would be used how would be used for things like that so we started then drafting a principles document and we started with the data citation principles that had been around for probably a year or so before that only updated that based on the soft for use cases and related work and then number working group discussions that we had no I don't know maybe every 2 months of communities and review a draft so we had a workshop for a half-day workshop at the force 2016 meeting in April 2016 all in 1 of the things that came out was so that was the fact that software and data are different which probably isn't too surprising to
anybody here but it's surprising to a lot of people and a lot of our reviewers and faxable Socrates data widely it something different and so because we got feedback we ended up actually of creating a what turned out to be a preprint encourage a that was originally just to get a page where we tried to get a community people to talk about why software and data are different and so we've documented a number of differences so as 1 of the speakers this morning so that 1 of them is licensing and product and copyright which
is 1 area in which suffered a different and offers a creative work that is not and so they have different legal requirements as well In any case we again did everything through get over so it's a bit of issues you can go to our page you can see the old version of this document you can see why it changed you can see see you asked for it to change you can see the discussion about our white didn't change in some cases things like that so we try to do everything extremely openly
are and what we end up doing then was submitting the final paper to appear J. Computer Science so we went through a number of review cycles and eventually published the final paper and 1 of the things it was actually interesting with Q J was that when we were done with this the editor said to us as clear your paper is now accepted that would you like us to also published the reviews and your response to the reviews as an appendix and researcher wants to I'm so again this is extremely open you can see what the reviewers said you can see what responded and and how we get to to the final paper so the principles paper itself and has a 6 principles that we discuss number talk about this in in the next few slides of the motivation a summary of the use cases the related work on the discussion including recommendations on and no yet began as a good talked about the rest of the so the 1st principles importance and what we say is that software should be considered a legitimate and citable product of research and so softer citation should have the same importance as citations of other products or the papers were data or something else the In the 2nd principles credit and attribution which says that softer citations should facilitate giving credit and attribution to all contributors the not unique identification says this offer citation should include a method for identification that's machine-actionable globally unique interoperable and recognized persistence says the unique identifiers in the metadata that describe the software and its disposition should persist and it should persist even if the software itself is not available which is really what an identifier does in most other cases as well accessibility so the citation should make it possible to access the software itself and the metadata and any other material is necessary to be able to use the software and finally specificity says that the citation should facilitate identification of access to other specific version of the software used and so I think again this is basically consistent with a lot of the other points have been made we tried to formalize this and it's a don't know actually the other speakers and had seen this or not but it's it's I think it's promising that the we seem to be actually all coding in the same direction and thinking about the same things at this point the good so let me talk about a few different other things some I should say that basically we have the 6 principles that we agreed on and then we have a lot of other discussion and the other discussion is a much much longer sessions section of the paper then the principles and it's there because the principles are
intended and we believe our on uniformly applicable but how they're actually applied differs depending on a lot of different things and so the discussion really talks about how you apply the principles in some cases or how the different ways which could apply the principles were not sure exactly what the right thing is yet but we think that over time will figure out something that's that's the standard because so is the 1 of these points of discussion is just what the site and and the important principle says that authors should
cite the appropriate set of software products just as they cite the appropriate set of papers but exactly what that means again is not completely clear and what we discussed was that the software that cited should be decided by the author of the product in the context of community norms and practices on and in particular it's worth mentioning something from the uh the pretty 1 ride Online Writing Lab which shows do not cite standard office software such as Word or Excel or programming languages provide references only for specialized software and we tend to agree with this and the the point that we came up with was that if using different software would produce different data results and that's a sign that software
should be cited and if the software is something that you're using along the way but it doesn't actually make so much of a difference in the results and then maybe it's not important side but but again this is discussion this is not the a strong principle the the women in terms of Provenance and reproducibility at this discussion actually can often apocalypse in yesterday on provenance and reproducibility requirements are much greater than citation requirements on citation is thinking about what's the software that's important to the research outcome provenance is all the steps including the software that go into the research and so just 1 example of this is that you might say that for citation we use our library X version 1 . 5 but for provenance you might say we compiled it was flagged Deschanel 3 we ran it on the specific hardware using the specific operating system right so there's a lot more that goes in the provenance then goes into citation and we recognize
that but we were not so concerned with the provenance case we're concerned the citation case decided the provenance case was more than we wanted to to worry about and we would like somebody else look at that at a later time yeah moments of the papers
also come up and and our group school is that software should be cited and that software is software not papers about software and papers about sulfur are
published and cited today and and our principal says that the software itself should be cited and so we take this than to mean that it's OK to cite the software paper also but you shouldn't cite justice offer paper you should cite the software itself and and the reason is like this author paper contains something about the performance on the validation or something else it's important that might say that you really should cite the paper also on or the author of this offer may say Please cite myself a paper in which case right it's probably a good idea to do it and that the only thing that our principles are saying it's like a softer not to cite the source for people but in terms of unique identifiers the principles again are intended to be general and they're intended to apply to all identifier types
but we recommend that when possible the people deciding softer you your eyes that identify specific versions of source code and by this this is the deal pointing to source code it in a repository somewhere something like notable feature were some institutional repository and not a URL that points to a get have repository long so it is also some discussion about our ideas and and do do people are people here familiar with our ideas OK so I'm going to leave it out and skip because it's not worth
going into the I assume it's also were thinking about what the identifier resolves to and and 1 option is as I said before is that identifiers could point directly to software like a github repository URL to get a repository and that would satisfy
all 3 of the 6 principals that doesn't actually satisfy the persistence part of it because they get a repository may not be persistant it may actually change we may go away from get up at some point to move to I don't know and I different version control system that has an understandable API potentially on or something else on but so we recommend that the identifier's resolved to persistent landing pages that contain metadata and link to the software rather than directly to the source code itself and particularly because of landing pages are persistent of this is intended to ensure the longevity of the longevity of softer metadata even beyond the offers lifespan the may still on the in the in the paper there's a lot of other
points and if this was an hour talking we talk about most of them but for a half hour talk and talking about a few of them but if you're interested please read it and see what the others are and you can agree or disagree on the paper has as its last section discussion how we would update the paper and and so we recognize that the point that we made a not necessarily last points there is best points they could come up with at the time and so it's perfectly conceivable there may be another version of this in a couple of years and if you have feedback that leads into the next version of that would be great the and so in example of how you make yourself a
citable is 1st is the publisher I mean if the target of it actually extremely easy to do this and there is no way that you have to do a very small number of things to to put your software from get out into the no and it's not and get home you can actually put it in does not or a future were something else pretty easily as well with a few more steps if you do it manually but you get that idea why then for yourself or are you creating citation file which is a file that says name citation that's in the top of your repository that says how you should cite your paper and you also put that information into the readme because a bunch of people are not looking for the citation filing the the reviews a single pop up at the beginning on and you can also write it off a paper as people decided but again that's a secondary thing from our point of view so if you don't wanna say
somebody else's software so the 1st thing to do would be to check for citation file or read me in that would say notice that the offer potentially and if it does then you do it and if it doesn't you do your best to try to follow the principles and try to include all the contributors to the softer if there are too many and you can't figure it out and maybe you can name the project and have that be the author of the software and I try to include a method for identification its machine-actionable globally unique interoperable and maybe to the URL to release maybe it's a company product number right so this is the maybe the 1st time that I've mentioned this doesn't apply just open source this applies to software and so there is commercial software and I wanna be able to cite that commercial software as well as open source of life is a landing page includes metadata point to that are not directly to the software and includes specific version release information OK so at this point you may think hopefully the cell makes sense this is really good and this is the way the community is going to move
inward and or you may not only like kind of this was OK for another few months after we did it in a realize that there some problems and so I just wanna mention 1 of the problems that and thinking about is that if we think about paper
citation really 3 steps that happens at there's the creator of of who we call the author of a semester the paper to a publisher of the publisher potentially has a paper reviewed in some way and then publishers and and assigns an identifier and then once it's been published and has an identifier people can actually cited and they do that by you or by the deal I usually and so these 3 things happen papers
in a very fixed-order in their 3 discrete steps you can't do 1 before the other 1st offer today we have the fact that creators often develop software can have been their release in the different stages aversions during its development and some the users it likely one-sided but if they do the inside the repository and so effectively there is no step through that happens in this process and the suffer citation principles really are trying to put in step 2 the trying to kind of force this extra step in so that we can use the existing system so that the problem with this is that it might not work on because it had a step to the softer developers workflow break that they have to actually publish their software end a lot of developers may not care enough to actually do that and if they're not working in a university of the working in a company that may not be the way they get credit and this may just be an extra task that doesn't add anything to them have been so even if we do get at some future time in which developers do
routinely publish the softer releases we don't really know what happens until then or what happens for existing software and so the real problem here is that the steps that we have to create publish site I don't actually match how open software is developed and used and the fact is that software is more fine-grained and more intuitive than papers and this leads to differences but also open-source development happens in the open
like papers which are usually not written the open although certainly sometimes they are in free software there's no natural lakes published at other than for marketing credit which again our primary concerns for everybody so it's interesting to think about how all sorts of papers and the fact that sometimes a papers people want to cite something that hasn't been published and and students are initially taught not to cite things that have been published but
then eventually they realize that sometimes there's a reason that you just have to and are told often that they should call those things personal communications and the the EPA which I believe is the American Publishing Association has a manual that says that the distinguishes between a recoverable and nonrecoverable data and I think these ideas are actually kind of important here in the context of software so recoverable data is something to be accessed by the reader via citation information and that's something that the EPA so should be cited in unrecoverable data something that's referred to within the text and the context should be accessed by the reader and it's referred to in the text by the author of the communication and the personal communication some if keyword right so this distinction between recoverable or published in hour of or not available doesn't work for
software and this is really the problem is for sulfur in general there is no unrecoverable states but there is an unpublished state and so there's a kind of a difference of definitions in the difference in terms of all the versions of software that I get help even if you never published are recoverable by default and so we go back to the principles documenting the thing that I said towards beginning which is that it's it's not that academics often needs a separate credit system from that of academic papers but the need for credit for research softer underscores the need to overhaul the system a credit for all research products and I think this is basically this is just another reason that what we're trying to do is the best we can but it's not really
right either the good so the working group itself so again publish the principles documents that working group that had our Smith and myself in County Myers chairs ended and this is where we are now and we've just started a new
software citation implementation working group that has no Martin Fenner and your to Hong and myself as coaches in we will be planning what we're going to do hopefully tomorrow on distance from the same place but things and I think we made is to work with institutions and publishers and funders and researchers have potentially to try to get people to endorse the principles and say that this is something they want to do and then work with us to figure out exactly how they can do it but if you're interested in this please
talk to me or talk to either of the other coaches were here at all we also may think about writing a full implementation examples paper we talk about exactly how the principles would be applied to a bunch of different use cases in a way that again publishers and others can work through and say this is what I need to do to make this happen on and if you're interested there's a new force 11 group page and you can sign up for this group and you can join and and participate in all the discussions so the the last thing I
wanna mention is that there is also in the meantime the Journal of open source software and this is 1 of the journals was mentioned earlier it's a developer friendly Journal for Research offer packages will be saying on the website is if you've written coding you've already licensed didn't have good documentation that should take less than an hour to prepare and submit your paper to this journal of everything in the journal is open but the published paper this is open so that paper is open the code itself that is being published as Open where is the processor done through GitHub issues and so they're open and the code it's behind the journal itself is also open so you can create your own journal like to work on this and other archives papers and issues that utilize the 1st paper was submitted in May of last year and this about a year ago and about a week can know we had 100 paper accepted I I think we're
103 now if I remember right I'm not exactly sure could be to the problem of 35 papers the under review currently as well and and if you're interested in this you can also volunteer to be reviewed and by going to the journal page and an opening in issue saying that you'd like to be a reviewer and we would be very happy to have more actually reviewers of the problem editors of the problem and because every
paper needs 7 editor it's it's it's an interesting job but it took it so just to wrap up then I hope I've made the point that software is important today essential tomorrow
which actually probably was made by everybody else other citation we kind of know what we wanna do more or less of the we actually need to start doing it now and that's the reason for this offer citation Implementation Group is to take these principles in the turn them into actions that happen automatically and the things that you can do if you're a developer or you can cite the software that you use you can make it easy for other people decide to soften you right arm and hand no was the for there is a manifesto that led by Carol Goble called I solemnly pledge that talks about these 2 things plus a number of other things like but if you're viewing a paper
and you see it mentions offer what it doesn't take a softer than you can say you should cite that offer but were if you were in a man of faculty meeting in your voting on the next person the higher in some of the other people in your committees releases developed the software but he has written many papers right you could point out that well maybe this softer actually is an interesting product as well that may be as valuable papers so there's a number of different things you can do so without that you
Feedback