Software sustainability - guidelines for the selfish scientist
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 9 | |
Number of Parts | 13 | |
Author | 0000-0002-8876-7606 (ORCID) | |
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/31022 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
13
00:00
SoftwareService (economics)InformationSelf-organizationSoftwareXMLLecture/Conference
00:25
SoftwareInformationElectronic program guideInternet service providerInformationComputer animationLecture/Conference
01:17
SoftwareInformationQuicksortSoftwareProcess (computing)Software engineeringPhysicalismComputer animationLecture/Conference
02:06
SoftwareInformationService (economics)Execution unitComputer-generated imageryNumberEmpennageOpen sourceDirection (geometry)Fitness functionComputer animation
02:35
Open sourceComputer-generated imageryService (economics)SoftwareInformationNumberEmpennageNumberSubject indexingLecture/ConferenceMeeting/Interview
03:06
Computer-generated imageryEmpennageOpen sourceNumberProduct (business)Different (Kate Ryan album)NumberTotal S.A.Data storage deviceMultiplication signLecture/ConferenceMeeting/Interview
03:50
Computer-generated imageryEmpennageOpen sourceNumberSoftwareService (economics)InformationSupercomputerVector spaceSound effectBit
04:33
Computer-generated imageryEmpennageOpen sourceNumberSoftwareMultiplication signSoftwareSoftware developerAreaComputer animation
04:55
SoftwareService (economics)InferenceInformationComputing platformPresentation of a groupScripting languageSoftware developerMultiplication signXMLUML
05:42
Software developerCodeProduct (business)Sound effectLecture/ConferenceMeeting/Interview
06:07
SoftwareDifferent (Kate Ryan album)Line (geometry)WebsiteProgramming paradigmComputer animation
06:34
Service (economics)SoftwareInformationoutputWebsiteDifferent (Kate Ryan album)SoftwareAreaMathematicsTheoryProof theoryComputer animation
07:10
CodeLine (geometry)SpreadsheetComplex (psychology)Scripting languageScale (map)SoftwareSoftwareProof theoryEndliche ModelltheorieQuicksortAreaLecture/ConferenceComputer animation
07:34
CodeLine (geometry)SpreadsheetComplex (psychology)Scripting languageSoftwareService (economics)Scaling (geometry)Sound effectSoftwareEndliche ModelltheorieScripting language
08:06
SoftwareLocal GroupRange (statistics)Universe (mathematics)SoftwareArchaeological field surveyDiagram
08:28
SoftwareService (economics)Archaeological field surveySoftwareDifferent (Kate Ryan album)Set (mathematics)Wave packetFormal grammarResultantOffice suiteMereologyDiagram
09:21
Maxima and minimaNatural languageDivision (mathematics)Web pageSoftwareRevision controlService (economics)InformationGame theoryDigital object identifierProcess (computing)Observational studyMathematical analysisDatabaseReplication (computing)SoftwareDifferent (Kate Ryan album)NumberObservational studyNatural languageComputer animation
10:09
SoftwareLink (knot theory)ComputerBuildingError messageFormal languageSoftware engineeringStreaming mediaCodeSoftwarePoint (geometry)Observational studyComputer scienceDiagram
10:36
SoftwareService (economics)Boundary value problemLevel (video gaming)InformationXML
11:06
SoftwareBoundary value problemService (economics)Level (video gaming)Computing platformStaff (military)Equivalence relationMultiplication signSoftwareGoodness of fitLecture/ConferenceMeeting/InterviewComputer animation
11:42
SoftwareBoundary value problemService (economics)Level (video gaming)InformationType theorySoftwareXML
12:07
SoftwareStructural load1 (number)Data analysisSoftwareData managementLecture/ConferenceMeeting/InterviewComputer animation
12:35
SoftwareService (economics)CodeMachine learningArchaeological field surveyFormal verificationComputational physicsInformationVector potentialSpacetimeDisintegrationDifferent (Kate Ryan album)Multiplication signMachine learningShared memoryCodeArchaeological field surveyVirtual machineCollaborationism1 (number)Right angleSoftwareXMLUML
13:17
Spectrum (functional analysis)SoftwareDisintegrationCodeArchaeological field surveyMachine learningSpacetimeFormal verificationComputational physicsCollaborationismCodeTwitterSlide ruleLink (knot theory)Lecture/ConferenceMeeting/Interview
13:38
CodeVector potentialReal numberCodeStatement (computer science)Peer-to-peerComputer animation
13:59
CodeVector potentialService (economics)SoftwareInformationCodeDataflowProcess (computing)Computer animation
14:37
CodeVector potentialExploit (computer security)Open sourceMetric systemPhysical systemElectric currentSoftware engineeringSoftware engineeringCodeShared memorySlide ruleInclusion mapStructural loadComputer animation
15:15
SoftwareService (economics)CodeMetric systemPhysical systemElectric currentExploit (computer security)Open sourceSoftware engineeringInformationSoftwareProjective planeOpen sourceUniverse (mathematics)Self-organizationGoodness of fitComputer animation
16:02
CodeExploit (computer security)Open sourceMetric systemPhysical systemElectric currentSoftware engineeringService (economics)SoftwareCategory of beingComputer animation
16:31
SoftwareControl flowPoint (geometry)Software testingComputer programmingVisualization (computer graphics)Multiplication signPoint (geometry)SoftwareData managementSoftware engineeringEquivalence relationObservational studyData analysisComputational scienceLecture/ConferenceMeeting/InterviewComputer animation
17:17
SoftwareControl flowPoint (geometry)Computational scienceSoftware testingVisualization (computer graphics)Computer programmingService (economics)InformationGoodness of fitSubsetComputational scienceMultiplication signXML
17:40
SoftwareControl flowPoint (geometry)Computational scienceSoftware testingVisualization (computer graphics)Computer programmingPoint (geometry)Revision controlAreaWeb pageComputer animation
18:10
SoftwareControl flowPoint (geometry)Computational scienceSoftware testingComputer programmingVisualization (computer graphics)Service (economics)InformationInformationOnline helpSound effectDifferent (Kate Ryan album)Visualization (computer graphics)XML
18:31
SoftwareService (economics)InformationControl flowRevision controlBackupCodeTwitterData analysisOnline helpFrame problemWave packetComputational scienceDifferent (Kate Ryan album)Instance (computer science)Lecture/ConferenceMeeting/InterviewXML
19:06
SoftwareControl flowRevision controlBackupCodeComputer filePhysical systemNumbering schemeRevision controlGoodness of fitComputer animation
19:33
Service (economics)SoftwareControl flowRevision controlBackupCodeInformationRevision controlQuicksortComputing platformSoftwareMusical ensembleBitXML
20:10
Revision controlPhysical systemBackupLecture/ConferenceMeeting/Interview
20:31
Control flowRevision controlBackupCodeSoftwareService (economics)InformationLaptopBinary fileWaveCodeThermal conductivityPoint (geometry)Set (mathematics)MereologyForm (programming)State of matterXML
21:24
Nichtlineares GleichungssystemGravitationGamma functionLaptopDirected setMaxima and minimaSpiralSigma-algebraBinary fileWaveElectronic signatureCollaborationismPhysicsSoftwareProcess (computing)IterationForceFunction (mathematics)Set (mathematics)SoftwareLaptopProcess (computing)Forcing (mathematics)Field (computer science)IterationComputer animation
22:00
Service (economics)SoftwareProcess (computing)IterationForceInformationStudent's t-testCodeForcing (mathematics)Instance (computer science)Point (geometry)Closed setSoftwareXML
22:48
AreaBit rateCodeImage processingComputational physicsFormal verificationOpen sourceMeasurementSoftwareScripting languageComputer fileQuicksortPoint (geometry)CodeRight angleMetadataSummierbarkeitLecture/ConferenceMeeting/InterviewComputer animation
23:25
Scripting languageCodeSoftwareService (economics)InformationSoftwareLink (knot theory)Desktop publishingOpen setElectronic mailing listXMLComputer animation
24:10
SoftwareSoftwareProduct (business)Computer animation
24:36
SoftwareService (economics)InformationSlide ruleTransport Layer SecurityPressureDigital rights managementDifferent (Kate Ryan album)Software developerSet (mathematics)XML
25:16
Slide ruleTransport Layer SecuritySoftwarePressureComputing platformOpen setComputerOpen sourceSoftwareOnline helpComputer animation
25:45
Computing platformSoftwareService (economics)InferenceInformationComputing platformResultantModal logicQuicksortSoftwareLimit (category theory)Lattice (order)XMLUML
26:22
SoftwareService (economics)InformationSet (mathematics)QuicksortXMLComputer animation
Transcript: English(auto-generated)
00:04
Thank you very much to the organizers for inviting me here, and particularly for the wonderful dinner that we had last night. I thoroughly enjoyed myself. Before I start, this talk is called Software Sustainability Guidelines for the Selfish Scientist.
00:22
But I realized as we've gone through the days of this conference that perhaps this isn't quite the right talk for this audience. I want to have a quick check. How many people here consider themselves to be researchers or scientists? And how many people consider themselves to work with researchers and scientists
00:44
to provide information on software? Okay, so maybe this is the right talk. If it isn't, I also have another talk I can give, which is more on the work we do directly and the way that the Software Sustainability Institute is organized and run.
01:02
And I'm happy to take questions on that as well. But effectively, this talk is meant to be a motivation and a guide for how you can talk to other scientists that you collaborate with and work with to help them see why doing things which are very much just for themselves
01:24
overall improves software sustainability for everyone. And the sort of background to this is based a lot on where we've been working and where in particular I have been working and the sorts of people I've been working with. And some of you I talked to at dinner,
01:41
I mentioned the fact that I started off in high energy physics, which obviously has a very large background in developing computer software. It has many software engineers working for it, and it has a large amount of practice and a large amount of kind of process.
02:03
So it knows how to develop software in some respect. But for me, it's also very, very difficult because you are one person in a huge consortium and you don't really get to do very much exciting stuff. So I ended up going in a very different direction. I ended up working with what we call the long tail of research.
02:24
So how many people have heard of the long tail? Yeah. So there is this idea that you can apply the long tail to everything. So I'm going to try and apply the long tail to something which probably it doesn't fit with, which is this idea that in science, in research, we've effectively got some people who are the superstars of the research world.
02:46
These are the Nobel Prize winners. These are the people who basically produce a large number of cited papers. So these are the people who have an H index, which is in the hundreds. These are the people who are rightly seen as the people who have a lot of impact in research.
03:07
But the thing is, we also have this long tail of people. We have a large number of researchers who are doing work that is also being cited and is also valuable. The difference is their productivity is slightly different.
03:20
They may not produce as many papers. They may not be cited as many times. But when you look at it, actually the total number of citations from this long tail, or as I will call it, most scientists, the mainstream of scientists, is important too. And is anyone a paleontologist or someone who works with dinosaurs?
03:44
No? Okay. So I have heard that actually this picture is completely wrong for dinosaurs because they didn't have tails like this. New research means that this is actually how dinosaurs looked. They held their tails up high. And if we look at this with what we see in the research world,
04:02
this is what I think is the really interesting area. So over on that end there, this is where my colleagues at the High Performance Computing Center work. They look at the people who are the top 1% and they help them make their work even better. I'm interested in this bit here, looking to see how we work with all of the other people
04:22
who might become the top 1% and make their work slightly better. So this is where the improvements in practice can have the most effect. It's in the long tail of researchers. And the question really is what can we do there and how can we persuade people that things like producing better software are useful.
04:46
And the first question I normally get asked is what software got to do with my research because a lot of the people in that area do not consider themselves software developers. It's kind of interesting actually. Yesterday even in this workshop when we had the first keynote presentation
05:04
and people asked how many people are R users and quite a few people put up their hands and then people asked how many of you are R developers, there was nobody. Yet if you're an R user, it's very difficult not to be an R developer.
05:23
R is a platform that means that you are writing scripts, you are doing computational work, you are extending. There are very few people who are R users who simply blindly run something that they have downloaded without changing anything. So there is this disconnect where most people do not think of themselves
05:44
as software developers when actually they are software developers. They're just not software developers who are developing code for other people to create a product. They're not people who are engineers, they're not people who are seeking to do this
06:01
because they want to produce something that they can sell effectively. And one of the things that has happened recently is that we have had a blurring of the lines of the different paradigms of science. So people will have seen this kind of idea of the four paradigms of science
06:21
from the empirical and theoretical that we've known for many years through to more recently computational and lately data science as well. And it's really great that I had the talk just before to basically lay out a lot of the tooling for data science. But these are no longer distinct.
06:41
In almost all disciplines, what happens is that you need to know how to do all of these different techniques for science. And therefore, we see not just in the computational and the data exploration paradigms, the use of software, but we see it everywhere.
07:01
I mean, even some people who are working in the areas of theoretical mathematics are now starting to at least accept the idea that they may be computational proofs. So modern research is impossible without software. There are so many different places where you can see software being used.
07:21
And I think the important thing for me is that mostly we sort of think of software for science being in this area, the sort of particle physics and large Hadron Collider or climate science, the very large models. But most software and science is at a completely different end of the scale. It is the scripts that are written in Excel.
07:41
It is the models that are defined in something like MATLAB. So many people are using software and developing software, but it's not necessarily the software we think of as scientific software.
08:02
So we did some work to try and understand quite what effect that would have. So we did a survey in 2014 trying to understand how people regarded software. And I think the big takeaway here is that 68% of researchers that we interviewed across the leading research-intensive universities in the UK
08:25
said their work would be impossible without software. So it's not just that it would be hard, it's that it would be, you know, they would not be able to do any of their research without this specialist research software in different ways.
08:41
The other thing here, and this is echoing something that Konrad spoke about, is that our survey results kind of chime with many of the other surveys that have been done to say that whilst there are a lot of people who are developing software, so over half of them are developing software,
09:01
most of them have had no formal software training. So what we end up with is a whole set of researchers who don't think they're developing software or who know they're developing software but who have had no training, all of whom feel that software is vitally important to their work. So this is a problem.
09:22
And it's a problem for many different reasons. So as well as this inbuilt knowledge that software is important, what we see now over the last 10 years is an increasing number of articles coming out which are casting doubt about the truth of science.
09:41
It's basically things like the reproducibility crisis which you have heard of, things where we are looking back and looking at the published research and going, can we have any trust in this? So it happens in bioinformatics and genetics.
10:00
Here we have a study which shows that it's very hard to repeat some of the top analyses in microarray gene expression. It's the same thing in computer science. So this is a study which is looking at whether you can get to the software that is mentioned in papers and the overall summary of that is,
10:23
mostly you can't. And even when you get to the point where magazines like The Economist, which are nothing to do with science, are taking an interest in the reproducibility crisis, you know that overall we maybe do have a crisis here for researchers.
10:43
So we were set up, the Software Sustainability Institute, as a way that the research funders in the UK could start moving some of these challenges outwards from being top-down challenges
11:00
that were being kind of put out as guidelines by the research funders to being more bottom-up initiatives that looked at trying to change the way that people worked across research in the UK and collaborating with people across the world. And our logo, I realize I'm not wearing the t-shirt.
11:21
We actually have t-shirts. You can buy the t-shirt. Someone asked about sustainability of platforms. Every time you buy a t-shirt, we get an extra equivalent of 1 euro 50. So if you buy enough t-shirts, we'll be able to hire a new member of staff. But we have the slogan, better software, better research,
11:42
because we think that by developing software better, actually you become more efficient and more effective at research. And the rest of my talk is really about the very simple steps that we try and tell people to do to make them more effective researchers.
12:01
And I think this is something that's changed a lot. The types of skills you need as a researcher in the modern world are different from the ones 20 years ago and definitely different from the ones 40 years ago. And we need to keep up with this. And there are a whole load of skills, mostly around data analysis and data management,
12:21
that we do not teach people. So if software is so important, why is it so hard to reuse? Because the other question that we get asked is, well, we think this is a good idea to do this, but no one else seems to be doing it. Why should we spend the time when no one else is? And there are lots of different reasons.
12:42
Victoria Stodden in the US did a survey of the machine learning community for both code and data sharing. And a lot of the things that you see there are probably the ones that you know yourself are the ones that you kind of worry about. So it's things like the time it takes to document and clean up your software.
13:03
Or this one I find really great. The second one is dealing with questions from users. And you can always flip questions like that around. So dealing with questions from users could be re-expressed as starting collaborating with great new collaborators.
13:22
So there is this thing of a lack of incentive for sharing code and making code reusable. And all of my slides are online as well. I put up a Figshare link, and there's a tweet that's gone out which has the link to the slides as well. The other thing that is a problem
13:41
is something that's happening more recently and I think is a real issue for science. And that is expressed in this kind of statement here. And I'm just going to highlight the last bit. So this is the example of someone who has shared their code and then had their peers criticize them for doing it
14:01
because the code was not necessarily great code. It was fine code. But the problem is that nowadays, because it's all out in the open, potential employers can see this as well. And they might not ever look at your code base. They'll only look at the comments that are on that code. And this is a problem because basically we are now
14:23
in a kind of culture which is all about competition. There are not enough jobs, so people compete. This also means, though, that the competition can get very vicious for different reasons. There are possibly other reasons for things like this. There is a lot of work that is needed
14:42
to understand diversity and inclusion. So the other thing I've not put on this slide deliberately is that this is the experience of a female coder, not a male coder. But we have this problem. So there is no incentive to share code because even if you do share your code, you might end up with a whole load of people
15:01
criticizing you for very minor things. So that's the problems. We've ended up with a research culture where you don't share your code because there's a fear of being found out for poor software engineering skills. There's no reward for publishing code. As has been mentioned in many of the other talks,
15:22
there's no incentive to actually get a good software project out there because if it comes to a promotion committee, they'll just go, well, how many papers have you published? What is your impact factor? And a lot of people fear being scooped of someone stealing their work
15:40
and getting the publications that they think they should have got. And the other thing is, and it was great to see a talk on copyright and licensing, many organizations do not understand how to exploit open source licenses. I know that my university is only just starting to understand how to use open source licenses effectively
16:03
to exploit their intellectual property. So this is the main meat of the talk. So what can a selfish scientist do to get ahead instead? And I'm going to give five basic guidelines and I hope that you're basically just nodding along and going, yeah, I do this.
16:22
Because I don't think any of them are particularly spectacular. None of them are new. All of them are hopefully very simple. But the problem is that quite often we don't think we have the time even to do some of these simple things. And the first one is just to improve your skills.
16:42
So we've heard about Software Carpentry. I'll also mention Data Carpentry, which is the equivalent for data analysis and data management skills. But the point is that these are jumping off points. The idea is to continue to learn. I now point people at this.
17:01
So a few years ago we wrote a paper called Best Practices in Scientific Computing, which is a great paper, I think, for this audience because it brings together a lot of references to software engineering studies that show what practices actually work. So it's all based on evidence-driven research.
17:21
The problem is that almost all scientists will not be able to apply the best practices because they are either slightly too time-consuming or they do not have the experience to do this. So a subset of people from that paper wrote this thing called Good Enough Practices in Scientific Computing.
17:40
And that is the paper that I would suggest you give to all your colleagues. I'll give an example of one of the things that they talk about to do with revision control on the next page, but it is important to remember that not all of the people you may work with necessarily have a good background in this area.
18:01
So whilst you will be very happy with best practices, most people might be happier with good enough practices. And the whole point of this is I mentioned that there's this kind of skills difference. Really what we're trying to do is improve the efficiency and effectiveness of your research. And so by continuing to learn new skills,
18:21
whether it be in new technologies or new techniques or in things like information visualization, all of these help you get across your research better and help you do your research quicker. I for myself am trying to learn how to do better data analysis using pandas and data frames,
18:42
which I am completely failing to do just now, which means that I too am someone who would benefit from this training. And this is best practices for scientific computing. So the second tip is keeping things tidy. And here's a good example of the difference between what, for instance, we tell people
19:02
to do in good enough practices versus best practices. So in best practices, it talks about using revision control systems like Git or Mercurial or SVN. In good enough practices, it mentions the fact that you should just be understanding what things are different versions,
19:21
even if that is simply by having a good naming scheme for your files. Because some people find revision control systems really hard, particularly Git. How many people find Git hard? Yeah. So what we try and persuade people to do is use version control of any sort,
19:41
because any sort of version control is better than nothing. Most people we talk to nowadays will be using something like Dropbox or Google Drive as their first attempt at version control. And then as they go a bit further on, then they'll start using things like GitHub because of the other things that infrastructure
20:01
and platforms like GitHub provide them. And we're trying to persuade them to do it for everything. So not just for their software but their data, their papers, their talks. And the other thing is making sure that that's backed up because one of the great sellers of version control and revision control systems is getting back to previous versions of your work.
20:22
And one of the things we would like to persuade people to do is check that their backups work for all of the stuff that they put in to the version control. The other thing is to get into a state where they work with things like tidy data and tidy code. So this is the idea that
20:42
you're trying to get your data and your code into a form which makes them reusable. So it makes them much easier to use with different tools and it makes it much easier to be reused by themselves. And here's where the selfish part comes in. Really what we're talking about is a whole set of practices which are useful for that particular researcher.
21:04
We don't really care if they're not sharing with other people at this point. All we're trying to do is make sure that they are not losing out and as a byproduct, as a secondary symptom almost, it makes it easier for them to share and conduct research with other people as well.
21:24
And an example is the LIGO notebook. So if you're trying to kind of see an example of how to kind of tidy things up and share it well, I would like you to go and see that. I won't explain that in much detail. Third thing, release early and release often.
21:41
So research is an iterative process. A lot of people see it as just being the publishing of papers, but if we see it actually as a career, what you're doing is publishing many papers as an ongoing set of, I guess, outputs in a research theme. And releasing your software and data forces you to check and clean them.
22:02
It kind of makes sure that your tidying does the right thing. And one of the things we say here is even if you're not wanting to go completely open from the start, make sure you're releasing early to a trusted colleague so that it's not just you looking at your software. So even if you are not making it available outside,
22:22
make sure someone else is looking at it. But of course, open has many benefits. And here, really, we're looking about persuading people to make their research more reproducible, but mostly by their own team. So one of the biggest problems we have is with, for instance, PhD students who don't show their code
22:43
to their supervisors or to other members of their team until the point where they get a review. And that's too late, really. We want them to share their code earlier. And there are many reasons why you should do this that sort of say, actually,
23:01
that makes you have a higher scientific impact. Fourth one is to get credit for everything. We've already had a lot of talks on this, so I'm not going to dwell on this. But the main thing is make sure that your work is easy to cite and reference. And the important thing here is to provide good enough metadata so others can actually find your work
23:21
and credit you for it. Because if you want to get credit for everything and further your career, it's no use if no one understands what your work is or where it is or how to cite it. So that increases your visibility and your reputation. And I won't talk about it here,
23:42
but I'm the editor-in-chief of the Journal of Open Research Software. There are now many, many places where you can publish software or software papers or anything to do with software. And there's a large list in that link there, bit.ly slash software journals. And it's gone up a lot.
24:01
So when I started curating this list, there were about seven places you could publish software, and there are now over 80. So there's no excuse for not getting credit for your work. And then the last tip. Be your own user first. So try and find software that already exists and extend and modify it.
24:20
But most importantly, develop software that solves your own research questions. So don't try and create a product for someone else. Try and use your software with yourself as the main user. And use the different kinds of infrastructure to try and split apart and record your different roles
24:42
so that you can both be a user and a developer and a manager. And once it's working for you, find others like yourself to collaborate, because this is basically the golden rule of startups. Come up with an idea that you would like to see happen. Work to make it work for a very small
25:00
and very specific audience, and then go global, because that's the only way, really, of making something that is ultimately reusable is by reusing it yourself. And here's a great example of creating a whole set of reusable resources that can be used by someone in a high school by bringing together open source software,
25:22
open data, and open computational resources to solve new problems. Okay, so I need to wrap up now. So I've kind of given five different tips for how selfish scientists can make their own work better.
25:41
But really what these are are tips that all help drive software sustainability. So they're all about making things more available, more reusable, and more maintainable in the future so that software can be used to meet new needs on new platforms. And a lot of this is just driven by limited research resources.
26:02
So really what we're talking about is this quote of necessity being the mother of invention. So in some sense, what we're telling people is you don't need a lot of resources to do all of this. In fact, the fewer resources you have, the easier it is to follow these guidelines and have results.
26:23
I'll finish there with that set of guidelines, and I'm happy to take questions. Thank you. Thanks a lot.