Transforming scattered analyses into a documented, reproducible and shareable workflow
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46922 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Normed vector spacePresentation of a groupExpressionShared memoryPresentation of a groupFeedbackExpert systemMathematical analysisScalar fieldComputer animation
00:55
Normed vector spacePresentation of a groupCollaborationismSoftwareDifferent (Kate Ryan album)CollaborationismSoftwareWordMassPresentation of a groupAreaGoodness of fitComputer animation
01:28
CollaborationismGoodness of fitPresentation of a groupScripting languageGraphical user interfaceMereologyUser interfaceRippingComputer animation
02:40
UsabilitySoftwareDifferent (Kate Ryan album)Arithmetic meanPasswordField (computer science)ComputerFile formatCodeWordSet (mathematics)Physical lawState of matterMetaanalyseCategory of beingDeterminismDataflowInstance (computer science)Computer animation
04:12
CodeMachine codeMusical ensembleSlide ruleTelecommunicationComputerComputer animation
04:49
Machine codeSystem callProjective planeMassComputer animation
05:17
Projective planeSoftware developerScripting languageDistribution (mathematics)Endliche ModelltheorieIntegrated development environmentState observerDataflowRootRepository (publishing)MereologyDecision theoryComputer animation
06:40
CodeDigital rights managementRepository (publishing)Programmable read-only memorySystem identificationSoftware testingExecution unitLaptopFunction (mathematics)Software developerScripting languageMereologyRootDifferent (Kate Ryan album)Repository (publishing)Point (geometry)outputFunctional (mathematics)Digital rights managementProjective planeWhiteboardFunction (mathematics)ComputerSoftware developerLaptopExecution unitParameter (computer programming)Medical imagingDirectory serviceGenderProcess (computing)WordMusical ensembleCodeSet (mathematics)SoftwareTouchscreenData structureRight angleSheaf (mathematics)Mathematical analysisFeedbackWebsiteForcing (mathematics)Dot productFormal languageCausalityPlastikkarteMultiplication signBlogPower (physics)DemosceneCollaborationismClient (computing)Water vaporPlanningUltraviolet photoelectron spectroscopyBuildingSmartphoneEntire functionComputer animation
16:32
CodeSocial softwareWebsiteMathematicsBuildingSign (mathematics)Software testingFunctional (mathematics)Endliche ModelltheorieSet (mathematics)CodeFunction (mathematics)Different (Kate Ryan album)Web 2.0MereologyError messageoutputRepository (publishing)Web pageAdditionStudent's t-testElectronic mailing listMultiplicationScripting languageLaptopMultiplication signWebsiteProcess (computing)BlogProjective planeControl flowTest-driven developmentMusical ensembleRevision controlIterationMathematical analysisSoftware developerOnline helpInformationWordPoint (geometry)FeedbackLine (geometry)Plug-in (computing)Constructor (object-oriented programming)InferencePlastikkarteComputer iconArithmetic meanCategory of beingBitAnalytic continuationCommutatorSoftwareFormal languageDesign by contractHypermediaArrow of timeCausalityReading (process)Division (mathematics)Prisoner's dilemmaStress (mechanics)Block (periodic table)Presentation of a groupAutomatic differentiationWhiteboardComputer animation
26:24
Open sourcePoint cloudFacebook
Transcript: English(auto-generated)
00:14
So, hi, I am Sébastien Rochette, and I will talk to you about Transforming Scalar Analysis
00:21
into a Documented, Reproducible, and Shareable Workflow. This is feedback from a conference that I did a few months ago, so I will explain what we did during this hackathon together to do that. So, yeah, I'm working at, I think, Aura.
00:42
I am a data scientist and an Aura expert and an Aura trainer, and you can follow us if you want, if you need, and everything is there. Also, what I didn't say is this presentation is already on my GitHub, so if you want to follow the PDF, you can find them on my GitHub account.
01:03
So, this collaboration fest was about the AIRE software, so I'm not sure that everybody knows AIRE software, but it's a software for doing data science, and this was in Concarno, in France, in Brittany, this nice place, and it was with different researchers in
01:24
ecology mainly. The aim of this collaboration fest was to speak with AIRE, so we were all here because of this software, and we had three days together, and the first part was to do some presentations about good practices with Aura, so how to do packages and how to use
01:46
Git to work together, how to put all your packages and work inside a Docker container and to share it in a reproducible way, and at the end, you could also put that and learn how to work with the Galaxy user interface, which is a graphical interface to share
02:05
your research work to the planet, if you want. So we had just steps from the writing scripts in a reproducible way to sharing
02:21
this packet and to share it into a UI like Docker and Galaxy. And what we learned around that, it was about how to efficiently share your work. So base principles were made for data, so we already spoke a few hours ago by Mateus
02:44
about the FAIR principles, which is findable, accessible, interpretable, and reusable. It's applied to data sets, so findable means that you can find where the data is or a computer can find it. Accessible is to say when you find the data, you know how you can access this data,
03:06
and if you have to put a password on it, it should be written somewhere. Interoperable means that you can use the data inside different workflows, so the format of the data can be used in other softwares or somewhere else.
03:28
And reusable is more linked to reproducible, it's like you can use the data to combine with other data sets, but also you can reproduce the data because you know how the data was
03:40
built before. And if you want to add some new data to do meta-analysis, for instance, you have the clues of how to go on the field and to add some new data to this and to continue to work with this data. So these are the FAIR principles for sharing data, and I think that to share the code
04:01
and to share the software that you have, we can use the same ideas around FAIR for that. So I divided that in five categories, not only four, about accessibility, so the same here, reproducibility, documentation, readability, and communication.
04:23
So I will explain in the different slides what are these five steps, but we tried to apply that inside our group in the hackathon to go from this coding to share the work
04:45
that you do. But for researchers who only work alone or with two people in the research lab, they are only doing some coding, you have plenty of codes stored in your computer, and
05:01
how do you share this plenty of codes to the community, and how can you do this in a way that the other can use it again. So this is a big step, and for that, I would recommend to not stay alone in front of this big step and try to go with some friends, because alone is quite difficult.
05:24
So during this hackathon, we decided to work together on a common project. So one researcher was here with a project which is called VG Quiro. It's a project that looks at the distribution of these bats in France.
05:42
So we have some data about the distribution and the observation of bats. We have some data about environmental distribution, temperature, whatever weather in France. And with this data, they do some modeling to try to model the distribution in France.
06:03
So they have a workflow about this work that they wanted to share and continue to work on it. And for that, they have a lot of OR scripts. So of course, I said it was about OR, so this is about OR. So we decided to work together with the different developers who were there to try to know how to
06:28
share this kind of work and how to apply the good practices that we presented the day before. Going back to these five steps about the accessibility, the work was already available on GitHub.
06:46
So the researcher put OR scripts on GitHub. So you have here only a part of the OR scripts that were available at the root of the repository. And the big part is hidden inside the folder that we wanted to hide because it was too much for us.
07:04
But at least we had something to work on and everybody could clone the GitHub repository to have access to these OR scripts and to be able to work together on that project. So the first point was already okay. The main part of this work, I mean, the difficult part of this work was maybe the
07:24
management of the project, which means that how to teach to people who never use this kind of tools together, I mean, to work in collaboration, to go from this script to a well usable piece of software.
07:44
So first, we had to dissect the work. So maybe you don't see it, but here we took a whiteboard to write all the functions, all the different parts of the project, what this script does, what this script does, and what this one. Can we link them together?
08:02
Is there something that comes at first, something that comes at the end? And what do we have between these different steps? So here we identified that between the different steps, we could have some input data and output data that can be used for the next step. So the researcher had to show us all his script and to be able to also give an image like
08:33
this about his work. And it seemed to be easy for him to do it. But at the end, when we asked him how was it for him, he said, I felt totally naked
08:45
to present all my code to some strangers, some people I don't know. I mean, it was easy for me to share the code on GitHub, because you don't know if somebody will see it or if they see it on the other side of the planet, so you don't care. But here you had 10 people looking at you saying, OK, tell me what you did, and why
09:03
do you put that in there? So it's a really difficult part. But I think that the people who were there were really welcoming and friendly, which is kind of a particularity of the R community.
09:22
The R community is made, I mean, there are a lot of work that is made in the R community to be able to welcome anybody, whatever is your origin, whatever is your gender. Maybe you heard about the R ladies that inspired the Pi ladies.
09:40
It's some groups that help putting in front of the screen people who are not usually seen, like white men with a beard. We are not only white men doing some code, so there are other people. So the R community is very inclusive for that. And I think that this thing that we have in mind helped us also to be friendly with
10:07
this research and to share the work and to work together. So we separated the work into small pieces, and we opened issues on the GitHub repository to say this part of the code does that, and I would like you or somebody to work
10:24
on it to make it nicer or to work on this part. We dispatched the different issues between one or two developers so that anyone could work on his little part, because here the workflow was quite easy.
10:42
As soon as we had some data sets that were available to do the different parts, it was quite easy to say, yeah, you can work on this small part because you have some input data, and you have some output data, and you know how to go from this point to the other point. So these section parts.
11:01
And then, of course, you have to manage a repository to present how you collaborate inside a Git repository, how do you deal with pull requests and everything like this. But I took this part for this time. So the recipe, I mean, for this work and how to share this work is first to carefully
11:27
peel back the code. So we had to identify inside the code if there were some user-specific pieces. I mean, if inside your code you say, yes, this data is on my computer at C, dot, dot,
11:42
my document and settings and my name, nobody will be able to use it again. So you have to go inside it and to find all this part that could be removed or at least be parameterized to put it on the top of your script saying this part, you let the user define where is the data on his own computer.
12:01
And you can also cut the different big code into small pieces so that it's easier to maintain or to see the goal of the different parts. With the R software, everything is about functions. So a function is you take something in input and you have one thing as output.
12:20
So the parameterization is like the parameters of the function. So we had to put a different script as functions or a smaller function. And we use what we call reprex reproducible example. This is the word that we used to use in Aura is you have one function, so you
12:42
have to show a small data set that can enter it and show what is the output of that. And anyone that uses small data set and uses the same function should have the same output without having to have the big data set that you use for the entire analysis. And if you are comfortable enough, of course, you can add some unit tests on these
13:05
functions, but maybe it can be later if you just start to learn how to read a function. So the first part of this recipe, the second part is document generously. We already spoke about notebooks. So in R, we have these notebooks that we call vignettes.
13:24
It's done with Markdown 2. It's like the Jupyter notebook, but for R, so R Markdown. And it allows to mix some plain text documentation with some R code that will be executed during the process of building the package here, so that it's also reproducible.
13:49
If you give the same notebook to somebody else to compile this notebook, it should be self-contained and anyone can reproduce a different example on it.
14:00
So in this notebook, you put some also reproducible examples, but because you did some reproducible examples for the functions, you can reuse them to show how this function works and why they work this way. And in R, as soon as you build a function, you have to document the function too. So say, what does this function, what are the parameters, what do you put in this
14:21
parameter, is it numeric, is it text, whatever. And what are the dependencies needed to use this function. So indeed, when we do that, we transform the lots of scripts into an R package,
14:42
because this structure of having vignettes, of having documentation for function, and putting the function in a specific directory inside your big directory, is called a package, and this is forced by the R community to document completely this package.
15:03
So from there, we had built a small package, and of course we had to, each of us, continue to develop the different parts. After that, you have to add, you can add the right amount of readability. For me, R is not minified language, so it's not like JavaScript,
15:25
you can put a lot of air inside, and when you put some air, you can breathe when you read your code. You have to think about future you in six months will totally forget what you put inside your code, and if you can read your code like you read a book, it's easier to go back inside.
15:44
There are in the R community some packages, like in the Tidyverse, who also give some functions that are readable, and the code can be read by anyone, even if you don't understand the R code, even if you never developed in R,
16:06
you can understand the code because the names of the function and the way you write it is readable by anyone, like an English text, of course you need to speak English. So think about the future you, and think also about the other developers who will help you,
16:23
because if you want to share to the community, of course people will give you some feedback on it. And the last part is communicate abundantly about your work. There are many different ways to communicate about it, you know, social media, I'm sure everybody of you have a smartphone, or almost some,
16:40
you can do some blog post about what you did, and indeed at the end of this hackathon, I wrote a blog post, this one is in French, but the blog post was to present what I present here today, so what we did to go from this small script to a shareable package,
17:03
and it was also a way for me to add a little more information for the developers, because at the end of these days they are let alone with their code, and then nobody is here to help them anymore, so at least I add some clues inside this blog post to help them continue the work alone.
17:24
And you can also build a website, and something that is interesting with the OR and the OR package is this package, this plugin which is called package-down, and with this package-down, with one line of code, you can build a website from your OR package.
17:41
So it uses all the different documentation parts you have put inside your package to show it as a web page, and if you combine your GitHub with the CI, with the continuous integration, this website is built each time you put a commit on Git.
18:00
So this is the website that is built from the repositories that we built, so we have the first page which is the readme that you have in the GitHub repository, you have a second page which is the references, which is the list of all functions that are available inside the package, and with the documentation of the function, how to use them, with examples inside, and you have the articles, you have the article parts,
18:27
which is the part where you have the vignettes, so all the notebooks inside. So the feedback I have from this experience is that for this kind of researcher who only
18:42
worked alone in the lab, the mentoring of this project was a good start for them because they would never go alone in this project. I mean, at first when I presented what they could do with the script and going to the package and to show the website, they were okay, but where do I start? I will never do that alone. That's why we decided to choose one of
19:05
their projects to say, okay, let's do it together and let's see how it works and how you could do it. As a researcher who presents the work, you have to accept the exposure. I mean, as I said, it's not easy to present to people you don't know your work and what is really
19:25
inside your code and to have to explain to people in front of them, yes, here I did this because I thought that this was interesting, but yes, you have another way of doing it, so let's go. You also need a welcoming community because you cannot say, yeah, I do not code the
19:42
same way that you code, so let's break everything that you did and I will record it my way, but then the researcher will be alone in front of this script that he doesn't understand and cannot continue to maintain. So you also have to be welcoming to accept as the helper that you can
20:04
help the way the other researcher understands, so this is important to be in such a kind of group. At the end, the researcher will change his practices. This is for good because as soon as you know the entire process and you did it together, you can apply this
20:27
every time you do a new code. You say, okay, I have to think when I do my code that I will do it this way because the documentation is important and at the end I would like this website that is nice to share to everybody, so this is important, but the thing that was almost missing is the
20:44
follow-up because I only wrote this small blog post and then I just go back home and I continue to speak with them by email, but it's not the same as being here and helping them on their own code, so maybe we can find some other way to collaborate.
21:01
So transforming scattered analysis to a shareable workflow is accessible to people with a little help to start, but as soon as you know how it works, you can do it alone and continue as you wish, and I would recommend for anyone, either it's for R or any other language, start
21:21
with the documentation because you have the thing in mind, you know what this function or what this part will do, so write it, not keep it in your mind, and as soon as it's written, you can say, oh yes, my function is written and I can share directly it because I already wrote the documentation, so it will be easier for you for your future. Thank you very much.
21:51
So two questions that are maybe linked. Can you say a few more words about reprex, which I don't know, and the following question is, you don't mention in your guidelines
22:05
for reproducibility, test-driven documentation, basically, or ... Yeah, so the first question about what is a reprex, it's a construction of reproducible example,
22:20
okay, there is a package in R which is called reprex, which helps you to do some reproducible examples, so the reproducible example is that you have a data set and you do, I don't know, a model on it, and you have some outputs, but the data set is very big, and you need to spend two hours to have the outputs. If you want somebody to help you to debug this function,
22:43
you will not give the big data set, so the reproducible example is like the smallest example you can prepare so that the people can reproduce the output from the small input and can reproduce also the errors and the warnings that you have in your function so that you can help them say,
23:01
yeah, I can reproduce it in a few seconds, give me this small part of code, and then I can help you to debug this. This is a reproducible example. I didn't talk about test-driven development. I think test-driven development can be problematic. You have to be very careful
23:23
about that because you can think, okay, before writing my function, I write what should look like the output. This part is interesting, but if you put a lot of tests, you will write your
23:40
function as you want the tests to be passed, but maybe not, I mean, I'm not sure I can really explain it in a few words, but it's quite dangerous to do that. It's like you go at school and say, I would like my students to be very good at addition because
24:01
they will be tested in addition, so you don't teach multiplication and you don't teach division. You spend a lot of time on additions. In the end, yes, they have 100 percent of success, but what about the rest? Was it also important that you didn't think about that? Test-driven development can be correlated. You say that the test-driven development
25:05
is iterative process. When I imagine test-driven development, it's like you write the tests you want to have, and then you write the function to be able to success in this test. But the iterative process, of course, exists because you use the function and at some point
25:24
you will have some new data sets and the function will not work. So the test, you will say, okay, this should work. So you write the test like, I would like the output of this process to work, and then you add a new test in your code so that the next time you see this
25:41
function, so you correct the code, of course, and the next time you use this function, you make sure that it will work. So the test is also here to verify that when you change the versions of your code, when you modify different parts in other places of your software, that the test still passed because you verify that all the different kind of data you used
26:05
continue to work. This is the iterative part, but for me it's not what I call test-driven development, but maybe it's just a semantic thing. Thank you.