Guix, toward practical transparent, verifiable and long-term reproducible research
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/61907 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
RankingBitLecture/Conference
00:27
InformationAsynchronous Transfer ModeContext awarenessReplication (computing)XMLLecture/Conference
00:50
Open setFormal verificationReplication (computing)Link (knot theory)Digital signalZeno of EleaRepository (publishing)Open sourceSoftwareProcess (computing)Numerical analysisResultantReplication (computing)Computer animation
01:31
Form (programming)Digital signalRepository (publishing)Open setSoftwareOpen sourceComputer wormZeno of EleaProcess (computing)Source codeLecture/Conference
01:52
Digital signalOpen setRepository (publishing)Zeno of EleaOpen sourceSoftwareComputational physicsNumerical analysisProcess (computing)Communications protocolMathematical analysisScripting languageIntegrated development environmentData analysisP (complexity)MereologyProcess (computing)Task (computing)Mathematical analysisCommunications protocolScripting languageResultant
03:15
Compilation albumNormed vector spaceCommunications protocolNichtlineares GleichungssystemMaxima and minimaPoint (geometry)ComputerStrategy gameLecture/Conference
03:35
ComputerState observerIndependence (probability theory)Open sourceBounded variationPoint (geometry)Integrated development environmentBounded variationDifferent (Kate Ryan album)Open sourceComputer animation
04:15
Independence (probability theory)Reduction of orderContext awarenessMultiplication signState observerLecture/Conference
04:37
ComputerIndependence (probability theory)State observerOpen sourceBounded variationCompilation albumVirtual machineComputerPoint (geometry)View (database)MereologyComputer animationLecture/Conference
05:06
Source codeBuildingComputational physicsOpen sourcePoint (geometry)Interpreter (computing)Integrated development environmentSource codeCompilerXML
05:28
BuildingSource codeBounded variationOpen sourceControl flowModule (mathematics)LaptopIntegrated development environmentComputer fileMathematical singularityRevision controlNP-hardRollback (data management)Installation artLink (knot theory)BuildingLibrary (computing)Open sourceRun time (program lifecycle phase)Flow separationIntegrated development environmentRevision controlBounded variationDigital rights managementLecture/ConferenceComputer animation
06:34
Integrated development environmentMathematical singularityLaptopNP-hardDigital rights managementIntegrated development environmentLecture/Conference
06:57
LaptopModule (mathematics)Integrated development environmentComputer fileMathematical singularityRollback (data management)NP-hardRevision controlInstallation artLink (knot theory)Strategy gameMathematical singularityOnline helpMultiplication signComputer animationLecture/Conference
07:25
LaptopModule (mathematics)Integrated development environmentComputer fileMathematical singularityRollback (data management)Revision controlNP-hardInstallation artLink (knot theory)Concurrency (computer science)Computational physicsDeclarative programmingVirtual realityDistribution (mathematics)Library (computing)Numbering schemeRootDigital rights managementSoftwareDatabase transactionDylan <Programmiersprache>Fatou-MengeDigital rights managementPoint (geometry)Virtual machineDeclarative programmingView (database)Distribution (mathematics)Database transactionNumbering schemeLibrary (computing)Scripting languageRevision controlComplete metric space
09:15
Grand Unified TheoryMenu (computing)Fatou-MengeAliasingoutputPlot (narrative)Installation artAerodynamicsDigital rights managementDatabase transactionBinary fileSubstitute goodIntegrated development environmentDigital rights managementGroup actionMultiplication signDatabase transactionLecture/Conference
09:44
Physical systemComputer-generated imageryDatabase transactionInterface (computing)Installation artBinary fileComponent-based software engineeringSubstitute goodDeclarative programmingDigital rights managementConfiguration spaceSheaf (mathematics)Gastropod shellIntegrated development environmentFactory (trading post)Revision controlBootstrap aggregatingVertex (graph theory)Graph (mathematics)Schmelze <Betrieb>Revision controlNumbering schemeFactory (trading post)Medical imagingDigital rights managementDatabase transactionComputer fileIntegrated development environmentSubstitute goodBinary codeCompilerConfiguration spaceComputer animation
10:56
HypermediaSharewareMathematicsLibrary (computing)Graph (mathematics)Linker (computing)Different (Kate Ryan album)Revision controlRootCompilerLecture/Conference
12:02
Revision controlUniform resource locatorOpen sourceMachine codeComplete metric spaceVertex (graph theory)Graph (mathematics)Configuration spaceFlagCompilerPatch (Unix)Graph (mathematics)Complete graphMachine codeOpen sourceState of matterRevision controlComputer animation
12:46
DialectPatch (Unix)Machine codeOpen sourceCompilerConfiguration spaceHypermediaRadarRevision controlIntegrated development environmentUniform resource locatorGraph (mathematics)Gastropod shellCollaborationismComputational physicsLatent heatComputer fileTerm (mathematics)Graph (mathematics)Library (computing)Multiplication signComputer fileComputerGastropod shellCollaborationismState of matterIntegrated development environmentVirtual machineRevision controlLaptopShared memoryComplete graphElectronic mailing listGroup actionMotion captureLecture/ConferenceXML
14:53
Integrated development environmentMeta elementInclusion mapComputational physicsMachine codeTemporal logicKernel (computing)Metropolitan area networkSoftwareSource codeLink (knot theory)Computer hardwareCondition numberComputer fileSource codeMultiplication signState of matterFrame problemVirtual machineKernel (computing)Integrated development environmentComputer hardwareDifferent (Kate Ryan album)Lecture/ConferenceComputer animation
15:41
SoftwareSource codeLink (knot theory)Kernel (computing)Computer hardwareCondition numberTemporal logicCondition numberWindowView (database)Multiplication signPoint (geometry)Computer animation
16:05
Source codeLatent class modelSoftwareOpen sourceTerm (mathematics)Convex hullWeb 2.0Computer configurationBuildingRevision controlCorrespondence (mathematics)SoftwareFile archiverForm (programming)Source codeInstance (computer science)Uniform resource locatorTerm (mathematics)Lecture/ConferenceComputer animation
16:57
Source codeSoftwareOpen sourceComputer configurationTerm (mathematics)Machine codeCorrespondence (mathematics)Multiplication signBuildingComputer configurationSoftwareSource codeOpen sourceLecture/Conference
17:24
Source codeComputer configurationSoftwareBuildingOpen sourceCorrespondence (mathematics)Revision controlTerm (mathematics)Virtual machineGraph (mathematics)Declarative programmingGastropod shellComputational physicsInformationFormal verificationIntegrated development environmentRevision controlIdentifiabilityGroup actionIntegrated development environmentTime travelMultiplication signComputer animation
18:30
Formal verificationTerm (mathematics)Integrated development environmentInformationMaterialization (paranormal)NumberMachine visionEvent horizonComputer animation
19:20
Maxima and minimaExecution unitComa BerenicesLaptopWorkstation <Musikinstrument>Cluster samplingLink (knot theory)Total S.A.Integrated development environmentVertex (graph theory)Mathematical analysisDeconvolutionAerodynamicsOpen setComputational physicsVirtual machineObservational studyPresentation of a groupTask (computing)Product (business)LaptopMessage passingIntegrated development environmentTime travelSoftware bugLecture/ConferenceComputer animation
20:22
WebDAVSoftwareCountingComputer wormInformationSlide ruleCellular automatonLaptopAsynchronous Transfer ModeAerodynamicsWorkstation <Musikinstrument>Programmable read-only memoryCAN busGamma functionMaizeElectric generatorCloud computingGroup actionVotingMultiplication signRootStreaming mediaDecision theoryWindowMechanism designSystem administratorRevision controlLecture/Conference
23:22
Slide ruleView (database)Point (geometry)Lecture/Conference
23:45
Open setConfiguration spaceDifferent (Kate Ryan album)Strategy gameView (database)Slide rulePoint (geometry)Continuum hypothesisMereologyFormal languageWritingFunctional (mathematics)1 (number)Category of beingProgramming languageGraph (mathematics)Problemorientierte ProgrammierspracheNumbering schemeComputer fileCollaborationismAnalytic continuationTransformation (genetics)Machine codeDigital rights managementVirtual machineConfiguration spaceLecture/ConferenceComputer animation
26:41
Program flowchart
Transcript: English(auto-generated)
00:06
So, sorry for the mess. It's a bit impressing, all these people and so on. I'm Simon, I'm working as a research engineer in University of Paris, and I will be here
00:23
to present you Geeks, GNU Geeks, to be able to do some rebusible research. And there is a group, Geeks HPC, which tried to apply Geeks tooling for scientific context.
00:44
So currently we are in a replication and reproducibility crisis, so more than 70% of researchers are unable to reproduce the results of peers, or more than half are unable to reproduce their
01:03
own results. So we have a big issue, so there is many problems of this replication crisis, and maybe one solution is open science. So what does it mean open science? So what does it mean science? Science means being transparent and collective activity.
01:25
And what is a scientific result? Scientific result is some experiment, so producing experimental data, and then we have some numerical processing. So to do that, today we have different way because we need to communicate, so we need
01:42
to write results, so we need open article to be able to read the results, we need to share the data, so we have open data. We need to share the source code. But there is something that we never discuss is that all that need to be glued together because there is a numerical processing. So we need to glue everything together, so we need another one.
02:02
We need a computational environment, and this is really one of the issue is that if this is not open, all the other stack is failing. So that's the topic of today.
02:20
How do we manage this computational environment? So again, a result is a paper, some data, and an analysis. And there is some parts which are possible to audit. For example, a paper, you can read it. A data, you can read the protocol that generates the data.
02:42
You have analysis, you can read the script. But there is some parts that are opaque. For example, the instrument, a telescope, a microscope, this is opaque. We don't know how it works. But there is something that is dependent on our collective practice as researcher, and this is something that we can act on to do a better research.
03:04
So the question is to be able to eliminate at least this dependent and turn this as an auditable task to be really transparent. So, yeah, from my point of view, computation and computing is just similar to an instrument.
03:28
So we should apply the same strategy that experimental people are applying for any instrument. And computing is just an experiment, in fact.
03:40
So the challenge about reproducible science, from my point of view, there is two kinds. The first one is controlling the source of variation. What is different between this and that? So between this computational environment and this computational environment. Because as with a telescope, for example, we want to know what is the difference
04:05
between this telescope and this telescope to be sure that what we are observing is correct. So from a scientific method, we need that the computational environment is transparent. And from a scientific knowledge viewpoint, what we are building together needs to be independent.
04:22
So what I'm observing, you should observe the same. And this observation should be sustainable when the time is passing. We should be able to observe the same thing. Otherwise, it means that maybe we miss something. So the big question today is with this kind of context.
04:41
How do we do later and elsewhere? So I did something on my machine and you have to do this thing on your machine, for example, six months or one year or five years later with a computer. And this is a big issue and is part of the reproducible crisis in science from my point of view.
05:03
So what is a computational environment? Computational environment implies various points. For example, what is a source code? But for example, if say I use Python and this script, okay, we have the source code of Python is in C
05:22
and we have the source code of this Python script, okay? But the Python interpreter requires a C compiler so we need tools for building. And my script, for example, needs some Python library so we need also tools for running at runtime.
05:41
And each tool has the same issue. What is the source code? What are the tools for building? And so this is recursive. So this is a big issue. And answering all these questions is controlling the source of variation. So the question is so how do we capture the answer of all these questions?
06:04
So the question is not new. We have already tools. Package manager modifies container. So for example, with package manager like APT for Debian, Yume for ADAT, you can control this computational environment. But there is some issue. For example, how do you have several versions of OpenBLAST on the same machine?
06:23
It doesn't work really easily with Debian. Or with Yume and so on. So there is fixes but it's not really, practically sometimes it's difficult. So to fix this issue, you have an environment manager like Conda, Pip, Modify and so on.
06:42
But this is really difficult because, for example, in Conda, how do you know how it is built? What is inside what you install? So this is for transparency in science. Modifies, how do you, who use Modify on their laptop?
07:01
I think no one. And Docker is for container, Docker, Singularity or whatever, is a strategy which generally based on the previous solution. So in fact you have exactly the same problems as the previous solution. It just helps to move stuff from one place to the other one but it doesn't help to be able to have the correct thing in the first time.
07:24
Geeks, in fact, is all these three solutions glued together so it tries to fix all the annoyance from each to have something working, fixing all the issues of everything. So Geeks is a package manager like Apetium, etc.
07:43
It's transactional and declarative. It means that you can roll back, you can have a concurrent version and so on. You can produce packs which is Docker images, for example. You can produce virtual machines like Ansible for deploying on some machine.
08:01
You can build a complete distribution and it's also a scheme library so you can extend Geeks. So, okay, the talk is 25 minutes so just kind of an operative before lunch. So I don't speak about all that because it's too much. So I just speak about how Geeks help in open research from my point of view.
08:26
So I said it's really easy to try. You have just a script and give a look before installing it. It's just a bar script but check it. And you can install Geeks on any recent distribution so it's really easy to try.
08:41
You are running Debian, you can try Geeks without installing the complete distribution. You can use Geeks on the top of any distribution and it is really easy to try. Give a try. So now, Geeks is just another package manager.
09:01
So you have the same command that you have in any package manager for searching packages, showing packages, installing packages, removing packages and so on. It's exactly the same as any package manager. But you have some more functionality like transactional so you can do two actions at the same time
09:23
so for example, removing and installing in the same transaction or you can roll back. So for example, you install something and you want to roll back to uninstall this thing without breaking anything. So okay, this is another package manager but is it really another package manager?
09:41
So yeah, it's a command line. We install remove without special privilege so this is nice. It's transactional so there is no broken state. We have binary substitutes so we don't have to wait hours and hours to have our binary. But this is nice.
10:00
What is really, really nice is decorative management. It means that everything is a configuration file with a scheme but you can declare everything and you can produce an isolated environment on the fly. This is something that is really helpful and you can also see Geeks as a factory
10:22
for Docker images for example. So okay, this is all interesting features but why Geeks is reproducible? Or what does it mean it's reproducible? Reproducible means, I mean for reproducibility we need to talk about what is a version.
10:41
So what is a version? At least, say for example I use GCC at version 11. Okay, nice. But what does it mean? Concretely, I use GCC at version 11. It means that you need GCC, the compiler but you also need LD which is the linker and you know, binitils for example
11:00
and the glitzy library. But the compiler GCC it needs for example MPC which is a package that does I don't know what exactly, anyway. And you need also MPFR and so on. And you have this kind of graph. And we can ask the question is it the same GCC at version 11
11:22
if we replace this MPFR at version 4.1 by MPFR at version 4.0? Is it the same GCC or not? And maybe not. And if it is not the same maybe we are seeing a difference. How can we be sure that we are using the exact same GCC?
11:43
So this is just an extract of the graph because the graph has roots and it can be really large and maybe we can also talk about what are the roots of this graph. But this is another talk. So when you say that
12:00
I need to have a version. So what is my version in Geeks? So Geeks describes the state of Geeks. So in fact Geeks describes a version of Geeks. And what it does in fact it paints the complete collection of all the packages
12:21
and Geeks itself. And because of that we are able to freeze the complete graph. We can move this graph from one place to the other. So this graph in fact describes the nodes of each
12:42
the nodes in this graph specify a receipt and this receipt defines the code source the build time terms and the dependency. And this graph can be really really large. For example for SkyPy which is a scientific Python library there is more than 1000 nodes
13:00
so it can be really large. When I say GCC at version 11 it means one fixed graph and providing the state which describes captures this complete graph
13:21
and I can reproduce this complete graph on another machine. So this is collaboration in action. Alice describes a list of the tools in a manifest, declarative way. She generates the environment Geeks shell and providing the tools. So this creates an environment
13:41
containing the tools that are listed in the manifest file. Ok, this is nice. But now she describes the revision of Geeks so she writes Geeks describe and this fixes the state of Alice. So ok, this Alice is working on her laptop
14:02
but collaboration shares this computational environment. So it's about sharing the state. To share the state you need to share one specific graph. To share this graph you need to only share these two files and if sorry, if Blake
14:22
has these two files Blake can create the exact same computational environment as Alice. So you have the Geeks tie machine you specify the state of Alice shell and specify the tools that Alice used. Blake and Alice are
14:40
running the exact same computational environment. And for example, if you have Carol who knows these two files, she also can reproduce the exact same that Alice and Blake. So in fact, you only need two files and with these two files you can reproduce everything from one place to the other.
15:03
So in fact you have this kind of picture Alice, Blake Carol are in different time frames but they can jump from this time frame virtually different time frame to the same place because their machines are in different state but they can temporarily
15:21
go to another state to create the computational environment. To make this work when the time is passing you need to preserve all the source code. And this is not straightforward, it's not trivial to preserve all the source code. And you also need some backward compatibility of the Linux kernel and some
15:41
compatibility of the hardware. And when these three conditions are satisfied you have the reproducibility. But what is the size of the window, of the time window where the three conditions are satisfied? And this is from my point of view unknown. And Geeks is, to my knowledge, a quasi-unique by experimenting tool
16:01
to be able, because we have the tooling to do all that and now we can know what is the size that we are able to reproduce the past in the future. So what is software heritage? So software heritage is an archive. It collects, preserves software in source code form
16:21
from a very long term. And Geeks is able to save source code of the package and the receipt of the package itself. And Geeks itself is also saved in software heritage. And Geeks is able to use software heritage archive to fall back if upstream disappears.
16:41
So you have the postdoc working on some GitLab instance and the account is closed because the postdoc is moving to other place and so on. And now you have this paper with this URL with the GitLab package and oh no, it doesn't work because the account is closed. If you
17:01
were using Geeks, transparently you can check if the source code is on software heritage. And this asks really a good question about how to see the software and do you notice it only the source and what about the dependency and the build time options and so on. How do you see a software?
17:21
And how do you see it? Do you see it with intrinsic identifier like checksum or with intrinsic identifier like version label? This is not easy. So in summary there are three commands. I'm almost done, right? So in summary you have three commands.
17:41
And these three commands which are Geeks shell, Geeks time machine and Geeks subscribe they help you to have a computational environment that you can inspect and collectively share. So if you have this and two files, manifest
18:02
and channel files, you are reproducible over the time. So okay for offline because I convinced you that it's cool. So here are some resources to read offline.
18:22
Geeks HPC is a group of people trying to apply this Geeks tooling to scientific research. And we are organizing Coffee Geeks where we drink coffee and speak about Geeks. There is
18:41
an article trying to explain this kind of vision of what Geeks could provide for open research. And for French speaker there is a one hour tutorial, so yeah. And there is
19:00
now Geeks is ten years, so it's kind of ready. So we organized ten years events where there is some really nice materials about Geeks. And Geeks is not new at FOSDEM. So here all the numbers are linked to the previous
19:20
presentation. So as you see there is 31 presentations about Geeks in FOSDEM. So you have a lot of material about what Geeks can do for your job, your task. So it runs in production on big cluster but also in a lot of laptop and desktop. And here for example is
19:40
paper in medical and biomedical stuff using Geeks as as tooling as I presented about Geeks shell time machine and so on. So open science means to be able to trace and transparent because it's to be able to
20:02
collectively study bug to bug to be what is different from one thing to the other thing. And this is a scientific method and we have to apply the scientific method to the computational environment. This is my opinion and the message that I would like you bring back to home. And if we have Geeks we can do that by controlling the environment and compare
20:22
two different environments to know what is different. So okay this is the kind of what we are trying to do with the Geeks project. So thank you and I'm ready for a question.
20:44
Yeah. So we have five minutes for questions and switching speakers. Please take questions and do repeat them for the stream. Thanks. I will try to do my best.
21:09
Lobbying? So the question is okay I don't have the root privilege to install Geeks on the cluster because once Geeks is installed on any
21:21
cluster you can run it without privilege. But you need to install the first time you need to install Geeks you need root privilege. And the system administrator of my cluster doesn't me. Yeah I need to convince him. So maybe the answer is to say other people are already doing that so it's not
21:41
I mean to reduce the scare to provide a new tool. This is what I would like to try to say okay these people are doing it so maybe it's not so scary. I think it was after so yeah. You mentioned that you are not sure how big the time window is
22:03
five years, ten years. No so the question is okay what is the size of the window and can we go back five years from now in the past. The issue is that the mechanism
22:20
to going back in time or to travel in time in Geeks had been introduced in 2019. So in fact with Geeks we don't have the tooling to go back earlier. So now the zero for Geeks is a version one so it's
22:40
2019. Is it possible to use all this stuff even though Geeks can't really run on Mac OS? So Geeks cannot run on Mac OS but we can ask the question is it transparent if we are running on Mac OS?
23:02
So is it, are we applying scientific method if we are running on Mac OS? I mean I have not the question it's a collective decision. Yeah. Hello my name is Ivan
23:23
the same approach as the Geeks. Yeah. So I've never used Geeks before but I have some experience with Geeks. Is there any crucial difference between them? So from my point of view
23:41
sorry in the slides there is some appendix so there is extra slide and there is one extra slide trying to explain what from my point of view the difference with Geeks. So the question is what is the difference between Geeks and Geeks because Geeks use
24:01
exactly the same functional strategy package management functional strategy. So what is the difference? From my point of view the difference is that you have a continuum in Geeks in the language. The package are wrote in scheme and the code of Geeks itself is also wrote in scheme
24:21
the configuration file are wrote in scheme so you have a big continuation with everything and because that you can extend Geeks for your own stuff. So for example you can write a package transformation on the fly using I mean Geeks as a library. You cannot do that with Geeks because
24:41
you have a lot of different tooling in C++ and some from my point of view is this unity of the the continuum of the language. General purpose programming language yeah yeah but scheme allow you
25:00
to write kind of domain specific language it's it's a it's a good language to write domain specific language. So in fact you have both of the two worlds. From my point of view. Oh yeah sorry last question yeah
25:25
no yeah this is so when you are running Geeks for example on the top of Debian so how do we manage the graph and can we cut the graph to reuse a part of the Debian part
25:41
I mean a part of the graph from Debian so the question if maybe it could be helpful for some packages but when you do that you are not able to to manage the computational environment because if you have for example if I cut the graph on Debian
26:01
so I have a state in Debian with some packages I cut the graph at some some place to use this Debian packages. If I do that how my collaborator can cut the graph in the same place with the same Debian packages. So this is kind of issue of replaceability. So from a practical point of view
26:21
it could be nice because for example Debian has some machine learning packages that are not yet in Geeks so maybe we can reuse some part but from a replaceability point of view you lose the property to move from one place to the other. Thank you Thank you