Sharing Reproducible Python Environments with Binder - TIB AV-Portal

Sharing Reproducible Python Environments with Binder

00:00

7

Related Material

Formal Metadata

Title

Sharing Reproducible Python Environments with Binder

Title of Series

EuroPython 2020

Number of Parts

130

Author

License

CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/50082 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

As reproducibility gains traction in the data science and research communities, the need to package code, data and the computational environment is growing. There are many tools that address different aspects of this type of packaging, such as Jupyter Notebooks for literate programming, Docker for containerising and porting computational environments, and so on. But they represent barriers to reproducibility as each one requires time and effort to learn. Project Binder integrates Notebooks and Docker for generating reproducible computational analyses and combines them with a web-based interface and cloud orchestration engines. This means that analysts do not have to worry about all the moving parts so long as they have followed basic software best practices: their code is version controlled and they've captured the dependencies the analysis needs to run. Binder then hosts the compute in the cloud and makes it easily shareable by providing a unique URL to the code repository, without imposing additional overheads on the analyst. During this talk, Sarah will introduce Binder (the service), BinderHub (the technological infrastructure) and mybinder.org (a public instance of a Binder service, free for anyone to use) and demonstrate how it can be used to share Python environments and analyses.

Speech

Text

Image

00:00

Integrated development environmentTuring testSoftware engineeringReal numberThermodynamisches SystemMeeting/Interview

00:38

Integrated development environmentTuring testThermodynamisches SystemSoftware engineeringPublic domainWordArithmetic meanOperator (mathematics)Software maintenanceAstrophysicsProjective planeTuring test

01:23

Different (Kate Ryan album)Standard deviationComputerWeb browserThermodynamisches SystemInteractive televisionDimensional analysisMotion captureProjective planeSet (mathematics)Thermodynamisches SystemSoftwareVapor barrierCodeResultantInferencePoint cloudNeuroinformatikMultiplication signCartesian coordinate systemRight angleSoftware testingLaptopOperating systemWeb browserInteractive televisionPortable communications deviceVirtual machineSoftware engineeringBitContinuous integrationTime travelSlide ruleRevision controlSystem callPublic domainLink (knot theory)Mathematical analysisSoftware bugDifferent (Kate Ryan album)Programming languageHypermediaService (economics)TelecommunicationForm (programming)Product (business)Discrepancy theoryField (computer science)Type theoryGravitational waveOpen setArithmetic meanComputer clusterContext awarenessWebsiteComputer programComputer animation

07:24

Virtual machinePoint cloudLocal ringWeb browserServer (computing)Gravitational waveLaptopCollaborationismLIGOThermodynamisches SystemPoint (geometry)Medical imagingLoginWebsiteBinary codeLink (knot theory)Open setEvent horizonAstrophysicsComputer animationXML

08:38

Open setBinary fileMenu (computing)Computer fileLaptopGravitational waveRing (mathematics)Signal processingGraph (mathematics)Spectrum (functional analysis)Power (physics)PlotterCellular automatonCollaborationismLIGOSlide ruleWeb browserRevision controlRight angleMathematical analysisObject (grammar)XML

10:04

Computer fileSpecial unitary groupRepetitionTunisComputer clusterDecimalSpline (mathematics)Local area networkCloningMenu (computing)Software repositoryComputer-generated imageryProxy serverComputerComputer iconComputational physicsResource allocationCalculationOrder (biology)Revision controlMoore's lawText editorUser interfaceOpen sourceFront and back endsSoftware repositoryBitRepository (publishing)Computer fileComputerConfiguration spaceUniform resource locatorDescriptive statisticsOpen setServer (computing)LaptopCodeMedical imagingService (economics)Power (physics)Lattice (order)Point cloudNeuroinformatikWeb pageForm (programming)TouchscreenPrototypeOperator (mathematics)Different (Kate Ryan album)AuthenticationUniverse (mathematics)Student's t-testSupercomputerSoftwareRadical (chemistry)Thermodynamisches SystemVisualization (computer graphics)Goodness of fitVideoconferencingInsertion loss2 (number)Modal logicWebsiteValidity (statistics)Software developerArithmetic meanProjective planeCASE <Informatik>Process (computing)Electronic mailing listMappingGene clusterWeb browserComputer animation

17:56

MathematicsService (economics)WorkloadOpen sourceDifferent (Kate Ryan album)Turing testStreaming mediaGene clusterWebsitePosition operatorModul <Datentyp>Cloud computingPresentation of a group

19:09

Archaeological field surveyDigital photographyDependent and independent variablesDivision (mathematics)Repository (publishing)Execution unitWechselseitige InformationSoftwareArchaeological field surveyCASE <Informatik>Range (statistics)Revision controlProcess (computing)Multiplication signMedical imagingDependent and independent variablesService (economics)Set (mathematics)SoftwareFatou-MengeThermodynamisches SystemFormal languageMoment (mathematics)Point cloud2 (number)Uniform resource locatorMereologyQuicksortMusical ensembleLink (knot theory)Projective planeOperator (mathematics)NeuroinformatikAreaWeb 2.0Materialization (paranormal)Modal logicInferenceWave packetUniverse (mathematics)WordComputer animation

22:44

Repository (publishing)Service (economics)Projective planeGraphics processing unitTuring testCASE <Informatik>Point cloudTransport Layer SecurityMathematical analysisDifferent (Kate Ryan album)Cloud computingGraph (mathematics)Multiplication signSimulationResultantElectronic mailing listGoogolPerturbation theoryPlotterScripting languageLink (knot theory)SoftwarePoint (geometry)Streaming mediaData managementThermodynamisches SystemJames Waddell Alexander IIData conversionMessage passingTwitterInstance (computer science)Pauli exclusion principleChord (peer-to-peer)Personal identification numberMaxima and minimaDirected graphInstallation artCartesian coordinate systemVirtual machineOpen sourceWeb browserOpen setMultilaterationOnline chatComputing platformDirection (geometry)Meeting/Interview

Transcript: English(auto-generated)

00:06

And our next speaker is Sara Gibson from the Alan Turing Institute. She is research software engineer there, and she creates or she solves real world problems using cutting edge academic research there.

00:25

And she will talk about Binder, reusable Python environments with Binder. Welcome, Sara. Hi, thank you very much for that introduction. I'm Sara Gibson. If you don't know who I am, I am a research software engineer at the Alan Turing Institute in London.

00:45

I've been using Python now for about five years since I started my PhD in astrophysics. I'm also a maintainer and an operator for Project Binder, which we're going to talk about today. And the primary use of Binder is that it makes it really super easy to share reproducible computational environments.

01:08

And I'm going to show you what that means for Python today. But I just very quickly want to start by defining what I mean by reproducible. So all the meanings of the word reproducible pretty much in whatever field, whether it's data science or a specific domain you go into,

01:28

reproducibility has a different meaning. So I'm just going to quickly run through this slide so it's absolutely clear what I mean when I say reproducible. So I'm looking at this top right quadrant of reproducible,

01:41

which is if you gave me the same data and the same analysis pipeline that you were running, I should be able to get the same answer as you. This is like the lowest bar of reproducibility. But if we can conquer that, we can open up these really interesting other dimensions of reproducibility. So if I then run a different data set through the same analysis pipeline and it tells me qualitatively the same answer,

02:07

this would be a replicable analysis. If I run the same data through a different analysis pipeline and that qualitatively gave me the same insights, this would be a robust analysis. It's a really interesting dimension because it's looking for methods that should do the same thing.

02:25

And if we combine all of that together, we get a generalizable analysis. And this is not generalized because it doesn't apply to all of the people doing the analysis. But it is two steps towards being able to make inferences about the data to the broader population.

02:40

And the results of this type of analysis are not specific to one data set or one particular methodology. But we can't achieve any of that unless we can get the same answer out from the same data in the same analysis pipeline. But I'm actually being even more specific than that.

03:01

And I'm going to talk about what is actually called a repeatable analysis. And this is literally getting the exact same answer from the same data and the same analysis pipeline. And this is not the same as reproducible because repeatable brings in the concept of the computational environment,

03:21

whereas reproducible only relates to the data and the analysis steps. And this is why I'm talking about Binder because Binder captures this repeatable aspect by adding in the dimension of what's going on in the background when you're running your analysis on a data set.

03:40

So I just want to acknowledge that there are many reasons and barriers why reproducible research isn't happening all of the time across all of the domains. And I'm sure many people on this call will relate to at least one reason on this slide right now. The two I like to focus on are the fact that it takes time and it requires additional skills.

04:03

You need time to learn these skills and implement them. But if you can get over that initial barrier, research in general can be much more efficient and successful in the long run because you've set up your workflows in such a way that you can repeat things much more easily.

04:22

I like to do a little bit of market research during my talks, but I'm aware that I can't see any of you. So hopefully you'll still at least be able to relate to the questions I'm about to ask. I want you to think about scenarios when you've been collaborating on software.

04:40

I'm sure all of us have heard the question, oh but it worked on my computer or oh but it worked yesterday. And these are super frustrating things when you're trying to collaborate and work on software and you can't get the same environment going or something different is happening on your machine to someone else's machine.

05:03

And this is really frustrating because we actually have tools that solve these problems. For the it worked on my computer scenario, we have Docker. If you're not familiar with that, what Docker does is it creates something called containers. These are portable environments that capture not just code and dependencies such as software or data

05:25

but it can actually capture high-level architecture such as the operating system. And these then become very portable from your Mac laptop to your Linux desktop to even implementing on the cloud and it becomes very easy to share those kind of environments.

05:44

And for the oh it worked yesterday scenario, we can use version control. Version control is the magic time machine I wish I'd had during my PhD when I can just rewind everything back to last Tuesday when everything was last making sense. And if you combine that with continuous integration and testing,

06:03

you can actually catch these kind of bugs that might create discrepancies in your code before they actually reach your production research ready code base. But these are exactly the additional skills required. I just said form a barrier to reproducible research and we have to sit down and learn them.

06:24

So this is where Binder comes in. Project Binder is a global community of research, software engineers and data scientists who are dedicated to the open and transparent communication of reproducible research.

06:42

And the mybinder.org service allows anyone to launch a complete and interactive computing environment from their browser. And here what is complete is the code, the programming language, any software dependencies and packages, any assets you have like prose or media. You can very easily share a full research environment through a browser

07:05

without actually needing the person you're sharing with to install anything. So I am going to now attempt to demonstrate this. So please bear with me. I am just clicking on a link here.

07:23

And this will take us to the Gravitational Wave Open Science Centre website. So what this is, this is a website hosted by the LIGO and Virgo collaborations. These are astrophysical research collaborations that are looking into the detection of gravitational waves.

07:46

And they have example notebooks here and you'll see this third one here, Binary Black Hole Events, has this mybinder link next to it. And if we click that link, we get redirected to the binder website where we have this the binder spinner of reproducibility.

08:04

And if I click on build logs, it says found built image launching and we're launching a server. And if it hadn't found a built image at this point, it would be building one for us. And this server that is launching, it's not actually on my local machine. It's in the cloud.

08:21

And what's going to happen is my browser will be redirected to a Jupyter notebook environment, as we can see here. It's being hosted in the cloud, all running through my browser. And if I go down and show you the requirements.txt file,

08:41

we'll see, you know, we require H5Py, Matplotlib, SciPy. None of those have had to install. And then if I open up the notebook, we will be able to see this tutorial notebook that the LIGO and Virgo collaboration have written up.

09:00

That basically goes through all of the analysis steps that they do when they are detecting gravitational waves. And I can just click run all cells and this will just run. And I can then begin to work my way through this notebook and learn all about signal processing and such.

09:21

And we can see that various different things are running. We can create plots and we can see we have the power spectrum of the gravitational waves. And we can even see the chirp signal as the two binary black holes emerging into one another. And this graph shows the ring down

09:42

as the resulting object comes to rest. And that's great. I can now learn all about black holes and gravitational waves. And I haven't had to install it. I haven't had to make sure I have the right version of Python and the packages. It's just launched in a browser there for me to use.

10:01

So that's the power of Binder. So just go back to my slides. So what did they have to do to create that? So this is a comic that kind of shows the workflow research would go through in order to publish their work on Binder. So we have Jane here. She's written a paper based on her experiments and she would like anyone anywhere

10:21

to be able to reproduce, check and improve her calculations. So the first step she takes is to describe the experiment as a Jupyter Notebook. And the reason we like Jupyter Notebooks is that you can mix together prose, code and visualisation. And much like the example we just saw you can walk people through the steps you are taking.

10:41

Although you do not have to use Jupyter Notebook in order to use Binder, it also supports JupyterLab, Terminal, a text editor. You can get it to work with VS Code and it even works with RStudio as well. It's not just for Python environments. So once she's written her notebook

11:00

she publishes this on a hosted repository such as GitHub but many other public repository websites are supported such as GitLab, Bitbucket, Zenodo. And then she makes this repository binder ready by just describing it with mybinder.org.

11:44

You just glitched. There was maybe 30 seconds lost in the video so can you please come back again? Oh yes, sorry. So I'm just talking through this comic here. Jane has written a notebook based on her experiments.

12:00

She's publishing it on GitHub and now she needs to make that repository binder ready. So what does that mean for Python developers? How do we make sure that that's ready? So if you are a Python person used to installing your packages using pip

12:21

then you will have a requirements.txt file and this is just plain text file that lists your packages and your package versions and this is a valid configuration file compatible with mybinder.org. But you might use Conda and therefore have environment.yml files

12:40

which is again just a list of all of your dependent packages and the channels from which you download them. And this too is a compatible configuration file for working with mybinder.org. And by providing those kind of configuration files mybinder.org will allow Jane to share her notebook with everyone

13:03

so that they can run it and reproduce her computations by providing compute power from the cloud. So all mybinder.org needs to work is a version control repository on a public server and a description of the software dependencies. And importantly we've configured it such that it recognises

13:22

the typical configuration files that those communities already are using to define their software dependencies. So this is a little bit of background into mybinder.org or the Binder project.

13:40

It was originally launched by Jeremy Freeman back in 2015 and at the beginning of 2017 Project Jupiter started having meetings with Binder as we began to take over stewardship and bring Binder into the Jupiter ecosphere. The first half of 2017 was spent redeveloping the backend of Binder into what is now called Binder Hub

14:01

and that's all the technology that powers this service. And in September of that year Binder was awarded a Moore Foundation grant to run its operations. So since these humble beginnings we've now grown to hosting over 140,000 user sessions per week.

14:21

So in roughly three years Project Jupiter has brought this tool from a prototype to a staple of the open science community. And Binder itself is open source and is built modularly using other open source tools which means anyone can deploy their own service for example in an institution

14:40

and it's completely configurable as well so you could configure it to only allow sharing between specific teams which I've represented here by changing the globe that meant everyone to just be a little house that might be your institution or your team. And that's totally fine. So what is the technology, what's this Binder Hub

15:01

that we use to provide this service. So here is a little screen grab of the form you see on the mybinder.org webpage and all you do is you paste in the URL of your version controlled repository. So we have an example from GitHub here

15:20

and then you just click launch and everything that happens afterwards is happening in the background automatically. So the first thing that happens is that your repository is cloned. Then we have a tool called Repo to Docker and if you were at Tanya Allard's talk about Docker and Python yesterday you will have heard a little bit about this

15:41

but what Repo to Docker does is it basically reads the repository and looks for configuration files and it builds a Docker image without the need for a Docker file and it copies across all of the code, any data it installs all of your software dependencies and it makes that image compatible with Jupyter

16:02

by installing all of the necessary Jupyter servers and such. And these configuration files are your requirements.txt or your environment YAML that I showed you earlier. This image is then executed using Docker and we then host the running Docker container on a JupyterHub.

16:22

If you're not familiar with what a JupyterHub is JupyterHub is one solution to the problem where you have access to some computers and you would like to share them or give access to some humans and Jupyter notebooks are a good user interface for this scenario.

16:40

And here computational resources and humans can take many forms such as researchers on HPC, anonymous users on the cloud students on a university, etc. By combining different custom authenticators and spawners you can create any kind of mapping of any collection of humans onto any kind of computational resources you like

17:02

and that is JupyterHub's job. It's just providing computational resources to your people. So that's what it's doing. It's giving computational resources to our running Docker container and in the case of mybinder.org these computational resources are a Kubernetes cluster running in the cloud. JupyterHub then makes the running container accessible at some URL

17:24

and then Binder is this thin layer running across the top of all of these tools that handles URL redirection and it just redirects a user's browser to that running container. And lo and behold, as we saw in the example you get your Jupyter notebook with the environment already installed

17:42

and you're ready to roll. So how did we manage to scale this up to 140,000 users per week? And the answer is we created a federation. So mybinder.org is supported by four clusters around the world including one that I myself manage at the Turing Institute

18:03

and federating the service in this way means we can be resilient against both cluster outages and funding streams. We can be sustainable by sharing the workload and knowledge like this. The service can persist in spite of changes amongst resources and people

18:21

and we become robust as well. We can reliably support users present and future. And because BinderHub and mybinder.org are built using modular open source tooling we're actually in a really super cool position in that to run mybinder.org you do not require cloud vendor lock-in

18:42

and we're in a really cool position of hosting one website with four different flavors of Kubernetes and our user base are for the large part unaffected by the redirection to our different clusters. And in fact we're not even completely cloud based. The Geysers cluster which is the second largest cluster in the federation

19:02

is actually an on-premise facility hosted at the Leibniz Institute. So another thing that the Binder team like to do is to run user surveys and we do this to try and gauge how our users are using mybinder.org and what they like or dislike about it.

19:21

So the good news is around 80% of our user base or at least 80% of the 346 responses we got to the survey would recommend the service to a friend. So that's good. And we found that there's actually a wide range of use cases for using mybinder.org.

19:42

But like the top three sort of equally distributed use cases are university teaching, hosting documentation and examples or running workshops and training courses. And the inference you can make from that is Binder really shines in scenarios where installing an environment would be distracting

20:00

or a waste of time. Such as you're running a workshop at a conference, you don't want to spend 20 minutes making sure everybody has the correct version of Python and all of the packages installed and then downloading all of the necessary materials. Instead you could just send them a URL to click and they get the environment and everything is there

20:20

and they can interactively explore that data set. Thank you. But one thing that we could do better on is speed. So this is a word cloud that was generated from free-from responses that show one area we could improve is the time it takes to launch Binder.

20:42

Tackling this problem, however, isn't as simple as it seems. Part of this is that we're reliant on upstream contributions to speed up our launches. For example, pulling the Docker images onto our computational nodes or allocating computational resources with Kubernetes. We are always, always going to be limited

21:00

by how quickly Kubernetes can requisition a user session. But one way we've tried to tackle this is to write up some community guidance. And this explains what's happening during the launch process, where the time is spent, which steps could be taking longer and why. And we've then provided pathways our users can follow

21:20

to speed up their launch times. And we found that once we've explained what's going on, many people actually go, yeah, actually 10 to 15 seconds to launch a Binder is actually reasonable considering how much is going on in the background. So we've probably been too good at streamlining this at this moment. And this is ultimately what this project is all about.

21:43

We are about community. And we exist in a larger ecosphere of languages as well. As I mentioned earlier, we don't just support Python. We include Julia and R. And we want to meet those communities where they are to ensure we're providing the best and most useful service we can. That being said, Binder is a Python project, and it is run by Python devs.

22:02

And this is problematic because when it comes to developing the service for Julia and R, we just don't have the knowledge and the expertise from those communities. However, this year, I've been awarded a fellowship by the Software Sustainability Institute. And my goal for this fellowship is to help diversify the skill set of the teams maintaining and operating Binder and MyBinder.org.

22:22

And this is so that we can ensure that these communities outside Python are represented when we are innovating Binder for the future. So this is all from me. Here is a whole bunch of links if you're at all interested in this project, getting involved, asking questions.

22:42

Thank you very much. Thank you very much, Sarah. I see there are many, many questions. I doubt we have time to answer them all, but let's start. There's a question from Francesco.

23:00

What are the pros and cons of Binder with respect to Google Colab? For instance, do you also allow users to access GPUs, which are essential for several machine learning applications? So the pros and cons. So Google Colab is a very different kind of project to Binder.

23:24

Google Colab is about developing quickly with other people. So it kind of installs a kitchen sink environment that has everything you could possibly dream of, but you might not actually be using them. Whereas MyBinder.org is providing a bespoke environment that is specifically the requirements you need to run an analysis.

23:45

So they're kind of different beasts targeting different use cases. MyBinder.org does not offer GPU services. This is because we are providing this service completely free of charge and it's expensive enough to run without adding GPUs to it.

24:04

We did run a GPU service for NeurIPS conference a few years ago, and it's a very obvious spike in our billing graph. But because it's open source, you can deploy Binder Hub onto your own infrastructure that includes GPUs

24:22

and you can do really clever things, like if you authenticate a user and they're on an allowed list to use GPUs, you can then redirect them to a different nerd pool, for example. What I would say about the analyses that rely on GPUs,

24:40

MyBinder.org is not a tool to do your analysis on. It's a tool to communicate the result of your analyses. So you probably wouldn't be running your simulation on a GPU that might take X amount of hours. There's no reason to run that in a browser. But the tidy up scripts that create your plots,

25:02

that is something you are likely to share and that is where Binder becomes powerful. Okay, there are many more questions. One hot question is, who is paying for the compute resources?

25:20

So the Google cluster, which is the largest cluster, was originally paid for by the Merv Foundation, but we get a donation of credits direct from the Google Cloud platform. Another cluster is run by OVH, which is a cloud provider in Europe. The Geese's cluster is paid for by the Leibniz Institute who host it

25:44

and the Azure cluster that I manage is paid for via a donation that the Turing receives from Microsoft as well. So we're like this community of Binder Hub managers that are getting our funding through our own streams,

26:02

but we just get the traffic from one point and then it's kind of like under the hood distributed across the network. It's quite a cool system. So another very important question for the project. This is awesome. How would one contribute to this?

26:24

I've left the links up here. The Binder Hub repository is at the top. My Twitter is in the corner if you'd like to send me a direct message. One good way to do is to jump onto our discourse at discourse.Jupyter.org and introduce yourself

26:43

and just be like, hey, I think this is awesome, I want to contribute. Then we can get into a conversation. I always think the best place to start is by reading the documentation and seeing how you can improve it. Like fresh pair of eyes that don't suffer from expert blindness are so useful for these kind of things. Then we can just build up into more cool technical ideas.

27:07

Okay, and one more technical question from Alexander. Could Binder N be configured with PEP 518? I looked it up. Does it specify minimal build system requirements for Python projects?

27:23

If no, will it be supported in the future? I have no idea what that means. I'll be honest. I think this is something you can discuss on Discord later. So I had no clue too, so I looked it up quickly.

27:41

If we can drop the PEP in discourse, I will try and read it and make sense of it. Okay, I think we unfortunately run out of time. So there are three more questions. Please take that to the Discord chat. It's a talk, Binder, I think, the channel name.

28:03

So yeah, please join us there in the Binder channel Discord, and I'm sure Sara will answer all your hot questions. So thanks again, Sara. Thank you. Bye-bye.