We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Jupyter and IPython facilitating open access and reproducible research

00:00

Formal Metadata

Title
Jupyter and IPython facilitating open access and reproducible research
Title of Series
Part Number
11
Number of Parts
13
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Jupyter notebooks provide a document-based interactive environment for performing and recording computation. Notebook documents contain not only the code that is run, but prose and mathematics for describing the analysis, as well as recording the outputs of the computation, from plain text output to rich interactive media, such as HTML and javascript or images and video. Jupyter notebooks are being deployed widely as computational companions for publications, facilitation reproduction of results, and interactive exploration and modification of analyses, including by prominent scientific discoveries such as the LIGO experiment. Being freely available, open source software, Jupyter and IPython aim to improve the accessibility of reproducible practices in computational science.
Service (economics)SoftwareInformationNeuroinformatikCommutatorSimilarity (geometry)Product (business)Projective planeXMLUMLLecture/Conference
Projective planeGroup actionNeuroinformatikMeasurement
Context awarenessBitMathematical modelScaling (geometry)NeuroinformatikPhysicalismSoftware developerPhysical lawComputer simulationPlasma displayComputer scientistLecture/ConferenceMeeting/InterviewComputer animation
Online helpService (economics)SoftwareInformationCodeResultantPhysical systemProcess (computing)Wave packetCodeProjective planeFigurate numberSoftwarePhysicistComputer scientistCircle
CodeProcess (computing)Shared memoryFigurate numberSensitivity analysisMultiplication signResultantLinear codeSimulationParameter (computer programming)CodeComputer animation
CodeProduct (business)IterationProcess (computing)Peer-to-peerError messageFocus (optics)Figurate numberMultiplication signFitness functionIntegrated development environmentCodeStudent's t-testBitMetropolitan area networkInclined planeRight angleExpert systemLecture/ConferenceMeeting/InterviewComputer animation
CodeComplete metric spaceObject (grammar)Array data structureComplete metric spaceDenial-of-service attackDisk read-and-write headInteractive televisionObject (grammar)Sound effectProjective planeGastropod shellBitInformationElectronic signatureString (computer science)Extension (kinesiology)Formal languageMultiplication signDirectory serviceAdditionOrder (biology)Library (computing)Parameter (computer programming)Lecture/ConferenceMeeting/InterviewComputer animation
CodeService (economics)SoftwareInformationDegree (graph theory)Level (video gaming)
RepetitionCommunications protocolProjective planeInteractive televisionGastropod shellCodeCommunications protocolComputer animation
Cartesian coordinate systemCodeProcess (computing)Interface (computing)User interfacePhysicalismLecture/Conference
Communications protocolService (economics)SoftwareInformationFunction (mathematics)Communications protocolProcess (computing)State of matterRun time (program lifecycle phase)User interfaceCodeComputer animation
File formatRepetitionCommunications protocolReading (process)LaptopBitResultantTextsystemCommunications protocolFunction (mathematics)Letterpress printingProgramming paradigmDebuggerMathematicsMessage passingRadical (chemistry)Gastropod shelloutputLoop (music)CodeTracing (software)Computer animation
Communications protocolReading (process)Content (media)CodeProcess (computing)Kernel (computing)BitComputer fileRadical (chemistry)Type theoryWeb pageWeb 2.0View (database)Message passingLetterpress printingSinc functionFunction (mathematics)Variety (linguistics)Lecture/ConferenceComputer animation
Reading (process)Loop (music)Content (media)Communications protocolFunction (mathematics)Medical imagingElectronic visual displayWeb browserFunction (mathematics)Computer fileInformationWeb 2.0Content (media)File formatCodeBitResultantVisualization (computer graphics)Loop (music)Type theoryLetterpress printingData miningLecture/ConferenceComputer animation
Communications protocolFunction (mathematics)MathematicsVector spaceTheory of relativity1 (number)Vector graphicsRaster graphicsWebsiteFile archiverWeb browserDifferent (Kate Ryan album)MathematicsFunction (mathematics)ExpressionLaptopLibrary (computing)Medical imagingIntegrated development environmentLibrary catalogPlotterLecture/ConferenceComputer animation
Communications protocolFunction (mathematics)GUI widgetFunction (mathematics)Scripting languageInteractive televisionResultantDivisorCommunications protocolWeb applicationPower (physics)LaptopCellular automatonGame controllerGUI widgetMathematicsCASE <Informatik>Table (information)outputNichtlineares GleichungssystemLie groupWeb pageLecture/ConferenceComputer animation
Communications protocolSoftwareInformationInterface (computing)Communications protocolType theoryFunction (mathematics)CodeoutputLatent heatComputer animation
Communications protocolSequenceLaptopCellular automatonTheoremSampling (music)outputCodeRepetitionFunction (mathematics)Formal languageCommunications protocolLaptopCodeKernel (computing)Function (mathematics)SoftwareCellular automatonSequenceProgrammschleifeWeb browserBitAxiom of choiceFile formatData structureParsingVirtual machineMathematicsVariety (linguistics)Endliche ModelltheorieMetadataLevel (video gaming)Set (mathematics)Lecture/ConferenceMeeting/InterviewComputer animation
Data structureFile formatVariety (linguistics)Transformation (genetics)Internet service providerData conversionNumbering schemeLaptopLecture/ConferenceMeeting/InterviewComputer animation
Service (economics)SoftwareInformationIntegrated development environmentNeuroinformatikWeb applicationLaptopWritingInteractive televisionIntegrated development environmentTransformation (genetics)XML
Integrated development environmentFile formatoutputSampling (music)TheoremInteractive televisionFunction (mathematics)Function (mathematics)SimulationTraffic reportingNeuroinformatikLaptopoutputResultantFile formatCodePrisoner's dilemmaInformationXML
Video gameComputational intelligenceSoftwareService (economics)LaptopInformationLibrary (computing)Row (database)Open setEndliche ModelltheorieAreaSoftwareMultiplication signLaptopResultantIntegrated development environmentNeuroinformatikPeer-to-peerInteractive televisionLibrary (computing)DampingComplete metric spaceFigurate numberPOKEPerspective (visual)MathematicsComputer simulationMathematical analysisExploratory data analysisSummierbarkeitComputer animationLecture/ConferenceMeeting/Interview
ResultantMathematical modelFigurate numberDampingCodeComputer fileInteractive televisionServer (computing)Lecture/ConferenceMeeting/Interview
Library (computing)Row (database)Computational intelligenceLaptopVideo gameNeuroinformatikLaptopNumberTerm (mathematics)Function (mathematics)MereologyContext awarenessAdditionComputer animation
Library (computing)Row (database)Computational intelligenceLaptopVideo gameSoftwareService (economics)InformationLaptopMathematical analysisFigurate numberDisk read-and-write headEndliche ModelltheorieMedical imagingComputer animation
Scripting languageSlide ruleFile formatGroup actionSocial classoutputMehrplatzsystemKernel (computing)Shared memoryLaptopWeb 2.0Service (economics)Probability density functionWeb applicationCommunications protocolRemote procedure callNeuroinformatikValidity (statistics)Cartesian coordinate systemElectronic visual displayPower (physics)TextsystemFile viewerEvent horizonData conversionJSONXML
Block (periodic table)CodePower (physics)Web pageLaptopBlogResultantLecture/ConferenceMeeting/Interview
Slide ruleScripting languageFile formatGroup actionSocial classMehrplatzsystemoutputDifferent (Kate Ryan album)Variety (linguistics)LaptopCartesian coordinate systemCombinational logicBuildingPerspective (visual)Focus (optics)RadiusXML
LIGOData analysisWave functionBulletin board systemBitPhysicalismCollaborationismOpen setGravitational waveMassState observerDirection (geometry)Lecture/Conference
Open setLine (geometry)LIGOEmbedded systemSign (mathematics)SummierbarkeitView (database)Binary codeMaxima and minimaSigma-algebraProcess (computing)LaptopLemma (mathematics)Data miningTerm (mathematics)Insertion lossLaptopState observerMathematical analysisFile viewerResultantService (economics)Open setDirection (geometry)Sinc functionGravitational waveSimulationComputer animation
FreewareBinary codeGamma functionService (economics)MetadataPhysical systemCloud computingLaptopPoint cloudDeclarative programmingUniform resource locatorLecture/ConferenceMeeting/InterviewComputer animation
Open setLink (knot theory)Cartesian coordinate systemStability theoryInteractive televisionLaptopResultantIntegrated development environmentNatural numberMeeting/Interview
Binary codeSanitary sewerIntegrated development environmentInteractive televisionFAQComputer animationLecture/Conference
SoftwareService (economics)InformationXML
Transcript: English(auto-generated)
I go by Min. My full name is Benjamin, but I go by Min. It's fine. Yeah, all right, so I am Min. I work on the IPython and Jupyter projects. I'm currently based in Oslo, Norway at Simila Research Lab, which is a small research laboratory
that does work in biomedical computing and computer systems, and so we work on, specifically my group works on improving tools for science in general, in particular, these Jupyter and IPython projects.
Just a quick check. Who here knows what Jupyter is? Who here uses Jupyter on a regular basis? And who here knows about the relationship of Jupyter and IPython? Hopefully everybody will know all those things by the end.
So I'm gonna start with just a bit of context about how we do computational research. So my background is in computational plasma physics, so doing computer simulations of plasmas and mathematical models of that and coming up with the physical scaling laws and things,
and so this is from our experience. Most of the software developers on the IPython and Jupyter projects are scientists by training physicists, biologists, and the code over the years sometimes reflects the fact that we're not, we haven't been software engineers,
although we've learned a lot over the years. So what's our process of doing computational research? You start by, okay, I write some code to describe whatever system I'm interested in, and then I run that code, I get some results, and then, okay, now I've got some great results.
I'm gonna communicate those results to people. I'll make some figures to illustrate those results that I've come up with, and then it's like, okay, these figures are cool and interesting. I wanna share these with the world, so I write a paper and then publish the paper, and then people see it. Okay, yeah, that's interesting. I wanna reproduce that and then produce derivative works.
It's a nice, very nice, clean, linear process. But in reality, it's a much more iterative process where you write some code, you run it, it never works the first time, then you go back, you change it, then, okay, you get far, I got an actual result, I make some figures, you do your plotting,
and then you realize those figures don't make any sense. I've clearly done something wrong. Then you go back and change it and rerun it, and then, okay, now you think you've actually finally got something together, you write the paper, and then the reviewers say, well, no, I think maybe you need to rerun the simulation. With these parameters, I wanna see the sensitivity
to these things, or I don't trust your results, please back them up more substantially in some way or another. And then eventually, you get to publishing your results, and then even after the publication process, errors can get through the peer review process and you can end up going back through the iteration even after the final product.
The allegedly final product of a paper is produced. So, where does our work fit into this? So, IPython started as a way to solve this one little piece of, we spend a lot of our time in this process of, all right, I wanna figure out
what code I wanna write, I wanna run it, and then based on running it, how do I figure out what it actually should have been? How do I figure out what the next bit of code I should write? And so, IPython was developed by Fernando Perez as a PhD student to help the process of running the code for his research.
So, what is IPython? The I is for interactive, it's just an environment for better interactive Python with the focus on the kind of work that scientists do of plotting and making, interrogating code. So, the first thing it does is it gives you things
like tab completion, so you can say, I know I wanna use this package, and I wanna do something related to arrays or plotting, and then tab completion lets you more easily find that without having to refer to the documentation as much. And it has introspection, so this is in addition to the Python language where you can say, okay, I know I wanna use this,
I wanna interrogate that object to say how can I use this? And that's another way to kind of more quickly figure out if people are familiar with libraries like Matplotlib and pandas, a lot of these tools have very, very complicated signatures and you either spend a lot of time in the documentation
or you can quickly peek at the docstrings of these objects to say, okay, how am I supposed to use this, what's the argument order and things? And you can really quickly look up that information in interactive session with IPython. And the last thing we do are these magics that are another extension to the Python language for shell-like syntax to more efficiently do things
like running timers, changing the working directory, simple things to make that interactive shell experience a bit more pleasant than plain Python. So that's where IPython sits. So what about Jupyter? Jupyter is a bit more ambitious,
where we want to, as we develop as scientists and we expand the project, we wanna kind of work on it at least to some degree every piece of this problem. So what is Jupyter? There are a lot of answers to this question. I'm gonna give a few of them.
At the first level, and this is an answer that most people maybe aren't aware of, Jupyter is a protocol. So it's a, we talked about IPython is a shell where you're running code interactively. The first thing we built that ultimately became Jupyter, this was built in the IPython project originally
as just a technical feature of IPython. We wanted where your code actually executes and then the user interface that's presented to you to be separate. And one of the reasons for this is if you, in physics code, often you're running, you're actually running code that's in C or Fortran
that's called from Python and that code can fail in more dramatic ways than Python code can fail. So it can tear down the process and you can lose everything. It would be nice if you didn't lose your, the application, the interface as well. You might lose the runtime state of the code
but we'd like to keep the UI persistent. And so that and some other reasons we broke this protocol that took, okay, what do we do in IPython that's this process of reading code, executing it and producing output and then we defined a protocol for that so that we could separate the user interface from the execution.
But Jupyter also defines a document format. So this is a Jupyter notebook which contains mathematics and prose and code and then the output that's the result of running that code. And I'll go into more detail on what Jupyter notebooks are in a bit but we have these two really basic technologies of a Jupyter notebook as a document format
and the Jupyter protocol that we use for execution. So a bit of detail about the protocol. So it's, the protocol encapsulates a paradigm we call a re-developed print loop which is how shells in the terminal work. And the first step is read which is ask the,
get some input for what code to run and this is, this produces a message. So the front end sends a message called an execute request with just the code that it wants to run as text. And then the eval is the, so the process that we call the kernel receives that request and says,
okay, I'm gonna run that code and do whatever that code says to do. And then the bit where we depart from terminals a bit is the print step where it's, because we're sending these messages we can send outputs that aren't just text. We're not, since we're not confined to a terminal we can produce outputs that can be a variety of types.
And we identify these types by MIME type which is the way the web identifies what kind of document it is. So if you view a webpage it's going to be, it's gonna be an HTML document but you can also view other types of files on the web, images and things. And we use the same concepts to identify what kind of outputs are produced.
And then we hand that same information to the browser which ends up, ultimately displays it. And then the loop is just, you do this over and over again. In an active session you're just taking bits of code, you're executing them and then you're looking at the results. So a bit about that extended print in the REPL.
We use MIME types for output which means that we can, any format of a file or of a content we can use in Jupyter. So the basic output is text, that's simple.
You've just got some text outputs then you wanna just see that text and you can display it. So here's a bit of code that produces some simple text. We can also have images. So this is commonly used when plotting and visualization. You produce a simple raster or vector image in SVG or PNG.
You can also have outputs that is itself LaTeX mathematics. So we have in the notebook, for example, we render mathematics in the browser using a library called MathJax. It's the same thing that runs on archive.org and various websites that render math.
And the output of this expression is not the rendered math, it's not an image. The output is actually the LaTeX expression and then the browser is responsible for rendering it. And the nice thing about that is that different consumers of this notebook can render that LaTeX as appropriate in different environments
and you don't have stuck with some rendering that you have to include as an image when embedding it in something else. And you can also have output that is itself HTML and JavaScript. And this is where a lot of the power of the web-based notebook lies, is you can have fully interactive HTML
and JavaScript outputs. This is a simple table, but you can have outputs that do, anything you can do on a webpage, you can do as an output in a notebook. And then lastly, you can have an output that is itself actually an input. So this is an interactive widget where by the output of this execution
is actually a control that when you move a slider, in this case, it actually re-executes the cell and produces new output. So this is interacting with a factoring of an equation and as you just move, this is factoring S to the N minus one and as you move the slider,
it changes N and then recomputes the result. And then this is key to the protocol. We talked about pulling the execution apart from the interface and the way we did that
with the outputs are just keyed by MIME type, the input is just the code as text, there's nothing Python specific about that. We wrote this, we did it this way for IPython because we wanted to use Python in this way, but kind of by accident, we made this in such a way
that it would be useful to any interpreted language. And so we documented the protocol, this is how it works and other language communities have adopted that so there are now currently, I believe just over 80 kernels for Jupyter in languages from C++
to JavaScript and Erlang and various kernels for Spark and R so the Jupyter notebook model can be used in a variety of ways with a variety of languages. You're not stuck with Python if that's not your language of choice.
So a bit more detail about what a notebook is. So at the most basic level, a notebook is a sequence of cells. So there are two kinds of cells. A text cell contains Markdown and LaTeX mathematics and this is rendered all in the browser. A code cell is one of these redevelopment loops so it's the code you execute
and then it also includes the output that's produced so the cell contains the output of an execution. And then there's metadata everywhere for marking these things up that can be used for consumers of notebooks. The file format is a plain text JSON format that is a publicly documented schema
so if all of the Jupyter software goes away, you can still take your notebooks and easily get your code and output out. It's machine readable because it's a very simple JSON data structure that's easy to understand that poking around with a notebook
if you have a JSON parser is pretty simple. And because of the structure of the document, so the notebook document is not this HTML environment, the document is this simple data structure, you can take that data structure and transform it to a variety of other formats. So you can turn it into an HTML document,
you can turn it into a LaTeX document, you can turn it into Markdown for some blogging engine and then you can also provide custom converters to take a notebook and convert it to any format. And because of this simple schema of a notebook, it's not too difficult to write transformers
for notebooks to integrate into an existing publication pipeline. So that's the notebook as a document. The notebook is also the interactive environment. So this web application where you actually create notebooks
is an interactive computing environment for kind of data exploration and poking around with your data and learning and exploring and producing the notebooks. And once you produce a notebook, it can also be an input format to a pipeline. So you can say, okay, I have this notebook, it's got some code and some results,
I want to plug this notebook into some computation pipeline and produce either a report or a new notebook or something and do offline execution of notebooks. And it can also be an output format. So you can have some big offline simulation that automatically programmatically produces
a notebook document to enable interactive exploration with your results. And so the way this fits for us, the way this fits into the open science reproducible research area is,
so you start out with in the interactive model where you're just exploring using notebook as kind of an interactive scratch environment to kind of poke around, learn, figure out what you want to do, find something interesting. And then from a sustainable software perspective,
so you're learning in the notebook and then you're producing software as you normally would. You're not writing all your software in the notebook, you're writing your software the same way you produce any good library and then you use that library again in the notebook. And then once you've got something complete that works,
you perform your analysis, you run your simulations, you do some exploratory work, actually producing something collaboratively with your peers in new notebooks that actually illustrate what you're interested in.
And then, so now you've actually done your work, you've got interesting results and it's time to communicate with the outside world. And notebooks are a really nice way for communicating computational ideas of saying, okay, here's in my markdown, my math, this is what I'm doing, this is my mathematical model
and then right below that you have the code implementing that mathematical model and then you have the figures illustrating your results and then you can share that both as an HTML file that people can view but also because it is the document that is loaded in this interactive environment,
people can download that notebook and run it and change it and actually use that to explore and interact with your results without having to start from scratch. And then in more traditional publication contexts, notebooks can be used as, people have picked up the term computational companions
for pre-purchaseable papers, so there have been a number of academic publications using notebooks as appendices or kind of companion things like this is a notebook that generates every figure in this paper or this is a notebook that actually goes through the analysis that implements this entire model.
So people can say, okay, I'm interested in this traditional paper, how do I start interacting with it? They say, here's a notebook, you can just, you can go right ahead and interact with everything you've produced. So a couple of the applications of Jupyter,
we have JupyterLab is a new web application for interacting with Jupyter notebooks and other things. There's nbconvert is our tool for taking notebooks and converting them to HTML, Markdown, LaTeX and PDF.
nbviewer, which I'll demonstrate shortly, is a service that's basically nbconvert on the web where you point, which I'll go into that in a minute, but it's basically any public notebook, you can render that to HTML and share it. JupyterHub is a tool for hosting computational resources and giving,
this is focused on the interactive side of giving people access to computational resources through notebooks. nbgrader is a tool that works with JupyterHub for automatic grading and assignment dissemination and turning in assignments and automatic validation
and auto-grading of notebooks as academic assignments. TempNB is a service for demonstrating, running a small scale, free hosted computing notebook using Docker and TB. So we talked about the protocol and the document format
and you absolutely don't have to use the entire Jupyter stack all at once. You can just use the protocol for remote execution and get all of the power of execution display of Jupyter without using notebooks. And Phoebe is an example of that, of turning any little code blocks on a page
into code that you can execute without there being any notebooks involved. So you can just say, okay, I wanna run this little block of code that's on this webpage on a blog or something. This is something from O'Reilly. So O'Reilly uses it on their web pages and you can execute this code and see the results and without there actually being any notebooks involved.
And there are a variety of other examples of tools and using different pieces and different combinations of the Jupyter stack to build different applications. Some are purely document focused, some are purely execution focused and some use the whole thing together. So this was mentioned earlier and I'll go a little bit more in depth.
So one good example we have of Jupyter in science is the LIGO experiment. So this is a direct observation of gravitational waves with the big laser interferometers. And what they did was the,
so LIGO which is a massive collaborative physics experiment has something called the LIGO Open Science Center. And I can just go here. And the LIGO Open Science Center is a whole piece of the LIGO experiment that's devoted to making the scientific results
that they produce at LIGO available to the wider community. And one of the ways they do that is when they have a result, they produce these Jupyter notebooks that explain how to go through. So I can just open the,
I can just open a quick viewer of a notebook. And they publish these notebooks that are tutorials for how to actually download the LIGO data, go through the analysis and download.
So for this direct observation of a gravitational wave, you can download this notebook and run it and it will actually get the actual data that they use, go through the actual analysis that they ran and show how they came to the conclusions
that they came to. And one of the coolest pieces of this is these services. So Microsoft Azure, the Microsoft Azure cloud service allows running notebooks on their cloud systems with an account and there's also a service called Binder that's a free service that any notebooks you put on GitHub
with some declaration of their requirements, you can visit the URL and you'll get a temporary little container with the dependencies and you can actually run these notebooks. There's some issues with the Binder service right now that people are working on,
so I'm not gonna click on it and try my thing, but so there's some stability issues there right now. But with Binder, this is extremely useful for these kind of open science applications of tying it together of public, you make your data available
and you make your notebooks available, Binder's currently tied to GitHub, but it doesn't have to be. And then people can just click a link and then they've got an interactive execution environment with a document explaining what's interesting and that they can actually execute and follow along and modify and produce derived things from your scientific results and actually interact
with them in a full executable environment. Yeah, and I think that's about all I wanted to cover. Yeah. Okay, thanks a lot.