We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Notebooks in (geo)datascience

00:00

Formal Metadata

Title
Notebooks in (geo)datascience
Title of Series
Number of Parts
266
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In the FOSS4G 2021 programme, the word 'notebook' appeared ten times and the word 'jupyter' ten times too in the abstracts of four workshops and four presentations. In 2022, 'jupyter' and 'notebook' appear in two workshops and two presentations abstracts. More discreetly, at least three workshops and one scientific paper used notebooks without mentioning them. As we can see, notebooks are becoming increasingly common in data science and the geospatial world. But what is a notebook? What is it useful for? What are its limitations? Are there other platforms than Jupyter? Can we do anything other than Python? What about geospatial? Are these tools FOSS? These are some of the questions that this presentation will try to answer. (TL;DR: yes!) If you have never heard of Quarto, Observable or Org-mode, this presentation is for you.
LaptopComputerTask (computing)Computer programmingCore dumpInteractive televisionCodePresentation of a groupLaptopCodeInformationState observerNeuroinformatikLimit (category theory)Function (mathematics)Musical ensembleMathematicsJava appletBitPatch (Unix)Formal languageMultiplication signGastropod shellCodecMereologyWeb 2.0Computer programmingMedical imagingMacro (computer science)Digital photographyAsynchronous Transfer ModeCore dumpResultantMathematical analysisKnotObject (grammar)MultilaterationGraph (mathematics)Mixed realityPasswordLevel (video gaming)Category of beingMobile appCodebuchMultiplicationProbability density functionWeb pageInteractive televisionComputer networkXMLComputer animation
Asynchronous Transfer ModeLaptopKernel (computing)Fatou-MengeCellular automatonCodeDisintegrationIntrusion detection systemOvalPlot (narrative)Demo (music)Inclusion mapGraph (mathematics)Texture mappingOperations researchTransformation (genetics)Level (video gaming)MappingLaptopIntrusion detection systemCodeWritingGraph (mathematics)Library (computing)Formal languageAndroid (robot)Server (computing)ArmWordBuffer solutionMetadataBitNeuroinformatikMusical ensembleSquare numberKernel (computing)Operator (mathematics)Computer configurationCellular automatonINTEGRALRoundness (object)Block (periodic table)Function (mathematics)Row (database)Ocean currentGodNatural numberProjective planeFraunhofer-Institut für Materialfluss und LogistikFile formatState observerComputing platformRun time (program lifecycle phase)Category of beingRule of inferenceParameter (computer programming)MereologyLevel (video gaming)Functional (mathematics)Open sourceGeometryGoodness of fitComputer filePlotterWindowWebsiteLink (knot theory)Text editorFlow separationVisualization (computer graphics)Graph (mathematics)Integrated development environmentFeedbackGrass (card game)Computer animationXML
CodeLaptopEmailCodeLaptopVolumenvisualisierungComputer animation
Demo (music)Plot (narrative)Execution unitCodeFunction (mathematics)Inclusion mapBlock (periodic table)Fatou-MengeOctaveView (database)Cloud computingPresentation of a groupQuarkCodeLine (geometry)Goodness of fitState of matterProbability density functionPoint (geometry)SoftwareMultiplication signWikiModule (mathematics)State observerRun time (program lifecycle phase)BefehlsprozessorLevel (video gaming)MultiplicationWordAsynchronous Transfer ModeFormal languageLaptopCellular automatonMathematical analysisPresentation of a groupException handlingLimit (category theory)MereologyBitFunction (mathematics)MathematicsModul <Datentyp>Run-time systemTraffic reportingFunctional (mathematics)Source codeResultantWave packetComputer animation
Computer animation
Transcript: English(auto-generated)
Thanks, Ian. Welcome to this presentation. I will be speaking obviously in notebooks and geodatascience because I'm a GIS engineer. So let me introduce myself to introduce the
subject. I'm Nicolas Roland. I'm a GIS engineer at Gustave Eiffel University, and I work for researchers that have trouble to specialize their data or that are not cartographers, so I make my problem of spatial analysis. And to do that, I use a lot of notebooks. Pretty
much all of my work is in using notebooks. I'm also part of the notebook workgroup in France where we investigate notebook thing, the object, what it is and what you can do with that, and what's the limitation of notebooks, actually. And I'm in this group
and also part of the OAGO community, and last previous edition and even in this edition people use notebooks, but nobody speak about the notebooks, the actual thing. So I say,
am I jumping? That's my subject. So to define what is a notebook, I still quote Stephen Wolfram. So he said that the idea of a notebook is to have an interactive document that remixes code, results, graphics, text, and everything else. And I think the important
things in that quote is that it's an interactive document, so you can interact freely with it, and that mixes code and text, and then after that you get results.
It might be more text, it may be a graph, another output. And it's the core idea is from light terrain programming. It's something that was thought by Donald Knut, you might know him,
but it tells us where you should be, refer to, instruct the computer what to do. We should tell people, other people, other human beings, what you want the computer to do. Not to say, do that to a computer, but oh, I ask the computer to do that because I want that.
And I get this result. So it's the core concept of literal programming. So you give instruction to the computer, and you give information to the reader. It might be yourself, or it might be another reader. So a notebook is a computational
document that is interactive, that mixes code and text, and should be readable by a computer and a human. So let's get back quickly, many many years ago. In 1978, Donald Knut created tech
to create a document with a computer. Later on, LaTeX was created to make
web. That was a mix of LaTeX and Pascal, so you can have a computer document. And there was also the C web with LaTeX and the C language. At the time, the first
actual notebook was created in 1988. So it's pretty old, actually, by Sivam Ramaphan. It's called Mathematica, so obviously it was a notebook on math, with its own language, math language. In 1992, Knut
coordinated the literate programming thingy. In 1992, there is the creation of NoWeb,
which was not a fork, but another thing similar to web, but not limited to Pascal. And also, we have HTML output plus LaTeX output, so it starts to have multiple languages
and multiple outputs just from your document. A little bit closer to us, in 2001, there was the IPython interactive shell that was released. We can write in your command line Python.
In 2002, there was Swift. It was something we did from the NoWeb idea, but where you can mix R and R as code and get HTML or LaTeX output. In 2005, SageMap,
again, it quickly became a notebook on math. In 2011, the IPython tool became a browser-based interactive. It was the first one to do that, and they get some knowledge from SageMap. So
the IPython community and the SageMap community were pretty close, and they benefit from each other. In 2012, there was the Knitnar engine, which from the R world
improve what Swift did and take the whole computation of the document with LaTeX. So you press a button, and it compiles the document to take in account cross-references and citations, and at the end, you have your PDF or your HTML page.
Swift didn't make that. You have to cross-compile things beforehand, so it's an all-in engine. In 2014, there was the Markdown and Jupyter notebook released. I will speak more about
that later, so I will get there. In 2015, for the people who like Java, there was actually a notebook written in Java, Apache Zeppelin. I'll never use it, so I won't make any comments on that. In 2018, there was the observable notebook released, and in 2020, it will be
Quarto. I will speak about only those four tools, because they are trending in data science. I won't speak about Org Mode, because it will be hard to explain. I don't have a full session,
I just have 20 minutes. So there is a lot more on notebooks, but those four are the things I think trendier and funny to use for some of them. And I tested them for Geospatial, so
I can speak about that. So let's talk about Jupyter notebooks. So they are very popular, anyone know? Anybody has ever used Jupyter notebooks? Yeah?
Yeah, pretty much everyone. So they are very well known. At first, it was only Python, R, and Julia, but they quickly add new kernels to connect to other languages. So now you can connect to more than 100 languages and things. So we can do a lot of stuff in Jupyter notebooks.
It's markdown and code cells. Each cells are very distinct, and the file format is JSON. I think it's for me the big issue.
So there is a good integration with IDEs, especially in the Python world. So there is very good integration with VS Code or PyCharm, because actually Jupyter notebooks are not really good IDEs. They are trying to help you write Python code, but they are not as good as VS Code
or PyCharm might be. So what does it look like? So you have markdown, everybody is somewhere, you know already. So this is a notebook I stole from Grass, Grass community, and you have cells. Markdown cells, and then you have code cells
that you can run. You can run each cell in the value by itself, or you can run the whole document if you want. So you can access Python or other geospatial libraries. So
the notebook itself doesn't provide any geospatial insight. It's provided by the language you use. So Grass community has created several notebooks in Jupyter, so you might want
to look to it. But you need a Jupyter server to run the notebook if you want to edit it. So you forget to use just notepad++ or VS Code. You have to run the Jupyter.
And I think it's my personal opinion. Markdown editing in Jupyter notebooks is sometimes a bit tedious. Let's speak about RMarkdown. It's not a JSON file. From the beginning, it's a markdown file with code cells in it. So you can edit it with any text editor you
want. It might be vi or Emacs or whatever, or even notepad from Windows. It should be okay. It has great integration with IDEs, but especially in RStudio, because it's the same company that
builds RMarkdown and RStudio. So they make sure everything works well. It's mostly used in the R world, but not only. And it's not limited to R. As you can see, there are more than 150 engines in nightmare that you can connect to other languages, like C, Python, Fortran, for example,
or SQL. So you're not limited to R. RMarkdown is just a syntax, pretty much.
So let's have a look. We can compare it a bit with Jupyter notebooks. So we have IAML editor, where you can provide metadata, and you have your code blocks that are fancied by a feedback tick. It starts with a feedback tick, and then you have a feedback tick.
But everything is text. It's plain text. So you can write your RMarkdown text in RStudio. And when you render your documents, you can have the whole document with outputs inside your document, but you can also run specific cells,
if you want. And you have also the option to run cells above. The fun one. The funny one. Observable. It's a JavaScript notebook made for data visualization.
It was created by the creator of d3.js. The current observable platform is closed, but the runtime is open source. And the libraries are also open source.
So actually, libraries are often also notebooks. So public notebooks are free to use and re-share. So you can load the notebook to access function from the notebook. So it's pretty fun to use. You can make easy integration to website. For example,
you create a map with a notebook, put it, put just the iframe in your website, and just a link to get back to the notebook so people can get access to how you created the map. Let's talk about the geospatial ecosystem. So it's quite young, maybe one or two years
old for some libraries. You can make graphs and maps with plot, but you also have access to Bertin.js to make thematic mapping. So it's created by a French cartographer, and it took several, it's an opinionated mapping tool dedicated to thematic mapping.
You can also access to spatial operations with geotube box, and you can do some projections. You can also access to other data formats than JSON with GDAL, with a port of GDAL in WebAssembly. So in geotube box, for example,
you can do buffers, centroids, you can clip, you can compute debox. It's not a full feature JS like we are used to, but there are still some tools for basic operation.
So this is actually not live, but I will show you the rounder code afterward. So in this block of code, for example, I get the world of JSON from, I think it's natural earth, Africa, actually it's not my map, it's Nicolas Lambert's map,
I just told the code. We all do that, copy and paste. It's how we work, obviously. And then we draw a map. I want to run onto the parameters, if you want to show more. If you want to see more,
you can access to the notebook. And so actually, yeah, this is what the code rendered. So I can have that. I might have done the same for Python and R. Actually, when I was there, that's code that I've been running. I just put the line,
the code line, and that has been running. Okay, so that's working. So the last tool is Quarto. It's pretty new, last year. It's a tool made for scientific and technical publication.
It benefits from all the experience from R Markdown. Actually, I think it's better, the syntax is a little bit better for some things than plain R Markdown, the old R Markdown. And you can have various outputs, like this presentation is Quarto document, actually.
But you can also have reports in Word or PDF. You can also have websites, wikis, it works with Pandoc, so you can access to what all Pandocs outputs you can get. But it's limited to four languages, Python, R, Julia, and Observable.js. That's why I was
able to create a map within this Quarto document. Because with Quarto, you actually have your Python and R and Julia installation, but Observable is shipped with Quarto. The runtime is shipped
with Quarto. So let's talk about the limitation. So the cell execution is not linear, not necessarily linear. If you run the whole document, it will be linear. But if you run each cell and move, change things, and get back to another cell, at some point,
you might get lost of the current state of all variables, because the variable state might be hidden. So the tip is to, from time to time, run the whole document to get a clean state of everything. And it will be okay. Notebooks become very messy. Actually, an intern
that wrote 6,000 lines of code in R Markdown. And I open the document, I say, oh, I won't run it.
Never. And I tell him, okay, you have to split it in maybe five or six parts to make it, because it was actually several analyses. So it's a source of bad habits, actually. And it doesn't care about software environment. You have to care about the software environment.
The notebook won't do that for you. So if you are like me into reposible science, you have to take care of that yourself and think about that. It helps to understand what the people are doing and what they are wanting and how they process the data to get their results. So it's good for reposibility, but it's not perfect. You have to take care of software
environment. So there is this presentation at JupyterCon by Joel Goss called I Don't Like Notebooks. You should look at it. There is lots of memes and actually very good points
on issues. It's more on Jupyter Notebooks. So all those issues might be not present in software. Let me conclude. Most training notebooks that I show you today can do geospatial stuff. Most are multi-languages, except observable, which is only JavaScript.
The capabilities are the same as the language you use. It's way too to interact and play with, but there are some caveats. Like it can be a messy document. It doesn't encourage modularity. So you have to take that into account and maybe rework afterwards. You are on draft mode and
then you export your function and create modules from it. The reposibility is quite good, but you have to take care of the software environment.
If you have any questions, I think I'm on time.