Reproducible and shareable notebooks across a data science team
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 56 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67160 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 202244 / 56
22
26
38
46
56
00:00
Musical ensembleLaptopVirtual machineDigitizingEndliche ModelltheorieCybersexBuildingXMLUMLLecture/ConferenceComputer animation
00:40
Product (business)Software developerCybersexFatou-MengePresentation of a groupMachine learningShared memoryLecture/ConferenceComputer animation
01:14
Right angle1 (number)Context awarenessVirtual machineProjective planeLaptopLecture/ConferenceComputer animation
02:03
Context awarenessLaptopDemo (music)Context awarenessEvoluteLecture/ConferenceComputer animation
02:35
Context awarenessLeakVirtual machineInterface (computing)EvoluteProcess (computing)Information securitySensitivity analysisBlock (periodic table)Virtual machineOrder (biology)Client (computing)LeakSampling (statistics)InternetworkingCybersexNoise (electronics)Real numberInterface (computing)Vector potentialSoftware as a serviceLecture/Conference
03:37
Information securityCybersexData storage deviceInternetworkingSensitivity analysisLeakWebsiteData centerInformation securityServer (computing)Uniform resource locatorPhysicalismComputer animation
04:06
Context awarenessScale (map)ScalabilityData modelClient (computing)Virtual machineContext awarenessEndliche ModelltheorieFreewareEvent horizonComputer animationDiagramProgram flowchart
04:43
Context awarenessContext awarenessVirtual machineClient (computing)Cartesian coordinate systemComputer animationLecture/Conference
05:17
Context awarenessHeat transferVirtual machineDecision theoryProduct (business)Endliche ModelltheorieContext awarenessProjective planeFile formatMachine codeLaptopServer (computing)Table (information)Client (computing)PlotterOpen sourceSoftware developerWeb browserMathematical analysisComputer animationLecture/Conference
06:30
Inversion (music)LaptopMachine codeFile formatInterface (computing)Computer programmingLaptopRevision controlFile formatComputer animation
06:59
Point cloudTwin primeMenu (computing)Special unitary groupInterface (computing)Machine codeComputer programmingFile formatData modelExploratory data analysisData analysisScripting languageMachine learningWordServer (computing)Local ringLaptopBit error rateScale (map)MathematicsFocus (optics)Product (business)Function (mathematics)Scripting languageFile formatMedical imagingVirtual machineMereologyNetbookLaptopMultiplication signGroup actionDoubling the cubeWave packetAxiom of choiceHeat transferOrder (biology)Exploratory data analysisPerformance appraisalPrototypeEndliche ModelltheorieProjective planeMoore's lawSoftware maintenanceShift operatorShared memoryMachine codeSet (mathematics)Phase transitionProgramming paradigmServer (computing)Point cloudXMLComputer animation
11:41
Focus (optics)LaptopComputerPoint cloudBefehlsprozessorDataflowTensorIntegrated development environmentType theoryConfiguration spaceVirtual machineLaptopLatent heatFigurate numberPoint cloudComputing platformIntegrated development environmentLecture/ConferenceXMLComputer animation
12:24
Open setComputerPoint cloudGraphics processing unitBefehlsprozessorShared memoryComputer configurationLaptopInstance (computer science)Repository (publishing)Semiconductor memoryVirtual machineProjective planeData storage devicePoint cloudVirtualizationComputer animation
13:45
ComputerPoint cloudData modelMachine codeConsistencyModal logicParameter (computer programming)Integrated development environmentVirtual machineType theoryMultiplication signLimit (category theory)Projective planeLaptopComputer fileMachine codePoint cloudLecture/ConferenceComputer animation
15:05
Revision controlMachine codeData managementRevision controlData managementMachine codeLimit (category theory)Different (Kate Ryan album)Projective planeLaptopVirtual machineDefault (computer science)Computer animationLecture/Conference
15:36
Machine codeScripting languageMachine codeScripting languageProduct (business)Limit (category theory)Pascal's triangleMereologyRevision controlLaptopComputer programmingDifferenz <Mathematik>Computer fileDifferent (Kate Ryan album)Lecture/ConferenceComputer animation
16:38
Computer-generated imagerySpecial unitary groupMachine codeScripting languageHeat transferSoftwareLimit (category theory)Different (Kate Ryan album)Doubling the cubeMereologyMachine codeProduct (business)Functional (mathematics)LaptopComputer animation
17:12
Open setConstraint (mathematics)Integrated development environmentComputer configurationMachine codeConstraint (mathematics)Product (business)Scripting languageLaptopVisualization (computer graphics)Lecture/ConferenceComputer animation
17:49
Open setConstraint (mathematics)Integrated development environmentComputer configurationPower (physics)Plug-in (computing)LaptopFile formatRevision controlScripting languageData managementMobile appFunction (mathematics)Image resolutionVirtual machineConstraint (mathematics)LaptopServer (computing)Default (computer science)Limit (category theory)BelegleserPlug-in (computing)Kernel (computing)Cellular automatonInformationProjective planeFormal languageComputer fileMehrplatzsystemClassical physicsAdditionLecture/ConferenceComputer animation
20:25
Open setLaptopVirtual machineProjective planeFlow separationDemo (music)Web pageLink (knot theory)Connected spaceMobile appComputer animationLecture/Conference
21:09
Demo (music)Object (grammar)CloningTotal S.A.Image resolutionComputer fileKernel (computing)View (database)Pascal's triangleDigital filterDirectory servicePlot (narrative)Software testingEndliche ModelltheorieData modelLaptopDisintegrationExecution unitGoogolUniform resource locatorScripting languageMobile appRevision controlData structureLaptopHTTP cookieProjective planeINTEGRALData modelDemo (music)Content (media)Uniqueness quantificationSoftware testingExecution unitComputer animationXML
22:35
Kernel (computing)Mixed realityPlot (narrative)Digital filterData modelComa BerenicesConfiguration spaceScripting languageRadical (chemistry)Projective planeComputer fileInformationLaptopSoftware testingComputer animation
23:14
Endliche ModelltheoriePlot (narrative)Query languageRevision controlTerm (mathematics)10 (number)Demo (music)Digital filterElectric generatorSoftware testingComputer fileGastropod shellView (database)Kernel (computing)Scripting languageLaptopConfiguration spaceComputer fileProtein
24:05
Structural loadString (computer science)Statement (computer science)Formal languageExtension (kinesiology)Digital filterKernel (computing)Demo (music)View (database)Client (computing)Query languageGoogolOrder (biology)ASCIISimulated annealingComputer wormPersonal digital assistantTotal S.A.LiquidNumberState of matterInversion (music)Projective planeKernel (computing)Computer fileContent (media)Information privacyMetadataGroup actionFunction (mathematics)LaptopCellular automatonSet (mathematics)Machine codeMathematical analysisComputer animation
26:48
Demo (music)Mathematical analysisDirectory serviceScripting languageInterior (topology)Kernel (computing)Machine codeRevision controlMathematical analysisDifferent (Kate Ryan album)Content (media)Differenz <Mathematik>LaptopComputer fileComputer animation
27:37
Demo (music)Elasticity (physics)Streaming mediaLaptopProduct (business)Projective planeComputer animationLecture/Conference
28:13
Machine learningVirtual machineData structureLaptopFreewareProduct (business)Computer architectureSoftwareLecture/Conference
29:20
Product (business)LaptopTask (computing)Right angleProcess (computing)StapeldateiReal numberCarry (arithmetic)Revision controlScripting languageLecture/ConferenceMeeting/Interview
30:03
LaptopExploratory data analysisMereologyScripting languageFile formatProduct (business)Phase transitionTranslation (relic)Process (computing)Lecture/Conference
30:45
Revision controlLaptopSoftware testingProduct (business)Unit testingGroup actionScripting languageFunctional (mathematics)Machine codeUniversal product codePhase transitionLecture/Conference
31:28
LaptopSoftware testingFunctional (mathematics)Machine codeComputer fileScripting languageProduct (business)Social classOrder (biology)Direction (geometry)Lecture/ConferenceMeeting/Interview
32:17
LaptopMereologyDifferenz <Mathematik>Function (mathematics)Graph (mathematics)Lecture/Conference
32:57
MetadataDifferent (Kate Ryan album)Function (mathematics)Kernel (computing)Multiplication signLaptopInformationMeta elementLecture/Conference
33:29
Elasticity (physics)Revision controlLaptopBootingLecture/Conference
34:00
LaptopFunction (mathematics)Computer fileArithmetic meanScripting languageLecture/Conference
34:36
LaptopVirtualizationVirtual machineMobile appLecture/ConferenceMeeting/Interview
35:05
LaptopLatent heatDirect numerical simulationScalabilityVirtual machineRight angleSemiconductor memoryLecture/ConferenceMeeting/Interview
35:50
Source codeMiniDiscVirtual machineType theoryLecture/Conference
36:20
Virtual machineMusical ensembleLecture/ConferenceXMLUML
Transcript: English(auto-generated)
00:07
Okay, so thank you everyone for attending this talk. Today we are going to talk about notebooks, one of the main tools of the data scientists, and we are going to show you how in our team we make them shareable and reproducible.
00:24
First some more than us, I'm Pascal, I'm data scientist at CyberAngel, and I'm focusing on building robust and efficient machine learning models to identify all kinds of digital threats. Hi, I'm Mike, machine learning engineer at CyberAngel, and I'm focusing on leading the
00:45
development of machine learning products from idea to production. We also have Julia, she contributed a lot to the work presented today, but unfortunately couldn't attend the presentation. So first, let me ask you a question.
01:03
According to you, which kind of candies is better for sharing among, let's say, three people, left or right? Anyone? Let's say that on the left picture you would have just one flavor, but more candies,
01:20
and on the right you would have less candies but more flavors. Anyone, left or right? For me, I like the right one, because I have Tagada, Cruco, everything. Me, I would say left. I love the red ones. Okay, you may be wondering what the hell is he talking about.
01:40
Actually, we are going to show you how we switched from left to right in the context of scaling up across a data science team. Just keep in mind that the candies can be seen as notebooks, the candy pack can be seen as a data science project, and that the packaging itself can be seen as a virtual
02:01
machine, the project or Oscillon. Here is today's agenda. We'll have first some context about our company, our team, and how they evolved. We will drive you through our notebook journey, followed by a quick demo of our final solution. Then we will have some key takeaways, and if you have any questions, please keep
02:24
them for the Q&A session. Okay, first some context. What kind of company are we working in? What kind of evolution did we face? What these evolutions brought as challenges?
02:42
CyberAngel is a cybersecurity scale-up whose job is to protect companies from external cyber threats. So basically, we scan every layer of the internet in order to find sensitive data leaks belonging to our client, and the solution can be seen as four main blocks.
03:02
The first one is comprehensive scanning, which is about detecting potential data belonging to our clients, and here we detect billions of samples. So we need the second block, the machine learning block, in order to filter this noise and to only send to our analysts potential sensitive data.
03:25
The job of our analysts is to identify the real sensitive data in order to alert our clients via a SaaS interface. Here is an example of sensitive data we can find on the internet.
03:43
So here, for example, we have the blueprint of a data center, which was an open server, and you can see that on the blueprint you have the CCTV camera location, so there are obviously big risks for the company, such as physical security of the site, but
04:02
also customer data leaks if anyone enters the site. With several fundraising events, the company has lately switched from a startup to a scale-up with new challenges, and with more clients in the pipe, we also need to have more robust
04:23
and efficient machine learning models. In this context, the data team has also evolved. Over the years, there have been departures and arrivals. We have been more senior members, and over the last year, there have been the creation
04:40
of three official subdata teams and the arrival of machine learning engineers like Mike in the DataSense team. And all these contexts brought new challenges that can be mainly seen as two axes. The first axis is quality and robustness.
05:01
More teamwork, but also more clients in our pipe, implies a need for more quality and more robustness, and here we try to answer how to better support growth. The second axis is traceability and reproducibility. Here we want to be able to better trace back decisions between exploration and production,
05:25
and we want, for example, for a new member of the team to easily retrain an existing machine learning model. So here, we want to better support the transfer knowledge. So in this context, in our DataSense team, each of us contributes to projects using
05:47
notebooks before code industrialization. Let's see what kind of issue we faced and how we overcome them. That's our notebook journey. So at TableAngel, we use a Jupyter notebook. Jupyter is an open source server client application that allows editing and running
06:06
notebook documents via a web browser. So notebooks are a must-have for data exploration for model development. As you can see on the right picture, they are structured under sales format, so you
06:20
can easily load and explore your data. You can visualize them in plots. You can interact with those plots and perform analysis under markdown format. So notebooks are great for storytelling that allows flexible experimentation. And if I show you an example here, you would see that a lot of valuable work and
06:45
interesting things can be done in notebook, as I said, which are structured under sales format. But if I show you the plain text version of these notebooks, you would see that this format is oddly, is tricky to share, oddly reproducible, and simply not built for
07:06
production purpose, because here you have a JSON format with outputs of the images, for example, and it's simply not built for versioning, for example, like you can have with scripts. Indeed, if you look at a common machine learning workflow, you can see it as two
07:27
main parts. The first part is the exploration and experimentation. It's basically the training with exploratory data analysis, feature engineering, model evaluation, and here you mainly deal with notebooks.
07:41
You can experiment, you can prototype, and sometimes it's dirty. While on the second part of this workflow, you have the production part, you only deal with scripts, and you have to have something clear and reproducible. And our main challenge is to facilitate the transition between notebooks and scripts
08:04
in order to avoid less late transfer of code from notebooks to scripts, and in order to avoid less double work between training and production. And focusing on notebooks versioning can help a lot.
08:22
Let's see how we started and where we arrived. So, remember the first picture I saw you? Back to 2019, we were rather in this kind of sharing. We had one data scientist per project working on virtual machines. We were hosted on a server.
08:40
So, based on the picture, you had one data scientist working on just one candy pack, so one flavor, and most of the code was sometimes in remote on those virtual machines, sometimes in local on our personal laptop, but in the end, there were no common guidelines.
09:02
It means that apart from the models and the datasets, the notebooks and scripts were often left aside either on those virtual machines or, worst, on our personal laptop, so there was no traceability. There are obviously some reasons we started like that.
09:22
First, at this time, managed notebooks were not widespread, so if you are not familiar with managed notebooks, that's actually just a notebook solution hosted on managed virtual machine on cloud solution. And also, at this time, we needed a lot of computational power to train our models, so
09:43
this solution was quite effective for that. And finally, it was a startup choice because we wanted to go fast, and even more, we wanted to fail fast. However, even if we had computational power and we could go fast, there were obviously
10:01
some drawbacks. First, maintenance was not easy because we had to upgrade and update those machines by ourselves, and you should need some skill to do it. Plus, there was a lack of flexibility because with those machines, we had fixed resources,
10:23
and most of the time, we needed less. Sometimes, we needed more, but in the end, we paid too much for that. Furthermore, we had a lack of quality and robustness because with such a solution,
10:40
it was very hard to work at several on the same project. And finally, there was a lack of traceability and reproducibility because, as I said, in the end, apart from models and datasets, not so much when stored on a specific place and especially notebooks. And entering the scale-up phase of the company and the data team led to a paradigm shift.
11:07
We wanted to go fast instead of going fast. And as the proverb says, if you want to go fast, go alone. If you want to go far, go together. That's why starting 2020, we decided to be at least two data scientists on the same project.
11:25
This implies to focus on quality and robustness, and it highlighted our needs of better sharing and traceability. That's why we're starting to work with managed notebook, and Mike is going to tell
11:41
you more about it. Yes, in 2020, we started to use, to migrate to GCP and decide to use managed solution like AI notebooks. AI notebooks is a platform that allows you to use Jupyter notebook in the cloud
12:02
with the specification of the machine you need. So in this figure, you have one example of AI notebooks. You can see there are several notebooks that have been already created, and you have the machine type, which is the configuration of your virtual machine. You also have environment, which is machine learning packages, and you have another option,
12:25
which is GPU. You can choose if you want to use GPU or not in your virtual machine. So using GCP AI notebooks, each team member can easily spawn new instances based on the project requirement.
12:41
Compared to the initial solution, we have several advantages. The first one is less cost. Today, we use virtual machine. We have possibility to turn off or delete this virtual machine at the end of the project. We can also store all the notebooks in GitHub repository and data in the cloud storage,
13:04
for example. We also have flexibility enhancement. Today, we use virtual machine. We don't need to install a lot of packages. Also, we have possibility to update our virtual machine, for example, to increase
13:26
the memory size of the CPU. Again, we have better sharing. Today, we use this notebook, and each team member can have easy access to the data and
13:41
also to the notebooks. Finally, we have managed solution. Environment and machine type already exist. Also, we don't need to maintain the virtual machine and it's a managed solution. Everything is done automatically. This is one example of the one project on AI notebooks.
14:06
You can see we have our project and then in the notebook folder, we have several folders. In one folder, you can have these files. You can see you have a lot of notebooks.
14:20
It's also CSV file or pickle, but you can also notice that we have a lot of manual versioning for one file. You have V1, V2, etc. This is because we are met by users on the virtual machine and also working at the same time.
14:42
Everything works like you want. And we have a lot of manual versioning, also the repeated code. So it's very tricky to work at that. So by switching to the cloud, we solved a lot of problems, but we also realized that there was still a lot of room for improvement.
15:02
So I'm going to show you some limitations we have. The first limitation we have is user management. So it's impossible to know the latest version piece of code. Also, in AI notebooks, you cannot use different credentials to log in.
15:23
So anyone on the GCP project can log into the virtual machine as a default user. So then it's very difficult or impossible to know who made modifications on FI. The second limitation is traceability and reproducibility.
15:42
So today, as Pascal said, we have two parts. Today we just versioned the code that is in production, but we don't version the code in the first part, exploration and experimentation, because we use only scripts in the second part,
16:02
but we also use many notebooks in the first part and also some scripts. But versioning notebooks is very complicated, because we use, in our team, we use Git to version.
16:20
And Git is not designed to work on notebooks, because notebooks are a designed document. And this is a problem. I'm going to show you one example of a git diff on a notebook file. You can see the difference between two versions of a notebook.
16:41
Yeah, you can see it's very humanly impossible to read, you cannot know what modification is. And the third limitation is the double work. Today, how we work, we have two parts, we have notebooks, write some functions,
17:03
and then in production we are just going to copy the code and then pass them into production. Yes, it's not very good, this is why we have a lot of duplicated code. I'll show you the three drawbacks we have today, but as data scientists we also have some constraints.
17:26
It's to use notebook and also script. We need to use notebook for data exploration and also to do some visualization, but we also need the script into production, because we want to have a portable and scalable pipeline.
17:48
So now we have three drawbacks, which are user management, traceability and reproducibility, and double work, plus this additional constraint.
18:02
So, in addition to these drawbacks, we set up several tools. The first tool we set up is Jupyter app. We set up Jupyter app to solve the user management limitation. What is Jupyter app? Jupyter app is the way to enable simultaneous use of Jupyter notebooks by multiple users.
18:25
So compared to the classic AI notebooks, with Jupyter app you have the possibility that each user has his own session, how it works today in our team. In our team we set up Jupyter app on the virtual machine and we have also some default kernels.
18:44
And when someone wants to work, he's going to use his GitLab credential, connect to Jupyter app and enter in his session. And also you can work with these kernels,
19:02
or you can also have the possibility to create your custom kernel. You do what you want in your session. To solve reproducibility and traceability, we use Jupyter text. Jupyter text is a Jupyter plugin that can save Jupyter notebooks in various text formats.
19:22
So like you can have markdown file, also programming language like Python or R. So in our team we use Python. How it works Jupyter text? Jupyter text is going to pair the notebook, your notebook with P Wi-Fi. And then every modification you are going to do in your notebook,
19:43
he's going to update the P Wi-Fi, automatically update the P Wi-Fi. But the things that Jupyter text just updates, the notebook sells information in your P Wi-Fi, not the output. So finally, we set up Jupyter app to solve user management limitation,
20:06
Jupyter text to solve reproducibility and traceability. And when we combine these tools, we solve the double work. So now how we work compared to the inside solution where we have one machine per project.
20:25
Today we have one machine, each user has his own session, and in your session you can have one or several projects. This we don't have, three separate pockets, now we have one pocket with several projects.
20:43
Now I'm going to show you a quick demo on how we work. This is a connection page, Jupyter app knows that you have a link,
21:09
you can set up, you can connect, you use one URL, you can connect, and you are going to authenticate with your GitHub credential or Google. And then enter in your session.
21:27
Now we are going to clone a new project, for example, existing project. We are going to clone a new project, demo Jupyter app. Now we show you the structure of our project.
21:43
In our team we use custom cookie cutter, you can see we have the structure, you have data, model, notebooks, notebook script, and yeah.
22:02
I will show you the content of that Git in here, you can see we have notebooks in our Git in here, because we don't want to version the notebook to ipnb file.
22:21
And here we have notebook script with pyify, I'm going to explain why. Yeah, you have unit test, integration test, and you also have some configuration files,
22:43
pyproject.terminal, poetry.log, and now you can see the pyproject.terminal, you have some information on your project, and also you have packages, some dependencies,
23:02
but this is the Jupyter text configuration, you have Jupyter text configuration, here you can see it, you have notebooks folder and also notebook script folder. This configuration means that every file, ipnb, you are going to create in your notebooks,
23:24
we have an equivalent in the notebook script folder, yes. Now, yes, we use Protea for vtry-elf, yes, now because we use Protea we can
23:45
install some packages in our vtry-elf, and yes, now here you have the pyify, and you can see the notebooks folder is empty.
24:05
I'm going to show you the content of this file, you can see that we have the metadata, it's Jupyter text metadata, this is, yes, and now you are going to generate the ipnb,
24:29
you just open your pyify with a notebook, yes. When you do this action, it's going to generate the ipnb, so also we already installed the
24:48
custom kernel, you have the possibility to install the custom kernel with poetry, so we don't, you can install a lot of kernel that you can see here, we use the kernel for this project, for this project,
25:05
and also you can, you see it automatically create the ipnb file, and then you can open your notebook now and use the custom kernel and execute the cells,
25:20
you can see the output too, yes, here we use the public dataset, you can see you have you have the code in the cell and also the output in your notebook, you can see
25:41
here we just execute the code, we didn't have anything, and now we, yes, I'm going to see if I, yeah, I use this statue to look if I made a modification, yes,
26:07
no modification, no update, and now I'm going to add a piece of code, new piece of code, new cell, and add this code, you can see also execute, you have the output,
26:26
and then also go back here to the py, you can see in our py it's automatically update our py, and now we are going to use the same command git statues,
26:46
and you can see that, yes, the quick analysis py, it's updated, and now we are going to show you the difference between two versions, with git diff you can see that we have add this piece of code, and now we are going to show you the content of the ipnb file,
27:07
remove the notebook folder from this, from git in here, git statues,
27:26
yeah, just to show you the content of this file, you can see git ipnb, it's very tricky, it's impossible for you want to read this file, and that's it.
27:46
Okay, so let's go to the key takeaways of this talk. The main key takeaways is that notebook versioning is possible, it will probably, the main solution actually we used is gptext, and you can combine with
28:02
jupyterhub if you work with other teammates, and first it will probably allow your company to deploy projects faster in production, secondly it will ease your workflow as a data scientist or as a machine learning engineer, will have a structured workflow with reproducible and
28:25
shareable notebooks, and in the end you will have more quality, more robustness, more flexibility, and so better teamwork. So thank you, everyone, if you have any questions, please feel free. Thank you. Any questions? Yeah, sure. Hello, thank you for the talk, first of all.
28:56
You mentioned that Jupyter notebooks were not suitable for production, but then in your
29:05
architecture is running the Jupyter notebooks in production, that's because of JupyterHub, or it's a different trick? I don't know if the question is clear. You mean that why we can't run Jupyter notebooks in production? No, if it is suitable or not, the format of
29:25
or the idea of the Jupyter notebooks will carry on an actual task in production. I think I don't understand, can you repeat, please?
29:40
Sorry, do you consider that a Jupyter notebook is a right process batch or a right binary, let's say, write a script for carrying tasks, real tasks in production? No, we don't consider that. Actually, we wanted to version
30:03
notebook because we want to ease our workflow in the experimentation part, but in the end we write scripts, the feature engineering and stuff is done in script, in package, in microservices, and no, we don't use notebook and we don't consider that. Okay, so it's like for exploratory
30:25
phase and then there should be a translation to another format. Yes, and actually our solution does not allow to have notebooks in the production part, but just to ease the
30:41
process of experimentation and experimentation. Thank you. Any more questions? Yes.
31:02
Well, first of all, thank you. I think we deal with a lot of the same issues and I was wondering how do you deal with testability of your code? So, for example, in your exploration phase, do you already write tests for your functions or is that something you only do when you transition to production code? Do you want to take it? Yes, we have a unit test for some
31:27
function, yes, because we have a notebook, we can work with notebook also with script. When we have the same function, we are going to create just one piece of code in new file and
31:42
then when we want to version, we test before production, yes. And furthermore, when we do feature engineering, we try to create classes that can be script into PY file and to avoid
32:04
like pandas.apply in order to have something clean and can be tested directly. We got that directly in the notebook to transport that in the production step more easily in order to avoid to lose time after. Yep, thank you. I think we'll have to try that out.
32:32
So, yeah, first of all, thank you. JupyterHub looks really interesting for us as well. I have one question regarding JupyterText. So, did you try some other kinds of tools to remove
32:47
the parts of Jupyter notebooks that were not great for looking at diffs? Because I find that usually it's mostly just outputs and graphs that make it hard to look at differences. So,
33:02
we usually do clean the notebooks, like just remove all the output and all the kernel meta information, like all the timing information and stuff like that. And then it's kind of okay. But, yeah, so the question was really, do you have any other problems with JupyterText or does
33:22
it solve everything for you now? For now we'll say that it's okay. Yes, it's okay. But maybe it's not a good solution, but for now, we are able to version this notebook because JupyterText is bi-directional, which means that you can generate the PY from your
34:01
notebook. Thanks. Yeah, because I'm not sure you can do that with other solutions, like just cleaning the outputs. It's not bi-directional, so maybe it's an advantage that JupyterText has. Yeah, if you're just cleaning the outputs, you still have the notebook file, right? Yeah, yeah, yeah, of course. But I mean, if you do a modification in your
34:25
IP to notebook and in your script, for example, you won't have it in your IP to notebook and so maybe that's the advantage. I'll definitely try out your solution, thanks. Thank you.
34:40
Maybe a quick question regarding infrastructure. So JupyterHub, is that compatible with AI notebooks or did you switch to a self-hosted solution now? It's a virtual machine, actually, JupyterHub. Yes, yes, JupyterHub is just, you just install JupyterHub in a virtual machine and then you already have a notebook because it's a package, yes.
35:04
It's a package. So you're still using AI notebooks in the background? No, it's not an AI notebook solution. I think you can do it with AWS on a specific machine, you can install it and then we have a DNS and we go there and it's very easy because you
35:21
connect in just one click. Sorry, we seem to have a lot of pains with notebooks. And how is it with scalability now with JupyterHub? Now you have a fixed machine, right, basically? Yes.
35:40
And you can do modification? Yeah, but you can do modification of your machine, you can decide to increase the memory size and also CPU, it's like you want. You can decide to turn off every night so you have more flexibility with your resources. For example, you can also increase the size of the disk, yes. You can do everything you want
36:03
because you can update the machine type. And a problem we had with our old virtual machines was that if anyone used all the resources you just can't do anything and with the solution
36:23
implemented you can also fix the resources. I think you can maybe do it with another virtual machine but that was more easy, easier. Cool, thanks. Thank you. Any more questions? Okay, thank you very much gentlemen, that was fun.