Docker and Python: making them play nicely and securely for Data Science and ML
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 130 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/49952 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 202051 / 130
2
4
7
8
13
16
21
23
25
26
27
30
33
36
39
46
50
53
54
56
60
61
62
65
68
73
82
85
86
95
100
101
102
106
108
109
113
118
119
120
125
00:00
Coefficient of determinationComputing platformMultiplication signTrailSlide ruleMereologyInheritance (object-oriented programming)Meeting/Interview
01:15
Virtual machineSlide ruleSoftware developerProjective planeSlide ruleWebsiteTwitterUniform resource locatorChord (peer-to-peer)Data conversionPresentation of a groupMeeting/InterviewComputer animation
02:33
Machine learningInformation securityAutomationExpected valueInformation securityPresentation of a groupCartesian coordinate systemBitMachine learningQuicksortDampingWeb applicationContext awarenessComputer animation
03:21
Error messageModule (mathematics)SoftwareIntegrated development environmentSoftware testingConfiguration spacePhysical systemOperations researchProcess (computing)Abstract machineVirtual realityComputer hardwareBinary fileComputer-generated imageryConstraint (mathematics)Point (geometry)KettenkomplexData modelAlgorithmComplex (psychology)Endliche ModelltheorieIntegrated development environmentModule (mathematics)BitSoftware testingMathematical analysisMobile appSoftwareVolume (thermodynamics)Software developerAbstract machineLatent heatFinite differenceProduct (business)Presentation of a groupVirtual machineCartesian coordinate systemInformation securityComplex (psychology)LaptopMultiplicationConfiguration spaceLevel (video gaming)Library (computing)SupercomputerRun-time systemOverhead (computing)Representation (politics)NeuroinformatikOperating systemProcess (computing)AlgorithmPhysical systemMultiplication signDifferent (Kate Ryan album)Web applicationComputer fileComputing platformRevision controlBinary codeAbstractionWindowReading (process)Computer hardwareFile archiverWeb 2.0WordLocal ringVariable (mathematics)Repository (publishing)CodeDataflowIterationPredictabilityArrow of timeVirtualizationRun time (program lifecycle phase)MassMachine learningFitness function
10:27
Mobile appDifferent (Kate Ryan album)World Wide Web ConsortiumData modelMixture modelSoftwareLevel (video gaming)Information securityComputer-generated imageryBuildingFlash memoryStack (abstract data type)Complex (psychology)Context awarenessComputer fileCache (computing)Coma BerenicesVolumeDirectory serviceComputer fileMixture modelLatent heatDifferent (Kate Ryan album)IP addressKey (cryptography)Information securityRevision controlProduct (business)SoftwareFrustrationMachine learningSimilarity (geometry)Local ringKernel (computing)Level (video gaming)Block (periodic table)Set (mathematics)Inheritance (object-oriented programming)Process (computing)Projective planeDirectory serviceIntegrated development environmentReading (process)Abstract machineStack (abstract data type)Line (geometry)Term (mathematics)Real numberMultiplication signSoftware repositoryBitShared memoryBuildingPhysical systemHeegaard splittingQuicksortVariable (mathematics)Keyboard shortcutCache (computing)Vulnerability (computing)Endliche ModelltheorieStatement (computer science)Context awarenessNormal (geometry)Compilation albumLibrary (computing)AirfoilNumberMultiplicationImpulse responseNetwork topologyCASE <Informatik>InternetworkingActive contour modelElectronic mailing listAliasingComplex (psychology)Fatou-MengeMobile appType theoryInformation privacyThumbnailBinary codeSoftware engineeringConfiguration spaceOperating systemDamping
20:02
VolumeDirectory serviceDatabaseMaxima and minimaComputer-generated imageryKey (cryptography)LeakLevel (video gaming)CompilerRun time (program lifecycle phase)Integrated development environmentInformation security2 (number)Stack (abstract data type)Kernel (computing)Key (cryptography)Error messageContext awarenessCompilerComputer fileVolume (thermodynamics)Level (video gaming)Software developerCompilation albumVulnerability (computing)Software repositoryFlagVirtual realityRun time (program lifecycle phase)Computer configurationRootMassSensitivity analysisDefault (computer science)MereologyCache (computing)InformationDirectory serviceNormal (geometry)BuildingComputer animation
25:24
Spline (mathematics)Level (video gaming)Multiplication signLibrary (computing)BuildingDampingPoint (geometry)Latent heatComputer animation
26:03
Template (C++)BuildingProjective planeHTTP cookieSoftwareData structureRevision controlComputer fileSign (mathematics)Portable communications deviceLibrary (computing)ConsistencyTemplate (C++)
27:03
ExistenceInstallation artData Encryption StandardMachine codeDefault (computer science)Continuous integrationGroup actionIntegrated development environmentLatent heatInstallation artSoftware repositoryBuildingContinuous integrationComputer fileWindows RegistryInformation securityGroup actionProjective planeData managementAnalytic continuationOperating systemSoftware testingPatch (Unix)Revision controlRepository (publishing)Multiplication signBinary codeDigital photographyMultilaterationFreewareMassGraphical user interfaceHand fanContent (media)GoogolScheduling (computing)Fatou-MengePoint cloud
31:17
Control flowCodeRevision controlComputer-generated imageryComputer fileSoftware repositoryCodeRevision controlGame controllerInformationScheduling (computing)Repository (publishing)Multiplication signComputer animation
32:03
Physical systemInformation securityComputer-generated imageryStack (abstract data type)Revision controlCache (computing)MultiplicationMachine codeVariable (mathematics)DatabaseAutomationPhysical systemDirected graphSoftware repositoryVulnerability (computing)Task (computing)Binary codeSoftware testingComputer fileTraffic reportingInformation securityInstallation artSoftware developerMathematicsIntegrated development environmentSingle-precision floating-point formatIdentifiabilityBuildingText editorAbstract machineCache (computing)Internet service providerVideo gameCodeProjective planeDifferent (Kate Ryan album)Computing platformConstructor (object-oriented programming)Image resolutionAnalytic continuationContinuous integrationFlagDatabaseContext awarenessMultiplication signProduct (business)Latent heatDirectory serviceAxiom of choiceLevel (video gaming)Similarity (geometry)Distribution (mathematics)Content (media)Variable (mathematics)Rule of inferenceFlow separationRevision controlExtension (kinesiology)Group actionMoment (mathematics)Personal identification numberStack (abstract data type)Expected valueComputer animation
39:08
Abstract machineCodeSoftwareComputer animation
39:45
Asynchronous Transfer ModeVariable (mathematics)Default (computer science)Product (business)Integrated development environmentPublic domainCASE <Informatik>Type theorySoftware testingVolumeComputer fileDatabasePasswordProjective planeSet (mathematics)Operator (mathematics)QuicksortSoftware developerMultiplication signElectronic program guideError messageMultilaterationControl flowInformationTerm (mathematics)Run time (program lifecycle phase)MehrplatzsystemFlagBuildingVolume (thermodynamics)Coefficient of determinationRight angleDirectory serviceRootKettenkomplexProcess (computing)Field (computer science)Run-time systemCopyright infringementTrailProtein foldingMeeting/Interview
Transcript: English(auto-generated)
00:07
All right, welcome everyone to the parent track of EuroPython 2020. I hope you all had a good time in the initial part of the conference, where Mark introduced you to all the platforms and the first time we are having this online.
00:23
So this is gonna be the first track of the data science session, which is exciting, the parent track. So without further ado, I'm gonna introduce you to Tanya, who's joining us from the UK. Okay, oh, you can unmute yourself.
00:48
Of course, hello everyone. Of course, my dog decided that it's the perfect time to start barking at whatever she's barking now, because that's how it works. But thank you very much everyone for joining my talk.
01:04
And I'm gonna share my slides if that's okay. Yeah, I'm gonna let you go. Okay, let me share it now. Whoa, fabulous.
01:21
So for those of you that don't know me, my name is Tanya Lard, and I'm a senior developer advocate of Microsoft. This is where you can find me. You can find me on Twitter, on GitHub, and on my personal site. I run a lot of projects, like site projects,
01:42
things like mentor spreads. I'm also starting a podcast for called Python 101, and I do a lot, lot, lot, lot of stuff here and there. After the talk, you're gonna be able to find these slides, these URL.
02:00
They're not yet available, but they're gonna be made available soon after I finish the presentation. If you have any questions, feel free to put them on Discord. And I, well, I'll answer some questions right after my talk, and some others, I'll just jump onto the chat myself.
02:21
And I'm also gonna be available later on at the Microsoft sponsor room so we can carry on the conversation or talk about anything else that you're interested. And just to set expectations on what this presentation is gonna cover,
02:40
I'm just gonna work out, well, explain why you want to use Docker, especially in a data science and machine learning context, because it can be a bit different to when you're developing web apps or other sorts of applications with Python. Gonna give you some tips on improving your security and performance when you're working with Docker,
03:01
and some ways in which you can actually automate your workflow so you don't have to start doing everything from scratch. And I'm gonna finish up with a summary of tips and tricks on how to use Docker so you can start jumping straight into it. So let's start with why do you want to use Docker?
03:22
I am sure a few of you have been in this situation where you're developing an application, it can be anything, a model or a web app or an API that returns a prediction or something. So would you actually try to share this with somebody else
03:42
if you are not sharing your environment, as well as the specifications, you're gonna see that your folks or your friends have actually encountered this problem. Either modules are not found or data or environment variables are not set.
04:00
So if you're not sharing all of the required things that folks need to actually rerun or reproduce your analysis or your app, then that's a massive blocker for them, really. So Docker is actually an amazing tool that helps you to create, deploy and run your applications using containers.
04:23
And this gets rid of the problem of one hand, your laptop is not a production environment, so it allows you to mimic environments where you're gonna actually be deploying your apps or your products that your customers use. And this is how throughout my presentation,
04:41
I'm gonna pour into a container. This is just a mental exercise, this is just like mental representation, but it will help us to understand where in the workflow a container fits. So as I said, again, containers are a great way of solving the problem of your laptop
05:01
is not a production environment. So it really, really allows you to get your software or your application and your model from one computing environment to another. So it can be your laptop, your test environment, your staging environment and your production environments seamlessly and with all dependencies and requirements already bundled
05:21
so that basically you can run it out of the box or out of the container. So then when you're working with containers, not only you are developing your application or your model, but you're also bundling together all the libraries, dependencies, the runtime environment and configuration file within a container
05:43
so then folks can reproduce and reuse and build on your application in a seamless manner. So if you're familiar with virtual machines, if you've ever used virtual machines to develop on a different platform or a different operating system,
06:01
this abstraction might sound a bit familiar but the nice thing about Docker and containers is that it actually upgrades at the app level. So you have your infrastructure, whatever your infrastructure is, if it's cloud, if it's local, if it's a HPC,
06:21
and you have your host operating system, Docker sits on top of that and you can have multiple apps containerized. So it takes a very little overhead and you can have multiple runtime environments, multiple containers running on the same infrastructure as an isolated process. The difference with the virtual machine
06:41
is that all the abstraction happens at the hardware level. So instead you have your infrastructure, hypervisor and then you have full guest operating system. So you can imagine that you all have a full Windows operating system, full Linux CDM, for example, a full Linux Ubuntu and Red Hat, for example,
07:03
but it's very, very bloated because it includes all of the native packages and all of the native dependencies that said operating system needs to operate. Oh, I'm going backwards. Bear with me.
07:21
So that makes it much, much bloated than an actual container because you'll also have the binaries. Now, if you're not very familiar with the lingo of container and images, sometimes you're gonna go read tutorials
07:41
and this happened to me at the very beginning when it was just getting started and there are so many words that are thrown in. So for example, let's start with image, which is an archive with all the data needed to run the app. So basically that's just a snapshot of all the libraries, dependencies and the code, for example, that you need.
08:03
You then have a tag as you would normally do with your software when it's in version control and you make a new release. Every time you update, for example, binaries or the libraries, you make a tag and then you can pull that image from repositories like Docker Hub, for example,
08:21
and you then run it, you run that image and every time you run that image, it's gonna create a container and that's actually where you can create your development work, where you can do your work and you can mount volumes, you can process some data but it all spins from the Docker image.
08:42
You can actually have multiple containers spin out from the same image. Now, again, if you've tried to learn how to use Docker, more than likely you've gone to your web storage engine and you've typed Python and Docker tutorial
09:00
and you're gonna find out that most folks have actually developed tutorials based on web data or sorry, web applications but they're not exactly the same when we're doing data science and machine learning. We can have a lot of complex setups and dependencies
09:21
especially if you work with things like Arrow or TensorFlow or Kubeflow or things of the such, those dependencies can be very complex and getting things right can sometimes take a long time. We also have high reliance on data. We have data ingress and ingress all the time
09:42
and we even work with data that are not publicly accessible or sensitive. So making sure that our development environments are secure is critical and we also work, especially when you're into iterative research development process of developing either a new model
10:01
or an algorithm or application, that process is very, very fast evolving. So sometimes you're installing dependencies just to try something on. It doesn't work. Then you try another dependency and such and getting things right in Docker and orchestrating containers can take a lot of time.
10:20
It can be a bit complex when we're working with them. So how is it different from more maps, for example? We all know that our Python packaging ecosystem is a bit complex to say the least.
10:42
So sometimes we want to understand or we want to find ways where it is good enough to start repackaging. We have very, very complex dependencies. For example, if you have a team where some folks are working Python
11:00
and some others are working with R and some others are Julia, where do you draw the line on? We need BLAS, we need R, we need Julia. We need all of these dependencies. We need to optimize builds for this. It is very, very hard sometimes to consile and get robust and also lightweight containers
11:22
for data science works. And also in our case, not every deliverable is NAP. Not all of the things that you do in machine learning or all of the products that you're doing in machine learning is gonna end up as NAP or as an API. We have multiple types of deliverables.
11:43
And also there's a lot of real emphasis out there saying that machine learning equals model. And that is not the tree either. Not every deliverable is a model. As I said before also, we rely on data. Data is our gold, is our primary material.
12:04
So we deal with this in many, many ways. So because again, how our scientific Python ecosystem works, we're gonna have a mixture of wheels, a mixture of compile packages that we're gonna need.
12:23
Probably we're gonna have Kanda. Some other folks are gonna be using things like poetry and PIPEM. And we have a lot of different channels as well if you're using Kanda. And as well, if you're working in a multidisciplinary team
12:42
or you're working across multiple projects in machine learning, people are gonna have and gonna need different security access levels for both data and software. For example, if you're working with highly confidential data, you might just want to block internet address from your container,
13:01
but other folks are probably working with public datasets and they don't need that such high security access levels. So again, how do you console all of this apparently confounding requirements, it can be very, very tricky. And finally, especially when you're working
13:21
to create products based on machine learning, you have a lot of folks that use Python and use Zeta in a very different ways. You have folks in the data science team, software engineers, some other teams might have machine learning engineers that take care of taking your machine learning
13:42
deliverables into production or out there in the world. So again, when I was learning to use Docker, I experienced a lot of frustration because I would go and say, hey, well, go and search, how do I do a Docker or build a Docker image for Python?
14:04
And this is like what I would find everywhere, which is a bad example. And I'm gonna be telling you why this is a bad example. But you start every time you start with a Docker file and it's a specification file where you're providing basically a set of instructions
14:23
on what software to install and how to configure your image. So you start from normally a base image that is gonna define what a reading system or what kernel you're gonna be working on and in this case, we normally want to use something
14:43
starting with Python, then you provide the main instructions and if you're familiar with Bash, it's very personal, it's the same syntax, it's the same kind of instructions that you would go and enter a command.
15:01
Now you have to be very careful because everything like stacks on top of each other. So Docker, like all Docker images is creating layers. So you can imagine that if you follow traditional workflows, you're gonna end up with a very, very big block to the image.
15:20
So how do you even choose a base-based image? Well, it depends a lot on the requirements. A lot of folks and a lot of tutorials out there tells you to use Alpine because it's very lightweight, it doesn't have a lot of unnecessary binaries, don't do it. It's an absolute pain to get anything working.
15:43
So if you're gonna need to build from scratch, use official Python images and I recommend using the slim version that it's like, as the name says, slimmer, it's thinner. But I would recommend using also either the variant of Buster for 3.7 and 3.8.
16:03
I can talk more about that in detail, but if you need to build everything from scratch, use those. If you don't need to build anything from scratch, and I absolutely recommend this for most of the cases, just go to the Jupyter Docker stacks. The folks there have done an amazing job
16:22
trying to understand what are the traditional or most common requirements that data scientists have. And they've already pre-baked a lot of Docker images for you. If you've ever used things like binder or repo to Docker, again, this is very similar stack
16:41
and that saves you a lot of time, a lot of fiddling and a lot of headaches. So next, we've identified our base image, what do we do next? You always have to know what you're expecting. If you remember in the first example, it just says Python 3. We need to know, again, the specific task,
17:00
so we always know what we're pulling, avoid using things like LaTeX, provide context with labels, especially if you're sharing this. And something that a lot of folks forget and a lot of tutorials don't tell me about is adding a security context.
17:20
Because this will allow your Docker container to be much more secure. And you can start using tools like Snyk for a vulnerabilities assessment. If you need to run very complex statements, split them and sort them.
17:42
And as a general term of role, you have copy statements and add statements. I always prefer copy because it's a much better way to do it. Again, also make sure to use the cache because whenever you're building your images
18:02
and you change something within your software, depending on where you put that, like this copy, the run statements, everything is gonna rebuild. So try to leverage this build cache. Normally install the requirements first
18:20
unless you're updating your libraries. Do a clean of your installs, either using Conda or Pip so your image is not bloated. And then separate your instructions for a scope. This ensures that your cache is hit
18:40
as appropriate as possible. Again, only install what you need and concrete versions of everything. And something that we forget a lot when working with data is explicitly ignore files. If you're familiar with GitHub or with sorry, with Git or version control,
19:00
you might remember a gitignore file. We can have something similar for Docker that is called Docker ignore and it follows the same process where you can deny or add a certain number, certain files or certain directories that they actually are not passed directly onto your Docker file or used into your container.
19:22
This is especially good for when you have settings or environment variables or super secret keys that you don't want to go out there. To access data, there are lots of ways to do it. And it depends on where your data lives.
19:42
If you are using local data, bind mounts, create mounts to directories, instead of moving the data over, because you always want to have your data up to date instead of baking it into the container. And also create a non-root user.
20:04
Because, and this takes us directly to security and performance. By default, Docker allows you to do everything that a root user does, but you don't want that. You don't want to introduce vulnerabilities. You want to privilege,
20:20
to favor the least privileged user. So if you go, for example, to any of the Docker science, data sciences tags from the Jupyter stack or the repo to Docker mode, you're gonna see that we're creating, that they create a non-privileged user called Jovian, for example, that's where all of the work
20:43
is gonna take place. And that allows your container to be locked down. That means that folks are not gonna have access to the kernel, are not gonna be able to do different potentially damaging actions
21:01
and you're minimizing capabilities. Also, it's gonna prevent a lot of issues. I don't know if you've ever been working in a container or something. And then it tell, when you're trying to work, it tells you, you get an error saying that you didn't have access to a certain directory or you can't mount volumes.
21:24
Normally it is because there is a non-compatibility between root user privileges and whichever user you are trying to work on top of the container. And having this tight end is very, very essential, especially if you have the security access level
21:43
restrictions if you are working with confidential data. Again, I said that all of the Docker containers are like onions, so everything is contained in different layers. Sometimes you think, oh well, if I copy this key
22:02
in a layer and then just delete it over and clear my cache or something, they're not gonna be there. Everything stays there. Everything stays in an intermediate layer. They might not be visible in the outermost layer, but there are tools in which you can actually see how your whole Docker image was built.
22:23
You can inspect the layers and folks can extract all of your sensitive information. So again, keep them out of your Docker file. There are different ways in which you can keep all of the sensitive information. And something that I really use a lot
22:42
and that I recommend and it's probably a very, very robust manner to do is using multi-stage builds. Where basically you'll have a base image, for example, in this massive Docker file, I'm using Slim Buster and that's my compile image.
23:01
So you can fetch and manage secrets there. And not everything always needs to be compiled. Sorry, not everything comes as well. So if you also need to compile packages, for example, if you need GCC or JFRIJ for something,
23:21
you can do the compile also in that first layer and then carry it over for a single layer. And also using this approach gives you a much smaller images overall. So again, I'm gonna just go over how you would do, for example, you have this Docker file
23:41
and you would use the same command Docker build. You specify your Docker file, well, and the context and you provide a name and a tag. So it's gonna start first by creating this Docker image that is compile image. And in this particular example, I am using,
24:03
I am compiling some packages, sorry, I'm providing options for my compiler and installing some requirements. Then in the second image that is the actual runtime image that is the actual one where I'm gonna be doing my development work is I carry over
24:22
all of this compiled packages or precompiled packages and install them directly in a virtual environment that I'm creating. So this virtual environment is what is gonna have all of my final compiled installed dependencies. And also if I were to have secrets in the first image,
24:44
let's say special compile flags or a special access flags, those are not passed over to the runtime image. So that also makes it much more secure for me and whoever is using it. So my final image has that tag,
25:03
has that name that I provided as part of, sorry, as part of my Docker run command. And it contains everything that I need, but it tends to be much, much smaller. I don't need to carry over GCC and Jif or Tron,
25:22
which is now unnecessary. And I know that probably at this point, this all sounds like a lot because it is a lot and it can be very, very overwhelming, especially when you are not an experienced Docker user
25:42
or you're getting started. So my best advice is always automate, try not to reinvent the wheel. Most of the times you don't need to build everything from scratch, unless you need very, very specific setups, permissions, libraries, or access levels.
26:04
So again, to start, if you already know what you're, something that I always recommend to anyone working in data science for reproducibility, portability, and reusability is always know what you're expecting everywhere.
26:20
So the best way for you to automate and optimize your Docker builds is also having a consistent project structure and template. I like using the cookie cutter data science and there's like a Docker ready version, which is cookie cutter Docker science. And it already gives you a very good baseline
26:42
on how your project should look like or should be built. And this makes it much easier when you're building your Docker files and mount and carrying over software or carrying over files, because it's easier to know where things live, where things are living, sorry, and debug stuff.
27:06
Unless again, unless you have very specific requirements, leverage the use of tools like repo to Docker, because it already gives you configured and optimized Docker images.
27:23
All the folks working in Jupyter binder and repo to Docker in general have been put a lot of work and thought there. And you can install repo to Docker through Conda. So do Conda install, and then if you already have a repository with a JAML file,
27:45
with an environment of JAML or requirements text, you've run Jupyter repo to Docker, you don't have to create any Docker files. And it's gonna create all of your Docker, well, they're gonna create your Docker image ready to use.
28:01
And also if you want to use something like binder, this is the same Docker image that it would be created by binder. So you ensure that your project is ready for usage. And instead of having to write a massive Docker file, everything is done for you.
28:20
And I absolutely love repo to Docker and a massive fun. And it works with whatever it like, the beautiful thing of repo to Docker is that it works with pretty much anything or any package of specifications that you're already using, whether it is an environment JAML file requirements,
28:44
if you're using Julia as well, you can use Julia specification environments and install R for R users. You have a lot, a lot of, you can even use a next package manager specification files. And if you are already a Docker heavy user
29:04
and need very specific environments, you can provide also your Docker file. Also, if you're using containers, Docker container and it's to do your dev work,
29:21
you probably do it daily or every time you work in a project, that's good. But also make sure that your image is built frequently. If there's a Docker container that I use every day, I probably want to rebuild my Docker image every week or every other week,
29:41
because that ensures not only that all my dependencies are up to date, but also the binaries for the operating system that I'm using. And if you're using things like the Python 3.7 Slim Buster, you also get the latest security patches and security updates. So it also ensures that your Docker container
30:02
is continuously receiving and being updated to the latest security patches. But you don't have to do this manually. If you already have version control and are using things like Travis, GitHub actions to build, well, to test
30:20
and to continuous integration and continuous delivery, you can delegate all of this build of your image to these tools as well. Something that I do is, for example, now in GitHub actions, not only you can create your image when there is a pull request or when there is a release,
30:42
but you can set schedule builds. So you can say for, in this example, I have weekly and Sundays at two o'clock in the morning. I don't know, it's a totally arbitrary time, but it can be every Monday at five o'clock or every Friday at five o'clock when the week is finished.
31:01
And then you have the concrete tags and it can be pushed directly to whatever container registry you use, whether that's Docker Hub or Azure Container Registry or Google Cloud or whatever, or your own in-house. So this makes your workflow much, much easier
31:22
because you have your code and version control, whatever you're using to build your images. If it's a Docker file or a repo to Docker, you can have your triggers on tags, your scheduled triggers, build your image, push, and then you can use that. Oh, sorry, and everybody can use this
31:43
readily from your container, from your repository. So just to summarize, and I know I am giving you a lot, lot of information, and probably at this time, your brain is full of a lot of stuff,
32:02
I'm gonna give you the top tips. And these are at least the minimal or baseline requirements that you should try to get into your Docker and data science workflow. First, rebuild your images frequently. Make sure that you're getting the security updates for system packages.
32:23
This is especially important for avoiding vulnerabilities or problems with any of the images. Second, never work as read. Minimize the privileges.
32:41
If you're building your own images, make sure that always, right before you're intertwined, as after you've built all of the binaries and all of the system specifications, you are searching to non-privileged user with access to whatever the working directory is.
33:02
Don't use Alpine Linux. It's very good for a lot of stuff, but for data science and machine learning, it is much more trouble than it's worth. Yes, it is a very small image, but you're paying the price for that small size.
33:24
My advice, go for Buster. That's probably the best distribution at the moment. It has long-term support. Use Stretch or the Jupyter stack. If you can't use a Jupyter stack, always know what you're expecting. Pin all the versions, pin everything.
33:44
Try to use, instead of using traditional PIP, and this is very opinionated, instead of just doing PIP, install requirements, blah, use PIP tools for dependency resolution or Tonda or poetry or PIPM.
34:01
Choose whatever tool you prefer, stick to it, and make sure that you always know what you're expecting from your base image, from all your dependencies. And even from your databases. Leverage, build cache, be very smart. Separate all of your run commands based on the context.
34:25
This is gonna ensure that your image doesn't get rebuilt every time. There is a minimal change in your code. So make sure that everything, it's making the most of the building cache.
34:40
Use one Dockerfile per project. Sometimes folks have a single kitchen sink container or Dockerfile, and they have all of the 70 dependencies that they need for every single project they could be working on or the company works on.
35:03
It is very, very, very, very troublesome to do it this way. So one project, one Dockerfile, one image. And use multi-stage builds. If you need to compile code, if you need to reduce your image size, if there is no way that you can use build flags
35:21
or environment variables when you're orchestrating your code, orchestrating your containers, use multi-stage builds and make your images identifiable. Sometimes you might need to provide different environment flags or different build flags
35:42
to differentiate from test production and environment, test production, test production and research and development environment. Because sometimes you need access to different databases. You sometimes need a different ingress or different agress rules.
36:04
So make sure that all your images are identifiable. Make sure that you are providing the right variables and do not reinvent the wheel. Use repo to Docker. If none of these advanced requirements apply to you
36:22
or your project, use repo to Docker. It is amazing and I love it and I use it all the time. And finally, automate. There is no need to build your image yourself every week and push it manually.
36:42
Delegate as much as possible of these tasks, like building, tagging and pushing to whatever platform you are using for your continuous integration or continuous delivery. I demonstrated an example with GitHub Actions
37:01
because it allows me to do scheduled runs or scheduled tasks and that works for me. But choose again whatever works for you and for your team, but don't do it manually because it's boring. It is boring and you don't want to be rebuilding
37:21
your image manually and pushing it every week. And use linter. I didn't mention this before, but I use, well, my editor of choice is VS code. I've been using it for a very long time.
37:40
And there is a Docker extension and I absolutely love it because it provides linting capabilities. I can write my Docker file and make sure that everything, well, that I'm using the correct amounts, that everything is written accordingly.
38:01
And it also helps me with a lot of my tasks on my Docker development workflow. So especially when you're starting to use Docker, I highly, highly, highly recommend using linter or just for you to make sure that your syntax is correct,
38:23
your construction is correct. And also if you're working with multi-stage builds, sometimes it can be quite hard if you have everything in Docker file. I sometimes split them in separate Docker files, but just generally use linter
38:41
and that will make your life so much easier in a similar way as linter is for Python work. So I hope you find these tips and the content in this presentation valuable and that has convinced you of it to try and optimize
39:02
or improve your Docker and data science workflow. As I said, I'm gonna be taking some questions now. I have probably like five minutes or so. And I'm also gonna be later on in the Microsoft and VS Code room.
39:21
So you can come chat with me about Docker, machine learning, VS Code, I love VS Code, Jupyter Netbooks, whatever it is that you want to talk about. And thank you everyone. Thank you very much.
39:41
And I think I'm gonna be, stop sharing my screen. All right, thank you for that amazing talk. I think all of us learned something new out of it. So yeah, this applause is for you and also for your dog.
40:10
Oh, my dog is the worst. No, it was awesome. We were having fun. So we have two questions. I think you have five minutes left and we can take them right now.
40:22
So the first question is from Ignacy. Why not use environment variables or volume mounting for sensitive information instead of increasing Docker file complexity with multi-stage builds? Why not use environment variables?
40:40
Especially, well, it depends. If you are just setting your environment variables and then providing them through however you're orchestrating your container, for example, if you are using Azure or AWS or Kubernetes, you can provide those as you're running your container.
41:01
But a lot of folks actually use them as build environment variables. And those are persisted in your final image. Those are the cases where you should avoid providing those directly. If you can provide them at runtime environment, that's fine.
41:23
Yeah, okay. I hope that solved your question. You can also type in the questions if you have. We have actually time left for more questions. So the next question is from Diego. About users and mounted volumes. I wanted to share a Docker image with multiple users and container mounts.
41:42
Container mount a volume in RW mode. The process will set UID, hyphen GID of the Docker user and not the host user. This potentially can lead to all sorts of errors in terms of permissions because the Docker user and the host users are different. How do you solve this issue
42:01
without rebuilding the image with the right user? Oh, so I normally, to avoid these issues, this permission issues, that's why I set the non-privileged user when within the Docker file. That's the easiest way because that way you can set your UID, your GUID.
42:24
And I just forgot that command. So you can create your directories. I forgot the command if someone can help me. So you can actually ensure that the permissions are correct.
42:41
Oh, you can post that on the bake room later. Yeah, I'll do that later. But I can provide an example of how I do it on my Docker files normally to prevent this. It is very, very hard if you don't do, oh, chowning. You have to do a chown on whatever directory
43:03
for the relevant image. User ID and GUID. Otherwise, you're always gonna have problems between this Docker host and Docker users permissions. And that's normally because of the default behavior of Docker always running as root.
43:23
Okay, so the next question is from Johannes. Could you say something about EEND bus, build bus, and especially about DB access and what one should avoid? I think I missed that. So can you repeat my question?
43:43
Yeah, yeah, yeah, yeah, got it. Could you say something about the environment variables and build variables, especially something about database access and what one should avoid? Oh, yes. So for example, for database, an example is sometimes you have a production database
44:02
and sometimes you have another R&D database, because that's how sometimes companies or a project design to work. So when you're using the command Docker run, you can actually pass a build variable. You can have an environment variable that takes one value when it's production
44:22
and one value when it's R&D, for example. And when you build your Docker image, you can set those and you can imagine that almost if it's production, the environment variable would be pointing to your production database R&D when you're in R&D.
44:44
This is very, very helpful because that way you ensure that folks are working within the domain that they need. And also because in many cases, when you're working in, I've seen cases where folks are working in an R&D environment and they share, for example,
45:04
the same password or the same user for access database. And that's okay if you can't do like completely wipe out your production database. But you want to be very, very careful
45:21
and avoid any destructive operations when you're using your production database. And that's when setting this test development and production environment variables through setting a build flag can be very, very useful. All right, that's awesome. I think we are out of time.
45:40
I think we can take the rest of the questions to the break room or you can reach out to Tanya later. Awesome, thank you for your talk. Thank you everyone for coming. I think we have a short coffee break after which we will be coming back to the padded track again, awesome. Thank you very much, bye.