Containerize your Python apps like it's 2024
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 131 | |
Author | ||
Contributors | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/69414 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 2024120 / 131
1
10
12
13
16
19
22
33
48
51
54
56
70
71
84
92
93
95
99
107
111
117
123
00:00
Transport Layer SecurityDistribution (mathematics)Integrated development environmentComputer-generated imageryPhysical systemInstallation artServer (computing)Stack (abstract data type)Software frameworkFormal languageComputer programmingSource codeDesign of experimentsResource allocationRead-only memoryPairwise comparisonHash functionBinary fileRevision controlMilitary operationSpacetimeMiniDiscElectronic mailing listDefault (computer science)Software developerService (economics)Cache (computing)Run-time systemResource allocationOperating systemCartesian coordinate systemClient (computing)Distribution (mathematics)Finite differenceFile systemBinary codeMultiplication signMedical imagingComputer fileMathematicsCodePhysical systemOperator (mathematics)CASE <Informatik>Template (C++)FrequencyBuildingDecision theoryGraph coloringPoint (geometry)Computer configurationInstance (computer science)DatabaseRevision controlLibrary (computing)Entire functionShape (magazine)Virtual machineInformationStandard deviationImplementationFitness functionPrinciple of maximum entropySpacetimeServer (computing)Graphics processing unitSoftware maintenanceInternet service providerModule (mathematics)Complex (psychology)Product (business)Goodness of fitDevice driverMaxima and minimaForm (programming)Information securityProjective planeNumberProcess (computing)Hash functionRight angleIntegrated development environmentLevel (video gaming)Direction (geometry)Touch typingInstallation artKey (cryptography)ResultantAbstractionCompilation albumMemory managementSound effectMiniDiscRule of inferenceComputer animationLecture/Conference
08:24
Military operationSpacetimeMiniDiscElectronic mailing listDefault (computer science)Hand fanLevel (video gaming)MultiplicationInstallation artCache (computing)Mobile appComputer-generated imageryMetadataAsynchronous Transfer ModeWindows RegistryGroup actionMedical imagingInstallation artBuildingLibrary (computing)Virtual machineLevel (video gaming)Run time (program lifecycle phase)Inheritance (object-oriented programming)Interpreter (computing)Client (computing)Game controllerChainBackdoor (computing)Scripting languageMetadataRoutingDirectory serviceBinary fileVulnerability (computing)Virtual realityCache (computing)Computer configurationHeegaard splittingRevision controlConfiguration spaceAsynchronous Transfer ModeMultiplicationMaxima and minimaTask (computing)Web browserMobile appDirection (geometry)Windows RegistryVariable (mathematics)Set (mathematics)MereologyDefault (computer science)Parallel portData compressionSoftware developerWebsitePlug-in (computing)Parameter (computer programming)DampingComputer fileType theoryCartesian coordinate systemIntegrated development environmentSinc functionProduct (business)Decision theoryMultiplication signConnected spacePhysical systemRemote procedure callMathematicsAdditionInstance (computer science)Process (computing)Point (geometry)Musical ensembleVirtualizationStandard deviationForm (programming)Diallyl disulfideLow-energy houseComputer-assisted translationComputer animation
16:26
Group actionCache (computing)Design of experimentsVirtual machinePrecedence diagram methodComputer-generated imageryMaxima and minimaAsynchronous Transfer ModeInternet service providerWindows RegistryData compressionService (economics)Computer fileVolume (thermodynamics)Parameter (computer programming)Token ringCache (computing)Message passingComputer configurationData compressionMedical imagingAsynchronous Transfer ModeShared memorySingle-precision floating-point formatVirtual machineRoutingDependent and independent variablesSet (mathematics)Direction (geometry)Uniform resource locatorDirectory serviceMathematicsCartesian coordinate systemSubject indexingData miningInstallation artLevel (video gaming)Keyboard shortcutSource codeGroup actionIntrusion detection systemSubsetVirtual realityInternet service providerInformation securitySpacetimeDampingCASE <Informatik>Scheduling (computing)Mereology1 (number)2 (number)Integrated development environmentFerry CorstenFunction (mathematics)Process (computing)FlagPhysical systemSimilarity (geometry)CodeMultiplication signSoftware testingSlide ruleRepository (publishing)InformationLink (knot theory)Maxima and minimaInstance (computer science)BitWeb 2.0Pointer (computer programming)Software developerProduct (business)Windows RegistryRevision controlProjective planeData managementBuildingSoftware bugRootMobile appArithmetic meanOrder (biology)Computer animation
24:29
Service (economics)Systems engineeringStorage area networkOracleInformation securityContext awarenessIncidence algebraMultiplication signLine (geometry)Server (computing)Software developerPhysical systemProduct (business)Computer animationLecture/ConferenceMeeting/Interview
25:28
SynchronizationMedical imagingBinary codeCartesian coordinate systemMeeting/InterviewLecture/Conference
26:20
Computer configurationInflection pointFinite state transducerOrdinary differential equationRevision controlMedical imagingMobile appMultiplication signCartesian coordinate systemSoftware maintenanceComputer filePersonal identification numberProcess (computing)Roundness (object)MultiplicationInformation securityLine (geometry)Memory managementInformationDevice driverDatabaseQuicksortSelf-organizationData structureMultitier architectureBuildingProjective planeLevel (video gaming)Semiconductor memoryPhysical systemBitLecture/ConferenceComputer animation
Transcript: English(auto-generated)
00:04
OK. Thank you very much. My name is Jan Smitka, and I am a co-founder of a marketing agency called Lens Services based here in Czech Republic. So we will continue our Docker journey. The previous talk was about using Docker containers
00:21
during development. I would like to focus more on creating production Docker images for your Python applications. I will explain how to use Docker features introduced in recent years to improve your images and their build process. So OK.
00:40
Let's dive in. So as my colleague said, Docker images are a great way to distribute your application and their entire runtime environment. They usually include an image of the entire operating system, a Python interpreter, and all dependencies. This makes them robust and easy to deploy.
01:00
You mean no more fiddling with system dependencies, no more fiddling with Python versions, and et cetera. But bundling everything has a price. The development images are usually stored only on your dev machines, so their size and complexity is usually not an issue. However, if you want to build a good production image,
01:22
you should consider a few things. First of all, the image should build fast. You don't want to spend your Friday afternoons waiting for the build to get your fix on productions. I mean, everybody deploys on Friday afternoons. That's the best time, right? And it's not only about the production.
01:40
Staging and preview environments count too. The image should be small in file size and in number of layers. Smaller images are faster to push and tend to pull on the production machines, resulting in faster deployments overall. And last but not least, the image should be secure.
02:04
As it was said by my colleague, many security aspects come from the isolation of Docker containers from the rest of the system. But you want to make your app secure inside the container. It contains many interesting data, like database credentials, personal information, et cetera. So you want to prevent attackers
02:21
from doing evil things to your containers as much as possible. However, remember that no single solutions fit all projects, because projects comes in various shapes and colors. So my goal is not to give you a definitive example. My goal is to show what's possible
02:41
and when it makes sense. So you should always test what works for you or for your application and make your own decisions. The first thing you must consider when writing a Docker image is choosing a base image. You can use several options as a starting point for your application.
03:01
You can start with the base image of your preferred Linux distribution and install or even compile Python yourself. However, building the image will take longer, so I usually don't recommend it unless you have very special requirements. For instance, if you want to use GPUs or stuff like that, then you might be better off with special pre-built images.
03:23
For most applications, official Python images are a better starting point. They provide all supported Python versions, even pre-releases, and new versions are usually added within hours of the official release. The images come in two different flavors, one based on Debian Linux
03:41
and the second based on Alpine Linux. So which one should you choose? Well, let's look at Alpine first. It is very small, which is a good thing for distributing the image. However, it's based on Musl, a different implementation of the C standard library, which might be a good fit for some applications.
04:01
Musl has a slower memory allocator than glibc, which is the common library used by other Linux distributions. It's not a huge issue for Python itself because since Python 3.6, which was released quite a long time ago, added an extra layer of abstraction that makes allocations from Python faster.
04:22
However, it might affect other libraries to use, for instance, your database driver. Another reason to avoid Alpine is because Musl has worse support from package maintainers. If your application uses packages with binary modules, some of them might not provide pre-built wheels for Musl.
04:41
In that case, the package will have to be compiled, and that slows down the entire build process. This is somehow from, no, this comes from historical reasons, and support for Musl is definitely getting better, but it's still lagging behind gtpc.
05:01
Okay, let's look at Debian-based images. They have a code name of the Debian release in their image tag, currently bookworm, and they have two variants, regular and slim. The regular variant comes with many tools and libraries. This allows you to use the same base image for multiple images on your machine without installing the same packages over and over
05:21
for each image you use. This can save both disk space and build time, but when the only thing running on the server is your application, it's pretty useless. So for production images, I recommend using the slim variant. The base system contains
05:41
the bare minimum required to run Python, and you can install all of the libraries and tools you actually need for your application. You should always pin your images to the exact image variant. Avoid tags without the operating system name and version, like 3.10 or 3.10-slim, because they might point to a different operating system version in the future.
06:02
For example, in May 2023, the Python 3.10 image used the Bill & Bill's eye, but now it uses the beyond bookworm. Changes like that can easily break your application. All right, now that we have a base image, let's move on. The key to achieving fast builds is using the Docker cache effectively.
06:22
Docker images are made up of layers, and each layer can be cached. Let's start with a very simple Docker file. It copies our application and installs all dependencies. Each instruction that touches the file system creates a new layer in the resulting image. So when we build the image, we will end up with all the layers from the base image,
06:43
one layer for the copy directive with our files, and one layer with the result of pip install. If we rebuild the image without making any changes, it's no problem. Both layers can be reused from cache. The copy directive uses file hashes to detect if there are any changes,
07:02
and the run directive uses only text of the command. However, if we change any file in our application and rebuild the image, a change in the copy layer will trigger a rerun of pip install. This is because each layer depends on the previous one. So if there is any change in the previous layers,
07:22
the cache cannot be used for that layer. So to utilize the cache, we should order the layers by frequency of change. Least frequently change at the beginning, at the top of the Docker file, and most frequently changed at the end. The application's code usually changes more frequently than its dependencies, so it makes sense to split the operations.
07:42
Copy the requirements first, then install the dependencies, and then copy the application. Fixing a typo in an HTML template should not trigger a pip install. Things get more complex if one of your dependencies needs to be compiled. This is the case with the MySQL client package,
08:02
which is, for instance, a requirement of Django when used with MySQL or MariaDB. This package does not provide any binary rules for Linux, so you must install the MySQL client developer libraries, C compiler, and other build tools. So in Debian, in the current versions,
08:22
the MySQL client package requires 58 new packages, totaling in over 300 megabytes of data when installed. In the end, only the MySQL library is required at the runtime, so the rest of the 300 megabytes is useless.
08:43
So to solve this issue, there is a feature called multistage build. Docker allows us to split the image into multiple stages. Each stage starts with a from instruction and can have a name, and each stage will build independently when you build the entire image.
09:00
You can copy files from one stage to another by adding a from argument to a copy directive. You can use one stage to build a virtual environment with all the dependencies of your app and then copy it to the final image. When building the image, you can set the stage you would like to build using the target argument.
09:20
Only layers from this stage will be included in the final image, and other layers will stay in your local build cache. Therefore, those 400 megabytes of build dependencies will not get deployed with your application. And the good thing is that only stages required by the target will be built, and when possible, they will be built in parallel,
09:43
which speeds up the entire build process. To use this feature, you will need a new build tool called BuildKit, and luckily, it is included in Docker. But when I say new, I don't mean exactly new. It was first included in Docker in November 2018, a few years ago,
10:04
and has been a new default since Docker 23, which was released last year. Since Docker 23, you don't have to do anything to use it. In older versions, you have to set the Docker BuildKit environment variable. Docker also comes with a plugin called BuildX, which uses BuildKit and adds additional features,
10:23
for instance, more control over the cache, or use of remote builders. To use BuildX, just run Docker BuildX build, and this is the preferred way to build your Docker images. All right, back to the example with MySQL client. Let's optimize the build using multistage builds.
10:42
The build is divided into two stages. The first stage, the builder, installs all build dependencies, creates a virtual environment, and installs the packages. The second stage, called production, installs only MariaDB library. This is much better. Only three new packages and less than one megabyte.
11:01
That's actually what's required at runtime. The virtual environment is then copied to builder. At this point, you might be wondering, do I need to create a virtual environment in a Docker image? Can I install packages globally if they are isolated anyway? Well, creating a virtual environment
11:20
makes copying installed packages between the stages much easier. Virtual environment contains everything, so just copy the virtual environment and you are done. And if you don't need a multistage build and you don't need to copy the installed packages, installing anything to the global interpreter requires a super user privileges.
11:42
PIP can execute any script from the package. When on its route, these scripts can do horrible things to your image and at hard to find vulnerabilities and various backdoors. And supply chain attacks are on the rise, so I would avoid installing anything as a route. Is there anyone who switches user
12:00
before installing packages in their Docker containers? Okay, activating the virtual environment is fairly easy. You need to set two environment variables, path and virtualenv.
12:20
It's basically the same as what the activate script does. Add the virtual environment's bin directory to path, add virtualenv to the path to the entire environment. And that's it. Back to multistage builds. There is another trick. Stages can be used as a base image of another stage.
12:41
Simply use the name of the stage from directive. It allows you to create a base stage that installs dependencies required by all stages, set the common environment variables or perform other system configuration. You can speed up your build by splitting time-consuming tasks into multiple stages. For instance, one of our applications requires a browser
13:03
to get the data from external sites. Browsers are large and they have a lot of dependencies and it takes quite a while to install. Everything could be done in a single stage, but we saved a lot of time on rebuild by splitting installs into two stages. One stage installs the browser
13:22
and is a part of the final image. And the other one just creates the virtual environment and installs Python packages. And the virtual environment is then copied to the final image. Both stages run on in parallel because there is no dependency between them. You can also add an extra stage
13:41
with development dependencies. Use your production image as a base and install your drift dependencies and debugging tools. Use the development image locally as a dev container and then build and push only the production image. One of the challenges related to multi-stage builds is caching. The layers from previous stages
14:01
will not end up in the image. So if you build your image on a different machine every time like GitHub Actions does, it's not enough to just pull the original image. The easiest way is to use an inline cache. Enabling it adds cache metadata to the image itself. The metadata allows builder to resolve exact dependencies
14:22
of each layer and decide if the layer from the cache can be reused. For instance, it knows that your virtual environment comes from the builder stage and that stage depends on a requirements.txt file. If this file changes, the build stage will be rerun from scratch and the new virtual environment will be built.
14:41
When there is no change, the layer can be reused. This is called main cache mode. Only layers from the final stage are cached and metadata helps to decide what layers can be reused. To use an inline cache, add cache to argument to build x build with type set to inline and the cache from argument with type registry
15:01
pointing to a previous version of your image. Another option is a registry cache. It allows you to push the cache as another image and use the max cache mode. In max mode, all layers from all the stages will be included in the cache. If you change your requirements,
15:21
the builder can reuse previous layers, don't install the build dependencies and only install new Python packages. Max cache is always larger and downloading and uploading the cache during the build will take more time. You should try which mode works best for your app. If your dependencies do not change frequently, it might be generally faster to use the min mode
15:42
and let the builder stage run on each dependency change. To use the registry cache, set both cache from and cache to. Use type registry and set the full name and tag of the image you would like to use as a cache. You can supply additional arguments to cache two,
16:00
for example, to set the cache mode. Additionally, you can set the compression algorithm. The default is gzip, but I recommend you to switch to z standard, which is faster and provides better compression. You can also increase the compression level, sacrificing speed of the compression for a smaller cache.
16:22
This might be useful if you have high-performance build machines and slower connectivity to your registry. If you use GitHub Actions, there's also a built-in option for using cache at GitHub directly. You can use both min and max modes, but you cannot control the compression and et cetera.
16:40
Another option to improve builds is by using mounts in run commands. This feature allows you to temporarily mount files from the build host to the container for a single run command in your Dockerfile. The most straightforward kind of mount is a bind mount. It just mounts the given file or directory to a target location in the container.
17:00
Docker also computes a checksum of the files and invalidates the cache if there is any change. This is useful when you don't need files to be present in the final container, because file binding is generally faster than copying them. You can use it for requirements TXT or by projecting them with log files, meaning you can merge the copy and run to a single layer.
17:21
It's not a big deal, but it makes the Dockerfile a bit more coherent. It's also very useful for large source files that are not required for the final image. For instance, if your application uses Webpack or something similar to bundle and modify JavaScript, you can just bind your JavaScript sources, run the build and copy only the final image,
17:42
or copy the generated files to the final image. Using a from argument, you can also mount files from other stages or even different images. This is great for tools. For example, you can create a virtual environment with PDM or Poetry or a preferred package manager in one stage and then use it in other stages
18:02
by binding it. It will not take up any space in the final image. Another kind of mount is a cache mount. This mounts an initial empty volume that can be used as a cache. The volume is persisted between builds and reused and cache mounts does not affect the layer cache.
18:20
You could use it for the pip cache to speed up your builds when dependencies change. However, there's a big issue with them. They stay on a single machine and there is currently no built-in support for sharing them between machines. So if you use an isolated environment for each build, like GitHub Action does, they would be deleted after every build.
18:41
For GitHub Actions, there is a workaround called BuildKit Cache Dance. It saves the cache mounts in GitHub cache but it imports it and exports it every single time, even when they are not required. Overall, it will probably just slow down your builds and until sharing is resolved, I would stick only with the layer caching
19:02
unless you have dedicated build machines that can persist the cache volumes between builds. The secret mount is the last kind of mount I would like to introduce. It allows you to define a named secret that will be mounted as a file in the container during build. The mount also allows you to set fast permissions
19:21
because it is often required that the file containing secrets is not accessible by other users. When building the container, provide a secret argument with the secret name and source file to be mounted. The secret mount also does not invalidate the layer cache, so it can be used with short-lived tokens. Secrets can be used to pass credentials
19:41
for private package index, AWS credentials, and so on. Let's look into the security of your images. As the image creator, you should do two things. First, run your application and package installs as a non-privileged user. This is essential because it limits the potential damage that can be done when the attacker compromises your application.
20:02
In most base images, you start as root to modify system settings, and it's your responsibility to switch to a non-privileged user. This can be done using the user directive, and all commands after this directive will be run as this user. But before you can do it, your user must exist,
20:23
created with the user add command. I recommend using explicit IDs for both user and the group because with secret mounts, you have to use IDs. It doesn't accept names. For explicit group ID, you must first create the group and then create the user. A home directory for the user will be created,
20:41
and you can use it as your working directory and copy your application files there. It's also good to know that you can set the owner and permissions of files when copying them with the copy directive. Okay, the second thing you should do is install security updates when building your container. People often forget that there is another OS
21:02
in their images that also needs updating. The base images are updated regularly, but there might be delays between system and image updates. And if there is a security update for a system package that is not present in the base image, but you installed it for your application, the base image will not get updated.
21:21
So therefore, you should check for updates every time you build your image. In Debian-based images, you can update the system using the apt-get upgrade. It installs all updates, including security ones. You can run the commands as the first or last step in your Dockerfile.
21:41
If you decide to update as the last step, it will install fresh system updates every time you deploy a new version of your application. This will also slow down every build, but only by two or three seconds, so I think that's perfectly acceptable. You might need to switch back to root in order to install the updates, so don't forget to switch back to your user afterwards.
22:04
If you update as the first step, the packages will be cached and updated on a during a full image rebuild. In this case, you should regularly rebuild your image from scratch, for example, as a part of scheduled job. This can be done by passing a no-cache flag
22:20
to the build command. GitHub Action has an option called no-cache and other CI CD pipelines will have something similar. You can also check that if there are any security updates to apply to an existing image. This can be done by running an apt-get upgrade in a temporary container from your image. If security updates can be installed,
22:40
this command will finish with exit code zero and will have a non-empty output. Otherwise, it will exit with exit code one. I will share the slides, so don't worry about the command. You can perform this check in your build automation and rebuild your image only if there are updates available.
23:02
Okay, time to wrap up. I might have overwhelmed you with a lot of information. To help you get started, I prepared a GitHub repository with examples of your files from road to stage builds. Each example uses a different tool to install dependencies. There are examples for PDM, PIP, poetry, and UV.
23:22
These examples also follows the recommendations I mentioned during the talk. You can just take the example and modify it for your app. I will also share the link to the repository in Discord. After writing the Dockerfile, you should consider caching. What mode should you use, min or max? What cache provider is the best in your case?
23:41
Inline, registry cache, GitHub actions? Answering these questions might require some testing. You should also schedule regular rebuilds to get lazy security updates and bug fixes. These rebuilds can be periodic, for instance, daily, or you can trigger them by checking for security updates with the command I showed you.
24:03
Remember, every project is different. I can't give you a definitive example that would fit every application. I showed you what's possible and gave you some pointers on how to decide. I hope it will help you prepare production images for your projects and improve your development experience when deploying new versions.
24:22
Thank you for your attention. Thank you. And thank you, Jan. That was amazing. So, do we have any questions? We have five minutes, almost five minutes. So, if you have any questions,
24:41
please line up behind the mic. In the meantime, I might have come up with a question. So, personally, have you ever had any catastrophic situation if you ever run PIP as a root, or is it more like learning from mistakes of others? No, I didn't have any catastrophic situation
25:02
except one time I broke system packages on a production server because I overrode some of them, but it's kind of experience every developer should have. And I was careful ever since. But in the context of security, I didn't have any security incidents related to it.
25:23
But I think that it's only a matter of time until somebody does something really bad with the packages. Great, thank you. We have a question from the audience. Hi, thanks for the talk. It was nice. And I would like to know your opinion on Alpine images since you were putting so much emphasis
25:41
on making them small. Well, as I said, the Alpine images are not as well supported in the Python ecosystem. So the big packages like Pandas or NumPy and similar will always have binary wheels.
26:01
But the possibility to have binary wheels for muscle was added in the recent years. So there are some packages that does not provide them. And generally, if you have an application that uses multiple packages that have binary wheels, there's a chance that one of them will not provide a binary for muscle, which will slow down your,
26:21
so I personally don't use them. Also the memory allocator thing slows down things a little bit. I've seen like 5% performance decrease in the memory-heavy application or database-heavy applications that put a lot of information from the database. But it might not be a big deal for some applications.
26:45
Okay, thanks. Thank you. What would you advise to someone who needs to have CUDA drivers inside the Docker image and wants to make the image smaller than like several gigabytes? I was never in similar situation,
27:01
so I don't really have any recommendation. Maybe the best solution would be to start with the official images provided by NVIDIA and use the commands that are present in the official images to install your own version of Python
27:23
and to use that as a base across your organization. That would be what I would do, but I'm not sure if there is anything better. Thanks. Thank you very much. We've got time for one more question. Hi, thanks. I'm just wondering if you have any tips for sort of structuring the Docker files
27:41
to keep readability when there's multiple layers and lots of app gets in different images in those different tiers? Like I've just found I've really struggled to read and parse what's going on. Well, I don't have any concrete tips because it really depends on the kind of application you're building,
28:02
how many lines are there in the Docker file. I just separate the layers by big commands at least two empty lines between them. So if you scan the image, you can find it very well. And I also use a trick for structuring multi-line run commands. I run set ex, which prints each command
28:24
you have in bash, and you can separate the commands by a semicolon because when there is any failure, it will fail. And if you open the example Docker files, I think they are structured quite well, so you can use it as some inspiration.
28:41
Cool, thank you. Okay, we can try a really quick one, last one. Yeah, it's a good one. Would you suggest pinning system-level packages, like the versions of the system-level packages in the Docker file or not? I think it depends on how much time you have to spend on update something like that.
29:03
If you really need reusable containers and to reduce the risk of getting anything broken, I would recommend pinning the versions, but it also makes everything harder regarding the updates, security updates, and so on. So usually in my projects, I don't pin them.
29:21
I just install security updates and let the beyond maintainers do their job that they don't get anything broken. But if you're doing something really mission critical, pin them, yes. Perfect, thank you very much, everyone, to be here and big round of applause for Jan.
29:41
Thank you very much.
Recommendations
Series of 18 media