We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Optimizing Docker builds for Python applications

00:00

Formal Metadata

Title
Optimizing Docker builds for Python applications
Title of Series
Number of Parts
118
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Do you deploy Python applications in Docker? Then this session is for you! We will start by reviewing a simple Dockerfile to package a Python application and move to more complex examples which speed up the build process and reduce the size of the resulting Docker image for both development and production builds.
Keywords
20
58
Multiplication signSoftwareCartesian coordinate systemIntegrated development environmentPhysical systemSlide ruleVideo gameSystems engineeringBuildingMathematical optimizationLecture/Conference
Cartesian coordinate systemWeb pageComputer animationLecture/Conference
SpacetimeDifferent (Kate Ryan album)Kernel (computing)Template (C++)Windows RegistryComputer-generated imageryComputer networkData storage deviceFocus (optics)Mathematical optimizationMedical imagingComputer fileCartesian coordinate systemPoint (geometry)CryptographyProjective planePhysical systemWindows RegistryMathematical optimizationSampling (statistics)Stack (abstract data type)CASE <Informatik>DiagramDefault (computer science)1 (number)BuildingSpacetimeData structureKernel (computing)Template (C++)Object (grammar)Set (mathematics)Multiplication signDifferent (Kate Ryan album)Module (mathematics)Shared memoryDirectory serviceFunctional (mathematics)MereologyFlow separationProcess (computing)Image resolutionMobile appComputer animation
CASE <Informatik>Medical imagingImage resolutionProjective planePairwise comparisonTable (information)ResultantProduct (business)CodeSoftware developerLecture/ConferenceMeeting/Interview
Extension (kinesiology)BLACK MAGICGraph (mathematics)Computer-generated imagerySpacetimeCompilerInstallation artCache (computing)Compilation albumRun time (program lifecycle phase)Computer fileInclusion mapLatent heatStatement (computer science)Source codeExtension (kinesiology)Computer fileImage resolutionRun time (program lifecycle phase)Source codeMultiplication signDirectory serviceSelectivity (electronic)Medical imagingLibrary (computing)Logic gateData Encryption StandardInstallation artCryptographyOpen setCompilation albumMaxima and minimaCartesian coordinate systemTerm (mathematics)Time zoneDifferent (Kate Ryan album)EmailCASE <Informatik>SpacetimeLatent heatMenu (computing)Projective planeStatement (computer science)BuildingComputer animation
StrutComputer fileCache (computing)Overhead (computing)Statement (computer science)Physical systemSource codeComputer configurationInstallation artGezeitenkraftMaxima and minimaSoftware testingRun time (program lifecycle phase)BuildingMedical imagingRun time (program lifecycle phase)Statement (computer science)File formatMaxima and minimaSingle-precision floating-point formatMultiplicationCASE <Informatik>Image resolutionComputer fileElectronic mailing listSoftware testingLevel (video gaming)BuildingAdditionOverhead (computing)1 (number)Physical systemGraph (mathematics)Mathematical optimizationLocal ringSound effectComputer configurationPoint (geometry)Latent heatAnalytic continuationSource codeSoftware developerResultantMultiplication signModule (mathematics)Product (business)Open setCache (computing)Computer animation
Execution unitCASE <Informatik>Statement (computer science)Computer animationMeeting/Interview
Computer-generated imageryBinary fileMultiplicationLevel (video gaming)ResultantMedical imagingCartesian coordinate systemBinary codeFormal languageImage resolutionInterpreter (computing)Process (computing)Programming languageExtension (kinesiology)Computer animationLecture/Conference
Formal languageLevel (video gaming)MultiplicationMereologyHand fanExtension (kinesiology)Data managementBuildingSource codeProjective planeTwitterMobile appVirtual realityLevel (video gaming)Computer fileMedical imagingCartesian coordinate systemResultantCodeGraph (mathematics)Directed graphComputer animationMeeting/Interview
Directed graphCache (computing)Software testingMultiplicationLevel (video gaming)Curve fittingComputer-generated imageryBuildingWindows RegistryProjective planeLevel (video gaming)Module (mathematics)CASE <Informatik>ResultantDirectory serviceComputer fileLine (geometry)Virtual realityMedical imagingRun time (program lifecycle phase)BitMobile appComplex (psychology)BuildingSoftware testingRight angleMultiplicationXMLComputer animation
Run time (program lifecycle phase)Source codeIntegrated development environmentLocal ringMedical imagingBuildingWindows RegistrySource codeMeeting/InterviewComputer animation
Buffer solutionComputer fileIntegrated development environmentVariable (mathematics)AreaCache (computing)RankingComputer-generated imageryComplex (psychology)Level (video gaming)MultiplicationStatement (computer science)Graph (mathematics)Virtual realityCodeVariable (mathematics)Web applicationSource codeIntegrated development environmentStatement (computer science)Complex (psychology)Image resolutionMedical imagingWechselseitige InformationElectronic mailing listBytecodeElectric generatorProcedural programmingCASE <Informatik>Slide ruleVirtual realityUsabilityComputer fileWritingBitCache (computing)Lecture/ConferenceMeeting/InterviewComputer animation
Virtual realityMultiplication signProduct (business)Integrated development environmentSoftware developerDatabaseWeb applicationFront and back endsLevel (video gaming)Different (Kate Ryan album)Software testingLocal ringINTEGRALUnit testingCASE <Informatik>Maxima and minimaSurjective functionMedical imagingCartesian coordinate systemBitResultantEvent horizonExecution unitWeb 2.0Lecture/Conference
Transcript: English(auto-generated)
I had a great time, too, and by the way, this was my first year of Python. So I'm going to talk about Docker and Docker and Python, and Docker is a popular way to package and run applications, however, when you're packaging Python applications in
Docker, there are some caveats, so I'm going to share with you my lessons learned when I was trying to optimise Docker builds. Now, I hope you will find something useful, something valuable, something that you can take away and apply in your environment. My name is
Dmitry and I'm a system engineer at Cisco. I create Python applications for internal use, and I focus mostly on network automation. You can find slides here if you would like to follow along. Before we proceed, let's do a quick show of hands. Who has been using
Docker already? Wow, okay. Keep your hand raised if you're using Python and Docker together. Okay. The same amount. That's awesome. In this talk, I'm not going to talk to you about why Docker, and what are the benefits, and if you should use it.
If you haven't already, you should check it out. I started using it around two years ago, and it completely changed the way I deployed the application. So, to make sure that everyone is on the same page, let's start with some Docker terminology, okay?
So first, container. It's a lightweight way to package your application with dependencies, and different containers have some isolation, they have separate user space, but they share the kernel of the host. Now, next one is Docker image. Docker image is a template
to create Docker containers. It's built using the Docker file, and it consists of read-only layers. We are going to talk about layers later. We can upload the image to the registry and share it with others. Now, Docker file is a set of instructions to build an image.
You start with the base image from which one you're going to inherit, and then every instruction does something, and it creates a new layer which is cached for future builds. In this case, I have a silly example where I inherit from Debian image, I copy
a file from my host system to the image, then I run some command inside of this image, and then I say that, okay, my default command when I'm going to start this container
is going to be there. Okay. So, what is Docker container? It's a container created from Docker image. We add a writable, a layer on top. We allocate resources, and then when we start the container, we execute entry point and CMD commands. Okay, and the last
one is registry. It's a place where we store and share tagged images. Here is the very simple diagram. Again, just to summarise, we have a Docker file. We use Docker build commands to build an image from it. We can change the tag using Docker tag. We can
push and pull this image to the registry, and when we want to run a container, we do Docker run on the image tag. We apply entry point and CMD, and here we go. We have a container. Now, in this talk, we are going to focus mostly on the left part on
the build process. Okay, so, Python and Docker. So, on the left-hand side, you can see my sample project. So I have my project is a directory which contains, which is a Python
module, and then I have main POI which I run when I want to run this application, and it invokes some function from my project grid. Details of the Python itself here are not really important. The structure is more important. Now, another thing that I have
is the requirements.txt file which contains Python dependencies. In this case, I have only two. Requests and cryptography. And you will understand why I chose something like cryptography for this talk. Now, in the middle, you can see a sample Docker file where I inherit from Python 3.7 image. I create a directory called app, and then
I copy everything from the host system, from the current directory to the image, I install my dependencies, and then I define that I'm going to run Python main POI. Now, it's
so good. It's very simple. However, the size of this container is around one gigabyte. And in this talk, we are going to see how we can improve on that. So, before we go further, let's define our optimization objectives, what we are going to optimize. And there are two things. Image size, but also the build time, and it can be initial
build time as well as subsequent build time. Now, let's also define priorities, and those are the ones that I defined for myself for my projects. During development, I would like to have fast builds, okay? I care less about image size during development
because whenever I change my code, I like to see results much faster. But, for production, I prefer small image size. In your case, the priorities could be different. Okay, so the first and the most important one is selecting the base image. So, here is
the comparison table. I have Python 3.7 which corresponds to Python 3.7 stretch where the base image for that is Debian stretch. The size of it is around 900 max. We have slim stretch, which is much smaller version, but it's still using Debian as a base image.
It's 150 max. You can see it's six times different already. We also have Alpine, and Alpine is very popular. Alpine Linux is a very popular base image, especially in container world because of the small image size. You can see it's almost twice less than slim stretch.
What are the differences in terms of the Python applications and these base images? Well, the Debian uses glibc, and what it allows it to do is it supports many Linux builds. Now, when we are talking about many Linux builds, we have to talk briefly
about Python native extensions, so Python native extensions, we usually have to compile them to make them work with Python, and those are libraries that some of you may
know, like cryptography, XML, and some others, they use Python native extensions. However, there was this way of creating a menu Linux build where that we don't have to compile it. It's precompiled for us. We just download the build and extract it. However, the Alpine
uses muscle, and it doesn't support many Linux builds, so the consequence of that is that Python extensions should be compiled, and if you have ever tried installing something like XML in Alpine, it takes around 15 minutes just for that one dependency. Now, if you
want to know more about many Linux builds and Python native extensions, there was amazing talks this year at PyCon US, the black magic of Python builds. I strongly recommend you watching that. Also, what I noticed is that, in Alpine, some of the well-known packages take much less space. For example, for Git, I think there was three times difference
in size. So, in general, the footprint of Alpine images is much smaller. So, here is my recommendation. When you care about the build time, I would select slim stretch
as base. Whenever you care about image size, I would recommend selecting Alpine as a base. The main reason why is that, for a slim stretch, you can use the many Linux builds. Okay, so let's do that. So, I changed my previous Dockerfile, and now I'm using
slim stretch. You can see that the size from one gig is around 200 megs now. With
that, we have that cryptography dependency, and it needs to be compiled to be able to run Python, because there is no many Linux, well, there is no build for Alpine for cryptography. So, this case, I have to install tools like GCC, but also some packages with
the headers, like OpenSSL, and when I do all of that, you can see that the size is 300 megs, which is actually bigger than slim stretch. So, you may wonder, like, you just told us that you should use Alpine if you care about image size, right? So, what's
going on here? Well, let's first define the problem. So, the problem is that the build dependencies which are contributing to image size here are needed only for compilation, but not the runtime. This is the main issue here. So, the general solution to this
is you have to include only files necessary for your runtime in your image. So, how to achieve that? Well, first, let's take a look on when we are copying the source code to the image. So, the recommendation here is to use more specific copy statements instead
of broad copy ..., and you can also use Docker ignore file to ignore some of the files when you are doing that copy. So, here is an example of Docker ignore file I use. I just copy it from every, you know, from every project. So, something like ... or PYC files, or ... I usually ignore that to make sure that it's
never in my image. Okay? So, let's apply the technique of, instead of having the broad statement, copy, copy, copy dot, dot, and convert it to more specific copy
statements, in this case, I copy my module, my Python module, my project, and main PY, so you can see that the size decreased. The reason why that I had V ENV on my host, and it previously got copied to the image. So now, it's not
there. And the same with Alpine, we also get around 20 megs from doing that. Okay. Now, this next one is very important, removing unnecessary files.
And it's not as easy as it may sound. So, let's try using that Alpine Docker file. It's exactly the same, however, I added at the very end additional run instruction, where I'm trying to delete GCC, open SSL, and some other
packages because I don't need them during the run time. And if you do that, you can see that the size of the image hasn't changed at all, right? So, what's going on here? Well, to answer this question, we have to understand how Docker layers work, so every instruction creates a layer, and
then the new layer can be smaller than the previous layer. Those layers are cached and can be reused for subsequent builds, and layers themselves introduce some overhead, but the first two are the most important ones. So, again, the new layer can have smaller size than the previous layer.
So, what's the consequence of this? What is the takeaway here? First is combining multiple run statements into a single one, so that they are formed the same layer. If you need to delete files, you have to make sure that you
delete them in the same layer where they were added, because if you do it later, it has no effect whatsoever on the image size. If you want benefit from caching, you have to arrange your statements in the order from the least changing to the most changing. Usually, the order will be system-level
dependencies and tools, Python dependencies, and then the source code. And another tip would be not to save anything to cache. For example, with PIP, you can use no cache gear, not to save builds. With APK, you can
use no-cache option as well. So let's try to apply these principles to our problem here. Now, in this case, for slim stretch, the only thing I added was no-cache gear, so I got additional formats, and, in case of Alpine, I
defined my build dependencies and runtime dependencies, and then I combined all of my run statements into a single one. So I have, first, I install my build dependencies, then I install my Python dependencies, then I delete
build dependencies, and then I install runtime dependencies. And all of that is in a single run statement, and the result, you can see it's already three times smaller, so it's 100 megs. Now, from this point on, I'm not, I will no longer consider slim stretch any more, because we can
already see the image size of the Alpine is much smaller, so we are going to continue optimising that, but slim stretch is already good, so, because I personally use slim stretch for my local development builds, I don't care about those 20, 30 extra megs. But in case of my production image size,
sometimes I do, so we will keep decreasing that size. So, here is an optional thing that you can do. You can delete your PYC files and tests from your dependencies. If you want to, if you do, then it becomes
even more complex, so you have to find those files in your user local PYC files, test files, and delete them. In this case, you get additional 10 megs, sometimes you get around 30 megs from it. Really depends. Okay.
What are the disadvantages of this approach that I just shown? Well, the Dockerfile becomes really complex. You have to always remember that you have to install build dependencies, and then install everything else that
you need, and then delete everything that you don't need, everything in the single statement. The consequence of this is not only it is complex, but also you can no longer benefit from caching. You can't cache your build dependencies in this case. You will have to always rebuild the
container. Okay. So, Docker multi-stage. So, the idea behind Docker multi-stage is you build an intermediary image where you have all of your build dependencies, and you install your application. Then
you copy result, for example, binary, if it's go, length, for example, or whatever the result from your programming language is, to a fresh image, and then you label it as your final image. So, you have these two
separate images, if you will, in one image you do all of your build process, and the second one you actually package your application for future use. So, why would you want to do that? The resulting image size
is smaller, because you will have no build dependencies. It could be also faster, because you can cache all of those build dependencies now. You no longer have to delete them anywhere. Okay. So, however, Python is interpreted language, and the question is, are multi-stage builds
relevant to Python apps? My answer is somewhat. The main thing is that even though Python is interpreted, you still may have those dependencies which use native Python extensions. Not only that, but you may also have
some other tooling that you need as part of your build. For example, I am a big fan of a tool called Poetry which allows you to manage Python dependencies. If you think about that, and it is also in the same league, if you think about that, you only need the tool to install
your dependencies from the log file, but you don't really need that tool to run your app. So, all of that can really go to that build stage, and then you can copy only your result to your final image. Okay. So, here is the idea, and thanks to Hynek for sharing it
on Twitter. In order to simplify copying from one stage to another, the easiest solution would be to use virtual environments. So, the idea is
that you have your code of your application, you create the virtual environment in the same folder, for example, you install all dependencies that you need, and then between the build stage and your final stage, you just change, you just copy the whole project directory, including
your source code, including your virtual environment. And it works out pretty well. So, let's take a look. So, this is the example of Python Docker multi-stage. It may seem a little bit complex, but it really
isn't from our previous examples. On the left-hand side, I have my builder stage, where I still have my build dependencies, I install them, I create virtual environment, and then I also upgrade pip in that case. I copy my requirements, txt, and then I install my dependencies,
and then in this case, I also delete PYC files and tests. But you don't really have to do that stage. And then on the right-hand side, so, the result from the left-hand side that in slash app, slash .vnv, we
have our virtual environment, and in slash app, slash my project, we have our module. So, in the second stage, I inherit from Python 3.7 Alpine again, I install my runtime dependencies only, and then
I copy my slash app directory, and that's pretty much it. So you no longer have to care about, you no longer have to care about how you delete those build dependencies. The size is a little bit bigger, because you use virtual environments, but I think that's fine. So, one
additional piece that you get in this case, your build dependencies are now cached, so everything, well, depending if you change your Python dependencies often or not, you can cache up to line 14, or
maybe even further, maybe even the whole build stage, it really depends, because you don't have to delete anything, so, in case you don't change your Python dependencies, you cache the whole layer. In case you change them, but your system-level build dependencies are still the same, you can cache up until line 12.
So that's pretty nice, because previously, we couldn't do caching at all. Okay. So now that we have that, you can also create your custom image with common build dependencies across your multiple projects. For example, as I told you, I like using
poetry, in some cases, I need curl to download something, sometimes I needed Git, I may also need a bunch of build dependencies, so I just build a custom image for it, and I store it in a registry. So, then, that multi-stage is simplified even further. We just
have to inherit from your custom image, and everything else is the same. Okay. A couple of quick advices here, or suggestions. What I found for my local dev, where I use slim stretch, sometimes binding out of your source code instead
of copying, it really pays off, especially if you have web apps with some reload capability, so that's pretty nice, so you just have to change the code, and you don't have to rebuild the container. It really depends, but it may be useful for you. Now, another one is adding the
environmental variables, Python and buffer, so everything is printed to the stdout, it's all buffering, and then don't write bytecode one, if you don't want to generate You just have to add the .pyc files, which I think is not really needed in your Docker image. Okay. Now, this is my
example. I'm not going to go into details. You can download slides later where I'm using poetry, so it's a little bit complicated to build. If you're interested in that, you can check it out from the slides. So, let's do a summary. So, first and foremost, you have to select
your base image carefully. So, Alpine for smaller image size, slim stretch when you need faster builds. You have to take into account layer caching. So, combine different statements into one. If you want to delete something, you have to make sure you delete it in the same statement where
you added that. You have to order statements from the least to most changing to benefit from caching. And then, the last one, Docker multistage can help you avoid some complex removal procedures and benefit from the caching.
And, if you go down this path, I recommend using Python virtual environment. It's really nice in this case, even though I'm not really in favour of using virtual environment in Docker container. And that's all I have for you today. Thank you very much. Thank you, Dimitri. We have a few minutes.
Four minutes for questions. One, two, three. Okay. Hi. Thank you for this great talk. I have a very simple question. Have you evaluated
using other base images, like clear Linux or me in depth? Because they both have GDPC and maybe much smaller than stretch slim. Thank you for the question. I haven't. Now that you mentioned, I probably should check it out.
Yeah, you should check it out. Thank you. If you write unit tests, how do you run them? Okay. That's an amazing question. It really depends. In my case, I built a development Docker container for that, and I run that. So it's not the same as my production container. So, yeah, I just have a separate
container to run that. And I include my development dependencies there, and I run them there. Thanks. Thank you. Any more questions? Then I'll go ahead and ask
one question. Do you think if we, in the build stage, if you, instead of like installing a virtual event and combining in the second stage, you would actually build the wheels and then use those wheels to install in the second stage. I think, do you think that will like lower the size a lot, or have
you tried using this technique instead of building the virtual length? I haven't tried it, even though it was one of the suggestions and one of the things that I wanted to explore. I don't have the data to confirm it, but I don't really see any benefit from doing that, but it is just my
personal opinion. So because when we saw here by adding a virtual environment, I just added five Macs to the container. It was acceptable for my case. Okay. Thank you for your talk. And just for our curiosity, how does your local development
environment look like? What kind of tools do you use, and do you use Docker when you are developing? So I, well, my local development is my Mac, and I use Docker. Sometimes I use Docker, sometimes I don't. So it really
depends how complex the application is. If there are a lot of outside of Python things, you know, for example, if it is web app, you know, some database, your front-end and stuff like that, then I do use Docker to make sure that everything is working, and I would like, you know,
to see and touch the result. If I don't really, you know, if it is only Python and nothing else, then I usually don't use Docker locally, but most of the time I do. And adding on to that,
I do use tools like Poetry to manage dependencies. And I think that is pretty much it. We have time for one more question. Also concerning testing, you said you have another environment for local development you also use for testing, but what about integration tests? Do you run them, like, in that same
container, or is it, like, I would be a bit afraid of running that in a completely different container than production? So there are different approaches for stuff. I do actually do it in a separate container, but that's just me. I don't have, well, according to my requirements,
that's okay. However, I do understand your concern that, yeah, it may make sense to, if you have your integration test, to run it on your production container. Thank you. Unfortunately, we don't have time for any more questions. You can find Dimitri around the conference. Thank you.
Thank you.