Analyzing Data with Python & Docker
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 92 | |
Number of Parts | 169 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/21095 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 201692 / 169
1
5
6
7
10
11
12
13
18
20
24
26
29
30
31
32
33
36
39
41
44
48
51
52
53
59
60
62
68
69
71
79
82
83
84
85
90
91
98
99
101
102
106
110
113
114
115
118
122
123
124
125
132
133
135
136
137
140
143
144
145
147
148
149
151
153
154
155
156
158
162
163
166
167
169
00:00
Computer fontMathematical analysisPersonal digital assistantData analysisScale (map)Interactive televisionLaptopGoogolScripting languageLocal ringOpen sourceDatabaseAutomationPhysicalismType theoryDifferent (Kate Ryan album)Local ringFlow separationGraph (mathematics)Cartesian coordinate systemMathematical analysisStandard deviationDatabaseScripting languagePattern languageScaling (geometry)Field (computer science)Right angleStability theoryForcing (mathematics)Axiom of choiceSet (mathematics)AreaLaptopState of matterAnalytic continuationData analysisMultiplication signOpen sourceSelf-organizationGraph (mathematics)Task (computing)CodeBitGoodness of fitUser interfaceInteractive televisionMilitary baseComputer animation
03:41
Hecke operatorMathematical analysisComputer fileShared memoryForcing (mathematics)Scripting languageDirectory serviceComputer animation
04:23
Scaling (geometry)SoftwareComputer filePhysical systemProcess (computing)View (database)Level (video gaming)Control flowRevision controlCore dumpContext awarenessData analysisResultantFlow separationScaling (geometry)Process (computing)InformationCellular automatonMereologyInstance (computer science)Medical imagingMultiplicationRight angleContent (media)Cartesian coordinate systemRepresentational state transferSet (mathematics)Revision controlError messageDifferent (Kate Ryan album)Local ringSoftware bugCore dumpSoftwareService (economics)Cluster analysisLevel (video gaming)State of matterPhysical systemData managementOrder (biology)Basis <Mathematik>Insertion lossNumberVirtual machineMultiplication signNeuroinformatikStandard deviationClient (computing)Operating systemView (database)Heegaard splittingAbstractionConstraint (mathematics)Windows RegistryElectronic data processingFocus (optics)File systemServer (computing)Goodness of fitLaptopKernel (computing)Computer fileComputer animation
09:47
SpacetimeComputer-generated imageryReading (process)WritingMixed realityMenu (computing)Random numberNatural numberSpacetimeNumberMedical imagingDrop (liquid)Digital electronicsFunction (mathematics)Overhead (computing)Visualization (computer graphics)Open sourceDifferent (Kate Ryan album)Revision controlRepository (publishing)VirtualizationImage resolutionCumulative distribution functionPhysical systemoutputCASE <Informatik>Doppler-EffektCartesian coordinate systemMathematicsScripting languageMiniDiscOperator (mathematics)SoftwareStudent's t-testWritingAuthorizationGastropod shellGraph (mathematics)Data analysisFrequencyMathematical analysisFile systemLecture/Conference
12:43
Cartesian coordinate systemDifferent (Kate Ryan album)Shared memoryFiber bundlePhysical systemMedical imagingRevision controlException handlingData analysisMathematicsSet (mathematics)Cellular automatonContent (media)
13:36
Function (mathematics)Configuration spaceData analysisBuildingCodeRepository (publishing)Mathematical analysisScripting languageComputer fileConfiguration spaceMathematical analysisPhysical systemData loggerConstructor (object-oriented programming)NeuroinformatikBlock (periodic table)Execution unitCodeProcess (computing)SoftwareFunction (mathematics)InformationScripting languageData analysisFile archiverTheoryLoginResultantCASE <Informatik>Different (Kate Ryan album)State of matterMultiplication signSeries (mathematics)Context awarenessoutputOpen sourceContent (media)Computer simulationComputer animation
15:35
Reading (process)Computer fileoutputLine (geometry)WordDirectory serviceWord for wordFrequencyMenu (computing)Amsterdam Ordnance DatumLetterpress printingReduction of orderRobotGraphics tabletAdditionPrice indexMathematicsComputer clusterSoftware testingWrapper (data mining)Branch (computer science)Modul <Datentyp>SharewareLine (geometry)Computer fileFrequencyDirectory serviceData dictionaryFunctional (mathematics)ResultantCountingLibrary (computing)Scripting languageNumeral (linguistics)Multiplication signStatisticsBlock (periodic table)Message passingNumberEvent horizonWordSoftware developerGoodness of fitStandard deviationFunktionalanalysisVideo gameDatabase normalizationSystem callElectronic mailing listEndliche ModelltheorieSource codeJSON
18:19
Coma BerenicesOpen sourceFinite element methodSoftware testingWrapper (data mining)Binary fileCodeRevision controlMathematicsEstimationBuildingAdditionPrice indexComputer clusterComputer fileMetropolitan area networkMathematical analysisScripting languageData analysisFunction (mathematics)Scripting languageDirectory serviceMedical imagingMathematical analysisResultantMathematicsData conversionComputer animationSource codeDiagramProgram flowchart
19:09
Reading (process)Computer fileGraphics tabletSoftware maintenanceMetropolitan area networkSoftware testingMathematical analysisBuildingCache (computing)Context awarenessDemonInstallation artMenu (computing)outputDirectory serviceLine (geometry)WordFunction (mathematics)Word for wordFrequencyLetterpress printingRobotWeightFile formatMathematicsReduction of orderClient (computing)Integrated development environmentCodeRange (statistics)Boiling pointUltraviolet photoelectron spectroscopyAdditionPrice indexRevision controlComputer clusterOpen sourceComa BerenicesArmModul <Datentyp>Open sourceMedical imagingContent (media)Electronic mailing listMappingComputer fileFunctional (mathematics)Asynchronous Transfer ModeFunction (mathematics)DivergenceVariable (mathematics)Pattern languageConfiguration spaceResultantBitIntegrated development environmentDegree (graph theory)Matching (graph theory)Directory serviceParameter (computer programming)Thermal conductivityPredictabilityReduction of orderOrder (biology)Reading (process)WritingProfil (magazine)Block (periodic table)Scripting languageLevel (video gaming)DialectOperator (mathematics)Mathematical analysisUniform resource locatorRevision controlCASE <Informatik>Line (geometry)NumberProcess (computing)Group actionClient (computing)outputNetwork topologyDifferent (Kate Ryan album)Slide ruleMultiplication signPosition operatorPhysical systemRandomizationParallel portSoftware maintenanceInterpreter (computing)Local ringCache (computing)BuildingMultilaterationLatent heatSource codeJSON
24:53
Mathematical analysisScripting languageSharewareVirtual machineComponent-based software engineeringWeb 2.0DataflowMultiplicationCartesian coordinate systemReal numberWorkloadComplex (psychology)Cluster analysisFunction (mathematics)Virtual machineVideo gameScripting languageDirectory serviceMathematical analysisoutputData analysisIntegrated development environmentMedical imagingContent (media)Local ringInteractive televisionMultiplication signWeightOrder (biology)BitDimensional analysisProcess (computing)Set (mathematics)Computer fileException handlingRevision controlHookingMereologyDiagramProgram flowchartSource codeXMLComputer animation
26:57
Data analysisResource allocationData managementComputer networkScheduling (computing)Exception handlingSharewareReal numberCalculationType theoryOrder (biology)Sheaf (mathematics)CASE <Informatik>BitGroup actionData analysisModule (mathematics)Revision controlDatabaseService (economics)Exception handlingData managementRange (statistics)Modal logicPrototypeFunctional (mathematics)Task (computing)Network topologyContext awarenessFundamental theorem of algebraSequenceVirtual machineDifferent (Kate Ryan album)Flow separationComputer fileScheduling (computing)Multiplication signMereologySharewareArithmetic meanMathematicsFunction (mathematics)Mathematical analysisComputer animation
29:56
Graphics tabletAdditionReading (process)Computer fileMathematicsComputer clusterSoftware testingPrice indexUniform resource nameAnalog-to-digital converterGroup actionStructural loadDatabaseData typeOpen sourceEmulationQuicksortRoute of administrationSpecial unitary groupRobotScripting languageInverter (logic gate)Pointer (computer programming)Computer networkDynamic random-access memoryLogarithmServer (computing)Physical systemDatabase transactionNewton's law of universal gravitationVacuumProcess (computing)Complete metric spaceObject (grammar)Configuration spacePasswordLocal ringMetropolitan area networkTemplate (C++)MiniDiscDefault (computer science)Codierung <Programmierung>Set (mathematics)Web pageMaxima and minimaBuffer solutionInformationDirectory serviceMedical imagingGroup actionPhase transitionUniform resource locatorRow (database)Service (economics)Mathematical analysisDatabaseComputer fileData conversionoutputScripting languageSheaf (mathematics)Function (mathematics)SoftwareCASE <Informatik>Structural loadPhysical systemState of matterNetwork topologyResultantRule of inferenceElectronic mailing listOpen sourceMultiplication signRight angleStress (mechanics)Data analysisSource codeJSON
33:27
LogarithmDatabaseServer (computing)Physical systemDatabase transactionLine (geometry)Newton's law of universal gravitationVacuumProcess (computing)Complete metric spaceWeightMenu (computing)Degree (graph theory)DatabaseIncidence algebraLine (geometry)Set (mathematics)Scripting languageContent (media)CASE <Informatik>Multiplication signLink (knot theory)Revision controlMathematical analysisState of matterSource code
34:08
Open setIntegrated development environmentTelecommunicationComputer networkComputer fileRevision controlVisualization (computer graphics)Data managementModule (mathematics)Image resolutionStapeldateiAverageControl flowArmProcess (computing)VarianceTrigonometric functionsMathematical analysisMultiplicationRevision controlLibrary (computing)Set (mathematics)BefehlsprozessorScaling (geometry)Different (Kate Ryan album)ResultantSoftwareCore dumpData analysisMedical imagingCodeGraph (mathematics)Virtual machineService (economics)Integrated development environmentContext awarenessSingle-precision floating-point formatComplex (psychology)Product (business)Endliche ModelltheorieScripting languageComputer fileLevel (video gaming)Figurate numberOverhead (computing)Multiplication signVariety (linguistics)State of matterMulti-core processorPerformance appraisalOpen sourceOperating systemFile systemProcess (computing)Cluster analysisFitness functionSpacetimePoint (geometry)Group actionGoodness of fitPhysical systemCASE <Informatik>outputBuildingAxiom of choiceSource codeView (database)TheoryLimit (category theory)Solid geometryQuarkPlastikkarteCapability Maturity ModelCausalityParticle systemModulare ProgrammierungNetwork topologySheaf (mathematics)Procedural programmingProgramming paradigmThermal conductivityContent (media)Lecture/Conference
Transcript: English(auto-generated)
00:02
Hi everyone. Please say welcome to Andreas. So, good morning everyone. I'm Andreas and today I'm going to talk about analyzing data with Docker. Before I start I want to thank again the organizers for inviting me to the conference.
00:21
It's really great to see a lot of people here for the second or third time and I'm really excited to speak to you today about this. So my own background to say that is in science. I've been working in physics and I've been using Python since about 2009 for my own work. And in the last five years I've been mostly working on data science problem.
00:42
Also of course using Python as my main tool of choice. So, we're going to attack this problem as follows. First I'm going to give you a small introduction to data analysis and explain the different scales and the different types of analysis that we can do and why sometimes that might be difficult.
01:03
Afterwards I'm going to talk briefly about Docker so that we all understand what it is and how we can possibly use it. And then I want to give you some examples of how we can containerize our data analysis using this technology. Finally I want to talk about some other possible approaches. I want to show you some relevant technologies that you can use.
01:25
And I want to give you some outlook into the future of containerized data analysis. Okay, so let's get started. Data analysis is a pretty large field. So, as data analyst I like graphs. So here you have a graph about the different scales and the different types of analyzing data.
01:49
So I tried to segment this a bit from small scale to large scale and from interactive to automated methods. And if you look for example in the upper left quadrant here you would have automated small scale data analysis tasks.
02:07
So this would be typically some scripts or Python code that interacts with your data. For example a local database and does some analysis on that on a non interactive way. On the lower left quadrant here you have things that are interactive and possibly user interface based.
02:24
So a good example for this would be the IPython notebook where you can analyze your data in an interactive and straightforward way. And you can do so very easily using graphical methods and using various types of data sources as well. If you go to the large scale data analysis we have things like Apache Hadoop which is mostly non
02:46
interactive technology that allows us to perform data analysis tasks at very very large scales in a batch way. On the lower right quadrant on the other hand you have tools which are also helping us to
03:02
deal with very large data sets but which are more interactive than traditional for example MapReduce based approaches. Examples for this would be for example Apache Spark or Google BigQuery. So what kind of tools am I going to talk about today? Everything of course.
03:23
I want to show you that using containers can help us in all of these areas. So if we have lots of tools for data analysis you might ask yourself what is actually so difficult about this. Well in my own experience and maybe from your experience several things.
03:42
First sharing your data and tools is not exactly easy. As a scientist I experienced this myself. My PhD I started it in 2009 and I used Python for a lot of things and back then my analysis
04:01
workflow would basically involve a few hacked together scripts in Python and some data files that I would keep in directories. So sharing those files, the data and the tools that are used was possible of course but it was not easy and it was not surely straightforward to give other people access to this kind of things.
04:24
This leads of course to problems in reproducing results here. So here we see a cell in the process of reproducing and it can do that because it has all the necessary information that it needs for it available to it. And if you try to reproduce our results in science or in other contexts it's not that
04:44
easy because often we're lacking like the context and several critical parts of the data analysis process. And another thing that is difficult in data analysis is the scaling. You know that probably at the small scale you have a lot of tools available that you can use to analyze your data.
05:06
I mentioned IPython and the IPython notebook earlier and there are a lot of different ways to handle for example the plotting and the processing of data at this scale. But if you go to larger scales you normally need a totally different set of tools.
05:21
So the normal tool set that you use for your small data sets doesn't apply anymore to this world. You need technologies like MapReduce, like Hadoop and that means you need to rewrite a lot of your data analysis tools because when you're getting bigger. So how can Docker help us to overcome some of these problems?
05:45
Well, let's first try to understand what Docker is actually about. Docker is basically a tool that helps us to deploy applications inside of software containers. And if I say software containers you're probably thinking of virtual machines.
06:03
But it's not the right approach because Docker containers are working on a process level and they isolate different aspects of the operating systems. For example processes, resources and the files that an application sees. This means that some aspects for example the kernel that your containers are running on are shared between them.
06:25
And this is exactly what makes Docker very interesting because it provides a more lightweight way to isolate applications from each other. And Docker, so this is the basic idea and of course we need a lot of tooling to make this idea convenient.
06:42
So Docker provides a high level API that helps you to manage version control, deploy and network your containers. So if you look at the core concepts of Docker, at the basis we have the image which you can imagine
07:01
as a frozen version of a given system that contains the whole file system that we need to launch a given container. And as you can see here, images are versioned and some images are based on other images and we have also images that are not based on anything else which we call a base image.
07:21
And we'll see later why version controlling images and building them on top of each other is a great idea. So you can keep your images on your local computer of course but what makes it convenient to use them is to put them into a registry. So Docker has its own registry on Docker hub but it's also possible to run your own private registry server.
07:45
Now a container in this sense is nothing but a running instance of an image. So each of these containers here has a given image associated with it and philosophically or like conceptually containers are ephemeral.
08:01
That means that the state of a given container is not saved when it stops working. So that means that in order for containers to be useful to do any data processing we usually want to attach some resources to a container. This is shown here. So containers can run on any number of hosts and each host that containers runs on runs the
08:26
so called Docker engine which is responsible for managing, starting, stopping and monitoring the containers in a given host. Now one of the very great things about Docker which I like a lot is the ability to network containers together.
08:42
Which is a quite recent feature and which basically abstracts away the networking of different hosts. So we can completely ignore the physical constraints of our network and can construct virtual networks that connect different containers to each other. Which of course is very useful if we have applications that rely on multiple
09:01
containers and multiple services that need to talk to each other over the network. To orchestrate all that there are a couple of tools. For example there is Docker Swarm which makes it easy to deploy Docker containers on a cluster of machines. And there are also, if you ask yourself how do you manage all this, it's through the Docker API which provides a
09:26
rest interface that allows you to create containers, manage them, monitor them and do everything that is possible in the Docker ecosystem. The command line interface which you will mostly use to interact with Docker on your machine is nothing else but a client to this Docker API.
09:45
Good. So what do I like about Docker? Well, one thing that I really think is great is that images are space efficient. And they are space efficient because they are based on a so-called layered file system which you can imagine
10:05
somehow like onion where you have different layers and you can just add layers on top of an existing layer. And here I have an example of the image that we're going to use later in our data analysis. You can see in the beginning when we created this image we downloaded a lot
10:21
of data, about 124 megabytes which corresponds to the Ubuntu base image that we used. And then we did some things, so we installed some, we did some shell scripting, we installed some things and we called for example update to get the newest repositories which added about 38 megabytes to our image size.
10:41
Then we installed Python 3 on the image and then afterwards we installed our analysis script. And you can see the last steps here where we analyze, where we add the script consume only very little space, in this case a few kilobytes. And this is really great because it means if you make small changes to your containers, the size of, sorry,
11:03
to your images, the size of your images on a disk will not grow linearly with the number of those images. That means you can build a lot of different versions of your software without worrying about filling up your disk with all different images files. Another thing which is really great is that containers have very little overhead.
11:22
And what I mean with this, you can see here, so that's two graphs that I took from a paper from IBM from late 2014 where the two authors compared the performance of the virtual, the native Linux with various virtualization technologies like Docker and in this case KVM.
11:44
And we're seeing two things. Here's one thing, the write latency of disk operations, so this is the cumulative distribution function where we want to be on the left side if you want to be fast. And the other thing that we're seeing are the input output operations per second for different use cases.
12:03
And you can see that Docker here imposes actually very little or almost no overhead compared to the native solution. Whereas another virtualization technology, KVM, you can see that there's a significant performance drop here. And I mean, I don't want to make any of these virtualization technologies bad because they're doing something that is very different from Docker.
12:23
You know, they're providing things that are impossible to do with Docker, but you can also see that by doing this, they're incurring a performance penalty. And with Docker we don't have that, so we can operate our applications at the same speed like if we would run them on a native system.
12:45
Another thing which is great of course is that containers are self-sufficient. And this means that as soon as we have an image that we can run with Docker, we have everything that we need to run our application. So we don't need to install any dependencies on the whole system, except Docker of course. And we can rely on the fact that the application bundles everything that it
13:03
needs inside the containers or inside another, inside a set of containers so to say. And this makes things like sharing tools for data analysis or sharing data itself much easier than relying on a workflow where we would need our users to install a lot of different dependencies on the system.
13:21
Which might be problematic because versions change, systems change and it's always difficult to manage all these different dependencies. And if we can bundle them into an image and run it as a container, then all of these problems disappear. So in that sense, containers can be seen as Lego blocks for data analysis.
13:44
Or if you want to regard that more in a functional context, you could see them as a unit of computation where you have certain inputs, for example configuration data, your data files and possibly other network containers. You perform some computation on that and you produce an output.
14:03
And this is a very powerful idea because it allows us to construct data analysis workflows that are reproducible and can easily scale to large systems. So here for example, we would have a use case where we would take log files from different sources, for example Apache logs, Nginx logs and use two containers to map out interesting information in those logs.
14:28
Then use another container to aggregate those results, use a container finally to filter those results for things that are interesting to us and pass that on to other containers that for example put that information into a business intelligence system, into a monitoring system or into an archive.
14:49
Okay, so now we talked a lot about the theory. Now I want to show you some very simple example on how to do this actually in practice. And the thing that we're going to look at is the log file analysis.
15:01
So we're going to download some data from the GitHub archive and we're going to process them and extract some interesting information and then we're going to perform a reduced step to get the summary of that information over all the log files that we're interested in. And the code of this is available on GitHub if you're interested and as you can see the basic workflow is very simple.
15:26
We have our analysis script that takes some log files from GitHub, launches an analysis process and then reproduces some output. Okay, and now please keep your fingers crossed because we're going to do a live demo.
15:43
Good, so you can see we have several files in this directory here. If you look at the analyze file, you can see that we're importing a bunch of standard libraries here. We're defining our data directory so I can show you that the data directory
16:02
actually contains a bunch of JSON, GunSip JSON files that we're going to analyze. And I mean the first question that you probably have now is who is pushing commits to GitHub on the first of January? Well obviously a lot of people. So to analyze those files, we have several functions here in our script.
16:24
We have just one function that lists all the files in the directory. If they contain a JSON.GZ ending, then we have the analyze file function which takes a file name, initializes a dictionary of word frequencies, then opens the file using GunSip.
16:43
When it goes through each line of this file and decodes it using a JSON module, then checks if the data contained in a given line is a push event. And if that's true, there's a commits entry in that event that we can use to extract the number of words from the commit messages.
17:01
So here we just split each word for non-alpha numeric characters. And for each of those words that we obtain like this, we increase the count of our word frequencies. Finally we return that and that's it. And then we have the reduce function which takes the result as produced by this analyze file function and just adds
17:23
the counts in that file in those results together, producing a global dictionary of all the different words and their frequency. And the main block of our script does nothing else than uses get files function to list all the files in the directory,
17:41
analyze each of these files, reduce the results and then print out the statistics. So if we run that, it will take some time to do that, going through each file and calling the analyze and the reduce function at the end.
18:03
And you can see we got a pretty straightforward result. And if you ask yourself who is pushing all those commits to GitHub, well it's apparently JavaScript developers. And you can see that the good Python developers, they seem to be taking a day off on New Year's Day.
18:26
So very simple, very straightforward way to analyze this data. So now let's have a look how we can take this data analysis and containerize it. And to do that we're going to make some changes to our workflow. So instead of having our analysis script work directly with the data, we use it to first create a docker image and
18:47
then we're going to use some supervisor script that's also written in Python to create a bunch of containers based on this image. Then take each of them, a chunk of the data, analyze it and finally produce an output that we can again with the supervisor reduce and convert into the result that we are interested in.
19:08
So let's go back to our directory and let's first have a look at how we create the docker image. So if you see here we have a so called docker file in our directory which is a file that specifies the specifics of our image that we want to create.
19:26
And you can see here that we are basing our image on the Ubuntu 1604 base image. We're saying that I'm the maintainer of that image. And then we're doing a bunch of simple steps.
19:41
First we update the apt cache so we can get an up to date version of all the packages available. Then we install the Python tree package in our system. And then finally we copy the docker analyze py which is in the same directory as the docker file into the container at this location here.
20:00
And the final line specifies the command that is being run when the container starts up. In this case it's the Python tree interpreter that runs the file that we just put there. So we can use docker to build that file. We just call docker build and then tag the resulting image with the name that we want to use.
20:23
And as you can see here we did nothing basically because the image already existed before. But you can see that docker went through all of the steps, checked if it already has an image that corresponds to the version that we want to have. And then successfully creates a new image with the given name. Now we could run that image manually using the run command which is a bit complicated.
20:45
So let's go through that here. So basically we're saying docker run. We're saying that we want to run that with a given user ID and a given process, a group ID. We want to expose all the ports of the docker container. We then specify certain environment variables which I will explain a bit later.
21:02
And we just say that we want to mount this directory here as the data directory and this directory as the output directory. And finally we specify the container at the name of the image that we want to run. And so if we do that we just receive the output of the container that is being run and as you can see it already finished.
21:23
And now let's have a look at our analysis script actually. So like before we have a Python script that operates on a data directory and that produces output in an output directory.
21:40
And we have one function that is called analyze file that takes a file name and does the same kind of map operation that we saw before in our traditional analysis script. Now we don't have any reduce function as I will explain later. And instead we only have a main block that takes the input file names from the environment variable input file names here. And then goes to each one of them calling the analyze file function and
22:05
writing the result into the output directory that is mounted into the docker container. So and as I said we need an orchestrator or like a way to start these containers of course. And for this we wrote a simple Python script. Again we specify our container name, the data directory, the output directory and the number of containers that we want to launch.
22:27
So the parallelization degree of this problem if you want. And the first thing that we do here is to use the docker Python API to create a docker client to our local docker engine. And then retrieve the files from the data directory, analyze each file in the container and reduce the given output file.
22:53
So maybe we can step through this a bit more in detail. So the analyze file in container function takes a number of files and then creates a host
23:01
so called host config which specifies the different directories that we want to mount into the container. In this case we want to mount the data directory in the read only way and an output directory in a read write mode. And this host configuration we can then pass to the create container function where you also pass in the
23:21
container name, the user ID that we want to use, the host configuration that we just created and the environment variables which just contains a list of the files that we have given as a parameter to the function. And now the main function looks like this. So we first retrieve all the files that we want to analyze here.
23:41
We then chunk those files up into pieces of like four or five depending on our parameter N. And then we create for each of those chunked file lists a container that is performing the map step for each of these files. We append those containers to a list so that we can use them later and then we wait that all the containers finish their work of mapping the files.
24:07
As soon as this is done we can call the reduce output files function which takes all the files that have been created by the containers in the output directory, reduces them and then produces the result that we're interested in. So if we run this now, so we have to do that with Python 2 because I
24:24
have only installed the Docker API for that version but it also works with Python 3 of course. So we call Python 2 Docker parallelize. This will launch the containers for the individual files. It will wait for the results and as you can see it's even a bit faster than before and in the end we get exactly the same result as before.
24:45
You can see the files that have been created in the output directory by the containers are here. So pretty straightforward to actually go from a workflow where we use normal Python to a containerized workflow where we also use Python but based on a Docker workflow.
25:05
So this was of course a very simple example and I wanted to show you the basics of this approach. And in real life the complexity would be higher of course for any real data analysis application and certain advantages and disadvantages associated to this approach.
25:26
So one advantage is of course that it's as I said easy to share your data analysis workflows because now when we have an image with our scripts we can just push that to the Docker hook for example and anybody can download that image and use it locally on his or her machine.
25:41
Each analysis step is self-sufficient in the way that the container doesn't care about its environment. As you've seen we only specified the input files and the output directory for the container and everything else was inside the container. So there are no dependencies that we need to run this analysis except from the input and the output data.
26:02
As I also showed you the containerization makes it pretty easy to parallelize our analysis process. And I mean for this example we ran everything on a single host but as I said with Docker swarm it's also possible to run this kind of analysis on a multi -cluster environment so we can easily parallelize our workloads to hundreds or even thousands of Docker containers.
26:24
And the nice thing is also that of course with the image based approach we have a versioning of our data analytics script included for free. Disadvantages, there are also a few. It's of course a bit more complex because we have to prepare our containers for the analysis.
26:41
We need to install Docker on each machine that should perform the data analysis obviously and we also have lost a bit of interactivity and flexibility in doing our analysis. So which parts are actually missing from this workflow? For me three things. First as we've seen we need a lot of orchestration to make sure that we have all the containers running as they should.
27:07
And for the simple case that I showed here it was not that important but for any real world data analysis you probably need databases, you need maybe task queues. So you have a lot of different things that you need to put together and launch in the right order.
27:21
And so you need a lot of orchestration capabilities to do this in a straightforward and effective way. Another thing is of course dependency management because in most real world data analysis context you want to not only perform the steps of your data analysis that you really need to perform. So for example if you have several types of data and they depend on each other in for example this way, we
27:44
do not want to perform all of the data analysis again only if for example this part here or this part here changes. We want to perform only those things that are really necessary to redo with the change data sets. And finally we also need a way to manage the resources.
28:00
So in our example we produce a lot of output files already and in real world data analysis you will produce many more of those files and it's also important to manage and version control those things for which Docker unfortunately does not provide any good means right now. And so I was fingering with Docker a bit on my own time and I happened on these problems so I
28:25
decided to start writing a small tool just called roster and which is built on the top of the Docker API. And if you would summarize it in one sentence you could say that it's make for Docker. So it provides basically the three functionalities that I talked about before.
28:43
So resource management container orchestration and dependency management. I have to say it's still an early prototype but I want to show you a bit how it works. So the basic concept of Rooster is a so-called recipe which specifies three things.
29:01
We have first the resources that we want to use in our data analysis. Then we have the services that we need to run for example databases etc. And then we have a sequence of actions that we want to perform in order to perform the analysis. And the resources here layer includes things like versioning, dependency calculation of the different resources, backing
29:20
them up, copying them and distributing them to the machines where we want to perform the analysis. The services section deals with things like starting up the services including the right order to do that, provisioning the resources to those services and networking them together. The action section then is concerned with scheduling the different actions that we need in
29:42
our data analysis, monitoring them, performing exception handling and finally doing some logging for us. Okay, again I want to show you a small live demo here. So what we are going to look at is again really a very simple example where we want
30:05
to convert a CSV file into a, we want to load a CSV file into a Postgres database. So if you look at the recipe for this data analysis, we can see we have
30:21
resources section where we specify all the resources that we need for this kind of analysis. So first of course we have our CSV file which comes from the user resources which we want to mount as read only and which we should make available which has the URL electricity.csv in this case.
30:42
Then we have the Postgres data which is the database where we want to put the data and here we tell Rooster that it depends, that the state of this database depends both on the CSV file and on the converter script that we are using to create the database.
31:01
And that we should create the resource if it doesn't exist, that the URL is Postgres and that it's also a user resource and that we want to mount it in write mode. So finally we have the converter script that performs the conversion between CSV and the Postgres
31:21
database and this comes directly from the recipe and it is contained in the converter URL. So far so much about the resources, the services are listed here, in this case it's only a single service, notably a Postgres database which uses the Postgres image and which exposes this port here to the outside world and which makes use of the Postgres data resource that we have defined up here.
31:52
And here you can see that we mount this resource at this location where Postgres will be able to find that and to use that to initialize or work with the database.
32:02
So finally we have the action sections which contains in that case also only a single entry that uses the Python 3 image that we created before and executes this converter, convert.py script that takes the data from the CSV file and loads it into the database.
32:21
And this container needs access to both of the converter script and the CSV file obviously. So now if we can launch this recipe by just saying rooster run and then recipes CSV to Postgres and you can see that several things are happening now.
32:48
So what rooster did now is to first look at all the images which we require are available on the system and then initialize the resources, in that case copy or initialize the Postgres data, make sure that
33:02
the input data is there and also check that the script which we need is present in the recipe. Then mount those resources, create the Postgres service and finally launch the analysis steps or the action phase and give the action access to the Postgres database through a virtual network.
33:21
And this took a while to run and you can see the output here of both the Postgres container which created our database and the Python container which ran the script that inserted those rows in the database. You can see that we inserted about 35,000 lines of CSV into the Postgres data and now the resulting data is put here.
33:47
You can see that rooster also takes care of versioning your data by using a UID based approach where we always copy the previous version of the data and providing a link to the parents so that we can go back in time and for example revert to a good state of our database in case anything goes wrong in our analysis.
34:10
Alright, now this is again a pretty simple case. It also works for more complex problems where we have different services and more action steps that depend on each other and of course there are still some open questions here.
34:26
In the example that we looked at earlier we used files to communicate the results of our analysis between containers but there are also different approaches so we could for example use the network or even use the Docker API to communicate that and right now there is no canonical way to do this so to say.
34:43
Also, an open question especially for distributed system is of course how to make the data available to the containers and there Docker doesn't provide a good solution and we can probably rely on some technologies for things like MapReduce so for example the Hadoop distributed file system but it's also not clear what is the optimal way to do this kind of thing here.
35:07
Of course there are some other technologies that are interesting in this space. I wanted to just briefly show you two of them here. One of them is Pachyderm which is a US based startup that provides an open source tool for data analysis using
35:24
Docker and the great thing about their solution is that they provide both a version controlled view on top of your data. So they basically have version control for large data sets and they make it very easy to build a dependency graph based analysis workflow
35:40
and I talked yesterday to one of the founders and it's a really great product so compared to Rooster it also works reliably already. So if you want to have something that works both as a large scale as well you should definitely check it out. Another thing that I wanted to mention here which is not directly related to Docker but which helps you also with
36:01
managing your dependencies and data analysis is Luigi which is a library that was built by Spotify and that can help you to build complex data analysis pipelines where you have a lot of interdependencies between your individual data analysis step and Luigi kind of figures out how to run your analysis and how to only run those steps of the analysis that are really required.
36:27
So to summarize, containers are by now a pretty mature technology and they are probably here to stay. They are very useful in a variety of data analysis contexts. They don't solve all of our problems with data analysis though.
36:43
And that means that we need additional tools to handle them effectively. Some of them I showed you and I also showed you how you can use Python in conjunction with Docker to use this kind of approach to data analysis. Okay, so with that I'm at the end. If you're interested in the tool in Rooster you can find it here on GitHub.
37:04
Contributions are highly welcome and I think we have time for some questions so thank you. Thank you. This is useful, exciting and I have a question about running this on the cluster.
37:31
How does the Docker swarm use like, so if you have like a powerful single machine, let's say, or you have several of those machines but they are powerful, they are multi CPU.
37:41
How does it scale? Will it use all the cores on that powerful machine? Any other bottlenecks? Okay, I didn't do any performance evaluation of that but swarm basically transparently handles distributing your containers to the different system.
38:01
And the great thing about swarm is that it has almost the same API as the Docker core engine. So we can for example use it from Python exactly like you would use Docker on a single machine. And as I said like the containers are completely isolated from each other so each container runs in its own process. And hence if you have a multi core machine you can of course make use of all
38:21
the cores and the operating system will take care of allocating resources to each of these containers. In that sense a container is nothing, not much different from a process running on the operating system. Is that answering your question? Okay.
38:44
Maybe that would be too much overhead but did you consider Dockerizing Apache Spark for this MapReduce thing like just putting Spark workers in Docker containers?
39:02
So I think in general Docker provides a great way to build a local setup where you can test out technologies like MapReduce and Spark in an environment on your own machine. So I think it's definitely possible to have a setup for example running Spark if that's your question.
39:21
And on the other way around it's also of course possible to use for example Docker containers from inside the Spark ecosystem or inside Hadoop. So I know that Hadoop for example has a runner that can make use of Docker containers to perform the map steps. So both of these technologies can be kind of used in conjunction with each other.
39:42
Yeah, but I think your main purpose is to make it as small as possible, self-contained. So I thought maybe Hadoop is a very big thing. Apache Spark is more like lightweight Hadoop for MapReduce.
40:02
So I just thought that may solve your problems with distributing work, serializing results, etc. Okay, that's an interesting point. I didn't look into this, but it's possible that it's a good fit. Any more questions? There?
40:29
You talked a lot about dependencies, but I think there are two kind of dependencies and we should not make confusion. At least we should focus on possible and evident differences.
40:44
One is code dependencies, dependencies between software packages, versions and so on. And the other is data dependencies, like models that are built on data, that is built on other data and so on. Maybe it's kind of a theoretical question, but how do you see these two different concept of dependencies interacting?
41:15
Is there going to be a single tool or instrument that can solve both or we are going to build completely different tools to solve them?
41:29
I think that is the question. As I said, I think images are a great way of solving the dependency problem with software. So we can use images to make a reproducible environment for analyzing the data that we have,
41:46
where we are sure that all the dependencies and all the software code, for example, is at a given state. For managing the dependency of the data, we need a different tool because Docker is, in my opinion, not the right choice for doing that.
42:03
For example, Pachyderm and other technologies have some support for these kind of things, where you have large datasets that you want to version control and that you want to manage in that sense. Personally, I think that code can also be treated as data in that sense.
42:22
If you would look at the different inputs of your container, as I showed them before, you could also take the software and the scripts that are used for analyzing the other data as data themselves. So in that sense, you can treat those two things under the same paradigm, I think.
42:40
It's, of course, always a question, what is the best practical way of handling these things, because the scale is very different, because code is usually quite small and manageable, whereas data can be very large and cannot be managed effectively using, for example, source code version control systems. Does that answer your question? Okay, good.
43:02
Any other questions? No? Okay, so thanks again.