Bestand wählen

Analyzing Data with Python & Docker

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Erkannte Entitäten
so hi everyone please say welcome to Andriette's be fair so all morning everyone and Andreas and then going to talk about this analyzing data with the force that I want to thank the organisers for inviting me to conference it's really great to see a lot of people here for the 2nd time and I'm really excited to speak to you today about this on the my own background to say that is in science in the working in physics and then using patterns and stability doesn't mind for my own work in the last 5 years has been mostly working on this on problem also of course using fight as the main tool of choice so what we had from this follows from person gonna give you small Introduction to this and this is an expanded different scales in the different types of analysis that we can do and why sometimes and that might be difficult and afterwards I'm going to talk briefly about Dr. so that we all understand what it is and how we can possibly use it and then I want to give you some examples of how we can containerized or analyzes using this technology and finally I want to talk about some other possible approaches I to I'm sure some relevant technologies and that you can use either 1 of these user-models go into the future of continuous state and OK what so let's get
started on this analysis is a pretty large field and so on as the standard is that analysts like graphs so here we have a graph of all the other different scales and the different types of love and analyzing data so tried to segment is built from small-scale and large-scale and from interactive to automated methods and if you look for example in the upper left quadrant here it would have an automated small-scale data analysis tasks so this would be a typical use some scripts for python code that interacts with on the data for example local databases and some analysis on the on the non interactive with on the lower left quadrant you have things that are interactive and possibly user-interface based solely on a good example for this would be the IPython notebook we can analyze the data in an interactive and straightforward way and you can do so on very easily using graphical methods and using various types of data sources as well and if you go to the large-scale data analyzes we have some things like that such do which is mostly non interactive technology that allows us to perform data and starts at a very very large scales and that's the way under lower in the lower right quadrant on the other hand you have the tools which also and helping us to deal with very large data sets but which are more interactive than traditional for example MapReduce reduce based approaches some examples for this would be for example Apache spots to request so what kind of 2 amended to talk about today and everything of course so and so on I want to show you that using containers can help was In all of these areas so if we have lot of
tools for data analysis to make you might ask yourself what is actually a so difficult about this well in my own experience and maybe from your experience several
things 1st of all sharing and the sharing your data and tools is not exactly easy and as a scientist I experienced this myself and my PhD and I started 2 9 and I used to apply the for a lot things and make them my analysis workflow would basically involves a few heck together stressed in Python and some on data files that would keep in director so on sharing those files that data and the tools that are used and was possible of course but it was not easy and it wasn't actually straightforward to give other people access to this kind of thing and this this force to
problems in after introducing results here so he received at the cell in the process of producing and it can do that because there's all the necessary information and the need for it available to you and if you try to reproduce all results in science or in other contexts and it's not that easy because often we electing on like the context and several critical parts of the data analysis process and another thing that is
difficult and it analyzes the scaling know that probably on at the small scale you have a lot of tools available that you can use to analyze the data and I mentioned at 5 and yeah but not book earlier and of different ways to handle for example plotting and on the processing of status that this case but if you go to larger scales on anomaly need and the totally different set of tools so on the normal considered use for your small datasets doesn't apply anymore to the splitting in technologies like MapReduce like and you that means you need to rewrite a lot of data analysis tools and because when you getting standards so how conducted at time to consulted problems were let's 1st try to understand what the present about from focus on
is basically a tool that helps us to deploy applications inside software containers and it itself the content as you probably thinking of virtual machines on but it's not the right approach because docker containers are working on a process level and they isolated from different aspects of the operating systems for example processes our resources and on the father and implications this means that some aspects for example the kind that you contain other running on a shared between them and this is exactly what makes up a very interesting because it provides a more likely way to isolate applications from each other and stop the so this is the basic idea and of course and we need a lot of to link on to make this idea convenience of Dukkha provides a high-level API that helps to manage version control of deploying network view contains so if look at the core concepts of talker and 2 basis we have found the image which 1 you can imagine bound frozen version of a given system that contains the whole file system that we need to logical given container and as you can see here images on all versions and on some images of based on all images and we have also images that are not based on anything else which we call this image and we'll see later why on version controlling images and unwilling to them on top of each other is a great idea so you can keep your images on a local computer of course but what makes it convenient to use them is to have put them into a registry of Dukkha as it's registry but it's also possible to run your own private registers no and the container in this sense is nothing but a running instance of an image for each of these containers here as a given image associated with it and philosophically on like conceptually containers are if an error that means that the state of a given container is not safe when prompted it stops working so that means that in order to continue to be useful to do any data processing we usually have 1 to attach some resources will continue this is shown here so containers can run on any number of holes and it shows that the content runs on right most the so called referendum which is responsible for managing and starting something and monitoring the containers in a given all now 1 of the very great things about the cultural like loss is the ability of Montenegro continues together which is applied to recent feature and which basically abstracts away and the networking of different also so we can completely ignored the physical constraints of our network and can construct virtual networks that connect different containers to each other which of course is very useful if we have some applications that rely on multiple containers multiple services that need to talk to each other over the next and to orchestrate on and there's a couple of tools for example stock of swarm which makes it easy to deploy in Darfur containers on a cluster of machines and there are also if you ask is how do you manage all this and this to the docket API which provides a REST interface and that allows you to create containers managed on my children and so and do everything that is possible in the case of of the command-line interface which we will mostly used to interact with the on are you machine there's nothing at the clients to discover good so what
do you like about the the 1 thing that I really think it's great that is great and is that the images are space efficient and space efficient because it based on so called only verified system which you can imagine and some are like onions we have different layers and you can just at layers on top of an existing layer and you have an example of the image that we're gonna use later in our data this and you can see in the beginning when we created this image we downloaded a lot of data about 124 megabytes which corresponds to the Ubuntu base image that we use and then we did some things that we installed on some with its own conscience scripting we installed some of things and we could for example add it up they opted to get the news repositories which at about 40 megabytes to our image size then we can solve patently on the image and then afterwards installed our analyzes and you can see and the last steps here we analyze the we at script from consume only very little space in this case a few kilobytes and it's really great because it means if you make small changes to your containers the size of the source of the images the size of the images on the disk would not grow linearly with the number of those images that means again but a lot of different versions of the software without worrying about filling up this from all different images another thing which is really great but with the containers have very little over and what I mean this you can see here so that's the rest of the students that took from a paper from IBM from 8 to 14 and where the 2 compared the performance of an individual the nature of the notes of various virtualization technologies like Doppler and in this case K. the and and that we seem to have things 1 thing the right latency of this cooperation so this is the cumulative distribution function where we want to be on the left side you want to be fast and the other thing that we're seeing history and input output operations per 2nd for different use cases and you can see that sucker here it was actually very little almost no overhead compared to the native solution whereas another visualization technology Katie and and you can see that there is a significant performance drop you and I mean I want to make any of digitization technologies that because they're doing something that is very different from that another providing things that are impossible to do but it can also see that by doing this and then and then during the performance penalty and with stucco we don't have that we can operator or applications that are on the same speed and like it would run them on a native system another thing which is
great because the containers a self-sufficient and this means that as soon as we have an image that we can run with talker and we have everything that we need to run our reputation so we don't need to install any dependencies on the whole system except of course cost and we can rely on the fact that the application bundles everything it needs inside the container or inside another from inside a set of contents of cell and this makes things like sharing and tools for data analysis for sharing data so much easier than relying on workflow worrying would need of all users to install on a lot of different dependencies on this system which might be problematic because of Britain's change systems change and that it's always difficult to manage all these different dependencies and recombine them into an image among others a container then all of these problems disappear so in that sense
as can be seen as a legal looks for the advantages time or if you want to and regarded monofunctional context you could see them as a unit of computation we have certain inputs for example configuration data on the data files and possibly other network container you perform some computation that and you produce an output and this is a very and powerful idea
because it allows us to move and construct state workflows and their reproducible and can easily scale to large systems so here for example I would have a use case would take a lot of files from different sources for example question logs and next and used to contain to run on that all interesting information and overlooks the and news and other content aggregators results from the use of unverified fitted results for things that are interesting to us and pass it on to the containers that for example put that information into a business intelligence system into monitoring system for into a market OK and so now we talked a
lot about this series and I want to show you on some very simple example and how to do it is actually in practice and the thing that we're going to look at is the log file analysis so we're gonna talk about some data from the get archive and we're gonna processed and on and extract some interesting information and then we're gonna on perform reduced that to on it the summary of that information over all the log files that we're interested in and the gold some of this is available on github if you're interested and as you can see that on the basic work was very simple we have our analyzes that and take some of the computer models and analysis process and then reproduce some of the OK
and now it's keep your fingers
crossed because we're going to like OK
so you can see several files in this directory and if we look at and analyzed by considered we're importing and bunch of standard libraries here on we define your data directory so I can show you
that the data directory actually contains a bunch of chasing guns adjacent the thing analyzed and so on and the 1st question that you probably have not as who is pushing commits to beat up on the 1st of generating was obviously a lot of people so to analyze those plants and we have functions and ask them to just 1 function that lists all the files and directories and if they contain the adjacency g set ending then we have found the analyst functional analysis by function which takes a filename and initialize the dictionary of word frequencies and then opens the file and using guns when goes to each line of this fight and the because during the days model that checks of the data contained in a given line as social event and if that's true and there's a commits entry in that given that we can use to extract the number of words from the commit messages so he would just spend each word for non-alphanumeric characters and for each of those words that we obtain like this we increase the count of all word frequencies finally we reduce redundant and that's it and then we have to reduce function which takes the result that has produced but is analyzed function and the interest at the counts 5 then those results together and producing a global dictionary of all the the different words in the and the frequency and the main block of our script to there's nothing has been used as defined function to list all the files and directories and analyze each of these 5 reduce the results and then print out I'm still statistics so if we run the but it will take some time to to do that going to each file and call and analyzing the reduce function at the end and you can see you get pretty
straightforward result of
and if you ask is who is pushing on those comments to get up when it's friendly telescope developers and to consider the good life and to be
taking and they also on new year's day
that's so very can simple very
straightforward way to analyze this data and not that of look how we
can and take this and they listen container and to do that we can make some changes the lower floor so on instead of having our analyzes scriptural director of the data we use it to 1st created a doctor and and then we gonna use some supervisors script it's also written in Python to create their own bunch of containers based on this image but then take each of them a chunk of the data analyzer and finally produce an output that we can again with a supervisor reduce and convert into the result that we are interested in OK so let's
go back to our directory and
its 1st look at how we created the pre image of you see here we have a so called conduct 5 in our directory which is invited specifies the specifics of of already mentioned we want to create and you can see here that we are on basing our image on the you want tools 1604 based on edge and saying that I'm making of that image and then we're doing bunch of simple steps so 1st we update the active so we can get an up-to-date version of all the packages available then we installed on the pipes treat package analysis system and then finally we copy the doctor analyzed PY which is in the same directory as the profile into the container and at this location here and the final line specifies the command that is being run when container starts out in this case it's the 5 and 3 independent runs the further regions with this so we can you stop at bit that fire just go and talk about it and then that the resulting images the name that we want to use and up as you can see here we did nothing basically because the image already existed before but he conceded and that went to all of the the steps jective it already has an image that corresponds to the reason that we want to have and then on successfully creates an image of the given name now we could run that match manually using the wrong command which is a bit complicated so let's go to it at the so basically the same took around and we thing that we want to run that a given user ID and a given process idea we want to expose all the ports of the bulk container on we then on specifies of environment variables which I will explain the data and we just and say that we want to model this directory here as the data directory and this directory of the order of the of the directory and finally we specify the container the name of the image that we you want to run so if we do that we just received the output of the container that is being run as you can see it already finished and
let's have a look at our analysts predictions selected for by the script that operates on the data directory and produces all put in another directory and we have 1 function that is called let's 5 that takes the financing of the same kind of map operation that we saw before knowledge traditional analysis and now we don't have any reduce function as I will explain later Can and instead we only have 2 main blocks that takes the input findings from the environment variables involved violence here and then goes to each 1 of them and calling the analyzes of 5 function and so the writing the result into the output directory that is unwanted into the book container
so in setting an orchestrator like I wait to status continues cost and for this we will into practice that again we specify or container name the data directory of the directory and the number of containers that we want to launch the and fertilization and degree of this problem if you want and the 1st thing that we do here is to use the doctor and Python API create doctor clients to Oracle group of random and retrieve the files from the data directory and analyze each file in the container and to reduce the given output so they conception has been more in detail so the analyst container function takes the number of finance and then creates a whole so-called post-conflict which specifies and the from different directories everyone want to mount and to the content in this case you want to monitor data directory and only way and an output directory and read write mode read write and this source configuration we can pass to the create content a function we also doesn't contain the name and the user ID that we want to use on the host configuration and we just created and the environment variables which just contains a list of the files that we have on given as a parameter to the function and now the main function that's like this so we 1st retrieve all the files that we want to analyze here we have and chunk of slides up into pieces of like 4 or 5 depending on on our parameter and and then we create for each time of those chapter 5 lists the container that is performing the next step for each of the past and we have had those containers to this so that we can use later and then we wait that all containers finished the work of mapping device as soon as this is done we can and quality reduce output files function which takes all the files that have been created by the content of the of the directory reduces dominant produces the result that we are interested in so if we run
this now so we have to do a fuckin to because have only and so the degree of divergence but also works of Pattern Tree of course so we call on but to adopt parallelized this would become
larger containers for the
individual files we wait for the results as a consistent the faster than before and in the end we get exactly the same result as before consider positive in
created and all directly with containers are here so and pretty
straightforward to actually go from workflow use non applied to a containerized there were 4 we're we're we also used by the web based on local and so and this was was a very
simple example and I wanted to show you on the basics of this approach and in real life and so on the complexity will be higher of course for any real data and as applications and there are certain advantages and disadvantages associated to this approach so 1 advantages of course that it's as easy to share your own data analysis workflows because now when we have an image all scripts we can just watch that the bucket of for example and anybody can download the dimension used locally on his or her machine on the genesis that and assess self-sufficient in the weight of the container doesn't care about its environment as you've seen we we only specify the input files and the output directory for the container and everything else was inside the container so there are no dependencies that we need to run this analyzes except from the input and the output of it also showed you and containerization makes it pretty easy to parallelize all on analyzes process and the timing for this example we ran everything on a single host but there is a set of talkers warm it's also possible to run this kind of analysis on multi cluster the environment so we can easily parallelizable workloads to hundreds or even thousands of local content and and the nice thing is also that of course with the image based approach we have a burgeoning of order included for free well disadvantages there also fewer and it's of course a bit more complex because we have to prepare all containers for the analysis and we need to install doctor on each machine that should perform at data obviously and then we also have lost a bid of interactivity and flexibility and doing so which father actually missing from this work flow for me 2 things 1st as
we've seen and we need a lot of orchestration to make sure that we have from all the containers running as the show and for the simple case that I showed you was not that important but for any real world data analysis problem in databases you need and many task so you have a lot of different things that you need to put together and the launch in the right order and so need a lot of orchestration capabilities to do this in a straight forward and effective ways and another thing is of course dependency management because in most real-world data analyzes context and you want to not only perform the steps of the data is that you know they really need to perform so for example we have several types of data and they depend on each other and for example this way we do not want to perform all of the data analyzer is again only if for example despite the all politicians want to perform only those things that are really necessary to really do have to change and finally we also need a way to try to manage the resources so and in our example we produce a lot about the father ready and in real world data analysis you will produce many more of those files and it's also important to our manage and version control those things for which stalker unfortunately does not provide any good so means right and so was I think a range talker and a bit on my own time and so I and happen on these problems so I decided to started writing small module roster and which is pertinent top of the brokerage and if you were summarized in 1 sentence you could say that it's made for the poor so and it provides basically the trees of functionalities that I talked about before so resource management container orchestration and dependency management necessary step in early prototype but and I want to
show you a bit called works so the basic on concept the roosters so-called recipe which specifies 3 things we have 1st the resources that we want to use the knowledge then we have the services that we need to run for example database etc. and then we have a sequence of actions that we want to perform in order to perform the and the reason this year they have includes things like versioning dependency calculation of the different resources and the kingdom of copying them and distributing them to the machines where we want to perform the fundamentals and so is the section on do you have things like starting up the services including the right order to do that provisioning resources to those services and networking them together and the action section than others concerned with scheduling the different actions that we need in our data analysis and monitoring them and performing exception handling and finally doing some learning for OK
and again I I want to show you a small left them year so what
we're going to look at this time
again released the very simple example we want to convert this being fired into the so we want to know this is divided to oppose this database so if you look at the recipe for on this data analysis and we can see we have the resources section was specify all the resources that we need for this kind of analysis so 1st of course we have lost is the which comes from the use of resources of which we want to model as read-only and to which we should make available all which has a URI electricity that's easily in this case then we have to force state which is in the database we want to put the and he would have rules that it depends the state of database depends both on this is is the 5 and animal script that we are using to create the database and then we should create the results of it doesn't exist that's the US and this on all stress and that's also use the resources and if we want to move to to of mounted in right so finally we have to convert a script that performs the conversion between the and posters and this comes directly from the recipe and it is contained in the converter your so so far so much about the resources services and others that here in this case it's on using a service motivated PostgreSQL database was used to postprocess image and which exposes said support here to the outside world and which makes use of the posters data resources that we have to find out here and you can see that we Montastrea source at this location reports that would be able to find that and to use that to initialize the work of the database so finally we have the action sections which contains and that involves only a single entry from it uses the Python tree image that we created before and executes this comparable coverage of the wife of a script that takes the data from this is the file and loaded into the database and this container access to both of them to convert script and this is the 1st of the so now if really we can launch recipe produce same true rather and the recipient is going PostgreSQL and you can see the list of things that happening now so what was said that knowledge to look 1st looked at all the images which we require are available on the system and then on initialize the resources in that case copy or initialize the post because they can make sure that the input data is there and also check the script which when is present in the recipe and then all those resources and create the post press service and finally last year
and analysis steps of the action of taste and give action and access to the post database religion and just took a while to run and you can see the all of you know both suppose this container which created or database and the Python
content would render script the incident roles in the database considered the incident about 45 thousand lines of C is the into data and now the resulting the resulting in data is but here we considered was also takes care worsening data by I'm using the UID based approach we always copy the previous version of the data and providing a link to the current so that we can go back in time and for example reverts to a good set of our database in case anything goes wrong in our analyzes and know this
is again a very simple case and it also works for more complex problems where we have different services and the more sections steps that depend on each other and so of course there are still some open questions here and in the example that we had a look at that we looked at earlier we use file to communicate the results of our analyzes between containers but also different approaches we could do for example using network or even use the greedy communicated and right now there's no canonical way to do this the say and also and an open question especially for distributed systems of course also make the data available to the containers and their docket doesn't provide a good solution and then we can probably rely some technologies for on things like MapReduce for example the Hadoop Distributed File System on but it's also not clear what is the optimal way to do this kind of thing you and of course there's some
of technologies that are interesting in this space I wanted to just briefly show 2 of them here but 1 of them is secondary and which is US based start up that provides an open-source tool for the data that is using the and the great thing about this solution is that they provide all the version control of you want of your data so that basically a version control for large datasets and they make it very easy to build the dependency graph based analyzes were and so I is that 1 of the polymers and it's really great products and so on and compared to those that also works reliably already so if you want something that works both of those as a large scale as well you should definitely check it out another thing that I want to to mention here which is not directly related to the public which helps you also have managing you dependencies in data analysis is Luigi which is a library that was that was modified and so that can and will help you to build complex data analysis pipelines we have a lot of interdependencies between individual data this step and reach a kind of figures out how how to run your analysis involves only run those steps of the analysis you really required good
and so on to summarize containers are found and by now pretty mature technology and they're probably here to stay and they're very useful in a variety of data analysis context they don't solve all of our problems with it and this is the and so that means that we need additional tools to handed and effectively some of them I showed you and I also showed you know you can use python and conduction of Dukkha to use this kind of approach to data and OK so that I'm at the end so if you're interested in and the tool and roles that you can find it here and it's of the contributions are highly welcome and I think we have time for some questions thank you the thank you this is so useful exciting and yeah think I have a question about so running on the cost of inserting this into clusters from 0 there's the docks for use like procedure like a powerful or single machine saying of 0 you have several those questions but they're powerful their multi-CPU uh how does it scale those is who would use all old course on that powerful mission and and so and any other bottlenecks OK I didn't do any performance evaluation of that but the and so on basically transparently handles distributing the content of the different system and the great thing about this 1 is that it has almost the same API as the docket core engine so you can for example use it from Python exactly like would use stop promising a machine and so as a set like content a completely isolated from each other so each container and limits on process and hence if you have a multi-core machine you can of course make use of all of course and the operating system would and take care of and get it over allocating resources to each of these containers in that sense the container is nothing but not much different from a process running on the operating system is answering a question again the the it may be that it's it would be too much overhead to but did you consider considered the duck arising Apache Spark this MapReduce think like it is just putting care about stock or a person in the docket containers and so I think in general talker provides a great way to and from the local and so that we can test all technologies like MapReduce and sparkle in an environment on your own machine so I think it's definitely possible to have a set of for example running spots and that's the question and and on the other way around it's also of course possible to use for example talk a container from inside the and inside the particle system instead due to knowledge that do for example has run out can make use of talker containers to perform the the that's steps so both of these technologies can can be kind of used in conjunction with each other that yeah about the imperial main purpose is to make it as small as possible in self-contained and so I thought maybe Hadoop is very big think 102 smart pitchers markets and more like a lightweight will likely Hadoop or might produce that we just put it that made some of the problems with distributed quarks serializing results at different OK that's an interesting point I didn't look into this but it's possible but it's good for them any more questions the you told a lot about dependencies but I I think that gotcha kind of dependencies and uh we should not make confusion trees we should focus on that 1 possible and even the differences 1 is a called dependencies dependencies between their software packages versions of solid and the other is that the dependencies like models that are built on the top of is built on other data sources and ii right maybe it's kind of theoretical question that how do you do you see this should different concept of defend dependencies interacting uh ah is there going to be like at a single tool at the show instrument that can solve both over uh every or we are going to be a completely different to so I think that this that the question that this think images are a great way of solving the dependency problem with software so we can use images to make the reproducible and environment for and analyzing the data that we have where we are sure that all the dependencies of all the software on called for example is at a given state and for making for managing the dependency of the data and we need a different tools and because their careers in my opinion not the right choice for doing that and so for example Ikeda and on the other hand technologies have some support for this kind of thing we have like large datasets that you want to version control and so that you want to and did manage in that sense of any person you think that cause can also be treated as data sets and if you would look at the different inputs of the container as a show them before he could but also take the software and like the scripts that are on use for analyzing and the other data as the data themselves so in that sense you can treat those 2 things under the same paradigm of it's of course always a question and what is the best practical way of handling these things because the the scale is very different because code is usually quite small and manageable whereas data can be very large and so on cannot be managed effectively using for example source code version control systems does that answer your question any other questions wanted to president few
Schreiben <Datenverarbeitung>
Ungerichteter Graph
Skript <Programm>
Analytische Fortsetzung
Profil <Aerodynamik>
Verteilungsfunktion <Statistische Mechanik>
Ordnung <Mathematik>
Lesen <Datenverarbeitung>
Objekt <Kategorie>
Folge <Mathematik>
Selbst organisierendes System
Wrapper <Programmierung>
Virtuelle Maschine
Modul <Datentyp>
Reelle Zahl
Binder <Informatik>
Wort <Informatik>
Prozess <Physik>
Natürliche Zahl
Computerunterstütztes Verfahren
Komplex <Algebra>
Komponente <Software>
Lineares Funktional
Physikalischer Effekt
Vorzeichen <Mathematik>
Algorithmische Programmiersprache
Spannweite <Stochastik>
Varietät <Mathematik>
Zellularer Automat
Kontextbezogenes System
Physikalische Theorie
Physikalisches System
CMM <Software Engineering>
Inverser Limes
Bildgebendes Verfahren
Fundamentalsatz der Algebra
Physikalisches System
Modul <Software>
Offene Menge
Umsetzung <Informatik>
Interaktives Fernsehen
Schreiben <Datenverarbeitung>
REST <Informatik>
Gebäude <Mathematik>
Inverter <Schaltung>
Kontextbezogenes System
Dienst <Informatik>
Rechter Winkel
Dynamisches RAM
Lesen <Datenverarbeitung>
Logische Schaltung
Stabilitätstheorie <Logik>
Open Source
Spannweite <Stochastik>
Skript <Programm>
Inhalt <Mathematik>
Demo <Programm>
Elektronische Publikation
Data Dictionary
Gemeinsamer Speicher
Kartesische Koordinaten
Einheit <Mathematik>
Data Desk
Figurierte Zahl
Funktion <Mathematik>
Divergenz <Vektoranalysis>
Konstruktor <Informatik>
Nichtlinearer Operator
Zentrische Streckung
Prozess <Informatik>
Installation <Informatik>
Mixed Reality
Funktion <Mathematik>
Overhead <Kommunikationstechnik>
Gewicht <Mathematik>
Virtuelle Maschine
Interaktives Fernsehen
Elektronische Publikation
Matching <Graphentheorie>
Dämon <Informatik>
Mapping <Computergraphik>
Puffer <Netzplantechnik>
Zentrische Streckung


Formale Metadaten

Titel Analyzing Data with Python & Docker
Serientitel EuroPython 2016
Teil 92
Anzahl der Teile 169
Autor Dewes, Andreas
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
DOI 10.5446/21095
Herausgeber EuroPython
Erscheinungsjahr 2016
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Andreas Dewes - Analyzing Data with Python & Docker Docker is a powerful tool for packaging software and services in containers and running them on a virtual infrastructure. Python is a very powerful language for data analysis. What happens if we combine the two? We get a very versatile and robust system for analyzing data at small and large scale! I will show how we can make use of Python and Docker to build repeatable, robust data analysis workflows that can be used in many different contexts. I will explain the core ideas behind Docker and show how they can be useful in data analysis. I will then discuss an open-source Python library (Rouster) which uses the Python Docker-API to analyze data in containers and show several interesting use cases (possibly even a live-demo). Outline: 1. Why data analysis can be frustrating: Managing software, dependencies, data versions, workflows 2. How Docker can help us to make data analysis easier & more reproducible 3. Introducing Rouster: Building data analysis workflows with Python and Docker 4. Examples of data analysis workflows: Business Intelligence, Scientific Data Analysis, Interactive Exploration of Data 5. Future Directions & Outlook

Ähnliche Filme