AV-Portal 3.23.3 (4dfb8a34932102951b25870966c61d06d6b97156)

The State of Containers in Scientific Computing

Video in TIB AV-Portal: The State of Containers in Scientific Computing

Formal Metadata

The State of Containers in Scientific Computing
Title of Series
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Goodness of fit Different (Kate Ryan album) State of matter Right angle Neuroinformatik
Demon Group action Overlay-Netz Multiplication sign Client (computing) Stack (abstract data type) Food energy Neuroinformatik Software bug Shared memory File system Computational science Vertex (graph theory) Office suite Information security Error message Physical system Overlay-Netz Computer icon Namespace Data storage device Fitness function Physicalism Parallel port Bit Portable communications device Data management Process (computing) System programming Right angle Modul <Datentyp> Information security Astrophysics Asynchronous Transfer Mode Classical physics Computer-generated imagery Gene cluster Virtual machine Modulare Programmierung Field (computer science) Portable communications device Supercomputer Revision control Workload Goodness of fit Root Software Uniqueness quantification Computer worm Selectivity (electronic) Namespace Plug-in (computing) Module (mathematics) Demon Computer network Volume (thermodynamics) Stack (abstract data type) Directory service Cartesian coordinate system Scalability Numerical analysis Equivalence relation Supercomputer Kernel (computing) Software Personal digital assistant
Code Ferry Corsten Graph (mathematics) Multiplication sign Mereology Medical imaging Mechanism design Semiconductor memory File system Information security Physical system Scripting language Mapping Namespace File format Moment (mathematics) Keyboard shortcut Parallel port Sound effect Bit Statistics Flow separation Message passing Mathematical singularity MiniDisc Booting Slide rule Server (computing) Computer file Virtual machine Device driver Metadata Field (computer science) Pivot element Revision control Propagator Operator (mathematics) Computer hardware Gastropod shell output Proxy server Graphics processing unit Distribution (mathematics) Surface Multilateration Line (geometry) Loop (music) Bootstrap aggregating Kernel (computing) Personal digital assistant Computer hardware Mixed reality Point cloud Library (computing)
Building Multiplication sign IBM Mainframe Set (mathematics) Parameter (computer programming) Open set Mereology Computer programming Software bug Web 2.0 Medical imaging Mechanism design Hypermedia Different (Kate Ryan album) Befehlsprozessor Vector space Species Computational science Series (mathematics) Library (computing) Physical system Mainframe computer Area Programming paradigm Electric generator Structural load Floating point Fitness function Electronic mailing list Bit Cloud computing FLOPS Benchmark Flow separation Data management Process (computing) Befehlsprozessor Vector space Repository (publishing) Different (Kate Ryan album) Mathematical singularity Right angle Permian Asynchronous Transfer Mode Point (geometry) Windows Registry Dataflow Slide rule Mobile app Service (economics) Computer file Computer-generated imagery Virtual machine Control flow Heat transfer Coprocessor 2 (number) Power (physics) Revision control Workload Latent heat Operator (mathematics) Energy level Booting Mathematical optimization Plug-in (computing) Computer architecture Installation art Default (computer science) Distribution (mathematics) Stapeldatei Information Theory Stack (abstract data type) Line (geometry) Binary file Software maintenance Configuration management Cartesian coordinate system Numerical analysis Cache (computing) Kernel (computing) Software Scalar field Personal digital assistant Web-Designer Computer hardware Point cloud Integer Library (computing)
Service (economics) Collaborationism
all right good morning everybody I'm George computing and why we're doing everything different than anybody else I work for nurse k--
which is the primary difficulty reading facility of the office of science of the US government we have two supercomputers named cauri and Edison both of them are crane machines and three smaller clusters so there would be like nearly a million cores and around 50 petabytes of storage in varying speeds we serve more than 6,000 scientists at nurse and all of the science that happens is open and the fields that we do nearly everything from astrophysics to energy physics genomics so jobs that are like 8,000 nodes or one so can I get a quick show of hands who of you knows what HPC is well this more than I thought who knows what Poker is so each piece in a nutshell is you have a large number of compute nodes connected by a high-speed network which usually is something like InfiniBand or something highly
proprietary like ares you access data stored in a parallel file system you run the whole thing is a shared resource and all of that is orchestrated by workload manager in your case that's learn so one of the question is what's the hardest problem in scientific computing surprisingly the answer to that might be installing software the way it usually works is that you have a stack that's provided by the center and you use a modules tool that can load different version of that software so you have the selection of L mode the classic modules in modules for this is a process that's usually error prone because scientific software is surprisingly hard to build it's a slow process in the sense of when you open a ticket with your Center - for example built tensorflow that's gonna take weeks to months makes the stacks unique so if you go to an HPC Center you have a software stack there and you want to go to an HPC Center that's an hour away it's gonna look very different and of course that's not portable what the least is you should use repealed to software in home directories which depend on the module system which makes it even harder to port and an aqua camera so doctor was an interesting solution for the problem for us because he made everything simple portable and relatively reproducible and it leveraged and I'm seeing a relatively stable API is like namespaces and C groups and I do see relatively stable because if you use docker on Center 7 you might've run into a few bugs and we of course wanted to use it and then we cried for a bit because it was not a good fit for HPC so we would really would have liked to use docker at the time but and that's not to shame talker it's just the another client base for us it was a it's very hard to have a shared machine installed the docker teaming on the shared machine because it's basically a root equivalent and as HPC people are quite pure istic you're not going to be able to run a demon or compute node nearly ever if you have a tightly coupled application and you have a jitter in your compute nodes that affects performance negatively and we don't do that so let me thought okay if you take talker as a whole or containers as a whole what features do we want all of those systems so we wanted a way to run docker on Cory Edison very specialist Christams we didn't need the fancy stuff like overlay networks because to be frank if you run a proprietary interconnect the chances of that working ever are very slim we didn't big plug-ins like storage provisioning using those volume plugins because we do have a parallel file system so that's handy first swan kubernetes the equivalent for us would be llamas learn optimally we do not want to run a demon on the computers we can have it secure depends on your definition of secure right and scalable which means if you take a topic container put it on 10,000 nodes at the same time and spin them up at the same time that should work and as a bonus it should work in all the kernels like 2 6 for example and
apparently the people that nursed were not the only one with that idea so in early to 2015 shifter popped up end of 2015 singularity and a bit later charlie cloud surprising some of you might say you talker is not on the slide it is very correct because until yesterday I didn't know that it existed and I'm sorry for that the governance of shifter is it's developed at nurse charlie cloud is the Los Alamos National Lab that have even more stringent security requirements than we have and singularity started at Lawrence Berkeley National Lab as well and nurse is part of Lawrence Berkeley National Lab but is now run by SCI labs the to commercial support for it the mechanisms that those tools used to basically construct the container and then Sheree into the things you wanna do in a container is set you ad for shifter charlie cloud because set you ad is considered a security risk uses exclusively user name spaces and singularity uses a mix of both mmm the way those three tools work is shifter as well as Charlie cloud leverage talker and the docker file to be docker container then you pulled it in to your to your system converted to a native format better suited for this massively parallel use case and then around it circularity works nearly the same way except that syncretic can import docker files but they have their own recipe format which resembles RPM a bit to satisfy the needs of scientific users a bit more so the image formats are interesting enough as shifter and a recent version of singularity you squash efest - so you take a docker image you flatten it you put it into a squash surface when you start running the container you loop on to squash surface which has a very nice side effect if you have a large parallel file system and you start 10,000 chops at the same time you're gonna hit the metadata server for every single invocation of the container that you do if you loop mount a large image all of the metadata operations are local to the node which makes the scalable so that's why we do it this way charlie cloth is a bit of a different approach it uses tar files and sips them to ram disk on the local node and runs from there the reason for that is that Charlie cloud is pretty lightweight so if you compare those tools by size just code size which is a very dubious metric I know but it's shifter is 20,000 lines of code singularity is also 20,000 and Charlie cloud weighs in about a thousand so the reason why I can build a container with that little code is that
it's not rocket science really so these 13 lines of shell script construct a container that you can run on a recent Linux distribution so the way that works is you bootstrap in this case the Debian or s you can use anything you like into a folder called container FS you set up namespaces mount the folder to itself and make it private which which means that if you things in the container that doesn't propagate to your host operating system so for example as we mount proxies see stamp and run if you unmount them you don't want them from the host then you people Drude into the container exit passion that's it in a nutshell that's what a container is and I think this is important to understand you still using the host kernel of course and it's relatively uncontained which might be a bad thing in general in this case it's actually quite beneficial because what you do if you have a takabal application several of them running on a single node you share memory so you actually we don't need the containment so as HPC is a field where you usually get libraries provided by the vendor so if you have a proprietary high-speed interconnect and you use some kind of message passing libraries you're gonna get those from your vendor and you should use them because they're highly optimized and they do change the performance of the things you run so another use case for accessing Hardware directly would be GPUs of course so that's a problem that I guess many of you have if you want to do machine learning for example is a popular topic at the moment you somehow need to access the GPU which violates this whole containment thought so you're basically breaking the wall and mapping stuff from the host into your container which could be a problem it depends from and on what you looking for it so what you usually do using GPUs and as an example you bind the GPU device like the F Nvidia zero just believe Nvidia into the container then you inject the host libraries so you take live CUDA torso which is a part of the driver injected into the container either you do that manually which is relatively
easy to do or well which is relatively hard to do after or you sleep in video container which conveniently automates the process you tell it which namespace you wanna make cheap you ready and it does that for you a considerable downside in the case is that because you're propagating whatever you have on the host into the container you need to be a bi compatible so if for whatever reason and media in three years breaks this compatibility on accident or not or for a good reason who knows your container is not going to be able to run anymore so that's a bit of a downside and other than said which many of you will ask who their static linking anymore HPC does because I mean if you run a program and the program loads let's say a hundred shared libraries and you do that on ten thousand dollars in parallel that is a quite expensive operation so that doesn't work because you rely on the loader to pull in those libraries into your guests program and also a thing that not many will run into if you have a modern Linux distribution now like let's say Susa for you to and you try to run it on a to six kernel it's not going to work by default because the chilli bc in modern linux needs at least a 3.0 kernel so of course you can argue who is going to use it to six kernel anymore the other argument you can make is in five to ten years if you want to reproduce your science are you still going to be able to run your container it's something to keep in mind then there are other problems for example this is a docker file that installs tensorflow from an ubuntu 18.4 base image in sauce python tensorflow uses some AI program that i don't know it's cancer and then runs that so there is a very intricate problem with this speed so on a very broad level if you produce a container that you wanna that gets used by many people you're not going to build a container optimized in the sense if you're not going to enable machine specific optimizations to be portable right the problem with that is the modern processors need vectorization to perform it's a minor part is very hard to say about the theoretical performance of a modern processor actually is I there are relatively good numbers for the Intel Haswell architecture which uses in the scaler mode gives you like 130 gigaflops per second floating-point operations and with AVX it gives you nearly 4 fold improvement on that the downside is that you need to use machine specific optimizations to get that performance so let's fix that and suddenly we suddenly we're back to the hard problems right so the apocryphal doesn't fit on the slide that's from the efficient ends of flow repositories again not hitting on transfer flow is just feeling software is not easy and if anybody if you were at the talk yesterday how to make package managers cry that explains why so the doc Rafael is like 80 lines 85 and a bit more sophisticated you need to be intimately familiar with the build system the loader and how they pull together so in this case that sets the library path for the loader invoke spares little pinchy for the hospital architecture builds a wheel then installs the wheel and so on and so forth if you want to try that at home I do recommend that you use some tool that automates the process for you like easy build or spec because if you do that repeatedly gonna be pretty pen bad for your mental health so does it pay off the docker file is eight times the size we get the seven times speed-up so that's a pretty good bang for the buck to line-by-line and I have to thank Kenna Forster who is the lead developer of easy build further benchmark so it does pay off but it is a considerable amount of investment of time and it opens up a whole other set of issues you need to cross compile basically so suddenly you had the thing that was portable and you wanted to make it fast and so that it's a portable anymore which i think is a common problem and if you use easy building or spec in conjunction with containers you could do
like portable builds but then you have the problem how do you share that for example do you tell the users well you use the version that's AVX optimized maybe with FMA or without FMA may be the cache size of your processor is different or and that's going to be a quite interesting problem in the future maybe you want to share with the researcher that runs on our machine so the architecture is totally different one way to solve the problem is docker has a thing called fat manifests because of course docker containers run on set series mainframes which makes me a bit happy so if you do a poker pool on an IBM mainframe or on a power of machine and the docker registry has a version for the architecture it's going to pull it down unfortunately there's not integrated with the HPC solutions yet and a minor point is I didn't find any information if there works for CPU features as well and not only for architectures so to conclude this talk I do think that containers are a valuable tool for scientific computing because they enable a user-defined software stack if you want to say it less fancy it's basically our p.m. in userland kind of I do have to say as the advertisement goes there are not the panacea so they're not gonna automatically solve all of your problems particularly their performance to require work and if you imagine a seven times performance difference if you need to buy seven times the service or pay for seven times your AWS bill you're gonna think about it very hard and as we have containers which are quite a new paradigm it's also quite beneficial to use what's already there in conjunction with that mechanism to use them to the fullest extent and with that I'm gonna go into questions I work for a major European company that does HPC and we still run two point six channels and some of our machines for applications that's not my point I think my point was very interested to hear what you're saying about optimizing and installing the up to my extensive floor I think HPC people have a great deal to learn from the web generation you've got scaling up to the cloud you've got configuration management cattle not pets and you get applications which are designed to fail unlike HPC applications just don't tolerate failures however I think people and I want people to listen and I want you to comment often you see the recipe for installation is that to get installed tensa for apt-get install open MPI and then you see in the openmpi list oh I did an app get install open MPI and I've got it's a really update version it doesn't perform far well I've got problems and the answer from the openmpi guises hey-oh - point not series is great it's got fixed all these bugs and I think people don't realize that the maintainer of these scientific packages in Red Hat and to the new depending on some guy maintaining and who's not really going to keep it up to date and he's not going to optimize it for the process so you comment on that so I'm not quite sure what the question was you very nicely elaborated recommend commenting on the the issue of limited the tooling not being updated in this Joe's due to lack of time for me to oldest chairs and that is do contain a psychopath to the kind of said in a blunt way you shifting the load of maintaining the software's text to the user it's what you're doing and then for the stuff that we need to be highly optimized you still have the center people but it makes the turnaround times quicker so maintaining software is a very hard problem that's not gonna go away in the future even with cloud technologies anybody else the question is how do you integrated with the batch system for shifter there is a storm spank plug-in singularity as well you don't need to use one so if you pull up a job and you do shifter the image name and you run it it's just gonna run the place to single areas but you don't need to integrate it with the workload manager it does get better performance though for several reasons [Applause]