Multi-host containerised HPC cluster
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Alternative Title |
| |
Title of Series | ||
Part Number | 29 | |
Number of Parts | 110 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/30951 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
00:00
SupercomputerMultiplicationKernel (computing)Discrete element methodCountingCommon Intermediate LanguageComputer networkCNNDifferent (Kate Ryan album)Mathematical analysisIntegrated development environmentContext awarenessElectronic visual displayVolume (thermodynamics)State of matterData structureMathematicsSocial classRight angleConnected spaceSelf-organizationSoftware developerService (economics)Line (geometry)Type theoryParameter (computer programming)SoftwareAnalogyCellular automatonLevel (video gaming)Form (programming)SpacetimeObservational studyBit1 (number)Multiplication signQuicksortGraph (mathematics)Endliche ModelltheoriePhysical systemAuthorizationMultiplicationInstance (computer science)Extension (kinesiology)Proxy serverCovering spaceMechanism designComputer programmingTheoryProper mapWindowWordStudent's t-testBenchmarkContent (media)Execution unitCycle (graph theory)Insertion lossKernel (computing)AreaDistribution (mathematics)LengthRun time (program lifecycle phase)IP addressDisk read-and-write headMereologyVirtualizationProduct (business)Stack (abstract data type)Bridging (networking)NamespaceVirtual realityPower (physics)Variable (mathematics)Scheduling (computing)Overhead (computing)Virtual machineClient (computing)Process capability indexNetwork socketModule (mathematics)Slide ruleCodeKey (cryptography)Semiconductor memoryFunction (mathematics)Presentation of a groupBefehlsprozessorData storage deviceAddress spaceXMLLecture/Conference
08:05
Discrete element methodMultiplicationSupercomputerChaos (cosmogony)Enterprise architectureGradientDistribution (mathematics)Integrated development environmentComputer hardwareSoftwareRule of inferenceWordGame theoryInformation securitySet (mathematics)WritingSoftware developerNumberProduct (business)Multiplication signMoment (mathematics)SoftwareSoftware testingIntegrated development environmentCASE <Informatik>SummierbarkeitView (database)Medical imagingVirtual machineEvoluteBlogExecution unitArithmetic meanContent (media)Web pageBasis <Mathematik>InformationQuicksortComputer programmingState of matterIdentity managementEndliche ModelltheoriePartition (number theory)Utility softwareConfiguration spaceElectric generatorElectronic mailing listMereologyPhysical systemComputer fileSinc functionWorkstation <Musikinstrument>Cartesian coordinate systemNamespaceOpen setDemonSemiconductor memoryComputerBitScheduling (computing)Distribution (mathematics)Alpha (investment)Installation artAndroid (robot)Optical disc driveVideo game consoleTemplate (C++)Service (economics)GUI widgetVolume (thermodynamics)Revision controlHookingKey (cryptography)Data storage deviceClient (computing)Operator (mathematics)Different (Kate Ryan album)Field (computer science)DigitizingLink (knot theory)CodeLecture/Conference
16:05
SupercomputerMultiplicationDiscrete element methodDistribution (mathematics)Common Intermediate LanguageVirtual machineFunction (mathematics)Open setDemonCartesian coordinate systemCoprocessorComputer fileMultiplication signTelecommunicationRevision controlProcess (computing)Scheduling (computing)BlogInstance (computer science)EmailSoftware testingNumberCASE <Informatik>IdentifiabilityLengthKernel (computing)Distribution (mathematics)Medical imagingWorkloadMetric systemSource codeLimit (category theory)Integrated development environmentBinary codeSemiconductor memoryOperator (mathematics)Group actionLink (knot theory)PasswordEndliche ModelltheorieParallel computingDisk read-and-write headNatural numberMappingGraph (mathematics)Physical systemSampling (statistics)TheoryMereologyThermal conductivityNetwork topologyPhysical lawSummierbarkeitSocial classLine (geometry)Form (programming)Water vaporLevel (video gaming)Volume (thermodynamics)Interactive televisionPoint (geometry)MiniDiscRadiusPlotter
24:05
GoogolCore dumpComputer animation
Transcript: English(auto-generated)
00:05
All right, next up is Christian, who's going to be talking about Docker on HPC clusters. HPC within Docker and... Yeah, my name is Christian. As I said, I worked for HPC systems like eight years or so.
00:25
Now I'm working for Gaikai, which is a Sony company, another bunch. But anyway, it's nice to tinker around with HPC still, and that's still my thing. So I have a big data problem myself, because the topic could be like four hours talk,
00:40
but I only have 20 minutes, so we better get going. I will just introduce a bit of pieces that I use. First, I use Docker, and Docker, as maybe a couple of you are familiar with these containers, they are a little bit different than traditional virtualization, just to give a brief heads up. So in traditional virtualization, we have a hypervisor,
01:00
and on top of this hypervisor, we have the same stack that we have beneath the hypervisor, so a kernel in the userland, and within the virtual machine, hence the name, you step up a complete virtual environment. You have your own virtual network card, and so on. There are ways to improve it with power virtualization, and some nice tricks to do to make them more performant,
01:26
but basically that's what it is, right? On the right-hand side, we have a little display about Linux containers. So in Linux containers environments, you only have one kernel, so the host kernel is used even by the so-called, let's say, called in machines, containers,
01:44
and they are only communicating with the host kernel, which is a nice advantage that the scheduling is only done by one kernel, and not by the guest kernel, because there is no way of knowing about other machines that are running next to you. So that's kind of the thing.
02:01
And when you want to use some cool technologies like InfiniBand, there are also some tricks, but basically what you could do is you can hand over the PCI device of the InfiniBand and attach it to a certain guest, and then the guest can use it as if he would have a physical slot in his machine.
02:22
On the Linux containers, since you talk to the same kernel, you just have to load the InfiniBand module in the host kernel and the user then stuff in the container, and then you can use InfiniBand using containers, which is pretty surprisingly easy to set up. I tried it five minutes, and then I was surprised that I'm done after five minutes.
02:42
So that was kind of nice. And since... Oh, no, there's supposed to be another... Anyway, sorry. Yeah, there's a slide missing, I guess. So the user lengths are totally independent. So you could have a user length which is tiny for Linux on the host. So for example, for instance, use tiny for Linux,
03:02
but you can have the user length of the containers can be completely different. So you could have center six, center seven, you can do whatever. So that's pretty nice. And you can trim down the user length pretty neatly. I had a presentation one year ago or so where I benchmarked center seven as a host installation
03:20
and a couple of other distributions as containers, and there it turns out that center six was beating the crap out of center seven in certain benchmarks when they're using InfiniBand because I had used center seven alpha, so it was a bit unfair, but the container kicked the bare metal inside.
03:42
That was kind of nice. And you can tweak and tune the user length of the container, and you don't have to care about the user length of the host. Anyway, I said a quick bits and pieces, so I have to go on. So if you look at the Docker engine, so the Docker engine is the runtime of the container that is provided by Docker Inc. There are other container runtimes, but here we will focus on Docker engine.
04:05
It creates, stops, starts, manipulates the containers and all through a RESTful API, and it also handles the name spacing and the cgroups and all that. And a little bit about the networking part here. By default, the Docker engine will instantiate the Docker zero bridge on your host,
04:24
and when you spin up a container by default, you will get an IP address from this Docker bridge, which is a private for host only network, so to speak, if you want it in virtual books terms, and they have connectivity through this bridge, all the containers running on the system. You could also use the networking names based off the server,
04:42
so you get the same IP address, the same host name, and so on, and you could also start a container without any network. And what was introduced by Docker 1.9 was Docker networking, which is kind of nice, so it's the same Docker engine, or a different Docker engine. Now, from 1.9 on, it was called Docker engine before it was Docker runtime, anyway.
05:02
You can have multiple servers, and they all have to have a key value store to synchronize the different engines, so now we have these Docker engines hooked up to Consul in my example. It could be the keeper as well. And so they use VXLAN to create networks
05:23
that span across complete hosts. So for instance, I could create a global network that spans across all nodes, and then I start containers on different nodes, and they will get the same, or they will have the same networking address space, so the same subnet across all nodes. And I don't care where the nodes are
05:41
or the containers are not spun up. That's not neat. And I could have multiple of those, and they are distinct, so I could have a global, an internal, what have you, and start different containers, connect them to different networks, and even connect them to multiple networks, so I could create nice environments for development. For instance, every developer has his own network,
06:01
and they don't have to fight upon IP address or so on. That's kind of nice. So that's Docker networking. Docker Swarm is also some nice feature or some nice product to use in this stack. So when you have multiple nodes, and you run a Docker engine on each of them, if you want to spin up a container,
06:20
you have to connect to the Docker engine, then you spin up the container. If you want to start a container on a different node, then you have to change your environment variable again, connect to the Docker engine of the different host, and it's kind of a bummer because it's some overhead that you have to add. With Docker Swarm, you create a swarm cluster, so it's called.
06:40
So you start swarm clients on each of the nodes, which are also containers. They connect to the Docker engine by mounting the socket of the Docker engine so they can start and stop containers by themselves from within the container, which sounds scary, but it's not. And then you spin up a swarm master instance,
07:02
and this swarm master instance just proxies in front of all Docker engines and distributes the containers as it's pleased. So there are multiple distribution mechanisms. It's been packed, so it tries to put as much containers on one node as possible, and then you have one that distributes it evenly
07:20
and so on, and you can create your own if you like to write your code. But I think it's not that hard. Anyway, but what it brings you is when I, this, for instance, I query one node, and I do Docker info, which shows me the container, amount of container running, which is the memory available and so on. And if I do this with the Docker engine directly,
07:43
then I get this output. If I query the Docker swarm master, then I got more CPUs, so all the CPUs in my cluster, and I got information about different nodes. And if I run it, if I start a container with Docker run, then it will be magically placed on one node, which I can also pin.
08:02
I can say, put this container on, for instance, vinos-007, and then it will be spun up on this node. So that's pretty neat. Okay, so this new technology is all nice and shiny, and if I do something new, I Dockerize it first.
08:21
That's the first thing I do. I even Dockerized Scalia a couple of weeks ago. I haven't used it yet, but the image is there. And yeah, so I put in the Docker pole, and when I do this myself, I think, yeah, this is how it goes. I'm like, yeah, there are ponies and rainbows and so on. But when I talk to others,
08:40
or to the ops guys of our system, so we don't, then he's like, oh, this guy again with his flamethrower, and it burns everything. So it was not helping as well as that there are so much buzzwords around Docker, so solutions and problems, production-ready and device-ready anyway. And there's a lot of solutions for a lot of problems.
09:02
I mean, Docker spends so much of the IT infrastructure that there are 50 answers for five questions because it depends on where you look at it. You look at development, production, or testing, or what have you. So that's kind of a broad field. So, and as I described,
09:20
traditional digitization compared to containerization, if you talk to someone and say Docker, then he says, oh, yeah, virtual machines, okay, I got it. So this you have to bring across. It's very hard at the beginning. So nowadays maybe a lot of people heard about these containers, but one year ago, it was impossible to talk to someone who knows virtual machines and to convince them
09:42
that Docker is something different. And I said it spends a lot of environment. So what I try to do is spin everything up on my little laptop, but at the same time I can be able to spin it up on a workstation with more memory and more CPUs, or even on clusters or even on big production units. And what I want to do, I don't want to be a unicorn
10:02
or I don't need to be a unicorn. I don't want to have special distributions like CoreOS or RancherOS or what have you because my opinion, it's something that might be helpful if you're deployment and might be helpful if you want to have a very agile scaling approach.
10:21
But I want to more leverage existing technologies and existing workflows, existing installations and monitoring and logging of the infrastructure I want to put my Docker on. And the same for security infrastructure. So I want to reuse stuff that's already there
10:40
and not put something completely new on. And I want to keep up with the upstream of the Docker ecosystem. I think that's very important because, for instance, 1.9 brought us networking and volumes. 1.10 will give us some user name spacing and some IP configuration that you can tell
11:00
the Docker run command just use this IP within the container and then it will have this IP which is pretty neat as well. And that's why I want to cover some of the new cool features and I don't want to rely on the vendor to provide a newer Docker version. And I think Automikos is 1.8 currently and they're not even updated to 1.9 without networking.
11:20
But anyway, I don't want to brag about it too much. So what I want to do or what I did is reduce it to the max. I put Docker on an existing installation. The installation I will show the configuration in a bit. I used Kickstart or I did not even install it myself so it was Kickstarted and I had an SSH access
11:42
and I used Ansible to install all the bits and pieces I need. And I don't want to focus on corner cases so much. I think that's a Docker Inc approach anyway. They don't care about corner cases in the first hand. And for instance, if you want to have multi-tenant IP usage so InfiniBit usage in different containers on your host
12:02
then you are screwed because there are also tricks to work around it. There are a lot of snowflake containers that you can create but I think first I would like to get it going and then think about the owner cases later. And the same with user namespaces. I think there is also a need for doing something like this.
12:23
But we will get there eventually with a new version maybe. We have to play around with it first. And to play around with it we have to be very flexible in using different tools and that's what I am trying to achieve here. And the HPC environment assumptions, I mean that's a given I guess.
12:43
It's single-tenant for me and I focus on performance so I don't care about security yet so much because I think that, as I said, Docker is so fast approaching or fast moving forward. When I think hard about it for half a year then what I thought about might be irrelevant anyway so I did not think about it yet so much.
13:04
So the setup, I had eight nodes from the HPC Advisory Council XS2, it's like an old system, fairly old, eight nodes with Xeon 32 gigs and QDR and Finiban.
13:21
I used CentOS 7.2 which was updated from the previous mentioned 7 alpha and to install the dependencies in my stack I used some Android. So I first install a console as a service, then the service starts. After the service is started then the Docker engine can start
13:41
because it has to hook into the console cluster as mentioned before and so on. Okay, so how does it look like? I have these eight nodes, as I said, the Docker engines are hooked to the console as a key value store to synchronize over the networks that are created.
14:00
I put Docker swarm on top of it so it looks like this. There are swarm clients that I talked about and one master on one of the nodes. And they also use a key value store from console to synchronize. And then I just put a swarm cluster in it and I will show a link to the code and the playbooks
14:22
and the composed files that are used but hence I have only 30 minutes. I will skip this. So I put the Slurm cluster on eight nodes and since Slurm daemon itself is pretty boring I put some little benchmarking tool on top of the Slurm DPD container
14:42
to run really an application, an HPC application. And I could also pre-sage multiple containers of this sort, right? So this is what I've done a year ago. So I had an open phone container and an HPCG container and so on and since when they are just sitting there and they do nothing then they don't consume much resources.
15:01
So that's why I pre-staged them. What could be also done could be that I have a Slurm daemon on the real hardware and that I just spin up the containers when I need to run the computation. So this might be a little bit less resource hungry. But I think that since the bash is just idling it doesn't consume much resources
15:22
so I think it's no problem to have them all spin up at the same time. So this is basically the hello world of Slurm. Slurm is a resource scheduler. It was mentioned a couple of times. So you have different partitions. I just have stupid partition all, odd and even which turns out to be like this. And since I use console this is auto-generated.
15:43
So if I would spin up eight more containers then the configuration of Slurm will be auto-generated out of templates and then the widgets are restarted. And if I run a simple job, so I use srun-n to have eight nodes and just let them execute at the same time.
16:03
And I execute it on one of the nodes. Then I got the output of the hostnet basically for all nodes. That's the Slurm cluster. And since we are not only an HPC daemon but big data as well I created a SAMSA container and put SAMSA on my cluster as well.
16:25
So SAMSA for all of you who are not familiar with it it's a streaming processor distributed streaming pipeline which uses Kafka for communication. So you create a SAMSA job that reads from a Kafka topic and writes from a Kafka topic.
16:41
And so I had to stage a zookeeper and a Kafka cluster that are containers on top of all the different nodes. And then I put a SAMSA instance next to it and then I can run my jobs through the SAMSA nodes. I could set it up, or I will set it up with YARN so that I have a distributed YARN scheduler on top of it.
17:03
And maybe I can even set it up with Slurm. Three minutes? So Slurm, maybe? Yeah, out of time, so as you said, three minutes. I would like to have shown also Senzu which is kind of a nice monitoring tool which
17:20
I don't have time, it's nice, look at it. Metrics, through Senzu I collect metrics and I put them in graphite environment and I use Grafana dashboards to show cool stuff. Yeah, so recap. In my opinion, as I said, I will go with vanilla docotech
17:41
and I can go with it on top of any distribution because I would like to keep up with the ecosystem and prevent a vendor or ecosystem lock-in. I think that's a pretty dangerous thing to do currently but maybe I'm not earning money with it. So if you want to earn, anyway, no. And as I said, I don't want to focus on corner cases
18:02
or what I can achieve with 80% or 20% of the time then I will do for it and corner cases are a thing for next year or so. And I don't want to scare away stakeholders so Guido is an ops guy in Gaikai. It should be simple, but not because he's a simple guy.
18:21
It should reuse all our workflows and infrastructure or HPC workflows and infrastructure and it should be a solution and a problem and it's a little buzzer when you're here. Yeah, if you have questions, the upper link is a link to all the playbooks and hopefully no passwords and compose files and so on and the background files.
18:45
On my blog I put a couple of posts about it and I will continue to brag about it on my blog. And a little bit premature, but with the University of Pia we might have a little conference about this kind of HPC cluster setup.
19:02
So we scheduled it in April. Maybe it's shifting, but hopefully it won't. So if you'd like to learn more about it, then ping me an email and talk to me. That's basically it. What's your sign?
19:56
I will recommend our test, putting our test in a kernel or a device,
20:03
but device number in some cases you have an latency or it's low. Overlet should be included and I could talk about this somewhere.
20:30
The question is, can I answer this one?
20:48
No, no. The question was, why not use native installation or native virtual machines for the Docker?
21:08
First, because I like Docker and I don't like OpenStack, I installed it seven times. And I finished it half the time, but I don't want to brag about OpenStack.
21:21
But I think since you have, with images, with Docker images, you have clearly identifiers for the user length that you have packaged. This is a very nice way to reproduce your work. So if you have an image package that has the ID of the Docker container or Docker image you use, then you can run the same workload
21:41
with the same user length as yourself. So I think it's for reachability and for an athletic approach. Maybe I didn't answer your question. No, no, no. If you want to help me answer who runs the thing, like installing OpenStack at night and installing this and like
22:00
and so on and so forth. Just following on from that, you said that when you preload these things it's a very small footprint. It's worth doing some work on that and actually saying what that footprint is and how many you could just have ready in your magazine to, I mean, how many you could have ready?
22:20
I think there was a limit of 1,000 containers because of something. So you could have a cluster with 1,000 applications ready to go? Yeah, I mean, if it's just heavily, slowly running and if you use RFS and even the binary is loaded once to the memory. So anyway, but. How much is the footprint? Has anyone analyzed it?
22:40
It swaps out anyway, so. I don't know. I think it's the 10 in my pocket, but I cannot say how much. It's just a group of processes. We funded 1,000. We can't. Do you have a question? Can you roll? Ah, thank you. It's very kind.
23:01
Look at the wall. Nice to meet you, by the way. Alright, cheers. I'm going to be dockerized. I'm going to be dockerized. Yeah, yes. That guy knows. That guy knows what he's doing. He's really good. He seems to see a lot more prevalence of docker on.
23:21
But he did open OpenFoam Docker container. Yeah. But I mean, I don't know OpenFoam Docker, but it's crazy. You have to have, the source code for GCC 4.7, you have to compile the GCC before you can compile the application. It's like, what are you guys doing? That's funny. I mean, we maintain OpenFoam in our center
23:42
and we just start like, we're not going to build a QQC. We're going to build it with our compiler, you know. But I think that's people, they've probably been burned so many times by not knowing what compilations are. I used to do something, it was one of these specialization things, one of these commercial pipelines and they actually ship their own Python.
24:01
It's a really out of date version of Python in there.