OpenStack Magnum at CERN
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 611 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/42238 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2017 |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Vertex (graph theory)Operator (mathematics)Scale (map)Execution unitArchitectureInformation securityComputer networkStructural loadOperations researchDemonPersonal digital assistantProcess (computing)Atomic numberOrdinary differential equationIdentity managementScalabilityBenchmarkMaxima and minimaConvex hullGamma functionPoint (geometry)Server (computing)Different (Kate Ryan album)Benchmark2 (number)Computer hardwareResultantService (economics)SoftwareTime zoneSocial classMetropolitan area networkGradientArithmetic progressionLevel (video gaming)Ocean currentCycle (graph theory)DemosceneVideo gamePlanningHypermediaContent (media)Game controllerState of matterGame theorySoftware developerShared memoryCausalityTouchscreenProduct (business)Open setWater vaporArchaeological field surveyGroup actionCASE <Informatik>Configuration spaceThomas BayesData managementDevice driverWebsiteOperator (mathematics)Mathematical analysisDegree (graph theory)Event horizonMoment (mathematics)Sound effectComputer architectureSpacetimeScaling (geometry)Physical lawPhysical systemSpeciesError messageData centerHigh availabilityWeb serviceDataflowReliefMeasurementBitStability theoryScripting languageMultiplication signStructural loadNumberLastteilungLattice (order)Point cloudOnline helpImage registrationInformation securityPublic key certificateComputing platformPhysicistSimilarity (geometry)Software testingInstance (computer science)Term (mathematics)Virtual machineRobotCore dumpWeb browserGoodness of fitRight angleSoftware engineeringPay televisionMetric systemStapeldateiContinuous integrationWeb 2.0ScalabilityVertex (graph theory)Particle systemRepresentational state transferComputer animation
10:04
Computer animation
Transcript: English(auto-generated)
00:20
Go ahead and start. Hi, everyone. I'm Spiros. I'm a software engineer at CERN, and I'm a core developer in OpenStack Magnum. So I will talk about how we use OpenStack Magnum at CERN and the use cases of containers we have and some scalability tests we did with our service and Kubernetes.
00:40
So as I said, I'm a core developer in Magnum. We are the OpenStack containers team. We offer an API service that offers Kubernetes, Docker Sharm, Mesos, and DCS experimental as a service. So with two clicks, you can have a cluster running, and you can talk directly to the API of what it shows,
01:02
Docker Sharm, Kubernetes, or Mesos. So what Magnum does is it orchestrates compute instances that can be either VM server metals. It creates networks like tenant networks or public networks and load balancers. It also configures storage for container storage
01:22
or for persistent storage. You can also deploy the certificates you need to have a secure service, like TLS credentials for etcd, TLS for the Kubernetes server. And of course, you have the container native API.
01:40
So if you use Docker, you do Docker run or Docker ps or whatever. And if you use Kubernetes, you use kubectl, and Mesos we use Marathon, and DCS has its own UI and its own API. So Magnum mostly focuses on lifecycle operations. So the current available is create, delete, and scale up and down the cluster, and you have more in progress.
02:03
So this is the architecture of the service. So on the right side of the screen, we have the Magnum user that creates a cluster, which has a special cluster driver. So if you are an operator, you can customize your driver and do modifications of how Kubernetes or Docker is
02:22
deployed. And the orchestration service, which is hit in OpenStack, it creates the cluster. And then we pass our scripts with Cloud Meet inside the nodes, master nodes, and worker nodes, and we play the service. And then on the left side of the screen, we have the native API of Kubernetes or Docker.
02:41
So we can use the tools or directly the REST API. So the OpenStack has two releases per year. So the next release is in two weeks. And these are the plans that we have for Pike, which is the next release that will be in August. So we want to manage upgrades of clusters.
03:01
So it's either upgrade the Kubernetes only, or upgrade Docker underneath, or upgrade Docker. And we plan to do it like rolling upgrades with node replacement. We also want to support heterogeneous clusters, so to create, in OpenStack always, clusters in different availability zones,
03:21
or with different hardware, or with different flavor. For example, have a bunch of big nodes and a bunch of small nodes. Also, very soon, we will release the Docker shard mode, which is not available. We will use the legacy shard for now. We're also working on provide the solution for container monitoring, that's on deployment time
03:42
with Prometheus, so I just saw about the operators, we might do it with an operator, maybe, but it's to monitor Kubernetes itself, to monitor something else. And we're going to improve the support for cluster drivers, so we can allow different companies with different use cases, modify their drivers, and customize their needs.
04:01
And to extend a bit our bare metal support, which is only limited to Kubernetes for now. So about our infrastructure, this is a screenshot taken this week, so we run, at the moment, 60 Magnum clusters, if you can see, but we have a very big infrastructure, so we can create a bunch of them more. So the use case is at CERN.
04:22
So for you that don't know what CERN is, is that we have a particle accelerator that accelerates particles in nearly the speed of light, and we smash them together, and we take pictures of them, and we set them as events. So the first use case is batch processing, that is distributed system that tries to do event recreation
04:41
from the metrics that the sensor did. We also have end user analysis with Jupyter notebooks, because physicists want to analyze the data, and to allow them to do it easier, we have these notebooks, so they can do analysis from their browser. Also, use cases for machine learning with terms of flow, deep learning,
05:02
so physicists are more into that, we just provide infrastructure. We also have infrastructure service, infrastructure management, like moving data across the various data centers that are used by CERN users, and then web servers, platform as a service, continuous integration like GitLab CI, and many others.
05:21
So this is a history of Magnum at CERN. We started looking into 2015, and in the beginning of 2016, we had the first pilot service, and later last year, we opened it to all users. So we modify, as I said, with cluster drivers, a bit the upstream Magnum,
05:41
to support help services such as CVMFS use, which mounts data from the LHC, and we investigate how to do that with system containers with a comic, so if you want to have a look. So this is how it looks like for a CERN user to use Magnum. We have this public cluster templates,
06:00
which is swarm or swarm high availability, Kubernetes high availability, and this is a workflow. So you do cluster create, you specify the node counts, you wait a bit depending on how many nodes you want, then you do list, and you see that it's create complete. You do one command create config that fetches all the TLS credentials you want, and then we talk to Docker or Kubernetes,
06:21
like you do normally in a deployment. So how is this service, how good is this service that we offer? So we did two benchmarks. One is to benchmark the service. So is Magnum able to serve many users? How does it scale? Can it create all these clusters?
06:40
And the second one is are these resources good to use? Is the performance good or is it low? So we use the Kubernetes benchmark that the Google Cloud team released that creates some load bots and some HTTP servers that serve the static file, and you scale up and down for that.
07:03
So we did the test in two data centers, one at CERN and one at CNCF cluster in Las Vegas. So our deployment has 240 hypervisors with 32 cores, 100 hypervisors at CNCF. We use similar configuration for Magnum and Hit.
07:20
Hit is the registration service. At CERN, we've used our production service, so we have more controllers for Abit and Q, which is heavily used when you create clusters. But in CNCF, we use the upstream Ansible scripts, so you can replicate what we did at CNCF using the Ansible scripts.
07:41
And you have a small difference between CERN and CNCF. At CERN, we have a flat network, so all the VMs are in the same network. But the CNCF cluster will have tenant networks. So these are the results at CERN for both tests. On the left is how to benchmark the service,
08:01
how fast you can create VMs, so clusters. So for two node clusters, which are essentially three VMs, one master and two workers, in 2.5 minutes, you can have it, and we created 50 clusters at the same time. And as you can see, the time scale is a bit stable until 100 nodes, which is about five minutes,
08:23
but then we started to see that the scale is leaner and for 1,000 nodes, we did like 25 minutes, which is still pretty good. But as we noticed, there is still room for improvement. And on the right is the Kubernetes benchmark.
08:41
So in this example, we managed to serve seven million requests per second with 500 service, 500 NGINX servers serving a static file, and 9,500 load bots hammering these servers.
09:00
And we had like pretty good latency about, not pretty good, but reasonable, 15 milliseconds. So about the Kubernetes test we did at CNCF, we managed to have similar numbers, but we did only, we didn't scale very much, just to reach one million requests with 180 DB servers and 1,000 load bots.
09:21
And the deployment of clusters is very similar. For a small cluster, we needed three minutes, at the same time we need 2.5, but then we must refine our run subscripts to do a better deployment of RabbitMQ, so we didn't have exact measurements about
09:41
how it performed when we created many, many, many clusters. So we were only able to measure that we created successfully 200 clusters, then our benchmark literally broke. So that's it, I hope you like the presentation, and thank you.