Automated Deadline-Based Scaling of Experiments in the Cloud with MiCADO
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 60 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/42564 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
6
13
21
25
41
53
00:00
Level (video gaming)Constraint (mathematics)Complete metric spaceComputer programSoftware maintenanceSoftwareScaling (geometry)Modul <Datentyp>Template (C++)Mathematical optimizationAsynchronous Transfer ModeVorwärtsfehlerkorrekturBefehlsprozessorRead-only memoryService (economics)CASE <Informatik>Order (biology)BitQuicksortLink (knot theory)Mathematical optimizationPoint (geometry)Cartesian coordinate systemSoftware frameworkProjective planeVirtual machineWage labourHorizonConnectivity (graph theory)Library (computing)Web 2.0Mixed realityRadical (chemistry)Message passingDifferent (Kate Ryan album)Template (C++)Level (video gaming)Core dumpTerm (mathematics)Series (mathematics)IdentifiabilityNumberUniverse (mathematics)InformationMultiplication signDecision theoryComplete metric spaceService (economics)BefehlsprozessorOpen sourceCASE <Informatik>Cloud computingStatement (computer science)TrailStandard deviationStructural loadEndliche ModelltheorieInformation securityWorkloadSoftwareMathematicsCodeServer (computing)Task (computing)Perfect groupUser interfaceAddress spaceCommunications protocolNeuroinformatikInternet service providerSemiconductor memoryEnterprise architectureFreewareUtility softwareLine (geometry)Film editingComputer animation
05:28
Read-only memoryService (economics)BefehlsprozessorNumberGraph (mathematics)Service (economics)Computer animation
05:40
Read-only memoryBefehlsprozessorService (economics)CASE <Informatik>Parameter (computer programming)StapeldateiSweep line algorithmMultiplicationQueue (abstract data type)Task (computing)Scaling (geometry)AverageSimulationIntegrated development environmentData recoveryDistribution (mathematics)AerodynamicsVariable (mathematics)Computer programDirectory serviceProcess (computing)Interactive televisionBootingConnectivity (graph theory)Resource allocationQuicksortMathematicsSingle-precision floating-point formatOrder (biology)Multiplication signProjective planeParameter (computer programming)ResultantStructural loadProgrammer (hardware)Sampling (statistics)Term (mathematics)Graph (mathematics)Bit2 (number)Suite (music)Physical systemSweep line algorithmQueue (abstract data type)SimulationWhiteboardGroup actionGraph (mathematics)Form (programming)TelecommunicationNumberPairwise comparisonEntire functionCASE <Informatik>Integrated development environmentEndliche ModelltheorieOpen sourceCuboidSoftwareTrailMetric systemReading (process)Functional (mathematics)Software testingWebsiteVirtual machineService (economics)Computer programmingUniverse (mathematics)Object (grammar)ExpressionAmerican Mathematical SocietyBeat (acoustics)Uniform resource locatorGoodness of fitGraph coloringComputer animation
12:47
Computer programDirectory serviceMultiplication signVirtual machineTheoryComputer animation
13:24
Computer programDirectory serviceCore dumpUniverse (mathematics)Exception handlingSoftwarePerspective (visual)Capability Maturity ModelStudent's t-testInteractive televisionSoftware testingComputer animation
14:08
Overhead (computing)Memory managementVirtual machineResource allocationProcess (computing)Multiplication signReduced instruction set computingStability theoryLecture/Conference
Transcript: English(auto-generated)
00:00
Today, I'm going to be talking about auto-scaling deadline-constrained workloads in containers, in the cloud. But in order to kind of get to that point, I'm going to start by talking about a project that I've been working on at the University of Westminster, where I'm a researcher. And Project COLA is pretty cool. It stands for, it's a bit blurry, but Cloud Orchestration at the Level of Application.
00:23
And it's a Horizon 2020, so EU-funded project. It's been going now for about two years. The completion date is coming up on us fast. It's the end of September of this year. And we've got 14 really interesting partners across six different countries. And there's a mix of small and medium enterprise, some public sector, and four higher education and research institutes.
00:45
So there's ourselves, there's the University of Brunel, which is also in London. There's the National Research Institute in Sweden, and the National Research Institute in Hungary. So we've got a big nice consortium that are all working together. I'll go through kind of a couple conceptual things that I'd like for most people to kind of believe in for at least the remainder of this talk.
01:06
I don't think I'll spend too much time here because Bradley made this point three times over in his talk at the keynote yesterday. But this talk assumes that one day everybody, or not everybody, people who want to kind of run software in the way we do,
01:21
will eventually move away from the on-prem model, which we're all really really familiar with, to this off-premise model where we kind of have compute as a utility. And yeah, I won't labor the point. Another kind of conceptual thing which is important for this is the idea of an application container,
01:40
better known by its commercial name, or most popular commercial name, of a Docker container. You can think of it as a lightweight VM which provides the ability to package your application in with all of your libraries and dependencies, and makes for very reusable portable software, which is really important in the cloud because who knows what infrastructure you're going to be running on tomorrow, and we really want everything to run, just run.
02:04
So that's it for kind of the conceptual stuff. The cloud isn't perfect, and this was about three years ago the problem statement that Project Cola identified. And we found that you'd have all these beautiful applications that are running up in the cloud, either a monolith or a number of microservices, and they have some sort of base requirement from the cloud.
02:23
So they need a certain amount of CPU, they need a certain amount of memory, and that need is ever-changing. So as people start to use your application to do computations, or people start accessing your web server, that demand on the baseline changes, can change radically.
02:42
And we found that at least three years ago there was no good way to really kind of get that supply to scale automatically. Now three years later, fast forward, we're all very very familiar with the term of autoscaling, and all of the big three cloud providers have very mature frameworks for autoscaling.
03:01
However, that means that you basically lock into one of those big three. And second to that, think about a scenario, we work with a couple of smaller European commercial clouds, and they don't have these really mature autoscaling frameworks, they may not have autoscaling frameworks at all. Our private cloud at the university certainly doesn't have an autoscaling framework built into it.
03:22
So cola here addressed that this was an issue, we wanted not only dynamic supply, but we also wanted it to be vendor-free, and we set out some other requirements, it should be a modular framework, it should be flexible, it should be secure, and obviously the whole thing should be open source so that people can kind of join in. So in finding a solution out of the project cola, we developed the framework which we call Mikado.
03:48
And this is Mikado in one big slightly messy slide, but it's quite easy to understand I think. On the left-hand side you have your user interface, which is Atoska, and that's an OASIS standard now for describing applications in the cloud.
04:04
And in this kind of template you describe your application, usually in containers, you describe the virtual machines that are going to be kind of supporting those containers, and you can also define nice things like scaling policies, security policies, and then the links between your different containers and virtual machines. That gets passed into our kind of main core of Mikado,
04:24
where there's a submitter component to chunk it all up, and pass information to other components. So there's Occopus, which is an infrastructure as code, open source project, which orchestrates VMs on a number of given cloud service providers. There's Kubernetes, which orchestrates containers on top of those VMs.
04:42
Prometheus does the monitoring, so it keeps track of your resource usage in containers and in VMs. And then we've got a couple components, one which does totally reactive scaling, and then a machine learning based optimizer we call it, to provide more proactive scaling. As the master node is doing its thing, it's spawning up worker nodes, which are actually running your application in the cloud,
05:03
and they scale up and down as needed, and pass information back to the master so that it can make more scaling decisions in the future. The first scaling use case comes from that problem statement, which I mentioned a couple slides ago, and that was pretty easy. We were talking about resource intensive services, so these are typically CPU membound services
05:23
that we wanted to scale when there was kind of this big influx of load. And here's a graph of a very simple kind of example. We have just a couple of VMs that are running a number of services quite happily at the beginning,
05:41
and then those services experience some load, they spike, and in order to kind of manage the load, Mikado starts adding new virtual machines to the infrastructure, Kubernetes spreads the containers across these new virtual machines, and the overall load of the infrastructure, which is the graph you can see here, is kind of pushed down to something more manageable.
06:00
The second use case that came along came from one of our project partners, and is the title of the talk today, which is hopefully going to be something that some people in this room would one day like to try and will start bugging me about. And they came to us, it was the University of Brinnell, and they came to us and they said, we want to run these multi-job experiments. So they had these kind of parameter sweep style jobs
06:22
that they wanted to run, and they expected that Mikado could scale the underlying containers and VMs in order to finish all of the jobs that they had before a deadline that they set. So this one researcher in particular wanted to start these jobs on a Friday afternoon and leave her desk and come back on Monday,
06:42
and Mikado would have scaled the infrastructure perfectly so that the lowest amount of cloud resources were utilized, but all of the jobs were still finished and the results were sitting for her there on Monday, when she arrived. So we had a bit of a problem because Mikado didn't have any way
07:00
to execute hundreds and thousands potentially of jobs, and we didn't then have any way to then execute them in the containers that we were scaling up and down. So you can kind of see we're missing this queue over here. So we went back to the drawing board and we designed a queue, which you can see here in the color, and we called it JQer. Very, very simple. There's a master component which runs external to Mikado, and that's the queue itself, provides some monitoring functionality as well.
07:23
And then there's an agent component which runs on each and every single Mikado worker node and fetches the jobs and executes them locally on the containers that are local to it. And with that, we kind of had the system that we thought would work. So I'll show you next the experiment.
07:40
Oh, sorry. Here's the communication between Mikado and JQer. So this is JQer. It keeps tracks of all these metrics and it sends them across to Mikado. So Mikado can kind of determine, okay, how do I scale here appropriately so that I can finish all these jobs I have using the lowest amount of cloud resource by whatever deadline is set.
08:00
So that's a simple Prometheus exporter that we've just attached there to export those. And now the experiment itself. So this is what Brunel came to us with. And they said, hey, we want to determine the impact of small changes in behavior on the spread of a disease across the population. And we don't really know anything about this because we're just kind of cloud programmers.
08:21
But we started working together and they came up with this experiment. So it's an agent-based simulation using Repass Symphony which is an open source suite for modeling and sim. They kind of defined three agents there. An infected group, a susceptible group and a recovered group. And they designed some models which would simulate the movement of these three groups
08:42
and their interaction in some sort of an environment while kind of sweeping through a number of parameters so that they could determine what the best case, worst case scenarios were for disease spread here. So they had, this was the first real experiment where we were going to be running through Mikado. So we wanted to have some sort of baseline test
09:00
so we determined a baseline with this very simple mini experiment here, a pre-experiment. And we said no Mikado, no JQR. And we spawned up five VMs for ourselves in the cloud and we equally distributed all of the jobs that they gave us so I think it was 200 for a first sample across these five VMs.
09:21
And then we hit play and we kept track, we started watching. And it took one hour for five VMs to complete these jobs and we thought okay, that forms a pretty good baseline for our comparisons in Mikado. And next we booted up Mikado and we booted up JQR and we submitted these 200 jobs to Mikado and JQR with the same deadline that we'd gotten
09:42
in our manual allocation. And we were hoping that Mikado was gonna beat, was gonna beat the manual allocation and not beat it in terms of speed because that's not what we were trying to do. We were hoping that somehow it would find a way to allocate jobs differently so that we could use less cloud resources over time. And you can see here the jobs go to JQR.
10:01
The little time thing gets stamped up on there to Mikado and Mikado spawns its worker nodes which run the Repass SIM software and the JQR agent software. And well, here's what happened. It was pretty, pretty neat, at least we think it was. So this is the same experiment. We ran it about five times. Here's two graphs showing kind of a slightly different result
10:21
but more or less the same. The red box is that manual allocation that I showed you earlier. So very, very boring, five VMs the whole time, an hour and three minutes, something like that. And then we let Mikado run and that's in the blue there. And you can see there's a bit of a lag time and that's because Mikado takes a little while, you hit start on Mikado
10:41
and it takes a little while to boot up its first VM and throw it into the infrastructure. And you can see at the beginning, Mikado's kind of overcompensating there. It's taking a look at how many jobs it has and it's kind of calculating, it calculates on the go and on the fly the time it takes for each job to complete and it says, okay, I think we're gonna need six.
11:01
And then Mikado realizes we're going way too fast. I can do all of this before the deadline without this many resources. So Mikado scales down and it goes to four and it goes to three and it goes to two and it's different, a little bit different each time because we're kind of dealing with cloud resources on the fly here. But the big takeaway and the big kind of nice win for us is that while manual allocation took five VMs over an hour,
11:23
Mikado did it with just 3.86 VMs which is really cool because that's a savings for a VMs over an hour. And if you start scaling that up, I think you start to really see some savings. Not to mention that we can then take this entire experiment and after having run it on, say, our private cloud for a little while,
11:41
we can go and use our credits with AWS and run it there for 200 hours and when we've run through those credits, we can then take it to our small EU commercial cloud that we like using and we can run it there for 200 hours. So yeah, that's all I really had to show you today. The homework for this is to come and kind of check us out.
12:03
So I've got a GitHub up there which is for the entire Mikado project. I've also got an article on JQN and Mikado that was just published a couple days ago online and it's a much deeper dive into it. So if you want to see kind of all the nuts and bolts of what's going on, give it a read.
12:21
I think it's up there for everyone. And then of course our project website which is up there too. So we have time for some questions. Please, Robert.
12:47
You mean the lag time at the very beginning? Oh sorry, do you want me to repeat? So the question was that lag time at the beginning of Mikado booting up, does that scale as you have more and more? And in fact, it doesn't because all of those
13:00
virtual machines get started asynchronously. So it'll take five minutes to start up four virtual machines but in theory, it should take about five minutes to start up 500 virtual machines. Further questions?
13:20
May I? Did other people also already work with the tool or is it currently tested only in your hand or could also other people from the community work on it and see maybe their perspectives on it? Yeah, we would love for that to happen. We're lucky in that we have a pretty big consortium. So you saw I think we have 14 partners there and then a couple smaller side partners.
13:41
But in fact, the only people that have really tested this so far with the exception of a couple PhD students at the University of Westminster are people within the consortium. So we've had very, very little interaction with the community yet which is probably a good thing because the software is really kind of just coming into its own now. So we had our first release of it, kind of like immature release of it
14:01
if you want to call it that, about kind of six months ago or something like that. And we're putting out a new release this week which is kind of the most stable thing to date yet and hopefully the most understandable. So that's what we're looking for. We'd like for people to come and talk to us and get more use out of kind of both our SE communities
14:22
here and in the UK. That would be really great. Great, thank you. Another question? So in fact, when you talk about overheads, I don't know, sorry, the question was what are the overheads that Mikado can save
14:43
with regards, kind of versus manual allocation? And in fact, the overheads are worse because you're running JQ on your machines you're running Mikado on your machines, so the resource overhead is worse.
15:06
So here's the question. Why does the dynamic allocation beat the manual allocation? Because these job times are different. They're all different. So we'd kind of done a spread of I think 40 jobs per VM
15:25
in the manual allocation because we hadn't looked at the job times and seen how can we stack these things to get the best use out of each VM, whereas Mikado does it automatically. So as soon as one job finishes, it immediately assigns another job into whatever container it was running.
15:41
So we might not do, in Mikado, each machine isn't doing 40 jobs necessarily. You might have a couple that are doing 50 and then kind of your last VM for the last half hour can shut down and save you some time there.