Self-service Kubernetes Platforms with RDMA on OpenStack
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
Contributors | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/62011 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Green's functionWater vaporComputer animation
00:38
DisintegrationSystem programmingOpen setSupercomputerAddress spacePoint cloudHeat transferHybrid computerComputer networkSoftwareOperations researchProxy serverQuicksortDifferent (Kate Ryan album)StapeldateiPoint cloudProjective planeSet (mathematics)Revision controlBitMultiplication signDisk read-and-write headGroup actionOffice suiteSelf-organizationWebsiteFeedbackMoment (mathematics)Process (computing)Computer animation
03:41
Multiplication signQuicksortIRIS-TCASE <Informatik>Software developerAnalogyServer (computing)CollaborationismBitGraphics processing unitTerm (mathematics)Speech synthesisPhysical systemSoftware development kitSoftware testingKernel (computing)Different (Kate Ryan album)Computer animation
05:32
DivisorMathematicsComputer animation
06:05
Mobile appRead-only memoryDirected setAerodynamicsProcess (computing)Stack (abstract data type)Message passingBand matrixBusiness clusterAzimuthView (database)MathematicsVirtual machineSemiconductor memoryLink (knot theory)Computer hardwareFunction (mathematics)QuicksortLevel (video gaming)Software developerProcess (computing)Different (Kate Ryan album)Term (mathematics)WorkloadSet (mathematics)1 (number)Multiplication signBlogBitInformationComputing platformLaptopBenchmarkProjective planeMultilaterationServer (computing)VideoconferencingWeb applicationDoubling the cubeIRIS-TCluster analysisCloud computingPublic domainDirection (geometry)2 (number)Remote procedure callCASE <Informatik>Random accessCartesian coordinate systemFreewarePoint (geometry)AzimuthSoftwareExpert systemInteractive televisionConfiguration spaceSimulationAbstractionResultantDiagram
12:59
Business clusterComputing platformAzimuthComputer configurationServer (computing)Single sign-onIntegrated development environmentBefehlsprozessorMachine learningOpen setSturm's theoremMessage passingVideo game consoleLaptopCluster analysisRootSingle-precision floating-point formatDevice driverVirtual realityHuman migrationValue-added networkInternet service providerHypercubeBand matrixComputer hardwareConstraint (mathematics)Cartesian coordinate systemBitSign (mathematics)RoutingQuicksortPoint cloudLaptopSoftware repositoryComputing platformOpen setSoftwareAsynchronous Transfer ModeMoment (mathematics)Operator (mathematics)Link (knot theory)Drop (liquid)Standard deviationPlastikkarteFunctional (mathematics)Type theoryVirtual machineRadical (chemistry)Web 2.0Set (mathematics)Cluster analysisCASE <Informatik>Different (Kate Ryan album)Process capability indexServer (computing)IP addressConfiguration spaceProxy serverStrategy gameWorkstation <Musikinstrument>AuthorizationAuthenticationCuboidAddress spaceTerm (mathematics)InformationHecke operator2 (number)VirtualizationComputer animation
19:53
Point cloudInternet service providerBusiness clusterSpherical capData managementCluster analysisDevice driverStructural loadGraphics processing unitComputer networkRange (statistics)Address spaceInformation securityAdditionAsynchronous Transfer ModeCluster analysisConnected spaceBit rateLine (geometry)Band matrixUnit testingDevice driverGraph (mathematics)IP addressFunctional (mathematics)Right angleVirtual LANPlastikkarteIdentity managementInterface (computing)SoftwareMereologyStructural loadAdditionDataflowMultiplication signAddress spaceBitRegular graphHash functionTranslation (relic)Information securityVirtualizationVirtual machineTerm (mathematics)Pattern languageLink (knot theory)Standard deviationIntegrated development environmentMoment (mathematics)Server (computing)Computer hardwareNetwork topologyStability theoryPlanningComputer animation
26:47
Computer animation
27:10
GEDCOMFunctional (mathematics)BitResultantRight angleComputer animation
27:56
Program flowchart
Transcript: English(auto-generated)
00:05
Next speaker is John Garbett from StackHPC, who's going to talk about self-service Kubernetes with RDMA on OpenStack. Thank you. Hello, everyone. I pressed the button. Excellent. I'm green.
00:20
Hello, everyone. I'm John Garbett. I'm here to talk to you about OpenStack RDMA Kubernetes. And are they oil and water mixing, or are they bread, oil, and vinegar? Hopefully, I'll convince you of something nice. So I'll start with some thank yous from my sponsors.
00:41
So I work at StackHPC. We're about 20-something people now. We've got people across the UK and across Europe. So I'm based out of Cambridge, but the head office has a lot of people around Bristol, people in Poland, and people in France as well. And we work on helping people create OpenStack clouds, train them up on how
01:07
to look after them, and support them through that journey and everything that's happening there. For this particular topic today, I want to say a big thank you to all of these organizations. These are all in the UK.
01:20
Firstly, Jasmine. So I'm going to talk today about how do we package up these solutions and stamp them out for people as reusable pieces. And this is a project that's come out of the Jasmine Institution. And that got taken on by IRIS, which is an STFC community cloud project. So they're trying to get ways in which more STFC-funded activities in the UK can share the same sets of infrastructure.
01:47
How do we get one pool of infrastructure and share that between all of these different research use cases? And in particular, there's lots of organizations we've been working on and getting feedback from. So we've been working a lot with the SKA community in the UK, particularly the SLC community at the moment.
02:06
And they've been giving us great feedback on some early versions of all of this and how to improve things. And that's actually been funded partly by also the Dirac project, which is the HPC center, a sort of a group of HPC systems.
02:23
Also note the small i, not the capital I Dirac, just to confuse everything. If you look for the small i Dirac, that's the group of the HPC centers as opposed to the job submission system. And we've been working very closely with the research computing services at the University of Cambridge and sort of tying this together.
02:43
They're one of the IRIS sites and one of the Dirac sites. And we're starting to reuse the things coming out of Jasmine. Anyway, a big thank you to all those folks. So I want to start with why on earth would you use OpenStack and Kubernetes and not just have one big batch scheduler?
03:06
And really it's about getting the most value out of the infrastructure investment you've made. And today also it's worth saying that getting what I really mean by that partly is that investment in your infrastructure is also investment in carbon cost.
03:21
How do you get the best out of that investment in carbon to manufacture these machines and run these machines? And what do I mean by value? Well, that's different things to different people. I mean, how do we reduce time to science? How do we get more science out of that particular investment that a community has made?
03:41
So firstly, it's a bit about sharing diverse infrastructure. Hopefully people aren't hungry. Apologies. I've spent far too much time on Unsplash, so thank you to Unsplash. So there's increasing diversity, as in different flavors on the pizza here, in lots of the user requirements.
04:03
So in terms of the IRIS community, they're currently working actually a lot more with large international collaborations. And often those users come with a system that they want to run on your infrastructure, regardless of everything else that's happening. And so one of the problems that's been happening is you sort of silo your infrastructure into, well, this was bought for purpose A, this was bought for purpose B.
04:25
But actually those infrastructures are getting more diverse. There's only so many GPUs any one person can afford in a particular institution, and everyone wants to use them. How do we share that out? How do we share out the accelerators and all the special bits of kit between these different use cases that day to day might be different people wanting to use those bits of infrastructure?
04:45
So that's kind of how do we slice it up? And also one physical server, particularly when you're doing test and development, is getting bigger and bigger in terms of consuming it. So giving people one whole server can be a problem.
05:02
The other thing, and I'm speaking as a developer here before I bash developers, we love breaking things. So if you give people access to the kernel and they're going crazy and they crash the kernel, if it's just your little kernel, then it's only you you've just crashed. That's a bit of an extreme example, to be fair.
05:21
I don't really mean crashing the kernel, I more mean crashing the thing that you put in the kernel more likely, to be particular. Anyway, how do we separate this up? And actually probably a better analogy rather than pizza is sort of a reconfigurable conference room. So if you plan ahead, you can make this kind of change. So sometimes you want to use all of the room for a really big meeting, like this one, and sometimes you want to divide it up.
05:48
And when you divide it up, you kind of want a certain amount of isolation and not accidentally you can get the noisy neighbor problem in these setups. So you have to be careful about actually how you're doing that dividing.
06:05
And so one of the things that's also changed most recently is how do we get these reusable bits of infrastructure. So I said we've got, well, reusable platforms on top of the infrastructure. So one of the things I said about the Iris project is it's working a lot with international communities coming with a thing to run.
06:23
Quite often these days that thing to run is packaged in Kubernetes. Sometimes people are developing on Kubernetes on their laptops and they need a bigger Kubernetes. But this is certainly becoming a thing now. People just say, you know, this is how I'm wanting to deploy. How do I carve out the Kubernetes infrastructure and have Kubernetes on top of it to do what I need to do?
06:48
And actually it's been very helpful in terms of giving us a higher level of abstraction that we're working with to kind of, you know, to package up web applications and interactive applications and a whole manner of things.
07:05
Okay. So the next piece in the topic was why R domain networking or why random access and remote direct memory access. I can remember that. So I put it in there. I thought to try and prove my point, I'd show a pretty graph.
07:21
This is open foam. At the bottom here, there's a link to the tool that we use to actually run these benchmarks and to make it nice and repeatable. Essentially you can describe in a Kubernetes CRD the kind of benchmark you want to run. And then it basically submits a job to volcano, monitors the output and just tells you what the output of that was.
07:45
It's just a way of just making it nice and quickly reproducible. So if you look at this graph, it's showing basically wall clock time for the simulation. And on these lines, you've got lots of different networking technologies that were being tested out.
08:03
And not unsurprisingly, the ones that were performing the best have all got the lowest wall clock time. So the best result in this particular benchmark. As you can see, this was probably an interesting configuration in the sense that as you were
08:21
scaling out the compute, there was actually no benefit at all in terms of the simulation time. Actually, interestingly, because of this slightly wackadoodle configuration, or the job was too small, essentially, you can actually see in the TCP ones above, they sort of gradually actually get worse as they got more cross communication, as you would expect.
08:44
There's MPI underneath here. So if we dive down into MPI, on the left hand side, we've got the latencies. And these bottom two latencies, for people at the back of the room, there's two at about five microseconds and one that's about half of that.
09:02
These are interesting. These are the RDMA ones. Actually, I'm saying RDMA here. These are actually all rocky using Ethernet, as you probably guessed, because I just said what the latencies were, if you're interested in that kind of thing.
09:21
So there's no such thing as a free coffee unless you're at FOSDEM, I guess. Let's just compare very briefly those three technologies. If we have a look at the bandwidth, there's something interesting happening here. It would be slightly more interesting if we'd actually had the hardware for long enough and run the rest of the points.
09:44
But you can see that the one with the lowest latency actually caps out about 100 gigabits a second. And the ones with a slightly higher latency, or double if you're being mean, actually go all the way up to the 200 gigabits a second. And actually there's a difference in the way in which that's been wired up, which I'll go into in a bit more detail later.
10:06
But essentially one of them can use the whole bond and one of them can only use one side of the bond. So these were on service with bonded 100 gig Ethernet. If you pay a latency penalty, you can use both sides of the bond in an interesting way.
10:21
If you want the ultimate lowest latency, you kind of have to dedicate and just use one side of the bond. Anyway, so RDA main makes a big difference to these kind of workloads. I'm referencing a talk here that was at KubeCon, five ways with a CNI.
10:40
If you look at the FOSDEM sort of session information for this talk, one of the links on there is to a blog that we wrote about this kind of thing. There's a video from KubeCon you can watch to have more detail and that this particular sort of set of bang for bang can house all these different ways of wiring the networks.
11:03
So that all sounded a bit complicated, right? How do we actually stamp this out in a kind of useful way for users and get this all tied together? So how do we manage that operational complexity?
11:20
So the first side of this is in terms of deploying at the OpenStack layer and configuring all of that, we've got tools from the OpenStack community, from the collar community in particular, KOB and Collar Ansible, and we use those with Ansible playbooks to sort of repeatably, once you've got a working configuration, make sure you use that every time.
11:43
It involves, you know, ensuring you can reimage the machines easily and make sure that you apply the Ansible on there and get the same thing each time. So sort of package that up and that is all open for people to reuse. And then the next stage is the users need to actually consume this infrastructure.
12:04
So if we give people OpenStack directly, they can get very confused. The people that are trying to just create a platform are typically not experts in using cloud infrastructure. So how do we make that easier? So I want to talk about Azimuth.
12:22
This is the project that I mentioned at the beginning coming from the Jasmine team. And the idea here is for the people creating platforms. So for the platform creators, people who want to create a Jupyter hub or a Dask hub or a Slurm cluster that's isolated and dedicated for their own needs. This might be for a development use case or otherwise, or create a Kubernetes cluster.
12:44
How do we just package up those good practices and make that really easy to deploy? So calling this platform as a service. If you've seen me talk about this before, one of the changes here is that you get all of the platforms in one view now.
13:01
So you can log in using your OpenStack credentials. So there's the cloud operator and then there's the platform operator logs into Azimuth, creates the platform. Then on top of the platform, you can choose which users can log into that just to make all of that much easier to do.
13:23
So I'll quickly go through the types of things that are going on here and the different types of platforms. So firstly, there's Ansible-based platforms. So things like, give me a bigger laptop, which is a particular case. So give me a Linux workstation that I can just guacamole into and give me a Slurm cluster.
13:46
What we do for that is they're not Kubernetes-based. We use Terraform to stamp out virtual machines. And then there's Ansible, basically there's an Ansible running Terraform to stamp out the machines and do any final configuration that might be required.
14:03
So when you click the button, all of that happens in the background and it sets up the infrastructure and you can get straight in. The other type is give me a Kubernetes cluster. I'll go into this in a bit more detail in a sec. You choose your Kubernetes cluster, set that up and it stamps that out.
14:21
And the third type, which is relatively new now, is, well, I just want a Jupyter notebook, a Jupyter hub or a Dask hub. And so for those kind of situations, we're deploying those on the Kubernetes cluster. So you can go through that. So let's go into a bit more detail. This is more just a bit of an eye chart, particularly because it's not rendering at all.
14:44
The idea is you just ask some basic questions about creating a Kubernetes cluster, what size nodes you want, what name it is. If you're creating your Kubernetes application and you've pressed go into the Kubernetes application, you give it a name and the basic constraints for the notebooks.
15:02
It's sort of pre-configured and you tell it which Kubernetes cluster to put it on or create one if you haven't got one yet. And then finally, when you've stamped out all of these, these bits of infrastructure, you can see there's a nice single sign on to go and dig in.
15:23
So if you've got Dask hub, you can click on the link to sort of open your notebook and it gets you straight in. One of the issues we've got at the moment is that there's the cost of IPv4 addresses or the shortage of IPv4 addresses. This is a big deal. So we're actually using a Zenith proxy here, a tunneling proxy called Zenith.
15:44
So essentially when we create the infrastructure, there's an SSH session poking out, doing a port forward essentially, out into the proxy. Then the proxy secures that. It does all the authentication and authorization and then punches that through.
16:01
So essentially it means that these are inside, you've got a VM inside your private network and it goes out through the NAT, not consuming floating IPs for each of these bits of infrastructure that you're stamping out. And there's lots of, I'm not going to go into too much detail on all these things.
16:23
If you create a Kubernetes cluster, it's easy to get the kubectl out. It's got monitoring included. Slurm, similarly it comes with monitoring, open on demand dashboards. So in this case you can get in and out through open on demand, although this one does require a public IP so that you can do SSH.
16:43
I said about bigger desktop. So if you just want a VM you can get into without worrying about SSH, without having to configure all that, you can go in through Guacamole, get a web terminal and otherwise. Again, you can stamp out all of these without consuming a floating IP.
17:03
Another mode which is a bit like binder hub, but just inside a single VM, is just you specify your repo to Docker. Same kind of idea, spins up the Jupyter notebook, punches it out with Zenith, so it's all nice and simple to just get that up and running. Okay, so let's do a little bit of a shortish technical dive into actually how do you get RDMA in Loki.
17:26
What the heck is Loki, you may have said. So if you've been in some of the open info talks, Thierry described this quite well. This is the idea of Linux, OpenStack and Kubernetes giving you dynamic infrastructure.
17:42
So how do we get RDMA into this stack? Well there's three main steps. First of all you do need RDMA in the OpenStack servers that you're creating. Second step is, if you want Kubernetes, you need the Kubernetes clusters on those OpenStack servers. The third step is you need RDMA inside the Kubernetes pods, executing within the Kubernetes clusters.
18:05
So let's just drill down into each of those. So how do we do RDMA inside the OpenStack servers? Well there's two main routes here. The first route is if it's a bare metal server, you've got the NIC there, RDMA is generally available in the way it's normally available.
18:26
This is not a lot special to do there. I should stop there for a moment. What I've said is you're using the standard OpenStack APIs and all the Terraform tooling and you're stamping out bare metal machines. That's totally possible. When you select the flavor drop down, it might give me a box with eight A100s on it.
18:43
If I want the whole thing, that's perfectly possible. So I referenced Cambridge as helping us out with this. Cambridge's HPC clusters are actually deployed on OpenStack using the bare metal orchestration. So it doesn't get in the way of anything in terms of RDMA or InfiniBand or whatever.
19:03
You get the bare metal machine. On the VM side, it's a little bit more complicated. Essentially, the easiest way to get RDMA working in there is that we pass in an actual NIC using PCI pass-through, the SROV. So the VM itself has to have drivers appropriate for the NIC that you've passed through.
19:23
Now there's a whole bunch of different strategies for doing that. I wanted to quickly go through this one, which is using specifically in some MellorLox cards. And there are other ways of doing this. Essentially, you do OVS offload onto your virtual function.
19:41
So if you do SROV into the VM, that virtual function can actually get attached into OVS. Now that sounds insane because that's a really slow path and you just put a nice fast thing into a slow path. Well, what happens is OVS gets told that these, actually you look for hardware offloaded flows.
20:02
So when you actually start getting connections going into your different machines, it notices the MAC and IP address pairs and those flows in OVS get put into the hardware and then it goes onto a fast path. The other part of this is that you connect the OVS directly to your bond on the host.
20:24
And the VFs are actually getting connected to the bond. So in that earlier graph where I was showing 200 gigabits a second and basically getting line rate, that's using this setup where essentially your VM with its virtual function is going through the bond rather than through one of the individual interfaces.
20:42
And this is actually quite a nice setup in terms of wiring. So if you've got a server that's got dual 100 gig ethernet going in or, you know, dual 25 gig ethernet, you don't have to dedicate one of those ports to SRV. You have the host bond on there and you can connect the virtual functions into the host bond.
21:08
Okay, so the next bit, create Kubernetes. I'm not going to go into that too much detail. Essentially we're using cluster API. I really like its logo because it uses basically you create a management cluster.
21:24
In CRDs you describe what you want your HA cluster to be or your other cluster to be and it stamps that out for you using an operator. This has proved to be really quite a stable way and reliable way of creating Kubernetes. One part of this is that we're actually hoping to try and, well, while I'm in the room I'm actually trying to fix the unit test on it.
21:48
But we're developing a Magnum driver for OpenStack Magnum to actually consume cluster API and just stamp them out. And to make this repeatable it's all been packaged up in Helm charts which are here.
22:04
So now we've got OpenStack machines that have got RDMA in. We can do that. We've got the Kubernetes cluster that's using those OpenStack machines that have the virtual function in that's doing RDMA at line rate. Now how on earth do we get the Kubernetes pods to actually make use of RDMA?
22:26
Now if this was a bare metal machine there's actually quite a lot of standard patterns that seem to be quite well documented in terms of actually using virtual functions into the pod. If we're inside a VM we've already done the PF to VF translation so you can't go again.
22:40
You can't have a VVF yet. Although VDPA and other things might change this. So what we're actually doing is we're using Multus and something called the Mac VLAN CNI. So essentially when you create your Kubernetes pod you give it two interfaces, your regular CNI interface. So that has all the usual smarts and you give it an additional MAC IP address pair on your virtual function for the VM.
23:08
Now at the moment you have to turn off port security to ensure that those extra MACs that are auto-generated inside Kubernetes are punching out correctly and not restricted by the virtual function. And there's a plan to try and orchestrate that so that you can use allowed address pairs to explicitly decide which ones.
23:26
So basically you use Multus to say give me two network connections and use MAC VLAN to get that connection to your IDMA. And there's also some permissions stuff which is actually quite a simple decorator on the pod.
23:43
Essentially extra pod YAML to opt in to actually how to get this all wired together. Okay, so it would be really great if people have these problems and this is interesting to get involved. There's a whole load of links. But yeah, thank you very much.
24:06
And we've probably got time for half a question.
24:29
Yeah, I thought you mentioned that we were doing the bond on the network interface, you were getting the full bandwidth of the bond. So whenever I do LICP bonding, any particular connection, I only get half the interface. So I'm just wondering how you're doing that.
24:42
It depends on your bonding mode. So with LICP bonding I only seem to get half the interface. Well, no, so there's a hashing mode on your bond. So what you need to make sure is that you do something like L3 plus L4 hashing. So that from a single client, it depends, it's basically each of your traffic flows gets hashed onto a different bit of the bond.
25:05
So you need drivers that are respecting that hashing function. But yeah, if you get enough different flows, then it will actually hash across the bond, okay. It's all about the hashing modes. Not all switches support all hashing modes, which is the gotcha in that.
25:25
Yeah, the other question is, I don't understand the connection between MACVLAN and RDNA. Sorry, what's that? The connection between MACVLAN and RDNA. Oh, the connection between the MACVLAN and RDMA. Yeah, why do you need MACVLAN to do the RDNA into your VMs? So you could just do host networking.
25:42
So if you did host networking on the pod, you would just have access to all of those host interfaces. But if you want to have multiple different RDMA flows with different MAC and IP address pairs, then the MACVLAN allows you to have those multiple pods each with their own identity on your VLAN that's doing RDMA.
26:02
Anyway, emails for next questions, I think. So I should let the next person set up. Any other questions for John? Oh. Yeah, last one. I saw that you support also creating SLURM clusters.
26:24
Yes. So how does Kubernetes and SLURM play together for the network topology and placement of... Well, I have lots of ideas for that after your talk. Right. At the moment, not really. So they just, the pods get placed wherever and then...
26:41
At the moment, they're totally isolated environments. You stamp out a SLURM cluster and it's your own to do what you need. And then super briefly, the pink line that was legacy, RDMA, SRA virtualization. Yes. Is that bare metal or is that also virtualized? That was...
27:01
Is it running on... I should hand the mic over. We can catch up later. Okay. So that specific scenario, I definitely recommend watching Stig's helpers talk. There are five ways on CNI. I think that particular setup was actually bare metal with a virtual function.
27:22
So it was actually Kubernetes on bare metal with the virtual function passed into the container. Right. I believe we got similar results with doing that legacy path into the VM as well. The extra cost, I believe, is on the VF lag piece because it has to...
27:40
There's an extra bit in routing inside the silicon, I believe. But I'm not certain on that, so I'd have to check. Thank you. Pleasure. Thank you very much, John. Thank you.