We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Running MPI applications on Toro unikernel

00:00

Formal Metadata

Title
Running MPI applications on Toro unikernel
Title of Series
Number of Parts
542
Author
Contributors
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Unikernels aims at improving the way single purpose systems are built with minimalist kernels that the user’s application compiles within. This results in deployments that require less memory, less disk, less cpu and less time to be up and running. Also, the whole system spends most of the time in the user application or doing IO for that single application thus cpu time is used more efficiently. In this presentation, we talk about the use of unikernels for High Performance Computing. We present a work-in-progress that aims at implementing the MPI standard on top of Toro, an open-source non-POSIX unikernel. In this work, we implement a library that conforms to the Open MPI implementation. This library relies on Toro API to implement the MPI functions. In particular, the library leverages Toro’s features like per-CPU memory allocation, cooperative scheduler, thread migration and inter-core communication based on Virtio. During the initialization, Toro creates one instance of the MPI application per core. Each instance is a thread that is migrated to the corresponding core and then executes without any interference. When applications are required to allocate memory, each core has its own memory pool from where memory is allocated. This allows us to keep memory allocation local, thus improving the way cache is used. Also, primitives like MPIGather() or MPIScatter() that require communication between instances are implemented by relying on a new Virtio device named virtio-bus that allows core-to-core communication without locking. At the moment, we have implemented the following APIs: - MPIGather() - MPIScatter() - MPIReduce() - MPIBarrier() The goal of this PoC is to port benchmarks from the osu-microbenchmarking (http://mvapich.cse.ohio-state.edu/benchmarks/) to compare with existing implementations. During the presentation, we present how this is implemented and we demonstrate the use of the current implementation by executing different MPI applications on top of Toro.
EmailOperations researchSystem programmingVisualization (computer graphics)BenchmarkComputer fontVirtual machineVirtual realityScheduling (computing)Data modelLibrary (computing)Level (video gaming)Ring (mathematics)Semiconductor memoryCore dumpTelecommunicationMultiplicationRead-only memorySpacetimeResource allocationProgrammer (hardware)Thread (computing)Kernel (computing)MiniDiscBefehlsprozessorQueue (abstract data type)Pointer (computer programming)Data bufferCartesian coordinate systemBefehlsprozessorThread (computing)System programmingFunctional (mathematics)Core dumpSemiconductor memoryTelecommunicationOperating systemResource allocationWechselseitige InformationKernel (computing)Process (computing)SpacetimeSystem callCASE <Informatik>BuildingVirtual machineAsynchronous Transfer ModeBlock (periodic table)Scheduling (computing)SynchronizationTotal S.A.Computer architecturePoint (geometry)Different (Kate Ryan album)Rational numberMemory managementBitImplementationQueue (abstract data type)Operator (mathematics)Computer fileNumberFlow separationDevice driverProjective planeBinary codeSemantics (computer science)Arithmetic meanView (database)Parameter (computer programming)Software developerEmailProfil (magazine)Ocean currentMultiplication signBenchmarkStandard deviationControl flowCuboidConnectivity (graph theory)Computer animation
Core dumpQueue (abstract data type)Pointer (computer programming)Data bufferTelecommunicationMeta elementComputer-generated imageryInterface (computing)Thread (computing)IntelVertex (graph theory)Function (mathematics)BenchmarkNetwork topologyVapor barrierPerformance appraisalSystem programmingQuicksortCore dumpSystem programmingThread (computing)Virtual machineBenchmarkOcean currentPrimitive (album)Queue (abstract data type)SupercomputerEndliche ModelltheorieBitData transmissionPhysicalismSoftware bugNumberSemiconductor memoryComputer hardwareCartesian coordinate systemMultiplication signBuildingKernel (computing)Software testingProof theoryBuffer solutionUniqueness quantificationBootingTelecommunicationKey (cryptography)Numbering schemeMaxima and minimaOrder (biology)Functional (mathematics)Slide ruleInterface (computing)ImplementationExecution unitTerm (mathematics)Bounded variationOperating systemBlock (periodic table)Latent heatComputer animation
Program flowchart
Transcript: English(auto-generated)
All right, we'll get started. Now we have another talk on MPI, but I think a very different one, running MPI applications on the Toro Unikernel. Exactly, yeah. So hello everyone, I'm Matthias. Here I'm going to talk about
running MPI applications on Toro Unikernel. Ruthfully speaking, a Unikernel is a way to deploy a user application in a way that is closer to the hardware, by trying to reuse the operating system interferency. So in the overall, it should perform better than just deploying a user application by using a during a proposed operating system.
First, I would like to introduce myself. Well, I am passionate about operating system development and mutualization technologies. I have worked at Citrix to take Huawei and I'm currently a Baptist. And here you have my email and my GitHub profile if you want to know more about what I'm doing.
So I'm going to start to present what is exactly a Unikernel. And then I'm going to go to the details of what just makes Toro special. And then I will show current implementation of MPI standard on top of Toro. And I will finish with a benchmark
that I'm trying to do to see if the current implementation is working as expected or if there are things that could be improved. So maybe you are already familiar with this picture. This is more or less how a user application is deployed either using a virtual machine or bare metal.
So what we have is the operating system, the user application, and the two different modes, the Rin3, Rin0, which is the different modes in the x86 processor. So in general, what we have is that when a user application requires some, I mean, want to open a file, send a packet or whatever, it's going to train a Cisco. And then it's going to be a switch
in which the processor is running from, which is user space to kernel space. So it's going to be process your kernel space and come back, right? In general, when we see what we have inside the kernel is what we have different components, right? That for example, we have the scheduler,
the processor file system, different drivers, and so on. So in particular, what we have is a scheduler. A scheduler is going to choose what is the next process that's going to be executed. One of this process, or several of them, is going to be your MPI application, for example. So if you deploy your MPI application by using a general proposal processing system,
your application is going to compete with other processes in the system for sure. And also what you have in the scheduler is some policy which is going to decide which is the next process to be deployed. Also we have a component like a file system, and since we have a general proposal processing system,
we're going to have several drivers for different file system, and different drivers, and so on. So what some people observed was that there were too much generality in using a general proposal processing system for a single proposed application, like can be a MPI.
So some people come up with a new architecture, they propose what they call unikernel. You have some projects there like osv, mirageos, unikrash, or nano VMs. What they do is just take the user application and compile it within the kernel itself. So at the end, what you have is a single binary
that is going to be deployed, either by using a virtual machine or bare metal, right? So instead of, for example, having syscalls that we have in the case that we have a general proposal processing system, and different modes of execution, in the case of a unikernel, we have simply calls,
which are cheaper than using syscalls, for example. In general, the project that I presented before, all are conformed to the POCC standards. So it means that if you have any application breaking in C that conforms to POCC, you can theoretically compile with the unikernel
without any modification of the user application. In reality, this has not happened, and most of the time, the POCC that the unikernel implement is not complete, so you have to do some work to just, you cannot just take your application and compile it and generate something. It doesn't work out of the box in most of the cases.
So in this context, what is total is also a unikernel. It's an application-oriented unikernel. The idea of total is to offer an API which is dedicated, I mean, to efficiently deploy parallelized application. In the case of total, it's not POCC complaint.
It means that even if the name of the function, like this opens, you close, and so on, it's more or less the same name. The semantic of this function is slightly different, so I will not say that it's a POCC complaint in that sense, and I will explain that later. So let's say that the three building blocks
of the total unikernel are the memory per core, cooperative scheduler, and core-to-core communication based on your total. Here I'm talking about the architecture of the unikernel. I'm not talking about yet the application, how are we going to build an application to conform to total, right? And I'm going to explain these three points.
So first, what happened in the total unikernel is that we had memory dedicated per core, so at the beginning, what we does is just allocate, I mean, to split the whole memory in rations, and we assign the rations per core, and for the moment, the size of the rations
just proportional with the number of cores that we have. That makes that, for example, the memory allocator is quite simple, it doesn't require any communication, because we have chunks of data, I mean, yeah, the allocator is, we have one allocator per core, which means that we do not require
any synchronization in the kernel to allocate for one core. It's quite, we call it per CPU data, let's say. So for example, if you have a thread in a core one and we want to allocate memory, we're going always to get it from same ration, and that happen also if we're on the core two, we are going to use the from ration two.
And the idea is that by doing this, we can then leverage technologies like hyper-transport or inter-quit path interconnect, in which we can say, well, this core is going to access this ration of memory, and if it access all the rations, it's going to get a penalty to do it, right? So, talking about the scheduler, what happen in total is that we only have thread,
so we don't have process means that we, all threads share the view of the memory, and we have mainly one API to create thread, that's called a begin thread, and it's the parameter that have to say in which core each thread is going to run.
The scheduler's capacity, which means that it is the thread that's going to call the scheduler to then choose another thread, and it's by relying on the API called a thread switch, and most of the time this is just in bucket, because we are going to be idle for a while, so we just call the scheduler,
or we, for example, we're going to do some IO. So, the scheduler is also very simple, we have, again, per CPU data, so we have one queue per core, and the scheduler is simple, going to choose the next thread that is ready, and then schedule it, and this means that also we don't require
any synchronization at the level of the kernel to schedule a thread, so it's like each core run independently one for another. Finally, I'm going to talk a bit about how we communicate cores, and basically what we have is one dedicated
reception queue per core for any other core in the system, so we have one-to-one communication. It's basically realized in two primitives, which is send and resist front, send to and resist front, and it's just by using the destination core,
and from where we want to get a packet, for example. These two primitives are the ingredients to then build more complicated APIs, like MPI gutter, MPI vcast, and MPI scatter, so these are the building blocks for those APIs, for example.
So to implement this core-to-core communication, I was using virtual, so I was just following the specification. I will talk a little bit about this. I don't want to go too much into details just to understand roughly how communication between core is done. As I said before,
we have one reception queue for each core for any other core in the system, so it means that, for example, if core one want to get packets from core two, we have a reception queue, and also if core one want to send a packet to core two, it's going to have a transmission queue,
and the number of queues is going to be for different if you have three core, for example, because the build queues are dedicated. So basically how a build queue works is made of three RINs buffers, so the first RIN buffer is the buffer which only contain descriptors to chunks of memory.
The second buffer is the available RIN, and the three buffer is the user RIN. Basically what happen is the available RIN is the buffers that the core one are exposing to core two, so if core two want to send a packet to core one, it's going to get a buffer from available RIN,
put the data, and then put it again in the user RIN. This is basically how build.io works. Just that if you are familiar with build.io, in this case, for example, the consumer of available RIN is the core two, but if, for example, if you are in hypervisor
and you're implementing some build.io device, the consumer is not going to be the core two, but it's going to be the device model key in you, for example, I don't know if you're familiar with that, but it's the same scheme. And this allows that, for example, since we have one producer and one consumer,
we can access to the build queue without any synchronization, I mean, at least if we have only one consumer, right? So you don't require any lock, for example, to access to the build queue. So yeah, I already talked too much, I don't know how much time I have left, but I wanted to show some examples about the implementation,
maybe it's more fun that all the slides should show. So what happened, how we, how we deployed an application by using Toyota? We have the MPI application, these are C applications for the moment, and you compile it with a unit, that's going just to link the application
with some functions that are the implementation of the MPI, of the MPI, so for example, MPI because, gotcha and so on, is implemented in this level, in the MPI interface. And this unit is going to use some MPI from the unique kernel. So at the end, what you're going to get
is an ELF, a binary, that could be used to deploy your application, either as a beauty machine or a bare metal. So you don't have any operating system in terms of year to year. You have only your application, the threads and so on, but you don't have nothing else.
So if you want to see how it is deployed, if you get the MPI application at dot C, what is going to happen is we're going to get the main, and then instantiate it one for every core in the system as a thread.
So to benchmark the current implementation, I'm not very familiar with the MPI word, I just coming from another domain, so I'm not really familiar how I had to benchmark such implementation. And so I choose the osu microbenchmarks, maybe you know them, maybe not.
And I just pick up one of them, like for example, MPI barrier, and I try to implement, which is quite simple. The benchmark itself is quite simple, so I decided to pre-implement it. I could not take the benchmark as it is, I have to do some rework to make it work. And then my idea was to see how this behaves
when I was deploying this as a single VM, which many cores. The hardware that I use is this one. Since I'm not familiar with the high performance computing work, I'm not really sure if this is hardware that you often use.
It's quite a new Intel. You can get it in Equinix. The price is four euros, the hour. So I launched the test and I tried to measure things, so I was just measuring the latency of this,
and I was taking into account the max latency through over four, eight, 16, and 32 cores. I'm getting values in the order of the microseconds, and then I found this paper, which was also using
this benchmark to measure the platform, and I was saying, well, in this paper, they were reporting around 20 nanoseconds and 30 nanoseconds. Sorry, this is nanoseconds, microseconds.
Not nanoseconds, sorry. In this platform, in any case, I will be very cautious about this graph, because I was getting a lot of variation in the numbers. Most of the time, for example,
I was trying a machine with 32 cores, and the VM has already 32 vCPUs, so you should not test in that sort of machine, because one of the threads is going to compete with the main one of the host, so you should always test with less vCPU cores, physical cores, and yeah.
The idea is to continue doing this, I mean, improving the way I'm measuring this, and also trying maybe different hardware, and at the same time, I found a lot of bugs in the unique kernel by doing this, so for example, at the beginning, I only support more or less four cores, so I went from four to 32. Well, it was a number in a constant,
but anyway, I found many bugs when I was doing this, so it is all, this is just a proof of concept, and I worked in progress. Don't take it too serious, I'm trying to say. I don't want to jump to any conclusion from this. And yeah, it was fun to do, anyway.
So that's all, I don't know if you have any questions. So you said this runs on bare metal? Sorry?
The unique kernel runs on bare metal? Yeah, there are some. How do you even install it? I mean, operating systems are kind of complicated, right? Sorry? How do you even install it on bare metal? Can you say that again? How do you install it on bare metal? How do you install it? Yeah, like if I had this, how would I install it on bare metal?
There's an installer? Installer, you mean? No, you can just use some device to boot from, for example. So it's bootable? Yeah, that's it, yeah. Well, yeah. You have many ways to do that, for example. You don't have to start it, for example. You can do it from a device that is removable,
for example, and you don't need to start it, I mean, whatever. Any questions?