We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Building a Linux-compatible Unikernel

00:00

Formal Metadata

Title
Building a Linux-compatible Unikernel
Subtitle
How your application runs with Unikraft
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Running your own custom applications is one of the most important features that make unikernels fit for the cloud. As related work has shown, unikernels can achieve this by compiling or linking them (native) or by providing a binary-compatible interface (e.g., Linux system call ABI). Both modes have their pros and cons, and because specialization is our key concept for the Unikraft OSS project, we support both. In this talk, we will present our implementation design, the challenges that we solved, and the lessons that we learned. Additonally, we will show a demonstration with nginx running in both modes. Unikraft is an open source Xen Project incubator under the auspices of the Linux Foundation. The Unikraft open source project is the basis for Unikraft GmbH, a company that aims to build the next generation of cloud with unikernels for production and enterprise use.
BuildingMaxima and minimaMultiplication signBuildingThresholding (image processing)Binary fileKey (cryptography)Computer animation
BuildingKernel (computing)Address spaceSpacetimeCurvatureVirtual realityComponent-based software engineeringBinary fileSingle-precision floating-point formatComputer hardwareRead-only memoryComputer configurationLibrary (computing)Device driverArchitectureComputing platformInterface (computing)Resource allocationComputer networkStack (abstract data type)Integrated development environmentRun time (program lifecycle phase)Formal languagePhysical systemComputer multitaskingKernel (computing)Operating systemVirtualizationScheduling (computing)File systemSingle-precision floating-point formatLatent heatKnowledge-based configurationStandard deviationCartesian coordinate systemDifferent (Kate Ryan album)Address spaceMedical imagingDiagramDevice driverMultiplicationSoftwareSemiconductor memoryPoint (geometry)Functional (mathematics)Category of beingLibrary (computing)Computer fileComputing platformFormal languagePhysical systemMoment (mathematics)Interface (computing)Primitive (album)SurfaceProcess (computing)ImplementationStack (abstract data type)Connectivity (graph theory)Virtual machine1 (number)Computer animation
SoftwareBuildingMachine visionBinary fileSource codePhysical systemCompilation albumLinear mapObject (grammar)Link (knot theory)Fluid staticsRepresentation (politics)Interface (computing)Function (mathematics)Library (computing)Cartesian coordinate systemPhysical systemTrailComputer fileCategory of beingBuildingMedical imagingLibrary (computing)BootingLink (knot theory)Semiconductor memoryMachine visionBinary fileLevel (video gaming)Cloud computingCASE <Informatik>Source codeComputer programmingProgramming languagePoint (geometry)Parameter (computer programming)Address spaceObject (grammar)Vector potentialKernel (computing)Revision controlAsynchronous Transfer ModeHookingWind tunnelRewritingCryptographyUniverse (mathematics)Symbol tableSystem callBinary codeDampingMultiplication signWeb 2.0Functional (mathematics)Computer animation
Library (computing)Function (mathematics)Binary fileInterface (computing)Representation (politics)File formatPhysical systemBuildingIntegrated development environmentRun time (program lifecycle phase)System programmingPoint (geometry)Proof theoryKernel (computing)Source codeStandard deviationKolmogorov complexitySocket-SchnittstelleImplementationPhysical lawCompilation albumResource allocationPairwise comparisonMathematical optimizationBefehlsprozessorProcess (computing)Vector potentialRepresentation (politics)Asynchronous Transfer ModeCartesian coordinate systemBinary codeNetwork socketBitCodeKeyboard shortcutSoftwareFunctional (mathematics)Interface (computing)Kernel (computing)Physical systemTunisInteractive televisionDefault (computer science)Mathematical optimizationDrop (liquid)Address spaceSource codeEquivalence relationBuildingWeb applicationStructural loadBinary fileField (computer science)IP addressPairwise comparisonBootingImplementationSubject indexingView (database)Local ringMultiplication signMereologyComputer fileFigurate numberComputer animation
Execution unitImplementationBuildingFunction (mathematics)Physical systemCompilerLink (knot theory)System callBinary fileRun time (program lifecycle phase)Patch (Unix)PiStructural loadBootingFluid staticsProcess (computing)Time domainExtension (kinesiology)Pointer (computer programming)Floating-point unitSingle-precision floating-point formatTransport Layer SecurityCodeAsynchronous Transfer ModeData structureMultiplicationImplementationWritingLibrary (computing)Binary codeCASE <Informatik>Structural loadPhysical systemSystem callConnectivity (graph theory)SpeicherschutzSinc functionFunctional (mathematics)Process (computing)Arithmetic meanCartesian coordinate systemMereologyRun time (program lifecycle phase)Fluid staticsSymbol tableDomain nameExistenceAddress spaceBitBuildingMacro (computer science)WordAlpha (investment)Kernel (computing)Slide ruleVector spacePoint (geometry)Windows RegistryCore dumpExecution unitView (database)Principal ideal domainMultiplication signDemosceneCompilation albumParameter (computer programming)Binary fileNumberMachine codeIndependence (probability theory)Position operatorComputer animation
Goodness of fitBitDemo (music)Inheritance (object-oriented programming)Mobile appResultantComputer fileStandard deviationWeb 2.0Perfect groupTouch typingServer (computing)Computer animationDiagram
Web pageConfiguration spaceComputer configurationComputing platformLibrary (computing)PressureServer (computing)Structural loadComputer-generated imageryArchitectureArrow of timeMenu (computing)Electronic mailing listStack (abstract data type)Proxy serverReverse engineeringGeneric programmingUDP <Protokoll>Function (mathematics)EmailModule (mathematics)Window2 (number)Library (computing)Cartesian coordinate systemConfiguration spaceComputer animation
Game theoryCellular automatonConfiguration spaceServer (computing)Menu (computing)Web pageKeilförmige AnordnungFluid staticsBeer steinStaff (military)Stiff equationUsabilityGEDCOMSpecial unitary groupPhysical systemLarge eddy simulationHyperlinkEmbedded systemLocal GroupInformationConstructor (object-oriented programming)Directory servicePrice indexComputer iconBlogAlgebraAngleRadical (chemistry)BootingMobile appKernel (computing)Proof theoryCodeServer (computing)Phase transitionLibrary (computing)Medical imagingData compressionBitDevice driverComputer fileCartesian coordinate systemDependent and independent variablesPhysical systemParameter (computer programming)Goodness of fitResultantIP addressNetwork socketSimilarity (geometry)Integrated development environmentWrapper (data mining)MereologySystem callBuildingStructural loadWeb pageAddress spaceVirtual machineBridging (networking)Auditory maskingScripting languageRootDirectory service
Constructor (object-oriented programming)Directory serviceBlogLogarithmPrice indexInformationGraphical user interfaceComputer clusterMassHyperlinkBootingAddress spaceStatisticsServer (computing)Gamma functionConfiguration spaceIP addressBitBinary codeBootingSubject indexingComputer fileNetwork socketSystem callNumberMultiplication signCartesian coordinate systemWeb pagePoint (geometry)Analytic continuation
BuildingSoftware testingCodeLetterpress printingKernel (computing)Transport Layer SecurityFunction (mathematics)Multiplication signSystem callMappingRevision controlKernel (computing)CodeNoise (electronics)Software maintenanceCausalityCASE <Informatik>Cartesian coordinate systemPatch (Unix)MultiplicationExterior algebraImplementationFile systemRight angleCore dumpComputer animation
BuildingSpacetimeDevice driverKernel (computing)CodeFlow separationFunction (mathematics)Physical systemBinary fileDisintegrationInternet service providerThread (computing)TwitterPoint cloudMultiplication signINTEGRALRevision controlBitLibrary (computing)Right angleFeedbackCartesian coordinate systemLevel (video gaming)QR codePoint cloudComputer animation
CodeExecution unitTwitterPoint cloudBuildingStack (abstract data type)Library (computing)Computing platformComa BerenicesPhysical systemLevel (video gaming)Functional (mathematics)Principle of maximum entropyMachine codeLink (knot theory)System callBinary codeFile systemWeb pageSpeech synthesisAsynchronous Transfer ModeError messageWrapper (data mining)Instance (computer science)Heat transferCartesian coordinate systemSemiconductor memoryComputer hardwareMacro (computer science)CASE <Informatik>Patch (Unix)DiagramDirection (geometry)RootShared memoryMultiplicationVariable (mathematics)Device driverComputer animationProgram flowchart
Web pageConfiguration spaceElement (mathematics)Computer animationProgram flowchart
Transcript: English(auto-generated)
Okay, thank you. Thank you, Rasan. Actually, after hearing your talk, I'm kind of concerned I should join the Unicraft community. Sounds to be fun there. It's a threshold there. Yeah, I see. Okay, so.
My name is Simon Künzler, as you now heard. I'm the lead maintainer, also the original person that started that Unicraft project while being still a researcher at NEC Labs Europe. In the meantime, we spinned off. We have now a startup, also called Unicraft, so it's the Unicraft GmbH,
and I'm their CTO and co-founder. And yeah, we're building a community and a startup at the same time. So, first question into the room. Who has used Unicraft before, I would like to know? Okay. Who has maybe more a theoretical background,
what our key concepts in Unicraft are? Okay. So then, yeah, I have some background slides to bring everybody on the same stage, and then we jump directly into the binary compatibility topic, but I won't spend too much time here.
Okay. So, with this picture, I usually start that. You see on the left side the traditional setup when you have virtual machines and your applications running on them, so, I mean, stuff that you know since 20 years now.
Then you have a setup which is more recent and more popular is using containers where you basically run a host OS on your hardware, and then you use isolation primitives on your host kernel to separate your containers from each other.
And then there's unikernels. I don't know if, is this interrupted somewhere? Okay. Seems to be okay. So we think this could be a different execution environment, especially for the container setup, bringing, kind of marriaging what you had before with virtual machines, with strong isolation,
and really more minimal hypervisors underneath that are much more secure as well, and don't need to do a shared host base, which can become an attack surface. And then you want the flexibility of containers, and this is where we think a unikernel can come in,
so where you build a kernel per application. So the thing is, since you know the application that you run on, you can also give up a lot of principles you had in standard operating systems. And do simplifications, which is totally okay,
because it's not hitting your attack vector, actually. So if you say one application, you can go for a flat and single address space, because that kernel that you have underneath is just for your application, for nothing else. We, in Unicraft, build a single monolithic binary,
usually, so it's your application plus the kernel layers, and everything ends up in function calls into drivers. And you get then further benefits, first by this simple setup, but also since you know your environment,
you know where you run on, you know what you run, so you can specialize the kernel layers that you need underneath. So you put only drivers that you need to run on your target hypervisor. You build a separate image if you run that application on a different hypervisor, right? So floppy drivers, forget it, you don't need it.
Virta.io only for KVM guests. Zen Netfront, for instance, only for Zen guests, right? And you know the application, you have knowledge which features of the OS is needed, and that way you can also, from the top down, specialize the operating system to just provide that what you need.
So this makes us also slightly different from the other Unicron projects, maybe, that you had heard of, so we are for sure not the first ones. But we claim we are the ones that follow at least this principle as most strongest,
because we build it from the beginning with that in mind, which is specialization. So everything that we implement should never dictate any design principles. The concept is you know what you need for your application, you know what you need to run your Unikernel.
So I wanna give you a highly customizable base where you pick and choose and configure of components and specialize the kernel, I guess, for you. So that led us to this principle,
everything is a microlibrary, which means for us even OS primitives are microlibraries, meaning a scheduler is a microlibrary, a specific scheduler implementation is a microlibrary, so like a cooperative scheduler or some schedulers do preemptive scheduling on different libraries, memory allocators, also things like VFS, network stacks,
the architectures, the platform supports, and the drivers are all libraries, and because we're also going up the stack, the application interfaces. So everything that has to do with POSIX, even that is split into multiple subsystems, POSIX subsystem libraries, the Linux system called ABI,
which you will see in this talk now, and even language runtimes, right? If you, let's say, run a JavaScript Unikernel, you can build it with a JS engine, right? And the project consists basically of a kconfig-based configuration system
and a build system, also make-based to not come up with yet another build system, right, to make actually entrance easy when people were familiar with Linux before, and our library pool, actually.
And to give you a rough idea how this library pool is organized, I find this diagram nice, so let's see if this works at this point. Yeah. So we divide roughly, so you don't find it that way in the repos, but we divide roughly the libraries
into these different categories. So you have like here on the bottom, the platform layer, which basically includes drivers and platform support where you run on. Then we have this os-primitives layer. These are then libraries that implement like a TCP IP stack or file systems or something regarding scheduling,
memory allocation, et cetera, et cetera. And then always in mind, there's like, first the opportunity for you to replace components in here, and also that we provide also alternatives. So you don't need to stick with lightweight IP if you don't like it, so you can provide your own network stack here as well and reuse the rest of the stack too.
Then we have this POSIX compatibility layer. This is basically things here, fd-tab. This is, for instance, file descriptor handling, as you know it. POSIX process has then aspects about process IDs, process handling, et cetera. Pthread API, of course.
And then we have a libc layer where we also have at the moment actually three libcs. Muscle, which is becoming our main thing now, a new lib that we had in the past to actually provide all the libc functionally to the application, but also actually for the other layers too.
It provides also things like memcpy, which is like all over the place used. Okay. Then Linux application compatibility. That was now a big topic for this release. Why do we do application compatibility? It's actually for us, for adoption, to drive the adoption.
Because most cloud software is developed for Linux, people are used to their software, so we don't feel confident to ask them to use something new or rewrite stuff from scratch. And if you provide something like Linux compatibility,
you remove also obstacles that people start using Unicraft because they can run their application with Unicraft. And our vision behind the project is to give seamless application support. So the users that say,
they tell you I use that in that web server, and it should be like with a push of a button, so including with some tooling that we provide, that we can run that on Unicraft as they run it before on Linux, and they benefit from all these nice unikernel properties which are lower boot times, less memory consumption,
and also improved performance. Okay, so now speaking about which possibilities you have for supporting Linux compatibility, we divide actually compatibility into two main tracks.
One track is so-called native, which means that we have the application sources, and we compile that together and link that together with the Unicraft build system. And then we have on the other side, the binary compatibility mode,
where the story is that the application is built externally, and we just get binary artifacts, or the final image even. And then actually you can subdivide these two tracks. On the native side, we have,
which we did actually quite a lot until recently, this Unicraft-driven compilation, which basically meant that when you have your application, you have to port or mimic the application's original build system with the Unicraft build system, and then you compile all the sources with Unicraft.
Has the benefit that you're then staying in one universe and don't have potential conflicts with compiler flags or things that influence your calling conventions between objects. And then there is this way that you probably have also seen,
for instance, with RAM kernels. They did it a lot using an instrumented way, where you actually utilize the build system of an application with the cross-compile feature, and then you hook in, and that's your entry point into replacing the compile calls
and make it fit for your Unico. And then on the binary compatibility side, we have, so let's start here, because that's easier. So of course, so externally built, and this means basically you have ELF files,
so like a shared library or an ELF application. What you need here is basically just to support loading that and get that into your address space, and then run it. And then there's also this flavor of let's say build time linking, which means that you take some build artifacts
from the original application build system, like the intermediate object files before it does a final link to the application image, and you link those together with the Unicraft system. And I call it here binary compatible, because you interface it on an API,
and not on the API level like in the native cases. So, and here, this is just a little mark that in the Unicraft project, you will mostly find these three modes
in the projects that people are working on. So here we, that we never tried with Unicraft, in fact. But I mean, there's some tooling and this should work too, actually. So, as you may have noticed, native is about API compatibility,
so really on the programming interface, and binary compatibility is about the application binary interface, so really the compiled, sorry, the compiled artifacts and how you have calling conventions here, et cetera, where are your arguments in which register, or how's your stack layout, et cetera, right? And this is here on a programming language level, right?
So, the requirements for providing you, let's say, a native experience is POSIX, POSIX, and POSIX, right? Most applications are written for POSIX, so we have to do POSIX, no excuse, right?
Okay, so libc's will mostly cover that, but yeah, it's all about POSIX. And the second point is that you also need to port the libraries that your
application additionally uses. Let's say, yeah, let's take Nginx as a web, so right, you have then tons of library dependencies, for instance, for cryptographic things like setting up HTTPS tunnels or doing some other things, so those libraries, you need also, you know,
port here and add them so that you have the application sources available during the build, right? On the binary compatibility side, the requirements are you need to understand the L format, share libraries or binaries, depending on which level you are driving it,
and then since this stuff got built for Linux, you must be aware that it can happen that that binary will do directly system calls, so it's instrumented because it got built together with libc or something like that to do a syscall assembly instruction,
which means on our side, we need to be able to handle those system calls as well. And if we speak about shared library support, we need to support all this library function or library symbol linking, actually, right? And additionally, of course,
each data that is exchanged needs to be in the same representation. This means, because this is AVI, right? Now imagine you have a C struct, and here, it's fine to move some fields around because if you use the same definition for your compilation, it's all fine. You can sort the fields in the struct, it'll all work.
Here, you can't because your application that got built externally, that layout of that struct, that binary layout, that must fit. Otherwise, you will read different fields right, obviously. And then for both modes, which is important for us as an operating system,
we have, of course, also some things that we need to provide to the application, which are things that the application just requires because it is that way on Linux, meaning providing a procfs or sysfs entries or files in slash etc or something like that, right? Because they do sometimes silly things
just to figure out in which time soon they are, so they go to slash etc and figure out what is configured and also the locales and et cetera. So I'm closing that, let's say, this high-level view up so that we have the full understanding. Let's speak a bit about the pros and cons
between these two modes. The native side, what is really nice here, which is a really interesting pro, so if you've got everything put together, you have quite of a natural way to change code in the application, to change code in the kernel,
to make maybe shortcuts between the application and kernel interaction and can use that for driving your specialization even further, right, and performance tune your unikernel for that application. The disadvantages, you always need the source codes
because we are compiling everything here together, and which is also, let's say, for newcomers a bit difficult is if you require them that the application they have, and they say, okay, I have the source codes, and I just run make and then it compiles,
but I have the source code, you need to instrument either the build system of the application, as we just saw with the instrumented build that also run it, or you actually must say, okay, sorry, you can't use that build system. No, you need to mimic and write and unikraft makefile equivalent to build your application.
So this is why binary compatibility is actually really interesting for, let's say, newcomers, because you don't need the source code. They can compile the application that they're interested in, so if they need to compile it, let's say, right,
the way as they usually do, they don't need to care about unikraft at all, and normally also no modifications to the applications needed. Obviously, you can still do things here, but it's not a requirement. The risk that we saw by doing the work is,
at least for the, let's say, on the unikernel side, is that you get into a risk that you need to implement things the way how Linux does it. And one really stupid example, I get a bit nuts on that, is providing an implementation for Netlink sockets,
because if you have a web application or any application that does some networking, and that application wants to figure out which network interfaces are configured and what are the IP addresses there, so it will likely use the libc function get-if address, and that is implemented with a Netlink socket.
So this goes back here, right? Here I can just provide a get-if address which is highly optimized in that sense, right, which just returns in that struct all the interfaces, but if I go binary compatible, and if I do it really on an extreme means, because that libc, which is part of your binary here,
maybe, opens a socket, which is addressed firmly in Netlink, and starts communicating about a socket with the kernel to figure out the interface addresses, which can be really silly, right, for a unikernel, right, to do. And then also, maybe it's less opportunities, but also a bit harder to specialize
and tune the kernel application interaction, right? Because assuming you don't have access to the source code of the application, there's nothing you can do on the application side, right? So to give you a rough idea, what that means in performance, because at UniCraft,
let's say the second important thing for us is always performance, performance, performance. Here, we just show you Nginx, here compiled as a native version, so meaning it uses the UniCraft build system to build the Nginx sources,
versus we run Nginx on, we call it ELF loader, so this is actually our UniCraft application to load ELF binaries, and then a comparison here with a standard Linux, and here this is the same binary. What that means in performance, so this quick test,
we have just the index page, the standard default of any Nginx installation served, and these are like the performance numbers. The takeaway here is, if you just don't go into any special performance tuning yet, and start just getting the thing compiled and run,
you will end up in a similar performance as if you just take the ELF loader to run that application in binary compatibility mode. That is interesting, because you don't need to see
necessarily huge performance drops. The only thing that you lose is the potential to further optimize in this mode if you go for this one, but the nice thing is you can still see benefits, right, running your application on UniCraft, right? And to just give you an impression,
so this is here a Go HTTP application, where we go a bit crazy about we're optimizing and specializing the interaction between the Go application and UniCraft, yeah, we can get more out of this, we can really performance tune and squeeze stuff out of it.
Okay, so now in the next slides, I go over how we implement these modes with UniCraft, because as I said, we don't wanna target just one mode, we wanna target multiple modes, and it has also some implementation challenges,
because as an engineer, you also want to reuse code as much as possible, so we'll talk about the structure here. Okay, so to give you an overview, so this doesn't mean now that these applications
run at the same time, could also be possible, but it's just to show you how the components get involved in our ecosystem. So if you take just the left part, the native port of application, we settle now on muscle to provide
all the libc functionality that the application needs, and we have a library called syscall-shim, which is actually the heart of our application compatibility, and this is actually, you can imagine, this is a bit of a registry where it knows
where in which sub-library, a system called handler is implemented, and it can forward then the muscle calls to those places. On the binary compatibility side, you have a library called this ELF loader, which is the library that loads an ELF binary into memory, and then here's the syscall-shim
taking care of handling binary system calls. And I will go into the individual items to show you a bit more zoomed in view what's happening there, and we of course start with the heart, with the core, the syscall-shim.
So here we have some macros, so when you develop VFS core is our VFS library, actually ported from OSV, or POSIX process where you do some process functionality, like get PID or something like that, we have some macros that help you
to define a system call handler, and it's really a system call handler, it's just a function that is defined at that point, and you will register this to the syscall-shim. Then the shim provides you two options, how that system call handler can be reached,
one is at compile time, this is like macros, macros and preprocessor, which allows you, when you have a native application that does, or actually it's on the muscle side, to call a system call, it will replace those calls,
or will return at compile time the function of that library that implements that system call. Then it holds also a runtime handler, which is provided here, which does the typical syscall trap, and running that function then behind the scenes.
Yeah, and our aim, as I mentioned, we wanna reuse code as much as possible, so the idea is that we implement that function for that system call just once, and the syscall shim is helping us, depending on the mode, doing a link,
or provide it as binary compilable. So let's go back to the overview, and then you will see it a bit more concrete with muscle, but probably I said everything already. So we have muscle natively compiled with the Unicraft build system.
Now imagine you have the application, you have a write, goes to muscle, and muscle does then a UK syscall R write, which is then actually the symbol that's provided by the actual library that's implementing it. And the rewriting happens, as I said, with the macros at compile time in libmuscle,
so what we did for that is to replace that syscall muscle internal function with our syscall macro, which then kicks in the whole machinery to map a system call request to the direct function call. The thing is that in muscle,
not all but most of the system call requests have a static argument with a system call number first. So this, let's say, write is a libc wrapper, and internally there, they're set in preparing the arguments, there's maybe some checks before they go to the kernel, and then they have this syscall function
with the number of the system call, and then the arguments hand it over, and as soon that number is a const static, you know, just written down in your code literally, we can do a direct mapping so that that write will directly do a function call with UK syscall R write.
If it's not static, which is really happening only on two, three places, if I remember correctly, then of course we can provide an intermediate function that then does a switch case and jumps then to the actual system call handler. And the thing is, since everything is configurable,
means I can have a build where bvscall is not part of the build, or POSIX process is not part of the build, then the syscall symbol automatically, also with all this macro magic that we do, replace calls to non-existing system call handlers with an enosis stop, so that for the applications look like,
oh, function not implemented. Yeah, and exactly. So at runtime, the syscall shim is for that port out of the game, so everything happens at compile time. So for the binary compatibility side, that's unfortunately a runtime thing,
and we have actually two components here. As I was mentioning, the L floater itself, which loads the alpha application, what we support today is static Pis, so if you have a static position independent executable compiled, you can run that. And what also works is using your,
let's say with your libc together provided dynamic linker, meaning if you use glibc with the application, you can use that dynamic linker, so ld.so, and also run dynamically linked applications with that.
What it needs is POSIX mapp as a library, which implements all these mapp, munapp, mprotect functions on the system call there. Then system calls are trapped here in the syscall shim, and yeah, I think I said that. Then the library is not selected,
it's replaced with enosis, so the syscall shim knows which system calls are available, which are not. Then there's a bit of a specialty for handling a system call, so the system call trap handler. So we provide it with a syscall shim,
and we don't need to do a domain switch, so we have still a single address space, a single, what's called, I forgot the word. So it's all kernel privilege. Yeah, so we have done. It's the same privilege domain, exactly,
so we don't have a privilege domain switch as well. Now we have it. Good, good, good, you learned something. But we are slightly in a different environment, I will show you later in the slide exactly what this means. We have some different assumptions
that you have on the Linux system call API, which requires us to do some extra steps, unfortunately. So the first thing is Linux does not use extended register, or if they use it, they guard it, meaning extended registers are floating point units, vector units, MMX, SSE, you know.
We do, unfortunately, so we need to save that state, because that's unexpected for an application that was compiled for Linux before, that these units could screw up when coming back from a system call. And the second thing is, we don't have a TLS,
you know, in the Linux kernel, but unfortunately, on Unicraft we have, so we use the same, even unfortunately, the same TLS register. So we also need to save and restore that so that the application keeps its TLS, and all the Unicraft functions operate on the Unicraft TLS.
Good. Before I continue and give you some, let's say, lessons learned while implementing all these things, I would like to give you a short demo. And then, we speak a bit about what was tricky during the implementation,
and what are our special considerations that we had to do. So then, let's hope that this works. So this is a super fresh demo, it's, don't touch it, you will burn your fingers. My colleagues, so thank you, Mark, for getting that work,
just, you know, half an hour before the talk. Well, let's go. Well, he's the person that no one sees, but that's all the work. Yeah, he's amazing, yeah. Okay, so in this demo, I have actually Nginx. It's a web server with a standard file system, and I'll show you a bit the files around. I have it once compiled natively, and once compiled as a Linux application,
we'll run it with the ELF loader. And you will see that the result is the same, right? So let's start with the native one. So I'm actually already, so, probably I need to increase a bit the size, right, that you can read it in the background. Is that good? Yeah. Yeah.
Let's do it here, too. So I hope you can, also in the last row, you can read. Perfect. So, yeah. You have here the Nginx app checked out.
So we have my new config, so you can, oh, this, the window's somehow wider now. Just one second. No, it's better. Okay, so you see the application is here as a library here, the Nginx,
and then you have here the configuration of all this HTTP modules that Nginx provides, and you can select and choose. This is really the Unicraft way to do things. Because it builds a while,
and for that, my left is not the fastest, I built it already. So you see here the result of the build directory. You see each individual library that, because of dependencies, we're coming in and we're compiled, so like, for instance, POSIX futex,
POSIX socket, RAMFS, which is an in-memory file system, and the, where is it now? The application here.
That's the application image, uncompressed. So wait, I can do, so that you see how big it is. So it's here, 1.1 megabyte. So this is like a full image of Nginx,
including muscle, including all the kernel code and driver to run on a chemo KVM X-rated machine. Yeah, and then let's run it to see what happens. So exactly, it's already up and running to show you.
These were roughly the arguments that we have in the meantime, because I found chemo systems sometimes a bit brutal with command line arguments, a wrapper script that shortens a few things. But in the end, I mean, this is running a chemo system.
And then, you know, it's attaching to this virtual bridge. Take that kernel image, load that in the ID file system, because we reserve a file from that RAMFS. And here's also some parameters to set the IP address and that mask for that guest.
And here and down there. So we can check actually. Let's see here, set IPV4, that's the address where the unicorn is up. And yeah, you see here with this wget line, that yeah, I get the page served.
And to prove that this is real, let us kill this. Now the guest is gone, and this is dead. So no response anymore, good. So now let's go to the ELF loader,
which is also treated as an application that can run other applications. Also here in the build directory, let's do the same thing. So it has also similar dependencies, of course, it's prepared to run NGINX.
So POSIX socket is there, et cetera, et cetera. Where is the, here. So here is the image. It's a bit smaller, it's now 526 kilobytes, which provides your environment to run in Linux ELF. Of course, the NGINX image is not included here anymore, right? So that is part of the root file system.
And if I run this now, so on purpose, I enabled now some debug output so that you see the proof that it does system calls. But if you scroll up, so the initialization phase looks a bit different, also sets the IP address. Here it's extracting the init RD,
and here it's starting to load the NGINX binary, the Linux binary from the init RD. And then from that point on, the ELF loader was jumping into the application and you see every system call that the application was doing. And you can even see that, you know, some stuff,
probably this is first GWC initialization. Here, for instance, etc local time, it's trying to open and find some configuration. Of course, we don't have it. We could provide one, but it's still fine. It's continuous booting. Affinity, we don't have, so,
but whatever, it continues booting. It's quite optimistic actually, but it works. A lot of files, if you look into proxies, get BW name, all those items, it works, it works. Yeah, yeah, exactly. And then there's tons of MMABs and, you know, EDC password, etc. So those files we had provided, so you get a file descriptor returned back,
otherwise it would have stopped, etc. And then, you know, configuration and so forth. And now you should see that some system call happened when I accessed the page. And you saw it happened. Index was opened, file descriptor seven,
and here is, there should be a write to the socket. Yeah, probably here, this is probably the socket number four. Yeah, I mean, you get the impression of what's going on, right? So it's working the same way. Okay, how much time do I have left? Five minutes. Five minutes, okay.
Then? Actually, three minutes, just to leave some room for questions. Yeah, yeah, exactly. Okay, so let's get quickly back. So, we had some learned lessons, learned lessons. For the native mode, I mean,
the thing is we have also this model, like you heard the noise V, we wanna use just one libc in our build, right? So meaning all the kernel implementation and everything that the application needs is one libc. We provide multiple implementations of libcs because muscle might be, for some uses,
cases too thick still or too big. So we have some, an alternative like no libc, and originally we had new lib. And we need, so what we want as well in our project is to keep the libc as vanilla as possible, like upstream as possible, because we want to keep the maintenance effort for updating libc versions low.
Well, these causes then, I mean, let's just list them. I've speak just about one of these items. Some things that you stumble on. And one was quite interesting, was this getdense64 issue that cost us some headache. It was mainly Rust 1 fixing it, which caused, or required actually a patch.
I'm always fixing it. Yeah, yeah, required a patch to muscle. The thing what happened here is that in this drn.h, muscle is providing an alias, right? To use the non-64 version for getdense, and if it finds code with using getdense64,
because of this large file system support thing that was happening, it maps it to getdense, right? On the other side, on the VFS core side, so this is the VFS implementation where we provide the system call. We need to provide both, obviously. We need to provide the non-64 version and the 64 version.
And guess what? We include drn, because we need a struct definition here. And then you can imagine, so if you're familiar with c and preprocessor, there's a little hint with this thunder. Of course, I mean, this gets replaced, and then you have two times the same symbol, and you're like, what the hell is going on here?
All right. Yeah, so let's skip this because of time. Upcoming features, Ruslan was telling a bit already, especially for this topic for application compatibility, we will further improve it.
So this will be now our first release to officially release alpha-loader and an updated muscle version. We wanna make that more seamless, which requires a bit more under-the-hood libraries for that support. You should also watch out for features that are coming up for a seamless integration of Unicraft
into your Kubernetes deployment. No pressure, Alex. Running Unicraft on your infrastructure provider, for instance, AWS, Google Cloud, et cetera, and automatically packaging of your applications, right? And I would love, or actually all of us,
everyone within Unicraft will love to hear also your feedback and what you think about, turning the cloud with Unicronels to the next level. Yeah, any feedback to me, please send to Simone. Right, and these are, again, the project resources. If you're interested, you can just scan the QR code.
I think that's it. Okay, thank you, Simone. Right. We can take a couple of questions. You can also address them to me. I mean, it's a joint talk, so any, yeah, please, first here and then on the back.
Yeah, thanks a lot, both of you, for your talks. I have a question regarding dynamically linked applications in Linux. As far as I can see, you only use muscle, and how does this work out if my application is linked against G2C and I want to run it with the right load, but what do I have to do
because in Linux world, when I link against G2C and I only have muscle, nothing works? Right, right, so I'm assuming you're speaking now about the binary compatibility mode. In the end, what you just need to do is providing the muscle loader, if you have compiled with your application with muscle, or the glibc loader, and then both works.
The thing is, in that setup, in memory, there is actually two libc's. There's the libc on the Unicraft side, and there's the libc with your application. So that's why it works seamless, actually. Okay, thank you. Just to add to that, when you build your unikernel
for binary compatibility, you don't use muscle. You can if you want, I mean. FL-loader doesn't use muscle because the entire libc is provided by the application, either by the application of our static binary or the application plus its libc inside the root file system and it's loaded from there. There's no need to have anything like that.
Yeah, please. Yeah? Yeah, so the question is about the API. You spoke about the POSIX API. You also add a diagram showing a direct link to unikernel.
So the question is, is there some variable, next diagram perhaps, one of the next diagram. Okay. Is it a variable use case? Yes, this one. There is a link directly from the native application to the unikernel.
Yeah, yeah, this is what it shows you is how the calls are going. It can happen because some system calls don't have a provided libc wrapper. It's like for that completeness, this error is here. For instance, the futex call, if you use futex directly from your application, there is no wrapper function in libc.
You need to do a system call directly and you can do that by also using the syscall macro then, or actually, I mean the syscall shim will replace that with the direct function call then to actually POSIX futex in our case. Is it variable to have a kind of application that you develop specially for a unikernel
and a native API? Yes, yes, that's for sure, that's for sure. So this talk is just about how we get application compatibility in case you have your application already, but if you write it anyway from scratch, I recommend forget everything about POSIX
and speak the native APIs. You get much more performance and more directly connected to your driver layers and APIs that, you know, POSIX has some implications, right? It's a lot of things like read, write, imply there's a mem copy happening. And with these lower level APIs, you can do way quicker transfers
just because you can do a zero copy, for instance. Yeah, sure, of course, of course, yeah. Have you looked into patching the binary to remove the syscall overhead? Patching the binary to remove?
For a couple of, now with the syscalls, do you have to emulate the syscalls? Have you looked into patching the binary itself instead of running the, doing it at runtime, handling syscalls at runtime? Yeah, let's say, at least we thought about that, but we didn't do it. I mean, the hardware tux, that is the other, exactly, he's sitting in front of it.
They were doing some experiments with that. That works too, so you can patch it. But yeah, I mean, this is just, we didn't do it. Okay. In regards to memory usage, obviously, unicurrel lowers it,
but what if I ran multiple unicurnals and multiple VMs? How do you support memballooning or something like that, or is it like just overprovisioning? Yeah, I mean, the idea is to have memballooning, but it's not upstream yet, of course. There's also a really interesting research project,
maybe I should mention, that works on memory deduplication. So if you run the same unicurrel, 100 times, you can share VM memory pages on the hypervisor side, but you need hypervisor support for that. Okay, thank you so much, Simon. Let's end it here. We're going to ask, yeah.
Ah, yeah, and get some stickers. Anastasios and Vabis for the next talk on VXL. So, please. So please get some stickers. Yeah, stickers. They are free. Don't have to pay for it. For now. Next year, 100 euro each.