We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Rethinking the OS for Isolation Flexibility with FlexOS

00:00

Formal Metadata

Title
Rethinking the OS for Isolation Flexibility with FlexOS
Title of Series
Number of Parts
287
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Operating Systems (OSes) have historically been classified according to their isolation properties: monolithic OSes, microkernels, single-address-space OSes, or unikernels... Decades of experience in research and industry showed that there is no silver bullet and that different use-cases might demand different approaches to optimize safety and performance. What if we tried to design an operating system able to be easily reconfigured into any of these points in the OS design space? What if the OS could be a microkernel, a unikernel, or a monolithic OS, at will, and using a wide range of hardware- and software-backed isolation mechanisms? In this talk, we will present FlexOS, the result of our recent research work in trying to answer this question. FlexOS is an OS allowing users to easily specialize the safety and isolation strategy of an OS at compilation/deployment time, instead of design time. Depending on the configuration, the same FlexOS code can mimic a microkernel with multiple address-spaces, a single-address-space OS with Intel MPK compartments, or many other OS isolation approaches. We have implemented a prototype of FlexOS on top of Unikraft, a popular library OS framework.
Information securityFlow separationKernel (computing)Plane (geometry)Decision theoryAreaCodeMathematical optimizationArchitectureComputer hardwarePatch (Unix)Point (geometry)Physical systemComputer networkDevice driverLibrary (computing)AbstractionMechanism designComputer-generated imageryIntelConfiguration spaceChainoutputFront and back endsModul <Datentyp>Core dumpIndependence (probability theory)System callImplementationMechanism designComputer hardwareLibrary (computing)Data structureKernel (computing)Operating systemFlow separationPatch (Unix)Physical systemConnectivity (graph theory)TwitterIndependence (probability theory)RewritingSystem callMultiplication signSimilarity (geometry)ChainTask (computing)ImplementationSoftwareResultantPoint (geometry)Cartesian coordinate systemAxiom of choiceChemical equationFront and back endsMereologyPoint cloudCategory of beingInformation securityMedical imagingProjective planeState observerMikrokernelVulnerability (computing)Single-precision floating-point formatMaxima and minimaMathematicsStrategy gameOperator (mathematics)Virtual machineProcess (computing)Computer architectureSpeicherschutzWeb pageAbstractionoutputInterface (computing)Film editingProcedural programmingCASE <Informatik>Slide ruleSoftware maintenanceFree variables and bound variablesLogic gateSpacetimeSeitentabellePerturbation theoryBitSet (mathematics)Operating systemStack (abstract data type)CodeKey (cryptography)Boundary value problemDifferent (Kate Ryan album)Primitive (album)BuildingRange (statistics)Functional (mathematics)Domain nameGroup actionConfiguration spaceTheoryComputer fileInternet service providerEngineering drawingComputer animation
PrototypeImplementationFront and back endsIntelFile systemComputer networkScheduling (computing)Server (computing)Library (computing)Standard deviationComputer-generated imagerySubsetAverageQueue (abstract data type)GradientSmoothingSimilarity (geometry)TelecommunicationWeb pageMechanism designOverhead (computing)EmulationPairwise comparisonDecision theoryInformation securityComputer hardwareMassConfiguration spaceResultantInsertion lossMultiplication signNumber2 (number)Front and back endsQuery languageDifferent (Kate Ryan album)SoftwareMechanism designWeb pageOverhead (computing)Pairwise comparisonSpacetimeLibrary (computing)Total S.A.Set (mathematics)MereologyStack (abstract data type)TelecommunicationDomain nameFreewareScheduling (computing)Computer fileMeasurementCartesian coordinate systemSemiconductor memoryStandard deviationGraph coloringCategory of beingFree variables and bound variablesFunctional (mathematics)Key (cryptography)ImplementationUniform resource locatorLogic gateSpeicherschutzRun time (program lifecycle phase)Source codePoint (geometry)Normal (geometry)PrototypePresentation of a groupPerformance appraisalMemory managementOnline chatVulnerability (computing)Group actionOcean currentDecision theoryExtreme programmingComputing platformState observerServer (computing)Medical imagingKernel (computing)Computer hardwareAddress spaceGreatest elementBenchmarkIntegrated development environmentComputer animation
Computer architectureComputer animationMeeting/Interview
Meeting/Interview
Software testingMeeting/Interview
Software testingMeeting/Interview
Remote procedure callLatent heatAreaCodePrimitive (album)Semiconductor memorySoftware developerMechanism designFront and back endsTelecommunicationCASE <Informatik>Logic gateAbstractionType theoryVariable (mathematics)Meeting/Interview
Set (mathematics)ImplementationPointer (computer programming)Front and back endsMeeting/Interview
Parameter (computer programming)Mechanism designCASE <Informatik>Library (computing)Range (statistics)Web pageDifferent (Kate Ryan album)Online chatAbstractionPointer (computer programming)Domain nameAddress spacePoint (geometry)Meeting/Interview
Computer hardwareLibrary (computing)Meeting/Interview
CASE <Informatik>Term (mathematics)Line (geometry)Meeting/Interview
Engineering drawing
Computer animation
Transcript: English(auto-generated)
Hello, first of all, I'm Hugo Le Fevre, PhD candidate at the University of Manchester, and debuting developer. I'm going to present the result of my research on OS design over the past two years, how I
try to rethink the operating system for isolation flexibility. This project is called Flex OS. On the academic side, it has been published at Hot OS early last year, and more recently at Asplos. This is the result of a cooperation between the University of Manchester, Polytechnic at Bucharest, and NEC Laboratories, Europe.
Let's get started. So if we take a look at current OS designs, it is quite clear that security or isolation strategies are typically fixed at design time. Taking the example of a typical monolithic operating system, such as Linux, there is a user kernel separation
and the process abstraction using the page table. This is quite set in stone. You can make similar observations for other OS designs, such as microkernels or single address-based operating systems. And the result of this is an understanding of the operating system that does not work very well with modern specialization or hardware
heterogeneity trends. For example, if you want to remove the user kernel separation in Linux, you might want to do that for performance reasons, as part of the unikernel trend. But it would require major rewrites. You would have similar issues if you
try to use hardware capabilities, such as Cherry, or memory protection keys, instead of the page table. That's also very much in the spirit of the times. And most would agree with me that this is no simple task with major rewrite and maintenance ahead. With XOS, we propose to take a different approach
and decouple security or isolation mechanisms from the design of the operating system in the first place. The goal is to be able to achieve a range of trade-offs instead of a single point in the design space. We aim to support a range of isolation mechanisms, hardware
and software, and granularities. In essence, what we want to do is to specialize the operating system for security for given use cases. Now, other use cases that we envision for flexible isolation includes, besides pure specialization,
the deployment, for example, to heterogeneous hardware. Here, we are envisioning to make use of each machine's or architecture's mechanisms optimally with minimal code changes. Another one would be to quickly isolate vulnerable libraries. Let's say vulnerability is found in one of the libraries you're using.
You can quickly and easily apply fine-grained protections to at least mitigate it while a patch is being developed. Finally, we envision to use isolation flexibility to guarantee that the properties of certain verified or safe components in a system hold while being mixed and matched with unsafe components.
So now that I'm motivated for isolation flexibility a little bit, let me try to summarize the spirit and the scope of this project in four simple points. So first, we focus on single-purpose appliances
like cloud microservices. So no desktop operating system, no general-purpose operating system. And the point here is that we are advocating for specialization of the operating system towards the application it's running. That starts to make much less sense when several of them are running.
Second, we advocate for a holistic approach to compartmentalization. Instead of considering only the application or only the kernel, we want to have everything on the balance to make the best choices performance and security-wise. Once again, we want to specialize.
Fundamentally, what we do is embracing the library operating system philosophy. And so all system components are libraries and all libraries should be equal before the hardware. Third, we want to abstract away the details of isolation mechanisms.
You know, the page table, Intel memory protection keys, a tease. They surely don't have the same guarantees but conceptually they have a lot in common and we can unify them behind a common interface, the isolation mechanism structure. And finally, we think that flexibility must not come into the way of performance.
And the question is, can we achieve a new layer of abstraction at nearly zero cost? Now, all of this is theory. So let's take a look at the concrete stuff we built. And let's start with a 10,000 feet overview of FlexOS. So first, users would come
with a simple configuration file describing the system that they want to build. And therefore they would select an application, libraries, boundaries, and safety mechanisms. Following this, the tool chain would select an isolation backend that is going to provide safety primitives for the system.
So isolation backends can leverage software as well as hardware-based technologies such as memory protection keys or even VMs. Then the tool chain would automatically rewrite the application and libraries to match requested properties. Following this, with this backend
and rewritten libraries in hand, the tool chain will be able to generate an appropriate image. So these points here might still be somewhat confusing for you. And that's where the new abstraction comes into play. So let me speak a little bit more about it and I hope that it will become more clear.
So as I was mentioning earlier, FlexOS embraces the philosophy of library operating system. And so it's quite naturally that we decided to base it on UNIcraft. And the gist of library operating systems like UNIcraft is really that they are composed of a set of fine granular independent libraries assembled together at compile time.
In FlexOS, we take that idea and start considering these libraries as finest unit of compartmentalization. They're great for that because they have naturally well-defined interfaces with a clear cut on shared data. What we fundamentally rethink, however,
is how these libraries communicate. As part of reporting phase, we are going to pre-compartmentalize them. Fundamentally, what we do is we modify them to replace cross-library procedure calls and share data by an abstract construct that we call gates and data sharing primitives.
And we define them as part of the FlexOS API. So we will see how to do that concretely in the next slide. Finally, at build time and based on the user's input that we saw previously, these abstract construct are going to be replaced by the tool chain with a concrete implementation.
For example, you could take a procedure called simple function call for isolation for no isolation boundary. And you could take MPK or T domain transitions or even different VMs if you want to go crazy to add isolation boundaries. So now let's become even more concrete
and let's take a closer look at the API. This example, we're going to consider a simple call to receive from the application and receive is implemented in the network stack. As part of the porting process, we would annotate share data first
and add gate placeholders. Then at build time, if the application and the network stack are in two different compartments and we're using MPK to isolate them, then we would replace the annotated share data with a shared heap location
and the gate placeholder with an MPK gate. That would be very different if we decided to put the application and the network stack in the same compartment. In that case, we would replace the annotated share data with a normal stacker location. No need to do anything fancy here in the same compartment.
And we would replace the gate placeholder with a simple function called, once again, when the same compartments and no need to do anything particular. So we not enter too much into backend details here for time reasons, but there are more in the paper. So you're welcome to take a look at it. So we built a prototype of XOS.
The implementation, as mentioned earlier, is based on UniCraft. We have developed backend implementations for Intel memory protection keys and different VMs. In the VM case, we basically put each compartment in its own VM and they communicate via inter-VM shared memory.
So we have also ported libraries to run as isolated libraries. So the network stack, the scheduler file system, the time subsystem, they're good examples. And we also have application support for Redis, Nginx, SQLite, and the hyper server.
In this talk, I am going to focus on demonstrating flexibility and performance, but there are more measurements in our papers. You're welcome to take a look at it. So the first part that we'll take a look at it, that's right. The first part we'll take a look at now is quite complex.
So I will try to slowly introduce it. What we want to do here is to demonstrate the flexibility of XOS or the design space that we can achieve. We're gonna measure the runtime performance with Redis in request per second. At the bottom,
you have the different configurations that we're going to invite to evaluate. The four items here are flexible libraries used in the Redis image. So only a subset for readability. At the top, the Redis application, then the C standard library, the scheduler and the network stack.
Each bar represents one configuration and its associated performance. We have 80 configurations in total. The way you can read the configurations at the bottom is the color indicates the compartment. So white for compartment one,
red for compartment two, blue for three. If you only have white for your board, there is only one compartment, so no isolation. And the dot tells you what they're hardening. So the address in it is a safe stack. UBSan is enabled. So the first observation that we can make here
is that we have a pretty large gap between the most expensive and the least expensive configuration. And the configurations at extremes are roughly what we expect them to be, full on and full off. We can also see that the slope is pretty smooth.
Basically, if you give me a performance price that you're ready to pay, I can give you a configuration that's pretty much exactly what you're asking for. The most interesting thing, however, is that you sometimes can get very different properties at pretty much the same cost.
And most of the time, that's due to the way libraries communicate. Take the example on the right. Security-wise, they're very different. On the left, you have three compartments, the network stack and the scheduler in different compartments. And on the right, you have two compartments,
the network stack and the scheduler in the same compartment. So security-wise, they're different. But performance-wise, they're very, very close or identical. That's because the way the network stack communicates with the scheduler goes through another compartment, another library that is always
in the third or the second compartment. And so in both cases, you have two domain crossings. And so you have the same cost. So clearly you can get some safety for free by exploring intelligently. And that's something we explore in greater detail in the paper.
So there would be more to tell, but I hope that I could give you at least an idea in this limited time of the interesting things that we can do with isolation flexibility. The second set of results that we're going to take a look at is focused on performance and comparison with the baseline. Does flexibility come into the way of performance?
And here, we're going to take a look at numbers for SQLite, namely the time to perform 10,000 SQLite insert queries in seconds. So at the bottom, you have the number of compartments and the mechanism that's used. For example, PT2 for two compartments with the page table,
EPT2 for two compartments with VMs, MPK3 for three compartments with MPK and none for no isolation whatsoever. And at the top, you have the VMM or the environment we're running the benchmark in.
The first observation that we can make is that there is no overhead when disabling isolation in FlexOS. Basically, you only pay for what you get. So that's the first good thing. The second observation that you can make is that the MPK backend compares for positively to competing solutions.
The comparison with Capycalo OS is a bit tricky because they're using a Linux userland debug platform of UniCraft. But if you take relative numbers, it's still quite good for FlexOS. Finally, the PT backend compares to very positively to competing solutions.
Here, we would make a comparison with Linux. It's interesting to see that we get almost the same number as Linux. And the fact is, it's more clear in the paper, the domain transition cost between userland and kernel is actually very similar to the communication cost
between different VMs and the EPDKs. So as a conclusion, I highlighted the need for more isolation flexibility, whether it is for specialization purposes to adapt to new hardware or simply to clearly react to newly published vulnerabilities.
Current approaches tend to take one approach at design time, and that makes them effectively representing a single point in the design space. In this work, I presented FlexOS, an OS that decouples isolation from the OS design, effectively making isolation decisions at build time.
I presented some of our results that motivated for more automatic exploration of the new trade of space, something that we do in greater detail in the paper. So if you're interested in this work, you're welcome to come have a chat. Good starting points would be our website,
the preprint of the paper, and of course, the source code that's released under BSD3. We're proud to say that this work is reproducible research. All results presented in this presentation have been artifact evaluated by the S-PLOS artifact evaluation committee.
And finally, I want to stress how much of a team effort FlexOS is. Nothing that I presented in this presentation would be there without my colleagues from Manchester, UPB, and NEC. And I'm really happy to take questions and have a discussion.
Thank you. Informed if he tries to compile an API to an architecture that just doesn't support all the features that are required.
So basically- I think you're muted. I'm not supposed- I cannot hear you. Huh.
That is annoying. Does this work? I hear you on the main channel. So can you actually hear me or not?
Because I don't know what is going on, supposed to be working. Test, test. Can you hear me? No? Okay, so should I-
I hear you with a lot of delay. There's like one minute delay. Oh, wow. Okay, so I'm maybe just gonna try to answer the question and there will be delay unfortunately, but- Test, test. Now I hear you. Okay.
So what should I mute? Because I only have this open, so. You can hear me. Okay. So is it on John's side, the issue?
Probably so. Okay. I'm just gonna try and answer the question. So basically what we try to do is to create a layer of abstraction that is going to sit in front of the isolation mechanism.
We call it the backend. So we have this compartmentalization API that you use to port the code. And on the other side, developers are going to write backends that implement support for a specific isolation primitive. If I take the example of what we implemented,
NPK, in this case, the developers are going to implement the gates by changing the value of the PKRU. They are going to map the shared variables to areas of shared memory.
In the case of VMs, developers are going to map gates to RPC to inter-VM communication like remote procedure calls. They're going to use a shared memory as well. And you can actually map this quite well
to Cherry as well. In this case, you would have gates that if you take the century type of yeah, okay, maybe I shouldn't go too much into that. Is this with pure capability code
or with native capability code? Yeah, exactly. You can have this set of capabilities that take care of doing this transition, I guess. Which of them did you use? Did you use native or hybrid or pure? So basically at this stage, we didn't implement support for Cherry backend.
Because, well, you can just simply pass a pointer from one place to another one because if you pass it via capability, you need to use the capability of instructions, which actually means that you need to rewrite the entire code. Well, this is the problem. There is a question in the chat. So you proposed an abstraction that abstracts different isolation,
inter-process isolation technologies like MPK, will Cherry in some extent. And definitely this technology has different granularity, have different MPKs, has a basic granularity. Cherry ideally can be byte addressable. So why do you think it's a good abstraction?
So at some point you have to decide of a base granularity of a minimal granularity if you want to be able to have this abstraction that can be applied to a wide range of mechanisms. The basic granularity that we decided to go for
is the library. Speaking about arguments of a call, so when you pass an argument of a call, for example, a pointer to a buffer, this often should be aligned and should be alone inside the page because you use MPK pages to remap this page from one protection domain to another one. While in the case of Cherry,
well, you don't need to remap the entire page because you can deal with bytes. So the abstraction you use actually quite different in the case of those two different technologies. Yeah, well, I'm pretty sure that you could actually write the backend in such a way that you take advantage
of the hardware's abilities. For sure, if you want to isolate at a granularity that's finer than your library, you will need to split the library manually. Okay, so another question, how much important effort is required to use your system?
Okay, so in the case of Redis, we actually put that in the paper. In the case of Redis, that was about two days, I think. In terms of lines of code, it's a few hundreds.