We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

The Composite Component-Based OS

00:00

Formal Metadata

Title
The Composite Component-Based OS
Title of Series
Number of Parts
287
Author
Contributors
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The Composite CBOS is in many ways a traditional micro-kernel. Services and policies are implemented at user-level, the kernel focuses on fast IPC, and it uses a strong capability-based access control mechanism. It has historically focused on being a research laboratory for strange features including a thread-migration-based IPC, user-level scheduling of system-level threads, user-level definition of capability policies, a wait-free kernel that scales linearly with increasing cores, and temporal capabilities to coordinate between untrusting schedulers. It also scales down and supports paravirtualized RTOSes on microcontrollers (with on the order of 64KiB SRAM, between 16 and 200 Mhz, and MPUs). Composite represents a design that deviates from the L4 lineage in some interesting ways. In this talk, we'll discuss the design with a focus on how the system provides the challenging combination of predictability, performance, and scalable parallelism.
Physical systemPredictabilityElasticity (physics)System programmingEmbedded systemSystementwurfCodeLimit (category theory)Shared memoryElectronic mailing listPhysical systemComponent-based software engineeringPredictabilityNeuroinformatikService (economics)Connectivity (graph theory)Functional (mathematics)BitMobile appSet (mathematics)Cartesian coordinate system1 (number)Level (video gaming)Operator (mathematics)Domain nameFocus (optics)Client (computing)Elasticity (physics)NumberFlow separationExecution unitPoint cloudConstraint (mathematics)Data centerInformation securityCASE <Informatik>Mathematical optimizationMereologyRight angleAutonomic computingGoodness of fitOperating systemQuicksortDesign by contractType theoryCodeDivisorMassWeightPower (physics)ImplementationSoftwareRevision controlInstance (computer science)Vector potentialInterface (computing)Different (Kate Ryan album)SpacetimeBound stateOptical disc driveMultiplicationSpectrum (functional analysis)Taylor seriesImpulse responseEmbedded system2 (number)SummierbarkeitDiagramEngineering drawingComputer animation
Function (mathematics)Connectivity (graph theory)Type theoryImplementationLevel (video gaming)Different (Kate Ryan album)Physical systemAbstractionExtreme programmingKernel (computing)QuicksortFunctional (mathematics)Parallel portCategory of beingConcurrency (computer science)MikrokernelSynchronizationScalabilityScheduling (computing)Electronic mailing listRight angleComputer animation
Kernel (computing)Independence (probability theory)Mathematical analysisInterprozesskommunikationThread (computing)Inheritance (object-oriented programming)Data managementThread (computing)Address spaceSemiconductor memoryDomain nameSeitentabelleVirtual memoryMemory managementFunctional (mathematics)Level (video gaming)BitCASE <Informatik>Mathematical analysisRule of inferenceConnectivity (graph theory)P (complexity)Message passingHierarchyMultiplication signType theoryInterprozesskommunikationTable (information)Different (Kate Ryan album)Internet service providerPhysical systemStructural loadData storage deviceScheduling (computing)Instance (computer science)Core dumpObject (grammar)MikrokernelComputer programmingVirtualizationComputer configurationLogicSystem callQuicksortConcurrency (computer science)Kernel (computing)SynchronizationDirection (geometry)Electronic mailing listRight anglePercolationInheritance (object-oriented programming)MathematicsService (economics)TwitterSet (mathematics)Data managementDependent and independent variablesControl flowSubject indexingWeb pageWebsiteOperating systemComputer animation
InterprozesskommunikationData modelMetreHuman migrationThread (computing)Mach's principleOperations researchPhysical systemThread (computing)Inheritance (object-oriented programming)Human migrationPhysical systemLeakConnectivity (graph theory)Context awarenessCellular automatonElectronic mailing listSynchronizationType theoryClient (computing)Reading (process)Price indexDiagram
Human migrationThread (computing)Temporal logicControl flowInterrupt <Informatik>Data managementScheduling (computing)Physical systemRemote Access ServiceKernel (computing)Thread (computing)Connectivity (graph theory)Scheduling (computing)Greatest elementKey (cryptography)LogicTemporal logicLevel (video gaming)Human migrationTable (information)Physical systemLibrary (computing)Kernel (computing)InterprozesskommunikationQueue (abstract data type)Set (mathematics)SynchronizationElectric generatorGame controllerMultiplicationInstance (computer science)Flow separationDifferent (Kate Ryan album)Multiplication signStack (abstract data type)AbstractionProgram slicingHierarchyMereologyDomain nameSource codeComplete graphAsynchronous communicationData structureProcess (computing)Point (geometry)Semantics (computer science)ImplementationContext awarenessIntegerSoftwareInterrupt <Informatik>Right angleDecision theoryFunctional (mathematics)Content (media)Computer configurationRevision controlVector spaceMechanism designCycle (graph theory)Computer hardwareGroup actionComplete metric spaceComputer programmingCurvatureSpherical capBlock (periodic table)Graph coloringOverhead (computing)PredictabilityProof theoryGoodness of fitWebsiteObject (grammar)Connected spaceTrailSequencePrice indexFerry CorstenElectronic mailing listOcean currentDiagram
InterprozesskommunikationTemporal logicKernel (computing)Remote Access ServiceThread (computing)Human migrationChaos theoryPhysical systemContext awarenessMultiplicationCore dumpImplementationLimit (category theory)ScalabilityCausalitySynchronizationImplementationScalabilityScheduling (computing)Physical systemHuman migrationThread (computing)Core dumpDomain nameConnectivity (graph theory)Normal (geometry)SequenceInterface (computing)Kernel (computing)InterprozesskommunikationMessage passingPredictabilityAbstractionLevel (video gaming)BitClient (computing)Source codeLibrary (computing)Cycle (graph theory)CodeLimit (category theory)CausalityOrder (biology)Revision controlPoint (geometry)IterationShape (magazine)Form (programming)NumberRight angleCoordinate systemMechanism designTerm (mathematics)Primitive (album)CASE <Informatik>AreaSemiconductor memoryHüllenbildungCategory of beingSystem callNegative numberConcurrency (computer science)CountingPairwise comparisonState observerAsynchronous Transfer ModeDecision theoryArithmetic meanInstance (computer science)Process (computing)DiagramInteractive televisionCondition numberOverhead (computing)Greatest elementArrow of timeElectronic mailing listFerry CorstenServer (computing)Roundness (object)Real-time operating systemAdditionWebsiteField (computer science)MultiplicationOperator (mathematics)Value-added networkDifferent (Kate Ryan album)40 (number)MereologyStack (abstract data type)Computer animation
Kernel (computing)ImplementationScalabilityLimit (category theory)CausalitySynchronizationPhysical systemMultiplicationCore dumpChaos theoryContext awarenessPredictabilityGame controllerFreewareDomain nameSemiconductor memoryImplementationGame controllerMultiplication signMechanism designLevel (video gaming)NumberPRINCE2Physical systemLibrary (computing)Connectivity (graph theory)FreewareDifferent (Kate Ryan album)Line (geometry)CountingCache (computing)Kernel (computing)Core dumpScalabilityContent (media)Computer animationXML
Domain nameMixed realityPhysical systemSoftwareWeb pageProcess (computing)Event horizonRead-only memoryAerodynamicsCache (computing)Kernel (computing)Proxy serverPhysical systemSemiconductor memoryVirtual memoryInfinityNumberKernel (computing)MikrokernelLimit (category theory)Virtual machineSpeicherschutzTLB <Informatik>Sensitivity analysisType theoryLevel (video gaming)Network topologyData centerSoftwareDialectMicrocontrollerConnectivity (graph theory)Static random-access memoryExecution unitWindowVirtualizationMultiplication signCoordinate systemCore dumpDomain nameMixed realityCache (computing)Different (Kate Ryan album)WeightProxy serverFinite setSeitentabelleGame controllerLatin squareMusical ensembleImplementationOcean currentData compressionPhysical lawView (database)ScalabilityComputer animation
Read-only memoryRemote Access ServiceInterrupt <Informatik>Meta elementKernel (computing)Proxy serverInterprozesskommunikationMessage passingTime zoneInterrupt <Informatik>Real-time operating systemMicrocontrollerTerm (mathematics)MikrokernelSpectrum (functional analysis)MultiplicationPhysical systemData centerXML
Computer configurationScalabilityProxy serverSoftwareData centerPhysical systemPopulation densityMultiplication signType theoryCategory of beingProcess (computing)Server (computing)Client (computing)InformationDomain nameSpectrum (functional analysis)Right angleSet (mathematics)Constraint (mathematics)NeuroinformatikAbstraction2 (number)Service (economics)Goodness of fitScalabilityInterior (topology)Limit (category theory)
Computer networkComputer configurationScalabilityComputer-generated imageryMultiplicationClient (computing)Service (economics)Thermische ZustandsgleichungSystem programmingClient (computing)Physical systemDifferent (Kate Ryan album)NeuroinformatikDomain nameFunctional (mathematics)2 (number)Flow separationProcess (computing)Level (video gaming)DiagramMicrocontrollerMessage passingConnectivity (graph theory)Student's t-testEmailScheduling (computing)Multiplication signSoftwarePresentation of a groupWeb pageVirtualizationPopulation densityGame controllerData managementChainType theoryOpen sourceProxy serverInformation securityDependent and independent variablesÜbertragungsfunktionShared memoryMultiplicationAddressing modeColor managementRight anglePoint (geometry)XMLUMLDiagram
Connectivity (graph theory)Stack (abstract data type)Level (video gaming)Computer animationMeeting/Interview
Multiplication signInterprozesskommunikationMeasurementMeeting/InterviewComputer animation
Transcript: English(auto-generated)
Hello, this is Gabriel Parmer and today I'm going to talk about the composite component-based operating system Which is an OS that we've been working on for about 15 years
I'm going to be talking about a lot of research that we do at that we've done at GW And I've done a lot of this work with the amazing researchers listed here A lot of our work on composite has been motivated by the idea that there's a convergence in requirements
Between systems that have historically been quite disparate. So we have Embedded systems that have focused almost entirely on things like predictability Simplicity size weight and power and cost and all of these types of factors but then on the complete other end of the spectrum, we have these massive data centers that have
Multi-tenant clouds and these focus inherently on performance we want to push as many client requests through them as possible things like isolation, especially multi-tenant isolation and Elasticity the ability to scale up and down and What we've been noticing is that there's been an interesting convergence between these two things in that
We have the predictability Requirements of embedded systems being ported over to the cloud people care Mostly now about things like tail latency We also have the performance constraints of the cloud percolating down to embedded systems as we start to consolidate more and more
things on to our small systems right autonomous vehicles are a very good example of lists and We care increasingly about isolation and security on embedded systems in some cases full on multi-tenant isolation Now the problem with this is that
When we start having all three of these things in the same system belay and embedded systems or belay in the cloud We have a very difficult system design Optimization when we're implementing our operating system, we need to say I need strong isolation, but I also want performance But these are contradictory to each other in many cases because isolation requires protection domains performance requires
efficiency right requires not switching between protection domains frequently Predictability and performance are often at odds simply because when we want predictable performance We want to be able to put bounds on how long it takes to respond to different Impulses, but to do so often requires that we make optimizations that make performance much more difficult
Isolation and predictability or are often frequently at odds as well for similar reasons to why isolation and performance are not often seen It's the same system so we have this really interesting type of a design space and a lot of what we've been doing is
Researching how we can effectively design an operating system that works very well for each of these domains and each of these different optimizations but can also Satisfy the requirements of all of them. So as you can see here We're essentially trying to provide a foundation that provides performance isolation and predictability for a number of interesting domains
So, how do we do this from the very start composite has been focused on the idea of being a component based Operating system and the idea is that it is very difficult to implement one version of functionality That will meet all of these requirements. So what we really want instead is the ability to
Customize the functionality of the system for the requirements of the overall system and its goals So what is the component? I think that most people who are at this Conference understand what it is, but you know, it's also one of those things that means different things to different people So I just want to be explicit here when we think about components
We think about some code and the associated data executing at user level and we call this the functionality That functionality is typically an implementation of very specific API's where each of these API's Like it's exported for other components to use consists of a number of functions that have some sort of a you know
supposed Contract some sort of a required functionality provided by the component itself And then of course that component cannot necessarily implement everything itself that it needs from the system So it has a number of explicit dependencies on API's from that should be provided by other components
So when we look at components in this way They are a unit of reuse in the system and because they're implemented at user level and in our system in separate protection domains They're also a unit of isolation Now a goal that we typically have in these systems is to minimize the functionality provided within each of these components
That's necessary for implementing the API's that it exports This is simply another way of saying that when you're implementing your components You want to implement them in accordance with the principle of least privilege, right? You only want them to require the implementation that's necessary minimally necessary for what they're trying to implement and
This minimizes the number of dependencies that they have on other parts of the system and if we do this you actually end up with a fair number of components in the system and this provides stronger fine-grained isolation than You see in a lot of different systems one of the benefits of lists is that we can start thinking about the
Implementation of systems as just the composition of components together So here you see a system where we have a number of system services each exporting in this very trivial example a Interface some of them require other interfaces from other components. We have applications at the top that rely on some functionality
Etc. So we have composed a system out of components so design has moved to lat level right and now the Faults within some of these service components and the unpredictability within the system is limited by design
So for instance when we think about app one and app three It might be the case that app three is relatively trusted and relatively simple All that it's doing is talking to some device over I to see Using a channel to pass some data maybe to app one that talks over the network Right app one might be relatively complicated app three
maybe not so much so we'd really like the facets of the system that are required by app three to hopefully be somewhat isolated from the potential compromises or faults or of the app one and we can see that we can simply look at the the
Intersection of the two sets of dependencies for these applications to see which ones need to be in the trusted computing base So I think this is probably kind of secondhand to a lot of People here Relatively obvious but I think it's worth saying that this ability to design The system out of components is in some sense a little bit of a superpower
the problem is how do you actually design a system in which you can define all of the policies that are important for these goals of performance predictability etc in components at user level right this becomes very difficult and it actually harkens back to What
Lika gave his advice for microkernel design in 95 and that is that a concept can be tolerated Inside of the microkernel only if moving moving it outside of the kernel ie permitting competing implementations For example different components would prevent the system the implementation of the system's required functionality, right?
We should only put something in the kernel if it has to be there to implement the functionality of the system Right, so we want to be able to implement all of these things in components that begs the question What needs to be in the kernel and most micro kernels try to adhere to this type of advice? But they all kind of need to make trade-offs in different places
Composite in some sense is a research lab that allows us to really try to push this far So composite tries to push this silly extreme in the sense that we want to have component defined user level defined Things like scheduling whereby there is no scheduler with in the kernel all that defy is defined in some sort of a user level
component the parallel scalability properties of the system should also be defined and if limited they should be delimited by implementations in user level not by the implementation of the kernel itself Similar concurrency the second that you define blocking API's within the kernel you're defining
Concurrency synchronization all of these types of policies within the system that in our opinion should be at user level or at least Composite tries to put them there and see how far we can push it. We also do very strange things like move capability delegation revocation policies to user level so we can export them out of the kernel and redefine them as appropriate
So there are a lot of different ways in which composite tries to push the bar in different ways To understand how we do that you need to kind of dive into What are the abstractions provided by composite? So I'm gonna go through these relatively quickly I think a lot of these will I don't know feel very comfortable to people who know SEL for
They're gonna sound very similar in how I initially present them and I'll go into the differences later So we what does the kernel provide if the kernel is meant to be minimal and we're supposed to be able to implement Incomponents most of the required policies and functionality What does that mean? It means that we have capability nodes. We have page table nodes. All of these are directly accessible
resources from user level implemented in the kernel and We can piece these together to create protection domains in the system And if we combine a capability table with a page table all of a sudden we have a component So a component is a protection domain
That includes all the memory resources indexed by the page tables and all of the kernel objects referenced by the capability pay tables That define the access rights for anything executing within that component what executes in a component Of course threads, right? We don't do anything to buckle the historical trends and 70s of defining threads as our as our
Abstraction for execution. What's a thread? Of course is a set of registers that executes in a component We'll see that the notion of threads gets relatively complicated later We also of course when we're implementing a system like this need to understand how memory allocation works
We do not have memory allocation within the kernel instead We looked at what seo4 did with untyped memory and said that that is a fantastic Amazing idea and just kind of ported it over we make some significant changes to the details, but the high-level Idea is the same You can have access to untyped memory components can have access to untyped memory
They cannot actually do loads and stores to lap memory but it is memory that they can retype into either kernel objects or into user level virtual address space Pages, so if we're retyping for instance here into a thread and into a page table node
You can see that those are now referenced from within the capability table because hey now they are actually kernel objects that we program As I said we can also retype some of this memory into virtual memory and map it up into components and therefore It's actually present and accessible to loads and stores for those components
so the retyping facilities of the kernel Seek to keep the kernel safe and to keep components safe by simply providing the guarantee that no piece of memory Can be typed as mole can be accessed as multiple types at the same time so any piece of memory is accessed by as one type and only one type within the system and
There are complicated rules for generating that but that's how it maintain maintain safety We also have things like IPC in the system Of course We have multiple components and we want to talk between them coordinate between them as I talked about we want to compose them into these
Component hierarchies within the system and we support both synchronous and asynchronous IPC I'm gonna go into both of these in detail because these are Kind of the core of what makes composite able to do things like user level scheduling before I go into that We need to understand a little bit about how
Traditionally you think about IPC within micro kernels or within operating systems in general Option one is we use things like asynchronous IPC or bounded asynchronous IPC and things like pipes, right? And that's just the idea that we could have multiple threads and they can coordinate with each other They might have different priorities in this case X Y & Z for different threads
and if we want to kind of pass messages between them and we want to essentially assess how long it takes for a message to be Percolate through processing throughout the system from the thread on the left to the thread on the right We need to do some sort of an end-to-end timing analysis that understands the dependencies throughout the system Especially when these threads are located across various cores. This can actually lead to a fair amount of pessimism
There's a lot of active research and trying to make lists better and faster We've also noticed that you know over the years most Micro kernel IPC ends up being synchronous and that ends up being for good reasons It's a lot easier to coordinate in that way in some cases and it can be a lot faster
so It's not that you know, it's synchronous IPC is inherently better But a lot of micro kernels have gone in that direction for a lot of functionality and what does that look like? Well, we typically do synchronous IPC between different threads So again, we have these three threads in the system and now instead of sending asynchronous messages
We rendezvous on some sort of an endpoint between threads So priority the thread in the center might await a call from the thread on the left Once the thread on the left makes a call to say yes, I need your service I want to effectively invoke some sort of a function within you then they rendezvous it wakes up the middle thread, etc
Etc blocking thread on the left this actually bakes a lot of the concurrency logic with into the kernel and the Synchronization logic within the kernel it adds a lot of assumptions for things like ordering for things like priority into the system So it's hard to remove policies from the system when you're doing this
If you want a Predictable system it also requires that you do things like priority inheritance Do considerations around priority ceiling manage budgets? These are all tractable problems But they require a lot of engineering and they all inherently bake policy with into the kernel in some sense
So what's interesting is we looked at lists and we started to see that If you look at history we see on the far left list notion of synchronous rendezvous between threads as being the Primary and only way to coordinate within the system and this was you know back in 93
95 a lot of leak as original l4 and Since then we've been moving more and more to the right towards this notion of what's called thread migration And this is simply at the idea that you want some notion of that Execution that started on the client on the left to go with the IPC to go with the invocations
As it percolates through the components so fiasco has Credo support Nova has priority inheritance support These allow some notion of a notion of the thread the execution context to go along with the IPC within the system Cell for with MCS support doesn't allow priority to go throughout the system
But it does allow budget to go along with invocations So a lot of these types of systems are moving over towards thread migration composite actually represents a system that goes all in on thread migration and always had thread migration is simply the idea that I might have multiple components in the system and
if I have a thread on the far left and it wants to make this invocation to the middle component and then To the component on the right it actually ends up logically flowing between them So as it makes function calls the executable thread Continues execution into the middle and then the right component and then of course when the right component finishes its execution
It returns to the middle the middle returns to the left, etc Now, of course, you know done in Naively, this is very dangerous, right? We can't have kind of thread context just accessible across all of the components We still want to maintain isolation between them
so this requires that you have things like execution context for stuff like stacks and Register contents and stuff like that Existing for that thread on each in each component It requires that you still have separate protection domains for each of the components But the logical executable entity persists in its execution between different components
There's only one priority that exists for it. There's only one budget that exists for it, etc This does mean that as Multiple threads for instance invoke that middle component We might need multiple execution stacks and managing how many execution stacks how to allocate them out at user level
All of this is relatively difficult, but you can see the publication at the bottom for how we dealt with it one of the key Insights from thread migration is simply that because we are not switching between threads as we do
IPC throughout the system It means that we don't need to ever involve a scheduler on IPC and that enables us to be able to move the scheduling logic entirely to user level. So within the kernel We provide dispatch and schedulers implemented in Components are able to use that facility for dispatch to actually switch between threads and define their own scheduling policy
Indeed the kernel barely knows what a priority is Schedulers define priority is in their user level data structures run queues are in the user level schedulers So if a scheduler is executing in this thread and wants to switch over to another thread
It simply needs access to a thread within its capability table Which gives it the ability to dispatch over to that other thread that this is not user level scheduling as defined by a library In traditional systems like scheduler activations Because the threads that you're switching between dispatching between might be implementing in any protection domain
Right, the scheduler doesn't know that the schedulers job is just to decide which to run at any point in time and can dispatch between them accordingly Because the scheduler defines all of its logic for prioritizing budgets all of that stuff it defines its own notion of
Budget priority for all of these threads So switching between threads means that it naturally can switch between not just the executable context using dispatching But also its own notion of which thread is running and account for them separately prioritize them separately, etc Now that's only part of what a scheduler does of course, right?
Yes, the scheduler decides which thread that run at any point in time And right now we see that it can do that in a really interesting way But we don't actually define blocking semantics by the kernel in doing this instead when a Thread wants to cooperatively block it invokes the scheduler So we use IPC to call to the scheduler and the scheduler provides us dispatch
But that's not all that you need to actually implement user level scheduling. Additionally, you need a source of time So composite provides the ability to vector timer ticks up to these schedulers in a controlled way The schedulers can actually define their own granularity for at least timer ticks on a cycle accurate
Granularity well, depending on what the hardware can provide on x86 on a cycle accurate granularity But This this is actually sufficient for implementing a system But we also need to actually answer the question what happens with other interrupts in the system Not just timer interrupts, right and one option would be for all
Interrupts to be vectored to the scheduler and that would solve a lot of problems but it does have some performance repercussions in the system because now The scheduler needs to be involved on the processing of that every interrupt and that's possible We have done that in the first version of the kernel, but to avoid that overhead
We came up with a way where we could actually vector These interrupt sources up to threads that are executing in the different components, which is to say the kernel can be told by the scheduler using a mechanism that we call temporal capabilities whether
when a Networking interrupt arrives for instance the thread that is destined for the the interrupt processing thread should be should be run immediately thereby preempting the currently executing thread and We make these decisions using something that we call temporal capabilities
And I'm gonna brush over these are a really cool abstraction one of the abstractions I'm the proudest that we implemented in the system and you can see the paper for that a temporal capability is just a slice of time that has been given for execution of a thread on that t-cap on that temporal capability and
A integer that's interpreted effectively or a set of integers It's interpreted as a set of priorities to decide whether preemption should be made And the way that the scheduler maintains control in the system is that it is the incomplete control Programming these time slices that are allowed for execution and the priorities
But t-caps are also interesting because they allow us to implement Multiple schedulers in the system because schedulers are defined at user level We might be able to we might want to have multiple of them so that we can specialize different parts of the system for different goals Performance predictability, etc
Right and t-caps allow different schedulers to delegate slices of time to each other Thereby passing the temporal resources throughout the system so that the execution can be done in these different schedulers using that time So this is interesting because it allows us to create a hierarchy of schedulers or even a full graph of
Schedulers that are all coordinating by passing time t-caps are A very difficult abstraction because they want to allow these schedulers to coordinate well Necessarily constraining their impact on each other a scheduler wants to say hey you I'm gonna give you some time you other scheduler
But I want to guarantee that you will not use that time to interfere with this set of threads that I want to be able to execute at the highest of priorities So This is one of the resources I didn't go over before that we have in the system If we think about now the kernel objects, we have threads temporal capabilities
Synchronous endpoints that enabled us to connect these resources and allow thread migration between them Asynchronous IPC. That's effectively that notion of Interops being able to generate Execution and of course we generalize that to asynchronous communication between threads and
Registers that yeah, sorry and threads that you see at the bottom here so the one thing that I want to point out here is that every thread now maintains an invocation stack that tracks the sequence of the different components that that thread has executed between as it invokes the Interfaces for each of those components and of course we can return from a component
It pops an entry off just like a normal C execution stack, but of course because this is a kernel resource It's trusted and cannot be corrupted in any way shape or form Now this raises a number of questions, right? it's nice that we have this facility for defining scheduling policy at user level and it's allowed us to
Do a lot of research into providing these different domains of goals within the composite infrastructure, but There's just this intuitive notion that doing user level scheduling is probably too slow, right? So we have to answer this question of essentially is it fast enough to be able to do what we need in the system?
Right, does it slow down scheduling to the point where it's useless or it limits the system in a significant way? Well, we've gone through about five iterations of these user level scheduling Infrastructures through three versions of the kernel and we've ended as a system with slight that Fonny did a couple of years ago
That actually allows user level dispatching of threads to be on the order of 41 cycles So when we're doing cooperative switches, we actually have dispatch on the level of kind of thread based user level Scheduling libraries when we're doing preemptive scheduling then we need to dispatch from the kernel
But that's on the order of 300 cycles in the end this notion of the scheduler being defined at user level It's not that prohibitive the policy ends up looking relatively similar to what it would be in the kernel a little bit simpler because it's at user level and The dispatching latency is not an inhibitor But there is another source of concern here because now if we want to do cooperative switches within the system
We don't have a blocking abstraction within the kernel So we need to invoke the scheduler asking For herself to be blocked. So is IPC too slow if it's on the path of a lot of these scheduling decisions and IPC
You know, it's hard to look at cycle counts on their own and get much from them So I just I add a comparison to SEL for here because the SEO for is one of the faster IPC systems in the The world now, I'd say it's probably Nova is maybe the only other system that I know of that is Highly competitive with it. And if we look at x86 32 and cortex a9
You see that for a round-trip IPC from client to server and back on SEO foreign composite composite is competitive if not better here, so IPC is not free but it ends up in our observations not being prohibitive and our scheduling latencies end up usually being
Significantly faster than those in Linux. So in systems where Linux is sufficient You'd expect this to be sufficient as well in systems where Linux isn't sufficient like some of the real-time domains then we have to kind of go into kind of a design mode and make sure that Everything is as efficient as it can be
So this is one core aspect of how we think about system design and composite We want to enable this notion of user level policies defined by components that means removing the policies for things like concurrency from The kernel and that means using things like thread migration So we've made it practical in many different ways and have demonstrated the ability of user level scheduling to do pretty significant
To have pretty significant benefits an additional area where we have tried to essentially remove Implicit policy from the kernel is that around? Scalability bottlenecks as we increase a number of cores within the system
The big question here is can the kernels APIs and implementation in some sense limit the scalability of components or cause interference between components if the kernels implementation of its own primitives Limits the scalability of the system in some way then it is imposing
Performance policy on those components our goal is to essentially say that if your user level code Implementing executing in a component is scalable. It can do well as you increase the number Of course, we want it to actually be scalable which is to say that the kernel gets out of its way
So we take this very strong Version of you know We really really want to move the kernel out of the way and define policies and user level and that includes Actually not putting prohibitive overheads for things like scalability And this is difficult because on the the bottom left you see we have a component that is
coordinating in itself across different cores and As it makes system calls the arrows going down and up if we have for example a big kernel lock that interposes on kernel operations to provide the
To prevent race conditions Then we're no longer kind of achieving this goal right that one lock on its own means that the kernel is going to be inhibiting the scalability of that user level process in The center diagram you can see that there's one component that is trying to coordinate through the kernel
Let's say using message passing API's across cores to another component And again this notion of it making those calls and for instance you requiring a big kernel lock means that another component executing on that destination core might actually feel the negative performance and predictability impacts of that message passing and
The worst case is on the right whereby we have two components that are completely isolated from each other executing on different cores But they inhibit each other's execution Simply because they make system calls and there's a big kernel lock that prevents the performance properties of
the other core the other core has some Performance impact on the component on that other core So these are these are a few of the cases that I motivate here and the exact same thing actually applies for IP eyes When we think about coordinating between different cores
We think yes shared memory is one way to do that locks are a way to mediate that IP eyes are another right if you have a multi-kernel based system where you share no memory in the kernel then this locking Concern is not really a concern however, you often do use things like IP eyes to in a vent triggered way coordinate between the different cores and
In I want to focus in on the middle case here where we have a component That's sending a message to a component on another core, which is causing IP eyes onto that other core But the IP eyes might cause execution that interferes with the other component that's executing on that core So again, this is policy imposed by the kernel mechanisms within the system that
Place the limits on what you can actually define in user level in terms of performance The same could exist on the far right in an even worse scenario where for instance, we need to do TLB shootdowns
Scheduler run queue rebalancing stuff like that where two components again that don't interact with each other and should be otherwise Isolated can feel each other's performance impact by these IP eyes so we really want to avoid all of these things and composite takes a very aggressive tone for lists and essentially says that
to solve these problems the kernel must be entirely weight-free and That components that user level should entirely be able to control cache line contention within the kernel, which is to say that policies should be able to be defined whereby Two components should be completely guaranteed to never write to a shared cache line within the kernel
Therefore should never cause a performance impact on each other with respect to an increasing number of cores So this means that the kernel can have no locks. This means that the kernel by and large can't really have reference counts
This means that we need to take a very aggressive stance to the implementation of the kernel. So we do this by integrating a scalable memory reclamation facility based on time In a library that we've called parsec and you can read about it in the spec paper I'm not going to go into it, but it and a few other facilities are what's required for actually implementing this
so this is allowed us to scale up across many different cores in a pretty successful way and We additionally provide facilities for enabling components to define effectively all every single time The IP eyes are used throughout the system The kernel itself does not use IP eyes for any of its mechanisms and instead user-level
Policies are defined for controlling lat when IP eyes are actually used again We want the kernel to get out of the way so that user level can have maximum flexibility. So we've looked at using a Composite in a lot of different domains and I just want to go through a few very quick examples here
where we look at embedded systems on the left small little microcontrollers in the middle and Edge data centers on the right On the far left. We have a lot of work that we've done here and I'm just going to overview some of it mainly from an engineering standpoint
We've implemented a principle of least privilege focused RTOS on the system We've also implemented a personality for that same RTOS on scl4 and compared the two We've implemented net BSD rump kernels on the system Zen like driver domains and coordinated time coordination between them We've ported the national NASA's core flight system to be able to do flight control
within the system mixed criticality systems and we've implemented ways in which you can have different systems some that are mission critical on the embedded system and some that might just be there for convenience these mixed criticality systems and to find ways to coordinate between them so that the
Less important software on the system can't really interfere with the more important stuff on the system I'm not going to go into that in much detail, but I do want to dive into this interesting facet, which is microcontrollers and microcontrollers are not a place that you typically think a micro kernel would be or a component based system would be because these have 128, you know, well 16 to
128 kilobytes of SRAM and they don't even have MPUs. They don't have virtual memory I'm sorry, they don't have the MMUs. They don't have virtual memory instead They have memory protection units that just allow you to open windows to physical memory like are accessible or not so we've actually
implemented a pair of virtualization infrastructure a microcontroller virtual machine infrastructure on the system in which we Excuse me in which we can run multiple free our tosses on the system or generally multiple our tosses and We had to overcome a lot of challenges with this that required us both to modify the kernel and to be able to modify a
lot of those user level components specialized for this kind of low memory Performance sensitive type system and just a very high level overview of this We looked at how an MPU based isolation versus page table based software isolation Could kind of how we could span the gap between these and we use path compressed radix trees to flatten a system of protection
Into the required limited MPU regions and because we have these limited MPU regions We came up with different techniques some static being able to solve for memory layouts so that we could fit all of the required protection into the required memory the limited MPU regions
But then we also treated these MPU regions as a cache and allowed As what looks like essentially a software managed TLB so we could have an infinite number of protected regions for a component Even though we have a finite number of MPU regions and more recent work enables us to
Have efficient coordination with devices by doing kernel bypass around Interops and I'm not gonna go into that but it's pretty cool So this allowed us to run, you know up to eight virtual machines and 128 kilobytes of SRAM with this system I'm not gonna go into the details here. You can look at the papers or you can pause
But the whole idea here is that there are some performance impacts for doing this but by and large the Micro kernel VM is not that much slower than raw free RTOS executing with no isolation whatsoever So this means that you can have strong isolation and these small
microcontrollers without giving up too much in terms of performance and we've actually taken it down so that Interrupts can actually have bare metal interrupt latencies in work that uses trust zone M So We've also on the complete other end of the spectrum looked at least relatively large systems that are focused on
The edge and we look at these as being essentially kind of the our traditional data centers requiring multi-tenancy requiring performance all of that but shrunk down to a much smaller facilities that could fit into a Network edge right and that might be a set of racks
That may be a closet worth of information that might just be a couple of servers. So density becomes very important We have many tenants and they all need to fit onto the system But then additionally things like 5g provide one millisecond round-trip latency So we really want to be able to have the software infrastructure of the system in spite of this density have very strong latency properties
So we looked at what we might be able to do to implement this type of a system and if we use Processes well processes don't provide that much isolation because of full UNIX based API's they do they are relatively scalable We can get good density with them and they provide decent setup startup time The thing is we want to be able to provide isolation for each client within the system
So we require the creation of new protection domains for every client coming in So startup time becomes really important. You see these types of constraints a lot in serverless computing And high-performance networking is relatively you can't really use DPDK and scale up to high density because of the isolation
Requirements so high-performance networking, you know, you're kind of stuck with what Linux gives you which is very fast But maybe not as fast as if you were using bypass Containers and VMs on the other hand provide stronger isolation, but are not as scalable a bad startup time, right? So we've implemented a system called edge OS with
Incomposite that provides an abstraction for featherweight processes that attempts to provide all of these functionalities and the goal is that we can provide 10 to 100 gigabits per second processing of packets without Sacrificing isolation every client that tries to connect to the system should have should be completely isolated in some separate protection domains
We want the startup of these different clients computations to be very very fast And we wanted to scale up to thousands of these computations per host So this is a high-level diagram of what we we've done in composite. We've implemented components that provide things like DPDK for network access
something that controls the FWP is the manager for them to create them very quickly to recycle them very quickly and FWP is these featherweight processes can be added into chains of computations if you're familiar with network function virtualization that might sound familiar and then our scheduler is optimized for being able to deal with the high density and fast
Activation and controlled latency and all this is implemented on top of composite And what we can see here is that if we look at the startup time of these different technologies That if we look even at just fork and exec within Linux, right?
edge OS allows the FWP is because it's so optimized how we create these things to be about 20 times faster with our checkpointing and restore support we can actually activate these even faster in response to new client requests coming in so Responding to a client request in 6.2 microseconds is around the level that we want to be at for these types of high-density systems and because these are so specialized
We support higher throughput than Linux for things like TLS Endpointing or proxies and all of this is despite edge OS using message passing Throughout this entire thing we require we provide no shared memory between the different FWP. So it's very very strong
Security within the system. So I've talked about the system scaling all the way down to microcontrollers And now I've talked about the system that's relatively large on the edge focused on security Density multi-tenancy and performance you can start to see how this component based design
Can actually span the gamut between all these different systems and hopefully satisfy A lot of future research and hopefully a future foundation for different systems So if you have any questions or comments, please ask this is all open source software
So you can find it all and you can find all their papers on my web page I will be looking for new PhD students probably in about a year So if you're interested in looking at let's work send me an email. Let me know that you Saw this presentation and we can chat. Thank you so much. I really appreciate it
Asking great questions and trying to identify kind of where the key aspects are here We definitely have a complicated policy for managing stacks in user level and that's something that we can kind of redefine the ink
Components and we kind of do complicated analyses to decide decide dynamically how many stacks are need to be So definitely do read the papers that we referred to but that is kind of a key policy that we've moved up in the user Level. Yeah. Thank you for all those questions. I
Am missing the current questions I Could check. Oh Just thanks. Cool. I do want to say we were highly Inspired by a lot of what Nova has done and we've been trying to chase Nova IPC
performance for a long time and I need to do the measurements recently to see if we've