We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

NOVA Microhypervisor Feature Update

00:00

Formal Metadata

Title
NOVA Microhypervisor Feature Update
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
NOVA is a modern open-source microhypervisor that can host unmodified guest operating systems next to critical host applications. After adding support for multiple instruction sets (ARMv8-A and x86_64), NOVA's code base has been restructured to share as much code between architectures as possible. I will give an overview of the new abstractions that make NOVA fit for the next decade and discuss how advanced features, such as boot-time relocation and code patching, multiple resource spaces, support for suspend/resume, cache allocation technology, control-flow protection and multi-key total memory encryption have increased NOVA's flexibility, security and performance even further.
14
15
43
87
Thumbnail
26:29
146
Thumbnail
18:05
199
207
Thumbnail
22:17
264
278
Thumbnail
30:52
293
Thumbnail
15:53
341
Thumbnail
31:01
354
359
410
Beat (acoustics)ArmMachine codeInformation securityBuildingBlock (periodic table)Network socketComputer networkData storage deviceGame controllerCategory of beingKernel (computing)Domain nameWeb portalInterface (computing)Operations researchZugriffskontrollePersonal digital assistantInheritance (object-oriented programming)WhiteboardFingerprintGeneric programmingBootingProcess (computing)Enumerated typeFunction (mathematics)Similarity (geometry)Formal verificationCodeRead-only memorySocial classMultiplicationGoodness of fitMultiplication signArmInformation securityConcurrency (computer science)CodeInterface (computing)Enumerated typeContext awarenessBootingPower (physics)Scheduling (computing)Asynchronous Transfer ModeAddress spaceComputing platformData structureData managementCache (computing)Entire functionOperator (mathematics)Kernel (computing)SoftwareCartesian coordinate systemInterprozesskommunikationInheritance (object-oriented programming)Semiconductor memoryBuildingBitType theoryGame controllerResource allocationTotal S.A.MultiplicationControl flowInstance (computer science)Category of beingComplex (psychology)SeitentabelleQuicksortVirtualizationCuboidVirtual machineStructural loadComponent-based software engineeringParameter (computer programming)Point (geometry)Product (business)Thread (computing)Domain nameBlock (periodic table)Mechanism designConnectivity (graph theory)MikrokernelSlide ruleWhiteboardReal numberBinary codeSystem callFunctional (mathematics)WindowUniformer RaumMessage passingClient (computing)Server (computing)MereologyDifferent (Kate Ryan album)State of matterProgram slicingDecision theorySynchronizationPrimitive (album)Device driverMoment (mathematics)Directory serviceAbstractionPhysical systemRevision controlCoordinate systemStandard deviationFormal verificationComputer animation
Generic programmingBuildingBootingEnumerated typeProcess (computing)Similarity (geometry)Function (mathematics)Formal verificationCodeRead-only memorySocial classInformation securityFingerprintWhiteboardMultiplicationMachine codeData structureLatent heatComputer fileDifferent (Kate Ryan album)ImplementationEmailSource codeAbstract state machinesSystem on a chipArmNumbering schemeRevision controlBinary fileFirmwareComputing platformComputer hardwareWindowMemory managementDrop (liquid)Host Identity ProtocolService (economics)Structural loadAddress spaceRange (statistics)Communications protocolComputer-generated imageryAsynchronous Transfer ModeIndependence (probability theory)Sheaf (mathematics)Run time (program lifecycle phase)Virtual realityState of matterCore dumpPower (physics)Directory serviceInterior (topology)1 (number)Inheritance (object-oriented programming)Shared memoryQuicksortCondition numberSingle-precision floating-point formatComputer fileLine (geometry)Software maintenanceInformation securityRight angleMereologyTerm (mathematics)CodeFormal verificationAcoustic shadowPhysicalismAddress spaceComputing platformStructural loadRange (statistics)ArmVirtualizationWeb pageEqualiser (mathematics)Numbering schemeDifferent (Kate Ryan album)Binary codeSheaf (mathematics)Run time (program lifecycle phase)Assembly languageIndependence (probability theory)Data managementState of matterGeneric programmingPower (physics)Data structureAsynchronous Transfer ModeFunctional (mathematics)ImplementationSource codeInterface (computing)EmailFrequencyInterrupt <Informatik>Physical systemOperator (mathematics)BootingCompilerRevision controlPerspective (visual)Point (geometry)Latent heatFerry CorstenGroup actionData compressionBus (computing)Memory managementSystem callSemiconductor memoryBoundary value problemMultiplicationDynamic random-access memorySeitentabelleMathematicsComplex (psychology)Virtual memoryIdentity managementLimit (category theory)Emoticon32-bitArtificial neural networkPosition operatorForm (programming)Uniform resource locatorInternational Date LineFirmwareComputer hardwareWindowService (economics)Level (video gaming)Kernel (computing)Game controllerComputer animation
Euclidean vectorConfiguration spaceEntire functionComplex (psychology)InformationBootingCodeControl flowTurbo-CodeBinary fileState of matterFrequencyMaxima and minimaCore dumpPairwise comparisonInterrupt <Informatik>HypercubeTime domainVisualization (computer graphics)Magnetic stripe cardSpacetimeObject (grammar)Memory managementCore dumpLine (geometry)Event horizonAsynchronous Transfer ModeKeyboard shortcutPower (physics)SpacetimeTurbo-CodeInformationMultiplication signBitArmLevel (video gaming)Web pageFrequencyPairwise comparisonDomain nameMagnetic stripe cardComputing platformContext awarenessFirmwareState of matterObject (grammar)Kernel (computing)Generic programmingInterface (computing)SequenceConfiguration spaceBefehlsprozessorGame controllerComputer hardwarePlastikkarteVirtual machineProcess capability indexOperating systemException handlingForcing (mathematics)InterprozesskommunikationPoint (geometry)MikrokernelWordMessage passingCoprocessorHeat transferSemiconductor memoryWärmestrahlungCodeNominal numberScheduling (computing)Web portalMiniDiscDemo (music)Operator (mathematics)MultiplicationInterrupt <Informatik>Bus (computing)Streaming mediaGroup actionPhysical systemVirtualizationLimit (category theory)Link (knot theory)MathematicsSystem callRadical (chemistry)Database transactionInstance (computer science)CuboidCASE <Informatik>Entire functionComputer animation
Generic programmingWeb pageTable (information)Template (C++)Social classData typeLatent heatInheritance (object-oriented programming)SeitentabelleType theoryLevel (video gaming)Memory managementRead-only memoryPrice indexoutputAddress spaceSpacetimeSubject indexingConfiguration spaceUniform convergenceNumbering schemeAttribute grammarEntropiecodierungComputer programmingTotal S.A.Semiconductor memoryEncryptionMultiplicationDomain nameCache (computing)HauptspeicherLevel (video gaming)Web pageArmSocial classTemplate (C++)Independence (probability theory)SpacetimeSlide ruleAddress spaceSubject indexingPhysicalismTransformation (genetics)BitParameter (computer programming)Pattern languageInheritance (object-oriented programming)Heegaard splittingSoftwareComplex (psychology)outputBuildingComputing platformFlow separationMicrocontrollerFile formatEntropiecodierungData structureTable (information)Generic programmingFunctional (mathematics)CASE <Informatik>Multiplication signCuboidSystem on a chipInstance (computer science)Server (computing)Type theoryMathematicsRandom number generationComputer programmingKey (cryptography)Numbering schemeSeitentabelleBoundary value problemPerimeterBasis <Mathematik>CodeSemiconductor memoryVirtual machineTotal S.A.DigitizingUniformer RaumMultiplicationEncryptionNon-volatile memoryNetwork socketConnectivity (graph theory)Computer animation
Domain nameCore dumpLevel (video gaming)HauptspeicherChannel capacityNumbering schemeSocial classService (economics)Cache (computing)PredictabilityThread (computing)BefehlsprozessorMagnetic stripe cardResource allocationControl flowComputer-assisted translationAsynchronous Transfer ModeCodeConfiguration spaceMachine codeContext awarenessComputer configurationTime domainAttribute grammarScheduling (computing)ConsistencyServer (computing)Client (computing)Data modelChainData integrityInjektivitätSpywareKontrollflussArmCore dumpService (economics)Cache (computing)Context awarenessSemiconductor memoryDomain nameSocial classComputer-assisted translationComputer hardwarePredictabilityComputer configurationLevel (video gaming)Scheduling (computing)Category of beingEndliche ModelltheorieResource allocationVirtual machineINTEGRALMulti-core processorAsynchronous Transfer ModeKernel (computing)Web pageInjektivitätBitRun time (program lifecycle phase)CuboidParity (mathematics)Shared memorySide channel attackRight angleNumbering schemeChainCartesian coordinate systemDecision theoryMultiplication signParameter (computer programming)SeitentabelleServer (computing)System callAxiom of choiceFile systemInsertion lossSlide ruleArmHydraulic jumpFunctional (mathematics)Single-precision floating-point formatExtension (kinesiology)Client (computing)Stack (abstract data type)Address spaceAuditory maskingChannel capacityLengthCodeLatent heatMultiplicationInformation securityPhysical systemSet (mathematics)Game controllerLimit (category theory)Real numberControl flowArithmetic meanCASE <Informatik>Object (grammar)Type theoryComputer animation
KontrollflussAcoustic shadowStack (abstract data type)Graph (mathematics)Data integrityControl flowComputer hardwareBranch (computer science)Video trackingSystem callAddress spaceData managementHeat transferException handlingEquals signMultiplicationVariable (mathematics)Surjective functionContext awarenessToken ringTime evolutionExtension (kinesiology)State of matterMachine codeBefehlsprozessorComputer configurationBinary fileInstallation artBootingComputing platformGeneric programmingMathematical optimizationSequencePatch (Unix)Assembly languageCodeSoftware testingOverhead (computing)InterprozesskommunikationRoundingIntelStack (abstract data type)Functional (mathematics)Level (video gaming)CompilerAssembly languageAcoustic shadowBound stateInterprozesskommunikationThread (computing)Pointer (computer programming)Computing platformParameter (computer programming)Overhead (computing)Mechanism designBranch (computer science)Extension (kinesiology)Address spaceWeb pageExterior algebraFile formatSet (mathematics)Computer configurationRevision controlPoint (geometry)RewritingMultiplication signFreewareScaling (geometry)Numbering schemePairwise comparisonSlide ruleControl flow graphCycle (graph theory)Order (biology)Patch (Unix)TrailFigurate numberSystem callGame controllerSemiconductor memoryInformation securityVariable (mathematics)CodeDepth-first searchException handlingINTEGRALLatent heatBootingReading (process)Data managementGreen's functionBuffer overflowControl flowComputer hardwareCompilation albumBefehlsprozessorGraphics tabletHydraulic jumpOnline helpUniform resource locatorBinary codeResource allocationRun time (program lifecycle phase)Software testingState of matter
System programmingLink (knot theory)FrequencyPoint (geometry)CodeTurbo-CodeArmSemiconductor memoryBefehlsprozessorMultiplication signEndliche ModelltheorieLogicArithmetic progressionRun time (program lifecycle phase)Game controllerMereologyFormal grammarSeitentabelleFormal verificationFlow separationProof theoryConcurrency (computer science)Lattice (order)TheoremComputer animation
Acoustic shadowStack (abstract data type)Address spaceToken ringLocal ringVariable (mathematics)Data managementParameter (computer programming)Exception handlingEquals signContext awarenessSystem callBranch (computer science)Video trackingCache (computing)Resource allocationLevel (video gaming)Subject indexingConfiguration spaceWeb pageTable (information)outputUniform convergencePrice indexMagnetic stripe cardSpacetimeMemory managementControl flowTurbo-CodeBinary fileState of matterFrequencyMaxima and minimaAerodynamicsCore dumpHost Identity ProtocolBootingFirmwareComputing platformComputer hardwareDrop (liquid)WindowService (economics)Machine codeArmAbstract state machinesSystem on a chipGeneric programmingNumbering schemeRevision controlSlide ruleLatent heatCodeFirmwareBootingMultiplication signInformation securityComputer animationProgram flowchart
Abstract state machinesSystem on a chipArmMachine codeGeneric programmingNumbering schemeRevision controlBinary fileWeb pageAddress spaceTrailBinary codeBranch (computer science)Different (Kate Ryan album)Social classInterrupt <Informatik>Point (geometry)Slide ruleTotal S.A.Information securityCodeSemiconductor memoryIndependence (probability theory)PhysicalismSoftware testingCompilerFunctional (mathematics)RotationBitLinker (computing)Form (programming)Position operatorRippingTheory of relativityGame controllerControl flowLatent heatArmVirtual memoryData structureAssembly languageLine (geometry)Table (information)Message passingComputer animation
Program flowchart
Transcript: English(auto-generated)
All right, so we move on to our next talk. We have Udo here with the Nova micro-hypervisor updates. Udo, please. Thank you, Arson. Good morning, everybody. Welcome to my talk at FOSTM. It's good to be back here after three years.
The last time I presented at FOSTM, I gave a talk about the Nova micro-hypervisor on ARMv8. And this talk will cover the things that happened in our ecosystem since then. So just a brief overview of the agenda. For all of those who might not be familiar with Nova,
I'll give a very brief architecture overview and explain the Nova building blocks. Then we look at all the recent innovations that happened in the last three years. I'll talk a bit about the code unification between ARM and x86, the two architectures that we support at this point. And then I'll spend the majority of the talk going into details into all the advanced security features,
particular in x86, that we added to Nova recently. And towards the end, I'll talk a little bit about performance and hopefully we'll have some time for questions. So the architecture in Nova is similar to the microkernel-based systems that you've seen before.
At the bottom, we have a kernel, which is not just a microkernel, it's actually a micro-hypervisor called the Nova micro-hypervisor. And on top of it, we have this component-based multi-server user mode environment. G node would be one instantiation of it, and Martin has explained that most microkernel-based systems have this structure.
In our case, the host OS consists of all these colorful boxes. We have a master controller, which is sort of the init process, which manages all the resources that the micro-hypervisor does not need for itself. We have a bunch of drivers, all the device drivers run in user mode, which is multi-privileged.
We have a platform manager, which primarily deals with resource enumeration and power management. You can run arbitrary host applications, many of them. And there's a bunch of multiplexers, like a UART multiplexer that everybody can get a serial console, and you have a single interface to it,
or a network multiplexer, which acts as some sort of virtual switch. And virtualization is provided by virtual machine monitors, which are also user mode applications. And we have this special configuration, or this special design principle that every virtual machine uses its own instance of a virtual machine monitor.
They don't all have to be the same. For example, if you run a unikernel in VM, as shown to the far right, the virtual machine monitor could be much smaller, because it doesn't need to deal with all the complexity that you would find in an OS, like Linux or Windows. So the entire host OS,
consisting of the Nova micro-hypervisor, and the host OS, the user mode portion of it is what bedrock calls the ultravisor, which is a product that we ship. And once you have a virtualization layer that is very small, very secure, and basically sits outside the guest operating system,
you can build interesting features, like virtual machine introspection, or virtualization assisted security, which uses features like nested paging, breakpoints, and page table overwrites to harden the security of the guest operating systems, like protecting critical data structures,
introspecting memory, and also features in the virtual switch for doing access control between the different virtual machines and the outside world, as to who can send what types of packets. And all of that is another product which is called ultra security.
The whole stack, not just the kernel, the whole stack is undergoing rigorous formal verification. And one of the properties that this formal verification effort is proving is what we call the bare metal property. And the bare metal property basically says that combining all these virtual machines
on a single hypervisor has the same behavior as if you were running these as separate physical machines connected by a real ethernet switch, so that whatever happens in a virtual machine could have happened on a real physical machine that was not virtualized. That's what the bare metal property says.
So the building blocks of Nova are those that you would find in an ordinary microkernel. It's basically address spaces, threads, and IPC. And in Nova, address spaces are called protection domains, or PD. And threads, or virtual CPUs,
are called execution contexts, short EC. And for those of you who don't know Nova very well, I've just given a very brief introductory slide for how all these mechanisms interact. So let's say you have two protection domains, PD, A, and B. Each of them have one or more threads inside. And obviously, at some point,
you want to intentionally cross these protection domain boundaries, because these components somehow need to communicate. And that's what IPC is for. So assume that this client thread wants to send a message to the server thread. It has a thread control block, which is like a message box. Puts the message in there, invokes a call, an IPC call, to the hypervisor.
It vectors through a portal, which routes that IPC to the server protection domain, and then the server receives the message in its UTCB. As part of this control and data transfer, the scheduling context, which is a time slice coupled with a priority, is donated to the other side. And as you can see on the right,
that's the situation after the IPC call has gone through. So now the server is executing on the scheduling context of the client. The server computes a reply, puts it in its UTCB, issues a hypercall called IPC reply, and the data goes back, the reply goes back to the client. The scheduling context donation is reverted,
and the client gets its time slice back. So what you get with that is very fast synchronous IPC with time donation and priority inheritance. And it's very fast because there's no scheduling decision on that path. Also, Nova is a capability-based microkernel or hypervisor, which means all operations
that user components do with the kernel have capabilities as parameters. And capabilities have the nice property that they both name a resource and at the same time have to convey what access you have on that resource. So it's a very powerful access control primitive.
So that said, let's look at all the things that happened in Nova over the last two and a half or so years. And we are now on a release cadence where we put out a new release of Nova approximately every two months. So it's always the year and the week of the year where we do releases.
And this shows what we added in Nova in 21, 22, and what we'll add to the first release of this year at the end of this month. So we started out at the beginning of 21 by unifying the code base between x86 and ARM, making the load address flexible, adding power management like suspend resume,
then extended that support to ARM. And later in 22, when that unification was complete, we started adding a lot of, let's say advanced security features in x86, like control flow enforcement, code patching, cache allocation technology, multiple spaces, multi-key total memory encryption,
and recently we've added some APIC virtualization. So the difference between the things that are listed in bold here and those that are not listed in bold, everything in bold I'll try to cover in this talk, which is a lot, so hopefully you'll have enough time to go through all this.
First of all, the design goals that we have in Nova, and Martin already mentioned that not all micro-chronos have the same design goals. Our design goal is that we want to provide the same or at least similar functionality across all architectures, which means the API is designed in such a way that it abstracts from architectural differences
as much as possible, that you get a uniform experience, whether you're on x86 and ARM, you can create a thread, and you don't have to worry about details of instruction set, register set, page table format. Nova tries to abstract all of that away. You wanna have really simple build infrastructure,
and we'll see in a moment what the directory layout looks like, but suffice it to say that you can build Nova with a very simple make command, where you say make architecture equals x86 or ARM, and in some cases, board equals, I don't know, Raspberry Pi or NXP, I'm x8, whatever,
and it runs for maybe five seconds, and then you get a binary. We use standardized processes, like the standardized boot process and standardized resource enumeration as much as possible, because that allows for a great reuse of code, so we use multi-boot version two or one,
and UEFI for booting, we use ACPI for resource enumeration. We can also use the FDT, but that's more of a fallback, and for ARM, there's this interface called PSCI for power state coordination that's also abstracting this functionality across many different ARM boards,
so we try to use these interfaces as much as possible. The code is designed in such a way that it is formally verifiable, and in our particular case, that means formally verifying highly concurrent C++ code, not C code, not assembler code, but C++ code, and even weakly-ordered memory,
because ARMv8 is weak memory, and obviously, we want Nova to be modern, small, and fast, best-in-class security and performance, and we'll see how we did on that. So first, let me talk about the code structure, and Martin mentioned in this talk this morning
that using directories to your advantage can really help, so on the right, you see the directory structure that we have in the unified Nova code base. We have generic int directory and a generic source directory. Those are the ones listed in green, and then we have architecture-specific subdirectories
for AR64 and x8664, and we have architecture-specific build directories. There's also a doc directory in which you will find the Nova interface specification, and there's a single make file unified. And when we looked at the source code and we discussed them with our formal methods engineers, we recognized that basically all the functions
can be categorized into three different buckets. The first one is what we call the same API and same implementation. This is totally generic code. All the system calls are totally generic code. All the memory allocators are totally generic code. Surprisingly, even page tables can be totally generic code.
So these can all share the source files, the header files, and the spec files, which basically describe the interface pre and post conditions. The second bucket is functions that have the same API, but maybe a different implementation, and an example of that would be a timer, where the API could be set a deadline for when a timer interrupt should fire,
so the API for all callers is the same, so you can potentially share the header or the spec file, but the implementation might be different on each architecture, or it's very likely different. And the final bucket is those functions that have a different API and implementation and you can't share anything. So the code structure is such
that architecture specific code lives in the architecture specific subdirectories, and generic code lives in the sort of parent directories of that, and whenever you have an architecture specific file with the same name as a generic file, the architecture specific file takes precedence and basically overwrites or shadows the generic file,
and that makes it very easy to move files from architecture specific to generic and back. So the unified code base that we ended up with, and these are the numbers from the very recent upcoming release, 2308, which will come out at the end of this month,
shows sort of what we ended up with in terms of architecture specific versus generic code. So in the middle, the green part is the generic code that's shared between all architectures, and it's 4,300 lines today. X86 adds 7,000 and some lines specific code, and ARM to the right adds some 5,600 lines.
So if you sum that up for X86, it's roughly 11,500 lines, and for ARM it's less than 10,000 lines of code, so it's very small, and if you look at it, ballpark, 40% of the code for each architecture is generic and shareable,
and that's really great, not just from a maintainability perspective, but also from a verifiability perspective because you have to specify and verify those generic portions only once. If you compile that into binaries, then the resulting binaries are also very small, like a little less than 70K in code size,
and obviously, if you use a different compiler version or different Nova version, these numbers will slightly differ, but it gives you an idea of how small the code base and how small the binaries will be. So let's look at some interesting aspects of the architecture,
because assume you've downloaded Nova, you've built such a small binary from source code, and now you want to boot it, and typical boot procedure, both on x86 and ARM, which are converging towards using UEFI as firmware, will basically have this structure where UEFI firmware runs first and then invokes some bootloader,
passing some information, like an image handle and a system table, and then the bootloader runs and invokes a Nova micro-hypervisor, passing also the image handle and the system table, maybe adding multi-boot information, and at some point, there will have to be a platform handover
of all the hardware from firmware to the operating system, in our case, Nova, and this handover point is called exit boot services. It's basically the very last function that you call as either a bootloader or a kernel in firmware, and that's the point where firmware stops accessing all the hardware, and the ownership of the hardware
basically transitions over to the kernel, and the unfortunate situation is that as you call exit boot services, firmware, which may have enabled the IOMU or SMU at boot time to protect against DMA attacks, drops it at this point, which sounds kind of silly, but that's what happens, and the reason if you ask those who are familiar with UEFI
is for legacy OS support, because UEFI assumes that maybe the next stage is a legacy OS which can't deal with DMA protection, so it gets turned off, which is really unfortunate, because between the point where you call exit boot services to take over the platform hardware,
and the point where Nova can actually enable the IOMU, there's this window of opportunity shown in red here where there's no DMA protections, and that's the point. It's very small, maybe a few nanoseconds or microseconds where an attacker could perform a DMA attack, and for that reason,
Nova takes complete control of the exit boot services flow, so it's not the boot loader who calls exit boot services. Nova actually drives the UEFI infrastructure, and it disables all bus master activity before calling exit boot services so that we eliminate this window of opportunity.
So that was a very aggressive change in Nova because it means Nova has to comprehend UEFI. The next thing that we added was a flexible load address, so when the boot loader wants to put a binary into physical memory, it invokes it with paging being disabled, which means you have to load it at some physical address,
and you can define an arbitrary physical address, but it would be good if whatever physical address you define worked on all the boards, and that is simply impossible, especially in the ARM ecosystem. So on ARM, some platforms have the DRAM starting at physical address zero, some have MMIO starting at address zero, so you will not find a single physical address range
that works across all ARM platforms where you can say always load Nova at two megabytes, one gigabyte, whatever. So we made the load address flexible. Also, the boot loader might want to move Nova to a dedicated point in memory like at the very top so that the bottom portion can be given one-to-one
to a VM. So the load address is now flexible for Nova, not fully flexible, but you can move Nova up-down by arbitrary multiples of two megabytes, so at super page boundaries. And the interesting insight into this is for pulling this off, there's no L3 location complexity required.
Nova consists of two sections, a very small init section, which is mapped, which is identity map, which means virtual addresses equal physical addresses, and that's the code that initializes the platform up to the point where you can enable paging. And then there's a runtime section which runs paged, so it has virtual to physical memory mappings.
And for those virtual to physical memory mappings, if you run this paging enabled, the physical addresses that back these virtual memory ranges simply don't matter. So paging is basically some form of relocation. You only need to deal with relocation for the init section, and you can solve that by making the init section be position independent code.
And it's assembler anyway, so making that position independent is not hard. We actually didn't make the code just position independent. It is also mode independent, which means no matter if UEFI starts you in 32-bit mode or 64-bit mode, that code is dealing with all these situations.
There's a limit, an artificial limit of, you still have to load Nova below four gigabytes because multi-boot has been defined in such a way that you can't express addresses above four gigabytes because some of these structures are still 32-bit, and that little emoticon expresses what we think of that.
So then after we had figured this out, we wanted to do some power management, and this is an overview of all the power management that ACPI defines. So ACPI defines a few global states like working, sleeping, and off. Those aren't all that interesting.
The really interesting states are the sleep states, and the things that have this black bold border around it is the state in which the system is when it's fully up and running, no idling, no sleeping, no nothing. It's called the S0 working state, and then there are some sleep state. You might know suspend to run, suspend to disk, and soft off,
and when you're in the S0 working state, you can have a bunch of idle states, and in the C0 idle state, you can have a bunch of performance state which roughly correspond to voltage and frequency scaling, so ramping up the clock speed up and down. Unfortunately, we don't have a lot of time to go into all the details of these sleep states,
but I wanna still say a few words about this. We implemented suspend resume on both x86 and ARM, and there's two ways you can go about it. One which is, I would say, a brute force approach, and the other which is the smart approach,
and the brute force approach basically goes like you look at all the devices that lose their state during a suspend resume transition, and you save their entire register state, and that's a significant amount of state that you have to manage, and it may even be impossible to manage it because if you have devices with hidden internal state, you may not be able to get at it,
or if the device has a hidden internal state machine, you may not know what the internal state of that device is at that point. So it may be suitable for some generic devices like if you wanted to save the configuration space of every PCI device, that's generic enough that you could do that, but for some interrupt controllers or SMMUs
with internal state, that's not smart. So for that, you can actually use the second approach which Nova uses which is you save a high-level configuration, and you initialize the device based on that. So as an example, say you had an interrupt routed to core zero in edge-triggered mode.
You would save that as a high-level information, and that's sufficient to reinitialize all the interrupt controllers, all the redirection entries, all the trigger modes based on just this bit of information. So there's lots less information to maintain. Saving becomes basically a no-op. Restoring can actually use the same code pass
that you used to initially bring up that particular device. And that's the approach for all the interrupt controllers, all the SMMUs, all the devices managed by Nova. The next thing I wanna briefly talk about is P states, performance states, which are these gears for ramping up the clock speed on x86.
And Nova can now deal with all these P states. The interesting aspect is that most modern x86 processors have something called turbo mode. And turbo mode allows one or more processors to exceed the nominal clock speed to actually turbo up higher
if other cores are idle. So if other cores are not using their thermal or power headroom, a selected set of cores, maybe just one core, maybe a few other cores can actually turbo up many bins. And this is shown here on active core zero, which basically gets the thermal headroom of core one, core two, and core three to clock up higher.
So Nova will exploit that feature when it's available. But there are situations where you want predictable performance, where you want every core to run at its guaranteed high frequency mode. And there's a command line parameter that you can set that basically clamps the maximum speed to the guaranteed frequency.
You could also lower the frequency to something less than the guaranteed frequency. There's a point, an operating point, it's called maximum efficiency. And there's even points below that where you can clock really high, but then it's actually less efficient than this point. So all of that is also supported.
So as an overview, from a feature comparison perspective, ARM versus x86, we support P states on x86, not on ARM because there's no generic interface on ARM yet. We support all the S states on x86, like stop clock, suspend resume, hibernation power off, platform reset.
On ARM, there's no such concept as one, but we also support suspend resume and suspend to disk if it's supported. And what does it mean? If it's supported, it means if platform firmware like PSCI implements it, and there are some features that are mandatory
and some features that are optional. So suspend resume, for example, works great on the NXP IMX8M that Stefan had for his demo. It doesn't work so great on the Raspberry Pi because the firmware simply has no support for jumping back to the operating system after a suspend. So it's not a Nova limitation. There's a new suspend feature called low power idle,
which we don't support yet because it requires way more support than just Nova. It basically requires powering down the GPU, powering down all the devices, powering down all the links. So this is a concerted platform effort. But from a hyper-call perspective, the hyper-call that you would invoke
to transition the platform to a sleep state is called control hardware. And whenever you try to invoke it with something that's not supported, it returns bad feature. And for the hyper-calls that assign devices or interrupts, the state that the system had when you assign devices or interrupts to particular domains
will completely be preserved across the suspend resume calls using this safety high-level state approach. So next I'll talk about some radical API change that we made and being a micro kernel and not being Linux, we don't have to remain backward compatible. So that's one of these major API changes
that took quite a lot of time to implement. What we had in the past was basically an interface with five kernel objects, protection domains, execution context, scheduling context, portals and summer force. And every protection domain looked as shown on this slide,
it actually had six resource spaces built into it. An object space which hosts capabilities to all the kernel objects that you have access to, a host space which represents the stage one page table, a guest space which represents the stage two guest page table, DMA space for memory transactions
that are remapped by the IOMU, port IO space and an MSR space. So all of these existed in one single instance in every protection domain and when you created the host EC, a guest EC like a virtual CPU or device, they were automatically bound to the PD and picking up the spaces that they needed.
And that is, that worked great for us for more than 10 years but it turned out to be suboptimal for some more advanced use cases like nested virtualization. If you run a hypervisor inside a virtual machine and that hypervisor creates multiple guests itself, then you suddenly need more than one guest space. You need one guest space per subguest.
So you need multiple of these yellow guest spaces. Or when you virtualize the SMMU and the SMMU has multiple contexts and every context has its own page table, then you suddenly need more than one DMA space. So you need more of these blue boxes and the same can be said for PortIO and MSR spaces.
So how do we get more than one if the protection domain has all these singletons? So what we did and it was quite a major API and internal reshuffling is we separated these spaces from the protection domain. They are now new first-class objects. So Nova just got six new kernel objects
that you get, when you create them, you get individual capabilities for them and you can manage them independently from the protection domain. So the way that this works is first you create a protection domain with createpd, then you create one or more of these spaces,
again with createpd, so that's a sub-function of createpd, and then you create an EC, like a host EC, and it binds to those spaces that are relevant for host EC. So a host EC, like a hyper-spread, needs capabilities, so it needs an object space, it binds to that, it needs a stage one page table, so it binds to that, and it needs access to ports, so it binds to that
on x86 only, because on ARM there's no such thing. So for host thread, all these assignments are static. We could make them flexible, but we have not found a need. Gets more interesting for guest EC, which is a virtual CPU that runs in a guest. So again, the sequence is the same. You first create a protection domain,
then you create one or more of these spaces, and when you create the virtual CPU, it binds to those spaces that it urgently needs, which is the object space and the host space. It does not yet bind to any of the flexible spaces shown to the right, and that binding is established on the startup IPC during IPC reply.
You pass selectors, capability selectors, to these spaces that you wanna attach to, and then you flexibly bind to those spaces as denoted by these dashed lines. And that assignment can be changed on every event. So every time you take a VM exit, Nova synthesizes an exception IPC or architecture IPC,
sends it to the VMM for handling, and when the VMM replies, it can set a bit in the message transfer descriptor to say, I wanna change the space assignment. It passes new selectors, and then you can flexibly switch between those spaces. And that allows us to implement, for example, nested virtualization.
The same for a device, which in x86 is represented by a bus device function, or an ARM is represented by a stream ID, the assigned def hypercall can flexibly rebind the device to a DMA space at any time. So that took quite a while to implement,
but it gives us so much more flexibility, and I heard that some of the Nova forks have come across the same problem, so maybe that's something that could work for you too. So let's talk about page tables, and I mentioned earlier that page tables are actually generic code, which is somewhat surprising.
Nova manages three page tables per architecture, the stage one, which is the host page table, the stage two, which is the guest page table, and a DMA page table, which is used by the IOMU, and these correspond to the three memory spaces that I showed in the previous slide. And the way we made this page table code architecture independent is by using a template base class,
which is completely log-less, so it's very scalable, and the reason why it can be log-less is because the MMU doesn't honor any software logs anyway, so if you put a log around your page table infrastructure, the MMU wouldn't know anything about those logs. So it has to be written in a way that it does atomic transformations anyway,
so that the MMU never sees an inconsistent state. And once you have this, there's also no need to put the log around it for any software updates, so that's completely log-free. And that architecture independent base class deals with all the complexities of allocating and deallocating page tables, splitting super pages into page tables,
or overmapping page tables with super pages, and you can derive architecture-specific subclasses from it, and the subclasses themselves inject themselves as a parameter to the base class that's called the Curiously Recurring Template pattern. And the subclasses then do the transformation
between the high-level attributes, like this page is readable, writable, user-accessible, whatever, into the individual bits and coding of the page table entries as that architecture needs it. And also there are some coherency requirements on ARM and some coherency requirements between SMUs that don't snoop the caches,
so these architecture-specific subclasses deal with all that complexity. But it allows us to share the page table class and to specify and verify it only once. So let's look at page tables in a little bit more detail because there's some interesting stuff you need to do on ARM. So most of you who've been in an OS class
or who've written a microcontroller will have come across this page table format where an input address, like a hospital or guest physical address, is split up into an offset portion into the final page, 12 bits, and then you have nine bits indexing into the individual levels of the page table.
So when an address is transformed by the MMU into a virtual address into a physical address, the MMU first uses bits 30 to 38 to index into the level two page table to find the level one and then to find the level zero. And the walk can terminate early. You can have a leaf page at any level,
so it gives you one gigabyte, two megabytes, or 4K super pitches. And with that, a page table structure like this, three levels, you can create an address space of 512 gigabytes of size. And that should be good enough. But it turns out we came across several ARM platforms which have an address space size of one terabyte.
So twice that, they need one extra bit which you can't represent this 39 bit, so you have a 40 bit address space. So what would you do if you were designing a chip? You would expect that it would just open a new level here and that you get a four level page table.
But ARM decided differently because they said if I just add one bit, the level three page table would have just two entries. And that's not worse building basically another level into it. So what they did is they came up with a concept called a concatenated pitch table and it makes the level two page table twice as large
by adding another bit at the top. So now suddenly the level two page table has 10 bits of indexing and the backing page table has 1,024 entries and is 8K in size. And this concept was extended, so if you go to a 41 address space, again, you get one additional bit and the page table gets larger
and this keeps going on. It can extend to up to four bits that the level two page table is 64K in size. And there's no way around it. The only time at which you can actually open the level three is when you exceed 44 bits. And then when you get 44 bits,
you can go to a four level and it looks like this. So the functionality that we also had to add to Nova is to comprehend this concatenated pitch table format so that we can deal with arbitrary address space sizes on all. And we actually had a device, I think it was a Xilinx DCU-102,
which had something mapped about, above 512 gigabytes and just below one terabyte and you can't pass that through to a guest if you don't have concatenated pitch table. So the generic pitch table cluster we have right now is so flexible that it can basically do what's shown on this slide. And the simple case is x86.
You have three level, four level, or five level pitch tables with a uniform structure of nine bits per level and 12 offset bits. 39 isn't used by the MMU, but might be used by the SMMU. And the MMU typically uses four levels and in high-end boxes like servers for 57. On ARM, depending on what type of SOC you have,
it either has something between 32 or up to 52 bits. Physical address bits. And the table shows the pitch table level split, the indexing split that Nova has to do. And all these colored boxes are basically instances
of concatenated pitch tables. So 42 would require three bits to be concatenated. Here we have four, here we have one, here we have two. So we really have to exercise all of those and we support all of those. And unlike the past where Nova set pitch tables is so many levels per so many bits,
we now have turned this around by saying, the pitch table covers so many bits and we can compute the number of bits per level and the concatenation at the top level automatically in the code. So that was another fairly invasive change. While we were at re-architecting all the pitch tables,
we took advantage of a new feature that Intel added to Isolake servers and to all the desktop platforms, which is called total memory encryption with multiple keys. And what Intel did there is they repurposed certain bits of the physical address in the pitch table entry,
the top bits shown here as key ID bits. And so it's stealing some bits from the physical address and the key ID bits index into a key programming table shown here that basically select a slot. And let's say you have four key bits
that gives you 16 keys, two to the power of four. So your key indexing or your key programming table would have the opportunity to program 16 different keys. We've also come across platforms that have six bits. It's basically flexible how many bits are stolen from the physical address, can vary per platform
depending on how many keys are supported. And those keys are used by a component called the memory encryption engine. The memory encryption engine sits at the perimeter of the package or the socket, basically at the boundary where data leaves the chip that you plug in the socket
and enters the interconnect and enters RAM. So inside this green area, which is inside the SOC, everything is unencrypted in the cores, in the caches, in the internal data structure, but as it leaves the die and moves out to the interconnect, it gets encrypted automatically by the memory encryption engine with the key.
And this example shows a separate key being used for each virtual machine, which is a typical use case, but it's actually very more flexible than that. You can select the key on a per page basis. So you could even say, if there was a need for these two VMs to share some memory, that some blue pages would appear here and some yellow pages would appear here.
That's possible. So we added support in the page tables for encoding these key ID bits. We added support for using the pconfig instruction for programming keys into the memory encryption engine. And the keys can come in two forms. You can either randomly generate them, in which case Nova will also drive
the digital random number generator to generate entropy, or you can program tenant keys. So you can say, I want to use this particular AS key for encrypting the memory. And that's useful for things like VM migration, where you want to take an encrypted VM and move it from one machine to another.
And the reason why Intel introduced this feature is for confidential computing, but also because DRAM is slowly moving towards non-volatile RAM and an offline even made a tag or so where somebody unplugs your RAM or takes your non-volatile RAM and then looks at it in another computer,
is a big problem. And they can still unplug your RAM, but they would only see ciphertext. So next thing we looked at was, so this was more of a confidentiality improvement. Next thing we looked at is improving the availability. And we added some support
for dealing with noisy neighbor domains. So what are noisy neighbor domains? Let's say you have a quad core system as shown on this slide. And you have a bunch of virtual machines as shown at the top. On some cores, you may over provision the cores, run more than one VM, like on core zero and core one.
For some use cases, you might want to run a single VM on a core only, like a real-time VM, which is exclusively assigned to core two. But then on some cores, like shown on the far right, you may have a VM that's somewhat misbehaving. And somewhat misbehaving means it uses excessive amounts of memory
and basically evicts everybody else out of the cache. So if you look at the last level of cache portion here, the amount of cache that is assigned to the noisy VM is very disproportionate to the amount of cache given to the other VM, simply because this is trampling all over memory. And this is very undesirable
from a predictability perspective, especially if you have a VM like the green one that's real time, which may want to have most of its working set in the cache. So is there something we can do about it? And yes, there is. It's called CAT. CAT is Intel's acronym for Cache Allocation Technology.
And what they added in the hardware is a concept called class of service. And you can think of class of service as a number. And again, like the key ID, there's a limited number of classes of service available, like four or 16. And you can assign this class of service number to each entity that shares the cache.
So you could make it a property of a protection domain or a property of a thread. And for each of the classes of service, you can program a capacity bit mask, which says what proportion of the cache can this class of service use? Can it use 20%, 50% and even which portion?
There are some limitations, like the bit mask must be contiguous, but they can overlap for sharing. And there's a model specific register, which is not cheap to program, where you can say this is the active class of service on this core right now. So this is something you would have to context switch to say I'm now using something else. And when you use this, it improves the predictability,
like the worst case execution time quite nicely. And that's what it was originally designed for. But it turns out it also helps tremendously with dealing with cache side channel attacks, because if you can partition your cache in such a way that your attacker doesn't allocate into the same ways
as the VM you're trying to protect, then all the flush and reload attacks simply don't work. So here's an example for how this works. And to the right, I've shown an example number of six classes of service. And a cache which has 20 ways.
And you can program, and this is again, just an example, you can program the capacity bit mask for each class of service, for example, to create full isolation. So you could say class of service gets 40% of the cache, ways zero to seven, and class of service one gets 20%,
and everybody else gets 10%. And these capacity bit masks don't overlap at all, which means you get zero interference through the level three cache. You could also program them to overlap. There's another mode which is called CDP, code and data prioritization,
which splits the number of classes of service in half and basically redefines the meaning of these bit masks to say those with an even number are for data, and those with an odd number are for code. So you can even discriminate how the cache is being used between code and data. It gives you more fine grain control. And the Nova API forces users to declare upfront
whether they want to use cat or CDP to partition their cache, and only after you've made that decision can you actually configure the capacity bit masks. So with CDP, it would look like this. You get three classes of service instead of six, distinguished between D and C, data and code.
And you could, for example, say class of service one as shown on the right gets 20% of the cache for data, 30% of cache for the code, so 50% of the capacity in total, exclusively assigned to anybody who's class of service one, and the rest shares capacity bit masks.
And here you see an example of how the bit masks can overlap, and wherever they overlap, the cache capacity is being competitively shared. So that's also a new feature that we support right now. Now the question is, class of service is something you need to assign to cache sharing entities. To what type of object do you assign that?
And you could assign it to a protection domain. You could say every box on the architecture slide gets assigned a certain class of service, and the question is then what do you assign to a server that has multiple clients? It's really unfortunate, and what it also means is if you have a protection domain that spends multiple cores,
and you say I want this protection domain to use 40% of the cache, you have to program the class of service settings on all cores the same way. So it's really a loss of flexibility. So that wasn't our favorite choice. And we said, well, maybe we should assign class of service to execution contexts instead. And again, the question is what class of service
do you assign to a server execution context that does work on behalf of clients? And the actual killer argument was that you would need to set the class of service in this model-specific register again during each context switch, which is really bad for performance. So even option two is not what we went for. Instead, we made the class of service
a property of the scheduling context. And that has very nice properties. We only need to context switch it during scheduling decisions. So the cost of reprogramming that MSR is really not relevant anymore, and it extends the already existing model of time and priority donation with class of service donation.
So a server does not need to have a class of service assigned to it at all. It uses the class of service of its client. So if, let's say, your server implements some file system or so, then the amount of cache that it can use depends on whether your client can use a lot of cache or whether your client cannot use a lot of cache.
So it's a nice extension of an existing feature. And the additional benefit is that the classes of service can be programmed differently per core. So eight cores times six classes of service gives you 48 classes of service in total instead of six. So that was a feature for availability.
We also added some features for integrity. And if you look at the history, there's a long history of features being added to paging that improve the integrity of code against injection attacks. And it all started out many years ago with these 64-bit architecture
where you could mark pages non-executable. And you could basically enforce that pages are either writable or executable, but never both. So there's no confusion between data and code. And then over the years, more features were added like supervisor mode execution prevention where if you use that feature,
kernel code can never jump into a user page and be confused as executing some user code. And then there's another feature called supervisor mode access prevention, which even says, kernel code can never, without explicitly declaring that it wants to do that, read some user data page. So all of these tighten the security
and naturally Nova supports them. There's a new one called mode-based execution control, which is only relevant for guest page tables or stage two, which gives you two separate execution bits. So there's not a single X bit, there's now executable for user and executable for super user.
And that is a feature that ultra security can, for example, use where we can say, even if the guest screws up its page tables, its stage one page tables, the stage two page tables can still say Linux user applications or Linux kernel code can never execute Linux user application code
if it's marked as XS in the stage two page tables. So it's again, a feature that can tighten the security of guest operating systems from the host. But even if you have all that, there are still opportunities for code injection. And these classes of attacks basically reuse existing code snippets and chain them together in interesting ways
using control flow hijacking, like ROP attacks. And I'm not sure who's familiar with ROP attacks. It's basically you create a call stack with lots of return addresses that chain together simple code snippets like add this register return, multiply this register return, jump to this function return. And by chaining them all together,
you can build programs out of existing code snippets that do what the attacker wants. You don't have to inject any code. You simply find snippets in existing code that do what you want. And this doesn't work so well on ARM. It still works on ARM, but on ARM the instruction length is fixed to four bytes. So you can't jump into the middle of instructions.
But on x86 with a flexible instruction size, you can even jump into the middle of instructions and completely reinterpret what existing code looks like. And that's quite unfortunate. So there's a feature that tightens the security around that and it's called Control Flow Enforcement Technology or CET.
And that feature adds integrity to the control flow graph, both to the forward edge and to the backward edge. And forward edge basically means you protect jumps or calls that jump from one location forward to somewhere else. And the way that this works
is that the legitimate jump destination where you want to jump to land, this landing pad, must have a specific end branch instruction placed there. And if you try to jump to a place which doesn't have an end branch landing pad, then you get the control flow violation exception. So you need the help of the compiler to put that landing pad
at the beginning of every legitimate function. And luckily GCC and other compilers have had that support for quite a while. So GCC since eight and we are now at 12. So that works for forward edges. For backward edges, there's another feature called shadow stack. And that protects the return addresses on your stack.
And we'll have an example later. And it basically has a shadow call stack which you can't write to. It's protected by paging and if it's writable, then it won't be usable as a shadow stack. And you can independently compile Nova with branch protection,
with return address protection or both. So let's look at indirect branch tracking. And I try to come up with a good example. And I actually found a function in Nova which is suitable to explaining how this works. It's Nova has a body allocator that can allocate contiguous chunks of memory.
And that body allocator has a free function where you basically return an address and say free this block. And the function is really as simple as shown there. It just consists of these few instructions because it's a tail call that jumps to some coalescing function here later. And you don't have to understand all the complicated assembler.
But suffice it to say that there's a little test here of these two instructions which performs some meaningful check. And you know that you can't free a null pointer. So this test checks if the address passed as the first parameter is a null pointer. And if so, it jumps out right here. So basically the function does nothing, does no harm.
It's basically a knob. And let's say an attacker actually wanted to compromise memory. And instead of jumping to the beginning of this function, it wanted to jump past that check to this red instruction to bypass the check and then corrupt memory. Without control flow enforcement, that would be possible if the attacker could gain execution. But this control flow, it wouldn't work
because when you do a call or jump, you have to land on an end branch instruction. And the compiler has put that instruction there. So if an attacker managed to get control and try to jump to a vtable or some indirect pointer to this address, you would immediately crash. So this is how indirect branch tracking works.
Shadow stacks work like this. With a normal data stack, you have your local variables on your stack. You have the parameters for the next function on the stack so the green function wants to call the blue function. And then when you do the call instruction, the return address gets put on your stack. Then the blue function puts its local variables on a stack,
wants to call the yellow function, puts the parameters for the yellow function on the stack, calls the yellow function so the return address for the blue function gets put on a stack. And you see, and the stack grows downward. And you see that the return address always lifts above the local variables. So if your local variables, if you allocate an array on a stack and you don't have proper bounds checking,
it's possible to overwrite the return address by writing past the array. And this is a popular attack technique, buffer overflow exploits that you find in the wild. So if you have code that is potentially susceptible to these kind of return address overwrites, then you could benefit from shadow stacks.
And the way that this works is there's a separate stack, this shadow stack, which is protected by paging. So you can't write to it with any ordinary memory instructions. It's basically invisible. And the only instructions that can write to it are call and read instructions and some shadow management instructions. And when the green function calls the blue function,
the return address will not just be put on the ordinary data stack, but will additionally be put on a shadow stack. And likewise with the blue and the yellow return address. And whenever you execute a return instruction, the hardware will compare the two return addresses that it pops off the two stacks. And if they don't match, you again get a control flow violation.
So that way you can protect the backward edge of the control flow graph also using shadow stacks. And that's a feature that Nova uses on Tiger Lake and all the lake and platforms beyond that that have this feature. But there's a problem. And the problem is that using shadow stack instructions
is possible on newer CPUs that have these instructions that basically have this ISA extension. But if you have a binary containing those instructions, it would crash on older CPUs that don't comprehend that. And luckily Intel defined the end branch instruction to be a knob, but some shadow stack instructions are not knobs.
So if you try to execute a CET enabled Nova binary on something older without other effort, it might crash. So obviously we don't want that. So what Nova does instead, it detects at runtime whether CET is supported.
And if CET is not supported, it patches out all these CET instructions in the existing binary to turn them into knobs. And obviously being a micro kernel, we try to generalize the mechanism. So we generalize that mechanism to be able to rewrite arbitrary assembler snippets from one version to another version.
And there's other examples for newer instructions that do better work than older instructions like the XSAFE feature set, which can save supervisor state or save floating point state in a compact format. And the binary as you build it originally always uses the most sophisticated version.
So it uses the most advanced instruction that you can find. And if you run that on some CPU, which doesn't support the instruction or which supports some older instruction, then we use code patching to rewrite the newer instruction into the older one. So the binary automatically adjusts to the feature set of the underlying hardware. The newer your CPU, the less patching occurs,
but it works quite well. And the reason we chose this approach because the alternatives aren't actually great. So the alternatives would have been that you put some ifdefs in your code and you say ifdef cet, use the cet instructions and otherwise don't. And then you force your customers or your community to always compile the binary the right way.
And that doesn't scale. The other option could have been that you put some if then else. You say if cet is supported, do this, otherwise do that. And that would be a runtime check every time. And that runtime check is prohibitive in certain code paths like entry paths where you simply don't have any register free
for doing this check because you have to save them all. But in order to save them, you already need to know whether shadow stacks are supported or not. So doing this feature check at boot time and rewriting the binary to the suitable instruction is what we do and that works great. So the way it works is you declare some assembler snippets
like xsaves is the preferred version. If xsaves is not supported, the snippet gets rewritten to xsave. Or a shadow stack instruction gets rewritten to a knob. We don't need to patch any high level C++ functions because they never compile
to those complicated instructions. And yeah, we basically have a binary that automatically adjusts. So finally, let's take a look at performance because IPC performance is still a relevant metric if you want to be not just small but also fast.
And the blue bars here in the slide show Nova's baseline performance on modern Intel platforms like NUC 12 with Alder Lake and NUC 11 with Tiger Lake. And you can see that if you do an IPC between two threads in the same address space, it's really in the low nanosecond range, like 200 and some cycles.
If you cross address spaces, you have to switch page tables, you have to maybe switch class of service, then it takes 536 cycles and it's comparable on other microarchitectures. But the interesting thing that I wanna show with this slide is that there's overhead for control flow protection.
So if you just enable indirect branch tracking, the performance overhead is some 13 to 15%. If you enable shadow stacks, the performance overhead is increased some more. And if you enable the full control flow protection, the performance overhead is in the relevant case,
which is the cross address space case, it's up to 30%. So users can freely choose through these compile time options what level of control flow protection they are willing to trade for what in decrease in performance. So the numbers are basically just rough ballpark figures to give people feeling for, if I use this feature,
how much IPC performance do I lose? So with that, I'm at the end of my talk. There are some links here where you can download releases, where you can find more information. And now I'll open it up for questions. Thank you so much, Judon. So we have time for.
Or some questions? And then you'll part in it then. And then we'll get to part. Thank you. It was a really nice talk. We'd like to see how many new things are in Nova. One thing I would like to ask is, you mentioned that page table code is formally verified
and that it's also lock free. What tools did you use for, especially in regards of memory model formal verification? Thank you. So I must say that I'm not a formal verification expert, but I obviously have regular meetings and discussions with all formalized people. And the tools that we are using is the Coq Theorem Prover for basically doing the proofs
but for concurrent verification, there's a tool called Iris that implements separation logic. But the memory model that we verified depends on whether you're talking about x86 or ARM.
For ARM, we're using multi-copy atomic memory model. Also thanks for the talk and it's great to see such a nice progress. Just a quick question. In the beginning of the talk, you said that you have this command line option to clamp the CPU frequency to disable turbo boosting.
Why can't you do that at runtime? Why can't you configure it at runtime? We could configure it at runtime too. But we haven't added an API yet because the code that would have to do that simply doesn't exist yet. But there's no technical reason for why Userland couldn't control the CPU frequency
at arbitrary points in time. Okay, wonderful, thanks. I was gonna ask about the verification aspect of this. Okay, gotcha. Any other questions? Can you, yeah, just saying, Nicolas, sorry, Jonathan,
it's going to be a lot. Yeah, just very quickly, I am on the point of the DMA attack. Were you talking about protecting the guests or the host for this DMA attack? So the question was for the DMA attack that I showed in the slide here and you'll find the slides online after the talk.
This is not a DMA attack of guest versus host. This is a boot time DMA attack. So this is, you can really think of this as a timeline. Firmware starts, bootloader starts, Nova starts. And at the time that Nova turns on the IOMU, both guests and hosts will be DMA protected. But Nova itself could be susceptible to DMA attack
if we didn't disable Busmaster simply because the firmware does this legacy backward compatible shenanigans that we don't like. And I bet a lot of other micro-kernels are susceptible to problems like this too and the fix would work for them as well. Thanks Udo for the talk.
I would like to know, can you approximate how much percentage of the architecture specific code is now added because of the security measures? So most of the security measures that I talked about
are x86 specific and ARM has similar features like they have a guarded control stack specified in ARMv9, but I don't think you can buy any hardware yet. You can take the difference between x86 and ARX64 as a rough ballpark figure, but it's really not all that much.
For example, the multi-key total memory encryption, that's just a few lines of code added to the x86 specific pitch table class because it was already built into the generic class to begin with. Control flow enforcement is probably 400 lines of assembler code in entry, pass and the switching.
I did a quick test as to how many end branch instructions a compiler would actually inject into the code. It's like 500 or so because you get one for every interrupt entry and then one for every function and it also inflates the size of the binary a bit, but not much. And the performance decrease for indirect branch tracking
among other things, comes from the fact that the code gets inflated and is not as dense anymore. Yeah, final question, please, because we had this one other time, yeah. You were saying that you were able to achieve now an ELF binary without rotations.
Can you elaborate a little bit on how did you do that? Which linker did you use? So it's the normal GNU-LD, but you could also use gold or mold or any of the normal linkers. So the reason for why no relocation is needed is for the page code, as long as you put the right physical address
in your page table, the virtual address is always the same. So virtual memory is some form of relocation where you say no matter where I run in physical memory, the virtual memory is always the same. But for the un-paged code, which doesn't know at which physical address it was actually launched, you have to use position independent code,
to basically say I don't care at which physical address I run, I can run at an arbitrary address because all my data structures are addressed rip relative or something like that. And at some point you need to know what the offset is between where you want it to run and where you do actually run, but that's simple. It's like you call your next instruction, you pop the return address of the stack,
you compute the difference, and then you know. Okay, thank you so much, Judo. Thank you. So the slides are online, the recording as well.