We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Address Space Isolation in the Linux Kernel

00:00

Formal Metadata

Title
Address Space Isolation in the Linux Kernel
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Security is a big problem especially in the cloud of container workloads. This presentation investigates improving security in the Linux kernel itself. The first target is securing sensitive application data, for instance, private keys. Address space isolation has been used to protect the kernel and userspace programs from each other since the invention of the virtual memory. Assuming that kernel bugs and therefore exploits are inevitable it might be worth isolating parts of the kernel to minimize damage that these exploits can cause. Moreover, restricted mappings in the kernel mode may improve mitigation of hardware speculation vulnerabilities. There are several ongoing efforts to use restricted address spaces in Linux kernel for various use cases: * speculation vulnerabilities mitigation in KVM * support for memory areas visible only in a single owning context * hardening of the Linux containers We are going to present the approach for the implementation of restricted mappings in the Linux kernel and how this implementation would be used with various use-cases. We are also going to take a closer look at possibility to assign an address space to the Linux namespaces, so that tasks running in namespace A have different view of kernel memory mappings than the tasks running in namespace B. For instance, by keeping all the objects in a network namespace private, we can achieve levels of isolation equivalent to running a separated network stack.
Address spaceSpeicheradresseKernel (computing)Address spaceSharewareRight angleBootingMemory managementProcess (computing)Kernel (computing)Discrete element methodSlide ruleSemiconductor memoryComputer animation
Dew pointEnergy levelPoint cloudVirtual machineService (economics)Information privacyGene clusterCartesian coordinate systemForm (programming)AdditionData storage deviceGame theoryUniform resource locatorInstallation artInformation securityDialectCycle (graph theory)Computer animation
Computer hardwareSpeicheradresseAddress spaceVirtual realityRead-only memoryKernel (computing)Information securityMereologySurfacePhysical systemInterface (computing)NamespaceTable (information)Web pageSeitentabelleHypercubeLeakCodeContext awarenessAsynchronous Transfer ModeBoundary value problemLocal ringProcess (computing)Formal verificationHydraulic jumpScalable Coherent InterfaceSystem callVirtual machineSystem callNamespaceSeitentabellePhysical systemBitVirtual memoryEntire functionArithmetic meanGame controllerSurfaceInterface (computing)MereologyWeb pageKernel (computing)Different (Kate Ryan album)MappingCodeAddress spaceMechanism designInformation securityComputer hardwareSpeicheradresseProcess (computing)Local ringImplementationSemiconductor memoryCartesian coordinate systemVirtualizationSimilarity (geometry)ResultantAreaGame theoryGroup actionSinc functionOvalMultiplication signTable (information)PlanningLink (knot theory)Queue (abstract data type)Maxima and minimaVulnerability (computing)Military baseInterrupt <Informatik>Beat (acoustics)
SeitentabelleKernel (computing)Address spacePhysical systemSeitentabelleKernel (computing)System callCodeSelectivity (electronic)Process (computing)Web pageException handlingMereologyAddress spaceModal logicStability theorySummierbarkeitGame theoryPrice indexBridging (networking)
Process (computing)SpeicheradresseWeb pageAddress spaceCodeVulnerability (computing)Scalable Coherent InterfaceSimultaneous localization and mappingAreaAddress spaceWeb pageSpeech synthesisPhysical systemComputer animationProgram flowchart
Physical systemSpeicheradresseAddress spaceFormal verificationRead-only memoryWeb pageTable (information)Kernel (computing)CodeHydraulic jumpSystem callVulnerability (computing)Scalable Coherent InterfaceMultiplication signGame theoryContext awarenessPhysical systemPatch (Unix)Information securitySystem callAcoustic shadowVulnerability (computing)Stack (abstract data type)Proper mapNormal (geometry)Web pageSpeicheradresseSheaf (mathematics)Network topologySoftware testingState of matterMoving averageProcess (computing)RoboticsInstance (computer science)ResultantComputer animation
SeitentabelleWeb pageKernel (computing)Address spaceProcess (computing)Table (information)Process (computing)InformationSemiconductor memoryVirtual machineCASE <Informatik>Software developerKernel (computing)Data storage deviceKey (cryptography)EmailMoving averageInstance (computer science)Game theorySystem callObject-oriented programmingMetropolitan area networkDemosceneLocal ringOffice suite
OvalExclusive orSemiconductor memoryProcess (computing)Physical systemRead-only memoryWeb pageKernel (computing)EmailImplementationResource allocationInterface (computing)FlagCore dumpDirected setExtension (kinesiology)Level (video gaming)BLACK MAGICBootingCache (computing)Kolmogorov complexityLevel (video gaming)BootingModal logicFlagSemiconductor memoryComputer fileOrder (biology)Exclusive orData managementFeedbackKernel (computing)ImplementationRevision controlMappingWeb pagePatch (Unix)Direction (geometry)Category of beingSpeicheradresseException handlingAreaCache (computing)Memory managementCodeMagnetic-core memoryContent (media)Front and back endsAddress spaceResource allocationDifferent (Kate Ryan album)Mechanism designVirtual memorySystem callCartesian coordinate systemLine (geometry)Complex (psychology)CASE <Informatik>GodInstance (computer science)Game theoryWhiteboardExistenceMetropolitan area networkComputer virusControl flowDrop (liquid)Heegaard splittingComputer animation
SpeicheradresseKernel (computing)Address spaceGoogolBookmark (World Wide Web)View (database)Computer fileSharewareSlide ruleProcess (computing)Discrete element methodCASE <Informatik>Source codeComputer animation
Kernel (computing)SpeicheradresseAddress spaceView (database)Bookmark (World Wide Web)Computer fileHill differential equationProcess (computing)Module (mathematics)Formal verificationElectronic signatureCache (computing)Physical systemRead-only memoryComputer-generated imagerySCSIDirected setBlock (periodic table)NP-hardMiniDiscEmulationComputer networkRandom numberService (economics)Interface (computing)Data storage deviceDirectory serviceSynchronizationWeightBit rateToken ringCondition numberRun time (program lifecycle phase)Client (computing)Link (knot theory)Radical (chemistry)Spring (hydrology)EmailElectronic mailing listOrder (biology)Multiplication signSharewareSemiconductor memoryProof theoryKernel (computing)Patch (Unix)BootingPOKEInformation securityComputer animation
Kernel (computing)Term (mathematics)Distribution (mathematics)SoftwarePhysical systemComputer programLoginComputer fileInclusion mapCryptographyProcess (computing)Library (computing)Thread (computing)Virtual machineInterface (computing)SpeicherschutzResource allocationPointer (computer programming)Library (computing)Computer programmingFunction (mathematics)Public-key cryptographyProcess (computing)Order (biology)Physical systemString (computer science)Standard deviationProof theoryLetterpress printingSemiconductor memorySoftware developerInformation securityCellular automatonSheaf (mathematics)Open setCASE <Informatik>Density of statesPoint (geometry)Computer animation
Data typeLink (knot theory)Thread (computing)Library (computing)SynchronizationDirectory serviceComputer fileTerm (mathematics)Distribution (mathematics)Physical systemSoftwareComputer programLoginKernel (computing)FlagInclusion mapType theoryFluid staticsLine (geometry)OvalCryptographyProcess (computing)Installable File SystemConfiguration spaceRootRippingRaw image formatFloppy diskInterface (computing)BefehlsprozessorAsynchronous Transfer ModeWeb pageSoftware bugSpeicheradresseSerial portServer (computing)SSHOpen setService (economics)RotationLevel (video gaming)Function (mathematics)Physical systemPoint (geometry)Single-precision floating-point formatProcess (computing)RootSemiconductor memorySpeicherschutzWeb pagePresentation of a groupComputer programmingRight angleRoutingLetterpress printingData storage deviceCellular automatonInformation securityResultantVirtual machinePoint cloudLine (geometry)Resource allocationFrame problemPointer (computer programming)Kernel (computing)Computer animation
Address spaceObject (grammar)Web pageTable (information)Process (computing)NamespaceComputer networkIndependence (probability theory)Stack (abstract data type)Object-oriented programmingCommunications protocolBoundary value problemSpeicheradresseTexture mappingContext awarenessTrailPatch (Unix)Software frameworkAbstractionDrop (liquid)Read-only memoryControl flowRepresentation (politics)Kernel (computing)Data managementSeitentabelleLocal ringCloningBootingBefehlsprozessorTLB <Informatik>DisintegrationEnergy levelProper mapMoment (mathematics)Social classTable (information)Military baseGame theoryService (economics)Generic programmingNamespacePhysical systemInformationProcess (computing)Point (geometry)WeightInternetworkingDifferent (Kate Ryan album)SoftwareSeitentabelleCASE <Informatik>SpeicheradresseFlow separationArithmetic meanPresentation of a groupProper mapSpeech synthesisSpeciesSoftware developerBridging (networking)Address spaceFreewareMultiplication signVirtual machineMereologySemiconductor memoryAreaContext awarenessCAN busObject (grammar)Group actionKernel (computing)System callCache (computing)Entire functionStack (abstract data type)CodeAbstractionFerry CorstenRepresentation (politics)Web pageState of matterLevel (video gaming)Source codeComputer animation
Context awarenessExtension (kinesiology)Web pageObject (grammar)Boundary value problemProcess (computing)NamespaceSemiconductor memoryResource allocationExclusive orDrop (liquid)Directed setRead-only memoryCache (computing)Similarity (geometry)Local GroupIntegrated development environmentKernel (computing)View (database)CloningAddress spaceSpeicheradresseStrutTable (information)Network socketCommunications protocolSoftware testingBackdoor (computing)Computer networkImplementationProof theoryAreaPrimitive (album)SharewareFamilyGame theoryTable (information)Prime idealPresentation of a groupSemiconductor memoryBuffer solutionSpeech synthesisLevel (video gaming)Process (computing)Metropolitan area networkSeitentabelleInterface (computing)Direction (geometry)Address spaceSoftwareNamespaceMechanism designObject (grammar)Web pageGroup actionExtension (kinesiology)FluxContext awarenessSpeicheradresseMemory managementBridging (networking)Energy levelLaptopFlock (web browser)Data managementResource allocationFlagProof theoryHierarchyRepresentation (politics)WeightCloningSet (mathematics)Uniform resource locatorCache (computing)Escape characterKernel (computing)Network socketExclusive orComputer animation
Table (information)Resource allocationWeb pageCache (computing)Extension (kinesiology)OvalMultiplication signNamespaceBridging (networking)Table (information)Data managementSemiconductor memoryThomas BayesFunctional (mathematics)Revision controlSeitentabelleWeb pageResource allocationExclusive orMachine visionKernel (computing)Cache (computing)Program flowchart
SurfaceInformation securityKolmogorov complexitySpeicheradresseAddress spaceKernel (computing)Data managementNormal (geometry)Axiom of choiceAddress spaceSingle-precision floating-point formatProcess (computing)Level set methodRing (mathematics)Key (cryptography)Data managementSystem callNormal operatorSeitentabelleImplementationNamespaceVirtual memoryComplex (psychology)TheoryScalable Coherent InterfacePhysical systemReduction of orderProduct (business)SurfaceSemiconductor memoryInformation securityComputer fileVirtual machineKernel (computing)Resource allocationPlanningMoment (mathematics)Patch (Unix)Network socketSoftwareFunctional (mathematics)CASE <Informatik>MereologyRight angleWeb pageControl flowFile systemSpeicherschutzSharewareOcean currentEndliche ModelltheorieAdditionClosed setStack (abstract data type)BitGroup actionMessage passingMultiplication signTracing (software)DebuggerMechanism designError messageLibrary catalogBasis <Mathematik>Order (biology)Radical (chemistry)Metropolitan area networkAcoustic shadowMemory managementTraffic reportingBit ratePurchasingDescriptive statisticsComputer animation
Context awarenessInfinityImplementationKey (cryptography)Kernel (computing)Turtle graphicsSemiconductor memoryRootInformation securityComplex (psychology)Order (biology)MikrokernelStandard deviationMechanism designBuildingProcess (computing)Block (periodic table)Ring (mathematics)Flow separationCodeServer (computing)Exploit (computer security)Library (computing)ArmBitComputer architectureEntire functionDisk read-and-write headPoint (geometry)Bifurcation theoryAddress spaceCASE <Informatik>View (database)InjektivitätSharewareLevel (video gaming)Data conversionNamespaceSpeicheradresseThread (computing)Web pageGame theoryCoefficient of determinationLie groupPatch (Unix)Total S.A.Software developerPhysical systemTurbo-CodeRoutingBack-face cullingEvent horizonService (economics)Group actionMultiplication signComputer animation
FacebookPoint cloudOpen source
Transcript: English(auto-generated)
Welcome, everybody, to the next talk. This talk is from Mike and James, who worked at IBM, and they will give a talk about address space isolation. Please when the talk finishes and the Q&A starts, please remain seated.
It's not going to take long, and we can hear the questions, and then afterwards we can leave and don't disturb the Q&A. For the next one, I'll give a warm welcome to these people. Please give a warm applause, and this place is yours. Thanks.
I'm Mike. I work on the memory management in Linux kernel. I happen to maintain a boot time memory management called memblock, and I'm an employee of IBM Research, and we are going to talk about our research, how to use memory management techniques to make containers even more secure than they are today.
Okay. And I'm James. My job in all of this was really just to persuade Mike that it was worth doing, and since I gave a talk this morning, my voice isn't doing too well, so he's going to be doing all the talking, telling you what it's about. I'm going to be the demo monkey, because we have a demo right in the middle explaining
how this all works in practice. With that, I'll hand over to Mike to do the slides. Thank you. So it took a couple of years to get from chroot to cloud native. The containers, as you probably know, can be described kind of chroot on steroids,
and thanks to technologies like Docker and Kubernetes, containers now are everywhere. It's probably the most popular form of application deployment, and you may find containers
deployments in that or other form, both in private data centers and in public clouds. If you noticed, if you used container services in public clouds, they all run the Kubernetes
clusters on top virtual machines, which creates kind of unnecessary level of virtualization, which obviously costs additional money, cycles, performance, and so on. One of the claims to use virtual machines to run container installations, because the
containers are less secure than virtual machines, and proponents of this claim usually saying, guys, come on, virtual machines have hardware that ensures their security. And like we all know, with Meltdown, L1TF, and everything, hardware probably not as
good to ensure security for anything, particularly with L1TF, VMs are much more vulnerable than containers or simple processes. Nevertheless, as the researchers, we were looking at interesting problems, and we said, okay, we also can use some hardware to ensure isolation of containers.
And what we have is MMU, so we will try to use MMU and to protect Linux containers with page tables. Our goal is to make containers less vulnerable, and besides, we can presume that every system
will be vulnerable in that way or another. So once an attacker has gained some control of the system, we are trying to make sure it will be harder for him or her to penetrate to containers of other tenants sharing
the same system. For that, we are proposing to use restricted address spaces to allow better isolation of privileged contacts of different tenants in the system.
The containers attack surface is the entire system call interface of Linux kernel, which is about 400 plus system calls. So the first question we asked ourselves was, what can we do to make system call
less vulnerable or at least less exposing the rest of the system to the attacker? And the other thing we've been thinking of is that in Linux, containers are isolated
mostly using Linux namespaces. So what we are trying to do is to provide namespaces their mean of hardware isolation. In other words, we are trying to broaden namespaces with their own page tables. We'll explain in a bit more detail what we are trying to achieve.
There is similar work that's done. Some of it is already there. All of you probably know that as a result of meltdown vulnerability, Linux kernel started using restricted address namespaces for the first time, which is page table isolation.
There is a work ongoing at KVM area to protect virtual memories from the host and from each other that also try to implement address space isolation in KVM. And another mechanism that was dubbed is process local memory to ensure that we are
secrets are visible only to that VM and are not visible in the host or in the other virtual machines. So what we tried first is to create a restricted address space for execution of a system call.
It builds on the technology PTI introduced into the Linux kernel where the kernel mappings are very much restricted for the user space part of the application.
And the only thing user space page tables contain are the code necessary to jump into the system call to the interrupt handler. We thought that probably would make sense to extend this a bit and to make a system call execution inside the Linux kernel also use some very minimalistic
page table and then required pages on demand. So it would be something like, oh, so this is a page table of the kernel part of a process.
This is a page table of the user part of the process. The privileged code and data are not mapped in the user space page table except for the small part required for entering into the kernel. We introduced yet another page table that adds some code that allows code and
data that allows selection of a particular system call execution. And then when system call continues its execution, we try to demand page with whatever code or data is necessary.
The idea was that whenever we enter system call, we switch in address space and then we remain in a restricted address space and every access to unmapped area causes page fault and page fault handler can decide if the access is safe or not. If the access is not safe,
we kill the offender and if access is considered safe, we map the page and continue the execution. We actually implemented this thing. I think here the patches. We found out this really slow,
like times slower than normal system call execution and that context switch are really costly. And also it has some security weaknesses as well. We couldn't validate red targets to actually prevent a rope attack properly.
We were competing with upcoming CT technology that probably eventually will be available sometime. Do you know anything about CT? Intel has been promising it for several years. It's safely in their next chip, but nobody's seen it.
Intel CT is going to do the same thing but in the hardware, so if the chips will be available, there is no sense in implementing our approach. But we also thought about another possibility. We didn't try to implement anything yet. One can use ftrace to create a shadow stack
of the execution and then upon return from any routine, there is a possibility to insert red hunks using GCC or LLVM. That's what red-paling does, for instance, with a call. It's possible to do the same thing with red
and then the thunks, there is a way to check if the return address actually matches the shadow stack created with an ftrace. This should be faster than using the page fault for that. We don't know yet if it will fly at all. The next thing we were trying to do
came actually from the idea. Some of the KVM developers proposed a while ago on the mailing list, they call it process local memory. What we are suggesting is to hide a piece of user process from the kernel
and obviously it won't be mapped in other processes. So this memory can be used to store secrets, for instance, keys and maybe some other sensitive information. Another possible use case is to hide virtual machine memory from the entire kernel and from the entire host.
For instance, for storing a secret, this may be used in a way described here. We create a mapping with particular flags. We open a secret file and then read its contents into that mapping and the patches are here
if anybody is interested. It was a long discussion about this approach, about using a memory map with some flag. The outcome more or less summarized here. The pros were that it was relatively simple, at least our submission was about
200 lines of difference. It can be easily plugged into existing user space allocators and it can be easily plugged into existing applications with madvice and protect and such. But the downside, the implementation has to take
into account various places of the memory management code to address the kind of the mapping and to see if, for instance, it is possible to do madvice on such an area or if it's possible to pipe into there, splice and so on, et cetera.
And the most significant disadvantage of this was the necessity to fragment the direct map kernel used to map the physical memory. Because whenever we create a special mapping, in order to make it invisible for the kernel and for the privileged code,
we drop this memory from the direct map and it requires a splitting of large and very large pages that usually constitutes a direct map.
I think, okay. So one of the feedbacks we've got on the mmap map exclusive suggestion was that it's probably better to use file descriptor or device, character device to create such secret mappings. And we came with another version of the patch
that actually extends memfd create system call. So to create a secret area, one has to create a file descriptor, this memfd create secret. And then you must call iotl to specify
exactly the way kernel would treat your memory. It could be exclusive, uncached, maybe some other different properties there. And then continue with mmap and use the memory in a secure way.
It has an advantage of less modifications to the core memory management. We don't mark the allocated area with anything except VM specials. So we wouldn't need to insert as many
if something in the core mm. It is possible to pre-allocate memory boot and then use it as a back-end memory for such file descriptor based memory management. We still would need to audit all the memory management code to make sure that nothing would try
to access the secret memory and that the safety is preserved. And as it is file-based memory management, it would use page cache mechanism in such way or another
and the final implementation may introduce some complexities into page cache management. And still we didn't address the huge gap of the fragmentation of the direct map which also may cause some pain in the future. I've recorded this slide just in case,
but now James, it's all yours. My job is to be the demo monkey. Can I actually get this demo up and running? Let's just stop the presentation.
Let's get a same as so I can see what I'm doing. A name terminal. How big does it need to be for you lot?
So everybody can see that. Yeah, of course every tab you start also has to be sized, great. So Mike actually sent the patch that does this to the mailing list, what, three days ago.
So I have built a 5.5 kernel with this patch integrated and that's actually what I'm going to demo to you. So ignore the fact that this is a UFI secure boot. This is where I've got all my demo kernels. So I'm just going to boot up this kernel in KVM
and then once it's booted up I will log into it and we're going to try twice to poke at the memory of what should be a container, but in order to convince you that this works generally I'm just going to do an ordinary process and prove that we can actually abstract its secrets. So if I just log into this system
I actually have a very simple program that uses OpenSSL. One of the great things about using OpenSSL is for reasons best known to the OpenSSL developers they insisted on rewriting the buddy allocation interface
for Linux for glibc memory because they claimed it would make the program more secure. So OpenSSL malloc if you use it in an OpenSSL program and OpenSSL does use it for all of the private keys should be actually give you more security according to OpenSSL. Realistically as we'll demonstrate it doesn't, but one of the great things is I can just use
a preload library to override OpenSSL malloc and insert our special allocation thing into their buddy allocator and then OpenSSL malloc is really allocating all of your private keys in secure memory. And so the purpose of this demonstration is a very, very simple program to actually do that.
So this is it. It's basically I allocate a secure pointer using the OpenSSL malloc. I allocate an insecure pointer using the standard Linux malloc. I copy two strings into each of these pointers and I print them out again. This is obviously highly insecure
if you have access to the console, but it serves as a demonstration. The reason for printing it out again is to prove that the process itself can still get access to memory we've designated secure. And if I actually just, oh, and usually when you're dealing with secrets the trick is to get the secret in, use it and shred it as fast as possible.
In order to demonstrate the actual program working I put a pause in here that allows me to go in and actually try and extract the secret. So if I run this program it's gonna print out the two pointers. So I mean, I haven't done an ld.so override so this is only OpenSSL's protection.
And what I'm gonna do is I'm gonna use the biggest hammer I can possibly do which is log into the virtual machine as a root and see if I can extract the secret. So all I have to do is find the process. No, there it is. And then I can just ask gdb to attach to it.
Which it does with no problem, it's stuck at the pause so I just. Sorry.
Okay. So I go up in the stack frame and now I can actually print out the pointers. So here's the insecure pointer. I'm running as a root so I can easily grab about in anybody's memory. And here's the secure pointer. So I can just, as root, extract all secrets from the system.
So if I manage to compromise the system sufficiently I could also do the secret extraction. So now what we're going to do is leave this. I'll come out here, kill this, and now I'm just going to add a preload
that overrides the OpenSSL malloc. Let me actually show you roughly what the preload looks like. It's basically the same program Mike showed you. So all it's doing is getting a secret memory thing. It's mapping a single page and it's putting that page in a secure pool and when you call OpenSSL malloc
it just returns the page. This obviously is not a buddy allocator but for the purpose of demonstration where we only have a single allocation it demonstrates all of the principles. And obviously if I'm going to do this in practice I would have actually written a buddy allocator but I did this on fly last night and I couldn't be bothered. So, let's apply the preload
and again the program runs. I actually, if you looked, put a debugging print in the OpenSSL malloc override so we know that the secure pointer is now actually in Mike's secure memory. So what we're going to do is find it again, attach to it with GDB, okay, go up,
and obviously the insecure pointer was only an ordinary malloc memory so I should still be able to get access to it. But now let's see what happens if I actually try to grab a bat in here
and get access to the secure pointer. My program is killed and if I actually have a look at what happened to the kernel that is a paging fault in the direct map. So even root on this machine cannot get access to the secrets that are processed deposited into its memory.
And so this affords us a lot of useful secrecy for things like OpenSSL keys, HTTPS, establishing secure channels in containers. Can you keep it for a moment? In the cloud. Can you keep the terminal? Yeah, sure. So as we can easily see, right, the page is not present here.
Back to the presentation, okay. Let me just, yeah, I know you want it bigger. Thanks. Now another thing we are trying to do is to protect other spaces with namespaces.
Namespaces with address spaces. And namespaces in Linux anyway create their objects in a way that is isolated from the rest of the system. So there is no actual need for a kernel code running in one namespace to access objects in the other namespace.
And that's why we think it would be possible to give each and every namespace its own page table and then just take care of some rare cases of namespace transition for different objects. We've started with the network namespace.
Network namespace creates a known copy of the entire network stack, TCP caches, UDP caches, sockets, everything. They're all private to that network namespace. There is no need for any other namespace to touch the data in the caches of the network stack
of other namespaces. The only thing that behaves differently is SK buffs, which represent packets that usually traverse several namespaces on their way to other services out of the machine or into the machine.
And then we started working on this more or less at the same time KVM developers submitted their work for what they call KVM-ASI. And one of the comments for their submissions
from Thomas Glaxner, one of X86 maintainers, was that there are actually four points to creation of restricted namespaces. There need to be a way to create restricted mapping. There need to be a way to switch into it and switch back to it.
And there should be some machinery to track the state to understand, so the code could understand which address space it is actually using at the moment. And together with the KVM guys, we started to work on some generic APIs
that will allow usage of restricted address spaces in the kernel. First of all, the API for creation of the page tables. What we thought is that we need a first class abstraction for kernel page table, which is non-existent as of today.
The kernel presumes that every address space has its mm struct that actually used to represent user process memory. It has a lot of information that is not necessary to represent the kernel page table. So what we're trying to do is to extract
page table information proper from a mm struct and create a first class abstraction. We call it PgTable for now. Then we need some API to populate this PgTable and we need an API that will be able to tear down this PgTable.
The context creation varies between different use cases. KVM guys have it explicit on the VM enter, VM exit. With the network namespace, with the address spaces of processes, it becomes implicit as a part
of the context which is just the page table of particular context is reduced because it is already there inside the namespace or because of some other reason.
And freeing the restricted page table is kind of pain in the whatever because currently the code in the kernels that actually frees page table very tightly bound to the mm struct and to the assumption that kernel page tables are never freed.
And there is a lot of care that must be taken in freeing page tables to properly play with the TLBs and to avoid TLB shootdowns as much as possible and there is a lot of work we're going to do in that area.
What did I do? Uh-huh, you should warn me. So on top of these page table management primitives that we are going to implement someday, we are trying to implement a private memory
management allocation, private memory allocations that page alloc, okay malloc will receive particular flags that will say okay I want this page visible only in my page table and I want it absent in all the other page tables and I want it dropped out from the direct map.
So the idea was to add some page flags for struct page and some flags for struct slab. As we got a pretty much of pushback on using new page flags on our first submission
of mmap exclusive, we'll need to think about some different way probably using page extension for this mechanism and then we can use existing interface to tweak in the direct map like set memory and set memory P
which makes my pages present or not present in the direct map and again, despite the availability of this interface, it is really not good to use it because it will fragment the direct memory and there is some ground work required to properly implement direct map manipulations
and maybe doing something like thp for direct memory as well. And for the private allocations using a kmalloc and this family kmemcache alloc, we are proposing mechanism which is similar
to what memory cgroup's currently doing. And we create another level in the hierarchy of the kmalloc caches and for every context there will be a new cache that belong to particular like cgroups create their own.
We will also create our own for address space one, address space two, et cetera. And if we are looking again at address spaces for network name space, we add the page table to the struct net which is the kernel representation
of the network name space. And then whenever process joins the network name space is using clone set and S or something like that, its page table gets overwritten with the page tables that is common to all the processes present in that network name space. And every location of memory inside the kernel
makes sure that the page is allocated privately to that name space and that they are not visible in the direct map in other name spaces. We had some proof of concept mostly working that socket objects and escape buffers
were allocated using GFP exclusive and using the exclusive memory. And I actually planned another demo but I couldn't get wifi on my laptop, so sorry about that.
And this is our current vision of how it's gonna happen. It may evolve over the time, so we are going to implement some page table management API for management of the kernel page table. Page allocator is live, allocator will use these APIs
and then it will be available to name spaces isolation as well as to KVM isolation for what KVM people are doing. And for the exclusive memory mappings, currently we are looking at extending page cache functionality and using that
to implement exclusive memory mappings. So to conclude, well first of all it would be nice to make all this work. It will take a couple of years or so I presume. We can presume that using restricted address spaces
does reduce attack surface but we yet to evaluate the security benefits versus the added complexity and probably performance degradation. So to evaluate the pros and cons we need to implement
so it will take some time. As we've seen with SCI this is no-go. The system co-isolation we tried to do was too slow. We hope that exclusive memory will be fast enough to be useful in production and we hope that address spaces for name spaces will also be fast enough.
And reworking kernel address space management in kernel is really difficult because we have to break a lot of assumptions that go into kernel memory management. The major assumption was that there is single kernel page table and we don't need anything
to actually manage it, it's always there. So that's all I have to say. Do you want to add something? And if you have any questions.
Thank you, please raise your hand if you have a question. We finished quite early so we have plenty of time to move to the other room after questions. Hello, thank you for the talk. I'm afraid terminating processes this way will break workflow of traces and debuggers.
I mean they usually expect some kind of error when they try to access a memory. That's the idea, right? Yeah, but in this demo there was just a killed process. Right, but the theory of the demo is that somebody has already broken into your machine
and tried something nefarious. You don't get killed in normal operation and because it's a kernel page fault we can actually choose the signal we give to terminate the process. And that signal can be picked up by the container orchestration software and in addition the kernel log contains a bit of a verbose trace of what went on.
That can also be passed up through Kafka in the container world and analyzed to show that you have a problem. I mean if I actually saw a log message like this on one of my containers what it shows me is somebody is trying to break into your system and they already have enough privilege to be trying to steal secrets. So this machine needs to be shut down as fast as possible.
Right, next question over there. Thank you, great talk. In the memfd case how does it play with SCM rights and could you theoretically adapt this to be passed between processes?
Yeah, I'd say. How it go and play with cgroups? No, no, SCM rights, like passing the memfd to another process, could you theoretically use something like SCM rights to then securely donate that memory to another process? Oh, cool. It's a normal file descriptor. So you can do SCM rights, you can do LSM on it
and everything you can do on the file descriptor. Pretty much like memfd today, the usual memfd, we just thought that instead of introducing a new system call we extend the existing one. Makes sense, so you would then update the other processes, page tables and that. Yes. Very cool.
Okay, next question. Would it be possible to use this work to further lock down file system implementations in Linux kernel to isolate file system implementations? Well, in theory it's possible.
So if you looked at the patches we are planning for network isolation, we're planning to give the network namespace its own allocation of socket buffs and if you terminate at the end with a virtual function that means we have an isolated network stack. It is not impossible in the kernel to isolate the file system in the same way using the mount namespace.
So we have a namespace to do this, it's just that neither of us has looked at what the complexity of actually adding private allocations to the mount namespace is. So I can answer theoretically yes, but I have no idea because we've not tried it. Okay, I see the next question over there.
Thank you all for being so quiet, it's really special. Have you explored looking at using this with key rings? Being able to allocate secure memory for key rings seems like an obvious choice. So there is a specific problem with the kernel key ring mechanism in that it has no C group
or namespace isolation currently. So right at the moment key rings are shared amongst all processes. Where it is theoretically possible that we could use secret memory for say the user part of the key ring, but the protections it affords may not be as great as you think because you don't have,
I mean in the current model it's actually only the children of the process can get access to the memory. And in fact we did the MFD with close on exec, which means even the children don't get access. So this is really restricted. The key ring has a much more general use case within the kernel and it's much longer lived. So I think we won't get key rings in secret memory
until we get the key ring namespace, which is actually necessary in order to consume key rings and containers anyway. Without the key ring namespace, a key you put into a key ring even inside a container is shared by all of the containers. Thank you. Any other questions?
No? Oh, we've got one here. One more question. This architecture especially, I think in context to KML, looks much like a first step to microkernels. Do you think we are heading into that direction? Okay, so this work does have contact points
with microkernels. But if you think about the architecture of a microkernel, and actually, although Mike and I are standing up here, we had a third guy called Joel Neider who also worked on us with IBM Haifa doing this. He was a microkernel guy, so his job was to bring microkernel techniques to what we were doing. But in a standard microkernel,
it's actually all of the internal servers within the microkernel that run in their own address space. And the problem is that if a tenant can exploit one of those servers, you can limit the exploit to that server, but that tenant can still compromise any other tenant also using that server. So it doesn't provide a lot of protection
in the microkernel against exploits that are exercised by tenants. Whereas if you look at what we're doing, we're actually trying to bring up an entire address space that belongs to the tenant alone. So any other tenant running in the kernel can't get access to this address space. So instead of trying to isolate the servers within the kernel,
we're actually trying to isolate the access from the tenant from the top. That actually conceptually is very different from the way microkernels operate. So it's true to say that we definitely got our ideas by looking at microkernel work because Joel was very, very fanatical about it. But the ultimate implementation we have is very dissimilar from what a microkernel would do.
Thank you. We have two more questions. Still plenty of time. So with Udemy you stopped bit trace by stretching from accessing the trace in memory, but what it stops from like injecting code
into the context of the process and basically executing it? So the question is really about mechanisms we can use for protection. Thanks to the no execute bit in the modern processes, it's actually very difficult to inject code into processes and force them to execute it.
It is definitely not impossible. And a root attacker has many other ways of compromising a process other than by trying to pull the secret straight out of the memory. So if they know we've deployed this protect, I mean security is basically a turtles game. So we've gone down about a couple of layers in the turtle
but in order to get perfect security, you have to go down the infinite layers of the turtle. But what we're hoping is that this is definitely a building block for providing enhanced security coupled with a few other security techniques that containers will use like no execute memory, enhanced protections for the name space, various other things. We might be able to block
most of the standard attack channels. And obviously when you do this, the black hats just tend to come up with new attack channels which we look forward to seeing what they are and we end up in an arms race to see if we can also block those with the same technology or whether we need new technology. So this is definitely not an endpoint for security in containers. This is just the beginning.
Okay, thank you. We have another question there. Hey, have you discussed your design with the potential consumers of those patches,
say the container orchestration community or the TLS providing libraries or something like that? And if so, what was their reaction and will they adopt the features that you just presented?
Okay, so as you probably know, there is a bit of a bifurcation between the container orchestration community, Docker, Go, and the actual mechanisms in Linux that implement containers. Most Docker people can get their heads around name spaces and C groups, but if you look at what Docker does,
it still can't take advantage of a lot of the security mechanisms we have in Linux. The user name space being the classic example. And so the kernel developer's view of the Docker community is that in the rare case they can actually formulate the question correctly. They usually don't understand the answer. So I would agree exactly that what we need to be doing is evangelizing our features,
but just due to the fact that the complexity of what we've done in the kernel is almost incomprehensible to people who are managing orchestration systems, it's very difficult to have a sensible conversation about how you would make use of it. So I think the business end of the conversation goes to that demo that I showed you.
This is a way of using a preload library in a container which is very easy to do to get security. Just put this ld.so in, attach it, and your container is more secure. And the Docker community will be perfectly happy about that. Trying to explain to them the mechanics of an address space separation mechanism that pushes the page out of the direct map is probably going to cause their eyes
to fall back in their head. Thank you. Let's leave these questions. It's been a good time already. Thank the speakers. Thank you for the clapping already.