ToroV, a kernel in user-space, or sort of
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 287 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/57044 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Order (biology)WebsiteHelmholtz decompositionService (economics)Server (computing)Internet service providerKernel (computing)Computer networkRing (mathematics)EmulationData modelProcess (computing)Presentation of a groupFigurate numberVirtual machineNumberSpacetimeCartesian coordinate systemCASE <Informatik>Service (economics)Lattice (order)Independence (probability theory)Group actionView (database)Functional (mathematics)Server (computing)Information securityUniqueness quantificationDependent and independent variablesPhysical systemBitMultiplication signKey (cryptography)Computer hardwareTask (computing)Semiconductor memoryComplex (psychology)Matching (graph theory)Cloud computingPoint (geometry)Projective planeWorkstation <Musikinstrument>Letterpress printingContext awarenessQuicksortInstance (computer science)Mechanism designBefehlsprozessorTerm (mathematics)Limit (category theory)Endliche ModelltheorieVirtualizationKernel (computing)Operating systemConnectivity (graph theory)Web pageSurfaceMedical imagingOperator (mathematics)Software bugSystem callProgram flowchart
04:58
Ring (mathematics)Data modelSystem programmingOperations researchPoint cloudLibrary (computing)Virtual realityLevel (video gaming)Kernel (computing)SurfaceService (economics)Asynchronous Transfer ModeFile systemComputer networkProcess (computing)System callProcess (computing)File systemProjective planeKernel (computing)Cartesian coordinate systemNumberOperating systemConnectivity (graph theory)SpacetimeReduction of orderDifferent (Kate Ryan album)Physical systemService (economics)System callMechanism designMereologyContext awarenessCASE <Informatik>Task (computing)Total S.A.SurfaceIntercept theoremAsynchronous Transfer ModeSoftwareInteractive televisionEndliche ModelltheoriePresentation of a groupGoogolDenial-of-service attackOcean currentVideo gameLevel (video gaming)Field (computer science)Personal digital assistantAreaRoundness (object)Tracing (software)Order (biology)DiagramProgram flowchart
08:56
Kernel (computing)Configuration spaceCompilerSurfaceProcess (computing)System callCartesian coordinate systemKernel (computing)SurfaceProxy serverContext awarenessSpacetimeTotal S.A.Reduction of orderProcess (computing)Virtual machineCausalityComputer animation
09:46
Function (mathematics)Ring (mathematics)Reduction of orderBinary fileSystem callRoundness (object)Process (computing)Directory serviceRational numberSpacetimeCartesian coordinate systemCorrespondence (mathematics)Service (economics)Virtual machineFigurate numberFormal languageMotif (narrative)Context awarenessHypercubeCASE <Informatik>Connectivity (graph theory)Principal ideal domainRootConfiguration spaceNamespaceComputer animation
11:03
Cartesian coordinate systemSystem callDirectory serviceBinary codeMemory managementSemiconductor memoryFormal languageVirtual machineAddress spaceParameter (computer programming)Asynchronous Transfer ModePointer (computer programming)Web pageWrapper (data mining)Level (video gaming)Computer fileForcing (mathematics)Revision controlDressing (medical)Table (information)Water vaporProcess (computing)DistanceDemosceneData storage deviceCASE <Informatik>Computer animation
13:24
Directory serviceWritingWordRegulärer Ausdruck <Textverarbeitung>Binary fileProcess (computing)TrajectoryAreaCartesian coordinate systemLine (geometry)Green's functionCorrespondence (mathematics)View (database)System callIntercept theoremVirtual machineShared memorySemiconductor memoryCloningPhysical systemFunction (mathematics)NumberParameter (computer programming)FlagRootDirectory serviceComputer animation
14:52
Read-only memoryMedianWritingSystem callOpcodeMechanism designBinary fileService-oriented architectureVirtualizationServer (computing)Repository (publishing)Cartesian coordinate systemVirtual machineMechanism designQueue (abstract data type)System callCASE <Informatik>Binary codeVirtualizationNetwork socketAverageProof theoryPhysical systemHypercubeSemiconductor memoryServer (computing)Formal languageTask (computing)Connectivity (graph theory)Projective planeInformationClient (computing)Multiplication signInteractive televisionStaff (military)Finite differenceStress (mechanics)Constructor (object-oriented programming)2 (number)Forcing (mathematics)QuicksortSynchronizationObservational studyDemosceneHeegaard splittingFood energyUniform resource locatorDifferent (Kate Ryan album)Power (physics)BackupComputer animation
19:17
Scaling (geometry)DivisorFormal languagePoint (geometry)Goodness of fitMeeting/Interview
19:41
Multiplication signDivisorCASE <Informatik>Meeting/Interview
20:14
System callMathematical optimizationMeeting/Interview
Transcript: English(auto-generated)
00:02
Hello everyone, and thanks for joining me. I'm Matias Farah. I'm going to talk about Total V, which is a kernel in user space or sort of. Before to start, I would like to thank all the people that made possible Fosten this year. I would like first to present myself.
00:23
I enjoy working on operating systems and on virtualization. I have worked at Citrix, TT Tech and Huawei, and you can find some of my projects at my GitHub web page.
00:42
Before going deeply into the presentation, I would like to picture the use case that motivates this project. I want to talk a bit about microservice and serverless applications. One way to decompose a monolithic application is by using service, in which each service provides a functionality.
01:02
For example, in this picture, I implement the application that left as three independent microservices. And the future is to implement this microservice as serverless applications. So it is the cloud provider who deals with the deployment. This presentation
01:26
talks about how we can improve the deployment of serverless applications. In such a context, performance and isolation are keys. There are basically two mechanisms to deploy
01:46
serverless applications, either by using containers or by using virtual machines. This mechanism should be able to limit host resources, prevent interference between applications and with the host, be efficient in terms of CPU and memory. And in general,
02:08
these mechanisms are chosen based on a trade-off between the performance and security. By relying on these two parameters, let's analyze these solutions.
02:24
When using containers, the applications run as a process on top of a general proposed operating system. This process has a limited view of the host resources. This is implemented by relying on cgroups and namespace, which are features of the host operating system. The application
02:48
still communicates with the host kernel by using syscalls. This deployment is very efficient. However, the host kernel is exposed, so the attack surface is large. For example,
03:02
a bug in the kernel host could be used to compromise all the containers. In this sense, containers do not offer a complete isolation. Instead, virtual machines provide a stronger
03:20
isolation. To create virtual machines, we require a hypervisor that will handle the virtualization features of the hardware. In this case, applications run in a virtual machine on top of a guest operating system. In this case, only the kernel in the virtual machine can be
03:47
compromised. However, a virtual machine requires a larger footprint because guests consume resources like memory, vCPU, and on-disk image. Also, the time to be up and running is longer than by
04:07
using containers. These issues end up limiting the number of instances that a server can host.
04:22
Also, this deployment requires a device model to provide hardware to the virtual machine, like for example QEMU. This represents a new component that is exposed to the guest that can be also compromised together with the hypervisor. The device model can be reduced by
04:45
using approaches like Firecracker or QEMU MicroVM, in which only a few virtual devices are exposed to the guest. To reduce the complexity of a general proposal S and the number of
05:07
resources that a VM requires for a dedicated task, some approaches propose the use of a unikernel to host microservices. A unikernel is when you compile the kernel with the user application. Currently, there are different unikernels which are used in different scenarios
05:25
like Toro, OSV, Mirage OS, Unicraft, NanoVMs, and so on. In this context, we wonder if we can just offload the guest kernel in the host. This is called a kernel in user space because
05:46
it is just a process in the host serving requests from to the guest kernel. But what I call a kernel in user space? Well, it is a way to offload guest kernel to the host.
06:07
In this design, the VM does not need a device model. Conversely, instead of exposing a whole device model, we expose some high-level service to the guest. In the following slide,
06:22
I'm going to present first user mode Linux and gVisor and show the difference with TOLERI. In user mode Linux, we have basically two processes in the host. One process that hosts the application and another that intercepts the syscalls.
06:45
The interception mechanism is based on ptrace. The process that handles the syscalls emulates the whole Linux operating system. So basically, we have split the user space and kernel space into two processes. In approaches like gVisor or in TOLERI, the application runs as a VM
07:09
on top of a simple kernel and still communicates with the guest kernel by using syscalls.
07:21
However, this kernel may decide to forward some syscalls to a process that is running on the host. This is triggered by a VM asset from the guest operating system. In the host, this process emulates the service that a kernel will offer, like file system or networking. Such service will be provided by using a minimal interaction
07:49
with the host kernel. The reason to do this is to reduce the task surface as much as possible of the host kernel. This is the case of gVisor, which is a project from Google. In gVisor,
08:07
the serverless application is running in the context of a VM in Rinse 3 and communicate with the guest OS by using syscalls. These syscalls are trapped and then forwarded to the host process
08:21
naming sentry. This component does part of the syscall and delegates file system syscall to a service named gopher. These two components interact with the host kernel with a reduced number of syscalls which short the host kernel attack surface.
08:43
This approach increments the cost of a syscall, so applications that heavily use syscalls may perform not so well. Well, in this context, what is exactly then Total V? Well, Total V is
09:02
a minimalistic kernel in user space in which syscalls from the guest are forwarded to the host. It allows the user to configure what syscall are allowed per application and it provides a modified sentry leave that the user application must be compiled within. It exposes a proxy API
09:26
based on hyper calls to the guest. It runs in the host as a containerized process to reduce host attack surface and it allows the user to debug this application
09:41
by simply using gdV. In Total V, the application runs in the context of a virtual machine and when an application requires a service from the host, it calls the corresponding syscall
10:01
which triggers a hyper call. This hyper call is trapped by a component named virtual machine monitor which processes the syscall. For each application, the user set up what are the
10:29
syscalls that are allowed. This is configured by using a JSON. For example, the JSON that you can see correspond with the configuration for the hello world example. In this case, only the
10:41
iostl and bright syscalls are allowed. The user can also configure if the process that holds the virtual machine monitor run in its own PID namespace and root directory. So to build an
11:06
application that runs in Total V, we base on a modified mglibc or fpcrtl depending on the language that you are using. The application compiles into an elf64 and from this file,
11:22
we generate a flat binary which is going to be loaded into the VM. When the application starts, the VM is in 64-bit long mode in Rinseo. So in this sense, the virtual machine monitor is just a wrapper for the cavian API. The following pictures shows the memory layout
11:56
when the application starts. This is built by the virtual machine monitor.
12:03
As we read the picture from left to right, we have first the page directory table, then the application binary, the heap, and the stack. Currently, we are limited to two megabytes
12:24
for the application binary and less than two megabytes for the heap. In the current version, the heap is pre-allocated. This means that syscalls like mmap shows incremental pointer
12:47
and returns. The amount of memory that we allocate is two megabytes per application.
13:00
This memory is accessed from the virtual machine monitor. So when a syscall has to be forwarded to the host, the virtual machine monitor should modify the address of the parameters to point to the corresponding address in the host process. Let's see this in an example. So let's see what happens when we launch the hello world
13:37
example by using strace. In the first line, you have the command line that you can use to
13:44
get the same output. This is a partial output in which I just highlight the most important things. We can see in the first line in green, the virtual machine monitor creates the vCPU and then creates a child process by using clone. This is the process that will intercept
14:07
the syscalls from the guest. This process will have a limited view of the system by setting the corresponding flag in the clone syscalls. Then the child process changes the root directory
14:22
and executes the vCPU. The first interception corresponds with the bright syscall which we can see in red in the RAX register. Then the virtual machine monitor triggers the bright syscall but
14:42
in the host with the corresponding parameters. These are some numbers for the hello world application. It consumes about one and a half megabytes of memory and it takes about
15:05
seven milliseconds to execute. This is in average. The cost of the bright syscall is about 10 times slower than in the host and this is the different steps to perform the syscall
15:23
from the guest application to the host. So as we can see Torov is still a proof of concept so there is a lot of work to do. These are some of the tasks that I have in mind
15:44
for the future. If someone is interested in helping me feel free to contact me. For example I'm currently working on porting libc and I would like also to port other languages like Go or Rust to Torov so we could generate binaries that instead of using a syscall do a hyper call
16:08
and interact with the host. I'm also researching about comparing Torov with seccom. There is also the possibility to run binaries without recompiling it but by replacing
16:27
the syscall instruction with some instruction that could trigger a VMS it like out the out instruction as I'm doing now. Also the idea is to maybe port
16:40
the whole project to Rust and to try to work on the syscall's bottleneck which is where we spend most of the time. There is the idea that I could replace the asynchronous mechanism
17:01
of the syscall by something that is, sorry, the mechanism that is synchronous in the syscall for something that is asynchronous like for example a virtual device for syscalls. So in that case the guest application will send packets by using virtual queues in which you have all the information for performing the syscall in the host instead of executing an instruction that blocks
17:27
the guest. Also regarding how it is implemented in the host I would like to try to be more modular so maybe to support different binaries coming from different operating systems we would
17:47
like to handle in different ways the same syscall. So to do that maybe we can split the virtual machine monitor in components that handle differently the syscalls. This is
18:05
still a proof of concept idea. If you want to see more interesting examples here you have a couple of them. I think one of the most interesting examples I have is the echo server
18:28
in which you can see the use of the socket syscalls from the guest application. Also in that example I show how I use gdv to debug the application and the reason why I can do that
18:47
is because totalv includes a gdv staff so any application that you run on totalv you could debug it by using a gdv client. Well that's all folks thank you very much for listening
19:09
and I will be online for questions. Thank you again very much. I have a question in which
19:22
Leo asked why Pascal and I don't have a good answer for that. I just like the language but one of the future work is to replace it with Rust. I have another question from Leo too.
19:47
He said do you think it will scale at some point? Well I'm not sure and if you compare with at least some papers you have a factor of 100
20:03
instead of 10 times so I think depending on your use case you can tolerate that but I'm not sure yet. Well Stefan is asking if I have implemented something like the Linux VDSO where the syscall is actually handled inside the guest as a performance optimization
20:24
no I didn't and the reason why I didn't do that is because I wanted to have a minimal guest.