Evolution of OSv: Towards Greater Modularity and Composability
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/61619 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2023294 / 542
2
5
10
14
15
16
22
24
27
29
31
36
43
48
56
63
74
78
83
87
89
95
96
99
104
106
107
117
119
121
122
125
126
128
130
132
134
135
136
141
143
146
148
152
155
157
159
161
165
166
168
170
173
176
180
181
185
191
194
196
197
198
199
206
207
209
210
211
212
216
219
220
227
228
229
231
232
233
236
250
252
256
258
260
263
264
267
271
273
275
276
278
282
286
292
293
298
299
300
302
312
316
321
322
324
339
341
342
343
344
351
352
354
355
356
357
359
369
370
372
373
376
378
379
380
382
383
387
390
394
395
401
405
406
410
411
413
415
416
421
426
430
437
438
440
441
443
444
445
446
448
449
450
451
458
464
468
472
475
476
479
481
493
494
498
499
502
509
513
516
517
520
522
524
525
531
534
535
537
538
541
00:00
EvoluteModul <Datentyp>Projective planeOcean currentReal numberComputer animation
01:24
CASE <Informatik>Modul <Datentyp>Source codeRange (statistics)Operations researchSystem programmingSystem callKernel (computing)Computing platformPhysical systemHost Identity ProtocolProcess (computing)Component-based software engineeringCodeDevice driverAsynchronous Transfer ModeBuildingComputer configurationDevice driverRevision controlSubsetHypercubeStatement (computer science)Function (mathematics)Read-only memoryBootingVector potentialInformation securityMobile appComputer architectureSemiconductor memorySingle-precision floating-point formatMobile appDefault (computer science)Cartesian coordinate systemRange (statistics)Universe (mathematics)BitSubsetKernel (computing)Software testingModul <Datentyp>Functional (mathematics)PhysicalismPoint (geometry)Symbol tableProduct (business)Focus (optics)ArmSet (mathematics)Form (programming)Device driverState of matterCASE <Informatik>Computer fileDifferent (Kate Ryan album)Point cloudSinc functionThread (computing)Connectivity (graph theory)VirtualizationWeightQuicksortWeb pageDevice driverMereologyMechanism designSeitentabelleProjective planeLatent heatOperating systemPresentation of a groupDynamical systemRevision controlVulnerability (computing)BuildingVirtual machineProcess (computing)BootingDatabaseVector potentialFile systemAsynchronous Transfer ModeCodePhysical systemMultilaterationProfil (magazine)Computer configurationComputer animationProgram flowchart
09:49
Device driverKernel (computing)HypercubeStatement (computer science)SubsetFunction (mathematics)Device driverPhysical systemRead-only memoryBootingVector potentialInformation securityMobile appDefault (computer science)Computer configurationBuildingDirectory serviceSource codeFlagLink (knot theory)Revision controlSymbol tableScripting languageCodeCompilerLinker (computing)BootingMathematicsWeb pageCache (computing)Arc (geometry)Thread (computing)FreewareRevision controlTotal S.A.Utility softwareSpectrum (functional analysis)Computer fileProcess (computing)Forcing (mathematics)Linker (computing)Set (mathematics)Symbol tableCodeDirectory serviceUsabilityDirection (geometry)Adaptive behaviorFlagSpeicherbereinigungResultantMacro (computer science)Modal logicKernel (computing)Reduction of orderWrapper (data mining)Unit testingGoodness of fitDynamical systemElectronic mailing listModul <Datentyp>1 (number)Scripting languageAttribute grammarMatching (graph theory)Latent heatWeb pageImplementationThread (computing)Functional (mathematics)Vector potentialParameter (computer programming)Semiconductor memoryMechanism designSheaf (mathematics)Point (geometry)Source codeElectric generatorMappingMathematical optimizationDeadlockCartesian coordinate systemBuildingPatch (Unix)Device driverDefault (computer science)Order (biology)Computer configurationBootingCache (computing)Bootstrap aggregating
18:14
MathematicsWeb pageCache (computing)Arc (geometry)Device driverCodeVector potentialBuildingThread (computing)Function (mathematics)Kernel (computing)FreewareComputer-generated imageryModul <Datentyp>System callExecution unitWrapper (data mining)Slide ruleServer (computing)HypercubeDevice driverMechanism designParameter (computer programming)SpeciesScripting languageMacro (computer science)Directory serviceProcess capability indexRegular graphSlide ruleGoodness of fitSymbol tableBoolean algebraThread (computing)File systemMedical imagingDevice driverLatent heatKernel (computing)1 (number)Mobile appBootstrap aggregatingStructural loadDevice driverMiniDiscDefault (computer science)Modul <Datentyp>Electronic mailing listComputer configurationMenu (computing)Parameter (computer programming)CodeSystem callProfil (magazine)Server (computing)Functional (mathematics)Block (periodic table)Scripting languageObject (grammar)Computer fileExecution unitExpressionOcean currentNumberVirtualizationPartition (number theory)Mechanism designExtension (kinesiology)
26:39
Device driverBuildingKernel (computing)Process capability indexDevice driverScripting languageComputer-generated imagerySystem programmingPoint cloudEnthalpyMechanism designCodeRevision controlSupport vector machineSystem callModul <Datentyp>Element (mathematics)Functional (mathematics)Proof theoryExtension (kinesiology)Thermal expansionImplementationStack (abstract data type)Read-only memoryInterrupt <Informatik>Web pageSymbol tableScripting languageInformation securityMobile appCartesian coordinate systemMultiplication signSoftwareSystem callBuildingTerm (mathematics)Thread (computing)Element (mathematics)Run time (program lifecycle phase)Support vector machineKernel (computing)Functional (mathematics)Computer fileElectronic mailing listCore dumpRevision controlObject (grammar)CodeMechanism designMereology1 (number)Stack (abstract data type)SubsetState observerSemiconductor memoryLimit (category theory)Expandierender GraphImplementationLatent heatWeb pageDevice driverDevice driverBootingInterrupt <Informatik>Computer configurationArmType theoryCASE <Informatik>Electric generatorRight angleIdentifiabilityBitComputer animation
35:04
Stack (abstract data type)Read-only memoryInterrupt <Informatik>CodeKernel (computing)Web pageWebsiteRule of inferenceThread (computing)Exception handlingBuildingComputer-generated imageryDevice driverRootMiniDiscPartition (number theory)Default (computer science)Scripting languageProof theoryVery-large-scale integrationDifferent (Kate Ryan album)Revision controlContent (media)Boolean algebraSystem callPhysical systemImplementationNormal (geometry)Table (information)Texture mappingVirtual realityPermianSupport vector machineSimilarity (geometry)CoroutinePartition (number theory)Computer fileOrder (biology)WebsiteOnline helpInterrupt <Informatik>Stack (abstract data type)Utility softwareException handlingFile systemMiniDiscRootMereologyMixed realityWeb pageMultiplication signPoint (geometry)BootingSystem callLogicCodecScripting languageBitMathematics2 (number)Virtual memoryDefault (computer science)Slide ruleDifferent (Kate Ryan album)Semiconductor memoryCodeRevision controlMoment (mathematics)Kernel (computing)Physical systemBootingScheduling (computing)Modul <Datentyp>BuildingPhysicalismComputer configurationNumberVisualization (computer graphics)Rule of inferenceMappingSupport vector machineAdditionTable (information)Context awarenessPosition operatorFunctional (mathematics)Modal logicThread (computing)Medical imaging
43:29
Exception handlingStack (abstract data type)Kernel (computing)Thread (computing)Interrupt <Informatik>Default (computer science)Web pageSupport vector machinePoint (geometry)Context awarenessCASE <Informatik>Physical systemElectric currentHill differential equationVolumeLink (knot theory)Interface (computing)Address spaceType theoryImplementationWebsiteSystem programmingPoint cloudWikiPresentation of a groupMobile appComponent-based software engineeringData managementPhysical systemException handlingThread (computing)Cartesian coordinate systemBitLink (knot theory)Pointer (computer programming)CASE <Informatik>Kernel (computing)Stack (abstract data type)ImplementationSubsetSymbol tableBinary codeGoodness of fitCollisionMathematicsVirtual memoryMultiplication signPatch (Unix)1 (number)Revision controlFunctional (mathematics)Different (Kate Ryan album)Presentation of a groupSequenceCodeWechselseitiger AusschlussPrototypeEmail
51:54
Program flowchart
Transcript: English(auto-generated)
00:05
So once again, hello everybody. Welcome to my talk. This talk is going to be about OSV, evolution of OSV towards greater modularity and possibility. Thanks, Osvaldo, for introducing me. So I've been contributing to OSV since 2016.
00:27
A year ago, in 2015, I heard about OSV in one of the conferences. And then, a couple of years later, I was nominated to be one of its committers. And my greatest contributions to OSV include making OSV run on Firecracker and significantly improving ARCH64 port, among other things.
00:53
So I'm not sure if you can tell it, but OSV is actually my hobby. So I'm not like a real current developer, like many of previous speakers are.
01:04
So it's actually, you know, I work on it in my night when I feel it. And I have a day job, so I don't represent my company that I work for. So this is all my personal contribution to the project.
01:25
So in today's presentation, I will talk about enhancements introduced by the latest release of OSV, 0.57, with a focus on greater modularity and composability. But I will also discuss other interesting enhancements, like lazy stack, novel ways to build ZFS images, and improvements to the ARM port.
01:54
Finally, I will also cover an interesting use case of OSV, seaweedFS, running on OSV, which is a distributed file system.
02:07
So as you can see in this talk, besides the title, right, modularity, I will actually try to give you like state of the art where OSV is, how it has changed recently, and a little bit of where it's going, hopefully.
02:25
So I know there are probably many definitions of unikernels, and each of them is a little bit different, right? And I'm sure most of you understand what unikernels are, but just a quick recap with emphasis on how OSV is a little bit different.
02:47
So OSV is a unikernel that was designed to run single, unmodified Linux application on top of hypervisor, whereas traditional operating systems were originally designed to run on a vast range of physical machines.
03:01
But simply speaking, OSV is an OS designed to run single application without isolation between application and kernel. Or it can be thought as a way to run highly isolated process without ability to make system calls to the host OS.
03:23
Finally, OSV can run on both 64-bit x86 and ARMv8 architectures. Now, a little bit of history. So OSV, for those that don't know, OSV was started in late 2012 by the company called Cloud Use Systems.
03:47
And they built pretty strong team of 10, 20 developers, I think. I wasn't one of them, but they pretty much wrote most of OSV.
04:02
But at some point, they basically, I guess, realized they have to make money, I'm guessing. So they basically moved on and started working on this product you may have called CillaDB, which is this high-performance database. But I think they took some learning.
04:21
And after that, basically, I think OSV did receive some grant from European Union, so there was some project on that. And I think there may have been some companies also using OSV, but honestly, since then, it's been really maintained by volunteers. So, like me, there's still some people from CillaDB, Nadav Harel, and others that contribute to the project.
04:53
I will just single out Fotis Ksenakis, which actually was the one that implemented Virta.OFS as a very interesting contribution to OSV.
05:04
And obviously, I would like to take this opportunity to invite more people to become part of our community, because honestly, you may not realize it, but our community is very small. So it's just really me, Nadav, and a couple of other people that contribute to the project.
05:28
So I hope we're going to grow as a community after this talk. So, a quick recap of a little bit of how OSV looks like, what the design is.
05:41
So in this slide, you can see major components of OSV across layers, starting with glibc, the top, which is greatly based actually on muscle, then core layer in the middle, comprised of ELF, dynamic linker, of VFS, virtual file system, networking stack, thread scheduler, page cache,
06:07
RCU, read, copy, update, page table management, and L1, L2 pools to manage memory. And then you have a layer of device drivers where OSV implements virtual layer devices
06:25
on both over PCI transport and MMIO transport, and then Xen and VMware, among others. And obviously, and one more thing, so OSV can run on KVM-based hypervisors like QEMU, like Firecracker.
06:46
I did test also OSV on cloud hypervisor, which is I think Intel's hypervisor written in Rust. And then I personally didn't really run OSV on Xen, so I know that the Xen support is a little bit dated probably,
07:05
and I'm not sure how much it has been tested. I did test on VMware virtual box, and I think on hyperkit at some point. So I won't go into more detail about this diagram, but I will leave it with you just as a reference later.
07:25
So in the first part of this presentation about modularity and composability, I will focus on new experimental modes to hide the known glibc symbols and standard C++ library.
07:46
I will also discuss how ZFS code was extracted out of the kernel in the form of a dynamically linked library. And finally, I will also explain another new build option to tailor the kernel to a set of specific drivers.
08:04
I call them driver profiles. And another new mechanism to allow building a version of kernel with a subset of glibc symbols needed to support a specific application, which I think is quite interesting.
08:23
So by design, OSV has always been a fat unikernel, which has been some sort of criticism. And by default provided a large subset of glibc functionality has included full standard
08:40
C++ library and ZFS implementation, drivers for many devices, and has supported many hypervisors. So on one hand, it makes running arbitrary application on any hypervisor very easy using a single universal kernel.
09:01
But on another hand, such universality comes with the price of bloated kernel, with many symbols and drivers, and possibly ZFS that is unused. Thus causing inefficient memory usage, longer boot time, and potential security vulnerabilities.
09:23
In addition, C++ application linked against one version of libstdc++, different than the version the kernel was linked against, may simply not work. For example, that happened to me when I was testing OSV with .NET. The only way to make it
09:46
work was to hide the C++ standard library and use the one that was part of the .NET app.
10:03
So one way to lower memory utilization of the guest is to minimize the kernel size. By default, OSV comes with a universal kernel that provides quite large spectrum of glibc library and full standard C++ library.
10:21
And exposes over a total of 17,000 symbols. And most of those are very long, as you know, C++ symbols that make up the symbol table. So the question may be posed, why not have a mechanism where we can build a kernel with all known glibc symbols hidden, and all unneeded code that is unused garbage collected.
10:50
So the extra benefit of fewer exported symbols is increased security, that stems from the fact that there's simply less potential code that is left, that is harmful, that could be harmful.
11:05
And also that way we can achieve better compatibility as any potential symbol collisions, for example, and mismatch standard C++ library, which I mentioned, can be avoided.
11:26
So the release 0.57 added a new build option called confhide symbols to hide those known glibc symbols and the standard C++ library symbols.
11:43
If enabled, in essence, most files in a source tree of OSV, except the ones under libc and muscle directories, would be compiled with the flags visibility hidden. And only if that build flag is enabled.
12:02
On the other hand, the symbols to be exposed as public, like the glibc one, would be annotated with OSV asterisk API macros that translate basically to attribute visibility default. And the standard C++ library is linked with the flag no whole archive.
12:23
Those sv asterisk API macros basically would be like OSV libc API or OSV pthreads API, OSV libm API, and so on. Basically that match all the, I think, around 10 libraries that OSV dynamic linker exposes.
12:43
Finally, the list of public symbols exported by the kernel is enforced during the build process based on the symbol list files for each advertised library. Like, for example, libc so6 and is maintained under the directory exported symbols.
13:08
So these files are basically a list of symbols that are concatenated using the script called generate version script, which goes into version script file and then is fed to the linker as an argument to the version script flag.
13:28
So in order to now remove all unneeded code, basically garbage, all files will be compiled with the flags function sections and data sections.
13:41
And then they will be linked with the flag GC section. Now, any code that needs to stay, like, for example, the bootstrap start points or dynamically enabled code like the optimal memcpy implementation or tracepoint patch sides, is retained by putting relevant kept directives and relevant sections in the linker script.
14:11
The kernel elf file built with most symbols hidden is roughly 4.3 megabytes in size compared to 6.7, which is a reduction of around 40%.
14:27
This great reduction stems from the fact that the standard library, standard C++ library, is no longer linked with whole archive. The symbol table is way smaller and unused code is garbage collected.
14:44
Please note that the resulting kernel is still universal as it exports all jlibc symbols and includes all the device drivers. And as a result of this size reduction, kernel boots also a little bit faster.
15:03
Well, this all sounds great. So one may ask why not hide most symbols and standard C++ library by default, right? The problem is that there are around 35 unit tests and some also applications that were written in the past that rely on C++ symbols.
15:30
And they basically would not run if we hide all of those symbols. And those are basically used to, they were implemented in the past
15:42
and it was done sometimes out of convenience, sometimes basically out of necessity. So to address this specific problem, we will need to expose some of those OSV C++ symbols as an API expressed in C.
16:03
So we'll basically define very simple C wrapper functions that will call those C++ code. Well, I can use this one.
16:21
A good example of modularity improvements made in the release 0.57 is extracting ZFS code out of kernel as a dynamically linked library, libsolaris-so, which effectively is a new module. To accomplish that, we changed the main OSV makefile to build new artifact, libsolaris-so, out
16:45
of ZFS and Solaris file sets in the makefile, which basically used to be linked into kernel. The new library has to be linked with a bind-now flag and OSV specific OSV mlog
17:01
node to force OSV dynamic linker to resolve symbols eagerly and populate the mappings eagerly as well. This basically is done to prevent page faults that would lead to potential deadlocks as the library is loaded and initialized.
17:22
The init function zfs.initialize() called upon the libraries loaded creates necessary thread pools and registers various callbacks so that the page cache arc, which is adaptive replacement cache from ZFS, and zfs.devdriver can interact with relevant code in the ZFS library.
17:50
On the other hand, the OSV kernel needs to expose some around 100 symbols that provide some internal FreeBSD originating functionality that libsolaris-so depends on.
18:07
OSV borrowed some code from FreeBSD and a good chunk of this code was implementation of ZFS, which right now is outside of the kernel. Finally, the virtual file system bootstrap code has to dynamically load libsolaris
18:27
-so from bootfs or read-onlyfs using dlopen before mounting zfs file system. There are at least three advantages of moving zfs to a separate library.
18:43
First off, ZFS can be optionally loaded from another file system like bootfs or read-onlyfs, partition on the same disk or another disk. I will actually discuss that in more detail in one of the upcoming slides later.
19:00
Also, kernel gets smaller by around 800 kilobytes and effectively becomes 3.6 megabytes in size. Finally, there are at least 10 fewer threads that are needed to run non-ZFS image. For example, when you run a read-onlyfs image on OSV, with one CPU, it only requires 25 threads.
19:40
The regular Linux glibc apps should run fine on kernel with most symbols and standard C++ library hidden.
19:47
Unfortunately, many unit tests, which I mentioned, and various internal OSV apps, which are written mostly in C++, so-called modules, do not. As they have been coded in the past to use those internal C++ symbols from the kernel, we have to do something to deal with that problem.
20:11
So in the release 0.57, we introduced some of the CRapper API, which are basically in C-style convention.
20:27
And then we changed those modules to use those CRapper functions instead of C++ code. And the benefit is that down the road, we might have some newer apps or some newer modules that would use those CRapper functions.
20:49
And it also may make OSV more modular. As you can see, some of those, for example, OSV gets all threads, which is basically
21:03
a function that gives a thread-safe way to a caller to basically iterate over threads. Which, for example, is used in an HTTP monitoring module to list all the threads.
21:24
A good example of OSV-specific modules that uses some internal C++ symbols is HTTP server monitoring. We modified the HTTP monitoring module to stop using internal kernel C++ API.
21:41
We do it by replacing some of the calls to internal C++ symbols with these new module C-style API symbols, which you saw on the slide before. For example, SCAD with all threads, with this new OSV getAllThreads function. In other scenarios, we fall back to standard glibc API.
22:05
For example, the monitoring app used to call OSV current mounts. And right now it uses basically getMTNT function and related ones.
22:28
The release 0.57 introduced another build mechanism that allows creating a custom kernel with a specific list of drivers intended to target a given hypervisor.
22:43
Obviously, such kernel benefits from even smaller size and better security, as all unneeded drivers are basically excluded during the build process. In essence, we introduced a new build script and makefile parameter, driver's profile.
23:02
This new parameter is intended to specify a driver profile, which is simply a list of device drivers, to be linked into the kernel. And some extra functionality, like PCI or ACPI, these drivers depend on.
23:21
Each profile is specified in a tiny include files with the mk extension under conf profiles arch directory, and included by the main makefile as requested by the driver profile parameter. The main makefile has a number of basically if-acc expressions and adds conditionally given driver object to the linked
23:50
object list depending on the value of 0 or 1 of the given conf drivers parameter, specified in that include file.
24:03
The benefit of using drivers are most profound when you build kernel and hide most of the symbols, as I talked about in one of the previous slides. It's also possible to enable or disable individual drivers on top of profiles.
24:24
Profiles are basically a list of the drivers, but there are a number of configuration parameters where you can specifically, for example, include specific driver.
24:41
One may ask a question, why not use something more standard like menu config, like for example, what Unicraft does? Well, actually, OSV has this specific build system, and I didn't want to basically now introduce another way of doing things.
25:01
So that's where we basically script build uses the various effectively parameters to, for example, hide symbols or specify specific driver profile or list of other parameters.
25:20
So as you can see in the first example, we built default kernel with all symbols hidden, and the resulting kernel is around 3.6 megabytes.
25:46
In the next example, we actually built kernel with the VirtIO over PCI profiles, which is like 300 kilobytes smaller.
26:01
And then in the third one, we built kernel, which is intended to, for example, for Firecracker when we include only VirtIO block device and networking driver over MMIO transport.
26:23
And then just to see basically in a fourth one, just to see how large the driver's code in OSV is, when you basically use driver profiles base, which is basically nothing, no drivers, you can see that roughly 600 kilobytes of the driver's code is roughly 600 kilobytes in size.
26:47
And then the last one, actually, option is where you can specify, you use basically driver's profile, and then you explicitly say which specific drivers or driver-related capability, like in this case, ACPI, VirtIO FS, and VirtIO Net, and PV panic devices you want to use.
27:14
Actually, with the new release of OSV 0.57, we started publishing new versions of new variations,
27:23
effectively, of OSV kernel that correspond to this, I thought, interesting build configuration that I just mentioned. And in this example, the OSV loader hidden artifacts are effectively the versions of OSV kernel built with most symbols hidden.
27:44
And then, for example, which would be at the top for both ARM and x86. And then, for example, right here in the second and the third and fourth artifact is basically version of the kernel built for
28:07
micro VM profile, which is effectively something that you would use to run OSV on Firecracker, which only has VirtIO over MMIO transport.
28:24
Now, the release 0.57 introduced yet another build mechanism, and that allows creation of a custom kernel by exporting only symbols required by a specific application. Such kernel benefits from the fact that, again, it's a little bit smaller and thus
28:45
offers better security as, in essence, all unneeded code by that specific application is removed. This new mechanism relies on two scripts that analyze the build manifest, detect application L files, identify
29:03
symbols required from OSV kernel, and finally, produce the application-specific version script under app version script. The generate app version script iterates over the manifest files produced by list manifest files pi, identifies undefined symbols in
29:26
the L files using object dump that are also exported by OSV kernel, and finally, generates basically the app version script. So please note that this functionality only works when you build kernel with most symbols hidden.
29:43
I think what is worth noting in that approach is that you basically run a build script against given application twice. Basically, first time to identify all symbols that application needs from OSV kernel,
30:03
and then, actually, the second time we build the kernel for that specific app. In this example, we actually generate kernel-specific to run a simple Golang app on OSV.
30:23
When you actually build kernel with symbols, around, I think, 30 symbols in by Golang Pi example, the kernel is effectively by around half a megabyte smaller, and it's around 3.2 megabytes.
30:46
So this approach has obviously some limitations. Some applications obviously use, for example,
31:04
DLCIM to dynamically resolve symbols, and those would be missed by this technique. So in this scenario, basically, for now, you have to manually find those symbols and add them to the app version script file. Conversely, a lot of glibc functionality is still in OSV, in Linux CC, where all the system calls are actually implemented, still
31:31
basically references all the code in some of the parts of the libc implementation, so this obviously also would not be removed.
31:43
So obviously, we could think of ways of finding some kind of build mechanism that could, for example, find all the usages of Cisco instruction or SVC on ARM, and analyze and find all this only code that is needed.
32:06
In the future, we may componentize other functional elements of the kernel. For example, the DHCP lookup code could be either loaded from a separate library or compiled out, depending on some build option.
32:21
To improve compatibility, we're also planning to add support of statically linked executables, which would require implementing at least Clone, BRK, and Arch PRCTL syscalls. We may also introduce the ability to swap built-in version of glibc libraries with third -party ones. For example, the subset of libm, so that it's provided by OSV kernel, could
32:46
be possibly hidden with the mechanisms they discussed, and we could use different implementations of that library. Finally, we are considering to expand standard PROCFS and SFS and OSV-specific parts of SFS
33:05
that would better support statically linked executables, but also allow regular apps to interact with OSV. A good example of it could be implementation of a netstat-like type of capability application that could expose the networking in terms of OSV better during runtime.
33:31
In the next part of the presentation, I will discuss the other interesting enhancements introduced as part of the latest 0.57 release. More specifically, I will talk about lazy stack, new ways to build ZFS images, and finally the improvements to the ARCH64 port.
33:51
The lazy stack, which by the way is actually the idea that was thought of by Nadav Harel, which maybe is listening
34:02
to this presentation, effectively allows to save a substantial amount of memory if an application spawns many p-threads with large stack by letting stack grow dynamically as needed, instead of getting pre-populated ahead of time, which is normally the case right now with OSV. On OSV right now, all kernel threads and all application threads have
34:25
stacks that are automatically pre-populated, which is obviously not very memory efficient. The crux of the solution is based on observation that OSV page fault handler requires that both interrupt and preemption must be enabled when fault is triggered.
34:44
Therefore, if stack is dynamically mapped, we need to make sure that the stack page fault never happens in these relatively few places where the kernel code executes with either interrupts or preemption disabled. We basically satisfied this requirement by pre-faulting the stack by reading one
35:05
byte, one page per stack pointer, just before preemption or interrupts are disabled. A good example of that code would be in a scheduler. OSV scheduler is trying to figure out what the next thread to
35:23
switch to, and obviously that code has preemption and interrupts disabled, and we obviously wouldn't want to have page fault happen at that moment. So there are relatively few places when that happens, and this idea is to basically pre-fault this code.
35:48
To achieve that, we basically analyze OSV code to find all the places where the IRQ disable and preemption disable is called directly or indirectly sometimes, and pre-fault the stack there if necessary.
36:04
As we analyze all call sites, we need to follow basically five rules. The first one, do nothing if the call in question executes always on the kernel thread, because it has pre-populated stack. There's no chance that page fault is going to happen. The second one is do nothing if the call site executes on other types of pre-populated stack. A
36:27
good example of that would be the interrupt and exception stack or syscall stack, which are all pre-populated. The number three rule is do nothing if the call site executes when we
36:43
know that either interrupts or pre-emptions are disabled, because somebody already pre-faulted that.
37:03
Basically, by calling the pre-emptable, isPreemptable, and IRQ-enabled functions. If we followed only rule number five, which actually this is what I tried to do
37:22
in the very beginning, the first attempt to implement z-stack, it would be actually pretty inefficient. I saw pretty significant degradation of, for example, context switch and other parts of the OSV when I dynamically checked if pre-emptions and interrupts were disabled.
37:48
This actually was pretty painful to basically analyze the code, but I think it was worth it. As you remember from the modularity slides, the ZFS file system has been extracted from the kernel as a separate
38:02
shared library called LipsolizeSO, which can be loaded from the different file system before ZFS file system can be mounted. This allows for three ways ZFS can be mounted by OSV. The first and original way assumes that ZFS is mounted at the root from the first partition of the first disk.
38:22
The second one involves mounting ZFS from the second partition of the first disk at an arbitrary non-root point, for example, slash data. Similarly, the third way involves mounting ZFS from the first partition of the second or higher disk at an arbitrary non-root point as well.
38:45
Please note that the second and third options assume that the root file system is non-ZFS, obviously, which could be like read-only-fs or bootfs. This slide shows you the build command and how OSV runs when we follow the original and
39:09
default method of building and mounting ZFS. For those that have done it, there's nothing really interesting here.
39:22
This is a new method, actually the first of the two new ones, where we actually allow ZFS to be mounted at a non-root mount point like data, for example, and mixed with another file system on the same disk. Please note that LipsolizeSO is placed on the root file system, typically read-only-fs under USR libfs, and loaded from it automatically.
39:50
The build script will automatically add the relevant mount point time to its zf style. The last method is basically similar to the one before, but this time we
40:04
allow ZFS to be mounted from the partition from the second disk or another one. It's actually what happens with this option. I noticed that OSV would actually mount ZFS file system by around 30 to 40 milliseconds faster.
40:27
Now, there's another new feature we used to run. In order to build ZFS images and file system, we would use OSV itself to do it.
40:43
With this new release, there's a specialized version of the kernel called zfs-loader, which basically delegates to these utilities like zpoolSO, zfs, and so on to mount OSV.
41:13
It's actually quite nice because you can mount OSV disk and introspect it. You can
41:22
also modify it using standard Linux tools and unmount it and use it on OSV again. Here's some help on how this new script can be used. I don't have much time left, but I will try. I will focus a little bit
41:45
on the AR64 improvements. I will focus on three things that I think are worth mentioning. The changes to dynamically map the kernel during the boot from the second gigabyte of visual memory to the
42:06
63rd gigabyte of memory, addition enhancements to handle system calls, and then also handle exceptions on a dedicated stock. As far as the moving memory, virtual memory to the 63rd gigabyte, I'm not sure if you realize OSV kernel is actually position dependent.
42:31
Obviously, the kernel itself may be loaded in different parts of physical memory. It used to be before that release that you would have to build different versions for Firecracker or for the QEMU.
42:44
So basically, in this release, we changed the logic in the assembly in a bootloader where basically OSV detects itself where it is in a physical memory and in essence builds dynamically the early mapping tables to then eventually bootstrap to the right place in the positional code.
43:13
So now basically you can use the same version of the kernel on any hypervisor.
43:23
To add system calls on ARM, we had to handle the SVC instruction. There's not really much interesting if you know how that works. What is maybe a little bit more interesting was the change that I made to make all exceptions, including system calls, to work on a dedicated stock.
43:49
Before that change, all exceptions would be handled on the same stock as the application, which caused all kinds of problems.
44:02
For example, that would effectively prevent implementation of the lazy stock. To support basically that, in OSV, which runs in EL1, in a kernel mode, we would basically take
44:22
advantage of the stack selector register and we would basically use both stack pointer registers, SPL0 and SPL1. Normally, OSV uses SPL1 register to point to the stock for each thread.
44:43
So with the new implementation, what basically we would do before the exception was taken, basically we would switch the stack pointer selector to SPL0.
45:01
Once the exception was handled, it would basically go back to normal, which was SPL1. I think I was going to skip C with FS, because we're running a little half-time left, but you can read it on that.
45:28
We've also added Netlink support, and we've made quite many improvements to VFS layer. Both of those Netlink and VFS improvements were done to support C with FS.
45:48
There are basically more gaps that have been filled by trying to run this new use case. Just briefly, as we are pretty much at the end of the presentation, I think in the next releases of OSV,
46:03
whenever they're going to happen, I would like us to focus on supporting statically linked executables, adding proper support of spinlocks. OSV, for example, in Mutex right now is lockless, but under high contention, it would actually make sense to use spinlocks.
46:22
We have actually a prototype on that on the mailing list. Then supporting ASLR, refreshing Capstan, which is a build tool which hasn't been really out because we don't have enough volunteers, improved for a long time. Then even the website, and there are many other interesting ones.
46:41
As a last slide, I would like to basically use this as an occasion to thank basically the organizer, Razvan, for inviting me and everybody else from the community of Unikernels. I would also want to thank SilaDB for supporting me and Dor Laor and Nadav Haral for reviewing all the patches and his other improvements.
47:11
I also want to thank all other contributors to OSV. I also would like to invite you to join us because there are not many of us.
47:22
If you want to have OSV alive, we definitely need you. There are some resources about OSV. There's my P99 presentation here as well. If you guys have any questions, I'm happy to answer them. Thank you. Thank you, Voldemort. Thank you. Any questions for Voldemort?
47:46
Please, Mark, just ask. It's going to be happy to get the mic. I have two questions. First, when you have spoken about the symbols, about the glibc symbols and the sequence of symbols, do I understand it correctly that the problem is that the kernel might be using
48:05
some glibc functions and the applications might be linked to its own glibc and some symbols? Well, not really. They would use the same version. There's no problem with, for example, malloc.
48:22
We don't want to expose malloc, but there is a good chunk of OSVs implemented in C++ and all of those symbols don't need to be exposed because they inflate the symbols table a lot and they shouldn't be really visible to others.
48:44
Now, I think, OSV exposes, if you build with that option, around, I think, 16 hundreds of symbols instead of 17,000. So it's really about binary size and, in case of C++ library, avoiding a
49:05
collision where you build OSV with different version of C++ library versus the application. Have you thought about maybe renaming the symbols in the kernel image during the link time, maybe adding
49:23
some prefixes to all the symbols so that you can have them visible but they would not clash? That's an interesting idea. I haven't thought about it yet. And Marty, the other, the second question, yeah.
49:40
So when you have spoken about the lazy stack, you said that you refold the stack to avoid the problematic case when it drops and preemption disabled. So basically, when I'm thinking about it, do you still need to have some kind of upper bound of the size of the stack so that you know that you refold it large enough to not get into the issue?
50:05
So my question is, why not then have the kernel stacks fixed size? Because if you already need to have some upper bound, then why not have a global upper bound for the whole kernel? Wouldn't it be just easier? Well, I mean, these are, this is for a technique for applications threads only. So
50:24
for application stacks, well, the kernel threads would still have the pre-populated fixed size stack. Because there are many applications, like a good example is Java that would start like 200 threads and all of them right
50:40
now are pre-populated like one megabyte and you would all of a sudden need like 200. So this is just for application. Well, no, it's in the same, you know, virtual memory. But yeah, I mean, when
51:07
I say kernel stack, I mean, in OSV, basically, there are two types of threads. There are kernel threads and there are application threads. So basically, application threads use their own stack. But I mean, application threads use application stack and kernel use kernel. And when I say like
51:32
some kernel code, obviously, as application, well, obviously, because unikernel, as the code executes in an application,
51:41
it runs on application stack, but it may execute some kernel code as well, which, yeah, yeah. Yeah, yeah. Thank you. Any other question? Okay, thank you so much. Let's move on.
Recommendations
Series of 9 media