Is it time to replace mmap() - TIB AV-Portal

Is it time to replace mmap()

00:00

4

Berkeley Software Distribution (BSD)

Formal Metadata

Title

Is it time to replace mmap()

Title of Series

The Technical BSD Conference 2018

Number of Parts

45

Author

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/45138 (DOI)

Publisher

Berkeley Software Distribution (BSD)

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Is it time to replace mmap()? A history of userspace virtual address management interfaces and a case for a new interface.

The Technical BSD Conference 201817 / 45

1

42:14

eBPF Implementation for FreeBSD

2

33:19

RoCE as a performance accelerator

3

37:01

BSD from scratch from source to OS with ease on NetBSD

4

45:03

100x Faster Clone Deletion for ZFS

5

51:25

Automating Network Infrastructures with Ansible on FreeBSD

6

54:50

ZFS send and receive, performance issues and improvements

7

54:33

Vagrant for OpenBSD: make installations by the busloads easy

8

28:53

PASTE: Fast End System Networking with netmap

9

52:00

Oh a new Unix shell

10

1:02:17

Flexible Disk Use in OpenZFS

11

55:57

How to bootstrap a BSD conference

12

58:19

The Evolution of FreeBSD Governance

13

1:00:20

The TrueOS Difference

14

53:06

15

55:40

All along the dwatch tower

16

1:00:40

Running Linux applications on FreeBSD

17

47:45

Is it time to replace mmap()

18

32:29

The Tragedy of systemd

19

55:57

Mininet on OpenBSD

20

51:39

Integrating ZStandard into ZFS

21

43:53

Why did my application crash?

22

56:44

devmatch - matching devices to modules

23

50:23

Plumbing the Internet, BSD-style

24

55:34

Imprisoning software with libiocage

25

51:14

Adding verification to FreeBSD loader

26

55:47

Introducing FreeBSD VPC

27

45:23

Improving netdump hardware support and performance with iflib

28

46:29

IT automation with Puppet

29

54:32

Large Scale SSH

30

54:53

Institutionalizing FreeBSD Isolated and Virtualized Hosts Using bsdinstall(8), zfs(8) and nfsd(8)

31

43:17

Getting your routers future-proof?

32

44:27

Making your own console server using OpenBSD

33

1:06:22

Forget reusability, aim for perfection

34

44:29

35

42:08

Design of a Public- Key Trust System for Free BSD

36

44:04

FreeBSD ARM32/ARM64 : Porting to a new board

37

1:13:19

The Technical BSD Conference 2018 - Closing session

38

09:08

The Technical BSD Conference 2018 - Opening session

39

1:05:19

Profiling the FreeBSD kernel boot

40

53:12

41

42:52

Replacing Traditional Backup Systems with ZFS

42

50:38

Leveraging Jails for a FaaS System

43

1:05:08

Pledge, and Unveil, in OpenBSD

44

57:51

Fighting Spam at the Frontline

45

52:05

Introducing FreeBSD in new environment: the good, the bad, the ugly

Automatic playback

Speech

Text

Image

00:00

Address spaceWeb pageLevel (video gaming)Memory managementSemiconductor memoryVirtualizationProcess (computing)BitVirtual memoryParameter (computer programming)Projective planeMultiplication signComputerXMLComputer animationLecture/Conference

02:22

Physical systemPhysicalismPersonal digital assistantAddress spaceMiniDiscProcess (computing)Sheaf (mathematics)Semiconductor memoryPoint (geometry)Revision controlComputer programmingMemory managementMappingPointer (computer programming)BitComputer hardwareVariable (mathematics)Web pageMechanism designKernel (computing)QuicksortCodeContext awarenessSystem callCore dumpComputer architectureMultiplication signDifferent (Kate Ryan album)Control flowFunctional (mathematics)ImplementationWritingDefault (computer science)Computer animation

06:14

Uniform resource locatorPointer (computer programming)Computer programmingPhysical systemPoint (geometry)Parameter (computer programming)Control flowMetropolitan area networkRoundness (object)WordWeb pageLibrary (computing)MappingProcess (computing)Virtual memorySystem callMultiplicationException handlingAxiom of choiceFault-tolerant systemContext awarenessComputer animationLecture/Conference

07:54

Computer programmingLimit (category theory)Semiconductor memoryUniform resource locatorKernel (computing)Virtual memoryAddress spacePointer (computer programming)QuicksortData storage deviceMemory managementControl flowShared memoryVirtual machineSupercomputerNormal (geometry)MappingProgrammschleifeWeb pageGoodness of fitPhysical systemSoftwareAreaLevel (video gaming)FamilyCASE <Informatik>PhysicalismCodeMereologyDiagramForceDot productAbsolute valueBitArray data structureReading (process)Windows RegistryComputer animationLecture/Conference

12:20

Run time (program lifecycle phase)Linker (computing)Computer programmingCodeLibrary (computing)Reading (process)Computer-assisted translationAverageAddress spaceParameter (computer programming)Transport Layer SecurityMappingMemory managementThread (computing)Semiconductor memoryShared memoryBitFerry CorstenLecture/Conference

13:34

BitSemiconductor memoryQuicksortLevel (video gaming)Web pageMemory managementProjective planeRange (statistics)ImplementationCore dumpExpressionState transition systemPointer (computer programming)1 (number)Endliche ModelltheorieJust-in-Time-CompilerDecision theoryOperator (mathematics)State of matterAddress spaceSoftware repositoryMaxima and minimaMappingException handlingPoint (geometry)System callBoom (sailing)Kernel (computing)Data storage deviceCodeBound stateMobile appStructural loadMultiplication signMixed realityMathematicsComplex (psychology)Scaling (geometry)Line (geometry)SpeicherschutzGastropod shellFreewareLecture/Conference

21:30

Computer configurationMathematicsPointer (computer programming)Level (video gaming)Mechanism designParameter (computer programming)Address spaceFunctional (mathematics)IntegerVirtualizationMobile appLecture/Conference

22:22

IntegerInstance (computer science)Parameter (computer programming)MappingAsynchronous Transfer ModeComputer hardwareOrder (biology)Inheritance (object-oriented programming)Web pageError messageRight angleBoundary value problemExpressionMaxima and minimaStatisticsAddress spaceProgrammer (hardware)MathematicsPointer (computer programming)Level (video gaming)FlagMenu (computing)System callComputer animation

24:14

Statement (computer science)Level (video gaming)Kernel (computing)Line (geometry)CausalityString (computer science)CodeError messageComputer animationLecture/Conference

25:04

Web pageBitKernel (computing)Multiplication signPortable communications deviceLibrary (computing)Inheritance (object-oriented programming)FlagPointer (computer programming)QuicksortMemory managementAddress spaceLevel (video gaming)CASE <Informatik>Semiconductor memoryPower (physics)Computer fileBound statePhysical systemInstance (computer science)MathematicsLecture/Conference

27:54

Numbering schemeAddress spaceObject (grammar)Functional (mathematics)Library (computing)Pointer (computer programming)Level (video gaming)WritingMemory managementPlanningInstance (computer science)AdditionINTEGRALMereologyStatisticsSpeicherschutzHTTP cookieMappingReading (process)Ferry CorstenWeb pageOperator (mathematics)QuicksortGrand Unified TheoryComputer hardwareDialectBitEndliche ModelltheorieImplementationClosed setLine (geometry)Kernel (computing)Single-precision floating-point formatMathematicsDefault (computer science)Maxima and minimaStructural loadComputer fileSet (mathematics)Matching (graph theory)Spherical capCASE <Informatik>InformationExtension (kinesiology)LengthVirtualizationFlagMultiplication signSystem callComputer programmingRight angleCore dumpPermianDirected graphPoint (geometry)Independence (probability theory)Execution unitComputer animationLecture/Conference

34:54

SpeicherbereinigungDecision theorySpherical capWeb pageFunctional (mathematics)MereologyMultiplication signPointer (computer programming)Maxima and minimaEndliche ModelltheorieAdaptive behaviorAbsolute valueSoftware repositoryPhysical systemSpeech synthesisNumberReading (process)ImplementationMappingSystem callLevel (video gaming)Memory managementVirtual machineCASE <Informatik>Formal languageAddress spaceHuman migrationData structureCore dumpWritingMusical ensembleLecture/Conference

41:54

Object (grammar)Order of magnitudeFormal languageArmWordMechanism designMeasurement1 (number)SpeicherbereinigungMultiplication signGoodness of fitLimit (category theory)Memory managementFile formatNumberAddress spaceProcess (computing)Online helpImplementationCore dumpExterior algebraText editorForcing (mathematics)Symbol tableDirection (geometry)BitDot productPhysical systemFlagPoint (geometry)Interpreter (computing)Context awarenessNetwork socketLinker (computing)Object-oriented programmingPlanningPeer-to-peerExpert systemLecture/Conference

Transcript: English(auto-generated)

00:11

It's 1.30, so I'm going to go ahead and get started. I'm here to, well, my title is, Is It Time to Replace MMAP? But it's really, my talk is really

00:20

a history of virtual memory management, virtual address-based management. And I give a proposal for the future that solves some of the problems we have today and some of the problems that the project I'm working on is creating. So I did turn it on. It is indeed on.

00:41

OK. Let's see if that's any better. There, OK, there's some pick up. Let's go open the wire then.

01:04

OK, is that picking up? No, it does not seem to be. Audio available. OK. OK, is that picking up at all? No.

01:20

It's on, yeah. OK. I think, well, I can, let's try that then.

01:40

OK, is that more useful? OK. So we're talking about what, I'm going to talk today about what we usually call memory. But in fact, what I'm talking about is virtual address-based management. And that's a process of both allocating virtual address space and putting something behind it so that it doesn't give you a page fault because there's nothing there when you do it.

02:02

But first, we're going to start with some history. Here we see a timeline of a little bit of the future and the entirety of computing history. You know, you can quibble a bit with the start. There's some arguments you could make. But we'll start there. This is the conventional wisdom of where computing started with ENIAC.

02:23

Manchester Baby was in 48. I couldn't find any good pictures, so it doesn't get any pictures. And then EDSAC was in 49. And then we're going to skip ahead a bit to the PDP-11, which is what's sort of interesting for us.

02:41

Because the PDP-11 is where we really started developing Unix. The actual first version was PDP-7. But this is where we really started to have Unix as we know it. And that's what I'm going to focus on. So one little point in this computer architecture history that's worth noting is in 85, we got the i3-86. And this was the point where consumer hardware

03:02

at reasonable prices could conceivably run a Unix system without some serious compromises. So now I'm going to give you a little refresher on how process address space works. Because that's what we're going to be talking about. In an old Unix program, traditional Unix program,

03:21

you have a big process address space. And the kernel maps the program into memory. You have some code. You have some data. You have BSS, which is a bunch of things that start out as zeros. So all those global variables that were uninitialized, they start as zeros. And then you have a bit of unmapped stuff

03:41

at the beginning, at least you hope it's unmapped, because it's the null pointer. So when you dereference the null pointer, we want default and not have interesting things happen. You have a stack. Conventionally, you put the stack up at the top of the process. The stack grows down and goes that way.

04:00

Then you have the heap. That's the stuff above BSS. And the heap grows up. Now, ideally, they don't meet in the middle. But the heap is all your dynamic memory allocations that isn't your program stack.

04:20

Now, we are going to take a look at how that maps into the physical address space. So these different sections get mapped into the physical address space in different places. Doesn't really matter where they go. We have paging mechanisms, so we can use physical address space efficiently. Some of these things, like the code and data, are backed by disk.

04:41

And then some of those are mapped copy on write, so that when you write to them in your program, so you change the data in your global variables, or change your global variables, you don't write that back to disk. Amusingly, we are now heading into the world where we do this on purpose, because that's persistent memory. So we are heading back into what

05:02

happens if you screw up and don't implement copy on write. Now, think about the, in the context of this address space, we have two things that we track. So as I said, you don't want the heap and the stack to meet in the middle and run into each other. So we manage the break, which is the top of the heap,

05:21

traditionally. And then we have the stack, the stack pointer, we manage that. So now, let's get back to Unix history. So first, the very, very first bit of Unix was set 1970-ish in PDP 7. We don't have public documentation of any of that.

05:43

We start with v1. We see the system, the sys break syscall first appears, or at least is first documented. It's there. However, v1, the v1 dump doesn't have any documentation. So you can read the code, see what it does. But there's no explanation as to how you ought to use it or whatnot.

06:02

v2, about the same, except it was renamed to break. You might notice at that time, it was the break function. So you might notice we're probably not running C yet, because break is a keyword in C. So we're there. v3, in 1973, we finally get some documentation.

06:24

So we get a man page, which says that break sets the system's idea of the highest location used by the program to adder, the argument. Locations greater than adder and below the stack pointer are not swapped and are thus liable

06:42

for unexpected modification. You notice at this point that we're using virtual memory not for protection at all. In fact, as far as I can tell what we're using it for is to make process swapping context switching cheaper by not having to change the page mappings for the rest of the pages. At least, that's what I infer.

07:01

And in an extremely minimal system, this is the sensible approach, assuming you don't want fault tolerance or anything like that. So back to the history. v4 Unix, the system call has now been renamed to sbreak. And the sbrk library call is introduced.

07:22

So again, break sets the system's idea of the lowest location not used by the program to adder. So it says the same thing as before except backwards. Rounded up to the next multiple of 64 bytes, which I don't believe sbrk does in modern Unix.

07:44

And it's interesting to me that it's 64 bytes because words were smaller then. And so I'm wondering what the motivation for that particular rounding choice was on the PDP-11. But I haven't found out. That was the granularity of the segmentation registry.

08:03

Ah, that makes sense. And this is why I'm glad that I have old timers in the audience who can correct me on my talk. That's quite pleasing. So locations not less than adder and below the stack pointer

08:21

are not in the address space. And thus will cause a memory violation if accessed. So now we're seeing actual protection from the virtual memory system. So continuing on through the manual, we see the manual for sbrk, which is the C interface to the break command.

08:42

The sbrk command takes an increment, which is positive or negative, and moves the break. And then it returns a pointer to that location. So incr adds more bytes to the program's data space and a pointer is returned to the start of the new area.

09:01

Another bit. When a program begins execution... Oh, another snippet I found that I thought was kind of interesting. When a program begins execution via exec, the break is set to the highest location defined by the program. So that's the end of BSS. And the data storage area. So that's the end of BSS. Ordinarily, therefore, only programs

09:23

with growing data areas need to use break. I read this and I sort of giggled to myself. I was like, programs that don't use dynamic memory allocation, that's really weird. And I said this when I gave this talk before at BSDTW. And one of my co-presenters pointed out he works in high performance computing.

09:41

And in high performance computing, that's the norm. You statically define the size of all your giant arrays. You recompile when you want a new array. And that's the world that was there and that's the world today. In fact, there's a bunch of good reasons for that in HPC because that means you have static sized loops and you can unroll them without any limit.

10:08

Absolutely. So I probably had too many Fs in my address space diagram. Even with the dots.

10:22

OK. Here's where SBRK was introduced again. So in 75 v4 introduced break, which is another interface to the same thing. It's just another software interface. And then in 1983

10:41

4.4 BSD documents mmap, but does not implement it. And the whole family of interfaces. And we'll get to what those interfaces are, but first let's look at some problems with the break interface. So things that don't work real well. So one problem is heap

11:01

fragmentation. So with break you keep incrementing, incrementing, incrementing you're allocating some memory. Then you slide this huge thing that you've allocated previously. You don't want that anymore. We can't give it back to the kernel. Now you have to do bookkeeping so you can remember that you can reuse it. So that's not great. On a

11:20

16 bit machine you're probably very carefully doing that bookkeeping, possibly hard coded, but in this case, once you start to see 32 bit and 64 bit machines this becomes problematic. You have to do a bunch of work. Another issue is that you'd like to have memory sharing. So let's consider an LS program this is a statically linked LS program

11:42

mapped into physical address space. If we start another one we should be able to share the code and we should be able to share the data or part of the data the read only part of the data and then we can't share the BSS we can't share the heap and we can't share the stack

12:03

which makes sense. We can do this with a statically linked program with SBRK because the kernel does these mappings so it can make sure that it shares the pages. But when we get to dynamic linking we've got some new problems. Now in a dynamically linked program the kernel initially maps the stack

12:21

the program and the runtime linker the runtime linker is then started and it maps libc and libc of course, the proportions here are probably exactly opposite for your average program libc is a bloated beast and you know, cat has far more code than it should

12:42

but nonetheless is small or LS even is pretty small so you'd like to have sharing there but if you were using SBRK to allocate some space opening libc and reading it and then splatting it into memory you wouldn't have any sharing and you know, you could do ridiculous things

13:02

but you're not going to do ridiculous memory deduplication here so it's another problem. So you've got your heap here again so now let's consider multi-threaded programs we have a new way to create heap fragmentation you spawn a thread with SBRK, you give it a stack

13:20

oh, now we need some more heap you know, some more heap that's not what you want especially because when your threads exit then you've got these little stack bits left in there that you would need to reclaim instead, you'd really like your stack to be off somewhere else with some nice guard pages around it so when you walk off the end, things go boom instead of something else

13:44

so that's some of the problems that I feel probably motivated the MMAP interface that's all backformed things that are why you don't just want SBRK so let's talk about the MMAP interface

14:00

as documented in 4.2 BSD so we have MMAP, of course allocates address space and alters backing mappings we have munmap, which removes backings we have mremap was documented BSDs as far as I know, none of them have ever implemented it Linux does implement it

14:20

it has all the perils of realloc except with page management and it is typically it's interesting it's not the worst idea, but it's really easy to screw up realloc so cross DSL CFI support

14:40

and LLVM it would be good to know it would be good to study how they're using it so there are some usage models that are very useful and that are safe and manageable and there are other ones that I find

15:00

unlikely people would get right and in particular once you add Cherry, which I'll talk about in a bit to the mix, it gets quite complicated I think it is all doable the question is, are you adding so much complexity are you adding interfaces that are easy to get wrong, and MMAP is easy enough to screw up we have the memory

15:20

MProtect MProtect alters the permissions on pages so you can, for instance, create a writable page and then make it executable later if you're using a JIT compiler, things like that MAdvise provides usage hints to the kernel and also for reasons that I have not quite figured out there's also MAdvise Free

15:40

which unmaps pages effectively which is effectively an unmap remap operation and probably shouldn't have been in there but that arrived at some point MIncore gives you a one character per page status of what's the state of a given page

16:00

for a range and then SBRK, of course, was documented and SSTK which is like SBRK but for moving the stack down now, in 4.2 BSD SBRK was implemented interestingly SSTK, as far as I can tell, is another one of those things that was never implemented

16:22

I think I did just recently shoot it in FreeBSD the empty implementation hmm? so yeah, that was a fun one well there was the stub implementation always

16:40

the stub implementation changed quite a lot over time including getting an MPSafe marker but you know, one of those things about code debt so here we are back to the first references and then in 1990

17:02

4.3 Reno had MMAP as implemented from mock and I don't know when the actual well, I think I looked up when the actual commit was but didn't write it down here but one of the things that I found interesting when I was doing all this spelunking which was mostly in the Unix history repo

17:23

was that the wisdom I had gotten from just talking to people was I thought that MMAP came from mock but the documentation is before any mock release and the mock interfaces are totally

17:41

different and kind of weird so I'm curious what the actual history was there yeah yeah because I knew we had taken their stuff but I had somehow absorbed this idea that we got MMAP from mock but it's in fact

18:02

we got the implementation from mock but not the actual interface so a few more points along the scale before I go and start tearing down MMAP a bit so

18:20

in 2003 OpenBSD implemented WXRX so the idea that no page should ever be both writable and executable so this is a good thing prevents you from splatting shell code straight onto your stack and then running it ideally all that good stuff

18:42

so there are some issues with WXRX some things you have to solve so one of them is that with a just-in-time compiler you need a writable page but you can't have it both writable and executable so you need some writable space you need to so generally what you do is you map something writable

19:02

and then you remove write and add exec this leads to an interesting thing in the OpenBSD implementation because there is no way to express what the maximum permissions are on a page or what the maximum permissions you should allow on a page, every page is allowed to become

19:22

executable you have to make a system call to do it and call mprotect but exec is always added to the maxprod permissions which is a bit suboptimal but it's because mmap can't express this concept

19:52

so how do you do gits?

20:05

so you turn that feature off in something that gits that makes sense so there's that and then one more that's relevant to the rest

20:20

of my discussion is that in 2012 the project that I've been working on since then kicked off I think we wrote proposals in well, other people wrote proposals in 2010 but Cherry changes the way we do pointers so in Cherry pointers have bounds and permissions associated

20:42

with them, they're a bit bigger architecturally they look like they contain 256 bits of data more or less in practice they're 128 bits in memory so we double the size of pointers but for that we get strong bounds and we get permissions on pointers so you could have load and store

21:01

permissions and you have exact permissions and then we have strong monotonicity guarantees on manipulating pointers which is to say that if I have a pointer that has just write permission on it just store permission I cannot add load permission to it I cannot manipulate it to create a

21:21

capability pointer with load permission so you need to make your decisions early so we'd like to do WXRX on pointers but MMAP returns a pointer and there's no way to get a new pointer when you decide to toggle

21:41

from read write to execute only or read execute more likely so some API change is required to handle this situation if you want to do this one option you could take here is that you could let mprotect return a pointer which could then when you

22:01

give you a pointer with the new permissions you created that would be one way to go or you might need some other mechanism so that's a place where Cherry requires that we make some changes here there's also some MMAP functionality issues so MMAP just manages the virtual address space as this open expensive thing

22:21

it takes integer arguments including that pointer that's just an integer and it conflates address reservation and mapping of stuff behind it the lack of boundaries on reservation is what gives us things like stack clash because you create a stack and from

22:41

that reference to your stack you can just keep going and you can jump into another stack you can do math on those things so Cherry would help you but it's also you can make math errors and think you're manipulating inside that reservation you made and really be off somewhere else in the address space doing stuff

23:01

and making a mess of things there's also some lack of expressiveness in the interface so for instance there's no portable way to express alignment of things so if you want more you get page alignment no matter what but if you say wanted super page alignment FreeBSD has a flag but there's no

23:21

portable way to say I want something super page aligned or I want something I don't know a megabyte megabyte aligned or something if for some hardware reason you need that there's also no way as I alluded to before to express a maximum permissions on a mapping which you would like to do and I feel

23:42

so another problem with that map too many damn arguments you know can you know how many people can just sit down and write them all down in order right you know many people can many of the best programmers I know are like nope I read the manual if I haven't been writing mmap calls all day

24:01

I'm reading the manual and then many calls don't use all of them so you have to get them all right and then you have to not screw it up and then we have way too many failure modes so previous D11 one had 19 documented errors 14 of which were e-inval or 15 of which were e-inval so if the kernel says no then what did I do wrong

24:21

what's going on you know and then you know we made it worse we introduced map guard that added I think another four documented e-invals and I think it's actually considerably worse than that because there is one giant if statement of if this if this and this and this but map guard no blah

24:41

but not this and that it's four lines long and it's all just yeah if any of this happens e-inval you know in archery I got so fed up with this I added a new ktrace probe which is sys error cause and inserted code and

25:01

broke up all the if statements and so I splat ASCII strings into the log that say it happened this way because your pointer permissions were this and this was expected so there's more ways to fail but I wrote them out I have not fully updated the latest one and I've not tried to upstream that but

25:21

but you know that's a problem we have here some other map issues there's no support for mapping more pages than requested what I mean by that is if you were using for instance non-transparent super pages and you wanted to map something that's like one page or short of the super page size and it's a file you know you might

25:41

want to just map a super page size thing and fill it in you know there's good reasons for our design but you might want to do things like that also in sherry as I alluded we have 256 bits of data in 128 bits and the way we do that is we compress our bounds so at larger sizes

26:00

the granularity of your bounds goes up from byte granularity to you know one a little bit at a time and so if you allocate more than a gigabyte of memory in one go in the current design then you know if you allocate two gigabytes of memory or something in between one and two gigabytes of memory your granularity is

26:21

two pages now and so the kernel either would have to do all sorts of shenanigans to reserve that page and not give it up to somebody else the extra page or demand that you do it or it could round it up if we had a way to do that but there's no sensible interface to say actually I did this ironically this is something that

26:40

for all of the horror that is system 5 shared memory they got this right there's a handle to your allocation you free the whole thing at once and there's no and so it can be whatever size it wants so long as it's as big as you asked for and as I alluded to previously

27:01

there's no concept of ownership in the address space so if you make a reservation you get your math wrong you could be stomping on some other stack so that's not good you know I would like to say I want to reserve some space I want to put some things in it and if I try to put some things outside of it then I clearly didn't mean to do that so

27:22

you know let's not support that so there are some non-portable solutions to some things so John Baldwin at so we got the aligned super flag which is you know give me something super page aligned because I know it's going to be bigger than super pages or in the case of RTLD we just figure whatever

27:40

a little fragmentation doesn't hurt and we actually super page aligned all libraries and also map aligned allows arbitrary power of two alignment and those are flags you pour into the flags so that's nice it works well it's non-portable though and then coming soon

28:00

we're planning to bring in an additional set of protection flags there's a max prop macro inside of which you put the protections you want to be you want to be the maximum protections and or that with your other protections and we'll be able to set max prop so these are non-portable ways we can deal with some of these problems and we want to do

28:20

this to bring to help us bring WX or X into the tree because in most cases it's a one line change to a program and say oh you know that beehive for instance maps prop none the address space of the VM and then loads things into it with protections well if we just make

28:42

a little tweak there and say let's map that with a max prop with the max prop of RWX because that's what it needs but we can still map it prop none and we don't have to worry about increasing the permissions on the pointer if the pointer were a cherry pointer

29:00

so now I have sort of an RFC for a new interface which solves a bunch of the problems we have and solves some of these problems I've alluded to so the new interface the first part of it is cm reserve c started as capability now it's cookie because cookies are good

29:23

and my goal here is to have an interface which is provides benefit even without capability capability memory management or capability memory protection or without other pointer integrity schemes which is the reason I designed it the goal is to have benefits for everyone so cm reserve reserves

29:42

some address space it takes a reservation object which I'll get into in a bit a length because obviously you're reserving some address space a hint which is the fixed address that you want to map at potentially protection including maximum protection

30:01

and then returns a request object returns a and takes a request object once you get a successful reservation you can then get a pointer to it and the reason we separate getting a pointer to it from making the reservation is that in the cherry world

30:22

I want to be able to get a read write pointer and a write execute pointer or a read execute pointer separately and I don't ever want user space to have both we have discussed but I think no one no one has enough guts to do a hardware implementation

30:42

where there are no read write execute pointers but that is something you could do you could literally mean there is never one and you can't manufacture one no matter what you do so we'd like to support that use case next there is cmmap because one of the common things you do with mmap is you map in part of a library you know you map in

31:01

a library and then you map in the bss at the end as a bunch of anonymous pages so you need to be able to take your reservation and manipulate the mappings underneath it we have a close which is say get rid of the handle we've created which means now you can't manipulate it anymore so if you wanted immutable page mappings this will give you immutable

31:22

page mappings so you have to exit if you want to get rid of these we have a concept of a cm restrict which is there is a bunch of operations you could perform we might want to leave the handle open but only let you unmap so that you can't monkey around with monkey around with permissions anymore for instance

31:41

um idea of a stat function um the idea here is to allow introspection at a level at a more interesting level than that provided by um mincore so the information is there for instance to tell you what object match it maps

32:01

what objects are mapped into a reservation so we'd like to be able to query that there are some interesting challenges here because obviously if you do this wrong um you can break a slr by enumerating through the handles um so you don't want that but uh we have ideas on how to fix that um and then the other m advise

32:22

mincore m inherit msync m1map all those would be just the same as before except that they only operate on a single reservation or on regions within that so a little bit more on map requests my idea is to map is to have map requests follow the model of pthread adder t so you have a

32:41

request object it's an opaque thing um you you initialize it you make some manipulations to it you set you know which file you want on it uh and whatnot and then using accessors and then you pass it on to to mreserve or um to

33:01

um m map um and the goal is to have useful defaults so ideally you do as little as possible so if you just want some empty pages like malloc does it should be no more than like two calls um and it's a little on unixc but you know possix does it and my hope is that it will be less error prone

33:21

than uh the mess that is uh cm map now a couple of cherry extensions so in addition to the cget pointer that we had before I've added a cget cap so I can retrieve a capability pointer um not just a virtual address um and also

33:42

cnperms which allows you to restrict the set of permissions on future pointers returned by cget cap um because there's a default set of permissions that any mapping gets that would be dictated by the maxplot but there are other permissions that normally we would give to uh capabilities um that are created by the kernel for user space

34:01

and you might say eh I don't want those for whatever reason so that's my proposal um so independent of my proposal what do people think is it time for something new nope

34:37

yes I guess

34:50

yeah that was not something I I guess I don't think I have access at which is it in is the 286 stuff in the Unix history repo

35:01

yeah I mean I I glossed over it in that I knew there were things but there were more compromises there less of an mm you know things without an mmu or more yeah so

35:37

my my current my current proposal does is is about

35:40

address based management so it doesn't do that um you could conceive of a proposal where you say I want another mapping to the stuff behind this mapping so a new handle and do that but the goal of the proposal is that in so in cherry pointers have permissions fundamentally so I could have I could have you know you you would do the thing

36:00

where you make writable pages write to them and then um turn them executable and then similarly you would but because we don't want you to ever have a pointer that's capable of both reading and writing you need a way to get two pointers so that's the reason for the c get cap cm get cap shaun

37:06

absolutely well so part of the goal here is

37:21

you should be able to implement mmap on top of this and in fact there's nothing stopping you from doing a poor implementation of this on top of mmap that's that's the goal and probably something we would do before we made a proposal um so that there is an adoption path that you know works on a shitty android phones that have never

37:41

get updates but you know we are talking to people who maintain other operating systems so

38:15

so yeah I turn that off in cherry um I fail so I in my adaptation

38:21

of jam out to cherry I just make all the map fix request fail because I know that's what they're trying to do and you can't glue two capabilities together we don't provide a facility to do that um we could but that's a very significant I mean there's no reason that you couldn't

38:41

have a syscall that says so I have these two capabilities please glue them together and give me a new one but that definitely is non monotonic behavior so you have to be very careful um so we may I that's that's something we've not explored because it's turned out we haven't really needed to do it um I think it does have some impacts

39:01

in jmalloc um in particular because jmalloc has made some some decisions about small small buckets um that mean they're really bizarre performance cliffs um if you slightly increasing the size of a data structure

39:21

can cause you to go into the next bucket and then allocate seven times as many pages or do seven times as many mmaps um for the same number of small allocations because you go from a seven page bucket to a one page bucket um and that minimizes absolute fragmentation at the cost at considerable performance cost in some cases

39:41

eric

40:04

yeah some manner of call interface to be able to say okay

40:21

kick me back here if this page is is not present or tell me when pages are about to be kicked uh the benefits to garbage collection particularly of a lateral substantial uh speaking of garbage collection have you looked into

40:40

Linux's user fault fd interface no user fault fd is manipulating mapped areas so that you can be notified this page is written um traditionally

41:01

people especially garbage collectors did that by putting the segmentation fault handler and then dealing with the signal handler which is extremely expensive so the garbage collector the language people didn't matter that much but now the virtual machine live migration people also

41:21

want the same functionality so they made this interface which is kind of or whatever you want to call it and your API would actually absorb what they are trying to do a lot better interesting we'll have to look at that we do have somebody actually looking actively at garbage collection

41:42

and looking at a non reuse model

42:08

and if you've seen the dune papers out of stanford but they have a way to do the garbage collection tricks that you want at a much lower cost then you could do some other mechanisms but my question was I guess I think

42:21

I really like the direction with this work in general but I was wondering also what you had in dots for and stuff that got given especially that we're hitting the limit of number of cores per socket well, it's not something I really thought about much but the request object format should I think help us

42:41

give us a place to have more than just our current 32 flags so more things we can ask for and I think that would be something to explore before anyone tried to standardize something I definitely agree it doesn't work on Intel

43:17

we have a little time

43:22

so I'll give you a little bit more here so another point in history in 2016 we decided as an experiment to ship ARM 64 and RISC-V without SBRK

43:41

mostly it turns out it's used for incorrect attempts to measure heap use I fixed the documentation so it says those don't work but it's still there some of them require some force to disable there's also some internal allocators the funnest one I think was probably all the GNU binutils crap

44:03

tries really hard to find SBRK so if the symbol is there but it's fine if there's no symbol actually I don't think it will call system itself but it's fine if you don't declare it it'll just look for it and use it

44:21

it doesn't need it but it wants it anyway it's very important and then some WISP interpreters mostly unpopular ones but we did win the editor war because we broke Emacs so

44:42

so oops turns out so Emacs did finally this year did finally implement an SBRK implementation they decided to join the 80s and so it's very good. Andy? yes they did call us mad but you know

45:01

yes we said you know it's been documented since 83 guys and SBRK has been deprecated since the early 90s so you know now Emacs works without SBRK I'm planning to finish killing these silly things off

45:20

in FreeBSD 13 because it's time if you really need it there's plenty of ways you could do a peer user space implementation and really all you need to do is allocate more BSS and have a big symbol and for most uses that'll just work so that's all I've got thank you for coming I'm happy to take any more questions if you do have them

45:58

it's really annoying that

46:02

you currently use segmentation quality for this kind of thing because just building the segment context is so expensive and the use of segmentation will make the quality of magnitude cheaper and you have

46:20

a much easier time not just the page but by the word that would be a really good thing yes, so we have people who are actively working on garbage collections I will send them this and say they should implement it in FreeBSD so they can test it and they're good, so they should do a good job

46:43

I think we should do that regardless of what happens to an MMIP alternative for SBRK we could I'm a little worried that if I put the linker warnings in

47:01

I will break ports too much I mean, you can always ignore the deprecated warnings I could do that if you think that would be appropriate oh yeah Ed was pushing me to do an

47:21

xrun, so we'll get that done soon I know a new one popped up recently, which is why the documentation was updated because I broke something it was again some old interpreted language

47:42

thank you everyone