pidfds: Process file descriptors on Linux
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 44 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46127 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
All Systems Go! 201914 / 44
5
10
17
20
21
30
32
36
44
00:00
Thermodynamischer ProzessSystem programmingComputer fileComputer fileProcess (computing)Physical systemRight angleInformation securityBitTwitterMultiplication signLimit (category theory)CodeOrder (biology)Inheritance (object-oriented programming)Directory serviceComputer clusterKernel (computing)Computer animation
01:27
Task (computing)StrutProcess (computing)Thread (computing)Local GroupStatuteComputer fileKernel (computing)Process (computing)CodePhysical systemMultiplication signGroup actionPrincipal ideal domainOpen setGraph coloringRadio-frequency identificationBitRouter (computing)Scripting languageThread (computing)Single-precision floating-point formatComputer animation
02:49
Task (computing)StrutProcess (computing)Thread (computing)Local GroupExploit (computer security)System programmingService (economics)Android (robot)Computer hardwareKernel (computing)Process (computing)Principal ideal domainThread (computing)Array data structureGroup actionSemiconductor memoryMultiplication signCodeBitDirectory serviceTask (computing)Type theoryQueue (abstract data type)Different (Kate Ryan album)Latent heatInformationCASE <Informatik>NumberStability theoryPersonal identification numberElectronic mailing listPointer (computer programming)Maxima and minimaRadio-frequency identificationDecision theorySystem callWeightMetropolitan area networkArithmetic meanCycle (graph theory)Uniform resource locatorComputer animation
05:36
Android (robot)Service (economics)Computer hardwareExploit (computer security)Witt algebraPrincipal ideal domainComputer hardwareMaxima and minimaService (economics)NumberKernel (computing)1 (number)Data managementFreewareProcess (computing)Computer fileSoftware bugThermodynamischer ProzessExploit (computer security)Operator (mathematics)Android (robot)Information securityMultiplication signCuboidRadio-frequency identificationSource codePoint (geometry)Game theoryOffice suiteGroup actionPunched cardPhysical systemFocus (optics)Computer animation
07:12
EmpennageService (economics)Android (robot)Computer hardwareSystem programmingExploit (computer security)Thermodynamischer ProzessInheritance (object-oriented programming)Process (computing)Physical systemChi-squared distributionInformationPrincipal ideal domainAuthenticationCodeFerry CorstenCASE <Informatik>Functional (mathematics)Thermodynamischer ProzessProcess (computing)Loop (music)Physical systemSimilarity (geometry)Library (computing)Military baseEvent horizonLogicComputer fileKernel (computing)Data managementProjective planeBlogGroup actionSoftware maintenanceBuildingSpacetimeParameter (computer programming)Fiber bundlePattern languageOperator (mathematics)Hacker (term)EmailElectronic mailing listOnline helpScripting languageRoutingGoodness of fitQueue (abstract data type)Speech synthesisInheritance (object-oriented programming)Arithmetic meanRight angleReading (process)NumberRadio-frequency identificationMereologySystem callMultiplication sign1 (number)WeightForcing (mathematics)Network topologyComputer animation
13:06
Physical systemChi-squared distributionProcess (computing)EmulatorOpen setFreewareBuildingKernel (computing)Patch (Unix)Semantics (computer science)Latent heatComputer fileContent (media)Kernel (computing)Process (computing)Physical systemFormal verificationSet (mathematics)Slide rulePrincipal ideal domainCloningSemiconductor memoryPlastikkarteMultiplication signStability theoryThermodynamischer ProzessMereologySpacetimeCodeParameter (computer programming)Information securityDecision theoryParallel portEmailAndroid (robot)BackupLink (knot theory)DemonImplementationState transition systemFinite differenceOpen sourceEmulatorPerspective (visual)Data managementGraph coloringRadio-frequency identificationPoint (geometry)Computer programmingScripting languageBitGoogolCASE <Informatik>Musical ensembleElectronic mailing listTrailForm (programming)Digital photographyResultantMixed realityComputer animation
18:59
StrutFlagComputer fileInformationPoint (geometry)Personal identification numberInformation securityKeyboard shortcutPrincipal ideal domainWeightCloningImplementationQuicksortSpacetimeTerm (mathematics)Process (computing)CodeKernel (computing)Resource allocationDirectory serviceComputer fileMereologyFunctional (mathematics)Operator (mathematics)BitGoodness of fitoutputFile systemThermodynamischer ProzessScripting languageRadio-frequency identificationMilitary baseRobotParameter (computer programming)Network topologyNumberOpen setPhysical systemProduct (business)Proxy serverSystem callComputer animation
22:51
Sigma-algebraCloningProcess (computing)LeakFlagProof theoryPoint (geometry)Focus (optics)CloningLine (geometry)Directory servicePrincipal ideal domainBitCategory of beingFlagCodeMultiplication signFunctional (mathematics)Computer fileProcess (computing)Kernel (computing)Wind tunnelCASE <Informatik>Computer animation
23:55
Process (computing)Default (computer science)Proof theoryShared memoryLeakCloningSystem programmingConnectionismConvex hullCASE <Informatik>Functional (mathematics)FlagOcean currentScripting languageSpacetimeProcess (computing)Computer fileDescriptive statisticsKernel (computing)Resource allocationGroup actionLeakFocus (optics)Operator (mathematics)Type theoryDefault (computer science)NamespacePoint (geometry)BitStress (mechanics)Video gameCategory of beingEvent horizonNumberRadio-frequency identificationMultiplication signSemiconductor memoryInformationRewritingPrincipal ideal domainBootingCodeComputer animation
26:07
LeakCloningProcess (computing)Default (computer science)Proof theoryStrutTask (computing)FlagLocal GroupFluid staticsThread (computing)Similarity (geometry)Computer configurationSicNamespaceProcess (computing)Kernel (computing)Principal ideal domainThread (computing)MultiplicationCloningComputer fileGroup actionInheritance (object-oriented programming)FamilyData managementValidity (statistics)Correspondence (mathematics)Parameter (computer programming)Thermodynamischer ProzessFerry CorstenFlagConnected spaceDefault (computer science)BitDifferent (Kate Ryan album)Semiconductor memoryMedical imagingBasis <Mathematik>Radio-frequency identificationLine (geometry)Scripting languageImplementationPoint (geometry)System callInstance (computer science)Block (periodic table)Multiplication signCone penetration testMereologyComputer animation
31:31
Process (computing)Exclusive orSoftware bugCloningRight angleCollaborationismGoodness of fitPatch (Unix)Point (geometry)Principal ideal domainMultiplication signComputer fileLibrary (computing)Differential (mechanical device)MereologySemiconductor memoryOperator (mathematics)Correspondence (mathematics)Process (computing)ImplementationKernel (computing)Queue (abstract data type)Generic programmingSemantics (computer science)FlagDifferent (Kate Ryan album)CASE <Informatik>Open setBitFreewareConfidence intervalFerry CorstenPressureCoalitionScripting languageInsertion lossOnline helpWeightComputer animation
35:43
PrototypeRight anglePlanningInformation securityNormal (geometry)Inheritance (object-oriented programming)Extension (kinesiology)Computer programmingLimit (category theory)Process (computing)2 (number)INTEGRALLevel (video gaming)Multiplication signCategory of beingOrder (biology)EmailThermodynamischer ProzessNetwork socketMessage passingGroup actionCASE <Informatik>Physical systemFlagCurvatureElectronic mailing listRadio-frequency identificationPoint (geometry)Type theoryParameter (computer programming)YouTubeNamespaceService (economics)Royal NavyMereologyGoodness of fitFerry CorstenMyspaceCloningRule of inferenceBitSet (mathematics)Principal ideal domainTable (information)CausalityCodeGastropod shellImplementationFile systemConnected spaceSlide ruleKernel (computing)Computer virusMeeting/Interview
42:38
WebsiteSystem programmingLattice (order)Computer animation
Transcript: English(auto-generated)
00:05
Next up, Mr. Christian Brauner. Hey, so I just came back from Paris yesterday evening and the first thing I saw on Twitter about this conference is that Leonard is going to steal our home directories. It was great news.
00:21
No joking, he's been talking about this for the last couple of weeks and runs already. Right, I'm Christian. I write code. I work for Canonical right now. I mostly do upstream kernel work. I also maintain Lexi and Lexi container runs at times you might know about.
00:41
And we do a lot of work in the upstream kernel for new container features, security, crossing a bunch of subsystems, also maintaining a few bits and pieces. And this is work we've been doing over the last couple of kernel releases. I've spoken about this for the first time, I think, in depth at Linux Plumbers,
01:03
a couple of weeks ago. And this is a concept which we call PIDFDs. Who has heard of that, by the way? Ah, some people read LWN.
01:20
Okay, so in short, what is a PIDFD? The idea is it's a file descriptor referring to a process. Don't worry, I'm going to go into details what other systems had that before and so on. It's not a super novel idea. So a file descriptor referring to a process, and specifically right now,
01:40
is a file descriptor that refers to a threat group leader. If you have any questions about this or stuff, you can just yell right away. Right, which refers to a threat group leader. So right now, but we don't exclude the possibility to do this in the future, is to make it possible, it's not possible to make a file descriptor refer to a single threat.
02:03
The reason for this being that at the time when we started doing this work, there was no real reason to do it, nobody really wanted it or yelled we wanted. And also it would have made the whole code a lot more complicated. So threat group leaders in the kernel are really horrible in general.
02:23
So it's a stable private handle, so the file descriptor, a PID file descriptor, guarantees that you maintain a reference to the same process as long as you hold that FD open. And it works in a very specific way, I will go into this in a little more detail in a little bit.
02:42
And PID FDs use a pre-existing stable process handle that the kernel already knows about, which is struct PID. So it's already used in PROC to pin a process. So all the PROC PID directories stash away a reference to struct PID. So the first question, if you know a little bit about the kernel,
03:03
you might ask why are we not using a task struct? Ideas? Takers? So struct PID in the kernel is a way, it's the kernel's idea, if you read the comment in the code, it's the kernel's way of maintaining a stable process handle,
03:23
but without having to pin task struct. Why is that a problem? If you look at task struct in the kernel, it's like this chunk of code. And there's like lists in there and pointers and probably arrays and all that kind of stuff. And so struct PID is a cheap way of getting around the problem
03:43
of having to pin a lot of memory in the kernel for a long time. Because sometimes there are various code paths in the kernel that take references on struct PID to keep it alive because they need to look at some information or need to look who's the thread group leader or who's the session leader and so on.
04:00
So struct PID has a bunch of members. You already see something interesting in there, which we added, which is wait queue for PID FD notification. We will get to this in a bit. And the idea is from a struct PID, you can get to all of the interesting task structs, which reference a type of PID that you're interested in. So the kernel makes difference between a thread group leader,
04:22
a process group leader, and a session ID leader. And you can reference struct PID. There's an array in there which you can get from a struct PID. Right, with hlisthead, tasks, PID type max, you can get to all the PIDs that are used.
04:44
Yes, you can get to all the task structs that are used by this struct PID for a specific process type. You can get at the process group leader, you can get at the session ID leader, or the thread group leader, depending on what information you need.
05:01
So that's a PID FD. PID FD stash has a reference to struct PID. It doesn't go away. And why do this in the first place? I mean, this is usually the first question. I would have been happy to just write it for my own entertainment, but we actually had a bunch of use cases. The first one that is pretty obvious and it comes up, even though it's always heavily debated,
05:22
whether or not this is a real issue, and there are other ways to fix it, but is PID recycling. So avoid pitfalls of PID recycling on high pressure systems. PID allocation, PID number allocation, I should be precise, in the kernel works cyclically. That means the kernel keeps on ramping up the PID number
05:40
until it hits the maximum number of PIDs on the system, and then it wraps around and takes the next free PID number. So if you have a lot of processes exiting and so on, then you can get in a situation where you can have a process whose PID has been recycled while you're still operating under the assumption that it's actually the prior process that you're operating on.
06:02
This is useful. This is basically, you can use this to have timing-based attacks, of which we actually had quite a few. So there is one against, which Jan found, against Android's GetPidCon. So you could trick it into operating on the wrong security context,
06:22
if I remember it correctly. There are actually two bugs related to this, or two CVEs related to this, the first two. So you can have a look at that if you're interested. There are a bunch of PID-based Mac exploits, which I didn't know about. So Mac OS doesn't have, as far as I know, doesn't have the concept of a process file descriptor.
06:41
There is a bunch of stuff in Qt that had problems with this. There are a bunch of, yeah. The last ones are actually hardware service manager arbitrary, yeah. The last ones are the GetPid exploits. I think the first one might actually be for PollKit,
07:00
which you could attack with this as well. So this is really an issue. One thing you can prevent this, to make it at least less likely to run into this issue, is bump the maximum number of PIDs to four million, I think, which systemd started doing at some point. You could also probably get around this problem by using UUIDs and not file descriptors.
07:23
There was a lot of discussions going on, how exactly we should solve these problems. We went with PID-FTs. And I'm going to explain why, I think, which is a good reason why. So PID recycling, that was one of the issues we really, really wanted to get around. And why do this in the first place?
07:43
Again, there were a bunch of other reasons. One thing that came up repeatedly was shared libraries. I want to allow, I want to spawn off invisible helper processes. What does this mean? So an exit notification on Linux works in the following way.
08:00
Oh, by the way, if anyone knows more than me, please yell as well. So you get a sick child signal, usually, on process exit, right? So that's how you get informed. But, for example, if you have a main loop running, like a large main loop running, where you have a lot of callbacks, and one of your callbacks is there to reap helper processes.
08:22
Now, this callback calls wait PID 1, wait ID 1, minus 1, sorry, which means wait for all my children, or wait ID P all, wait PID minus 1, wait ID all. So it wants to wait for all children, specifically use cases that probably init systems want to have or want to support.
08:41
But now you can end up in a situation where any other callback in your main loop could have spawned off a helper process and also relies on sick child and exit signals to be received. So now the wrong helper wakes up in your loop, gets a sick child signal, and calls wait ID P all,
09:00
and is like, yes, I'm going to reap all of my children now, and then accidentally reaps someone else's child. The other process now gets confused as to where the hell is my child. So this is really not a nice situation. So it's forking off invisible helper processes. With some work you can do it, but then you run into issues with, then you run into issues with threat safety and signal handlers.
09:22
There is a long blog post, I think, from Tiago out there, who is one of the cute project maintainers that want to make use of this feature. So PID FDs, we'll see how, hopefully, make it possible to spawn off invisible helper process pretty nicely. They also allow you to get notifications for process exit
09:43
as a non-parent process in a clean way. And process management delegation, in general, a handle to a non-parent process for a bunch of operations that you want to perform, which you cannot safely do right now. I mean, you can pass a PID, but apart from, if you're the parent, you usually know that this is your child,
10:02
and you can be sure that this is still your child. If you're a non-parent and you have more issues to figure this out, you need to parse through proc, look at start times, all kinds of hacks. Kriu has had issues with this for a long time, for example. Yes, so hand of a handle to a non-parent process for waiting, signaling, whatever,
10:21
which if you have an FD as a stable private handle on a process, that problem should go away. Another reason why we did go with FDs, the ubiquity of FDs, which sounds like a trivial argument, but it's actually not, I think. There are common patterns everywhere in code bases that make use of FDs.
10:42
So most people have an EPOL loop to listen for events on file descriptors. Most people have parsing logic to parse out FD info from proc self FD, FD number, and then FD info or something, and have logic for sending around file descriptors via SCM writes and so on.
11:00
So there is not a lot of adapting that you have to do. If you would have to build UUIDs on top of everything now, then it would have been annoying for the kernel to generate them and handing them out to user space. I know Leonard still wants them very much. You're not going to get them. And so FDs seem to be quite the obvious solution.
11:24
And also, here's where we a little bit get into this part, there was prior art for this as well, or similar art before that. So ubiquity of FDs, I think, is a pretty good argument for doing it this way. And last but not least, does user space really care about this feature?
11:41
So nowadays, when you try to bring something into the kernel, it's usually not that you just get to do it. You usually have to say, like, this is a problem, and people really care about this. So we need a justification for why we wanted to do this, even though it seems obvious that it makes a problem go away,
12:02
that is a pretty big deal. Yes, user space really cares about this. Some of those have talked to me before, some of those have written mails after, and some of them I just figured out by pure chance, by people pointing out, hey, they're making use of this feature.
12:20
And this list keeps growing. One is D-Bus. D-Bus has an issue open where they want to switch from doing PID-based authentication to a PID-FD-based authentication, because they have issues with PID recycling as well, or at least they're afraid of running into issues with this. Qt wants to use it for subprocess management,
12:41
so it's forking off invisible helper processes, which I mentioned before. SystemD, the only issue that I currently know about that is open, is using it to reliably kill processes per C group. But they probably have other use cases for this as well in the future. CRIU, which uses it to detect PID reuse.
13:02
So it has a function that is called detect PID reuse, which is a hack, correct me if I'm wrong. And we can switch out this function for PID-FDs, using it to reliably track processes. Android, low memory killer daemon, is using PID-FDs as well.
13:22
They were actually one of the first that got really excited about this. It all derived back from a debate. So, parts of that, I had an argument with a discussion with Case Cook, and I think David Howell is a while back at Linux Security Summit, somewhere in Edinburgh or something,
13:40
where we talked about various things that we should do or could do, and how we should do it, which is where the PID-FD stuff started. Then there was also in parallel a discussion that started on the mailing list, where people started to hack around in proc to make it at least possible to send signals via files and so on. So there was a lot happening at the same time, and this is ultimately the approach that came to fruition.
14:04
So the Android guys had to give it to Joel, who works for Google as a kernel engineer, who helped with this work, and who also did a lot of polling work that we'll see in a bit. So low-memory killer demon is using PID-FDs to reliably track processes and kill them.
14:25
This basically runs on Android 10 already. Oh, no, not PID-FD, sorry. They will be in Android 11. It backported it to all of the LTS kernels, but they're definitely going to use it to get around issues where they have to make sure that the process that they are killing
14:41
is really the process that they want to kill and so on. And BPF trace wants to switch to PID-FDs as well. They have an issue open as well. Don't just trust me, click on the links to verify. Maybe they're all wrong. And there's prior art. This goes back to a former slide, to an earlier slide.
15:06
Why do this in this specific way? Well, there is precedence in other systems, and I usually think it's not a good idea if you keep deviating too much between different operating system implementations because it makes it horrible for user space that at least try to be compatible across different operating systems
15:25
to write working code. So I always was under the impression, I have to admit I did something which is not very smart from the perspective of a kernel guy. I didn't look at other kernels before actually starting this work.
15:40
I did it later. And we were kind of lucky that we didn't get run in a lot of issues they originally run into, but that was just by pure chance and having a lot of smart people yell at our work. But it also means I was falsely under the impression, for example, that Solaris had PID-FDs, which is wrong.
16:01
They actually have it not, don't have it. At least the Lumos, the open source implementation of Solaris only has a pure user space emulation of stable process handles. PROC OPEN, PROC RUN, PROC CLOSE, and PROC FREE, but they have the same problem essentially. OpenBSD and NetBSD don't have it.
16:21
FreeBSD is the only system other than Linux that has it. They have a concept called PROC DESK, PROC FILE DESK, or PROC DESKRYPTA, sorry. And they have three system calls, PD FORK, PD GETPIT, and PD KILL, and they have gone with the different decisions, or they have taken slightly different decisions than we have on Linux, parts of which are implementation-based,
16:42
or most of which are implementation-based. And if you have questions about this, I can go into detail, but probably not going to be enough time. PD FORK gives you a backup PID file descriptor, essentially, a PROC file descriptor. And PD GETPIT allows you to translate it to a PID, and PD KILL is used to send a signal through one of those file descriptors.
17:02
And on Linux, there are multiple approaches to get this into the kernel at once. There was 450, which I think was originally done by Thiago as well from Qt, or at least was one of his suggestions. And then there was another patch set which is called CLONE FD. None of those made it.
17:22
And I think one of the reasons... The patches were fine, they were interesting, had interesting new concepts, but for example, CLONE FD tried to do a lot of things at the same time, so they mix auto-reaping semantics with file descriptors for processes and so on, so a lot of contention going on on how to actually do this correctly, and I think ultimately it didn't go in
17:44
because it tried to do too many things at once. Maybe there was another reason, I didn't see it from the thread, but there was a lot of stuff in that patch set. So okay. So we started building a new API around process management, and I want to start with this right away, which I also did at LPC.
18:04
My intention has never been to say, we have PID FDs and we have PIDs, and they are totally separate worlds, and you either use the one and you can't use the other, which I think is the wrong way to think about this. PID FDs get around a very specific problem that you have in multiple but rather specific situations.
18:26
So you probably want the way to cross between PIDs and PID FDs, and use both at the same time. So it's not like we're deprecating the PID API and only going to PID FDs in the future. That's probably not what's going to happen.
18:41
It may be the case, and this is what I expect, that a lot of new interesting features that people care about can be built upon PID FDs just by being a stable process handle, which you couldn't build upon PIDs. So the first thing that we did in 5.1
19:00
was implement a new syscall, which is called PID FD send signal, which allows you to send a signal through a PID FD. This was the really obvious piece because PID recycling is usually concerned with sending signals. You're operating on something that is not yours. So it's pretty obvious to make the argument,
19:21
look, this solves an issue. This lets us cleanly solve an issue. It's clearly something that userspace has run into. There are a bunch of people who pointed this out that this is an issue for them. So we should do this. And it's actually not a lot of code, as you can see here. Oh, there's a bit more to it down below, but the whole FD handling part
19:40
is encapsulated in what you see here in the top. The first controversy that we had about this was what exactly is a PID FD going to be. And people had very strong opinions about this, which also derives from the fact that it's an obvious problem in the sense, oh, it solves something really obvious, and here's my opinion, and here's how we should do it, and I'm not going to back down.
20:01
So we kept yelling at each other for a long while, which is usually what happens. And the first idea was to use slash proc slash PID DRFDs as PID FDs. Because they already pinstruck PID, they're pretty easy to get by, and it's a nice shortcut.
20:21
You call open on slash proc slash PID for the process you're interested in, ignoring for now that this also has PID recycling issues, then you have an FD, it can't be stolen from you, and then you stuff it into PID FD send signal, and you send a signal to it. And if the process has exited behind your back,
20:41
and it's not around anymore, and you send a signal, you get esarch, which is kernel speak for no such process. So this brings me to another point. We don't pin PIDs. We don't pin PID numbers. It doesn't mean that when you hold a PID FD that now your PID is not going to be recycled. Your PID is going to be recycled. We don't care about this. DFD is your stable handle.
21:01
PID can be recycled. We're not stopping the kernel from doing PID allocation or something. So, right, we use proc PID as a shortcut FDs, as sort of a shortcut that is really handy for user space. But then we started, or we already had thought about this,
21:21
we were faced with implementing the part where you return a file descriptor from one of those nice foracle clone functions that we have on Linux. And here is where we ran into real interesting problems. So this is work, so Jan and I started discussing about this because he had good input on this.
21:41
And some people have the opinion that clones should just return file descriptors from slash proc slash PID. Sounds straightforward if you ignore all of the security issues like, for example, there is a net directory in proc PID which allows you to snoop on the traffic
22:00
of another process. It's also really horrible in terms of how file systems and especially the proc file system works in the kernel. So if you want to return a proc PID file descriptor and you have to pre-allocate a dentry, well, you have to pre-allocate something that the kernel uses internally to refer to a file and then later on splice it into proc.
22:21
Believe me, it was really nasty code. So what we did was we showed our preferred implementation and we showed the implementation that some people preferred and wrote both implementations, which was a lot of work. The one implementation showed slash proc PID file descriptors used as PIDs,
22:40
and the other one showed our implementation and I was like, really, if you compare this, like, it's much code and then you have this much code. And so people were like, yes, let's go with the implementation that you wanted, which was lucky for us. I think it saved us a lot of headache and I would have been very unhappy. Actually, I considered if we go with proc PID directories if I'm going to abandon this, but we didn't,
23:02
so we got lucky. So in 5.2, this is where we landed support for returning file descriptors from CloneFunch, from the CloneFunch. We were always under the impression that all of the Clone flags were gone. Ha, no, there was one left. Which no one knew. I mean, we only saw this because Linus pointed it out.
23:23
No, there is one flag bit left. I always assume we're out, but okay. So we added a new flag called ClonePIDFDs, which creates PIDFDs at process creation time to completely let you get rid of the race where your PID can be recycled. They have a bunch of interesting properties.
23:41
I'm going to be talking about this a little bit. This is more or less, there are a bunch of more places that we had to touch or that I had to touch, but overall, this is the code that you would see in the kernel internals clone function or fork function in this case. Ignore the comment for now, but if the flag is set,
24:01
you allocate a new file descript and your return is user space. It stashes away a reference to a struct PID, which is a kernel internals notion of a stable reference on a process file descriptor. And you see these are ano and inode file descriptors. If you use a timer FD, if you use an event FD,
24:20
if you use a signal FD, if you use an FD for the new amount API, and there is probably a bunch more I'm forgetting. Ah, the seccomp notifier FD. They're all anonymous inode based, which is just a single inode in the kernel that gets allocated when the kernel boots up, and then you can get a new file from this, so the inode number is the same for all of them.
24:42
It basically means you don't need to allocate an inode, and that's why it's very cheap. It doesn't waste memory. It doesn't waste allocation time and so on. So this is ideal, more or less. You stash away a bunch of operations you want to allow in that file descriptor. So this is the PID FD fop stuff.
25:01
PID is the struct PID that I talked about. So it's pretty simple code overall. And one of the things that we also did is we made PID FDs close on exec by default. So if you get a PID FD back, you really want to be sure that it doesn't leak into the child process when it execs, for example.
25:22
And yeah, so this is what we did, and I think if you think about any new file descriptor type that you bring into the kernel, please make it close on exec by default. I tried to convince people to do this for the mount API. Didn't fly well. But it really helps user space.
25:41
It's one of the major pain points, actually, that you get file descriptors that stay open after you exec. And close on exec is really easy to set. Like, this is all it takes. I'll rewrite OCLO exec, and then you get an OCLO exec file descriptor back. Really makes life easier in user space. And another property is that we added an FD info file.
26:01
So the FD info file will contain the PID of the process as seen from the PID namespace with which your proc mount was mounted. So any proc mount, especially in containers, is attached to the container's PID namespace, if you remount it, at least. And in a new PID namespace, the PID that the process has will be different
26:20
from the one that it has in one of the ancestor PID namespaces. And so we write it in there. So if you parse it, you will get the PID of that process in your proc instance, which is, for example, helpful if you sent around a PID of D, and it was created in other PID namespaces, and this is how you get the PID.
26:41
But we also made it such, and this was Alec's idea, Alec Nesterov's idea. Originally, we had it implemented in a way that when you set clone PID of D, you got a file descriptor instead of a PID, which is problematic in multiple ways because file descriptors start at zero, and zero is obviously used to differentiate between the child and the parent,
27:02
so you cannot really return zero as a valid file descriptor. So PID of Ds would have started at one, which is not nice. So for legacy clone, we made it such that we abused one of those return arguments it has, where usually the PID of the parent process,
27:23
the PID of the child process is placed to return a PID of D. So if you set clone PID of D, you get the PID back, and you get the PID of D back at the same time, so you have no disconnect. You know both at the same time, which I think is pretty nice. It's even a little bit nicer, I think,
27:40
than FreeBSD's PD fork. And then in 5.3, I think this is really exciting. We added, Joel added polling support. This is something that they really wanted for the low memory killer daemon, so that you can get exit notification for non-parents. And in a more complex sense,
28:01
it allows you to turn off the exit signal, which means that when a process exits, you can tell the kernel already today that I don't want a sick child. I don't want a sick child signal. I'm not going to explicitly ignore it, because then you would auto-reap it. I just don't want a signal when the process exits, which was a bit problematic,
28:20
because then how you know when the child is exited and so on. And the polling support will allow you to do this, because what it essentially does is, as soon as the process exits, so it's a thread group leader exits, and the thread group is empty, then you get an exit notification saying, I'm ready, I have exited.
28:41
So if you hand off one of those PID of Ds to a non-parent process, you get reliable exit notification, and you don't need to rely on sick child signals and so on, which is pretty nifty. Also pretty small code, it's not a lot to do. This is actually in two different files, doesn't matter, you can grab for it if you're really interested in it. It's the polling implementation,
29:01
so yeah, the one caveat that we have, poll is only when the whole thread group exits. If the thread group leader exits before a lot of threads in the group, then polls should block similar to the wait family. That's actually a problem you can run into, and that's why thread group leaders are not really nice. But yeah, polling support is pretty exciting for process management. And in 5.3, we added another syscall,
29:25
right, another syscall, PID of Ds without clone PID of D, PID of D open, the idea being that when you have forked the process, you sometimes still want to create a PID of D, especially if you want to watch a bunch of other processes, PIDs,
29:42
and you can't rely on them using clone PID of D. PID of D open will allow you to do just that, it gets you a new PID of D. It also verifies that you give it a thread group leader so you can't get a PID of D for a thread. Yeah, that's 5.3 and 5.4.
30:01
Excellent, that was proposed, that's actually no longer true, Linus pulled it from me, so we have that in. You can now wait on processes through PID of Ds, so you can pass, wait ID has gained a new flag, P PID of D, you pass it a PID of D instead of a PID, and then it retrieves it,
30:21
and then it waits on the PID of D, which I think is pretty neat. And we have a bunch of work planned, there are some work that I've, there's a lot of work or ideas that I have, I'm just going to speak about two of them. This at least came up as one of the original ideas,
30:43
or the first ideas that we had was to make it possible, similar to what FreeBSD has. On FreeBSD, if you have a process file descriptor, it is kill on close by default, which means if you close the last file descriptor, which has a reference to the corresponding struct file
31:01
and the kernel, then it will kill the process, which is pretty neat. I wanted to do it the other way around, so per default that's not the default, so if you close the last PID of D, then things are still fine, but we could add a flag that is set at process creation time,
31:23
you can't take it away afterwards, so no PR cuddles or PR cuddles or that nonsense, and then you kill the process when the file descriptor is closed. There is a problem though that has been bothering me and that I've been thinking about is, so on FreeBSD, closing file descriptor is synchronous,
31:42
so that means when close returns and that FD was the last reference to the struct file inside of the kernel, then by the time close returns, you can be guaranteed that all of the cleanup operations have finished. Linux is smarter in some ways, or let's say complicated maybe,
32:01
in the sense that close can return without the corresponding release or cleanup method that belongs to the struct file that the last FD referred to has been run. It adds it to a work queue or a K thread, and then at some point when the kernel thinks, ah, it's fine, I have some time to run this right now, then it cleans up everything.
32:22
So that means if you close the last FD, close returns, you are not guaranteed that the process is dead, which is not ideal, but usually the kernel cleans it up really quickly or calls the release method for the corresponding file fairly quickly. It should only delay it
32:40
when there's a lot of memory pressure, for example, in which case you screwed anyway. And another thing which is for the shared library case is exclusive waiting. So right now, anyone can still wait on the process that you forked off, either via a PID or via a PID FD.
33:01
There is no differentiation there right now, and I think that just makes sense because then you have this connection between the PID FD and the PID API, but there should be a way to say, I'm now going to separate this connection. And exclusive waiting would allow you to do this, so you have something like clone wait PID, flag name, open for discussion, which hides the process from generic wait requests.
33:21
This is similar to what FreeBSD has as well, but I actually would like to make it stronger, which derives back from a discussion I had with Eric a while back. So you would only be able to wait on a process through a PID FD, as long as there is a PID FD referring to it,
33:42
and when the last PID FD is gone, the process gets autoreeped. Oh, do you know autoreeping semantics? Thank you. So autoreeping semantics really, it's not, you explicitly, you set sick child, you ignore sick child explicitly. You say, I don't want sick child, but I explicitly tell you that I don't want sick child,
34:00
at which point the kernel says, okay, then I'm going to clean up that process for you. If it exits, then it exits, and it's gone, which is really neat. It's different, for example, on FreeBSD, which is why they have chosen to implement proc file descriptors a little bit differently. If you explicitly ignore sick child on FreeBSD, then the process doesn't get autocleaned up by the kernel.
34:23
It gets re-parented to PID 1, and PID 1 then gets a signal for that process, which is basically saying like, I'm done, you take care of it. Whereas Linux really cleans it up. So it would be really nice if you have clone wait PID. As long as you have a PID FD, you can wait on the process explicitly.
34:40
If you close the last PID FD, you're telling the kernel, I don't care about this process anymore. If this process exits, just clean it up. The problem is the implementation usually. It's pretty tricky to get right, I think, but I might put a patch set up there soonish.
35:01
And there's a bunch of other ideas, but I could keep talking, but I probably shouldn't. So yeah, this is what we built over the last couple of kernel releases. We're at 5.4. We've obviously also been fixing a lot of bugs along the way, so this wasn't a whole bug-free process. But overall, I think it was pretty good.
35:22
It was also a pretty good collaboration effort. A lot of people took part in the discussions, brought in really good ideas, gave reviews and helped with this. So I could probably give a shout-out to a lot of people here. But yeah, that's about it. So if you have questions, go at it.
35:43
Yes? So what about integration with SCM credentials
36:01
and other places where you send a PID to someone else? What do you mean? Could we have a flag where SCM credentials contain a PID FD instead of a PID? You can just send it as a regular FD.
36:21
Why do you want a special flag? Because it gets sent implicitly. You set some flags on the socket, and then the kernel does the job for you. And it is possible that by the time the consumer looks at this data, the PID could be not valid anymore.
36:42
Oh, so you're saying the PID is sent implicitly, but instead of a PID, you now want a PID FD. From the top of my head, I don't necessarily see a problem with this. Yes? I guess the problem is, for instance, a process sends a log message to Journo and dies. So at that time, Journo wants to look at the process
37:00
to figure out which SC group is running which services are. I remember this. Okay. Right. I'm trying to think if... Yeah, we should probably talk about this. It shouldn't be something that is... I wouldn't put this off the table. It sounds useful if there is a really good use case for this
37:21
and it doesn't really complicate in kernel code too much so that I have a good cause to justify this. It should be fine. I have a similar question, which is, we now have PID FDs and namespace FDs, but for the namespace FD, we have to go through the file system procfs to get them.
37:40
Yes. Is there a way to derive a namespace FD from a PID FD? Maybe I have plans for that. So it's official request. Can it do that? One of the things that has always bothered me is, if you do a setns into a namespace,
38:02
you have to do it iteratively, right? So it actually iteratively in two stages. You have to call open like seven times nowadays and then you have to call setns seven times and often in the correct order. And that's obviously a problem.
38:21
Well, I see it as a problem. Maybe some people don't see it as a problem. I think this is the wrong approach. Ideally, we could change... I once had the idea or played with the idea. I may have mentioned it on the mailing list at some point. Two is, if setns would take a PID FD and interpret the type argument that it has right now as a flag argument so you could specify the namespaces that you want to attach to,
38:41
and then it derives it from the PID FD in kernel and gets you into all of those namespaces, which would make attaching for containers and so on way nicer. Yes, that's definitely something which has been on my mind. Now it becomes a battle of the maintainers, as I would like to call it, because then we need to agree with,
39:01
does Eric think this is a good idea? How exactly does the API should look like and so on? But overall, yes, that's something I definitely have thought about. I also have thought about, just recently, forking into namespaces. Actually, I shouldn't claim this completely for me.
39:21
David Howell suggested this once in a discussion. So ideally, at clone time, you say, not just create me a set of new namespaces, but create me a process in this set of namespaces. But there needs to be strong justification for why we would need this. Is there some security issue or something
39:41
when you create a process and then you do all of the setns stuff on it and so on? Yeah, that's definitely something we can think about. Yes? Hello. Kill on close feature. Is it dangerous if the shell process exits setuid program and becomes more privileged
40:03
and then the parent can kill it? Can the parent then kill it? Yes. In the naive implementation that I prototyped, yes, because it's an internal signal, right? So there is no security disconnect.
40:22
Yes, so the answer is yes. It would just kill the setuid program that you spawned, even if it runs with more privileges for now. I even haven't thought about this, but I haven't worked on this in a lot more detail. If you have specific concerns about this, we should probably talk.
40:42
So one of the things that I really want and this is important I think is we have this tendency on Linux, and this may be a good thing, in some situations it's a good thing, in some situations it's a bad thing where we, for example, we create, especially with process, we create a process with a specific property and then later on we add a PR cuddle,
41:00
I'll be done in a second, and then we do a PR cuddle and that PR cuddle takes that property away from the process, which is horribly annoying. Like if I, as the parent, say, I'm creating a new process with these properties set, then this property needs to stick to this process. It can't be taken away anymore after the fact, which is the thing what I want to do
41:20
with the clone wait pit, for example, the close on kill flags. It's a property that sticks to the process as long as it's alive, and if it's gone, it's gone. I don't want to end up in a scenario where suddenly you can change all bits and pieces and flags, again, on processes. So I want sticky properties, essentially, I would like if it's useful at least.
41:46
So, Christian Browner, thank you very much. Okay, so if you create a child process and then you send a PFD to that process to some other process,
42:02
does the parent then get a sick child notification if the other process wants an exit notification that you just added in 5.3? So do both processes get notified when the process exits one over a PFD and the parent or normal sick child?
42:20
If you have said sick child, yes, if you want that, but you can explicitly turn it off. Yeah, yeah, I want it in a parent. Okay, thank you. Anyone? Yeah, so these are just a few remarks while doing this work. I forgot that slide.
42:40
Okay. Thank you very much.