The Tales of the Cursed Operating Systems Textbook!
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 34 | |
Autor | ||
Lizenz | CC-Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben. | |
Identifikatoren | 10.5446/38565 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
Bangbangcon (!!CON 2016)10 / 34
3
15
16
17
18
19
24
28
30
32
34
00:00
SystemprogrammierungOperations ResearchComputeranimation
00:11
SystemprogrammierungOperations ResearchBetriebssystemNichtlinearer OperatorPhysikalisches SystemJSONXMLComputeranimationFlussdiagramm
00:25
SystemprogrammierungOperations ResearchProgrammfehlerCASE <Informatik>FrequenzPhysikalisches SystemBetriebssystemÜberlagerung <Mathematik>Computeranimation
00:47
Physikalisches SystemMultiplikationsoperatorKernel <Informatik>Metropolitan area networkSnake <Bildverarbeitung>Luenberger-BeobachterComputerspielZeichnung
01:18
Notepad-ComputerProgrammGroße VereinheitlichungBlackboxPhysikalisches SystemKontextbezogenes SystemBesprechung/Interview
01:38
Physikalisches SystemPhysikalischer EffektOrtsoperatorComputeranimation
01:57
Kernel <Informatik>Kernel <Informatik>HalbleiterspeicherServerDirekte numerische SimulationAbfrageLeckMagnetkarteKartesische KoordinatenBitSpeicherabzugComputeranimationBesprechung/Interview
02:23
Baum <Mathematik>Physikalisches SystemInformationProzess <Informatik>HalbleiterspeicherVererbungshierarchieProzess <Informatik>Physikalisches SystemKeller <Informatik>InformationBitElektronische PublikationDienst <Informatik>VerzeichnisdienstGoogolImplementierungObjekt <Kategorie>CASE <Informatik>SystemprogrammMultiplikationsoperatorBetriebsmittelverwaltungSystemaufrufAggregatzustandRechter WinkelGrößenordnungHardwareQuaderSpeicherbereinigungKernel <Informatik>Kartesische KoordinatenStrömungsrichtungSystem FLaufzeitsystemSystemsoftwareZweiProgrammfehlerPhysikalischer EffektDateiverwaltungUnendlichkeitNotebook-ComputerFlash-SpeicherLaufzeitfehlerKonfigurationsraumComputeranimation
04:54
SoftwarePhysikalisches SystemDämon <Informatik>Virtuelle MaschineNeuroinformatikGarbentheorieMessage-PassingMultiplikationsoperatorProzess <Informatik>Computeranimation
05:35
AlgorithmusSoftwareProgrammbibliothekBaum <Mathematik>E-MailApp <Programm>RechenschieberComputeranimation
06:10
ServerDefaultComputeranimation
06:30
RechnernetzClientFrequenzServerWurm <Informatik>Exogene VariableKartesische KoordinatenÜberlastkontrolleRechenschieberComputervirusSoftwareComputeranimation
07:43
PunktElement <Gruppentheorie>ParametersystemSprachsyntheseElektronische PublikationMini-DiscSicherungskopieObjekt <Kategorie>Puffer <Netzplantechnik>Element <Gruppentheorie>BitProgrammfehlerKeller <Informatik>DatenbankAbstraktionsebeneAggregatzustandPhysikalisches SystemKernel <Informatik>Schreiben <Datenverarbeitung>SynchronisierungSystemaufrufPunktQuick-SortAutomatische IndexierungVerdeckungsrechnungFehlermeldungMultiplikationsoperatorRechter WinkelTLSProgram SlicingMessage-PassingAusnahmebehandlungDatenparallelitätGerichteter GraphComputeranimation
10:04
Ext-FunktorDatenparallelitätFestplatteSchedulingFlash-SpeicherProzess <Informatik>BetriebssystemGemeinsamer SpeicherAutomatische HandlungsplanungPhysikalisches SystemCoprozessorInterface <Schaltung>
10:49
Leistung <Physik>FreewareComputeranimationXMLFlussdiagramm
Transkript: Englisch(automatisch erzeugt)
00:20
I'm Ciaran Batarim and I'm here to tell you about the Tails of Woe, or the Tails of the Curse of the Operating Systems textbook.
00:26
So, I have a Curse of the Operating Systems textbook. Each chapter I read unearths a new bug in the systems I work on. Lest you think this is another case of the frequency illusion, where once you've heard about something, it keeps popping up again and again, let me assure you that it's not.
00:41
The problems this book of spells uncovers are massive and unescapable. They're like system-ending problems. So, for context, while I've been using Unix systems for most of my life, my dad used to hand out a bunch of CDs at dinner parties. He was a huge hit. And I know theoretically how operating systems work. I haven't spent that much time looking into the depth of more serious and more involved kernels.
01:03
My understanding of operating systems is kind of that of the story of the blind man examining an elephant. I know what the APIs were and how it worked by observation, but there's probably a trunk here and a snake here. I've never really seen the whole system or dug into its internals. Basically, I wanted to be Lex Murphy. This is a Unix system. I know this.
01:24
I firmly believe that having end-to-end familiarity with the systems I work on will make me a better programmer. I want to abstract things away so I can say that I'm aware of its guts, but I'm not going to focus on it right now instead of throwing black boxes around things and not really knowing how they work on the inside. So I guess if you're thinking about it on a more positive note, it's less a cursed textbook and more a grimoire.
01:44
It's not a Necronomicon. It's a big book of spells and incantations about these systems of magic, sorry, engineering, that I work with. And some of these spells might cause harm, but that's okay because you learn things along the way. So the first chapter, the memory leak that's coming from inside the kernel.
02:03
So credit goes to my co-worker Nelson for debugging this issue. So a while back, Stripe started seeing intermittent sadness with our internal DNS servers. DNS queries would fail out and our servers would periodically, out of memory, kill processes, despite no application really using all that much memory. So as a hint, this is a little bit of a teaser for Kamal's talk later this afternoon.
02:24
So a side note about the OOM killer. I wrote this 30 seconds ago so my French is probably wrong. But, qu'est-ce que c'est? It's the job of the Linux out of memory killer to sacrifice one or more processes to free up memory in the system when everything else fails. Your system's out of memory and you have nowhere to go.
02:42
So looking over the OOM killer's logs, Nelson noted that a huge amount of memory was being used by the slab. In the Linux kernel, slab refers to the kernel's slab allocator, which is used for internal allocations by the kernel itself. So essentially all the box's memory was being used by the kernel, not user applications, which explained why the box
03:00
was swapping itself to death and OOM killing everything around, even though no application was using all that much memory. So our current state right now, something is taking up all of our kernel memory and we're not really sure what. It's taking up enough memory that the OOM killer pops in every now and then and shuts off our NSD process. How do we gather more information?
03:21
So let's take a detour and talk about slash proc. It's a pseudo-file system about process information. So it doesn't contain real files, but a bunch of runtime system information about things like what devices you have mounted, your hardware config, and importantly here, about system memory. So a lot of system utilities that you might be using actually are calls to read from files in this directory.
03:42
So relevant to our case here is slash proc slash slab info. We later found out that slab top was a utility that reads from this and presents prettier things, but that's what we had at the time. So looking at slab top, we found that there was something called a non-VMA, taking up a huge amount of memory.
04:00
Some Googling discovered that there was a bug in the Linux kernel's implementation of garbage collecting these. Basically, the way we reloaded our NSD process was by forking it and then killing off the parent. But for each fork, it retained an object of a non-VMA, so having each child become the new parent and the previous parent exit resulted in an infinite stack of these objects.
04:26
So it wasn't garbage collected. Armed with that knowledge, we were able to confirm that doing a complete restart of these processes instead of doing the graceful reload we were doing caused all that memory to be released. And that doing a thousand graceful reloads caused the memory to grow rapidly.
04:43
So as an overview, we talked a little bit about kernel memory and user memory, how to debug where your memory is going, and the slash proc file system, and along the way, got our service discovery back. So the second chapter, this is something that Julia worked on. I'm sorry for the pun, I'm dedicating
05:01
this to my coworker Andreas here in the audience, who I've just started blaming for all my bad puns. So the issue behind this section is that we were having, after I read the networking chapter of my book, Julia started debugging some slow networking issues with the system. The gist was that we were publishing messages to this daemon on localhost, and it took 40 milliseconds each time.
05:24
This daemon lives on localhost. It's on the same machine. You're just talking between processes. There's no reason publishing localhost should take 40 milliseconds. That is silly. Computers are fast. So the HTTP library we used sends post requests in two small packets, one for the headers,
05:42
and it expects an ACK, and then one for the body, and then it expects an ACK. So for efficiency, you want to send full-sized TCP packets, and there's an algorithm called Nagel's algorithm that says if you have a few bytes to send, but not a full packet's worth, and you have some unacknowledged data in flight, then you wait until you have a full packet, or until you time out and you get an ACK of all outstanding data.
06:03
So usually this is a good idea. It's there to protect the network from stupid apps, or like, naive apps, where a... Oh, I'm missing a slide. Anyway, it's there to prevent something where you might be sending 10,000 one-byte packets.
06:21
So you delay sending a packet that combines multiple small packets into a single larger one. Linux, by default, waits 40 milliseconds. Oh, I... slides. So yeah. The server we were using, on the other hand, had delayed ACKs on.
06:41
So the assumption is that this is... Julia has a great write-up of all of this, which is what I've been taking from. So the assumption is that the server usually generates a response to a packet sent. So if you send a hi, the server responds with a hello, so you don't have to do a received followed by a hello.
07:00
So our client sent the application, then the server sits in silence waiting for another packet. It's like, okay, I'll ACK eventually, but maybe there's more to say. And then the client's waiting in silence. It's like, well, I'm waiting for an ACK, but maybe there's network congestion? So this passive-aggressive period where you're waiting on both sides was where our 40 milliseconds was.
07:24
And eventually, we're done. So when Julia sent the server to ACK immediately and not delay and do the, well, maybe there's more to say, we found out that everything sped up incredibly.
07:45
So we've talked a little bit about networking stacks and why knowing abstractions there are important and how my book keeps setting myself up for failure, speaking of writing operating systems in your own blood. So chapter three was something that we call oints.
08:02
So a note about our backups. We store a small portion of our data in mongos, and we snapshot the disk to take backups of it every now and then. And then we clean those backups and restore them to test out that pipeline. So when we restored one of our specific things, we saw a pretty cryptic log message.
08:25
There was something that says, caused by MongoDB exception, the BSON object size is negative, which is an invalid size. That looks like a bit mask. And the first element was oints.
08:40
We weren't really sure where to go from this. So there's some data corruption on indexes in this thing. So the error we got was sort of the index is pointed at a deleted doc, which is confusing, because as a note about the way you write to disk, the kernel maintains a bunch of write buffers,
09:01
so it'll return to write calls immediately, and then later it'll flush that data to disk asynchronously. This means that at any given point in time, you might have half-written data in various buffers or states as your database is in the middle of a write. Before it confirms a write, it issues a sync system call that clears out the buffers and writes everything to disk. So before we took our disk snapshot,
09:23
we wanted to make sure that everything that the database or kernel was holding on to has been completely flushed out and written to disk, so you don't end up with oints. When we saw this bug arise a while ago, we realized that we did do the fsync, but we didn't actually lock the replica against more writes.
09:42
So snapshots had a state where all of the data was flushed to disk, but we weren't preventing more writes from coming in in the meanwhile, so we just scribbled on a data file while the snapshot was ongoing. This meant that when we cleaned and attempted to remount the database, it had half-written data and barfed. So this is a lesson in non-atomic writes and buffers, friends.
10:05
Coming up next is the concurrency chapter of my textbook. I've been told not to read this, especially while on a plane. I'm sure it's fine. So instead, I was reading about hard drives and flash memory on planes.
10:24
Also exciting. I'm looking forward to the next bit flip that causes something to fall over. There's also scheduling and how your operating system ensures that processes have their fair share of the processor, and so many more exciting things. Operating systems are cool. There's a lot going on in there, and they present a fairly simple interface.
10:42
That's all I have.