Backpressure in FreeBSD I/O Stack
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 31 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/45261 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
1
3
5
6
7
9
10
11
13
14
15
16
18
19
21
22
26
27
28
29
30
00:00
WeightComputer networkPressureFlow separationReceiver operating characteristicUniform resource locatorTouchscreenoutputBitResultantMereologyExpert systemXML
01:38
InternetworkingVideoconferencingComputer networkDistribution (mathematics)Scale (map)Open setContent delivery networkLiquidIntrusion detection systemInternetworkingVideoconferencingStreaming mediaDifferent (Kate Ryan album)Server (computing)Content delivery networkEndliche ModelltheorieDatei-ServerMultiplicationData storage deviceCASE <Informatik>XMLComputer animation
02:50
Time zoneControl flowDistributed computingOpen setCanonical correlationUniform resource locatorSet (mathematics)Instance (computer science)Point cloudInformationWeb browserOpen setContent (media)Uniform resource locatorConnected spaceWeb 2.0Server (computing)Mobile appTablet computerDebuggerPersonal area networkVirtual machine
03:57
Type theoryData storage deviceHard disk driveLibrary catalogNumberData storage deviceType theoryMereologySocial classStreaming mediaContent (media)Existential quantificationMixed realityRight angleMiniDisc10 (number)Hard disk driveInternetworkingServer (computing)Computer hardwareEntire functionMultiplication signCuboidLogistic distributionDifferent (Kate Ryan album)Memory cardCombinational logicBand matrixMultitier architectureVideoconferencingXMLComputer animation
06:28
Data storage deviceOrder of magnitudeOrder (biology)Web pageSpeicherbereinigungBand matrixHard disk driveWritingPressureMultiplication signTerm (mathematics)Content (media)WorkloadDisk read-and-write headData storage devicePhysical systemCuboidBitBand matrixReading (process)Type theoryCharacteristic polynomialMereologyMappingoutputCache (computing)Natural numberWeb page2 (number)FrequencyPopulation densityChannel capacityMathematicsVirtual machineOrder of magnitudeTunisNumberDifferent (Kate Ryan album)MultiplicationHeat transferSinc functionArithmetic progressionDifferential (mechanical device)Database transactionDrop (liquid)Cellular automatonOperator (mathematics)Asynchronous Transfer ModeBoom (sailing)Event horizonData bufferAdditionMiniDiscServer (computing)Bit rateSystem on a chipSystem programmingComputer animation
12:16
Cache (computing)Physical systemScheduling (computing)Content-addressable memoryLimit (category theory)Data bufferBroadcast programmingBootingFluid staticsResource allocationHard disk driveLevel (video gaming)Reading (process)Scheduling (computing)Web pageData bufferCache (computing)MereologyMetric systemPhysical systemMultiplication signBit rateService (economics)SynchronizationWritingSingle-precision floating-point formatLimit (category theory)MiniDiscNumberBackupDemonDigital watermarkingoutputContent-addressable memoryStatisticsRight angleOrder (biology)Computer animation
16:00
Interface (computing)Limit (category theory)TelecommunicationPoint (geometry)QuicksortPressureoutputInheritance (object-oriented programming)
16:41
Computer fileSystem programmingSpacetimeData storage deviceAufrufschnittstelleCommunications protocolObject (grammar)Web pageCache (computing)MiniDiscGeometryWeb pageMereologyLevel (video gaming)Revision controlFile systemImplementationCache (computing)Data bufferSystem callBitInclusion mapComputer fileXMLComputer animation
17:40
Physical systemFocus (optics)EncryptionData compressionGeometryContent-addressable memoryScheduling (computing)Cache (computing)Semiconductor memoryGroup actionDigital filterPartition (number theory)RAIDLimit (category theory)Rule of inferenceCoroutineWeb pageElement (mathematics)Electronic mailing listFlagContent-addressable memoryData bufferRAIDLevel (video gaming)Electronic mailing listMappingProcess (computing)Scheduling (computing)Object (grammar)Touch typingWeb pageMereologyField (computer science)LengthNumberCache (computing)ResultantPhysical systemBus (computing)Cycle (graph theory)TrailSemiconductor memoryVideo gameCoroutineoutputRight angleComputer animationXML
20:22
Block (periodic table)Cache (computing)Set (mathematics)DemonData bufferFluid staticsLimit (category theory)File systemSemiconductor memoryVideoconferencingHard disk driveWeb pageComputer fileBus (computing)Data bufferTranslation (relic)DemonMultiplication signPhysical systemQuaternion groupLimit (category theory)PressureType theoryMiniDiscBootingNumberScheduling (computing)Block (periodic table)Hidden Markov modelSheaf (mathematics)Database transactionOperating systemCASE <Informatik>Computer configurationField (computer science)Electronic mailing listQueue (abstract data type)WritingComputer animation
24:09
Data bufferInterface (computing)Cache (computing)Strategy gameSocial classSystem callFile systemData bufferFunctional (mathematics)Queue (abstract data type)CoroutineCache (computing)Strategy gameoutputNumberXML
25:11
Asynchronous Transfer ModeStrategy gameSynchronizationWritingData bufferCache (computing)Multiplication signBus (computing)System callNumberCoroutineStrategy gameSynchronizationData bufferObject (grammar)ChainComputer animation
26:16
Web pageProcess (computing)Default (computer science)Object (grammar)PhysicsMereologySet (mathematics)Web pageDefault (computer science)Data bufferCache (computing)XMLComputer animation
27:26
Data bufferStrategy gameGeometryProcess (computing)Web pageSpacetimeSlide ruleDemonWritingWeb pageStrategy gameScheduling (computing)Data bufferCountingSystem callFile systemPhysical systemProcess (computing)System programmingPoint (geometry)Disk read-and-write headBitCoroutineoutputRight angleBus (computing)Data structureOcean currentComputer animation
29:42
PressureData modelInterface (computing)Channel capacityLevel (video gaming)Limit (category theory)Physical systemPressureNumbering schemeRight angleLevel (video gaming)Multiplication signSoftware testingDecision theoryScaling (geometry)Descriptive statisticsScheduling (computing)Content-addressable memoryRow (database)InformationMiniDiscComplete metric spaceSpacetimeBand matrixInterface (computing)Fluid staticsLimit (category theory)Computer animation
32:24
Row (database)Complete metric spaceScale (map)QuantumChannel capacityDivisorDifferent (Kate Ryan album)MiniDiscAuditory maskingMultiplication signQuantumComplete metric spaceSemiconductor memoryScaling (geometry)Physical systemRow (database)BitEstimatorTelecommunicationSoftwareHard disk driveRAID2 (number)Computer animation
33:56
Spherical capConsistencyComplete metric spaceEstimationChannel capacityGeometryGradientFile formatPhysical systemLimit (category theory)Device driverMechanism designScheduling (computing)ResultantInformation overloadRow (database)Multiplication signComplete metric spaceFrame problemMiniDiscSpherical capChannel capacitySlide ruleBit rateMoving averageAverageoutputFlagComputer animation
36:51
FlagPressureClient (computing)Communications protocolElectric currentChannel capacitySoftware testingGroup actionFlagRight angleResultantPressureScheduling (computing)Reading (process)Physical systemDatabase transactionCodeInformationInterface (computing)System callWorkloadComputer fileBus (computing)Computer animation
38:44
Scheduling (computing)Default (computer science)Limit (category theory)Channel capacityEstimationAdaptive behaviorContent-addressable memoryMiniDiscMathematical analysisCodeDemonInformationPrincipal ideal domainControl flowData bufferLevel (video gaming)Limit (category theory)Default (computer science)Scheduling (computing)Software testingObject-oriented programmingBitPhysical systemContent-addressable memoryAdaptive behaviorRight angleData bufferEstimatorCodeSoftware bugoutputNP-hardBus (computing)MiniDiscComputer animation
40:04
Mathematical analysisScheduling (computing)Content-addressable memoryCodeData bufferDemonInformationControl flowPrincipal ideal domainLevel (video gaming)FrequencyAsynchronous Transfer ModeOrder (biology)SoftwareTangentNumberSimilarity (geometry)DemonBasis <Mathematik>GeometryAlgorithmElectronic mailing listMultiplication signScheduling (computing)Different (Kate Ryan album)Content-addressable memoryImplementationCache (computing)Channel capacity1 (number)ÜberlastkontrolleCalculus of variationsoutputAverageMathematicsReading (process)GradientMiniDiscMaxima and minimaMathematical analysisLink (knot theory)Slide ruleBand matrixAreaStudent's t-testBit ratePressureInteractive televisionProfil (magazine)Data bufferFigurate numberCharacteristic polynomialSpeech synthesisMereologyExistential quantificationParallel portQuality of serviceSingle-precision floating-point formatStrategy gameData storage deviceResultantHard disk driveSemiconductor memoryComputer simulationProduct (business)Arithmetic meanQuantumSource codeNormal (geometry)Right angleMusical ensembleGoodness of fitProcess (computing)System programmingOperator (mathematics)Office suiteGroup actionWeb pageXML
49:39
Mathematical analysisCodeScheduling (computing)Content-addressable memoryDemonData bufferInformationPrincipal ideal domainControl flowLevel (video gaming)SoftwareSlide ruleLattice (order)RAIDEmailData storage deviceMemory cardSemiconductor memoryEncryptionDecision tree learningKey (cryptography)Grand Unified TheoryMultiplication signOperator (mathematics)XMLSource code
Transcript: English(auto-generated)
00:00
I work at Netflix, and we've noticed several problems in the FreeBSD IOS stack dealing with back pressure due to the way we use FreeBSD in our OCNs, so I thought I'd talk about that. If you want to grab a copy of my slides and follow along at home, they're at the
00:23
URL that's on the screen. I'll give you a second to grab that while I get some tea from my throat. So there's three parts to my talk today.
00:41
The first part I'll talk a little bit about how Netflix uses FreeBSD, as is relevant to the rest of the talk. This could be an entire talk all on its own, so it'll be abbreviated. If you have questions about what I don't cover, feel free to see me or anybody else that's working at Netflix. I'll then give an overview of the upper layers of the FreeBSD IOS stack.
01:05
That will tell you, hopefully, what you need to know, and I see we have several experts in that in the room. Hopefully, I'll be getting it right. It's always embarrassing when Kirk raises his hand and goes, no, that's not quite right. And then I'll talk about my work on back pressure.
01:23
I had hoped to present some extensive results about that in this talk, but due to an emergency at work, the results will be a little bit abbreviated. So the FreeBSD stack overview will be a little bit longer than I had anticipated. So I work for Netflix.
01:41
I assume everybody here has heard about Netflix and know what they are. We deliver... We mail out DVDs, right? Yeah, we mail out DVDs. And here we have streaming soon, whatever that is. So we have internet video that streams out from hundreds of different servers.
02:05
As we grew up, we overwhelmed all the content distribution networks out there, so we built our own. So the reason we built our own is we have between 30% and 40% of the peak internet traffic according to Sandvine, depending on which year you look at, which report.
02:26
The picture here is of some of our storage servers at one of our colos from a couple years ago, slightly older model. But the newer model looks just the same. I think they paint the cases red now for the storage to kind of brand them as Netflix.
02:43
We stream multiple terabits a second, every second, every day. And the individual machines, our fastest one currently is about 190 or 100 gigabits
03:01
at saturation. The way Netflix works, briefly, is you're interacting with the Netflix app, whether it's your phone, your tablet, your browser, your TV. It's talking to an instance in the Amazon cloud to do all our front-end information.
03:26
Do you have an account? You know, what movies are good for you? All of that stuff, all of the browsing and recommendations is all handled through that. So you hit Play. The cloud tells your browser or your app, get the movie from this URL.
03:43
And then it contacts the Open Connect device, which is just a glorified web server. All we do is serve a lot of content through HTTPS. And we do this through a number of different OCA types.
04:05
We have a fairly large catalog that won't even fit in one of our appliances. It's tens of terabytes in size. The storage appliance has a bunch of spinning disks.
04:23
It stores most of our content. And it stores it very well and has a small number of SSDs to enhance the streaming for the most popular content. But when something's really popular, we go to the next tier in our OCA type,
04:43
which is a flash appliance, which is purely flash-based. It's either 40 gig with SSDs or 100 gig with NVMe cards. And we put the most popular content on here and stream it out as quickly as possible. We locate as many, replicate the content as many times as necessary to get the bandwidth
05:05
that our customers need. We do this, and we place these boxes close to our customers so we don't overwhelm the rest of the internet. Plus, the further hops you have to traverse for a video stream, the higher quality, better experience you get from a video stream.
05:22
We also have kind of a box in the middle right now. It's our global appliance, where it has some hard disks, some SSDs, so that the most popular content from smaller catalogs in other parts of the world is able to stream from that, and the full catalog comes off of the hard disks.
05:44
And we're also looking at mixing and matching these in any of the combinations possible. Our hardware guy that creates the servers likes to buy NAND very cheaply, and where the sweet spot is on the NAND varies, and also what NAND is available varies.
06:02
I could do an entire talk on NAND logistics, but it's not why you guys came here, and I'd probably fall asleep in the middle of it too. So the important thing to note here is that we're looking at having boxes that have three different classes of storage, sorry, performance classes.
06:30
Hard disks operate at about 10 to 100 milliseconds in terms of the time it takes them to process events, depending on how loaded they are, how big the transfers are.
06:41
And the fastest NVMe drives can be microseconds. So that's a few orders of magnitude, depending on how you count four, five, six orders of magnitude in speed differential of storage all in the same box.
07:01
In addition, we found that our SSDs have widely varying performance. If we're riding along, it can be fine, and then all of a sudden, the performance drops by a tenth, I mean, to a tenth. It'll go from being three, four, 500 megabytes a second to 30, 40, 50 megabytes a second.
07:22
And there's a number of reasons for this. I talked about some of them in prior BSD can talk. But generally, what happens today is the drives are based on what's called TLC NAND, which gets great density by storing three bits per cell, but is super slow.
07:42
They can operate the NAND in multiple modes, including an SLC mode, which is really fast. So they devote a few percent of their drive capacity to SLC pages. As you're writing out, it buffers them in this SLC cache. And then after you're done writing,
08:02
they'll move it to another part of the drive that has higher capacity. And if you're writing in bursts, this makes the drive look like it performs really, really well. But we write bursts that are larger than this cache when we're filling our content, putting new content on the machines.
08:21
So we overwhelm it, so we'll see this performance change. Everything's fine, 300, 400 megabytes a second. And then boom, everything's not fine, and performance really is terrible. And any of the tuning that we had done for 300 megabytes a second doesn't work so well at 30 megabytes a second.
08:43
Also, the nature of NAND flash is such that, as you're writing to it, you erase it and then write once. Well, if you're rewriting pages, that creates holes earlier in what's called the device log, which is appended only with the data and a little tag saying what the logical mapping of that data is.
09:06
So when you rewrite the data further on in the log, prior in the log, it creates holes. And if you're writing too much, you can overwhelm any pool of NAND box that it has, and it has to find them somewhere.
09:21
And the only way it can find them is by reading the old NAND pages that have holes, writing them at the end of the log to collapse out the holes, and then it has pages. But it's doing this activity instead of your workload, which impacts performance. And then once it does this, performance might be fine again. We've also found that the performance varies, even when the drive isn't misbehaving,
09:45
we find that the performance varies significantly with the workload. If we're doing 100 percent or 99 percent read workload, we can get maybe 500 megabytes a second out of the drives. But even a 10 percent write workload drops that down to about 400.
10:01
And if you do a 50-50 workload, it drops it down to 100. So it's not even a linear progression. And as the workload changes, what you can expect out of the drive changes. And since we have basically two periods for our servers, we have a full period where we're writing data and sometimes serving traffic.
10:24
You can do that on our storage machines, but not on our flash machines. Once you are doing that, the capacity you have to serve customers drops a lot
10:41
because your read rates drop a lot. In fact, on the flash, it drops so much, we just have to turn off serving customers while we're filling in content. Also, with having hard disks and NVMe drives in the same system, or hard disks and SSDs in the same system,
11:03
they have very different performance characteristics. Hard disks tend to be dominated by the number of IOs you want to do to them. Because between each IO, it has to move the head, and that time dominates the IO time that you're going to have. So as you are moving the head around,
11:22
the number of things you want to get tends to dominate. If you read 100k versus reading a megabyte or two megabytes or eight megabytes,
11:42
you'll get about the same number of IOPS. Your bandwidth will vary considerably. But with flash, since there's no moving parts, unless you're doing really, really tiny transactions, after the transaction size gets above about 32 or 64k, bandwidth dominates. So for different systems, you would want to tune it for bandwidth or for IOPS,
12:03
but if you have a system that has both types of drives in it, it's hard to tune it for both at the same time. So this is some of the motivation that led me to do the back pressure work. Some of the problems that we found in FreeBSD,
12:24
the buffer cache winds up scheduling all of the IO. I'm going to put a big asterisk to it, except for the parts it doesn't. But generally speaking, all the IO goes through the buffer cache. And the buffer cache, particularly in writing, tries to be very nice to the lower layers.
12:40
If you're doing a synchronous write, it'll let that through because you're going to wait and not schedule very many writes. But if you're doing asynchronous writes, it limits the number of those asynchronous writes that you can do. And if you have a single disk system, this works fairly well. It's a global limit, and a global limit with one disk, you don't run into any problems.
13:02
But if you've got a global limit and you've got hard disks that are slow and SSD drives that are fast, one can starve the other. If you get a bunch of requests out to a hard disk, then your SSDs won't write for a while. If you get a bunch of requests on the SSDs and they're turning through and cranking through quickly,
13:22
you can starve the hard disks. It just depends on the order of arrival. So that's one of the problems that we've seen. Another problem is that as you're getting ready to write,
13:42
the system limits the number of buffers that can be dirty at any given time. And if you want to write and there's too many dirty buffers, you wait for the buffers to be cleaned. And they're cleaned by the buffer daemon. And although not the topic of this talk, Isilon and some other people have been working to make that more efficient.
14:01
One of the problems is it has a high and low watermark to do the work, which makes it bursty to the lower layers. Burstiness isn't very good to the lower layers. It tends to overwhelm them. You tend to tie up more resources. Anytime you create a situation where IO is delayed at the lower levels because too much is queued up or because you can't do very much IO,
14:25
you've got all the pages that are locked. You've got other resources that are consumed by that until that blockage or that backup can clear. And we found that. A couple of years ago, I presented on the IO scheduler
14:40
that I had written for CAM, which lets us throttle writes to hard disks or to any disks, really. We use it for, or we're trying to use it for SSDs. We thought if we could limit the amount of data we would write at any time to the SSD, we would get consistent, but not very good, but consistent and predictable performance out of the SSD on writes
15:03
and bursts of writes wouldn't affect our read speed. So we would be able to service customers while we're filling. And that didn't work as well as we'd hoped. Some of that is due to the tooling that we have in our infrastructure that looks at the latency and the percent busy metrics.
15:24
If you schedule something and send it down to the lower layers, it's considered to have being pending, so DevStat says, oh, the device is busy while you're doing that. Even if it's delaying to allow other reads or whatever.
15:43
The other thing is it increases latency, because as you're measuring these things, the stuff that goes down and then delayed, it's going to necessarily increase latency. If you're throttling to a particular rate and you have more than that coming in, the latency is going to go way up.
16:03
So all of these things have kind of led me to the point where I thought, well, we need to communicate some sort of back pressure up the stack. The global limits, we're kind of reaching the end of the usefulness of the global limits. So the work that I've done adds that communication to the stack.
16:24
So before I talk about that, though, I'm going to talk about the FreeBSDio stack so that where I put this stuff in is more apparent and more understandable.
16:40
So here's a picture of the FreeBSDio stack conceptually. It's a simplified version from Kirk's and George's and Robert's design and implementation book. So it's a simplified version that takes some liberties, but basically from top to bottom you have system calls that ingest the data
17:06
and those map to file entries that map to vnodes that go through the file system, that go through the page cache that then goes down to geom and the layers below. I've talked in prior years about the layer layers,
17:22
and this talk will be more about the page cache layer. I wanted to include file system parts on that, but that made the talk too long. I'll be talking a little bit about the page cache or the buffer cache as well.
17:45
These are the same thing in describing what these different layers do. The upper layer buffer stuff, the lower layers then transform the data to do RAID and striping and partitioning,
18:03
and then the CAM at the lowest level knows how to get the stuff back and forth to devices. I've talked about that in a prior talk, so I won't be hitting on it any more than that in this talk. So the center of the buffer cache is struct buff.
18:23
Basically, it maps a vnode to memory. In fact, it maps a particular part of the vnode. It has an offset and a length in it, and the pages are generally backed by VM objects,
18:42
or they're not backed by VM objects, and when the system touches them, the VM objects, the pages are read in. The VM objects are always there. They're just not populated, I guess. There might be a better way of saying that. So it has that.
19:02
It has a number of fields for bookkeeping. Is this a delayed write? Is this an asynchronous request? What kind of thing is going on here? It also has list members. There are active and inactive and wired pages
19:24
that I won't be getting into in this talk. But the buff object keeps track of that. During the lifecycle of the buff, it gets moved from list to list,
19:40
and that's how that's managed. Then there's the buff object that the buff uses to tie the different pieces of the system together. And I'll talk a little bit about that a little later.
20:00
There's a done routine, so when the IO completes, the requester can get a callback and either schedule more IO or return results to the user. And there's also credentials that are based on the process initiating the IO.
20:24
So how are the struct buffs used? This array of buffers, which is statically allocated at boot, schedules IO between the layers. When it's time to populate the buff
20:41
or to write out dirty buffs, they are used to communicate through all the translation layers. So if it's a file in a file system, it will go through the file system, look up where that file was on the disk, and fill those fields of the buff in
21:02
and then schedule that IO to the lower layers through the buff option. It also is used to track read ahead and write behind delayed writes. So even though the user is writing or reading a few bytes,
21:23
the system might say, oh, well, it's time to read a cluster of pages. This is particularly useful when you're streaming large video files out. If you read ahead a particular amount, that can create larger IO transactions down the stack, which is beneficial on, say, hard disks.
21:43
It also caches the most frequent blocks. This can be both good and bad. If you're a general purpose system and people are running a bunch of executables, having LS cached in memory is a good thing because when somebody types LS, you don't have to hit the disk. It gets the pages out of memory.
22:01
If you're streaming a bunch of video data, this might not be so good because not everybody, there's poor reuse in that case. Buffs are also used when it comes time if you've created anonymous pages through malloc
22:23
or through data sections in your executable, and there's memory pressure. These get written out to the swap file and can be potentially used for other things. And all the dirty buffers are managed
22:41
by the buffer daemon. The buffer daemon manages the lists that these, excuse me, the buffer daemon manages the dirty buffers that accumulate in the system and tries to keep those numbers low.
23:04
It runs from time to time, and it goes through all the queues that it has. It schedules the buffers to write. It checks the low and high water marks and says, oh, I need to do this much work this time and goes off and does the work.
23:27
And depending on what the limit is, some of them will block the buffer daemon, particularly if there's a suspended file system. Some of them are ignored. One of the problems with the static limits
23:43
is that sometimes you really, really need to write something no matter what has been scheduled. So the buffer daemon has special tricks in it to get around the limits imposed by bchemwrite,
24:01
which I'll talk about here in a second. And that's one of the problems with that hack as well. So interacting with the buffer cache, there are basically four classes of functions. One is to get a new buffer,
24:21
where you can look up a buffer by a vnode and offset, or get an anonymous buffer if you don't have a vnode associated with it. Most of the file systems use bread and bwrite to bring data in and out. bstrategy is the routine that's called
24:43
to actually schedule the IO at the next layer down. And usually the strategy routine queues up the request and returns, or calls something that queues up the request and returns. It's a non-blocking request. And then there's a number of routines
25:01
that wait for the buffer cache to do its thing and return. So the buff-obj. The buff-obj ties together the vnodes and the buffs.
25:20
And there's a number of routines that it has to manage this. It's a very small number, so I thought I could go through them. Again, there's the strategy routine, which calls the next thing down the chain. There's a sync routine, which synchronizes all of the buffs associated with this vnode.
25:44
So it's like fsync, an fsync system call. There's bowrite. Even though bostrategy handles reads and writes, there's a separate bowrite, which is the thing that calls bcanwrite, which does the running buffer,
26:02
makes sure there's not too many going on at any given time. And then there's also a flush routine, which is a lot like sync, except it does all the buffers for that buffer object.
26:25
So another part of the buffer cache are pagers. And when I say ghost to the next layer, the next layer typically is one of these pagers, where the data is written out and brought in.
26:46
So the vnode pager is what handles writing and reading pages that are associated with vnodes. It's the thing that brings false pages in or writes them out
27:01
when you're trying to do IOT files, primarily. The swap pager is used for managing working sets, like I was talking about earlier. And then the other pagers are, the default pager does nothing,
27:22
and the other pagers are not used all that much. So trying to pull all this together, judging from the fact I'm seeing some heads nod, maybe I should have put this first in my talk. But the current write path down in the system
27:42
is a system call happens. Somebody decides they want to write. So before dirtying a buffer, they call b will write, which checks the dirty count and puts the process to sleep. If there's too many dirty buffers to interact with the buffer daemon, or to wake up the buffer, it also wakes up the buffer daemon and waits for the buffers to be available.
28:03
Some systems have hacked that to do an inline reclamation of a few dirty buffers or to start that off immediately rather than waiting for the buffer daemon to wake up. And then once it knows it can proceed, it goes ahead and dirties the buffer by putting the data in there by whatever means,
28:21
and that there's so many different paths in that how it does that, whether it copies it from user space or swaps pages is a bit long. It calls the buffage write routine to do the write.
28:40
And the reason it calls buffage write is so that AHX for running buff can be made. That then winds up calling the strategy routine, which schedules the I.O. And that'll bounce around through a few layers at the upper system that would make this slide way too hard to read.
29:03
But if you have file systems involved, there's translating to the lower device and so forth that aren't on this, but eventually it'll get down to geom. The GVFS strategy is the entry point to geom. That's where it converts the struct buff
29:21
to the struct bio and does all the stuff. And then once everything's done, we call bio done, and the caller gets a notification that, hey, this is done as well. I'm sorry, we call buff done. So that's kind of the backdrop
29:45
of the FreeBSD I.O. system. So the back pressure design, which is what you guys came here for, I imagine, conceptually, the high level description is that at the lowest level,
30:01
each device publishes how much it thinks it can do and the time scale it thinks it can do it on. I've started adding this to the CAM I.O. scheduler so it can try to estimate how much is too much I.O.
30:20
for a particular disk. I also have one that just is based on static limits for testing. And so right now when you submit an I.O. from the upper layers, it's fire and forget. You submit it, and something happens, and either it's submitted or it's not.
30:43
And the something that happens might include sleeping, might include delays, might include other things, but basically it's submitted. No information flows back up the stack to tell the upper layers anything at all about what happened. What I've changed is that the lower levels
31:02
fill in these completion records, or these submission records. There's one of each that tell the upper layers, okay, after you've submitted this I.O., we have this much space available, and also what the time scale of this device is
31:21
and whether this device is IOP limited or bandwidth limited or both so that the upper layers can make decisions about either how many things to schedule or how much work to schedule. It's envisioned that this is a cooperative scheme.
31:41
The upper layers know how urgent their I.O. is, and they know whether or not they can cheat and use more bandwidth, or they need to use more bandwidth and might be available in the next few milliseconds because the need is urgent, or oh, I'm only going to use about half of that
32:01
because the need isn't very urgent right now. They can make decisions like that based on what they know about their needs for I.O. So I've also created a few interfaces that allow the upper layers to do this, to give them the flexibility to do this.
32:27
So like I was talking about, the completion layer has a couple of things in it. It has the time scale. I'm calling it an I.O. quantum here, but that's the time scale that the device works best on.
32:41
For a memory device, that would be nanoseconds. For a hard disk, that might be 50 or 100 milliseconds. It has a bit mask that tells whether or not this record was even filled in, as well as what's the limiting factor for this I.O., or both.
33:03
And then it tells what the limiting factors are. And in this way, it communicates up the stack an estimate of what the drives are able to do. For a simple direct dispatch system, this is fairly simple,
33:21
for a system that has a g-mirror or g-raid where the world has to aggregate to different disks or there's networks involved where the latency might be changing or different, I'm leaving the hard problem of how to fill this in to those layers.
33:45
Does g-mirror, is it additive for the reads and writes, or is it just what the worst one of the lower disks can do? I'm leaving that to g-mirror. That's kind of a hard problem. Another command that's needed is an I.O. cap command.
34:02
And that gets the capacity of the drive down the stack. And the reason for this is, on the previous slide I said in a direct dispatch system this is all easy. Well, not all device drivers support direct dispatch. A lot of them are queued to g-down and then return.
34:22
Well, it doesn't know what the capacity is. So there's a mechanism for getting that capacity should the upper layers need it on their submission record.
34:46
And it returns the data in the same format as a submission or completion record is the important takeaway. Yes, Alan? It's based on whatever the device wants to estimate.
35:03
And this could be a rolling average, this could be some more sophisticated, smooth, filtered average. At this layer, this is the devices, or somebody that's managing the devices,
35:22
best guess of what won't overload the device in whatever time frame is present. So I know that's kind of a squishy, you know, ah, just tell us what you can do. But it allows also for different,
35:40
it allows for innovation with the different schedulers. It turns out the question of how do you know that the disk is saturated is a hard question. And I didn't want to dictate what that, the answer to that was in doing this. Yes, yes, I have a simple static scheduler that says
36:06
this device does this rate. And that's all it does. It doesn't try to estimate or guess. And I don't have the results I was hoping to have. I spent the last month working on a high priority item for work
36:24
rather than generating results for this talk, which I'm kind of embarrassed about, but you know, we have what we have. So I can't tell you how well that would perform. My intuition tells me it probably wouldn't suck
36:40
because it's not that different than the static limits we have in the system today. There are a couple of flags I added to bio, but they also apply to buffs that tell whether or not this transaction is participating in the back pressure for a transition.
37:07
And this code I have now, one of the problems I talk about in a few minutes is the code is green. How to do the automatic back pressuring for things that haven't been converted is very green right now. But there's a flag that says don't try to do that.
37:22
I know what I'm doing. Schedule this and tell me the results and then I'll manage my workload appropriately. And then there's another flag that says normally when you would sleep when I do an asynchronous request, don't sleep, just return E again and I'll cope.
37:41
Right now, the different sleeps in the system are designed so that we don't have to put in a bunch of tests and retries. It can be centralized to sleep in one place. But that's also inconvenient if you want to try to be innovative in what you're doing. You don't want the thing to go to sleep. If you want to try to fire, do it or not,
38:02
I want to know, and that's fine. Yeah, exactly. Or I can tell Synfile to slow down or have a smaller read ahead or something sophisticated like that. We haven't talked about that at Netflix, but getting this back pressure,
38:23
you can apply the back pressure not only up to the layer I've done, but also further, possibly in the future. I haven't done any syscalls to get what the back pressure is or anything like that. I don't know what the right interface for that is. Probably something close to our AIL interface
38:40
that just adds that information. I have a default IO scheduler that you set limits on, enforces the old limits and is boring, but it's there for testing. Oops. Okay. I also have a little bit of work
39:02
in the CAMS scheduler. It kind of knows it has a... The adaptive CAMS scheduler has a quota system in it, and it will just return not how much the device can do, but how much of the quota is left.
39:20
And so I've used this to start to implement running buffs. It currently works, but it tends to schedule too many buffers still, so the code needs some enhancement to not do that.
39:43
I'm not exactly sure why I'm doing that, so that's just a bug in my code, though. The adaptive IO scheduler, its estimate is the limiter right now, but it could also be a dynamic limit from what it estimates, and that's, again, the hard problem.
40:01
If I'm doing IO to the disk, how do I know too much is too much? So there's some problems with this work. Normally I like my problem slide to be shorter, but I haven't worked through a lot of these problems. Like I've said several times,
40:21
recognizing what saturated is and backing off is a hard problem. The network guys have been trying to do congestion control for 30 or 40 years, and they keep changing based on how the network changes and based on what their notion of good or bad is.
40:43
So a similar body of work doesn't seem to have been published for disks. A lot of people are doing the work. You see press releases talking about
41:00
innovative algorithms in different vendors' disk systems, but nobody seems to be publishing, or I've not been able to access that in the literature if people know of things in the literature that aren't simulations in some Java thing that a bunch of grad students have done. I would love to hear about that.
41:22
I've also not done analysis for starvation and other unfair behavior. Oh, yes. Yes?
41:41
Yeah, you can tell when you're totally saturated, but that's not the most useful thing to do to get the best performance. What you want to do is you want to say, at 100 megabytes a second, I'm totally saturated, and I get crappy performance, but I know at 95, I'm great. So I'll never do more than 95, and I'll be good.
42:00
But how do you know that and figure that out dynamically, particularly when you have large swings in performance for some of the crappier consumer-grade SSDs that we use? So that is the basis of all saturation is what happens to the latency. And what do you mean by getting longer? Does that mean the average gets longer?
42:21
Does that mean the 99th percentile gets longer? Does that mean the max over a period of time gets longer? There's a number of ways you can approach it, and I'd hope to explore those and present different numbers on how well each of those worked, but I ran out of time. A failing device?
42:42
Oh, I can't characterize that in a single way because devices fail in so many different ways. Well, yeah, and it'll depend. If it's a flash drive and starting to fail, it might be doing a bunch of retries internally to try to sift the data out, which has one profile.
43:01
And a hard disk failing, it's going to try a few times and then return an error. So it'll vary. Yes? Yes. Yes.
43:57
I haven't noticed that in the data I've worked at,
44:01
different failure modes. However, I've tried to design things to be flexible so that as different things drive, I've got IOPS versus bandwidth. But with networks, you say, I have a 20-gig link or a 10-gig link, or you measure a bandwidth delay product to the remote host.
44:22
And similar things kind of exist for disks but don't. I have a 6-gigabit link to the disk, but I can only use 3-gigabits of that for this particular operation, unless I'm doing too much, and then it drops to 1-gigabit maybe. So the things that trigger the saturation
44:43
I think are going to be different with storage, but I think the concepts will be very similar. I have a disk I can be wide open with, maybe limited by some known characteristics versus something's really slowing this down, and I don't know what, but I know what the number is. I've been able to measure it to be this particular thing.
45:02
And it's similar, talking with our network guys, Randall Stewart and such, at Netflix. It's a similar thing that's going on in the congestion control stuff. And I'm looking for more papers in that area as well to see which, you know,
45:20
what are the techniques that are applicable to storage that, you know... Right, and in other discussions,
45:43
we went off on a half-hour tangent about quality of service. Well, this could be used to implement quality of service. You just put the layer above geom that looks up the process and goes, oh, you get this quality or that process, you get that quality, and you return those results up, and then the upper layers adjust accordingly. I thought it was a little hand-wavy
46:01
because the buffer daemon is aggregating things from a number of different sources, and it would have to keep going through the list and say, schedule this. Oh, no, I can't? Okay, that doesn't mean I can't schedule anybody. Let's try the next one because it has different credentials and might be allowed through if we did quality of service or tried to do rate limiting at this layer.
46:23
I don't know if this is the right layer to do that, but, you know, that quality of service is another way you can look at this. The CAM IO scheduler says, the adaptive part says, I want to slow these write IOs down because I measured this read IO taking too long.
46:42
It's a simple-minded notion of congestion. It reduces the quality that the writes get in order to preserve the quality of the reads on a global basis. It doesn't care who's doing them.
47:02
Any other questions on this? Yeah, go ahead.
47:51
That's correct. The reassignment of backing store from device to device,
48:01
depending on the capacity and what's dirty, I hadn't thought of. That's a very interesting notion. I had thought, I have a period of interaction with the buffer demons minimal because if the buffer demon needs to free a lot of dirty pages, is a good strategy for that to look for the ones that have the most capacity
48:21
and the shortest time quantum and schedule as many as possible to those and some others for fairness and to prevent starvation. And that's an open question. It seems like it would be a very good idea that we could get rid of memory pressure faster
48:42
by writing to the fastest devices. But then does that swap the devices and make them slower? I don't know. That's a very good question.
49:06
That's right. And speaking of ZFS, it uses the arc, which is a completely parallel buffer cache to everything I've talked about today. But that, huh?
49:21
And yeah. But it too could benefit from these hints. It schedules one to all of them and then it can look for one with the most capacity to do next or something. I can queue five or ten to this one and two to this one
49:41
and six to this one and eight to this one and then wait for everything to return. Okay, I should have turned off my network, but, you know. I'll do the question slide anyway. Nobody calc-sayed me this year. Martin, we would love such devices,
50:19
particularly if the copy from the storage device
50:22
to the network device could happen with TLS encryption in between without going through host memory. We would love that. In fact, if anybody has vendors that are trying to sell them to them, email me. They might be able to sell it to us. Maybe not. We're kind of picky,
50:41
but we would love to have devices like that is the short answer. And work that I have seen people talk about in bars, and I was drinking at the time, so I don't remember who was talking about it,
51:01
was doing the same thing on the network side. They were doing really well, and they were just trying to do that on the storage side as they were replicating data from a failing drive to a non-failing drive or doing low-level operations like that.
51:22
Currently, the only way on FreeBSD to do that is to have a RAID card that does it internally to the RAID card, as far as I know. If there are innovative ways to do that, I would love to find out. Other questions?
51:46
Okay, I guess that's it. Thank you very much for joining.
52:07
Okay, there's food outside. You have six minutes to grab it.