I/O Scheduling in CAM

Video in TIB AV-Portal: I/O Scheduling in CAM

Formal Metadata

I/O Scheduling in CAM
Title of Series
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
SSD have many unique characteristics not present in spinning drives. Applications have different access patterns and desire different performance trade offs. Geom offers some scheduling facilities, but they are hampered by no visibility into the underlying device's characteristics. Scheduling I/O in CAM allows peripheral drivers to use their detailed knowledge of a drive to schedule I/Os that are optimal for the application's needs (with hints from the application) Netflix operates a small fleet of Video servers for its Video Streaming service . There are two main kinds of server used in our operations. We have a storage appliance, which is used for long-tail access and filling other servers. We have a Flash appliance for serving popular titles. Our service has a certain amount of change each day, as titles change in popularity, contracts expire or come on line, etc. While our workload is read mostly, we also need to write and trim the drive from time to time. With flash drives we found any sustained write activity above a certain level lead to a sudden decrease in the read performance, reducing our effective capacity at times when this happens. By clever scheduling, one can reduce these effects to keep read performance good, but write performance will suffer. The traditional scheduler didn't allow any efficient way to do this, short of write throttling in the application. While this does help mitigate things, when there's many threads or processes acting in parallel it can be hard for the application to coordinate everything, and the many layers between the application and the disk can interfere with even perfect coordination. Moving the throttling to the lowest layer in the system helps smooth out the bumps, as well as adapt dynamically to the changing workloads (you can write more, if you need to read less, for example).
Slide rule Scheduling (computing) Server (computing) Graph (mathematics) Feedback Bit Stack (abstract data type) Canonical ensemble Computer font Web 2.0 Type theory Length of stay Software Resultant
Slide rule Server (computing) Multiplication sign Water vapor Streaming media Drop (liquid) Operator (mathematics) Videoconferencing Symmetric matrix Physical system Turing test Graph (mathematics) Server (computing) Structural load Graph (mathematics) Drop (liquid) Skewness Bit Subgraph Macro (computer science) Process (computing) Software System programming Figurate number Metric system Reading (process)
Server (computing) Scheduling (computing) Inheritance (object-oriented programming) Graph (mathematics) Modal logic View (database) Multiplication sign Boom (sailing) 1 (number) Canonical ensemble Graph coloring Value-added network Subset Workload Bit rate Control theory Conditional-access module Control system Physical system Metropolitan area network Addition Graph (mathematics) Structural load Surface Weight Bit Grand Unified Theory Stack (abstract data type) Line (geometry) Port scanner Benchmark Causality Order (biology) MiniDisc Right angle Moving average Reading (process) Booting
Server (computing) Spline (mathematics) Maxima and minima Set (mathematics) Streaming media Perspective (visual) Value-added network Number Uniform resource locator Connected space Different (Kate Ryan album) Videoconferencing Cuboid Physical system Condition number Metropolitan area network Multiplication Algorithm Graph (mathematics) Arm Channel capacity Closed set Internet service provider Set-top box Open set Data stream Uniform resource locator In-System-Programmierung Software Order (biology) Direct numerical simulation Data center Point cloud
Server (computing) Spline (mathematics) Computer file Multiplication sign View (database) Set (mathematics) Order of magnitude Web 2.0 Connected space Different (Kate Ryan album) Videoconferencing File system Square number Control theory Gamma function MiniDisc Physical system Window Area Metropolitan area network Graph (mathematics) Channel capacity Block (periodic table) Content (media) Electronic mailing list Planning Computer simulation Client (computing) Bit Line (geometry) Open set Word Software Web service MiniDisc Right angle Natural language Window Reading (process)
Web page Electronic data interchange Arm Block (periodic table) Flash memory Execution unit Parallel port Mereology Sequence Number Array data structure Uniformer Raum Bit rate Personal digital assistant Hierarchy Right angle Lie group Block (periodic table) Spacetime
Scheduling (computing) State of matter Multiplication sign Mereology Order (biology) Bit rate File system Control theory Error message Covering space Enterprise architecture Arm Channel capacity Block (periodic table) Web page Data storage device Physicalism Bit Database transaction Translation (relic) Demoscene 10 (number) Connected space Band matrix Process (computing) Buffer solution Hard disk drive MiniDisc Right angle Block (periodic table) Reading (process) Firmware Point (geometry) Trail Duplex (telecommunications) Density of states Device driver Translation (relic) Data storage device Metadata Number Twitter Operator (mathematics) Speicherbereinigung Energy level Game controller Graph (mathematics) Bit error rate Duplex (telecommunications) Interface (computing) Limit (category theory) Single-precision floating-point format Block diagram
Scheduling (computing) Aufrufschnittstelle Different (Kate Ryan album) File system Diagram Partition (number theory) Geometry Channel capacity Block (periodic table) Computer file Web page Physicalism Bit Database transaction Demoscene Process (computing) Numeral (linguistics) System programming MiniDisc Right angle Simulation Physical system Fundamental theorem of algebra Web page Geometry Digital filter TDMA Device driver Translation (relic) Limit (category theory) Drop (liquid) Student's t-test RAID Rule of inference Number Revision control Cache (computing) Read-only memory Operator (mathematics) Speicherbereinigung Conditional-access module Module (mathematics) Forcing (mathematics) Stack (abstract data type) Line (geometry) Device driver Group action System call Power (physics) Inclusion map Network topology Bus (computing) Musical ensemble Pressure
Geometry Dataflow Scheduling (computing) Implementation Group action Computer file Divisor Multiplication sign Coroutine 1 (number) Family of sets Device driver Set (mathematics) Mereology Field (computer science) Wave packet Twitter Revision control Strategy game Term (mathematics) Operator (mathematics) Computer hardware Circle Diagram Interrupt <Informatik> Office suite Conditional-access module Physical system Sine Interface (computing) Interactive television Electronic mailing list Bit Group action Binary file System call Proof theory Kernel (computing) Chain Video game MiniDisc Normal (geometry) Queue (abstract data type) Quicksort Figurate number Table (information) Reading (process)
Scheduling (computing) Code Differential (mechanical device) Multiplication sign Source code Execution unit Set (mathematics) Water vapor Mereology Duality (mathematics) Order (biology) Envelope (mathematics) Error message Area Metropolitan area network Algorithm Block (periodic table) Bit Control flow Variable (mathematics) Order (biology) Different (Kate Ryan album) MiniDisc Right angle Cycle (graph theory) Asymmetry Simulation Resultant Ocean current Freeware Device driver Control flow Rule of inference Twitter Wave packet Workload Internet forum Peripheral Energy level Gamma function Conditional-access module Default (computer science) Multiplication Graph (mathematics) Code Line (geometry) Device driver Software Personal digital assistant Partial derivative
Geometry Point (geometry) Scheduling (computing) Implementation Code View (database) 1 (number) Device driver Set (mathematics) Bit rate Goodness of fit Mathematics Latent heat Bit rate Different (Kate Ryan album) Natural number Energy level Software testing Abstraction Position operator Algorithm Geometry Sine Linear regression Interface (computing) Interactive television Bit Measurement System call Band matrix Mathematics Elementary arithmetic Network topology Interface (computing) Right angle Quicksort Reading (process) Force Spacetime Library (computing)
Computer virus Trail Dataflow Statistics Scheduling (computing) Implementation Code Decision theory Device driver Limit (category theory) Mereology Number Twitter Mathematics Roundness (object) Flow separation Bit rate Average Different (Kate Ryan album) Videoconferencing Energy level Diagram Software testing Interrupt <Informatik> Metropolitan area network Constraint (mathematics) Graph (mathematics) Sine Structural load Data storage device Plastikkarte Sound effect Ext functor Binary file System call Band matrix Mathematics Word MiniDisc Video game Iteration Moving average Right angle Queue (abstract data type) Musical ensemble Bounded variation Library (computing)
Addition Multiplication sign Counting Limit (category theory) Line (geometry) Limit (category theory) Time domain Number Band matrix Fluid statics Queue (abstract data type) Right angle Musical ensemble Implementation Resultant Reading (process)
Scheduling (computing) Code Multiplication sign Decision theory Modal logic Range (statistics) 1 (number) Bit rate Mereology Fluid statics Mathematics Bit rate Semiconductor memory Different (Kate Ryan album) Matrix (mathematics) File system Control theory Identity management Chi-squared distribution Area Enterprise architecture Gradient Ext functor Bit Complete metric space MiniDisc Right angle Summierbarkeit Metric system Resultant Point (geometry) Web page Implementation Server (computing) Spline (mathematics) Maxima and minima Online help 2 (number) Goodness of fit Average Queue (abstract data type) Speicherbereinigung Interrupt <Informatik> Gamma function Loop (music) Addition Dialect Graph (mathematics) Matching (graph theory) Weight Graph (mathematics) Code Planning Cartesian coordinate system Limit (category theory) Frame problem Mathematics Length of stay Explosion Loop (music) Network topology
Point (geometry) Addition Slide rule Enterprise architecture Algorithm Dependent and independent variables Torus Inheritance (object-oriented programming) Multiplication sign Range (statistics) Bit Parameter (computer programming) Variable (mathematics) Theory Entire function Length of stay Mathematics Process (computing) Buffer solution Speicherbereinigung Metropolitan area network Physical system
Implementation Scheduling (computing) Code Confidence interval Multiplication sign Decision theory 1 (number) Virtual machine Set (mathematics) Branch (computer science) Mereology Twitter Product (business) Number Goodness of fit Mathematics Synchronization Different (Kate Ryan album) Videoconferencing Medizinische Informatik Hill differential equation Area Projective plane Mathematical analysis Maxima and minima Bit Basis <Mathematik> Line (geometry) System call Measurement Length of stay Word Process (computing) Pointer (computer programming) Personal digital assistant Network topology Quantum Right angle Data logger Musical ensemble Physical system
alright I guess we can get started it's 430 on today I'll be talking about and I was going to I wrote for all working and Netflix my name is 1 watch and I have the slides on the the web server and also if you access the conference schedule and click through to my talk this slide you there as well and if you could give feedback on the type that would be great that helps me get better talks next year so it is an
outline of what i'm going to talk about and talk a little bit about why we even bother to do this Netflix we had some problems and we solve them a little bit with this but to understand what I did I need to give some background on and you know what we do in networks and why we need to solve these problems these involve SST is and which is something I like to talk about so I put that in and it'll help the understanding of also talk a little bit about previous these ideas stack and a little bit about canon and once I get through all that all talk about what I did to the schedule for networks and on some of the results are and also talk a little bit about on some stuff I have done more recently I've put together the stock Frazier obviously economy originally and I found will use that data little bit at the end war some of the stuff like that so I did that so
motivation before I started networks we would see every day the servers pumping of video and then when we would add video to the server when we would fill video into the server but there were problems on a big drop in throughput it's just hours reporting 100 % busy and I you know which we can we can handle our our systems dynamic we survey some of the most worrying was that on but there were a lot of problems that were showing up and quality of experience symmetric skew metrics as well on the slide and those are the things that we use to measure of how good a quality of your Netflix video streams and I'm going to assume everybody knows what Netflix's streaming services or talk about how it's implemented a little bit so so digging into this we found that it happened during content on and we have a much of servers with video on the news stream the video out and so the servers are off would service that is the most popular video on the so every day we change that a little bit and in the process of changing not out there would be these problem but some days the problems and other days of the big problems with and nobody knew what it was like a water go figure this out because of the
operations people just subgraphs like this this is a graph from from an older server that's doing you know that the peak 16 gigabits a 2nd but there these weird dips here in here in here same time every day this time that we were filling the videos and all of it turns out that join the videos made the SST perform poorly for read loads on and we would see these depths and the operations people don't like to see dips in the network like this is nice smooth graphs nice pretty graph so I'm
digging in Maury's big the top graph is a graph of the and rights per 2nd on all the money to the interval disks and you don't need to see all the colors and which ones they are but they should big spikes normally there's almost no load and then boom boom boom big spike and here is what we call 1 of them with the user reads by Kieran Read spike here courtesy 1 here and the rest it's kind of hard to tell and that was the 1st clue that we were having some kind of problem that you do so to to look into and drawing down in
that kind of of a micro view on the horrors how much read our system is doing in megabytes per 2nd and you know where 120 megabytes 110 megabytes which is normal for what systems were doing at the time and then here we start to fill in you see of little w's pop up and as soon as the w pops up and stays here you see 2 things on 1 of the the read service time goes from it after a few minutes happy happy happy and not so happy really really really bad and the amount that we will read off of this just dropped by half and then we stop writing and the real rate goes right back up now another it's not instantaneous back up because our control system of the control the so what load is placed on the servers notice that we had a period of about 20 minutes were things kind of sucked gave us must love so on that's lexical awhile for to recover but you know as soon as we start writing you know the seek times back to to normal this dotted line here represents about 5 ms which was kind of benchmark that I had established for an acceptable redone because that's most the size of a necessity here read times there have been millisecond to millisecond so you know 10 times that seemed like a good a good place to start so today I'm talking about their schedule to understand the sky during need understand a little bit about canon and to understand about can we need to understand our our eyes stack but in addition for why that we're doing this you need to understand how assist these works in order to understand how these work there are these cool devices made from and flash that everybody thinks is wonderful but non flash has some issues that the SST strand of paper roll so to understand why did what I did it can have to dig down below the surface a little bit what some of the SST is a doing to to to to to really get a feel for a while I would even bother you why would even think so that I could explain this big dip and why I I thought I could solve it and so on the also need to understand the that workload so that's on my that's where I'm going to start with an overview of the we connect appliance there are OpenConnect appliance
we have racks of these things in different data centers they have different networks videos on them that we stream to our customers on each of these systems this is a slightly newer than the graph that I system and shown on the graph earlier that each of these systems can do our can do on the order of 35 38 gigabits a 2nd video streaming and you know we have stacks of like this in multiple data centers that we use to offload the the video streams so although video you're watching his income from netflix corporate headquarters it comes from someplace close to your ISP at least from a network perspective so what happens when
you have a little set-top box or whatever at home and you say I wanna watch this video I mean is that on the boxes request to kind of our arms are control points and basically server in the cloud that tells it tells you what this video but they look to see where all the videos are see which 1 is close to you on here's a URL for the video that will those see that will be best for you and it does this by looking at how loaded the servers are where the servers are off would capacity we have all of all of these things and on it assigns a server and so then the winter gets the server the set of box goes all give this video and there are a number of different algorithms and that say 0 I'm streaming OK I'm going to give you a better quality or well the network conditions about an integrated quality but also alternately all of those data streams come off of the appliance which is nothing but a x 86 box no abstract not like that shown in the previous picture on so the
other thing that goes on inside these open connects appliances is on every so often we register with the control point to say this is my health were still here and this is the content I have and during the day we say what's today's new content or every hour we don't know what content joining to update so we asked the main control plane may give us the with gives the worst of content and that we should have we don't have it yet so that on server goes through its listing goes I need to do these things I need to download these things and this other stuff I don't need to worry about and we also do some housekeeping or will move some things around with some slower disks and faster disks in some systems and will try to put the last of the last hot content on the slower disks and put the more popular content on the faster but so know that this kind of flavor of settings that so most the time or streaming video sometimes we have to do we of videos that were this the these are these these systems are nothing more than glorified web servers running in genetics on the got a whole but would files we have 14 is on our latest FlashBox where this work is primarily focused on on these as as these each have a fight their own file system so if we was 1 SST we don't use all the data we just lose 114 of the data and so it's a simple UFS on the on each SST and we've tried fast and it doesn't perform as well as you effect so if you want a simulator I could talk as to why but we do we go with the with cesium were performs best in right now that you have so that a typical day to
give a rough order of magnitude of the difference between the reading the right the red in this graph is the the the reader as you can see on this quite a bit of disk read and it's almost the entire graph and there's a little teeny tiny bit of right here on where we do of course swapping up the content and if the square just right you can see a little bit of blue and the green on this graph is right and the blue on this graph is trained and with you fast when you delete files freeze the blocks which since the trims down to the drive which is this traffic here and he's going really really hard you can see that this area here is grab on and that once we start into the great we have kind of an exponential backoff on on our trap what's going on here is this is a film window this is when we put this stuff on the device and because when people when we didn't have these words of noticing problems that I talked about at the the beginning of the talk so I you know well during this time this guy cancer video intensive much so we we take em off line it all the clients out and then with the new video on and and bring about back online you can see quickly you know he's ramping up and off to the races again and so you know we had the filter window to avoid and where these problems now and it would be better from an operations point of view it is if we can write more slowly or do something different in how we write on but that doesn't affect reads you know we would get back to hours of serving time for 3 hours of serving time on these on these servers and when you have thousands of servers in your network you know that can add up to a lot of excess capacity that on turns into some some some real dollars if you can find some the other otherwise utilized so
emission SST and assist these and built out of 9 flash and nand flash is very difficult to work with it if you have to work with it directly but the SSD vendors have put a nice 3 will face on it and pretend that everything is cool and fine and tells all kinds of horrible lies to you so you're wilderness and of all everything's cool there's no problems here now I don't have to worry about it and that's mostly true except when it's not known and we were using 1 of the except when it's not cases so so
understand how these SST made up generally there's a whole bunch of chips of a whole bunch of all parts on the SST each 1 of these parts have a number of ships and all of the hierarchical number of chips each chip has so many race units the ratio is that on the smallest unit you can write or a rastered they can write individual pages in into the arms race unit on the chapter right sequential on and then each is that each erased block the number of pages you could read those all day long and it doesn't really affect things so users can the hierarchy of what's going on the key takeaway here the ratio of the rates blocks what go bad and so the units that when the SST needs to garbage collect it operates on arrays blocks so if it needs space for something new has to be our find it from somewhere else it also means
that so we've got a log file system on here so as you write new blocks we have all all blocks creating holes and that's why we need the garbage collector and so also were racing was what this single all duplexes no curing all-pole erasing we can do any reads and races take a long time right take a long time reads very fast reads take tens of microseconds small number of tens of microseconds right takes so long as millisecond states maybe tender 1 ms so you can see as you need to do each of these things is a bigger and bigger impact on the media is also unreliable so sometimes you get garbage collection activity when you would expect because the disks were like hey I just read this data and the bit error rate on the data really is bad on so they don't people don't notice I'm just going to copy this data to a new blocking erase this block and maybe all retired or maybe just a racing is enough to make a good again N so that activity goes on behind the scenes on last year's obviously can't up I gave away and a lot more detail about this but that they give you the flavor of of why we're doing this so
this is just a block diagram of the SST you got some kind of post connects on this is also the block diagram for Indian me drive you have hosting the connective a process so the processes and data you got RAM buffers for reading and writing in the you know the processes schedules access to on ships through as some kind of Nanjing controller arm and the reason you have 2 layers for that is and it's good to send transactions down to the and control of but it really needs to wiggle look the bits in that depends to the chips and very fast manner and it's good to accept the break-up of the logical from the physical and so that's that's why you have this kind of a range of so the flash translation layer like I said it tends to enforce a log-based file system are log based device driver so when you write a a particular logical blocks on they don't wind-up that that logical block on the physical media so there's a logical to physical translation and there's also so little bit of extra data that you alongside the user data to keep track of all of this and on a bunch of data that may be proprietary to that particular controller but that there a little bit of stuff 4 were also takes care of wear leveling so if you have a DOS file system on a pounding the master block on pounding super-blocking in in the Unix on that 1 block on given to wear because each time you write it went up in different places and the world comes in when you've got all these race blocks and there's a small pool of free erased blocks that are you know is a little bit of excess capacity and they do that so when you right is there's a block to read right away you have to wait for it usually on and also over time blocks were up so that to be retired so and you would want the 1st retirement to reduce the capacity of the drive that's another wider the drive tells you it's like I'm 1 terabyte well there might be 1 . 2 or 1 point 3 terrabytes of manned underneath that but you can use all of that because as you wear it out because you know from 1 . 3 to 1 . 2 1 . 1 capacity and underneath the covers but it's still tells the same 1 terabit lied to the OS I that a little bit about the reliability aspect as well you know data that's too old it's moved data that reads at a higher rate gets moved and when they try to use a new block of their errors that to do something to get that data that makes sense because if you write a block is no I didn't really again just ignore that so there's all these things going on behind the scenes in and it generally boils down to the garbage collection you write a block the really takes 2 blocks or 1 and a half blocks of rights to the desk and these rights are going on this extra right on for garbage collecting since it's you going 1 operation a time on each of these died and you can't say you know don't good performance from reads will write activities going on is different banks are tied up sometimes some banks of need be tied up sometimes it won't it depends on the depends on the architecture of the drive basically if you have a if you're doing a lot of rights so there's a problem with the reasons for having trend is to say for all of these blocks away at once so and if the driver needs to copy you know a piece of data that has been thrown away it's really easy you don't have to read it you have to write it it's got this part of the metadata that it keeps around as well as which blocks of been trimmed out so that it doesn't keep track anymore I but all this garbage collection that's going on text performance as you as you might expect if you if you're writing to something that has a particular band with and your reducing the bandwidth you're not going to get good right performance or if you're expecting to use some of the bandwidth with for read and it's busy well that's not gonna work so that's that that's what's going on in these drives and that's that's what was causing the the the spike in latency is that the particular driver that we have we talk to the vendor and he told they told us that will garbage collection of hard drives and background of great 22 that 0 when the interface has been busy for 500 ms or more and if you recall the rather large amount of bread that you saw on on that what prior graph for reads we don't have 500 milliseconds of of of down time we barely have some you know there's there's very rarely a time limit we go 5 a 10 milliseconds without talking to these drives so what would happen is what we were writing to the film it's like 0 I gotta catch up don't have room for all of this I have to do all this grooming all the garbage collection to to move everything forward so it was doing that and we were in a writing as fast as we could we were basically killing any read bandwidth ability or in the bandwidth available for reads that we would have and the and so that found a
couple pictures online and if you formatted drive answer writing to it I'm doing random right everything's cool for a little while and then you fill up the drive and then you know after you felt to drive performance drops quite a bit to this big long line on you know after you've exhausted the initial fill in the the performance drops as its of having to behind the scenes of the collapse in do the garbage collection to get blocks to to satisfy the rights so that's that's kind of what kind of physics of what's going on in the drive and so switching gears a little bit this is I stack which again you'd understand to understand what the schedule or would you the top where you got system calls and hand going down to the file system with page catch on that's the thing that generates the requests to John which calls the device driver the data kind of a quick condensed version of all what's going on from George and Kirk's book is well on a diagram came from and so there's a number of places that you could do it scheduled because inserted John module that would mitigate the there would on base the data going down schedule the data going down you could go on and do something in the CAM where to deal with the data on my community maybe only some stuff in the file system as well and who was the 1st it's or performance benefit by grouping and pacing rights just appropriately so that there's not too much and not too little that they make and then we use the band from the scheduler I wrote and I'll talk about why in a few minutes lives in the pressure or the 1st driver makes use of the schedule or to figure out what to schedule on any higher in the tree and we wouldn't know what's going on we wouldn't have the capacity to direct capacities of the device you will run the tree and we had to be changing it although the HPA driver so HCI that LSI drivers on and all that that's that's too much for and frankly that's not the right way so to talking a little bit more
of about some of the stuff going on each each of these levels on numerous of the operator for the on driver or the stack generates the request John does a good job of filtering data so if you need to have partitions so you need a translation of blocks if you wanted to compression it's a good place to do compression if you're doing multiplexing many-to-one one-to-many and TDD striping and mirroring or did I get that backwards but either way and you know that does really well we don't even have a schedule or in the geometry of that was tuned to 50 400 rpm disks that a student of Luigi did but but it's kind of limited right now I can handle all the cuing can have has a fundamental notion of cues that exist in the devices underneath the like can where it knows how many requests individual HPH can do it also knows how many requests we can put down to the individual devices on the hidden forces of different rules of tagged verses non-tagged which is important for a 88 arises the similar and and when necessary deals with multiplexing summary so if you've got 20 disks that can handle 32 or 64 tags each but only have 200 transactions in a particular HPA it can and multiplex between all the drives to to keep the more busy than they would otherwise there are too many device devices like that anymore but that's 1 of the things that the candidates so this is a little
bit of an eye chart like I go into this in more detail than I'm going to and the talk on my paper but some users have the flow that's the data has to him each of the proof drivers have a strategy reading think it's called from geom via a little bit of interaction with the request and their request so called by a disk sought to in Q the divide the the the request or inserted directly at the end if it's a flash devices we don't need to do know and discovered a sort of it then calls of DA schedule and the schedule which calls expertise schedule which checks to see you know does this device have a slot both in terms of the HPA that I talked about and in terms of the disk can I give this more data this just more data if I can I called DA start to tell the driver had a good is that if I can't I just from DA start calls by q 1st which takes the data off of the and by Q and submitted to the drive of for both DA ATA we have to to accuse on 1 from normal read write those in 1 hand and 1 for the train operation by delete operations translate to trim operations and the the reason for that is that the office data from normal life goes on our title so you can put a whole bunch out and given back and figure out which ones completed by looking at the text from the insecure and the book requests but for trend of ice train commands for status and those are serialized you give me 1 and you can have any other request and the drive you get it and it OK I'm coming back and does that and the kernel scheduler has yes it does data that we don't implement that so the we have a separate you for that so that we do 1 from that and then we do a bunch of other requests and then we do another 1 from that we do this little dance in the queue and say the 3 does add 1 and in Cq version of the trends and I've implemented but you can only do it for certain and it's the it's and you you can do it for 8 of the HCI but you can do a full like the LSI HPA and the reason for that is this command is unique of all the commands that we use in in that there's an auxilary register that needs to be set and then accelerate history you know is just a couple of extra bytes in the field use and the drive and the HCI driver no problem all this and how those bytes 0 right now just fill those bytes another some interface issues about how to communicate you need this but in the implementation I did that doesn't work on you know set of sets of sets of flying insects the the the the bit in all but 4 LSI that that's in the down and uh scuzzy ATA a CCD and that's a fixed size there's no room for those bits in there so sigir concept a bit to say to do it it doesn't do it so you even with modern disks we can always do it now that the implementation I did use like 5th on my list that I have 3 things time to do 3 of our it 4 tracking it down but you know that that is something that would help the of the latency in the system so getting back to 1st day start passes it down on same action is the HPA command in the table that aren't takes a command in ships off to hardware time passes interrupt happens of and we wind up going to a chain calling XPG Annex feature-value done goes well this peripheral wants to know that this thing is done so it calls all calls DA done and bio done of circles of the basically the done routine that is part of the CCD and that all this file was what was selected I need to call bio done until the the prescriber that it's done hello when talking to 1 this life the
proof ever goes OK but I'll call DA schedule people and that'll all basically if I have work also of this last bit of the kind of the flow of it's the kernel is going on in the and in in camp kind of a little bit of a slob to go through but I want it to be a while the figure this out and to be able to draw this diagram thought it might be useful for other people trying to approach the kind I guess I go into it in more detail in my paper I so next about
the schedule the 1st of previously default 1 what we do in which a part of a little little bit and then I'll talk about what we do in the network schedules but right now there's no differentiation of I all I always treated equal or accepted for a buyer delete for trends because their unequal at the lower levels and this is the the default Camacho's get your free because he doesn't really have a default schedule reading level it's just you know what the block devices and all right now the peripherals implement to water and we have used the disk sought to do the the venerable elevator algorithm which you know tries to line things up so that if you're due this case here if the request is and here I could stand next otherwise it gets put you know pasta marker so that the elevated him go back in and and and stick it up for spinning this this this makes a moderate amount of standstill still even though will be is not strictly ordering there's kind of a strong partial ordering Fernanda makes no sense at all it is it's cycles so we can also have a very policy of just doing and in were we get not that that's guaranteed but that's just the most convenient and and and better performing 1 alone this is implemented in the camper of drivers in the DA driver and the ADA driver it's expressed in line and in both of those drivers on the set of all the code in units 100 200 lines of code scattered in a bunch of places so it's not obvious that that's not too bad but it's all the scheduling is done there won't accept a few say again for 18 Cq devices Scantlings than there is insecure quests on not insecure request the same source to know about it because it's an error to send a non insecure request the insecure quest spending it cancels all of them with unpredictable result so the same has to know that there's not insecure request pending and train insecure requests and send them the multi insecure request back so it's not entirely on campus for all that is mostly this generally performs well for most work was in future work orders and pushing the envelope of what the device can do it work great on if you're devices well behaved and has symmetric behavior between reads and writes even if you know rights are always twice as slow as regional vise versa it's well behaved but as a these break all these rules you know if you stressed that you can kind of pure beneath the previous year that they've put on themselves and see some of the issues you can see very asymmetric performance you can see very most of the times with disks when you're writing to and it'll be the same if you write all day with SST is it good good good bad bad bad bad bad maybe eventually get again you know depending on how your cycling through the cycle so you get and based on workload assume much higher variability of underlying performance and so it doesn't work as well for that released as well as Netflix worldwide I
so you know this is again the graph you know this is the kind of some the asymmetry we start writing and then read performance goes to and the current schedule doesn't really take that into account at all it's just like all you got i area I want to be the most efficient pipe to lower layers I can be and that's what it's focused on i want to follow the rules for the pipe to lower levels and then and generally that's what you want except when you do have weird behavior like the so the
Netflix as going to of 1 of the things we want to try to do is reduce the rate the right amplification how we can reduce the rate the right amplification happens that gives us more read within the drives so if we need to write a gigabyte of data rather than writing it all at once and not having any bandwidth we space it out we also on schedule pure concurrent writes on to the drive scheduling fewer concurrent rights will I mean that fewer banks are busy with rights and more banks are available for read home depending on the drive from 1 this is a very from more dependent on assumptions and sometimes it works out and sometimes it doesn't work the drives a we have I'm willing to trade a little bit of elementary read latency and we don't care about the right performance to a point so we were able to make some trade offs like welfare right this fast we have this latency well we can tolerate some also can write a little faster or maybe we have to write slower based on on what the wait-and-see ones at being as I as I talked about earlier know geomet to high in the stack and that it can it doesn't know what's available in the lower levels it can't make individual measurements about how long each I O takes it can't really it's is not really in a position to have and the knowledge that it needs to some effective we schedule things
can so I did some changes sort schedule I created an abstract interface so that we abstracted the different I O scheduler bits into a set of calls on this is a layer of indirection and driver that has lots of interaction of so on the 1 hand you could argue while you've made it more like the rest of the driver good on the other hand you can go wow you made another place where have interact to to find out what's going on that so you know the pros and cons depending on your point of view and people tend to polarize to 1 extreme or the other way when we're talking about so I cited as lot I'm kind of the camp that duplicated code is bad can prove and I also didn't want implement this twice because there's a lot of sophistication I wanted to put in a 1 just to create a library and I was going to a library you call into and you can replace the library on to get different sorts of behavior or implement different scheduling things I converted DATA to the new interface because that's all the device drivers that we have in the tree that are the preface on and I stopped there and the budget testing in nature there was no regression because the last thing I want my my was scheduler was only introduced aproblem when I abstracted and detected and so on on my new algorithms nor the problems of my algorithms are where I just wanted to constrain what was going on and also makes it easier to commit upstream if I can say if you don't take any of the Netflix changes Netflix specific drug schedulers stuff there's no change you know he behaves the same as test of the same it's a low-risk thing even if it does not move some code I for the Netflix
scheduler than I made sure that we had separate cues for reads and writes in 2 weeks and then I caught a lot of statistics how long does this has this device own how long do read spend in the similar lower basically in the device and I keep a running average of running exponential average of that so for region lights and trends on number of reads number right all that stuff all the statistics are kept so I can our in trying to figure out what to do next I can look at the statistics and say you know 3 latency is fine I don't need to worry about rights or the real is terrible I do need to worry about or I could make this decision I don't have to make that decision but I had the data I need to make the decision so I I had the ability to work with the number of I O's in the device in the 1st round of the work in the 2nd round of the work I also added bandwidth limiting and I absolutely and you can pick 1 of the 3 you say I don't want you to sit on more than 2 right civilized or I don't want you to schedule more than 20 megabytes of life both of these there are about things to say you can put in and between them you either get I I ops or bandwidth or depth and not all of them all at once but this is for simplicity of implementation but it might turn out we need something like that so now plus some adjustments to the code that i didn't do right on the 1st iteration so we have this this diagram that I had up here before 1 of the jails are in red so rather than calling disk storage directly it calls schedule and the schedule work as part of the library that's responsible for keeping track of where that work I I broke up into multiple cues in the library for now the Netflix schedule on if you have your own schedule or you're free to do whatever you want them a and then I had and next by that calls to get what the next virus so the DA drivers no longer deciding what the next I have to do it it's the whole Camile schedule library the deciding and I also had to come because I'm limiting I 0 I had to add calls to DA schedule from DA done so that we wouldn't freeze the Q 5 artificially frozen the Q and the rest of can doesn't know that so when I artificial so when I remove the constraints of where the the the q was frozen I need to on call expertise schedule because can doesn't know to do that for me automatically so I had I had that a call from Dierdre and DA scheduled during that I did that and and for all these diagrams DNA-DNA me the same and so it's I didn't really change a lot in the in the flow just abstracted and most of the changes are basically in this column where we decide what to do and how we do it is is largely enough as largely unchanged so here's another run
i or we do a fill of and we start out at a lower rate but that's just an accident of what I did the test that flicks video load varies from day to day the day before the House of Cards is released is much lower than the day after the new season of house of cards is released and this was a slow on but you'll say instead of having the big spikes up in this range for right so we have 1 big spike and then it comes down and it's generally 120 megabytes and we have a couple of pop in a couple of things going up above the tolerance level set for the redial but generally limiting the queued up the 1 what is with this just shows on you know kept the right amplification effects of low so we don't really see a change in the band with the word that were doing and again and again this was at a different time of day and the 1st graph was for an hour in this graph is for 3 hours and see more variation of so of putting all these together we get you know something like this so
uh all of the such got
these results I kept tinkering with this setting the bandwidth limits like a talk about earlier and to do that if you're just queuing individually going to put 3 items in the queue at a time line alignment the cure the attended the time domain you keep accounts you send 1 down decrement the count of what's available to very simple but trying to say I'm going to do more than 20 megabytes a 2nd you can't do that anymore you have to have a clock ticking to to to figure out how much your sense or you have to have a quota of how much can be sent and you deduct from the quota used in the data down and when reaches 0 you send more down and then every so often you have a time ago off and replenish the quality and see if there's any any thing that's stuck in the queue and kicks off the I so I would both static and dynamic steering of limits the statics during when you can have 20 megabytes dynamics during of women's tried to say well if read latency is a millisecond low bump up how much right band with you can have it the read latencies above 5 ms these are just arbitrary numbers like that pump down the right of the amount of rights you're allowed to send them are you going to so
basically to do that I just had to change to places identity accurate to replenish the quota and then I had to do schedule next to look at what to do on and I got this done last week and I want to put it in the top but on fancy graphs to show what's going on and generally the results show that the static limit of work predictably on in there you always give always back the dynamic limits it turns out from this this is a took are averaged over several hundred seconds the as you saw in earlier graph the dynamics of the device change 2nd a 2nd so if I'm gathering is this is the average of the last thousand sex and not going to stir quickly enough to affect the change so I steer to slowly and I went it's being stuck at the rails and having sucky performance the other in 1 graphic that led the adding 1 throws stuff like that uh so that there's still some more work that I need to do on this but I that's that's how where things are and in my paper on so have been working on code I have been working on my paper so I worked on a lot Frazier PST kind and then the addition since then I added in an appendix in the paper I haven't gone back and reintegrated I'm so they need to do more to and in a weight so I've done that tuning to to to to update the of some issues that popped up where that if I'm rate-limiting given how does that holes a looks at I O completions if I keep something in the queue for a long time that counts as being busy you know it's not so and I could be writing to the disk rate limited to 10 megabytes a 2nd have plenty of re down with but it's reporting on does status reporting 100 % you that's no good we for our application we take that and say and say what we what 90 per cent are less anything above that it saturates the devices no good to achieve a 90 per cent less is kind of an arbitrary decision because we knew this some and I was great if you're not artificially constraining it but if you artificially constraining it also doesn't work so we need to come up with a new metric to do that but because does that operative John where well above we are the tree itself I have either we need to somehow told you on we're doing this all lost something so those are the issues that I noticed the other issue is that our with you fast it wakes up every so often says here all the dirty pages but it would be nice if I could tell it Hagel so given don't give me these pages more than this right because that way the the right cues don't get clogged and the latency for individual rights isn't large and perhaps you could also do some more intelligent scheduling of rights all these rights is always taken each other they don't need to be written whatever and that's an area that's an idea at this point good research area if anybody's interested but not something is factually that implement so
that brings me to the end of my time I kind of got a few minutes for questions if anybody has some questions ask a very good question that I was gonna put in my talk which is all if you can only have 1 I schedule or would you want different ones for different things on and I answered that will yeah that's kind of bad I kind of such a really you have discovered your body but it can implement different policies and the schedule or would implement the different policies which you can set individually for so so they can implement 1 policy for this disk which is read only in 1 policy for this is which is read write and in mixed maybe even with different file systems so so that's something that you would have won I O scheduler but I know it will implement range of policies that you would want for a diversity of work was so other questions here this is what you can do do is we don't consider that that that that that that is 1 way to deal with this the problem with that is alone in that wondrous during loop from the control point operates minute by minute match much longer than half a 2nd the other is that of as were streaming data from the devices don't actually go back to the control plane the goal I've got a good server all get the next 1 and if it's not there it was see a half a 2nd latency which would give you a nice little rebuffering dial on this so we wouldn't we wouldn't want that these devices have very limited amount of memory and they're giving you know just a few frames worth of data and and so introduce a half a 2nd way in the that would would would affect the quality of experience metric matrix that that we have so right exactly yes so that so so know if if we if declines could tolerate it that would be great if it was like 50 ms 100 ms you know we would run the experiment but since it was so long and it's 500 ms before it starts the garbage collection and we ask the vendor how long does the garbage collection take all we don't know well how long and we don't know the hundreds of milliseconds was the answer we finally got back after a lot of back and forth so is half a 2nd it's a 2nd to a 2nd hand and you know what that was nondeterministic that we couldn't find the end of you know so that was another reason that we went down this path rather than trying to hold off the I O for a while but that that is a good question other work that would be the easiest solution yes it is yes how is it all the time was we so the bulk of my experience has been 1 micron SST but I think that was disclosed in another talk hopefully not sharing companies I and they perform adequately well underway by the consumer grade not the enterprise grade and so we have issues like like like we're seeing here other vendors that we haven't by heavily from behave about the same way and some of them hide the grooming the garbage collection activity a little better than others but I don't really have personal recommendations because as part of this it was which is what we have in our infrastructure which seems to be a little bit of a monoculture rather than you know when we get all these necessities and and find sum and I'm sorry I can't help you with that so it's over here or go back and forth the this is the the and in this thank
you very much this is the the you know what the writer we way it would it would absolutely changes parameters because the less asymmetric performance you get out of the drive the less you can do is scheduled to your point but enterprise devices tends to create and today have more more buffer and the process of this is more capable so they're able to do more sophisticated algorithms are able to buffer data longer in addition to the RAM over-provisioning they also usually give you better RAM so it doesn't go bad is fast so there's less background garbage collection going on so how of some some of the theories about somebody that you have that's something that exists primarily on in slide where is the kind of man super get almost every SSD or in Iran super Capra battery to back it up because otherwise the right guarantees just can't happen that actually on the you all have been done on these you will follow you on a lot of a lot of yeah I mean the intent there's another question over here is based on the next this 1 of the best the only way that you all along the micro listen to us and basically told us so thank you have a but for the suggestion of it turns out that they that the process that they have devices that we bought was not capable enough for them to to to do the background stuff around and they could bring in the timer is a little bit but not enough to merely it's so we certainly for nearing different from new devices that were buying were absolutely talking to them about can we control the garbage collection can we tell you this is a good time to do more this is a bad time or can you tell us that this is a good time to write a a bad time derived from and and know pure back to the new year of the FTO little bit to to expose some of this and some vendors have been very forthcoming and some vendors of told depends on where the entire range of responses so you in the back and then over here it means that all of you this so that you have a review of the of the variables would saying you usually will this is the truth of use of the system so here are the 2nd person here basically the guest actually have this much of what only have is 1 of the material in the world we don't know what so this is
the right of you that you said this so it is that what I have to use the word is not the other thing that is right advertises itself so that it was clear that whenever we think this is what we want to do this so that you know that you have more data that you you can do when you introduce even if you do the drivers and support the set maximum LBA command you can I never use more than 80 % and make sure the rest history and it's not as good you have to make sure the rest this trend of the world and that is yes yes to you have to do a secure race and then you can like to a shrink it you can write tree get faster and then set it yet it is secure race and so for us operationally for us it's impossible for the end of the path that I had on the yeah yeah and in various from right to drive the efficacy of those efforts the is 2 of them I've implemented in sync trim it works great until the dry becomes corrupted which tells me that my implementation is flawed in a way that I have not detected of solve alone for H C OK so normally we see for HCI we see about a 3 millisecond latency per request and on and on LSI we see about a 1 1 half millisecond on latency per request and implementing Cq trim on the drives that we were writing to all the time they had both video and log files on them I saw that the the the the the 3 ms went to 1 and a half to 2 milliseconds so improved it it was useful but if it was corrupting to drive it it it's it's not worth 2 . production must I hate My helps guys actually like all the Oscars we have now they hadn't pissed me off Lawrence that we have now concluded that that was a the aim of patient and not on the basis of the of the war but here's the thing that I you want to buy his work in his right hand that is true of that that's true both and 80 and SKOS isn't it's an advisory request is not a get rid of this data although on the might make the data accessible so it's actually in the name and but when you read a return 0 and there's little bits in the and identified it says you know 0 ones or people who knows where they're right but but it but it is a it is a request of some drives from where the reason it take so long for the trend the come back they do something what what is it about how long the war
he was always the last year or the year work it's always you what it's going to do you can ask it to do anything different and you're up and is so obviously In this we used to use all of the above our went on to the design and here you have yes short answer yes long answer yes to the usual reviews and so forth and it's in it in my paper I have a pointer to the code I've I've committed already to the tree on a project branch which also has the flodden Cq implementation of somebody wants to tell me my but not that I expect them to do that but some people take great pleasure in that I would like to give them the opportunity on but that only has gone through all our that only has the queued up from it doesn't have the band with the other stuff I was planning on committing that I ran out of time so I like to get in the free STI maybe even get the Netflix 1 and on as an example of but I haven't I haven't climb that hill yet because I'd like to get 2 done with a good working example that has a lot of air miles in production and even though from I was kind of done with the schedule back in the March selected presented should be a on it was only in in mid-May that we started using it on hundreds of machines you know some small percentage of our machines in production and you know that we through the switch and started using it on the now we've done that I have more confidence and I'm more interested in committing it on as well so that the reader into delays and I tried to keep the case I just haven't you know it takes time to do derive that Lawrence OK so the ticket does it does 2 things and it is this is the Caltech it's called and and read on the call so it's it's during that 20 years 50 times a 2nd depending on how big a quantum you 1 it takes the band with that you have been devised by that number of sets that is the amount of I O you're allowed to do and then each I says 0 this is a 1 megabyte request to I have indeed left yes do the request and deduct that from the from the and the for the no not not necessarily right right there's there's no I didn't do it that way the reason which is a kind of a weak reason and I didn't want that uh the slightly stronger reason would be I didn't wanna put those you know quota computations in line with the FastPath and the gun path as part of the fast so as a little scared of that in measure it to see if it would be a problem but I was afraid so I didn't this otherwise I get all the other thing I wanted to do was to more sophisticated analysis of you know every 2nd for the dynamic staring we look at a bunch of different things and do some math on that eventually might be pretty heavy with math to say 0 in the next 2nd I can only do 20 megabytes a 2nd instead of the 25 I'm doing now or I can do very you making those decisions to be more heavy-weight I don't want to do that and the way an individual I wanna do that kind off-line to that area synchronous to that process so it probably did it probably could I just and doing that with any other questions alright thank you


  625 ms - page object


AV-Portal 3.20.1 (bea96f1033d39fbe77f82542458e108105398441)