We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Optimizing Software Defined Storage for the Age of Flash

00:00

Formal Metadata

Title
Optimizing Software Defined Storage for the Age of Flash
Title of Series
Number of Parts
644
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Flash memoryStatement (computer science)SequenceRandom numberDisintegrationHard disk driveLimit (category theory)Software testingHill differential equationConfiguration spaceSystem programmingSoftwareRevision controlInheritance (object-oriented programming)Physical systemStatisticsComputer configurationFlash memoryTerm (mathematics)AreaOverhead (computing)Characteristic polynomialWorkloadPoint (geometry)TwitterData storage deviceGoodness of fitComputing platformIntegrated development environmentQuicksortCASE <Informatik>BitHard disk driveMiniDiscRandomizationFile systemMultiplicationClient (computing)Computer hardwareServer (computing)Presentation of a groupPosition operatorCombinational logicMultiplication signVariety (linguistics)Concurrency (computer science)NumberSoftwareSoftware testingGreatest elementFocus (optics)Order (biology)Arithmetic progressionView (database)DataflowComputer fileBackupCurveSlide ruleConfiguration spaceInheritance (object-oriented programming)WritingParameter (computer programming)Process (computing)Block (periodic table)Reading (process)Cartesian coordinate systemoutputFeedbackLocal ringLimit (category theory)Computer architecturePhysical systemMereologyShape (magazine)Water vaporMetreScaling (geometry)MathematicsCuboidThread (computing)State observerData centerMoving averageGene clusterBit rateWord40 (number)Observational studyDatabaseComputer animation
Execution unitData storage devicePhysical systemScale (map)Server (computing)ArchitectureSystem programmingMiniDiscClient (computing)Data modelSpacetimeAttribute grammarAttribute grammarMultiplicationData storage deviceClient (computing)NumberMiniDiscServer (computing)Multiplication signMeasurementNamespacePhysical systemProcess (computing)Cartesian coordinate systemSlide ruleTranslation (relic)Configuration spaceFile systemConcurrency (computer science)Parameter (computer programming)TwitterReading (process)Single-precision floating-point formatQuicksortFunctional (mathematics)Computer architectureLine (geometry)Graph (mathematics)Bit rateOrder (biology)Latent heatComputer fileSoftwareTerm (mathematics)Fraction (mathematics)Presentation of a groupMathematicsGene clusterSoftware testingLocal ringVolume (thermodynamics)RandomizationLevel (video gaming)DivisorWorkloadEnvelope (mathematics)Right angleScaling (geometry)MereologyHard disk driveSensitivity analysisIntegrated development environmentCellular automatonCASE <Informatik>Game theorySoftware frameworkDisk read-and-write headWebsiteObservational studyArithmetic meanPoint (geometry)Revision controlError messagePlanningGastropod shellGreatest elementExtension (kinesiology)
CompilerStack (abstract data type)Thread (computing)NumberOperations researchKernel (computing)Bridging (networking)ImplementationProcess (computing)Default (computer science)Scale (map)Maxima and minimaServer (computing)Client (computing)Event horizonNetwork socketComputer configurationPersonal digital assistantMiniDiscComputer hardwareData structureThread (computing)Client (computing)MultiplicationParallel portContrast (vision)Dependent and independent variablesFile systemTranslation (relic)Communications protocolServer (computing)CASE <Informatik>SoftwareCountingComputer configurationLevel (video gaming)System callComputer hardwareOrder (biology)Volume (thermodynamics)MereologyMiniDiscProcess (computing)Computer fileBridging (networking)Stack (abstract data type)Module (mathematics)Parameter (computer programming)Cartesian coordinate systemGreatest elementMoment (mathematics)Structural loadQueue (abstract data type)BitComputer architectureEvent horizonImplementationConfiguration spaceMaxima and minimaOperator (mathematics)Kernel (computing)Default (computer science)NamespaceGroup actionVirtual machineNumberInheritance (object-oriented programming)Graph (mathematics)Arrow of timeNetwork socketData storage deviceMetrePhysical systemExecution unitPosition operatorExtension (kinesiology)Scaling (geometry)PlanningLine (geometry)Right angleFluidCoefficient of determinationMultiplication signDiffuser (automotive)Rule of inferenceSocial classDeterminismControl flowRow (database)Wave packetWordBit rateTime zoneEntropie <Informationstheorie>Revision controlPrisoner's dilemmaView (database)Electronic mailing listLogic gateDivisorPhysical lawForm (programming)Closed setDataflowDenial-of-service attackWaveCellular automatonComputer animation
TrailThread (computing)StatisticsVolumeMilitary operationAverageCompilerServer (computing)Client (computing)Event horizonDefault (computer science)Bridging (networking)Dependent and independent variablesData bufferOperations researchRandom numberParallel portBefehlsprozessorProcess (computing)Gamma functionMultiplicationInformationUser profileConcurrency (computer science)Network socketThread (computing)Client (computing)Volume (thermodynamics)NumberWechselseitiger AusschlussDependent and independent variablesBuffer solutionContent (media)Single-precision floating-point formatEvent horizonProfil (magazine)Order (biology)Different (Kate Ryan album)Operator (mathematics)Reading (process)MathematicsResource allocationRoundness (object)Data structureCodeRun time (program lifecycle phase)StatisticsInformationMultiplicationRandomizationSoftware testingImplementationNetwork socketCombinational logicPatch (Unix)Multiplication signPoint (geometry)Stack (abstract data type)Translation (relic)Electronic mailing listServer (computing)AverageDebuggerMaxima and minimaEntire functionBridging (networking)Connected spaceComputer fileWritingDefault (computer science)Communications protocolMultilaterationBitFile systemProcess (computing)System callLevel (video gaming)BefehlsprozessorError messageView (database)Physical lawMusical ensembleMortality rateWordContingency tableRight angleService (economics)Labour Party (Malta)Particle systemPlanningCollisionSpecial unitary groupInsertion lossGroup actionArithmetic meanForcing (mathematics)Mechanism designArithmetic progressionSoftware bugNetwork topologyTrailCoefficient of determinationDevice driverWater vaporReal numberDivisorControl flowLine (geometry)Goodness of fitDegree (graph theory)Observational studyGame theoryBlock (periodic table)Uniform resource locatorBoss CorporationRule of inferenceWave packetMetropolitan area networkComputer animation
Distribution (mathematics)Client (computing)Random numberParallel computingEvent horizonClient (computing)Thread (computing)Volume (thermodynamics)Profil (magazine)NumberMultiplicationGraph (mathematics)ResultantNon-volatile memoryMereologyConcurrency (computer science)Core dumpSet (mathematics)Point (geometry)Object (grammar)Buffer solutionParameter (computer programming)Kernel (computing)Content (media)Message passingConnected spaceNetwork socketRandomizationWindowPhysical systemMultiplication signParallel port2 (number)HypothesisReading (process)Endliche ModelltheorieArithmetic progressionModule (mathematics)Condition numberSoftware testing10 (number)Software developerScaling (geometry)Different (Kate Ryan album)Socket-SchnittstelleLibrary (computing)InformationSoftwareDependent and independent variablesMechanism designImplementationTelecommunicationCommunications protocolRight angleSoftware bugCodeMatrix (mathematics)CodeDemosceneChaos (cosmogony)Physical lawControl flowProcess (computing)Block (periodic table)Video gameExistential quantificationValue-added networkSurfaceCoefficient of determinationComputerNetwork topologyLoginDigital electronicsFlock (web browser)BlogUsabilitySingle-precision floating-point formatBit rate
Parallel computingStructural loadThread (computing)Concurrency (computer science)Task (computing)Mathematical analysisSoftware testingInformationIterationProcess (computing)Mathematical optimizationHypothesisEndliche ModelltheorieClient (computing)BefehlsprozessorEvent horizonDiffuser (automotive)Data storage deviceInformationSpacetimeControl flowFlow separationSingle-precision floating-point formatScaling (geometry)Multiplication signDirectory serviceResultantDivisorPoint (geometry)Physical lawContent (media)Instance (computer science)CausalityTable (information)Mathematical analysisIntegrated development environmentTerm (mathematics)Extension (kinesiology)HypothesisComputer architectureLevel (video gaming)View (database)BitNetwork topologyUtility softwareLatent heatSpeech synthesisWave packetPlastikkarteBus (computing)Shared memoryPresentation of a groupHigh availabilityArithmetic progressionMathematical optimizationClient (computing)MultiplicationCycle (graph theory)NumberThomas BayesReading (process)Core dumpMereologyFile systemSoftware testingEndliche ModelltheorieLabour Party (Malta)Right angleEvent horizonForcing (mathematics)SurfaceWeb browserUsabilitySoftwarePhysical systemVolume (thermodynamics)Performance appraisalWechselseitiger AusschlussConcurrency (computer science)Library (computing)CodeBefehlsprozessorThread (computing)Translation (relic)Graph (mathematics)CodeComputer fileReplication (computing)Module (mathematics)Kernel (computing)Stack (abstract data type)Buffer solutionComputer wormOffice suiteGeneric programming2 (number)Link (knot theory)Software bugSource code
CollaborationismProgram flowchart
Transcript: English(auto-generated)
Welcome, everybody, to the talk Optimizing Software-Defined Storage for the Age of Flash. The presenters are Krutika Manoj and Raghavendra.
Give him a big hand. Thank you very much. So we are the last talk keeping you, I guess, from your dinner. That's always a good, enviable position to be in. So the talk is Optimizing SDS for the Age of Flash.
And basically, you have a lot of fast SSD devices coming in to general use. And we work at Red Hat with a software-defined storage called Gluster. It's there at the bottom of the slide. And the talk is about optimizing the software stack
to get the most out of the performance that the hardware devices are capable of. So since there are multiple speakers, I'll just give you a brief overview of what to expect over the course. I'll start off with an introduction. And as the performance engineer, I
will present sort of a performance-centric view of the problem that we are trying to solve. And in order for this part, you don't really need a whole lot of background about Gluster, details, the architecture, and things like that. So after that, Tritika is going to come in with the Gluster architecture, the flow of the request through the stack,
the client and the server side, and start describing some of the enhancements that we had to do in order to improve the performance of Gluster for this particular use case. Raghuendra is going to continue with a focus on the RPC layer improvements that we had to drive in order to get better performance out of Gluster.
And we will conclude with some lessons learned and some of the work that is in progress in this area. So Gluster has been around for a while. And many of you probably know about it. And the traditional strength has been for use cases that are more large file sequential, I oriented.
So you have storage for CCTVs, or crash test videos, or IPTV, or backup. A lot of those use cases. And typically, you would back the storage with spinning drives with a good cost
per gigabyte characteristic. So you have a lot of data to store. And the performance is good for these kind of workloads. And that's kind of been the strong point. But there are some trends coming up. On the device side, you have SSDs
becoming commonplace in data center deployments. And some of these, especially the NVMe SSDs, can be really fast in terms of the IOPS that they can deliver. And this is one area that hard disks have typically been not so good. So if you have a small random IO workload,
the seek overhead suggests keep your spinning drives from delivering any kind of good performance. And that's exactly the area where these flash devices are really good at. And so in terms of IOPS capabilities, some of these NVMe SSDs can go 300,000 IOPS.
So whereas a typical hard drive would be something around 200 or so. Not 200k, it's just 200 versus 300k or more. The other thing is that as Gluster, some of you may know, is integrating with things like Kubernetes, solutions like Kubernetes, container platforms. So the role of the solution changes a little bit.
It goes from being sort of a scale-out solution targeted towards a particular use case. Like somebody has an IPTV workload, they want to back up with Gluster, so they use that. So instead of that kind of a role, the role of Gluster is shifting a little bit
towards providing a storage infrastructure layer to a larger cluster. So there you are not targeting any specific workload. I mean, you have whatever workload is running in the cluster, you have to support any and all of those.
So you cannot be picky about what you support well and don't support. So it's important for Gluster in this environment to be able to support a larger variety of workloads, and some of them happen to be databases, things that are IOP-centric, and workloads that Gluster previously
did not have to worry about too much, because people are not so interested in putting those workloads on top of Gluster. And once we started looking at this combination of how does Gluster behave for these kinds of workloads on these kinds of devices, we did notice some deficiencies.
And the path that we took to optimize the solution, that's what the talk is really about. And this is very much a work in progress, but at around the time that the submission for FOSDEM was due, we figured out that we had enough material that it would be a good time to go out and tell people about some of our experiences and observations,
and maybe get some feedback from you guys as well. So just for the sake of comparison, what we're doing here is just taking a basic local file system, a Linux XFS file system.
It is backed by a local NVMe SSD drive. And we run a random IOS, small random IOS test on it. And this is what we get. So what you see is, on the x-axis, what we are varying is the number of concurrent requests.
So the number of concurrent jobs that are issuing IOS requests, and that we are increasing. And when that number is low, if you just have a single-threaded application doing small IOS, there is a limit to what you can get in terms of IOS. So it's basically dictated by the latency of the stack.
And even for a local file system, you will not be able to push something like an NVMe SSD to the limit. But as you increase that number, as you increase the IOS depth, or I'll use the term interchangeably for the number of concurrent requests, you will see that the IOPS will steadily increase up to the point where the device saturates.
So you steadily increase to the point of device saturation. And at the peak, you can get something like 300k plus IOPS. So notice that the y-axis here is IOPS in k, which is 300,000 in this case. So this is what we would expect a good solution
to deliver. And obviously, we don't expect Gluster as a scale-out distributed solution to match the performance of a local file system, because it provides so many more features. But this is kind of from just a performance point of view what you would expect a solution to look like, at least in terms of the shape of the curve.
And just for reference, I'm going to put up the test that the previous slide was based on. And the important thing here is there are a couple of parameters here. This is an FIO job file for read and write. A couple of parameters here that we are varying to change the IO depth and the number of concurrent requests
that are in flight. And one is the number of jobs. There's a number of FIO jobs. And for each FIO job, there is an IO depth parameter, which controls how many outstanding requests it can have at any time. And that is possible because we are using an IO engine, the LibAIO, which allows asynchronous IO.
Couple of other things, the block size is just 4k, which is small random IO, like I said. We are using direct IO, because a lot of these kinds of applications tend to use that direct IO to do the job. So I'll move ahead.
And the configuration that we are evaluating here, the numbers that are based on, is a fairly high end super microsystem. It's just that there is not a whole lot of storage. We are particularly interested in performance on NVMe SSDs, the fastest SSDs. And so it just has a single NVMe drive per system.
And we are looking at a fairly recent release of Gluster and some of the enhancements that we used to try out the performance enhancements. And they will eventually make it in, but they have not done that yet.
And Gluster, as Kritika will explain, has got a number of sort of optional what we call translators that you can plug in that add functionality to the SDS. So you have translators which do read ahead and write behind
and caching and things like that. In this particular case, it has been tuned for random IO, which means that there is a parameter called strict trend or direct, which makes sure that or direct is respected properly on the client side and also remote DIO is disabled, which means that direct IO goes all the way to the brick.
It's respected all the way to the brick. Most of the reading read ahead caching translators and things are turned off in our tests. So this is back when we started or this is the baseline performance before most of the enhancements that we
were talking about were in place. And you see that the top line is the XFS read performance that we talked about earlier, that graph. And on the bottom, we have the same test done with Gluster. And you see that that doesn't look too good. So one thing is that as you increase the IO depth,
you are not able to scale up to what the device is capable of in terms of IOPS. So you are flattening out at something like 16 to 32 IO depth. And at that point, the IOPS that you're delivering is just a small fraction of what the device is capable of.
And so we'll use this as a well-defined problem for the rest of the presentation to see what we can do here and what are the changes that we have to do in order to make this better. Couple of things to mention here. One is that, like I said, for sequential IO, this is a specific workload, small file random IO.
For sequential IO, the request path is somewhat different. And you will see that Gluster will perform well even with NVMe devices. So typically, you are able to deliver what the device is capable of, what the network is capable of. The other thing is if you run this on hard drives,
because the device capability is so low in terms of IOPS, you would not probably notice that Gluster performance is not up to the mark compared to even if you compare it to a local file system, because both of these will be able to saturate the device quite quickly
and deliver the IOPS at that level. What else? So a couple of other mitigating factors. Like I said, there are new workloads that people are wanting to run on solutions like Gluster. But right now, what we are seeing is that most of them are not really interested in pushing the performance envelope.
So new things are coming in, but those are mostly applications where performance is not that sensitive, but that's probably not going to remain that way for a long time. Once people start migrating, the more performance-sensitive applications come in as well. So we have some time to get our act together and fix some of these problems. The other thing is in environments
like in the Kubernetes environments, you might know about Hecate and how that serves out Gluster storage to Kubernetes clusters. Sometimes these devices are carved up into multiple Gluster volumes and mounted
in multiple places each volume. So when you split things up like that, the aggregate IOPS that you can get might still be quite high. So some of the numbers that we are looking at here is basically a single client trying to push a single server as much as possible.
So there are some mitigating factors. But even so, this is an important problem for us to solve. And we will look at what worked and some of the lessons that we learned here. And I will hand it off to Krithikar to continue with the Gluster architecture and some of the enhancements that we made.
Clarifications, I'll be happy to clarify. This is a single client connected to a single server.
And so what does the network look like? What is a network? This is the client-server configuration. Right. So in this particular case, we have 10-gig ethernet in between the client and the server.
No, so this is about, so it's a 4K I-O, right? So if we are talking about, so with 10-gig, you can go up to about 1 gigabyte per second in terms of throughput, right? So this is still about less than 200 or so, I think.
So the network is not the bottleneck in this test. Yeah. Yeah, but I should have put that on the slide, what the network is. My apologies, maybe it's too late. But can you define I-O depth once again?
What is the I-O depth? I-O depth is a term that you, so if you're using FIO, which is a popular tool for benchmarking, right? So FIO job, if you are using it with the libAIO, which is the asynchronous I-O framework,
I-O depth allows, dictates how many outstanding I-Os you can have at a time, right? So if you have a job with I-O depth equal to 16, FIO will allow you to have 16 outstanding I-Os before blocking more I-Os being sent out. So it's a measure of the number of concurrent requests
that you can have that the system will experience at any given point in time.
Right, so I'd like to take some minutes talking about what is Gluster and the Gluster architecture, as I believe that's relevant to understanding the rest of the talk. So what is Gluster? So Gluster is a scale-out distributed storage system.
It aggregates storage across multiple servers to provide a unified namespace. So for example, if you have five servers, each of which has around, say, one TB of storage attached to it, what Gluster does is to aggregate all of this
into one single one TB times five, which is five TB namespace. And your application's data gets distributed across these five disks in a way that
is transparent to the user. So Gluster has a modular and extensible architecture. We achieve this using what we call as translators. I'll talk about translators more in the subsequent slides. So Gluster is layered on disk file systems that
support extended attributes. So what this means is that it can be layered on most disk file systems that are available today. And Gluster has a client-server architecture where the client reads requests from the application,
does some processing, and sends the request to the appropriate server. And the server is the process that actually executes the actual system call that the application requested and then returns the response.
So let's talk about some Gluster terminology. So first, we have brick. So a brick is a basic unit of storage. So it is nothing but a file system that is exported from a server.
And therefore, it has two parts to it. One is the machine where it is exported from. And then there is the path to the file system that is exported. And then we have the server. Servers, they export your bricks. A group of servers are called as a trusted storage
pool in Gluster, parlance. Then we have the volume. Volume is a namespace represented as a POSIX mount point. It is basically nothing but a collection of bricks. So it is the volume that is mounted on the client
to access the data on the Gluster cluster. So then we have the translator. Translator is a stackable module with a very specific purpose. So in Gluster architecture, all the different translators are stacked one on top of the other to form a graph.
So each translator receives requests from its parent translator, does what it needs to do, and then sends the request down to its child. So I'll explain a bit more about that with a picture. So this is what a very simple Gluster translator stack
looks like. So on the left, we have the client side stack. And on the right, we have the server side translator stack. And two are connected by a network, which is represented by this dotted arrow in the center. So at the top, we have a Fuse bridge translator,
which is responsible for reading file system requests from slash dev slash Fuse. And then it transforms the request and sends it down to its child translator. So the operation flows from Fuse bridge
all the way down to the rest of the translators in the middle. And then finally, it reaches client translator, which is at the bottom of the client stack. This is the translator that is responsible for sending the file operation over the network.
And on the brick side, the request is received by the server translator, which again, dissects the operation into all the different parameters for the file operation and then sends it down to its child translator. And this way, the file operation flows all the way down to POSIX translator,
which is the module that actually executes your file system call. So if your application requested a create, then it's POSIX at the bottom, which executes the create system call to create your file. And then this is executed on disk file system that's
layered on top of Gluster. And then it returns the response. And the response flows all the way back up from POSIX to server and then over the network to client zero. And then it reaches Fuse bridge again. And then it writes the response back to defuse, which is returned to the application.
So I'll talk a bit about Gluster threads and their roles. So first we have the FuseReader thread. So FuseReader thread is actually part of the Fuse bridge module that I talked about at the top. So it serves as a bridge between the Fuse kernel module
and the Gluster of a stack. So it translates your IO requests from defuse to actual Gluster file operations, which we call as FOPs, and sits at the top of the Gluster stack. And at the moment we have just one thread that does all this.
Then we have IO threads. So IO threads is a thread pool implementation in Gluster. So the threads process the file operations sent by the translator above it. So what it essentially does is to maintain a queue
where all the requests get enqueued. And then we have a bunch of worker threads that pick up these requests in parallel and start winding them down. So the number of threads are scalable according to the kind of load that is there.
So there's a lot of load. Then all the default number of threads will be active. Otherwise it has the capability to scale down to as many threads as we need. So in the case of parallel requests, IO threads help in winding all of them in parallel.
In contrast to what would happen if, for example, in the stack, there is client IO threads on the client stack, right? So if client IO threads were not there, then the single-fuse reader thread would have to pick up the request and send it all the way down to client zero
and then come back up and then pick up the next request. So with client IO threads, that long round trip gets avoided in the sense that the request just gets queued till the client IO threads layer and then it just goes back to reading the next request. So we have IO threads on both the client and the server stack.
So on the server it is located quite close to the actual server translator. It's labeled as server IO threads. So we can have a maximum of 64 IO threads. This is a configurable option in Gluster.
Then we have event threads. So event threads is also a thread pool implementation but at the socket layer. So it is responsible for reading requests from the socket between the client and the server. So this thread count is again configurable and the default is two and it again exists
on both the client and server. So it's executed at the client translator and the server translator level. So if you put all these threads into one single picture, you will see that at the top we have a single thread that reads all the requests
and then parallel requests get distributed at the level of client IO threads and then when the request flows to the server side, again when we have parallel requests, the server IO threads can pick up multiple requests in parallel and wind them down and the response processing again can happen
in parallel at the protocol server and protocol client layer where we have multiple threads receiving parallel requests over the network. So it might seem like with all these threads, we should be getting good performance but unfortunately that's not the case.
So but what we found was that all this multithreading was sufficient to saturate the hardware when we were using spinning disks in the bricks but with NVMe drives, we found that the hardware was far from saturated.
So we set out to understand why. So as part of this, we had to figure out whether the bottleneck was on the client side or the brick side. In order to do this, the same FIO job was executed from two different cluster clients
on a single brick volume to see whether the number of IOPS increased. So we saw that from 30,000 IOPS, we were able to get 60,000 IOPS which meant that the brick was able to deliver all these requests but it was the client that was not able to send enough requests.
So then we decided to concentrate on fixing the bottleneck on the client side. So one thing that was very clear from this experiment was that with multiple threads and a lot of global data structures, there is bound to be lock contention which can slow down performance.
So to debug the lock contention issue, we use a tool called Mutrace. So Mutrace is a mutex profiler that is used to track down lock contention. So it provides a breakdown of the most contented mutexes. So it gives some useful information
like what are the top 10 mutex locks that were most contented, how often every lock was locked and how long during the entire run time the mutex was locked and stuff like that.
So we ran Mutrace on the client side. Okay, I'll probably talk about that a bit later. So I'd also like to talk about the debugging tools that already exist in Gluster today for performance. So there's a volume profile command
which provides per brick IO statistics for each file operation. So this includes information like the number of calls and the minimum and maximum and average latency for every file operation. So this is implemented in a translator called as iostats and we loaded this iostats translator
through by hacking the wall file at multiple places on the stack to see the latency between two translators. So this experiment indicated to us that the highest latency was coming
from the bottommost translator on the client stack and the topmost translator on the server stack which is the protocol client and protocol server translators which operate at the network level. So I'll talk a bit about the enhancements that we made in the process.
So first we have FUSE event history. This is something that was detected by Mutrace tool. It appeared at the top of the list of most contended locks. So FUSE bridge maintains a history of most recent 1K operations
that has performed in a circular buffer. So it tracks every FOP request in both the request and response path. The problem with this data structure is that it is protected by a single mutex lock and this was causing a lot of contention
between the FUSE reader thread which was operating in the request path and the client event threads which were operating in the response path. So we fixed this by disabling this feature by default since the feature itself is not much useful unless you have to debug some errors
or some file system issue. So we disabled event history and found that this was the impact of disabling FUSE event history. So the random read IOPS improved by around 10K and the random write IOPS improved by around 15,000.
So then the next thing we did was that there was another RPC layer fix that we added in the middle and that Raghuendra will be talking about a bit later. So a combination of that work
and the FUSE event history patch was not, so when we tested with a combination of these two patches, we found that at one point the FUSE reader thread started consuming almost 100% of the CPU. This meant that we could not proceed further
unless we fixed this. So to work around this, we added more reader threads to process requests from a def use in parallel. So the impact of this fix was that we got around 8,000 IOPS increase with four reader threads.
So then it was time to use mutrace again to see where the bottleneck was next and we actually used, and so with an IO depth of 16, we saw that mutrace was reporting IO buff pool
at the top of the list and then we again increased the IO depth to 32 to see whether the contention indeed increases and it did increase, so the contention time almost doubled. So this meant that we had to fix the IO buff bottleneck.
So what exactly is IO buff? So IO buff is a data structure that is used to pass read write buffer between the client and the server. So it is implemented as a pre-allocated pool of IO buffs as with most data structures in GlusterFS toward the cost of doing a malloc
or a free every time. This again is a single global IO buff pool which is protected by mutex lock. This meant that both the fuse reader thread and the client event threads again contained on the same lock in order to allocate
or deallocate an IO buff. So to fix this, we came with a quick fix to see what is the impact of decreasing this contention. So we created multiple IO buff pools in the code and for each IO buff allocation request,
we made changes in the code that it would, in a way that it would select one pool at random or using a round robin policy. So the lock got kind of striped, so instead of all threads containing on the same lock, now we had contention distributed across multiple pools.
So more pools implies fewer contentions. So with this fix, we ran the same FIO test again and saw that the random read IOPs improved by about 4K and the random write IOPs by around 10K. So this might not seem like a lot of improvement,
but it was vital for, vital to,
thank you, no problem. Okay, so I'll now let Raghuendra talk about some of the RPC layer improvements.
Good evening, everyone. So the RPC layer is basically the Rome Gluster's custom implementation of Sun RPC RFC. So for every connection, a socket connection between the client and server, there's this RPC library loaded on client
as well as the brick. So, and since the sockets are non-blocking, we use EPOL-based polling mechanism. And there, as Kathika mentioned, we have a thread pool, which we call as event threads,
which is global for the entire process, in the sense that if there is a DHT client, which is speaking to five bricks, so the thread pool is basically reads all the requests from all the five, responses from all the five bricks and sends requests to the five bricks.
So as Kathika mentioned earlier, we, the Gluster volume profile information showed high latencies between protocol client and the servers, which basically means that something is not right in the RPC layer, which can also mean that there might be some issues
with our implementation of sockets, or rather, how we are using the sockets for communication. So the first thing which occurred to us, basically we went and saw, okay, in the request submission part, we acquired this lock
and in the replace submission part also, we also acquired this lock, the same lock. So basically we might be containing on the same lock, which might be humping down, which might be making the requests and responses contained on each other, thereby slowing down the performance. But unfortunately, when we made the fixes
and ran the tests again, there was no gains. So this is basically one of the key things we faced in debugging the performance issue. So we spent quite a lot of time on this, but it turned out to be not much of an improvement.
So thinking about this, in the meantime, an unrelated problem, basically, which was a performance problem on Azure Coding Translator. So basically there were bugs complaining that the performance was not good with Azure Coding implementation.
And as you all might be aware of, there's a good lot of processing happening in the right code part for Azure Coding. So you have to compute all those matrices and a lot of other stuff. So there, we had made a fix and it had given significant boost in the performance.
So if I quote Manoj, it was probably on three to four X of performance improvement back in that time. And that announcement was already present in the 3.131 baseline, which we are evaluating for this exercise. So to recap how this polling model works,
basically we'll get an event that there is a message, incoming message. Since the EPOL-based mechanism is a one-shot mechanism, till we add back the socket for polling by calling EPOL-CTL system call,
we won't be receiving any more events from the socket, which means that any messages coming on that socket will be waiting till we add back the socket for polling. So the fix which I just spoke about, what it made was it tried to reduce the time window where a particular socket was out of polling
for incoming messages. So since it had given a significant boost, that was one hint, basically. Maybe the bottleneck is while leading the messages from the socket.
So while we were discussing that problem, so we thought of basically validating this hypothesis using a test, basically. So all the while we were testing on single-brick and a single-client model. Now we thought of scaling to three-brick distributed model
so that you'll have more number of connections. And luckily for us, it gave performance benefits. So which kind of pointed out that maybe the single connection is not, is a bottleneck and we need to add more concurrency there. So one of our colleagues, Milind, suggested that probably we can have multiple connections
between the same client and the brick. So till now we had only one connection. And that multiple connection effort gave us the same benefits as the three-brick distributed improvement.
So to recap, these are the improvements we did. One is basically the event history buffer. We disabled it. The second was adding up concurrency while reading the request from the fused kernel module by scaling up the fused reader threads.
The third enhancement was eliminating the lock conditions in the IO buffers. And the fourth enhancement was basically adding up more connections between the single-brick and the client. So after all these enhancements,
we got the random read IOPS peaked around 70K IOPS as compared to approximately 30K IOPS when we started earlier. And for the random writes, it's saturated around 80K IOPS as compared with the 40K earlier. So one point to note is that
this is very much a work in progress. So we are expecting further improvements. And post this conference also we are going to concentrate and carry forward this work. So one of the main things is
me being a developer for most of my life, there were some insights into how people debug the performance problems. One thing is when Manoj and Kritika showed me
the lock-contention-mutrix output, basically I could see there are lots of locks which are highly-contented. Now the question is, should I fix all the locks? In a typical software like GlusterFS, you would expect at least tens of, maybe hundreds of locks in different core parts.
So should I go after all the contented locks? So how do I differentiate the locks which basically bringing down the performance and other highly-contented locks which basically not affecting the performance at least at that particular point in time? So thanks to Manoj,
so he again pointed out that maybe we should try out with multiple data sets of results by altering the parallelism. So this is what the graph looks like. So we are considering two locks here. One is IO buffer lock and the other one is the memory pool lock. So the memory pool is again basically
pool of very commonly-allocated objects. So as we can see, IO buffer is the highly-contended lock and the second place was memory pool lock. So there are two parameters here.
One is basically the numbers there, the thread six and six, right? So one represents the client event threads and the other represents the fused threads. So when we scale that from six to six to 12 plus 12, we saw the IO buffer lock contention growing up
but not the memory pool lock. So we didn't see the contention growing up, even though we have increased the scale. So which either points that at this point in time, IO buffer lock is the one which is bringing the performance. Memory pool lock is not the one. So that's how we arrived at,
let's fix the IO buffer and go ahead with the test. So the next thing is when you have highly-concurrent loads, multiple threads are necessary even for a very lightweight task. If you observe, the work done by the fused reader thread is very small,
which is basically reading the request from the fuse, Dell fuse and queuing it to the IO threads. So initially what the assumption is that probably it's not doing that much and it can pick up that concurrency. But that turned out to be wrong. So we got performance improvements
after scaling up the fused reader threads. So the next drawback we faced was basically while trying to collect the information about the lock contention, mutase itself added some overhead and that potentially skewed the bottleneck information.
So over the course of time, we found that there are multiple bottlenecks and sometimes you might fixing the bottlenecks but not seeing the results because that bottlenecks, there are bottlenecks which need to be fixed before that to see the results of the enhancements.
So for every fix we make, we have to go ahead and take all the enhancements we have discarded and see whether they become visible at this point in time. So one best example is basically the three big distributed test what we had done, right?
So we had done that even before disabling the event history but unfortunately that didn't give the performance improvements before disabling the event history. So once we after disabled the event history, it turned out that the same test gave better results because the earlier the bottleneck was the client event history.
So once we removed that, we can see the other hand announcement also giving the results. So you could have seen that the gains were small. The gains is not like a one big which gave us like 25K IOPS or something like that. So it's more small gains
adding up to a significant number. So simple tools like system activities like TOP gave good insights. Like you can observe that the CPU utilization of a thread is peaking. It's reaching the 100%. Probably it's time to add more threads. Yeah, so we had our share of micro optimization.
That was one was the efforts adding more concurrency between request submission and reply reading in RPC. So again, stepping back and having a high level view of the architecture and coming up with a model like a three big distributed test, it helped us to validate the hypothesis
even before trying out the fix actually. So the point is we don't have to come up with fixes already if you have correct models. So the future work, so as we have been pointing out, it's very much a work in progress.
Bottleneck analysis on the client and the BRIC stack is still to be, is not a completed work yet. The work till now has concentrated, we were concentrating on the client. Another interesting thing we observed was when we scaled up the FUSE reader thread, there was significant CPU cycles consumed in our spin lock.
So this spin lock is basically acquired in the reading code part of slash dev FUSE. So we need to figure out like what is that and why it is adding, consuming the basing of the CPU cycles. So and the lock contention work, especially the mutexes is not a complete work.
So the next lock we see is basically the inotable lock and it is contained quite heavily. And of course we need more lightweight tracing tools and probably we need to do some work over there whether improving the existing mutrace or a new tool. We don't know that yet.
Since we encountered some inefficiencies, since the RPC library, probably we can evaluate RPC, other RPC libraries like gRPC whether they do better. And there is this zero copy idea which is basically to use splice
so that you don't copy the write and read payloads between the kernel buffers and the user buffer. Since cluster office is a user space based file system, so when you read from the kernel a request and write the same data to the network to transfer it to the brick, basically there's this back and forth copying
between user space and kernel space. Splice helps us to avoid that. So there's a work in progress. We need to see how it will help. And of course all this work has to be merged into the master and shipped which in itself will take considerable efforts.
And the efforts can be traced in the bug link, bugzilla link given below. So any questions?
Yeah, okay. Second mic is also ready. I have a question. Thank you for holding this microphone. So thank you all three of you. I enjoyed the talk and also your work on GlusterFS. I have a more generic question about GlusterFS. I was wondering how elastic it is these days. So how well does it handle bricks and nodes coming and going?
I mean do you have dips in your graphs or is it worse or is it no issue? So if you have brick, how highly available is GlusterFS? So can it deal well with bricks and nodes failing and that kind of stuff? Can you give me an update on this? Yeah, so basically we have two
high availability related translators. One is basically the EFR, which we call automatic file replication. It's a synchronous replication module. So if bricks go down, so if the code number of bricks are still up, it can still service IOPS. So that's one thing. And of course one common complaint
with automatic file replication, it's because it's a mirroring solution. So if you have a three-way replica, it would have mirrored all the three nodes. So naturally people complained about wastage of the storage space. So the next solution we had was erasure coding. So that again brings in high availability to the stack.
So that's how GlusterFS handles the node shutdown scenarios. So I was wondering have you tried, this is all running FIO in a single mount point. Have you tried mounting the same export multiple times
and running FIO across multiple instances because you would get multiple FUSE instances in that sense. And does it scale better that way with multiple directories? So the same volume mounted multiple times and multiple mount points, yeah. So that's what I was saying. I was, I don't know if you caught it.
I was talking about some of the mitigating factors, right? So in environments, things like that happen, right? In Kubernetes environments where multiple volumes get mounted on multiple mount points. The same volume getting mounted in multiple mount points on the same client that tends to sometimes distort the picture
in terms of the free space that the client has available and things like that. So I mean those might be workarounds, but I think it's important to solve the basic problems to the extent that you can, right? So that's important as well. But those might be workarounds that might work for some specific users. So that's good, yeah.
We should not stop people who need to catch a train or bus from leaving. So if you still have questions, come and ask them from the speakers here at the podium and let's give them a big hand for the presentation.